Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild
Canjie Luo, Qingxiang Lin, Yuliang Liu, Lianwen Jin, Chunhua Shen
NNoname manuscript No. (will be inserted by the editor)
Separating Content from Style Using Adversarial Learningfor Recognizing Text in the Wild
Canjie Luo · Qingxiang Lin · Yuliang Liu · Lianwen Jin · Chunhua Shen
Received: date / Accepted: date
Abstract
Scene text recognition is an important task incomputer vision. Despite tremendous progress achieved inthe past few years, issues such as varying font styles,arbitrary shapes and complex backgrounds etc. have madethe problem very challenging. In this work, we proposeto improve text recognition from a new perspective byseparating the text content from complex backgrounds, thusmaking the recognition considerably easier and significantlyimproving recognition accuracy.To this end, we exploit the generative adversarialnetworks (GANs) for removing backgrounds while retainingthe text content. As vanilla GANs are not sufficiently robustto generate sequence-like characters in natural images,we propose an adversarial learning framework for thegeneration and recognition of multiple characters in animage. The proposed framework consists of an attention-based recognizer and a generative adversarial architecture.Furthermore, to tackle the issue of lacking paired trainingsamples, we design an interactive joint training scheme,which shares attention masks from the recognizer to thediscriminator, and enables the discriminator to extract thefeatures of each character for further adversarial training. (cid:0)
Lianwen Jin
E-mail: [email protected] Shen
E-mail: [email protected] Luo E-mail: [email protected] Lin E-mail: [email protected] Liu
E-mail: [email protected] South China University of Technology The University of Adelaide Monash University SCUT-Zhuhai Institute of Modern Industrial Innovation
Benefiting from the character-level adversarial training, ourframework requires only unpaired simple data for stylesupervision. Each target style sample containing only onerandomly chosen character can be simply synthesized onlineduring the training. This is significant as the trainingdoes not require costly paired samples or character-levelannotations. Thus, only the input images and correspondingtext labels are needed. In addition to the style normalizationof the backgrounds, we refine character patterns to easethe recognition task. A feedback mechanism is proposedto bridge the gap between the discriminator and therecognizer. Therefore, the discriminator can guide thegenerator according to the confusion of the recognizer,so that the generated patterns are clearer for recognition.Experiments on various benchmarks, including both regularand irregular text, demonstrate that our method significantlyreduces the difficulty of recognition. Our framework can beintegrated into recent recognition methods to achieve newstate-of-the-art recognition accuracy.
Keywords
Text recognition · Attention mechanism · Generative adversarial network · Separation of content andstyle
Recognizing text in the wild has attracted great interestin computer vision (Ye and Doermann, 2015; Zhu et al.,2016; Yang et al., 2017; Shi et al., 2018; Yang et al.,2019a). Recently, methods based on convolutional neuralnetworks (CNNs) (Wang et al., 2012; Jaderberg et al., 2015,2016) have significantly improved the accuracy of scene textrecognition. Recurrent neural networks (RNNs) (He et al.,2016b; Shi et al., 2016, 2017) and attention mechanism (Leeand Osindero, 2016; Cheng et al., 2017, 2018; Yang et al.,2017) are also beneficial for recognition. a r X i v : . [ c s . C V ] D ec Canjie Luo et al.
Fig. 1
Examples of scene text with complex backgrounds, makingrecognition very challenging.
Nevertheless, recognizing text in natural images isstill challenging and largely remains unsolved (Shi et al.,2018). As shown in Figure 1, text is found in variousscenes, exhibiting complex backgrounds. The complexbackgrounds cause difficulties for recognition. For instance,the complicated images often lead to attention drift (Chenget al., 2017) for attention networks. Thus, if the complexbackground style is normalized to a clean one, therecognition difficulty will significantly decreases.With the development of GANs (Johnson et al., 2016;Cheng et al., 2019; Jing et al., 2019) in recent years, it ispossible to migrate the scene background from a complexstyle to a clean style in scene text images. However, vanillaGANs are not sufficiently robust to generate sequence-likecharacters in natural images (Fang et al., 2019). As shown inFigure 2 (a), directly applying the off-the-shelf CycleGANfails to retain some strokes of the characters. In addition,as reported by Liu et al. (Liu et al., 2018b), applying asimilar idea of image recovery to normalize the backgroundsfor sequence-like objects fails to generate clean images. Asillustrated in Figure 2 (b), some characters on the generatedimages are corrupted, which leads to misclassification.One possible reason for this may be the discriminator isdesigned to focus on non-sequential object with a globalcoarse supervision (Zhang et al., 2019). Therefore, thegeneration of sequence-like characters requires more fine-grained supervision.One potential solution is to employ the pixel-wisesupervision (Isola et al., 2017), which requires pairedtraining samples aligning at pixel level. However, it isimpossible to collect paired training samples in the wild.Furthermore, annotating scene text images with pixel-wiselabels can be intractably expensive. To address the lack ofpaired data, it is possible to synthesize a large number ofpaired training samples, because synthetic data is cheaperto obtain. This may be why most state-of-the-art scene textrecognition methods (Cheng et al., 2018; Shi et al., 2018;Luo et al., 2019) only use synthetic samples (Jaderberget al., 2014a; Gupta et al., 2016) for training, as tensof millions of training data are immediately available.However, experiments of Li et al. (Li et al., 2019) suggestthat there exists much room for improvement in synthesisengines. Typically a recognizer trained using real data
Fig. 2
Text content extraction of (a) CycleGAN, (b) Liu et al. (Liuet al., 2018b) and (c) our method. Our method uses character-leveladversarial training and thus better preserves the strokes of everycharacter and removes complex backgrounds. significantly outperforms the ones trained using syntheticdata due to the domain gap between artificial and realdata. Thus, to enable broader application, our goal here isto improve GANs to meet the requirement of text imagegeneration and address the unpaired data issue.We propose an adversarial learning framework with aninteractive joint training scheme, which achieves successin separating text content from background noises byusing only source training images and the correspondingtext labels. The framework consists of an attention-basedrecognizer and a generative adversarial architecture. Wetake advantage of the attention mechanism in the attention-based recognizer to extract the features of each characterfor further adversarial training. In contrast to global coarsesupervisions, character-level adversarial training providesguidance for the generator in a fine-grained manner, whichis critical to the success of our approach.Our proposed framework is a meta framework. Thus,recent mainstream recognizers (Cheng et al., 2018; Shiet al., 2018; Luo et al., 2019; Li et al., 2019) equippedwith attention-based decoders (Bahdanau et al., 2015) canbe integrated into our framework. As illustrated in Figure 3,the attention-based recognizer predicts a mask for eachcharacter, which is shared with the discriminator. Thus,the discriminator is able to focus on every character andguide the generator to filter out various background styleswhile retaining the character content. Benefiting from theadvantage of the attention mechanism, the interactive jointtraining scheme requires only the images and correspondingtext labels, without requirement of character bounding boxannotation. Simultaneously, the target style training samplescan be simply synthesized online during the training. Asshown in Figure 4, for each target style sample, we randomlychoose one character and simply render the characteronto a clean background. Each sample contains a blackcharacter on a white background or a white character on ablack background. The target style samples are character-level, whereas the input style samples are word-level. Theunpaired training samples suggest our training process isflexible.Moreover, we take a further step of the interactive jointtraining scheme. In addition to the sharing of attention eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 3
Fig. 3
Interactive joint training of our framework. The attention-basedrecognizer shares the position and prediction of every character withthe discriminator, whereas the discriminator learns from the confusionof the recognizer, and guides the generator so that it can generate cleartext content and clean background style to ease reading. masks, we proposed a feedback mechanism, which bridgethe gap between the recognizer and the discriminator.The discriminator guides the generator according to theconfusion of the recognizer. Thus, the erroneous characterpatterns on the generated images are corrected. For instance,the patterns of the characters “C” and “G” are similar, whichcan easily cause failed prediction of the recognizer. Afterthe training using our feedback mechanism, the generatedpatterns are more discriminative, and incorrect predictionson ambiguous characters can be largely avoided.To summarize, our main contributions are as follows. We propose a framework that separates text contentfrom complex background styles to reduce recognitiondifficulty. The framework consists of an attention-basedrecognizer and a generative adversarial architecture. Wedevise an interactive joint training of them, which iscritical to the success of our approach. The shared attention mask enables character-level adver-sarial training. Thus, the unpaired target style samplescan be simply synthesized online. The training of ourframework requires only the images and correspondingtext labels. Additional annotations such as boundingboxes or pixel-wise labels are unnecessary. We further propose a feedback mechanism to improvethe robustness of the generator. The discriminator learnsfrom the confusion of the recognizer and guides thegenerator so that it can generate clear character patternsthat facilitate reading. Our experiments demonstrate that mainstream recog-nizers can benefit from our method and achieve newstate-of-the-art performance by extracting text contentfrom complex background styles. This suggests that ourframework is a meta-framework, which is flexible forintegration with recognizers.
Fig. 4
Training samples and generations. Left: Widely used trainingdatasets released by Jaderberg et al. (Jaderberg et al., 2014a) and Guptaet al. (Gupta et al., 2016). Middle: Unpaired target style samples,which are character-level and synthesized online. Right: Output of thegenerator.
In this section, we review the previous methods that are mostrelevant to ours with respect to two categories: scene textrecognition and generative adversarial networks.
Scene text recognition.
Overviews of the notable workin the field of scene text detection and recognition havebeen provided by Ye et al. (Ye and Doermann, 2015)and Zhu et al. (Zhu et al., 2016). The methods basedon neural networks outperform the methods with handcrafted features, such as HOG descriptors (Dalal and Triggs,2005), connected components (Neumann and Matas, 2012),strokelet generation (Yao et al., 2014), and label embedding(Rodriguez-Serrano et al., 2015), because the trainableneural network is able to adapt to various scene styles. Forinstance, Bissacco et al. (Bissacco et al., 2013) applied anetwork with five hidden layers for character classification,and Jaderberg et al. (Jaderberg et al., 2015) proposed a CNNfor unconstrained recognition. The CNN-based methodssignificantly improve the performance of recognition.Moreover, the recognition models yield better robust-ness when they are integrated with RNNs (He et al., 2016b;Shi et al., 2016, 2017) and attention mechanisms (Lee andOsindero, 2016; Cheng et al., 2017, 2018; Yang et al., 2017).For example, Shi et al. (Shi et al., 2017) proposed an end-to-end trainable network using both CNNs and RNNs, namelyCRNN. Lee et al. (Lee and Osindero, 2016) proposed arecursive recurrent network using attention modeling forscene text recognition. Cheng et al. (Cheng et al., 2017) useda focusing attention network to correct attention alignment
Canjie Luo et al. shifts caused by the complexity or low-quality of images.These methods have made great progress in regular scenetext recognition.With respect to irregular text, the irregular shapesintroduce more background noise into the images, whichincreases recognition difficulty. To tackle this problem, Yanget al. (Yang et al., 2017) and Li et al. (Li et al., 2019)used the two-dimensional (2D) attention mechanism forirregular text recognition. Liao et al. (Liao et al., 2019b)recognized irregular scene text from a 2D perspective witha semantic segmentation network. Additionally, Liu et al.(Liu et al., 2016), Shi et al. (Shi et al., 2016, 2018), andLuo et al. (Luo et al., 2019) proposed rectification networksto transform irregular text images into regular ones, whichalleviates the interference of the background noise, andthe rectified images become readable by a one-dimensional(1D) recognition network. Yang et al. (Yang et al., 2019a)used character-level annotations for supervision for a moreaccurate description for rectification. Despite the manypraiseworthy efforts that have been made, irregular scenetext on complex backgrounds is still difficult to recognizein many cases.
Generative adversarial networks.
With the widespreadapplication of GANs (Goodfellow et al., 2014; Mao et al.,2017; Odena et al., 2017; Zhu et al., 2017), font generationmethods (Azadi et al., 2018; Yang et al., 2019b) usingadversarial learning have been successful on documentimages. These methods focus on the style of a singlecharacter and achieve incredible visual effects.However, our goal is to perform style normalizationon noisy background, rather than the font, size orlayout. A further challenge is to keep multiple charactersfor recognition. That means style normalization of thecomplex backgrounds of scene text images requires accurateseparation between the text content and background noise.Traditional binarization/segmentation methods (Casey andLecolinet, 1996) typically work well on document images,but fail to handle the substantial variation in text appearanceand the noise in natural images (Shi et al., 2018). Stylenormalization of background in scene text images remainsan open problem.Recently, several attempts on scene text generationhave taken a crucial step forward. Liu et al. (Liu et al.,2018b) guided the feature maps of an original imagetowards those of a clean image. The feature-level guidancereduces the recognition difficulty, whereas the image-levelguidance does not result in a significant improvement intext recognition performance. Fang et al. (Fang et al.,2019) designed a two-stage architecture to generate repeatedcharacters in images. An additional 10k synthetic imagesboost the performance, but more synthetic images do notimprove accuracy linearly. Wu et al. (Wu et al., 2019) editedtext in natural images using a set of corresponding synthetic
Fig. 5
Attention decoder, which recurrently attends to informativeregions and outputs predictions. training samples to preserve the style of both backgroundand text. These methods provided sufficient visualizedexamples. However, the poor recognition performance oncomplex scene text remains a challenging problem.We are interested in taking a further step to enable recog-nition performance to benefit from generation. Our methodintegrates the advantages of the attention mechanism andthe GAN, and jointly optimizes them to achieve betterperformance. The text content is separated from variousbackground styles, which are normalized for easier reading.
We design a framework to separate text content from noisybackground styles, through an interactive joint training ofan attention-based recognizer and a generative adversarialarchitecture. The shared attention masks from the attention-based recognizer enable character-level adversarial training.Then, the discriminator guides the generator to achievebackground style normalization. In addition, a feedbackmechanism bridges the gap between the discriminatorand recognizer. The discriminator guides the generatoraccording to the confusion of the recognizer. Thus, thegenerator can generate clear character patterns that facilitatereading.In this section, we first introduce the attention decoderin mainstream recognizers. Then, we present a detaileddescription of the interactive joint training scheme.3.1 Attention DecoderTo date the attention decoder (Bahdanau et al., 2015) hasbecome widely used in recent recognizers (Shi et al., 2018;Luo et al., 2019; Li et al., 2019; Yang et al., 2019a).As shown in Figure 5, the decoder sequentially outputspredictions ( y , y ..., y N ) and stops processing when itpredicts an end-of-sequence token “ EOS ” (Sutskever et al.,2014). At time step t , output y t is given by y t = sof tmax ( W out s t + b out ) , (1) eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 5 Fig. 6
Interactive joint training. The recognizer shares attention masks with the discriminator, whereas the discriminator learns from the predictionsof the recognizer and updates the generator using ground truth. The shared attention masks work on feature maps, which we present in the generatedimages for better visualization. where s t is the hidden state at the t -th step. Then, we update s t by s t = GRU ( s t − , ( y t − , g t )) , (2)where g t represents the glimpse vectors g t = n (cid:88) i =1 ( α t,i h i ) , α t ∈ R n , (3)where h i denotes the sequential feature vectors. Vector α t is the vector of attention mask, expressed as follows: α t,i = exp( e t,i ) (cid:80) nj =1 (exp( e t,j )) , (4) e t,i = w T T anh ( W s s t − + W h h i + b ) . (5)Here, W out , b out , w T , W s , W h and b are trainableparameters. Note that y t − is the ( t − -th character inthe ground truth in the training phase, whereas it is thepreviously predicted output in the testing phase. The trainingset is denoted as D = { I i , Y i } , i = 1 ...N . The optimizationis to minimize the negative log-likelihood of the conditionalprobability of D as follows: L reg = − N (cid:88) i =1 | Y i | (cid:88) t =1 log p ( Y i,t | I i ; θ ) , (6)where Y i,t is the ground truth of the t -th character in I i and θ denotes the parameters of the recognizer.3.2 Interactive Joint Training for Separating Text fromBackgroundsAs vanilla discriminator is designed for non-sequentialobject with a global coarse supervision, directly employinga discriminator fails to provide effective guidance for the generator. In contrast to apply a global discriminator, wesupervise the generator in a fine-grained manner, namely,character-level adversarial learning, by taking advantageof the attention mechanism. Training the frameworkat character level also reduces the complexity of thepreparation of target style data. Every target style samplecontaining one character can be easily synthesized online. Sharing of attention masks.
Given an image I as input,the goal of our generator G is to generate a clean image I (cid:48) without a complex background. The discriminator D encodes the image I (cid:48) as E = Encode ( I (cid:48) ) . (7)With similar settings of the backbone in recognizer (e.g.,kernel size, stride size and padding size in the convolutionaland pooling layers), the encoder in the discriminator isdesigned to output an embedding vector E i with the samesize as that of h i in Equation (3), which enables therecognizer to share attention mask α t with the discriminator.After that, character-level features of the generation areextracted by F gen,t = n (cid:88) i =1 ( α t,i E i ) , α t ∈ R n . (8)The extracted character features are used for furtheradversarial training. Unpaired target style samples.
Benefiting from ourcharacter-level adversarial learning, target style samples canbe simply synthesized online. As illustrated in Figures 4 and6, every target style sample contains only a black characteron a white background or a white character on a blackbackground. The characters are randomly chosen. Followingthe previous methods for data synthesis (Jaderberg et al.,2014a; Gupta et al., 2016), we collect fonts to synthesizethe target style samples. The renderer is a simple and https://fonts.google.com Canjie Luo et al. publicly available engine that can efficiently synthesizesamples online. Owing to the diversity of the fonts, the fontsensitivity of the discriminator is thereby decreased, whichenables the discriminator to focus on the background styles.Because there is only one character in a target styleimage, we apply global average-pooling to the embeddingfeatures for every target style sample I t as follows: F tgt = averageP ooling (cid:0) Encode ( I t ) (cid:1) . (9)The features of the t -th character in the generated image F gen,t and the target style sample F tgt are prepared for thefollowing adversarial training. Adversarial training on style.
We use a style classifierin the discriminator to classify the style of characters inthe generated images as fake and the characters in thetarget style samples as real. We use the 0 − min D L s = E I t [ (cid:0) − Style ( F tgt ) (cid:1) ] + E I (cid:48) ,t (cid:2) Style ( F gen,t ) (cid:3) , min G L s = E I (cid:48) ,t (cid:2)(cid:0) − Style ( F gen,t ) (cid:1) (cid:3) , (10)where Style ( · ) denotes the style classifier.The advantages of character-level adversarial trainingare threefold: 1) Because the background is complicated in ascene text image, the background noise varies substantiallyin different character regions. Considering the text stringas a whole and supervising the training in a globalmanner may cause the failure of the generator, as discussedpreviously in Section 1 and Figure 2. Thus, we encouragethe discriminator to inspect the generation in a more fine-grained manner, namely, character-level supervision, whichcontributes to the effective learning. 2) Training at characterlevel brings a benefit for the preparation of target styledata. For the synthesis of a text string, it is necessary toconsider the text shape, the space between neighboringcharacters and the rotation of every character (Jaderberget al., 2014a; Gupta et al., 2016). In contrast, we can simplysynthesize only one character on a clean background forevery target style sample. Therefore, our target style samplescan be simply synthesized online during the training. 3) Thetraining is free of the need for paired data. Because theattention mechanism decomposes a text string into severalcharacters and benefits the further training, only input scenetext images and corresponding text labels are required.Hence, our framework is potentially flexible enough to makefull use of available data to gain robustness. Feedback mechanism.
As our goal is to improverecognition performance, we are not only interested in https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html the styles of the backgrounds, but also the quality of thegenerated content. Therefore, we use a content classifier inthe discriminator to supervise content generation.In contrast to the previous work auxiliary classifierGAN (Odena et al., 2017), which used ground truth tosupervise the content classifier, our content classifier learnsfrom the predictions of the recognizer. This bridges the gapbetween the recognizer and discriminator. The discriminatorthus can guide the generator according to the confusionof the recognizer. After the training with this feedbackmechanism, the generated patterns are more discriminative,which facilitates recognition. The details of the feedbackmechanism are present as follows.The generator G and discriminator D are updated byalternately optimizing min D L c,D = E ( I,P ) , ( I t ,GT ) [ − log Content ( GT | F tgt ) − | P | | P | (cid:88) t =1 log Content ( P t | F gen,t )] , (11) min G L c,G = E I,GT [ − | GT | | GT | (cid:88) t =1 log Content ( GT t | F gen,t )] , (12)where GT denotes the ground truth of the input image I and target style sample I t . In addition, Content ( · ) is thecontent classifier. Note that the discriminator learns from thepredictions P on I of the recognizer, whereas it uses GT of I to update the generator. This is an adversarial processthat is similar to that of GAN training (Goodfellow et al.,2014; Mao et al., 2017; Odena et al., 2017; Zhu et al., 2017).They use different labels for the discriminator and generator,but backpropagate the gradient using the same parametersas those of the discriminator. Alternately optimizing thediscriminator and generator achieves adversarial learning.There are some substitution errors in the predictions P that are different from the GT . Therefore, the second term ofthe right side in Equation (11) can be formulated as contentadversarial training as − | P | | P | (cid:88) t =1 log Content ( P t | F gen,t ) = − | P | [ | P real | (cid:88) i =1 log Content ( P real,i | F gen,i )+ | P fake | (cid:88) j =1 log Content ( P fake,j | F gen,j )] , (13)where P real and P fake present the correct and incorrectpredictions of the recognizer, respectively. Note that P real ∪ P fake = P . eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 7 Since the discriminator with the content classifier learnsfrom the predictions of the recognizer, it guides thegenerator to correct erroneous character patterns in thegenerated images. For instance, similar patterns such as “C”and “G”, or “O” and “Q”, may cause failed prediction of therecognizer. If a “G” is transformed to look more like a “C”and the recognizer predicts it to be a “C”, the discriminatorwill learn that the pattern is a “C” and guide the generator togenerate a clearer “G”. We show more examples and furtherdiscuss this issue in Section 4.
Interactive joint training.
The pseudocode of theinteractive joint training scheme is presented in Algorithm1. During the training of our framework, we found thatthe discriminator often learns faster than the generator. Asimilar problem has also been reported by others (Berthelotet al., 2017; Heusel et al., 2017). The Wasserstein GAN(Arjovsky et al., 2017) uses more update steps for thegenerator than the discriminator. We simply adjust thenumber of steps according to a balance factor β ∈ (0 , .If the discriminator learns faster than the generator, then thevalue of β decreases, potentially resulting in a pause during Algorithm 1:
Interactive joint training Discriminator: D ; Generator: G ; Batch size: B ; Balance factor: β (initialized as 1.0); while not at the end of training do Sample B training images as I , I (cid:48) = G ( I ) ; Randomly synthesize B target style samples as I t ; Obtain the predictions P on I (cid:48) ; Use I (cid:48) and GT to update the recognizer: min L reg , andobtain attention masks for the D ; I chosen = ∅ ; P chosen = ∅ ; GT chosen = ∅ ; for i in B do if length ( P i ) = length ( GT i ) then if edit distance of ( P i , GT i ) ≤ then I chosen ← I chosen ∪ { I (cid:48) i } ; P chosen ← P chosen ∪ { P i } ; GT chosen ← GT chosen ∪ { GT i } ; end end end if I chosen (cid:54) = ∅ then Generate a random number k ∈ [0 , ; if k ≤ β then Use I chosen , P chosen to update the D : max D L s , min D L c,D ; end Use I chosen , GT chosen to update the G : min G L s , min G L c,G ; β ← L s + L c,D L s + L c,G ; end end (a) (b) Fig. 7
Generated images for (a) regular and (b) irregular text. Inputimages are on the left and the corresponding generated images are onthe right. The text content is separated by the generator from the noisybackground styles. In the generated images, the font style tends to bean average style. the update steps for the discriminator. In practice, this trickcontributes to the training stability of the generator.We first sample a set of input samples, and randomlysynthesize unpaired samples of target style. Then, therecognizer makes predictions on the generated imagesand shares its attention masks with the discriminator. Toavoid the effects of incorrect alignment between characterfeatures and labels (Bai et al., 2018), we filter out somepredictions using the metrics of edit distance and stringlength. The corresponding images are also filtered out.Only substitution errors exist in the remaining predictions.Finally, the discriminator and generator are alternatelyoptimized to achieve adversarial learning.After the adversarial training, the generator can separatetext content from complex background styles. The generatedpatterns are clearer and easier to read. As illustrated inFigure 7, the generator works well on both regular textand slanted/curved text. Because the irregular shapes ofthe text introduce more surrounding background noise, therecognition difficulty can be significantly reduced by usingour method.
In this section, we provide the training details andreport the results of extensive experiments on variousbenchmarks, including both regular and irregular textdatasets, demonstrating the effectiveness and generality ofour method.
Canjie Luo et al.
As paired text images in the wild are not availableand there exists great diversity in the number of charactersand image structure between the input images and ourtarget style images, popular GAN metrics such as theinception score (Salimans et al., 2016) and Fr´echet inceptiondistance (Heusel et al., 2017) cannot be directly appliedin our evaluation. Instead, we use the word accuracy ofrecognition, which is a more straightforward metric, and isof interest for our target task, to measure the performanceof all the methods. Recall that our goal here is to improverecognition accuracy.4.1 Datasets
SynthData , which contains 6-million data released byJaderberg et al. (Jaderberg et al., 2014a) and 6-million datareleased by Gupta et al. (Gupta et al., 2016), is a widelyused training dataset. Following the most recent work forfair comparison, we select it as the training dataset. Onlyword-level labels are used, but other extra annotation isunnecessary in our framework. The model is trained usingonly synthetic text images, without any fine-tuning for eachspecific dataset.IIIT5K-Words (Mishra et al., 2012) (
IIIT5K ) contains3,000 cropped word images for testing. Every image hasa 50-word lexicon and a 1,000-word lexicon. The lexiconconsists of the ground truth and some randomly pickedwords.Street View Text (Wang et al., 2011) (
SVT ) wascollected from the Google Street View, and consists of 647word images. Each image is associated with a 50-wordlexicon. Many images are severely corrupted by noise andblur or have very low resolutions.ICDAR 2003 (Lucas et al., 2003) (
IC03 ) contains 251scene images that are labeled with text bounding boxes.For fair comparison, we discarded images that contain non-alphanumeric characters or those have fewer than threecharacters, following Wang et al. (Wang et al., 2011).The filtered dataset contains 867 cropped images. Lexiconscomprise of a 50-word lexicon defined by Wang et al. (Wanget al., 2011) and a “full lexicon”. The latter lexicon combinesall lexicon words.ICDAR 2013 (Karatzas et al., 2013) (
IC13 ) inheritsmost of its samples from IC03. It contains 1,015 croppedtext images. No lexicon is associated with this dataset.SVT-Perspective (Quy Phan et al., 2013) (
SVT-P )contains 645 cropped images for testing. Images wereselected from side-view angle snapshots in Google StreetView. Therefore, most images are perspective distorted.Each image is associated with a 50-word lexicon and a fulllexicon.CUTE80 (Risnumawan et al., 2014) (
CUTE ) contains80 high-resolution images taken of natural scenes. It was specifically collected for evaluating the performance ofcurved text recognition. It contains 288 cropped naturalimages for testing. No lexicon is associated with this dataset.ICDAR 2015 (Karatzas et al., 2015) (
IC15 ) contains2077 images by cropping the words using the ground truthword bounding boxes. Cheng et al. (Cheng et al., 2017)filtered out some extremely distorted images and used asmall evaluation set (referred as
IC15-S ) containing only1811 test images.4.2 Implementation DetailsAs our proposed method is a meta-framework for recentattention-based recognition methods (Shi et al., 2018; Luoet al., 2019; Li et al., 2019; Yang et al., 2019a), recentrecognizers can be readily integrated with our framework.Thus the recognizer implementation follows their specificdesign. Here we present details of the discriminator,generator, and training.
Generator.
The generator is a feature pyramid network(FPN)-like (Lin et al., 2017) architecture that consists ofeight residual units. Each residual unit comprises a × convolution followed by two × convolutions. Featuremaps are downsampled by × stride convolutions in thefirst three residual units. The numbers of output channelsof the first four residual units are 64, 128, 256, and 256,respectively. The last four units are symmetrical with thefirst four, but we upsample the feature map by simpleresizing. We apply element-wise addition to the output ofthe third and fifth units. At the top of the generator, thereare two convolution layers that have 16 filters and one filter,respectively. Discriminator.
The encoder in the discriminator consistof seven convolutional layers that have 16, 64, 128, 128, 192and 256 filters. Their kernel sizes are all × , except forthe size of the last one, which is × . The first, second,fourth and sixth convolutional layers are each followed byan average-pooling layer. Using settings similar to those ofthe backbone in the recognizer (e.g., kernel size, stride sizeand padding size in the convolutional and pooling layers),the output size of the encoder can be controlled to meet therequirements of the attention mask sharing of the recognizer.Both the style and content classifiers in the discriminator areone-layer fully connected networks. Training.
We use Adam (Kingma et al., 2015) tooptimize the GAN. The learning rate is set to 0.002. Itis decreased by a factor of 0.1 at epochs 2 and 4. In theinteractive joint training, we utilize the attention mechanismin the recognizer. Therefore, an optimized attention decoderis necessary to enable the interaction. To accelerate thetraining process, we pre-trained the recognizer for threeepochs. eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 9
Table 1
Word accuracy on the testing datasets using different inputs. The recognizer is trained on the source and generated images, respectively.Input Image Regular Text Irregular TextIIIT5K SVT IC03 IC13 SVT-P CUTE IC15Source I I (cid:48) Implementation.
We implement our method using thePyTorch framework (Paszke et al., 2017). The target stylesamples are resized to × . Input images are resized to × for the generator and × for the recognizer.The outputs of the generator are also resized to × . When the batch size is set to 64, the training speedis approximately 1.7 iterations/sec. Our method takes anaverage of 1.1 ms to generate an image using an NVIDIAGTX-1080Ti GPU.4.3 Ablation Study Experiment setup.
To investigate the effectiveness ofseparating text content from noisy background styles, weconduct an ablation analysis by using a simple recognizer.The backbone of the recognizer is a 45-layer residualnetwork (He et al., 2016a), which is a popular architecture(Shi et al., 2018). On the top of the backbone, there isan attention-based decoder. In the decoder, the number ofGRU hidden units is . The decoder outputs 37 classes,including 26 letters, 10 digits, and a symbol that represented “EoS” . The training data is SynthData. We evaluate therecognizer on seven benchmarks, including regular andirregular text.
Input of the recognizer.
We study the contributionof our method by replacing the generated image with thecorresponding input image. The results are listed in Table 1.The recognizer trained using SynthData serves as a baseline.Compared to the baseline, the clean images generated byour method boost recognition performance. We observe thatthe improvement is more substantial on irregular text. Onenotable improvement is an accuracy increase of 6.6% onCUTE. One possible reason for this is that the irregular textshapes introduce more background noise than the regularones. Because our method removes the surrounding noiseand extracts the text content for recognition, the recognizercan thus focus on characters and avoid noisy interference.With respect to regular text, the baseline is much higherand there is less room for improvement, but our method alsoshows advantages in recognition performance. The gain ofperformance on several kinds of scene text, including lowquality images in SVT and real scene images in IC03/IC13,suggests the generality our method. To summarize, the
Fig. 8
Visualization of background normalization weakly supervisedby content label. generated clean images of our proposed method greatlydecrease recognition difficulty.
Style supervision.
We study the necessity of style super-vision by disabling the style classifier in the discriminator.Without style adversarial training, the background stylenormalization is only weakly supervised by the contentlabel. As shown in the Figure 8, the generated imagessuffer from severe image degradation, which leads to poorrobustness of the recognizer. The quantitative recognitionresults of not using/using the style supervision are presentedin the second and third row of Table 2. The significant gapsindicate that without the style supervision, the quality ofthe generated images is insufficient for recognition training.Thus, the style adversarial training is necessary and is usedin the basic design of our method.
Feedback mechanism.
We also study the effectivenessof the content classifier in the discriminator and theproposed feedback mechanism. In this experiment, we firstdisable the content classifier. Therefore, there is no contentsupervision. Only a style adversarial loss supervises thegenerator. The result is shown in the first row in Table 2. Theaccuracy on the generated images decreases to nearly zero.We observe that the generator fails to retain the characterpatterns for recognition. As the content classifier is designedfor assessing the discriminability and diversity of samples(Odena et al., 2017), it is important to guide the generatorso that it can determine informative character patterns andretain them for recognition. When the content supervisionis not available, the generator is easily trapped into failuremodes, namely mode collapse (Salimans et al., 2016).Therefore, the content supervision in the discriminator isnecessary.Then we enable the content classifier and replace thesupervision in L c with the ground truth. This setting issimilar to that of the auxiliary classifier GANs (Odena et al.,2017), which use content supervision for discriminabilityand diversity in the style adversarial training. After thisprocess, the generated text images contain text content forrecognition. Table 2
Word accuracy on generated images using variants of content supervision for the discriminator. Losses L s and L c denote style loss andcontent loss, respectively.Content Supervision Adversarial Loss Feedback Mechanism Testing Set L s L c SVT-P CUTE IC15None (cid:88) × ×
Failed Failed FailedGround Truth × (cid:88) × (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) Predictions of challenging samples in the SVT-P testing set.Recognition errors are marked as red characters. Confusing and distinctpatterns are marked by red and green bounding boxes, respectively.
Finally, we replace the content supervision with thepredictions of the recognizer. The discriminator thus learnsfrom the confusion of the recognizer, and guides thegenerator so that it can refine the character patterns to beeasier to read. Therefore, the adversarial training is morerelevant to the recognition performance. As shown in Table2, the feedback mechanism further improves the robustnessof the generator and benefits the recognition performance.One interesting observation is that on the SVT-P testingset, the accuracy on the source image (75.7% in Table 1)is higher than that on the generated image with contentsupervision of the ground truth (75.0% in Table 2). Weobserve the source samples and find that most images areseverely corrupted by noise and blur. Some of them havelow resolutions. The characters in the generated samplesare also difficult to distinguish. After training with thefeedback mechanism, the generator is able to generate clearpatterns that facilitate reading, which boosts the recognitionaccuracy from 75.0% to 79.2%. As illustrated in Figure 9,the predictions of “C” and “N” are corrected to “G” and“M”, respectively. The clear characters in the generatedimages are easier to read.4.4 Comparisons with Generation MethodsRecently, a large body of literature (Shi et al., 2018; Luoet al., 2019; Li et al., 2019; Yang et al., 2019a) has explored
Fig. 10
Comparison between the OTSU (Otsu, 1979) and our method. the use of stronger recognizers to tackle the complications inscene text recognition. However, there is little considerationof the quality of the source images. The background noise inthe source image has not been addressed intensively before.To the best of our knowledge, our method may be the firstimage generation network that removes background noiseand retains text content to benefit recognition performance.Although few literature proposed to address this issue statedabove, we find several popular generation methods andperform comparisons under fair experimental conditions. Apre-trained recognizer used in the ablation study is adoptedin the comparisons. The recognizer is then be fine-tuned ondifferent kinds of generations.First we use a popular binarization method, namelyOTSU method (Otsu, 1979), to separate the text contentfrom the background noise by binarizing the source images.As shown in Figure 10, we visualize the binarized imagesand find that single threshold value is not sufficientlyrobust to separate the foreground and background inscene text images, because the background noise usuallyfollows multimodal distribution. Therefore, the recognitionaccuracy on the generation of OTSU method falls behindours in Table 3.Then, we compare our method with generation methods.Considering the high demand for data (pixel-level pairedsamples) of pixel-to-pixel GANs (Isola et al., 2017), we treatthis kind of method as a potential solution when there is eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 11
Table 3
Word accuracy on testing datasets using different transformation methods.Transformation Method Regular Text Irregular TextIIIT5K SVT IC03 IC13 SVT-P CUTE IC15Style Normalization OTSU (Otsu, 1979) 70.3 65.4 76.0 76.0 46.5 50.3 49.3CycleGAN (Zhu et al., 2017) 43.6 21.3 37.0 35.9 14.6 18.8 18.0Ours 92.5 86.6 95.0 91.4 79.2 80.9 73.0+ Rectification Ours + ASTER (Shi et al., 2018) 94.0 90.0 95.6 93.3 81.6 85.1 78.1Ours + ESIR (Zhan and Lu, 2019)
Comparison between CycleGAN (Zhu et al., 2017) and ourmethod. no restriction of data. Here, we study the CycleGAN (Zhuet al., 2017). Before the training, we synthesize word-levelclean images as target style samples. The results shown inTable 3 and Figure 11 suggest that modeling a text stringwith multiple characters as a whole leads to poor retentionof character details. The last two rows in Figure 11 arefailed generations, which indicate that the generator failsto model the relationships of the characters. In Table 3,the recognition accuracy on this kind of generation dropssubstantially.Compared with previous methods, our method not onlynormalizes noisy backgrounds to a clean style, but alsogenerates clear character patterns that tend to be an averagestyle. The end-to-end training with the feedback mechanismbenefits the recognition performance. We also show theeffectiveness of the image rectification by integrating ourmethod with advanced rectification modules (Shi et al., The official implementation is available on https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
Fig. 12
Predictions corrected by our method.2 Canjie Luo et al.
Table 4
Word accuracy on regular benchmarks. “50”, “1k” and “0” are lexicon sizes. “Full” indicates the combined lexicon of all images in thebenchmarks. “Add.” means the method uses extra annotations, such as character-level bounding boxes and pixel-level annotations. “Com.” is theproposed ensemble method that outputs prediction of either source or generated image with higher confidence score.Method Add. IIIT5K SVT IC03 IC1350 1k 0 50 0 50 Full 0 0Yao et al. (2014) 80.2 69.3 - 75.9 - 88.5 80.3 - -Jaderberg et al. (2014b) - - - 86.1 - 96.2 91.5 - -Su and Lu (2014) - - - 83.0 - 92.0 82.0 - -Rodriguez-Serrano et al. (2015) 76.1 57.4 - 70.0 - - - - -Gordo (2015) 93.3 86.6 - 91.8 - - - - -Jaderberg et al. (2015) 95.5 89.6 - 93.2 71.7 97.8 97.0 89.6 81.8Jaderberg et al. (2016) 97.1 92.7 - 95.4 80.7 98.7 98.6 93.1 90.8Shi et al. (2016) 96.2 93.8 81.9 95.5 81.9 98.3 96.2 90.1 88.6Lee and Osindero (2016) 96.8 94.4 78.4 96.3 80.7 97.9 97.0 88.7 90.0Liu et al. (2016) 97.7 94.5 83.3 95.5 83.6 96.9 95.3 89.9 89.1Shi et al. (2017) 97.8 95.0 81.2 97.5 82.7 98.7 98.0 91.9 89.6Yang et al. (2017) (cid:88) (cid:88) - - 83.6 - 84.4 - - 91.5 90.8Liu et al. (2018b) 97.3 96.1 89.4 96.8 87.1 98.1 97.5 94.7 94.0Liu et al. (2018c) (cid:88) (cid:88) (cid:88) The result was corrected by the authors on https://github.com/bgshih/aster . in Table 4. Although the baseline accuracy on thesebenchmarks is high, thus no much room for improvement,our method still achieves a notable improvement in lexicon-free prediction. For instance, it leads to accuracy increasesof 1.4% on SVT and 1.3% on IC13. Some predictionscorrected using our generations are shown in Figure 12.Then, we reveal the superiority of our method by applyingit to irregular text recognition. As shown in Table 5, ourmethod significantly boosts the performance of ASTERby generating clean images. The ASTER integrated withour approach outperforms the baseline by a wide marginon SVT (3.9%), CUTE (5.2%) and IC15 (4.3%). Thissuggests that our generator removes the background noiseintroduced by irregular shapes and further reduces difficultyof rectification and recognition. It is noteworthy that theASTER with our method outperforms ESIR (Zhan and Lu,2019) that uses more rectification iterations (ASTER only rectifies the image once), which demonstrates the significantcontribution of our method. The performance is evencomparable with the state-of-the-art method (Yang et al.,2019a), which uses character-level geometric descriptors forsupervision. Our method achieves a better trade-off betweenrecognition performance and data requirement.After that, our method is integrated with a differentmethod ESIR to show the generalization. Based on the moreadvanced recognizer, our method can achieve further gains.For instance, the improvement is still notable on CUTE(4.1%). As a result, the performance of the ESIR is alsosignificantly boost by our method. Upper bound of GAN.
We are further interested in theupper bound of our method. As our method is designedbased on adversarial training, the limitations of the GANcause some failure cases. As illustrated in Figure 13, thewell-trained generator fails to generate character patterns on eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 13
Table 5
Word accuracy on irregular benchmarks. “50” and “0” are lexicon sizes. “Full” indicates the combined lexicon of all images in thebenchmarks. “Add.” means the method uses extra annotations, such as character-level bounding boxes and pixel-level annotations. “Com.” is theproposed ensemble method that outputs prediction of either source or generated image with higher confidence score.Method Add. SVT-P CUTE IC15-S IC1550 Full 0 0 0 0Shi et al. (2016) 91.2 77.4 71.8 59.2 - -Liu et al. (2016) 94.3 83.6 73.5 - - -Shi et al. (2017) 92.6 72.6 66.8 54.9 - -Yang et al. (2017) (cid:88) (cid:88) - - - - - 60.0Liu et al. (2018b) - - 73.9 62.5 - -Cheng et al. (2018) 94.0 83.7 73.0 76.8 - 68.2Bai et al. (2018) - - - - 73.9 -Shi et al. (2018) - - 78.5 79.5 76.1 -Luo et al. (2019) 94.3 86.7 76.1 77.4 - 68.8Liao et al. (2019b) (cid:88) - - - 79.9 - -Li et al. (2019) - - 76.4 83.3 - 69.2Zhan and Lu (2019) - - 79.6 83.3 - 76.9Yang et al. (2019a) (cid:88) - - 80.8 87.5 - 78.7ASTER 94.3 87.3 77.7 79.9 75.8 74.0+ Ours 95.0 ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . + Com. 95.5 ↑ . ↑ . ↑ . ↑ . ↑ . ↑ . + Com. Word accuracy on testing datasets when we use a little more real training data.Method Training data Regular Text Irregular TextIIIT5K SVT IC03 IC13 SVT-P CUTE IC15-S IC15ASTER Millions of Synthetic Data 93.5 88.6 94.7 92.0 77.7 79.9 75.8 74.0+ 50k Real Data 94.5 90.4 95.0 92.9 79.4 88.9 84.1 81.3+ Ours Millions of Synthetic Data 95.4 92.7
Failure cases. Top: source images. Bottom: generated images. difficult samples, particularly when the source image is oflow quality and the curvature of the text shape is too high.One possible reason is the mode-dropping phenomenonstudied by Bau et al. (Bau et al., 2019). Another reasonis the lingering gap observed by (Zhu et al., 2017)between the training supervision of paired and unpairedsamples. To break this ceiling, one possible solution is to improve the synthesis engine and integrate various pairedlifelike samples for training. This may lead to substantiallymore powerful generators, but heavily dependent on thedevelopment of synthesis engines.Inspired by recent work (Shi et al., 2018; Liao et al.,2019a), it is possible to integrate several outputs of thesystem and choose the most possible one to achieveperformance gain. Therefore, we proposed a simple yeteffective method to address the issue stated above. Thesource image and the corresponding generated image areconcatenated as a batch for network inference. Then,we choose the prediction with the higher confidence.As shown in the last row in Tables 4 and 5 (noted as“+Com.”), this ensemble mechanism greatly boosts the
Table 7
Comparisons of generation in RGB space and in gray.Space Regular Text Irregular TextIIIT5K SVT IC03 IC13 SVT-P CUTE IC15-S IC15Gray
RGB 95.2 92.3 system performance, which indicates that the source andgenerated images are complementary to each other.4.6 More Accessible DataIn the experiments of comparing the proposed methodwith previous recognition methods, we have used onlysynthetic data for fair comparison. Here, we use the ASTER(Shi et al., 2018) to explore whether there is room forimprovement in synthesis engines.Following Li et al. (Li et al., 2019), we collect publiclyavailable real data for training. In contrast to syntheticdata, real data is more costly to collect and annotate. Thus,there are only approximately 50k public real samples fortraining, whereas there are millions of synthetic data. Asshown in Table 6, after we add the small real training setto the large synthetic one, the generality of both the baselineASTER and our method is further boosted. This suggeststhat synthetic data is not sufficiently real and the model isstill data-hungry.In summary, our approach is able to make full use ofreal samples in the wild to further gain robustness, becauseof the training of our method requires only input images andthe corresponding text labels. Note that our method trainedusing only synthetic data even outperforms the baselinetrained using real data on most benchmarks, particularlyon SVT-P ( ↑ Generation in RGB space or in gray.
The backgroundnoise and text content may be relative easier to be separatedin RGB colorful images. To this end, we conduct anexperiment to evaluate the influence of RGB color space.The target style samples are synthesized in random colorto guide the generation in RGB space. As shown inTable 7, we find that the generation in RGB space cannotoutperform the generation in gray. Therefore, the key issueof background normalization is not the color space, butthe lack of pixel-level supervision. Without fine-grainedguidance at pixel level, the generation is only guided by theattention mechanism of the recognizer to focus informative regions. Other noisy regions on the generated image areunreasonably neglected.
Alignment issue on long text.
To tackle the lack ofpaired training samples, we exploit the attention mechanismto extract every character for adversarial training. However,there exists misalignment problems of the attention mech-anism (Cheng et al., 2017; Bai et al., 2018), especially onlong text. (Cong et al., 2019) conducted a comprehensivestudy on the attention mechanism and found that theattention-based recognizers have poor performance on textsentence recognition. Thus, our method still have scope forperformance gains on text sentence recognition. This is acommon issue of most attention mechanisms, which meritsfurther study.
We have presented a novel framework for scene textrecognition from a brand new perspective of separatingtext content from noisy background styles. The proposedmethod can greatly reduce recognition difficulty and thusboost the performance dramatically. Benefiting from theinteractive joint training of an attention-based recognizerand a generative adversarial architecture, we extractcharacter-level features for further adversarial training.Thus the discriminator focuses on informative regions andprovides effective guidance for the generator. Moreover, thediscriminator learns from the confusion of the recognizerand further effectively guides the generator. Thus, thegenerated patterns are clearer and easier to read. Thisfeedback mechanism contributes to the generality of thegenerator. Our framework is end-to-end trainable, requiringonly the text images and corresponding labels. Because ofthe elegant design, our method can be flexibly integratedwith recent mainstream recognizers to achieve new state-of-the-art performance.The proposed method is a successful attempt to solvethe scene text recognition problem from the brand newperspective of image generation and style normalization,which has not been addressed intensively before. In thefuture, we plan to extend the proposed method to dealwith end-to-end scene text recognition. How to extend ourmethod to multiple general object recognition is also a topicof interest. eparating Content from Style Using Adversarial Learning for Recognizing Text in the Wild 15
Acknowledgements
This research was in part supported in part byNSFC (Grant No. 61771199, 61936003), GD-NSF (No. 2017A0-30312006), the National Key Research and Development Programof China (No. 2016YFB1001405), Guangdong Intellectual PropertyOffice Project (2018-10-1), and Guangzhou Science, Technology andInnovation Project (201704020134).