STEFANN: Scene Text Editor using Font Adaptive Neural Network
Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal
SSTEFANN: Scene Text Editor using Font Adaptive Neural Network
Prasun Roy ∗ , Saumik Bhattacharya ∗ , Subhankar Ghosh ∗ , and Umapada Pal Indian Statistical Institute, Kolkata, India Indian Institute of Technology, Kharagpur, India https://prasunroy.github.io/stefann
Abstract
Textual information in a captured scene plays an im-portant role in scene interpretation and decision making.Though there exist methods that can successfully detect andinterpret complex text regions present in a scene, to the bestof our knowledge, there is no significant prior work thataims to modify the textual information in an image. Theability to edit text directly on images has several advan-tages including error correction, text restoration and imagereusability. In this paper, we propose a method to modifytext in an image at character-level. We approach the prob-lem in two stages. At first, the unobserved character (tar-get) is generated from an observed character (source) be-ing modified. We propose two different neural network ar-chitectures – (a)
FANnet to achieve structural consistencywith source font and (b)
Colornet to preserve source color.Next, we replace the source character with the generatedcharacter maintaining both geometric and visual consis-tency with neighboring characters. Our method works asa unified platform for modifying text in images. We presentthe effectiveness of our method on COCO-Text and ICDARdatasets both qualitatively and quantitatively.
1. Introduction
Text is widely present in different design and scene im-ages. It contains important contextual information for thereaders. However, if any alteration is required in the textpresent in an image, it becomes extremely difficult for sev-eral reasons. For instance, a limited number of observedcharacters makes it difficult to generate unobserved charac-ters with sufficient visual consistency. Also, different natu-ral conditions, like brightness, contrast, shadow, perspectivedistortion, complex background, etc., make it harder to re-place a character directly in an image. The main motivationof this work is to design an algorithm for editing textual in-formation present in images in a convenient way similar to ∗ These authors contributed equally to this work. (a) (b)
Figure 1. Examples of text editing using STEFANN: (a)
Originalimages from ICDAR dataset; (b)
Edited images. It can be ob-served that STEFANN can edit multiple characters in a word (toprow) as well as an entire word (bottom row) in a text region. the conventional text editors.Earlier, researchers have proposed font synthesis al-gorithms based on different geometrical features of fonts[6, 24, 27]. These geometrical models neither generalize thewide variety of available fonts nor can be applied directlyto an image for character synthesis. Later, researchers haveaddressed the problem of generating unobserved charactersof a particular font from some defined or random set of ob-servations using deep learning algorithms [4, 7, 31]. Withthe emergence of Generative Adversarial Network (GAN)models, the problem of character synthesis has also beenaddressed using GAN-based algorithms [2, 19]. ThoughGAN-based font synthesis could be used to estimate the tar-get character, several challenges make the direct implemen-tation of font synthesis for scene images difficult. Firstly,most of the GAN-based font synthesis models require anexplicit recognition of the source character. As recognitionof text in scene images is itself a challenging problem, it ispreferable if the target characters can be generated without a r X i v : . [ c s . C V ] A p r recognition step. Otherwise, any error in the recognitionprocess would accumulate, and make the entire text editingprocess unstable. Secondly, it is often observed that a par-ticular word in an image may have a mixture of differentfont types, sizes, colors, etc. Even depending on the rela-tive location of the camera and the texts in the scene, eachcharacter may experience a different amount of perspec-tive distortion. Some GAN-based models [2, 19] requiremultiple observations of a font-type to faithfully generateunobserved characters. A multiple observation-based gen-eration strategy requires a rigorous distortion removal stepbefore applying generative algorithms. Thus, rather than aword-level generation, we follow a character-level genera-tive model to accommodate maximum flexibility. Contributions : To the best of our knowledge, this is thefirst work that attempts to modify texts in scene images. Forthis purpose, we design a generative network that adapts tothe font features of a single character and generates othernecessary characters. We also propose a model to transferthe color of the source character to the target character. Theentire process works without any explicit character recogni-tion.To restrict the complexity of our problem, we limitour discussion to the scene texts with upper-case non-overlapping characters. However, we demonstrate in Figs.5 and 13 that the proposed method can also be applied forlower-case characters and numerals.
2. Related Works
Because of its large potential, character synthesis froma few examples is a well-known problem. Previously, sev-eral pieces of work tried to address the problem using geo-metrical modeling of fonts [6, 24, 27]. Different synthesismodels are also proposed by researchers explicitly for Chi-nese font generation [19, 37]. Along with statistical mod-els [24] and bilinear factorization [30], machine learningalgorithms are used to transfer font features. Recently, deeplearning techniques also become popular in the font syn-thesis problem. Supervised [31] and definite samples [4]of observations are used to generate the unknown samplesusing deep neural architecture. Recently, Generative Ad-versarial Network (GAN) models are found to be effectivein different image synthesis problems. GANs can be usedin image style transfer [10], structure generation [13] or inboth [2]. Some of these algorithms achieved promising re-sults in generating font structures [7, 19], whereas some ex-hibits the potential to generate complex fonts with color [2].To the best of our knowledge, these generative algorithmswork with text images that are produced using design soft-ware, and their applicability to edit real scene images areunknown. Moreover, most of the algorithms [2, 4] requireexplicit recognition of the source characters to generate theunseen character set. This may create difficulty in our prob- lem as text recognition in scene images is itself a challeng-ing problem [3, 11, 21] and any error in the recognition stepmay affect the entire generative process. Character gen-eration from multiple observations is also challenging forscene images as the observed characters may have distinc-tively different characteristics like font types, sizes, colors,perspective distortions, etc.Convolutional Neural Network (CNN) is proved to be ef-fective in style transfer with generative models [10, 17, 18].Recently, CNN models are used to generate style and struc-ture with different visual features [9]. We propose a CNN-based character generation network that works without anyexplicit recognition of the source characters. For a natural-looking generation, it is also important to transfer the colorand texture of the source character to the generated charac-ter. Color transfer is a widely explored topic in image pro-cessing [25, 28, 35]. Though these traditional approachesare good for transferring global colors in images, most ofthem are inappropriate for transferring colors in more local-ized character regions. Recently, GANs are also employedin color transfer problem [2, 16]. In this work, we introducea CNN-based color transfer model that takes the color in-formation present in the source character and transfer it tothe generated target character. The proposed color transfermodel not only transfers solid colors from source to targetcharacter, it can also transfer gradient colors keeping subtlevisual consistency.
3. Methodology
The proposed method is composed of the followingsteps: (1) Selection of the source character to be replaced,(2) Generation of the binary target character, (3) Colortransfer and (4) Character placement. In the first step, wemanually select the text area that requires to be modified.Then, the algorithm detects the bounding boxes of eachcharacter in the selected text region. Next, we manuallyselect the bounding box around the character to be modifiedand also specify the target character. Based on these user in-puts, the target character is generated, colorized and placedin the inpainted region of the source character.
Let us assume that I is an image that has multiple textregions, and Ω is the domain of a text region that requiresmodification. The region Ω can be selected using any textdetection algorithm [5, 20, 36]. Alternatively, a user can se-lect the corner points of a polygon that bounds a word to de-fine Ω . In this work, we use EAST [38] to tentatively markthe text regions, followed by a manual quadrilateral cornerselection to define Ω . After selecting the text region, weapply the MSER algorithm [8] to detect the binary masksof individual characters present in the region Ω . However,MSER alone cannot generate a sharp mask for most of the igure 2. Architecture of FANnet and Colornet. At first, the target character (‘N’) is generated from the source character (‘H’) by FANnetkeeping structural consistency. Then, the source color is transferred to the target by Colornet preserving visual consistency. Layer namesin the figure are: conv = 2D convolution, FC = fully-connected, up-conv = upsampling + convolution.Figure 3. Generation of target characters using FANnet. In eachimage block, the upper row shows the ground truth and the bot-tom row shows the generated characters when the network has ob-served one particular source character (‘A’) in each case. characters. Thus, we calculate the final binarized image I c defined as I c ( p ) = (cid:40) I M ( p ) (cid:74) I B ( p ) if p ∈ Ω0 otherwisewhere I M is the binarized output of the MSER algorithm [8]when applied on I , I B is the binarized image of I and (cid:74) denotes the element-wise product of matrices. The image I c contains the binarized characters in the selected region Ω . If the color of the source character is darker than itsbackground, we apply inverse binarization on I to get I B .Assuming the characters are non-overlapping, we applya connected component analysis and compute the minimumbounding rectangles of each connected component. If thereare N number of connected components present in a scene, C n ⊆ Ω denotes the n th connected area where < n ≤ N .The bounding boxes B n contain the same indices as the (a)(b)(c)(d) Figure 4. Color transfer using Colornet: (a)
Binary target charac-ter; (b)
Color source character; (c)
Ground truth; (d)
Color trans-ferred image. It can be observed that Colornet can successfullytransfer solid color as well as gradient color. connected areas that they are bounding. The user specifiesthe indices that they wish to edit. We define Θ as the set ofindices that require modification, such that | Θ | ≤ N , where | . | denotes the cardinality of a set. The binarized images I C θ associated with components C θ , θ ∈ Θ are the source char-acters, and with proper padding followed by scaling (dis-cussed in Sec. 3.2), they individually act as the input of thefont generation network. Each I C θ has the same dimensionwith the bounding box B θ . Conventionally, most of the neural networks take squareimages as input. However, as I C θ may have different aspectratios depending on the source character, font type, font sizeetc., a direct resizing of I C θ would distort the actual fontfeatures of the character. Rather, we pad I C θ maintainingits aspect ratio to generate a square binary image I θ of size m θ × m θ such that, m θ = max ( h θ , w θ ) , where h θ and w θ are the height and width of bounding box B θ respectively,and max ( . ) is a mathematical operation that finds the max-imum value. We pad both sides of I C θ along x and y axeswith p x and p y respectively to generate I θ such that p x = (cid:20) m θ − w θ (cid:21) , p y = (cid:20) m θ − h θ (cid:21) followed by reshaping I θ to a square dimension of × . .2.1 Font Adaptive Neural Network (FANnet) Our generative font adaptive neural network (FANnet) takestwo different inputs – an image of the source character ofsize × and a one-hot encoding v of length 26 of thetarget character. For example, if our target character is ‘H’,then v has the value 1 at index 7 and 0 in every other lo-cation. The input image passes through three convolutionlayers having 16, 16 and 1 filters respectively, followed byflattening and a fully-connected (FC) layer FC1. The en-coded vector v also passes through an FC layer FC2. Theoutputs of FC1 and FC2 give 512 dimensional latent repre-sentations of respective inputs. Outputs of FC1 and FC2 areconcatenated and followed by two more FC layers, FC3 andFC4 having 1024 neurons each. The expanding part of thenetwork contains reshaping to a dimension × × fol-lowed by three ‘up-conv’ layers having 16, 16 and 1 filtersrespectively. Each ‘up-conv’ layer contains an upsamplingfollowed by a 2D convolution. All the convolution layershave kernel size × and ReLU activation. The architec-ture of FANnet is shown in Fig. 2. The network minimizesthe mean absolute error (MAE) while training with Adamoptimizer [14] with learning rate lr = 10 − , momentumparameters β = 0 . , β = 0 . and regularization pa-rameter (cid:15) = 10 − .We train FANnet with 1000 fonts with all 26 upper-casecharacter images as inputs and 26 different one-hot encodedvectors for each input. It implies that for 1000 fonts, wetrain the model to generate any of the 26 upper-case targetcharacter images from any of the 26 upper-case source char-acter images. Thus, our training dataset has a total of 0.6760million input pairs. The validation set contains 0.2028 mil-lion input pairs generated from another 300 fonts. We selectall the fonts from Google Fonts database [12]. We apply theOtsu thresholding technique [22] on the grayscale outputimage of FANnet to get a binary target image. It is important to have a faithful transfer of color fromthe source character for a visually consistent generationof the target character. We propose a CNN based archi-tecture, named Colornet, that takes two images as input –colored source character image and binary target characterimage. It generates the target character image with trans-ferred color from the source character image. Each inputimage goes through a 2D convolution layer, Conv1 col andConv2 col for input1 and input2 respectively. The outputsof Conv1 col and Conv2 col are batch-normalized and con-catenated, which is followed by three blocks of convolutionlayers with two max-pooling layers in between. The ex-panding part of Colornet contains two ‘up-conv’ layers fol-lowed by a 2D convolution. All the convolution layers havekernel size × and Leaky-ReLU activation with α = 0 . .The architecture of Colornet is shown in Fig. 2. The net- work minimizes the mean absolute error (MAE) while train-ing with Adam optimizer that has the same parameter set-tings as mentioned in Sec. 3.2.1.We train Colornet with synthetically generated imagepairs. For each image pair, the color source image and thebinary target image both are generated using the same fonttype randomly selected from 1300 fonts. The source colorimages contain both solid and gradient colors so that thenetwork can learn to transfer a wide range of color varia-tions. We perform a bitwise-AND between the output ofColornet and the binary target image to get the final col-orized target character image. Even after the generation of target characters, the place-ment requires several careful operations. First, we need toremove the source character from I so that the generated tar-get character can be placed. We use image inpainting [29]using W ( I C θ , ψ ) as a mask to remove the source character,where W ( I b , ψ ) is the dilation operation on any binary im-age I b using the structural element ψ . In our experiments,we consider ψ = 3 × . To begin the target character place-ment, first the output of Colornet is resized to the dimensionof I θ . We consider that the resized color target character is R θ with minimum rectangular bounding box B Rθ . If B Rθ issmaller or larger than B θ , then we need to remove or addthe region B θ \ B Rθ accordingly so that we have the space toposition R θ with proper inter-character spacing. We applythe content-aware seam carving technique [1] to manipulatethe non-overlapping region. It is important to mention thatif B Rθ is smaller than B θ then after seam carving, the entiretext region Ω will shrink to a region Ω s , and we also needto inpaint the region Ω \ Ω s for consistency. However, boththe regions B θ \ B Rθ and Ω \ Ω s are considerably small andare easy to inpaint for upper-case characters. Finally, weplace the generated target character on the seam carved im-age such that the centroid of B Rθ overlaps with the centroidof B θ .
4. Results
We tested our algorithm on COCO-Text and ICDARdatasets. The images in the datasets are scene images withtexts written with different unknown fonts. In Fig. 5, weshow some of the images that are edited using STEFANN.In each image pair, the left image is the original image andthe right image is the edited image. In some of the im-ages, several characters are edited in a particular text region,whereas in some images, several text regions are edited ina single image. It can be observed that not only the fontfeatures and colors are transferred successfully to the tar-get characters, but also the inter-character spacing is main- Code: https://github.com/prasunroy/stefann igure 5. Images edited using STEFANN. In each image pair, the left image is the original image and the right image is the edited image.It can be observed that STEFANN can faithfully edit texts even in the presence of specular reflection, shadow, perspective distortion, etc.It is also possible to edit lower-case characters and numerals in a scene image. STEFANN can easily edit multiple characters and multipletext regions in an image.
More results are included in the supplementary materials. tained in most of the cases. Though all the images are nat-ural scene images and contain different lighting conditions,fonts, perspective distortions, backgrounds, etc., in all thecases STEFANN is able to edit the images without any sig-nificant visual inconsistency.
Evaluation and ablation study of FANnet:
To evaluatethe performance of the proposed FANnet model, we takeone particular source character and generate all possible tar-get characters. We repeat this process for every font in thetest set. The outputs for some randomly selected fonts areshown in Fig. 3. Here, we only provide an image of charac-ter ‘A’ as the source in each case and generate all 26 charac-ters. To quantify the generation quality of FANnet, we se-lect one source character at a time as input and measure theaverage structural similarity index (ASSIM) [34] of all 26generated target characters against respective ground truthimages over a set of 300 test fonts. In Fig. 6, we show theaverage SSIM of the generated characters for each different
Figure 6. Average SSIM of the generated characters for each dif-ferent source character. source character. It can be seen from the ASSIM scores thatsome characters, like ‘I’ and ‘L’, are less informative as they able 1. Ablation study of FANnet architecture. Layer names aresimilar to Fig. 2.
Evaluation and ablation study of Colornet:
The per-formance of the proposed Colornet is shown in Fig. 4 forboth solid and gradient colors. It can be observed that inboth cases, Colornet can faithfully transfer the font color ofsource characters to target characters. As shown in Fig. 4,the model works equally well for all alphanumeric charac-ters including lower-case and upper-case letters. To under-stand the functionality of the layers in Colornet, we performan ablation study and select the best model architecture thatfaithfully transfers the color. We compare the proposed Col-ornet architecture with two other variants – Colornet-L andColornet-F. In Colornet-L, we remove ‘Block conv3’ and‘up-conv1’ layers to perform layer ablation. In Colornet-F, (a)(b)(c)
Figure 7. Additional generation results of target characters usingFANnet: In each image block, the upper row shows the groundtruth, and the bottom row shows the generated characters whenthe network observes only one particular character: (a) lower-case target characters generated from a lower-case source charac-ter (‘a’); (b) upper-case target characters generated from a lower-case source character (‘a’); (c) lower-case target characters gener-ated from an upper-case source character (‘A’). (a)(b)(c)(d)(e)(f)
Figure 8. Color transfer results for different models: (a)
Binarytarget character; (b)
Color source character; (c)
Ground truth; (d)
Output of the proposed Colornet model; (e)
Output of Colornet-L; (f)
Output of Colornet-F. It can be observed that the Colornetarchitecture discussed in Sec. 3.3 transfers the color from sourceto target without any significant visual distortion. we reduce the number of convolution filters to 16 in both‘Conv1 col’ and ‘Conv2 col’ layers to perform filter abla-tion. The results of color transfer for all three Colornet vari-ants are shown in Fig. 8. It can be observed that Colornet-Lproduces visible color distortion in the generated images,whereas some color information are not present in the im-ages generated by Colornet-F.
Comparison with other methods:
To the best of ourknowledge, there is no significant prior work that aims to igure 9. Comparison between MC-GAN and the proposed FANnet architecture. The green color indicates input to both the models whenonly one observation is available. Yellow colors indicate input to MC-GAN and the red box indicates input to FANnet when 3 randomobservations are available. The evaluation is performed on the MC-GAN dataset.Table 2. Comparison of synthetic character generation betweenMC-GAN and FANnet.
MC-GAN(1 observation) MC-GAN(3 random observations) FANnet(1 observation)nRMSE ASSIM nRMSE ASSIM nRMSE ASSIM0.4568 0.4098 0.3628 0.5485 0.4504 0.4614 edit textual information in natural scene images directly.MC-GAN [2] is a recent font synthesis algorithm, but toapply it on scene text a robust recognition algorithm is nec-essary. Thus on many occasions, it is not possible to applyand evaluate its performance on scene images. However,the generative performance of the proposed FANnet is com-pared with MC-GAN as shown in Fig. 9. We observe thatgiven a single observation of a source character, FANnetoutperforms MC-GAN, but as the number of observationsincreases, MC-GAN performs better than FANnet. This isalso shown in Table 2 where we measure the quality of thegenerated characters using nRMSE and ASSIM scores. Thecomparison is done on the dataset [2] which is originallyused to train MC-GAN with multiple observations. But inthis case, FANnet is not re-trained on this dataset which alsoshows the adaptive capability of FANnet. In another exper-iment, we randomly select 1000 fonts for training and 300fonts for testing from the MC-GAN dataset. When we re-train FANnet with this new dataset, we get an ASSIM scoreof 0.4836 over the test set. For MC-GAN, we get ASSIMscores of 0.3912 (single observation) and 0.5679 (3 randomobservations) over the same test set.We also perform a comparison among MC-GAN [2],Project Naptha [15] and STEFANN assisted text editingschemes on scene images as shown in Fig. 10. For MC-GAN assisted editor, we replace FANnet and Colornet withMC-GAN cascaded with Tesseract v4 OCR engine [26].Project Naptha provides a web browser extension that al-lows users to manipulate texts in images. It can be observedthat the generative capability of MC-GAN is directly af-fected by recognition accuracy of the OCR and variation ofscale and color among source characters, whereas ProjectNaptha suffers from weak font adaptability and inpainting.To understand the perceptual quality of the generatedcharacters, we take opinions from 115 users for 50 differ-
OCR: STeP OCR: LETTERMC-GAN assisted MC-GAN assistedSTEFANN assisted STEFANN assisted OCR: SIDGHPORE OCR: GENER:COProject Naptha assisted Project Naptha assistedSTEFANN assisted STEFANN assisted
Figure 10. Comparison among MC-GAN, Project Naptha andSTEFANN assisted text editing on scene images.
Top row:
Orig-inal images. Text regions to be edited are highlighted with greenbounding boxes. OCR predictions for these text regions are shownin respective insets.
Middle row:
MC-GAN or Project Naptha as-sisted text editing.
Bottom row:
STEFANN assisted text editing.Figure 11. Effectiveness of seam carving. In each column, toprow shows the original image, middle row shows the edited imagewithout seam carving and bottom row shows the edited image withseam carving. ent fancy fonts randomly taken from the MC-GAN datasetto evaluate the generation quality of MC-GAN and FAN-net. For a single source character, 100% of users opine thatthe generation of FANnet is better than MC-GAN. For 3random source characters, 67.5% of users suggest that thegeneration of FANnet is preferable over MC-GAN.
Visual consistency during character placement:
Au-tomatic seam carving of the text region is an important stepto perform perceptually consistent modifications maintain-ing the inter-character distance. Seam carving is particu-larly required when the target character is ‘I’. It can be seenfrom Fig. 11 that the edited images with seam carving lookisually more consistent than those without seam carving.
Evaluation of overall generation quality:
To evaluatethe overall quality of editing, we perform two opinion basedevaluations with 136 viewers. First, they are asked to detectwhether the displayed image is edited or not from a set of50 unedited and 50 edited images shown at random. We get15.6% true-positive (TP), 37.1% true-negative (TN), 12.8%false-positive (FP) and 34.4% false-negative (FN), whichshows that the response is almost random. Next, the sameviewers are asked to mark the edited character(s) from an-other set of 50 edited images. In this case, only 11.6% ofedited characters are correctly identified by the viewers.
5. Discussion and Conclusion
The major objective of STEFANN is to perform imageediting for error correction, text restoration, image reusabil-ity, etc. Few such use cases are shown in Fig. 12. Apartfrom these, with proper training, it can be used in font adap-tive image-based machine translation and font synthesis. Tothe best of our knowledge, this is the first attempt to de-velop a unified platform to edit texts directly in images withminimal manual effort. STEFANN handles scene text edit-ing efficiently for single or multiple characters while pre-serving visual consistency and maintaining inter-characterspacing. Its performance is affected by extreme perspec-tive distortion, high occlusion, large rotation, etc. whichis expected as these effects are not present in the trainingdata. Also, it should be noted that while training FANnet,we use Google Fonts [12] which contains a limited num-ber of artistic fonts, and we train Colornet with only solidand gradient colors. Thus at present, STEFANN does notaccommodate editing complex artistic fonts or unusual tex-ture patterns. The proposed FANnet architecture can alsobe used to generate lower-case target characters with simi-lar architecture as discussed in Sec. 3.2.1. However in thecase of lower-case characters, it is difficult to predict thesize of the target character only from the size of the sourcecharacter. It is mainly because lower-case characters areplaced in different ‘text zones’ [23] and the source charac-ter may not be replaced directly if the target character fallsinto a different text zone. In Fig. 13, we show some imageswhere we edit the lowercase characters with STEFANN. InFig. 14, we also show some cases where STEFANN fails toedit the text faithfully. The major reason behind the failedcases is an inappropriate generation of the target character.In some cases, the generated characters are not consistentwith the same characters present in the scene [Fig. 14(a)],whereas in some cases the font features are not transferredproperly [Fig. 14(b)]. We also demonstrate that STEFANNcurrently fails to work with extreme perspective distortion,high occlusion or large rotation [Fig. 14(c)]. In all the edit-ing examples shown in this paper, the number of charactersin a text region is not changed. One of the main limitations
Figure 12. Application of STEFANN. Misspelled words (boundedin Red) are corrected (bounded in Green) in scene images.Figure 13. Some images where lower-case characters are editedusing STEFANN. (a) (b) (c)
Figure 14. Some images where STEFANN fails to edit text withsufficient visual consistency. of the present methodology is that the font generative modelFANnet generates images with dimension × . Whileediting high-resolution text regions, a rigorous upsamplingis often required to match the size of the source character.This may introduce severe distortion of the upsampled tar-get character image due to interpolation. In the future, weplan to integrate super-resolution [32, 33] to generate veryhigh-resolution character images that are necessary to editany design or illustration. Also, we use MSER to extracttext regions for further processing. So, if MSER fails to ex-tract the character properly, the generation results will bepoor. However, this can be rectified using better charac-ter segmentation algorithms. It is worth mentioning thatrobust image authentication and digital forensic techniquesshould be integrated with such software to minimize the riskof probable misuses of realistic text editing in images. Acknowledgements
We would like to thank NVIDIA Corporation for provid-ing a TITAN X GPU through the GPU Grant Program. eferences [1] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. In
ACM SIGGRAPH , 2007. 4[2] Samaneh Azadi, Matthew Fisher, Vladimir G Kim, ZhaowenWang, Eli Shechtman, and Trevor Darrell. Multi-contentGAN for few-shot font style transfer. In
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2018. 1, 2, 7[3] Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, andShuigeng Zhou. Edit probability for scene text recognition.In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018. 2[4] Shumeet Baluja. Learning typographic style: from discrimi-nation to synthesis.
Machine Vision and Applications , 2017.1, 2[5] Michal Busta, Lukas Neumann, and Jiri Matas. DeepTextSpotter: An end-to-end trainable scene text localizationand recognition framework. In
The IEEE International Con-ference on Computer Vision (ICCV) , 2017. 2[6] Neill DF Campbell and Jan Kautz. Learning a manifold offonts.
ACM Transactions on Graphics (TOG) , 2014. 1, 2[7] Jie Chang and Yujun Gu. Chinese typography transfer. arXivpreprint arXiv:1707.04904 , 2017. 1, 2[8] Huizhong Chen, Sam S Tsai, Georg Schroth, David M Chen,Radek Grzeszczuk, and Bernd Girod. Robust text detectionin natural images with edge-enhanced maximally stable ex-tremal regions. In
The IEEE International Conference onImage Processing (ICIP) , 2011. 2, 3[9] Alexey Dosovitskiy, Jost Tobias Springenberg, MaximTatarchenko, and Thomas Brox. Learning to generate chairs,tables and cars with convolutional networks.
IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) ,2016. 2[10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.Image style transfer using convolutional neural networks.In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2016. 2[11] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman.Learning to read by spelling: Towards unsupervised textrecognition. arXiv preprint arXiv:1809.08675 , 2018. 2[12] Google Inc. Google Fonts. https://fonts.google.com/ , 2010. 4, 8[13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-Image translation with conditional adversar-ial networks. In
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017. 2[14] Diederik P Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In
International Conference onLearning Representations (ICLR) , 2015. 4[15] Kevin Kwok and Guillermo Webster. Project Naptha. https://projectnaptha.com/ , 2013. 7[16] Chongyi Li, Jichang Guo, and Chunle Guo. Emerging fromwater: Underwater image color correction based on weaklysupervised color transfer.
IEEE Signal Processing Letters ,2018. 2[17] Chuan Li and Michael Wand. Combining markov randomfields and convolutional neural networks for image synthesis. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2016. 2[18] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing BingKang. Visual attribute transfer through deep image analogy.In
ACM SIGGRAPH , 2017. 2[19] Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, TengtengHuang, and Wenyu Liu. Auto-encoder guided GAN for Chi-nese calligraphy synthesis. In
International Conference onDocument Analysis and Recognition (ICDAR) , 2017. 1, 2[20] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, andXiang Bai. Multi-oriented scene text detection via corner lo-calization and region segmentation. In
The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018.2[21] Lukas Neumann.
Scene text localization and recognition inimages and videos . PhD thesis, Department of Cybernetics,Faculty of Electrical Engineering, Czech Technical Univer-sity, 2017. 2[22] Nobuyuki Otsu. A threshold selection method from gray-level histograms.
IEEE Transactions on Systems, Man, andCybernetics (TSMC) , 1979. 4[23] U Pal and BB Chaudhuri. Automatic separation of machine-printed and hand-written text lines. In
International Confer-ence on Document Analysis and Recognition (ICDAR) , 1999.8[24] Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. FlexyFont:Learning transferring rules for flexible typeface synthesis. In
Computer Graphics Forum , 2015. 1, 2[25] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and PeterShirley. Color transfer between images.
IEEE ComputerGraphics and Applications , 2001. 2[26] Ray Smith. An overview of the Tesseract OCR engine. In
International Conference on Document Analysis and Recog-nition (ICDAR) , 2007. 7[27] Rapee Suveeranont and Takeo Igarashi. Example-based au-tomatic font generation. In
International Symposium onSmart Graphics , 2010. 1, 2[28] Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. Localcolor transfer via probabilistic segmentation by expectation-maximization. In
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2005. 2[29] Alexandru Telea. An image inpainting technique based onthe fast marching method.
Journal of Graphics Tools , 2004.4[30] Joshua B Tenenbaum and William T Freeman. Separatingstyle and content with bilinear models.
Neural Computation ,2000. 2[31] Paul Upchurch, Noah Snavely, and Kavita Bala. From A toZ: Supervised transfer of style and content using deep neuralnetwork generators. arXiv preprint arXiv:1603.02003 , 2016.1, 2[32] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.Recovering realistic texture in image super-resolution bydeep spatial feature transform. In
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018. 8[33] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,Chao Dong, Yu Qiao, and Chen Change Loy. ESRGAN:nhanced super-resolution generative adversarial networks.In
The European Conference on Computer Vision Workshops(ECCVW) , 2018. 8[34] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-moncelli. Image quality assessment: from error visibility tostructural similarity.
IEEE Transactions on Image Process-ing (TIP) , 2004. 5[35] Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller.Transferring color to greyscale images. In
ACM SIGGRAPH ,2002. 2[36] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-WeiHao. Robust text detection in natural scene images.
IEEETransactions on Pattern Analysis and Machine Intelligence(TPAMI) , 2013. 2[37] Baoyao Zhou, Weihong Wang, and Zhanghui Chen. Easygeneration of personal Chinese handwritten fonts. In
TheIEEE International Conference on Multimedia and Expo(ICME) , 2011. 2[38] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, ShuchangZhou, Weiran He, and Jiajun Liang. EAST: An efficient andaccurate scene text detector. In
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2017. 2 upplementary Materials
Figure S1. Editing texts in road signboards with STEFANN.
Left:
Original images.
Right:
Edited images. Text regions are intentionallyleft unmarked to show the visual coherence of edited texts with original texts without creating passive attention.igure S2. Editing texts in movie posters with STEFANN.
Left:
Original images.
Right:
Edited images. Text regions are intentionally leftunmarked to show the visual coherence of edited texts with original texts without creating passive attention.igure S3. Editing texts in scene images with STEFANN.
Left:
Original images with text regions marked in red.
Right:
Edited imageswith text regions marked in green.igure S4. Editing texts in scene images with STEFANN.
Left:
Original images with text regions marked in red.
Right:
Edited imageswith text regions marked in green.igure S5. Generation of all possible image pairs for a specific font with FANnet. In the first row, characters highlighted in green are theground truth images. For each subsequent row, character highlighted in red is the image input (source) and characters highlighted in blueare the image outputs (targets) from FANnet, generated by varying the encoding input for each target character. This figure shows thestructural consistency of FANnet on a specific font regardless of the source character. round TruthColornetGround TruthGround TruthGround TruthColornetColornetColornetGround TruthColornetGround TruthGround TruthGround TruthColornetColornetColornetround TruthColornetGround TruthGround TruthGround TruthColornetColornetColornetGround TruthColornetGround TruthGround TruthGround TruthColornetColornetColornet