Adversarial Segmentation Loss for Sketch Colorization
AADVERSARIAL SEGMENTATION LOSS FOR SKETCH COLORIZATION
Samet Hicsonmez (cid:63)
Nermin Samet † Emre Akbas † Pinar Duygulu (cid:63)(cid:63)
Computer Engineering, Hacettepe University, Ankara, Turkey † Computer Engineering, Middle East Technical University, Ankara, Turkey
ABSTRACT
We introduce a new method for generating color images fromsketches or edge maps. Current methods either require some formof additional user-guidance or are limited to the “paired” translationapproach. We argue that segmentation information could providevaluable guidance for sketch colorization. To this end, we proposeto leverage semantic image segmentation, as provided by a generalpurpose pantoptic segmentation network, to create an additionaladversarial loss function. Our loss function can be integrated to anybaseline GAN model. Our method is not limited to datasets thatcontain segmentation labels, and it can be trained for “unpaired”translation tasks. We show the effectiveness of our method on fourdifferent datasets spanning scene level indoor, outdoor, and chil-dren book illustration images using qualitative, quantitative and userstudy analysis. Our model improves its baseline up to 35 points onthe FID metric. Our code and pretrained models can be found athttps://github.com/giddyyupp/AdvSegLoss.
Index Terms — sketch colorization, sketch to image translation,Generative Adversarial Networks (GAN), image segmentation, im-age to image translation
1. INTRODUCTION
Generating an image from an input sketch (i.e. edge map), a taskknown as “sketch to image translation”, or “sketch colorization” forshort, is an attractive task as sketches are easy to obtain and conveyessential information about the content of an image. At the sametime, it is a challenging task due to the large domain gap betweensingle channel edge maps and color images. In addition, sketchesusually lack details for background objects, and even sometimes forforeground objects.Sketch colorization has been explored in a variety of domainsincluding faces [1, 2, 3], objects [4, 5, 6, 7], animes [8, 9, 10, 11, 12],art [13] and scenes [14, 15, 16]. Most of the methods in these studiesrequire some form of user-guidance, as additional input, in the formof, e.g., a reference color, patch or image. Without this guidance,these methods produce unrealistic colorizations. Another importantobservation is that, except Liu et al.’s work [7], all methods followthe “paired” approach, which limits the method to datasets that havea ground-truth image per sketch.In this work, we propose to leverage general purpose seman-tic image segmentation to alleviate the two shortcomings mentionedabove. Semantic segmentation methods have matured to a level thatthey produce useful results even for datasets on which they were nottrained (Section 4). We hypothesize that a correctly colored sketchwould yield a proper segmentation result and leverage this result inan extra adversarial loss in a GAN setting. By doing so, our methodneither requires additional user guidance nor becomes limited to the“paired” domain. We propose a new method which utilizes semantic segmentationfor sketch based image colorization problem. We introduce threemodels considering different levels of segmentation feedback in thesketch to image translation pipeline. Our models could be integratedinto any paired or unpaired GAN model. We demonstrate effective-ness of using segmentation cues through extensive evaluations. Ourcontributions in this paper can be summarized as follows. (i) Wepropose to use general purpose semantic segmentation as an addi-tional adversarial loss in a GAN model, for the sketch colorizationproblem. Ground-truth segmentation labels are not a requirementfor our approach. (ii) Our method is neither specific to a domain(e.g. face, art, anime, etc.) nor limited to the “paired” approach.(iii) We conduct extensive evaluations on four distinct datasets onboth paired and unpaired settings, and show the effectiveness of ourmethod through qualitative, quantitative and user study analysis.
2. RELATED WORK
Even though the edge map and sketch of an image are different con-cepts, in practice, XDoG [17] or HED [18] based edge maps areconsidered as sketches (e.g., [14, 10]). Moreover, some sketch basedmodels [4] use edge maps for data augmentation. Hence, we referto all these methods as sketch to image translation models. Gen-eral purpose image-to-image translation methods [19, 20, 21, 22, 23]could be used to solve sketch to image translation tasks. However,the results of these generic methods are usually not very satisfactory.One widely used solution to improve the colorization perfor-mance is to employ additional color [14, 10, 11, 9], patch [5], im-age [2, 12, 13, 8] or language guidance [16]. For instance, in colorguidance, users specify their desired colors for the regions in thesketch image, and the model utilizes this information to generateexact or similar colors for these regions. Some automatic methodsalso utilize user guidance to improve their performance resulting ina hybrid approach. Most of the sketch to image translation methodsare based on the “paired” training approach [14, 10, 4, 15], however,recently unpaired methods have also been presented [7, 13].In Scribbler [14], one of the very first paired and user guidedscene sketch colorization models, in addition to pixel, perceptual andGAN losses, they use total variation loss to encourage smoothness.They use XDoG to generate sketch images of 200k bedroom photos,and produce × colorized images. Zou et al. [16] use textinputs to progressively colorize an input sketch, in such a way that anovel text guided sketch segmenter segments and locates the objectsin the scene. EdgeGAN [15] maps edge images to a latent spaceduring training using an edge encoder. During inference, the edgeencoder encodes the input sketch to the latent space to subsequentlygenerate a color image. They experimented with 14 foreground and3 background objects from COCO [24] dataset.EdgeGAN [15] and Scribbler [14] use a supervised approachwhere input sketches and corresponding output images exist. How- a r X i v : . [ c s . C V ] F e b eal? Fake SegmentationReal SegmentationFake ImageReal ImageInput Sketch
SegmentationNetwork
InferenceGAN Model,e.g. Pix2Pix
Fig. 1 : Our proposed model with adversarial segmentation loss for sketch colorization.ever, it is hard to collect sketch image pairs. Liu et al. [7] propose atwo stage method to convert object sketches to color images in an un-supervised (unpaired) way. They first convert sketches to grayscaleimages, then color images. Self supervision is also used to completethe deliberately deleted sketch parts and clear the added noisy edgesfrom sketch images. In Sketch-to-Art [13], an art image is gener-ated using an input sketch and a target art style image. They encodecontent of the input sketch and style of the art image, then fuse bothfeatures to generate a stylized art image.
3. MODEL
Figure 1 shows the overall structure of our proposed model. The boxwith dashed yellow borders shows the inference stage of our model.Red border marks the GAN model used for sketch to image transla-tion. In this work, we used
Pix2Pix and
CycleGAN as baselines forpaired and unpaired training, respectively. This preference is madebased on the effectiveness of both methods across a variety of tasksand datasets. Our model could be integrated into any other GANmodel.Our model consists of a baseline GAN, a panoptic segmentationnetwork (
Seg ) and two discriminators ( D M and D B ). Panoptic seg-mentation network is trained offline on the COCO Stuff [25] datasetand its weights are frozen during the training of our model. Real andfake images are fed to the Seg network to get real and fake segmen-tation maps. Then, these two segmentation maps are given to thediscriminators to classify them as real or fake.We designed three variants of our model to embed differentlevels of segmentation feedback to the sketch to image translationpipeline.The first variant utilizes the full segmentation map of an imagewhere all foreground and background classes – a total of 135 classes– are considered. In this model, ground-truth color image I real andthe generated color image I fake are fed to Seg which outputs fullsegmentation maps for both images. Then, these two outputs aregiven to a discriminator network D M to discriminate between realand fake segmentation maps. We call this model as Multi-class inthe rest of the paper.As a higher level of abstraction, grouping objects only as back-ground and foreground may yield sufficient information. In the sec-ond variant of our model, we only used two classes (backgroundand foreground) in the segmentation map by grouping all foregroundclasses into one and all background classes into another class. Aswith our original multi-class model, binary segmentation outputs for
Fig. 2 : Sample segmentation results on different datasets.real and fake images are fed to a discriminator network D B to dis-criminate between real and fake ones. We refer to this model as Binary .Finally, our third variant is the union of the above two. It con-tains both discriminators, and is named as
Both . Overall loss func-tion for our model is the summation of losses of the baseline GANmodel ( L G ) and the two additional discriminators’ ( L B and L M ).Formally, the objective function is L = w g L G + w b L B + w m L M .We set w g , w b and w m to upon visual inspection on a small set oftraining images.We used PyTorch [26] to implement our models. We usesketch images as source domain, and color images as target domain.Datasets are described in Section 4. All training images (i.e. colorand sketch images) are resized to × pixels. We train allmodels for epochs using the Adam optimizer [27] with a learn-ing rate of . . We conducted all our experiments on a NvidiaTesla V100 GPU.
4. DATASET
We evaluated our models on four challenging datasets. The firstdataset consists of bedroom images from the Ade20k indoor dataset[28], with train and test images. The second dataset con-nput GT Baseline AdvSegLoss(Multi-class) AdvSegLoss(Binary) AdvSegLoss(Both)
Fig. 3 : Sample results from baselines (CycleGAN and Pix2Pix) and our model with different settings. Input images on each row are frombedroom, illustration, elephants and sheep datasets, respectively. First two rows display results of unpaired training, and last two rows showresults for paired training. On bedroom and elephant datasets
Binary , on illustration and sheep datasets
Both setting gave best results for bothtraining schemes.tains children’s book illustrations by Alex Scheffler [22], with train and test images. The third and fourth datasets were curatedby us from the COCO dataset. We collected images which containelephant or sheep instances. Note that these images contain not onlyelephants and sheeps but objects/regions from other foreground andbackground classes such as person, animals, mountains, grass andsky. Elephant dataset contains train and test images, andthe sheep dataset has train and test images. Example im-ages from these datasets and their segmentation outputs are shownin Figure 2.Edge images are extracted using the HED [18] method. In thefirst two columns of Figure 3, we present sample natural and edgeimages for all the datasets. It can be seen that the images contain avariety of foreground and background objects, also it is hard – evenfor the trained eye – to figure out the source dataset for some of theedge images.Our code, pretrained models and the scripts to produce the“sheep” and “elephant” datasets, and corresponding sketch imagescan be found at https://github.com/giddyyupp/AdvSegLoss.
5. EXPERIMENTS
We compared our models with the baseline models: CycleGAN [19]and Pix2Pix [21]. We used their official implementations that arepublicly available. Baseline models are trained for epochs.
To quantitatively evaluate the quality of generated images, we usedthe widely adopted Frechet Inception Distance (FID) [29] metric.FID score measures the distance between the distributions of thegenerated and real images. Lower FID score indicates the highersimilarity between two image sets. We present FID scores for allthe experiments in the Table 1. FID scores are inline with the visualinspections (see Figure 3), for all the datasets, at least one of thevariants of our model performed better than the baseline.First of all, when we compare FID scores of two trainingschemes and baseline models, paired training (Pix2Pix) performedbetter than unpaired training, as expected. However, our “adversarialsegmentation loss” affected the results of paired and unpaired casesdifferently. For instance, on elephant dataset our models improvedbaseline up to points for unpaired case, but only points forpaired case.Another crucial observation is that segmentation guidanceclosed the gap between unpaired and paired training results. BestFID scores for unpaired models on bedroom, illustration and ele-phant datasets become very close to or even better than pairedtraining. For instance on the elephant dataset, the initial pointFID gap ( vs ) dropped to ( vs ) on Binary setting.Here the only exception is the sheep dataset. Since the sheep datasetcontains various complex objects, unpaired and paired models failedto generate plausible images.When we look at the best performing settings on different able 1 : Comparison with baseline methods in terms of FID scores, lower is better.
Dataset Unpaired PairedCycleGAN +AdvSegLoss(Multi-class) +AdvSegLoss(Binary) +AdvSegLoss(Both) Pix2Pix +AdvSegLoss(Multi-class) +AdvSegLoss(Binary) +AdvSegLoss(Both)Bedroom 113.1 111.7
Elephant 126.4 103.9
Table 2 : User Study results.
Dataset CycleGAN +AdvSegLossBedroom 20.0
Illustration 27.0
Elephant 39.1
Sheep 19.1 datasets, structure of the dataset has an effect on the results. Forinstance, even though one is an indoor and the other one is an out-door dataset, bedroom and elephant images are composed of similarstructure. FG/BG ratios and placements of them in these datasetsare similar across all images, i.e. walls, ceiling and floors in bed-room images are always positioned in the same places on differentimages. Also elephant images contain very few FG objects, i.e.only elephants most of the time, and large BG areas such as grass,trees and sky. On these two datasets,
Binary setting which considersFG/BG classes only gave the best FID score. On the other hand,illustration and sheep images got a variety of FG objects and scenes.On such datasets, using only a FG/BG discriminator even degradesthe performance.
We present visual results of sketch colorization for our model andthe baseline models in the Figure 3. On bedroom and illustrationdatasets, we show results of unpaired training, and on elephant andsheep datasets we show paired training results.On the bedroom dataset, the
Binary setting generates better im-ages compared to baselines and other settings. Colors are uniformacross the object parts in this setting. There are defective colors inthe CycleGAN results such as the bottom of the bed and floor. On theillustration dataset, the baseline model performed poorly. Objectsare hard to recognize and most importantly colors are not proper atall. On the other hand,
Multi-class and
Both settings generate sig-nificantly better images i.e. generated objects and background gotconsistent colors.Finally, on elephant and sheep datasets although generated im-ages are not very visually appealing for all the methods, segmenta-tion guided images are quite appealing compared to baseline mod-els’. On the elephant dataset
Binary , on the sheep dataset
Both set-ting performed the best.We conducted a user study to measure realism of generated im-ages. We show two random images (at random positions, left orright) which were generated with CycleGAN and our best setting Readers could reach our study at http://52.186.136.234:8080/ Input GT CycleGAN +AdvSegLoss
Fig. 4 : Sample results on elephant and sheep datasets for CycleGANand our best model. Realism of both models are not satisfactory,however, especially colors of BG areas are better in our results.(lowest FID score) for all four datasets, and asked participants toselect the more realistic one.We collected a total of 115 survey inputs from 39 different users.In Table 2, we present results of the user study in terms of preferencepercentages of each model. User study results are inline with theFID score results, on all datasets, images generated by our modelwere preferred by the users most of the time.
6. CONCLUSION
In this paper, we presented a new method for the sketch colorizationproblem. Our method utilizes a general purpose image segmentationnetwork and adds an adversarial segmentation loss (AdvSegLoss) tothe regular GAN loss. AdvSegLoss could be integrated to any GANmodel, and works even if the dataset doesn’t have segmentation la-bels. We used CycleGAN and Pix2Pix as baseline GAN models inthis work. We conduct extensive evaluations on various datasets in-cluding bedroom, sheep, elephant and illustration images and evalu-ate the performance both quantitatively (using FID score) and quali-tatively (through a user study). We show that our model outperformsbaselines on all datasets on both FID score and user study analysis.Regarding the limitations of our method, although we improvethe baseline both qualitatively and quantitatively, especially elephantand sheep results lack realism. Even the paired training results arenot visually appealing on these two datasets (last two rows of Fig-ure 3), most probably due to the fact that the baseline models are notvery successful at generating complex scenes. Ablation studies areneeded to set w g , w b and w m to their optimal values. . ACKNOWLEDGEMENTS The numerical calculations reported in this paper were fully per-formed at TUBITAK ULAKBIM, High Performance and Grid Com-puting Center (TRUBA resources).
8. REFERENCES [1] Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, andHongbo Fu, “Deepfacedrawing: deep generation of face im-ages from sketches,”
ACM Transactions on Graphics (TOG) ,vol. 39, no. 4, pp. 72–1, 2020.[2] Junsoo Lee, Eungyeup Kim, Yunsung Lee, Dongjun Kim, Jae-hyuk Chang, and Jaegul Choo, “Reference-based sketch imagecolorization using augmented-self reference and dense seman-tic correspondence,” in
CVPR , 2020.[3] Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha,“Linestofacephoto: Face photo generation from lines with con-ditional self-attention generative adversarial networks,” in
ACM MM , 2019.[4] Wengling Chen and James Hays, “Sketchygan: Towards di-verse and realistic sketch to image synthesis,” in
CVPR , 2018.[5] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj,Jingwan Lu, Chen Fang, Fisher Yu, and James Hays, “Texture-gan: Controlling deep image synthesis with texture patches,”in
CVPR , 2018.[6] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang,“Image generation from sketch constraint using contextualgan,” in
ECCV , 2018.[7] Runtao Liu, Qian Yu21, and Stella X Yu, “Unsupervised sketchto photo synthesis,” in
ECCV , 2020.[8] Chie Furusawa, Kazuyuki Hiroshiba, Keisuke Ogaki, and YuriOdagiri, “Comicolorization: semi-automatic manga coloriza-tion,” in
SIGGRAPH Asia 2017 Technical Briefs . 2017.[9] Yuanzheng Ci, Xinzhu Ma, Zhihui Wang, Haojie Li, andZhongxuan Luo, “User-guided deep anime line art colorizationwith conditional adversarial networks,” in
ACM MM , 2018.[10] Yifan Liu, Zengchang Qin, Tao Wan, and Zhenbo Luo, “Auto-painter: Cartoon image generation from sketch by using con-ditional wasserstein generative adversarial networks,”
Neuro-computing , vol. 311, pp. 78 – 87, 2018.[11] Lvmin Zhang, Chengze Li, Tien-Tsin Wong, Yi Ji, and Chun-ping Liu, “Two-stage sketch colorization,”
ACM Transactionson Graphics (TOG) , vol. 37, no. 6, pp. 1–14, 2018.[12] Lvmin Zhang, Yi Ji, Xin Lin, and Chunping Liu, “Style trans-fer for anime sketches with enhanced residual u-net and auxil-iary classifier gan,” in
Asian Conference on Pattern Recogni-tion , 2017.[13] Bingchen Liu, Kunpeng Song, Yizhe Zhu, and Ahmed Elgam-mal, “: Synthesizing stylized art images from sketches,” in
Asian Conference on Computer Vision , 2020.[14] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, andJames Hays, “Scribbler: Controlling deep image synthesiswith sketch and color,” in
CVPR , 2017.[15] Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu,and Changqing Zou, “Sketchycoco: Image generation fromfreehand scene sketches,” in
CVPR , 2020. [16] Changqing Zou, Haoran Mo, Chengying Gao, Ruofei Du, andHongbo Fu, “Language-based colorization of scene sketches,”
ACM Transactions on Graphics (TOG) , vol. 38, no. 6, pp. 1–16, 2019.[17] Holger Winnem¨oller, Jan Eric Kyprianidis, and Sven C Olsen,“Xdog: an extended difference-of-gaussians compendium in-cluding advanced image stylization,”
Computers & Graphics ,vol. 36, no. 6, pp. 740–753, 2012.[18] Saining Xie and Zhuowen Tu, “Holistically-nested edge detec-tion,” in
ICCV , 2015.[19] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros,“Unpaired image-to-image translation using cycle-consistentadversarial networks,”
ICCV , 2017.[20] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “Dual-gan: Unsupervised dual learning for image-to-image transla-tion,”
ICCV , 2017.[21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros,“Image-to-image translation with conditional adversarial net-works,”
CVPR , 2017.[22] Samet Hicsonmez, Nermin Samet, Emre Akbas, and PinarDuygulu, “Ganilla: Generative adversarial networks for im-age to illustration translation,”
Image and Vision Computing ,p. 103886, 2020.[23] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz,“Multimodal unsupervised image-to-image translation,” in
ECCV , 2018.[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick, “Microsoft COCO: Common objects in context,” in
ECCV , 2014.[25] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari, “Coco-stuff: Thing and stuff classes in context,” in
CVPR , 2018.[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan,Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmai-son, Luca Antiga, and Adam Lerer, “Automatic differentiationin pytorch,” 2017.[27] Diederik P. Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” 2014.[28] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, AdelaBarriuso, and Antonio Torralba, “Semantic understand-ing of scenes through the ade20k dataset,” arXiv preprintarXiv:1608.05442 , 2016.[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-hard Nessler, and Sepp Hochreiter, “Gans trained by a twotime-scale update rule converge to a local nash equilibrium,”in