AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation
Bing Li, Yuanlue Zhu, Yitong Wang, Chia-Wen Lin, Bernard Ghanem, Linlin Shen
AAniGAN: Style-Guided Generative Adversarial Networks for UnsupervisedAnime Face Generation
Bing Li * Yuanlue Zhu ∗ Yitong Wang Chia-Wen Lin Bernard Ghanem Linlin Shen Visual Computing Center, KAUST, Thuwal, Saudi Arabia ByteDance, Shenzhen, China Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Abstract
In this paper, we propose a novel framework to translatea portrait photo-face into an anime appearance. Our aim isto synthesize anime-faces which are style-consistent with agiven reference anime-face. However, unlike typical trans-lation tasks, such anime-face translation is challenging dueto complex variations of appearances among anime-faces.Existing methods often fail to transfer the styles of referenceanime-faces, or introduce noticeable artifacts/distortions inthe local shapes of their generated faces. We propose Ani-GAN, a novel GAN-based translator that synthesizes high-quality anime-faces. Specifically, a new generator archi-tecture is proposed to simultaneously transfer color/texturestyles and transform local facial shapes into anime-likecounterparts based on the style of a reference anime-face,while preserving the global structure of the source photo-face. We propose a double-branch discriminator to learnboth domain-specific distributions and domain-shared dis-tributions, helping generate visually pleasing anime-facesand effectively mitigate artifacts. Extensive experimentsqualitatively and quantitatively demonstrate the superiorityof our method over state-of-the-art methods.
1. Introduction
Animations play an important role in our daily life andhave been widely used in entertainment, social, and educa-tional applications. Recently, anime , aka Japan-animation,has been popular in social media platforms. Many peoplewould like to transfer their profile photos into anime im-ages, whose styles are similar to that of the roles in theirfavorite animations such as
Cardcaptor Sakura and
SailorMoon . However, commercial image editing software failsto do this transfer, while manually producing an anime im-age in specific styles needs professional skills. * equal contribution Source R e f e r e n ce Figure 1:
Illustration of some synthesized results with the pro-posed AniGAN for style-guided face-to-anime translation. Thefirst row and the first column show reference anime-faces andsource photo-faces, respectively. The remaining columns showthe high-quality anime-faces with diverse styles generated by Ani-GAN, given source photo-faces with large pose variations andmultiple reference anime-faces with different styles.
In this paper, we aim to automatically translate a photo-face into an anime-face based on the styles of a refer-ence anime-face. We refer to such a task as
Style-GuidedFace-to-Anime Translation (StyleFAT). Inspired by the ad-vances of generative adversarial networks (GANs) [13],many GAN-based methods (e.g., [19, 24, 45, 51]) have beenproposed to automatically translate images between two do-mains. However, these methods [30,51] focused on learningone-to-one mapping between the source and target images,which does not transfer the information of the reference im-age into a generated image. Consequently, the styles of theirgenerated anime-faces [30, 51] are usually dissimilar fromthat of the reference ones. Recently, a few reference-guidedmethods [25,35] were proposed for multi-modal translationwhich generates diverse results by additionally taking refer-ence images from the target domain as input. These meth-1 a r X i v : . [ c s . C V ] F e b ds, however, usually fail to fulfill the StyleFAT task andgenerate low-quality anime images.Different from the image translation tasks of reference-guided methods, StyleFAT poses new challenges in two as-pects. First, an anime-face usually has large eyes, a tinynose, and a small mouth which are dissimilar from natu-ral ones. The significant variations of shapes/appearancesbetween anime-faces and photo-faces require translationmethods to largely overdraw the local structures (e.g., eyesand mouth) of a photo-face, different from caricature trans-lation [5] and makeup-face transfer [6, 20, 26] which pre-serve the identity of a photo-face. Since most reference-guided methods are designed to preserve the local struc-tures/identity of a source image, these methods not onlypoorly transform the local shapes of facial parts into anime-like ones, but also fail to make the these local shapes style-consistent with the reference anime-face. On the otherhand, simultaneously transforming local shapes and trans-ferring anime styles is challenging and has not yet been wellexplored. Second, anime-faces involve various appearancesand styles (e.g. various hair textures and drawing styles).Such large intra-domain variations poses challenges in de-vising a generator to translate a photo-face into a specific-style anime-face, as well as in training a discriminator tocapture the distributions of anime-faces.To address the above problems, we propose a novelGAN-based model called AniGAN for StyleFAT. First, sinceit is difficult to collect pairs of photo-faces and anime-faces,we train AniGAN with unpaired data in an unsupervisedmanner. Second, we propose a new generator architec-ture that preserves the global information (e.g., pose) of asource photo-face, while transforming local facial shapesinto anime-like ones and transferring colors/textures basedon the style of a reference anime-face. The proposed gen-erator does not rely on face landmark detection or faceparsing. Our insight is that the local shapes (e.g., largeand round eyes) can be treated as a kind of styles likecolor/texture. In this way, transforming a face’s local shapescan be achieved via style transfer. To transform local fa-cial shapes via style transfer, we explore where to injectthe style information into the generator. In particular, themulti-layer feature maps extracted by the decoder representmulti-level semantics (i.e., from high-level structural infor-mation to low-level textural information). Our generatortherefore injects the style information into the multi-levelfeature maps of the decoder. Guided the injected style in-formation and different levels of feature maps, our generatoradaptively learns to transfer color/texture styles and trans-form local facial shapes. Furthermore, two normalizationfunctions are proposed for the generator to further improvelocal shape transformation and color/texture style transfer.In addition to the generator, we propose a double-branchdiscriminator, that explicitly considers large appearance variations between photo-faces and anime-faces as well asvariations among anime images. The double-branch dis-criminator not only learns domain-specific distributions bytwo branches of convolutional layers, but also learns thedistributions of a common space across domains by sharedshallow layers, so as to mitigate artifacts in generated faces.Meanwhile, a domain-aware feature matching loss is pro-posed to reduce artifacts of generated images by exploitingdomain information in the branches.Our major contributions are summarized as follows:1. To the best of our knowledge, this is the first study onthe style-guided face-to-anime translation task.2. We propose a new generator to simultaneously trans-fer color/texture styles and transform the local fa-cial shapes of a source photo-face into their anime-like counterparts based on the style of a referenceanime-face, while preserving the global structure ofthe source photo-face.3. We devise a novel discriminator to help synthesizehigh-quality anime-faces via learning domain-specificdistributions, while effectively avoiding noticeable dis-tortions in generated faces via learning cross-domainshared distributions between anime-faces and photo-faces.4. Our new normalization functions improve the visualqualities of generated anime-faces in terms of trans-forming local shapes and transferring anime styles.
2. Related Work
Generative Adversarial Networks.
Generative Adver-sarial Networks (GANs) [13] have achieved impressive per-formance for various image generation and translation tasks[4, 7, 22, 23, 27, 33, 37, 42, 46, 50]. The key to the success ofGANs is the adversarial training between the generator anddiscriminator. In the training stage, networks are trainedwith an adversarial loss, which constrains the distribution ofthe generated images to be similar to that of the real imagesin the training data. To better control the generation process,variants of GANs, such as conditional GANs (cGANs) [37]and multi-stage GANs [22, 23], have been proposed. In ourwork, we also utilize an adversarial loss to constrain theimage generation. Our model uses GANs to learn the trans-formation from a source domain to a significantly differenttarget domain, given unpaired training data.
Image-to-Image Translation.
With the popularizationof GANs, GAN-based image-to-image translation tech-niques have been widely explored in recent years [9, 18,19, 24, 30, 31, 45, 47, 51]. For example, trained with paireddata, Pix2Pix [19] uses a cGAN framework with an L lossto learn a mapping function from input to output images.2ang et al. proposed an improved version of Pix2Pix [45]with a feature matching loss for high-resolution image-to-image translation.For unpaired data, recent efforts [18, 24, 30, 31, 51] havegreatly improved the quality of generated images. Cycle-GAN [51] proposes a cycle-consistency loss to get rid ofthe dependency on paired data. UNIT [30] maps source-domain and target-domain images into a shared-latent spaceto learn the joint distribution between the source and targetdomains in an unsupervised manner. MUNIT [18] extendsUNIT to multi-modal contexts by incorporating AdaIN [17]into a content and style decomposition structure. To fo-cus on the most discriminative semantic parts of an imageduring translation, several works [24, 28] involve attentionmechanisms. ContrastGAN [28] uses the object mask anno-tations from each dataset as extra input data. UGATIT [24]applies a new attention module and proposes an adaptivelayer-instance normalization (AdaLIN) to flexibly controlthe amount of change in shapes and textures. However, thestyle controllability of the above methods is limited due tothe fact that the instance-level style features are not explic-itly encoded. To overcome this, FUNIT [31] utilizes a few-shot image translation architecture for controlling the cate-gories of output images, but its stability is still limited. Neural Style Transfer.
StyleFAT can also be regardedas a kind of the neural style transfer (NST) [11, 12, 21]. Inthe field of NST, many approaches have been developedto generate paintings with different styles. For example,CartoonGAN [8] devises several losses suitable for gen-eral photo cartoonization. ChipGAN [15] enforces voids,brush strokes, and ink wash tone constraints to a GAN lossfor Chinese ink wash painting style transfer. APDrawing-GAN [49] utilizes a hierarchical GAN to produce high-quality artistic portrait drawings. CariGANs [5] and Warp-GAN [41] design special modules for geometric transfor-mation to generate caricatures. Yaniv et al. [48] proposed amethod for geometry-aware style transfer for portraits uti-lizing facial landmarks. However, the above methods ei-ther are designed for a specific art field which is completelydifferent from animation, or rely on additional annotations(such as facial landmarks).
3. Proposed Approach
Problem formulation.
Let X and Y denote the sourceand target domains, respectively. Given a source image x ∈ X and a reference image y ∈ Y , our proposed AniGANlearns multimodal mapping functions G : ( x, y ) (cid:55)→ ˜ x thattransfer x into domain Y .To generate high-quality anime-faces for the StyleFATtask, the goal is to generate an anime-face ˜ x that well pre-serves the global information (e.g., face pose) from x aswell as reflects the styles (e.g., colors and textures) of refer-ence anime-face y , while transforming the shapes of facial parts such as eyes and hair into anime-like ones.To achieve the above goals, a question is posed, howto simultaneously transform local shape while transferringcolor/texture information ? Existing methods focus on trans-ferring styles while preserving both local and global struc-tures/shapes from the source image, which, however, can-not well address the above problem. Differently, we hereexplore where to inject style information into the generator,and a novel generator architecture is thereby proposed, asshown in Fig. 2. Different from existing methods whichinject style information in the bottleneck of the generator,we introduce two new modules in the decoder of the gener-ator for learning to interpret and translate style information.Furthermore, we also propose two normalization functionsto control the style of generated images while transforminglocal shapes, inspired by recent work [17, 24].In addition, anime-faces contain significant intra-variations, which poses large challenge for generating high-quality images without artifacts. To further improve the sta-bility of the generated results, a novel double-branch dis-criminator is devised to better utilize the distribution infor-mation of different domains.Figs. 2 and 3 illustrate the architectures of our pro-posed generator and discriminator, respectively. The gen-erator takes a source image and a reference image as inputsand then learns to synthesize an output image. The double-branch discriminator consists of two branches, where onebranch discriminates real/fake images for domain X , andthe other for Y . The generator of AniGAN consists of a content encoder E c , a style encoder E s and a decoder F , as shown in Fig. 2. Encoder.
The encoder includes a content encoder E c and a style encoder E s . Given a source photo-face x anda reference anime-face y , the content encoder E c is usedto encode the content of the source image x , and the styleencoder is employed to extract the style information fromthe reference y . The functions can be formulated as follows: α = E c ( x ) , (1) ( γ s , β s ) = E s ( y ) (2)where α is the content code encoded by the content encoder E c , γ s and β s are the style parameters extracted from thereference anime-face y . Decoder.
The decoder F constructs an image from acontent code and style codes. However, different from typ-ical image translations that transfer styles while preservingboth local and global structures of source image, our de-coder aims to transform the local shapes of facial parts andpreserve the global structure of the source photo-face duringstyle transfer.3 SC FST FST C onv A d a P o L I N C onv A d a P o L I N C onv A d a P o L I N C onv A d a P o L I N 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝑀𝑎𝑝𝑠𝜸 , 𝜷 U p s a m p l e C onv P o L I N C onv A d a P o L I N 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝑀𝑎𝑝𝑠𝜸 , 𝜷 ASC BlockFST BlockDecoder 𝑭 Content EncoderStyle Encoder … 𝑬 𝒔 𝑬 𝒄 𝜸 , 𝜷 … … Figure 2: Architecture of the proposed generator. It consists of a content encoder and a style encoder to translate the sourceimage into the output image reflecting the style of the reference image. The grey trapezoids indicate typical convolutionblocks. More detailed notations are described in the contexts of Sec. 3.
Real/Fake
Domain X s h a r e 𝑫 𝑼 𝑫 𝑼 Real/Fake
Domain Y
Figure 3: Architecture of the proposed double-branch dis-criminator, where D U denotes the shared layers between thetwo branches for domain X and Y . The discriminator dis-tinguishes real images from fake ones in the two individualbranches.Recent image translation methods [17, 18, 24, 31, 35]transfer the style of the target domain by equipping residualblocks (resblocks) in the bottleneck of the generator withstyle codes. However, we observe that such decoder archi-tectures cannot well handle the StyleFAT task. Specifically,they are either insufficient to elaborately transfer an animestyle, or introduce visually annoying artifacts in generatedfaces. To decode high-quality anime-faces for StyleFAT, wepropose an adaptive stack convolutional block (denoted as ASC block), Fine-grained Style Transfer (FST) block andtwo normalization functions for the decoder (see Fig. 2)
ASC block
Existing methods such as MUNIT [18], FU-NIT [31], and EGSC-IT [35] transfer styles to generate im-ages via injecting the style information of the target domaininto the resblocks in the bottleneck of their decoder. How-ever, we observe that resblocks may ignore some anime-style information, which degrades the translation perfor-mance of decoder on StyleFAT task. More specifically,although multiple reference images with different animestyles are given, the decoder with resblocks synthesizessimilar styles for specific regions, especially on eyes. For example, the decoder with resblocks improperly rendersright eyes with the same color in the generated images (seeFig. 7). We argue that the decoder would ”skip” some in-jected style information due to the residual operation. Toaddress this issue, we propose an ASC block for the de-coder. ASC stacks convolutional layers, activation, and ourproposed normalization layers, instead of using resblocks,as shown in Fig. 2.
FST block
One aim of our decoder is to transform lo-cal facial features into anime-like ones, different from exist-ing methods which preserve local structures from a sourcephoto-face. We here explore how to transform local shapesin the decoder when transferring styles. One possible so-lution is to first employ face parsing or facial landmark de-tection to detect facial features and then transform local fa-cial features via warping like [41]. However, since the localstructure of anime-faces and photo-faces are significantlydissimilar to each other, warping often leads to artifacts inthe generated anime-faces. For example, it is difficult towell warp the mouth of a photo-face into a tiny one of areference anime-face. Instead of employing face parsing orfacial landmark detection, our insight is that local structurescan be treated as a kind of styles like color/texture and canbe altered through style transfer.Therefore, we propose a FST module to simultaneouslytransform local shapes of the source image and transfercolor/texture information from the reference image. In par-ticular, as revealed in the literature, deep- and shallow-layerfeature maps with low and high resolutions in the decoderindicate different levels of semantic information from high-level structures to low-level colors/textures. Motivated bythe fact, we argue that the FST block can adaptively learnto transform local shapes and decode color/texture infor-mation by injecting the style information into feature mapswith different resolutions. In other words, since the fea-ture map contains high-level structural information, FST4an learn to transform local shapes into anime-like oneswith the specified style information. Thus, as shown inFig. 2, FST consists of a stack of upsampling, convolu-tional, and normalization layers. Furthermore, style codesare injected after the upsampling layer in our FST block,different from the prior works [18, 24, 31] that only injectsstyle codes into the bottleneck of the generator. Conse-quently, FST block enables the decoder to adaptively syn-thesize color/texture and transform local shapes, accordingto both the style codes and different levels of feature maps.
Normalization
Recent image translation methods [17,18, 35] normalize the feature statistics of an image viaAdaptive Instance Normalization (AdaIN) to adjust thecolor/texture styles of the image. However, we aim todevise normalization functions which not only transfercolor/texture styles from the reference, but also transformlocal shapes of the source image based on the reference.Recently, it was shown in [24] that layer normalization(LN) [3] can transform the structure/shape of an image. Be-sides, AdaLIN was proposed in [24] to control the degreeof changes in textures and shapes by adaptively combiningAdaIN and LN. However, AdaLIN is insufficient to simul-taneously transfer the color/texture information of a localregion and its shape information from a reference image toa generated image. That is, since AdaLIN combines IN andLN in a per-channel manner, AdaLIN ignores the correla-tions among channels. For example, the shape styles ofeyes and their color/texture styles may respectively domi-nate in different channels. In such case, the features learnedby AdaLIN often ignore shape styles or color/texture styles.In other words, the combination space of AdaLIN tends tobe smaller than that of all-channel combinations of IN andLN.To address the above issue, we propose two novel nor-malization functions called point-wise layer instance nor-malization (PoLIN) and adaptive point-wise layer instancenormalization (AdaPoLIN) for the generator. our PoLINand AdaPoLIN learn to combine all channels of IN and LN,different from AdaLIN [24].To achieve all-channel combination of instance normal-ization (IN) [44] and LN, PoLIN learns to combine IN andLN via a × convolutional layer as defined below:P O LIN ( z ) = Conv (cid:18)(cid:20) z − µ I ( z ) σ I ( z ) , z − µ L ( z ) σ L ( z ) (cid:21)(cid:19) , (3)where Conv ( · ) denotes the × convolution operation, [ · , · ] denotes the channel-wise concatenation, z is the thefeature map of a network layer, µ I , µ L and σ I , σ L denotethe channel-wise and layer-wise means and standard devia-tions, respectively.AdaPoLIN adaptively combines IN and LN, while em-ploying the style codes from the reference anime-faces toretain style information: AdaPoLIN ( z, γ s , β s )= γ s · Conv (cid:18)(cid:20) z − µ I ( z ) σ I ( z ) , z − µ L ( z ) σ L ( z ) (cid:21)(cid:19) + β s , (4)where γ s and β s are style codes, and the bias in Conv ( · ) isfixed to 0.Thanks to their all-channel combination of IN and LN,the proposed PoLIN and AdaPoLIN lead to a larger combi-nation space than AdaLIN, thereby making them beneficialto handle color/texture style transfer and local shape trans-formation for StyleFAT. It is challenging to design a discriminator which effec-tively distinguishes real anime-faces from fake ones forStyleFAT. In particular, both the appearances and shapesvary largely among anime-faces, leading to significant intra-variations in the distribution of anime-faces. Thus, it is dif-ficult for a typical discriminator (e.g. [51]) to well learnthe distribution of anime-faces. As a result, the generatedanime-faces may contain severely distorted facial parts andnoticeable artifactsTo address the above issues, we propose a double-branchdiscriminator. In particular, we assume that anime-facesand photo-faces partially share common distributions andsuch cross-domain shared distributions constitute meaning-ful face information, since these two domains are both abouthuman faces. In other words, by learning and utilizingthe cross-domain shared distributions, the discriminator canhelp reduce distortions and artifacts in translated anime-faces. Therefore, as shown in Fig. 3, the proposed double-branch discriminator consists of shared shallow layers fol-lowed by two domain-specific output branches: one branchfor distinguishing real/fake anime-faces and the other fordistinguishing real/fake photo-faces. With the Siamese-like shallow layers shared by the photo-face and anime-facebranches, the additional photo-face branch aims to assist theanime-face branch to learn domain-shared distributions. Asa result, the anime-face branch learns to effectively discrim-inate those generated anime-faces with distorted facial partsor noticeable face artifacts. On the other hand, each branchcontains additional domain-specific deep layers with an ex-tended receptive field to individually learn the distributionsof anime-faces and photo-faces We formulate the two-branch discriminator in terms ofdomain X and domain Y for generality. Let D X and D Y denote the discriminator branches corresponding to domain X and Y , respectively, and D U denote the shallow layers Two separate discriminators without shared shallow layers can alsoindividually learn the distributions of anime-faces and photo-faces. How-ever, we observe that such a design not only consumes more computationalcost but also performs worse than our double-branch discriminator. D X and D Y . An input image h is discriminatedeither by D X or by D Y according to the domain that h be-longs to. The discriminator function is formulated as fol-lows: D ( h ) = (cid:40) D X ( D U ( h )) if h ∈ X,D Y ( D U ( h )) if h ∈ Y. (5)Our discriminator helps significantly improve the qualityof generated images and the training stability of the genera-tor, since it not only individually learns domain-specific dis-tributions using separable branches, but also learns domain-shared distributions across domains using shared shallowlayers. In addition, our discriminator is scalable and can beeasily expended to multiple branches for tasks across mul-tiple domains. StyleFAT is different from typical image translation taskswhich preserve the identity or the whole structure of theinput photo-face. If we directly employ loss functions inexisting methods [5,35] that preserves both local and globalstructures or an identity in the source image, the quality of agenerated anime-face would be negatively affected. Instead,besides adversarial loss like in [26, 35], the objective of ourmodel also additionally involves reconstruction loss, featurematching loss, and domain-aware feature matching loss.
Adversarial loss.
Given a photo-face x ∈ X and a ref-erence anime-face y ∈ Y , the generator aims to synthesizefrom x an output image G ( x, y ) with a style transferredfrom y . To this end, we adopt an adversarial loss similarto [31, 37] as follows: L adv = E x [log D X ( x )] + E x,y [log(1 − D X ( G ( y, x )))]+ E y [log D Y ( y )] + E y,x [log(1 − D Y ( G ( x, y )))] . (6) Feature matching loss.
To encourage the model to pro-duce natural statistics at various scales, the feature match-ing loss [39, 45] is utilized as a supervision for training thegenerator. Formally, let D kU ( h ) denote the feature map ex-tracted from the k -th down-sampled version of the sharedlayers D U of D X or D Y for input h , and ¯ D kU ( h ) denote theglobal average pooling result of D kU ( h ) , the feature match-ing loss L fm is formulated below: L fm = E h [ (cid:88) k ∈ K || ¯ D kU ( h ) − ¯ D kU ( G ( h, h )) || ] , (7)where K denotes the set of selected layers in D U used forfeature extraction. In our work we set K = { , } accord-ing to the architecture of our discriminator. Domain-aware feature matching loss.
We further uti-lize the domain-specific information to optimize the genera-tor. In particular, we extract features by the domain-specific discriminator D X or D Y , respectively. Similar to D kU ( · ) ,let D kX ( · ) denote the k -th down-sampling feature map ex-tracted from the branch D X in domain X , and ¯ D kX ( · ) de-note the average pooling result of D kX ( · ) (similar notationsare used for domain Y ). Then we define a domain-awarefeature matching loss L dfm as follows: L dfm = E h [ (cid:80) k ∈ K || ¯ D kX ( D U ( h )) − ¯ D kX ( D U ( G ( h, h ))) || ] , if h ∈ X E h [ (cid:80) k ∈ K || ¯ D kY ( D U ( h )) − ¯ D kY ( D U ( G ( h, h ))) || ] , if h ∈ Y (8)where K represents the set of selected layers in D X and D Y used for feature extraction. In our work we set K = { } according to the architecture of our discriminator. With L dfm , the artifacts of generated images can be largely miti-gated thanks to the additional domain-specific features. Reconstruction loss.
We aim to preserve the global se-mantic structure of source photo-face x . Without a well-designed loss on the discrepancy between the generatedanime-face and source photo-face, the global structure in-formation of the source photo-face may be ignored or dis-torted. However, as discussed previously, we cannot di-rectly employ an identity loss to preserve the identity like[26] or a perceptual loss to preserve the structure of the im-age like [35]. Different from existing methods [26, 35], weimpose a reconstruction loss to preserve the global informa-tion of photo-face. Specifically, given source photo-face x ,we also use x as the reference to generate a face G ( x , x ) .If the generated face G ( x , x ) well reconstructs the sourcephoto-face, we argue that the generator preserves the globalstructure information from the photo-face. Thus, we definethe reconstruction loss as the dissimilarity between x and G ( x , x ) by L rec = || G ( x , x ) − x || . (9)This loss encourages the generator to effectively learnglobal structure information from photo-face, such that thecrucial information of x is preserved in the generated image. Full Objective.
Consequently, we combine all the aboveloss functions as our full objective as follows: L G = L adv + λ rec · L rec + λ fm · ( L fm + L dfm ) . (10) L D = − L adv , (11)where λ rec , λ fm are hyper-parameters to balance the losses.
4. Experimental Results
We first conduct qualitative and quantitative experimentsto evaluate the performance of our approach. We then con-duct user studies to evaluate the subjective visual qualities6igure 4: Example photo-faces and anime-faces of our face2anime dataset. From top to bottom: photo-faces and anime-faces.of generated images. Finally, we evaluate the effectivenessof proposed modules by ablation studies.
Baselines
We compare our method with CycleGAN[51], UGATIT [24], and MUNIT [18], the state-of-the-arts in reference-free image translation. In addition, wecompare our method with reference-guided image transla-tion schemes including FUNIT [31], EGSC-IT [35], andDRIT++ [25] which are most relevant to our method. Notethat MUNIT also allows to additionally take a referenceimage as the input at the testing stage. We refer to thisbaseline as RG-MUNIT. However, instead of using a ref-erence image as part of a training image pair, RG-MUNITtakes a random latent code from the target domain as styleinformation for the training (see more details in its githubcode ). Due to such inadequate reference information dur-ing training, while performing StyleFAT with referencefaces, RG-MUNIT tends to fail for face images with sig-nificant inter-variations between domains or data with largeintra-variations within a domain. That is, RG-MUNIT per-forms much poorer than the original MUNIT that takes arandom latent code from the target domain at the test stage.We hence only show the visual results with the original MU-NIT in the qualitative comparisons, while evaluating quanti-tative results for both the original MUNIT and RG-MUNIT.We train all the baselines based on the open-source imple-mentations provided by the original papers . Selfie2anime.
We follow the setup in UGATIT [24] anduse the selfie2anime dataset to evaluate our method. Forthe dataset, only female character images are selected andmonochrome anime images are removed manually. Bothselfie and anime images are separated into a training setwith 3,400 images and a test set with 100 images.
Face2anime.
We build an additional dataset calledface2anime, which is larger and contains more diverseanime styles (e.g., face poses, drawing styles, colors, We used the full version implementation of UGATIT https://github.com/NVlabs/MUNIT As discussed in Section 2, since neural-style transfer methods focuson specific art-style translation, we do not compare these methods for faircomparison. hairstyles, eye shapes, strokes, facial contours) thanselfie2anime, as illustrated in Fig. 4. The face2animedataset contains 17,796 images in total, where the num-ber of both anime-faces and natural photo-faces is 8,898.The anime-faces are collected from the Danbooru2019 [2]dataset, which contains many anime characters with var-ious anime styles. We employ a pretrained cartoon facedetector [1] to select images containing anime-faces . Fornatural-faces, we randomly select 8,898 female faces fromthe CelebA-HQ [22,34] dataset. All images are aligned withfacial landmarks and are cropped to size × . Weseparate images from each domain into a training set with8,000 images and a test set with 898 images. We train and evaluate our approach using the face2animeand selfie2anime datasets. We use the network architecturementioned in Section 3 as our backbone. We set λ fm = 1 for all experiments, λ rec = 1 . for the face2anime dataset,and λ rec = 2 for the selfie2anime dataset.For fast training, the batch size is set to 4 and the modelis trained for 100K iterations. The training time is less than14h on a single Tesla V100 GPU with our implementationin PyTorch [40]. We use RMSProp optimizer with a learn-ing rate of . . To stabilize the training, we use thehinge version of GAN [4, 29, 38, 50] as our GAN loss andalso adopt real gradient penalty regularization [14, 36]. Thefinal generator is a historical average version [22] of the in-termediate generators, where the update weight is . . Given a source photo-face and a reference anime-face, agood translation result for StyleFAT task should share simi-lar/consistent anime-styles (e.g., color and texture) with thereference without introducing noticeable artifacts, while fa-cial features are anime-like and the global information (e.g.,the face pose) from the source is preserved.Fig. 5 illustrates qualitative comparison results on face2anime dataset, where the photo-faces involve var-ious identities, expressions, illuminations, and poses, Images containing male characters are discarded, since Danbooru2019only contains a small number of male characters. cat2dog and horse2zebra dataset, face2anime dataset contains larger variations ofshape/appearance among anime-faces, whose data distribu-tion is much more complex and is challenging for imagetranslation methods.The results show that CycleGAN introduces visible arti-facts in their generated anime-faces (see the forehead in thefourth row). MUINT also leads to visible artifacts in somegenerated anime-faces, as shown in the third and fourth rowin Fig.5. UGATIT better performs than CycleGAN andMUNIT. However, the anime styles of generated anime-faces by CycleGAN, UGATIT and MUNIT are dissimilar to that of the references. FUNIT is designed for few-shotreference-guided translation, and hence is not suitable forthe StyleFAT task. Consequently, the style of generatedanime-faces by FUNIT is much less consistent with the ref-erences compared with our method.Although EGSC-IT usually well preserves the poses ofphoto-faces, it also attempts to preserve the local structuresof photo-faces, which often conflicts with the transfer ofanime styles, since the local shapes of facial parts like eyesand mouth in an anime-face are dissimilar to the coun-terparts of the corresponding photo-face. Consequently,EGSC-IT often leads to severe artifacts in the local struc-tures of generated anime-faces (see eyes and hair in the first8eference Source CycleGAN UGATIT MUNIT FUNIT DRIT++ EGSC-IT OursFigure 6: Comparison with image-to-image translation methods on the selfie2anime dataset. From left to right: sourcephoto-face, reference anime-face, the results by CycleGAN [51], MUNIT [18], UGATIT [24], FUNIT [31], DRIT++ [25],EGSC-IT [35] and our AniGAN.to third rows). DRIT++ also introduces artifacts into gen-erated faces, when transferring the styles of the referenceanime-faces. For example, DRIT++ generates two mouthsin the generated face in the third row and distort eyes in thefifth row.Outperforming the above state-of-art methods, ourmethod generates the highest-quality anime-faces. First,compared with reference-guided methods FUNIT, EGSC-IT, and DRIT++, the styles of generated anime-faces withour method are the most consistent with that of referencefaces, thanks to our well-designed generator. Moreover, ourmethod well preserves the poses of photo-faces, althoughour method does not use perceptual loss like EGSC-IT. Ourmethod also well converts local structures like eyes andmouth into anime-like ones without introducing clearly vis-ible artifacts.Fig. 6 compares the results of various methods on theselfie2anime dataset. The results show that FUNIT intro-duces artifacts into some generated faces, as shown in thefourth and sixth rows in Fig. 6. Besides, EGSC-IT tends togenerate anime-faces with similar styles, although the ref-erence anime-faces are of different styles (see the referenceand generated images in the fourth to sixth rows in Fig.6). Similarly, DRIT++ tends to synthesize eyes with sim- ilar styles in the generated faces. For example, the synthe-sized eyes by DRIT++ are orange ellipse in the first, second,fourth and fifth rows in Fig. 6. In contrast, our method gen-erates anime-faces reflecting the various styles of the ref-erence images. In other words, our method achieves themost consistent styles with those of reference anime-facesover the other methods. In addition, our method gener-ates high-quality faces which preserve the poses of sourcephoto-faces, despite a photo-face is partially occluded byother objects (e.g., the hand in the fifth rows of Fig. 6).
In addition to qualitative evaluation, we quantitativelyevaluate the performance of our method in two aspects. Oneis the visual quality of generated images, and the other is thetranslation diversity.
Visual quality.
We evaluate the the quality of our resultswith Frechet Inception Distance (FID) metric [16] whichhas been popularly used to evaluate the quality of syn-thetic images in image translation works e.g., [10, 31, 49].The FID score evaluates the distribution discrepancy be-tween the real faces and synthetic anime-faces. A lowerFID score indicates that the distribution of generated im-ages is more similar to that of real anime-faces. That is,9able 1: Comparison of FID scores on the face2anime andselfie2anime datasets: lower is betterMethod Face2anime Selfie2animeCycleGAN 50.09 99.69UGATIT 42.84 95.63MUNIT 43.75 98.58EG-MUNIT 185.23 305.33FUNIT 56.81 117.28DRIT 70.59 104.49EGSC-IT 67.57 104.70AniGAN (Ours)
Table 2: Comparison of average LPIPS scores on theface2anime and selfie2anime dataset: higher is betterFace2anime Selfie2animeDRIT++ 0.184 0.201EGSC-IT 0.302 0.225AniGAN (Ours) those generated images with lower FID scores are moreplausible as real anime-faces. Following the steps in [16],we compute a feature vector by a pretrained network [43]for each real/generated anime-face, and then calculate FIDscores for individual compared methods, as shown in Table1. The FID scores in Table 1 demonstrate that our our Ani-GAN achieves the best scores on both the face2anime andselfie2anime datasets, meaning that the anime-faces gener-ated by our approach have the closest distribution with realanime-faces, thereby making they look similar visually.
Translation diversity.
For the same photo-face, weevaluate whether our method can generate anime-faces withdiverse styles, given multiple reference anime-faces withdifferent styles. We adopt the learned perceptual imagepatch similarity (LPIPS) metric, a widely adopted metricfor assessing translation methods on multimodal mapping[10, 25] in the perceptual domain, for evaluating the trans-lation diversity. Following [10], given each testing photo-face, we randomly sample 10 anime-faces as its referenceimages and then generate 10 outputs. For these 10 outputs,we evaluate the LPIPS scores between every two outputs .Table 2 shows the average of pairwise LPIPS over all test-ing photo-faces. A higher LPIPS score indicates that thetranslation method generates images with larger diversity.Since CycleGAN, UGATIT focus on one-to-one map-ping and cannot generate multiple outputs given a sourceimage, we do not include them for comparison of translationdiversity. We also do not compare with MUNIT and FU-NIT, since MUNIT does not take a references as the inputand FUNIT focuses on few-shot learning instead of trans- Following [10], we uniformly scale all generated images to the size of × . Table 3: Preference percentages of 20 subjects for fourmethods in user study. Higher is better.Method Face2anime Selfie2animeFUNIT 17.50% 40.17%DRIT++ 68.17% 47.83%EGSC-IT 27.33% 30.67%Ours % %lation diversity. Instead, we compare with DRIT++ andEGSC-IT, which are state-of-the-arts in reference-guidedmethods. DRI++ uses a regularization loss which explicitlyencourages the generator to generate diverse results. Al-though our method does not impose such loss, LPIPS scoresin Table 2 shows that our method outperforms DRIT++ andEGSC-IT on translation diversity, thanks to the generatorand discriminator of our method.We compare our method with three reference-guidedmethods FUNIT, EGSC-IT and DRIT++. Since transla-tion methods (e.g., CycleGAN and UGATIT) do not trans-fer the information of a reference image, we do not comparewith these methods for fair comparison. The subjective userstudy is conducted for face2anime and selfie2anime dataset,respectively, where 10 pairs of photo-faces and anime-facesin each dataset are fed into these four translation methodsto generate anime-faces.We receive , answers from 20 subjects in total foreach dataset, where each method is compared times.As shown in Table 3, most subjects are in favor of ourmethod for the results on both face2anime and selfie2animedatasets, demonstrating that the anime-faces translated byour method are usually the most visually appealing to thesubjects. We conduct a subjective user study to further evaluateour method. 20 subjects are invited to participate in ourexperiments, whose ages range from 22 to 35.Following [18,45], we adopt pairwise A/B test. For eachsubject, we show a source photo-face, a reference anime-face, and two anime-faces generated by two different trans-lation methods. The generated anime-faces are presented ina random order, such that subjects are unable to infer whichanime-faces are generated by which translation methods.We then ask each subject the following question:”
Q: Which generated anime-faces has better visual qual-ity by considering the source photo-face and the animestyles of the reference anime-face? ”We compare our method with three reference-guidedmethods FUNIT, EGSC-IT and DRIT++. Since transla-tion methods (e.g., CycleGAN and UGATIT) do not trans-fer the information of a reference image, we do not comparewith these methods for fair comparison. The subjective user10able 4: Quantitative comparison for ablation study using FID score. Lower is betterMethod ASC FST DB PoLIN AdaPoLIN IN LIN AdaIN AdaLIN Face2anime Selfie2animew/o ASC (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table 5: Quantitative comparison for ablation study using LPIPS score. Higher is betterMethod ASC FST DB PoLIN AdaPoLIN IN LIN AdaIN AdaLIN Face2anime Selfie2animew/o ASC (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) study is conducted for face2anime and selfie2anime dataset,respectively, where 10 pairs of photo-faces and anime-facesin each dataset are fed into these four translation methodsto generate anime-faces.We receive , answers from 20 subjects in total foreach dataset, where each method is compared times. Asshown in Table 3, most subjects are in favor of our results onboth face2anime and selfie2anime datasets, demonstratingthat anime-faces translated by our method are usually themost visually appealing to the subjects. We conduct ablation experiments to validate the effec-tiveness of individual components in our method: (1) ASCblock, (2) FST block, (3) PLIN and AdaPLIN, (4) double-branch discriminator.
ASC block.
We seek to validate whether the ASC blockeffectively retains the style information of the reference im-age and helps the generator transfer the style characteristics.Note that the key differentiating factor in our ASC block isthe removal of residual blocks from the bottleneck of ourgenerator, different from start-of-the-art image translationmethods (e.g., MUNIT, UGATIT, and FUNIT). We henceimplement a baseline called “w/o ASC” which adds resid-ual blocks in the bottleneck of our decoder. As shown inFig. 7, “w/o ASC” tends to ignore certain style informationdue to the additional residual blocks. For example, “w/oASC” ignores the styles of the right eyes in the references Reference Source w/o ASC OursFigure 7: Visual comparison of the contributions of ASCblocks, where “w/o ASC” improperly renders ”brown” righteyes in all the generated images.and renders ”brown” right eyes in all the generated images.In contrast, our method well and consistently renders thestyle of the left and right eyes, despite no face landmarksor face parsing are used. Clearly, our method outperforms “w/o ASC” in transferring the styles of reference anime-faces, thanks to the ASC block.
FST block.
Different from existing methods (e.g.,MUNIT, FUNIT, EGSC-IT) which use typical upsamplingblocks in the decoder, FST is additionally equipped withour normalization functions. To validate the effectiveness11 a) (b) (c) (d) (e) (f) (g)
Figure 8: Visual comparison of the contributions of PoLIN and AdaPoLIN. Row from left to right: (a) reference anime-faces,(b) source photo-faces, generated faces by (c) “w/o PoLIN w/ IN”, (d) “w/o PoLIN w/ LIN”, (e) “w/o AdaPoLIN w/ AdaIN”,(f) “w/o AdaPoLIN w/ AdaLIN” and (g) our method.Reference Source w/o FST OursFigure 9: Visual comparison of the contributions of FSTblocksof FST, we build a baseline (called “w/o FST” ) which re-places the FST blocks with typical upsampling blocks (likeMUNIT, FUNIT, and EGSC-IT) without our normalizationfunctions. As shown in Fig. 9, “w/o FST” performs poorlyin converting the shape of local facial features and transfer-ring styles. For example, “w/o FST” poorly converts theface shapes and eyes and introduces artifacts in the gener-ated faces. In contrast, with FST, our method better con-verts the local shapes into anime-like ones than “w/o FST” .Moreover, our method also better transfers the styles ofthe reference anime-faces to the generated faces than “w/oFST” does. Similarly, the FID of “w/o FST” increases inTable 4, indicating that the anime-faces generated by “w/oFST” are less plausible than those generated by our method.In addition, the LPIPS scores of w/o FST and our methodin Tab. 5 shows that FST is helpful for generating diverseanime-faces.
PoLIN and AdaPoLIN.
We build four baselines to eval- uate the effectiveness of PoLIN and AdaPoLIN. The firstand second baselines are built for evaluating PoLIN, andthe third and fourth baseline are for AdaPoLIN. The firstbaseline, named “w/o PoLIN w/ IN” , is constructed byreplacing PoLIN with IN in [44]. We build “w/o PoLINw/ IN” , since EGSC-IT, DRIT and FUNIT employ IN inthe up-sampling convolutional layers of their decoder, dif-ferent from our method with PoLIN. The second baseline,named “w/o PoLIN w/ LIN” , is constructed by replac-ing PoLIN with layer-Instance Normalization (LIN). Thethird baseline, called “w/o AdaPoLIN w/ AdaIN” , re-places AdaPoLIN with AdaIN in [17], which was employedby many translation methods e.g., [18, 31, 32, 35]. Thefourth baseline, called “w/o AdaPoLIN w AdaLIN” , re-places AdaPoLIN with AdaLIN which is used in UGATIT.The FID scores in Table 4 show that our method out-performs the four baselines. Without PoLIN, the perfor-mance of transforming local shapes into anime-like ones isdegraded, as shown in the results generated by “w/o PoLINw/ IN” and “w/o PoLIN w/ LIN” in Fig. 8. For exam-ple, “w/o PoLIN w/ IN” introduces artifacts in the hairat the right boundary of generated faces in the first row ofFigs. 8(c) and (d). Similarly, without AdaPoLIN, both “w/oAdaPoLIN w/ AdaIN” and “w/o AdaPoLIN w AdaLIN” perform worse than our method. Table 4 shows that “w/oAdaPoLIN w AdaIN” and “w/o AdaPoLIN w AdaLIN” degrade the performance in terms of translation diversity.It is worth noting that all the above baselines, that re-place our normalization functions with other normalizationfunctions employed by DRIT, EGSC-IT, UGATIT, etc., stillachieve better FID and LPIPS scores than state-of-the-artmethods, as shown in Table 1, 2, 4, and 5. This indicatesthat the architectures of our generator and discriminator aremore advantageous for StyleFAT task.
Double-branch discriminator.
We implement a base-12eference Source w/o DB OursFigure 10: Visual comparison of the contributions of thedouble-branch discriminatorline “w/o DB” that removes the photo-face branch (i.e.,the branch that discriminates real/fake photo-faces) fromour discriminator. Table 4 shows that it yields poorer FIDthan our method. Table 5 also shows LPIPS scores of “w/oDB” is worse than that of our method. More specifically,as shown in Fig. 10, “w/o DB” distorts local facial shapes,leading to low-quality generated faces, especially for chal-lenging source photo-faces. This is because “w/o DB” would mainly focus on generating plausible anime images,rather than on generating plausible anime human faces. Incontrast, our method generates high-quality anime-faces,thanks to the additional photo-face branch in the discrim-inator. With the photo-face branch in our discriminator, thephoto-face and anime-face branches share the first few shal-low layers, which helps the anime-face branch better learnreal facial features from photo-faces so as to well discrimi-nate low-quality anime-faces with distorted facial features.
5. Conclusion
In this paper, we propose a novel GAN-based method,called AniGAN, for style-guided face-to-anime translation.A new generator architecture and two normalization func-tions are proposed, which effectively transfer styles fromthe reference anime-face, preserve global information fromthe source photo-face and convert local facial shapes intoanime-like ones. We also propose a double-branch discrim-inator to assist the generator to produce high-quality anime-faces. Extensive experiments demonstrate that our methodachieves superior performance compared with state-of-the-art methods.
A. Appendix
A.1. Comparisons with StarGAN-v2
StarGAN-v2 [10] achieves impressive translation resultson the CelebA-HQ and AFHQ datasets. We hence addi- Reference Source StarGAN-v2 OursFigure 11:
Comparison results on the face2anime dataset. Fromleft to right: source photo-face, reference anime-face, the resultsby StarGAN-v2 [10] and our AniGAN. tionally compare our method with StarGAN-v2. As shownin Fig. 11, StarGAN-v2 generates plausible anime-faces.However, the poses and hair styles of anime-faces generatedby StarGAN-v2 are often inconsistent with that of sourcephoto-faces. For example, given the source photo-face withlong hair in the second row of Fig. 11, StarGAN-v2 gen-erates a anime-face with short hair which is similar to thatof the reference. In other words, StarGAN-v2 ignores theglobal information of source photo-faces, which dose notmeet the requirements of the StyleFAT task. In contrast,our method well preserves the global information of sourcephoto-face and generates high-quality anime-faces.
References [1]
AnimeFace2009, https://github.com/nagadomi/animeface-2009/ . 7[2] . 7[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. Layer normalization. arXiv preprint arXiv:1607.06450 ,2016. 5[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale gan training for high fidelity natural image synthesis.In
Proc. Int. Conf. Learn. Rep. , 2019. 2, 7
5] Kaidi Cao, Jing Liao, and Lu Yuan. Carigans: Unpairedphoto-to-caricature translation.
ACM Trans. Graphics , 2018.2, 3, 6[6] Hung-Jen Chen, Ka-Ming Hui, Szu-Yu Wang, Li-Wu Tsao,Hong-Han Shuai, and Wen-Huang Cheng. Beautyglow: On-demand makeup transfer framework with reversible genera-tive network. In
Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. , pages 10042–10050, 2019. 2[7] Lei Chen, Le Wu, Zhenzhen Hu, and Meng Wang. Quality-aware unpaired image-to-image translation.
IEEE Trans. onMultimedia , 21(10):2664–2674, 2019. 2[8] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan:Generative adversarial networks for photo cartoonization. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pages9465–9474, 2018. 3[9] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-tive adversarial networks for multi-domain image-to-imagetranslation. In
Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. , 2018. 2[10] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.Stargan v2: Diverse image synthesis for multiple domains.In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. ,2020. 9, 10, 13[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.A neural algorithm of artistic style. arXiv preprintarXiv:1508.06576 , 2015. 3[12] Gatys, Leon A and Ecker, Alexander S and Bethge, Matthias.Image style transfer using convolutional neural networks. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pages2414–2423, 2016. 3[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. pages 2672–2680, 2014. 1, 2[14] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In
Proc. Adv. Neural Inf. Process. Syst. ,pages 5767–5777, 2017. 7[15] Bin He, Feng Gao, Daiqian Ma, Boxin Shi, and Ling-YuDuan. Chipgan: A generative adversarial network for chi-nese ink wash painting style transfer. In
Proc. ACM Int. Conf.Multimedia , pages 1172–1180, 2018. 3[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. Gans trained by atwo time-scale update rule converge to a local nash equilib-rium. In
Proc. Adv. Neural Inf. Process. Syst. , pages 6626–6637, 2017. 9, 10[17] Xun Huang and Serge Belongie. Arbitrary style trans-fer in real-time with adaptive instance normalization. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pages1501–1510, 2017. 3, 4, 5, 12[18] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz.Multimodal unsupervised image-to-image translation. In
Proc. European Conf. Comput. Vis. , 2018. 2, 3, 4, 5, 7, 8,9, 10, 12 [19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adversar-ial networks. In
Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. , 2017. 1, 2[20] Wentao Jiang, Si Liu, Chen Gao, Jie Cao, Ran He, JiashiFeng, and Shuicheng Yan. Psgan: Pose and expression ro-bust spatial-aware gan for customizable makeup transfer. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , June2020. 2[21] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. In
Proc. European Conf. Comput. Vis. , pages 694–711, 2016. 3[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.Progressive growing of gans for improved quality, stability,and variation.
Proc. Int. Conf. Learn. Rep. , 2018. 2, 7[23] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , pages4401–4410, 2019. 2[24] Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwang HeeLee. U-gat-it: Unsupervised generative attentional net-works with adaptive layer-instance normalization for image-to-image translation. In
Proc. Int. Conf. Learn. Rep. , 2020.1, 2, 3, 4, 5, 7, 8, 9[25] H. Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-DingLu, Maneesh Singh, and Ming-Hsuan Yang. Drit++: Diverseimage-to-image translation via disentangled representations.
Int. J. Comput. Vis. , pages 1–16, 2020. 1, 7, 8, 9, 10[26] T. Li, Ruihe Qian, C. Dong, Si Liu, Q. Yan, Wenwu Zhu,and L. Lin. Beautygan: Instance-level facial makeup transferwith deep generative adversarial network.
Proc. ACM Int.Conf. Multimedia , 2018. 2, 6[27] Zeyu Li, Cheng Deng, Erkun Yang, and Dacheng Tao.Staged sketch-to-image synthesis via semi-supervised gen-erative adversarial networks.
IEEE Trans. on Multimedia ,2020. 2[28] Xiaodan Liang, Hao Zhang, and Eric P Xing. Generativesemantic manipulation with contrasting gan. In
Proc. Euro-pean Conf. Comput. Vis. , 2018. 3[29] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXivpreprint arXiv:1705.02894 , 2017. 7[30] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervisedimage-to-image translation networks. In
Proc. Adv. NeuralInf. Process. Syst. , pages 700–708, 2017. 1, 2, 3[31] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, TimoAila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper-vised image-to-image translation. In
Proc. IEEE/CVF Conf.Comput. Vis. , pages 10551–10560, 2019. 2, 3, 4, 5, 6, 7, 8,9, 12[32] Runtao Liu, Qian Yu, and Stella X. Yu. Unsupervised sketchto photo synthesis. In
Proc. European Conf. Comput. Vis. ,pages 36–52, 2020. 12[33] Yu Liu, Wei Chen, Li Liu, and Michael S Lew. Swapgan: Amultistage generative approach for person-to-person fashionstyle transfer.
IEEE Trans. on Multimedia , 21(9):2209–2222,2019. 2
34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild. In
Proc. IEEE/CVFConf. Comput. Vis. , 2015. 7[35] Liqian Ma, Xu Jia, Stamatios Georgoulis, Tinne Tuytelaars,and Luc Van Gool. Exemplar guided unsupervised image-to-image translation with semantic consistency. In
ICLR , May2019. 1, 4, 5, 6, 7, 8, 9, 12[36] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.Which training methods for gans do actually converge? In
Proc. Int. Conf. Mach. Learn. , 2018. 7[37] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1607.08022 , 2014. 2,6[38] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative ad-versarial networks. In
Proc. Int. Conf. Learn. Rep. , 2018.7[39] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In
Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. , pages 2337–2346, 2019. 6[40] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. In
Proc. Adv. Neural Inf. Process.Syst. , 2017. 7[41] Yichun Shi, Debayan Deb, and Anil K Jain. Warpgan: Auto-matic caricature generation. In
Proc. IEEE/CVF Conf. Com-put. Vis. Pattern Recognit. , pages 10762–10771, 2019. 3, 4[42] Yuhang Song, Chao Yang, Zhe Lin, Xiaofeng Liu, QinHuang, Hao Li, and C-C Jay Kuo. Contextual-based imageinpainting: Infer, match, and translate. In
Proc. EuropeanConf. Comput. Vis. , pages 3–19, 2018. 2[43] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, JonShlens, and Zbigniew Wojna. Rethinking the inception ar-chitecture for computer vision. In
Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit. , pages 2818–2826, 2016. 10[44] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-stance normalization: The missing ingredient for fast styliza-tion. arXiv preprint arXiv:1607.08022 , 2016. 5, 12[45] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,Jan Kautz, and Bryan Catanzaro. High-resolution image syn-thesis and semantic manipulation with conditional gans. In
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. , 2018.1, 2, 3, 6, 10[46] Chao Yang, Taehwan Kim, Ruizhe Wang, Hao Peng, andC-C Jay Kuo. Show, attend, and translate: Unsupervised im-age translation with self-regularization and attention.
IEEETrans. Image Process. , 28(10):4845–4856, 2019. 2[47] Chao Yang and Ser-Nam Lim. One-shot domain adaptationfor face generation. In
Proc. IEEE/CVF Conf. Comput. Vis.Pattern Recognit. , pages 5921–5930, 2020. 2[48] Jordan Yaniv, Yael Newman, and Ariel Shamir. The faceof art: landmark detection and geometric style in portraits.
ACM Trans. Graphics , 38(4):60, 2019. 3[49] Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. Ap-drawinggan: Generating artistic portrait drawings from face photos with hierarchical gans. In
Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit. , pages 10743–10752, 2019.3, 9[50] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-tus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 , 2018. 2, 7[51] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In
Proc. IEEE/CVF Conf.Comput. Vis. , 2017. 1, 2, 3, 5, 7, 8, 9, 2017. 1, 2, 3, 5, 7, 8, 9