[PDF] Texture Deformation Based Generative Adversarial Networks for Face Editing

Abstract

Despite the significant success in image-to-image translation and latent representation based facial attribute editing and expression synthesis, the existing approaches still have limitations in the sharpness of details, distinct image translation and identity preservation. To address these issues, we propose a Texture Deformation Based GAN, namely TDB-GAN, to disentangle texture from original image and transfers domains based on the extracted texture. The approach utilizes the texture to transfer facial attributes and expressions without the consideration of the object pose. This leads to shaper details and more distinct visual effect of the synthesized faces. In addition, it brings the faster convergence during training. The effectiveness of the proposed method is validated through extensive ablation studies. We also evaluate our approach qualitatively and quantitatively on facial attribute and facial expression synthesis. The results on both the CelebA and RaFD datasets suggest that Texture Deformation Based GAN achieves better performance.

Full PDF

TTexture Deformation Based Generative Adversarial Networks for FaceEditing

WenTing Chen ∗ Institute of Computer VisionShenzhen University [email protected]

Xinpeng Xie ∗ Institute of Computer VisionShenzhen University [email protected]

Xi JiaInstitute of Computer VisionShenzhen University [email protected]

Linlin Shen † Institute of Computer VisionShenzhen University [email protected]

Abstract

Despite the signiﬁcant success in image-to-image trans-lation and latent representation based facial attribute editingand expression synthesis, the existing approaches still havelimitations in the sharpness of details, distinct image trans-lation and identity preservation. To address these issues, wepropose a Texture Deformation Based GAN, namely TDB-GAN, to disentangle texture from original image and trans-fers domains based on the extracted texture. The approachutilizes the texture to transfer facial attributes and expres-sions without the consideration of the object pose. Thisleads to shaper details and more distinct visual effect of thesynthesized faces. In addition, it brings the faster conver-gence during training. The effectiveness of the proposedmethod is validated through extensive ablation studies. Wealso evaluate our approach qualitatively and quantitativelyon facial attribute and facial expression synthesis. The re-sults on both the CelebA and RaFD datasets suggest thatTexture Deformation Based GAN achieves better perfor-mance.

1. Introduction

Face editing aims to change or enhance facial attributes(e.g. hair color, expression, gender and age), and add vir-tual makeup to human faces. In recent years, face edit-ing has attracted great interests in computer vision ﬁelds[1, 22, 16]. Several image-to-image translation methods[8, 27, 24] have achieved facial attributes and expressionsmanipulation on single or multiple domains. Most methodsare based on the generative adversarial networks (GANs)[3] like Cycle GAN [27], IcGAN [14], StarGAN [2], etc. ∗ Equal contribution † Corresponding author

However, the generators of most image-to-image transla-tion approaches are fed with the input image directly. Whilealtering the facial expressions, synthesized expressions arenot genuine enough since they only modify the face moder-ately.Moreover, the task of face editing can also be tackledwith the Encoder-Decoder architecture through decodingthe latent representation from encoder conditioned on tar-get attributes. This kind of architecture aims to ﬁgure outthe relationship between the facial attributes and the latentrepresentation and impose the latent representation to faceediting [26, 21, 17, 13]. Commonly, most Encoder-Decoderarchitectures encode the input image to a low-dimensionlatent representation which may lead to the loss of infor-mation and representation capability. Besides, most ap-proaches fail to preserve the identity during face editing.To solve the issues raised from both image-to-imagetranslation and latent representation approaches, we ﬁrstadopt the DAE model [15] to transfer the input image tothree different physical image signals, including shading,albedo and deformation with an Encoder-Decoder architec-ture, and then combine shading and albedo to generate apure and well-aligned texture image that presents the illumi-nation effects and the characteristic appearance of the face.Then, we feed both the generated texture and target domainlabels into a GAN model to synthesize a new texture imagewith target attributes. Finally, we warp the generated texturewith the spatial deformation to generate the ultimate result,we also employ an identity loss between the generated im-age and input image to preserve identity. Overall, our maincontributes are summarized as follows:1. We propose the Texture Deformation Based GAN, anovel framework that learns the mappings among multipledomains based on disentangled texture and warps the gener-1 a r X i v : . [ c s . C V ] D ec igure 1. TDB-GAN transfer attributes by texture based deformation. The ﬁrst and second columns are the input images and the texturegenerated by DAE, respectively. In the remaining columns, while the even columns show the textures produced by generator for the targetattribute, the odd ones show the ﬁnal synthesized images by wrapping the texture. ated texture spatially to generate the face image with targetdomain features.2. We empirically demonstrate the effectiveness of ourTDB-GAN through the ablation studies on facial attributeediting and expression synthesis. We validate the superior-ity of texture-to-image translation over the image-to-imagetranslation. We also prove the effectiveness of identity lossthrough the face veriﬁcation.3. The proposed TDB-GAN is evaluated on facial at-tributes and expression synthesis both qualitatively andquantitatively. The results suggest TDB-GAN outperformsthe existing methods.

2. Related works

The popularity of generative models has a great effect onface editing. The Encoder-Decoder architecture and Gen-erative Adversarial Network (GAN) [3] are the two majorcategories of methods for this task.

Intrinsic Deforming Autoencoder (DAE) [15] is anovel generative model which decomposes the input imageinto texture and deformation. DAE follows the deformabletemplate paradigm and models image generation throughtexture synthesis and spatial deformation. DAE can obtainthe prototypical object by removing the deformation. Dis-carding variability due to deformations, the texture encodedfrom the original image is a purer representation. Moreover,by modeling the face image in terms of a low-dimensionallatent code, we can more easily control the facial attributesand expression over the generative process.

Generative Adversarial Networks (GANs) [3] is apromising generative model and can be used to solve var-ious computer vision tasks such as image generation [6, 23,20], image translation [8, 27, 24], and face image editing[22, 2, 13]. The GAN model is mainly designed to learna generator G to generate fake samples and a discriminatorD to distinguish between real and fake samples. Besidesleveraging the typical adversarial loss, a reconstruction lossis often employed [2, 4] to generate the faces as realistic aspossible. Additionally, an identity loss is proposed to assurethat the generated faces preserve the original identity in ourapproach.

Pix2Pix [8] is a typical image-to-image translation basedmethod. The approach can learn the mapping between inputand output domains and has achieved impressive results inseveral image translation tasks [27, 24, 11]. Pix2Pix com-bines adversarial loss with L1 loss to transfer images in apaired way. For unpaired images, several frameworks likeMUNIT [7], CycleGAN [27], and Invertible ConditionalGAN [14] have been proposed. However, all the frame-works try to learn the joint distribution between two do-mains, which limits them to handle multiple domains at thesame time.

StarGAN [2] is the ﬁrst generative model to achievemulti-domain image-to-image translation across differentdatasets with only one single generator. It also consists oftwo modules, a discriminator D to distinguish between realand fake images and classify the real images to its corre-sponding domain, and a generator G generates a fake image2sing both the image and target domain label (binary or one-hot vector). One of the novelties in StarGAN is that its gen-erator G is allowed to reconstruct the original image fromthe fake image given the original domain label. StarGANalso utilizes a mask vector with the domain label to enablejoint training between domains of different datasets. How-ever, StarGAN is an image-to-image model and does notinvolve any latent representation, so its capability of chang-ing facial attributes is limited.

AttGAN [4] is a multiple facial attribute editing modelthat contains three components at training: the attributeclassiﬁcation constraint, the reconstruction learning and theadversarial learning. The content that latent representationdeliveries is uncertain and limited. Hence, imposing the at-tribute label to the latent representation might change otherunexpected parts. Similar to StarGAN, AttGAN appliesan attributes classiﬁcation constraint to guarantee the cor-rect attribute manipulation on the generated image and areconstruction learning to preserve the attribute-excludingdetails. AttGAN tries to free the attribute-independent con-straint from the latent representation, while our approachencodes the input to different latent representation to gen-erate texture and employ an image-to-image translation toachieve face editing.

3. Texture Deformation Based GAN

In this section, we introduce the Texture DeformationBased GAN (TDB-GAN) framework for face attributesediting. As shown in Figure 2, the TDB-GAN consists oftwo major modules, i.e. the intrinsic deforming autoencoderDAE and GAN based image-to-image translation module.

The recent works [2] aim at translating an original faceimage to a new face image with different attributes. How-ever, the pose and shape of face might have inﬂuence onfacial attributes and facial expression synthesis. Thus, weutilize the Intrinsic DAE [15] to separate a face image intotexture and deformation to disentangle the variation. DAEadopts the intrinsic decomposition regularization loss tomodel the physical properties of shading and albedo. Theshading and the albedo are then combined to generate thetexture that eliminates the geometric information and canrepresent the identity, illumination, face attributes and soon, whereas the deformation describes the spatial gradientof the warping ﬁeld (spatial transformation).

In this module, we feed the encoder E θ enc , a densely con-nected convolutional network, with an input image I Input .Then, it generates a latent representation Z for the following decoders. Particularly, the latent representation can be de-composed as follows: Z=[ Z S , Z A , Z D ], where Z S , Z A and Z D are shading-related, albedo-related and deformation-related representations, respectively. Z = E θ enc ( I Input ) (1) As visualized in Figure 2, we introduce three separate de-coders for shading, albedo and deformation, including D S , D A and D D . The inputs to these decoders are deliveredby a joint encoder network. The shading-related, albedo-related and deformation-related decoders are fed with thelatent representations Z S , Z A and Z D respectively. The de-coders can provide us with a clear separation of shading,albedo and deformation. The equation can be written asfollows: S = D S ( Z S ) (2) A = D A ( Z A ) (3) De = D D ( Z D ) (4)where S , A and De denote shading, albedo and deforma-tion. Then the texture T of the input image can be computedby the shading S and albedo A with the Hadamard product: T = S (cid:12) A (5)Finally, the generated texture is warped spatially with thedeformation to synthesize the ultimate image I Output . Wdenotes the operation of spatial warping. I Output = W ( T, De ) (6) The objective function is composed of three losses, includ-ing L R , L smooth and L B . It can be written as: L DAE = L R + L smooth + L B + L Shading , (7)where the reconstruction loss is deﬁned as: L R = (cid:107) I Ouput − I Input (cid:107) , (8)the smoothness cost is given by: L smooth = λ ( (cid:107)∇ W x ( x, y ) (cid:107) + (cid:107)∇ W y ( x, y ) (cid:107) ) , (9)the bias reduce loss is formatted as: L B = λ (cid:107) S A − S (cid:107) + λ (cid:48) (cid:13)(cid:13) ¯ W − W (cid:13)(cid:13) (10)and the shading loss is written as: L Shading = λ (cid:107)∇ S (cid:107) (11)3 igure 2. Overview of Texture Deformation Based GAN. In the equations above, I Input and I Ouput representthe input images and reconstructed images respectively. ∇ W x ( x, y ) and ∇ W y ( x, y ) stands for local warping ﬁeld. S and S A denote the identity and average afﬁne transformwithin the minibatch. W and ¯ W represent identity grid andaverage deformation grid within a minibatch. Similar to StarGAN, our goal is to train a multi-domaintexture-to-image translation network. We ﬁrst feed the gen-erator with texture and target domain labels randomly sam-pled from training data. Then, we warp the generated tex-ture with deformation to synthesize the fake face image. Wealso impose the domain classiﬁcation loss to classify thedomain of the fake face image. Furthermore, we use thereconstruction loss and identity loss to supervise the gener-ator to synthesize more realistic and identity-preserved faceimages, respectively.

We utilize the adversarial loss to enable the generated im-ages as genuine as the real samples. The adversarial losscan be written as: L adv = E x [ logD src ( x )]+ E t,c [ log (1 − D src ( W ( G ( t, c ) , De ))] (12)In this loss function, G generates a new texture G ( t, c ) con-ditioned on both the face texture t and target domain label c ,while D strives to differentiate the real face texture from thegenerated face texture. In Eq. (12) , D src ( x ) denotes a prob-ability distribution over sources given by D . The discrimi-nator tries to maximize this objective, whereas the generatortries to minimize it. To enable the generator to generate the fake image with thetarget domain, we add a domain classiﬁer on the top of D .For the optimization of D and G , we deﬁne the domain clas-siﬁcation of the real image as follow: L rcls = E t,c (cid:48) [ − logD cls ( c (cid:48) | x )] , (13)where c (cid:48) stands for the original domain label for the realface image. The term D cls ( c (cid:48) | x ) represents a probability4istribution over domain labels produced by D . In addi-tion, the domain classiﬁcation loss of the fake face textureis deﬁned as L fcls = E t,c [ − logD cls ( c | W ( G ( t, c ) , De ))] . (14) By optimizing the adversarial and classiﬁcation loss, G isable to generate the realistic face texture with proper at-tributes. Nonetheless, we cannot guarantee that the gen-erated face texture preserves the content of the input facetexture while changing the domain-related parts of the inputface texture. Therefore, the reconstruction loss is imposedto the reconstructed texture and image, respectively. For thetexture image, we apply a cycle consistency loss proposedby Zhu et al. [27] to our generator, which is deﬁned as: L trec = E t,c,c (cid:48) [ (cid:107) t − G ( G ( t, c ) , c (cid:48) ) (cid:107) ] , (15)where G takes the generated face texture G ( t, c ) and theoriginal domain label c (cid:48) as input and tries to reconstruct theoriginal face texture. We utilize the L1 norm to computeour reconstruction loss.For the reconstructed image, the generator synthesizesthe new texture with the original texture t and domain la-bel c (cid:48) . Then, the new texture is warped with deformationto generate the output image. L1 norm of the differencebetween the input and the generated image is deﬁned as be-low: L irec = E t,c,c (cid:48) [ (cid:107) x − W ( G ( t, c (cid:48) ) , De ) (cid:107) ] . (16) L rec = L trec + L irec (17) Even though reconstruction loss can preserve some unre-lated content of the input face texture, the generator mightstill change the identity of the output face texture. The gen-erator would not only learn the attribute relative parts butalso learns the identity corresponding to the person with la-bel c from the training set. For example, majority of celebri-ties of face images in CelebA [12] come from Europe orAmerica, and only few are from Asia. Therefore, whenlearning the attributes from European or American, Asianmight not preserve its own particular facial features.Therefore, we exploit an identity preserving network F ip to retain the identity discrimination of the synthesized facetexture, and an identity loss L ip to preserve personal facialfeatures. This approach is derived from the work proposedby Huang [5]. F ip denotes a feature extractor to extractthe feature of the synthesized face texture ˆ t and the realface texture t . We select the LightCNN [19] as our featureextractor and ﬁx the parameters in the training procedure. Speciﬁcally, we apply the output of the second to last fullyconnected layer of F ip to the identity loss L ip : L ip = (cid:13)(cid:13) F ip ( t ) − F ip (ˆ t ) (cid:13)(cid:13) (18)where (cid:107)·(cid:107) denotes the L2-norm. Overall, the ﬁnal objective functions to optimize G and D are illustrated as: L D = − L adv + λ cls L rcls (19) L G = L adv + λ cls L fcls + λ rec L rec + λ ip L ip , (20)where λ cls , λ rec and λ ip are hyper-parameters to control theweight of domain classiﬁcation, reconstruction and identityloss.

4. Implementation

In this section, we demonstrate how we stabilize thetraining process and the details of the network architecture.

Since the proposed TDB-GAN consists of two majormodules, we directly utilize the encoder and decoder archi-tectures from DAE [15]. The generator and discriminatorarchitectures are adopted from StarGAN [2] in our frame-work. We also leverage PatchGANs[8, 27] for the discrim-inator to distinguish the real images from synthesized im-ages.

In order to stabilize and accelerate the training procedureof TDB-GAN, we propose a multi-stage training strategy.In the ﬁrst stage, we only optimize the DAE model, namelythe L R , L smooth , L B and L Shading . Then, we ﬁx the pre-trained weights of DAE model. Simultaneously, the gener-ator G and discriminator D are trained with the L G (with λ ip = 0 ) and L D loss, respectively. Finally, we jointly train L DAE , L G and L D . Note that, we impose the identity loss L ip in the ﬁnal training stage to ensure that the generatedimage preserves the identity.

5. Experiments

In this section, we ﬁrst compare TDB-GAN with andwithout the DAE module on facial attribute transfer. In ad-dition, we demonstrate empirical results that the result ofTDB-GAN with identity loss can preserve more identity in-formation than that without it.5 .1. Datasets

The CelebFaces Attributes (CelebA) dataset [12] con-tains 202,599 face images of 10,177 celebrities, each an-notated with 40 binary attributes. We resize all alignedimages from the × into × . We randomlyselect 2,000 images as test set and use the remaining im-ages for training. We mainly test ten domains with thefollowing attributes: expression (smiling/not smiling), skincolor (pale skin/normal skin), accessory (eyeglasses/no eye-glasses), gender (male/female) and age (young/old). The Radboud Faces Database (RaFD) [10] consists of4,824 images collected from 67 subjects. Each subject haseight facial expressions in three different gaze directions,which are captured from three different angles. We ﬁrst de-tect all face images with MTCNN [25] and crop out the im-ages with size × resolution, where the faces arecentered, and resized to × . All the models are optimized with Adam [9], where β = 0 . and β = 0 . . We ﬂip the images horizon-tally with a probability of 0.5 to augment the training set.We perform one generator update after ﬁve discriminatorupdates as described in [2]. The batch size is set to 100 forall experiments. For the experiments on the CelebA, we ﬁrsttrain the DAE module for 5 epochs with a learning rate of0.0002. Then, we train the generator and discriminator witha learning rate of 0.0001 for the ﬁrst 100 epochs and lin-early decay the learning rate to 0 over the next 100 epochs.Next, we impose the identity loss to the GAN module andtrain the GAN-related part for 29 epochs with a learning rateof 0.0001 and apply the aforementioned decaying strategyover the next 29 epochs. The train strategy of RaFD is simi-lar to that of CelebA. The weight in training objective is setas λ cls = 1 for L rcls , L fcls , λ rec = 10 for L rec , λ ip = 0 . for L ip , λ = 1 e − for L smooth , λ = λ (cid:48) = 0 . for L B and λ = 1 e − for L Shading . A unique advantage of the TDB-GAN is its capability ofdisentangling the texture from the input image and editingthe facial attributes and expression without impact of thepose and shape of face We conduct an experiment on TDB-GAN with/without DAE module. Additionally, we provethat the proposed identity loss helps to preserve more iden-tity information through a veriﬁcation result.

In TDB-GAN, we prefer to separate the texture and defor-mation from the input image. We propose that the informa-tion of deformation would signiﬁcantly affect the quality of

Figure 3. Facial attribute transfer results on the CelebA dataset.The ﬁrst column demonstrates the input image, next ﬁve columnsshow the single attribute transfer results. The odd rows display theresults generated by the TDB-GAN without DAE module, whilethe even rows show the results produced with DAE. face editing and the convergence of the domain classiﬁca-tion loss of fake face texture during training. As illustratedin Figure 3, the eyeglasses generated by TDB-GAN withDAE are more obvious. For example, no glasses can be ob-served for the faces in row C, E, G and I generated by TDB-GAN without DAE. The images generated by TDB-GANwithout DAE (A, C) do not show the pale skin as realis-tic as those generated by TDB-GAN. While the faces of Cand E generated by TDB-GAN without DAE are still smil-ing, TDB-GAN with DAE transfers the face image to smileor not smile correctly and naturally. Lastly, our proposedmethod has more genuine changing of feminization, mas-culinity, aging and rejuvenation than the TDB-GAN with-out DAE module. The main reason is that DAE disentanglestexture and deformation from the input image. The formerpreserves the main feature and identity of the face, whereasthe latter contains the information about the pose of head,6 igure 4. The domain classiﬁcation loss of the fake face texturesgenerated by the TDB-GAN with/without DAE module. the shape of face and so on. While we feed the genera-tor with the well-aligned texture, face editing does not needto consider shape invariance. By contrast, the TDB-GANwithout DAE transfers the attributes with the constraint ofthe shape invariance. It can be also observed from Figure 3that TDB-GAN with DAE achieves a lower domain classi-ﬁcation loss of fake face textures than the TDB-GAN with-out DAE. There is a clear margin between the curves in thechart. The lower domain classiﬁcation loss of fake face tex-tures indicates the better attributes transferring.

While transferring the domains, the network would striveto transfer more average features of the domain to decreasethe domain classiﬁcation loss of the fake face textures eventhough the domain classiﬁcation loss of the real face tex-tures plays against it. Thus, we propose the identity lossto ensure the identity invariance. We evaluate the perfor-mance of the domain transferring in terms of face recogni-tion accuracy generated by the identity loss. In the follow-ing sections, we present the veriﬁcation results about theTDB-GAN with and without identity loss.In this experiment, we train our model on RaFD to syn-thesize facial expressions. There are eight different expres-sions on RaFD. We ﬁx the input domain as the neutral ex-pression and set the target domain to the seven remainingexpressions. Thus, the proposed task aims to impose a par-ticular expression to a neutral face.We randomly split the RaFD dataset into training andtesting sets with a 90%:10% ratio, namely 4,320 trainingimages and 504 testing images including 63 neutral faces.For each of the neutral face, we apply our network to gen-erate seven facial expression images, i.e. in total 441 fakefacial expression images were generated. Based on the 441generated faces and 504 test images, we randomly generate3,000 client accesses and 3,000 impostor accesses. The net-

Figure 5. ROC curves on the test set of RaFD dataset.

Method TDB-GAN TDB-GANwith identity loss w/o identity lossTPR@FPR=1% 11.07 8.70TPR@FPR=0.1% 1.60 0.60TPR@FPR=0% 0.23 0.13EER (%) 23.60 24.50AP (%) 81.89 80.29AUC (%) 83.73 82.82

Table 1. Veriﬁcation performance on RaFD dataset. work proposed by Wen and Zhang [18] is employed to ex-tract 512-dimension identity features from the face images.The cosine distance is adopted to measure the similarity oftwo faces. The similarity was compared with a threshold(e.g. 0.5) to decide whether they are from the same person,or not. In this work, TPR (True Positive Rate), FPR (FalsePositive Rate), EER (Equal Error Rate), AP (Average Preci-sion) and AUC (Area under curve) are used to evaluate theperformance of face veriﬁcation. The higher scores of thesemetrics, except EER, the better results.Figure 5 and Table 1 show the ROC curves and theveriﬁcation results of the TDB-GAN with/without iden-tity loss. From Table 1, while the TPR@FPR=1% forTDB-GAN without identity loss is 8.70, the identity losssigniﬁcantly increases the TPR@FRP=1% to as high as11.07. Identity loss almost doubles the TPR@FPR=0.1%and TPR@FPR=0% of the TDB-GAN. Table 1 also sug-gests that the TDB-GAN with identity loss achieves thelower EER and higher AP and AUC than the TDB-GANwithout identity loss.7 igure 6. Facial expression synthesis results on CelebA dataset.

We ﬁrst display qualitative results of facial attributetransfer on the CelebA dataset. Then, the quantitative re-sults are evaluated with a user questionnaire.

Figure 6 shows the face images generated by IcGAN, Cy-cleGaN, StarGAN and our TDB-GAN for attribute trans-fer in smiling, pale, eyeglasses, gender and age. As visual-ized in the Figure, the images generated by image-to-imagetranslation approaches are better than that generated by Ic-GAN. Our approach contains more information than thelow-dimension latent representation and also preserves theattribute-independent information, like hairstyle. The facesgenerated by TDB-GAN for gender and age transfer arebetter than that generated by StarGAN, and the eyeglassesadded by TDB-GAN are more natural than that added byCycleGAN. Furthermore, our proposed method not onlyachieves higher visual quality but also preserves the iden-tity related to the input image due to the effect of identityloss.

For quantitative evaluation, we perform a user study on thevisual effect of transferred facial attributes to access IcGAN[14], CycleGAN [27], StarGAN [2] and TDB-GAN. Eachof the four approaches were applied to transfer the ﬁve fa-cial attributes, i.e. smile, pale skin, eyeglasses, gender andage, of faces from twenty individuals. For each of the ﬁve attributes transferred for the 20 subjects, four images syn-thesized by different models were shown to volunteers andthey were asked to select the best one, in terms of the re-alism, preservation of identity and quality of the facial at-tribute synthesis. As a number of 15 volunteers participatedthe questionnaire, a maximum of ×

15 = 300 votes canbe received for each approach and attribute. Table 2 lists theratio of votes received for each model and attribute. WhileStarGAN received the highest votes for pale skin transfer,our TDB-GAN received the highest votes for four of theﬁve attributes, i.e. smile, eyeglasses, gender and age.Models IcGAN CycleGAN StarGAN TDB-GANSmile 2.33% 21.33% 19.00% 57.33%Pale skin 2.00% 37.00% 36.67% 24.33%Eyeglasses 0 28.00% 30.33% 41.67%Gender 1.33% 35.00% 9.67% 54.00%Age 0.33% 20.00% 17.67% 62.00%

Table 2. The perceptual evaluation of different models. Note that,the sum of probability of each row is not strictly equal to 100%due to numerical precision loss.

In the following sections, we demonstrate the qualitativeand quantitative evaluation results on the RaFD dataset.8 igure 7. Facial expression synthesis results on RaFD dataset.

Figure 7 shows an example of seven facial expressions syn-thesized by IcGAN [14], CycleGAN [27] and StarGAN [2]and our TDB-GAN. As shown in the Figure, the imagesgenerated by StarGAN and our TDB-GAN have better vi-sual quality than that generated by IcGAN and CycleGAN.IcGAN transfers the neutral expression to various expres-sions, but the generated fake images have lowest quality.We believe that the latent vector extracted from IcGANlacks effective representability. While the performance ofCycleGAN is considerably better than that of IcGAN, thefake images generated by CycleGAN are still ambiguous.The fake faces synthesized by StarGAN have much morenatural and more distinct expressions. Nonetheless, TDB-GAN is superior to StarGAN for the sharper details and themore distinguishable expressions. For example, the facesgenerated by our TDB-GAN for angry, fearful and surprisedare much more representative than that of StarGAN, espe-cially in the eye regions.We believe the ability of separating the texture and defor-mation of TDB-GAN contributes most to the image quality,which allows TDB-GAN to pay more attention to the faceexpression editing, instead of the pose, shape and so on.

For a quantitative evaluation, we compute the classiﬁcationerror of facial expression recognition on the generated im-ages.We ﬁrst train a facial expression classiﬁer with the 4,320training images. And then we train all the GAN modelsusing the same training set.For testing, we ﬁrst use the trained GANs to transfer all the neutral expression of the testing images to seven differ-ent expressions. Then we use the aforementioned classiﬁerto classify these synthesized expressions. Table 3 lists theaccuracies of the facial expression classiﬁer on the imagessynthesized by different GAN models.As shown in Table 3, the images synthesized by TDB-GAN model achieves the highest accuracy, which suggeststhat it synthesizes the most realistic facial expressions com-pared with the other methods.Models Accuracy (%)IcGAN 91.61CycleGAN 88.44StarGAN 92.06TDB-GAN 97.28

Table 3. The expression classiﬁcation accuracies of images syn-thesized by different GAN models.

6. Conclusion

In this paper, we proposed Texture Deformation BasedGAN to perform texture-to-image translation among multi-ple domains. The proposed TDB-GAN can generate imageswith higher quality and more relative identity compared tothe existing methods, due to the disentangled texture anddeformation, and the identity loss.

References [1] Y.-C. Chen, H. Lin, M. Shu, R. Li, X. Tao, Y. Ye, X. Shen,and J. Jia. Facelet-bank for fast portrait manipulation.In

Computer Vision and Pattern Recognition, 2018. CVPR2018. IEEE Conference on , 2018.

2] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.Stargan: Uniﬁed generative adversarial networks for multi-domain image-to-image translation. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2018.[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680, 2014.[4] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitraryfacial attribute editing: Only change what you want. arXivpreprint arXiv:1711.10678 , 2017.[5] R. Huang, S. Zhang, T. Li, R. He, et al. Beyond face rota-tion: Global and local perception gan for photorealistic andidentity preserving frontal view synthesis.

ECCV , 2017.[6] X. Huang, Y. Li, O. Poursaeed, J. E. Hopcroft, and S. J. Be-longie. Stacked generative adversarial networks. In

CVPR ,volume 2, page 3, 2017.[7] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodalunsupervised image-to-image translation. In

ECCV , 2018.[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint , 2017.[9] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[10] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T.Hawk, and A. Van Knippenberg. Presentation and valida-tion of the radboud faces database.

Cognition and emotion ,24(8):1377–1388, 2010.[11] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In

Advances in Neural Informa-tion Processing Systems , pages 700–708, 2017.[12] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face at-tributes in the wild. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3730–3738, 2015.[13] R. Natsume, T. Yatagawa, and S. Morishima. Rsgan: Faceswapping and editing using face and hair representation inlatent spaces. arXiv preprint arXiv:1804.03447 , 2018.[14] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M.´Alvarez. Invertible Conditional GANs for image editing. In

NIPS Workshop on Adversarial Training , 2016.[15] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Para-gios, and I. Kokkinos. Deforming autoencoders: Unsuper-vised disentangling of shape and appearance. arXiv preprintarXiv:1806.06503 , 2018.[16] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman,and D. Samaras. Neural face editing with intrinsic imagedisentangling. In

Computer Vision and Pattern Recogni-tion (CVPR), 2017 IEEE Conference on , pages 5444–5453.IEEE, 2017.[17] R. Sun, C. Huang, J. Shi, and L. Ma. Mask-awarephotorealistic face attribute manipulation. arXiv preprintarXiv:1804.08882 , 2018.[18] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discrimina-tive feature learning approach for deep face recognition. In

European Conference on Computer Vision , pages 499–515.Springer, 2016. [19] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep facerepresentation with noisy labels.

IEEE Transactions on In-formation Forensics and Security , 13(11):2884–2896, 2018.[20] W. Xian, P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays.Texturegan: Controlling deep image synthesis with texturepatches. arXiv preprint , 2017.[21] T. Xiao, J. Hong, and J. Ma. Dna-gan: Learning disentangledrepresentations from multi-attribute images.

InternationalConference on Learning Representations, Workshop , 2018.[22] T. Xiao, J. Hong, and J. Ma. Elegant: Exchanging latent en-codings with gan for transferring multiple face attributes. In

Proceedings of the European Conference on Computer Vi-sion (ECCV) , pages 172–187, September 2018.[23] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Con-ditional image generation from visual attributes. In

EuropeanConference on Computer Vision , pages 776–791. Springer,2016.[24] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper-vised dual learning for image-to-image translation. In

ICCV ,pages 2868–2876, 2017.[25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detectionand alignment using multitask cascaded convolutional net-works.

IEEE Signal Processing Letters , 23(10):1499–1503,2016.[26] S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, and W. He. Gene-gan: Learning object transﬁguration and attribute subspacefrom unpaired data. In

Proceedings of the British MachineVision Conference (BMVC) , 2017.[27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-workss. In

Computer Vision (ICCV), 2017 IEEE Interna-tional Conference on , 2017., 2017.