3D Dense Geometry-Guided Facial Expression Synthesis by Adversarial Learning
33D Dense Geometry-Guided Facial Expression Synthesis by AdversarialLearning
Rumeysa Bodur , Binod Bhattarai , and Tae-Kyun Kim Imperial College London, UK KAIST, South Korea { r.bodur18, b.bhattarai, tk.kim } @imperial.ac.uk Abstract
Manipulating facial expressions is a challenging taskdue to fine-grained shape changes produced by facial mus-cles and the lack of input-output pairs for supervised learn-ing. Unlike previous methods using Generative Adversar-ial Networks (GAN), which rely on cycle-consistency lossor sparse geometry (landmarks) loss for expression synthe-sis, we propose a novel GAN framework to exploit 3D dense(depth and surface normals) information for expression ma-nipulation. However, a large-scale dataset containing RGBimages with expression annotations and their correspond-ing depth maps is not available. To this end, we proposeto use an off-the-shelf state-of-the-art 3D reconstructionmodel to estimate the depth and create a large-scale RGB-Depth dataset after a manual data clean-up process. Weutilise this dataset to minimise the novel depth consistencyloss via adversarial learning (note we do not have groundtruth depth maps for generated face images) and the depthcategorical loss of synthetic data on the discriminator. Inaddition, to improve the generalisation and lower the biasof the depth parameters, we propose to use a novel con-fidence regulariser on the discriminator side of the frame-work. We extensively performed both quantitative and qual-itative evaluations on two publicly available challengingfacial expression benchmarks: AffectNet and RaFD. Ourexperiments demonstrate that the proposed method outper-forms the competitive baseline and existing arts by a largemargin.
1. Introduction
Face expression manipulation is an active and challeng-ing research problem [3, 24, 4]. It has several applicationssuch as movie industry and e-commerce. Unlike the manip-ulation of other facial attributes [3, 16, 1], e.g. hair/skin
Image Landmarks Depth map Normal map
Figure 1.
Sparse Geometry vs Dense Geometry.
Some of therandomly selected images from AffectNet and their correspondinggeometric information. We can observe that sparse geometry, i.e.landmarks, is incapable of capturing expression specific fine de-tails, e.g. wrinkles on the cheek and jaw area, nose and eyebrowshapes. colour, wearing glasses, beard , expression manipulation isrelatively fine-grained in nature. The change in expressionsis caused by geometrical actions of facial muscles in a de-tailed manner. Figure 1 shows images with two differentexpressions, sad and happy. From these images we can ob-serve that different expressions bear different geometric in-formation in a fine-grained manner. The task of expressionmanipulation becomes even more challenging when sourceand target expression pairs are not available to train themodel, which is the case in this study.In one of the earliest studies [5], the Facial Action Cod-ing System (FACS) was developed for describing facial ex-pressions, which is popularly known as Action Units (AUs).These AUs are anatomically related to the contractions andrelaxations of certain facial muscles. Although the numberof AUs is only 30, more than 7000 different combinationsof AUs that describe expressions have been observed [27].This makes the task more challenging than other coarse at-tribute/image manipulation tasks, e.g. higher order face1 a r X i v : . [ c s . C V ] S e p ttribute editing [3], season translation [16]. These kind ofproblems cannot be addressed only by optimising existingloss functions, such as cycle consistency [3] or by adding atarget attribute classifier on the discriminator side [3, 8].A recent study on GAN [24] proposes to use AUsto manipulate expressions and shows a performance im-provement over prior-arts, which include CycleGAN [35],DIAT [15] and IcGAN [22]. These methods do not explic-itly address such details in their objective functions, butrather only deal with coarse characteristics, and this ulti-mately results in a sub-optimal solution. The main draw-back of notable existing methods exploiting geometry, suchas GANimation [24] (action units) and Gannotation [24](sparse-landmarks), is the use of annotations which aresparse in nature. As presented in Figure 1, these sparse ge-ometrical annotations fail to capture the geometries of dif-ferent expressions precisely. To address these issues, wepropose to introduce a dense 3D depth consistency loss forexpression manipulation. The use of dense 3D geometryguides how different expressions should appear in imagepixel space and thereby the process of image synthesis bytranslating from one expression to another.RGB-Depth consistency loss has been successfully ap-plied for the semantic segmentation task [2], which is for-mulated as a domain translation problem, i.e. from simula-tion to real world. However, this task is less demanding thanours since the label as well as the geometric information ofthe translated image remain the same as that of the source.On the other hand, our objective is to manipulate an imagefrom the source expression to a target expression involv-ing faithful transformation of expression relevant geometrywithout distorting the identity information. Also note thatunlike previous works [4, 26, 13, 30, 25] on expression ma-nipulation utilising depth information, our work is a target-label conditioned framework where the target expression isa simple one-hot encoded vector. Hence, it does not requiresource-target pairs during both training and testing time.Although, constraining depth information on model param-eters are effective, a reasonably good size of RGB-D datasetfor expressions is not available to date. Among the existingdatasets, EURECOM Kinect face data set [17] consists of 3expressions (neutral, smiling, open mouth) of 50 subjects,and consisting of 2500 images, BU-3DFE [34] provides adatabase with 8 expressions of 100 subjects. These datasetsare too small to train a generative model optimally.To overcome the challenge of the unavailability of RGB-D pairs in a scale to train a GAN, we propose to use anoff-the-shelf 3D face reconstruction model [31] and extractthe depth and surface normal information from the recon-structions. As discussed, since in our case the category ofthe image changes when translated from source to targetdomain, we propose to minimise the target attribute crossentropy loss on depth domain to ensure that the depth of the reconstructed image is also consistent with the target la-bels. As we are aware, the depth images extracted from theoff-the-shelf face 3D reconstruction model are error-proneand sub-optimal for our purpose. To address this challenge,we propose to penalise the depth estimator by employing aconfidence penalty [23]. This regulariser is effective to im-prove the accuracy of classification models trained on noisylabes [10].Our contributions can be summarised as follows: • We propose a novel geometric consistency loss toguide the expression synthesis process using depth andsurface normals information. • We estimated the depth and surface normal parametersfor two challenging expression benchmarks, namelyAffectNet and RaFD, and will make these annotationspublic upon the acceptance of the paper. • We propose a novel regularisation on the discriminatorto penalise the confidence of the depth estimator in or-der to improve the generalisation of the cross-data setmodel parameters. • We evaluated the proposed method on two challeng-ing benchmarks. Our experiments demonstrate that theproposed method is superior to the competitive base-lines both in quantitative and qualitative metrics.
2. Related Work
Unpaired Image to Image Translation.
One of the ap-plications of image-to-image translation by Generative Ad-versarial Networks (GANs) [35, 11] is the synthesis of at-tributes and facial expressions on given face images. Vari-ous methods have been proposed based on the GAN frame-work in order to accomplish this task. In general, the train-ing sets used for image-to-image translation problems con-sist of aligned image pairs. However, since paired datasetsfor various facial expressions and attributes are quite smalland constructed in a controlled environment, most of themethods are designed for datasets with unpaired images.CycleGAN [35] tackles the problem of unpaired image-to-image translation by coupling the mapping learned by thegenerator with an inverse mapping and an additional cycleconsistency loss. This approach is employed by most of therelated studies in order to preserve key attributes betweenthe input and output images, and thus the person’s iden-tity in the given source image. In StarGAN [3], the multi-domain image-to-image translation problem is approachedby using a single generator instead of training a separategenerator for each domain pair. In AttGAN [8] an attributeclassification constraint is applied on the generated imagerather than imposing an attribute-independence constrainton the latent encoding. The method is further extended to2enerate attributes of different styles by introducing a set ofstyle controllers and a style predictor. Unlike other attributegeneration methods ELEGANT [33] aims to generate fromexemplars, and thus uses latent encoding instead of labels.For multiple attribute generation, they encode the attributesin a disentangled manner, and for increasing the quality theyadopt residual learning and multi-scale discriminators.
Geometry Guided Expression Synthesis.
Existing worksincorporating geometric information yields better quality ofsynthetic images. GP-GAN [4] generates faces from onlysparse 2D landmarks. Similiarly, GANnotation [26] is afacial-landmark guided model that simultaneously changesthe pose and the expression. In addition, this method alsominimises triple consistency loss for bridging the gap be-tween the distributions of the input and generated imagedomains. G2GAN [30] is also another recent work on ex-pression manipulation guided by 2D facial landmarks. InGAGAN [13], latent variables are sampled from a probabil-ity distribution of a statistical shape model i.e. mean andeigenvectors of landmark vectors, and the generated outputis mapped into a canonical coordinate frame by using a dif-ferentiable piece-wise affine warping.In GC-GAN [25], they apply contrastive learning inGAN for expression transfer. Their network consists oftwo encoder-decoder networks and a discriminator network.One encodes the image whilst the other encodes the land-marks of the target and the source disentangled from theimage. GANimation [24] manages to generate a contin-uum of facial expression by conditioning the GAN on a1-dimensional vector that indicates the presence and magni-tude of AUs rather than just category labels. It benefits fromboth cycle loss and geometric loss, however the annotationsare sparse. Whilst the studies mentioned above mostly relyon sparse geometric information, some recent studies go be-yond landmarks to exploit geometric information and makeuse of surface normals. [28] exploits surface normals infor-mation in order the address the coarse attribute manipula-tion task whereas in our work we deal with the more chal-lenging expression manipulation task. Similarly, [19] feedsthe surface normal maps of the target as a condition to thenetworks for modifying the expression of the given image.Unlike this work, we are dealing with an unpaired imagetranslation problem, where the expressions is given only asa categorical label.
3. Proposed Method
The schematic diagram of our method can be seen in Fig-ure 2. As seen in the Figure, our pipeline consists of twogenerative adversarial networks in a cascaded form, namelythe RGB adversarial network and the depth adversarial net-work. One of our contributions lies in introducing a depthnetwork to the RGB network and training the framework inan end-to-end fashion. As our RGB adversarial network, we choose StarGAN [3], which is one of the representa-tive conditional GAN architectures of contemporary frame-works that has been widely used for unpaired multi-domainimage translation. Since our method is generic, it can beemployed in any other RGB adversarial network. In thissection, we will explain our RGB-Depth dataset construc-tion followed by architectural details and the training pro-cess.
To train a generator for estimating depth from RGB im-ages accurately, there is a need for a large-scale dataset con-sisting of RGB image and depth map pairs of faces withdifferent expressions. As discussed in Section 1, there is nosuch large-scale dataset available that fits the requirement totrain a GAN adequately. To fill this gap, we propose to cre-ate a large-scale RGB-depth dataset. Hence, we utilise thepublicly available model of [31] to reconstruct 3D meshesof a subset of images from existing datasets, AffectNet[18]and RaFD[14], and augment them with depth maps. How-ever, some of the 3D meshes were poor in quality, whichwe carefully discarded manually. We projected the recon-structed meshes and interpolated missing parts of the pro-jections in order to acquire the corresponding depth mapsand normal maps. Samples obtained by this process areshown in Figure 3.To validate the generalisation and statistical significanceof the RGB image and depth map pairs, we use the datasetswe constructed to train a model in an adversarial manner,which generates depth maps and normal maps from RGBimages. As estimating depth from a single RGB image is anill-posed problem, it is essential to take the global scene intoaccount. Since we need local as well as global consistencyon the predicted depth maps, we train the model in an adver-sarial manner [7]. Although recent CNNs are capable of un-derstanding the global relationship too, their objective func-tions minimise the combination of per-pixel wise losses. Todifferentiate the predicted depth maps from the extracteddepth maps, for convenience, we refer to the extracted depthmaps as ground truth depth maps . Figure 3 shows the pre-dicted and ground truth depth maps and normal maps ofsome of the examples from the real test set. From these ex-amples, we observe that predicted depth maps and normalmaps are quite similar to the ground truths. This demon-strates that the model trained with our dataset generaliseswell to unseen examples. Hence, it shows the potential usesof our dataset for future use cases as well. We will make ourdataset publicly available for the research community uponthe acceptance of this paper. We name the depth augmentedAffectNet and RaFD datasets as AffectNet-D and RaFD-D,respectively, for future use.3 r gb D dep t h RGB Adversarial Network Depth Adversarial Network L adv L cls L adv L cls G rgb G rgb G depth L rec y s RGB Images Depth Maps[0 0 0 0 1 0 0 0] y t RGB Image ( I s ) Forward Pass Backward Pass I t G depth d t L rec Figure 2.
Overview of the Proposed End-to-End Pipeline.
It consists of two generative adversarial networks, namely the RGB networkand the depth network. G rgb takes as input the RGB source image along with the target label and generates a synthetic image with the withthe given target expression. This image is passed to G rgb to generate its estimated depth and thereby to calculate the depth consistencyloss. This loss helps capturing detailed 3D geometric information, which is then aligned in generated RGB images. We have a scenario in which an RGB input image anda target expression label in the form of a one-hot encodedvector are given. Our objective is to translate this imageaccurately to the given target label with high-quality mea-surements. To this end, we propose to train two types ofadversarial networks, namely RGB and Depth, in an end-to-end fashion. In the following part, we discuss about thesenetworks in detail.
RGB Adversarial Network.
As seen from Figure 2, thefirst part of the pipeline describes the RGB adversarial net-work that consists of a generator G rgb and a discrimina-tor D rgb . This part is StarGAN [3], however we wouldlike to emphasise that our method is not constrained toone architecture. The discriminator serves two functions D rgb : { D rgbadv , D rgbcls } , with D rgbadv providing a probabilitydistribution over real and fake domain, while D rgbcls pro-vides a distribution over the expression classes. The net-work takes as input an RGB source image, I s , along with atarget expression label, y t , as condition. G rgb translates thesource image in the guidance of the given target label, y t ,yielding a synthetic image I t . Then, D rgbadv , which is a dis-criminator trained to distinguish between real and syntheticRGB images, is applied on the generated image to ensurea realistic look in the target domain. The adversarial lossused for training D rgb is given in Eqn. 1. L rgbadv = E I s [log( D rgbadv ( I s ))]+ E I s ,y t [log(1 − D rgbadv ( G rgb ( I s , y t ))] (1) The discriminator contains an auxiliary classifier, D rgbcls ,which guides G rgb to generate images that can be confi-dently classified as the target expression. We minimise acategorical cross-entropy loss as follows: L rgbcls = E I s ,y t [log( D rgbcls ( y t | G rgb ( I s , y t )))] (2)To preserve attribute excluding details, a reconstructionloss is employed, which is a cycle consistency loss and isformulated in Eqn.3. Here, y s represents the original labelof the source image I s . L rgbrec = E ( I s ,y s ) ,y t [ (cid:107) I s − G rgb ( G rgb ( I s , y t ) , y s ) (cid:107) ] (3)The overall objective to be minimised for the RGB ad-versarial network can be summed as follows: L rgb = λ adv L rgbadv + λ cls L rgbcls + λ rec L rgbrec (4)In Equation 4, λ adv , λ cls , and λ rec are hyper-parametersto control the weight of adversarial, classification, and re-construction losses, and are set by cross-validation. FromEquation 4, we can see that none of these objectives are ex-plicitly modeling the fine-grained details of the expressions.Thus, the solution from RGB adversarial network alone re-mains sub-optimal. Depth Adversarial Network.
To address the limitations ofthe RGB adversarial network, we propose to append anothernetwork that operates on the depth domain and guides theRGB network, because depth maps are able to capture thegeometric information of various expressions. As depicted4 mage Groundtruth depth map Groundtruth normal map Predicted depth map Predicted normal map Image Groundtruth depth map Groundtruth normal map Predicted depth map Predicted normal map
Figure 3.
RGB Images and Their Ground Truth/Predicted Depth and Surface Normal Maps.
Real test images with their correspondingground truth depth maps extracted from the off-the-self 3D model and the depth maps predicted by our depth-estimator network. We canclearly see that the depth maps are quite aligned with the geometry of real images, and our network trained on the training set of image-depthpairs is able to preserve the geometry. in the right-half of Figure 2, similar to the RGB adversar-ial network, the depth network also consists of a generator G depth and a discriminator D depth : { D depthadv , D depthcls } . Inthis case the generator, G depth , takes as input the syntheticimage, I t , generated by G rgb and generates the estimateddepth map, d t , which is then passed to the discriminator. D depth is trained to distinguish between realistic and syn-thetic depth maps by using the following adversarial lossfunction, where d s represents the real depthmaps: L depthadv = E d s [log( D depthadv ( d s ))]+ E I t [log(1 − D depthadv ( G depth ( I t ))] (5)Again, an auxiliary classifier, D depthcls , is added on top of thediscriminator in order to generate depth maps that can beconfidently classified as the target expression, y t . The lossfunction used for this classifier is given in Eqn. 6. L depthcls = E I t ,y t [log( D depthcls ( y t | G depth ( I t )))] (6)The Depth Network is trained in a similar way to themulti-domain translation network in [3], where the domainis given as a condition. Thus, the network is able to trans-late from RGB to depth domain as well as from depth toRGB domain. We denote the latter translation as G (cid:48) depth .This network takes the estimated depth map and generatesan RGB image. As done in the RGB adversarial network,we employ the reconstruction loss given in Eqn.7. L depthrec = E I t [ (cid:107) I t − G (cid:48) depth ( G depth ( I t )) (cid:107) ] (7)The losses used for the depth network can be summed asfollows: L depth = λ adv L depthadv + λ cls L depthcls + λ rec L depthrec (8) where λ adv , λ cls , and λ rec are hyper-parameters for adver-sarial, classification and reconstruction loss. We set theirvalues by cross-validation. Minimising dense depth losswhich encodes fine-grained geometric information relatedto expressions helps to synthesise high-quality expressionmanipulated images. Overall Training Objective.
The overall objective func-tion used for the end-to-end training of the networks is asfollows: min { G rgb ,G depth } max { D rgb ,D depth } ( L rgb + L depth ) From Eqns. 5, 6 and 7 we can see that the loss incurredby the synthetic image on the depth domain is propagatedto the RGB adversarial network. This will ultimately guidethe RGB generator to synthesise the image with the targetattributes in such a way that their geometric information isalso consistent.
Pre-training the Depth Network.
The Depth network thatwe employed in this framework is trained offline on the dataset augmented with the depth maps. Please refer to Section3.1 for creation and augmentation of the depth maps on Af-fectNet and RaFD. We train the Depth Network with strongsupervision in prior utilising the RAFD-D and AffectNet-Ddatasets we constructed, so that it can generate reasonabledepth maps for given synthetic inputs during the end-to-endtraining. Employing such pre-trained network to constrainthe GAN training has been useful in other downstream taskssuch as identity preservation [6]. Apart from the depth net-work losses described in the previous section, since we havepaired data we employ a pixel loss and a perceptual loss.The pixel loss, as given in Eqn. 9, calculates the L1 dis-tance between the depth map generated by G depth for theinput image, I s , and its ground truth depth map, d s ,: L pix = E ( I s ,d s ) [ (cid:107) d s − G depth ( I s ) (cid:107) ] (9)5nstead of relying solely on the L1 or L2 distance, withthe perceptual loss [12] defined in Eqn. 10 the model learnsby using the error between high-level feature representa-tions extracted by a pre-trained CNN, which in our case isVGG19 [29]. L per = E ( I s ,d s ) [ (cid:107) V ( d s ) − V ( G depth ( I s ))) (cid:107) ] (10)Finally, to improve the generalisation of the depth net-work, we propose to introduce a regulariser on the depthnetwork called confidence penalty as shown in the Equa-tion 11. This regulariser relaxes the confidence of thedepth prediction model similar to regularisers that learnfrom noisy labels [10]. Such regularisers are successfullyapplied on image classification tasks [23] As mentioned be-fore, the depth maps are extracted from the off-the-self 3Dreconstruction model and manual eradication of low qual-ity depth maps has already made our data set ready to learnthe model. Even then, this regulariser helps to further re-move the noise from the data set and learn the robust model.In Eqn. 11 H stands for the entropy and β controls thestrength of this penalty. We set the parameters of β by cross-validation. L depthcls = E I s ,y t [log( D depthcls ( y t | G depth ( I s )))] − βH ( D depthcls ( y t | G depth ( I s )) (11)
4. Experimental Results
Datasets.
We extended two popular benchmarks, Affect-Net [18] and RaFD [14], to AffectNet-D and RaFD-D.For AffectNet, we took the down sampled version, wherea maximum of K images are selected for each class.This resulted in a reconstruction of K meshes in totalfor expressions. RaFD[14] contains images of 67 modelstaken with 3 different gazes and 5 different camera angles,where each model shows 8 expressions. The images arealigned based on their landmarks and cropped to the size of × . Implementation.
Our method is implemented on PyTorch[21]. We perform our experiments on a workstation withIntel i5-8500 3.0G, 32G memory, NVIDIA GTX1060 andNVIDIA GTX1080. As for the training schedule, we firstpre-train the depth network for 200k epochs, then bothGANs in an end-to-end fashion for 300k epochs, where thebath size is set to 15 and 8, respectively. We start with alearning rate of 0.0001 and decay it in every 1000 epochs.We update the discriminator 5 times per generator update.The models are trained with Adam optimiser.
Evaluation metrics.
We evaluate our method both quanti-tatively and qualitatively. In order to quantitatively evalu-ate the generated images we compute the Frechet InceptionDistance (FID)[9]. FID is a commonly used metric for as-sessing the quality and diversity of synthetic images. We further evaluate our method by applying a pre-trained facerecognition model to calculate the identity loss for synthe-sised images. We also calculate the peak signal-to-noiseratio (PSNR) and the structural similarity (SSIM) [32] be-tween the original and reconstructed images. Also, similarto the attribute generation rate reported on STGAN [16],we report expression generation rate in our experiments. Tocalculate the expression generation rate, we train a classi-fier independent of all models on the real training set andevaluate on the synthetic test sets.
Compared Methods.
We compare our method to someof the state-of-the-art attribute manipulation baselines, Star-GAN [3], IcGAN [22] and CycleGAN [35]. Among thesemethods we took StarGAN[3] as our baseline. We furthercompare our method to some recent studies that utilise geo-metric information, namely Ganimation [24] and Gannota-tion [26] (Please refer to Table 1). For Ganimation, we usetheir publicly available implementation and trained it on ourdataset.
We compare our method with existing arts on variousquantitative metrics on both Affectnet and RaFD.
Expression generation rate:
On AffectNet, we report theperformance on 6 expressions excluding contempt and dis-gust, since these two classes are highly under-representedon this dataset. Apart from this accuracy score on Affect-Net, all other quantitative results are based on all 8 expres-sion classes for both datasets. As seen from Table 1, onAffectNet, when we apply our method on StarGAN, theperformance increases by . . With the introduction ofthe confidence penalty to the depth network, this rate fur-ther increases by . yielding a . improvement overStarGAN by reaching . . Similarly, as seen from Table2, for RaFD, with our method we get an accuracy of ,which corresponds to an . improvement over StarGAN. FID:
To assess the quality of the synthesised images wecalculate the FID for both datasets. As seen in the tables,when our method is applied over Stargan, the FID values de-crease from . to . for AffectNet, and from . to . for RaFD. Since lower FID values indicate bet-ter image quality and diversity, these results verify that ourmethod improves the synthetic images in both aspects. PSNR/SSIM:
We further compute the PSNR/SSIM scoresto evaluate the reconstruction performance of the networks.Our method improves the PSNR score by . and . yielding a score of . and . for AffectNet andRaFD, respectively. Similarly, the SSIM score also im-proves by for AffectNet and for RaFD resulting in and . Identity Preservation:
Finally, to assess how well the iden-tity is preserved we employ the pre-trained model of VGGFace[20] to extract the features of real and synthesised im-6 ethod Geometry Conf. penalty Accuracy (%) ↑ FID ↓ PSNR ↑ SSIM ↑ StarGAN [3] N/A No 82.1 16.33 32.96 0.79Ganimation [24] Action Units No 74.3 18.86 32.12 0.74Gannotation [26] Landmarks No 82.6 17.36 32.54 0.77Our Method Depth map No 85.9(+3.8) 14.03 33.88 0.82Our Method Normal map Yes 83.9(+1.8) 13.29 33.59 0.81Our Method Depth map Yes
Table 1.
Performance Comparison on AffectNet.
Method Geometry Conf. penalty Accuracy (%) ↑ FID ↓ PSNR ↑ SSIM ↑ StarGAN [3] N/A No 53.1 39.45 31.19 0.75Our Method Depth map Yes
Table 2.
Performance Comparison on RaFD.
Method Depth network weight ( β ) Accuracy (%) ↑ StarGAN 0.0 82.1Our Method 0.1 85.9(+3.8)Our Method 0.2 86.1(+4.0)Our Method 0.3
Our Method 0.4 86.3(+4.2)
Table 3.
Performance Comparison on AffectNet with DifferentWeights for the Depth Network. ages. We calculate the L2 loss between the features of eachsynthesised image and the corresponding real image. Fig-ure 4 illustrates the resulting distribution of this loss. Asseen in these graphs, our method manages to preserve theidentity better than StarGAN in both datasets.
Comparison to existing arts:
Our method obtains a bet-ter score than all three compared methods, StarGAN[3],Ganimation[24] and Gannotation[26], in every metric. Thisverifies that depth and surface normal information capturesexpression-specific details which are missed by sparse geo-metrical information, i.e. landmarks and action units.
Hyper-parameter Study.
The important additional hyper-parameter for us in comparison to StarGAN is the weightof the loss incurred by the depth network, which comprisesdepth adversarial loss and depth classification loss. We setdifferent weights for this loss and evaluate the performance.Table 3 summarises this hyper-parameter study. When weset the weight of depth loss to 0.0, which is the RGB Net-work alone, i.e StarGAN, the mean target expression clas-sification accuracy is . . As discussed before, withoutusing the confidence penalty, setting the weight to . im-proves StarGAN by +3 . , yielding an accuracy rate of . .We set the weight to 0.2, which improves the performancefrom . to . . Slightly increasing the weight to . gives a mean accuracy of . . As increasing the weightto . does not further improve the performance, we choose . as the optimal value for this parameter. These experi-mental results show that the performance of our method isinfluenced by the hyper-parameters but is stable. Figure 4.
ID Loss Distribution.
AffectNet (left) and RaFD (right).Please zoom in to see the details.
Figure 5 shows a comparison of some samples synthe-sised by StarGAN and our method. We observe that in-corporating the depth information with our method yieldsresults that are of higher quality while also maintaining thefacial geometry and photometry, hence the identity.On the left-hand side of Figure 5, we present some of theexamples that show our method is capable of providing pho-tometric consistency. Images synthesised by StarGAN con-tain artifacts on the skin, and in some cases fail to preservethe shapes of nose, mouth and eyes as well as the originaltexture. On the other hand, in almost all synthesised im-ages our method yields results that look more realistic andnatural. In particular, areas like mouth, nose and eyebrowsare well synthesised in correspondence with the given targetexpression even in cases where StarGAN fails.The right-hand side of Figure 5 presents a comparisonof samples generated by StarGAN and our method alongwith their depth maps predicted by our method. For the firsttwo and the last sample, when translating a neutral image tohappy, our method synthesises the face with a realistic ge-ometry whereas StarGAN fails in the mouth area resultingin artifacts in the inner mouth. In translation from happyto neutral on the fourth and fifth samples, StarGAN yieldsresults where the mouth is uncertain and not closed. Ourmethod, on the other hand, results in a closed mouth withlittle or no artifacts. The predicted depth maps for StarGAN7 ur MethodStarGANOur MethodStarGAN StarGAN Our Method I npu t A nge r F ea r H app y N eu t r a l S ad S u r p r i s e Input StarGAN PredictedDepth Map Our Method PredictedDepth Map N eu t r a l N eu t r a l N eu t r a l H app y H app y H app y N eu t r a l H app y H app y H app y S ad N eu t r a l N eu t r a l S ad (a) Photometric consistency (b) Geometric consistency Figure 5.
Qualitative Results on AffectNet. (a) A comparison of images synthesised by StarGAN and our method. Results show that ourmethod preserves photometric consistency. (b) Synthesised images and their predicted depth maps. The original images with the givenexpressions are translated to the target expressions by StarGAN and our method. Our depth map estimator network is applied to obtain thepredicted depth maps. S t a r G A N G an i m a t i on I c G A NC yc l e G A N O u r M e t hod Input Anger Fear SadContempt
Figure 6.
Qualitative Results on RaFD.
Facial expres-sions synthesised by CycleGAN[35], IcGAN[22], StarGAN[3],Ganimation[24] and our method. Synthesised images for previ-ous methods are taken from [24]. verify this inference as the mouth is ambiguous in the fourthdepth map and open in the fifth. For the third and sixth sam-ples, when synthesising images with the target class ’sad’,both methods generate open-mouth samples. However, inimages generated by StarGAN the inner mouth has artifacts,the mouth is blurry and the original mouth geometry is notpreserved, which is reflected in the predicted depth maps.These results show that the guidance of the depth networkleads the RGB network to synthesise images that yield morerealistic depth map predictions, which results in improvedgeometric and photometric consistency in the synthesisedimages. These qualitative examples further validate that thegeometric consistency loss is essential while training GANsfor high quality expressions translation.Figure 6 shows a comparison of our method to existingwork. In this figure we observe that our method producesoutperforming or competitive results on RaFD as well. Re-sults produced by CycleGAN and IcGAN are of lower qual-ity, whilst StarGAN and GANimation produce high qual-ity results. However, we observe that the geometric guid-ance introduced by our method helps preserving the existinggeometry while modifying expression-specific geometricaldetails. For instance, contempt expression synthesised byGanimation only slightly modifies the mouth, whereas our8ethod adds mouth wrinkles and lowers the eyelids.
5. Conclusion
In this paper, we proposed a novel end-to-end deep net-work for manipulating facial expressions. The proposedmethod incorporates the depth information with an addi-tional network that guides the training process by the depthconsistency loss. To train the proposed pipeline of net-works, we constructed a dataset that consists of RGB im-ages and their corresponding depth maps and surface nor-mal maps using an off-the-shelf 3D reconstruction model.To improve generalisation and lower the bias of the depthparameters, a confidence regulariser is applied to the dis-criminator side of the GAN frameworks. We evaluatedour method on two challenging benchmarks, AffectNet andRaFD, and showed that our method synthesised samplesthat are both qualitatively and quantitatively superior to theones generated by recent methods.
R. Bodur is funded by the Turkish Ministry of NationalEducation. This work is partly supported by EPSRC Pro-gramme Grant FACER2VM(EP/N007743/1).
References [1] Binod Bhattarai and Tae-Kyun Kim. Inducing opti-mal attribute representations for conditional gans. In
ECCV , 2020.[2] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc VanGool. Learning semantic segmentation from syntheticdata: A geometrically guided input-output adaptationapproach. In
CVPR , June 2019.[3] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-WooHa, Sunghun Kim, and Jaegul Choo. Stargan: Uni-fied generative adversarial networks for multi-domainimage-to-image translation. In
CVPR , 2018.[4] Xing Di, Vishwanath A. Sindagi, and Vishal M. Patel.Gp-gan: Gender preserving gan for synthesizing facesfrom landmarks. In
ICPR , 2018.[5] E Friesen and Paul Ekman. Facial action coding sys-tem: a technique for the measurement of facial move-ment. In
Consulting Psychologists Press , 1978.[6] Baris Gecer, Binod Bhattarai, Josef Kittler, and Tae-Kyun Kim. Semi-supervised adversarial learning togenerate photorealistic face images of new identitiesfrom 3d morphable model. In
ECCV , 2018.[7] Rick Groenendijk, Sezer Karaoglu, Theo Gevers, andThomas Mensink. On the benefit of adversarial train-ing for monocular depth estimation.
CVIU , 2020. [8] Zhenliang He, Wangmeng Zuo, Meina Kan, ShiguangShan, and Xilin Chen. Arbitrary facial attribute edit-ing: Only change what you want. arXiv preprintarXiv:1711.10678 , 2017.[9] Martin Heusel, Hubert Ramsauer, Thomas Un-terthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule convergeto a local nash equilibrium. In
NeurIPS , 2017.[10] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean.Distilling the knowledge in a neural network. In
NIPS ,2015.[11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, andAlexei A. Efros. Image-to-image translation with con-ditional adversarial networks. In
CVPR , 2017.[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Per-ceptual losses for real-time style transfer and super-resolution. In
ECCV , 2016.[13] Jean Kossaifi, Linh Tran, Yannis Panagakis, and MajaPantic. Gagan: Geometry-aware generative adversar-ial networks. In
CVPR , 2018.[14] Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, DanielWigboldus, Skyler Hawk, and Ad Knippenberg. Pre-sentation and validation of the radboud face database.
Cognition & Emotion , 2010.[15] Mu Li, Wangmeng Zuo, and David Zhang. Deepidentity-aware transfer of facial attributes. In
CoRR ,2016.[16] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, ErruiDing, Wangmeng Zuo, and Shilei Wen. Stgan: A uni-fied selective transfer network for arbitrary image at-tribute editing. In
CVPR , 2019.[17] Rui Min, Neslihan Kose, and Jean-Luc Dugelay.Kinectfacedb: A kinect database for face recogni-tion.
Systems, Man, and Cybernetics: Systems, IEEETransactions on , 2014.[18] Ali Mollahosseini, Behzad Hasani, and Moham-mad H. Mahoor. Affectnet: A database for facial ex-pression, valence, and arousal computing in the wild.
IEEE Transactions on Affective Computing , 10:18–31,2019.[19] Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei,Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fur-sund, and Hao Li. pagan: real-time avatars using dy-namic textures. In
SIGGRAPH Asia 2018 TechnicalPapers , page 258. ACM, 2018.[20] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zis-serman. Deep face recognition. In
BMVC , 2015.[21] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in pytorch. In
NeurIPS , 2017.922] Guim Perarnau, Joost Van De Weijer, Bogdan Radu-canu, and Jose M ´Alvarez. Invertible conditional gansfor image editing. arXiv preprint arXiv:1611.06355 ,2016.[23] Gabriel Pereyra, George Tucker, Jan Chorowski,Łukasz Kaiser, and Geoffrey Hinton. Regularizingneural networks by penalizing confident output distri-butions. arXiv preprint arXiv:1701.06548 , 2017.[24] Albert Pumarola, Antonio Agudo, Aleix M Martinez,Alberto Sanfeliu, and Francesc Moreno-Noguer. Gan-imation: Anatomically-aware facial animation from asingle image. In
ECCV , 2018.[25] Fengchun Qiao, Naiming Yao, Zirui Jiao, Zhihao Li,Hui Chen, and Harrison Wang. Geometry-contrastivegan for facial expression transfer. In
CoRR , 2018.[26] Enrique Sanchez and Michel Valstar. Triple consis-tency loss for pairing distributions in gan-based facesynthesis. arXiv preprint arXiv:1811.03492 , 2018.[27] Klaus R Scherer. Emotion as a process: Function, ori-gin and regulation, 1982.[28] Zhixin Shu, Ersin Yumer, Sunil Hadap, KalyanSunkavalli, Eli Shechtman, and Dimitris Samaras.Neural face editing with intrinsic image disentan-gling. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 5541–5550, 2017.[29] Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recogni-tion.
ICLR , 2015.[30] Lingxiao Song, Zhihe Lu, Ran He, Zhenan Sun, andTieniu Tan. Geometry guided adversarial facial ex-pression synthesis. In
ACM Multimedia , 2018.[31] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz,Yuval Nirkin, and G´erard Medioni. Extreme 3d facereconstruction: seeing through occlusions. In
CVPR ,2018.[32] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, andEero P. Simoncelli. Image quality assessment: fromerror visibility to structural similarity.
Ieee Transac-tions on Image Processing , 13, 2004.[33] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Elegant:Exchanging latent encodings with gan for transferringmultiple face attributes. In
ECCV , 2018.[34] Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, andMatthew J. Rosato. A 3d facial expression databasefor facial behavior research. In FG , 2006.[35] Jun-Yan Zhu, Taesung Park, Phillip Isola, andAlexei A Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In ICCV ,2017. 10 . Appendix
In this Supplementary Material, we present our ad-ditional results and provide further discussions on ourmethod. First, we give more information about our processof constructing the RGB-depth pair datasets, AffectNet-D and RaFD-D. Then, we present plots that illustrate thelearning behaviour of our method. In Section 6.3, we pro-vide further quantitative results. Finally, we visualise addi-tional qualitative results on AffectNet with a comparison ofimages generated by our method and StarGAN.
As we mentioned in our main paper, there is no large-scale dataset with RGB-Depth pairs for expression classifi-cation. Hence, we propose to augment existing expressionannotated datasets, AffectNet and RaFD, with depth infor-mation. To this end, we propose to use an existing state-of-the-art method to reconstruct the 3D models of faces. Wecarefully investigated the quality of the reconstructed 3Dmodels and discarded the ones which are not fitted well.From these 3D models, we computed the correspondingdepth maps and surface normal maps. Figure 7 shows thepipeline to extract these depth and normal maps. Pleasecheck Table 4 for the statistics of RGB image and 3D meshpairs for both constructed datasets, AffectNet-D and RaFD-D.
Figure 9 shows the plots of the learning curves forthe proposed method. From these plots, we can observeeven after introducing the depth adversarial and depthclassification loss, the learning curve is stable and matchesthe trends with the existing standard adversarial learningframeworks. Our method has lower reconstruction errorthan the compared baseline, which is StarGAN. This vali-dates that our method is able to disentangle the expressionsin a better form and is also capable of reconstructing theimages with a better quality. This further supports thatour method is superior to the compared baseline in variousimage quality metrics such as SSIM, PSNR, FID (pleasecheck main paper). Similarly, the classification loss forsynthetic data is, in general, lower when compared to thatof the baseline. This shows that, data generated by ourmethod is classified as the target class more confidently.This observation is parallel with the results we obtainedwhen applying an independent classifier on syntheticimages (please see experiments section of main paper).
As described in the main paper, we report expressiongeneration rate in our experiments, which is calculated by applying a classifier, that is independent of all models,on the synthetic test sets. Figure 8 shows a comparisonof the confusion matrices of StarGAN and the proposedmethod with different weights for the depth network andwith confident penalty.
Figure 10 shows a comparison of samples generated byStarGAN and our method. We can observe that, in general,our method outperforms StarGAN.11 igure 7.
Pipeline to extract depth maps from RGB images.
Dataset Anger Contempt Disgust Fear Happy Neutral Sadness Surprise TotalAffectNet-D 15,000 3,703 3,726 6,073 15,000 15,000 15,000 13,604 87,106RaFD-D 564 580 576 548 557 585 569 515 4,494
Table 4.
RGB-Depth pairs statistics on AffectNet-D and RaFD-D.
Figure 8.
Confusion Matrices.
The confusion matrices show the performance of StarGAN and our method on AffectNet with differenthyper-parameters. The first confusion matrix is for StarGAN, whereas the second and third confusion matrices show the performance ofour method with a depth network weight of . and . , respectively. The last one is obtained by our method with a weight of . for thedepth network with confidence penalty. (Zoom in to view).Figure 9. Learning Curves of Our Method and the Baseline.
The learning curves on the left and in the middle provide a comparisonof StarGAN and our method for the reconstruction and expression classification losses of the generator, respectively. The adversarial lossthroughout training is shown in the graph on the right-hand side. t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod Input Anger Fear Happy Neutral Sad Surprise
Input Anger Fear Happy Neutral Sad Surprise S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod Input Anger Fear Happy Neutral Sad Surprise
Input Anger Fear Happy Neutral Sad Surprise S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod S t a r G A N O u r M e t hod Figure 10.