[PDF] SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters

Abstract

Image-to-image (i2i) translation is the dense regression problem of learning how to transform an input image into an output using aligned image pairs. Remarkable progress has been made in i2i translation with the advent of Deep Convolutional Neural Networks (DCNNs) and particular using the learning paradigm of Generative Adversarial Networks (GANs). In the absence of paired images, i2i translation is tackled with one or multiple domain transformations (i.e., CycleGAN, StarGAN etc.). In this paper, we study a new problem, that of image-to-image translation, under a set of continuous parameters that correspond to a model describing a physical process. In particular, we propose the SliderGAN which transforms an input face image into a new one according to the continuous values of a statistical blendshape model of facial motion. We show that it is possible to edit a facial image according to expression and speech blendshapes, using sliders that control the continuous values of the blendshape model. This provides much more flexibility in various tasks, including but not limited to face editing, expression transfer and face neutralisation, comparing to models based on discrete expressions or action units.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

SliderGAN: Synthesizing Expressive Face Images by Sliding3D Blendshape Parameters.

Evangelos Ververas (cid:63) · Stefanos Zafeiriou † Received: date / Accepted: date

Abstract

Image-to-image (i2i) translation is the denseregression problem of learning how to transform an in-put image into an output using aligned image pairs.Remarkable progress has been made in i2i translationwith the advent of Deep Convolutional Neural Networks(DCNNs) and particular using the learning paradigm ofGenerative Adversarial Networks (GANs). In the ab-sence of paired images, i2i translation is tackled withone or multiple domain transformations (i.e., Cycle-GAN, StarGAN etc.). In this paper, we study a newproblem, that of image-to-image translation, under aset of continuous parameters that correspond to a modeldescribing a physical process. In particular, we proposethe SliderGAN which transforms an input face imageinto a new one according to the continuous values of astatistical blendshape model of facial motion. We showthat it is possible to edit a facial image according toexpression and speech blendshapes, using sliders thatcontrol the continuous values of the blendshape model.This provides much more ﬂexibility in various tasks, in-cluding but not limited to face editing, expression trans-fer and face neutralisation, comparing to models basedon discrete expressions or action units.

Keywords

GAN, image translation, facial expressionsynthesis, speech synthesis, blendshape models,action units, 3DMM ﬁtting, relativistic discriminator,Emotionet, 4DFAB, LRW (cid:63) [email protected] † [email protected] (cid:63), † Queens Gate, London SW7 2AZ, UK † Center for Machine Vision and Signal Analysis, Uni-versity of Oulu, Oulu, Finland

Interactive editing of the expression of a face in an im-age has countless applications including but not lim-ited to movies post-production, computational photog-raphy, face recognition (i.e. expression neutralisation)etc. In computer graphics facial motion editing is apopular ﬁeld, nevertheless mainly revolves around con-structing person-speciﬁc models having a lot of trainingsamples [27]. Recently, the advent of machine learning,and especially Deep Convolutional Neural Networks (DC-NNs) provide very exciting tools making the communityto re-think the problem. In particular, recent advancesin Generative Adversarial Networks (GANs) providevery exciting solutions for image-to-image (i2i) trans-lation.i2i translation, i.e. the problem of learning how totransform aligned image pairs, has attracted a lot ofattention during the last few years [18,35,12]. The so-called pix2pix model and alternatives demonstrated ex-cellent results in image completion etc. [18]. In order toperform i2i translation in absence of image pairs theso-called CycleGAN was proposed, which introduced acycle-consistency loss [35]. CycleGAN could perform i2itranslation between two domains only (i.e. in the pres-ence of two discrete labels). The more recent StarGAN[12] extended this idea further to accommodate multi-ple domains (i.e. multiple discrete labels).StarGAN can be used to transfer an expression to agiven facial image by providing the discrete label of thetarget expression. Hence, it has quite small capabilitiesin expression editing and arbitrary expression transfer.The past year quite some deep learning related method-ologies have been proposed for transforming facial im-ages [12,33,25]. The most closely related work to usis the recent work [25] that proposed the GANimation a r X i v : . [ c s . C V ] A ug Evangelos Ververas (cid:63) , Stefanos Zafeiriou † Input Sliding single parametersInput Sliding multiple parametersInput Speech synthesis

Fig. 1

Expressive faces generated by sliding a single or multiple blendshape parameters in the normalized range [ − , model. GANimation follows the same line of research asStarGAN to translate facial images according to the ac-tivation of certain facial Action Units (AUs) and theirintensities. Even though AU coding is a quite compre-hensive model for describing facial motion, detectingAUs is currently an open problem both in controlled,as well as in unconstrained recording conditions [7,6].In particular, in unconstrained conditions for certainAUs the detection accuracy is not high-enough yet [7,6],which aﬀects the generation accuracy of GANimation .One of the reasons of the low accuracy of automatic an-notation of AUs, is the lack of annotated data and thehigh cost of annotation which has to be performed by AUs is a system to taxonomize motion of the human facialmuscles [15]. The state-of-the-art AU detection techniques achievearound 50% F1 in EmotioNet challenge and from our exper-iments OpenFace [2] achieves lower than 20-25% The accuracy of the GANimation model is highly relatedto both the AU detection, as well as the estimation of theirintensity, since the generator is jointly trained and inﬂuencedby a network that performs detection and intensity estima-tion. highly trained experts. Finally, even though AUs 10-28model mouth and lip motion, only 10 of them can beautomatically recognized (10, 12, 14, 15, 17, 20, 23, 25,26, 28) which can only be achieved with low accuracyand thus, they cannot describe all possible lip motionpatterns produced during speech. Hence, GANimationmodel cannot be used in straightforward manner fortransferring speech.In this paper, we are motivated by the recent suc-cesses in 3D face reconstruction methodologies from in-the-wild images [26,28,29,9,8], which make use of a sta-tistical model of 3D facial motion by means of a set oflinear blenshapes, and propose a methodology for facialimage translation using GANs driven by the continuousparameters of the linear blenshapes. The linear blend-shapes can describe both the motion that is producedby expression [11] and/or motion that is produced byspeech [30]. On the contrary, neither discrete emotionsnor facial action units can be used to describe the mo-tion produced by speech or the combination of motionfrom speech and expression. We demonstrate that it ispossible to transform a facial image along the contin- liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 3 uous axis of individual expression and speech blend-shapes.Moreover, contrary to StarGAN, which uses discretelabels regarding expression, and GANimation, whichutilizes annotations with regards to action units, ourmethodology does not need any human annotations, aswe operate using pseudo-annotations provided by ﬁt-ting a 3D Morphable Model (3DMM) to images [9] (forexpression deformations) or by aligning audio signals[30] (for speech deformations). Building on the auto-matic annotation process exploited by SliderGAN, aby-product of our training process is a very robust re-gression DCNN that estimates the blendshape param-eters directly from images. This DCNN is extremelyuseful for expression and/or speech transfer as it canautomatically estimate the blendshape parameters oftarget images.i2i translation models have achieved photo-realisticresults by utilizing diﬀerent GAN optimization methodsin literature. pix2pix employed the original GAN opti-mization technique proposed in [16]. However, the lossfunction of GAN may lead to the vanishing gradientsproblem during the learning process. Hence, more eﬀec-tive GAN frameworks emerged that were employed byi2i translation methods. CycleGAN uses LSGAN, whichbuilds upon GAN adopting a least squares loss functionfor the discriminator. StarGAN and GANimation useWGAN-GP [17], which enforces gradient clipping as ameasure to regularize the discriminator. WGAN-GP,builds upon WGAN [3] which minimizes an approxi-mation of the Wasserstein distance to stabilize trainingof GANs.A recent approach of eﬃcient GAN optimizationwhich has been used to produce higher quality textures[32], is the Relativistic GAN (RGAN) [19]. RGAN wassuggested in order to train the discriminator to simul-taneously decrease the probability that real images arereal, while increasing the probability that the generatedimages are real. In our work, we incorporate RGANin the training process of SliderGAN and demonstratethat it can improve the generator which produces moredetailed results in the task of i2i translation for expres-sion and speech synthesis, when compared to trainingwith WGAN-GP. In particular, we employ the Rela-tivistic average GAN (RaGAN) which decides whetheran image is relatively more realistic than the others onaverage, rather than whether it is real or fake. Moredetails, as well as the beneﬁts from this mechanism arepresented in Section 3.1.To summurize, the proposed method includes quitea few novelties. First of all, we showcase that Slider-GAN is able to synthesize smooth deformations of ex-pression and speech in images by utilizing 3D blend- shape models of expression and speech respectively. More-over, it is the ﬁrst time to the best of our knowledgethat a direct comparison of blendshape and AU codingis presented, for the task of expression and speech syn-thesis. In addition, our approach is annotation-free butoﬀers much better accuracy that AUs-based methods.Furthermore, it is the ﬁrst time that Relativistic GANwas employed for the task of expression and speech syn-thesis. We demonstrate in our results that SliderGANtrained with the RaGAN framework (SliderGAN-RaD)beneﬁts towards producing more detailed textures, thanwhen trained with the standard WGAN-GP framework(SliderGAN-WGP). Finally, we enhance the training ofour model with synthesized data, leveraging the recon-struction capabilities of statistical shape models. D = [ d , ..., d m ] ∈ R n × m , whichincludes m diﬀerence vectors d i ∈ R n , produced bysubtracting each expressive mesh from the neutral meshof each corresponding sequence. Therefore, the sparseblendshape components C ∈ R h × where recovered bythe following minimization problem:argmin (cid:107) D − BC (cid:107) F + Ω ( C ) s.t. V ( B ) , (1)where, the constraint V can either be max ( | B k | ) =1 , ∀ k or max ( B k ) = 1 , B ≥ , ∀ k , with B k ∈ R n × denoting the k th component of the sparse weight ma-trix B = [ B , · · · , B h ]. According to [23], the selectionof the constraints mainly controls whether face defor-mations will take place towards both negative and pos-itive direction of the axes of the model’s parameters ornot, which is useful for describing shapes like musclebulges. The regularization of sparse components C wasperformed with (cid:96) /(cid:96) C and B , an iterative alternating optimizationwas employed. The exact same approach was employedby [11], in the construction of the 4DFAB blendshapemodel exploited in this work. The 5 most signiﬁcant de-formation components of the 4DFAB expression modelare depicted in Fig. 2. Evangelos Ververas (cid:63) , Stefanos Zafeiriou † mean b/s 1 b/s 2 b/s 3 b/s 4 b/s 5 − σ − σ − σ − σ − σ +3 σ +3 σ +3 σ +3 σ +3 σ Fig. 2

Visualization of the 5 most signiﬁcant componentsof the blendshape model S exp . The 3D faces of this ﬁgurehave been generated by adding the multiplied components toa mean face. Fig. 3

Examples of the 3D representation of the expressionof an image by the model S exp . The 3D faces of this ﬁgurehave been generated by 3DMM ﬁtting on the correspondingimages. shape , texture and camera models, in order to render a 2D instanceas close as possible to the input image. To extract theexpression parameters from an image we employ 3DMMﬁtting and particularly the approach proposed in [9].In our pipeline we employ the identity variation ofLSFM [10], which was learned from 10,000 face scans ofunique identity, as the shape model to be optimized. Toincorporate expression variation in the shape model, wecombine LSFM with the 4DFAB blenshape model [11],which was learned from 10,000 face scans of sponta-neous and posed expression. The complete shape modelcan then be expressed as: S ( p id , p exp ) = ¯s + U s,id p id + U s,exp p exp = ¯s + [ U s,id , U s,exp ][ p (cid:62) id , p (cid:62) exp ] (cid:62) , (2)where ¯s is the mean component of 3D shape, U s,id and U s,expr are the identity and expression subspaces ofLSFM and 4DFAB respectively, and p id and p expr arethe identity and expression parameters which are usedto determine 3D shape instances.Therefore, by ﬁtting the 3DMM of [9] in an inputimage I , we can extract identity and expression parame-ters p id and p exp that instantiate the recovered 3D face mesh S ( p id , p exp ). Based on the independent shape pa-rameters for identity and expression, we exploit param-eters p exp to compose an annotated dataset of imagesand their corresponding vector of expression parame-ters { I i , p iexp } Ki =1 , with no manual annotation cost. In this section we develop the proposed methodologyfor continuous facial expression editing based on slidingthe parameters of a 3D blendshape model.3.1 Slider-based Generative Adversarial Network forcontinuous facial expression and speech editing

Problem Deﬁnition

Let us here ﬁrst formulate theproblem under analysis and then describe our proposedapproach to address it. We deﬁne an input image I org ∈ R H × W × which depicts a human face of arbitrary ex-pression. We further assume that any facial deformationor grimace evident in image I org , can be encoded bya parameter vector p org = [ p org, , p org, , ..., p org,N ] (cid:62) ,of N continuous scalar values p org,i , normalized in therange [ − , p org consti-tutes the parameters of a linear 3D blendshape model S exp that, as in Fig. 3, instantiate the 3D representationof the facial deformation of image I org which is givenby the expression: S exp ( p org ) = ¯s + U exp p org , (3)where ¯s is a mean 3D face component and U exp theexpression eigenbasis of the 3D blendshape model.Our goal is to develop a generative model whichgiven an input image I org and a target expression pa-rameter vector p trg , will be able to generate a new ver-sion I gen of the input image with simulated expressiongiven by the 3D expression instance S exp ( p trg ). Attention-Based Generator

To address the chal-lenging problem described above, we propose to employa Generative Adversarial Network architecture in orderto train a generator network G that performs transla-tion of an input image I org , conditioned on a vectorof 3D blendshape parameters p trg ; thus, learning thegenerator mapping G ( I org | p trg ) → I gen . In addition, tobetter preserve the content and the colour of the orig-inal images we employ an attention mechanism at theoutput of the generator as in [1,25]. That is we employa generator with two parallel output layers, one produc-ing a smooth deformation mask G m ∈ R H × W and theother a deformation image G i ∈ R H × W × . The valuesof G m are restricted in the region [0 ,

1] by enforcing a liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 5 sigmoid activation. Then, G m and G i are combined withthe original image I org to produce the target expression I gen as: I gen = G m G i + (1 − G m ) I org . (4) Relativistic Discriminator

We employ a discrim-inator network D that forces the generator G to producerealistic images of the desired deformation. Diﬀerentfrom the standard discriminator in GANimation whichestimates the probability of an image being real, we em-ploy the Relativistic Discriminator [19] which estimatesthe probability of an image being relatively more realis-tic than a generated one. That is if D img = σ ( C ( I org ))is the activation of the standard discriminator, then D RaD,img = σ ( C ( I org ) − C ( I gen )) is the activation ofthe Relativistic Discriminator. Particularly, we employthe Relativistic average Discriminator (RaD) which ac-counts for all the real and generated data in a mini-batch. Then, the activation of the RaD is: D RaD,img = (cid:40) σ ( C ( I ) − E I gen [ C ( I gen )]) , if I is a real image σ ( C ( I ) − E I org [ C ( I org )]) , if I is a generated image (5)where E I org and E I gen deﬁne the average activations ofall real and generated images in a mini-batch respec-tively.We further extend D by adding a regression layerparallel to D img that estimates a parameter vector p est ,to encourage the generator to produce accurate facialexpressions, D ( I ) → D p ( I ) = p est . Finally, we aim toboost the ability of G to maintain face identity betweenthe original and the generated images by incorporatinga face recognition module F . Semi-supervised training

We train our model ina semi-supervised manner with both data with no im-age pairs of the same person under diﬀerent expressions { I iorg , p iorg , p itrg } Ki =1 and data with image pairs that weautomatically generate as described in detail in Sec-tion 4.1, { I iorg , p iorg , I itrg , p itrg } Li =1 . The modules of ourmodel, as well as the training process of SliderGAN arepresented in Fig. 4. Adversarial Loss

To improve the photorealism ofour synthesized images we utilize the Wasserstein GANadversarial objective with gradient penalty (WGAN-GP) [17]. Therefore, the selected WGAN-GP adversar-ial objective with RaD is deﬁned as: L adv = E I org [ D RaD,img ( I org )] − E I org ,p trg [ D RaD,img ( G ( I org , p trg ))] − λ gp E I gen [( (cid:107)∇ I org D img ( I gen ) (cid:107) − ] . (6) Diﬀerent from the standard discriminator, both realand generated images are included in the generatorpart of the objective of Eq. 6. This allows the gener-ator to beneﬁt by the gradients of both real and fakeimages, which as we show in experimental section leadsto generated images with sharper edges and more de-tails which also better represent the distribution of thereal data.Based on the original GAN rational [16] and theRelativistic GAN [19], our generator G and discrimina-tor D are involved in a min-max game, where G triesto maximize the objective of Eq.(6) by generating real-istic images to fool the discriminator, while D tries tominimize it by correctly classifying real images as morerealistic than fake and generated images as less realisticthan real. Expression Loss

To make G consistent in accu-rately transferring target deformations S exp ( p trg ) tothe generated images, we consider the discriminator D to have the role of an inspector. To this end, we back-propagate a mean squared loss between the estimatedvector p est of the regression layer of D and the actualvector of expression parameters of an image.We apply the expression loss both for original im-ages and generated ones. Similarly to the classiﬁcationloss of StarGAN [12], we construct separate losses forthe two cases. For real images I org we deﬁne the loss: L exp, D = 1 N (cid:107)D ( I org ) − p org ) (cid:107) , (7)between the estimated and real expression parametersof I org , while for the generated images we deﬁne theloss: L exp, G = 1 N (cid:107)D ( G ( I org , p trg )) − p trg ) (cid:107) , (8)between the estimated and target expression parame-ters of I gen = G ( I org , p trg ). Consequently, D minimizes L exp, D to accurately regress the expression parametersof real images, while G minimizes L exp, G to generateimages with accurate expression according to D . Image Reconstruction Loss

The adversarial andthe expression loss of Eq.(6) and Eq.(7), Eq.(8) respec-tively, would be enough to generate random realisticexpressive images which however, would not preservethe contents of the input image I org . To overcome thislimitation we admit a cycle consistency loss [35] for ourgenerator G : L rec = 1 W × H (cid:107) I org − I rec (cid:107) , (9)over the vectorized forms of the original image I org andthe reconstructed image I rec = G ( G ( I org , p trg ) , p org ). Evangelos Ververas (cid:63) , Stefanos Zafeiriou †  rec I rec p org + (1 − ) m  i  m I gen  m  att  att  i   img  p p org I org Relativistic Average Discriminator Training exp,  adv [ ( )]� I gen  img I gen   img  p p gen I gen  adv [ ( )]� I org  img I org  exp, Real image Fake image    gen  id p trg I org I gen Generator Training  adv  m  i + (1 − ) m  i  m I org I trg Fig. 4

Synopsis of the modules, losses and the training process of SliderGAN. A attention-based generator G is trained togenerate realistic expressive faces from continuous parameters by employing a set of adversarial, generation, reconstruction,identity and attention losses. The performance of our model is signiﬁcantly boosted by employing synthetic image pairs throughthe L gen loss. Moreover, a relativistic discriminator D is trained to classify images as relatively more real or fake, as well asto regress expression parameters of the input images in order to increase the generation quality of G . Note that we obtain image I rec by using the generatortwice, ﬁrst to generate image I gen = G ( I org , p trg ) andthen to get the reconstructed I rec = G ( I gen , p org ), con-ditioning I gen on the parameters p org of the originalimage. Image Generation Loss

To further boost our gen-erator towards accurately transferring the expressionfrom a vector of parameters to the edited image, we in-troduce image pairs of the form { I iorg , p iorg , I itrg , p itrg } Li =1 that we automatically generate from neutral images asdescribed in detail in Section 4.1. We exploit the syn-thetic pairs of images of the same individualss underdiﬀerent expression by introducing an image generationloss: L gen = 1 W × H (cid:107) I trg − I gen (cid:107) , (10) where I trg and I gen are images with either neutral orsynthetic expression of the same individual. Here, wecalculate the L I trg and the generated by G , I gen , aiming to boostour generator to accurately transfer the 3D expression S exp ( p trg ) to the edited image. Identity Loss

Image reconstruction loss of Eq.(9),aids to maintain the surroundings between the originaland generated images. However, the faces’ identity isnot always maintained by this loss, as also show by ourablation study in Section 4.7. To alleviate this issue,we introduce a face recognition loss adopted from Arc-Face [14], which models face recognition conﬁdence byan angular distance loss. Particularly, we introduce theloss: L id = 1 − cos( e gen , e org ) = 1 − (cid:107) e gen (cid:107)(cid:107) e org (cid:107) e (cid:62) gen e org , (11) liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 7 where e gen = F ( I gen ) and e org = F ( I org ) are embed-dings of I gen and I org respectively, extracted by theface recognition module F . According to ArcFace, faceveriﬁcation conﬁdence is higher as the cosine distancecos( e gen , e org ) grows. During training, G is optimizedto maintain face identity between I gen and I org whichminimizes Eq.(11). Attention Mask Loss

To encourage the generatorto produce sparse attention masks G m that focus onthe deformation regions and do not saturate to 1, weemploy a sparsity loss L att . That is we calculate andminimize the L L att = 1 W × H (cid:16) (cid:107)G m ( I org , p trg ) (cid:107) + (cid:107)G m ( I gen , p org ) (cid:107) (cid:17) , (12) Total Training Loss

We combine loss functions ofEq.(6) - Eq.(12) to form loss functions L G and L D forseparately training the generator G and the discrimina-tor D of our model. We formulate the loss functions as: L G =  L adv + λ exp L exp, G + λ rec L rec + λ id L id + λ att L att , for unpaired data { I iorg , p iorg , p itrg } Ki =1 L adv + λ exp L exp, G + λ rec L rec + λ gen L gen + λ id L id , + λ att L att , for paired data { I iorg , p iorg , I itrg , p itrg } Li =1 (13) L D = −L adv + λ exp L exp, D , (14)where λ exp , λ rec , λ gen , λ id and λ att are parameters thatregularize the importance of each term in the total lossfunction. We discuss the choice of those parameters inSection 3.2.As can be noticed in Eq.(13), we employ diﬀerentloss functions L G , depending on if the training data arethe real data with no image pairs or the synthetic datawhich include pairs. The only diﬀerence is that in thecase of paired data we use the additional supervised lossterm L gen .3.2 Implementation and training detailsHaving presented the architecture of our model, here wereport further implementation and training details. Forthe generator module G of SliderGAN, we adopted the architecture of CycleGAN [35] as it is proved to gen-erate remarkable results in image-to-iamge translationproblems, as for example in StarGAN [12]. We extendedthe generator by adding a parallel output layer to acco-modate the attention mask mechanism. Moreover, for D we adopted the architecture of PatchGAN [18] whichproduces probability distributions of the multiple im-age patches to be real or generated, D ( I ) → D img . Asdescribed in Section 3.1, we extended this discrimina-tor architecture by adding a parallel regression layer toestimate continuous expression parameters.We trained our model with images of size 128 × β = 0 . , β = 0 . λ adv = 30, λ exp = 1000, λ rec = 10, λ gen = 10, λ id = 4 and λ att = 0 .

3. Largervalues for λ id signiﬁcantly restrict G , driving it to gen-erate images very close to the original ones with nochange in expression. Also, lower values for λ att , leadto mask saturation. In this section we present a series of experiments thatwe conducted in order to evaluate the performance ofSliderGAN. First, we describe the datasets we utilizedto train and test our model (Section 4.1). Then, we testthe ability of SliderGAN to manipulate the expressionin images by adjusting a single or multiple parametersof a 3D blendshape model (Section 4.2). Moreover, wepresent our results in direct expression transfer betweenan input and a target image (Section 4.3) and in dis-crete expression synthesis (Section 4.4). We examinethe ability of SliderGAN to handle face deformationsdue to speech (Section 4.5) and test the regression ac-curacy of our model’s discriminator (Section 4.6). Weclose the experimental section of our work by presentingan ablation study on the contribution of the diﬀerentloss functions of our technique (Section 4.7).4.1 Datasets

Emotionet

For the training and validation phasesof our algorithm we utilized a subset of 250,000 im-ages of the EmotioNet database [5], which contains over

Evangelos Ververas (cid:63) , Stefanos Zafeiriou † Fig. 5

Synthetic expressive faces, generated by ﬁtting a3DMM on the original images and rendering back with a ran-domly sampled expression. The images with a red frame arethe original images.

3D Warped Images

One crucial problem of train-ing with pseudo-annotations extracted by 3DMM ﬁt-ting on images, is that the parameter values are notalways consistent as small variations in expression canbe mistakenly explained by the identity, texture or cam-era model of the 3DMM. To overcome this limitation,we augment the training dataset with expressive imagesthat we render and therefore know the exact blenshapeparameter values. In more detail, we ﬁt with the same3DMM 10,000 images of EmotioNet in order to recoverthe identity and camera models for each image. A 3Dtexture can also be sampled by projecting the recov-ered mesh on the original image. Then, we combinedthe identity meshes with randomly generated expres-sions from the 4DFAB expression model and renderedback on the original images. Rendering 20 diﬀerent ex-pressions from each image, we augmented the datasetby 200,000 accurately annotated images. Some of thegenerated images are displayed in Fig. 5

A common problem of developinggenerative models of facial expression is the diﬃculty inaccurately measuring the quality of the generated im-ages. This is mainly due to the lack of databases withimages of people of the same identity with arbitraryexpressions. To overcome this issue and quantitativelymeasure the quality of images generated by SliderGAN,as well as compare with the baseline, we created adatabase with rendered images from 3D meshes andtextures of 4DFAB. In more detail, we rendered 100 to500 images with arbitrary expression from each of the180 identities and for each of the 4 sessions of 4DFAB,thus rendering 300,000 images in total. To obtain ex-pression parameters for each rendered image, we pro-jected the blendshape model S exp on each correspond-ing 3D mesh S such that the obtained parameters are p = U (cid:62) exp ( S − ¯s ). Lip Reading Words in 3D (LRW-3D)

Lip Read-ing in the Wild (LRW) dataset [13] consists of videosof hundreds of speakers including up to 1000 utterancesof 500 diﬀerent words. LRW-3D [30] provides speechblendshapes parameters for the frames of LRW, whichwere recovered by mapping each frame of LRW thatcorrespond to one of the 500 words to instances of a3D blendhshape model of speech, by aligning the audiosegments of the LRW videos and those of a 4D speechdatabase. Moreover, to extract expression parametersfor each word segment of the videos we applied the3DMM video ﬁtting algorithm of [9], which accountsfor the temporal dependency between frames. In Sec-tion 4.5, we utilize the annotations of LRW-3D as wellas the expression parameters to perform expression andspeech transfer.4.2 3D Model-based Expression Editing

Sliding single expression parameters

In this exper-iment we demonstrate the capability of SliderGAN toedit the facial expression of images when single expres-sion parameters are slid within the normalized range[-1, 1]. In Fig. 6 we provide results for 10 levels of acti-vation of single parameters of the model (-1, -0.8, -0.6,-0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1), while the rest param-eters remain zero. As can be observed in Fig. 6, Slid-erGAN successfully learns to reproduce the behaviourof each blendshape separately, producing realistic facialexpressions while maintaining the identity of the inputimage. Also, the transition between the generated ex-pressions is smooth for successive values of the sameparameter and the intensity of the expressions depen-dent on the magnitude of the parameter value. Notethat when the zero vector is applied, SliderGAN pro- liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 9

Input B/s values / Synthesized expressions Input B/s values / Synthesized expressions-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Fig. 6

Expressive faces generated by sliding single blendshape (b/s) parameters in the range [ − , duces the neutral expression, whatever the expressionof the original image. Sliding multiple expression parameters

Themain feature of SliderGAN is its ability to edit facialexpressions in images by sliding multiple parameters ofthe model, similarly to sliding parameters in a blend-shape model to generate new expressions of a 3D facemesh. To test this characteristic of our model, we syn-thesize random expressions by conditioning the gen-erator input on parameter vectors with elements ran-domly drawn from the standard normal distribution.Note that the model was trained with expression pa-rameters normalized by the square root of the eigenval-ues e i , i = 1 , ..., N of the PCA blendshape model. Thismeans that all combinations of expression parameterswithin the range [-1, 1] correspond to feasible facial ex-pressions.As illustrated by Fig. 7, SliderGAN is able to syn-thesize face images with a great variability of expres-sions, while maintaining identity. The generated expres-sions accurately resemble the 3D meshes’ expressionswhen the same vector of parameters is used for theblendshape model. This fact makes our model ideal forfacial expression editing in images. A target expressioncan ﬁrst be chosen by utilizing the ease of perception of3D visualization of a 3D blendshape model and then, the target parameters can be employed by the generatorto edit a face image accordingly.4.3 Expression Transfer and InterpolationA by-product of SliderGAN is that the discriminator D learns to map images to expression parameters D p that represent their 3D expression through S exp ( D p ).We capitalize on this fact to perform direct expressiontransfer and interpolation between images without anyannotations about expression. Assuming a source im-age I src with expression parameters p src = D p ( I src )and a target image I trg with expression parameters p trg = D p ( I trg ), we are able to transfer expression p trg to image I src by utilising the generator of SliderGAN,such that I src → trg = G ( I src | p trg ). Note that no 3DMMﬁtting or manual annotation is required to extract theexpression parameters and transfer the expression, asthis is performed by the trained discriminator.Additionally, by interpolating the expression param-eters of the source and target images, we are able to gen-erate expressive faces that demonstrate a smooth tran-sition from expression p src to expression p trg . Interpo-lation of the expression parameters can be performedby sliding an interpolation factor a within the region[0,1] such that the requested parameters are p interp = a p src + (1 − a ) p trg . (cid:63) , Stefanos Zafeiriou † Input Random synthesized expressions

Fig. 7

Expressive faces generated by sliding multiple blendshape (b/s) parameters in the range [ − , Qualitative Evaluation

Results of performing ex-pression transfer and interpolation on images of the4DFAB rendered database and Emotionet are displayedin Fig. 8 and Fig. 9 respectively, where it can be seenthat the expressions of the generated images obviouslyreproduce the target expressions. The smooth transi-tion between expressions p src and p trg indicates thatSliderGAN successfully learns to map images to ex-pressions across the whole expression parameter space.Also, it is evident that D accurately regresses the blend-shape parameters from images I trg by observing therecovered 3D faces. The accuracy of the regressed pa-rameters is also examined in Section 4.6.To further validate the quality of our results, wetrained GANimation on the same dataset with AU an-notations extracted with OpenFace [2] as suggesed bythe authors. We performed expression transfer betweenimages and present results for SliderGAN-RaD, SliderGAN-WGP and GANimation. In Fig. 10, it is obvious thatSliderGAN-RaD beneﬁts from the Relativistic GAN train-ing and produces higher quality textures than SliderGAN-WGP, while both SliderGAN implementations bettersimulate the expressions of the target images than GAN-imation. Quantitative Evaluation

In this section we pro-vide quantitative evaluation on the performance of Slid-erGAN on arbitrary expression transfer. We employ the 4DFAB rendered images dataset which allows usto calculate the Image Euclidean Distance [31] betweenground truth rendered images of 4DFAB and imagesgenerated by SliderGAN. Image Euclidea Distance isa robust alternative metric to the standard pixel lossfor image distances, which is deﬁned between two RGBimages x and y each with M × N pipxels as:12 π MN (cid:88) i =1 MN (cid:88) j =1 exp {| P i − P j | / } ( (cid:107) x i − y i (cid:107) )( (cid:107) x j − x j (cid:107) )(15)where P i and P j are the pixel locations on the 2D imageplane and x i , y i , x j , y j the RGB values of images x and y at the vectorized locations i and j .We trained SliderGAN with the rendered imagesfrom 150 identities of 4DFAB, leaving 30 identities fortesting. To allow direct comparison between generatedand real images, we randomly created 10,000 pairs ofimages of the same session and identity (this ensuresthat the images were rendered with the same cameraconditions) from the testing set and performed expres-sion transfer within each pair. To compare our modelagainst the baseline model GANimation, we trainedand performed the same experiment using GANima-tion on the same dataset with AUs activations that weobtained with OpenFace. Also, to showcase the bene- liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 11 Input Target

Fig. 8

Expression interpolation between images of 4DFAB. First, we employ D to recover the expression parameters froman input and the target images. Then, we capitalize on these parameter vectors to animate the expression of the input imagetowards multiple targets. Input Target

Fig. 9

Expression interpolation between images of Emotionet. First, we employ D to recover the expression parameters froman input and the target images. Then, we capitalize on these parameter vectors to animate the expression of the input imagetowards multiple targets.2 Evangelos Ververas (cid:63) , Stefanos Zafeiriou † Input Target expressions / Synthesized images

Ganima-tion [25]SliderGAN-WGPSliderGAN-RaDGanima-tion [25]SliderGAN-WGPSliderGAN-RaDGanima-tion [25]SliderGAN-WGPSliderGAN-RaDGanima-tion [25]SliderGAN-WGPSliderGAN-RaD

Fig. 10

Expression transfer between images of Emotionet. First, we employ D to recover expression parameters from thetarget images. Then, we utilize these parameter vectors to transfer the target expressions to the input images. From theresults, SliderGAN-RaD produces higher quality textures than any of the other two methods (mostly evident in the mouthand eyes regions). Moreover, GANimation reproduces the target expressions with lower accuracy. (Please, zoom in the imagesto notice the diﬀerences in texture quality.)liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 13 Table 1

Image Euclidean Distance (IED), calculated be-tween ground truth images of 4DFAB and correspondinggenerated images by Ganimation [25], SliderGAN-WGP andSliderGAN-RaD. Results from SliderGAN-RaD produce thelowest IED between the three methods.Method

IED

GANimation [25] 1 . e − . − . − ﬁts of the relativistic discriminator in image quality ofthe generated images, we repeated the experiment withSliderGAN-WGP. The results are presented in Table1 where it can be seen that SliderGAN-RaD producesimages with the lowest IED.4.4 Synthesis of Discrete ExpressionsSpeciﬁc combinations of the 3D expression model pa-rameters represent the discrete expressions anger, con-tempt, fear, disgust, happiness, sadness, surprise andneutral. We employ these parameter vectors to synthe-size expressive face images of the aforementioned dis-crete expressions and test our results both qualitativelyand quantitatively. Qualitative Evaluation

To evaluate the perfor-mance of SliderGAN in this task, we visually compareour results against the results of ﬁve baseline models:DIAT [21], CycleGAN [35], IcGAN [24], StarGAN [12]and GANimation [25]. In Fig. 11 it is evident that Slid-erGAN generates results that resemble the queried ex-pressions while maintaining the original face’s identityand resolution. The results are close to those of GANi-mation, however the Relativistic GAN training of Slid-erGAN allows for slightly higher quality of images.The neutral expression can also be synthesized bySliderGAN when all the elements of the target param-eter vector are set to 0. In fact, the neutral expressionof the 3D blendshape model is also synthesized by thesame vector. Results of image neutralization on in-the-wild images of arbitrary expression are presented in Fig.12, where it can be observed that the neutral expressionis generated without signiﬁcant loss in faces’ identity.

Quantitative Evaluation

We further evaluate thequality of the generated expressions by performing ex-pression recognition with the oﬀ-the-self recognition sys-tem [22]. In more detail, we randomly selected 10,000images from the test set of Emotionet, translated themto each of the discrete expressions anger, disgust, fear,happiness, sadness, surprise, neutral and passed them to the expression recognition network. For compari-son, we repeated the same experiment with SliderGAN-WGP and GANimation using the same image set. InTable 2 we report accuracy scores for each expressionclass separately, as well as the average accuracy scorefor the three methods. The classiﬁcation results are sim-ilar for the three models, with both implementationsof SliderGAN producing slightly higher scores, whichdemotes that GANimation’s results include more failcases.4.5 Combined Expression and Speech Synthesis andTransferBlendshape coding of facial deformations allows mod-elling arbitrary deformations (e.g. deformations due toidentity, speech, non-human face morphing etc.) thatare not limited to facial expressions, unlike AUs cod-ing which is a system that taxonomizes the human fa-cial muscles [15]. Even though AUs 10-28 model mouthand lip motion, not all the details of lip motion thattakes place during speech can be captured by theseAUs. Moreover, only 10 (10, 12, 14, 15, 17, 20, 23, 25,26, 28) out of these 18 AUs can automatically be rec-ognized, which is achieved only with low accuracy. Onthe contrary, a blendshape model of the 3D motion ofthe human mouth and lips would better capture motionduring speech, while it would allow the recovery of ro-bust representations from images and videos of humanspeech.We capitalize on this fact and employ the mouthand lips blendshape model of [30] to perform speechsynthesis from a single image with SliderGAN. Partic-ularly, we employ the LRW-3D database which containsspeech blendshape parameters annotations for the 500words of LRW [13], to perform combined expression andspeech synthesis and transfer, which we evaluate bothqualitatively and quantitatively.

Qualitative Evaluation

LRW contains videos withboth expression and speech. Thus, to completely cap-ture the smooth face motion across frames we employed30 expression parameters recovered by 3DMM ﬁttingand 10 speech parameters of LRW-3D which correspondto the ten most signiﬁcant components of the 3D speechmodel. We trained SliderGAN with 180,000 frames ofLRW, without leveraging the temporal characteristicsof the database, that is we shuﬄed the frames andtrained our model with random target vectors to avoidlearning person speciﬁc deformations. Results of per-forming expression and speech synthesis from a videousing a single image are presented in Fig. 13 where thethe parameters and the input frame belong to the same (cid:63) , Stefanos Zafeiriou † InputDIAT [21]CycleGAN [35]ICGAN [24]StarGAN [12]GANima-tion [25]SliderGAN-RaD (a) (b) (c) (d) (e) (f) (g)

Fig. 11

Generation of the 7 discrete expressions a) anger, b) contempt, c) disgust, d) fear, e) happiness, f) sadness, g) surprise.By comparing SliderGAN against DIAT [21], CycleGAN [35], IcGAN [24], StarGAN [12] and GANimation [25] we observethat our model generates results of high texture quality that resemble the queried expressions. The results of the rest of themethods where taken from [25].

Table 2

Expression recognition results by applying the oﬀ-the-self expression recognition system [22] of images generated byGANimation [25], SliderGAN-WGP and SliderGAN-RaD. Accuracy scores from both SliderGAN models outperform those ofGANimation, while SliderGAN-RaD achieves thehighest accuracy in all epressions.Method

Anger Disgust Fear Happiness Sadness Surprise Neutral Average

GANimation [25] 0.552 0.446 0.517 0.658 0.632 0.622 0.631 0.579SliderGAN-WGP 0.550 0.463 0.514 0.762 0.633 0.678 0.702 0.614SliderGAN-RaD

Input Neutral Input Neutral Input Neutral

Fig. 12

Neutralization of in-the-wild images of arbitrary ex-pression. The neutralization takes place by setting all blend-shape parameter values to zero. video (ground truth frames are available) and in Fig. 14 where the parameters and the input frame belong todiﬀerent videos of LRW.For comparison we trained GANimation on the samedataset with AU activations obtained by OpenFace. Ascan be seen by Fig. 13 and Fig. 14, GANimation isnot able to accurately simulate the lip motion of thetarget video. On the contrary, SliderGAN-WGP sim-ulates mouth and lip motion well, but produces tex-tures that look less realistic. SliderGAN-RaD produceshigher quality results that look realistic in terms of ac-curate deformation and texture.

Quantitative Evaluation

To measure the perfor-mance of our model we employ Image Euclidean Dis-tance (IED) [31] to evaluate the results of expressionand speech synthesis when the input frame and targetparameters belong to the same video sequence. Due tochanges in pose in the target videos, we align all tar- liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 15

Input Target video / Synthesized expression and speech

Fig. 13

Combined expression and speech animation from a single input image. We utilize as targets the expression and speechblendshape parameters of consecutive frames of videos of LRW, to synthesize sequences of expression and speech from a singleinput image.

Table 3

Image Euclidean Distance (IED), calculated be-tween ground truth images of LRW and corresponding gen-erated images by Ganimation [25], SliderGAN-WGP andSliderGAN-RaD. Results from SliderGAN-RaD produce thelowest IED between the three methods, which indicates therobustness of blendshape coding for speech utlized by Slider-GAN. Method

IED

GANimation [25] 3 . e − . e − . − get frames with the corresponding output ones beforecalculating IED. The results are presented in Table 3,where it can be seen that SliderGAN-RaD achieves thelowest error.4.6 3D Expression ReconstructionAs also described in Section 4.3, a by-product of Slid-erGAN is the discriminator’s ability to map images toexpression parameters D p that reconstruct the 3D ex-pression as S exp ( D p ). We test the accuracy of the re-gressed parameters on images of Emotionet in two sce-narios: a) we calculate the error between parametersrecovered by 3DMM ﬁtting and those regressed by D on the same image as (Table 4 row 1) and b) we testthe consistency of our model and calculate the error be-tween some target parameters p trg and those regressedby D on a manipulated image which was translated toexpression p trg by SliderGAN-RaD (Table 4 row 2).For comparison, we repeated the same experimentwith GANimation for which we calculated the errorsin AUs activations. For both experiments we employed10000 images from our test set. The results demonstratethat the discriminator of SliderGAN-RaD extracts ex-pression parameters from images with high accuracycompared to 3DMM ﬁtting. On the contrary, GANi-mation’s discriminator is less consistent in recoveringAU annotations when compared to those of OpenFace.This, also, illustrates that the robustness of blendshapecoding of expression over AUs, makes SliderGAN moresuitable than GANimation for direct expression trans-fer.4.7 Ablation StudyIn this section we investigate the eﬀect of the diﬀerentlosses that constitute the total loss functions L G and L D of our algorithm. As discussed in Section 3.1, bothtraining in a semi-supervised manner with loss L gen and employing a face recognition loss L id between theoriginal and the generated images, contribute signiﬁ- (cid:63) , Stefanos Zafeiriou † Input Target video / Synthesized expression and speech

GroundTruthGanima-tion [25]SliderGAN-WGPSliderGAN-RaDGroundTruthGanima-tion [25]SliderGAN-WGPSliderGAN-RaD

Fig. 14

Comparison of combined expression and speech animation from a single input image between GANimation [25],SliderGAN-WGP and SliderGAN-RaD. We utilize as targets the expression and speech blendshape parameters of consecutiveframes of a video of LRW. Then we reconstruct the expression and speech from a single input frame of the same video. BothSliderGAN implementations reconstruct face motion more accurately than GANimation. Also, the texture quality of the resultsis higher in SLiderGAN-RaD than in SLiderGAN-WGP as expected. (Please, zoom in the images to notice the diﬀerences intexture quality.)

Table 4

Expression representation results on SLiderGAN-RaD (blendshape parameters coding) and Ganimation (AUsactivations coding). SliderGAN is capable to accurately androbustly recover expression representations, while GANima-tion fails to detect AUs activations.SliderGAN GANimation [25] N (cid:80) Ni =1 (cid:107) p DMM,i − p D,i (cid:107)(cid:107) p DMM,i (cid:107) N (cid:80) Ni =1 (cid:107) p trg,i − p D,i (cid:107)(cid:107) p trg,i (cid:107) cantly in the training process of the generator G . Toexplore the extend at which these losses improve or af-fect the performance of G , we consider three diﬀerentmodels trained with variations of the loss function ofSliderGAN which are: a) L G does not include L id , b) L G does not include L gen and c) L G does not includeboth L id and L id . Fig. 15 depicts results for the same subject generated by the three models as well as Slider-GAN. As it can be observed, the absence of L id aﬀectsthe quality of the generated images more, as more arti-facts are produced. However, L gen vitally supports L id in accurately simulating the target expression and pro-ducing good quality textures. When both L id and L id are omitted, both the identity preservation and the ex-pression accuracy decrease drastically. In this paper, we presented SliderGAN, a new and veryﬂexible way for manipulating the expression (i.e., ex-pression transfer etc.) in facial images driven by a set ofstatistical blendshapes. To this end, a novel generatorbased on Deep Convolutional Neural Networks (DC-NNs) is proposed, as well as a learning strategy thatmakes use of adversarial learning. A by-product of the liderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters. 17

Input

Slider-GANwithout L id without L gen without L id + L gen Fig. 15

Results from the ablation study on SliderGAN’s lossfunction components. It is evident that both losses L id and L gen have signiﬁcant impact on the training of the model,with L id being the most important for generating realisticimages. learning process is a very powerful regression networkthat maps the image into a number of blenshape pa-rameters, which can then be used for conditioning theinputs of the generator. References

1. Alami Mejjati, Y., Richardt, C., Tompkin, J.,Cosker, D., Kim, K.I.: Unsupervised attention-guided image-to-image translation pp. 3693–3703(2018). URL http://papers.nips.cc/paper/7627-unsupervised-attention-guided-image-to-image-translation.pdf

2. Amos, B., Ludwiczuk, B., Satyanarayanan, M.: Openface:A general-purpose face recognition library with mobileapplications. Tech. rep., CMU-CS-16-118, CMU Schoolof Computer Science (2016)3. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gen-erative adversarial networks. In: Proceedings of the 34thInternational Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017, pp.214–223 (2017). URL http://proceedings.mlr.press/v70/arjovsky17a.html

4. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Opti-mization with sparsity-inducing penalties. Foundationsand Trends in Machine Learning (1), 1–106 (2012).DOI 10.1561/2200000015. URL http://dx.doi.org/10.1561/2200000015

5. Benitez-Quiroz, C.F., Srinivasan, R., Martinez, A.M.:Emotionet: An accurate, real-time algorithm for the au-tomatic annotation of a million facial expressions in thewild. In: 2016 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pp. 5562–5570 (2016).DOI 10.1109/CVPR.2016.6006. Benitez-Quiroz, C.F., Wang, Y., Martinez, A.M.: Recog-nition of action units in the wild with deep nets and anew global-local loss. In: ICCV, pp. 3990–3999 (2017) 7. Benitez-Quiroz, F., Srinivasan, R., Martinez, A.M.: Dis-criminant functional learning of color features for therecognition of facial action units and their intensities.IEEE transactions on pattern analysis and machine in-telligence (2018)8. Booth, J., Antonakos, E., Ploumpis, S., Trigeorgis, G.,Panagakis, Y., Zafeiriou, S., et al.: 3d face morphablemodels in-the-wild. In: Proceedings of the IEEE Confer-ence on ComputerVision and Pattern Recognition (2017)9. Booth, J., Roussos, A., Ververas, E., Antonakos, E.,Poumpis, S., Panagakis, Y., Zafeiriou, S.P.: 3d recon-struction of” in-the-wild” faces in images and videos.IEEE Transactions on Pattern Analysis and Machine In-telligence (2018)10. Booth, J., Roussos, A., Zafeiriou, S., Ponniahy, A., Dun-away, D.: A 3d morphable model learnt from 10,000 faces.In: 2016 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 5543–5552 (2016). DOI10.1109/CVPR.2016.59811. Cheng, S., Kotsia, I., Pantic, M., Zafeiriou, S.: 4dfab:A large scale 4d database for facial expression analysisand biometric applications. In: 2018 IEEE Conference onComputer Vision and Pattern Recognition (CVPR 2018).Salt Lake City, Utah, US (2018)12. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo,J.: Stargan: Uniﬁed generative adversarial networks formulti-domain image-to-image translation. In: The IEEEConference on Computer Vision and Pattern Recognition(CVPR) (2018)13. Chung, J.S., Zisserman, A.: Lip reading in the wild. In:Asian Conference on Computer Vision (2016)14. Deng, J., Guo, J., Zafeiriou, S.: Arcface: Additive angularmargin loss for deep face recognition. arXiv:1801.07698(2018)15. Ekman, P.: Facial action coding system (facs). A humanface (2002)16. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu,B., Warde-Farley, D., Ozair, S., Courville, A., Ben-gio, Y.: Generative adversarial nets. In: Z. Ghahra-mani, M. Welling, C. Cortes, N.D. Lawrence, K.Q.Weinberger (eds.) Advances in Neural InformationProcessing Systems 27, pp. 2672–2680. Curran Asso-ciates, Inc. (2014). URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

17. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin,V., Courville, A.C.: Improved training of wasser-stein gans. In: I. Guyon, U.V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, R. Gar-nett (eds.) Advances in Neural Information Process-ing Systems 30, pp. 5767–5777. Curran Associates,Inc. (2017). URL http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf

18. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.CVPR (2017)19. Jolicoeur-Martineau, A.: The relativistic discriminator:a key element missing from standard GAN. In: Interna-tional Conference on Learning Representations (2019).URL https://openreview.net/forum?id=S1erHoR5t7

20. Kingma, D.P., Ba, J.: Adam: A method for stochasticoptimization. CoRR abs/1412.6980 (2014). URL http://arxiv.org/abs/1412.6980

21. Li, M., Zuo, W., Zhang, D.: Deep identity-aware trans-fer of facial attributes. CoRR abs/1610.05586 (2016).URL http://arxiv.org/abs/1610.05586

22. Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deeplocality-preserving learning for expression recognition in8 Evangelos Ververas (cid:63) , Stefanos Zafeiriou † the wild. In: Computer Vision and Pattern Recognition(CVPR), 2017 IEEE Conference on, pp. 2584–2593. IEEE(2017)23. Neumann, T., Varanasi, K., Wenger, S., Wacker, M.,Magnor, M., Theobalt, C.: Sparse localized deformationcomponents. ACM Transactions on Graphics (TOG) (6), 179 (2013)24. Perarnau, G., van de Weijer, J., Raducanu, B., ´Alvarez,J.M.: Invertible conditional gans for image editing. CoRR abs/1611.06355 (2016). URL http://arxiv.org/abs/1611.06355

25. Pumarola, A., Agudo, A., Martinez, A., Sanfeliu, A.,Moreno-Noguer, F.: Ganimation: Anatomically-aware fa-cial animation from a single image. In: Proceedings ofthe European Conference on Computer Vision (ECCV)(2018)26. Richardson, E., Sela, M., Or-El, R., Kimmel, R.: Learn-ing detailed face reconstruction from a single image. In:Computer Vision and Pattern Recognition (CVPR), 2017IEEE Conference on, pp. 5553–5562. IEEE (2017)27. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman,I.: Synthesizing obama: learning lip sync from audio.ACM Transactions on Graphics (TOG) (4), 95 (2017)28. Tewari, A., Zollh¨ofer, M., Garrido, P., Bernard, F., Kim,H., P´erez, P., Theobalt, C.: Self-supervised multi-levelface model learning for monocular reconstruction at over250 hz. arXiv preprint arXiv:1712.02859 (2017)29. Tran, L., Liu, X.: Nonlinear 3d face morphable model.arXiv preprint arXiv:1804.03786 (2018)30. Tzirakis, P., Papaioannou, A., Lattas, A., Tarasiou, M.,Schuller, B., Zafeiriou, S.: Synthesising 3d facial motionfrom”in-the-wild”speech (2019)31. Wang, L., Zhang, Y., Feng, J.: On the euclidean distanceof images. IEEE Trans. Pattern Anal. Mach. Intell. (8),1334–1339 (2005). DOI 10.1109/TPAMI.2005.165. URL https://doi.org/10.1109/TPAMI.2005.165

32. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao,Y., Loy, C.C.: Esrgan: Enhanced super-resolution gener-ative adversarial networks. In: The European Conferenceon Computer Vision Workshops (ECCVW) (2018)33. Wiles, O., Koepke, A.S., Zisserman, A.: X2face: A net-work for controlling face generation using images, audio,and pose codes. In: Proc. ECCV (2018)34. Wright, S.J., Nowak, R.D., Figueiredo, M.A.T.: Sparsereconstruction by separable approximation. IEEE Trans-actions on Signal Processing57