FACEGAN: Facial Attribute Controllable rEenactment GAN
FFACEGAN: Facial Attribute Controllable rEenactment GAN
Soumya TripathyTampere UniversityFinland [email protected]
Juho KannalaAalto UniversityFinland [email protected]
Esa RahtuTampere UniversityFinland [email protected]
Abstract
The face reenactment is a popular facial animationmethod where the person’s identity is taken from the sourceimage and the facial motion from the driving image. Recentworks have demonstrated high quality results by combin-ing the facial landmark based motion representations withthe generative adversarial networks. These models per-form best if the source and driving images depict the sameperson or if the facial structures are otherwise very simi-lar. However, if the identity differs, the driving facial struc-tures leak to the output distorting the reenactment result.We propose a novel Facial Attribute Controllable rEenact-ment GAN (FACEGAN), which transfers the facial motionfrom the driving face via the Action Unit (AU) represen-tation. Unlike facial landmarks, the AUs are independentof the facial structure preventing the identity leak. More-over, AUs provide a human interpretable way to control thereenactment. FACEGAN processes background and face re-gions separately for optimized output quality. The extensivequantitative and qualitative comparisons show a clear im-provement over the state-of-the-art in a single source reen-actment task. The results are best illustrated in the reen-actment video provided in the supplementary material. Thesource code will be made available upon publication of thepaper.
1. Introduction
Face-reenactment is a process of animating a source faceaccording to the motion (pose and expression) of a drivingface. In general, the process involves three major steps: 1)creating a representation of the source face identity, 2) ex-tracting and encoding the motion of the driving face, and 3)combining the identity and motion representations to pro-duce a modified source face. Each part has a significantimpact on the output quality.Number of algorithms, including traditional facemodels [3, 4], data driven Neural-Networks [5, 6, 7, 1, 8,9, 10, 11, 12], and their combinations [13] have been pre-sented for creating photorealistic face animations. In SourceDriving Contemporary Models ProposedFSGAN[1] FOM[2]
Figure 1. Examples of the face-reenactment results using the pro-posed FACEGAN and recent baseline works FSGAN [1] and First-Order-motion-Model (FOM) [2]. FACEGAN is able to preservethe source identity faithfully while imposing the driving pose andexpression. The baselines suffer particularly in preserving the fa-cial shape of the source. face-model based approach [3] , the identity and motionfeatures are encoded with model parameters. The reen-acted face is then rendered using the identity parameters ofthe source and motion parameters of the driving face. Al-though this approach results in high quality outputs, theyrequire substantial efforts in obtaining the faithful rep-resentations of the faces. Therefore, such approaches areoften limited to a few source identities.In recent years, the data driven deep neural networkshave gained popularity in generating and manipulating im-ages. In particular, the models using adversarial loss func-tions [14] have obtained highly realistic image synthesis re-sults [15, 16]. Similar models are also applied for the facereenactment as demonstrated in [5, 1, 12, 11, 6, 17, 10, 2,18]. In most of these works, the first step is to represent thesource and driving faces with deep networks. These repre-sentations can be either latent codes or human interpretable a r X i v : . [ c s . C V ] N ov eenactmentNetworkLandmarkExtraction Landmark-1Landmark-2Source Image Output Image 1Output Image 2Driving Image-1Driving Image-2 Figure 2. An illustration of the facial structure leaking problem inthe landmark based reenactment approach. The facial structure ofthe generated output clearly follows the shape of the driving faceinstead of source face as intended. representations like facial landmarks. Subsequently, the fa-cial feature representations are combined by a generatornetwork to produce the final output. Although these modelsgeneralise well to multiple identities, they have numerouschallenges discussed in the following.
Landmark based motion representation
Most recentworks [5, 1] use facial landmark points as the representationof the facial motion (pose and expression). Although land-marks can provide strong supervision of the driving motion,they also contain the overall structure of the driving face inthe form of facial contours. If the driving landmarks aredirectly used to animate the source face, the difference be-tween the facial contours may lead to dramatic distortionsin the output result. Therefore, the gap in the facial shapesusually leads to either losing the source identity or low pho-torealism. Figure 2 illustrates a practical example of suchdistortions. For this reason, many works search for drivingand source pairs with highly similar face structures. How-ever, this greatly limits the applicability of such methods.
Learning based representation
Instead of landmarks,some works [11, 5, 18] use unsupervisedly learned latentfeatures to represent the facial identity and motion. Theserepresentations are often not disentangled with respect toidentity, structure and motion. Hence, the driving mo-tion features may contain the driving identity information,which ends up to the final reenacted output. This identityleakage problem is illustrated in figures 5,1 in the case ofX2face [11] and FOM [2].
Generalization and inference time training
In someworks, the entire reenactment model is trained for a par-ticular source identity [12] or the model is fine tuned usinga few shot approach at the inference time [5]. Althoughsuch a strategy may improve the output quality, it requiresresources and time to adapt to a new identity.
Selective editing
It is often desirable to selectively editthe reenactment result, for example, add an extra smile oropen the eyes. This type of editing would be easier bydirectly manipulating the motion representation instead of searching or producing an alternative driving image. To thisend, the representation of the driving motion should be in-terpretable for the human editor. Although landmark pointscan be easily moved for editing, one needs sophisticatedtools to preserve the structure of the source face while ad-justing the landmarks. An alternative approach based onAction Units (AUs) is presented in [10], which allows edit-ing without distorting the identity. However, the model wasnot able to produce high fidelity reenactment results.In this paper, we propose a Facial Attribute Control-lable rEenactment GAN (FACEGAN), which produces highquality source reenactment from various driving pairs evenwith significant facial structure differences between them.Our model manipulates the source face-landmarks withdriving facial attributes to generate a new set of landmarksrepresenting the desired motion with the source identityand structure. This mitigates the identity leakage problempresent in many recent works. Furthermore, by representingthe motion cues using the action units, we provide a selec-tive editing interface to the reenactment process. The pro-posed method combines the benefits of the action units (sub-ject agnostic and interpretable) and facial landmarks (strongsupervision for image generation) into a single reenactmentsystem. To the best of our knowledge, this is the first workthat proposes such a combination for the face reenactment.In addition, FACEGAN decouples and handles facial regionand background in two separate branches that help in gener-ating the source background realistically in the final outputunlike other models in the literature. Finally, we provide adetailed comparison of our method against the recent state-of-the-art works in a single source image reenactment. Theproposed FACEGAN model results in superior performanceboth in quantitative and qualitative measures.
2. Related work
Face-reenactment has been actively studied over theyears, and several different approaches have been presented.The proposed methods can be roughly grouped into the fol-lowing three categories.
Morphable 3-D face models
Parametric 3D face mod-els are popular tools for face animation and several works[3, 19] have adopted them to face reenactment. These meth-ods start by fitting a 3D morphable face model to source anddriving faces. Afterward, the motion and pose parametersof the driving model are transferred to the source model,and the output face with the source identity and the drivingmotion is rendered. The motion parameters can also be con-trolled using other modalities such as audio [20]. Althoughthese methods can provide high output quality, they are noteasily scaled to a large number of identities.
Deep generative models
Variational Auto Encoder(VAE) [21] is a popular model for image generation andmanipulation. VAE encodes the input image into a latentepresentation vector that is subsequently decoded back tothe original input image. The VAE models can be utilisedfor face animation, by manipulating the latent representa-tion of the source face according to the representation of thedriving face [22]. If the latent representation is disentangledwith respect to the identity and motion, the facial move-ments can be controlled by manipulating the correspondingvector elements of the encoded source face [23]. However,the disentangled representation is very difficult to obtain,which often leads to compromises in the output quality.An alternative encoder-decoder based approach was pre-sented in [11]. In this work, an encoder-decoder model isapplied to warp the input face into an embedding image.Another encoder-decoder network converts the driving im-age into a warping field that maps the embedding image tothe output face. This approach provides better quality com-pared to the VAE based models, but the driving identity canstill easily leak to the reenacted output. This phenomenonis further emphasized if the source and driving faces havelarge structural or pose differences.Recently, another warping based reenactment approachwas proposed in [2]. The method learns a warping field us-ing a set of keypoints from the source and driving images.This approach obtains higher output quality compared to[11], but it suffers from similar identity leakage problems.To this end, the authors of [2] proposed an alternative setupwhere the method is provided by an additional driving im-age with similar pose and expression as the source. This ad-ditional driving image was utilised to obtain better approx-imation of the identity independent motion of the drivingface. Unfortunately, such a driving image would be chal-lenging to obtain.
Deep Generative Adversarial Networks
The Genera-tive Adversarial Networks (GANs) are popular models forhigh quality image generation [15, 16] and manipulation[24, 25]. In [5, 1], a progressively growing GAN archi-tecture was trained to animate a source face based on thelandmarks of the driving face. As landmarks preserve thefacial structure and, up to some extent, the identity, thesemodels suffer from quality degradation if the driving andsource faces are not similar. However, this approach is wellsuited for applications like telepresence where the drivingand source identities are the same.Landmark transfer models were proposed in [12] and[6] to remove the identity features from the driving land-marks. In [12], the landmark transformer was trained sepa-rately for each identity pair, which limited the scalability ofthe method. A principal component analysis (PCA) basedmodel was used in [6] to separate expression and shapeparameters from the landmarks. They utilised shape pa-rameters from the source and expression parameters fromthe driving face to reconstruct the transferred landmarks.Although the method generalizes better compared to [12], the proposed linear models are not sufficient to differentiatecomplex expressions from the shape information.An action units (AUs) based face representation is usedin [7] to manipulate facial expressions (not pose). Morerecently, in [10], the authors proposed a model that usedAUs for the full face reenactment (expression and pose).The AUs represent complex facial expressions by modelingthe specific muscle activities [26]. These activations are in-dependent of the facial structure, which makes the actionunits a potential representation to disentangle driving mo-tion from the identity. Unfortunately, the model presentedin [10] was not able to produce similar output quality as thefacial landmark based alternatives.In this paper, we combine the disentanglement proper-ties of the action units with strong supervision from the fa-cial landmarks to produce a high quality reenactment resultswithout significant identity leakage problems.
3. Method
Given an image, I d with a driving face F d and an im-age I s with a source face F s , the FACEGAN model aimsto transfer the motion (pose and expression) from F d to F s while maintaining the identity of F s . To do so, the modeltakes the landmarks of F s and manipulates them to includethe motion of F d . The motion information is extracted from F d in terms of action units (AUs), which correspond to var-ious face muscle activations [27] and head pose angles. Theoverall shape features of the source landmarks are not al-tered during the transformation, which helps to preserve thesource identity. The transformed landmarks are passed toreenactment and background mapper modules, which to-gether generate a face image depicting the source identityand background, but with driving facial motion.The overall architecture of FACEGAN contains threemain components, which are illustrated in figure 3. Thefirst component, called landmark-transformer, consisting ofa fully connected network L T . It takes the landmarks l s from F s and the action units AU d of F d as inputs and gen-erates the transformed landmarks L t as an output. L t haspose and expression from F d , whereas the facial structurefrom F s (as the input landmarks are from F s ).The transformed landmarks L t are converted to a grayscale image H t , as shown in Figure 3, and provided tothe next component called as face-reenactor. The face-reenactor takes I s and H t as inputs for the generator G r and produces the reenacted face I fr as an output. G r con-tains also a segmentation unit G rs , which provides a face,hair and background segmentation maps of I fr .By removing the background from I fr and the face from I s , we obtain two images called I f and I b . These imagesare provided to the third module called background mixer. I f contains the reenacted face information whereas I b con-tains the source background. Unlike other contemporary eenactmentDiscriminatorBackgroundSubtractionSource ImageDriving Image ReenactmentGeneratorBackgroundMixer DiscriminatorFinal OutputInput with Faceand hair Removed FaceReenactorBackgroundMixer
ActionUnitsPredictorSourceLandmarks LandmarkTransformerDrivingActionUnits TransformedLandmarks
LandmarkTransformer
SegmentationNetwork
Figure 3. The overall architecture of the proposed FACEGAN contains three major blocks, namely 1. Landmark transformer, 2. Facereenactor, and 3. Background Mixer. The Landmark transformer modifies the source landmarks to include the driving face’s expressionand pose, which are represented in terms of action units (AUs). The modified landmarks and the source image are provided to the Facereenactor module for producing the reenacted face. Finally, the reenacted face and the source background are combined into the output bythe Background mixer module. In the training phase, we use the source and driving images from the same face track, which allows us touse the driving image as a pixel-wise ground truth of the output. However, during the inference, the source and driving may have differentidentities. approaches, our model processes the background separatelyfrom the face area. In this way, different components willspecialize to face reenactment and background manipula-tion, leading to improved output quality. The backgroundmixer contains a generator G b producing I r , which has fa-cial identity and background from I s but facial pose andexpression from I d .The training process of our model requires image framesextracted from talking head videos. In the following, we as-sume that the facial landmarks of the source faces, and thehead poses and the action units (AUs) of the driving faces,are already extracted and stored. This can be achieved us-ing publicly available landmark [28] and Pose-AU detectors[29, 30]. For simplicity, we denote the AU and pose combi-nation as AUs throughout the text unless mentioned other-wise. In each step of the training procedure, we extract twoframes I s and I d from the same video track as described in[11, 10, 5]. In this way, the exact ground truth of our reen-acted source is available in terms of I d . The detailed trainingsteps and loss functions are discussed in the following sec-tions. The landmark transformer L T modifies the source land-marks according to the driving motion, while maintainingthe facial shape and identity of the source face. First, facial landmarks are extracted from the source image I s ∈ R × H × W and these are subsequently reshaped to l s ∈ R × . The facial motion of I d ∈ R × H × W is extractedin terms of the action units ( ∈ R × ) and the head pose an-gles ( ∈ R × ). These two parameters are concatenated into a single vector AU d ∈ R × that describes the completemotion of F d . The landmark transformer L T takes the con-catenated l s and AU d as an input and predicts the landmarkmovements δl s . Finally, the transformed source landmarksare obtained as l t = l s + δl s .The landmark transformer is trained using various lossfunctions. The most straightforward loss is calculated be-tween l t and the landmarks of I d denoted as l d . Since duringtraining, I s and I d are from the same video track, the land-marks l d form the ground truth for l t . To smooth the pre-dictions δl s a l2-weighted penalty regularization is addedalong with the reconstruction loss. The objective functioncan be written as, L lr = || l t − l d || + λ lr || δ s || . (1)Although L lr encourage the proper pose variations in l t ,they fail to capture the subtle movements due to the expres-sion variations. In order to focus on expression, anotherfully connected network L a is introduced to regress the AUparameters from the landmarks. L a is trained simultane-usly with l t using the following loss function L lau = || L a ( l t ) − AU d || + || L a ( l d ) − AU d || . (2)In order to preserve the facial shape in landmark domain,we include a connectivity loss. The loss preserves the dis-tances between the neighbouring landmark points in l s and l t . The loss function is mathematically expressed as, L lc = || D t − D d || , (3)where D t ∈ R d × and D d ∈ R d × are vectors containingthe differences between the connected landmark positionsof l t and l d . Here d is the number of connected landmarksincluded in the loss function calculation. Finally, the fullloss function for the landmark transformer is formed by alinear combination of equations (1), (2), and (3) as L l = λ l L lr + λ l L lau + λ l L lc , (4)where λ l , λ l , and λ l are weighting constants. The transformed landmarks l t are mapped to single chan-nel image H t ∈ R × H × W by placing a 2D Gaussian func-tion at each keypoint location as shown in figure 3. H t ischannel-wise concatenated with the source image I s to formthe input for the face reenactor network G r . G r producesa reenacted RGB image I fr ∈ R × H × W and a segmen-tation map S fr R × H × W with face, hair and backgroundclasses. The segmentation maps are obtained using a CNNhead added to the second last layer of the G r . For train-ing, the pixel wise reconstruction loss, VGG perceptual loss[25], and adversarial loss are applied as given in equations(5), (6), and (7). The segmentation branch is trained withstandard cross entropy loss. The full loss function for G r isprovided in Equation (8). L rr = || I (cid:48) fr − F d || (5) L rp = (cid:88) i C i H i W i || V i ( I (cid:48) fr ) − V i ( F d ) || (6) L radv = min G r max D r ,D r ,D r (cid:88) r =1 E F d [ logD r ( F d )]+ E L S (cid:104) log (1 − D r ( I (cid:48) fr )) (cid:105) (7) L r = λ r L rr + λ r L rp + λ r L radv + λ r L ce . (8) I (cid:48) fr and F d are I fr and I d with background removed.Each D r in equation (7) stands for discriminator, used formultiple resolution of images as described in [25]. L ce is astandard cross entropy loss for S fr . In order to obtain theground truth for L ce , a pretrained face-segmentation net-work G s is utilized as explained in [8]. The input to the background mixer network G b is achannel-wise concatenation of I (cid:48) fr and I b , where I b is I s with face and hair removed. G b generates an RGB image I c and a single channel mask M ∈ R HXW . Finally, thereenacted output image I r is obtained using I c , I b , and M as I r = M ∗ I c + (1 − M ) ∗ I b . (9)In this way, G b is encouraged to focus on producing thecompatible background while directly copying as much in-formation form I b as possible. G b is trained using pixelloss and adversarial loss on I r , and an additional regular-ization for M . The regularization is obtained by imposinga l weight penalty, smoothing the mixing process in equa-tion 10, and applying a total variation regularization. Thefull loss on M can be written as, L bm = λ b H,W (cid:88) x,y [( M x +1 ,y − M x,y ) + ( M x,y +1 − M x,y ) ]+ λ b || M || (10)Finally, the complete loss function for the backgroundtransformer with the regularization parameters can be writ-ten as, L b = L bm + λ b L bp + λ b L badv + λ b L br , (11)where L bp , L br are perceptual loss and reconstruction lossbetween I r and I d , respectively. L badv is the adversarialloss of the Equation (7) for the discriminator D b and G b .
4. Training Details
We train FACEGAN using a dataset created from IJB-Cvideos [31]. Most of these videos contain celebrities talk-ing in an unconstrained setup. The faces are first detectedusing a landmark detector and then tracked using a centroidtracker over the video frames. Around the centroids, a fixedimage crop is used to extract the faces in a way that the mid-dle position between the eyes will always be in a fixed po-sition. Small faces are rejected based on landmarks heightand width. In total 400k good quality face images are ob-tained. In addition, we used 400 videos from the Foren-sic++ dataset [32] to evaluate the performance of our sys-tem. Forensic++ was pre-processed in same way as IJB-C,resulting in total k face images.The networks G r and G b have U-Net like structures andare trained progressively as given in [25]. We first trainedthe networks to generate images of resolution × andthen increased to the final resolution of × . The land-mark transformer L T is a fully connected network predict-ing landmark locations in normalized coordinates. Thesean be converted to a single channel image of any resolu-tion. All networks are trained separately for a few epochsand then together in and end-to-end manner. For L T and forother generator networks, the learning rate is kept at 0.0002whereas the batch-size is 32 for the former and 1 for thelatter.
5. Experiments
We assess the proposed approach in a series of experi-ments and compare the results against the recent state-of-the-art works [11, 1, 2, 10] in quantitative and qualitativeterms. We start by evaluating our landmark transformer inisolation and then continue with the full face reenactmentcomparisons. All experiments are performed using the k face images obtained from the FaceForensic++ dataset [32]as described in Section 4. We also note that none of thetested models, including ours, are trained with FaceForen-sic++, which emphasizes the generalization properties ofthe models. In all cases, we use one source image per iden-tity and the final output resolution is × .We compare our approach with four recent works,namely X2face [11], FSGAN [1], First-Order-motion-Model(FOM) [2] and ICface [10]. These state-of-the-artmodels are representatives from a broader set of reenact-ment frameworks. For instance, FSGAN and ICface [10]utilise similar ideas of interpretable facial attributes likelandmarks and AUs in reenactment. In contrast, X2face[11] and First-order-motion-model [2] represent works thatlearn motion and identity representation in an unsupervised(or self-supervised) setups. By comparing FACEGAN withthese representatives from both categories, we obtain betterunderstanding of the drawback in the existing works and thecontributions of FACEGAN. We used the original sourcecodes for all comparison methods. We did not considerthe few-shot learning models [5] and [6] since they requireinference time training and the corresponding implementa-tions are not available. In order to evaluate the importance of the land-mark transformer ( L T ) in our model, we performed self-reenactment and cross-reenactment with and without L T .The self-reenactment refers to a case where the source anddriving images are from the same identity whereas in cross-reenactment these identities are different. In the formercase, we take the source and driving images from the sameface track, which allows us to use the driving image as apixel-wise ground truth of the reenacted output. Here weevaluate the result using a pixel-wise with Mean Square eu-clidean Loss (MSE). In the cross-reenactment case, we donot have access to the ground truth image and cannot usethe MSE loss. Instead, inspired by [6, 5, 33], we calculatethe following three measures Figure 4. Qualitative comparison of FACEGAN outputs with andwithout the landmark transformer. The facial shape and identityare clearly distorted if the landmark transformer is not applied. Cosine Similarity between IMage embeddings(CSIM): uses a pre-trained face recognition network [34]to obtain embeddings from the source and driving images.Higher score indicates that the source identity is better pre-served in the reenactment process. 2.
Pose Cosine Similar-ity between IMage (PSIM): is calculated between the headpose angles, estimated using Openface [29], of the drivingand the reenacted faces. PSIM measures the ability of themodel to retain the driving head poses in the final output. 3.
Expression difference (ED): is the Euclidean distance be-tween the action units, calculated with Openface [29], ofdriving and reenacted images. It measures the ability of themodel to retain the driving expressions in the final image.The results of the landmark transformer experiments arepresented in Table 1. In the self-reenactment case, the bestperformance is obtained without using the landmark trans-former. The result is expected, since in the self-reenactmentsetup, the driving landmarks represent the optimal output ofthe landmark transformer. Nevertheless, the MSE score isonly slightly lower if the landmark transformer is applied,which indicates that the expression and pose are faithfullytransferred from the driving face.In the cross-reenactment case, we observe a significantdifference in the CSIM score while the PSIM and ED scoresare similar. High CSIM score indicates that the sourceidentity is well preserved in the reenactment output. Thelarge difference in terms of CSIM shows that the landmarktransformer clearly improves the reenactment quality andreduces the identity leakage problem. The PSIM and EDscores measure pose and expression similarity between theoutput and the driving face. These measures do not depend riving Source FSGAN[1] FOM[2] ICFace[10] X2Face[11] FACEGAN
Figure 5. Qualitative comparison of FACEGAN and four recent baseline methods: FSGAN [1], First-Order-motion-Model(FOM) [2],ICface [10], and X2face [11]. FACEGAN is clearly able to retain the source face shape and identity better compared to the baselines.Moreover, the driving motion is faithfully reproduced.
Self-Reenactment Cross-ReenactmentModel MSE ↓ CSIM ↑ PSIM ↑ ED ↓ FACEGAN w L T FACEGAN w/o L T Table 1. Quantitative evaluation of the impact of the Landmarktransformer( L T ). High CSIM and PSIM scores indicate the abilityto preserve the source identity and driving head pose, respectively.Low ED score reflects the ability to reproduce the driving expres-sion. The MSE score measures the pixel-wise differences in theself-reenactment case. on the face identity and the driving landmarks directly pro-vide strong supervision in these terms. However, the land-mark transformer results in similar or better scores also ac-cording to these measures, which indicates that the pose andexpression are faithfully transferred despite substantial im-provement in preserving the source identity. Figure 4 showsa few examples of the cross-reenactment results for FACE-GAN with and without the landmark transformer. Theseillustrate in particular how the face structure is preservedbetter by using the landmark transformer. In this section, we assess the reenactment performance ofthe proposed model in qualitative terms and compare it withX2face [11], FSGAN [1], First-Order-motion-Model(FOM)[2] and ICface [10]. The results for versatile driving-sourcepairs are illustrated in figures 5 and 1. The source iden-tity is well preserved in the FACEGAN results, and there is considerably less visible structure leakage compared to thebaseline methods. The difference is particularly due to thelandmark transformer, which helps to preserve the shape ofthe source face.In addition, most models like X2Face [11], FSGAN [1],and ICFace [10] operate only on a tight crop around theface area. Such approach completely ignores the untrivialintegration of face, background, and other body parts suchas hair. These limitations greatly hamper the practical us-ability of the methods. In contrast, FACEGAN has a ded-icated background mixer model that generates the contextfor the reenacted face by hallucinating high quality back-ground pixels along with the ears and the upper body partsas shown in figures 4 and 6 (Note that the background iscropped out from FACEGAN results in figures 5 and 1 tofacilitate comparison). The background mixer enables thereenactment network to focus on the face and hair regionsimproving the reenactment quality of these parts. In con-clusion, our model produces sharper and better reenactmentresults from a single source and driving images of differentidentity in comparison to the state-of-the-art models.
In this section, we compare the FACEGAN results withthe baseline models in quantitative terms. We use similarself and cross-reenactment setups as in Section 5.1 and re-port the results using MSE, CSIM, PSIM and ED scoresin Table 2. In the self-reenactment case, all methods ex-ept ICFace obtain similar MSE scores. Comparison to IC-Face demonstrates the benefits of using the facial landmarkrepresentation instead of pure action units. In the cross-reenactment setup, FACEGAN obtains the highest perfor-mance in terms of the CSIM score with a clear margin.This result further illustrates the ability of FACEGAN topreserve the source identity faithfully. In terms of PSIM,the direct landmark based FSGAN model results in the bestpose retention capability. However, FACEGAN and FOMare not far behind the FSGAN. ICFace has the lowest PSIMscore, which further illustrates the difficulty of producingfaithful pose using purely action unit based supervision. Interms of ED metrics, FACEGAN obtains the best results,followed by ICFace.While the CSIM score reflects the overall ability to pre-serve the identity, it may be inadequate to reflect small scalestructural differences. The difficulty, however, would beto obtain a sufficient ground truth for measuring such de-tails. For this purpose, we propose a new measure calledlandmark similarity score (LSIM). To calculate the mea-sure, we first randomly choose two source images from asingle identity. Then we search the test database for a driv-ing image with different identity but as similar action unitsas possible to the second source image. Finally, we use themotion information (AUs, landmarks, warping parameters)from the discovered driving image to reenact the face in thefirst source image. The output is compared with the sec-ond source image that acts as a pseudo ground truth for theoutput. To accommodate for a small differences, we calcu-late LSIM as a mean square error between the correspond-ing landmark locations instead of pixel level MSE. If thereenactment process preserves the facial shape and motionsaccurately then it should result in low LSIM score. Table2 contains the obtained results. The proposed FACEGANmodel achieves clearly the highest performance among thecompared approaches.Self-Reenactment Cross-ReenactmentModel MSE ↓ CSIM ↑ PSIM ↑ ED ↓ LSIM ↓ X2Face [11] 0.018 0.564 0.659 0.025 0.040FSGAN [1] 0.016 0.631
Table 2. Quantitative comparison of our model with the state-of-the-arts works. High CSIM scores indicate better ability to pre-serve the source identity, while low LSIM signifies better land-mark shape retention ability. The PSIM and ED measure the headpose and expression reproduction, respectively. FACEGAN ob-tains clearly the best CSIM and LSIM scores, while obtaining sim-ilar PSIM and ED scores. This indicates that our method is capableof preserving the source identity while faithfully reproducing thedriving motion.
AU2 AU4 AU12 AU25 AU45 P0 P1 P2Background Manipulation
Source Manipulated
Figure 6. Demonstration of the selective editing capabilities ofFACEGAN. The first three rows images are generated by increas-ing the source AUs directly from minimum to maximum value.The last column illustrate the capability of combining the reen-acted face with a non-source background (zoom in to observe thequality better).
FACEGAN utilises human interpretable action units(AUs) to represent the pose and expression of the desiredoutput face. The AUs can be obtained from a driving face,but this is not the only option. For instance, one can take theAUs from the source face, manipulate them, and feed themas driving information to the model. In this way, one canselectively edit the source image by, for example, changingthe pose or adding an extra smile. Figure 6 contains a fewexamples of such selective editing procudure.Moreover, the background mixer can combine face andbackground from two different sources. This results in amixed reenactment result, where the face identity is fromone source and the background from another. Figure 6 il-lustrates a few examples of this kind of edits. Such featurewould be very useful for relocating the source person into adesired environment. The proposed selective editing prop-erties provide complete freedom and control to generate thedesired reenactment video.
6. Conclusion
We proposed a facial animator called FACEGAN that iscapable of performing high quality reenactment from a sin-gle source image. Unlike many previous works, our modeldoes not pose any restriction on the compatibility of thesource and driving pairs. The model combines the bestproperties of the action unit and facial landmark motion rep-resentations for reducing the identity leakage problem andto optimise the reenactment quality. Furthermore, FACE-GAN handles the face and the background separately whichimproves the output quality and gives additional control ofchoosing the desired background. We have compared ourmethod with the state-of-the-art approaches and obtainedsuperior results both quantitatively and qualitatively. eferences [1] Nirkin, Y., Keller, Y., Hassner, T.: FSGAN: Subjectagnostic face swapping and reenactment. In: Proceed-ings of the IEEE International Conference on Com-puter Vision. (2019) 7184–7193[2] Siarohin, A., Lathuili`ere, S., Tulyakov, S., Ricci, E.,Sebe, N.: First order motion model for image anima-tion. In: Conference on Neural Information Process-ing Systems (NeurIPS). (2019)[3] Thies, J., Zollhofer, M., Stamminger, M., Theobalt,C., Niessner, M.: Face2face: Real-time face captureand reenactment of rgb videos. In: Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR). (2016)[4] Alexander, O., Rogers, M., Lambeth, W., Chiang, M.,Debevec, P.: The digital emily project: Photoreal fa-cial modeling and animation. In: ACM SIGGRAPH2009 Courses. SIGGRAPH ’09, New York, NY, USA,Association for Computing Machinery (2009)[5] Zakharov, E., Shysheya, A., Burkov, E., Lempitsky,V.: Few-shot adversarial learning of realistic neuraltalking head models. In: Proceedings of the IEEEInternational Conference on Computer Vision. (2019)9459–9468[6] Ha, S., Kersner, M., Kim, B., Seo, S., Kim, D.: Mari-onette: Few-shot face reenactment preserving identityof unseen targets. In: Proceedings of the AAAI Con-ference on Artificial Intelligence. (2020)[7] Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu,A., Moreno-Noguer, F.: Ganimation: Anatomically-aware facial animation from a single image. In: Pro-ceedings of the European conference on computer vi-sion (ECCV). (2018) 818–833[8] Nirkin, Y., Masi, I., Tuan, A.T., Hassner, T., Medioni,G.: On face segmentation, face swapping, and faceperception. In: 2018 13th IEEE International Confer-ence on Automatic Face & Gesture Recognition (FG2018), IEEE (2018) 98–105[9] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo,J.: Stargan: Unified generative adversarial networksfor multi-domain image-to-image translation. In: Pro-ceedings of the IEEE conference on computer visionand pattern recognition. (2018) 8789–8797[10] Tripathy, S., Kannala, J., Rahtu, E.: Icface: Inter-pretable and controllable face reenactment using gans.In: The IEEE Winter Conference on Applications ofComputer Vision. (2020) 3385–3394 [11] Wiles, O., Sophia Koepke, A., Zisserman, A.: X2face:A network for controlling face generation using im-ages, audio, and pose codes. In: Proceedings ofthe European conference on computer vision (ECCV).(2018) 670–686[12] Wu, W., Zhang, Y., Li, C., Qian, C., Change Loy, C.:Reenactgan: Learning to reenact faces via boundarytransfer. In: Proceedings of the European conferenceon computer vision (ECCV). (2018) 603–619[13] Yao, G., Yuan, Y., Shao, T., Zhou, K.: Mesh GuidedOne-shot Face Reenactment using Graph Convolu-tional Networks. arXiv:2008.07783 [cs] (2020) Com-ment: 9 pages, 8 figures,accepted by MM2020.[14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.:Generative adversarial nets. In: Advances in neuralinformation processing systems. (2014) 2672–2680[15] Karras, T., Laine, S., Aila, T.: A style-based gener-ator architecture for generative adversarial networks.In: Proceedings of the IEEE conference on computervision and pattern recognition. (2019) 4401–4410[16] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progres-sive growing of gans for improved quality, stability,and variation. In: Proceedings of International Con-ference on Learning Representations (ICLR) 2018.(2018) International Conference on Learning Rep-resentations, ICLR ; Conference date: 30-04-2018Through 03-05-2018.[17] Zhang, J., Zeng, X., Wang, M., Pan, Y., Liu, L., Liu,Y., Ding, Y., Fan, C.: Freenet: Multi-identity facereenactment. In: CVPR. (2020) 5326–5335[18] Siarohin, A., Lathuili`ere, S., Tulyakov, S., Ricci, E.,Sebe, N.: Animating arbitrary objects via deep mo-tion transfer. In: The IEEE Conference on ComputerVision and Pattern Recognition (CVPR). (2019)[19] Bouaziz, S., Wang, Y., Pauly, M.: Online modelingfor realtime facial animation. ACM Trans. Graph. (2013)[20] Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen,J.: Audio-driven facial animation by joint end-to-endlearning of pose and emotion. ACM Transactions onGraphics (TOG) (2017) 1–12[21] Kingma, D.P., Welling, M.: Auto-Encoding Varia-tional Bayes. arXiv:1312.6114 [cs, stat] (2014)22] Hou, X., Shen, L., Sun, K., Qiu, G.: Deep feature con-sistent variational autoencoder. In: 2017 IEEE Win-ter Conference on Applications of Computer Vision(WACV), IEEE (2017) 1133–1141[23] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot,X., Botvinick, M., Mohamed, S., Lerchner, A.: ββ