One-shot Face Reenactment Using Appearance Adaptive Normalization
Guangming Yao, Yi Yuan, Tianjia Shao, Shuang Li, Shanqi Liu, Yong Liu, Mengmeng Wang, Kun Zhou
OOne-shot Face Reenactment Using Appearance Adaptive Normalization
Guangming Yao , Tianjia Shao ∗ , Yi Yuan , Shuang Li , Shanqi Liu , Yong Liu , MengmengWang , Kun Zhou NetEase Fuxi AI Lab State Key Lab of CAD&CG, Zhejiang University School of Computer Science and Technology, Beijing Institute of Technology Institute of Cyber-Systems and Control, Zhejiang [email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected], [email protected]
Abstract
The paper proposes a novel generative adversarial networkfor one-shot face reenactment, which can animate a singleface image to a different pose-and-expression (provided bya driving image) while keeping its original appearance. Thecore of our network is a novel mechanism called appearanceadaptive normalization, which can effectively integrate theappearance information from the input image into our facegenerator by modulating the feature maps of the generatorusing the learned adaptive parameters. Furthermore, we spe-cially design a local net to reenact the local facial compo-nents (i.e., eyes, nose and mouth) first, which is a much eas-ier task for the network to learn and can in turn provide ex-plicit anchors to guide our face generator to learn the globalappearance and pose-and-expression. Extensive quantitativeand qualitative experiments demonstrate the significant effi-cacy of our model compared with prior one-shot methods.
Introduction
In this paper we seek a one-shot face reenactment network,which can animate a single source image to a different pose-and-expression (provided by a driving image) while keep-ing the source appearance (i.e identity). We start with theperspective that a face image can be divided into two parts,the pose-and-expression and the appearance, which is alsoadopted by previous work (Zhang et al. 2019). In face reen-actment, the transferring of pose-and-expression is relativelyeasy because the training data can cover most possible posesand expressions. The main challenge of face reenactmentis how to preserve the appearances of different identities.This insight motivates us to design a new architecture, whichexploits a novel mechanism called the appearance adaptivenormalization, to better control the feature maps of the facegenerator for the awareness of the source appearance. Ingeneral, the appearance adaptive normalization can effec-tively integrate the specific appearance information from thesource image into the synthesized image, by modulating thefeature maps of the face generator. Especially, the appear-ance adaptive normalization learns specific adaptive param-eters (i.e., mean and variance) from the source image, which * Both authors contributed equally to this research. † are utilized to modulate feature maps in the generator. In thisway, the face generator can be better aware of the appear-ance of the source image and effectively preserve the sourceappearance.The appearance adaptive normalization is inspired by re-cent adaptive normalization methods (Huang and Belongie2017; Park et al. 2019), which perform cross-domain imagegeneration without retraining for a specific domain. This at-tribute makes adaptive normalization potentially suitable forone-shot face reenactment, in which each identity could beseen as a domain. However, there exists a key challenge toapply these adaptive normalization methods to face reenact-ment. That is, these existing adaptive normalization meth-ods are all designed to deal with the pixel-aligned image-to-image translation problems. For example, in (Park et al.2019) they propose spatially-adaptive normalization for syn-thesizing photorealistic images given an input semantic lay-out. However, in the scenario of face reenactment, the sourceand driving images are not pixel-aligned. Such pixel mis-alignment makes it difficult to optimize the adaptive nor-malization layers during training in existing methods. Con-sequently, the existing methods will yield distorted imagesafter reenactment, and we will show it in the experiments.To tackle this challenge, one key insight of our work is thatinstead of learning individual adaptive parameters for differ-ent adaptive normalization layers using independent archi-tectures, we can use a unified network to learn all the adap-tive parameters from the source image in a global way. Thebenefit of such paradigm is, by jointly learning the adaptiveparameters, the different adaptive normalization layers canbe globally modulated rather than being modulated locally.In this way, we can effectively optimize the adaptive nor-malization layers and control the feature maps of face gen-erator to keep the source appearance. Specifically, we designa simple but effective skip-connected network to predict theadaptive parameters from the source image, which can ex-plicitly promote the relations within adaptive parameters fordifferent adaptive normalization layers, and thus effectivelypropagate the appearance information throughout the net-work during reenacting.We make another key observation that, compared withreenacting the whole faces with largely varying appearancesand expressions, reenacting the local facial components (i.e., a r X i v : . [ c s . C V ] F e b ource Driving Result Source Driving Result Figure 1: Generated examples by our method. The source image provides the appearance and different driving images pro-vide different expressions and head poses. The reenacted face has the same appearance as the source and the same pose-and-expression as the driving. Both the source and driving images are unseen in the training stage.eyes, nose, and mouth) is a much easier task for the networkto learn. It is because the space of appearance and pose-and-expression is significantly reduced for these local regions.To this end, we can learn the reenactment of these localregions first, which can in turn provide explicit anchors toguide our generator to learn the global appearance and pose-and-expression. Especially, the landmarks are utilized to lo-cate the source and target positions of each face component,so the network only needs to learn the reenactment of thesecomponents locally. After local reenacting, the synthesizedface components are transformed to the target positions andscales with a similarity transformation and fed to the globalgenerator for the global face synthesis.In summary, we propose a novel framework for one-shotface reenactment, which utilizes appearance adaptive nor-malization to better preserve the appearance during reenact-ing and local facial region reenactment to guide the globalsynthesis of the final image. Our model only requires onesource image to provide the appearance and one driving im-age to provide the pose-and-expression, both of which areunseen in the training data. The experiments on a variety offace images demonstrate that our method outperforms thestate-of-the-art one-shot methods in both objective and sub-jective aspects (e.g., photo-realism and appearance preser-vation).The main contributions of our work are:1) We propose a novel method for one-shot face reenact-ment, which animates the source face to another pose-and-expression while preserving its original appearanceusing only one source image. In particular, we propose anappearance adaptive normalization mechanism to betterretain the appearance.2) We introduce the reenactment of local facial regions toguide the global synthesis of the final reenacted face.3) Extensive experiments show that our method is able tosynthesize reenacted images with both high photo-realismand appearance preservation.
Related Work
Face Reenactment
Face reenactment is a special conditional face synthesis taskthat aims to animate a source face image to a pose-and-expression of driving face. Common approaches to facereenactment could be roughly divided into two categories:many-to-one and many-to-many. Many-to-one approachesperform face reenactment for a specific person. Reenact-GAN (2018) utilizes CycleGAN (2017) to convert the facialboundary heatmaps between different persons, and hencepromote the quality of the result synthesized by an identity-specific decoder. Face2Face (2016) animates the facial ex-pression of source video by swapping the source face withthe rendered image. The method of Kim et al. (2018) cansynthesize high-resolution and realistic facial images withGAN. However, all these methods require a large number ofimages of the specific identity for training and only reenactthe specific identity. On the contrary, our method is capableof reenacting any identity given only a single image withoutthe need for retraining or fine-turning.To extend face reenactment to unseen identities, somemany-to-many methods have been proposed recently. Za-kharov et al. (2019) adopt the architecture of Big-GAN (2018) and fashional meta-learning, which is capableof synthesizing a personalized talking head with several im-ages, but it requires fine-tuning when a new person is in-troduced. Zhang et al. (2019) propose an unsupervised ap-proach to face reenactment, which does not need multipleposes for the same identity. Yet, the face parsing map, anidentity-specific feature, is utilized to guide the reenacting,which leads to distorted results when reenacting a differentidentity. Geng et al. (2018) introduce warp-guided GANs forsingle-photo facial animation. However, their method needsa photo with frontal pose and neutral expression, while oursdoes not have this limitation. (Pumarola et al. 2018) gener-ates a face guided by action units (1978), which makes itdifficult to handle pose changes. X2Face (2018) is able toanimate a face under the guidance of pose, expression, andaudio, but it can not generate face regions that do not exist in ppearance Extractor F l o w e s ti m a ti o n m od u l e s I locals S locals I locald S Local net d I ˆ d I W a r p ˆ locald I Θ a F Fusion net ˆ a F Appearance adaptive parameters sd F Figure 2: The architecture of generator of our proposed method.original images. MonkeyNet (2019a) provides a frameworkfor animating general objects. However, the unsupervisedkeypoints detection may lead to distorted results in the one-shot case. MarioNetTe (2019) proposes the landmark trans-former to preserve the source shape during reenactment, butit does not consider how to retain the source appearance.Yao et al. (2020) introduce graph covolutional network tolearn better optical flow, which helps method to yield betterresults. Different from previous many-to-many methods, ourgoal is to synthesize a high-quality face image, by learningthe appearance adaptive parameters to preserve the sourceappearance and utilizing the local component synthesis toguide the global face synthesis.
Adaptive normalization
The idea of adapting features to different distributions hasbeen successfully applied in a variety of image synthesistasks (Huang and Belongie 2017; Park et al. 2019). Theadaptive normalization normalizes the feature to zero meanand unit deviation first, and then the normalized feature isdenormalized by modulating the feature using the learnedmean and standard deviation. In conditional BN (de Vrieset al. 2017; Zhang et al. 2018), the fixed categorical imagesare synthesized using different parameters of the normaliza-tion layers for different categories. However, unlike the cat-egorical image generation with fixed categories, the num-ber of identities is unknown in the one-shot face reenact-ment. AdaIN (Huang and Belongie 2017) predicts the adap-tive parameters for style transfer, which is spatially shar-ing. However, it is insufficient in controlling the global ap-pearance, since the facial appearance is spatially varying.SPADE (Park et al. 2019) deploys a spatially varying nor- malization, which makes it suitable for spatially varying sit-uations. However, SPADE (Park et al. 2019) is designed forthe pixel-aligned image translation task which uses indepen-dent blocks to locally predict the adaptive parameters fordifferent layers. In face reenactment, the source and driv-ing images are not pixel-aligned, which makes it difficultto locally optimize the different adaptive normalization lay-ers. Hence, we propose the appearance adaptive normaliza-tion mechanism to globally predict adaptive parameters ofdifferent layers using a skip-connected network, which bet-ter promotes the relations within the adaptive parameters fordifferent layers during transferring.
Methodology
For convenience, we denote the images in the dataset as I ji j =1 ,. . . ,Mi =1 ,. . . ,N j , where j denotes the identity index and i de-notes the image index of identity j . M is the number ofidentities and N j is the number of images of identity j . S ji ∈ R × H × W denotes the corresponding heatmaps forthe 68 facial landmarks of I ji ∈ R × H × W , where H and W are the image height and width. Overview
Our method is a generative adversarial method. We adopt aself-supervised approach to train the network in an end-to-end way, where the driving image I d has the same identity as I s in the training stage (i.e., two frames from a video). Thelandmark transformer (Ha et al. 2019) is utilized to improvethe identity preservation. Fig.2 shows the architecture of theproposed generator, which takes as input the source image I s and the driving image I d . Our generator is composed of sub-nets, and all the 4 sub-nets are jointly trained in anend-to-end way. First, to preserve the source appearance, wesend I s to the appearance extractor to learn the appearanceadaptive parameters Θ as well as the encoded appearancefeature F a , as shown at the top of Fig. 2. Second, to es-timate the facial movements from the source image to thedriving pose-and expression, the flow estimation module es-timates the optical flow F sd from I s to I d , which is thenutilized to warp the encoded appearance feature, as shownin the middle of Fig. 2. Third, the local net is deployed toreenact the local facial regions, which provides essential an-chors to guide the subsequent synthesis of the whole face, asshown at the bottom of Fig. 2. Finally, the fusion net fusesthe adaptive parameters Θ , the reenacted local face regions ˆ I locald and the warped appearance feature ˆ F a , to synthesizethe reenacted face. By modulating the distribution of featuremaps in the fusion net using the appearance adaptive param-eters, we let F sd determine the pose-and-expression, and F a and Θ retain the appearance. Flow Estimation Network L a nd m a r k t r a n s f o r m e r L a nd m a r k s E s ti m a t o r s I s S d S d I locals S locald S sd F locals S locald S Figure 3: The procedure of flow estimation module.
Flow Estimation Module
The procedure of flow estima-tion module is illustrated in Fig. 3. Firstly, we estimate land-marks for I s and I d to obtain the source heatmap S s andthe driving heatmap S d respectively using OpenFace(Amos,Ludwiczuk, and Satyanarayanan 2016). We then feed S s and S d into the flow estimation net (FEN) to produce an opti-cal flow F sd ∈ R × H × W , representing the motion of pose-and-expression. F sd is then utilized to warp the appearancefeature F a . Bilinear sampling is used to sample F sd to thespatial size of F a . The warped F a is denoted as ˆ F a , whichis subsequently fed into the fusion net to synthesize the finalreenacted face. Besides, we also build the heatmaps of localregions for source and driving images based on the land-marks, denoted as S locals and S locald respectively. The archi-tecture of FEN is an hourglass net (Yang, Liu, and Zhang2017), composed of several convolutional down-samplingand up-sampling layers. Notably, large shape differences be-tween the source identity and the driving identity will lead tosevere degradation of the quality of generated images, whichis also mentioned by (Wu et al. 2018). To deal with thisissue, we additionally adopt the landmark transformer (Haet al. 2019), which edits the driving heatmap S d so that S d has a shape close to S s . For more details, please refer to (Haet al. 2019). Local Net
The local net G local is built with the U-Netstructure (Ronneberger, Fischer, and Brox 2015). We reenactthe left eye, right eye, nose and mouth with 4 independentnetworks G eyel , G eyer , G nose , and G mouth . Each of themis a U-Net with three down-convolution blocks and threeup-convolution blocks. The inputs of each local generatorare I locals , S locals and S locald , where local refers to the cor-responding parts (i.e., left eye, right eye, nose and mouth)on the image and heatmap. The reenacted face local regionsserve as anchor regions that can effectively guide the fusionnet to synthesize the whole reenacted face. Appearance Extractor
The source image I s is fed intothe appearance extractor E a ( I s ) for predicting the adaptiveparameters Θ and the appearance feature F a . Here Θ = { θ i = ( γ i , β i ) , i ∈ { , , ..., N a }} , where i is the index ofthe adaptive normalization layer and N a denotes the numberof adaptive normalization layers in the fusion net. For a fea-ture map F i ∈ R c × h × w in the fusion net, we have the cor-responding γ i , β i ∈ R c × h × w to modulate it. The encodedsource appearance feature F a is warped to ˆ F a using the opti-cal flow F sd , and Θ and ˆ F a are fed to the fusion net for facesynthesis by controlling the distributions of feature maps.We employ the U-net (2015) architecture for the appearanceextractor, because the skip-connection in appearance extrac-tor can effectively promote the relations between adaptiveparameters. C onv R e L U C onv B N A d a p ti v e no r m P i x s hu ff l e i F Θ , i i γ β A li gn m e n t i F + ˆ locald I B N R e L U Figure 4: The fusion block of the proposed method.
Fusion Net
The fusion net ˆ I d = G f ( ˆ I locald , ˆ F a , Θ) aimsto decode the reenacted local regions I locald and the warpedappearance feature ˆ F a to a reenacted face image ˆ I d underthe control of adaptive parameters Θ . G f is a fully convolu-tional network, which performs decoding and up-samplingto synthesize the reenacted face. G f consists of several fu-sion blocks to adapt the source appearance, followed by sev-eral residual-connected convolution layers to produce thefinal result. The architecture of fusion block is illustratedin Fig. 4. F i denotes the input feature map of i -th fusionblock, γ i and β i denote the i -th adaptive parameters and F B i denotes the i -th fusion block. Before fed into the fu-sion block, the reenacted local regions ˆ I locald are similarlytransformed to the target scale-and-position. In this way, thealigned face regions provide explicit anchors to the genera-tor. These aligned ˆ I locald are then resized to the same spatialize as F i using bilinear interpolation. At last, F i and ˆ I locald are concatenated along the channel axis and fed into nextblock of G f . In this way, the formulation of fusion blockcan be written as: F i +1 = F B i ([ F i , ˆ I locald ] , γ i , β i ) . (1)The core of our fusion net is the appearance adaptivenormalization mechanism. Specifically, the feature map ischannel-wisely normalized by µ ic = 1 N H i W i (cid:88) n,h,w F in,c,h,w , (2) σ ic = (cid:115) N H i W i (cid:88) n,h,w [( F in,c,h,w ) − ( µ ic ) ] , (3)where F in,c,h,w is the feature map value before normaliza-tion, and µ ic and σ ic are the mean and standard deviationof the feature map in channel c . The index of the normal-ized layer is denoted as i . Notably, the denormalization inadaptive normalization is element-wise, where the normal-ized feature map is denormalized by γ ic,h,w F in,c,h,w − µ ic σ ic + β ic,h,w . (4)Here γ ic,h,w and β ic,h,w are the scale and bias learned bythe appearance extractor from I s . Besides, instead of us-ing the transposed convolutional layer or the bilinear up-sampling layer followed by a convolutional layer to expandthe feature-map (Isola et al. 2017; Wang et al. 2018), weadopt the pixel-shuffle (Shi et al. 2016) to upscale the fea-ture map. Discriminator
There are two discriminators in our method, a discrimina-tor D L to discriminate whether the reenacted image and thedriving heatmap are matched (pose-and-expression consis-tency) and a discriminator D I to discriminate whether thesource and reenacted image share the same identity (appear-ance consistency). D L takes ˆ I d and S d as input, while D I takes ˆ I d and I s as input. ˆ I d is concatenated with S d or I s along the channel axis, before being fed into D L or D I re-spectively. To generate a sharp and realistic-looking image,the discriminators should have a large receptive field (Wanget al. 2018). In our method, instead of using a deeper net-work with larger convolutional kernels, we use a multi-scalediscriminator (Wang et al. 2018) which can improve theglobal consistency of generated images in multiple scales. Loss function
The total loss function is defined as: L total = arg min G max D L ,D I λ GAN L GAN + λ c L c + λ local L local ., (5) where L c denotes the content loss, L GAN denotes the adver-sarial loss and L local denotes local region loss. The adver-sarial loss is the GAN loss for D L and D I : L GAN = E I s , ˆ I d ,S d [log D L ( I d , S d ) + log(1 − D L ( ˆ I d , S d ))]+ E I s , ˆ I d ,I d [log D I ( I s , I d ) + log(1 − D I ( I s , ˆ I d , I d ))] . (6)The content loss is defined as: L c = L ( I d , ˆ I d ) + L per ( I d , ˆ I d ) , (7)where L ( I d , ˆ I d ) is the pixel-wise L1 loss, measuring thepixel distance between the generated image and the ground-truth image. L per ( I d , ˆ I d ) is the perceptual loss (Johnson,Alahi, and Fei-Fei 2016), which has been shown to be use-ful for the task of image generation (Ledig et al. 2017). Wemake use of the pre-trained VGG (Simonyan and Zisserman2014) to compute the perceptual loss, and L per is written as: L per ( I d , ˆ I d ) = E i ∈ X [ || Φ i ( I d ) − Φ i ( ˆ I d ) || ] , (8)where X represents the layers we use in VGG and Φ i ( x ) denotes the feature map of the i -th layer in X .The local region loss penalizes the perceptual differencesbetween the reenacted local regions and the local regions onthe ground-truth and is defined as: L local = L per ( I eyel , ˆ I eyel ) + L per ( I mouth , ˆ I mouth )+ L per ( I nose , ˆ I nose ) + L per ( I eyer , ˆ I eyer ) . (9) Experiments
Implementation
The learning rate for the generator and discriminator are setto e − and e − respectively. We use Adam (Kingma andBa 2014) as the optimizer. Spectral Normalization (Miyatoet al. 2018) is utilized for each convolution layer in the gen-erator. We set λ GAN = 10 , λ c = 5 and λ local = 5 in theloss function. The Gaussian kernel variance of heatmaps is3. Datasets and metrics
Both the FaceForensics++ (R¨ossler et al. 2019) and Celeb-DF (Li et al. 2019) datasets are used for quantitative andqualitative evaluation. The OpenFace (Amos, Ludwiczuk,and Satyanarayanan 2016) is utilized to detect the face andextract facial landmarks. Following the work of Marion-NetTe(2019), we adopt the following metrics to quantita-tively evaluate the reenacted faces of different methods.Frechet Inception Distance (FID) (Heusel et al. 2017) andstructural similarity index (SSIM) (Wang et al. 2004) areutilized to measure the photographly similarity between thereenacted images and the ground-truth images. Those twometrics are only computed in the self-reenactment scenariosince the ground-truth is inaccessible when reenacting a dif-ferent person. Then we evaluate the identity preservationby calculating the cosine similarity (CSIM) of identity vec-tors between the source image and the generated image. The ource Driving NeuralHead-FFX2Face FirstOrder MarioNetTe Ours
Figure 5: Qualitative comparison with state-of-the-art one-shot methods. Our proposed method generates more natural-lookingand sharp results compared to previous methods.identity vectors are extracted by the pre-trained state-of-the-art face recognition networks (Deng et al. 2019). To inspectthe model’s capability of properly reenacting the pose andexpression of driving image, we calculate PRMSE (Ha et al.2019) and AUCON (Ha et al. 2019) between the generatedimage and the driving image to measure the reenacted poseand expression respectively.
Quantitative and qualitative comparison
Table 1: Quantitative comparison in the self-reenactmentsetting. Up/down arrows correspond to higher/lower valuesfor better performance. Bold and underlined numbers rep-resent the best and the second-best values of each metricrespectively.
Model CSIM ↑ SSIM ↑ FID ↓ PRMSE ↓ AUCON ↑ FaceForensics++ (2019)X2face(2018) 0.689 0.719 31.098 3.26 0.813NeuralHead-FF (2019) 0.229 0.635 38.844 3.76 0.791MarioNETte (2019) 0.755
Ours
Ours
Table 1 lists the quantitative comparisons with existingone-shot reenactment methods when reenacting the sameidentity, and Table 2 reports the evaluation results whenreenacting a different identity. It is worth mentioning that themethod that, following (Ha et al. 2019), we re-implement(Zakharov et al. 2019) using only the feed-forward net-work in the one-shot setting. Differ from other competitors,FirstOrder (2019b) require two driving image to perform Table 2: Quantitative comparison of reenacting a differentidentity. Model CSIM ↑ PRMSE ↓ AUCON ↑ Faceforensics++ (2019)X2face (2018) 0.604 9.80 0.697NeuralHead-FF (2019) 0.381 6.82 0.730MarioNETte (2019) 0.620 7.68 0.710FirstOrder (2019b) 0.614
Ours
FirstOrder (2019b) 0.432 6.10 0.500Ours
Source Driving FSGAN Ours
Figure 6: Comparison of our method with FSGAN(2019),source andn driving images are cited from FSGAN(2019)
Source Driving Zhang et al.
Ours
Figure 7: Comparison of our method with Zhang etal. (2019), source andn driving images are cited fromZhang et al. (2019).We also qualitatively compare our method with re-cently proposed methods of Zhang et al. (2019) and FS-GAN(2019), demonstrated in Fig. 6 and Fig. 7. We can ob-serve blurriness and color-inconsistency in the results of FS-GAN(2019). Also the images synthesized by Zhang et al.(2019) have distorted face shapes and artifacts in bound-aries, because Zhang et al. (2019) utilize the face parsingmap, which is an identity-specific feature, to guide the reen-acting. On the contrary, with the help of appearance adaptivenormalization and local region reenacting, our method canachieve more detailed and natural-looking results.
Ablation study
To better evaluate the key components within our network,we perform the ablation study by evaluating the followingvariants of our method:• − LocalN et . The local net is excluded from the fullmodel.
Source Driving - LocalNet - AAN+ SPADE
Ours
Figure 8: Qualitative results of the ablation study. Our fullmodel leads to better results than other variants.Table 3: Quantitative ablation study for reenacting a differ-ent identity on the Faceforensics++ dataset (R¨ossler et al.2019). Model CSIM ↑ PRMSE ↓ AUCON ↑ - local net 0.615 7.293 0.698- AAN + SPADE 0.558 11.030 0.660Ours • − AAN + SP ADE . To validate the effectiveness of ap-pearance adaptive normalization, we use the spatially-adaptive normalization to replace it, and all the other com-ponents are the same as our model.The qualitative results are illustrated in Fig. 8 and quan-titative results are listed in Table 3. We can see that our fullmodel presents the most realistic and natural-looking results.The local net can help reduce the pose-and-expression er-ror, as it explicitly provides anchors for local face regionsto guide the reenacting. The appearance adaptive normaliza-tion can effectively improve image quality and reduce arti-facts by globally modulating the appearance features. Com-pared to the spatially-adaptive normalization (2019), ourappearance adaptive normalization can better preserve thesource appearance and leads to more realistic results. It val-idates our appearance adaptive normalization is more suit-able for face reenactment.
Conclusion and future work
In the paper, we propose a novel method to deal with thechallenging problem of one-shot face reenactment. Our net-work deploys a novel mechanism called appearance adap-tive normalization to effectively integrate the source ap-pearance information into our face generator, so that thereenacted face image can better preserve the same appear-ance as the source image. Besides, we design a local net toreenact the local facial components first, which can in turnguide the global synthesis of face appearance and pose-and-expression. Compared to previous methods, our network ex-hibits superior performance in different metrics. In the fu-ure, we plan to explore the temporal consistency in the net-work design to facilitate the face reenactment in videos.
Acknowledgments
We thank anonymous reviewers fortheir valuable comments. This work is supported by Na-tional Key R&D Program of China (2018YFB1004300),NSF China (No. 61772462, No. U1736217) and the 100 Tal-ents Program of Zhejiang University.
References
Amos, B.; Ludwiczuk, B.; and Satyanarayanan, M. 2016.OpenFace: A general-purpose face recognition library withmobile applications. Technical report, CMU-CS-16-118,CMU School of Computer Science.Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large ScaleGAN Training for High Fidelity Natural Image Synthesis.de Vries, H.; Strub, F.; Mary, J.; Larochelle, H.; Pietquin, O.;and Courville, A. 2017. Modulating early visual processingby language.Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface:Additive angular margin loss for deep face recognition. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 4690–4699.Friesen, E.; and Ekman, P. 1978. Facial action coding sys-tem: a technique for the measurement of facial movement.
Palo Alto
SIGGRAPH Asia 2018 Technical Papers , 231. ACM.Ha, S.; Kersner, M.; Kim, B.; Seo, S.; and Kim, D. 2019.MarioNETte: Few-shot Face Reenactment Preserving Iden-tity of Unseen Targets.Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; andHochreiter, S. 2017. Gans trained by a two time-scale updaterule converge to a local nash equilibrium. In
Advances inNeural Information Processing Systems , 6626–6637.Huang, X.; and Belongie, S. 2017. Arbitrary Style Transferin Real-Time with Adaptive Instance Normalization. doi:10.1109/iccv.2017.167. URL http://dx.doi.org/10.1109/iccv.2017.167.Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-Image Translation with Conditional Adversarial Net-works. doi:10.1109/cvpr.2017.632. URLhttp://dx.doi.org/10.1109/cvpr.2017.632.Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Percep-tual losses for real-time style transfer and super-resolution.In
European conference on computer vision , 694–711.Springer.Kim, H.; Carrido, P.; Tewari, A.; Xu, W.; Thies, J.; Niessner,M.; P´erez, P.; Richardt, C.; Zollh¨ofer, M.; and Theobalt, C.2018. Deep video portraits.
ACM Transactions on Graphics(TOG) arXiv preprint arXiv:1412.6980 .Ledig, C.; Theis, L.; Husz´ar, F.; Caballero, J.; Cunningham,A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.;et al. 2017. Photo-realistic single image super-resolution us-ing a generative adversarial network. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 4681–4690.Li, Y.; Yang, X.; Sun, P.; Qi, H.; and Lyu, S. 2019.Celeb-DF: A Large-scale Challenging Dataset for DeepFakeForensics.Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018.Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 .Nirkin, Y.; Keller, Y.; and Hassner, T. 2019. Fsgan: Subjectagnostic face swapping and reenactment. In
Proceedingsof the IEEE International Conference on Computer Vision ,7184–7193.Park, T.; Liu, M.-Y.; Wang, T.-C.; and Zhu, J.-Y. 2019. Se-mantic Image Synthesis with Spatially-Adaptive Normaliza-tion.Pumarola, A.; Agudo, A.; Martinez, A. M.; Sanfeliu, A.;and Moreno-Noguer, F. 2018. Ganimation: Anatomically-aware facial animation from a single image. In
Proceedingsof the European Conference on Computer Vision (ECCV) ,818–833.Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net:Convolutional Networks for Biomedical Image Segmenta-tion.
ArXiv abs/1505.04597.R¨ossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies,J.; and Nießner, M. 2019. FaceForensics++: Learning to De-tect Manipulated Facial Images. In
International Conferenceon Computer Vision (ICCV) .Shi, W.; Caballero, J.; Husz´ar, F.; Totz, J.; Aitken, A. P.;Bishop, R.; Rueckert, D.; and Wang, Z. 2016. Real-timesingle image and video super-resolution using an efficientsub-pixel convolutional neural network. In
Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , 1874–1883.Siarohin, A.; Lathuili`ere, S.; Tulyakov, S.; Ricci, E.; andSebe, N. 2019a. Animating Arbitrary Objects via Deep Mo-tion Transfer. In
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) .Siarohin, A.; Lathuili`ere, S.; Tulyakov, S.; Ricci, E.; andSebe, N. 2019b. First Order Motion Model for Image An-imation. In
Conference on Neural Information ProcessingSystems (NeurIPS) .Simonyan, K.; and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 .Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; andNießner, M. 2016. Face2face: Real-time face capture andreenactment of rgb videos. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , 2387–2395.ang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.;and Catanzaro, B. 2018. High-Resolution Image Synthe-sis and Semantic Manipulation with Conditional GANs. doi:10.1109/cvpr.2018.00917. URL http://dx.doi.org/10.1109/cvpr.2018.00917.Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P.;et al. 2004. Image quality assessment: from error visibility tostructural similarity.
IEEE transactions on image processing
Proceedings of the European Conferenceon Computer Vision (ECCV) , 603–619.Yang, J.; Liu, Q.; and Zhang, K. 2017. Stacked hourglassnetwork for robust facial landmark localisation. In
Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition Workshops , 79–87.Yao, G.; Yuan, Y.; Shao, T.; and Zhou, K. 2020. MeshGuided One-shot Face Reenactment Using Graph Convolu-tional Networks.
Proceedings of the 28th ACM InternationalConference on Multimedia doi:10.1145/3394171.3413865.URL http://dx.doi.org/10.1145/3394171.3413865.Zakharov, E.; Shysheya, A.; Burkov, E.; and Lempitsky, V.2019. Few-Shot Adversarial Learning of Realistic NeuralTalking Head Models.Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A.2018. Self-Attention Generative Adversarial Networks.Zhang, Y.; Zhang, S.; He, Y.; Li, C.; Loy, C. C.; and Liu, Z.2019. One-shot Face Reenactment.Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Un-paired Image-to-Image Translation using Cycle-ConsistentAdversarial Networkss. In