[PDF] Detecting Face2Face Facial Reenactment in Videos

Abstract

Visual content has become the primary source of information, as evident in the billions of images and videos, shared and uploaded on the Internet every single day. This has led to an increase in alterations in images and videos to make them more informative and eye-catching for the viewers worldwide. Some of these alterations are simple, like copy-move, and are easily detectable, while other sophisticated alterations like reenactment based DeepFakes are hard to detect. Reenactment alterations allow the source to change the target expressions and create photo-realistic images and videos. While technology can be potentially used for several applications, the malicious usage of automatic reenactment has a very large social implication. It is therefore important to develop detection techniques to distinguish real images and videos with the altered ones. This research proposes a learning-based algorithm for detecting reenactment based alterations. The proposed algorithm uses a multi-stream network that learns regional artifacts and provides a robust performance at various compression levels. We also propose a loss function for the balanced learning of the streams for the proposed network. The performance is evaluated on the publicly available FaceForensics dataset. The results show state-of-the-art classification accuracy of 99.96%, 99.10%, and 91.20% for no, easy, and hard compression factors, respectively.

Full PDF

DDetecting Face2Face Facial Reenactment in Videos

Prabhat Kumar ∗ , Mayank Vatsa ∗ , and Richa Singh ∗ Indian Institute of Science Bangalore Indian Institute of Technology Jodhpur [email protected] { mvatsa,richa } @iitj.ac.in Abstract

Visual content has become the primary source of infor-mation, as evident in the billions of images and videos,shared and uploaded on the Internet every single day. Thishas led to an increase in alterations in images and videos tomake them more informative and eye-catching for the view-ers worldwide. Some of these alterations are simple, likecopy-move, and are easily detectable, while other sophis-ticated alterations like reenactment based DeepFakes arehard to detect. Reenactment alterations allow the sourceto change the target expressions and create photo-realisticimages and videos. While the technology can be potentiallyused for several applications, the malicious usage of auto-matic reenactment has a very large social implication. It istherefore important to develop detection techniques to dis-tinguish real images and videos with the altered ones. Thisresearch proposes a learning-based algorithm for detect-ing reenactment based alterations. The proposed algorithmuses a multi-stream network that learns regional artifactsand provides a robust performance at various compressionlevels. We also propose a loss function for the balancedlearning of the streams for the proposed network. The per-formance is evaluated on the publicly available FaceForen-sics dataset. The results show state-of-the-art classiﬁcationaccuracy of 99.96%, 99.10%, and 91.20% for no, easy, andhard compression factors, respectively.

1. Introduction

Approximately 95 million photos and videos are up-loaded daily on Instagram [1]. YouTube receives 300 hoursof video uploads every minute, with about 5 billion viewsevery single day [2]. These visual contents, on one hand,act as a medium to interact with individuals, share opin- ∗ This study has been performed when the authors were at IIIT-Delhi.

Figure 1. Effect of Reenactment by Face2Face [24], the sourceactor (left), the target actor(right), reenactment of the target actorbased upon source actor (bottom). ions and thoughts, and reach out to the public. On the otherhand, it also serves as a source of information and entertain-ment. This two-way exchange makes videos and images aneffective form of communication between the creators andthe viewers. These images and videos are not always postedin the original form but, more often than not, are altered tomake them more eye-pleasing for the viewer [7]. It is pri-marily done by the use of ﬁlters available to the creator orby editing software such as Photoshop. These include al-terations such as splicing and copy-move. However, somerecent opes are more advanced and sophisticated, and lieunder the category of “DeepFakes”. Deepfakes, as the namesuggests, are often the result of video synthesis commonlydone by the use of deep learning networks. Deepfakes in-clude alterations of two kinds - identity swap and reenact-ment.Reenactment is deﬁned as the acting out of a past event ;in other words, performing a past event or, with modiﬁca- a r X i v : . [ c s . C V ] J a n ions as required. Facial reenactment refers to the modiﬁca-tions brought to the target actions in the form of change ofmovement of the head, lips, and facial expression. The tech-niques allowing for reenactment have been devised with theintent of improving the experience, speciﬁcally in the caseof movies with the dubbing of target actors [10, 22] andteleconferencing [24, 25]. However, the malicious use ofsuch techniques cannot be ruled out. Speciﬁcally, reenact-ment techniques are capable of synthesizing photo-realisticvideos and images that are hard to detect with the humaneye or even with existing forgery detection techniques. Datacompression also adds to the challenge of the detection taskas often, the media in circulation are highly compressed andoffer little knowledge of being altered.Despite the increased awareness about fake news,videos, and images still remain one of the most trustablesources of information. Reenacted video, as seen in Figure1, can be used to portray an individual saying things thathe/she has not said in the real life. Such videos circulatedto a billion uninformed audience via the Internet can leadto chaos and confusion at a large scale. With very limitedprior work done in detection, there is an urgent need fordeveloping techniques that can be used for the detection ofsuch alterations.This paper addresses the problem of detecting reenact-ment in videos. Our contributions are two-fold; (i) wepropose a multistream deep learning network based on theextraction of localized features for detection of reenactedframes by Face2Face reenactment approach [24] in videos,and (ii) we propose a loss function for balanced training ofstreams in the proposed network. The paper has been or-ganized as follows: Section 2 expands upon the generationand detection of reenactment video through subsections 2.1and 2.2, respectively. In section 3 we explain the pipeline,including the deep learning architecture for successful de-tection of Face2Face reenactment [24]. In section 4, weprovide the description of the dataset used for experiments,and the results of the experiments are discussed in section5.

2. Related Work

Attacking visual content using a deep learning approachand their defense is an important area of research [4, 11, 12,13, 17]. The face reenactment literature can be categorizedinto two broad categories: (i) the generation techniques im-plying the methods that pave the way for reenactment ma-nipulation on videos or in some cases images and (ii) detec-tion techniques aimed at detecting such forgeries in videosand images.

For the past decade, there has been signiﬁcant work ontransforming target video either from the input audio or

Figure 2. Effect of reenactment on the target sequence by (a) Kim et al . [15], (b) Wu et al . [26], (c) Thies et al . [25], and (d) Suwa-janakorn et al . [22]. video). These have been aimed at different applicationsranging from expression transfer from one video footageto another [23, 24], lip-syncing of the target from input au-dio [10, 22], and mimicking the movement of the sourceto target [25]. The effect of these works can be consideredas reenactment manipulations, as the resultant movement ofthe target is modiﬁed in the process or has been reenacted upon with. Figure 2 depicts the effect of generation tech-niques upon the target actors.Suwajanakorn et al . [22] proposed an approach towardsthe generation of a photo-realistic video from a target videoof President Obama and lip-syncing to the input audio.The authors suggest a simplistic Recurrent Neural Network-based approach to synthesize the mouth shape of the target-ing the input audio. Synthesis is primarily performed onlower face regions including mouth, cheek, chin, and nose.Garrido et al . [10] have presented a system based upon thecapture of the 3D face model of both dubbing and targetactors and then using audio analysis on the dubbing actorsfor creating a photo-realistic 3D mouth model to be appliedupon the target actors.Thies et al . [23] have presented a method for real-timetransfer of expression from one actor to another in a tar-get sequence. Using RGB-D data as input, the proposedmethod keeps the non-face region unchanged while trans-ferring the expressions. The authors presented a novel ap-proach to represent facial identity and expression in a lin-ear parametric model. The expressions are synthesized bychanging the blend shape parameters of the target framesby the source. Thies et al . [24], eliminated the need fordepth videos in [23], thus allowing the transfer of expres-sion to generic RGB videos (e.g. YouTube videos). Kim et al . [15] presented a novel method of allowing full reani-mation of portrait videos by the actor, including head pose, igure 3. Proposed pipeline, RGB frames are sampled from RGB videos. ROI extraction is done on frames by face detection followed bylocal region extraction which acts as an input to the proposed classiﬁcation network. facial expression, eye motion, and in some cases, even theidentity. The method employs the use of a face reconstruc-tion approach to get a parametric representation of the faceand illumination of each video frame. This representationis fed into a Render-to-video network based on the Condi-tional Generative Adversarial Network to generate the out-put frames. Wu et al . [26] proposed reenactment throughthe transfer of facial features to a boundary latent space andthen adapting the target boundary according to the sourcewith the use of a transformer. Thies et al . [25] extended theconcept reenactment to transfer of movement of the torsoand head to the target video with the use of parametric mod-els of the head, eye, and torso. These are later used toproject the captured motion from the source to target in aphoto-realistic fashion.

Work towards DeepFake detection has been sparse, dueto the relatively new nature of the manipulation. However,the sheer degree of realism in the videos created by reen-actment should have attracted more detection work in theﬁeld.Afchar et al . [3] proposed two shallow architectures in anattempt to capture the mesoscopic properties of images orframes. The ﬁrst architecture Meso-4 comprises of four lay-ers of convolution and pooling followed by a single-layereddense network. The other architecture MesoInception-4performed modiﬁcation on Meso-4 by replacing the ﬁrsttwo convolution layers by a modiﬁed inception module. The authors also explored image aggregation on the pro-posed network in an attempt to better classify videos. Agar-wal et al . [5] suggested learning the head and facial move-ment of speciﬁc people of interest and then differentiatingthe movement in the DeepFake video of the same individ-ual.Face tampering detection techniques have been observedto be useful in the detection of manipulations by Face2Face.Zhou et al . [28] introduced a two-stream network, with onestream based on patch triplet stream with 5514D steganal-ysis features and other upon GoogleNet followed by scorefusion of the two streams. Raghavendra et al . [18] haveused feature level fusion by extracting features from ﬁne-tuned VGG19 and AlexNet and are concatenated as input toProbabilistic Collaborative Representation Classiﬁer. Bayer et al . [6] have proposed a generic tampering detection algo-rithm, which is a shallow network of eight layers with aconstrained CNN to suppress image content and adaptivelearning of manipulation features. XceptionNet [8], whichis based upon depth-wise separable convolution layers hasalso been shown to perform well for the detection task [19].

3. Proposed Detection Algorithm

In this research, we have proposed a deep learning-basedarchitecture for detecting reenacted frames generated us-ing the Face2Face reenactment technique [24]. The pro-posed method uses RGB frames in conjunction with a multi-stream network for improved extraction of localized facialrtifacts and noise patterns introduced by the reenactmentprocedure. We also propose a loss function to facilitate bal-anced training of the proposed multi-stream network. Thenetwork captures local facial artifacts by the use of ded-icated streams that learn their respective regional artifacts.A full-face stream then determines the dependency betweenthe regions. By combined learning of regional and full-faceartifacts, the proposed network can classify highly com-pressed frames with a relatively small drop in the perfor-mance as compared to the existing methods.

Figure 3 shows the schematic representation of the pro-posed pipeline. The frames are extracted from the RGBvideo as per the experimental protocol deﬁned in Section4; this is followed by face detection by the S3FD approach[27]. In the case where multiple faces are detected in a sin-gle frame, face mask annotations provided by the dataset areused to identify the target face. Mask annotations are usedin case S3FD fails to detect the faces in the frame. Thiscan be seen as a region of interest extraction step, which isalso streamlined by strict square cropping centered aroundthe face to suppress the background information as muchas possible. For each face, the local region is extracted bydividing the frame into a × grid. This segregates thefundamental facial features into four regions, which is thenfollowed by re-sizing each of the four local images and thefull-face to × . As shown in Figure 3, the proposed multi-stream net-work consists of ﬁve parallel ResNet-18 models [14] - fourare dedicated to learning the local, regional artifacts and onefor the overall effect of the reenactment upon the face. Foreach of the ResNet-18 models, the classiﬁcation layer hasbeen mapped to two outputs by a fully connected layer. Theoutputs from these ﬁve parallel ResNet-18 are concatenatedto form a 10-dimensional vector which is passed upon tolearn the weighted fusion of the scores for the binary clas-siﬁcation task.The fundamental intuition is to make the network learnthose features or artifacts that get suppressed when learningthe model with only the full-face image. Training a modelexplicitly on a speciﬁc region of the image forces the net-work to learn those low-level spatial features that are notlearned by the initial model, trained upon the full-face asshown in Figure 6, and can be used to improve the perfor-mance for the classiﬁcation task speciﬁcally for highly com-pressed frames. It has been done keeping in mind the prac-ticality of the problem. Since most of the time, manipulatedvideos or images that are circulated are in a highly com-pressed format, the drop in performance due to compressionshould ideally be as low as possible. Also, the prior knowl- edge that Face2Face manipulations affect the whole facialregion adds to the improvement of the performance of theproposed network, as discussed in Section 5. A combina-tion of such four regional models with the model trained onfull image paves the way for a setup similar to Spatial Pyra-mid [16] structure. The two-level spatial pyramid has beentaken, keeping in mind the need to maintain the balance be-tween the model complexity and the information gain by thespatial features extracted by the model. The ﬁnal fully con-nected layer learns the weighted mapping of scores of thefour models trained upon the local regions and one modeltrained upon the full image.

Let the input image be represented as X , and the cor-responding output be Y for the binary classiﬁcation taski.e., classifying if the input X has been manipulated or not.Each input X = { X , X , X , X , X } is a set of ﬁve im-ages of size × where X represents the cropped fullfacial image and X , X , X , X represent the four locallyextracted images for each frame and Y a binary value with denoting original and denoting altered frames. The fol-lowing loss function is minimized during the training pro-cess. L total = L R (cid:124)(cid:123)(cid:122)(cid:125) Full Face Loss + Local Regional Loss (cid:122) (cid:125)(cid:124) (cid:123) (cid:88) n =2 L R i + λ × L fusion (cid:124) (cid:123)(cid:122) (cid:125) Fusion Loss (1)where L total is the effective loss and L R i represents thecross-entropy loss as per Equation 2. L R i = − (cid:88) c =0 Y c log f c ( X i ) (2)between the scores f ( X i ) of ResN et i model and true out-put Y c . L fusion denotes the cross-entropy loss between theoutput of the ﬁnal linear layer of the proposed model andthe true output Y . The weight of L fusion is parameterizedby λ . It is to be noted that during the calculation of vari-ous cross-entropy losses, the output of each model is ﬁrstnormalized using softmax in the range [ , ].The loss function has been designed to avoid the modelfrom getting biased towards a particular ResNet model.Back-propagation of just L fusion as L total was found tobe making the network biased towards ResN et model,thereby reducing the performance of the network. By incor-porating the loss of each of the parallel ResNet into the lossfunction, we prevent the network from being biased towards ResN et . Consequently, it improves the performance ofthe overall multi-stream network. igure 4. Illustrating the effect of compression on video frames. The proposed network has been implemented withPython3.5 Pytorch deep learning framework. Optimizationis performed using ADAM optimizer with default parame-ters ( β = 0 . & β = 0 . ) with a batch size of 32. Theinitial learning rate is kept at − and is divided by 10 af-ter every 10 epochs. The ResNet-18 models are pre-trainedon the ImageNet dataset [9] and then retrained on the facereenactment dataset. The value of the loss parameter λ as 1yields the optimal results.

4. Dataset

In this research, we have proposed a novel algorithm fordetecting alterations that occur due to reenactment in RGBframes. For testing the performance of the proposed al-gorithm, we have used the FaceForensics Source-to-Targetreenactment dataset [19]. The dataset is the only pub-licly available reenactment dataset for this task (FaceForen-sics++ [20] also contains the same videos as in FaceForen-sics for reenactment detection task). The dataset consists of1004 unique videos from YouTube. Each video sequenceis at least 300 frames long at 30 fps. The videos havebeen modiﬁed using the Face2Face approach [24] to pro-duce reenactment manipulations. Therefore, for each video,the dataset contains the original video, reenacted video, andface mask against which the reenactment has been done.The dataset has been divided into train, test, and valida-tion split as per Table 1. For training and testing, we havefollowed the protocols mentioned in [19], where 10 frameshave been randomly sampled from each video, i.e., from1004 original and altered videos. Thus, for each unique

Table 1. FaceForensics Dataset Composition

Set Number of VideosTrain

Validation

Test

5. Results and Observations

The proposed algorithm has been compared and con-trasted with the respective state-of-the-art counterparts forthe given dataset across various compression schemes. Wehave also analyzed multiple components of the proposed al-gorithm and its effect on the detection performance. Theresults are compared against the shallow network architec-ture such as MesoNet [3] and Bayer et al . [6], state-of-the-art transfer learning architecture XceptionNet [8] and facetampering detection algorithms like Zhou et al . [28] andRaghvendra et al . [18]. Baseline performance reported in[19] has been directly inferred for comparison.Table 2 summarizes the accuracy of various reenactment able 2. Accuracy (%) of different algorithms on the FaceForen-sics dataset with different compression factors.

Methods no-c easy-c hard-c

MesoNet, Afchar et al . [3] 96.80 93.40 83.20Bayer et al . [6] 99.53 86.10 73.63Zhou et al . [28] 99.93 96.00 86.83Raghvendra et al . [18] 97.70 93.50 82.13XceptionNet [8] 99.93 98.13 87.81Proposed Approach

Table 3. Classiﬁcation performance (%) of ResNet and VGG mod-els on the FaceForensics dataset.

Network no-c easy-c hard-c

VGG16 [21] 99.50 96.90 85.20ResNet-18 [14] 99.93 97.70 88.20ResNet-50 [14] 99.93 97.40 86.40ResNet-152 [14] 99.89 97.60 85.70Proposed Approach detection algorithms on the FaceForensics dataset across thethree compression modes. The classiﬁcation accuracy hasbeen calculated as the average of class-wise classiﬁcationaccuracies. The proposed model yields the best classiﬁca-tion performance on the test set, it has a mean classiﬁca-tion accuracy of 90.40% with a standard deviation of 0.30%.The results and analysis are discussed below. • With an increase in the degree of compression, thereis a signiﬁcant drop in the performance of all of themethods, as shown in Table 2. However, the declinein performance is high for shallow networks, such asMesoNet [3] with four convolutional layers and a clas-siﬁcation layer, and universal manipulation algorithmlike [6] with eight convolutional layers. In contrast, thedrop in performance is comparatively lower for deepnetworks such as XceptionNet [8] and two-stream net-work [28] with GoogleNet classiﬁcation stream andsteganalysis features as the second stream. • Most of the detection methods give high performanceon images with no compression or easy compression.However, the performance signiﬁcantly reduces in thecase of images with hard compression. This may havebeen caused because no-compression Face2Face ma-nipulation tends to show edges around the corners ofthe chin and near the nostrils. Thus, allowing networksto quickly learn the difference between the original andthe altered images. However, with compression, thesedetails tend to vanish, and it becomes more and morechallenging to learn the difference between the twoclasses. Figure 5 also showcases the effect of com-pression upon the proposed network.

Figure 5. ROC Curves of the proposed network for different com-pression modes.Table 4. Score fusion results (%) for hard-c compression.

Classiﬁers Fusion Accuracy

Regional Classiﬁers SVM 85.80Regional Classiﬁers Proposed

All Classiﬁers SVM 89.13All Classiﬁers Neural Net 89.80All Classiﬁers Proposed • As can be seen from Table 3, increasing the layersdo not speciﬁcally improve the classiﬁcation perfor-mance. The models give consistent, comparable per-formance in case of no or easy compression, but a sig-niﬁcant drop in performance is observed in the case ofa network with a high number of layers with framescompressed with high quantization factors. This maybe due to the inability of ResNet-50 and ResNet-152 tolearn its large number of model parameters optimallyas compared to ResNet-18 when there is a signiﬁcantloss of information in the input, which is the resultanteffect of severe compression. • Table 4 showcases the performance of streams undervarious fusion techniques and also the effectiveness ofthe proposed loss function. It is observed that the fu-sion of output scores of independently trained region-based ResNet models by Linear-SVM, performs com-parably to ResNet-50 and ResNet-152. Thus, present-ing quantitative evidence of discriminative featuresavailable in these regions. The fusion of scores of allResNet models gives a signiﬁcant boost to the classi-ﬁcation performance. It is also to be noted that a lin-ear combination of scores emulated by a single-layeredneural network slightly outperforms the support vectormachine based score fusion. End-to-end training of theproposed architecture with binary cross-entropy func-tion gives a classiﬁcation score similar to the score fu- igure 6. Class activation maps for local and full face ResNet for the Proposed Network.Table 5. Classiﬁcation performance (%) of the proposed networkon cross-compression. The rows and columns depict the compres-sion mode of train and test set respectively.

Network Trained On Network Tested Onno-c easy-c hard-c

No-Compression 99.96 58.26 52.66Easy-Compression 99.56 99.10 55.43Hard-Compression 96.73 95.76 91.20sion by a neural network. However, end-to-end train-ing by the proposed loss function further improves theperformance of both networks, i.e., with only regionalclassiﬁers and the proposed architecture. • Table 5 summarizes the performance of the proposedarchitecture on cross-compression dataset. The net-work is thereby trained upon frames of one compres-sion mode and then tested against frames compressedby different compression modes. It is observed that theperformance of models trained upon low compressiondrops signiﬁcantly for the input of higher compression.However, the models trained upon highly compressedframes generalize better across the low compressed in-puts. • We analyze the class activation maps, as shown in Fig-ure 6 for the regional and full-face classiﬁers across thecompression schemes. – Class activation maps corresponding to full-facetrained ResNet, i.e.,

ResN et indicate that thenose and mouth regions provide the fundamentaldifferentiation between the original and alteredframes. This is because during the process of reenactment, the realistic portrait of movementof mouth and nearby regions are hardest to cre-ate as the transfer of static features are easy toperform as compared to dynamic features. Also,the lower facial regions are more prone to move-ment as compared to any other facial regions in avideo sequence. – The drop in performance of the proposed multi-stream network with respect to compression fac-tors can easily be inferred from the activationmaps. The higher the compression factor, thesmaller is the activated region of the networktrained for the classiﬁcation task. – Unlike the forehead regions, which are not proneto high movement, the full-face ResNet fails todetect the artifacts generated due to the move-ment in the eye region. This shows the need oflocal classiﬁers dedicated for alteration detectionnear the eye region by

ResN et and ResN et . – Face2Face approach incorporates blendshape de-tection followed by parametric transfer of facialexpression, thus leading to the creation of edgeartifacts near the face boundary due to the er-ror in face tracking and effective transfer of ex-pression. These errors are again neglected by thefull-face ResNet speciﬁcally in case if compres-sion is applied upon the input video.

ResN et and ResN et , thus help in providing another in-dicative measure of falsiﬁcation by exploiting theface tracking limitations of the Face2Face ap-proach. • Table 6 summarizes the classiﬁcation accuracy of theclassiﬁers for the individual as well as the combination able 6. Classiﬁcation accuracy (%) of regional classiﬁers forframes with hard-compression.

Stream Accuracy

Individual Face (X1) 88.20Left Eye (X2) 78.95Left Cheek (X4) 79.15Right Eye (X3) 77.45Right Cheek (X5) 74.20Combination Regional 88.26Face + Left Eye 88.60Face + Left Cheek 89.30Face + Right Eye 88.83Face + Right Cheek 88.40of streams in the proposed model. It is observed thatthe left-sided features perform better than the right-sided features, speciﬁcally in the case of the cheekregion. It is also observed that the eye region givesa more consistent classiﬁcation performance than thecheek region. This can be inferred from the class ac-tivation maps as the regional classiﬁers are able to ex-tract more prominent features in the eye region thanthe cheek region across all the compression schemes.The regional classiﬁers combined have performancecomparable to the full-face classiﬁer. Also, the con-tribution of each stream in combination with the full-face classiﬁer is proportional to the classiﬁcation per-formance of each regional classiﬁer. • We analyzed the effect of parameter λ upon theclassiﬁcation performance. The proposed networkyields a classiﬁcation accuracy of . , . and . for λ = 0 . , , and , respectively. A verysmall value of λ is equivalent to training the streamsindependently and then fusing the scores whereas ahigh value of λ depicts an end to end training withstandalone cross-entropy loss at the output layer ofthe network. The best classiﬁcation performance wasachieved by λ = 1 , i.e., the weight of fusion was equalto the weight of individual streams. • Figures 7 and 8 show some instances where the pro-posed network is not able to correctly classify the ma-jority of the frames of the input subjected to hard com-pression.

6. Conclusion and Future Work

In this research, we have addressed reenactment basedDeepFake detection in videos. The proposed detection al-gorithm outperforms state-of-the-art methods on the Face-Forensics dataset and shows the smallest reduction in clas-siﬁcation performance when the input video frame is sub-jected to adverse compression. In the proposed algorithm,

Figure 7. Frames misclassiﬁed as

Altered due to compression.Figure 8. Frames misclassiﬁed as

Original due to compression. we aim to ﬁnd local noise patterns and artifacts that are leftbehind when altered with Face2Face reenactment. This al-lows the network to model itself upon various noise patternslearned by various regional classiﬁers, further aided by thefull-face classiﬁer. We also propose an end to end train-ing loss function to allow for balanced training of regionalclassiﬁers as compared to the full-face classiﬁer. Such typeof loss function can ﬁnd use in cases where the fusion ofclassiﬁers with different rates of convergence is needed. Inorder to develop a generalized detection approach, it is im-portant to understand the DeepFake generation mechanismand try to leverage the limitations of various modules usedin the generation of reenacted videos. The proposed modelcontains ﬁve parallel streams, thus leading to high computa-tional complexity. In the future, we plan to reduce the modelcomplexity by using an attention mechanism that learns thedependency between the image regions and features mapsin a more computationally effective manner.

7. Acknowledgement

M.Vatsa is supported through the Swarnajayanti Fellow-ship by the Government of India. R. Singh and M. Vatsaare partly supported by the Ministry of Electronics and In-formation Technology, Government of India.

References

IEEEInternational Workshop on Information Forensics and Secu-rity , pages 1–7, 2018.[4] A. Agarwal, R. Singh, M. Vatsa, and N. Ratha. Are image-agnostic universal adversarial perturbations for face recogni-tion difﬁcult to detect? In

IEEE 9th International Conferenceon Biometrics Theory, Applications and Systems , pages 1–7,2018.[5] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li.Protecting world leaders against deep fakes. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 38–45, 2019.[6] B. Bayar and M. C. Stamm. A deep learning approach touniversal image manipulation detection using a new convolu-tional layer. In

Proceedings of the ACM Workshop on Infor-mation Hiding and Multimedia Security , pages 5–10, 2016.[7] A. Bharati, R. Singh, M. Vatsa, and K. W. Bowyer. De-tecting facial retouching using supervised deep learning.

IEEE Transactions on Information Forensics and Security ,11(9):1903–1913, Sep. 2016.[8] F. Chollet. Xception: Deep learning with depthwise separa-ble convolutions. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1251–1258,2017.[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In , pages 248–255. Ieee, 2009.[10] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi,P. Perez, and C. Theobalt. Vdub: Modifying face video ofactors for plausible visual alignment to a dubbed audio track.In

Computer Graphics Forum , volume 34, pages 193–204.Wiley Online Library, 2015.[11] A. Goel, A. Singh, A. Agarwal, M. Vatsa, and R. Singh.Smartbox: Benchmarking adversarial detection and mitiga-tion algorithms for face recognition. In

IEEE InternationalConference on Biometrics Theory, Applications and Systems ,pages 1–7. IEEE, 2018.[12] G. Goswami, A. Agarwal, N. Ratha, R. Singh, and M. Vatsa.Detecting and mitigating adversarial perturbations for robustface recognition.

International Journal of Computer Vision ,127(6):719–742, Jun 2019.[13] G. Goswami, N. Ratha, A. Agarwal, R. Singh, and M. Vatsa.Unravelling robustness of deep learning based face recog-nition against adversarial attacks. In

AAAI Conference onArtiﬁcial Intelligence , 2018.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages770–778, 2016.[15] H. Kim, P. Carrido, A. Tewari, W. Xu, J. Thies, M. Niessner,P. P´erez, C. Richardt, M. Zollh¨ofer, and C. Theobalt. Deepvideo portraits.

ACM Transactions on Graphics , 37(4):163,2018.[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing natural scene categories. In

IEEE Computer Society Conference onComputer Vision and Pattern Recognition , volume 2, pages2169–2178, 2006.[17] P. Majumdar, A. Agarwal, R. Singh, and M. Vatsa. Evadingface recognition via partial tampering of faces. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , 2019.[18] R. Raghavendra, K. B. Raja, S. Venkatesh, and C. Busch.Transferable deep-cnn features for detecting digital andprint-scanned morphed face images. In

IEEE Conferenceon Computer Vision and Pattern Recognition Workshops(CVPRW) , pages 1822–1830, 2017.[19] A. R¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,and M. Nießner. Faceforensics: A large-scale video datasetfor forgery detection in human faces. arXiv preprintarXiv:1803.09179 , 2018.[20] A. R¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,and M. Nießner. Faceforensics++: Learning to detect ma-nipulated facial images. arXiv preprint arXiv:1901.08971 ,2019.[21] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[22] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from au-dio.

ACM Transactions on Graphics (TOG) , 36(4):95, 2017.[23] J. Thies, M. Zollh¨ofer, M. Nießner, L. Valgaerts, M. Stam-minger, and C. Theobalt. Real-time expression transfer forfacial reenactment.

ACM Trans. Graph. , 34(6):183–1, 2015.[24] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, andM. Nießner. Face2face: Real-time face capture and reenact-ment of rgb videos. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2387–2395, 2016.[25] J. Thies, M. Zollh¨ofer, C. Theobalt, M. Stamminger, andM. Nießner. Headon: real-time reenactment of human por-trait videos.

ACM Transactions on Graphics , 37(4):164,2018.[26] W. Wu, Y. Zhang, C. Li, C. Qian, and C. Change Loy. Reen-actgan: Learning to reenact faces via boundary transfer. In

Proceedings of the European Conference on Computer Vi-sion , pages 603–619, 2018.[27] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.S3fd: Single shot scale-invariant face detector. In

Proceed-ings of the IEEE International Conference on Computer Vi-sion , pages 192–201, 2017.[28] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Two-streamneural networks for tampered face detection. In