[PDF] Mutual Information Maximization on Disentangled Representations for Differential Morph Detection

Abstract

In this paper, we present a novel differential morph detection framework, utilizing landmark and appearance disentanglement. In our framework, the face image is represented in the embedding domain using two disentangled but complementary representations. The network is trained by triplets of face images, in which the intermediate image inherits the landmarks from one image and the appearance from the other image. This initially trained network is further trained for each dataset using contrastive representations. We demonstrate that, by employing appearance and landmark disentanglement, the proposed framework can provide state-of-the-art differential morph detection performance. This functionality is achieved by the using distances in landmark, appearance, and ID domains. The performance of the proposed framework is evaluated using three morph datasets generated with different methodologies.

Full PDF

MMutual Information Maximization on Disentangled Representations forDifferential Morph Detection

Sobhan Soleymani, Ali Dabouei, Fariborz Taherkhani, Jeremy Dawson, Nasser M. NasrabadiWest Virginia University { ssoleyma, ad0046, ft0009 } @mix.wvu.edu, { jeremy.dawson, nasser.nasrabadi } @mail.wvu.edu Abstract

In this paper, we present a novel differential morph de-tection framework, utilizing landmark and appearance dis-entanglement. In our framework, the face image is rep-resented in the embedding domain using two disentangledbut complementary representations. The network is trainedby triplets of face images, in which the intermediate im-age inherits the landmarks from one image and the appear-ance from the other image. This initially trained networkis further trained for each dataset using contrastive rep-resentations. We demonstrate that, by employing appear-ance and landmark disentanglement, the proposed frame-work can provide state-of-the-art differential morph detec-tion performance. This functionality is achieved by the us-ing distances in landmark, appearance, and ID domains.The performance of the proposed framework is evaluatedusing three morph datasets generated with different method-ologies.

1. Introduction

The main goal of biometric systems is automated recog-nition of individuals based on their unique biological andbehavioral characteristics [36]. The human face is widelyaccepted as a means of biometric authentication. Although,the uniqueness of face images and user convenience of facerecognition systems have resulted in their popularity, mor-phed face images have shown to pose a severe threat tothem. This is because the main objective of morph attacksis to purposefully alter or obfuscate the unique correspon-dence between probe and gallery images [40]. The resultof a morph attack is a face image which matches the probeimages corresponding to two different face images. There-fore, the detection of morph images plays a major role inproviding reliable face recognition.The majority of morph generation frameworks focus onaltering the position of the facial landmarks. These frame-works mainly utilize three steps: correspondence, warping,

Figure 1. Trusted probe image, x i , and image in question, x j ,are disentangled into landmark and appearance representations,using the disentanglement network trained on triplet of face im-ages. In these triplets, the constructed intermediate face imageinherits landmarks and the appearance from two different face im-ages. Landmark, appearance, and ID representations are utilizedto make the decision about the image in question. and blending. The ﬁrst step aims to detect the correspond-ing landmarks of both the images. These sets of landmarksare then utilized to warp the images toward each other, e.g., considering the landmarks of the morph image as the pair-wise average of two face images. Finally, textures fromthe two images are combined either over the entire face im-age [41] or face patches [29]. Another trend of morph gen-eration considers Generative Adversarial Networks (GANs)to construct images that can be matched with the two sourceimages, such as AliGAN [12, 10] and StyleGAN [23, 50].The face morphing algorithms can affect the face imagein two broad aspects. First, they alter the position of thelandmarks. On the other hand, they modify the appear- a r X i v : . [ c s . C V ] D ec nce of the face image by either blending two source im-ages or generating samples using generative models. Al-though appearance corresponds to the soft biometrics of asubject which are not necessarily unique, such as ethnic-ity, hair color, and gender, it can still be interpreted to dis-tinguish between face images with similar soft biometricssuch as differences in the texture of the face images. How-ever, deep differential morph detection frameworks focuson distinguishing the samples based on the ID information.Our proposed differential morph detection framework in-vestigates both the locations of the landmarks and the ap-pearance of the face image. Therefore, this approach re-stricts the attacker’s morphing capability by studying boththe changes resulted from altering the landmarks as well asmodiﬁcation in soft biometrics and texture information.As presented in Figure 1, our proposed framework learnsthe disentangled representations for the landmarks and theappearance of a face image. While these representations arepractically shown to be sufﬁcient for face recognition [8],the proposed training setup ensures that the mutual infor-mation between representations of the real images froma subject is maximized. In this paper, we make the fol-lowing contributions: i) we construct triplets of images inwhich an intermediate image inherits the landmarks fromone image and the appearance from the other image, ii)these triplets are considered to train a disentangling networkwhich provides disentangled representations for landmarksand face appearance, and iii) we train speciﬁc networks foreach morph dataset by learning contrastive representationsthrough maximizing the mutual information between realimages from each subject.

2. Related Works

Facial morphing studies the possibility of creating arti-ﬁcial biometric samples which resemble the biometric in-formation of two or more individuals [40]. Morph im-ages can be generated with little technical experience usingtools available on the internet and mobile platforms [40].The overall purpose of face morphing is to generate aface image that will be veriﬁed against samples of twoor more subjects in automated face recognition systems.One of the ﬁrst efforts to study the generation of a morphimage from two source images [13] has concluded thatgeometric alterations and digital beautiﬁcation can causean increase in the possibility of fooling recognition sys-tems. Morph generation techniques can roughly be cat-egorized into landmark-based [29, 43, 44] and generativemodels [10, 45]. Landmark-based frameworks focus on de-tecting the landmarks in both the images, translating thesepoints toward each other, and blending the two face im-ages. On the other hand, inspired by a learned inference model [12], Morgan [10] presents a face morphing attackbased on automatic image generation using a GAN frame-work.

Morph detection can be categorized into two main ap-proaches [41]: single image morph detection and differen-tial morph detection. Single image morph detection studiesthe possibility of detecting the morph image in the absenceof a reference image. On the other hand, differential morphdetection leverages the information extracted from a realimage corresponding to the subject. Texture descriptors arethe main feature extraction models for single image morphdetection [51, 47, 38, 37, 34]. Recently, deep learning mod-els have also been considered for this purpose [44, 43, 33].The models mentioned can also be employed for differentialmorph detection when the extracted feature from the twoimages are compared [35, 39, 9]. Another trend of work fordifferential morph detection considers that subtracting thetrusted image from the image in question should increasethe classiﬁcation score of the resulting image for one of theprobe subjects [14, 15, 32].

The geometry of landmarks and visual appearance arethe two main characteristics of the face that can be utilizedfor face recognition. Initially, the geometry of hand-craftedface landmarks were basis for face recognition [17]. Neu-ral network approaches have provided state-of-the-art facerecognition performance, with several deep models usingthe location of landmarks for varying face recognition pur-poses [21, 7]. On the other hand, the effect of appearancein face recognition is widely studied, including soft biomet-rics such as gender, age, ethnicity, and hair color [18, 16].Recently, an unsupervised approach using a coupled au-toencoder model for disentangling the appearance and ge-ometry of face images was developed [46]. In this frame-work, each autoencoder learns the geometry or appearancerepresentation of the face, while the reconstruction loss isconsidered as the supervision for disentangling. Anothersimilar work [55] has incorporated variational autoencodersto improve the disentangling. Another recent generativemodel [24] presents an unsupervised algorithm for trainingGANs that learns the disentangled style and content repre-sentations of the data.

Among the ﬁrst works that studied the application of mu-tual information in deep learning, [31] showed that GANtraining loss can be recovered by minimizing the estimateddivergence between the generated and true data distribu-tions. The authors in [3] expanded the mutual informationmaximization techniques to estimate the mutual informa-ion between two random variables via a neural network.The authors in [5] and [6] used mutual information to quan-tify the separation of distributions of positive and negativepairings in learning binary hash codes. The authors in [25]introduced RankMI algorithm,an information-theoretic lossfunction and a training algorithm for deep representationlearning for image retrieval. The authors in contrastiverepresentation distillation [48] proposed a contrastive-basedobjective function for transferring knowledge between deepnetworks. The authors in [2] propose an approach to self-supervised representation learning based on maximizingmutual information between features extracted from mul-tiple views of a shared context.

3. Proposed Framework

Our proposed differential morph detection frameworkresonates with the morph generation frameworks in whichthe the landmarks of the real image are translated to land-marks of the target face image [29] or image generation bygenerative adversarial networks [10]. Disentangling appear-ance and landmark information has shown to be a power-ful tool for face recognition [8]. These two domains pro-vide the majority of the information content for differen-tial morph detection as well. We aim to study the possibil-ity of detecting the morph image based on its differenceswith the trusted image in both landmark and appearancedomains. Therefore, to train our framework, we constructsamples that inherit the appearance and landmarks from dif-ferent samples. Then, we train a network that can disentan-gle these two types of information [8]. This framework isthen trained for differential morph detection by maximizingthe mutual information between representations of genuinepairs.

The ﬁrst step in our proposed training consists of gen-erating face images that inherit appearance from one imageand landmarks from the other image. Then, these tripletsof face images are used to train two deep networks. Theﬁrst network aims to represent the appearance of the faceimage and the second network extracts the landmark infor-mation. The supervision for disentangling appearance andlandmarks of faces is provided by constructing triplets offace images. Each triplet consists of two real face imagesfrom two different IDs. For convenience we denote theseimages as appearance image, x i , landmark image, x (cid:48) i , andan intermediate face image generated using the appearanceof the ﬁrst face image and the landmarks of the second faceimage, (cid:98) x i . To construct this intermediate face image, wetranslate the landmarks of the appearance image to the land-mark image.For this purpose, let x i be a face image noted as an ap-pearance image belonging to the class y i and the set l i de- scribe the locations of its K landmarks. We ﬁnd anotherface image x (cid:48) i from a different class corresponding to theclosest set of landmarks l (cid:48) i as the landmark set. The distancebetween the sets of landmarks is calculated in terms of L ∞ ,to assure that x i and x (cid:48) i have similar structures in order tominimize the distortion caused by the spatial transformationin the next step.We use the thin plate spline (TPS) algorithm [4] to trans-fer the landmarks of the appearance face image to the land-marks of x (cid:48) i as: (cid:98) x i = TPS ( x, l, l (cid:48) + δ l ) , (1)where TPS and (cid:98) x i represent the spatial transformation andthe deformed image noted as the intermediate face image.This face image has the appearance of x i and the landmarksof x (cid:48) i . The set δ l accounts for small perturbations in the lo-calizing the landmarks in the morph generation framework. As presented in Figure 2, in our proposed framework,two networks are deﬁned as appearance network, a , andlandmark network, g . These networks map the input faceimage to the appearance and landmark representations as: a ( . ) : R w × h × → R d a and g ( . ) : R w × h × → R d g . Itis worth mentioning that landmarks can be deﬁned as thesalient points in the face image. Although the landmark rep-resentation aims to represent the landmarks in the face im-age, it is trained through a classiﬁcation setup to preservethe information required to distinguish between the inputimages regarding their geometrical differences. We deﬁne athird network, f ( . ) , that maps these two representations to aface ID representation as: f ( . ) : R d a × R d g → R d f , where d a , d g , and d f are the dimension of appearance, landmark,and face ID representations, respectively. This representa-tions enables us to train the framework as a classiﬁcationsetup.To provide enough information to distinguish betweenreal and morph images, these three representations shouldsatisfy three conditions: i) The appearance representationof the appearance and intermediate images should be sim-ilar: a ( x i ) ≈ a ( (cid:98) x i ) , ii) the landmark representation ofthe landmark and intermediate images should be similar: g ( x (cid:48) i ) ≈ g ( (cid:98) x i ) , and iii) for both the non-manipulated im-ages, x i and x (cid:48) i , the face representations resulted from net-work f should preserve sufﬁcient classiﬁcation informa-tion. We address these three conditions in our initial train-ing setup. The appearance-preserving loss function aims toenforce the ﬁrst condition: L a ( x i , (cid:98) x i ) = − N (cid:88) i Φ( a ( x i ) , a ( (cid:98) x i )) , (2) igure 2. Face image (cid:98) x i is constructed by considering the appear-ance of x i and the landmarks of x (cid:48) i . L a enforces the appearancerepresentations of x i and (cid:98) x i to be similar. Similarly, L g ensuresthat g ( (cid:98) x i ) and g ( x (cid:48) i ) are close to each other. A fully-connectedlayer of size fed with the concatenation of g and a providesthe ID representation for the input image. where Φ( v , v ) represents the cosine similarity between v and v as in [52, 28]: Φ( v , v ) = v T v || v || || v || and N is thenumber of samples. Similarly, the landmark-preserving lossis deﬁned as: L g ( x (cid:48) i , (cid:98) x i ) = − N (cid:88) i Φ( g ( x (cid:48) i ) , g ( (cid:98) x i ))+ max (0 , Φ( g ( x (cid:48) i ) , g ( x i )) − α g φ g ) , (3)where φ g = || l i − l (cid:48) i || || l i − l i || is the normalized measure of the dis-tance of landmark locations, and l i is the mean of landmarklocations along two axes. α g is a scaling coefﬁcient, scalingto form an angular loss which aims to maximize the cosinesimilarity of g ( x i ) and g ( (cid:98) x i ) and dissimilarity of g ( x i ) and g ( x (cid:48) i ) .In addition to the discussed training loss functions, weshould assure that the appearance and landmark representa-tions provide sufﬁcient information for the identiﬁcation ofthe real images, x i and x (cid:48) i : L id ( x i ) = − N (cid:88) i log e s (cos( m θ yi,i + m ) − m ) e s (cos( m θ yi,i + m ) − m ) + (cid:80) j (cid:54) = y i e s cos( θ j,i ) , (4) where f ( x i ) = T ( a ( x i ) , g ( x i )) is the ID representa-tion [11] for face image, cos( θ j,i ) = W Tj f ( x i ) || W j || || f ( x i ) || , and W j is the weight vector assigned to the i th class. In thisangular loss function, m , m , and m are the hyperparam-eters controlling the angular margin, and s is the magnitudeof angular representations. The training loss function is de- ﬁned as: L t = (cid:88) i L id ( x i )+ L id ( x (cid:48) i )+ λ a L a ( x i , (cid:98) x i )+ λ g L g ( x (cid:48) i , (cid:98) x i ) , (5)where λ a and λ g are hyper-parameters scaling the appear-ance and landmark preserving loss functions. Our proposed differential morph detection frameworkbuilds upon recent information-theoretic approaches todeep representation learning [25, 48]. We aim to maximizethe mutual information between the real images from thesame subject and minimize the mutual information betweensamples in an imposter pair during the training and make thedecision during the test considering the distance betweenthe representations of the pair of images in the embed-ding. To this aim, as presented in Figures 3, the joint train-ing of the disentanglement and auxiliary networks providesembedding representations distinguishable enough to de-tect morphed face images in a differential morph detectionsetup. Our framework beneﬁts from transferring knowledgefrom that recognition task on a large face dataset to the dis-entanglement network, which provides a faster training ofboth disentanglement and auxiliary networks.To maximize the mutual information between real sam-ples from the same subject in the embedding space, we fol-low the notation proposed in [25, 3]. Let x i be an input faceimage and z ai and z gi be its corresponding appearance andlandmark representations as: z ai = a ( x i ) , z gi = g ( x i ) . (6)We aim to train a ( x ) and g ( x ) such that real images fromthe same subject are mapped closely in the embeddingspace. To this aim, we maximize the mutual informationbetween the real images from the same subject in each em-bedding space using the functions T a ( . ) and T g ( . ) . To con-struct our training samples we deﬁne a genuine set as: P = { ( x i , x j ) | c i = c j , r i = r j = 1 } , (7)where c i and c j represents the classes for the subjects and r i = 1 represents the real images. On the other hand wedeﬁne the imposter set as: N = { ( x i , x j ) | c i (cid:54) = c j or r i = 0 or r j = 0 } , (8)where r i = 0 represents morphed images. It is worth men-tioning that we deﬁne the above imposter set during thetraining. During the test phase, the imposer set consists ofpairs in which both the samples belong to the same subject,while one of them is a real face image and the other is amorphed face image. In addition, for the genuine set, we igure 3. A pair of one trusted probe image, x i , and an image in question, x j , are fed into the disentanglement network. This network whichis trained in combination with the auxiliary networks, T a ( ., . ) and T g ( ., . ) , provides embedding representations that present high mutualinformation for genuine pairs and results in close representations for the samples in genuine pair and distant representations for samplesin imposter pairs. Here, the morph image (red) is constructed displacing the landmarks of a real image (green) toward the landmarks of avisually similar image (black). The genuine pair consists of two real images from the same subject (orange and green), while the imposterpair in constructed using a real image and its corresponding morph image (green and red). can deﬁne the joint distribution of x i and x j as: p ( x i , x j ) = (cid:88) k ∈ C p ( x i , x j , c = k, r i = r j = 1) . (9)Assuming the high entropy of p ( c ) p ( r ) for the imposter set,we can approximate the joint distribution of the samples asthe product of their marginals: p ( x i ) p ( x j ) ≈ (cid:88) k ∈ c (cid:88) k (cid:48) (cid:54) = k { p ( x i | c i = k ) p ( x j | c j = k (cid:48) ) p ( c i = k ) p ( c j = k (cid:48) ) } + (cid:88) k ∈ c (cid:88) r i ∈ r { p ( x i | c i = k ) p ( x j | c j = k ) p ( c i = k ) p ( c j = k ) p ( r j = 0) } + (cid:88) k ∈ c { p ( x i | c i = k ) p ( x j | c j = k (cid:48) ) p ( c i = k ) p ( c j = k (cid:48) ) p ( r i = 0) p ( r j = 1) } , (10) where r = { , } represents morphed and real images.Considering the genuine and imposter pairs deﬁned in equa-tions 7 and 8, the appearance differential loss is deﬁned tomaximize the mutual appearance information between sam-ples in a genuine pair as [3, 25]: L a = 1 || P || (cid:88) ( x i ,x j ) ∈ P T a ( z ai , z aj ) − log 1 || N || (cid:88) ( x i ,x j ) ∈ N e T a ( z ai ,z aj ) . (11) A similar loss is deﬁed over the genuine and imposter pairsto calculate L gt as the differential landmark informationloss. Then, the differential loss is deﬁned as: L t = λ a L a + λ g L g + L id , (12) where L id provides the training for network T and subse-quently f ( x i ) .

4. Experiments

We study the performance of the proposed frameworkon three morph datasets. In our experiments we followframeworks described in [41, 49]. Evaluation metrics forthe differential morph detection are deﬁned as: Attack Pre-sentation Classiﬁcation Error Rate (APCER) as the propor-tion of morph attack samples incorrectly classiﬁed as bonaﬁde (non-morph), presentation and Bona Fide PresentationClassiﬁcation Error Rate (BPCER) is the proportion of bonaﬁde (nonmorph) samples incorrectly classiﬁed as morphedsamples.

For all the datasets, DLib [26] is considered to detect andalign faces, as well as extracting landmarks. We train themodel on the CASIA-WebFace [56] dataset. In the train-ing set, for each image, the image from a different ID thatprovides closest landmarks to its landmarks in terms of L norm is selected. Neighbor face is transformed spatially us-ing Equation 1. This image is aligned again to compensatefor the displacements caused by the spatial transformation.ll images are resized to × and pixel values arescaled to [ − , .We adopt ResNet-64 [19] as the base network architec-ture. To reduce the size of the model, the convolutional net-works for extracting the landmark representation, g ( x ) , andthe appearance representation, a ( x ) , are combined. Thisnetwork produces feature maps of spatial size × and thedepth of channels. These feature maps are divided indepth into two sets, dedicated to the appearance and land-mark representations, respectively. Each set of feature mapsis reshaped to form a vector of size , and passed todedicated fully-connected layers. These layers of size generate the ﬁnal representations, a ( x ) and g ( x ) . The IDrepresentation is constructed by concatenation of these tworepresentations fed to a fully-connected layer of size .The model is trained using Stochastic Gradient Descent(SGD) with the mini-batch size of on two NVIDIATITAN X GPUs. In Equation 4, following ArcFace [11]framework, m , m , and m are set to 0.9, 0.4, and 0.15,respectively. In Equation 1, δ l is sampled from N (0 , .The initial value for the learning rate is set to . andmultiplied by . in intervals of ﬁve epochs until its valueis less than or equal to − . The model is trained for K iterations. We select α g = 9 . , λ a = 1 . , and λ g = 0 . . For training the network using Equation 11,each fully-connected layer of size is fed to a fully-connected of size , and then to a single unit. Here, con-sidering λ a = λ g = 1 , the network is trained using thelearning rate of − and is dropped similar to the rate men-tioned above. MorGAN is constructed using the generative frameworkdescribed in [10]. In this dataset, bonaﬁde imagesare considered. For each bona ﬁde image two morph im-ages are generated using two most similar identities to thebona ﬁde image, resulting in , morph images. In to-tal this dataset consists of , references, , probes,and , MorGAN morphing attacks. The database is ran-domly split into disjoint and equal train and test sets. All theimages are of size × . VISAPP17-Splicing-Selected is a subset of VISAPP17-Splicing dataset [30] containing genuine neutral and smil-ing face images as well as morphed face images. Thisdataset is generated by warping and alpha-blending [53].To construct this dataset, facial landmarks are localized, theface image is tranquilized based on these landmarks, trian-gles are warped to some average position, and the resultingimages are alpha-blended, where alpha is set to . mak-ing alpha-blending equal to average. Splicing morphs aredesigned to avoid ghosting artefacts usually present in the For simplicity, we refer to this dataset as VISAPP17.

Figure 4. Samples from (a) MorGAN, (b) VISAPP17-Splicing-Selected, and (c) AMSL Face Morph Image Datasets. For eachdataset, the ﬁrst and second faces are the gallery and probe bonaﬁde images and the third face is the morph image construed fromthe ﬁrst and forth face images. The original sizes for face imagesin these datasets are × , × , and × , re-spectively. hair region, done by warping and blending of only face re-gions and inserting the blended face into one of the orig-inal face images. The background, hair and torso regionsremain untouched. VISAPP17-Splicing-Selected dataset,which consists of bona ﬁde and morph images ofsize × , is constructed by selecting morph imageswithout any recognizable artifacts. The AMSL Face Morph Image Dataset is created usingthe Face Research Lab London Set [1] and includes genuineneutral and smiling face images and morphed face images.The morphed face images are generated from pairs of gen-uine face images [30]. For all the morph images the propor-tions of both faces in the morphed face are the same. Whilegenerating morphed faces male, female, white, and Asianpeople are only morphed with their corresponding category.All images are down-scaled to × pixels and JPEGcompression is applied to them to compress the images to15kb [54]. This dataset includes 102 neutral or 102 smilinggenuine face images and 2,175 morph images. Differential Morph Detection:

For the MorGAN dataset,we follow the train and test split presented in [10]. For theother two datasets, we consider a disjoint train and test splitin which of the subjects are used for training. Theataset MorGAN VISAPP17 AMSLD-EER 5% 10% D-EER 5% 10% D-EER 5% 10%LM-Dlib [9, 26] 12.53 20.71 10.17 17.88 26.64 22.71 14.45 20.67 18.55BSIF+SVM [22] 10.17 14.22 8.64 16.42 28.77 25.37 12.75 20.71 16.26LBP+SVM [27] 15.51 28.40 18.71 18.75 23.88 20.65 14.97 21.47 16.21FaceNet [42] 16.14 38.38 26.67 9.51 29.82 6.91 8.43 25.74 5.68ArcFace [11] 14.65 22.76 16.23 7.14 17.51 5.69 6.14 14.51 5.23FaceNet+SVM 12.53 18.84 12.21 8.85 26.46 6.28 8.42 18.46 5.28ArcFace+SVM [41] 10.82 15.47 12.43 5.38 7.45 4.78 3.87 6.12 3.28Ours

Table 1. D-EER%, BPCER@APCER=5%, and BPCER@APCER=10% for the differential morph detection. distance between face images x i and x j is deﬁned as: D =Φ( f ( x i ) , f ( x j )) + β a Φ( a ( x i ) , a ( x j ))+ β g Φ( g ( x i ) , g ( x j )) , (13)where β a and β g are the scaling parameters used for de-cision making. We employ classical texture descriptors,BSIF [22] and LBP [27], with an SVM classiﬁer. The LBPfeature descriptors are extracted according to the originalLBP image patches of × . The resulting feature vectoris then a normalized histogram of size 256, which encom-passes all potential values of the LBP binary code. BSIFfeature vectors are conducted on a ﬁlter size of × and8 bits. The ﬁlters utilized for BSIF are pre-learned Inde-pendent Component Analysis (ICA) ﬁlters [20] that are uti-lized by the original BSIF paper to construct normalizedhistogram for each image. The feature vectors are then in-putted to an SVM with an RBF kernel for classiﬁcation. Forall classical baseline models the difference between the fea-ture representation of the image in question and the featurerepresentation of the trusted image is fed to an SVM classi-ﬁer.In addition, we employ LM-Dlib [9, 26] as a model forthe landmark displacement measure. In this framework, thedistance between landmarks extracted by Dlib [26] are fedto an SVM. For deep models, the distance between the rep-resentations in the embedding domain is considered as thedecision criteria. For all the model, the default parame-ters presented in the original papers are considered. It isworth mentioning that in this experiments we do not con-sider the prior knowledge on which of the images in the pairfed to the recognition framework is the trusted image. Onthe other hand, in Table 3, we assume that the differentialmorph detection framework is provided with the informa-tion regarding the trusted image.For each the datasets, of the training set is consid-ered as the validation set. Then, the parameters to trainthe framework are selected based on the experiments de-scribed in Table 4 and Figure 5. Table 1 presents the perfor-mance of the proposed framework in comparison with four Train Test Algorithm D-EER 5% 10% M o r GAN V I S A PP LM-Dlib [9, 26] 23.74 51.42 38.67BSIF+SVM [22] 19.21 51.25 39.41ArcFace+SVM [41] 11.67 22.36 14.86Ours A M S L LM-Dlib [9, 26] 20.67 44.28 32.15BSIF+SVM [22] 17.27 38.54 24.71ArcFace+SVM [41] 10.48 22.49 14.90Ours V I S A PP M o r GAN

LM-Dlib [9, 26] 16.82 38.54 24.8BSIF+SVM [22] A M S L LM-Dlib [9, 26] 18.83 38.86 24.78BSIF+SVM [22] 16.92 38.84 24.64ArcFace+SVM [41] 8.27 9.63 5.28Ours A M S L M o r GAN

LM-Dlib [9, 26] 16.24 30.94 19.28BSIF+SVM [22]

ArcFace+SVM [41] 16.34 38.62 24.51Ours 14.21 28.58 18.51 V I S A PP LM-Dlib [9, 26] 20.55 62.21 38.42BSIF+SVM [22] 20.36 51.28 32.95ArcFace+SVM [41] 10.65 14.36 9.81Ours

Table 2. Cross-dataset performance for differentialmorph detection: D-EER%, BPCER@APCER=5%, andBPCER@APCER=10%. deep learning and three classical differential morph detec-tion frameworks. In addition to outperforming the baselinemodels on all the datasets, the proposed framework outper-forms the baseline models by a wide margin on the Mor-GAN dataset, which can be contributed to the disentangle-ment of landmark and appearance representations.In Table 2, we study the performance of the networkstrained on the training portion of one morph dataset andtested on the other datasets. As presented in this ta-ble, while outperforming the other models, the proposedframework provides high cross-dataset performance be-ataset MorGAN VISAPP17 AMSLD-EER 5% 10% D-EER 5% 10% D-EER 5% 10%LM-Dlib [9, 26] 8.14 10.67 7.83 15.67 22.87 20.32 11.67 16.98 14.63BSIF+SVM [22] 6.07 9.15 4.63 13.87 23.53 20.12 10.53 16.53 13.86LBP+SVM [27] 7.47 9.23 4.71 15.21 20.64 18.74 12.21 17.11 12.81FaceNet [42] 8.11 14.52 7.59 7.32 24.54 5.21 7.46 22.12 5.17ArcFace [11] 7.58 9.64 4.08 6.45 14.78 5.02 5.36 10.46 4.87FaceNet+SVM 7.23 12.46 5.22 6.37 26.46 6.28 8.42 18.46 5.28ArcFace+SVM [41] 5.35 6.71 3.50 4.52 5.98 4.05 3.27 5.56 2.69Ours 4.71 5.32 3.85 3.74 4.91 2.17 2.82 4.97 2.82Ours ∗ Table 3. The differential morph detection performance on three datasets, when the trusted image is known to the detection framework:D-EER%, BPCER@APCER=5%, and BPCER@APCER=10%. tween VISAPP17 and AMSL. In addition, the proposedframework provides D-EER of 8.55% and 7.95% for cross-dataset performance on the network trained on MorGANand tested on VISAPP17 and AMSL datasets, respectively.On the other hand, BSIF+SVM outperforms the other al-gorithms when testing the network trained on other twodatasets and tested on MorGAN, which illustrates the sametrend as the results provided in [10].Table 3 studies the effect of the trusted images beingknown to the detection framework. For the baseline mod-els, rather than comparing the representations of the trustedimages and the image in question, the representation of theimage in question is subtracted from the representation ofthe trusted image before feeding the difference to the SVM.For the proposed framework, we consider an additional al-gorithm, denoted as ”Ours ∗ ”, in which two dedicated in-stances of the framework are constructed for trusted imagesand images in question. In this algorithm, which outper-forms the algorithm for which only one instance of the net-work is considered, we only train the network dedicated tothe images in question. Table 4 provides the performancefor the proposed framework on the validation sets when thescaling parameters in making the decision vary in Equa-tion 13. As presented in this table, morph images con-structed using landmark displacement are better detectedfor higher weights given to g ( x ) , while the MorGAN sam-ples are best detected when g ( x ) and a ( x ) are given similarweights. In addition, Figure 5 provides the performance forthree datasets when variance of the normal distribution togenerate δ l samples in Equation 1 varies from 0 to 6.

5. Conclusions

In this paper, we presented a novel differential morph de-tection framework which beneﬁts from disentangling land-mark representation and appearance representation in anembedding space. These two representations which are dis-entangled but complementary, are constructed using a dis-

Figure 5. D-EER% for different variances of δ l values in Equa-tion 1. MorGAN VISAPP17 AMSL(4,1) 10.91 6.97 4.72(3,1) 9.64 6.57 4.12(2,2) (1,4) 10.89 5.12 3.54

Table 4. The D-EER% for differential morph detection perfor-mance considering different scaling values ( β a and β g ) in Equa-tion 13. entanglement network trained using triplets of face images.Each triplet consists of two real images and an intermedi-ate image which inherits the landmarks from one image andthe appearance from the other image. We demonstrated thatappearance and landmark disentanglement can be boostedusing contrastive representations for each disentangled rep-resentation. This property provides the possibility of accu-rate differential morph detection, using distances in land-mark, appearance, and ID domains. The performance of theproposed framework is studied using three morph datasetsconstructed with different methodologies. eferences [1] Face research lab london set: https://ﬁgshare.com/articles/face research lab london set/5047666.[2] Philip Bachman, R Devon Hjelm, and William Buchwalter.Learning representations by maximizing mutual informationacross views. In Advances in Neural Information ProcessingSystems , pages 15535–15545, 2019.[3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajesh-war, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and De-von Hjelm. Mutual information neural estimation. In

Inter-national Conference on Machine Learning , pages 531–540,2018.[4] Fred L. Bookstein. Principal warps: Thin-plate splines andthe decomposition of deformations.

IEEE Transactions onpattern analysis and machine intelligence , 11(6):567–585,1989.[5] Fatih Cakir, Kun He, Sarah Adel Bargal, and Stan Sclaroff.Mihash: Online hashing with mutual information. In

Pro-ceedings of the IEEE International Conference on ComputerVision , pages 437–445, 2017.[6] Fatih Cakir, Kun He, Sarah Adel Bargal, and Stan Sclaroff.Hashing with mutual information.

IEEE transactions on pat-tern analysis and machine intelligence , 41(10):2424–2437,2019.[7] Ali Dabouei, Sobhan Soleymani, Jeremy Dawson, andNasser Nasrabadi. Fast geometrically-perturbed adversarialfaces. In , pages 1979–1988, 2019.[8] Ali Dabouei, Fariborz Taherkhani, Sobhan Soleymani,Jeremy Dawson, and Nasser Nasrabadi. Boosting deep facerecognition via disentangling appearance and geometry. In

The IEEE Winter Conference on Applications of ComputerVision , pages 320–329, 2020.[9] Naser Damer, Viola Boller, Yaza Wainakh, Fadi Boutros,Philipp Terh¨orst, Andreas Braun, and Arjan Kuijper. De-tecting face morphing attacks by analyzing the directed dis-tances of facial landmarks shifts. In

German Conference onPattern Recognition , pages 518–534, 2018.[10] Naser Damer, Alexandra Mosegui Saladie, Andreas Braun,and Arjan Kuijper. MorGAN: Recognition vulnerability andattack detectability of face morphing attacks created by gen-erative adversarial network. In , pages 1–10, 2018.[11] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou. Arcface: Additive angular margin loss for deepface recognition. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 4690–4699, 2019.[12] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, OlivierMastropietro, Alex Lamb, Martin Arjovsky, and AaronCourville. Adversarially learned inference. arXiv preprintarXiv:1606.00704 , 2016.[13] Matteo Ferrara, Annalisa Franco, and Davide Maltoni. Themagic passport. In

IEEE International Joint Conference onBiometrics , pages 1–7, 2014. [14] Matteo Ferrara, Annalisa Franco, and Davide Maltoni. Facedemorphing.

IEEE Transactions on Information Forensicsand Security , 13(4):1008–1017, 2017.[15] Matteo Ferrara, Annalisa Franco, and Davide Maltoni. Facedemorphing in the presence of facial appearance variations.In , pages 2365–2369. IEEE, 2018.[16] Hiren J Galiyawala, Mehul S Raval, and Anand Laddha. Per-son retrieval in surveillance videos using deep soft biomet-rics. In

Deep Biometrics , pages 191–214. 2020.[17] Francis Galton. Personal identiﬁcation and description.

Jour-nal of anthropological institute of Great Britain and Ireland ,pages 177–191, 1889.[18] Ester Gonzalez-Sosa, Julian Fierrez, Ruben Vera-Rodriguez,and Fernando Alonso-Fernandez. Facial soft biometrics forrecognition in the wild: Recent works, annotation, and cotsevaluation.

IEEE Transactions on Information Forensics andSecurity , 13(8):2001–2014, 2018.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[20] Aapo Hyv¨arinen, Jarmo Hurri, and Patrick O Hoyer.

Naturalimage statistics: A probabilistic approach to early compu-tational vision. , volume 39. Springer Science & BusinessMedia, 2009.[21] Seyed Mehdi Iranmanesh, Ali Dabouei, Sobhan Soleymani,Hadi Kazemi, and Nasser Nasrabadi. Robust facial land-mark detection via aggregation on geometrically manipu-lated faces. In

The IEEE Winter Conference on Applicationsof Computer Vision , pages 330–340, 2020.[22] Juho Kannala and Esa Rahtu. Bsif: Binarized statistical im-age features. In

Proceedings of the 21st international confer-ence on pattern recognition (ICPR2012) , pages 1363–1366,2012.[23] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 4401–4410, 2019.[24] Hadi Kazemi, Seyed Mehdi Iranmanesh, and NasserNasrabadi. Style and content disentanglement in generativeadversarial networks. In , pages 848–856,2019.[25] Mete Kemertas, Leila Pishdad, Konstantinos G Derpanis,and Afsaneh Fazly. Rankmi: A mutual information maxi-mizing ranking loss. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages14362–14371, 2020.[26] Davis E King. Dlib-ml: A machine learning toolkit.

The Journal of Machine Learning Research , 10:1755–1758,2009.[27] Shengcai Liao, Xiangxin Zhu, Zhen Lei, Lun Zhang, andStan Z Li. Learning multi-scale block local binary patternsfor face recognition. In

International Conference on Biomet-rics , pages 828–837, 2007.[28] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, BhikshaRaj, and Le Song. Sphereface: Deep hypersphere embeddingor face recognition. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 212–220,2017.[29] Andrey Makrushin, Tom Neubert, and Jana Dittmann. Au-tomatic generation and detection of visually faultless facialmorphs. In

International Conference on Computer VisionTheory and Applications , volume 7, pages 39–50, 2017.[30] Tom Neubert, Andrey Makrushin, Mario Hildebrandt, Chris-tian Kraetzer, and Jana Dittmann. Extended stirtrace bench-marking of biometric and forensic qualities of morphed faceimages.

IET Biometrics , 7(4):325–332, 2018.[31] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variationaldivergence minimization. In

Advances in neural informationprocessing systems , pages 271–279, 2016.[32] Fei Peng, Le-Bing Zhang, and Min Long. FD-GAN: Facede-morphing generative adversarial network for restoringaccomplice’s facial image.

IEEE Access , 7:75122–75131,2019.[33] Kiran Raja, Sushma Venkatesh, RB Christoph Busch, et al.Transferable deep-cnn features for detecting digital andprint-scanned morphed face images. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion Workshops , pages 10–18, 2017.[34] Raghavendra Ramachandra, Sushma Venkatesh, Kiran Raja,and Christoph Busch. Towards making morphing attack de-tection robust using hybrid scale-space colour texture fea-tures. In , pages 1–8,2019.[35] Ulrich Scherhag, Dhanesh Budhrani, Marta Gomez-Barrero,and Christoph Busch. Detecting morphed face images usingfacial landmarks. In

International Conference on Image andSignal Processing , pages 444–452, 2018.[36] Ulrich Scherhag, Andreas Nautsch, Christian Rathgeb,Marta Gomez-Barrero, Raymond NJ Veldhuis, LuukSpreeuwers, Maikel Schils, Davide Maltoni, Patrick Grother,Sebastien Marcel, et al. Biometric systems under morphingattacks: Assessment of morphing techniques and vulnerabil-ity reporting. In , pages 1–7, 2017.[37] Ulrich Scherhag, Ramachandra Raghavendra, Kiran B Raja,Marta Gomez-Barrero, Christian Rathgeb, and ChristophBusch. On the vulnerability of face recognition systems to-wards morphed face attacks. In , pages 1–6, 2017.[38] Ulrich Scherhag, Christian Rathgeb, and Christoph Busch.Morph deterction from single face image: A multi-algorithmfusion approach. In

Proceedings of the 2018 2nd Interna-tional Conference on Biometric Engineering and Applica-tions , pages 6–12, 2018.[39] Ulrich Scherhag, Christian Rathgeb, and Christoph Busch.Towards detection of morphed face images in electronictravel documents. In , pages 187–192,2018.[40] Ulrich Scherhag, Christian Rathgeb, Johannes Merkle,Ralph Breithaupt, and Christoph Busch. Face recognition systems under morphing attacks: A survey.

IEEE Access ,7:23012–23026, 2019.[41] Ulrich Scherhag, Christian Rathgeb, Johannes Merkle,and Christoph Busch. Deep face representations fordifferential morphing attack detection. arXiv preprintarXiv:2001.01202 , 2020.[42] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A uniﬁed embedding for face recognition and clus-tering. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 815–823, 2015.[43] Clemens Seibold, Wojciech Samek, Anna Hilsmann, and Pe-ter Eisert. Detection of face morphing attacks by deep learn-ing. In

International Workshop on Digital Watermarking ,pages 107–120, 2017.[44] Clemens Seibold, Wojciech Samek, Anna Hilsmann, and Pe-ter Eisert. Accurate and robust neural networks for secu-rity related applications exampled by face morphing attacks. arXiv preprint arXiv:1806.04265 , 2018.[45] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, andMichael K Reiter. Adversarial generative nets: Neural net-work attacks on state-of-the-art face recognition. arXivpreprint arXiv:1801.00349 , 2(3), 2017.[46] Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, DimitrisSamaras, Nikos Paragios, and Iasonas Kokkinos. Deform-ing autoencoders: Unsupervised disentangling of shape andappearance. In

Proceedings of the European conference oncomputer vision (ECCV) , pages 650–665, 2018.[47] Luuk Spreeuwers, Maikel Schils, and Raymond Veldhuis.Towards robust evaluation of face morphing detection. In , pages 1027–1031. IEEE, 2018.[48] Yonglong Tian, Dilip Krishnan, and Phillip Isola.Contrastive representation distillation. arXiv preprintarXiv:1910.10699 , 2019.[49] Sushma Venkatesh, Raghavendra Ramachandra, Kiran Raja,Luuk Spreeuwers, Raymond Veldhuis, and Christoph Busch.Detecting morphed face attacks using residual noise fromdeep multi-scale context aggregation network. In

The IEEEWinter Conference on Applications of Computer Vision ,pages 280–289, 2020.[50] Sushma Venkatesh, Haoyu Zhang, Raghavendra Ramachan-dra, Kiran Raja, Naser Damer, and Christoph Busch. CanGAN generated morphs threaten face recognition systemsequally as landmark based morphs?-vulnerability and detec-tion. In , pages 1–6, 2020.[51] Lukasz Wandzik, Gerald Kaeding, and Raul Vicente Garcia.Morphing detection using a general-purpose face recognitionsystem. In , pages 1012–1016, 2018.[52] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. Adiscriminative feature learning approach for deep face recog-nition. In

European conference on computer vision , pages499–515, 2016.[53] George Wolberg. Image morphing: a survey.

The visualcomputer , 14(8-9):360–372, 1998.54] Andreas Wolf. Portrait quality (reference facial images formrtd).

Version: 0.06 ICAO, Published by authority of theSecretary General , 2016.[55] Xianglei Xing, Ruiqi Gao, Tian Han, Song-Chun Zhu, andYing Nian Wu. Deformable generator network: Unsuper-vised disentanglement of appearance and geometry. arXivpreprint arXiv:1806.06298 , 2018.[56] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learn-ing face representation from scratch. arXiv preprintarXiv:1411.7923arXiv preprintarXiv:1411.7923