[PDF] Learning Transformation-Aware Embeddings for Image Forensics

Abstract

A dramatic rise in the flow of manipulated image content on the Internet has led to an aggressive response from the media forensics research community. New efforts have incorporated increased usage of techniques from computer vision and machine learning to detect and profile the space of image manipulations. This paper addresses Image Provenance Analysis, which aims at discovering relationships among different manipulated image versions that share content. One of the main sub-problems for provenance analysis that has not yet been addressed directly is the edit ordering of images that share full content or are near-duplicates. The existing large networks that generate image descriptors for tasks such as object recognition may not encode the subtle differences between these image covariates. This paper introduces a novel deep learning-based approach to provide a plausible ordering to images that have been generated from a single image through transformations. Our approach learns transformation-aware descriptors using weak supervision via composited transformations and a rank-based quadruplet loss. To establish the efficacy of the proposed approach, comparisons with state-of-the-art handcrafted and deep learning-based descriptors, and image matching approaches are made. Further experimentation validates the proposed approach in the context of image provenance analysis.

Full PDF

LLearning Transformation-Aware Embeddings for Image Forensics

Aparna Bharati , Daniel Moreira , Patrick Flynn , Anderson Rocha , Kevin Bowyer , Walter Scheirer University of Notre Dame, IN, USA University of Campinas, SP, Brazil

Abstract

A dramatic rise in the ﬂow of manipulated image contenton the Internet has led to an aggressive response from themedia forensics research community. New efforts have in-corporated increased usage of techniques from computer vi-sion and machine learning to detect and proﬁle the space ofimage manipulations. This paper addresses Image Prove-nance Analysis, which aims at discovering relationshipsamong different manipulated image versions that share con-tent. One of the main sub-problems for provenance analysisthat has not yet been addressed directly is the edit order-ing of images that share full content or are near-duplicates.The existing large networks that generate image descrip-tors for tasks such as object recognition may not encodethe subtle differences between these image covariates. Thispaper introduces a novel deep learning-based approachto provide a plausible ordering to images that have beengenerated from a single image through transformations.Our approach learns transformation-aware descriptors us-ing weak supervision via composited transformations anda rank-based quadruplet loss. To establish the efﬁcacy ofthe proposed approach, comparisons with state-of-the-arthandcrafted and deep learning-based descriptors, and im-age matching approaches are made. Further experimenta-tion validates the proposed approach in the context of imageprovenance analysis.

1. Introduction

In the ﬁght against the spread of disinformation [2, 17]through manipulated media content [1], understanding thestory and intent behind the manipulated object is critical.Whether the manipulations are benign or malicious, a step-by-step analysis of how the current version of the manipu-lated image or video was generated helps us in answeringmore holistic and contextual questions than just whether agiven image or video is real or fake. Unlike the early daysof the web, a media object today does not exist in isola-tion. Most often there are multiple versions of a single im-age online at any given time (Fig. 1). Tracing the uploads,downloads and re-uploads of original image content with

Figure 1. An image search on the web using keywords relatedto some fake photos that went viral (Google search keywords:most+viral+fake+photo) reveals many versions of the same photoin the related images section. Most of these images are trans-formed versions of the original. For example, some cropped andperspective transformed versions appear in the related images forquery (a), a sheared image version appears for (b) and some colortransformed images appear in the related section of query (c).Given multiple variants for any given image, an important ques-tion is: can we ﬁnd the original? Not necessarily just for mali-ciously manipulated images, this question is valid for a number ofcircumstances where image manipulation is present. We proposea framework that helps determine the ordering of image transfor-mations to trace the provenance of the content. modiﬁcations can help assess the reach of that content. Ifan image is a composite, analyzing its different versions andother images that donated content to its creation can provideexplanations for the types of manipulations and their com-plexity. Beyond images on social media, we can also con-sider other image-based applications such as autonomousdriving, where it is important to know if the data being usedto train a visual recognition system is the original versionand distinguish between benign and malicious data attacks.

Image Provenance Analysis [34] aims to understand theevolution of a media object in question by establishing pair-wise associations among images related in content to gen-erate its provenance graph. An example of provenanceanalysis for a small set of images is shown in Figure 2.Provenance analysis usually involves two stages: (1) a spe-cialized image retrieval stage to obtain images related toa query called provenance ﬁltering , and (2) a graph con- a r X i v : . [ c s . C V ] J a n igure 2. An example of image ordering in the context of im-age provenance analysis. Given a set of near-duplicate images,an ordered set where the order explains the process of transform-ing one image to the other is desired as output. The ﬁgure in-cludes a schematic representation of the proposed technique usedto achieve this. The goal is to order relative distances betweencompositely-transformed images through the use of a ranking-based quadruplet loss function. struction stage to model the relationships between the re-trieved images. While both steps employ vision-based tech-niques (they rely on establishing image correspondences),there has been limited research on the latter problem.In the literature, provenance graph construction algo-rithms are based on image phylogeny algorithms [15, 14,40]. Provenance graph construction uses generalized deﬁni-tions of image relationships and circumvents the phylogenyconstraints such as speciﬁc image formats and a two-donorlimit for composite images [5]. These algorithms use pointsof interest to establish image correspondence and computepairwise image dissimilarity by describing matched localregions. Upon computing a dissimilarity score for all possi-ble pairs, a greedy algorithm is employed to create a mini-mally connected Directed Acyclic Graph (DAG) [34] of im-ages. Depending on the type of manipulations performedon an image, the graph can be very complicated with mul-tiple donor images (images that share partial content) andlong chains formed by near-duplicate images (images de-rived from a single image through a series of transforma-tions). This work focuses on improving the ﬁdelity of re-constructing the chains in the provenance graphs.Greedy graph algorithms treat dissimilarity values as ad-jacency weights and rely heavily on invariant image match-ing. Image matching is a common task in many computervision problems such as object recognition, 3D reconstruc-tion, scene understanding and image retrieval. A desiredproperty for representations used to solve these problemsis invariance to view changes, compression and other im-age transformations. Optimizing this property while learn-ing the mapping between the data X and label Y can eas-ily miss understanding the ﬁne differences between imagetransformations. Thus invariance becomes something of amisnomer in the context of forensics.We propose to learn representations that are aware of dif- ferent versions of images in the transformation space, anddepending on the number of transformations, can encodeappropriate distance among near-duplicate images. This ap-proach can be useful for improving dissimilarity computa-tion between different versions of an image. Better under-standing of the subtle differences in the variants can leadto improved rank-based output in deducing a sequence oftransformations for image forensics [18] and cultural ana-lytics [33]. It can also be used to deﬁne acceptable standardsof edited data for learning algorithms. To our knowledge,this is the ﬁrst work that focuses on solving the image or-dering aspect of provenance analysis. This work also high-lights the importance of awareness of transformation-baseddifferences among near-duplicate images while learning im-age representations.

2. Related Work

Most of the recent research related to image provenanceanalysis and other media forensics problems [34, 25, 57,16, 50, 42] is related to the DARPA Media Forensics pro-gram [12] through its challenges and datasets. The ﬁrstend-to-end algorithm [34] proposed for provenance analy-sis coming out of that program followed a two-step strat-egy to obtain a graph-based relationship representation ofimages that are near-duplicates of or share partial contentwith the query image. Firstly, a specialized image re-trieval algorithm that employed Speeded Up Robust Fea-tures (SURF) [3] for description, Optimized Product Quan-tization (OPQ) [20, 26] for efﬁcient indexing, and iterativeﬁltering is used to obtain a list of images related to thequery. In the second step, a keypoint-based pairwise im-age comparison is performed to obtain a dissimilarity ma-trix. The most feasible provenance graph is created usingthis matrix through a hierarchical clustering-based graphexpansion method. The latter step can also be interpretedas ordering pair similarities between multiple image pairs,and is therefore a natural extension to pairwise image com-parison. For output, the ordered pairings are modeled as agraph, where each edge denotes a transformation-based cor-respondence, and the images in the pair are the vertices [5].Image ordering can be deﬁned as a speciﬁc task that out-puts an ordered list of provided input images along withproducing a matching score between two images. This cancorrespond to ranking images with respect to distance froma query image in retrieval algorithms or identiﬁcation sce-narios in recognition. Approaches in the existing literaturewhich are a source of inspiration for the proposed work arediscussed below. We categorize them broadly to explainhow advances in techniques for these general tasks becomeuseful in designing a pipeline to learn transformation-awareembeddings for ordering images in provenance analysis. .1. Image Matching

Ordering can be considered as an additional step to im-age matching as it involves a measure of how well imagesmatch. Image matching algorithms use points of interestto learn whether two images have the same content withrespect to structure [3, 51, 30]. Such algorithms are inte-gral parts of correspondence tasks where one cares about theexact locations two image contents match. Due to the na-ture of the applications that require image correspondence(dense or sparse), these matches [52, 41] are intended to en-code invariance across a range of transformations, i.e. , theyattempt to compute a feature space where different versionsof an image region map to the same vicinity. In the case ofhandcrafted feature matching, the design of keypoint detec-tors and patch descriptors such as SIFT [32] and SURF [3]mostly imparts invariance to linear afﬁne transformationssuch as scale, transformation, rotation and others.For learning-based correspondence, invariance to moretransformations can be achieved through hierarchical con-volutional architectures and by introducing transformed oraugmented versions of data points during training [55, 19].Encoding invariance in one or the other way helps to im-prove the robustness of matching and make it stable underextreme transformations or view-point changes. For appli-cations such as provenance analysis, we require an under-standing of the measure of invariance ( i.e. , how much in-variance is present) in addition to mapping features of near-duplicate or similar images closer to each other than thoseof dissimilar images.

Learning an embedding space and similarity score withrespect to a speciﬁc semantic label becomes useful in cre-ating generalizable frameworks that can be used for classi-ﬁcation as well as clustering tasks. They are more suitedfor open set scenarios as the scores are learned on a matchvs. non-match basis and do not compute per-class probabil-ity forcing the score to be maximized for one of n classes.Classiﬁcation-based losses for metric learning have shownpromising results in applications such as face veriﬁcation,but pairwise losses are more prevalent for image retrievalscenarios [53, 35]. Intuitively, networks trained with pair-wise losses are intended to generalize better. Among thelarge pool of such loss functions including, but not limitedto, contrastive [45], triplet [29] and N-pair [46], triplet lossis the most popular and is effective in numerous image com-parison tasks. Most of these approaches learn similarity ina presumably metric space [44] but mostly for classiﬁca-tion. Methods that use augmented triplet loss [55, 56] fordetection and description in a transformation-aware spacehave inspired this work. The method proposed in this paperuses quadruplets that contain an extra weak positive samplefor ordering distance between positive samples. A differ- ent quadruplet approach was employed by Chen et al. [10]for person re-identiﬁcation and Zhang et al. [54] for imagecorrespondence. Chen et al. claim better generalization fortheir test sets through the use of two references, one eachfor the positive and negative samples in the quadruplets (setof two triplets), than triplet loss. Image ordering involves comparing all possible imagepairs and then ranking them based on the distance fromthe reference or query. Hence, this operation is similar toalgorithms from the information retrieval domain such asRankNet [6] and ListNet [8] that aim towards learning acorrect ranking among a set of objects. RankNet minimizesthe number of inversions in the rank as its objective whereasListNet employs listwise loss functions. Chen et al. [11]explain the relationship between losses used for learning torank and the evaluation metrics. They do so using essentialloss to model ranking as a sequence of classiﬁcation tasksand establish that the minimization of ranking losses lead tothe maximization of evaluation measure. Our quadrupletsare lists of four images where three are related through im-age transformations.So far, in the image domain, learning to rank methodshave focused more on semantically similar images versusdissimilar images rather than near-duplicates [24, 7]. Se-mantically similar images are images that share content buthave not been derived from the same image and come fromdifferent imaging pipelines. It makes sense for applica-tions such as content-based image retrieval to focus moreon such images as they care about variety and relevance inthe retrieved results. Retrieving near-duplicates would notbe very useful in such scenarios as a user would want to seedifferent images (in terms of camera, view and context) ofthe same content. Whereas in order to perform provenanceanalysis on any image, it becomes imperative to attempt tounderstand the near-duplicate variants at a ﬁner granularityin terms of the image space.

3. Transformation-Aware Distance Learning

A fundamental task in provenance analysis, like othervisual recognition problems, is computing the dissimilaritybetween two images. Techniques currently used in the lit-erature, such as keypoint matching and color-wise contentcomparison with mutual information, present the drawbackof not taking into account modiﬁcations other than afﬁneand basic color transformations. They also do not considercomplex transformations and their ordering. Aiming to in-crease the reliability of dissimilarity computation betweentwo images that share content in a way where one gives ori-gin to the other, after a number of transformations, we pro-pose the use of deep distance learning. igure 3. Examples of quadruplets used for training in the pro-posed algorithm. Each row depicts one set of quadruplets, in thefollowing column-wise order: the anchor, the positive (after onetransformation), the weak positive (after two transformations), andthe negative (unrelated) patch. The number of transformations isfor illustration sake only; other conﬁgurations are possible (SeeSec. 4). Anchors and negatives come from the COCO dataset [31].

A key contribution of the proposed framework is aranking-based network training approach that learns how toexpress the dissimilarity between such images as a functionof the number of transformations. We describe the approachin detail in the following sections.

Similar to triplet-loss training, we want to encode imagesin a space where, given an image of reference ( i.e. , the an-chor), positively related images lie closer to it than unrelatedimages. For our purpose, we have the additional require-ment of ordering the positively related images, which limitsthe use of conventional triplet sampling regime. Hence, wepropose learning embeddings using quadruplets to better fa-cilitate ordering among the related images.When training the network, we provide sets of fourpatches, namely (i) the anchor patch, which represents theoriginal content, (ii) the positive patch, which stores theanchor after M image processing transformations, (iii) the weak positive patch, which stores the positive patch after N transformations, and (iv) the negative patch, a patch that isunrelated to the others (see Figure 3). The idea is to train theembedding network to provide a distance score to a givenpair of patches, where the output score between the anchorand the positive patch is smaller than the one between theanchor and the weak positive, which, in turn, is smaller thanthe score between the anchor and the negative patch.To obtain the quadruplets of patches for training, weemploy a speciﬁc set of image transformations that areof interest to provenance analysis. They are: (i) projec-tive changes ( e.g. , content scaling, rotation, ﬂipping, shear,and projection), (ii) color-space changes ( e.g. , changes inbrightness, in contrast, gamma correction, and grayscal-ing), (iii) frequency-space changes ( e.g. , blurring and sharp-ening), and (iv) data-lossy compression. For each anchor Table 1. Details of the architecture of the embedding network. Theconvolutional layer structure is deﬁned as C × H × W ( P adding ) .The dimensionality of the learned feature for each patch is 256.The number of trainable parameters in the network is 7,564,800. Layer Structure Output ParametersConv1+bn 32x11x11 (9) 72x72 11,712MaxPool (2,2) 36x36 –Conv2+bn 64x9x9 (7) 42x42 166,080MaxPool (2,2) 21x21 –Conv3+bn 128x7x7 (5) 25x25 401,792MaxPool (2,2) –Conv4+bn 256x5x5 (3) 14x14 819,968MaxPool (2,2) –Conv5+bn 512x3x3 (1) 7x7 1,181,184MaxPool (2,2) –Linear – 1024 4,719,616BatchNorm1d – 1024 2,048Linear – 256 262,400patch, random transformations from this pool are sequen-tially applied, one on top of the result of the other, allowingus to generate positive and weak positive patches from theanchor, after M and M + N transformations, respectively.Figure 3 depicts a few examples of these quadruplets. The network used to learn representations is a four-waySiamese structure with all four branches of the networksharing weights (see the training block of Figure 4). Eachembedding module is a ﬁve-layer convolutional neural net-work (CNN) with one batch normalization layer and twofully connected layers, very similar to the one used in [9] toidentify image processing operations. In contrast to them,we do not preprocess the images using high-pass ﬁlters andlet the network learn useful noise patterns depending on thetransformation. Our convolutional layers use Rectiﬁed Lin-ear Unit (ReLU) as activation function for faster conver-gence. Max-pooling is applied to feature maps from con-volutional layers. We use relatively large kernels in ourconvolutional layers to have a larger receptive ﬁeld for ourlearned feature maps. Details of the network architectureare provided in Table 1. The features extracted from theembedding units are used by the pairwise distance rankingunit to compute the loss.We employ an L2-distance-based pairwise margin rank-ing loss to learn image embeddings that match the orderof generation of near-duplicate images through edits andtransformations. Given a feature vector for an anchor im-age patch a and two transformed derivatives of the anchorpatch p (positive) and p (cid:48) (weak positive) where p = T M ( a ) and p (cid:48) = T N ( T M ( a )) , and an unrelated image patch froma different image n , the objective is to learn embeddings igure 4. The left box (red) shows the proposed framework used to train a network that can provide transformation aware embeddings toinput regions. The right box (blue) shows steps for how the trained embedding network is used to construct provenance graphs. Once thefeature extractor is trained, we describe image regions from the test set, use L2-distance to compute pairwise image dissimilarity matricesand use a greedy approach to connect images into a provenance graph. such that d ( a, p ) < d ( a, p (cid:48) ) < d ( a, n ) , a quadruplet basedsimilarity precision rank [48]. Here, d ( . ) is the L2-distancebetween two embeddings. T M is the series of M transfor-mations applied to a to generate p and T N is the series of N transformations further applied on p to create p (cid:48) .In order to achieve the rank objective, we minimize: L ( a, p, p (cid:48) , n ) =max(0 , − y × ( d ( a, p (cid:48) ) − d ( a, n )) + µ )+max(0 , − y × ( d ( p, p (cid:48) ) − d ( p, n )) + µ )+max(0 , − y × ( d ( a, p ) − d ( a, p (cid:48) )) + µ ) (1)In the above loss function, y is the truth function (analo-gous to labels for classiﬁcation) which determines the rankorder [43] and µ , µ and µ are margins corresponding toeach pairwise distance term and are treated as hyperparame-ters. The ranking operates in the score space rather than theembedding space. Also, µ > µ > µ , in order to clusterstronger positives (less transformed versions) closer to theanchor image embedding. Values for these margins are cho-sen empirically over a small range of proportional values ∈ [0.01, 0.1]. Here, proportion is based on the fact that dis-tances are L2 normalized and d ( a, n ) (cid:29) { d ( a, p ) , d ( a, p (cid:48) ) } is desirable. Margin ranking loss determines the loss valuebased on the sign of y and the difference between the twodistances. Both having the same sign implies ordering iscorrect and the loss is zero. A positive loss is accumulatedwhen the ordering is wrong and they are of opposite signs.The combination of these terms induce an ordering on thefour units in the quadruplet.We use three out of the four possible topologically or-dered triplets from the quadruplets, i.e. , ( a, p, p (cid:48) ) , ( a, p (cid:48) , n ) and ( p, p (cid:48) , n ) and claim that they are enough to rank the in-volved pairwise distances in an increasing order. The fourthtriplet ( a, p, n ) is redundant as the ordering of pairs ( a, p ) and ( a, n ) is already covered by second and third termsof the loss by transitivity, i.e. , if d ( a, p ) < d ( a, p (cid:48) ) and d ( a, p (cid:48) ) < d ( a, n ) , then d ( a, p ) < d ( a, n ) . Minimizing theproposed loss maximizes the similarity precision as the ﬁrstdistance in each of the three terms of the loss is maximallybounded by the second distance in that term, leading to thedesired quadruplet ordering d < d < d (see Figure 4).The above loss is optimized using Stochastic GradientDescent method with Nesterov Momentum [47]. The learn-ing rate is reduced by a factor of . every time the valida-tion loss plateaus. The network is trained until convergence(usually around 100 epochs). The model is saved for epochswhere the average similarity precision for the validation setof quadruplets improves over the previous best model. Themodel corresponding to the best measure for validation isused for feature extraction from patches of test images. Features learned with the proposed technique are used toperform image comparison for image provenance analysis.For this paper, we assume that the related set of images fora given image is provided to us through an accurate retrievalmethod similar to the oracle scenario described in [5, 34].Once a set of k images related to a query image is obtained,provenance graph construction involves the following steps:

1. Creation of dissimilarity matrix . All possible imagepairs from the retrieved image set are compared to obtain asimilarity or distance score that is used to create a dissimi-larity matrix D of size k × k . Depending on the propertiesf the dissimilarity measures, the matrix can be symmet-ric or asymmetric. In order to use the proposed frameworkfor obtaining pairwise image dissimilarity scores, we ex-tract patches of the same size as those in the quadrupletsused for training from each of the k images. These patchesare then described using the trained embedding module ofthe proposed framework. D [ i, j ] between image i and im-age j is computed by matching the set of features from oneimage to the other using an all-to-all brute force matchingstrategy. We only consider the bidirectionally consistent topmatch for each patch and compute the average L2-distancefor all the patches with their best matches. This is computedfor each of the (cid:0) k (cid:1) possible image pairs to get a symmetricdissimilarity matrix D .

2. Combining image pairs to construct a graph . Consid-ering the matrix D as an adjacency matrix with the dissim-ilarity values as edge weights of a fully connected graph,Kruskal’s optimum spanning tree algorithm [28] is em-ployed to connect images in a greedy manner. Since weuse L2-distance between embeddings as dissimilarity val-ues, the algorithm chooses edges with lower weights ﬁrstand adds ( n − such edges until all the n images are con-nected. The result is an undirected ordered graph with n related images as nodes and each selected edge correspondsto the image transformation between the two connected ver-tices.We focus on the graph construction part of image prove-nance analysis as it entails reconstructing a contiguoussequence of image transformations (similar to the outputchain in Figure 2). As the proposed approach is agnosticto the directionality of transformation between a given pairof images, for the purposes of this work, we create and eval-uate the undirected provenance graphs similar to Bharatiet al. [5]. An asymmetric distance function that correctlyinfers whether an operation was performed from image 1to image 2 or a reverse operation from image 2 to 1 basedsolely on visual content is hard to compute, as pointed outin the literature [4]. Since we employ a greedy approach toconnect images for provenance graph building, the orderingbetween pairwise distances is still important as it governsthe choice of image pairs that will share a direct vs. an in-direct relationship in the graph.

4. Experiments

The transformation-aware embedding model is trainedusing ∼ × pixelsare sampled centered on SURF keypoints, that are previ-ously described and tracked across the original and trans-formed images. Negative patches have the same size and areobtained from keypoints detected over unrelated images.During evaluation, × patches are sampled fromthe images and described with the trained network, lead-ing to 256-dimensional feature vectors whose pairwise L2-distances are used to create a dissimilarity matrix as ex-plained in Sec. 3.3. Once the dissimilarity matrix is com-puted, undirected graphs are generated using Kruskal’s min-imum (or maximum depending on if comparison score is adistance or similarity) spanning tree algorithm. Code forthe full pipeline to perform provenance graph constructionusing transformation-aware learned embeddings will be re-leased to the community, upon the acceptance of this paperfor publication. The DARPA Media Forensics program has curated andreleased multiple datasets for image provenance analysisas part of its annual challenges. These datasets containgood quality images of generic scenes, along with corre-sponding manipulated versions with a record of each stepof manipulation and intermediate versions in the form ofgraph journals. The manipulations were performed usingGIMP, an open source image editing tool, by professionalartists [22]. The journals capture a vast array of image ma-nipulation operations from color-based transformations toanti-forensic operations. We evaluate our method on twoDARPA datasets:

1. Nimble 2017 Challenge Dataset.

We use the NC2017-Dev1-Beta4 set [37], which contains provenance journalsfor 65 queries. Similar to the experimental setup describedin [34], we also the development partition of this datasetsince it provides a full set of ground-truth graphs. The orderof the ground-truth graphs is in the range [2 , , with theaverage graph order being 13.6. The resolution of images inthe dataset varies widely, with the average resolution being5.9 megapixels.

2. Media Forensics 2018 Challenge Dataset.

We usethe MFC18-Dev1-Ver1 [22] partition of the 2018 challengedataset. This partition has provenance journals related to258 query images. The average graph order is 14.3 and theaverage resolution of images is 10.1 megapixels. The dis-tribution of manipulations in this dataset is different fromthose in NC2017-Dev1-Beta4 [22]. These provenance caseshave a larger set of manipulations. For example, opera-tions such as recapture and

CGI-Fill are not present in theC2017-Dev1-Beta4 dataset. Please check the supplemen-tal material for more details on the available manipulationsin the dataset.The DARPA Media Forensics challenges and evaluationprotocols for image provenance analysis contain two tasks:(1) an end-to-end analysis, and (2) an oracle framework.The ﬁrst protocol includes retrieval of related images givena query and construction of ﬁnal provenance graphs. Thistask includes distractors in the pool of world images alongwith those related to the query. The oracle framework pro-vides a standalone estimate of the effectiveness of the graphconstruction solution where errors from the previous stepare not taken into account. For this work, we assume thatwe have access to all of the related image components forperforming provenance analysis for a query image becauseour focus is ordering of near-duplicate images.

The ﬁrst category of comparison methods incorporatesdetection of salient keypoints in the images and matcheslocal regions around those points with those in other im-ages. For experiments on NC17-Dev1-Beta4, we considerSURF [3], LIFT [51], DELF [39] and DeepMatching [41], ahierarchical dense local region matching technique. SURFand DeepMatching are handcrafted while LIFT and DELFare learned local features. For the LIFT experiment, weuse the published model trained on the Piccadilly Circusimages [49]. The model used for DELF was trained onthe Google Landmarks dataset [21]. After matching, weperform geometric consistency veriﬁcation [5] for matchesfrom 2000 keypoints for SURF, 500 for LIFT, and all de-tected regions for DELF. DeepMatching inherently per-forms this veriﬁcation step by design, enabling us to usethe matches directly. The number of consistent matches be-tween two images is used as score in the dissimilarity ma-trix, which is then used to create the graphs.The second set of methods are popular data-drivenlearned descriptors using deep convolutional neural net-works such as AlexNet [27] and ResNet (18-layers) [23],both trained on ImageNet dataset [13]. For these, we de-scribe × patches extracted from the entire image.We then match pairs of patch sets between images as de-scribed in Section 3.3, and create the dissimilarity matrix.The graphs are created in the same manner for all the meth-ods in consideration. We later choose the best performingmethods from the different types to evaluate performanceon MFC18-Dev1-Ver1 and report results using the samesteps as used for NC17-Dev1-Beta4. During training, the 4-way network (see training panel inFigure 4) is evaluated using similarity precision for quadru-plets, similar to the usage for triplets in [48]. This metric is deﬁned as the percentage of quadruplets in the valida-tion set for which the pairwise distances are correctly or-dered as d ( a, p ) < d ( a, p (cid:48) ) < d ( a, n ) . To evaluate theproposed framework for provenance graph construction, wegenerate graphs by creating dissimilarity matrices using dis-tance scores between learned embeddings and Kruskal’s op-timum spanning tree algorithm. We compare them to theground-truth graphs for the datasets using their metrics forthe provenance task [38].The metrics are computed by comparing the nodes andedges from both ground-truth and candidate graphs. Thecorresponding measures of Vertex Overlap (VO) and

EdgeOverlap (EO) are the harmonic mean of precision and re-call (F1 score) for the graph vertices and edges retrieved byour method. In addition to these, a uniﬁed metric repre-senting one score for the graph overlap namely the

VertexEdge Overlap (VEO) is also reported. The VEO is thecombined F1 score for vertices and edges. All the metricsare computed through the

MediScore tool [36], with undi-rected graph option. The values of these metrics lie in therange [0 , where higher values are better.

5. Results

Experiments for provenance graph construction show theefﬁcacy of the proposed approach in comparison to exist-ing alternatives. For each line of comparison presented,only the image correspondence measure is changed whilethe remainder of the provenance analysis pipeline is keptthe same. Results presented in Tables 2 and 3 show thatemploying transformation-aware descriptors for image dis-similarity computation improves the average Vertex Over-lap for provenance analysis. The CNN-based descriptorsadapted from the area of object recognition have an advan-tage over keypoint-based approaches as the images that donot share a sufﬁcient number of keypoint matches with oth-ers may not be included in the ﬁnal connected graph. Thisleads to better node overlap for learned techniques where anall-patch-to-all-patch matching strategy is employed. Thatsaid, one of the reasons behind the popularity of keypoint-based approaches for provenance-type applications is theirefﬁciency and subsequent ability to scale to larger prob-lems. In this regard, the handcrafted SURF approach has anedge over the learning-based keypoint approaches (DELFand LIFT). Describing 2000 SURF keypoints for an imageon average takes less than 1 second, whereas it takes ∼ ∼ able 2. Provenance graph construction over the NC2017-Dev1-Beta4 dataset (oracle mode). We report the mean and the standarddeviation of 65 cases for the metrics presented in Sec. 4. In bold: best results. TAE stands for Transformation Aware Embeddings learnedusing the proposed approach. NA stands for not applicable . Please see supplemental material for qualitative results. DescriptionMethod DescriptionType Local FeatureType FeatureVector Size ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TAE (ours) learned image patches 256 1200 1.00 ( ± ± ± Table 3. Provenance graph construction over the MFC18-Dev1-Ver1 dataset (oracle mode). Means and standard deviations of 258cases are reported for the metrics presented in Sec. 4. Only thebest deep learning-based keypoint and full-image approaches fromTable 2 are reported here. In bold: best results.

Description VO EO VEOSURF [3] 0.89 ( ± ± ± ± ± ± ± ± ± TAE (ours) 0.99 ( ± ± ± measures of edge overlap in our results reveal that usingtransformation-aware image descriptors improves orderingand selection of edges over general deep learning-based de-scriptors. Despite being trained with less image data by anorder of magnitude, the network used in this paper is moreeffective for the provenance graph building task. In addi-tion to being more effective for provenance-based vertexand edge selection, our method is more efﬁcient in terms ofnumber of training parameters and disk space. In compari-son to approximately 61M parameters for AlexNet [27] and11M parameters for ResNet-18 [23], the embedding net-work used in the proposed approach contains only ∼

6. Discussion

Estimating the order of manipulations performed on animage to generate the ﬁnal version is known to be a difﬁcultand open problem. This paper provides a solution for order-ing in the context of image provenance analysis for foren-sics with knowledge of a limited set of transformations dur-ing training — the most realistic scenario one can consider.All of our results have been evaluated for cross-dataset sce-narios, and the overall design is meant for real-world oper-ation. The evaluation set we made use of has a large set oftransformations, most of which are unseen at training time.Thus we were able to reveal the limits and potentials of ourapproach and other possible solutions in a practical setting.Our efforts also highlight the need for further improve-ments in vision-based provenance analysis. A better un-derstanding of the image transformation space and datasetswith varied types of complex manipulations will help in thedevelopment of more advanced algorithms. In this regard, itis important to note that due to the common usage of propri-etary image editing tools and the possibility of large num-bers of edits, creating ground-truth for provenance analysisis an arduous undertaking. However, through the recent ef-forts of the DARPA Media Forensic program [12], a viableregime for data collection, annotation, and evaluation hasemerged. Finally, considering the nature of the problem,it is insufﬁcient to train a good system solely based on thedata from individual images in isolation. The transfer ofknowledge gained from the general space of image trans-formations is still essential to solving the problem, and thework proposed in this paper aims to advance research activ-ity in this direction. . Acknowledgement

This material is based on research sponsored by DARPAand Air Force Research Laboratory (AFRL) under agree-ment number FA8750-16-2-0173. Hardware support wasgenerously provided by the NVIDIA Corporation. We alsothank the ﬁnancial support of FAPESP (Grant 2017/12646-3, D´ej`aVu Project), CAPES (DeepEyes Grant) and CNPq(Grant 304472/2015-8).

References [1] The Big Loophole That Helped Russia Exploit Face-book: Doctored Photos. The Wall Street Journal, , 2019.Accessed on 11-10-2019. 1[2] Joshua Backer. Disinformation. , 1999. Ac-cessed 11-11-2019. 1[3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf:Speeded up robust features. In

ECCV , 2006. 2, 3, 7, 8[4] Aparna Bharati, Daniel Moreira, Joel Brogan, Patricia Hale,Kevin Bowyer, Patrick Flynn, Anderson Rocha, and WalterScheirer. Beyond pixels: Image provenance analysis lever-aging metadata. In

IEEE WACV , 2019. 6[5] Aparna Bharati, Daniel Moreira, Allan Pinto, Joel Brogan,Kevin Bowyer, Patrick Flynn, Walter Scheirer, and Ander-son Rocha. U-phylogeny: Undirected provenance graph con-struction in the wild. In

IEEE ICIP , 2017. 2, 5, 6, 7[6] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier,Matt Deeds, Nicole Hamilton, and Gregory N Hullender.Learning to rank using gradient descent. In

ICML , 2005.3[7] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and StanSclaroff. Deep metric learning to rank. In

IEEE/CVF CVPR ,2019. 3[8] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and HangLi. Learning to rank: from pairwise approach to listwiseapproach. In

ICML , 2007. 3[9] Bolin Chen, Haodong Li, and Weiqi Luo. Image processingoperations identiﬁcation via convolutional neural network. arXiv preprint arXiv:1709.02908 , 2017. 4[10] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and KaiqiHuang. Beyond triplet loss: a deep quadruplet network forperson re-identiﬁcation. In

IEEE/CVF CVPR , 2017. 3[11] Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, andHang Li. Ranking measures and loss functions in learningto rank. In

Advances in Neural Information Processing Sys-tems , 2009. 3[12] Defense Advanced Research Projects Agency (DARPA).Media Forensics program. , 2016. Accessed on 11-12-2019. 2, 8[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. In

CVPR , 2009. 7 [14] Zanoni Dias, Siome Goldenstein, and Anderson Rocha. To-ward image phylogeny forests: Automatically recovering se-mantically similar image relationships.

Elsevier ForensicScience International , 231(1):178–189, 2013. 2[15] Zanoni Dias, Anderson Rocha, and Siome Goldenstein. Im-age phylogeny by minimal spanning trees.

IEEE Transac-tions on Information Forensics and Security , 7(2):774–788,2012. 2[16] Luca DAmiano, Davide Cozzolino, Giovanni Poggi, andLuisa Verdoliva. A patchmatch-based dense-ﬁeld algorithmfor video copy–move detection and localization.

IEEETransactions on Circuits and Systems for Video Technology ,29(3):669–682, 2018. 2[17] Hany Farid. Dont be fooled by fake images and videosonline. https://theconversation.com/dont-be-fooled-by-fake-images-and-videos-online-111873 , 2019. Accessed 11-10-2019. 1[18] Hany Farid. Image forgery detection.

IEEE Signal Process-ing Magazine , 26(2):16–25, 2009. 2[19] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. De-scriptor matching with convolutional neural networks: acomparison to sift. arXiv preprint arXiv:1405.5769 , 2014.3[20] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. OptimizedProduct Quantization for Approximate Nearest NeighborSearch. In

IEEE CVPR , 2013. 2[21] Google LLC. Google-Landmarks Dataset: Label famous(and not-so-famous) landmarks in images. https://bit.ly/34yDwvO , 2019. Accessed on 11-14-2019. 7[22] Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee,Amy N. Yates, Andrew Delgado, Daniel Zhou, TimotheeKheyrkhah, Jeff Smith, and Jonathan Fiscus. MFC datasets:Large-scale benchmark datasets for media forensic challengeevaluation. In

IEEE WACV Workshops , 2019. 6[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

IEEECVPR , 2016. 7, 8[24] Yang Hu, Mingjing Li, and Nenghai Yu. Multiple-instanceranking: Learning to rank images for image retrieval. In

IEEE CVPR , 2008. 3[25] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei AEfros. Fighting fake news: Image splice detection vialearned self-consistency. In

ECCV , 2018. 2[26] Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with GPUs.

ArXiv e-prints ,abs/1702.08734, 2017. 2[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

Advances in Neural Information Processing Sys-tems , 2012. 7, 8[28] J.B. Kruskal. On the shortest spanning subtree of a graph andthe traveling salesman problem.

Proceedings of the Ameri-can Mathematical Society , 7(1):48–50, 1956. 6[29] B. Kumar, Gustavo Carneiro, Ian Reid, et al. Learning lo-cal image descriptors with deep siamese and triplet convo-lutional networks by minimising global loss functions. In

IEEE CVPR , 2016. 330] Chengcai Leng, Hai Zhang, Bo Li, Guorong Cai, Zhao Pei,and Li He. Local feature descriptor for image matching: Asurvey.

IEEE Access , 7:6424–6434, 2018. 3[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

ECCV , 2014. 4, 6[32] David G. Lowe. Object recognition from local scale-invariant features. In

IEEE ICCV , 1999. 3[33] Lev Manovich. Cultural Analytics: Visualizing Cultural Pat-terns in the Era of “More Media”.

Domus , March 2009. 2[34] Daniel Moreira, Aparna Bharati, Joel Brogan, Allan Pinto,Michael Parowski, Kevin W. Bowyer, Patrick J. Flynn, An-derson Rocha, and Walter J. Scheirer. Image provenanceanalysis at scale.

IEEE Transactions on Image Processing ,27(12):6109–6123, 2018. 1, 2, 5, 6[35] Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Le-ung, Sergey Ioffe, and Saurabh Singh. No fuss distance met-ric learning using proxies. In

IEEE ICCV , 2017. 3[36] National Institute of Standards and Technology. MediS-core:Scoring tools for Media Forensics Evaluations. https://github.com/usnistgov/MediScore ,2017. Accessed on 11-12-2019. 7[37] National Institute of Standards and Technology.Nimble Challenge 2017 Evaluation. , 2017. Accessed11-11-2019. 6[38] National Institute of Standards and Technology. NimbleChallenge 2017 Evaluation Plan. https://w3auth.nist.gov/sites/default/files/documents/2017/09/07/nc2017evaluationplan_20170804.pdf , 2017. Accessed 11-11-2019. 7[39] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand,and Bohyung Han. Large-scale image retrieval with attentivedeep local features. In

IEEE ICCV , 2017. 7, 8[40] Alberto Oliveira, Pasquale Ferrara, Alessia De Rosa,Alessandro Piva, Mauro Barni, Siome Goldenstein, ZanoniDias, and Anderson Rocha. Multiple parenting identiﬁcationin image phylogeny. In

IEEE ICIP , 2014. 2[41] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, andCordelia Schmid. Deepmatching: Hierarchical deformabledense matching.

International Journal of Computer Vision ,120(3):300–323, 2016. 3, 7, 8[42] Andreas R¨ossler, Davide Cozzolino, Luisa Verdoliva, Chris-tian Riess, Justus Thies, and Matthias Nießner. Faceforen-sics: A large-scale video dataset for forgery detection in hu-man faces. arXiv preprint arXiv:1803.09179 , 2018. 2 [43] Cynthia Rudin and Robert E Schapire. Margin-based rankingand an equivalence between adaboost and rankboost.

Journalof Machine Learning Research , 10(Oct):2193–2232, 2009. 5[44] Walter J. Scheirer, Michael J. Wilber, Michael Eckmann, andTerrance E. Boult. Good recognition is non-metric.

PatternRecognition , 47(8):2721–2731, 2014. 3[45] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, IasonasKokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis-criminative learning of deep convolutional feature point de-scriptors. In

IEEE ICCV , 2015. 3[46] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In

Advances in Neural Informa-tion Processing Systems , pages 1857–1865, 2016. 3[47] Ilya Sutskever, James Martens, George Dahl, and GeoffreyHinton. On the importance of initialization and momentumin deep learning. In

ICML , 2013. 5[48] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg,Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learn-ing ﬁne-grained image similarity with deep ranking. In

IEEECVPR , 2014. 5, 7[49] Kyle Wilson and Noah Snavely. Robust global translationswith 1dsfm. In

ECCV , 2014. 7[50] Yue Wu, Wael Abd-Almageed, and Prem Natarajan. Buster-net: Detecting copy-move image forgery with source/targetlocalization. In

ECCV , 2018. 2[51] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and PascalFua. LIFT: Learned invariant feature transform. In

ECCV ,2016. 3, 7, 8[52] Sergey Zagoruyko and Nikos Komodakis. Learning to com-pare image patches via convolutional neural networks. In

IEEE CVPR , 2015. 3[53] Andrew Zhai and Hao-Yu Wu. Classiﬁcation is a strongbaseline for deep metric learning. In

BMVC , 2019. 3[54] Dalong Zhang, Lei Zhao, Duanqing Xu, and Dongming Lu.Learning local feature descriptors with quadruplet rankingloss. In

Chinese Conference on Computer Vision , pages 206–217. Springer, 2017. 3[55] Xu Zhang, Felix X. Yu, Svebor Karaman, and Shih-FuChang. Learning discriminative and transformation covari-ant local feature detectors. In

IEEE CVPR , 2017. 3[56] Xu Zhang, Felix X. Yu, Sanjiv Kumar, and Shih-Fu Chang.Learning spread-out local feature descriptors. In

IEEE ICCV ,2017. 3[57] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.Learning rich features for image manipulation detection. In