[PDF] A Full-Image Full-Resolution End-to-End-Trainable CNN Framework for Image Forgery Detection

Abstract

Due to limited computational and memory resources, current deep learning models accept only rather small images in input, calling for preliminary image resizing. This is not a problem for high-level vision problems, where discriminative features are barely affected by resizing. On the contrary, in image forensics, resizing tends to destroy precious high-frequency details, impacting heavily on performance. One can avoid resizing by means of patch-wise processing, at the cost of renouncing whole-image analysis. In this work, we propose a CNN-based image forgery detection framework which makes decisions based on full-resolution information gathered from the whole image. Thanks to gradient checkpointing, the framework is trainable end-to-end with limited memory resources and weak (image-level) supervision, allowing for the joint optimization of all parameters. Experiments on widespread image forensics datasets prove the good performance of the proposed approach, which largely outperforms all baselines and all reference methods.

Full PDF

11 A Full-Image Full-Resolution End-to-End-TrainableCNN Framework for Image Forgery Detection

Francesco Marra, Diego Gragnaniello, Luisa Verdoliva and Giovanni Poggi

Abstract —Due to limited computational and memory re-sources, current deep learning models accept only rather smallimages in input, calling for preliminary image resizing. This is nota problem for high-level vision problems, where discriminativefeatures are barely affected by resizing. On the contrary, inimage forensics, resizing tends to destroy precious high-frequencydetails, impacting heavily on performance. One can avoid resizingby means of patch-wise processing, at the cost of renouncingwhole-image analysis.In this work, we propose a CNN-based image forgery detec-tion framework which makes decisions based on full-resolutioninformation gathered from the whole image. Thanks to gradi-ent checkpointing, the framework is trainable end-to-end withlimited memory resources and weak (image-level) supervision,allowing for the joint optimization of all parameters. Experimentson widespread image forensics datasets prove the good perfor-mance of the proposed approach, which largely outperforms allbaselines and all reference methods.

Index Terms —Digital image forensics, CNN, forgery detection.

I. I

NTRODUCTION

In this work, we propose a new framework for image forgerydetection based on convolutional neural networks (CNN).This may not look particularly exciting: deep learning is by-now common practice to solve all kinds of vision-relatedproblems. However, image forensics has some peculiaritiesthat set it apart from standard computer vision problems. Wecan summarize them in the need to look, at the same time,at the whole image but also at its tiniest details. Consider theexample of Fig.1. This well-crafted splicing does not showobvious artifacts that allow detection by visual inspection,but a suitable structural analysis reveals differences that maybe due only to the insertion of alien material in the hostimage. Indeed, many state-of-the-art forensic tools rely onthe statistical analysis of local micro-patterns. However, local analyses alone are necessarily suboptimal. Clues emergingfrom the whole image, and at multiple scales, should becombined and processed jointly to make a reliable decision.Therefore, our goal is to design CNN-based forensic tools that,overcoming current technological limitations, meet the con-trasting requirements of full-resolution and full-image trainingand analysis.It should be realized that this problem is indeed peculiarof multimedia forensics. Typical CNN classiﬁers for com-puter vision problems rely on macroscopic features, whichbear high-level semantic clues on the scene. For example,

F.Marra, D.Gragnaniello and G.Poggi are with the DIETI, L.Verdoliva iswith the DII, Universit`a degli Studi di Napoli Federico II, Naples, Italy. E-mail: { francesco.marra, diego.gragnaniello, verdoliv, poggi } @unina.it. Fig. 1. Example of carefully crafted splicing. Visual inspection does not allowdetection, but pixel-level analyses expose suspicious textural differences. a face detector may look for the presence of speciﬁc facialfeatures with suitable spatial relationships. Such large-scaleinformation persists nicely after resizing the image. And infact, target images of wildly different sizes are routinelyresized to match the input CNN layer. Actually, resizing iseven used on purpose, during training, to gain robustnessto scale changes. In the context of image forensics, instead,resizing may destroy the very same information classiﬁers relyupon, the pixel-level micro-patterns that characterize differentdigital histories. By analyzing such patterns one can identifycamera models, individual devices, or discover the traces ofout-camera processing. A huge scientiﬁc literature testiﬁes onthe importance of such high-frequency features. Hence, imageresizing and resampling should be deﬁnitely avoided whenperforming forensic tasks.So, one could naively think of using a network with an inputsize as large as the target image. Besides the lack of generality(images can be of any size) a more fundamental issue concernscomputational and memory resources. Acquisition devicesare continuously improving their resolution, with commercialsmart-phone cameras delivering photos with many millions ofpixels. Deep learning hardware capabilities do not increaseat the same rate. Due to computation and memory limitations,state-of-the-art architectures accept only small images in input,especially when very deep networks are used. Therefore, thehighly informative image samples cannot be directly fed to anetwork and analyzed as a whole.Eventually, when high-resolution must be preserved, asimple solution is to perform patch-wise feature extraction,followed by some forms of feature aggregation to exploitthe full-image information. This approach makes full sense,and largely predates deep learning. Yet, even with goodCNN-based feature extractors and classiﬁers, it is inherently a r X i v : . [ c s . C V ] S e p suboptimal for several reasons: i) poor feature extraction; ii) poor global decision; iii) need of over-detailed ground truth.First of all, since the patch-wise feature extractor is trainedwithout taking into account whole-image information, the bestit can do is to learn good features for local decisions, whichare not necessarily the best ones in view of future aggregation.Then, the global classiﬁer, trained after freezing the patch-levelprocessing, operates only on intermediate features, hence isnecessarily suboptimal with respect to a classiﬁer trained end-to-end on the original data. Last, patch-wise training requiresa detailed, handcrafted, ground truth. Therefore, the largedatasets necessary to train deep learning models require a hugeman-power and are inevitably affected by errors, with a sureimpact on the eventual performance.All these considerations motivate our work, and allow usto deﬁne the ﬁnal goal more clearly. We want to design deeplearning models for image forgery detection which are:1) full-image: make decisions based on information gath-ered from all over the image;2) full-resolution: do not perform any harmful image resiz-ing;3) end-to-end trainable: optimize jointly all model parame-ters for image-level classiﬁcation, based only on image-level (weak) supervision.To achieve this goal, we propose a framework comprisingthree blocks in cascade performing, respectively, patch-wisefeature extraction, image-wise feature aggregation, and globaldecision. All blocks are fully trainable, based on image-wise labels, allowing information to ﬂow backward throughthe whole network. The global decision takes into accountfeatures extracted from the whole image, whatever its size, andbased on local micro-patterns. Memory problems are solved bymeans of gradient checkpointing, with a very limited increaseof computational costs.With these solutions, the proposes framework allows oneto optimize jointly the local information extraction, the globalfeature aggregation, and the whole-image classiﬁcation, what-ever the input image size. We implemented several versions ofthis general framework, through appropriate selection of themajor architectural blocks. After training on suitable syntheticdatasets, we performed extensive experiments on realisticdatasets widespread in the image forensics community, focus-ing on local manipulations, such as splicings, copy-moves, andinpainting, likely indicators of malicious attacks. Results fullysupport our approach which largely outperforms both baselinemethods and state-of-the-art references, including methodsrequiring strong supervision.In the following, we analyze related work (Section II),describe the proposed approach (Section III), report on theresults of numerical experiments (Section IV), and ﬁnally drawconclusions (Section V).II. R ELATED WORK

Forgery detection is a central topic in image forensics, andthere is a large bulk of relevant literature. In addition, it isnecessary to consider both forgery detection and localization,since these tasks are tightly related. Indeed, detection methods can be used for localization through sliding-window analysis,and localization method may allow detection by suitable post-processing. So, to limit the scope, in the following analysiswe take a historical perspective, but focus especially on recentCNN-based methods. Moreover, we neglect global manipula-tions, such as histogram equalization or gamma correction, asthey can be hardly regarded as malicious forgeries.Early contributions were mostly model-based, looking forstatistical anomalies related to the color ﬁlter array (CFA) [1],[2], double JPEG compression [3], [4], or sensor noise [5],[6]. Most of these methods assume a priori the presence of aforgery and pursue localization through pixel-level analysis,generating a heat-map. Then, a global score can be easilycomputed from the latter and used for detection. Model-basedapproaches are elegant and do not require extensive training,but work only in quite restrictive hypotheses.The advent of data-driven solutions granted a quantum leapin performance and ensured higher generality. Methods basedon machine learning extract suitable hand-crafted featuresfrom the image, both in the spatial domain [7], [8], [9], [10],[11] and in the transform (DCT, wavelet) domain [12], [13],[14], which are used to train a classiﬁer. Extracting featuresfrom the whole image allows direct and reliable image forgerydetection. Instead, localization can be obtained by working insliding-window modality and using a suitable local score. Themost discriminative features rely on high-order image statisticswhich help revealing spatial inconsistencies originated by thepresence of forgeries. To this end, high-pass residual imagesare often used, obtained by means of derivative ﬁlters [15] orimage denoisers.In recent years, methods based on deep learning havebecome dominant. Some early papers, inspired by the successof residual-based machine learning methods, propose CNNarchitectures with a ﬁrst layer of high-pass ﬁlters, eitherﬁxed [16], [17], or trainable [18], meant to extract residualfeature maps. In [19] it is even shown that successful methodsbased on hand-crafted features can be recast as CNNs andﬁne tuned for improved performance. In [20] these low-levelfeatures are augmented with high-level ones in a two-streamCNN architecture. Recent ﬁndings [21], [22], however, showthat such constrained ﬁrst layer is only useful with smallnetworks and datasets. Given a suitably large training set,general-purpose very deep architectures provide the same goodresults in favourable cases, but ensure higher robustness tocompression and training/test misalignments.Several papers, to begin with [16], followed more recentlyby [23] and [24], train explicitly the net to distinguish betweenhomogeneous and heterogeneous patches, the latter character-ized by the presence of both pristine and forged areas. Therationale is to catch the patterns that characterize transitionsregions, anomalous with respect to the background, so as tolocalize possible forgeries. This idea is followed also in [25],where an hybrid CNN-LSTM architecture is trained end-to-end to produce a binary mask for forgery localization. Thesemethods, however, require detailed ground truth maps to trainthe net, which may not be available or precise.For architectural constraints, most of these methods carryout a patch-based analysis, working on relatively small

Fig. 2. Strong image resizing corrupts the textural patterns used in forensics. patches, with further steps needed to compute a global scoreat image-level. In [16], for example, the CNN extract featurespatch-wise and later aggregates them in a global feature vectorused to feed a SVM classiﬁer. This may impact on detectionperformance. A more fundamental limit concerns the need ofstrongly aligned training and test sets. Some methods, e.g. ,[24], [25], carry out experiments on a single database splitinto training and test, others [20] require ﬁne-tuning on targetdata. All this highlights the limited generalization ability ofsupervised learning, as also shown in [26].A more promising line of research is to revisit the anomalydetection approach under a data-driven paradigm. Anomaliesare detected by means of single-image analyses, with a sort ofblind source identiﬁcation. In [27] this was accomplished in afully unsupervised fashion by using an autoencoder architec-ture. More recent proposals [28], [29], [30] use camera-modelfeatures, gathered off-line by dedicated CNNs, or leveragemetadata information [31] for direct detection. A strong proof this approach is that training is performed only on pristineimages, with no need of aligned datasets and ground truths,which ensures good robustness and adaptability to unseenmanipulations. In [29] and [31], in particular, this is achievedby using a Siamese training on pairs of patches extracted frompristine images, with a suitable consistency metric.Besides its technical content, this short review of ideasmakes clear that there is high and growing interest for newsolutions in this ﬁeld, to face the threats posed by increasinglysophisticated fake multimedia tools.III. P

ROPOSED METHOD

Our aim is to design a deep network to detect the presenceof localized forgeries in a target image, irrespective of theimage size and the forgery size. Of course, images can havewildly different sizes, depending also on the context, butthe trend is towards higher and higher resolutions. Today’ssmartphones feature cameras with resolutions of 10 Mpixelsand more. On the other hand, due to computation/memorybottlenecks, deep networks accept rather small images in input,for example 256 × A. The need for full-image full-resolution processing

The ﬁrst solution listed before is to rescale the image to ﬁtthe network ﬁrst layer. However, this is not advisable whendealing with forgery detection. In some cases, the forged re-gion could be so small to become practically undetectable afterstrong downsampling. A more fundamental problem, however,is that some sophisticated forgeries may only be detected basedon the statistical analyses of micro-textures. However, theseprecious high-frequency components are strongly corruptedwhen the image is resized or resampled. Fig.2 shows a clearexample, in which the markedly different textures highlightedin Fig.1, after resizing become very similar to one another andbasically useless for forensic analyses.The second solution is to perform patch-level detection,with no resampling, followed by some form of informationfusion to make a global decision. Indeed, given an ideal patch-level classiﬁer, the fusion problem has an obvious solution,and the presence of a forgery can be declared if at least oneforged patch is detected. However, real-world detectors are farfrom ideal, they always have non-zero missing-detection andfalse-alarm rates. For example, assuming a rather optimistic1% patch-level false-alarm rate, and independent decisions,a 100-patch pristine image would present a false-alarm ratebeyond 63%. Therefore, the fusion problem is not at all trivialwith real-world detectors, as our experiments will conﬁrm.In addition, the patch-level detector itself should be designedtaking into account image-level performance.These considerations motivate the need for a full-image full-resolution detector. In this way, precious microtextures can bepreserved and, at the same time, information coming from allpatches can be processed jointly to make a reliable decision.A naive implementation of this idea, with a CNN input sizematching the image size, would require huge computationaland memory resources, not to speak of the number of imagesneeded for reliable training. Instead, we propose a suitablearchitecture that, through reasonable structural constraints, sat-isﬁes the needs of forensics detection with limited resources.

B. Proposed architecture

The proposed framework is represented pictorially in Fig.3.It consists of three blocks performing, respectively, patch-levelfeature extraction, feature aggregation, and decision. Note that,although we propose a speciﬁc implementation of such blocks,this is not the core of our proposal, which is instead the wholeframework.

1) Patch-level feature extraction: after dividing the imagein overlapping patches, these are processed to extract discrim-inative features. As feature extractors, we adopt some state-of-the-art deep networks, taking the output of the penultimatelayer as feature vector, and discarding the ﬁnal class probabil-ities. However, considering the peculiarities of image forgerydetection, we modify the input layer to accommodate someadditional inputs, the image noiseprint [30], besides the imagecolor bands. Noiseprints are high-pass image residuals, ex-tracted through a dedicated network, in which camera-relatedartifacts are emphasized. Therefore, they highlight possiblespatial anomalies and may help detecting local manipulations.

TILING CLASSIFICATIONEXTRACTION AGGREGATION C NN F e a t u r e E x t r a c t o r C l a ss i f i e r Class P oo li n g Fig. 3. Proposed end-to-end trainable framework for image forgery detection, comprising extraction, aggregation, and classiﬁcation blocks.

2) Feature aggregation: the feature extractor produces alarge number of features, which are aggregated image-wiseto obtain a single descriptor for the classiﬁcation task. Tothis end, we consider several forms of pooling, maximum,minimum, average, and average of squares: F max = max i =1 ,..,N p F i F min = min i =1 ,..,N p F i F mean = 1 N p (cid:80) N p i =1 F i F msq = 1 N p (cid:80) N p i =1 F i (1)where F i = [ F i, , . . . , F i,C ] is the C -component feature ex-tracted from the i -th patch, N p is the number of (possibly over-lapping) patches, and all operations on features are component-wise. The most appropriate type of pooling depends on theproblem of interest. When the information is spread over thewhole image, an average pooling is reasonable, while minor max pooling are more appropriate when the discriminativeinformation is concentrated in a localized region. In any case,we also use the combination of multiple types of pooling,leaving the ﬁnal choice to experiments. After aggregation allexplicit spatial dependencies are discarded.Note that the type of pooling impacts on how informationback-propagates from the output to update the parametersof the feature extractor. In more detail, let F agg denote theaggregated feature, L the loss function of the framework, and θ a generic parameter of the CNN. Then, the gradient of L with respect to θ reads ∂ L ∂θ = C (cid:88) c =1 ∂ L ∂F agg ,c ∂F agg ,c ∂θ (2) with ∂F agg ,c ∂θ =  ∂F i,c ∂θ · δ i,i max ( c ) max pooling ∂F i,c ∂θ · δ i,i min ( c ) min pooling N p N p (cid:88) i =1 ∂F i,c ∂θ average pooling N p N p (cid:88) i =1 F i,c ∂F i,c ∂θ av.square pooling (3)In the above equation, δ i,j equals 1 when i = j and 0 otherwise,while i max ( c ) and i min ( c ) point to the feature vectors with thelargest, respectively smallest, c -th component. Therefore, withmax or min pooling, only some “active” patches contributeto the gradient, and are updated during training. Instead, withmean and msq pooling all patches are involved. Of course,when multiple forms of pooling are used at the same time,the gradient is obtained as the weighted sum of the individualterms.

3) Decision: after aggregating the local information in asingle descriptor F for the whole image, this is classiﬁed bymeans of a few fully-connected layers. This is the typicalclassiﬁer used in deep networks, and usually two layersprovide a good trade-off between complexity and accuracy. C. End-to-end training

If we focus only on the post-training operations, the pro-posed architecture does not look much different from conven-tional approaches based on patch-wise feature extraction, pool-ing, and classiﬁcation. Contrary to such approaches, however,our framework is trainable end-to-end . This means that we donot train the feature extractor on individual labeled patchesand, afterwards, train the classiﬁer on the features extractedby a ﬁxed net. Instead, we train the whole framework, top (a) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (b) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:114) (cid:112) (cid:112) (cid:63) (cid:63) (cid:63)(cid:27) (cid:27) (cid:27) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114)

Fig. 4. Conventional CNN training with backpropagation. to bottom, on full-size images, with a single label associatedwith each one: forged or pristine. The loss back-propagatesthrough the net up to to individual patches, allowing the featureextractor to learn which information is the most discriminativefor the ﬁnal decision, and adapting the classiﬁer jointly withthe extractor itself.To better underline the difference with respect to patch-wisetraining, consider that in a large image with a localized forgerymost patches are actually pristine, and only a few ones trulyforged. In our end-to-end training, all these patches share thesame image-level label (forged). Therefore, the net is forcedto learn how to manage such contrasting indications to makethe correct decision. As a side beneﬁt, there is no need to havea pixel-wise ground truth for training, since the only relevantlabel applies to the whole image. Also, images of any size canbe used for training, with forgeries of any size (especially ifmax/min pooling is used).Going into technical details for each training batch ofimages, the framework performs i) an inner loop on the patchesof each image, computing the back-propagation at the end ofthe loop, and ii) an outer loop on the images of the batch, thatsums up gradients computed for each inner loop and ﬁnallyupdates the weights once at the end of the batch. Due tothe arbitrary size of input images, each inner loop involves adifferent number of patches, impacting on the computationaleffort, which may vary signiﬁcantly from batch to batch. Thisis a minor issue, though, with respect to memory requirements.In fact, to back-propagate the loss, gradients must be computedfor all processed patches, causing an increase of the occupiedmemory, which grows linearly with the image size. For deepnetworks and large images, this memory is simply unavailable.The situation is described pictorially in Fig.4, where a circlerepresents a layer, and a black dot at the center indicates thatactivations are stored. In the forward pass (a), in fact, allactivations at each layer are computed and stored. Then, in thebackward pass (b), they are used to propagate gradients fromthe last layer, where the loss is computed, to the input. Afterusage, they are erased (small dots). It should be realized thatdeep nets can include hundreds of layers, with several featuremaps at each layer, whose size is typically proportional to theinput size. Therefore, to process a large input image at once,a huge number of variables should be stored, exceeding theavailable memory.To manage this problem we resort to the gradient check-pointing strategy, originally proposed in [32], which trades offmemory for computation. This solution is described pictoriallyin Fig.5. During the forward pass (a), all activations are deleted (a) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (b) (cid:45) (cid:45) (cid:45) (cid:45) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114)(cid:114) (cid:114) (cid:114) (c) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (cid:114) (cid:112) (cid:112) (cid:63) (cid:63) (cid:63)(cid:27) (cid:27) (cid:27) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (d) (cid:45) (cid:45) (cid:45) (cid:45) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114)(cid:114) (cid:114) (cid:114) (e) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (cid:112) (cid:112) (cid:112) (cid:112) (cid:112) (cid:112) (cid:63) (cid:63) (cid:63) (cid:63) (cid:63) (cid:63) (cid:63)(cid:27) (cid:27) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106) (cid:106)(cid:114) (cid:114) (cid:114) (cid:112) (cid:112) (cid:112) (cid:112) Fig. 5. CNN training with gradient checkpoints. After the forward pass (a),activations are stored only at checkpoint layers (red). The backward pass (b)-(e) proceeds one group of layers at a time. Activations at intermediate layersmust be recomputed each time a group is processed. immediately after use, except for those in a few “checkpoint”layers (red dots). In the backward pass (b)-(e), gradients arecomputed one group at a time (in the ﬁgure we show twogroups of 4 layers). Since activations are necessary to this end,they are recomputed, but only from the last checkpoint on, (b).This allows backpropagating the gradient until the checkpointlayer itself (c). At this point all variables at layers beyond thecheckpoint are deleted, and the process goes on with a newgroup of layers (d)-(e).With a judicious choice of the number of checkpoints,memory occupation can be signiﬁcantly reduced and becomemanageable. Of course, each activation is computed twice, butthe computational overhead is limited, because the forwardpass is lighter than the backward pass. Note that gradientcheckpointing has been recently made available in PyTorchas well as in other platforms. With this solution, we were ableto train our network end-to-end seamlessly, with a increase ofthe training time that never exceeded 20%.IV. E

XPERIMENTAL ANALYSIS

In this Section we design and perform numerical experi-ments to assess the performance of the proposed approach.In the following subsections, we ﬁrst describe the trainingprocedure, then present the results of some preliminary ex-periments carried out to make key design choices, and ﬁnallycompare the proposed method with both baselines and state-of-the-art references on several challenging datasets widespreadin the community.

A. Training

In order to train our networks, we generated a suitablesynthetic dataset. Background images are taken from the

TABLE IF

EATURES OF DATASETS USED FOR TRAINING AND TESTING dataset manipulations counter forensic ∞ × − × × − × × × × − × × − × Vision dataset, proposed originally [33] for camera modelidentiﬁcation, which comprises images acquired by different devices with the native high-quality JPEG compres-sion. To generate manipulated images, we spliced on themobjects drawn from a set of 81 objects manually cropped fromthe uncompressed images of the UCID dataset [34]. Detailson all datasets used in this work are reported in Tab.I.We used all images from 25 devices of the Vision datasetfor training, and kept the others for validation, with anapproximate 70%-30% split, so as to avoid any possible bias.For each pristine image, we created on the ﬂy a manipulatedimage by inserting in a random position one of the UCIDobjects, selected at random, with random scaling and rotation.Scaling is such that the size of spliced objects goes fromabout 1% to about 10% of the image size. Eventually, bothpristine and manipulated images are ﬂipped or rotated, andJPEG compressed with QF going from 75 to 100, obtaininga signiﬁcant augmentation. Fig.6 shows a few examples ofmanipulated Vision/UCID images (without rotations).In the training procedure we used the Adam optimizer withminibatches of 10+10 images and a learning rate of 0.001.Training took about three days with an Nvidia Tesla P100GPU. With the same hardware, testing takes about half asecond for a 3072 × B. Preliminary experiments

The proposed framework aims at the detection of localizedmanipulations, such as splicing, copy-move, and object re-moval through content-aware inpainting. Towards this goal,we instantiated the proposed framework by means of somekey design choices. In particular, we • augmented input RGB bands with the correspondingnoiseprint bands; • used Xception [35] as feature extractor; • performed aggregation by including all types of pooling; • used two fully connected layers, of size FC1=512 andFC2=256, to perform the ﬁnal classiﬁcation.We arrived at these choices as a result of a large numberof preliminary experiments, whose description would be dis-persive and tedious. However, we can study experimentally Fig. 6. Examples from the synthetic Vision/UCID training set. Spliced objectsare delimited by a red contour for the sake of clarity. the impact of each individual choice on the performance ofthe proposed architecture. To this end, we generated a newdataset, with the same modalities used for the training set, butcompletely separated from it. Background images were takenfrom the Dresden dataset, originally proposed [36] for cameramodel identiﬁcation, and manipulated images were created bysplicing on them 13 objects taken from the FAU dataset [37](see again Tab.I for details on datasets). After performing thesplicing, images were JPEG compressed at high (QF ≥ ≥ QF ≥ ≥ QF ≥ TABLE IIR

ESULTS OF ABLATION STUDIES ON THE D RESDEN /FAU

DATASET architecture AUCRGB+NP / Xception / all poolings / 512 0.851RGB → RGB+NP 0.845NP → RGB+NP 0.849Resnet101 → Xception 0.750Inception → Xception 0.745max-pooling → all poolings 0.800avg-pooling → all poolings 0.808FC1-size 256 →

512 0.831FC1-size 1024 →

512 0.838TABLE IIIR

ESULTS ON SUBSETS FROM THE D RESDEN /FAU

DATASET sub-dataset AUCglobal 0.851large-size objects 0.855medium-size objects 0.860small-size objects 0.875high-QF JPEG compression 0.886medium-QF JPEG compression 0.847low-QF JPEG compression 0.855original-size 0.884resized 0.841

In Tab.II we report the results of our ablation study. Row2 refers to the selected architecture, which uses Xception,takes in input both RGB and noiseprint bands, concatenatesvectors given by all pooling types, and use a size-512 FC1layer. In all other rows, we modiﬁed a single item of thisreference architecture. A number of non-trivial results appear.First of all, Xception is a much better feature extractor thanthe two alternatives, Resnet101 [38] and InceptionV4 [39].We had already observed a similar edge in other applications[22] although never so sharp. The likely reason is Xception’sbetter use of resources, with a much smaller number ofparameters to optimize for a given network depth. It alsoclearly emerges that using 4 types of pooling together ensuresa signiﬁcant improvement w.r.t. using only one of them. Usingonly max-pooling, as suggested by the nature of the problem,is even worse than using average pooling, probably becauseof its lower robustness to noise. As for the size of the ﬁrstFC layer, 512 appears to be the best choice, although justslightly. The only controversial choice concerns the input. Infact, using only the RGB bands or only the noiseprint (NP)bands provides results very close to those of RGB+NP, with astatistically insigniﬁcant gap. Therefore, we refrain from sharpdecisions on the input, and will keep testing several optionsin real-world cases.We now study the impact of compression, resizing, andsplicing size on the performance of the proposed methodby collecting results for speciﬁc relevant subsets. A quicklook at the numbers of Tab.III makes clear that only minorvariations occur across such subsets, with all AUC’s in the0.84–0.89 range. The largest performance gap is observed between original-size and resized images. Also JPEG com-pression affects somewhat the detection performance, althoughno signiﬁcant difference emerges between the medium-QF andlow-QF cases. The size of the spliced area, instead, seems tohave a minor impact and, contrary to expectation, relativelysmall-size splicings are detected more easily that large-sizeones. Note that, on the average, the AUC on speciﬁc subsetsis larger than the global AUC, but this is a consequence of thehigher homogeneity of the tested images.

C. Comparative performance analysis

Having justiﬁed our design choices, we now move tocompare the performance of the proposed framework withthose of suitable baselines and state-of-the-art methods, usingnot only our relatively simple Dresden/FAU synthetic dataset,but also several realistic and challenging datasets widespreadin the forensic community.

1) Reference methods: ﬁrst of all, we consider two naturalbaselines, both relying on Xception, given its good perfor-mance. The ﬁrst one, Xception-resize, consists simply in resiz-ing the target image to ﬁt the CNN input, with straightforwardtraining procedure. Xception-patchwise, instead, works byanalyzing the image patch-by-patch, with no resizing and somespatial overlapping, and ﬁnally fusing results. Accordingly, thenet is trained to perform binary patch classiﬁcation. Sincethe detector will look for anomalies, we decided to labelonly boundary patches as forged, that is, patches including asigniﬁcant fraction of both background and manipulated areas.Eventually, the output probabilities are collected in a heatmap,from which a suitable statistic is extracted (after some tests,we chose the max statistic) and compared with a threshold tomake the image-level decision.Just like our two baselines, methods proposed in the lit-erature can be grouped in two classes. A few ones work atimage level, while the majority work at patch-level, as theypursue forgery localization, and are converted into image-leveldetectors through some simple post-processing.For the ﬁrst category, we selected the SPAM+SVM method[8], winner of the First IEEE Forensic Challenge and based onthe SPAM steganalytic features [15], the CNN+SVM methodof [16], which extract features through a constrained CNN, andLSTM-EnDec [25], which uses a long-short term memoryrecurrent neural network to detect pristine/forged spatial tran-sitions. For the second category, we consider several forgerylocalization methods converted into image-level detectors. Inparticular, we selected the best performing methods resultingfrom the analysis carried out in [29], that is, CFA [2], whichexploits features related to the color-ﬁlter array, DCT [3],based on the analysis of double-quantized DCT coefﬁcients,NOI [1], looking for spatial inconsistencies in the noise level,EXIF-SC [31], looking for anomalies in the image leveragingthe EXIF metadata during the training phase, and Noiseprint[29], which extracts and analyzes an image ﬁngerprint wherecamera model-related artifacts are emphasized. All these meth-ods compute a heatmap representing the probability that a Contrary to other supervised methods, we were not able to re-train thenetwork on our data and used the original model in experiments.

TABLE IVR

ESULTS OF ALL VERSIONS OF

E2E

AND ALL REFERENCES METHODS ON THE TEST DATASETS .Method supervision Dresden/FAU DSO-1 Korus NC2017 MFC2018 MFC2019 averageXception-resize weak 0.609 0.539 0.527 0.513 0.570 0.516 0.546Xception-patchwise strong ˚0.721 0.643 0.533 0.729 ˚0.711 0.632 0.661CFA [2] – 0.507 0.584 ˚0.598 0.593 0.539 0.526 0.558DCT [3] – 0.505 0.614 0.501 0.683 0.523 0.509 0.556NOI [1] – 0.558 0.543 0.507 0.678 0.523 ˚0.726 0.589NoisePrint [29] – 0.611 ˚0.821 0.583 0.746 0.684 0.662 ˚0.684EXIF-SC [31] – 0.599 0.721 0.496 0.709 0.670 0.655 0.642SPAM+SVM [8] weak 0.506 0.768 0.502 0.767 0.631 0.634 0.635CNN+SVM [16] strong 0.593 0.728 0.568 ˚0.798 0.702 0.679 0.678LSTM-EnDec [25] strong 0.543 0.590 0.521 0.504 0.535 0.542 0.539E2E-RGB weak 0.958 0.596 0.607 0.774 0.760 0.737 0.739E2E-NP weak 0.874 0.924¯ 0.665¯ 0.766 0.776 0.741 0.791E2E-RGB+NP weak 0.914 0.790 0.619 0.762 0.765 0.765 0.769E2E-Fusion weak 0.993¯ 0.824 0.655 0.846¯ 0.838¯ 0.787¯ 0.824¯Fig. 7. Examples from the NIST datasets. certain patch has been manipulated. To make the image-leveldecision we extract several statistics from such heatmaps:mean, maximum, and q -quantile, with q ∈ { , , . . . , } ,selecting the best one in terms of AUC performance separatelyfor each method. Note that all these latter methods are blind,that is, they require no training on forged images or patches.

2) Datasets: for performance assessment, besides our syn-thetic Dresden/FAU dataset, we consider several more datasets,widely used in the forensics community, with markedly dif-ferent characteristics. DSO-1 [40] features only splicings, withlittle or no post-processing. In Korus [41], instead, both splic-ings and copy-moves are present. Both datasets include onlylarge-size high-quality images, not even compressed in thecase of Korus. A very different, and much more challenging,scenario is given by the NC2017, MFC2018, and the veryrecent MFC2019 datasets [42], developed by NIST in thecontext of the Medifor initiative. Images of these datasetshave been manually doctored, often with multiple and possiblyoverlapping manipulations of various types. In addition, they have wildly different sizes and quality levels, and have beensubject to several anti-forensics measures to prevent easydetection and localization of forgeries. For our tests, we keptall images with splicing, copy-move, inpainting, or computer-generated material. The reader is referred to Tab.I and to theoriginal papers for more details, while some example imagesare shown in Fig.7

3) Numerical results: in Tab.IV we report the detectionAUC for all reference and proposed methods on all testdatasets. Next to each method, in column 2, we give the levelof supervision it requires, strong (pixel-wise ground truth),weak (only image label), or – (none) for blind methods. Inthe upper part of the table we group all reference methods,including our two baselines, and in the lower part all versionof the proposed method with end-to-end (E2E) training. Bestresults are highlighted in red for reference methods and inblue for our proposal. In Fig.8 we also show ROC curves fora subset of methods (for readability) and datasets (for space)characterized by very different features. On the Dresden/FAUdataset, disjoint from the training Vision/UCID dataset, butwell-aligned with it, the proposed method (E2E-RGB+NP)largely outperforms all references, with a gain of almost20 percent points over the best one, the strongly supervisedXception-patchwise. Guided by the outcomes of preliminaryexperiments, together with the “best” version, with RGB+NPinput, we consider also the versions with only RGB andonly NP inputs. To our surprise, E2E-RGB provides a furthersigniﬁcant performance improvement. Our explanation for thisphenomenon is the strong eterogeneity of the input: sinceRGB bands and noiseprints have quite different statistics,the net may have a hard time processing them jointly. Toconﬁrm such hypothesis, we considered a further versions ofthe proposed method, where the networks trained on RGB-only, NP-only, and RGB+NP inputs are fused afterwards bya trivial average of scores. This strategy proved successful,with the new version, E2E-Fusion, providing almost perfectdetection (see also the top-left ROC in Fig.8), thus conﬁrmingour conjecture.Moving to the DSO-1 dataset, we observe again a large

Fig. 8. ROC curves on Dresden/FAU (top-left), NC2017, MFC2018, MFC2019 (bottom-right) datasets. For the sake of clarity, ROCs are shown only forselected methods: the best proposed (E2E-Fusion), the two baselines, and the best references (SPAM+SVM, CNN+SVM, Noiseprint, EXIF-SC). Only forDresden/FAU we also show other E2E versions. E2E-Fusion is always clearly, and almost uniformly, the best. The resizing-based baseline always the worst. gain, more than 10 percent points, of the best E2E methodover the best reference. On this dataset, Noiseprint provides anespecially good performance. a phenomenon already observedin [29], and likely related to all images being JPEG com-pressed at high-quality. Accordingly, also E2E works best withonly noiseprints as input, with no fusion. Images of the Korusdataset, instead, are uncompressed. This removes a majorsource of forensic traces, which impacts all methods, someof which exhibit a 0.5 AUC, equivalent to coin tossing. CFA(relying on color ﬁlter array properties) and Noiseprint, keepproviding decent results, however they trail all E2E versions,featuring AUC’s between 0.60 and 0.66.Turning to the more challenging NIST datasets, the generalbehavior does not change, with E2E working generally betterthan reference methods. The best reference method is notalways the same for all such datasets: CNN+SVM for NC2017,Xception-patchwise for MFC2018, NOI for MFC2019. On thecontrary, E2E-Fusion is always the best version of proposedmethod, and the best overall, with a signiﬁcant performancegain over the best reference, going from 0.048 (NC2017) to0.127 (MFC2018).The ﬁnal column shows the average over all datasets, which conﬁrms all above observations. We only underline, in passing,that the Xception-resize baseline, as expected, performs quitepoorly due to the loss of precious high-frequency details, whilethe Xception-patchwise baseline is among the best references,although it is fair to recall that it requires strong supervision.A general observation is that the performance of E2E isconsistently good in all cases (with a small dip on Korus),including the NIST datasets, despite their great variety andthe abundance of counter-forensic measures. This is all themore remarkable, considering that the network was trained ona dataset, Vision/UCID, lacking such a diversity. Therefore, wecarried out a further experiment on NC2017 and MFC2018,in which the E2E methods are ﬁne-tuned on their respectivedevelopment sets, provided by NIST together with the test sets.Results are reported in Tab.V and Tab.VI, while Fig.9 showsthe corresponding ROC curves. It is clear that ﬁne-tuning onthe development set, certainly more aligned with the test setthan Vision/UCID, grants further performance gains. Over thewhole dataset (“all” column) the best AUC, obtained alwayswith E2E-Fusion, grows from 0.846 to 0.932 on NC2017, andfrom 0.838 to 0.902 on MFC2018. The larger improvement onNC2017 can be attributed to better development-test alignment TABLE VR

ESULTS OF

E2E

METHODS ON

NC2017 W / O AND WITH FINETUNING .Method f.t. all splicing CM inpaint. CGE2E-RGB 0.774 0.829 0.819 0.694 ˚0.949E2E-NP 0.765 0.774 0.752 0.762 0.902E2E-RGB+NP n 0.762 0.816 0.832 0.693 0.921E2E-Fusion ˚0.846 ˚0.860 ˚0.870 ˚0.809 0.932E2E-RGB 0.868 0.871 0.887 0.833 0.937¯E2E-NP 0.879 0.799 0.849 0.914 0.880E2E-RGB+NP y 0.913 0.837 0.893 0.939 0.885E2E-Fusion 0.932¯ 0.884¯ 0.911¯ 0.950¯ 0.935TABLE VIR

ESULTS OF

E2E

METHODS ON

MFC2018 W / O AND WITH FINETUNING .Method f.t. all splicing CM inpaint. CGE2E-RGB 0.760 0.808 0.705 0.696 0.730E2E-NP 0.775 0.805 0.750 0.744 ˚0.817E2E-RGB+NP n 0.765 0.795 0.733 0.734 0.786E2E-Fusion ˚0.838 ˚0.860 ˚0.811 ˚0.811 0.799E2E-RGB 0.854 0.874 0.823 0.802 0.851E2E-NP 0.844 0.868 0.819 0.811 0.864E2E-RGB+NP y 0.867 0.893 0.842 0.824 0.874E2E-Fusion 0.902¯ 0.925¯ 0.877¯ 0.856¯ 0.910¯ and lighter counter-forensic actions. In any case, results areextremely satisfactory for such challenging datasets.In the tables, taking advantage of the auxiliary informationprovided with these datasets, we also provide analytic resultsfor each type of forgery. Although E2E is trained only on splic-ing, it works well also on all other localized manipulations.The most interesting phenomenon we could spot from thesedata is the performance drop on computer-generated fakes.In NC2017 the AUC for these manipulations was very high,above 0.93 without ﬁne-tuning, lowering dramatically (0.80)in MFC2018. Probably, this is the effect of the fast pace ofprogress in the quality of such manipulations.

D. Towards forgery localization

The E2E framework was conceived and trained with thegoal of making global decisions, leaving the problem oflocalization to other tools. However, just like localizationtools can be used for detection through suitable fusion, theproposed detection framework can be recast to provide alsosome localization information. In the following subsectionswe provide some insight into how the system exploits andcombines local information coming from all over the image,and how this can be exploited towards forgery localization.

1) Activation Maps: ﬁrst of all, we try to investigate theimpact of each patch of the image on the ﬁnal decision. Tothis end, we consider a simpliﬁed framework in which onlythe max pooling is used. Given this hard selection rule, wecan easily compute a spatial activation map which counts howmany features each patch contributes to the overall featurevector. Such a map, however, would be extremely coarse,due the low resolution of patch-wise analysis. Therefore, wecombine it with the a high-resolution map, the Grad-CAM

Fig. 9. ROC curves of all E2E variants on NC2017 (top) and MFC2018(bottom) without (dashed lines) and with (solid) ﬁne-tuning on the NISTdevelopment sets. Fine-tuning provides a signiﬁcant gain in all cases. (guided gradient weighted class activation map) obtained bybackpropagating the loss gradient to the full-resolution input[43]. In Fig.10 we show some results for images of theDresden/FAU dataset (hand-made to look more realistic). Forthis synthetic dataset, we have the pristine version of eachmanipulated image, so we can analyze the network behavior inboth circumstances. In all cases, the network focuses on high-activity regions, often corresponding to object boundaries.When there is no manipulation, the salient regions are scatteredall over the image. On the contrary, when a splicing takesplace, they tend to concentrate on the boundaries of the splicedobject, proving that the system has learned to look at thesepatches to make its decisions. Therefore, when a forged imageis detected, this activation provides hints about the possible siteof the manipulation.

2) ROI-based Analysis: moving towards forgery localiza-tion, we can obtain some interesting results by leveragingthe ﬂexibility of the proposed framework. Indeed, since thesystem can analyze images of any size, it can also analyzeregions of interest (ROI) selected by the user based on theprevious activation map or any other criterion. If the ROIcontains manipulated material, the system will likely providea large probability of manipulation (score, from now on). Fig. 10. Example images (top) and activation maps (bottom) from the Dresden/FAU dataset. Pristine images are on the odd columns, forged images (withhand-made splicings for higher visibility) on the even columns. Active patches are superimposed in cyan to the gray-scale/red-scale version of the images.

Therefore, the system can be used in supervised modalityto test suspicious objects. Also, it can be recast to performautomatic box-like localization. In fact, once features havebeen computed and stored for all patches, the aggregation andclassiﬁcation phases are extremely simple, with light-speedprocessing. Therefore, one can easily test a large number ofboxes and select automatically as ROI those with the largestscores, obtaining a rough but effective form of localization.Fig.11 shows some examples taken from the MFC2018dataset. Together with the original images (top) and activationsmaps (middle) it also shows (bottom) the scores obtained overthe whole image (white number in the top-left corner) and onselected boxes (colored numbers). The green boxes have beenselected manually around possible subjects of interest, whilethe magenta boxes are selected by our automatic procedurearound the local maxima of the score. In the ﬁrst image, theman on the right has been spliced on the host background.Here, the activation map provides strong hints on the possiblemanipulation, conﬁrmed by a large image-level score (0.935).However, an even larger score (1.000) is obtained when aROI is correctly placed around the splicing. The automaticprocedure also selects a ROI roughly covering the splicing,with unitary score. Another ROI is selected automatically ina pristine area in correspondence of a local maximum, but ishas a rather low score (0.428). In the second image, a furthersplicing has been added, the woman in the center. Neitherthe activation map nor the automatic ROI selection procedurehighlight this new subject. So, we selected a ROI manuallyaround this splicing, obtaining a rather low score. Exploitingthe side information provided with the NIST datasets, weinvestigated on this splicing, to discover that the insertedobject had been acquired with the same camera model as thehost image. This fact reduces the discriminating power of thenoiseprint input, justifying in hindsight such result. In the thirdimage, the only manipulation is a tiny inpainted region. Here,a supervised selection makes no sense, since the manipulatedregion does not correspond to any semantic object. However,the manipulation is nicely localized through the automaticprocedure, with unitary score, unlike other candidate ROIs characterized by low scores. The last image shows an oppositecase, with many large, semantically relevant, objects splicedon the host image. To avoid cluttering the image, we nowshow only the supervised ROIs and the corresponding scores,which are very large in all cases.To complete this visual inspection of results, it is fair toshow, in Fig.12, some counter-examples where the proposedframework fails to detect the manipulation. Reasons for failureare not always obvious. In these cases, they may be relatedto the absence of texture in the spliced object (right) orthe strongly textured host image (right) which may hide thediscriminating information. Note that in the image on theright, a well-placed ROI would allow detection, but there isno semantic hint to select it.V. C

ONCLUSION

We proposed a new CNN-based framework for imageforgery detection. Thanks to suitable architectural solutions,it allows one to process jointly information gathered at full-resolution from the whole image. Moreover, the frameworkcan be trained end-to-end based only on weak (image-level)supervision. We proved the effectiveness of this solution by ex-tensive performance analysis on forensic datasets widespreadin the community. A large performance gain is observed inall cases with respect to all reference methods. In addition,the framework can be also recast to provide localizationinformation, both in supervised and unsupervised modality.Despite the very promising results, there is still much roomfor improvement. In particular, better forms of pooling shouldbe considered to preserve long-range spatial relationships inthe aggregation phase. Moreover, image and object semanticsshould be taken into account to complement the low-levelinformation analyzed by the current framework. Work isalready under way along these paths.A

CKNOWLEDGMENT

This material is based on research sponsored by the AirForce Research Laboratory and the Defense Advanced Re- Fig. 11. Manipulated images from the NIST datasets (top) corresponding activation maps (middle) and ROI-based localization results (bottom) with hand-made(green) and automatic (magenta) box-shaped ROIs. Detection scores are shown on the top-left of each box.Fig. 12. Examples of missed detection from the NIST datasets. search Projects Agency under agreement number FA8750-16-2-0204. The U.S.Government is authorized to reproduce anddistribute reprints for Governmental purposes notwithstandingany copyright notation thereon. The views and conclusionscontained herein are those of the authors and should not beinterpreted as necessarily representing the ofﬁcial policies orendorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced ResearchProjects Agency or the U.S. Government.R

EFERENCES[1] B. Mahdian and S. Saic, “Using noise inconsistencies for blind imageforensics,”

Image and Vision Computing , vol. 27, no. 10, pp. 1497–1503,2009.[2] P. Ferrara, T. Bianchi, A. D. Rosa, and A. Piva, “Image forgerylocalization via ﬁne-grained analysis of CFA artifacts,”

IEEE Trans. Inf.Forensics Security , vol. 7, no. 5, pp. 1566–1577, 2012.[3] S. Ye, Q. Sun, and E.-C. Chang, “Detecting digital image forgeries bymeasuring inconsistencies of blocking artifact,” in

IEEE InternationalConference on Multimedia and Expo , 2007, pp. 12–15.[4] T. Bianchi and A. Piva, “Image Forgery Localization via Block-GrainedAnalysis of JPEG Artifacts,”

IEEE Trans. Inf. Forensics Security , vol. 7,no. 3, pp. 1003–1017, 2012.[5] M. Chen, J. Fridrich, M. Goljan, and J. Luk`aˇs, “Determining imageorigin and integrity using sensor noise,”

IEEE Trans. Inf. ForensicsSecurity , vol. 3, no. 4, pp. 74–90, 2008.[6] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-MRF approach for PRNU-based image forgery detection,”

IEEE Trans.Inf. Forensics Security , vol. 9, no. 4, pp. 554–567, 2014.[7] M. Kirchner and J. Fridrich, “On detection of median ﬁltering in digitalimages,” in

SPIE, Electronic Imaging, Media Forensics and Security XII ,vol. 7541, 2010, pp. 101–112.[8] D. Cozzolino, D. Gragnaniello, and L. Verdoliva, “Image forgery de-tection through residual-based local descriptors and block-matching,” in

IEEE International Conference on Image Processing , 2014, pp. 5297–5301.[9] X. Zhao, S. Wang, S. Li, and J. Li, “Passive Image-Splicing Detectionby a 2-D Noncausal Markov Model,”

IEEE Trans. Circuits Syst. VideoTechnol. , vol. 25, no. 2, pp. 185–199, 2015.[10] D. Cozzolino, G. Poggi, and L. Verdoliva, “Splicebuster: A new blindimage splicing detector,” in

IEEE International Workshop on Informa-tion Forensics and Security , 2015, pp. 1–6.[11] H. Li, W. Luo, X. Qiu, and J. Huang, “Identiﬁcation of various imageoperations using residual-based features,”

IEEE Trans. Circuits Syst.Video Technol. , vol. 28, no. 1, pp. 31–45, 2018.[12] S. Lyu and H. Farid, “How realistic is photorealistic?”

IEEE Trans.Signal Process. , vol. 53, no. 2, pp. 845–850, 2005. [13] Y. Shi, C. Chen, and G. Xuan, “Steganalysis versus splicing detection,”in International Workshop on Digital Watermarking , vol. 5041, 2008,pp. 158–172.[14] Z. He, W. Lu, W. Sun, and J. Huang, “Digital image splicing detectionbased on Markov features in DCT and DWT domain,”

Pattern recogni-tion , vol. 45, pp. 4292–4299, 2012.[15] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digitalimages,”

IEEE Trans. Inf. Forensics Security , vol. 7, pp. 868–882, 2012.[16] Y. Rao and J. Ni, “A deep learning approach to detection of splicingand copy-move forgeries in images,” in

IEEE International Workshopon Information Forensics and Security , 2016, pp. 1–6.[17] Y. Liu, Q. Guan, X. Zhao, and Y. Cao, “Image forgery localizationbased on multi-scale convolutional neural networks,” in

ACM Workshopon Information Hiding and Multimedia Security , 2018.[18] B. Bayar and M. Stamm, “A deep learning approach to universalimage manipulation detection using a new convolutional layer,” in

ACMWorkshop on Information Hiding and Multimedia Security , 2016.[19] D. Cozzolino, G. Poggi, and L. Verdoliva, “Recasting residual-basedlocal descriptors as convolutional neural networks: an application toimage forgery detection,” in

ACM Workshop on Information Hiding andMultimedia Security , 2017, pp. 1–6.[20] P. Zhou, X. Han, V. Morariu, and L. Davis, “Learning rich features forimage manipulation detection,” in

IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 1053–1061.[21] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, “Detectionof GAN-generated fake images over social networks,” in , April 2018.[22] A. Roessler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, andM. Nießner, “FaceForensics++: Learning to Detect Manipulated FacialImages,” in

International Conference on Computer Vision , 2019.[23] R. Salloum, Y. Ren, and C. C. J. Kuo, “Image Splicing Localizationusing a Multi-task Fully Convolutional Network (MFCN),”

Journal ofVisual Communication and Image Representation , vol. 51, pp. 201–209,2018.[24] Z. Zhang, Y. Zhang, Z. Zhou, and J. Luo, “Boundary-based ImageForgery Detection by Fast Shallow CNN,” in

IEEE International Con-ference on Pattern Recognition , 2018, pp. 2658–2663.[25] J. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. Roy-Chowdhury,“Hybrid LSTM and Encoder-Decoder Architecture for Detection ofImage Forgeries,”

IEEE Transactions on Image Processing , vol. 28,no. 7, pp. 3286–3300, 2019.[26] D. Cozzolino, J. Thies, A. R¨ossler, C. Riess, M. Nießner, and L. Ver-doliva, “ForensicTransfer: Weakly-supervised Domain Adaptation forForgery Detection,” arXiv preprint arXiv:1812.02510 , 2018.[27] D. Cozzolino and L. Verdoliva, “Single-image splicing localizationthrough autoencoder-based anomaly detection,” in

IEEE Workshop onInformation Forensics and Security , 2016, pp. 1–6.[28] L. Bondi, S. Lameri, D. G¨uera, P. Bestagini, E. Delp, and S. Tubaro, “Tampering Detection and Localization through Clustering of Camera-Based CNN Features,” in

IEEE CVPR Workshops , 2017.[29] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera modelﬁngerprint,”

IEEE Trans. Inf. Forensics Security, in press , 2019.[30] ——, “Camera-based image forgery localization using convolutionalneural networks,” in

European Signal Processing Conference , Sep. 2018.[31] M. Huh, A. Liu, A. Owens, and A. Efros, “Fighting fake news: Imagesplice detection via learned self-consistency,” in

European Conferenceon Computer Vision , 2018, pp. 101–117.[32] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets withsublinear memory cost,” arXiv preprint arXiv:1604.06174 , 2016.[33] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “Vision:a video and image dataset for source identiﬁcation,”

EURASIP Journalon Information Security , pp. 1–16, 2017.[34] G. Schaefer and M. Stich, “Ucid: an uncompressed color imagedatabase,” in

Storage and Retrieval Methods and Applications forMultimedia 2004 , vol. 5307, 2003, pp. 472–480.[35] F. Chollet, “Xception: Deep Learning with Depthwise Separable Con-volutions,” in

IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2017.[36] T. Gloe and R. B¨ohme, “The ‘Dresden Image Database’ for bench-marking digital image forensics,” in

Proceedings of the 25th AnnualACM Symposium On Applied Computing (SAC 2010) , vol. 2, Sierre,Switzerland, Mar. 2010, pp. 1585–1591.[37] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou, “Anevaluation of popular copy-move forgery detection approaches,”

IEEETransactions on information forensics and security , vol. 7, no. 6, pp.1841–1854, 2012.[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[39] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” in

Thirty-First AAAI Conference on Artiﬁcial Intelligence , 2017.[40] T. de Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and A. Rocha,“Exposing digital image forgeries by illumination color classiﬁcation,”

IEEE Trans. Inf. Forensics Security , vol. 8, no. 7, pp. 1182–1194, 2013.[41] P. Korus and J. Huang, “Evaluation of random ﬁeld models in multi-modal unsupervised tampering localization,” in

IEEE InternationalWorkshop on Information Forensics and Security , dec. 2017, pp. 1–6.[42] H. Guan, M. Kozak, E. Robertson, Y. Lee, A. N. Yates, A. Delgado,D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation,”in . IEEE, 2019, pp. 63–72.[43] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in