Image Splicing Detection, Localization and Attribution via JPEG Primary Quantization Matrix Estimation and Clustering
Yakun Niu, Benedetta Tondi, Yao Zhao, Rongrong Ni, Mauro Barni
11 Image Splicing Detection, Localizationand Attribution via JPEG Primary QuantizationMatrix Estimation and Clustering
Yakun Niu, Benedetta Tondi,
Member, IEEE , Yao Zhao,
Senior Member, IEEE ,Rongrong Ni and Mauro Barni,
Fellow, IEEE
Abstract —Detection of inconsistencies of double JPEG arte-facts across different image regions is often used to detectlocal image manipulations, like image splicing, and to localizethem. In this paper, we move one step further, proposing anend-to-end system that, in addition to detecting and localizingspliced regions, can also distinguish regions coming from differentdonor images. We assume that both the spliced regions and thebackground image have undergone a double JPEG compression,and use a local estimate of the primary quantization matrixto distinguish between spliced regions taken from differentsources. To do so, we cluster the image blocks according to theestimated primary quantization matrix and refine the result bymeans of morphological reconstruction. The proposed methodcan work in a wide variety of settings including aligned and non-aligned double JPEG compression, and regardless of whetherthe second compression is stronger or weaker than the firstone. We validated the proposed approach by means of extensiveexperiments showing its superior performance with respect tobaseline methods working in similar conditions.
Index Terms —Image forensics, double JPEG compression,image forgery localization, deep learning based image forensics,primary quantization matrix, spectral clustering, normalizedmutual information (NMI).
I. I
NTRODUCTION
Detection of double JPEG (DJPEG) compression playsa major role in image forensics since double compressionreveals important information about the past history of animage [1], [2]. This is the case of image splicing detectionand localization. When part of an image is spliced from adonor JPEG image into a target JPEG image to create acomposite forgery (which is eventually recompressed so thatthe final image is double compressed), it is quite commonthat the compression setting used for the donor image isnot equal to that used to compress the target image. As aconsequence, the original and spliced regions of the forgedimage exhibit different (double) compression artefacts thusproviding the basis for the detection and localization of thespliced region. Most of the methods proposed so far to detectimage splicing based on double compression artefacts workunder the following simple assumptions:1) the tampered region (or regions) comes from a singledonor image. Very few attempts have been made to iden-tify forgeries containing multiple spliced areas comingfrom different donor images. Yet, given an image withseveral copy-pasted regions, it is possible, at least inprinciple, to identify the different origin of the tamperedareas by recognizing that spliced regions coming fromdifferent donor images probably underwent a differentdouble compression history; 2) the spliced region (or regions) is taken from a noncompressed image and spliced into a JPEG image. Afterrecompression, the forged area has undergone only a sin-gle JPEG compression (SJPEG) while the background hasbeen compressed twice [3], [4], [5]. In the following, werefer to this situation as DJPEG vs SJPEG detection. Inpractical scenarios, however, it is more likely that both thedonor and the target images have been JPEG compressed,thus calling for the development of techniques capable towork in a DJPEG vs DJPEG setting .In this paper, we propose a general approach to simulta-neously perform image splicing detection, localization andattribution of regions coming from different donors. Theproposed method, which is specifically thought to work inthe DJPEG vs DJPEG scenario (but can also work in theSJPEG vs DJPEG case), relies on the estimation of thequantization matrix used in the first compression step of aDJPEG image (in the following, we will refer to such a matrixas primary quantization matrix and we will indicate it by Q ).Specifically, the proposed method works by providing a localblockwise estimate of the primary quantization matrix and thenclustering the image blocks according to such an estimate.Splicing detection is achieved by recognizing the presence ofmore than one cluster, while the exact number of clustersidentifies the number of donor images used to create theforgery. Eventually, by looking at blocks belonging to differentclusters, we can localize the spliced areas and attribute themto different donor images. As an additional advantage, theproposed method also works when the second compressionis stronger than the first one, e.g. when the quality factor usedfor the first compression ( QF ) is larger than that used for thesecond one ( QF ), that is when QF > QF . Moreover, itcan also cope with non standard quantization matrices.For the estimation of the Q matrix, we adopt the CNN-based estimator proposed in [6], due to its capability to providea good estimation results also on small patches. To get thetampering map with the indication of the spliced regions,we associate to each × block of the image a vectorwith the estimated quantization steps, then we apply SpectralClustering (SC) to such vectors [7]. In order to determinethe number of clusters, we trained a Convolutional NeuralNetwork (CNN) taking as input the estimated quantizationsteps. If only one cluster is found, the image is classified as The DJPEG vs DJPEG scenario is often addressed indirectly assuming thatthe second compression of the foreground is performed on a misaligned JPEGgrid, while an aligned DJPEG is applied to the background (or viceversa). Inthis setting, many systems implicitly regard the background as SJPEG hencereducing this case to a SJPEG vs DJPEG scenario. a r X i v : . [ ee ss . I V ] F e b a non-forged image and no further operation is carried out. Inthe presence of multiple clusters, we apply SC, then the largestcluster is associated to the image background, while the othersare regarded as belonging to spliced regions, each clustercorresponding to a different donor image. The tampering mapand the estimated number of spliced regions are finally refinedby enforcing the spatial coherence and smoothness of theclusters through morphological reconstruction.The main contributions of our work can be stated as follows: • We propose a new method to localize tampering in JPEGimages distinguishing between spliced regions comingfrom different donor images, under the assumption thatthe donor images have been compressed with different Q matrices. We do so in the most common DJPEG vsDJPEG scenario. • To the best of our knowledge, this is the first method thatperforms identification and attribution of multiple splicedregions by relying on the analysis of compression traces.The proposed approach is a very general one since it canbe seamlessly applied in a wide variety of cases. First itis one of the few approaches explicitly thought to work ina DJPEG vs DJPEG setting, secondly it works also when QF > QF . Eventually, it maintains good performanceregardless of wether the first and second compressiongrids are aligned or not. • We designed and trained a CNN to identify the number ofclusters present in the image. Thanks to such a CNN, theproposed system is able to carry out splicing detection,localization and attribution simultaneously, hence provid-ing an end-to-end system for image splicing forensics. • The proposed method is a very general one and can beapplied also when non-standard quantization matrices areused and hence the quality factor QF is not defined. • In order to evaluate the performance of tampering local-ization in the case of spliced regions originating frommultiple sources or donor images, we introduce a newmetric based on the Normalized Mutual Information(NMI) metric, commonly used in pattern recognitionapplications to assess the performance of clustering. • We carried out an extensive experimental campaigns toevaluate the effectiveness of the proposed method in awide variety of settings.The rest of this paper is organized as follows. After a briefreview of related methods (Section II), in Section III, wepresent the general tampering localization setup consideredin this paper and introduce the main notations. The proposedmethod for image tampering localization is described in Sec-tion IV. Section V describes the methodology we followedfor the experimental analysis, whose results are reported inSection VI. We conclude the paper with some final remarksin Section VII. II. P
RIOR ART
Several techniques for image tampering localization havebeen proposed in the forensic literature, relying on differentmanipulation traces, e.g. inconsistencies of resampling arti-facts [8], [9], or presence of different sensor noise patterns[10], [11]. More often, inconsistencies of JPEG compres-sion artefacts are exploited. Due to the popularity of JPEG compression standard, in fact, image editing software oftenre-save the edited images in JPEG format, hence makingit possible to detect and localize tampering based on theanalysis of the traces of double JPEG compression and theirinconsistencies across the tampered image. In the following,we briefly review the relevant literature about DJPEG detectionfor image tampering localization.It is well known that double JPEG compression leavespeculiar artifacts in the DCT domain, in particular, in thehistograms of block-DCT coefficients [12]. Accordingly, manytampering detection algorithms rely on the statistical analysisof DCT coefficients [2], [13]. Some example of methodsfor detecting double compression artefacts in the non-alignedDJPEG scenario relying on handcrafted features computedin the pixel or the DCT domain are described in [5], [14],[15], [16]. Early approaches were designed to work on thewhole image, to detect if the image has undergone a globalsingle or double JPEG compression. Such methods are notapplicable in a tampering detection scenario, where only partof the image has been manipulated, due to the difficulty ofestimating the required statistics on small blocks. To cope withthis problem, a bunch of other methods have been developedfor DJPEG localization [3], [17], [18]. In general, thesemethods have low spatial resolution and their performancedrop significantly when regions smaller than × areconsidered. More recently, a new class of CNN-based methodshave been proposed. They are able to improve the spatialresolution of DJPEG localization and then can be convenientlyused for tampering localization, (see, for instance, [19], [20]- for both aligned and not-aligned DJPEG detection, and [4]- for the aligned DJPEG case). All the above methods focuson the SJPEG vs DJPEG scenario, that is, they work underthe assumption that the tampered areas are double compressedwhile the background is single compressed. In contrast, veryfew work has been done to specifically address the morechallenging DJPEG vs DJPEG scenario considered in thispaper. In principle, methods capable to estimate the primaryquantization matrix of DJPEG images, e.g. [1] and [21], couldbe applied to this scenario. However, the methods in [1],[21] work on the full image, and hence are not suitable forlocalization. In [22], a technique is proposed to detect whetherpart of an image was formerly compressed with a low JPEGquality than the rest of the image ( QF < QF ), by means ofexhaustive recompression with every quality factors.The works that are more closely related to this paper are[23] and [24]. Both these methods estimate the Q matrixon a local basis and output a map with the probability thata DCT block has been double-compressed. The method in[23] works under the assumption that the histograms of theunquantized DCT coefficients are locally uniform in the nontampered region. Moreover, accurate detection can be achievedwhen the spatial resolution is larger than × pixel.Two approaches are proposed in [23] for the cases of alignedand non-aligned DJPEG. The method in [24] is designed forthe case of aligned DJPEG compression. Both methods workbetter when QF > QF , while performance are significantlyworse in the opposite case.Being able to distinguish spliced regions coming from different donor images, our method can also be used for imagephylogeny, where the identification of the donor images is arequired step to identify the relationships between the splicedimage and its parent images [25], [26], [27] and use it to re-construct the history of semantically similar images. From thisperspective, the goal of the system described in this paper isnot very different from that of image phylogeny applications,the main difference being that in the image phylogeny scenariothe donor images are assumed to be available to the analyst.III. P ROBLEM STATEMENT
Let Q denote the × matrix with the quantization stepsof the DCT coefficients, namely, the quantization matrix,used for JPEG compression. The image tampering scenarioconsidered in this paper is illustrated in Fig. 1. The splicedregions (referred to as foreground regions), possibly comingfrom different donor images, and the background, are doubly JPEG compressed, however different quantization matriceshave been used for the former compression ( Q ). In thefigure, Q denotes the quantization matrix of the background, Q (cid:48) and Q (cid:48)(cid:48) the quantization matrices of the spliced regions.The tampered image is finally JPEG compressed with anotherquantization matrix ( Q ). The final image is then a doublecompressed JPEG image. The second compression can beeither aligned or non-aligned to the first one, depending on theposition of the × JPEG compression grid. A misalignmentoccurs in the background, for instance, when the image iscropped between the former and the second compression stage.With regard to the spliced area(s), when a region of a JPEGimage is copy-pasted into another JPEG image, it is verylikely that the alignment between the compression grids is notpreserved and the final JPEG compression will not be alignedwith the grid of the spliced area(s). In the following, we denotethe aligned DJPEG scenario, i.e. when no misalignment occursbetween the two compressions, with the acronym A-DJPEG,and the non-aligned DJPEG scenario with NA-DJPEG.In the scenario described above, spliced regions comingfrom different donor images can be distinguished by relying onthe inconsistencies between the primary quantization matrices.Let k be the total number of donor images. Accordingly, for apristine image k = 1 . For a tampered image, k corresponds tothe number of donor images plus the background, the totalnumber of donor images used to create the forgery beingthen k − . In this setting, the system we have developedaims at solving three different problems: tampering detection,localization and attribution. The detection part outputs a binarydecision on the presence or absence of tampering based on theestimated k . When tampering is detected ( k > ), a tamperinglocalization map is returned by the system. The tamperingmap is a coloured map with different colours assigned to thebackground pixels and to the pixels of spliced regions comingfrom different donors. Source attribution is performed basedon the colours of the spliced regions.In the following, we introduce the main notations usedthroughout the paper. We denote by q the 64-dim vector withthe elements of the × Q matrix, taken in zig-zag order[28]. The quantization steps corresponding to the medium-high frequencies are more difficult to estimate, since they (JPEG with matrix ) (JPEG, with matrix )(Final JPEG with matrix )(JPEG with matrix ) Fig. 1. Image tampering setup considered in this paper.Fig. 2. Scheme of the proposed method for splicing detection, localizationand attribution of DJPEG images based on Q matrix inconsistencies. are quantized more heavily. However, they are usually lessimportant since they tend to be similar for most quantizationmatrices. For this reason, as in most part of related literature,the estimation is restricted to the first N c elements of q .Hereafter, we will use the symbol q to indicate only thefirst N c quantization steps. With a slight abuse of notation,we denote with (cid:98) Q ( · , · , · ) the tensor with the estimated pri-mary quantization steps for every × block of the image.Specifically, for a given ( i, j ) , (cid:98) Q ( i, j, k ) corresponds to theestimation of the k -th element of q for the × block ofpixels in the position indicated by ( i, j ) .IV. P ROPOSED METHOD
The overall scheme of the proposed method is illustrated inFig. 2. Given a possibly tampered DJPEG image, a block-wiseestimation of the primary quantization matrix is first obtained(first block in Fig. 2), yielding (cid:98) Q , then splicing detection,localization and attribution is performed by clustering theimage blocks according to the result of the estimation followedby a map refinement step (dashed block in Fig. 2).The number of clusters is first estimated from the (cid:98) Q tensorvia a CNN model, then a clustering algorithm is applied to (cid:98) Q to obtain the tampering map. Specifically, the N c -dim vectorswith the estimated quantization steps of the image blocksare regarded as points in the clustering space. As a result ofclustering, a label is associated to each block of the image. Inthis way, the clustering labels in the tampering map indicatethe different donor images used to build the tampered image. Amap refinement step is finally performed on the clustering mapto improve the quality of the map based on spatial information.In the above scheme, ˆ k denotes the estimated value of k .Tampering detection is carried out on the basis of the estimatedvalue of k . In particular, if ˆ k = 1 , the image is declared tobe pristine and the process ends. If ˆ k > , the clusteringalgorithm is applied to perform splicing localization andattribution, followed by map refinement. After map refinement,the number of clusters in the map might change. We let ˆ k r be the number of clusters after the refinement step. Then, if ˆ k r = 1 , the image is declared to be pristine, while if ˆ k r > ,the image is judged to be tampered. By looking at the labelsof the different clusters, we can localize the spliced areas andattribute them to different donor images. Fig. 3. Example of (cid:98) Q tensor for various quantization steps. The location ofspliced regions can be easily spot from all the bands of the tensor. A detailed description of each block of Fig. 2 is providedin the following subsections.
A. Patch-based estimation of the primary quantization matrix
The primary quantization matrix is estimated on a local win-dow basis, by means of a CNN estimator working. In partic-ular, we chose the CNN-based approach described in [6], dueto its ability to work regardless of the alignment/misalignmentof the first and second compression grids and to the goodperformance obtained even when QF < QF .The network adopted in [6] for primary quantization matrixestimation has an input size of × , and N c final outputnodes, each providing the estimated value of the quantizationstep of a DCT coefficient. As we said, we let N c = 15 .Given an input patch x , the CNN is trained to minimize thedifference between the predicted values f ( x ) and the truevector q ( x ) . Rounding is performed independently on eachelement of the output vector to get the final prediction, that is ˆ q ( x ) = round ( f ( x )) .Since the CNN is applied patch-wise to the input image,the estimation step returns a tensor with the N c -dim vectorswith the primary quantization steps of each × block.Specifically, given an input image x of size R × C × , the CNNestimator is run on × overlapping patches (each shiftedby 8 pixels with respect to the previous one). The estimatedvector, then, is assigned to the central × block of the patch.At the end, a tensor (cid:98) Q with the estimated quantization stepsis obtained. More precisely, by assuming that the stride s usedto slide the estimation window over the images is equal to 8,and by assuming, for simplicity, that R and C are multipleof 8, the estimated tensor (cid:98) Q has size R (cid:48) × C (cid:48) × N c , where R (cid:48) = R/ − , C (cid:48) = C/ − . Note that, for simplicity, we arenot considering blocks close to the border of the image . Inthe following, we use the compact notation (cid:98) Q ,t to denote the t -th estimated vector, that is (cid:98) Q ,t = ˆ q ( x t ) , where x t is the t -th × patch fed into the CNN estimator in left-to-right,top-to-bottom, scanning.Fig. 3 shows an example of a tensor (cid:98) Q estimated by theCNN. We can observe that the various components of thetensor corresponding to different quantization steps provideuseful information regarding the position and the provenanceof the spliced areas. B. Localization and attribution of spliced regions
To localize and attribute multiple spliced regions to differentdonor images based on the estimated quantization matrix, we Given that the patch contains an even number of blocks, rigorouslyspeaking a central block does not exist. In the following we denote the blockin the fourth block-row and fourth block-column as the central block. If needed, we can incorporate such blocks into the analysis by mirror-padding the border blocks. Fig. 4. Similarity graph associated to the (cid:98) Q tensor. adopted the Spectral Clustering (SC) algorithm [7]. Spectralclustering has been used in the last years in related imageforensics fields, including, for instance, camera identification[29] and mobile phone clustering [30]. Recently, SC has alsobeen considered for splicing detection [31], to improve theperformance of a deep-learning-based forensic approach forgeneral forgery localization.Based on some preliminary tests we carried out, SC providesbetter results compared to other clustering methods, likeexpectation maximization [32], hierarchical clustering [33],fuzzy clustering [34] and, in particular, K -means clustering.SC exploits graph theory to map points (in our case, the N c -dim vectors (cid:98) Q ,t = ˆ q ( x t ) ) to a low dimensional space[7]. The problem of clustering is reformulated by using asimilarity graph. More specifically, an undirected similaritygraph G = ( V, S ) , where V denotes the set of vertexes ornodes and S the set of edges, is associated to the (cid:98) Q tensoras shown in Fig. 4. Each element of the tensor, i.e. each (cid:98) Q ,t vector, represents a node of the graph. The number ofnodes N = | V | of the graph corresponds to the number of × blocks in the image. Then, S ∈ R N × R N . The edgeweights S ij represent the similarity of the nodes i and j , and inour case are defined as: S ij = exp (cid:16) −|| (cid:98) Q ,i − (cid:98) Q ,j || / σ (cid:17) .The choice of the scale parameter σ is not obvious, so wedetermined it experimentally.The goal of SC is to find a partition of the graph suchthat the edges within a group (cluster) have high weights, i.e.the points within the same cluster are similar to each other,and the edges between different groups (clusters) have verylow weights, i.e., the points belonging to different clusters aredifferent from each other. To do so, the SC algorithm computesthe Laplacian matrix associated to G , then it applies the K -means algorithm to the eigenvalues matrix. C. CNN-based estimation of the number of clusters
In principle, the spectral clustering algorithm would providea way to estimate the number of clusters k by relying ongraph theory, that is, through the analysis of the eigenvalues ( λ i ) Ni =1 of the Laplacian matrix associated to the graph (see[7]). Specifically, the eigenvalues are listed in descending orderand the index i corresponding to the maximum gap betweentwo consecutive eigenvalues λ i +1 − λ i is selected as ˆ k , thatis ˆ k = arg max i ( λ i +1 − λ i ) . Based on our experiments,estimating k in this way provides poor results.By carefully analyzing the results of the preliminary exper-iments we run to estimate k by means of SC, we concludedthat the bad performance we got are due to the fact that Fig. 5. scattered clusters of (cid:98) Q in the spatial domain. SC does not exploit any spatial information. This preventsto estimate k correctly even in cases that appear easy to solveby visual inspection. Fig. 5 shows an example in which thereare several scattered areas with yellow color in the estimatedtensor (cid:98) Q . Such scattered areas will be mistakenly regardedas one cluster by SC, even if their spatial incoherence makesit very unlikely that they correspond to a spliced region. Asa result, the estimated k by SC is 3 while the true one is 2.By exploiting the spatial information, such spatially scatteredareas can be identified as a noisy cluster and assigned to thebackground.In order to exploit the spatial information, we trained a CNNto estimate the number of clusters k directly from (cid:98) Q . TheCNN has an input size R (cid:48) × C (cid:48) × N c and a number of outputnodes equal to (Fig. 6), since we have implicitly assumedthat the spliced regions in the tampered image come from atmost different donor images ( k ≤ ). Specifically, we chosean architecture commonly and successfully used for patternrecognition applications, namely the VGG-16 [35] network.The VGG-16 network has a number of 16 layers in total and2 Fully Connected (FC) layers. The CNN has been trained on (cid:98) Q tensors computed from pristine images, for k = 1 , andtampered images, for k = 2 , , . The dataset creation processfor the tampered images and the setup considered for networktraining and validation is described in Section V-B1, while thedetails of the training process are provided in Section V-D. D. Tampering map refinement by means of morphologicalreconstruction
A visual analysis of the tampering maps obtained after theapplication of the SC algorithm often reveals the presence ofspurious isolated regions that do not correspond to any splicedregion. By referring to Fig. 7 as an example, we observe twotypes of spurious regions: small and usually scattered regionsbelonging to the same cluster of one big region, and ring-shaped regions along the boundary of (usually big) splicedareas. The first type of spurious regions are due to the scatteredpresence of image blocks for which the estimated quantizationsteps are similar to those of truly spliced regions. The presenceof such regions is due to the noisiness of the Q estimationstep and to the lack of spatial information during the clusteringphase. Ring-shaped spurious regions are due to the window-based approach used for Q estimation. In the proximity ofthe boundary of spliced regions, in fact, the estimation windowincludes blocks from both the spliced and background regions,thus producing a somewhat mixed estimated vector. For largespliced regions, the number of blocks with the mixed estimateis large enough to represent a separated ring-shaped cluster(the mixed estimated quantization steps may also resemble Fig. 6. Input and output of the CNN used for estimating k from the (cid:98) Q tensor.Fig. 7. Ring-shape and scattered spurious regions in the preliminary tamperingmap obtained after clustering. the values of another truly spliced region, as the brown ring-shaped region in Fig. 7). We also observed that spuriousregions are more frequent when ˆ k > k , that is when the SCalgorithm is run with a ˆ k larger than the correct one.Of course, the presence of spurious clusters reduces theperformance of our algorithm in terms of localization (andattribution) accuracy. For this reason, we introduced a maprefinement step, aiming at improving the quality of the tamper-ing map based on spatial information. We did so by resortingto morphological reconstruction (MR [36]). In particular, themap refinement procedure consists of the following sequenceof morphological operations:1) for every cluster we consider the region formed by thepixels belonging to the cluster;2) we apply a predefined number of erosion iterations witha small structuring element;3) we apply a conditional dilation procedure starting fromthe regions (referred to as marks or seeds according tothe terminology of morphological reconstruction theory[36]) obtained at the end of the erosion phase in 2.The conditional dilation procedure works as follows: i) first,each seed is expanded by means of a conditional dilation,where the dilation is applied only to the pixels belonging to thecluster the region corresponds to. This procedure is carried outin parallel on the seeds of all clusters; ii) the regions obtainedat the end of the previous step, are further dilated conditionedto all the pixels (if any) that do not belong to the backgroundcluster and that have not be assigned yet; iii) if, at a giveniteration, regions belonging to different clusters are expandedon the same pixels, disputed pixels are assigned randomly toone of the clusters.The main goal of MR is to reassign pixels belonging toring-shaped clusters. Fig. 8 shows the results of the processafter each of the steps describe above. In the erosion step, thering-shaped cluster and the isolated cluster are removed whilethe interior of the spliced regions are kept (see the erosionmap). After the conditional dilation procedure, the pixels ofthe ring-shaped region are reassigned to the correspondinginner cluster (see the dilation map). Note that isolated (usuallysmall) regions are completely eroded during the erosion step Fig. 8. The sequence of morphological operations in MR. and are not reassigned during the conditional dilation. Thechoice of the number of erosion iterations and the size ofthe structuring element is a crucial one, all the more that theoptimum setting depends on the size of the spliced areas. Fora given structuring element, a too large number of iterationsmay cause the removal of spliced regions, while if the numberof iterations is too small, the risk is that small isolated clustersare not removed. Given the difficulties of determining the bestsetting on a theoretical basis, we tuned the system by meansof experimentally analysis.After the application of the MR procedure, the number ofclusters might change. In particular, the final number ˆ k r ofclusters may be lower than ˆ k , because isolated clusters havebeen removed or ring-shaped clusters have been reassigned tothe internal clusters. This happens especially when k ≥ (see Section VI-A). If after MR the number of remainingcluster is equal to 1, the image is considered to be a pristineone. Experiments carried out on several tampered imagesreveal that MR can indeed help to remove noisy clusters andthe undesired rings around compact regions from the maps.The benefits that can be obtained with the map refinementprocedure are illustrated in the examples reported in Fig. 9.The number of erosion iterations considered in those examplesis 2 and the size of the (disk shaped) structuring element is1. In the examples reported in the figure, at the end of theMR procedure, the ring-shaped clusters are reassigned to theinternal clusters and then ˆ k r < ˆ k . We notice that the estimationof k can be worse after map refinement, however, especiallywhen k is large, a better overall result is obtained when thevalue of k is underestimated. This is the case with the lastexample in Fig. 9, where we see that using clusters insteadof permits to remove the two rings around one of the splicedregions. This, apparently counterintuitive, behavior is due tothe fact that, especially when k is large, the value of thequantization coefficients may be similar for different donorimages, and hence it might not be easy to correctly identifythe clusters using the true k . In such cases, a better map isobtained by assigning the regions originating from the donorimages with similar quantization matrixes to the same cluster.V. E XPERIMENTAL M ETHODOLOGY
In this section, we describe the methodology we followed torun the experiments whereby we validated the effectiveness ofthe proposed method. We first present the evaluation metrics
Fig. 9. Examples of tampered maps before and after the application ofmorphological reconstruction. used to measure the performance of the proposed tool to detectand localize the tampered areas. Then we introduce a metricexplicitly thought to measure the effectiveness of the attribu-tion part of the system. Afterwards, we pass to the descriptionof the automatic procedure that we followed to generate thetampered contents in the DJPEG scenario, and introduce thedatasets used for: i) training and testing the CNN model forthe estimation of k ; ii) assessing the detection, localizationand attribution performance of the system. Finally, we describethe two closest related state-of-the-art methods [23], [24] anddescribe how they are applied for a fair comparison with theresults obtained by our system.For simplicity and without loss of generality, in our experi-mental analysis we considered standard quantization matrices,so we refer to them by means of the Quality Factor ( QF ).Specifically, we denote with QF the QF used for the secondcompression and with QF that of the primary compression. A. Evaluation metrics
In this section, we introduce the metrics used to measurethe performance of our algorithm.
1) Tampering detection:
Tampering detection is a binaryclassification problem. For tampered images, we have a correctdetection when ˆ k r > , while for pristine images, the detectionis correct when ˆ k r = 1 , wrong in all the other cases.By adopting a common terminology in detection theory,in the following we assume that tampered images ( k > )belong to the positive class and pristine images ( k = 1 ) tothe negative one. A Neyman Pearson setup is considered forthe decision. Accordingly, we fixed the maximum admissibleFalse Positive Rate (FPR = P r (ˆ k r > | k = 1) ) (say T ),i.e., the percentage of pristine images wrongly detected astampered, and we evaluate the True Positive Rate (TPR = P r (ˆ k r > | k > ), namely, the percentage of correctlydetected tampered images. The overall accuracy is given bythe fraction of correct decisions on both tampered and pristineimages over the total number of tested images.
2) Tampering localization and attribution:
With regardto the metrics for assessing the localization and attributionperformance, we observe that we should not only evaluate thecapability of the system to localize the tampered areas, butalso the capability to identify the regions spliced from differentdonor images as belonging to different clusters.Let us first focus on the former task. Tampering localizationcan be regarded as a binary classification problem appliedat the pixel level. Pixels belong to one of two classes, thebackground (the negative class) or the foreground (tamperedor positive class). To measure the localization performance,we consider the Matthews correlation coefficient (MCC) [37],defined as:MCC = TP × TN + FP × FN (cid:112) ( TP + FP )( TP + FN )( TN + FP )( TN + FN ) , (1)where TP is the number of true positives pixels, i.e., thepixels correctly classified as tampered, TN the number oftrue negative pixels, i.e., the pixels correctly classified asnon-tampered, FP the number of false positives and FN thenumber of false negatives. If any of the sums in brackets at thedenominator is zero, the denominator is arbitrarily set to one.MCC is particularly helpfull in the case of unbalanced classed,like it is almost always the case for tampering localization .The identification of spliced regions coming from differentdonor images is a new goal addressed in this paper, so noestablished metric exists to measure the performance withrespect to this task. To fill this gap, we introduce a new metric,borrowed from the clustering field, called Normalized MutualInformation (NMI), [38].To start with, we observe that the output of the clusteringalgorithm is a map assigning to each pixel a label, rangingfrom to ˆ k r , indicating the cluster the pixel belongs to. Theground truth map indicates for every pixel the true cluster, i.e.the corresponding donor image for foreground pixels or thebackground cluster. Note that, the exact label assigned to eachcluster is irrelevant, as long as pixels coming from differ donorimages are assigned to different clusters. So it is not necessarythat the labels of the clustering map are identical to those ofthe ground truth. As an additional difficulty, we observe thatthe number of clusters contained in the map output by thetampering localization and attribution system does not need tobe equal to the number of clusters in the ground truth map.In the following we will refer to the labels assigned to theregions of the ground truth map as pixel classes. Let y denotethe class label ( y = 1 , .., k ) and c denote the cluster label( c = 1 , ..., ˆ k r ) in the output clustering map. The NMI indexis defined as: NMI ( y, c ) = 2 I ( y ; c ) H ( y ) + H ( c ) , (2)where H ( y ) and H ( c ) denote, respectively, the empiricalEntropy of y and c , and I ( y ; c ) the empirical Mutual In-formation between y and c . More precisely, let p y ( i ) = p ( y = i ) = { pixels in class i } / total no. of pixels , p c ( j ) = In the case of highly unbalanced classes, the overall accuracy is not agood indicator of the performance given that errors on the minority classhave virtually no impact. p ( c = j ) = { pixels in cluster j } / total no. of pixels , and p y | c ( i | j ) = p y | c ( y = i | c = j ) where p y | c ( i | j ) = { pixels of class i assigned to cluster j } { pixels in cluster j } , (3)where the total number of pixels is R (cid:48) · C (cid:48) . Then, H ( y ) = (cid:80) ki =1 p y ( i ) log p y ( i ) , and I ( y ; c ) = k (cid:88) i =1 ˆ k (cid:88) j =1 p y | c ( i | j ) p c ( j ) log (cid:16) p y | c ( i | j ) p y ( i ) (cid:17) . (4)In case of perfect clustering, it is easy to see that H ( y ) = H ( c ) , while I ( y ; c ) = H ( y ) , then NMI = 1. Note that beinga normalized quantity, the NMI allows comparing cases witha different number of clusters. B. Datasets
To build the datasets for our experiments, we started fromthe 8156 camera-native uncompressed large size images inthe RAISE8K dataset [39]. We divided these images in twosets: 7000 images to be used for training (and validation)and 1156 images for the tests. On the average, about 5 non-overlapping patches are extracted from each RAISE image,for a total number of 41000 patches, 35000 of which (comingfrom the set of 7000 original images), denoted by S tr , wereused to produce the pristine and tampered images for trainingthe models and the remaining 5780, denoted by S ts , to producethe pristine and tampered images for the tests.We considered two types of DJPEG pristine and tamperedimages, named Type I and Type II, respectively for the caseof Aligned DJPEG (A-DJPEG) and the case of Non-AlignedDJPEG (NA-DJPEG). • For
Type I images: the pristine images are A-DJPEGimages. This means that the first 8 × ( r, c ) , ≤ r, c ≤ , ( r, c ) (cid:54) = (0 , is considered for the grid ofthe former compression of the foreground. • For
Type II images: we assume that the images are firstJPEG compressed using a DCT grid shifted by a quantity ( r, c ) , randomly chosen in ≤ r, c ≤ , ( r, c ) (cid:54) = (0 , ,with respect to the upper left corner, while for the secondcompression no grid misalignment is considered. Then,the pristine images are NA-DJPEG. For the tamperedimages, the JPEG grid of the background is non-alignedwith the grid of the second compression. The same holdsfor the foreground regions. Note that the misalignmentsof foreground and background are generally different.The datasets we used for our experiments are availableonline, together with a report detailing the exact procedurewe have followed to build the pristine and tampered imagesfor the various k and combination of QFs . Below we provide The document is made available at: https://drive.google.com/drive/folders/1ck-Xm1G3dxgGN717B JVdKMpBaPCZ3Ap, along with the datasets. a description of the datasets considered for the various tests.The size of the images in all the datasets is 512 ×
1) Dataset for primary quantization matrix estimation(training and testing):
To build the datasets for training andtesting the CNN for the estimation of Q , we followed exactly[6]. Training and testing were carried out on 64 ×
64 patches,obtained from the set of 7000 and 1156 images of RAISE.For DJPEG, a random grid shift ( r, c ) is applied between thetwo compressions, ≤ r, c ≤ , then, as in [6], the A-DJPEGcase occurs with probability 1/64.
2) Dataset for number of clusters estimation (training andtesting):
The dataset used to train and test the CNN for theestimation of the number of clusters consists of: • a set D tr of 18000 images for each k (for a total of 72000images) used for training. The set is obtained from 18000(randomly chosen) images in S tr ; • a set D ts of 4000 images for k = 1 and 4000 images for k > , in equal proportions for k = 2 , and . The setis obtained from 4000 images in S ts .Let V = { , , , , , , , } . To get the pristineimages, QF is randomly chosen in V and QF = 90 . For thetampered images, when k = 2 , QF is randomly chosen in { , , , } , and QF , ∈ V , QF , (cid:54) = QF . When k = 3 , QF ∈ { , , , } , and QF , , QF , ∈ V , QF , (cid:54) = QF , (cid:54) = QF . Finally, for k = 4 , QF ∈ { , , , } ,and QF , (cid:54) = QF , (cid:54) = QF , (cid:54) = QF ∈ V . The height h and width w of the bounding-box of the tampered regionsare randomly selected in {
64, 96, 128, 156 } . Misalignmentis applied to the background with 0.5 probability, then thedataset consists of both Type I and Type II images in similarproportions.
3) Dataset for detection, localization and attribution tests:
Detection performance are measured over the same dataset D ts considered to test the CNN for k estimation, where we have4000 images representative of the negative class (pristine),and 4000 for the positive class (tampered). The threshold T achieving the desired FPR is set on these 4000 pristineimages. To better assess the localization performance, and toease the comparison with state-of-the-art methods (see the nextsection), we additionally built two separate Type I and TypeII datasets, named D I and D II , whose images are generatedfrom S ts under specific setting. Specifically, in both D I and D II , we considered 100 images for every combination of k and { QF ,i } ki =2 , for the tampering sizes h × w = 96 × and × .A summary of the datasets used in our experiments isreported in Table I. C. Baseline methods for comparison
The baselines we compared our method with are the meth-ods described in [23] and [24], since, as we said in SectionII, these are the methods most closely related to the systemproposed in this paper. In [23], statistical models are used tobuild a map reporting the likelihood that image blocks havebeen double compressed. The cases of A-DJPEG and NA-DJPEG are treated separately. In [24], the authors propose astrategy to estimate the a-posteriori probability that an imageblock has been tampered with, by minimizing a properly
TABLE ID
ATASETS OF IMAGES CONSIDERED IN OUR EXPERIMENTS . Name D tr D ts D I D II Purpose Training Test Test TestOriginal S tr S ts S ts S ts datasetNo. images 18000 per k k = 1
100 per each ( k , 100 per each ( k ,(72000 total) 4000 per k > { QF ,i } , h × w) { QF ,i } , h × w)DJPEG Type I and II Type I and II Type I Type II(50% each) (50% each)Setting { QF ,i } , h × w { QF ,i } , h × w h × w = h × w =randomly randomly { × , { × chosen chosen × } × } defined energy function. The approach works only for thecase of A-DJPEG compression. Similarly to our system, boththese methods are designed for tampering localization indouble compressed images and can be applied to a scenariowherein both the background and spliced regions are doublecompressed but exhibit different compression artefacts (i.e.they are compressed with a different quality factor), since theyare based on the estimation of the DCT quantization steps usedfor the first compression. Both methods perform well when theformer compression quality is lower than the second, whilethey provide poor performance in the reverse case.To turn the tampering localization map provided by thebaseline methods into a tampering detection output, we fol-lowed the approach used in [24] and trained an SVM classifierwith 3 features. The first feature is given by the perimeter-arearatio of the localized tampered region. The second one is thepercentage of pixels detected as tampered. These two featuresare used to characterize scattered and small regions, that areusually linked to false alarms. The third feature is a measure ofthe periodicity consistency between the DCT coefficient his-tograms of the localized region and the entire image. We referto [24] for more details. To train the SVM, we considered 3000pristine ( k = 1 ) and 3000 tampered images (for k = 2 , , ,where each k is represented in equal proportions), obtainedfrom the images in S tr as detailed in the previous section,with the difference that only Type I images are considered for[24], while Type I images and Type II images are consideredfor the methods in [23], respectively for the A-DJPEG and NA-DJPEG cases . Given the trained model, the operating point isdetermined from the ROC curve by fixing the threshold T onthe FPR and deriving the SVM decision threshold accordingly.Specifically, 2000 pristine images obtained from the images in S ts , for the Type I setting and Type II setting respectively, areconsidered to set T .Before concluding this section, we stress that the compari-son with [23] and [24] is possible only when the consideredsetting satisfies the operative conditions such methods havebeen built for. In fact, an advantage of our method is thatit is much more general than [23] and [24], so that it canwork in a wider variety of situations. In addition, our methodis able to distinguish between spliced regions coming fromdifferent donor images, which is something that neither [23]nor [24] can do. Such aspects must be taken into account for The same proportion of pristine and tampered images is considered inthese cases (the SVM is trained for the binary tampering detection task).
TABLE IIO
PERATIVE CONDITIONS OF OUR SYSTEM AND THE METHODS IN [23]
AND [24].
DJPEG grid Background Non-std Purposesetting Q matrix[23]-Al Aligned QF < QF , Yes Localization QF > QF (poor)[23]-NAl Non-Aligned QF < QF , Yes Localization QF > QF (poor)[24] Aligned QF < QF , Yes Localization QF > QF (poor)Our Both QF ≶ QF Yes LocalizationAttribution a fair judgement of the improvement allowed by the methodproposed in this paper. The operative conditions, and expectedperformance of the various methods under different settings,are summarized in Table II.
D. Parameter setting
Below, we report the setting of the parameters considered toimplement the various steps of our system. To train the CNNfor Q estimation, we followed exactly [6]. Then, the networkis trained on × patches ( × for each QF ) obtainedfrom the set of 7000 RAISE images used for training asdetailed in [6]. The number of coefficients N c is set to 15. Thenetwork is trained for 60 epochs with batch size 32. The solveris the Adam optimizer. The learning rate is − . The (cid:98) Q mapis obtained as described in Section IV-A. With regard to theCNN used to estimate k , as we said, we considered the VGG-16 network [35]. We started from the pre-trained solutionfor ImageNet classification, and re-trained the network on thedataset described in Section V-B1 for 50 epochs, with batchsize 16 and applying data augmentation. The Adam optimizerwith learning rate − was considered as solver. Regardingthe spectral clustering algorithm, the scale parameter σ is setto . if the estimated k is , while we use 0.15 if k = 3 or .Finally, for the morphological reconstruction, we considered2 erosion iterations with a disk shaped structuring element ofradius 1. We found experimentally that this setting is a goodchoice for the range of tampering sizes that we are considering.VI. E XPERIMENTAL RESULTS
The results of our experiments are reported and discussedin this section for a wide variety of settings.
A. Accuracy of the number of clusters estimation
The estimation of k is a crucial step since it directlyaffects the performance of tampering detection, localizationand attribution. In particular, if k = 1 and the estimated ˆ k r > , pristine images are mistakenly identified as tamperedimages, and vice versa.The results of the k estimation step are reported in the con-fusion matrix shown in Table III (left). The average accuracyof the estimation is 0.79. Somewhat expectedly, the estimationworks better for smaller k , since the minimum differencebetween the QF values considered tends to be smaller inthe presence of multiple spliced regions. TABLE IIIC
ONFUSION MATRIX OF THE
CNN-
BASED k ESTIMATION ( LEFT ) AND OFTHE FINAL ˆ k r ( RIGHT ), COMPUTED OVER THE D ts SET . k \ ˆ k k \ ˆ k r TABLE IVT
HE DETECTION ACCURACY OF THE PROPOSED METHOD . k \ ˆ k r ˆ k r = 1 ˆ k r > k = 1 k > Table III (right) reports the number of clusters ˆ k r afterspectral clustering and map refinement. The average accuracyof the estimation now is 0.75. We see that, especially for large k , ˆ k r is a worse estimate of k than ˆ k . This is not surprisingsince in many cases a better clustering result can be obtainedusing a k lower than the true value, as we have alreadyobserved (see the examples in Fig. 9). However, for the case k = 1 , which is important for the detection, the estimationcontinues to work well also with ˆ k r . B. Detection performance
With regard to the detection performance ( k = 1 vs k > )of our system, it can be derived from Table III (right) bymeasuring the capability of distinguishing pristine ( k = 1 )from tampered images ( k > ). The detection accuracies wegot for k = 1 , k > and over the D ts set (we remind thatthis set comprises both Type I and Type II images in equalpercentage) are 0.95, 0.97 (see Table IV) and 0.96 respectively;specifically, we got TPR = 0.97 at FPR = 0.05. Table V (lastrow) reports the accuracy results split for different settings.Specifically, we got TPR = 0.97 for Type I and 0.96 for TypeII tamperings.The comparison with [23] and [24], where detection isperformed via SVM classification as described in Section V-Cis reported in Table V, where TPR values corresponding toFPR = 0.05 are reported for both methods in Type I and TypeII scenarios. For the baselines, the table reports only valuesobtained under the operative conditions the methods werethought for. As it can be seen, the proposed method greatlyoutperforms the baselines in the various settings. The poorresults of [23] and [24] are in line with those reported in thereference papers, and are mainly due to the weak performanceachieved by these methods in the scenario QF > QF (whilethe performance when QF < QF are good, in their operativescenarios). C. Localization performance
The results we obtained for tampering localization, averagedon the TP images only, that is the images correctly detectedas tampered, are reported in Table VI. Not surprisingly, theproposed method works better than method [23] in the Type TABLE VC
OMPARISON OF THE DETECTION PERFORMANCE OF OUR METHOD WITH [23]
AND [24]. T
HE TABLE REPORTS THE
TPR
WHEN
FPR=0.05
OVER D ts , D ts (T YPE I ONLY ) AND D ts (T YPE II ONLY ) D ts D ts (Type I) D ts (Type II) All QF < QF QF > QF All QF < QF QF > QF [23]-Al – 0.31 0.50 0.14 – – – [23]-NAl – – – – TABLE VIA
VERAGE LOCALIZATION PERFORMANCE OF THE METHODS (MCC),
AVERAGED ON THE SET OF TP IMAGES FOR EACH METHOD . D ts D ts (Type I) D ts (Type II) All QF < QF QF > QF All QF < QF QF > QF [23]-Al – 0.63 0.77 0.09 – – – [23]-NAl – – – – II scenario, since the CNN-based Q estimator is designed towork particularly for the NA-DJPEG case (the aligned case isassumed to occur with probability 1/64). The performances inthe Type I scenario are also good and slightly better than thoseachieved by the best performing method [24] on the average.The capability to work both in the case of A-DJPEG andNA-DJPEG is a noticeable property of the proposed method,since the information about the alignment of the compressiongrids is not in general available. Tables VII through IX showthe localization results of the proposed method for the NA-DJPEG case in the various setting, for various combinations ofthe quality factors of the background and foreground regions.Specifically, Table VII reports the results for k = 2 , for twodifferent tampering sizes, namely × and × . TableVIII and Table IX report the results for k = 3 and k = 4 , fortwo different values of the quality factor of the background QF of the background, i.e. QF = 85 and , when thetampering size is equal to × . Similarly, Tables Xthrough XII show the localization results for the A-DJPEGcase, for k = 2 , and respectively. The sets D I and D II are considered for these tests, respectively in the NA-DJPEGand A-DJPEG case.Expectedly, better results are achieved when the tamperingsize is large (see Tables VII and VIII), for k = 2 . The averagedifference in the MCC between the tampering size of × and × for the other values of k is similar, ranging from . to . depending on the setting. We can observe thatour method greatly outperforms [23] in all the settings forthe NA-DJPEG case. Regarding the performance in the A-DJPEG scenario, the method in [24] always outperforms [23](the A-DJPEG method) when k > and QF < ( QF ),while when QF > QF both methods can not correctlylocalize tampering. Compared to our method, the performanceof [24] are superior in all the cases when the background QF is smaller than , with a gain in the MCC which isabout 0.1 . The performance loss in these cases is the price These values are not directly comparable since they are averaged on adifferent image sets, which in the case of the proposed method is much larger. to pay for a general method, that can work in all the settingsof QF ,i and QF , and focuses on the more probable NA-DJPEG scenario (and then is not specifically designed for theA-DJPEG scenario). Fig. 10. Examples of the results provided by our system on some tamperedimages, for the case of A-DJPEG, with QF < QF . The ground truth andthe output maps produced by our method are reported, along with the MCCand NMI values. D. Attribution performance
As opposed to state-of-the-art methods, the system intro-duced in this paper allows to distinguish spliced regionscoming from different donor images. Fig. 10 shows some ex-amples of the output maps obtained on some tampered imageswith different k . The corresponding NMI value measuring thegoodness of the clustering (see Section V-A) is also reported.We can see that, even if the NMI indexes can theoreticallyreach 1, satisfactory clustering results are already obtainedwith much lower NMI values.The average NMI values obtained over the test set D ts andthe subsets of Type I and Type II images in D ts are reportedin Table XIII. The NMI values of the state-of-the-art methodsare not reported since they do not distinguish spliced regionscoming from different donor images. The average NMI in theType I and II cases are very similar. Notice that a value ofthe NMI around 0.5 is satisfactory, since the localization andclustering result are already good with an NMI around 0.6,while the NMI is close to zero when either the localizationor the clustering result is poor, then the average NMI is notexpected to be very high. The clustering performance for somesettings of k > , with background JPEG quality QF = 85 (measured on 100 TP images each) are reported in Table XIV.Similar NMI values are obtained for the various combinationsof quality factors of the foreground regions.VII. C ONCLUSION
We have proposed an end-to-end system capable to detect-ing and localizing image splicing operations by additionallydistinguishing spliced regions taken from different donor im-ages (a task referred to as spliced regions attribution). The ba-sic assumption behind the proposed system is that the splicedand background regions have been double JPEG-compressed, TABLE VIIL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 2 , FOR THE CASES OF TAMPERING SIZE × ( LEFT ) AND × ( RIGHT ).P
ERFORMANCE ARE MEASURED ON D II . T HE NUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF QF ,
65 75 85 95 9875 [23]-NAl (10) — 0.643(9) 0.650(13) 0.667(12)Our 0.428(92) — (99) (99) (100)85 [23]-NAl 0.102(3) 0.091(14) — 0.114(14) 0.123(10)Our (98) (98) — (94) (97)95 [23]-NAl 0.086(12) 0.103(15) 0.108(17) — 0.153(13)Our (98) (98) (95) — (19)98 [23]-NAl 0.090(17) 0.151(18) 0.140(8) 0.102(13) —Our (96) (98) (94) (41) — QF QF ,
65 75 85 95 9875 [23]-NAl (25) — 0.744(30) 0.540(26) 0.741(31)Our 0.542(97) — (100) (100) (100)85 [23]-NAl 0.179(10) 0.201(9) — 0.230(9) 0.286(8)Our (100) (98) — (99) (100)95 [23]-NAl 0.186(16) 0.183(12) 0.199(9) — (8)Our (99) (100) (98) — 0.136(27)98 [23]-NAl 0.223(14) 0.131(5) 0.103(7) 0.175(8) —Our (100) (100) (100) (100) —
TABLE VIIIL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 3 , × , QF = 85 ( LEFT ), QF = 95 ( RIGHT ). P
ERFORMANCE ARE MEASURED ON D II . T HE NUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF , QF ,
65 75 95 9865 [23]-NAl — 0.198(11) 0.356(7) 0.232(12)Our — (100) (100) (100)75 [23]-NAl 0.127(10) — 0.254(12) 0.243(9)Our (100) — (99) (100)95 [23]-NAl 0.221(12) 0.192(18) — 0.356(11)Our (100) (100) — (99)98 [23]-NAl 0.376(8) 0.176(16) 0.329(8) —Our (100) (100) (99) — QF , QF ,
65 75 85 9865 [23]-NAl — 0.168(15) 0.178(15) 0.260(23)Our — (100) (100) (100)75 [23]-NAl 0.292(9) — 0.148(11) 0.255(22)Our (100) — (100) (98)85 [23]-NAl 0.198(18) 0.336(17) — 0.137(11)Our (100) (99) — (98)98 [23]-NAl 0.145(11) 0.148(14) 0.274(13) —Our (100) (100) (100) —
TABLE IXL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 4 , × , QF = 85 ( LEFT ), QF = 95 ( RIGHT ). QF , IS SET TO . P ERFORMANCEARE MEASURED ON D II . T HE NUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF , QF ,
65 75 95 9865 [23]-NAl — 0.322(14) 0.240(14) 0.320(21)Our — (100) (100) (100)75 [23]-NAl 0.381(14) — 0.309(16) 0.410(12)Our (100) — (100) (100)95 [23]-NAl 0.334(11) 0.293(17) — 0.335(17)Our (100) (100) — (100)98 [23]-NAl 0.288(17) 0.235(13) 0.269(11) —Our (100) (100) (100) — QF , QF ,
65 75 85 9865 [23]-NAl — 0.198(20) 0.228(19) 0.297(13)Our — (100) (100) (100)75 [23]-NAl 0.211(18) — 0.371(15) 0.277(13)Our (100) — (100) (100)85 [23]-NAl 0.280(16) 0.290(15) — 0.233(25)Our (100) (100) — (100)98 [23]-NAl 0.239(17) 0.140(17) 0.266(19) —Our (100) (100) (100) — but the quantization matrices used for the first compressionare different for the background and spliced regions stemmingfrom different sources. Estimating the primary quantizationmatrix and clustering image blocks according to the result ofthe estimation provides the basic mechanism underlying thedetection, localization and attribution of the spliced regions.We instantiated the proposed system by adopting state-of-the-art solutions for each step the system consists of, includingestimation of the primary quantization matrix, estimation ofthe number of clusters, clustering and spatial refinement ofthe tampering map. The good performance of the resultingsystem have been proven by means of extensive experimentsand compared with those of two baseline methods operating in similar conditions. The main strength of the proposed systemis that it can be applied in a wide variety of scenarios,concerning the alignment of the double compression stepsand the quantization steps used in the first and the secondcompression. Moreover, to the best of our knowledge, this isthe first method that can distinguish multiple tampered regionstaken from different donor images based on the analysis ofDJPEG traces.It goes without saying that any improvement of each ofthe steps the system consists of will result in a consequentimprovement of the overall accuracy of the system. From thispoint of view, improving the accuracy of the estimation ofthe primary quantization matrix would play a crucial role, TABLE XL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 2 , × ( LEFT ) AND × ( RIGHT ). P
ERFORMANCE ARE MEASURED ON D I . T HENUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF QF ,
65 75 85 95 9875 [23]-Al 0.878(10) — 0.862(9) 0.851(13) 0.888(12)[24] (60) — (54) (70) (68)Our 0.476(98) — 0.744(100) 0.796(100) 0.814(99)85 [23]-Al 0.417(11) 0.608(13) — 0.463(19) 0.590(22)[24] (40) (50) — 0.706(47) (43)Our 0.710(100) 0.523(93) — (98) 0.734(100)95 [23]-Al 0.044(6) 0.054(14) 0.069(21) — 0.081(15)[24] 0.002(24) 0.051(14) 0.056(11) — 0.000(16)Our (99) (100) (98) — (34)98 [23]-Al 0.039(15) 0.023(6) 0.163(12) 0.027(13) —[24] 0.000(18) 0.000(10) 0.000(14) 0.006(21) —Our (100) (99) (94) (34) — QF QF ,
65 75 85 95 9875 [23]-Al 0.864(58) — 0.872(74) 0.856(66) 0.857(65)[24] (58) — (71) (71) (73)Our 0.581(99) — 0.808(100) 0.846(100) 0.853(100)85 [23]-Al 0.618(41) 0.606(34) — 0.621(34) 0.617(42)[24] 0.732(57) (52) — (53) (64)Our (100) 0.650(100) — 0.821(98) 0.797(99)95 [23]-Al 0.049(15) 0.075(11) 0.096(9) — 0.059(16)[24] 0.000(100) 0.011(18) 0.020(11) — 0.039(21)Our (100) (100) (99) — (36)98 [23]-Al 0.099(12) 0.043(5) 0.188(15) 0.101(10) —[24] 0.051(15) 0.012(13) 0.007(20) 0.006(19) —Our (100) (100) (99) (49) —
TABLE XIL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 3 , × , QF = 85 ( LEFT ), QF = 95 ( RIGHT ). P
ERFORMANCE ARE MEASURED ON D I . T HE NUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF , QF ,
65 75 95 9865 [23]-Al — 0.641(43) 0.695(45) 0.722(51)[24] — (50) (68) (57)Our — 0.685(100) 0.671(100) 0.686(100)75 [23]-Al 0.725(45) — 0.771(48) 0.676(51)[24] (57) — (60) (64)Our 0.671(100) — 0.654(99) 0.636(100)95 [23]-Al 0.714(47) 0.602(46) — 0.698(41)[24] (58) (58) — (65)Our 0.683(100) 0.618(100) — 0.768(100)98 [23]-Al 0.706(42) 0.623(43) 0.687(49) —[24] (60) (59) 0.791(54) —Our 0.656(100) 0.616(99) (94) — QF , QF ,
65 75 85 9865 [23]-Al — 0.133(15) 0.028(14) 0.116(19)[24] — 0.000(9) 0.034(17) 0.063(19)Our — (100) (100) (100)75 [23]-Al 0.136(12) — 0.089(19) 0.056(22)[24] 0.000(12) — 0.044(16) 0.000(14)Our (100) — (100) (98)85 [23]-Al 0.092(15) 0.128(21) — 0.097(11)[24] 0.059(11) 0.030(25) — 0.078(14)Our (99) (100) — (97)98 [23]-Al 0.090(12) 0.110(12) 0.083(15) —[24] 0.000(12) 0.035(18) 0.026(18) —Our (100) (100) (100) —
TABLE XIIL
OCALIZATION PERFORMANCE (MCC)
FOR THE CASE k = 4 , × , QF = 85 ( LEFT ) AND QF = 95 ( RIGHT ), QF , IS SET TO .P ERFORMANCE ARE MEASURED ON D I . T HE NUMBER OF TP IMAGES IS REPORTED IN BRACKETS . QF , QF ,
65 75 95 9865 [23]-Al — 0.679(40) 0.619(41) 0.592(40)[24] — (53) (54) (51)Our — 0.695(100) 0.707(100) 0.714(100)75 [23]-Al 0.646(50) — 0.639(44) 0.648(47)[24] (58) — (50) (55)Our 0.709(100) — 0.664(100) 0.679(100)95 [23]-Al 0.682(39) 0.672(39) — 0.710(39)[24] (52) (58) — (51)Our 0.702(100) 0.647(100) — 0.720(100)98 [23]-Al 0.673(38) 0.684(43) 0.687(41) —[24] (55) (44) (57) —Our 0.724(100) 0.693(100) 0.725(100) — QF , QF ,
65 75 85 9865 [23]-Al — 0.135(22) 0.170(16) 0.221(14)[24] — 0.000(8) 0.152(8) 0.033(12)Our — (100) (100) (100)75 [23]-Al 0.119(12) — 0.054(17) 0.118(8)[24] 0.091(13) — 0.081(8) 0.000(15)Our (100) — (100) (100)85 [23]-Al 0.113(11) 0.065(21) — 0.142(20)[24] 0.090(9) 0.088(17) — 0.054(9)Our (100) (100) — (100)98 [23]-Al 0.105(22) 0.113(8) 0.069(16) —[24] 0.084(11) 0.054(13) 0.058(14) —Our (100) (100) (100) —3
TABLE XIIIA
VERAGE ATTRIBUTION PERFORMANCE (NMI)
OF THE PROPOSEDMETHOD .Test set D ts D ts (Type I) D ts (Type II)NMI 0.475 0.481 0.468 since the entire system relies on the accuracy of such a step.Clustering is also an area where improvements are possible,both with regard to the estimation of the number of clustersand the subsequent clustering process. In particular, findingbetter ways to fuse spatial information with the informationprovided by the estimated quantisation matrix, could signifi-cantly improve the accuracy of the system.VIII. A CKNOWLEDGMENTS
This work has been partially supported by the ItalianMinistry of University and Research (MUR) under the PRIN2017 2017Z595XS-001 program - PREMIER project.R
EFERENCES[1] T. Pevny and J. Fridrich, “Detection of double-compression in JPEGimages for applications in steganography,”
IEEE Trans. Inf. ForensicsSecurity , vol. 3, no. 2, pp. 247–258, June 2008.[2] B. Li, Y. Q. Shi, and J. Huang, “Detecting doubly compressed JPEGimages by using mode based first digit features,” in
Proc. IEEE MMSP ,Oct 2008, pp. 730–735.[3] I. Amerini, R. Becarelli, R. Caldelli, and A. D. Mastio, “Splicingforgeries localization through the use of first digit features,” in
Proc.IEEE Int. Workshop Inf. Forensics Secur. , 2014, pp. 143–148.[4] C. Deng, Z. Li, X. Gao, and D. Tao, “Deep multi-scale discriminativenetworks for double JPEG compression forensics,”
ACM Trans. Intell.Syst. Technol. , vol. 10, no. 2, pp. 1–20, 2019.[5] Y.-L. Chen and C.-T. Hsu, “Detecting recompression of JPEG images viaperiodicity analysis of compression artifacts for tampering detection,”
IEEE Trans. Inf. Forensics Security , vol. 6, no. 2, pp. 396–406, 2011.[6] Y. Niu, B. Tondi, Y. Zhao, and M. Barni, “Primary quantization matrixestimation of double compressed JPEG images via CNN,”
IEEE SignalProcess. Lett. , vol. 27, pp. 191–195, 2020.[7] U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics andcomputing , vol. 17, no. 4, pp. 395–416, 2007.[8] B. Mahdian and S. Saic, “Blind authentication using periodic propertiesof interpolation,”
IEEE Trans. Inf. Forensics Security , vol. 3, no. 3, pp.529–538, 2008.[9] J. Bunk, J. H. Bappy, T. M. Mohammed, L. Nataraj, A. Flenner, B. Man-junath, S. Chandrasekaran, A. K. Roy-Chowdhury, and L. Peterson,“Detection and localization of image forgeries using resampling featuresand deep learning,” in
Proc. IEEE Conf. Comput. Vis. Pattern RecognitWorkshops (CVPRW) , 2017, pp. 1881–1889.[10] G. Chierchia, D. Cozzolino, G. Poggi, C. Sansone, and L. Verdoliva,“Guided filtering for PRNU-based localization of small-size imageforgeries,” in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. ,2014, pp. 6231–6235.[11] P. Korus and J. Huang, “Multi-scale analysis strategies in PRNU-basedtampering localization,”
IEEE Trans. Inf. Forensics Security , vol. 12,no. 4, pp. 809–824, 2017.[12] A. C. Popescu and H. Farid, “Statistical tools for digital forensics,” in
Proc. 6th Int. workshop on Inf. Hiding , 2004, pp. 128–147.[13] C. Pasquini, G. Boato, and F. P´erez-Gonz´alez, “Multiple JPEG com-pression detection by means of Benford-Fourier coefficients,” in
Proc.IEEE Int. Workshop Inf. Forensics Secur. , 2014, pp. 113–118.[14] W. Luo, Z. Qu, J. Huang, and G. Qiu, “A novel method for detectingcropped and recompressed image block,” in
Proc. IEEE Int. Conf.Acoust., Speech Signal Process. , vol. 2, 2007, pp. 217–220.[15] Z. Qu, W. Luo, and J. Huang, “A convolutive mixing model for shifteddouble JPEG compression with application to passive image authenti-cation,” in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , 2008,pp. 1661–1664. [16] T. Bianchi and A. Piva, “Detection of nonaligned double JPEG com-pression based on integer periodicity maps,”
IEEE Trans. Inf. ForensicsSecurity , vol. 7, no. 2, pp. 842–848, 2012.[17] Z. Lin, J. He, X. Tang, and C.-K. Tang, “Fast, automatic and fine-grainedtampered JPEG image detection via DCT coefficient analysis,”
PatternRecognition , vol. 42, no. 11, pp. 2492–2501, 2009.[18] T. Bianchi, A. D. Rosa, and A. Piva, “Improved dct coefficient analysisfor forgery localization in JPEG images,” in
Proc. IEEE Int. Conf.Acoust., Speech Signal Process. , 2011, pp. 2444–2447.[19] Q. Wang and R. Zhang, “Double JPEG compression forensics basedon a convolutional neural network,”
EURASIP Journal on InformationSecurity , vol. 2016, no. 1, 2016.[20] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini,B. Tondi, and S. Tubaro, “Aligned and non-aligned double JPEGdetection using convolutional neural networks,”
Journal of Visual Com-munication and Image Representation , vol. 49, pp. 153–163, 2017.[21] J. Luk´aˇs and J. Fridrich, “Estimation of primary quantization matrixin double compressed JPEG images,”
Proc. Digital Forensic ResearchWorkshop , 2003.[22] H. Farid, “Exposing digital forgeries from JPEG ghosts,”
IEEE Trans.Inf. Forensics Security , vol. 4, no. 1, pp. 154–160, March 2009.[23] T. Bianchi and A. Piva, “Image forgery localization via block-grainedanalysis of jpeg artifacts,”
IEEE Trans. Inf. Forensics Security , vol. 7,no. 3, pp. 1003–1017, 2012.[24] W. Wang, J. Dong, and T. Tan, “Exploring DCT coefficient quantiza-tion effects for local tampering detection,”
IEEE Trans. Inf. ForensicsSecurity , vol. 9, no. 10, pp. 1653–1666, 2014.[25] Z. Dias, A. Rocha, and S. Goldenstein, “First steps toward imagephylogeny,” in
Proc. IEEE Int. Workshop Inf. Forensics Secur. , 2010,pp. 1–6.[26] ——, “Image phylogeny by minimal spanning trees,”
IEEE Trans. Inf.Forensics Security , vol. 7, no. 2, pp. 774–788, 2012.[27] A. A. de Oliveira and P. Ferrara and A. De Rosa and A. Piva and M.Barni and S. Goldenstein and Z. Dias and A. Rocha, “Multiple parentingphylogeny relationships in digital images,”
IEEE Trans. Inf. ForensicsSecurity , vol. 11, no. 2, pp. 328–343, 2016.[28] W. B. Pennebaker and J. L. Mitchell,
JPEG: Still image data compres-sion standard . Springer Science & Business Media, 1992.[29] I. Amerini, R. Caldelli, P. Crescenzi, A. Del Mastio, and A. Marino,“Blind image clustering based on the normalized cuts criterion forcamera identification,”
Signal Process., Image Commun. , vol. 29, no. 8,pp. 831–843, 2014.[30] Y. Li, X. Zhang, X. Li, Y. Zhang, J. Yang, and Q. He, “Mobile phoneclustering from speech recordings using deep representation and spectralclustering,”
IEEE Trans. Inf. Forensics Security , vol. 13, no. 4, pp. 965–977, 2017.[31] O. Mayer and M. C. Stamm, “Exposing fake images with forensicsimilarity graphs,”
IEEE J. Sel. Topics Signal Process. , vol. 14, no. 5,pp. 1049–1064, 2020.[32] G. J. McLachlan and K. E. Basford,
Mixture models: inference andapplications to clustering . M. Dekker New York, 1988, vol. 84.[33] J. Vesanto and E. Alhoniemi, “Clustering of the self-organizing map,”
IEEE Trans. Neural Netw , vol. 11, no. 3, pp. 586–600, 2000.[34] Y. El-Sonbaty and M. Ismail, “Fuzzy clustering for symbolic data,”
IEEETrans. Fuzzy Syst. , vol. 6, no. 2, pp. 195–204, 1998.[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[36] R. C. Gonzales and R. E. Woods, “Digital image processing,” 2002.[37] B. W. Matthews, “Comparison of the predicted and observed secondarystructure of t4 phage lysozyme.”
Biochimica et Biophysica Acta , vol.405, no. 2, pp. 442–451, 1975.[38] P.-N. Tan, M. Steinbach, and V. Kumar,
Introduction to Data Mining .USA: Addison-Wesley Longman Publishing Co., Inc., 2005.[39] D. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:A raw images dataset for digital image forensics,” in
Proc. the 6thACM Multimedia Systems Conference , 2015, pp. 219–224. [Online].Available: http://doi.acm.org/10.1145/2713168.2713194 TABLE XIVA
TTRIBUTION PERFORMANCE (NMI)
OF THE PROPOSED METHOD FOR THE CASES k = 3 (( A ) AND ( C )) AND k = 4 (( B ) AND ( D )), TAMPERING SIZE × , QF = 85 . ( A ) AND ( B ) ARE FOR THE NON - ALIGNED CASE ( PERFORMANCE MEASURED ON D II ), ( A ) AND ( B ) REFER TO THE ALIGNED CASE ( PERFORMANCE MEASURED ON D I ). F OR THE k = 4 SETTINGS , QF , IS SET TO .(a) QF , QF ,
65 75 95 9865 — 0.559 0.497 0.53275 0.547 — 0.463 0.48295 0.516 0.467 — 0.48898 0.508 0.475 0.476 — (b) QF , QF ,
65 75 95 9865 — 0.555 0.535 0.53675 0.529 — 0.515 0.50395 0.527 0.502 — 0.52198 0.573 0.532 0.503 — (c) QF , QF ,
65 75 95 9865 — 0.518 0.524 0.53175 0.507 — 0.496 0.68295 0.536 0.467 — 0.55398 0.501 0.468 0.553 — (d) QF , QF ,4