[PDF] SPN-CNN: Boosting Sensor-Based Source Camera Attribution With Deep Learning

Abstract

We explore means to advance source camera identification based on sensor noise in a data-driven framework. Our focus is on improving the sensor pattern noise (SPN) extraction from a single image at test time. Where existing works suppress nuisance content with denoising filters that are largely agnostic to the specific SPN signal of interest, we demonstrate that a~deep learning approach can yield a more suitable extractor that leads to improved source attribution. A series of extensive experiments on various public datasets confirms the feasibility of our approach and its applicability to image manipulation localization and video source attribution. A critical discussion of potential pitfalls completes the text.

Full PDF

SSPN-CNN: Boosting Sensor-Based Source CameraAttribution With Deep Learning

Matthias Kirchner and Cameron Johnson

Kitware, Inc.Email: { matthias.kirchner,cameron.johnson } @kitware.com Abstract —We explore means to advance source camera identi-ﬁcation based on sensor noise in a data-driven framework. Ourfocus is on improving the sensor pattern noise (SPN) extractionfrom a single image at test time. Where existing works suppressnuisance content with denoising ﬁlters that are largely agnosticto the speciﬁc SPN signal of interest, we demonstrate that a deeplearning approach can yield a more suitable extractor that leadsto improved source attribution. A series of extensive experimentson various public datasets conﬁrms the feasibility of our approachand its applicability to image manipulation localization andvideo source attribution. A critical discussion of potential pitfallscompletes the text.

I. I

NTRODUCTION

Sensor noise ﬁngerprints have been recognized and utilizedas a cornerstone of media forensics [1] ever since Luk´aˇs etal. ﬁrst observed almost 15 years ago in their seminal work[2] that digital images can be traced back to their sensorbased on unique noise characteristics. Minute manufacturingimperfections are believed to make every sensor physicallyunique, leading to the presence of a weak yet deterministicsensor pattern noise (SPN) in each and every image signalcaptured by the same sensor [3]. This ﬁngerprint, commonlyreferred to as photo-response non-uniformity (PRNU), canbe estimated from images captured by a speciﬁc camera forthe purpose of source camera identiﬁcation, in which a noisesignal extracted from a probe image of unknown provenanceis compared against pre-computed ﬁngerprint estimates from aset of candidate cameras.While extensive testing has already demonstrated the feasi-bility of highly reliable PRNU-based consumer camera identi-ﬁcation at scale [4], recent research has been largely drivenby enabling source attribution under ever more challengingconditions. Modern cameras, particularly those installed insmartphones, go to great lengths to produce visually appealingimagery, and techniques such as lens distortion correction,electronic image stabilization, or high dynamic range imaginghave all been found to impede camera identiﬁcation if notaccounted for through spatial resynchronization [5]–[7]. Ro-bustness to strong image and video compression is anothermajor concern [8]–[10] that has been gaining more and morepractical relevance due the widespread sharing of visual mediathrough online social networks [11]. Finally, the ability to

WIFS‘2019, December, 9-12, 2019, Delft, Netherlands.Approved for public release: distribution unlimited. c (cid:13) reliably establish camera identiﬁcation also from very smallimage patches is crucial for image manipulation localizationbased on sensor noise [12]–[14].The pertinent literature concludes that all these scenariosgenerally beneﬁt from high-quality ﬁngerprint estimates, andthat images should undergo content suppression prior toanalysis [3]. To date, most works still resort to the Wavelet-based denoiser as adopted by Luk´aˇs et al. [15] for that purpose.The maximum likelihood ﬁngerprint estimator derived from asimpliﬁed multiplicative signal model [12] gives near-optimalresults when fed with noise residuals from a set of full-resolution images of a homogeneously lit scene. The conditionsat test time are less ideal, however. Probe images are generallyof varying content and an aggregation over multiple images isoften not possible. It is accepted that noise residuals obtainedfrom off-the-shelf denoisers are imperfect by nature and thatthey are contaminated non-trivially by remnants of imagecontent. Salient textures or quantization noise exacerbate theissue. A number of studies have found that alternative denoisingalgorithms can lead to moderate improvements [13], [16], [17],yet there remains a considerable gap in the ability to reliablyestablish the presence of the camera ﬁngerprint in (portionsof) a probe image under more challenging conditions.While the overall performance of camera identiﬁcation isclearly governed by fundamental bounds imposed by, forinstance, the available image resolution and the strength ofcompression, this paper sets out to demonstrate that thereis still ample room for advances over prior art. Critically,the present work deviates from the common procedure toemploy at test time the very same noise extractor that wasalso used for ﬁngerprint estimation, which we explain witha subtle shift in perspective in the restatement of practicalcamera identiﬁcation: we accept that the goal ultimately is tomatch a noise signal extracted from the probe image againsta pre-computed ﬁngerprint estimate, and that the quality ofthe match is assessed in terms of the similarity of the twosignals. Adopting the DnCNN work by Zhang et al. [18],we let a convolutional neural network (CNN) learn how toextract a noise signal from a probe image that resembles thepre-computed sensor pattern noise (SPN) ﬁngerprint fromthe camera of interest as closely as possible. By lookingat ﬁngerprint extraction through the lens of an optimizationprocedure, the resulting network, which we call SPN-CNN, canbe expected to better adapt to the problem at hand than existingﬁngerprint-agnostic denoising algorithms. It is worth pointing a r X i v : . [ c s . C V ] F e b ut here that the proposed technique differs from the creativeadvancement of the DnCNN idea in Cozzolino and Verdoliva’sNoisePrint approach [19] both in its objective and its designin that the computed noise signal is expected to emphasizedevice characteristics instead of camera model characteristicsthrough a training regime that explicitly utilizes a knowncamera-speciﬁc target signal (the camera ﬁngerprint estimate)instead of a more general notion of pairwise patch similarity.Another related work is the image anonymization approachby Bonettini et al. [20], who adapted a DnCNN-like designto remove the camera ﬁngerprint from an image. We refer toSection III for a more detailed exposition of our approach,which follows after a brief overview of camera sensor noiseforensics in Section II. Section IV presents experimental results,including applications to image manipulation localization andvideo source identiﬁcation. Section V concludes the text.II. C AMERA S ENSOR N OISE F ORENSICS

State-of-the-art sensor noise forensics assumes a simpliﬁedimaging model of the form x = x ( o ) ( + k ) + θ , (1)in which the multiplicative PRNU factor k modulates the noise-free image x ( o ) , while θ comprises a variety of additive noisecomponents [3]. Substantial empirical evidence suggests thatthe PRNU is a unique and robust camera ﬁngerprint [4] thatcan be estimated from a set of N images taken with the speciﬁccamera of interest. The standard procedure relies on a denoisingﬁlter F ( · ) to obtain the noise residual w n = x n − F ( x n ) fromthe n -th image x n , ≤ n ≤ N . A modeling assumption w n = kx n + η n (2)with i. i. d. Gaussian noise η n then leads to a maximumlikelihood estimate ˆ k of the PRNU factor of the form [12] ˆ k = (cid:32) N (cid:88) n =1 w n x n (cid:33) · (cid:32) N (cid:88) n =1 x n (cid:33) − . (3)Practical applications warrant a post-processing step to cleanthe ﬁngerprint estimate from non-unique artifacts [3], [21].For a given probe image y of unknown provenance, cameraidentiﬁcation is formulated as a hypothesis testing problem: H : w = y − F ( y ) does not contain the ﬁngerprint k H : w does contain the ﬁngerprint k ; i. e., the probe is attributed to the tested camera if H holds.In practice, the test can be decided by evaluating the similaritybetween the residual w and the ﬁngerprint estimate ˆ k for asuitably set threshold τ , ρ = sim( w , ˆ k ) H ≷ H τ . (4)It is assumed that the two signals are geometrically alignedexcept for possible translational displacements, which can beaccounted for conveniently with normalized cross-correlationor peak-to-correlation energy (PCE) as similarity measures [3]. As inserting content from elsewhere into an image orother forms of image manipulation will remove or impairthe camera ﬁngerprint in the affected regions, many localimage alterations can be detected by testing for the presence ofthe expected ﬁngerprint in a sliding window mode [12]. Thelocalization of small manipulated regions warrants sufﬁcientlysmall analysis windows, which generally impacts the abilityto reliably establish whether or not the expected ﬁngerprint ispresent negatively. The literature often recommends a windowsize of about × pixels as a reasonable trade-off betweenresolution and accuracy [13], [14], [22]. A core problem is thatthe measured local similarity scores under H depend greatlyon local image characteristics. Chen et al. [12] have proposeda correlation predictor ˆ ρ i ( x ) as a remedy, which utilizes a setof simple intensity and texture features to predict how stronglythe i -th local patch in a probe image would correlate with thepurported camera ﬁngerprint under H . The decision whetherto declare a manipulation (i. e., the absence of the expectedﬁngerprint) can then be made based on the deviation from theexpected correlation.As for video data, it can be advantageous to estimatereference sensor ﬁngerprints from full-resolution still imageswhen available [23]. At test time, it is usually recommendedto aggregate noise residuals from multiple probe frames intoa probe video ﬁngerprint to cope with strong compressionartifacts [24]. Special care has to be taken to account forgeometrical desynchronization due to video stabilization, aspointed out in many recent reports [6], [23], [25].III. SPN-CNN E STIMATOR

While camera source attribution from sensor pattern noisearguably works extremely well iff pitfalls due to geometricmisalignment between the ﬁngerprint and the probe are avoided,there remains a huge discrepancy between a ﬁngerprint estimateaggregated from several images, ˆ k , and an estimate from asingle probe image (via the residual w ). A simple explana-tion is that denoising algorithms are not designed with thisspeciﬁc application in mind, and more generally that contentsuppression is an ill-posed problem in the absence of viableimage models (and possibly even of the noise characteristicsitself [26]). Thanks to the central limit theorem, aggregationover multiple noise residuals mitigates these effects to someextent, but noise residuals from a single image will alwayssuffer from signiﬁcant distortion.The recent success of data-driven methods suggests a wayforward, however, especially also considering that Equation (4)poses a clear objective function to be targeted. By understand-ing CNNs as ﬂexible non-linear optimization tool, it seemsreasonable to expect that a network should be able to learn howto extract a better approximation of k from a given probe thana conventional “blind” denoiser. Conceptually, the problem athand aligns with the scenario considered by Zhang et al. [18],who trained a CNN to learn how to extract well-characterizednoise signals from a given image. In our case, we train anetwork to extract a noise pattern ˜ k to minimize (cid:107) ˆ k − ˜ k (cid:107) .Our premise here is that the maximum likelihood ﬁngerprint ﬂat-ﬁeld images N noise residuals ﬁngerprint estimate ˆ k training image ﬁngerprint estimate ˜ k MLE F ( · ) SPN-CNN L = k ˆ k − ˜ k k update Fig. 1. Proposed SPN-CNN training for learning to extract a camera ﬁngerprint ˜ k from a given probe image. estimator (MLE) gives the best approximation of the actualPRNU signal that we have under the given imaging modelassumptions, so we consider it a viable proxy target in lieu ofthe unknown ground truth. Once trained, the network replacesthe denoiser F ( · ) at test time, i. e., w = ˜ k in Equation (4).Notably, this breaks with the tradition of employing the verysame denoiser for both ﬁngerprint estimation and detection,although further iterations in which the trained network informsa better ﬁngerprint aggregation seem conceivable.Figure 1 outlines the overall training setup, which operates,in its current instantiation, on single-channel grayscale patchesof size × pixels. To facilitate the learning, the output fromeach training patch is paired with its corresponding portion ofthe ﬁngerprint estimate ˆ k . The SPN-CNN itself is modeled afterZhang et al.’s DnCNN [18] and comprises 17 layers. Each layerimplements 64 convolutional ﬁlters with a small × spatialsupport, except for the last layer with only a single-channeloutput. Batch normalization is implemented in between theconvolutional layers, and ReLu activation follows after eachbut the last layer. No pooling is used at any point. In linewith reports in the literature, any modern off-the-shelf denoiser F ( · ) can be expected to do reasonably well for the purposeof estimating a camera ﬁngerprint from ﬂat-ﬁeld images. Wedecided to stick with the standard Wavelet formulation [15] inthis work for its computational edge over alternative approaches.The trained network can be fed with inputs of any size withinthe memory constraints of the GPU. In practice, we divideimages into tiles of size × with an overlap of 20 pixelsto accommodate for possible boundary effects.Figure 2 depicts exemplary results from two images from theVISION database [27] and compares the obtained noise signals, ˜ k , to the corresponding residuals from the Wavelet denoiser, w .Although the CNN-based estimator does arguably not succeedin suppressing the image content completely, the results suggestqualitatively that it does better than the Wavelet denoiser. Anotable increase in correlation with the corresponding portionof the camera ﬁngerprint estimate ˆ k supports this impression.IV. E XPERIMENTS

We work with various datasets for an experimental validation.Our baseline camera identiﬁcation results in Section IV-B probe ˜ k (SPN-CNN) w (Wavelet) Fig. 2. Two × images and the noise signals obtained with the SPN-CNN and Wavelet estimators. Each noise pattern has been scaled independentlyto cover the full intensity range. Numbers in the lower left corner indicate thecorrelation with the corresponding Wavelet-based camera ﬁngerprint. cover camera ﬁngerprints from ten devices from the VISIONDatabase [27], and from six cameras from the Dresden ImageDatabase [28], respectively. The former dataset comprisescamera-native JPG images from a variety of mobile devices,whereas we chose the Nikon cameras that provide uncom-pressed imagery from the latter (converted from raw formatwith Adobe Lightroom). For each camera, we reserve 100randomly chosen images for training and divide the remainingimages into non-overlapping crops of size × pixels,resulting in a total of 53,677 (VISION) plus 41,568 (Dresden)cropped test images under H . Under H , we supply for eachnon-overlapping × portion of the 16 camera ﬁngerprintsabout 100 randomly chosen image crops from a different device,totaling in 48,362 (VISION) plus 23,004 (Dresden) samples.The manipulation localization results in Section IV-C coverthe 55 manipulated images from the Nikon D7000 camerain the Realistic Tampering Dataset [14]. These uncompressedimages originate from the RAISE database [29], from whichwe source an additional 100 images for training. To gain someinsight into the applicability to video source attribution inSection IV-D, we work with 18 camera-native indoor andoutdoor videos from a Samsung Galaxy S3 Mini (D01 in theVISION database), where we reserve ﬁve videos for training.All databases provide ﬂat-ﬁeld images for the initial MLEcamera ﬁngerprint estimation with the Wavelet denoiser. A. Training

We train a separate network for each camera ﬁngerprint ofinterest in batches of size 128 for 100 epochs. Each epochsees a selection of non-overlapping × patches from alltraining images. The patch selection is greedy and successivelysamples up to 1,000 random patches per image while rejectingthose that overlap with previous selections from within thesame image. The ﬁnal number of selected patches thus dependson the image size, but it typically averages to over 50,000 per ABLE IM

EDIAN CORRELATION UNDER H ( × PATCHES , VISION).

SPN-CNN trained for camera ca m e r a extractorD01 D02 D03 D06 D10 D15 D26 D27 D29 D34 Wav [18]D01 epoch. Saturated patches are excluded from training. We usethe Adam optimizer with MSE loss for training with learningrate − and weight decay of 0.2 after every 30 epochs. B. Baseline Results

For a test of the proposed noise extractor’s baseline perfor-mance, we conduct basic camera identiﬁcation by correlatingthe obtained noise signals against the corresponding matchingcamera ﬁngerprint estimate. A key question is whether eachcamera ﬁngerprint warrants its own set of learned parameters.Table I gives some insight by reporting the median correlationscores under H for the VISION database, where we employedall ten trained networks (arranged along the columns) foreach probe image. For comparison, the last two columns listthe corresponding results obtained from the standard Waveletdenoiser (Wav), and an off-the-shelf variant of the DnCNN [18]that we trained to extract Gaussian noise with standard deviation σ = 3 on the 400 images provided by its authors. Notably,the SPN-CNN approach yields a measurable boost over theﬁngerprint-agnostic techniques when a camera-speciﬁc variantof the network is employed. The median correlation under H increases, on average, by a factor of 1.5 (at a median correlationunder H of about − compared to − for the Waveletdenoiser), while camera-foreign network conﬁgurations yield H correlations that are largely on par with prior art. Thecorresponding results for the six cameras from the DresdenImage Database in Table II give a slightly different picture, asthe network training seems to have less inﬂuence. It is currentlyunclear to what extent this is an artifact of the much morehomogenous dataset (where each camera was set up to capturethe same scenes [28] and all images were processed with thesame Adobe software). Images in the VISION dataset are moreheterogenous, both in terms of content and camera-speciﬁcprocessing pipelines. More tailored datasets and experimentswill be necessary to disentangle the impact of training data andsensor speciﬁcs. For the time being, we recommend training acamera-speciﬁc network for each ﬁngerprint.Adopting this premise, Figure 3 gives a more comprehensivepicture of camera identiﬁcation performance by reporting ROCcurves for a select set of ten cameras from both datasets.We focus on smaller patches of size × and × For instance, both device D06 and D15 are iPhone 6 cameras, but operateunder different iOS versions. TABLE IIM

EDIAN CORRELATION UNDER H ( × PATCHES , D

RESDEN ). SPN-CNN trained for camera ca m e r a extractorD200-0 D200-1 D70-0 D70-1 D70s-0 D70s-1 Wav [18]D200-0 here (center-cropped from the bigger probes) to showcasethe advantage of the targeted data-driven extractors overconventional methods. Not surprisingly, smaller patch sizesincur a drop in performance, but the proposed approachachieves notable improvements across all tested cameras,including those not depicted here. C. Application: Manipulation Localization

The promising performance on camera identiﬁcation fromsmall image patches suggests improved image manipulationlocalization. We test this conjecture by computing fromthe aforementioned 55 manipulated images × slidingwindow correlations with the Nikon D7000 ﬁngerprint estimate.Following prior art [12], we also train a simple linear regressioncorrelation predictor for both the SPN-CNN and the Waveletnoise extractors. A completely disjoint set of 20,000 patchesfrom 20 images was used for this purpose. The pixel-levelROC curves (aggregated over all probe images) in the leftpanel of Figure 4 are obtained from thresholding the differencebetween measured and predicted correlation, ∆ i = ρ i − ˆ ρ i ,i. e., a strongly negative difference is indicative of the cameraﬁngerprint being absent in the respective local neighborhood.In line with the baseline results in Figure 3, the graphs suggestthat the data-driven SPN-CNN extractor is beneﬁcial over thestandard Wavelet approach by a large margin. The overallarea under the curve (AUC) increases from 0.75 to 0.82. Asthe results generally vary greatly from image to image, weinclude a scatter plot of per-image pixel-level AUC scores inthe right panel of Figure 4. Observe that SPN-CNN performsbetter for all but seven probe images, which qualitativelyseemed to exhibit content characteristics that were under-represented during training. In general, it is worth pointing outthat the localization performance can be expected to increasefurther with one of the more sophisticated recent random ﬁeldapproaches [13], [14], [22]. D. Application: Video Source Attribution

Video source attribution is generally more challenging thancamera identiﬁcation from still images. The ﬁnal experiment ofthis initial exploration thus puts emphasis on the impact of lossyvideo compression and reduced resolution. We do not considervideo stabilization here to keep the discussion focussed, butwe anticipate that any performance boost in the absence ofstabilization would ultimately also help techniques designed toreverse the effects of inter-frame geometric misalignment [6]. . . . . . . . . . D01

FAR T P R . . . . D02

FAR T P R . . . . D06

FAR T P R . . . . D10

FAR T P R . . . . D26

FAR T P R . . . . . . . . . D200-0

FAR T P R . . . . D200-1

FAR T P R . . . . D70-0

FAR T P R . . . . D70-1

FAR T P R . . . . D70s-0

FAR T P R [17] A. Cortiana, V. Conotter, G. Boato, and F. G. B. De Natale, “Performancecomparison of denoising ﬁlters for source camera identiﬁcation,” in MediaWatermarking, Security, and Forensics III , ser. Proceedings of SPIE, N. D.Memon, J. Dittmann, A. M. Alattar, and E. J. Delp, Eds., vol. 7880,2011.[18] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussiandenoiser: residual learning of deep CNN for image denoising,”

IEEETransactions on Image Processing , vol. 26, no. 7, pp. 3142–3155, 2017.[19] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera modelﬁngerprint,”

IEEE Transactions of Information Forensics and Security ,in press.[20] T. Gloe, S. Pfennig, and M. Kirchner, “Unexpected artefacts in PRNU-based camera identiﬁcation: A ’Dresden Image Database’ case-study,” in

ACM Multimedia and Security Workshop (MM&Sec) , 2012, pp. 109–114.[21] S. Chakraborty and M. Kirchner, “PRNU-based image manipulationlocalization with discriminative random ﬁelds,” in

IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2017.[22] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “Hybrid reference-basedvideo source identiﬁcation,”

Sensors , vol. 19, no. 3, 2019.[23] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Source digital camcorderidentiﬁcation using sensor photo-response non-uniformity,” in

Securityand Watermarking of Multimedia Content IX , ser. Proceedings of SPIE,E. J. Delp and P. W. Wong, Eds., vol. 6505. SPIE, 2007.[24] S. Taspinar, M. Mohanty, and N. Memon, “Source camera attributionusing stabilized video,” in

IEEE International Workshop on InformationForensics and Security (WIFS) , 2016.[25] M. Masciopinto and F. P´erez-Gonz´alez, “Putting the PRNU modelin reverse gear: Findings with synthetic signals,” in

European SignalProcessing Conference (EUSIPCO) , 2018, pp. 1352–1356.[26] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “VISION:a video and image dataset for source identiﬁcation,”

EURASIP Journalon Information Security , 2017.[27] T. Gloe and R. B¨ohme, “The Dresden Image Database for benchmarkingdigital image forensics,”

Journal of Digital Forensic Practice , vol. 3, no.2–4, pp. 150–159, 2010.[28] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:a raw images dataset for digital image forensics,” in

ACM MultimediaSystems Conference (MMSys) , 2015, pp. 219–224.[29] G. Chierchia, D. Cozzolino, G. Poggi, C. Sansone, and L. Verdoliva,“Guided ﬁltering for PRNU-based localization of small-size imageforgeries,” in

IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2014, pp. 6231–6235.[30] S. Mandelli, P. Bestagini, S. Tubaro, D. Cozzolino, and L. Verdoliva,“Blind detection and localization of video temporal splicing exploitingsensor-based footprints,” in

European Signal Processing Conference(EUSIPCO) , 2018.[31] N. Bonettini, L. Bondi, S. Mandelli, P. Bestagini, S. Tubaro, and D. G¨uera,“Fooling PRNU-based detectors through convolutional neural networks,” in

European Signal Processing Conference (EUSIPCO) , 2018, pp. 957–961. × SPN-CNNWaveletDnCNN × SPN-CNNWaveletDnCNN

Fig. 3. Camera identiﬁcation baseline ROC performance for selected cameras from the VISION Database (top) and the Dresden Image Database (bottom)based on the correlation between Wavelet-based camera ﬁngerprint MLE and noise signals extracted from probe patches of size × (blue) and × (orange) with SPN-CNN, and off-the-shelf Wavelet and DnCNN [18] denoising, respectively. To address the speciﬁc characteristics of video data, we traina separate network on frames extracted from the ﬁve trainingvideos, while the target ﬁngerprint was computed from theavailable still images [23]. As a result, synchronizing theimagery to the camera ﬁngerprint requires special attention.The available videos have a resolution of × , whereasthe full sensor resolution of the still images is × pixels. We use the guidance in [27] to determine the geometricmapping from image to video space, but found that the croppingparameters had to be adjusted to reﬂect differences betweenlandscape and portrait mode. Once camera ﬁngerprint estimateand video frames are aligned, the training proceeds as before.The results in Figure 5 offer two complementary perspectives.The left panel reﬂects the common procedure to aggregate noiseestimates from multiple frames into a video ﬁngerprint. Thegraphs depict the mean PCE score (averaged over all test videos)against the number of frames considered (starting from the ﬁrstframe). The data-driven SPN-CNN approach again providesa measurable boost over the Wavelet denoiser, indicatingthe possibility of more reliable video source attribution. Fora closer look at video-speciﬁc results, the right panel of . . . . . . . . FAR T P R SPN-CNNWavelet . . . . . . . . . . AUC Wavelet AU C SP N - C NN Fig. 4. Image manipulation localization pixel-level ROC curves from 55manipulated images (left) and per-image AUC scores (right) from correlatingnoise signals extracted with SPN-CNN and Wavelet denoising in slidingwindows of size ×

64 with the Wavelet-based camera ﬁngerprint MLE.

Figure 5 reports average frame-level PCE scores per video,distinguishing between I-frames and all remaining frames. I-frames are expectedly more beneﬁcial for source attribution,and noise signals extracted with the SPN-CNN resemble thecamera ﬁngerprint more closely. The average frame-level PCEover all I-frames increases from 75 to 135.Interestingly, there is a sharp drop in SPN-CNN performancefor the last four videos compared to a more graceful decline ofthe Wavelet approach. A closer inspection revealed that theseare all “outdoor” videos (each two from the “move” and “panrot”categories), with content that was not well represented duringtraining. Along with a number of the earlier observations,we thus ﬁnd it appropriate to close with a note of caution.While a data-driven perspective clearly holds great promise forboosting sensor noise forensics under challenging conditions,these advances may not come cheap in practice. Proper andcareful training is crucial, and a single “catch-all” frameworkwill likely require substantial further experimentation.V. C

ONCLUDING R EMARKS

We have demonstrated that digital camera identiﬁcation fromPRNU sensor pattern noise can beneﬁt greatly from training aconvolutional neural network to extract noise signals from probeimages that resemble the expected camera ﬁngerprint moreclosely than noise residuals obtained with standard ﬁngerprint-agnostic denoising procedures. The discussed network featuresa clean end-to-end design that draws from the recent DnCNNresidual learning approach [18]. In its current instantiation,it achieves its most favorable results with a dedicated setof parameters for each candidate camera ﬁngerprint, butfuture research questions abound. More research is neededto understand to what extent the apparent dependence onsensor-speciﬁcs is related to differences in the content ofthe data presented to the network. Along those lines, it is The training did not make this distinction. m ea n P C E SPN-CNNWavelet m ea n P C E SPN-CNN, I-frames w/o I-framesWavelet, I-frames w/o I -rames

Fig. 5. Video source attribution from the ﬁrst N probe video frames (left)and from individual frames (right). Average PCE scores from 13 unstabilizedvideos (left) and per video (right). worth pointing out that we have not explicitly controlledfor effects such as JPEG compression quality etc., and thatpractical applications may warrant a rigorous in-depth analysisof the estimators under H . Looking forward, it seems alsoreasonable to assume that the current approach may only be astepping stone towards a fully data-driven camera identiﬁcationframework. Fingerprint estimation and noise signal extractionmay be learned jointly, likely leading to further performanceboosts. This may address some of the recent concerns regardthe viability of the fundamental imaging model that is alsothe foundation of this present work [26]. Finally, questionspertaining to ﬁngerprint-copy and removal attacks in the realmof counter-forensics [30] will also have to be reconsideredwhen moving to data-driven approaches [20].A CKNOWLEDGMENT

This work was supported by AFRL and DARPA underContract No. FA8750-16-C-0166. Any ﬁndings and conclusionsor recommendations expressed in this material are solely theresponsibility of the authors and does not necessarily representthe ofﬁcial views of AFRL, DARPA, or the U.S. Government.R

EFERENCES[1] R. B¨ohme and M. Kirchner, “Media forensics,” in

Information Hiding ,S. Katzenbeisser and F. Petitcolas, Eds. Artech House, 2016, pp. 231–259.[2] J. Luk´aˇs, J. Fridrich, and M. Goljan, “Determining digital image originusing sensor imperfections,” in

Image and Video Communications andProcessing , ser. Proceedings of SPIE, A. Said and J. G. Apostolopoulos,Eds., vol. 5685, 2005, pp. 249–260.[3] J. Fridrich, “Sensor defects in digital image forensic,” in

Digital ImageForensics: There is More to a Picture Than Meets the Eye , H. T. Sencarand N. Memon, Eds. Springer-Verlag, 2013, pp. 179–218.[4] M. Goljan, J. Fridrich, and T. Filler, “Large scale test of sensor ﬁngerprintcamera identiﬁcation,” in

Media Forensics and Security , ser. Proceedingsof SPIE, E. J. Delp, J. Dittmann, N. D. Memon, and P. W. Wong, Eds.,vol. 7254, 2009.[5] M. Goljan and J. Fridrich, “Sensor-ﬁngerprint based identiﬁcation ofimages corrected for lens distortion,” in

Media Watermarking, Security,and Forensics , ser. Proceedings of SPIE, N. D. Memon, A. M. Alattar,and E. J. Delp, Eds., vol. 8303, 2012.[6] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro, “Facing deviceattribution problem for stabilized video sequences,”

IEEE Transactionson Information Forensics and Security , vol. 15, pp. 14–27, 2020.[7] M. D. M. Hosseini and M. Goljan, “Camera identiﬁcation from HDRimages,” in

ACM Workshop on Information Hiding and MultimediaSecurity (IH&MMSec) , 2019.[8] W. van Houten and Z. J. Geradts, “Source video camera identiﬁcationfor multiply compressed videos originating from YouTube,”

DigitalInvestigation , vol. 6, no. 1–2, pp. 48–60, 2009. [9] W.-H. Chuang, H. Su, and M. Wu, “Exploring compression effects forimproved source camera identiﬁcation using strongly compressed video,”in

IEEE International Conference on Image Processing (ICIP) , 2011, pp.1953–1956.[10] M. Goljan, M. Chen, P. Comesa˜na, and J. Fridrich, “Effect of compressionon sensor-ﬁngerprint based camera identiﬁcation,” in

IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2016.[11] I. Amerini, R. Caldelli, A. del Mastio, A. di Fuccia, C. Molinari, andA. P. Rizzo, “Dealing with video source identiﬁcation in social networks,”

Signal Processing: Image Communication , vol. 57, 2017.[12] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Determining image originand integrity using sensor noise,”

IEEE Transactions on InformationForensics and Security , vol. 3, no. 1, pp. 74–90, 2008.[13] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-MRFapproach for PRNU-based image forgery detection,”

IEEE Transactionson Information Forensics and Security , vol. 9, no. 4, pp. 554–567, 2014.[14] P. Korus and J. Huang, “Multi-scale analysis strategies in PRNU-basedtampering localization,”

IEEE Transactions on Information Forensicsand Security , vol. 12, no. 4, pp. 809–824, 2017.[15] J. Luk´aˇs, J. Fridrich, and M. Goljan, “Digital camera identiﬁcation fromsensor pattern noise,”

IEEE Transactions on Information Forensics andSecurity , vol. 1, no. 2, pp. 205–214, 2006.[16] I. Amerini, R. Caldelli, V. Cappellini, F. Picchioni, and A. Piva, “Analysisof denoising ﬁlters for photo response non uniformity noise extractionin source camera identiﬁcation,” in

International Conference on DigitalSignal Processing (DSP) , 2009, pp. 511–517.[17] A. Cortiana, V. Conotter, G. Boato, and F. G. B. De Natale, “Performancecomparison of denoising ﬁlters for source camera identiﬁcation,” in

MediaWatermarking, Security, and Forensics III , ser. Proceedings of SPIE, N. D.Memon, J. Dittmann, A. M. Alattar, and E. J. Delp, Eds., vol. 7880,2011.[18] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussiandenoiser: residual learning of deep CNN for image denoising,”

IEEETransactions on Image Processing , vol. 26, no. 7, pp. 3142–3155, 2017.[19] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera modelﬁngerprint,”

IEEE Transactions of Information Forensics and Security ,vol. 15, pp. 144–159, 2020.[20] N. Bonettini, L. Bondi, S. Mandelli, P. Bestagini, S. Tubaro, and D. G¨uera,“Fooling PRNU-based detectors through convolutional neural networks,”in

European Signal Processing Conference (EUSIPCO) , 2018, pp. 957–961.[21] T. Gloe, S. Pfennig, and M. Kirchner, “Unexpected artefacts in PRNU-based camera identiﬁcation: A ’Dresden Image Database’ case-study,” in

ACM Multimedia and Security Workshop (MM&Sec) , 2012, pp. 109–114.[22] S. Chakraborty and M. Kirchner, “PRNU-based image manipulationlocalization with discriminative random ﬁelds,” in

IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2017.[23] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “Hybrid reference-basedvideo source identiﬁcation,”

Sensors , vol. 19, no. 3, 2019.[24] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Source digital camcorderidentiﬁcation using sensor photo-response non-uniformity,” in

Securityand Watermarking of Multimedia Content IX , ser. Proceedings of SPIE,E. J. Delp and P. W. Wong, Eds., vol. 6505. SPIE, 2007.[25] S. Taspinar, M. Mohanty, and N. Memon, “Source camera attributionusing stabilized video,” in

IEEE International Workshop on InformationForensics and Security (WIFS) , 2016.[26] M. Masciopinto and F. P´erez-Gonz´alez, “Putting the PRNU modelin reverse gear: Findings with synthetic signals,” in

European SignalProcessing Conference (EUSIPCO) , 2018, pp. 1352–1356.[27] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “VISION:a video and image dataset for source identiﬁcation,”

EURASIP Journalon Information Security , 2017.[28] T. Gloe and R. B¨ohme, “The Dresden Image Database for benchmarkingdigital image forensics,”

Journal of Digital Forensic Practice , vol. 3, no.2–4, pp. 150–159, 2010.[29] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:a raw images dataset for digital image forensics,” in

ACM MultimediaSystems Conference (MMSys) , 2015, pp. 219–224.[30] R. B¨ohme and M. Kirchner, “Counter-forensics: Attacking imageforensics,” in