SPN-CNN: Boosting Sensor-Based Source Camera Attribution With Deep Learning
SSPN-CNN: Boosting Sensor-Based Source CameraAttribution With Deep Learning
Matthias Kirchner and Cameron Johnson
Kitware, Inc.Email: { matthias.kirchner,cameron.johnson } @kitware.com Abstract —We explore means to advance source camera identi-fication based on sensor noise in a data-driven framework. Ourfocus is on improving the sensor pattern noise (SPN) extractionfrom a single image at test time. Where existing works suppressnuisance content with denoising filters that are largely agnosticto the specific SPN signal of interest, we demonstrate that a deeplearning approach can yield a more suitable extractor that leadsto improved source attribution. A series of extensive experimentson various public datasets confirms the feasibility of our approachand its applicability to image manipulation localization andvideo source attribution. A critical discussion of potential pitfallscompletes the text.
I. I
NTRODUCTION
Sensor noise fingerprints have been recognized and utilizedas a cornerstone of media forensics [1] ever since Luk´aˇs etal. first observed almost 15 years ago in their seminal work[2] that digital images can be traced back to their sensorbased on unique noise characteristics. Minute manufacturingimperfections are believed to make every sensor physicallyunique, leading to the presence of a weak yet deterministicsensor pattern noise (SPN) in each and every image signalcaptured by the same sensor [3]. This fingerprint, commonlyreferred to as photo-response non-uniformity (PRNU), canbe estimated from images captured by a specific camera forthe purpose of source camera identification, in which a noisesignal extracted from a probe image of unknown provenanceis compared against pre-computed fingerprint estimates from aset of candidate cameras.While extensive testing has already demonstrated the feasi-bility of highly reliable PRNU-based consumer camera identi-fication at scale [4], recent research has been largely drivenby enabling source attribution under ever more challengingconditions. Modern cameras, particularly those installed insmartphones, go to great lengths to produce visually appealingimagery, and techniques such as lens distortion correction,electronic image stabilization, or high dynamic range imaginghave all been found to impede camera identification if notaccounted for through spatial resynchronization [5]–[7]. Ro-bustness to strong image and video compression is anothermajor concern [8]–[10] that has been gaining more and morepractical relevance due the widespread sharing of visual mediathrough online social networks [11]. Finally, the ability to
WIFS‘2019, December, 9-12, 2019, Delft, Netherlands.Approved for public release: distribution unlimited. c (cid:13) reliably establish camera identification also from very smallimage patches is crucial for image manipulation localizationbased on sensor noise [12]–[14].The pertinent literature concludes that all these scenariosgenerally benefit from high-quality fingerprint estimates, andthat images should undergo content suppression prior toanalysis [3]. To date, most works still resort to the Wavelet-based denoiser as adopted by Luk´aˇs et al. [15] for that purpose.The maximum likelihood fingerprint estimator derived from asimplified multiplicative signal model [12] gives near-optimalresults when fed with noise residuals from a set of full-resolution images of a homogeneously lit scene. The conditionsat test time are less ideal, however. Probe images are generallyof varying content and an aggregation over multiple images isoften not possible. It is accepted that noise residuals obtainedfrom off-the-shelf denoisers are imperfect by nature and thatthey are contaminated non-trivially by remnants of imagecontent. Salient textures or quantization noise exacerbate theissue. A number of studies have found that alternative denoisingalgorithms can lead to moderate improvements [13], [16], [17],yet there remains a considerable gap in the ability to reliablyestablish the presence of the camera fingerprint in (portionsof) a probe image under more challenging conditions.While the overall performance of camera identification isclearly governed by fundamental bounds imposed by, forinstance, the available image resolution and the strength ofcompression, this paper sets out to demonstrate that thereis still ample room for advances over prior art. Critically,the present work deviates from the common procedure toemploy at test time the very same noise extractor that wasalso used for fingerprint estimation, which we explain witha subtle shift in perspective in the restatement of practicalcamera identification: we accept that the goal ultimately is tomatch a noise signal extracted from the probe image againsta pre-computed fingerprint estimate, and that the quality ofthe match is assessed in terms of the similarity of the twosignals. Adopting the DnCNN work by Zhang et al. [18],we let a convolutional neural network (CNN) learn how toextract a noise signal from a probe image that resembles thepre-computed sensor pattern noise (SPN) fingerprint fromthe camera of interest as closely as possible. By lookingat fingerprint extraction through the lens of an optimizationprocedure, the resulting network, which we call SPN-CNN, canbe expected to better adapt to the problem at hand than existingfingerprint-agnostic denoising algorithms. It is worth pointing a r X i v : . [ c s . C V ] F e b ut here that the proposed technique differs from the creativeadvancement of the DnCNN idea in Cozzolino and Verdoliva’sNoisePrint approach [19] both in its objective and its designin that the computed noise signal is expected to emphasizedevice characteristics instead of camera model characteristicsthrough a training regime that explicitly utilizes a knowncamera-specific target signal (the camera fingerprint estimate)instead of a more general notion of pairwise patch similarity.Another related work is the image anonymization approachby Bonettini et al. [20], who adapted a DnCNN-like designto remove the camera fingerprint from an image. We refer toSection III for a more detailed exposition of our approach,which follows after a brief overview of camera sensor noiseforensics in Section II. Section IV presents experimental results,including applications to image manipulation localization andvideo source identification. Section V concludes the text.II. C AMERA S ENSOR N OISE F ORENSICS
State-of-the-art sensor noise forensics assumes a simplifiedimaging model of the form x = x ( o ) ( + k ) + θ , (1)in which the multiplicative PRNU factor k modulates the noise-free image x ( o ) , while θ comprises a variety of additive noisecomponents [3]. Substantial empirical evidence suggests thatthe PRNU is a unique and robust camera fingerprint [4] thatcan be estimated from a set of N images taken with the specificcamera of interest. The standard procedure relies on a denoisingfilter F ( · ) to obtain the noise residual w n = x n − F ( x n ) fromthe n -th image x n , ≤ n ≤ N . A modeling assumption w n = kx n + η n (2)with i. i. d. Gaussian noise η n then leads to a maximumlikelihood estimate ˆ k of the PRNU factor of the form [12] ˆ k = (cid:32) N (cid:88) n =1 w n x n (cid:33) · (cid:32) N (cid:88) n =1 x n (cid:33) − . (3)Practical applications warrant a post-processing step to cleanthe fingerprint estimate from non-unique artifacts [3], [21].For a given probe image y of unknown provenance, cameraidentification is formulated as a hypothesis testing problem: H : w = y − F ( y ) does not contain the fingerprint k H : w does contain the fingerprint k ; i. e., the probe is attributed to the tested camera if H holds.In practice, the test can be decided by evaluating the similaritybetween the residual w and the fingerprint estimate ˆ k for asuitably set threshold τ , ρ = sim( w , ˆ k ) H ≷ H τ . (4)It is assumed that the two signals are geometrically alignedexcept for possible translational displacements, which can beaccounted for conveniently with normalized cross-correlationor peak-to-correlation energy (PCE) as similarity measures [3]. As inserting content from elsewhere into an image orother forms of image manipulation will remove or impairthe camera fingerprint in the affected regions, many localimage alterations can be detected by testing for the presence ofthe expected fingerprint in a sliding window mode [12]. Thelocalization of small manipulated regions warrants sufficientlysmall analysis windows, which generally impacts the abilityto reliably establish whether or not the expected fingerprint ispresent negatively. The literature often recommends a windowsize of about × pixels as a reasonable trade-off betweenresolution and accuracy [13], [14], [22]. A core problem is thatthe measured local similarity scores under H depend greatlyon local image characteristics. Chen et al. [12] have proposeda correlation predictor ˆ ρ i ( x ) as a remedy, which utilizes a setof simple intensity and texture features to predict how stronglythe i -th local patch in a probe image would correlate with thepurported camera fingerprint under H . The decision whetherto declare a manipulation (i. e., the absence of the expectedfingerprint) can then be made based on the deviation from theexpected correlation.As for video data, it can be advantageous to estimatereference sensor fingerprints from full-resolution still imageswhen available [23]. At test time, it is usually recommendedto aggregate noise residuals from multiple probe frames intoa probe video fingerprint to cope with strong compressionartifacts [24]. Special care has to be taken to account forgeometrical desynchronization due to video stabilization, aspointed out in many recent reports [6], [23], [25].III. SPN-CNN E STIMATOR
While camera source attribution from sensor pattern noisearguably works extremely well iff pitfalls due to geometricmisalignment between the fingerprint and the probe are avoided,there remains a huge discrepancy between a fingerprint estimateaggregated from several images, ˆ k , and an estimate from asingle probe image (via the residual w ). A simple explana-tion is that denoising algorithms are not designed with thisspecific application in mind, and more generally that contentsuppression is an ill-posed problem in the absence of viableimage models (and possibly even of the noise characteristicsitself [26]). Thanks to the central limit theorem, aggregationover multiple noise residuals mitigates these effects to someextent, but noise residuals from a single image will alwayssuffer from significant distortion.The recent success of data-driven methods suggests a wayforward, however, especially also considering that Equation (4)poses a clear objective function to be targeted. By understand-ing CNNs as flexible non-linear optimization tool, it seemsreasonable to expect that a network should be able to learn howto extract a better approximation of k from a given probe thana conventional “blind” denoiser. Conceptually, the problem athand aligns with the scenario considered by Zhang et al. [18],who trained a CNN to learn how to extract well-characterizednoise signals from a given image. In our case, we train anetwork to extract a noise pattern ˜ k to minimize (cid:107) ˆ k − ˜ k (cid:107) .Our premise here is that the maximum likelihood fingerprint flat-field images N noise residuals fingerprint estimate ˆ k training image fingerprint estimate ˜ k MLE F ( · ) SPN-CNN L = k ˆ k − ˜ k k update Fig. 1. Proposed SPN-CNN training for learning to extract a camera fingerprint ˜ k from a given probe image. estimator (MLE) gives the best approximation of the actualPRNU signal that we have under the given imaging modelassumptions, so we consider it a viable proxy target in lieu ofthe unknown ground truth. Once trained, the network replacesthe denoiser F ( · ) at test time, i. e., w = ˜ k in Equation (4).Notably, this breaks with the tradition of employing the verysame denoiser for both fingerprint estimation and detection,although further iterations in which the trained network informsa better fingerprint aggregation seem conceivable.Figure 1 outlines the overall training setup, which operates,in its current instantiation, on single-channel grayscale patchesof size × pixels. To facilitate the learning, the output fromeach training patch is paired with its corresponding portion ofthe fingerprint estimate ˆ k . The SPN-CNN itself is modeled afterZhang et al.’s DnCNN [18] and comprises 17 layers. Each layerimplements 64 convolutional filters with a small × spatialsupport, except for the last layer with only a single-channeloutput. Batch normalization is implemented in between theconvolutional layers, and ReLu activation follows after eachbut the last layer. No pooling is used at any point. In linewith reports in the literature, any modern off-the-shelf denoiser F ( · ) can be expected to do reasonably well for the purposeof estimating a camera fingerprint from flat-field images. Wedecided to stick with the standard Wavelet formulation [15] inthis work for its computational edge over alternative approaches.The trained network can be fed with inputs of any size withinthe memory constraints of the GPU. In practice, we divideimages into tiles of size × with an overlap of 20 pixelsto accommodate for possible boundary effects.Figure 2 depicts exemplary results from two images from theVISION database [27] and compares the obtained noise signals, ˜ k , to the corresponding residuals from the Wavelet denoiser, w .Although the CNN-based estimator does arguably not succeedin suppressing the image content completely, the results suggestqualitatively that it does better than the Wavelet denoiser. Anotable increase in correlation with the corresponding portionof the camera fingerprint estimate ˆ k supports this impression.IV. E XPERIMENTS
We work with various datasets for an experimental validation.Our baseline camera identification results in Section IV-B probe ˜ k (SPN-CNN) w (Wavelet) Fig. 2. Two × images and the noise signals obtained with the SPN-CNN and Wavelet estimators. Each noise pattern has been scaled independentlyto cover the full intensity range. Numbers in the lower left corner indicate thecorrelation with the corresponding Wavelet-based camera fingerprint. cover camera fingerprints from ten devices from the VISIONDatabase [27], and from six cameras from the Dresden ImageDatabase [28], respectively. The former dataset comprisescamera-native JPG images from a variety of mobile devices,whereas we chose the Nikon cameras that provide uncom-pressed imagery from the latter (converted from raw formatwith Adobe Lightroom). For each camera, we reserve 100randomly chosen images for training and divide the remainingimages into non-overlapping crops of size × pixels,resulting in a total of 53,677 (VISION) plus 41,568 (Dresden)cropped test images under H . Under H , we supply for eachnon-overlapping × portion of the 16 camera fingerprintsabout 100 randomly chosen image crops from a different device,totaling in 48,362 (VISION) plus 23,004 (Dresden) samples.The manipulation localization results in Section IV-C coverthe 55 manipulated images from the Nikon D7000 camerain the Realistic Tampering Dataset [14]. These uncompressedimages originate from the RAISE database [29], from whichwe source an additional 100 images for training. To gain someinsight into the applicability to video source attribution inSection IV-D, we work with 18 camera-native indoor andoutdoor videos from a Samsung Galaxy S3 Mini (D01 in theVISION database), where we reserve five videos for training.All databases provide flat-field images for the initial MLEcamera fingerprint estimation with the Wavelet denoiser. A. Training
We train a separate network for each camera fingerprint ofinterest in batches of size 128 for 100 epochs. Each epochsees a selection of non-overlapping × patches from alltraining images. The patch selection is greedy and successivelysamples up to 1,000 random patches per image while rejectingthose that overlap with previous selections from within thesame image. The final number of selected patches thus dependson the image size, but it typically averages to over 50,000 per ABLE IM
EDIAN CORRELATION UNDER H ( × PATCHES , VISION).
SPN-CNN trained for camera ca m e r a extractorD01 D02 D03 D06 D10 D15 D26 D27 D29 D34 Wav [18]D01 epoch. Saturated patches are excluded from training. We usethe Adam optimizer with MSE loss for training with learningrate − and weight decay of 0.2 after every 30 epochs. B. Baseline Results
For a test of the proposed noise extractor’s baseline perfor-mance, we conduct basic camera identification by correlatingthe obtained noise signals against the corresponding matchingcamera fingerprint estimate. A key question is whether eachcamera fingerprint warrants its own set of learned parameters.Table I gives some insight by reporting the median correlationscores under H for the VISION database, where we employedall ten trained networks (arranged along the columns) foreach probe image. For comparison, the last two columns listthe corresponding results obtained from the standard Waveletdenoiser (Wav), and an off-the-shelf variant of the DnCNN [18]that we trained to extract Gaussian noise with standard deviation σ = 3 on the 400 images provided by its authors. Notably,the SPN-CNN approach yields a measurable boost over thefingerprint-agnostic techniques when a camera-specific variantof the network is employed. The median correlation under H increases, on average, by a factor of 1.5 (at a median correlationunder H of about − compared to − for the Waveletdenoiser), while camera-foreign network configurations yield H correlations that are largely on par with prior art. Thecorresponding results for the six cameras from the DresdenImage Database in Table II give a slightly different picture, asthe network training seems to have less influence. It is currentlyunclear to what extent this is an artifact of the much morehomogenous dataset (where each camera was set up to capturethe same scenes [28] and all images were processed with thesame Adobe software). Images in the VISION dataset are moreheterogenous, both in terms of content and camera-specificprocessing pipelines. More tailored datasets and experimentswill be necessary to disentangle the impact of training data andsensor specifics. For the time being, we recommend training acamera-specific network for each fingerprint.Adopting this premise, Figure 3 gives a more comprehensivepicture of camera identification performance by reporting ROCcurves for a select set of ten cameras from both datasets.We focus on smaller patches of size × and × For instance, both device D06 and D15 are iPhone 6 cameras, but operateunder different iOS versions. TABLE IIM
EDIAN CORRELATION UNDER H ( × PATCHES , D
RESDEN ). SPN-CNN trained for camera ca m e r a extractorD200-0 D200-1 D70-0 D70-1 D70s-0 D70s-1 Wav [18]D200-0 here (center-cropped from the bigger probes) to showcasethe advantage of the targeted data-driven extractors overconventional methods. Not surprisingly, smaller patch sizesincur a drop in performance, but the proposed approachachieves notable improvements across all tested cameras,including those not depicted here. C. Application: Manipulation Localization
The promising performance on camera identification fromsmall image patches suggests improved image manipulationlocalization. We test this conjecture by computing fromthe aforementioned 55 manipulated images × slidingwindow correlations with the Nikon D7000 fingerprint estimate.Following prior art [12], we also train a simple linear regressioncorrelation predictor for both the SPN-CNN and the Waveletnoise extractors. A completely disjoint set of 20,000 patchesfrom 20 images was used for this purpose. The pixel-levelROC curves (aggregated over all probe images) in the leftpanel of Figure 4 are obtained from thresholding the differencebetween measured and predicted correlation, ∆ i = ρ i − ˆ ρ i ,i. e., a strongly negative difference is indicative of the camerafingerprint being absent in the respective local neighborhood.In line with the baseline results in Figure 3, the graphs suggestthat the data-driven SPN-CNN extractor is beneficial over thestandard Wavelet approach by a large margin. The overallarea under the curve (AUC) increases from 0.75 to 0.82. Asthe results generally vary greatly from image to image, weinclude a scatter plot of per-image pixel-level AUC scores inthe right panel of Figure 4. Observe that SPN-CNN performsbetter for all but seven probe images, which qualitativelyseemed to exhibit content characteristics that were under-represented during training. In general, it is worth pointing outthat the localization performance can be expected to increasefurther with one of the more sophisticated recent random fieldapproaches [13], [14], [22]. D. Application: Video Source Attribution
Video source attribution is generally more challenging thancamera identification from still images. The final experiment ofthis initial exploration thus puts emphasis on the impact of lossyvideo compression and reduced resolution. We do not considervideo stabilization here to keep the discussion focussed, butwe anticipate that any performance boost in the absence ofstabilization would ultimately also help techniques designed toreverse the effects of inter-frame geometric misalignment [6]. . . . . . . . . . D01
FAR T P R . . . . D02
FAR T P R . . . . D06
FAR T P R . . . . D10
FAR T P R . . . . D26
FAR T P R . . . . . . . . . D200-0
FAR T P R . . . . D200-1
FAR T P R . . . . D70-0
FAR T P R . . . . D70-1
FAR T P R . . . . D70s-0
FAR T P R [17] A. Cortiana, V. Conotter, G. Boato, and F. G. B. De Natale, “Performancecomparison of denoising filters for source camera identification,” in MediaWatermarking, Security, and Forensics III , ser. Proceedings of SPIE, N. D.Memon, J. Dittmann, A. M. Alattar, and E. J. Delp, Eds., vol. 7880,2011.[18] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussiandenoiser: residual learning of deep CNN for image denoising,”
IEEETransactions on Image Processing , vol. 26, no. 7, pp. 3142–3155, 2017.[19] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera modelfingerprint,”
IEEE Transactions of Information Forensics and Security ,in press.[20] T. Gloe, S. Pfennig, and M. Kirchner, “Unexpected artefacts in PRNU-based camera identification: A ’Dresden Image Database’ case-study,” in
ACM Multimedia and Security Workshop (MM&Sec) , 2012, pp. 109–114.[21] S. Chakraborty and M. Kirchner, “PRNU-based image manipulationlocalization with discriminative random fields,” in
IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2017.[22] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “Hybrid reference-basedvideo source identification,”
Sensors , vol. 19, no. 3, 2019.[23] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Source digital camcorderidentification using sensor photo-response non-uniformity,” in
Securityand Watermarking of Multimedia Content IX , ser. Proceedings of SPIE,E. J. Delp and P. W. Wong, Eds., vol. 6505. SPIE, 2007.[24] S. Taspinar, M. Mohanty, and N. Memon, “Source camera attributionusing stabilized video,” in
IEEE International Workshop on InformationForensics and Security (WIFS) , 2016.[25] M. Masciopinto and F. P´erez-Gonz´alez, “Putting the PRNU modelin reverse gear: Findings with synthetic signals,” in
European SignalProcessing Conference (EUSIPCO) , 2018, pp. 1352–1356.[26] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “VISION:a video and image dataset for source identification,”
EURASIP Journalon Information Security , 2017.[27] T. Gloe and R. B¨ohme, “The Dresden Image Database for benchmarkingdigital image forensics,”
Journal of Digital Forensic Practice , vol. 3, no.2–4, pp. 150–159, 2010.[28] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:a raw images dataset for digital image forensics,” in
ACM MultimediaSystems Conference (MMSys) , 2015, pp. 219–224.[29] G. Chierchia, D. Cozzolino, G. Poggi, C. Sansone, and L. Verdoliva,“Guided filtering for PRNU-based localization of small-size imageforgeries,” in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2014, pp. 6231–6235.[30] S. Mandelli, P. Bestagini, S. Tubaro, D. Cozzolino, and L. Verdoliva,“Blind detection and localization of video temporal splicing exploitingsensor-based footprints,” in
European Signal Processing Conference(EUSIPCO) , 2018.[31] N. Bonettini, L. Bondi, S. Mandelli, P. Bestagini, S. Tubaro, and D. G¨uera,“Fooling PRNU-based detectors through convolutional neural networks,” in
European Signal Processing Conference (EUSIPCO) , 2018, pp. 957–961. × SPN-CNNWaveletDnCNN × SPN-CNNWaveletDnCNN
Fig. 3. Camera identification baseline ROC performance for selected cameras from the VISION Database (top) and the Dresden Image Database (bottom)based on the correlation between Wavelet-based camera fingerprint MLE and noise signals extracted from probe patches of size × (blue) and × (orange) with SPN-CNN, and off-the-shelf Wavelet and DnCNN [18] denoising, respectively. To address the specific characteristics of video data, we traina separate network on frames extracted from the five trainingvideos, while the target fingerprint was computed from theavailable still images [23]. As a result, synchronizing theimagery to the camera fingerprint requires special attention.The available videos have a resolution of × , whereasthe full sensor resolution of the still images is × pixels. We use the guidance in [27] to determine the geometricmapping from image to video space, but found that the croppingparameters had to be adjusted to reflect differences betweenlandscape and portrait mode. Once camera fingerprint estimateand video frames are aligned, the training proceeds as before.The results in Figure 5 offer two complementary perspectives.The left panel reflects the common procedure to aggregate noiseestimates from multiple frames into a video fingerprint. Thegraphs depict the mean PCE score (averaged over all test videos)against the number of frames considered (starting from the firstframe). The data-driven SPN-CNN approach again providesa measurable boost over the Wavelet denoiser, indicatingthe possibility of more reliable video source attribution. Fora closer look at video-specific results, the right panel of . . . . . . . . FAR T P R SPN-CNNWavelet . . . . . . . . . . AUC Wavelet AU C SP N - C NN Fig. 4. Image manipulation localization pixel-level ROC curves from 55manipulated images (left) and per-image AUC scores (right) from correlatingnoise signals extracted with SPN-CNN and Wavelet denoising in slidingwindows of size ×
64 with the Wavelet-based camera fingerprint MLE.
Figure 5 reports average frame-level PCE scores per video,distinguishing between I-frames and all remaining frames. I-frames are expectedly more beneficial for source attribution,and noise signals extracted with the SPN-CNN resemble thecamera fingerprint more closely. The average frame-level PCEover all I-frames increases from 75 to 135.Interestingly, there is a sharp drop in SPN-CNN performancefor the last four videos compared to a more graceful decline ofthe Wavelet approach. A closer inspection revealed that theseare all “outdoor” videos (each two from the “move” and “panrot”categories), with content that was not well represented duringtraining. Along with a number of the earlier observations,we thus find it appropriate to close with a note of caution.While a data-driven perspective clearly holds great promise forboosting sensor noise forensics under challenging conditions,these advances may not come cheap in practice. Proper andcareful training is crucial, and a single “catch-all” frameworkwill likely require substantial further experimentation.V. C
ONCLUDING R EMARKS
We have demonstrated that digital camera identification fromPRNU sensor pattern noise can benefit greatly from training aconvolutional neural network to extract noise signals from probeimages that resemble the expected camera fingerprint moreclosely than noise residuals obtained with standard fingerprint-agnostic denoising procedures. The discussed network featuresa clean end-to-end design that draws from the recent DnCNNresidual learning approach [18]. In its current instantiation,it achieves its most favorable results with a dedicated setof parameters for each candidate camera fingerprint, butfuture research questions abound. More research is neededto understand to what extent the apparent dependence onsensor-specifics is related to differences in the content ofthe data presented to the network. Along those lines, it is The training did not make this distinction. m ea n P C E SPN-CNNWavelet m ea n P C E SPN-CNN, I-frames w/o I-framesWavelet, I-frames w/o I -rames
Fig. 5. Video source attribution from the first N probe video frames (left)and from individual frames (right). Average PCE scores from 13 unstabilizedvideos (left) and per video (right). worth pointing out that we have not explicitly controlledfor effects such as JPEG compression quality etc., and thatpractical applications may warrant a rigorous in-depth analysisof the estimators under H . Looking forward, it seems alsoreasonable to assume that the current approach may only be astepping stone towards a fully data-driven camera identificationframework. Fingerprint estimation and noise signal extractionmay be learned jointly, likely leading to further performanceboosts. This may address some of the recent concerns regardthe viability of the fundamental imaging model that is alsothe foundation of this present work [26]. Finally, questionspertaining to fingerprint-copy and removal attacks in the realmof counter-forensics [30] will also have to be reconsideredwhen moving to data-driven approaches [20].A CKNOWLEDGMENT
This work was supported by AFRL and DARPA underContract No. FA8750-16-C-0166. Any findings and conclusionsor recommendations expressed in this material are solely theresponsibility of the authors and does not necessarily representthe official views of AFRL, DARPA, or the U.S. Government.R
EFERENCES[1] R. B¨ohme and M. Kirchner, “Media forensics,” in
Information Hiding ,S. Katzenbeisser and F. Petitcolas, Eds. Artech House, 2016, pp. 231–259.[2] J. Luk´aˇs, J. Fridrich, and M. Goljan, “Determining digital image originusing sensor imperfections,” in
Image and Video Communications andProcessing , ser. Proceedings of SPIE, A. Said and J. G. Apostolopoulos,Eds., vol. 5685, 2005, pp. 249–260.[3] J. Fridrich, “Sensor defects in digital image forensic,” in
Digital ImageForensics: There is More to a Picture Than Meets the Eye , H. T. Sencarand N. Memon, Eds. Springer-Verlag, 2013, pp. 179–218.[4] M. Goljan, J. Fridrich, and T. Filler, “Large scale test of sensor fingerprintcamera identification,” in
Media Forensics and Security , ser. Proceedingsof SPIE, E. J. Delp, J. Dittmann, N. D. Memon, and P. W. Wong, Eds.,vol. 7254, 2009.[5] M. Goljan and J. Fridrich, “Sensor-fingerprint based identification ofimages corrected for lens distortion,” in
Media Watermarking, Security,and Forensics , ser. Proceedings of SPIE, N. D. Memon, A. M. Alattar,and E. J. Delp, Eds., vol. 8303, 2012.[6] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro, “Facing deviceattribution problem for stabilized video sequences,”
IEEE Transactionson Information Forensics and Security , vol. 15, pp. 14–27, 2020.[7] M. D. M. Hosseini and M. Goljan, “Camera identification from HDRimages,” in
ACM Workshop on Information Hiding and MultimediaSecurity (IH&MMSec) , 2019.[8] W. van Houten and Z. J. Geradts, “Source video camera identificationfor multiply compressed videos originating from YouTube,”
DigitalInvestigation , vol. 6, no. 1–2, pp. 48–60, 2009. [9] W.-H. Chuang, H. Su, and M. Wu, “Exploring compression effects forimproved source camera identification using strongly compressed video,”in
IEEE International Conference on Image Processing (ICIP) , 2011, pp.1953–1956.[10] M. Goljan, M. Chen, P. Comesa˜na, and J. Fridrich, “Effect of compressionon sensor-fingerprint based camera identification,” in
IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2016.[11] I. Amerini, R. Caldelli, A. del Mastio, A. di Fuccia, C. Molinari, andA. P. Rizzo, “Dealing with video source identification in social networks,”
Signal Processing: Image Communication , vol. 57, 2017.[12] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Determining image originand integrity using sensor noise,”
IEEE Transactions on InformationForensics and Security , vol. 3, no. 1, pp. 74–90, 2008.[13] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva, “A Bayesian-MRFapproach for PRNU-based image forgery detection,”
IEEE Transactionson Information Forensics and Security , vol. 9, no. 4, pp. 554–567, 2014.[14] P. Korus and J. Huang, “Multi-scale analysis strategies in PRNU-basedtampering localization,”
IEEE Transactions on Information Forensicsand Security , vol. 12, no. 4, pp. 809–824, 2017.[15] J. Luk´aˇs, J. Fridrich, and M. Goljan, “Digital camera identification fromsensor pattern noise,”
IEEE Transactions on Information Forensics andSecurity , vol. 1, no. 2, pp. 205–214, 2006.[16] I. Amerini, R. Caldelli, V. Cappellini, F. Picchioni, and A. Piva, “Analysisof denoising filters for photo response non uniformity noise extractionin source camera identification,” in
International Conference on DigitalSignal Processing (DSP) , 2009, pp. 511–517.[17] A. Cortiana, V. Conotter, G. Boato, and F. G. B. De Natale, “Performancecomparison of denoising filters for source camera identification,” in
MediaWatermarking, Security, and Forensics III , ser. Proceedings of SPIE, N. D.Memon, J. Dittmann, A. M. Alattar, and E. J. Delp, Eds., vol. 7880,2011.[18] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussiandenoiser: residual learning of deep CNN for image denoising,”
IEEETransactions on Image Processing , vol. 26, no. 7, pp. 3142–3155, 2017.[19] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera modelfingerprint,”
IEEE Transactions of Information Forensics and Security ,vol. 15, pp. 144–159, 2020.[20] N. Bonettini, L. Bondi, S. Mandelli, P. Bestagini, S. Tubaro, and D. G¨uera,“Fooling PRNU-based detectors through convolutional neural networks,”in
European Signal Processing Conference (EUSIPCO) , 2018, pp. 957–961.[21] T. Gloe, S. Pfennig, and M. Kirchner, “Unexpected artefacts in PRNU-based camera identification: A ’Dresden Image Database’ case-study,” in
ACM Multimedia and Security Workshop (MM&Sec) , 2012, pp. 109–114.[22] S. Chakraborty and M. Kirchner, “PRNU-based image manipulationlocalization with discriminative random fields,” in
IS&T ElectronicImaging: Media Watermarking, Security, and Forensics , 2017.[23] M. Iuliani, M. Fontani, D. Shullani, and A. Piva, “Hybrid reference-basedvideo source identification,”
Sensors , vol. 19, no. 3, 2019.[24] M. Chen, J. Fridrich, M. Goljan, and J. Luk´aˇs, “Source digital camcorderidentification using sensor photo-response non-uniformity,” in
Securityand Watermarking of Multimedia Content IX , ser. Proceedings of SPIE,E. J. Delp and P. W. Wong, Eds., vol. 6505. SPIE, 2007.[25] S. Taspinar, M. Mohanty, and N. Memon, “Source camera attributionusing stabilized video,” in
IEEE International Workshop on InformationForensics and Security (WIFS) , 2016.[26] M. Masciopinto and F. P´erez-Gonz´alez, “Putting the PRNU modelin reverse gear: Findings with synthetic signals,” in
European SignalProcessing Conference (EUSIPCO) , 2018, pp. 1352–1356.[27] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “VISION:a video and image dataset for source identification,”
EURASIP Journalon Information Security , 2017.[28] T. Gloe and R. B¨ohme, “The Dresden Image Database for benchmarkingdigital image forensics,”
Journal of Digital Forensic Practice , vol. 3, no.2–4, pp. 150–159, 2010.[29] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:a raw images dataset for digital image forensics,” in
ACM MultimediaSystems Conference (MMSys) , 2015, pp. 219–224.[30] R. B¨ohme and M. Kirchner, “Counter-forensics: Attacking imageforensics,” in