[PDF] Camera Bias in a Fine Grained Classification Task

Abstract

We show that correlations between the camera used to acquire an image and the class label of that image can be exploited by convolutional neural networks (CNN), resulting in a model that "cheats" at an image classification task by recognizing which camera took the image and inferring the class label from the camera. We show that models trained on a dataset with camera / label correlations do not generalize well to images in which those correlations are absent, nor to images from unencountered cameras. Furthermore, we investigate which visual features they are exploiting for camera recognition. Our experiments present evidence against the importance of global color statistics, lens deformation and chromatic aberration, and in favor of high frequency features, which may be introduced by image processing algorithms built into the cameras.

Full PDF

CCamera Bias in a Fine Grained Classiﬁcation Task

Philip T. Jackson Stephen Bonner Ning Jia Christopher Holder Jon Stonehouse Boguslaw Obara Department of Computer Science, Durham University, Durham, UK Procter and Gamble, Reading, UK { philip.jackson,stephen.bonner,ning.jia,c.j.holder,boguslaw.obara } @[email protected] Abstract —We show that correlations between the camera usedto acquire an image and the class label of that image canbe exploited by convolutional neural networks (CNN), resultingin a model that “cheats” at an image classiﬁcation task byrecognizing which camera took the image and inferring theclass label from the camera. We show that models trained ona dataset with camera / label correlations do not generalize wellto images in which those correlations are absent, nor to imagesfrom unencountered cameras. Furthermore, we investigate whichvisual features they are exploiting for camera recognition. Ourexperiments present evidence against the importance of globalcolor statistics, lens deformation and chromatic aberration, andin favor of high frequency features, which may be introduced byimage processing algorithms built into the cameras.

I. I

NTRODUCTION

Convolutional neural networks (CNN) sometimes learn tosatisfy their objective functions in ways we do not intend,typically by exploiting some subtle idiosyncrasy in the trainingdata. For example, in [8] a CNN trained on ImageNet wasfound to be recognizing chocolate sauce pots by the presenceof a spoon, because many of the chocolate sauce pots inthe ImageNet dataset are indeed accompanied by a silverspoon. While effective at minimizing the loss function attraining time, these clever exploits usually result in the modelbecoming brittle, as it is relying on characteristics that arespeciﬁc to the training set and are not representative of thewider world. This tends to manifest as domain bias, wherebythe model fails to generalise well to instances from otherdatasets with different idiosyncrasies.We investigate a real world applied computer vision problemin which severe domain bias was caused by strong correlationsbetween camera model and class label. Since the trainingdataset consists of two classes acquired with different cameras,the model learns to predict the class label by recognizing thecamera that captured the image. Since the sets of cameras usedto acquire the two classes are non intersecting, this is sufﬁcientto achieve perfect training accuracy, whilst learning nothingabout the task itself. Our task has characteristics typical ofindustrial deep vision projects, and we believe the lessonslearned will be useful to many deep learning practitionersworking on similar projects. By illuminating the sometimescounterintuitive means by which CNNs can classify images,our work is also relevant to the ongoing quest for algorithmictransparency and accountability in machine learning. The task itself is to discriminate between shampoo bottlesfrom two different manufacturers, which are distinguished onlyby very small differences in the printing of a batch code onthe underside of the bottle. These differences are caused bydifferent industrial printers being used, are independent of theactual character string that is printed, and are subtle enoughthat detecting them by eye is difﬁcult even for trained experts.This therefore constitutes a ﬁne grained binary classiﬁcationproblem, in which the intra-class variance is high relative tothe inter-class variance.Fine grained classiﬁcation is difﬁcult, so one might intu-itively expect a model to cheat more often on such tasks, ifthe correct decision function is more complex compared to acheating rule. On the other hand, in this instance the exploitof recognizing cameras is also a ﬁne grained classiﬁcationtask, and in general it is not obvious which tasks are “harder”for a CNN to learn. CNNs have been known to cheat bydetecting patterns which are barely perceptible to humans,such as chromatic aberration [6]. CycleGAN even cheats itsreconstruction loss by inserting steganographic codes intoits converted images, which it then uses to reconstruct theoriginals [3].In this paper, we closely examine an instance of a modelcheating on a real world visual classiﬁcation task, and attemptto answer the following questions:1) Is it possible for a CNN to recognize camera types whenexplicitly trained to do so?2) Can we prove that the same CNN cheats on the taskof manufacturer classiﬁcation by recognizing cameratypes?3) Does the propensity toward cheating depend on modelarchitecture?4) How exactly does a CNN recognize camera types?Section II reviews relevant literature in ﬁne grained classi-ﬁcation and overﬁtting, while Section III describes out datasetand classiﬁcation task in detail. Section IV investigates theabove questions systematically with a series of experiments,and in Section V we discuss our ﬁndings and draw conclu-sions. II. P

REVIOUS W ORK

Two major branches of literature are relevant to our work:source camera identiﬁcation from images, and understandingdeep neural networks. a r X i v : . [ c s . C V ] J u l . Camera / Image Sensor Pattern Identiﬁcation Because our work concerns accidental camera detection,a brief review of deliberate camera detection methods iswarranted, as it may shed some light on how our modellearns to cheat. Many techniques have been developed to tracedigital photos back to their camera of origin, primarily by thedigital forensics community [9]. Such techniques can be usedto detect doctored images or videos, where images or framesfrom different cameras are spliced together [4], [5]. Mostof these methods revolve around extracting a unique sensornoise ﬁngerprint from the image, and matching it against thereference patterns of known cameras. Since sensor noise isa complex phenomenon with multiple sources (e.g. photonicnoise, lens imperfections, dust particles, dark currents, non-uniform pixel sensitivity), there are many ways of doing this.Geradts et al. [10] identify cameras by their unique patterns ofdead and hot pixels, however not all cameras have dead pixels,and some remove them via post-processing. Kharrazi et al.[13] train an SVM to recognize ﬁve different cameras based onhand-engineered feature vectors extracted from images. Thisapproach achieves up to classiﬁcation accuracy, but thisis too low for forensic purposes. Choi et al. [2] take a similarSVM based approach, additionally showing that radial lensdistortion is a useful feature for identifying cameras. Unlikenoise based approaches, lens distortion can identify models ofcamera but not individuals. Kurosawa et al. [15] recognizescameras by dark current noise, which is a small, constantsignal emitted by a CCD, varying randomly from pixel topixel. Although every digital camera has such a noise patternand it will always be unique, it can only be acquired from darkframes where no light strikes the sensor, and is only a smallcomponent of sensor noise. Lukas et al. [16] propose a morerobust method that exploits the non-uniform sensitivity to lightamong sensor pixels, which is a much stronger component anddoes not require dark frames to measure.Another feature of consumer cameras that has thwarted aprevious deep learning experiment [6] is chromatic aberration,in which different wavelengths of light are refracted by differ-ent amounts by the lens. This results in colored fringes aroundthe edges of objects. This too has been used in digital forensics[12].Recently, CNN-based methods have shown great potentialin digital camera identiﬁcation from images using standardsupervised training [1], [25], [26], proving that CNNs areindeed able to infer which camera acquired a digital image.

B. Understanding Deep Convolutional Neural Networks

CNNs are often seen as something of a black box, with noclear consensus as to what information they are using to reachtheir decisions, how that information is represented internally,or what are the speciﬁc roles of their individual components.Attempts to answer these questions can be divided into twostrands, feature visualization and attribution.Feature visualization aims to clarify the function of neuronsor channels, by synthesizing images that maximize their ac-tivation [20]. Simonyan et al. [22] investigate what patterns CNNs look for in each image class by performing gradientascent in image space, to maximize the activation of anoutput class neuron. Yosinski et al. [27] do the same but withbetter regularization, producing more natural looking images.Mahendran et al. [17] treat intermediate CNN representationsas functions which they can invert via gradient ascent in imagespace. This yields images that the CNN maps to the samerepresentation as the original image, implying that they “lookthe same” to the CNN. Nguyen et al. [18] ﬁnd natural lookingimages that maximally activate feature maps by searching themanifold learned by a generative adversarial network, ratherthan the full image space. Fong et al. [7] show evidencethat far from feature maps learning separate, well deﬁnedconcepts, the relationship between feature maps and semanticconcepts is many-to-many, with each feature map involved inthe detection of several concepts and most concepts activatingmultiple feature maps.Attribution investigates which parts of an image contributemost to a CNN’s decision - often expressed as “where themodel is looking”. Zeiler and Fergus [28] propose two meth-ods to this end: occlusion mapping, in which the importanceof an image patch is measured as the reduction in classprobability when it is obscured, and backpropagation of classprobability gradients into image pixels. Both of these methodsyield saliency maps showing which parts of the image havethe greatest effect on the output when changed, correspondingto the notion of how much they contributed to the network’sdecision. Another popular approach is guided backpropagation[23], which reﬁnes the gradient saliency maps of [28] byzeroing out negative gradients at every backpropagation step,so as to focus only on image parts that contribute positivelyto a particular class. A much faster alternative to occlusionmapping (which must run a forward pass for each test patch)is class activation mapping (CAM) [29], which uses ﬁnallayer feature maps as saliency maps, weighted and summedaccording to the weight of their connection to the class neuronin question. This approach requires that the output layer takesits input directly from mean pooled feature maps (as is thecase with GoogLeNet and ResNet but not for networks withfully connected layers such as AlexNet). Selvaraju et al. [21]address this by using mean pooled gradients as a proxy fordirect connection weights, allowing feature maps from anylayer in any network to be used as saliency maps. Anothertechnique by Fong et al. [8] learns a mask that causes amodel to misclassify an image while obscuring the smallestarea possible. III. D

ATASET

Our dataset consists of

RGB images of the undersidesof shampoo bottles. These images are of size × andare cropped tightly around the bottle’s batch code, which is atwo-line alphanumeric serial number printed by a dot matrixprinter (see Figure 1). The crops are from roughly the samearea of the original images, so they should cover mostly thesame region of the cameras’ sensors. This means they shouldcontain roughly the same sensor pattern noise, up to some ig. 1. A pair of images from our dataset. The left was taken with an iPhonecamera, while the right was taken with a Samsung. random translation. The batch codes of the two manufacturers’products are expected to differ in some potentially very subtleways, hence the relatively high resolution of our images.Our images are captured with ﬁve different cameras: iPhone,Huawei, Samsung, Redmi and Vivo. In the base dataset thesecameras occur at equal frequencies among the two manufac-turers, but by excluding certain combinations of camera andlabel from the dataset, we can introduce correlations betweencamera type and class label. Since we were aware of thecamera bias issue at the time our dataset was collected, carewas taken to remove all sources of domain bias (e.g. differentpeople photographing bottles from each manufacturer, andperhaps holding the bottles differently). To this end, the imageswere acquired by four different people who each photographedan equal number of images from each manufacturer and witheach camera, all in the same room, under controlled lightingconditions. This means that domain bias should only exist ifwe deliberately induce it by excluding certain manufacturer/ camera combinations. It also means that background dis-tractors should be uncorrelated with manufacturer and cameratype. The test set is a random of the samples, on whichthe model is never trained.IV. E XPERIMENTS

To address the questions raised in Section I, we run aseries of classiﬁcation experiments on variations of our datasetwith camera/label correlation artiﬁcially introduced. All exper-iments are trained to convergence with the Adam optimizer[14], using a learning rate of − , a weight decay of , and thecategorical cross entropy loss function. All accuracy numberswe report are averaged over four runs with different randomnumber generator seeds. We perform all our experiments withﬁve commonly used CNN architectures, all of which arepretrained on ImageNet and ﬁne-tuned on our tasks with nolayer freezing. A. Camera Classiﬁcation

As a basic sanity check, we verify here that state of the artvision models can very easily classify which camera took theimage in this dataset. This corroborates the work of [1] [25] forour own datasets and cameras. Table I shows that very high

Fig. 2. A pretrained ResNet34 model learns to recognize manufacturers veryquickly, and learns to recognize cameras even faster. test accuracies can be achieved on this task across a rangeof architectures. As Figure 2 shows, a pretrained ResNet34not only achieves high accuracy at camera classiﬁcation butdoes so very quickly. Camera recognition is learned faster thanmanufacturer recognition, suggesting that a model which canminimize its loss by recognizing manufacturers or by cheatingby camera recognition will tend toward the latter, as it issomehow easier.

Model Test Accuracy

ResNet34 0.999ResNet101 0.913InceptionV3 0.998AlexNet 0.945VGG16 0.974TABLE IA

CCURACY ON THE TEST SET WHEN CLASSIFYING WHICH CAMERA TOOKAN IMAGE . B. Manufacturer Classiﬁcation

We investigate our primary task, manufacturer classiﬁcation,under three settings, which we refer to as Balanced, Partial,and Disjoint. In the Balanced setting, we use the full trainingset and there are no correlations between camera type andclass label. In Partial, we use the same training set but withonly iPhone and Samsung cameras included. In Disjoint, weintroduce correlations between camera and class label byincluding only Manufacturer 1 images taken with iPhone orSamsung cameras, and Manufacturer 2 images taken withHuawei or Redmi cameras. Our test set is the same in allcases, balanced across camera types with no camera/labelcorrelations.Table II shows the results of manufacturer classiﬁcationexperiments on these three datasets. It is immediately apparentthat while respectable accuracy is achieved when training onthe Balanced dataset, an accuracy drop of around occurswhen training on Disjoint. In fact, Disjoint accuracy is closeto , hardly better than random guessing, which is entirelyexpected if the model were basing its classiﬁcations on camera odel Balanced Partial Disjoint

ResNet34 0.974 0.957 0.505ResNet101 0.969 0.921 0.505InceptionV3 0.973 0.940 0.518AlexNet 0.929 0.893 0.573VGG16 0.979 0.945 0.556TABLE IIM

ANUFACTURER CLASSIFICATION TEST SET ACCURACY OF FIVE MODELSWITH DIFFERENT TRAINING SETUPS .Fig. 3. Test accuracy plot showing the distribution of predicted labels amongcorrect outputs, for a ResNet34 trained on the Disjoint training set, in whichall Manufacturer 1 images are iPhone or Samsung, and all Manufacturer 2are Huawei or Redmi. For images from iPhone and Samsung cameras themodel predicts only Manufacturer 1, while for Huawei and Redmi it predictsonly Manufacturer 2, while for the unseen Vivo images it appears to guessrandomly, achieving accuracy with a mostly even mix of both classes.Best viewed in color. types, each of which has an equal number of images of eachclass in the test set.We can conﬁrm that this drop in accuracy is due to camerabias by observing the model’s behavior across camera typesin the test set. As Figure 3 shows, a ResNet34 trained on theDisjoint dataset predicts Manufacturer 1 exclusively on testimages acquired by iPhone or Samsung cameras, and Man-ufacturer 2 overwhelmingly on Huawei and Redmi images.Similar behavior is observed in the other models.It is interesting to note that AlexNet and VGG16 both scorehigher test accuracy after training on Disjoint than their moremodern counterparts, ResNet34, ResNet101 and InceptionV3.One possible explanation for this is that AlexNet and VGG16both use fully connected layers to produce their ﬁnal output,while the more recent networks are fully convolutional, i.e.using a global average pooling layer to convert the ﬁnal featuremaps into a ﬁxed size vector that is then classiﬁed by a singlefully connected output layer. Fully connected layers have oneparameter per input unit and hence require ﬁxed size input,whereas convolutional layers can process arbitrary sized inputby using the same convolutional weights at every location inthe input. AlexNet and VGG16 therefore require input imagesto be downsampled to × , whereas the fully convo-lutional networks receive the full × images. Thissuggests that camera identiﬁcation exploits high frequency Fig. 4. Test accuracy plot showing the distribution of predicted labels amongcorrect outputs, for a ResNet34 trained on the Partial training set, in whichcamera type is uncorrelated with class label but only iPhone and Samsungimages are present. Overall accuracy across all camera types is close tothat achieved when trained on the full dataset, with little bias in favor offamiliar camera types. This implies that in the absence of camera / labelcorrelations, the model learns robust features for manufacturer classiﬁcation,which generalize well to images from unseen cameras. Best viewed in color. features (as opposed to geometric distortions caused by lensvariations), which are partly destroyed during downsampling,thus preventing AlexNet and VGG16 from exploiting them.When training on the Partial dataset, where only iPhone andSamsung images are present but no camera/label correlationexists, test accuracy is broadly similar to training on the full(Balanced) dataset. Not only is accuracy high, but as Figure 4shows, the model performs well on the unseen cameras. Thisimplies that in the absence of camera/label correlation, themodel learns a robust classiﬁcation rule that is unaffected bycamera type.

C. Adversarial Attacks on Manufacturer Classiﬁers

To gain some insight into the effect that camera / labelcorrelations have on a trained model in terms of the patternsit learns to recognize, we perform adversarial attacks ontrained models and visualize the perturbations that ﬂip atrained model’s judgement of an image from Manufacturer2 to 1. Adversarial attacks are small perturbations to inputimages, imperceptible to the human eye, which nonethelessare sufﬁcient to fool a model into classifying that image aswhatever the attacker wishes [24]. They are easily generated bygradient ascent in image space, backpropagating the negativelog likelihood of the target label into the image pixels andtaking small steps in the direction of the resulting imagegradient until the model’s prediction favors our target (e.g.see Nguyen et a. [19]). By performing this process using thesame image but different models and comparing the resultingimage perturbations, we can learn something about how thosemodels differ.Figure 5 shows that strikingly different adversarial pertur-bations are induced depending one which dataset the modelwas trained on. Perturbations that fool the Balanced modelare focused around the batch code and other visible featuresof the bottle, such as the plastic seam, whereas those that ig. 5. Adversarial perturbations applied to two images, classiﬁed by a ResNet34 model trained on the Balanced dataset (left) and the Disjoint dataset (right).The left image in each pair shows the input image with the perturbation ampliﬁed for visibility and overlaid on top, while the right image shows just theampliﬁed perturbation itself. Strikingly different perturbations to the same image are observed depending on whether the model was trained without camera /label correlations (Balanced) or with them (Disjoint). Best viewed digitally, zoomed in. fool the Disjoint model show a characteristic pink / greenbanding pattern in ﬂat, featureless areas of the image. Adistinct rainbow-like band of perturbation is also visible alongthe tops of images classiﬁed by the Disjoint model; thesebanding patterns at the tops and in featureless areas of imagesappear regardless of which input image the attack is performedon.Adversarial perturbations, when ampliﬁed for visibility,usually look like uninterpretable noise bearing little apparentresemblance to the target image class (e.g. Goodfellow et al.[11]), so it is interesting to see so much structure in our case.The appearance of banding patterns in ﬂat regions providessome evidence against chromatic aberration, which shouldmanifest at the edges of objects.

D. Classiﬁcation of Binary Masks

As discussed in Section II, there is a ﬁnite set of imagefeatures that may be used to infer the camera from which animage originates. Since most of these features relate to colordistribution or high frequency detail (i.e. patterns detectablewithin a small window), it seems likely that removal of thesefeatures would render camera identiﬁcation impossible, andhence resolve the domain bias issues. To do this while pre-serving features that are likely relevant for robust manufacturerdetection, we apply local mean thresholding to the images. This yields a binary image that effectively segments the dotsof the batch codes while removing all elements of color andtexture (see Figure 6). Although lens distortion is typicallymore prevalent near the edges of images,As Table III shows, training on binary segmented imagesdoes not yield usable results on the manufacturer detection orcamera classiﬁcation tasks - in both cases, the test accuracy isclose to the level expected of random guessing. As expected,we also observed no signiﬁcant correlation between manu-facturer classiﬁcation test accuracy and camera type whentraining on the Disjoint dataset with binary thresholding. Thislargely rules out lens distortion or other large scale geometricartifacts as the source of camera bias, since these distortionswould cause the dots to move and thus be visible in the binarythresholded images.

E. Classiﬁcation of Color Jittered Images

One hypothesis is that different cameras have subtly dif-ferent color correction / white balancing settings, which aCNN could very easy detect and exploit, especially since theimages were acquired in laboratory conditions with controlledlighting. We test this hypothesis by randomizing the hue,saturation, contrast and brightness of the images at trainingtime, thus removing any correlation between camera type andglobal image color statistics. Table IV shows that even with ig. 6. A bottle image with local mean thresholding applied, segmenting thebatchcode dots. Origin camera classiﬁcation does not work on such images,indicating that models use something other than the shape and position of thedots to classify cameras.

Model Test Accuracy

Manufacturers CamerasResNet34 0.489 0.229ResNet101 0.530 0.186InceptionV3 0.525 0.270Alexnet 0.499 0.214VGG16 0.616 0.384TABLE IIIM

ANUFACTURER AND CAMERA CLASSIFICATION ACCURACY ON THE TESTSET WHEN TRAINED ( AND TESTED ) ON BINARY SEGMENTED IMAGES ( SEE F IGURE the high level of color randomization used (see Figure 7),camera and manufacturer classiﬁcation test accuracy remainshigh. Accuracy is somewhat diminished for the AlexNet andVGG16 architectures, which require downsampled images asinput, suggesting that these features are still useful when highfrequency features are less available. We also train modelsto recognize cameras from grayscale images, achieving testaccuracies roughly identical to those in Table IV.

F. Classifying Cameras from Small Image Patches

With lens deformation and color statistics ruled out ascamera identifying features, we turn our attention towardshigh frequency features. As discussed in Section II, such

Fig. 7. Color jitter augmentations applied to a single image (original in topleft). Augmenting our training images with basic color distortions removes anycorrelations that may exist between class label and white balance, saturation,hue.

Model Test Accuracy

Manufacturers CamerasResNet34 0.975 0.992ResNet101 0.961 0.995InceptionV3 0.974 0.998Alexnet 0.923 0.768VGG16 0.972 0.883TABLE IV

MANUFACTURER AND CAMERA CLASSIFICATION TEST SET ACCURACYWHEN TRAINED ON IMAGES WITH RANDOMIZED HUE , SATURATION , CONTRAST AND BRIGHTNESS . R

OBUST CAMERA CLASSIFICATIONACCURACY IMPLIES THAT IMAGE COLOR STATISTICS ARE NOTNECESSARY FOR CAMERA INFERENCE . features could be introduced by various forms of ﬁxed sensorpattern noise, dust particles stuck to the lens, and imageprocessing / compression algorithms performed automaticallyby the camera. We investigate the role of high frequencyfeatures by training CNNs to classify cameras given only arandom × crop of our original input images (upsampledto × for Alexnet and VGG16). As Table V shows,camera identiﬁcation accuracy remains surprisingly robusteven when input is restricted to a × window. Thisstrongly implies that high frequency features are sufﬁcientfor camera identiﬁcation, and conﬁrms that lens distortionis not required. However, it remains unclear whether thesefeatures are localized to certain regions of the image or presentuniformly. Figure 8 shows an accuracy heatmap, constructedby repeatedly sampling × crops from our training setand drawing a white square at the location of each correctlyclassiﬁed crop. This shows that classiﬁcation accuracy isindependent of the location of the crop, at least when averagedover the whole dataset. This implies that whatever patternis being exploited occurs uniformly across the images onaverage. Figure 9 shows how classiﬁcation accuracy for × crops varies across ﬁve individual images, one from eachcamera. Model Test Accuracy

ResNet34 0.665ResNet101 0.681InceptionV3 0.948AlexNet 0.770VGG16 0.872TABLE VC

AMERA CLASSIFICATION TEST ACCURACY WHEN TRAINED ONLY ONRANDOM × CROPS OF THE INPUT DATA . G. Generalizing from Left Field of View to Right

Pixel non-uniformity (PNU) noise, as described in [16], isa high frequency noise ﬁngerprint, manifested as randomlyvarying sensitivities of individual sensor pixels to light. Wewould expect such a noise ﬁngerprint to be non-repeating, thatis, the noise pattern in one part of an image should be differentto that in other parts. If the models are learning to recognizecameras by recognizing their PNU noise ﬁngerprints, they ig. 8. Heatmap representing relative classiﬁcation accuracy of × cropsat different locations in the image, averaged across images from the wholedataset. The lack of bias toward any particular part of the image implies thatcamera predictive patterns are present uniformly across the images. should therefore be incapable of recognizing cameras frompatches of noise ﬁngerprint they have not encountered duringtraining. We therefore test our models’ reliance on PNU noiseby training them on only the left halves of our Balancedtraining set images, and testing on the right halves. If they arereliant on PNU noise then generalization to the right halvesof images should be poor. As Table VI shows, this is not thecase, therefore PNU noise is unlikely to be the primary sourceof camera identifying information. Model Test Accuracy

ResNet34 0.992ResNet101 0.889InceptionV3 0.999AlexNet 0.818VGG16 0.851TABLE VIC

AMERA CLASSIFICATION ACCURACY ON RIGHT HALVES OF IMAGESAFTER TRAINING ON THE LEFT HALVES . S

TRONG GENERALIZATION TO ANUNSEEN AREA OF THE TRAINING IMAGES IMPLIES THAT

PNU

NOISEFINGERPRINTS OF THE SORT DISCUSSED BY L UKAS ET AL . [16]

AREUNLIKELY TO BE THE MECHANISM BY WHICH

CNN

S ARE RECOGNIZINGCAMERAS , BECAUSE THE NOISE FINGERPRINT ON THE RIGHT SIDE OF THEIMAGES WILL BE DIFFERENT TO THAT ON THE LEFT SIDE . V. C

ONCLUSION

We have shown that CNNs learn to exploit camera / classlabel correlations in an image classiﬁcation dataset in whichsuch correlations are present. By recognizing the camera thatacquired an image, CNNs are able to infer the class labelwithout learning any features that are relevant to the task (inour case, manufacturer classiﬁcation), as evidenced by poorgeneralization to images where the camera / label correlationis broken. This ﬁnding has relevance both to ﬁne grained clas-siﬁcation and to algorithmic transparency and accountability.We also show that CNNs are capable of learning to infer origincameras when explicitly trained to do so, corroborating theresults of Bondi et al. and Tuama et al. [1], [25]. We test thesephenomena across ﬁve different CNN architectures and showthat the effects are common to all of them, although lesseramong AlexNet and VGG16, the two architectures whose

Fig. 9. Heatmaps representing the camera identiﬁcation accuracy on × patches at different locations in single images. An image from each camerais shown, and the predictions are all from the same ResNet34 checkpoint.The model is able to correctly classify patches from most locations on mostimages, but some signiﬁcant dark patches occur. nputs must be downsampled to a smaller size due to the useof fully connected layers. We have also performed a numberof experiments to gain insight into how CNNs are recognizingcameras, the results of which require some discussion in thissection. Section II outlines a number of potential sources ofcamera identifying information, and our experiments provideevidence for and against those hypotheses.A simple explanation for camera bias would be differencesin average color statistics among cameras, caused by differ-ences in white balance and color correction settings. Thishypothesis is largely ruled out by the fact that CNNs stillrecognize cameras easily even when hue, saturation, contrastand brightness are randomized (see Figure 7, Table IV).Another potential explanation was lens distortion; if differentcameras have different shaped lenses then there may be slightdifferences in geometric distortion (e.g. radial lens distortion[2]). This hypothesis too is ruled out, by the fact that CNNsare incapable of inferring cameras from binary segmentedimages (see Figure 6, Table III). Geometric distortions wouldbe visible in the spacing of the dots from the batch codes,which are the only features visible in these images. Chromaticaberration is also unlikely since it should only be visible at theedges of objects, not in ﬂat regions (Figure 5,9), and shouldalso be undetectable in grayscale images (Section IV-E). Theseresults increase the likelihood that texture, which is absent insegmented images but preserved in color randomized images,plays an important role. High camera recognition accuracy on × random crops (Table V), including in empty patchesof the image where it is hard to imagine what features besidesfaint, high frequency texture are available (Figure 9), increasesthis likelihood further.There are two likely sources of camera correlated texture:pixel non-uniformity (PNU) noise, and the camera’s on-boardimage processing, which typically includes algorithms suchas kernel ﬁltering, image sharpening and compression (bothdiscussed by Lukas et al. [16]). PNU noise is dominatedby a ﬁxed multiplicative noise pattern that is introducedduring manufacturing, as such we would expect different noisepatterns in different parts of the ﬁeld of view, as opposed toa repeating pattern. The fact that CNNs trained on the lefthand sides of images generalize well to the right hand sidesof those images (Table VI) therefore implies that PNU noiseis not crucial for camera recognition, since they should notbe able to recognize unseen noise patterns on the right side ofthe images. The fact that AlexNet and VGG16 are also able torecognize cameras from downsampled images is also strongevidence against PNU noise, which should be undetectableafter downsampling.By a process of elimination, the most likely explanationtherefore seems to be on-camera image processing algorithms.We do not consider these results to be conclusive; a conclusiveanswer would require full knowledge of the original cameras,which we do not have. Further research is required to ascertainexactly which textural features are exploited by CNNs torecognize cameras. R EFERENCES[1] Luca Bondi, David G¨uera, Luca Barofﬁo, Paolo Bestagini, Edward JDelp, and Stefano Tubaro. A preliminary study on convolutional neuralnetworks for camera model identiﬁcation.

Electronic Imaging , (7):67–76, 2017. 2, 3, 7[2] Kai San Choi, Edmund Y Lam, and Kenneth KY Wong. Sourcecamera identiﬁcation using footprints from lens aberration. In

Digitalphotography II , volume 6069, pages 172–179, 2006. 2, 8[3] Casey Chu, Andrey Zhmoginov, and Mark Sandler. Cyclegan, a masterof steganography. arXiv preprint arXiv:1712.02950 , 2017. 1[4] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Extractingcamera-based ﬁngerprints for video forensics. In

IEEE Conference onComputer Vision and Pattern Recognition Workshops , pages 130–137,2019. 2[5] Davide Cozzolino and Luisa Verdoliva. Noiseprint: a CNN-based cameramodel ﬁngerprint.

IEEE Transactions on Information Forensics andSecurity , 2019. 2[6] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visualrepresentation learning by context prediction. In

Proceedings of theIEEE International Conference on Computer Vision , pages 1422–1430,2015. 1, 2[7] Ruth Fong and Andrea Vedaldi. Net2vec: Quantifying and explaininghow concepts are encoded by ﬁlters in deep neural networks. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 8730–8738, 2018. 2[8] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of blackboxes by meaningful perturbation. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 3429–3437, 2017. 1,2[9] Jessica Fridrich. Digital image forensics.

IEEE Signal ProcessingMagazine , 26(2):26–37, 2009. 2[10] Zeno J Geradts, Jurrien Bijhold, Martijn Kieft, Kenji Kurosawa, KenroKuroki, and Naoki Saitoh. Methods for identiﬁcation of images acquiredwith digital cameras. In

Enabling technologies for law enforcement andsecurity , volume 4232, pages 505–512. International Society for Opticsand Photonics, 2001. 2[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explainingand harnessing adversarial examples. arXiv preprint arXiv:1412.6572 ,2014. 5[12] Micah K Johnson and Hany Farid. Exposing digital forgeries throughchromatic aberration. In

Proceedings of the 8th workshop on Multimediaand security , pages 48–55, 2006. 2[13] Mehdi Kharrazi, Husrev T Sencar, and Nasir Memon. Blind sourcecamera identiﬁcation. In , volume 1, pages 709–712. IEEE, 2004. 2[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014. 3[15] Kenji Kurosawa, Kenro Kuroki, and Naoki Saitoh. Ccd ﬁngerprintmethod-identiﬁcation of a video camera from videotaped images. In

Proceedings 1999 International Conference on Image Processing (Cat.99CH36348) , volume 3, pages 537–540. IEEE, 1999. 2[16] Jan Luk´aˇs, Jessica Fridrich, and Miroslav Goljan. Digital camera iden-tiﬁcation from sensor pattern noise.

IEEE Transactions on InformationForensics and Security , 1(2):205–214, 2006. 2, 6, 7, 8[17] Aravindh Mahendran and Andrea Vedaldi. Understanding deep imagerepresentations by inverting them. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages 5188–5196,2015. 2[18] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, andJeff Clune. Synthesizing the preferred inputs for neurons in neural net-works via deep generator networks. In

Advances in Neural InformationProcessing Systems , pages 3387–3395, 2016. 2[19] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks areeasily fooled: High conﬁdence predictions for unrecognizable images.In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 427–436, 2015. 4[20] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Featurevisualization.

Distill , 2(11):e7, 2017. 2[21] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visualexplanations from deep networks via gradient-based localization. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 618–626, 2017. 2[22] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep insideconvolutional networks: Visualising image classiﬁcation models andsaliency maps. arXiv preprint arXiv:1312.6034 , 2013. 223] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and MartinRiedmiller. Striving for simplicity: The all convolutional net. arXivpreprint arXiv:1412.6806 , 2014. 2[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna,Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing propertiesof neural networks. arXiv preprint arXiv:1312.6199 , 2013. 4[25] Amel Tuama, Fr´ed´eric Comby, and Marc Chaumont. Camera modelidentiﬁcation with the use of deep convolutional neural networks. In

IEEE International Workshop on Information Forensics and Security ,pages 1–6, 2016. 2, 3, 7[26] Ye Yao, Weitong Hu, Wei Zhang, Ting Wu, and Yun-Qing Shi. Dis-tinguishing computer-generated graphics from natural images based onsensor pattern noise and deep learning.

Sensors , 18(4):1296, 2018. 2[27] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and HodLipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 , 2015. 2[28] Matthew D Zeiler and Rob Fergus. Visualizing and understandingconvolutional networks. In

European conference on computer vision ,pages 818–833. Springer, 2014. 2[29] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and AntonioTorralba. Learning deep features for discriminative localization. In