[PDF] Analyzing the Performance of Multilayer Neural Networks for Object Recognition

Abstract

In the last two years, convolutional neural networks (CNNs) have achieved an impressive suite of results on standard recognition datasets and tasks. CNN-based features seem poised to quickly replace engineered representations, such as SIFT and HOG. However, compared to SIFT and HOG, we understand much less about the nature of the features learned by large CNNs. In this paper, we experimentally probe several aspects of CNN feature learning in an attempt to help practitioners gain useful, evidence-backed intuitions about how to apply CNNs to computer vision problems.

Full PDF

AAnalyzing the Performance of Multilayer NeuralNetworks for Object Recognition

Pulkit Agrawal, Ross Girshick, Jitendra Malik { pulkitag,rbg,malik } @eecs.berkeley.edu University of California, Berkeley

Abstract.

In the last two years, convolutional neural networks (CNNs)have achieved an impressive suite of results on standard recognitiondatasets and tasks. CNN-based features seem poised to quickly replaceengineered representations, such as SIFT and HOG. However, comparedto SIFT and HOG, we understand much less about the nature of thefeatures learned by large CNNs. In this paper, we experimentally probeseveral aspects of CNN feature learning in an attempt to help practition-ers gain useful, evidence-backed intuitions about how to apply CNNs tocomputer vision problems.

Keywords: convolutional neural networks, object recognition, empiri-cal analysis

Over the last two years, a sequence of results on benchmark visual recognitiontasks has demonstrated that convolutional neural networks (CNNs) [7,16,20] willlikely replace engineered features, such as SIFT [17] and HOG [3], for a widevariety of problems. This sequence started with the breakthrough ImageNet [4]classiﬁcation results reported by Krizhevsky et al. [13]. Soon after, Donahue etal. [5] showed that the same network, trained for ImageNet classiﬁcation, was aneﬀective blackbox feature extractor. Using CNN features, they reported state-of-the-art results on several standard image classiﬁcation datasets. At the sametime, Girshick et al. [9] showed how the network could be applied to objectdetection. Their system, called R-CNN, classiﬁes object proposals generated bya bottom-up grouping mechanism (e.g., selective search [25]). Since detectiontraining data is limited, they proposed a transfer learning strategy in which theCNN is ﬁrst pre-trained, with supervision, for ImageNet classiﬁcation and thenﬁne-tuned on the small PASCAL detection dataset [6]. Since this initial set ofresults, several other papers have reported similar ﬁndings on a wider range oftasks (see, for example, the outcomes reported by Razavian et al. in [19]).Feature transforms such as SIFT and HOG aﬀord an intuitive interpretationas histograms of oriented edge ﬁlter responses arranged in spatial blocks. How-ever, we have little understanding of what visual features the diﬀerent layers ofa CNN encode. Given that rich feature hierarchies provided by CNNs are likely a r X i v : . [ c s . C V ] S e p Pulkit Agrawal, Ross Girshick, Jitendra Malik to emerge as the prominent feature extractor for computer vision models overthe next few years, we believe that developing such an understanding is an inter-esting scientiﬁc pursuit and an essential exercise that will help guide the designof computer vision methods that use CNNs. Therefore, in this paper we studyseveral aspects of CNNs through an empirical lens.

Girshick et al. [9] showed thatsupervised pre-training and ﬁne-tuning are eﬀective when training data is scarce.However, they did not investigate what happens when training data becomesmore abundant. We show that it is possible to get good performance whentraining R-CNN from a random initialization (i.e., without ImageNet supervisedpre-training) with a reasonably modest amount of detection training data (37kground truth bounding boxes). However, we also show that in this data regime,supervised pre-training is still beneﬁcial and leads to a large improvement indetection performance. We show similar results for image classiﬁcation, as well.

ImageNet pre-training does not overﬁt.

One concern when using super-vised pre-training is that achieving a better model ﬁt to ImageNet, for example,might lead to higher generalization error when applying the learned featuresto another dataset and task. If this is the case, then some form of regulariza-tion during pre-training, such as early stopping, would be beneﬁcial. We showthe surprising result that pre-training for longer yields better results, with di-minishing returns, but does not increase generalization error. This implies thatﬁtting the CNN to ImageNet induces a general and portable feature represen-tation. Moreover, the learning process is well behaved and does not require adhoc regularization in the form of early stopping.

Grandmother cells and distributed codes.

We do not have a good un-derstanding of mid-level feature representations in multilayer networks. Recentwork on feature visualization, (e.g., [15,28]) suggests that such networks mightconsist mainly of “grandmother” cells [1,18]. Our analysis shows that the rep-resentation in intermediate layers is more subtle. There are a small number ofgrandmother-cell-like features, but most of the feature code is distributed andseveral features must ﬁre in concert to eﬀectively discriminate between classes.

Importance of feature location and magnitude.

Our ﬁnal set of experi-ments investigates what role a feature’s spatial location and magnitude plays inimage classiﬁcation and object detection. Matching intuition, we ﬁnd that spatiallocation is critical for object detection, but matters little for image classiﬁcation.More surprisingly, we ﬁnd that feature magnitude is largely unimportant. Forexample, binarizing features (at a threshold of 0) barely degrades performance.This shows that sparse binary features, which are useful for large-scale imageretrieval [10,26], come “for free” from the CNN’s representation. nalyzing The Performance of Multilayer Neural Networks 3

In this paper, we report experimental results using several standard datasetsand tasks, which we summarize here.

Image classiﬁcation.

For the task of image classiﬁcation we consider twodatasets, the ﬁrst of which is PASCAL VOC 2007 [6]. We refer to this datasetand task by “PASCAL-CLS”. Results on PASCAL-CLS are reported using thestandard average precision (AP) and mean average precision (mAP) metrics.PASCAL-CLS is fairly small-scale with only 5k images for training, 5k im-ages for testing, and 20 object classes. Therefore, we also consider the medium-scale SUN dataset [27], which has around 108k images and 397 classes. We referto experiments on SUN by “SUN-CLS”. In these experiments, we use a non-standard train-test split since it was computationally infeasible to run all ofour experiments on the 10 standard subsets proposed by [27]. Instead, we ran-domly split the dataset into three parts (train, val, and test) using 50%, 10%and 40% of the data, respectively. The distribution of classes was uniform acrossall the three sets. We emphasize that results on these splits are only used to sup-port investigations into properties of CNNs and not for comparing against otherscene-classiﬁcation methods in the literature. For SUN-CLS, we report 1-of-397classiﬁcation accuracy averaged over all classes, which is the standard metric forthis dataset . For select experiments we report the error bars in performanceas mean ± standard deviation in accuracy over 3 runs (it was computationallyinfeasible to compute error bars for all experiments). For each run, a diﬀerentrandom split of train, val, and test sets was used. Object detection.

For the task of object detection we use PASCAL VOC2007. We train using the trainval set and test on the test set. We refer to thisdataset and task by “PASCAL-DET”. PASCAL-DET uses the same set of im-ages as PASCAL-CLS. We note that it is standard practice to use the 2007version of PASCAL VOC for reporting results of ablation studies and hyperpa-rameter sweeps. We report performance on PASCAL-DET using the standardAP and mAP metrics. In some of our experiments we use only the ground-truth PASCAL-DET bounding boxes, in which case we refer to the setup by“PASCAL-DET-GT”.In order to provide a larger detection training set for certain experiments,we also make use of the “PASCAL-DET+DATA” dataset, which we deﬁne asincluding VOC 2007 trainval union with VOC 2012 trainval. The VOC 2007test set is still used for evaluation. This dataset contains approximately 37k The version of this paper published at ECCV’14 contained an error in our descriptionof the accuracy metric. That version used overall accuracy, instead of class-averagedaccuracy. This version contains corrected numbers for SUN-CLS to reﬂect the stan-dard accuracy metric of class-averaged accuracy. Pulkit Agrawal, Ross Girshick, Jitendra Malik labeled bounding boxes, which is roughly three times the number contained inPASCAL-DET.

All of our experiments use a single CNN architecture. This architecture is theCaﬀe [11] implementation of the network proposed by Krizhevsky et al. [13].The layers of the CNN are organized as follows. The ﬁrst two are subdividedinto four sublayers each: convolution (conv), max( x,

0) rectifying non-linear units(ReLUs), max pooling, and local response normalization (LRN). Layers 3 and4 are composed of convolutional units followed by ReLUs. Layer 5 consists ofconvolutional units, followed by ReLUs and max pooling. The last two layers arefully connected (fc). When we refer to conv-1, conv-2, and conv-5 we mean theoutput of the max pooling units following the convolution and ReLU operations(also following LRN when applicable). For layers conv-3, conv-4, fc-6, and fc-7we mean the output of ReLU units.

Training a large CNN on a small dataset often leads to catastrophic overﬁtting.The idea of supervised pre-training is to use a data-rich auxiliary dataset andtask, such as ImageNet classiﬁcation, to initialize the CNN parameters. TheCNN can then be used on the small dataset, directly, as a feature extractor(as in [5]). Or, the network can be updated by continued training on the smalldataset, a process called ﬁne-tuning .For ﬁne-tuning, we follow the procedure described in [9]. First, we remove theCNN’s classiﬁcation layer, which was speciﬁc to the pre-training task and is notreusable. Next, we append a new randomly initialized classiﬁcation layer withthe desired number of output units for the target task. Finally, we run stochasticgradient descent (SGD) on the target loss function, starting from a learning rateset to 0 .

001 (1 / The results in [9] (R-CNN) show that supervised pre-training for ImageNet clas-siﬁcation, followed by ﬁne-tuning for PASCAL object detection, leads to largegains over directly using features from the pre-trained network (without ﬁne-tuning). However, [9] did not investigate three important aspects of ﬁne-tuning:(1) What happens if we train the network “from scratch” (i.e., from a random Note that this nomenclature diﬀers slightly from [9].nalyzing The Performance of Multilayer Neural Networks 5

Table 1: Comparing the performance of CNNs trained from scratch, pre-trainedon ImageNet, and ﬁne-tuned. PASCAL-DET+DATA includes additional datafrom VOC 2012 trainval. (Bounding-box regression was not used for detectionresults.)

SUN-CLS PASCAL-DET PASCAL-DET+DATA scratch pre-train ﬁne-tune scratch pre-train ﬁne-tune scratch pre-train ﬁne-tune35 . ± . . ± . . ± . initialization) on the detection data? (2) How does the amount of ﬁne-tuningdata change the picture? and (3) How does ﬁne-tuning alter the network’s pa-rameters? In this section, we explore these questions on object detection andimage classiﬁcation datasets. The main results of this section are presented in Table 1. First, we focus on thedetection experiments, which we implemented using the open source R-CNNcode. All results use features from layer fc-7.Somewhat surprisingly, it’s possible to get reasonable results (40.7% mAP)when training the CNN from scratch using only the training data from VOC2007 trainval (13k bounding box annotations). However, this is still worse thanusing the pre-trained network, directly, without ﬁne-tuning (45.5%). Even moresurprising is that when the VOC 2007 trainval data is augmented with VOC2012 data (an additional 25k bounding box annotations), we are able to achievea mAP of 52.3% from scratch. This result is almost as good as the performanceachieved by pre-training on ImageNet and then ﬁne-tuning on VOC 2007 train-val (54.1% mAP). These results can be compared to the 30.5% mAP obtainedby DetectorNet [23], a recent detection system based on the same network ar-chitecture, which was trained from scratch on VOC 2012 trainval.Next, we ask if ImageNet pre-training is still useful in the PASCAL-DET+DATA setting? Here we see that even though it’s possible to get good per-formance when training from scratch, pre-training still helps considerably. Theﬁnal mAP when ﬁne-tuning with the additional detection data is 59.2%, whichis 5 percentage points higher than the best result reported in [9] (both withoutbounding-box regression). This result suggests that R-CNN performance is notdata saturated and that simply adding more detection training data without anyother changes may substantially improve results.We also present results for SUN image classiﬁcation. Here we observe a simi-lar trend: reasonable performance is achievable when training from scratch, how-ever initializing from ImageNet and then ﬁne-tuning yields signiﬁcantly betterperformance.

Pulkit Agrawal, Ross Girshick, Jitendra Malik

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 260280300320340360 Fraction of Filters M eas u r e o f C l ass S e l ec t i v i t y ( b ase d on E n t r op y ) conv−1conv−2conv−3conv−4conv−5fc−6fc−7 Fig. 1: PASCAL object class selectivity plotted against the fraction of ﬁlters, foreach layer, before ﬁne-tuning (dash-dot line) and after ﬁne-tuning (solid line).A lower value indicates greater class selectivity. Although layers become morediscriminative as we go higher up in the network, ﬁne-tuning on limited data(PASCAL-DET) only signiﬁcantly aﬀects the last two layers (fc-6 and fc-7).

We have provided additional evidence that ﬁne-tuning a discriminatively pre-trained network is very eﬀective in terms of task performance. Now we lookinside the network to see how ﬁne-tuning changes its parameters.To do this, we deﬁne a way to measure the class selectivity of a set of ﬁlters.Intuitively, we use the class-label entropy of a ﬁlter given its activations, abovea threshold, on a set of images. Since this measure is entropy-based, a low valueindicates that a ﬁlter is highly class selective, while a large value indicates thata ﬁlter ﬁres regardless of class. The precise deﬁnition of this measure is given inthe Appendix.In order to summarize the class selectivity for a set of ﬁlters, we sort themfrom the most selective to least selective and plot the average selectivity of theﬁrst k ﬁlters while sweeping k down the sorted list. Figure 1 shows the classselectivity for the sets of ﬁlters in layers 1 to 7 before and after ﬁne-tuning (onVOC 2007 trainval). Selectivity is measured using the ground truth boxes fromPASCAL-DET-GT instead of a whole-image classiﬁcation task to ensure thatﬁlter responses are a direct result of the presence of object categories of interestand not correlations with image background.Figure 1 shows that class selectivity increases from layer 1 to 7 both withand without ﬁne-tuning. It is interesting to note that entropy changes due toﬁne-tuning are only signiﬁcant for layers 6 and 7. This observation indicatesthat ﬁne-tuning only layers 6 and 7 may suﬃce for achieving good performancewhen ﬁne-tuning data is limited. We tested this hypothesis on SUN-CLS andPASCAL-DET by comparing the performance of a ﬁne-tuned network (ft) with nalyzing The Performance of Multilayer Neural Networks 7 Table 2: Comparison in performance when ﬁne-tuning the entire network (ft)versus only ﬁne-tuning the fully-connected layers (fc-ft).

SUN-CLS PASCAL-DET PASCAL-DET+DATA ft fc-ft ft fc-ft ft fc-ft52 . ± . . ± . Table 3: Performance variation (% mAP) on PASCAL-CLS as a function of pre-training iterations on ImageNet. The error bars for all columns are similar tothe one reported in the 305k column. layer 5k 15k 25k 35k 50k 95k 105k 195k 205k 305kconv-1 23.0 24.3 24.4 24.5 24.3 24.8 24.7 24.4 24 . . ± . . . ± . . . ± . . . ± . . . ± . . . ± . . . ± . a network which was ﬁne-tuned by only updating the weights of fc-6 and fc-7(fc-ft). These results, in Table 2, show that with small amounts of data, ﬁne-tuning amounts to “rewiring” the fully connected layers. However, when moreﬁne-tuning data is available (PASCAL-DET+DATA), there is still substantialbeneﬁt from ﬁne-tuning all network parameters. There is no single image dataset that fully captures the variation in naturalimages. This means that all datasets, including ImageNet, are biased in someway. Thus, there is a possibility that pre-training may eventually cause the CNNto overﬁt and consequently hurt generalization performance [24]. To understandif this happens, in the speciﬁc case of ImageNet pre-training, we investigated theeﬀect of pre-training time on generalization performance both with and withoutﬁne-tuning. We ﬁnd that pre-training for longer improves performance. This issurprising, as it shows that ﬁtting more to ImageNet leads to better performancewhen moving to the other datasets that we evaluated.We report performance on PASCAL-CLS as a function of pre-training time,without ﬁne-tuning, in Table 3. Notice that more pre-training leads to betterperformance. By 15k and 50k iterations all layers are close to 80% and 90%of their ﬁnal performance (5k iterations is only ∼ Pulkit Agrawal, Ross Girshick, Jitendra Malik(a) 5k Iterations (b) 15k Iterations (c) 305k Iterations

Fig. 2: Evolution of conv-1 ﬁlters with time. After just 15k iterations, these ﬁltersclosely resemble their converged state.Table 4: Performance variation on SUN-CLS and PASCAL-DET using featuresfrom a CNN pre-trained for diﬀerent numbers of iterations and ﬁne-tuned for aﬁxed number of iterations (40k for SUN-CLS and 70k for PASCAL-DET)

50k 105k 205k 305k

SUN-CLS . ± . . ± . . ± . . ± . PASCAL-DET Further, notice from Table 3 that conv-1 trains ﬁrst and the higher the layeris the more time it takes to converge. This suggests that a CNN, trained withbackpropagation, converges in a layer-by-layer fashion. Table 4 shows the inter-action between varied amounts of pre-training time and ﬁne-tuning on SUN-CLSand PASCAL-DET. Here we also see that more pre-training prior to ﬁne-tuningleads to better performance.

Neuroscientists have conjectured that cells in the human brain which only re-spond to very speciﬁc and complex visual stimuli (such as the face of one’s grand-mother) are involved in object recognition. These neurons are often referred to as grandmother cells (GMC) [1,18]. Proponents of artiﬁcial neural networks haveshown great interest in reporting the presence of GMC-like ﬁlters for speciﬁcobject classes in their networks (see, for example, the cat ﬁlter reported in [15]).The notion of GMC like features is also related to standard feature encodingsfor image classiﬁcation. Prior to the work of [13], the dominant approaches forimage and scene classiﬁcation were based on either representing images as a bagof local descriptors (BoW), such as SIFT (e.g., [14]), or by ﬁrst ﬁnding a setof mid-level patches [12,22] and then encoding images in terms of them. Theproblem of ﬁnding good mid-level patches is often posed as a search for a setof high-recall discriminative templates. In this sense, mid-level patch discovery A network pre-trained from scratch, which was diﬀerent from the one used in Section3.1, was used to obtain these results. The diﬀerence in performance is not signiﬁcant.nalyzing The Performance of Multilayer Neural Networks 9 is the search for a set of GMC templates. The low-level BoW representation, incontrast, is a distributed code in the sense that a single feature by itself is not dis-criminative, but a group of features taken together is. This makes it interestingto investigate the nature of mid-level CNN features such as conv-5.For understanding these feature representations in CNNs, [21,28] recentlypresented methods for ﬁnding locally optimal visual inputs for individual ﬁlters.However, these methods only ﬁnd the best, or in some cases top- k , visual inputsthat activate a ﬁlter, but do not characterize the distribution of images thatcause an individual ﬁlter to ﬁre above a certain threshold. For example, if it isfound that the top-10 visual inputs for a particular ﬁlter are cats, it remainsunclear what is the response of the ﬁlter to other images of cats. Thus, it is notpossible to make claims about presence of GMC like ﬁlters for cat based on suchanalysis. A GMC ﬁlter for the cat class, is one that ﬁres strongly on all cats andnothing else. This criteria can be expressed as a ﬁlter that has high precision andhigh recall . That is, a GMC ﬁlter for class C is a ﬁlter that has a high averageprecision (AP) when tasked with classifying inputs from class C versus inputsfrom all other classes.First, we address the question of ﬁnding GMC ﬁlters by computing the AP ofindividual ﬁlters (Section 4.1). Next, we measure how distributed are the featurerepresentations (Section 4.2). For both experiments we use features from layerconv-5, which consists of responses of 256 ﬁlters in a 6 × For each ﬁlter, its AP value is calculated for classifying images using class labelsand ﬁlter responses to object bounding boxes from PASCAL-DET-GT. Then,for each class we sorted ﬁlters in decreasing order of their APs. If GMC ﬁltersfor this class exist, they should be the top ranked ﬁlters in this sorted list. Theprecision-recall curves for the top-ﬁve conv-5 ﬁlters are shown in Figure 3. Weﬁnd that GMC-like ﬁlters exist for only for a few classes, such as bicycle, person,cars, and cats.

In addition to visualizing the AP curves of individual ﬁlters, we measured thenumber of ﬁlters required to recognize objects of a particular class. Featureselection was performed to construct nested subsets of ﬁlters, ranging from asingle ﬁlter to all ﬁlters, using the following greedy strategy. First, separate linearSVMs were trained to classify object bounding boxes from PASCAL-DET-GTusing conv-5 responses. For a given class, the 256 dimensions of the learnt weightvector ( w ) is in direct correspondence with the 256 conv-5 ﬁlters. We used themagnitude of the i -th dimension of w to rank the importance of the i -th conv-5ﬁlter for discriminating instances of this class. Next, all ﬁlters were sorted using P r e c i s i o n aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Recall

Fig. 3: The precision-recall curves for the top ﬁve (based on AP) conv-5 ﬁlter re-sponses on PASCAL-DET-GT. Curves in red and blue indicate AP for ﬁne-tunedand pre-trained networks, respectively. The dashed black line is the performanceof a random ﬁlter. For most classes, precision drops signiﬁcantly even at modestrecall values. There are GMC ﬁlters for classes such as bicycle, person, car, cat.Table 5: Number of ﬁlters required to achieve 50% or 90% of the completeperformance on PASCAL-DET-GT using a CNN pre-trained on ImageNet andﬁne-tuned for PASCAL-DET using conv-5 features. perf. aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tvpre-train 50% 15 3 15 15 10 10 3 2 5 15 15 2 10 3 1 10 20 25 10 2ﬁne-tune 50% 10 1 20 15 5 5 2 2 3 10 15 3 15 10 1 5 15 15 5 2pre-train 90% 40 35 80 80 35 40 30 20 35 100 80 30 45 40 15 45 50 100 45 25ﬁne-tune 90% 35 30 80 80 30 35 25 20 35 50 80 35 30 40 10 35 40 80 40 20 these magnitude values. Each subset of ﬁlters was constructed by taking the top- k ﬁlters from this list. For each subset, a linear SVM was trained using only theresponses of ﬁlters in that subset for classifying the class under consideration.The variation in performance with the number of ﬁlters is shown in Figure2. Table 10 lists the number of ﬁlters required to achieve 50% and 90% of thecomplete performance. For classes such as persons, cars, and cats relatively fewﬁlters are required, but for most classes around 30 to 40 ﬁlters are requiredto achieve at least 90% of the full performance. This indicates that the conv-5feature representation is distributed and there are GMC-like ﬁlters for only a few We used values of k ∈ { , , , , , , , , , , , , , , , , } .nalyzing The Performance of Multilayer Neural Networks 11 F r a c t i o n o f c o m p l e t e p e r f o r m a n c e aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Number of conv-5 ﬁlters

Fig. 4: The fraction of complete performance on PASCAL-DET-GT achieved byconv-5 ﬁlter subsets of diﬀerent sizes. Complete performance is the AP com-puted by considering responses of all the ﬁlters. Notice, that for a few classessuch as person and bicycle only a few ﬁlters are required, but for most classessubstantially more ﬁlters are needed, indicating a distributed code.classes. Results using layer fc-7 are presented in the the supplementary material.We also ﬁnd that after ﬁne-tuning, slightly fewer ﬁlters are required to achieveperformance levels similar to a pre-trained network.Next, we estimated the extent of overlap between the ﬁlters used for dis-criminating between diﬀerent classes. For each class i , we selected the 50 mostdiscriminative ﬁlters (out of 256) and stored the selected ﬁlter indices in the set S i . The extent of overlap between class i and j was evaluated by | S i ∩ S j | /N ,where N = | S i | = | S j | = 50. The results are visualized in Figure 5. It can beseen that diﬀerent classes use diﬀerent subsets of conv-5 ﬁlters and there is littleoverlap between classes. This further indicates that intermediate representationsin the CNN are distributed. The convolutional layers preserve the coarse spatial layout of the network’s in-put. By layer conv-5, the original 227 ×

227 input image has been progressivelydownsampled to 6 ×

6. This feature map is also sparse due to the max( x, Fig. 5: The set overlap between the 50 most discriminative conv-5 ﬁlters foreach class determined using PASCAL-DET-GT. Entry ( i, j ) of the matrix is thefraction of top-50 ﬁlters class i has in common with class j (Section 4.2). Chanceis 0.195. There is little overlap, but related classes are more likely to share ﬁlters. (a) (b) (c) (d) Filter 2Filter N feature map

Fig. 6: Illustrations of ablations of feature activation spatial and magnitude in-formation. See Sections 5.1 and 5.2 for details.experimentally analyze the role of ﬁlter response magnitude and spatial locationby looking at ablation studies on classiﬁcation and detection tasks. nalyzing The Performance of Multilayer Neural Networks 13

Table 6: Percentage non-zeros (sparsity) in ﬁlter responses of CNN. conv-1 conv-2 conv-3 conv-4 conv-5 fc-6 fc-787 . ± . . ± . . ± . . ± . . ± . . ± . . ± . We can asses the importance of magnitude by setting each ﬁlter response x to 1if x > Now we remove spatial information from ﬁlter responses while retaining informa-tion about their magnitudes. We consider two methods for ablating spatial infor-mation from features computed by the convolutional layers (the fully-connectedlayers do not contain explicit spatial information).The ﬁrst method (“sp-max”) simply collapses the p × p spatial map into a sin-gle value per feature channel by max pooling. The second method (“sp-shuﬄe”)retains the original distribution of feature activation values, but scrambles spa-tial correlations between columns of feature channels. To perform sp-shuﬄe, wepermute the spatial locations in the p × p spatial map. This permutation isperformed independently for each network input (i.e., diﬀerent inputs undergodiﬀerent permutations). Columns of ﬁlter responses in the same location movetogether, which preserves correlations between features within each (shuﬄed)spatial location. These transformations are illustrated in Figure 6.For image classiﬁcation, damaging spatial information leads to a large diﬀer-ence in performance between original and spatially-ablated conv-1 features, butwith a gradually decreasing diﬀerence for higher layers (Table 7). In fact, theperformance of conv-5 after sp-max is close to the original performance. Thisindicates that a lot of information important for classiﬁcation is encoded in theactivation of the ﬁlters and not necessarily in the spatial pattern of their acti-vations. Note, this observation is not an artifact of small number of classes inPASCAL-CLS. On ImageNet validation data, conv-5 features and conv-5 after Table 7: Eﬀect of location and magnitude feature ablations on PASCAL-CLS. layer no ablation (mAP) binarize (mAP) sp-shuﬄe (mAP) sp-max (mAP)conv-1 25 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 8: Eﬀect of location and magnitude feature ablations on PASCAL-DET. no ablation (mAP) binarize (mAP) sp-max (mAP)conv-5 47.6 45.7 25.4 sp-max result into accuracy of 43.2 and 41.5 respectively. However, for detectionsp-max leads to a large drop in performance. This may not be surprising sincedetection requires spatial information for precise localization.

To help researchers better understand CNNs, we investigated pre-training andﬁne-tuning behavior on three classiﬁcation and detection datasets. We found thatthe large CNN used in this work can be trained from scratch using a surprisinglymodest amount of data. But, importantly, pre-training signiﬁcantly improvesperformance and pre-training for longer is better. We also found that some ofthe learnt CNN features are grandmother-cell-like, but for the most part theyform a distributed code. This supports the recent set of empirical results showingthat these features generalize well to other datasets and tasks.

Acknowledgments.

This work was supported by ONR MURI N000141010933. PulkitAgrawal is partially supported by a Fulbright Science and Technology fellowship. Wethank NVIDIA for GPU donations. We thank Bharath Hariharan, Saurabh Gupta andJo˜ao Carreira for helpful suggestions.

Citing this paper.

Please cite the paper as: @inproceedings { agrawal14analyzing,Author = { Pulkit Agrawal and Ross Girshick and Jitendra Malik } ,Title = { Analyzing the Performance of Multilayer Neural Networks for Object Recognition } ,Booktitle = { Proceedings of the European Conference on Computer Vision (ECCV) } ,Year = { } } nalyzing The Performance of Multilayer Neural Networks 15 Appendix: estimating a ﬁlter’s discriminative capacity

To measure the discriminative capacity of a ﬁlter, we collect ﬁlter responses froma set of N images. Each image, when passed through the CNN produces a p × p heat map of scores for each ﬁlter in a given layer (e.g., p = 6 for a conv-5 ﬁlterand p = 1 for an fc-6 ﬁlter). This heat map is vectorized into a vector of scores oflength p . With each element of this vector we associate the image’s class label.Thus, for every image we have a score vector and a label vector of length p each.Next, the score vectors from all N images are concatenated into an N p -lengthscore vector. The same is done for the label vectors.Now, for a given score threshold τ , we deﬁne the class entropy of a ﬁlter to bethe entropy of the normalized histogram of class labels that have an associatedscore ≥ τ . A low class entropy means that at scores above τ , the ﬁlter is very classselective. As this threshold changes, the class entropy traces out a curve whichwe call the entropy curve . The area under the entropy curve (AuE), summarizesthe class entropy at all thresholds and is used as a measure of discriminativecapacity of the ﬁlter. The lower the AuE value, the more class selective the ﬁlteris. The AuE values are used to sort ﬁlters in Section 1.1. References

1. Barlow, H.: Single units and sensations: A neuron doctrine for perceptual psychol-ogy? In: Perception (1972)2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (Oct 2001), http://dx.doi.org/10.1023/A:1010933404324

3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: InCVPR. pp. 886–893 (2005)4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)5. Donahue, J., Jia, Y., Vinyals, O., Hoﬀman, J., Zhang, N., Tzeng, E., Darrell, T.:Decaf: A deep convolutional activation feature for generic visual recognition. arXivpreprint arXiv:1310.1531 (2013)6. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: Thepascal visual object classes (voc) challenge. IJCV 88(2) (2010)7. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech-anism of pattern recognition unaﬀected by shift in position. Biological cybernetics36(4), 193–202 (1980)8. Geman, D., Amit, Y., Wilder, K.: Joint induction of shape features and tree clas-siﬁers (1997)9. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-rate object detection and semantic segmentation. In: CVPR (2014)10. Gong, Y., Lazebnik, S.: Iterative quantization: A procrustean approach to learningbinary codes. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on. pp. 817–824. IEEE (2011)11. Jia, Y.: Caﬀe: An open source convolutional architecture for fast feature embed-ding. http://caffe.berkeleyvision.org/ (2013)6 Pulkit Agrawal, Ross Girshick, Jitendra Malik12. Juneja, M., Vedaldi, A., Jawahar, C.V., Zisserman, A.: Blocks that shout: Distinc-tive parts for scene classiﬁcation. In: Proceedings of the IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) (2013)13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con-volutional neural networks. In: NIPS (2012)14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories. In: Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on. vol. 2, pp. 2169–2178.IEEE (2006)15. Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J.,Ng, A.: Building high-level features using large scale unsupervised learning. In:International Conference in Machine Learning (2012)16. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neuralcomputation 1(4) (1989)17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-tional Journal of Computer Vision 60, 91–110 (2004)18. Quiroga, R.Q., Reddy, L., Kreiman, G., Koch, C., Fried, I.: Invari-ant visual representation by single neurons in the human brain. Na-ture 435(7045), 1102–7 (2005),

19. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features oﬀ-the-shelf:an astounding baseline for recognition. CoRR abs/1403.6382 (2014)20. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representationsby error propagation. Parallel Distributed Processing 1, 318–362 (1986)21. Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors usingconvex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelli-gence (2014)22. Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discrim-inative patches. In: European Conference on Computer Vision (2012), http://arxiv.org/abs/1205.3137

23. Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In:NIPS (2013)24. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Computer Vision andPattern Recognition (CVPR), 2011 IEEE Conference on. pp. 1521–1528. IEEE(2011)25. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for objectrecognition. IJCV (2013)26. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in neural in-formation processing systems. pp. 1753–1760 (2009)27. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scalescene recognition from abbey to zoo. CVPR pp. 3485–3492 (2010)28. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.CoRR abs/1311.2901 (2013)nalyzing The Performance of Multilayer Neural Networks 17

Supplementary Material

In the main paper we provided evidence that ﬁne-tuning a discriminatively pre-trained network is very eﬀective in terms of task performance. We also providedinsights into how ﬁne-tuning changes its parameters. Here we describe and dis-cuss in greater detail some metrics for determining the eﬀect of ﬁne-tuning.

The entropy of a ﬁlter is calculated to measure its discriminative capacity. Theuse of entropy is motivated by works such as [2], [8]. For computing the entropyof a ﬁlter, we start by collecting ﬁlter responses from a set of N images. Eachimage, when passed through the CNN produces a p × p heat map of scores foreach ﬁlter in a given layer (e.g., p = 6 for a conv-5 ﬁlter and p = 1 for an fc-6ﬁlter). This heat map is vectorized ( x(:) in MATLAB) into a vector of scoresof length p . With each element of this vector we associate the class label ofthe image. Thus, for every image we have a score vector and a label vector oflength p each. Next, the score vectors from all N images are concatenated intoan N p -length score vector. The same is done for the label vectors. We deﬁnethe entropy of a ﬁlter in the following three ways. Label Entropy.

For a given score threshold τ , we deﬁne the class entropy ofa ﬁlter to be the entropy of the normalized histogram of class labels that havean associated score ≥ τ . A low class entropy means that at scores above τ , theﬁlter is very class selective. As this threshold changes, the class entropy tracesout a curve which we call the entropy curve . The area under the entropy curve (AuE), summarizes the class entropy at all thresholds and is used as a measureof discriminative capacity of the ﬁlter. The lower the AuE value, the more classselective the ﬁlter is. Weighted Label Entropy.

While computing the class label histogram, insteadof the label count we use the sum of the scores associated with the labels toconstruct the histogram. (Note: Since we are using outputs of the rectiﬁed linearunits, all scores are ≥ Spatial-Max (spMax) Label Entropy.

Instead of vectorizing the heatmap,the ﬁlter response obtained as a result of max pooling the p × p ﬁlter outputis associated with the class label of each image. Thus, for every image we havea score vector and a class label vector of length 1 each. Next, the score vectorsfrom all N images are concatenated into an N -length score vector. Then, weproceed in the same way as for the case of Label Entropy to compute the AuEof each ﬁlter. The discriminative capacity of layer is computed as following: The ﬁlters aresorted in increasing order of their AuE. Next, the cumulative sum of AuE valuesin this sorted list is calculated. The obtained list of Cumulative AuEs is referredto as CAuE. Note that, the i -th entry of the CAuE list is the sum of the AuEscores of the top i most discriminative ﬁlters. The diﬀerence in the value of the i -th entry before and after ﬁne-tuning measures the change in class selectivityof the top i most discriminative ﬁlters due to ﬁne-tuning. For comparing resultsacross diﬀerent layers, the CAuE values are normalized to account for diﬀerentnumbers of ﬁlters in each layer. Speciﬁcally, the i -th entry of the CAuE listis divided by i . This normalized CAuE is called the Mean Cumulative AreaUnder the Entropy Curve (MCAuE). A lower value of MCAuE indicates thatthe individual ﬁlters of the layer are more discriminative.Table 9: This table lists percentage decrease in MCAuE as a result of ﬁnetun-ing when only 0.1, 0.25, 0.50 and 1.00 fraction of all the ﬁlters were used forcomputing MCAuE. A lower MCAuE indicates that ﬁlters in a layer are moreselective/class speciﬁc. The 0.1 fraction includes the top 10% most selective ﬁl-ters, 0.25 is top 25% of most selective ﬁlters. Consequently, comparing MCAuEat diﬀerent fraction of ﬁlters gives a better sense of how selective the “most”selective ﬁlters have become. A negative value in the table below indicates in-crease in entropy. Note that for all the metrics maximum decrease in entropytakes place while moving from layer 5 to layer 7. Also, note that for fc-6 and fc-7the values in Label Entropy and spMax Label Entropy are same as these layershave spatial maps of size 1. Layer Label Entropy Weighted Label Entropy spMax Label Entropy0.1 0.25 0.5 1.0 0.1 0.25 0.5 1 0.1 0.25 0.5 1.0conv-1 − . − . − . − .

19 0 . − . − . − .

16 0 .

19 0 .

10 0 .

07 0 . − . − . − .

14 0 .

01 0 .

41 0 .

53 0 .

58 0 . − . − .

03 0 .

11 0 . − . − . − . − .

44 1 .

11 0 .

66 0 .

52 0 .

32 0 .

14 0 .

20 0 .

32 0 . − . − . − . − . − .

10 0 .

55 0 .

64 0 .

57 0 .

93 0 .

97 0 .

80 0 . .

97 0 .

55 0 .

43 0 .

36 5 .

84 3 .

53 2 .

66 1 .

85 4 .

87 3 .

05 2 .

31 1 . .

52 5 .

06 3 .

92 2 .

64 9 .

59 7 .

55 6 .

08 4 .

27 6 .

52 5 .

06 3 .

92 2 . .

17 2 .

66 1 .

33 0 .

44 20 .

58 14 .

75 11 .

12 7 .

78 5 .

17 2 .

66 1 .

33 0 . nalyzing The Performance of Multilayer Neural Networks 19

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 150200250300350400 Fraction of Filters M eas u r e o f C l ass S e l ec t i v i t y ( b ase d on E n t r op y ) conv−1conv−2conv−3conv−4conv−5fc−6fc−7 (a) Weighted Label Entropy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 260270280290300310320330340350 Fraction of Filters M eas u r e o f C l ass S e l ec t i v i t y ( b ase d on E n t r op y ) conv−1conv−2conv−3conv−4conv−5fc−6fc−7 (b) Spatial-Max Label Entropy Fig. 1: PASCAL object class selectivity (measured as MCAuE) plotted againstthe fraction of ﬁlters, for each layer, before ﬁne-tuning (dash-dot line) and afterﬁne-tuning (solid line). A lower value indicates greater class selectivity. (a),(b)show MCAUE computed using Weighted-Label-Entropy and Spatial-Max Label-Entropy method respectively.

The MCAuE measure of determining layer selectivity before and after ﬁne-tuningis shown in Figure 1 for Weighted Label and Spatial-Max Label Entropy. Re-sults for Label Entropy method are presented in the main paper. A quantitativemeasure of change in entropy due to ﬁnetuning, computed as percentage changeis deﬁned as following:Percent Decrease = 100 × M CAuE pre − M CAuE fine

M CAuE pre (1)where,

M CAuE fine is for ﬁne-tuned network and

M CAuE untuned is for networktrained on imagenet only. The results are summarized in table 9.As measured by Label Entropy, layers 1 to 5 undergo negligible change intheir discriminative capacity, whereas layers 6-7 become a lot more discrimina-tive. Whereas, the measures of Weighted Label and Spatial-Max Label Entropyindicate that only layers 1 to 4 undergo minimal changes and other layers be-come substantially more discriminative. These results conﬁrm the intuition thatlower layers of the CNN are more generic features, whereas ﬁne-tuning mostlyeﬀects the top layers. Also, note that these results are true for ﬁne-tuning formoderate amount of training data available as part of PASCAL-DET. It is yetto be determined how lower convolutional layers would change due to ﬁne-tuningwhen more training data is available.

In the main paper we studied the nature of representations in mid-level CNNrepresentations given by conv-5. Here, we address the same question for layerfc-7, which is the last layer of CNN and features extracted from this lead tobest performance. The results for number of ﬁlters required to achieve the sameperformance as all the ﬁlters taken together is presented in Figure 2. Table 10reports the number of ﬁlters required per class to obtain 50% and 90% of thecomplete performance. It can be seen that like conv-5, feature representationsin fc-7 are also distributed for a large number of classes. It is interesting to note,that for most classes 50% performance can be reached using a single ﬁlter, butfor reaching 90% performance a lot more ﬁlters are required. nalyzing The Performance of Multilayer Neural Networks 21 F r a c t i o n o f c o m p l e t e p e r f o r m a n c e aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor Number of fc-7 ﬁlters