[PDF] Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification

Abstract

Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this ICCV 2015 paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique to generate a set of high-quality patches potentially containing objects, and apply a pre-trained CNN to extract generic deep features from these patches. Then we perform both unsupervised and weakly supervised learning to screen these patches and discover discriminative ones representing category-specific objects and parts. We further apply discriminative clustering enhanced with local CNN fine-tuning to aggregate similar objects and parts into groups, called meta objects. A scene image representation is constructed by pooling the feature response maps of all the learned meta objects at multiple spatial scales. We have confirmed that the scene image representation obtained using this new pipeline is capable of delivering state-of-the-art performance on two popular scene benchmark datasets, MIT Indoor 67~\cite{MITIndoor67} and Sun397~\cite{Sun397}

Full PDF

HHarvesting Discriminative Meta Objects with Deep CNN Featuresfor Scene Classiﬁcation

Ruobing Wu † Baoyuan Wang ‡ Wenping Wang † Yizhou Yu †† The University of Hong Kong ‡ Microsoft Technology and Research

Abstract

Recent work on scene classiﬁcation still makes use ofgeneric CNN features in a rudimentary manner. In thisICCV 2015 paper, we present a novel pipeline built upondeep CNN features to harvest discriminative visual objectsand parts for scene classiﬁcation. We ﬁrst use a region pro-posal technique to generate a set of high-quality patches po-tentially containing objects, and apply a pre-trained CNN toextract generic deep features from these patches. Then weperform both unsupervised and weakly supervised learningto screen these patches and discover discriminative onesrepresenting category-speciﬁc objects and parts. We fur-ther apply discriminative clustering enhanced with localCNN ﬁne-tuning to aggregate similar objects and parts intogroups, called meta objects. A scene image representationis constructed by pooling the feature response maps of allthe learned meta objects at multiple spatial scales. We haveconﬁrmed that the scene image representation obtained us-ing this new pipeline is capable of delivering state-of-the-art performance on two popular scene benchmark datasets,MIT Indoor 67 [22] and Sun397 [31].

1. Introduction

Deep convolutional neural networks (CNNs) havegained tremendous attention recently due to their great suc-cess in boosting the performance of image classiﬁcation[14, 19], object detection [7, 26], action recognition [12]and many other visual computing tasks [23, 21]. In thecontext of scene classiﬁcation, although a series of state-of-the-art results on popular benchmark datasets (MIT Indoor67[22], SUN397 [31]) have been achieved, CNN featuresare still used in a rudimentary manner. For example, recentwork in [33] simply trains the classical Alex’s net [14] on ascene-centric dataset (“Places”) and directly extracts holis-tic CNN features from entire images. This work was partially completed when the ﬁrst author was an internat Microsoft Research.

The architecture of CNNs suggests that they might notbe best suited for classifying images, including scene im-ages, where local features follow a complex distribution.The reason is that spatial aggregation performed by poolinglayers in a CNN is too simple, and does not retain muchinformation about local feature distributions. When criticalinference happens in the fully connected layers near the topof the CNN, aggregated features fed into these layers are infact global features that neglect local feature distributions.It has been shown in [8] that in addition to the entire im-age, it is consistently better to extract CNN features frommultiscale local patches arranged in regular grids.In order to build a discriminative representation basedon deep CNN features for scene image classiﬁcation, weneed to address two technical issues: (1) Objects withinscene images could exhibit dramatically different appear-ances, shapes, and aspect ratios. To detect diverse local ob-jects, one could in theory add many perturbations to the in-put image by warping and cropping at various aspect ratios,locations, and scales, and then feed all of them to the CNN.This is, however, not feasible in practice; (2) To distinguishone scene category from another, it is much desired to har-vest discriminative and representative category-speciﬁc ob-jects and object parts. For example, to tell a “city street”from a “highway”, one needs to identify objects that canonly belong to a “city street” but not a “highway” scene.Pandey and Lazebnik [20] adopt the standard DPM to adap-tively infer potential object parts. It is however unclear howto initialize the parts and how to efﬁciently learn them usingCNN features.In this paper, we present a novel pipeline built upondeep CNN features for harvesting discriminative visual ob-jects and parts for scene classiﬁcation. We ﬁrst use a re-gion proposal technique to generate a set of high-qualitypatches potentially containing objects [3]. We apply a pre-trained CNN to extract generic deep features from thesepatches. Then, for each scene category, we train a one-class SVM on all the patches generated from the images forthis class as a discriminative classiﬁer [25], which heavilyprunes outliers and other non-representative patches. The1 a r X i v : . [ c s . C V ] O c t emaining patches correspond to the objects and parts thatfrequently occur in the images for this scene category. Tofurther harvest the most discriminative patches, we apply anon-parametric weakly supervised learning model to screenthese remaining patches according to their discriminativepower across different scene categories. Instead of directlyusing the chosen category-speciﬁc objects and parts, wefurther perform discriminative clustering to aggregate sim-ilar objects and parts into groups. Each resulting group iscalled a “Meta Object” . Finally, a scene image representa-tion is obtained by pooling the feature response maps of allthe learned meta objects at multiple spatial scales to retainmore information about their local spatial distribution. Lo-cally aggregated CNN features are more discriminative thanthose global features fed into the fully connected layers in asingle CNN.There exists much recent work advocating the conceptof middle-level objects and object parts for efﬁcient sceneimage classiﬁcation [16, 20, 27, 11, 4, 30]. Among them,the methods proposed in [4, 11] are most relevant. Nonethe-less, there exist major differences between our method andtheirs. First, we use multiscale object proposals instead ofgrid-based sampling with multiple patch sizes, thus we canintrinsically obtain better discriminative object candidates.Second, we aggregate our meta objects through deep CNNfeatures while previous methods primarily rely on low-levelfeatures (i.e., HOG). As demonstrated through experiments,deep features are more semantically meaningful when usedfor characterizing middle-level objects. Last but not theleast, there exist signiﬁcantly different components alongindividual pipelines. For instance, we adopt unsupervisedlearning to prune outliers while Juneja et al. [11] train alarge number of exemplar-SVMs, which is more computa-tionally intensive. Furthermore, our discriminative cluster-ing component also plays an important role in aggregatingmeta objects.In summary, this paper has the following contributions:(1) We propose a novel pipeline for scene classiﬁcation thatis built on top of deep CNN features. The advantages of thispipeline are orthogonal to any category independent regionproposal methods [29, 34, 3] and middle-level parts learn-ing algorithms [4, 20, 11]. (2) We propose a simple yetefﬁcient method that integrates unsupervised and weaklysupervised learning for harvesting discriminative and repre-sentative category-speciﬁc patches, which we further aggre-gate into a compact set of groups, called meta objects, viadiscriminative clustering. (3) Instead of global ﬁne-tuning,we locally ﬁne-tune the CNN using the meta objects discov-ered from the target dataset. We have conﬁrmed through ex-periments that the scene image representation obtained us-ing this pipeline is capable of delivering state-of-the-art per-formance on two popular scene benchmark datasets, MITIndoor 67 [22] and Sun397 [31].

2. A New Pipeline for Scene Classiﬁcation

In this section, let us present the main components ofour proposed new pipeline for scene classiﬁcation. As illus-trated in Figure 1, our pipeline is built on top of a pre-traineddeep convolutional neural network, which is regarded as ageneric feature extractor for image patches. In the contextof scene classiﬁcation, instead of directly transferring thesefeatures [33] or global ﬁne-tuning on whole images usingthe groundtruth labels [6, 7], we perform local ﬁne-tuningon discriminative yet representative local patches that cor-respond to visual objects or their parts. As for scene classi-ﬁcation datasets, bounding boxes or segment masks are notavailable for our desired local patches. In order to harvestthem, we ﬁrst adapt the latest algorithms to generate imageregions potentially containing objects, expecting a high re-call of all informative ones (Section 2.1). Then we ﬁrst ap-ply an unsupervised learning technique, one-class SVMs, toprune those proposed regions that do not appear frequentlyin the images for a speciﬁc scene class. This is followed by aweakly supervised learning step to screen the remaining re-gion proposals and discard those patches that are unlikely tobe useful for differentiating a speciﬁc scene category fromother categories (Section 2.2).To further improve the generality and representativenessof the remaining patches, we perform discriminative clus-tering to aggregate them into a set of meta objects (Sec-tion 2.3). Finally, our scene image representation is built ontop of the probability distribution of the mined meta objects(Section 2.4).

As discussed in Section 1, for arbitrary objects withvarying size and aspect ratio, the traditional sliding win-dow based object detection paradigm requires multiresolu-tion scanning using windows with different aspect ratios.For example, in pedestrian detection [5], at least two win-dows should be used to search for the full body and upperbody of pedestrians. Recently, an alternative paradigm hasbeen developed that performs perceptual grouping with thegoal of proposing a limited number of high-quality regions,that likely enclose objects. Tasks including object detec-tion [7] and recognition [9] can then be built on top of theseproposed regions only without considering other non-objectregions. There is a large body of literature along this newparadigm for efﬁciently generating region proposals witha high recall, including selective search [29], edge-boxes[34], and multi-scale combinatorial grouping (MCG) [3].We empirically choose MCG as the ﬁrst component in ourpipeline for generating high-quality region proposals, butone can use other methods as well. Figure 3 shows a fewexamples of regions generated by MCG. We also use regionproposals from hierarchical image segmentation [2] at thesame time (see Sec.2.5). igure 1. Flowchart of our pipeline. From left to right: (a) Training scene images are processed by MCG [3] and we obtain top rankedregion proposals (yellow boxes). (b) Patches are screened by our non-parametric scheme and only discriminative patches remain. (c)Discriminative clustering is performed to build meta objects. Three meta objects are shown here: ‘computer screen’, ‘keyboard’, ‘computerchair’ (from top to bottom). Note that these names are for demonstration only, not labels applicable to our pipeline. (d) Local ﬁne-tuning isperformed on Hybrid CNN [33], which decides which meta object a testing region belongs to. (e) We train an image classiﬁer on aggregatedresponses of our ﬁne-tuned CNN. Here the response maps of two meta objects, ”computer screen” (second row) and ”keyboard” (bottomrow), are shown. Gray-scale values in the response maps indicate conﬁdence.

Feature Extraction

We use the CNN model pre-trainedon the Places dataset [33] as our generic feature extractorfor all the image regions generated by MCG. As this CNNmodel only takes input images with a ﬁxed resolution, wefollow the warping scheme described in R-CNN [7] and re-sample a patch with an arbitrary size and aspect ratio us-ing the required resolution. Then each patch propagatesthrough all the layers in the pre-trained CNN model, andwe take the 4096-dimensional vector in the FC7 layer asthe feature representation of the patch (see [14] and [33] fordetailed information about the network architecture).

Screening via One-Class SVMs

For each scene category,there typically exist a set of representative regions that fre-quently appear in the images for that category. For example,since regions with computer monitors frequently appear inthe images for the “computer room” class, a region con-taining monitors should be a representative region. Mean-while, there are other regions that might only appear in fewimages. Such non-representative patches can be viewed asoutliers for a certain scene category. On the basis of this ob-servation, we adopt one-class SVMs [25] as discriminativemodels for removing non-representative patches. A one-class SVM separates all the data samples from the origin toachieve outlier detection. Let x , x , ..., x l ( x i ∈ R d ) be theproposed regions from the same class, and Φ : X −→ H be a kernel function that maps original region features intoanother feature space. Training a one-class SVM needs to solve the following optimization: min w,ξ,ρ (cid:107) w (cid:107) + 1 υl l (cid:88) i =1 ξ i − ρ (1)subject to ( w · Φ( x i )) ≥ ρ − ξ i , ξ i ≥ , i = 1 , , ..., l, where υ ( ∈ (0 , controls the ratio of outliers. The deci-sion function f ( x ) = sign ( w · Φ( x i ) − ρ ) (2)should return the positive sign given the representativepatches and the negative sign given the outliers. This isbecause the representative patches tend to stay in a localregion in the feature space while the outliers are scatteredaround in this space. To further improve the performance,we train a series of cascaded classiﬁers, each of which la-bels of the input patches as outliers and prune them.We typically use 3 cascaded classiﬁers. Weakly Supervised Soft Screening

After the region pro-posal step and outliers removal, let us suppose that m i im-age patches have been generated for each image I i , andthese patches likely contain objects or object parts. Let usdenote a patch from I i as p ij ( j ∈ { , ..., m i } ), and use y i to represent the scene category label of image I i . We as-sociate each image patch p ij with a weight w ij ∈ [0 , in-dicating the discriminative power of the patch among scene igure 2. Patch weight distribution after weakly supervised patchscreening. category labels. Our goal is to estimate this weight for ev-ery patch. Intuitively, a discriminative patch should havea high probability of appearing in one scene category andlow probabilities of appearing in the other categories. Thatmeans, if we ﬁnd the set of K nearest neighbors N ij of p ij from all image patches generated from all training imagesexcept I i , we can use the following class density estimatorto set w ij : w ij = P ( y i | p ij ) = P ( p ij , y i ) P ( p ij ) ≈ K y /K, (3)where K y is the number of patches among the K nearestneighbors that share the same scene label with p ij . By as-suming that the K nearest neighbors of p ij are almost iden-tical to p ij , we use K y to estimate the joint probability be-tween a patch p ij and its label y i . Empirically we set K to100 in all the experiments. It is worth noting that patcheswith large weights also have more representative power. Asrepresentative patches would occur frequently in the visualworld [27], it is unlikely for non-representative patches toﬁnd similar ones (as its nearest neighbors) that share thesame scene label. Fig. 2 shows the distribution of patchweights after our screening process. Once we have identiﬁed the most discriminative imagepatches, the next step is grouping these patches into clusterssuch that ideally patches in the same cluster should containvisual objects that belong to the same category and sharethe same semantic meaning. This is important for discover-ing the relationship between scene category labels and thelabels of object clusters. Clustering also helps to show theinternal variation of an object label. For example, desksfacing a few different directions in a classroom might begrouped into several clusters. We call every patch cluster ameta object. Note that meta objects could correspond to vi-sual objects but could also correspond to parts and patchesthat characterize the commonalities within a scene category. We adopt the Regularized Information Maximization(RIM) algorithm [13] to perform discriminative clustering.RIM strikes a balance among cluster separation, cluster bal-ance and cluster complexity. Fig. 3 shows a few clustersafter applying RIM to the screened discriminative patchesfrom the MIT 67 Indoor Scenes dataset [22]. As we cansee, the patches within the same cluster has similar appear-ances and the same semantic meaning. Here we can alsoobserve the discriminative power of such clusters. For ex-ample, the wine buckets (top row in Fig. 3) only show up inwine cellars, and the cribs (second row from the bottom inFig. 3) only show up in nurseries.

Local Fine-Tuning for Patch Classiﬁcation

Given theset of meta objects, we need a classiﬁer to decide whichmeta object a patch from a testing image belongs to. Thereare various options for this classiﬁer, including GMM-typeprobabilistic models, SVMs, and neural networks. Wechoose to ﬁne-tune the pre-trained CNN on our meta ob-jects, which include the collection of discriminative patchessurviving the patch screening process. We perform stochas-tic gradient descent over the pre-trained CNN using thewarped discriminative image patches and their correspond-ing meta object labels. Take MIT Indoor 67 [22] as an ex-ample. After weakly supervised patch screening (Section2.2), there exist around a million remaining image patches,and 120 meta objects are discovered during the clusteringstep (Section 2.3). In the CNN, we replace the original out-put layer that performs ImageNet-speciﬁc 1000-way classi-ﬁcation with a new output layer that does 121-way classiﬁ-cation while leaving all other layers unchanged. Note thatwe need to add one extra class to represent those patchesthat are discarded during the screening step. The reasonfor local ﬁne-tuning is obtaining an accurate meta objectclassiﬁer that is also robust to noisy labels generated by thediscriminative clustering algorithm used in Section 2.3.

Inspired by previous work such as object-bank [16] andbag-of-parts (BOF) [11], we hypothesize that any scene im-age can be represented as a bag of meta objects as well.Suppose N meta objects have been learned during discrim-inative clustering (Section 2.3).Given a testing image, we still perform MCG to obtainregion proposals. Every region can be classiﬁed into oneof the discriminative object clusters using our meta objectclassiﬁer. Spatial aggregation of these meta objects can beperformed using Spatial Pyramid Matching (SPM) [32]. Inour implementation, we use three levels of SPM, and adap-tively choose the centroid of all meta objects falling into aSPM region as the splitting center of its subregions. Thisstrategy can better balance the number of meta objects thatfall into each subregion. After applying SPM to the testing igure 3. Examples of patch clusters (meta objects) from the MIT 67 Indoor dataset [22]. Patches on the same row belong to the same metaobject. The rightmost column shows the average image, namely the ‘center’, of the corresponding meta object. image, we obtain a hierarchical spatial histogram of metaobject labels over the image, which can be used for deter-mining the scene category of this testing image.Another pooling method we consider is Vector of Lo-cally Aggregated Descriptors (VLAD) [10, 1]. We com-pute a modiﬁed version of VLAD that suits our framework.Speciﬁcally, we use our discriminative object clusters (metaobjects) as the clusters for computing VLAD. That meanswe do not perform K-means clustering for VLAD. It is im-portant to maintain such consistency because otherwise therecognition performance would degrade by 1.5% on theMIT 67 Indoor Scenes dataset (from 78.41% to 76.9%).Other steps are similar to the standard VLAD. Given regionproposals of an image, we assign each region to its nearestcluster center, and aggregate the residuals of the region fea-tures, resulting in a 4096-d vector per cluster. Suppose thereare k clusters. The dimension of this per-cluster vector isreduced to (4096/ k )-d using PCA. Finally, these (4096/ k )-dvectors are concatenated into a 4096-d VLAD descriptor.The holistic Places CNN feature extracted from thewhole image is also useful for training the scene image clas-siﬁer since they also encode local as well as global informa-tion of the scene.We train a neural network with two fully-connected hid-den layers (each with 200 nodes) using normalized VLAD(or SPM) features concatenated with the holistic Places CNN features. The relative weight between these two typesof features are learned via cross validation on a small por-tion of the training data. We use the rectiﬁed linear function(ReLU) as the activation function of the neurons in the hid-den layers. Our image representation with meta objects can be gen-eralized to a multi-level representation. The insight hereis that objects with different sizes and scales may supplycomplementary cues for scene classiﬁcation. To achievethis, we switch to multi-level region proposals. The coarserlevels deal with larger objects, while the ﬁner levels dealwith smaller objects and object parts. On each level, re-gion proposals are generated and screened separately. Lo-cal ﬁne-tuning for patch classiﬁcation is also performed oneach level separately. During the training stage of the ﬁnalimage classiﬁer, the image representation is deﬁned as theconcatenation of the feature vectors from all levels. In prac-tice, we ﬁnd a 2-level representation sufﬁcient. The bottomlevel includes relatively small regions from a ﬁner level ina region hierarchy [2] to capture small objects and objectparts in an image, while the top level includes region pro-posals generated by MCG as well as relatively large regionsfrom a coarser level in the region hierarchy to capture largeobjects. . Experiments and Discussions

In this section, we evaluate the performance of ourframework, named MetaObject-CNN, on the MIT Indoor67 [22] and SUN 397 [31] datasets as well as analyze theeffectiveness of the speciﬁc choices we made at every stageof our pipeline introduced in Section 2.

MIT Indoor 67

MIT Indoor 67 [22] is a challenging in-door scene dataset, which contains 67 scene categories anda total of 15,620 images. The number of images variesacross categories (but at least 100 images per category). In-door scenes tend to have more variations in terms of compo-sition, and are better characterized by the objects they have.This is consistent with the motivation of our framework.

SUN397

SUN397 [31] is a large-scale scene dataset,which contains 397 scene categories and a total of 108,754images (also at least 100 images per category). The cate-gories include different kinds of indoor and outdoor sceneswhich show tremendous object and alignment variance, thusbringing more complexity in learning a good classiﬁer.

For MIT Indoor 67, we train our model on the com-monly adopted benchmark, which contains 80 training im-ages and 20 testing images per category. There are 192 topranked region proposals generated with MCG and 32 (96)regions from hierarchical image segmentation in the top(bottom) level for every training and testing image. The fea-ture representation of a proposed region is set to the 4096-dimensional vector at the FC7 layer of the Hybrid CNNfrom [33]. After outlier removal (3 iterations of 15% ﬁl-tering out), we further discard 16% patches, where the ra-tio is determined via cross validation on a small portion ofthe training data. Then we perform data augmentation (to4 times larger) on the remaining patches using reﬂection,small rotation and random distortion. Discriminative clus-tering is performed on the augmented patches to produce120 (40) meta objects for local ﬁne-tuning in the bottom(top) level, which is performed on the Hybrid CNN by re-placing the original output layer that performs ImageNet-speciﬁc 1000-way classiﬁcation with a new output layerthat does 121-way (41-way) classiﬁcation while leaving allother layers unchanged. The pooling step (SPM and ourmodiﬁed VLAD) is discussed in Section 2.4. The imageclassiﬁcation is done by a neural network with two fully-connected layers (200 nodes each) on the concatenated fea-ture vector of VLAD pooling and the Hybrid CNN featureof the whole image.For SUN 397, we adopt the commonly used evaluationbenchmark that contains 50 training images and 50 testing images per category for each split from [31]. There are 96top ranked regions generated with MCG and 32 (96) re-gions from hierarchical image segmentation in the top (bot-tom) level for every training and testing image. The featurerepresentation of a proposed region is also set to the 4096-dimensional vector at the FC7 layer of the Hybrid CNN.After outlier removal (3 iterations of 15% ﬁltering out), wefurther discard 24% patches. Data augmentation is alsoperformed on the remaining patches involving reﬂection,small rotation and random distortion. Discriminative clus-tering results in 450 (150) meta objects in the bottom (top)level. Local ﬁne-tuning is further performed on the HybridCNN by replacing the original output layer with a new out-put layer that does 451-way (151-way) classiﬁcation whileleaving all other layers unchanged. We also train a neuralnetwork with two fully-connected layers (200 nodes each)on the concatenated feature vector of VLAD pooling andthe Hybrid CNN feature of the whole image to deal withimage level classiﬁcation.

In Table 1, we compare the recognition rate ofour method (MetaObject-CNN) against published resultsachieved with existing state-of-the-art methods on MIT In-door 67. Among the existing methods, oriented texturecurves (OTC) [18], spatial pyramid matching (SPM) [15],and Fisher vector (FV) with bag of parts [11] represent ef-fective feature descriptors as well as their associated pool-ing schemes. Discriminative patches [27, 4] are focusedon mid-level features and representations. More recently,deep learning and deep features have proven to be valuableto scene classiﬁcation as well [8, 33]. The recognition ac-curacy of our method outperforms the state of the art byaround 8.1%.

Table 1. Scene Classiﬁcation Performance on MIT Indoor 67

Method Accuracy(%)SPM [15] 34.40OTC [18] 47.33Discriminative Patches ++ [27] 49.40FV + Bag of parts [11] 63.18Mid-level Elements [4] 66.87MOP-CNN [8] 68.88Places-CNN [33] 68.24Hybrid-CNN [33] 70.80

MetaObject-CNN 78.90

Table. 2 shows a comparison between the recogni-tion rate achieved with our method (MetaObject-CNN) andthose achieved with existing state-of-the-art methods onthe SUN397 dataset. In addition to the methods intro-duced earlier, there exists additional representative workhere. Xiao et al. [33], as the collector of SUN397, inte-rated 14 types of distance kernels including bag of fea-tures and GIST. DeCAF [6] uses the global 4096D featurefrom a pre-trained CNN model on ImageNet. OTC togetherwith the HOG2x2 descriptor [18] outperforms dense Fishervectors [24], both of which are effective feature descriptorsfor SUN397. And again, by applying deep learning tech-niques, MOP-CNN [8] and Places-CNN [33] (ﬁne-tunedon SUN397) achieve state-of-the-art results (51.98% and56.2%). With our MetaObject-CNN pipeline, we manageto achieve a higher recognition accuracy.

Table 2. Scene Classiﬁcation Performance on SUN397

Method Accuracy(%)OTC [18] 34.56Xiao et al. [33] 38.00DeCAF [6] 40.94FV [24] 47.20OTC+HOG2x2 [18] 49.60MOP-CNN [8] 51.98Hybrid-CNN [33] 53.86Places-CNN [33] 56.20

MetaObject-CNN 58.11

In this section, we perform an ablation study to analyzethe effectiveness of individual components in our pipeline.When validating each single component, we keep all theothers ﬁxed. Speciﬁcally, we treat the ﬁnal result from ourMetaObject-CNN as the baseline, and perform the analysisby altering only one component at a time. Table 3 shows asummary of the comparison results on MIT Indoor 67. Adetailed explanation of these results is given in the rest ofthis section.

Table 3. Evaluation results on MIT Indoor 67 for varying pipelineconﬁgurations.

Conﬁguration Accuracy(%)Global ﬁne-tuning 73.88Mode-seeking [4] with Hybrid-CNN 69.70Mode-seeking elements instead of MCG 76.34Dense grid-based patches 71.43Without outlier removal and patch screening 75.12Without outlier removal 76.30Without patch screening 78.82Without clustering 72.81Without local ﬁne-tuning 76.10Cross-dataset evaluation 76.52

MetaObject-CNN 78.90 Global vs Local Fine-Tuning

Most of the previous meth-ods [33, 6, 12] using a pre-trained deep network primar-ily focus on global ﬁne-tuning for domain adaptation tasks,which take the entire image as input and rely on the net-work itself to learn all the informative structures embeddedwithin a new dataset. However, in this work, we performﬁne-tuning on local meta objects harvested in an explicitmanner. To compare, we start with the Places CNN net-work [33], and ﬁne-tune this network on MIT Indoor 67.The recognition rate after such global ﬁne-tuning is 73.88%(top row in Table. 3), which is around 5% lower than thatof our pipeline. This indicates the advantages of our localapproach of harvesting meta objects and performing recog-nition on top of them.

Choice of Region Proposal Method

In addition tochoosing MCG [3] and hierarchical image segmentation forgenerating object proposals, one might directly use densegrid-based patches or mid-level discriminative patches dis-covered by the pioneering techniques in [4, 27] as local ob-ject proposals. To evaluate the effectiveness of MCG, wehave conducted the following three internal comparisons.First, we compare our patch screening on top of regionproposal with the patch discovery process in [4], which is apiece of representative work on learning mid-level patchesin a supervised manner. For a fair comparison, we use thePlaces CNN feature (FC7) to represent the visual elementsin this work. Similar to the conﬁguration in [4], 1600 ele-ments are learned per class and 200 top elements per classare used for further classiﬁcation. The resulting recogni-tion rate is 69.70% (second row in Table. 3, which is 9.2%lower than our result. This comparison demonstrates thatregion proposal plus patch screening is helpful in ﬁndingvisual objects that characterize scenes. In a second exper-iment, we feed the top visual elements identiﬁed by [4] toour patch clustering step, and obtain 96 meta objects. Theﬁnal recognition rate achieved with these meta objects is76.34% (third row in Table. 3), which is around 2.6% lowerthan our result. This second experiment shows that MCGworks with our pipeline better than mode-seeking elementsfrom [4]. Then in a third experiment, instead of taking re-gion proposals, we have tried using all patches from a reg-ular 8x8 grid, the result is 71.43% (fourth row in Table. 3),which indicates patches sampled from a regular grid are notgood candidates for meta objects.

Importance of Outlier Removal and Patch Screening

To see how important our outlier removal and patch screen-ing stages are, one can directly feed all the object propos-als without any screening into the subsequent componentsdown the pipeline (discriminative clustering and local ﬁne-tuning). During our patch screening step, as shown in Eq.3, we rank all the patches according to their discriminativeeights and discard those with lower weights. Here we de-ﬁne the total screening ratio as the percentage of discardedpatches in both outlier removal and patch soft screeningsteps. In Fig. 4 (top), we can see, when the total screen-ing ratio is zero, the recognition accuracy is 75.12% (alsoshown in the ﬁfth row in Table. 3). This is because, al-though we have reasonable region proposals, there couldstill be many noisy ones among them. These noisy regionproposals are either false positives or non-discriminativeobjects (as shown in Fig. 1) shared by multiple scene cate-gories. On the other hand, an overly high screening ratio hasalso been found to hurt recognition performance, as shownin Fig. 4 (top). This is reasonably easy to understand be-cause higher ratios could discard some discriminative metaobjects that would otherwise contribute to the overall per-formance. We search for an optimal ratio through cross val-idation on a small subset of the training data. The outlierremoval step is also important in ﬁltering out regions thatdo not ﬁt in a certain category and brings along 2.6% im-provement in ﬁnal classiﬁcation performance, as shown inthe sixth row of Table. 3).

Figure 4. Top: recognition accuracy vs. total screening ratioon MIT Indoor 67. Bottom: recognition accuracy vs. number ofclusters in bottom level on MIT Indoor 67.

Importance of Clustering

Next we justify the useful-ness of clustering patches into meta objects. Without patchclustering, we can directly take the collection of screenedpatches as a large codebook, and treat every patch as a vi-sual word. We then apply LSAQ [17] (with 100 nearestneighbors) coding and SPM pooling to build the image- level representation. The resulting recognition rate on MITIndoor 67 is 72.81% (eighth row in Table. 3), which isaround 6.1% lower than the result of MetaObject-CNN.This controlled experiment demonstrates that patch clus-tering for meta object creation is crucial in our pipeline.Clustering patches into meta objects improves the gener-ality and representativeness of those discovered discrimi-native patches because clustering emphasizes the commonsemantic meaning shared among similar patches while tol-erating less important differences among them. Fig. 4 (bot-tom) shows the impact of the number of clusters in the bot-tom level on the ﬁnal recognition rate. It is risky to grouppatches into an overly small number of clusters because itwould assign patches with different semantic meanings tothe same meta object. Creating too many clusters is alsorisky due to the poor generality of the semantic meaningsof meta objects.

Importance of Local Fine-Tuning

Local ﬁne-tuning hasalso proven to be effective in our pipeline. We tried us-ing the responses from the RIM clustering model directlyfor pooling. On MIT Indoor 67, the recognition rate with-out local ﬁne-tuning is 76.10% (ninth row from the bot-tom in Table. 3), which is around 2.8% lower than thatwith local ﬁne-tuning. This demonstrates local ﬁne-tuningactually deﬁnes better separation boundaries between clus-ters, which is consistent with the common sense about ﬁne-tuning. We have also used the CNN locally ﬁne-tuned onSUN397 to perform cross-dataset classiﬁcation on MIT In-door 67. The recognition rate is 76.52% (bottom row inTable. 3), which indicates CNNs ﬁne-tuned over one scenepatch dataset have the potential to perform well on otherscene datasets.

4. Conclusions

We have introduced a novel pipeline for scene classiﬁ-cation, which is built on top of pre-trained CNN networksvia explicitly harvesting discriminative meta objects in a lo-cal manner. Through extensive comparisons in a series ofcontrolled experiments, our method generates state-of-the-art results on two popular yet challenging datasets, MITIndoor 67 and Sun397. Recent studies on convolutionalneural networks, such as GoogLeNet [28], indicate that us-ing deeper models would improve recognition performancemore substantially than shallow ones. Therefore trainingbetter generic CNNs would certainly improve its transferlearning capability as well. Nevertheless, our approach isintrinsically orthogonal to this line of effort. Exploringother local ﬁne-tuning methods would be an interesting di-rection for future work.

Acknowledgment

This work was partially supported byHong Kong Research Grants Council under General Re-search Funds (HKU17209714). eferences [1] R. Arandjelovic and A. Zisserman. All about vlad. In

CVPR ,pages 1578–1585. IEEE, 2013. 5[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con-tour detection and hierarchical image segmentation.

PAMI ,33(5):898–916, 2011. 2, 5[3] P. A. Arbel´aez, J. Pont-Tuset, J. T. Barron, F. Marqu´es, andJ. Malik. Multiscale combinatorial grouping. In

CVPR ,pages 328–335, 2014. 1, 2, 3, 7[4] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual ele-ment discovery as discriminative mode seeking. In

NIPS’13 ,2013. 2, 6, 7[5] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestriandetection: An evaluation of the state of the art.

TPAMI ,34(4):743–761, April 2012. 2[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti-vation feature for generic visual recognition. In

ICML , pages647–655, 2014. 2, 7[7] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection and semanticsegmentation. In

CVPR , 2014. 1, 2, 3[8] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scaleorderless pooling of deep convolutional activation features.In

ECCV , 2014. 1, 6, 7[9] C. Gu, J. Lim, P. Arbelaez, and J. Malik. Recognition usingregions. In

Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on , pages 1030–1037, June2009. 2[10] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregatinglocal descriptors into a compact image representation. In

CVPR , pages 3304–3311. IEEE, 2010. 5[11] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman.Blocks that shout: Distinctive parts for scene classiﬁcation.In

CVPR , 2013. 2, 4, 6[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classiﬁcation with convo-lutional neural networks. In

CVPR , pages 1725–1732, 2014.1, 7[13] A. Krause, P. Perona, and R. G. Gomes. Discriminative clus-tering by regularized information maximization. In

NIPS ,2010. 4[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS , pages 1106–1114, 2012. 1, 3[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. In

CVPR , 2006. 6[16] L. Li, H. Su, Y. Lim, and F. Li. Object bank: An object-levelimage representation for high-level visual recognition.

IJCV ,107(1):20–39, 2014. 2, 4[17] L. Liu, L. Wang, and X. Liu. In defense of soft-assignmentcoding. In

ICCV , 2011. 8[18] R. Margolin, L. Zelnik-Manor, and A. Tal. Otc: A novel lo-cal descriptor for scene classiﬁcation. In

ECCV 2014 , pages377–391. Springer, 2014. 6, 7 [19] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning andtransferring mid-level image representations using convolu-tional neural networks. In

CVPR , 2014. 1[20] M. Pandey and S. Lazebnik. Scene recognition and weaklysupervised object localization with deformable part-basedmodels. In

ICCV , 2011. 1, 2[21] T. Pﬁster, K. Simonyan, J. Charles, and A. Zisserman. Deepconvolutional neural networks for efﬁcient pose estimationin gesture videos. In

ACCV , 2014. 1[22] A. Quattoni and A. Torralba. Recognizing indoor scenes. In

CVPR , pages 413–420, 2009. 1, 2, 4, 5, 6[23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls-son. CNN features off-the-shelf: an astounding baseline forrecognition.

CoRR , 2014. 1[24] J. S´anchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-age classiﬁcation with the ﬁsher vector: Theory and practice.

IJCV , 105(3):222–245, 2013. 7[25] B. Sch¨olkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola,and R. C. Williamson. Estimating the support of a high-dimensional distribution.

Neural Comput. , 13(7):1443–1471, July 2001. 1, 3[26] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks.

CoRR , 2013. 1[27] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In

ECCV 2012 , pages73–86, 2012. 2, 4, 6, 7[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions.

CoRR , 2014. 8[29] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, andA. W. M. Smeulders. Selective search for object recognition.

International Journal of Computer Vision , 104(2):154–171,2013. 2[30] X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu. Max-margin multiple-instance dictionary learning. In

ICML-13 ,volume 28, May 2013. 2[31] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sundatabase: Large-scale scene recognition from abbey to zoo.In

CVPR’10 , June 2010. 1, 2, 6[32] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-mid matching using sparse coding for image classiﬁcation.In

CVPR , pages 1794–1801. IEEE, 2009. 4[33] B. Zhou, J. Xiao, A. Lapedriza, A. Torralba, and A. Oliva.Learning deep features for scene recognition using placesdatabase. In

NIPS , 2014. 1, 2, 3, 6, 7[34] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating objectproposals from edges. In