[PDF] Exploit Bounding Box Annotations for Multi-label Object Recognition

Abstract

Convolutional neural networks (CNNs) have shown great performance as general feature representations for object recognition applications. However, for multi-label images that contain multiple objects from different categories, scales and locations, global CNN features are not optimal. In this paper, we incorporate local information to enhance the feature discriminative power. In particular, we first extract object proposals from each image. With each image treated as a bag and object proposals extracted from it treated as instances, we transform the multi-label recognition problem into a multi-class multi-instance learning problem. Then, in addition to extracting the typical CNN feature representation from each proposal, we propose to make use of ground-truth bounding box annotations (strong labels) to add another level of local information by using nearest-neighbor relationships of local regions to form a multi-view pipeline. The proposed multi-view multi-instance framework utilizes both weak and strong labels effectively, and more importantly it has the generalization ability to even boost the performance of unseen categories by partial strong labels from other categories. Our framework is extensively compared with state-of-the-art hand-crafted feature based methods and CNN based methods on two multi-label benchmark datasets. The experimental results validate the discriminative power and the generalization ability of the proposed framework. With strong labels, our framework is able to achieve state-of-the-art results in both datasets.

Full PDF

EExploit Bounding Box Annotations for Multi-label Object Recognition

Hao Yang , Joey Tianyi Zhou , Yu Zhang , Bin-Bin Gao , Jianxin Wu , and Jianfei Cai SCE, Nanyang Technological University, [email protected] , [email protected] IHPC, A*STAR, [email protected] Bioinformatics Institute, A*STAR, [email protected] National Key Laboratory for Novel Software Technology, Nanjing University, China [email protected] , [email protected] Abstract

Convolutional neural networks (CNNs) have showngreat performance as general feature representations forobject recognition applications. However, for multi-labelimages that contain multiple objects from different cate-gories, scales and locations, global CNN features are notoptimal. In this paper, we incorporate local informationto enhance the feature discriminative power. In particu-lar, we ﬁrst extract object proposals from each image. Witheach image treated as a bag and object proposals extractedfrom it treated as instances, we transform the multi-labelrecognition problem into a multi-class multi-instance learn-ing problem. Then, in addition to extracting the typicalCNN feature representation from each proposal, we pro-pose to make use of ground-truth bounding box annotations(strong labels) to add another level of local informationby using nearest-neighbor relationships of local regions toform a multi-view pipeline. The proposed multi-view multi-instance framework utilizes both weak and strong labelseffectively, and more importantly it has the generalizationability to even boost the performance of unseen categoriesby partial strong labels from other categories. Our frame-work is extensively compared with state-of-the-art hand-crafted feature based methods and CNN based methods ontwo multi-label benchmark datasets. The experimental re-sults validate the discriminative power and the generaliza-tion ability of the proposed framework. With strong labels,our framework is able to achieve state-of-the-art results inboth datasets.

1. Introduction

Recently, the availability of large amount of labeled datahas greatly boosted the development of feature learning

Figure 1. An example of a typical multi-label image, which con-tains several cows in different locations as well as a person. methods for classiﬁcation. In particular, convolutional neu-ral networks (CNNs) achieve great success in visual recog-nition/classiﬁcation tasks. Features extracted from CNNscan provide powerful global representations for the singleobject recognition problem [15, 22, 23]. However, con-ventional CNN features may not generalize well for imagescontaining multiple objects as the objects can be in differentlocations, scales, occlusions and categories. Fig. 1 showsan example of such images. Since multi-label recognitiontask is more general and practical in real world applications,many CNN related methods [18, 19, 26] have been proposedto address the problem.A well known fact is that image-level labels can be uti-lized to ﬁne-tune a pre-trained CNN model and producegood global representations [22, 23, 19]. However, dueto the diversity and the complexity of multi-label images,classiﬁers trained from such global representations mightnot be optimal. For example, if we use images similar toFig. 1 to train a classiﬁer for “person”, the classiﬁer willhave to account for not only hundreds of different variations a r X i v : . [ c s . C V ] J un f “person” but also other objects contained in the images.The complexity of multi-label images adds an extra level ofdifﬁculty for training appropriate classiﬁers with the globalimage representations. Furthermore, due to the large intra-class variations of multi-label images, the global featuresextracted from training images are likely to be unevenly dis-tributed in the feature space. A classiﬁer trained with suchfeatures can be successful at regions that are densely popu-lated with training instances, but may fail in poorly sampledareas of the feature space [31].To address the problems with global CNN representa-tions, following the recent works [19, 26, 18, 12], we in-corporate local information via extracting object proposalsusing general object detection techniques such as selectivesearch [25] for multi-label object recognition. By decom-posing an image into local regions that could potentiallycontain objects, we avoid the complex process of directlyrecognizing multiple objects in the whole image. Instead,we only need to identify whether there exist target objectsin the local regions. However, as the local regions are noisyand of large variations (see Fig. 3), the usual CNN represen-tations might not be good enough for discrimination. There-fore, we add another level of locality by incorporating localnearest-neighbor relationships of these local regions. In thisway, the resulting features will be more evenly distributed inthe feature space. However, such relationships are not easyto obtain only through weak supervision, i.e., image-levellabels. Fortunately, for many multi-label applications, wecan exploit the strong supervision information, i.e. ground-truth bounding boxes, which can be considered as local re-gions with strong labels. Then, we can exploit the relation-ships between object proposals and ground-truth boundingboxes (e.g., nearest neighbor relationships) to help multi-label recognition.We would like to point out that ground-truth bound-ing boxes have been utilized in two proposal-based meth-ods [18, 12] for multi-label object recognition. In partic-ular, [12] makes use of ground-truth bounding boxes totrain category-speciﬁc classiﬁers to classify object propos-als. However, it requires ground-truth bounding boxes foreach category of objects, which might not be available inpractice. In contrast, for our proposed method, even withonly partial strong labels (e.g., bounding boxes and labelsfor classes in the classes of Pascal VOC), the pro-posed local relationships generalize well and can help rec-ognize all classes (e.g., improving recognition of the classes in VOC). [18] directly uses ground-truth boundingboxes to ﬁne tune the CNN model, but its performance isnot better than other proposal-based methods [19, 26] thatdo not utilize ground-truth bounding boxes. In other words,an effective way to exploit ground-truth bounding boxes formulti-label object recognition is still missing, which is whatwe aim to provide in this paper. Fig. 2 gives an overview of our proposed framework. Weutilize both strong and weak labels as two views , and pro-pose a multi-view multi-instance framework to tackle themulti-label object recognition task. In particular, for anyimage, we ﬁrst extract object proposals using general objectdetection techniques. The global image and its accompany-ing weak (image) label is used to ﬁne-tune a standard CNNto generate a feature view representation for each proposal.Using the ground truth bounding boxes and their strong la-bels, we design a large margin nearest neighbor (LMNN)CNN architecture to learn a low-dimensional feature so thatwe could extract nearest neighbor relationship between lo-cal regions and a candidate pool formed by ground truthobjects. These local NN features are used as the labelview . When combining both views, we can achieve a bal-ance between global semantic abstraction and local similar-ity, hence enhancing the discriminative power of our frame-work. More importantly, as the strong labels are indirectlyutilized through LMNN to encode local neighborhood rela-tionships among labelled local regions, the proposed frame-work can generalize well to the whole local region space,even with only partial strong labels for part of the objectclasses, making our framework more practical.The main contribution of this research lies in the pro-posed multi-view multi-instance framework, which utilizesbounding box annotations (strong labels) to encode the la-bel view and combine it with the typical CNN feature repre-sentation (feature view) for multi-label object recognition.Another novelty of our work is the proposed LMNN CNNwhich effectively extracts local information from the stronglabels.

2. Related Works

Our paper mainly relates to the topics of CNN basedmulti-label object recognition, multi-view and multi-instance learning and local and metric learning.

CNN based multi-label object recognition . Recently,CNN models have been adopted to solve the multi-label ob-ject recognition problem. Many works [12, 22, 18, 19, 26]have demonstrated that CNN models pre-trained on a largedataset such as ILSVRC can be used to extract features forother applications without enough training data. A typicalway ([22], [23] and [5]) is to directly apply a pre-trainedCNN model to extract an off-the-shelf global feature foreach image from a multi-label dataset, and use these fea-tures for classiﬁcation. However, different from single-object images from the ImageNet, multi-label images usu-ally have multiple objects in different locations, scales andocclusions, and thus global representations are not optimalfor solving the problem [26]. More recently, two proposal-based methods [18, 12] were propose for multi-label recog-nition and detection tasks with the help of ground truthbounding boxes. These methods achieve signiﬁcant im- roposal Extraction

ImagesObjects

Label View Fisher VectorLarge Margin NN CNNCNN Feature View

Figure 2. Overview of the proposed multi-view multi-instance framework. We transform the multi-label object recognition problem intoa multi-class multi-instance learning problem by ﬁrst extracting object proposals from each image using selective search. Two types offeatures are then extracted for each proposal. One is a low-dimensional feature from a large-margin nearest neighbor (LMNN) CNN, whichis used to generate the label view by encoding the label information of k -NN from the candidate pool (containing ground truth objects).The other is a standard CNN feature as the feature view. These two views are fused and then used to encode a Fisher vector for each image. provement over single global representations. On the otherhand, [19] and [26] handle the problem in a weakly super-vised manner by max-pooling image scores from the pro-posal scores. Moreover, [24] employs a very deep CNN,aggregates multiple features from different scales of the im-age and achieves state-of-the-art results. Multi-view and multi-instance learning . Multi-viewlearning deals with data from multiple sources or featuresets. The goal of multi-view learning is to exploit the rela-tionship between views to improve the performance or re-duce model complexity. Multi-view learning is well stud-ied in conjunction with semi-supervised learning or activelearning. To combine information from multi-views for su-pervised learning, fusion techniques at feature level or clas-siﬁer level can be employed [32]. Multi-instance learningaims at separating bags containing multiple instances. Overthe years, many multi-instance learning algorithms havebeen proposed, including miBoosting [29], miSVM [1],MILES [7] and miGraph [33]. Several works also studiedthe combination of multi-view and multi-instance learningand its application to computer vision tasks.

Local and metric learning . Existing local learningmethods mainly vary in the way that they utilize the labelledinstances nearest to a test instance. One way is to only use aﬁxed number of nearest neighbors to the test point to train amodel using neural network, SVM or just voting. The otherway is to learn a transformation of the feature space (e.g.,Linear Discriminant Analysis). In either way, the learnedmodel can be better tailored for the test instance’s neigh-borhood property [13]. Metric learning is closely related tolocal learning as a good distance metric is crucial for thesuccess of local learning. Generally, metric learning meth-ods optimize a distance metric to best satisfy known sim-ilarity constraints between training data [3]. Some metriclearning methods learn a single global metric [27]. Otherslearn local metrics that vary in different regions of the fea- ture space [30].

3. Multi-Label as Multi-Instance

In this section, we introduce the ﬁrst level of locality byformulating the multi-label object recognition problem asa multi-instance learning (MIL) problem. To be speciﬁc,given a set of n training images { X i } ni =1 , we extract n i ob-ject proposals { x ij , j = 1 , . . . , n i } from each image X i using general object detection techniques. By decompos-ing images into object proposals, each image X i becomesa bag containing several positive instances, i.e., proposalswith the target objects, and negative instances, i.e., propos-als with background or other objects. The problem of clas-sifying X i is thus transformed from a multi-label classiﬁ-cation problem to a multi-class MIL problem. The merit ofsuch a transformation is that we do not need to deal with thecomplex process of directly recognizing multiple objects inmultiple scales, locations and categories in a single image.Instead, we only need to identify whether there exist targetobjects in the proposals, which has been proven to be theforte of CNN features [12].MIL problems assume that every positive bag containsat least one positive instance. As extensively comparedand evaluated in [14], state-of-the-art general object detec-tion methods like BING [8], selective search [25], MCG [2]and EdgeBoxes [34] can reach reasonably good recall rateswith several hundreds of proposals. Therefore, if we sampleenough proposals from each image, we can safely assumethat these proposals can cover all objects (or at least all ob-ject categories) in an image, thus fulﬁlling the assumptionof multi-instance learning.In particular, we employ the unsupervised selectivesearch method [25] for object proposal generation. Selec-tive search has proven to be able to achieve a balance be-tween effectiveness and efﬁciency [14]. More importantly,as it is unsupervised, no extra training data or ground truth igure 3. An example of object proposals generated by selectivesearch. We demonstrate randomly sampled proposals from thefull proposals, which clearly cover two of the main objects:person and bike. bounding boxes are needed in this stage. Example of pro-posals extracted by selective search can be found in Fig. 3.Traditionally, MIL is formulated as a max-margin clas-siﬁcation problem with latent parameters optimized us-ing alternating optimization. Typical examples includemiSVM [1] and Latent SVM [11]. However, although thesemethods can achieve satisfactory accuracies, their limita-tions in scalability hinder their applicability to current largescale image classiﬁcation tasks. For large scale MIL prob-lem, [28] shows that Fisher vector [20, 21] (FV) can be usedas an efﬁcient and effective holistic representation for a bag.Thus, we choose to represent each bag X i as an FV.Assume we have a K -component Gaussian MixtureModel (GMM) with parameters θ = { ω k , µ k , Σ k , k =1 , . . . , K } , where ω k , µ k and Σ k are the mixture weight,mean vector and covariance matrix of the k -th Gaussian,respectively. The covariance matrices Σ k are assumed tobe diagonal, where the corresponding standard deviationsof the diagonal entries form a vector σ k . We have [21]: f X i µ k = 1 √ ω k n i (cid:88) j =1 γ j ( k ) (cid:18) x ij − µ k σ k (cid:19) , (1) f X i σ k = 1 √ ω k n i (cid:88) j =1 γ j ( k ) 1 √ (cid:34) ( x ij − µ k ) σ k − (cid:35) , (2)where γ j ( k ) is the soft assignment weight, which is also theprobability for x ij to be generated by the k -th Gaussian: γ j ( k ) = p ( k | x ij , θ ) . (3)We map all the proposals { x ij , j = 1 , . . . , n i } in an im-age X i to an FV by concatenating f X i µ k and f X i σ k for all k = 1 , . . . , K , and denoting it as F X i . F X i will be usedas the ﬁnal feature to train the one-vs-all linear classiﬁers.Note that for simplicity, we abuse the notation of x ij forboth proposal j in image i and its corresponding feature representation. In the next section, we will describe how togenerate the feature representation x ij for each proposal.

4. From Global Representation to Local Simi-larity

Once we obtain object proposals for each image, wecan naturally use CNN features to represent these propos-als. Following general practices in the literature [22, 23],each proposal is fed into a pre-trained CNN, and the out-put of the second last fully connected layer (e.g. layer 7 inAlexNet [15]) is used as the feature representation of thatparticular proposal. We call this kind of representation asthe feature view f ij for proposal x ij from image X i . Withthe proposals represented by CNN features, one baseline isto encode each image (bag) at the feature view by Fishervector as discussed in Sec. 3. Such baseline is able to getreasonably good results by utilizing the Fisher vector gen-erated only from the feature view.However, since the proposals contain different objectsas well as random background, there exist large variancesand imbalanced distributions. As a consequence, the globalrepresentation might not be accurate enough. Inspired bythe idea of local learning, which solves the data density andintra-class variation problems by focusing on a subset of thedata that are more relevant to a particular instance [4], wepropose the second level of locality by adding local spatialconﬁguration information as the label view (cf. Fig. 2) toenhance the discriminative power of the feature.To effectively encode the local spatial conﬁguration of aproposal, we need to solve two key problems: how to form agood candidate pool for local learning and how to determinewhich candidates are relevant to a particular new proposal.For the former problem, since we have some ground truthobject bounding boxes from the strong labels, we could usethem as the candidate pool assuming all of the ground truthobjects are useful. For the latter one, we follow the com-mon assumption that the most relevant candidates are thenearest neighbors. In this way, the problem becomes howto deﬁne “nearest”. Many studies [3, 27] have shown thatthe distance metric is critical to the performances of locallearning. Metric learning studies the problem of learning a dis-criminative distance metric. Conventional Mahalanobismetric learning methods optimize the parameters of a dis-tance function in order to best satisfy known similarityor dissimilarity constraints between training instances [3].To be speciﬁc, given a set of n labelled training instances { x i , y i } ni =1 , the goal of metric learning is to learn a squarematrix M such that the distance mapped between trainingata, represented as D M ( x i , x j ) = ( x i − x j ) T M ( x i − x j ) , (4)satisﬁes certain constraints. Since M is symmetric and pos-itive semi-deﬁnite, it can be decomposed as M = W T W ,and D M ( x i , x j ) can be rewritten as: D ( x i , x j ) = (cid:107) W ( x i − x j ) (cid:107) . (5)We can see that learning a distance metric is equivalent tolearning linear projection W that maps the data from inputspace to a transformed space. In this sense, the extraction ofCNN features from the original raw pixel space can also beviewed as a form of metric learning, while only the processis highly nonlinear. However, the goal of CNN is usually tominimize the classiﬁcation error using loss functions suchas the logistic loss, which may not be suitable for local en-coding.Our desired metric should be discriminative such that allcategories are well separated, as well as compact so thatwe can ﬁnd more accurate nearest neighbors. Speciﬁcally,we want the pairwise distance between instances from thesame class to be smaller than that between instances fromdifferent classes. In order to achieve such a goal, [27] pro-posed the large-margin distance to minimize the followingobjective function: (cid:88) i,j η ij D ( x i , x j )+ α (cid:88) i,j,l η ij (1 − y il ) (cid:2) D ( x i , x j ) − D ( x i , x l ) (cid:3) + . (6)Here η encodes target nearest neighbor information, where η ij = 1 if x j is one of the ˆ k positive nearest neighborsof x i ; otherwise η ij = 0 . y is the label information where y il = 1 if x i and x l are in the same class; otherwise y il = 0 . [ · ] + = max( · , is the hinge loss function. α is the trade-off parameter. The ﬁrst term in Eq. 6 penalizes large dis-tances between instances and target neighbors, and the sec-ond term penalizes small distances between each instanceand all other instances that do not share the same label. Byemploying such an objective function, we can ensure thatthe ˆ k -nearest neighbors of an instance belong to the sameclass, while instances from different classes are separatedby a large margin.In order to learn a discriminative metric, we proposeto learn a large-margin nearest neighbor (LMNN) CNN.Speciﬁcally, we replace the logistic loss with the large mar-gin nearest neighbor loss and train a network with low-dimension output utilizing the strong labels. Details oftraining and ﬁne-tuning the LMNN CNN can be found inSection 4.3. The output of the proposed LMNN network isa low-dimensional feature that shares the good semantic ab-straction power of conventional CNN feature and the good neighborhood property of large-margin metric learning. Wethen build the candidate pool with the LMNN CNN featuresextracted from ground truth objects. To effectively incorporate local label information arounda local region, we encode its neighborhood as the label view.Speciﬁcally, we extract features from each proposal x ij us-ing the LMNN CNN, then ﬁnd k nearest neighbors of x ij in the candidate feature pool as nn ij = { nn ij , . . . , nn kij } and record their labels l ij = (cid:2) l ij . . . l kij (cid:3) . The label infor-mation (e.g. l kij ) of a neighbor (e.g. nn kij ) is encoded as a C -dimensional binary vector, which corresponds to C cat-egories. The d -th dimension l kij ( d ) = 1 , ( d = 1 , . . . , C ) if the object is annotated as class d ; otherwise l kij ( d ) = 0 .Therefore, l ij is a × kC vector and it will be used as thefeature for the label view.The merit of such indirectly utilizing ground truthbounding boxes as the label view is the good generalizationability. As the label view is a form of local structure repre-sentation, even for unseen categories, i.e. no same-categorystrong labels, the encoding process can naturally exploit ex-isting semantically or visually close categories to build localsupport. For example, suppose we do not have the bound-ing box annotations for “cat” and “train”, a proposal con-taining “cat” might have nearest neighbors of “dog”, “tiger”or other related animals, and a proposal containing “train”might have nearest neighbors of “car”, “truck” or other ve-hicles. Although lacking the exact annotations of certaincategory, the label view is still able to encode the local struc-ture with semantically or visually similar objects. In thisway, our framework can make use of existing strong super-vision information to boost the overall performance. Theexperimental results shown in Section 5.3 validate this ar-gument.We directly concatenate the feature view and the labelview to form the ﬁnal representation of each proposal x ij as (cid:2) f ij λ l ij (cid:3) , where λ is the trade-off parameter betweenthe feature view and the label view. Our framework consists of two networks, a large-marginnearest neighbor (LMNN) CNN and a standard CNN. Bothnetworks’ architectures are similar to [5] with convolu-tional layers and fully-connected layers, and the dimen-sion of the layer-7 output is set to 2048. The main differ-ences of these two networks lie in the loss function and theﬁne-tuning process. For LMNN CNN, its layer-8 output isa 128-dimensional feature, based on which we measure thepair-wise distance for kNN, while the output of the standardNN is a C -dimensional score vector, corresponding to the C categories. Data pre-processing and pre-training.

We use theILSVRC dataset to pre-train both networks. Givenan image, we resize the short side to with bilinear inter-polation and perform a center crop to generate the standard × input. Each of these inputs is then pre-processedby subtracting the mean of ILSVRC images. Fine-tuning.

To better adopt the pre-trained network forspeciﬁc applications, we also ﬁne-tune these networks usingtask relevant data. Unlike [5], currently our implementationdoes not involve any data augmentation in the ﬁne-tuningstage.For the standard CNN used for feature view, we onlyﬁne-tune the network with weak labels on the whole image.As our task is multi-label recognition, following [26], weuse square loss instead of the logistic loss. To be speciﬁc,suppose we have a label vector y i = [ y i , y i , . . . , y iC ] for the i -th image. y ij = 1 ( j = 1 , . . . , C ) if the im-age is annotated with class j ; otherwise y ij = 0 . Theground truth probability vector of the i -th image is deﬁnedas p i = y i / (cid:107) y i (cid:107) and the predicted probability vector is ˆ p i = [ˆ p i , ˆ p i , . . . , ˆ p iC ] . Then, the cost function to be min-imized is deﬁned as n n (cid:88) i =1 C (cid:88) j =1 ( p ij − ˆ p ij ) . (7)During the ﬁne-tuning, the parameters of the ﬁrst seven lay-ers of the network are initialized with the pre-trained pa-rameters. The parameters of the last fully connected layeris initialized with a Gaussian distribution. We tune the net-work for epochs in total.For the large-margin NN (LMNN) CNN used for labelview, we execute a three-step ﬁne-tuning. The ﬁrst step isthe image level ﬁne-tuning similar to the process we havedescribed above. The second step is ground truth objectsﬁne-tuning, where we ﬁne-tune the network using groundtruth objects with the logistic loss. The ﬁnal step is thelarge-margin nearest neighbor ﬁne-tuning, where we ﬁne-tune the network with the loss function of Eq. 6 describedin Section 4.1. To accelerate the process, in this ﬁnal step,we ﬁx all parameters of the ﬁrst seven layers and only ﬁne-tune the parameters of the last fully connected layer.

5. Experimental Results

In this section, we present the experimental results of theproposed multi-view multi-instance framework on multi-label object recognition tasks.

We evaluate our method on the PASCAL Visual ObjectClasses Challenge (VOC) datasets [10], which are widely

Table 1. Dataset information.

Dataset

TRAIN , VAL and

TEST sets. We use

TRAIN and

VAL fortraining and

TEST for testing. The evaluation method is av-erage precision (AP) and mean average precision (mAP).We compare the proposed framework with the followingstate-of-the-art approaches: • CNN-SVM [23]. This method employed OverFeat [22],which is pre-trained on ImageNet, to get CNN activa-tions as the off-the-shelf features. Speciﬁcally, CNN-SVM employs the -d feature extracted from the -nd layer of OverFeat and uses these features to train alinear SVM for the classiﬁcation task. • PRE [18]. [18] proposed to transfer image represen-tations learned with CNN on ImageNet to other visualrecognition tasks with limited training data. The net-work has exactly the same architecture as that in [15].The network is ﬁrst pre-trained on ImageNet. The pa-rameters of the ﬁrst seven layers of CNN are then ﬁxedand the last fully-connected layer is replaced by twoadaptation layers. Finally, the adaptation layers aretrained with images from the target dataset. • HCP [26]. HCP proposed to solve the multi-label ob-ject recognition task by extracting object proposals fromthe images. Speciﬁcally, HCP has three main steps.The ﬁrst step is to pre-train a CNN on ImageNet data.The second step is image-level ﬁne-tuning that uses im-age labels and square loss to ﬁne-tune the pre-trainedCNN. The ﬁnal step is to employ BING [8] to extractobject proposals and ﬁne-tune the network with theseproposals. The image-level scores are obtained by max-pooling from the scores of the proposals. • [19]. [19] also handled the problem in a weakly super-vised manner. Particularly, multiple windows are ex-tracted from different scales of the images in the densesampling fashion. The scores of these windows arecombined with max-pooling from the same scale thensum-pooling across different scales. • VeryDeep [24]. [24] densely extracts multiple CNN fea-tures across multi-scales of the image with very-deepnetworks (16-layer and 19-layer). The features from thesame scale are concatenated by sum-pooling and fea-tures from different scale are aggregated by stacking orsum-pooling. [24] also augments the test set by hori-zontal ﬂipping of the images.

Hand-crafted Features. [6] presented an Ambiguityguided Mixture Model (AMM) to integrate externalcontext features and object features, and then usedthe contextualized SVM to iteratively boost the perfor-mance of object classiﬁcation and detection tasks. [9]proposed an Ambiguity Guided Subcategory (AGS)mining approach to improve both detection and classiﬁ-cation performance.

It is difﬁcult to make a completely fair comparisonamong different CNN based methods as the CNN conﬁg-urations, the data augmentation and the pre-training couldsubstantially inﬂuence the results. All CNN based meth-ods can beneﬁt from extra training data and more powerfulnetworks as shown in [18, 26, 19, 5, 24]. To fairly evalu-ate our proposed framework, we develop our system basedon the common -layer CNN pretrained on ILSVRC 2012dataset with 1000 categories. The details of the ﬁne tun-ing process has been elaborated in Section 4.3. Once theLMNN CNN and the standard CNN (see Fig. 2) are trained,the system is applied to map each training image into a ﬁ-nal FV feature. Finally, TopPush [16] is chose to learn lin-ear one-vs-all classiﬁers for each category, which producesthe scores for each binary sub-problem of the multi-labeldatasets. The scores are then evaluated with standard VOCevaluation package. All the experiments are run on a com-puter with Intel i7-3930K CPU, 32G main memory and annVIDIA Tesla K40 card.For the proposal extraction, we employ selectivesearch [25], which typically generates around propos-als on average from every image in the PASCAL VOC 2007dataset using the parameters suggested in [25]. Consider-ing the computational time and the hardware limitation, werandom sample around proposals per image for trainingand testing.For the parameters of Fisher vector, we follow [20] toﬁrst employ PCA to reduce the dimension of the originalfeatures to preserve around 90% energy. For VOC 2007and 2012 datasets, after PCA, the standard CNN featuresis reduced to around -d. After PCA, we generate GMM codewords and encode each image with IFV similarto [20].For ﬁne-tuning the LMNN CNN, we set the trade-off pa-rameter α = 1 (see (6)) and the nearest neighbor number ˆ k = 10 (for training). For combining the feature view andthe label view features, we select the trade-off parameter λ (speciﬁed at the end of Section 4.2) from { , . , . } bycross-validation. For the nearest neighbor number k used intesting, generally bigger k leads to better accuracy, but weobserve there is no performance gain for k > . Thus, weset k = 50 for testing. For faster NN search, we employFLANN [17] with “autotuned” parameters. Image Classiﬁcation on VOC 2007 : Table 2 reports ourexperimental results compared with state-of-the-art meth-ods on VOC 2007. In the upper part of the table we com-pare with the hand-crafted feature based methods and theCNN based methods pre-trained on ILSVRC 2012 using -layer network. To demonstrate the effectiveness of in-dividual components, we consider three variations of ourproposed framework: ‘FeV’, ‘FeV+LV-10’ and ‘FeV+LV-20’, where ‘FeV’ uses only the feature view (i.e. withoutthe label view features), ‘FeV+LV-10’ uses both the featureview and the label view with 10 categories of ground-truthbounding boxes of the training set, and ‘FeV+LV-20’ is theone with 20 categories of ground-truth bounding boxes.From the upper part of Table 2, we can see that us-ing just feature view (‘FeV’), we already outperform thestate-of-the-art proposal-based method (‘HCP-1000C’) by . , which suggests that Fisher vector as a holistic rep-resentation for bags is superior than max-pooling. With all20 categories of ground-truth bounding boxes of the train-ing set, our multi-view framework (‘FeV+LV-20’) achievesa further . performance gain. This signiﬁcant perfor-mance gain validates the effectiveness of the label view.Our framework shows good performance especially for dif-ﬁcult categories such as BOTTLE , COW , TABLE , MOTOR and

PLANT .If we just use the ground-truth bounding boxes fromthe ﬁrst categories ( PLANE to COW ), our frame-work (‘FeV+LV-10’) still outperforms single feature view(‘FeV’) by a margin of . . As expected, using the bound-ing boxes of the categories from PLANE to COW can boostthe performance of these categories as shown in the table.However, it is interesting to see that the label view also im-proves the accuracies of unseen categories such as

HORSE , PERSON and TV . This is mainly because the proposed labelview encoding is a form of local similarity representation,which can generalize quite well to unseen categories.In the lower part of Table 2, we list the results of ‘HCP-2000C’ [26], which uses additional categories fromImageNet that are semantically close to VOC 2007 cate-gories for CNN pre-training, and ‘VeryDeep’ [24], whichdensely extracts multiple CNN features from scales andcombines two very-deep CNN models ( -layer an -layer). Our framework (‘FeV+LV-20’) can still outper-form ‘HCP-2000C’, but is inferior to ‘VeryDeep’ since ourframework is based on the common 8-layer CNN.To demonstrate the potential of our framework, we re-place the 8-layer CNNs in our framework by the 16-layerCNN model in [24], which is denoted as ‘FeV+LV-20-VD’.Unlike [24], we do not use any data augmentation or multi-scale dense sampling in the feature extraction stage. Our‘FeV+LV-20-VD’ outperforms ‘VeryDeep’ by nearly .By further averaging the scores of ‘VeryDeep’ [24] and able 2. Comparisons of the classiﬁcation results (in %) of state-of-the-art approaches on VOC 2007 ( TRAINVAL / TEST ). The upper partshows the results of the hand-crafted feature based methods and the CNN based methods trained with -layer CNN and ILSVRC dataset. The lower part shows the results of the methods trained with very-deep CNN or with additional training data. PLANE BIKE BIRD BOAT BOTTLE BUS CAR CAT CHAIR COW TABLE DOG HORSE MOTOR PERSON PLANT SHEEP SOFA TRAIN TV M

APAGS [9] 82.2 83.0 58.4 76.1 56.4 77.5 88.8 69.1 62.2 61.8 64.2 51.3 85.4 80.2 91.1 48.1 61.7 67.7 86.3 70.9 71.1AMM [6] 84.5 81.5 65.0 71.4 52.2 76.2 87.2 68.5 63.8 55.8 65.8 55.6 84.8 77.0 91.1 55.2 60.0 69.7 83.6 77.0 71.3CNN-SVM [23] 88.5 81.0 83.5 82.0 42.0 72.5 85.3 81.6 59.9 58.5 66.3 77.8 81.8 78.8 90.2 54.8 71.1 62.6 87.4 71.8 73.9PRE-1000C [18] 88.5 81.5 87.9 82.0 47.5 75.5 90.1 87.2 61.6 75.7 67.3 85.5 83.5 80.0 95.6 60.8 76.8 58.0 90.4 77.9 77.7HCP-1000C [26] 95.1 90.1 92.8 89.9 51.5 80.0 91.7 91.6 57.7 77.8 70.9 89.3 89.3 85.2 93.0 64.0

HCP-2000C [26] 96.0 92.1 93.7 93.4 58.7 84.0 93.4 92.0 62.8 89.1 76.3 91.4 95.0 87.8 93.1 69.9 90.3 68.0 96.8 80.6 85.2VeryDeep [24]

Table 3. Comparisons of the classiﬁcation results (in %) of state-of-the-art approaches on VOC 2012 (

TRAINVAL / TEST ). The upper partshows the results of the hand-crafted feature based methods and the CNN based methods trained with -layer CNN and ILSVRC dataset. The lower part shows the results of the methods trained with very-deep CNN or with additional training data. PLANE BIKE BIRD BOAT BOTTLE BUS CAR CAT CHAIR COW TABLE DOG HORSE MOTOR PERSON PLANT SHEEP SOFA TRAIN TV M

APNUS-PSL [26] 97.3 84.2 80.8 85.3 60.8 89.9

PRE-1512C [18] 94.6 82.9 88.2 84.1 60.3 89.0 84.4 90.7 72.1 86.8 69.0 92.1 93.4 88.6 96.1 64.3 86.6 62.3 91.1 79.8 82.8HCP-2000C [26] 97.5 84.3 93.0 89.4 62.5 90.2 84.6 94.8 69.7 90.2 74.1 93.4 93.7 88.8 93.3 59.7 90.3 61.8 94.4 78.0 84.2[19] 96.7 88.8 92.0 87.4 64.7 91.1 87.4 94.4 74.9 89.2 76.3 93.7 95.2 91.1 97.6 66.2 91.2 70.0 94.5 83.7 86.3VeryDeep [24] 99.0 89.1 ‘FeV+LV-20-VD’ (denoted as ‘Fusion’), we achieve state-of-the-art mAP of . . This suggests that our proposal-based framework and the multi-scale CNN extracted fromthe whole image are complement to each other. Image Classiﬁcation on VOC 2012 : Table 3 reports ourexperimental results compared with those of the state-of-the-art methods on VOC 2012. Similar to Table 2, we com-pare with the hand-crafted feature based methods and theCNN based methods pre-trained on ILSVRC 2012 using -layer CNN model in the upper part and the methods trainedwith additional data or very-deep CNN models in the lowerpart.The results are consistent with those on VOC 2007.Our framework that uses only the feature view (‘FeV’)already outperforms the state-of-the-art hand-crafted fea-ture method (‘NUS-PSL’) by . and the state-of-the-artproposal-based CNN method (HCP-1000C) by . . Withthe aid of the label view, our ‘FeV+LV-20’ obtains an addi-tional performance again, even outperforming the twoproposal-based methods pre-trained on additional or categories of image data (‘PRE-1512C’ and ‘HCP-2000C’) and comparable to [19]. By employing just categories of bounding boxes, the mAP performance of our‘FeV+LV-10’ does not degrade much.When employed with the very-deep 16-layer CNNmodel [24], our framework (‘FeV+LV-20-VD’) achievessimilar performance as ‘VeryDeep’. When we averagely fuse the scores of [24] and our proposal-based representa-tion, our method (‘Fusion’) achieves state-of-the-art result,outperforming [26].

6. Conclusion

In this paper, we have proposed a multi-view multi-instance framework for solving the multi-label classiﬁca-tion problem. Compared with existing works, our frame-work makes use of the strong labels to provide another viewof local information (label view) and combines it with thetypical feature view information to boost the discrimina-tive power of feature extraction for multi-label images. Theexperimental results validates the discriminative power andthe generalization ability of the proposed framework.For future directions, there are several possibilities to ex-plore. First of all, we can improve the scalability and possi-bly also the performance of the framework by establishing aproposal selection criteria to ﬁlter out noisy proposals. Sec-ondly, we may build a suitable candidate pool directly fromthe extracted proposals to eliminate the need for strong la-bels.

Acknowledgments

This research is supported by SingaporeMoE AcRF Tier-1 Grant RG138/14 and also partially supported bythe Rapid-Rich Object Search (ROSE) Lab at the Nanyang Tech-nological University, Singapore. The Tesla K40 used for this re-search was donated by the NVIDIA Corporation. eferences [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Supportvector machines for multiple-instance learning. In

NIPS 15 ,pages 561–568. MIT Press, 2003. 3, 4[2] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqu´es, and J. Ma-lik. Multiscale combinatorial grouping. In

CVPR , 2014. 3[3] A. Bellet, A. Habrard, and M. Sebban. A survey on met-ric learning for feature vectors and structured data.

CoRR ,abs/1306.6709, 2013. 3, 4[4] E. Bottou and V. Vapnik. Local learning algorithms.

NeuralComputation , 4:888–900, 1992. 4[5] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In

BMVC , 2014. 2, 5, 6, 7[6] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan.Contextualizing object detection and classiﬁcation.

IEEETPMAI , 37(1):13–27, 2015. 6, 8[7] Y. Chen, J. Bi, and J. Z. Wang. Miles: Multiple-instancelearning via embedded instance selection.

IEEE Trans. Pat-tern Anal. Mach. Intell. , 28(12):1931–1947, Dec. 2006. 3[8] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr.BING: Binarized normed gradients for objectness estimationat 300fps. In

IEEE CVPR , 2014. 3, 6[9] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan.Subcategory-aware object classiﬁcation. In

CVPR , pages827–834. IEEE, 2013. 7, 8[10] M. Everingham, L. Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge.

Int. J. Comput. Vision , 88(2):303–338, June 2010. 6[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models.

IEEE Trans. Pattern Anal. Mach. Intell. ,32(9):1627–1645, Sept. 2010. 4[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

CVPR , 2014. 2, 3[13] T. Hastie and R. Tibshirani. Discriminant adaptive nearestneighbor classiﬁcation.

IEEE Trans. Pattern Anal. Mach. In-tell. , 18(6):607–616, June 1996. 3[14] J. Hosang, R. Benenson, and B. Schiele. How good are de-tection proposals, really? In

BMVC , September 2014. 3[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS 25 , pages 1097–1105. Curran Associates, Inc., 2012.1, 4, 6[16] N. Li, R. Jin, and Z.-H. Zhou. Top rank optimization in lineartime. In

NIPS 27 , pages 1502–1510. 2014. 7[17] M. Muja and D. G. Lowe. Scalable nearest neighbor algo-rithms for high dimensional data.

IEEE PAMI , 36, 2014. 7[18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning andtransferring mid-level image representations using convolu-tional neural networks. In

CVPR , 2014. 1, 2, 6, 7, 8[19] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Weakly super-vised object recognition with convolutional neural networks.Technical Report HAL-01015140, INRIA, 2014. 1, 2, 3, 6,7, 8 [20] F. Perronnin, J. Snchez, and T. Mensink. Improving the ﬁsherkernel for large-scale image classiﬁcation. In

ECCV , 2010.4, 7[21] J. S´anchez, F. Perronnin, T. Mensink, and J. J. Verbeek. Im-age classiﬁcation with the ﬁsher vector: Theory and practice.

International Journal of Computer Vision , 105(3):222–245,2013. 4[22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localiza-tion and detection using convolutional networks.

CoRR ,abs/1312.6229, 2013. 1, 2, 4, 6[23] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls-son. Cnn features off-the-shelf: An astounding baseline forrecognition. In

CVPR Workshops , June 2014. 1, 2, 4, 6, 8[24] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014. 3, 6, 7, 8[25] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.Selective search for object recognition.

ICCV , 2013. 2, 3, 7[26] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao,and S. Yan. CNN: single-label to multi-label.

CoRR ,abs/1406.5726, 2014. 1, 2, 3, 6, 7, 8[27] K. Q. Weinberger and L. K. Saul. Distance metric learningfor large margin nearest neighbor classiﬁcation.

Journal ofMachine Learning Research , 10:207–244, June 2009. 3, 4, 5[28] J. W. Xiu-Shen Wei and Z. hua Zhou. Scalable multi-instance learning. In

ICDM , 2014. 4[29] X. Xu and E. Frank. Logistic regression and boostingfor labeled bags of instances. In

PAKDD , pages 272–281.Springer-Verlag, 2004. 3[30] L. Yang, R. Jin, R. Sukthankar, and Y. Liu. An efﬁcientalgorithm for local distance metric learning. In

AAAI , pages543–548, 2006. 3[31] A. Yu and K. Grauman. Predicting useful neighborhoods forlazy local learning. In

NIPS 27 , pages 1916–1924, 2014. 2[32] Q. Zhang, G. Hua, W. Liu, Z. Liu, and Z. Zhang. Can visualrecognition beneﬁt from auxiliary information in training?.In

ACCV , 2014. 3[33] Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li. Multi-instance learningby treating instances as non-i.i.d. samples. In

ICML , pages1249–1256, 2009. 3[34] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating objectproposals from edges. In