Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang
LLearning Spatial Regularization with Image-level Supervisionsfor Multi-label Image Classification
Feng Zhu , , Hongsheng Li , Wanli Ouyang , , Nenghai Yu , Xiaogang Wang University of Science and Technology of China, University of Sydney Department of Electronic Engineering, The Chinese University of Hong Kong [email protected], { hsli,wlouyang,xgwang } @ee.cuhk.edu.hk, [email protected] Abstract
Multi-label image classification is a fundamental butchallenging task in computer vision. Great progress hasbeen achieved by exploiting semantic relations between la-bels in recent years. However, conventional approachesare unable to model the underlying spatial relations be-tween labels in multi-label images, because spatial anno-tations of the labels are generally not provided. In this pa-per, we propose a unified deep neural network that exploitsboth semantic and spatial relations between labels with onlyimage-level supervisions. Given a multi-label image, ourproposed Spatial Regularization Network (SRN) generatesattention maps for all labels and captures the underlyingrelations between them via learnable convolutions. By ag-gregating the regularized classification results with originalresults by a ResNet-101 network, the classification perfor-mance can be consistently improved. The whole deep neuralnetwork is trained end-to-end with only image-level anno-tations, thus requires no additional efforts on image anno-tations. Extensive evaluations on 3 public datasets with dif-ferent types of labels show that our approach significantlyoutperforms state-of-the-arts and has strong generalizationcapability. Analysis of the learned SRN model demonstratesthat it can effectively capture both semantic and spatial re-lations of labels for improving classification performance.
1. Introduction
Multi-label image classification is an important task incomputer vision with various applications, such as scenerecognition [4, 30, 31], multi-object recognition [25, 19,18], human attribute recognition [24], etc . Compared tosingle-label image classification [6, 7, 12], which has beenextensively studied, multi-label problem is more practicaland challenging, as real-world images are usually associ-ated with multiple labels, such as objects or attributes.Binary relevance method [34] is an easy way to extendsingle-label algorithms to solve multi-label classification,which simply trains one binary classifier for each label. clouds lake ocean rocksky sun sunset water
CNN ...... clouds: 0.803lake: 0.685ocean: 0.974 rocks: 0.405 clouds: 0.880lake: 0.679ocean: 0.966rocks: 0.526
Learning Spatial Regularization sky: 0.945 sun: 0.339 sunset: 0.908water: 0.958 sky: 0.973sun: 0.519sunset: 0.903water: 0.963
Initial Confidences Regularized ConfidencesSpatial Regularization NetInput Image Label Attention Maps
Figure 1. Illustration of using our proposed Spatial RegularizationNet (SRN) for improving multi-label image classification. TheSRN learns semantic and spatial label relations from label atten-tion maps with only image-level supervisions.
Various loss functions have been investigated in [11]. Tocope with the problem that labels may relate to differentvisual regions over the whole image, proposal-based ap-proaches [38] are proposed to transform multi-label clas-sification problem into multiple single-label classificationtasks. However, these modifications of existing single-labelalgorithms ignored semantic relations of labels.Recent progress on multi-label image classificationmainly focused on capturing semantic relations betweenlabels. Such relations or dependency can be modeled byprobabilistic graphical models [23, 22], structured inferenceneural network [16], or Recurrent Neural Networks (RNNs)[36]. Despite the great improvements achieved by exploit-ing semantic relations, existing methods cannot capture spa-tial relations of labels, because their spatial locations are notannotated for training. In this paper, we propose to captureboth semantic and spatial relations of labels by a SpatialRegularization Network in a unified framework (Figure 1),which can be trained end-to-end with only image-level su-pervisions, thus requires no additional annotations.1 a r X i v : . [ c s . C V ] M a r eep Convolution Neural Networks (CNNs) [21, 33, 32,13] have achieved great success on single-label image clas-sification in recent years. Because of their strong capabilityin learning discriminative features, deep CNN models pre-trained on large datasets can be easily transferred to solveother tasks and boost their performance. However, the fea-ture representations might not be optimal for images withmultiple labels, since a ground truth label might semanti-cally relate to only a small region of the image. The diverseand complex contents in multi-label images make it difficultto learn effective feature representations and classifiers.Inspired by recent success of attention mechanism inmany vision tasks [40, 43, 15], we propose a deep neu-ral network for multi-label classification, which consists ofa sub-network, Spatial Regularization Net (SRN), to learnspatial regularizations between labels with only image-levelsupervisions. The SRN learns an attention map for each la-bel, which associates related image regions to each label.By performing learnable convolutions on the attention mapsof all labels, the SRN captures the underlying semantic andspatial relations between labels and act as spatial regular-izations for multi-label classification.The contribution of this paper is as follows. 1) We pro-pose an end-to-end deep neural network for multi-label im-age classification, which exploits both semantic and spa-tial relations of labels by training learnable convolutionson the attention maps of labels. Such relations are learnedwith only image-level supervisions. Investigation and vi-sualization of learned models demonstrate that our modelcan effectively capture semantic and spatial relations of la-bels. 2) Our proposed algorithm has great generalizationcapability and works well on data with different types oflabels. We comprehensively evaluate our method on 3 pub-licly available datasets, NUS-WIDE [5] (81 concept labels),MS-COCO [25] (80 object labels), and WIDER-Attribute[24] (14 human attribute labels), showing significant im-provements over state-of-the-art approaches.
2. Related Work
Multi-label classification has applications in many areas,such as document topic categorization [29, 10], music an-notation and retrieval [35], scene recognition [4], and genefunctional analysis [2]. Comprehensive reviews for generalmulti-label classification methods can be found in [34, 44].In this work, we focus on multi-label image classificationmethods with deep learning techniques.A simple way of adapting existing single-label methodsto multi-label is to learn an independent classifier for eachlabel [34]. Recent success of deeply-learned features [21]for single-label image classification have boosted the accu-racy of multi-label classification. Based on such deep fea-tures, Gong et al . [11] evaluated various loss functions andfound that weighted approximate ranking loss worked bestwith CNNs. Proposal-based approaches showed promisingperformance in object detection [8]. Similar ideas have alsobeen explored for multi-label image classification. Wei et al . [38] converted multi-label problems into a set of multi-class problems over region proposals. Classification resultsfor the whole images were obtained by max-pooling labelconfidences over all proposals. Yang et al . [42] treated im-ages as a bag of instances/proposals, and solved a multi-instance learning problem. The above approaches ignoredlabel relations in multi-label images.Approaches that learn to capture label relations were alsoproposed. Read et al . [28] extended the binary relevancemethod by training a chain of binary classifiers, where eachclassifier makes predictions based on both image featuresand previously predicted labels. A more common wayof modeling label relations is to use probabilistic graphi-cal models [20]. There were also methods on determiningstructures of the label relation graphs. Xue et al . [41] di-rectly thresholded the label correlation matrix to obtain thelabel structure. Li et al . [23] used a maximum spanningtree over mutual information matrix of labels to create thegraph. Li et al . [22] proposed to learn image-dependentconditional label structures base on Graphical Lasso frame-work [27]. Recently, deep neural networks have also beenexplored for learning label relations. Hu et al . [16] pro-posed a structured inference neural network that transferspredictions across multiple concept layers. Wang et al . [36]treated multi-label classification as a sequential predictionproblem, and solved label dependency by Recurrent Neu-ral Networks (RNN). Although classification accuracy hasbeen greatly improved by learning semantic relations of la-bels, the above mentioned approaches fail to explore the un-derlying spatial relations between labels.Attention mechanism was proven to be beneficial inmany vision tasks, such as visual tracking [3], object recog-nition [26, 1], image captioning [40], image question an-swering [43], and segmentation [15]. The spatial atten-tion mechanism adaptively focuses on related regions of theimage when the deep networks are trained with spatially-related labels. In this paper, we utilize attention mechanismfor improving multi-label image classification, which cap-tures the underlying spatial relations of labels and providesspatial regularization for the final classification results.
3. Methodology
We propose a deep neural network for multi-label classi-fication, which utilizes image-level supervisions for learn-ing spatial regularizations on multiple labels. The overallframework of our approach is shown in Figure 2. The mainnet has the same network structure as ResNet-101 [13].The proposed Spatial Regularization Net (SRN) takes vi-sual features from the main net as inputs and learns to reg-ularize spatial relations between labels. Such relations areexploited based on the learned attention maps for the multi-ple labels. Label confidences from both main net and SRNare aggregated to generate final classification confidences.The whole network is a unified framework and is trained inan end-to-end manner. patialRegularization Net f cnn Feature Map: X
14 14 f att conv1 Attention Map: AConfidence Map: S f sr Element-wiseMultiplication Element-wiseAddition
Final Predictions
Sum-pooling
Res-2048Stride: 2 Res-2048 Res-2048 f cls ˆ cls y ˆ sr y ˆ att y ˆ y Main Net S i g m o i d Element-wiseMultiplication Sigmoid Loss
Figure 2. Overall framework of our approach. (Top) The main net follows the structure of ResNet-101 and learns one independent classifierfor each label. “Res-2048” stands for one ResNet building block with 2048 output channels. (Bottom) The proposed SRN captures spatialand semantic relations of labels with attention mechanism. Dashed lines indicate weakly-supervised pre-training for attention maps.
The main net follows the structure of ResNet-101 [13]which is composed of repetitive building blocks with dif-ferent output dimensions. Specifically, the block structureproposed in [14] is adopted. The × feature map (for × input images) from layer “ res4b22 relu ” of themain net is used as input for SRN, which is of sufficient res-olution to learn spatial regularizations in our experiments.Let I denote an input image with ground-truth labels y =[ y , y , ..., y C ] T , where y l is a binary indicator. y l = 1 ifimage I is tagged with label l , and y l = 0 otherwise. C isthe number of all possible labels in the dataset. The mainnet conducts binary classification for each of the C labels, X = f cnn ( I ; θ cnn ) , X ∈ R × × , (1) ˆy cls = f cls ( X ; θ cls ) , ˆy cls ∈ R C , (2)where X is the feature map from layer “ res4b22 relu ”, ˆy cls = [ˆ y cls · · · , ˆ y Ccls ] T is predicted label confidences bythe main net. Prediction errors of the main net is measuredbased on ˆy cls and ground-truth labels y .The proposed SRN is composed of two successive sub-networks, where the first sub-network f att ( X ; θ att ) learnslabel attention maps with image-level supervisions (Sec-tion 3.2), and the second sub-network f sr ( U ; θ sr ) capturesspatial regularizations of labels (Section 3.3) based on thelearned label attention maps. Multi-label image is composed of multiple image re-gions that are semantically related to different labels. Al-though the region locations are generally not provided bythe image-level supervisions, when predicting one label’sexistence, it is desirable that more attention is paid to therelated regions. In our work, our neural network learn to predict such related image regions for each label withimage-level supervisions using the attention mechanism.The learned attention maps could then be used to learn spa-tial regularizations for the labels.Given input visual features X ∈ R × × from layer“ res4b22 relu ” of the main net, we aim to automaticallygenerate label attention values for each individual labels, Z = f att ( X ; θ att ) , Z ∈ R × × C , (3)where Z is the unnormalized label attention values by f att ( · ) with each channel corresponding to one label. Fol-lowing [40], Z is spatially normalized with the softmaxfunction to obtain the final label attention maps A , a li,j = exp ( z li,j ) (cid:80) i,j exp ( z li,j ) , A ∈ R × × C , (4)where z li,j and a li,j represent the unormalized and normal-ized attention values at ( i, j ) for label l . Intuitively, if label l is tagged to the input image, the image regions related toit should be assigned with higher attention values. The at-tention estimator f att ( · ) is modeled as 3 convolution layerswith 512 kernels of × , 512 kernels of × , and C kernelsof × , respectively. The ReLU nonlinearity operations areperformed following the first two convolution layers.Since ground-truth annotations of attention maps are notavailable, f att ( X ; θ att ) is learned with only image-levelmulti-label supervisions. Let x i,j ∈ R denote the vi-sual feature vector at location ( i, j ) of X . In the originalResNet, the visual features is averaged across all spatial lo-cations for classification as × (cid:80) i,j x i,j . Since we expectthe attention map A l for each label l to have higher val-ues at the label-related regions, and (cid:80) i,j a li,j = 1 for all l ,the attention maps could be used to weightedly average theisual features X for each label l as, v l = (cid:88) i,j x i,j a li,j , v l ∈ R . (5)Compared with the original averaged visual features sharedby all labels, the weightedly-averaged visual feature vector v l is more related to image regions corresponding to label l . Each such feature vector is then used to learn a linearclassifier for estimating label l ’s confidence, ˆ y latt = W l v l + b l , (6)where W l and b l are classifier parameters for label l . Forall labels, ˆy att = [ˆ y att , · · · , ˆ y Catt ] T . Using only the image-level supervisions y for training, the attention estimator pa-rameters are learned by minimizing the cross-entropy lossbetween ˆy att and y (see the dashed lines in Figure 2).The attention estimator network f att ( · ) can effectivelylearn attention maps for each label. Learned attention mapsfor an image are illustrated in Figure 3. It shows that theweakly-supervised attention model could effectively cap-ture related visual regions for each label. For example, “sunglass” focuses on the face region, while “longPants” pays more attention to legs. The negative labels also focuson reasonable regions, for example, “Hat” tries to find hatin the region of head.For efficient learning of the attention maps, recall thatwe have (cid:80) i,j a li,j = 1 , and Eq. (6) can be rewritten as ˆ y latt = (cid:88) i,j a li,j ( W l x i,j + b l ) . (7)This equation can be viewed as applying label-specific lin-ear classifier at every location of the feature map X , andthen spatially aggregating label confidences based on at-tention maps. In our implementation, the linear classifiersare modeled as a convolution layer with C kernels of size × ( “conv1” in Figure 2). The output of this layer isa confidence map S ∈ R × × C , where its l th channelis S l = W l ∗ X + b l , with ∗ denoting convolution opera-tion. The label attention map A and confidence map S areelement-wisely multiplied, and then spatially sum-pooledto obtain the label confidence vector ˆy att . This formulationleads to an easy-to-implement network for learning label at-tentions, and generates confidence maps for weighting theattention maps in SRN. Label attention maps encode rich spatial information oflabels. They can be used to generate more robust spatialregularizations for labels. However, the attention map foreach label always sum up to (see Figure 3), which mayhighlight wrong locations. Learning from label-not-existingattention maps might lead to wrong spatial regularizations.Therefore, we propose to learn spatial regularizations fromweighted attention maps U ∈ R × × C , U = σ ( S ) ◦ A , (8) Male:0.82 longHair: 0.10 sunglass: 0.85Hat: 0.07 longSleeve: 0.96 formal: 0.87Tshirt: 0.02shorts: 0.01 jeans: 0.04 longPants: 0.98 skirt: 0.01faceMask: 0.01 logo: 0.05 stripe: 0.02AttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttentionAttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttentionAttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttentionAttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttention AttentionWeightedAttention
Figure 3. Examples of learned attention maps from WIDER-Attribute dataset. Labels in red are ground-truth labels. “WeightedAttention” is the attention map weighted by corresponding labelconfidence (Eq. (8)). where σ ( x ) = 1 / (1+ e − x ) is the sigmoid function that con-verts label confidences S to the range [0 , , and ◦ indicateselement-wise multiplication. The weighted attention maps U encode both local confidences of attention and global vis-ibility of each label, as shown in Figure 3.Given the weighted attention maps U , a label regulariza-tion function is required to estimate the label confidencesbased on label spatial information from U , ˆy sr = f sr ( U ; θ sr ) , ˆy sr ∈ R C , (9)where ˆy sr = [ˆ y sr , ˆ y sr , ..., ˆ y Csr ] T is predicted label confi-dences by the label regularization function.Since the weighted attention maps for all labels are spa-tially aligned, it is easy to capture their relative relationswith stacked convolution operations. The convolutionsshould have large enough receptive fields to capture thecomplex spatial relations between the labels. However, anaive implementation might be problematic. If we only useone convolution layer with 2048 filters of size × , thenthe total number of additional parameters would be . C million. For a dataset that has 80 different labels, the ac-tual number of additional parameter would be million,In contrast, the original ResNet-101 only contains approxi-mately million parameters. Such large number of addi-tional parameters would make the network difficult to train.We propose to decouple semantic relation learning andspatial relation learning in different convolution layers. Thentuition is that one label may only semantically relate toa small number of other labels, and measuring spatial re-lations with those unrelated attention maps is unnecessary. f sr ( U ; θ sr ) is implemented as three convolution layers withReLU nonlinearity followed by one fully-connected layeras shown in Figure 4. The first two layers capture seman-tic relations of labels with 2 layers of × convolutions,and the third layer explores spatial relations using 2048 × kernels. The filters of the third convolution layerare grouped, with each group of 4 kernels corresponding toone feature channel of the input feature map. The 4 ker-nels in each group convolve the same feature channel inde-pendently. Different kernels in one group capture differentspatial relations of semantically related labels. Experimen-tal results show that the proposed SRN provides effectiveregularization to the classification results based on seman-tic and spatial relations of labels, with only about 6 millionadditional parameters. The final label confidences are aggregation of the outputsof main net and SRN, ˆy = α ˆy cls + (1 − α ) ˆy sr , where α isa weighting factor. Though the factor can also be learned,we fix α = 0 . and do not observe performance drop. Thewhole network is trained with the cross-entropy loss withthe ground truth labels y , F loss ( y , ˆy ) = C (cid:88) l =1 y l log σ (ˆ y l ) + (1 − y l ) log(1 − σ (ˆ y l )) . (10)We train the network in multiple steps. First, we fine-tune only the main net on the target dataset, which is pre-trained on 1000-classification task of ImageNet dataset [6].Both f cnn ( I ; θ cnn ) and f cls ( X ; θ cls ) are learned with cross-entropy loss F loss ( y , ˆy cls ) . Secondly, we fix f cnn and f cls , and focus on training f att ( X ; θ att ) and “conv1” (seedashed lines in Figure 2) with loss F loss ( y , ˆy att ) . Thirdly,we train f sr ( U ; θ sr ) with cross-entropy loss F loss ( y , ˆy sr ) ,by fixing all other sub-networks. Finally, the whole networkis jointly fine-tuned with loss F loss ( y , ˆy ) + F loss ( y , ˆy att ) .Our deep neural network is implemented with Caffe li-brary [17]. To avoid over-fitting, we adopt image augmenta-tion strategies suggested in [37]. The input images are firstresized to × , and then cropped at four corners andthe center. The width and height of cropped patches are ran-domly chosen from the set { , , , , } . Fi-nally, the cropped images are all resized to × . Weemploy stochastic gradient descend algorithm for training,with a batch size of 96, a momentum of 0.9, and weight de-cay of 0.0005. The initial learning rate is set as − , anddecreased to / of the previous value whenever validationloss gets saturated, until − . We train our model with4 NVIDIA Titan X GPUs. For MS-COCO, training costsabout 16 hours for all steps. For testing, we simply resizeall images to × and conduct single-crop evaluation. Weighted Attention Map: U conv2
512 filters1x1xC conv3
512 filters1x1x512 conv4 ˆ sr y Figure 4. Detailed network structure of f sr ( · ) for learning spatialregularizations from weighted attention maps. The first two layers( “conv2” and “conv3” ) are convolution layers with multi-channelfilters, while, “conv4” is composed of single-channel filters. Ev-ery 4 filters convolve with the same feature channel by “conv3” tolimit the parameter size.
4. Experiments
Our approach is evaluated with three benchmark datasetswith different types of labels: NUS-WIDE [5] with 81concept labels, MS-COCO [25] with 80 object labels, andWIDER-Attribute [24] with 14 human attribute labels. Ex-perimental results show that our approach significantly out-performs state-of-the-arts on all the three datasets , and hasstrong generalization capability to different types of labels.Analysis of the learned deep models demonstrates that ourproposed approach can effectively capture both semanticand spatial relations of labels. Evaluation Metrics . A comprehensive study ofevaluation metrics for multi-label classification is pre-sented in [39]. We employ macro/micro precision,macro/micro recall, macro/micro F1-measure, and MeanAverage Precision (mAP) for performance comparison. Forprecision/recall/F1-measure, if the estimated label confi-dences for any label are greater than 0.5, the labels arepredicted as positive. Macro precision (denoted as “P-C”)is evaluated by averaging per-class precisions, while microprecision (denoted as “P-O”) is an overall measure whichcounts true predictions for all images over all classes. Sim-ilarly, we can also evaluate macro/micro recall (“R-C”/“R-O”) and macro/micro F1-measure (“F1-C”/“F1-O”). MeanAverage Precision is the mean value of per-class averageprecisions. The above metrics do not require a fixed num-ber of labels for each image. Generally, the mAP, F1-Cand F1-O are of more importance. To fairly compare withstate-of-the-arts, we also evaluate precision, recall and F1-measure under the constraints that each image is predictedwith top-3 labels. To obtain such top-3 labels in our ap-proach, the 3 labels with highest confidences are obtained Code and trained models available at https://github.com/zhufengx/SRN_multilabel . ethod All top-3mAP F1-C P-C R-C F1-O P-O R-O F1-C P-C R-C F1-O P-O R-OKNN [5] - - - - - - - 24.3 32.6 19.3 47.6 42.9 53.4WARP [11] - - - - - - - 33.5 31.7 35.6 53.9 48.6 60.5CNN-RNN [36] - - - - - - - 34.7 40.5 30.4 55.2 49.9 61.7ResNet-101 [13] 59.8 55.7 65.8 51.9 72.5 75.9 69.5 47.0 46.9 56.8 61.7 55.8 69.1ResNet-107 59.5 55.6 65.4 52.2 72.6 75.5 70.0 46.9 46.7 56.8 61.8 55.9 69.2ResNet-101-semantic 60.1 54.9 ResNet-SRN
Table 1. Quantitative results by our proposed ResNet-SRN and compared methods on NUS-WIDE dataset. “mAP”, “F1-C”, “P-C”, and“R-C” are evaluated for each class before averaging. “F1-O”, “P-O”, and “R-O” are averaged over all sample-label pairs. for each image even if their confidence values are lower than0.5. However, we argue that outputting a variable numberof labels for each image is more practical for real-world ap-plications. Therefore, we report both our results with andwithout the top-3 label constraint.
Compared Methods . For NUS-WIDE and MS-COCOdatasets, we compare with state-of-the-art methods on thedatasets including CNN-RNN [36], WARP [11], and KNN[5]. CNN-RNN explored semantic relations of labels,while other methods did not. For WIDER-Attribute dataset,RCNN [8], R*CNN [9], and DHC [24] are compared.Both R*CNN and DHC explored spatial context surround-ing human bounding boxes. For our approach (denoted as“ResNet-SRN”), one variant is also explored, which learnsspatial regularizations from unweighted attention maps A instead of U to evaluate the necessity of weighting the at-tention maps. It is denoted as “ResNet-SRN-att”.We also design three baseline methods to further vali-date the effectiveness of our proposed Spatial Regulariza-tion Net. The first baseline is the original ResNet-101 (de-noted as “ResNet-101”) fine-tuned on each of the datasets.For the second baseline, since the proposed SRN has about6 million additional parameters compared with ResNet-101,which is approximately equal to two ResNet building blockswith 2048 output feature channels, we add two such resid-ual blocks following the last block of ResNet-101 (the layer “res5c relu” ) to create a “ResNet-107” model. For the thirdbaseline, we investigate learning semantic relations of la-bels based on initial label confidences by ResNet-101. Theinitial confidences are concatenated with the visual featuresfrom the “pool5” layer to encode label relations. Two 2048-neuron and one C -neuron fully-connected layers try to cap-ture label semantic relations from the concatenated featuresto generate final label confidences. We refer this model as“ResNet-101-semantic” in our experiments. NUS-WIDE [5]. This dataset contains 269,648 imagesand associated tags from Flickr. The dataset is manuallyannotated by 81 concepts, with 2.4 concept labels per im-age on average. The concepts include events/activities ( e.g ., “swimming” , “running” ), scene/location ( e.g ., “airport” , “ocean” ), objects ( e.g ., “animal” , “car” ). We trained ourapproach to predict the 81 concept labels. Official train/testsplit is used, i.e . 161,789 images for training/validation, and107,859 image for testing.Experimental results on this dataset are shown in Ta-ble 1. Our proposed ResNet-SRN and its variant ResNet-SRN-att outperform all state-of-the-arts and baseline mod-els. With the advances of deep network structures, evenour baseline ResNet-101 has achieved better performancethan existing state-of-the-arts. It mainly results from thelearning capability of ResNet-101 with deep learnable lay-ers. When adding more layers to match the parameter sizeof our proposed SRN, ResNet-107 shows very close perfor-mance with ResNet-101, which suggests that the capacity ofResNet-101 is sufficient on NUS-WIDE, and adding moreparameters does not lead to performance increase. Utilizingpredicted labels as context (ResNet-101-semantic) dose notimprove performance on this dataset. In contrast, by explor-ing spatial and semantic relations of labels, our proposedResNet-SRN model outperforms all baseline methods by ∼ U is more informative for learning spatial regularizationsthan the unweighted attention map A .Forcing the algorithm to predict a fixed number ( k = 3 is proposed in state-of-the-art methods) of labels for eachimage may not fully reflect the algorithm’s actual perfor-mance. When removing the constraint (Section 4.1), we canobserve significant performance improvements ( e.g ., from48.9 to 58.5 for the F1-C metric of ResNet-SRN). MS-COCO [25]. This dataset is primarily built for ob-ject recognition task in the context of scene understanding.The training set is composed of 82,783 images, which con-tain common objects in the scenes. The objects are catego-rized into 80 classes, with about 2.9 object labels per image.Since the ground-truth labels of test set is not available, weevaluated all methods on the validation set (40,504 images).The number of labels for each image varies considerably onthis MS-COCO. Following [36], when evaluating with top-3 label predictions, we filtered out labels with probabilityethod All top-3mAP F1-C P-C R-C F1-O P-O R-O F1-C P-C R-C F1-O P-O R-OWARP [11] - - - - - - - 55.7 59.3 52.5 60.7 59.8 61.4CNN-RNN [36] - - - - - - - 60.4 66.0 55.6 67.8 69.2
ResNet-101 [13] 75.2 69.5 80.8 63.4 74.4 82.2 68.0 65.9 84.3 57.4 71.7 86.5 61.3ResNet-107 75.4 69.7 80.9 63.7 74.5 82.1 68.2 66.1 84.4 57.6 71.8 86.4 61.4ResNet-101-semantic 75.5 69.9 81.1 63.8 74.8 82.1 68.6 66.2 84.3 57.7 72.0 86.3 61.8ResNet-SRN-att 76.1 70.0 81.2 63.3 75.0
Table 2. Quantitative results by our proposed ResNet-SRN and compared methods on MS-COCO validation set. “mAP”, “F1-C”, “P-C”,and “R-C” are evaluated for each class before averaging. “F1-O”, “P-O”, and “R-O” are averaged over all sample-label pairs. lower than 0.5 for each image, thus the image may returnless than k = 3 labels.Quantitative results on MS-COCO are presented in Ta-ble 2. The comparison results are similar to those on NUS-WIDE. Based on ResNet-101 network, all baseline mod-els perform better than state-of-the-art approaches. ResNet-107 shows a minor improvement over ResNet-101. Due tomore labels per image (3.5 labels on MS-COCO comparedwith 2.4 labels on NUS-WIDE), exploring label semanticrelations by ResNet-101-semantic is helpful, but the im-provement is limited ( e.g ., from 75.2 to 75.5 in terms ofmAP). Both ResNet-SRN and ResNet-SRN-att show supe-rior performance over baseline models, while the spatialregularizations learned from weighted attention maps per-form better ( e.g ., ResNet-SRN boosts mAP to 77.1, as com-pared with 76.1 of ResNet-SRN-att). WIDER-Attribute [24]. This dataset contains 13,789images and 57,524 human bounding boxes. The task is topredict existence of 14 human attributes for each annotatedperson. Each image is also labeled by an event label from 30event classes for context learning. For our approach, humanis cropped from the full image based on bounding box an-notations, and then used for training and testing. The train-ing/validation and testing set contain 28,340 and 29,177person, respectively. WIDER-Attribute also contains un-specified labels. We treated these unspecified labels as neg-ative labels during training. Unspecified labels are excludedfrom evaluation in testing following the settings of [24].Experimental results are shown in Table 3. All ResNetmodels outperform state-of-the-arts, R-CNN [8], R*CNN[9], and DHC [24], and our proposed ResNet-SRN per-forms best. It is important to note that R*CNN and DHCexplore visual context surrounding the target human by tak-ing full images and human bounding boxes as input. Eventlabels associated with each image are also used for trainingin DHC. In contrast, our approach and baselines only uti-lize cropped image patches without using the event labels.Nevertheless, the ResNet-SRN and ResNet-SRN-att showconsistent improvement over state-of-the-arts and baselinemethods. This result indicates that the proposed SRN couldcapture the spatial relations of human attributes with image-level supervisions, and these learned spatial regularizationscould help predicting human attributes. Method AllmAP F1-C F1-OR-CNN [8] 80.0 - -R*CNN [9] 80.5 - -DHC [24] 81.3 - -ResNet-101 [13] 85.0 74.7 80.4ResNet-107 85.0 74.8 80.6ResNet-101-semantic 85.1 74.8 80.5ResNet-SRN-att 85.4 74.9 80.8ResNet-SRN
Table 3. Quantitative results by our proposed ResNet-SRN andcompared methods on WIDER-Attribute dataset. “mAP”and “F1-C” are evaluated for each class before averaging. “F1-O” is aver-aged over all sample-label pairs.
The effectiveness of our approach has been quantita-tively evaluated in Table 1, 2, and 3, we visualize and ana-lyze the learned neurons from the conv4 layer of our SRNto illustrate its capability of learning spatial regularizationsfor labels. We observe that the learned neurons capture twotypes of label spatial information. One type of neurons cap-ture the spatial locations of individual labels, while the othertype of neurons are only activated when several labels havespecific relative position patterns.We calculate correlations between learned neuron re-sponses and label locations in images, and find some neu-ron highly correlates to individual label’s spatial locations.In Figure 5, we show two such examples. In (a), the re-sponse of neuron “conv4” in SRN highly correlatesthe vertical location of the label “longHair” in WIDER-Attribute dataset. In (b) the activation of neuron “conv4” highly correlates the vertical location of the label “flag” . It demonstrates that the two neurons focuses on spa-tial locations of certain labels.In Figure 6, we show three images from WIDER-Attribute dataset that have highest activations on neuron “conv4” in SRN. The images have common labels( “Male” , “longSleeve” , “formal” , “longPants” ) with sim-ilar relative label positions. It suggests that this neuron istrained to capture semantic and spatial relations of the fourlabels, and favors specific relative positions between them. IDER-Attribute, Neuron: activation: 6.14activation: 5.53activation: 5.06 activation: 0.87activation: 0.87activation: 0.87 activation: -3.66activation: -3.96activation: -4.20
LowActivationHighActivation ... ...... ...... ...
NUS-WIDE, Neuron: “conv4” , Label: “flags”Correlation With Vertical Location: 0.71 activation: 6.28activation: 5.83activation: 5.64 activation: 1.35activation: 0.76activation: 0.68 activation: -4.15activation: -4.18activation: -4.31
LowActivationHighActivation ... ...... ...... ... (a) (b)
Figure 5. Correlation between neuron activations and label locations. These two neurons are sensitive to the location variations of corre-sponding labels. activation: 6.19 Male longSleeve formal longPants sunglassMale longSleeve formal longPants sunglassMale longSleeve formal longPants sunglassactivation: 4.64activation: 4.57
Figure 6. Images with top-3 activations for neuron “conv4” from WIDER-Attribute dataset. True positive labels aremarked in red. Strong spatial and semantic relations between thefour labels ( “Male” , “longSleeve” , “formal” , “longPants” ) arecaptured by neuron. We also analyzed AP improvements for all classes inCOCO. As shown in Figure 7, our approach is more effec-tive for classes having more co-existing labels in the sameimages so that spatial relations can be better utilized to reg-ularize the results. For the class toaster , it was not improvedmuch because of its limited number of training samples.
5. Conclusion
In this paper, we aim to improve multi-label image clas-sification by exploring spatial relations of labels. This isachieved by learning attention maps for all labels with onlyimage-level supervisions, and then capturing both seman-tic and spatial relations of labels base on weighted attention
Figure 7. Top: improvement in AP for each class in COCO. Bot-tom: average number of concurrent labels for true positive imagesof each class. All sorted according to improvements in AP. maps. Extensive evaluations on NUS-WIDE, MS-COCO,and WIDER-Attribute datasets show that our proposed Spa-tial Regularization Net significantly outperforms state-of-the-arts. Visualization of learned models also shows thatour approach could effectively capture both semantic andspatial relations for labels.
6. Acknowledgment
This work is supported in part by National Natural Sci-ence Foundation of China under Grant 61371192, in partby SenseTime Group Limited, in part by the General Re-search Fund through the Research Grants Council of HongKong under Grants CUHK14213616, CUHK14206114,CUHK14205615, CUHK419412, CUHK14203015,CUHK14239816 and CUHK14207814, in part by the HongKong Innovation and Technology Support ProgrammeGrant ITS/121/15FX, in part by the Ph.D. Program Foun-dation of China under Grant 20130185120039, and in partby the China Postdoctoral Science Foundation under Grant2014M552339. eferences [1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recog-nition with visual attention.
ICLR , 2015. 2[2] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya. Hi-erarchical multi-label prediction of gene function.
Bioinfor-matics , 22(7):830–836, 2006. 2[3] L. Bazzani, H. Larochelle, V. Murino, J.-a. Ting, and N. D.Freitas. Learning attentional policies for tracking and recog-nition in video with deep networks. In
ICML , 2011. 2[4] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learn-ing multi-label scene classification.
Pattern recognition ,37(9):1757–1771, 2004. 1, 2[5] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng.Nus-wide: a real-world web image database from nationaluniversity of singapore. In
Proceedings of the ACM interna-tional conference on image and video retrieval , 2009. 2, 5,6[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In
CVPR , 2009. 1, 5[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generativevisual models from few training examples: An incrementalbayesian approach tested on 101 object categories.
ComputerVision and Image Understanding , 106(1):59–70, 2007. 1[8] R. Girshick. Fast r-cnn. In
ICCV , 2015. 2, 6, 7[9] G. Gkioxari, R. Girshick, and J. Malik. Contextual actionrecognition with r*cnn. In
ICCV , 2015. 6, 7[10] S. Godbole and S. Sarawagi. Discriminative methods formulti-labeled classification. In
Pacific-Asia Conference onKnowledge Discovery and Data Mining , 2004. 2[11] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep con-volutional ranking for multilabel image annotation.
ICLR ,2014. 1, 2, 6, 7[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-egory dataset. 2007. 1[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. arXiv preprint arXiv:1512.03385 ,2015. 2, 3, 6, 7[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In
ECCV , pages 630–645, 2016. 3[15] S. Hong, J. Oh, B. Han, and H. Lee. Learning transferrableknowledge for semantic segmentation with deep convolu-tional neural network.
CVPR , 2016. 2[16] H. Hu, G.-T. Zhou, Z. Deng, Z. Liao, and G. Mori. Learn-ing structured inference neural networks with label relations.
CVPR , 2016. 1, 2[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093 , 2014. 5[18] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang,Z. Wang, R. Wang, X. Wang, et al. T-cnn: Tubelets with con-volutional neural networks for object detection from videos. arXiv preprint arXiv:1604.02532 , 2016. 1[19] K. Kang, W. Ouyang, H. Li, and X. Wang. Object detec-tion from video tubelets with convolutional neural networks.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 817–825, 2016. 1[20] D. Koller and N. Friedman.
Probabilistic Graphical Models:Principles and Techniques . MIT Press, 2009. 2 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , 2012. 2[22] Q. Li, M. Qiao, W. Bian, and D. Tao. Conditional graphicallasso for multi-label image classification. In
CVPR , 2016. 1,2[23] X. Li, F. Zhao, and Y. Guo. Multi-label image classificationwith a probabilistic label enhancement model.
Proc. Uncer-tainty in Artificial Intell , 2014. 1, 2[24] Y. Li, C. Huang, C. C. Loy, and X. Tang. Human attributerecognition by deep hierarchical contexts. In
ECCV , 2016.1, 2, 5, 6, 7[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In
ECCV , 2014. 1, 2, 5, 6[26] V. Mnih, N. Heess, A. Graves, et al. Recurrent models ofvisual attention. In
NIPS , 2014. 2[27] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using 1-regularized logis-tic regression.
The Annals of Statistics , 38(3):1287–1319,2010. 2[28] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classi-fier chains for multi-label classification.
Machine learning ,85(3):333–359, 2011. 2[29] R. E. Schapire and Y. Singer. Boostexter: A boosting-basedsystem for text categorization.
Machine learning , 39(2-3):135–168, 2000. 2[30] J. Shao, K. Kang, C. Change Loy, and X. Wang. Deeplylearned attributes for crowded scene understanding. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4657–4666, 2015. 1[31] J. Shao, C.-C. Loy, K. Kang, and X. Wang. Slicing convolu-tional neural network for crowd video understanding. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 5620–5628, 2016. 1[32] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 2[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
CVPR , 2015. 2[34] G. Tsoumakas and I. Katakis. Multi-label classification: Anoverview.
International Journal of Data Warehousing andMining , 2007. 1, 2[35] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Se-mantic annotation and retrieval of music and sound effects.
IEEE Transactions on Audio, Speech, and Language Pro-cessing , 16(2):467–476, 2008. 2[36] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu.Cnn-rnn: A unified framework for multi-label image classi-fication.
CVPR , 2016. 1, 2, 6, 7[37] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towardsgood practices for very deep two-stream convnets.
CoRR ,abs/1507.02159, 2015. 5[38] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, andS. Yan. Cnn: Single-label to multi-label. arXiv preprintarXiv:1406.5726 , 2014. 1, 2[39] X.-Z. Wu and Z.-H. Zhou. A unified view of multi-label per-formance measures. arXiv preprint arXiv:1609.00288 , 2016.540] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdi-nov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neu-ral image caption generation with visual attention.
ICML ,2015. 2, 3[41] X. Xue, W. Zhang, J. Zhang, B. Wu, J. Fan, and Y. Lu.Correlative multi-label multi-instance image annotation. In
ICCV , 2011. 2[42] H. Yang, J. T. Zhou, Y. Zhang, B.-B. Gao, J. Wu, andJ. Cai. Exploit bounding box annotations for multi-label ob-ject recognition.
CVPR , 2016. 2[43] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stackedattention networks for image question answering.
CVPR ,2016. 2[44] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learn-ing algorithms.