[PDF] Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification

Abstract

Multi-label image and video classification are fundamental yet challenging tasks in computer vision. The main challenges lie in capturing spatial or temporal dependencies between labels and discovering the locations of discriminative features for each class. In order to overcome these challenges, we propose to use cross-modality attention with semantic graph embedding for multi label classification. Based on the constructed label graph, we propose an adjacency-based similarity graph embedding method to learn semantic label embeddings, which explicitly exploit label relationships. Then our novel cross-modality attention maps are generated with the guidance of learned label embeddings. Experiments on two multi-label image classification datasets (MS-COCO and NUS-WIDE) show our method outperforms other existing state-of-the-arts. In addition, we validate our method on a large multi-label video classification dataset (YouTube-8M Segments) and the evaluation results demonstrate the generalization capability of our method.

Full PDF

CCross-Modality Attention with Semantic Graph Embeddingfor Multi-Label Classiﬁcation

Renchun You, ∗ Zhiyao Guo, ∗ Lei Cui, ∗ Xiang Long, Yingze Bao, Shilei Wen † Baidu VIS Computer Science Department, Xiamen University, China Department of Computer Science and Technology, Tsinghua University, [email protected], [email protected], [email protected], [email protected],[email protected], [email protected]

Abstract

Multi-label image and video classiﬁcation are fundamentalyet challenging tasks in computer vision. The main chal-lenges lie in capturing spatial or temporal dependencies be-tween labels and discovering the locations of discriminativefeatures for each class. In order to overcome these challenges,we propose to use cross-modality attention with semanticgraph embedding for multi-label classiﬁcation. Based on theconstructed label graph, we propose an adjacency-based sim-ilarity graph embedding method to learn semantic label em-beddings, which explicitly exploit label relationships. Thenour novel cross-modality attention maps are generated withthe guidance of learned label embeddings. Experiments ontwo multi-label image classiﬁcation datasets (MS-COCO andNUS-WIDE) show our method outperforms other existingstate-of-the-arts. In addition, we validate our method on alarge multi-label video classiﬁcation dataset (YouTube-8MSegments) and the evaluation results demonstrate the gener-alization capability of our method.

Multi-label image classiﬁcation (MLIC) and multi-labelvideo classiﬁcation (MLVC) are important tasks in computervision, where the goal is to predict a set of categories presentin an image or a video. Compared with single-label classi-ﬁcation (e.g. assigns one label to an image or video), multi-label classiﬁcation is more useful in many applications suchas internet search, security surveillance, robotics, etc. SinceMLIC and MLVC are very similar tasks, in the followingtechnical discussion we will mainly focus on MLIC, whoseconclusions can be migrated to MLVC naturally.Recently, single-label image classiﬁcation has achievedgreat success thanks to the evolution of deep Convolu-tional Neural Networks (CNN) (He et al. 2016; Huang et al.2017; Simonyan and Zisserman 2014; Szegedy et al. 2016).Single-label image classiﬁcation can be naively extended toMLIC tasks by treating the problem as a series of single-label classiﬁcation tasks. However, such naive extension ∗ These authors share ﬁrst authroship. † Corresponding author.Copyright c (cid:13) usually provides poor performance, since the semantic de-pendencies among multiple labels are ignored, which are es-pecially important for multi-label classiﬁcation. Therefore, anumber of prior works aim to capture the label relations byRecurrent Neural Networks (RNN). However, these meth-ods do not model the explicit relationships between seman-tic labels and image regions, thus they lack the capacity ofsufﬁcient exploitation of the spatial dependency in images.An alternative solution for MLIC is to introduce ob-ject detection techniques. Some methods (Wei et al. 2014;Zhang et al. 2018; Hao et al. 2016) extract region propos-als using extra bounding box annotations, which are muchmore expensive to label than simple image level annota-tions. Many other methods (Wang et al. 2017; Zhu et al.2017) apply attention mechanism to automatically focus onthe regions of interest. However, the attentional regions arelearned only with image-level supervision, which lacks ex-plicit semantic guidance.To address above issues, we argue that an effective modelfor multi-label classiﬁcation should reach two capacities: (1)capturing semantic dependencies among multiple labels interms of spatial context; (2) locating regions of interest withmore semantic guidance.In this paper, we propose a novel cross-modality atten-tion network associated with graph embedding, so as tosimultaneously search for discriminative regions and la-bel spatial semantic dependencies. Firstly, we introducea novel Adjacency-based Similarity Graph Embedding(ASGE) method which captures the rich semantic relationsbetween labels. Secondly, the learned label embedding willguide the generation of attentional regions in terms of cross-modality guidance, which is referred to as Cross-modalityAttention (CMA) in this paper. Compared with traditionalself-attention methods, our attention explicitly introducesthe rich label semantic relations. Beneﬁting from the CMAmechanism, our attentional regions are more meaningful anddiscriminative. Therefore they capture more useful informa-tion while suppressing the noise or background informationfor classiﬁcation. Furthermore, the spatial context depen-dencies of labels will be captured, which further improvethe performance in MLIC.The major contributions of this paper are brieﬂy summa- a r X i v : . [ c s . C V ] M a r isual Features ⊗ Attention MapsImage

Cross-modality Attention

LogitsClassifierLabel Graph FeatureAggregationLabel Embeddings

CMT

Projected VisualFeaturesBackboneASGE

Figure 1: The overall framework of our model for MLICtask. The label embeddings are obtained by ASGE module.The visual features are ﬁrst extracted by backbone networkand then projected to semantic space to get projected vi-sual features through CMT module. The learned label em-beddings and projected visual features are together fed intoCMA module to generate the category-wise attention maps,each of which is utilized to weightedly average the visualfeatures and generate category-wise aggregated feature. Fi-nally, the classiﬁer is applied for ﬁnal prediction.rized as follows: • We propose an ASGE method to learn semantic label em-bedding and exploit label correlations explicitly. • We propose a novel attention paradigm, namely cross-modality attention, where the attention maps are gener-ated by leveraging more prior semantic information, re-sulting in more meaningful attention maps. • A general framework combining CMA and ASGE mod-ule, as shown in Fig.1 and Fig.2, is proposed for multi-label classiﬁcation, which can capture dependencies be-tween spatial and semantic space and discover the lo-cation of discriminative feature effectively. We evaluateour framework on MS-COCO dataset and NUS-WIDEdataset for MLIC task, and new state-of-the-art perfor-mances are achieved on both of them. We also evaluateour proposed method on YouTube-8M dataset for MLVC,which also achieves remarkable performances.

The task of MLIC has attracted an increasing interest re-cently. The easiest way to address this problem is to treateach category independently, then the task can be directlyconverted into a series of binary classiﬁcation tasks (Chen etal. 2019). However, such techniques are limited by withoutconsidering the relationships between labels.Several approaches have been applied to model the corre-lations between labels. Read et al. (2011) extends the multi-label classiﬁcation by training the chain binary-classiﬁersand introducing the correlations between labels by inputtingthe previously predicted labels. Some other works (Li,Zhao, and Guo 2014; Wang et al. 2016; Chen et al. 2018;Li et al. 2016) formulate the task as a structural inferenceproblem based on probabilistic graphical models. Besides,the latest work (Chen et al. 2019) explores the label depen-dencies by graph convolutional network. However, none ofaforementioned methods consider the associations between

AttentionMaps ⊗ ClassifierPre-extractedFeatures …… Video ASGE Label EmbeddingsSNet

Cross-modality Attention

LogitsCMTLabel Graph SqueezedFeatures ProjectedVisualFeatures FeatureAggregation

Figure 2: The overall framework of the model for MLVCtask. The input is pre-extracted features instead of raw video,and the pre-extracted features are processed through SNet toextract visual features. Other parts are quite similar to thatof MLIC.semantic labels and image contents, and the spatial contextsof images have not been sufﬁciently exploited.In MLIC task, visual concepts are highly related with lo-cal image regions. To explore information in local regionsbetter, some works (Wei et al. 2014; Hao et al. 2016) in-troduce region proposal techniques to focus on informativeregions. Wei et al. (2014) extracts an arbitrary number of ob-ject hypotheses, then inputs them into the shared CNN andaggregates the output with max pooling to obtain the ulti-mate multi-label predictions. Hao et al. (2016) introduces lo-cal information provided by generated proposals to boost thediscriminative power of feature extraction. Although abovemethods have used region proposals to enhance feature rep-resentation, they are still limited by requiring extra object-level annotations and without considering the dependenciesbetween objects.Alternatively, Wang et al. (2017) discovers the attentionalregions corresponding to multiple semantic labels by spatialtransformer network and captures the spatial dependenciesof the regions by Long Short-Term Memory (LSTM). Anal-ogously, Zhu et al. (2017) proposes the spatial regularizationnetwork to generate label-related attention maps and capturethe latent relationships by attention maps implicitly. The ad-vantage of above attention approaches is that no additionalstep of obtaining region proposal is needed. Nevertheless,the attentional regions are learned only with image-level su-pervision, which lacks of explicit semantic guidance. Whilein this paper, the semantic guidance is introduced to the gen-eration of attention maps by leveraging label semantic em-beddings, which improves the prediction performance sig-niﬁcantly.In this paper, the label semantic embeddings are learnedby graph embedding, which is a technique aiming to learnrepresentation of graph-structured data. The approaches ofgraph embedding mainly contain matrix factorization-based(Cao, Lu, and Xu 2015), random walk-based (Perozzi, Al-Rfou, and Skiena 2014) and neural network-based methods(Wang, Cui, and Zhu 2016; Zhu et al. 2018). A main as-sumption of these approaches is the embeddings of adjacentnodes on the graph are similar, while in our task, we alsorequire embeddings of non-adjacent nodes are mutually ex-lusive from each other. Therefore, we propose an ASGEmethod, which can further separate the embeddings of non-adjacent nodes.MLVC is similar to MLIC, but it involves additional tem-poral relationships (Gan et al. 2015; Long et al. 2018a;Campos et al. 2017; Arandjelovic et al. 2016; Wu, Ma, andHu 2017; Long et al. 2018b). It has been applied to many ap-plications such as emotion recognition (Kahou et al. 2016),human activity understanding (Caba Heilbron et al. 2015),and event detection (Xu, Yang, and Hauptmann 2015). Inthis paper, we validate our proposed method both in MLICand MLVC task, and achieve remarkable performance.

The overall frameworks of our approach for MLIC andMLVC are shown in Fig.1 and Fig.2 respectively. Thepipeline includes several stages: Firstly, the label graph istaken as the input of ASGE module to learn label embed-dings which encode the semantic relationships between la-bels. Secondly, the learned label embeddings and visual fea-tures will be fed together into the CMA module to obtaincategory-wise attention maps. Finally, the category-wise at-tention maps are used to weightedly average the visual fea-tures for each category. We will describe our two key com-ponents ASGE and CMA in detail.

The relationships between labels play a crucial role in multi-label classiﬁcation task as discussed in section 1. However,how to express such relationships is an open issue to besolved. Our intuition is that the co-occurrence properties be-tween labels can be described as joint probability, whichis suitable for modeling the label relationships. Neverthe-less, the joint probability is easy to suffer from the inﬂu-ence of class imbalance. Instead, we utilize the conditionalprobability between labels to solve this issue, which is ob-tained by normalizing the joint probability through dividingby marginal probability. Based on this, it is possible to con-struct a label graph where the labels are nodes and the condi-tional probability between the labels is edge weight. Inspiredby the popular applications of graph embedding method innatural language processing (NLP) tasks, where the learnedlabel embeddings are entered into the network as additionalinformation, we propose a novel ASGE method to encodethe label relationships.We formally deﬁne the graph as G = ( V , C ) , where V = { v , v , ..., v N } represents the set of N nodes and C represents the edges. The adjacency matrix A = { A ij } Ni,j =1 of graph G contains non-negative weights associated witheach edge. Speciﬁcally, V is the set of labels and C is theset of connections between any two labels, and the adjacencymatrix A is the conditional probability matrix by setting A ij = P ( v i /v j ) , where P is calculated through training set.Since P ( v i | v j ) (cid:54) = P ( v j | v i ) , namely A ij (cid:54) = A ji , in order tofacilitate a better optimization, we symmetrize A by A (cid:48) = 12 ( A + A (cid:124) ) . (1) To capture the label correlations deﬁned by the graphstructure, we apply a neural network to map the one-hot em-bedding of each label o i to semantic embedding space andproduce the label embedding e i = Φ( o i ) , (2)where Φ denotes the neural network which consists ofthree fully-connected layers followed by Batch Normaliza-tion(BN) and ReLU activation. Our goal is to achieve the op-timal label embedding set E = { e i } Ni =0 , where e i ∈ R C e .Such that cos( e i , e j ) is close to A ij for all i, j , where cos( e i , e j ) denotes the cosine similarity between e i and e j .Thereby, the objective function is deﬁned as follows: L ge = N (cid:88) i =1 N (cid:88) j =1 (cid:18) e (cid:124) i e j (cid:107) e i (cid:107) (cid:107) e j (cid:107) − A (cid:48) ij (cid:19) , (3)where L ge denotes the loss of our graph embedding. Optimization Relaxation.

In order to optimize Eq.3, thecosine similarity cos( e i , e j ) are required to be close to thecorresponding edge weight A ij for all i, j . However, it ishard to satisfy this strict constraint, especially when thegraph is large and sparse. In order to address this problem,a hyperparameter α is introduced to the Eq.3 to relax theoptimization. The new objective function is as follows: L ge = N (cid:88) i =1 N (cid:88) j =1 σ ij · (cid:18) e (cid:124) i e j (cid:107) e i (cid:107) (cid:107) e j (cid:107) − A (cid:48) ij (cid:19) , (4)where σ ij is an indicator function: σ ij = (cid:40) , A ij < α and e (cid:124) i e j (cid:107) e i (cid:107)(cid:107) e j (cid:107) < α , otherwise . (5)By adding this relaxation, it only needs to make the em-bedding pairs ( e i , e j ) be away instead of strictly enforcing cos( e i , e j ) to be A ij when A ij < α , thus focusing moreon the strong relationships between labels and reducing thedifﬁculty of the optimization. We formally deﬁne the multi-label classiﬁcation task as amapping function F : x → y , where x denotes a inputimage or video, y = [ y , y , ..., y N ] denotes correspondinglabels, N is the total number of categories and y n ∈ { , } denotes whether the label is assigned to the image or video.For multi-label classiﬁcation, we propose an novel at-tention mechanism, named cross-modality attention, whichuses semantic embeddings to guide spatial or temporal in-tegration of visual features. The semantic embeddings hereare the label embedding set E = { e i } Ni =0 achieved by ASGEand the visual features I = ψ ( x ) are extracted by backboneneural networks ψ . Note that for different tasks, we onlyneed to apply different backbones to extract visual features,and the rest part of the framework is completely generic forboth tasks. ackbone. In the MLIC task, we apply ResNet-101 net-work to extract the last convolutional feature map as the vi-sual features. Additionally, we use an × convolution toreduce the dimension, and obtain ﬁnal visual feature map I ∈ R H × W × C , where H × W is the spatial resolution of thelast feature map and C is the number of channels.In the MLVC task, frame level features x are pre-extractedby an Inception network and then processed by PCA withwhitening. Considering the meaningful and discriminativeinformation of video is derived from some pivotal frameswhile others may be redundant, we apply a Squeezing Net-work (SNet) to squeeze the temporal dimensions. The SNetis built on 4 successive D convolution and pooling layers.We can obtain the ﬁnal visual feature I = f SNet ( x ) , where I ∈ R T × C and T is the temporal resolution of the ﬁnal fea-ture map and C is the number of channels. Cross-Modality Attention.

The learned label embed-dings by ASGE compose a semantic embedding space,while the extracted features from CNN Backbone deﬁne avisual feature space. Our goal is to let semantic embeddingsguide the generation of attention maps. However, semanticembedding space and visual feature space exist a semanticgap because of modality difference. In order to measure thecompatibility between different modalities, we ﬁrst learn amapping function from the visual feature space to the se-mantic embedding space, then the compatibility can be mea-sured by a cosine similarity between projected visual featureand semantic embedding, namely cross-modality attention.Formal deﬁnition is introduced as follows.Firstly, we project the visual feature to semantic spaceby a Cross-Modality Transformer (CMT) module, which isbuilt with several × convolution layers followed by a BNand a ReLU activation. I s = f cmt ( I ) (6)where I s ∈ R M × C e ( M = W × H for image and M = T for video), f cmt denotes the map function of the CMTmodule. The category-speciﬁc cross-modality attention map z ik is yielded by calculating the cosine similarity betweenlabel embedding e k and projected visual feature vector I is atlocation i of I s : z ik = ReLU (cid:18) I is (cid:124) e k (cid:107) I is (cid:107) (cid:107) e k (cid:107) (cid:19) . (7)The category-speciﬁc attention map z ik is then normalizedto: a ik = z ik (cid:80) Mi =1 z ik . (8)For each location i , if the CMA mechanism generates ahigh positive value, it can be interpreted as the location i is highly semantic related to label embedding k or relativemore important than other locations, thus the model needsto focus on location i when considering category k . Thenthe category-speciﬁc cross-modality attention map is used toweightedly average the visual feature vectors for each cate-gory: h k = M (cid:88) i =1 α ik I i , (9) Semantic Space Attentions β " β β $ 𝛼 " 𝛼 $ 𝛼 γ " γ $ γ DogFrisbeeGlasses

Image Visual FeaturesBackbone

GlassesDogFrisbee

Label Graph

RepelAttract

Label Embeddings

ASGE CMA C M T 𝒗 ( 𝒗 ) 𝒗 * 𝒗 (+ 𝒗 )+ 𝒗 *+ 𝒆 * 𝒆 ( 𝒆 ) Figure 3: Illustration of latent spatial dependency. The dif-ferent colors indicate different categories. The solid arrowsrepresent the learned label embeddings, expressed as e ,while the dotted arrows represent the projected visual fea-tures through CMT module, expressed as v (cid:48) . The anglesbetween label embeddings and projected visual features,namely α , β and γ , represent the category-wise attentionscores. For a detailed discussion, refer to section 3.2.where h k is the ﬁnal feature vector for label k . Then h k isfed into the fully-connected layers for estimating probabilityof category k : y ∗ k = σ ( w k (cid:124) h k + b ) , (10)where w k ∈ R C and b are learnable parameters. y ∗ k isthe predicted probability for label k . For convenience, wedenote the calculation of whole CMA module as y ∗ k = f cma ( I, E ) .Compared with general single attention map method,where the attention map is shared by all categories, ourCMA module beneﬁts in two ways: Firstly, our category-wise attention map is related to image regions correspond-ing to category k , thus better learn category-related regions.Secondly, with the guidance of label semantic embeddings,the discovered attentional regions can be better match withthe annotated semantic labels.The potential advantage of our framework is to capturethe latent spatial dependency, which is helpful for visual am-biguous labels. As shown in Fig.3, we consider frisbee asan example to explain the spatial dependency. Firstly, TheASGE module learns label embeddings through the labelgraph, which encodes the label relationships. Since the dog and frisbee are often co-exist while glasses not, thereforethe label embeddings of dog and frisbee are close to eachother and far away from glasses ’s, namely e d ≈ e f (cid:54) = e g .The optimization procedure during training will enforcethe cosine similarity between visual feature and the cor-responding label embedding become high, in other words,the cos( e d , v (cid:48) d ) , cos( e g , v (cid:48) g ) and cos( e f , v (cid:48) f ) will be large.Since e d ≈ e f (cid:54) = e g , cos( e f , v (cid:48) d ) will also be large, while cos( e f , v (cid:48) g ) will be small. And the ﬁnal feature represen-tation of frisbee is h f = β v g + β v f + β v d ,where β = cos( e f , v (cid:48) g ) , β = cos( e f , v (cid:48) f ) , β = cos( e f , v (cid:48) d ) .Thus, the recognition of frisbee is depending on the semanticrelated label dog and not related to label glasses , indicatingthat our model is capable of capturing spatial dependencies.Specially, considering that the frisbee is a hard case to beecognized, β will be small. Fortunately, the β may stillbe large, so the visual information of dog will be a helpfulcontext to aid in the recognition of label frisbee . Multi-Scale CMA.

Single-scale feature representationmay not be sufﬁcient for multiple objects from differentscales. It is noteworthy that the calculation of attention in-volves the label embedding slides densely over all locationsof the feature map, in other words, the spatial resolution offeature map may effect on attention result. Our intuition isthat the low-resolution feature maps have more represen-tational capacity for small objects while high-resolution isopposite. The design of CMA mechanism makes it can benaturally applied to multi-scale feature maps via a scorefusion strategy. Specially, we extract a set of feature maps { I , I , ..., I L } and the ﬁnal predicted probability of multi-scale CMA is y ∗ k = 1 L L (cid:88) l =1 f cma ( I l , E ) . (11) Training Loss.

Finally we deﬁne our object function formulti-label classiﬁcation as follows L ( θ ) = − N (cid:88) k =1 w k [ y k · log ( y ∗ k ) + (1 − y k ) · (1 − log ( y ∗ k ))] w k = y k · e β (1 − p k ) + (1 − y k ) · e βp k , (12)Where w k is used to alleviate the class imbalance, β is ahyperparameter and p k is the ratio of label k in the trainingset. To assess our model, we perform experiments on two bench-mark multi-label image recognition datasets (MS-COCO(Lin et al. 2014) and NUS-WIDE (Chua et al. 2009)) . Wealso validate the effectiveness of our model on one multi-label video recognition dataset (YouTube-8M Segments) ,and the results demonstrate the extensibility of our method.In this section, we will introduce the results on MLIC andMLVC respectively.

Implementation Details.

In ASGE module, the dimen-sions of the three hidden layers and label embeddings areall set as 256. The optimization relaxation is not appliedhere since the label graph is relatively small. The opti-mizer is Stochastic Gradient Descent (SGD) with momen-tum 0.9 and the initial learning rate is 0.01. In the classi-ﬁcation part, the input image is randomly cropped and re-sized to × with random horizontal ﬂip for aug-mentation. The batch size is set as 64. The optimizer isSGD with momentum 0.9. Weight decay is − . The ini-tial learning rate is 0.01 and decays by a factor 10 every30 epochs. And the hyperparameter β in the Eq.12 is 0. inMS-COCO dataset and 0.4 in NUS-WIDE dataset. Basedon this setup, we implement two models: CMA and Multi-Scale CMA(MS-CMA). The MS-CMA model uses threescale features I ∈ R × × , I ∈ R × × from ResNet-101 backbone and I ∈ R × × obtained by ap-plying a residual block on I . While the CMA model onlyuses I . Evaluation Metrics.

We use the same evaluation met-rics as other works (Wang et al. 2017), which are the per-category and overall metrics: precision (CP and OP), recall(CR and OR) and F1 (CF1 and OF1). In addition, we alsocalculate the mean average precision (mAP), which is rel-atively more important than other metrics, and we mainlyfocus on the performance of mAP.

Results on MS-COCO Dataset.

The MS-COCO datasetis widely used in MLIC task. It contains 122,218 imageswith 80 labels and almost 2.9 labels per image. We dividethe dataset into two parts: 82,081 images for training and40,137 images for testing, according to the ofﬁcially pro-vided division criteria.We compare with the currently published state-of-the-artmethods, including CNN-RNN(Wang et al. 2016), RNN-Attention(Wang et al. 2017), Order-Free RNN(Chen et al.2018), ML-ZSL (Lee et al. 2018), SRN (Zhu et al. 2017)and Multi-Evidence (Ge, Yang, and Yu 2018). Besides, werun the source code released by ML-GCN(Chen et al. 2019)to train and get the results for comparison. The quantitativeresults of CMA and MS-CMA model are shown in Table 1.Our two models both perform better than the state-of-the-artmethods over almost all metrics. Specially, our MS-CMAmodel achieves better performance than the CMA model,demonstrating the multi-scale attentions yield performanceimprovement.

Results on NUS-WIDE Dataset.

The NUS-WIDE is aweb dataset including 269,648 images and 5018 labels fromthe Flickr. After removing the noise and the rare labels, thereare 1000 categories left. The images are further manuallyannotated into 81 concepts with 2.4 concepts per image onaverage. We follow the split used in (Liu et al. 2018), i.e.150,000 images for training and 59,347 for testing after re-moving the images without any labels.In this dataset, we compare with the current state-of-the-art models, including CNN-RNN (Wang et al. 2016) , CNN-SREL-RNN (Liu et al. 2017) , CNN-LSEP (Li, Song, andLuo 2017), Order-Free RNN(Chen et al. 2018), ML-ZSL(Lee et al. 2018), S-CLs(Liu et al. 2018), Attention trans-fer(Zagoruyko and Komodakis 2016) and FitsNet(Romeroet al. 2014).The quantitative results are shown in the Table 2. Thecomparison results are similar to MS-COCO’s. Our CMAand MS-CMA perform better than state-of-the-art methodson most metrics. The mAP metric of our MS-CMA, whichis mostly concerned, exceeds the previous state-of-the-art re-sult by . . We observe that the average edge weight perlabel of NUS-WIDE is 3.3, while that of MS-COCO is 3.9.This shows that the label graph of MS-COCO is denser thanNUS-WIDE. And the performance gain of NUS-WIDE isslightly less obvious than that of MS-COCO. These obser-vations indicate that richer label relationships may bring per-formance improvement. ethods All Top-3mAP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1CNN-RNN (2016) 61.2 - - - - - - 66.0 55.6 60.4 69.2 66.4 67.8CNN-LSEP (2017) - 73.5 56.4 62.9 76.3 61.8 68.3 - - - - - -CNN-SREL-RNN (2017) - 67.4 59.8 63.4 76.6 68.7 72.4 - - - - - -RNN-Attention (2017) - - - - - - - 79.1 58.7 67.4 84.0 63.0 72.0Order-Free RNN (2018) - - - - - - - 71.6 54.8 62.1 74.2 62.2 67.7ML-ZSL (2018) - - - - - - - 74.1 64.5 69.0 - - -SRN (2017) 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9S-CLs (2018) 74.6 - - 69.2 - - 74.0 - - 66.8 - - 72.7Multi-Evidence (2018) - 80.4 70.2 74.9 85.2 72.5 78.4 84.5 62.2 70.6 89.1 64.3 74.7ML-GCN (2019) 82.4 82.1 73.1 77.3 83.7 76.3 79.9 87.2 64.6 74.2 89.1 66.7 76.3 CMA

Table 1: Comparisons with state-of-the-art methods on the MS-COCO dataset. We report two our proposed model:

CMA and

MS-CMA . The bold numbers indicate the best results in different metrics, while the underlined numbers indicate the suboptimalresults.

Methods All Top-3mAP CF1 OF1 CF1 OF1CNN-RNN (2016) 56.1 - - 34.7 55.2CNN-LSEP (2017) - 52.9 70.8 - -CNN-SREL-RNN (2017) - 52.7 70.9 - -Order-Free RNN (2018) - - - 54.7 70.2ML-ZSL (2018) - - - 45.7 -Attention transfer (2016) 57.6 55.2 70.3 51.7 68.8FitsNet (2014) 57.4 54.9 70.4 51.4 68.6S-CLs (2018) 60.1 58.7 73.7 53.8

MS-CMA 61.4 60.5 73.8 55.7

Table 2: Comparisons with state-of-the-art methods on theNUS-WIDE dataset.

Ablation Study.

In this section, we expect to answer thefollowing questions: • Comparing with the backbone (ResNet-101) model, doesour CMA model improve signiﬁcantly? • Does our proposed CMA mechanism have an advantageover the general self-attention methods? • Can the CMA extend to multi-scale and bring perfor-mance improvement? • Is our ASGE more advantageous than other embeddingmethods, e.g. Word2vec?To answer these questions, we conduct some ablationstudies on the MS-COCO dataset, as shown in table 3.Firstly, we investigate how CMA contributes to mAP. It isobvious to see that the vanilla ResNet-101 achieves . mAP, while increases to . when CMA module isadded. This result shows the signiﬁcant effectiveness of theCMA mechanism. Secondly, we implement a general self-attention method by replacing Eq.7 with z i = σ ( f conv ( I i )) ,where f conv denotes the map function of × convolutionlayer. Our CMA mechanism performs better than generalself-attention mechanism by achieving . mAP improve-ment, which indicates that the label semantic embeddings Methods mAPResNet-101 (2016) 79.9Self-attention 81.1W2V-MS-CMA 82.5 CMA

MS-CMA . improvement. This result demon-strates that our attention mechanism is well adapted formulti-scale features. Finally, we compared our ASGE withother embedding methods, in this paper, we take Word2vecas an exmaple, which is a group of related models usedto produce word embeddings. Specially, we view the la-bel set in each image as a single sentence, and the windowsize in Word2vec is set as the length of longest sentence toeliminate the inﬂuence of label order. The experiment re-sults show our ASGE based MS-CMA performs better thanWord2vec based MS-CMA (represented as W2V-MS-CMA)by . improvement. In our ASGE, label relationships areexplicitly represented by adjacency matrix which is treatedas a direct optimization target. Instead, Word2vec implicitlyencodes label relationships in a data-driven manner with-out directly optimizing label relationships. Therefore, ourASGE will capture label relationships much better. Visualization and Analysis.

In this section, we visualizethe learned attention maps to illustrate the ability of exploit-ing discriminative or meaningful regions and capturing thespatial semantic dependencies.We show the attention visualization examples in Fig.4.The three rows show the category-wise attention maps gen-erated by CMA model and general self-attention respec-tively. It is observed that the CMA model concentrates more

Figure 4: The visualization of attention maps. The ﬁrst col-umn: original image, second and fourth column: attentionmaps of MS-CMA and self-attention respectively, third andﬁfth column: attention maps projected on the original imageof MS-CMA and self-attention respectively.on semantic regions and has stronger response than gen-eral self-attention, thus it is capable of exploiting more dis-criminative and meaningful information. Besides, our CMAmechanism has the ability of capturing the spatial semanticdependencies, especially for the indiscernible or small ob-jects occur in the image, e.g. attention of sports ball alsopays attention to tennis racket due to their semantic sim-ilarity. It’s quite helpful because these objects need richercontextual cues to help recognition.

Implementation Details.

For the training of ASGE, weapply optimization relaxation and set α = 0 . . Other set-tings are same as described in MLIC task. For the trainingof classiﬁcation, the initial learning rate is 0.0002 and de-cay each × samples with momentum 0.8 . The hyper-parameter β in the Eq.12 is 0. The optimizer is SGD withmomentum 0.9. The batch size is 256. In this task, we onlyimplemented a single scale CMA model. Evaluation Metrics.

In the MLVC task, we use severalmetrics to evaluate our model, including Global AveragePrecision (GAP)(Shin et al. 2018), Average Hit Rate (AvgHit@1), Precision at Equal Recall Rate (PERR) and MeanAverage Precision (mAP)(Abu-El-Haija et al. 2016).

Results on YouTube-8M Segments Dataset.

In theMLVC task, we verify the effectiveness of our model on theYouTube-8M Segments dataset, which is an extension of theYouTube-8M dataset(Abu-El-Haija et al. 2016).In our experiment, we only use frame-level image fea-tures, while the state-of-the-art methods use additional audiofeatures and most are built on model ensemble, which is un-fair to compare with. For this reason, we compare our CMA

Methods Avg Hit@1 Avg PERR mAP GAPSelf-attention-FC 85.1 79.1 49.6 81.8Self-attention-SNet 86.1 80.2 53.2 83.3

CMA-FC

CMA-SNet 86.7 81.0 55.8 84.1

Table 4: The comparison between self-attention model andours CMA model on the YouTube-8M Segments dataset. A tt e n t i o nS c o re Frames

Figure 5: The attention scores for each frame. The label is skiing .model with general self-attention model to validate the ef-fectiveness in MLVC task. Besides, in order to explore theimpact of backbone network on CMA mechanism, we im-plement the SNet-based and FC-based (two fully connectedlayers) models. The quantitative results are shown in the Ta-ble 4. It can be found that all of our metrics are better thanthe self-attention model. It seems that the improvementscompared with self-attention model are not so signiﬁcant asthat of MLIC, but considering the input of our model is ﬁxedpre-extracted features and the models only differ in attentionmechanism, the performance gains are quite remarkable.

Visualization and Analysis.

We also present the visual-ization results of attention scores for a video in Fig.5. Thelabel of the video is skiing, and our CMA model pays moreattention to skiing-related frames while partly ignores the re-dundant frames, suggesting that our attention mechanism iscapable of locating attentional frames and demonstrating theeffectiveness of our model more intuitively.

In this paper, we propose a novel cross-modality attentionmechanism with semantic graph embedding for both MLICand MLVC task. The proposed method can effectively dis-cover semantic location with rich discriminative features andcapture the spatial or temporal dependencies between labels.The extensive evaluations on two MLIC datasets MS-COCOand NUS-WIDE show our method outperforms state-of-the-arts. In addition, we conducted expriments on MLVCdatasset YouTube-8M Segments and achieve excellent per-formance, which validate the strong generalization of ourmethod.

This work was supported by the National Natural ScienceFoundation of China(61671397). We thank all anonymousreviewers for their constructive comments. eferences [Abu-El-Haija et al. 2016] Abu-El-Haija, S.; Kothari, N.; Lee,J.; Natsev, P.; Toderici, G.; Varadarajan, B.; and Vijaya-narasimhan, S. 2016. Youtube-8m: A large-scale video clas-siﬁcation benchmark. arXiv preprint arXiv:1609.08675 .[Arandjelovic et al. 2016] Arandjelovic, R.; Gronat, P.; Torii,A.; Pajdla, T.; and Sivic, J. 2016. Netvlad: Cnn architecturefor weakly supervised place recognition. In

CVPR .[Caba Heilbron et al. 2015] Caba Heilbron, F.; Escorcia, V.;Ghanem, B.; and Carlos Niebles, J. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In

CVPR .[Campos et al. 2017] Campos, V.; Jou, B.; Gir´o-i Nieto, X.; Tor-res, J.; and Chang, S.-F. 2017. Skip rnn: Learning to skipstate updates in recurrent neural networks. arXiv preprintarXiv:1708.06834 .[Cao, Lu, and Xu 2015] Cao, S.; Lu, W.; and Xu, Q. 2015.Grarep: Learning graph representations with global structuralinformation. In

CIKM . ACM. [Chen et al. 2018] Chen, S.-F.; Chen, Y.-C.; Yeh, C.-K.; andWang, Y.-C. F. 2018. Order-free rnn with visual attention formulti-label classiﬁcation. In

AAAI .[Chen et al. 2019] Chen, Z.-M.; Wei, X.-S.; Wang, P.; and Guo,Y. 2019. Multi-label image recognition with graph convolu-tional networks. In

CVPR .[Chua et al. 2009] Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo,Z.; and Zheng, Y. 2009. Nus-wide: a real-world web im-age database from national university of singapore. In

ICIVR .ACM.[Gan et al. 2015] Gan, C.; Wang, N.; Yang, Y.; Yeung, D.-Y.;and Hauptmann, A. G. 2015. Devnet: A deep event network formultimedia event detection and evidence recounting. In

CVPR ,2568–2577.[Ge, Yang, and Yu 2018] Ge, W.; Yang, S.; and Yu, Y. 2018.Multi-evidence ﬁltering and fusion for multi-label classiﬁca-tion, object detection and semantic segmentation based onweakly supervised learning. In

CVPR .[Hao et al. 2016] Hao, Y.; Zhou, J. T.; Yu, Z.; Gao, B. B.; Wu,J.; and Cai, J. 2016. Exploit bounding box annotations formulti-label object recognition. In

CVPR .[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.Deep residual learning for image recognition. In

CVPR .[Huang et al. 2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; andWeinberger, K. Q. 2017. Densely connected convolutional net-works. In

CVPR .[Kahou et al. 2016] Kahou, S. E.; Bouthillier, X.; Lamblin, P.;Gulcehre, C.; Michalski, V.; Konda, K.; Jean, S.; Froumenty, P.;Dauphin, Y.; Boulanger-Lewandowski, N.; et al. 2016. Emon-ets: Multimodal deep learning approaches for emotion recogni-tion in video.

Journal on Multimodal User Interfaces .[Lee et al. 2018] Lee, C.-W.; Fang, W.; Yeh, C.-K.; andFrank Wang, Y.-C. 2018. Multi-label zero-shot learning with structured knowledge graphs. In

CVPR .[Li et al. 2016] Li, Q.; Qiao, M.; Bian, W.; and Tao, D. 2016.Conditional graphical lasso for multi-label image classiﬁcation.In

CVPR . [Li, Song, and Luo 2017] Li, Y.; Song, Y.; and Luo, J. 2017. Im-proving pairwise ranking for multi-label image classiﬁcation.In

CVPR .[Li, Zhao, and Guo 2014] Li, X.; Zhao, F.; and Guo, Y. 2014.Multi-label image classiﬁcation with a probabilistic label en-hancement model. In

UAI .[Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.;Perona, P.; Ramanan, D.; Doll´ar, P.; and Zitnick, C. L. 2014.Microsoft coco: Common objects in context. In

ECCV .Springer.[Liu et al. 2017] Liu, F.; Xiang, T.; Hospedales, T. M.; Yang, W.;and Sun, C. 2017. Semantic regularisation for recurrent imageannotation. In

CVPR .[Liu et al. 2018] Liu, Y.; Sheng, L.; Shao, J.; Yan, J.; Xiang, S.;and Pan, C. 2018. Multi-label image classiﬁcation via knowl-edge distillation from weakly-supervised detection. In MM .ACM.[Long et al. 2018a] Long, X.; Gan, C.; De Melo, G.; Liu, X.; Li,Y.; Li, F.; and Wen, S. 2018a. Multimodal keyless attention fusion for video classiﬁcation. In AAAI .[Long et al. 2018b] Long, X.; Gan, C.; De Melo, G.; Wu, J.;Liu, X.; and Wen, S. 2018b. Attention clusters: Purely atten-tion based local feature integration for video classiﬁcation. In

CVPR .[Perozzi, Al-Rfou, and Skiena 2014] Perozzi, B.; Al-Rfou, R.;and Skiena, S. 2014. Deepwalk: Online learning of social rep-resentations. In

KDD . ACM.[Read et al. 2011] Read, J.; Pfahringer, B.; Holmes, G.; andFrank, E. 2011. Classiﬁer chains for multi-label classiﬁcation.

Machine learning .[Romero et al. 2014] Romero, A.; Ballas, N.; Kahou, S. E.;Chassang, A.; Gatta, C.; and Bengio, Y. 2014. Fitnets: Hintsfor thin deep nets. arXiv preprint arXiv:1412.6550 .[Shin et al. 2018] Shin, K.; Jeon, J.; Lee, S.; Lim, B.; Jeong, M.;and Nang, J. 2018. Approach for video classiﬁcation withmulti-label on youtube-8m dataset. In

ECCV .[Simonyan and Zisserman 2014] Simonyan, K., and Zisserman,A. 2014. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 .[Szegedy et al. 2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.;Shlens, J.; and Wojna, Z. 2016. Rethinking the inception ar-chitecture for computer vision. In

CVPR .[Wang et al. 2016] Wang, J.; Yang, Y.; Mao, J.; Huang, Z.;Huang, C.; and Xu, W. 2016. Cnn-rnn: A uniﬁed frameworkfor multi-label image classiﬁcation. In

CVPR .[Wang et al. 2017] Wang, Z.; Chen, T.; Li, G.; Xu, R.; and Lin,L. 2017. Multi-label image recognition by recurrently discov-ering attentional regions. In

ICCV .[Wang, Cui, and Zhu 2016] Wang, D.; Cui, P.; and Zhu, W.2016. Structural deep network embedding. In

KDD . ACM.[Wei et al. 2014] Wei, Y.; Xia, W.; Huang, J.; Ni, B.; Dong, J.;Zhao, Y.; and Yan, S. 2014. Cnn: Single-label to multi-label. arXiv preprint arXiv:1406.5726 .[Wu, Ma, and Hu 2017] Wu, J.; Ma, L.; and Hu, X. 2017. Delv-ing deeper into convolutional neural networks for camera relo-calization. In

ICRA . IEEE.Xu, Yang, and Hauptmann 2015] Xu, Z.; Yang, Y.; and Haupt-mann, A. G. 2015. A discriminative cnn video representationfor event detection. In

CVPR .[Zagoruyko and Komodakis 2016] Zagoruyko, S., and Ko-modakis, N. 2016. Paying more attention to attention:Improving the performance of convolutional neural networksvia attention transfer. arXiv preprint arXiv:1612.03928 .[Zhang et al. 2018] Zhang, J.; Wu, Q.; Shen, C.; Zhang, J.; andLu, J. 2018. Multi-label image classiﬁcation with regional la-tent semantic dependencies.

IEEE Transactions on Multimedia .[Zhu et al. 2017] Zhu, F.; Li, H.; Ouyang, W.; Yu, N.; and Wang,X. 2017. Learning spatial regularization with image-level su-pervisions for multi-label image classiﬁcation. In

CVPR .[Zhu et al. 2018] Zhu, D.; Cui, P.; Wang, D.; and Zhu, W. 2018.Deep variational network embedding in wasserstein space. In