[PDF] Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Abstract

Cross-modal retrieval aims to retrieve relevant data across different modalities (e.g., texts vs. images). The common strategy is to apply element-wise constraints between manually labeled pair-wise items to guide the generators to learn the semantic relationships between the modalities, so that the similar items can be projected close to each other in the common representation subspace. However, such constraints often fail to preserve the semantic structure between unpaired but semantically similar items (e.g. the unpaired items with the same class label are more similar than items with different labels). To address the above problem, we propose a novel cross-modal similarity transferring (CMST) method to learn and preserve the semantic relationships between unpaired items in an unsupervised way. The key idea is to learn the quantitative similarities in single-modal representation subspace, and then transfer them to the common representation subspace to establish the semantic relationships between unpaired items across modalities. Experiments show that our method outperforms the state-of-the-art approaches both in the class-based and pair-based retrieval tasks.

Full PDF

AADVERSARIAL CROSS-MODAL RETRIEVAL VIA LEARNING AND TRANSFERRINGSINGLE-MODAL SIMILARITIES

Xin Wen , Zhizhong Han , Xinyu Yin , Yu-Shen Liu School of Software, Tsinghua University, Beijing 100084, ChinaBeijing National Research Center for Information Science and Technology (BNRist) Department of Computer Science, University of Maryland, College Park, [email protected], [email protected], [email protected],[email protected]

ABSTRACT

Cross-modal retrieval aims to retrieve relevant data acrossdifferent modalities (e.g., texts vs. images). The commonstrategy is to apply element-wise constraints between manu-ally labeled pair-wise items to guide the generators to learnthe semantic relationships between the modalities, so that thesimilar items can be projected close to each other in the com-mon representation subspace. However, such constraints of-ten fail to preserve the semantic structure between unpairedbut semantically similar items (e.g. the unpaired items withthe same class label are more similar than items with dif-ferent labels). To address the above problem, we proposea novel cross-modal similarity transferring (CMST) methodto learn and preserve the semantic relationships between un-paired items in an unsupervised way. The key idea is tolearn the quantitative similarities in single-modal represen-tation subspace, and then transfer them to the common rep-resentation subspace to establish the semantic relationshipsbetween unpaired items across modalities. Experiments showthat our method outperforms the state-of-the-art approachesboth in the class-based and pair-based retrieval tasks.

Index Terms — Cross-modal, retrieval, transfer learning

1. INTRODUCTION

Cross-modal retrieval aims to retrieve relevant data across dif-ferent modalities, which enables ﬂexible search across multi-ple modalities. The common strategy is to apply element-wiseconstraints on manually labeled cross-modal pairs to bridgethe semantic gaps between different modalities. However,such manually labeled pairs can only reﬂect a small part of thesemantic structure in the common representation subspace,while the abundant semantic relationships between unpaireditems are often failed to be preserved. As demonstrated inFigure 1, only the similarity between the red car and its corre-sponding description are manually labeled (green solid line), *Corresponding author: Yu-Shen Liu

A red car in the dessert.

A blue car in the forest. 𝐯 𝑖 𝐭 𝑖 𝐯 𝑗 𝐭 𝑗 Fig. 1 : Demonstration of the missing semantic relationships.The green solid line is the manual label of paired items crossmodalities, and the dot line is the missing label for inter-modal items (orange) and inner-modal items (blue). The pro-posed CMST considers learning the missing inner-modal sim-ilarity (blue dot line) and transferring it to the inter-modalsimilarity (orange dot line) based on paired items (green solidline).as well as the similarity between the blue car and its descrip-tion. However, the similarity between the red car and the bluecar’s description are missing (orange dot line), although theyare semantically similar to some extent because of the samelabel (the label of car) they share.To solve the above problem, methods like DCML [1] andACMR [2] try to utilize the intra-class and inter-class labelsof the cross-modal items [3, 2] by directly assigning the high-est similarity (e.g., the value of 1) to the items with the sameclass label and the lowest similarity (e.g., the value of 0) tothe items with different class labels. The problem is, theselabels cannot quantitatively reﬂect the semantic relationshipsbetween the intra-class items, which is especially importantfor the retrieval tasks, because the rank list should indicatethe discriminative order of all the retrieved items, especiallyfor the items in the same class. The samples that matchesthe query better should be ranked higher compared to othersamples even in the same class. On the other hand, using theintra-class and inter-class labels as the supervision informa-tion also makes the cross-modal retrieval methods sensitive a r X i v : . [ c s . MM ] A p r o dataset noises such as mislabeled samples.In this paper, to address the above-mentioned issues, anovel cross-modal similarity transferring (CMST) method isproposed for cross-modal retrieval. The main idea is to em-ploy unsupervised strategy to learn the endogenous semanticrelationships between unpaired items in each single-modalrepresentation subspace, and then, transfer the learned rela-tionships to the common representation subspace to establishthe semantic structure between unpaired cross-modal items.In detail, the CMST ﬁrst employs a similarity learning net-work to establish the ﬁner similarity metric to capture the se-mantic structure of training items in each single-modal space.Then, three similarity transferring approaches are proposedto transfer the learnt single-modal similarities to the com-mon representation subspace. CMST works in an adversar-ial framework to utilize the ability of distribution generationfrom generative adversarial networks. Our main contributionscan be summarized as follows. • A novel CMST method is proposed to learn and pre-serve the semantic structure between unpaired itemsacross different modalities in the common representa-tion subspace in an unsupervised way. • Three similarity transferring approaches are proposedby the observation on how cross-modal relationship isbuilt from the similarities in single modalities, whichare proven effective in the experiments.

2. RELATED WORK

Cross-modal retrieval methods can be roughly divided into joint representation learning methods [2] and cross-modalhashing methods [4, 5]. The proposed CMST model fallsinto the category of joint representation based methods. Itaims to learn a real common representation subspace of mul-timodal data, where cross-modal data can be directly com-pared to each other through predeﬁned similarity measure-ment. Cross-modal retrieval methods like CCA-based meth-ods [6, 7], LDA-based methods [8] and neural network basedmethods [1, 2] also fall into this category. On the other hand,the cross-modal hashing methods mainly focus on the re-trieval efﬁciency by mapping the items of different modalitiesinto a common binary Hamming space.Beneﬁted from the strong ability of distribution modelingand discriminative representation learning, some recent cross-modal retrieval methods have collaborated with GAN models[9, 10, 2]. In this work, our method also follows the simi-lar adversarial learning framework that uses the single-modalsimilarities to guide the cross-modal representation learning.More recently, methods to explore the semantic rela-tionships between unpaired items have been proposed. TheACMR [2] method proposes the triplet loss and the modal-ity classiﬁer for preserving the modality level semantic struc- tures. The MHTN [10] is proposed to minimize the maxi-mum mean discrepancy between modalities, which preservesmore ﬂexibility for the generator to project vectors into a newspace. The difference between CMST and the previous workis that CMST can learn the item-level semantic relationshipsbetween unpaired items in an unsupervised way.

3. CMST-BASED CROSS-MODAL RETRIEVAL3.1. Problem Formulation

Without losing generality, we consider the images and textspairs in this paper. Let V = [ v , v , . . . , v n ] ∈ R d v × n bea collection of image features, and T = [ t , t , . . . , t n ] ∈ R d t × n be the corresponding collection of text features, inwhich v i and t i form a pair, where d v and d t denote the di-mension of the image features and the text features, respec-tively. Each sample pair is assigned a semantic label vec-tor denoted as y i = [ y i , y i , . . . , y ic ] ∈ R c , where c in-dicates the semantic classes. If the i -th sample pair in V and T belongs to the semantic class j , y ij = 1 ; otherwise, y ij = 0 . We denote the collection of semantic label vectorsas Y = [ y , y , . . . , y n ] ∈ R d c × n . The goal of our pro-posed CMST method is to learn a common semantic space S = [ s , s , . . . , s n ] ∈ R d s × n , where the features from differ-ent modalities can be directly compared in terms of semanticsimilarity, and d s denotes the dimension of the common se-mantic space. The Siamese networks [11] are adopted as similarity learningnetworks to learn the semantic relationships in single-modalrepresentation subspace. Siamese networks can learn simi-larity metrics discriminatively and effectively. In addition,Siamese networks are also robust to data noises because theyconsider not only the label relationship but also the distancebetween the features of each sample pair. For simplicity, weonly detail the network in image modality. Given two imagefeatures and their labels ( v i , y i ) and ( v j , y j ) , let u ij = 1 if y i = y j , otherwise u ij = 0 . The loss function of oursimilarity learning network can be formulated as: L sia = (cid:88) i,j u ij · s ( v i , v j )+(1 − u ij ) · max( C − s ( v i , v j ) , . (1) The similarity measurement in single-modal representationsubspace is given as s ( v i , v j ) = || H V ( v i ) − H V ( v j ) || , (2) where the H V denotes the image similarity learning networkconsists of feed-forward layers with ReLU activations. Thetext similarity learning network has the same structure as theimage similarity learning network. The Siamese network isillustrated in the middle bottom part of Figure 1. … … … … … A red car on the road.A blue car in the forest.

Inter-modal similarity transferring

Intra-modal similarity learning

Image

Text Modality classification taskDiscriminatorImage similarity learning network 𝑠𝑖𝑚 (𝐯 𝑖 , 𝐯 𝑗 ) Text similarity learning network 𝑠𝑖𝑚 (𝐭 𝑖 , 𝐭 𝑗 ) 𝑠𝑖𝑚 (𝐯 𝑖 , 𝐭 𝑗 ) PersonDogCar Similarity transferring taskLabel classification task

Supervised loss … 𝑠𝑖𝑚(𝐯 𝑖 , 𝐯 𝑗 )𝑠𝑖𝑚(𝐭 𝑖 , 𝐭 𝑗 )𝐯 𝑖 𝐭 𝑖 𝐭 𝑗 𝐯 𝑗 𝐺 𝑉 𝐺 𝑇 𝐻 𝑇 𝐻 𝑉 Fig. 2 : The overall structure of the CMST method, including two intra-modal similarity learning networks and an inter-modalsimilarity transferring network. This ﬁgure shows the procedure to learn the cross-modal similarity between v i and t j . Differentfrom the existing methods which simply assign the highest similarity to intra-class cross-modal samples, the CMST method usesthe similarity value between v i and v j to guide cross-modal similarity learning, which is learnt by the intra-modal similaritylearning network in the image modality. A direct way of using the intra-modal similarity as a refer-ence for cross-modal similarity learning is to use the value asthe learning goal. In this case, we use t j as the anchor sam-ple, only the intra-modal similarity that is different from it(image-modality similarity) is used for clear explanation, butthe training process contains both directions. Also, the rangeof the original features and the projected features are not lim-ited, as the fully connected networks have the ability of scal-ing between the input and the expected output. The typicalform of similarity transferring loss based on value transfer-ring is deﬁned as L val = | s ( v j , v j ) − s ( v j , t j ) | + | s ( v i , v j ) − s ( v i , t j ) | , (3) where the similarity between v j and itself is set to value 1. Compared to the absolute value of intra-modal similarity it-self, it is the relationship between cross-modal samples thatwe are really interested in. In other words, the relation of ( s ( v i , t j ) − s ( v j , t j )) can measure the difference of v i to t j and v j to t j in the common semantic space, where the differ-ence is expected to be related to the difference of v i and v j in their original image space. By difference transferring, the range of the common semantic space and the original modal-ity space can be decoupled. The learnt joint subspace canbe more ﬂexible to modality divergence. The typical form ofsimilarity transferring loss based on difference transferring isdeﬁned as L diff = | ( s ( v j , t j ) − s ( v i , t j )) − ( s ( v j , v j ) − s ( v i , v j )) | , (4) where we assume s ( v j , v j ) = 1 . As transitivity exists in the nature of similarity measurement,similarity transitivity across modalities can be expected. Fol-lowing this motivation, product transferring can be done bythe multiplication on the similarity chain. We can say that thesimilarity of v i and t j is generated by the chain of v i to v j and then v j to t j . In this case, the typical form of similaritytransferring loss based on product transferring is deﬁned as L prod = | s ( v i , t j ) − s ( v i , v j ) s ( v j , t j ) | , (5) where some implicit linear function is assumed to enable thedirect arithmetical operation between inter-modal and intra-modal similarities. The linear function is absorbed into thefunction approximation ability of neural networks.The proposed similarity transferring approaches providea reference value or reference relationship for the unpaireditems of two different modalities, so the loss function canbe deﬁned as the difference between the similarity calculatedrom learnt cross-modal features and the reference values.The similarity transferring task is applied together with otherwidely used tasks within a generative adversarial framework. Follow the common practice of cross-modal retrieval, the pro-posed CMST introduces the adversarial learning and classiﬁ-cation task for learning a better semantic structure in commonrepresentation subspace.Let L sim denote the similarity transferring task, which ischosen as the one from L val , L diff and L prod . For classiﬁca-tion task, the cross entropy loss is used in CMST and denotedas L lab . For adversarial learning, the discriminator composedof feed-forward layers takes the generated representationas input. The output is the prediction of which modality theinput comes from, using the sigmoid activation. Let L V de-notes the cross entropy loss of predicting the image input, and L T denotes the cross entropy loss of predicting the text input.The total loss for generator and the discriminator is given as L G = L lab + L sim + L V − L T , (6) L D = −L V + L T (7)

4. EXPERIMENTS4.1. Experimental Setup

Three widely used datasets for cross-modal retrieval are usedin the experiments, including Wikipedia dataset [17], NUS-WIDE-10k dataset [18] and Pascal Sentence dataset [19]. Wefollow the dataset partition method as [2] for fair comparison.Image features are taken from the fc7 layer of a pre-trainedVGGNet-19 model while text features are computed by theclassic Bag-of-Words features with tf-idf weighting. The im-age features for all the datasets are of 4096 dimensions, whilethe text BoW feature is 5000 dimensions for the Wikipediadataset, 1000 dimensions for the NUS-WIDE-10k and Pascaldatasets.For the class-based retrieval task, performance is mea-sured in terms of the mean average precision (mAP), themeasurement is applied on both directions, i.e. Img2txt andTxt2img. The larger the mAP value is, the better the perfor-mance becomes. For the pair-based retrieval task, we acceptonly the ground-truth paired sample of the input query as cor-rect retrieval result. The performance is evaluated by top- k accuracy, indicating the times that the correct retrieval resultappears within the top- k retrieved results over the test set. The CMST is compared with three classes of cross-modalretrieval methods, namely traditional methods, DNN-basedmethods and GAN-based methods, as shown in Table 1. Theresults of CMST is based on difference transferring. For the results shown in the table, the number of the intra-class samples with the query are counted in top-50 retrieveddocuments. Results show that our CMST method outper-forms the counterparts on all three datasets for cross-modalretrieval tasks. On the Pascal dataset, our method improvesthe best competitor ACMR by 16.1% and 7.9% in imageto text and text to image retrieval tasks, respectively. OnNUS-WIDE-10k and Wikipedia dataset, the proposed CMSTmethod achieves a relatively small but solid improvementcompared to the state-of-the-art performance. The underlyingcause why the method improves a lot on Pascal dataset but notas much on the other two datasets is that the Pascal datasetcontains more semantic classes than the other two datasets.On Wikipedia and NUS-WIDE-10k, although the intra-classand inter-class labels are coarse supervision for cross-modalsimilarity learning, the small number of total classes makesit easy to separate between classes and arrange the joint se-mantic distribution. For the datasets with more classes, suchas Pascal Sentences, the ﬁner similarity structure beneﬁts thegeneration of common semantic space as we expected.Table 2 shows the experimental results in the pair-basedretrieval task in aspect of top- k accuracy on Pascal dataset,where different k values are tested to examine the method’sperformance. For smaller values of k ( k =1,5,10), our pro-posed method outperforms the ACMR method by 43.4%,18.9% and 10.1% for the average measurement on two re-trieval directions. The results indicate that our proposedCMST method works effectively on improving the rank ofthe most related retrieval results for the given query. In this subsection, several experiments are conducted to inves-tigate the effectiveness of the important components of CMSTmethod.The intra-modal similarity learnt by the Siamese networksplays an important role in the subsequent similarity trans-ferring procedure. To examine its effectiveness, two tradi-tional similarity measurements are employed as the sourceof intra-modal similarity for comparison. Table 3 illustratesthe performance of traditional similarity metric and the learntSiamese similarity metric, showing that the Siamese networkachieves the best performance among the three similaritymeasurements.In order to compare the different methods of similar-ity transferring, we examine the three approaches on Pascaldataset. The results in the last row in Table 4 with no trans-ferring indicates that the similarity transferring task is not in-cluded in the training. Overall, we can see that all the similar-ity transferring approaches are effective for noticeable perfor-mance improvement with similarity transferring training. Thebest similarity transferring approach contributes an improve-ment of 21.6% compared to the CMST without similaritytransferring. The difference transferring approach performs able 1 : Comparison with existing cross-modal retrieval methods in aspect of mAP

Category Methods Wikipedia Pascal Sentences NUS-WIDE-10kImg2txt Txt2Img Avg. Img2txt Txt2Img Avg. Img2txt Txt2Img Avg.Traditionalmethods CCA-3V [7] 0.437 0.383 0.410 0.316 0.270 0.293 - - -LCFS [12] 0.455 0.398 0.427 0.442 0.357 0.400 0.383 0.346 0.365JRL [13] 0.453 0.400 0.426 0.504 0.489 0.496 0.426 0.376 0.401JFSSL [14] 0.428 0.396 0.412 - - - - - -DNN-basedmethods Corr-AE [15] 0.402 0.395 0.398 0.489 0.444 0.467 0.366 0.417 0.392DCML [1] 0.554 0.538 0.546 - - - 0.385 0.405 0.395CMDN [16] 0.488 0.427 0.458 0.534 0.534 0.534 0.492 0.515 0.504GAN-basedmethods CM-GAN [9] 0.521 0.466 0.494 0.603 0.604 0.604 - - -MHTN [10] 0.514 0.444 0.479 0.496 0.500 0.498 0.520 0.534 0.527ACMR [2] 0.619 0.489 0.554 0.535 0.543 0.539 0.544 0.538 0.541CMST (Ours)

Table 2 : Comparison with ACMR method on Pascal Sentences dataset in aspect of top-k acc

Methods k = 1 k = 5 k = 10 k = 50 Img2txt Txt2Img Avg. Img2txt Txt2Img Avg. Img2txt Txt2Img Avg. Img2txt Txt2Img Avg.ACMR 0.140 0.145 0.143 0.445 0.400 0.423 0.675 0.665 0.670 0.910 0.915 0.913CMST(Ours)

Table 3 : Similarity transferring with different intra-modalsimilarity measurements

Methods mAP top-1 accImg2txt Txt2Img Img2txt Txt2ImgCosine 0.509 0.500 0.150 0.140Euclidean 0.563 0.554 0.165 0.125Siamese Network

Table 4 : Effects of the similarity transferring approaches

Methods mAP top-1 accImg2txt Txt2Img Img2txt Txt2ImgValue 0.600 0.564 0.145 0.155Difference

Product 0.510 0.514 0.120 0.140No transfer 0.485 0.508 0.080 0.105 better than the value transferring method because it keeps thecomparative relationship between samples instead of assign-ing a value to the similarity between samples. The generatoris guaranteed with more ﬂexibility for the feature projectionby difference transferring method. The product transferringseems to perform badly, as the distance measurement used inthe experiment is the Euclidean distance and there is no ex-plicit limitations or normalization methods added to the dis-tance calculation. The product of the Euclidean distances be-tween high dimensional feature vectors varies sharply on mi-nor changes.A two-stage training strategy is employed by ﬁrstly learn-

Table 5 : Effects of different training strategies.

Strategies mAP top-1 accImg2txt Txt2Img Img2txt Txt2ImgTwo-stage

Fine-tuning 0.596 0.571 0.180 0.180End-to-end 0.585 0.561 0.140 0.165 ing the single-modal similarity metric and then transferringthe single-modal similarity into cross-modal semantic space.Note that all previous results are based on the two-stage strat-egy. On the other hand, ﬁne-tuning and end-to-end trainingstrategies are also available for our proposed CMST method.To examine the inﬂuence of different training strategies, weconduct an additional experiment by using different trainingmethod and compare the testing results at the same epoch(100). In the two-stage training and ﬁne-tuning training, theSiamese networks are trained for 50 epoches in advance. Fortwo-stage training, the parameters of Siamese networks areﬁxed after the 50th epoch, while for ﬁne-tuning, the parame-ters are still learnable with a low learning rate (0.0001). Ta-ble 5 shows the effect of different training method, and wecan draw from the table that two-stage training yields the bestperformance. Fine-tuning does not provide additional perfor-mance improvement. End-to-end training has negative inﬂu-ence on cross-modal similarity learning because the single-modal similarity learning should not be heavily affected bycross-modal information. . CONCLUSIONS

In this paper, a novel cross-modal retrieval method namedCMST is proposed. The proposed method efﬁciently learnssimilarity metrics in each single modality space by the intra-modal similarity learning networks and then guide the cross-modal similarity learning with the learnt single modal simi-larity metric. Our proposed similarity transferring approachessuccessfully transfer ﬁner similarity structure captured in sin-gle modal space to cross-modal space. Experiments demon-strate that the CMST method outperforms state-of-the-artcross-modal retrieval methods in both class-based and pair-based retrieval tasks.

6. ACKNOWLEDGMENTS

This work was supported by National Key R&D Program ofChina (2018YFB0505400), in part by Tsinghua-Kuaishou In-stitute of Future Media Data, and we also thank Mr. Tao Liand Ms. Junhui Wu, Kuaishous R&D engineers, for their as-sistance in the research.

7. REFERENCES [1] Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and JieZhou, “Deep coupled metric learning for cross-modalmatching,”

IEEE Transactions on Multimedia , vol. 19,no. 6, pp. 1234–1244, 2017.[2] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, andHeng Tao Shen, “Adversarial cross-modal retrieval,” in

ACM MM . ACM, 2017, pp. 154–162.[3] Alexis Mignon and Fr´ed´eric Jurie, “Cmml: A new met-ric learning approach for cross modal matching,” in

ACCV , 2012.[4] Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Han-jalic, and Heng Tao Shen, “Binary adversarial networksfor image retrieval,” in

AAAI , 2018.[5] Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, andXuelong Li, “Learning discriminative binary codes forlarge-scale cross-modal retrieval,”

TIP , vol. 26, no. 5,pp. 2494–2507, 2017.[6] Galen Andrew, Raman Arora, Jeff Bilmes, and KarenLivescu, “Deep canonical correlation analysis,” in

ICML , 2013, pp. 1247–1255.[7] Yunchao Gong, Qifa Ke, Michael Isard, and SvetlanaLazebnik, “A multi-view embedding space for model-ing internet images, tags, and their semantics,”

Interna-tional Journal of Computer Vision , vol. 106, no. 2, pp.210–233, 2014. [8] Duangmanee Putthividhy, Hagai T Attias, and Srikan-tan S Nagarajan, “Topic regression multi-modal latentdirichlet allocation for image annotation,” in

CCVPR .IEEE, 2010, pp. 3408–3415.[9] Yuxin Peng, Jinwei Qi, and Yuxin Yuan, “Cm-gans: Cross-modal generative adversarial networksfor common representation learning,” arXiv preprintarXiv:1710.05106 , 2017.[10] Xin Huang, Yuxin Peng, and Mingkuan Yuan, “Mhtn:Modal-adversarial hybrid transfer network for cross-modal retrieval,” arXiv preprint arXiv:1708.04308 ,2017.[11] Sumit Chopra, Raia Hadsell, and Yann LeCun, “Learn-ing a similarity metric discriminatively, with applicationto face veriﬁcation,” in

CVPR . IEEE, 2005, pp. 539–546.[12] Kaiye Wang, Ran He, Wei Wang, Liang Wang, and Tie-niu Tan, “Learning coupled feature spaces for cross-modal matching,” in

ICCV . IEEE, 2013, pp. 2088–2095.[13] Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao, “Learn-ing cross-media joint representation with sparse andsemisupervised regularization,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 24, no.6, pp. 965–978, 2014.[14] Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tie-niu Tan, “Joint feature selection and subspace learningfor cross-modal retrieval,”

TPAMI , vol. 38, no. 10, pp.2010–2023, 2016.[15] Fangxiang Feng, Xiaojie Wang, and Ruifan Li, “Cross-modal retrieval with correspondence autoencoder,” in

ACM MM . ACM, 2014, pp. 7–16.[16] Yuxin Peng, Xin Huang, and Jinwei Qi, “Cross-mediashared representation by hierarchical learning with mul-tiple deep networks,” in

IJCAI , 2016, pp. 3846–3853.[17] Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle,Nikhil Rasiwasia, Gert RG Lanckriet, Roger Levy, andNuno Vasconcelos, “On the role of correlation and ab-straction in cross-modal multimedia retrieval,”

TPAMI ,vol. 36, no. 6, pp. 521–535, 2014.[18] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li,Zhiping Luo, and Yantao Zheng, “Nus-wide: a real-world web image database from national university ofsingapore,” in

Proceedings of the ACM InternationalConference on Image and Video Retrieval . ACM, 2009.[19] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Ju-lia Hockenmaier, “Collecting image annotations us-ing amazon’s mechanical turk,” in