Do Neural Network Cross-Modal Mappings Really Bridge Modalities?
DDo Neural Network Cross-Modal Mappings Really Bridge Modalities?
Guillem Collell
Department of Computer ScienceKU Leuven [email protected]
Marie-Francine Moens
Department of Computer ScienceKU Leuven [email protected]
Abstract
Feed-forward networks are widely used incross-modal applications to bridge modal-ities by mapping distributed vectors of onemodality to the other, or to a shared space.The predicted vectors are then used toperform e.g., retrieval or labeling. Thus,the success of the whole system relies onthe ability of the mapping to make theneighborhood structure (i.e., the pairwisesimilarities) of the predicted vectors akinto that of the target vectors. However,whether this is achieved has not been in-vestigated yet. Here, we propose a newsimilarity measure and two ad hoc experi-ments to shed light on this issue. In threecross-modal benchmarks we learn a largenumber of language-to-vision and vision-to-language neural network mappings (upto five layers) using a rich diversity of im-age and text features and loss functions.Our results reveal that, surprisingly, theneighborhood structure of the predictedvectors consistently resembles more thatof the input vectors than that of the targetvectors. In a second experiment, we fur-ther show that untrained nets do not signif-icantly disrupt the neighborhood (i.e., se-mantic) structure of the input vectors.
Neural network mappings are widely used tobridge modalities or spaces in cross-modal re-trieval (Qiao et al., 2017; Wang et al., 2016; Zhanget al., 2016), zero-shot learning (Lazaridou et al.,2015b, 2014; Socher et al., 2013) in building mul-timodal representations (Collell et al., 2017) or inword translation (Lazaridou et al., 2015a), to namea few. Typically, a neural network is firstly trained to predict the distributed vectors of one modality(or space) from the other. At test time, some op-eration such as retrieval or labeling is performedbased on the nearest neighbors of the predicted(mapped) vectors. For instance, in zero-shot im-age classification, image features are mapped tothe text space and the label of the nearest neigh-bor word is assigned. Thus, the success of suchsystems relies entirely on the ability of the mapto make the predicted vectors similar to the tar-get vectors in terms of semantic or neighborhoodstructure. However, whether neural nets achievethis goal in general has not been investigated yet.In fact, recent work evidences that considerableinformation about the input modality propagatesinto the predicted modality (Collell et al., 2017;Lazaridou et al., 2015b; Frome et al., 2013).To shed light on these questions, we first in-troduce the (to the best of our knowledge) firstexisting measure to quantify similarity betweenthe neighborhood structures of two sets of vec-tors . Second, we perform extensive experimentsin three benchmarks where we learn image-to-textand text-to-image neural net mappings using a richvariety of state-of-the-art text and image featuresand loss functions. Our results reveal that, con-trary to expectation, the semantic structure of themapped vectors consistently resembles more thatof the input vectors than that of the target vectorsof interest. In a second experiment, by using sixconcept similarity tasks we show that the seman-tic structure of the input vectors is preserved aftermapping them with an untrained network, furtherevidencing that feed-forward nets naturally pre-serve semantic information about the input. Over-all, we uncover and rise awareness of a largely We indistinctly use the terms semantic structure , neigh-borhood structure and similarity structure . They refer to allpairwise similarities of a set of N vectors, for some similaritymeasure (e.g., Euclidean or cosine). a r X i v : . [ s t a t . M L ] J un ( M ) M f( M ) Figure 1: Effect of applying a mapping f to a (dis-connected) manifold M with three hypotheticalclasses ( (cid:4) , (cid:78) and • ).ignored phenomenon relevant to a wide range ofcross-modal / cross-space applications such as re-trieval, zero-shot learning or image annotation.Ultimately, this paper aims at: (1) Encouragingthe development of better architectures to bridgemodalities / spaces; (2) Advocating for the use ofsemantic-based criteria to evaluate the quality ofpredicted vectors such as the neighborhood-basedmeasure proposed here, instead of purely geomet-ric measures such as mean squared error (MSE). Neural network and linear mappings are popu-lar tools to bridge modalities in cross-modal re-trieval systems. Lazaridou et al. (2015b) leveragea text-to-image linear mapping to retrieve imagesgiven text queries. Weston et al. (2011) map la-bel and image features into a shared space with alinear mapping to perform image annotation . Al-ternatively, Frome et al. (2013), Lazaridou et al.(2014) and Socher et al. (2013) perform zero-shot image classification with an image-to-text neuralnetwork mapping. Instead of mapping to latentfeatures, Collell et al. (2018) use a 2-layer feed-forward network to map word embeddings directlyto image pixels in order to visualize spatial ar-rangements of objects. Neural networks are alsopopular in other cross-space applications such as cross-lingual tasks. Lazaridou et al. (2015a) learna linear map from language A to language B andthen translate new words by returning the nearestneighbor of the mapped vector in the B space.In the context of zero-shot learning, short-comings of cross-space neural mappings havealso been identified. For instance, “hub-ness” (Radovanovi´c et al., 2010) and “pollu- tion” (Lazaridou et al., 2015a) relate to the high-dimensionality of the feature spaces and to overfit-ting respectively. Crucially, we do not assume thatour cross-modal problem has any class labels, andwe study the similarity between input and mappedvectors and between output and mapped vectors.Recent work evidences that the predicted vec-tors of cross-modal neural net mappings arestill largely informative about the input vectors.Lazaridou et al. (2015b) qualitatively observe thatabstract textual concepts are grounded with thevisual input modality. Counterintuitively, Collellet al. (2017) find that the vectors “imagined” froma language-to-vision neural map, outperform theoriginal visual vectors in concept similarity tasks.The paper argued that the reconstructed visualvectors become grounded with language becausethe map preserves topological properties of the in-put. Here, we go one step further and show thatthe mapped vectors often resemble the input vec-tors more than the target vectors in semantic terms,which goes against the goal of a cross-modal map.Well-known theoretical work shows that net-works with as few as one hidden layer are ableto approximate any function (Hornik et al., 1989).However, this result does not reveal much nei-ther about test performance nor about the semanticstructure of the mapped vectors. Instead, the phe-nomenon described is more closely tied to otherproperties of neural networks. In particular, conti-nuity guarantees that topological properties of theinput, such as connectedness, are preserved (Arm-strong, 2013). Furthermore, continuity in a topol-ogy induced by a metric also ensures that pointsthat are close together are mapped close together.As a toy example, Fig. 1 illustrates the distortionof a manifold after being mapped by a neural net. In a noiseless world with fully statistically de-pendent modalities, the vectors of one modalitycould be perfectly predicted from those of theother. However, in real-world problems this isunrealistic given the noise of the features andthe fact that modalities encode complementaryinformation (Collell and Moens, 2016). Suchunpredictability combined with continuity andtopology-preserving properties of neural nets pro-pel the phenomenon identified, namely mappedvectors resembling more the input than the targetvectors, in nearest neighbors terms. Parameters of these mappings were generated at random.
Proposed Approach
To bridge modalities X and Y , we consider twopopular cross-modal mappings f : X → Y .(i)
Linear mapping ( lin ): f ( x ) = W x + b with W ∈ R d y × d x , b ∈ R d y , where d x and d y are the input and output dimensions respectively.(ii) Feed-forward neural network ( nn ): f ( x ) = W σ ( W x + b ) + b with W ∈ R d y × d h , W ∈ R d h × d x , b ∈ R d h , b ∈ R d y where d h is the number of hidden unitsand σ () the non-linearity (e.g., tanh or sigmoid).Although single hidden layer networks are alreadyuniversal approximators (Hornik et al., 1989), weexplored whether deeper nets with could improve the fit (see Supplement). Loss : Our primary choice is the
MSE : (cid:107) f ( x ) − y (cid:107) , where y is the target vector.We also tested other losses such as the co-sine : − cos( f ( x ) , y ) and the max-margin : max { , γ + (cid:107) f ( x ) − y (cid:107) − (cid:107) f (˜ x ) − y (cid:107)} , where ˜ x belongs to a different class than ( x, y ) , and γ is the margin. As in Lazaridou et al. (2015a)and Weston et al. (2011), we choose the first ˜ x that violates the constraint. Notice that lossesthat do not require class labels such as MSE aresuitable for a wider, more general set of tasks thandiscriminative losses (e.g., cross-entropy). In fact,cross-modal retrieval tasks often do not exhibitany class labels. Additionally, our research ques-tion concerns the cross-space mapping problem inisolation (independently of class labels).Let us denote a set of N input and output vec-tors by X ∈ R N × d x and Y ∈ R N × d y respectively.Each input vector x i is paired to the output vec-tor y i of the same index ( i = 1 , · · · , N ). Letus henceforth denote the mapped input vectors by f ( X ) ∈ R N × d y . In order to explore the similaritybetween f ( X ) and X , and between f ( X ) and Y ,we propose two ad hoc settings below. To measure the similarity between the neighbor-hood structure of two sets of paired vectors V and Z , we propose the mean nearest neighbor overlap measure ( mNNO K ( V, Z ) ). We define the near-est neighbor overlap NNO K ( v i , z i ) as the num-ber of K nearest neighbors that two paired vec-tors v i , z i share in their respective spaces . E.g.,if the (= K ) nearest neighbors of v cat in V are { v dog , v tiger , v lion } and those of z cat in Z are { z mouse , z tiger , z lion } , the NNO ( v cat , z cat ) is 2. Definition 1
Let V = { v i } Ni =1 and Z = { z i } Ni =1 be two sets of N paired vectors. We define: mNNO K ( V, Z ) = 1 KN N (cid:88) i =1 NNO K ( v i , z i ) (1) with NNO K ( v i , z i ) = | NN K ( v i ) ∩ NN K ( z i ) | ,where NN K ( v i ) and NN K ( z i ) are the indexes ofthe K nearest neighbors of v i and z i , respectively. The normalizing constant K simply scales mNNO K ( V, Z ) between 0 and 1, making itindependent of the choice of K . Thus, a mNNO K ( V, Z ) = 0 . means that the vectors in V and Z share, on average, 70% of their near-est neighbors. Notice that mNNO implicitly per-forms retrieval for some similarity measure (e.g.,Euclidean or cosine), and quantifies how semanti-cally similar two sets of paired vectors are. To complement the setting above (Sect. 3.1), itis instructive to consider the limit case of an un-trained network. Concept similarity tasks providea suitable setting to study the semantic structureof distributed representations (Pennington et al.,2014). That is, semantically similar conceptsshould ideally be close together. In particular,our interest is in comparing X with its projection f ( X ) through a mapping with random parameters,to understand the extent to which the mapping maydisrupt or preserve the semantic structure of X . To test the generality of our claims, we select arich diversity of cross-modal tasks involving textsat three levels: word level (ImageNet), sentence level (IAPR TC-12), and document level (Wiki).
ImageNet (Russakovsky et al., 2015). Consists of ∼
14M images, covering ∼
22K WordNet synsetsor meanings). Following Collell et al. (2017), wetake the most relevant word for each synset andkeep only synsets with more than 50 images. Thisyields 9,251 different words (or instances).
IAPR TC-12 (Grubinger et al., 2006). Contains20K images (18K train / 2K test) annotated with255 labels. Each image is accompanied with ashort description of one to three sentences.
Wikipedia (Pereira et al., 2014). Has 2,866 sam-ples (2,173 train / 693 test). Each sample is a sec-tion of a Wikipedia article paired with one image.
See the Supplement for details.
To ensure that results are independent of thechoice of image and text features, we use 5 (2 im-age + 3 text) features of varied dimensionality (64- d , 128- d , 300- d , 2,048- d ) and two directions, text-to-image ( T → I ) and image-to-text ( I → T ). Wemake our extracted features publicly available. Text. In ImageNet we use 300-dimensionalGloVe (Pennington et al., 2014) and 300- d word2vec (Mikolov et al., 2013) word embed-dings. In IAPR TC-12 and
Wiki , we employ state-of-the-art bidirectional gated recurrent unit (bi-GRU) features (Cho et al., 2014) that we learn witha classification task (see Sect. 2 of Supplement).
Image.
For
ImageNet , we use the publiclyavailable VGG-128 (Chatfield et al., 2014) andResNet (He et al., 2015) visual features fromCollell et al. (2017), where we obtained 128-dimensional VGG-128 and 2,048- d ResNet fea-tures from the last layer (before the softmax) ofthe forward pass of each image. The final repre-sentation for a word is the average feature vector(centroid) of all available images for this word. In
IAPR TC-12 and
Wiki , features for individual im-ages are obtained similarly from the last layer of aResNet and a VGG-128 model.
We include six benchmarks, comprising threetypes of concept similarity: (i) Semantic simi-larity : SemSim (Silberer and Lapata, 2014),
Sim-lex999 (Hill et al., 2015) and
SimVerb-3500 (Gerzet al., 2016); (ii) Relatedness : MEN (Bruni et al., http://liir.cs.kuleuven.be/software.html http://nlp.stanford.edu/projects/glove http://liir.cs.kuleuven.be/software.html WordSim-353 (Finkelstein et al., 2001); (iii) Visual similarity : VisSim (Silberer and Lap-ata, 2014) which includes the same word pairs as
SemSim , rated for visual similarity instead of se-mantic. All six test sets contain human ratings ofsimilarity for word pairs, e.g., (‘cat’,‘dog’).
The parameters in W , W are drawn from a ran-dom uniform distribution [ − , and b , b are setto zero. We use a tanh activation σ () . The outputdimension d y is set to 2,048 for all embeddings. Textual and visual features are the same as de-scribed in Sect. 4.1.3 for the
ImageNet dataset.
We compute the prediction of similarity betweentwo vectors z , z with both the cosine z z (cid:107) z (cid:107)(cid:107) z (cid:107) and the Euclidean similarity (cid:107) z − z (cid:107) . As is common practice, we evaluate the predic-tions of similarity of the embeddings (Sect. 4.2.4)against the human similarity ratings with the
Spearman correlation ρ . We report the average of10 sets of randomly generated parameters. We test statistical significance with a two-sidedWilcoxon rank sum test adjusted with Bonferroni.The null hypothesis is that a compared pair isequal. In Tab. 1, ∗ indicates that mNNO ( X, f ( X )) differs from mNNO ( Y, f ( X )) (p < ∗ indicates that performance of mapped andinput vectors differs (p < Results below are with cosine neighbors and K =10. Euclidean neighbors yield similar results andare thus left to the Supplement. Similarly, resultsin ImageNet with GloVe embeddings are shownbelow and word2vec results in the Supplement.The choice of K = { , , } had no visible effecton results. Results with
3- and 5-layer nets did notshow big differences with the results below (seeSupplement). The cosine and max-margin losses We find that sigmoid and ReLu yield similar results. Notice that papers generally use only cosine similarity(Lazaridou et al., 2015b; Pennington et al., 2014).
10 20 30 Epochs0.00.10.20.30.40.50.6 0 10 20 300 10 20 30ts mNNO(X,f(X))tr mNNO(X,f(X)) ts mNNO(Y,f(X))tr mNNO(Y,f(X)) ts MSEtr MSE
Figure 2: Learning a nn model in Wiki (left),
IAPR TC-12 (middle) and
ImageNet (right).performed slightly worse than
MSE (see Supple-ment). Although Lazaridou et al. (2015a) and We-ston et al. (2011) find that max-margin performsthe best in their tasks, we do not find our result en-tirely surprising given that max-margin focuses oninter-class differences while we look also at intra-class neighbors (in fact, we do not require classes).Tab. 1 shows our core finding, namely that thesemantic structure of f ( X ) resembles more that of X than that of Y , for both lin and nn maps. ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin ∗ ∗ nn ∗ ∗ T → I lin ∗ ∗ nn ∗ ∗ I A P R T C - I → T lin ∗ ∗ nn ∗ ∗ T → I lin ∗ ∗ nn ∗ ∗ W i k i p e d i a I → T lin ∗ ∗ nn ∗ ∗ T → I lin ∗ ∗ nn ∗ ∗ Table 1: Test mean nearest neighbor over-lap. Boldface indicates the largest score at each mNNO ( X, f ( X )) and mNNO ( Y, f ( X )) pair,which are abbreviated by X, f ( X ) and Y, f ( X ) .Fig. 2 is particularly revealing. If we wouldonly look at train performance (and allow trainMSE to reach 0) then f ( X ) = Y and clearly train mNNO ( f ( X ) , Y ) = 1 while mNNO ( f ( X ) , X ) can only be smaller than 1. However, the inter-est is always on test samples, and (near-)perfect test prediction is unrealistic. Notice in fact inFig. 2 that even if we look at train fit, MSEneeds to be close to 0 for mNNO ( f ( X ) , Y ) to be reasonably large. In all the combinations fromTab. 1, the test mNNO ( f ( X ) , Y ) never surpasses test mNNO ( f ( X ) , X ) for any number of epochs,even with an oracle (not shown). Tab. 2 shows that untrained linear ( f lin ) and neuralnet ( f nn ) mappings preserve the semantic structureof the input X , complementing thus the findingsof Experiment 1. Experiment 1 concerns learning,while, by “ablating” the learning part and random-izing weights, Experiment 2 is revealing about thenatural tendency of neural nets to preserve seman-tic information about the input, regardless of thechoice of the target vectors and loss function. WS-353 Men SemSimCos Eucl Cos Eucl Cos Eucl f nn (GloVe) ∗ ∗ ∗ ∗ f lin (GloVe) 0.63 0.606 0.798 0.781 0.763 0.712GloVe f nn (ResNet) 0.402 0.408 ∗ ∗ f lin (ResNet) VisSim SimLex SimVerbCos Eucl Cos Eucl Cos Eucl f nn (GloVe) 0.594 ∗ ∗ ∗ ∗ f lin (GloVe) 0.602 ∗ f nn (ResNet) 0.527 ∗ ∗ f lin (ResNet) 0.541 0.498 Table 2: Spearman correlations between humanratings and the similarities (cosine or Euclidean)predicted from the embeddings. Boldface denotesbest performance per input embedding type.
Overall, we uncovered a phenomenon neglectedso far, namely that neural net cross-modal map-pings can produce mapped vectors more akin tothe input vectors than the target vectors, in termsof semantic structure. Such finding has been pos-sible thanks to the proposed measure that explic-itly quantifies similarity between the neighbor-hood structure of two sets of vectors. While othermeasures such as mean squared error can be mis-leading, our measure provides a more realisticestimate of the semantic similarity between pre-dicted and target vectors. In fact, it is the semanticstructure (or pairwise similarities) what ultimatelymatters in cross-modal applications. cknowledgments
This work has been supported by the CHIST-ERAEU project MUSTER and by the KU Leuvengrant RUN/15/005. References
Mark Anthony Armstrong. 2013.
Basic topology .Springer Science & Business Media.Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.Multimodal distributional semantics.
JAIR
BMVC .Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Franc¸ois Chollet et al. 2015. Keras. https://github.com/keras-team/keras .Guillem Collell and Marie-Francine Moens. 2016. Isan Image Worth More than a Thousand Words? Onthe Fine-Grain Semantic Differences between Visualand Linguistic Representations. In
COLING . ACL,pages 2807–2817.Guillem Collell, Luc Van Gool, and Marie-FrancineMoens. 2018. Acquiring Common Sense SpatialKnowledge through Implicit Spatial Templates. In
AAAI . AAAI.Guillem Collell, Teddy Zhang, and Marie-FrancineMoens. 2017. Imagined Visual Representations asMultimodal Embeddings. In
AAAI . AAAI, pages4378–4384.Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In
WWW . ACM, pages 406–414.Andrea Frome, Greg S Corrado, Jon Shlens, SamyBengio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A deep visual-semantic embedding model. In
NIPS . pages 2121–2129.Daniela Gerz, Ivan Vuli´c, Felix Hill, Roi Reichart, andAnna Korhonen. 2016. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprintarXiv:1608.00869 . Michael Grubinger, Paul Clough, Henning M¨uller, andThomas Deselaers. 2006. The iapr tc-12 bench-mark: A new evaluation resource for visual informa-tion systems. In
International workshop ontoImage .volume 5, page 10.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Deep residual learning for image recog-nition. arXiv preprint arXiv:1512.03385 .Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with (gen-uine) similarity estimation.
Computational Linguis-tics
Neural networks
ACL . pages 1403–1414.Angeliki Lazaridou, Georgiana Dinu, and Marco Ba-roni. 2015a. Hubness and pollution: Delving intocross-space mapping for zero-shot learning. In
ACL .volume 1, pages 270–280.Angeliki Lazaridou, Nghia The Pham, and Marco Ba-roni. 2015b. Combining language and vision witha multimodal skip-gram model. arXiv preprintarXiv:1501.02598 .Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed repre-sentations of words and phrases and their composi-tionality. In
NIPS . pages 3111–3119.Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In
EMNLP . volume 14, pages 1532–1543.Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle,Nikhil Rasiwasia, Gert RG Lanckriet, Roger Levy,and Nuno Vasconcelos. 2014. On the role of cor-relation and abstraction in cross-modal multimediaretrieval.
TPAMI arXivpreprint arXiv:1707.05427 .Milos Radovanovi´c, Alexandros Nanopoulos, and Mir-jana Ivanovi´c. 2010. On the existence of obstinateresults in vector space models. In
SIGIR . ACM,pages 186–193.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition chal-lenge.
IJCV
ACL . pages 721–732.Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-shot learningthrough cross-modal transfer. In
NIPS . pages 935–943.Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu,and Liang Wang. 2016. A comprehensive sur-vey on cross-modal retrieval. arXiv preprintarXiv:1607.06215 .Jason Weston, Samy Bengio, and Nicolas Usunier.2011. Wsabie: Scaling up to large vocabulary imageannotation. In
IJCAI . volume 11, pages 2764–2770.Yang Zhang, Boqing Gong, and Mubarak Shah. 2016.Fast zero-shot image tagging. In
CVPR . IEEE,pages 5985–5994. upplementary Material of:
Do Neural Network Cross-Modal Mappings Really BridgeModalities?
A Hyperparameters and Implementation
Hyperparameters (including number of epochs) are chosen by 5 fold cross-validation (CV) optimizingfor the test loss. Crucially, we ensure that all mappings are learned properly by verifying that the trainingloss steadily decreases. We search learning rates in { } and number of hidden units( d h ) in {
64, 128, 256, 512, 1024 } .Using different number of hidden units (and selecting the best-performing one) is important in orderto guarantee that our conclusions are not influenced or just a product of underfitting or overfitting. Sim-ilarly, we learned the mappings at different levels of dropout { } which did not yield anyimprovement w.r.t. zero dropout (shown in our results).We use a ReLu activation, the RMSprop optimizer ( ρ = 0 . , (cid:15) = 10 − ) and a batch size of 64. Wefind that sigmoid and tanh yield similar results as ReLu. Our implementation is in Keras (Chollet et al.,2015).Since ImageNet does not have any set of “test concepts”, we employ 5-fold CV. Reported results areeither averages on 5 folds (ImageNet) or 5 runs with different model weights initializations (IAPR TC-12and Wiki).For the max-margin loss, we choose the margin γ by cross-validation and explore values within { , . , , . , } . B Textual Feature Extraction
Unlike ImageNet where we associate a word embedding to each concept, the textual modality in IAPRTC-12 and Wiki consists of sentences. In order to extract state-of-the art textual features in these datasetswe train the following, separate network (prior to the cross-modal mapping). First, the embedded inputsentences are passed to a bidirectional GRU of 64 units, then fed into a fully-connected layer, followedby a cross-entropy loss on the vector of class labels. We collect the 64-d averaged GRU hidden states ofboth directions as features. The network is trained with the Adam optimizer.In Wiki and IAPR TC-12 we verify that the extracted text and image features are indeed informativeand useful by computing their mean average precision (mAP) in retrieval (considering that a document Bis relevant for document A if A and B share at least one class label). In Wiki we find mAPs of: biGRU =0.77, ResNet = 0.22 and vgg128 = 0.21. In IAPR TC-12 we find mAPs of: biGRU = 0.77, ResNet = 0.49and vgg128 = 0.46. Notice that ImageNet has a single data point per class in our setting, and thus mAPcannot be computed. However, we employ standard GloVe, word2vec, VGG-128 and ResNet vectors inImageNet, which are known to perform well.
C Additional Results
Results with mNNO ( X, Y ) (omitted in the main paper for space reasons): Interestingly, the similarity mNNO ( X, Y ) between original input X and output Y vectors is generally low (between 1.5 and 2.3),indicating that these spaces are originally quite different. However, mNNO ( X, Y ) always remains lowerthan mNNO ( f ( X ) , Y ) , indicating thus that the mapping makes a difference. .1 Experiment 1C.1.1 Results with 3 and 5 layers ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T nn-3 T → I nn-3 nn-5 I A P R T C I → T nn-3 T → I nn-3 W i k i I → T nn-3 T → I nn-3 mNNO ( X, f ( X )) and mNNO ( Y, f ( X )) pair, which are abbreviated by X, f ( X ) and Y, f ( X ) .It is interesting to notice that even though the difference between mNNO ( X, f ( X )) and mNNO ( Y, f ( X )) has narrowed down w.r.t. the linear and 1-hidden layer models (in the main paper)in some cases (e.g., ImageNet), this does not seem to be caused by better predictions, i.e., an increase of mNNO ( Y, f ( X )) , but rather by a decrease of mNNO ( X, f ( X )) . This is expected since with morelayers the information about the input is less preserved.ResNet VGG-128 X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T nn-3 T → I nn-3 0.252 nn-5 0.261 I A P R T C I → T nn-3 T → I nn-3 W i k i I → T nn-3 T → I nn-3 .1.2 Results with the max margin loss ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin T → I lin I A P R T C I → T lin T → I lin W i k i I → T lin T → I lin max margin loss .ResNet VGG-128 X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin T → I lin I A P R T C I → T lin T → I lin W i k i I → T lin T → I lin max margin loss . .1.3 Results with the cosine loss ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin T → I lin I A P R T C I → T lin T → I lin W i k i I → T lin T → I lin cosine loss .ResNet VGG-128 X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin T → I lin I A P R T C I → T lin T → I lin W i k i I → T lin T → I lin cosine loss . .1.4 Results with Euclidean neighbors ( nn and lin models of the paper) ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I m a g e N e t I → T lin T → I lin I A P R T C I → T lin T → I lin W i k i I → T lin T → I lin mNNO ( X, f ( X )) and mNNO ( Y, f ( X )) pair, which areabbreviated by X, f ( X ) and Y, f ( X ) . ResNet VGG-128 X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I → T lin T → I lin mNNO with Euclidean-based neighbors in ImageNet dataset, using word2vec word em-beddings.
C.1.5 Results with word2vec in ImageNet (cosine-based neighbors)
ResNet VGG-128
X, f ( X ) Y, f ( X ) X, f ( X ) Y, f ( X ) I → T lin T → I lin mNNO using cosine-based neighbors in ImageNet , using word2vec word embeddings. .2 Experiment 2