Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval
Hadi Abdi Khojasteh, Ebrahim Ansari, Parvin Razzaghi, Akbar Karimi
DDeep Multimodal Image-Text Embeddingsfor Automatic Cross-Media Retrieval
Hadi Abdi Khojasteh Ebrahim Ansari
1, 2
Parvin Razzaghi
1, 3
Akbar Karimi Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Charles University, Czechia Institute for Research in Fundamental Sciences (IPM), Tehran, Iran IMP Lab, Department of Engineering and Architecture, University of Parma, Parma, Italy { hkhojasteh,ansari,p.razzaghi } @iasbs.ac.ir, [email protected] Abstract
This paper considers the task of matching im-ages and sentences by learning a visual-textualembedding space for cross-modal retrieval.Finding such a space is a challenging tasksince the features and representations of textand image are not comparable. In this work,we introduce an end-to-end deep multimodalconvolutional-recurrent network for learningboth vision and language representations si-multaneously to infer image-text similarity.The model learns which pairs are a match(positive) and which ones are a mismatch (neg-ative) using a hinge-based triplet ranking. Tolearn about the joint representations, we lever-age our newly extracted collection of tweetsfrom Twitter. The main characteristic of ourdataset is that the images and tweets are notstandardized the same as the benchmarks. Fur-thermore, there can be a higher semantic cor-relation between the pictures and tweets con-trary to benchmarks in which the descriptionsare well-organized. Experimental results onMS-COCO benchmark dataset show that ourmodel outperforms certain methods presentedpreviously and has competitive performancecompared to the state-of-the-art. The code anddataset have been made available publicly.
The advent of social networks has brought about aplethora of opportunities for everyone to share in-formation online in the forms of text, image, videoand so forth. As a result, there is a vast amountof raw data on the Net which could be helpful indealing with many challenges in natural languageprocessing and image recognition. Matching pic-tures with their textual descriptions is one of thesechallenges in which the research interest has beengrowing (Wang and Chan, 2018; Eisenschtat andWolf, 2017; Faghri et al., 2017; Lee et al., 2018).The goal in image-text matching is, given an im-
An African giraffe stares through the spiral-shaped horns of a Greater Kudu bull in this well-timed photo by Your Shot photographer
Dries Alberts https://on.natgeo.com/2J0GRNO
Ireland have spoiled the party for England
England have been beaten, their 18 match winning run ended
Figure 1: Motivation/Concept Figure: Given an image(caption), the goal in image-text matching is to auto-matically retrieve the closest textual description (im-age) for that. Tweets are examples of collected dataset. age, to automatically retrieve a natural languagedescription of this image. In addition, given a cap-tion (textual image description), we want to matchit with the most related image found in our datasetas shown in Fig. 1. The process involves model-ing the relationship between images and texts orcaptions used to describe them. This defines thesemantics of a language by grounding it to the vi-sual world.Many studies have explored the task of cross-modal retrieval on the level of sentence and im-age regions (Wang and Chan, 2018; Niu et al.,2017; Karpathy and Fei-Fei, 2015; Liu et al.,2017). Karpathy et al. (2014) work on match-ing parts of an image objects with phrases by us-ing dependency tree relations for sentence frag-ments and finding a common space for represent-ing fragments. Huang et al. (2017) propose a sm-LSTM where they utilize a multimodal context-modulated global attention scheme and LSTM topredict the salient instance pairs. Recently, manyresearchers (Huang et al., 2018; Yan and Mikola-jczyk, 2015; Zheng et al., 2017; Donahue et al.,2015; Lev et al., 2016; Mao et al., 2014; Gu et al.,2018) introduced a neural network model for im- a r X i v : . [ c s . I R ] F e b ge caption retrieval consists of RNNs, CNNs, andadditional multimodal layers. Practically, one ofthe reasons that these deep learning approacheshave been on the rise is the availability of abun-dant information on the Web. The next section de-scribes the proposed model. In this work, we introduce an end-to-end multi-modal neural network for learning image and textrepresentations simultaneously. The architectureis illustrated in Fig. 2. It consists of two main sub-nets, a CNN for input image representation with anembedding and an LSTM to map the captions intothe new space. The purpose of the model is to finda mapping from the text and image to a commonspace in order to represent them with similar em-beddings. In this space, an image (text) will havea similar representation to its text (image) but adifferent one from other texts (images). Once themodel is trained, by feeding an image (text) to thenetwork, we find the most similar text (image).
For our initial model, after removing the fully-connected layer from ResNet-50 (Xie et al., 2017)which has been pre-trained on ImageNet (Rus-sakovsky et al., 2015), we treat the remaining lay-ers as an image feature extractor. The inputs ofthe network are × images and the outputis a × × feature vector. Therefore, a denselayer with the size of text domain is added to theend of the network. With the rest of the network,this layer which is now part of the model, is trainedto produce image representations. If we call thisvector I , which is a representation of the input im-age, then f img is a visual descriptor that is the re-sult of forward pass in the network. The forwardpass is denoted by F img ( . ) , which is a non-linearfunction and is defined as f img = F img ( I ) . Con-ventionally, the image model is considered as onepart of our network with its pre-trained weightsto avoid computing a large number of learnableparameters which is a time-consuming process.Then, we add two fully-connected layersto transform f img ( . ) to an image feature vector( v img ) computed by v img = W img f img ( . ) + b img . Each input text ( T ) is first represented by an n × d matrix, with n being its length and d being the size of the dictionary. To build the dictionary, stopwords and punctuation marks are removed and allthe words are stemmed using porter stemmer. Inaddition, the removal of the special characters iscarried out and the remaining words are all in low-ercase format. Each word in the final dictionaryis represented by a one-hot d dimensional vectorand every word can nd an index l in the dictionary.Therefore, for an input sentence T with m words,there is a d × m matrix as the following: T ( i, j ) = (cid:26) j = l i otherwise where ≤ i ≤ m and ≤ j ≤ d .Based on this definition, each text should have afixed length. In this study, since two datasets withvarious distributions are employed, the length ofeach sentence is considered a fixed number. Inorder to meet this criterion, when there are sev-eral sentences for one image, we concatenate allthe words and build a long description for that im-age. When the length grows to be more than theexpected length, the extra words are removed andwhen there are fewer words, zero-padding is ap-plied. Therefore, we will have a × d dimensionspace for the representation of the sentences.The input of the text representation model is asequence of integer numbers. In the next step, aword embedding is used to reduce the number ofsemantically similar words or to remove the wordswith low frequency, which are non-existent in thedictionary, resulting in a new embedding space.Since the vocabulary size is very large, the reduc-tion is helpful in increasing the networks general-izability. The new embeddings are then fed intoan LSTM (Gers et al., 2000) to learn a probabilitydistribution over the above-mentioned sequence inorder to predict the next word. The output of theLSTM is not used for word-level labeling. Instead,for the representation of the whole text, only thelast hidden state is utilized. Therefore, for theinput sentence T, its text descriptor denoted by f txt and using the function F txt ( . ) , is computedas f txt = F txt ( T ) . The final word feature vector( v txt ) is defined by f txt ( . ) . Having an aligned collection of image-text pairs,the goal is to learn the image-text similarity scoredenoted by S ( T, I ) which is defined as follows: S ( T, I ) = − E ( v txt , v img ) where v img and v txt are the same-size imageand text representations which have been pro- ultimodal Space
Turn leftover rice into a delicious new meal with this crispy rice with shrimp, bacon and corn recipe https://nyti.ms/2NyEkrQ
Input Text
Input Image
Input
Words turnleftoverrice recipe meal WordEmbedding delicious R e s R e s R e s R e s Figure 2: Proposed end-to-end multimodal neural network architecture for learning the image and text representa-tions. Image features are extracted by a CNN with 16 residual blocks and text features are extracted by recurrentunit. Then the fully-connected layers join the two domains by feature transformation. jected onto a partial order visual-semantic embed-ding space. The penalty paid for every true pair ofpoints that disagree is E ( x, y ) = || max (0 , y − x ) || .To compute the training loss, the image and textoutput vectors ( v img , v txt ) have been forced to bein the R + . By merging the image and text embed-ded models as illustrated in Fig. 2, we achieve thedesired visual-semantic model. To learn an orderencoding function, we considered a hinge-basedtriplet loss function which encourages positive ex-amples to have zero penalty, and negative exam-ples to have penalty greater than a margin: (cid:80) ( T,I ) ( (cid:80) T (cid:48) ( max { , α − S ( T, I ) + S ( T (cid:48) , I ) } − σ ( T (cid:48) ))+ (cid:80) I (cid:48) ( max { , α − S ( T, I ) + S ( T, I (cid:48) ) } )) − σ ( I (cid:48) )) where S ( T, I ) , the similarity score function, isas described above while T (cid:48) and I (cid:48) are inferredfrom the ground truth by matching contrastive im-ages with each caption and the reverse. σ ( x ) isdiscrete variance written as (cid:80) n x − µ c | n | . For com-putational efficiency, rather than summing over allthe negative samples, we assumed only the nega-tives in a mini-batch. The proposed method has been implemented withthe TensorFlow (Abadi et al., 2016), and Pythonran on a machine with GeForce GTX 1080 Ti.For initialization, the GloVe (Pennington et al.,2014) word embeddings, trained on Twitter with1.2 million vocabulary size, 27 billion tokens and 2 billion tweets, are employed. The training phasestarts with an Adam optimizer with learning rateof 0.1 and a batch size of 16 and continues as longas the amount of loss does not change. When ithappens, the learning rate is divided by 2. Thiscontinues until the learning rate becomes − .Then, the batch size is doubled and the learningrate is reset to 0.1. We repeat this process to opti-mize the model. During the training, a grid searchover all the hyper-parameters is carried out in or-der to conduct a model selection. For efficiency,the training is performed in batches which allowsus to do real-time data augmentation on images inCPU in parallel with training the model in GPU. Given a sentence (image), all the images (cap-tions) of the test set are retrieved and listed basedon their penalty in an increasing order. Then wereport the results using
Recall and
Median Rank . Recall is a metric for assessing how well a systemretrieves information to a query. It is computed bydividing the number of relevant retrieved resultsby the total number of instances. In R @ K , the top K results are treated as the output and the Recall iscomputed accordingly.
Med r is the middle num-ber in a sorted sequence of the retrieved instances.To address this issue, other metrics can be takeninto account since the existing measures can be in-trinsically problematic (Bernardi et al., 2016). Forinstance, the retrieval of the exact image (text) isnot guaranteed. In these cases, since the exactmatches have not been retrieved, its score is con- ask Sentence Retrieval Image RetrievalMethod R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r K t e s ti m a g e s Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500STV (2015) 33.8 67.7 82.1 3 25.9 60.0 74.6 4DVSA (2015) 38.4 69.9 80.5 1 27.4 60.2 74.8 3GMM-FV (2015) 39.0 67.0 80.3 3 24.2 59.3 76.0 4MM-ENS (2015) 39.4 67.9 80.9 2 25.1 59.8 76.6 4m-RNN (2014) 41.0 73.0 83.5 2 29.0 42.2 77.0 3m-CNN (2015) 42.8 73.1 84.1 2 32.6 68.6 82.8 3HM-LSTM (2017) 43.9 - 87.8 2 36.1 - 86.7 3SPE (2016) 50.1 79.7 89.2 - 39.6 75.2 86.9 -VQA-A (2016) 50.5 80.1 89.7 - 37.0 70.9 82.9 -2WayNet (2017) 55.8 75.2 - - 39.7 63.3 - -sm-LSTM (2017) 53.2 83.1 91.5 1 40.7 75.8 87.4 2RRF-Net (2017) 56.4 85.3 91.5 - 43.9 78.1 88.6 -VSE++ (2017) 64.6 90.0 95.7 1 52.0 84.3 92.0 1SCAN (2018) 72.7 94.8 98.4 - 58.8 88.4 94.8 -Ours 47.5 81.0 91.0 2 48.4 84.3 91.5 2GMM-FV (2015) 17.3 39.0 50.2 10 10.8 28.3 40.1 17 K t e s ti m a g e s DVSA (2015) 16.5 39.2 52.0 9 10.7 29.6 42.2 14VQA-A (2016) 23.5 50.7 63.6 - 16.7 40.5 53.8 -VSE++ (2017) 41.3 71.1 81.2 2 30.3 59.4 72.4 4SCAN (2018) 50.4 82.2 90.0 - 38.6 69.3 80.4 -Ours 23.8 53.7 67.3 4 25.6 55.1 68.4 3
Table 1: Image and sentence retrieval results on MS-COCO. “Sentence Retrieval” denotes using an image as queryto search for the relevant sentences, and “Image Retrieval” denotes using a sentence to nd the relevant image. R@Kis Recall@K (high is good). Med r is the median rank (low is good). sidered although similar ones have been matched. Several datasets have been published for image-sentence retrieval task (Rashtchian et al., 2010;Ordonez et al., 2011; Young et al., 2014; Hu et al.,2017; Farhadi et al., 2010). We collect a dataset,as a proof of concept, for evaluating and analyz-ing our method to better showcase its ability togeneralize as well as for demonstrating the exten-sibility of this type of solution to conversationaltexts and unusual images. Moreover, we used MS-COCO (Lin et al., 2014) to train and test the pro-posed model. This dataset contains 123,287 im-ages and 616,767 descriptions (Lin et al., 2014).Each image contains 5 textual descriptions on av-erage which collected by crowdsourcing on AMT.The average caption length is 8.7 words after rareword removal. We follow the protocol in (Karpa-thy et al., 2014) and use 5000 images for both vali-dation and testing, and also report results on a sub-set of 1000 testing images in Table 1.We collected 13751 tweets with 14415 imagesby a crawler based on the Twitter API. To makesure that the collection is diverse, we first createda list of seed users. Then, the followers of the seedaccounts were added to the list. Next, the latesttweets of the users in our list were extracted andsaved in the dataset. To make the data appropriate for our task, we removed retweets, the tweets withno images, non-English tweets and the ones thathad less than three words. This led the dataset tohave a relatively long description for each imageand at least one image for every tweet. At the finalstep, the dataset was examined by two profession-als and unrelated content was removed by them.Fig. 1 shows samples of the extracted dataset. Thiscollection is different from currently existing onesdue to varied domains, informal texts and highlevel correlation between text and image. For in-stance, the tweets may contain abbreviations, ini-tialisms, hashtags or URLs. On collected tweets,our model improves sentence retrieval by 14.3%relatively and image retrieval by 16.4% relativelybased on R @1 . The dataset has been available. We propose a multi-modal image-text match-ing model using a convolutional neural networkand a long short-term memory along with fully-connected layers. They are employed to map im-age and text inputs into a shared feature space,where their representations can be compared, tofind the closest pairs. Additionally, a new datasetof images and tweets extracted from Twitter is in-troduced, with the aim of having a characteris- Dataset, source codes and model will be publicly avail-able after publishing the paper. ically different collection from the benchmarks.Whereas the descriptions in the benchmarks arewell-organized, our dataset has not been standard-ized and the image-text pairs can contain highsemantic correlations. Also, because of a var-ied number of domains existent in the extracteddataset, the task of image-text matching becomeseven more challenging. Therefore, it can be usedto carry out new research and assess the robustnessof the proposed frameworks. Our experiments onMS-COCO yield improved results over some pre-viously proposed architectures.
Acknowledgments
The research was partially supported by OP RDEproject No. CZ.02.2.69/0.0/0.0/16 027/0008495,International Mobility of Researchers at CharlesUniversity. This project was in part supported by agrant from Institute for Research in FundamentalSciences (IPM).We gratefully acknowledge the support ofNVIDIA Corporation with the donation of theGPU used for this research.
References
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In , pages 265–283.Raffaella Bernardi, Ruket Cakici, Desmond Elliott,Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,Frank Keller, Adrian Muscat, and Barbara Plank.2016. Automatic description generation from im-ages: A survey of models, datasets, and evalua-tion measures.
Journal of Artificial Intelligence Re-search , 55:409–442.Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recogni-tion and description. In
Proceedings of the IEEEconference on computer vision and pattern recogni-tion , pages 2625–2634.Aviv Eisenschtat and Lior Wolf. 2017. Linking im-age and text with 2-way nets. In
Proceedings ofthe IEEE conference on computer vision and patternrecognition , pages 4601–4611.Fartash Faghri, David J Fleet, Jamie Ryan Kiros, andSanja Fidler. 2017. Vse++: Improving visual- semantic embeddings with hard negatives. arXivpreprint arXiv:1707.05612 .Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In
European conference on computer vision , pages15–29. Springer.Felix A. Gers, J¨urgen A. Schmidhuber, and Fred A.Cummins. 2000. Learning to forget: Continual pre-diction with lstm.
Neural Comput. , 12(10):2451–2471.Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, andGang Wang. 2018. Look, imagine and match:Improving textual-visual cross-modal retrieval withgenerative models. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recogni-tion , pages 7181–7189.Yuting Hu, Liang Zheng, Yi Yang, and YongfengHuang. 2017. Twitter100k: A real-world datasetfor weakly supervised cross-media retrieval.
IEEETransactions on Multimedia , 20(4):927–938.Yan Huang, Wei Wang, and Liang Wang. 2017.Instance-aware image and sentence matching withselective multimodal lstm. In
Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 2310–2318.Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang.2018. Learning semantic concepts and order for im-age and sentence matching. In
Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 6163–6171.Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages3128–3137.Andrej Karpathy, Armand Joulin, and Li F Fei-Fei.2014. Deep fragment embeddings for bidirectionalimage sentence mapping. In
Advances in neural in-formation processing systems , pages 1889–1897.Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. In
Advances in neural information processing systems ,pages 3294–3302.Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf.2015. Associating neural word embeddings withdeep image representations using fisher vectors. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4437–4446.Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu,and Xiaodong He. 2018. Stacked cross attentionfor image-text matching. In
Proceedings of theEuropean Conference on Computer Vision (ECCV) ,pages 201–216.uy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf.2016. Rnn fisher vectors for action recognition andimage annotation. In
European Conference on Com-puter Vision , pages 833–850. Springer.Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In
European confer-ence on computer vision , pages 740–755. Springer.Xiao Lin and Devi Parikh. 2016. Leveraging visualquestion answering for image-caption ranking. In
European Conference on Computer Vision , pages261–277. Springer.Yu Liu, Yanming Guo, Erwin M Bakker, and Michael SLew. 2017. Learning a recurrent residual fusion net-work for multimodal matching. In
Proceedings ofthe IEEE International Conference on Computer Vi-sion , pages 4107–4116.Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li.2015. Multimodal convolutional neural networksfor matching image and sentence. In
Proceedingsof the IEEE international conference on computervision , pages 2623–2631.Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, ZhihengHuang, and Alan Yuille. 2014. Deep captioningwith multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 .Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, andGang Hua. 2017. Hierarchical multimodal lstm fordense visual-semantic embedding. In
Proceedingsof the IEEE International Conference on ComputerVision , pages 1881–1889.Vicente Ordonez, Girish Kulkarni, and Tamara L Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. In
Advances in neural in-formation processing systems , pages 1143–1151.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In
Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP) , pages 1532–1543.Cyrus Rashtchian, Peter Young, Micah Hodosh, andJulia Hockenmaier. 2010. Collecting image annota-tions using amazon’s mechanical turk. In
Proceed-ings of the NAACL HLT 2010 Workshop on CreatingSpeech and Language Data with Amazon’s Mechan-ical Turk , pages 139–147. Association for Computa-tional Linguistics.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,et al. 2015. Imagenet large scale visual recognitionchallenge.
International journal of computer vision ,115(3):211–252. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning deep structure-preserving image-text em-beddings. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages5005–5013.Qingzhong Wang and Antoni B Chan. 2018. Cnn+ cnn:Convolutional decoders for image captioning. arXivpreprint arXiv:1805.09019 .Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu,and Kaiming He. 2017. Aggregated residual trans-formations for deep neural networks. In
Proceed-ings of the IEEE conference on computer vision andpattern recognition , pages 1492–1500.Fei Yan and Krystian Mikolajczyk. 2015. Deep corre-lation for matching images and text. In
Proceedingsof the IEEE conference on computer vision and pat-tern recognition , pages 3441–3450.Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions tovisual denotations: New similarity metrics for se-mantic inference over event descriptions.
Transac-tions of the Association for Computational Linguis-tics , 2:67–78.Zhedong Zheng, Liang Zheng, Michael Garrett,Yi Yang, and Yi-Dong Shen. 2017. Dual-path con-volutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535arXiv preprint arXiv:1711.05535