[PDF] Cross-Modal Retrieval with Implicit Concept Association

Abstract

Traditional cross-modal retrieval assumes explicit association of concepts across modalities, where there is no ambiguity in how the concepts are linked to each other, e.g., when we do the image search with a query "dogs", we expect to see dog images. In this paper, we consider a different setting for cross-modal retrieval where data from different modalities are implicitly linked via concepts that must be inferred by high-level reasoning; we call this setting implicit concept association. To foster future research in this setting, we present a new dataset containing 47K pairs of animated GIFs and sentences crawled from the web, in which the GIFs depict physical or emotional reactions to the scenarios described in the text (called "reaction GIFs"). We report on a user study showing that, despite the presence of implicit concept association, humans are able to identify video-sentence pairs with matching concepts, suggesting the feasibility of our task. Furthermore, we propose a novel visual-semantic embedding network based on multiple instance learning. Unlike traditional approaches, we compute multiple embeddings from each modality, each representing different concepts, and measure their similarity by considering all possible combinations of visual-semantic embeddings in the framework of multiple instance learning. We evaluate our approach on two video-sentence datasets with explicit and implicit concept association and report competitive results compared to existing approaches on cross-modal retrieval.

Full PDF

CCross-Modal Retrieval with Implicit Concept Association

Yale Song

Microsoft AI & [email protected]

Mohammad Soleymani

University of [email protected]

ABSTRACT

Traditional cross-modal retrieval assumes explicit association ofconcepts across modalities, where there is no ambiguity in how theconcepts are linked to each other, e.g., when we do the image searchwith a query “dogs”, we expect to see dog images. In this paper,we consider a different setting for cross-modal retrieval where datafrom different modalities are implicitly linked via concepts that mustbe inferred by high-level reasoning; we call this setting implicitconcept association. To foster future research in this setting, wepresent a new dataset containing 47K pairs of animated GIFs andsentences crawled from the web, in which the GIFs depict physicalor emotional reactions to the scenarios described in the text (called“reaction GIFs”). We report on a user study showing that, despitethe presence of implicit concept association, humans are able toidentify video-sentence pairs with matching concepts, suggestingthe feasibility of our task. Furthermore, we propose a novel visual-semantic embedding network based on multiple instance learning.Unlike traditional approaches, we compute multiple embeddingsfrom each modality, each representing different concepts, and mea-sure their similarity by considering all possible combinations ofvisual-semantic embeddings in the framework of multiple instancelearning. We evaluate our approach on two video-sentence datasetswith explicit and implicit concept association and report competitiveresults compared to existing approaches on cross-modal retrieval.

ACM Reference Format:

Yale Song and Mohammad Soleymani. 2018. Cross-Modal Retrieval withImplicit Concept Association. In

Proceedings of ACM Multimedia (MM’18).

ACM, New York, NY, USA, 9 pages. https://doi.org/10.475/123_4

Animated GIFs are becoming increasingly popular [3]; more peopleuse them to tell stories, summarize events, express emotion, andenhance (or even replace) text-based communication. To reflect thistrend, several social networks and messaging apps have recentlyincorporated GIF-related features into their systems, e.g., Facebookusers can create posts and leave comments using GIFs, Instagramand Snapchat users can put “GIF stickers” into their personal videos,and Slack users can send messages using GIFs. This rapid increasein popularity and real-world demand necessitates more advanced andspecialized systems for animated GIF search. The key to the successof GIF search lies in the ability to understand the complex semanticrelationship behind textual queries and animated GIFs, which, attimes, can be implicit and even ambiguous. Consider a query term “my ex texts me at 2am” : How can we build a system that understandsthe intent behind such queries and find relevant content, e.g., GIFsshowing someone surprised or agitated?

MM’18, November 2018, Seoul, Korea “My ex texts me at 2am” … Awkward Surprised Agitated …

Inferredconcepts from text … Surprised Suspicious Sneaky …

Inferred concepts from video

Implicit Concept Association

Figure 1: We consider a scenario for cross-modal retrievalwhere data from different modalities are implicitly linked via“concepts” (e.g., emotion, intent) that must be inferred by high-level reasoning. Each modality may manifest multiple differentconcepts; only a small subset of all visual-textual concept pairscould be related (green), while others are irrelevant (gray). Thisscenario poses new challenges to cross-modal retrieval.

In this paper, we focus on cross-modal retrieval when there is implicit association of concepts across modalities. Our work is inpart motivated by the above scenario (see also Figure 1), where thesemantic relationship between textual queries and visual contentis implied only indirectly, without being directly expressed. In theexample above, the two modalities are related with such conceptsas “surprised” and “panicked” – these are abstract concepts that arenot explicitly stated or depicted in either modalities; they are rathersomething that must be inferred from data by high-level reasoning.In this regard, we refer to our task cross-modal retrieval with implicitconcept association , where “implicit concepts” are implied and notplainly expressed in the data.Previous work on cross-modal retrieval has mainly focused on thescenario of explicit concept association, in which there is no ambigu-ity in how concepts are linked to each other [28, 37]. For example,most popular and widely accepted visual-textual retrieval datasets,such as NUS-WIDE [7], Flickr30K [42], and MS-COCO [24], con-tain user tags and sentences describing visual objects and scenesdisplayed explicitly in the corresponding images. Also, the conceptvocabulary used in those datasets have clear image-to-text corre-spondence (a dog image is tagged with the “dog” category), leavingless room for ambiguity. Furthermore, most existing approaches tocross-modal retrieval focus on learning a shared embedding spacewhere the distance between data from two modalities are minimized,e.g., DCCA [2], DeViSE [13], and Corr-AE [12]; all these methodsare however based on an assumption that there is explicit associationof concepts across modalities. To the best of our knowledge, there isno literature on the case with implicit concept association. a r X i v : . [ c s . C V ] A p r M’18, November 2018, Seoul, Korea Yale Song and Mohammad Soleymani (a) Physical ReactionMRW a witty comment I wantedto make was already said (b) Emotional ReactionMFW I see a cute girl on Facebook change her status to single (d) Lexical Reaction (Caption)MRW a family member asks me why his computer isn’t working(c) Animal ReactionMFW I can’t remember if I’ve locked my front door

Figure 2: Our MRW dataset contains 47K pairs of GIF-sentence pairs, where GIFs depict reactions to the sentences (often called“reaction GIFs”). Here we show the four most common types of reactions: (a) physical, (b) emotional, (c) animal, (d) lexical.

In this work, we make a step toward cross-modal retrieval with im-plicit concept association. To foster further research in this direction,we collect a new dataset containing 47K pairs of animated GIFs andsentences, where the GIFs depict physical or emotional reactions tocertain situations described in text – they are often called “reactionGIFs”. We name our dataset MRW (my reaction when) following themost popular hashtag for reaction GIFs. To highlight the differencebetween implicit vs. explicit concept association, we compare ourdataset with the TGIF dataset [23] that also contains GIF-sentencepairs, originally proposed for video captioning. In the TGIF dataset,sentences describe objects and actions explicitly displayed in visualcontent; it can, therefore, be considered as explicit concept associ-ation. We report on a user study that shows, despite the presenceof implicit concept association in our dataset, humans are still ableto identify GIF-sentence pairs with matching concepts with a highconfidence, suggesting the feasibility of our task.Furthermore, we propose a novel approach to visual-semanticembedding based on multiple instance learning; we call our model M ultiple- i nstance Vi sual- S emantic E mbedding (MiViSE) networks.To deal with implicit concept association, we formulate our learn-ing problem as many-to-many mapping, in which we compute sev-eral different representations of data from each modality and learntheir association in the framework of multiple instance learning.By considering all possible combinations of different representa-tions between two modalities (i.e., many-to-many), we provide moreflexibility to our model even in the presence of implicit conceptassociation. We employ recurrent neural networks [6] with self at-tention [25] to compute multiple representations, each attending todifferent parts of a sequence, and optimize the model with tripletranking so that the distance between true visual-textual pairs areminimized compared to negative pairs. We evaluate our approach onthe sentence-to-video retrieval task using the TGIF and our MRWdatasets, and report competitive performance over several baselinesused frequently in the cross-modal retrieval literature.In summary, we make the following contributions:(1) We focus on cross-modal retrieval with implicit concept asso-ciation, which have not been studied in the literature.(2) We release a new dataset MRW (my reaction when) thatcontains 47K pairs of animated GIFs and sentences.(3) We propose a novel approach to cross-modal retrieval basedon multiple instance learning, called MiViSE. Datasets : Existing datasets in cross-modal retrieval have differenttypes of visual-textual data pairs, categorized into: image-tag (e.g.,NUS-WIDE [7], Pascal VOC [11], COCO [24]), image-sentence (e.g., Flickr 8K [30], Flickr 30K [42], and COCO [24]), and image-document (e.g., Wikipedia [31] and Websearch [22]).Unlike existing datasets, ours contains video-sentence pairs; itis therefore more closely related to video captioning datasets, e.g.,MSR-VTT [39], TGIF [23], LSMDC [32]. All those datasets con-tain visual-textual pairs with explicit concept association (sentencesdescribe visual content displayed explicitly in videos), whereasours contains visual-textual pairs with implicit concept association(videos contain physical or emotional reactions to sentences).

Visual-semantic embedding : The fundamental task in cross-modalretrieval is finding an embedding space shared among differentmodalities such that samples with similar semantic meanings areclose to each other. One popular approach is based on maximizingcorrelation between modalities [2, 9, 12, 14, 31, 41]. Rasiwasia etal . [31] use canonical correlation analysis (CCA) to maximize corre-lation between images and text, while Gong et al . [14] propose anextension of CCA to a three-view scenario, e.g., images, tags, andtheir semantics. Following the tremendous success of deep learning,several work incorporate deep neural networks into their approaches.Andrew et al . [2] propose deep CCA (DCCA), and Yan et al . [41]apply it to image-to-sentence and sentence-to-image retrieval. Basedon an idea of autoencoder, Feng et al . [12] propose a correlationautoencoder (Corr-AE), and Eisenschtat and Wolf [9] propose 2-waynetworks. All these methods share one commonality: They aim tofind an embedding space where the correlation between pairs ofcross-modal data is maximized.Besides the pairwise correlation approaches, another popular ap-proach is based on triplet ranking [13, 38]. The basic idea is toimpose a constraint that encourages the distance between “posi-tive pairs” (e.g., ground-truth pairs) to be closer than “negativepairs” (typically sampled by randomly shuffling the positive pairs).Frome et al . [13] propose a deep visual-semantic embedding (De-ViSE) model, using a hinge loss to implement triplet ranking. Similarto DeViSE, we also train our model using triplet ranking. However,instead of using the hinge loss of [13], which is non-differentiable,we use the pseudo-Huber loss form [4], which is fully differentiableand therefore more suitable for deep learning. Through ablation ross-Modal Retrieval with Implicit Concept Association MM’18, November 2018, Seoul, Korea (a) Nouns (b) VerbsMRW TGIF MRW TGIF

Figure 3: Distributions of nouns and verbs in our MRW and the TGIF [23] datasets. Compared to the TGIF, words in our datasetdepict more abstract concepts (e.g., post, time, day, start, realize, think, try), suggesting the implicit nature in our dataset. studies we show improved performance just by replacing the hingeloss to the pseudo-Huber loss.Recent approaches to cross-modal retrieval apply techniques frommachine learning, such as metric learning and adversarial train-ing [35, 36]. Tasi et al . [35] train deep autoencoders with maximummean discrepancy (MMD) loss, while Wang et al . [36] combine thetriplet ranking-based method with adversarial training. Unlike allprevious approaches, our work focuses specifically on the scenarioof implicit concept association, and train our model in the frameworkof multiple instance learning (MIL). To the best of our knowledge,our work is the first to use MIL in the cross-modal retrieval setting.

Animated GIF : There is increasing interest in conducting researcharound animated GIFs. Bakhshi et al . [3] studied what makes ani-mated GIFs engaging on social networks and identified a number offactors that contribute to it: the animation, lack of sound, immediacyof consumption, low bandwidth and minimal time demands, thestorytelling capabilities and utility for expressing emotions.Several work use animated GIFs for various tasks in video under-standing. Jou et al . [19] propose a method to predict viewer perceivedemotions for animated GIFs. Gygli et al . [16] propose the Video2GIFdataset for video highlighting, and further extended it to emotionrecognition [15]. Chen et al . [5] propose the GIFGIF+ dataset foremotion recognition. Zhou et al . [44] propose the Image2GIF datasetfor video prediction, along with a method to generate cinemagraphsfrom a single image by predicting future frames.Recent work use animated GIFs to tackle the vision & languageproblems. Li et al . [23] propose the TGIF dataset for video caption-ing; Jang et al . [18] propose the TGIF-QA dataset for video visualquestion answering. Similar to the TGIF dataset [23], our datasetincludes video-sentence pairs. However, our sentences are created byreal users from Internet communities rather than study participants,thus posing real-world challenges. More importantly, our dataset has implicit concept association between videos and sentences (videoscontain physical or emotional reactions to sentences), while theTGIF dataset has explicit concept association (sentences describevisual content in videos). We discuss this in more detail in Section 3.

Our dataset consists of 47K pairs of GIFs and sentences collectedfrom popular social media websites including reddit, Imgur, andTumblr. We crawl the data using the GIPHY API with query terms https://developers.giphy.com Total number of video-sentence pairs 47,172Median number of frames in a video 48Median number of words in a sentence 10Word vocabulary size 3,706Average term frequency 18.47Median term frequency 1

Table 1: Descriptive statistics of the MRW dataset. mrw , mfw , hifw , reaction , and reactiongif ; we crawledthe data from August 2016 to January 2018. Figure 2 shows examplesof GIF-sentence pairs in our dataset.Our dataset is unique among existing video-sentence datasets.Most existing ones are designed for video captioning [23, 32, 39]and assume explicit association of concepts between videos and sen-tences: sentences describe visual content in videos . Unlike existingvideo-sentence datasets, ours focuses on a unique type of video-sentence pairs, i.e., reaction GIFs. According to a popular subredditchannel r/reactiongif : A reaction GIF is a physical or emotional responsethat is captured in an animated GIF which you can linkin response to someone or something on the Internet.The reaction must not be in response to something thathappens within the GIF, or it is considered a “scene”.

This definition clearly differentiates ours from existing video-sentencedatasets: ours assume implicit association of concepts betweenvideos and sentences, i.e., videos contain reactions to sentences .We call our dataset MRW (my reaction when) to emphasize theunique characteristic of our data – MRW is the most popular hashtagfor reaction GIFs. Table 1 shows descriptive statistics of our dataset, and Figure 3shows word clouds of nouns and verbs extracted from our MRWdataset and the TGIF dataset [23]. Sentences in the TGIF dataset areconstructed by crowdworkers to describe the visual content explicitlydisplayed in GIFs. Therefore, its nouns and verbs mainly describephysical objects, people and actions that can be visualized, e.g., cat,shirt, stand, dance. In contrast, MRW sentences are constructed bythe Internet users, typically from subcommunities in social networksthat focus on reaction GIFs. As can be seen from Figure 3, verbs and M’18, November 2018, Seoul, Korea Yale Song and Mohammad Soleymani joy average intensity= 9.1% fear average intensity= 1.6% disgust average intensity= 7.4% sadness average intensity= 1.0% anger average intensity= 1.2% surprise average intensity= 3.0%

Figure 4: Histograms of the intensity of facial expressions. Thehorizontal axis represents the intensity of the detected expres-sion, while the vertical axis is the sample count in frames withfaces. We clip the y -axis at 1000 for visualization. Overall, joy,with average intensity of 9.1% and disgust (7.4%) are the mostcommon facial expressions in our dataset. nouns in our MRW dataset additionally include abstract terms thatcannot necessarily be visualized, e.g., time, day, realize, think.Facial expression plays an important role in our dataset: 6,380samples contain the hashtag MFW (my face when), indicating thatthose GIFs contain emotional reactions manifested by facial expres-sions. Therefore, we apply automatic facial expression recognitionto analyze the types of facial expressions contained in our dataset.First, we count the number of faces appearing in the GIFs. Todo this, we applied the dlib CNN face detector [21] on five framessampled from each GIF with an equal interval. The results show thatthere are, on average, . faces in a given frame of an animated GIF;32,620 GIFs contain at least one face. Next, we use the AffectivaAffdex [26] to analyze facial expressions depicted in GIFs, detectingthe intensity of expressions from two frames per second in eachGIF. We looked at six expressions of basic emotions [10], namely,joy, fear, sadness, disgust, surprise and anger. We analyzed onlythe frames that contain a face with its bounding box region largerthan 15% of the image. Figure 4 shows the results. Overall, joy withaverage intensity of 9.1% and disgust (7.4%) are the most commonfacial expressions in our dataset. Image and video captioning often involves describing objects andactions depicted explicitly in visual content [23, 24]. For reactionGIFs, however, visual-textual association is not always explicit. Forexample, objects and actions depicted in visual content might be aphysical or emotional reaction to the scenario posed in the sentence.To study how clear these associations are for humans, we con-ducted a user study in which we asked six participants to verify theassociation between sentences and GIFs. We randomly sampled 100GIFs from the test sets of both our dataset and TGIF dataset [23].We chose the test sets due to the higher quality of the data. We pairedeach GIF with both its associated sentence and a randomly selectedsentence, resulting in 200 GIF-sentence pairs per dataset. The results show that, in case of our dataset (MRW), 80.4% of theassociated pairs are positively marked as being relevant, suggestinghumans are able to distinguish the true vs. fake pairs despite implicitconcept association. On the other hand, 50.7% of the randomlyassigned sentences are also marked as matching sentences. The highfalse positive rate shows the ambiguous nature of GIF-sentenceassociation in our dataset. In contrast, for TGIF dataset with clearexplicit association, 95.2% of the positive pairs are correctly markedas relevant and only 2.6% of the irrelevant pairs are marked as beingrelevant. This human baseline demonstrates the challenging natureof GIF-sentence association in our dataset, due to their implicitrather than explicit association. This motivates our multiple-instancelearning based approach, described next.

Our goal is to learn embeddings of videos and sentences whose con-cepts are implicitly associated. We formalize this in the frameworkof multiple instance learning with triplet ranking.We assume a training dataset D = {( V n , S + n , S − n )} Nn = , where V n = [ v n , , · · · , v n , T v ] is a video of length T v and S + n = [ s + n , , · · · , s + n , T s ] is a sentence of length T s ; the pair ( V n , S + n ) has matching concepts.For each pair, we randomly sample a negative sentence S − n from thetraining set for the triplet ranking setup. For notational simplicity,we drop the subscript n and use S to refer to both S + and S − unless adistinction between the two is necessary.Our model (see Figure 5) includes a video encoder and a sentenceencoder. Each encoder outputs K representations, each computedby attending to different parts of a video or a sentence. We use therepresentations to train our model using a triplet ranking loss inthe multiple instance learning framework. Below we describe eachcomponent of our model in detail. Our video and sentence encoders operate in a similar fashion: oncewe compute image and word embeddings from video and sentenceinput, we use a bidirectional RNN with self-attention [25] to obtain K embeddings of a video and a sentence, respectively.We compute an image embedding of each video frame v t ∈ V using a convolutional neural network (CNN); we use the penultimatelayer of ResNet-50 [17] pretrained on ImageNet [33], and denoteits output by x vt ∈ R . Similarly, we compute a word embeddingof each word s t ∈ S using the GloVe [29] model pretrained on theTwitter dataset, and denote its output by x st ∈ R .Next, we process a sequence of image embeddings using thebi-directional Gated Recurrent Units (GRU) [6], −→ h vt = GRU ( x vt , −−−→ h vt − ) , ←− h vt = GRU ( x vt , ←−−− h vt + ) , (1)and concatenate the −−−−−−→ forward and ←−−−−−−−− backward hidden states to obtain h vt = [−→ h vt , ←− h vt ] . We denote the resulting sequence of hidden states by H v = [ h v , · · · , h vT v ] (2)which is a matrix of size d -by- T v , where d is the total number ofGRU hidden units (forward and backward combined). We follow thesame steps to process a sequence of word embeddings and obtain amatrix H s = [ h s , · · · , h sT s ] of dimension d -by- T s . ross-Modal Retrieval with Implicit Concept Association MM’18, November 2018, Seoul, Korea !" K ! " ! $ % , - , - % & " & $ : ;; : ;; % % ! "’ ! $’ % , - , - % Sentence(Positive) Sentence(Negative)Video(Anchor)

Figure 5: Our model consists of a video encoder (blue) and a sen-tence encoder (green). Each encoder computes k embeddings,each attending to different parts of a sequence of hidden statesinferred from a bidirectional GRU. We train our model withour novel MIL-based triplet ranking formulation. Finally, we compute a linear combination of the hidden statesto obtain K video embeddings ϕ ( V ) i and K sentence embeddings ψ ( S ) j , i , j = ... K , where each embedding selectively attends todifferent parts of the sequence using self attention [25]. This is doneby computing a self-attention map, e.g., for video input: A v = softmax (cid:0) W v tanh ( W v H v ) (cid:1) (3)where W v is a weight matrix of size u -by- d and W v is of dimension K -by- u ; we set u = d per empirical observation. The softmax is applied row-wise to obtain an attention map A v of dimension K -by- T v , where each of K rows sums up to . We then obtain K video embeddings as a linear combination Φ ( V ) = A v (cid:0) H v (cid:1) ⊺ (4)where Φ ( V ) = [ ϕ ( V ) , · · · , ϕ ( V ) k ] is a matrix of size K -by- d witheach row serving as an embedding. We use the same method toobtain K sentence embeddings Ψ ( S ) = [ ψ ( S ) , · · · , ψ ( S ) K ] .Each of the K embeddings selectively attends to different partsof a sequence, and thus represents different concepts contained in avideo or a sentence. As we shall see later, having multiple embed-dings allows us to formulate our problem in the MIL framework. We train our model with a triplet ranking loss, using V as the anchor.The standard choice for triplet ranking is the hinge loss: L hinдe ( V , S + , S − ) = max (cid:0) , ρ − ∆ ( V , S + , S − ) (cid:1) (5)where ρ sets the margin and ∆ (·) measures the difference between ( V , S + ) and ( V , S − ) with respect to a suitable metric; we define it as ∆ ( V , S + , S − ) = f ( V , S + ) − f ( V , S − ) (6) where f ( V , S ) measures the similarity between V and S . This hingeloss form is non-differentiable at ρ , making optimization difficult tosolve. Therefore, we adapt the pseudo-Huber loss formulation [4]: L huber ( V , S + , S − ) = δ (cid:18)q + (( ρ − ∆ ( V , S + , S − ))/ δ ) − (cid:19) (7)where δ determines the slope of the loss function; we set δ = ρ = . .The loss is differentiable because its derivatives are continuous forall degrees; this makes it especially attractive for learning with deepneural networks because gradient signals are more robust.One way to compute the similarity f ( V , S ) is to treat the embed-dings Φ ( V ) and Ψ ( S ) as Kd -dimensional vectors and compute: f ( V , S ) = Φ ( V ) · Ψ ( S )∥ Φ ( V )∥∥ Ψ ( S )∥ (8)where · denotes the dot product between two vectors.Note that this way of computing the similarity requires each ofthe K video-sentence embedding pairs to be “aligned” in the sharedembedding space; any misaligned dimension will negatively impactthe similarity measure. This is problematic for our case because weassume the concepts in video-sentence pairs are implicitly associatedwith each other: It is natural to expect that only a few pairs of theembeddings to be aligned while others are severely misaligned. Theform above, however, computes the dot product of the two embed-dings, and thus has no ability to deal with potentially misalignedembedding pairs (the same reasoning applies even if K = ). Thismotivates us to develop a MIL-based strategy, described next. In multiple instance learning (MIL), individual instances are as-sumed to be unlabeled; rather, the learner receives a set of labeledbags, each containing multiple instances with unknown (possiblydifferent) labels [8]. For example, in binary classification all theinstances in a negative bag are assumed to be negative, but only afew instances in a positive bag need to be positive. This providesflexibility in learning from weakly-supervised data and has shownto be successful in solving many real-world tasks [1].To formulate our problem in the MIL framework, we start byassuming that each video and sentence contains multiple “concepts”that can be interpreted in different ways; consequently, a video-sentence pair has different combinations of visual-textual concepts(see Figure 1). Under this assumption, if a pair has explicit conceptassociation, there should be no ambiguity in the interpretation of thepair, and thus every possible combination of visual-textual conceptsshould be valid. However, if the association is implicit , as is in ourscenario, we assume ambiguity in the interpretation of the pair, andthus only a few visual-textual concept pairs need to be valid.We define our “bag of instances” to contain all possible combina-tions of video-sentence embeddings from a pair ( V , S ) . Formally, wedefine a bag of instances F ( V , S ) as F ( V , S ) = (cid:26) f i , j ( V , S ) = ϕ ( V ) i · ψ ( S ) j ∥ ϕ ( V ) i ∥∥ ψ ( S ) j ∥ (cid:27) , ∀ i , j ∈ [ , · · · , K ] (9)Note that we have K combinations of embeddings instead of K ; thisimproves sample efficiency because any of Φ ( V ) can be matched upwith any of Ψ ( S ) . We then modify Eqn. (6) to ∆ mil ( V , S + , S − ) = max F ( V , S + ) − max F ( V , S − ) (10) M’18, November 2018, Seoul, Korea Yale Song and Mohammad Soleymani d = ( , )

MIL d = ([ ], [ ])

Conventional

Video concepts: { , , } Sentence concepts: { , , }

Figure 6: In our scenario, each sample (video or sentence) is rep-resented by k embeddings, each capturing different concepts.Conventional approaches measure the visual-semantic distanceby considering all k embeddings (e.g. by concatenation; seeEqn. (8) ), and would therefore suffer when not all concepts arerelated across modality. Our MIL-based approach, on the otherhand, measures the distance from the most related concept com-bination (squares) by treating each embedding individually. and plug it into our loss function of Eqn. (7) to train our model.Note that we have the max operator for both the positive and thenegative bags. Similar to the case with MIL for binary classifica-tion [8], we require only a few instances in a positive bag to be valid.However, unlike [8], we also encourage only a few (rather than all)instances to be “invalid”. This is intuitive and important: even if V is paired with a randomly sampled S − , some concept combinationsof the two can still be valid. Our preliminary experiments showedthat the performance is significantly worse if we do not use the maxoperator for the negative bags, which suggests that providing this“leeway” to negative bags is crucial in training our model. We train our model by minimizing the following objective function: min N (cid:213) n = L mil − huber ( V n , S + n , S − n ) + α R( A n ) (11)where L mil − huber (·) is the pseudo-Huber loss (Eqn. (7)) with ourMIL triplet ranking formulation (Eqn. (10)) and R( A n ) = R( A vn ) + R( A s + n ) + R( A s − n ) is a regularization term that sets certain constraintsto the attention maps for the n -th input triplet. The weight parameter α balances the influence between the two terms; we set α = e − .Following [25], we design R(·) to encourage K attention mapsfor a given sequence to be diverse, i.e., each map attends to differentparts of a sequence. Specifically, we define it as: R( A ) = ∥ AA ⊺ − βI ∥ F (12)where β is a weight parameter, I is a K -by- K identity matrix, and ∥·∥ F is the Frobenius norm. The first term promotes K attention mapsto be orthogonal to each other, encouraging diversity. The secondterm drives each diagonal term in AA ⊺ to be close to β ∈ [ , ] . Inthe extreme case of β = , each attention map is encouraged to besparse and attend to a single component in a sequence (because its l norm, the diagonal term, need to be ). Lin et al . [25] suggests β = ;however, we found that setting β = . leads to better performance. For the inference (i.e., test time), we find the best matching video(or sentence) given a query sentence (or video) by computing co-sine similarities between all K combinations of embeddings (seeEqn (9)) and select the one with the maximum similarity. This hasthe time complexity of O( K N ) with N samples in the database. Datasets : We evaluate our approach on a sentence-to-video retrievaltask using two datasets: the TGIF [23] and our MRW datasets. Theformer contains sentences describing visual content in videos, whilethe latter contains videos showing physical or emotional responsesto sentences; therefore, the former contains explicit video-sentenceconcept association, while the latter contains implicit concept asso-ciation. We split both datasets following the method of [23], whichuses 80% / 10% / 10% of the data as the train / validation / test splits.

Metrics : We report our results using the median rank (MR) andrecall@ k with k = { , , } (R@1, R@5, R@10). Both are widelyused metrics for evaluating cross-modal retrieval systems [20, 37,41]: the former measures the median position of the correct item ina ranked list of retrieved results, while the latter measures how oftenthe correct item appears in the top k of a ranked list. We also includenormalized MR (nMR), which is the median rank divided by thetotal number of samples, so that it is independent of the dataset size.As both the datasets do not come with categorical labels, we do notuse mean average precision (mAP) as our metric. Baselines : We compare our method with DCCA [2], Corr-AE [12],and DeViSE [13]; these are well established methods used fre-quently in the cross-modal retrieval task. For fair comparisons, weuse the same image and word embedding models (ResNet50 [17]and GloVe [29], respectively) as well as the same sequence encoder(Bidirectional GRU [6]) for all the models including the baselines.

We set the maximum length of video and sentence to be 32 and zero-pad short sequences. For sequences longer than the maximum length,we sample multiple subsequences with random starting points andwith random sampling rates. We apply dropout to the linear transfor-mation for the input in the GRU, with a rate of 20% as suggested by[43]. We vary the number of GRU hidden units d ∈ { , , } ,the number of embeddings K ∈ { , , , , , } , regularizationweight λ = − p with p ∈ { , , , } , and report the best resultsbased on cross-validation; we choose the best model based on thenMR metric. We use the ADAM optimizer with a learning rate of2e-4 and train our model for 500 epochs with a batch size of 100.Our model is implemented in TensorFlow. Table 2 summarizes the results. From the results, we make the follow-ing observations: (1) Our approach outperforms all three baselines onboth datasets, suggesting its superiority. (2) The overall performanceon the TGIF dataset is better than on our dataset. This suggests thedifficulty of our task with implicit concept association. (3) DeViSEand MiViSE perform better than DCCA and Corr-AE; the formerare triplet ranking approaches, while the latter are correlation-based ross-Modal Retrieval with Implicit Concept Association MM’18, November 2018, Seoul, Korea

Method TGIF Dataset [23] MRW Dataset (Ours)MR (nMR) R@1 R@5 R@10 MR (nMR) R@1 R@5 R@10DCCA [2] 3722 (33.84) 0.02 0.08 0.17 2231 (44.62) 0.02 0.12 0.20Corr-AE [12] 1667 (15.02) 0.05 0.26 0.53 2181 (43.62) 0.01 0.12 0.28DeViSE [13] 861 (7.69) 0.16 1.04 1.72 1806 (36.13) 0.08 0.28 0.50

MiViSE (Ours)

426 (3.77) 0.58 2.00 3.74 1578 (31.57) 0.14 0.38 0.64Table 2: Sentence-to-video retrieval results on the TGIF and the MRW datasets. Metrics: median rank (MR), normalized MR (nMR),recall @ k = { , , } (R@1, R@5, R@10). For MR and nMR, the lower the better; for R@ k , the higher the better. MIL SA ME MR (nMR) R@1 R@5 R@10DeViSE [13] 1806 (36.13) 0.08 0.28 0.50 ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓

Table 3: Detailed evaluation results on the MRW dataset. Abla-tive factors are: multiple instance learning (MIL), self-attention(SA), multiple embeddings (ME). See the text for detailed expla-nation about different settings. approaches. This suggests the superiority of the triplet ranking ap-proach on sentence-to-video retrieval tasks. (3) DCCA performspoorly on both datasets. We believe this is due to the difficulties withestimating the covariance matrix required for DCCA.Both our approach and DeViSE are trained using a triplet rankingloss; but there are three major differences – (i) we replace the hingeloss (Eqn. (5)) to the pseudo-Huber loss (Eqn (7)), (ii) we computemultiple embeddings per modality using self-attention (Eqn. (4)),and (iii) we train our model in the framework of multiple instancelearning (Eqn. (10)). To tease apart the influence of each of thesefactors, we conduct an ablation study adding one feature at a time;the results are shown in Table 3. We also include the DeViSE resultsfrom Table 2 for easy comparison.From the results we make the following observations: (1)

Rows1 vs. 2 : The only difference between DeViSE and ours without thethree features is the loss function: the hinge loss vs. the pseudo-Huber loss. We can see that just by replacing the loss function weobtain a relative 3.29% performance improvement in terms of nMR.This shows the importance of having a fully differentiable loss func-tion in training deep neural networks. (2)

Rows 2 vs. 3 : Addingself-attention provides a relative 3.09% performance improvementover the base model in nMR. Note that the two methods produceembeddings with the same feature dimensionality, R d ; the formerconcatenates the two last hidden states of bidirectional GRUs, whilethe latter computes a linear combination of the hidden states usinga single attention map (i.e., K = ). Our result conforms to theliterature showing the superiority of an attention-based representa-tion [27, 40]. (3) Rows 3 vs. 4 : Computing multiple embeddings andtreating them as a single representation to compute the similarity(i.e., Eqn. (8)) hurts the performance. This is because each of the K embeddings are required to be aligned in the embedding space,which might be a too strict assumption for the case with implicitconcept association. (4) Rows 4 vs. 5 : Training our model with MIL

Number of Embeddings (K) nMR R@1 R@5 R@10

Figure 7: Performance on the MRW dataset with different num-bers of embeddings (K). The best results are obtained at K = . provides a relative 7.93% performance improvement in nMR. Notethat the two methods produce embeddings with the same featuredimensionality, R Kd ; the only difference is in how we compute thesimilarity between two embeddings, i.e., Eqn. (8) vs. Eqn. (9). Thissuggests the importance of the MIL in our framework.Figure 7 shows the sensitivity of our MiViSE model to differentnumbers of embeddings; the results are based on our MRW dataset.We divide the nMR values by 100 for better visualization. We an-alyze the results in terms of the nMR metric because we chooseour best model based on that metric. At K = , the model performsworse (nMR = 34.37) than our ablative settings (row 3 and 4 in Ta-ble 3). The performance improves as we increase K and plateaus at K = (nMR = 31.57). We can see that within a reasonable range of < K < , our model is insensitive to the number of embeddings:the relative difference between the best ( K = , nMR = 31.57) andthe worst ( K = , nMR = 33.4) is 5.6%.Finally, we visualize some examples of visual-textual attentionmaps in Figure 8. We see that attention maps for sentences changedepending on the corresponding video, e.g., in the first row, thehighest attention is given to the word “witty” in case of the predictedbest matching video, while in case of the ground truth video itis given to the word “already”. We also see evidence of multipleembeddings capturing different concepts. In the second row, theword “guy” is highlighted for the video with a male hockey player;when combined with “how he feels about me”, this may capturesuch concepts as awkwardness. On the other hand, in case of theground truth video, only the phrase “feels about me” is highlighted,which may capture different concepts, such as fondness. M’18, November 2018, Seoul, Korea Yale Song and Mohammad Soleymani

MRW a witty comment I wanted to make was already said MRW a witty comment I wanted to make was already said(a) Predicted best match (b) Ground truthMFW a guy sends me a song saying its how he feels about meMFW a guy sends me a song saying its how he feels about meMFW my boss told me I couldn’t work extra hours this week MFW my boss told me I couldn’t work extra hours this weekMRW my friend's roommates arrive and he tells us to act natural MRW my friend's roommates arrive and he tells us to act natural

Figure 8: Attention maps for (a) predicted best matching pairs and (b) the ground truth pairs; note the same sentence in each row. Weshow four frames uniformly sampled from each GIF; visual attention maps are shown below the frames. See the text for discussion.

Using extra information : We demonstrated our proposed MiViSEnetwork on sentence-to-GIF retrieval, which is motivated by thereal-world scenario of GIF search. For simplicity, we used genericvideo and sentence representation (CNN and GloVe, respectively,followed by a Bi-RNN); we did not leverage the full potential of ourMRW dataset for this task. We believe that considering other sourcesof information available in our dataset, such as facial expression andcaptions in GIFs, will further improve the performance.

Multiple embeddings : Our MIL-based approach requires multipleembeddings from each modality, each capturing different concepts.We used self-attention to compute multiple embeddings, each at-tending to different parts of a sequence. While each embedding isencouraged to attend to different parts of a sequence, this does notguarantee that they capture different “concepts” (even though we seesome evidence in Figure 8). We believe that explicitly modeling con-cepts via external resources could further improve the performance,e.g., using extra datasets focused on emotion and sentiment.

Evaluation metric : Our user study shows a high false positive rateof 50.7% when there is implicit concept association. This meansthere could be many “correct” videos for a sentence query. Thiscalls for a different metric that measures the perceptual similaritybetween queries and retrieved results, rather than exact match. Therehas been some progress on perceptual metrics in the image synthesisliterature (e.g., Inception Score [34]). We are not aware of a suitable perceptual metric for cross-modal retrieval, and this could be apromising direction for future research.

Data size : Our dataset is half the size of the TGIF dataset [23]. Thisis due to the real-world limitation: our data must be crawled, ratherthan generated by paid crowdworkers. We are continuously crawlingthe data, and plan to release updated versions on a regular basis.

In this paper, we addressed the problem of cross-modal multimediaretrieval with implicit association. For this purpose, we collected adataset of reaction GIFs that we call MRW and their associated sen-tences from the Internet. We compared the characteristics of MRWdataset with the closest existing dataset developed for video cap-tioning, i.e, TGIF. We proposed a new visual-semantic embeddingnetwork based on multiple instance learning and showed its effec-tiveness for cross-modal retrieval. Having a diverse set of attentionmaps in addition to using multiple instance learning, our method isbetter able to deal with the problem of alignment in the context of im-plicit association in reaction GIF retrieval. Our results demonstratedthat the proposed method outperforms correlation-based methods,i.e., Corr-AE [12] and DCCA [2], as well as existing methods withtriplet ranking, i.e, DeViSE [13]. We carefully compared differentvariations of our model and showed factors that contribute to the im-proved performance. Finally, we discussed limitations and directionsfor future work, some of which we have started to work on. ross-Modal Retrieval with Implicit Concept Association MM’18, November 2018, Seoul, Korea

REFERENCES [1] Jaume Amores. 2013. Multiple instance classification: Review, taxonomy andcomparative study.

Artificial Intelligence (2013).[2] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deepcanonical correlation analysis. In

ICML .[3] Saeideh Bakhshi, David Shamma, Lyndon Kennedy, Yale Song, Paloma de Juan,and Joseph’Jofish’ Kaye. 2016. Fast, Cheap, and Good: Why Animated GIFsEngage Us. In

CHI .[4] Pierre Charbonnier, Laure Blanc-Féraud, Gilles Aubert, and Michel Barlaud.1997. Deterministic edge-preserving regularization in computed imaging.

IEEETransactions on Image Processing (1997).[5] Weixuan Chen, Ognjen Rudovic, and Rosalind W. Picard. 2017. GIFGIF+: Col-lecting Emotional Animated GIFs with Clustered Multi-Task Learning. In

ACII .[6] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation. In

EMNLP .[7] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and YantaoZheng. 2009. NUS-WIDE: a real-world web image database from NationalUniversity of Singapore. In

CIVR .[8] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solvingthe multiple instance problem with axis-parallel rectangles.

Artificial intelligence (1997).[9] Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets.In

CVPR .[10] Paul Ekman. 1992. An argument for basic emotions.

Cognition & emotion

6, 3-4(1992), 169–200.[11] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, andAndrew Zisserman. 2010. The pascal visual object classes (voc) challenge.

IJCV (2010).[12] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval withcorrespondence autoencoder. In

ACM Multimedia .[13] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, TomasMikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In

NIPS .[14] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics.

IJCV (2014).[15] Michael Gygli and Mohammad Soleymani. 2016. Analyzing and predicting GIFinterestingness. In

ACM Multimedia .[16] Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2GIF: Automaticgeneration of animated GIFs from video. In

CVPR .[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

CVPR .[18] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017.TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In

CVPR .[19] Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang. 2014. Predictingviewer perceived emotions in animated GIFs. In

ACM Multimedia .[20] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments forgenerating image descriptions. In

CVPR .[21] Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit.

Journal of MachineLearning Research

10 (2009), 1755–1758.[22] Josip Krapac, Moray Allan, Jakob Verbeek, and Frédéric Juried. 2010. Improvingweb image search results using query-relative classifiers. In

CVPR .[23] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Ale-jandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In

CVPR .[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In

ECCV .[25] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang,Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentenceembedding. In

ICLR .[26] Daniel McDuff, Abdelrahman Mahmoud, Mohammad Mavadati, May Amr, JayTurcot, and Rana el Kaliouby. 2016. AFFDEX SDK: A Cross-Platform Real-Time Multi-Face Expression Recognition Toolkit. In

Proceedings of the 2016CHI Conference Extended Abstracts on Human Factors in Computing Systems(CHI EA ’16) . ACM, New York, NY, USA, 3723–3726.

DOI: http://dx.doi.org/10.1145/2851581.2890247[27] Volodymyr Mnih, Nicolas Heess, Alex Graves, and others. 2014. Recurrentmodels of visual attention. In

NIPS .[28] Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media re-trieval: Concepts, methodologies, benchmarks and challenges.

IEEE Transactionson Circuits and Systems for Video Technology (2017).[29] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global vectors for word representation. In

EMNLP .[30] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010.Collecting image annotations using Amazon’s Mechanical Turk. In

CSLDAMT .[31] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RGLanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In

ACM Multimedia .[32] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal,Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description.

IJCV (2017).[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, andothers. 2015. Imagenet large scale visual recognition challenge.

IJCV (2015).[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,and Xi Chen. 2016. Improved techniques for training gans. In

NIPS .[35] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017.Learning robust visual-semantic embeddings. In

ICCV .[36] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017.Adversarial Cross-Modal Retrieval. In

ACM Multimedia .[37] Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A com-prehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).[38] Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In

CVPR .[39] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video descriptiondataset for bridging video and language. In

CVPR .[40] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual attention. In

ICML .[41] Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching imagesand text. In

CVPR .[42] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From imagedescriptions to visual denotations: New similarity metrics for semantic inferenceover event descriptions.

Transactions of the ACL (2014).[43] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2015. Recurrent neuralnetwork regularization. In

ICLR .[44] Yipin Zhou, Yale Song, and Tamara L Berg. 2018. Image2GIF: GeneratingCinemagraphs using Recurrent Deep Q-Networks. In