[PDF] Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

Abstract

Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses both spoken natural language (the what) and mouse traces over an empty canvas (the where) to express the characteristics of the desired target image. We then describe simple modifications to an existing image retrieval model, enabling it to operate in this setup. Qualitative and quantitative experiments show that our model effectively takes this spatial guidance into account, and provides significantly more accurate retrieval results compared to text-only equivalent systems.

Full PDF

TTelling the What while Pointing the Where: Fine-grained Mouse Traceand Language Supervision for Improved Image Retrieval

Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu SoricutGoogle Research schangpi,jponttuset,vittoferrari,[email protected]

Abstract

Existing image retrieval systems use text queries to pro-vide a natural and practical way for users to express what they are looking for. However, ﬁne-grained image retrievaloften requires the ability to also express the where in the im-age the content they are looking for is. The textual modal-ity can only cumbersomely express such localization pref-erences, whereas pointing would be a natural ﬁt. In thispaper, we describe an image retrieval setup where the usersimultaneously describes an image using both spoken natu-ral language (the “what”) and mouse traces over an emptycanvas (the “where”) to express the characteristics of thedesired target image. To this end, we learn an image re-trieval model using the Localized Narratives dataset, whichis capable of performing early fusion between text descrip-tions and synchronized mouse traces. Qualitative and quan-titative experiments show that our model is capable of tak-ing this spatial guidance into account, and provides moreaccurate retrieval results compared to text-only equivalentsystems.

1. Introduction

Gargantuan amounts of pictures are taken and sharedrecently, at an ever accelerating pace. Building effective image retrieval systems for ﬁnding speciﬁc images amonglarge collections is, therefore, of paramount importance andpresents opportunities for high impact work. Finding thepicture that one has in mind should be easier and faster thanpainfully scrolling through hundreds of pictures in a digital-camera roll. To speed the search up, Content-Based ImageRetrieval (CBIR) systems build an index that represents acollection of images by automatically analyzing their con-tent [41, 24, 8, 32, 21, 29, 36, 37, 6, 17, 22, 19, 3].A query is a description of what a user is looking for inan image, a translation of their mental model of the targetimage into a concrete form that can be understood by CBIRsystems. At a very coarse level, a query can be a list of spe-

A horse in a city, occluding a bike and a car.

A horse in a city, occlud-ing a bike and a car. The horse is on the left side of the image, in a very close shot, cut below its neck...

A horse in a city, occluding a bike and a car. abc

Figure 1:

Different types of textual queries to represent the what and the where in the target image: (a) spatial informationis usually lacking in textual descriptions and (b) it is cumbersometo express in written form, while (c) it is very natural using mousetraces synchronized with the text. ciﬁc classes of objects ( e.g . cars or people) the user wantsto be contained by the target image [33]. At a ﬁner-grainedlevel, a query can be a natural language description of thecontents of the target image [36, 37, 6, 17, 22, 19, 3]. Thelatter is the most common paradigm in the recent literature,partly due to the availability of captioning datasets that canbe used as training and testing data [20, 2, 39, 25]. Thesetypes of queries generally focus on what is present in theimage, but fall short of expressing where in the image theuser expects this content.As an example, consider a user having the image in Fig-ure 1 in mind. A potential textual query would be “A horsein a city, occluding a bike and a car” (Fig. 1a). The im-age returned, while not the one the user had in mind, is aperfect match for this description: the “what” in the imageis very similar to the intended target. However, expressingthe “where” part using the textual query is not only cum-bersome for the user to write, but also hard for the CBIRsystem to process (Fig. 1b), and, we argue here, not the bestway to do it.In this paper, we propose a new query modality wherethe user describes the characteristics of the desired tar-get image simultaneously using spoken natural language,1 a r X i v : . [ c s . C V ] F e b anked retrieved images(b) Query : Caption (a) Query : Caption + Mouse Trace (Ours) In the image there is a dog running in the water. At the bottom of the image there are a few seashells. In the image there is a dog running in the water. At the bottom of the image there are a few seashells.

Figure 2:

Qualitative results : Querying with (a) text and mouse traces, versus (b) only text. The target image is marked in green. Addingmouse traces to express the spatial location of the image content allows us to get a better retrieval result even given the same textual query.In this particular case, notice that the exact position of the dog and the seashells allow the model to detect the correct target image. the “what”, and mouse traces over an empty canvas, the“where” (Fig. 1c). Roughly pointing at an object’s loca-tion comes naturally to humans [7, 4] and is a very effec-tive way of communicating the image layout the user has inmind. When the localization information is also temporallyaligned with the natural language query, it becomes a natu-ral grounding signal that can be exploited to make retrievalmore precise.We propose an image retrieval model that takes this newtype of multimodal query as input. We start from an image-to-text matching model that is repurposed as an image re-triever by ranking image-text pairs according to their afﬁn-ity, as it is common in previous literature [13, 6, 17, 40].We then augment the text input to also take the rough posi-tion in the blank canvas of each of the words into account(Fig. 3).The data for training and evaluating such a model comesfrom the Localized Narratives [26], a captioning datasetwhere annotators describe the images with their voice whilesimultaneously moving their mouse over the objects theyare describing. The mouse traces are effectively groundingeach word of the caption in the image. To use this data inan image retrieval scenario, we take the caption and cor-responding mouse trace as input query, and the image onwhich the annotation is generated as target image.Our experimental evaluation shows that this querymodality provides a + % absolute better recall ( % rel-ative error rate decrease) for the top image compared to themodel that uses only text-based queries. As we show in Fig-ure 2, having the rough location of the objects mentioned inthe input restricts the space of plausible target images andthus allows for a more effective retrieval result. The remainder of the paper is organized as follows. InSection 2, we describe the approach we took to performimage retrieval from our newly proposed query modality.Section 3 explains the experimental setup to validate ourapproach and Section 4 analyzes the results. Section 5 dis-cusses the related work and Section 6 concludes the paper.

2. Approach

In this section we ﬁrst describe our base image retrievalsystem based on an image-text matching model (Sec. 2.1).We then propose a modiﬁcation of this model to incorpo-rate the extra supervision in the form of bounding boxes(Sec. 2.2), and show how we derive them from mouse tracesegments (Sec. 2.3).

As in much of the previous work [13, 6, 17, 40], weturn the standard text-based image retrieval problem intolearning image-text matching. Let us denote by x =( x , . . . , x N ) a set of feature vectors representing the im-age ( e.g . the output of a CNN or an object detector runon the image) and y = ( y , . . . , y K ) a set of feature vec-tors representing the text ( e.g . random or pre-trained char-acter/subword/wordpiece/word embeddings of text tokens).We ﬁx both N and K in our experiments and use paddingand masking as necessary.Our base model learns a similarity function s ( x, y ) = p (cid:0) f ( x ) , g ( y ) (cid:1) , (1)where f , g , and p are an image tower , a text tower , andan image-text fuser , respectively. Each tower reduces a set2 usionL x Pooling + FFNIRE IRE IRE IRE IRE Similarity score [CLS] a man standing under a bell [SEP] t a t man t standing t under t a t bell Pooling + FFN M xTTETTETTE TTE TTETTETTE TBE TBETBETBE TBETBEMulti-Head Self Attention + FFN Multi-Head Self Attention + FFN

Image-Region Embedder

IRE

Semantic Feature

Global ResNet-152 or Regional FRCNN

Embedded Location Feature x min x max y min y max area Embedded Semantic Feature + Location-aware Semantic Feature Linear Linear Linear Linear Linear

Proj x min Location FeatureFFNLayerNormFFNLayerNorm FFNLayerNorm

Proj x max Proj y min Proj y max Proj area

Token EmbeddingTransformed Token Embedding + Position-aware Token EmbeddingPosition Embedding a shared lookup table

FFN

Text-Token Embedder

TTE

Linear Linear Linear Linear LinearTrace Location EmbeddingTransformed Trace EmbeddingFFN

Trace-Box Embedder

TBE + Position-aware Trace EmbeddingPosition Embedding a shared lookup table x min x max y min y max area Proj x min Proj x max Proj y min Proj y max Proj area

Figure 3:

Model : Our model performs early fusion of text token representations (blue) and the box representations derived from traces(orange) with the transformers on the right. Similarly, the model embeds the global and regional image embeddings (yellow) with thetransformers on the left. During the late fusion, the model combines the two streams and computes the similarity score between the imageembedding and the text+traces embedding. of feature vectors into a ﬁxed-length vector and the fusercombines them to produce the ﬁnal score. In this paper, wechoose the dot product as the image-text fuser p .At training time, we learn the parameters of f , g , and p from a collection of positive image-text pairs. At test time,given a query text y (cid:48) , we use the learned p to compute asimilarity score between y (cid:48) and each of the images x in thedatabase. We then output a ranking of all database imagessorted by their score, which represents our retrieval result.Figure 3 (without the trace inputs and the trace box em-bedder, in orange) illustrates our base model. We adopta two-stream model in which the image tower f and thetext tower g do not share weights. Each tower consists of three components: (i) an embedder, (ii) a contextualizer,and (iii) a pooler. Both towers use a -layer Transformerarchitecture [34] for (ii) and mean pooling for (iii). We usethe vanilla architecture, where each transformer layer con-sists of a multi-head self-attention and feed-forward fully-connected network. We refer the reader to [34] for detailsabout the Transformer architecture. Below, we describe theﬁrst component of each tower. The Image Region Embedder (IRE)

The input of the IREis a ﬁxed-length feature vector representing the whole im-age (CNN output) or a region of the image (one of an ob-ject detector’s region outputs). The IRE transforms eachof these feature vectors into an embedded semantic feature3ector, and their corresponding 5D geometric feature of boxcoordinates ( x min , x max , y min , y max ) and box area into anembedded location feature. Adding the two together gives alocation-aware semantic feature vector of the region, whichgoes through a 2-layer Multi-Layer Perceptron (MLP) be-fore it is used as input of the image transformer. The Text Token Embedder (TTE)

Given a ﬁxed-lengthvector representing a text token (a character, a subword, aword, etc .), the TTE applies a 2-layer MLP and adds a po-sition embedding to the output, resulting in a token embed-ding that is position aware.

Given mouse traces t as an additional input, we modifyour similarity function in (1) by injecting it into the textstream of the model: s ( x, y ) = p (cid:0) f ( x ) , h ( y, t ) (cid:1) , (2)where h is a text-trace fuser/embedder, and f and p are thesame as in (1).Similarly to the setting in Section 2.1, at training timewe learn the parameters of f , h , and p from a collectionof positive image-text-trace triplets. At test time, given aquery text y (cid:48) and its corresponding query trace t (cid:48) , we usethe learned p to compute a similarity score between ( y (cid:48) , t (cid:48) ) and each of the images in the database and output a rankingof the images. Note that our setting assumes the existenceof traces both during training and testing, as we envisagethese new “text+trace” queries to be cast by users using aninterface analog to the one used for Localized Narrativesannotation [26].Figure 3 depicts our full model, with the components de-scribed in Section 2.1 unchanged. The extra component, themouse trace input t , is encoded in the form of a sequence ofboxes by the Trace Box Embedder (TBE, bottom right ofFig. 3), which we describe below, and then fuse it with thetext query. The Trace Box Embedder (TBE)

Analogous to the loca-tion input of IRE, each of the trace boxes is representedusing a 5D vector consisting of coordinates and area ( x min , x max , y min , y max , area ). Since these boxes correspond toparts of the text query, they also have the notion of 1D time-location “position” in the query. Thus, we add a positionembedding to the transformed trace embedding vector, re-sulting in a trace embedding vector that is both location-aware and position-aware. Fusing texts and traces

We concatenate all the outputs ofTTE (Sec. 2.1) and TBE, and use the result as input to thetext-trace transformer. We believe this is both simple andpowerful, as the transformer self-attention layers allow texttokens and trace boxes to attend to each other freely. Notethat it is this early fusion of text and traces that is capable of [...] a refrigerator , behind it we can see [...]

Caption:Mouse trace segment:Temporal padding:Spatial padding:Final box for “refrigerator”:Mouse trace:

Figure 4:

From a mouse trace segment to its box : We ﬁrstprolong the mouse trace segment along the temporal dimen-sion (green), and then we add spatial 2D padding (blue).modeling where in the image certain parts of the query areexpected to be relevant.

A Localized Narrative annotation has each utterance inthe caption associated with a mouse trace segment, which grounds the utterance on the image. In other words, it de-ﬁnes the rough position in the image where the semanticcontent from the utterance (the what ) is located (the where ).The mouse trace segment for a certain utterance corre-sponds to the sequence of image points the mouse traversedduring the time interval ( t , t ) while the annotator spokethe utterance. We observe that the mouse traces around the time when an utterance was spoken can still refer to thesame utterance, so we explore adding temporal padding t p to better deﬁne the trace segment. That is, we consider thetrace segment in the time interval ( t − t p , t + t p ). Fig-ure 4 shows an example image with the full mouse traceoverlaid (gray) and the mouse trace segment correspondingto “refrigerator” highlighted in red.Our model consumes bounding boxes as input to locatethe query in the image (Fig 3), so we convert the mousetrace segments to bounding boxes as follows. We startfrom the tightest bounding box (Fig. 4, yellow box) thatfully contains the trace segment deﬁned by the time seg-ment ( t − t p , t + t p ), and we enlarge it in all dimensionsby a certain spatial padding s p (Fig. 4, blue box).

3. Experimental Setup

Localized Narratives as query-image pairs

LocalizedNarratives [26] are image captions where each word isgrounded on the image by a mouse trace segment. Theywere obtained by annotators describing the content of theimages with their voice while simultaneously moving theirmouse over the objects they were describing (Fig. 5 left).4 woman sitting on the grass besides a plant with the basket. She wears a cap. On the background we can see many trees. And this is the sky with heavy clouds.

A woman sitting on the grass besides a plant with the basket. She wears a cap. On the background we can see many trees. And this is the sky with heavy clouds.

Query Target ImageLocalized Narrative

Figure 5:

Localized Narratives annotations (left) can be transformed into training and testing data for image retrieval (right) by using themouse traces as if they were drawn on top of a blank canvas and used as part of the query.

We transform the original Localized Narratives into use-ful annotations for image retrieval, by forming a query-image pair for each Localized Narrative as follows. Weﬁrst strip away the image and keep only the caption andsynchronized mouse trace, as if it was drawn on an emptycanvas, as input query. Then we place the underlying im-age in our database, as the intended target for that query(Fig. 5 right). These paired queries and target images allowus to train and evaluate our system. In particular, the im-age database used for evaluation consists of all the imagesin the test split (which are disjoint from the images used intraining).

Main Task

Flickr30K [39, 25] is a dataset of , imagesannotated with captions each. It is commonly used to eval-uate text-to-image retrieval systems. Since our queries are acombination of text and synchronized mouse traces, we usethe publicly available Localized Narrative annotations [38]on these images and disregard the original captions. We re-fer to it as Flickr30K Localized Narratives . We ﬁne-tuneour models on the training set of Flickr30K Localized Nar-ratives ( , images) and report our quantitative resultson its test set ( , images). For each image in the testset, we input the corresponding text+trace to our model andcheck whether the target image is within the k ﬁrst in theoutput ranking. As our quantitative metrics, we count theoverall percentage of target images that fall within the ﬁrst k images in the rankings, which is known as Recall@ k . Weuse k = 1 , , and , and we denote as R@1, R@5, andR@10. Pre-training

Vision-and-language pre-training has re-cently been proven to beneﬁt downstream image retrievaltasks [22, 23, 19, 3]. We explore whether it also beneﬁts ourmodels (Sec. 2.1 and Sec. 2.2). In particular, we explore twodifferent datasets and type of data for pre-training: First,

Conceptual Captions [31], which contains over . millionimages harvested from the web. The alt-text HTML at-tribute associated with these images is used as their natural- language caption. Second, Open Images [16, 14], a col-lection of over million images out of which , areannotated with Localized Narratives. We will refer to thisdataset as Open Images Localized Narratives . We only usethe training splits of both datasets.The image-text pairs in these two datasets come with dif-ferent strengths. Conceptual Captions is larger-scale withmore semantically speciﬁc terms ( e.g . croissant vs food),but the style of text in Open Images Localized Narratives ismore similar to our target task, which uses Flickr30K Local-ized Narratives. We explore three pre-training settings: (i)pre-training on Conceptual Captions only, (ii) pre-trainingon Open Images Localized Narratives only, and (iii) pre-training on Conceptual Captions followed by Open ImagesLocalized Narratives. The last setting is based on our in-tuition (which will be veriﬁed in the experiments) that thedomain of Open Images Localized Narratives is closer tothat of Flickr30K Localized Narratives.We use these annotations to pre-train both the image andthe language branches of our model (Fig. 3). Further, sincethe traces are also available in the case of Open Images Lo-calized Narratives, we also explore incorporating traces dur-ing pre-training, that is, using Open Images Localized Nar-ratives for pre-training the model in Sec. 2.2.

Summary

The main goal of our experiments is to testthe hypothesis that trace supervision ( i.e . incorporating themouse traces) improves the accuracy of our image retrievalmodel. We test this hypothesis in two main scenarios: withand without pre-training. When there is no pre-training, wesimply compare the retrieval performance of the model inSec. 2.1 and the one in Sec. 2.2, both trained and tested onFlickr30K Localized Narratives.In the scenarios where pre-training is involved, we makeuse of any available pre-trained weights and randomly ini-tialize the rest, if needed. For instance, if we use both thecaptions and the traces of the Open Images Localized Nar-ratives for pre-training, then we pre-train the parameters ofevery component in Sec. 2.2. If we only use the captions,5 ow Pre-training Trace supervision Trace supervision Recall@K=ID data during pre-training during ﬁne-tuning 1 5 101 None N/A 61.6 86.8 93.32 None N/A (cid:88) −→ OID LocNarr (cid:88) (cid:88) −→ OID LocNarr (cid:88) (cid:88) (cid:88)

10 CC −→ OID LocNarr (cid:88) (cid:88)

Table 2:

Detailed results.

Image retrieval performance (Recall@K) on the Flickr30K Localized Narratives 1K test set. Best numbers ofRow 3-5 and of Row 6-10 are reported in Table 1, in italics . Best numbers overall in each column are in bold .Settingw/ text w/ trace Recall@K=pretraining supervision 1 5 1061.6 86.8 93.3 (cid:88) (cid:88) (cid:88) (cid:88)

Table 1:

Main results.

The image retrieval performance (Re-call@K). Best performance is reported in each setting on theFlickr30K Localized Narratives 1K test set. then the TBE parameters are randomly initialized during theﬁne-tuning stage with the Flickr30K Localized Narratives.

Image and text representations

We use subtokens torepresent text units ( e.g . “standing” is subtokenized into“stand” and “ing”) and use a random embedding to repre-sent it. We use a vocabulary size of , . We representan image with two types of features: A D global fea-ture vector of ResNet152 [9] and the set of the top regionproposals by a Faster-RCNN [27] object detector trained onVisual Genome [15] with a backbone ResNet101 [9] trainedon JFT [10] and ﬁne-tuned on ImageNet [28]. Our box co-ordinates and area of a region are represented with relativenumbers between 0 and 1, such that the 5D location infor-mation x min , x max , y min , y max , and area of the whole im-age is . , . , . , . , . , respectively. We concatenatethe two sets of features and permute the 16 regional featurevectors during training. Learning

We use contrastive learning, treating all otherimage-text pairs in each batch as negatives. We usethe Adam optimizer [12] with an initial learning rate of . . We use a linear warm-up of epochs and mul-tiply the learning rate by . every epochs after that.More details are in the Appendix.

4. Experimental Results

Main results: improvements from mouse trace supervi-sion.

Table 1 reports Recall@1, 5, 10 on the Flickr30K Lo-calized Narratives test set in four settings, as described inthe previous section. We observe that both text pre-training(which improves the “what”) and mouse trace supervision(which improves the “where”) signiﬁcantly helps (Row 1vs. Row 2 and 3). The best result is obtained when wecombine the two (Row 4): compared to the baseline model(no pre-training, no traces, Row 1), we improve by an ab-solute + . % in R@1, + . % in R@5, and + . % inR@10. We note that trace supervision is useful regard-less of whether or not we use text pre-training; it leadsto an absolute + . % improvement in R@1 without textpre-training (Row 1 vs. Row 2), and + . % with text pre-training (Row 3 vs. Row 4). Qualitative results

Figure 6 shows two qualitative results,comparing our best method (a) in Row 4, to that withouttrace supervision (b) in Row 3, and with neither trace su-pervision nor pre-training (c) in Row 1.Next, we expand Table 1 and report our detailed resultsin Table 2, summarized below.

When does mouse trace supervision help the most?

Thebeneﬁt of trace supervision is most dramatic when (i) wedo not perform text pre-training, and when (ii) we lever-age trace supervision in both pre-training and ﬁne-tuning.For instance, with Open Images Localized Narratives (OIDLocNarr) pre-training as a baseline (Row 4), trace supervi-sion at both pre-training and ﬁne-tuning stages push R@1from . to . (vs. . with trace supervision at theﬁne-tuning stage only).Nevertheless, we do observe slight improvements inR@1 with trace supervision in the case of text-only pre-training in all cases (Row 6 vs. Row 3, Row 7 vs. Row 4,and Row 8 vs. Row 5), + . , + . , and + . with Concep-6 mage Text Trace Recall@K=loc pos pos 1 5 10 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Table 3:

Beneﬁt of 1D position (pos) and 2D location (loc) fea-tures. tual Captions (CC) text, OID LocNarr text, and CC −→ OIDLocNarr text, respectively.We attribute this to two potential factors. First, learn-ing to effectively incorporate the traces with our current ap-proach requires more training data than the 30K instancesin Flickr30K Localized Narratives. Second, pre-training isalso capable of helping the model better ground word tokensto their image regions, though arguably in a more implicitmanner ( i.e . via data augmentation) than trace supervision.

Which text pre-training data sources help the most?

In general, we observe the trend that Open Images Local-ized Narratives is superior to Conceptual Captions as a pre-training data source for this task (Row 4 vs. Row 3, Row 7vs. Row 6, and Row 10 vs. Row 9), supporting our intuitionthat the domain of the Open Images Localized Narratives iscloser to that of Flickr30K Localized Narratives. Neverthe-less, both can be complementary, especially if we only usethe captions, not the traces, during pre-training (Row 5 vs.Row 3-4) and (Row 8 vs. Row 6-7).

How and how much position and location embeddingscontribute?

Table 3 investigates the beneﬁts of 1D wordposition (TTE & TBE in Fig. 3) and 2D image region lo-cation (IRE & TBE in Fig 3) embeddings, on top of thesemantic ones. We ﬁnd that each component contributes toreﬁning the top retrieved image (Row 1 vs. others).

What is the relative performance contribution of asingle-source modality?

We quantify the relative perfor-mance contributions of the text and mouse-trace elements ofthe query. The trace-only query achieves a R@1 of . vs. . of the text-only query and . of the joint text-tracequery (Row 1 and Row 2 of Table 1). Hence, text playsa major role but both elements are important in achievingstrong overall performance. We ﬁnd the best values for the temporal padding t p andthe spatial padding s p (Sec. 2.3) by leveraging the LocalizedNarratives annotations for the COCO [20, 2] dataset, whereall instances of objects of classes are annotated with abounding box. We consider all occurrences in the Local-ized Narrative caption mentioning any of the classes. Foreach occurrence, we extract the bounding box of the corre- Spatial padding ( s p )0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 T e m po r a l p a dd i ng ( t p ) Table 4:

From traces to bounding boxes : Average

IoU betweenthe extracted boxes and ground-truth boxes in COCO with respectto the temporal and spatial padding values (higher is better). sponding mouse trace segment as introduced above, leadingto , bounding boxes. We then compute the overlap(Intersection over Union - IoU ) between each mouse tracesegment box and all ground-truth object bounding boxes forthat class in the image, and take the best. Table 4 showsthe mean values of the best overlap over the training set ofCOCO, with respect to the values of t p (in seconds) and s p (percentage of the image height or width). We concludethat the optimal conﬁguration is t p = 0 . s and s p = 0 . ,and we use these values in all our experiments. Notice thatthe COCO dataset is disjoint from our evaluation datasetFlickr30K-test, and hence suitable for such parameter tun-ing.

5. Related Work

Query Modality for Image Retrieval

The closest line ofwork to ours is text-based image retrieval [36, 37, 6, 17,22, 19, 3], in which a natural language description of animage content is used as input to an image retrieval system.Our main contribution is the augmentation of this input withmouse traces drawn on an empty canvas to express where inthe image the content should appear.Retrieval from scene graphs [11] is related to our ap-proach in that both the what and the where are incorpo-rated in the query. However, the what lacks expressiveness,as it is restricted to be in closed-vocabulary form, and the where is not as user-friendly to provide as our mouse traces:the user needs to draw bounding boxes and edges linkingthem (the scene graph). Deriving scene graphs from natu-ral language descriptions is a challenging, unsolved task initself [15, 18, 30].Drawing on an empty canvas was also explored in im-age retrieval from sketches [32, 21, 29]. In all works on thistopic thus far, the drawings represent an abstraction of a sin-gle object. One could envision drawing multiple objects inthe canvas to express the what and the where . We argue,though, that expressing the what in natural language is sig-niﬁcantly more intuitive and faster than drawing a sketch( e.g . compare using the term “horse” versus drawing one7 n this image, I can see a person wearing a hat on the horse. And a horse is jumping in the air. And I can see ground. In the background of the image, I can see a fence, a group of people sitting and a group of people standing and I can see a hut, the trees, plants, some objects and it looks like houses. ( a ) O u r s ( b ) N o m ou s e t r ace N o m ou s e t r ace N o p r e t r a i n i ng Mouse trace:Caption: ( c ) In this image there is a person holding a bicycle in his hand is walking over a wooden plank on the surface. On the surface there are leaves and dry sticks. Behind the person there are trees. At the top of the image there is the sky. ( a ) O u r s ( b ) N o m ou s e t r ace N o m ou s e t r ace N o p r e t r a i n i ng Mouse trace:Caption: ( c ) Figure 6:

Qualitative examples : Comparison between our best method (a) to that without trace supervision (b), and withouttrace supervision nor pre-training (c). In green, the target image that corresponds to the query on the left.to the level of detail that differentiates it from a zebra or adonkey).In instance-based image retrieval [41, 24, 8], the query isan image representing an object or place, and the target im-age depicts the same, typically from another point of view,at another time of the day, etc . One can also add some nat-ural language text that describes such desired modiﬁcationsto the input image [35]. However, querying by image is arather inﬂexible way to express what the user has in mind.The query image already has its content (both the what andthe where) already ﬁxed within it. Often the user wants adifferent image, e.g. with certain objects in a different po- sition, or with other objects altogether. Making a collage ofimages as a query representing different objects in differentparts of the image would become impractical: the user in-terface would be complicated, and the user would need toﬁnd an example image for each of the objects of interest.Our proposed way of querying is much more ﬂexible in thissense.When considering all the possible ways in which a querycan be speciﬁed, we believe that our proposal makes themost efﬁcient use of both natural language and mousetraces: the former to express a ﬁne-grained what naturallyand fast, and the latter the where effectively and intuitively.8 mage Retrieval Models

The literature on image retrievalmodels is vast [42, 1] so we focus on the most relevantworks to ours. The most common approach to image re-trieval is learning to fuse and score the representations ofimage-text pairs. We adopt the late-fusion-style image-text matching model, similar to [13, 6], due to its sim-plicity and scalability. We strengthen this baseline by us-ing transformer-based architectures, stronger Faster R-CNNfeatures, and vision-and-language pre-training, followingrecent advances [22, 3, 19]. We make our model inject thetraces by introducing the TBE module whose encoded 1Dtext positions and 2D image locations act as a glue betweentext tokens and image regions (orange boxes in Fig. 3),largely inspired by position/location embeddings that areused extensively in recent work from both vision and NLPcommunities [34, 5, 22].

6. Conclusions

In this paper, we propose a new query modality forcontent-based image retrieval systems where the user de-scribes the characteristics of the desired target image si-multaneously using spoken natural language (the “what”)and mouse traces over an empty canvas (the “where”). Wepresent an image retrieval model that takes this new typeof multimodal query as input, based on recent advances onimage-to-text matching models. We train and evaluate ourmodel using Localized Narratives, where the caption andcorresponding mouse trace is used as input query, and thecorresponding image as target. Our experimental evaluationshows that this query modality provides a + % absolute bet-ter recall ( % relative error rate decrease) for the top imagecompared to the model that uses only text-based queries. Acknowledgments

References [1] Wei Chen, Yu Liu, Weiping Wang, Erwin Bakker, TheodorosGeorgiou, Paul Fieguth, Li Liu, and Michael S. Lew. Deepimage retrieval: A survey. arXiv preprint arXiv:2101.11282 ,2021. 9[2] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-tam, Saurabh Gupta, Piotr Doll´ar, and C Lawrence Zitnick.Microsoft COCO captions: Data collection and evaluationserver. arXiv , 2015. 1, 7[3] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.UNITER: UNiversal Image-TExt Representation Learning.In

ECCV , 2020. 1, 5, 7, 9[4] Herbert Clark. Coordinating with each other in a materialworld.

Discourse Studies , 7:507–525, 10 2005. 2 [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. BERT: Pre-training of deep bidirectional trans-formers for language understanding. In

NAACL , 2019. 9[6] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and SanjaFidler. VSE++: Improving visual-semantic embeddings withhard negatives. In

BMVC , 2017. 1, 2, 7, 9[7] Chaz Firestone and Brian J. Scholl. “Please tap the shape,anywhere you like” shape skeletons in human vision revealedby an exceedingly simple measure.

Psychological science ,25(2):377–386, 2014. 2[8] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Lar-lus. Deep image retrieval: Learning global representationsfor image search. In

ECCV , 2016. 1, 8[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

CVPR ,2016. 6[10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling theknowledge in a neural network. In

Proceedings of NeurIPSworkshop , 2015. 6[11] Justin Johnson, Ranjay Krishna, Michael Stark, Li Jia Li,David A. Shamma, Michael S. Bernstein, and Fei Fei Li.Image retrieval using scene graphs. In

CVPR , 2015. 7[12] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In

ICLR , 2015. 6, 11[13] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel.Unifying visual-semantic embeddings with multimodal neu-ral language models. arXiv preprint arXiv:1411.2539 , 2014.2, 9[14] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, SamiAbu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui-jlings, Stefan Popov, Shahab Kamali, Matteo Malloci, JordiPont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes,Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, ZheyunFeng, Dhyanesh Narayanan, and Kevin Murphy. Open-Images: A public dataset for large-scale multi-label andmulti-class image classiﬁcation.

Dataset available fromhttps://g.co/dataset/openimages , 2017. 5[15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, Michael Bernstein, andLi Fei-Fei. Visual genome: Connecting language and vi-sion using crowdsourced dense image annotations.

IJCV ,123(1):32–73, 2017. 6, 7[16] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, StefanPopov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari.The Open Images Dataset V4: Uniﬁed image classiﬁcation,object detection, and visual relationship detection at scale.

IJCV , 128(7):1956–1981, 2020. 5[17] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-aodong He. Stacked cross attention for image-text matching.In

ECCV , 2018. 1, 2, 7[18] Ang Li, Jin Sun, Joe Yue-Hei Ng, Ruichi Yu, Vlad I.Morariu, and Larry S. Davis. Generating holistic 3d sceneabstractions for text-based image retrieval. In

CVPR , 2017.7

19] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and DaxinJiang. Unicoder-VL: A universal encoder for vision and lan-guage by cross-modal pre-training. In

AAAI , 2020. 1, 5, 7,9[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, LubomirBourdev, Ross Girshick, James Hays, Pietro Perona, DevaRamanan, C. Lawrence Zitnick, and Piotr Doll´ar. MicrosoftCOCO: Common objects in context. In

ECCV , 2014. 1, 7[21] Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and LingShao. Deep sketch hashing: Fast free-hand sketch-based im-age retrieval. In

CVPR , 2017. 1, 7[22] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert:Pretraining task-agnostic visiolinguistic representations forvision-and-language tasks. In

NeurIPS , 2019. 1, 5, 7, 9[23] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, DeviParikh, and Stefan Lee. 12-in-1: Multi-task vision and lan-guage representation learning. In

CVPR , 2020. 5[24] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand,and Bohyung Han. Large-scale image retrieval with attentivedeep local features. In

ICCV , 2017. 1, 8[25] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes,Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb-nik. Flickr30k entities: Collecting region-to-phrase cor-respondences for richer image-to-sentence models.

IJCV ,123(1):74–93, 2017. 1, 5[26] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, RaduSoricut, and Vittorio Ferrari. Connecting vision and lan-guage with localized narratives. In

ECCV , 2020. 2, 4[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards real-time object detection with re-gion proposal networks. In

NIPS , 2015. 6[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet large scale visual recognition chal-lenge.

IJCV , 115(3):211–252, 2015. 6[29] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and JamesHays. The Sketchy Database: Learning to Retrieve BadlyDrawn Bunnies.

ACM Transactions on Graphics , 35(4):1–12, 2016. 1, 7[30] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. Generating semanticallyprecise scene graphs from textual descriptions for improvedimage retrieval. In

Workshop on Vision and Language , 2015.7[31] Piyush Sharma, Nan Ding, Sebastian Goodman, and RaduSoricut. Conceptual captions: A cleaned, hypernymed, im-age alt-text dataset for automatic image captioning. In

ACL ,2018. 5[32] Jifei Song, Qian Yu, Yi-Zhe Song, Tao Xiang, and Timo-thy M. Hospedales. Deep spatial-semantic attention for ﬁne-grained sketch-based image retrieval. In

ICCV , 2017. 1, 7[33] Lorenzo Torresani, Martin Szummer, and Andrew Fitzgib-bon. Efﬁcient object category recognition using classemes.In

ECCV , 2010. 1[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

NeurIPS , 2017. 3,9[35] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li Jia Li, LiFei-Fei, and James Hays. Composing text and image forimage retrieval-An empirical odyssey. In

CVPR , 2019. 8[36] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg,Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learn-ing ﬁne-grained image similarity with deep ranking. In

CVPR , 2014. 1, 7[37] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learningdeep structure-preserving image-text embeddings. In

CVPR ,2016. 1, 7[38] Website. Localized Narratives Data and Visualiza-tion. https://google.github.io/localized-narratives , 2020. 5[39] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-maier. From image descriptions to visual denotations: Newsimilarity metrics for semantic inference over event descrip-tions.

TACL , 2:67–78, 2014. 1, 5[40] Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, and FeiSha. Learning to represent image and text with denotationgraph. In

EMNLP , 2020. 2[41] L. Zheng, Y. Yang, and Q. Tian. SIFT Meets CNN: ADecade Survey of Instance Retrieval.

IEEE Trans. on PAMI ,40(5):1224–1244, 2018. 1, 8[42] Wengang Zhou, Houqiang Li, and Qi Tian. Recent advancein content-based image retrieval: A literature survey. arXivpreprint arXiv:1706.06064 , 2017. 9

A. Additional Details

A.1. Model

We expanded the details of our “embedder,” “contextual-izer,” and “pooler,” described in the main text. Please referto Fig. 3 of the paper for high-level overview.We use the same hyperparameter across Image-RegionEmbedder (IRE), Text-Token Embedder (TTE), and Trace-Box Embedder (TBE), described in the main text. Our feed-forward network (FFN) is a -layer MLP with the ReLUactivation function, hidden size of , and a dropout rateof . (applied during training only). The dimension ofposition embedding (TTE, TBE) is set to . To obtaina vector of size from a visual box or a trace box of xmin, xmax, ymin, ymax, area (IRE, TBE). We performa linear projection of this D vector into

D and concate-nate the result before feeding this

D vector into FFN.Our image (or text) transformer encoder has layers, avocab embedding of size , a hidden embedding of size , a ﬁlter size of . The number of attention heads isset to .Our image (or text) pooler is a mean pooling layer, fol-lowed by a -layer MLP with the ReLU activation function,hidden size of , and a dropout rate of . (applied dur-ing training only).10 .2. Learning We expanded the description of our learning procedurein the main text. For all experiments, we use Adam [12]with default hyperparameters. We use Google Cloud -core TPUs, with a batch size per core of . The FRCNNfeatures are permuted during training.For all pre-training experiments, we use an initial learn-ing rate of . with a linear warmup of epochs,and a decay rate of . every epochs. The number oftraining steps is M for Conceptual Captions pre-trainingand

K for Open Images Localized Narratives.For from-scratch experiments (with or without themouse trace supervision), we use the same learning rateschedule as in pre-training experiments, but also consideran initial learning rate of . and . . We alsoset the number of training steps to K instead.For ﬁne-tuning experiments on Flickr30K LocalizedNarratives, there are two cases. When both (latest) pre-training and ﬁne-tuning stages use the same inputs, i.e., withor without the mouse trace supervision in both stages, weuse an initial learning rate of . (an order of mag-nitude smaller) and set the number of training steps to K.When the mouse trace supervision is only added duringthe ﬁne-tuning stage, we observe better performance with aslightly higher initial learning rate of . and the num-ber of training steps is set to100