[PDF] Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Abstract

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.

Full PDF

DDrill-down: Interactive Retrieval of Complex Scenesusing Natural Language Queries

Fuwen Tan

University of Virginia [email protected]

Paola Cascante-Bonilla

University of Virginia [email protected]

Xiaoxiao Guo

IBM Research AI [email protected]

Hui Wu

IBM Research AI [email protected]

Song Feng

IBM Research AI [email protected]

Vicente Ordonez

University of Virginia [email protected]

Abstract

This paper explores the task of interactive image retrieval using natural languagequeries, where a user progressively provides input queries to reﬁne a set of retrievalresults. Moreover, our work explores this problem in the context of complex imagescenes containing multiple objects. We propose Drill-down, an effective frameworkfor encoding multiple queries with an efﬁcient compact state representation thatsigniﬁcantly extends current methods for single-round image retrieval. We showthat using multiple rounds of natural language queries as input can be surprisinglyeffective to ﬁnd arbitrarily speciﬁc images of complex scenes. Furthermore, weﬁnd that existing image datasets with textual captions can provide a surprisinglyeffective form of weak supervision for this task. We compare our method withexisting sequential encoding and embedding networks, demonstrating superior per-formance on two proposed benchmarks: automatic image retrieval on a simulatedscenario that uses region captions as queries, and interactive image retrieval usingreal queries from human evaluators.

Retrieving images from text-based queries has been an active area of research that requires some levelof visual and textual understanding. Signiﬁcant improvement has been achieved over the past yearswith advances in representation learning but ﬁnding very speciﬁc images with detailed speciﬁcationsremains challenging. A common way of speciﬁcation is through natural language queries, where auser inputs a description of the image and obtains a set of results. We focus on a common scenariowhere a user is trying to ﬁnd an exact image, or similarly where the user has a very speciﬁc idea of atarget image, or is deciding on-the-ﬂy while querying. We present empirical evidence that users aremuch more successful if they are allowed to reﬁne their search results with subsequent textual queries.Users might start with a general query about the “concept” of the image they have in mind and then“drill down” onto more speciﬁc descriptions of objects or attributes in the image to reﬁne the results.Among previous efforts in image retrieval, a promising paradigm is to learn a visual-semanticembedding by minimizing the distance between a target image and an input textual query using a jointfeature space. Pioneering approaches such as [17, 34, 9, 21, 36, 33] have demonstrated remarkable a r X i v : . [ c s . C V ] N ov igure 1: An example of the interactive image retrieval with our Drill-down model, where a usergenerated query (U t ) progressively reﬁnes the search results (S t ) until the target image is among topsearch results.performance on large scale datasets such as Flickr30K [26] and COCO [23], and domain-speciﬁctasks such as outﬁt composition [12]. However, we ﬁnd that these methods are limited in theircapacity for retrieving highly speciﬁc images, because it is either difﬁcult for users to be speciﬁcenough with a single query or users may not have the full picture in mind beforehand. We show anexample of this type of interaction in Figure 1. While single-query retrieval might be more suited fordomains such as product search where images typically contain only one object, requiring users todescribe a whole scene in one sentence might be too demanding. More recently, dialog based searchhas been proposed to overcome some of the limitations of single-query retrieval [22, 31, 10, 7].In this paper, we propose Drill-down, an interactive image search framework for retrieving complexscenes, which learns to capture the ﬁne-grained alignments between images and multiple text queries.Our work is inspired by the observations that: (1) user queries at each turn may not exhaustivelydescribe all the details of the target image, but focus on some local regions, which provide a naturaldecomposition of the whole scene. Therefore, we explicitly represent images as a list of object/stufflevel features extracted from a pre-trained object detector [27]. This is also in line with recentresearch [21, 36] on learning region-phrase alignments for single-query methods; (2) complex scenescontain multiple objects that might share the same feature subspace. Particularly, existing staterepresentations of sequential text queries, such as the hidden states of a RNN, condense all imageproperties in a single state vector, which makes it difﬁcult to distinguish entities sharing the samefeature subspace, such as multiple person instances. To address this, we propose to maintain a setof state vectors, encouraging each of the vectors to encode text queries corresponding to a distinctimage region. Figure 2 shows an overview of our approach, images are represented with local featurerepresentations, and the query state is represented by a ﬁxed set of vectors that are selectively updatedwith each subsequent query.We demonstrate the effectiveness of our approach on the Visual Genome dataset [20] in two scenarios:automatic image retrieval using region captions as queries, and interactive image retrieval with realqueries from human evaluators. In both cases, our experimental results show that the proposed modeloutperforms existing methods, such as a hierarchical recurrent encoder model [29], while using lesscomputational budget.Our main contributions can be summarized as follows: • We propose Drill-down, an interactive image search approach with multiple round querieswhich leverages region captions as a form of weak supervision during training. • We conduct experiments on a large-scale natural image dataset: Visual Genome [20], anddemonstrate superior performance of our model on both simulated and real user queries; • We show that our model, while producing a compact representation, outperforms competingbaseline methods by a signiﬁcant margin.

Text-based image retrieval has been an active research topic for decades [5, 4, 28]. Prominent morecontemporary works have recognized the need for richer user interactions in order to obtain higher Codes are available at https://github.com/uvavision/DrillDown query andpossibly update operations on a predeﬁned memory space. In contrast to this line of research, weexplore a more challenging scenario where the model needs to create and update the memory (i.e.the state vectors) on-the-ﬂy so as to maintain the states of the queries.

Retrieving images with multi-round reﬁnements offers the potential beneﬁt of reducing the ambiguityof each query but also raises challenges on how to integrate user queries from multiple rounds. Ourmodel is inspired by the observation that users naturally underspecify in their queries by referringto local regions of the target image. We aim to capture these region level alignments by learning tomap text queries { s t } Tt =1 and the target image I into two sets of latent vectors { x i } Mi =1 and { v j } Nj =1 respectively, and computing the matching score of { s t } Tt =1 and I by measuring and aggregatingﬁne-grained similarities between { x i } Mi =1 and { v j } Nj =1 . Figure 2 provides an overview of our model. To identify candidate regions referred in the queries, we follow [1, 21]. For each image I , we ﬁrstdetect the potential objects and salient stuff using the FasterRCNN detector [27]. Corresponding3

1) red brick of fireplace(2) china plates and glasses…(t-1) flowers on the dining table(t) candle style chandelier hanging down from ceiling

Query Encoder

Queries

FasterRCNN !(

RegionFeaturesCross Modal SimilarityState Vectors (t)

GRU ' (1) red brick of fireplace(2) china plates and glasses(3) group of three candle sticks on mantel(4) flowers on the dining table(5) candle style chandelier hanging down from ceiling(6) wooden chairs on the carpet New Query State Vectors ( )*+ GRU

Sentence Rep. , ) State Vectors ( ) FasterRCNN !(

RegionFeaturesCross Modal Similarity ' Figure 2: Overview of our model. Drill-down maintains a ﬁxed set of state vectors X , modeling thehistorical context of the user queries. Given a new query q t , our model selects and updates one ofthe state vectors. The updated state vectors X t and image region features are then projected to across-modal embedding space to measure the ﬁne-grained alignment between each region-state pair.features { c j } are extracted from the ROI pooling layer of the detector. In practice, we leverage theobject detector provided by [1], which is pre-trained on Visual Genome [20] with 1600 predeﬁnedobject and stuff classes. A linear projection v j = W I c j + b I is applied to reduce { c j } into D-dimensional latent vectors V = { v j } Nj =1 , v j ∈ R D . Here N is the number of regions in each image.The learnable parameters for the image representation { W I , b I } are denoted as θ I . Supporting multi-round retrieval requires a state representation for integrating the queries frommultiple turns. Solutions adopted by existing methods include applying a single recurrent networkto the concatenation of all queries [9] or a hierarchical recurrent network [7, 31, 22, 10] modelingindividual query and historical context in separate recurrent modules. These approaches producea single latent vector which aggregates all queries. While state-of-the-art models [22, 10] showremarkable performance on domains such as fashion product search, we demonstrate that currentlyused single-vector representations are not the most effective for capturing complex scenes withmultiple objects. Speciﬁcally, as image features used in existing methods are typically extracted fromthe penultimate layer of a pre-trained image classiﬁcation or object detection model, input instancesof the same or very similar categories activate the same feature units in the extracted feature space.Therefore, it is nontrivial for these latent representations to encode and distinguish multiple entitiesfrom the same or very similar categories (i.e. multiple person instances).We propose to maintain a set of latent representations X = { x i } Mi =1 , x i ∈ R D for multiple turnqueries. Here M is the number of latent vectors. This parameter represents the computationalbudget, since retrieval time will depend on the compactness of this representation. While usersmight provide a general image description in the ﬁrst round of querying, subsequent queries typicallydescribe more speciﬁc regions. We aim at ﬁnding a good alignment between queries and image regionrepresentations { v j } Nj =1 . An ideal set of { x i } Mi =1 should learn to group and encode the input queriesinto visually discriminative representations referring to distinct image regions. In the remaining ofthe section, we ﬁrst introduce the cross modal similarity formula used in our model. We then explainhow to update the state representations { x i } Mi =1 from the queries { s t } Tt =1 so as to optimize theirmatching score with the target image. To measure the similarity of X = { x i } Mi =1 and V = { v j } Nj =1 , we ﬁrst compute the cosine similarityof each possible state-region pair ( x i , v j ) : s ( x i , v j ) = x Ti v j / (cid:107) x i (cid:107)(cid:107) v j (cid:107) , where (cid:107) . (cid:107) denotes the L norm. Given s ( x i , v j ) , we deﬁne the similarity s ( x i , I ) between a state vector x i and the targetimage I as s ( x i , I ) = 1 N N (cid:88) k =1 α ik s ( x i , v k ) , α ik = exp( s ( x i , v k ) /σ ) (cid:80) Nj exp( s ( x i , v j ) /σ ) (1)4ere σ is a temperature hyper-parameter. Note that this formulation is similar to measuring thecosine similarity of x i and a context vector (cid:80) Nk =1 α ik v k from an attention module [24, 21]. Thecross modal similarity between the state vectors X = { x i } Mi =1 and the target image I is deﬁned as s ( X , I ) = M (cid:80) Mk =1 s ( x k , I ) . Given a query input s t at time t, our model maps each word token w k in s t to an E-dimensionalvector via a linear projection: e k = W E w k , e k ∈ R E , k = 1 , · · · , K , then generates thesentence embedding via a uni-directional recurrent network φ with gated recurrent units (GRU) as: h k = φ ( e k , h k − ) , h k ∈ R D . The ﬁrst hidden state of φ is initialized as a zero vector, while the lasthidden state is treated as the sentence representation: q t = h K . We also explore using a bidirectionalencoder but ﬁnd no improvement. Given the assumption that each text query describes a sub-regionof the image, each q t only updates a subset of the state vectors. In this work, we focus on a simpliﬁedscenario where each q t only updates a single state vector x t − k ∈ X t − . In detail, given the text query q t at time step t , our model samples x t − k from the previous state vector set X t − = { x t − i } Mi =1 based on the probability: π ( x t − k | X t − , q t ) =  ( x t − k = ∅ ) (cid:80) j ( x t − j = ∅ ) if X t − has an empty vector exp( f ( x t − k , q t )) (cid:80) j exp( f ( x t − j , q t )) otherwise (2) f ( x t − k , q t ) = W π ( δ ( W π ( δ ( W π [ x t − k ; q t ] + b π )) + b π )) + b π , (3)where ( x t − j = ∅ ) is an indicator function which returns 1 if x t − j is an empty vector and 0 otherwise. f ( · ) is a multilayer perceptron mapping the concatenation of x t − k and q t into a scalar value. Here δ is the ReLU activation function, W π ∈ R D × D , W π , ∈ R D × D , W π ∈ R × D , b π , b π ∈ R D , b π ∈ R are model parameters. An empty state vector is initialized with zero values. Ideally, anexpressive sample policy should learn to allocate a new state vector when necessary. However, weempirically ﬁnd it beneﬁcial to update q t to an empty state vector whenever possible. Once x t − k issampled, we update this state vector using a single uni-directional gated recurrent unit cell (GRUCell) τ : x tk = τ ( q t , x t − k ) . Note that our formulation is similar to a hard attention module [37].Leveraging a soft attention is possible, but it is more computationally expensive as it would needto update all state vectors. Our state vector update mechanism is inspired by the knowledge basemethods with external memory [22]. Our method can be interpreted as building a knowledge basememory online from scratch, only from the query context, which can be trained end-to-end withother modules. We denote the learnable parameters for the state vector update policy function π ( · ) as θ π = { W π , W π , W π , b π , b π , b π } , and for the rest modules as θ q = { W E , φ, τ } . Our model is trained to optimize θ I , θ π and θ q so as to achieve high similarity score between thequeries { s t } Tt =1 and the target image I . Thus, we follow [9, 21] and adopt a triplet loss on s ( X , I ) with hard negatives: L e = argmin θ I ,θ q (cid:88) X , I (cid:96) ( X , I ) (cid:96) ( X , I ) = max I (cid:48) [ α + s ( X , I (cid:48) ) − s ( X , I )] + + max X (cid:48) [ α + s ( X (cid:48) , I ) − s ( X , I )] + (4)Here, α is a margin parameter, [ · ] + ≡ max( · , . I (cid:48) and X (cid:48) are decoy images and state vectorswithin the same mini-batch as the ground-truth pair ( X , I ) during training. Note that L e will onlyoptimize the parameters θ I and θ q . Directly optimizing θ π is difﬁcult as sampling from Equation 2is non-differentiable. We propose to train the policy parameters via Reinforcement Learning (RL).5ormally, the state in our RL formulation is the set of state vectors X t = { x ti } Mi =1 , and the action k ∈ { , ..., M } is to select the state vector x tk from X t when fusing information from the embeddedquery vector q t +1 . The RL objective is to maximize the expected cumulative discounted rewards, soin our case we deﬁne the reward function as the similarity between the state vectors X t and the image I , i.e. s ( X t , I ) . Note that our reward function evaluates the potential similarity at all future time stepinstead of only the last step T , encouraging the model to ﬁnd the target image with fewer turns. Supervised pre-training

As optimizing the sampling policy requires reward signals fromthe retrieval environment, we pre-train the model by optimizing L e with a ﬁxed policy: π ( x t − k | X t − , q t ) = ( k ≡ t (mod M)), where ( · ) is an indicator function and M is the number ofstate vectors. Intuitively, this policy circularly updates the state vectors in order. Joint optimization

Given the pre-trained environment, we then jointly optimize the sampling policyand the other modules (i.e. θ I , θ q and θ π ). Because the next state X t +1 is a deterministic functiongiven the current state X t and action k , we adopt the policy improvement strategy from [10] to updatethe policy. Speciﬁcally, we estimate the state-action value Q ( X t , k ) = (cid:80) T − t (cid:48) = t γ t (cid:48) − t s ( X t (cid:48) +1 , I ) foreach state vector selection action k by sampling one look-ahead trajectory. γ is the discount factor.The policy is then optimized to predict the most rewarding action k ∗ = argmax k Q ( X t , k ) via across entropy loss: L π = argmin θ π (cid:88) X t , q t +1 − log( π ( x tk ∗ | X t , q t +1 ; θ π )) (5)We also jointly ﬁnetune θ I and θ q by applying L e on the rollout state vectors X ∗ : L ∗ e =argmin θ I ,θ q (cid:80) X ∗ , I (cid:96) ( X ∗ , I ) . The model is trained with the multi-task loss: L = L ∗ e + µL π , where µ is a scalar factor determining the trade-off between the two terms. Dataset

We evaluate the performance of our method on the Visual Genome dataset [20]. Eachimage in Visual Genome is annotated with multiple region captions. We preprocess the data byremoving duplicate region captions (e.g. multiple captions that are exactly the same), and imageswith less than 10 region captions. This preprocessing results in 105,414 image samples, which arefurther split into 92,105/5,000/9,896 for training/validation/testing. We also ensure that the images inthe test split are not used for the training of the object detector [1]. All the evaluations, including thehuman subject study, are performed on the test split, which contains 9,896 images. We use regioncaptions as queries to train our model, thus bypassing the challenging issue of data collection forthis task. The vocabulary of the queries is built with the words that appear more than 10 times in allregion captions, resulting in a vocabulary size of 14,284. During training, queries and their orders arerandomly sampled. During validation and testing, the queries and their orders are kept ﬁxed.

Baselines

We compare our method with four baseline models: (1)

HRE : a hierarchical recurrentencoder network, which is commonly adopted by recent dialog based approaches [31, 22, 10]. Weconsider the framework using text queries as context, which consists of a sentence encoder, a contextencoder and an image encoder. The sentence encoder has the same word embedding (e.g. the linearprojection W E ) and sentence embedding (e.g. the φ function) as the proposed model. The contextencoder is a uni-directional GRU network ψ that sequentially integrates the sentence features q t from φ and generates the ﬁnal query feature ¯x t : ¯x t = ψ ( q t , ¯x t − ) . ¯x is initialized as a zero vector. Theimage encoder maps the mean-pooled features of ResNet152 [13] into a one-dimensional featurevector ¯v via a linear projection. The ResNet model is pre-trained on ImageNet [8]. The model istrained to optimize the cosine similarity between ¯x t and ¯v by a triplet loss with hard negatives asin [9]. (2) R-HRE : a model similar to baseline (1) but is trained with the region features { v j } Nj =1 , asin the proposed method. Speciﬁcally, the model learns to optimize the similarity term s ( ¯x t , I ) deﬁnedin Eq.(1) by a triplet loss with hard negatives similar to L e on one state vector. (3) R-RE : a modelsimilar to baseline (2) but instead of using a hierarchical text encoder, this baseline uses a singleuni-directional GRU network which encodes the concatenation of the queries. (4)

R-RankFusion : a6igure 3: Quantitative evaluation of our models and the baselines. (A) Comparison of modelsusing query representations of the same memory size; (B) Comparison of the models using queryrepresentations of different memory sizes. The horizontal axis represents the query turn.Methods HRE / R-RE

R-HRE / Drill-down × / × / × / × × ×

640 / 36 × ×

128 / 36 ×

256 / 36 ×

256 / 36 × { v j } Nj =1 . The ranks of all images are computed separably for each turn.The ﬁnal ranks of the images are represented as the averages of the per-turn ranks. Implementation details

We try to keep consistent conﬁgurations for all the models in our ex-periments to better evaluate the contribution of each component. In particular, all the models aretrained with 10-turn queries ( T = 10 ). We use ten turns as we’d like to track and demonstratethe performance of all methods in both short-term and long-term scenarios. For each image, weextract the top 36 regions ( N = 36 ) detected by a pretrained Faster RCNN model, following [1].Each embeded word vector has a dimension of 300 ( E = 300 ). In all our experiments, we set thetemperature parameter σ to 9, the margin parameter α to 0.2, the discount factor γ to 1.0, and thetrade-off factor µ to 0.1. For optimization, we use Adam [16] with an initial learning rate of e − and a batch size of 128. We clip the gradients in the back-propagation such that the norm of thegradients is not larger than 10. All models are trained with at most 300 epochs, validated after eachepoch. The models which perform best on the validation set are used for evaluation. Evaluation metrics

To measure the retrieval performance, we use the common R@K metric, i.e.,recall at K - the ratio of queries for which the target image is among the top-K retrieved images. TheR@1, R@5 and R@10 scores at each turn are reported as shown in Fig. 3.

Due to the lack of existing benchmarks for multiple turn image retrieval, we use the annotatedregion captions in Visual Genome to mimic the user queries. As region captions focus more oninvariant information, such as image contents, and convey fewer irrelevant signals, such as differentspeaking/writing styles, they could be seen as the common "abstracts" of real queries in differentforms. While we agree that strong supervisory signals such as real user queries could bridge thedomain gap and would like to explore further in this direction, we choose at this stage to use only"weak but free" signals and investigate their potentials of being generalized to real scenarios. First,we compare our method against the baseline models when using query representations of the samememory size. In particular, we use 5 state vectors in our model ( M = 5 ), each with a dimension of256. Accordingly, the baseline models use a 1280-d query vector. Figure 3(A) shows the per-turn7igure 4: Qualitative examples of Drill-down × . The sequential queries and the correspondingstate vectors used to integrate them are shown on the left; The top-3 regions of the target imagesattended by each state vector are shown on the right, with the same color as the corresponding statevector. Note that all these target images rank top-1 given the input queries.performance of the models on the test set. Here Drill-down × ( FP ) indicates the supervised pre-trained model with the ﬁxed policy, and Drill-down × indicates the jointly optimized model with alearned policy. Both the R-RE and R-HRE baselines perform better than the HRE model,demonstrating the beneﬁt of incorporating region features. R-HRE is superior to R-RE ,demonstrating the beneﬁt of hierarchical context encoding. R-RankFusion performs inferior toall other models. Note that it also requires more memory to store the ranks of all images at each turn.Our models signiﬁcantly outperform all baselines by a large margin. On the other hand, we observethat the performance of our model will degrade when different queries have to share the same statevector. For example, after the 5th turn, the Drill-down × ( FP ) model gains less improvement fromeach new query. Drill-down × further improves Drill-down × ( FP ) by learning to distributethe queries into the most rewarding state vectors.To investigate the design space of the query representation, we further explore variants of our modelwith different numbers of state vectors and feature dimensions. Table 1 shows the sizes of thequery/image representations and the parameters used in our models and the baselines. Note that theR-RankFusion and R-RE models have the same size of query/image representations and parameters.Here Drill-down M × D indicates the model with M state vectors, each with a dimension of D. Asshown in Figure 3(B), while both Drill-down and the R-HRE baseline can be improved by increasingthe feature dimension, using more state vectors gains signiﬁcantly more improvements with the same,or even less memory budget. For example, Drill-down × signiﬁcantly outperforms R-HRE with 3 times less query features, 10 times less region features and 4 times less parameters. Thehighest performance is achieved by the model which stores each query in a distinct state vector: 10state vectors for 10-turn queries. Integrating multiple queries into the same state vector could makethe model “forget” the responses from earlier turns, especially when they activate the same semanticspace as the new query.Figure 4 provides qualitative examples of the Drill-down × model. Here the arrows indicate thepredicted state vectors used to incorporate the queries. We show the top-3 regions of the target imagesthat have the highest similarity scores with each state vector (illustrated with the same color). Weobserve that the model tends to group queries with entities that potentially coincide with each other.However, it could also lead to the “forgetting” of earlier queries. For instance, in the ﬁrst example,when aggregating the queries “ child in a stroller ” and “ woman in a dress ” in order, the model tendsto focus on “ woman ” while forgetting information about “ child ”, as “ woman ” and “ child ” potentiallyactivate the same semantic subspace. 8igure 5: Examples of real user queries and the top-1 images from Drill-down × . We evaluate our method with the queries from crowdsourced human users via a multi-round in-teractive system adapted from [3]. Given a target image, a user is asked to search for it by pro-viding descriptions of the image content. The system shows top-5 retrieved images to the userper turn as context so that the user can improve the results by providing additional descriptions.Figure 6: Human subject evaluation ofthe HRE , R-HRE baselines and ourDrill-down × model.This process is repeated until the image is found or itreaches 5 turns. We sample 80 random images fromthe test set and evaluate HRED , R-HRED and Drill-down × on these images respectively.Each image is viewed by 3 different users. For eachmodel, the best result on each image is selected acrossusers to ensure high quality responses. As shown inFigure 6, most users ( > ) successfully ﬁnd thetarget image within 5 turns, demonstrating the ef-fectiveness of the multi-round search paradigm andthe quality of using region captions for training. Inparticular, Drill-down × consistently outperformsHRE and R-HRE on all evaluation metrics.On the other hand, as real user queries have moreﬂexible forms, e.g. longer sentences, repeated de-scriptions of the same region, etc, we also observesmaller performance gaps between our method andthe baselines. We believe further efforts such as realquery data collection are needed to systematicallyﬁll this domain gap. Figure 9 shows example real user queries and the retrieval sequences usingDrill-down × . We present Drill-down, a framework that is efﬁcient and effective in interactive retrieval of speciﬁcimages of complex scenes. Our method explores in depth and addresses several challenges in multipleround retrievals with natural language queries such as the compactness of query state representations,and the need for region-aware features. It also demonstrates the effectiveness of training a retrievalmodel with region captions as queries for interactive image search under human evaluations.

Acknowledgements

We thank our anonymous reviewers for helpful feedback. This work wasfunded by a research grant from SAP Research and generous gift funding from SAP Research. Wethank Tassilo Klein and Moin Nabi from SAP Research for their support.9 xamples of Real User Queries

Figure 7: Examples of real user queries collected in the human subject study and the top-3 retrievedimages from the Drill-down × model at each turn. The ranks of the target image (A) at each turnare 121, 32, 14, 1. The ranks of the target image (B) at each turn are 42, 9, 3, 110igure 8: Examples of real user queries collected in the human subject study and the top-3 retrievedimages from the Drill-down × model at each turn. The ranks of the target image (A) at each turnare 51, 24, 11, 7, 3. The ranks of the target image (B) at each turn are 182, 23, 7, 2, 1.11igure 9: Examples of real user queries collected in the human subject study and the top-3 retrievedimages from the Drill-down × model at each turn. The ranks of the target image (A) at each turnare 33, 10, 10, 1. The ranks of the target image (B) at each turn are 826, 62, 24, 4.12 eferences [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and LeiZhang. Bottom-up and top-down attention for image captioning and visual question answering. In IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2018.[2] Relja Arandjelovic and Andrew Zisserman. Multiple queries for large scale speciﬁc object retrieval. In

British Machine Vision Conference (BMVC) , pages 1–11, 2012.[3] Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez, and Song Feng. Chat-crowd: A dialog-basedplatform for visual layout composition. In

Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL-HLT) , 2019.[4] Ning-San Chang and King-Sun Fu. Query-by-pictorial-example.

IEEE Trans. Softw. Eng. , 6(6):519–524,November 1980.[5] Ning-San Chang and King-Sun Fu. A relational database system for images. In

Pictorial InformationSystems , 1980.[6] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, andDhruv Batra. Visual Dialog. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2017.[7] Abhishek Das, Satwik Kottur, Jose M. F. Moura, Stefan Lee, and Dhruv Batra. Learning cooperative visualdialog agents with deep reinforcement learning. In

IEEE International Conference on Computer Vision(ICCV) , Oct 2017.[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2009.[9] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semanticembeddings with hard negatives. 2018.[10] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-basedinteractive image retrieval. In

Advances in Neural Information Processing Systems (NeurIPS) , pages676–686, 2018.[11] Tanmay Gupta, Kevin J. Shih, Saurabh Singh, and Derek Hoiem. Aligned image-word representationsimprove inductive transfer across vision-language tasks. In

IEEE International Conference on ComputerVision (ICCV) , 2017.[12] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S Davis. Learning fashion compatibility withbidirectional lstms. In

ACM Multimedia , 2017.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.[14] Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep fragment embeddings for bidirectional imagesentence mapping. In

Advances in Neural Information Processing Systems (NeurIPS) , pages 1889–1897,2014.[15] Chloé Kiddon, Luke S. Zettlemoyer, and Yejin Choi. Globally coherent text generation with neuralchecklist models. In

Empirical Methods in Natural Language Processing (EMNLP) , 2016.[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

InternationalConference on Learning Representations (ICLR) , 2015.[17] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings withmultimodal neural language models. arXiv preprint arXiv:1411.2539 , 2014.[18] Adriana Kovashka and Kristen Grauman. Attribute pivots for guiding relevance feedback in image search.In

IEEE International Conference on Computer Vision (ICCV) , December 2013.[19] Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Interactive image search withrelative attribute feedback.

International Journal of Computer Vision (IJCV) , 115(2):185–210, 2015.[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome:Connecting language and vision using crowdsourced dense image annotations. 2016.[21] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention forimage-text matching. In

European Conference on Computer Vision (ECCV) , 2018.[22] Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-Seng Chua. Knowledge-aware multimodaldialogue systems. In

ACM International Conference on Multimedia (ACM MM) , pages 801–809, 2018.[23] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objectsin context.

European Conference on Computer Vision (ECCV) , 2014.

24] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-basedneural machine translation. In

Empirical Methods in Natural Language Processing (EMNLP) , pages1412–1421, 2015.[25] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Hierarchical multimodal lstm for densevisual-semantic embedding. In

IEEE International Conference on Computer Vision (ICCV) , 2017.[26] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentencemodels. In

IEEE International Conference on Computer Vision (ICCV) , pages 2641–2649, 2015.[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detectionwith region proposal networks. In

Advances in Neural Information Processing Systems (NeurIPS) , 2015.[28] Yong Rui, Thomas S. Huang, and Shih-Fu Chang. Image retrieval: Current techniques, promisingdirections, and open issues.

Journal of Visual Communication and Image Representation , 10(1):39 – 62,1999.[29] Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Buildingend-to-end dialogue systems using generative hierarchical neural network models. In

AAAI Conference onArtiﬁcial Intelligence (AAAI) , pages 3776–3783, 2016.[30] Behjat Siddiquie, Rogerio S Feris, and Larry S Davis. Image ranking and retrieval based on multi-attributequeries. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 801–808, 2011.[31] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In

ACMInternational on Conference on Information and Knowledge Management (CIKM) , pages 553–562, 2015.[32] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In

Advances in Neural Information Processing Systems (NeurIPS) , 2015.[33] Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal, Ranjitha Kumar, and David Forsyth.Learning type-aware embeddings for fashion compatibility. In

European Conference on Computer Vision(ECCV) , 2018.[34] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings.In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 5005–5013, 2016.[35] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In

International Conference onLearning Representations (ICLR) , 2015.[36] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. Uniﬁedvisual-semantic embeddings: Bridging vision and language with structured meaning representations. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2019.[37] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In

International Conference on Machine Learning (ICML) , volume 37, pages 2048–2057, 2015., volume 37, pages 2048–2057, 2015.