[PDF] Image-to-Image Retrieval by Learning Similarity between Scene Graphs

Abstract

As a scene graph compactly summarizes the high-level content of an image in a structured and symbolic manner, the similarity between scene graphs of two images reflects the relevance of their contents. Based on this idea, we propose a novel approach for image-to-image retrieval using scene graph similarity measured by graph neural networks. In our approach, graph neural networks are trained to predict the proxy image relevance measure, computed from human-annotated captions using a pre-trained sentence similarity model. We collect and publish the dataset for image relevance measured by human annotators to evaluate retrieval algorithms. The collected dataset shows that our method agrees well with the human perception of image similarity than other competitive baselines.

Full PDF

IImage-to-Image Retrievalby Learning Similarity between Scene Graphs

Sangwoong Yoon , Woo Young Kang , Sungwook Jeon , SeongEun Lee , Changjin Han ,Jonghun Park , and Eun-Sol Kim Seoul National University Robotics Lab, Kakao Brain, Seoul National University Information Management Lab [email protected], [email protected], { wookee3,ryuha96,changjin9653,jonghun } @snu.ac.kr, [email protected] Abstract

As a scene graph compactly summarizes the high-level con-tent of an image in a structured and symbolic manner, thesimilarity between scene graphs of two images reﬂects therelevance of their contents. Based on this idea, we pro-pose a novel approach for image-to-image retrieval usingscene graph similarity measured by graph neural networks.In our approach, graph neural networks are trained to predictthe proxy image relevance measure, computed from human-annotated captions using a pre-trained sentence similaritymodel. We collect and publish the dataset for image rele-vance measured by human annotators to evaluate retrieval al-gorithms. The collected dataset shows that our method agreeswell with the human perception of image similarity than othercompetitive baselines.

Introduction

Image-to-image retrieval, the task of ﬁnding similar imagesto a query image from a database, is one of the fundamen-tal problems in computer vision and is the core technologyin visual search engines. The application of image retrievalsystems has been most successful in problems where eachimage has a clear representative object, such as landmarkdetection and instance-based retrieval (Gordo et al. 2016;Mohedano et al. 2016; Radenovi´c, Tolias, and Chum 2016),or has explicit tag labels (Gong et al. 2014).However, performing image retrieval with complex im-ages that have multiple objects and various relationships be-tween them remains challenging for two reasons. First, deepconvolutional neural networks (CNNs), on which most im-age retrieval methods rely heavily, tend to be overly sensitiveto low-level and local visual features (Zheng, Yang, and Tian2017; Zeiler and Fergus 2014; Chen et al. 2018). As shownin Figure 1, nearest-neighbor search on ResNet-152 penulti-mate layer feature space returns images that are superﬁciallysimilar but have completely different content. Second, thereis no publicly available labeled data to train and evaluatethe image retrieval system for complex images, partly be-cause quantifying similarity between images with multipleobjects as label information is difﬁcult. Furthermore, a sim-ilarity measure for such complex images is desired to reﬂect * Work done during an internship at Kakao Brain

Figure 1: Image retrieval examples from ResNet and IRSGS.ResNet retrieves images with superﬁcial similarity, e.g.,grayscale or vertical lines, while IRSGS successfully returnsimages with correct context, such as playing tennis or skate-boarding.semantics of images, i.e., the context and relationship of en-tities in images.In this paper, we address these challenges and build animage retrieval system capable of ﬁnding semantically simi-lar images to a query from a complex scene image database.First of all, we propose a novel image retrieval framework,

Image Retrieval with Scene Graph Similarity ( IRSGS ),which retrieves images with a similar scene graph to thescene graph of a query. A scene graph represents an imageas a set of objects, attributes, and relationships, summariz-ing the content of a complex image. Therefore, the scenegraph similarity can be an effective tool to measure seman-tic similarity between images. IRSGS utilizes a graph neu-ral networks to compute the similarity between two scenegraphs, becoming more robust to confounding low-level fea-tures (Figure 1).Also, we conduct a human experiment to collect humandecisions on image similarity. In the experiment, annotatorsare given a query image along with two candidate imagesand asked to select which candidate image is more similar tothe query than the other. With 29 annotators, we collect morethan 10,000 annotations over more than 1,700 image triplets. a r X i v : . [ c s . C V ] D ec hanks to the collected dataset, we can quantitatively evalu-ate the performance of image retrieval methods. Our datasetis available online .However, it is costly to collect enough ground truth an-notation from humans to supervise the image retrieval al-gorithm for a large image dataset, because the number ofpairwise relationships to be labeled grows in O ( N ) for thenumber of data N . Instead, we utilize human-annotated cap-tions of images to deﬁne proxy image similarity, inspiredby Gordo and Larlus (2017) which used term frequenciesof captions to measure image similarity. As a caption tendsto cover important objects, attributes, and relationships be-tween objects in an image, the similarity between captionsis likely to reﬂect the contextual similarity between two im-ages. Also, obtaining captions is more feasible, as the num-ber of the required captions grow in O ( N ) . We use thestate-of-the-art sentence embedding (Reimers and Gurevych2019) method to compute the similarity between captions.The computed similarity is used to train a graph neural net-work in IRSGS and evaluate the retrieval results.Tested on real-world complex scene images, IRSGS showhigher agreement with human judgment than other compet-itive baselines. The main contributions of this paper can besummarized as follows:• We propose IRSGS, a novel image retrieval frameworkthat utilizes the similarity between scene graphs computedfrom a graph neural network to retrieve semantically sim-ilar images;• We collect more than 10,000 human annotations forsemantic-based image retrieval methods and publish thedataset into the public;• We propose to train the proposed retrieval framework withthe surrogate relevance measure obtained from image cap-tions and a pre-trained language model;• We empirically evaluate the proposed method and demon-strate its effectiveness over other baselines. Related Work

Image Retrieval

Conventional image retrieval methods use visual feature rep-resentations, object categories, or text descriptions (Zheng,Yang, and Tian 2017; Babenko et al. 2014; Chen, Davis, andLim 2019; Wei et al. 2016; Zhen et al. 2019; Gu et al. 2018;Vo et al. 2019; Gordo et al. 2017). The activation of inter-mediate layers of CNN is shown to be effective as a rep-resentation of an image for image retrieval tasks. However,as shown in Figure 1, CNN often fails to capture semanticcontents of images and is confounded by low-level visualfeatures.Image retrieval methods which reﬂects more semanticcontents of images are investigated in Gordo and Larlus(2017); Johnson et al. (2015). Gordo and Larlus (2017) usedterm frequencies in regional captions to supervise CNN forimage retrieval, but they did not utilize scene graphs. John-son et al. (2015) proposed an algorithm retrieving images https://github.com/swyoon/aaai2021-scene-graph-img-retr given a scene graph query. However, their approach does notemploy graph-to-graph comparison and is not scalable. Scene Graphs A scene graph (Johnson et al. 2015) represents the con-tent of an image in the form of a graph nodes of whichrepresent objects, their attributes, and the relationships be-tween them. After a large-scale real-world scene graphdataset manually annotated by humans in Visual Genomedataset (Krishna et al. 2017) was published, a number ofapplications such as image captioning (Wu et al. 2017;Lu et al. 2018; Milewski, Moens, and Calixto 2020) vi-sual question answering (Teney, Liu, and van den Hengel2017), and image-grounded dialog (Das et al. 2017) haveshown the effectiveness of the scene graphs. Furthermore,various works, such as GQA(Hudson and Manning 2019),VRD(Lu et al. 2016), and VrR-VG(Liang et al. 2019) pro-vided the human-annotated scene graph datasets. Also, re-cent researches (Yang et al. 2018; Xu et al. 2017; Li et al.2017) have suggested methods to generate scene graphs au-tomatically. Detailed discussion on scene graph generationwill be made in Experimental Setup Section. Graph Similarity Learning

Many algorithms have been proposed for solving the iso-morphism test or (sub-)graph matching task between twographs. However, such methods are often not scalable tohuge graphs or not applicable in the setting where nodefeatures are provided. Here, we review several state-of-the-art algorithms that are related to our application, image re-trieval by graph matching. For the graph pooling perspec-tive, we focus on two recent algorithms, the Graph Con-volutional Network (GCN;Kipf and Welling (2016)) andthe Graph Isomorphism Network (GIN;(Xu et al. 2018)).GCN utilized neural network-based spectral convolutionsin the Fourier domain to perform the convolution opera-tion on a graph. GIN used injective aggregation and graph-level readout functions. The learned graph representations,then, can be used to get the similarity of two graphs. Bothnetworks transforms a graph into a ﬁxed-length vector, en-abling distance computation between two graphs in the vec-tor space. Other studies viewed the graph similarity learn-ing problem as the optimal transport problem (Solomonet al. 2016; Maretic et al. 2019; Alvarez-Melis and Jaakkola2018; Xu, Luo, and Carin 2019; Xu et al. 2019; Titouanet al. 2019). Especially in Gromov Wasserstein Learning(GWL;(Xu et al. 2019)), node embeddings were learnedfrom associated node labels. Thus the method can reﬂectnot only a graph structure but also node features at the sametime. Graph Matching Network (GMN;(Li et al. 2019)) usedthe cross-graph attention mechanism, which yields differentnode representations for different pairs of graphs.

Image Retrieval with Scene Graph Similarity

In this section, we describe our framework, Image Retrievalwith Scene Graph Similarity (IRSGS). Given a query image,IRSGS ﬁrst generates a query scene graph from the imageigure 2: An overview of IRSGS. Images I , I are converted into vector representations φ ( S ) , φ ( S ) through scene graphgeneration (SGG) and graph embedding. The graph embedding function is learned to minimize mean squared error to surrogaterelevance, i.e., the similarity between captions. The bold red bidirectional arrows indicate trainable parts. For retrieval, thelearned scene graph similarity function is used to rank relevant images.and then retrieves images with a scene graph highly simi-lar to the query scene graph. Figure 2 illustrates the retrievalprocess. The similarity between scene graphs is computedthrough a graph neural network trained with surrogate rele-vance measure as a supervision signal. Scene Graphs and Their Generation

Formally, a scene graph S = {O , A , R} of an image I isdeﬁned as a set of objects O , attributes of objects A , and re-lations on pairs of objects R . All objects, attributes, and re-lations are associated with a word label, for example, ”car”,”red”, and ”in front of”. We represent a scene graph as a setof nodes and edges, i.e., a form of a conventional graph. Allobjects, attributes, and relations are treated as nodes, and as-sociations among them are represented as undirected edges.Word labels are converted into 300-dimensional GloVe vec-tors (Pennington, Socher, and Manning 2014) and treated asnode features.Generating a scene graph from an image is equivalent todetecting objects, attributes, and relationships in the image.We employ a recently proposed method (Anderson et al.2018) in our IRSGS framework to generate scene graphs.While end-to-end training of scene graph generation mod-ule is possible in principle, a ﬁxed pre-trained algorithm isused in our experiments to reduce the computational burden.We shall provide details of our generation process in Exper-imental Setup Section. Note that IRSGS is compatible withany scene graph generation algorithm and is not bound tothe speciﬁc one we used in this paper. Retrieval via Scene Graph Similarity

Given a query image I q , an image retrieval system rankscandidate images {I i } Ni =1 according to the similarity to thequery image sim ( I i , I q ) . IRSGS casts this image retrieval task into a graph retrieval problem by deﬁning the similar-ity between images as the similarity between correspondingscene graphs. Formally,sim ( I i , I j ) = f ( S i , S j ) (1)where S i , S j are scene graphs for I i , I j , respectively. Weshall refer f ( S i , S j ) as scene graph similarity .We compute the scene graph similarity from the innerproduct of two representation vectors of scene graphs. Witha scene graph, a graph neural network is applied, and theresulting node representations are pooled to generate a unit d -dimensional vector φ = φ ( S ) ∈ R d . The scene graphsimilarity is then given as follows: f ( S , S ) = φ ( S ) (cid:62) φ ( S ) . (2)We construct φ by computing the forward pass of graphneural networks to obtain node representations and then ap-ply average pooling. We implement φ with either GCN orGIN, yielding two versions, IRSGS-GCN and IRSGS-GIN,respectively. Learning to Predict Surrogate Relevance

We deﬁne surrogate relevance measure between two im-ages as the similarity between their captions. Let c i and c j are captions of image I i and I j . To compute the simi-larity between the captions, we ﬁrst apply Sentence-BERT(SBERT; Reimers and Gurevych (2019)) and project theoutput to the surface of an unit sphere to obtain representa-tion vectors ψ ( c i ) and ψ ( c j ) . The surrogate relevance mea-sure s ( c i , c j ) is then given by their inner product: s ( c i , c j ) = ψ ( c i ) (cid:62) ψ ( c j ) . When there is more than one caption for an We use the code and the pre-trained model (bert-large-nli-mean-tokens) provided inhttps://github.com/UKPLab/sentence-transformers. mage, we compute the surrogate relevance of all captionpairs and take the average. With the surrogate relevance, weare able to compute a proxy score for any pair of imagesin the training set, given their human-annotated captions. Tovalidate the proposed surrogate relevance measure, we col-lect human judgments of semantic similarity between im-ages by conducting a human experiment (details in HumanAnnotation Collection Section).We train the scene graph similarity f by directly minimiz-ing mean squared error from the surrogate relevance mea-sure, formulating the learning as a regression problem. Theloss function for i -th and j -th images is given as L ij = || f ( S i , S j ) − s ( c i , c j ) || . Other losses, such as triplet lossor contrastive loss, can be employed as well. However, wecould not ﬁnd clear performance gains with those losses andtherefore adhere to the simplest solution. Human Annotation Collection

We collect semantic similarity annotations from humans tovalidate the proposed surrogate relevance measure and toevaluate image retrieval methods. Through our web-basedannotation system, a human labeler is asked whether twocandidate images are semantically similar to a given queryimage. The labeler may choose one of four answers: either ofthe two candidate images is more similar than the other, im-ages in the triplet are semantically identical, or neither of thecandidate images is relevant to the query. We collect 10,712human annotations from 29 human labelers for 1,752 imagetriplets constructed from the test set of the VG-COCO, thedataset we shall deﬁne in Experimental Setup Section.A query image of a triplet is randomly selected from thequery set deﬁned in the following section. Two candidateimages are randomly selected from the rest of the test set,subjected to two constraints. First, the rank of a candidateimage should be less than or equal to 100 when the wholetest set is sorted according to cosine similarity in ResNet-152 representation to the query image. Second, the surrogaterelevance of a query-candidate image pair in a triplet shouldbe larger than the other, and the difference should be greaterthan 0.1. This selection criterion produces visually close yetsemantically different image triplets.We deﬁne the human agreement score to measure theagreement between decisions of an algorithm and that of thehuman annotators, in a similar manner presented in (Gordoand Larlus 2017). The score is an average portion of hu-man annotators who made the same decision per each triplet.Formally, given a triplet, let s (or s ) be the number of hu-man annotators who chose the ﬁrst (or the second) candi-date image is more semantically similar to the query, s bethe number of annotators who answered that all three im-ages are identical, and s be the number of annotators whomarked the candidates as irrelevant. If an algorithm chooseeither one of candidate images is more relevant, the humanagreement score for a triplet is s i +0 . s s + s + s + s , where i = 1 if the algorithm determines that the ﬁrst image is semanti-cally closer and i = 2 otherwise. The score is averaged overtriplets with s + s ≥ . Randomly selecting one of two candidate images produces an average human agreement of0.472 with a standard deviation of 0.01. Note that the agree-ment of random decision is lower than 0.5 due to the exis-tence of the human choice of ”both” ( s ) and ”neither” ( s ).The alignment between labelers is also measured with thehuman agreement score in a leave-one-out fashion. If a hu-man answers that both candidate images are relevant, thescore for the triplet is . s +0 . s + s s + s + s + s , where s . . . s arecomputed from the rest of annotators. If a human marks thatneither of the candidates is relevant for a triplet, the tripletis not counted in the human agreement score. The mean hu-man agreement score among those annotators is 0.727, andthe standard deviation is 0.05. We will make the human an-notation dataset public after the review. Experimental Setup

Data

In experiments, we use two image datasets involving diversesemantics. The ﬁrst dataset is the intersection of the VisualGenome (Krishna et al. 2017) and MS-COCO (Lin et al.2014), which we will refer to as VG-COCO. In VG-COCO,each image has a scene graph annotation provided by Vi-sual Genome and ﬁve captions provided by MS-COCO. Weutilize the reﬁned version of scene graphs provided by (Xuet al. 2017) and their train-test split. After removing the im-ages with empty scene graphs, we obtain fully annotated35,017 training images and 13,203 test images. We ran-domly select a ﬁxed set of 1,000 images among the test setand deﬁne them as a query set . For each query image, a re-trieval algorithm is asked to rank the other 13,202 imagesin the test set according to the semantic similarity. Besidesthe annotated scene graphs, we automatically generate scenegraphs for all images and experiment with our approach toboth human-labeled and machine-generated scene graphs.The second dataset is Flickr30K (Plummer et al. 2017),where ﬁve captions are provided per an image. Flickr30Kcontains 30,000 training images, 1,000 validation images,and 1,000 testing images. For Flickr30k, the whole test setis the query set. During the evaluation, an algorithm ranksthe other 999 images given a query image in a test set.Scene graphs are generated in the same manner as in theVG-COCO dataset.

Scene Graph Generation Detail

Since we focus on learning graph embeddings when twoscene graphs are given for the image-to-image retrieval task,we use the conventional scene graph generation process.Following the works (Anderson et al. 2018), objects in im-ages are detected by Faster R-CNN method, and the nameand attributes of the objects are predicted based on theResNet-101 features from the detected bounding boxes. Wekeep up to 100 objects with a conﬁdence threshold of 0.3.To predict relation labels between objects after extractinginformation about the objects, we used the frequency priorknowledge constructed from the GQA dataset that covers309 kinds of relations. For each pair of the detected objects, We have been tried to predict relation labels by using recentlysuggested SGG algorithms, such as (Yang et al. 2018; Xu et al.ethod Data nDCG HumanAgreement5 10 20 30 40 50Inter Human - - - - - - - 0.730 ± ± Table 1: Image retrieval results on VG-COCO with human-annotated scene graphs. Data column indicates which data modalitiesare used. Cap(HA): human-annotated captions. Cap(Gen): machine-generated captions. I: image. SG: scene graphs.relationships are predicted based on the frequency prior withconﬁdence threshold 0.2. To give position-speciﬁc informa-tion, the coordinates of the detected bbox are used. Here, weshould note that even though the suggested method to gener-ate a scene graph is quite simple than other methods (Yanget al. 2018; Xu et al. 2017; Li et al. 2017), it outperforms allthe others.

Two-Step Retrieval using Visual Features

In information retrieval, it is a common practice to take atwo-step approach (Wang et al. 2019; Bai and Bai 2016): re-trieving roughly relevant items ﬁrst and then sorting (or ”re-ranking”) the retrieved items according to the relevance. Wealso employ this approach in our experiment. For a queryimage, we ﬁrst retrieve K images that are closest to thequery in a ResNet-152 feature representation space formedby the 2048-dimension activation vector of the last hiddenlayer. The distance is measured in cosine similarity. Thisprocedure generates a set of good candidate images whichhave a high probability of having strong semantic similar-ity. This approximate retrieval step can be further boostedby using an approximate nearest neighbor engine such asFaiss (Johnson, Douze, and J´egou 2017) and is critical if thefollowing re-ranking step is computationally involved. Weuse this approximate pre-ranking for all experiments with K = 100 unless otherwise mentioned. Although there islarge ﬂexibility of designing this step, we shall leave otherpossibilities for future exploration as the re-ranking step isour focus. Training Details

We use Adam optimizer with the initial learning rate of0.0001. We multiply 0.9 to the learning rate every epoch. Weset batch size as 32, and models are trained for 25 epochs.In each training step, a mini-batch of pairs is formed byrandomly drawing samples. When drawing the second sam-ple in a pair, we employ an oversampling scheme to rein-force the learning of pairs with large similarity values. With

IRSGS-GCN

Table 2: Image retrieval results on VG-COCO with machine-generated scene graphs. Baselines which do not use scenegraphs are identical to the corresponding rows of Table 1.a probability of 0.5, the second sample in a pair is drawnfrom 100 most relevant samples with the largest surrogaterelevance score to the ﬁrst sample. Otherwise, we select thesecond sample from the whole training set. Oversamplingimproves both quantitative and qualitative results and is ap-ply identically for all methods except for GWL where thescheme is not applicable.

Experiments

Evaluation

We benchmark IRSGS and other baselines with VG-COCOand Flickr30K. Images in the query set are presented asqueries, and the relevance of the images ranked by an imageretrieval algorithm is evaluated with two metrics. First, wecompute normalized discounted cumulative gain (nDCG)with the surrogate relevance as gain. A larger nDCG valueindicates stronger enrichment of relevant images in the re-trieval result. In nDCG computation, surrogate relevance isclipped at zero to ensure its positivity. Second, the agree-ment between a retrieval algorithm and decision of humanannotators is measured in a method described in Human An-notation Collection Section. aseline Methods

ResNet-152 Features

Image retrieval is performed basedon the cosine similarity in the last hidden representation ofResNet-152 pre-trained on ImageNet.

Generated Caption

To test whether machine-generatedcaptions can be an effective means for semantic image re-trieval, we generate captions of images by soft attentionmodel (Xu et al. 2015) pretrained on Flickr30k dataset(Plummer et al. 2017). We obtain SBERT representationsof generated captions, and their cosine similarity is used toperform image retrieval.

Object Count (OC)

Ignoring relation information givenin a scene graph, we transform a scene graph into a vectorof object counts. Then, we compute the cosine similarity ofobject count vectors to perform image retrieval.

ResNet Finetune (ResNet-FT)

We test whether aResNet-152 can be ﬁne-tuned to capture semantic similarity.Similarly to Siamese Network (Bromley et al. 1994), ResNetfeature extractor is trained to produce cosine similarity be-tween images close to their surrogate relevance measure.

Gromov-Wasserstein Learning (GWL)

Based onGromov-Wasserstein Learning (GWL) framework (Xu et al.2019), we obtain a transport map using a proximal gradi-ent method (Xie et al. 2018). A transport cost, a sum ofGromov-Wasserstein discrepancy and Wasserstein discrep-ancy, is calculated with the transport map and the cost ma-trix, and used for retrieval. The method is computationallydemanding, and we only tested the method for VG-COCOwith generated scene graphs setting in Table 2.

Graph Matching Networks (GMN)

GMNs are imple-mented based on the publicly available code . We use fourpropagation layers with shared weights. The propagation inthe reverse direction is allowed, and the propagated repre-sentation is updated using the gated recurrent unit. Finalnode representations are aggregated by summation, result-ing in a 128-dimensional vector which is then fed to a multi-layer perceptron to produce ﬁnal scalar output. As GMNis capable of handling edge features, we leave relations asedges instead of transforming them as nodes. To indicateobject-attribute connections, we append additional dimen-sionality to edge feature vectors and deﬁne a feature vectorof an edge between an object and an attribute is a one-hotvector where only the last dimension is non-zero. Graph Embedding Methods in IRSGS

Here, we describe implementation details of graph neuralnetworks used in IRSGS.

IRSGS-GCN

A scene graph is applied with GCN and theﬁnal node representations are aggregated via mean poolingand scaled to the unit norm, yielding a representation vec-tor φ ( S ) . We use three graph convolution layers with 300hidden neurons in each layer. The ﬁrst two layers are fol-lowed by ReLU nonlinearity. Stacking more layers does notintroduce clear improvement. We always symmetrize the ad-jacency matrix before applying GCN. https://github.com/deepmind/deepmind-research/tree/master/graph matching networks Method nDCG5 10 20 40Captions SBERT 1 1 1 1Random 0.195 0.209 0.223 0.245Gen. Cap. SBERT 0.556 0.576 0.610 0.659Resnet 0.539 0.541 0.541 0.542ResNet-FT 0.368 0.393 0.433 0.502Object Count 0.511 0.530 0.560 0.615IRSGS-GIN 0.564 0.584 0.618 IRSGS-GCN

Table 3: Image retrieval results on Flickr30K with machine-generated scene graphs.

IRSGS-GIN

Similarly to GCN, we stack three GIN con-volution layers with 300 hidden neurons in each layer. Formulti-layer perceptrons required for each layer, we use onehidden layer with 512 neurons with ReLU nonlinearity.Other details are the same as that of the GCN case.

Quantitative Results

From Table 1, Table 2, and Table 3, IRSGS shows largernDCG score than baselines across datasets (VG-COCO andFlickr30K) and methods of obtaining scene graphs (human-annotated and machine-generated). IRSGS also achievesbest agreement to human annotator’s perception on semanticsimilarity, as it can be seen from Table 1 and Table 2.Comparing Table 1 and Table 2, we found that us-ing machine-generated scene graphs instead of human-annotated ones does not deteriorate the retrieval perfor-mance. This result shows that IRSGS does not need human-annotated scene graphs to perform successful retrieval andcan be applied to a dataset without scene graph annotation.In fact, Flickr30K is the dataset without scene graph anno-tation, and IRSGS still achieves excellent retrieval perfor-mance in Flickr30K with machine-generated scene graphs.On the other hand, using machine-generated captions inretrieval results in signiﬁcantly poor nDCG scores and hu-man agreement scores. Unlike human-annotated captions,machine-generated captions are crude in quality and tend tomiss important details of an image. We suspect that scenegraph generation is more stable than caption generationsince it can be done in a systematic manner, i.e., predictingobjects, attributes, and relations in a sequential way.While not showing the optimal performance, GWL andGMN also show competitive performance over other meth-ods based on generated captions and ResNet. This overalltendency of competence of graph-based method is interest-ing and implies the effectiveness of scene graphs in captur-ing semantic similarity between images.Note that in Caption SBERT, retrieval is performed withsurrogate relevance, and their human agreement scores indi-cate the agreement between surrogate relevance and humanannotations. With the highest human agreement score thanany other algorithms, this result assures that the proposedsurrogate relevance reﬂects the human perception of seman-tic similarity well.igure 3: Four most similar images retrieved by six algorithms. OC: Object Count, GIN: IRSGS-GIN, GCN: IRSGS-GCN. Thevisual genome ids for the query images are 2323522 and 2316427.

Qualitative Results

Figure 1 and Figure 3 show the example images retrievedfrom the retrieval methods we test. Pitfalls of baseline meth-ods that are not based on scene graphs can be noted. As men-tioned in Introduction, retrieval with ResNet features oftenneglects the semantics and focuses on the superﬁcial visualcharacteristics of images. On the contrary, OC only accountsfor the presence of objects, yielding images with mislead-ing context. For example, in the left panel of Figure 3, OCsimply returns images with many windows. IRSGS couldretrieve images containing similar objects with similar re-lations to the query image, for example, an airplane on theground, or a person riding a horse.

Discussion

Ablation Study

We also perform an ablation experiment foreffectiveness of each scene graph component (Table 4). Inthis experiment, we ignore attributes or randomize relationinformation from IRSGS-GCN framework. In both cases,nDCG and Human agreement scores are higher than theObject Count that uses only object information. This indi-cates that both attributes and relation information are use-ful to improve the image retrieval performance of the graphmatching-based algorithm. Further, randomizing relationsdrops performance more than ignoring attribute information,which means that relations are important for capturing thehuman perception of semantic similarity.

Comparison to Johnson et al. (2015)

We exclude John-son et al. (2015) from our experiment because the CRF-based algorithm from Johnson et al. (2015) is not feasiblein a large-scale image retrieval problem. One of our goalsis to tackle a large-scale retrieval problem where a query iscompared against more than ten thousand images. Thus, wemainly consider methods that generate a compact vector rep-resentation of an image or a scene graph (Eq.(2)). However,the method in Johnson et al. (2015) requires object detectionresults to be additionally stored and extra computation forall query-candidate pairs to be done in the retrieval phase.Note that Johnson et al. (2015) only tested their algorithmon 1,000 test images, while we benchmark algorithms using13,203 candidate images.

Method nDCG HumanAgreement5 10 20 40IRSGS-GCN 0.771 0.784 0.805 0.836 0.611No Attribute 0.767 0.782 0.803 0.834 0.606Random Relation 0.764 0.777 0.797 0.828 0.604Object Count 0.730 0.743 0.761 0.794 0.581

Table 4: Scene graph component ablation experiment resultson VG-COCO. Machine-generated scene graphs are used.

Effectiveness of Mean Pooling and Inner Product

Onepossible explanation for the competitive performance ofIRSGS-GCN and IRSGS-GIN is that the mean pooling andinner product are particularly effective in capturing similar-ity between two sets. Given two sets of node representations { a , · · · , a N } and { b , · · · , b M } , the inner product of theirmeans are given as (cid:80) i,j a (cid:62) i b j / ( N M ) , the sum of the innerproduct between all pairs. This expression is proportional tothe number of common elements in the two sets, especiallywhen a (cid:62) i b j is 1 if a i = b j and 0 otherwise, measuring thesimilarity between the two sets. If the inner product valuesare not binary, then the expression measures the set similar-ity in a ”soft” way. Conclusion

In this paper, we tackle the image retrieval problem for com-plex scene images where multiple objects are present in var-ious contexts. We propose IRSGS, a novel image retrievalframework, which leverages scene graph generation and agraph neural network to capture semantic similarity betweencomplex images. IRSGS is trained to approximate surro-gate relevance measure, which we deﬁne as a similarity be-tween captions. By collecting real human data, we show thatboth surrogate relevance and IRSGS show high agreement tohuman perception on semantic similarity. Our results showthat an effective image retrieval system can be built by us-ing scene graphs with graph neural networks. As both scenegraph generation and graph neural networks are techniquesthat are rapidly advancing, we believe that the proposed ap-proach is a promising research direction to pursue. cknowledgements

Sangwoong Yoon is partly supported by theNational Research Foundation of Korea Grant(NRF/MSIT2017R1E1A1A03070945) and MSIT-IITP(No. 2019-0-01367, BabyMind).

References

Alvarez-Melis, D.; and Jaakkola, T. 2018. Gromov-Wasserstein Alignment of Word Embedding Spaces. In

Pro-ceedings of the 2018 Conference on Empirical Methods inNatural Language Processing , 1881–1890.Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S.2016. Spice: Semantic propositional image caption evalu-ation. In

European Conference on Computer Vision , 382–398. Springer.Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at-tention for image captioning and visual question answering.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 6077–6086.Babenko, A.; Slesarev, A.; Chigorin, A.; and Lempitsky, V.2014. Neural codes for image retrieval. In

European confer-ence on computer vision , 584–599. Springer.Bai, S.; and Bai, X. 2016. Sparse contextual activation forefﬁcient visual re-ranking.

IEEE Transactions on ImageProcessing

Advances in neural information process-ing systems , 737–744.Chen, B.-C.; Davis, L. S.; and Lim, S.-N. 2019. An Analysisof Object Embeddings for Image Retrieval. arXiv preprintarXiv:1905.11903 .Chen, X.; Li, L.-J.; Fei-Fei, L.; and Gupta, A. 2018. Iterativevisual reasoning beyond convolutions. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 7239–7248.Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura,J. M.; Parikh, D.; and Batra, D. 2017. Visual dialog. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 326–335.Gong, Y.; Ke, Q.; Isard, M.; and Lazebnik, S. 2014. A multi-view embedding space for modeling internet images, tags,and their semantics.

International journal of computer vi-sion

European conference on computer vision ,241–257. Springer.Gordo, A.; Almaz´an, J.; Revaud, J.; and Larlus, D. 2017.End-to-End Learning of Deep Visual Representations forImage Retrieval.

International Journal of Computer Vision

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 6589–6598.Gu, J.; Cai, J.; Joty, S.; Niu, L.; and Wang, G. 2018. Look,Imagine and Match: Improving Textual-Visual Cross-ModalRetrieval with Generative Models.

CVPR .Hudson, D. A.; and Manning, C. D. 2019. GQA: A NewDataset for Real-World Visual Reasoning and Composi-tional Question Answering.

Conference on Computer Visionand Pattern Recognition (CVPR) .Johnson, J.; Douze, M.; and J´egou, H. 2017. Billion-scale similarity search with GPUs. arXiv preprintarXiv:1702.08734 .Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.;Bernstein, M.; and Fei-Fei, L. 2015. Image retrieval usingscene graphs. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 3668–3678.Kipf, T. N.; and Welling, M. 2016. Semi-supervised classi-ﬁcation with graph convolutional networks. arXiv preprintarXiv:1609.02907 .Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.;Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma,D. A.; et al. 2017. Visual genome: Connecting language andvision using crowdsourced dense image annotations.

Inter-national Journal of Computer Vision arXiv preprint arXiv:1904.12787 .Li, Y.; Ouyang, W.; Zhou, B.; Wang, K.; and Wang, X. 2017.Scene graph generation from objects, phrases and regioncaptions. In

Proceedings of the IEEE International Con-ference on Computer Vision , 1261–1270.Liang, Y.; Bai, Y.; Zhang, W.; Qian, X.; Zhu, L.; and Mei,T. 2019. VrR-VG: Refocusing Visually-Relevant Relation-ships. In

Proceedings of the IEEE International Conferenceon Computer Vision , 10403–10412.Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-manan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoftcoco: Common objects in context. In

European conferenceon computer vision , 740–755. Springer.Lu, C.; Krishna, R.; Bernstein, M.; and Fei-Fei, L. 2016.Visual relationship detection with language priors. In

Euro-pean Conference on Computer Vision , 852–869. Springer.Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural babytalk. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 7219–7228.Maretic, H. P.; Gheche, M. E.; Chierchia, G.; and Frossard,P. 2019. GOT: An Optimal Transport framework for Graphcomparison. arXiv preprint arXiv:1906.02085 .Milewski, V.; Moens, M.-F.; and Calixto, I. 2020. Are scenegraphs good enough to improve Image Captioning? arXivpreprint arXiv:2009.12313 .ohedano, E.; McGuinness, K.; O’Connor, N. E.; Salvador,A.; Marques, F.; and Giro-i Nieto, X. 2016. Bags of localconvolutional features for scalable instance search. In

Pro-ceedings of the 2016 ACM on International Conference onMultimedia Retrieval , 327–331.Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:Global vectors for word representation. In

Proceedings ofthe 2014 conference on empirical methods in natural lan-guage processing (EMNLP) , 1532–1543.Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo,J. C.; Hockenmaier, J.; and Lazebnik, S. 2017. Flickr30KEntities: Collecting Region-to-Phrase Correspondences forRicher Image-to-Sentence Models.

IJCV

European conference on computer vision ,3–20. Springer.Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sen-tence Embeddings using Siamese BERT-Networks. arXivpreprint arXiv:1908.10084 .Solomon, J.; Peyr´e, G.; Kim, V. G.; and Sra, S. 2016. En-tropic metric alignment for correspondence problems.

ACMTransactions on Graphics (TOG)

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 1–9.Titouan, V.; Courty, N.; Tavenard, R.; Laetitia, C.; and Fla-mary, R. 2019. Optimal Transport for structured data withapplication on graphs. In

International Conference on Ma-chine Learning , 6275–6284.Vo, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.-J.; Fei-Fei, L.;and Hays, J. 2019. Composing text and image for imageretrieval-an empirical odyssey. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,6439–6448.Wang, L.; Qian, X.; Zhang, Y.; Shen, J.; and Cao, X. 2019.Enhancing sketch-based image retrieval by cnn semantic re-ranking.

IEEE transactions on cybernetics .Wei, Y.; Zhao, Y.; Lu, C.; Wei, S.; Liu, L.; Zhu, Z.; and Yan,S. 2016. Cross-modal retrieval with CNN visual features:A new baseline.

IEEE transactions on cybernetics

IEEE trans-actions on pattern analysis and machine intelligence arXiv preprint arXiv:1802.04307 .Xu, D.; Zhu, Y.; Choy, C. B.; and Fei-Fei, L. 2017. Scenegraph generation by iterative message passing. In

Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 5410–5419. Xu, H.; Luo, D.; and Carin, L. 2019. Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching. arXiv preprint arXiv:1905.07645 .Xu, H.; Luo, D.; Zha, H.; and Carin, L. 2019. Gromov-wasserstein learning for graph matching and node embed-ding. arXiv preprint arXiv:1901.06003 .Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attendand tell: Neural image caption generation with visual at-tention. In

International conference on machine learning ,2048–2057.Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018.How powerful are graph neural networks? arXiv preprintarXiv:1810.00826 .Yang, J.; Lu, J.; Lee, S.; Batra, D.; and Parikh, D. 2018.Graph r-cnn for scene graph generation. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,670–685.Zeiler, M. D.; and Fergus, R. 2014. Visualizing and under-standing convolutional networks. In

European conferenceon computer vision , 818–833. Springer.Zhen, L.; Hu, P.; Wang, X.; and Peng, D. 2019. Deep Super-vised Cross-Modal Retrieval. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) .Zheng, L.; Yang, Y.; and Tian, Q. 2017. SIFT meets CNN:A decade survey of instance retrieval.

IEEE transactionson pattern analysis and machine intelligence

Appendix

Computational Property lIRSGS is scalable in terms of both computing time andmemory, adding only marginal overhead over a conven-tional image retrieval system. For candidate images in adatabase, their graph embeddings and ResNet features arepre-computed and stored. Generating a scene graph for aquery image is mainly based on the object detection whichcan be run almost in real-time. Searching over the databaseis essentially a nearest neighbor search, which is fast for thesmall ( < Two-Stage Retrieval

The initial retrieval using ResNet is beneﬁcial in two as-pects: retrieval quality and speed. ResNet-based retrieval in-deed introduces the bias but in a good way; the ResNet-based stage increases human agreement for all retrievalmethods, possibly by excluding visually irrelevant images.Some baselines, such as graph matching networks, are notcomputationally feasible without the initial retrieval. How-ever, IRSGS is computationally feasible without ResNet-based retrieval because the representations of images canbe pre-computed and indexed. We empirically found thatk=100 showed a good trade-off between computational costand performance.

Comparison to SPICE

We initially excluded SPICE(Anderson et al. 2016) from ex-periments not because of its computational property but be-cause of the exact matching mechanism that SPICE is basedon. By deﬁnition, SPICE would consider two semanticallysimilar yet distinct words as different. Meanwhile, IRSGSis able to match similar words since it utilizes the continu-ous embeddings of words. Still, SPICE can be an interestingbaseline, and we will consider adding it for comparison.

Full Resolution Figures

Here, we provide ﬁgures presented in the main manuscriptin their full scale.igure 5: An overview of IRSGS. Images I , I are converted into vector representations φ ( S ) , φ ( S ))