Referring Relationships
RReferring Relationships
Ranjay Krishna † , Ines Chami † , Michael Bernstein, Li Fei-FeiStanford University { ranjaykrishna, chami, msb, feifeili } @cs.stanford.edu Abstract
Images are not simply sets of objects: each image rep-resents a web of interconnected relationships. These rela-tionships between entities carry semantic meaning and helpa viewer differentiate between instances of an entity. Forexample, in an image of a soccer match, there may be mul-tiple person s present, but each participates in differentrelationships: one is kicking the ball , and the other is guarding the goal . In this paper, we formulate the taskof utilizing these “referring relationships” to disambiguatebetween entities of the same category. We introduce an it-erative model that localizes the two entities in the referringrelationship, conditioned on one another. We formulate thecyclic condition between the entities in a relationship bymodelling predicates that connect the entities as shifts inattention from one entity to another. We demonstrate thatour model can not only outperform existing approaches onthree datasets — CLEVR, VRD and Visual Genome — butalso that it produces visually meaningful predicate shifts,as an instance of interpretable neural networks. Finally, weshow that by modelling predicates as attention shifts, wecan even localize entities in the absence of their category,allowing our model to find completely unseen categories.
1. Introduction
Referring expressions in everyday discourse help iden-tify and locate entities in our surroundings. For instance,we might point to the “person kicking the ball” to dif-ferentiate from the “person guarding the goal” (Figure 1).In both these examples, we disambiguate between the two person s by their respective relationships with other enti-ties [23]. While one person is kicking the ball , theother is guarding the goal . The eventual goal is to buildcomputational models that can identify which entities oth-ers are referring to [34]. † = equal contribution We use the term “entities” for what is commonly referred to as“objects” to differentiate from the term object in < subject-predicate-object > relationships. Figure 1: Referring relationships disambiguate between in-stances of the same category by using their relative relation-ships with other entities. Given the relationship < person - kicking - ball > , the task requires our model to cor-rectly identify which person in the image is kicking the ball by understanding the predicate kicking .To enable such interactions, we introduce referring re-lationships — a task where, given a relationship, modelsshould know which entities in a scene are being referredto by the relationship. Formally, the task expects an in-put image along with a relationship, which is of the form < subject - predicate - object > , and outputs lo-calizations of both the subject and object . For ex-ample, we can express the above examples as < person - kicking - ball > and < person - guarding - goal > (Figure 1). Previous work has attempted to disambiguateentities of the same category in the context of referring ex-pression comprehension [28, 24, 41, 42, 11]. Their task ex-pects a natural language input, such as “a person guardingthe goal”, resulting in evaluations that require both naturallanguage as well as computer vision components. It can bechallenging to pinpoint whether errors made by these mod-els occur from either the language or the visual components.By interfacing with a structured relationship input, our taskis a special case of referring expressions that alleviates theneed to model language.1 a r X i v : . [ c s . C V ] M a r eferring relationships retain and refine the algorithmicchallenges at the core of prior tasks. In the object localiza-tion literature, some entities such as zebra and person are highly discriminative and can be easily detected, whileothers such as glass and ball tend to be harder to local-ize [29]. These difficulties arise due to, for example, smallsize and non-discriminative composition. This difference indifficulty translates over to the referring relationships task.To tackle this challenge, we use the intuition that detectingone entity becomes easier if we know where the other oneis. In other words, we can find the ball conditioned onthe person who is kicking it and vice versa. We trainthis cyclic dependency by rolling out our model and itera-tively passing messages between the subject and the objectthrough an operator defined by the predicate . We de-scribe this operator in more detail in Section 3.However, modelling this predicate operator is notstraightforward, which leads us to our second challenge.Traditionally, previous visual relationship papers havelearned an appearance-based model for each predicate [20,23, 26]. Unfortunately, the drastic appearance varianceof predicates, depending on the entities involved, makeslearning predicate appearance models challenging. Forexample, the appearance for the predicate carrying can vary significantly between the following two relation-ships: < person - carrying - phone > and < truck - carrying - hay > . Instead, inspired by the moving spot-light theory in psychology [18, 35], we bypass this chal-lenge by using predicates as a visual attention shift oper-ation from one entity to the other. While one shift oper-ation learns to move attention from the subject to the object , an inverse predicate shift similarly moves atten-tion from the object back to the subject . Over multi-ple iterations, we operationalize these asymmetric attentionshifts between the subject and the object as differenttypes of message operations for each predicate [37, 9].In summary, we introduce the task of referring relation-ships, whose structured relationship input allows us to eval-uate how well we can unambiguously identify entities ofthe same category in an image. We evaluate our model on three vision datasets that contain visual relationships:CLEVR [12], VRD [23] and Visual Genome [17]. , . , and of relationships in these datasets refer toambiguous entities, i.e. entities that have multiple instancesof the same category. We extend our model to performattention saccades [36] using relationships belonging to ascene graph [14]. Finally, we demonstrate that in the ab-sence of a subject or the object , our model can stilldisambiguate between entities while also localizing entitiesfrom new categories that it has never seen before. Our model was coded using Keras with a Tensorflow back-end and is available at https://github.com/StanfordVL/ReferringRelationships .
2. Related Work
To properly situate the task of referring relationships, weexplore the evolution of visual relationships as a representa-tion. Next, we survey the inception of referring expressioncomprehension as a similar task, summarize how attentionhas been used in the deep learning literature, and surveyother technical approaches that are similar to our approach.There is a long history of vision papers moving be-yond simple object detection and modelling the context around the entities [27, 31] or even studying object co-occurrences [8, 19, 25] to improve classification and detec-tion itself. Our task on referring relationships was motivatedby such papers. Unlike these models, we utilize a formaldefinition for context in the form of a visual relationship .Pushing along this thread, visual relationships were ini-tially limited to spatial relationships: above , below , inside and around [8]. Relationships were then ex-tended to include human interactions, such as holding and carrying [40]. Extending the definition further, thetask of visual relationship detection was introduced alongwith a dataset of spatial, comparative, action and verb pred-icates [23]. More recently, relationships were formalized aspart of an explicit formal representation for images calledscene graphs [14, 17], along with a dataset of scene graphscalled Visual Genome [17]. These scene graphs encode theentities in a scene as nodes in a graph that are connectedtogether with directed edges representing their relative re-lationships. Scene graphs have shown to improve a num-ber of computer vision tasks, including semantic image re-trieval [33], image captioning [1] and object detection [30].Newer work has extended models for relationship detectionto use co-occurrence statistics [26, 32, 37] and have evenformulated the problem in a reinforcement learning frame-work [21]. These papers focused primarily on detecting vi-sual relationships categorically — they output relationshipsgiven an input image. In contrast, we focus on the inverseproblem of localizing the entities that take part in an inputrelationship. We disambiguate entities in a query relation-ship from other entities of the same category in the image.Moreover, while all previous work has attempted to learnvisual features of predicates, we propose that the visual ap-pearances of predicates are too varied and can be more ef-fectively learnt as an attention shift, conditioned on the en-tities in the relationship.Such an inverse task of disambiguating between differentregions in an image has been studied under the task of re-ferring expression comprehension [24]. This task uses aninput language description to find the referred entities. Thiswork has been motivated by human-robot interaction, wherethe robot would have to disambiguate which entities the hu-man user is referring to [34]. Models for their task havebeen extended to include global image contrasts [41], visualrelationships [11] and reward-based reinforcement systems2igure 2: Referring relationships’ inference pipeline begins by extracting image features, which are then used to generate aninitial grounding of the subject and object independently. Next, these estimates are used to shift the attention usingthe predicate from the subject to where we expect the object to be. We modify the image features by focusing ourattention to the shifted area when refining our new estimate of the object . Simultaneously, we learn an inverse shift fromthe initial object to the subject . We iteratively pass messages between the subject and object through the two predicateshift modules to finally localize the two entities.that encourage the generation of unique expressions for dif-ferent image regions [41]. Unfortunately, all these mod-els require the ability to process both natural language aswell as visual constructs. This requirement makes it diffi-cult to disentangle the mistakes as a result of poor languagemodelling or visual understanding. In an effort to amelio-rate these limitations, we propose the referring relationshipstask — simplifying referring expressions by replacing thelanguage inputs with a structured relationship. We focussolely on the visual component of the model, avoiding con-founding errors from language processing.One key observations about predicates is their large vari-ance in visual appearance [23]. For example, consider thesetwo relationships: < person - carrying - phone > and < truck - carrying - hay > . We use an insight frompsychology [18, 35], specifically the moving spotlight the-ory, which suggests that visual attention can be modelledas a spotlight that can be conditioned on and directed to-wards specific targets. The use of attention has been ex-plored to improve image captioning [38, 2] and even stackedto improve question answering [13, 39]. In comparison, wemodel two discriminative attention shifting operations foreach unique predicate, one conditioned on the subject tolocalize the object and an inverse predicate shift condi-tioned on the object to find the subject . Each predi-cate utilizes both the current estimate of the entities as wellas image features to learn how to shift, allowing it to utilizeboth spatial and semantic features.Our work also has similarities to knowledge bases ,where predicates are often projections in a defined semanticspace [3, 6, 22]. Such a method was recently used for visual relationship detection [43]. While these methods have seensuccess in knowledge base completion tasks, they have onlyled to a marginal gain for modelling visual relationships.However, unlike these methods, we do not model predicatesas a projection in semantic space but as a shift in attentionconditioned on an entity in a relationship. Our method canbe thought of as a special case of deformable parts model [7]with two deformable parts, one for each entity. Finally, ourmessaging passing algorithm can be thought of as a domain-specific specialized version to the message passing in graphconvolution approximation methods [9, 15].
3. Referring relationships model
Recall that our aim is to use the input referring relation-ship to disambiguate entities in an image by localizing theentities involved in the relationship. Formally, the input isan image I with a referring relationship, R = < S - P - O > ,which are the subject , predicate and object cate-gories, respectively. The model is expected to localize boththe subject and the object . We begin by using a pre-trained convolutional neuralnetwork (CNN) to extract a L × L × C dimensional featuremap from the image µ = CNN( I ) . That is, for each im-age, we extract a 3-dimensional tensor of shape L × L × C ,where L is the spatial size of the feature map while C is thenumber of feature channels. Our goal is to decide if each L × L image region belongs to the subject or object or nei-ther. We can model this problem by representing the imageby two binary random variables X , Y . For i = 1 . . . L × L ,3 i > τ implies that the subject occupies the region i and Y i > τ implies that the object occupies that region, forsome hyperparameter threshold τ . We now define a graph G = ( V X ∪ V Y , E ) , where V X = { x i } , V Y = { y i } are thenodes of the graph represented by the image regions and E = ( x i , y j ) represents an edge from every x i to y j . Giventhe image and relationship, we want to assign x ∗ and y ∗ with x ∗ , y ∗ = arg max x , y Pr( X = x , Y = y | µ , R ) .This optimization problem can be reduced to inferenceon a densely connected graph which can be very expensive.As shown in previous work [44, 16], dense graph inferencecan be approximated by mean field in Conditional RandomFields (CRF). Such papers allow fully differential inferenceassuming weighted gaussians as pairwise potentials [44].To achieve greater flexibility in a more principled trainingframework, we design a general model where the messagingpassing during inference is a series of learnt convolutions.More specifically, we design our model with two types ofmodules: attention and predicate shift modules. While at-tention models attempt to locate a specific category in animage, the predicate shift modules learn to move attentionfrom one entity to another. Before we specify our attention and shift operators, let’srevisit the challenges in referring relationships to motivateour design decisions. The two challenges are (1) the dif-ference in difficulty in object detection and (2) the dras-tic appearance variance of predicates. First, the differencein difficulty arises because some objects like zebra and person are highly discriminative and can be easily de-tected while others like glass and ball tend to be harderto localize. We can overcome this problem by conditioningthe localization of one entity on the other. If we know wherethe person is, we should be able to estimate the locationof the ball that they are kicking .Second, predicates tend to vary in appearance depend-ing on the objects involved in the relationship. To dealwith the wide appearance variance of predicates, we moveaway from how previous work [23] attempted to learn ap-pearance features of predicates and instead treat predicatesas a mechanism for shifting the attention from one objectto another. Relationships like above should learn to fo-cus attention down from the subject when locating the object , and the predicate left of should focus the at-tention to the right of the subject . Inversely, once welocate the object , the model should use left of to fo-cus attention to the left to confirm its initial estimate of the subject . Note that not all predicates are spatial, so wealso ensure that we can model their visual appearances byconditioning the shifts on the image features as well.
Attention modules.
With these design goals in mind, we formulate the attention module as an initial estimate of the subject and object localizations by approximating themaximizers x ∗ , y ∗ with the soft attention Att( · ) : ˆx = Att( µ , S ) = ReLU( µ · Emb( S )) (1) ˆy = Att( µ , O ) = ReLU( µ · Emb( O )) (2)where Emb ( · ) embeds the entity into a C dimensional se-mantic space. Note that ReLU( · ) is the Rectified LinearUnit operator. ˆx , ˆy denote the initial attention over the subject and object , which are not conditioned on thepredicate at all and only use the entities. Predicate shift modules.
Inspired by the message pass-ing protocol in CRF’s [44], we design a more general mes-sage passing function to transfer information between thetwo entities. Each message is passed from the subject ’sestimate to localize the object and vice versa. In prac-tice, we want the message passed from the subject tothe object to be different from the one passed from the object back to the subject . So, we learn two asym-metric attention shifts, one that shifts the location from the subject to its estimate of where it thinks the object is and another one that does the inverse from the object to the subject . We denote these shift operations as
Sh( · ) and Sh − ( · ) , respectively and define them as n convolutionsapplied in series to the initial estimated assignments: ˆx shift = Sh − ( ˆy , P ) = (cid:13) nl ReLU( ˆy ∗ F − l ( P )) (3) ˆy shift = Sh( ˆx , P ) = (cid:13) nl ReLU( ˆx ∗ F l ( P )) . (4)where the (cid:13) nl implies that we perform the operation n times, each parametrized by F − l ( P ) and F l ( P ) which cor-respond to learned convolution filters for the inverse predi-cate and the predicate operations respectively. The ∗ opera-tor indicates a convolution with kernels F − l ( P ) and F l ( P ) of size k l = k with c l channels. We set c n = 1 for thelast convolution to ensure that ˆx shift and ˆy shift have di-mension L x L x1 . While we do not enforce the two shiftoperators to be inverses of one another, for most predicates,we empirically find that Sh − ( · ) in fact learns the inverseattention shift of Sh( · ) . Note that we do not provide anysupervision to our shifts and the model is tasked to learnthese shifts to improve its entity localizations. The outputsof these two predicate shift operators is a new estimate at-tention mask over where the our model expects to find the object , ˆy shift , conditioned on its initial estimate of the subject , ˆx and vice versa from ˆy to ˆx shift .Each predicate learns its own set of shift and inverse shiftfunctions. And by allowing multiple channels c l for each setof kernels, our model can formulate shifts as a mixture. Forexample, carrying might want to focus on the top of theobject when the relationship is < person - carrying - phone > while focusing towards the bottom when the rela-tionship is < person - carrying - bag > .4ince we want every image region X i to pass a messageto all other regions Y j , we enforce that n > L/k , i.e. weneed a minimum of L/k number of convolutions in series.We arrive at this restriction because the maximum spatialdistance that a message needs to travel is √ L and the fur-therest image region it can send a message to in each itera-tion is √ k , where L is the image feature size and k is thekernel size of each predicate shift convolution. Running iterative inference.
Once we have these esti-mates, we can modify our image features with using aelement-wise multiplication across the C channels in thefeature map. We can then pass it back to the subject and object attention modules to update their locations: ˆx = Att( ˆx shift × µ , S ) (5) ˆy = Att( ˆy shift × µ , O ) (6)We can continuously update these locations, conditioned onone another. This amounts to running a maximum a poste-riori inference on one entity while using the other entity’sprevious location. We finally output ˆx t and ˆy t where t isa hyper-parameter that determines the number of iterationsfor which we run inference. Image Encoding.
We extract image features using an Ima-geNet pre-trained [29] ResNet50’s [10] last activation layerof conv which outputs a × × dimensional repre-sentation and finetune the features. We find that our modelperforms best with predicate convolution filters with kernelsize × and channels. Training details.
We use RMSProp as our optimizationfunction with an initial learning rate of . decaying by when the validation loss does not decrease for con-secutive epochs. We train for a total of epochs and embedall of our objects and predicates in a dimensional space.
4. Experiments
We start our experiments by evaluating our model’s per-formance on referring relationships across three datasets,where each dataset provides a unique set of characteristicsthat complement our experiments. Next, we evaluate howto improve our model in the absence of one of the entitiesin the input referring relationship. Finally, we conclude bydemonstrating how our model can be modularized and usedto perform attention saccades through a scene graph.
CLEVR . CLEVR is a synthetic dataset generated fromscene graphs [12], where the relationships between objectsare limited to spatial predicates ( left , right , front , behind ) and distinct entity categories. With over M relationships where are ambiguous, along with theease of localizing object categories, this dataset also allowsus to explicitly test the effects of our predicate attention shifts without confounding errors from poor image featuresor noise in real world datasets. VRD . Visual relationship detection (VRD) is the mostwidely benchmarked dataset for relationship detection inreal world images [23]. It consists of object and predicate categories in k images, with ambiguous re-lationships out of a total of k . With a few examples perobject and predicate category, this dataset allows us to eval-uate how our model performs when starved for data. Visual Genome . Visual Genome is the largest dataset forvisual relationships in real images that is publicly avail-able [17]. It contains k images with over . M rela-tionship instances. We use version . , which focuses onthe most common objects with the most commonpredicate categories. Our experiments on Visual Genomerepresent a large scale evaluation of our method where of relationships refer to ambiguous entities. Evaluation Metrics.
Recall that the output of our modelis localizing the subject and the object of the referring re-lationship. To evaluate how our model performs, we reportthe Mean Intersection over Union (IoU), a common metricused in localizing salient parts of an image [4, 5]. This met-ric measures the average intersection over union betweenthe predicted image regions to those in the ground truthbounding boxes. Next, we report the KL-divergence, whichmeasures the dissimilarity between the two saliency mapsand heavily penalizes false positives.
Baseline models.
We create three competitive baselinemodels inspired by related work in entity co-occurrence [8],spatial attention shifts [18] and visual relationship detec-tion [23]. The first model tests how much we can leverageonly the entities’ co-occurrence , without using the pred-icate. This model simply embeds the subject and the object and combines them to collectively attend over theimage features. The next baseline embeds the entities alongwith the predicate using a series of dense layers, similar tothe vision component in relationship embeddings used in vi-sual relationship detection (
VRD ) [23, 11]. This model hasaccess to the entire relationship when finding the two enti-ties. Finally, the third baseline replaces our learnt predicateshifts with a spatial shift that we statistically learn for eachpredicate in the dataset (see supplementary for details). Thisfinal model tests whether our model utilizes both semanticinformation from images and not just the spatial informa-tion from the entities to make predictions.
Quantitative results.
Across all the datasets, we find thatthe co-occurrence model is unable to disambiguate be-tween instances of the same category and only performswell when there is only one instance of that category in animage. The spatial shift model does better than the otherbaselines on CLEVR, where the predicates are spatial and5ean IoU ↑ KL divergence ↓ CLEVR VRD Visual Genome CLEVR VRD Visual GenomeS O S O S O S O S O S OCo-occurence [8] 0.691 0.691 0.347 0.389 0.414 0.490 0.839 0.839 2.598 2.307 1.501 1.271Spatial shift [18] 0.740 0.740 0.320 0.371 0.399 0.469 0.643 0.643 2.612 2.318 1.512 1.293VRD [23, 11] 0.734 0.732 0.345 0.387 0.417 0.480 1.024 1.014 2.492 2.171 1.483 1.255SSAS(iter1) 0.742 0.748 0.358 0.398
Table 1: Results for referring relationships on CLEVR [12], VRD [23] and Visual Genome [17]. We report Mean IoU andKL divergence for the subject and object localizations individually. (a)(b)
Figure 3: (a) Relative to a subject in the middle of an image,the predicate left will shift the attention to the right whenusing the relationship < subject - left of - object > to find the object. Inversely, when using the object to findthe subject , the inverse predicate left will shift the at-tention to the left. We visualize all VRD, CLEVR and Visual Genome predicate and inverse predicate shifts inour supplementary material. (b) We also show that theseshifts are intuitive when looking at the dataset that was usedto learn them. For example, we find that ride usually cor-responds to an object below the subject .worse on the real world datasets, implying that it is insuffi-cient to model predicates only as spatial shifts. Surprisingly,when evaluating on the CLEVR dataset, we find that
VRD model does not properly utilize the predicate and leads tomarginal gains over the co-occurrence models. In compar-ison, we find that our
SSAS variants perform better acrossall metrics. We gain over a . Mean IoU on CLEVR.This gain however, is smaller on Visual Genome and VRDas these datasets are noisy and incomplete, penalizing our model for making predictions that are not annotated in thedatasets. KL, which only penalizes false predictions high-lights that our models are more precise than our baselines.Across the different ablations of SSAS, we notice that hav-ing more iterations is better; but the performance saturatesafter iterations because the predicate shifts and the inversepredicate shifts learn near inverse operations of one another. Interpreting our results.
We can interpret the predicateshifts by synthetically initializing the subject to be at thecenter of an image, as shown in Figure 3(a). When ap-plying the left predicate shift, we see that the model haslearnt to focus its attention to the right, expecting to findthe object to the right of the subject . Similarly, theinverse predicate shift learns to do nearly the opposite byfocusing attention in the other direction. When visualizingthese shifts next to the dataset examples in Visual Genome,we see that the shifts represent the biases that exist in thedataset (Figure 3(b)). For example, since most entities thatcan be ridden are below the subject , the shifts learnto focus attention down to find the object and up to findthe subject . We also find that that our model learns toencode dataset bias in these shifts. Since the perspective ofmost images in the training set for hit are of people play-ing tennis or baseball facing left, our model also capturesthis bias by learning that hit should focus attention to thebottom left to find the entity being hit.Figure 4 shows numerous examples of how our modelshifts attention over multiple iterations. We see that gener-ally across all our test cases the subject and object attentionmodules learn to use the image features to localize all in-stances initially on iteration . For example, in Figure 4(a),all the regions that contain person are initially activated.But after the predicate and the inverse predicate shifts, wesee that the model learns to move the attention in oppositedirections for the predicate left . In the second iteration,both the people are uniquely localized in the image. Fig-ure 4(b) clearly shows that we can easily locate all instancesof purple metal cylinders in the image since it is6 a)(b) (c) (d)(e) (f) (g) Figure 4: Example visualizations of how attention shifts across multiple iterations from the CLEVR and Visual Genomedatasets. On the first iteration the model receives information only about the entities that it is trying to find and henceattempts to localize all instances of those categories. In later iterations, we see that the predicate shifts the attention, allowingour model to disambiguate between different instances of the same category.easy to detect entities in CLEVR. Our model learns to iden-tify which purple metal cylinders we are actuallyreferring to on successive iterations while suppressing theother instance.In Figure 4(c), even though both the subject and objecthave multiple instances of person and cup , we can dis-ambiguate which person is actually holding the cup .For the same image in Figure 4(d), our model is able todistinguish the cup being held in the previous referringrelationship from the one that is on top of the table .In cases where a referring relationship is not unique, likethe example in Figure 4(e), we manage to find all instancesthat satisfy the relationship we care about. Here, we re-turn both person s riding the skateboard s. Hav-ing learnt from the dataset, that most relationships with stand next to annotate the subject to the left of the object , our model emulates this behaviour in Figure 4(f). However, our model does make a fair share of mistakes -for example, in Figure 4(g), it finds both the person s andisn’t able to distinguish which one is wearing the skis . Now that we have evaluated our model, one natural ques-tion to ask is how important is it for the model to receiveboth the entities of the relationship as input? Can it localizethe person from Figure 1 if we only use < - kicking - ball > as input? Or can we localize both the subjectand the object with only < - kicking - > ? We arealso interested in taking this task a step further and studyingwhether we can localize categories that we have never seenbefore. Previous work has shown that we can localize seencategories in novel relationship combinations [23] but wewant to know if it is possible to localize unseen categories .We remove all instances of categories like pants ,7igure 5: We can decompose our model into its attention and shift modules and stack them to attend over the nodes of ascene graph. Here we demonstrate how our model can be used to start at one node ( phone ) and traverse a scene graph us-ing the relationships to connect the nodes and localize all the entities in the phrase < phone on the person nextto another person wearing a jacket > . A second examples attends over the entities in < hat worn byperson to the right of another person above the table > .No subject No object Only predicateS-IoU O-IoU S-IoU O–IoUVRD [23] 0.208 0.008 0.024 0.026SSAS (iter 1) 0.331 0.359 0.332 0.361SSAS (iter 2) 0.333 0.360 Table 2: Referring relationships results in the absence of theentities under three test conditions: no subject where theinput is < - predicate - object > , no object wherethe input is < subject - predicate - > and onlypredicate where the input is < - predicate - > hydrant , etc. that are not in ImageNet ( CNN( · ) was pre-trained on ImageNet) from our training set and attempt tolocalize these novel categories using their relationships. Wedo not make any changes to our model but alter the trainingscript to randomly (we use a drop rate of . ) mask out the subject or object or both in the referring relationshipsduring each iteration. The model learns to attend over gen-eral object categories when the entities are masked out. Wefind that we can in fact localize these missing entities, evenif they are from unseen categories. We report results for thisexperiment on the VRD dataset in Table 2. A ramification of our model design results in its modu-larity — the attention and shift modules expect inputs andproduce outputs that are image features of shape L × L × C .We can decompose these modules and stack them like Legoblocks, allowing us to perform more complicated tasks.One particularly interesting extension to referring relation-ships is attention saccades [36]. Instead of using a singlerelationship as input, we can extend our model to take anentire scene graph as input. Figure 5 demonstrates how we can iterate between the attention and shift modules to tra-verse a scene graph. We can start from the phone and canlocalize the jacket worn by the “woman on the right ofthe man using the phone”. A scene graph traversal can beevaluated by decomposing the graph into a series of rela-tionships. We do not quantitatively evaluate these saccadeshere, as its evaluations are already captured by the referringrelationships in the graph.
5. Conclusion
We introduced the task of referring relationships, whereour model utilizes visual relationships to disambiguate be-tween instances of the same category. Our model learns toiteratively use predicates as an attention shift between thetwo entities in a relationship. It updates its belief of wherethe subject and object are by conditioning its pre-dictions on the previous location estimate of the object and subject , respectively. We show improvements onCLEVR, VRD and Visual Genome datasets. We alsodemonstrate that our model produces interpretable predicateshifts, allowing us to verify that the model is in fact learningto shift attention. We even showcase how our model can beused to localize completely unseen categories by relying onpartial referring relationships and how it can be extended toperform attention saccades on scene graphs. Improvementsin referring relationships could pave the way for vision al-gorithms to detect unseen entities and learn to grow its un-derstanding of the visual world.
Acknowledgements.
Toyota Research Center (TRI) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of its au-thors and not TRI or any other Toyota entity. We thankJohn Emmons, Justin Johnson and Yuke Zhu for their help-ful comments.8 . Supplementary material
In the supplementary material, we include more detailedresults of our task for every entity and predicate category,allowing us to diagnose which entities or predicates aredifficult to model. We also include the learnt predicateand the inverse predicate shifts for all , and pred-icates we modeled in VRD [23], CLEVR [12] and VisualGenome [17]. Furthermore, we explain our baseline mod-els in more detail here. Co-occurrence and VRD baseline models
Given that the closest task to referring relationships isreferring expression comprehension [24], we draw inspira-tion from this literature when designing our baselines. Afrequent approach used by most models for this task in-volve semantically mapping language expressions to theircorresponding image regions [28, 24, 41]. In other words,they map the image features extracted from a CNN closeto the language expression features extracted from a LongShort Term Memory (LSTM). Our baseline models ( co-occurrence and
VRD ) draws inspiration from this line ofwork and maps relationships to a semantic feature space andmaps them close to the image regions to which they refer tousing our attention module.The difference from the two baseline models is deter-mined by how we embed the relationships to that semanticspace. In the case of co-occurrence , we are only interestedin studying how well we can model relationship without thepredicate and rely simply on co-occurrence statistics. So,we first embed the subject and the object , concate-nate their representations and pass them through a denselayer followed by a RELU non-linearity to allow the twoembeddings to interact. For the
VRD baseline, we embedthe entire relationship similar to prior work [23] by embed-dings all three components of the relationship, concatenat-ing their representation and passing them through a denseand non-linear layer.Unlike our model, which attends over the subject andobject in succession, these models are jointly aware of theentire relationship or at least about the other entity when at-tending over the image features. Also embedding the predi-cate and attending over the image with this embedding asksthese baselines to model predicates as visual. But predi-cates such as above or below are not visually significantand can only be modelled as a relative shift from one en-tity to another. We show through our experiments that suchbaselines are not able to perform as well as our model norare interpretable. Spatial shift baseline model
Instead of learning the attention shifts for each predicate,we assume (incorrectly) that all predicates are simply spa-tial shifts and model each predicate as a shift function. We learn the shift statistically from the relative locations of thetwo entities of the relationship. We visualize these statisti-cally calculated shifts in Figures 8, 10 and 12. We normalizethe shifts so visualize the heatmaps. They don’t show theactual values of how much each predicate shifts attentionbut only shows the direction of the shift. We see the as ex-pected left push attention to the right, etc. This baselineuses our attention modules to find the subject and object anduses these precalculated shifts to move attention around. Weonly need to train the attention module, which is equivalentto training our SSAS model with zero iterations. Duringevaluation, we use these statistical spatial shifts to move at-tention.This baseline is useful in two ways. First, it demonstratesthat it is important to model predicates as both spatial aswell as semantic. Second, it allows us to compare the learntpredicate shifts with these calculated ones to verify that ourSSAS models are in fact learning spatial shifts as well.
While above and below are spatial predicates, otherslike hit or sleep on are both spatial as well as seman-tic. hit usually refers to entities around the subject andare usually ball s. Similarly, sleep on usually refersto something below the subject and typically a bed or couch . We show the learnt predicate shifts of all the pred-icates in the three datasets in Figures 7, 9 and 11.As expected most relationships that are spatial are inter-pretable. In Figure 7, above moves attention below whileits inverse moves it up. hit focuses on the right bottom,emulating the dataset bias of right handed people hittingtennis or baseball. In Figure 11, wearing shifts atten-tion all over the body of the subject focusing mainly on shirt s, pant s and glass es. By splits the attention bothto the left and to the right to find what the subject isnext to. Some predicates, like attached to are harderto interpret as they depend on both the semantic as well asspatial shifts. While our model uses the image features tolearn these shifts, our current spatial shift visualization doesnot create an interpretable predicate shift. One of the benefits of referring relationships is its struc-tured representation of the visual world, allowing us tostudy which entities and predicates are hard to model. Inthis section we report the Mean IoU of our model on all thepredicate categories for the three datasets in Tables 3 and 5.Note that we don’t report the results for CLEVR here sinceall the spatial predicates are equally represented in thedataset and perform equally across all categories.Across most predicates we find that the object local-ization is much harder than the subject ’s. This occursbecause most object s tend to be smaller objects which9igure 6: Example bounding box annotations we added tothe Clevr dataset.are better localized by first attending over the subject first. We also see that size is an important factor in detec-tion as predicates like carry and use usually have a larger subject and a smaller object and we find that the IoUfor the subject is much higher than that of the object .We also see that when entities are partially occluded, forexample < subject - drive - object > , the object IoU is much higher than the occluded subject . We run a similar analyze of the performance of ourmodel across all the entity categories and report Mean IoUresults in Tables 4 and 6. Note that we don’t report the re-sults for CLEVR here since all the entities perform equallyacross all categories.We find that the Mean IoU for all entities in VisualGenome are higher than the ones in VRD, implying thatmore data for each of these categories helps the model learnto attend over the right image regions. In Figure 6, we findthat with the predicate shifts, we can detect smaller objects,like face , ear , bowl , eye , a lot better. Some entities like shelves and light don’t perform well on the dataset be-cause not all the shelves or light sources are annotated in thedataset, causing the model’s correct predictions to be penal-ized. Surprisingly, the model has a hard time finding bag s,perhaps because it learns that bags are often found beingworn or carried by people in the training set but the test setcontains bags that are on the ground or resting against otherentities. The CLEVR dataset is annotated with objects in 3Dspace [12]. To use the dataset in the same manner asVRD [23] and VisualGenome [17], we converted all the 3Dentity locations into 2D bounding boxes, with respect to theviewing perspective of every image. We will release theconversion code as well as the bounding box annotationsthat we added to CLEVR. Figure 6 showcases an exampleimage annotated with our bounding boxes.
References [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice:Semantic propositional image caption evaluation. In
Eu-ropean Conference on Computer Vision , pages 382–398.Springer, 2016. 2[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machinetranslation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473 , 2014. 3[3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, andO. Yakhnenko. Translating embeddings for modeling multi-relational data. In
Advances in neural information processingsystems , pages 2787–2795, 2013. 3[4] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva,and A. Torralba. Mit saliency benchmark, 2015. 5[5] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba,and F. Durand. Where should saliency models look next? In
European Conference on Computer Vision , pages 809–824.Springer, 2016. 5[6] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Con-volutional 2d knowledge graph embeddings. arXiv preprintarXiv:1707.01476 , 2017. 3[7] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model.In
Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on , pages 1–8. IEEE, 2008. 3[8] C. Galleguillos, A. Rabinovich, and S. Belongie. Object cat-egorization using co-occurrence, location and appearance.In
Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on , pages 1–8. IEEE, 2008. 2, 5, 6[9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.Dahl. Neural message passing for quantum chemistry. arXivpreprint arXiv:1704.01212 , 2017. 2, 3[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016. 5[11] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko.Modeling relationships in referential expressions with com-positional modular networks. In , pages4418–4427. IEEE, 2017. 1, 2, 5, 6[12] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei,C. L. Zitnick, and R. Girshick. Clevr: A diagnostic datasetfor compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890 , 2016. 2, 5, 6, 9, 10[13] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman,L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring andexecuting programs for visual reasoning. arXiv preprintarXiv:1705.03633 , 2017. 3[14] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma,M. Bernstein, and L. Fei-Fei. Image retrieval using scenegraphs. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 3668–3678, 2015. 2[15] T. N. Kipf and M. Welling. Semi-supervised classifica-tion with graph convolutional networks. arXiv preprintarXiv:1609.02907 , 2016. 3 [16] P. Kr¨ahenb¨uhl and V. Koltun. Efficient inference in fullyconnected crfs with gaussian edge potentials. In
Advancesin neural information processing systems , pages 109–117,2011. 4[17] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations.
International Journal ofComputer Vision , 123(1):32–73, 2017. 2, 5, 6, 9, 10, 13, 14[18] D. LaBerge, R. L. Carlson, J. K. Williams, and B. G. Bunney.Shifting attention in visual space: tests of moving-spotlightmodels versus an activity-distribution model.
Journal ofExperimental Psychology: Human Perception and Perfor-mance , 23(5):1380, 1997. 2, 3, 5, 6[19] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Graph cutbased inference with co-occurrence statistics. In
EuropeanConference on Computer Vision , pages 239–253. Springer,2010. 2[20] Y. Li, W. Ouyang, and X. Wang. Vip-cnn: A visual phrasereasoning convolutional neural network for visual relation-ship detection. arXiv preprint arXiv:1702.07191 , 2017. 2[21] X. Liang, L. Lee, and E. P. Xing. Deep variation-structuredreinforcement learning for visual relationship and attributedetection. arXiv preprint arXiv:1703.03054 , 2017. 2[22] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entityand relation embeddings for knowledge graph completion.In
AAAI , pages 2181–2187, 2015. 3[23] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual re-lationship detection with language priors. In
European Con- ference on Computer Vision , pages 852–869. Springer, 2016.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12[24] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, andK. Murphy. Generation and comprehension of unambiguousobject descriptions. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 11–20,2016. 1, 2, 9[25] T. Mensink, E. Gavves, and C. G. Snoek. Costa: Co-occurrence statistics for zero-shot classification. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2441–2448, 2014. 2[26] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier,and S. Lazebnik. Phrase localization and visual relationshipdetection with comprehensive linguistic cues. arXiv preprintarXiv:1611.06641 , 2016. 2[27] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora,and S. Belongie. Objects in context. In
Computer vision,2007. ICCV 2007. IEEE 11th international conference on ,pages 1–8. IEEE, 2007. 2[28] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, andB. Schiele. Grounding of textual phrases in images by re-construction. In
European Conference on Computer Vision ,pages 817–834. Springer, 2016. 1, 9[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.
International Journal of Computer Vision , 115(3):211–252,2015. 2, 5[30] M. A. Sadeghi and A. Farhadi. Recognition using vi-sual phrases. In
Computer Vision and Pattern Recogni- tion (CVPR), 2011 IEEE Conference on , pages 1745–1752.IEEE, 2011. 2[31] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learningto share visual appearance for multiclass object detection.In
Computer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on , pages 1481–1488. IEEE, 2011. 2[32] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski,R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neu-ral network module for relational reasoning. arXiv preprintarXiv:1706.01427 , 2017. 2[33] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D.Manning. Generating semantically precise scene graphsfrom textual descriptions for improved image retrieval. In
Proceedings of the fourth workshop on vision and language ,volume 2, 2015. 2[34] M. Shridhar and D. Hsu. Grounding spatio-semantic refer-ring expressions for human-robot interaction. arXiv preprint arXiv:1707.05720 , 2017. 1, 2[35] G. Sperling and E. Weichselgartner. Episodic theory ofthe dynamics of spatial attention.
Psychological review ,102(3):503, 1995. 2, 3[36] A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Hender-son. Contextual guidance of eye movements and attentionin real-world scenes: the role of global features in objectsearch.
Psychological review , 113(4):766, 2006. 2, 8[37] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graphgeneration by iterative message passing. arXiv preprintarXiv:1701.02426 , 2017. 2[38] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neuralimage caption generation with visual attention. In
Interna-tional Conference on Machine Learning , pages 2048–2057,2015. 3 [39] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stackedattention networks for image question answering. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 21–29, 2016. 3[40] B. Yao and L. Fei-Fei. Modeling mutual context of ob-ject and human pose in human-object interaction activities.In
Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on , pages 17–24. IEEE, 2010. 2[41] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod-eling context in referring expressions. In
European Confer-ence on Computer Vision , pages 69–85. Springer, 2016. 1, 2,3, 9[42] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. arXivpreprint arXiv:1612.09542 , 2016. 1[43] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visualtranslation embedding network for visual relation detection. arXiv preprint arXiv:1702.08319 , 2017. 3[44] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional randomfields as recurrent neural networks. In
Proceedings of theIEEE International Conference on Computer Vision , pages1529–1537, 2015. 4 spatial shift baseline model.16igure 9: Learnt predicate shifts from the CLEVR dataset.Figure 10: Spatial shifts calculated from the CLEVR dataset. These shifts were used for the spatial shift baseline model.17igure 11: Learnt predicate shifts from the Visual Genome dataset.18igure 12: Spatial shifts calculated from the Visual Genome dataset. These shifts were used for the spatial shiftspatial shift