Inferring spatial relations from textual descriptions of images
Aitzol Elu, Gorka Azkune, Oier Lopez de Lacalle, Ignacio Arganda-Carreras, Aitor Soroa, Eneko Agirre
IInferring spatial relations from textual descriptionsof images
Aitzol Elu , Gorka Azkune , Oier Lopez de Lacalle , IgnacioArganda-Carreras
2, 3, 4 , Aitor Soroa , and Eneko Agirre HiTZ Basque Center for Language Technologies - Ixa NLP Group,University of the Basque Country UPV/EHU, M. Lardizabal 1, Donostia20018, Basque Country, Spain Dept. of Computer Science and Artificial Intelligence, University of theBasque Country (UPV/EHU), Paseo Manuel Lardizabal 1, 20018Donostia-San Sebastian, Spain Ikerbasque, Basque Foundation for Science, Maria Diaz de Haro 3,48013 Bilbao, Spain Donostia International Physics Center (DIPC), Paseo Manuel Lardizabal4, 20018 Donostia-San Sebastian, Spain [email protected], { gorka.azcune,oier.lopezdelacalle,ignacio.arganda,a.soroa,e.agirre } @ehu.eus Abstract
Generating an image from its textual description requires both a certainlevel of language understanding and common sense knowledge about thespatial relations of the physical entities being described. In this work, wefocus on inferring the spatial relation between entities, a key step in the pro-cess of composing scenes based on text. More specifically, given a captioncontaining a mention to a subject and the location and size of the boundingbox of that subject, our goal is to predict the location and size of an objectmentioned in the caption. Previous work did not use the caption text infor-mation, but a manually provided relation holding between the subject andthe object. In fact, the used evaluation datasets contain manually annotatedontological triplets but no captions, making the exercise unrealistic: a man-ual step was required; and systems did not leverage the richer information incaptions. Here we present a system that uses the full caption, and
Relationsin Captions (REC-COCO), a dataset derived from MS-COCO which allowsto evaluate spatial relation inference from captions directly. Our experiments a r X i v : . [ c s . A I] F e b how that: (1) it is possible to infer the size and location of an object withrespect to a given subject directly from the caption; (2) the use of full text al-lows to place the object better than using a manually annotated relation. Ourwork paves the way for systems that, given a caption, decide which entitiesneed to be depicted and their respective location and sizes, in order to thengenerate the final image. The ability of automatically generating images from textual descriptions is a fun-damental skill which can boost many relevant applications, such as art generationand computer-aided design. From a scientific point of view, it also drives researchprogress in multimodal learning and inference across vision and language, whichis currently a very active research area [1]. In the case of scenes comprising sev-eral entities, it is necessary to infer which is an adequate scene layout, i.e., whichentities to show, their location and size.From the language understanding perspective, in order to generate realistic im-ages from textual descriptions, it is necessary to infer visual features and relationsbetween the entities mentioned in the text. For example, given the text ”a blackcat on a table” , an automatic system has to understand that the cat has a certaincolor ( black ) and is situated on top of the table, among other details. In this pa-per, we focus on the spatial relations between the entities, since they are the key tosuitably compose scenes described in texts. The spatial information is sometimesgiven explicitly, in form of prepositions ( ”cat on a table” ), but more often implic-itly, since the verb used to relate two entities contains information about the spatialarrangement of both. For example, from the text ( ”a woman riding a horse” ) itis obvious for humans that the woman is on top of the horse. However, acquiringsuch spatial relations from text is far from trivial, as this kind of common sensespatial knowledge is rarely stated explicitly in natural language text [2]. That isprecisely what text-to-image systems learn, relating both explicit and implicit spa-tial relations expressed in text with actual visual arrangements showed in images.A large strand of research in text-to-image generation are evaluated accord-ing to the pixel-based quality of the generated images and the global fidelity to thetextual descriptions, but do not evaluate whether the entities have been arranged ac-cording to the spatial relations mentioned in the text [3]. Closer to our goal, someresearchers do focus on learning spatial relations between entities [4, 5, 6, 7, 8, 9].For instance, in [6, 8] the authors proposed to associate actions along with their se-mantic arguments (subject and object) with pixels in images (i.e., bounding boxesof entities) as a way towards understanding the images. V-COCO is a dataset whichcomprises images and manually created Subject, Relation, Object (
S, R, O ) onto-2 igure 1: An example to illustrate the relevance of full captions in spatial relation inference.Given a caption, a subject token in the caption, the bounding box for the subject (both inred), and a target object in the caption (in green), the systems need to return the boundingbox for the object (see Figure 2 for the actual images). The relationship between subjectand object is highlighted in purple. In each row we can see two different layouts for thesame subject, object and relation, motivating the need to model the full caption. Bestviewed in color. See Figure 2 for the actual images. logical triplets, henceforth called concept triplets , where each S and O is associ-ated with a bounding box in the image [6]. Note that the terms used to describethe triplet concepts are selected manually from among a small vocabulary of anontology, e.g. PERSON or BOOK , and are not linked to the words in the caption.Visual Genome is constructed similarly [8]. Typically, those datasets are createdby showing images to human annotators, and asking them to locate the boundingboxes of the entities participating on predefined relations, and to select the termsfor the relation and entities from a reduced vocabulary in a small ontology. Usingsuch a dataset, [5] presents a system that uses concept triplets to infer the spatialrelation between the subject S and the object O . Given the bounding box of thesubject, the system outputs the location and size of the bounding box of the ob-ject. Evaluation is done checking whether the predicted bounding box matches theactual bounding box in the image. The datasets and systems in the previous work In this paper we will use uppercase words for ontology concepts, as opposed to lowercase forcaption words. igure 2: The images that underlie the bounding boxes in Figure 1. Best viewed in color. require the use of manually extracted ontological triplets, and systems did not usethe actual captions, posing two issues: a manual pre-processing step was required;and systems did not use the richer information in captions.In this paper we propose to study the use of full captions instead of manuallyselected relations when inferring the spatial relations between two entities ,where one of them is considered the subject and the other is the object of the actionbeing described by the relation . The problem we address is depicted in Figure 1.Given a textual description of an image and the location (bounding box) of thesubject of the action in the description, we want the system to predict the boundingbox of the target object. Note that we do not use the actual pixels for this task, butwe include Figure 2 for illustrative purposes. To the best of our knowledge, there isno previous work addressing the same problem, i.e. nobody studied before whetherusing full captions instead of concept triplets benefits spatial relation inference.Our hypothesis is that the textual description accompanying the image con-tains information that helps inferring the spatial relations of two entities. We arguethat the information presented in manually created triplets alone is often insuffi-cient to properly infer spatial relations. As a motivation, Figure 2 shows pairs ofexamples (left and right) where the relation between the subject and the object(given by a verb) is not enough to correctly predict the spatial relation between In this paper we use caption and textual description interchangeably. person , reading , book ), but the spatial relation between subject and object isdifferent, and depends on the interpretation of the rest of the caption. For instance,in the top-left caption the person is sitting while it is reading a book, so that thebook is around the middle of the bounding box for the person, while in the top-right caption the person is laying in bed, and therefore the book is slightly abovethe person.To validate the main hypothesis of our work, we created a new dataset called Relations in Captions (REC-COCO) that contains associations between caption to-kens and bounding boxes in images. REC-COCO is based on the MS-COCO [10]and V-COCO datasets [6]. For each image in V-COCO, we collect their corre-sponding captions from MS-COCO and automatically align the concept triplet inV-COCO to the tokens in the caption. This requires finding the token for conceptssuch as
PERSON . As a result, REC-COCO contains the captions and the tokenswhich correspond to each subject and object, as well as the bounding boxes for thesubject and object (cf. Figure 3).In addition, we have adapted a well-known state-of-the-art architecture thatworked on concept triplets [5] to work also with full captions, and performed ex-periments which show that: (1) It is possible to infer the size and location of anobject with respect to a given subject directly from the caption; (2) The use of thefull text of the caption allows to place the object better than using the manuallyextracted relation.The main contributions of the work are the following:• We show for the first time that the textual description includes informationthat is complementary to the relation between a subject and an object. Fromanother perspective, our work shows that, given a caption, a reference subjectand an object in the caption, our system can assign a location and a size tothe object using the information in the caption, without any manually addedrelation.• We introduce a new dataset created for this task. The dataset comprises pairsof images and captions, including, for each pair, the tokens in the captionthat describe the subject and object, and the bounding boxes of subject andobject. The dataset is publicly available under a free license . https://github.com/ixa-ehu/rec-coco Related work
Understanding the spatial relations between entities and their distribution in spaceis essential to solve several tasks such as human-machine collaboration [11] ortext-to-scene synthesis [12, 7, 13], and has attracted the attention of different re-search communities. In this section, we will provide the different approaches toinfer spatial relations among entities, evaluation methodologies arisen from thosecommunities and available resources such as datasets.
Visual scene understanding.
There has been a great interest in tasks relatedto visual scene understanding in recent years, such as human-object interaction,semantic segmentation or object detection. As a consequence, there are large-scale image-based datasets like MS-COCO [10], V-COCO [6] or Visual Genome[8].Those datasets contain very rich and diverse scenes combining humans andtheir daily environments, accompanied by textual descriptions and/or structuredtext, among others. Thus, in principle, they should be appropriate to test whethertextual descriptions are useful to infer spatial relations between entities.However, none of those datasets combine concept triplets, image descriptions,textual triplets as mentioned in the textual description, and the bounding boxesof the subject and object for each instance. V-COCO is the most similar, but thecaptions and the mentions of the concepts as expressed in the caption are not in-cluded. We thus had to build a new dataset, REC-COCO, which contains all thatinformation.
Spatial common sense knowledge.
Initial proposals created rule-based sys-tems to generate spatial representations [14]. With the arrival of deep learningsystems this task began to gain more interest among researchers. Malowinski etal. [9] demonstrated that it was possible to create a system to estimate spatial tem-plates from structured input such as (Object , spatial preposition, Object ) [15].Collell et al. [5] proposed the task of predicting the 2D relative spatial arrange-ment of two entities under a relationship given a concept triplet (Subject, Relation,Object). The template is determined by the interaction/composition of the Subject,Relation and Object, so changing one of the concepts that make up the structuredinput may change the spatial template. Contrary to those previous works, we arguethat the information presented in the concept triplets alone is often insufficient toproperly infer spatial relations. Therefore, we propose to check whether textualdescriptions in the form of captions encode contextual information which is usefulto infer spatial relations and thus place entities better in an image. Text-to-image synthesis.
Recent studies have proposed a variety of modelsto generate an image given a sentence. Reed et al. [3] used a GAN [16] that isconditioned on a text encoding for generating images of flowers and birds. Zhanget al. [17] proposed a GAN based image generation framework where the image6s progressively generated in two stages at increasing resolutions. Reed et al. [18]performed image generation with sentence input along with additional informationin the form of keypoints or bounding boxes. Some works [19, 20] break down theprocess of generating an image from a sentence into multiple stages. The inputsentence is first used to predict the entities that are presenting the scene, followedby the prediction of bounding boxes, then semantic segmentation masks, and fi-nally the image. These works are aligned with ours, since they also assume thatthe spatial relations can be obtained from paired textual descriptions and images,as we do. However, their focus is on image generation and they do not prove thatusing raw textual information is actually helpful for spatial relation inference. Inthat sense, our work provides a solid foundation for their design choices and thus,complements their work.
Quantitative information about entities.
There is a line of work to determinethe quantitative relation between two nouns on a specific scale [21, 22]. Thesetypes of relations are key for image understanding tasks such as image captioning[23, 24] and visual question answering citeaditya2019spatial,bai2020decomvqanet.The common theme in the recent work [27, 28, 29] is to use search query templateswith other textual cues (e.g., more than, at least, as many as, and so on), collect nu-merical values, and model sizes as a normal distribution. However, the quality andscale of such extraction is somewhat limited. Bagherinezhad et al. [4] showed thattextual observations about the relative sizes of entities are very limited, and rela-tive size comparisons are better collected through visual data. In this sense, ourwork shows that it is possible to extract information about the relative sizes of en-tities, learning the implicit relations that appear in the raw text. In [30] the authorsautomatically collected large amounts of web data and created a resource with dis-tributions over physical quantities that can be used to acquire common knowledgesuch as relative sizes of entities, but they did not use images for that goal. Theirwork is complementary to ours, as we use multimodal data instead of textual alone.
Other multimodal tasks.
Following the success of transformers in naturallanguage processing [31], multimodal transformers have been proposed to tackleseveral multimodal tasks with similar architecture designs. Good examples areViLBERT [32], VisualBERT [33] and InterBERT [34]. Those multimodal trans-formers have shown strong performance in multimodal tasks such as visual ques-tion answering, visual commonsense reasoning, natural language for visual rea-soning and region-to-phrase grounding. However, multimodal transformers havebeen investigated for discriminative tasks, rather than generative tasks such as im-age generation. Only very recently a solution has been proposed for text-to-imagegeneration: X-LXMERT [35], which shows that multimodal transformers can alsogenerate state-of-the-art images from textual input. For that purpose, authors pro-posed to sample visual features for masked inputs and to add an image generator to7 igure 3: Example of REC-COCO. Given the caption, the subject token ( man ), the bound-ing box for the subject (in red), and the target object ( book ), the systems need to return thebounding box for the object (in green). The dataset is automatically created from V-COCOand MS-COCO, matching the ontological triplet in V-COCO (PERSON,READ,BOOK)with the tokens in the MS-COCO captions. The actual image is included for illustrationpurposes, it is not used by the systems. Best viewed in color. transform those sampled visual features into images. Although the proposal is veryrelevant for the field, the suggested solution does not explicitly model the spatiallayout of entities and thus, it cannot be used for the purposes of this work.
The main goal of this paper is to extract spatial relations among the entities men-tioned in image captions. To the best of our knowledge, there exists no dataset thatcontains explicit correspondences between image pixels (bounding boxes of enti-ties) and their respective mentions in the image descriptions. We thus developed anew dataset, called
Relations in Captions (REC-COCO), that contains such corre-spondences. REC-COCO is derived from MS-COCO [10] and V-COCO [6]. Theformer is a collection of images, each image described by 5 different captions. Thelatter comprises a subset of the MS-COCO images, where each image has a manu-ally created Subject, Relation, Object (
S, R, O ) concept triplet, a bounding box forthe subject and a bounding box for the object. V-COCO uses ontology concepts todescribe the elements of (
S, R, O ) triplets, e.g. (PERSON, READ, BOOK). Thetriplets correspond to actions performed by the subject on the object. Given thebounding box of the subject and the triplet, the dataset has been used to evaluatewhether the system has been able to infer the spatial relation between the subjectand object and thus produce the correct bounding box for the object. Note that theconcept triplets are not linked to the actual words used in the image captions.8n order to be able to access the information in the captions, we devised anautomatic method to map MS-COCO and V-COCO, so that each term in the V-COCO (
S, R, O ) triplet is linked to the most similar token on one of the MS-COCO captions for the corresponding image. If the similarities between termsand tokens is below a threshold the example is discarded. To create the links, themethod considers all (
S, R, O ) triplets of V-COCO images in turn. For each tripletand image, it first gathers all five captions from MS-COCO, and represents eachcaption by concatenating the normalized vector embeddings of each word. Let C i = [ c i ; . . . ; c iN i ] be the matrix representing caption i ( i ∈ [1 , ), where column c ij is the unit normalized embedding of the j -th word. Let also ( s , r , o ) be the unitnormalized embeddings of the terms used to describe the elements in the ( S, R, O )triplet from V-COCO. For each caption i , the algorithm first obtains the word inthe caption ( j i ) that is closest to each of the embeddings of the concept triplet, aswell as the similarity score between them ( sc i ). For example, the method wouldcompute j is and sc iS for the S element in the triplet, as follows: sc iS = max s T · C i j iS = argmax j ∈| C i | s T · C i j iR , sc iR , j iO , sc iO are calculated likewise for R and O . Afterwards, it selects thecaption i whose sum of scores is maximum: i = argmax sc iS + sc iR + sc iO If the similarity score is below certain threshold , the triplet is discarded. Ifnot, the caption word corresponding to index j iS (respectively j iR , j iO ) is selected torepresent the subject of the triplet (same for relation and object).By applying the method described above, each element of the ( S, R, O ) tripletsin V-COCO is anchored to actual words occurring in captions accompanying theimage. Figure 3 shows a sample output, where the subject and object in the V-COCO triplet (PERSON, READ, BOOK) are linked to the corresponding wordsin the MS-COCO caption, man and book , respectively. In addition, the actionconcept READ is matched to the token reading . We discarded V-COCO tripletscorresponding to actions that do not explicitly require a subject and an object, andthat have a single argument instead ( smile , look , stand , and so on). All in all, REC-COCO comprises , instances from , different images. Each instanceconsists of an image, a caption, the subject and object words and the bounding The threshold was empirically set to . based on manual inspection of the resulting dataset. , Number of Captions , Number of Instances , Captions per Image . Subject-Object Pairs per Caption 1.31
Table 1: Statistics of the REC-COCO dataset. Note that an image can be described bymore than one caption, and that a caption can contain more that one subject-object pair ofinterest. boxes of subject and object. In addition, in order to enable contrastive experiments,the V-COCO concept triplet and the word corresponding to the relation are alsoprovided in the dataset. Table 1 shows further statistics of the dataset.As the method to generate the dataset is automatic, we checked the qualityof the produced alignments by manually annotating 100 random samples of thedataset. For each subject and object pair extracted by the automatic alignmentalgorithm, we checked whether the tokens matched the action described by theconcept triplet. The results can be seen in Table 2. More than 85 examples goteither the subject or the object correctly aligned with caption tokens, and in 71 ex-amples both of them were correctly aligned. In addition, we also checked whetherthe verb describing the action could be correctly identified. Identifying the tokenthat describes the relation is more difficult, and the algorithm is only able to do itcorrectly for 56 examples.Term AccuracySubject 86%Object 85%Subject & Object 71%
Table 2: Quality of token identification in REC-COCO for 100 random samples.
The problem addressed in this work (cf. Figure 3) is the following: given a cap-tion, a subject token in the caption ( S ), the location and size of the bounding boxfor the subject, and a target object ( O ), the system needs to predict sensible lo-cation and size of the bounding box for the object. More formally, we denote as O c = [ O cx , O cy ] ∈ R the ( x, y ) coordinates of the center of the bounding box cov-10ring the object O , and O b = [ O bx , O by ] ∈ R half of its width and height. Thus,we use O = [ O c , O b ] ∈ R as the ground truth location and size of the object.Model predictions are denoted with a hat (cid:99) O c , (cid:99) O b . The task is then to produce thelocation and size (cid:98) O = [ (cid:99) O c , (cid:99) O b ] ∈ R of the token filling the Object role in the cap-tion describing the scene given the bounding box of the
Subject , which is definedanalogously S = [ S c , S b ] ∈ R .The proposed model is a neural network inspired in [5]. We chose this modelbecause of the excellent results in spatial relation inference, and adapted it to in-clude the caption text in the input. Our model takes as input the embeddings ofthe caption words, additional embeddings for subject and object tokens ( S and O ),denoted respectively as v S and v O , and the bounding box of the subject [ S c , S b ] .Figure 4 shows the diagram of the model, with input in the lower part and outputin the top. The system first uses a caption encoder and a dense layer to producethe fixed-length representation of the caption. We tried different alternative captionencoders (see below). The output of the dense layer is concatenated to the embed-dings of subject and object, and fed into a dense layer which encodes the caption,the subject and object tokens. This representation is concatenated to the subjectbounding box representation and fed into the final dense layer, which is used topredict the object bounding box.We experimented with three different caption encoders in our experiments: Average embedding (AVG)
This encoder just averages the embeddings of eachtoken in the caption: c cap = 1 N N (cid:88) i =1 v i where v i is the embedding of the i th word in the caption of length N . BiLSTM encoder
The caption words are fed into a bidirectional LSTM [36] andthe final hidden states of the left and right LSTMs are concatenated: c cap = [ h LN ; h RN ] The embedding layer of the LSTM modules are initialized with external wordembeddings, and the rest of weights are learned during training.11 igure 4: Architecture of the spatial relation inference model. The system receives as inputa caption, the subject and object tokens, the location and size of the subject bounding box,and outputs the location and size of the subject. See text for further details.
BERT encoder
In this setting we use a pre-trained BERT model [31]. Morespecifically, the caption of length N is represented by the embedding correspond-ing to the special [CLS] token (position 0). BERT weights are fine-tuned duringtraining. c cap = BERT [0]( v , . . . , v N ) Given the output of any of the above caption encoders c cap , we stack a denselayer to obtain the final caption representation v cap : v cap = ReLU( W cap c cap + b cap ) The caption representation v cap is then concatenated to the object and subjectembeddings and fed into a dense layer to obtain the final textual embedding:12aption Encoder minutes per epochAVG 0.016BiLSTM 1.783BERT 9.783 Table 3: Average training time per epoch for our model when training on the REC-COCOdataset, depending on the caption encoder used. z c = ReLU ( W c [ v cap ; v S ; v O ] + b c ) This representation is concatenated to the subject bounding box and a finalregression dense layer produces the object bounding box: z h = ReLU ( W h [ z c ; S c ; S b ] + b h ) (cid:98) O = W out z h + b out where W cap , b cap , W c , b c , W h , b h , W out and b out are the parameters of the model(along with the parameters of the caption encoders). We used the ReLU activationfunction because it is widely used in similar neural network architectures. The lossfunction is the mean squared error loss between the predicted and the actual values: L ( O, (cid:98) O ) = (cid:107) (cid:98) O − O (cid:107) In this section we report the results of the performed experiments. We conductseveral sets of experiments, depending on the research question addressed. In thefirst set we assess the validity and quality of the REC-COCO dataset, complement-ing the analysis presented in Section 3. In the second set, we study which encoderis the most effective for solving this task. In a third set, we check whether it ispossible to infer the size and location of an object with respect to a given subjectdirectly from the caption without the need of manually extracted concept triplets.In addition, we present a fourth set to study how complementary is the informationin the captions with respect to the triplets.The evaluation metrics used within the paper are the ones proposed by Collel etal. [5], and include the following: Above/below Classification Accuracy, a binary13etric that measures whether the model correctly predicts that the object centeris above/below the subject in the image, where we report both macro averagedaccuracy (acc y ) and macro-averaged F1 score ( F y ); Pearson Correlation (r) ofboth ( x, y ) axes between the predicted value and the ground truth; Coefficient ofDetermination ( R ) of the prediction and the ground truth [37]; Intersection overUnion (IoU), a bounding box overlap measure [38].The data is preprocessed using the same procedure presented in [5], namely, wenormalize the bounding box coordinates with the width and height of the imagesand apply a mirror transformation on the vertical axis to the image when the objectis at the left of the subject. Textual captions are lowercased and punctuation marksare removed.Regarding model hyperparameters, we used dimension GloVe embed-dings [39] that are publicly available to initialize all word embeddings used inthe model. Regarding training details we use -fold cross-validation to train allthe models using epochs, a batch size of and a learning rate of . withan RMSprop optimizer. The same parameters are used when training the BiLSTMsentence encoders. We use default parameters when fine-tuning the BERT encoderand we trained for epochs with a batch size of . All the experiments have beenperformed in a single Nvidia Titan XP GPU. The observed training time per epochon the REC-COCO dataset is depicted in Table 3. Most of the complexity lays inthe training of the BiLSTM and BERT caption encoders. In fact, the rest of themodel only takes 0.016 minutes per epoch, as shown by the training time neededby the average caption encoder. In this set of experiments we want to assess the quality and validity of the REC-COCO dataset. More concretely, we want to check two important features of REC-COCO:1. The effect of the token alignment algorithm in the task: we check whetherthe noise introduced by the token-concept alignment method used to createREC-COCO has any negative effect when comparing the results of a systemrunning on the aligned tokens with respect to the results of a system runningon the manual concept triplets.2. The difficulty of the task: the proposed task should be feasible to be re-solved by automatic methods, yielding results which should be comparableto related datasets. http://nlp.stanford.edu/projects/glove acc y F y r x r y R IoUREC-COCO 19,559 77.9 77.7 70.4 67.6 47.3 12.1V-COCO (subset) 19,559 75.6 75.4 78.3 63.4 51.7 14.9Visual Genome 20,000 71.7 71.2 87.2 76.5 46.9 6.8378k 74.5 74.5 89.2 83.2 64.8 11.1
Table 4: Assessing REC-COCO as a dataset. Comparison of results attained on compa-rable datasets. V-COCO contains manually created ontological triplets and REC-COCOuses their automatically linked mentions. The results for V-COCO refer to the subset whichwas linked in REC-COCO. Visual Genome also contains images and manually annotatedtriplets, where annotations do not overlap with those in V-COCO or REC-COCO. All re-sults use the same simplified model, where the relation was used instead of the full caption.
In these experiments we train and run a model which is a simplified version ofour full model, as it does not use the caption in the input, just the subject, relationand object. The architecture is the same as in Section 4 (cf. Figure 4), whereinstead of the output of the linear layer over the caption encoder v cap we use theembedding of the relation v R : z c = ReLU ( W c [ v R ; v S ; v O ] + b c ) We thus compare the results of this system under the same conditions acrossthree different datasets:
V-COCO
This dataset contains (
S, R, O ) concept triplets, e.g. (PERSON, RIDE,HORSE), with corresponding bounding boxes in the images [6]. We use the subsetof V-COCO obtained by discarding the actions that have no argument, as describedin Section 3. This allows for head-to-head comparison with REC-COCO.
REC-COCO
It contains the same examples above, but the subject, object andrelation tokens have been extracted from the captions after the automatic link tothe concept triplets, e.g. ( woman , riding , horse ). Visual Genome
This dataset [8] also contains manually annotated (
S, R, O ) con-cept triplets with corresponding bounding boxes in the images. The images andannotations are independent of those in V-COCO (and therefore REC-COCO). Weuse two variants of this dataset: the k version is the same used in [5], where allinstances containing explicit relations are discarded. We further reduce this dataset15nput acc y F y r x r y R IoUConceptual relation 75.6 75.4 78.3 63.4 51.7 14.9Relation token
Table 5: Evaluating the contribution of captions. Performance for different inputs: manualconceptual relations (V-COCO subset), the relation token in the caption (as linked whenderiving REC-COCO) and full captions (REC-COCO). Results are fully comparable, asthey only differ in the input used, and show the performance gains when using captions.See text for more details. to have the same size as REC-COCO. The subset is created by randomly selectingtriplets of the most used actions and the most used entities.Table 4 shows the results of the system in each dataset. Regarding the effectof the automatic mapping, the table shows that using the caption tokens (i.e. REC-COCO) instead of ontology concept triplets (i.e. V-COCO) yields better resultsin three of the evaluation metrics and worse in the other three metrics, so we canconclude that they are comparable. These results show that the possible errorsintroduced when aligning triplets to caption tokens is relatively low, and that REC-COCO is overall a valid dataset for inferring spatial relations from triplets.The results are better than those obtained using a subset of Visual Genomeof comparable size, although a larger training dataset ( k ) yields better resultsoverall. Although the results of different datasets can not be directly compared,they can give insights regarding the task difficulty. In this regard, the table showsthat the task proposed by REC-COCO is comparable in difficulty to the tripletspresented in Visual Genome. As Visual Genome is a well established dataset, wethink these results are relevant. Table 5 shows the results which confirm the hypothesis in this paper. The toprow shows the results when the system ignores the caption and uses instead themanually extracted conceptual relation in V-COCO. The second row shows theresults of the same system when using the automatically mapped relation tokenin the input . The bottom row shows the results for the system when using thefull caption encoded using BERT, ignoring which is the relevant relation . The These results are the same as the rows labeled with V-COCO and REC-COCO in Table 4. The results for alternative caption encoders are shown below. acc y F y r x r y R IoUAVG 77.8 77.7 80.0 60.4 53.8 13.8BiLSTM
Table 6: Evaluating caption encoders. Performance on REC-COCO for different captionencoders (cf. Section 4) best results in all metrics except acc y and r y are obtained when using the caption.Indeed, this model achieves the best R and IoU, the metrics which are best forevaluating spatial relations, since they are continuous and consider both x and y axis. These results confirm our hypothesis: (1) it is possible to infer the size andlocation of an object with respect to a given subject directly from the caption; (2)the use of full text allows to place the object better than using a manually extractedrelation. The improvement obtained by the use of full captions with respect tousing the relation token alone reflects that the motivation was correct (cf. Figure1). Table 6 shows the results of our model on REC-COCO for different caption en-coders (cf. Section 4). As expected, the simplest model (AVG) yields the worstresults across all metrics, and the better results of BiLSTM show that this sentenceencoder is able to profit from modelling word order in order to learn a more effec-tive caption representation. In the bottom row, the use of transformers pre-trainedin a masked language modelling task to encode the caption (BERT) yields the bestresults for all metrics except acc y and F y . The fact that these results agree withthose obtained by the community on sentence encoding problems across multi-ple tasks [31] serves as indirect confirmation that REC-COCO is a well-designeddataset, and that full captions contain information which is relevant for inducingspatial relations. In order to understand the contribution of each possible input, we tried severaladditional combinations and ablations, as shown in Table 7. In the first row weshow the results when using the BERT encoding over captions, already reportedin Table 5 and repeated here for easier comparison. In the second row we showthe results when extending the input to consider the embedding of the relation in17nput acc y F y r x r y R IoUCaption 77.6 77.7
Caption+relation
Table 7: Analysis of combined inputs. Performance on REC-COCO for different inputs ineach row: caption, caption plus relation, caption without subject and object information. addition to the caption, with no clear improvement, as the results only improveslightly in the binary above/below metric ( acc y and F y ), with lower performancein other metrics. This result shows that the caption encoder is able to represent therelevant information regarding the relation between subject and object, without theneed of an additional explicit signal.The third row shows the results when our model (cf. Figure 4 in Section 4)does not receive any information about which are the subject and object. The cleardecrease in performance shows that the model is not learning hidden biases in thedata, and that the results of our model are sensible. From another perspective,it also shows that the captions in our dataset are complex and describe multiplerelations between different subjects and objects.All in all, the results validate our hypothesis that the information conveyed incaptions is complementary to the structured information, and that the unstructuredinformation is particularly useful when important information is missing from thetriplets. The MS-COCO dataset contains complex scenes with many entities in diverse con-texts, which makes spatial relations prediction very challenging. Even the contextprovided by captions may be insufficient to properly identify the spatial relationsof some images. Figure 5 shows examples of system predictions that do not agreewith the ground truth. The a) example shows a difficult scene where the captiondoes not provide enough information about the scene. Note that, although wrong,the system prediction corresponds more or less to prototypical spatial arrangementsbetween the entities mentioned in the scene, which would probably agree with thespatial relations that a typical person would draw.The b) example shows incorrectly tagged entities. For example, the bench inthe image is larger than the tagged bounding box. But the model prediction for thebounding box suggests that it knows that a bench is usually larger than a person.Further, when we compare a) and b) examples we see that our model is also able18 igure 5: Four examples of spatial relations inferred by our method (green bounding boxon white background, right side) that do not match the gold standard (green bounding boxin the image, left side). Best viewed in color. to differentiate when a person is laying or is sitting on a bench. When it is laying,the bench is roughly equal in size to the person, but it is larger in the x axis whenthe person is sitting on a bench. This is something interesting, because it showsthe ability to learn common sense from the raw text and visual information, likehumans do.The c) example in Figure 5 shows another difficult scene, where the person isjumping with his/her snowboard. The position of the person is not the usual one(on top of the board). This is not fully described in the caption, and our modelinfers that the person is on the board. Once again, it would be interesting to seewhat humans would draw given the caption. The d) example is also complicated,since it shows the occlusion of the object, which our model cannot handle properly.In that case, the surfboard is well located (under the person), but its size is largerthan in the image. However, it is worth to note that the bounding box predicted forthe surfboard is not long enough (given the size of the person, the surfboard shouldbe longer in the x axis, if it was not occluded). It might be that the ”riding a wave”expression made the model infer that a part of the board is actually occluded. In this paper, we show that using the full textual descriptions of images improvesthe ability to model the spatial relationships between entities. Previous research19as focused on using structured concept triplets which include an ontological rep-resentation of the relation, but we show that the caption contains additional usefulinformation which our system uses effectively to improve results. Our experimentsare based on REC-COCO, a new dataset that we have automatically derived fromMS-COCO and V-COCO containing associations between the words in the captionand bounding boxes in images. Although there is some lose of information whenmoving from the ontological concept triplets to the corresponding textual triplet asmentioned in the textual caption, the use of the full caption yields the best results.Furthermore, we see that the improvement also holds without explicitly specifyingthe relation token in the caption, which shows that our system is able to automat-ically place entities relative to others without any additional manual annotation.The system is thus able to figure out the relation and the relevant contextual infor-mation from the textual caption. Our error analysis shows that even in the case ofexamples where the system output gets low scores, the system often guesses proto-typical locations and sizes, which we think reflect common sense knowledge aboutthe scenes.In order to place an object according to a caption, our system needs to take asreference the size and location of another object. In the future, we would like toexplore techniques to infer, from a caption describing a scene, which entities needto be depicted and their respective location and sizes in the image. In addition,recent multimodal transformers like X-LXMERT [35] could be used to improve theencoding of captions, taking advantage of the visual grounding previously learnedby the model. Finally, given that our work shows that it is not necessary to manuallyannotate the relation between entities for satisfactory results, large collections likeMS-COCO which include captions and bounding boxes can be readily used to trainand test systems with the ability to decide which entities from a caption need to bedepicted.
Acknowledgements
Aitzol Elu has been supported by a ETORKIZUNA ERAIKIZ grant from theProvincial Council of Gipuzkoa. This research has been partially funded by theBasque Government excellence research group (IT1343-19), the Spanish MINECO(FuturAAL RTI2018-101045-B-C21, DeepReading RTI2018-096846-B-C21 MCIU/AEI/FEDER,UE), Project BigKnowledge (Ayudas Fundaci´on BBVA a equipos de investigaci´oncient´ıfica 2018), and the NVIDIA GPU grant program.20 eferencesReferences [1] A. Mogadala, M. Kalimuthu, D. Klakow, Trends in integration of vision andlanguage research: A survey of tasks, datasets, and methods, arXiv preprintarXiv:1907.09358.[2] B. D. V. Dume, Extracting implicit knowledge from text, Ph.D. thesis, Uni-versity of Rochester (2010).[3] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative ad-versarial text to image synthesis, in: M. F. Balcan, K. Q. Weinberger (Eds.),Proceedings of The 33rd International Conference on Machine Learning,Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York,New York, USA, 2016, pp. 1060–1069.URL http://proceedings.mlr.press/v48/reed16.html[4] H. Bagherinezhad, H. Hajishirzi, Y. Choi, A. Farhadi, Are elephants biggerthan butterflies? reasoning about sizes of objects, in: Thirtieth AAAI Confer-ence on Artificial Intelligence, 2016.[5] G. Collell, L. Van Gool, M.-F. Moens, Acquiring common sense spatialknowledge through implicit spatial templates, in: Thirty-Second AAAI Con-ference on Artificial Intelligence, 2018.[6] S. Gupta, J. Malik, Visual semantic role labeling, arXiv preprintarXiv:1505.04474.[7] F. Huang, J. F. Canny, Sketchforme: Composing sketched scenes from textdescriptions for interactive applications, in: Proceedings of the 32Nd An-nual ACM Symposium on User Interface Software and Technology, UIST’19, ACM, New York, NY, USA, 2019, pp. 209–220. doi:10.1145/3332165.3347878 .URL http://doi.acm.org/10.1145/3332165.3347878[8] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, Visualgenome: Connecting language and vision using crowdsourced dense imageannotations, Int. J. Comput. Vision 123 (1) (2017) 32–73.[9] M. Malinowski, M. Fritz, A pooling approach to modelling spatial relationsfor image retrieval and annotation, arXiv preprint arXiv:1411.5190.2110] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,C. L. Zitnick, Microsoft coco: Common objects in context, in: Europeanconference on computer vision, Springer, 2014, pp. 740–755.[11] S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel,T. Darrell, et al., Grounding spatial relations for human-robot interaction, in:2013 IEEE/RSJ International Conference on Intelligent Robots and Systems,IEEE, 2013, pp. 1640–1647.[12] T. Hinz, S. Heinrich, S. Wermter, Generating multiple objects at spatiallydistinct locations, in: International Conference on Learning Representations,2019.URL https://openreview.net/forum?id=H1edIiA9KQ[13] A. A. Jyothi, T. Durand, J. He, L. Sigal, G. Mori, Layoutvae: Stochastic scenelayout generation from a label set, in: The IEEE International Conference onComputer Vision (ICCV), 2019.[14] G.-J. Kruijff, H. Zender, P. Jensfelt, H. Christensen, Situated dialogue andspatial organization: What, where and why?, International Journal of Ad-vanced Robotic Systems 4. doi:10.5772/5701 .[15] G. Platonov, L. Schubert, Computational models for spatial prepositions, in:Proceedings of the First International Workshop on Spatial Language Under-standing, 2018, pp. 21–30.[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neuralinformation processing systems, 2014, pp. 2672–2680.[17] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, Attngan:Fine-grained text to image generation with attentional generative adversarialnetworks, in: The IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2018.[18] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, H. Lee, Learning whatand where to draw, in: D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,R. Garnett (Eds.), Advances in Neural Information Processing Systems 29,Curran Associates, Inc., 2016, pp. 217–225.URL http://papers.nips.cc/paper/6111-learning-what-and-where-to-draw.pdf[19] S. Hong, D. Yang, J. Choi, H. Lee, Inferring semantic layout for hierarchicaltext-to-image synthesis, in: 2018 IEEE/CVF Conference on Computer Vision22nd Pattern Recognition, 2018, pp. 7986–7994. doi:10.1109/CVPR.2018.00833 .[20] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, J. Gao, Object-driventext-to-image synthesis via adversarial training, in: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019.[21] M. Forbes, Y. Choi, Verb physics: Relative physical knowledge of actionsand objects, in: Proceedings of the 55th Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1: Long Papers), Associa-tion for Computational Linguistics, Vancouver, Canada, 2017, pp. 266–276. doi:10.18653/v1/P17-1025 doi:10.18653/v1/P18-2102 doi:10.1017/S1351324918000104 .[24] J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, T. Tan, Learning visualrelationship and context-aware attention for image captioning, Pattern Recog-nition 98 (2020) 107075.[25] S. Aditya, R. Saha, Y. Yang, C. Baral, Spatial knowledge distillation to aidvisual reasoning, in: 2019 IEEE Winter Conference on Applications of Com-puter Vision (WACV), IEEE, 2019, pp. 227–235.[26] Z. Bai, Y. Li, M. Wo´zniak, M. Zhou, D. Li, Decomvqanet: Decomposing vi-sual question answering deep network via tensor decomposition and regres-sion, Pattern Recognition 110 (2020) 107538.[27] D. Davidov, A. Rappoport, Extraction and approximation of numerical at-tributes from the web, in: Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics, Association for ComputationalLinguistics, 2010, pp. 1308–1317.2328] K. Narisawa, Y. Watanabe, J. Mizuno, N. Okazaki, K. Inui, Is a 204 cm mantall or small? acquisition of numerical common sense from the web, in: Pro-ceedings of the 51st Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), 2013, pp. 382–391.[29] N. Tandon, G. De Melo, G. Weikum, Acquiring comparative commonsenseknowledge from the web, in: Twenty-Eighth AAAI Conference on ArtificialIntelligence, 2014.[30] Y. Elazar, A. Mahabal, D. Ramachandran, T. Bedrax-Weiss, D. Roth, Howlarge are lions? inducing distributions over quantitative attributes, in: Pro-ceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics, Association for Computational Linguistics, 2019. doi:10.18653/v1/p19-1388 .[31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deepbidirectional transformers for language understanding, in: Proceedings of the2019 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers), Association for Computational Linguistics, Minneapolis, Min-nesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423 doi:10.18653/v1/2020.emnlp-main.707 .2436] A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep recur-rent neural networks, in: 2013 IEEE international conference on acoustics,speech and signal processing, IEEE, 2013, pp. 6645–6649.[37] N. R. Draper, H. Smith, Applied regression analysis, Vol. 326, John Wiley &Sons, 1998.[38] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, A. Zis-serman, The pascal visual object classes challenge: A retrospective, Interna-tional journal of computer vision 111 (1) (2015) 98–136.[39] J. Pennington, R. Socher, C. Manning, GloVe: Global vectors for word rep-resentation, in: Proceedings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), Association for Computa-tional Linguistics, Doha, Qatar, 2014, pp. 1532–1543. doi:10.3115/v1/D14-1162doi:10.3115/v1/D14-1162