[PDF] SG2Caps: Revisiting Scene Graphs for Image Captioning

Abstract

The mainstream image captioning models rely on Convolutional Neural Network (CNN) image features with an additional attention to salient regions and objects to generate captions via recurrent models. Recently, scene graph representations of images have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes. Several studies have noted that naive use of scene graphs from a black-box scene graph generator harms image caption-ing performance, and scene graph-based captioning mod-els have to incur the overhead of explicit use of image features to generate decent captions. Addressing these challenges, we propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance. The basic idea is to close the semantic gap between two scene graphs - one derived from the input image and the other one from its caption. In order to achieve this, we leverage the spatial location of objects and the Human-Object-Interaction (HOI) labels as an additional HOI graph. Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning. Direct utilization of the scene graph labels avoids expensive graph convolutions over high-dimensional CNN features resulting in 49%fewer trainable parameters.

Full PDF

SSG2Caps: Revisiting Scene Graphs for Image Captioning

Subarna Tripathi * Intel Labs, USA [email protected]

Kien Nguyen * UC San Diego, USA [email protected]

Tanaya GuhaUniversity of Warwick, UK [email protected]

Bang DuUC San Diego, USA [email protected]

Truong Q. NguyenUC San Diego, USA [email protected]

Abstract

The mainstream image captioning models rely on Con-volutional Neural Network (CNN) image features with anadditional attention to salient regions and objects to gen-erate captions via recurrent models. Recently, scene graphrepresentations of images have been used to augment cap-tioning models so as to leverage their structural semantics,such as object entities, relationships and attributes. Sev-eral studies have noted that naive use of scene graphs froma black-box scene graph generator harms image caption-ing performance, and scene graph-based captioning mod-els have to incur the overhead of explicit use of image fea-tures to generate decent captions. Addressing these chal-lenges, we propose a framework,

SG2Caps , that utilizes only the scene graph labels for competitive image caption-ing performance. The basic idea is to close the seman-tic gap between two scene graphs - one derived from theinput image and the other one from its caption. In or-der to achieve this, we leverage the spatial location of ob-jects and the Human-Object-Interaction (HOI) labels as anadditional HOI graph. Our framework outperforms exist-ing scene graph-only captioning models by a large margin(CIDEr score of 110 vs 71) indicating scene graphs as apromising representation for image captioning. Direct uti-lization of the scene graph labels avoids expensive graphconvolutions over high-dimensional CNN features resultingin fewer trainable parameters.

1. Introduction

The mainstream image captioning models rely on con-volutional image features and/or attention to salient re-gions and objects to generate captions via recurrent mod-els [20, 1]. Recently, scene graph representations of im- * Authors have equal contributions ages have been used to augment captioning models so as toleverage their structural semantics, such as object entities,relationships and attributes [32, 30, 6].The literature however has mixed opinion about the use-fulness of scene graphs in captioning. Few works havereported improvement in caption generation using scenegraphs [21, 30], while several others have highlighted thatscene graphs alone yield poor captioning results and caneven harm captioning performance [11, 14]. In this paper,we identify the challenges in effective utilization of scenegraphs in image captioning, and subsequently investigatehow to best harness them for this task.Scene graph representation consisting of nodes andedges can be derived from either (i) images where the nodescorrespond to the objects present in the scene, termed as

Visual Scene Graphs (VSG) , or (ii) from a caption wherenouns and verbs take on the roles of nodes and edges ina rule-based semantic parsing, termed as

Textual SceneGraphs (TSG) . The literature of scene graph generation,and scene graphs for image captioning primarily refers tothe

VSG representation.To be able to leverage scene graphs for captioning, weneed paired VSG-caption annotations. This is currently un-available. Hence, methods requiring explicit scene graphsend up training the VSG generator and the caption generatoron disparate datasets [30, 6, 14]. The current practice is totrain VSG generators on the Visual Genome (VG) dataset,train TSG to caption generation on COCO-captions dataset,and ﬁnally transform the VG-trained VSGs to captions uti-lizing the later.We note two issues with this approach: • The VG-trained VSGs are highly biased towards cer-tain types of relationships (e.g., has, on ); the relationshipdistribution is signiﬁcantly long-tailed, and even the top-performing VSG generators fail to learn meaningful rela-tionships accurately [26]. This results in noisy VSGs,which in turn degrades the quality of captions [14].1 a r X i v : . [ c s . C V ] F e b igure 1. SG2Caps ﬁrst creates Visual Scene Graphs (VSG) by combining (1) pseudolabel - output of a black-box VSG generator, and(2) HOI graph from an HOI inference model. Each object node of the VSG has a bounding box label. Object nodes, relations, attributesare color-coded in red, blue, green respectively. The output of VSG encoding is the input for the LSTM-based decoder for the captiongeneration. (a) An image and the TSG generated from its caption(b) VSG containing all detected objects as nodes

Figure 2. Characterization of TSG and VSG. While TSG onlycontains salient contents such as man, motorcycle, ﬂag for natu-ral language description, VSG includes unnecessary details suchas wheel, tire, window, sign, pole . Objects, attributes, edges areshown in pink,green,blue respectively. (Best viewed in color) • There is an assumption in the existing approaches thatTSGs and VSGs are compatible. But, are VSGs and TSGsactually compatible?

TSGs, when used as inputs, cangenerate excellent captions [30]. However, the problemarises when parameters trained for TSGs are used forVSG inputs, assuming direct compatibility. TSGs, beinggenerated from captions, do not include every object seenin the image or all their pairwise relationships - the very information VSGs are designed to extract (See Fig. 2).In other words, VSGs are exhaustive while TSGs focusonly on the salient objects and relationships. Thus naturallanguage inductive bias does not translate automaticallyfrom TSG models to VSG models. We argue that this isthe major reason why previous efforts to exploit VSGs forcaptioning did not achieve desired results.To mitigate the above issues, we explore several novelways to enhance VSGs in the context of captioning:(i)

Human-Object Interaction (HOI) information:

Humans tend to describe visual scenes involving humans byfocusing on the human-object interactions at the exclusionof other details. If HOI information is extracted from an im-age, it can provide an effective way to highlight the ‘salient’parts in its VSG, thereby bringing it closer to its correspond-ing TSG. Hence, we propose to harness pre-trained HOI in-ferences as partial

VSGs, where all detected objects (notlimited to humans) in a scene form the graph nodes and theHOI information augment a few relevant nodes with appro-priate relationship and attributes.(ii)

VSG grounding:

A unique aspect of VSG is thateach of the node in an VSG is grounded, i.e., has a one-to-one association with the object bounding boxes in the im-age. This spatial information can be used to capture therelationship between objects. It is well known from thescene graph generation literature that the inter-object re-lationship classiﬁcation performance greatly beneﬁts fromground-truth bounding box locations [18, 17]. Despite thisevidence, no VSG-based captioning model has yet used thespatial information of the nodes. We show that such infor-mation can signiﬁcantly improve captioning performance.In this paper, we investigate how to best leverage VSGsfor caption generation, if at all. To this end, we develop2 new image captioning model, termed

SG2Caps , that uti-lizes the VSGs alone for caption generation (see Fig. 1 forthe main idea). In contrast to the existing work, we donot use any image or object-level visual features; yet, weachieve competitive caption generation performance by ex-ploiting the HOI information and VSG grounding. Directlyutilizing the scene graph labels avoids expensive graph con-volutions over high-dimensional CNN features, we showthat it is still effective for caption generation via captur-ing visual relationships. This also results in % reduc-tion in the number of trainable parameters comparing withthe methods that require processing of both visual fea-tures and scene graphs. Researchers have shown that im-age captioning algorithms are limited by dataset biases andthe availability of exhaustive human annotations [2]. Our SG2Caps , on the other hand, leverages annotations beyondthe available paired image-caption training data.Our contributions are summarized below.• We show that competitive captioning performance canbe achieved on the COCO-captions dataset using VSGalone, without any visual features.• We experimentally show that VSGs and TSGs are notcompatible with each other in the context of caption gen-eration, and we propose to improve the learnable trans-formation directly from VSG to caption.• We propose a new captioning model,

SG2Caps , that uti-lizes VSG node groundings and HOI information to gen-erate better caption from VSGs. While node groundingshelp to identify the meaningful relationships among ob-jects (results in 10 point gain in CIDEr score), HOI cap-tures the essence of natural language communication (re-sults in 7 point gain in CIDEr score). Thus they help toclose semantic gap between TSGs and VSGs for the pur-pose of image captioning.

2. Related work

Image captioning:

Mainstream image captioning mod-els [20, 4] directly feed convolutional image features to arecurrent network to generate natural language. The top-down approaches in such image captioning models relyon attention-based deep models [13, 15, 31, 24] where apartially-completed caption is utilized as the ‘context’. Anattention, based on the context, is then applied to the out-put of one or more layers of a convolutional neural network(CNN). These approaches can predict attention weightswithout any prior knowledge of objects or salient image re-gions. Later, bottom-up approaches [1] enabled attentionto be computed at the object-level using object detectors.Such object-level attention mechanism is the state-of-the-artin many vision-language tasks including image captioningand visual question answering.

VSG in image captioning : A great deal of works[27, 7, 18, 23, 28, 12, 33] devised approaches that striveto perform on VSG generation tasks on the benchmark VGdataset [9]. A few recent works [32, 30, 6, 21, 25, 11] haveintroduced the use of VSG (in addition to the visual fea-tures) with the hope that encoding of objects attributes andrelationships would improve image captioning. Some ofthe works used implicit scene graph representation [32, 5],while others explored an explicit representation of relationsand attributes [30, 21, 14, 10]. The explicit scene graph ap-proaches integrate VSG features with convolutional neuralnetwork (CNN) features from image or objects. Such ex-plicit scene graph approaches use a scene graph generatoras a blackbox. For example [21] utilized FactorizableNet[12], [30, 6] utilized MotifNet [33] and [11, 14] utilized It-erative Message Passing [23] as their blackbox scene graphgenerator. Attributes from the scene graphs without the re-lationships are used for image captioning [22].Researchers found that Visual Scene Graph alone yieldpoor captioning results. The literature so far has mixedopinions about the usefulness of scene graphs. While Wang et al. [21] observed improvement in caption generationusing VSG, Li and Jiang [11] did not ﬁnd VSG useful. Re-cently, an in-depth study has concluded that it is the noisein VSG (often the relations) that harm the image captioningperformance [14].In contrast to the existing VSG-augmented captioningmodels, our proposed model,

SG2Caps , does not use im-age/object CNN features as input but utilizes the VSG labelsonly. We are aware of one work that used VSG labels asthe only input in captioning, and observed that scene graphlabels alone as input leads to unsatisfactory results [21].Our method outperforms Wang et al. [21] signiﬁcantly anddiffers from it in several ways: (a) We propose novel tech-niques to make VSGs compatible for captioning, and (b)leverage HOI information to further improve captioning,and (c) are able to achieve competitive image captioningperformances with VSG labels alone .

3. Proposed model

Given an image, our

SG2Caps model ﬁrst constructs itsVSG such that it is particularly suited for caption genera-tion. The VSG consists of objects, their attributes, spatiallocations and inter-object relationships. The VSG is en-coded to generate a context-aware embedding which in turnare fed into a long short term memory network (LSTM) lan-guage model to generate the captions (see Fig 1).

Our VSG generator has two components as described be-low.

VG150 Pseudolabel:

Off-the-shelf VSG generators pro-vide object classes as the node labels and pairwise relation-3hip. We learn our own attribute classiﬁer and train a VSGgenerator on Visual Genome (VG150) using MotifNet [33].This pretrained VSG generator is applied on the COCO im-ages to create scene graph pseudolabels so as to create vi-sual scene graphs with nodes, their attributes, their locationsand pairwise relationships. The objects, attributes and rela-tionships correspond to the vocabulary of VG150 consistingof 150 object classes, 50 relations and 203 attributes. Pleasenote that COCO does not provide scene graph annotations.Also note that pseudolabel path is not bound to the speciﬁcone we used in this paper, it can be any black-box visualscene graph generator.

Partial COCO scene graph:

In parallel, we use an ob-ject detector pretrained on COCO images to create a list ofCOCO objects that serve as nodes of another graph, termedas HOI graph. Then we use an HOI inference to ﬁll uponly a few attributes and relationship edges involving only’person’ category. We call it a partial scene graph since ithas a limited relationship information with mostly discon-nected nodes but all in COCO vocabulary. Images where person category objects are not detected, the partial VSGsdo not contain any HOI augmentation, it consists of onlythe list of nodes created from detected object instances ofother

COCO categories.

For our experiments, we use the union of pseudolabelsand HOI graphs as the VSG of an image. A VSG is a tu-ple G = ( N , E ) , where N and E are the sets of nodes andedges. In our formulation, there are four kinds of nodes in N : object nodes o , attribute nodes a , bounding box nodes b and relationship nodes r . We denote o i as the i -th object, r ij as the relationship between o i and o j ; b i as the boundingbox coordinates of o i and a i,l as the l -th attribute of object o i . Each node in N is represented by a d -dimensional vec-tor i.e. e o , e a , e b and e r . The edges in E are deﬁned asfollows:• If an object o i has an attribute a i,l , there is a directededge from o i to a i,l .• There is a directed edge from o i to its bounding box b i .• If there is a relationship triplet < o i − r ij − o j > , thereare two directed edges from o i to r ij and from r ij to o j ,respectively.Below, we present the VSG encodings i.e. , how to trans-form the original node embeddings e o , e a , e b , e r into a newset of context-aware embeddings X that contains four kindsof d-dimensional embeddings: relationship embedding x r ij for the relationship node r ij , object embedding x o i for theobject node o i , attribute embedding x a i for the object node o i , and bounding box embedding x b i for the object node o i .We use ﬁve spatial graph convolutions g r , g a , g b , g s , and g o to generate the above embeddings. All of g r , g a , g b , g s ,and g o have same structure with independent parameters: afully-connected layer, followed by a ReLU. Relationship embedding x r ij : Given a relationship triplet < o i − r ij − o j > in G , x r ij is deﬁned in the context ofthe subject ( o i ), object ( o j ) and predicate ( r ij ) together asfollows: x r ij = g r ( e o i , e r ij , e o j ) (1) Attribute embedding x a i : For an object node o i with itsattributes a i, N ai in G , the embedding x a i is given by x a i = 1 N a i Na i (cid:88) l =1 g a ( e o i , e a i,l ) (2)where N a i is the number of attributes for o i . Here the con-text of an object with all its attributes are incorporated. Bounding box embedding x b i : Given o i with its boundingbox b i , x b i is deﬁned as: x b i = g b ( e o i , e b i ) (3) Object embedding x o i : An object node o i plays differentroles based on the edge directions, i.e. , whether o i acts as asubject or the object in a triplet. Following past work [30],our object embedding takes the entire triplet into consider-ation. We deﬁne x o i as follows: x o i = 1 N r i [ (cid:88) o j ∈ sbj ( o i ) g s ( e o i , e o j , e r ij )+ (cid:88) o k ∈ obj ( o i ) g o ( e o k , e o i , e r ki )] (4)where N r i = | sbj ( i ) | + | obj ( i ) | is the number of relation-ship triplets involving o i . Each node o j ∈ sbj ( o i ) acts asan object while o i acts as a subject . Given an image I , we want to generate a natural lan-guage sentence S = w , w , ..., w T describing the image.We follow an encoder-decoder architecture. The encoder in SG2Caps takes VSG encoding as input (in contrast to CNNimage features in popular captioning models) followed byan attention mechanism [1]. The decoder is an LSTM-basedlanguage decoder.Given the ground truth captions S ∗ , we train ourencoder-decoder model using one of the two loss functions:(i) Minimize a cross-entropy loss: L XY = − logP ( S ∗ ) (ii) Maximize a reinforcement learning (RL) based reward[15] R RL = E S s ∼ P ( S ) [ rw ( S s ; S ∗ )] where, rw ( · ) providesa sentence-level metric ( e.g. the CIDEr metric) for the sam-pled sentence S s and the ground-truth S ∗ .4 . Experiments The COCO-Captions [3]: We conducted the experi-ments and evaluated our proposed

SG2Caps model on theKarpathy split of COCO-Caption dataset for the ofﬂine test.This split has , / , / , train/val/test images,each of which has 5 captions. Visual Genome (VG) [9]: We used Visual Genome(VG) dataset to pre-train a scene graph generator and at-tribute classiﬁer. VG has scene graph annotations e.g. object categories, object attributes, and pairwise relation-ships that we utilized to train the object proposal detec-tor, attribute classiﬁer, and relationship classiﬁer as ourvisual scene graph parser. We ﬁrst pre-trained a fasterR-CNN based object detector [16] using object annota-tions from Visual Genome with an additional 2-layer multi-layer-perceptron layer for 203 one-vs-all attribute classi-ﬁers. Then the Neural Motif [33] model serves as the ROIhead to predict the pair-wise relationships.

Verbs in COCO : V-COCO dataset contains a subset ofCOCO images and was created for evaluating the HOI task.It annotates 16K people instances in 10K images with theiractions and associate objects in the scene with different se-mantic roles for each action. We utilize VSGNet [19], apre-trained HOI model, to generate the inference on COCOimages.

Processing pseudolabels:

The VSG generated by aninference on COCO-caption images utilizes a pre-trainedscene graph generator model [33] as a black box (referto Sec. 4.1). The predicted scene graphs contain a largeamount of noise. They can include many object proposalsoften the duplicate ones. It also predicts attributes for eachproposal and relationship for each object-pair. Below wedescribe the method to re-purpose the above VSGs so theybecome suitable for the caption generation.First, we discard less conﬁdent (conﬁdence score be-low 0.25) object predictions and apply non-maximum-suppression (NMS) on object proposals with intersection-over-union (IOU) threshold of . The motivation behindre-purposing the output of the blackbox VSG generator is asfollows. Blackbox VSG generators were trained to achievehigher performance measured by a retrieval metric, Recall.Recall cares for higher fraction of ground truth returned, butdoesn’t penalize duplicate objects. However, VSG shouldget rid of duplicate objects in order to be used for captiongeneration. Similarly, we discard relationships returned bythe black-box generator, if the conﬁdence score is below 0.3and keep only the best attribute per node if the conﬁdence isabove 90%. VSG predicts attributes for each object and re-lations for each-pair and only a few will have conﬁdent pre- dictions. Finally, the detected object categories are mappedto the closest word string in the caption dataset manuallyfor the corresponding labels.

Processing HOI graph:

The HOI graph is formed viaextracting relationships and attributes from the HOI infer-ence network[19], which utilizes the instance detection re-sults from detectron2. Images where human objects aredetected with score 0.5 or more are selected as the inputof the HOI network. The outputs of the HOI network are < agent ( human ) − instrument − object > triplets as-sociated with a relation , r. For example, in the case of “aperson hitting a ball with a bat”, the triplet takes the form of < person ( agent ) − bat − ball > for the action “hitting”.To transform such a triplet into a scene graph like struc-ture, all agents, objects and instruments become the nodesof the graph and the corresponding relation (r) adds the re-lationship between the subject and the object. The list ofsuch HOI relations consists of 10 different semantic verbactions. On the other hand, for inferences without objects e.g. stand , we add them as an attribute e.g. standing to the subject , since it isn’t possible to form any directededge between a subject and an object . The list of such object -less actions that are utilized as attributes consists of16 different semantic verbs. HOI graphs generated in theabove way have such relations or attributes for train-ing images and test images. Other images transformto graphs with only the detected objects as nodes with onlyone relation < object , AN D, object > per image. Sincethe goal of utilizing scene graphs for captioning is to enrichthe model with objects relationships, we argue that spatiallocations of nodes should be leveraged. Our VSG thus con-tains a list of nodes each consisting of object class label,bounding box, attributes and a set of edges.For our experiments, we use the union of pesudo-labelsand HOI graphs as the VSG of an image. For the inferencein language generation, we use greedy search. Our implementation is built upon the source code [29]of SGAE [30]. Our models are trained on a single NVIDIA1080Ti GPU running pytorch 0.4.0 in python 2.7.15. Weevaluate our caption generation model on standard metricssuch as BLEU@1(B@1), BLEU@4 (B@4), ROGUe(R),METEOR(M), CIDEr(C) and SPICE(S).

TSG-VSG incompatibility:

First we show the incom-patibility between TSG and VSG in Table 1. All the en-tries in this table are from our reproduced models. TSG,when used as an input, generates excellent captions usingthe GCN and LSTM language model, as can be seen fromthe ﬁrst row. The above TSG-caption model was trainedwith cross-entropy loss. However, if we simply use thatmodel and perform inference using the pseudolabel as in-puts, it performs signiﬁcantly worse. Although the HOI5 able 1.

Incompatibility between TSG and VSG . Performanceof the caption model trained on textual scene graph (TSG) whileevaluating on different scene graphs as input. TSG row denotessentence scene graph as inputs. † denotes our reproduced resultsusing [29]. PL denotes pseudolabel and (PL + HOI) denotes theunion of pseudolabel and HOI graph as the input respectively.Inference on TSG-trained model fail to generate decent captionswith VSG input. Model B@1 B@4 M R C STSG † PL 46.7 7.7 14.6 35.7 31.5 9.0

PL + HOI augmentation improves the CIDEr score by 7 points, theoverall caption generation performance still remains poor.The takeaway message is that although VSG and TSG aresimilar type of representations, but they are not directlycompatible to each other for caption generation.

Caption generation performance comparison:

Table2 shows the main result of our proposed

SG2Caps method.The bottom half of the table denotes methods that use onlyscene graph labels as input to the captioning model. OurSG2Caps signiﬁcantly outperforms the existing

G only [21]method by large margin. Our graph construction differsfrom it in a few ways. Wang et al . used FactorizableNet[12] as the relation detection model on top of RPN from theFaster R-CNN. We use a simpler relation model [33]. Wang et al . didn’t use bounding box locations which we ﬁnd im-portant for caption generation. With HOI augmentation andbounding box feature, SG2Caps produces competitive re-sults (CIDEr score of 109.7) close to the SoTA models thatrely on image or object detection CNN features.Table 2 reports the performances of the state-of-the-art captioning models that use both the visual featuresand explicit scene graphs from the corresponding papers[30, 21, 22, 14]. In our work, we focus on scene graphsfor caption generation, and thus limit our experiments andcomparisons to the LSTM language model. Results in Table2 show that our SG2Caps substantially minimizes the gapbetween the performances of state-of-the-art caption gener-ation models and a scene-graph-only model.We also note that the number of trainable parameters forcaptioning models that use both the high dimensional CNNfeatures and scene graph labels are signiﬁcantly higher thanour scene graph only model. For example,

SG2Caps has fewer parameters comparing with SGAE [30] full-model (21M vs 41M) for the same language decoder model.(More details are available in the supplementary material.)

Ablation experiments:

Table 3 shows the results of ourablation experiments. First, we observe that when we di-rectly use the post-processed pseudolabels for training acaptioning model, it largely beats the existing G-only modelfrom [21] (CIDEr score 93 vs 71). This is our baseline

Table 2.

Comparison with State-of-the-art captioning models .We compare with both image captioning SoTA methods (that useboth image features and scene-graphs labels) and methods that usescene graphs as only input. Here, all methods use LSTM languagemodel. RSG-G1 and RSG-G2 refer to without and with gate-wisegating in the G-only setup from RSG.

Model B@4 M R C S visual scene graph + visual features

Attribute [22] 31.0 26.0 - 94.0 -R-SCAN [10] - - - 126.1 21.8KMSL [11] 36.3 27.6 56.8 120.2 21.4SGAE [30]

SGC [14] 35.5 - 56.0 109.9 19.8RSG [21] 34.5 26.8 55.9 108.6 20.3 visual scene graph only

RSG-G1 [21] 22.8 20.6 46.7 66.3 13.5RSG-G2 [21] 22.9 21.1 47.5 70.7 14.0

SG2Caps (ours) 32.8 26.0 55.5 109.7 19.2

Table 3.

Ablation results:

Performance of our

SG2Caps model,when the object groundings and HOI are leveraged, in comparisonwith G-only model from the literature.

Model bbox B@1 B@4 M R C SRSG-G2 - - 22.9 21.1 47.5 70.7 14.0Baseline - 72.0 29.3 24.5 52.6 92.6 17.8BBox (cid:88) (cid:88) model. BBox model corresponds to the node groundings ontop of the baseline. SG2Caps, our ﬁnal model, uses HOIgraph atop BBox model. We observe that when we incor-porate the node groundings in the pseudolabels, our BBoxmodel achieves a 10-point gain in CIDEr score. Finally HOIaugmentation results in further 7 point gain in CIDEr score.The performance of the existing

G-only caption generatorfrom the literature is also shown for comparison.

Model size

We have compared the model size of

SG2Caps with SGAE [30] that utilizes both image fea-tures and scene graph labels. We have counted the trainableparameters needed for the captioning model directly fromtheir supplementary material. Unlike SGAE, SG2Capsdoes not require the Multi-Modal GCN module over high-dimensional visual features and scene graph features beforepassing it to the language decoder. This results in 49.4%savings on the number of trainable parameters (41.7 Mil-lions vs 21.1 Millions). Please note that, the above trainableparameters are only for the language decoder part, we don’tcount the parameters to extract the visual features or scenegraph generator for any of the methods.6 igure 3. Qualitative examples: generated captions from different baseline models from Karpathy test split for COCO image ids 177861,45710, 553879 respectively.Table 4. Comparison of number of trainable parameters.

Model SGAE SG2CapsGCN 21,102,000 21,107,000MGCN 20,616,000 NATotal: 41,718,000 21,107,000Reduction: 0 49.4%

Qualitative results:

Our baseline model generates con-vincing caption sentences. As shown in Fig 3, the captionquality improves from baseline to BBox and SG2Caps mod-els. We also show output captions generated by SGAE [30],one of the best captioning models from the literature, thatuses both the visual features and scene graphs. SG2Caps isable to generate high-quality and competitive captions with-out using visual features (Refer to Fig 3).Additional qualitative results of COCO image 258628,110877, 21900, and 548361 are shown in Figure 4. For eachimage, the visualizations of the HOI partial graph and theSGDet pseudo-label graph are displayed with the captionsgenerated from different models as well as the ground truth.

5. Conclusion

Explicit encoding of objects, attributes and relationsare useful information for image captioning. However,blindly using visual scene graphs for captioning fails to pro-duce reasonable caption sentences. The proposed SG2Capspipeline enables networks pre-trained for (1) SGDet onother scene graph datasets, and (2) semantic roles on HOIdatasets to greatly reduce the gap in accuracy on COCOcaption datasets – indicating strong captioning models canbe achieved with low dimensional objects and relations la-bel space only. These results further strengthen our defenseof scene graph for image captioning. We hope our observa-tions can open up new opportunities for vision and languageresearch in general.

Ethical/ Societal impact statement

Visual content on the web or from photographs can bemade more accessible to visually impaired people via auto-matic image captioning combined with text-to-speech. In-dustries like commerce, education, digital libraries, andweb searching, social media platforms utilize image cap-tioning to mine the visual content for image indexing, im-age retrieval, and data summarization [8]. Although thishelps managing online content and may provide value tothe users, it can also compromise the users’ privacy. e.g.summarizing user behavior or preferences for targeted ad-vertisement. Importantly, image captioning algorithms arestill limited by dataset biases and the availability of exhaus-tive human annotations [2]. In this paper, we attempt to ad-dress the later by leveraging annotations beyond the pairedtraining data via explicit use of scene graphs. More researchis necessary to reach human-level accuracy and diversity.

References [1] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In

CVPR , June 2018. 1, 3, 4[2] Kaylee Burns, Lisa Anne Hendricks, Trevor Darrell, andAnna Rohrbach. Women also snowboard: Overcoming biasin captioning models. In

ECCV , 2018. 3, 7[3] Xinlei Chen, H. Fang, Tsung-Yi Lin, Ramakrishna Vedan-tam, Saurabh Gupta, Piotr Doll´ar, and C. L. Zitnick. Mi-crosoft coco captions: Data collection and evaluation server.

ArXiv , abs/1504.00325, 2015. 5[4] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,and Trevor Darrell. Long-term recurrent convolutional net-works for visual recognition and description. In

CVPR , 2015.3[5] Lizhao Gao, Bo Wang, and Wenmin Wang. Image captioningwith scene-graph based semantic concepts. In

ICMLC , page225–229, 2018. 3 a) 258628 (b) 110877(c) 21900 (d) 548361 Figure 4. Subjective Results of COCO Images.[6] Jiuxiang Gu, Shaﬁq R. Joty, Jianfei Cai, Handong Zhao, XuYang, and Gang Wang. Unpaired image captioning via scenegraph alignments.

CoRR , abs/1903.10658, 2019. 1, 3[7] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai,and Mingyang Ling. Scene graph generation with externalknowledge and image reconstruction. In

CVPR , 2019. 3[8] Gunhee Kim, L. Sigal, and E. Xing. Joint summarization oflarge-scale collections of web images and videos for story-line reconstruction.

CVPR , pages 4225–4232, 2014. 7[9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, andFei-Fei Li. Visual genome: Connecting language and vi-sion using crowdsourced dense image annotations.

CoRR ,abs/1602.07332, 2016. 3, 5[10] Kuang-Huei Lee, Hamid Palangi, Xi Chen, Houdong Hu,and Jianfeng Gao. Learning visual relation priors for image-text matching and image captioning with neural scene graphgenerators.

CoRR , abs/1909.09953, 2019. 3, 6[11] X. Li and S. Jiang. Know more say less: Image captioningbased on scene graphs.

IEEE Transactions on Multimedia ,21(8):2117–2130, 2019. 1, 3, 6[12] Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, ChaoZhang, and Xiaogang Wang. Factorizable net: An efﬁcientsubgraph-based framework for scene graph generation. In

ECCV , 2018. 3, 6[13] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.Knowing when to look: Adaptive attention via a visual sen-tinel for image captioning. In

CVPR , 2017. 3 [14] Victor Milewski, Marie-Francine Moens, and Iacer Calixto.Are scene graphs good enough to improve image captioning?In

AACL-IJCNLP , pages 29–34, 10 2020. 1, 3, 6[15] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, JerretRoss, and Vaibhava Goel. Self-critical sequence training forimage captioning. In

CVPR , pages 7008–7024, 2017. 3, 4[16] Kaihua Tang. A scene graph generation codebase in py-torch, 2020. https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch . 5[17] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, andHanwang Zhang. Unbiased scene graph generation from bi-ased training. In

CVPR , 2020. 2[18] Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo,and Wei Liu. Learning to compose dynamic tree structuresfor visual contexts. In

CVPR , 2019. 2, 3[19] Oytun Ulutan, A S M Iftekhar, and B. S. Manjunath. Vs-gnet: Spatial attention network for detecting human objectinteractions using graph convolutions. In

CVPR , 2020. 5[20] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-mitru Erhan. Show and tell: A neural image caption gen-erator. In

CVPR , June 2015. 1, 3[21] Dalin Wang, Daniel Beck, and Trevor Cohn. On the role ofscene graphs in image captioning. In

EMNLP WS , 2019. 1,3, 6[22] Qi Wu, Chunhua Shen, Peng Wang, A. Dick, and A. V. D.Hengel. Image captioning and visual question answeringbased on attributes and external knowledge.

IEEE TPAMI ,40:1367–1381, 2018. 3, 6

23] Danfei Xu, Yuke Zhu, Christopher Choy, and Li Fei-Fei.Scene graph generation by iterative message passing. In

CVPR , 2017. 3[24] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel,and Yoshua Bengio. Show, attend and tell: Neural imagecaption generation with visual attention. In

ICML , page2048–2057. JMLR.org, 2015. 3[25] Ning Xu, An-An Liu, Jing Liu, Weizhi Nie, and Yuting Su.Scene graph captioner: Image captioning based on structuralvisual representation.

Journal of Visual Communication andImage Representation , 58, 12 2018. 3[26] Pengfei Xu, Xiaojun Chang, Ling Guo, Po-Yao Huang, Xi-aojiang Chen, and Alex Hauptmann. A survey of scenegraph: Generation and application. EasyChair Preprint no.3385, EasyChair, 2020. 1[27] Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang,Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. PCPL:predicate-correlation perception learning for unbiased scenegraph generation. In

ACM Multimedia , 2020. 3[28] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and DeviParikh. Graph r-cnn for scene graph generation. In

ECCV ,September 2018. 3[29] Xu Yang. Sgae/ pytorch 0.4.0. https://github.com/yangxuntu/SGAE , 2019. [Accessed: 2020-09-25]. 5, 6[30] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.Auto-encoding scene graphs for image captioning. In

CVPR ,June 2019. 1, 2, 3, 4, 5, 6, 7[31] Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, andRuss R Salakhutdinov. Review networks for caption genera-tion. In

NIPS , pages 2361–2369, 2016. 3[32] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploringvisual relationship for image captioning. In

ECCV , Septem-ber 2018. 1, 3[33] Rowan Zellers, Mark Yatskar, Sam Thomson, and YejinChoi. Neural motifs: Scene graph parsing with global con-text. In

CVPR , 2018. 3, 4, 5, 6[34] Yiwu Zhong, Liwei Wang, Jianshu Chen, Dong Yu, and YinLi. Comprehensive image captioning via scene graph de-composition. In

ECCV , 2020., 2020.