Scene Graph Generation with External Knowledge and Image Reconstruction
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Mingyang Ling
SScene Graph Generation with External Knowledge and Image Reconstruction
Jiuxiang Gu ∗ , Handong Zhao , Zhe Lin , Sheng Li , Jianfei Cai , Mingyang Ling ROSE Lab, Interdisciplinary Graduate School, Nanyang Technological University, Singapore Adobe Research, USA University of Georgia, USA Google Cloud AI, USA { jgu004, asjfcai } @ntu.edu.sg, { hazhao, zlin } @[email protected], [email protected] Abstract
Scene graph generation has received growing atten-tion with the advancements in image understanding taskssuch as object detection, attributes and relationship predic-tion, etc . However, existing datasets are biased in termsof object and relationship labels, or often come with noisyand missing annotations, which makes the development ofa reliable scene graph prediction model very challenging.In this paper, we propose a novel scene graph generationalgorithm with external knowledge and image reconstruc-tion loss to overcome these dataset issues. In particular, weextract commonsense knowledge from the external knowl-edge base to refine object and phrase features for improvinggeneralizability in scene graph generation. To address thebias of noisy object annotations, we introduce an auxiliaryimage reconstruction path to regularize the scene graphgeneration network. Extensive experiments show that ourframework can generate better scene graphs, achieving thestate-of-the-art performance on two benchmark datasets:Visual Relationship Detection and Visual Genome datasets.
1. Introduction
With recent breakthroughs in deep learning and imagerecognition, higher-level visual understanding tasks, suchas visual relationship detection, has been a popular researchtopic [9, 19, 15, 40, 44]. Scene graph, as an abstraction ofobjects and their complex relationships, provides rich se-mantic information of an image. It involves the detectionof all (cid:104) subject - predicate - object (cid:105) triplets in an image and thelocalization of all objects. Scene graph provides a struc-tured representation of an image that can support a widerange of high-level visual tasks, including image caption-ing [12, 14, 13, 43], visual question answering [36, 38, 47],image retrieval [11, 21], and image generation [20]. How- ∗ This work was done during the author’s internship at Adobe Research.
Figure 1: Conceptual illustration of our scene graph learn-ing model. The left ( green ) part illustrates the image toscene graph generation, the right ( blue ) part illustrates theimage-level regularizer that reconstructs the image basedon object labels and bounding boxes. The commonsenseknowledge reasoning ( top ) is introduced to the scene graphgeneration process.ever, it is not easy to extract scene graphs from images,since it involves not only detecting and localizing pairs ofinteracting objects but also recognizing their pairwise rela-tionships. Currently, there are two categories of approachesfor scene graph generation. Both categories group objectproposals into pairs and use the phrase features (features oftheir union area) for predicate inference. The difference ofthe two categories lies in the different procedures. The firstcategory detects the objects first and then recognizes the re-lationships between those objects [5, 28, 29]. The secondcategory jointly identifies the objects and their relationshipsbased on the object and relationship proposals [27, 25, 37].Despite the promising progress introduced by these ap-proaches, most of them suffer from the limitations of ex-isting scene graph datasets. First, to comprehensively de-pict an image using the scene graph, it requires a widevariety of relation triplets (cid:104) subject - predicate - object (cid:105) . Un-fortunately, current datasets only capture a small portionof the knowledge [29], e.g ., Visual Relationship Detection(VRD) dataset. Training on such a dataset with long-tail a r X i v : . [ c s . C V ] A p r istributions will cause the prediction model bias towardsthose most-frequent relationships. Second, predicate la-bels are highly determined by the identification of objectpairs [46]. However, due to the difficulty of exhaustivelylabeling bounding boxes of all instances of each object,the current large-scale crowd-sourced datasets like VisualGenome (VG) [22] are contaminated by noises ( e.g ., miss-ing annotations and meaningless proposals). Such a noisydataset will inevitably result in a poor performance of thetrained object detector [3], which further hinders the perfor-mance of predicate detection.For human beings, we are capable of reasoning overvisual elements of an image based on our commonsenseknowledge. For example, in Figure 1, humans have thebackground knowledge: the subject ( woman ) appears /stands on something; the object ( snow ) enhances the evi-dence of the predicate ( skiing ). Commonsense knowledgecan also help correct object detection. For example, the spe-cific external knowledge for skiing benefits inference of theobject ( snow ) as well. This motivates us to leverage com-monsense knowledge to help scene graph generation.Meanwhile, despite the crucial role of object labels forrelationship prediction, existing datasets are very noisy dueto the significant amount of missing object annotations.However, our goal is to obtain scene graphs with more com-plete scene representation. Motivated by this goal, we regu-larize our scene graph generation network by reconstructingthe image from detected objects. Considering the case inFigure 1, a method might recognize snow as grass by mis-take. If we generate an image based on the falsely predictedscene graph, this minor error would be heavily penalized,even though most of the snow ’s relationships might be cor-rectly identified.The contributions of this paper are threefold. 1) Wepropose a knowledge-based feature refinement module toincorporate commonsense knowledge from an externalknowledge base. Specifically, the module extracts useful in-formation from ConceptNet [35] to refine object and phrasefeatures before scene graph generation. We exploit Dy-namic Memory Network (DMN) [23] to implement multi-hop reasoning over the retrieved facts and infer the mostprobable relations accordingly. 2) We introduce image-levelsupervision module by reconstructing the image to regular-ize our scene graph generation model. We view this aux-iliary branch as a regularizer, which is only present dur-ing training. 3) We conduct extensive experiments on twobenchmark datasets: VRD and VG datasets. Our empiri-cal results demonstrate that our approach can significantlyimprove the state-of-the-art on scene graph generation.
2. Related Works
Incorporating Knowledge in Neural Networks.
Therehas been growing interest in improving data-driven mod- els with external Knowledge Bases (KBs) in natural lan-guage processing [17, 4] and computer vision communi-ties [24, 1, 6]. Large-scale structured KBs are constructedeither by manual effort ( e.g ., Wikipedia, DBpedia [2]), or byautomatic extraction from unstructured or semi-structureddata ( e.g ., ConceptNet). One direction to improve the data-driven model is to distill external knowledge into Deep Neu-ral Networks [39, 45, 18]. Wu et al . [38] encode the minedknowledge from DBpedia [2] into a vector and combineit with visual features to predict answers. Instead of ag-gregating the textual vectors with average-pooling opera-tion [38], Li et al . [24] distill the retrieved context-relevantexternal knowledge triplet through a DMN for open-domainvisual question answering. Unlike [38, 24], Yu et al . [45]extract linguistic knowledge from training annotations andWikipedia, and distill knowledge to regularize training andprovide extra cues for inference. A teacher-student frame-work is adopted to minimize the KL-divergence of the pre-diction distributions of teacher and student.
Visual Relationship Detection.
Visual relationship de-tection has been investigated by many works in the lastdecade [21, 8, 7, 31]. Lu et al . [29] introduce generic vi-sual relationship detection as a visual task, where they de-tect objects first, and then recognize predicates between ob-ject pairs. Recently, some works have explored the mes-sage passing for context propagation and feature refine-ment [41, 27]. Xu et al . [41] construct the scene graph by re-fining the object and relationship features jointly with mes-sage passing. Dai et al . [5] exploit the statistical dependen-cies between objects and their relationships and refine theposterior probabilities iteratively with a Conditional Ran-dom Field (CRF) network. More recently, Zeller et al . [46]achieve a strong baseline by predicting relationships withfrequency priors. To deal with the large number of poten-tial relations between objects, Yang et al . [42] propose a re-lation proposal network that prunes out uncorrelated objectpairs, and captures the contextual information with an atten-tional graph convolutional network. In [25], they propose aclustering method which factorizes the full graph into sub-graphs, where each subgraph is composed of several objectsand a subset of their relationships.Most related to our work are the approaches proposedby Li et al . [25] and Yu et al . [45]. Unlike [25], which fo-cuses on the efficient scene graph generation, our approachaddresses the long tail distribution of relationships by com-monsense cues along with visual cues. Unlike [45], whichleverages linguistic knowledge to regularize the network,our knowledge-based module improves the feature refin-ing procedure by reasoning over a basket of commonsenseknowledge retrieved from ConceptNet.igure 2: Overview of the proposed scene graph generation framework. The left part generates a scene graph from the inputimage. The right part is an auxiliary image-level regularizer which reconstructs the image based on the detected object labelsand bounding boxes. After training, we discard the image reconstruction branch.
3. Methodology
Figure 2 gives an overview of our proposed scene graphgeneration framework. The entire framework can be di-vided into the following steps: (1) generate object and sub-graph proposals for a given image; (2) refine object andsubgraph features with external knowledge; (3) generatethe scene graph by recognizing object categories with ob-ject features and recognizing object relations by fusing sub-graph features and object feature pairs; (4) reconstruct theinput image via an additional generative path. During train-ing, we use two types of supervisions: scene graph levelsupervision and image-level supervision. For scene graphlevel supervision, we optimize our model by guiding thegenerated scene graph with the ground truth object andpredicate categories. The image-level supervision is intro-duced to overcome the aforementioned missing annotationsby reconstructing the image from objects and enforcing thereconstructed image close to the original image.
Object Proposal Generation.
Given an image I , wefirst use the Region Proposal Network (RPN) [33] to extracta set of object proposals: [o , · · · , o N − ] = f RPN ( I ) (1)where f RPN ( · ) stands for the RPN module, and o i is the i -th object proposal represented by a bounding box r i =[ x i , y i , w i , h i ] with ( x i , y i ) being the coordinates of the topleft corner and w i and h i being the width and the heightof the bounding box, respectively. For any two differentobjects (cid:104) o i , o j (cid:105) , there are two possible relationships in op-posite directions. Thus, for N object proposals, there aretotally N ( N − potential relations. Although more ob-ject proposals lead to a bigger scene graph, the number ofpotential relations will increase dramatically, which signifi-cantly increases the computational cost and deteriorates the inference speed. To address this issue, subgraph is intro-duced in [25] to reduce the number of potential relations byclustering. Subgraph Proposal Construction.
We adopt the clus-tering approach proposed in [25]. In particular, for a pair ofobject proposals, a subgraph proposal is constructed as theunion box with the confidence score being the product of thescores of the two object proposals. Then, subgraph propos-als are suppressed by non-maximum-suppression (NMS).In this way, a candidate relation can be represented by twoobjects and one subgraph: (cid:104) o i , o j , s ik (cid:105) , where i (cid:54) = j and s ik is the k -th subgraph of all the subgraphs associated with o i , which contains o j as well as some other object propos-als. Following [25], we represent a subgraph and an objectas a feature map, s ik ∈ R D × K s × K s , and a feature vector, o i ∈ R D , respectively, where D and K s are the dimensions. Object and Subgraph Inter-refinement.
Consideringthat each object o i is connected to a set of subgraphs S i andeach subgraph s k is associated with a set of objects O k , werefine the object vector (resp. the subgraph) by attendingthe associated subgraph feature maps (resp. the associatedobject vectors): ¯ o i = o i + f s → o (cid:88) s ik ∈ S i α s → ok · s ik (2) ¯ s k = s k + f o → s (cid:88) o ki ∈ O k α o → si · o ki (3)where α s → ok (resp. α o → si ) is the output of a softmax layerindicating the weight for passing s ik (resp. o ki ) to o i (resp. s k ), and f s → o and f o → s are non-linear mapping functions.This part is similar to [25]. Note that due to different di-mensions of o i and s k , pooling or spatial location based at-tention needs to be respectively applied for s → o or o → s igure 3: Illustration of our proposed knowledge-based fea-ture refinement module. Given the object labels, we retrievethe facts (or symbolic triplets) from the ConceptNet ( bot-tom ), and then reason those facts with dynamic memorynetwork using two passes ( top right ).refinement. Interested readers are referred to [25] for de-tails. Knowledge Retrieval and Embedding.
To address therelationship distribution bias of the current visual relation-ship datasets, we propose a novel feature refinement net-work to further improve the feature representation by tak-ing advantage of the commonsense relationships in externalknowledge base (KB). In particular, we predict the objectlabel a i from the refined object vector ¯ o i , and match a i withthe corresponding semantic entities in KB. Afterwards, weretrieve the corresponding commonsense relationships fromKB using the object label a i : a i retrieve −→ (cid:104) a i , a ri,j , a oj , w i,j (cid:105) , j ∈ [0 , K − (4)where a ri,j , a oj and w i,j are the top- K corresponding re-lationships, the object entity and the weight, respectively.Note that the weight w i,j is provided by KB ( i.e ., Concept-Net [35]), indicating how common a triplet (cid:104) a i , a ri,j , a oj (cid:105) is.Based on the weight w i,j , we can identify the top- K mostcommon relationships for a i . Figure 3 illustrates the pro-cess of our proposed knowledge-based feature refinementmodule.To encode the retrieved commonsense relationships, wefirst transform each symbolic triplet (cid:104) a i , a ri,j , a oj (cid:105) into a se-quence of words: [ X , · · · , X T a − ] , and then map eachword in the sentence into a continuous vector space withword embedding x t = W e X t . The embedded vectors arethen fed into an RNN-based encoder [39] as h tk = RNN fact ( x tk , h t − k ) , t ∈ [0 , T a − (5)where x tk is the t -th word embedding of the k -th sentence,and h tk is the hidden state of the encoder. We use a bi-directional Gated Recurrent Unit (GRU) for RNN fact andthe final hidden state h T a − k is treated as the vector repre- sentation for the k -th retrieved sentence or fact, denoted as f ik for object o i . Attention-based Knowledge Fusion.
The knowledgeunits are stored in memory slots for reasoning and updating.Our target is to incorporate the external knowledge into theprocedure of feature refining. However, for N objects, wehave N × K relevant fact vectors in memory slots. Thismakes it difficult to distill the useful information from thecandidate knowledge when N × K is large. DMN [23] pro-vides a mechanism to pick out the most relevant facts by us-ing an episodic memory module. Inspired by this, we adoptthe improved DMN [39] to reason over the retrieved facts F , where F denotes the set of fact embedding { f k } . It con-sists of an attention component which generates a contex-tual vector using the episode memory m t − . Specifically,we feed the object vector ¯o to a non-linear fully-connectedlayer and attend the facts as follows: q = tanh( W q ¯o + b q ) (6) z t =[ F ◦ q ; F ◦ m t − ; | F − q | ; | F − m t − | ] (7) g t = softmax ( W tanh( W z t + b ) + b ) (8) e t = AGRU ( F , g t ) (9)where z t is the interactions between the facts F , the episodememory m t − and the mapped object vector q , g t is theoutput of a softmax layer, ◦ is the element-wise product, | · | is the element-wise absolute value, and [ ; ] is the concate-nation operation. Note that q and m need to be expandedvia duplication in order to have the same dimension as F for the interactions. In (9), AGRU ( · ) refers to the Attentionbased GRU [39] which replaces the update gate in GRUwith the output attention weight g tk for fact k : e tk = g tk GRU ( f k , e tk − ) + (1 − g tk ) e tk − (10)where e tK is the final state of the episode which is the stateof the GRU after all the K sentences have been seen.After one pass of the attention mechanism, the memoryis updated using the current episode state and the previousmemory state: m t = ReLU ( W m [ m t − ; e tK ; q ] + b m ) . (11)where m t is the new episode memory state. By the finalpass T m , the episodic memory m T m − can memorizes use-ful knowledge information for relationship prediction.The final episodic memory m T m − is passed to refinethe object feature ¯o as ˜o = ReLU ( W c [ ¯o ; m T m − ] + b c ) (12)where W c and b c are parameters to be learned. In partic-ular, we refine objects with KB via (12) as well as jointlyrefining objects and subgraphs by replacing { o i , s i } with { ˜o i , ¯s i } in (2) and (3), in an iterative fashion (see Alg. 1).igure 4: Illustration of our proposed object-to-image gen-eration module Gen o2i . Relation Prediction.
After the feature refinement, wecan predict object labels as well as predicate labels with therefined object and subgraph features. For object label, wecan predict it directly with the object features. For rela-tionship label, as the subgraph feature is related to severalobject pairs, we predict the label based on subject and ob-ject feature vectors along with their corresponding subgraphfeature map. We formulate the inference process as P i,j ∼ softmax ( f rel ([ ˜o i ⊗ ¯s k ; ˜o j ⊗ ¯s k ; ¯s k ])) (13) V i ∼ softmax ( f node ( ˜o i )) (14)where f rel ( · ) and f node ( · ) denote the mapping layers forpredicate and object recognition, respectively, and ⊗ de-notes the convolution operation [25]. Then, we can con-struct the scene graph as: G = (cid:104) V i , P i,j , V j (cid:105) , i (cid:54) = j . Scene Graph Level Supervision.
Like other ap-proaches [26, 25, 37], during training we want the gener-ated scene graph close to the ground-truth scene graph byoptimizing the scene graph generation process with objectdetection loss and relationship classification loss L im2sg = λ pred L pred + λ obj L obj + λ reg u ≥ L reg (15)where L pred , L obj and L reg are the predicate classificationloss, the object classification loss and the bounding box re-gression loss, respectively, λ obj , λ pred and λ reg are hyper-parameters, and is the indicator function with u being theobject label, u ≥ for object categories and u = 0 forbackground.For the predicate detection, the output is the probabilityover all the candidate predicates. L pred is defined as thesoftmax loss. Like the predicate classification, the outputof the object detection is the probability over all the objectcategories. L cls is also defined as the softmax loss. Forthe bounding box regression loss L reg , we use smooth L loss [33]. To better regularize the networks, an object-to-imagegenerative path is added. Figure 4 depicts our proposedobject-to-image generation module Gen o2i . In particular,we first compute a scene layout based on the object labelsand their corresponding locations. For each object i , weexpand the object embedding vectors o i ∈ R D to shape D × × , and then wrap it to the position of the boundingbox r i using bilinear interpolation to give an object layout o layout i ∈ R D × H × W , where D is the dimension of the em-bedding vectors for objects and H × W = 64 × is theoutput image resolution. We sum all object layouts to obtainthe scene layout S layout = (cid:80) i o layout i .Given the scene layout, we synthesize an image that re-spects the object positions with an image generator G . Here,we adopt a cascaded refinement network [20] which con-sists of a series of convolutional refinement modules to gen-erate the image. The spatial resolution doubles between theconvolutional refinement modules. This allows the genera-tion to proceed in a coarse-to-fine manner. For each mod-ule, it takes two inputs. One is the output from the pre-vious module (the first module takes Gaussian noise), andthe other one is the scene layout S layout , which is downsam-pled to the input resolution of the module. These inputs areconcatenated channel-wisely and passed to a pair of × convolution layers. The outputs are then upsampled usingnearest-neighbor interpolation before being passed to thenext module. The output from the last module is passedto two final convolution layers to produce the output image. Image-level Supervision.
In addition to the commonpixel reconstruction loss L pixel , we also adopt a conditionalGAN loss [32], considering the image is generated basedon the objects. In particular, we train the discriminator D i and the generator G i by alternatively maximizing L D i inEq. (16) and L G i in Eq. (17): L D i = E I ∼ p real [log D i ( I )] (16) L G i = E ˆ I ∼ p G [log(1 − D i (ˆ I )] + λ p L pixel (17)where λ p is the tuning parameter. For the generator loss,we maximize log D i ( G i ( z | S layout )) rather than minimizingthe original log(1 − D i ( G i ( z | S layout ))) for better gradientbehavior. For the pixel reconstruction loss, we calculate the (cid:96) distance between the real image I and a correspondingsynthetic image ˆ I as || I − ˆ I || .As shown in Figure 2, we view the object-to-image gen-eration branch as a regularizer. It can be seen as a correctivemodel for scene graph generation by improving the perfor-mance of object detection. During training, backpropaga-tion from losses (15), (16), and (17) influences the modelparameter updates. This image-level supervision can beseen as a corrective model for scene graph generation byimproving the performance of object detection. The gradi-nts back-propagated from the object-to-image branch up-date the parameters of our object detector and the featurerefinement module which is followed by the relation pre-diction.Alg. 1 summarizes the entire training procedure. Algorithm 1
Training procedure.
Input:
Image I , number of training steps T s . Pretrain image generation module Gen o2i (GT objects) for t = 0 : T m − do Get objects and relationship triples. Proposal Generation: ( O , S ) ← I { RPN } /*Knowledge-based Feature Refining*/ for r = 0 : T r − do ¯ o i ← { o i , S i } /*Refining using (2)*/ ¯ s k ← { s k , O k } /*Refining using (3)*/ ˜ o i ← { F , ¯o i } /*Refining using (12)*/ o i ← ˜ o i , s i ← ¯ s i end for Update parameters with Gen o2i (predicted objects)
Update parameters with (15) end forFunction:
Gen o2i
Input:
Real image I , objects (GT / predicted). Object Layout Generation: o layout i ← { o i , r i } Scene Layout Generation: S layout = (cid:80) i o layout i Image Reconstruction: ˆ I = G i ( z, S layout ) Update image generator G i parameters using (17). Update image discriminator D i parameters using (16).
4. Experiments
We evaluate our approach on two datasets: VRD [29]and VG [26]. VRD is the most widely used benchmarkdataset for visual relationship detection. Compared withVRD, the raw VG [22] contains a large number of noisylabels. In our experiment, we use a cleansed-version VG-MSDN in [26]. Detailed statistics of both datasets areshown in Table 1.For the external KB, we employ the English subgraphof ConceptNet [35] as our knowledge graph. ConceptNetis a large-scale graph of general knowledge which aims toalign its knowledge resources on its core set of 40 relations.A large portion of these relation types can be consideredas visual relations, such as, spatial co-occurrence ( e.g ., At-Location , LocatedNear ), visual properties of objects ( e.g ., HasProperty , PartOf ), and actions ( e.g ., CapableOf , Used-For ). As shown in Alg. 1, we train our model in two phrases.The initial phase looks only at the object annotations of Table 1: Dataset statistics.
Dataset Training Set Testing Set the training set, ignoring the relationship triplets. For eachdataset, we filter the objects according to the category andrelation vocabularies in Table 1. We then learn an image-level regularizer that reconstructs the image based on theobject labels and bounding boxes. The output size of theimage generator is × × , and the real image is resizedbefore inputting to the discriminator. We train the regular-izer with learning rate − and batch size 32. For eachmini-batch we first update G i , and then update D i .The second phase jointly trains the scene graph gener-ation model and the auxiliary reconstruction branch. Weadopt the Faster R-CNN [33] associated with VGG-16 [34]as the backbone. During training, the number of object pro-posals is 256. For each proposal, we use ROI align [16]pooling to generate object and subgraph features. The sub-graph regions are pooled to × feature maps. The dimen-sion D of the pooled object vector and the subgraph fea-ture map is set to 512. For the knowledge-based refinementmodule, we set the dimension of word embedding to 300and initialize it with the GloVe 6B pre-trained word vec-tors [30]. We keep the top-8 commonsense relationships.The number of hidden units of the fact encoder is set to 300,and the dimension of episodic memory is set to 512. The it-eration number T m of DMN update is set to 2. For the rela-tion inference module, we adopt the same bottleneck layeras [25]. All the newly introduced layers are randomly ini-tialized except the auxiliary regularizer. We set λ pred = 2 . , λ cls = 1 . , and λ reg = 0 . in Eq (15). The hyperparam-eter λ p in Eq (17) is set to 1.0. The iteration number T r of the feature refinement is set to 2. We first train RPNsand then jointly train the entire network. The initial learn-ing rate is 0.01, decay rate is 0.1, and stochastic gradientdescent (SGD) is used as the optimizer. We deploy weightdecay and dropout to prevent over-fitting.During testing, the image reconstruction branch will bediscarded. We respectively set the RPN non-maximum sup-pression (NMS) [33] threshold to 0.6 and subgraph cluster-ing [25] threshold to 0.5. We output all the predicates anduse the top-1 category as the prediction for objects and re-lations. Models are evaluated on two tasks: Visual PhraseDetection ( PhrDet ) and Scene Graph Generation (
SGGen ). PhrDet is to detect the (cid:104) subject-predicate-object (cid:105) phrases.
SGGen is to detect the objects within the image and rec-ognize their pairwise relationships. Following [29, 25],he Top- K Recall (denoted as Rec@ K ) is used as the per-formance metric; it calculates how many labeled relation-ships are hit in the top K predictions. In our experiments,Rec@50 and Rec@100 are reported. Note that, Li et al . [26]and Yang et al . [42] reported the results on two more met-rics: Predicate Recognition and
Phrase Recognition . Thesetwo evaluation metrics are based on ground-truth object lo-cations, which is not the case we consider. In our setting,we use detected objects for image reconstruction and scenegraph generation. To be consistent with the training, wechoose
PhrDet and
SGGen as the evaluation metrics, whichis also more practical.
Baseline.
This baseline model is the re-implementation ofFactorizable Net [25]. We re-train it based on our backbone.Specifically, we use the same RPN model, and jointly trainthe scene graph generator until convergence.
KB.
This model is a KB-enhanced version of the base-line model. External knowledge triples are incorporated inDMN. The explicit knowledge-based reasoning is incorpo-rated in the feature refining procedure.
GAN.
This model improves the baseline model by attachingan auxiliary branch that generates the image from objectswith GAN. We train this model in two phases. The firstphase trains the image reconstruction branch only with theobject annotations. Then we refine the model jointly withthe scene graph generation model.
KB-GAN.
This is our full model containing both KB andGAN. It is initialized with the trained parameters from KBand GAN, and fine-tuned with Alg. 1.
In this section, we present our quantitative results andanalysis. To verify the effectiveness of our approach and an-alyze the contribution of each component, we first comparedifferent baselines in Table 2, and investigate the improve-ment in recognizing objects in Table 3. Then, we conduct asimulation experiment on VRD to investigate the effective-ness of our auxiliary regularizer in Table 4. The comparisonof our approach with the state-of-the-art methods is reportedin Table 5.
Component Analysis.
In our framework, we proposed twonovel modules – KB-based feature refinement (KB) andauxiliary image generation (GAN). To get a clear sense ofhow these components affect the final performance, we per-form ablation studies in Table 2. The left-most columns inTable 2 indicate whether or not we use KB and GAN inour approach. To further investigate the improvement ofour approach on recognizing objects, we also report objectdetection performance mAP [10] in Table 3.In Table 2, we observe that KB boosts
PhrDet and
SGGen significantly. This indicates our knowledge-based Table 2: Ablation studies of individual components of ourmethod on VRD.
KB GAN PhrDet SGGen
Rec@50 Rec@100 Rec@50 Rec@100- - 25.57 31.09 18.16 22.30 (cid:88) - 27.02 34.04 19.85 24.58- (cid:88) (cid:88) (cid:88)
Table 3: Ablation study of the object detection on VRD.
Model
FasterR-CNN [33] ViP-CNN [27] Baseline KB GAN KB-GANmAP 14.35 20.56 20.70 22.26 22.10 feature refinement can effectively learn the commonsenseknowledge of objects to achieve high recall for the cor-rect relationships. By adding the image-level supervisionto the baseline model, the performance is further improved.This improvement demonstrates that the proposed image-level supervision is capable of capturing meaningful con-text across the objects. These results align with our intu-itions discussed in the introduction. With KB and GAN,our model can generate scene graphs with high recall.Table 3 demonstrates the improvement in recognizingobjects. We can see that our full model (KB-GAN) out-performs Faster R-CNN [33], ViP-CNN [27] measured bymAP. It is worth noticing that the huge gain of KB illustratesthat the introduction of commonsense knowledge substan-tially contributes to the object detection task.Table 4: Ablation study of image-level supervision on sub-sampled VRD.
KB GAN PhrDet SGGen
Rec@50 Rec@100 Rec@50 Rec@100- - 15.44 20.96 10.94 14.53- (cid:88) (cid:88) (cid:88)
Investigation on Image-level Supervision.
As aforemen-tioned, our image-level supervision can exploit the in-stances of rare categories. To demonstrate that our intro-duced image-level supervision can help on this issue, we ex-aggerate the problem by randomly removing 20% object in-stances as well as their corresponding relationships from thedataset. In Table 4, we can see that training on such a sub-sampled dataset (with only 80% object instances), Rec@50of the baseline model drops from 25.57 (resp. 18.16) to15.44 (resp. 10.94) for PhrDet and SGGen. However, withthe help of GAN, Rec@50 of our final model decreases onlyslightly from 27.39 (resp. 20.31) to 26.62 (resp. 19.78) forPhrDet and SGGen, respectively.We give our explanation on this significant performanceimprovement as below. Too many low-frequency categoriesdeteriorate the training gain when only utilizing the class la-able 5: Comparison with existing methods on
PhrDet and
SGGen . Dataset Model PhrDet SGGen
Rec@50 Rec@100 Rec@50 Rec@100VRD [29] ViP-CNN [27] 22.78 27.91 17.32 20.01DR-Net [5] 19.93 23.45 17.73 20.88U+W+SF+LK: T+S [45] 26.32 29.43 19.17 21.34Factorizable Net [25] 26.03 30.77 18.32 21.20
KB-GAN 27.39 34.38 20.31 25.01
VG-MSDN [26] ISGG [41] 15.87 19.45 8.23 10.88MSDN [26] 19.95 24.93 10.72 14.22Graph R-CNN [42] – – 11.40 13.70Factorizable Net [25] 22.84 28.57 13.06 16.47
KB-GAN 23.51 30.04 13.65 17.57
Figure 5: Qualitative results from KB-GAN. In each example, the left image is the original input image; the scene graph isgenerated by KB-GAN; and the right image is reconstructed from the detected objects.bel as training targets. With the explicit image-level super-vision, the proposed image reconstruction path can utilizethe large quantities of instances of rare classes. This image-level supervision idea is generic, which can apply to manypotential applications such as object detection.
Comparison with Existing Methods.
Table 5 shows thecomparison of our approach with the existing methods. Wecan see that our proposed method outperforms all the exist-ing methods in the recall on both datasets. Compared withthese methods, our model recognizes the objects and theirrelationships not only in the graph domain but also in theimage domain.
Figure 5 visualizes some examples of our full-model. Weshow the generated scene graph as well as the reconstructedimage for each sample. It is clear that our method can gen-erate high-quality relationship predictions in the generatedscene graph. Also notable is that our auxiliary output im-ages are reasonable. This demonstrates our model’s capa- bility to generate rich scene graph by learning with bothexternal KB and auxiliary image-level regularizer.
5. Conclusion
In this work, we have introduced a new model for scenegraph generation which includes a novel knowledge-basefeature refinement network that effectively propagates con-textual information across the graph, and an image-level su-pervision that regularizes the scene graph generation fromimage domain. Our framework outperforms state-of-the-art methods for scene graph generation on VRD and VGdatasets. Our experiments show that it is fruitful to incor-porate the commonsense knowledge as well as the image-level supervision into the scene graph generation. Our workshows a promising way to improve high-level image under-standing via scene graph.
Acknowledgments
This work was supported in part by Adobe Research,NTU-IGS, NTU-Alibaba Lab, and NTU ROSE Lab. eferences [1] Somak Aditya, Yezhou Yang, and Chitta Baral. Ex-plicit reasoning over end-to-end neural architecturesfor visual question answering. In
AAAI , 2018. 2[2] S¨oren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives. Db-pedia: A nucleus for a web of open data. In
The se-mantic web , pages 722–735. 2007. 2[3] Ankan Bansal, Karan Sikka, Gaurav Sharma, RamaChellappa, and Ajay Divakaran. Zero-shot object de-tection. In
ECCV , 2018. 2[4] Junwei Bao, Nan Duan, Ming Zhou, and TiejunZhao. Knowledge-based question answering as ma-chine translation. In
ACL , 2014. 2[5] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visualrelationships with deep relational networks. In
CVPR ,2017. 1, 2, 8[6] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome,Kevin Murphy, Samy Bengio, Yuan Li, HartmutNeven, and Hartwig Adam. Large-scale object clas-sification using label relation graphs. In
ECCV , 2014.2[7] Henghui Ding, Xudong Jiang, Bing Shuai, Ai QunLiu, and Gang Wang. Context contrasted feature andgated multi-scale aggregation for scene segmentation.In
CVPR , 2018. 2[8] Henghui Ding, Xudong Jiang, Bing Shuai, Ai QunLiu, and Gang Wang. Semantic correlation promotedshape-variant context for segmentation. In
CVPR ,2019. 2[9] Desmond Elliott and Frank Keller. Image descriptionusing visual dependency representations. In
EMNLP ,2013. 1[10] Mark Everingham, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. Thepascal visual object classes (voc) challenge.
IJCV ,2010. 7[11] Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, andGang Wang. Look, imagine and match: Improv-ing textual-visual cross-modal retrieval with genera-tive models. In
CVPR , 2018. 1[12] Jiuxiang Gu, Jianfei Cai, Gang Wang, and TsuhanChen. Stack-captioning: Coarse-to-fine learning forimage captioning. In
AAAI , 2018. 1[13] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang.Unpaired image captioning by language pivoting. In
ECCV , 2018. 1[14] Jiuxiang Gu, Gang Wang, Jianfei Cai, and TsuhanChen. An empirical study of language cnn for imagecaptioning. In
ICCV , 2017. 1 [15] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, LianyangMa, Amir Shahroudy, Bing Shuai, Ting Liu, XingxingWang, Gang Wang, Jianfei Cai, et al. Recent advancesin convolutional neural networks.
Pattern Recogni-tion , 2017. 1[16] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and RossGirshick. Mask r-cnn. In
ICCV , 2017. 6[17] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis-tilling the knowledge in a neural network. In
NIPSWorkshop , 2015. 2[18] Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, andEric Xing. Deep neural networks with massive learnedknowledge. In
EMNLP , 2016. 2[19] Hamid Izadinia, Fereshteh Sadeghi, and Ali Farhadi.Incorporating scene context and object layout into ap-pearance modeling. In
CVPR , 2014. 1[20] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Imagegeneration from scene graphs. In
CVPR , 2018. 1, 5[21] Justin Johnson, Ranjay Krishna, Michael Stark, Li-JiaLi, David Shamma, Michael Bernstein, and Li Fei-Fei.Image retrieval using scene graphs. In
CVPR , 2015. 1,2[22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan-nis Kalantidis, Li-Jia Li, David A Shamma, et al.Visual genome: Connecting language and vision us-ing crowdsourced dense image annotations. In
ICCV ,2017. 2, 6[23] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mo-hit Iyyer, James Bradbury, Ishaan Gulrajani, VictorZhong, Romain Paulus, and Richard Socher. Ask meanything: Dynamic memory networks for natural lan-guage processing. In
ICML , 2016. 2, 4[24] Guohao Li, Hang Su, and Wenwu Zhu. Incorporat-ing external knowledge to answer open-domain visualquestions with dynamic memory networks. In
CVPR ,2018. 2[25] Yikang Li, Wanli Ouyang, Bolei Zhou, Yawen Cui,Jianping Shi, and Xiaogang Wang. Factorizablenet: An efficient subgraph-based framework for scenegraph generation. In
ECCV , 2018. 1, 2, 3, 4, 5, 6, 7, 8[26] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang,and Xiaogang Wang. Scene graph generation fromobjects, phrases and region captions. In
ICCV , 2017.5, 6, 7, 8[27] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang,and Xiaogang Wang. Vip-cnn: Visual phrase guidedconvolutional neural network. In
CVPR , 2017. 1, 2, 7,828] Wentong Liao, Lin Shuai, Bodo Rosenhahn, andMichael Ying Yang. Natural language guidedvisual relationship detection. arXiv preprintarXiv:1711.06032 , 2017. 1[29] Cewu Lu, Ranjay Krishna, Michael Bernstein, and LiFei-Fei. Visual relationship detection with languagepriors. In
ECCV , 2016. 1, 2, 6, 8[30] Jeffrey Pennington, Richard Socher, and ChristopherManning. Glove: Global vectors for word representa-tion. In
EMNLP , 2014. 6[31] Bryan A Plummer, Arun Mallya, Christopher M Cer-vantes, Julia Hockenmaier, and Svetlana Lazebnik.Phrase localization and visual relationship detectionwith comprehensive image-language cues. In
ICCV ,2017. 2[32] Scott Reed, Zeynep Akata, Xinchen Yan, LajanugenLogeswaran, Bernt Schiele, and Honglak Lee. Gen-erative adversarial text to image synthesis. In
ICML ,2016. 5[33] Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In
NIPS , 2015. 3, 5,6, 7[34] Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recogni-tion. In
ICLR , 2015. 6[35] Robert Speer and Catherine Havasi. Conceptnet 5: Alarge semantic network for relational knowledge. In
The Peoples Web Meets NLP , pages 161–176. 2013.2, 4, 6[36] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick,and Anton van den Hengel. Fvqa: Fact-based visualquestion answering.
PAMI , 2018. 1[37] Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, and AlanYuille. Scene graph parsing as dependency parsing. In
ACL , 2018. 1, 5[38] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick,and Anton van den Hengel. Image captioning and vi-sual question answering based on attributes and exter-nal knowledge.
PAMI , 2018. 1, 2[39] Caiming Xiong, Stephen Merity, and Richard Socher.Dynamic memory networks for visual and textualquestion answering. In
ICML , 2016. 2, 4[40] Yuanjun Xiong, Kai Zhu, Dahua Lin, and XiaoouTang. Recognize complex events from static imagesby fusing deep channels. In
CVPR , 2015. 1[41] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message pass-ing. In
CVPR , 2017. 2, 8 [42] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, andDevi Parikh. Graph r-cnn for scene graph generation.
ECCV , 2018. 2, 7, 8[43] Xu Yang, Kaihua Tang, Hanwang Zhang Zhang, andJianfei Cai. Auto-encoding scene graphs for imagecaptioning. In
CVPR , 2019. 1[44] Xu Yang, Hanwang Zhang, and Jianfei Cai. Shuffle-then-assemble: Learning object-agnostic visual rela-tionship features. In
ECCV , 2018. 1[45] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis.Visual relationship detection with internal and exter-nal linguistic knowledge distillation. In
ICCV , 2017.2, 8[46] Rowan Zellers, Mark Yatskar, Sam Thomson, andYejin Choi. Neural motifs: Scene graph parsing withglobal context. In
CVPR , 2018. 2[47] Handong Zhao, Quanfu Fan, Dan Gutfreund, and YunFu. Semantically guided visual question answering.In