Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation
Tao He, Lianli Gao, Jingkuan Song, Jianfei Cai, Yuan-Fang Li
LLearning from the Scene and Borrowing from the Rich:Tackling the Long Tail in Scene Graph Generation
Tao He , , Lianli Gao , Jingkuan Song , Jianfei Cai and Yuan-Fang Li ∗ Faculty of Information Technology, Monash University, Australia Center for Future Media and School of Computer Science and Engineering, University of ElectronicScience and Technology of China, China { tao.he,jianfei.cai,yuanfang.li } @monash.edu, [email protected], [email protected] Abstract
Despite the huge progress in scene graph genera-tion in recent years, its long-tail distribution in ob-ject relationships remains a challenging and pes-tering issue. Existing methods largely rely on ei-ther external knowledge or statistical bias infor-mation to alleviate this problem. In this paper,we tackle this issue from another two aspects: (1)scene-object interaction aiming at learning specificknowledge from a scene via an additive attentionmechanism; and (2) long-tail knowledge transferwhich tries to transfer the rich knowledge learnedfrom the head into the tail. Extensive experimentson the benchmark dataset Visual Genome on threetasks demonstrate that our method outperforms cur-rent state-of-the-art competitors.
Scene graph generation is a fundamental task in computervision that has been successfully applied to many othertasks, including image captioning [Yang et al. , 2019], im-age retrieval [Johnson et al. , 2015] and commonsense rea-soning [Zellers et al. , 2019]. Given an image, a relation-ship between objects in the image is typically denoted as atriple: ( subject, predicate, object ) , where the predicate canalso be denoted as relation . To detect such relationships re-quires the understanding of the image content globally . Inscene graph generation, the representation of a relationshipneeds to preserve semantic information of the triple as well asthe inherent attributes of the objects and the relations betweenthem. It is a challenging task due to the distributional biasespresent in the datasets. For example, the benchmark datasetVisual Genome [Krishna et al. , 2017] contains distinctobjects, producing possible unique relationships of approx.22K. Such a large number of relationships are too arduousto train a model as it is impossible to cover each relation-ship with sufficient samples [Zellers and Yatskar, 2018]. Thischallenge is further complicated by the highly imbalanceddistribution in the relations. It has been observed [Zellers andYatskar, 2018; Zhang et al. , 2019b; Dornadula et al. , 2019;Chen et al. , 2019] that the distribution of relations of Visual ∗ Corresponding author
Genome is highly long-tail and biased: the head relations canhave k instances whereas the tail relations have less than each. Thus, a model can readily learn the representationof head relations but struggles to learn that of tail relations.Many previous methods focus on the union region of a pairof objects [Deng et al. , 2014; Dai et al. , 2017], where onlyvisual features are considered, but not distributional bias ofrelations. However, as mentioned before, due to the highlyimbalanced nature of relations, a relation classifier is hardlywell optimized by such uneven data. Xu et al [2017] devel-oped a message passing strategy to aid relation recognitionwhere how to refine the object and relation feature becomesthe central goal. However, its performance still suffers fromthe lack of sufficient data required for learning. By countingthe frequency of various relations, Neural Motifs [Zellers andYatskar, 2018] discovers that some relations are highly cor-related with the objects. For instance, the possession relation“ has ” always exists between some specific pairs of objects,such as subject “ man ” and object “ eye ”. Similarly based onthe statistic results from a dataset, KER [Chen et al. , 2019]developed a knowledge routing network to preserve the rela-tion bias into their model. Additionally, other work [Lu et al. ,2016] utilized natural language information as an auxiliarytool to boost relation classification by mapping the languageprior knowledge to relation phrases. One limitation of thesemethods is their reliance on the statistic bias knowledge, with-out which their results would decline significantly. Similarly,Gu et al [2019] leveraged ConceptNet [Speer et al. , 2017], acommonsense knowledge graph, to bridge the gap betweenvisual features and external knowledge by a recurrent neuralnetwork.Moreover, many recent works discover that a well-represented contextual feature can significantly benefit rela-tion recognition. Specifically, Graph R-CNN [Yang et al. ,2018] develops an attentional Graph Convolutional Network(aGCN), focusing on learning the contextual information be-tween two objects that are filtered by a Relation ProposalNetwork (RePN). Qi et al [2019] proposed two interactingmodules to inject contextual clue to relation feature: a se-mantic transformer module concentrating on preserving se-mantic embedded relation features via projecting visual fea-tures and textual features to a common semantic space; anda graph self-attention module embedding a joint graph repre-sentation by aggregating neighboring nodes’ information. Shi a r X i v : . [ c s . C V ] J un t al [2019] utilized the attention mechanism to enhance nodeand relationship representation and trace the reasoning-flowin complex scene scenarios.In this paper we address two critical challenges in scenegraph generation: (1) how to effectively encode contextualclue into its corresponding object representation; and (2) howto balance the severely skewed predicate distribution to im-prove model performance. Specifically, for the first chal-lenge, we propose a scene-object interaction module aimingat learning the interplay coefficient between individual ob-jects and their specific scene context. For instance, the re-lation triple “ man riding bike ” is usually associated with theoutdoor scene instead of indoor. Therefore, the outdoor sceneis a key contextual clue to aid us to confidently predict the“ riding ” relation once given the objects “ man ” and “ bike ”and the outdoor information. To this end, we treat annotatedobjects of each image as the scene label of the image and de-ploy a weighted multi-label classifier to learn the contextualscene clue. At the same time, we employ an additive attentiontechnique to effectively fuse the clue and the objects’ visualfeatures. For the second challenge, we introduce a knowledgetransfer module to enhance the representation of tail (data-starved) relations, by transferring the knowledge learned inhead relations to tail relations. In addition, we also introducea calibration operation, inspired by the notion of reachabilityin reinforcement learning [Savinov et al. , 2018], to resize thehead and tail features to enhance their features’ discrimina-tive ability. In summary, our contributions are threefold: • We introduce a scene-object interaction module to fuseobjects’ visual feature and the scene contextual clue byan additive attention mechanism. • To alleviate the imbalanced distribution of relations, wepropose a head-to-tail knowledge transfer module to pre-serve rich knowledge learned from the head into the tail.Moreover, our calibration operation further enhances thediscriminative ability of learned visual features. • We evaluate our method on the standard scene graphgeneration dataset Visual Genome [Krishna et al. , 2017]on three tasks: predicate classification, scene graph clas-sification and scene graph detection, on which our modeloutperforms current state-of-the-art methods.
Our overall framework, shown in Figure 1, consists of threemain modules: (1) feature extraction, (2) scene-object inter-action, and (3) knowledge transfer. Specifically, the scene-object interaction module aims to combine scene context fea-tures into object features via an additive attention mechanism,while the knowledge transfer module focuses on fusing theknowledge learned in head and tail relations to enhance theirrepresentation.
A scene graph is a directed relation network extracted from amulti-object image. Each edge in a scene graph is representedby a triple ( o i , r ij , o j ), consisting of two objects o i , o j andthe relationship predicate r ij between them. Additionally, a scene graph requires to localize each object in the referringimage and we denote the localization of object o i as b i . Thus,given a set of object labels C and a set of relationship types R (including the none relation), a complete scene graph for animage consists of: • A set of bounding boxes B = { b , b , . . . , b n } , where b i ∈ R denotes the coordinates of the top-left cornerand the bottom-right corner, respectively. • A set of objects O = { o , o , . . . , o n } , assigning a classlabel o i ∈ C to each b i . • A set of triples T = { ( o i , r ij , o j ) } , where each o i , o j ∈ O , and r ij ∈ R , and that i (cid:54) = j . The first step in scene graph generation is to detect objectsin an image. Numerous object detection methods have beenproposed, e.g., Faster R-CNN [Girshick, 2015]. To fairlycompare to other baseline methods, we adopt Faster R-CNNtrained on VGG-16 [Simonyan and Zisserman, 2014] as ourobject detection and localization backbone network.For each detected object o i , we extract two types of fea-tures: visual features f oi ∈ R and spatial features l i ∈ R . Specifically, the visual feature extraction f oi follows thatof Neural Motifs [Zellers and Yatskar, 2018]. The spatialfeatures l i is a 5-dimensional vector that encodes top-leftand bottom-right coordinate and the size of object: l i =[ x t i , y t i , x b i , y b i , w i ∗ h i ] , where w i and h i are the width andheight of the object respectively. Recent works [Zhuang etal. , 2017; Woo et al. , 2018] have demonstrated that the rela-tive position of two objects in an image can significantly en-hance relation recognition. Thus, we also encode the rela-tive position into their relation representation as s ij ∈ R .Concretely, we first convert l i to the centralized coordinate as [ x c i , y c i , w i , h i ] and then calculate the relative spatial featureas s ij = (cid:104) x tj − x ci w i , y tj − y ci h i , x bj − x ci w i , y bj − y ci h i , w j · h j w i · h i (cid:105) . It isworth noting that s ij is different from s ji . To enrich the rep-resentation of s ij , we feed the above raw -dimensional vec-tor into a non-linear layer and convert it to a -dimensionvector s ij ∈ R .As for the union region features f uij of subject s i and object o i , we first generate their union bounding box and follow theextraction of an object’s visual feature to obtain f uij . For scene graph generation, the correct recognition of re-lations not only depends on object features, but also takesimportant cues from the scene. For example, the scene of“ outdoor ” should contribute more to the relation “ riding ”while less to “ holding ”, as riding mostly takes place in theoutdoor, which is not the case for holding.Many works, such as IMP [Xu et al. , 2017] and NeuralMotifs [Zellers and Yatskar, 2018], demonstrate the contex-tual representation has a conspicuous effect on the relationrecognition. In this work, we propose a scene-object interac-tion module to encode the global scene contextual informa-tion into the object representation, which is implemented via
Additive Attention
Additive
Attention
Spatial
RelationSoftmax
Knowledge Codewords A dd iti v e A tt e n ti on ObjectUnionSubject
Scene
Object Subject
Scene-object Interaction Module KnowledgeTransfer Module ...
Cluster S o f t m a x Multi-label loss
Margin loss C on ca t Figure 1: The high-level architecture of our framework. It consists of the two main parts: the scene-object interaction module and theknowledge transfer module. The scene-object interaction module refines object features by injecting the global scene interaction information.The knowledge transfer module transfers the knowledge learned in the head relations to the tail relations and bridges the knowledge gapbetween them. an additive attention module widely used in machine transla-tion models [Bahdanau et al. , 2014; Britz et al. , 2017]: a i = max { , w g · ( f oi + f s ) } (1)where f oi is the feature of object o i , f s is the global scene fea-ture of an image, · denotes pointwise product, and w g com-putes a coefficient of interaction between the object and itscontextualized scene. It is worth mentioning that all objects’features f oi in the same image share a common scene feature f s . a i is pruned to the interval [0 , + ∞ ] , and a greater valueof a i corresponds to more interaction with the scene, that is,the scene feature should contribute more to the object feature(see (3)). Note that LinkNet [2018] has also proposed to in-corporate scene features, while we consider the contributionof a scene to relations via an attention mechanism instead ofa simple concatenation as in LinkNet. w g is implemented bya fully-connected layer activated via a ReLu function. Theglobal feature f s is learned by a weighted multi-label classi-fication loss: L s = − |C| (cid:88) c =1 W c ∗ BCE( p c , l c ) (2)where W c is a weight for each class and pre-calculated bycounting the proportion of each object class in the trainingset, p c is the probability of each class output from a sigmoidfunction, l c is the true target label, and BCE( . ) is a binarycross-entropy function aiming at classifying multi-label im-ages. With the scene-object interaction, the object feature isthen refined as: ˜ f oi = f oi + a i ∗ f si . (3)From the refined object feature ˜ f oi , the union region feature f uij and the transformed relative spatial feature s ij , we con-struct the final representation of each triple ( s i , r ij , o j ) as: f tij = (cid:104) ˜ f oi × f uij × ˜ f oj ; s ij (cid:105) (4) where × is the element-wise multiplication following [Zellersand Yatskar, 2018; Woo et al. , 2018], [; ] is the vector concate-nation operation, and f tij ∈ R . Many previous works [Zellers and Yatskar, 2018; Chen et al. ,2019] have observed that the distribution of relations is signif-icantly unbalanced and long-tail, that very few relations (thehead) have orders of magnitude more data than the majorityof the relations (the tail). Intuitively, the head relations canbe accurately classified while the less frequent relations aremuch more challenging. Therefore, how to transfer knowl-edge learned in the head relations to the tail is a key point inour model.
Knowledge Codewords Construction.
Inspired by thegreat success of knowledge transfer in domain adaptive learn-ing [Hsu et al. , 2017; Xie et al. , 2018], our model adopts se-mantic codewords as the knowledge representation for eachrelation class. Our model first learns |R| codewords denotedas D = { d r } |R| r =1 , where |R| is the number of unique relationtypes. The codewords should possess two properties: dis-criminative and semantic. To this end, we add two constraintsto learn D : a near-zero margin for intra-relation groups and alarge margin for inter-relation groups, as follows: L d = |R| (cid:88) r =1 Y dis( f tij , d r ) + (1 − Y ) max(0 , M − dis( f tij , d r )) (5)where M is a constant margin for inter-relation groups; Y =1 if the relation of f tij is r , otherwise Y = 0 ; d r is the learn-able codewords; and dis( , ) is a metric function to calculatetwo features’ distance, for which we choose L metric. Intu-itively, L d forces the same relation group to cluster togetherwhile pushes the inter-relation groups away. nowledge Transfer. Relations at the tail of the distribu-tion are hard to be trained, as there is an insufficient amountof samples for training. Simply put, the challenge lays on thefact that feature f tij learned of the tail relationships is not rep-resentative. Therefore, transferring knowledge learned fromthe head of the distribution to the tail is critical for the recog-nition of those data-starved relationships.Inspired by the hallucination strategy used in meta-learning [Zhang et al. , 2019c; Zhang et al. , 2019a], wepropose a knowledge transfer method by hallucinating thelearned features. Specifically, we first build a coarse classifieron f tij , that is, p = softmax ( f tij ) (6)where p is a probability distribution over relation types R implemented by a softmax classification layer. Then, the hal-lucinated feature is calculated by: ˜f tij = |R| (cid:88) r =1 p r d r (7)where d r is the informative knowledge codewords learned byEquation 5. Similarly, we also apply an additive attention tocombine the original feature f tij with the hallucinated ˜f tij : a tij = max (cid:110) , w f · ( f tij + ˜f tij ) (cid:111) (8)where w f is the parameters of a nonlinear layer to calculate acoefficient of two features. Finally, we obtain the new relationfeatures as the below: ˜f ij = f tij + a tij ˜f tij (9) Long-tail Features Calibration.
Ideally, ˜f ij should beclose to f tij so that the fused feature does not change f tij toomuch, because the head relations already have sufficient sam-ples to be trained and the codewords of head relations shouldbe close to f tij . On the contrary, for the tail relations, themodification can be significant and arbitrary, consequentiallyleading to the confusion with the head relations.Many previous works have demonstrated that the discrim-ination of the head and tail class representation plays an es-sential role in imbalanced data learning [Zhu et al. , 2014]. Toavoid this confusion, we calibrate f ij to different scales fordifferent frequency relationships by: f ij = α · max ( p ) · ˜f ij (10)where p is the probability vector from Equation 6. Gener-ally, as for the data-rich relations, max ( p ) should be a largevalue, possibly close to whereas much smaller for the rarerelations, because the frequent relations are trained by moredata and their predicate prediction should be more confident.Thus, max ( p ) can be seen as a discriminative calibratingmetric to separate the head and tail features. α is a constantscalar to resize them. Finally, we deploy a relation classifieron f ij , on which a cross-entropy loss L rel is imposed. The overall loss function is as follows: L = L s + L det + L p + L rel + (cid:15) L d (11)where L s is a multi-label classification loss defined in Equa-tion 2 to learn the scene feature, L det is the object detectionloss of Faster-RCNN, L d is the knowledge codewords learn-ing loss defined in Equation 5, L p is the coarse relation classi-fication loss in Equation 6, and L rel is the final relation clas-sification loss defined above. (cid:15) = 0 . serves to balance theterm of the codewords loss. Note that the reason why (cid:15) is setto a small number is that L d is a distance metric usually muchgreater than the other terms, but not that L d is not important.All parameters in our model are differentiable, so the modelis trained in an end-to-end fashion. We evaluate our method on three standard scene graph gen-eration tasks: predicate classification (PredCls), scene graphclassification (SGCls) and scene graph detection (SGDet). InPredCls, given ground-truth bounding boxes and objects, thetask is to predict scene graph triples on these objects. In SG-Cls, given the ground-truth bounding boxes only, the task isto predict object labels and triples. In SGDet, the task is tolocalize bounding boxes, predict object labels and triples.Specifically, the experiments are conducted to answer thefollowing research questions:
RQ1 : How does our method compare with state-of-the-artscene graph generation methods?
RQ2 : How does each part of our model contribute to therelation recognition performance on three tasks?
RQ3 : How well does our method perform in qualitativeanalysis?
Dataset.
We conduct our method on the challenging andmost widely used benchmark, Visual Genome (VG) [Krishna et al. , 2017], which consists of 108,077 images with averageannotations of 38 objects and 22 relations per image. Theexperimental settings follow the previous works [Zellers andYatskar, 2018; Chen et al. , 2019], where we use ob-ject classes for C and relations for R . Similar to NeuralMotifs [Zellers and Yatskar, 2018], we utilize the statisticalbias information as the extra knowledge to boost the relationrecognition performance and we also report the results with-out this information. Implementation Details. α is set as , (cid:15) at . , andlearning rate starts from . and decays with the train-ing processing. Codewords D = { d r } |R| r =1 is initialized bypre-calculated clusters implemented by K-means. We applythe Faster R-CNN [Girshick, 2015] based on VGG-16 asthe backbone object detection and localization network. Thenumber of object proposals is , each of which is processedby RoIAalign [He et al. , 2017] pooling to extract object andunion region features. We adopt the Top-K Recall (denotedas R@K) following previous work [Zellers and Yatskar, 2018;Chen et al. , 2019] as the evaluation metric and report R@20,R@50 and R@100 on the three tasks. on s t r a i n t Method SGDet SGCls PredCls MeanR@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100IMP - 3.4 4.2 - 21.7 24.4 - 44.8 44.8 25.3Graph-RCNN - 11.4 13.7 - 21.7 31.6 - 54.2 59.2 33.2Neural Motifs † - 56.6 61.3 -Mem 7.7 11.4 13.9 23.3 27.8 29.5 42.1 53.2 57.9 29.6KRE † Ours † Ours 23.6 28.2 31.4 33.6 U n c on s t r a i n t IMP - 22.0 27.4 - 43.4 47.2 - 75.2 83.6 49.8Neural Motifs 25.7 30.5 35.8 42.6 44.5 47.7 76.3 81.1 88.3 52.5GSM - - - - 41.4 46.0 - 61.6 68.9 -KRE 24.6 30.9 35.8 42.8 45.9 49.0 77.1 81.9 88.9 52.9
Ours 26.9 31.4 36.5 43.6 46.2 50.2 77.9 82.5 90.2 53.9
Table 1: Performance (R@K) comparison with the state-of-the-art methods with and without graph constraint on VG. Since some works donot test on R@20, we only compute the mean on the two tasks of R@50 and R@100. † indicates the method discards the statistical bias priorinformation during training. C on s t r a i n t Method SGDet SGCls PredCls MeanR@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100BL 20.4 25.2 27.5 30.3 33.4 34.6 54.8 58.5 62.1 38.5BL+SO 22.5 26.7 30.1 32.5 35.7 36.8 58.2 64.2 66.8 41.5BL+SO+KT 23.0 27.6 30.9 33.4 37.1 38.0 59.8 65.8 67.6 42.6BL+SO+KT+FC 23.6 28.2 31.4 33.6 37.5 38.3 60.3 66.2 68.0 43.1 U n c on s t r a i n t BL 23.3 27.5 32.6 40.2 43.4 45.3 73.3 78.5 86.7 50.0BL+SO 25.4 29.2 34.3 42.7 44.7 48.1 76.4 80.6 88.0 52.2BL+SO+KT 26.2 30.7 35.9 43.1 45.0 49.4 77.2 82.1 89.4 53.3BL+SO+KT+FC 26.9 31.4 36.5 43.6 46.2 50.2 77.9 82.5 90.2 53.9
Table 2: Ablation study results, where we study the effect of the three main modules of our method: scene-object (SO), knowledge transfer(KT) and feature calibration (FC). BL denotes the baseline without any of the above modules.
We compare our method to the following recent state-of-the-art methods: KRE [Chen et al. , 2019], GSA [Qi et al. , 2019],Mem [Wang et al. , 2019], IMP [Xu et al. , 2017], and Neu-ral Motifs [Zellers and Yatskar, 2018]. In addition, we alsocompare to Graph-RCNN [Yang et al. , 2018], since it alsodevelops an attention mechanism to learn contextual infor-mation. As the source code of LinkNet [Woo et al. , 2018]is unavailable and we are unable to reproduce its results, wedo not compare with LinkNet. It is worth noting that NeuralMotifs and KRE use the relation bias as the additional priorto guide the recognition and we report their results with orwithout the bias. Also, we report two sets of results underdifferent conditions, constraint and unconstraint, to calculateR@K, following IMP [Xu et al. , 2017]Table 1 shows the results on the three tasks. As some meth-ods did not report their results on the R@20, the mean result is calculated according to their reported results. From Ta-ble 1, we can make the following observations.(1) Our method is superior to other methods in the majorityof cases even irrespective of the use of the bias information.Specifically, in terms of mean recall in the constraint setting,our method surpasses KRE, the best method among the base-lines, by about . percentage points when the statistical biasinformation is used. A larger improvement of about per-centage points is achieved when that information is not used.Also, the similar comparison pattern can be found in NeuralMotifs. Compared with KRE and Neural Motifs, the perfor-mance difference between with and without statistical biasinformation is less in our methods ( . percentage points vs . and . ), indicating that our method does not heavily relyon this bias, and that our model can essentially learn this biasfrom the raw data.(2) GSM shows a great advantage in SGCls task but per-forms poorly in the task of the predicate classification. AsGSM does not report the results on the scene graph detection able 3: Predicate classification results of bottom- tail relationswith or without the knowledge transfer module on unconstraintR@50 and R@100. Relation R@50 R@100w/o KT w KT w/o KT w KT lying on on back of to mounted on walk in across made of playing says flying in We split our model into three modules: scene-object inter-action (SO), knowledge transfer (KT) and feature calibration(FC). The baseline model (BL) denotes the simple model thatonly uses the feature generated by Faster-RCNN to recognizerelations. The ablation study results are shown in Table 2,where we test the performance on the three tasks by addingeach module one at a time. For a fair comparison, all ablatedmodels are trained by the same number of epochs, set as .We can observe that under both experimental conditions,constraint and unconstraint, the performance of the baseline isthe worst. The addition of the scene-object interaction mod-ule SO improves the average performance by – percentagepoints, which confirms the crucial role the global contextualinformation plays in relation recognition. When we deploythe knowledge transfer module KT, a further percentagepoint of improvements is gained. Finally, though the achieve-ments from adding the feature calibration module FC is notas significant as the other two modules, it still obtains a no-ticeable lift of about . percentage point.Our knowledge transfer module (KT) is specifically de-signed to solve the problem of data imbalance. To evaluateits effectiveness, Table 3 shows the predicate classification(PredCls) results of bottom- tail relations whose frequen-cies are substantially lower than the average frequency of allrelations. The columns “w/o KT” (respectively “w KT”) de-note the model without (respectively with) knowledge trans-fer and feature calibration. The superiority of the knowledgetransfer module can be clearly observed. It is worth notingthat since the relation flying in has only five samples in theentire dataset, all its results are zero. More generally, theknowledge transfer module on average improves performancefor each relation by – percentage points.Briefly, we can draw two conclusions from the ablation ScreenChair
Keyboard
TableVaseArm ScreenVaseChair ArmKeyboardTable onofundernear near
Cat BottleLaptopBook 2PhoneBook 1Table Bottle
Phone
Laptop on TableBook 1Cat
Book 2
In front of bihind
Sitting on on
Figure 2: Qualitative results of two images based on two models: thebaseline and the full model. Solid lines represent correct relationsdetected by both models. Dashed lines represent wrong relations de-tected by the baseline model. Dotted lines represent correct relationsdetected by the full model that the baseline model missed. study. (1) The three modules all positively contribute tothe relation recognition performance, and their combinationachieves the best results. (2) The scene-object interactionmodule is the most effective of the three, as it offers morecontextual clues and knowledge, and the other two modulesrely on the knowledge learned from the scene context.
Figure 2 visualizes some scene graph generation results oftwo models: the baseline model and the full model. Wecan observe that though the baseline model is able to capturemany relations, it does get confused on some cases. Takingthe second image as an example, the baseline model predictsthat the keyboard is under the screen but in fact is under thetable. The possible reason is that the baseline model onlyconsiders the visual and spatial feature of the screen and key-board objects but does not consider the global scene feature.
In this work, we investigate the long-tail problem existing inscene graph generation. To address this issue, we proposean end-to-end framework consisting of three modules: scene-object interaction, knowledge transfer and feature calibration,each of which has its specific function. The extensive ex-perimental results show that our method significantly outper-forms other state-of-the-art methods on all standard evalua-tion metrics. We observe that there still exists a large per-formance gap between the scene graph detection task and thepredicate classification task. In future, we will focus on ob-ject label refinement, which is a promising way to improvescene graph generation performance. eferences [Bahdanau et al. , 2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. Neural machine translationby jointly learning to align and translate. arXiv preprintarXiv:1409.0473 , 2014.[Britz et al. , 2017] Denny Britz, Anna Goldie, Minh-ThangLuong, and Quoc Le. Massive exploration of neu-ral machine translation architectures. arXiv preprintarXiv:1703.03906 , 2017.[Chen et al. , 2019] Tianshui Chen, Weihao Yu, RiquanChen, and Liang Lin. Knowledge-embedded routing net-work for scene graph generation. In
CVPR , pages 6163–6171, 2019.[Dai et al. , 2017] Bo Dai, Yuqi Zhang, and Dahua Lin. De-tecting visual relationships with deep relational networks.In
CVPR , pages 3076–3086, 2017.[Deng et al. , 2014] Jia Deng, Nan Ding, Yangqing Jia, andAndrea Frome. Large-scale object classification using la-bel relation graphs. In
ECCV , pages 48–64. Springer,2014.[Dornadula et al. , 2019] Apoorva Dornadula, Austin Nar-comey, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationships as functions: Enabling few-shotscene graph prediction. arXiv preprint arXiv:1906.04876 ,2019.[Girshick, 2015] Ross Girshick. Fast r-cnn. In
CVPR , pages1440–1448, 2015.[Gu et al. , 2019] Jiuxiang Gu, Handong Zhao, Zhe Lin, andSheng Li. Scene graph generation with external knowl-edge and image reconstruction. In
CVPR , pages 1969–1978, 2019.[He et al. , 2017] Kaiming He, Georgia Gkioxari, PiotrDoll´ar, and Ross Girshick. Mask r-cnn. In
CVPR , pages2961–2969, 2017.[Hsu et al. , 2017] Yen-Chang Hsu, Zhaoyang Lv, and ZsoltKira. Learning to cluster in order to transfer across do-mains and tasks. arXiv preprint arXiv:1711.10125 , 2017.[Johnson et al. , 2015] Justin Johnson, Ranjay Krishna, andMichael Stark. Image retrieval using scene graphs. In
CVPR , pages 3668–3678, 2015.[Krishna et al. , 2017] Ranjay Krishna, Yuke Zhu, OliverGroth, Justin Johnson, Kenji Hata, and Joshua Kravitz.Visual genome: Connecting language and vision usingcrowdsourced dense image annotations.
IJCV , 123(1):32–73, 2017.[Lu et al. , 2016] Cewu Lu, Ranjay Krishna, Michael Bern-stein, and Li Fei-Fei. Visual relationship detection withlanguage priors. In
ECCV , pages 852–869. Springer, 2016.[Qi et al. , 2019] Mengshi Qi, Weijian Li, Zhengyuan Yang,and Yunhong Wang. Attentive relational networks formapping images to scene graphs. In
CVPR , pages 3957–3966, 2019. [Savinov et al. , 2018] Nikolay Savinov, Anton Raichuk,Rapha¨el Marinier, Damien Vincent, and Pollefeys.Episodic curiosity through reachability.
ICLR , 2018.[Shi et al. , 2019] Jiaxin Shi, Hanwang Zhang, and JuanziLi. Explainable and explicit visual reasoning over scenegraphs. In
CVPR , pages 8376–8384, 2019.[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[Speer et al. , 2017] Robert Speer, Chin, and Joshua. Con-ceptnet 5.5: An open multilingual graph of general knowl-edge. In
AAAI , 2017.[Wang et al. , 2019] Wenbin Wang, Ruiping Wang, andShiguang Shan. Exploring context and visual pattern ofrelationship for scene graph generation. In
CVPR , pages8188–8197, 2019.[Woo et al. , 2018] Sanghyun Woo, Dahun Kim, andDonghyeon Cho. Linknet: Relational embedding forscene graph. In
NIPS , pages 560–570, 2018.[Xie et al. , 2018] Shaoan Xie, Zibin Zheng, and Liang Chen.Learning semantic representations for unsupervised do-main adaptation. In
ICML , pages 5419–5428, 2018.[Xu et al. , 2017] Danfei Xu, Yuke Zhu, Christopher B Choy,and Li Fei-Fei. Scene graph generation by iterative mes-sage passing. In
CVPR , pages 5410–5419, 2017.[Yang et al. , 2018] Jianwei Yang, Jiasen Lu, and Stefan Lee.Graph r-cnn for scene graph generation. In
ECCV , pages670–685, 2018.[Yang et al. , 2019] Xu Yang, Kaihua Tang, Hanwang Zhang,and Jianfei Cai. Auto-encoding scene graphs for imagecaptioning. In
CVPR , pages 10685–10694, 2019.[Zellers and Yatskar, 2018] Rowan Zellers and MarkYatskar. Neural motifs: Scene graph parsing with globalcontext. In
CVPR , pages 5831–5840, 2018.[Zellers et al. , 2019] Rowan Zellers, Yonatan Bisk, and AliFarhadi. From recognition to cognition: Visual common-sense reasoning. In
CVPR , pages 6720–6731, 2019.[Zhang et al. , 2019a] Hongguang Zhang, Jing Zhang, andPiotr Koniusz. Few-shot learning via saliency-guided hal-lucination of samples. In
CVPR , June 2019.[Zhang et al. , 2019b] Ji Zhang, Yannis Kalantidis, and Mar-cus Rohrbach. Large-scale visual relationship understand-ing. In
AAAI , volume 33, pages 9185–9194, 2019.[Zhang et al. , 2019c] Weihe Zhang, Yali Wang, and Yu Qiao.Metacleaner: Learning to hallucinate clean representationsfor noisy-labeled visual recognition. In
CVPR , June 2019.[Zhu et al. , 2014] Xiangxin Zhu, Dragomir Anguelov, andDeva Ramanan. Capturing long-tail distributions of objectsubcategories. In
CVPR , pages 915–922, 2014.[Zhuang et al. , 2017] Bohan Zhuang, Lingqiao Liu, andChunhua Shen. Towards context-aware interaction recog-nition for visual relationship detection. In