3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents
TThis article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CYBERNETICS 1
Ue-Hwan Kim , Jin-Man Park , Taek-Jin Song , and Jong-Hwan Kim ,
Fellow, IEEE
Abstract —Intelligent agents gather information and perceivesemantics within the environments before taking on given tasks.The agents store the collected information in the form ofenvironment models that compactly represent the surroundingenvironments. The agents, however, can only conduct limitedtasks without an efficient and effective environment model. Thus,such an environment model takes a crucial role for the autonomysystems of intelligent agents. We claim the following characteris-tics for a versatile environment model: accuracy, applicability,usability, and scalability. Although a number of researchershave attempted to develop such models that represent environ-ments precisely to a certain degree, they lack broad applicability,intuitive usability, and satisfactory scalability. To tackle these lim-itations, we propose 3-D scene graph as an environment modeland the 3-D scene graph construction framework. The conciseand widely used graph structure readily guarantees usabilityas well as scalability for 3-D scene graph. We demonstrate theaccuracy and applicability of the 3-D scene graph by exhibitingthe deployment of the 3-D scene graph in practical applica-tions. Moreover, we verify the performance of the proposed3-D scene graph and the framework by conducting a series ofcomprehensive experiments under various conditions.
Index Terms —3-D scene graph, environment model, intelligentagent, scene graph, scene understanding.
I. I
NTRODUCTION T HE CAPABILITY of understanding the surrounding envi-ronments is one of the key factors for intelligent agents tosuccessfully complete given tasks [1]. Without the capability,the agents can only perform simple and limited tasks. For ver-satile performance, the agents have to perceive not only thephysical attributes of the environments but also the seman-tic information inherent in the environments. In the process
Manuscript received October 12, 2018; revised May 20, 2019; acceptedJuly 22, 2019. This work was supported by the Institute for Informationand Communications Technology Promotion (IITP) Grant funded by theKorean government (MSIT) (No. 2016-0-00563, Research on AdaptiveMachine Learning Technology Development for Intelligent AutonomousDigital Companion). This paper was recommended by Associate EditorP. De Meo. (Corresponding author: Jong-Hwan Kim.)
The authors are with the School of Electrical Engineering, KoreaAdvanced Institute of Science and Technology, Daejeon 34141, South Korea(e-mail: [email protected]; [email protected]; [email protected];[email protected]).This paper has supplementary downloadable material available athttp://ieeexplore.ieee.org, provided by the author.Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCYB.2019.2931042 of observing the environments and storing up the collectedinformation, the agents construct environment models, whichcompactly represent the surrounding spaces [2]. Such modelsinclude dense maps generated by SLAM [3] and descriptionsof scenes [4] produced by computer vision and natural lan-guage processing (NLP) algorithms. Environment models letthe agents plan how to perform given tasks and offer groundsfor inference and reasoning. Thus, an effective environmentmodel for intelligent agents is of great importance.We claim the following properties as an effective environ-ment model for intelligent agents.1)
Accuracy:
The model should delineate environmentsprecisely and provide intelligent agents with correctinformation.2)
Applicability:
Intelligent agents should be able to utilizethe model in performing various types of tasks ratherthan just one specific task.3)
Usability:
The model should not require a complicatedprocedure for usage, but provide intuitive user interfacefor application.4)
Scalability:
The model should be able to depict bothlarge- and small-scale environments and increment thecoverage step by step.The claimed properties of environment models enhance theautonomy of intelligent agents. Intelligent agents could collectcorrect information regarding the environments in the processof constructing the models (accuracy), utilize the models tocomplete diverse tasks staying in the environments (applica-bility) using the models with the ease (usability), and expandthe knowledge of the environments incrementally (scalability).In this paper, we define 3-D scene graph, which representsthe physical environments in a sparse and semantic way, andpropose the 3-D scene graph construction framework. Theproposed 3-D scene graph describes the environments com-pactly by abstracting the environments as graphs, where nodesdepict the objects and edges characterize the relations betweenthe pairs of objects. As the proposed 3-D scene graph illus-trates the environments in a sparse manner, the graph can coverup an extensive range of physical spaces, which guarantees thescalability. Even when agents have to deal with a broad rangeof environments or encounter new environments in the mid-dle, 3-D scene graph offers a quick way for accessing andupdating the environment models. Furthermore, the concisestructure of the 3-D scene graph ensures intuitive usability.The graph structure is straightforward, since 3-D scene graph (cid:2) his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. follows the convention of common graph structures and graphstructures are already familiar to majority of the researchersdue to its wide use. Next, we verify the applicability of the3-D scene graph by demonstrating two major applications:1) visual question and answering (VQA) and 2) task planning.The two applications are under active research in computervision [5], NLP [6], and robotics societies [7].The proposed 3-D scene graph construction frameworkextracts relevant semantics within environments, such as objectcategories and relations between objects as well as phys-ical attributes, such as 3-D positions and major colors inthe process of generating 3-D scene graphs for the givenenvironments. The framework receives a sequence of obser-vations in the form of RGB-D image frames. For robustperformance, the framework filters out unstable observations(i.e., blurry images) using the proposed adaptive blurry imagerejection (ABIR) algorithm. Then, the framework factors outkeyframe groups to avoid redundant processing of the sameinformation. Keyframe groups contain reasonably overlappingframes. Next, the framework extracts semantics and physi-cal attributes within the environments through the recognitionmodules. During the recognition processes, spurious detec-tions get rejected. Finally, the gathered information gets fusedinto 3-D scene graph and the graph gets updated upon newobservations.The main contributions of this paper are as follows.1) We define the concept of the 3-D scene graph whichrepresents the environments in an accurate, applicable,usable, and scalable way.2) We design the 3-D scene graph construction frameworkwhich generates 3-D scene graphs for environmentsupon receiving a sequence of observations.3) We provide two application examples of the 3-D scenegraph: a) VQA and b) task planning.4) We conduct a series of thorough experiments andanalyze the experiments both quantitatively and qual-itatively to verify the performance of the 3-D scenegraph.5) We make the source code of the algorithms presented inthis paper public to contribute to the research societyand the development of the field.The remainder of this paper is organized as follows.Section II introduces the related works with associated resultsand issues. Section III presents an overview of the 3-D scenegraph construction framework. Sections IV and V describeeach module of the proposed 3-D scene graph constructionframework. Section VI provides two major applications of the3-D scene graph. Section VII illustrates the experimental set-tings and the analysis of the results. Discussion and concludingremarks follow in Sections VIII and IX, respectively.II. R ELATED W ORKS
We overview the previous research outcomes relevantto our 3-D scene graph and point out the differencesbetween the previous works and 3-D scene graph in thissection. https://github.com/Uehwan/3-D-Scene-Graph A. Environment Representation
A number of studies have attempted to digitize and storeenvironmental information obtained from various sensors.Although there exist multiple options for sensor selection, wefocus on visual sensors in this paper. The studies fall into twocategories: 1) raw and dense representations and 2) abstrac-tive and descriptive representations. First of all, the raw anddense representations aim to represent the environments asthey are with minimum distortion. They take two steps ingeneral. First, the algorithms gather information for modelbuilding using one of the SLAM algorithms [3], [8], and thenconstruct environment models through point clouds [9], grid-based models [10], and probabilistic occupancy maps [11].The resulting models known as 3-D reconstructions, however,require massive memory space, take much time for process-ing and lack any semantic information, requiring additionalrecognition processes to be applicable to reasoning-enabledhigh-level AI applications.3-D semantic segmentation, which labels every point ina point cloud, supplements the raw and dense representa-tions with semantic information. The area is under activeresearch and empowered by recent advances in deep learning.One recent work has demonstrated a real-time 3-D seman-tic segmentation by combining the latest SLAM [3] witha CNN-based 2-D semantic segmentation algorithm [12].In addition, methods directly segmenting 3-D point cloudssemantically have been presented [13]. These methods canprovide the locations and classes of objects in given envi-ronments. Nevertheless, the relationships between objects aremissing in these methods and these methods still require muchmemory and processing time.The abstractive and descriptive representations, on the otherhand, aim to summarize the characteristics of the environmentsby remaining only the relevant information. Algorithms for theabstractive and descriptive representations include image cap-tioning, which describes given scenes in natural language [4].Image captioning algorithms select the pairs of objects fromthe given images, generate the descriptions for the pairs, andrepeat the previous processes multiple times. Therefore, therelations between the objects are naturally extracted in the pro-cess of generating descriptions. However, the coverage of thespaces image captioning methods can express is confined tothe field of view (FoV) of the cameras, since the methods uti-lize only one image at a time. In addition, the output of imagecaptioning is in the form of natural language, which makes itdifficult for other AI applications to utilize the information asnatural language presents the information in an unordered way.Video captioning [14] could compensate for the limited cov-erage of image captioning, but the current focus of the field ison the dynamics of objects or scenes, discarding considerableamount of information retained in the sequence. Moreover, theoutput format of video captioning does not vary from that ofimage captioning.
B. Scene Graph
A scene graph represents an image with a graph structure,where the nodes correspond to each object instance in the his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 3
Fig. 1. Example of 2-D scene graph. The scene graph abstracts the givenscene and represents the relations between objects (gray nodes) as well asobject attributes (black nodes). (a) Input scene with object recognition results.(b) Corresponding scene graph. image and the edges depict the pairwise relationships betweenthe objects (see Fig. 1). Compared to the previous text-basedrepresentations of visual scenes [4], the scene graph repre-sentation offers much contextual information with respect torelative geometry and semantics. The researches related toscene graph can be categorized into two groups: 1) scene graphgeneration and 2) application of scene graph. The scene graphgeneration algorithms focus on producing accurate scenegraphs given visual scenes. Current state-of-the-art methodsutilize the iterative message parsing [15], the multitask for-mulation technique [16] for boosting the performance, or thesubgraph formulation [17].For the second point, a wide range of visual tasks, suchas semantic image retrieval [18], scene synthesis [19], andvisual question answering [20] employs the scene graphrepresentations and report enhanced performance. As the end-to-end approaches often exploit the statistics of the trainingdataset rather than truly understanding the visual scenes, thescene graph representation facilitates the algorithms to actu-ally extract relevant features and understand the semantics.The current research on scene graphs, however, concentrateson 2-D static scenes, which cannot provide physical attributessuch as 3-D positions, and deal with only one image, whichlimits the spatial coverage. Our proposed 3-D scene graphexpands 2-D scene graphs into 3-D spaces. This expansionenables intelligent agents to understand the surrounding envi-ronments in a more tangible way and to perform tasks in amore stable and precise manner.III. O
VERALL S YSTEM A RCHITECTURE
Fig. 2 shows the overall system architecture of the proposed3-D scene graph construction framework. The proposed frame-work takes in streams of images and preprocesses the inputimages. In the preprocessing step, the noise within the imagesgets removed and appropriate scaling and cropping are applied.Then, the preprocessed images enter the blurry image rejec-tion module. Blurry images get rejected to guarantee the
Fig. 2. Overall architecture of the proposed 3-D scene graph constructionframework. The framework receives RGB-D image frames and then processesthe input to generate 3-D scene graph. stable performance of later modules. Then, the pose estima-tion module extracts the relative poses between nearby framesusing visual odometry or SLAM. The poses estimated in thismodule are utilized for extracting the keyframe groups andphysical features of objects in the following modules. In thisframework, we use ElasticFusion [3] and BundleFusion [8]for estimating the poses. Next, the keyframe group extrac-tion (KGE) module categorizes each frame as key, anchor,or garbage frames for boosting efficiency. After KGE, theframework processes only the reasonably overlapping frames.The extracted keyframe groups go through the region pro-posal and object recognition modules. The region proposalmodule detects the plausible regions of object instances andthe object recognition module classifies the objects withinthe regions. Specifically, we use Faster-RCNN [21] to recog-nize object instances (region proposal) and object categories(object recognition). The object recognition module outputs alist of category candidates with confidence scores for an object.After the object instances with the corresponding classes areextracted, the relation extraction module identifies the pair-wise relations between object instances. The relations includeaction (e.g., jumping over), spatial (behind), and prepositional(e.g., with) relations. Factorizable Net (F-Net) [17] is incor-porated in the proposed framework for the relation extractionprocess.Running the recognition modules on the stream ofimages inevitably generates spurious detections despite post-processing procedures followed by the recognition modules.The spurious detection rejection (SDR) module removes boththe spurious and duplicate detections using 3-D positioninformation, semantics from word2vec [22], and rule-bases.The local 3-D scene graph construction module receives theprocessed detection results and constructs a local 3-D scenegraph for one input frame. This local 3-D scene graph coversonly a short range of physical spaces and the first local 3-D his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. scene graph becomes the initial global 3-D scene graph. In theglobal 3-D scene graph construction module, the initial global3-D scene graph gets updated and expanded upon receivinglocal 3-D scene graphs generated from the following imageframes. IV. D
ATA P ROCESSING
In this section, we describe the major data processing mod-ules of the proposed 3-D scene graph construction framework.The three modules proposed in this paper allow the efficientand effective generation of the 3-D scene graphs.
A. Adaptive Blurry Image Rejection
For stable performance, the object recognition and relationextraction modules require clean and motionless images asinput. Blurry input images deteriorate the performance, sincethe shape, size, and even color of an object appear differ-ent in the blurry images. The input image sequences for theproposed 3-D scene graph construction framework could con-tain such blurry images due to abrupt camera motions. Toprevent blurry images and to guarantee the robust performanceof the recognition modules, we adopt the variance of Laplaciandefined as V = WH W (cid:2) x = H (cid:2) y = L ( x , y ) − ⎧⎨⎩ WH W (cid:2) x = H (cid:2) y = L ( x , y ) ⎫⎬⎭ (1)where W and H denote the image width and height, respec-tively, and L ( x , y ) = (∂ I /∂ x ) + (∂ I /∂ y ) is the Laplacianoperator. The variance of Laplacian measures the intensityvariations across pixels in an image. By filtering images withthe variance of Laplacian less than a threshold, blurry imagesget removed. However, a fixed threshold value could detectnonblurry images with low texture as blurry frames, sincethe intensities of low texture images do not vary significantlyacross pixels.To cope with the varying texture problem, we propose theABIR algorithm. First, the proposed algorithm evaluates theexponential moving average (EMA), S t , over the Laplacianvariances S t = (cid:9) V t t = α · S t − + ( − α) · V t t > t is the time step, V t is the variance of Laplacian at t , and α is a constant smoothing factor in the interval [0, 1).A lower α reduces the effect of older observations faster. Atthe initial phase, EMA does not follow the observed valuessince only a small number of observations are available atfirst. This phenomenon is known as a bias and we correct thebias as follows: S (cid:3) t = S t − α t . (3)Then, we modify the bias-corrected average value, S (cid:3) t , togenerate the adaptive threshold, t blurry , as follows: t blurry = g · ln ( + S (cid:3) t ) + b (4)where g and b represent the gain and offset, respectively. Fig. 3. Process of KGE. The input image frames are classified into threeclasses: key, anchor, and garbage frames. Through the process, redundanciesin the image frames are resolved.
B. Keyframe Group Extraction
The KGE module 1) receives the sequence of preprocessedimage frames; 2) filters out unnecessary frames by catego-rizing the input frames into three classes; and 3) forms thekeyframe groups. The three classes of the input frames are asfollows.1)
Keyframe:
A keyframe works as a reference. Thekeyframe determines the first anchor frame and thecoverage of the keyframe group.2)
Anchor Frame:
An anchor frame is either active or inac-tive. Only the latest anchor frame is active and the activeanchor frame determines the next anchor frame.3)
Garbage Frame:
Frames other than the keyframes andthe anchor frames are classified as garbage frames anddiscarded due to redundancy.Fig. 3 and Algorithm 1 depict the process of the KGE mod-ule. First of all, the module sets the first nonblurry incomingframe as the first keyframe. The rest of the frames go throughthe classification process. Each incoming frame is comparedwith both the current keyframe and the active anchor frame.The frame with less than t anchor % of overlap with the activeanchor frame is kept as the next anchor frame. For the extrac-tion of the first anchor frame, the input frames are comparedwith the keyframe. With the detection of a new anchor frame,the current active anchor frame becomes inactive and the newanchor frame turns into active. If an incoming frame overlapsless than t keyframe % with the keyframe, the incoming framebecomes a new keyframe and the previous keyframe and theanchor frames detected up to the point form a keyframe group.We set t anchor larger than t keyframe .The overlap between two frames is calculated by projectingone frame into another frame’s coordinate as follows:overlap = W · H (cid:2) x (cid:2) y (cid:4) I W , H ( p (cid:3) ( i , j )) (5)where (cid:4) A ( · ) is an indicator function for a set A , I W , H ={ ( x , y ) | ≤ x < W , ≤ y < H , ( x , y ) ∈ Z } , and p (cid:3) ( i , j ) is aprojection function from source image frame to target image his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 5
Algorithm 1
Keyframe Group Extraction
Input : Sequence of image frames
Output : List of keyframe groups keyframe_groups = [] keyframe, anchor = [get_next_frame()] * 2 curr_keyframe_group = [keyframe] while curr_frame = get_next_frame() do overlap_key = overlap(keyframe, curr_frame) overlap_anchor = overlap(anchor, curr_frame) if overlap_key < t key then keyframe_groups.append(curr_keyframe_group) keyframe, anchor = [curr_frame] * 2 curr_keyframe_group = [keyframe] continue end if if overlap_anchor < t anchor then anchor_frame = curr_frame curr_keyframe_group.append(anchor) end if end while frame, which is defined as p (cid:3) = K · T i , j · D ( p ) · K − · p (6)where K is the camera intrinsic, T i , j is the relative posebetween the i th and j th frames, D is the depth measure ofpoint p , and p is the original point before projection.By having both keyframe and anchor frame as two refer-ences, the proposed KGE can effectively handle the redundantinformation inherent within consecutive image sequences. Inaddition, the proposed framework does not explode, but effi-ciently deals with the redundancies even when the cameracaptures the input frames staying around at the same point fora long time. The time complexity of Algorithm 1 is O ( NWH ) ,where N denotes the number of input image frames. The sin-gle while loop in Algorithm 1 makes the total time complexity O ( N · overlap ) . The time complexity of the overlap function is O ( WH ) since the function iterates through both the horizontaland vertical directions as shown in (5) and each projectiontakes constant time as shown in (6). C. Spurious Detection Rejection
Processing streams of images unavoidably produces theerroneous results, since the recognition modules do not achievethe perfect performance and image frames captured with thesensing devices are corrupted with noise. The purpose of SDRlies in removing both the spurious and repeated detections bydistilling human knowledge and priors from other knowledgebases. SDR works over multiple modules: region proposal,object recognition, and relation extraction modules. First, SDRremoves redundant object regions proposed by the region pro-posal module. It utilizes the popular nonmaximum suppression(NMS) [23] to achieve the goal. For the objection recogni-tion module, the SDR module removes the irrelevant objects. Moreover, SDR deletes the predefined irrelevant objects, suchas road, sky, building, and moving objects. These objectscannot exist in the setting of the 3-D scene graph.For the relation extraction module, the SDR module inspectsthe detected relations between the pairs of objects frommultiple frames in one keyframe group and the most-occurredrelations remain in the graph. If a tie occurs, all the top-rankedrelations get added to the graph. Then, SDR applies a rela-tion dictionary as a prior. The relation dictionary stores thestatistics of relations for the pairs of objects. We constructthe relation dictionary using the visual genome dataset [24]in advance. We collect the statistics of the relations betweenthe pairs of objects and store the pixel distance information( d pixel ) as the Gaussian distributions.To apply the prior, SDR gathers the related objectpairs, searches through the relation dictionary, and calcu-lates the probabilities of the detected relations. The relationswith the probabilities lower than a threshold value are deletedfrom the graph. In calculating the probabilities, we utilizethe pixel distances between the objects to filter out relationsbetween the pairs of objects with the distances the relationscannot make sense. The probability of a relation, Pr ( r | d pixel ) ,takes both the pixel distance and frequency into account asfollows: Pr ( r | d pixel ) = Pr dict ( r | d pixel ) · φ μ,σ ( d pixel )/φ μ,σ (μ) (7)where Pr dict ( r | d pixel ) is the statistics from the visual genomedataset and φ μ,σ ( k ) = ( / √ πσ ) exp ( − [1 / ( [ k − μ ] /σ ) ) is the Gaussian function. We approximate the probabilityof distance between the points by a normalized probabilitydensity function.V. 3-D S CENE G RAPH C ONSTRUCTION
In this section, we define the proposed 3-D scene graph withits properties. Next, we detail the 3-D scene graph constructionalgorithm and the merge and update algorithm for the global3-D scene graph.
A. Graph Representation
A 3-D scene graph G defines a pair of sets G = ( V , E ) ,where V and E denote the set of vertices and edges,respectively. On the one hand, the vertices represent theobjects present in the environments. Each vertex contains theinformation regarding the object it represents. The informationcontained in each vertex is as follows.1) Identification Number (ID):
A unique number assignedto each object in the environment.2)
Semantic Label:
Category of the object classified by theobject recognition module.3)
Physical Attributes:
Physical characteristics, such as size(height and width), major colors, or position relative tothe first keyframe.4)
Visual Feature:
A thumbnail, color histogram, orextracted visual features with descriptors.On the other hand, the following lists the types of relationsan edge could stand for.1)
Actions:
Behaviors shown by one object toward anotherone (e.g., feeding). his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. Spatial Relation:
Spacious relations, such as distanceand relative position (e.g., in front of).3)
Description:
States of one object related by anotherobject (e.g., wear).4)
Preposition:
Semantic relations that are expressed byprepositions (e.g., with).5)
Comparison:
Relative attributes of one object comparedto another one (e.g., smaller).As one object is subjective and the other is objective given apair of objects, the edges in 3-D scene graphs are directed (thesubjective objects point at the objective objects). Therefore,3-D scene graphs are directed graphs (or digraphs).
B. Local Graph Construction
Up to the relation extraction module, the proposed 3-Dscene graph construction framework extracts a part of objectattributes and the relations between objects, which could forma 2-D version of scene graph for the input frame. In thelocal 3-D scene graph construction module, the 2-D attributesare translated into 3-D attributes and additional attributes areextracted. The constructed local 3-D scene graph for one imageframe gets merged and updated into the global 3-D scene graphin the following module.The ID of an object is temporarily assigned and later mod-ule determines the ID in the respect of the global 3-D scenegraph. Furthermore, the previous modules readily provide thesemantic label candidates for the object with the correspond-ing scores. We keep the top- k labels and the relevant scoresfor the same node detection afterward. The 3-D position of theobject, on the other hand, is expressed as a Gaussian distribu-tion, because using only one center point for the position of theobject is prone to measurement error. We carve out the centerrectangle from the object bounding box given by the regionproposal module, after dividing the object region into 5 × p (cid:3)(cid:3) , is evaluated by p (cid:3)(cid:3) = T i , o · D ( p ) · K − · p (8)where i and o represent the indices of the current frame andthe first keyframe, respectively. Then, we evaluate the mean( μ ) and variance ( σ ) for the Gaussian distribution of the 3-Dposition ∼ N (μ, σ ) . In the process, we assume independentidentically distributed for each dimension, x , y , and z , and keepthe number of points used for the evaluation.Next, we calculate the color histogram for the object, h H , S , V ,as follows: h H , S , V = N · Pr ( H = h , S = s , V = v ) (9)where ( H , S , V ) represents the three axes of color space and N is the number of pixels. It is reported that ( H , S , V ) spaceshows the superior performance over ( R , G , B ) space in gen-eral [25]. We digitize each axis of the color space into c -bins,making the size of the histogram c . Finally, the region insidethe bounding box becomes the thumbnail of the object.The attributes extracted in this step get updated and modifiedby later modules which collect the information of objects frommultiple frames and then make the final decisions. C. Graph Merge and Update
The graph merge and update process merges individual 3-Dscene graphs for single image frames into one global 3-D scenegraph and updates the nodes and edges of the global 3-D scenegraph accordingly. As the camera view and position vary, therecognition module would extract different information fromthe same objects. The followings compensate for such varia-tions and construct the global 3-D scene graph for the entireenvironment.
1) Same Node Detection:
Without same node detection,3-D scene graph would explode with same nodes added numer-ous times and multiple observations of the same objects cannotbe integrated effectively. We utilize the following features forthe same node detection: object label, 3-D position, and colorhistogram. From these features, we evaluate the similarityscores as follows.First, we define the label similarity, s label , as s label = (cid:9) | C o ∩ C c | · score | C o ∩ C c | > { − d wv ( f wv ( l o ), f wv ( l c )) } · score otherwise (10)where the subscripts o and c refer to the original node inthe global 3-D scene graph and candidate node in the cur-rent frame, respectively, C o and C c contain the top- k labelsfor the objects, and l o and l c are the labels of the objectswith the maximum scores. The number of common elementscomplements the score when the two label sets share thesame elements, while the score gets penalized by the dis-tance between word vectors, otherwise. The score for the labelsimilarity is evaluated byscore = max i ∈{ o , c } { f s i ( l ) : l ∈ C i } (11)where the score function, f s i , returns a score for the given label l in the candidate set, C i .Second, the position similarity, s position , is evaluated as s position = (cid:10) j ∈{ x , y , z } I j (μ jc ) (12)where μ c is the mean position of the object in the candidatenode and I j calculates the position similarity of x , y , and z asfollows: I j = ⎧⎨⎩ | μ jc − μ jo | < σ jo − φ( Z μ jc ) + φ( − Z μ jc ) − φ( Z σ jo ) + φ( − Z σ jo ) otherwise (13)where Z is the z -score of the normal distribution and φ( · ) returns the area of the standard normal distribution. If the can-didate object positions within the σ o boundary of the objectin the global 3-D scene graph, no penalty occurs. Otherwise,the score becomes inversely proportional to the distance.Third, we calculate the following color similarity, d h ( h i , h j ) ,using the intersection of histograms: d h ( h i , h j ) = (cid:11) X (cid:11) Y (cid:11) Z min ( h i ( x , y , z ), h j ( x , y , z )) min ( | h i | , | h j | ) (14)where h i and h j are a pair of histograms for comparison, X , Y , and Z are the axes of 3-D space, and | · | gives his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 7 the magnitude of a histogram. Among various options forhistogram distance measure, the intersection of histogramguarantees efficient computation as well as effective compar-ison for color histograms [25]. The color similarity becomes s color = − d h ( h oH , S , V , h cH , S , V ) .Finally, the confidence score for the same node, s total , is theweighted combination of the three similarities s total = (cid:2) i ∈ F w i · s i (15)where F = { label , color , position } contains the features. A pairof nodes with the similarity score higher than a threshold isdetected as the same node.
2) Merge and Update:
The 3-D scene graph constructedby the first keyframe forms the initial global 3-D scene graphfor the environment. Upon receiving image frames, local 3-Dscene graphs are generated and merged to the global 3-Dscene graph. The merge process compares the nodes in thenewly constructed scene graph with the nodes in the global3-D scene graph and detects same nodes. Nodes not includedin the global 3-D scene graph get added to the graph with thecorresponding edges and nodes detected as same nodes updatethe nodes in the global graph. The update process proceedsas follows. The label set for the same node selects the top- k labels with the maximum scores from C o ∪ C c . The 3-D posi-tion now considers newly sampled points and recalculates themean and variance. The color histogram gets combined withthe incoming color histogram. The number of points kept inthe node becomes the number of the original points plus thenumber of new points. The thumbnail gets replaced with theincoming one if the label of the maximum score comes fromthe incoming scene graph.VI. A PPLICATION : VQA
AND T ASK P LANNING
The proposed 3-D scene graph allows deeper understandingof the environments for intelligent agents. Thus, the agents canperform various tasks in a versatile manner. Such tasks includeVQA, task planning, 3-D space captioning, 3-D environmentmodel generation, and place recognition. We illustrate two ofthe major applications of the 3-D scene graph in this section.
A. Visual Question and Answering
It is possible to assess how well intelligent agents under-stood a given entity, such as image or text by asking questionsand evaluating the answers replied. In this paper, we adopt asimilar question and answering (QA) approach both to demon-strate the performance of the 3-D scene graph in environmentunderstanding for intelligent agents and to provide one of themajor applications of the 3-D scene graph.Contrast to ongoing research on the QA approach where QApairs along with the entity to understand are provided as thetraining datasets, such QA datasets are not prepared for 3-Dscene graph as the area is in the commencing stage. Therefore,we constrain the types of questions as follows.1)
Object Counting:
Either simple (e.g., how many cupsare in the environment?) or hierarchical (e.g., how manypieces of cutlery are there?). 2)
Counting With Attributes:
Number of objects with spe-cific attributes, such as size, visual feature, and location(e.g., how many red chairs are in the environment?).3)
Counting With Relations:
Number of objects distinc-tively related to a specific object (e.g., how many objectsare on the shelf?).4)
Multimodal VQA:
An answer by providing a thumb-nail of an object (e.g., show me the biggest bowl inthe environment).The above-listed questions can be easily converted to a query-form (machine readable form) even when asked in a free-formand open-ended manner by employing a few NLP algorithms.The intelligent agent could answer the questions by searchingthe asked entity through the constructed 3-D scene graph. Anumber of graph search algorithms are available off the shelf,including depth first search and breadth first search.
B. Task Planning
Robots plan how to perform given tasks before actually con-ducting the tasks, which is called task planning. In the processof task planning, robots generate the sequences of primitiveactions to achieve the goals using the collected environmentinformation. The environment information includes the posi-tions, states, and attributes of objects in the environment. Taskplanning assumes the environment information is already givenbefore it starts to generate the action plans. Therefore, theenvironment information, the purpose of the 3-D scene graph,takes a central role in task planning.We demonstrate the effectiveness of the 3-D scene graphin task planning using fast-forward (FF) planner [26], oneof the representative examples of task planner. FF plannerrequires two types of descriptions: 1) problem description and2) domain description. The problem description contains theinformation regarding the categories and states of objects andthe goal definition. The domain description describes the prim-itive actions robots can take. The descriptions are written inplanning domain definition language (PDDL) [27]. The 3-Dscene graphs can be directly turned into the problem descrip-tion format, as it stores the information FF planner needs. Afew rule-bases can transform 3-D scene graph into problemdescriptions. VII. E
XPERIMENT
In this section, we first focus on the verification of accu-racy, since the widely used and well-known graph structurealready guarantees both usability and scalability. Next, theapplicability of the 3-D scene graph is verified.
A. Performance Verification1) Dataset:
We selected a few sequences from the ScanNetdataset [28]. ScanNet consists of 1513 sequences, which werecollected using one type of RGBD-sensor. The resolutions ofthe image frames are 1269 ×
968 (color) and 640 ×
480 (depth)with 30-Hz frame rate and all the image and depth frames arecalibrated. The imaging parameters for rgb and depth sen-sors are provided as well. Classes of imaging environmentsinclude bedroom, classroom, office, apartment, etc. We chose his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IL
IST OF A LGORITHMS FOR C OMPARATIVE S TUDY one challenging sequence after filtering out sequences with toofew objects, narrow coverages, or multiple blurry images. Thelength of the sequence is 5578 with the sequence number 0of a living room.
2) Algorithms:
We compared the quality of the 3-D scenegraph with a few baseline methods. We used the extendedversions of base scene graph generation algorithm (F-Net) [17]as baselines (see Table I). The first baseline computes 2-Dscene graphs for every image frame, then concatenates all thegenerated graphs after removing the same (subject, relation,object) pairs. The second baseline applies ABIR and KGEto boost efficiency. The third baseline constructs 3-D scenegraphs for every input frame and combines the graphs usingthe proposed same node detection. The fourth baseline is builtupon the third baseline, which additionally employs ABIR andKGE for efficiency. The last algorithm refers to the full model,the proposed 3-D scene graph construction framework, whichadds SDR to the fourth baseline.
3) Evaluation Metrics:
We used a human judgment met-ric to evaluate the accuracy of each method. We recruitedsix experiment participants (age: 22–35, male/female: 5/1)and gathered six responses for each graph. For each result-ing graph, we showed the participants, their correspondingScanNet sequence and asked to count the number of spu-rious entities (nodes and edges) and missing entities. Wefollowed the majority voting approach [29] when interpret-ing the responses, thus only the entities that two thirds ofparticipants responded consistently were considered for thestatistics. In addition, we required the participants to rate howwell the generated graphs represented the environments on a7-point Likert scale (overall accuracy). After all, we averagedthe results for each method. We measured the runtime of eachmethod to compare the computational efficiency as well.
4) Implementation Details:
For the implementation of theproposed 3-D scene graph construction framework, we usedPython and Pytorch for seamless integration, since the build-ing blocks of the framework had been developed in the sameenvironment. We used Intel Core i9-7980XE [email protected] GHzand Titan XP for the experiment. For ABIR, we set α , g ,and b as 0 .
9, 30, and 25, respectively. We tuned the valuesbefore applying the algorithm. During the process of calculat-ing overlaps between the frames for KGE, we sampled 1000points after projection from source to target frame to reducethe amount of computation. Inspecting all the projected pointsrequired much computation, slowing down the entire process.In addition, SDR rejects 68 predefined object classes among 400 classes the recognition module can classify. To rejectspurious relations, we used 0 . w label , w color , and w position as 0 . .
25, and 0 . .
5) Results and Analysis:
Fig. 4 shows the resulting 2-Dand 3-D scene graphs for the experiment sequence with thefirst keyframe group and Table II reports the quantitative resultof the comparative study. The resulting scene graph from the2D-basic baseline was too crowded by spurious nodes andedges, thus human judges were not able to count the numberof spurious or missing entities. Similarly, the resulting 3-Dscene graph from the 3D-basic baseline contained multiplespurious nodes. However, 3-D scene graph from the 3D-basicmethod includes a reasonable number of spurious nodes incontrast to 2D-basic. Comparison between 2D-basic and 3D-basic establishes the superior performance of the same nodedetection over the simple triple pair detection.Comparing basic and efficient baselines verifies theperformance of ABIR and KGE in boosting the efficiency.The average runtime improves in both 2-D and 3-D cases. Thetwo modules boost the speed of graph construction by avoidingrepeated processing of redundant information. The ABIR mod-ule, in addition, filters out blurry images, which improves theperformance of the recognition modules. Although the ABIRand the KGE modules advance the efficiency, the resulting 2-Dand 3-D scene graphs still contain multiple spurious entities.The proposed framework, the 3D-full model, compactlyextracts the objects in the environments and the pairwiserelations among the objects. Although a few entities are miss-ing in the resulting 3-D scene graph, SDR rejects most ofthe irrelevant relations and the quantitative analysis corrobo-rates this. The number of detected objects in the 3-D scenegraph from the 3D-full model equals to that of the 3D-efficient. Although the percentages of spurious entities arehigher for other baselines than the 3D-full method, humanjudges assessed the overall accuracy of 3D-full lower than3D-efficient. We assume that human judges consider the exces-siveness better than the missing. The 3D-full method processesthe input frames faster than the 3D-efficient method, becausethe 3D-full method holds less number of entities in the processof constructing 3-D scene graphs. As less number of enti-ties are compared against incoming entities in local 3-D scenegraphs, the amount of computation reduces. Human judgeswere not able to count the number of missing relations for2D-efficient and 3D-efficient, since the graphs were alreadycrowded by spurious relations.SDR could be utilized for inferring missing entities. Aninference module can figure out missing relations by collectingunpaired objects and calculating the probability of existenceof relations between them. In a similar way, missing objectscan be inferred by calculating conditional probabilities. Weimplemented the inference module, but the module did notshow noticeable performance improvement. We suspect the his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 9
Fig. 4. First keyframe group of the first experiment sequence and the resulting 2-D and 3-D scene graphs. Red, green, and blue denote node, edge, andattributes, respectively. Graphs generated by 2D-basic and 3D-basic are omitted due to over crowded entities. From the basic to the full model, 3-D scenegraphs become more efficient, precise, and free of spurious entities. (a) First keyframe. (b) Anchor frame 1. (c) Anchor frame 2. (d) Anchor frame 3. (e) Scenegraph from 2D-efficient. (f) 3-D scene graph from 3D-efficient. (g) 3-D scene graph from 3D-full. knowledge-bases used for priors were not enough to assure theperformance. Moreover, the limited performance of the recog-nition module stems from the inherent feature of the ScanNet dataset whose color images are blurrier than the general imagedatasets. The average blurriness of the sequence used forthe experiment was around 200, whereas the general image his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON CYBERNETICS
TABLE IIR
ESULTS OF C OMPARATIVE S TUDY datasets provide images with blurriness higher than 1000. Therecognition modules trained on the general image datasetswould have shown higher performance in the experiment, ifthey had been trained on blurrier images.In summary, the experiment results verified the performanceof the proposed 3-D scene graph construction framework. Aseach module of the proposed framework was added to the basealgorithm one by one, the performance improved step by step.The results also established that mere extension of 2-D scenegraph generation algorithm to 3-D spaces would not work, butwith the proposed 3-D scene graph construction framework.
B. Applicability Demonstration1) Environment:
We constructed a kitchen simulation envi-ronment to verify the applicability of the proposed 3-Dscene graph. The kitchen simulation environment models theactual kitchen environment, so that the 3-D scene graph fromthe actual kitchen can be directly used in the simulation.In the simulation environment, human and a robot interactthrough input devices, such as mouse and keyboard and therobot performs the given tasks. We conducted the experi-ment with simulated Mybot, which was developed in theRobot Intelligence Technology (RIT) Lab at KAIST. Mybotincludes a robotic head connected to the upper body througha three degrees of freedom (DoFs) neck, RGB-D camera, twoarms (10 DoFs for each) attached to the upper body, one2-DoFs trunk, and an omnidirectional wheel-base with a powersupply. Mybot is able to perform home chore tasks [30], con-trol gaze [31], and converse autonomously [32]. We utilizedpreviously built modules in the Mybot system to implementthe demonstration scenario. We implemented the simulationenvironment using Webots [33] and ROS [34].
2) Scenario:
The demonstration scenario consists of fourphases. In the first phase, human scans through the actualkitchen and a 3-D scene graph is constructed for being usedin the simulation. The generated 3-D scene graph is trans-ferred to Mybot in the simulated environment and Mybotbuilds a 3-D map of the simulation environment for localiza-tion, which is required for the later phases. Next, human asksa couple of questions regarding the environment and Mybotanswers those questions. The types of questions human can askare predefined as discussed before. Third, human commandsMybot to clean up the kitchen (throwing away used cups intothe sink) and Mybot generates a problem description usingthe transferred 3-D scene graph and a sequence of actions for the task using FF planner. Finally, Mybot completes the giventask according to the generated task plan.
3) Results and Analysis:
Fig. 5 shows the demonstrationprocess. In the first step, human scanned through the envi-ronment and gathered relevant information to construct a 3-Dscene graph. During the process, objects in the environmentwere detected and the categories, positions, and other attributesof each object were extracted. Mybot in the simulation envi-ronment saved transferred 3-D scene graph and built the 3-Dmap of the simulation environment. Then, Mybot answeredquestions regarding the environment. Mybot could count thenumber of objects with specific classes and features and tellhow objects were related. After VQA, human commandedMybot to clean up the table. To accomplish the command,Mybot generated a problem description using the constructed3-D scene graph. A few rule-bases could transform 3-D scenegraph into the problem description in PDDL. Then, Mybotdeveloped a task plan combining the problem description withthe predefined domain description. With the task plan, Mybottook cups into the sink one by one.In short, the demonstration verified the broad applicabilityof the proposed 3-D scene graph. Intelligent agents can answerquestions regarding the environments the agents are situated inutilizing the constructed 3-D scene graphs. 3-D scene graphoffers a straight-forward way for intelligent agents to countobjects with specific traits and categories. Furthermore, intelli-gent agents can answer questions asking relations between thepairs of objects as 3-D scene graph characterizes the relationsbetween objects. In addition, the agents can perform giventasks by autonomously formulating problem descriptions inPDDL using 3-D scene graphs. In general, problem descrip-tions are written by humans. Intelligent agents equipped with3-D scene graphs can plan tasks with enhanced autonomy.VIII. D
ISCUSSION
In this section, we present a couple of discussion points forfuture works. First of all, the current setting of the 3-D scenegraph does not include moving objects. Although intelligentagents could conduct heaps of tasks in such static environ-ments, real-world settings might contain dynamic objects. Oneof the dynamic objects that 3-D scene graph should considerwith high priority is human, since intelligent agents deployedin real-world entail the issue of interaction with human andsecurity of human. To take dynamic objects into account,research on real-time update of the 3-D scene graph should his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 11
Fig. 5. Applicability demonstration. Human scanned through the kitchen environment and gathered relevant information for constructing a 3-D scenegraph. Then, simulated Mybot answered questions regarding the environment using the transferred 3-D scene graph. Upon receiving the command, cleaningup the table, Mybot generated a task plan using the 3-D scene graph. Finally, Mybot performed the given task in the simulated kitchen. (a) Scanningthrough the environment to construct a 3-D scene graph. (b) VQA with Mybot. (c) Mybot generating problem PDDL from 3-D scene graph. (d) Mybotperforming the given task. follow. The real-time update algorithm would handle the sameobjects with varying positions, changes in relations, and statevariations.In the second place, we could improve the computationalefficiency and the accuracy of the 3-D scene graph con-struction framework in the following research. The presentframework interprets the image frames using deep learningfor semantics and a pose estimation algorithm for physicalattributes. Although the algorithms run in real time when usedindividually, each of them generally requires the entire com-putation capability that a single processing unit offers. Weproposed KGE for the sake of efficiency, but a single process-ing unit cannot guarantee real-time operation. Furthermore,the accuracy of the framework depends on the performanceof each module. A part of the modules achieves human-levelperformance, yet the remaining part needs further improve-ments. We supplemented the framework with human knowl-edge and the prior from other knowledge base by proposingSDR, but the framework still misses a few corner cases. Inother words, enhancement in both deep-learning-based recog-nition modules and the pose estimation module would benefitthe proposed 3-D scene graph construction framework.For the third point, we could investigate the architectureof the proposed 3-D scene graph construction framework. Wehave built up a set of modules that take distinct roles. However, an end-to-end architecture or integration of a part of modulesto form a bigger building block is possible. It is reported thatthe integration of submodules, whose functions are related,results in performance improvement when combined properly.This formulation, known as multitask learning [35], helps theintegrated modules to generalize over the training data and toavoid overfitting. Candidates for the incorporation include thepair of relation extraction and SDR.Last but not least, we could explore the application areasthat might benefit from 3-D scene graph and expand thedemonstrated applications presented in this paper. For theexploration of the application areas, environment descriptionin natural language and semantic segmentation for 3-D spacesare representative. The environment description could sup-port the blind to understand the surrounding spaces and avoidpossible dangers. The semantic segmentation for 3-D spaceswould aid 3-D dense modeling/reconstruction of 3-D spaces.For the expansion of the presented applications, the currentVQA system can be upgraded to full-sentence VQA systemor linked to an active search system. In addition, the taskplanning algorithm introduced in this paper only verified thefeasibility of the 3-D scene graph’s utilization in task plan-ning. We could formulate the related research issues in a moredelicate and mathematical way for the state-of-the-art levelperformance. his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON CYBERNETICS
IX. C
ONCLUSION
In this paper, we defined 3-D scene graph and proposed the3-D scene graph construction framework. 3-D scene graph rep-resents the surrounding environments in a sparse and semanticmanner, providing intelligent agents with an effective envi-ronment model to store collected information and retrievethe information for reasoning and inference. The well-knowngraph structure of the 3-D scene graph offers intuitive usabil-ity and broad scalability. We verified the applicability andthe accuracy of the 3-D scene graph through two majorapplications: 1) VQA and 2) task planning. Moreover, theproposed 3-D scene graph construction framework generates3-D scene graphs representing environments upon receivingmultiple observations in the form of image frames. The 3-Dscene graph construction framework handles the input imageframes both efficiently through the keyframe groups and effec-tively through the SDR algorithm. The experimental resultsestablished the four properties of an effective environmentmodel for 3-D scene graph: 1) accuracy; 2) applicability;3) usability; and 4) scalability.R
EFERENCES[1] S. Chitta, I. Sucan, and S. Cousins, “Moveit! [ROS topics],”
IEEE Robot.Autom. Mag. , vol. 19, no. 1, pp. 18–19, Mar. 2012.[2] R. B. Rusu, “Semantic 3D object maps for everyday manipulation inhuman living environments,”
KI-Künstliche Intelligenz , vol. 24, no. 4,pp. 345–348, 2010.[3] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, andS. Leutenegger, “Elasticfusion: Real-time dense SLAM and light sourceestimation,”
Int. J. Robot. Res. , vol. 35, no. 14, pp. 1697–1716, 2016.[4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:Lessons learned from the 2015 MSCOCO image captioning challenge,”
IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 4, pp. 652–663,Apr. 2017.[5] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh,and D. Batra, “VQA: Visual question answering,”
Int. J. Comput. Vis. ,vol. 123, no. 1, pp. 4–31, 2017.[6] A. Kumar et al. , “Ask me anything: Dynamic memory networks fornatural language processing,” in
Proc. Int. Conf. Mach. Learn. , 2016,pp. 1378–1387.[7] M. Hanheide et al. , “Robot task planning and explanation in open anduncertain worlds,”
Artif. Intell. , vol. 247, pp. 119–150, Jun. 2017.[8] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt,“Bundlefusion: Real-time globally consistent 3D reconstruction usingon-the-fly surface reintegration,”
ACM Trans. Graph. , vol. 36, no. 4,2017, Art. no. 76a.[9] R. B. Rusu and S. Cousins, “3D is here: Point cloud library (PCL),” in
Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , Shanghai, China, 2011,pp. 1–4.[10] G. Tanzmeister, J. Thomas, D. Wollherr, and M. Buss, “Grid-based map-ping and tracking in dynamic environments using a uniform evidentialenvironment representation,” in
Proc. IEEE Int. Conf. Robot. Autom.(ICRA) , May 2014, pp. 6090–6095.[11] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,“OctoMap: An efficient probabilistic 3D mapping framework based onoctrees,”
Auton. Robots , vol. 34, no. 3, pp. 189–206, Apr. 2013. [Online].Available: https://doi.org/10.1007/s10514-012-9321-0[12] J. McCormac, A. Handa, A. Davison, and S. Leutenegger,“SemanticFusion: Dense 3D semantic mapping with convolutional neu-ral networks,” in
Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , 2017,pp. 4628–4635.[13] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchicalfeature learning on point sets in a metric space,” in
Proc. Adv. NeuralInf. Process. Syst. , 2017, pp. 5099–5108.[14] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferredsemantic attributes,” in
Proc. CVPR , vol. 2, 2017, p. 3.[15] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation byiterative message passing,” in
Proc. IEEE Conf. Comput. Vis. PatternRecognit. , vol. 2, 2017, pp. 3097–3106. [16] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graphgeneration from objects, phrases and region captions,” in
Proc. ICCV ,2017, pp. 1270–1279.[17] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizablenet: An efficient subgraph-based framework for scene graph generation,”in
Proc. Eur. Conf. Comput. Vis. (ECCV) , 2018, pp. 335–351.[18] J. Johnson et al. , “Image retrieval using scene graphs,” in
Proc.IEEE Conf. Comput. Vis. Pattern Recognit. , Boston, MA, USA, 2015,pp. 3668–3678.[19] A. Chang, M. Savva, and C. D. Manning, “Learning spatial knowledgefor text to 3D scene generation,” in
Proc. Conf. Empirical Methods Nat.Lang. Process. (EMNLP) , 2014, pp. 2028–2038.[20] D. Teney, L. Liu, and A. van den Hengel, “Graph-structured representa-tions for visual question answering,” in
Proc. IEEE Conf. Comput. Vis.Pattern Recognit. , 2017, pp. 1–9.[21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in
Proc. Adv.Neural Inf. Process. Syst. , 2015, pp. 91–99.[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in
Proc. Adv. Neural Inf. Process. Syst. , 2013, pp. 3111–3119.[23] A. Neubeck and L. Van Gool, “Efficient non-maximum suppression,”in
Proc. IEEE 18th Int. Conf. Pattern Recognit. (ICPR) , vol. 3, 2006,pp. 850–855.[24] R. Krishna et al. , “Visual Genome: Connecting language and visionusing crowdsourced dense image annotations,”
Int. J. Comput. Vis. ,vol. 123, no. 1, pp. 32–73, 2017.[25] S. Jeong, “Histogram-based color image retrieval,” ProjectRep. Psych221/EE362, 2001.[26] J. Hoffmann, “Ff: The fast-forward planning system,”
AI Mag. , vol. 22,no. 3, p. 57, 2001.[27] M. Fox and D. Long, “PDDL2.1: An extension to PDDL for expressingtemporal planning domains,”
J. Artif. Intell. Res. , vol. 20, pp. 61–124,2003.[28] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, andM. Nießner, “Scannet: Richly-annotated 3D reconstructions of indoorscenes,” in
Proc. CVPR , vol. 2, 2017, p. 10.[29] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast—But is it good? Evaluating non-expert annotations for natural languagetasks,” in
Proc. Conf. Empirical Methods Nat. Lang. Process. , 2008,pp. 254–263.[30] D.-H. Kim, G.-M. Park, Y.-H. Yoo, S.-J. Ryu, I.-B. Jeong, and J.-H. Kim,“Realization of task intelligence for service robots in an unstructuredenvironment,”
Annu. Rev. Control , vol. 44, pp. 9–18, 2017.[31] B.-S. Yoo and J.-H. Kim, “Fuzzy integral-based gaze control of a robotichead for human robot interaction,”
IEEE Trans. Cybern. , vol. 45, no. 9,pp. 1769–1783, Sep. 2015.[32] W.-H. Lee and J.-H. Kim, “Hierarchical emotional episodic memoryfor social human robot collaboration,”
Auton. Robots , vol. 42, no. 5,pp. 1087–1102, 2018.[33] O. Michel, “Webots: Symbiosis between virtual and real mobile robots,”in
Proc. Int. Conf. Virtual Worlds , 1998, pp. 254–263.[34] M. Quigley et al. , “ROS: An open-source robot operating system,” in
Proc. ICRA Workshop Open Source Softw. , vol. 3. Kobe, Japan, 2009,p. 5.[35] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multi-task learning framework for face detection, landmark localization, poseestimation, and gender recognition,”
IEEE Trans. Pattern Anal. Mach.Intell. , vol. 41, no. 1, pp. 121–135, Jan. 2019.
Ue-Hwan Kim received the B.S. and M.S. degreesin electrical engineering from the Korea AdvancedInstitute of Science and Technology, Daejeon, SouthKorea, in 2013 and 2015, respectively, where he iscurrently pursuing the Ph.D. degree.His current research interests include ser-vice robot, cognitive IoT, computational memorysystems, and learning algorithms. his article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KIM et al. : 3-D SCENE GRAPH: SPARSE AND SEMANTIC REPRESENTATION OF PHYSICAL ENVIRONMENTS FOR INTELLIGENT AGENTS 13
Jin-Man Park received the B.S. degree in electricaland electronic engineering from Yonsei University,Seoul, South Korea, in 2015, and the M.S. degreein the robotics program from the Korea AdvancedInstitute of Science and Technology, Daejeon, SouthKorea, in 2017, where he is currently pursuing thePh.D. degree.His current research interests include servicerobot, natural language processing, and visual ques-tion answering.
Taek-Jin Song received the B.S. degree in electricalengineering from the Korea Advanced Institute ofScience and Technology, Daejeon, South Korea, in2017, where he is currently pursuing the integratedmaster’s and Doctoral degrees.His current research interests include servicerobot, visual question answering, and learningalgorithms.