[PDF] SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors - Researchain

Abstract

Spatial computing experiences are constrained by the real-world surroundings of the user. In such experiences, augmenting virtual objects to existing scenes require a contextual approach, where geometrical conflicts are avoided, and functional and plausible relationships to other objects are maintained in the target environment. Yet, due to the complexity and diversity of user environments, automatically calculating ideal positions of virtual content that is adaptive to the context of the scene is considered a challenging task. Motivated by this problem, in this paper we introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes. SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content. We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room. We believe providing explicit and intuitive features plays an important role in informative content creation and user interaction of spatial computing settings, a quality that is not captured in implicit models. We use kernel density estimation (KDE) to build a multivariate conditional knowledge model trained using prior spatial Scene Graphs extracted from real-world 3D scanned data. To further capture orientational properties, we develop a fast pose annotation tool to extend current real-world datasets with orientational labels. Finally, to demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.

Full PDF

SSceneGen: Generative Contextual Scene Augmentation using SceneGraph Priors

MOHAMMAD KESHAVARZI,

University of California, Berkeley

AAKASH PARIKH,

University of California, Berkeley

XIYU ZHAI,

University of California, Berkeley

MELODY MAO,

University of California, Berkeley

LUISA CALDAS,

University of California, Berkeley

ALLEN Y. YANG,

University of California, Berkeley

Fig. 1. SceneGen is a framework to augment scenes with virtual objects using an explicit generative model to learn topological relationship from priorsextracted from a real-world datasets. Primarily designed for spatial computing applications, SceneGen extracts features from rooms into a novel spatial SceneGraph representation and iteratively augments objects by sampling positions and orientations in the scene to create a probability map and predicts a viablecontextual placement for the virtual object.

Spatial computing experiences are constrained by the real-world surround-ings of the user. In such experiences, augmenting virtual objects to exist-ing scenes require a contextual approach, where geometrical conflicts areavoided, and functional and plausible relationships to other objects are main-tained in the target environment. Yet, due to the complexity and diversityof user environments, automatically calculating ideal positions of virtualcontent that is adaptive to the context of the scene is considered a challeng-ing task. Motivated by this problem, in this paper we introduce SceneGen, agenerative contextual augmentation framework that predicts virtual objectpositions and orientations within existing scenes. SceneGen takes a seman-tically segmented scene as input, and outputs positional and orientationalprobability maps for placing virtual content. We formulate a novel spatialScene Graph representation, which encapsulates explicit topological prop-erties between objects, object groups, and the room. We believe providingexplicit and intuitive features plays an important role in informative contentcreation and user interaction of spatial computing settings, a quality that isnot captured in implicit models. We use kernel density estimation (KDE) tobuild a multivariate conditional knowledge model trained using prior spatial

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).© 2020 Copyright held by the owner/author(s). XXXX-XXXX/2020/0-ART0 $15.00DOI: 10.1145/nnnnnnn.nnnnnnn

Scene Graphs extracted from real-world 3D scanned data. To further captureorientational properties, we develop a fast pose annotation tool to extendcurrent real-world datasets with orientational labels. Finally, to demonstrateour system in action, we develop an Augmented Reality application, in whichobjects can be contextually augmented in real-time.CCS Concepts: •

Computing methodologies → Mixed / augmented re-ality;

Virtual reality; • Mathematics of computing → Kernel density esti-mators;

Additional Key Words and Phrases: Augmented Reality, Scene Graphs, SceneSynthesis, Generative Modelling, Spatial Computing

ACM Reference format:

Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, LuisaCaldas, and Allen Y. Yang. 2020. SceneGen: Generative Contextual SceneAugmentation using Scene Graph Priors. 0, 0, Article 0 ( 2020), 19 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

Spatial Computing experiences such as augmented reality (AR) and virtual reality (VR) have formed a newly exciting market in todayfistechnological space. New applications and experiences are beinglaunched daily across the categories of gaming, healthcare, design,education, and more. However, for all of the countless applications , Vol. 0, No. 0, Article 0. Publication date: 2020. a r X i v : . [ c s . G R ] S e p :2 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang Fig. 2. End-to-end workflow of SceneGen shows the four main modulesof our framework to augment rooms with virtual objects. The left pipelineshows the training procedure including dataset processing (blue) and Knowl-edge Model creation (pink). The right pipeline shows the test time procedureof sampling and prediction (yellow) and the application (green). available, they are physically constrained by the geometry andsemantics of the 3D user environment where existing furnitureand building elements are present (Narang et al. 2018; Razzaqueet al. 2001) . Contrary to traditional 2D graphical user interface,where a flat rectangular region hosts digital content, 3D spatialcomputing environments are often occupied by physical obstaclesthat are diverse and often times non-convex. Therefore, how onecan assess content placement in spatial computing experiences ishighly dependent on the userfis target scene.However, since different users may reside in different spatialenvironments, which differ in dimensions, functions (rooms, work-place, garden, etc.), and open usable spaces, existing furniture andtheir arrangements are often unknown to the developers, makingit very challenging to design a virtual experience that would adaptto all userfis environments. Therefore, contextual placement is cur-rently addressed by asking users themselves to identify the usablespaces in their surrounding environments or manually positioningthe augmented object(s) within the scene. Currently, virtual objectplacement in most AR experiences is limited to specific surfacesand locations, e.g., placing objects naively in front of the user withno scene understanding, or only using basic horizontal or verticalsurface detection. These simplistic strategies can work to someextent for small virtual objects, but the methods break down forlarger objects or complex scenes with multiple object augmenta-tion requirements. This limitation is further elevated in remotemulti-user interaction scenarios, where finding a common virtualground physically accessible to all participants to augment theircontent becomes challenging. (Keshavarzi et al. 2020b). Hence,such experiences automatically become less immersive once theusers encounter implausible virtual object augmentation in theirenvironments.The task of adding objects to existing constructed scenes falls un-der the problem of constrained scene synthesis . The work of (Kermaniet al. 2016; Li et al. 2019; Ma et al. 2016; Qi et al. 2016; Ritchie et al.2019; Wang et al. 2019) are examples of such approach. However,there are currently two major challenges in the general literature which also create bottlenecks for virtual content augmentation inspatial computing experiences. First, current scanned 3D datasetspublicly available are limited in size and diversity, and may not offerall the data required to capture topological properties of the rooms.For instance, pose , the direction in which the object is facing, is acritical feature for understanding the orientational property of anobject. Yet, such property is not clearly annotated for all objects inmany large-scale real-world datasets such as SUN-RGBD and Mat-terport3D. Therefore, more recent research has adapted syntheticdatasets, which can be used to extract higher-level information suchas pose as they do not necessarily need to be manually annotated.However, a critical drawback of synthetics datasets is that it can-not capture the natural transformation and topological propertiesof objects in real-world settings. Furniture in real-world settingsare a product of gradual adoption of a space, contributing to thefunctionality of the room and surrounding items. Topological re-lationships between objects in real-world scenes typically exceedtheoretical design assumptions of a architect, and instead capturecontextual relationships from a living environment. Moreover, thelimitations of the modeling software for synthetic datasets can alsointroduce unwanted biases to the generated scenes. The SUNCG(Song et al. 2017) dataset, for instance, was built with Planner5Dplatform, an online tool which any user around the world can use.However, it comes with modeling limitations for generating roomsand furniture. Orientations are also snapped to right angles bydefault, which makes most scenes in the dataset Manhattan-like.More importantly, there is no indication if the design is completeor not, namely, a user may just start playing with the software andthen leave at a random time, while the resulting arrangement isstill captured as a legitimate human-modeled arrangement in thedataset.Second, recent models take advantage of implicit deep learningmodels and have shown promising results in synthesizing indoorscenes. Yet, their approach falls short for content developers toparameterize customized placement in relation to standard objectsin the scene, and to generate custom spatial functionalities. Onemajor limitation of these studies is that they do not have directcontrol over objects in the generated scene. For example, authors of(Li et al. 2019) report they cannot specify object counts or constrainthe scene to contain a subset of objects. Such limitations come fromthe implicit nature of such networks. Implicit models produce ablack-box tool, which is difficult to comprehend should a end-userwishes to tweak its functions. In cases where new objects are set tobe placed, implicit structures may not provide abilities to manuallydefine new object types, without providing . Moreover, using deepconvolution networks require large datasets to train, a bottleneckthat we have discussed above.Motivated by these challenges, in this paper we introduce Scene-Gen, a generative contextual augmentation framework that providesprobability maps for virtual object placements. Given a non-emptyroom already occupied by furniture, SceneGen provides a model-based solution to add new objects in functional placements andorientations. We also propose an interactive generative system tomodel the surrounding room. Contrary to the unintuitive implicitmodels, SceneGen is based on clear, logical object attributes andrelationships. In light of the existing body of literature on semantic , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:3

Scene Graphs, we leverage this approach to encapsulate the relevantobject relationships for scene augmentation. Scene Graphs havealready been in use for general scene generation tasks; they canalso inform the intelligent placement of virtual objects in physicalscenes.We use kernel density estimation (KDE) to build a multivariateconditional model to encapsulate explicit positioning and cluster-ing information for object and room types. This information willallow our algorithm to determine likely locations to place a newobject in a scene while satisfying their physical constraints. Objectorientations are predicted using a probability distribution. Fromthe calculated probabilities, we generate a score for each potentialplacement of the new object, visualized as a heat map over the room.Our system is user-centric and ensures that the user understandsthe influence of data points and object attributes on the results. Inaddition, recent work has produced extensive scans of real-worldenvironments. We use one such dataset, Matterport3D (Chang et al.2018), in place of synthetic datasets such as SUNCG. As a trade-off,our real-world environment data are prone to messy object scansand non-Manhattan alignments.Our contributions can be summarized as follows:(1) We introduce a spatial Scene Graph representation whichencapsulates positional and orientational relationships of ascene. Our proposed Scene Graph captures pairwise topol-ogy between objects, object groups, and the room.(2) We develop a prediction model for object contextual aug-mentation in existing scenes. We construct an explicitKnowledge Model which is trained from Scene Graph rep-resentations captured from real-world 3D scanned data.(3) To learn orientational relationships from real-world 3Dscanned data, we have labeled the Matterport3D datasetwith pose directions by human. To do so, we have developedan open-source labeling tool for fast pose labeling.(4) We develop an Augmented Reality (AR) application thatscans a user’s room and generates a Scene Graph basedon the existing objects. Using our model, we sample posesacross the room to determine a probabilistic heat map ofwhere the object can be placed. By placing objects in poseswhere the spatial relationships are likely, we are able toaugment scenes that are realistic.We believe our proposed system can facilitate a wide varietyof AR/VR applications. For example collaborative environmentsrequire placing one userfis objects into another userfis surroundings.More recently, adding virtual objects to scenes has been exploredin online-shopping settings. This work can also apply to designindustries, for example in generating 3D representations of examplefurniture placements. In addition, content creation of augmentedand virtual reality experiences requires long hours of cross platformdevelopment on current applications, so our system will allow fasterscene generation and content generation in AR/VR experiences.Source code and pretrained models for our system can be foundat our website after the review.

Fig. 3. Our proposed Scene Graph representation is extracted from eachscene capturing orientation and position based relationships between ob-jects in a scene (pairwise) and between objects and the room itself. Visual-ization shows a subset of features for clarity.

Semantic Scene Graphs form one part of the overall task of sceneunderstanding. Given visual input, as AR experiences generallywould receive, one can tackle the tasks of 3D scene reconstructionand visual relationship detection. On the latter topic, a progressionof papers attempted to encapsulate human ”common-sense” knowl-edge in various ways: physical constraints and statistical priors(Silberman et al. 2012), physical constraints and stability reasoning(Jia et al. 2013), physics-based stability modeling (Zheng et al. 2015),language priors (Lu et al. 2016), and statistical modeling with deeplearning (Dai et al. 2017). A similar approach was detailed in (Kimet al. 2012) for 3D reconstruction, taking advantage of the regularity , Vol. 0, No. 0, Article 0. Publication date: 2020. :4 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang

Fig. 4. Each placement choice for an object forms different topological relationships captured by the Scene Graphs. SceneGen evaluates the probability ofthese new relationships to create a probability map and recommend a placement. and repetition of furniture arrangements in certain indoor spaces,e.g., office buildings. In (Xu et al. 2016), the authors proposed a tech-nique that potentially could be well suited to AR applications, as itbuilds a 3D reconstruction of the scene through consecutive depthacquisitions, which could be taken incrementally as a user moveswithin their environment. Recent work has addressed problemslike retrieving 3D layouts from 2D panoramic input (Kotadia et al.2020; Sun et al. 2019) or floorplan sketches (Keshavarzi et al. 2020a),building scenes from 3D point clouds (Pittaluga et al. 2019; Shi et al.2019), and 3D plane reconstruction from a single image (Liu et al.2019b; Yu et al. 2019). One can consult a recent overview of the topicin (Liu et al. 2019a). Our approach leverages this work on scene un-derstanding, because our model operates on the assumption that wealready have locations and bounding boxes of the existing objectsin scene.

Semantic Scene Graphs have been applied to various tasks in thepast, including image retrieval (Johnson et al. 2015), visual ques-tion answering (Teney et al. 2017), image caption generation (Yaoet al. 2018), and more. The past research can be divided into twoapproaches: (1) separate stages of object detection and graph infer-ence, and (2) joint inference of object classes and graph relationships.Papers that followed the first approach often leverage existing ob-ject detection networks (Chen et al. 2019; Li et al. 2017; Ren et al.2015; Yao et al. 2018; Zellers et al. 2018). Similarly to other sceneunderstanding tasks, many methods also involved learning priorknowledge of common scene structures in order to apply them tonew scenes, such as physical constraints from stability reasoning(Yang et al. 2017) or frequency priors represented as recurring scenemotifs (Zellers et al. 2018). Most methods were benchmarked basedon the Visual Genome dataset (Krishna et al. 2017). However, re-cent studies found this dataset to have an uneven distribution ofexamples across its data space. In response, researchers in (Gu et al.2019) and (Chen et al. 2019) proposed new networks to draw froman external knowledge base and to utilize statistical correlations between objects and relationships, respectively. Our work focuseson the task of construction and utilization of the semantic SceneGraph. As in (Chen et al. 2019; Zellers et al. 2018), we also usestatistical relationships and dataset priors; but unlike these papers,we use a mathematical model rather than deep learning. Becauseour approach is based on a model with specified properties, we canexplain our results with explicit reasoning based on these properties.

The general goal of indoor scene synthesis is to produce a feasi-ble furniture layout of various object classes which address bothfunctional and aesthetic criteria (Zhang et al. 2019). Early workof synthetic generation focused on hard-coding rules, guidelineand grammars, resembling a procedural approach for this problem(Bukowski and S´equin 1995; Germer and Schwarz 2009; Xu et al.2002). The work of (Merrell et al. 2011) is a successful example ofhard-coding design guidelines as priors for the scene generationprocess. They extracted these guidelines through consulting man-uals on furniture layout (Sharp 2008; Talbott and Matthews 1999;Ward 1999) and interviewing professional designers who specializein arranging furniture. A similar approach is also seen in Yu et al(Yu et al. 2011) work, while (Yeh et al. 2012) attempted synthesizingopen world layouts with hard-coded factor graphs.The work of (Fisher et al. 2012) can be seen as one of the earlyadapters of example-based scene synthesis. They synthesized scenesby training to build a probabilistic model based on Bayesian net-works and Gaussian mixtures. Their problem, however, was one ofgenerating an entire scene, and they utilized a more limited set ofinput example scenes. In the work of (Kermani et al. 2016), a full3D scene is synthesized iteratively by adding a single object at atime. This system learned some priors similar to ours, includingpairwise and higher-order object relations. Compared to this work,we incorporate additional priors, including objectsfi relative posi-tion within the room bounds. The work of Liang et al. (Liang et al.2018, 2017) and Fu et al. (Fu et al. 2017) also took room functionsinto account. They argued that extracting topological priors should , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:5 also be extended to room functions and their activities, which wouldimpact the pair-wise relationships between objects. While objecttopologies differ in various room function, a major challenge inthis approach is that not all spaces can be classified with a certainroom function. For instance, in a small studio apartment, the livingroom might serve additional functions such as dining room and astudy space. (Savva et al. 2017) also proposed a similar approach,involving a Gaussian mixture model and kernel density estimation.However, their system targeted an inverse problem of ours, namely,their problem received a selected object location as input and wasasked to predict an object type. We find our problem to be morerelevant to the needs of a content creator who knows what objectthey wish to place in scene, but does not have prior knowledgeabout the userfis surroundings.Another data-driven approach to scene generation involves mod-eling human activities and interactions with the scene ((Fisher et al.2015; Fu et al. 2017; Ma et al. 2016; Qi et al. 2018)). Research follow-ing this approach generally seeks to model and adjust the entirescene according to human actions or presence. There have alsobeen an number of interesting studies that take advantage of logicalstructures modeled for natural language processing (NLP) scenarios.Work of (Chang et al. 2014b), (Chang et al. 2014a), (Chang et al. 2017)(Ma et al. 2018) are examples of such approach. More specifically,(Ma et al. 2018) bears a minor resemblance to our approach, in 1)training on object relations, and 2) the ability to augment an initialinput scene, but unlike our work, it augments scenes by merging insubscenes retrieved from a database. In contrast, we seek to add inindividual objects, which is more aligned with the needs of creatorsof augmented reality experiences. A series of papers (including(Avetisyan et al. 2019; Chen et al. 2014; Shao et al. 2012)) proposedgenerating a 3D scene representation by recreating the scene fromRGB-D image input, using retrieved and aligned 3D models. Thisresearch, however, involves recreating an existing physical scene,and does not handle adding new objects.More recent work endeavors to improve learning-based meth-ods, using deep convolutional priors (Wang et al. 2018), scene-autoencoding (Li et al. 2019) and new representations of objectsemantics (Balint and Bidarra 2019), to name just a few. (Zhanget al. 2020) addressed a related but distinct problem of synthesizinga scene by arranging and grouping an input set of objects. Thework of Ritchie et al. (Ritchie et al. 2019) is another example ofusing deep generative models for scene synthesis. Their methodsampled each object attribute with a single inference step to al-low constrained scene synthesis. This work was extended in PlanIt(Wang et al. 2019), where the authors proposed a combination oftwo separate convolutional networks to address constrained scenesynthesis problems. They argue that object-level relationships facil-itate high-level planning of how a room should be laid out, whileroom-level relationships perform well at placing objects in precisespatial configurations. Our method differs from the discussed stud-ies in 1) utilizing an explicit model rather than an implicit structure,2) taking advantage of higher level relationships with the room itselfin our proposed Scene Graph, and 3) generating a probability mapwhich would guide the end user on potential locations for objectaugmentation.

Fig. 5. In our annotation tool, a camera is orbited around each object tofacilitate labeling of object orientations.Fig. 6. A labeler using our annotation tool can select which direction theobject is facing or move to the next camera to get a better view. The selectionis used to automatically standardize the axes of each object’s bounding box.

SceneGen is a framework to augment scenes with virtual objectsusing a generative model to maximize the likelihood of the rela-tionships captured in a spatial Scene Graph. Specifically, if given apartially filled room, SceneGen, will augment it with one or multiplenew virtual objects in a realistic manner using an explicit modeltrained on relationships between objects in the real world. TheSceneGen workflow is shown in Figure 2.In this paper, we first introduce a novel Scene Graph that connectsthe objects and the room (both represented as nodes) using spatialrelationships (represented as edges) in Section 4. For each object,these relationships are determined by positional and orientational , Vol. 0, No. 0, Article 0. Publication date: 2020. :6 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang

Fig. 7. Scene Gen places objects into scenes by extracting a Scene Graph from each room, sampling positions and orientations to create probability maps, andthen places an object in the most probable pose. (a) A sofa is placed in a living room, (b) a bed is placed in a bedroom, (c) a chair is placed in an office, (d) Atable is placed in a family room, (e) a storage is placed in a bedroom. features between itself and other objects, object groups, and theroom.In Section 5 we show how from a dataset of rooms, we can extractthese Scene Graphs to construct a Knowledge Model that is usedto train explicit models that approximate the probability densityfunctions of position and orientation relationships for a given object using kernel density estimation. In order to augment a scene with avirtual object, SceneGen samples possible positions and orientationsin a scene, building updated Scene Graphs for each sample. Weestimate the probability of each sample and place an object at themost likely pose. SceneGen also shares a heat map of the likelihood , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:7 of each sample to suggest alternate high probability placements.This can be repeated to augment multiple virtual objects.Our implementation of SceneGen is built using data extractedfrom the Matterport3D dataset as our priors and is detailed in Sec-tion6. This dataset is chosen as it contains real world rooms. Asusing object scans results in unoriented bounding boxes, we developan application to facilitate the labeling of the facing direction ofeach object.We assess the effectiveness of SceneGen in Sections 7 and 8 foreight categories of objects across several types of room includingbedrooms, living rooms, hallways, and kitchens. In order to under-stand the effectiveness of each relationship on predicting whereand how a new object should be placed, we run a series of ablationtests on each feature. We use k-fold cross validation to partition theMatterport3D dataset, building the Knowledge Model on a trainingset and assessing how well the model can replace removed objectsfrom a validation set. Additionally, we carry out a user study toanalyze how SceneGen compares with a random placement and thereference scene in placing new objects into virtual rooms based offof real scenes from the Matterport3D dataset and to evaluate thevalue of a heat map showing the probability of all samples.Finally, Section 9 details an Augmented Reality mobile applicationthat we have developed employing SceneGen to add new virtualobjects to a scene. This application locally computes the semanticsegmentation and generates a Scene Graph before estimating sampleprobabilities on an external server, and then parses and visualizesthe prediction results. This demonstrates how our framework canwork with state-of-the-art AR/VR systems.

In this section, we introduce a novel spatial Scene Graph that con-verts a room and the objects included in it to a graphical represen-tation using extracted spatial features. A Scene Graph G is definedby nodes representing objects, object groups, and the room, and byits edges representing the spatial relationships between the nodes.While various objects hold different individual functions (eg. a chairto sit, a table to dine, etc), their combinations and topological re-lationships tend to generate the main functional purpose of thespace. In other words, spatial functions are created by the pair-wisetopologies of objects and their relationship with the room. In ourproposed Scene Graph representation, we intend to explicitly ex-tract a wide variety of positional and orientational relationshipsthat can be present between objects. We model descriptive topolo-gies that are commonly utilized by architects and interior designersto generate spatial functionalities in a given space. Therefore, ourScene Graph representation can also be described as a function map,where objects (nodes) and their relationships (edges) correspond toa single or multiple spatial functionalities present in a scene. Figure3 illustrates two examples of our Scene Graph representation, wherea subset of topological features are visualized in the graph. In this paper, we consider a room or a scene in 3D space where itsfloor is on the flat ( x , y ) -plane and the z -axis is orthogonal to the ( x , y ) -plane. In this orientation, we denote the room space in a floor-plan representation as R , namely, an orthographic projection of its3D geometry plus a possible adjacency relationship that objects in R may overlap on the ( x , y ) -plane but on top of one another along the z -axis. Specifically, the “support” relationship is defined in Section4.3.3. This can also be viewed as a 2.5-D representation of the space.Further denote the k -th object (e.g., a bed or a table) in R as O k .The collection of all n objects in R is denoted as O = { O , O , ... O n } . B ( O k ) represents the bounding box of the object O k . (cid:219) O k representsthe center of the object O k . Every object O k has a label to classifyits type. Related to the same R , we also have a set of groups G = { д , ..., д m } , where each group д i contains all objects of the sametype within R .Furthermore, each O k has a primary axis a k and a secondary axis b k . For Asymmetric objects, a k represents the orientation of theobject. a k and b k are both unit vectors such that b k is a π radiancounter clockwise rotation of a k . We define θ a k and θ b k to be theangle in radians represented by a k and b k respectively.For each room R , we define W = { W , W , ..., W l } where each W k is a wall of the l -sided room. In the floor plan representation, W k is represented by a 1D line segment. We also introduce a distancefunction δ ( a , b ) as the shortest distance between a and b objects. Forexample, δ ( B ( O k ) , (cid:219) R ) is the shortest distance between the boundingbox of O k and the center of the room R . We first introduce features for objects based on their spatial positionsin a scene. We include both pairwise relationships between objects(eg. between a chair and a desk), object groups (eg. between a diningtable and dining chairs), and relationships between an object andthe room.

RoomPosition : The room position feature of an object denotes whetheran object is at the middle, edge, or corner of a room. This is basedon how many walls an object is less than ϱ distance from.RoomPosition ( O k , R ) = (cid:213) W i ∈( W ) ( δ ( (cid:219) O k , W i ) < ϱ ) (1)In other words, if RoomPosition ( O k , R ) ≥

2, the object is near atleast 2 walls of a room, and hence is near a corner of the room; ifRoomPosition ( O k , R ) =

1, the object is near only one wall of theroom and is at the edge of the room; otherwise, the object is notnear any wall and is in the middle of the room.

AverageDistance : For each object, and each group of objects wecalculate the average distance between that object and all objectswithin that group. For cases where the object is a member of thegroup, we do not count the distance between the object in questionand itself in the average.AverageDistance ( O k , д i ) = (cid:213) O j ∈ д i j (cid:44) k δ ( B ( O k ) , B ( O j ))/ (cid:213) O j ∈ д i j (cid:44) k SurroundedBy : For each object, and each group of objects, we com-pute how many objects in the group are within a distance ε of the , Vol. 0, No. 0, Article 0. Publication date: 2020. :8 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang Fig. 8. SceneGen can be used to iteratively add multiple virtual objects to a scene. For each object we sample poses and place it in the most likely position andorientation before placing the next object into a partially emptied room. (Top) A bed, storage and sofa are replaced in a bedroom, reorganizing the room in aviable alternative to the dataset ground truth; (Middle) Two sofas and a table are replaced to a living room in an arrangement similar to ground truth; (Bottom)A sofa, a table are replaced, and another sofa and then a table are added to a family room, demonstrating how a scene augmented with different objectscompares to the ground truth. object. For cases where the object is a member of the group, we donot count the object in question.SurroundedBy ( O k , д i ) = (cid:213) O j ∈ д i j (cid:44) k ( δ (cid:0) B ( O j ) , B ( O k )) < ε (cid:1) (3) Support : An object is considered to be supported by a group if isdirectly on top of an object from the group, or supports a group if itis directly underneath an object from the group.Support ( O k , д i ) =  ∃ O j ∈ д i where O k is on top of O j − ∃ O j ∈ д i where O k is under O j We categorize the objects in our scenes into three main groups:(1) G sym: Symmetric objects such as coffee tables and houseplants that have no clear front-facing direction;(2) G asym: Asymmetric objects such as beds and chairs thatcan be oriented to face in a specific direction;(3) G in: Inside Facing objects such as paintings and storagethat are always facing opposite to the wall of the roomwhere they are situated.In this section we discuss features applicable to objects with a de-fined facing decisions, and not for symmetric objects. We first define an indicator equation that is 1 if a ray extendingfrom the center in the direction d k of an object intersects a wall W i . f ( (cid:219) O k , d k , W i ) = ( ∃ γ ≥ | (cid:219) O k + γd k ∈ W i ) (5) TowardsCenter : An object is considered to be facing towards thecenter of the room, if an ray extending from the center of the objectintersects one of the furthest l walls from the object. c = argmax W i ∈( W ) δ ( (cid:219) O k , W i ) c = argmax W i ∈( W \ c ) δ ( (cid:219) O k , W i ) ... c l = argmax W i ∈( W \ c ... c l − ) δ ( (cid:219) O k , W i ) (6)TowardsCenter ( O k ) = f ( (cid:219) O k , a k , c ) ∨ ... ∨ f ( (cid:219) O k , a k , c l − ) (7) AwayFromWall : An object is considered facing away from a wallif it is oriented away from and is normal to the closest wall to theobject. c = argmin W i ∈( W ) δ ( B ( O k ) , W i ) AwayFromWall ( O k ) = f ( (cid:219) O k , − a k , c ) ∧ ( a k ⊥ c i ) (8) DirectionSimilarity : An object has a similar direction as one ormore objects within a constant ε distance from the object if the otherobjects are facing in the same direction or in the opposite direction , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:9 Fig. 9. Visualization of the Knowledge Model built from Scene Graphs extracted from the Matterport3D Dataset shows for each group of objects: (a) frequencyof each Room Position, (b) frequency the object is surrounded by multiple objects from another group, (c) frequency the object is facing an object from anothergroup, (d) frequency the object is facing towards the center of the room or not. ( π radians apart) from the first object subject to some small angularerror φ .Same ( O k ) = (cid:205) O j ∈O , j (cid:44) kδ ( B ( O k ) , B ( O j ) ) ≤ ε (| θ a k − θ a j | ≤ φ ) Opp ( O k ) = (cid:205) O j ∈O , j (cid:44) kδ ( B ( O k ) , B ( O j ) ) ≤ ε (| π − | θ a k − θ a j || ≤ φ ) DirectionSimilarity ( O k ) = [ Same ( O k ) , Opp ( O k )] ∈ R (9) We first define an indicator function that is 1 if a ray extending fromthe center of the object in direction d k intersects the bounding boxof a second object. h ( (cid:219) O k , d k , B ( O j )) = ( ∃ γ ≥ | (cid:219) O k + γd k ∈ B ( O j )) (10) Facing : Between an object and a group of objects we count howmany objects of the group are within a distance ε of the object andare in the direction of the primary axis of the first object.Facing ( O k , д i ) = (cid:213) O j ∈ д i , j (cid:44) kδ ( B ( O k ) , B ( O j ))≤ ε h ( (cid:219) O k , a k , B ( O j )) (11) NextTo : Between an object and a group of object we count howmany objects of the group are within a distance ε of the object andare in the direction of the positive or negative secondary axis of thefirst object.NextTo ( O k , д i ) = (cid:213) O j ∈ д i , j (cid:44) kδ ( B ( O k ) , B ( O j ) ) ≤ ε h (cid:0) (cid:219) O k , ± b k , B ( O j ) (cid:1) (12) To evaluate the plausibility of a new arrangement, we compareits corresponding Scene Graph with a population of viable SceneGraphs priors. By extracting Scene Graphs from a corpus of rooms,we construct a Knowledge Model which serves as our spatial priorsfor the position and orientation relationships of each object group.For each object instance, we assemble a data vector for positionalfeatures from G . For Asymmetric objects, we similarly create a data vector for orientational features. First we define the following thatrepresent an object’s relationships with all groups, G = { д , ..., д m } .AD ( O k ) = [ AverageDistance ( O k , д i )| i = , · · · , m ] ∈ R m S ( O k ) = [ SurroundedBy ( O k , д i )| i = , · · · , m ] ∈ R m F ( O k ) = [ Facing ( O k , д i )| i = , · · · , m ] ∈ R m NT ( O k ) = [ NextTo ( O k , д i )| i = , · · · , m ] ∈ R m SP ( O k ) = [ Support ( O k , д i )| i = , · · · , m ] ∈ R m (13)This allows us to construct data arrays, d p ( O k ) and d o ( O k ) , con-taining features that relate to the position and orientation of anobjects respectively. RoomPosition is also included in the data ar-ray for orientational features, d o , since the other features of d o arestrongly correlated with an object’s position in the room. This isabbreviated as RP. We also use the abbreviate TowardsCenter to TCand DirectionSimilarity to DS. For succinctness, when using theseabbreviations for our features, the parameter O k is dropped fromour notation. d p ( O k ) = [ RP ∈ R , AD ∈ R m , SP ∈ R m , S ∈ R m ] ∈ R m + d o ( O k ) = [ RP ∈ R , TC ∈ R , DS ∈ R , F ∈ R m , NT ∈ R m ] ∈ R m + (14)Finally, given one feature vector per object for position and ori-entation, respectively, we can collect more samples from a database,which we will discuss in Section 6, to form our Knowledge Model.The model collects feature vectors separately with respect to differ-ent object types in multiple room spaces. To do so, we introduce д i , j to collect all of the i -th type objects in room R j , j = , · · · , r .Without loss of generality, we assume that the i -th object type isthe same across all rooms. Therefore, we can collect all the objectsof the same i -th type from a database as д i , ∗ = r (cid:216) j = д i , j . Then D p ( д i , ∗ ) and D o ( д i , ∗ ) represent the collections of all featurevectors in (14) from objects in д i , ∗ . D p ( д i , ∗ ) = { d p ( O k ) | ∀ O k ∈ д i , ∗ }D o ( д i , ∗ ) = { d o ( O k ) | ∀ O k ∈ д i , ∗ } (15) , Vol. 0, No. 0, Article 0. Publication date: 2020. :10 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang Given the feature samples for the same type of object in (15), nowwe can estimate their likelihood distribution. In particular, givenan object placement O of the i -th type, we seek to estimate thelikelihood function for its position features: P ( d p ( O )|D p ( д i , ∗ )) . (16)If O is asymmetric, we also seek to estimate the likelihood functionfor its orientation features: P ( d o ( O )|D o ( д i , ∗ )) . (17)However, if O is an Inside Facing object, then with certainty its ori-entation will be determined by that of its adjacent wall. Additionally,if O is a Symmetric object, it has no clear orientation. Therefore, forthese categories of objects, estimation of their orientation likelihoodis not needed. In this section, we discuss how to estimate (16) and(17)We can approximate the shape of these distributions using multi-variate kernel density estimation (KDE). Kernel density estimationis a non-parametric way to create a smooth function approximatingthe true distribution by summing kernel functions, K , placed at eachobservation X i ... X n (Sheather 2004).ˆ f h ( x ) = nh n (cid:213) i = K (cid:18) x − X i h (cid:19) (18)This allows us to estimate the probability distribution function(PDF) of the position and orientation relationships from the spatialpriors in our Knowledge Model, D p ( д i , ∗ ) , D o ( д i , ∗ ) for each group д i . Algorithm 1 describes the SceneGen algorithm. Given a room model R and a set of existing objects O = { O , O , ... O n } , the algorithmevaluates the position and orientation likelihood of augmenting anew object O (cid:48) and recommends its most likely poses. ALGORITHM 1:

SceneGen AlgorithmGiven a training database, calculate D p ( д i , ∗ ) and D o ( д i , ∗ ) as prior.For a given room R , construct the Scene Graph G of its objects O . while Sample the position of O (cid:48) of type i in R do Calculate P ( d p ( O (cid:48) )|D p ( д i , ∗ )) . while Sample the orientation of O (cid:48) ∈ [ , π ) do Calculate P ( d o ( O (cid:48) )|D o ( д i , ∗ )) endend Generate a heat map displaying the likelihood distributions.Make recommendation to place O (cid:48) at the highest probability pose. Figure 4 shows how potential scene graphs are created for sam-pled placements. For scenes where multiple objects need to be added,we repeat Algorithm 1 for each additional object.

In this section, we discuss the implementation detail of SceneGenframework based on the relationship data learned from the Matter-port3D dataset.

Matterport3D (Chang et al. 2018) is a large-scale RGB-D datasetcontaining 90 building-scale scenes. The dataset consists of variousbuilding types with diverse architecture styles, including numer-ous spatial functionalities and furniture layouts. Annotations ofbuilding elements and furniture have been provided with surfacereconstruction as well as 2D and 3D semantic segmentation.

In order to use the Matterport3Ddataset as priors for SceneGen, we must make a few modificationsto standardize object orientations using an annotation tool we havealso developed. In particular, different from Section 4.2, our anno-tation tool interacting with the dataset is fully in 3D environment(i.e., through Unity 3D). After the annotation, the relationship datathen are consolidated back to the 2.5-D representation conformingto the computation of the SceneGen models.For each object O k , the Matterport3D dataset supplies labeledoriented 3D bounding boxes B ( O ) aligned to the ( x , y ) -plane. Thisis defined by a center position (cid:219) O , primary axis a , secondary axis b , an implicit tertiary axis c , and r ∈ R denotes the bounding boxsize of O divided in half. However, the Matterport3D dataset doesnot provide information about which labeled direction the object isfacing or aligns with the z -axis. Hence, it will rely on our labelingtool to resolve the ambiguities.To provide a consistent definition, we describe a scheme to labelthese axes such that the primary axis, a points in the direction theobject is facing, a ∗ . Since we know that only one of these three axeshas a z component, we shall store this in the third axis c and define b to be orthogonal to a on the x , y plane. The box size r will also beupdated to correspond to the correct axes. By constraining theseaxes to be right handed, for a given a ∗ we have: c ∗ (cid:17) [ , , ] , b ∗ (cid:17) c ∗ × a ∗ . (19)In order to correctly relabel each object, we have developed anapplication to facilitate the identification of the correct primary axisfor all Asymmetric objects and supplemented this to the updateddata set.For each object, we view the house model mesh at different camerapositions around the bounding box in order to determine the primaryaxis of the object as displayed in Figure 5. Our annotation tool shownin Figure 6 allows a labeler to select from two possible directions ateach camera position or can move the camera clockwise or counterclockwise to get a better view. Once a selection is made, the orientingaxis, a ∗ can be determined by knowing which camera we are lookingat and the direction selected. We use (19) to standardize the axes.Using our annotation tool, the orientations of all objects in a typicalhouse scan can be labeled in about 5 minutes. For this study, we have reduced thecategories of object types considered for building our model andplacing new objects. Though the Matterport3D dataset includesmany different types of furniture, organized with room labels todescribe furniture function (e.g. ”dining chair” v.s. ”office chair”), wefound that the dataset has a limited amount of instances for manyobject categories. Because we build statistical models for each objectcategory, we require an adequate representation of each category. , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:11

Table 1. Distance between ground truth and predicted positions for different models, with smallest distances for each object type in bold (ablation study).Topology features are abbreviated as follows: AverageDistance as AD, SurroundedBy as S, and RoomPosition as RP.

System Bed Chair Storage Decor Picture Table Sofa TV Overall

Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

AD+S+RP (SceneGen) 1.58

AD + RP

Fig. 10. Distance between the ground truth object’s position and whereSceneGen and other ablated versions of our system predicts the objectshould be re-positioned is shown in a cumulative density plot.Fig. 11. Distance between the ground truth object’s position and the nearestof the 5 highest probability positions predicted by SceneGen and otherablated versions of our system is shown in a cumulative density plot.

Thus, we reduce the categories to a better-represented subset forthe purposes of this study.We group the objects into 9 broader categories: G = { Bed, Chair,Decor, Picture, Sofa, Storage, Table, TV, Other } . Each of thesecategories has a specific type of orientation, as described in Sec-tion 4.4. Of these categories, Asymmetric objects are G asym = { Bed, Chair, Sofa, TV } , Symmetric objects are G sym = { Decor, Table } ,and Inside Facing objects are G in = { Picture, Storage } .For room types, we consider the set { library, living room, meetingroom, TV room, bedroom, rec room, office, dining room, familyroom, kitchen, lounge } to avoid overly specialized rooms such asbalconies, garages and stairs. We also manually eliminate unusuallysmall or large rooms with outlier areas and rooms where scans andbounding boxes are incorrect.After the data reduction, we consider a total of 1,326 rooms and7,017 objects in our training and validation sets. The object androom categories used can be expanded if sufficient data is available. We use the processed dataset as prior to train the SceneGen Knowl-edge Model. The procedure first estimates each object O k accord-ing to (14), and subsequently constructs D p ( д i , ∗ ) and D o ( д i , ∗ ) in(15) for categories in G and G asym respectively. We do not con-struct models for the ‘Other’ category as objects contained in thiscategory are sparse and unrelated from each other. Given our pri-ors, we estimate the likelihood functions P ( d p ( O )|D p ( д i , ∗ )) and P ( d o ( O )|D p ( д i , ∗ )) from (16) and (17) using Kernel Density Estima-tion.We utilize a KDE library developed by (Seabold and Perktold 2010)with a normal reference rule of thumb bandwidth with ordered,discrete variable types. We make an exception for AverageDistance,which is continuous. When there are no objects of a certain group, д i in a room, the value of AverageDistance ( O k , д i ) is set to a largeconstant (1000), and we use a manually tuned bandwidth (0.1) toreduce the impact of this on the rest of the distribution.Furthermore, we found for this particular dataset, a subset of fea-tures, Facing, TowardsCenter and RoomPosition, are most impactfulin predicting orientation as detailed in Section 8.1.2. Therefore, whilewe model all of the orientational features, we only use the Facing,TowardsCenter and RoomPosition features for our implementationof SceneGen and in the User Studies. Finally, due to overlappingbounding boxes in the dataset, calculating object support relation-ships (SP) precisely is not possible. Thus in our implementation, weallow the certain natural overlaps defined heuristically instead of , Vol. 0, No. 0, Article 0. Publication date: 2020. :12 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang Fig. 12. Cumulative density plot indicates angular distance between groundtruth orientation and our system’s predicted orientation for SceneGen andother subsets of orientation features. The range is [ , π ) .Table 2. Angular Distance between ground truth and predicted orientationsfor different model architectures (ablation study). Topology features areabbreviated as follows: Facing as F, TowardsCenter as C, RoomPosition as(RP), NextTo as NT, DirectionSimilarity as DS. System Bed Chair Sofa TV Overall

F+C+RP (SceneGen)

F only 1.13 1.66 1.51 0.91 1.54F+C 1.13 1.55 1.18 0.49 1.35F+C+NT 1.18 1.53 1.23 R with an object of type i and generate a probability heat map. This can be repeated in orderto add multiple objects. To speed up computation in this implemen-tation, we first sample positions, and then sample orientations at themost probable position, instead of sampling orientations at everypossible position.Figure 7 shows how our implementation of SceneGen adds anew object to a scene and examples of scenes are augmented withmultiple objects iteratively is shown in Figure 8. Computation Time.

We train and evaluate our model using a ma-chine with an 4-core Intel i7-4770HQ CPU and 16GB of RAM. Intraining, creating our Knowledge Model and estimating distribu-tions for 8 categories of objects takes approximately 12 seconds. Intesting, it takes ≈ To evaluate our prediction system, we run ablation studies, examin-ing how the presence or absence of particular features affects ourobject position and orientation prediction results. We use a K=4-foldcross validation method on our ablation studies, with 100 rooms ineach validation set and the remaining rooms in our training set.

The full position predictionmodel, SceneGen, trains three features: AverageDistance (AD), Sur-roundedBy (S), RoomPosition (RP) or AD+S+RP in short. We createthree reduced versions of our system: AD+RP, using only Aver-ageDistance and RoomPosition features; S+RP, using only Surround-ing and RoomPosition features; and RP, solely using the RoomPosi-tion feature.We evaluate each system using the K-fold method described above.In this study, we remove each object in the validation set, one at atime, and use our model to predict where the removed object shouldbe positioned. The orientation of the replaced object will be thesame as the original. We compute the distance between the originalobject location and our system’s prediction.However, as inhabitants of actual rooms, we are aware that thereis often more than one plausible placement of an object, thoughsome may be more optimal than others. Thus, we raise the questionof whether there is more than one ground truth or correct answer forour object placement problem. Hence, in addition to validating ourmodel’s features, our first ablation study validates them in relationto the simple approach of taking the single highest-scored locationfrom our system. Meanwhile, our second ablation study uses thetop 5 highest-scored locations, opening up examination to multiplepotential ”right answers”.

We run a similar experi-ment to evaluate our orientation prediction models for Asymmetricobjects. Our Scene Graphs capture 5 relationships based on theorientation of the objects: Facing (F), TowardsCenter (C), NextTo(NT), DirectionSimilarity (DS), and RoomPosition (RP). We assessmodels based on several combinations of these relationships.We evaluate each of these models using the same K-fold approach,removing the orientation information of each object in the valida-tion set, and then using our system to predict the best orientation,keeping the object’s position constant. We measure the angulardistance between our system’s predictions and the original object’sorientation.

We conduct user studies with a designed 3D application based onour prediction system to evaluate the plausibility of our predictedpositions and the usefulness of our heat map system. We recruited 40participants, of which 8 were trained architects. To ensure unbiasedresults, the participants were randomly divided into 4 groups. Eachgroup of users were shown 5 scenes from each of the 5 levels fora total of 25 scenes. The order these scenes were presented in wasrandomized for each user and they were not told which level a scenewas at. , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:13

Fig. 13. Users are shown scenes that are simplified models based on original Matterport3D rooms. An object is replaced in rooms using one of 5 levels of thesystems. Level I places the object randomly in the room. Level II places the object randomly in an open space. Levels III and IV use SceneGen to predict themost likely placement and orientation, and Level IV also shows a heat map visualizing the probabilities of each sampled position. In Level V, the user sees theground truth scene. When viewing the 3D model during experiment, the user have multiple camera angles available and is able to pan, zoom and orbit aroundthe 3D room to evaluate the placement.

We reconstructed 34 3D scenes from our dataset test split, whereeach scene had one object randomly removed. In this reconstruc-tion process, we performed some simplification and regularized thefurniture designs using prefabricated libraries, so that users wouldevaluate the layout of the room, rather than the design of the objectitself, while matching the placement and size of each object. Anexample of this scene reconstruction and simplification can be seenin Figure 13(a-b).The five defined levels test different object placement methods asshown in Figure 13(c-g) to replace the removed object. Levels I andII are both random placements, generated at run time for each user.The Level I system initially places the object in a random positionand orientation in the scene. The Level II system places the objectin an open random position and orientation, where the placementdoes not overlap with the room walls or other objects. Levels IIIand IV use SceneGen predictions. The Level III system places theobject in the position and orientation predicted by SceneGen. LevelIV also places the object in the predicted position and orientation,but also overlays a probability map. The Level V system places theobject at the position it appears in the Matterport3D dataset, i.e.,the ground truth.We recorded the users’ Likert rating of the plausibility of theinitial object placement on a scale of 1 to 5 (1 = implausible/random,3= somewhat plausible, 5 = very plausible). We also recorded whetherthe user chose to adjust the initial placement, the Euclidean distancebetween the initial placement and the final user-chosen placement,the orientation change between the initial orientation and the finaluser-chosen orientation. We expect higher initial Likert ratings andsmaller adjustments to position and orientation for levels initializedby our system than for levels initialized to random positions. Each participant used an executable application on a desktop com-puter. The goal of the study was explained to the user and they wereshown a demonstration of how to use the interface. For each scene,the user was shown a 3D room and an object that was removed.After inspecting the initial scene and clicking ”place object”, theobject was placed in the scene using the method corresponding tothe level of the scene. In Level IV Scenes, the probability heat mapwas also visualized. The user was shown multiple camera anglesand was able to pan, zoom and orbit around the 3D room to evaluatethe placement.The user was first asked to rate the plausibility of placement ona Likert Scale from 1-5. Following this, the user was asked if theywanted to move the object to a new location. If they answered”no”, the user would progress to the next scene. If they answered”yes”, the UI displayed transformation control handles (positionaxis arrows, rotation axis circles) to object position and orientation.After moving the object to the desired location, the user could savethe placement and progress to the next scene. An IRB approval wasmaintained ahead of the experiment.

In this experiment, we remove objectsfrom test scenes taken from the Matterport3D dataset and replaceit using various versions of our model in an ablation study. InFigure 10, we plot the cumulative distance between the groundtruth position and the top position prediction, and in Figure 11, weplot the cumulative distance between the ground truth position andthe nearest out of the the top 5 position predictions, using our fullsystem and three ablated versions. , Vol. 0, No. 0, Article 0. Publication date: 2020. :14 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang

Fig. 14. Users rate the plausibility of object placement in each room on theLikert Scale from 1 to 5. (1= Implausible/ Random, 3= Somewhat Plausible,5 = Very Plausible). Scores are displayed in a box plot separated by the userstudy level.

We find that the full SceneGen system predicts a placement mostsimilar to ground truth than any of the ablated versions, followed bythe models using AverageDist and RoomPosition features (AD+RP),and SurroundedBy and RoomPosition (S+RP). The predictions fur-thest from the ground truth are generated by only using the Room-Position (RP) feature. These curves are consistent between the bestand the closest of the top 5 predicted positions and indicate that eachof our features for position prediction contributes to the accuracyof the final result.In addition, when the top 5 predictions are considered, we see thateach system we assessed is able to identify high probability zonescloser to the ground truth. This is supported by the slope of thecurves in Figure 11, evaluating the closest of the top 5 predictions,which rise much more sharply than in Figure 10, using the onlybest prediction. This difference provides support for the importanceof predicting multiple locations instead of simply returning thehighest-scored sampling location. A room can contain multipleplausible locations for a new object, so our system’s most highlyscored location will not necessarily be same as the ground truth’s.For this reason, our system returns probabilities across sampledpositions using a heat map to show multiple viable predictions forany placement query.Table 1, shows the mean distance of the position prediction toground truth position separated by object categories. We find thatthe object categories where the full SceneGen system outperforms itsablations are chairs, storage, and decor. For beds and TVs, SceneGenonly produces the closest placements out of the system versionswhen considering the top five predictions. For pictures and tables,SceneGen’s top prediction is closest to ground truth, and is onlyslightly further when comparing the nearest of the top 5 predictions.

As with our position ablationstudies, we assess the ability of various versions of our model to

Fig. 15. The plausibility score for each object category on the Likert Scalegiven by users is compared between SceneGen Levels (III, IV) and the groundtruth, Level V. reorient assymmetric objects from test scenes. In Figure 12, we plotthe angular distance between the ground truth orientation and thetop orientation prediction, various versions of our system. The basemodel includes only Facing, (F), and is the lowest performing. Wefind that the system that also includes TowardsCenter and RoomPo-sition features performs best overall. We use this system (F+C+RP)in our implementation of SceneGen. The other four versions of oursystem perform similarly to each other overall.Table 2 shows the results of the orientation ablation study sep-arated by object category. In this case, the system with Facing,TowardsCenter and RoomPosition features (F+C+RP) outperformsall other versions across on all categories except for TVs, where thesystem that includes Facing, TowardsCenter and NextTo (F+C+NT)produces the least deviation. In fact, all three of the systems thatincluded either DirectionSimilarity or NextTo, predict the orienta-tion of TVs more closely than the overall best performing system,but perform more poorly on other objects such as beds when com-pared with systems without those features. This suggests that forother datasets, these features could be more effective in predictionorientations.

We show the distribu-tions of Likert ratings by level in Figure 14. We also run a one-wayANOVA test on the Likert ratings of initial placements, finding sig-nificant differences between all pairs of levels except for Levels IVand V. In other words, the ratings for Level IVfis representation ofour prediction system are not significantly different from groundtruth placements. Across multiple tests, we see that Level IV resultmeans are significantly different from levels based on randomiza-tion, while Level III is only sometimes. As the Level IV presentationof the system can have multiple suggested initial placements, thisdifference between Levels III and IV could support our conjecture , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:15

Fig. 16. Radial histograms display distribution of how much a user rotatedan object from its orientation in each level of the user study. Figure createdusing (Zittrell 2020). that accounting for multiple firight answerfi placements improvesthe predictions.

We analyze how participantsfichoices to adjust placement and amount moved varied across differ-ent scene levels. Results of this can be seen in Figure 17. A one-wayANOVA test of the distance users moved objects from its placementsfound a significant difference ( p = . e ) between two group-ings of levels: 1) Levels I and II (with higher means), and 2) LevelsIII, IV, and V (with lower means). This first group contains the levelswith randomized initial placements, while this second group con-tains the levels that use our prediction system or the ground truthplacement. The differentiation in groupings provides support forthe plausibility of our systemfis position predictions over randomplacements. A one-way ANOVA testwas performed on the overall change in object orientation from theparticipantsfi manual adjustment, and found a significant difference( p = . e ) between a different pair of level groupings: 1) LevelsI, II, and III, and 2) Levels IV and V. In Figure 16, we show the distri-butions of angular distance between the initial object orientation and the final user-chosen orientation, for each level. The levels IVand V have distributions are most concentrated at no rotation bythe user. In Levels I and II, the users rotate objects more than halfof the time, with an average rotation greater than π radians. Avast majority of objects placed by Levels III, IV, V systems are notrotated by the user, lending support to the validity of our predictionsystem. To demonstrate a way to integrate our prediction system in action,we have implemented an augmented reality application that aug-ments a scene using SceneGen. Users can overlay bounding boxesover the existing furniture to see the object bounds used in ourpredictions. On inserting a new object to the scene, the user canvisualize a portability map to observe potential positions. Our Aug-mented Reality application consists of five main modules: (i) localsemantic segmentation of the room; (ii) local Scene Graph gener-ation (iii) heat map generation which is developed on an externalserver (iv) local data parsing and visualization; and finally (v) theuser interface. We briefly discuss each of these modules below.Semantic segmentation of the room can be done either manuallyor automatically, using integrated tools available on augmentedreality devices. However, as not all current AR devices are equippedwith depth-sensing capturing hardware, we use techniques previ-ously introduced by (Saran et al. 2019), allowing the user themselvesto manually generate and annotate semantic bounding boxes of ob-jects of the target scene. The data acquired are then converted to ourproposed spatial Scene Graph, resulting in an abstract representa-tion of the scene. Both semantic segmentation and graph generationmodules are performed locally on the AR devices, ensuring the pri-vacy of the raw spatial data of the user.Once the Scene Graph is generated, it is sent to a remote serverwhere SceneGen engine can calculate positional and orientationaugmentation probability maps for the target scene. The predictionprobability maps for all objects are generated in this step. Suchapproach would allow faster computation time, since current ARdevices come with limited computational and memory resources.The results are sent back to the local device, in which can be parsedand visualized using the Augmented Reality GUI.The instantiation system can toggle between two modes:

Man-ual and

SceneGen . In Manual mode, the object is placed in frontof the user, on the intersection of the camera front-facing vectordirection with the floor. This would normally result in augmentingthe object in the middle of the screen. While such conventional ap-proach allows the user control the initial placement by determiningthe pose of the AR camera, in many cases additional movementsare necessary to place the object in a plausible final location. Insuch cases, the user can then further move and rotate the objectsto its desirable location. In SceneGen mode, the virtual object isaugmented using the prediction of our system, resulting in fasterand contextual placements. , Vol. 0, No. 0, Article 0. Publication date: 2020. :16 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang

Fig. 17. Cumulative density plot indicates the distance objects were movedfrom its placement in each level of the user study.

10 DISCUSSION10.1 Features and Predictions

The Scene Graph we introduce in this paper is designed to cap-ture spatial relationships between objects, object categories and theroom. Overall, we have found that each of the relationships we havepresented improves the model’s ability to augment virtual objectsin realistic placements in a scene. These relationships are importantto understand the functional purposes of the space in addition tothe individual objects.In SceneGen, RoomPosition is used as a feature in predicting bothorientation and position of objects. While this is a feature basedsolely on the position of the object, where it is in a room also hasa strong impact on the function of the object and how it shouldbe oriented. For example, a chair in a corner of the room is verylikely to face towards the center of the room, while a chair in themiddle of the room is more likely to face towards a table or a sofa.When analyzing our placement predictions probability maps andour user study results, we have observed that the best orientationis not the same at each position. This is not only affected by thenearby objects, but also by the sampled position within the room.

In our evaluation of SceneGen, we have found a number of benefitsin using an explicit model to predict object placements. One benefitis that if we want to define a non-standard object to be placed inrelation with standard objects by specifying your own relationshipdistributions, it is feasible with our system but would not be pos-sible for implicit models. For example, in a collaborative virtualenvironment, where special markers are desired to be placed neareach user, one could specify distributions for relationships such asNextTo chair and Facing table, without needing to train these froma dataset.

Fig. 18. Top 5 highest probability positions for placing sofa (a,b), table (c) andTV (d) predicted by SceneGen (green) are compared to the user placements(red) showing that different users prefer different locations in a room andSceneGen also finds the clusters preferred by users to be highly probable.

Another benefit is that explicit models can be easily examineddirectly to understand why objects are being placed where they are.For example, the Bed orientation feature distribution, based on theMatterport3D priors in Figure 9, marginalized with respect to allother variables except TowardsCenter show that beds are nearly 5times as likely to face the center of the room, while marginalizingfeatures except position of the Storage show that a storage is foundin a corner of a room 63% of the time, along an edge 33% of the time,and only in the middle of the room in 4% of occurrences.

One important consideration in our choice of dataset is that we aimto learn spatial relationships for real world scenes. One can imagineidiosyncrasies of lived-in rooms, such as an office chair that is notalways tucked into a desk but often left rotated away from it or adining table pushed into a wall to create more space in a familyroom. Using personal living spaces, from the Matterport3D dataset,as our priors, we can capture these relationships that exist only inreal world, lived-in scenes.One drawback of using the Matterport3D dataset is that it is not aslarge as some synthetic datasets. In our implementation, we groupobjects into broader groups to ensure adequate representation toensure that all object categories are represented well enough toapproximate the distribution of a large feature space.Another downside of using a real-world dataset is its accuracyin labeling where many human errors occur in this labour inten-sive process. Such mismatches are unlikely to happen in synthetic , Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:17

Fig. 19. Augmented Reality application demonstrates how SceneGen canbe used to add virtual objects to a scene. From the target scene (top-left), aTV (top-right), table (middle-left), and then a sofa (middle-right) are placedin the most probable poses. A probability map can be displayed indicatinghow likely each position is (top-left, middle-left). The AR application withvirtual objects is compared to the original scene (bottom). datasets as the geometry is already assigned in a digital format.To mitigate some of these concerns, we have developed a labelingapplication that allows us to determine the correct orientation ofeach objects, and also filter out rooms with corrupted scans andinaccurate labeling.

Where and how an object is placed in a scene is often very subjectiveand preferences can differ between users. This is demonstrated bythe Likert scale plausibility ratings in Level V reference scenes inthe user studies. Figures 14 and 15 show that some users wouldonly give scores of somewhat plausible to scenes that are modelledfrom real world ground truth Matterport3D rooms. This supportsproviding a heat map of probabilities for each sampled placement,as alternate high probability positions may be more preferable todifferent users. Our results also indicate that most users preferlevel IV scenes, with the heat map, compared to level III scenes,even though the placements use the same SceneGen models. This suggests that the inclusion of the heat map guides the users towardsthe system’s placement and may help in convincing them of theviability and reasoning for such a choice.We also see that some users move objects to other high proba-bility alternatives as seen in Figure 18. This is a similar result tothe position prediction experiment, which compares the groundtruth position to the closest of SceneGen’s top 5 predictions andshows that while the reference position may not always be the topprediction, it was often one of the top predictions. Moreover, re-sults in Figure 15 show the subjectivity of an object placement ishighly dependent on the size and type of object itself. In any room,there are very few natural places to put a bed. Hence the resultsfor placing beds cluster in one or two high probability locations.Other objects such as decor are more likely to be subject to userpreferences.

11 CONCLUSION

In this paper we introduce a framework to augment scenes with oneor more virtual objects using an explicit generative model trained onspatial relationship priors. Scene Graphs from a dataset of scenes areaggregated into a Knowledge Model and used to train a probabilisticmodel. This explicit model allows for direct analysis of the learnedpriors and allows for users to input custom relationships to placenon-standard objects alongside traditional objects. SceneGen placesthe object in the highest probability pose and also offers alternatehighly likely placements.We implement SceneGen using the Matterport3D, a dataset com-posed of 3D scans of lived-in rooms, in order to understand objectrelationships in a real world setting. The features that SceneGenextracts to build our Scene Graph are assessed through an ablationstudy, identifying how each feature contributes to our model’s abilityto predict realistic object placements. User Studies also demonstratethat SceneGen is able to augment scenes in a much more plausibleway than system that places objects randomly or in open spaces.We also found that different users have their own preferences forwhere an object should be placed. Suggesting multiple high prob-ability possibilities through a heat map allows users an intuitivevisualization of the augmentation process.There are of course, limitations to our work. While SceneGenis able to iteratively add objects to a scene, the resulting layout ishighly dependent on the order in which objects are placed. Suchapproach does not consider all possible permutations of the possiblearrangements. In addition, it can narrow down the open possiblespaces for later objects, forcing placements that are far from optimal.Moreover, in scenarios where a large number of objects are to beaugment, the current approach may not have the ability to fit all theobjects within the usable space, as initial placements are not awareof upcoming objects. Future work can comprise of incorporatingfloorplanning methodologies with the current sampling mechanismallowing a robust search in the solution space, while addressingcombinatorial arrangement.Moreover, SceneGen is a framework that naturally fits into spatialcomputing applications. We demonstrate this in a augmented realityapplication that augments a scene with a virtual object using Scene-Gen. Contextual scene augmentation can be useful in augmenting , Vol. 0, No. 0, Article 0. Publication date: 2020. :18 • Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, and Allen Y. Yang collaborative mixed reality environments or in other design appli-cations, and using this framework allows for fast and realistic sceneand content generation. We plan on improving our framework inproviding the option to contextually augment non-standard objectsby parameterizing topological relationships, a feature that wouldfacilitate content creation for future spatial computing workflows.

ACKNOWLEDGMENTS

We acknowledge the generous support from the following researchgrants: FHL Vive Center for Enhanced Reality Seed Grant, a SiemensBerkeley Industrial Partnership Grant, ONR N00014-19-1-2066.

REFERENCES

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X Chang, andMatthias Nießner. 2019. Scan2cad: Learning cad model alignment in rgb-d scans.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .2614–2623.J Timothy Balint and Rafael Bidarra. 2019. A generalized semantic representation forprocedural generation of rooms. In

Proceedings of the 14th International Conferenceon the Foundations of Digital Games . ACM, 85.Richard W Bukowski and Carlo H S´equin. 1995. Object associations: a simple andpractical approach to virtual 3D manipulation. In

Proceedings of the 1995 symposiumon Interactive 3D graphics . 131–ff.Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Mano-lis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2018. Matterport3D: Learningfrom RGB-D data in indoor environments. In

Proceedings - 2017 International Con-ference on 3D Vision, 3DV 2017 . 667–676. https://doi.org/10.1109/3DV.2017.00081arXiv:1709.06158Angel Chang, Manolis Savva, and Christopher D Manning. 2014a. Interactive learningof spatial knowledge for text to 3D scene generation. In

Proceedings of the Workshopon Interactive Language Learning, Visualization, and Interfaces . 14–21.Angel Chang, Manolis Savva, and Christopher D Manning. 2014b. Learning spatialknowledge for text to 3D scene generation. In

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP) . 2028–2038.Angel X Chang, Mihail Eric, Manolis Savva, and Christopher D Manning. 2017. Sce-neSeer: 3D scene design with natural language. arXiv preprint arXiv:1703.00050 (2017).Kang Chen, Yukun Lai, Yu-Xin Wu, Ralph Robert Martin, and Shi-Min Hu. 2014. Au-tomatic semantic modeling of indoor scenes from low-quality RGB-D data usingcontextual information.

ACM Transactions on Graphics

33, 6 (2014).Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-EmbeddedRouting Network for Scene Graph Generation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 6163–6171.Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deeprelational networks. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 3076–3086.Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan.2012. Example-based synthesis of 3D object arrangements.

ACM Transactions onGraphics

31, 6 (2012), 1. https://doi.org/10.1145/2366145.2366154Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner.2015. Activity-centric scene synthesis for functional 3D scene modeling.

ACMTransactions on Graphics (TOG)

34, 6 (2015), 1–13.Qiang Fu, Xiaowu Chen, Xiaotian Wang, Sijia Wen, Bin Zhou, and Hongbo Fu. 2017.Adaptive synthesis of indoor scenes via activity-associated object relation graphs.

ACM Transactions on Graphics (TOG)

36, 6 (2017), 1–13.Tobias Germer and Martin Schwarz. 2009. Procedural Arrangement of Furniture forReal-Time Walkthroughs. In

Computer Graphics Forum , Vol. 28. Wiley Online Library,2068–2078.Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019.Scene graph generation with external knowledge and image reconstruction. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .1969–1978.Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, and Tsuhan Chen. 2013. 3d-basedreasoning with blocks, support, and stability. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 1–8.Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, MichaelBernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition . 3668–3678.Z Sadeghipour Kermani, Zicheng Liao, Ping Tan, and H Zhang. 2016. Learning 3D SceneSynthesis from Annotated RGB-D Images. In

Computer Graphics Forum , Vol. 35.Wiley Online Library, 197–206. Mohammad Keshavarzi, Clayton Hutson, Chin-Yi Cheng, Mehdi Nourbakhsh, MichaelBergin, and Mohammad Rahmani Asl. 2020a. SketchOpt: Sketch-based ParametricModel Retrieval for Generative Design. arXiv preprint arXiv:2009.00261 (2020).Mohammad Keshavarzi, Allen Y Yang, Woojin Ko, and Luisa Caldas. 2020b. Optimiza-tion and Manipulation of Contextual Mutual Spaces for Multi-User Virtual andAugmented Reality Interaction. In . IEEE, 353–362.Young Min Kim, Niloy J Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring3D indoor environments with variability and repetition.

ACM Transactions onGraphics (TOG)

31, 6 (2012), 138.Yash Kotadia, Krisha Mehta, Mihir Manjrekar, and Ruhina Karani. 2020. IndoorNet:Generating Indoor Layouts from a Single Panorama Image. In

Advanced Comput-ing Technologies and Applications: Proceedings of 2nd International Conference onAdvanced Computing Technologies and ApplicationsfiICACTA 2020 . Springer, 57–66.Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Vi-sual genome: Connecting language and vision using crowdsourced dense imageannotations.

International Journal of Computer Vision

ACM Transactions on Graphics(TOG)

38, 2 (2019), 12.Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scenegraph generation from objects, phrases and region captions. In

Proceedings of theIEEE International Conference on Computer Vision . 1261–1270.Yuan Liang, Fei Xu, Song Hai Zhang, Yu Kun Lai, and Taijiang Mu. 2018. Knowl-edge graph construction with structure and parameter learning for indoor scenedesign.

Computational Visual Media

4, 2 (2018), 123–137. https://doi.org/10.1007/s41095-018-0110-3Yuan Liang, Song-Hai Zhang, and Ralph Robert Martin. 2017. Automatic data-drivenroom design generation. In

International Workshop on Next Generation ComputerAnimation Techniques . Springer, 133–148.Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, and Jan Kautz. 2019b. Planercnn:3d plane detection and reconstruction from a single image. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition . 4450–4459.Daqi Liu, Miroslaw Bober, and Josef Kittler. 2019a. Visual Semantic Information Pursuit:A Survey. arXiv preprint arXiv:1903.05434 (2019).Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationshipdetection with language priors. In

European Conference on Computer Vision . Springer,852–869.Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016.Action-driven 3D indoor scene evolution.

ACM Trans. Graph.

35, 6 (2016), 173–1.Rui Ma, Akshay Gadi Patil, Matthew Fisher, Manyi Li, S¨oren Pirk, Binh-Son Hua, Sai-KitYeung, Xin Tong, Leonidas Guibas, and Hao Zhang. 2018. Language-driven synthesisof 3D scenes from scene databases. In

SIGGRAPH Asia 2018 Technical Papers . ACM,212.Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011.Interactive furniture layout using interior design guidelines.

ACM transactions ongraphics (TOG)

30, 4 (2011), 1–10.Sahil Narang, Andrew Best, and Dinesh Manocha. 2018. Simulating movement interac-tions between avatars & agents in virtual worlds using human motion constraints.In . IEEE, 9–16.Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. 2019.Revealing scenes by inverting structure from motion reconstructions. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition . 145–154.Charles R. Qi, Hao Su, Matthias Niessner, Angela Dai, Mengyuan Yan, and Leonidas J.Guibas. 2016. Volumetric and Multi-View CNNs for Object Classification on 3DData. (2016). https://doi.org/10.1109/CVPR.2016.609 arXiv:1604.03265Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, and Song-Chun Zhu. 2018.Human-centric indoor scene synthesis using stochastic grammar. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition . 5899–5908.Sharif Razzaque, Zachariah Kohn, and Mary C Whitton. 2001. Redirected Walking.

Proceedings of EUROGRAPHICS (2001), 289–294.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In

Advances in neuralinformation processing systems . 91–99.Daniel Ritchie, Kai Wang, and Yu-an Lin. 2019. Fast and flexible indoor scene synthesisvia deep convolutional generative models. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 6182–6190.Vedant Saran, James Lin, and Avideh Zakhor. 2019. Augmented Annotations: In-door Dataset Generation with Augmented Reality.

International Archives of thePhotogrammetry, Remote Sensing & Spatial Information Sciences (2019).Manolis Savva, Angel X Chang, and Maneesh Agrawala. 2017. Scenesuggest: Context-driven 3D scene design. arXiv preprint arXiv:1703.00061 (2017).Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statisticalmodeling with python. In ., Vol. 0, No. 0, Article 0. Publication date: 2020. ceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors • 0:19

Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo.2012. An interactive approach to semantic modeling of indoor scenes with an rgbdcamera.

ACM Transactions on Graphics (TOG)

31, 6 (2012), 136.V. Sharp. 2008.

The Art of Redesign . Sharp Publishing. https://books.google.com/books?id=2kxqIfFC1EcCSimon J. Sheather. 2004. Density Estimation.

Statist. Sci.

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition . 1771–1780.Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoorsegmentation and support inference from rgbd images. In

European conference oncomputer vision . Springer, 746–760.Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and ThomasFunkhouser. 2017. Semantic Scene Completion from a Single Depth Image.

Proceed-ings of 30th IEEE Conference on Computer Vision and Pattern Recognition (2017).Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. 2019. Horizonnet:Learning room layout with 1d representation and pano stretch data augmentation.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .1047–1056.Carole Talbott and Maggie Matthews. 1999.

Decorating for Good: A Step-by-step Guideto Rearranging What You Already Own . Clarkson Potter.Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-structuredrepresentations for visual question answering. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 1–9.Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and DanielRitchie. 2019. Planit: Planning and instantiating indoor scenes with relation graphand spatial prior networks.

ACM Transactions on Graphics (TOG)

38, 4 (2019), 132.Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutionalpriors for indoor scene synthesis.

ACM Transactions on Graphics (TOG)

37, 4 (2018),70.Lauri Ward. 1999.

Use what You Have Decorating: Transform Your Home in One Hourwith Ten Simple Design Principles Using…

Penguin.Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, DanielCohen-Or, and Baoquan Chen. 2016. 3d attention-driven depth acquisition for objectidentification.

ACM Transactions on Graphics (TOG)

35, 6 (2016), 238.Ken Xu, James Stewart, and Eugene Fiume. 2002. Constraint-based automatic placementfor scene composition. In

Graphics Interface , Vol. 2. 25–34.Michael Ying Yang, Wentong Liao, Hanno Ackermann, and Bodo Rosenhahn. 2017. Onsupport relations and semantic scene graphs.

ISPRS journal of photogrammetry andremote sensing

131 (2017), 15–25.Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship forimage captioning. In

Proceedings of the European Conference on Computer Vision(ECCV) . 684–699.Yi-Ting Yeh, Lingfeng Yang, Matthew Watson, Noah D Goodman, and Pat Hanrahan.2012. Synthesizing open worlds with constraints using locally annealed reversiblejump mcmc.

ACM Transactions on Graphics (TOG)

31, 4 (2012), 1–11.Lap-Fai Yu, Sai-Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan,and Stanley J. Osher. 2011. Make it Home: Automatic Optimization of Furni-ture Arrangement Lap-Fai.

ACM Transactions on Graphics

30, 4 (July 2011), 1.https://doi.org/10.1145/2010324.1964981Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, and Shenghua Gao. 2019. Single-imagepiece-wise planar 3d reconstruction via associative embedding. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition . 1029–1037.Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scenegraph parsing with global context. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 5831–5840.Song-Hai Zhang, Shao-Kui Zhang, Yuan Liang, and Peter Hall. 2019. A Survey of 3DIndoor Scene Synthesis.

Journal of Computer Science and Technology

34, 3 (2019),594–608.Song-Hai Zhang, Shao-Kui Zhang, Wei-Yu Xie, Cheng-Yang Luo, and Hong-Bo Fu. 2020.Fast 3D Indoor Scene Synthesis with Discrete and Exact Layout Pattern Extraction. arXiv preprint arXiv:2002.00328 (2020).Bo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. 2015. Sceneunderstanding by reasoning stability and safety.

International Journal of ComputerVision