[PDF] In-game Residential Home Planning via Visual Context-aware Global Relation Learning

Abstract

In this paper, we propose an effective global relation learning algorithm to recommend an appropriate location of a building unit for in-game customization of residential home complex. Given a construction layout, we propose a visual context-aware graph generation network that learns the implicit global relations among the scene components and infers the location of a new building unit. The proposed network takes as input the scene graph and the corresponding top-view depth image. It provides the location recommendations for a newly-added building units by learning an auto-regressive edge distribution conditioned on existing scenes. We also introduce a global graph-image matching loss to enhance the awareness of essential geometry semantics of the site. Qualitative and quantitative experiments demonstrate that the recommended location well reflects the implicit spatial rules of components in the residential estates, and it is instructive and practical to locate the building units in the 3D scene of the complex construction.

Full PDF

IIn-game Residential Home Planning via Visual Context-aware Global RelationLearning

Lijuan Liu , Yin Yang , Yi Yuan , Tianjia Shao , He Wang , Kun Zhou NetEase Fuxi AI Lab School of Computing, Clemson University State Key Lab of CAD&CG, Zhejiang University Leeds Universityliulijuan, [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

In this paper, we propose an effective global relation learningalgorithm to recommend an appropriate location of a buildingunit for in-game customization of residential home complex.Given a construction layout, we propose a visual context-aware graph generation network that learns the implicit globalrelations among the scene components and infers the loca-tion of a new building unit. The proposed network takes asinput the scene graph and the corresponding top-view depthimage. It provides the location recommendations for a newly-added building units by learning an auto-regressive edge dis-tribution conditioned on existing scenes. We also introduce aglobal graph-image matching loss to enhance the awarenessof essential geometry semantics of the site. Qualitative andquantitative experiments demonstrate that the recommendedlocation well reﬂects the implicit spatial rules of componentsin the residential estates, and it is instructive and practical tolocate the building units in the 3D scene of the complex con-struction.

Introduction

Customized residential complex design becomes a popularelement in modern MMORPG games. This module allowsplayers to virtually create personalized housing experienceswith a comprehensive construction and design interface. Forinstance, a player could have a palace-like mansion with acarefully-shaped garden of pools and a greenhouse. Possess-ing such luxury housing is unlikely to be possible for mostof us. Yet, it could enhance the feeling of belongingness andescalate the joyfulness during the gaming. On the downside,designing a residential housing complex is not “as easy aspie”, which requires professional expertise and extensive ex-periences. Our answer to this dilemma is to resort to machinelearning to prompt smart suggestions during the user inter-action, which is similar to smart typing system that predictsthe following word we will input.Following this motivation, we introduce an algorithm forlocation recommendation, which interactively provides sug-gestions to players on where to place new building compo-nents etc. A few techniques have been proposed to suggest a * Home customization Relation graph with depth Edge inferring Suggestion heatmap

Figure 1: A high-level overview of our pipeline: from an in-put image of in-game the home customization site, our sys-tem extracts a relation graph. With other learned features, wetrain a graph generation network to infer the deployment ofnew edges. Finally, our system outputs a location predictionindicates the “suitableness” for the next building unit.placement of a new component in an indoor scene automat-ically (Wang et al. 2018, 2019; Nauata et al. 2020; Wu et al.2019). They use deep learning techniques e.g., the FiLMnet (Perez et al. 2018), to predict a component location as anattribute of the new node. However, these approaches do notdirectly transfer to the home construction which occurs in anopen outdoor space. Clearly, planning the building construc-tion of a residential complex involves elements of diversedimensions and scales. Therefore, searching all the possiblepositions on the site is not efﬁcient. Second, items/elementsin indoor scenes are normally associated with well-deﬁnedfunctional constraints, which can be fully exploited by thenetwork. However, we have much weaker functional rela-tions among buildings on the construction site. Certain typesof building units are also exclusive – for instance, one can-not add extra building blocks on the top of a swimming pool,and we name such locations forbidden areas . This type ofexclusiveness is not considered in previous algorithms.Given a layout of a housing site, we aim to suggestthe user a location directly without any prior knowledgeof the building unit to be placed. Our method is inspiredby planIT (Wang et al. 2019) converting the site layoutinto a graph. To reduce the search space, we do not iter-ate all the candidate locations on the layout. Instead, weinfer a possible location of a building unit through global graph relations. While building units do not have stronglocal/neighborhood dependence, in a complex with multi-ple building units, we leverage global relations (i.e., graph a r X i v : . [ c s . C V ] F e b dges) among all the building units to facilitate our pre-diction. For instance, one does not want to have houses tofully enclose a golf court. In this way, our graph genera-tion network learns implicit global constraints from the ex-isting graph and understands how to add new graph edgesfollowing such ineffable rules. To account for the exclusiveunits/areas, we extract the essential visual clues of the inputscene from the top view of the site image through a con-volutional net and fuse them into the graph generation net-work. Concretely, we construct two data structures as thenetwork inputs. One is the top-down rendered scene image(with the detailed exclusive units labeled area description),and the other is the scene graph. Our graph generation net-work takes as input the scene graphs and integrates the corre-sponding visual clues learned from the scene image to learnthe global relations of nodes, and mimic constructing newedges. The scene graph does not contain the building’s vi-sual geometry semantics, nor can it describe the forbiddenarea in the scene. To this end, we introduce a global graph-image feature matching loss to enable the awareness of thescene geometry during graph generation. The proposed vi-sual context-aware global relation learning network can pre-cisely describe the geometric and topological semantics ofthe input scene. The auto-regressive generative mode withinthe network can effectively model the edge distribution fromthe existing nodes to the future nodes. Finally, we infer therecommended location for guiding the placement from thelearned edge distribution.We have qualitatively and quantitatively evaluated ourmethod on a residential housing dataset collected from acommercial game. The results show that our method can ef-fectively model the global spatial rules in the layout of build-ing components. With the extracted visual clues, our net-work effectively avoids suggestions in the forbidden areasand collisions with existing buildings. The perceptual studyand the quantitative evaluation results demonstrate that ourgenerated location maps yield meaningful and instructiveguidance for the players to place new building units. Related Works

Residential Scene Layout Synthesis.

Residential scenelayout synthesis plays an important role in various do-mains, such as game designs and architectural layouts. Withthe emergence of large scene datasets, more deep learn-ing based models are proposed to address the layout gen-eration problem. DeepSynth (Wang et al. 2018) and Fast-Synth (Ritchie, Wang, and Lin 2019) introduce iterative gen-eration methods to synthesize new indoor layouts with repre-senting the unstructured input as top view rendered images.In GRAINS (Li et al. 2019) the input is represented as a treestructure and a recursive auto-encoder network is introducedto learn and sample the layout hierarchies. (Zhang et al.2020) represents the input as both arrangement matrix andrendered images and generates scenes in an attribute-matrixform with a generative adversarial network. PlanIT (Wanget al. 2019) proposes a two-stage method with ﬁrst generat-ing a layout plan encoded as a relation graph and then instan-tiating the plan through an autoregressive convolutional gen-erator based on the rendered images. In (Zhang et al. 2019), a stylistic GAN is proposed to model the relationship be-tween the style distribution and the enhancements for 3Dindoor scenes. A novel evaluation method is also introducedby (Liu 2019) to evaluate the synthesized 3D indoor scenesqualitatively. In addition to the work mentioned above on theindoor layout generation, a few researchers have also pro-posed some techniques working on ﬂoor layout design. Forexample, (Wu et al. 2019) proposes a two-stage method toiteratively locate rooms and walls given an input boundarywhile (Hu et al. 2020) introduces an interactive solution inwhich users can specify some constraints during planning.In (Nauata et al. 2020), they propose a convolutional mes-sage passing network named House-GAN that takes as inputa bubble diagram and outputs the house layout with axis-aligned bounding boxes. Unlike these tasks of indoor layoutand ﬂoor plan design, our work focuses on outdoor homeplanning, and speciﬁcally, we aim to suggest locations forthe new buildings.

Graph Generation Networks.

Graphs are natural repre-sentations of information in many areas, such as biology, en-gineering, and social sciences. Traditional techniques, suchas (Bollob´as and B´ela 2001; Leskovec et al. 2010; Margaritis2003; Leskovec, Kleinberg, and Faloutsos 2007), are basedon hand-engineered graph priors that adhere to a pre-decideddistribution, thus the learned generative models do not haveenough capacity to represent the graph structures containedin the observed data. Inspired from recent advances in deepgenerative models in computer vision (Wang, She, and Ward2019; Kingma and Welling 2019; Kobyzev, Prince, andBrubaker 2019) and natural language processing (Radfordet al. 2019; Brown et al. 2020), recent techniques haveshifted towards a learning-based approach and have madesigniﬁcant progress. (Simonovsky and Komodakis 2018)proposes a VAE based graph generation model to learn totranslate a latent continuous vector to a graph that can gener-ate a graph matrix at once. However, different node orderingwould lead to different graph matrices for the same graphstructure, making the learning process difﬁcult. In (Li et al.2018), a message passing method is introduced to expressprobabilistic dependencies between nodes and edges withina graph, but the message is passing on every single edge,leading to a complex training process. GraphRNN (You et al.2018) proposes a hierarchical RNN framework to generatenodes and edges alternately. It also proposes a BFS node or-dering scheme to improve scalability. To speed up the gener-ative velocity, GRAN (Liu et al. 2019) employs an efﬁcientframework to generate one block of edge connections be-tween nodes at a time. Inspired by GRAN (Liu et al. 2019),we propose a graph generation stream in our framework tolearn the edge distribution between existing scene buildingunits and the new units.

The Dataset

We collected near 150K residential garden plans designedby players from a popular online game, which provides alarge area of grid × grid ( grid = 64 pixels )and multiple building units of different sizes. Many playersre novices to home design and landscaping, or they sim-ply do not want to spend time on it. Some home designsare more like a collection of random building units. Sig-niﬁcant efforts have been devoted to clean up the dataset.We ﬁrst rendered all the designs into images and randomlypicked 30K out of them. Those images were sent to an an-notation team, consisting of trained professionals. Each im-age will be labeled with ﬁve grades, and a ResNet50 wastrained with those manually annotated labels. Another about30K designs that fell into the top three grades were auto-matically picked out by the trained model. Afterwards, theannotation team re-assessed machined-graded designed, andwe kept ones labled in the top three grades. After this pro-cessing, our dataset contains about 28K designs, with building units per sample on average. There are differ-ent building units in total, including infrastructure units(e.g., walls, doors etc.), architectural units, and one for-bidden unit that can be any shape (i.e., the pool). Relation Graph Extraction

We convert the scene into a directed relation graph G = ( V , E ) . In this graph, nodes V denote scene units, whichalso have a spatial coordinate. Edges E ⊆ V × V representthe spatial relations between nodes.

Edges.

In order to encode the arrangement relations be-tween the components, the spatial relationship is describedwith four direction types i.e., front , back , right , left and fourdistance types, next to , adjacent , proximal , distant , resulting16 spatial edge types in total. To model the geometric rela-tionship between units in more details, we also detect sixedge alignment attributes, namely left side , vertical center , right side , top side , horizontal center , and bottom side . Toextract spatial edges for node v i , we ﬁrst raycast from thefour sides of its oriented bounding box on the xy plane, andthen detect intersection with other nodes. For an intersect-ing node, an edge is added to the graph from v i to that nodeif the node is visible from v i with more than on oneside, and the directions are deﬁned in the coordinate frameof node v i . We set the distance label based on the distancebetween the two nodes’ oriented bounding boxes: next to if distance = 0 , adjacent if < distance ≤ , proximal if < distance ≤ and distant otherwise. The alignmentattributes are added if there is an edge connecting two nodes.For clarity, we only show one edge between two nodes. Infact, once one edge is detected between two nodes, we willadd another edge between them (opposite direction, samedistance). Nodes.

An obvious strategy is to to represent a buildingunit as a node in the relation graph. However, as one lay-out design in our dataset contains about 276 different units(most of them are infrastructural units), doing so leads to anover-complicated graph. To this end, we simplify the rela-tion graph by merging multiple infrastructure units to onenode. Two units can be merged if they satisfy all the follow-ing conditions: 1) they are in the same category and havethe same orientation; 2) they have the same height and are aligned in the x -axis or the same width and aligned the y -axis; 3) they are next to each other and are completely vis-ible to each other. After merging, the number of nodes inthe graph is reduced to an average of 63 with the primaryinformation preserved. Attributes.

We assign attribute vectors to each graph nodeand edge to encode the geometrical/semantic information ofthe corresponding scene. Speciﬁcally, for a node v i ∈ V , itsattribute vector is deﬁned as (cid:101) v i = [ l Ti , o Ti ] , where l i ∈ R | D | is the one hot encoded vector of the label, and | D | is thenumber of the unit labels. o i ∈ R is the oriented bound-ing box of the unit on the xy plane. For an edge e k ∈ E ,its attribute vector is deﬁned as (cid:101) e k = [ t Tk , d Tk , m Tk ] , in which t k ∈ R is the one hot encoded vector of the edge type (16edge types in total). t k ∈ R is the distance between twonodes. m k ∈ R is the alignment vector of the edge. Top-down View Representation

We convert the 3D residential home design into a 2D layoutwith a top-down orthographic depth render. Doing so bringsseveral beneﬁts. First, since the forbidden area in the designcan be in any shapes, it is difﬁcult to represent it as a node inthe graph. Instead, rendering it into a spatial image can pro-vide detailed shape information to the network. Second, al-though the design is in 3D, most building units are arrangedin 2D. The top-view rendering better reveals spatial outlineof the design. Following (Wang et al. 2018), this renderingmaps a grid × grid area to a × image. Our Method

We propose a visual context-aware graph generation modelto learn the edge distribution for the possible building. Ourmodel consists of two streams: one is a ConvNet that learnsdetailed semantic information of each unit from the renderedimage; the other stream is a graph generation network thattakes as input the relation graph and fuses the visual clueslearned from the ConvNet and outputs the edge distributionbased on the existing graph for the possible building unit.

Visual Context Extraction

We extract transformed visual features from the renderedimages with a ConvNet. It is known that low-level featuresfrom a ConvNet characterize the details of local regions,and high-level features represent the global structural in-formation of the input image. In our framework, we pro-duce the transformed visual features using FPN based ob-ject detector (Lin et al. 2017) in a multi-stage manner. Wecrop the visual feature of each building unit from the featurepyramids { C , C , C , C } through the ROIAlignLayer (Heet al. 2017). Each cropped visual feature is then transformedinto a ﬁxed dimensional visual clue through a convolutionalblock and ﬁnally integrated into the corresponding node fea-tures in the graph relation learning network to make thelearning process visual context-aware (Figure 2). The con-volutional block is a Conv-BN-ReLU block with a kernelsize × . arrow wall narrow wall large bldlarge bld large bld (0) (3)(2) (1)(4)

02 31 4 r h r c r h r c r h r c r h r c r h r c

02 31 4 '0 r h '3 r h '2 r h '1 r h '4 r h ri m ri att rkki m rkki att GRU r h  …… …… …… Visual contexts

Conv-BN-ReLUblock

Node representations   ctx f  Visual context augmented node representations   , rkmsg f    rkatt f  Weighted messages Updated node representation

Network connectionsFeature transform

Edges

Figure 2: An overview of our proposed visual context-aware attentive message passing for the r -th round. This is a toy examplewith ﬁve building units and four edges to illustrate a single message passing iteration (“large bld”: “large building”). Context-Aware Global Relation Learning

With the extracted visual clues and the relation graph as in-put, our global relation learning model outputs the edge dis-tribution between the existing nodes and the possible node.Inspired by (Liao et al. 2019), we encode the edges of therelation graph G = ( V , E ) with a label weighted adjacencymatrix A . For each edge ( i, j ) ∈ E , A ij = t ij , where t ij ∈ T is the edge label, and T is the edge type set. Each row vector a i ∈ A is interpreted as a connectivity feature of node v i rep-resenting connected relations between v i and other nodes inthe graph. We learn an edge distribution P ( a | V | +1 |G ) , whichsamples connectivity features of relations between the newnode and existing nodes in the graph. In our experiment, weonly model the edge distribution from the previous nodes tothe new node, which can easily infer the opposite relations.In the following, we describe how to learn edge distributionin detail. More implementation details of the network struc-ture are provided in the supplementary material. Graph Node Initialization.

We ﬁrst translate the adja-cency matrix A into an one-hot matrix (cid:101) A of size | V |× ( | T | +1) × | V | with (cid:101) A [ i, A ij , j ] = 1 . All the node features (cid:101) a i ∈ (cid:101) A are padded with zeros to the max dimension of the adjacencymatrix in the whole dataset (443 in our dataset). Togetherwith the node attribute vector v i , the node representation isinitialized as: h i = f init ( (cid:101) a i , (cid:101) v i ; W init ) , (1)where f init is a stacked 1D convolutional block transform-ing the raw connectivity features (cid:101) a i into the latent embed-dings. Then the following one-layer MLP takes as input theembeddings and node attribute vector v i and outputs a noderepresentation h i ∈ R I , where I is the dimension of thenode representation. For the new node representation, we set h | V | +1 = 0 and h | V | +1 ∈ R I . Edge Masked Attentive Message Propagation.

Withnode features (including the node representation and thecorresponding visual clues) and associated attribute vectors, stacked edge masked attentive message propagation blockspropagate the messages and update the node representations’state. For the new node, we assume that it is connected to allexisting nodes with an unknown label. At the r -th step, weﬁrst compute the visual semantic augmented node represen-tations for all graph nodes: h r (cid:48) i = f ctx ( h ri , c ri ; W ctx ) , (2)where h ri is the node representation, and the c ri is the corre-sponding cropped visual clue. f ctx is a two-layer MLP withlearnable parameters W ctx to make the output node repre-sentations aware of the corresponding visual clues.To propagate messages and update node representations,the multi-head attention mechanism (Veliˇckovi´c et al. 2018)is used to weight different messages for different nodes: m rkij = f rkmsg ( h r (cid:48) i , h r (cid:48) j , (cid:101) e k ; W rkmsg ) , (3) ma rkij = f rkatt ( h r (cid:48) i , h r (cid:48) j , (cid:101) e k ; W rkatt ) , (4)att rkij = exp ( ma rkij ) (cid:80) l ∈N ( i ) ( ma rkil ) , (5) h r +1 i = f r GRU ( h ri , (cid:107) k = Kk =1 (cid:80) j ∈N ( i ) att rkij m rkij ; w rk GRU ) . (6)Here, K indicates that we use K different attention mech-anisms to transform the messages ﬂowing on the edges. Inthe k -th attention mechanism, we ﬁrst compute the message ma rkij for all triplets [ v i , e k , v j ] (where v i and v j are the twonodes of the edge e k ) according to Eq. (3). An edge maskedself-attention weights on messages is then obtained (accord-ing to Eq. (4) and Eq. (5)) to compute a linear combina-tion of the messages for each node. Finally, the graph noderepresentations are updated with the concatenation of the K different message combinations from K different attentionmechanism (according to Eq. (6)). In our experiments, f rkmsg is a two-layer MLP with learnable parameters W rkmsg , f rkatt isimplemented as a single-layer forward neural network fol-lowed with a ReLU nonlinearity. N ( i ) indicates the neigh-boring nodes for each node i , and w r GRU are the learnableparameters for GRU. We show an example of this process inFigure 2. dge distribution Modelling.

After R steps of messagepropagation, we obtain the ﬁnal node representations h Ri foreach node i and compute the raw messages from the existinggraph to the new node m Ri, | V | +1 = [ h Ri , h R | V | +1 ] . We modelthe edge distribution from existing nodes to the new node P ( a | V | +1 |G ) with a mixture of categorical model based onthe raw messages: P ( a | V | +1 |G ) = S (cid:88) k =1 α s (cid:89) ≤ j ≤| V | θ s,j, | V | +1 , (7) α = Softmax ( (cid:88) ≤ j ≤| V | f α ( m Rj, | V | +1 ; W α )) , (8) θ = Sigmoid ( f θ ( m Rj, | V | +1 ; W θ )) , (9)where S is the number of the mixtures in our experiments. α is the mixed coefﬁcient of S dimension. θ is the learnededge probabilities of different mixtures. Both f α and f θ areimplemented as a two-layer MLP, and W α and W θ are thelearnable parameters. The mixture of categorical distribu-tion provides an efﬁcient way to capture dependence in theoutput distribution due to the latent mixture components. Losses.

To learn the edge distribution from existing nodesto the new node, we deﬁne the objective function as the neg-ative log posterior probability of the mixture model: L o = − Z (cid:88) z =1 log P ( a z, | V | +1 |G z ) , (10)where Z is the batch size. To encourage the graph genera-tion network to perceive the global visual semantics, we addglobal graph-image matching loss to minimize the match-ing score for the graph image pair. We ﬁrst obtain the twoglobal features ( (cid:102) v rG and (cid:101) v rI ) by averaging the correspondingnode features for simplify. The matching score is deﬁned asa cosine similarity: R ( G rz , I rz ) = (cid:102) v rGT (cid:101) v rI (cid:107) (cid:102) v rG (cid:107) · (cid:107) (cid:101) v rI (cid:107) . (11)Similar to (Xu et al. 2018), for a batch of graph-images { ( G z , I z ) } Zz =1 , the posterior probability of image I z beingmatching with graph G z is computed as: P ( G rz | I rz ) = exp ( γR ( G rz , I rz )) (cid:80) Zb =1 exp ( γR ( G rz , I rb )) , (12)and the paired symmetric loss is deﬁned as the negative pos-terior probability: L rm = − Z (cid:88) z =1 log P ( G rz | I rz ) − Z (cid:88) z =1 log P ( I rz | G rz ) . (13)Finally, the objective function of our model is: L = L o + R (cid:88) r =1 L rm . (14) The current scene

The edges to all to- be-located buildings

The visualized heatmaps for edges

The thresholded and smoothed heatmaps

Figure 3: The visual examples of the discrete labels for eval-uation.

Implementation Details

In our implementation, we ﬁrst extract relation graphs andrendered images from the unstructured sites. For the Con-vNet to extract the visual clues for each component, we im-plement it based on Detectron2 and choose ResNet50 as thebackbone. The aspect ratio is set as [0 . , . , . , . , . .This detection model is pretrained on our rendered sceneimages, and we will not change the parameters in the fol-lowing training phases. The cropped features are then trans-formed into visual clues with the size of , based onthe convolution block. For the graph generation network, weﬁrst learn the initial node representation with a size of .Then we update the node representations rounds togetherwith the corresponding visual clues based on the stack ofedge masked attentive message propagation blocks. For eachmessage passing block, we ﬁrst obtain the messages with thedimension of and then concatenate the 4-head attentionmechanism output to update the node representation. We addglobal graph-image matching loss at every round of messagepassing. To model the latent dependencies between edges,we set the number of mixtures to in the edge distributionmodel. During the training phase, we set the batch size as and choose the Adam solver for optimization, with theinitial learning rate of lr = 10 − . The model is trained on 4TitanX 2080 GPUs. Experiments

We have systematically tested the proposed method. Wechoose 22.4K designs as the training set and the rest 5.6Kare used for testing.

Visualization.

In order to intuitively evaluate our experi-mental results, we ﬁrst convert the discrete edges from theexisting scene graph to the target node for the testing set andpredictions based on introduced above. Since our purpose isto recommend a location for the possible building unit, weset the default target unit size as [24 , when visualizingedges for both the ground truth testing dataset and the pre-dictions. During the visualization process, for the edge set e } t from the current scene to the component t , the prob-ability of the location the current edge points to is set as / |{ e } t | . The ﬁnal heatmap is the sum of all the locations’probability values implied by the corresponding edges. Theheatmap is normalized to a maximum value of 1. In our ex-periment for perceptual study, we only keep the areas withthe probability value greater than . in the heatmap andsmooth them with a Gaussian kernel (kernel size = 5 ),which better indicates our recommendation. Several exam-ples are shown in Figure 3. Comparisons.

We compare our method withplanIT (Wang et al. 2019) and FastSynth (Ritchie, Wang,and Lin 2019). While those two methods are originallydesigned for indoor scenes, they are quite relevant to ourmethod. For a fair comparison, we implement the partialgraph completion of planIT and add only one unit foreach scene during the test. For (Ritchie, Wang, and Lin2019), we implemented the

Object Location and choose thepredicted location map indicated by the ground truth label.The heatmaps generated in our method are visualized fromdiscrete edges. It is more coarse-grained than a pixel-levelprediction. Therefore, we enlarge the areas of positionwith a probability value higher than the mean probabilityvalue for a fair comparison. We enlarge the area for eachposition with a size of [24 , and centroid of itself. Thenthe heatmap is thresholded and smoothed as the abovevisualization method. We also compared our method withtwo degraded variants. The ﬁrst variant is Baseline : weimplemented the GRAN model (Liao et al. 2019) with5 GNN layers and tested on our dataset. In this model,the relation graph is the only input. The second variant is

LocRec(w/o) : we implemented with the proposed modelwithout the global graph-image matching loss during thetraining process. We denote our model as

LocRec in thebenchmark reports.

Quality Metrics.

We provide two types of metrics to eval-uate the quality of predicted location maps. First, we de-ﬁne two criteria to evaluate the visualized heatmaps. For theground truth heatmap ht r and the corresponding predictedheatmap ht p , we ﬁrst calculated the mask m of the inter-section of the two non-zero areas, then the f1 score on bothareas and probabilities are calculated to evaluate the results.The recall and the precision score of the area are deﬁned asar = (cid:80) m/ (cid:80) ( ht r > and ap = (cid:80) m/ (cid:80) ( ht p > ,the ﬁnal f1 score area = 2 / ( ap − + ar − ) . To calculate thef1 score on probabilities, the recall and the precision scoreare deﬁned as pr = (cid:80) min( ht r [ m ] , ht t [ m ]) / (cid:80) ht r andpp = (cid:80) min( ht r [ m ] , ht t [ m ]) / (cid:80) ht p . The high f1 score onarea indicates that our model can effectively recommend thelocation for the new building units, while the high f1 scoreon probabilities indicate that our recommend location iscompact and has clear guiding signiﬁcance.Since we aim at prompting players where to place newunits, we also provide a ranking choice perceptual study torank the results generated by different methods. We provided60 questions and invited 45 participants to rank the different Method ar ap pr pp f1s a f1s p LocRec

LocRec(w/o)

Baseline

PlanIT

FastSynth

Table 1: The quantitative score on the testing dataset for dif-ferent methods. f1s a : f1 score area ; f1s p : f1 score probability .results in each question. We provide two questions for eachparticipant, “does this heatmap clearly specify a location?”and “are you willing to place a building at this location?”.Participants are asked to rank the results based on their an-swers. We deﬁne ﬁve levels to quantify the results, and par-ticipants are not allowed to give the same rank to different re-sults in the same question. Rank5 represents the best, whileRank1 represents the worst. A detailed questionnaire is pro-vided in the supplementary material. Experimental Results.

Figure 1 shows a location mappredicted by our method, where the heatmap is directly vi-sualized from the predicted edges. We observe that the re-sults generated by visualizing edges occupies a relativelylarge area, but the central location with high probability isstill obvious, which has good guiding signiﬁcance. We re-port more results in Figure 4 (more results in the supple-mentary material), where the background is the input scene,and the corresponding predicted location map is imposed tothe background as a heatmap. To more clearly point out thelocations generated by different methods, we show the re-sults after thresholding and smoothing in the ﬁgure. For theresults generated by our method (i.e., LocRec), we also plotthe generated edges. We observe that our model can effec-tively learn the relations contained in the scenes, and theedge set predicted from the learned distribution has a highconsistency and seldom point to multiple areas at the sametime, which ensures the stability of our results.We ﬁnd that the results generated by our method can in-dicate accurate (only one peak area exists in the heatmapand the area with large probability values is very compact)and collision avoidance locations, which is instructive andmeaningful to locate the new building units for players. Wealso ﬁnd that our predictions can reasonably avoid the for-bidden areas and are harmonious with the existing scenes.The corresponding quantitative results are shown in Ta-ble 1. It can be seen that our generated locations can hitthe ground truth results in the testing dataset in most cases(with f1 score area = 62 . , f1 score prob = 38 . ), whichshows that our approach can effectively model the sptialrules within the units in the scene. The results of the per-ception study (Table 2) also conﬁrm that the players acceptour recommendation results and are willing to place build-ings in such locations in most cases (with = 64% ).The visualization results of planIT are given in Figure 4,the quantitative results and the perception study results areprovided in Tables 1 and 2. Compared with FastSynth,planIT gives a more reasonable location, which veriﬁes that nput scenes Ground truth Predicted edges LocRec LocRec(w/o) Baseline planIT FastSynth

Figure 4: Location predictions using different algorithms on our testing dataset.

Method Rank1 Rank2 Rank3 Rank4 Rank5

LocRec

LocRec(w/o)

Baseline

PlanIT

FastSynth

Table 2: The resulting scores of the perceptual study for dif-ferent methods.the constraint of the extrinsic relation graphs is more con-ductive to us recommending a reasonable location. BecauseplanIT relies on the local relationship of the current unitwhen recommending locations, it is more difﬁcult to learnthe global relations between units in the scene, the resultingrecommendation locations are much worse than our result.From the perceptual study results, compared to planIT weobserve that the players are more satisﬁed with the locationrecommended by LocRec. This is because the constructionsite has a large space, the locations and probability valuesrecommended by planIT can be scattered. Our method isbased on global relations and leads to consistent recommen-dations. The pr score in Table 1 also reﬂects this fact.We also provide the visualization results of our methodand its variants in Figure 4. The corresponding benchmarksare shown in Tables 1 and 2. As one can see, our baselinemodel is effective. Compared to FastSynth, which only in-puts visual semantics, an algorithm based on relation graphis more conducive to use to learn a compact location, even it may appear to conﬂict with other building units in the scene.After adding visual clues reasonably, the learned locationsbecomes more effective. Since LocRec(w/o) does not inte-grate the global graph-image matching loss into the learningprocess of edge distribution, the resulting locations cannoteffectively avoid the forbidden areas (e.g., pools). With thelocal visual clues and the global graph-image matching lossfor the learning process, our full model can effectively cap-ture the detailed and global structure of the input scene, andresulting in the best location prediction.

Conclusion

We propose an effective location recommendation methodbased on a visual context-aware graph generation network.This net learns the global relations between the buildingunits. To integrate the visual clues to the learning process,a global graph-image matching loss in also designed to en-able the awareness of the scene geometry during the graphgeneration. The experimental results show that our methodcan generate instructive and meaningful locations to placethe possible units. Currently, our work focus on recommend-ing one location for the next building unit. In practice, itis more convenient to recommend multiple choices for dif-ferent units collectively, which clearly offers more optionsto the user during the customization. However, more build-ing units require more ﬂexibility and ambiguities during thelearning. In the future, we plan to investigate possible solu-tions to solve this problem. Besides, quantitative measure-ment of uncertainty during learning is also worth exploring. eferences

Bollob´as, B.; and B´ela, B. 2001.

Random graphs . 73. Cam-bridge university press.Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.;Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell,A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 .He, K.; Gkioxari, G.; Doll´ar, P.; and Girshick, R. 2017. Maskr-cnn. In

Proceedings of the IEEE international conferenceon computer vision , 2961–2969.Hu, R.; Huang, Z.; Tang, Y.; van Kaick, O.; Zhang, H.; andHuang, H. 2020. Graph2Plan: Learning Floorplan Genera-tion from Layout Graphs. arXiv preprint arXiv:2004.13204 .Kingma, D. P.; and Welling, M. 2019. An Introduction toVariational Autoencoders .Kobyzev, I.; Prince, S.; and Brubaker, M. A. 2019. Nor-malizing ﬂows: Introduction and ideas. arXiv preprintarXiv:1908.09257 .Leskovec, J.; Chakrabarti, D.; Kleinberg, J.; Faloutsos, C.;and Ghahramani, Z. 2010. Kronecker graphs: An approachto modeling networks.

Journal of Machine Learning Re-search

ACMtransactions on Knowledge Discovery from Data (TKDD)

ACM Transactions on Graphics (TOG) arXivpreprint arXiv:1803.03324 .Liao, R.; Li, Y.; Song, Y.; Wang, S.; Hamilton, W.; Duve-naud, D. K.; Urtasun, R.; and Zemel, R. 2019. Efﬁcientgraph generation with graph recurrent attention networks. In

Advances in Neural Information Processing Systems , 4257–4267.Lin, T.-Y.; Doll´ar, P.; Girshick, R.; He, K.; Hariharan, B.;and Belongie, S. 2017. Feature pyramid networks for ob-ject detection. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2117–2125.Liu, H. 2019.

A qualitative and localized evaluation for3D indoor scene synthesis . Ph.D. thesis, Applied Sciences:School of Computing Science.Liu, J.; Kumar, A.; Ba, J.; Kiros, J.; and Swersky, K. 2019.Graph normalizing ﬂows. In

Advances in Neural Informa-tion Processing Systems , 13556–13566.Margaritis, D. 2003. Learning Bayesian network modelstructure from data. Technical report, Carnegie-Mellon UnivPittsburgh Pa School of Computer Science. Nauata, N.; Chang, K.-H.; Cheng, C.-Y.; Mori, G.; andFurukawa, Y. 2020. House-GAN: Relational GenerativeAdversarial Networks for Graph-constrained House LayoutGeneration. arXiv preprint arXiv:2003.06988 .Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; andCourville, A. C. 2018. FiLM: Visual Reasoning with a Gen-eral Conditioning Layer. In

AAAI .Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language models are unsupervised mul-titask learners.

OpenAI Blog

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 6182–6190.Simonovsky, M.; and Komodakis, N. 2018. Graphvae: To-wards generation of small graphs using variational autoen-coders. In

International Conference on Artiﬁcial NeuralNetworks , 412–422. Springer.Veliˇckovi´c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li`o,P.; and Bengio, Y. 2018. Graph Attention Networks. In

International Conference on Learning Representations .Wang, K.; Lin, Y.-A.; Weissmann, B.; Savva, M.; Chang,A. X.; and Ritchie, D. 2019. Planit: Planning and instan-tiating indoor scenes with relation graph and spatial priornetworks.

ACM Transactions on Graphics (TOG)

ACMTransactions on Graphics (TOG) arXiv preprint arXiv:1906.01529 .Wu, W.; Fu, X.-M.; Tang, R.; Wang, Y.; Qi, Y.-H.; and Liu,L. 2019. Data-driven interior plan generation for residentialbuildings.

ACM Transactions on Graphics (TOG)

Proceedings of the IEEE conference on computer visionand pattern recognition , 1316–1324.You, J.; Ying, R.; Ren, X.; Hamilton, W. L.; and Leskovec,J. 2018. Graphrnn: Generating realistic graphs with deepauto-regressive models. arXiv preprint arXiv:1802.08773 .Zhang, S.; Han, Z.; Lai, Y.-K.; Zwicker, M.; and Zhang, H.2019. Stylistic scene enhancement GAN: mixed stylisticenhancement generation for 3D indoor scenes.

The VisualComputer