[PDF] LayoutGMN: Neural Graph Matching for Structural Layout Similarity

Abstract

We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, using an attention-based GMN designed under a triplet network setting. To train our network, we utilize weak labels obtained by pixel-wise Intersection-over-Union (IoUs) to define the triplet loss. Importantly, LayoutGMN is built with a structural bias which can effectively compensate for the lack of structure awareness in IoUs. We demonstrate this on two prominent forms of layouts, viz., floorplans and UI designs, via retrieval experiments on large-scale datasets. In particular, retrieval results by our network better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-the-art method based on graph neural networks and image convolution. In addition, LayoutGMN is the first deep model to offer both metric learning of structural layout similarity and structural matching between layout elements.

Full PDF

LLayoutGMN: Neural Graph Matching for Structural Layout Similarity

Akshay Gadi Patil Manyi Li Matthew Fisher Manolis Savva Hao Zhang Simon Fraser University Adobe Research

Abstract

We present a deep neural network to predict structuralsimilarity between 2D layouts by leveraging Graph Match-ing Networks (GMN). Our network, coined LayoutGMN,learns the layout metric via neural graph matching, usingan attention-based GMN designed under a triplet networksetting. To train our network, we utilize weak labels ob-tained by pixel-wise Intersection-over-Union (IoUs) to de-ﬁne the triplet loss. Importantly, LayoutGMN is built witha structural bias which can effectively compensate for thelack of structure awareness in IoUs. We demonstrate thison two prominent forms of layouts, viz., ﬂoorplans and UIdesigns, via retrieval experiments on large-scale datasets.In particular, retrieval results by our network better matchhuman judgement of structural layout similarity comparedto both IoUs and other baselines including a state-of-the-art method based on graph neural networks and image con-volution. In addition, LayoutGMN is the ﬁrst deep modelto offer both metric learning of structural layout similarity and structural matching between layout elements.

1. Introduction

Two-dimensional layouts are ubiquitous visual abstrac-tions in graphic and architectural designs. They typicallyrepresent blueprints or conceptual sketches for such dataas ﬂoorplans, documents, scene arrangements, and UI de-signs. Recent advances in pattern analysis and synthesishave propelled the development of generative models forlayouts [11, 25, 47, 15, 26] and led to a steady accumulationof relevant datasets [48, 42, 10, 46]. Despite these develop-ments however, there have been few attempts at employinga deeply learned metric to reason about layout data, e.g.,for retrieval, data embedding, and evaluation. For example,current evaluation protocols for layout generation still relyheavily on segmentation metrics such as intersection-over-union (IoU) [15, 30] and human judgement [15, 26].The ability to compare data effectively and efﬁciently isarguably the most foundational task in data analysis. Thekey challenge in comparing layouts is that it is not purely atask of visual comparison — it depends critically on infer-

Figure 1. LayoutGMN learns a structural layout similarity metricbetween ﬂoorplans and other 2D layouts, through attention-basedneural graph matching . The learned attention weights (numbersshown in the boxes) can be used to match the structural elements. ence and reasoning about structures , which are expressedby the semantics and organizational arrangements of the el-ements or subdivisions which compose a layout. Hence,none of the well-established image-space metrics, whethermodel-driven, perceptual, or deeply learned, are best suitedto measure structural layout similarity. Frequently appliedsimilarity measures for image segmentation such as IoUsand F1 scores all perform pixel-level matching “in place”— they are not structural and can be sensitive to elementmisalignments which are structure-preserving .In this work, we develop a deep neural network to predictstructural similarity between two 2D layouts, e.g., ﬂoor-plans or UI designs. We take a predominantly structuralview of layouts for both data representation and layout com-parison. Speciﬁcally, we represent each layout using a di-rected, fully connected graph over its semantic elements.Our network learns structural layout similarity via neuralgraph matching, where an attention-based graph matchingnetwork [27] is designed under a triplet network setting.The network, coined LayoutGMN, takes as input a tripletof layout graphs, composed together by one pair of anchor-positive and one pair of anchor-negative graphs, and per-forms intra-graph message passing and cross-graph infor-mation communication per pair, to learn a graph embeddingfor layout similarity prediction. In addition to returning ametric, the attention weights learned by our network canalso be used to match the layout elements; see Figure 1.To train our triplet network, it is natural to consider hu-man labeling of positive and negative samples. However, it1 a r X i v : . [ c s . C V ] D ec igure 2. Structure matching in LayoutGMN “ neutralizes ” IoUfeedback. In each example (left: ﬂoorplan; right: UI design), atraining sample N labeled as “Negative” by IoU is more struc-turally similar to the anchor ( A ) than P , a “Positive” sample. Withstructure matching, our network predicts a smaller A -to- N dis-tance than A -to- P distance in each case, which contradicts IoU. is well-known that subjective judgements by humans overstructured data such as layouts are often unreliable, espe-cially with non-experts [45, 2]. When domain experts areemployed, the task becomes time-consuming and expen-sive [45, 2, 14, 9, 20, 41], where discrepancies among eventhese experts still remain [14]. In our work, we avoid thisissue by resorting to weakly supervised training of Layout-GMN, which obtains positive and negative labels from thetraining data through thresholding using layout IoUs [30].The motivations behind using IoU for training are three-fold, despite its shortcomings for structural matching. First,IoU does hold merits as one of the most wide-used layoutsimilarity measures [30, 15]. Second, IoU is objective andmuch cheaper to obtain compared to expert annotations. Fi-nally and most importantly, our network has a built-in in-ductive bias to enforce structural correspondence, via inter-graph information exchange, when learning the graph em-beddings. The structural bias introduced can effectivelycompensate for the lack of structure awareness in the IoU-based triplet loss. In Figure 2, we illustrate the effect of thestructural bias on the metric learned by our network.We evaluate our network on retrieval tasks over largedatasets of ﬂoorplans and UI designs, via Precision@ k scores, and investigate the stability of the proposed met-ric by checking retrieval consistency and top-1 retrieved re-sults. Overall, retrieval results by LayoutGMN better matchhuman judgement of structural layout similarity comparedto both IoUs and other baselines including a state-of-the-artmethod based on graph neural networks [30]. Finally, weshow a label transfer application for ﬂoorplans enabled bythe structure matching learned by our network.

2. Related Work

Layout analysis.

Early works [18, 3] on document analy-sis involved primitive heuristics to analyse document struc-tures. Organizing a large collection of such structures intomeaningful clusters requires a distance measure betweenlayouts, which typically involved content-based heuristics[34] for documents and constrained graph matching algo-rithm for ﬂoorplans [40]. An improved distance measure relied on rich layout representation obtained using autoen-coders [7, 29], operating on an entire UI layout. Althoughsuch models capture rich raster properties of layout images,layout structures are not modeled, leading to noisy recom-mendations in contextual search over layout datasets.

Layout generation.

Early works on synthesizing 2D lay-outs relied on exemplars [16, 23, 37] and rule-based heuris-tics [33, 38], and were unable to capture complex elementdistributions. The advent of deep learning led to generativemodels of layouts of ﬂoorplans [42, 15, 5, 32], documents[25, 11, 47], and UIs [7, 6]. Perceptual studies aside, evalu-ation of generated layouts, in terms of diversity and general-ization, has mostly revolved around IoUs of the constituentsemantic entities [25, 11, 15]. While IoU provides a visualsimilarity measure, it is expensive to compute over a largenumber of semantic entities, and is sensitive to element po-sitions within a layout. Developing a tool for structuralcomparison would perhaps complement visual features incontextual similarity search. In particular, a learning-basedmethod that compares layouts structurally can prove usefulin tasks such as layout correspondence, component labelingand layout retargeting. We present a Layout Graph Match-ing Network, called LayoutGMN, for learning to comparetwo graphical layouts in a structured manner.

Structural similarity in 3D.

Fisher et al. [8] developGraph Kernels for characterizing structural relationships in3D indoor scenes. Indoor scenes are represented as graphs,and the Graph Kernel compares substructures in the graphsto capture similarity between the corresponding scenes. Achallenging problem of organizing a heterogeneous collec-tion of such 3D indoor scenes was accomplished in [43] byfocusing on a subscene, and using it as a reference pointfor distance measures between two scenes. Shape EditDistance, SHED, [22] is another ﬁne-grained sub-structuresimilarity measure for comparing two 3D shapes. Theseworks provide valuable cues on developing an effectivestructural metric for layout similarity. Graph Neural Net-works (GNN) [28, 21, 4, 36] model node dependencies ina graph via message passing, and are the perfect tool forlearning on structured data. GNNs provide coarse-levelgraph embeddings, which, although useful for many tasks[39, 1, 17, 19], can lose useful structural information in con-textual search, if each graph is processed in isolation. Wemake use of Graph Matching Network [27] to retain struc-tural correspondence between layout elements.

GNNs for structural layout similarity.

To the best of ourknowledge, the recent work by Manandhar et al. [30] is theﬁrst to leverage GNNs to learn structural similarity of 2Dgraphical layouts, focusing on UI layouts with rectangularboundaries. They employ a GCN-CNN architecture on agraph of UI layout images , also under an IoU-trained triplet2 igure 3. Given an input ﬂoorplan image with room segmentationsin (a), we abstract each room into a bounding box and obtain lay-out features from the constituent semantic elements, as shown in(b). These features form the initial node and edge features (Section3.1) of the corresponding layout graph shown in (c). network [13], but obtain the graph embeddings for the an-chor, positive, and negative graphs independently.In contrast, LayoutGMN learns the graph embeddingsin a dependent manner. Through cross-graph informationexchange, the embeddings are learned in the context ofthe anchor-positive (respectively, the anchor-negative) pair.This is a critical distinction to GCN-CNN [30], while bothtrain their triplet networks using IoUs. However, since IoUdoes not involve structure matching, it is not a reliable mea-sure of structural similarity, leading to labels which are con-sidered “structurally incorrect”; see Figure 2.In addition, our network does not perform any convolu-tional processing over layout images; it only involves eightMLPs, placing more emphasis on learning ﬁner-scale struc-tural variations for graph embedding, and less on image-space features. We clearly observe that the cross-graphcommunication module in our GMNs does help in learn-ing ﬁner graph embeddings than the GCN-CNN frame-work [30]. Finally, another advantage of moving away fromany reliance on image alignment is that similarity predic-tions by our network are more robust against highly varied,non-rectangular layout boundaries, e.g., for ﬂoorplans.

3. Method

The Graph Matching Network (GMN) [27] consumesa pair of graphs, processes the graph interactions viaan attention-based cross-graph communication mechanismand results in graph embeddings for the two input graphs,as shown in Fig 4. Our LayoutGMN plugs in the GraphMatching Network into a Triplet backbone architecture forlearning a (pseudo) metric-space for similarity on 2D lay-outs such as ﬂoorplans, UIs and documents.

Given a layout image of height H and width W with se-mantic annotations, we abstract each element into a bound-ing box, which form the nodes of the resulting layoutgraph. Speciﬁcally, for a layout image I , its layout graph G l is given by G l = ( V, E ) , where the node set V = { v , v , ..., v n } represents the semantic elements in thelayout, and E = { e , ..., e ij , .., e n ( n − } , the edge set, Figure 4. LayoutGMN takes two layout graphs as input, performsintra-graph message passing (Eq. 2), along with cross-graph infor-mation exchange (Eq. 3) via an attention mechanism (Eq. 5, alsovisualized in Figure 1) to update node features, from which ﬁnalgraph embeddings are obtained (Eq. 7). represents the set of edges connecting the constituent ele-ments. Our layout graphs are directed and fully-connected.

Initial Node Features.

There exist a variety of visual andcontent-based features that could be incorporated as the ini-tial node features; ex. the text data/font size/font type ofan UI element or the image features of a room in a ﬂoor-plan. For structured learning tasks as ours, we ignore suchcontent-based features and only focus on the box abstrac-tions. Speciﬁcally, similar to [11, 12], the initial node fea-tures contain semantic and geometric information of thelayout elements. As shown in Fig 3, for a layout element k centered at ( x k , y k ), with dimensions ( w k , h k ), its geo-metric information is: g k = (cid:20) x k W , y k H , w k W , h k H , w k h k √ W H (cid:21) . Instead of one-hot encoding of the semantics, we use alearnable embedding layer to embed a semantic type intoa 128-D code, s k . A two-layer MLP embeds the 5 × g k into a 128-D code, and is concatenatedwith the 128-D semantic embedding s k to form the initialnode features U = { u , u , ..., u n } . Initial Edge Features.

In visual reasoning and relation-ship detection tasks, edge features in a graph are designedto capture relative difference of the abstracted semantic en-tities (represented as nodes) [12, 44]. Thus, for an edge e ij ,we capture the spatial relationship (see Fig 3) between thesemantic entities by a 8 × e ij = (cid:34) ∆ x ij √ A i , ∆ y ij √ A i , (cid:114) A j A i , U ij , w i h i , w j h j , (cid:112) ∆ x + ∆ y √ W + H , θ (cid:35) , where A i is the area of the element box i ; U ij = B i ∩ B j B i ∪ B j isthe IoU of the bounding boxes of the layout elements i, j ; θ = atan ∆ y ∆ x ) is the relative angle between the two com-ponents, θ ∈ [ − π, π ] ; ∆ x ij = x j − x i and ∆ y ij = y j − y i .This edge vector accounts for the translation between the3wo layout elements, in addition to encoding their box IoUs,individual aspect ratios and relative orientation. The graph matching module employed in LayoutGMNis made up of three parts: (1) node and edge encoders, (2)message propagation layers and (3) an aggregator.

Node and Edge Encoders.

We use two MLPs to embedthe initial node and edge features and compute their corre-sponding code vectors: h i (0) = M LP node ( u i ) , ∀ i ∈ U r ij = M LP edge ( e ij ) , ∀ ( i, j ) ∈ E (1)The above MLPs map the initial node and edge features totheir 128-D code vectors. Message Propagation Layers.

The graph matchingframework hinges on coherent information exchange be-tween graphs to compare two layouts in a structural manner.The propagation layers update the node features by aggre-gating messages along the edges within a graph, in additionto relying on a graph matching vector that measures howsimilar a node in one layout graph is to one or more nodesin the other. Speciﬁcally, given two node embeddings h (0) i and h (0) p from two different layout graphs, the node updatesfor the node i are given by: m j → i = f intra (cid:16) h ( t ) i , h ( t ) j , r ij (cid:17) , ∀ ( i, j ) ∈ E (2) µ p → i = f cross (cid:16) h ( t ) i , h ( t ) p (cid:17) , ∀ i ∈ V , p ∈ V (3) h ( t +1) i = f update  h ( t ) i , (cid:88) j m j → i , (cid:88) p µ p → i  (4)where f intra is an MLP on the initial node embeddingcode that aggregates information from other nodes withinthe same graph, f cross is a function that communicatescross-graph information, and f update is an MLP used toupdate the node features in the graph, whose input is theconcatenation of the current node features, the aggregatedinformation from within, and across the graphs. f cross isdesigned as an Attention-based module: a p → i = exp( s h ( h ( t ) i , h ( t ) p ) (cid:80) p exp( s h ( h ( t ) i , h ( t ) p ) µ p → i = a p → i (cid:16) h ( t ) i − h ( t ) p (cid:17) (5)where a p → i is the attention value (scalar) between node p inthe second graph and node i in the ﬁrst, and such attentionweights are calculated for every pair of nodes across thetwo graphs; s h is implemented as the dot product of the Figure 5. Given a triplet of graphs G a , G p and G n correspondingto the anchor, positive and negative examples respectively, the an-chor graph paired with each of other two graphs is passed througha Graph Matching Network (Fig 4) to get two 1024-D embeddings.Note that the anchor graph has different contextual embeddings h Ga and h (cid:48) Ga . LayoutGMN is trained using the margin loss (mar-gin=5) on the L distances of the two paired embeddings. embedded code vectors. The interaction of all the nodes p ∈ V with the node i in V is then given by: (cid:88) p µ p → i = (cid:88) p a p → i (cid:16) h ( t ) i − h ( t ) p (cid:17) = h ( t ) i − (cid:88) p a p → i h ( t ) p (6)Intuitively, (cid:80) p µ p → i measures the (dis)similarity be-tween h ( t ) i and its nearest neighbor in the other graph. Thepairwise attention computation results in stronger structuralbonds between the two graphs, but requires additional com-putation. We use ﬁve rounds of message propagation, thenthe representation for each node is updated accordingly. Aggregator.

A 1024-D graph-level representation, h G , isobtained via a feature aggregator MLP, f G , that takes as in-put, the set of node representations { h ( T ) i } , as given below: h G = M LP G (cid:32)(cid:88) i ∈ V σ ( M LP gate ( h ( T ) i )) (cid:12) M LP ( h ( T ) i ) (cid:33) (7)Graph-level embeddings for the two layout graphs issimilarly computed. h G = f G ( { h ( T ) i } i ∈ V ) h G = f G ( { h ( T ) p } p ∈ V ) To learn a layout similarity metric, we borrow the Triplettraining framework [13]. Speciﬁcally, given two pairs oflayout graphs, i.e., anchor-positive and anchor-negative,each pair is passed through the same GMN module to getthe graph embeddings in the context of the other graph, asshown in Fig 5. A margin loss based on the L distance be-tween the graph embeddings, as given in equation 8, is usedto backpropagate the gradients through GMN. L tri ( a, p, n ) = max (0 , γ + (cid:13)(cid:13) h G a − h G p (cid:13)(cid:13) − (cid:107) h (cid:48) G a − h G n (cid:107) ) (8)4 . Datasets We use two kinds of layout datasets in our experiments:(1) UI layouts from the RICO dataset [7], and (2) ﬂoorplansfrom the RPLAN dataset [42]. After some data ﬁltering , thesize of the two datasets is respectively, 66261 and 77669.In the absence of a ground truth label set and the needfor obtaining the triplets in a consistent manner, we resortto using IoU values of two layouts, represented as multi-channel images, to ascertain their closeness. Given an an-chor layout, the threshold on IoU values to classify anotherlayout as positive, from observations, is 0.6 for both UIsand ﬂoorplans. Negative examples are those that have athreshold value of at least 0.1 less than the positive ones,avoiding the incorrect ”negatives” during training. Thetrain-test sizes for the aforementioned datasets are respec-tively: 7700-1588, 25000-7204. In the ﬁltered ﬂoorplantraining dataset [42], the distinct number of semantic cat-egories/rooms across the dataset is nine and the maximumnumber of rooms per ﬂoorplan is eight. Similarly, for theﬁltered UI layout dataset [7], the number of distinct seman-tic categories is twenty-ﬁve and the number of elements perUI layout across the dataset is at most hundred.

5. Results and Evaluation

We evaluate LayoutGMN by comparing its retrieval re-sults to those of several baselines, evaluated using humanjudgements. Similarity prediction by our network is efﬁ-cient: taking 33 milliseconds per layout pair on a CPU.With our learning framework, we can efﬁciently retrievemultiple, sorted results by batching the database samples.

Graph Kernel (GK) [8].

GK is one of the earliest struc-tural similarity metrics, initially developed to compare in-door 3D scenes. We adopt it to 2D layouts of ﬂoorplans andUI designs. We input the same layout graphs to GK to getretrievals from the two databases, and use the best settingbased on result quality/computation cost trade-off.

U-Net [35].

As one of the best segmentation networks, weuse U-Net with the same parameter setting as in Pytorch, ina triplet network setting to auto-encode layout images. Theinput to the network is a multi-channel image with semanticsegmentations. The network is trained on the same set oftriplets as LayoutGMN until convergence.

IoU Metric.

Given two multi-channel images, we use theIoU values between two layout images to get their IoUscore, and use this score to sort the examples in the datasetsto rank the retrievals for a given query.

GCN-CNN [30].

The state-of-the-art network for struc-tural similarity on UI layouts is a hybrid network comprisedof an attention-based GCN, similar to the gating mechanismin [28], coupled with a CNN. In this original GCN-CNN, Method Precision@k (%)k=1 ( ↑ ) k=5 ( ↑ ) k=10 ( ↑ )Graph Kernel [8] 33.33 15.83 11.46U-Net Triplet [35] 27.08 10.83 7.92IoU Metric 43.75 22.92 14.38GCN-CNN Triplet [30] 39.6 17.1 13.33LayoutGMN Graph Kernel [8] 27.27 15.15 12.42U-Net Triplet [35] 28.28 18.18 15.05IoU Metric 33.84 24.04 17.48GCN-CNN Triplet [30] 37.37 22.02 17.02LayoutGMN

Table 1. Precision scores for the top-k retrieved results obtainedusing different methods, on a set of randomly chosen UI and ﬂoor-plan queries. The ﬁrst set of ﬁve comparisons is for UI layouts,followed by ﬂoorplans. the training triplets are randomly sampled every epoch,leading to better training due to diverse training data. Inour work, for a fair comparison over all the aforementionednetworks, we sample a ﬁxed set of triplets in every epochof training. The GCN-CNN network is trained on the twodatasets of our interest, using the same training data as ours.Qualitative retrieval results for GCN-CNN, IoU metricand LayoutGMN for a given query are shown in Figure 6.

Precision@k scores.

To validate the correctness of Lay-outGMN as a tool for measuring layout similarity, we startby evaluating layout retrieval from a large database. Astandard evaluation protocol for measuring the relevance ofranked lists is the

Precision@k scores [31]. Given a query q i from a query set Q = { q , q , q , ..., q n } , we measure therelevance of the ranked lists L ( q i ) = [ l i , l i , ...., l ik , .... ] using the precision scores, deﬁned as: P @ k ( Q, L ) = 1 k | Q | (cid:88) q i ∈ Q k (cid:88) j =1 rel ( L ij , q i ) , (9)where, rel (L ij , q i ) is a binary indicator of the relevance ofthe returned element L ij for the query q i . In our evaluation,due to the lack of a labeled and exhaustive recommenda-tion set for any query over the layout datasets employed,such a binary indication of relevance is determined by hu-man subjects. Table 1 shows the P@k scores for differentnetworks described in Section 5.1 employed for the layoutretrieval task. To get the precision scores, similar to [30],we conducted a crowd-sourced annotation study via Ama-zon Mechanical Turk (AMT) on the top-10 retrievals perquery, for N ( N = 50 for UIs and 100 for ﬂoorplans) ran-domly chosen queries outside the training set. 10 turkerswere asked to indicate the structural relevance of each of5 igure 6. Top-5 retrieved results for an input query based on IoU metric, GCN-CNN Triplet [30] and LayoutGMN. We observe thatthe ranked results returned by LayoutGMN are closer to the input query than the other two methods, although it was trained on tripletscomputed using the IoU metric. Attention weights for understanding structural correspondence in LayoutGMN are shown in Figure 1 andalso provided in the supplementary material. UI and ﬂoorplan IDs from the RICO dataset [7] and RPLAN dataset [42] are indicated on topof each result. More results, along with results on document layouts, can be found in the supplementary material. the top-10 results per query, without any speciﬁc instruc-tions on what a structural comparison means. A result wasconsidered relevant if at least 6 turkers agreed. For detailson the AMT study, please see the supplementary material.We observe that LayoutGMN better matches humans’ no-tion of structural similarity. [30] performs better than theIoU metric on ﬂoorplan data (+3.5%) on the top-1 retrievalsand is comparable to IoU metric on top-5 and top-10 results.On UI layouts, the IoU metric is judged better by turkersthan [30]. U-Net fails to retrieve structurally similar resultsbecause it overﬁts on the small amount of training data, andrelies more on image pixels due to its convolutional struc-ture. LayoutGMN outperforms other methods by at least1% for all k , on both datasets. The precision scores on ﬂoor-plans (bottom-set) are lower than on UI layouts perhaps be-cause they are easier to compare owing to smaller set ofsemantic elements than UIs and turkers tend to focus moreon the size and boundary of the ﬂoorplans in additional tothe structural arrangements. We believe that when a lot ofsemantics are present in the layouts and are scattered (as in UIs), the users tend to look at the overall structure insteadof trying to match every single element owing to reducedattention-span, which likely explains higher scores for UIs. Overlap@k score.

We also propose another measure toquantify the stability of retrieved results –

Overlap@k score. Speciﬁcally, if Q is a set of queries and Q top isthe set of top-1 retrieved results for every query in Q , then Ov @ k ( Q , Q top ) = 1 k | Q | (cid:88) q m ∈ Q q p = top q m ) k (cid:88) j =1 ( L mj , ∧ L pj ) (10)where L ij is the j th ranked result for the query q i , and ∧ is the logical AND operation. Thus, (L mj ∧ L pj ) is 1 ifthe j th result for query q m ∈ Q and query q p = top1(Q ) ∈ Q top are the same. This score measures the ability of thelayout similarity metric to replicate the distance ﬁeld im-plied by a query according to its top-ranked retrieval. Inother words, retrieval stability can be measured by checkingthe consistency of retrievals for many (q m , q p ) pairs. This6ethod Overlap@k (%)k=5 ( ↑ ) k=10 ( ↑ )IoU Metric IoU Metric 30.42 30.8GCN-CNN Triplet [30] 43.2 46.8LayoutGMN

Table 2. Overlap scores for checking the consistency of retrievalsfor a query and its top-1 retrieved result, over 50 such pairs. Theﬁrst set of three rows are for UI layouts, followed by ﬂoorplans. score makes sense only when the ranked results returned bya layout similarity tool are deemed reasonable, as assessedby the P@k scores. Table 2 shows the overlap@k scores fork=5,10 for IoU metric, GCN-CNN [30] and LayoutGMNon 50 such pairs. On UIs (ﬁrst three rows), IoU metric hasa slightly higher

Ov@5 score (+0.6%) than LayoutGMN.Also, it shares the largest

P@5 score with LayoutGMN, in-dicating that IoU metric has slightly better retrieval stabilityfor the top-5 results. However, in the case of

Ov@10 , Lay-outGMN has a higher score (+0.4%) than IoU metric andalso has a higher

P@10 score than the other two methods,indicating that when top-10 retrievals are considered, Lay-outGMN has slightly better consistency on the retrievals.As for ﬂoorplans (last three rows), Table 1 already showsthat LayoutGMN has the best

P@k scores. This, coupledwith a higher

Ov@k scores, indicate that on ﬂoorplans, Lay-outGMN has better retrieval stability. In the supplementarymaterial, we show qualitative results on the stability of re-trievals for the three methods.

Classiﬁcation accuracy.

We also measure the classiﬁca-tion accuracy of test-triplets as a sanity check. However,such a measure alone is not an appropriate one for correct-ness of a similarity metric employed in information retrievaltasks [31]. We present it alongside

Precision@k and

Over-lap@k scores for a broader, informed evaluation, in Table3. Since user annotations are expensive and time consum-ing (and hence the motivation to use IoU metric to get weaktraining labels), we only get user annotations on 452 tripletsfor both UIs and ﬂoorplans, and the last column of Table 3reﬂects the accuracy on such triplets. LayoutGMN outper-forms all the baselines by atleast 1.32%, on triplets obtainedusing both, IoU metric and user annotations.

Following [30], we employed fully connected graphs forour experiments and observed that such graphs are a gooddesign for training graph neural networks for learning struc-tural similarity. We also performed experiments using ad-jacency graphs on GCN-CNN [30] and LayoutGMN, andobserved that, for ﬂoorplans (where the graph node count Method Test Accuracy on TripletsIoU-based ( ↑ ) User-based ( ↑ )Graph Kernel [8] 90.09 90.73U-Net Triplet [35] 96.67 93.38GCN-CNN Triplet [30] 96.45 94.48LayoutGMN Graph Kernel [8] 92.07 95.60U-Net Triplet [35] 93.01 91.00GCN-CNN Triplet [30] 92.50 91.8LayoutGMN

Table 3. Classiﬁcation accuracy on test triplets obtained using IoUmetric (IoU-based) and annotated by users (User-based). The ﬁrstset of comparisons is for UI layouts, followed by ﬂoorplans.Figure 7. Retrieval results for the bottom-left query in Fig 6, whenadjacency graphs are used. We observe, on most of the queries,that the performance of LayoutGMN improves, but degrades inthe case of GCN-CNN [30] on ﬂoorplan data. is small), the quality of retrievals improved in the case ofLayoutGMN, but degraded for GCN-CNN. This is mainlybecause GCN-CNN obtains independent graph embeddingsfor each input graph and when the graphs are built only onadjacency connections, some amount of global structuralprior is lost. On the other hand, GMNs obtain better contex-tual embeddings by now matching the sparsely connectedadjacency graphs, as a result of narrower search space; fora qualitative result using adjacency graphs, see Figure 7.However, for UIs (where the graph node count is large), theelements are scattered all over the layout, and no one heuris-tic is able to capture adjacency relations perfectly. The qual-ity of retrievals for both the networks degraded when usingadjacency graphs on UIs. More results can be found in thesupplementary material.

To evaluate how the node and edge features in our layoutrepresentation contribute to network performance, we con-duct an ablation study by gradually removing these features.Our design of the initial representation of the layout graphs(Sec 3.1) are well studied in prior works on layout gener-ation [11, 26], visual reasoning, and relationship detectiontasks [12, 44, 30]. As such, we focus on analyzing Lay-outGMN’s behavior when strong structural priors viz., theedges, box positions, and element semantics, are ablated.Figure 8 shows qualitative results on top-5 retrieved results7 igure 8. Top-5 retrieved results for a given query when structuralpriors (edges, box positions, and element semantics) are graduallyremoved from the input graphs. for a given query when these structural priors in the traininggraphs are gradually removed.

Graph edges.

When the edges of the graphs are not con-sidered, i.e., when there is no message propagation within agraph, the only component that updates the node features isthe attention-weighted node update (Eq. 4). Naturally, thestructure encoding is lost in both, the query and the databasesample, leading to random retrievals; see ﬁrst row of Fig-ure 8.

Effect of box positions:

The nodes of the layoutgraphs encode both the absolute box positions and the ele-ment semantics. When the position encoding information iswithdrawn, arguably, the most important cue is lost. The re-sulting retrievals from such a poorly trained model, as seenin the second row of Figure 8, are noisy as semantics alonedo not provide enough structural priors.

Effect of node se-mantics:

Next, when the box positions are preserved butthe element semantics are not encoded, we observe that thenetwork slowly begins to understand element comparisonguided by the position info, but falls short of understand-ing the overall structure information. Finally, when all theabove information is accounted for, we observe that the net-work better learns the structural information and even re-turns structurally sound results compared to the IoU metric.

We present layout label transfer , via attention-basedstructural element matching, as a natural application of Lay-outGMN. Given a source layout image I with known la-bels, the goal is to transfer the labels to a target layout I .A straight-forward approach to establishing element cor-respondence is via maximum area/pixel-overlap matchingfor every element in I with respect to all the elementsin I . However, this scheme is highly sensitive to ele-ment positions within the two layouts. Moreover, raster-alignment (via translations) of layouts is non-trivial to for-mulate when the two layout images have different bound- Figure 9. Element-level label transfer results from a source image I to a target image I , using a pretrained LayoutGMN vs. maxi-mum pixel-overlap matching. LayoutGMN predicts correct labelsvia attention-based element matching. aries and structures. LayoutGMN, on the other hand, is ro-bust to such boundary variations, and can be directly usedto obtain element-level correspondences using the built-inattention mechanism that provides an attention score for ev-ery element-level match. Speciﬁcally, we use a pretrained

LayoutGMN which is fed with two layout graphs, where thesemantic encoding of all nodes is set to a vector of ones.As shown in Figure 9, the pretrained

LayoutGMN is ableto ﬁnd the correct labels despite masking the semantic infor-mation at the input. Note that when semantic informationis masked at the input, such a transfer can not be applied toany two layouts. It is limited by a weak/ﬂoating alignmentof I and I , as seen in Figure 9.

6. Conclusion, limitation, and future work

With the advent of large-scale layout datasets, analysingand organizing layout data becomes crucial, where the ﬁrststep is to develop an effective means to compare layouts.We present the ﬁrst deep neural network to offer both met-ric learning of structural layout similarity and structuralmatching between layout elements. Extensive experimentsdemonstrate that our metric best matches human judgementof structural similarity for both ﬂoorplans and UI designs,compared to all well-known baselines.The main limitation of our current learning frameworkis the requirement for strong supervision, which justiﬁes, inpart, the use of the less-than-ideal IoU metric for networktraining. An interesting future direction is to combine few-shot or active learning with our GMN-based triplet network,e.g., by ﬁnding ways to obtain small sets of training tripletsthat are both informative and diverse [24].Another limitation of our current network is that it doesnot learn hierarchical graph representations or structuralmatching, which would have been desirable when han-dling large graphs. In addition, the graph embedding spacelearned by LayoutGMN may be worth a closer examinationto assess its potential for generative modeling. Finally, wewould like to explore applying our learning framework toother, more complex graph structured data.8 eferences [1] Oron Ashual and Lior Wolf. Specifying object attributesand relations in interactive scene generation. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 4561–4569, 2019. 2[2] Thorsten Brants. Inter-annotator agreement for a germannewspaper corpus. In

International Conference on Knowl-edge Engineering and Knowledge Management , 2000. 2[3] Thomas M Breuel. High performance document layout anal-ysis. In

Proceedings of the Symposium on Document ImageUnderstanding Technology , pages 209–218, 2003. 2[4] Michael M Bronstein, Joan Bruna, Yann LeCun, ArthurSzlam, and Pierre Vandergheynst. Geometric deep learning:going beyond euclidean data.

IEEE Signal Processing Mag-azine , 34(4):18–42, 2017. 2[5] Qi Chen, Qi Wu, Rui Tang, Yuhan Wang, Shuai Wang, andMingkui Tan. Intelligent home 3d: Automatic 3d-house de-sign from linguistic descriptions only. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 12625–12634, 2020. 2[6] Niraj Ramesh Dayama, Kashyap Todi, Taru Saarelainen, andAntti Oulasvirta. GRIDS: Interactive layout design with in-teger programming. In

Proceedings of the 2020 CHI Confer-ence on Human Factors in Computing Systems , pages 1–13,2020. 2[7] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hib-schman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ran-jitha Kumar. Rico: A mobile app dataset for building data-driven design applications. In

Proceedings of the 30th An-nual ACM Symposium on User Interface Software and Tech-nology , pages 845–854, 2017. 2, 5, 6[8] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Char-acterizing structural relationships in scenes using graph ker-nels. In

ACM SIGGRAPH 2011 papers , pages 1–12. 2011.2, 5, 7[9] Kar¨en Fort, Maud Ehrmann, and Adeline Nazarenko. To-wards a methodology for named entities annotation. 2009.2[10] Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Rongfei Jia,Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3D FurnishedRooms with layOuts and semaNTics, 2020. 1[11] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and HadarAverbuch-Elor. READ: Recursive autoencoders for docu-ment layout generation. In

Proceedings of the IEEE/CVFConference on Computer Vision and Pattern RecognitionWorkshops , pages 544–545, 2020. 1, 2, 3, 7[12] Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo,and Hanqing Lu. Aligning linguistic words and visual se-mantic units for image captioning. In

Proceedings of the27th ACM International Conference on Multimedia , pages765–773, 2019. 3, 7[13] Elad Hoffer and Nir Ailon. Deep metric learning using tripletnetwork. In

International Workshop on Similarity-BasedPattern Recognition , pages 84–92. Springer, 2015. 3, 4[14] George Hripcsak and Adam Wilcox. Reference standards,judges, and comparison subjects: roles for experts in evalu- ating system performance.

Journal of the American MedicalInformatics Association , 9(1):1–15, 2002. 2[15] Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver van Kaick,Hao Zhang, and Hui Huang. Graph2Plan: Learning ﬂoor-plan generation from layout graphs.

ACM Transaction onGraphics (TOG) , 2020. 1, 2[16] Nathan Hurst, Wilmot Li, and Kim Marriott. Review of auto-matic document formatting. In

Proceedings of the 9th ACMsymposium on Document engineering , pages 99–108, 2009.2[17] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gener-ation from scene graphs. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages1219–1228, 2018. 2[18] Rangachar Kasturi.

Document image analysis , volume 39. 2[19] Nagma Khan, Ushasi Chaudhuri, Biplab Banerjee, and Sub-hasis Chaudhuri. Graph convolutional network for multi-label vhr remote sensing scene recognition.

Neurocomput-ing , 357:36–46, 2019. 2[20] Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. Cor-pus annotation for mining biomedical events from literature.

BMC bioinformatics , 9(1):10, 2008. 2[21] Thomas N Kipf and Max Welling. Semi-supervised classiﬁ-cation with graph convolutional networks. 2017. 2[22] Yanir Kleiman, Oliver van Kaick, Olga Sorkine-Hornung,and Daniel Cohen-Or. SHED: shape edit distance for ﬁne-grained shape similarity.

ACM Transactions on Graphics(TOG) , 34(6):1–11, 2015. 2[23] Ranjitha Kumar, Jerry O Talton, Salman Ahmad, and Scott RKlemmer. Bricolage: example-based retargeting for web de-sign. In

Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems , pages 2197–2206, 2011. 2[24] Priyadarshini Kumari, Ritesh Goru, Siddhartha Chaudhuri,and Subhasis Chaudhuri. Batch decorrelation for active met-ric learning. In

IJCAI-PRICAI , 2020. 8[25] Jianan Li, Tingfa Xu, Jianming Zhang, Aaron Hertzmann,and Jimei Yang. LayoutGAN: Generating graphic layoutswith wireframe discriminator. In

International Conferenceon Learning Representations , 2019. 1, 2[26] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri,Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen,Daniel Cohen-Or, and Hao Zhang. GRAINS: Generative re-cursive autoencoders for indoor scenes.

ACM Transactionson Graphics (TOG) , 38(2):1–16, 2019. 1, 7[27] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, andPushmeet Kohli. Graph matching networks for learning thesimilarity of graph structured objects. In

ICML , 2019. 1, 2,3[28] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and RichardZemel. Gated graph sequence neural networks. 2016. 2, 5[29] Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer,Radomir Mech, and Ranjitha Kumar. Learning design se-mantics for mobile apps. In

Proceedings of the 31st AnnualACM Symposium on User Interface Software and Technol-ogy , pages 569–579, 2018. 2[30] Dipu Manandhar, Dan Ruta, and John Collomosse. Learn-ing structural similarity of user interface layouts using graph etworks. In Proceedings of the European Conference onComputer Vision (ECCV) , 2020. 1, 2, 3, 5, 6, 7[31] Christopher D Manning, Hinrich Sch¨utze, and PrabhakarRaghavan. Chapter 8: Evaluation in information retrievalin “Introduction to Information Retrieval”. pages 151–175.Cambridge university press, 2008. 5, 7[32] Nelson Nauata, Kai-Hung Chang, Chin-Yi Cheng, GregMori, and Yasutaka Furukawa. House-gan: Relational gener-ative adversarial networks for graph-constrained house lay-out generation.

Eur. Conf. Comput. Vis. , 2020. 2[33] Peter O’Donovan, Aseem Agarwala, and Aaron Hertz-mann. Learning layouts for single-page graphic designs.

IEEE transactions on visualization and computer graphics ,20(8):1200–1213, 2014. 2[34] Daniel Ritchie, Ankita Arvind Kejriwal, and Scott R Klem-mer. d. tour: Style-based exploration of design example gal-leries. In

Proceedings of the 24th annual ACM symposiumon User interface software and technology , pages 165–174,2011. 2[35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In

International Conference on Medical image com-puting and computer-assisted intervention , pages 234–241.Springer, 2015. 5, 7[36] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, RianneVan Den Berg, Ivan Titov, and Max Welling. Modeling rela-tional data with graph convolutional networks. In

EuropeanSemantic Web Conference , pages 593–607. Springer, 2018.2[37] Amanda Swearngin, Mira Dontcheva, Wilmot Li, JoelBrandt, Morgan Dixon, and Andrew J Ko. Rewire: Interfacedesign assistance from examples. In

Proceedings of the 2018CHI Conference on Human Factors in Computing Systems ,pages 1–12, 2018. 2[38] Sou Tabata, Hiroki Yoshihara, Haruka Maeda, and KeiYokoyama. Automatic layout generation for graphical de-sign magazines. In

ACM SIGGRAPH 2019 Posters , pages1–2. 2019. 2[39] Subarna Tripathi, Sharath Nittur Sridhar, Sairam Sundare-san, and Hanlin Tang. Compact scene graphs for layout com-position and patch retrieval. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition Work-shops , pages 0–0, 2019. 2[40] Raoul Wessel, Ina Bl¨umel, and Reinhard Klein. The roomconnectivity graph: Shape retrieval in the architectural do-main. 2008. 2[41] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. Newdirections in biomedical text annotation: deﬁnitions, guide-lines and corpus construction.

BMC bioinformatics , 7(1):1–10, 2006. 2[42] Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan genera-tion for residential buildings.

ACM Transactions on Graph-ics (TOG) , 38(6):1–12, 2019. 1, 2, 5, 6[43] Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir,Daniel Cohen-Or, and Hui Huang. Organizing hetero-geneous scene collections through contextual focal points.

ACM Transactions on Graphics (TOG) , 33(4):1–12, 2014. 2 [44] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploringvisual relationship for image captioning. In

Proceedings ofthe European conference on computer vision (ECCV) , pages684–699, 2018. 3, 7[45] Ziqi Zhang, Sam Chapman, and Fabio Ciravegna. A method-ology towards effective and efﬁcient manual document an-notation: addressing annotator discrepancy and annotationquality. In

International Conference on Knowledge En-gineering and Knowledge Management , pages 301–315.Springer, 2010. 2[46] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao,and Zihan Zhou. Structured3D: A Large Photo-realisticDataset for Structured 3D Modeling. In

Eur. Conf. Comput.Vis. , 2020. 1[47] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WHLau. Content-aware generative modeling of graphic designlayouts.

ACM Transactions on Graphics (TOG) , 38(4):1–15,2019. 1, 2[48] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-laynet: largest dataset ever for document layout analysis. In , pages 1015–1022. IEEE, 2019. 1, pages 1015–1022. IEEE, 2019. 1