[PDF] ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis

Abstract

Manually authoring 3D shapes is difficult and time consuming; generative models of 3D shapes offer compelling alternatives. Procedural representations are one such possibility: they offer high-quality and editable results but are difficult to author and often produce outputs with limited diversity. On the other extreme are deep generative models: given enough data, they can learn to generate any class of shape but their outputs have artifacts and the representation is not editable. In this paper, we take a step towards achieving the best of both worlds for novel 3D shape synthesis. We propose ShapeAssembly, a domain-specific "assembly-language" for 3D shape structures. ShapeAssembly programs construct shapes by declaring cuboid part proxies and attaching them to one another, in a hierarchical and symmetrical fashion. Its functions are parameterized with free variables, so that one program structure is able to capture a family of related shapes. We show how to extract ShapeAssembly programs from existing shape structures in the PartNet dataset. Then we train a deep generative model, a hierarchical sequence VAE, that learns to write novel ShapeAssembly programs. The program captures the subset of variability that is interpretable and editable. The deep model captures correlations across shape collections that are hard to express procedurally. We evaluate our approach by comparing shapes output by our generated programs to those from other recent shape structure synthesis models. We find that our generated shapes are more plausible and physically-valid than those of other methods. Additionally, we assess the latent spaces of these models, and find that ours is better structured and produces smoother interpolations. As an application, we use our generative model and differentiable program interpreter to infer and fit shape programs to unstructured geometry, such as point clouds.

Full PDF

SShapeAssembly: Learning to Generate Programs for3D Shape Structure Synthesis

R. KENNY JONES,

Brown University

THERESA BARTON,

Brown University

XIANGHAO XU,

Brown University

KAI WANG,

Brown University

ELLEN JIANG,

Brown University

PAUL GUERRERO,

Adobe Research

NILOY J. MITRA,

University College London, Adobe Research

DANIEL RITCHIE,

Brown University def Chair(): bbox = Cuboid(1.2, 1.4, 1, T) base = Base(.9, .5, .8, T) seat = Seat(1.1, .1, .9, T) back = Back(1.1, .9, .2, F) arm = Cuboid(.1, .4, .7, F) attach(base, bbox, .5, 0, .5, .5, 0, .5) squeeze(back, bbox, base, top, .5, .1) attach(seat, base, .5, 0, .5, .5, 1, .5) attach(arm, back, .5, .5, 0, .1, .3, .5) attach(arm, seat, .5, 0, .5, .1, .7, .5) reflect(arm, X) . . .def Back(l, w, h, aligned): bbox = Cuboid(l, w, h, aligned) surface = Cuboid(1.16, .64, .13, T) slat = Cuboid(.04, .76, .1, F) attach(surface, bbox, .5, 1, .5, .5, 1, .7) attach(slat, bbox, .5, 0, .5, .2, 0, .45) attach(slat, surface, .5, .6, .8, .2, .3, .2) reflect(slat, X)

EDIT def Chair():. . .def Back(l, w, h, aligned): bbox = Cuboid(l, w, h, aligned) surface = Cuboid(1.08, .58, .11, T) slat = Cuboid(.04, .73, .1, F) attach(surface, bbox, .5, 1, .5, .5, 1, .6) attach(slat, bbox, .5, 0, .5, .15, 0, .3) attach(slat, surface, .6, .5, .6, .1, .1, .1) reflect(slat, X) def Chair():. . .def Back(l, w, h, aligned): bbox = Cuboid(l, w, h, aligned) surface = Cuboid(.9, .51, .08, T) slat = Cuboid(.05 .6, .07, F) attach(surface, bbox, .5, 1, .5, .5, 1, .5) attach(slat, bbox, .5, 0, .5, .1, 0, .3) attach(slat, surface, .5, .8, .5, .1, .1, .3) reflect(slat, X) def Chair(): bbox = Cuboid(.82, 1.6, .85, T) base = Base(.75, .66, .66, T) seat = Seat(.8, .13, .85, T) back = Back(.8, .9, .1, T) attach(base, bbox, .5, 0, .5, .5, 0, .5) attach(back, bbox, .5, 1, .5, .5, 1, .05) attach(seat, base, .5, .0, .5, .5, 1, .5) attach(back, seat, .5, .0, .5, .5, .75, .05) . . .def Back(l, w, h, aligned): bbox = Cuboid(l, w, h, aligned) surface = Cuboid(.8, .4, .1, T) slat = Cuboid(.05, .5, .05, T) attach(surface, bbox, .5, 1, .5, .5, 1, .5) squeeze(slat, bbox, surface, bot, .1, .5) translate(slat, X, 3, 0.8) def Chair(): bbox = Cuboid(0.5, 2, 0.7, T) base = Base(0.5, .95, 0.7, T) seat = Seat(0.5, .05, 0.7, T) back = Back(0.5, 1, 0.05, T) attach(base, bbox, .5, 0, .5, .5, 0, .5) attach(back, bbox, .5, 1, .5, .5, 1, .05) attach(seat, base, .5, .0, .5, .5, 1, .5) attach(Back, seat, .5, .0, .5, .5, .75, .05) . . .def Back(l, w, h, aligned): bbox = Cuboid(l, w, h, aligned) surface = Cuboid(0.5, 0.1, 0.05, T) slat = Cuboid(.09, .9, .05, T) attach(surface, bbox, .5, 1, .5, .5, 1, .5) squeeze(slat, bbox, surface, bot, .1, .5) translate(slat, X, 2, 0.8) execute execute execute execute execute

Interpolation in ShapeAssembly Program Space

Fig. 1. We present a deep generative model which learns to write novel programs in ShapeAssembly, a domain-specific language for modeling 3D shapestructures. Executing a ShapeAssembly program produces a shape composed of a hierarchical connected assembly of part proxies cuboids. Our methoddevelops a well-formed latent space that supports interpolations between programs. Above, we show one such interpolation, and also visualize the geometrythese programs produce when executed. In the last column, we manually edit the continuous parameters of a generated program, in order to produce avariant geometric structure with new topology.

Manually authoring 3D shapes is difficult and time consuming; generativemodels of 3D shapes offer compelling alternatives. Procedural representa-tions are one such possibility: they offer high-quality and editable resultsbut are difficult to author and often produce outputs with limited diversity.On the other extreme are deep generative models: given enough data, they

Authors’ addresses: R. Kenny Jones, Brown University; Theresa Barton, Brown Univer-sity; Xianghao Xu, Brown University; Kai Wang, Brown University; Ellen Jiang, BrownUniversity; Paul Guerrero, Adobe Research; Niloy J. Mitra, University College London,Adobe Research; Daniel Ritchie, Brown University.© 2020 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in

ACM Transactions onGraphics , https://doi.org/10.1145/3414685.3417812. can learn to generate any class of shape but their outputs have artifacts andthe representation is not editable.In this paper, we take a step towards achieving the best of both worldsfor novel 3D shape synthesis. First, we propose ShapeAssembly, a domain-specific “assembly-language” for 3D shape structures. ShapeAssembly pro-grams construct shape structures by declaring cuboid part proxies and at-taching them to one another, in a hierarchical and symmetrical fashion. Sha-peAssembly functions are parameterized with continuous free variables,so that one program structure is able to capture a family of related shapes.We show how to extract ShapeAssembly programs from existing shapestructures in the PartNet dataset. Then, we train a deep generative model, ahierarchical sequence VAE, that learns to write novel ShapeAssembly pro-grams. Our approach leverages the strengths of each representation: theprogram captures the subset of shape variability that is interpretable and

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. a r X i v : . [ c s . G R ] S e p editable, and the deep generative model captures variability and correlationsacross shape collections that is hard to express procedurally.We evaluate our approach by comparing the shapes output by our gener-ated programs to those from other recent shape structure synthesis models.We find that our generated shapes are more plausible and physically-validthan those of other methods. Additionally, we assess the latent spaces ofthese models, and find that ours is better structured and produces smootherinterpolations. As an application, we use our generative model and differen-tiable program interpreter to infer and fit shape programs to unstructuredgeometry, such as point clouds.CCS Concepts: • Computing methodologies → Neural networks ; La-tent variable models ; Shape analysis .Additional Key Words and Phrases: Shape analysis, shape synthesis, genera-tive models, deep learning, procedural modeling, neurosymbolic models

ACM Reference Format:

R. Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, PaulGuerrero, Niloy J. Mitra, and Daniel Ritchie. 2020. ShapeAssembly: Learningto Generate Programs for 3D Shape Structure Synthesis.

ACM Trans. Graph.

39, 6, Article 234 (December 2020), 20 pages. https://doi.org/10.1145/3414685.3417812

3D models of human-made objects are more in-demand than ever. Inaddition to the traditional drivers of demand in computer graphics(visual effects, animation, games), new applications in artificial in-telligence increasingly benefit from or even require high-quality 3Dobjects, such as producing synthetic training imagery for computervision systems [Kar et al. 2019; Richter et al. 2016; Zhang et al. 2017]or training robots to perform tasks in virtual environments [Ab-batematteo et al. 2019; Kolve et al. 2017; Savva et al. 2019; Xia et al.2018]. Despite the growing demand, the craft of 3D modeling largelyremains as difficult and time-consuming as it has ever been. Thetime and expertise required to create 3D content by hand will notscale to these demands.One promising way out of this conundrum is the development of generative models of 3D shapes, i.e. procedures which can be exe-cuted to generate novel shapes within some class [Müller et al. 2006;Parish and Müller 2001; Prusinkiewicz and Lindenmayer 1996]. Anideal generative model would produce plausible output geometry,capture a wide range of shape variations, and use an interpretablerepresentation which a user could subsequently manipulate andedit. Unfortunately, no existing shape generative model achieves allof these properties. On the one hand are procedural models : struc-tured computer programs which produce geometry when executed.Procedural models can produce high-quality geometry, and theirprogram-based representation makes them interpretable and ed-itable to users with some programming background. However, au-thoring a good procedural model from scratch is difficult (arguablyat least as difficult as modeling an object by hand), and the amountof shape variation captured by a single procedural model is limited(e.g., it is difficult to write one program that can model all types ofcars). On the other hand are data-driven generative models, particu-larly deep generative models : neural networks which learn how togenerate 3D shapes from data [Fan et al. 2017; Groueix et al. 2018; Liet al. 2017; Mo et al. 2019a; Wu et al. 2016]. Deep generative modelscapture variability with little human effort: given enough training data, they can in theory learn to generate any class of shape. Sincethey lack the strict semantics of programs, however, their outputsoften exhibit “noise” artifacts such as incomplete geometry and float-ing parts. Additionally, the representations they learn are typicallyinscrutable to people, making them hard to edit or manipulate inpredictable ways.Our insight, in this work, is that these two approaches have com-plementary strengths: deep generative models are efficient to createand excel at broad-scale variability, and procedural models producehigh-quality geometry by construction and better facilitate editingfor fine-scale variability. We take a first step toward achieving thebest of both worlds by integrating these two approaches into a sin-gle pipeline: a deep generative model that learns to write programs ,which, when executed, themselves output 3D geometry. We hypoth-esize that going through this intermediate program representationproduces a generative model with a smoother latent space, whoseoutputs are more likely to be physically valid, compact, and editable.As the motivating applications mentioned earlier demand 3Dmodels of human-made objects, we focus on generating novel part-based shape structures in this paper. We introduce ShapeAssembly,an “assembly language” for 3D shape structures. In ShapeAssem-bly, shape structures are represented by hierarchical assembliesof connected parts, where leaf-level parts are approximated by abounding cuboid (a similar representation as the ones used by Part-Net [Mo et al. 2019b] and StructureNet [Mo et al. 2019a]); thesehierarchical cuboid structures can then be used to condition thegeneration of shape surface geometry in the form of e.g. pointclouds. A ShapeAssembly program constructs a shape by declaringcuboids, iteratively attaching them to one another, and specifyingsymmetric repetitions of connected cuboid assemblies. The dimen-sions of these cuboids and the positions of these attachments are aprogram’s parameters; manipulating them allows for exploring afamily of related shapes. Furthermore, our interpreter for execut-ing ShapeAssembly programs is fully differentiable, meaning it ispossible to compute gradients of a program’s output geometry withrespect to its continuous parameters. Figure 1 shows some examplehierarchical ShapeAssembly programs and the output shapes theyproduce.While ShapeAssembly programs produce valid geometry under arange of parameter values, they do not exhibit structural variability,and authoring them from scratch still takes time. Thus, we traina neural network to write a variety of ShapeAssembly programsfor us. Using programs we extract from a shape dataset, we train ahierarchical sequence VAE which outputs hierarchical ShapeAssem-bly programs. Each node in the hierarchy uses a recurrent languagemodel to generate the program text at that level, and to decidewhich cuboids should be expanded into subroutine calls. Further-more, the well-defined semantics of ShapeAssembly allow us toidentify semantically-invalid programs and modify the generatorsuch that it never produces them. The programs shown in Figure 1were written by our generative model, by decoding code vectorsalong a straight line in its latent space. We show that this generativemodel indeed learns to generate plausible, novel shape programsthat were never seen its the training set.We evaluate our approach by comparing it to other recently-proposed generative models of 3D shape structure along several

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:3 axes including plausibility, diversity, complexity, and physical valid-ity. We find that our generated shapes are both more plausible andmore physically-valid than those of other methods. Additionally, weassess the latent spaces of these models, and find that ours is betterstructured and produces smoother interpolations, both in termsof geometric and structural continuity. As a bonus, we also showthat ShapeAssembly’s decoder does a better job of fitting programsto unstructured point clouds while also maintaining physical valid-ity, and that this performance difference is magnified by optimizingthe program fit via our differentiable interpreter.In summary, our contributions are:(i) The ShapeAssembly language and its differentiable interpreter,allowing the procedural specification of shape structures rep-resented as connected part assemblies.(ii) A deep generative model for ShapeAssembly programs, cou-pling the ease-of-training and variability of neural networkswith the precision and editability of procedural representa-tions.Code and data used for all of our experiments can be found athttps://github.com/rkjones4/ShapeAssembly .

Deep Generative Models of 3D Shapes.

Recent years have seen anexplosion of activity in applying deep generative models to 3Dshape generation. Some of the earliest approaches generated shapesas 3D occupancy grids [Wu et al. 2016; Z. Wu 2015]; later workhas explored generative representations of point clouds [Fan et al.2017], 2D surface patches [Groueix et al. 2018], and implicit sur-faces [Chen et al. 2019; Chen and Zhang 2019; Michalkiewicz et al.2019; Park et al. 2019]. Our approach is more closely related to gen-erative models of part-based shapes, wherein a complete object issynthesized by generating and assembling multiple subparts. Theseinclude approaches for iteratively adding parts to partially-completeshapes [Sung et al. 2017], generating symmetry hierarchies [Li et al.2017], composing parts from two different shapes [Zhu et al. 2018],and generating hierarchical connectivity graphs [Mo et al. 2019a].Our method is different in that we do not aim to learn a generativemodel that outputs a single shape; rather, ours outputs a proceduralprogram which then itself generates a related family of shapes.

Procedural and Inverse Procedural Modeling.

There is a rich historyof methods for procedural modeling in computer graphics: especiallynoteworthy examples include its use in modeling plants [Prusinkiewiczand Lindenmayer 1996] and urban environments [Müller et al. 2006;Parish and Müller 2001]. Most procedural modeling systems usesome form of (context-free) grammar, i.e., a recursive string re-writing system (which may be interpreted as e.g., recursively split-ting a spatial domain, in the case of shape grammars). Additionally,attachment-based grammars of part assemblies have been used to aidin the structural analysis of shapes [Lau et al. 2011]. Our proceduralrepresentation is fundamentally different: we use an imperativelanguage which iteratively constructs shapes via declaring and thenconnecting parts represented as simple proxy geometry. Also relatedto our work is the line of research on inverse procedural modeling,i.e., inferring a procedural model from a set of examples [Hwanget al. 2011; Martinovic and Van Gool 2013; Nishida et al. 2018, 2016; Ritchie et al. 2018; Talton et al. 2012]. These methods all strive toinfer an interpretable, stochastic program which generates multipleoutput shapes. In contrast, we represent shapes via deterministic programs, and then we use a stochastic neural network to generatethose programs.

Visual Program Induction.

Another related line of work to ours is visual program induction (VPI): the practice of inferring a programwhich describes a single visual entity, such as a 3D shape. We addressa fundamentally different problem: training a generative model togenerate novel

3D shape programs from scratch. We do use a VPI-likeprocess as a subroutine, to convert every shape in a large dataset intotraining programs for our generative model. Prior work in this areacan be roughly divided into two categories: methods that assumethat clean, segmented geometry is available and then use geometricheuristics to infer a program [Demir et al. 2016; Stava et al. 2010],and methods which use learning or optimization to operate directlyon “raw” visual inputs such as images and occupancy grids [Duet al. 2018; Ellis et al. 2019, 2018; Liu et al. 2019; Sharma et al. 2018;Tian et al. 2019; Zhou et al. 2019; Zou et al. 2017]. Our approach toextracting programs from shapes to formulate training data is moresimilar to the former.One could consider solving our problem of novel shape programgeneration by first generating novel 3D shapes with an existingshape generative model and then using a VPI-like system to infera program describing that shape. However, as we will later show,the programs produced by such a process are less clean and ed-itable than ones generated by our model; furthermore, training togenerate programs rather than shapes directly actually produces abetter-structured latent space.Our complete pipeline of using a neural network to generate aprogram and then using that program to generate the ultimate out-put is also related to work in visual question answering which usesneural networks to generate a “query program” for each questionwhich then analyzes the input image and produces an answer [John-son et al. 2017]. It is also related to work that tries to combine theadvantages of neural guidance with symbolic search for performinginference over structured domains [Lu et al. 2019].Our work is the first to train a deep generative model to pro-duce novel shape programs from scratch, each of which outputs aparametric family of related 3D shapes. This pipeline combines theadvantages of neural and procedural shape modeling.

Our approach (Figure 2) is divided into the following stages:

Input.

Our pipeline takes as input a large dataset of hierarchical 3Dpart graphs [Mo et al. 2019a,b]. This is a shape representation inwhich each node represents a part in a shape consisting of an as-sembly of parts. Nodes are connected via edges that denote physicalpart attachments. They can also be connected via parent-child edgesthat denote hierarchy relationships (i.e., that one part is composedof several other smaller parts). At the leaf level of this hierarchy,atomic parts are represented by cuboid proxy geometry (typicallyminimum-volume bounding boxes of more detailed part meshes).

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020.

Input Hierarchical Part GraphsShapeAssembly DSL(Section 4) Shapes to Training Programs(Section 5) Learning to Generate Programs(Section 6) def Chair(): bbox = Cuboid(.7, 1.7, .5, True) prog1 = Program1(.7, .6, .5, True) prog2 = Program2(.7, .9, .05, True) cube2 = Cuboid(.7, .15, .5, True) attach(prog1, bbox, .5, 0, .5, .5, 0, .5) attach(cube2, prog1, .5, 0, .5, .5, 1, .5) squeeze(Prog2, bbox, cube2, top, .5, .1) def Program1(l, w, h, aligned): bbox = Cuboid(.7, .6, .5, True) prog3 = Program3(.05, .6, .5, True) squeeze(prog3, bbox, bbox, top, 0, .5) reflect(prog3, X) ... def Chair(): ... def Chair(): ... def Chair(): ... def Chair(): ... execute execute execute execute

Fig. 2. Our pipeline for generating 3D shape structure programs. We first define a DSL language for 3D shapes, ShapeAssembly . Then, given a dataset ofhierarchical part graphs, we extract ShapeAssembly programs from them. Finally, we use these programs as training data for a deep generative model. Ourmethod learns to generate novel program instances that can be executed to produce complex and interesting 3D shape structures.

Defining a DSL for connected, hierarchical shapes.

To representshapes as programs, we introduce a domain-specific language (DSL).Since our input shapes are characterized by graphs of parts, wheregraph edges denote physical part connections, we introduce a DSLbased around declaring parts and then attaching them to one an-other. We call this language ShapeAssembly (as in, an “assemblylanguage” for shapes). Section 4 describes the language.

Creating a dataset of shape-program pairs.

Given the language de-scribed above, we present a method for finding programs that rep-resent the shapes in our dataset.In our procedure, we first extract the program content based ona combination of data cleaning and geometric analysis. Then, wecreate canonical programs through a series of ordering and filtersteps. Section 5 describes this procedure in more detail.

Learning to generate programs.

Finally, we treat the programs ex-tracted from each shape as training data for a generative model.Section 6 describes our deep generative model’s architecture, theprocedure we use to train it, and how we sample from it to syn-thesize new programs, which when executed produce novel shapestructures.

1. def Chair():2. bbox = Cuboid(1, 1.5, .8, True)

3. base = Base(.8, .5, .8, True)4. cube1 = Cuboid(.8, .1, .8, True)

5. back = Back(.9, .8, .07, True)6. attach(base, bbox, .5, 0, .5, .5, 0, .5) attach(cube1, base, .5, 0, .5, .5, 1, .5) squeeze(back, bbox, cube1, top, .5, .05)

9. def Base(l, w, h, aligned):10. bbox = Cuboid(l, w, h, aligned) cube0 = Cuboid(.2, .5, .2, True) cube1 = Cuboid(.2, .5, .2, True) squeeze(cube0, bbox, bbox, top, .1, .1) squeeze(cube1, bbox, bbox, top, .1, .8) reflect(cube0, X) reflect(cube1, X)

17. def Back(l, w, h, aligned):18. bbox = Cuboid(l, w, h, aligned) cube0 = Cuboid(.9, .4, .07, True) cube1 = Cuboid(.1, .4, .05, True) attach(cube0, bbox, .5, 1, .5, .5, 1, .5) squeeze(cube1, bbox, cube0, bot, .3, .5) translate(cube1, X, 2, .5)

Fig. 3. An example ShapeAssembly program and the shape that it generates.Parts are colored according to the line of the program which instantiatesthem, and attachment points are numbered accordingly. In the top shape, weshow the executed Chair program without hierarchy. In the bottom shape,we show the Chair program executed hierarchically with its sub-programs(Base and Back). For instance, the light grey back part is expanded into thepurple back surface and gold slats.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:5 (1) attach(cube0, bbox, .5, 0, .5, .5, 0, .5)(2) attach(cube1, cube0, .5, 0, .5, .5, 1, .5)(3) squeeze(cube2, bbox, cube1, top, .5, .18)(4) attach(cube3, cube2, .5, .5, 0, .1, .1, 1)(5) attach(cube3, cube1, .5, 0, .5, .1, 1, .7) bbox = Cuboid(.7, 1.8, .6, True)cube0 = Cuboid(.6, .6, .6, True)cube1 = Cuboid(.6, .2, .6, True)cube2 = Cuboid(.6, .9, .2, True)cube3 = Cuboid(.2, .2, .4, True)(6) reflect(cube3, X)

Fig. 4. An illustration of how the ShapeAssembly interpreter incrementallyconstructs shapes by imperatively executing program commands. Cuboidsare instantiated at the origin and are moved through attachment. Noticehow the reflect command in line 6 acts as a macro function, creating anew cuboid and two new attachments.

Our goal in this section is to define a domain-specific language forshapes which are specified as connected assemblies of parts. As wefocus on the problem of shape structure synthesis, cuboids, servingas part proxy geometry, are the only data type in our language. InSection 7, we show how to use other existing techniques to convertthese proxies into surface geometry.The primary operation in the language is attaching these cuboidstogether. Attachment turns out to be a very powerful and flexi-ble operation. In fact, our language does not include any opera-tions for explicitly positioning or orienting cuboids: all of this isaccomplished via attachment operations. Additionally, the languageincludes higher-level macros that capture more complex spatialrelationships, such as symmetry. At execution time, each macrois expanded into a series of cuboid declarations and attachmentoperations.We call this DSL ShapeAssembly, because it is an “assemblylanguage for shapes”: a low-level language for creating shapes, inwhich shapes are created by assembling parts. Table 1 shows thegrammar for ShapeAssembly, and Figure 3 shows an annotatedhierarchical program along with its executed 3D shape.A ShapeAssembly program consists of four main blocks: • BBlock:

Declares a non-visible bounding volume of the over-all shape. This bounding volume is treated as a physical entityto which other parts can be connected. • CBlock:

Declares all the cuboid part proxies that will be usedby the remainder of the program. The

Cuboid command takesin l , w , h parameters that control the starting dimensions ofthe part, and an aligned flag a that specifies if the part hasthe same orientation as its bounding volume. • ABlock:

Connects cuboids by iteratively attaching them toone another. The attach command takes in two cuboids, c n , c n , and attaches the point ( x , y , z ) in the local coordinateframe of c n with the point ( x , y , z ) in the local coordinateframe of c n . The squeeze macro expands into two attach statements, such that c n is placed in-between c n and c n along the specified face f at the face-coordinate position ( u , v ) . • SBlock:

Generates symmetry groups by instantiating addi-tional

Cuboid and attach commands. The reflect macroreflects cuboid c n over axis axis of the bounding volume. The translate macro creates a translational symmetry groupstarting at c n with m additional members along axis a of thebounding volume that ends distance d away. Semantics.

ShapeAssembly has imperative semantics: every line ofthe program immediately takes effect and alters the state of the shapebeing constructed. Figure 4 shows an example of a simple shapebeing imperatively constructed. Declaring a cuboid instantiatesa new piece of cuboid geometry with the requested dimensions,centered at the origin. Invoking the attach command alters thecuboid, potentially translating, rotating, or resizing it in order tosatisfy the attachment (see Appendix A for details). Higher-levelmacros expand into two or more

Cuboid or attach lines, which arethen immediately executed (see Appendix B for details).One distinct advantage of this imperative semantics, as opposedto an alternative formulation in which the program specifies con-straints which are jointly optimized, is that the entire process of Start −→ BBlock; CBlock; ABlock; SBlock;BBlock −→ bbox = Cuboid ( l , h , w , True ) CBlock −→ c n = Cuboid ( l , w , h , a ) ; CBlock | NoneABlock −→ Attach ; ABlock | Squeeze ; ABlock | NoneSBlock −→ Reflect ; SBlock | Translate ; SBlock | NoneAttach −→ attach ( c n , c n , x , y , z , x , y , z ) Squeeze −→ squeeze ( c n , c n , c n , f , u , v ) Reflect −→ reflect ( c n , axis ) Translate −→ translate ( c n , axis , m , d ) f −→ right | left | top | bot | front | backaxis −→ X | Y | Z l , h , w ∈ R + x , y , z , u , v , d ∈ [ , ] a ∈ [ True , False ] n , m ∈ Z + Table 1. The grammar for ShapeAssembly, our low-level domain-specific“assembly language” for shape structure. A program consists of

Cuboid statements which instantiate new geometry and attach statements whichconnect these geometries together at specified points on their surfaces.Macro functions ( reflect , translate , squeeze ) form complex spatial rela-tionships by expanding into multiple Cuboid and attach statements.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. a b c d e

Fig. 5. The steps of our program extraction pipeline. (a)

Fragment of an input hierarchical part graph showing chair back (parent node), chair back frame (bluechild), and chair back surface (orange child). (b)

Locally flattening the hierarchy so that physically interacting leaf parts become siblings. (c)

Shortening leafparts that intersect other leaf parts. (d)

Locating attachment points between parts. (e)

Forming leaf parts into symmetry groups. executing a program is end-to-end differentiable. That is, it is possi-ble to compute the gradient of the program’s output geometry withrespect to the continuous parameters in the text of the program(e.g., cuboid dimensions, attachment point locations). We make useof this feature in results shown later in this paper.

Handling hierarchy.

Thus far, we have described a language thatcan generate flat assemblies of parts, but not hierarchical ones. Theextension to hierarchical shapes is straightforward: we represent hi-erarchical shapes by treating select non-leaf cuboids as the boundingbox of another program (e.g., the contents of its “BBlock”). Figure 3shows an example of a program in which cuboids expand into sub-programs.

ShapeAssembly allows us to write programs that generate newshapes. However, we are interested in using the language to repre-sent existing shapes in a dataset, so that we can learn to generatenovel instances from the same underlying shape distribution. Inthis section, we describe how we accomplish this goal. Given aninput shape, represented as a hierarchical part graph, the processdivides into three steps: extracting program information, creatingcandidate programs, and checking program validity.

To convert hierarchical part graphs into ShapeAssembly programs,we perform a series of data regularizations, record cuboid parame-ters, locate cuboid-to-cuboid attachments, and identify symmetrygroups (Figure 5). We provide a high-level overview of the stepsinvolved here, and a detailed description in Appendix C.

Regularization.

Before we parse program attributes, we attemptto create more regularized part graphs through a series of data-cleaning steps. For instance, in the flattening phase, we restructurethe part graph hierarchy so that leaf parts with spatial relationshipsare more often siblings. In the shortening phase, we decrease thedimensions of leaf cuboids that interpenetrate other leaf cuboids (tocreate more surface-to-surface part connections).

Cuboids.

Ground truth cuboid dimensions are provided in the inputpart graphs. A cuboid is marked as aligned if its orientation matchesits parent cuboid (with an allowable error of 5-degrees).

Attachment.

To locate cuboid-to-cuboid attachments, we sample auniform, dense point cloud on each cuboid in the scene. For eachpair of cuboids, we compute the intersection of the point clouds.If the intersection set is non-zero, we record an attachment pointwithin the volume formed by the intersection, with preference forlocations on the centers of faces. For every cuboid, we then checkif any of its parsed attachments could be represented as a squeezerelationship, and replace any that can.

Symmetry.

To find symmetry groups, we identify collections ofcuboids that share a reflectional or translational relationship abouteither the X, Y, or Z axis of their parent cuboid. For each collection, ifall of the member cuboids have the same connectivity relationships,we form them into a symmetry group. Each symmetry group isrepresented by a transform applied to a single cuboid, and all othermembers are removed from the graph.

Given the extracted program information, we know the contentof the program, but not how the lines should be ordered. To makethe task of learning a generative model of programs easier, we aimto extract only a single, “canonical” program for each shape. Asthe ordering of cuboid and symmetry lines doesnâĂŹt change theexecuted geometry, this consistency is enforced by ordering theselines according to the semantic label of each part involved in the line.Ties in this ordering between same part-type cuboids are broken bysorting on centroid position.Deciding on a single ordering of the attach and squeeze statementsis more challenging. Since ShapeAssembly has an imperative exe-cution semantics, the order in which these commands are executedis significant: different orderings can potentially create differentoutput geometries. To reduce the space of possible orderings, weonly consider programs which follow a grounded attachment order,which we define as follows: • Initially, only the shape bounding box is grounded.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:7 • The only valid attachments to perform are those which con-nect a cuboid to a grounded cuboid. • After executing an attachment, the newly-attached cuboidbecomes grounded.If there are multiple valid grounding orders, we first discard anyorderings that produce worse geometric fits to the target shape.If ambiguities in the attachment ordering still remain, we breakties using (1) the semantic ordering of the cuboids involved in theattachment (2) preferring attachments from non-aligned to alignedcuboids and finally (3) preferring attachments from cuboid face-centers.

Once we extract a canonical ShapeAssembly program, we performa series of checks to verify the results of our procedure. Programsmust pass the following validation steps in order to be added to ourtraining data:

Reconstruction.

Executed programs should recreate the geometry oftheir respective ground truth part graph. To verify this, we samplepoint clouds from the surfaces of the ground truth shape and thegeometry generated by executing the canonical program. Thesepoint clouds are compared using the F-score [Knapitsch et al. 2017]metric; a program is filtered out if it produces an F-score lower than75.

Semantics.

Programs must respect the semantics of ShapeAssem-bly. For instance, within each program, the connectivity graph ofall parts should have only one component. Likewise, executed pro-grams should not create geometry that extends beyond the boundingvolumes they define.

Complexity.

Programs that are overly complex (more than 12 leafcuboid instantiations) are discarded. Note that, when executed, pro-grams can still produce more than 12 leaf cuboids through expandingsymmetry macros.

Given the programs extracted from our dataset, we now have thedata we need to train a neural network to write novel hierarchi-cal ShapeAssembly programs for us. In this section, we describethe generative model architecture we use, our learning procedure,and how we sample new shapes from the learned model.

Figure 6 shows our generative model architecture. It is a hierarchi-cal sequence VAE. The encoder branch embeds a hierarchical Sha-peAssembly program into a latent space. The decoder branch con-verts a point in this latent space into a hierarchical ShapeAssem-bly program. The bottleneck of our network is parameterized byseparate µ and σ vectors in the standard variational autoencoder(VAE) setup.The dark grey callout in Figure 6 illustrates the operation ofour decoder within a single node of the program hierarchy. Thedecoder receives as input the latent code z par of its parent node (orthe root latent code from the encoder, if it is the root node of thehierarchy). This latent code is used to initialize the hidden state of a Gated Recurrent Unit (GRU), a recurrent language model which isresponsible for constructing a representation of the program state.The output of the GRU cell is sent to the line decoder sub-routine,which predicts a line in the ShapeAssembly grammar, that is thenpassed as input back to the GRU cell at the next time step.The purple callout in Figure 6 gives a detailed depiction of theline decoder sub-routine. The line decoder receives the hidden stateof the GRU cell, along with conditioning information about the sizeof the current bounding volume, and uses a collection of multilayerperceptrons (MLPs) to predict a 63-dimensional vector representinga single line in ShapeAssembly . The sub-networks it uses are: • f cmd : (7): Predicts the type of command to execute. This is aone-hot vector whose seven entries correspond to (the special program start token), (the special pro-gram stop token),

Cuboid , attach , squeeze , translate and reflect . • f cube : (4): Predicts the length, width, height, and alignedflag for cuboid lines, conditioned on the bounding volumedimensions. • f idx : (11 × Predicts the indices of the cuboids involved inthe line represented as 3 one-hot vectors, conditioned on thepredicted command. We limit each node in the hierarchy tocontain at most 10 children parts, so there are 11 choices (10cuboids and the bounding volume). • f att : (3 × Predicts the ( x , y , z ) coordinates involved inan attach line, conditioned on the cuboids involved in theattach. • f sqz : (8): Predicts the the face involved in a squeeze line as aone-hot vector in the first 6 indices. The last 2 indices predictthe ( u , v ) coordinates. Both predictions are conditioned onthe cuboids involved in the squeeze operation. • f sym : (5): Predicts the axis involved in a symmetry line as aone-hot vector in the first 3 indices. For translate lines, the4th index is the number of cuboids involved in the symmetrygroup, and the 5th index is the scale of the symmetry. Allpredictions are conditioned on the cuboid involved in thesymmetry and the bounding volume dimensions.

Hierarchical decoding.

To generate a hierarchical program, our de-coder also includes a submodule f child which is executed after ev-ery predicted Cuboid command to determine whether that cuboidshould be recursively expanded. This is another MLP which takesas input both the current hidden state of the GRU as well as z par ,the overall latent code for this hierarchy node. f child produces twooutputs: a Boolean flag for whether the current cuboid should beexpanded into a child program, and a new latent code z child whichis used to initialize the decoder for this child program. We implement our models in PyTorch[Paszke et al. 2017]. All train-ing is done with the Adam optimizer [Kingma and Ba 2014], with alearning rate of 0.0001 without batching.All multilayer perceptrons have 3 layers and use leaky ReLU [Maas2013] with α = 0.2.We train our model in a seq2seq fashion, where the ground truthinput sequence is teacher forced to the model, and our model is ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. 𝜇 𝜎𝒩(𝜇, 𝜎) 𝑧

GRU 𝑧 !" Line 1

GRU 𝑓 !" GRU ...... 𝑓 %&"!’ 𝑓 %&"!’ 𝑧 $ % & ’ ( DecoderEncoder

Line 1 𝑓 !" Line 2Line 3Line 𝑁𝑓 !" 𝑓 !" GRUGRUGRUGRU ...... 𝑧 $ % & ’ ( Line Decoder

Line 2

Line Decoder

Line 𝑁 Line Decoder ... ......

Line Decoder

GRUBBox dims

Cuboid( 𝑙 , 𝑤 , ℎ , 𝑎 )attach( 𝑐 ( , 𝑐 ) , 𝑥𝑦𝑧 ( , 𝑥𝑦𝑧 ) )squeeze( 𝑐 ( , 𝑐 ) , 𝑐 * , 𝑓 , 𝑢𝑣 )reflect( 𝑐 ( , axis )translate( 𝑐 ( , axis, 𝑚 , 𝑑 ) END 𝑓 +,- 𝑓 .// 𝑓 +01 𝑓 %23$ 𝑓 +,- 𝑓 "’4 𝑓 "’4 𝑓 %-’ 𝑓 "’4 𝑓 "’4 Fig. 6. Architecture of our hierarchical sequence VAE for ShapeAssembly programs. Given a ShapeAssembly program, the encoder ascends the hierarchy from theleaves to the root, encoding each sub-program into a latent z vector. Given a latent code, the decoder recursively decodes a hierarchical ShapeAssembly program.Within each hierarchy node, a recurrent neural network decodes each line of the program. tasked with predicting each subsequent line. During training, weuse a program reconstruction loss that only considers entries ofthe predicted 63 dimensional vector that are relevant to the targetline. For instance, when predicting a Cuboid line, no part of thereconstruction loss comes from the indices in the tensor associatedwith symmetry. The program reconstruction loss is comprised of across-entropy component for each one-hot prediction (with weight1) and an l1 loss for each continuous component (with weight 50).Additionally we use a KL loss in the standard VAE setup with weight0.1 [Kingma and Welling 2014].

Enforcing semantically-valid output.

As our model generates shape programs , rather than raw shape geometry, we can use the semanticsof the ShapeAssembly language to detect outputs that would beinvalid, and prevent them from happening. For instance, attachesmust be made in a grounded order. If a predicted attach line vio-lates such a constraint, we use a backtracking procedure to find new‘valid’ parameter values whenever possible. During unconditionalgeneration, if we cannot fix the line through backtracking, we rejectthe sample. During interpolation, if we cannot fix the line throughbacktracking we don’t add the predicted line to the program. Ap-pendix D describes the complete semantic validity procedure weenforce. We also note that this approach to forbidding the genera-tion of invalid outputs is similar to that of the Grammar VariationalAutoencoder [Kusner et al. 2017]. However, that model only usesgrammar syntax to determine whether an output is valid, whereasas we use program semantics . In this section, we demonstrate our learned generative model’sability to synthesize high-quality hierarchical ShapeAssembly pro-grams, and we compare it to alternative generative models of 3Dshape structure. All of the experiments described were run on aGeForce RTX 2080 Ti GPU with an Intel i9-9900K CPU, and con-sumed 3GB of GPU memory.We use objects from the PartNet dataset [Mo et al. 2019b] as ourtraining data. It contains 3D shapes in multiple categories, each witha hierarchical part segmentation and labeling. For the experiments in this paper, we use the

Chairs , Tables , and

Storage categories. Afterrunning the extraction procedure described in Section 5, we obtain3835

Chair , 6536

Table , and 1551

Storage ground truth programs.

In this section, we present both qualitative and quantitative eval-uations of our method’s ability to produce novel shape structures.Figure 7 includes some unconditionally generated samples from ourlearned generative model for each of the three shape categories.Above each sample we show its nearest neighbor in the trainingdata based on Chamfer distance. Additionally, below each samplewe visualize its nearest neighbor in the training data based on pro-gram distance, the string edit distance of a tokenized version of ourhierarchical programs. As shown, our method is able to generatecomplex and interesting structural variation without copying eitherthe geometry or program structure of its training data.As our model directly generates programs, its outputs can beeasily edited to produce variants. In Figure 8 we demonstrate howby changing just the continuous parameters of programs generatedby our model, we are able to create a wide variety of output geometry,all the while maintaining part-to-part attachment relationships.We compare the generated results of our method against twobaselines: • StructureNet is a variational autoencoder that generateshierarchical part graphs with cuboids at each node [Mo et al.2019a]. • is a recurrent neural network that generates a se-quence of cuboids [Zou et al. 2017]. It enforces global bilateralsymmetry by only generating cuboids with some part of theirgeometry on the negative side of the x = ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:9 G e o m NN S a m p l e P r o g NN Fig. 7. In the middle row, we show samples from our generative model of ShapeAssembly programs. In the top row, we show the nearest neighbor shapein the training set by Chamfer distance. In the bottom row, we show the nearest neighbor shape in the training set by program edit distance. Our methodsynthesizes interesting and high-quality structures that go beyond direct structural or geometric memorization. We quantitatively examine ShapeAssembly’sgeneralization in Table 4. Refer to the supplemental material for the corresponding program text. comparisons with StructureNet for reconstruction tasks, we onlyconsider shapes that appear in the validation splits of both methods.We compare against a version of 3D-PRNN that was re-trained onthe data we use for our generative model. Figure 9 shows a qualita-tive comparison of unconditionally generated samples from eachmethod. Our method is capable of generating diverse, structurallycomplex, 3D shape structures across multiple categories. Attach-ment as a primary operation provides a strong inductive bias forgenerating physically plausible shapes that maintain realistic part-to-part relationships. In contrast, both comparison methods thatdirectly predict part placements in 3D space are prone to producingfloating cuboids or jumbled collections of spatially colocated parts. S a m p l e V a r i a n t Fig. 8. Programs, by way of representational form, allow for easy semanticediting of generated output. Each column shows a sample from our model inthe top row. In the bottom row we create a variant with the same structure,but different geometry, by editing only the continuous parameters of theprogram. Program text can be found in the supplemental material.

We also quantitatively comparethe quality of the shape structures generated by different methods.Our desiderata for generated shape structures is that they shouldbe physically plausible and come from the same distribution thatthe model was trained on. In order to asses the quality of generatedoutput, we use the following metrics: • Rootedness ⇑ (% rooted): The percentage of shapes forwhich a connected path exists between the ground and allleaf parts. • Stability ⇑ (% stable): The percentage of shapes which re-main upright under gravity and small forces in a physicalsimulation. • Realism ⇑ (% fool): The percentage of test set shapes classi-fied as “generated” by a PointNet classifier trained to distin-guish between generated shapes and shapes from the trainingdataset. • Frechet Distance ⇓ (FD): Measurement of distributional sim-ilarity between generated shapes and the training dataset us-ing the feature space of a pre-trained PointNet model [Heuselet al. 2017]Further details about these metrics are provided in Appendix E.We show results for these metrics on 1000 unconditional gener-ated shapes in Table 2. Our method largely outperforms 3D-PRNNand StructureNet across these metrics for three categories of shapes.While StructureNet achieves good rootedness scores, especiallyfor the Storage category, our method performs better in the otherthree metrics along all categories. The samples from 3D-PRNN,achieve similar FD and % fool scores with StructureNet, but performmarkedly worse on the rootedness and stability metrics.Additionally in this experiment we compare our model with aseries of ablated versions:

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. • Flat:

Training on programs with no hierarchies, only leafparts. • No Order:

Training on programs without canonical orderingas described in Section 5. • No Align:

Training on programs without an aligned flag forcuboids. • No Macros:

Training on programs without squeeze , translate ,or reflect commands. • No Reject:

At generation time, discard unfixable, invalidprogram line predictions instead of rejecting the entire sample.Training without hierarchy (Flat) slightly improves rootedness,but drastically lowers the quality of output as seen in the % fooland FD columns. Training on programs without a canonical or-dering (No Order) performs worse on every metric. Removing thealignment flag (No Align) actually improves performance on theChair category for % rooted and % fool, but drastically worsens thephysical validity of generations for Tables and Storage, categorieswhere parts are much more often aligned with their parent cuboid.Training without macros (No Macros) once again decreases the per-formance of all metrics, but not by a substantial margin. Finally, wesee that while the rejection sampling step does improve the qualityof our generated samples, without it we still outperform 3D-PRNNand StructureNet by a wide margin.

In this section, we quantitatively ana-lyze our previous claim that directly predicting programs improveseditability. We claim that a program is more editable if it is both com-pact and compromised of higher level functions. That is, a shorterprogram that uses higher-level constructs will be easier to under-stand and make changes to.As a strong baseline, we evaluate the editability of our programsagainst the generated outputs of 3D-PRNN and StructureNet. As3D-PRNN and StructureNet do not directly produce ShapeAssem-bly programs, we use our extraction procedure described in Section5 in order to convert their generations into programs. As Struc-tureNet predicts part graph hierarchies, the representational formour extraction procedure takes as input, we use our procedure with-out any of the data cleaning steps. As 3D-PRNN has no notion ofhierarchy, we create single node part graphs out of their outputsamples, which are then run through our program extraction logic.Table 3 shows results from an experiment where we comparethe ShapeAssembly programs of each method’s generations (di-rectly predicted by our method, parsed programs from comparisons).The metrics we use are the number of lines in each program (as acoarse measure of compactness) and the percentage of lines whichare macros (split by macro type).Compared with programs parsed from StructureNet, the pro-grams generated by our model are much more compact and havehigher rates of macro usage across all categories of shapes. Whileour method also has higher macro rate usage compared with 3D-PRNN, 3D-PRNN programs are more compact in the Chair andTable categories. Based on 3D-PRNN’s poor performance within ourshape quality experiments (Table 2), and its significant deviationfrom the number of lines found in the ground truth programs (thecleanest set of ShapeAssembly programs we have access to), there

Category Method % rooted ⇑ % stable ⇑ % fool ⇑ FD ⇓ Chair

Ours 94.5

Table

Ours (No Macros) 95.9 85.0 33.16 53.21Ours (No Reject) 94.1 76.4 29.20 52.78Ours

Storage

Ours (No Macros) 87.5 69.9 5.92 72.80Ours (No Reject) 94.3 80.9 11.66 31.69Ours 95.3

Table 2. Comparing the quality of generated samples. Our method outper-forms other generative methods for 3D shape structure in terms of realismand physical validity. Through a series of ablation baselines, we validatevarious design decisions of our method. is reason to believe that the compactness of its parsed programsmore likely reflects shape simplicity rather than useful editability.

Beyond quality and editability, we alsoconsider the variability of outputs of each method. Specifically, forgenerated shapes, we care about their novelty with respect to thetraining data, their complexity, and their variety. We present resultsof an experiment using Chamfer distance to quantify performanceacross these areas in Table 4.The

Generalization metric measures the average distance of eachgenerated sample to its nearest neighbor in the training set. As allmethods have higher generalization scores than the validation set,we can conclude that none of the methods appear to be overfitting.For our method specifically, this re-enforces the qualitative nearestneighbor results presented in Figure 7.The

Coverage metric measures the average distance of each vali-dation shape to its nearest neighbor in the set of generated shapes.Across all categories our method achieves the best results, andby a wide-margin for tables, which indicates that our generationshave enough complexity to match the distribution of the validationshapes.The

Variety metric measures the average distance of each gen-erated shape to its nearest neighbor in the set of generated shapes

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:11 C h a i r T a b l e S t o r a g e Fig. 9. Qualitative comparison between generated samples from our method, StructureNet, and 3D-PRNN. Across different categories, our method createsnovel ShapeAssembly programs that, when executed, produce shape structures that maintain realistic and physically valid part-to-part relationships.Comparison methods that directly predict 3D shape geometry exhibit failure cases where parts become disconnected or intersect in an implausible manner.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020.

Macros Per Line

Category Method Lines ⇓ Refl ⇑ Trans ⇑ Squeeze ⇑ Total ⇑ Chair

Ground Truth 24.4 0.0800 0.0090 0.1130 0.2070

Table

Ground Truth 20.0 0.0950 0.0050 0.1450 0.2460

Storage

Ground Truth 24.7 0.0650 0.0147 0.1510 0.2320

Table 3. Markers of program editability for ShapeAssembly programs pre-dicted by our generative model compared with ShapeAssembly programsparsed from outputs of other generative methods. Training our model inthe space of programs allows us to represent geometry more compactly. Wefind higher rates of macro functions per program line in our method’s gen-erations compared with extracting programs from other generative models’predictions. besides itself. Once again, across all categories our method achievestop, or tied for top performance.Additionally, we look at average number of leaf parts as a coarseproxy for the complexity of a shape’s structure, which is shown inTable 5. While our method has a similar number of leaf parts tothe comparison methods for the Chair and Table categories, we dohave fewer leaf parts on average for Storage. Qualitatively, theseadditional parts in the comparison methods often manifest as col-lections of spatially colocated cuboids, and not necessarily morecomplex shape structures.In terms of the variability of programs generated by our method,we note that 65% of Chair programs, 85% of Table programs, and 53%of Storage programs contained ShapeAssembly program structuresnot present in the training data. Thus our method not only exhibitsnovelty in the geometric domain, but also in the structural domain.

Our approach is predicated on the as-sumption that a single program can represent a parametric family ofmultiple shapes, allowing for this shape space to be explored via ma-nipulation of interpretable program parameters. To verify whetherthis is true, we cluster shapes that are represented by structurally-equivalent programs (i.e. programs that are the same up to contin-uous parameter variations). Figure 10 shows program clusteringresults for the ground truth programs we parse from PartNet. Theseresults demonstrate how the structure of a single ShapeAssem-bly program is able to represent related shapes through differentparameterizations. The marked improvement in clustering whensplitting by intermediate part programs compared with clusteringon entire shape programs, provides additional support for our hier-archical approach; shape programs are more likely to share structurewithin a node of the hierarchy than they are to match entire hierar-chies exactly.

Generalization Coverage Variety

NND to Train ⇑ NND from Val ⇓ NND to Self ⇑ Category Method CD CD CD

Chair

StructureNet 0.104 0.119 0.087Ours 0.108

Validation 0.105 — 0.114

Table

Validation 0.09 — 0.099

Storage

StructureNet 0.129 0.135 0.107Ours 0.125

Validation 0.11 — 0.125

Table 4. We compare the geometric variability of generated shapes fromdifferent methods. In the first column, we measure generalization as the av-erage nearest neighbor distance (NND) from generated samples to shapes inthe training set. In the second column we measure coverage as the averageNND from shapes in the validation set to generated samples. In the last col-umn, we measure variety as the average NND from shapes in the generatedsamples to other generated shapes in the same set. Across three categoriesof shapes, our method performs the best on the coverage and variety met-rics, while outperforming validation on generalization (demonstrating weare not overfitting).

Category Method Avg

Chair

Table

Storage

Table 5. We compare the average number of leaf parts in generated shapes,as a coarse proxy for complexity of shape structure. Our method generatessimilar numbers of leaf parts compared with other methods for Chairsand Tables, but fewer leaf parts for Storage. Qualitatively, the additionalleaf parts measured in comparison methods often manifests as spuriousoverlapping cuboids, rather than more complex structural variety.

While collections of partproxies are a useful modeling representation for 3D shape struc-tures, they do not directly attempt to capture the wide range ofintra-part variability present in man-made objects. We demonstratehow ShapeAssembly programs can additionally be used to modelparts at finer levels of detail by turning ShapeAssembly programsinto dense point clouds. As a proof of concept, we augment ourgenerative model with a point cloud encoder that consumes densepoint cloud samples of ground truth leaf parts, and a point clouddecoder that generates dense point clouds for every leaf part within

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:13

Avg. Step Size ⇓ Category Method Geo Prog

Chair

StructureNet

Table

StructureNet 0.0474 4.75Ours

Storage

StructureNet 0.0512 4.29Ours

Table 6. We measure smoothness along random high-frequency interpo-lation sequences in each method’s latent space. The Geo column mea-sures smoothness with Chamfer distance, while the Prog column measuressmoothness with program edit distance. Note that 3D-PRNN is missingbecause it is not a latent variable model and thus does not support interpo-lation.Fig. 10. Clustering results that demonstrate how the structure of a sin-gle ShapeAssembly program is capable of capturing a family of relatedshapes. Using ground truth programs found with our program extractionprocedure, in the left graph we plot the percentage of shapes captured aswe consider more program structures extracted from the data. In the rightgraph we show the same plot but with parts (nodes) instead of shapes (fullhierarchy). its predicted bounding volume. Figure 11 shows some qualitativeresults of our method, trained on point clouds sampled from thedense geometry of Chairs found in PartNet. These generated sur-faces provide additional detail over the geometry specified by theircuboid part proxies, as evidenced by both the rounding in the legsand back slats, and also in the curvature of the chair back surfaces.

Beyond novel shape generation, we evaluate the ability of ourmethod to interpolate between two points in our latent space. Thepresence of smooth, semantic transitions between end-points in-dicates a well-formed latent space. In Figure 12 we qualitativelycompare our method with StructureNet on the task of interpolatingbetween shapes in the validation sets of both models. Our interpola-tions demonstrate both geometrically smooth and semantically con-sistent transitions. For instance, in the top interpolation sequence,the surface of the chair back in the source shape gradually shrinksvertically until in the target shape it is just a horizontal bar. At thesame time, the number of vertical slats in the chair back graduallyincreases from 2, to 4, to 5. P a r t C u b o i d s S u r f a c e P o i n t s Fig. 11. Converting generated ShapeAssemblyprograms into dense pointclouds. We use a point cloud decoder to predict the surface geometry ofeach leaf part proxy in our 3D shape structure. In this process, geometricdetails begin to take form, at the cost of some artifacts. We discuss a methodfor improving this procedure in section 8.

In Table 6, we attempt to quantify the smoothness along randominterpolation sequences within the latent space of each generativemodel. In this experiment, 100 interpolation sequences were com-puted from sources to targets that were randomly sampled in eachmodel’s latent space, with 100 interpolation steps per sequence.Each method’s geometric smoothness is computed by taking theaverage Chamfer distance (normalized by shape scale) betweeneach interpolation step. The lower geometric smoothness of ourmethod, compared to StructureNet in the Table and Storage cate-gories, demonstrates the quality of the latent space learned by ourmethod. Moreover, using our procedure to turn StructureNet out-puts into ShapeAssembly programs, we can measure the programsmoothness along these interpolation paths. Each method’s programsmoothness is computed by taking the average tokenized programedit distance between each interpolation step. As a measure forstructural change throughout the transitions of an interpolationsequence, our lower program smoothness metric again shows howour method benefits by operating within the space of 3D shapeprograms.

Another way to inspect the structure of a generative model’s latentspace is through performing “synthesis from X", by projecting Xinto the latent space of the generative model. As an application for3D reconstruction, we are able perform such a projection with pointclouds, demonstrating how our generative model’s latent space cansynthesize ShapeAssembly programs from unstructured geometry.Specifically, we train a PointNet ++ encoder [Qi et al. 2017] tomap point clouds sampled on dense mesh geometry to the latentspace learned by our generative model. These latent codes are thenconverted into programs by our trained decoder.In Table 7, we show an experiment comparing our method againstStructureNet on the task of reconstructing point cloud samplingsof dense geometry on the intersection of each method’s validationset for Chairs in Partnet (463 shapes total). We evaluate reconstruc-tion accuracy with F-score [Knapitsch et al. 2017], and the physical ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020.

Source Shape Interpolations Target Shape S t r u c t u r e N e t O u r s S t r u c t u r e N e t O u r s S t r u c t u r e N e t O u r s S t r u c t u r e N e t O u r s Fig. 12. A qualitative comparison of latent space interpolation between our method and StructureNet on shapes from the validation set. Our method’sinterpolations within program space produce sequences that combine smooth continuous variation with discrete structural transitions.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:15

Method F1 ⇑ % rooted ⇑ % stable ⇑ StructureNet 24.3 95.1 78.4Ours

SN + Opt Cuboids

Table 7. Results from our point cloud reconstruction experiment. Ourmodel’s well-formed latent space allows for more accurate and physicallyvalid reconstructions without further optimization. With additional op-timization, using the reconstructed program from our method and ourdifferentiable interpreter finds the best trade-off between reconstructionaccuracy and maintaining physical validity. validity of reconstructions with the rootedness and stability met-rics. When projecting point clouds into the latent space of eachmethod (top two rows), our method outperforms StructureNet onboth reconstruction accuracy and maintaining physical validity.This demonstrates, once again, the well-structured nature of ourmethod’s latent space.Moreover, as the ShapeAssembly interpreter is differentiable,we can further refine the continuous parameters of a program byminimizing the Chamfer distance between executed geometry and atarget point cloud with a gradient-based optimizer. We compare thisprocedure (Ours + Opt Program) against the following conditions: • SN + Opt Cuboids : Starting with StructureNet’s reconstruc-tion, then directly optimizing predicted cuboids to minimizeChamfer distance to the target point cloud. • SN + Opt Program : Parsing StructureNet’s reconstructioninto a ShapeAssembly program, then optimizing the programto minimize Chamfer distance to the target point cloud. • Ours + Opt Cuboids : Starting with our reconstruction, di-rectly optimizing predicted cuboids to minimize Chamferdistance to the target point cloud.We show results for this experiment in the last four rows of Ta-ble 7. All of the optimization procedures improve reconstructionaccuracy at the cost of physical validity. However, Ours + Opt Pro-gram is the only condition that achieves a desirable trade-off in thisexchange, gaining much more reconstruction accuracy improve-ment than it loses in physical validity.We show some qualitative results of this experiment in Figure 13.Through latent space projection, our model is able to output therough 3D shape structure (column 1) of an input unstructured pointcloud (column 0). Through our differentiable interpreter, we areable to find continuous parameters for the predicted program struc-ture that ultimately lead to better reconstruction fits (column 3).Shape programs place a strong structural regularization prior overunstructured 3D data, and thus our presented method is less proneto “losing” semantic parts, such as small legs, in comparison to theother conditions.

In this paper, we took a first step toward marrying the complemen-tary strengths of neural and procedural 3D shape generative models by introducing a hybrid neural-procedural approach for synthesiz-ing novel 3D shape structures. We introduced ShapeAssembly, alow-level “assembly language” for shape structures, in which shapesare constructed by declaring cuboidal parts and attaching them toone another. We also introduced a differentiable interpreter for Sha-peAssembly, allowing the optimization of program parameters toproduce desired output geometry. After describing how to extractconsistent programs from existing shape structures in the PartNetdataset, we then defined a deep generative model for ShapeAssem-bly programs, effectively training a neural network to write novelshape programs for us. We evaluated the quality of the generativemodel along several axes, showing that it produces more plausi-ble and physically-valid shapes, and that its latent space is better-structured than that of other generative models of shape structure.We also found that directly generating shape programs leads tomore compact, editable programs than extracting programs fromshapes generated by methods that directly output 3D geometry.

Limitations.

As mention in Section 5, we do not successfully extracttraining programs from every shape in our dataset.For instance, our program extraction procedure assumes thatthe orientation of all parts can be specified through solely part-to-part attachments, yet as demonstrated in Figure 14, this does nothold for all shapes. While it is possible to reconstruct these shapeswith ShapeAssembly programs (through attaching parts to “floating”points in space via the bounding volume) such programs will neverbe added to our training data, and thus our generative model won’tlearn to produce such constructs. Our design decision to discardtraining programs with more than 12 total

Cuboid declarations hasa similar effect: it limits our generative model from synthesizingthe most complex of shape structures that exist in our dataset. Weimpose such strict criteria in order to make our training programsexhibit more regularity, simplifying the learning task for our neuralnetwork at the expense of its potential expressivity.This highlights a central tradeoff: higher variability in the train-ing programs may result in lower quality shapes synthesized by agenerative model. This phenomenon is not unique to our setting:it is well-known that e.g., image generative models perform betteron very-regularly-structured domains, such as human faces. Thequestion, looking forward, is how to capture more data variabilitywhile keeping a high-degree of regularity in the input data repre-sentation? We believe that using programs as a data representationis the best avenue of attack, here. As we have shown in our work, asingle program can capture a wide range of parametrically relatedshapes. One program, many shapes; strong regularity, but also highvariability. We are excited to investigate extensions of ShapeAssem-bly, as well as other shape-generating languages, which can captureeven more shape variability with highly-regular structures.While ShapeAssembly has a strong inductive bias for generatingphysically-connected shapes, it is not guaranteed to do so. Hierar-chical part structures which are locally connected everywhere mayoccasionally still exhibit disconnected leaf cuboids. This is morelikely to happen with very non-axis-aligned structures that result inloose bounding cuboids at the intermediate levels of the hierarchy. Itis worth investigating mechanisms to guarantee that ShapeAssem-bly programs maintain leaf-to-leaf connectivity.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020.

Input Points Ours Ours + Opt Cuboids Ours + Opt Program SN SN + Opt Cuboids SN + Opt Program

Fig. 13. Qualitative comparison of synthesis from point clouds of our method against StructureNet (SN). Our method is able to infer good program structuresthat match well with the unstructured geometry. The continuous parameters of this program structure can be further refined through an optimizationprocedure in order to better fit the target point cloud without creating artifacts.Fig. 14. Examples of PartNet shapes that contain parts whose orientationscannot be inferred from part-to-part attachments alone. While these shapescan be represented with ShapeAssembly programs that attach parts to“floating” points within the bounding volume, such programs are not addedto our training data during our program extraction phase. As a result, ourgenerative model never learns to produce shapes that require this type ofattachment pattern.

Future work.

In Section 7, we showed an example of refining ourgenerated hierarchical cuboid structures with point cloud surfacegeometry. This is not a new idea; other recent related work onpart-based shape generation takes a similar approach to refininghigh-level part structures [Gao et al. 2019; Li et al. 2017; Mo et al.2019a]. However, there is more work to be done at the intersection ofstructure generation and surface generation. These two paradigmscould be much more closely married than they have been thus far,as existing part-surface generation has been explored largely in an independent, part-by-part fashion. What would it look like toswap out the “surface style” code for a shape while retaining its“structure” code? Our procedural representation may confer distinctadvantages here, as the attachments explicitly specify where andhow part geometries must connect.It would also be interesting to move beyond cuboids as the proxygeometry used for atomic parts, as not all atomic parts are well-approximated by rectilinear geometry. In some cases, spherical,cylindrical, or more general curvilinear geometry would be a betterchoice. Pursuing this direction would help push more shape vari-ability into the procedural representation, so that we do not lean soheavily on the neural network to capture it.Another way to push knowledge from the learned latent spaceinto the programs would be to make the programs include con-straints on their parameters: either independent bounds, or correla-tions between parameters. For instance, it is non-semantic to makea chair leg too thin, or to make a chair back much narrower thanthe seat to which it is attached. It should be possible to mine shapedatasets for this information, and to include it in the data used totrain the generative model.There are also more opportunities to apply generative modelsof shape programs to “synthesis from X” applications. While weshowed translation from point clouds to shape programs, there aremany more exciting possibilities in terms of linking 3D geometry,2D images, and shape programs, and seamlessly using the threemodalities to author different forms of shape manipulations.Finally, if we aim for our generated shapes to be useful in embod-ied AI applications, they should also be equipped with informationabout kinematics and/or dynamics. For example, a program which

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:17 specifies a cabinet could also specify the type of hinge with whichthe door attaches to the body, and how far that hinge opens. Ul-timately, we believe that shape programs, and generative modelswhich produce them, are the right fundamental representation forboth human creative tasks and AI analysis tasks involving part-based 3D shapes.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their help-ful suggestions. Renderings of part cuboids and point clouds wereproduced using the Blender Cycles renderer. This research was sup-ported by the National Science Foundation (

REFERENCES

Ben Abbatematteo, Stefanie Tellex, and George Konidaris. 2019. Learning to GeneralizeKinematic Models to Novel Objects. In

Proceedings of the Third Conference on RobotLearning .Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. 2019. BSP-Net: Generating CompactMeshes via Binary Space Partitioning. (2019). arXiv:cs.CV/1911.06971Zhiqin Chen and Hao Zhang. 2019. Learning Implicit Fields for Generative ShapeModeling. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Äř. Demir, D. G. Aliaga, and B. Benes. 2016. Proceduralization for Editing 3D Architec-tural Models. In .Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, DanielaRus, Armando Solar-Lezama, and Wojciech Matusik. 2018. InverseCSG: AutomaticConversion of 3D Models to CSG Trees.

ACM Trans. Graph.

37, 6 (Dec. 2018).Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar-Lezama. 2019. Write, Execute, Assess: Program Synthesis with a REPL. In

Advancesin Neural Information Processing Systems (NeurIPS) .Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. 2018. Learn-ing to Infer Graphics Programs from Hand-Drawn Images. In

Advances in NeuralInformation Processing Systems (NeurIPS) .Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for3D object reconstruction from a single image. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition . 605–613.Lin Gao, Jie Yang, Tong Wu, Yu-Jie Yuan, Hongbo Fu, Yu-Kun Lai, and Hao (Richard)Zhang. 2019. SDM-NET: Deep Generative Network for Structured Deformable Mesh.In

SIGGRAPH Asia .Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and MathieuAubry. 2018. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to aLocal Nash Equilibrium. In

NeurIPS .Irvin Hwang, Andreas Stuhlmüller, and Noah D. Goodman. 2011. Inducing ProbabilisticPrograms by Bayesian Program Merging.

CoRR arXiv:1110.5667 (2011).Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei,C Lawrence Zitnick, and Ross Girshick. 2017. Inferring and Executing Programs forVisual Reasoning. In

ICCV .Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak,David Acuna, Antonio Torralba, and Sanja Fidler. 2019. Meta-Sim: Learning toGenerate Synthetic Datasets. (2019). arXiv:cs.CV/1904.11621Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization.

CoRR abs/1412.6980 (2014).Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In

International Conference on Learning Representations (ICLR) .Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks andTemples: Benchmarking Large-Scale Scene Reconstruction.

ACM Transactions onGraphics

36, 4 (2017).Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and AliFarhadi. 2017. AI2-THOR: An Interactive 3D Environment for Visual AI.

CoRR arXiv:1712.05474 (2017).Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. 2017. GrammarVariational Autoencoder. In

Proceedings of the 34th International Conference onMachine Learning - Volume 70 (ICMLâĂŹ17) . JMLR.org, 1945âĂŞ1954.Manfred Lau, Akira Ohgawara, Jun Mitani, and Takeo Igarashi. 2011. Converting 3DFurniture Models to Fabricatable Parts and Connectors.

ACM Trans. Graph.

30, 4,Article 85 (July 2011), 6 pages. https://doi.org/10.1145/2010324.1964980Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas.2017. GRASS: Generative recursive autoencoders for shape structures.

ACM Trans-actions on Graphics (TOG)

36, 4 (2017), 52.Yunchao Liu, Zheng Wu, Daniel Ritchie, William T. Freeman, Joshua B. Tenenbaum,and Jiajun Wu. 2019. Learning to Describe Scenes with Programs. In

InternationalConference on Learning Representations (ICLR) .Sidi Lu, Jiayuan Mao, Joshua B. Tenenbaum, and Jiajun Wu. 2019. Neurally-GuidedStructure Inference. In

International Conference on Machine Learning (ICML) .Andrew L. Maas. 2013. Rectifier Nonlinearities Improve Neural Network AcousticModels.A. Martinovic and L. Van Gool. 2013. Bayesian Grammar Learning for Inverse ProceduralModeling. In

CVPR .Mateusz Michalkiewicz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, andAnders P. Eriksson. 2019. Deep Level Sets: Implicit Surface Representations for 3DShape Inference.

CoRR abs/1901.06802 (2019).Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and LeonidasGuibas. 2019a. StructureNet: Hierarchical Graph Networks for 3D Shape Generation.In

SIGGRAPH Asia .Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, andHao Su. 2019b. PartNet: A Large-Scale Benchmark for Fine-Grained and HierarchicalPart-Level 3D Object Understanding. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) .Pascal Müller, Peter Wonka, Simon Haegler, Andreas Ulmer, and Luc Van Gool. 2006.Procedural Modeling of Buildings. In

SIGGRAPH .Gen Nishida, Adrien Bousseau, and Daniel G. Aliaga. 2018. Procedural Modeling of aBuilding from a Single Image.

Computer Graphics Forum (Eurographics)

37, 2 (2018).Gen Nishida, Ignacio Garcia-Dorado, Daniel G Aliaga, Bedrich Benes, and AdrienBousseau. 2016. Interactive Sketching of Urban Procedural Models.

ACM Transac-tions on Graphics (TOG)

35, 4 (2016), 130.Yoav I. H. Parish and Pascal Müller. 2001. Procedural Modeling of Cities. In

SIGGRAPH .Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-grove. 2019. DeepSDF: Learning Continuous Signed Distance Functions for ShapeRepresentation. In

The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) .Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, ZacharyDeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Auto-matic differentiation in PyTorch. (2017).Przemyslaw Prusinkiewicz and Aristid Lindenmayer. 1996.

The Algorithmic Beauty ofPlants . Springer-Verlag, Berlin, Heidelberg.Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deephierarchical feature learning on point sets in a metric space. In

Advances in neuralinformation processing systems . 5099–5108.Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing fordata: Ground truth from computer games. In

European conference on computer vision .Springer, 102–118.Daniel Ritchie, Sarah Jobalia, and Anna Thomas. 2018. Example-based Authoringof Procedural Modeling Programs with Structural and Continuous Variability. In

EUROGRAPHICS .Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans,Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh,and Dhruv Batra. 2019. Habitat: A Platform for Embodied AI Research. In

The IEEEInternational Conference on Computer Vision (ICCV) .Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji.2018. CSGNet: Neural Shape Parser for Constructive Solid Geometry. In

IEEEConference on Computer Vision and Pattern Recognition (CVPR) .Ondrej Stava, Bedrich Benes, Radomír Mech, Daniel G. Aliaga, and Peter Kristof. 2010.Inverse Procedural Modeling by Automatic Generation of L-systems.

Comput. Graph.Forum

29 (2010), 665–674.Minhyuk Sung, Hao Su, Vladimir G. Kim, Siddhartha Chaudhuri, and Leonidas Guibas.2017. ComplementMe: Weakly-Supervised Component Suggestions for 3D Modeling.

ACM Transactions on Graphics (Proc. of SIGGRAPH Asia) (2017).Jerry O. Talton, Lingfeng Yang, Ranjitha Kumar, Maxine Lim, Noah D. Goodman, andRadomír Mech. 2012. Learning design patterns with Bayesian grammar induction.In

UIST .Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis, William T. Freeman, Joshua B.Tenenbaum, and Jiajun Wu. 2019. Learning to Infer and Execute 3D Shape Programs.In

International Conference on Learning Representations (ICLR) .Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, and Joshua B. Tenen-baum. 2016. Learning a Probabilistic Latent Space of Object Shapes via 3DACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. 𝑛𝑘𝑛 If existing attachments are colinear Side view 𝜃 < 𝜏 ?

Fig. 15. Illustrating how the attach command executes, depending on thenumber of existing attachments (left column) to the cuboid in question.Cuboids with no existing attachments can simply be translated into place(top). Cuboids with one existing attachment can be scaled along one axisand then rotated (middle). Cuboids with two or more existing attachmentsare more complicated, and the attachment may not always be satisfiable.Our interpreter attempts to rotate and scale the cuboid to get as close aspossible to valid solution.

Generative-Adversarial Modeling. In

Advances in Neural Information ProcessingSystems (NeurIPS) .Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese.2018. Gibson env: real-world perception for embodied agents. In

CVPR .A. Khosla F. Yu L. Zhang X. Tang J. Xiao Z. Wu, S. Song. 2015. 3D ShapeNets: A DeepRepresentation for Volumetric Shapes. In

Computer Vision and Pattern Recognition .Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin Jin,and Thomas Funkhouser. 2017. Physically-Based Rendering for Indoor Scene Under-standing Using Convolutional Neural Networks.

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (2017).Chenghui Zhou, Chun-liang Li, and Barnabas Poczos. 2019. Program Synthesis forImages using Tree-Structured LSTM. In

PGR Workshop at NeurIPS .Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Renjiao Yi, and Hao Zhang. 2018.SCORES: Shape Composition with Recursive Substructure Priors.

ACM Transactionson Graphics (TOG)

37, 6 (2018), 211:1–211:14.Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks. In

IEEEInternational Conference on Computer Vision (ICCV) . A SEMANTICS OF THE

ATTACH

COMMAND

In designing the ShapeAssembly interpreter, our goal is to ensurethat its internal operations stay limited to simple fixed-function,differentiable operations. Thus, implementing the attach command,we opt not to use any constrained optimization routines which could resolve a globally-optimal configuration of cuboids giventhe attachment constraints. Instead, the interpreter immediatelyexecutes each attachment as it is declared, i.e. it greedily solves forattachments. To make the behavior of this procedure as predictableas possible, the greedy attachment procedure should induce thefewest changes possible to the current cuboid shapes.With these desiderata in mind, we designed the following proce-dure for attaching cuboid c to cuboid c (see Figure 15). The logicthat executes depends upon how many prior attachments c hasand the aligned flag of c : No prior attachments.

In this case, cuboid c can connect to cuboid c by simply translating until the attach points are colocated. One prior attachment.

Here, the interpreter scales cuboid c alongone of its axes and then rotates it such that the attachment is sat-isfied. To choose the axis along which to scale c , the interpreterchecks how quickly scaling each of its three dimensions would re-duce the ratio n / k , where n is the distance between c ’s existingattachment point and the new target attachment point, and k is thedistance between c ’s existing attachment point and the new sourceattachment point. The interpreter then scales c by n / k along thisdimension, which gives it the correct length. Finally, c is rotatedsuch that the source and target attachment points are colinear (andthus colocated). Two or more prior attachments.

In this case, it is not always possibleto satisfy the attachment, as three point constraints on a cube maybe overconstrained. If a solution exists, however, our interpreterwill find it. And in the case where no solution exists, it attempts toapproximately satisfy the attachment (which we decided to be moreuser-friendly behavior than throwing an error).First, the interpreter checks if c ’s existing attachment points areall colinear. If they are, then it rotates c about this axis of colinearityto make the source attachment point face the target attachmentpoint. The final step is to scale c along the normal of the facecontaining the source attachment point. If the existing attachmentpoints were not colinear, and this face was not rotated to pointtoward the target attachment point, then this may not be a usefuloperation (i.e. it may introduce undesirable change to the cuboidshape while doing little to bring the source point closer to the targetpoint). Thus, the interpreter only executes this scale if the anglebetween the source face normal and the vector to the target pointis smaller than a threshold τ (25 degrees in our implementation). Aligned Cuboids.

Cuboids that are marked as aligned in ShapeAssem-bly programs cannot have their orientations changed through at-tachment. In fact, with correct cuboid dimension parameterization,a single attachment is enough to properly position and orient analigned cuboid. However, in order to ensure that aligned Cuboidsremain connected through edits and predictions of our generativemodel, we minimally grow aligned cuboid dimensions to satisfy thepart-to-part connectivity specified through attachments. That is, foraligned cuboids we do not guarantee attachment point colocationafter the first attachment, as this is often impossible to exactly fulfillwithout changing a cuboid’s orientation. Rather, we guarantee thataligned cuboids will fulfill attachment relationships with cuboidsthey are attached to at some attachment point.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020. hapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis • 234:19

B SEMANTICS OF SHAPEASSEMBLY MACROFUNCTIONS

We provide an account of the logic for macro function expansionin ShapeAssembly :

Squeeze.

The squeeze macro is parameterized by three cuboids( c n , c n , c n ) a face f and a ( u , v ) position on f ’s 2D coordinatesystem. A squeeze command expands into two attach functions.The first attach function attaches the center of c n ’s f face to the ( u , v ) position on the opposite face of f on c n . The second attachfunction attaches the center of c n ’s opposite face of f to the ( u , v ) position on the face of f on c n . For example, the line squeeze ( c n , c n , c n , left, .1, .4). It expands into attach ( c n , c n , 0.0, .5, .5,1.0, .1, .4) and attach ( c n , c n , 1.0, .5, .5, 0.0, .1, .4). Reflect.

The reflect macro is parameterized by a cuboid c n and anaxis a . A reflect command first expands into one Cuboid function,that creates a new cuboid c n ′ with the same parameters as c n . Thenfor every previous attachment line pair that had moved c n , of theform attach ( c n , c m , x , y , z , x , y , z ) , the reflect command cre-ates a new attachment line: attach ( c n ′ , c m , x , y , z , R ( x , y , z , c n , c m , a )) . R is a function that applies a reflection of the global point specifiedby ( x , y , z ) in the local coordinate frame of c n about the axis a ,and then returns the local coordinates of that point within c m . Translate.

The translate macro is parameterized by a cuboid c n , anaxis a , a number of members m , and a distance d . A translate com-mand first expands into m Cuboid functions, that each creates a newcuboid c n i with the same parameters as c n . Then for every previousattachment line pair that had moved c n , of the form attach ( c n , c m , x , y , z , x , y , z ), the translate command creates a new attach-ment line attach ( c n i , c m , x , y , z , T ( x , y , z , c n , c m , a , d )) . T is afunction that applies a translation of the global point specified by ( x , y , z ) in the local coordinate frame of c n along the axis a (ofthe bounding volume) for for a distance of d (where d is normalizedby the size of the bounding volume), and then returns the localcoordinates of that point within c m . C PROGRAM EXTRACTION PROCEDURE

Here, we provide an account of our program extraction procedurein greater detail:

Part Shortening.

Before any hierarchical processing, we first attemptto regularize any artifacts in the input data. Specifically, for eachleaf cuboid part proxy, we check if any of its faces are completelycontained within any other leaf cuboid. If we find that we canshorten a leaf cuboid without changing the visible, non-intersecting,geometry of the part graph, we do so.

Semantic Hierarchy Arrangement.

During our data preprocessingstage when converting PartNet part graphs into ShapeAssembly pro-grams, we locally flatten part graph hierarchies based on semanticrules as depicted in Figure 5. For chairs we flatten the followingnodes: back, arm, base, seat, footrest and head. For tables we flattenthe following nodes: top and base. For storage we flatten the follow-ing nodes: cabinet frame, cabinet base. For storage, we move thefollowing nodes into the cabinet frame sub-program: countertop,shelf, drawer, cabinet door and mirror. We also perform a semantic collapsing step where the intermediate nodes containing detailed ge-ometry are converted into leaf nodes and their children are discarded.For chairs we collapse the following nodes: caster and mechanicalcontrol. For tables we collapse the following nodes: caster, cabinetdoor, drawer, keyboard tray. For storage we collapse the follow-ing nodes: drawer, cabinet door, mirror and caster. Empirically weobserved that this method of hierarchy re-arrangements producescleaner and more regularized training data for our generative model.

Attachment Point Detection.

In order to identify which cuboids con-nect, and where they connect, we use a point cloud intersectionprocedure. We sample a uniform 20x20x20 point cloud within thevolume defined by each cuboid. To check if two cuboids are attached,we find the set of points in the pairwise point cloud comparisonthat have a minimum distance to any point in the other point cloudwithin a distance threshold determined by the scale of the largercuboid. For cuboids that attach (i.e. this intersection set is non-zero)we sample a denser 50x50x50 point cloud within the bounds of thedetected intersection volume, forming a set of candidate attach-ment points. From this set we first filter all attachment points thatare outside of either cuboid. If any remaining attachment pointsform face-to-face connections between cuboids we choose them,otherwise we define the attachment as taking place at the meanof the remaining attachment points. With the same procedure, wealso record if cuboids connect to the top or bottom of the boundingvolume. Sampled points with bounding volume local y-coordinatesin the ranges of [ , . ] and [ . , . ] are assigned to the bottomand top respectively. Symmetry Detection.

We enforce that all members of a symmetrygroup share the same connectivity structure in the input part graph.Cuboids are grouped together by symmetry if they: (i) connect tothe same cuboids, (ii) share a reflectional or translational symmetryabout the X, Y or Z axis of their parent bounding volume, and (iii)each attachment point involved in their outgoing connections alsoshares this same symmetrical relationship. Two cuboids, or two at-tachment points, are considered to share a symmetrical relationshipif applying the symmetry transformation matrix to one memberproduces a parameterization close to that of the other member.Notice that this procedure can disqualify symmetry formationabout groups of interconnected cuboids that share a symmetrical re-lationship. As such, before forming symmetry groups about individ-ual cuboids, we attempt to form symmetry groups about connectedcomponents of multiple cuboids. Whenever such a component isfound, we locally abstract its structure with a bounding volume, andcreate a symmetry group sub-program. In this manner, we captureadditional spatial symmetries while continuing to enforce the re-lationship between symmetry and part connectivity. The "H-leg"program (Program3) in Figure 2 shows an example of where such asymmetry sub-program was formed.In total, our parsing procedure finds valid ShapeAssembly pro-grams for 46% of Chairs, 65% of Tables and 58% of Storage shapesin PartNet.

ACM Trans. Graph., Vol. 39, No. 6, Article 234. Publication date: December 2020.

D DECODER SEMANTIC VALIDITY CHECKS

During the process of decoding a latent code, our generative networkenforces the following semantic validity conditions on its outputs: • XYZ attachment coordinates are clamped between 0 and 1.0.Additionally, attachments to the bounding box can only be atthe top or bottom faces with an allowable error of .05. • Cuboid dimensions are clamped between 0.01 and the corre-sponding bounding box dimension • Bounding box cuboids can have no sub-programs • Cuboids only attach at a single location. As an exception,cuboids are allowed to attach to both the top and bottomfaces of the bounding volume. • The bounding box cannot be moved by an attach command • Attachment orderings must be grounded. Upon terminating,any ungrounded cuboids instantiations are discarded. • Symmetries can only operate on grounded cuboids • The ordering of Cuboid, attach, squeeze, reflect, and translatelines must be consistent with the ShapeAssembly grammar. • Commands must keep cuboids within the bounds of the de-fined bounding volume with an allowable error of 10%.During generation, if our model predicts a non-semantic programline, we attempt to back-track until we are able to find a semanticallyvalid solution. For instance, if we predict a new line to be a reflect command, but no cuboids have been grounded, we pick a newcommand type for the line by zeroing out the logits for the reflectcommand index.In some cases, a combination of bad continuous parameters andprogram structure predictions produce a violating line that cannotbe easily fixed. During unconditional generation, we reject the sam-ple if we encounter this behavior (this happens for 10% - 20% ofour random samples across the categories we consider). We run anablation on this rejection sampling in Table 2. During interpolation,we never reject a sample. Instead, we simply do not add lines to thepredicted program for which we could not find a fix.

E SHAPE QUALITY METRICS

We provide additional details about the metrics used in Table 2: • Rootedness :

We check if a connected path exists betweenthe ground and all parts in the shape. We judge two parts tobe connected if they are separated by a distance no largerthan 2% of the overall shape’s bounding box diagonal length. • Stability :

We convert generated 3D shape structures intorigid bodies and place them in a physical simulation withgravity. A vertical force is applied to each shape proportionalto its mass, along with some other small random forces andtorques. If the resting height of any connected component ofthe shape changes by more than 10% after these perturbationswe declare it unstable. Note that this is by definition less thanor equal to the percentage of rooted shapes, as a shape mustbe rooted in order to be stable. • Realism: