[PDF] DSG-Net: Learning Disentangled Structure and Geometry for 3D Shape Generation

Abstract

D shape generation is a fundamental operation in computer graphics. While significant progress has been made, especially with recent deep generative models, it remains a challenge to synthesize high-quality shapes with rich geometric details and complex structure, in a controllable manner. To tackle this, we introduce DSG-Net, a deep neural network that learns a disentangled structured and geometric mesh representation for 3D shapes, where two key aspects of shapes, geometry, and structure, are encoded in a synergistic manner to ensure plausibility of the generated shapes, while also being disentangled as much as possible. This supports a range of novel shape generation applications with disentangled control, such as interpolation of structure (geometry) while keeping geometry (structure) unchanged. To achieve this, we simultaneously learn structure and geometry through variational autoencoders (VAEs) in a hierarchical manner for both, with bijective mappings at each level. In this manner, we effectively encode geometry and structure in separate latent spaces, while ensuring their compatibility: the structure is used to guide the geometry and vice versa. At the leaf level, the part geometry is represented using a conditional part VAE, to encode high-quality geometric details, guided by the structure context as the condition. Our method not only supports controllable generation applications but also produces high-quality synthesized shapes, outperforming state-of-the-art methods. The code has been released at this https URL.

Full PDF

DDSM-Net: Disentangled Structured Mesh Net for ControllableGeneration of Fine Geometry

JIE YANG ∗ , Institute of Computing Technology, CAS and University of Chinese Academy of Sciences

KAICHUN MO ∗ , Stanford University

YU-KUN LAI,

Cardiff University

LEONIDAS GUIBAS,

Stanford University

LIN GAO † , Institute of Computing Technology, CAS and University of Chinese Academy of Sciences

Fig. 1. Our deep generative network DSM-Net encodes 3D shapes with complex structure and fine geometry in a representation that leverages the synergybetween geometry and structure, while disentangling these two aspects as much as possible. This enables novel modes of controllable generation forhigh-quality shapes. Left: results of disentangled interpolation. Here, the top left and bottom right chairs (highlighted with red rectangles) are the inputshapes. The remaining chairs are generated automatically with our DSM-Net, where in each row, the structure of the shapes is interpolated while keepingthe geometry unchanged, whereas in each column, the geometry is interpolated while retaining the structure. Right: shape generation results with complexstructure and fine geometry details by our DSM-Net. We show close-up views in dashed yellow rectangles to highlight local details.

3D shape generation is a fundamental operation in computer graphics. Whilesignificant progress has been made, especially with recent deep generativemodels, it remains a challenge to synthesize high-quality geometric shapeswith rich detail and complex structure, in a controllable manner. To tacklethis, we introduce DSM-Net, a deep neural network that learns a disentangledstructured mesh representation for 3D shapes, where two key aspects ofshapes, geometry and structure, are encoded in a synergistic manner toensure plausibility of the generated shapes, while also being disentangledas much as possible. This supports a range of novel shape generationapplications with intuitive control, such as interpolation of structure(geometry) while keeping geometry (structure) unchanged. To achieve ∗ Authors contributed equally. † Corresponding author.Project webpage: http://geometrylearning.com/dsm-net/. This is the author’s versionof the work. It is posted here for your personal use. Not for redistribution. this, we simultaneously learn structure and geometry through variationalautoencoders (VAEs) in a hierarchical manner for both, with bijectivemappings at each level. In this manner we effectively encode geometry andstructure in separate latent spaces, while ensuring their compatibility: thestructure is used to guide the geometry and vice versa. At the leaf level, thepart geometry is represented using a conditional part VAE, to encode high-quality geometric details, guided by the structure context as the condition.Our method not only supports controllable generation applications, but alsoproduces high-quality synthesized shapes, outperforming state-of-the-artmethods.CCS Concepts: •

Computing methodologies → Shape modeling .Additional Key Words and Phrases: 3D shape generation, disentangledrepresentation, structure, geometry, hierarchies a r X i v : . [ c s . G R ] A ug • Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, and Lin Gao

3D shapes are widely used in computer graphics and computervision, with applications ranging from modeling, recognition torendering. Synthesizing high-quality shapes is therefore highly de-manded for many downstream applications. Ideally, the synthesizedshapes should be able to contain fine geometric details and complexstructures, and the generation process needs to provide high-levelcontrol to ensure desired shapes are produced.Shape generation has been extensively researched in recent years,benefiting especially from the capabilities of deep generative models.This has been true across a variety of 3D representations used torepresent generated shapes, including point clouds, voxels, implicitfields, meshes, etc. However, existing methods still have limitationsin representing both complex shape structure as well as geometrydetails, which is what is required for many downstream applications.Moreover, to ensure high-level control of shape generation, it isimportant to decompose shapes into multiple aspects that can beindependently manipulated – typically geometry and structure (i.e.,how different parts are related to form the overall shape). On the onehand, geometry and structure are synergistic: the structure of anobject may restrict the specific geometric shapes that are plausible,and vice versa. On the other hand, to support high-level control, itis beneficial to derive a representation that disentangles these twoaspects as much as possible. Such disentangled representations havebeen widely studied in deep image generation, allowing differentaspects, such as different facial attributes (expression, age, gender,etc.) to be manipulated separately, either in a supervised [Xiaoet al. 2018a,b] or unsupervised [Chen et al. 2016] manner. Fordisentanglement of 3D shapes, existing works either focus onspecific object categories such as human faces [Abrevaya et al. 2019]where explicit annotation is used for supervision, or are restricted tointrinsic/extrinsic decomposition, where shape geometry and posesare considered [Aumentado-Armstrong et al. 2019]. However, suchmethods are rather restrictive, and usually require the set of shapesto have point-to-point correspondence. None of these methods canhandle the more general geometry and structure disentanglementwe address in this work. Such disentangled and synergistic represen-tations offer significant benefits, including controllable generationof new shapes, e.g., interpolating or transferring structure whilekeeping geometry unchanged, or manipulating geometry whileretaining the structure.Specifically, most existing deep shape generation works producesynthesized shapes as a whole. This makes it particularly difficult tocontrol the generation, either in a topology or geometry awaremanner. Recently, some pioneering works have addressed thisshortcoming by considering shape generation using parts and theircompositions, leading to improved geometric detail [Gao et al.2019b] and better handling of complex structure [Mo et al. 2019a].However, neither is able of generating shapes with both complexstructures and detailed geometry. They also have not addresseddisentanglement of structure and geometry.In this paper, we introduce Disentangled Structured Mesh Net(DSM-Net), a novel deep generative model which overcomes theabove limitations. DSM-Net is based on the PartNet [Mo et al. 2019c]dataset with fine-grained, consistent part annotations aggregated into shape hierarchies. We follow the PT2PC [Mo et al. 2020]approach to group the PartNet data (e.g. chair, table, cabinet, etc.)according to their structure. However, our structure only includesthe hierarchical graph of part semantics and relationships, excludingall geometric information, whereas the geometry hierarchy includesthe detailed geometry and position of each part. Our networkencodes structure and geometry hierarchies with an n -ary tree usingseparate variational autoencoders (VAEs) with recursive neuralnetwork architectures. Both the geometry and structure informationflow along the edges of hierarchical graphs and aggregate intotwo latent spaces, allowing these two key aspects to be encoded separately in a disentangled manner.However, the latent codes from both spaces need to be correlated,to ensure the plausibility of the represented shape. To achievethis, we simultaneously train both structure and geometry VAEs.We further ensure that they communicate with each other: bothhierarchies have bijective mappings bridging them at each level.During training, the structure communicates with the geometryand gives guidance on the generated part shapes. The geometrywill follow the inter-part relationship edges with a message passingprotocol. The geometry of parts also supplements the structure toprovide reliable correspondences to the ground truth to facilitatetraining. The detailed geometry is encoded using a conditionalVAE, where the structure context is used as the condition to furtherpromote structure and geometry compatibility.Our novel solution allows shapes with complex structure anddelicate geometry to be represented and synthesized, outperformingstate-of-the-art methods, e.g. [Gao et al. 2019b; Mo et al. 2019a]. Thedisentangled and synergistic formulation allows novel applications,such as shape generation and interpolation with separate controlof structure and geometry, which is an intuitive process for shapemodelers. New shapes can also be synthesized by mixing structureand geometry from different examples.In summary, our DSM-Net makes the following key contributions: • We propose a novel deep network that decomposes shapespace into two disentangled latent spaces, encoding thegeometry and structure of shapes. We incorporate commu-nication between the geometry and structure, making themcompatible on the generated shapes, while supporting novelsynthesis applications that exploit independent control ofstructure and geometry. • Our DSM-Net also allows high-quality shapes with complexstructures and delicate geometric details to be effectivelyrepresented and synthesized, outperforming state-of-the-artmethods.Figure 1 demonstrates the capability of our DSM-Net to inter-polate shapes with rich geometry and complex structure in thegeometry and structure spaces, separately where each row showsinterpolation of structure while keeping geometry unchanged, andeach column presents interpolation of geometry while retaining thesame structure. Through extensive evaluations and comparisonswith the state-of-the-art deep neural generative models, our methodshows significant advantages and superiority on various shapecategories. Our method supports traditional applications suchas shape generation, synthesis, and interpolation, but now with

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 3 independent control on the shape structure and geometry detail,facilitating the design process.

In recent years, researchers have been making great advancestowards learning deep representations for 3D data and pushingthe frontiers of 3D shape analysis, synthesis and modeling. Akey research topic in 3D computer vision and graphics is how torepresent, reconstruct and generate 3D shapes with complicatedpart structures and delicate geometric details.In this section, we give a brief review on various kinds of 3Dshape representations and provide a comprehensive discussion ofrecent advances on modeling 3D shape geometry and structure.

In contrast to reaching a great consensus on representing 2D imagesas pixel grids, researchers have been exploring a big variation ofrepresentations for 3D data. To name a few, recent works havedeveloped deep learning frameworks for 3D voxel grids [Choy et al.2019, 2016; Girdhar et al. 2016; Graham et al. 2018; Maturana andScherer 2015; Riegler et al. 2017; Tatarchenko et al. 2017; Wu et al.2017, 2016, 2015; Yan et al. 2016], multi-view 2D rendering of 3Ddata [Huang et al. 2017; Kalogerakis et al. 2017; Kanezaki et al. 2018;Lyu et al. 2020; Su et al. 2018, 2015], 3D point clouds [Achlioptaset al. 2018; Fan et al. 2017; Gadelha et al. 2018; Le et al. 2019; Li et al.2018; Qi et al. 2017a,b; Shu et al. 2019; Valsesia et al. 2018; Yanget al. 2019, 2018; Zhao et al. 2019], 3D polygonal meshes [Chen et al.2019a; Dai and Nießner 2019; Gkioxari et al. 2019; Groueix et al.2018; Kanazawa et al. 2018; Nash et al. 2020; Sinha et al. 2017; Wanget al. 2018], and 3D implicit functions [Chabra et al. 2020; Chen andZhang 2019; Chibane et al. 2020; Duan et al. 2020; Jiang et al. 2020;Mescheder et al. 2019; Park et al. 2019; Peng et al. 2020; Xu et al.2019]. For more detailed discussion and comparison, we refer thereaders to these survey papers [Ahmed et al. 2018; Bronstein et al.2017; Ioannidou et al. 2017; Xiao et al. 2020].More relevant to our work is the trend of part-based andstructure-aware 3D shape representations. 3D shapes naturallyexhibit compositional part structures. Part-based shape modelingdecomposes complicated shapes into simpler parts for geometricmodeling and organizes parts as part sequences or part hierarchiesthat encode shape part relationships and structures. Many previousworks investigated parsing 3D shapes into parts [Chen et al. 2019b;Golovinskiy and Funkhouser 2009; Hu et al. 2012; Huang et al. 2011;Kalogerakis et al. 2017; Mo et al. 2019c; Tulsiani et al. 2017; Yi et al.2017; Yu et al. 2019; Zhu et al. 2020; Zou et al. 2017], representing 3Dshapes as part sequences or hierarchies [Ganapathi-Subramanianet al. 2018; Kim et al. 2013; Mo et al. 2019a; Niu et al. 2018; Sung et al.2017; Van Kaick et al. 2013; Wang et al. 2011; Wu et al. 2019b; Zhuet al. 2018a], and generating 3D shapes with part structures [Gaoet al. 2019b; Kalogerakis et al. 2012; Li et al. 2019, 2017; Mo et al.2020; Schor et al. 2019; Wu et al. 2019a]. We refer to these surveypapers [Chaudhuri et al. 2020; Mitra et al. 2014; Xu et al. 2016] formore comprehensive discussion.

There are several different approaches to generate detailed 3D shapegeometry: direct methods, patch-based methods, deformation-basedmethods, and others. Direct methods exploit decoder networksthat output 3D contents in direct feed-forward procedures. Forinstance, Choy et al. [2016] and Tatarchenko et al. [2017] directlygenerate 3D voxel grids using 3D convolutional neural networks.Fan et al. [2017] and Achlioptas et al. [2018] use Multi-layer Per-ceptrons (MLPs) to directly generate 3D point clouds. Patch-basedmethods generate 3D shapes by assembling many local 3D surfacepatches. AtlasNet [Groueix et al. 2018] and Deprelle et al. [2019]learn to reconstruct each 3D shape by a collection of local surfaceelements or point clouds. Recent papers [Genova et al. 2019; Jianget al. 2020] learn local implicit functions that are aggregated togetherto generate 3D shapes. Deformation-based methods train neuralnetworks to deform an initial shape template to the output shape.For example, FoldingNet [Yang et al. 2018] and Pixel2Mesh [Wanget al. 2018] learn to deform 2D grid surfaces and 3D sphere manifoldsto reconstruct 3D target outputs.In our paper, we choose a deformation-based mesh representationfor leaf-node parts, where we deform a unified unit cube mesh with5,402 vertices to describe leaf-node part geometry. Representing3D shapes as fine-grained part hierarchies [Mo et al. 2019a,c],we find that it is effective and efficient for preserving geometrydetails for leaf-node parts, as previously shown in the recentworks [Gao et al. 2019a,b]. Different from SDM-Net [Gao et al.2019b], we introduce a structure-conditioned part geometry VAE,that substantially improves data efficiency and reconstructionperformance, beyond SDM-Net. Second, we build up bijectivemappings between the structure and geometry nodes for synergisticjoint learning, which enables disentangled representations for shapestructure and geometry.Compared to StructureNet [Mo et al. 2019a] that directly gen-erates 3D point clouds for leaf-node parts, we find our methodgenerates 3D part geometry with sharper edges and more details.

3D objects, especially man-made ones, are highly compositional andstructured. Previous works attempt to infer the underlying shapegrammars [Chaudhuri et al. 2011; Kalogerakis et al. 2012; Wu et al.2016], part-based templates [Ganapathi-Subramanian et al. 2018;Kim et al. 2013; Ovsjanikov et al. 2011], and shape programs [Sharmaet al. 2018; Tian et al. 2019]. There are also many papers investigatinggenerating shapes in the part-by-part manner using consistent partsemantics [Dubrovina et al. 2019; Li et al. 2019; Schor et al. 2019;Wu et al. 2019a] and sequential part instances [Sung et al. 2017; Wuet al. 2019b].Recently, researchers have been investigating representing everyshape as a hierarchy of parts, which extends part granularity to morefine-grained scales. The pioneering work to encode the tree structureof object, GRASS [Li et al. 2017] uses binary part hierarchies andadvocates to use recursive neural networks (RvNN) to hierarchicallyencode and decode parts along the tree structure. A follow-up workStructureNet [Mo et al. 2019a] further extends the framework tohandle n-ary part hierarchies with consistent part semantics for an • Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, and Lin Gao object category [Mo et al. 2019c]. A concurrent work SDM-NET [Gaoet al. 2019a] learns to generate structured meshes with deformableparts by leveraging a part graph with rich support and symmetryrelations. Sun et al. [2019] and Paschalidou et al. [2020] explorelearning hierarchical part decompositions in unsupervised settings.Our work adapts the hierarchical part representation introducedin StructureNet [Mo et al. 2019a] that can represent ShapeNet [Changet al. 2015] shapes with complicated structures and fine-grainedleaf-node parts. Different from StructureNet where shape geometryand structure is jointly modeled in one RvNN, we learn a pair ofseparate geometry RvNN and structure RvNN in a disentangled butsynergistic fashion, which enables exploring geometric (structural)changes while keeping shape structure (geometry) unchanged.We also find that by combining the state-of-the-art structurelearning modules from StructureNet [Mo et al. 2019a] and the latesttechniques in modeling detailed part geometry from SDM-Net [Gaoet al. 2019b] in an effective way, we achieve the best from bothworlds that beats both StructureNet and SDM-Net in performance.In the pioneering work SAG-Net [Wu et al. 2019a], both geometryand structure are encoded in the single latent code by an attention-based GRU network, and the geometry details are represented with avoxel-based representation and the graph structure is represented bya fully connected graph. Our DSM-NET is fundamentally differentin that geometry and structure are encoded into separate latentcodes in a hierarchical manner. The hierarchy of encoded geometryguided by the structure is the key in our work that achievesdisentanglement while ensuring structure/geometry compatibility,which does not appear in SAG-NET. This novel design enables DSM-NET to disentangle geometry and structure while keeping the twoinformed of each other, and thus synthesize 3D mesh models withcomplex structure and compatible fine geometry, advancing thestate-of-the-art in neural shape representations.

Using deep learning to aid shape editing, deformation and trans-formation applications attracts much research attentions in recentyears. To name a few, Yumer et al. [2016] learn semantic deformationover 3D voxel grids for deforming shapes subject to input userintents. 3DN [Wang et al. 2019b] learns to deform 3D meshesby predicting offsets for mesh vertices. NeuralCages [Wang et al.2019a] learns to fit coarse cages outside shape meshes and conductdeformation over the cages. StructEdit [Mo et al. 2019b] learns aconditional variational autoencoder (cVAE) to generate plausibleshape variations for a source shape and transfer editing operationsamong similar shapes. LOGAN [Yin et al. 2019] proposes a generalframework to learn shape transforms from unpaired domainsand demonstrates many interesting applications, such as 3D styletransfer and generating shapes from skeletons. PT2PC [Mo et al.2020] learns a conditional generative adversarial network thatgenerates 3D shapes with geometric variations given a part-treecondition. Aumentado et al. [2019] proposes an unsupervisedapproach to learn disentangled representations for mammal andhuman point cloud shapes with factorization of the pose andintrinsic shapes. In this paper, we learn a disentangled part hierarchies forshape geometry and structure, which enables many controllableshape editing and transformation applications, such as varyingshape geometry (structure) while keeping the structure (geometry)unchanged, and shape re-synthesis combining the structure of oneshape and the geometry feature of another shape.

In the field of 2D image or 3D model processing, there are somepioneering research works on the

Disentangled Analysis .With the advancement of deep learning in the field of 2D images,many works aim to improve the generation quality and manipulatethe generated images. Borrowing from the style transfer literature,the proposed architecture [Karras et al. 2019] enables intuitive, scale-specific control of the high-resolution image synthesis by automaticunsupervised separation of high level attributes. HoloGAN [Nguyen-Phuoc et al. 2019] improves the visual quality of generation andallows manipulations by the disentangle learning, which utilizesexplicit 3D features to disentangle the shape and appearance inan end-to-end manner from unlabeled 2D images only. Also in thefield of 3D shape processing, the generative modeling becomesa mainstream topic thanks to the deep learning and tremendouspublic 3D datasets, some of which contain rich textures for realism.Levinson et al. [2019] propose a supervised generative model toachieve accurate disentanglement of pose and shape in a large-scale human mesh dataset, as well as successfully incorporatingtechniques such as pose and shape transfer. Moreover, CFAN-VAE [Tatro et al. 2020] proposed a CFAN (conformal factor andnormal) feature to achieve the geometric disentanglement (poseand identity of human shape) in an unsupervised way. For generaltextured objects datasets, VON [Zhu et al. 2018b] presents a fullydifferentiable 3D-aware generative model with a disentangled 3Drepresentation for image and shape synthesis. For more photo-realistic image generation, it decomposes the process into threefactors: shape, viewpoint, and texture.Compared to the above works, our work displays a rather novelcapability - disentanglement of structure and geometry . In this work,we learn a disentangled structured mesh representation for 3Dshapes, where the disentanglement is entirely between two explicitlydefined factors, namely structure and geometry . Our network can notonly be used to generate shape with improved geometric details, butalso allows to exploit independent control of structure and geometrywith the disentangled latent space.

As provided in the PartNet dataset [Mo et al. 2019c], every 3D shapeis decomposed into semantically consistent part instances that areorganized by an n -ary part hierarchy that covers parts at differentgranularities, ranging from coarse-grained parts (e.g. chair back,chair base) to fine-grained ones (e.g. chair back bars, chair legs). Thepart hierarchy also includes a rich set of part relationships sheddinglights on the complicated shape structure, such as the verticalparent-child relations and the horizontal symmetry or adjacencyrelations. Such a part hierarchy provides a powerful representation SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 5 that describes complex structure and geometry details in a unifiedformat.In this paper, we propose a disentangled but highly synergistic hierarchical representation for shape geometry and structure (seeFigure 2). We disentangle the unified PartNet [Mo et al. 2019c] parthierarchy into a structure hierarchy , which describes the symbolicpart semantics and part relationships, and a geometry hierarchy ,which contains the detailed part mesh geometry for the tree nodes.The structure and geometry hierarchies are disentangled to enablecontrollable shape editing in downstream applications, as we willshow in Sec. 4, while still being highly coupled and synergistic inthat the two hierarchies have a bijective part correspondence amongthe tree nodes and they are learned together so as to generate 3Dshapes with compatible structure and geometry.We use Recursive Neural Networks (RvNNs) to hierarchicallyencode and decode the structure and geometry part hierarchies.Different from StructureNet [Mo et al. 2019a], we propose to learntwo separate but deeply coupled VAEs to encode the geometry andstructure hierarchies into two latent spaces, producing a disentan-gled representation for shape structure and geometry. However,there are rich communications between the disentangled structureand geometry VAEs during both encoding and decoding procedures,since the part geometry is generated under the shape structureguidelines while the shape structure leverages the produced partgeometry for effective training. Such communication is necessaryto ensure the compatibility of the generated shape structure andgeometry.In the following subsections, we first describe the detaileddefinitions for our disentangled shape representation of structurehierarchy and geometry hierarchy. Then, we introduce a conditionalpart geometry VAE on encoding and decoding the fine-grained partgeometry using a unified deformable mesh. Finally, we present ournetwork architecture designs for the geometry and structure VAEsand discuss how to learn the disentangled shape geometry andstructure latent spaces simultaneously where the geometry andstructure VAEs guide the learning processes for each other.

We adapt the hierarchical part segmentation in PartNet [Mo et al.2019c] for ShapeNet models [Chang et al. 2015], where each shape isdecomposed into a set of parts P and organized in a part hierarchy H (i.e., the vertical parent-child part relationships) with rich partrelationships R (i.e., the horizontal among-sibling symmetry oradjacency part relationships). Each part P i is associated with asemantic label l i (e.g. chair back, chair leg) defined for a certainobject class, as well as the detailed part geometry G i .We introduce a disentangled but highly synergistic shape repre-sentation for shape structure and geometry, where we representeach 3D shape as a pair of a structure hierarchy and a geometryhierarchy. In our disentangled representation (see Figure 2), a struc-ture hierarchy abstracts away the part geometry and only describesa symbolic part hierarchy with part structures and relationships,namely (⟨ l , l , · · · , l N ⟩ , H , R ) , while a geometry hierarchy describesthe part geometry ⟨ G , G , · · · , G N ⟩ . There is a bijective mappingbetween the tree nodes of the structure and geometry hierarchies Chair

Symbolic part tree structure

Seat BackBase Arm

Seatsupport

Arm

Seatsurface

Seat singlesurface

BacksurfaceBackframeRegularLeg Base

LegLegLeg Leg

Arm nearvertical barArmhroizontal bar

Back verticalframeBack verticalframe Back singlesurface

Arm nearvertical bar Armhroizontal bar a (cid:87) a (cid:87) a (cid:87) o (cid:87) a (cid:87) t (cid:87) r (cid:87) t (cid:87) r (cid:87) o (cid:87) t (cid:87) r (cid:87) o (cid:87) r (cid:87) a (cid:87) a (cid:87) t (cid:87) o (cid:87) mammal bar Part geometry hierarchy

Fig. 2. An example showing the proposed disentangled but highlysynergistic representation of shape geometry and structure hierarchies.There is a bijective mapping between the tree nodes in the two hierarchies.In the structure hierarchy, we consider symbolic part semantics and a rich setof part relationships (orange arrows), such as adjacency ( τ a ) , transnationalsymmetry ( τ t ) , reflective symmetry ( τ r ) and rotational symmetry ( τ o ) . Inthe part geometry hierarchy, the part geometry is represented by mesh. that the part semantic label l i defined in the structure hierarchycorresponds to the part geometry G i included in the geometryhierarchy. Also, the geometry hierarchy implicitly follows the samepart hierarchy H and part relationships R as specified in the structurehierarchy. Part Geometry Representation.

For each part geometry G i , we usea mesh representation to capture more geometric details, such asthe decorative patterns and sharp boundary edges, than the pointcloud representation used in StructureNet [Mo et al. 2019a]. Given aclosed box mesh manifold G box with 5,402 vertices, we first calculatethe oriented bounding box (OBB) B i of each part P i and deform G box , initialized with the shape B i , to the target part geometry G i by adjusting the vertex positions through a non-rigid registrationprocedure. Then, for each part, we use the ACAP (as-consistent-as-possible) feature [Gao et al. 2019a,b] X i as the representation ofthe deformed box mesh. The ACAP feature X i ∈ R V × captures thelocal rotation and scale information in a one-ring neighbour patchof every vertex on the mesh and is capable of capturing large-scalelocal geometric deformations (e.g. rotation greater than 180 ◦ ). Weshow an example registration result in Figure 3 (a). For the detailedcalculation, please refer to the work [Gao et al. 2019a]. Since theACAP feature is invariant to spatial translation of the part, weincorporate an additional 3-dimensional vector to describe the partcenter c i . Overall, each part geometry is represented as a pair of anACAP feature X i and a part center vector c i , as shown in Figure 3(b), i.e., G i = ( X i , c i ) . Geometry Hierarchy.

The geometry hierarchy for a 3D shape is ahierarchy of part geometries ⟨ G , G , · · · , G N ⟩ (see Figure 2 right). It • Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, and Lin Gao decomposes a complicated shape geometry into a hierarchy of partsranging from coarse-grained levels to fine-grained levels. Each partgeometry G i in the geometry hierarchy corresponds to a tree node inthe structure hierarchy and gives a concrete geometric realizationgiven the context of the entire shape structure to generate. Thegeometry hierarchy implicitly follows the structural hierarchy andpart relationships H and R defined in the structure hierarchy. Structure Hierarchy.

We consider a symbolic structure hierarchy (⟨ l , l , · · · , l N ⟩ , H , R ) as the structure representation for a shape,inspired by a recent work PT2PC [Mo et al. 2020]. Figure 2(left) presents an example for the symbolic structure hierarchy.It only includes the semantic information of shape parts and therelationships between parts, while abstracting away the concretepart geometry. PT2PC learns to generate 3D point cloud shapesconditioned on a given symbolic structure hierarchy as a fixedskeleton for shape generation. In this work, we extend PT2PC toconsider encoding and decoding the symbolic structure hierarchyand investigate its disentangled but synergistic relationship to thegeometry hierarchy.In the symbolic structure hierarchy, we represent each part with asemantic label l i (e.g. chair back, chair leg) without having a concretepart geometry in the representation. We include the rich sets ofpart relationships defined in the PartNet dataset in the symbolicstructure hierarchy representation. There are two kinds of partrelationships: the vertical parent-child inclusion relationships (e.g.a chair back and its sub-component chair back bars), as defined in H , and the horizontal among-sibling part symmetry and adjacencyrelationships (e.g. chair back bars have translational symmetry), asdenoted in R . We use the part relationships H and R as provided inStructureNet [Mo et al. 2019a]. Coupling Geometry and Structure Hierarchies.

Even though weare attempting a disentangled shape representation, the structureand geometry need to be compatible with each other for generatingplausible and realistic shapes. On the one hand, shape structureprovides a high-level guidance for part geometry. If four legs of achair are specified to be symmetric to each other in the structurehierarchy, the four legs should have identical part geometry tosatisfy the structural requirement. On the other hand, given a certaintype of part geometry, only certain kinds of shape structures arepossible. For example, it is nearly impossible to manufacture a swivelchair if no lift handle or gas cylinder parts are provided.Concretely, in our disentangled shape representation, the ge-ometry hierarchy ⟨ G , G , · · · , G N ⟩ and the structure hierarchy (⟨ l , l , · · · , l N ⟩ , H , R ) of a shape are highly correlated and tightlycoupled. There is a bijective mapping between each part geometrynode G i and the part structure symbolic node l i . We set upcommunication channels between the two hierarchies in the jointlearning process. The geometry hierarchy uses the part hierarchy H and relationship R in the encoding and decoding stages for passingmessages and synchronizing geometry generation among relatednodes. To train the decoding stage of the structure hierarchy, weleverage the corresponding geometry nodes to help match theprediction to the ground-truth parts. Thus, the synergy between thestructure and geometry hierarchies is essential for simultaneouslylearning the embedding spaces. i c i X ACAP FeatureExtractionCenter (a) (b) (c)

Fig. 3. We present (a) the non-rigid registration process that deforms abox mesh to a part geometry, (b) the deformed part mesh, and (c) ourproposed part geometry representation, consisting of an ACAP deformationfeature [Gao et al. 2019a] and a 3-dimensional part center vector.

In the geometry hierarchy of a 3D shape, each part geometry G i is represented as a pair of ACAP feature X i ∈ R V × and the partcenter c i ∈ R . We propose a part geometry conditional variationalautoencoder (VAE) with a conditional part geometry encoder Enc PG that maps the part geometry G i = ( X i , c i ) into a 128-dimensionallatent feature and a conditional part geometry decoder Dec PG whichreconstructs ˆ G i from the latent code. Both the encoder and decoderare conditioned on the part semantics and its current structuralcontext, in order to generate part geometry that is synergistic to thecurrent structure tree nodes. We use the mesh graph convolutionaloperator to aggregate the local features around the vertex, which isalso suitable for shape analysis [Monti et al. 2017; Wang et al. 2020].Figure 4 illustrates the proposed part geometry conditional VAEarchitecture. The encoder network Enc PG performs two sequentialmesh graph convolutional operations over the X i ∈ R V × featuremap within local one-ring neighborhood around each vertex,extracts a global part geometry feature via a single fully-connectedlayer, which is then concatenated with the part center vector c i ,and finally predicts a 128-dim geometry feature f Gi for part P i .The decoder network Dec PG decodes the part ACAP feature ˆ X i and the part center ˆ c i through fully-connected and mesh-basedconvolutional layers. Then, the decoded ACAP feature ˆ X i is appliedon every vertex of the closed box mesh G box to reconstruct the partmesh ˆ G i and the reconstructed center ˆ c i move the part mesh to thecorrect position in the shape space.Different from SDM-NET where they train separate PartVAEs fordifferent part semantics, we propose to use a single shared PartVAEto encode and decode shape part geometry that is conditional onthe part structure information f Si . The reason is three-fold: firstly,PartNet gives far more part semantic labels than the SDM-NETdata, where training separate networks for different part semanticsis extremely costly and empirically hard to converge; secondly,the data sample for some rare part categories is not sufficient totrain a separate network; lastly, our conditional PartVAE can beconditioned on structure codes summarizing the part semantics andsub-hierarchy information, allowing effective specialization for partgeometry generation given different structure contexts.In summary, the conditional encoder Enc PG takes as inputs a partgeometry G i = ( X i , c i ) and a structure code condition f Si summa-rizing certain part semantics and its structural context informationand outputs a latent part embedding f Gi = Enc PG ( G i , f Si ) . And the SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 7

ACAP FeatureExtraction ACAPFeature

Mesh Graph deconvolutional operator( )

StructureInformation PG Dnc

Center Center i X ˆ i X Si f Si f Gi f i c ˆ i c Mesh Graph convolutional operator( ) PG Enc

Fig. 4. The architecture of our conditional part geometry variationalautoencoder. For a single part mesh geometry, the encoder maps the partACAP feature and its center position into a 128-dimensional geometriclatent code, while the decoder reconstructs the part geometry by decodingthe ACAP feature and the center vector. Both networks are conditional onthe part structure information along the structure hierarchy to generatespecialized part geometry for different structure contexts. conditional decoder

Dec PG learns to reconstruct ˆ G i from a geometrylatent code f Gi and the current structural context information f Si ,namely, ˆ G i = ( ˆ X i , ˆ c i ) = Dec PG ( f Gi , f Si ) . To train the proposedconditional PartVAE, we define the loss as follows. L cond-PartVAE = λ L reconcond-PartVAE + L KLcond-PartVAE (1)where L reconcond-PartVAE = ∥ ˆ X i − X i ∥ + ∥ ˆ c i − c i ∥ is the reconstructionloss and L KLcond-PartVAE is the standard KL divergence loss toencourage the learned embedding space to be close to a unitmultivariate Gaussian distribution.

To learn disentangled latent spaces for shape geometry and structure,we design two Variational Autoencoders (VAE) with RecursiveNeural Network (RvNN) encoders and decoders that are trainedin a disentangled but tightly coupled manner. Figure 5 providesan overview for the proposed disentangled VAEs. The geometryVAE (the blue part) and the structure VAE (the red part) learn twodisentangled latent spaces for shape geometry and structure.Though disentangled, the structure and geometry VAEs are jointlylearned in a highly synergistic manner, where we build up a bijectivemapping among the nodes between the structure and geometryhierarchies and allow communications across the two hierarchies.Such communications are necessary for the learning proceduresince neither the structure hierarchy nor the geometry hierarchycontains sufficient information for the training.

Given a structure hierarchy (⟨ l , l , · · · , l N ⟩ , H , R ) describing asymbolic tree with part semantics, hierarchy and relationships, thestructure VAE is trained to learn a structure latent space. For theencoding process, a part structure encoder Enc PS first summarizesthe leaf-node part semantics and then a recursive graph structureencoder Enc

RvS propagates features from the leaf nodes to theroot in a bottom-up manner according to the part hierarchy H andrelationships R . Inversely, the decoding process contains a recursivegraph structure decoder Dec

RvS that hierarchically predicts thestructure features from the root to the leaf nodes in a top-down fashion and a part structure decoder

Dec PS that decodes partsemantic labels for the leaf nodes.The structure VAE uses a similar recursive neural networkarchitecture to StructureNet [Mo et al. 2019a], but we are encodingand decoding symbolic structure hierarchies with no concrete partgeometry. It is thus difficult to train the decoding procedure givenno part geometry since we are not able to perform node matchingbetween a set of decoded children and the set of ground-truthparts. To address this challenge, we borrow the corresponding partgeometry decoded from the geometry VAE to perform the nodematching for the training, where a communication channel betweenthe structure and geometry VAEs is established.Below, we discuss more details on the four network componentsfor the structure VAE. Encoders.

To encode a symbolic structure hierarchy representedas (⟨ l , l , · · · , l N ⟩ , H , R ) , we need to introduce an additional partinstance identifier for each part d i , where d i = , , , · · · , similar toPT2PC [Mo et al. 2020]. Part instance identifiers help differentiatethe part instances with the same part semantics for a parent node.For example, if a chair base contains four chair legs, we mark themwith part instance identifiers 0 , , ,

3. The part instance identifiersare only necessary in the encoding stage will be ignored in thedecoding procedure.For each leaf node part P i , the part structure encoder Enc PS encodes the part semantics l i and its part instance identifier d i intoa part structure latent code f Si . f Si = Enc PS ([ l i ; d i ]) (2)where Enc PS is simply a fully-connected layer, [ ; ] denotes the vectorconcatenation, and we represent both d i and l i as one-hot vectors.For the non-leaf part P i , the recursive graph structure encoder Enc

RvS gathers all children node features, performs graph message-passing along the part relationships defined in R among the childrennodes, and finally computes f Si by aggregating the children nodes’features. Specifically, we have f Si = Enc

RvS (cid:18)(cid:110) f Sj (cid:111) ( P i , P j )∈ H , l i , d i (cid:19) (3)where ( P i , P j ) ∈ H denotes that part P j is a child of P i . Themodule Enc

RvS is composed of two iterations of graph message-passing similar to StructureNet [Mo et al. 2019a], a max-poolingoperation over the obtained node features and a fully-connectedlayer producing the part structure feature f Si given the pooledfeature and the part identifiers [ l i ; d i ] for the part. Here, pleasenote that the part instance identifiers are necessary, due to the max-pooling operation, to distinguish and count the different occurrencesof part instances with the same part semantics.We repeatedly apply the part structure encoder Enc

RvS untilreaching the root node P root . The final root node structure feature f Sroot is then mapped to the final structure embedding space througha fully-connected layer. We use a KL divergence loss to encouragethe learned structure latent space to be close to a unit multivariateGaussian distribution.

Decoders.

The decoding process of a structure VAE takes astructure latent code as input and recursively decodes a symbolic • Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, and Lin Gao PG Enc PG Enc PG Enc PG Enc PG Enc PG Enc PG Enc PG Enc PG Enc

N(0,I)N(0,I) PS Enc

Recursive Geometry Network(RvG)Recursive Structure Network(RvS)Structure FeatureGeometry Feature EncoderDecoderACAP Featurewith Center PG Dnc PG Dnc PG Dnc PG Dnc PG Dnc PG Dnc PG Dnc PG Dnc PG Dnc

MaxMaxMaxMaxMaxMax

RvS

Enc

RvG

Enc

RvG

Dec

RvS

Dec

RvS

Dec

RvG

Dec

Fig. 5. We train two coupled variational autoencoders (VAEs) with recursive encoders and decoders and learn disentangled latent spaces for shape geometryand structure. The left figure illustrates the joint learning procedure of the structure VAE (shown in red) and the geometry VAE (shown in blue). In theencoding stages, the structure features summarize the symbolic part semantics and recursively compute sub-hierarchy structure contexts, while the geometryfeatures encode the detailed part geometry for leaf nodes and propagate the geometry information along the same hierarchy. The decoding procedures ofthe VAEs are supervised to reconstruct the hierarchical structure and geometry information in an inverse manner. The right figure illustrates the sharedmessage-passing mechanism used in both VAEs among related part nodes in the encoding (top) and decoding (bottom) stages, as well as the matchingprocedure for simultaneous training of the decoding stages for the two VAEs (middle). The blue and red nodes refer to the part nodes in the geometry andstructure hierarchies respectively. For the encoding stage, there are two branches to aggregate the information (geometry/structure) of the same type siblingsrespectively. It performs several message-passing operations along the relation edges among the siblings and finally gathers information into a feature bymax-pooling and FC layers for each branch. For the decoding stage, there are also two branches to decode one feature to its siblings for geometry and structure.It predicts existence and the edges among the existing nodes on structure branch. The geometry branch utilizes the predicted relationships. Based on this, thefinal node features of two branches will be updated by several message-passing operations. structure hierarchy (⟨ ˆ l , ˆ l , · · · , ˆ l N ⟩ , ˆH , ˆR ) as the output. The partinstance identifiers are not involved in the decoding procedure.The recursive graph structure decoder Dec

RvS consumes theparent structure feature ˆ f Si and infers a set of children nodestructure features { ˜ f Si , , ˜ f Si , , · · · , ˜ f Si , } , where we assume thereare at maximum 10 children parts per parent node. FollowingStructureNet [Mo et al. 2019a], we predict a semantic label and anexistence probability for each part, by another fully-connected layerfollowed by classification output layers. Besides the node prediction,by connecting all pairs of parts, we also predict a set of symmetricor adjacent edges ˆ R i among the existing nodes. Along the predictededges, node features { ˜ f Si , k } k are updated via two graph message-passing operations and finally we decode a set of structure partnodes { ˆ f Sj , ˆ f Sj , · · · , ˆ f Sj Ki } , where K i denotes the number of existingnodes for part P i . We refer the readers to StructureNet [Mo et al.2019a] for more details. In summary, we have (cid:110) ˆ f Sj , ˆ f Sj , · · · , ˆ f Sj Ki , ˆ R i (cid:111) = Dec

RvS (cid:16) ˆ f Si (cid:17) (4)We repeat the recursive structure decoding procedure untilreaching the leaf nodes. For a leaf node part ˆ P i , the part structure decoder Dec PS simply decodes the part semantic label via a fully-connected layer followed by outputing a likelihood score for eachpart semantic label. Finally, we getˆ l i = Dec PS (cid:16) ˆ f Si (cid:17) (5)To train the hierarchical decoding process, StructureNet [Moet al. 2019a] predicts part geometry for the intermediate nodesand establishes a correspondence between the predicted set ofparts and the ground-truth set of parts. However, it is difficultto directly adapt this training procedure to decode the symbolicstructure hierarchy by matching the part semantic labels. We resolvethis challenge by building a communication channel between thestructure hierarchy and the geometry one and borrowing thecorresponding part geometry decoded in the geometry VAE forthe matching procedure. In our implementation, we resort to theconditional part geometry decoder Dec PG introduced in Sec. 3.2and predict an oriented bounding box geometry ˆ B j for each partˆ P j where j = j , j , · · · , j K i . We choose to use the OBB geometryfor the matching process instead of the mesh geometry ˆ G i since weobserve a decreased accuracy for registering the box mesh G box toan intermediate part geometry, which is usually more complex thanleaf-node parts. SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 9

To train the part existence scores, part edge predictions andthe part semantic labels, we follow StructureNet [Mo et al. 2019a]and refer the readers to the paper for more details. We use a KLdivergence loss term to train the structure latent space to get closerto the unit multivariate Gaussian distribution.

Given a geometry hierarchy ⟨ G , G , · · · , G N ⟩ encoding the partgeometry of shape parts, the geometry VAE learns to map theshape geometry to a geometry latent space, disengtangled from thestructure latent space. The geometry latent space is also modeled tobe a unit multivariate Gaussian distribution.The geometry VAE shares a similar network architecture to thestructure VAE. The encoding process starts from extracting partgeometry features for all leaf-node parts via a part geometry encoder Enc PG and then recursively propogates the geometry featuresalong the hierarchy to the root node, summarizing the geometryinformation for the entire shape through a recursive graph partgeometry encoder Enc

RvG . For the decoding process, we first usea recursive graph geometry decoder

Dec

RvG that hierarchicallydecodes the geometry features from the root to the leaf-node partsin an inversely recursive manner. Then, we leverage a part geometrydecoder

Dec PG to reconstruct the part geometry for leaf-node parts.There are two communication channels that allow the synergisticstructure hierarchy to guide the geometry VAE encoding anddecoding procedures. Firstly, the part geometry encoder Enc PG anddecoder Dec PG are conditioned on the structure context producedby the structure VAE, which allows generating different kinds of partgeometry according to different part semantics and shape structures.Secondly, the graph message-passing procedures in the recursivegraph geometry encoder Enc

RvG and decoder

Dec

RvG borrow thepart hierarchy and relationships defined in the structure hierarchy.As follows, we describe the encoding and decoding stages forlearning geometry VAE in more details.

Encoders.

We start from encoding each leaf node part geometry G i = ( X i , c i ) into a latent part geometry feature space. We use theconditional part geometry encoder Enc PG introduced in Sec. 3.2that maps the part ACAP feature X i and the part center c i to a128-dimensional feature f Gi , namely, f Gi = Enc PG (cid:16) [ X i ; c i ] , f Si (cid:17) (6)The network is conditioned on the structure code f Si generated inthe structure VAE, in order to gain some structural context on whatis the semantics for the current part and what role the part plays ingenerating the final shape.For each sub-hierarchy of the part geometry, we recursivelyproduce the intermediate part geometry node feature f Gi byaggregating its children geometry node features { f Gj } j through therecursive graph geometry encoder Enc

RvG . Similar to the designof

Enc

RvS for structure VAE, it performs two iterations of graphmessage-passing operations among the children geometry nodefeatures based on the part relationships between sibling part nodes,and conduct a simple max-pooling operation to compute f Gi , where we have f Gi = Enc

RvG (cid:18)(cid:110) f Sj (cid:111) ( P i , P j )∈ H (cid:19) (7)Different from the recursive graph structure encoder Enc

RvS asshown in Eq. 3, we do not encode the part geometry for the non-leafnode since the geometry is more complex and the registration to abox mesh is less accurate. The increased geometric complexity alsomakes it harder to effectively embed them in a low-dimensionallatent space. For the message-passing operations, we borrow thepart relationships defined in the structure hierarchy. This is achievedby maintaining a bijective mapping among the tree nodes in thestructure and geometry hierarchies, as illustrated in Figure 5.We repeatedly apply the recursive graph geometry encoder

Enc

RvG until reaching the root node P root . The final root nodegeometry feature f Groot is then mapped to the final geometryembedding space through a fully-connected layer.

Decoders.

The decoding process of a geometry VAE takes ageometry latent code as input and recursively decodes a geometryhierarchy ⟨ ˆ G , ˆ G , · · · , ˆ G N ⟩ for a shape.The recursive graph geometry decoder Dec

RvG takes the parentgeometry feature ˆ f Gi as input and decodes a set of childrennode geometry features { ˜ f Gi , , ˜ f Gi , , · · · , ˜ f Gi , } . Then, based on thestructural predictions on part existence scores, part semantic labelsand part edge information from the synergistic structure VAE,we conduct two iterations of graph message-passing over thechildren node geometry features along the predicted pairwise partrelationships ˆ R i . The decoder Dec

RvG then produces a final set ofchildren nodes with the predicted part geometry features. (cid:110) ˆ f Gj , ˆ f Gj , · · · , ˆ f Gj Ki (cid:111) = Dec

RvG (cid:16) ˆ f Si , ˆR i (cid:17) (8)where Dec

RvG is conditioned on the decoded part relationships ˆR i in the structure VAE and K i denotes the number of existing partnodes predicted by the recursive graph structure decoder Dec

RvS .We repeat the recursive graph geometry decoding procedureuntil reaching the leaf nodes. For a leaf node part ˆ P i , we use theconditional part geometry decoder Dec PG introduced in Sec. 3.2that reconstructs ˆ G i = ( ˆ X i , ˆ c i ) from an input part geometry featureˆ f Gi . Formally, we haveˆ G i = Dec PG (cid:16) ˆ f Gi , ˆ f Si (cid:17) (9)Notice that the network Dec PG is conditioned on the part structurecode ˆ f Si predicted in the coupled structure VAE decoding procedure.The geometry VAE is trained jointly with the structure VAE andthe conditional part geometry VAE. To supervise the reconstructionof the leaf-node part geometry in the decoding process, we simplyadapt the loss terms defined in Eq. 1 from Sec. 3.2. We also add aKL divergence loss term to train the geometric latent space to getcloser to the unit multivariate Gaussian distribution. Learning disentangled latent spaces for shape structure and ge-ometry allows us to generate high-quality 3D shape meshes withcomplex structure and detailed geometry in a controllable manner.Not only we demonstrate the state-of-the-art performance for

Fig. 6. Example shapes in the PartNet dataset (left) and the synthetic dataset(right).Table 1. We summarize the data statistics of the two datasets in ourexperiments. We use four categories from PartNet (chairs, tables, cabinetsand lamps) for the majority of our experiments and one synthetic dataset(synchairs) for evaluating disentangled shape reconstruction.

DataSet Chair Table Cabinet Lamp SynChair structured shape generative modeling, we also illustrate how ourDSM-Net can generate shape meshes with controllable structureand geometry configurations.In this section, we present extensive experiments on the tasksof shape reconstruction, generation and interpolation and showthe superior performance of our proposed method on the PartNetdataset [Mo et al. 2019c] comparing to several strong baselinemethods, including StructureNet [Mo et al. 2019a], SDM-Net [Gaoet al. 2019b], IM-Net [Chen and Zhang 2019] and BSP-Net [Chen et al.2019a]). We also propose and formulate the tasks of disentangledshape reconstruction, generation and interpolation, where we ma-nipulate one factor of shape structure and geometry while keepingthe other unchanged. We further benchmark our performancefor disentangled shape reconstruction on a synthetic dataset. Allexperiments were carried out on a computer with an i9-9900K CPU,64GB RAM, and a GTX 2080Ti GPU.

We primarily use the PartNet dataset [Mo et al. 2019c] for themajority of our experiments. PartNet provides fine-grained, multi-scale and hierarchical shape part segmentation for ShapeNet [Changet al. 2015] models. We use the four biggest and commonly usedobject categories for our experiments: chairs, tables, cabinets andlamps. Table 1 summarizes the data statistics. We follow the officialtraining and testing data splits. Figure 6 (left) shows example shapesin PartNet.All the PartNet shapes from the same object category share acanonical part template with consistent part semantics. The verticalparent-child relationships are defined consistently according tothe shared part semantics set. However, the horizontal pairwisepart symmetry and adjacency relationships are detected from thepart annotations that provide different part structures for different shapes. Also, the part hierarchies for complex shapes usually containmore part instances than the ones for simple shapes. We directlyfollow the part semantics, hierarchy and relationships introducedin StructureNet [Mo et al. 2019a], but we disentangle the unifiedpart hierarchy into two disentangled but coupled structure andgeometry hierarchies (see Figure 2). Following StructureNet [Moet al. 2019a], we only use the shapes that each parent part has amaximum number of 10 children parts.Moreover, for quantitatively evaluating the task of disentangledshape reconstruction, we further introduce a synthetic dataset thatcontains 10,800 shapes (see Figure 6 (right) ) with 54 kinds of shapestructures and 200 geometric variations. Each shape is generated bypicking one shape structure and one geometric variation, grantingus the access to the ground-truth shape synthesis outcome for everyconfiguration pair. The 54 structures are generated by enumeratingstructural combinations of different back types, leg styles andwhether the chair has arms or not. The 200 geometric variations arecreated by varying the global parameters for the part geometry (e.g.the width of legs, the height of the back). The dataset is divided intothe training and testing sets with a ratio of 3:1. We will release thecode and data for facilitating future research.

For our network, we train the part geometry conditional VAE andthe coupled hierarchical VAEs simultaneously. The part geometryconditional VAE is used to recover the part geometric detailsaccording to the structure context. Two coupled hierarchical VAEs isaim to learn two disentangled latent spaces for shape geometry andstructure in a disentangled but tightly coupled manner. Training ofthe whole network is optimized by the Adam solver [Kingma andBa 2014]. All learnable parameters are initialized randomly withGaussian distribution. For the training of whole network, we set thebatch size as 16 and learning rate starting from 0.001 and decayingevery 100 steps with the decay rate set to 0.9, until the loss convergewith about 1000 iterations.

In this section, we present the shape reconstruction performance ofour DSM-Net and provide quantitative and qualitative comparisonsto the state-of-the-art 3D shape generative models. Figure 7 showsthe shape reconstruction results for our DSM-Net on the four shapecategories in PartNet. We observe that our method successfullycaptures both the complex shape structure and the fine-grainedgeometry details. Next, we propose a novel task of disentangledshape reconstruction that takes two shapes as inputs and re-synthesize a novel shape with ingredients of the structure of oneshape and the geometry of the other shape. We present qualitativeresults on PartNet and provide quantitative evaluations on thesynthetic dataset where we are provided with the ground-truthre-synthesis outputs.

Baselines.

We compare DSM-Net to four state-of-the-art methodsfor learning 3D shape representations: IM-Net [Chen and Zhang2019], BSP-Net [Chen et al. 2019a], StructureNet [Mo et al. 2019a]and SDM-Net [Gao et al. 2019b]. IM-Net learns an implicit functionrepresentation for encoding 3D shapes, while BSP-Net puts attention

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 11 (a) Chair (b) Table (c) Cabinet (d) Lamp

Fig. 7. The gallery of shape reconstruction results on PartNet. For each set of results, the left column shows the ground-truth targets, and the right columnpresents our reconstruction results. We observe that our method can capture complex shape structures and detailed part geometry at the same time. on designing a compact mesh representations for 3D shapes.They both represent shapes as a whole, without explicit modelingof shape parts and structures. StructureNet and SDM-Net aremore relevant baselines to our method since they both explicitlyrepresent shapes as part hierarchies. StructureNet uses point cloudrepresentation for the part geometry, which we empirically findless effective on generating fine-grained shape geometry details.SDM-Net represents shapes with shallower part hierarchies, whichprevents it from generating shapes with complicated structures. Allthe results of four baselines are reproduced by their official pre-trained models on some sub-categories of ShapeNet. In Figure 8, wecan clearly see that our method achieves the best in both worlds andreconstructs shapes with more accurate structure and more detailedgeometry.

Metrics.

We adapt two kinds of metrics for quantitative compar-isons to baseline methods: the geometry metrics and the structuremetrics. For the geometry metrics, we compare the reconstructedshapes against the input shapes without explicitly considering theshape parts and structures. We follow the commonly used metrics inthe literature: Chamfer Distance (CD) [Barrow et al. 1977] and

EarthMover’s

Distance (EMD) [Rubner et al. 2000]. The CD and EMDare two permutation-invariant metrics for evaluating the differenceof two unordered point sets, which have been used in the litera-ture [Fan et al. 2017]. The CD measures the nearest distance for eachpoint in one set to another point set. The EMD solves an optimizationfor bijective mapping between two point sets. For the structuremetric, we use the HierInsSeg score proposed in PT2PC [Mo et al.2020]. To compute the HierInsSeg score, Mo et al. [2020] first parsethe reconstructed shape point cloud into the PartNet part hierarchyleveraging a pre-trained shape hierarchical instance segmentationnetwork, and then compute the normalized tree-editing distancebetween the reconstructed and ground-truth part hierarchies. We

Table 2. Shape reconstruction quantitative evaluations. We use twogeometry metrics (CD and EMD) and one structure metric (HierInsSeg).DSM-Net achieves the best geometry performance compared to all baselinemethods and gets the second place in terms of the structure reconstructionaccuracy. We achieve comparable HierInsSeg score with StructureNet butbeats it in terms of the geometry metrics by a large margin.

DataSet Method Geometry Metrics Structure MetricsCD ↓ EMD ↓ HierInsSeg(HIS) ↓ Chair StructureNet 0.00973 0.43912

IM-Net 0.004795 0.117533 0.689327BSP-Net 0.004117 0.109377 0.743159SDM-Net 0.006025 0.187308 0.845902Ours

IM-Net 0.004993 0.17932 1.139477BSP-Net 0.004733 0.170344 1.203952SDM-Net 0.007891 0.223127 1.389521Ours refer the readers to Fan et al. [2017] and Mo et al. [2020] for moredetails on the definitions of the metrics.

Results.

Table 2 presents the quantitative comparisons betweenour method and the baseline methods. Our method outperformsall baseline methods in terms of the geometry metrics, indicatingthat DSM-Net better captures and reconstructs shape geometry.We also beats IM-Net, BSP-Net, and SDM-Net by significantly (a) Input Shape (b) IM-Net (c) BSP-Net (d) SDM-Net (e) StructureNet (f) Ours

Fig. 8. Shape Reconstruction Comparison with the baseline methods. DSM-Net can reconstruct high-quality shape meshes with complex shape structuresand detailed part geometry. IM-Net, BSP-Net and SDM-Net fail to reconstruct the complicated shape structures (e.g. chair back bars and table leg stretchers),while StructureNet generates point cloud shapes with less part geometry details and inaccurate part geometry. For instance, StructureNet fails to reconstructthe slanted bars for the chair in the first row and loses accuracy for the aspect ratio of the table top surface in the third row.Table 3. Quantitative evaluations of disentangled shape reconstruction onthe synthetic data. We compare to an ablated version of our method, namelyours (no edge), since there is no applicable published baseline methods forthis novel task. We observe that allowing edge communications betweenthe structure and geometry hierarchies is essential in learning good shaperepresentations.

Method Geometry Metrics Structure MetricsCD ↓ EMD ↓ HierInsSeg(HIS) ↓ Ours (no edge) 0.001577 0.146813 1.959482Ours

GT 1.79281large margins in terms of the structure metric HierInsSeg, whileachieves comparable performance to StructureNet. Although SDM-Net utilizes the structure information, the shape segmentation usedby SDM-Net is very coarse. Some complex structures do not existin its results, so the performance of HierInsSeg score on SDM-Netis worst. Figure 8 shows the qualitative comparison with othermethods on the table and chair shape category. It is easy to obversethat IM-Net, BSP-Net and SDM-Net fail to generates complicatedshape structures, such as the chair back bars and the table legstretchers, while our method can successfully capture these complexshape structures. Comparing with StructureNet, we reconstruct the shape geometry more accurately. For example, StructureNet fails toreconstruct the slanted back bars for the chair in the first row, anddoes not recover an accurate aspect ratio of the table top surface inthe third row.

Disentangled Shape Reconstruction.

Our methods learn two dis-entangled latent manifolds (structure and geometry) for shaperepresentations, which opens up new possibilities for controllableshape editing and re-synthesis tasks. Given two input shapes, onecan push the two shapes through our structure and geometry VAEencoders and obtains the structure and geometry features for bothshapes. Then, by re-combining the structure code of one shapeand the geometry code of the other shape, DSM-Net is able to re-synthesize a novel shape that follows the structure of the first shapeand the geometry of the second shape.Figure 9 (left) shows a set of qualitative results we experimentwith on the PartNet dataset. The shapes in each row share the samegeometry code while the shapes in every column have the sameshape structure feature. Here, the top left and bottom right chairs arethe input. The remaining chairs are generated with our DSM-Net,where in each row, the structure of the shapes is interpolated whilekeeping the geometry unchanged, whereas in each column, thegeometry is interpolated while retaining the structure. The figuredemonstrate that out method is able to re-synthesize novel shapeswith pairs of geometry and structure configurations.

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 13 (a) Chair (b) Synthetic Data

Fig. 9. Disentangled shape reconstruction and interpolation results on PartNet chairs and the synthetic data. Here, the top left and bottom right chairs arethe input shapes. The remaining chairs are generated automatically with our DSM-Net, where in each row, the structure of the shapes is interpolated whilekeeping the geometry unchanged, whereas in each column, the geometry is interpolated while retaining the structure.

We further quantitatively benchmark the performance of DSM-Net for disentangled shape reconstruction on the synthetic dataset,where we have the access to the ground-truth reconstruction resultsgiven a pair of structure and geometry configurations. Table 3shows the quantitative results on the synthetic data. Since thereis no applicable baseline methods for this novel task, we comparewith an ablated version of our method: ours (no edge), where weignore the part relationships from the part hierarchies and removethe graph message-passing procedures, which further reduces thecommunication between the structure and geometry. We see thatremoving edge communications provides us worse performance,which proves the importance of maintaining the synergy betweenthe disentangled structure and geometry hierarchies. Figure 9 (right)presents some qualitative results on the synthetic data.

The main goal of DSM-Net is to generate high-quality shapes withcomplex structures and fine-grained geometry. Given a noise vectorsampled from a unit Gaussian distribution, a 3D shape generativemodel maps it to a realistic 3D shape. We evaluate the shapegeneration performance of DSM-Net and perform qualitative com-parisons to several state-of-the-art baseline methods. Quantitativeevaluations and user-study results further validate our superiorperformance than baselines. Equipped with two disentangled latentspaces for shape structure and geometry, DSM-Net also enables a novel task of generating shapes with a given shape structure orgeometry patterns.

Metrics.

The shape generation task aims to generate more diverseshape with complex structure and geometry, which is to coverthe data distribution as much as possible. Meanwhile, a goodgenerative models should generate realistic shapes as much aspossible. Following StructureNet [Mo et al. 2019a], we measure theshape generation performance by the coverage and quality scores.The coverage score computes the average distance from a real shapeto the closed generated shape, while the quality scores calculatesthe average distance from a generated shape to the closed real shape.The coverage score reflects if the diversity of the generated resultsis large enough to cover all real samples, and the quality scoremeasures if the generated results contain bad examples that arefar from the real data distribution. To compare with the baselinemethods, we generate 1000 shapes and compute the coverage andquality scores regarding the geometry metric (CD) and the structuremetric (HierInsSeg).

Results.

Figure 10 shows eight generated shapes for each of thefour object categories in PartNet. These shapes are generated byrandomly sampling on two latent spaces. Our results shows thediversity of the shape set from the structure to the geometry. Foreach shape category, In Figure 11, comparing with SDM-Net andStructureNet, we demonstrate that our method generates shapeswith more complex structure and better geometry details. In this

Fig. 10. Shape generation results. We sample random Gaussian noise vectors and use our DSM-Net to generate realistic shapes with complex structures anddetailed geometry. Here we show eight generation results for each of the four object categories in PartNet. experiment, we randomly generate the shape with different methodsand then select some similar shapes to compare the quality ofthe generated shapes. We observe that SDM-Net can not handlethe complex structure and StructureNet performs worse thanours in capturing detailed shape geometry. We show quantitativeevaluation results relative to the performance of DSM-Net ( i.e. all thereported scores are divided by the corresponding DSM-Net scoresfor normalization) in Table 4, where we see clear improvementsover the baseline methods.In addition, we conduct a user study to further evaluate howrealistic the generated shapes are for humans. We render the shapesinto images with the same setting. For every user, we asked them10 questions. For every question, we let the user rank the threealgorithms according to three different criteria (geometry, structureand overall). We shuffle the order of the algorithms each time wepresent the question and generate shapes from the three methodsrandomly. We show the results of the user study in Table 5, wherewe observe that our generated shape perform the best on all threecriteria. We also see clearly that StructureNet is at the secondplace for shape structure and SDM-Net achieves better for shapegeometry.

Disentangled Shape Generation.

DSM-Net learns two disentangledlatent spaces for modeling shape structure and geometry, whichenables a novel task of generating shapes with a given shapestructure or geometry patterns. We demonstrate that given an inputshape, DSM-Net can extract the structure code from the shape andpairs it with a random geometry code, which allows us to exploreshape geometry variations satisfying a certain shape structure. Italso works well to explore structure variations while keeping thegeometry code unchanged.We show two controllable generation results in Figure 12. In theexperiments, given an input shape, the geometry code and structurecode are extracted by running it through the encoding procedures.

Table 4. Quantitative evaluations on shape generation. We report thecoverage and quality scores relative to DSM-Net ( i.e. all the reportedscores are divided by the corresponding DSM-Net scores for normalization)under the geometry metric (Chamfer-Distance) and the structure metric(HierInsSeg), comparing to StructureNet and SDM-Net as two baselinemethods. We observe that DSM-Net achieves the best performance acrossall metrics.

Method Geometry StructureCoverage ↑ Quality ↑ Coverage ↑ Quality ↑ SDM-Net 0.587687 0.230641 0.422925 0.479782StructureNet 0.702391 0.766193 0.760957 0.975336Ours

Table 5. User study results on shape generation. We show the averageranking score of the three methods: SDM-Net, StructureNet, and ours. Theranking ranges from 1 (the best) to 3 (the worst). The results are calculatedbased on 238 trials. We see that our method achieves the best in terms ofboth structure and geometry.

Method Structure Geometry OverallSDM-Net 2.6832 1.7853 1.7706StructureNet 1.7448 2.6233 2.7014Ours

And then, we can keep one of them unchanged and randomly samplein another latent spaces. The two figures shows the controllablegeneration results. We see that when we preserve the geometrycode, the chair legs usually maintain similar width and length to theinput shape. And, when we keep the structure code unchanged, weare generating shapes with big geometric variations satisfying the

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 15 (a) SDM-Net (b) StructureNet (c) Ours

Fig. 11. Qualitative comparisons on shape generation. We compare ourgenerated shapes to the baseline methods and show that our method learnsto generate shapes with complex structures and fine-grained geometry.StructureNet fails at generating high-quality shape geometry and SDM-Netcannot generate shapes with complex part structures. (a) Given Shape (b) Random generation

Fig. 12. Qualitative results for disentangled shape generation. Given aninput shape (a), we extract the geometry code and structure code. We fix oneof them, we random sample on the other latent space to generate the newshapes (b). For the first row of (b), we keep the geometry code unchangedand randomly explore the structure latent space. And, for the second row, wekeep the structure code unchanged and randomly sample over the geometrylatent space. same symbolic structure hierarchy. This allows novel applicationssuch as exploring plausible geometry variations for a given shapestructure and editing shape structures while keeping similar shapegeometric patterns.

If a shape generative model can learn a smooth latent space for shapeembedding, it allows users to create novel shapes by interpolatingthe given shapes on the latent manifold. We evaluate our DSM-Net for interpolating between shape pairs and demonstrate thatour network learns a smooth latent space for shape interpolation.

Fig. 13. Shape interpolation results on the four PartNet categories. Welinearly interpolate between both the structure and geometry features ofthe two shapes. In the interpolated steps, we see both continuous geometryvariations and discrete structure changes.

Moreover, with the help of our learned two disentangled latentspaces for shape structure and geometry, we can also achievecontrollable interpolation between two shapes, varying shapestructure while keeping geometry unchanged and vice versa.

Shape Interpolation Results.

Figure 13 shows some interpolatedresults on four shape categories by interpolating both structure andgeometry latent spaces jointly. All of interpolated results exhibitsthe geometric changes and structure changes. The interpolation inour learned two latent spaces lead to much more valid and functionalshapes. For each interpolated step, we see both continuous geometryvariations and discrete structure changes. For example, in the firstrow, the armrest become smaller and then disappear while thebackrest change from square to round fashion in a more naturalmanner. In the second row, the backrest gradually becomes square,while the supporter disappears form the first chair to the secondchair.

Disentangled Shape Interpolation Results.

Our disentangled rep-resentations for shape structure and geometry also allows us toachieve the controllable interpolation between two shapes whilekeep structure or geometry unchanged. Figure 14 shows thecontrollable interpolation results between two shapes. Given a pair of source and target shapes ( a , b ) , we extract the geometry andstructure code for both shapes. Then, we perform interpolation inthe structure or geometry latent space while using the code of shape b in the other space. From the results, we find that our interpolationresult is controllable and every interpolated shape is very realisticand reasonable. And, we see a clear disentanglement of the shapestructure and geometry in the interpolated results. We perform two ablation studies to demonstrate the necessity of thekey components in our method. First, we demonstrate that explicitlyconsidering part relationships and conducting graph message-passing operations along the edges are important. Removing theedge components from our network gives significantly worse results.Then, we validate the design choice of learning a unified conditionalpart geometry VAE, instead of training separate VAEs for each partsemantics as used in SDM-Net [Gao et al. 2019b].

Table 6. Quantitative shape reconstruction performance comparing ourfull pipeline and two ablated versions: one version (Ours (no edge))that removes the edge components and the graph message-passingmodules, another version (StructureNet + Mesh) that naively combinesthe StructureNet backbone and SDM-Net ACAP mesh representation.We observe worse performances when we remove edges from the parthierarchies or naively replace the point cloud representation with SDM-NetACAP mesh representation.

DataSet Method Geometry Metrics Structure MetricsCD ↓ EMD ↓ HierInsSeg(HIS) ↓ Chair StructureNet(SN) 0.00973 0.43912

SN + Mesh 0.003317 0.089114 0.514022Ours (no edge) 0.003784 0.099193 0.624833Ours

SN + Mesh 0.004789 0.095722 0.969731Ours (no edge) 0.005341 0.106832 1.079425Ours

SN + Mesh 0.004019 0.159338 0.583601Ours (no edge) 0.004839 0.193373 0.697722Ours

GT 0.547783

Removing Part Relationships and Edges.

Our network explicitlymodels the part relationships as horizontal edges among sibling (a) Source (b) Target(c) Source (d) Target(e) Source (f) Target

Fig. 14. Qualitative results for disentangled shape interpolation. (a,c,e) and(b,d,f) respectively show the source and target shapes. The following tworows present the interpolation result in one latent space (geometry orstructure) while using the code of the target shape in the other latentspace. Concretely, the first row interpolates the structure between twoshapes while fixing the geometry code of target shape and the second rowinterpolates the geometry between two shapes while fixing the structurecode of target shape. We see a clear disentanglement of the shape structureand geometry in the interpolated results. nodes in the shape part hierarchy. Graph message-passing opera-tions are conducted along the edges in both encoding and decodingstages. In this experiment, we compare to a no-edge version ofour network where we remove the edge components and themessage-passing modules. In Table 6, we see that removing the edgecomponents gives worse results than our full pipeline. Figure 15

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 17

Fig. 15. Qualitative comparison on shape reconstruction with the no-edge version of our method. We can see that removing edges introducesdisconnected parts in the reconstructed shapes. illustrates three example reconstructed shapes for our method withand without edge components, where we see clearly that removingedge components creates more artifacts, such as disconnected parts.

Naive combination of StructureNet + SDM-Net ACAP mesh repre-sentation.

Our network focuses on the shape structure and geometrydisentanglement for controllable generation of meshes with finegeometry and complicated structure. If we do not consider theapplications that are enabled by the disentangled design, such asdisentangled shape generation and interpolation we presented inprevious sections, there is a baseline that naively combines theStructureNet structure generation and the detailed part geometryrepresentation (ACAP) for leaf-node mesh generation. Compared toour network with two disentangled structure and geometry VAEs,this baseline method uses one shared backbone for both. In Table 6,we clearly see that this naive baseline (SN + Mesh) obtains worseresults than our DSM-Net.

Training Separate Part Geometry VAEs.

In our network design, forencoding and decoding leaf-node part geometry, we train a unifiedpart geometry VAE that is conditional on the part semantic labelsand structure contexts. SDM-Net, however, proposes to use separatepart geometry VAEs for different part semantics. We argue thatwhile it is preferable in the SDM-Net experiments, it is very costlyand ineffective to train separate networks on the PartNet data, wherewe have more fine-grained part categories than the SDM-Net dataset.In Table 7, we try the alternative method of training our network (a) Input shape (b) Reconstruction by us-ing separate part geome-try VAEs for different partsemantics (c) Reconstruction by us-ing a single conditionalpart geometry VAE for allpart semantics

Fig. 16. We compare the reconstructed results for a chair using a unifiedconditional part geometry VAE or using separate VAEs for different semanticparts. We find that using a single VAE reconstructs part geometry with morefine-grained part geometry details, e.g. the curvy armrests and the delicatesofa foots. with 57 separate part geometry VAEs for PartNet chairs and showthat the performance of shape reconstruction is significantly worsethan training a unified conditional part geometry VAE. Figure 16compares the reconstructed shapes of a chair and we see that aunified part geometry VAE learns to reconstruct part geometry withmore fine-grained part geometry details, e.g. the curvy armrestsand the delicate sofa foots.

Cascaded v.s. End-to-end Training.

In our network, we have mul-tiple VAEs for predicting the structure and geometry of shape andpart geometric details. In our method, we train all network modules,including the part geometry VAE and the coupled hierarchical graphVAEs, in an end-to-end manner. We compare to a cascaded trainingscheme where we first train the part geometry VAE and then trainthe rest of our model. In Table 8, we evaluate the influence oftwo training strategies on the chair category. We observe similarperformance for the two training schemes. So finally, we picked theend-to-end solution for simplicity.

Table 7. Shape reconstruction performance by our DSM-Net with using asingle conditional part geometry VAE for all semantic parts or using separateVAEs for different semantic parts. We see that training one single conditionalpart geometry VAE is more data-efficient and thus works much better onthe PartNet dataset.

Method Geometry Metrics Structure MetricsCD ↓ EMD ↓ HierInsSeg(HIS) ↓ Ours(not share one PG VAE) 0.015722 0.627309 0.972835Ours(share one PG VAE)

GT 0.321953

Table 8. Shape reconstruction quantitative evaluations on training strategy.We evaluate two training strategies on chair category: 1.Cascaded Training:Pre-train the PG VAE firstly, then train the coupled hierarchical VAEs basedon the pre-trained PG VAE; 2. End-to-end Training: train two networkssimultaneously. The separate training has similar performance to our taskon geometry. So for simplicity of training network, we choose the end-to-endsolution.

Method Geometry Metrics Structure MetricsCD ↓ EMD ↓ HierInsSeg(HIS) ↓ Ours (Cascaded Training) 0.002207 0.080533 0.537924Ours (End-to-end Training) 0.002394 0.081749 0.534247GT 0.321953

Our method depends on heavily annotated shape hierarchies andfine-grained part annotations for a large-scale of 3D shapes as inputsto our networks. It is a non-trivial task to obtain such data fromautomatic algorithms. One may consider to predict such hierarchiesfrom training hierarchical part instance segmentation networks (asshown in PartNet [Mo et al. 2019c] Sec 5.3 and StructureNet [Moet al. 2019a] shape abstraction experiments). But, these methodsall require a large-scale training dataset of fine-grained part andstructure annotations. For unsupervised methods, although recentworks, e.g.

Cuboid Abstraction [Sun et al. 2019], show promisingresults for learning such fine-grained shape parts and structures, itstill remains a challenging topic in the research community.In our work, we explicitly define the structure and geometryhierarchies and supervise the networks with the annotated data. Itwould be interesting or fresh if the network can learn to disentanglethe shape structure and geometry representations automatically.It is very challenging how to learn 3D shape disentanglement ina fully unsupervised manner. We hope that our fully-supervisedversion can bring people’s attention to this topic and future workscan try to reduce the supervision.

In this paper, we have presented DSM-Net, a novel deep generativemodel that learns to represent and generate 3D shapes in disen-tangled latent spaces of geometry and structure, while consideringtheir synergy to ensure plausibility of generated shapes. Throughextensive evaluation, our method produces high-quality shapes withcomplex structure and fine geometric details, outperforming state-of-the-art methods. Our method also enables intuitive and flexiblecontrol of geometry and structure in shape generation, supportingnovel applications such as interpolation of geometry (structure)while keeping structure (geometry) unchanged.

ACKNOWLEDGMENTS

This work was supported by Royal Society Newton AdvancedFellowship (No. NAF \ R2 \ REFERENCES

Victoria Fernández Abrevaya, Adnane Boukhayma, Stefanie Wuhrer, and EdmondBoyer. 2019. A Decoupled 3D Facial Shape Model by Adversarial Training. In

IEEE/CVF International Conference on Computer Vision, ICCV . 9418–9427.Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018.Learning Representations and Generative Models for 3D Point Clouds. In

ICML .40–49.Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova, Rig Das,Gleb Gusev, Djamila Aouada, and Björn Ottersten. 2018. Deep learning advanceson different 3D data representations: A survey. arXiv preprint arXiv:1808.01462

IEEE/CVFInternational Conference on Computer Vision, ICCV . 8180–8189.Harry G Barrow, Jay M Tenenbaum, Robert C Bolles, and Helen C Wolf. 1977. Parametriccorrespondence and chamfer matching: Two new techniques for image matching.In

Proceedings: Image Understanding Workshop . Science Applications, Inc Arlington,VA, 21–27.Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.2017. Geometric deep learning: going beyond Euclidean data.

IEEE Signal ProcessingMagazine

34, 4 (2017), 18–42.Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, StevenLovegrove, and Richard Newcombe. 2020. Deep Local Shapes: Learning Local SDFPriors for Detailed 3D Reconstruction. arXiv preprint arXiv:2003.10983 (2020).Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. ShapeNet:An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015).Siddhartha Chaudhuri, Evangelos Kalogerakis, Leonidas J. Guibas, and Vladlen Koltun.2011. Probabilistic reasoning for assembly-based 3D modeling.

ACM Trans. Graph.

30 (2011), 35.Siddhartha Chaudhuri, Daniel Ritchie, Jiajun Wu, Kai Xu, and Hao Zhang. 2020.Learning Generative Models of 3D Structures.

Computer Graphics Forum(Eurographics STAR) (2020).Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.2016. InfoGAN: Interpretable Representation Learning by Information MaximizingGenerative Adversarial Nets. In

Advances in Neural Information Processing Systems .2172–2180.Zhiqin Chen, Andrea Tagliasacchi, and Hao Zhang. 2019a. BSP-Net: GeneratingCompact Meshes via Binary Space Partitioning. arXiv preprint arXiv:1911.06971 (2019).Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha Chaudhuri, and Hao Zhang.2019b. BAE-NET: branched autoencoder for shape co-segmentation. In

Proceedingsof the IEEE International Conference on Computer Vision . 8490–8499.Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shapemodeling. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 5939–5948.Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Implicit functionsin feature space for 3D shape reconstruction and completion. arXiv preprintarXiv:2003.01456 (2020).Christopher Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4D spatio-temporalconvnets: Minkowski convolutional neural networks. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition . 3075–3084.Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016.3D-R2N2: A unified approach for single and multi-view 3D object reconstruction.In

ECCV . Springer, 628–644.Angela Dai and Matthias Nießner. 2019. Scan2mesh: From unstructured range scans to3D meshes. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 5574–5583.Theo Deprelle, Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell,and Mathieu Aubry. 2019. Learning Elementary Structures for 3D Shape Generationand Matching.

NeurIPS (2019).Yueqi Duan, Haidong Zhu, He Wang, Li Yi, Ram Nevatia, and Leonidas J. Guibas. 2020.Curriculum DeepSDF. arXiv:2003.08593 [cs.CV]Anastasia Dubrovina, Fei Xia, Panos Achlioptas, Mira Shalah, and Leonidas J. Guibas.2019. Composite Shape Modeling via Latent Space Factorization. (2019), 8139–8148.Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for3D object reconstruction from a single image. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition . 605–613.Matheus Gadelha, Rui Wang, and Subhransu Maji. 2018. Multiresolution tree networksfor 3D point cloud processing. In

Proceedings of the European Conference on ComputerVision (ECCV) . 103–118.

SM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry • 19

Vignesh Ganapathi-Subramanian, Olga Diamanti, Soeren Pirk, Chengcheng Tang,Matthias Niessner, and Leonidas Guibas. 2018. Parsing geometry using structure-aware shape templates. In . IEEE,672–681.Lin Gao, Yu-Kun Lai, Jie Yang, Zhang Ling-Xiao, Shihong Xia, and Leif Kobbelt. 2019a.Sparse data driven mesh deformation.

IEEE transactions on visualization andcomputer graphics (2019).Lin Gao, Jie Yang, Tong Wu, Yu-Jie Yuan, Hongbo Fu, Yu-Kun Lai, and Hao(Richard)Zhang. 2019b. SDM-NET: Deep Generative Network for Structured DeformableMesh.

ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2019) (2019),7153–7163.Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learninga predictable and generative vector representation for objects. In

ECCV . Springer,484–499.Georgia Gkioxari, Jitendra Malik, and Justin Johnson. 2019. Mesh R-CNN. In

Proceedingsof the IEEE International Conference on Computer Vision . 9785–9795.Aleksey Golovinskiy and Thomas Funkhouser. 2009. Consistent segmentation of 3Dmodels.

Computers & Graphics

33, 3 (2009), 262–269.Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 2018. 3D semanticsegmentation with submanifold sparse convolutional networks. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition . 9224–9232.Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry.2018. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In

CVPR .Ruizhen Hu, Lubin Fan, and Ligang Liu. 2012. Co-segmentation of 3D shapes viasubspace clustering. In

Computer graphics forum , Vol. 31. Wiley Online Library,1703–1713.Haibin Huang, Evangelos Kalogerakis, Siddhartha Chaudhuri, Duygu Ceylan,Vladimir G Kim, and Ersin Yumer. 2017. Learning local shape descriptors from partcorrespondences with multiview convolutional networks.

ACM Transactions onGraphics (TOG)

37, 1 (2017), 1–14.Qixing Huang, Vladlen Koltun, and Leonidas Guibas. 2011. Joint shape segmentationwith linear programming. In

Proceedings of the 2011 SIGGRAPH Asia Conference .1–12.Anastasia Ioannidou, Elisavet Chatzilari, Spiros Nikolopoulos, and Ioannis Kompatsiaris.2017. Deep learning advances in computer vision with 3D data: A survey.

ACMComputing Surveys (CSUR)

50, 2 (2017), 1–38.Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner,and Thomas Funkhouser. 2020. Local Implicit Grid Representations for 3D Scenes. arXiv preprint arXiv:2003.08981 (2020).Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri.2017. 3D shape segmentation with projective convolutional networks. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition . 3779–3788.Evangelos Kalogerakis, Siddhartha Chaudhuri, Daphne Koller, and Vladlen Koltun. 2012.A probabilistic model for component-based shape synthesis.

ACM Transactions onGraphics (TOG)

31, 4 (2012), 1–11.Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 7122–7131.Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. 2018. RotationNet: Jointobject categorization and pose estimation using multiviews from unsupervisedviewpoints. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 5010–5019.Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture forgenerative adversarial networks. In

Proceedings of the IEEE conference on computervision and pattern recognition . 4401–4410.Vladimir G. Kim, Wilmot Li, Niloy J. Mitra, Siddhartha Chaudhuri, Stephen DiVerdi, andThomas Funkhouser. 2013. Learning Part-based Templates from Large Collectionsof 3D Shapes.

Transactions on Graphics (Proc. of SIGGRAPH)

32, 4 (2013).Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Eric-Tuan Le, Iasonas Kokkinos, and Niloy J Mitra. 2019. Going Deeper with PointNetworks. arXiv preprint arXiv:1907.00960 (2019).Jake Levinson, Avneesh Sud, and Ameesh Makadia. 2019. Latent featuredisentanglement for 3D meshes. arXiv preprint arXiv:1906.03281 (2019).Chun-Liang Li, Manzil Zaheer, Yang Zhang, Barnabas Poczos, and Ruslan Salakhutdinov.2018. Point cloud GAN. arXiv preprint arXiv:1810.05795 (2018).Jun Li, Chengjie Niu, and Kai Xu. 2019. Learning part generation and assembly forstructure-aware shape synthesis. arXiv preprint arXiv:1906.06693 (2019).Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas.2017. GRASS: Generative recursive autoencoders for shape structures.

ACMTransactions on Graphics (TOG)

36, 4 (2017), 1–14. Yecheng Lyu, Xinming Huang, and Ziming Zhang. 2020. Learning to Segment 3D PointClouds in 2D Image Space. arXiv preprint arXiv:2003.05593 (2020).Daniel Maturana and Sebastian Scherer. 2015. VoxNet: A 3D convolutional neuralnetwork for real-time object recognition. In . IEEE, 922–928.Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and AndreasGeiger. 2019. Occupancy networks: Learning 3D reconstruction in function space.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .4460–4470.Niloy J Mitra, Michael Wand, Hao Zhang, Daniel Cohen-Or, Vladimir Kim, and Qi-XingHuang. 2014. Structure-aware shape processing. In

ACM SIGGRAPH 2014 Courses .1–21.Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and LeonidasGuibas. 2019a. StructureNet: Hierarchical Graph Networks for 3D Shape Generation.

ACM Transactions on Graphics (TOG), Siggraph Asia 2019

38, 6 (2019), Article 242.Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and Leonidas JGuibas. 2019b. StructEdit: Learning Structural Shape Variations. arXiv preprintarXiv:1911.11098 (2019).Kaichun Mo, He Wang, Xinchen Yan, and Leonidas J Guibas. 2020. PT2PC: Learningto Generate 3D Point Cloud Shapes from Part Tree Conditions. arXiv preprintarXiv:2003.08624 (2020).Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, andHao Su. 2019c. PartNet: A large-scale benchmark for fine-grained and hierarchicalpart-level 3d object understanding. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 909–918.Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, andMichael M Bronstein. 2017. Geometric deep learning on graphs and manifolds usingmixture model cnns. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 5115–5124.Charlie Nash, Yaroslav Ganin, SM Eslami, and Peter W Battaglia. 2020. PolyGen: Anautoregressive generative model of 3D meshes. arXiv preprint arXiv:2002.10880 (2020).Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang.2019. Hologan: Unsupervised learning of 3d representations from natural images.In

Proceedings of the IEEE International Conference on Computer Vision . 7588–7597.Chengjie Niu, Jun Li, and Kai Xu. 2018. Im2struct: Recovering 3D shape structure froma single RGB image. In

Proceedings of the IEEE conference on computer vision andpattern recognition . 4521–4529.Maks Ovsjanikov, Wilmot Li, Leonidas J. Guibas, and Niloy Jyoti Mitra. 2011.Exploration of continuous variability in collections of 3D shapes.

ACM Trans.Graph.

30 (2011), 33.Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and StevenLovegrove. 2019. DeepSDF: Learning continuous signed distance functions forshape representation. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 165–174.Despoina Paschalidou, Luc Van Gool, and Andreas Geiger. 2020. Learning UnsupervisedHierarchical Part Decomposition of 3D Objects from a Single RGB Image.

ArXiv abs/2004.01176 (2020).Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger.2020. Convolutional occupancy networks. arXiv preprint arXiv:2003.04618 (2020).Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017a. PointNet: Deeplearning on point sets for 3D classification and segmentation. In

Proceedings of theIEEE conference on computer vision and pattern recognition . 652–660.Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017b. PointNet++: Deephierarchical feature learning on point sets in a metric space. In

Advances in neuralinformation processing systems . 5099–5108.Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. OctNet: Learning deep3D representations at high resolutions. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 3577–3586.Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 2000. The earth mover’s distanceas a metric for image retrieval.

International journal of computer vision

40, 2 (2000),99–121.Nadav Schor, Oren Katzir, Hao Zhang, and Daniel Cohen-Or. 2019. CompoNet: Learningto Generate the Unseen by Part Synthesis and Composition. In

Proceedings of theIEEE International Conference on Computer Vision . 8759–8768.Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji.2018. CSGNet: Neural Shape Parser for Constructive Solid Geometry. (2018), 5515–5523.Dong Wook Shu, Sung Woo Park, and Junseok Kwon. 2019. 3D point cloud generativeadversarial network based on tree structured graph convolutions. In

Proceedings ofthe IEEE International Conference on Computer Vision . 3859–3868.Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. 2017. SurfNet:Generating 3D shape surfaces using deep residual networks. In

Proceedings of theIEEE conference on computer vision and pattern recognition . 6040–6049.Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. 2018. SplatNet: Sparse lattice networks for point cloud processing. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 2530–2539.Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In

Proceedings of theIEEE international conference on computer vision . 945–953.Chun-Yu Sun and Qian-Fang Zou. 2019. Learning adaptive hierarchical cuboidabstractions of 3D shape collections.

ACM Transactions on Graphics (TOG)

ACM Transactions onGraphics (TOG)

38, 6 (2019), 1–13.Minhyuk Sung, Hao Su, Vladimir G Kim, Siddhartha Chaudhuri, and Leonidas Guibas.2017. ComplementMe: weakly-supervised component suggestions for 3D modeling.

ACM Transactions on Graphics (TOG)

36, 6 (2017), 1–12.Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. 2017. Octree generatingnetworks: Efficient convolutional architectures for high-resolution 3D outputs. In

Proceedings of the IEEE International Conference on Computer Vision . 2088–2096.N Joseph Tatro, Stefan C Schonsheck, and Rongjie Lai. 2020. Unsupervised GeometricDisentanglement for Surfaces via CFAN-VAE. arXiv preprint arXiv:2005.11622 (2020).Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis, William T. Freeman, Joshua B.Tenenbaum, and Jiajun Wu. 2019. Learning to Infer and Execute 3D Shape Programs.

ArXiv abs/1901.02875 (2019).Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. 2017.Learning shape abstractions by assembling volumetric primitives. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition . 2635–2643.Diego Valsesia, Giulia Fracastoro, and Enrico Magli. 2018. Learning Localized GenerativeModels for 3D Point Clouds via Graph Convolution. (2018).Oliver Van Kaick, Kai Xu, Hao Zhang, Yanzhen Wang, Shuyang Sun, Ariel Shamir,and Daniel Cohen-Or. 2013. Co-hierarchical analysis of shape structures.

ACMTransactions on Graphics (TOG)

32, 4 (2013), 1–10.Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. 2018.Pixel2mesh: Generating 3D mesh models from single RGB images. In

Proceedings ofthe European Conference on Computer Vision (ECCV) . 52–67.Weiyue Wang, Duygu Ceylan, Radomír Mech, and Ulrich Neumann. 2019b. 3DN: 3DDeformation Network. (2019), 1038–1046.Yifan Wang, Noam Aigerman, Vladimir G. Kim, Siddhartha Chaudhuri, and OlgaSorkine-Hornung. 2019a. Neural Cages for Detail-Preserving 3D Deformations.

ArXiv abs/1912.06395 (2019).Yiqun Wang, Jing Ren, Dong-Ming Yan, Jianwei Guo, Xiaopeng Zhang, and PeterWonka. 2020. MGCN: Descriptor Learning using Multiscale GCNs.

ACM Trans. onGraphics (Proc. SIGGRAPH) (2020).Yanzhen Wang, Kai Xu, Jun Li, Hao Zhang, Ariel Shamir, Ligang Liu, Zhiquan Cheng,and Yueshan Xiong. 2011. Symmetry hierarchy of man-made objects. In

Computergraphics forum , Vol. 30. Wiley Online Library, 287–296.Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum.2017. MarrNet: 3D shape reconstruction via 2.5D sketches. In

NIPS . 540–550.Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016.Learning a probabilistic latent space of object shapes via 3D generative-adversarialmodeling. In

NIPS . 82–90.Rundi Wu, Yixin Zhuang, Kai Xu, Hao Zhang, and Baoquan Chen. 2019b. PQ-NET: AGenerative Part Seq2Seq Network for 3D Shapes. arXiv preprint arXiv:1911.10949 (2019).Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang,and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetricshapes. In

CVPR . 1912–1920.Zhijie Wu, Xiang Wang, Di Lin, Dani Lischinski, Daniel Cohen-Or, and Hui Huang.2019a. SAGNet: Structure-aware generative network for 3D-shape modeling.

ACMTransactions on Graphics (TOG)

38, 4 (2019), 1–14.Taihong Xiao, Jiapeng Hong, and Jinwen Ma. 2018a. DNA-GAN: Learning DisentangledRepresentations from Multi-Attribute Images. In

ICLR Workshops .Taihong Xiao, Jiapeng Hong, and Jinwen Ma. 2018b. ELEGANT: Exchanging LatentEncodings with GAN for Transferring Multiple Face Attributes. In

ECCV .Yun-Peng Xiao, Yu-Kun Lai, Fang-Lue Zhang, Chunpeng Li, and Lin Gao. 2020. ASurvey on Deep Geometry Learning: From a Representation Perspective. arXivpreprint arXiv:2002.07995 (2020).Kai Xu, Vladimir G Kim, Qixing Huang, Niloy Mitra, and Evangelos Kalogerakis. 2016.Data-driven shape analysis and processing. In

SIGGRAPH ASIA 2016 Courses . 1–38.Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. 2019.DISN: Deep implicit surface network for high-quality single-view 3D reconstruction.In

Advances in Neural Information Processing Systems . 490–500.Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. 2016. Perspectivetransformer nets: Learning single-view 3D object reconstruction without 3Dsupervision. In

NIPS . 1696–1704.Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and BharathHariharan. 2019. PointFlow: 3D point cloud generation with continuous normalizing flows. In

Proceedings of the IEEE International Conference on Computer Vision . 4541–4550.Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. 2018. FoldingNet: Point cloudauto-encoder via deep grid deformation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 206–215.Li Yi, Leonidas Guibas, Aaron Hertzmann, Vladimir G Kim, Hao Su, and ErsinYumer. 2017. Learning hierarchical shape segmentation and labeling from onlinerepositories. arXiv preprint arXiv:1705.01661 (2017).Kangxue Yin, Zhiqin Chen, Hui Huang, Daniel Cohen-Or, and Hao Zhang. 2019. LOGAN:Unpaired Shape Transform in Latent Overcomplete Space.

ACM Transactions onGraphics(Special Issue of SIGGRAPH Asia)

38, 6 (2019), 198:1–198:13.Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, and Kai Xu. 2019. PartNet: A RecursivePart Decomposition Network for Fine-grained and Hierarchical Shape Segmentation.In

CVPR . to appear.Mehmet Ersin Yümer and Niloy Jyoti Mitra. 2016. Learning Semantic DeformationFlows with 3D Convolutional Networks. In

ECCV .Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 2019. 3D pointcapsule networks. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 1009–1018.Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Li Yi, Leonidas J. Guibas, and HaoZhang. 2020. AdaCoSeg: Adaptive Shape Co-Segmentation with Group ConsistencyLoss. In

IEEE Computer Vision and Pattern Recognition (CVPR) .Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Renjiao Yi, and Hao Zhang. 2018a.SCORES: Shape composition with recursive substructure priors.

ACM Transactionson Graphics (TOG)

37, 6 (2018), 1–14.Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, JoshTenenbaum, and Bill Freeman. 2018b. Visual object networks: Image generationwith disentangled 3D representations. In

Advances in neural information processingsystems . 118–129.Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3D-PRNN: Generating shape primitives with recurrent neural networks. In