STD-Net: Structure-preserving and Topology-adaptive Deformation Network for 3D Reconstruction from a Single Image
SSTD-Net: Structure-preserving andTopology-adaptive Deformation Network for 3DReconstruction from a Single Image
Aihua Mao (cid:63) , Canglan Dai (cid:63) , Lin Gao † ,Ying He , Yong-jin Liu School of Computer Science and Engineering,South China University of Technology Chinese Academy of Sciences (CAS) Nanyang Technological University Tsinghua University [email protected] dcl [email protected] [email protected]@ntu.edu.sg [email protected]
Abstract.
3D reconstruction from a single view image is a long-standingproblem in computer vision. Various methods based on different shaperepresentations (such as point cloud or volumetric representations) havebeen proposed. However, the 3D shape reconstruction with fine detailsand complex structures are still challenging and have not yet be solved.Thanks to the recent advance of the deep shape representations, it be-comes promising to learn the structure and detail representation usingdeep neural networks. In this paper, we propose a novel method calledSTD-Net to reconstruct the 3D models utilizing the mesh representationthat is well suitable for characterizing complex structure and geome-try details. To reconstruct complex 3D mesh models with fine details,our method consists of (1) an auto-encoder network for recovering thestructure of an object with bounding box representation from a singleimage, (2) a topology-adaptive graph CNN for updating vertex positionfor meshes of complex topology, and (3) an unified mesh deformationblock that deforms the structural boxes into structure-aware meshedmodels. Experimental results on the images from ShapeNet show thatour proposed STD-Net has better performance than other state-of-the-art methods on reconstructing 3D objects with complex structures andfine geometric details.
Keywords:
We would like to encourage you to list your keywords withinthe abstract section
3D reconstruction plays an essential role in various tasks in computer vision andcomputer graphics. Traditional approaches mainly utilize the stereo correspon-dence based on multi-view geometry, but are restricted to the coverage provided (cid:63) indicates equal contributions. † indicates corresponding author. a r X i v : . [ c s . G R ] M a r A.H. Mao et al. by the input views. Such limitation causes the single-view reconstruction par-ticularly tricky due to the lack of correspondence with other views and largeocclusions. With the availability of large-scale 3D shape database [2], shape in-formation can be efficiently encoded in a deep neural network, enabling faithful3D reconstruction even from a single image. Although many 3D representations(such as voxel-based and point cloud representations), have been utilized for 3Dreconstruction, they are not efficient to express the surface details of the shapeand may generate part-missing or broken structures due to the high compu-tational cost and memory storage. On the contrary, the triangle mesh is wellknown by its high efficiency for modelling geometric details, it has attractedconsiderable attention in computer vision and computer graphics.Recently, the mesh-based 3D methods have been explored with the deeplearning technology[29,8]. The triangle mesh can be represented by the graph-based neural network [25,17]. Although these methods can reconstruct the sur-face of the object, the reconstruction results are still limited to some categoriesof 3D models and miss structural information of the object. In literature, struc-ture recovery of 3D shapes is mostly studied with non-deep learning approaches,which is due to the lack of a structural representation of 3D shape suitable fordeep neural networks. Thus it is necessary to build up a deep neural networkthat can directly recover the 3D shape structure of an object from a single RGBimage. Recent works show that cuboid primitives can be an good choice forstructure representation [18,16]. Meanwhile, the cuboid structure representationcan recover the complete models with part relationship, while the results esti-mated by 3D-GAN [31] may lose some parts. However, it still lacks of the surfacedetails of the object. In order to delicately express the shape’s structural infor-mation through advancing the cuboid representation, we propose a deep neuralnetwork to reconstruct 3D objects in the mesh representation level, which makesit feasible to express the complex structure and fine-grained surface details ofthe 3D objects simultaneously.So far mesh-based deep learning approaches mostly rely on the Graph Con-volution Network (GCN) [29,8,24]. The common practice of these methods isto use GCN to deform a pre-defined mesh which is generally a sphere, or de-form a set of primitives (square planar) to form 3D shapes. Although GCN iseffective in many classification and regression tasks, it may be inadequate for re-construction, generation or 3D structure analysis, due to that it may cause over-smoothing when aggregating neighbouring information at the vertex level. Moreimportantly, the existing GCN-based methods always deal with fixed-topologymeshes, while the cuboid structure are naturally suitable for representing vari-able topological mesh, which causes that the existing GCNs are not suitable forthe cuboid representation. In this paper we aim to address these challenging is-sues and reconstruct 3D meshed models with adaptive topology and fine-grainedgeometric details from a single RGB image. The key idea for the single view re-construction is to obtain the structural representation of 3D objects with cuboidsand then deform them into concrete meshes through an integrated deep neuralnetwork (called STD-Net), by which a 3D structure-preserving object model can
TD-Net 3 be reconstructed from a single RGB image. The contributions of this paper aresummarized as:
Predicted MeshGround Truth Compute Loss
Decoder
Latent code
ResNet FC Refinement
Reshape
Input Image object structure 𝐿 𝑐𝑑 𝐿 𝑙𝑎𝑝 , 𝐿 𝑒𝑑𝑔 Mask estimator Deformation network … ... ……… Encoder U n p oo li n g U n p oo li n g A TAGCN layer with 192 channels … A block with 14 TAGCN layers
Fig. 1.
The overview of STD-Net.Given an input image, firstly an auto-encoder ar-chitecture is employed to obtain the bounding boxes covering the parts of the object.Then, multiple mesh deformation and unpooling operations are implemented to pro-gressively deform the mesh vertices and update the topology to approximate the target3D object surface. – We propose an integrated learning framework that takes an single view-freeimage as input and construct a 3D meshed models with complex structureand fine-grained geometric details. – We represent the 3D object structure by recovering the cuboids boundingboxes that can delicately express the rich structure information. – We construct a topology-adaptive mesh deformation network which releasesthe limitation with the fixed-topology of input graph and is able to recon-struct shapes with various topologies.
The voxel representation has been widely used for 3D shape generation andunderstanding with the neural network [26,28,3,12,9]. Although it is simple forimplementation and has a good compatibility, voxel representation are limitedby the resolution and computation cost of 3D convolution. Recently, octree isexploited to develop computationally efficient approaches [22,26,10].Another popular representation for 3D shapes is point cloud [21,1,33], whichdescribes 3D shapes through a set of points in the 3D space. Naturally, the
A.H. Mao et al. point cloud can represent the surface information of 3D objects with no localconnections between points, which makes it flexible with sufficient degrees offreedom to match the 3D shape with arbitrary topology. However, the irregularstructure of point clouds also leads to a great challenge for deep learning, whichis only able to produce relatively coarse geometry and can not be used to directlyrecover a detailed 3D shape.Recently, the mesh models have been introduced in deep learning for 3D gen-eration and reconstruction tasks. Pontes et al. [20] proposed a learning frameworkto infer 3D mesh models from a single image using a compact mesh representa-tion. Wang et al. [29] generated 3D mesh models from a RGB image by deformingan initial ellipsoid mesh from coarse to fine. Groueix et al. [8] deformed over aset of primitives to generate parametric surface elements for 3D shapes. Smithet al. [24] presented a approach for adaptive mesh reconstruction, which focuseson exploiting the geometric structure of graph encoded objects. Some meth-ods attempt to reconstruct category-specific mesh which is parameterized by alearned mean shape [13,7]. Our method is mostly related to the recent works[29,24], which deform a generic pre-defined mesh to form 3D surfaces. However,the work [29] did not concern the structure information in the 3D representation,hence it can not reconstruct the structure-welled objects. The work [24] takesadditional voxels to provide the global structure and can not express the localstructure. Moreover, both of them require a fixed-topology mesh model, whileour method allows the input meshes with different topological types, which isable to generate more delicate structure for the 3D shape.
There are already many works based on learning methods which generate the 3Dmodels by the manners of voxels, point cloud and meshes. However, the outputsof them are the non-structured models. The main reason is that it sill lack of aneffective structural representation for 3D shapes currently. Some works attemptto address this issue through non-deep-learning approaches. For example, Xuet al. [32] modeled 3D objects from photos by utilizing an available set of 3Dcandidate models. Huang et al. [11] jointly analyzed a collection of images andtogether reconstructed the 3D shapes from existing 3D models collections.Recently, researchers have explored the possibility in expressing the 3D struc-ture through the learning methods. [27] proposed a deep architecture to map a3D volume to a 3D cuboid primitive, which provides potentials to automaticallydiscover and exploit consistent structure in the data. Niu et al. [19] developeda convolutional-recursive auto-encoder comprised of structure parsing of a 2Dimage and structure recovering of a cuboid hierarchy. Gao et al. [6] reported adeep generative neural network to produce structured deformable meshes. Ourmethod adopts the ideas from these works and proposes a cost-effective auto-encoder to generate cuboid bounding boxes expressing the hierarchy structure ofthe 3D objects. The cuboid bounding boxes are further meshed and input intothe learning network for 3D model generation.
TD-Net 5
Graph convolution neural network [23] becomes a popular choice in 3D shapeanalysis [34] and 3D reconstruction [29]. Unlike traditional CNN which is definedon regular grids (e.g., 2D pixels and 3D voxels), GCN regards the mesh as a graphand learns features using graph convolution. The potential of graph convolutionis to capture the structural relations among node of data. But, the irregularstructure attributes of graphs leads the huge challenges to the convolutions ongraphs. The issues in deep learning on graphs are mainly from the complex anddiverse connectivity patterns in the graph-structured data.Defferrard et al. [4] proposed a approach to approximate the filters by ap-plying the Chebyshev polynomials applied on the Laplacian operator. This ap-proximation was further simplified by [14], which propose a filter in the spatialdomain and achieves good performance on classification task. The assumption ofsymmetric adjacency matrix in above two spectral based methods restricts theapplication to undirected graph-structured data. The issue of extending CNNsfrom grid-structured data to arbitrary graph-structured data remains unsolved.For adapting to the various topologies of the graph data, Du et al. [5] implementthe graph convolution by a set of fixed-size learnable filters. We adopts the ideaof TAGCN in [5] to construct GCN to achieve deformation for meshed cuboidbounding boxes with different structure topology types.
The network architecture of our method (STD-Net) is shown in Fig. 1, which iscomposed of two parts: structure recovery network and mesh deformation net-work . The structure recovery network is designed with an auto-encoder to predictthe 3D structure of the object from a single RGB image. It can generate the ob-ject’s hierarchy cuboid bounding boxes, which delicately describe the structureinformation in detail. These bounding boxes are further meshed and put theminto the GCNs in the next phase.The mesh deformation network aims to deform the input bounding box intoa structure-preserving shape. It consists of three blocks intersected by two graphunpooling layers. Each block has the same network structure containing 14TAGCN layers with 192 channels that accept variable topologies of the graph.The unpooling layer works to add the number of vertices to handle fine-grainedgeometric details. The detailed explanation for two network parts are discussedin the following Sections.
Recently, shape abstraction [18,19] has been used to discover the high-level struc-ture in un-annotated point clouds and images. These works inspire us to usethe decoder in shape abstraction to recover the structure of the object. In our
A.H. Mao et al. method, an encoder is first employed to map a shape (represented as a hierarchyof n − array graphs or cuboids) to a latent feature vector z . Then, a decodertransforms the feature vector z back into a shape (also represented as a hierar-chy of graphs or cuboids). The structure of a object is represented by a hierarchygraph, and every node is represented with a bounding box.For the encoder part, the structure information of image goes through a CNNnetwork and is transferred into a latent code as features. For the decoder part,an recursively network unfolds the features into a hierarchical organization ofthe bounding boxes, which are the recovered structure. Encoder
The encoder takes a 2D RGB image as input to obtain the objectstructure. It contains two parts. The first part is the contour estimator to esti-mate the contour of the object that provides strong cues for understanding shapestructures in the image. Inspired by the multi-scale network for detailed depthestimation [15], we design a two-scale CNNs, where the first one captures the in-formation of the whole image and the second one produces a detailed mask mapwith a quarter of the input resolution. The second part is to fuse the features ofthe input image, which is conducted by two convolution channels. One channeltakes the binary mask as input, followed by two convolution layers. The otherchannel takes the CNN feature of the original image (extracted by a ResNet18)as input. The output feature from the two channels are then concatenated, andfurther encoded into a n − D code after two fully connected layers, capturing theobject structure information from the input image.
Decoder
We adopt a recursive network architecture [18] as a hierarchy of graphsin the decoder, which is more cost-effective than GRASS [16]. In the wholestructure recovery network, the latent code can be regarded as the root featuresof the structure tree. The decoder gradually decodes the node feature code intoa hierarchy of features until reaching the leaf nodes where each of them can beparsed into a vector of box parameters. In more detail, there are two types ofdecoders in the whole decode pipeline. The bounding box decoder is implementedas a multi-layer perception (MLP) with two layers, which transforms a featurevector to a bounding box. The graph decoder transforms a parent feature vectorback into the child graph.
Given a bounding box B generated from Section 3.2 and the ground truth S , ourgoal is to deform the bounding box to make it to be as close to the ground truthshape as possible. Figure. 1 depicts our mesh deformation module, which takesa meshed bounding box defined by a set of vertex positions as input and outputpredicted deformed mesh. In the mesh deformation network, three deformationblocks are intersected by two graph unpooling layers to progressively achieve themesh deformation. Each deformation block takes the graph (representing themesh with the 3D shape feature attached on vertices) as input. The unpooling TD-Net 7 (a) (b) (c) (d) (e) (f)
Fig. 2.
Reconstruction results in three categories: Airplane, Table and Chair. (a) Inputimage; (b) Pixel2Mesh; (c) AtlasNet-25; (d) GEOMetrics; (e) Ours; (f) Ground Truth. layer is to increase the number of vertices, which can increase the capacity ofhandling fine-grained geometric details.The three deformation blocks having same architecture containing 14 TAGCNlayers with 192 channels. The first block takes the initial meshed bounding boxas input, and the remaining two blocks take the output from the previous un-pooling layer as input. Inspired by the TAGCN [5], we only care about a smalllocal region around the vertex (see Fig 4). By defining a path of length m on agraph as a sequence v = ( v , v , ..., v m ), v k ∈ V , the small region can be deter-mined by the node path . The convolution formula used in each convolution layeris X l +1 = f ( A K X l W lK + · · · + A X l W l + X l W l ) (1)where A is the adjacency matrix , X l is the input vector in l th convolution layer, f is the nonlinear activation function , and W k are the learnable weights. Throughexperiments, it has been shown that k = 2 can achieve good performance. Theconvolution defined in Eq.(1) is similar to the traditional convolution operations.In the convolution layer of traditional CNN, the receptive filed is a small rect-angle in the grid. In GCN, the receptive field is also a small region around thevertex. The operation in Eq.(1) calculates a linear combination of the signals ofnodes k hops away from the start node. Moreover, we propose a deep networkwith several shortcut connections, which can alleviate the over-smoothing in A.H. Mao et al.
GCN. In addition to the output of the network, there is a branch which appliesan extra TAGCN layer to the last layer and outputs the 3D coordinates. Similarto Pixel2Mesh, we add a new vertex at the center of the each edge to form eachunpooling layer, which can increase the number of vertices in GCN. sample P
𝟏𝟏 − 𝒖𝒖
Sample Points
Losses
Deformed Mesh 𝒖𝒖 (
𝟏𝟏 − 𝒘𝒘 ) 𝒖𝒖𝒘𝒘 𝒖𝒖 , 𝒘𝒘 ∼ 𝑼𝑼 ( , ) 𝒗𝒗 P 𝒗𝒗 𝒗𝒗 P Fig. 3.
Sampling strategy from predicted mesh.
Losses
The mesh deformation network is supervised by a hybrid losses includingthe chamfer distance loss, Laplacian loss and edge length loss. The chamferdistance (CD) measures the nearest neighbor distances between two point sets.We minimize the two directional distances between the deformed bounding boxesand the groundtruth shape. The chamfer distance loss is defined as: Fig. 4.
The left (resp. right) figure shows that the convolution starts from vertex 1(resp. 2).TD-Net 9 L cd = (cid:88) x ∈ M min y ∈ S (cid:107) x − y (cid:107) + (cid:88) y ∈ S min x ∈ M (cid:107) x − y (cid:107) (2)where the M, S are the two point sets. Akin to [30,24], our method does notsimply compute the CD loss between the predicted points and ground-truthpoints. Accordingly, we uniformly sample the same number of vertices from thepredicted mesh and ground-truth mesh and then compute the CD loss betweenthe predicted mesh and ground-truth mesh. More precisely (see Fig.3), for eachtriangular face, we first compute its area and store it in an array along withthe cumulative area of triangles visited so far. Next, we select a triangle witha probability ratio between its area and the total cumulative area. For each se-lected triangular face defined by vertices v , v , v , a new point r can be sampleduniformly from the surface by the following formulation: r = (1 − √ u ) v + √ u (1 − w ) v + √ uwv (3)where u, w ∼ U (0 , l , we first define a Laplacian coordinate for each vertex p as δ p = p − (cid:80) k ∈ N ( p ) 1 || N ( p ) || k , which is calculated as: l lap = (cid:80) p || δ (cid:48) p − δ p || , where δ (cid:48) p is the Laplacian coordinate of a vertex p after the deformation block, and δ p is the input Laplacian coordinate of a vertex p . The edge length loss is definedas l edge = (cid:80) p (cid:80) q ∈ N ( p ) || p − q || to make the predicted mesh visually appealing.The total loss of deformation module is L all = l cd + λ l lap + λ l edge (4)where the λ , λ are the hyper-parameters weighting the importance of each ofthem. Note that the loss L all is applied to the output of each mesh deformationblock. In this section, we demonstrate the efficiency and performance of STD-Net onthe single RGB images structure-preserving reconstruction by taking the benefitsof the mesh-based structure representation. In addition, we present an ablationstudy to demonstrate how individual components of our model contribute to itsoverall performance.
We use ShapeNet [2] as the main dataset for training and testing the STD-Net. Each shape is rendered with 24 different views, and the rendered imageresolution is 224 × For the encoder, the contour estimator consistsof a VGG-16 (two fully connected layers) and one 9x9 convolution layer. Thefusion phase is implemented by CNN. As for the decoder, we use the pre-trainedshape abstraction decoder in StructureNet [18]. There are two stages in trainingthe structure recovery network. We first train the network for estimating a binaryobject mask for the input image. The first and second scale network are trainedjointly. Secondly, we train the encode and decoder together, during which a lowlearning rate for the encoder is used.
Mesh Deformation Network
Prior to train this network, we prepare thepairs of bounding box and mesh in ShapeNet. For each category, a boundingbox(OBB)-to-mesh mapping is trained from the ground truth. These mappingsare trained with Adam Optimizer ( β = 5 e − ) and a learning rate of 3 e − . Weconduct the training for 10 iterations and empirically stop it, with the bestmodel selected from evaluating on the validation set every 10 iterations. Thehyper-parameters setting used, as described in Eq. 4, are λ = 0 . λ = 0 . Table 1.
The F1-score (%) on the three categories at threshold d = 1 e −
4. Larger isbetter. Best results (our) are bold.Pixel2Mesh AtlasNet GEOMetric oursChair 54.38 55.2 56.61
Plane 71.12 77.15 89.00
Table 66.30 59.49 66.33 mean 63.9 63.61 70.64
Comparison
We evaluate the quantitative performance of our method by com-paring with those results generated by the mesh-based approaches in Pixel2Mesh[29], AtlasNet[8] and GEOMetrics [24]. To do this, we train our network on the
TD-Net 11 mIoU
AtlasNet Pixel2Mesh Ours
Fig. 5.
Comparison of mean IoU(%) with other mesh-based approaches on the threecategories of ShapeNet dataset.
ShapeNet and evaluate two metrics (F1 and mIoU) on test data sets with state-of-the-art approaches. We mainly test on the three categories in ShapeNet: Chair,Table and Airplane, and then perform the mainly common-used metrics on theshape reconstruction. First, we compare the F1 score, which is a harmonic meanof precision and recall at a given threshold of d . The results of comparison aresummarized in Table 1. It shows that our method has much better performancethan previous approaches on the three categories under a smaller threshold of d = 1 e −
4. Second, we compare the mIoU with the other works, which is depictedin the Fig. 5, showing that our method get a high performance that others.Qualitative comparison results with state-of-the-art works on shape recon-struction for each category are presented in Figure 2. These results demonstratethat our method can provide highly accurate reconstruction of objects from theinput RGB images and capture structure details effectively.
In this section, we study the influence of our network’s components and demon-strate their importance by comparing the full method’s results on the chair datawith the ablated version of the network. Here, we mainly evaluate the impact ofthe topological-adaptive layers by replacing them with Naive GCN. The quali-tative results are shown in Table 2.
Table 2.
An ablation study of mesh deformation network.Chamfer loss F1-score IoUNaive GCN 3.550 30.84 12.76TAGCN
In this paper, we propose a structure-preserving and topological-adaptive defor-mation network for 3D objects reconstruction from single images. Our methodprovides a graph representation of 3D models with cuboid bounding boxes, whichcan delicately describe the structure information of the object, and thus can re-construct 3D models with complex structure directly from the single image.Our proposed learning framework consists of a structure recovery network and amesh deformation network. The former is designed with an auto-encoder whichgenerates cuboid bounding boxes for the object, and the latter consists of threedeformation blocks intersected by two graph unpooling layers, which progres-sively deform the input meshed bounding box into the meshed models. Themost significant feature of the mesh deformation network is that it accepts theinput graph with different topology types, which enables the reconstruction of3D models with various complex structures. Compared to previous works, ourmethod could achieve better performance in 3D shape structure-preserving re-construction.
TD-Net 13
References
1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representationsand generative models for 3d point clouds. In: ICML (2018)2. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: AnInformation-Rich 3D Model Repository. Tech. rep. (2015)3. Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: A unified ap-proach for single and multi-view 3D object reconstruction. In: ECCV (2016)4. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks ongraphs with fast localized spectral filtering. In: NeurIPS (2016)5. Du, J., Zhang, S., Wu, G., Moura, J.M.F., Kar, S.: Topology adaptive graph con-volutional networks (2017)6. Gao, L., Yang, J., Wu, T., Yuan, Y., Fu, H., Lai, Y.K., Zhang, H.: Sdm-net:deep generative network for structured deformable mesh. ACM Trans. Graph. ,243:1–243:15 (2019)7. Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: 3d-coded : 3d corre-spondences by deep deformation. In: ECCV (2018)8. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A Papier-MacheApproach to Learning 3D Surface Generation. In: CVPR (2018)9. Han, X., Li, Z., Huang, H., Kalogerakis, E., Yu, Y.: High-Resolution Shape Com-pletion Using Deep Neural Networks for Global Structure and Local GeometryInference. In: CVPR (2017)10. Hane, C., Tulsiani, S., Malik, J.: Hierarchical surface prediction for 3D objectreconstruction. In: 3DV 2017 (2018)11. Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis ofimage and shape collections. In: ACM TOG (2015)12. Huang, Z., Li, T., Chen, W., Zhao, Y., Xing, J., LeGendre, C., Luo, L., Ma, C., Li,H.: Deep Volumetric Video From Very Sparse Multi-view Performance Capture.In: ECCV (2018)13. Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific meshreconstruction from image collections. In: ECCV (September 2018)14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutionalnetworks (2017)15. Li, J., Klein, R., Yao, A.: A two-streamed network for estimating fine-scaleddepth maps from single rgb images. In: ICCV. pp. 3392–3400 (Oct 2017).https://doi.org/10.1109/ICCV.2017.36516. Li, J., Xu, K., Chaudhuri, S.: GRASS: Generative recursive autoencoders for shapestructures. In: ACM Transactions on Graphics (2017)17. Litany, O., Bronstein, A.M., Bronstein, M.M., Makadia, A.: Deformable shape com-pletion with graph convolutional autoencoders. In: CVPR. pp. 1886–1895 (2018)18. Mo, K., Guerrero, P., Yi, L., Su, H., Wonka, P., Mitra, N., Guibas, L.: Structurenet:Hierarchical graph networks for 3d shape generation. ACM TOG, Siggraph Asia2019 (2019)19. Niu, C., Li, J., Xu, K.: Im2Struct: Recovering 3D Shape Structure from a SingleRGB Image. In: CVPR (2018)20. Pontes, J.K., Kong, C., Sridharan, S., Lucey, S., Eriksson, A., Fookes,C.: Image2mesh: A learning framework for single image 3d reconstruction.arXiv:1711.10669 (2017)4 A.H. Mao et al.21. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning onpoint sets for 3D classification and segmentation. In: CVPR (2017).https://doi.org/10.1109/CVPR.2017.1622. Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: Learning deep 3D representationsat high resolutions. In: CVPR (2017)23. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graphneural network model. IEEE Trans. Neural Networks (2009)24. Smith, E., Fujimoto, S., Romero, A., Meger, D.: GEOMetrics: Exploiting geometricstructure for graph-encoded objects (2019)25. Tan, Q., Gao, L., Lai, Y.K., Yang, J., Xia, S.: Mesh-based autoencoders for lo-calized deformation component analysis. In: Thirty-Second AAAI Conference onArtificial Intelligence (2018)26. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree Generating Networks: EfficientConvolutional Architectures for High-resolution 3D Outputs. In: CRPR (2017)27. Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstrac-tions by assembling volumetric primitives. In: CVPR (2017)28. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-viewreconstruction via differentiable ray consistency. In: CVPR (2017)29. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: Generating3D Mesh Models from Single RGB Images. ECCV11215 LNCS