[PDF] A Survey on Deep Geometry Learning: From a Representation Perspective

Abstract

Researchers have now achieved great success on dealing with 2D images using deep learning. In recent years, 3D computer vision and Geometry Deep Learning gain more and more attention. Many advanced techniques for 3D shapes have been proposed for different applications. Unlike 2D images, which can be uniformly represented by regular grids of pixels, 3D shapes have various representations, such as depth and multi-view images, voxel-based representation, point-based representation, mesh-based representation, implicit surface representation, etc. However, the performance for different applications largely depends on the representation used, and there is no unique representation that works well for all applications. Therefore, in this survey, we review recent development in deep learning for 3D geometry from a representation perspective, summarizing the advantages and disadvantages of different representations in different applications. We also present existing datasets in these representations and further discuss future research directions.

Full PDF

CComputational Visual MediaDOI 10.1007/s41095-xxx-xxxx-x Vol. x, No. x, month year, xx–xxReview Article

A Survey on Deep Geometry Learning: From a RepresentationPerspective

Yun-Peng Xiao , Yu-Kun Lai , Fang-Lue Zhang , Chunpeng Li , Lin Gao ( (cid:0) ) c (cid:13) The Author(s) 2020. This article is published with open access at Springerlink.com

Abstract

Researchers have now achieved greatsuccess on dealing with 2D images using deep learning.In recent years, 3D computer vision and GeometryDeep Learning gain more and more attention.Many advanced techniques for 3D shapes have beenproposed for diﬀerent applications. Unlike 2D images,which can be uniformly represented by regular gridsof pixels, 3D shapes have various representations,such as depth and multi-view images, voxel-basedrepresentation, point-based representation, mesh-basedrepresentation, implicit surface representation, etc.However, the performance for diﬀerent applicationslargely depends on the representation used, andthere is no unique representation that works wellfor all applications. Therefore, in this survey,we review recent development in deep learningfor 3D geometry from a representation perspective,summarizing the advantages and disadvantages ofdiﬀerent representations in diﬀerent applications. Wealso present existing datasets in these representationsand further discuss future research directions.

Keywords

3D representation, geometry learning,neural networks, computer graphics. (cid:0) )2 School of Computer Science and Informatics, CardiﬀUniversity, Wales, UK. E-mail: LaiY4@cardiﬀ.ac.uk3 School of Engineering and Computer Science, VictoriaUniversity of Wellington, New Zealand. E-mail:[email protected]

Recent improvements in methods for acquisition andrendering of 3D models haven resulted in consolidatedrepositories containing massive amounts of 3D shapeson the Internet. With the increased availability of3D models, we have been seeing an explosion in thedemands of processing, generation and visualization of3D models in a variety of disciplines, such as medicine,architecture and entertainment. The techniques formatching, identiﬁcation and manipulation of 3D shapeshave become fundamental building blocks in moderncomputer vision and computer graphics systems. Dueto the complexity and irregularity of 3D shape data,how to eﬀectively represent 3D shapes remains achallenging problem. Thus, there have been extensiveresearch eﬀorts concentrating on how to deal with andgenerate 3D shapes based on diﬀerent representations.In early research on 3D shape representations,3D objects were normally modeled with a globalapproach, such as constructive solid geometry anddeformed superquadrics. Those approaches haveseveral drawbacks when utilized for the tasks likerecognition and retrieval. First, when representingimperfect 3D shapes, including those with noiseand incompleteness, which are common in practice,such representations may impose negative inﬂuenceon matching performance. Second, the high-dimensionality heavily burdens the computationand tends to make models overﬁt. Hence,more sophisticated methods are designed to extractrepresentations of 3D shapes in a more concise, yetdiscriminative and informative form.Several related surveys have been published [1, 9, 35],which focus on diﬀerent aspects of deep learning for3D geometry. Moreover, with rapid development of3D shape representations and related techniques fordeep learning, it is essential to further summarize up-to-date research works. In this survey, we mainly review a r X i v : . [ c s . G R ] A p r Xiao et al. deep learning methods on 3D shape representations anddiscuss their advantages and disadvantages consideringdiﬀerent application scenarios. We now give abrief summary of diﬀerent 3D shape representationcategories.

Depth and multi-view images can be usedto represent 3D models in the 2D ﬁeld. Theregular structure of images makes them eﬃcient tobe processed. Depending on whether depth mapsare included, 3D shapes can be presented by RGB(color) or RGB-D (color and depth) images viewed fromdiﬀerent viewpoints. Because of the inﬂux of availabledepth data due to the popularity of 2.5D sensors, suchas Microsoft Kinect, Intel RealSense, etc., multi-viewRGB-D images are widely used to represent real-world3D shapes. The large asset of image-based processingmodels can be leveraged using this representation. Butit is inevitable that this kind of representation losessome geometry features.A voxel is a 3D extension of the concept ofpixel. Similar with pixels in 2D, the voxel-basedrepresentation also has a regular structure in the 3Dspace. The architectures of some neural networkswhich have been demonstrated useful in the 2Dimage ﬁeld [48, 50] can be easily extended to thevoxel form. Nevertheless, adding one dimensionmeans an exponentially increased data size. As theresolution increases, the amount of required memoryand computational costs increase dramatically, whichrestricts the representation only to low resolutions whenrepresenting 3D shapes.

Surface-based representation describes 3Dshapes by encoding their surfaces, which can also beregarded as 2-manifolds. Point clouds and meshes areboth discretized forms of 3D shape surfaces. Pointclouds use a set of sampled 3D point coordinates torepresent the surface. It can be easily generated byscanners but diﬃcult to process due to their lack oforder and connectivity information. Researchers useorder invariant operators such as the max poolingoperator in deep neural networks [75, 77] to mitigatethe lack of order problem. Meshes can depict higherquality 3D shapes with less memory and computationalcost compared with point clouds and voxels. A meshcontains a vertex set and an edge set. Due to itsgraphical nature, researchers have made attemptsto build graph-based convolutional neural networksfor coping with meshes. Some other methods regardmeshes as the discretization of 2-manifolds. Moreover,meshes are more suitable for 3D shape deformation.One can deform a mesh model by transforming vertices while keeping the connectivity at the same time.

Implicit surface representation exploits implicitﬁeld functions, such as occupancy functions [67] andsigned distance functions [116], to describe the surfaceof 3D shapes. The implicit functions learned by deepneural networks deﬁne the spatial relationship betweenpoints and surfaces. They provide a descriptionwith inﬁnite resolution of 3D shapes with reasonablememory consumption, and are capable of representingshapes with changing topology. Nevertheless, implicitrepresentations cannot reﬂect the geometric featuresof 3D shapes directly, and usually need to betransformed to explicit representations such as meshes.Most methods apply iso-surfacing, such as marchingcubes [58], which is an expensive operation.

Structured representation . One way to copewith complex 3D shapes is to decompose them intostructure and geometric details, leading to structuredrepresentations. Recently, increasingly more methodsregard a 3D shape as a collection of parts and organizethem linearly or hierarchically. The structure of 3Dshapes is processed by

Recurrent Neural Networks(RNNs) [121],

Recursive Neural Networks (RvNNs) [51]or other network architectures. Each part of theshape can be processed by unstructured models. Thestructured representation focuses on the relations(such as symmetry, supporting, being supported, etc.)between diﬀerent parts within a 3D shape, whichprovides better description capability than alternativerepresentations.

Deformation-based representation . Unlike rigidman-made 3D shapes such as chairs and tables,there are also a large number of non-rigid (e.g.articulated) 3D shapes such as human bodies, whichalso play an important role in computer animation,augmented reality, etc. The deformation-basedrepresentation is proposed mainly for describing theintrinsic deformation properties while ignoring theextrinsic transformation properties. Many methods userotation-invariant local features for describing shapedeformation to reduce the distortion and keep thegeometry details at the same time.Recently, deep learning has achieved superiorperformance in contrast to classical methods in manyﬁelds, including 3D shape analysis, reconstruction,etc. A variety of architectures of deep networkshave been designed to process or generate 3D shaperepresentations, which we refer to as geometry learning .In the following sections, we focus more on mostrecent deep learning based methods for representingand processing 3D shapes in diﬀerent forms. According depth mapmulti-view imagesvoxel representationpoint representation mesh representationimplicit surfacestructured representationdeformation representation3D ShapeNets[112]MVCNN[93]GCNN[61]Eigen et al.[20] RIMD[24]3D-R2N2[16]3D-GAN[110]PointNet[75]PointOutNet[21]OctNet[80]O-CNN[105]GRASS[51]PointNet++[77]Pixel2Mesh[104]PointCNN[53]DeepSDF [74]IM-NET[14]SDM-NET[27]StructureNet[68]MeshCNN[39]Without 3DSupervision [56]BSP-Net[13]NASA[45]

Fig. 1

The timeline of deep learning based methods for various 3D shape representations. to how the representation is encoded and stored, oursurvey is organized in the following structure: Section 2reviews image-based shape representation methods.Sections 3 and 4 introduce voxel-based and surface-based representations respectively. Section 5 furtherintroduces implicit surface representations. Sections6 and 7 review structure-based and deformation-based description methods. We then summarizetypical datasets in Section 8 and typical applicationsfor shape analysis and reconstruction in Section 9,before concluding the paper in Section 10. Figure 1summarizes the timeline of representative deep learningmethods based on various 3D shape representations.

2D images are the projections of 3D entities.Although the geometric information carried by oneimage is incomplete, a plausible 3D shape could beinferred from a set of images with diﬀerent perspectives.The extra channel of depth in RGB-D data furtherenhances the capacity of image-based representationson encoding geometric cues. Beneﬁting from its image-like structure, the research using deep neural networkson 3D shape inferences from images started earlier thanalternative representations that can depict the surfaceor geometry of 3D shapes explicitly.Socher et al. [89] proposed a convolutional andrecursive neural network for 3D object recognition,which copes with RGB and depth images by singleconvolutional layers separately and merges the featuresby a recursive network. Eigen et al. [20] ﬁrst proposedto reconstruct the depth map from a single RGBimage and designed a new scale invariant loss for the training stage. Gupta et al. [37] encoded the depthmap into three channels including disparity, height andangle. Other deep learning methods based on RGB-Dimages are designed for 3D object detection [36, 91],outperforming previous methods.Images from diﬀerent viewpoints can providecomplementary cues to infer 3D objects. Thanksto the development of deep learning models in 2Dﬁelds, the learning methods based on multi-viewimage representation perform better in the 3D shaperecognition application than those based on other 3Drepresentations. Su et al. [93] proposed

MVCNN (Multi-View Convolutional Neural Network) for 3Dobject recognition. MVCNN ﬁrst processes the imagesin diﬀerent views separately by the ﬁrst part of CNN,then aggregates the features extracted from diﬀerentviews by view-pooling layers, and ﬁnally puts themerged feature to the remaining part of CNN. Qi etal. [76] propose to add multi-resolution into MVCNNfor higher classiﬁcation accuracy.

The voxel-based representation is traditionally adense representation, which describes 3D shape databy volumetric grids in 3D space. Each voxel in thegrid records the status of occupancy (e.g., occupied orunoccupied) within a cuboid grid.One of the earliest methods that applies deep neuralnetworks to volumetric representations was proposed byWu et al. [112] in 2015, which is called

3D ShapeNets .Wu et al. assigned three diﬀerent states to the voxels inthe volumetric representation produced by 2.5D depth et al. maps: observed, unobserved and free. 3D ShapeNetsextended the deep belief network (DBN) [41] frompixel data to voxel data and replaced fully connectedlayers in DBN with convolutional layers. The modeltakes the aforementioned volumetric representationas input, and outputs category labels and predicted3D shape by iterative computations. Concurrently,Maturana et al. proposed to process the volumetricrepresentation with 3D Convolutional Neural Networks(3D CNNs) [62] and designed

VoxNet [63] for objectrecognition. VoxNet deﬁnes several volumetric layers,including Input Layer, Convolutional Layers, PoolingLayers and Fully Connected Layers. Although thesedeﬁned layers are simple extensions of traditional 2DCNNs [48] to 3D, VoxNet is easy to implement and trainand gets promising performance as the ﬁrst attempton volumetric convolutions. In addition, to ensurethat VoxNet is invariant to orientation, Maturana etal. further augment the input data by rotating eachshape into n instances with diﬀerent orientations in thetraining stage and adding a pooling operation after theoutput layer to group all the predictions from the n instances in the test stage.In addition to the development of deep beliefnetworks and convolutional neural networks in shapeanalysis based on volumetric representation, two mostsuccessful generative models, namely auto-encodersand Generative Adversarial Networks (GANs) [33]are also extended to support this representation.Inspired by Denoising Auto-Encoders (DAEs) [101,102], Sharma et al. proposed an autoencoder model VConv-DAE for coping with voxels [83]. It is oneof the earliest unsupervised learning approaches invoxel-based shape analysis to our knowledge. Withoutobject labels for training, VConv-DAE chooses meansquare loss or cross entropy loss as the reconstructionloss function. Girdhar et al. [32] also proposed

TL-embedding Network , which combine an auto-encoder for generating a voxel-based representationwith a convolutional neural network for predicting theembeddings from the 2D images.Choy et al. [16] proposed which takessingle or multiple images as input and reconstructsobjects in occupancy grids. 3D-R2N2 regards inputimages as a sequence and designs the 3D recurrentneural network based on LSTM (Long Short-TermMemory) [42] or GRU (Gated Recurrent Unit) [15].The architecture consists of three parts: an imageencoder to extract features from 2D images, 3D-LSTMto predict hidden states as coarse representationsof ﬁnal 3D models, and a decoder to increase the resolution and generate target shapes.Wu et al. [110] designed a generative model called that applies the Generative Adversarial Network(GAN) [33] in voxel data. 3D GAN learns to synthesizea 3D object from a sampled latent space vector z withthe probability distribution P ( z ). Moreover, [110] alsoproposed inspired by VAE-GAN [49] forthe object reconstruction task. 3D-VAE-GAN puts theencoder before 3D-GAN for inferring the latent vector z from input 2D images and shares the decoder withthe generator of 3D-GAN.After the early attempts in dealing with volumetricrepresentations by deep learning, researchers beganto optimize the architecture of volumetric networksfor better performance and more applications. Amotivation is that the naive extension from traditional2D domain networks often does not perform better thanimage-based CNNs such as MVCNN [93]. The mainchallenges aﬀecting the performance include overﬁtting,orientation, data sparsity and low resolution.Qi et al. [76] proposed two new network structuresaiming to improve the performance of volumetricCNNs. One introduces an extra task namely predictingclass labels with subvolume space to prevent overﬁtting,and another utilizes elongated kernels to compress the3D information into the 2D ﬁeld in order to use 2DCNNs directly. Both of them use mlpconv layers [55]to replace traditional convolutional layers. [76] alsoaugments the input data in diﬀerent orientation andelevation to encourage the network to get more localfeatures in diﬀerent poses so that the results are lessinﬂuenced by orientation changes. To further mitigatethe orientation impact on recognition accuracy, insteadof using data augmentation like [63, 76], [82] proposeda new model called ORION which extends VoxNet [63]and uses a fully connected layer to predict the objectclass label and orientation label simultaneously.

Voxel-based representations often lead to highcomputational cost because of the exponential increaseof computations from pixels to voxels. Most of themethods cannot cope with or generate high-resolutionmodels within reasonable time. For instance,

TL-embedding Network [32] was designed for 20 voxelgrids; [112] and VConv-DAE [83] weredesigned for 24 voxel grids with 3 voxels padding oneach direction of the voxel grids; VoxNet [63], [16] and

ORION [82] were designed for 32 voxel grids; was designed for generating 64 occupancy grids as 3D shape representation. As the voxel resolution increases, the occupied grids becomesparser in the whole 3D space, which leads to moreunnecessary computation. To address this problem, Liet al. [54] designed a novel method called FPNN tocope with the data sparsity.Some methods instead encode the voxel grids by asparse, adaptive data structure, namely octree [64] toreduce the dimensionality of the input data. H¨ane etal. [38] proposed Hierarchical Surface Prediction (HSP)that can generate voxel grids in the form of octree fromcoarse to ﬁne. H¨ane et al. observed that only thevoxels near the object surface need to be predicted ina high resolution, so that the proposed HSP can avoidunnecessary calculation to ensure aﬀordable generationof high resolution voxel grids. As introduced in [38],each node in the octree is deﬁned as a voxel blockwith a ﬁxed number (16 in the paper) of voxels indiﬀerent size, and each voxel block is classiﬁed intooccupied, boundary and free. The decoder of the modeltakes a feature vector as input, and predicts featureblocks that correspond to voxel blocks hierarchically.The HSP deﬁnes that the octree has 5 layers andeach voxel blocks contains 16 voxels, therefore, HSPcan generate up to 256 voxel grids. Tatarchenkoet al. [98] also proposed a decoder called OGN forgenerating high resolution volumetric representations.In [98], nodes in the octree are separated into threecategories, including “empty”, “ﬁlled” and “mixed”.The octree representing a 3D model and the featuremap of the octree are stored in the form of hashingtables which are indexed by the spatial position andthe octree level. In order to process the featuremaps represented as hash tables, Tatarchenko et al.designed a convolutional layer named

OGN-Conv ,which converts the convolutional operation into matrixmultiplication. [98] adopts the method that generatesdiﬀerent resolution of voxel grids in each decoder layerby convolutional operations in feature maps, and thendecides whether to propagate the features to the nextlayer by speciﬁc labels (propagating the features if“boundary” and skipping the feature propagation if“mixed”).Besides the decoder model design for synthesizingvoxel grids, shape analysis methods are alsodesigned using octrees. However, conventionaloctree structure [64] has diﬃculty to be used in deepnetworks, so many researchers try to resolve theproblem by designing new structures of octrees andspecial operations such as convolution, pooling andunpooling on octrees. Riegler et al. [80] proposed

OctNet . The octree representation mentioned in [80] has a relatively regular structure than a traditionaloctree, which places a shallow octree in regular 3Dgrids. The shallow octree is constrained to have up to3 levels and is encoded in 73 bits. Each bit determinesif the corresponding cell needs to be split. Wang etal. [105] also proposed a convolutional neural networkbased on octree called

O-CNN , where the model alsoremoves pointers like shallow octree [80] and storesthe octree data and structure by a series of vectorsincluding shuﬄe key vectors, labels and input signals.In addition to representing voxels, octree structurecan also be utilized to represent 3D shape surfaces withplanar patches. Wang et al. [106] proposed

AdaptiveO-CNN , where they deﬁned another form of octreenamed patch-guided adaptive octree, which divides a3D shape surface into a set of planar patches restrictedby bounding boxes corresponding to octants. Theyalso provided an encoder and a decoder for the octreedeﬁned by this paper.

The typical point-based representation is alsoreferred to as point clouds or point sets. They can beraw data generated by 3D scanning devices. Becauseof its unordered and irregular structure, this kind ofrepresentation is relatively diﬃcult to cope with bytraditional deep learning methods. Therefore, mostresearchers avoided to use point clouds in a direct wayat the early stage of the deep learning-based geometryresearch. One of the ﬁrst models to generate pointclouds by deep learning came out in 2017 [21]. Theydesigned a neural network to learn a point samplerbased on 3D shape point distribution. The networktakes a single image and a random vector as input, andoutputs an N × x , y , z coordinates for N points). Inaddition, [21] proposed to use Chamfer Distance (CD) and

Earth Mover’s Distance (EMD) [81] as the lossfunction to train the networks.

PointNet . At almost the same time, Qi et al. [75]proposed

PointNet for shape analysis, which wasthe ﬁrst successful deep network architecture thatdirectly processes point clouds without unnecessaryrendering. The pipeline of PointNet is illustratedin Figure 2. On account of three properties ofpoint sets mentioned in [75], PointNet designed threecomponents in their network, including using max-pooling layers as symmetry functions for dealing withthe unordered property, concatenating global and local et al.

Fig. 2

The pipeline of

PointNet

Ref. [75], c (cid:13)

IEEE 2017. features together for point interaction, and jointlyaligning the network for transformation invariance.Based on PointNet, Qi et al. further improved thismodel and proposed

PointNet++ [77], in order toresolve the problem that PointNet cannot captureand deal with local features induced by metric well.Compared with PointNet, PointNet++ introducesa hierarchical structure, so that the model cancapture features in diﬀerent scales, which improves thecapability of extracting 3D shape features. As PointNetand PointNet++ show state-of-the-art performance inshape classiﬁcation and semantic segmentation, moreand more deep learning models were proposed basedon point-based representations.

Convolutional Neural Networks for PointClouds . Some research works focus on applying CNNsto the irregular and unordered form of point cloudsfor analysis. Li et al. [53] proposed

PointCNN forpoint clouds and designed the X -transformation toweight and permute the input point features, whichguarantees the equivariance in diﬀerent point orders.Each feature matrix needs to be multiplied by the X -transformation matrix before passing through theconvolutional operator. This process is named X -Conv operator, which is the key of PointCNN . Wanget al. [108] proposed

DGCNN , a dynamic graphCNN architecture for point cloud classiﬁcation andsegmentation. Instead of processing point featureslike PointNet [75],

DGCNN ﬁrst connects neighboringpoints in spatial or semantic space to generate agraph, and then captures the local geometry featuresby applying the EdgeConv operator on it. Moreover,diﬀerent from other graph CNNs which process theﬁxed input graph,

DGCNN changes the graph to obtain new nearest neighbors in the feature space in diﬀerentlayers, which is beneﬁcial to get larger and sparserreceptive ﬁelds.

Other Point Cloud Processing Techniquesusing Neural Networks . Klokov et al. [47] proposed

Kd-Network to process point clouds based on the formof kd-trees. Yang et al. [117] proposed

FoldingNet ,an end-to-end auto-encoder for further compressing apoint-based representation with unsupervised learning.Because point clouds can be transformed into 2D gridsby folding operations, FoldingNet integrates foldingoperations in their encoder-decoder to recover input 3Dshapes. Mehr et al. [65] further proposed

DiscoNet for3D model editing by combining multiple autoencoderswhich are trained for diﬀerent types of 3D shapesspeciﬁcally. The autoencoders use pre-learned meangeometry of training 3D shapes as their templates.Meng et al. [66] proposed

VV-Net (Voxel VAE Net)for point segmentation, which represents a point cloudby a structured voxel representation. In

VV-Net ,instead of containing a boolean value to representoccupancy status of each voxel as a normal volumetricrepresentation, it uses a latent code computed byan RBF-VAE, a variational autoencoder based on aradial basis function (RBF) interpolation of pointsto describe point distribution within a voxel. Thisrepresentation is used to extract intrinsic symmetryof point clouds by a group equivariant CNN, andthe output is combined with PointNet [75] for bettersegmentation performance.Although the point-based representation can bemore easily obtained by 3D scanners than other3D representations, this raw form of 3D shapesis often unsuitable for 3D shape analysis, due to noise and data sparsity. Therefore, compared withother representations, it is essential for the point-based representation to incorporate an upsamplingmodule to obtain ﬁne-grained point clouds, suchas

PU-NET [119],

MPU [118],

PU-GAN [52], etc.Additionally, point cloud registration is also anessential preprocessing step, e.g. to fuse pointsfrom multiple scans, which aims to calculate rigidtransformation parameters to align the point clouds.Wang et al. [107] proposed

Deep Closest Point (DCP) ,which extends traditional Iteractive Closest Point(

ICP ) method [4] and uses a deep learning method toobtain the transformation parameters. Recently, Guoet al. [35] presented a survey focusing on deep learningmodels in point clouds, which provides more details inthis ﬁeld.

Compared with point-based representations, mesh-based representations contain connectivity betweenneighboring points, so they are more suitable fordescribing local regions on surfaces. As a typicaltype of representation in non-Euclidean space, mesh-based representations can be processed by deep learningmodels both in spatial and spectral domains [9].

Parametric representations for meshes .Directly applying CNNs to irregular data structureslike meshes is non-trivial, so there emerged a handfulof approaches that map 3D shape surfaces to 2Ddomains such as 2D geometry images which can alsobe regarded as another 3D shape representation, andapply traditional 2D CNNs on them [60, 87]. Based ongeometry images, Sinha et al. [88] proposed

SurfNet for shape generation using a deep residual network.Similarly, Shi et al. [84] projected 3D models intocylinder panoramic images, which are processed byCNNs. Some other methods convert mesh models intospherical signals, and design a convolutional operatorin the spherical domain for shape analysis. To addresshigh-resolution signals on 3D meshes, in particulartexture information, Huang et al. [43] proposed

TextureNet to extract features in this situation, wherea 4-rotational symmetric (4-RoSy) ﬁeld is deﬁned toparametrize surfaces. In the following, we will reviewdeep learning models according to how meshes aredirectly treated as input, and introduce generativemodels working on meshes.

Graphs . The mesh-based representation isconstructed by sets of vertices and edges, which canbe seen as a graph. Some models were proposed basedon the graph spectral theorem. They generalize CNNs on graphs [2, 10, 18, 40, 46] by eigen-decompositionof Laplacian matrices, which is able to generalizeconvolutional operators to the spectral domain ofgraphs. Verma et al. [100] proposed another graph-based CNN named

FeaStNet , which computes thereceptive ﬁelds of convolution operator dynamically.Speciﬁcally, FeaStNet determines the assignment ofthe neighbor vertices by using features obtained innetworks. Hanocka et al. [39] also designed operators ofconvolution, pooling and unpooling for triangle meshes,and proposed

MeshCNN . Diﬀerent from other graph-based methods, MeshCNN focuses on processing thefeatures stored in edges, and proposes a convolutionoperator that is applied to the edges with a ﬁxednumber of neighbors and a pooling operator based onedge collapse. MeshCNN extracts 3D shape featureswith respect to speciﬁc tasks, and the network learnsto preserve the important features and ignore theunimportant ones.

The mesh-based representation canbe viewed as the discretization of 2-manifolds. Severalworks are designed in 2-manifolds with a series ofreﬁned CNN operators to adapt to this non-Euclideanspace. These methods deﬁne their own local patchesand kernel functions for generalizing CNN models.Masci et al. [61] proposed

Geodesic ConvolutionalNeural Networks (GCNNs) for manifolds, whichextract and discretize local geodesic patches andapply convolutional ﬁlters on these patches in polarcoordinates. The convolution operator is designed inthe spatial domain and their Geodesic CNN is quitesimilar to conventional CNNs applied in Euclideanspace.

Localized Spectral CNNs [6] proposed byBoscaini et al. apply

Windowed Fourier transform to non-Euclidean space.

Anisotropic ConvolutionalNeural Networks (ACNNs) [7] further designed ananisotropic heat kernel to replace the isotropic patchoperator in GCNN [61], which gives another solution toavoid ambiguity. Xu et al. [115] proposed

DirectionallyConvolutional Networks (DCNs) , which deﬁned localpatches based on faces of the mesh representation.In this work, researchers also designed a two-streamnetwork for 3D shape segmentation, which takes localface normals and the global face distance histogramas input for training. Moti et al. [70] proposed

MoNet to replace the weight functions in [7, 61]with Gaussian kernels with learnable parameters. Feyet al. [22] proposed

SplineCNN which designed aconvolutional operator based on B-splines. Pan etal. [72] designed a surface CNN for 3D irregular surfaceto preserve the standard CNN property of translation et al. equivariance by using parallel translation frames andgroup convolutional operations. Qiao et al. [78]proposed

Laplacian Pooling Network (LaplacianNet) for 3D mesh analysis. The

LaplacianNet considersboth spectral and spatial information of the mesh, andcontains 3 parts: preprocessing features as the networkinput, Mesh Pooling Blocks to split surface and clusterpatches for feature extraction, and the CorrelationNetwork to aggregate global information.

Generative Models.

There are also manygenerative models for the mesh-based representation.Wang et al. [104] proposed

Pixel2Mesh forreconstructing 3D shapes from single images, whichgenerates the target triangular mesh by deformingan ellipsoid template. As shown in Figure 3,the Pixel2Mesh network is implemented based on

Graph-based Convolutional Networks (GCNs) [9] andgenerates the target mesh from coarse to ﬁne byan unpooling operation. Wen et al. [109] advancedPixel2Mesh and proposed

Pixel2Mesh++ , whichextends single image 3D shape reconstruction to3D shape reconstruction from multi-view images.To achieve this, Pixel2Mesh++ introduces a

Multi-view Deformation Network (MDN) to the original

Pixel2Mesh , and the

MDN incorporates the cross-viewinformation into the process of mesh generation.Groueix et al. [34] proposed

AtlasNet , which generates3D surfaces by multiple patches. AtlasNet learnsto convert 2D square patches into 2-manifolds tocover the surface of 3D shapes by MLP (Multi-LayerPerceptron). Ben-Hamu et al. [3] proposed a multi-chart generative model for 3D shape generation. Themethod uses a multi-chart structure as input andbuilds the network architecture based on standardimage GAN [33]. The transformation between 3Dsurface and multi-chart structure is based on [60].However, the methods based on deforming a templatemesh into the target shape cannot express complextopology of some 3D shapes. Pan et al. [73] proposeda new single-view reconstruction method, whichcombines a deformation network and a topologymodiﬁcation network to model meshes with complextopology. In the topology modiﬁcation network, thefaces with high distortion are removed. Tang et al. [97]proposed to generate complex topology meshes by askeleton-bridged learning method, because skeletoncan well preserve topology information. Instead ofgenerating triangular meshes, Nash et al. [71] proposed

PolyGen to generate the polygon mesh representation.Inspired by neural autoregressive models in otherﬁelds like natural language processing, researchers regard mesh generation as a sequence, and design atransformer-based network [99], including a vertexmodel and a face model. The vertex model generatesa sequence of vertex positions and the face modelgenerates variable-length vertex sequences conditionedon input vertices.

In addition to explicit representations such as pointclouds and meshes, implicit ﬁelds have been in greaterpopularity in recent studies. A major reason is that theimplicit representation is not limited by ﬁxed topologyand resolution. There are an increasing number of deepmodels, which deﬁne their own implicit representationsand building on them further propose various methodsfor shape analysis and generation.The

Occupancy/Indicator Function is one of theforms to represent 3D shapes implicitly.

OccupancyNetwork was proposed by Mescheder et al. [67] tolearn a continuous occupancy function as a newrepresentation of 3D shapes by neural networks. Theoccupancy function reﬂects the 3D point status withrespect to the 3D shape surface, where 1 meansinside the surface and 0 otherwise. Researchersregarded this problem as a binary classiﬁcation taskand designed an occupancy network which inputs3D point position and 3D shape observation andoutputs the probability of occupancy. The generatedimplicit ﬁeld is then processed by a Multi-resolutionIsoSurface Extraction method

MISE and marchingcubes algorithm [58] to obtain meshes. Moreover,researchers introduce encoder networks to obtain latentembeddings. Similarly, Chen et al. [14] designed

IM-NET as a decoder for learning generative models,which also takes an implicit function in the form ofan indicator function.

Signed Distance Functions ( SDFs ) are also aform of implicit representation. Signed distancefunctions map a 3D point to a real value instead ofa probability, which indicates the spatial relation anddistance to the 3D surface. Denote

SDF ( x ) as thesigned distance value of a given 3D point x ∈ R .Then SDF ( x ) > x is outside the 3D shapesurface, SDF ( x ) < x is inside the surface,and SDF ( x ) = 0 means point x is on the surface.The absolute value of SDF ( x ) refers to the distancebetween point x and the surface. Park et al. [74]proposed DeepSDF and introduced an auto-decoder-based DeepSDF as a new 3D shape representation.Wang et al. [116] also proposed

Deep Implicit SurfaceNetworks (DISNs) for single-view 3D reconstruction

Fig. 3

The pipeline of

Pixel2Mesh

Ref.[104] c (cid:13)

Springer 2018. based on SDFs. Thanks to the advantages of SDF,DISN was the ﬁrst to reconstruct 3D shapes withﬂexible topology and thin structure in the single-viewreconstruction task, which is diﬃcult for other 3Drepresentations.

Function Sets . The occupancy functions andsigned distance functions represent the 3D shapesurface by a single function learned by a deep neuralnetwork. Genova et al. [30, 31] proposed to representthe whole 3D shape by combining a set of shapeelements. In [31], researchers proposed

StructuredImplicit Functions (SIFs) where each element isrepresented by a scaled axis-aligned anisotropic 3DGaussian , and the sum of these shape elementsrepresents the whole 3D shape. The parameters ofGaussians are learned by the CNN. [30] improved theSIF and proposed

Deep Structured Implicit Functions(DSIFs) which added deep neural networks as

DeepImplicit Functions (DIFs) to provide local geometrydetails. To summarize,

DSIF exploits

SIF to depictcoarse information of each shape element, and applies

DIF for local shape details.

Approach without 3D supervision . The aboveimplicit representation models need to sample 3Dpoints in the 3D shape bounding box as ground truthand train the model supervised with 3D information.But 3D ground truth may not be easy to access insome situations. Liu et al. [56] proposed a frameworkwhich learns implicit representations without explicit3D supervision. The model uses a ﬁeld probingalgorithm to bridge the gap between the 3D shape and2D images, and designs a silhouette loss to constrain 3Dshape outline and geometry regularization to constrainthe surface to be plausible.

Recently, more and more researchers began torealize the importance of structure of 3D shapes andintegrate structural information into deep learning models. Primitive representations are a typical type ofstructure-based representation which depict 3D shapestructure well. A primitive representation representsthe 3D shape with primitives such as oriented 3Dboxes. Instead of providing a description of geometrydetails, the primitive representation concentrates moreon the overall structure of 3D shapes. It represents 3Dshape structure as several primitives with a compactparameter set. More importantly, obtaining a primitiverepresentation encourages to generate more detailedand plausible 3D shapes.

Linearly Organized . Observing that humansoften regard 3D shapes as a collection of parts,Zou et al. [121] proposed , which appliesLSTM in a primitive generator, so that 3D-PRNNcan generate primitives sequentially. The generatedprimitive representations show great eﬃciency indepicting simple and regular 3D shapes. Wu et al. [111]further proposed an RCNN-based method called

PQ-NET which also regards 3D shape parts as a sequence.The diﬀerence is that PQ-NET encodes geometryfeatures in the network. Gao et al. [27] proposed adeep generative model named

SDM-NET (StructuredDeformable Mesh-Net). They designed a two-levelVAE, containing a PartVAE for part geometry and aSP-VAE (Structured Parts VAE) for both structureand geometry features. In [27], each shape part isencoded in a well designed form, which records boththe structure information (symmetry, supporting andsupported) and geometry features.

Hierarchically Organized . Li et al. [51] proposed

GRASS (Generative Recursive Autoencoders for ShapeStructures), which is one of the ﬁrst attempts toencode the 3D shape structure by a neural network.They describe the shape structure by a hierarchicalbinary tree, in which the child nodes are merged intothe parent node by either adjacency or symmetryrelations. Leaves in this structure tree represent theoriented bounding boxes (OBBs) and geometry features

90 Xiao et al. for each part, and intermediate nodes represent boththe geometry feature of child nodes and the relationsbetween child nodes. Inspired by recursive neuralnetworks (RvNNs) [89, 90], GRASS also recursivelymerges the codes representing the OBBs into a rootcode which depicts the whole shape structure. Thearchitecture of GRASS can be divided into three parts:(1) an RvNN autoencoder for encoding a 3D shapeinto a ﬁxed length code, (2) a GAN for learning thedistribution of root codes and generating plausiblestructures, (3) another autoencoder for synthesizinggeometry of each part which is inspired by [32].Furthermore, to synthesize ﬁne-grained geometry invoxel grids,

Structure-aware recursive feature (SARF) is proposed, which contains both the geometry featuresof each part and global and local OBB layout.However, the GRASS [51] uses a binary tree toorganize the part structure, which leads to ambiguity.Therefore, binary trees are not suitable for large scaledatasets. To address the problem, Mo et al. [68]proposed

StructureNet which organized the hierarchicalstructure in the form of graphs.The

BSP-Net (Binary Space Partitioning-Net)proposed by Chen et al. [13] is the ﬁrst method to depictsharp geometry features, which constructs a 3D shapeby convexes organized by a BSP-tree. The BinarySpace Partitioning (BSP) tree deﬁned in [13] is used torepresent 3D shapes by collections of convexes, whichincludes three layers, namely hyperplane extraction,hyerplane grouping and shape assembly. The convexescan also be seen as a new form of primitives which canrepresent geometry details of 3D shapes rather thangeneral structures.

Structure and Geometry . Researchers try toencode the 3D shape structure and geometry featuresseparately [51] or jointly [113]. Wang et al. [103]proposed

Global-to-Local (G2L) generative model togenerate man-made 3D shapes from coarse to ﬁne.To address the problem that GANs cannot generategeometry details well [110],

G2L ﬁrst applies a GANto generate coarse voxel grids with semantic labelsthat represent shape structure at the global level, andthen puts the voxels separated by semantic labels intoan autoencoder called

Part Reﬁner (PR) to optimizepart geometry details part by part at the local level.Wu et al. [113] proposed

SAGNet for detailed 3Dshape generation, which encodes the structure andgeometry jointly by a GRU [15] architecture in order toﬁnd intra-relation between them. The SAGNet showsbetter performance in tenon-mortise joints than otherstructure-based learning methods.

Deformable 3D models play an important role incomputer animation. However, most of the methodsmentioned above mainly focus on rigid 3D models,while paying less attention to the deformation of non-rigid models. Compared with other representations,deformation-based representations parameterize thedeformation information and have better performancewhen used to cope with non-rigid 3D shapes, such asarticulated models.

Mesh-based Deformation Description . A meshcan be seen as a graph, which is convenient whenmanipulating the vertex positions while maintainingthe connectivity between vertices. Therefore, agreat number of methods choose meshes to representdeformable 3D shapes. Based on this property, somemesh-based generation methods generate target shapesby deforming a mesh template [27, 73, 104, 109], andthese methods can also be regarded as deformation-based methods. The graph structure makes it easyto store deformation information as vertices features,which can be seen as deformation representations.Gao et al. [24] designed an eﬃcient and rotation-invariant deformation representation called

Rotation-Invariant Mesh Diﬀerence (RIMD) , which achieveshigh performance in shape reconstruction, deformationand registration. Based on [24], Tan et al. [94]proposed

Mesh VAE for deformable shape analysis andsynthesis, which takes

RIMD as the feature inputs ofVAE and uses fully connected layers for the encoderand decoder. Further, Gao et al. [25] designedan as-consistent-as-possible (ACAP) representation to constrain the rotation angle and rotation axesbetween adjacent vertices in the deformable meshwhich the graph convolution is easily applied. Tan etal. [95] proposed the

SparseAE based on the ACAPrepresentation [25], which applies graph convolutionaloperators [19] with the ACAP [25] to analysis the meshdeformations. Gao et al. [26] proposed

VC-GAN (VAECycleGAN) for unpaired mesh deformation transfer,which is the ﬁrst automatic work for mesh deformationtransfer. This work takes the ACAP representation asinput, and encodes the representation into latent spaceby a VAE, and then transfer deformations betweensource and target in the latent space domain withthe cycle consistency and visual similarity consistency.Gao et al. [27] ﬁrstly view the geometric detailsshown in Fig 5 as the deformations. Based on theprevious techniques [25, 26, 94, 95], the geometricdetails could be encoded and generated. The structure

10 Survey on Deep Geometry Learning: From a Representation Perspective 11

Fig. 4

The research works on deformation-based shape representation of the geometrylearning group in ICT, CAS in [27] is also analyzed in the stable supportablemanner [44]. Yuan et al.[120] apply newly designedpooling operation based on mesh simpliﬁcation andgraph convolution to VAE architecture, which alsotakes ACAP representation as input of network. Tanet al. [96] use ACAP representation for simulatingthin-shell deformable materials, which apply graph-based CNN to embed high-dimensional features intolow-dimensional features. In addition of consideringa single deformable mesh, mesh sequences play amore important role in computer animation. And thedeformation-based representation ACAP [25] is suitablefor representing mesh sequence. Qiao et al.[79] alsotakes ACAP representation as input to generate meshanimation sequences by a bidirectional LSTM network.

Fig. 5

An example of representing a chair leg by deformingbounding box in

SDM-NET . (a)a chair with one of its legparts highlighted, (b)the highlighted part in (a) and the overlaidbounding box, (c)the bounding box used as the template,(d)deformed bounding box, (e)recovered shape. Ref.[27] c (cid:13)

ACM2019

Implicit surface based approaches . Withthe development of implicit surface representations,Jeruzalski et al. [45] proposed a method to representarticulated deformable shapes by pose parameters,called

Neural Articulated Shape Approximation(NASA) . The pose parameters mentioned in [45]record the transformation of bones deﬁned in models.They compared three diﬀerent network architectures,including unstructured model (U), piecewise rigidmodel (R) and piecewise deformable model (D) in thetraining dataset and test dataset, which opens anotherdirection to represent deformable 3D shapes.

With the development of 3D scanners, 3D modelsbecome easier to obtain, so there are more and more3D shape datasets that have been proposed withdiﬀerent 3D representations. The larger datasetswith more details bring more challenges for existingtechniques, which further promotes the development ofdeep learning on diﬀerent 3D representations.The datasets can be divided into several types indiﬀerent representations and diﬀerent applications.Choosing the appropriate dataset beneﬁts theperformance and generalization for learning basedmodels.

RGB-D Images.

RGB-D image datasets can becollected by depth sensors like

Microsoft Kinect . Mostof the RGB-D image datasets can be regarded as asequence of video. The indoor scene RGB-D imagedataset

NYU Depth [85, 86] was ﬁrst provided for

112 Xiao et al. the segmentation problem, and the v1 version [85]collects 64 categories while the v2 version [86] collects464 categories. The

KITTI [29] dataset providesoutdoor scene images mainly for autonomous driving,which contains 5 categories including ‘Road’, ‘City’,‘Residential’, ‘Campus’ and ‘Person’. The depth mapof images can be calculated by the development kitprovided by the KITTI dataset. And the KITTIdataset also contains 3D objects annotations forapplications such as object detection.

ScanNet [17] is alarge annotated RGB-D video dataset, which includes2.5M views in 1,513 scenes with 3D camera pose,surface reconstructions and semantic segmentations.Another dataset

Human10 [11] is sampled from 10human action sequences.

Man-made 3D Object Datasets.

The

ModelNet [112] is one of the famous CAD modeldatasets for 3D shape analysis, including 127,915 3DCAD Models in 662 categories. ModelNet providestwo subsets named ModelNet10 and ModelNet40respectively. ModelNet10 includes 10 categoriesfrom the whole dataset, and the 3D modelsin ModelNet10 are aligned manually; ModelNet40includes 40 categories, and the 3D models are alsoaligned.

ShapeNet [12] provides a larger scale dataset,containing more than 3 million models in more than4K categories. ShapeNet also contains two smallersubsets: ShapeNetCore and ShapeNetSem. For variousgeometry applications, ShapeNet [12] provides richannotations for 3D objects in the dataset, includingcategory labels, part labels, symmetry information, etc.

ObjectNet3D [114] is a large-scale dataset for 3D objectrecognition from 2D images, which includes 201,888 3Dobjects in 90,127 images and 44,147 diﬀerent 3D shapes.The dataset is annotated with 3D pose parameters,which align 3D objects with 2D images.

SUNCG [92]includes full room 3D models, which is suitable for 3Dscene analysis and scene completion tasks. The 3Dmodels in

SUNCG are represented by dense voxel gridswith object annotations. The whole dataset includes49,884 valid ﬂoors with 404,058 rooms and 5,697,217object instances.

PartNet provides a more detailedCAD model dataset with ﬁne-grained, hierarchicalpart annotations, which brings more challenges andresources for 3D object applications such as semanticsegmentation, shape editing and shape generation.3D-Future[23] provides a large-scale furniture dataset,which includes 20,000+ scenes in 5,000+ rooms with10,000+ 3D instances. Each 3D shape is of high qualitywith the best texture information for now.

Non-Rigid Model Datasets.

TOSCA [8] is one of the high-resolution 3D non-rigid model datasets,which contains 80 objects in 9 categories. Themodels are in the mesh representation, and the objectswithin the same category have the same resolution.

FAUST [5] is a dataset of 3D human body scansin 10 diﬀerent people with a variety of poses andthe ground truth correspondences are also provided.Because FAUST was proposed for real-world shaperegistration, the scans provided in the dataset are noisyand incomplete, but the corresponding ground truthis water-tight and aligned.

AMASS [59] provides alarge and varied human motion dataset, which gathersprevious mocap datasets with a consistent frameworkand parameterization. It contains 344 subjects, 11,265motions and more than 40 hours of recordings.

The shape representations mentioned aboveare fundamental for shape analysis and shapereconstruction. In this section, we summarizerepresentative works in these two directionsrespectively and compare the performance of theseworks.

Shape analysis methods usually extract the latentcodes from diﬀerent 3D shape representations bydiﬀerent network architectures. The latent codesare then used for speciﬁc applications like shapeclassiﬁcation, shape retrieval, shape segmentation,etc. And diﬀerent representations are usually suitablefor diﬀerent applications. We now review theperformance of diﬀerent representations in diﬀerentmodels and discuss suitable representations for speciﬁcapplications.

Shape Classiﬁcation and Retrieval are the basicproblems of shape analysis. Both of them rely on thefeature vectors extracted from the analysis networks.For shape classiﬁcation, the datasets ModelNet10 andModelNet40 [112] are widely used and Table 2 showsthe accuracy of some diﬀerent methods on ModelNet10and ModelNet40. For shape retrieval, given a 3D shapeas a query, the target is to ﬁnd the most similar shape(s)in the dataset to match the query. Retrieval methodsusually learn to ﬁnd a compact code to represent theobject in a latent space, and query the closest object asthe result based on Euclidean distance, Mahalanobisdistance or other distance metrics. Diﬀerent fromthe classiﬁcation task, the shape retrieval task has anumber of evaluation measures, including precision,recall, mAP (mean average precision), etc.

12 Survey on Deep Geometry Learning: From a Representation Perspective 13Source Type Dataset Year Category Size DescriptionReal-world RGB-D Images NYU Depth v1[85] 2011 64 - Indoor SceneReal-world RGB-D Images NYU Depth v2[86] 2012 464 407024 Indoor SceneReal-world RGB-D Images KITTI[29] 2013 5 - Outdoor SceneReal-world RGB-D Images ScanNet[17] 2017 1513 2.5M Indoor Scene videoReal-world RGB-D Images Human10[11] 2018 10 9746 Human ActionSynthetic 3D CAD Models ModelNet[112] 2015 662 127915 Mesh RepresentationSynthetic 3D CAD Models ModelNet10[112] 2015 10 4899 -Synthetic 3D CAD Models ModelNet40[112] 2015 40 12311 -Synthetic 3D CAD Models ShpaeNet[12] 2015 4K 3millions Rich AnnotationsSynthetic 3D CAD Models ShapeNetCore[12] 2015 55 51300 -Synthetic 3D CAD Models ShapeNetSem[12] 2015 270 12000 -Synthetic Images and 3D Models ObjectNet3D[114] 2016 100 44161 2D aligned with 3DSynthetic 3D CAD Models SUNCG[92] 2017 - 49884 Full Room SceneSynthetic 3D CAD Models PartNet[69] 2019 24 26671 573585 Part InstanceSynthetic 3D CAD Models 3D-FUTURE[23] 2020 - 10K Texture InformationSynthetic Non-Rigid Models TOSCA[8] 2008 9 80 -Real-world Non-Rigid Models FAUST[5] 2014 10 300 Human BodiesSynthetic Non-Rigid Models AMASS[59] 2019 344 11265 Human Motions

Tab. 1

The Overview of 3D Model Datasets

Form Model Accuracy(%)10 40Voxel 3DShapeNet [112] 83.54 77.32Voxel VoxNet [63] 92 83Voxel 3D-GAN [110] 91.0 83.3Voxel Qi et al. [76] - 86Voxel ORION [82] 93.8 -Point PointNet [75] - 89.2Multi-view MVCNN [93] - 90.1Point Kd-net[47] 93.3 90.6Multi-view Qi et al. [76] - 91.4Point PointNet++ [77] - 91.9Point Point2Sequence [57] 95.3 92.6

Tab. 2

Accuracy of shape classiﬁcation on ModelNet10 andModelNet40 datasets.

Shape Segmentation aims to discriminate thepart categories of a 3D shape. This task plays animportant role in understanding 3D shapes. Themean Intersection-over-Union (mIOU) is often used asthe evaluation metric of shape segmentation. Mostresearchers choose to use the point-based representationfor the segmentation task [47, 53, 66, 75, 77].

Shape Symmetry Detection . Symmetry isimportant geometry information in 3D shapes, and itcan be further used in many other applications such asshape alignment, registration, completion, etc. Gao etal. [28] designed the ﬁrst unsupervised deep learning method named

PRS-Net (Planar Reﬂective SymmetryNet) to detect planar reﬂective symmetry of 3D shapes,which designs a new symmetry distance loss and aregularization loss. And

PRS-Net was proved to berobust in noisy and incomplete input and more eﬃcientthan traditional methods. As symmetry is largelydetermined by the overall shape,

PRS-Net is based ona 3D voxel CNN and gains high performance in a lowresolution.

Learning based generative models have beenproposed for diﬀerent representations, which is alsoan important ﬁeld in geometry learning. Thereconstruction applications include single-view shapereconstruction, shape generation, shape editing, etc.The generation methods can be summarized onthe basis of representations. For voxel-basedrepresentations, learning based models try to predictthe occupancy probability of each voxel in the grid.For point-based representations, learning based modelseither sample 3D points in the space or fold the 2D gridsinto target 3D objects. For mesh-based representations,most of the generation methods choose to deform amesh template into the ﬁnal mesh. In recent study,more and more methods choose to use structuredrepresentation and generate coarse-to-ﬁne 3D shapes.

134 Xiao et al.

Fig. 6

The pipeline of PRS-Net Ref. [28] c (cid:13)

IEEE 2020

10 Summary

In this survey, we review a series of deep learningmethods based on diﬀerent 3D object representations.We ﬁrst overview diﬀerent 3D representation learningmodels. And the tendency of the geometry learningcan be summarized to be less computation andmemory demanding, and more detailed and structured.Then, we introduce 3D datasets which are widelyused in the research. These datasets provide richresources and support evaluation for data-drivenlearning methods. Finally, we discuss 3D shapeapplications based on diﬀerent 3D representations,including shape analysis and shape reconstruction.Diﬀerent representations are usually suitable fordiﬀerent applications. Therefore, it is vitally importantto choose suitable 3D representations for speciﬁc tasks.

Acknowledgements

This work was supported by National NaturalScience Foundation of China (No. 61828204 andNo. 61872440), Beijing Municipal Natural ScienceFoundation (No. L182016), Youth InnovationPromotion Association CAS, CCF-Tencent OpenFund, Royal Society-Newton Advanced Fellowship(No. NAF \ R2 \ \ R1 \ Open Access

This article is distributed under theterms of the Creative Commons Attribution License whichpermits any use, distribution, and reproduction in anymedium, provided the original author(s) and the source arecredited.

References [1] E. Ahmed, A. Saint, A. E. R. Shabayek,K. Cherenkova, R. Das, G. Gusev, D. Aouada,and B. Ottersten. Deep learning advances ondiﬀerent 3D data representations: A survey. arXiv preprint arXiv:1808.01462 , 1, 2018.[2] J. Atwood and D. Towsley. Diﬀusion-convolutional neural networks. In

Advances inNeural Information Processing Systems , pages1993–2001, 2016.[3] H. Ben-Hamu, H. Maron, I. Kezurer,G. Avineri, and Y. Lipman. Multi-chartgenerative surface modeling. In

SIGGRAPHAsia 2018 Technical Papers , page 215. ACM,2018.[4] P. J. Besl and N. D. McKay. Method forregistration of 3-d shapes. In

Sensor fusion IV:control paradigms and data structures , volume1611, pages 586–606. International Society forOptics and Photonics, 1992.[5] F. Bogo, J. Romero, M. Loper, and M. J.Black. FAUST: Dataset and evaluation for 3Dmesh registration. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 3794–3801, 2014.[6] D. Boscaini, J. Masci, S. Melzi,M. M. Bronstein, U. Castellani, andP. Vandergheynst. Learning class-speciﬁcdescriptors for deformable shapes usinglocalized spectral convolutional networks. In

Computer Graphics Forum , volume 34, pages13–23. Wiley Online Library, 2015.[7] D. Boscaini, J. Masci, E. Rodol`a, andM. Bronstein. Learning shape correspondence

14 Survey on Deep Geometry Learning: From a Representation Perspective 15 with anisotropic convolutional neuralnetworks. In

Advances in Neural InformationProcessing Systems , pages 3189–3197, 2016.[8] A. M. Bronstein, M. M. Bronstein, andR. Kimmel.

Numerical geometry of non-rigidshapes . Springer Science & Business Media,2008.[9] M. M. Bronstein, J. Bruna, Y. LeCun,A. Szlam, and P. Vandergheynst. Geometricdeep learning: going beyond Euclidean data.

IEEE Signal Processing Magazine , 34(4):18–42, 2017.[10] J. Bruna, W. Zaremba, A. Szlam, andY. LeCun. Spectral networks and locallyconnected networks on graphs. arXiv preprintarXiv:1312.6203 , 2013.[11] Y.-P. Cao, Z.-N. Liu, Z.-F. Kuang, L. Kobbelt,and S.-M. Hu. Learning to reconstructhigh-quality 3D shapes with cascaded fullyconvolutional networks. In

The EuropeanConference on Computer Vision (ECCV) ,September 2018.[12] A. X. Chang, T. Funkhouser, L. Guibas,P. Hanrahan, Q. Huang, Z. Li, S. Savarese,M. Savva, S. Song, H. Su, et al. ShapeNet: Aninformation-rich 3D model repository. arXivpreprint arXiv:1512.03012 , 2015.[13] Z. Chen, A. Tagliasacchi, and H. Zhang.BSP-Net: Generating compact meshes viabinary space partitioning. arXiv preprintarXiv:1911.06971 , 2019.[14] Z. Chen and H. Zhang. Learning implicit ﬁeldsfor generative shape modeling. In

Proceedingsof the IEEE Conference on Computer Visionand Pattern Recognition , pages 5939–5948,2019.[15] K. Cho, B. Van Merri¨enboer, C. Gulcehre,D. Bahdanau, F. Bougares, H. Schwenk, andY. Bengio. Learning phrase representationsusing RNN encoder-decoder for statisticalmachine translation. arXiv preprintarXiv:1406.1078 , 2014.[16] C. B. Choy, D. Xu, J. Gwak, K. Chen,and S. Savarese. 3D-R2N2: A uniﬁedapproach for single and multi-view 3D objectreconstruction. In

European Conference onComputer Vision (ECCV) , pages 628–644.Springer, 2016.[17] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. ScanNet:Richly-annotated 3D reconstructions of indoorscenes. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pages 5828–5839, 2017.[18] M. Deﬀerrard, X. Bresson, andP. Vandergheynst. Convolutional neuralnetworks on graphs with fast localized spectralﬁltering. In

Advances in neural informationprocessing systems , pages 3844–3852, 2016.[19] D. K. Duvenaud, D. Maclaurin,J. Iparraguirre, R. Bombarell, T. Hirzel,A. Aspuru-Guzik, and R. P. Adams.Convolutional networks on graphs for learningmolecular ﬁngerprints. In

Advances inneural information processing systems , pages2224–2232, 2015.[20] D. Eigen, C. Puhrsch, and R. Fergus. Depthmap prediction from a single image using amulti-scale deep network. In

Advances inneural information processing systems , pages2366–2374, 2014.[21] H. Fan, H. Su, and L. J. Guibas. Apoint set generation network for 3D objectreconstruction from a single image. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages605–613, 2017.[22] M. Fey, J. Eric Lenssen, F. Weichert, andH. M¨uller. SplineCNN: Fast geometric deeplearning with continuous b-spline kernels.In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition ,pages 869–877, 2018.[23] H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao,S. Maybank, and D. Tao. 3d-future: 3dfurniture shape with texture.[24] L. Gao, Y.-K. Lai, D. Liang, S.-Y. Chen, andS. Xia. Eﬃcient and ﬂexible deformationrepresentation for data-driven surfacemodeling.

ACM Transactions on Graphics(TOG) , 35(5):158, 2016.[25] L. Gao, Y.-K. Lai, J. Yang, Z. Ling-Xiao,S. Xia, and L. Kobbelt. Sparse data drivenmesh deformation.

IEEE transactions onvisualization and computer graphics , 2019.[26] L. Gao, J. Yang, Y.-L. Qiao, Y.-K. Lai, P. L.Rosin, W. Xu, and S. Xia. Automatic unpairedshape deformation transfer. In

SIGGRAPH

156 Xiao et al.

Asia 2018 Technical Papers , page 237. ACM,2018.[27] L. Gao, J. Yang, T. Wu, Y.-J. Yuan, H. Fu,Y.-K. Lai, and H. Zhang. SDM-NET: Deepgenerative network for structured deformablemesh.

ACM Transactions on Graphics (TOG) ,38(6):243, 2019.[28] L. Gao, L.-X. Zhang, H.-Y. Meng, Y.-H.Ren, Y.-K. Lai, and L. Kobbelt. PRS-Net: Planar reﬂective symmetry detectionnet for 3D models.

IEEE Transactions onVisualization and Computer Graphics , 2020.[29] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun.Vision meets robotics: The KITTI dataset.

The International Journal of RoboticsResearch , 32(11):1231–1237, 2013.[30] K. Genova, F. Cole, A. Sud, A. Sarna, andT. Funkhouser. Deep structured implicitfunctions. arXiv preprint arXiv:1912.06126 ,2019.[31] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T.Freeman, and T. Funkhouser. Learning shapetemplates with structured implicit functions. arXiv preprint arXiv:1904.06447 , 2019.[32] R. Girdhar, D. Fouhey, M. Rodriguez, andA. Gupta. Learning a predictable andgenerative vector representation for objects.In

European Conference on Computer Vision(ECCV) , 2016.[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza,B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and Y. Bengio. Generative adversarial nets.In

Advances in neural information processingsystems , pages 2672–2680, 2014.[34] T. Groueix, M. Fisher, V. G. Kim, B. C.Russell, and M. Aubry. AtlasNet: Apapier-mˆach´e approach to learning 3D surfacegeneration. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , 2018.[35] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, andM. Bennamoun. Deep learning for 3D pointclouds: A survey. arXiv: 1912.12033 , 2019.[36] S. Gupta, P. Arbel´aez, R. Girshick, andJ. Malik. Aligning 3D models to RGB-Dimages of cluttered scenes. In

Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition , pages 4731–4740, 2015.[37] S. Gupta, R. Girshick, P. Arbel´aez, and J. Malik. Learning rich features from RGB-Dimages for object detection and segmentation.In

European conference on computer vision ,pages 345–360. Springer, 2014.[38] C. H¨ane, S. Tulsiani, and J. Malik. Hierarchicalsurface prediction for 3D object reconstruction.In , pages 412–420. IEEE, 2017.[39] R. Hanocka, A. Hertz, N. Fish, R. Giryes,S. Fleishman, and D. Cohen-Or. MeshCNN:a network with an edge.

ACM Transactionson Graphics (TOG) , 38(4):90, 2019.[40] M. Henaﬀ, J. Bruna, and Y. LeCun. Deepconvolutional networks on graph-structureddata. arXiv preprint arXiv:1506.05163 , 2015.[41] G. E. Hinton, S. Osindero, and Y.-W. Teh.A fast learning algorithm for deep belief nets.

Neural computation , 18(7):1527–1554, 2006.[42] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[43] J. Huang, H. Zhang, L. Yi, T. Funkhouser,M. Nießner, and L. J. Guibas. Texturenet:Consistent local parametrizations for learningfrom high-resolution signals on meshes. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition ,pages 4440–4449, 2019.[44] S.-S. Huang, H. Fu, L.-Y. Wei, and S.-M.Hu. Support substructures: support-inducedpart-level structural representation.

IEEEtransactions on visualization and computergraphics , 22(8):2024–2036, 2015.[45] T. Jeruzalski, B. Deng, M. Norouzi, J. Lewis,G. Hinton, and A. Tagliasacchi. NASA:Neural articulated shape approximation. arXivpreprint arXiv:1912.03207 , 2019.[46] T. N. Kipf and M. Welling. Semi-supervisedclassiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 ,2016.[47] R. Klokov and V. Lempitsky. Escape fromcells: Deep kd-networks for the recognition of3D point cloud models. In

Proceedings of theIEEE International Conference on ComputerVision , pages 863–872, 2017.[48] A. Krizhevsky, I. Sutskever, and G. E.Hinton. ImageNet classiﬁcation with deepconvolutional neural networks. In

Advances in

16 Survey on Deep Geometry Learning: From a Representation Perspective 17 neural information processing systems , pages1097–1105, 2012.[49] A. B. L. Larsen, S. K. Sønderby, H. Larochelle,and O. Winther. Autoencoding beyond pixelsusing a learned similarity metric. arXivpreprint arXiv:1512.09300 , 2015.[50] Y. LeCun, K. Kavukcuoglu, and C. Farabet.Convolutional networks and applications invision. In

Proceedings of 2010 IEEEInternational Symposium on Circuits andSystems , pages 253–256. IEEE, 2010.[51] J. Li, K. Xu, S. Chaudhuri, E. Yumer,H. Zhang, and L. Guibas. GRASS:Generative recursive autoencoders for shapestructures.

ACM Transactions on Graphics(TOG) , 36(4):52, 2017.[52] R. Li, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng. PU-GAN: a point cloud upsamplingadversarial network. In

Proceedings of theIEEE International Conference on ComputerVision , pages 7203–7212, 2019.[53] Y. Li, R. Bu, M. Sun, W. Wu, X. Di,and B. Chen. PointCNN: Convolution on x-transformed points. In

Advances in neuralinformation processing systems , pages 820–830, 2018.[54] Y. Li, S. Pirk, H. Su, C. R. Qi, andL. J. Guibas. FPNN: Field probing neuralnetworks for 3D data. In

Advances in NeuralInformation Processing Systems , pages 307–315, 2016.[55] M. Lin, Q. Chen, and S. Yan. Networkin network. arXiv preprint arXiv:1312.4400 ,2013.[56] S. Liu, S. Saito, W. Chen, and H. Li.Learning to infer implicit surfaces without3D supervision. In

Advances in NeuralInformation Processing Systems , pages 8293–8304, 2019.[57] X. Liu, Z. Han, Y.-S. Liu, and M. Zwicker.Point2sequence: Learning the shaperepresentation of 3D point clouds with anattention-based sequence to sequence network.In

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , volume 33, pages8778–8785, 2019.[58] W. E. Lorensen and H. E. Cline. Marchingcubes: A high resolution 3D surfaceconstruction algorithm. In

ACM siggraph computer graphics , volume 21, pages 163–169.ACM, 1987.[59] N. Mahmood, N. Ghorbani, N. F. Troje,G. Pons-Moll, and M. J. Black. Amass:Archive of motion capture as surface shapes.In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 5442–5451, 2019.[60] H. Maron, M. Galun, N. Aigerman, M. Trope,N. Dym, E. Yumer, V. G. Kim, and Y. Lipman.Convolutional neural networks on surfaces viaseamless toric covers.

ACM Trans. Graph. ,36(4):71–1, 2017.[61] J. Masci, D. Boscaini, M. Bronstein, andP. Vandergheynst. Geodesic convolutionalneural networks on Riemannian manifolds.In

Proceedings of the IEEE internationalconference on computer vision workshops ,pages 37–45, 2015.[62] D. Maturana and S. Scherer. 3D convolutionalneural networks for landing zone detectionfrom LiDAR. In , pages 3471–3478. IEEE, 2015.[63] D. Maturana and S. Scherer. VoxNet: A3D convolutional neural network for real-time object recognition. In , pages 922–928. IEEE,2015.[64] D. Meagher. Geometric modeling using octreeencoding.

Computer graphics and imageprocessing , 19(2):129–147, 1982.[65] E. Mehr, A. Jourdan, N. Thome, M. Cord,and V. Guitteny. DiscoNet: Shapes learningon disconnected manifolds for 3D editing.In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3474–3483, 2019.[66] H.-Y. Meng, L. Gao, Y.-K. Lai, andD. Manocha. Vv-net: Voxel vae net with groupconvolutions for point cloud segmentation.In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 8500–8508, 2019.[67] L. Mescheder, M. Oechsle, M. Niemeyer,S. Nowozin, and A. Geiger. Occupancynetworks: Learning 3D reconstruction infunction space. In

Proceedings of the IEEE

178 Xiao et al.

Conference on Computer Vision and PatternRecognition , pages 4460–4470, 2019.[68] K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka,N. J. Mitra, and L. J. Guibas. StructureNet:hierarchical graph networks for 3D shapegeneration.

ACM Transactions on Graphics(TOG) , 38(6):242, 2019.[69] K. Mo, S. Zhu, A. X. Chang, L. Yi,S. Tripathi, L. J. Guibas, and H. Su.PartNet: A large-scale benchmark for ﬁne-grained and hierarchical part-level 3D objectunderstanding. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 909–918, 2019.[70] F. Monti, D. Boscaini, J. Masci, E. Rodola,J. Svoboda, and M. M. Bronstein. Geometricdeep learning on graphs and manifolds usingmixture model CNNs. In

Proceedings of theIEEE Conference on Computer Vision andPattern Recognition , pages 5115–5124, 2017.[71] C. Nash, Y. Ganin, S. Eslami, and P. W.Battaglia. PolyGen: An autoregressivegenerative model of 3D meshes. arXiv preprintarXiv:2002.10880 , 2020.[72] H. Pan, S. Liu, Y. Liu, and X. Tong.Convolutional neural networks on 3D surfacesusing parallel frames. arXiv preprintarXiv:1808.04952 , 2018.[73] J. Pan, X. Han, W. Chen, J. Tang, andK. Jia. Deep mesh reconstruction from singlergb images via topology modiﬁcation networks.In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 9964–9973, 2019.[74] J. J. Park, P. Florence, J. Straub,R. Newcombe, and S. Lovegrove. DeepSDF:Learning continuous signed distance functionsfor shape representation. In

Proceedings of theIEEE Conference on Computer Vision andPattern Recognition , 2019.[75] C. R. Qi, H. Su, K. Mo, and L. J. Guibas.PointNet: Deep learning on point sets for 3Dclassiﬁcation and segmentation. In

Proceedingsof the IEEE Conference on Computer Visionand Pattern Recognition , pages 652–660, 2017.[76] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan,and L. J. Guibas. Volumetric and multi-view CNNs for object classiﬁcation on 3Ddata. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages5648–5656, 2016.[77] C. R. Qi, L. Yi, H. Su, and L. J.Guibas. PointNet++: Deep hierarchicalfeature learning on point sets in a metric space.In

Advances in neural information processingsystems , pages 5099–5108, 2017.[78] Y.-L. Qiao, L. Gao, J. Yang, P. L. Rosin, Y.-K. Lai, and X. Chen. LaplacianNet: Learningon 3D meshes with laplacian encoding andpooling. arXiv preprint arXiv:1910.14063 ,2019.[79] Y.-L. Qiao, Y.-K. Lai, H. Fu, and L. Gao.Synthesizing mesh deformation sequences withbidirectional lstm.

IEEE Transactions onVisualization and Computer Graphics , 2020.[80] G. Riegler, A. Osman Ulusoy, and A. Geiger.OctNet: Learning deep 3D representations athigh resolutions. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 3577–3586, 2017.[81] Y. Rubner, C. Tomasi, and L. J. Guibas. Theearth mover’s distance as a metric for imageretrieval.

International journal of computervision , 40(2):99–121, 2000.[82] N. Sedaghat, M. Zolfaghari, E. Amiri, andT. Brox. Orientation-boosted voxel nets for 3Dobject recognition. In

British Machine VisionConference , 2017.[83] A. Sharma, O. Grau, and M. Fritz. Vconv-DAE: Deep volumetric shape learning withoutobject labels. In

European Conference onComputer Vision , pages 236–250. Springer,2016.[84] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano:Deep panoramic representation for 3-d shaperecognition.

IEEE Signal Processing Letters ,22(12):2339–2343, 2015.[85] N. Silberman and R. Fergus. Indoor scenesegmentation using a structured light sensor.In

Proceedings of the International Conferenceon Computer Vision - Workshop on 3DRepresentation and Recognition , 2011.[86] N. Silberman, D. Hoiem, P. Kohli, andR. Fergus. Indoor segmentation and supportinference from RGBD images. In

EuropeanConference on Computer Vision , pages 746–760. Springer, 2012.[87] A. Sinha, J. Bai, and K. Ramani. Deep

18 Survey on Deep Geometry Learning: From a Representation Perspective 19 learning 3D shape surfaces using geometryimages. In

European Conference on ComputerVision , pages 223–240. Springer, 2016.[88] A. Sinha, A. Unmesh, Q. Huang, andK. Ramani. SurfNet: Generating 3Dshape surfaces using deep residual networks.In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages6040–6049, 2017.[89] R. Socher, B. Huval, B. Bath, C. D. Manning,and A. Y. Ng. Convolutional-recursive deeplearning for 3D object classiﬁcation. In

Advances in neural information processingsystems , pages 656–664, 2012.[90] R. Socher, C. C. Lin, C. Manning, andA. Y. Ng. Parsing natural scenes and naturallanguage with recursive neural networks.In

Proceedings of the 28th internationalconference on machine learning (ICML-11) ,pages 129–136, 2011.[91] S. Song and J. Xiao. Deep sliding shapesfor amodal 3D object detection in RGB-Dimages. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pages 808–816, 2016.[92] S. Song, F. Yu, A. Zeng, A. X. Chang,M. Savva, and T. Funkhouser. Semanticscene completion from a single depth image.

Proceedings of 30th IEEE Conference onComputer Vision and Pattern Recognition ,2017.[93] H. Su, S. Maji, E. Kalogerakis, andE. Learned-Miller. Multi-view convolutionalneural networks for 3D shape recognition.In

Proceedings of the IEEE internationalconference on computer vision , pages 945–953,2015.[94] Q. Tan, L. Gao, Y.-K. Lai, and S. Xia.Variational autoencoders for deforming 3Dmesh models. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 5841–5850, 2018.[95] Q. Tan, L. Gao, Y.-K. Lai, J. Yang,and S. Xia. Mesh-based autoencoders forlocalized deformation component analysis. In

Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , 2018.[96] Q. Tan, Z. Pan, L. Gao, and D. Manocha.Realtime simulation of thin-shell deformable materials using cnn-based mesh embedding.

IEEE Robotics and Automation Letters ,5(2):2325–2332, 2020.[97] J. Tang, X. Han, J. Pan, K. Jia, and X. Tong.A skeleton-bridged deep learning approach forgenerating meshes of complex topologies fromsingle RGB images. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 4541–4550, 2019.[98] M. Tatarchenko, A. Dosovitskiy, andT. Brox. Octree generating networks:Eﬃcient convolutional architectures for high-resolution 3D outputs. In

Proceedings of theIEEE International Conference on ComputerVision , pages 2088–2096, 2017.[99] A. Vaswani, N. Shazeer, N. Parmar,J. Uszkoreit, L. Jones, A. N. Gomez, (cid:32)L. Kaiser,and I. Polosukhin. Attention is all you need.In

Advances in neural information processingsystems , pages 5998–6008, 2017.[100] N. Verma, E. Boyer, and J. Verbeek. FeastNet:Feature-steered graph convolutions for 3Dshape analysis. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 2598–2606, 2018.[101] P. Vincent, H. Larochelle, Y. Bengio, andP.-A. Manzagol. Extracting and composingrobust features with denoising autoencoders.In

Proceedings of the 25th internationalconference on Machine learning , pages 1096–1103. ACM, 2008.[102] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio,and P.-A. Manzagol. Stacked denoisingautoencoders: Learning useful representationsin a deep network with a local denoisingcriterion.

Journal of machine learningresearch , 11(Dec):3371–3408, 2010.[103] H. Wang, N. Schor, R. Hu, H. Huang,D. Cohen-Or, and H. Huang. Global-to-local generative model for 3D shapes. In

SIGGRAPH Asia 2018 Technical Papers , page214. ACM, 2018.[104] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu,and Y.-G. Jiang. Pixel2Mesh: Generating3D mesh models from single RGB images.In

Proceedings of the European Conference onComputer Vision (ECCV) , pages 52–67, 2018.[105] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, andX. Tong. O-CNN: Octree-based convolutional

190 Xiao et al. neural networks for 3D shape analysis.

ACMTransactions on Graphics (TOG) , 36(4):72,2017.[106] P.-S. Wang, C.-Y. Sun, Y. Liu, andX. Tong. Adaptive O-CNN: a patch-based deeprepresentation of 3D shapes. In

SIGGRAPHAsia 2018 Technical Papers , page 217. ACM,2018.[107] Y. Wang and J. M. Solomon. Deep closestpoint: Learning representations for point cloudregistration. In

Proceedings of the IEEEInternational Conference on Computer Vision ,pages 3523–3532, 2019.[108] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M.Bronstein, and J. M. Solomon. Dynamicgraph cnn for learning on point clouds.

ACMTransactions on Graphics (TOG) , 38(5):1–12,2019.[109] C. Wen, Y. Zhang, Z. Li, and Y. Fu.Pixel2Mesh++: Multi-view 3D meshgeneration via deformation. In

Proceedingsof the IEEE International Conference onComputer Vision , pages 1042–1051, 2019.[110] J. Wu, C. Zhang, T. Xue, B. Freeman, andJ. Tenenbaum. Learning a probabilistic latentspace of object shapes via 3D generative-adversarial modeling. In

Advances in neuralinformation processing systems , pages 82–90,2016.[111] R. Wu, Y. Zhuang, K. Xu, H. Zhang, andB. Chen. PQ-NET: A generative part seq2seqnetwork for 3D shapes. arXiv preprintarXiv:1911.10949 , 2019.[112] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang,X. Tang, and J. Xiao. 3D ShapeNets: Adeep representation for volumetric shapes.In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages1912–1920, 2015.[113] Z. Wu, X. Wang, D. Lin, D. Lischinski,D. Cohen-Or, and H. Huang. SAGNet:structure-aware generative network for 3D-shape modeling.

ACM Transactions onGraphics (TOG) , 38(4):91, 2019.[114] Y. Xiang, W. Kim, W. Chen, J. Ji,C. Choy, H. Su, R. Mottaghi, L. Guibas,and S. Savarese. ObjectNet3D: A largescale database for 3D object recognition. In

European Conference on Computer Vision , pages 160–176. Springer, 2016.[115] H. Xu, M. Dong, and Z. Zhong. Directionallyconvolutional networks for 3D shapesegmentation. In

Proceedings of the IEEEInternational Conference on Computer Vision ,pages 2698–2707, 2017.[116] Q. Xu, W. Wang, D. Ceylan, R. Mech,and U. Neumann. DISN: Deep implicitsurface network for high-quality single-view 3Dreconstruction. In

NeurIPS , 2019.[117] Y. Yang, C. Feng, Y. Shen, and D. Tian.FoldingNet: Point cloud auto-encoder via deepgrid deformation. In

Proceedings of the IEEEConference on Computer Vision and PatternRecognition , pages 206–215, 2018.[118] W. Yifan, S. Wu, H. Huang, D. Cohen-Or, andO. Sorkine-Hornung. Patch-based progressive3D point set upsampling. In

Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition , pages 5958–5967, 2019.[119] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or,and P.-A. Heng. PU-Net: Point cloudupsampling network. In

Proceedings of theIEEE Conference on Computer Vision andPattern Recognition , pages 2790–2799, 2018.[120] Y.-J. Yuan, Y.-K. Lai, J. Yang, H. Fu, andL. Gao. Mesh variational autoencoders withedge contraction pooling. arXiv preprintarXiv:1908.02507 , 2019.[121] C. Zou, E. Yumer, J. Yang, D. Ceylan,and D. Hoiem. 3D-PRNN: Generating shapeprimitives with recurrent neural networks.In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 900–909, 2017.

Yun-Peng Xiao received hisbachelor’s degree in computer sciencefrom Nankai University, He is currentlya Master Student in the Instituteof Computing Technology, ChineseAcademy of Sciences. His researchinterests include computer graphics andgeometric processing.20 Survey on Deep Geometry Learning: From a Representation Perspective 21

Yu-Kun Lai received his bachelor’sdegree and PhD degree in computerscience from Tsinghua University in2003 and 2008, respectively. Heis currently a Reader in the Schoolof Computer Science & Informatics,Cardiﬀ University. His researchinterests include computer graphics,geometry processing, image processing and computer vision.He is on the editorial boards of

Computer Graphics Forum and

The Visual Computer . Fang-Lue Zhang is currently aLecturer with Victoria University ofWellington, New Zealand. He receivedthe Bachelors degree from ZhejiangUniversity, Hangzhou, China, in 2009,and the Doctoral degree from TsinghuaUniversity, Beijing, China, in 2015. Hisresearch interests include image andvideo editing, computer vision, and computer graphics. Heis a member of IEEE and ACM. He received Victoria Early- Career Research Excellence Award in 2019.

Chunpeng Li was born in 1980.He received his PhD degree in 2008and now is an Associate Professor atthe Institute of Computing Technology,Chinese Academy of Sciences. Hismain research interests are virtualreality, humancomputer interaction, andcomputer graphics.