[PDF] DNF-Net: a Deep Normal Filtering Network for Mesh Denoising

Abstract

This paper presents a deep normal filtering network, called DNF-Net, for mesh denoising. To better capture local geometry, our network processes the mesh in terms of local patches extracted from the mesh. Overall, DNF-Net is an end-to-end network that takes patches of facet normals as inputs and directly outputs the corresponding denoised facet normals of the patches. In this way, we can reconstruct the geometry from the denoised normals with feature preservation. Besides the overall network architecture, our contributions include a novel multi-scale feature embedding unit, a residual learning strategy to remove noise, and a deeply-supervised joint loss function. Compared with the recent data-driven works on mesh denoising, DNF-Net does not require manual input to extract features and better utilizes the training data to enhance its denoising performance. Finally, we present comprehensive experiments to evaluate our method and demonstrate its superiority over the state of the art on both synthetic and real-scanned meshes.

Full PDF

SSUBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 1

DNF-Net: a Deep Normal Filtering Networkfor Mesh Denoising

Xianzhi Li, Ruihui Li, Lei Zhu, Chi-Wing Fu,

Member, IEEE, and Pheng-Ann Heng,

Senior Member, IEEE

Fig. 1: Our method is able to denoise meshes of various shapes and noise patterns, while preserving the ﬁnedetails in the models; see the boxed regions in the above results. The left two models are corrupted by Gaussiannoise, while the rightmost one is produced by Kinect v1 scans. Note that the input meshes are provided by [1].

Abstract —This paper presents a deep normal ﬁltering network, called DNF-Net, for mesh denoising. To better capture local geometry,our network processes the mesh in terms of local patches extracted from the mesh. Overall, DNF-Net is an end-to-end network thattakes patches of facet normals as inputs and directly outputs the corresponding denoised facet normals of the patches. In this way,we can reconstruct the geometry from the denoised normals with feature preservation. Besides the overall network architecture, ourcontributions include a novel multi-scale feature embedding unit, a residual learning strategy to remove noise, and a deeply-supervisedjoint loss function. Compared with the recent data-driven works on mesh denoising, DNF-Net does not require manual input to extractfeatures and better utilizes the training data to enhance its denoising performance. Finally, we present comprehensive experiments toevaluate our method and demonstrate its superiority over the state of the art on both synthetic and real-scanned meshes.

Index Terms —Mesh denoising, normal ﬁltering, deep neural network, data-driven learning, local patches. (cid:70)

NTRODUCTION

3D meshes are very common 3D representations widely-used in animations and games, as well as in various ap-plications such as virtual and augmented reality, 3D simu-lations, medical shape analysis, etc. While 3D meshes canbe manually created by artists using software tools, thecreation process is usually long and tedious. Automaticallycapturing and reconstructing 3D meshes using scanning hasbecome a viable and efﬁcient solution for preparing 3Dmeshes. However, raw meshes inevitably contain noise, somesh denoising is often employed as a post-processing stepto remove noise while preserving the ﬁne object details.Fundamentally, the key difﬁculty of mesh denoising lieson how to differentiate noise and ﬁne details, which areboth high frequency and small in scale [2], [3]. In the litera- • X. Li, R. Li, L. Zhu, C.-W. Fu, and P.-A. Heng are with the ChineseUniversity of Hong Kong. P.-A. Heng is also with Guangdong-HongKong-Macao Joint Laboratory of Human-Machine Intelligence-SynergySystems, Shenzhen Institutes of Advanced Technology, Chinese Academyof Sciences, Shenzhen 518055, China.E-mail: { xzli, lirh, lzhu, cwfu, pheng } @cse.cuhk.edu.hk ture, lots of efforts have been devoted to denoise meshes.Traditional methods address the problem by introducingvarious kinds of ﬁlter-based models, i.e. , bilateral normalﬁltering [2], [4], [5], tensor voting [6], [7], [8], and non-local low-rank normal ﬁltering [3], [9], or by assuming somekinds of priors, i.e. , L minimization [10], L -norm spar-sity [11], and L sparse regularization [12]. However, a noisymesh may contain a variety of irregular structures that arecorrupted by noise of different patterns. Hence, making useof a particular ﬁlter or prior assumption to denoise meshesmay not always produce satisfactory results. Also, usersoften have to carefully ﬁne-tune various model parametersin the methods for denoising different input meshes.To circumvent these limitations, researchers began toexplore data-driven methods [1], [9]. The basic idea of thesemethods is to regress functions that map noisy inputs tothe ground-truth counterparts. Although these pioneeringmethods are already data-driven, they still rely on manualinputs to extract features. Hence, the valuable informationavailable in the training data may not be fully exhausted.Unlike existing methods, we introduce a novel deep nor-mal ﬁltering network, called DNF-Net , for mesh denoising. a r X i v : . [ c s . G R ] J un UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2

Fig. 2: Comparing the performance of various methods (b)-(i) on denoising (a) an input noisy mesh scanned by MicrosoftKinect v1: BNF [2]( σ s =0 . , k iter =50 , v iter =50 ), LM [10] ( λ =5 × − ), GNF [5] ( σ r =0 . , k iter =30 , v iter =50 ), CNR [1], NLLR [3]( σ M = 0 . , v iter = 50 , N k = 50 ), GSP [13], PcFilter [9], and our method, respectively. (j) The ground truth. We show also theassociated normal error maps, where the colors reveal the angular difference between the corresponding normal vectorsin ground truth and denoised meshes. Both the visual comparisons (compare the surface region marked by the red arrowabove) and the θ values (mean angular difference; see Section 4) show the superiority of our method over the others.Given a noisy mesh with corrupted facet normals, our DNF-Net is able to robustly generate a corresponding denoisedfacet normal ﬁeld, which is then employed to reconstructthe denoised mesh, while preserving the ﬁne details, suchas the sharp edges and corners, in the input mesh.The key contribution in our method is a deep neuralnetwork framework that learns to ﬁlter normal vectors onmeshes without requiring explicit information about theunderlying surface or the noise characteristics. To learn thelocal geometry patterns, our network processes the meshin the form of patches on the mesh surface. Particularly,to facilitate the network learning, we design the multi-scalefeature embedding unit to extract the normal feature map, andthe residual learning unit to regress the features of the noiseper patch. Also, we drive DNF-Net to learn by formulatingthe deeply-supervised joint loss function , which consists of anormal recovery loss and a residual regularization loss.We performed several experiments to qualitatively andquantitatively evaluate DNF-Net. Results show that DNF-Net is able to handle meshes of various shapes and noisepatterns and produce high-quality denoised results for bothsynthetic and real-scanned noisy inputs in terms of de-noising quality and feature preservation; see Figure 1 forresults produced by our method on various noisy meshes.Figure 2 further shows a comparison example on a real-scanned noisy mesh, demonstrating the strong capability ofDNF-Net to remove such severe noise, as compared withthe various state-of-the-art methods; please see Section 4 formore experiments and comparison results. ELATED W ORK

Early methods [14], [15], [16], [17] denoise meshes by formu-lating local isotropic ﬁlters to remove noise and solving thevolume shrinkage problem caused by denoising. Since theﬁlter weights remain unchanged for varying surface char-acteristics, the denoised results are often overly smoothed.Hence, various anisotropic techniques were proposed:Fleishman et al. [18] and Jones et al. [19] extended thebilateral ﬁltering technique in image denoising to mesh de-noising to directly ﬁlter the vertex positions. Later, observ-ing that normal information can better capture the under-lying surface characteristics, techniques based on bilateralnormal ﬁltering [2], [4], [5] were introduced to ﬁrst ﬁlterfacet normals and then adjust vertex coordinates accordingly .Very recently, Li et al. [3] and Wei et al. [9] independentlydeveloped a non-local low-rank scheme to ﬁlter the normalﬁeld, where promising results were obtained.Besides bilateral ﬁltering and normal ﬁltering, someworks explore the notion of voting on the surface tensorsto guide the mesh denoising process with feature preser-vation [6], [7], [8]. Some other works formulate the meshdenoising problem as a global optimization and recoverthe meshes that best ﬁt both the inputs and some pre-deﬁned constraints or priors, e.g. , He et al. [10] explored L minimization; Lu et al. [11] explored L -norm sparsity; andZhao et al. [12] explored L sparse regularization. However,these methods rely on priors of the noise distribution.Recently, Arvanitis et al. [13] proposed a novel coarse-to-ﬁne graph spectral processing approach for mesh denoising.Although the above methods work well for different noisyinputs, users have to speciﬁcally ﬁne-tune various param- UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3

Fig. 3: Illustration of our network training pipeline. Given a noisy mesh and its corresponding ground truth, we ﬁrst croplocal patches with N faces. Then, for each noisy patch, we prepare a matrix of facet normals N i and a matrix of localneighbor indices I i as the network inputs. Also, we extract the corresponding ground-truth facet normals N Gi and trainthe network to learn to directly output the denoised facet normals (cid:102) N i , which is supervised by N Gi .eters to obtain satisfactory results for meshes of differentgeometry features and noise levels. There have been increasing attention on exploiting data-driven methods to denoise meshes. Wang et al. [1] presenteda pioneering work called cascaded normal regression (CNR)to learn the mapping from noisy inputs to the ground-truthcounterparts. Considering that the learning process maylose some ﬁne details, Wang et al. [20] further developeda two-step data-driven method with the second step toenhance the recovery of the geometric details.Although these data-driven approaches are able to learnthe denoising pattern to a certain extent without speciﬁcassumptions on the underlying geometry features and noisepatterns, they still need manually-extracted geometry de-scriptors from the noisy inputs, e.g. , the ﬁltered facet normaldescriptor [1], without fully exploiting deep neural net-works to automatically learn and extract features. Hence, theinformation provided in the training data may not be fullyexhausted. To the best of our knowledge, this paper presentsthe ﬁrst work that formulates a deep neural network todenoise meshes by ﬁltering the raw facet normals.

Driven by the success of deep learning in diverse computervision, graphics, and image processing tasks, researchersin 3D geometry processing have started to explore deepneural networks for 3D model processing. However, unlike2D images with regular pixel grid structures, 3D modelssuffer from the property of irregular connectivity. Hence,early works explored the transformation of the input 3Dmodels to grid structures, e.g. , volume representation [21],[22], [23], [24], [25], depth map [26], multi-view images [23],[27], etc. , so that we can apply deep convolutional neuralnetworks (CNNs) to directly process the data.On the other hand, there have been extensive studieson directly taking deep neural networks to process 3D(irregular) point clouds. PointNet [28] and PointNet++ [29]are two pioneering works that consume point clouds di-rectly as the network input. Subsequently, more network architectures were designed for handling various tasks on3D point clouds, including object recognition [30], [31], [32],unsupervised feature learning [33], [34], upsampling [35],[36], completion [37], instance segmentation [38], etc.

Compared to point clouds, 3D meshes contain vertexconnectivity in addition to point/vertex coordinates. Hence,only a few works process 3D meshes using neural networks.Some of them focus on generating meshes from singleimages [39], [40] or from incomplete range scans [41], [42].Ranjan et al. [43] introduced a mesh autoencoder to generate3D human faces. Several studies on 3D shape representationdirectly operate on mesh data. Hanocka et al. [44] designedMeshCNN, a neural network that performs task-drivenmesh simpliﬁcation based on the edges between the meshvertices. Feng et al. [45] proposed MeshNet to learn frompolygon faces for 3D shape classiﬁcation and retrieval. Yi etal. [46] and Kostrikov et al. [47] exploited the differentialgeometry properties of manifolds through Graph NeuralNetworks and its spectral extensions. Different from theseworks, we design our DNF-Net to directly process normalvectors in local patches extracted on the mesh surface. We donot assume a grid structure nor resample the normals into agrid; our network directly processes the normal vectors, aswell as the local triangle connectivity information, on eachpatch as inputs and outputs the denoised normal vectors.

ETHOD

Given a noisy triangular 3D mesh M = ( V , F ) with vertexset V and face set F , our goal is to produce a denoised mesh (cid:102) M = ( (cid:101) V , F ) from M with updated vertex set (cid:101) V . Com-pared with vertex positions, ﬁrst-order normal variationsare known to better capture the local surface variations [48].Therefore, we take a normal ﬁltering approach [1], [2], [3],[5], [9], [20] to formulate our mesh denoising method.Distinctively, we design a deep neural network, called deep normal ﬁltering network (or DNF-Net ) to learn to map thenoisy facet normal vectors to noise-free ground-truth facetnormal vectors on meshes. In the course of formulating thisnetwork, we have the following considerations:

UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 4

Fig. 4: The overall network architecture of DNF-Net. We feed the patch facet normals N i and corresponding neighborindices I i as network inputs to extract feature map F i (with C channels). Our DNF-Net removes noise by ﬁrst learning thenoise residual ∆ F i from F i , then obtains the denoised feature map (cid:101) F i by subtracting the residual. The process is repeatedto improve the noise removal. Lastly, we regress the output patch facet normals (cid:102) N i from the denoised feature map (cid:101) F i . • First, since mesh denoising is a low-level task, thenetwork should focus on learning the local geometry.Hence, we propose to crop patches on the objectsurface and process facet normals per patch in thenetwork; see the left part in Figure 3. • Second, to enhance the generality of the network, itshould abstract local spatial patterns instead of justencoding each facet normal individually. Also, thenetwork should produce the same results regardless ofthe order in the normals in its input ; see N i in Figure 3.Hence, based on the mesh connectivity, we extract indices of the K -nearest faces per face in the patch asthe network inputs for the network to locate the faceneighbors for further feature embedding. • Lastly, for efﬁciency concern, our network is end-to-end to directly output denoised facet normals, andsupervise the network training with correspondingfacet normals from the ground-truth meshes.In the following subsections, we ﬁrst elaborate on thepatch preparation procedure (Section 3.2). We then intro-duce the architecture of DNF-Net and the loss function inthe network training (Sections 3.3 & 3.4). Lastly, we givedetails on the method implementation (Section 3.5).

Given a pair of meshes, a noisy mesh M and correspondingground truth M G , as inputs, there are three steps to preparethe training patches from them. First, we locate a set of faceson M as seeds to generate patches. To randomly select seedfaces, such that the resulting patches exhibit more diversesurface patterns, we calculate the one-ring facet normalvariation around each face and randomly pick P seed facesby an anisotropic sampling based on the normal variance.Second, from each seed face, we grow a patch by ﬁndingthe N − nearby faces on M with the shortest geodesic dis-tances from the seed. Speciﬁcally, to compute the geodesicdistance between the seed face and a nearby face, we tryeach of its three vertices as the start point, ﬁnd the shortestgeodesic distance to each vertex of a nearby face, thentake the smallest distance among the nine distances as thegeodesic distance from the seed face to that nearby face.Here, we use the heat method in [49]. Lastly, we sort all thedistances among the surrounding (nearby) faces, and selectthe N − nearest ones. Hence, we can produce patches (with N faces) that are more regular in shape for training. Lastly, we pack the N facet normals on each patch as thepatch normal matrix N i ∈ R N × . Also, we take advantageof the mesh connectivity and prepare patch index matrix I i ∈ R N × K , where each row represents a face on the patchand stores the indices (row indices in N i ) of the K -nearestfaces to the face on the patch. Then, we feed N i and I i as inputs to the network. Further, for each patch formedon M , we follow the same procedure to form a patchfrom the corresponding seed face on M G and extract thecorresponding N facet normals to form matrix N Gi as N i ’sground-truth to supervise the network; see Figure 3. Figure 4 shows the overall architecture of DNF-Net. Taking N i and I i as inputs, DNF-Net ﬁrst employs the multi-scalefeature embedding unit (Section 3.3.1) to extract the normalfeature map F i . Since F i contains noise, we thus model it as F i = (cid:101) F i + ∆ F i , where (cid:101) F i is the cleaned noise-free normalfeature map and ∆ F i is the feature map of the noise.Considering that the underlying noise-free surface isusually more diverse compared with the noise patterns,thus encoding ∆ F i is more effective than directly encod-ing (cid:101) F i . Hence, we feed F i to a residual learning unit(Section 3.3.2) to ﬁrst extract the residual ∆ F i from F i .Naturally, the cleaned feature map (cid:101) F i is recast into F i − ∆ F i .To enhance the outputs, we cascade the residual-learning-and-subtraction process in a progressive manner to obtainan intermediate cleaned feature map (cid:101) F i in the middle ofthe network, besides the ﬁnal cleaned feature map (cid:101) F i ; seeFigure 4. Importantly, instead of only supervising the ﬁnaloutput (cid:102) N i from (cid:101) F i , we also regress an intermediate output (cid:102) N i from (cid:101) F i to give direct supervision when training thehidden layer in the network (Section 3.4). For a comprehensive geometric understanding of a meshstructure, say locally around a vertex in the mesh, a generalapproach is to do a multi-scale analysis around the vertex,so that we can extract geometric features for different spatialscales. Particularly, the geometric structures usually varyover scales. Hence, given an input patch with N facet nor-mals, we formulate a multi-scale feature embedding unit ofthree levels to harvest geometric features of different scales.Speciﬁcally, the purpose of this unit is to extract normal UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5

Fig. 5: The three-level architecture of the multi-scale feature embedding unit.Fig. 6: The normal grouping layer (left) and feature groupinglayer (right) pack relevant local data (normals or features)for the feature extraction layer (see Figure 5) to process.feature map F i ∈ R N × C from N i and I i , where C is thenumber of channels and N is the number of faces (normals)on input patch. To do so, we build a three-level architectureto learn F i by progressively enlarging the contextual scales;see Figure 5 for the detailed illustration.Speciﬁcally, in the ﬁrst level of the multi-scale featureembedding unit, we design the normal grouping layer and feature extraction layer to generate an embedded feature map E i ∈ R N × C , given inputs N i and I i . In short, the normalgrouping layer packs facet normals of the k ( k k > k .Lastly, we concatenate N i , E i , E i , and E i , and pass the resultvia a series of multi-layer perceptrons (MLPs) to generate F i ; see the rightmost portion of Figure 5. Normal grouping layer . To abstract local spatial patterns(features) around each face on the input patch, instead ofjust constructing per-face features by encoding individualfacet normal, we design the normal grouping layer to packnearby facet normal vectors around each face for subsequentfeature extraction. Note that each row in patch normalmatrix N i stores one normal vector of a face on the patch. The detailed procedure of the normal grouping layer isillustrated on the left side of Figure 6. Speciﬁcally, we ﬁrstcreate k duplicated copies of N i , i.e. , an N × k × volume.Second, for each face in N i , we make use of I i to locate thenearest k faces to the face, then pack their facet normals toproduce another N × k × volume, altogether for all the N faces. After that, we concatenate these two N × k × volumes to produce a new volume of size N × k × ,as the output of the normal grouping layer. By doing so,we combine the global structure captured by the normalvectors over the patch, and the local neighborhood infor-mation captured by the packed nearby normals per face.Overall, this normal grouping layer (similarly for the featuregrouping layer) helps to pack, or re-arrange, the normaldata, such that spatially-nearby normal vectors are groupedtogether per vertex. By this means, we can consider localneighborhoods in the upcoming feature extraction layer andbetter capture local geometric structures. Feature grouping layer . As shown in Figure 5, the inputs tothe second and third levels are no longer 3D normal vectorsbut embedded feature vectors, i.e. , E i or E i ∈ R N × C . Hence,we design a feature grouping layer to pack these per-faceembedded feature vectors; see the right side of Figure 6.First, we also use I i to locate the nearest k (secondlevel) or k (third level) faces to generate an N × k × C or N × k × C feature volume. However, rather than directlyconcatenating it with another volume of duplicated E i or E i , we compute the residual via a subtraction operatorbefore the feature volume concatenation; see step 3 on theright side of Figure 6. Such subtraction helps to capturethe local neighborhood information, while considering theglobal structure [32]. Note also that we employ the subtrac-tion in feature grouping rather than normal grouping, sinceit helps to ﬁnd residuals for embedded feature vectors butnot for 3D normal vectors. Hence, the output of the featuregrouping layer is a volume of data of size of N × k × C for the second level or N × k × C for the third level. Feature extraction layer . Feature extraction is important,since weak features offer less help to abstract the spatialstructures, thus lowering the network performance. Seeagain Figure 5. After normal grouping or feature grouping,we employ a feature extraction layer to extract features( N × C ) from the grouped normal vectors ( N × k × ) orgrouped feature vectors ( N × k × C or N × k × C ). A com-mon solution here is to use MLPs followed by a max pool-ing, like several other works, e.g. , [29], [35], [37]. However,since we employ the concatenation operation on the channeldirection to combine both local and global information inour grouping layers (see Figure 6), we propose to use thechannel attention module [50] to replace MLPs to better UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 6

Fig. 7: The architecture of the residual learning unit.fuse the features among the different channels. The basicidea of this module is to learn the channel weights from thegrouped normal or feature vectors via MLPs, then use theweights to adjust the importance of each channel. For moredetails of the channel attention module, readers may referto [50].

After presenting the multi-scale feature embedding unit, wenow go back to the overall architecture (see Figure 4) andpresent the residual learning unit. This unit extracts thenoise feature ∆ F i from the normal feature F i , so that wecan later obtain the denoised feature map (cid:101) F i by F i − ∆ F i .To better extract features for denoising, we should en-code features over a local neighborhood rather than justas an individual feature vector. Hence, like the featuregrouping layer, this unit also ﬁnds the k -most similar featurevectors for each of the N feature vectors in F i . However,it employs KNN to locate similar feature vectors in thefeature space instead of using the index matrix I i [32];see Figure 7. The reason behind is that, in the multi-scalefeature embedding unit, our goal is to extract representativefeatures to encode local context, so using I i enables us tolocate features that are geodesically nearby. In this residuallearning unit, we, however, need to extract residual featuresfrom the input feature map F i , so we employ KNN searchand extract features by considering feature similarity in thefeature space. In our experimental settings, as suggestedby [32], we set k = 20 in this unit. As shown in Figure 7,after the concatenation between the duplicated ( k copies)input feature vectors ( F i ) and their associated k -most sim-ilar feature vectors, we use two MLPs followed by a maxpooling to get the residual feature map ∆ F i of size N × C .Also, as shown in Figure 4, we use two consecutiveresidual learning units to progressively remove noise and toimprove the overall denoising performance. For an experi-ment that explores the effect of using a different number ofresidual learning units, please refer to Section 4.3. We design a deeply-supervised joint loss function with twoterms to train the proposed network in an end-to-end man-ner: (i) deeply-supervised normal recovery loss and (ii) residualregularization loss . Deeply-supervised normal recovery loss.

To encourage thedenoised facet normals (cid:102) N i to be consistent with the groundtruth normals N Gi , we use an L norm to minimize thedifference between (cid:102) N i and N Gi . However, considering the deep-ness of our network, we further add another feedback,or supervision, on the companion intermediate output (cid:102) N i (see the middle part in Figure 4), by applying an L norm tominimize also the difference between (cid:102) N i and N Gi . So, thedeeply-supervised normal recovery loss is expressed as L deep = 1 N p N p (cid:88) i =1 ( (cid:107)N Gi − (cid:102) N i (cid:107) + (cid:107)N Gi − (cid:102) N i (cid:107) ) , (1)where N p is the total number of training patches. By doingso, we can directly inﬂuence the parameters in the hiddenlayer to enhance the quality of the feature maps. Residual regularization loss.

As shown in Figure 4, DNF-Net progressively learns the residual features ∆ F i and ∆ F i . Theoretically, these residual features should just be asmall portion of F i . Having said that, the magnitude of ∆ F i and ∆ F i should not be too large. Hence, we formulate theresidual regularization loss on ∆ F i and ∆ F i as L residual = 1 N p N p (cid:88) i =1 ( (cid:107) ∆ F i (cid:107) + (cid:107) ∆ F i (cid:107) ) . (2) Joint loss.

Overall, we formulate the deeply-supervisedjoint loss function as a combination of Eqs. (1) and (2): L = L deep + αL residual , (3)where α is a weight to balance the relative importance ofthe two loss terms, and we empirically set it as 0.5. Datasets . In our experiments, we use the two benchmarkdatasets kindly provided by [1] to train our network: (i)a synthetic dataset and (ii) a real-scanned dataset. Thesynthetic dataset contains 21 training models and 30 test-ing models, including CAD models, smooth models, andmodels with rich ﬁne details. For each model, the datasetprovides a noise-free mesh as ground truth and three noisymeshes. These noisy meshes are generated by adding threedifferent magnitudes of Gaussian noise into the noise-freemesh; the standard deviations of the noise in these meshesare . l e , . l e , and . l e , where ¯ l e is the average edge lengthin the mesh. On the other hand, the real-scanned datasethas 146 training noisy meshes and 149 testing noisy meshes.Each noisy mesh is accompanied with a clean mesh as theground truth. For more details, readers may refer to [1]. Network training . Observing that the synthetic and real-scanned datasets have very different noise distributions, wefollow the existing data-driven method, i.e., CNR [1], totrain our network separately on each dataset for obtainingour results on synthetic meshes and real-scanned meshes.We plan to explore knowledge distillation techniques in thefuture to combine the two trained network models.For each training mesh in the synthetic dataset, we crop P =100 patches, each with N =800 faces, such that a patchroughly covers 5% to 10% of the whole mesh. For eachface in a patch, we empirically store K =50 neighboringface indices, and uniformly set k =10 , k =30 , and k =50 in the multi-scale feature embedding unit (see Section 3.3.1)of our network. For each training mesh in the real-scanned UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7

Fig. 8: Comparing the mesh denoising results produced using different methods (b)-(g) on the synthetic noisy modelsshown on the leftmost column; these inputs have different Gaussian noise levels: . l e , . l e , and . l e (top to bottom).dataset, we crop more patches with P =200 , due to thecomplex noise distributions, and each patch has N =800 faces. Considering that most of the real-scanned mesheshave simple and smooth structures, we store K =150 neigh-bor facet indices for each face and set k =50 , k =100 , and k =150 . Please see the supplementary material for the effectof different parameter settings on the real-scanned dataset.To avoid over-ﬁtting in network training, we augmented N i by adopting random rotation and jittering. We imple-mented our method using TensorFlow and adopted theAdam optimizer with a mini-batch size of 10 and a learningrate of 0.001, as well as trained the network with 400 epochs.Our trained network model, data, and code can be found onthe GitHub project page . Network inference.

Given a test mesh, we crop patcheson the mesh following the procedure of preparing trainingpatches, then employ the trained network to produce de-noised normals on the patches. After that, we integrate thedenoised normals on all patches to compute facet normalsover the mesh, and follow [3] to pass the restored facetnormals to the iterative vertex updating method [51] toupdate vertex positions and produce the denoised mesh.

ESULTS AND D ISCUSSIONS

To demonstrate the effectiveness of our method, we com-pare it with several state-of-the-art methods, including L

1. https://github.com/nini-lxz/DNF-Net minimization (LM) [10], guided normal ﬁltering (GNF) [5],cascaded normal regression (CNR) [1], non-local low-ranknormal ﬁltering (NLLR) [3], graph spectral processing ap-proach (GSP) [13], and patch normal co-ﬁlter (PcFilter) [9].For LM, GNF, and NLLR, we obtained their publicly-released codes and ﬁne-tuned their model parameters withbest effort to produce their denoising results; see our sup-plementary material for the details of the employed param-eter values. For CNR, we directly employed their releasedtrained models to generate their results, while for GSP andPcFilter, we obtained their results directly from the authors.Following the recent works [1], [3], we also employedthe mean angular difference metric (denoted as θ ) to quan-titatively evaluate and compare the results produced by thevarious methods. By deﬁnition, θ is the mean angular differ-ence (in degrees) between the corresponding facet normalsin the ground truth and denoised meshes. Hence, a small θ value indicates a better denoising result. Note that, θ iscalculated on each denoised mesh after the vertex update,and all methods being compared (including our method)use the same vertex update algorithm [51]. First, we compare our method with the state-of-the-arts ontest models provided in the synthetic dataset of [1]. Figure 8shows the visual comparisons on three noisy meshes withdifferent amount of Gaussian noise. Comparing the resultsproduced by our method (g) and others (b-f) with the

UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 8

Fig. 9: Comparing the scores obtained by different methodsover 30 test models (sorted in descending order of score).ground truths (h), we can see that the other methods tendto over-smooth the ﬁne details, over-sharpen the edges, orretain excessive noise in the results. Our method is ableto preserve more geometric details and recover the sharpedges, while effectively removing the noise; see particularlythe blown-up views in Figure 8. See also the θ value beloweach generated result. Overall, our method achieved thesmallest θ values for all models compared with all the othermethods; see Part 1 of the supplementary material for morecomparison results. Besides, for each denoised mesh shownin Figure 8, we visualize its normal error distribution; pleaserefer to Part 9 of our supplementary material.Further, we expand the quantitative comparison by con-sidering all the 30 test models in the dataset. Note that,for each test model, [1] provides three noisy meshes withdifferent Gaussian noise levels. Here, we deﬁne ¯ θ ij as theaveraged θ value achieved by method i over the three noisymeshes of the j -th test model. Also, we deﬁne normalizedscore score ij ∈ [0 , achieved by method i on j -th test modelas score ij = 1 − ¯ θ ij − min i ¯ θ ij max i ¯ θ ij − min i ¯ θ ij , (4)so a value of one indicates best result among the methods.Figure 9 plots the scores achieved by each method overthe 30 test models. Note that to more clearly reveal theresults, the 30 score values over the test models per methodare sorted in descending order when producing each plot.Also, we consider only LM [10], GNF [5], CNR [1], andNLLR [3], since we have obtained their code or denoisingresults for all the test models in the dataset. From the plots,we can see that our method obtains the best results (scores ofone) for most test models, while the other methods start todrop much earlier. More importantly, the test models havevarious geometric structures, including simple and smoothsurface, sharp edges, and highly-detailed ﬁne structures.The reason behind the success of our method is that formu-lating a deep neural network allows us to more exhaustivelyextract discriminative features from the data to differentiatethe underlying details and noise in the input meshes, whilethe other methods heavily rely on hand-crafted featuresfrom the assumptions or priors on the input models. Next, we compare our method with others on real-scannedmodels. Besides the results shown earlier in Figure 2, we TABLE 1: Comparing the denoising quality of our fullpipeline with various cases in the ablation study.

Different Modelcases Block Cube Sphere Carter100K Eros100K w/o inter ◦ ◦ ◦ ◦ ◦ w/o reg ◦ ◦ ◦ ◦ ◦ w/o sub ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ full pipeline ◦ ◦ ◦ ◦ ◦ further show in Figure 10 more visual comparison resultson three other Kinect real-scanned models [1]. From thenoisy input models on the leftmost column of the ﬁgure, wecan see that Kinect scanning produces severe and irregularnoise. Such noise pattern differs from that of the Gaussiannoise. Comparing the results produced by various methods,the other methods tend to retain noise in their results, orfail to smooth ﬂat surfaces. On the contrary, our methodis able to produce denoised models that are more smoothand closer to the ground truths, as veriﬁed again by thesmallest θ values achieved by our method on the models.Additionally, please refer to Parts 2 and 9 of the supplemen-tary material for more real-scanned comparison results andfor the normal error visualizations on the denoised meshesshown in Figure 10, respectively. To evaluate the effectiveness of the major components inour method, we conducted an ablation study by simplifyingDNF-Net in the following four cases.(i) w/o inter : we remove the supervision on intermediateoutput (cid:102) N i from the ﬁrst residual learning unit in thenetwork (see Figure 4), and supervise only the ﬁnaloutput (cid:102) N i , i.e. , we remove the second term in Eq. (1).(ii) w/o reg : we remove the residual regularization lossterm ( i.e. , Eq. (2)) from the total loss of our network.(iii) w/o sub : we compute a residual in step 3 of the featuregrouping layer (Figure 6) by a subtraction operationbefore concatenating two feature volumes. Such anoperation helps capture the local information fordenoising. To verify its effectiveness, we remove itand directly concatenate the two feature volumes.(iv) < > res units : as shown in Figure 4, our networkcascades two residual learning units successively inthe overall network architecture. Instead of deploy-ing two residual learning units, we tried “ ”and “ ” to explore the effect of the number ofresidual learning units on the network performance.Note that, for fair comparison, we modiﬁed bothEqs. (1) & (2) to supervise the output of every resid-ual learning unit in the network.Speciﬁcally, we re-trained the network model separatelyfor each case using the same training dataset of syntheticmodels (see Section 3.5) and tested each network on ﬁve testmodels, i.e. , Block, Cube, Sphere, Carter100K, and Eros100K,which contain sharp edges, simple structure, and ﬁne de-tails. As mentioned earlier, each provided model has threeversions of noisy meshes. Hence, similar to Section 4.1, we UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9

Fig. 10: Comparing the mesh denoising results produced using different methods (b)-(g) on real-scanned noisy models.Fig. 11: Denoising performance ( θ ) of different methods onthe same test model in different resolutions. Note that westart with the Eros model with 100K faces ( . l e Gaussiannoise) and use the quadric edge collapse decimation methodin MeshLab [52] to progressively simplify the mesh model.test each network on all the three noisy meshes per model,then compute the averaged θ value achieved over the threenoisy meshes as the overall θ value of the model.Table 1 shows the results. By comparing the top threerows with the bottom-most row (our full pipeline), we cansee that each term in our loss function contributes to themesh denoising performance, including the supervision onintermediate output and residual regularization loss term.Besides, the subtraction operation in the feature grouping layer also contributes to improving the overall performance.Further, the network with only one residual learning unit(4-th row) achieved a worse result than our full pipelinewith two residual learning units. If we increase the numberof residual learning units to three (5-th row), the perfor-mance improves only slightly on two models, but the num-ber of network parameters increases from 0.36M to 0.44M.Hence, we decided to deploy two residual learning units inour network to balance the performance and efﬁciency. Next, we explore the robustness of DNF-Net by considering(i) varying mesh resolution ( i.e. , different number of facesin the same model); (ii) irregular mesh triangulation; (iii)unseen noise patterns; and (iv) varying noise intensities.

Robustness on mesh resolution.

Most existing methodsare sensitive to the mesh resolution, where the error metric θ could have large ﬂuctuation for low-resolution meshes.It is because these methods ﬁlter normals (or vertices)using a local neighborhood of a ﬁxed number of rings, e.g. ,one- or two-ring neighborhood. Hence, when given a low-resolution mesh, they could involve a too large neighbor-hood in the ﬁltering, thus leading to over-smoothing [5].Thanks to the patch-based training strategy, our DNF-Net is trained for various sizes of local neighborhoods fordifferent models. Hence, it can become less sensitive to themesh resolution. To explore this, we employed noisy Erosmodels of different resolutions as the inputs and plotted θ UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 10

Fig. 12: Comparing the denoising results produced by various methods (b)-(d) on two low-resolution noisy inputs (a) ofthe same model. Note that the input meshes were corrupted by Gaussian noise with a standard deviation of . l e .of the denoised results by various methods in Figure 11.Here, we directly used our trained network to test all theinputs without re-training or ﬁne-tuning the network. Fromthe plots in Figure 11, we can see that our method (orangeplot) consistently has the lowest θ values vs. all the othersfor various mesh resolutions. Particularly, when the numberof faces is reduced to less than 50K, the gaps of θ valuesbetween our method and the others gradually increase. Forexample, when the number of faces is reduced from 50Kdown to 1562, the increased θ value for our method is 6.57vs. 9.58 for the second-best one. This shows that DNF-Net isnot as sensitive as others to the mesh resolution.Further, Figure 12 shows the visual comparisons on twoexample low-resolution Eros models with 6250 and 3125faces. Comparing the results in (b), (c), and (d), we can seethat the other two methods overly smooth out the details,while our method can better preserve the ﬁne details; seethe areas around the eyes and mouth in the results. Also, itachieves the lowest θ values for both resolutions. Robustness on irregular triangulation.

Next, we test therobustness of our method on handling noisy meshes withirregular triangulation. Figure 14 (a) shows two examplemeshes with elongated triangles and irregular vertex de-grees; see particularly the blown-up views of the nose re-gions in the ﬁgure. Still, the results produced by our method(b) are quite close to the ground truths (c). Note that somecomparison results with other methods are also presentedin Part 8 of the supplementary material.

Robustness on noise patterns.

Further, we tested the gen-eralization ability of our network on handling 3D meshes ofnoise patterns different from the noise pattern employed inthe training set. To do so, we directly employed our trainednetwork to denoise meshes corrupted by impulsive noiseand uniform noise (by replacing the Gaussian noise we em-ployed with a uniform-distributed noise). Figure 15 showscomparisons on two 3D meshes corrupted with impulsivenoise (top) and uniform noise (bottom). From the normal Fig. 13: Comparing the mesh denoising performance of var-ious methods on input meshes of increasing noise intensity.error visualizations and the θ values presented in the ﬁgure,we can see that our network has a superior performance forboth input meshes, even though our network was trainedonly on Gaussian noisy meshes. Please refer to supplemen-tary material Part 6 for more results on noise patterns.Noting that Gaussian noise, uniform noise, and im-pulsive noise are all synthetic additive noises. Hence, thedifferences between their distribution variations may not bevery large (compared to real-scanned noise). As a result,our DNF-Net model trained only on Gaussian noise can stillgeneralize well to handle other noise patterns. Robustness on noise intensities.

Lastly, we tested our net-work on 3D meshes corrupted with noise intensities largerthan those in the training set. Here, we used the trainingset of CNR [1] with Gaussian noise magnitude from . l e to . l e , for which the released network of CNR was trained.In this regard, for a fair comparison, we directly applied thetrained models of both our network and CNR to denoise3D meshes of noise magnitude above . l e , without re-training the networks. Figure 13 plots the θ values for usingvarious methods to denoise 3D meshes of increasing noiseintensities, from . l e to . l e . From the plots, we can see UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11

Fig. 14: Denoising results produced by our DNF-Net on two irregular triangulated meshes with . l e Gaussian noise.Apparently, the triangles around the nose regions are elongated with irregular vertex degrees, and our method can stillproduce results that are quite similar to the ground truths.Fig. 15: Comparing the mesh denoising performance of various methods (b)-(e) on two synthetic models that are corruptedby impulsive noise (top) and uniform noise (bottom).that our method consistently achieves a better performancewith the lowest θ values for all noise levels. Please refer tosupplementary material Part 7 for more results. Time performance.

Our network takes only 0.04 seconds toprocess a patch on an NVidia Titan Xp GPU. Further, sincepatch processing can be parallelized, our method’s runningtime does not increase linearly with the number of patches.Hence, to process thousands of patches on a dense mesh,e.g., a model with 50K faces, our method takes only 40seconds. If more GPUs are available, the computation timecan be further shortened.

Limitations.

First, as a common drawback of data-drivenmethods like [1], our DNF-Net may produce unsatisfyingdenoising results, if we apply a network to process a test mesh whose noise pattern is very different from that in thetraining set. We plan to explore domain adaptation tech-niques to extract or transfer knowledge from the unpaireddata source. Also, our method cannot handle meshes withfaulty topological issues, e.g. , self-intersection and incon-sistent facet normal orientation. Such cases are also hardfor existing ﬁlter-based and optimization-based methods.Lastly, our network requires paired data (noisy meshes withground truths) to train. However, collecting a large amountof paired datasets is expensive and time-consuming. In thefuture, we plan to explore the possibility of learning fromunpaired data or training in an unsupervised manner.

ONCLUSION

This paper presents a novel deep normal ﬁltering network,namely DNF-Net, formulated for mesh denoising. DNF-Net

UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 12 is an end-to-end network that directly predicts denoisedfacet normals from noisy input meshes, without requiringexplicit information about the underlying surface or thenoise characteristics. To effectively learn the local geometricpatterns for denoising meshes, DNF-Net processes normaldata grouped by patches. Further, we design the multi-scalefeature embedding unit to extract normal features, followedby the cascaded residual learning units to progressivelyremove noise. Also, we drive DNF-Net to learn by formu-lating a deeply-supervised joint loss function with a normalrecovery loss and a residual regularization loss. Lastly, weperformed several experiments on our methods using a richvariety of synthetic and real-scanned models. Both visualand quantitative comparisons demonstrate the superiorityof our method over the state-of-the-arts.As the ﬁrst attempt to design a deep neural networkto ﬁlter facet normal for mesh denoising, our DNF-Netcan yet be improved in several aspects. First, instead ofusing only the normal data for denoising, we might as wellconsume other mesh information to enhance the extractedfeatures, e.g. , vertex position, facet centroid, etc. Second, weplan to explore graph convolutional networks to take intoaccount the mesh topology in the network learning. Third,enhancing the vertex update technique, e.g. , to handle localfold overs, will certainly help to improve the robustness ofthe overall method. On the other hand, exploring techniquesin domain adaptation and transfer learning is another futuredirection for improving the network generalization ability,particularly for handling real-scanned inputs. A CKNOWLEDGMENTS

We thank anonymous reviewers for their valuable com-ments. The work is supported by the Hong Kong ResearchGrants Council with Project No. CUHK 14225616, Key-Area Research and Development Program of GuangdongProvince, China (2020B010165004), and National NaturalScience Foundation of China with Project No. U1813204. Thework is also supported by the Research Grants Council ofthe Hong Kong Special Administrative Region (Project No.CUHK 14201717) and National Natural Science Foundationof China (Grant No. 61902275). R EFERENCES [1] P. Wang, Y. Liu, and X. Tong, “Mesh denoising via cascadednormal regression,”

ACM Trans. on Graphics (SIGGRAPH Asia) ,vol. 35, no. 6, pp. 232:1–12, 2016.[2] Y. Zheng, H. Fu, O. K.-C. Au, and C.-L. Tai, “Bilateral normalﬁltering for mesh denoising,”

IEEE Trans. Vis. & Comp. Graphics ,vol. 17, no. 10, pp. 1521–1530, 2011.[3] X. Li, L. Zhu, C.-W. Fu, and P.-A. Heng, “Non-local low-ranknormal ﬁltering for mesh denoising,”

Computer Graphics Forum(Paciﬁc Graphics) , vol. 37, no. 7, pp. 155–166, 2018.[4] K.-W. Lee and W.-P. Wang, “Feature-preserving mesh denoisingvia bilateral normal ﬁltering,” in

Ninth Int. Conf. on Computer AidedDesign and Computer Graphics , 2005, pp. 275–280.[5] W. Zhang, B. Deng, J. Zhang, S. Bouaziz, and L. Liu, “Guidedmesh normal ﬁltering,”

Computer Graphics Forum (Paciﬁc Graphics) ,vol. 34, no. 7, pp. 23–34, 2015.[6] L. Zhu, M. Wei, J. Yu, W. Wang, J. Qin, and P.-A. Heng, “Coarse-to-ﬁne normal ﬁltering for feature-preserving mesh denoising basedon isotropic subneighborhoods,”

Computer Graphics Forum (PaciﬁcGraphics) , vol. 32, no. 7, pp. 371–380, 2013. [7] M. Wei, L. Liang, W.-M. Pang, J. Wang, W. Li, and H. Wu, “Tensorvoting guided mesh denoising,”

IEEE Trans. on Automation Scienceand Engineering , vol. 14, no. 2, pp. 931–945, 2016.[8] S. K. Yadav, U. Reitebuch, and K. Polthier, “Mesh denoising basedon normal voting tensor and binary optimization,”

IEEE Trans. Vis. & Comp. Graphics , vol. 24, no. 8, pp. 2366–2379, 2017.[9] M. Wei, J. Huang, X. Xie, L. Liu, J. Wang, and J. Qin, “Meshdenoising guided by patch normal co-ﬁltering via kernel low-rankrecovery,”

IEEE Trans. Vis. & Comp. Graphics , vol. 25, no. 10, pp.2910–2926, 2019.[10] L. He and S. Schaefer, “Mesh denoising via L minimization,” ACM Trans. on Graphics (SIGGRAPH) , vol. 32, no. 4, pp. 64:1–8,2013.[11] X. Lu, W. Chen, and S. Schaefer, “Robust mesh denoising viavertex pre-ﬁltering and L -median normal ﬁltering,” ComputerAided Geometric Design , vol. 54, pp. 49–60, 2017.[12] Y. Zhao, H. Qin, X. Zeng, J. Xu, and J. Dong, “Robust and effectivemesh denoising using L sparse regularization,” Computer-AidedDesign , vol. 101, pp. 82–97, 2018.[13] G. Arvanitis, A. S. Lalos, K. Moustakas, and N. Fakotakis, “Featurepreserving mesh denoising based on graph spectral processing,”

IEEE Trans. Vis. & Comp. Graphics , vol. 25, no. 3, pp. 1513–1527,2019.[14] D. A. Field, “Laplacian smoothing and Delaunay triangulations,”

Communications in applied numerical methods , vol. 4, no. 6, pp. 709–712, 1988.[15] G. Taubin, “A signal processing approach to fair surface design,”

Proc. of SIGGRAPH , pp. 351–358, 1995.[16] M. Desbrun, M. Meyer, P. Schr¨oder, and A. H. Barr, “Implicitfairing of irregular meshes using diffusion and curvature ﬂow,”

Proc. of SIGGRAPH , pp. 317–324, 1999.[17] J. Vollmer, R. Mencl, and H. Mueller, “Improved Laplaciansmoothing of noisy surface meshes,”

Computer Graphics Forum(Eurographics) , vol. 18, no. 3, pp. 131–138, 1999.[18] S. Fleishman, I. Drori, and D. Cohen-Or, “Bilateral mesh denois-ing,”

ACM Trans. on Graphics (SIGGRAPH) , vol. 22, no. 3, pp. 950–953, 2003.[19] T. R. Jones, F. Durand, and M. Desbrun, “Non-iterative feature pre-serving mesh smoothing,”

ACM Trans. on Graphics (SIGGRAPH) ,vol. 22, no. 3, pp. 943–949, 2003.[20] J. Wang, J. Huang, F.-L. Wang, M. Wei, H. Xie, and J. Qin, “Data-driven geometry-recovering mesh denoising,”

Computer-Aided De-sign , vol. 114, pp. 133–142, 2019.[21] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3D ShapeNets: A deep representation for volumetric shapes,” in

IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2015,pp. 1912–1920.[22] D. Maturana and S. Scherer, “VoxNet: A 3D convolutional neuralnetwork for real-time object recognition,” in

IEEE/RSJ Int. Conf. onIntelligent Robots and Systems , 2015, pp. 922–928.[23] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas,“Volumetric and multi-view CNNs for object classiﬁcation on 3Ddata,” in

IEEE Conf. on Computer Vision and Pattern Recognition(CVPR) , 2016, pp. 5648–5656.[24] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong, “O-CNN: Octree-based convolutional neural networks for 3D shape analysis,”

ACMTrans. on Graphics (SIGGRAPH) , vol. 36, no. 4, pp. 72:1–11, 2017.[25] P. Wang, C. Sun, Y. Liu, and X. Tong, “Adaptive O-CNN: a patch-based deep representation of 3D shapes,”

ACM Trans. on Graphics(SIGGRAPH Asia) , vol. 37, no. 6, pp. 217:1–11, 2018.[26] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “FrustumPointNets for 3D object detection from RGB-D data,” in

IEEE Conf.on Computer Vision and Pattern Recognition (CVPR) , 2018, pp. 918–927.[27] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3D shape recognition,” in

IEEEInt. Conf. on Computer Vision (ICCV) , 2015, pp. 945–953.[28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learningon point sets for 3D classiﬁcation and segmentation,” in

IEEE Conf.on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 652–660.[29] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deephierarchical feature learning on point sets in a metric space,” in

Conference and Workshop on Neural Information Processing Systems(NeurIPS) , 2017, pp. 5099–5108.

UBMITTED TO IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 13 [30] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Con-volution on X -transformed points,” in Conference and Workshop onNeural Information Processing Systems (NeurIPS) , 2018, pp. 828–838.[31] A. Matan, M. Haggai, and L. Yaron, “Point convolutional neuralnetworks by extension operators,”

ACM Trans. on Graphics (SIG-GRAPH) , vol. 37, no. 4, pp. 71:1–12, 2018.[32] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.Solomon, “Dynamic graph CNN for learning on point clouds,”

ACM Trans. on Graphics , vol. 38, no. 5, pp. 146:1–12, 2019.[33] Y. Yang, C. Feng, Y. Shen, and D. Tian, “FoldingNet: Point cloudauto-encoder via deep grid deformation,” in

IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR) , 2018, pp. 206–215.[34] H. Deng, T. Birdal, and S. Ilic, “PPF-Foldnet: Unsupervised learn-ing of rotation invariant 3D local descriptors,” in

European Conf. onComputer Vision (ECCV) , 2018, pp. 602–618.[35] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng, “PU-Net:Point cloud upsampling network,” in

IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , 2018, pp. 2790–2799.[36] W. Yifan, S. Wu, H. Huang, D. Cohen-Or, and O. Sorkine-Hornung,“Patch-based progressive 3D point set upsampling,” in

IEEE Conf.on Computer Vision and Pattern Recognition (CVPR) , 2019, pp. 5958–5967.[37] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert, “PCN: Pointcompletion network,” in

Int. Conf. on 3D Vision (3DV) , 2018, pp.728–737.[38] L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas, “GSPN:Generative shape proposal network for 3D instance segmentationin point cloud,” in

IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , 2019, pp. 3947–3956.[39] H. Kato, Y. Ushiku, and T. Harada, “Neural 3D mesh renderer,”in

IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) ,2018, pp. 3907–3916.[40] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang, “Pixel2Mesh:Generating 3D mesh models from single RGB images,” in

EuropeanConf. on Computer Vision (ECCV) , 2018, pp. 52–67.[41] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia, “De-formable shape completion with graph convolutional autoen-coders,” in

IEEE Conf. on Computer Vision and Pattern Recognition(CVPR) , 2018, pp. 1886–1895.[42] A. Dai and M. Nießner, “Scan2Mesh: From unstructured rangescans to 3D meshes,” in

IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , 2019, pp. 5574–5583.[43] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black, “Generating 3Dfaces using convolutional mesh autoencoders,” in

European Conf.on Computer Vision (ECCV) , 2018, pp. 704–720.[44] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, andD. Cohen-Or, “MeshCNN: A network with an edge,”

ACM Trans.on Graphics (SIGGRAPH) , vol. 38, no. 4, pp. 90:1–12, 2019.[45] Y. Feng, Y. Feng, H. You, X. Zhao, and Y. Gao, “MeshNet: Meshneural network for 3D shape representation,” in

AAAI Conf. onArtiﬁcial Intell. (AAAI) , 2019, pp. 8279–8286.[46] L. Yi, H. Su, X. Guo, and L. J. Guibas, “SyncSpecCNN: Synchro-nized spectral CNN for 3D shape segmentation,” in

IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , 2017, pp. 2282–2290.[47] I. Kostrikov, Z. Jiang, D. Panozzo, D. Zorin, and J. Bruna, “Surfacenetworks,” in

IEEE Conf. on Computer Vision and Pattern Recognition(CVPR) , 2018, pp. 2540–2548.[48] T. Tasdizen, R. Whitaker, P. Burchard, and S. Osher, “Geometricsurface processing via normal maps,”

ACM Trans. on Graphics ,vol. 22, no. 4, pp. 1012–1033, 2003.[49] K. Crane, C. Weischedel, and M. Wardetzky, “Geodesics in heat: Anew approach to computing distance based on heat ﬂow,”

ACMTrans. on Graphics (SIGGRAPH) , vol. 32, no. 5, pp. 152:1–11, 2013.[50] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutionalblock attention module,” in

European Conf. on Computer Vision(ECCV) , 2018, pp. 3–19.[51] X. Sun, P. Rosin, R. Martin, and F. Langbein, “Fast and effectivefeature-preserving mesh denoising,”

IEEE Trans. Vis. & Comp.Graphics , vol. 13, no. 5, pp. 925–938, 2007.[52] P. Cignoni, M. Corsini, and G. Ranzuglia, “MeshLab: an open-source 3D mesh processing system,”

ERCIM News , no. 73, 2008.

Xianzhi Li is currently a PhD student in theDepartment of Computer Science and Engineer-ing, the Chinese University of Hong Kong. Shewill receive her Ph.D. degree in July, 2020. Sheserved as the reviewers of CVPR 2020 andWACV 2021. Her research interests focus ongeometry processing, computer graphics, anddeep learning.

Ruihui Li is currently a PhD student in the De-partment of Computer Science and Engineer-ing, the Chinese University of Hong Kong. Hisresearch interests include 3D processing, com-puter graphics, and deep learning.

Lei Zhu received his Ph.D. degree in the De-partment of Computer Science and Engineeringfrom the Chinese University of Hong Kong in2017. He is working as a postdoctoral fellowin the Chinese University of Hong Kong. Hisresearch interests include computer graphics,computer vision, medical image processing, anddeep learning.

Chi-Wing Fu is currently an associate profes-sor in the Chinese University of Hong Kong.He served as the co-chair of SIGGRAPH ASIA2016’s Technical Brief and Poster program, as-sociate editor of IEEE Computer Graphics &Applications and Computer Graphics Forum,panel member in SIGGRAPH 2019 DoctoralConsortium, and program committee membersin various research conferences, including SIG-GRAPH Asia Technical Brief, SIGGRAPH AsiaEmerging tech., IEEE visualization, CVPR, IEEEVR, VRST, Paciﬁc Graphics, GMP, etc. His recent research interestsinclude computation fabrication, point cloud processing, 3D computervision, user interaction, and data visualization.