[PDF] Multiscale Mesh Deformation Component Analysis with Attention-based Autoencoders

Abstract

Deformation component analysis is a fundamental problem in geometry processing and shape understanding. Existing approaches mainly extract deformation components in local regions at a similar scale while deformations of real-world objects are usually distributed in a multi-scale manner. In this paper, we propose a novel method to exact multiscale deformation components automatically with a stacked attention-based autoencoder. The attention mechanism is designed to learn to softly weight multi-scale deformation components in active deformation regions, and the stacked attention-based autoencoder is learned to represent the deformation components at different scales. Quantitative and qualitative evaluations show that our method outperforms state-of-the-art methods. Furthermore, with the multiscale deformation components extracted by our method, the user can edit shapes in a coarse-to-fine fashion which facilitates effective modeling of new shapes.

Full PDF

IIEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 1

Multiscale Mesh Deformation ComponentAnalysis with Attention-based Autoencoders

Jie Yang, Lin Gao ∗ , Qingyang Tan, Yihua Huang, Shihong Xia and Yu-Kun Lai Abstract —Deformation component analysis is a fundamental problem in geometry processing and shape understanding. Existingapproaches mainly extract deformation components in local regions at a similar scale while deformations of real-world objects areusually distributed in a multi-scale manner. In this paper, we propose a novel method to exact multiscale deformation componentsautomatically with a stacked attention-based autoencoder. The attention mechanism is designed to learn to softly weight multi-scaledeformation components in active deformation regions, and the stacked attention-based autoencoder is learned to represent thedeformation components at different scales. Quantitative and qualitative evaluations show that our method outperforms state-of-the-artmethods. Furthermore, with the multiscale deformation components extracted by our method, the user can edit shapes in acoarse-to-ﬁne fashion which facilitates effective modeling of new shapes.

Index Terms —Multi-Scale, Shape Analysis, Attention Mechanism, Sparse Regularization, Stacked Auto-Encoder (cid:70)

NTRODUCTION W ITH the development of 3D scanning and modeling tech-nology, 3D mesh collections are becoming much morepopular. These mesh models usually use ﬁxed vertex connectivitywith variable vertex positions to characterize different shapes. An-alyzing these mesh model collections to extract meaningful com-ponents and using these components for new model generation arekey research problems in these areas. Some works [1], [2], [3], [4]propose to extract deformation components from mesh data sets.They mainly focus on extracting local deformation componentswith sparse regularization at a uniform scale. However, real-worldobjects deform at multiple scales. For example, a person may havedifferent facial expressions which are more localized deformationson the face, but the whole body can also be bent, which is a largerscale deformation.Multi-scale techniques are getting increasingly popular invarious ﬁelds. In Finite Element Methods, multiscale analysis iswidely used [5], [6], [7]. In the spectral geometry ﬁeld, researchworks apply multiscale technology on the deformation representa-tion [8], physics-based simulation of deformable objects [9], andsurface registration [10] by analyzing the non-isometric global andlocal (multiscale) deformation. Moreover, for shape editing, mul-tiscale technology also enables modeling rich facial expressionson human faces [11].Such multiscale deformation components are especially usefulto support model editing from coarse level to ﬁne level. Onemotivation of this work is to achieve editing consistent withperceptual semantics by modifying shapes at suitable scales. The ∗ Corresponding author is Lin Gao. • J. Yang, L. Gao, Y. Huang and S. Xia are with the Beijing Key Laboratoryof Mobile Computing and Pervasive Device, Institute of ComputingTechnology, Chinese Academy of Sciences and also with the Universityof Chinese Academy of Sciences, Beijing, China.E-mail: { yangjie01, gaolin, huangyihua20g, xsh } @ict.ac.cn • Q. Tan is with University of Maryland, College Park, Maryland, USA.E-mail: [email protected] • Y.-K Lai is with Visual Computing Group, School of Computer Science andInformatics, Cardiff University, Wales, UK.E-mail: [email protected] received April 19, 2019; revised August 26, 2019. user would be able to make rough editing of the overall shape at alarge scale, as well as localized modiﬁcations to surface details ata small scale. Inspired by the recent advances in image processingwith attention mechanism [12], the attention is formulated to focuson speciﬁc regions in our approach.We propose a novel autoencoder architecture to extract mul-tiscale local deformation components from shape deformationdatasets. Our network structure is based on the mesh-based con-volutional autoencoder architecture and also uses an effectiverepresentation of the shapes [13] as input which is able to encodelarge-scale deformations. In this work, a stacked autoencoderarchitecture is proposed such that the network can encode theresidual value of the former autoencoder with the attention mech-anism, which helps to seperate the deformations into differentscales and extract multiscale local deformation components. Thenetwork architecture is shown in Fig. 1. We further utilize asparsity constraint on the parameters of the fully connected layersto keep the deformation components localized. The autoencoderarchitecture ensures the extracted deformation components aresuitable for multiscale shape editing and helps reconstruct highquality shapes with less distortion.Our contributions are twofold: • To the best of our knowledge, this is the ﬁrst work thatautomatically extracts multiscale deformation componentsfrom a deformed shape collection. With these extractedcomponents, the user can edit the 3D mesh shape in acoarse-to-ﬁne fashion, which makes 3D modeling muchmore effective. • To achieve this, we propose a novel deep architectureinvolving attentional stacked autoencoders. The attentionmechanism is designed for learning the soft weights thathelp extract multiscale deformation components in theshape analysis and the stacked autoencoders are usedto decompose the deformation of shape collections intodifferent shape components with different scales.All the components of the network are tightly integrated andhelp each other. The attention mechanism makes the follow- a r X i v : . [ c s . G R ] D ec EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 2 up autoencoders focus on a speciﬁc region, to allow extractingsmaller-scale local deformation components. Extensive compar-isons prove that our method extracts more meaningful deformationcomponents than state-of-the-art methods.In Sec. 2, we review the most related work. We then givea brief description of the input features used in our method inSec. 3, and present detailed description of our novel autoencoderwith the attention mechanism including implementation detailsin Sec. 4. Finally, we present experimental results, includingextensive comparisons with state-of-the-art methods in Sec. 5 anddraw conclusions in Sec. 6.

ELATED W ORK

Mesh deformation component analysis has attracted signiﬁcantinterest in the research of shape analysis and data-driven meshdeformation. Many data-driven methods for editing of either man-made objects [14], [15], [16] or general deformable surfaces [13],[17], [18] beneﬁt from extracted deformation components. Ourwork shares the same interest as theirs, aiming to assist users toedit shapes efﬁciently. Although our focus is to extract more mean-ingful multiscale deformation components automatically, it can beincorporated into existing data-driven deformation methods. In thefollowing, we will review work most related to ours.

3D shape deformation component extraction.

With theincreasing use of 3D models, the need for analyzing their intrinsiccharacteristics becomes mainstream. Early work [19] extractsprincipal components from the mesh data set by Principal Com-ponent Analysis (PCA), but the extracted components containglobal deformations, which is not effective for users to make localedits. Some works [20] demonstrate that sparse constraints areeffective for achieving localized deformation results. However, theclassical sparse PCA [21] does not take the spatial information intoconsideration. By promoting sparsity in the spatial domain, manyworks extract localized deformation components with a sparsityconstraint [1], [22], which outperforms the standard PCA variantssuch as Clustering-PCA [23] with respect to choosing suitablecompact basis modes, especially for producing more localizedmeaningful deformation. Moreover, the pioneering work [1] rep-resents meshes with Euclidean coordinates, but this representationis sensitive to rigid and non-rigid transformations. Later work [2]extends the previous work [1] to better deal with rotations by usingdeformation gradients to represent shapes, but the method stillcannot cope with larger rotations greater than ◦ due to theirinherent ambiguity. Based on deformation gradients, Neumann etal. [24] learn the arm-muscle deformation using a small set ofintuitive parameters. The work [3] extends [1] by using a rota-tion invariant representation based on edge lengths and dihedralangles [25], so can handle large-scale deformations. However, therepresentation [25] is not suitable for extrapolation as this wouldresult in negative edge lengths. This limits the capability of [3] fordeformation component analysis, as extrapolation is often needede.g. when utilizing the extracted deformation components for data-driven shape editing. Recent work [4] proposes a convolutional au-toencoder based on an effective shape representation [13] to learnthe localized deformations of a shape set, but their architectureis not suitable for extracting multiscale deformation components.Overall, different from these works [1], [2], [3], [4], our methodcan produce meaningful and multiscale localized deformationcomponents. Deep learning on 3D Shapes.

With the development ofartiﬁcial intelligence, deep learning and neural networks havemade great progress in many areas, in particular 2D imageprocessing. Hence, some researchers transform the non-uniformgeometry signals deﬁned on meshes of different topologies intoa regular domain, while preserving shape information as muchas possible, which enables powerful uniform Cartesian grid basedCNN (Convolutional Neural Network) backbone architectures tobe used on problems such as cross-domain shape retrieval [26],surface reconstruction [27] and shape completion [28]. DDSL [29]was recently proposed, which is a differentiable layer compat-ible with deep neural networks for learning geometric signals.However, due to the irregularity of 3D shapes, deep learning isdifﬁcult to apply straightaway. Inspired by image processing, someworks [30], [31], [32] apply deep learning to the voxel repre-sentation with regular connectivity. However, voxel representationincurs signiﬁcant computation and memory costs, which limits theresolution such methods can handle. Wang et al. [33], [34] improvethe performance of voxel-based convolutions by proposing anadaptive octree structure to represent 3D shapes, and apply it toshape completion [35]. Meanwhile, recent works [36], [37] deﬁnethe convolution on point clouds by using K-nearest-neighbors(KNN) and spherical convolutions. More recently, the work [38]proposed the EdgeConv operator for learning on the point cloudto improve the performance of segmentation and classiﬁcation. Inaddition to the voxel and point cloud representations, shapes canalso be represented as multiview projection images to perform 2D-CNNs [31], [39], [40] for 3D object recognition and classiﬁcation.Such approaches are used to learn the local shape descriptorsfor shape segmentation and correspondence [41]. For applicationsthat take meshes as input or generate meshes as output, turningmeshes to alternative representations can lead to useful topologyinformation to be lost.Alternatively, as a mesh can be represented as a graph, CNNscan be extended to graph CNNs in the spatial domain [42],[43] or spectral domain [44], [45], [46] for mesh-based deeplearning. In the spatial domain, works [47], [48], [49], [50] applyvariational autoencoders on 3D meshes for various applicationssuch as reconstruction, interpolation, completion and embedding.The work [51] uses autoencoders to analyze deformable soliddynamics. However, none of the existing methods can extractmultiscale deformation components, which we address by usinga novel attention-based mesh convolutional network architecture.

Attention mechanism on convolutional networks.

Deep neu-ral networks have proved their superiority in extracting high-levelsemantics and highly discriminative features on various imagedatasets. Researchers now pay more attention to using convolutionfeatures more effectively on ﬁne-grained datasets to improve theperformance. Such works can be widely seen in different areasof computer vision and natural language processing, such asimage translation (DA-GAN) [52], person re-identiﬁcation [53],[54], document classiﬁcation [55], object detection [56], [57],video classiﬁcation [58], etc. The work [59] proposes a “softattention” mechanism which predicts soft weights and computesa weighted combination of the items in machine translation. In[60], a hierarchical co-attention method is proposed to learnthe conditional representation of the image given a problem.Following [60], [61] extends the co-attention model to higherorders. Some works effectively utilize attention as a way to focuson speciﬁc regions for learning. Wang et al. [62] demonstratethe beneﬁt of guiding the feature learning by using residual

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 3

Figure 1. Our network architecture with attention mechanism and stacked autoencoders. We obtain large-scale deformation components andattention masks from the ﬁrst-level autoencoder AE . For the second-level autoencoders AE k , ≤ k ≤ K , we put the residual value X − (cid:98) X of the ﬁrst-level autoencoder AE with the attention masks into K autoencoders focusing on different sub-regions of the shape. We can thenobtain small-scale deformation components by these autoencoders. This architecture can be further extended to include more scale levels. Theautoencoder has mirrored encoder and decoder structure. The encoder has one convolution layer and a fully connected layer, and the encoderand decoder share the same trainable parameters. K , V and µ are the dimension of the latent space (the number of attention masks), the numberof vertices and the dimension of the vertex features, respectively. By making the latent vector as a one-hot vector, we can extract the K attentionmasks AM k , ≤ k ≤ K from the parameter C as the top-left corner of the ﬁgure shows. Please refer to Sec. 4.2 for details. attention learning for improving the recognition performance.Another example is the attention-focused CNN (RA-CNN) [12]based on the Attention Proposal Network (APN), which activelyidentiﬁes the effective region and uses bi-linear interpolationto adjust the scale, and then the enlarged region of interest isused for improved ﬁne-grained classiﬁcation. However, the aboveworks apply the attention mechanism in the 2D domain. Ourwork extends the attention mechanism to the 3D domain toextract multiscale deformation features based on an effective shaperepresentation [13]. EFORMATION R EPRESENTATION AND C ONVO - LUTION O PERATOR

The input of our overall network is based on the recently pro-posed as-consistent-as-possible (ACAP) deformation representa-tion [13], which can cope with large-scale deformations of shapesand is deﬁned only on vertices, making mesh-based convolutionseasier to implement. To validate it is a good choice, we comparethis with a recently proposed general-purpose mesh autoencoder(AE) architecture DEMEA [49]. For fair comparison, we comparewith DEMEA [49] on the COMA [48] dataset with the samesetting as used in [49]. We also use the same training set and testset split: the training set contains 10 various expressions and 17794meshes in total. The test set contains 2 challenging expressions(high smile and mouth extreme), with 2671 meshes in total. Weset the same latent space dimension (32) and our network only usesa single level of AE to make the network architecture comparable.The results on the test set in terms of average per-vertex errorsshow that our base AE network with 0.822mm error outperformsDEMEA with 1.05mm error. Such beneﬁts can be more substantialwhen shapes undergo more substantial deformations, such asvarious human body poses, so we build our network architecturebased on this.For a given shape set with N shapes that share thesame connectivity each with V vertices, without loss of generality, we choose the ﬁrst shape as the reference shape.For the patch which consists of the i th vertex and its 1-ring neighbor vertices, we can calculate its deformationgradient T m,i ∈ R × . The deformation gradient of thepatch is deﬁned on the i th vertex of the m th shape, whichdescribes the local deformation w.r.t. the reference shape. T m,i of shape m is obtained by minimizing the following formula: arg min T m,i (cid:80) j ∈N i c ij (cid:107) ( p m,i − p m,j ) − T m,i ( p ,i − p ,j ) (cid:107) where N i is the 1-ring neighbor vertices of vertex i and c ij = α ij + β ij is the cotangent weight [63], where α ij and β ij are the angles in the two faces that share a common edge ( i, j ) .Then T m,i can be decomposed as T m,i = R m,i S m,i using thepolar decomposition, where S m,i ∈ R × is a symmetry matrixthat describes the scaling/shear deformation and R m,i ∈ R × isan orthogonal matrix that describes the rotation. For the rotationmatrix R m,i , it can be represented by the rotation axis ω m,i and rotation angle ω m,j . But the mapping from the rotationmatrix to rotation axis and angle is one-to-many. For shapeswith large-scale rotations, the rotation axis and rotation angleof adjacency vertices may become inconsistent, which results inartifacts when synthesizing new shapes, as shown in [13]. Gao etal. propose a two-step integer optimization to solve the problemwhich makes the rotation angle and rotation axis of adjacentvertices as consistent as possible. For the details, please referto [13].Next, for each vertex i on the m th shape, we can obtainthe feature q m,i = { r m,i , s m,i } ∈ R by extracting non-trivialelements r m,i ∈ R and s m,i ∈ R from the logarithm ofrotation matrix R m,i and scaling/shear matrix S m,i respectively.Finally, the ACAP feature of the m th shape can be representedby { q m,i | ≤ i ≤ V } . Due to the usage of the tanh activationfunction [4], we further linearly scale each element in r m,i and s m,i to [ − . , . separately. Then we concatenate q m,i , ≤ i ≤ V together in the vertex order to form a long vector X m ∈ R µV × as the feature of the m th shape, where µ = 9 is the dimension of the ACAP feature of each vertex. EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 4

We further introduce the graph convolutional operator used inour architecture. As illustrated in [4], the output of the convolutionoperator for every vertex is a linear combination of the inputs ofthe vertex and its 1-ring neighbor vertices, along with a bias. Theoutput y i for the i th vertex is deﬁned as follows: y i = W point x i + W neighbor D i D i (cid:88) j =1 x n ij + b (1)where x i is the input feature vector of the i th vertex, D i is thedegree of the i th vertex, and n ij (1 ≤ j ≤ D i ) is the j th neighborvertex of the i th vertex. W point , W neighbor ∈ R µ × µ and b ∈ R µ are the trainable parameters of the graph convolutionallayer. All these weights are shared by all the vertex and theirneighborhoods in the same convolutional layer and learned duringthe training of the network. In this section, we introduce our network architecture from thesethree main aspects: our novel autoencoder structure, attentionmechanism and redundant component removal. Firstly, we in-troduce the novel autoencoder structure. Then we describe ourattention mechanism, which is applied to help autoencoders toextract multi-scale deformation components. Lastly, we explainhow we remove redundant components, followed by implementa-tion details of our network. This architecture is ﬂexible to supportmultiple levels of scales. In most cases, two levels of deformationscales are sufﬁcient to represent deformation in the dataset of thispaper. Please refer to Sec. 5.3.3 for details.

As Fig. 1 illustrates, we have achieved the multiscale structure bystacking autoencoder blocks. In the ﬁrst (coarsest) level, we haveone autoencoder AE , and in the second level, K autoencoders AE k ( k = 1 , , . . . , K ) are built, each focusing on one localregion through an attention mechanism, which will be detailedlater. The number of second level AEs is determined by thedimension of the latent space of AE .For each shape m, ≤ m ≤ N , we represent it by the pre-processed ACAP feature X m ∈ R V × µ , as described in Sec. 3.We use an encoder to map the feature to a 128-dimensionallatent code and a decoder which reconstructs the shape ACAPfeature from a latent code z . Both of the encoder and decoderhave one mesh-based graph convolutional layer and a fully-connected layer and their network structure is symmetrical, wherethe learnable parameters of the fully connected layer are deﬁnedas C ∈ R K z × µV , K z is the dimension of the latent space.Especially, the fully connected layers of encoder and decoder sharesame learnable parameter C without bias. The latent vector z forall the N shapes form a matrix Z ∈ R N × K z . Similar to [4], allthe layers use the tanh activation function. Figure 1 illustrates theautoencoder architecture in the left top corner.The output (cid:98) X ∈ R µV of the whole autoencoder block can bescaled back to the ACAP deformation representation and recon-struct the Euclidean coordinates using [13]. For every autoencoder,we optimize the following loss function that includes three terms:reconstruction loss that ensures accurate reconstruction of theinput, sparsity loss Ω( C ) that promotes localized deformationcomponents, and non-trivial regularization term V ( Z ) to avoid creating trivial solutions. The total loss for an autoencoder block AE k is as follows: L AE k = λ L recon + λ Ω( C ) + V ( Z ) (2)where AE k , ≤ k ≤ K represent the k th autoenoder, λ , λ arethe balancing weights.The reconstruction loss is the MSE (mean square error) loss,deﬁned as L recon = N (cid:80) Ni =1 (cid:107) X i − (cid:98) X i (cid:107) . For the non-trivialregularization term, V ( Z ) = K z (cid:80) K z j =1 max((max m | Z jm | − θ ) , , where Z jm is the weight for the j th dimension of the m th shape, and θ is a positive number and we set θ = 5 in ourexperiments.The above two loss terms are the same as the previouswork [4]. But for the loss Ω( C ) , there are some difference:we choose the step function Λ deﬁned above to map geodesicdistances to { , } , rather than the previously used clipped linearinterpolation function. This is because our network architec-ture extracts hierarchical deformation components, so at anylevel, a ﬁxed component size (rather than a range) is preferred.Autoencoder blocks at different levels will produce localizeddeformation components of different scales by adjusting the tun-able parameter d . Our sparsity loss term Ω( C ) is deﬁned as: Ω( C ) = K z (cid:80) K z k =1 (cid:80) Vi =1 Λ ik (cid:107) C k,i (cid:107) , where C k,i is the µ -dimensional vector of component k of vertex i , (cid:107)·(cid:107) is groupsparsity ( (cid:96) , norm) , and Λ ik is sparsity regularization parametersdeﬁned as follows: Λ ik = (cid:26) d ik < d d ik ≥ d (3)where the Λ ik is a binary function, where d ik denotes thenormalized geodesic distance [64] from vertex i to the center point c k of component k , which is deﬁned as c k = argmax i (cid:107) C k,i (cid:107) ,and is updated in each iteration of network training. d is a tunableparameter, which controls the size of the deformation region of acomponent. Larger d corresponds to bigger deformed regions ofthe shape. For our task, AEs of different levels choose differentvalues of d . Please refer to Sec. 4.4 for the default value of d .We train the whole network end-to-end by adding all the lossesof autoencoder blocks together as L total = (cid:80) Kk =0 L AE k , whichincludes a ﬁrst-level AE ( AE ) and K second-level AEs ( AE k , ≤ k ≤ K ). Similar to 2D images, there are many tasks (image recognition)that focus on the different level of images by attention mechanism,such as [12]. So we want to explore the 3D shape with multiscalefashion by the attention mechanism. We design an attentionmechanism that facilitates our autoencoder blocks extracting themultiscale deformations components.Mostly, the deformation datasets (e.g. human, horse and fabric)have both global scale and local scale deformations. Hence, itis naturally to extract these deformations in multiscale manner.To make the second level AEs focus on sub-regions to extractﬁner level components and then form a multiscale structure, weextract learnable attention masks from the fully-connected layer ofthe ﬁrst-level AE AE . Our attention mechanism is shown in thebottom left corner of Fig. 1. Due to the sparsity constraint Ω( C ) ,the parameter C of the fully connected layer represents the sparsedeformation components. The C k ∈ R V µ × , ≤ k ≤ K thatis denoted as each vector in C represents a deformed sub-region EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 5 of the shape. The deformed sub-regions can be regarded as theinterested mask of the second level AEs. So in every iteration oftraining, we can extract each row of C by setting the latent vectorto a one-hot vector for second level AEs: C k = C T × OH k (4)where the OH k is a K -dim column vector and k th entry of OH k is and the rest are . Then we reshape C k to a 2D array withthe size V × µ , denoted as C rk . The unnormalized attention mask am k,i ∈ R K × V for the k th component of the i th vertex is deﬁnedas am k,i = µ (cid:88) j =1 C rk,ij . (5)where the C rk,ij is the ( i, j ) entry of C rk . We further normalize itto obtain the normalized attention mask AM ∈ R K × V , where AM k,i = am k,i (cid:80) Kk =1 am k,i . (6)So for the ﬁrst-level autoencoder AE , the residual value ofthe reconstruction is X − (cid:98) X and the normalized attention mask is AM . We reshape ( X − (cid:98) X ) to a 2D array X res ∈ R V × µ . For thesecond-level autoencoder AE k , ≤ k ≤ K as in Fig. 1, its inputis diag ( AM k ) × X res , where diag ( · ) returns a square diagonalmatrix with the elements of the vector on the main diagonal.The input of each second-level AE is therefore the weighted (bythe corresponding attention mask) residual of the ﬁrst-level AE.Therefore, this attention mechanism ensures that the second-levelAEs can reconstruct smaller scale deformations that cannot bewell captured by AE , and each AE k focuses on an individuallocal region. The sum of every column of AM is one accordingto Eq. 6, which ensures the sum of inputs to AE k , ≤ k ≤ K equals the residual value X res of the ﬁrst-level autoencoder AE .Under the supervision of loss function and attention mech-anism, the ﬁrst level autoencoder AE is capable of capturingthe large scale deformation and the second level AEs can capturethe smaller scale deformations on speciﬁed regions of the shape.Consequently, our network can learn the multiscale deformationcomponents of the whole shape set, the multiscale deformationcomponents can be extracted from the parameters of fully con-nected layers of all AEs. For all autoencoder blocks, we extract the ﬁx number of deforma-tion components for each AE. For fair comparison and captureall deformations, we set K z = 10 for AE , K z = 5 for AE k , ≤ k ≤ K . Because the multiscale analysis and our settingis to extract as much as possible to avoid missing any components,so there are some redundant components in second level AEs,which means the components does not contain any deformation,or only slight deformations compared to the reference mesh. Sowe remove these components if the contained information islower than the given threshold value. All results in our paperare processed by the redundant component removal. The Fig. 6shows the results of the output of network without the process(there are some components with slight deformations comparedto the reference mesh). Fig. 11 illustrates some ﬁnite componentsare sufﬁcient enough to represent the whole dataset during thedeformation components analysis. Thus, we will remove the redundant components which contain slight deformations deﬁnedin Eq. 7.The process aims to make our results more brief and rea-sonable and is done after network training. It can gain trade offbetween the multiscale decomposition and avoiding overﬁtting ontraining data. As a result, we could exclude these redundant defor-mation components. We deﬁne the following deformation strengthof deformation components to ﬁlter slight or noisy deformationcomponents. The strength I on the features X m is deﬁned as: I ( X m ) = (cid:80) Vi =1 (cid:107) ( X diff ) i (cid:107) > (cid:15) ) (cid:107) ( X diff ) i (cid:107) (cid:80) Vi =1 (cid:107) ( X diff ) i (cid:107) > (cid:15) ) (7)where X diff = X m − X r , and X r , X m ∈ R V × µ are the featuresof the reference mesh and the extracted deformation componentsfrom the autoencoders respectively. (cid:107) · (cid:107) is the (cid:96) norm of thevector, and · ) gives 1 if the condition is true, and 0 otherwise. (cid:15) = 1 e − . Figure 2. The multiscale structure of deformation components onthe shape set SCAPE [65]. In the ﬁgure, we ﬁlter the redundantcomponents. As a result, our method can learn deformation componentsof different scales. The ﬁrst column shows some examples from theSCAPE dataset, the second column presents coarse level deformationcomponents from the ﬁrst-level AE , and the third column gives the ﬁnelevel deformation components from the second-level AE k , ≤ k ≤ K . If the extracted deformation component corresponds to a slightdeformation, deﬁned as its strength being smaller than a threshold (cid:15) , we will remove the component. In our experiments, we set (cid:15) = 1 e − . Finally, we obtain the multiscale structure of thedeformation components, as shown in Figs. 2, 9, 10, 11, 12 and13. , where deformations at different scales are indicated witharrows.We also check if our network produces similar components(near duplicates). To achieve this, we test the similarity of theextracted deformation components. Fig. 3 visualizes the cosinesimilarity matrix of the components extracted from the ﬁrst-levelAE. It shows that the components have low similarity, becauseour AE applies the localization constraint and reconstruction EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 6 error minimization to ensure different components in the latentspace represent different parts on the shape; having duplicatedcomponents would lead to reduced representation capability, sohigher total loss. Comp Comp Comp Comp Comp Comp Comp Comp Comp Comp

Cosine Similarity Matrix

Figure 3. Visualization of the cosine similarity matrix of the componentsextracted from the ﬁrst-level AE. It shows that the components have lowsimilarity. The value (0-1) in each grid indicates the similarity betweentwo components. Larger values mean more similar.Figure 4. Comparison of deformation components located in the left armof SCAPE [65], which are extracted by different methods. The deformedregion is highlighted in blue. Every row shows the components locatedin a similar region. It shows that our results are more reasonable.

Our experiments were carried out on a computer with an i7-6850KCPU, 16GB RAM and an Nvidia GTX 1080Ti GPU.

Datasets:

We compare with the state-of-the-art methods on theSCAPE dataset [65], Horse dataset [66], Face dataset [67], Hu-manoid dataset [3], Dress dataset, Pants dataset [68], Flag dataset,Skirt Dataset, Fat person (ID:50002) from the Dyna [69] Dataset,Coma [48] Dataset, Swing and Jumping datasets [70]. The Dress, Flag and Skirt datasets were synthesized by the NVIDIA ClothingTools on 3ds MAX. Our used data range from rigged deformationlike human motion to non-rigged deformation (Face, cloth), fromsmall dataset (several hundreds shapes) to large dataset (thou-sands of shapes(Dyna)). For the above datasets, they are mainlysequence models which have similar deformation between twoneighbour shapes. So for our model’s generaliablity, we select onefrom ten models to training and the rest to testing, the ratio is 9:1for test and training. For un-sequence data like SCAPE, we spiltthe training set and test set randomly with a ratio 1:1. A specialcase for Coma dataset, we use same setting as the DEMEA [49]for fair comparison. The statistics of the dataset are shown inTable 1. The table lists the number of shapes that dataset contains,training examples and testing examples. All data is very easy toobtain, some are the public data, some are the synthesized data bythe professional software. We will release the synthesized data forfuture research.

Table 1Data Statistics. We summarize the data statistics of 10 datasets in ourexperiments. Each data is split into training set and test set with a ratioof 1:9. For the Coma [48] dataset, we use the same setting withDEMEA [49] for fair comparison.

DataSet

In our experiments, our network takes the ACAP features of3D shapes as input, which can describe the large scale deforma-tions and be calculated by the novel work [13]. We have two levelsof autoencoders, the ﬁrst-level only has one AE ( AE ), and thesecond-level has the same number of AEs ( AE k , ≤ k ≤ K ) asthe attention masks. Since in most wild data sets, the deformationof the shape is not very exaggerated, two levels of autoencodersare enough to extract the multiscale localized deformation com-ponents in our experiments, but this can be extended if necessary.For above different categories, we use ﬁxed hyperparameters. Weperform experiments on the SCAPE dataset to demonstrate howwe choose the suitable hyper-parameter in the Sec. 5.3. As shownin that section, our stacked AEs has the lowest error with thefollowing default parameters: λ = 10 . , λ = 1 . , where λ , λ are the weights of reconstruction error and sparsity constrainterm, respectively. d = [ d , d ] = [0 . , . , K z = [10 , ,corresponding to the coarse and ﬁne levels. Here, we train thewhole network end-to-end rather than in a separate manner. Weset the learning rate as . with the exponential decay bythe ADAM solver [71] to train the network end-to-end until itconverges, which takes approximately , epochs. For allAEs, we set the batch size as 256, which is randomly sampledform the training data set. For a typical dataset, the training ofstacked AEs takes about 10 hours. Once our network is trained,the extracted components will output efﬁciently: outputing onecomponent takes only about 50 milliseconds. EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 7

Table 2Errors of applying our method to generate unseen data from Horse [72], Face [67], Humanoid [3], Pants [68], Flag dataset, Fat person (ID:50002from the Dyna dataset [69]), Swing and Jumping [70] datasets. From the table, the generation ability of our network is better than the othermethods on the E rms error and ST ED error.

Dataset Metric MethodOurs Tan et al. Wang et al. Huang et al. Neumann et al. Bernard et al.Horse E rms . . . . . . ST ED . . . . . . Face E rms . . . . . . ST ED . . . . . . Jumping E rms . . . . . . ST ED . . . . . . Humanoid E rms . . . . . . ST ED . . . . . . Swing E rms . . . .

495 15 . . ST ED . . . . . . Pants E rms . . . . . . ST ED . . . . . . Dress E rms . . . . . . ST ED . . . . . . Fat E rms . . . . . . ST ED . . . . . . Flag E rms . . . . . . ST ED . . . . . . As we will later discuss, compared with separate training inTable 4, training the network end-to-end can result in smallerreconstruction errors E rms on all data sets. The main reason is thatthe network can adjust the attention mask by minimizing the lossfunction, and conversely, the adjusted attention mask will result insmaller reconstruction errors E rms . Under the collaboration, weget better results as shown in Table 4. XPERIMENTAL R ESULTS & E

VALUATION

In this section, we will evaluate our method on the above datasetsfrom the following aspects: Quantitative Evaluation, QualitativeEvaluation and Applications.

We compare the generation ability of our method with the state-of-the-art methods [1], [2], [3], [4], [22] on various datasets. Inthis experiment, we select one model from every ten models fortraining and the remaining for testing. After training, we align allthe models and scale them into a unit ball. Then we use E rms (root mean square) error [73] and ST ED error [74] to comparethe generalization error on the test data (i.e., the reconstructionerror for unseen data) with the various methods. In particular,

ST ED error is designed for motion sequences with a focus on‘perceptual’ error of models. To ensure fairness, we train each autoencoder to extract 50 components. As Table 2 shows, theperformance of our method is better than the existing methodson the E rms and ST ED . Because the Euclidean coordinaterepresentation is sensitive to rotation, the extracted deformationcomponents of the methods [1], [22] have more artifacts andimplausible deformation, leading to larger reconstruction errors.Due to the limitation of the edge lengths and dihedral anglerepresentation, the reconstruction using the method [3] can alsobe inaccurate and unstable. The method [2] is not capable ofencoding large-scale deformation (e.g. folds on the fabric), so itcannot recover the original deformation accurately in such cases.The method [4] uses a large-scale deformation representationto achieve good performance, but it cannot produce multiscaledeformation components. In comparison, our method can keeplower reconstruction errors by using stacked AEs and analyze theresidual value of the ﬁrst-level AE, extracting effective multiscaledeformation components.Meanwhile, our extracted components are served for data-driven deformation. In the real world, the user usually edits theshape by a limited number of control points. To demonstrate theability of each method to reconstruct deformed models by thelimited control points, we use the furthest point sampling [73]to sample the control points to ensure that the sampled pointsdistribute on the shape evenly. Under the constraint of the controlpoints, we use the same number of the extracted components to

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 8

Number of Control Points G e n r a li z a t i on E rr o r on S CA PE ( E r m s ) Tan et al.Neumann et al.Huang et al.Bernard et al.Our

Figure 5. The reconstruction error using sparse control points to deformthe SCAPE data set [65]. The control points are obtained by furthestpoint sampling. The generalization error is measured by the data-drivendeformation with the extracted deformation components of variousmethods. In this ﬁgure, our performance is better than the other methodswith lower errors. perform data-driven deformation on the SCAPE dataset. As shownin Fig. 5, the reconstruction by our extracted components alwayskeep a lower error. In comparison, due to the use of Euclideancoordinate representation, the methods [1], [22] fail to reconstructshapes accurately. Our multiscale deformation components bettercharacterize the deformation of the shape, leading to reducederrors.

We also provide qualitative evaluation of extracted multiscaledeformation components, by comparing the visualization resultsof our method with the other methods. In our experiments, wehave two level autoencoders. For the each level autoencoders,we extract corresponding scale deformation components fromthe fully connected layer using the same method as [4]. Forevery autoencoders, we can extract same number components asthe dimension of latent space. So we visualize each extractedcomponents in each level. Due to our setting ( K z = [10 , ),the ﬁrst level and second level autoencoders output 10 and 5deformation components respectively. For all results, there aresome slight deformation components that is not shown for thebrief and reasonable visualization. We remove these componentsby the post-process (Sec.4.3). Furthermore, we colored deformedregion compared with the reference mesh.As independently extracted deformation components do notcorrespond to each other, we adopt the visualization methodin [2], [3], [4], and manually select two components with similardeformation areas as much as possible. Figs. 4 and 7 showthe comparison with various methods on the SCAPE and Horsedatasets. We show the corresponding deformation components in asimilar area on the shape (left arm of the SCAPE and every majorpart of the horse). Our results are more meaningful, capturingmajor deformation modes in a multiscale manner.We then verify if our extracted components are more seman-tically meaningful than other methods by a user study. We testit on three datasets (SCAPE, Horse and Pants) and compare ourmethod with ﬁve state-of-the-art methods [1], [2], [3], [4], [22]. Table 3The user study that veriﬁes the semantic meaning of deformationcomponents extracted by various methods. We ask 10 participants andreport their average scores of each method on the three datasets.

DataSet Neumann Bernard Wang Huang Tan Ouret al. et al. et al. et al. et al.Horse .

69 30 .

37 48 .

37 56 .

28 52 . . SCAPE .

87 40 .

87 55 .

29 48 .

13 61 . . Pants .

37 55 .

75 23 .

00 50 .

50 46 . . We adopt a scoring method to determine whether a componenthas a certain semantics meaningfulness. The scoring is based onthe participants’ perception of semantic appropriateness of eachcomponent. For each data set, we ﬁrst render the componentsextracted by all methods into the same rendering style and mixthem together. We let the users browse all the pictures to havean idea of distribution ﬁrst, and then let the users score in therange from 0 to 100 for each rendered image using a slider. 10participants were involved in the user study. Then we count theaverage score of each method by every participant as shown inTable 3. Our method receives highest scores in all the datasets.In summary, compared with existing methods, our method canextract plausible and reasonable localized deformation compo-nents with semantic meanings, while the other methods have somedistortions and cannot extract multiscale deformation components.Note that our extracted components correspond to deformations ,rather than semantic parts , so it is natural that they are not alwaysaligned with semantic segmentation, but instead aligned at themotion sequence level, as observed also in previous works [1],[2], [3], [4], [22].In addition, we show more visualization results of multiscaledeformation components on the following datasets: Swing [70], afat person (ID: 50002) from Dyna [69], Flag, Dress and Skirt.The Flag, Dress and Skirt dataset are synthesized by physicalsimulation. The Skirt dataset contains more complex motionand deformations. In Figs. 10, 12, 9, 8, 13, the componentsextracted by our method are in a multiscale manner and consistentwith semantic meanings and our method can learn deformationcomponents of different scales with multiscale structure in acomplex data.We further compare shape editing using various methods bygiven control points and deformation components they extract.An example is shown in Fig. 14. We use the 8 control points(rendered as green balls) on the th shape in the SCAPEdataset [65] manually chosen by the user. Then we apply thedata-driven method [13] to reconstruct it with the help of theextracted deformation components by various methods. As theleft part of Fig. 14 shows, our result is similar to the groundtruth and plausible, while the other results have some artifacts anddistortions, such as the right arm of [3] and left arm of [4]. Theright part of Fig. 14 shows the three main activated componentsduring data-driven deformation. Since the SCAPE contains muchlarge-scale rotation, the method [4] only focuses on extractinglarge-scale deformation, but fails to capture important ﬁne details,which results in the serious distortion in the arm due to lack ofessential components. EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 9

Figure 6. The multiscale structure of deformation components on the Pants dataset [68]. In the ﬁgure, we don’t ﬁlter the redundant components, thesymbol ‘ · · · ’ represents there are the similar results which contains slight deformation. This result is comparable with Fig. 11 which is processed bythe Sec4.3. As a result, our method can learn deformation components of different scales. In the deformation components, there will be some ﬁnitecomponent to represent the whole dataset, many redundant components which contains some slight deformations will appear.Table 4Comparison of different training strategies and the inﬂuence of the attention mechanism for shape reconstruction. We show results comparing jointtraining with separate training, and whether or not the attention mechanism is used. For each setting, we test it on several data sets by computingthe E rms of the reconstructed shapes for unseen shapes. It shows that joint training with attention mechanism give the best results. DataSet Swing Scape Pants Humanoid Horse Flag Dress Jump Fat FaceWith AttentionJoint Training . . . . . . . . . . Without AttentionJoint Training . . . . . . . . . . With AttentionSeparate Training . . . . . . . . . . Without AttentionSeparate Training . . . . . . . . . . Figure 7. Comparison of deformation components located in a similararea on the Horse [75], which are extracted by different methods. Thedeformed regions are highlighted in blue. The ﬁrst row shows the resultsof other methods, and the second row gives the results of our method.Every column shows a component located in a similar region.

In this section, we evaluate the model sensitivity to the parameters,including the weights ( λ , λ ) in the loss function, the size ofdeformation region ( d of Λ ik ), the effect of attention mechanismon generalization error and the difference between joint trainingand separate training. λ , λ We test the inﬂuence of parameters λ , λ on the generalizationability of the network. We evaluate it by E rms errors of recon-structing unseen shapes on the SCAPE dataset. By ﬁxing λ tothe default value , we change λ from . to as the right curve of Fig. 16 shows. The result shows our network is robust todifferent choice of λ . With ﬁxed λ , we change λ from . to as the left curve of Fig. 16 shows. The result justiﬁes that ournetwork can get lower errors when λ = 10 , which is chosen asthe default value in our experiments. d in Λ ik The other parameter to choose is d in Λ ik , which is the parameterthat determines the scale of the extracted components. d , d arecutoffs for normalized geodesic distance with range from 0 to 1,thus, the original size of speciﬁc dataset has little affect on them.These parameters only reﬂect the related size of the localizeddeformation components compared to the whole model.In our network, by default we stack two levels of autoencoders,so we need choose two parameters d , d for both the ﬁrst-leveland the second-level autoencoders ( AE and AE ) respectively.In order to extract multiscale deformation components, we needensure d > d , then AE will cover a more detailed regionthan AE . As Fig. 17 shows, we test the network generalizationability ( E rms ) on the SCAPE dataset with different combinations d , d by changing values from . to . with step . .The ﬁgure illustrates that smaller d and d result in largerreconstruction errors. Although lower error can be achieved when d and d are large enough (close to 0.5), we will extractmore global deformation components (i.e., not localized) It is a EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 10

Figure 8. The multiscale structure of deformation components on the skirt cloth dataset, which is our simulated data by physics engine. The datacontains more folds of cloth and complex motion. In the ﬁgure, we ﬁlter the redundant components. The ﬁrst row is some samples in the skirtdataset. And the second and third rows are the visualization of deformation components extracted from the ﬁrst level AE and second level AEsrespectively. As a result, our method can learn deformation components of different scales with multiscale structure.Figure 9. The multiscale structure of deformation components on the Dress data set (a lady walking forward in a skirt). In the ﬁgure, we have removedredundant components. In the left part, the result is with the attention mechanism. The ﬁrst row shows some examples in the dress dataset fromthe side view. The second row presents the deformation caused by leg movement and the wind, which are extracted by the ﬁrst-level AE ( AE ).The third row is the detailed deformations of the cloth during the movement, which are extracted by the second-level AEs ( AE k , ≤ k ≤ K ). In thesecond row, the two results on the left are the deformations of the front of the skirt, and the two results on the right are the deformation of the backof the skirt. In the right part, the results is without attention mechanism. It illustrate that the results no longer have a multiscale structure, like the leftpart shows. The second and third row represent the deformation components that are extracted by ﬁrst level AE and second level AEs respectively,which is processed by removal of redundant components. The ﬁrst row is the some sample data in Dress dataset. trade off progress and we must balance the generalization abilityand the extracted deformation components. In our experiments,we set d = [0 . , . that can make network extract localizedcomponents and keep lower reconstruction error simultaneously. The architecture of this neural net supports multiple levels ofdeformation scales from coarse to ﬁne. In almost all testeddataset in this paper, the two level AEs architecture is enoughto represent. The error between the input ACAP feature X i andthe reconstructed ACAP feature ˆ X i , < i < N is divided by thenorm of X i to get the percentage as shown in Eqn. 8. P erc i = || X i − ˆ X i || F || X i || F (8)where the || · || F is the Frobenius norm, P erc i is the percentageof error on the i th shape in this dataset. Then, we choose the maximum percentage of error in the whole dataset as theproportion of deformation that has not been represented. We showthe results in the Table 5. The experiment illustrates that thepercentage of two-level AEs is very low in all datasets so thatour two-level AEs is enough. K z The K z describes the dimension of latent space of two levelAEs. The hyper-parameter decides the number of deformationcomponents in each autoencoder of each level. Since methodscompared in the paper produce 50 deformation components,we let K z × K z = 50 . We have the following combina-tions: K z = [1 , , [2 , , [5 , , [10 , , [25 , , [50 , . How-ever, the [1 , , [50 , are the trivial setting. So we test the otherfour settings of K z on the SCAPE dataset. The Table 6 shows theresults, which illustrate that it perform well when K z = [25 , .But, there are only two deformation components for each second EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 11

Figure 10. The multiscale structure of deformation components on the fat person (ID: 50002) from the Dyna [69] dataset. The ﬁrst row showsexample shapes in the dataset, the second row presents coarse-level deformation components from the ﬁrst-level AE , and the third row showsthe ﬁne-level deformation components from the second-level AEs. Table 5The maximum percentage of the reconstruction error for each shape in every tested dataset in this paper. The values are calculated by the Eqn.8,as shown in table, the two-level AEs could represent the deformation of the very well with the tiny percentage of the reconstruct error, while thereconstruction error with one-level AE is much larger for reconstructing the input feature. DataSet Horse Face Jumping Swing Pants Dress Fat FlagPercentage of Error (one-level AE) 0.1219 0.06325 0.2312 0.4756 0.1763 0.5471 0.1322 0.1983Percentage of Error (two-level AEs) 0.005491 0.001065 0.007800 0.06876 0.01745 0.06296 0.005652 0.02674

Figure 11. The multiscale structure of deformation components on therunning pants data set [68]. In the ﬁgure, we have removed redundantcomponents. Our method can learn deformation components at differentscales. The ﬁrst row shows example shapes of the running pants, thesecond row gives deformation components caused by leg movement,which are extracted by the ﬁrst-level AE, and the third row presents thedetailed deformations of the cloth from the second-level AEs. level AEs, which is not reasonable for most datasets. For example,in Fig. 2, our network extract three meaningful components forsome brunches on SCAPE dataset. Furthermore, most of ourexperimental results show that there are more than 2 meaningfulcomponents extracted from the second level AEs. So we choose

Figure 12. The multiscale structure of deformation components onthe Swing [70] dataset. The ﬁrst row shows some example shapesin the dataset, the second row presents coarse-level deformationcomponents from the ﬁrst-level AE , and the third row shows the ﬁne-level deformation components from the second-level AEs ( AE k , ≤ k ≤ K ). the second higher performance with the setting k z = [10 , . Finally, we evaluate the effect of the attention mechanism anddifferent training strategies. The statistics are shown in Table 4

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 12

Figure 13. The multiscale structure of deformation components onthe Flag dataset extracted by our method. The ﬁrst row shows someexample shapes of the dataset, the second row presents coarse-leveldeformation components from the ﬁrst-level AE ( AE ), and the third rowshows the ﬁne-level deformation components from the second-level AEs( AE k , ≤ k ≤ K ).Figure 14. Comparison of shape reconstruction with different methods.The ﬁrst column shows the error heat maps on the ground truth shapebetween the editing results and ground truth. The second columnpresents the editing results of different methods. The right side of theﬁgure shows the three main activated deformation components duringdata-driven deformation. We use the same control points manuallychosen by the user (8 vertices on the th shape in the SCAPE [65])to reconstruct the shape with the data-driven deformation method [13]and the same number of components. The ﬁgure shows that our result ismore plausible than the existing methods, and it is similar to the groundtruth. based on the experiments on the SCAPE dataset.For the training strategy, we compare the reconstruction error( E rms ) by training the network either jointly or separately. Theresults are shown in the ﬁrst and third rows of Table 4, and jointlytraining the network can get better results.We also perform the experiment to demonstrate the effectof the attention mechanism qualitatively and quantitatively. Theresults are shown in the ﬁrst and second rows of Table 4. Trainingwith attention mechanism will get lower errors. The reason is thateach autoencoder of the second-level will focus on a different sub-region to minimize the loss. Table 6The reconstruction errors on the unseen data from SCAPE withdifferent settings of K z . For each setting, it represents the dimension oflatent space of two level AEs. From the table, it shows that our methodhas high performance on the setting K z = [25 , . However, there areonly up 2 deformation components for each second level AEs with thesetting, which is not reasonable for most of datasets. For example, inﬁgure 2, our network extract three meaningful components for somebrunches on Scape dataset. Furthermore, most of our experimentalresults show that there are more than 2 meaningful componentsextracted from the second level AEs. So we choose K z = [10 , . K z [2 ,

25] [5 ,

10] [10 ,

5] [25 , E rms For the qualitative evaluation, we show the results in Fig. 9.If we train our network without the attention mechanism, ournetwork architecture degenerates into the version of Tan et al. [4]with two level autoencoders. In the second level AEs, there will beonly one AE because all of them are same. Despite this, differentlevel autoencoders are also able to extract the different scaledeformation components as shown in the right part of Fig. 9.In the left part of Fig. 9, the second and third row representthe deformation components that are extracted by ﬁrst level AEand second level AE respectively, which is processed by removalof redundant components. The ﬁrst row is the some sampledata in Dress dataset. The results illustrate that the deformationcomponents no longer have a multiscale structure when the secondlevel AEs do not focus on sub-regions to extract the deformationscomponents.

Multiscale shape editing is an important application in computergraphics. Users usually start with editing of the overall shape,and then focus on adjusting the details. With existing methods,the extracted deformation components either contain some globalinformation [1], [22] thus making the components unsuitable forlocal editing, or focusing too much on large-scale deformationsand failing to capture essential small-scale deformations for faith-ful reconstruction, leading to distortions like [4], which wouldaffect the users’ editing efﬁciency for 3D animations. Given ashape deformation dataset that contains diverse deformations, ourmethod can produce multiscale localized deformation componentswhich are visually semantically meaningful, corresponding totypical deformation behaviour. Along with data-driven defor-mation [13], this allows users to edit shapes efﬁciently andintuitively under the constraints of the control points and subspacespanned by extracted deformation components. Please refer to thework [13] for details of data-driven deformation.Fig. 15 shows some examples. For the SCAPE dataset, wedesign two actions: raising the left arm (Action 1) and thenturning the wrist (Action 2). All the compared methods performwell in Action 1. However, in Action 2, only our method cannaturally twist wrist with the help of our extracted multi-scaledeformation components. In contrast, all other methods havevarious distortions. For the Horse dataset, we also design twoactions: raising the whole tail (Action 1) and then twisting theend of the tail (Action 2). Our method can bend the end ofthe tail naturally after raising the whole tail. But other methodshave more distortions and even lead to changes on the entiretail, especially the methods [1], [22] based on the Euclidean

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 13

Figure 15. Comparison of multiscale shape editing. We compare the editing results with different methods on the SCAPE [65] and Horse [72]datasets. The results demonstrate that our extracted deformation components are suitable for multiscale shape editing. The ﬁrst column showsthe editing steps. Every row gives the deformed results of the corresponding action of various methods. The differences with other methods arehighlighted in the orange rectangle with closeups to show the details. The existing methods have obvious distortions, demonstrating the superiorityof our multiscale deformation components. G en r a li z a t i on E rr o r Figure 16. E rms errors of generating unseen shapes with our overallautoencoder w.r.t. the weights λ and λ . The ﬁgure shows that ournetwork can get lower errors when λ = 10 and is robust to differentchoice of λ . coordinate representation. In summary, our extracted multiscaledeformation components can perform better than existing methodsin multiscale shape editing. See the accompanying video for moreresults. IMITATION AND C ONCLUSION

In this paper, we propose a novel autoencoder with the attentionmechanism to extract multiscale localized deformation compo-nents. We use stacked AEs to extract multiscale deformationcomponents. This helps capture richer information for better shape d d d d Genralization Error on SCAPE （） E rms Figure 17. The relationship between E rms of generating the unseendata and ( d , d ) is visualized, where d and d are from AE and AE k , ≤ k ≤ K on SCAPE data. The noise in the ﬁgure is a resultof the gradient-descent training procedure. The black (below the maindiagonal) means no data, as by deﬁnition d should be larger than d . editing, with better generalization ability (see Table 2). Moreover,the ﬁrst-level AE extracts some coarse level components andlearns attention masks to help the second-level AEs focus onrelevant sub-regions to extract ﬁne-level components. Extensivequantitative and qualitative evaluations show that our method iseffective, outperforming state-of-the-art methods. The extracteddeformation components by our method can be used on multiscale EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 14 shape editing for computer animation, which demonstrates thatthe extracted multiscale localized deformation components areeffective and meaningful for reducing the user’s efforts. In thefuture, this work can give more solutions and explorations onapplying the attention mechanism on 3D shape analysis andsynthesis.Although our method can analyze the shape dataset with mul-tiscale manner to extract some different-scale components for easyshape editing, there are some limitations. While datasets includingShapeNet contains the general 3D shapes, our methods cannothandle those kinds of data, especially for CAD manufacturing.The data that our method can handle must be with the sameconnectivity. Also, our method uses a ﬁxed network architectureand attention mechanism to analyze the shape with multiscalemanner, so the multiscale structure of shape set will be similarand obey the ﬁxed network structure for all dataset. We cannotcapture the multiscale variation between different categories in thedataset. In the future, We are committed to exploit a method thatcan analyze and capture the variation of the multiscale structureon ShapeNet automatically. Finally, we can also merge the post-process (Sec 4.3) to the neural network which can predict thenumber of sub-parts automatically. A CKNOWLEDGMENT

This work was supported by National Natural Science Foundationof China (No. 61872440 and No. 61828204), Beijing Munici-pal Natural Science Foundation (No. L182016), Royal SocietyNewton Advanced Fellowship (No. NAF \ R2 \ R EFERENCES [1] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor,and C. Theobalt, “Sparse localized deformation components,”

ACMTransactions on Graphics (TOG) , vol. 32, no. 6, p. 179, 2013.[2] Z. Huang, J. Yao, Z. Zhong, Y. Liu, and X. Guo, “Sparse localizeddecomposition of deformation gradients,” in

Computer Graphics Forum ,vol. 33, no. 7. Wiley Online Library, 2014, pp. 239–248.[3] Y. Wang, G. Li, Z. Zeng, and H. He, “Articulated-motion-aware sparselocalized decomposition,” in

Computer Graphics Forum , vol. 36, no. 8.Wiley Online Library, 2017, pp. 247–259.[4] Q. Tan, L. Gao, Y. Lai, J. Yang, and S. Xia, “Mesh-based autoencodersfor localized deformation component analysis,” in

AAAI Conference onArtiﬁcial Intelligence (AAAI) , 2018, pp. 2452–2459.[5] C. N. Alleman, J. W. Foulk, A. Mota, H. Lim, and D. J.Littlewood, “Concurrent multiscale modeling of microstructural effectson localization behavior in ﬁnite deformation solid mechanics,”

Computational Mechanics , vol. 61, no. 1-2, pp. 207–218, 2018.[6] M. Mathew, A. Ellenberg, S. Esola, M. McCarthy, I. Bartoli, andA. Kontsos, “Multiscale deformation measurements using multispectraloptical metrology,”

Structural Control and Health Monitoring , vol. 25,no. 6, p. e2166, 2018.[7] M. K. Abeyratne, W. Freeden, and C. Mayer, “Multiscale deformationanalysis by cauchy-navier wavelets,”

Journal of Applied Mathematics ,vol. 2003, 2002.[8] K. C. Lam, T. C. Ng, and L. M. Lui, “Multiscale representationof deformation via beltrami coefﬁcients,”

Multiscale Modeling &Simulation , vol. 15, no. 2, pp. 864–891, 2017.[9] Y. Yang, W. Xu, X. Guo, K. Zhou, and B. Guo, “Boundary-awaremultidomain subspace deformation,”

IEEE transactions on visualizationand computer graphics , vol. 19, no. 10, pp. 1633–1645, 2013.[10] H. Hamidian, Z. Zhong, F. Fotouhi, and J. Hua, “Surface registrationwith eigenvalues and eigenvectors,”

IEEE transactions on visualizationand computer graphics , 2019. [11] S.-L. Liu, Y. Liu, L.-F. Dong, and X. Tong, “Ras: A data-driven rigidity-aware skinning model for 3d facial animation,” in

Computer GraphicsForum , vol. 39, no. 1. Wiley Online Library, 2020, pp. 581–594.[12] J. Fu, H. Zheng, and T. Mei, “Look closer to see better:Recurrent attention convolutional neural network for ﬁne-grained imagerecognition,” in

Conference on Computer Vision and Pattern Recognition(CVPR) , vol. 2, 2017, p. 3.[13] L. Gao, Y.-K. Lai, J. Yang, Z. Ling-Xiao, S. Xia, and L. Kobbelt, “Sparsedata driven mesh deformation,”

IEEE transactions on visualization andcomputer graphics , 2019.[14] W. Xu, J. Wang, K. Yin, K. Zhou, M. Van De Panne, F. Chen, and B. Guo,“Joint-aware manipulation of deformable models,”

ACM Transactions onGraphics (TOG) , vol. 28, no. 3, pp. 1–9, 2009.[15] M. E. Yumer and L. B. Kara, “Co-constrained handles for deformation inshape collections,”

ACM Transactions on Graphics (TOG) , vol. 33, no. 6,p. 187, 2014.[16] M. E. Yumer, S. Chaudhuri, J. K. Hodgins, and L. B. Kara, “Semanticshape editing using deformation handles,”

ACM Transactions onGraphics (TOG) , vol. 34, no. 4, p. 86, 2015.[17] K. Zhou, J. Huang, J. Snyder, X. Liu, H. Bao, B. Guo, and H.-Y. Shum,“Large mesh deformation using the volumetric graph laplacian,” in

ACMSIGGRAPH 2005 Papers , 2005, pp. 496–503.[18] L. Gao, Y.-K. Lai, D. Liang, S.-Y. Chen, and S. Xia, “Efﬁcient andﬂexible deformation representation for data-driven surface modeling,”

ACM Transactions on Graphics (TOG) , vol. 35, no. 5, p. 158, 2016.[19] M. Alexa and W. M¨uller, “Representing animations by principalcomponents,” in

Computer Graphics Forum , vol. 19, no. 3. Wiley OnlineLibrary, 2000, pp. 411–418.[20] L. Gao, G. Zhang, and Y. Lai, “Lp shape deformation,”

Science ChinaInformation Sciences , vol. 55, no. 5, pp. 983–993, 2012.[21] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal componentanalysis,”

Journal of Computational and Graphical Statistics , vol. 15,p. 2006, 2004.[22] F. Bernard, P. Gemmar, F. Hertel, J. Goncalves, and J. Thunberg,“Linear shape deformation models with local support using graph-basedstructured matrix factorisation,” in

Conference on Computer Vision andPattern Recognition (CVPR) , 2016, pp. 5629–5638.[23] J. R. Tena, F. De la Torre, and I. Matthews, “Interactive region-basedlinear 3D face models,” in

ACM Transactions on Graphics (TOG) ,vol. 30, no. 4. ACM, 2011, p. 76.[24] T. Neumann, K. Varanasi, N. Hasler, M. Wacker, M. Magnor,and C. Theobalt, “Capture and statistical modeling of arm-muscledeformations,” in

Computer Graphics Forum , vol. 32, no. 2pt3. WileyOnline Library, 2013, pp. 285–294.[25] S. Fr¨ohlich and M. Botsch, “Example-driven deformations based ondiscrete shells,” in

Computer graphics forum , vol. 30, no. 8. WileyOnline Library, 2011, pp. 2246–2257.[26] M. Chen, C. Wang, and L. Liu, “Cross-domain retrieving sketch andshape using cycle cnns,”

Computers & Graphics , 2020.[27] C. Jiang, D. Wang, J. Huang, P. Marcus, M. Nießner et al. , “Con-volutional neural networks on non-uniform geometrical signals usingeuclidean spectral transformation,” arXiv preprint arXiv:1901.02070 ,2019.[28] K. Sarkar, K. Varanasi, and D. Stricker, “3d shape processing byconvolutional denoising autoencoders on local patches,” in . IEEE,2018, pp. 1925–1934.[29] C. Jiang, D. Lansigan, P. Marcus, and M. Nießner, “Ddsl: Deepdifferentiable simplex layer for learning geometric signals,” in

Proceedings of the IEEE International Conference on Computer Vision ,2019, pp. 8769–8778.[30] D. Maturana and S. Scherer, “VoxNet: a 3D convolutional neural networkfor real-time object recognition,” in

IEEE Conference on IntelligentRobots and Systems , 2015, pp. 922–928.[31] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3D shape recognition,” in

IEEEInternational Conference on Computer Vision (ICCV) , 2015, pp. 945–953.[32] M. E. Yumer and N. J. Mitra, “Learning semantic deformation ﬂowswith 3D convolutional networks,” in

European Conference on ComputerVision (ECCV) . Springer, 2016, pp. 294–311.[33] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, “O-cnn:Octree-based convolutional neural networks for 3D shape analysis,”

ACMTransactions on Graphics (TOG) , vol. 36, no. 4, p. 72, 2017.[34] P.-S. Wang, C.-Y. Sun, Y. Liu, and X. Tong, “Adaptive o-cnn: A patch-based deep representation of 3d shapes,”

ACM Transactions on Graphics(TOG) , vol. 37, no. 6, pp. 1–11, 2018.

EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JULY 2019 15 [35] P.-S. Wang, Y. Liu, and X. Tong, “Deep octree-based cnns withoutput-guided skip connections for 3d shape and scene completion,”in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition Workshops , 2020, pp. 266–267.[36] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolutionon x-transformed points,”

Advances in Neural Information ProcessingSystems (NIPS) , pp. 820–830, 2018.[37] H. Lei, N. Akhtar, and A. Mian, “Spherical convolutional neural networkfor 3D point clouds,” in

Conference on Computer Vision and PatternRecognition (CVPR) , 2019, pp. 9631–9640.[38] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.Solomon, “Dynamic graph cnn for learning on point clouds,”

AcmTransactions On Graphics (tog) , vol. 38, no. 5, pp. 1–12, 2019.[39] B. Shi, S. Bai, Z. Zhou, and X. Bai, “DeepPano: Deep panoramicrepresentation for 3-D shape recognition,”

IEEE Signal ProcessingLetters , vol. 22, no. 12, pp. 2339–2343, 2015.[40] K. Sarkar, B. Hampiholi, K. Varanasi, and D. Stricker, “Learning 3dshapes as multi-layered height-maps using 2d convolutional networks,”in

Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 71–86.[41] H. Huang, E. Kalegorakis, S. Chaudhuri, D. Ceylan, V. Kim,and E. Yumer, “Learning local shape descriptors with view basedconvolutional neural networks,”

ACM Transactions on Graphics (TOG) ,vol. 2, 2018.[42] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphsfor learning molecular ﬁngerprints,” in

Advances in neural informationprocessing systems (NIPS) , 2015, pp. 2224–2232.[43] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neuralnetworks for graphs,” in

International Conference on Machine Learning(ICML) , 2016, pp. 2014–2023.[44] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

Advancesin Neural Information Processing Systems (NIPS) , 2016. [Online].Available: https://arxiv.org/abs/1606.09375[45] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks andlocally connected networks on graphs,” in

International Conference onLearning Representations (ICLR) , 2014.[46] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

Advances inNeural Information Processing Systems (NIPS) , 2016, pp. 3844–3852.[47] Q. Tan, L. Gao, Y. Lai, and S. Xia, “Variational autoencoders fordeforming 3D mesh models,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018, pp. 5841–5850.[48] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black, “Generating 3D facesusing convolutional mesh autoencoders,” in

European Conference onComputer Vision (ECCV) . Springer International Publishing, 2018, pp.725–741. [Online]. Available: http://coma.is.tue.mpg.de/[49] E. Tretschk, A. Tewari, M. Zollh¨ofer, V. Golyanik, and C. Theobalt,“DEMEA: Deep Mesh Autoencoders for Non-Rigidly DeformingObjects,” arXiv e-prints , 2019.[50] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia, “Deformableshape completion with graph convolutional autoencoders,” in

Proceed-ings of the IEEE conference on computer vision and pattern recognition ,2018, pp. 1886–1895.[51] L. Fulton, V. Modi, D. Duvenaud, D. I. Levin, and A. Jacobson, “Latent-space dynamics for reduced deformable simulation,” in

Computergraphics forum , vol. 38, no. 2. Wiley Online Library, 2019, pp. 379–391.[52] S. Ma, J. Fu, C. W. Chen, and T. Mei, “DA-GAN: Instance-level imagetranslation by deep attention generative adversarial networks,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2018,pp. 5657–5666.[53] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang,“Dual attention matching network for context-aware feature sequencebased person re-identiﬁcation,” , 2018.[54] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identiﬁcation,” , 2018.[55] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchicalattention networks for document classiﬁcation,” in

Proceedings of the2016 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies , 2016, pp.1480–1489.[56] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressiveattention guided recurrent network for salient object detection,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018,pp. 714–722.[57] B. Zhuang, Q. Wu, C. Shen, I. Reid, and A. van den Hengel, “Parallelattention: A uniﬁed framework for visual object discovery throughdialogs and queries,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 4252–4261.[58] X. Long, C. Gan, G. de Melo, J. Wu, X. Liu, and S. Wen, “Attentionclusters: Purely attention based local feature integration for videoclassiﬁcation,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018, pp. 7834–7843.[59] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473 ,2014.[60] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-imageco-attention for visual question answering,” in

Advances In NeuralInformation Processing Systems (NIPS) , 2016, pp. 289–297.[61] P. Wang, Q. Wu, C. Shen, and A. van den Hengel, “The VQA-machine: Learning how to use existing vision algorithms to answernew questions,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , vol. 4, 2017, pp. 1173–1182.[62] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, andX. Tang, “Residual attention network for image classiﬁcation,”

The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pp.3156–3164, 2017.[63] O. Sorkine and M. Alexa, “As-rigid-as-possible surface modeling,”in

Proceedings of EUROGRAPHICS/ACM SIGGRAPH Symposium onGeometry Processing , 2007, pp. 109–116.[64] K. Crane, C. Weischedel, and M. Wardetzky, “Geodesics in Heat: ANew Approach to Computing Distance Based on Heat Flow,”

ACMTransactions on Graphics (TOG) , vol. 32, pp. 152:1–152:11, 2013.[65] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, andJ. Davis, “SCAPE: shape completion and animation of people,” in

ACMtransactions on graphics (TOG) , vol. 24, no. 3. ACM, 2005, pp. 408–416.[66] R. W. Sumner and J. Popovi´c, “Deformation transfer for triangle meshes,”

ACM Transactions on graphics (TOG) , vol. 23, no. 3, pp. 399–405, 2004.[67] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz, “Spacetimefaces: High-resolution capture for modeling and animation,” in

ACMTransactions on Graphics (SIGGRAPH) , 2004, pp. 548–558.[68] R. White, K. Crane, and D. A. Forsyth, “Capturing and animatingoccluded cloth,” in

ACM Transactions on Graphics (TOG) , vol. 26, no. 3.ACM, 2007, p. 34.[69] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black, “Dyna: Amodel of dynamic human shape in motion,”

ACM Transactions onGraphics, (Proc. SIGGRAPH) , vol. 34, no. 4, pp. 120:1–120:14, Aug.2015.[70] D. Vlasic, I. Baran, W. Matusik, and J. Popovi´c, “Articulated meshanimation from multi-view silhouettes,” in

ACM Transactions onGraphics (TOG) , vol. 27, no. 3. ACM, 2008, pp. 97:1–97:9.[71] D. Kingma and J. Ba, “ADAM: A method for stochastic optimization,”in

International Conference on Learning Representations (ICLR) , 2015.[72] R. W. Sumner and J. Popovi´c, “Deformation transfer for triangle meshes,”

ACM transactions on graphics (TOG) , vol. 23, no. 3, pp. 399–405, 2004.[73] L. Kavan, P.-P. Sloan, and C. O’Sullivan, “Fast and efﬁcient skinning ofanimated meshes,” in

Computer Graphics Forum , vol. 29, no. 2. WileyOnline Library, 2010, pp. 327–336.[74] L. Vasa and V. Skala, “A perception correlated comparison methodfor dynamic meshes,”

IEEE transactions on visualization and computergraphics , vol. 17, no. 2, pp. 220–230, 2011.[75] R. W. Sumner, M. Zwicker, C. Gotsman, and J. Popovi´c, “Mesh-basedinverse kinematics,” in