[PDF] High-order Differentiable Autoencoder for Nonlinear Model Reduction

Abstract

This paper provides a new avenue for exploiting deep neural networks to improve physics-based simulation. Specifically, we integrate the classic Lagrangian mechanics with a deep autoencoder to accelerate elastic simulation of deformable solids. Due to the inertia effect, the dynamic equilibrium cannot be established without evaluating the second-order derivatives of the deep autoencoder network. This is beyond the capability of off-the-shelf automatic differentiation packages and algorithms, which mainly focus on the gradient evaluation. Solving the nonlinear force equilibrium is even more challenging if the standard Newton's method is to be used. This is because we need to compute a third-order derivative of the network to obtain the variational Hessian. We attack those difficulties by exploiting complex-step finite difference, coupled with reverse automatic differentiation. This strategy allows us to enjoy the convenience and accuracy of complex-step finite difference and in the meantime, to deploy complex-value perturbations as collectively as possible to save excessive network passes. With a GPU-based implementation, we are able to wield deep autoencoders (e.g., 10+ layers) with a relatively high-dimension latent space in real-time. Along this pipeline, we also design a sampling network and a weighting network to enable \emph{weight-varying} Cubature integration in order to incorporate nonlinearity in the model reduction. We believe this work will inspire and benefit future research efforts in nonlinearly reduced physical simulation problems.

Full PDF

HHigh-order Differentiable Autoencoder for Nonlinear Model Reduction

SIYUAN SHEN,

State Key Lab of CAD&CG, Zhejiang University, China

YANG YIN,

Clemson University, USA

TIANJIA SHAO,

State Key Lab of CAD&CG, Zhejiang University, China

HE WANG,

University of Leeds, United Kingdom

CHENFANFU JIANG,

University of Pennsylvania, USA

LEI LAN,

Clemson University, USA

KUN ZHOU,

State Key Lab of CAD&CG, Zhejiang University, China

Fig. 1. In this paper, we exploit deep autoencoder (DAE) networks to accelerate physics-based simulation. In order to model nonlinear subspace dynamicsaccurately, second- and high-order derivatives of the deep decoder net must be efficiently evaluated to match the subspace simulation frame rate. We addressthis technical challenge by collectively applying complex-step network perturbations to the deep net. This is the first time a high-order differentiable neuralnet is employed in physical simulation problems. Our method can be further strengthened with the domain decomposition method as a nonlinear DAE bettercaptures local deformation effects. In this example, the puffer ball has elastic strings, and we assign a 𝑛 𝑝 = linear subspace and a 𝑛 𝑞 = nonlinearsubspace at each string. With the help of substructured deformation, DAE-based nonlinear reduction produces interesting animation effects. We believe this isa representative example showing case the advantage of data-driven animation using DAE. As the geometries of all the strings are the same, generating localtraining poses is more effective. This paper provides a new avenue for exploiting deep neural networks toimprove physics-based simulation. Specifically, we integrate the classic La-grangian mechanics with a deep autoencoder to accelerate elastic simulationof deformable solids. Due to the inertia effect, the dynamic equilibriumcannot be established without evaluating the second-order derivatives ofthe deep autoencoder network. This is beyond the capability of off-the-shelfautomatic differentiation packages and algorithms, which mainly focus onthe gradient evaluation. Solving the nonlinear force equilibrium is evenmore challenging if the standard Newton’s method is to be used. This isbecause we need to compute a third-order derivative of the network toobtain the variational Hessian. We attack those difficulties by exploitingcomplex-step finite difference, coupled with reverse automatic differenti-ation. This strategy allows us to enjoy the convenience and accuracy ofcomplex-step finite difference and in the meantime, to deploy complex-valueperturbations as collectively as possible to save excessive network passes.

Authors’ addresses: Siyuan Shen, State Key Lab of CAD&CG, Zhejiang University, 866Yuhangtang Rd, Hangzhou, 310058, China, [email protected]; Yang Yin, ClemsonUniversity, USA, [email protected]; Tianjia Shao, State Key Lab of CAD&CG, ZhejiangUniversity, China, [email protected]; He Wang, University of Leeds, UnitedKingdom, [email protected]; Chenfanfu Jiang, University of Pennsylvania, USA,[email protected]; Lei Lan, Clemson University, USA, [email protected];Kun Zhou, State Key Lab of CAD&CG, Zhejiang University, China, [email protected].

With a GPU-based implementation, we are able to wield deep autoencoders(e.g., 10 + layers) with a relatively high-dimension latent space in real-time.Along this pipeline, we also design a sampling network and a weighting net-work to enable weight-varying Cubature integration in order to incorporatenonlinearity in the model reduction. We believe this work will inspire andbenefit future research efforts in nonlinearly reduced physical simulationproblems.CCS Concepts: • Computing methodologies → Physical simula-tion ; Dimensionality reduction and manifold learning . Additional Key Words and Phrases:

Model reduction, Autoencoder, Dif-ferentiation, GPU, Deformable model

Reference Format:

Siyuan Shen, Yang Yin, Tianjia Shao, He Wang, Chenfanfu Jiang, Lei Lan,and Kun Zhou. 2021. High-order Differentiable Autoencoder for NonlinearModel Reduction. arXiv (February 2021), 15 pages.

Model reduction is a widely-used and highly-effective techniquefor accelerating physically-based simulation. It is also sometimesknown as reduced-order simulation or subspace simulation. Whilenamed differently, the core idea is to build a linear subspace with re-duced degrees-of-freedom (DOFs) so that the physical equations canbe solved with a system of a smaller size. This approach is sensible a r X i v : . [ c s . L G ] F e b • Shen, S. et al because many parts of the real physical world evolve smoothly andcontinuously along the time and space. Sharp and high-frequencyphysical changes are less common and should be treated with dedi-cated numerical methods. Existing model reduction methods havebeen dominantly linear reduction with a constant tangent space.The expressivity of linear reduction is a known limitation. As manyphysical phenomena are intrinsically nonlinear, a linearly reducedmodel only covers a small fraction of the dynamics space – all theinformation outside of the subspace is filtered. Thus, one has to(substantially) increase the dimensionality of the subspace to incor-porate a desired nonlinear effect even this effect itself may be of lowrank (thinking of a bead travelling on the circle).One question rises naturally: can we build a nonlinear reductionframework with a time-varying (as opposed to constant) tangentspace that best fits “local” dynamics? The challenges are twofold.First, the underlying manifold representing the nonlinear dynamicsis often too complex to be expressed in a closed form. Developinga nonlinear subspace-fullspace transformation, as a counterpart ofmodal analysis in linear reduction, is theoretically difficult. Sec-ond, a nonlinear reduction brings extra computation burdens tothe simulation, largely originating from the need for evaluating thederivatives of the subspace-fullspace transformation function. Thecomputational cost goes up quickly with respect to the subspacesize, which in turn neutralizes the original motivation of applyingmodel reduction.In this paper, we propose a new nonlinear model reduction frame-work that is tightly coupled with the classic Lagrangian mechanics.Although we demonstrate the effectiveness of our method in thecontext of elastic simulation, we believe our method could also beuseful in other physics-based simulation problems like fluid [Kimand Delaney 2013] or cloth animations [Hahn et al. 2014]. Our reduc-tion mechanism is data-driven with a deep autoencoder (DAE) in theloop, which obviates the need for a closed-form subspace-fullspacetransformation function. DAE is an unsupervised learning architec-ture skilled in data compression [Hinton and Salakhutdinov 2006]and has been proven effective in deformable simulation recently [Ful-ton et al. 2019]. Along this direction, we augment the DAE networkwith the complex-step finite difference (CSFD) method [Martins et al.2003], enabling its high-order differentiability, so that the neuralnet can be computationally integrated with Lagrangian formula-tion. To achieve this goal, we make several mentionable technicalcontributions: • Efficient high-order differentiable deep autoencoder.

A physically accurate coupling between DAE and elastic dynamicsrequires the information of the first- and second-order derivatives(for Newtonian equations of motion). Existing differentiation tech-niques such as backpropagation (BP) [Hecht-Nielsen 1992] for neu-ral networks or automatic differentiation (AD) [Bücker et al. 2006]for more general computations are optimized for gradient estima-tion only, and become cumbersome in high-order cases. We leverageCSFD to facilitate the differentiation of the encoder network. Con-ceptually straightforward, this however is not “as easy as pie” asit appears. Albeit the excellent numerical robustness and accuracy,CSFD needs to apply a complex-value perturbation for each inputvariable, leading to excessive forward passes of the deep neural net. We resolve this challenge by applying the function perturbation collectively and deploying CSFD inside other differentiation pro-cedures such as BP and directional derivative. With a GPU-basedimplementation, we simulate complex nonlinear models in real timewith a deep decoder net in the loop. • Coupling PCA with deep encoding net.

The tangent space of a DAE may vary drastically to incorporatenonlinearity seen in the training poses yielding a bumpy and unevendeformation manifold. This is analogous to over-fitting. A possiblecure is to use contractive autoencoder or CAE [Rifai et al. 2011b].CAE adds a regularizer in the objective function that forces the net-work to learn a function that is robust under slight input variations.While it could be a viable solution, we propose a more convenientand effective option. In our framework, DAE is constructed withinthe residual space of a standard PCA. In other words, DAE is de-signed to be complementary to an underlying linear subspace, andthe latter guarantees the existence of a smooth tangent variation.With this design, we can deepen the DAE architecture to capturenonlinear and salient deformation poses. • Weight-varying subspace integration using deep neural networks.

Reduced simulation is often coupled with sparse force and Hes-sian integration, which down samples element-wise volumetricintegration to a small collection of key elements, called Cubatureelements [An et al. 2008]. After Cubature elements are selected,one also needs to compute its integration weight by solving a non-negative least-square problem. The weight coefficient of a Cubatureelement is typically fixed given the training set. This is reasonablefor linear reduction and works well in practice. However, in thecontext of nonlinear reduction, as the tangent space varies along thesimulation, fixed weighting Cubature is problematic. To this end,we propose a deep neural network (DNN) based sampling method,that fully replaces Cubature training. Our DNN has two modules.The first module is a graph convolution network (GCN) that outputsthe possibility of an element being a Cubature element. On the topof it, the second module is a DNN, which predicts the weight ofselected Cubature elements. The last layer of this DNN carries out aper-neuron square operation to ensure the final network output isnon-negative. The training alternates between those two modules.Unlike conventional Cubature sampling strategy, our network-basedapproach is able to select multiple Cubature elements each iteration,thus greatly shortens the training time.We have evaluated our framework on various simulation sce-narios, and our method produces visually-plausible results in realtime or at an interactive rate. We also notice that our nonlinearmodel reduction framework synergizes with domain decompositionmethod [Barbič and Zhao 2011; Wu et al. 2015; Yang et al. 2013] – asmall-size nonlinear subspace captures deformation effects muchbetter at a local domain than over the entire deformable body. Tothis end, we also demonstrate examples combining DAE and do-main decomposition. As a natural follow up of our method, we donot intend to over claim this extension as our contribution. Modelreduction, regardless nonlinear or linear, seeks for smart trade-offsamong simulation effects, accuracy, and performance. Arguing con-clusively that the nonlinear model reduction is always better thanlinear model reduction techniques is too bold and over-confident to igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 3 us. Indeed, one should scrutinize various aspects in practice, such asthe problem size, expected results, time budget, hardware resourcesetc. before choosing a specific simulation algorithm. Regardless, wedo believe the techniques purposed in the paper are worthy andnon-trivially advance state-of-the-art model reduction methods.

Model reduction has been successfully employed in many simulation-related problems in computer graphics including fluid dynamics [Kimand Delaney 2013; Treuille et al. 2006], cloth animation [Hahn et al.2014], shape deformation [Von-Tycowicz et al. 2015; Wang et al.2015], material design [Musialski et al. 2016; Xu et al. 2015], anima-tion control [Barbič et al. 2009, 2012] etc. In this paper, we narrowour focus on using data-driven nonlinear model reduction to im-prove elastic simulation of solid objects.There are several well-established numerical solutions for de-formable models such as finite element method (FEM) [Zienkiewiczet al. 1977], finite difference method [Zhu et al. 2010], meshlessmethod [Martin et al. 2010; Müller et al. 2005], or mass-spring sys-tem [Liu et al. 2013]. Most of them end up with solving a large-scalenonlinear system if an implicit time integration scheme is used. Forhigh-resolution models, computing their time-varying nonlineardynamics is expensive. Speeding up the deformable simulation canbe achieved using carefully designed numerical treatments like themultigrid method [Tamstorf et al. 2015; Zhu et al. 2010], delayedmatrix update [Hecht et al. 2012], or parallelizable solvers [Fratar-cangeli et al. 2016; Wang and Yang 2016]. These methods focus onimproving the performance for the fullspace nonlinear optimizationwithout condensing the simulation scale.Acceleration can also be achieved using model reduction, whichremoves less important DOFs and creates a subspace representa-tion of fullspace DOFs. Modal analysis [Choi and Ko 2005; Hauseret al. 2003; Pentland and Williams 1989] and its first-order deriva-tives [Barbič and James 2005; Yang et al. 2015] are often consideredas the most effective way for the subspace construction. Displace-ment vectors from recent fullspace simulations can also be utilizedas subspace bases [Kim and James 2009]. Alternatively, it is alsoviable to coarsen geometric representation to prescribe the dynam-ics of a fine model like skin rigging, a technique widely used inanimation systems [James and Twigg 2005]. Analogously, Capelland colleagues [2002] deformed an elastic body using an embeddedskeleton; Gilles and colleagues [2011] used rigid frames to drivethe deformable simulation; Faure and colleagues [2011] used scat-tered handles for reduced models; Lei and colleagues [2020] com-bined model reduction with collision processing using medial axistransform; Martin and colleagues [2010] used sparsely-distributedintegrators named elastons to model the nonlinear dynamics of rod,shell, and solid uniformly.Those prior arts demonstrate impressive results, often with orders-of-magnitude performance speedups with model reduction. In mostcases, the generalized coordinate linearly depends on the fullspacedisplacement in the form of u = Uq with U being constant. This iswhy we refer to them as linear subspace methods. On the contrary,nonlinear reduction holds a more complicated relation between gen-eralized and fullspace coordinates. For instance, nonlinear modal analysis [Pesheck et al. 2001] aims to extend its linear version, andit has been used in structural analysis [Setio et al. 1992]. However, itis hardly useful for simulation acceleration – extracting the modalspace for a given configuration is normally dealt with by solving aneigenproblem (i.e., as in linear modal analysis), and it is clearly in-feasible to exhaustively sample all the system configurations even inthe pre-computation stage. Only when the animation follows somepre-known patterns, we may re-use the solution of rest-shape eigen-problem [Mukherjee et al. 2016] or interpolate multiple sparselychosen linear subspaces [Xu and Barbič 2016]. Due to this challenge,the nonlinear subspace method is less explored.In this paper, we do not aim to derive a closed-form mathematicalformulation connecting the generalized coordinate and the fullspacecoordinate. Instead, we leave this challenge to a deep neural networkthat learns the map directly from many seen simulation poses. Thisis a straightforward data-driven approach and has been exploited ingraphics for years [Ladick`y et al. 2015; Wang et al. 2011]. Our nov-elty however is to enable its efficient and high-order differentiabilityso that the DNN can be embedded into the classic physical simulationframeworks such as Lagrangian mechanics [Brizard 2014] etc. TheDNN used in our framework is a deep autoencoder or DAE [Hin-ton and Salakhutdinov 2006] originally designed for dimensionreduction. Its superior performance in data compression and mul-tidimensional scaling has quickly drawn many attentions. DAE issuccessfully deployed in NLP [Socher et al. 2011], image/video com-pression [Ballé et al. 2016; Habibian et al. 2019], GAN [Makhzaniet al. 2015], facial recognition [Zeng et al. 2018], 3D shape anal-ysis [Nair and Hinton 2009], just to name a few. The volume ofDAE-related studies is too vast to be contained here.The most relevant study of our work is the contribution fromFulton and colleagues [2019]. Indeed, we are strongly motivatedand inspired by those recent efforts [Fulton et al. 2019; Wiewel et al.2019] that also seek for DAE-based nonlinear reduction. To thisend, we re-examine each step along the pipeline of reduced sim-ulation and devise a comprehensive solution to couple DAE withnonlinear elastic simulation seamlessly. One core ingredient is thehigh-order differentiability that should be evaluated efficiently tomatch the frame rate of model reduction. We deliver this impor-tant technical asset by leveraging complex-step finite differenceor CSFD [Martins et al. 2003]. Similar to standard finite difference,CSFD applies a perturbation to the function input and evaluateshow the perturbation alters the function output. This perturbationhowever, is a complex-value quantity in CSFD, which avoids thenumerical issue of subtractive cancellation. While CSFD seems tobe a possible approach to the differentiability of DAE, its efficientdeployment for high-order differentiation remains challenging inpractice. An naïve implementation of CSFD requires one forwardpass of the net for each input variable. This scheme leads to O ( 𝑛 ) passes for a DAE-enabled Newton iteration. As discussed in § 4,we attack this difficulty by applying the complex stepping in CSFDcollectively whenever possible to remove redundant network passes.This method allows us to efficiently run the DAE differentiationon GPU and obtain its high-order derivatives in milliseconds. Inorder to ensure the smoothness of the simulation tangent space, ourframework consists of two layers: the first layer is a standard PCA-based linear subspace and within the orthogonal space of which, • Shen, S. et al DAE is deployed to capture nonlinear deformations more effectively.Lastly, we also design a DNN-based Cubature training procedureto generate pose-dependent weight coefficients for a more accuratesubspace integration.

To make the paper more self-contained, we start with a brief reviewof the linear model reduction framework, and show its nonlineargeneralization with DAE afterwards. Here, we assume that a DAEis differentiable and defer the discussion about how to compute itsfirst- and high-order derivatives to the next section.

Under the FEM discretization, the motion of an elastically deformablesolid can be described with the Euler-Lagrange equation: M (cid:165) u + f 𝑑𝑎𝑚𝑝 ( u , (cid:164) u ) + f 𝑖𝑛𝑡 ( u ) = f 𝑒𝑥𝑡 , (1)where M ∈ R 𝑁 × 𝑁 is the fullspace mass matrix; f 𝑖𝑛𝑡 and f 𝑒𝑥𝑡 arethe nonlinear internal force and external force. Here, we lump M to be a diagonal matrix. f 𝑑𝑎𝑚𝑝 is the damping force, and it is oftenmodeled, under the assumption of Rayleigh damping, as: f 𝑑𝑎𝑚𝑝 = (cid:18) 𝛼 M + 𝛽 𝜕 f 𝑖𝑛𝑡 ( u ) 𝜕 u (cid:19) (cid:164) u . (2)Eq. (1) describes the force equilibrium at all 𝑁 DOFs of the un-known displacement vector u . Computing u in Eq. (1) using non-linear methods like Newton’s method needs to solve an 𝑁 -by- 𝑁 linearized system repeatedly, which is slow and expensive for large-scale models. Linear model reduction prescribes the kinematic ofthis 𝑁 -dimension system with a set of generalized coordinate p suchthat u = Up . U ∈ R 𝑁 × 𝑛 is sometimes called subspace matrix, whichis constant in linear reduction. An important convenience broughtby the linearity is the time derivatives of (cid:164) u = U (cid:164) p and (cid:165) u = U (cid:165) p fol-low the same relation. Therefore, Eq. (1) can be projected into thecolumn space of U as: M 𝑝 (cid:165) p + U ⊤ f 𝑑𝑎𝑚𝑝 ( Up , U (cid:164) p ) + U ⊤ f 𝑖𝑛𝑡 ( Up ) = U ⊤ f 𝑒𝑥𝑡 , (3)where M 𝑝 = U ⊤ MU is the reduced mass matrix. Eq. (3) has the samestructure of Eq. (1) despite under a more compact representation of p . Nonlinear reduction also uses a set of generalized coordinates q .However, the relation between u and q is in a more generic formof u = 𝐷 ( q ) . Intuitively, 𝐷 prescribes an 𝑛 -dimension deformationmanifold embedded in R 𝑁 . Applying time differentiation at bothside yields: (cid:164) u = d 𝐷 ( q ) d 𝑡 = 𝜕𝐷 ( q ) 𝜕 q (cid:164) q = J (cid:164) q , (4)and (cid:165) u = dd 𝑡 (cid:18) 𝜕𝐷 ( q ) 𝜕 q (cid:164) q (cid:19) = ( H · (cid:164) q ) (cid:164) q + J (cid:165) q , (5)where J = 𝜕𝐷 ( q )/ 𝜕 q ∈ R 𝑁 × 𝑛 is the Jacobian of 𝐷 ( q ) , which dependson q and spans the tangent space at a given reduced coordinate. H ∈ R 𝑁 × 𝑛 × 𝑛 is a third tensor of the second-order derivative (i.e.,Hessian) of 𝐷 . Substituting Eqs. (4) and (5) into Eq. (1) followed by a tangent space projection gives the nonlinearly reduced equationof motion: J ⊤ M (cid:0) ( H · (cid:164) q ) (cid:164) q + J (cid:165) q (cid:1) + J ⊤ f 𝑖𝑛𝑡 ( 𝐷 ( q )) = J ⊤ f 𝑒𝑥𝑡 . (6)Here, the damping force term is ignored for a more concise notation.We note that if 𝐷 ( q ) is linear, H vanishes and J = U . Eq. (6) echoesEq. (3) completely.Given a time integration algorithm on q e.g., the implicit Eulermethod, we have: q = ¯ q + ℎ (cid:164) q and (cid:164) q = ¯ (cid:164) q + ℎ (cid:165) q , where ℎ is the time stepsize, and ¯ (·) indicates the kinematic variable is from the previoustime step. The final system that needs to be solved becomes: J ⊤ MJ ( q − ¯ q − ℎ ¯ (cid:164) q ) + J ⊤ f 𝑓 𝑖𝑐𝑡 ( q ) + ℎ J ⊤ f 𝑖𝑛𝑡 ( q ) = ℎ J ⊤ f 𝑒𝑥𝑡 , (7)with f 𝑓 𝑖𝑐𝑡 = M ([ H · ( q − ¯ q )] ( q − ¯ q )) . (8) f 𝑓 𝑖𝑐𝑡 is the fictitious force that responds for inertia effects associatedwith the varying Jacobian J . We also consider M 𝑞 = J ⊤ MJ is thereduced mass matrix of the nonlinear reduction, which is no longerconstant as J also depends on q . Clearly, f 𝑓 𝑖𝑐𝑡 is the most tricky part in Eq. (7). H is the Hessian ofthe coordinate transformation 𝐷 . Not only a third tensor, but H ( q ) is also a function of q . Therefore, if we want to solve Eq. (7) using,for instance, Newton’s method in the implicit integration, we needto compute 𝜕 H / 𝜕 q to assemble the corresponding system matrix,which is a forth tensor and the resultant of third-order differenti-ation over 𝐷 . This nasty computation stands as a major obstaclefor nonlinear model reduction. In [Fulton et al. 2019], the fictitiousforce term is discarded in the time integration of the generalized co-ordinate. This heuristic can somewhat be understood as performingan explicit subspace projection at the current time step ignoring thefact that a generalized velocity (cid:164) q also brings inertia effects when 𝐷 is nonlinear.On the one hand, we consider ignoring f 𝑓 𝑖𝑐𝑡 reasonable and anunderstandable compromise in the setting of [Fulton et al. 2019].First of all, f 𝑓 𝑖𝑐𝑡 vanishes under quasi-static deformations as (cid:164) q isclose to zero. Secondly, in [Fulton et al. 2019] the encoding-decodingnetwork is shallow, and 𝐷 represents a net of only two layers. Inaddition, a preliminary PCA is performed to “regularize” raw train-ing poses. Those treatments effectively suppress the nonlinearityin J (so H is small) and lessen the inertia deformation induced by f 𝑓 𝑖𝑐𝑡 . On the other hand, as f 𝑓 𝑖𝑐𝑡 is missed in [Fulton et al. 2019], theunderlying dynamic equation is inaccurate anyway. Visible artifactsare inevitable under higher-velocity deformations or a deeper DAEis employed (i.e., in our case). An autoencoder is an unsupervised learning algorithm that con-denses the input high-dimension data into a low-dimension latentspace (i.e., encoding), which is then expanded to the original dimen-sionality to monitor the compression loss (i.e., decoding). If thisnetwork only has one hidden layer, or it does not involve nonlin-ear activations, the autoencoder is similar to PCA [Bourlard andKamp 1988]. In this case 𝐷 is linear, and the resulting network aftertraining spans the same linear subspace as PCA does (under L2 loss). igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 5 This however is, not what we seek for in nonlinear model reduc-tion since 𝐷 is expected to capture as much nonlinearity as possibleto enrich the subspace expressivity. To this end, we ought to keepthe autoencoder deep and nonlinear. Unfortunately, too much non-linearity seems to be harmful to the simulation as well. As illustratedin Fig. 2, with increased nonlinearity, the network (i.e., the curvein the figure) can be stretched to reach some irregular and distantposes in the training set. In the meantime, the geometry of the de-formation manifold also becomes more wiggling – same as whatwe experience in high-order polynomial fitting. As we know, thesimulation under nonlinear reduction corresponds to travelling onthe deformation manifold of 𝐷 , driven by the generalized forces inthe tangent space. A wiggling manifold could stiffen the simulationand induce artifacts. NonlinearSmooth

Fig. 2. Increased nonlinearitybetter fits training poses butalso makes the network morewiggling.

There are several possible reme-dies of this issue. As in [Fulton et al.2019], one could regularize the train-ing data before the network training.This strategy is commonly used intraining deep neural nets on a verylarge-scale data set that could poten-tially be noisy e.g., ImageNet [Rus-sakovsky et al. 2015]. However, thetraining data in our case are synthe-sized by running physical simulations,and they are noise-free . While PCAregularization certainly prevents overfitting, it also negatively im-pacts the richness of the nonlinear subspace. Alternatively, CAE isalso a promising method [Rifai et al. 2011a]. It injects a penalty termrelated to ∥ J ∥ 𝐹 to enhance the smoothness of 𝐷 so that a highlycurved manifold is unlikely. Unfortunately, neither method looksattractive to us: the very reason of using DAE is to enhance thenonlinearity of the subspace, while both PCA regularization andJacobian penalty aim to prune the subspace nonlinearity, contradict-ing our original motivation. The remaining option is to expand thedimension of the latent space, which is also problematic knowingthat the computational cost for nonlinear model reduction is muchhigher than linear reduction (i.e., due to the evaluation of high-orderdifferentiation). If the majority information encoded in the DAE isclose to linear, why bother using nonlinear reduction and a deepnetwork at the very beginning?Our answer to this dilemma is to split the total simulation spaceinto two orthogonal spectra: a PCA-based linear subspace (or itcould be constructed by any linear model reduction methods) S 𝑝 and a DAE-based nonlinear manifold S 𝑞 such that S 𝑝 ⊥ S 𝑞 . Theorthogonality allows the dynamics from both subspaces to be sim-ply super-positioned. The advantages of this subspace design aremultifaceted. First of all, we can now increase the dimension ofthe linear subspace with a moderate cost to the overall simula-tion performance. Secondly, under this design, PCA basis matrixis part of the Jacobian of the overall subspace S 𝑝 ∪ S 𝑞 . Therefore,the simulation does not experience the locking artifact. Explicitlybuilding the linear subspace also allows us to push the depth of theDAE as needed to capture nonlinear deformations and keep latentspace highly compact at the same time. Lastly, we note that suchmulti-layer subspace construction has been successfully employed in simulation [Harmon and Zorin 2013; Zhang et al. 2020], but firsttime in conjunction with a deep network. … … … … Encoding Decoding

1, , ,0 p ni j i k j kk U U δ −= − ∑ Fig. 3. The network structure of our DAE. At both sides of the DAE, we ap-pend a fully connected filtering layer (in green) to remove any displacementsfrom PCA space S 𝑝 . The network architecture of our DAE is visualized in Fig. 3. It has asymmetric structure at encoding and decoding parts. Before trainingthe DAE, we perform PCA over the training set to obtain basisvectors of S 𝑝 . They are packed into the matrix U . U is 𝑁 × 𝑛 𝑝 , where 𝑛 𝑝 represents the dimensionality of S 𝑝 . Columns in U are all unitvectors, and they are orthogonal to each other. To ensure S 𝑝 ⊥ S 𝑞 ,we append a filtering layer at both ends of the encoder and and thedecoder. This filtering layer is fully connected (FC) and has fixedweights: the weight coefficient of the edge connecting 𝑖 -th and 𝑗 -thneurons before and after this FC layer is 𝛿 𝑖,𝑗 − (cid:205) 𝑛 𝑝 − 𝑘 = 𝑈 𝑖,𝑘 𝑈 𝑗,𝑘 , where 𝛿 𝑖,𝑗 = 𝑖 = 𝑗 and 0 otherwise. In fact, this FC layer carries out amatrix-vector product of ( I − UU ⊤ ) x for an input vector x , whichremoves any components in x that generate non-zero projections in S 𝑝 so that the input of the encoder and the output from the decoderare all orthogonal to S 𝑝 .After filtering, the DAE moves to an intermediate activation part,which consists of multiple (e.g., 6 to 8) FC layers of the same width.The width is normally set at the order of log 𝑁 . Each layer is nonlin-early activated. Our activation function is quite different from otherdeep nets. ReLU (rectified linear unit) is a widely chosen activation,and works well in many deep learning tasks by default [Nair andHinton 2010]. However, as ReLU is a linear activation, it fails tononlinearly compress the training data. More importantly, ReLUis 𝐶 continuous, and a DAE only activated by ReLU may degen-erate to PCA. The exponential linear unit or ELU enhances thesmoothness of ReLU, but it could remain a linear activator for cer-tain input signals. To this end, we use the trigonometric function sin 𝑥 as our activation. sin 𝑥 has a derivative of an arbitrary order,and it does not have a saturated gradient at both directions. Thispleasing property frees us from worrying about the vanishing gra-dient problem [Hochreiter et al. 2001] even the network is deep(over 10 layers). Finally, the feature vector is compressed to thelatent space. We use 𝑛 𝑞 to denote the dimension of S 𝑞 . 𝑛 𝑞 is a smallnumber, typically below a couple of dozens in our experiments. • Shen, S. et al Now, we have everything to give the formulation of the final sys-tem we need to solve. With S 𝑝 and S 𝑞 constructed, the fullspacedisplacement is written as: u = Up + 𝐷 ( q ) , s.t. U ⊤ 𝐷 ( q ) = . (9)Thanks to the orthogonality between S 𝑝 and S 𝑞 , we stack Eqs. (3) and (6)jointly to obtain: (cid:101) J ⊤ M (cid:101) J (cid:0) r − ¯ r − ℎ ¯ (cid:164) r (cid:1) + (cid:101) J ⊤ f 𝑓 𝑖𝑐𝑡 + ℎ (cid:101) J ⊤ ( f 𝑖𝑛𝑡 − f 𝑒𝑥𝑡 ) = , (10)where r = [ p ⊤ , q ⊤ ] ⊤ is the generalized coordinate concatenatingboth p and q , and (cid:101) J = [ U , J ] ∈ R 𝑁 ×( 𝑛 𝑝 + 𝑛 𝑞 ) . Here, f 𝑓 𝑖𝑐𝑡 is in thesame form of Eq. (8) because it vanishes in S 𝑝 . Eq. (10) can then beconcisely written as 𝜙 ( r ) =

0. Its Jacobian is a ( 𝑛 𝑝 + 𝑛 𝑞 ) × ( 𝑛 𝑝 + 𝑛 𝑞 ) matrix: 𝜕𝜙𝜕 r = (cid:101) H ⊤ M (cid:101) J (cid:0) r − ¯ r − ℎ ¯ (cid:164) r (cid:1) + (cid:101) H ⊤ f 𝑓 𝑖𝑐𝑡 + (cid:101) J ⊤ M [ U , Δ J + J ]− ℎ (cid:101) H ⊤ ( f 𝑖𝑛𝑡 − f 𝑒𝑥𝑡 ) + ℎ (cid:18)(cid:101) J ⊤ 𝜕 f 𝑖𝑛𝑡 𝜕 u (cid:101) J (cid:19) , (11)which needs to be updated and solved at each Newton iteration.Here, (cid:101) H = 𝜕 u / 𝜕 r = diag ( , H ) . The most involving term is Δ J ,which is defined as: Δ J ≜ ( S · ( q − ¯ q )) · ( q − ¯ q ) + H · ( q − q − ℎ ¯ (cid:164) q ) . (12)In order to compute Δ J , we need to calculate S , an 𝑁 × 𝑛 𝑞 × 𝑛 𝑞 × 𝑛 𝑞 forth tensor, and it is the third-order derivative of DAE: 𝜕 𝐷 ( q )/ 𝜕 q .If we choose to use first-order or quasi-Newton solvers [Liu et al.2017], the computation of S could be avoided, but we still need tocompute the Hessian H . Nevertheless, for subspace simulation witha small-size system matrix, Newton’s method with a direct linearsolver like Cholesky is always preferred.Evaluating high-order differentiation of a deep net is not intu-itive. Currently, gradient-based optimization is the mainstream so-lution for the network training, where the network gradient canbe computed via BP. Second- and high-order derivatives are notwell supported and are not efficient enough for subspace simulationtasks. Next, we discuss how we solve this technical challenge byexploiting the complex-step finite difference scheme. Computing the derivative of a function is omnipresent in physics-based simulation. It is typically done by inferring analytic formof the derivative function by hand or with assistance from some symbolic differentiation software like

Mathematica [Wolfram et al.1999]. Alternatively, it is also possible to approximate the deriva-tive numerically. The finite difference is the most commonly-usedmethod, which applies a small perturbation ℎ to the function in-put and the first-order function derivative can be estimated as: 𝑓 ′ ( 𝑥 ) ≈ ( 𝑓 ( 𝑥 + ℎ ) − 𝑓 ( 𝑥 ))/ ℎ . However, it is also known finitedifference suffers with the numerical stability issue named subtrac-tive cancellation [Luo et al. 2019]. This limitation could be avoidedby complex-step finite difference or CSFD [Luo et al. 2019; Martinset al. 2003]. Let (·) ∗ denote a complex variable, and suppose 𝑓 ∗ : C → C isdifferentiable around 𝑥 ∗ = 𝑥 + 𝑖 . With an imaginary perturbation ℎ𝑖 , 𝑓 ∗ can be expanded as: 𝑓 ∗ ( 𝑥 + ℎ𝑖 ) = 𝑓 ∗ ( 𝑥 ) + 𝑓 ∗ ′ ( 𝑥 ) · ℎ𝑖 + O ( ℎ ) . (13)We can “promote” a real-value function 𝑓 to be a complex-valueone 𝑓 ∗ by allowing complex inputs while retaining its original com-putation procedure. Under this circumstance, we have 𝑓 ∗ ( 𝑥 ) = 𝑓 ( 𝑥 ) , 𝑓 ∗ ′ ( 𝑥 ) = 𝑓 ′ ( 𝑥 ) ∈ R . Extracting imaginary parts of bothsides in Eq. (13) yields: Im (cid:0) 𝑓 ∗ ( 𝑥 + ℎ𝑖 ) (cid:1) = Im (cid:0) 𝑓 ∗ ( 𝑥 ) + 𝑓 ∗ ′ ( 𝑥 ) · ℎ𝑖 (cid:1) + O ( ℎ ) . (14)Note that the error term ( O ( ℎ ) ) in Eq. (14) is cubic because thequadratic term of ℎ in Eq. (13) is a real quantity and is excluded by Im operator. We then have the first-order CSFD approximation: 𝑓 ′ ( 𝑥 ) = Im (cid:0) 𝑓 ∗ ( 𝑥 + ℎ𝑖 ) (cid:1) ℎ + O ( ℎ ) ≈ Im (cid:0) 𝑓 ∗ ( 𝑥 + ℎ𝑖 ) (cid:1) ℎ . (15)It is clear that Eq. (15) does not have a subtractive numerator, mean-ing it only has the round-off error regardless of the size of theperturbation ℎ . If ℎ ∼ √ 𝜖 i.e., around 1 × − , CSFD approximationerror is at the order of the machine epsilon 𝜖 . Hence, CSFD can beas accurate as analytic derivative because the analytic derivativealso has a round-off error of 𝜖 .The generalization of CSFD to second- or even higher-order dif-ferentiation is straightforward by making the perturbation a multi-complex quantity [Lantoine et al. 2012; Nasir 2013]. The multicom-plex number is defined recursively: its base cases are the real set C = R , and the regular complex set C = C . C extends the realset ( C ) by adding an imaginary unit 𝑖 as: C = { 𝑥 + 𝑦𝑖 | 𝑥, 𝑦 ∈ C } .The multicomplex number up to an order of 𝑛 is defined as: C 𝑛 = { 𝑧 + 𝑧 𝑖 𝑛 | 𝑧 , 𝑧 ∈ C 𝑛 − } . Under this generalization, the multicom-plex Taylor expansion becomes: 𝑓 ★ ( 𝑥 + ℎ𝑖 + · · · + ℎ𝑖 𝑛 ) = 𝑓 ★ ( 𝑥 ) + 𝑓 ★ ′ ( 𝑥 ) ℎ 𝑛 ∑︁ 𝑗 = 𝑖 𝑗 + 𝑓 ★ ′′ ( 𝑥 ) ℎ (cid:0) 𝑛 ∑︁ 𝑗 = 𝑖 𝑗 (cid:1) + · · · 𝑓 ★ ( 𝑘 ) 𝑘 ! ℎ 𝑘 (cid:0) 𝑛 ∑︁ 𝑗 = 𝑖 𝑗 (cid:1) 𝑘 · · · (16)Here, (cid:0)(cid:205) 𝑖 𝑗 (cid:1) 𝑘 can be computed following the multinomial theorem ,and it contains products of mixed 𝑘 imaginary directions for 𝑘 -th-order terms. For instance, the second-order CSFD formulation canthen be derived as follows:  𝜕 𝑓 ( 𝑥, 𝑦 ) 𝜕𝑥 ≈ Im ( ) (cid:0) 𝑓 ( 𝑥 + ℎ𝑖 + ℎ𝑖 , 𝑦 ) (cid:1) ℎ 𝜕 𝑓 ( 𝑥, 𝑦 ) 𝜕𝑦 ≈ Im ( ) (cid:0) 𝑓 ( 𝑥, 𝑦 + ℎ𝑖 + ℎ𝑖 ) (cid:1) ℎ 𝜕 𝑓 ( 𝑥, 𝑦 ) 𝜕𝑥𝜕𝑦 ≈ Im ( ) (cid:0) 𝑓 ( 𝑥 + ℎ𝑖 , 𝑦 + ℎ𝑖 ) (cid:1) ℎ , (17)where Im ( ) picks the mixed imaginary direction of 𝑖 𝑖 . One caneasily tell from Eq. (17) that second-order CSFD is also subtraction-free making them as robust/accurate as the first-order case. WithCSFD, we augment the DAE to allow each neuron to house a complex igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 7 or a multicomplex quantity. Therefore, the input perturbation canbe passed through the network for computing its derivative values. A limitation of CSFD lies in its dependency on the perturbation. Ifthe function takes 𝑚 input variables e.g., an 𝑚 -dimension vector,CSFD needs to evaluate the function for 𝑚 times in order to com-pute its first-order derivative. In our case, the function is shapedas a DAE. More precisely, 𝐷 ( q ) corresponds to the decoding partof the network (Fig. 3). We need to take a forward pass of the de-coding network as one function evaluation. The total number ofnetwork forwards goes up exponentially with respect to the orderof differentiation. Therefore, computing S in Eq. (12) requires 𝑛 𝑞 network forwards per Newton iteration, which is further scaled bythe complexity 𝐷 . This is too expensive for real-time simulationeven on GPU.An important contribution of this work is to efficiently enablehigh-order differentiability of DAE (or other deep networks) whileeliminating excessive network perturbations. Our method is basedon two following key observations: • In CSFD, the imaginary parts can be somehow understood as thedifferential change induced by the perturbation. Under a straightusage, CSFD is analogous to forward automatic differentiation(AD) [Guenter 2007], but with much better generalization tohigher orders. The potential of CSFD is maximized if the functionhas a high-dimension output so that one function evaluationgives you more information of the differentiation. Conversely,the BP procedure of a neural net is essentially a reverse AD [Bay-din et al. 2017] – its efficiency is optimal when the input of anetwork is in high dimension. This is exactly the case in neuralnet training, where we have a large number of network param-eters as the function input. It is clear that CSFD and BP nicelycomplement each other so that we can choose the direction ofnetwork propagation accordingly. • While high-order differentiation produces high-order tensors,those tensors are rarely needed in its original form. In most cases,they are to be “reduced” by tensor contractions with other tensorsleft and right to them. Those reduction operations allow us toapply the perturbation collectively , not at an individual variablebut in the form of vector or tensor.

We now elaborate our method first with a toy example. Consider 𝑓 : R 𝑚 → R . Computing its Hessian ( H = ∇ 𝑓 ) will need 𝑚 perturbations with second CSFD (Eq. (17)). However, if ∇ 𝑓 is alsocontracted with a right vector a , Ha can actually be evaluated muchmore efficiently as: [ H ( x ) a ] 𝑘 = 𝑚 − ∑︁ 𝑙 = lim ℎ → [∇ 𝑓 ( x + ℎ e 𝑙 ) − ∇ 𝑓 ( x )] 𝑘 ℎ · [ a ] 𝑙 , ⇒ [ H ( x ) a ] 𝑘 = 𝑚 − ∑︁ 𝑙 = lim ℎ → [∇ 𝑓 ( x + [ a ] 𝑙 ℎ e 𝑙 ) − ∇ 𝑓 ( x )] 𝑘 ℎ , ⇒ Ha = lim ℎ → ∇ 𝑓 ( w + ℎ a ) − ∇ 𝑓 ( x ) ℎ ≈ Im (∇ 𝑓 ( x + ℎ𝑖 · x )) ℎ . Here, [·] 𝑘 gives 𝑘 -th element of vector, and e is the canonical bases.In the second line of the derivation, we substitute ℎ with [ a ] 𝑙 ℎ tocancel the multiplication of [ a ] 𝑙 . One may now recognize that Ha is essentially the directional derivative of ∇ a 𝑓 . j k l x hi ha i h ib + ++ ( ) f x   … f ∂∂ x a b k a l b Fig. 4. Right contractioncan be dealt with by apply-ing CSFD perturbation col-lectively.

This finding is not new and has beenused in Jacobian-free solvers [Knoll andKeyes 2004]. However, we note that thisstrategy can also be generalized for high-order cases. As shown in Fig. 4, one differ-entiation operation lifts the order of theresulting tensor by one. A right contrac-tion of the tensor undoes this expansionso that the perturbation can be appliedtogether. Now let us advert to the forthtensor S in Eq. (12). Its exact form is ofless interest to us. Instead, we would liketo compute the matrix after two contrac-tions with q − ¯ q . To this end, we apply acollective third-order multicomplex perturbations to the decodingDAE for 𝑛 𝑞 times. The 𝑗 -th perturbation computes the 𝑗 -th columnof the resulting matrix. This perturbation is applied along the firstimaginary direction 𝑖 at the 𝑗 -th element of the DAE input q . Theperturbations in 𝑖 and 𝑖 are scaled by the corresponding elementsin q − ¯ q . Putting together, the 𝑗 -th element, which is a third-ordermulticomplex quantity, of the CSFD input is: [ q ★ ] 𝑗 = [ q ] 𝑗 + ℎ𝑖 + ℎ [ q − ¯ q ] 𝑗 𝑖 + ℎ [ q − ¯ q ] 𝑗 𝑖 . (18)After the forward pass, we extract the component at 𝑖 𝑖 𝑖 direction,and divide it by ℎ . In the simulation, there are several computations involving contrac-tion between a left vector and a differentiation tensor such as allthe H ⊤ terms in Eq. (11). In those cases, the contraction occurs atthe dimension which is not expanded by the differentiation. Hence,the strategy outlined in § 4.3 does not apply. Consider evaluating a · H (i.e., H ⊤ a ). We carry out our computation with an auxiliaryfunction 𝑔 ( q ) = a · 𝐷 ( q ) ∈ R . As 𝐷 is embodied as a neural net-work, this auxiliary function can also be viewed as appending anFC layer at the end of the net reducing its 𝑁 -dimension output to asingle scalar (like the loss function). Because a is independent on 𝐷 , 𝜕 𝑘 𝑔 / 𝜕 q 𝑘 = a · 𝜕 𝑘 𝐷 / 𝜕 q 𝑘 . Hence, a · H can be computed as the Hessianof 𝑔 ( q ) . Here, the reader may be reminded that H is a function of q ,and it is the second derivative of 𝑑 . A standard second CSFD willneed 𝑛 𝑝 ( 𝑛 𝑝 + )/ 𝜕 𝑔 / 𝜕 q is symmetric.We show that is computation can be further reduced to O ( 𝑛 𝑞 ) .As mentioned, CSFD is most suited for differentiating functionswith a high-dimension output – 𝑔 ( q ) is not such a function, whichoutputs a single scalar. Its derivative could be more efficiently com-puted by reverse AD or BP. As a first-order routine however, BPonly computes the gradient function 𝜕𝑔 / 𝜕 q . To this end, we injectCSFD into the BP procedure treating BP as a generic function andenabling complex arithmetic along the BP computation to perturbthe gradient of 𝑔 . It starts with a complex-perturbed forward passof the network 𝑔 by adding the perturbation at one element (say the • Shen, S. et al 𝑗 -th element) of the network input as: [ q ∗ ] 𝑗 = [ q ] 𝑗 + ℎ𝑖 . The feedfor-ward of the net delivers this complex perturbation to all the neurons [ q ] 𝑗 influences. BP then ensues. During BP, all the computationsare complex-based. If a neuron receives an imaginary component inthe forward pass, this imaginary component participates in BP andpasses complex-value feedback signals to its previous layer. AfterBP, all the signals at the input layer are divided by ℎ yielding onecolumn of a · H .Fig. 5 illustrates this process with a simple net: two neurons ( 𝑥 and 𝑦 ) multiply first, and the result ( 𝑧 ) is squared to generate the output( 𝑤 ). Suppose 𝑥 = 𝑦 =

3, and we want to compute the secondderivative of the network output with respect to 𝑥 . The perturbation ℎ is applied to 𝑥 so that 𝑥 = + ℎ𝑖 , and all sequential neurons becomecomplex-value. After the forward pass, BP invokes. Everythingremains the same as the regular BP except the computation is incomplex. For instance, 𝜕𝑤 / 𝜕𝑧 = 𝑧 ; as 𝑧 holds a complex value, 𝜕𝑤 / 𝜕𝑧 = + ℎ𝑖 is also complex. Finally, after BP completes. Thereal part of 𝑥 gives the value of the first-order derivative – the sameas the original BP algorithm, and the imaginary part of 𝑥 after beingdivided by the input perturbation ℎ is the second derivative. Alongthis procedure, we follow the strategy in [Luo et al. 2019] to avoidunneeded complex computations. For instance in Fig. 5, high-orderterms of ℎ is discarded in 𝑤 . x y w × z [ ] ⋅ x y w z y = x hi = + z x hiy +×= =

33 66 633 3 66 w i h hihz = =+ −≈ +

Forward pass z hi = + x hi = + y =

36 36 w hi = + × [ ] ⋅ w z hiz ∂ = = +∂ w w z hi hix z x ∂ ∂ ∂= = + = +∂ ∂ ∂ zx ∂ =∂ / 18 Im w w hx x ∂ ∂ = = ∂ ∂  Backward pass

Fig. 5. By augmenting BP with CSFD, we can efficiently evaluate high-orderdifferentiation of a deep net, followed by a left-side tensor contraction.

Thanks to CSFD, all the tensor-related computations can nowbe completed with 𝑛 𝑞 network passes, either forward passes withCSFD or backward passes with CSFD-enabled BP. Those 𝑛 𝑞 networkpasses can be executed in parallel on GPU as one single mini-batch.The remaining performance bottleneck is the subspace integrationof reduced force and elastic Hessian. This computation is usuallyhandled with the Cubature method [An et al. 2008]. In the nextsection, we discuss how we replace the classic Cubature samplingwith a neural network based one to allow a pose-dependant subspaceintegration. In model reduction, it is expected that all the computations arecarried out in the polynomial time of the reduced order 𝑛 𝑝 + 𝑛 𝑞 .For Saint Venant-Kirchhoff (StVK) material model under linear re-duction, it is possible to pre-compute the polynomial coefficientsfor reduced force and Hessian [Barbič and James 2005] at the costof O ( 𝑛 𝑝 ) . Unfortunately, other material models do not share this sss                 s i n () s i n () s o f t m a x s i n ( ⋅ ) s i n ( ⋅ ) s i n ( ⋅ ) () Update Cubature set         MMM         network network Decoding . Fig. 6. Neural Cubature alternates between two neural networks: 𝑆 and 𝑊 . 𝑆 net is a GCN and selects Cubature elements with highest scores. 𝑊 netpredicts the weight of each Cubature element. We add a square activationto ensure the non-negativeness of the output weight value. This informationis then passed back to 𝑆 for next-round selection. convenience. A practical solution is the so-called Cubature sam-pling [An et al. 2008]. Cubature selects a subset of key elements(i.e., Cubature elements) such that the reduced force and reducedHessian can be integrated only at Cubature elements with a desig-nated non-negative weight. Cubature has been proven effective forlinear model reduction. However, its naïve deployment for nonlinearreduction is questionable: as the tangent space varies in nonlinearcases, why should we stick with invariant Cubature weights?Our neural Cubature consists of two networks as shown in Fig. 6.The first net is in charge of selecting new Cubature elements, and thesecond net is responsible for predicting Cubature weights. Specifi-cally, the first neural network outputs a “score” for each element,and we can add multiple elements to the Cubature set based onelement’s score. After updating the Cubature set C , the second net-work outputs the weights of all the Cubature elements based onthe input r . Neural Cubature training alternates between those twonetworks. After the training, only weight prediction net participatesin the simulation i.e., given a generalized coordinate r , the neural netoutputs its weights coefficients at the simulation run time, whichare then used for the subspace integration. We use 𝑆 to denote the first neural network for Cubature elementselection. 𝑆 is a graph convolutional network (GCN) [Wu et al.2020], which naturally inherits the topology of the input 3D model.The input of 𝑆 is the fullspace displacement u followed by twograph convolution layers. Each convolution layer produces eightchannels. After that, another two FC layers are applied. 𝑆 outputs theprobability 𝑠 𝑒 for each element 𝑒 on the mesh, which is concatenatedinto a global probability or score vector s . The graph convolutionoperation can be written as: ℎ ( 𝑙 + ) 𝑖 = sin (cid:169)(cid:173)(cid:171) ∑︁ 𝑗 ∈ N 𝑖 𝑐 𝑖 𝑗 ℎ ( 𝑙 ) 𝑗 𝛾 ( 𝑙 ) (cid:170)(cid:174)(cid:172) , (19)where ℎ ( 𝑙 ) 𝑖 represents the 𝑖 -th vertex in the 𝑙 -th neural networklayer. 𝛾 ( 𝑙 ) is the trainable parameter, and N denotes the one-ring igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 9 neighborhood of 𝑖 on the mesh. Similar to DAE, we use sin (·) forintermediate nonlinear activations. 𝑐 𝑖 𝑗 = √︁ 𝑑 𝑖 · 𝑑 𝑗 is the normaliza-tion constant of edge ⟨ 𝑖, 𝑗 ⟩ , where 𝑑 𝑖 is the degree of vertex 𝑖 . At thelast hidden layer, we use the softmax activation [Goodfellow et al.2016], which assigns each element a probability score. 𝑆 is trained in the residual space . This scheme is inspired by theoriginal Cubature algorithm. At the beginning, the set of Cubatureelements is empty: C = ∅ , and the original training set consists ofpose-force pairs. With some elements being selected, C ≠ ∅ , wecompute the remaining reduced force for each training data withcurrent Cubature integration: f ( r ) = (cid:101) f 𝑖𝑛𝑡 ( r ) − ∑︁ 𝑀 𝑒 ( C , w ) (cid:101) f 𝑒 ( r ) . (20)Here, (cid:101) f 𝑖𝑛𝑡 ( r ) is the reduced internal force projected in the columnspace of (cid:101) J ( r ) . The summation iterates all the elements on the mesh. 𝑀 𝑒 ( C , w ) is a mask function that removes non-Cubature weightsfrom an input weight vector w . In other words, 𝑀 𝑒 ( C , w ) = [ w ] 𝑒 if element 𝑒 ∈ C or 0 otherwise. Note that the dimensionality of w corresponds to the total number of elements on the model, and it isthe output from the current weight prediction net. (cid:101) f 𝑒 is the reducedforce at the element 𝑒 . Instead of adding one Cubature element eachtime, neural Cubature allows us to select multiple elements. After 𝑆 outputs scores s , we can pick 𝐾 non-Cubature elements with highestscores, and update C accordingly. We have tested 𝐾 = 𝐾 = 𝐾 =

20 and did not find much difference between them.

The weight prediction network 𝑊 takes a generalized coordinate r as well as the current Cubature set C as input, and outputs weightcoefficients for all the elements w . Specifically, r is first spanned to u with our decoder net. Network parameters at this part are fixedand do not participate in the training. Four additional FC layerswith sin (·) activations are followed. 𝑊 is not a graph network, aswe believe the geometry and topology information of the model isalready captured in 𝑆 . Training the weight should be pure algebraic,and several nonlinearly activated FC layers work for this purposewell. Because 𝑊 is the part of the simulation (we need to obtain w at each time step), we also want to make sure it is light-weight andruns feedforward efficiently. Therefore, the structure of 𝑊 is plainand straightforward. Finally, the weight coefficients of Cubatureelements should be non-negative in order to prevent extrapolationand overfitting. To this end, we put a square operation at the lastlayer to enforce the non-negative constraint.The loss functions of both 𝑆 and 𝑊 resemble each other a lot: 𝐿 𝑆 = (cid:13)(cid:13)(cid:13) f ( r ) − ∑︁ 𝑀 𝑒 ( C , w ) (cid:101) f 𝑒 ( r ) (cid:13)(cid:13)(cid:13) ,𝐿 𝑊 = (cid:13)(cid:13)(cid:13)(cid:101) f ( r ) − ∑︁ 𝑀 𝑒 ( C , w ) (cid:101) f 𝑒 ( r ) (cid:13)(cid:13)(cid:13) . (21)In practice, neural Cubature kicks off by initializing C as few Voronoisamples of the input model, which are passed to 𝑊 to start thealternating. After w is predicted, we feed this information to 𝑆 (i.e., updating the Cubature residual), which in turn, updates theCubature set C . Our neural Cubature is more efficient and accuratethan conventional Cubature methods. Because neural Cubature picks multiple elements each time, we can also quickly build abigger Cubature set C . We have implemented our framework on a desktop computer withan intel i7 9700

CPU and an nVidia 2080

GPU. The simulationpart is mostly on CPU but we move all the matrix-matrix and matrix-vector computations to GPU with cuBLAS [Nvidia 2008]. The simula-tion is implemented with

C++ , and some linear algebra computationsare based on

Eigen library [Guennebaud et al. 2010]. Network train-ing is initially carried out using

PyTorch [Paszke et al. 2019]. Afterwe have all the network parameters, we port the resulting neuralnetwork to

CUDA . Network BP for computing tensor contraction isalso implemented with cuBLAS . We generate training poses by running a scripted simulation. Atthe training stage, given a random surface vertex on the model, weselect its nearby vertices within a given radius, and apply a randomforce to them (Fig. 7). All the simulation poses along this dynamicprocedure are recorded as training data.

Fig. 7. We generate training data by applyingscripted random forces to the model.

In linear model re-duction, it is com-mon to directly sam-ple training data withinthe modal space e.g.,see [Von-Tycowiczet al. 2015]. We foundthat this strategy isnot valid for nonlin-ear model reduction.Here, we would liketo clarify two confus-ing concepts: pose and basis. In linear model reduction, we care moreabout the basis, whose most important attribute is the direction,and its magnitude matters little. This is not the case for nonlinearreduction, where we essentially learn the underlying deformationmanifold. An effective training will need samples on this manifoldi.e., poses, without unnecessary scaling. Therefore, training datashould be generated via real simulation.Training poses ought not be weighted equally. In general, weprefer to better fit poses closer to the rest shape. A slightly higherfitting error may be acceptable for poses under large deformations.This is also the motivation in the linear model reduction of scalingbasis vectors by their vibration frequencies [Barbič and James 2005;Von-Tycowicz et al. 2015]. However, nonlinear eigenvalues of de-formable poses are difficult to be estimated. We found a good metricis the elastic energy of a given pose, which nonlinearly measureshow far a deformation is away from the rest configuration. As aresult, we weight the loss value of each pose by the inverse of itselastic energy. As discussed, our method also builds a linear sub-space (i.e., S 𝑝 ) via PCA out of the training poses. If the training dataset is too big, computing a full PCA is time-consuming. We find thata good walk-around is to randomly pick poses with smallest elastic energy to form a more compact training set for PCA, and leave DAEto extract nonlinear information out of the residual pose space. T r a i n i n g l o ss Epoch

Weight Prediction NetworkCubature Selection Network T r a i n i n g l o ss Epoch

PCA

10 layers

14 layers

18 layers ...

Training poses

Fig. 8. Network curves for training the bunny model. Neural Cubature istrained by alternating 𝑆 and 𝑊 nets, and we use the parameters from theprevious alternation. Adding more layer helps reduce the training erroreffectively. We use

PyTorch and Adam for all our network training. For trainingDAE, we start with an initial learning rate of 0 . , ,

000 poses for a model.When training the neural Cubature networks 𝑆 and 𝑊 , we stick withthe learning rate of 0 . 𝑆 and 𝑊 . Depending on how many Cubature elements we wantto pick, the neural Cubature training could take several thousandepochs. The total network training time is less than expected, whichtakes ten to twenty minutes. Generating the training poses is themost expensive part. It often needs a couple of hours. A typicaltraining curve is reported in Fig. 8, which is for the bunny model.We note that the expressivity of the DAE improves with increaseddepth. This can be observed from Fig. 8: if the DAE is shallow e.g.,fours layers, its performance is only marginally better than PCA;but with a deeper DAE, the error decreases sharply. Fig. 9. The ground truth poses of “bend-ing” (left) and “opening arm” (right) of thedinosaur model.

First, we report a compre-hensive comparative exper-iment between our DAE-based nonlinear reductionand other commonly seenlinear reduction methodsincluding: PCA, physicalmodal derivative (PMD) [Bar-bič and James 2005], andgeometric modal derivative(GMD) [Von-Tycowicz et al.2015]. Both physical and geometric modal derivatives are based on linear modal analysis(LMA) [Pentland and Williams 1989]. PMD is computed via solvinga set of static equilibria around the rest shape, while GMD constructsthe subspace matrix by spanning each LMA basis to nine tangentdirections corresponding to its local linear transformation. In thisexperiment, we first compute 50 LMA basis vectors. Based on them,we compute 50 × ( + )/ = ,

275 PMD bases and 50 × = 𝑛 𝑝 = 𝑛 𝑞 = 𝑛 𝑝 = 𝑛 𝑞 =

5; and 𝑛 𝑝 = 𝑛 𝑞 =

10. Theground truth shape is given in Fig. 9 (left).In this experiment, we can see a clear advantage of our methodover PCA-based linear reduction. We think the reason is straight-forward, DAE is known to be more expressive than conventionalPCA especially for nonlinear data sets. In addition, we find thatPMD also gives very good results while GMD does not performwell. We assume this is because you need to fully incorporate all 450geometric derivative modes in GMD to first-order approximate thederivative of LMA modes reasonably well. PMD is optimal for low-frequency deformations like this dinosaur bending. Indeed, PMD isexactly designed to capture such deformations, while our method isbased on a data set generated randomly. From this perspective, it isactually encouraging to see our method yields comparable resultsin PMD’s “home field”. Another common trend for all the reductionmethods is that the deformation improves with increased subspacedimensions.To further verify our hypothesis, we generate another set oftraining poses (500 poses), where we only add random forces at thehands of the dinosaur. This type of deformation is local and high-frequency, which are less friendly for PMD as the bending pose. Inthe test, we ask the dinosaur to open its arm by applying forcesto its hands outwards. All the other settings remain unchanged.As shown in Fig. 10 (right), the difference between our methodand PMD becomes more obvious in this experiment. The groundtruth result is the right snapshot of Fig. 9. Interestingly, when wenarrow our training sampling at the hands, the performance ofPCA also gets much better. We can see from the figure that PCA isvery close to our method. This is because local deformation doesnot necessarily suggest higher nonlinearity. In the “opening arm”test, we only generate 500 training poses, which can be fairly wellcaptured by PCA. The advantage of nonlinear reduction is moreobservable when the subspace size is further condensed (e.g., when 𝑛 = We are not the first to leverage DAE to perform nonlinear reduction.Latent space dynamics (LSD) [Fulton et al. 2019] is closely relevantto our method. Both our method and LSD share the same high-level igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 11

PCA GMD PMD Ours S ub s p ace s i ze S ub s p ace s i ze S ub s p ace s i ze PCA GMD PMD Ours

Fig. 10. We compare simulation results of the dinosaur model using various model reduction methods: PCA, physical modal derivative (PMD), geometricmodal derivative (GMD), and our method. The ground truth shapes are given in Fig. 9. On the left, we globally bend the dinosaur, and on the right, we try toapply local forces at its hands.

Ground truth Ours Latent space dynamicsFrame 10Frame 5 Frame 10 Frame 10

Fig. 11. Latent space dynamics [Fulton et al. 2019] uses a trimmed for-mulation to avoid the evaluation of high-order derivative of DAE. Thissimplification leads to serious artifacts when the model moves under a highvelocity. motivation of nonlinear subspace simulation, and both choose to useautoencoder as the machinery of the reduction in elastic simulation.Therefore, we consider LSD our major competitor. There are severalkey differences between our method and LSD. The most importantone lies in the fact that the lack of differentiability in LSD needs asimplified formulation that ignores the fictitious force f 𝑓 𝑖𝑐𝑡 (Eq. (8)).This could lead to significant error during the simulation when themodel undergoes a high-velocity motion.Fig. 11 reports snapshots of this issue. In this test, we drag the headof the dinosaur to left with an abrupt force. The artifact does notappear serious at first few frames. However, once the accumulatederror reaches a certain level, the simulation diverges and cannotbe recovered even we slow down the animation later. In order tohave a fair comparison, we run our simulation fully in S 𝑞 without building the PCA space S 𝑝 , and we do not use Cubature samplingfor subspace force integration to avoid other potential error sources.As most dynamic simulation problems are prescribed by Newton’slaw of motion, being able to evaluate high-order derivatives ofnonlinear model reduction is a must for a successful deployment ofthis technique. This turns out be the key contribution of our method.In LSD, there are many smart strategies used to remedy the riskinduced by the missed f 𝑓 𝑖𝑐𝑡 such as pre PCA filtering in the DAEnetwork etc. They are all compatible with our method, but the f 𝑓 𝑖𝑐𝑡 issue does not even exist in our framework.10 20 50 100 200Neural Cubature [10] 91 .

1% 62 .

9% 39 .

1% 23 .

8% 16 . .

1% 60 .

3% 37 .

1% 22 .

3% 15 . .

4% 66 .

6% 47 .

1% 31 .

2% 19 . Table 1. Cubature sampling error using neural Cubature and classic Cuba-ture method. This experiment is performed on an Armadillo model with Kelements. Neural Cubature [10] means we add 10 elements to the Cubatureset C based on each 𝑆 network prediction. Neural Cubature [5] adds 5 ele-ments each time. Neural Cubature outperforms classic Cubature methodby a substantial margin: in general neural Cubature yields − lesserror than classic greedy Cubature method. In the next experiment, we would like to investigate the differencebetween our neural Cubature and the classic Cubature method. We first compare the fitting error of 10, 20, 50, 100, and 200 Cubatureelements. The results are reported in Tab. 1.We can see from listed error percentages that neural Cubatureoutperforms classic Cubature method [An et al. 2008] in the contextof nonlinear reduction. The selection network 𝑆 of neural Cubatureuses a GCN, which captures the geometry and topology informationof the input model, while classic Cubature method is solely alge-braic. Another advantage of neural Cubature is its efficiency. NeuralCubature allows us to choose multiple Cubature elements each timewhen 𝑆 net predicts a score vector. We find that the Cubature train-ing error is not sensitive to how many new Cubature elements weadd to C every time, as long as this is a reasonable number i.e., indozens – picking five elements has a higher accuracy than pickingten elements, but both are better than greedy Cubature (Tab. 1). Fig. 12. We drag the left leg of the Ar-madillo. Neural Cubature with ele-ments runs the simulation robustly. Wealso visualize the weight value at eachCubature element. Higher weighted ele-ments are brighter. Greedy Cubature failsin this simulation.

In addition, we can re-use the network parame-ters from the previous al-ternation to warm start thetraining for a faster con-vergence (e.g., see Fig. 8).With neural Cubature, wecan conveniently build abigger C set. Under CUDA -assisted subspace integra-tion, the neural Cubaturesampling error can be ef-fectively suppressed. Thisis hardly possible with clas-sic Cubature method aswe need to solve a non-negative least square prob-lem with increasing size.Building C of few hundredelements would be very ex-pensive. Fig. 12 shows a con-crete experiment of simu-lating an Armadillo usinggreedy and neural Cubaturestrategies with 140 Cubature elements. We fix Armadillo’s handand drag its leg downwards. Our neural Cubature with varyingweights simulates this animation robustly while greedy Cubaturefails ( 𝑛 𝑝 =

10 and 𝑛 𝑞 = We have briefly discussed in § 4 that finite difference is not numeri-cally robust even for first-order cases. Its applicability in nonlinearreduction is unlikely possible. To verify this, we implement second-and third-order differentiation by recursively applying center finitedifference. The simulation does not converge no matter how wetweak the perturbation size ℎ : ℎ = 𝐸 − ℎ = 𝐸 − ℎ = 𝐸 − ℎ = 𝐸 −

9. In fact, the simulation crashes almost immediatelywhen finite difference is used. We believe the increased depth ofthe neural net imposes more challenges for finite difference to work probably. However, CSFD is robust and accurate even for high-orderdifferentiations.

We first use

PyTorch to test and train our neural networks (DAE, 𝑆 and 𝑊 ). After the training is complete, we re-implement the net with CUDA , which is directly implanted in our simulation framework. Our

CUDA port is CSFD-capable, i.e., the forward and backward pass ofthe network also takes multi-/complex values. This could be done byoverloading the real operators with their complex or multicomplexcounterparts.Alternatively, we choose to use the

Cauchy-Riemann (CR) for-mulation [Ahlfors 1973; Luo et al. 2019] to achieve (multi-)complexperturbations without overloading the complex arithmetic. CR equa-tion represents a multicomplex number in the form of a real matrix.Suppose 𝑧 = 𝑧 + 𝑧 𝑖 , its CR form is a 2 × 𝑧 = 𝑧 + 𝑧 𝑖 = (cid:20) 𝑧 − 𝑧 𝑧 𝑧 (cid:21) , where 𝑧 ∈ C and 𝑧 , 𝑧 ∈ C = R . Here, we use the superscript (·) 𝑛 to denote the order of a multicom-plex number. The CR matrix of 𝑧 𝑛 can be constructed recursivelyusing the CR matrices of 𝑧 𝑛 − and 𝑧 𝑛 − as: 𝑧 𝑛 = 𝑧 𝑛 − + 𝑧 𝑛 − 𝑖 𝑛 ∈ C 𝑛 = (cid:20) 𝑧 𝑛 − − 𝑧 𝑛 − 𝑧 𝑛 − 𝑧 𝑛 − (cid:21) . (22)Each of the 2 × ( 𝑛 − ) -order multicomplexnumber, which can be further expanded with ( 𝑛 − ) -order multi-complex numbers and so on. Eventually, the CR form of 𝑧 𝑛 becomesa 2 𝑛 × 𝑛 real matrix.With CR formula, we organize each network layer into a reallayer and an imaginary layer (or multiple multicomplex layers)other than generalizing each neuron to be a complex or multiplequantity. All the computations are now in real, and we implementthe forward pass of the net for FC layers with cuBLAS . The CR matrixmultiplication is carried out block-wisely, so that we do not generateredundant multiplications corresponding to the off diagonal blocks.The activation is on the other hand, directly implemented bylaunching CUDA threads. Fortunately, we do not have many differenttypes of activations. Only the periodic activation function sin (·) isused. To this end, we just implement its nai¨ve expression up to thethird order to maximize the performance on

CUDA without recursivevariable initialization. While the expression looks verbose (e.g., inAppendix A and the supplementary document), computing the high-order derivative of activation function only takes a small fraction ofthe network forwards. The major computing efforts remain at FCforward and backward passes.

Our DAE-based nonlinearly reduced simulation algorithm can beextended and integrated into other simulation frameworks at ease.For instance, we can use the generalized Newton-Euler equation tocouple local deformation and rigid body dynamics [Kim and Delaney2013; Shabana 2003]. The training poses need to be generated withrigid body motion removed as well in this case. Fig. 13 reports areal-time simulation of a falling bunny on wooden stairs. We useimplicit penalty force to resolve the collision and self-collision. igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 13

Fig. 13. Falling bunny. We use generalized Newton-Euler equation to couple DAE-based model reduction with large rigid body motion to simulate free-floatingobjects. The subspace configuration of the bunny is 𝑛 𝑝 = and 𝑛 𝑞 = . 𝑛 𝑝 + 𝑛 𝑞 | C | +

10 100 – 44 (31 × )Armadillo 40K 20K 30 +

10 140 – 30 (35 × )Bunny 16K 8K 30 +

10 100 – 45 (30 × )Cactus 233K 139K 10 + ,

600 165 2 . × )Puffer ball 625K 120K 10 + ,

400 321 1.4 (36 × ) Table 2. Time performance of our nonlinear subspace simulator.

Ele. and

Tri. are the total numbers of elements and surface triangles on the model; 𝑛 𝑝 + 𝑏 𝑞 reports the composition of our subspace configuration (per domain); | C | is the number of Cubature elements used; D. is the total numberof domains on the model; FPS is the simulation frame per second (andspeedups compared with single-core full simulation).Fig. 14. Dropping an Armadillo into cactus. We use deformation substruc-turing [Barbič and Zhao 2011] method to build multi-level subspaces at thecactus. Each local domain has a compact subspace of 𝑛 𝑝 = and 𝑛 𝑞 = . For geometric-complex models, the advantage of nonlinear re-duction could be further amplified with the domain decompositionmethod [Barbič and Zhao 2011; Wu et al. 2015; Yang et al. 2013] in-stead of nai¨vely increasing the global subspace size. To this end, we also couple our DAE-based nonlinear reduction with deformationsubstructuring [Barbič and Zhao 2011], which delivers more inter-esting animations with DAE-enriched local details. Two examplesare reported in Figs. 1 and 14. The puffer ball is an ideal vehicle todeliver the advantage of this generalization. It has 320 elastic stringswith the same geometry. Therefore, the network training (for bothDAE and neural Cubature) of one string can be re-used for all otherstrings. Thanks to its simple and symmetric geometry, depth of DAEcan also be cut to 8. The cactus example shown in Fig. 14 is anotherrepresentative case: the tree-like structure of the cactus allows aneffective hierarchical deployment of nonlinear subspaces. Here, wehave 165 domains on the cactus, and each domain has a subspaceof 𝑛 𝑝 =

10 and 𝑛 𝑞 =

6. Lastly, the simulation time performance issummarized in Tab. 2.

In this paper, we present a framework combining classic reduceddeformable simulation with deep learning empowered data-drivenapproaches. We advance state-of-the-art reduction methods by plug-ging a deep autoencoder into the simulation pipeline. While someexisting work has attempted this idea before, we are the first to ad-dress the high-order differentiability of the deep neural net in orderto accurately project nonlinear dynamics of deformable solids intothe tangent space of the deformation manifold. This is made possibleby carefully re-engineering complex-step finite difference in thecontext of deep learning and complementing CSFD with reverse AD.With a CSFD-augmented BP and CSFD directional derivatives, wecan evaluate the high-order derivatives of a deep net only with O ( 𝑛 ) network passes. Based on this, we also propose a neural Cubaturescheme that allows a more efficient Cubature sampling and moreaccurate weighting. Without ignoring inertia forces induced by thetime-varying tangent projection, we are able to simulate deformableobjects with nonlinear model reduction in real time robustly. Webelieve CSFD-enabled differentiability paves the way to an in-depthintegration of deep neural network and physics-based simulation,which could inspire many follow-up research efforts.There are also several limitations of our framework. First, thevisual improvement of our nonlinear reduction method over existinglinear reduction method is not “wow”. After all, we are manipulatinga reduced simulation with only dozens of DOFs. We believe combingneural network and other data-driven approaches used in graphicscould potentially improve this issue. For instance, if the furtherdeformation types are somehow known, we could use DAE to build a more specific nonlinear subspace as in [Harmon and Zorin 2013].Building hierarchical DAE is also a promising solution. It may bepossible to train a neural network to select multiple pre-trainedDAEs to locally expand the tangent space. Model reduction is apowerful tool not only for deformable object simulation. To thisend, we will further investigate how to use nonlinear reductionto improve other simulation problems like fluid, cloth, and soundsynthesis. REFERENCES

Lars V Ahlfors. 1973. Complex Analysis. 1979.Steven S An, Theodore Kim, and Doug L James. 2008. Optimizing cubature for efficientintegration of subspace deformations.

ACM transactions on graphics (TOG)

27, 5(2008), 1–10.Johannes Ballé, Valero Laparra, and Eero P Simoncelli. 2016. End-to-end optimizedimage compression. arXiv preprint arXiv:1611.01704 (2016).Jernej Barbič, Marco da Silva, and Jovan Popović. 2009. Deformable object animationusing reduced optimal control. In

ACM SIGGRAPH 2009 papers . 1–9.Jernej Barbič, Funshing Sin, and Eitan Grinspun. 2012. Interactive editing of deformablesimulations.

ACM Transactions on Graphics (TOG)

31, 4 (2012), 1–8.Jernej Barbič and Yili Zhao. 2011. Real-time large-deformation substructuring.

ACMtransactions on graphics (TOG)

30, 4 (2011), 1–8.Jernej Barbič and Doug L James. 2005. Real-time subspace integration for St. Venant-Kirchhoff deformable models. In

ACM Trans. Graph. (TOG) , Vol. 24. ACM, 982–990.Atılım Günes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey MarkSiskind. 2017. Automatic differentiation in machine learning: a survey.

The Journalof Machine Learning Research

18, 1 (2017), 5595–5637.Hervé Bourlard and Yves Kamp. 1988. Auto-association by multilayer perceptrons andsingular value decomposition.

Biological cybernetics

59, 4-5 (1988), 291–294.Alain J Brizard. 2014.

Introduction To Lagrangian Mechanics, An . World ScientificPublishing Company.H Martin Bücker, George Corliss, Paul Hovland, Uwe Naumann, and Boyana Norris.2006.

Automatic differentiation: applications, theory, and implementations . Vol. 50.Springer Science & Business Media.Steve Capell, Seth Green, Brian Curless, Tom Duchamp, and Zoran Popović. 2002.Interactive skeleton-driven dynamic deformations. In

ACM Trans. Graph. (TOG) ,Vol. 21. ACM, 586–593.Min Gyu Choi and Hyeong-Seok Ko. 2005. Modal warping: Real-time simulation oflarge rotational deformation and manipulation.

IEEE Trans. on Visualization andComputer Graphics

11, 1 (2005), 91–101.François Faure, Benjamin Gilles, Guillaume Bousquet, and Dinesh K Pai. 2011. Sparsemeshless models of complex deformable solids. In

ACM Trans. Graph. (TOG) , Vol. 30.ACM, 73.Marco Fratarcangeli, Valentina Tibaldo, and Fabio Pellacini. 2016. Vivace: A practicalgauss-seidel method for stable soft body dynamics.

ACM Transactions on Graphics(TOG)

35, 6 (2016), 1–9.Lawson Fulton, Vismay Modi, David Duvenaud, David IW Levin, and Alec Jacobson.2019. Latent-space Dynamics for Reduced Deformable Simulation. In

Computergraphics forum , Vol. 38. Wiley Online Library, 379–391.Benjamin Gilles, Guillaume Bousquet, Francois Faure, and Dinesh K Pai. 2011. Frame-based elastic models.

ACM Trans. Graph. (TOG)

30, 2 (2011), 15.Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. 6.2. 2.3 softmax units formultinoulli output distributions.

Deep learning (2016), 180–184.Gaël Guennebaud, Benoit Jacob, et al. 2010. Eigen.

URl: http://eigen. tuxfamily. org (2010).Brian Guenter. 2007. Efficient symbolic differentiation for graphics applications. In

ACM SIGGRAPH 2007 papers . 108–es.Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. 2019.Video compression with rate-distortion autoencoders. In

Proceedings of the IEEEInternational Conference on Computer Vision . 7033–7042.Fabian Hahn, Bernhard Thomaszewski, Stelian Coros, Robert W Sumner, Forrester Cole,Mark Meyer, Tony DeRose, and Markus Gross. 2014. Subspace clothing simulationusing adaptive bases.

ACM Transactions on Graphics (TOG)

33, 4 (2014), 1–9.David Harmon and Denis Zorin. 2013. Subspace integration with local deformations.

ACM Transactions on Graphics (TOG)

32, 4 (2013), 1–10.Kris K Hauser, Chen Shen, and James F O’Brien. 2003. Interactive Deformation UsingModal Analysis with Constraints.. In

Graphics Interface , Vol. 3. 16–17.Florian Hecht, Yeon Jin Lee, Jonathan R Shewchuk, and James F O’Brien. 2012. Updatedsparse cholesky factors for corotational elastodynamics.

ACM Trans. Graph. (TOG)

31, 5 (2012), 123.Robert Hecht-Nielsen. 1992. Theory of the backpropagation neural network. In

Neuralnetworks for perception . Elsevier, 65–93. Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality ofdata with neural networks. science

ACMTransactions on Graphics (TOG)

24, 3 (2005), 399–407.Theodore Kim and John Delaney. 2013. Subspace fluid re-simulation.

ACM Transactionson Graphics (TOG)

32, 4 (2013), 1–9.Theodore Kim and Doug L James. 2009. Skipping steps in deformable simulation withonline model reduction. In

ACM Trans. Graph. (TOG) , Vol. 28. ACM, 123.Dana A Knoll and David E Keyes. 2004. Jacobian-free Newton–Krylov methods: asurvey of approaches and applications.

J. Comput. Phys.

ACM Transactions onGraphics (TOG)

34, 6 (2015), 1–9.Lei Lan, Ran Luo, Marco Fratarcangeli, Weiwei Xu, Huamin Wang, Xiaohu Guo, JunfengYao, and Yin Yang. 2020. Medial Elastics: Efficient and Collision-Ready Deformationvia Medial Axis Transform.

ACM Transactions on Graphics (TOG)

39, 3 (2020), 1–17.Gregory Lantoine, Ryan P Russell, and Thierry Dargent. 2012. Using multicomplexvariables for automatic computation of high-order derivatives.

ACM Transactionson Mathematical Software (TOMS)

38, 3 (2012), 1–21.Tiantian Liu, Adam W. Bargteil, James F. O’Brien, and Ladislav Kavan. 2013. FastSimulation of Mass-spring Systems.

ACM Trans. Graph. (TOG)

32, 6 (2013), 214:1–214:7.Tiantian Liu, Sofien Bouaziz, and Ladislav Kavan. 2017. Quasi-newton methods forreal-time simulation of hyperelastic materials.

ACM Transactions on Graphics (TOG)

36, 3 (2017), 1–16.Ran Luo, Weiwei Xu, Tianjia Shao, Hongyi Xu, and Yin Yang. 2019. Accelerated complex-step finite difference for expedient deformable simulation.

ACM Transactions onGraphics (TOG)

38, 6 (2019), 1–16.Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey.2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).Sebastian Martin, Peter Kaufmann, Mario Botsch, Eitan Grinspun, and Markus Gross.2010. Unified simulation of elastic rods, shells, and solids. In

ACM Trans. Graph.(TOG) , Vol. 29. ACM, 39.Joaquim RRA Martins, Peter Sturdza, and Juan J Alonso. 2003. The complex-stepderivative approximation.

ACM Transactions on Mathematical Software (TOMS)

Computer Graphics Forum , Vol. 35. Wiley Online Library,169–178.Matthias Müller, Bruno Heidelberger, Matthias Teschner, and Markus Gross. 2005.Meshless deformations based on shape matching. In

ACM Trans. Graph. (TOG) ,Vol. 24. ACM, 471–478.Przemyslaw Musialski, Christian Hafner, Florian Rist, Michael Birsak, Michael Wim-mer, and Leif Kobbelt. 2016. Non-linear shape optimization using local subspaceprojections.

ACM Transactions on Graphics (TOG)

35, 4 (2016), 1–13.Vinod Nair and Geoffrey E Hinton. 2009. 3D object recognition with deep belief nets.

Advances in neural information processing systems

22 (2009), 1339–1347.Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restrictedboltzmann machines. In

ICML .HM Nasir. 2013. A new class of multicomplex algebra with applications.

MathematicalSciences International Research Journal

2, 2 (2013), 163–168.CUDA Nvidia. 2008. Cublas library.

NVIDIA Corporation, Santa Clara, California

15, 27(2008), 31.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.Pytorch: An imperative style, high-performance deep learning library. arXiv preprintarXiv:1912.01703 (2019).Alex Pentland and John Williams. 1989. Good vibrations: Modal dynamics for graphicsand animation. In

Proceedings of the 16th annual conference on Computer graphicsand interactive techniques . 215–222.Eric Pesheck, Nicolas Boivin, Christophe Pierre, and Steven W Shaw. 2001. Nonlin-ear modal analysis of structural systems using multi-mode invariant manifolds.

Nonlinear Dynamics

25, 1-3 (2001), 183–205.Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, YannDauphin, and Xavier Glorot. 2011a. Higher order contractive auto-encoder. In

Joint European conference on machine learning and knowledge discovery in databases .Springer, 645–660.Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. 2011b.Contractive auto-encoders: Explicit invariance during feature extraction. In

Icml .Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015.Imagenet large scale visual recognition challenge.

International journal of computervision igh-order Differentiable Autoencoder for Nonlinear Model Reduction • 15

Sangriyadi Setio, Herlien D Setio, and Louis Jezequel. 1992. Modal analysis of nonlinearmulti-degree-of-freedom structures.

IJAEM

7, 2 (1992), 75–93.Ahmed A Shabana. 2003.

Dynamics of multibody systems . Cambridge university press.Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher DManning. 2011. Semi-supervised recursive autoencoders for predicting sentimentdistributions. In

Proceedings of the 2011 conference on empirical methods in naturallanguage processing . 151–161.Rasmus Tamstorf, Toby Jones, and Stephen F McCormick. 2015. Smoothed aggregationmultigrid for cloth simulation.

ACM Trans. Graph. (TOG)

34, 6 (2015), 245.Adrien Treuille, Andrew Lewis, and Zoran Popović. 2006. Model reduction for real-timefluids.

ACM Transactions on Graphics (TOG)

25, 3 (2006), 826–834.Christoph Von-Tycowicz, Christian Schulz, Hans-Peter Seidel, and Klaus Hildebrandt.2015. Real-time nonlinear shape interpolation.

ACM Transactions on Graphics (TOG)

34, 3 (2015), 1–10.Huamin Wang, James F O’Brien, and Ravi Ramamoorthi. 2011. Data-driven elasticmodels for cloth: modeling and measurement.

ACM transactions on graphics (TOG)

30, 4 (2011), 1–12.Huamin Wang and Yin Yang. 2016. Descent methods for elastic body simulation on theGPU.

ACM Trans. Graph. (TOG)

35, 6 (2016), 212.Yu Wang, Alec Jacobson, Jernej Barbič, and Ladislav Kavan. 2015. Linear subspacedesign for real-time shape deformation.

ACM Transactions on Graphics (TOG)

34, 4(2015), 1–11.Steffen Wiewel, Moritz Becher, and Nils Thuerey. 2019. Latent space physics: Towardslearning the temporal evolution of fluid flow. In

Computer Graphics Forum , Vol. 38.Wiley Online Library, 71–82.Stephen Wolfram et al. 1999.

The MATHEMATICA® book, version 4 . Cambridge univer-sity press.Xiaofeng Wu, Rajaditya Mukherjee, and Huamin Wang. 2015. A unified approach forsubspace simulation of deformable bodies in multiple domains.

ACM Transactionson Graphics (TOG)

34, 6 (2015), 1–9.Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S YuPhilip. 2020. A comprehensive survey on graph neural networks.

IEEE Transactionson Neural Networks and Learning Systems (2020).Hongyi Xu and Jernej Barbič. 2016. Pose-space subspace dynamics.

ACM Transactionson Graphics (TOG)

35, 4 (2016), 1–14.Hongyi Xu, Yijing Li, Yong Chen, and Jernej Barbič. 2015. Interactive material designusing model reduction.

ACM Transactions on Graphics (TOG)

34, 2 (2015), 1–14.Yin Yang, Dingzeyu Li, Weiwei Xu, Yuan Tian, and Changxi Zheng. 2015. Expeditingprecomputation for reduced deformable simulation.

ACM Trans. Graph. (TOG)

34, 6(2015).Yin Yang, Weiwei Xu, Xiaohu Guo, Kun Zhou, and Baining Guo. 2013. Boundary-awaremultidomain subspace deformation.

IEEE transactions on visualization and computergraphics

19, 10 (2013), 1633–1645.Nianyin Zeng, Hong Zhang, Baoye Song, Weibo Liu, Yurong Li, and Abdullah MDobaie. 2018. Facial expression recognition via learning deep sparse autoencoders.

Neurocomputing

273 (2018), 643–649.Jiayi Eris Zhang, Seungbae Bang, David I.W. Levin, and Alec Jacobson. 2020. Comple-mentary Dynamics.

ACM Transactions on Graphics (2020).Yongning Zhu, Eftychios Sifakis, Joseph Teran, and Achi Brandt. 2010. An efficientmultigrid method for the simulation of high-resolution elastic solids.

ACM Trans.Graph. (TOG)

29, 2 (2010), 16.Olgierd Cecil Zienkiewicz, Robert Leroy Taylor, Olgierd Cecil Zienkiewicz, andRobert Lee Taylor. 1977.

The finite element method . Vol. 36. McGraw-hill Lon-don.

A FIRST-, SECOND-, AND THIRD-ORDER DERIVATIVEOF NONLINEAR ACTIVATION

In this appendix, we give the analytic formula for first- and second-order derivative of the activation function sin (·) used in our net-work. We give partial formulation for the third-order derivative too,which is quite verbose. For the completeness, we move the entireformulation into the supplementary document.The first derivative of sin (·) under (first-order) CSFD is:sin ( 𝑎 + 𝑏𝑖 ) = sinh ( 𝑎 ) cos ( 𝑏 ) + cosh ( 𝑎 ) sinh ( 𝑏 ) 𝑖 . The second-order CSFD perturbation of sin (·) can be written as:sin ( 𝑎 + 𝑏𝑖 + 𝑐𝑖 + 𝑑𝑖 𝑖 ) = 𝑎 ′ + 𝑏 ′ 𝑖 + 𝑐 ′ 𝑖 + 𝑑 ′ 𝑖 𝑖 , where 𝑎 ′ = sin ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) − cos ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) ,𝑏 ′ = sin ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) + cos ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) ,𝑐 ′ = cos ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) + sin ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) ,𝑑 ′ = cos ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) − sin ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) . Similarly, we write the third-order multicomplex perturbation ofsin (·) as:sin ( 𝑎 + 𝑏𝑖 + 𝑐𝑖 + 𝑑𝑖 𝑖 + 𝑒𝑖 + 𝑓 𝑖 𝑖 + 𝑔𝑖 𝑖 + ℎ𝑖 𝑖 𝑖 ) = 𝑎 ′ + 𝑏 ′ 𝑖 + 𝑐 ′ 𝑖 + 𝑑 ′ 𝑖 𝑖 + 𝑒 ′ 𝑖 + 𝑓 ′ 𝑖 𝑖 + 𝑔 ′ 𝑖 𝑖 + ℎ ′ 𝑖 𝑖 𝑖 . The coefficient 𝑎 ′ of the real part is: 𝑎 ′ = − sin ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) cosh ( 𝑒 ) cos ( 𝑓 ) cos ( 𝑔 ) cosh ( ℎ )+ sin ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) sinh ( 𝑒 ) sin ( 𝑓 ) sin ( 𝑔 ) sinh ( ℎ )+ sin ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) cosh ( 𝑒 ) cos ( 𝑓 ) sin ( 𝑔 ) sinh ( ℎ )+ sin ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) sinh ( 𝑒 ) sin ( 𝑓 ) cos ( 𝑔 ) cosh ( ℎ )− cos ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) cosh ( 𝑒 ) cos ( 𝑓 ) sin ( 𝑔 ) sinh ( ℎ )− cos ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) cos ( 𝑑 ) sinh ( 𝑒 ) sin ( 𝑓 ) cos ( 𝑔 ) cosh ( ℎ )− cos ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) cosh ( 𝑒 ) cos ( 𝑓 ) cos ( 𝑔 ) cosh ( ℎ )+ cos ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) sin ( 𝑑 ) sinh ( 𝑒 ) sin ( 𝑓 ) sin ( 𝑔 ) sinh ( ℎ )+ cos ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) sinh ( 𝑒 ) cos ( 𝑓 ) sin ( 𝑔 ) cosh ( ℎ )− cos ( 𝑎 ) cosh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) cosh ( 𝑒 ) sin ( 𝑓 ) cos ( 𝑔 ) sinh ( ℎ )− cos ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) sinh ( 𝑒 ) cos ( 𝑓 ) cos ( 𝑔 ) sinh ( ℎ )− cos ( 𝑎 ) cosh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) cosh ( 𝑒 ) sin ( 𝑓 ) sin ( 𝑔 ) cosh ( ℎ )+ sin ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) sinh ( 𝑒 ) cos ( 𝑓 ) cos ( 𝑔 ) sinh ( ℎ )+ sin ( 𝑎 ) sinh ( 𝑏 ) sinh ( 𝑐 ) cos ( 𝑑 ) cosh ( 𝑒 ) sin ( 𝑓 ) sin ( 𝑔 ) cosh ( ℎ )+ sin ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) sinh ( 𝑒 ) cos ( 𝑓 ) sin ( 𝑔 ) cosh ( ℎ )− sin ( 𝑎 ) sinh ( 𝑏 ) cosh ( 𝑐 ) sin ( 𝑑 ) cosh ( 𝑒 ) sin ( 𝑓 ) cos ( 𝑔 ) sinh ( ℎ ) . (23)The exact formulation of 𝑏 ′ , 𝑐 ′ , 𝑑 ′ , 𝑒 ′ , 𝑓 ′ , 𝑔 ′ , and ℎ ′′