[PDF] Deep Autoencoders: From Understanding to Generalization Guarantees

Abstract

A big mystery in deep learning continues to be the ability of methods to generalize when the number of model parameters is larger than the number of training examples. In this work, we take a step towards a better understanding of the underlying phenomena of Deep Autoencoders (AEs), a mainstream deep learning solution for learning compressed, interpretable, and structured data representations. In particular, we interpret how AEs approximate the data manifold by exploiting their continuous piecewise affine structure. Our reformulation of AEs provides new insights into their mapping, reconstruction guarantees, as well as an interpretation of commonly used regularization techniques. We leverage these findings to derive two new regularizations that enable AEs to capture the inherent symmetry in the data. Our regularizations leverage recent advances in the group of transformation learning to enable AEs to better approximate the data manifold without explicitly defining the group underlying the manifold. Under the assumption that the symmetry of the data can be explained by a Lie group, we prove that the regularizations ensure the generalization of the corresponding AEs. A range of experimental evaluations demonstrate that our methods outperform other state-of-the-art regularization techniques.

Full PDF

PProvable Finite Data Generalizationwith Group Autoencoder

Romain Cosentino

E.C.E. DepartmentRice UniversityHouston, TX [email protected]

Randall Balestriero

E.C.E. DepartmentRice UniversityHouston, TX [email protected]

Richard Baraniuk

E.C.E. DepartmentRice UniversityHouston, TX [email protected]

Behnaam Aazhang

E.C.E. DepartmentRice UniversityHouston, TX [email protected]

Abstract

Deep Autoencoders (AEs) provide a versatile framework to learn a compressed,interpretable, or structured representation of data. As such, AEs have been usedextensively for denoising, compression, data completion as well as pre-trainingof Deep Networks (DNs) for various tasks such as classiﬁcation. By providinga careful analysis of current AEs from a spline perspective, we can interpret theinput-output mapping, in turn allowing us to derive conditions for generalizationand reconstruction guarantee. By assuming a Lie group structure on the data athand, we are able to derive a novel regularization of AEs, allowing for the ﬁrst timeto ensure the generalization of AEs in the ﬁnite training set case. We validate ourtheoretical analysis by demonstrating how this regularization signiﬁcantly increasesthe generalization of the AE on various datasets.

Autoencoders provide a rich and versatile framework that discovers the salient features of the datain an unsupervised manner. Such algorithm can be leveraged for compression Cheng et al. [2018],denoising Eraslan et al. [2019], data completion Tran et al. [2017], as well as pre-training of DNsErhan et al. [2010]. This method has been developed with the common assumptions that the datalies on a low-dimensional non-linear manifold. Solving those denoising or compression tasks isequivalent to discovering the underlying manifold structure of data, a task becoming challengingin the high dimensional and the ﬁnite samples regime Mallat [2016]. The extensive use of AEsduring the last decade led to the development of methods improving their generalization capabilityby introducing various explicit or implicit regularizations Rifai et al. [2011b], Makhzani and Frey[2013], Vincent et al. [2008]. Despite that progress, their underlying mechanisms and generalizationcapability are still poorly understood Lei et al. [2020].In this paper, we analytically characterize the mechanisms of AEs and develop a regularization thatforces the approximation of the data manifold with generalization guarantees for a ﬁnite data regime.We demonstrate that, under this regularization, the AE achieves perfect generalization in the case of aLie group based manifold. Besides this theoretical statement, our empirical results on real datasetsdemonstrate the performances of this regularization.Our approach is two-fold: First, we provide an interpretable formulation of the local parametricrepresentation of the manifold approximated by AEs. To do so, we leverage recent advances in thetheoretical study of DNs developed in Balestriero and Baraniuk [2018]. By leveraging the Max-AfﬁneSpline (MAS) properties of current DNs and enabling an explicit continuous piecewise afﬁne (CPA)input-output mapping, we make explicit some critical properties of AEs such as interpreting the rolesof the per layer parameters, how standard regularization techniques affect the AE mapping, and howthe encoder and decoder per region afﬁne mappings are related. Second, we extend those results by

Preprint. Under review. a r X i v : . [ c s . L G ] S e p onsidering problems where the data manifold corresponds to the orbit of a Lie group, such as in Raoand Ruderman [1999], Sohl-Dickstein et al. [2010]. In such a dataset, each sample is representedby a group action of another sample. We provide a novel regularization for AEs that constrain thenearby linear maps to incorporate such structural information about the data at hand and show thatunder this regularization, generalization is guaranteed in the ﬁnite data regime.Our contributions are summarized below: • We demonstrate how current AEs provide a CPA manifold approximation from which we caninterpret the role of the encoder, decoder, layer parameters and latent dimension (Sec. 3.1), e.g.,after successful learning, the encoder and decoder must be tied under a bi-orthogonality condition. • Under this viewpoint, we demonstrate how, comparing adjacent region afﬁne mappings provide anefﬁcient method to estimate the curvature of the manifold (Sec.3.2); and provide insights into thestandard regularization techniques employed in AEs (Sec. 3.3). • Finally, we demonstrate how under a dataset corresponding to the orbit of a Lie group, the AE mustfulﬁll some curvature conditions (Sec. 4.1). We turn this curvature condition into a regularizationadapted to AEs and demonstrate how generalization guarantees can be obtained even under a ﬁnitetraining data regime(Sec. 4.2). We empirically demonstrate that our regularization outperformsother variants of AE and it stabilizes the AE’s training (Sec.4.3).

Max Afﬁne Spline Network:

A Deep (Neural) Network (DN) is an operator f Θ with parameters Θ composing L intermediate layer mappings f (cid:96) , (cid:96) = 1 , . . . , L , that combine afﬁne and simplenonlinear operators such as the fully connected operator , convolution operator , activation operator (applying a scalar nonlinearity such as the ubiquitous ReLU), or pooling operator .A DN employing nonlinearities such as (leaky-)ReLU, absolute value, and max-pooling is a continu-ous piecewise linear operator and thus lives on a partition Ω of the input space. As such, the DN’sCPA mapping of an input x can be written as f Θ ( x ) = (cid:88) ω ∈ Ω { x ∈ ω } ( A ω x + B ω ) (1)where deﬁnes the indicator function, A ω and B ω the per region afﬁne parameters involving the DNper layer afﬁne parameters , W (cid:96) , b (cid:96) ∈ Θ , and the nonlinearities state of the region ω ∈ Ω Balestrieroand Baraniuk [2018]. The unit and layer input space partitioning can be rewritten as Power Diagrams,a generalization of Voronoi Diagrams Balestriero et al. [2019]; composing layers produce a PowerDiagram subdivision.

Autoencoder:

An Autoencoder (AE) aims at learning an identity mapping, also known as auto-association Ackley et al. [1985], on a given dataset with a bottleneck latent dimension. It has beenimplemented ﬁrst for image compression Cottrell et al. [1987], speech recognition Elman and Zipser[1988], and dimensionality reduction Baldi and Hornik [1989]. It is composed of two nonlinear maps:an encoder, denoted by E and a decoder, denoted by D . The encoder is mapping an input x ∈ R d toa hidden layer of dimension h < d , E ( x ) , which encodes the salient features in the data Goodfellowet al. [2016] and deﬁnes its code or embedding. The decoder reconstructs the input from its code ,thus the entire AE map is deﬁned as D ◦ E ( x ) with ◦ denoting the composition operator.The weights of the AE are learned based on some ﬂavors of reconstruction losses,e.g., the mean-square error for real data and the binary cross-entropy for binary data, between the output, D ◦ E ( x ) ,and the input, x . To improve generalization, some regularization can complement the reconstructionloss Srivastava et al. [2014] such as favoring sparsity of the code Makhzani and Frey [2013] orsparsity of the weights Jarrett et al.. Other types of regularization include injecting noise in the inputleading to Denoising AE known to increase the robustness to small input perturbations Vincent et al.[2008]. Closer to our work, Rifai et al. [2011b] and Rifai et al. [2011a] proposed to improve therobustness of the code to small input perturbations and penalize the curvature of the encoder mappingby regularizing the Jacobian as well as the Hessian of E .2igure 1: 2-dimensional visualization of the inputspace partitioning Ω E,D induced by a randomly initial-ized AE. Each region bounded by the red lines has aset of MAS parameter A Eω , A Dω , B Eω , B Dω described inEq. (5) which depends on the per layer afﬁne param-eters as well as the nonlinearities state of the region ω . To reconstruct its input, an AE achieves an afﬁnemap for each region, its output for a sample of a givenregion ω is provided by Eq. (4). Learning Group Transformations:

The approximation of Lie groups has been introduced by Raoand Ruderman [1999], and aims at learning the transformation operator underlying the data under theassumption that the dataset is the result of the action of a group on a sample. Different forms of thisapproximation have been introduced in Sohl-Dickstein et al. [2010], Wang et al. [2011] as to reduceits computation complexity, improve its efﬁciency. This approach has also been used in Bahroun et al.[2019] to develop a biologically plausible motion detector algorithm describing the vision. In thecase of a Lie group, the dataset can be modeled according to the ﬁrst order Lie equation d x ( θ ) dθ = G x ( θ ) , (2)where x ( θ ) ∈ R d , θ is the coefﬁcient governing the amount of transformation, and G ∈ R d × d . Thisﬁrst-order differential equation indicates that the variation of the data is linear with respect to thedata and depends on the inﬁnitesimal operator G ∈ T I G where T I G denotes the Lie Algebra of thegroup G , i.e., the tangent of the group at the identity element. The solution of Eq. (2) is given by x ( θ ) = exp( θG ) x (0) , an intuitive example in given in Appendix B.While the learnability of the exponential map is tedious, one can exploit its local linearity to learn theinﬁnitesimal operator. In fact, for a small (cid:15) we have x ( θ + (cid:15) ) ≈ ( I + (cid:15)G ) x ( θ ) (3)The operator G can thus be learned using data that are close to each other as they result from smalltransformations and thus follow this approximation. Without supervision, the search for neighbor datais achieved by the nearest neighbor algorithm, as in Hashimoto et al. [2017]. Note that in the case of agroup depending on multiple parameters, Eq. (3) becomes, x ( θ + (cid:15) ) ≈ ( I + (cid:80) hk =1 (cid:15) k G k ) x ( θ ) , where h denotes the dimension of the group, (cid:15) k the transformation parameter associated to inﬁnitesimaloperator G k . We leverage the CPA operator deﬁned in Eq. (1) to reformulate the AE as to interpret the role of thedecoder and encoder, derive a necessary condition for the reconstruction of piecewise linear datasurface, Sec. 3.1, characterize its per region surface via the Jacobian and approximated Hessian ofthe CPA operator, Sec. 3.2. We then use these ﬁndings to analyze commonly used variations of AE,Sec. 3.3.

The output of a CPA DN is formed as per Eq. (1). An AE composing two DNs, the encoder andthe decoder , the entire mapping remains a CPA with an input space partition and per region afﬁnemappings. Let ω ∈ Ω E,D deﬁne a region induced by the AE partitioning in the input space asdescribed in Sec. 2 and can be visualized in Fig. 1. Given a d-dimensional sample x ∈ ω , the maxafﬁne spline formulation of the AE mapping is deﬁned as D ◦ E ( x ) = A Dω A Eω x + A Dω B Eω + B Dω , (4)where ◦ is the composition operator, A Dω ∈ R d × h , A Eω ∈ R h × d , B Eω ∈ R h and B Dω ∈ R d with d being the dimension of the input data and h the bottleneck dimension. Let’s denote by W (cid:96) ∈ (latent dimension being 2). The gray de-notes the regions, and the red lines their borders. As theycorrespond to the MAS surface induced by the decoder,each gray region has a slope characterized by the Jacobienof the decoder as in Eq. (8). Our work aims at developinga constraint on these surfaces via their per-region tangent,such that they approximate the manifold deﬁned by the orbitof a signal with respect to the action of a group. R d (cid:96) × d (cid:96) − , b (cid:96) ∈ R d (cid:96) the afﬁne parameters of each layer, where (cid:96) ∈ { , . . . , L } deﬁnes the encoderindexes and (cid:96) ∈ { L + 1 , . . . , L + P } the decoder ones (with structure depending on the layer type),where L denotes the number of encoder layers, P the number of decoder layers, d (cid:96) − the inputdimension of the layer (cid:96) and d (cid:96) its output dimension. We have that d L = h the bottleneck dimension, d = d L + P = d the input and output dimension. We also denote by Q (cid:96) the diagonal matricesencoding the region induced states of the nonlinearities, (0 , for ReLU, ( − , for absolute value.The parameters of the max afﬁne spline AE formulation described in Eq. (4) are deﬁned as A Eω = W L Q L − ω W L − . . . Q ω W and B Eω = b L + L − (cid:88) i =1 W L Q L − ω W L − . . . Q iω b i . (5) A Dω and A Dω are deﬁned similarly with (cid:96) ∈ { L + 1 , . . . , L + P } . Now, let’s rewrite Eq. (4) as D ◦ E ( x ) = h (cid:88) k =1 (cid:68) a E T k , x (cid:69) a Dk + B E,Dω = A Dω µ x + B E,Dω , (6)where B E,Dω = A Dω B Eω + B Dω , a E T k are the rows of A Eω , a Dk are the columns of A Dω . This is theshifted mapping of x onto the subspace spanned by A Dω and with coordinates driven by A Eω .From Eq. (6), we deduce the per region role of the encoder and decoder. The samples of each region ω , are expressed in the basis deﬁned by the decoder region-dependent parameter A Dω , which is theper region parametric representation of the approximated manifold, with coordinates given by theregion-dependent parameter A Eω , the whole mapping is then shifted according to both the encoderand decoder CPA parameters.We now derive a necessary condition on the CPA parameters, A Dω , A Eω , such that the AE achievesperfect reconstruction on a given continuous piecewise linear surface. Proposition 1.

A necessary condition for the AE to reconstruct a continuous piecewise linear datasurface is to be bi-orthogonal as per ∀ x ∈ X , D ◦ E ( x ) = x = ⇒ (cid:10) a Dk , a Ek (cid:48) (cid:11) = 1 { k = k (cid:48) } , where X denotes the data surface. (Proof in Appendix A.1.) That is, if a continuous piecewise linear surface is perfectly approximated, we know that the parame-ters of the MAS operator describing the encoder and decoder will be bi-orthogonal, i.e., the columnvectors of A Dω and the row vectors of A Eω form a bi-orthogonal basis. From the CPA formulation, we observed that for each region ω , D ◦ E deﬁnes a composition of twoafﬁne functions, each deﬁned respectively by the parameters A Eω , B Eω , and A Dω , B Dω . We can thusderive the per region Jacobien and approximated Hessian of the AE.The Jacobian of the AE for a given region ω ∈ Ω E,D is given by J ω [ D ◦ E ] = A Dω A Eω . (7)4ore details regarding the Jacobian are given in Appendix A.6. It is also clear that the rank of theJacobian is upper bounded by the latent dimension as rank ( J ω [ D ◦ E ]) ≤ h , where h is the numberof units of the bottleneck layer of the AE. This dimension is directly related to the dimension of themanifold that one aims to approximate, assuming that all other layer widths are larger than h .In this paper, we are interested in the per-region tangent of the decoder, as it deﬁnes the per regionparametric representation of the manifold, see Fig. 2. We will denote by Ω D the partition of the latentspace induced by the CPA map D . ∀ ω ∈ Ω D , J ω [ D ] = A Dω , (8)where the columns of A Dω form the basis of the tangent space induced by D .The characterization of the curvature of the approximation of the data manifold can be done using theper region Hessian deﬁned by H ω , ∀ ω ∈ Ω D , which in our case will be deﬁned as the sum of thedifference of neighboring tangent planes. ∀ ω ∈ Ω D , (cid:107) H ω (cid:107) F = (cid:88) ω (cid:48) ∈ N ( ω ) (cid:107) J ω [ D ] − J ω (cid:48) [ D ] (cid:107) F , (9)where N ( ω ) denotes the set of neighbors of region ω and (cid:107) . (cid:107) F is the Frobenius norm. This approach isbased on the derivation described in Rifai et al. [2011a]. In practice, we use a stochastic approximationof the sum by generating a small mini-batch of a few corrupted samples which induce neighboringregions. While inducing a variance in the estimate, it alleviates the need for explicitly compute theanalytical Hessian. We are now interested in leveraging these ﬁndings to analyze and interpret the regularizationsproposed in AE’s standard variants.

Higher Order Contractive AE

Rifai et al. [2011a]: The regularization penalizes the en-ergy of the ﬁrst and approximated second derivative the encoder map for any region containing atraining sample, i.e., (cid:13)(cid:13) A Eω (cid:13)(cid:13) F and (cid:80) ω (cid:48) ∈ N ( ω ) (cid:13)(cid:13) A Eω − A Eω (cid:48) (cid:13)(cid:13) F . In the case of a ReLU AE, we knowfrom Eq. (4) and the submultiplicativity of the Frobenius norm that the norm of the Jacobian isupper-bounded by (cid:13)(cid:13) W L (cid:13)(cid:13) F × · · · × (cid:13)(cid:13) W (cid:13)(cid:13) F . Therefore adding a weight-decay penalty on theencoder weights induces the ﬁrst order contractive AE. The second-order induces the curvature of thepiecewise linear map A E to be smooth. However, this constraint applies only to the part to the regioncontaining training sample (and their neighbors) and thus does not constrain the entire latent spaceapproximation. Denoising AE

Vincent et al. [2008]: Denoising AE is known to have a similar effect than theweight-decay penalty on the DN architecture Wager et al. [2013]. A penalty on the energy of W (cid:96) induces a penalty on the energy of the A Eω and A Dω , ∀ ω ∈ Ω E,D . Therefore, it constrains the slopeof each piece to be as ﬂat a possible, in turn implying that the piecewise linear map is focusing on theapproximation of the low-frequency content in the data, which reinforces the learning bias of deepnetworks towards low-frequency information Rahaman et al. [2018].

Weight-Sharing AE

Teng and Choromanska [2019]: In the case of weight-sharing between thedecoder and encoder, we have that the W (cid:96) of the decoder are the transpose of the W (cid:96) of the decoder,which implies that the parameter of the max afﬁne spline, A Eω and A Dω only differ via their matricesencoding the nonlinearity state, Q (cid:96) . For the remaining of the paper, we model the observations X as the orbit of a Lie group, for whichwe ﬁrst derive the curvature condition that such a manifold must fulﬁll, Sec. 4.1. We then translatethis condition for CPA operators to apply it to an AE and demonstrate the generalization guaranteesit yields, Sec. 4.2, and ﬁnally demonstrate why such regularization should be leverage for learning,Sec. 4.3. 5 .1 A Second Order Characterization of a Lie Group Orbit The dataset is deﬁned as the orbit of a Lie Group, as per Eq. (2), x ( θ ) = exp( θG ) x (0) , θ ∈ R , G ∈ T I G , where T I G denotes the Lie Algebra of the group G . First, we want to understand under whichcondition a smooth approximant f ∈ C ∞ ( R ) coincides with the orbit of x (0) under the action of thegroup G . In particular, we propose a condition which guarantees perfect approximation of x under aﬁnite data regime deﬁned as R ( f ) (cid:44) min G (cid:90) (cid:13)(cid:13)(cid:13)(cid:13) d f ( θ ) dθ − G d f ( θ ) dθ (cid:13)(cid:13)(cid:13)(cid:13) dθ (10)where d f ( θ ) dθ and d f ( θ ) dθ denote respectively the second and ﬁrst order derivative of f , G ∈ R d × d isthe inﬁnitesimal operator to be learned.This regularization constrains f such that its second-order derivative is a linear map of the ﬁrst order.This penalizing term is usually coupled with an approximation loss function, as the reconstructionerror in the case of AE. Therefore, the function f generally coincides with x on the training samples.The following theorem shows that f coincides with x ( θ ) = exp( θG ) x (0) if and only if d f ( θ ) dθ = G d f ( θ ) dθ and ∃ θ (cid:48) such that f ( θ (cid:48) ) = x ( θ (cid:48) ) . Theorem 1.

Assuming G is inversible, a function f minimizing the regularization R ( f ) = 0 and itexists θ (cid:48) such that f ( θ (cid:48) ) = x ( θ (cid:48) ) then f has perfect generation as in R ( f ) = 0 and ∃ θ (cid:48) s.t. f ( θ (cid:48) ) = x ( θ (cid:48) ) ⇐⇒ (cid:107) x − f (cid:107) = 0 . (11) (Proof in Appendix A.2.) Thus, one can approximate the orbit of a Lie group G with respect to a signal x (0) utilizing twocomponents, the aforementioned second-order regularization, and a reconstruction error. If theappropriate inﬁnitesimal operator G is learned and the regularization is equal to zeros, then only onedata point should be approximated to obtain a perfect approximation of the entire data manifold.In the next section, we exploit these ﬁndings by considering a constraint on the tangent planes ofthe decoder of an AE such that their relations is driven by Eq. (10) as to improve the generalizationcapability of AE. The derived regularization was based on a smooth approximant f and need to be adapted to the caseof a CPA map. From Sec. 3, we know that for each region ω ∈ Ω D , the decoder is characterized byits tangent plane, i.e., deﬁned by slope parameter A Dω . As we developed in the previous section, weconsider a regularization that imposes that the tangent plane of a given region is the result of a smalltransformation of its neighboring tangent planes. Following the deﬁnition of the Hessian in Eq. (9)we obtain the following regularization R ( D ) (cid:44) min G ,...,G h (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) min θ ,...,θ h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) J ω [ D ] − ( I + h (cid:88) k =1 θ k G k ) J ω (cid:48) [ D ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F , (12)where N ( ω ) denotes the set of neighbors of region ω and (cid:107) . (cid:107) F is the Frobenius norm and we recallthat J ω (cid:48) [ D ] = A Dω (cid:48) and J ω [ D ] = A Dω . The implementation regarding the sampling of neighboringregions is detailed in Appendix C.This regularization is imposing on the CPA map to form an approximation of the orbit induced by agroup which inﬁnitesimal generators are G k , k ∈ { , . . . , h } . In fact, one can see that while Eq. (3)was describing the transformation of data, we presently characterize the transformation of tangentplanes. It is clear that this regularization applies to the entire piecewise linear map as it applies to allthe regions induced by the AE. Therefore, regions without data will be regularized a much as regionswith data points, which is consistent with the aim of improving the generalization capability of AE.This is a crucial component of the generalization guarantee of the proposed regularization.From Eq. (9), this regularization constrains the Hessian of the decoder, which deﬁnes the curvature ofthe piecewise linear map approximating the data manifold. Therefore, this penalization enforces thecurvature of the piecewise linear map to ﬁt the curvature of the orbit of the learned group.6 s a m p l e s i nb a ll r e g i on s i nb a ll radius radius radius radius radiusFigure 3: The ﬁrst and second ﬁgures (from left to right) represent the number of data points inside aball of growing radius ( ﬁrst to second : CIFAR10, MNIST). From the third to the last ﬁgure (fromleft to right), we show the number of regions in the latent space of the AE inside the same ball ofgrowing radius for different AE architectures ( third to ﬁfth : Small MLP, Large MLP, Convolutional).We observe that for any radius, the number of regions induced by the AE partitioning of any DNarchitecture in any randomly sampled ball is much larger than the number of data.We thereby derive the optimal group parameter θ with respect to the regularization deﬁned in Eq. (12). Proposition 2.

The θ of the regularization deﬁned in Eq. (12) are obtained as θ ∗ = (cid:32) (cid:80) i (cid:107) G [ A Dω ] .,i (cid:107) . . . (cid:80) i (cid:104) G [ A Dω ] .,i , G h [ A Dω ] .,i (cid:105) ... ... ... (cid:80) i (cid:104) G h [ A Dω ] .,i , G [ A Dω ] .,i (cid:105) . . . (cid:80) i (cid:107) G h [ A Dω ] .,i (cid:107) (cid:33) −  (cid:80) i (cid:104) G [ A Dω ] .,i , [ A Dω (cid:48) ] .,i − [ A Dω ] .,i ) ... (cid:80) i (cid:104) G h [ A Dω ] .,i , [ A Dω (cid:48) ] .,i − [ A Dω ] .,i )  , where the matrix is inversible and with θ ∗ = arg min θ ∈ R h (cid:13)(cid:13)(cid:13) A Dω − ( I + (cid:80) hk =1 θ k G k ) A Dω (cid:48) (cid:13)(cid:13)(cid:13) F . (Proofin Appendix A.3.) As we mentioned in Sec. 3.3, the higher order contractive constraint penalizes the Hessian of theencoder CPA map to reduce its curvature only around the training points. In our approach, theconstraint is global and enforces the entire decoder CPA map curvature to approximate the curvatureof the orbit of the data under the action of the learned group.In Sec. 4.1, we showed that if the regularization deﬁned in Eq. (10) is equal to zeros and if theapproximant function f coincides with the data manifold deﬁned by x on a single point, then theapproximant coincides with x . We now derive the generalization guarantees in the particular casewhere f is a CPA approximant.Based on the assumption that (i) for a speciﬁc region the real manifold is approximated, (ii) the regu-larization deﬁned in Eq. (12) is minimized and that (iii) the G matrix obtained from the regularizationcoincides with the inﬁnitesimal operator of the group governing the data we obtain the followingapproximation result. Theorem 2.

If on a region ω (cid:48) ∈ Ω D the matrix A Dω (cid:48) forms a basis of the manifold tangent space onthis region, and R ( D ) = 0 then for all region ω ∈ Ω D the basis vectors of A Dω are the basis vectorof the tangent of the data manifold with d ( ∪ ω ∈ Ω D T AE ( ω ) , X ) ≤ (cid:88) ω i ∈ Ω D Rad ( ω i ) , where T AE ( ω ) the tangent space of the AE for the region ω , X denotes the data manifold, d deﬁnesthe 2-norm distance,and Rad ( ω i ) the radius of the region ω i . (Proof in Appendix A.4.) We showed how to adapt the second-order regularization, Eq. (10), to AE via the CPA framework,and developed its generalization guarantees. The next part is dedicated to the empirical advantagesand difﬁculties regarding the learning of such regularization as well as its performances.

This section starts with the observation that an AE is producing a larger amount of regions in its latentspace than the number of data usually available, see Fig. 3. Therefore, in the case of a tangent-based7 e s t S e t R ec on s t r u c ti on E rr o r Epochs Epochs Epochs Epochs Figure 4: Test set recon-struction error on the Earth-quakes dataset evaluated onthe best set of parametersfor different AEs (from leftto right): AE, Higher Or-der Contractive AE, Denois-ing AE, Group AE. For eachmodel, the mean over runsis reported in black, and thegray area corresponds to itsstandard deviation.Table 1: Comparison of the best testing reconstruction error ( × − ) for each AE model (columns)and dataset (rows). Dataset \ Model

AE Denoising AE Higher Order Contractive AE Group AEMNIST . ± . × − . ± . × − . ± . × − ± . × − ECG500 . ± . × − . ± . × − . ± . × − ± . × − Earthquakes . ± . × − . ± . × − . ± . × − ± . × − Haptics . ± . × − . ± . × − . ± . × − ± . × − FaceFour . ± . × − . ± . × − . ± . × − ± . × − SyntheticControl . ± . × − . ± . × − . ± . × − ± . × − regularization, it is more appropriate to be dependant on the sampling of regions than the sampling ofdata as the high density of regions eases the approximation of the tangent. The regularization wedeveloped follows this scheme and forces the tangents of the AE to be related to any other tangentsby the action of the group governing the data. Remark:

The dimension of the inﬁnitesimal operators G k , ∀ k ∈ { , . . . , h } is quadratic in thedimension of the data. As such, for a high-dimensional dataset, its number of learnable parameters islarge. Also, as the adjacent regions are sampled this makes the regularization term volatile and withbehavior varying through training. Hence the optimization of the G k matrices remains the currentbottleneck of the method.We evaluate our framework on diverse datasets, including images and time-series, the description ofeach dataset is given in Appendix D. For all the datasets, we use only a few samples for training toshow the generalization capability of the different AEs. As per the previous remark, the dimension ofthe datasets we evaluate is < .For each model and each hyperparameter, we perform runs for epochs with batch size .The main results are reported in Table 1 and complementary ones in Table 3 (Appendix F). In thesetables, the statistics reported correspond to the best reconstruction error mean ± standard deviationon the test set for each model. We propose, in particular, to visualize the test set reconstruction forthe different AE models during training for the Earthquakes data in Fig. 7, where we can see thatGroup AE is robust to the DN initialization and does not overﬁt. Similar ﬁgures for other datasets areprovided in Appendix G.The hyperparameters corresponding to the results of Table 1 are given in Appendix E. In the case ofDenoising AE, the hyperparameter (cid:15) corresponds to the variance of the noise added to the data, whilefor the Higher Order Contractive AE, it is the noise added to the data in order to sample the Jacobian ofnearby regions. This parameter is evaluated for the values { . , . , . , } . The hyperparameter λ corresponds to the regularization trade-off parameter for both the Higher Order Contractive AE andGroup AE, the following values are tested for both model { . , . , . , , , , } . Allthe models were trained using the same AE with fully connected encoder layers with ReLU, and fully connected decoder layer with ReLU and linear fully connected output layer.8 Conclusion

In this paper, we analyzed and characterized AE via the MAS framework. We leveraged thisformulation to develop a constraint on the slope parameters of adjacent regions to enforce the AEto follow the geometry of a Lie group orbit. While this demonstrates increased generalizationperformances, this also opens many avenues to combine the rich theory of Lie Group approximationand deep AEs. In particular, the careful development of a robust optimization scheme of the proposedregularization remains to be improved. Our work also opens the door to the study of the learnedinﬁnitesimal operator to bring further insights into AE learning as well as on the group underlyingthe dataset at hand.

References

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines.

Cognitivescience , 9(1):147–169, 1985.Y. Bahroun, D. Chklovskii, and A. Sengupta. A similarity-preserving network trained on transformed imagesrecapitulates salient features of the ﬂy motion detection circuit. In

Advances in Neural Information ProcessingSystems , pages 14178–14189, 2019.P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples withoutlocal minima.

Neural networks , 2(1):53–58, 1989.R. Balestriero and R. Baraniuk. A spline theory of deep learning. In

Proceedings of the 35th InternationalConference on Machine Learning , pages 374–383, 2018.R. Balestriero, R. Cosentino, B. Aazhang, and R. Baraniuk. The geometry of deep networks: power diagramsubdivision. In

Advances in Neural Information Processing Systems , pages 15806–15815, 2019.Randall Balestriero. Symjax: symbolic cpu/gpu/tpu programming. arXiv preprint arXiv:2005.10635 , 2020.Z. Cheng, H. Sun, M. Takeuchi, and J. Katto. Deep convolutional autoencoder-based lossy image compression.In , pages 253–257. IEEE, 2018.G. Cottrell, P. Munro, and D. Zipser. Image compression by back propagation: An example of extensionalprogamming.

ICS Report , (8702), 1987.J. L. Elman and D. Zipser. Learning the hidden structure of speech.

The Journal of the Acoustical Society ofAmerica , 83(4):1615–1626, 1988.G. Eraslan, L. M. Simon, M. Mircea, N. S. Mueller, and F. J. Theis. Single-cell rna-seq denoising using a deepcount autoencoder.

Nature communications , 10(1):1–14, 2019.D. Erhan, Y. Bengio, A. Courville, P. A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervisedpre-training help deep learning?

Journal of Machine Learning Research , 11(Feb):625–660, 2010.I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . MIT Press, 2016. .B. Hall.

Lie Groups, Lie Algebras, and Representations: an Elementary Introduction , volume 222. Springer,2015.T. B. Hashimoto, P. S. Liang, and J. C. Duchi. Unsupervised transformation learning via convex relaxations. In

Advances in Neural Information Processing Systems , pages 6875–6883, 2017.K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for objectrecognition? In , pages 2146–2153. IEEE.N. Lei, D. An, Y. Guo, K. Su, S. Liu, Z. Luo, S. Yau, and X. Gu. A geometric understanding of deep learning.

Engineering , 2020.A. Makhzani and B. Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663 , 2013.S. Mallat. Understanding deep convolutional networks.

Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , 374(2065):20150203, 2016.N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A Hamprecht, Y. Bengio, and A. Courville. On thespectral bias of neural networks. arXiv preprint arXiv:1806.08734 , 2018. . Rao and D. L. Ruderman. Learning lie groups for invariant visual perception. In Advances in neuralinformation processing systems , pages 810–816, 1999.S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot. Higher order contractiveauto-encoder. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases ,pages 645–660. Springer, 2011a.S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance duringfeature extraction. 2011b.J. Sohl-Dickstein, C. M. Wang, and B. A. Olshausen. An unsupervised algorithm for learning lie grouptransformations. arXiv preprint arXiv:1001.1027 , 2010.N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to preventneural networks from overﬁtting.

The journal of machine learning research , 15(1):1929–1958, 2014.Y. Teng and A. Choromanska. Invertible autoencoder for domain adaptation.

Computation , 7(2):20, 2019.L. Tran, X. Liu, J. Zhou, and R. Jin. Missing modalities imputation via cascaded residual autoencoder. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1405–1414, 2017.P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and composing robust features withdenoising autoencoders. In

Proceedings of the 25th international conference on Machine learning , pages1096–1103, 2008.S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In

Advances in neuralinformation processing systems , pages 351–359, 2013.C. M. Wang, J. Shol-Dickstein, I. Tosic, and B. A. Olshausen. Lie group transformation models for predictivevideo coding. In , pages 83–92. IEEE, 2011. Proofs

A.1 Proof of Proposition 1

Proof.

Perfect reconstruction ⇒ : then ∀ ω, ∀ x ∈ ω , x = (cid:80) hk =1 (cid:10) x, a Ek (cid:11) a Dk . We have ∀ ω, ∀ x ∈ ω (cid:88) k (cid:68) x, a Ek (cid:69) a Dk = h (cid:88) k =1 (cid:42) h (cid:88) k (cid:48) =1 (cid:68) x, a Ek (cid:48) (cid:69) a Dk (cid:48) , a Ek (cid:43) a Dk = (cid:88) k h (cid:88) k (cid:48) =1 (cid:68) x, a Ek (cid:48) (cid:69) (cid:68) a Dk (cid:48) , a Ek (cid:69) a Dk ⇐⇒ A Dω A Eω x = A Dω A D T ω A E T ω A Eω x since A Dω A Eω is injective on the region (as per perfect reconstructioncondition) it implies that A D T ω A E T ω = I h , where I h is the identity matrix of dimension h × h A.2 Proof of Theorem 1

Proof.

Let’s ﬁrst build some intuitions on the implications of the ordinary differential equation d f ( θ ) dθ = G d f ( θ ) dθ when the initial condition d f ( θ ) dθ | θ =0 = G x (0) is satisﬁed. The following lemma shows that the solution ofthis ordinary differential equation coincides with the manifold of the data described by x ( θ ) , ∀ θ ∈ R up to aconstant shift.Let y ( θ ) = df ( θ ) dθ , then we have, dy ( θ ) dθ = Gy ( θ ) , The solution to this problem with initial condition y (0) = Gx (0) is y ( θ ) = exp( θG ) y (0) = exp( θG ) Gx (0) . Since, y ( θ ) = df ( θ ) dθ and that, exp( θG ) Gx (0) = (cid:88) n ≥ G n n ! Gx (0) = (cid:88) n ≥ G G n n ! x (0) = G exp( θG ) x (0) we have that, f ( θ ) = exp( θG ) x (0) + c, where c ∈ R Now, let’s consider the initial condition given by df ( θ ) dθ | θ =0 , we have that f ( θ ) = G − exp( θG ) df ( θ ) dθ | θ =0 + c =exp( θG ) G − df ( θ ) dθ | θ =0 + c . Now if ∃ θ (cid:48) such that f ( θ (cid:48) ) = x ( θ (cid:48) ) we have, exp (cid:0) θ (cid:48) G (cid:1) G − df ( θ i ) dθ i | θ i =0 + c = exp (cid:0) θ (cid:48) G (cid:1) x (0) , thus, df ( θ (cid:48) ) dθ (cid:48) | θ (cid:48) =0 = G ( x (0) − exp( − θ (cid:48) G ) c ) , thus exp (cid:0) θ (cid:48) G (cid:1) G − df ( θ (cid:48) ) dθ (cid:48) | θ (cid:48) =0 + c = exp (cid:0) θ (cid:48) G (cid:1) G − G ( x (0) − exp (cid:0) − θ (cid:48) G (cid:1) c ) + c = exp (cid:0) θ (cid:48) G (cid:1) x (0)= x ( θ ) .3 Proof of Proposition 2 Proof. (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A Dω (cid:48) − A Dω − h (cid:88) k =1 θ k G k A Dω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = Tr (cid:32) ( A Dω (cid:48) − A Dω − h (cid:88) k =1 θ k G k A Dω ) (cid:12) ( A Dω (cid:48) − A Dω − h (cid:88) k =1 θ k G k A Dω )11 T (cid:33) = Tr ( A Dω (cid:48) (cid:12) A Dω (cid:48) − A Dω (cid:48) (cid:12) A Dω − A Dω (cid:48) (cid:12) ( k (cid:88) h =1 θ k G k A Dω ) + A Dω (cid:12) A Dω − A Dω (cid:12) A Dω (cid:48) + A Dω (cid:12) ( k (cid:88) h =1 θ k G k A Dω ) − ( k (cid:88) h =1 θ k G k A Dω ) (cid:12) A Dω (cid:48) + ( k (cid:88) h =1 θ k G k A Dω ) (cid:12) A Dω + ( k (cid:88) h =1 θ k G k A Dω ) (cid:12) ( k (cid:88) h =1 θ k G k A Dω ))11 T ) . Now, ∀ j ∈ { , . . . , h } δ (cid:13)(cid:13)(cid:13) A Dω (cid:48) − A Dω − (cid:80) hk =1 θ k G k A Dω (cid:13)(cid:13)(cid:13) F δθ j = 2 Tr (cid:32) ( G j A Dω ) (cid:12) ( A Dω − A Dω (cid:48) + h (cid:88) k =1 G k A Dω T ) (cid:33) = 2 Tr (cid:16) G j A Dω (cid:12) ( A Dω − A Dω (cid:48) )11 T (cid:17) + 2 h (cid:88) k =1 θ k Tr (cid:16) ( G j A Dω (cid:12) G k A Dω )11 T (cid:17) , setting δ (cid:107) A Dω (cid:48) − A Dω − (cid:80) hk =1 θ k G k A Dω (cid:107) F δθ j = 0 for all j and rearranging in matrix form gives θ ∗ =  (cid:80) i (cid:107) G [ A Dω ] .,i (cid:107) . . . (cid:80) i (cid:104) G [ A Dω ] .,i , G h [ A Dω ] .,i (cid:105) ... ... ... (cid:80) i (cid:104) G h [ A Dω ] .,i , G [ A Dω ] .,i (cid:105) . . . (cid:80) i (cid:107) G h [ A Dω ] .,i (cid:107)  −  (cid:80) i (cid:104) G [ A Dω ] .,i , [ A Dω (cid:48) ] .,i − [ A Dω ] .,i ) ... (cid:80) i (cid:104) G h [ A Dω ] .,i , [ A Dω (cid:48) ] .,i − [ A Dω ] .,i )  , and we have that  (cid:80) i (cid:107) G [ A Dω ] .,i (cid:107) . . . (cid:80) i (cid:104) G [ A Dω ] .,i , G h [ A Dω ] .,i (cid:105) ... ... ... (cid:80) i (cid:104) G h [ A Dω ] .,i , G [ A Dω ] .,i (cid:105) . . . (cid:80) i (cid:107) G h [ A Dω ] .,i (cid:107)  = h (cid:88) i =1  G [ A Dω ] .,i ... G h [ A Dω ] .,i  T  G [ A Dω ] .,i ... G h [ A Dω ] .,i  , therefore it is the sum of positive deﬁnite matrices.For the case h=1, we have that (cid:13)(cid:13)(cid:13) a Dω (cid:48) − a Dω − θGa Dω (cid:13)(cid:13)(cid:13) = (cid:68) a Dω (cid:48) , a Dω (cid:48) (cid:69) − (cid:68) a Dω (cid:48) , a Dω (cid:69) + (cid:68) a Dω , a Dω (cid:69) + 2 (cid:68) θGa Dω , a Dω − a Dω (cid:48) (cid:69) + (cid:68) θGa Dω , θGa Dω (cid:69) , thus, δ (cid:13)(cid:13) a Dω (cid:48) − a Dω − θGa Dω (cid:13)(cid:13) δθ = a D T ω G T ( a Dω − a Dω (cid:48) ) + θa D T ω G T Ga Dω For the following proofs, we will denote by T : R d × R h → R d , the transformation operator taking as input adatum and a group parameter, and giving as output the transformed datum. As we used a Lie Group, we candeﬁne this operator analytically as T ( x, θ ) = exp( θG ) x . A.4 Proof of Theorem 2

For this proof, we will use the notation T X ( ω ) as the tangent space of the manifold described by the data X forthe data in the region ω , and by T AE ( ω ) the tangent space of the AE for the region ω . We show that if these twotangent space coincides for a given region, i.e., if the tangent space of the AE coincides with the tangent space ofthe manifold for a speciﬁc position, then they coincide everywhere. roof. By assumption, we know that (cid:8) a D ( ω (cid:48) ) , . . . , a Dh ( ω (cid:48) ) (cid:9) form a basis of T X ( ω (cid:48) ) . If the regularizationis satisﬁed, we also know that the tangent induced by the AE at position ω , denoted by T AE ( ω ) , is equal to T ( T X ( ω (cid:48) ) , θ ) . In fact, the regularization imposes that the tangent (induced by the AE) of the different regionsare transformed version of each other by the transformation operator T . As this operator is form a Lie groupaction operator, it is a diffeomorphism from the orbit of the group to the orbit of the group. Therefore, ∀ ω , itexists θ such that T ( T X ( ω (cid:48) ) , θ ) = T X ( ω ) .Per assumption, the tangent of the region ω (cid:48) ,i.e. T AE ( ω (cid:48) ) is actually tangent to X as its basis coincideswith T X ( ω (cid:48) ) . Denote by x ∈ X the point at which T X ( ω (cid:48) ) and X intersects. Let’s ﬁrst ﬁrst prove that for (cid:15) (cid:48) = arg max (cid:15) x + (cid:15)h ∈ ω , where h ∈ T X ( ω (cid:48) ) , that is, x + (cid:15) (cid:48) h lies at the boundary of the region ω (cid:48) . We furtherassume that (cid:107) h (cid:107) = 1 such that (cid:15) (cid:48) = Rad ( ω (cid:48) ) . Let’s deﬁne a smooth curve on the manifold γ : R → X such that γ (0) = x and γ (cid:48) (0) = h . Now, d ( x + (cid:15) (cid:48) h, X ) ≤ d ( x + (cid:15) (cid:48) h, γ ( (cid:15) (cid:48) ))= (cid:13)(cid:13) γ ( (cid:15) (cid:48) ) − γ (0) − (cid:15) (cid:48) γ (cid:48) (0) (cid:13)(cid:13) . Since, lim (cid:15) (cid:48) → γ ( (cid:15) (cid:48) ) − γ (0) (cid:15) (cid:48) = γ (cid:48) (0) , we have that d ( x + (cid:15) (cid:48) h,γ ( (cid:15) (cid:48) )) (cid:15) (cid:48) = o ( Rad ( ω (cid:48) )) . Then, since the ω i ∀ i ∈{ , . . . , | Ω |} form a partition of Ω and that by Proposition 2 we know that since one tangent of the AE coincideswith the tangent of the manifold at the point x then any tangent of the AE coincides with a tangent of themanifold. Thus, we have that d ( ∪ ω ∈ Ω T AE ( ω ) , X ) = (cid:80) | Ω | i =1 d ( T AE ( ω i ) , X ) ≤ (cid:80) | Ω | i =1 Rad ( ω i ) . A.5 Updates for G Case h = 1 : For sake of clarity we will denote J Dω = J ω and H ω,ω (cid:48) = A ω − A ω (cid:48) . arg min G (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) (cid:13)(cid:13)(cid:13) A Dω − ( I + θ ω,ω (cid:48) GA Dω (cid:48) ) (cid:13)(cid:13)(cid:13) F = arg min G (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) (cid:13)(cid:13)(cid:13) H Dω,ω (cid:48) − θ ω,ω (cid:48) GA Dω (cid:48) (cid:13)(cid:13)(cid:13) F = arg min G (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) Tr (cid:104) ( H Dω,ω (cid:48) − θ ω,ω (cid:48) GJ Dω (cid:48) )( H Dω,ω (cid:48) − θ ω,ω (cid:48) GJ Dω (cid:48) ) T (cid:105) = arg min G (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) Tr (cid:104) − θ ω,ω (cid:48) H ω,ω (cid:48) J Tω (cid:48) G T − θGJ ω (cid:48) H Tω,ω (cid:48) + θ ω,ω (cid:48) GJ ω (cid:48) J Tω (cid:48) G T (cid:105) = arg min G (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) Tr (cid:104) − θ ω,ω (cid:48) GJ ω (cid:48) H Tω,ω (cid:48) (cid:105) + Tr (cid:104) θ ω,ω (cid:48) GJ ω (cid:48) J Tω (cid:48) G T (cid:105) Now, δ (cid:104)(cid:80) ω ∈ Ω D (cid:80) ω (cid:48) ∈ N ( ω ) (cid:107) A ω − ( I + θ ω,ω (cid:48) G ) A ω (cid:48) ) (cid:107) F (cid:105) δG = (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) (cid:104) − θ ω,ω (cid:48) H ω,ω (cid:48) J Tω (cid:48) + 2 θ ω,ω (cid:48) GJ ω (cid:48) J Tω (cid:48) (cid:105) . Therefore, G (cid:63) =  (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) θ ω,ω (cid:48) ( A ω − A ω (cid:48) ) A Tω (cid:48)   (cid:88) ω ∈ Ω D (cid:88) ω (cid:48) ∈ N ( ω ) θ ω,ω (cid:48) A ω (cid:48) A Tω (cid:48)  − . A.6 Per Region Tangent - Details

Let [ D ◦ E ( . )] i : R d → R be the i th coordinate output of the AE, deﬁned as [ D ◦ E ( x )] i = [ A Dω ] i,. A Eω x +[ A Dω ] i,. B Eω + [ B Dω ] i . d [ D ◦ E ( . )] i = [ D ◦ E ( x + (cid:15) )] i − [( D ◦ E )( x )] i = (cid:68) A E T ω [ A Dω ] Ti,. , (cid:15) (cid:69) , ∀ (cid:15) ∈ R d . (13)As such, we directly obtain that ∇ x [ D ◦ E ( . )] i = A E T ω [ A Dω ] Ti,. , (14)which leads to the Jacobian of the AE as deﬁned in Eq. 7. Orbit of a Lie Group

One example of the orbit of a data with respect to a Lie group is the result of the rotation on an initialpoint x (0) ∈ R , we have x ( θ ) = exp( θG ) x (0) , θ ∈ R , G = (cid:18) −

11 0 (cid:19) . In fact, where we recall that exp (cid:18) θ (cid:18) −

11 0 (cid:19)(cid:19) = (cid:18) cos( θ ) − sin( θ )sin( θ ) cos( θ ) (cid:19) . The inﬁnitesimal operator G is thus encapsulating the groupinformation. For more details regarding Lie group and the exponential map refer to Hall [2015]. C Sampling

Recall that in the proposed regularization one should have the knowledge of the decoder latent space partition.In practice, and for large networks, the discovery of the partition would not be feasible. We thus propose toapproximate the regularization by only sampling some of the regions and some of their respective neighbours.This sampling is done by ﬁrst sampling randomly some vectors in the AE latent space. As based on the positionof the sampled vectors, the associated per region mapping are automatically formed during the forward pass toproduce the decoder output, it is enough to compute the afﬁne mapping induced by the samples. To computethe neighbours of those samples regions, we use a simple dichotomic search. That is, for each of the sampledregion, we sample another (nearby) vector, and keep pushing this new sample toward the ﬁrst sample until oneobtains the closest sample that remains in a different region. With the above, one has now the knowledge ofsome regions, and one neighbouring region for each of those regions. We leverage the above with the searchof a single neighbour, for better approximation of the regularization one can repeat this sampling process andaccumulate the obtained regions and neighbours. All the experiments are written in SymJAX Balestriero [2020].

D Datasets

Most of the datasets used for the experiments are extracted from the univariate time-series repository in ? . Someof them are recording from sensors or simulated data. They range from motion time-series to biological one.The dimension of the data we used is between − .Another dataset we use is the MNIST dataset which consist of × images of hand-written digits. This isan example of a dataset where the intra-class viariability is induced by group transformations such as rotation,translation, and small diffeomorphisms.The MNIST single class data are trained with a h = 1 (bottleneck dimension = 1 ). E Best Hyperparameters

Table 2: Best Hyperparameters for each AE (columns) and Dataset (rows).

Dataset \ Model

Denoising AE ( (cid:15) ) High-Order Contractive AE ( (cid:15), λ ) Lie Group AE ( λ ) Single Class ( ) MNIST .

01 0 . ,

100 0 . Single Class ( ) MNIST . . , .

001 0 . Single Class ( ) MNIST . . ,

100 100

CBF . . , . Yoga . . , . Trace . . , .

01 10

Wine .

01 1 . , .

01 0 . ShapesAll . . ,

100 0 . FiftyWords . . , . . WordSynonyms . . , . InsectWingbeatSound . . , . . MNIST . . ,

100 100

ECG500 .

01 1 . , .

001 0 . Earthquakes . . ,

100 100

Haptics .

01 1 . ,

100 0 . FaceFour . . , .

01 1 . SyntheticControl . . , .

01 10 Supplementary Results

Table 3: Comparison of the Best Testing Reconstruction Error ( × − ) for each AE (columns) andDataset (rows). Dataset \ Model

AE Denoising AE High-Order Contractive AE Lie Group AESingle Class ( ) MNIST . ± . × − . ± . × − . ± . × − ± . × − Single Class ( ) MNIST . ± . × − . ± . × − . ± . × − ± . × − Single Class ( ) MNIST . ± . × − . ± . × − . ± . × − ± . × − CBF . ± . × − . ± . × − . ± . × − ± . × − Yoga . ± . × − . ± . × − . ± . × − ± . × − Trace . ± . × − . ± . × − . ± . × − ± . × − Wine . ± . × − . ± . × − ± . × − ± . × − ShapesAll . ± . × − . ± . × − . ± . × − ± . × − FiftyWords . ± . × − . ± . × − . ± . × − ± . × − WordSynonyms . ± . × − . ± . × − . ± . × − ± . × − InsectWingbeatSound . ± . × − . ± . × − . ± . × − ± . × − MNIST . ± . × − . ± . × − . ± . × − ± . × − ECG500 . ± . × − . ± . × − . ± . × − ± . × − Earthquakes . ± . × − . ± . × − . ± . × − ± . × − Haptics . ± . × − . ± . × − . ± . × − ± . × − FaceFour . ± . × − . ± . × − . ± . × − ± . × − SyntheticControl . ± . × − . ± . × − . ± . × − ± . × − Additional Experimental Figures T e s t S e t R ec on s t r u c ti on E rr o r Epochs Epochs Epochs Epochs Figure 5: Test set reconstruc-tion error on the Synthetic-Control dataset evaluated onthe best set of parametersfor different AEs (from leftto right): AE, Higher Or-der Contractive AE, Denois-ing AE, Group AE. For eachmodel, the mean over runsis reported in black, and thegray area corresponds to itsstandard deviation. T e s t S e t R ec on s t r u c ti on E rr o r Epochs Epochs Epochs Epochs Figure 6: Test set recon-struction error on the Hap-tics dataset evaluated on thebest set of parameters fordifferent AEs (from left toright): AE, Higher OrderContractive AE, DenoisingAE, Group AE. For eachmodel, the mean over runsis reported in black, and thegray area corresponds to itsstandard deviation. T e s t S e t R ec on s t r u c ti on E rr o r Epochs Epochs Epochs Epochs Figure 7: Test set recon-struction error on the Face-Four dataset evaluated onthe best set of parametersfor different AEs (from leftto right): AE, Higher Or-der Contractive AE, Denois-ing AE, Group AE. For eachmodel, the mean over10