[PDF] Latent Space Non-Linear Statistics

Abstract

Given data, deep generative models, such as variational autoencoders (VAE) and generative adversarial networks (GAN), train a lower dimensional latent representation of the data space. The linear Euclidean geometry of data space pulls back to a nonlinear Riemannian geometry on the latent space. The latent space thus provides a low-dimensional nonlinear representation of data and classical linear statistical techniques are no longer applicable. In this paper we show how statistics of data in their latent space representation can be performed using techniques from the field of nonlinear manifold statistics. Nonlinear manifold statistics provide generalizations of Euclidean statistical notions including means, principal component analysis, and maximum likelihood fits of parametric probability distributions. We develop new techniques for maximum likelihood inference in latent space, and adress the computational complexity of using geometric algorithms with high-dimensional data by training a separate neural network to approximate the Riemannian metric and cometric tensor capturing the shape of the learned data manifold.

Full PDF

LLatent Space Non-Linear Statistics

Line Kühnel

University of CopenhagenDenmark [email protected]

Tom Fletcher

University of UtahUSA [email protected]

Sarang Joshi

University of UtahUSA [email protected]

Stefan Sommer

University of CopenhagenDenmark [email protected]

Abstract

Given data, deep generative models, such as variational autoencoders (VAE) andgenerative adversarial networks (GAN), train a lower dimensional latent representa-tion of the data space. The linear Euclidean geometry of data space pulls back to anonlinear Riemannian geometry on the latent space. The latent space thus providesa low-dimensional nonlinear representation of data and classical linear statisticaltechniques are no longer applicable. In this paper we show how statistics of data intheir latent space representation can be performed using techniques from the ﬁeld ofnonlinear manifold statistics. Nonlinear manifold statistics provide generalizationsof Euclidean statistical notions including means, principal component analysis,and maximum likelihood ﬁts of parametric probability distributions. We developnew techniques for maximum likelihood inference in latent space, and adress thecomputational complexity of using geometric algorithms with high-dimensionaldata by training a separate neural network to approximate the Riemannian metricand cometric tensor capturing the shape of the learned data manifold.

The Riemannian geometry of latent models, provided by deep generative models, have recentlybeen explored in [16, 4, 2]. The mapping f : Z → X , from latent space Z to the data space X ,constitutes an embedding of Z into X under mild assumptions on the network architecture. Thisallows the image f ( Z ) to inherit the Riemannian metric and hence the geometry from the Euclideanambient space X . Equivalently, the metric structure of X pulls back via f to a nonlinear Riemannianstructure on Z . The above papers explore aspects of this geometry including numerical schemes forgeodesic integration, parallel transport, Fréchet mean estimation, simulation of Brownian motion,and interpolation. With this paper, we wish to focus on performing subsequent statistics after learningthe latent representation and the embedding f . We aim at using the constructions, tools and methodsfrom nonlinear statistics [15] to perform statistical analysis of data in the latent representation.Deep generative models are excellent tools for learning the intrinsic geometry of a low-dimensionaldata manifold f ( Z ) , subspace of the data space X . When the major modes of data variation are of lowintrinsic dimensionality, statistical analyses exploiting the lower dimensionality can be more efﬁcientthan performing statistics directly in the high-dimensional data space. By performing statistics inlower-dimensional manifolds learned with deep generative models, we simultaneously adapt thestatistics to the intrinsic geometry of the data manifold, exploit the compact representation, and avoidunnecessary dimensions in the high-dimensional space X affecting the statistical analysis. Preprint. Work in progress. a r X i v : . [ c s . C V ] M a y xempliﬁed on two datasets, synthetic data on the sphere S for visualization and the MNIST digitsdataset, we show how statistical procedures such as principal component analysis can be performedon the latent space. We will subsequently deﬁne and infer parameters of geometric distributionsallowing the deﬁnition and inference of maximum likelihood estimates via simulation of diffusionprocesses. Both VAEs and GANs themselves learn distributions representing the input training data.The aim is to perform nonlinear statistical analyses for data independent of the training data andwith a different distribution, but which are elements of the same low-dimensional manifold of thedata space. The latent representation can in this way be learned unsupervised from large numbers ofunlabeled training samples while subsequent low-sample size statistics can be performed using thelow-dimensional latent representation. This setting occurs for example in medical imaging wherebrain MR scans are abundant while controlled disease progression studies are of a much smallersample size. The approach resembles the common task of using principal component analysis torepresent data in the span of fewer principal eigenvectors, with the important difference that in thepresent case a nonlinear manifold is learned using deep generative models instead of standard linearsubspace approximation.The ﬁeld of nonlinear statistics provide generalizations of statistical constructions and tools fromlinear Euclidean vector spaces to Riemannian manifolds. Such constructs, e.g. the mean value, oftenhave many equivalent deﬁnitions in Euclidean space. However, nonlinearity and curvature generallybreak this equivalence leading to a plethora of different generalizations. For this reason, we herefocus on a subset of selected methods to exemplify the use of nonlinear statistics tools in the latentspace setting: Principal component analysis on manifolds with the principal geodesic analysis (PGA,[6]) and inference of maximum likelihood means from intrinsic diffusion processes [18].The learned manifold deﬁnes a Riemannian metric on the latent representation, still the often highdimensionality of the data manifold makes evaluation of the metric computationally costly. This isseverely ampliﬁed when calculating higher-order derivatives needed for geometric concepts suchas the curvature tensor and the Christoffel symbols that are crucial for numerical integration ofgeodesics and simulation of sample paths for Brownian motions. We present a new method forhandling the computational complexity of evaluating the metric by training a second neural networkto approximate the local metric tensor of the latent space thereby achieving a massive speed up in theimplementation of the geometric, and nonlinear statistical algorithms.The paper thus presents the following contributions:1. we couple tools from nonlinear statistics with deep end-to-end differentiable generativemodels for analyzing data using a pre-trained low-dimensional latent representation,2. we show how an additional neural network can be trained to learn the metric tensor andthereby greatly speed up the computations needed for the nonlinear statistics algorithms,3. we develop a method for maximum likelihood estimation of diffusion processes in the latentgeometry and use this to estimate ML means from Riemannian Brownian motions.We show examples of the presented methods on latent geometries learned from synthetic data in R and on the MNIST dataset. The statistical computations are implemented in the Theano Geometrypackage [12] that using the automatic differentiation features of Theano [19] allows for easy andconcise expression of differential geometry concepts.The paper starts with a brief description on latent space geometry based on the papers [16, 4, 2].We then discuss deﬁnition of mean values in the nonlinear latent geometry and use of the principalgeodesic analysis (PGA) procedure before developing a scheme for maximum likelihood estimationof parameters with Riemannian Brownian motion using a diffusion bridge sampling scheme. We endthe paper with experiments. Deep generative models such as generative adversarial networks (GANs, [8]) and autoen-coders/variational autoencoders (VAEs, [3]) learn mappings from a latent space Z to the data space X . In the VAE case, the decoder mapping f : Z → X describes the mean of the data distribution, P ( X | z ) = N ( X | f ( z ) , σ ( z ) I ) , and is complemented by an encoder h : X → Z . Both Z and X are Euclidean spaces, with dimension d and n respectively and generally d (cid:28) n . When the2ushforward f ∗ , and the differential df of f , is of rank d for any point z , the image f ( Z ) in X isan embedded differentiable manifold of dimension d . We denote this manifold by M . Generallyfor deep models, f is nonlinear making M a nonlinear manifold. An example of a trained manifoldwith a VAE is shown in Figure 1. Here we simulate synthetic data on the sphere S by the transitiondistribution of a Riemannian Brownian motion starting at the north pole. The learned submanifoldapproximates S on the northern hemisphere containing the greatest concentration of samples. x1 0 1 y1 0 1 101 Figure 1: (left) Samples from the data distribution (blue) with corresponding predictions from theVAE (red). (right) The trained manifold.The learned manifold M inherits differential and geometric structure from X . In particular, thestandard Euclidean inner product restricts to tangent spaces T x M for x ∈ M to give a Riemannianmetric g on M , i.e. for v, w ∈ T x M , g ( v, w ) = (cid:104) v, w (cid:105) = v T w . Locally, we invert f to obtain chartson M , and get the standard expression g ij ( z ) = (cid:104) ∂ z i f, ∂ z j f (cid:105) for the metric tensor in Z coordinates.Using Jacobian matrix Jf = ( ∂ z i f j ) ij , the matrix expression of g ( z ) is g ( z ) = ( Jf ( z )) T Jf ( z ) .The metric tensor on Z can be seen as the pullback f ∗ g of the Riemannian metric on X .The geometry of latent spaces was explored in [16]. In addition to setting up the geometric foundation,the paper developed efﬁcient algorithms for geodesic integration, parallel transport, and Fréchet meanestimation on the latent space. The algorithms make particular use of the encoder function h : X → Z that is trained as part of the VAEs. Instead of explicitly computing Christoffel symbols for geodesicintegration, the presence of h allows steps of the integration algorithm to be taken in X and thensubsequently mapped back to Z . Avoiding computation of Christoffel symbols signiﬁcantly increasesexecution speed, a critical improvement for the heavy computations involved with the typically highdimensions of X . [4] provides additional views on the latent geometry and interpolation examples onthe MNIST dataset and robotic arm movements. [2] includes the z -variability of the variance σ ( z ) ofVAEs resulting in the inclusion of the Jacobian of σ in the expected metric. The paper in additionexplores random walks in the latent geometry and how to enable meaningful extrapolation of thelatent representation beyond the training data. Given sampled data y , . . . , y N in X , the aim is here to perform statistics on the data after mappingto the low-dimensional latent space Z . Note that the mapping f can thus be trained unsupervised andafterwards used to perform statistics on new data in the low-dimensional representation. Therefore,the data y , . . . , y N are generally different from the training data used to train f . In particular, N canbe much lower than the size of the training set.For VAEs, the mapping of y i to corresponding points in the latter representation z i is directly availablefrom the encoder function h , i.e. z i = h ( y i ) In more general settings where h is not present, we needto construct z i from y i . A natural approach is to deﬁne z i from the optimization problem z i = arg min z ∈ Z (cid:107) f ( z ) − y i (cid:107) . (1)This can be seen as a projection from X to M using the Euclidean distance in X .3 .2 Geodesics and Brownian Motions The pullback metric f ∗ g on Z deﬁnes geometric concepts such as geodesics, exponential andlogarithm map, and Riemannian Brownian motions on Z . Using f , each of these deﬁnitions isequivalently expressed on M viewing it as a submanifold of X with inherited metric. Given z ∈ Z and v ∈ T z Z , the exponential map Exp z : T z Z → Z is deﬁned as the geodesic γ vt at time t = 1 withstarting point z and initial velocity v , i.e. Exp z ( v ) = γ v . The logarithm map Log : Z × Z → T Z is the local inverse of

Exp : Given two points z , z ∈ Z , Log z ( z ) , returns the tangent vector v ∈ T z Z deﬁning the minimizing geodesic between z and z . The Riemannian metric deﬁnesthe geodesic distance expressed from the logarithm map by d ( z , z ) = (cid:107) Log z ( z ) (cid:107) g . Using Z ascoordinates for M by local inverses of f , the Riemannian Brownian motions on Z , and equivalentlyon M , is deﬁned by the coordinate expression dz jt = g ( z ) kl Γ jkl dt + (cid:112) ( g ( z )) − j dB t,j , (2)where Γ jkl denotes Christoffel symbols, g − the cometric, i.e. the inverse of the metric tensor g , B t astandard Brownian motion in R d , and where Einstein notation is used for index summation. While metric computation is easily expressed using automatic differentiation to compute the Jacobian Jf of the embedding map f , the high dimensionality of the data space has a computational costwhen evaluating the metric. This is particularly emphasized when computing higher-order differentialconcepts such as Christoffel symbols, used for geodesic integration, curvature, and Brownian motionsimulation, due to multiple derivatives and metric inverse computations involved. For integration ofgeodesics and Brownian motion, one elegant way to avoid the computation of Christoffel symbolsis to take each step of the integration in the ambient data space of M and map the result back tothe latent space using the encoder mapping h [16]. This requires h to be close to the inverse of f restricted to M , and limits the method to VAEs where h is trained along with the decoder, f .We here propose an additional way to allow efﬁcient computations without using the encoder map h .The approach therefore works for both GANs and VAEs. The latent space Z is of low dimension,and the only entity needed for encoding the geometry is the metric g : Z → Sym + ( d ) that toeach z assigns a positive symmetric d × d matrix. Sym + ( d ) has dimension d ( d + 1) / . The highdimensionality of the data space thus does not appear directly when deﬁning the geometry, and X isonly used for the actual computation of g ( z ) . We therefore train a second neural network ˜ g to actas a function approximator for g , i.e. we train ˜ g to produce an element of Sym + ( d ) that is closeto g ( z ) for each z . Notice that this network does not evaluate a Jacobian matrix when computing g ( z ) and no derivatives are hence needed for evaluating the metric. Because of this and due to bothinput and output space of the network being of low dimensions, d and d ( d + 1) / respectively, thecomputational effort of evaluating ˜ g and Christoffel symbols computed from ˜ g is orders of magnitudefaster than evaluating g directly when the dimensionality n of X is high compared to d : Integrationof the geodesic equation with 100 timesteps in the MNIST case presented later takes ≈ s., whencomputing the metric from Jf , compared to ≈ ms. using the second neural network to predict g .Inverting ˜ g ( z ) is sensitive to the approximation of g provided by ˜ g . The cometric tensor g − istherefore more sensitive to the approximation when computed from ˜ g than from g itself. This isemphasized when g ( z ) has small eigenvalues. As a solution, we let the second neural network predictboth the metric g ( z ) and cometric g ( z ) − . Deﬁning the loss function for training the network, webalance the norm between predicted matrices ˜ g and ˜ g − . In addition, we ensure that the predicted ˜ g and ˜ g − are close to being actual inverses. These observations are expressed in the loss functionloss g,g − -approximator ( g true , g − true , g predicted , g − predicted ) (3) = (cid:107) g true − g predicted (cid:107) / (cid:107) g true (cid:107) + (cid:107) g − true − g − predicted (cid:107) / (cid:107) g − true (cid:107) + (cid:107) g − predicted g predicted − Id d (cid:107) , using Frobenius matrix norms. We train a neural network with two dense hidden layers to minimize(3), and use this network for the geometry calculations. The network predicts the upper triangular partof each matrix, and this part is symmetrized to produce g predicted and g − predicted . Note that additionalmethods could be employed to ensure the predicted metric being positive deﬁnite, see e.g. [10]. Forthe presented examples, it is our observations that the loss (3) ensures positive deﬁniteness withoutfurther measures. 4 Nonlinear Latent Space Statistics

We now discuss aspects of nonlinear statistics applicable to the latent geometry setting. We startby focusing on means, particularly Fréchet and maximum likelihood (ML) means, before modelingvariation around the mean with the principal geodesic analysis procedure.

Fréchet mean [7] of a distribution on M and its sample equivalent minimize the expected squaredRiemannian distance: ˆ x = arg min x ∈ M E [ d ( x, y ) ] and ˆ x = arg min x ∈ M N (cid:80) Ni =1 d ( x, y i ) . Thestandard way to estimate a sample Fréchet mean is to employ an iterative optimization to minimizethe sum of squared Riemannian distances. For this, the Riemannian gradient of the squared distancecan be expressed using the Riemannian Log map [15] by ∇ x d ( x, y ) = 2 Log x ( y ) .The Fréchet mean generalizes the Euclidean concept of a mean value as a distance minimizer. InEuclidean space, this is equivalent to the standard Euclidean estimator ˆ x = N (cid:80) i y i . From aprobabilistic viewpoint, the equivalence between the log-density function of a Euclidean normaldistribution and the squared distance results in ˆ x as an ML ﬁt of a normal distribution to data: ˆ x = arg min x log p N ,x ( y ) , (4)with p N ,x ( y ) ∝ exp( − (cid:107) x − y (cid:107) ) being the density of a normal distribution with mean x . Whilethe normal distribution does not have a canonical equivalent on Riemannian manifolds, an intrinsicgeneralization comes from the transition density of a Riemannian Brownian motion. This density on M arise as the solution to the heat PDE, ∂∂t p x,t = ∆ g p x,t , using the Laplace-Beltrami operator ∆ g ,or, equivalently, from the law of the Brownian motion started at M . In [18, 14, 17], this density isused to generalize the ML deﬁnition of the Euclidean mean ˆ x = arg min x log p x,T ( y ) , (5)for at ﬁxed T > . We will develop approximation schemes for evaluating the log-density and forsolving the optimization problem (5) in Section 5. Euclidean principal component analysis (PCA) estimates subspaces of the data space that explainthe major variation of the data, either by maximizing variance or minimizing residuals. PCA is builtaround the linear vector space structure and the Euclidan inner product. Deﬁning procedures thatresemble PCA for manifold valued data hence become challenging, as neither inner products betweenarbitrary vectors nor the concept of linear subspaces is deﬁned on manifolds.Fletcher et al. [6] presented a generalized version of Euclidean PCA denoted principal geodesicanalysis (PGA). PGA estimates nested geodesic submanifolds of M that capture the major variationof the data projected to each submanifold. The geodesic subspaces hence take the place of the linearsubspaces found with the Euclidean PCA.Let z , . . . , z N ∈ Z be latent space representations of the data y , . . . , y N in M , and let µ be aFréchet mean of the samples z , . . . , z N . We assume the observations are located in a neighbourhood U of µ where Exp µ is invertible and the logarithm map, Log µ , thus well-deﬁned. We then searchfor an orthonormal basis of tangent vectors in T µ Z such that for each nested submanifold, H k = Exp µ ( span { v , . . . , v k } ) , the variance of the data projected on H k is maximized. The projection mapused is based on the geodesic distance, d , and is deﬁned by, π H ( z ) = arg min z ∈ H d ( z, z ) .The tangent vectors v , . . . , v k in the orthonormal basis of T µ Z are found by optimizing the Fréchetvariance of the projected data on the submanifold H , i.e. v k = arg max (cid:107) v (cid:107) =1 n (cid:88) i =1 d ( µ, π H ( z i )) , (6)where H = Exp µ ( span { v , . . . , v k − , v } ) . For a more detailed description of the PGA procedureincluding computational approximations of the projection map in the tangent space of µ that weemploy as well, see [6]. In the experiment section, we will perform PGA on the manifold deﬁned bythe latent space of a deep generative model for the MNIST dataset.5 Maximum Likelihood Inference of Latent Diffusions

As in Euclidean statistics, parameters of distributions on manifolds can be inferred from data bymaximum likelihood or, from a Bayesian viewpoint, maximum a posteriori. This can even be used todeﬁne statistical notions as exempliﬁed by the ML mean in Section 4. This probabilistic viewpointrelies on the existence of parametric families of distributions in the geometric spaces, and the abilityto evaluate likelihoods. One example of such a distribution is the transition distribution of theRiemannian Brownian motion, see e.g. [9]. In this section, we will show how likelihoods of data inthe latent space Z under this distribution can be evaluated by Monte Carlo sampling of conditioneddiffusion bridges. As before, we assume the geometry of Z has been trained by a separate trainingdataset, and that we wish to statistically analyze new observed data represented by z i . To determinethe transition distribution of a Brownian motion on the data manifold, we will apply a conditionaldiffusion bridge simulation procedure deﬁned in [5], which will be described in the following section.This sampling scheme has previously been used for geometric spaces in [1, 17]. Let z , . . . , z N ∈ Z be N observations in Z . We assume z i are time T observations from a Brownianmotion, z t deﬁned by (2), on Z started at x ∈ Z . The aim is to optimize for the initial point x bymaximizing the likelihood of the observed data and thereby ﬁnd the ML mean (5). The mean value ofthe data distribution is thus deﬁned as the starting point of the process maximizing the data likelihood, L θ ( z , . . . , z N ) = (cid:81) Ni =1 p T,θ ( z i ) , where p T,θ ( z i ) is the time T transition density of z t evaluated at z i . The difﬁculty is to determine the transition density p T,θ ( z i ) , i.e. the time T density conditional on z T = z i . In [5] it was shown that this conditional probability can be calculated based on the notionof a guided process d ˜ z jt = g ( z ) kl Γ jkl dt − ˜ z jt − z ji T − t dt + (cid:112) ( g ( z )) − j dB t , (7)which, without conditioning, almost surely hits the observation z i at time t = T . In fact, theguided process is absolutely continuous with respect to the conditional process z t | z T = z i withRadon-Nikodym derivative, dP z | z i /dP ˜ z = ϕ (˜ z ) /E ˜ z [ ϕ (˜ z t )] . From this, the transition density can beexpressed as p T,θ ( z i ) = (cid:115) | g ( z i ) | (2 πT ) d e − (cid:107) ( x − zi ) T g ( x )( x − zi ) (cid:107) T E ˜ z t [ ϕ (˜ z t )] , (8)see [5, 17] for more details. We can use Monte Carlo sampling of ˜ z t to approximate E ˜ z t [ ϕ (˜ z t )] andhence determine p T,θ ( z i ) by (8). The likelihood can then be iteratively optimized to ﬁnd the MLmean by computing gradients with respect to x . x1 0 1 y1 0 1 101 x1 0 1 y1 0 1 101 Figure 2: (left) Brownian bridge sample paths on the trained data manifold. (middle) The estimatedML mean (blue) from the data (black points). (right) The likelihood values from the MLE procedure.Figure 2 shows sample paths of a Brownian bridge on the trained manifold for the synthetic data on S in addition to the ML estimated mean (middle). The likelihood values at each iteration are plottedin the same ﬁgure and illustrates that convergence has been reached for the MLE procedure.6 Experiments

We will give examples of the analyses described above for the MNIST dataset [13]. The computa-tions are performed with the Theano Geometry package http://bitbucket.com/stefansommer/theanogeometry/ described in [12]. The package contains implementations of differential geometryconcepts and corresponding statistical algorithms.

The MNIST dataset consists of images of handwritten digits from to with each observation ofdimension × . A VAE has been trained on the full dataset providing a 2 dimensional latentspace representation. The VAE [11] has one hidden dense layer for both encoder and decoder, eachlayer containing neurons, and 2d latent space Z . Figure 3: (left) Scalar curvature of space Z . (middle) Min. eigenvalue of the Ricci curvature tensor.(right) Parallel transport of a tangent vector in Z . The transported vectors have constant lengthmeasured by the Riemannian metric.Figure 3 shows the scalar (left) and minimum Ricci curvature (middle) in a neighbourhood of theorigin of Z . In addition, an example of parallel transport of a tangent vector along a curve in thelatent space is visualized in the same ﬁgure (right). Note that the transported vector has constantlength as measured by the metric g which is clearly not the case for the Euclidean R norm. Figure 4: (top left) Samples from a Riemannian Brownian motion in latent space. (top right) Samplesfrom a Brownian bridge simulated by (7). (bottom) Examples of Brownian bridges of MNIST databetween two ﬁxed 9s (left-/rightmost). The variance of the Brownian motion has been increased tovisually emphasize the image variation. 7he top row of Figure 4 shows samples of Brownian motions and Brownian bridges in the latentspace Z . Each of these Brownian bridges correspond to a bridge in the data manifold of the MNISTdata. Examples of bridges in the high dimensional space X are shown in the bottom row of Figure 4.We now perform PGA on the latent space representation of the subset of the MNIST data consistingof even digits. PGA is a nonlinear coordinate change of the latent space around the Fréchet mean.PGA is applied to the data in Figure 5(a), and the resulting transformed data in the PGA basis isshown in Figure 5(b). The variation along the two principal component directions are visualized inthe full dimensional data space in the bottom row of Figure 5. (a) Latent space representation (b) PGA representation Figure 5: (top left) Latent space representation of data. (top right) PGA analysis on the sub-space ofeven digits. (bottom) Variation along ﬁrst (1. row) and second (2. row) principal components.Figure 6 (3. image from left) shows the maximum likelihood mean image for a subset of 256 evendigits estimated by the ML procedure described in section 5. Figure 6 (right) shows the correspondingFréchet mean. The iterations, for both ML and Fréchet mean, in latent space are shown in Figure 6(left), with the last plot showing the likelihood values for each step of the ML optimization.

Figure 6: From left: (1.) Iterations of ML (green) and Fréchet mean (red) for subset of even MNISTdigits. (2.) Likelihood evolution during the MLE. Estimated ML mean (3.) and Fréchet mean (4.).

Deep generative models deﬁne an embedding of a low dimensional latent space Z to a high dimen-sional data space X . The embedding can be used to reduce data dimensionality and move statisticalanalysis from X to the low-dimensional latent representation in Z . This method can be seen asa nonlinear equivalent to the dimensionality reduction commonly performed by PCA. Nonlinearstructure in data can be represented compactly, and the induced geometry necessitates use of non-linear statistical tools. We considered principal geodesic analysis on the latent space and maximumlikelihood estimation of the mean using simulation of conditioned diffusion processes. To enable fastcomputation of the geometric algorithms that involve high-order derivatives of the metric, we ﬁt asecond neural network, to predict the metric g and its inverse, which vastly speeds up computations.We visualized examples on 3D synthetic data simulated on S and performed analyses on the MNISTdataset based on a trained VAE with a 2D latent space.8 eferences [1] Alexis Arnaudon, Darryl D. Holm, and Stefan Sommer. A Geometric Framework for StochasticShape Analysis. accepted for Foundations of Computational Mathematics, arXiv:1703.09971[cs, math] , 2018.[2] G. Arvanitidis, L. K. Hansen, and S. Hauberg. Latent Space Oddity: on the Curvature of DeepGenerative Models. ICLR 2018, arXiv:1710.11379 , October 2017.[3] Yoshua Bengio. Learning Deep Architectures for AI.

Foundations and Trends R (cid:13) in MachineLearning , 2(1):1–127, November 2009.[4] Nutan Chen, Alexej Klushyn, Richard Kurle, Xueyan Jiang, Justin Bayer, and Patrick van derSmagt. Metrics for Deep Generative Models. In AISTAT 2018 , November 2017. arXiv:1711.01204.[5] Bernard Delyon and Ying Hu. Simulation of conditioned diffusion and application to parameterestimation.

Stochastic Processes and their Applications , 116(11):1660–1675, November 2006.[6] P.T. Fletcher, C. Lu, S.M. Pizer, and S. Joshi. Principal geodesic analysis for the study ofnonlinear statistics of shape.

Medical Imaging, IEEE Transactions on , 2004.[7] M. Frechet. Les éléments aléatoires de nature quelconque dans un espace distancie.

Ann. Inst.H. Poincaré , 10:215–310, 1948.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In

Advances inNeural Information Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014.[9] Elton P. Hsu.

Stochastic Analysis on Manifolds . American Mathematical Soc., 2002.[10] Zhiwu Huang and Luc Van Gool. A Riemannian Network for SPD Matrix Learning.

AAAI-17,arXiv:1608.04233 , August 2016. arXiv: 1608.04233.[11] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs,stat] , December 2013. arXiv: 1312.6114.[12] L. Kühnel, A. Arnaudon, and S. Sommer. Differential geometry and stochastic dynamics withdeep learning numerics. arXiv:1712.08364 [cs, stat] , December 2017. arXiv: 1712.08364.[13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, Nov 1998.[14] Tom Nye. Construction of Distributions on Tree-Space via Diffusion Processes. MathematischesForschungsinstitut Oberwolfach, 2014.[15] Xavier Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for GeometricMeasurements.

J. Math. Imaging Vis. , 25(1):127–154, 2006.[16] Hang Shao, Abhishek Kumar, and P. Thomas Fletcher. The Riemannian Geometry of DeepGenerative Models. arXiv:1711.08014 [cs, stat] , November 2017. arXiv: 1711.08014.[17] Stefan Sommer, Alexis Arnaudon, Line Kuhnel, and Sarang Joshi. Bridge Simulation and MetricEstimation on Landmark Manifolds. In

Graphs in Biomedical Image Analysis, ComputationalAnatomy and Imaging Genetics , Lecture Notes in Computer Science, pages 79–91. Springer,September 2017.[18] Stefan Sommer and Anne Marie Svane. Modelling anisotropic covariance using stochasticdevelopment and sub-Riemannian frame bundle geometry.

Journal of Geometric Mechanics ,9(3):391–410, June 2017.[19] The Theano Development Team. Theano: A Python framework for fast computation ofmathematical expressions. arXiv:1605.02688 [cs]arXiv:1605.02688 [cs]