Deep generative models in inversion: a review and development of a new approach based on a variational autoencoder
Jorge Lopez-Alvis, Eric Laloy, Frédéric Nguyen, Thomas Hermans
DDeep generative models in inversion: a review and development of a newapproach based on a variational autoencoder (cid:73)
Jorge Lopez-Alvis a,b, ∗ , Eric Laloy c , Frédéric Nguyen a , Thomas Hermans b a Urban and Environmental Engineering, Applied Geophysics, University of Liege, Belgium b Department of Geology, Ghent University, Belgium c Engineered and Geosystems Analysis, Institute for Environment, Health and Safety, Belgian Nuclear Research Center,Belgium
Abstract
When solving inverse problems in geophysical imaging, deep generative models (DGMs) may be used toenforce the solution to display highly structured spatial patterns which are supported by independent infor-mation (e.g. the geological setting) of the subsurface. In such case, inversion may be formulated in a latentspace where a low-dimensional parameterization of the patterns is defined and where Markov chain MonteCarlo or gradient-based methods may be applied. However, the generative mapping between the latentand the original (pixel) representations is usually highly nonlinear which may cause some difficulties forinversion, especially for gradient-based methods. In this contribution we review the conceptual frameworkof inversion with DGMs and study the principal causes of the nonlinearity of the generative mapping. As aresult, we identify a conflict between two goals: the accuracy of the generated patterns and the feasibilityof gradient-based inversion. In addition, we show how some of the training parameters of a variationalautoencoder, which is a particular instance of a DGM, may be chosen so that a tradeoff between these twogoals is achieved and acceptable inversion results are obtained with a stochastic gradient-descent scheme.A test case using truth models with channel patterns of different complexity and cross-borehole traveltimetomographic data involving both a linear and a nonlinear forward operator is used to assess the performanceof the proposed approach.
Keywords: deep generative model, geophysical inversion, geological prior information, variationalautoencoder, stochastic gradient descent, deep learning (cid:73)
Jorge Lopez-Alvis conceived the main idea of the paper, developed and tested the proposed method (including its softwareimplementation) and wrote a first draft of the manuscript. Eric Laloy conceived the main idea of the paper, performed somesoftware tests, suggested improvements to the proposed method and edited the first draft. Frédéric Nguyen and ThomasHermans are supervisors of the first author, conceived the main idea of the paper, suggested improvements to the proposedmethod, obtained funding for the research and edited the first draft. ∗ Corresponding author
Email address: [email protected] (Jorge Lopez-Alvis)
Submitted preprint a r X i v : . [ phy s i c s . g e o - ph ] A ug . Introduction A common task in the geosciences is to solve an inverse problem in order to obtain a model (or image)from a set of measurements sensing a heterogeneous spatial domain. When measurements are sparse, theinverse problem is usually ill-posed and its solution non-unique. In such cases it is possible to constrainthe solution to be found only among models with certain spatial patterns. In practice, such patterns aresupported by independent (prior) information of the sensed domain (e.g. knowledge of the geological setting)and used with the aim of appropriately reconstructing heterogeneity. Classical regularization may be used toimpose certain structures to the solution (Tikhonov and Arsenin, 1977) but these are generally too simple tobe realistic. Recently, the use of deep generative models (DGMs) to constrain the solution space of inverseproblems has been proposed so that resulting models have specific spatial patterns (Bora et al., 2017;Laloy et al., 2017; Hand and Voroninski, 2018; Seo et al., 2019). DGMs can deal with realistic (natural)patterns which are not captured by classical regularization or random processes defined by second-orderstatistics (Linde et al., 2015). In this way, inversion with DGMs provides an alternative to inversion witheither multiple-point geostatistics (MPS) (González et al., 2008; Hansen et al., 2012; Linde et al., 2015) orexample-based texture synthesis (ETS) (Zahner et al., 2016).All the previously mentioned methods generally rely on gridded representations for the models (i.e. bydividing the spatial domain in cells or pixels). However, a key difference between them is that MPS andETS directly extract spatial patterns from a training image (or exemplar) (Strebelle, 2002; Mariethoz et al.,2010) whereas DGMs require an initial learning phase in which samples of the patterns are used as a trainingset—such samples may be obtained by different means, e.g. cropped from a training image or taken directlyfrom a large set of images. While any of the methods (DGMs, MPS or ETS) may be used with inversion,DGMs rely on a so called latent space where a low-dimensional parametric representation (referred to aslatent vector or code) is defined and where Markov Chain Monte Carlo (MCMC) or gradient-based methodsmay be applied. Note that when using gridded representations each model may be seen as a vector in "pixel"space (a space where each pixel is one dimension). Since only highly structured spatial patterns are allowed,vectors will only occupy a subset of the pixel space. This subset defines a manifold of lower dimensionalitythan the pixel space (Fefferman et al., 2016) and the latent space is simply a low-dimensional space wheresuch manifold is mapped. Most inversion methods require a perturbation step to search for models that fitthe data but such a step is not straightforward to compute for highly structured patterns (Linde et al., 2015;Hansen et al., 2012). The latent space of DGMs provides a useful frame to compute a perturbation step(Laloy et al., 2017) or even a local gradient-descent direction (Laloy et al., 2019) which generally results inbetter exploration of the posterior distribution and/or faster convergence compared to inversion with MPSor ETS.So far, inversion with DGMs has been done successfully with regular MCMC sampling methods (Laloy2t al., 2017, 2018). However, when applicable, gradient-based methods may be preferred given their lowercomputational demand. Gradient-based deterministic inversion with DGMs has been pursued with encour-aging results (Richardson, 2018, Laloy et al., 2019), however, convergence to the true model was shownto be dependent on the initial model. In the framework of probabilistic inversion, MCMC methods thatuse the gradient to guide the sampling in the latent space have shown to be less prone to get trapped inlocal minima than gradient-based deterministic methods while they are also expected to reach convergencefaster than regular MCMC (Mosser et al., 2018). A different inversion strategy that has also been appliedsuccessfully with DGMs and has a relatively low computational cost is the Ensemble Smoother (Canchumuniet al., 2019; Mo et al., 2020).Recently, Laloy et al. (2019) studied the difficulties of performing gradient-based deterministic inversionwith a specific DGM. They concluded that the non-linearity of their generative function or "generator" (i.e.the mapping from the latent space to the pixel space) was high enough to hinder gradient-based optimization,causing the latter to often fail in finding the global minimum even when the objective function was knownto be convex (in pixel space). Such high non-linearity is expected since DGMs approximate highly complexpatterns by using low-dimensional inputs usually generated by simple probability distributions (e.g. normalor uniform distributions). The complexity of realistic patterns means that the corresponding manifold willgenerally have both a curvature and a topology that is radically different from the region (or subset) definedindirectly in the latent space by the chosen probability distribution. Then, the generative function has todeform this region (when mapping it to the pixel space) in such a way as to approximate (or cover) themanifold as close as possible. The combined deformation needed to curve the region and to approximateits topology causes the generative function to be highly nonlinear. In order to approximate manifolds ofrealistic patterns, most common DGMs involve (artificial) neural networks with several layers and non-linear (activation) functions. Recently, it has been shown that deep neural networks with a ReLU activationfunction are able to change topology of the input (Naitzat et al., 2020) so, besides nonlinearity, the generatorof DGMs may also induce changes in topology when mapping from the latent space. When the sole purposeof the DGM is for generating new samples, high nonlinearity and induced changes in topology are notimportant but they might be an issue when the DGM is used for additional tasks, such as inversion.For a specific subsurface pattern, the degree of non-linearity and the changes induced in topology bythe generative function may be controlled mainly by its architecture and the way it is trained (Goodfellowet al., 2016). Regarding difference in training, two common types of DGMs can be distinguished: generativeadversarial networks (GANs) (Goodfellow et al., 2014) and variational autoencoders (VAEs) (Kingma andWelling, 2014)—in both cases training generally takes the form of optimizing a loss function. GANs andVAEs require specification of a probability distribution in the latent space and an architecture for thediscriminator or encoder (respectively) in addition to the one for their generators. They might also requireother parameters to be specified such as the weights on the different terms of the loss function. Frequently,3ome of these choices use default values, but generally all of them may affect the degree of non-linearity ofthe generator (Rolinek et al., 2019).Given all these possible controls for learning the generator it is interesting to investigate whether thereexist some combinations of such controls that allow both for a good reproduction of the patterns and agood performance of less computationally demanding gradient-based inversion. In this work, we reviewsome of the difficulties of performing inversion with DGMs and show how to obtain a well-balanced tradeoffbetween accuracy in patterns and applicability of gradient-based methods. In particular, we propose to usethe training choice of a VAE as DGM and to select some of its parameters in order to achieve good resultswith gradient-based inversion. Then, we compare this to the training choice of a GAN that has been testedwith gradient-based inversion in prior studies (Laloy et al., 2019; Richardson, 2018). Furthermore, we showthat since the resulting VAE inversion is only mildly nonlinear, modified stochastic gradient-descent (SGD)methods are generally sufficient to avoid getting trapped in local minima and provide a better alternativethan regular gradient-based methods while also retaining a low computational cost.The remainder of this paper is structured as follows. Section 2.1 explains DGMs and their conceptu-alization as approximating the real (pattern) manifold. In Section 2.2 the use of DGMs to represent priorinformation in inversion and the difficulties of performing gradient-based inversion are reviewed. Sections2.3 and 2.4 show how to use a VAE and SGD to cope with some of the mentioned difficulties. Then, Section3 shows some results of the proposed approach. Section 4 discusses the obtained results and points to someremaining challenges. Finally, Section 5 presents the conclusions of this work.
2. Methods
The term "deep learning" generally refers to machine learning methods that involve several layers ofmultidimensional functions. This general "deep" setting has been shown to allow for complex mappings tobe accurately approximated by building a succession of intermediate (simpler) representations or concepts(Goodfellow et al., 2016). Consider, for instance, deep neural networks (DNNs) which are mappings definedby a composition of a set of (multidimensional) functions φ k as: g ( x ) = ( φ L ◦ · · · ◦ φ ◦ φ )( x ) (1)where x is a multidimensional (vector) input, k = { , . . . , L } denotes the function (layer) index and compo-sition follows the order from right to left. Furthermore, each φ k is defined as: φ k ( ξ ) = ψ k ( M k ξ + b k ) (2)4n which ψ k is a (nonlinear) activation function, M k and b k are vectors of weights and biases, respectively,and ξ denotes the output of the previous function (layer) φ k − for k > or the initial input x for k = 1 .Then, training the DNN involves estimating the values for the all the parameters θ = { M k , b k | ≤ k ≤ L } where each M k or b k may be of different dimensionality depending on the layer. In practice, the numberof parameters θ for such models may reach the order of , therefore training is achieved by relying onautodifferentiation (see e.g. Paszke et al., 2017) and fast optimization techniques based on stochastic gradientdescent (SGD) (see e.g. Kingma and Ba, 2017), both usually implemented for and run in highly parallel(GPU) computing architectures.A deep generative model (DGM) is a particular application of such deep methods (Salakhutdinov, 2015).In a DGM a set of training examples X = { x ( i ) | ≤ i ≤ N } and a simple low-dimensional probabilitydistribution p ( z ) are used to learn a model g ( z ) that is capable of generating new samples of x (which areconsistent with the training set) by using as input samples from p ( z ) . This can be written as: x = g ( z ) , z ∼ p ( z ) (3)where g ( z ) is referred to as the "generator" and z denotes a vector of latent variables or "code". While thetraining (and generated) samples x are usually represented in a high-dimensional space R D , the probabilitydistribution p ( z ) is defined in a low-dimensional space R d . The space R D is often referred to as "ambientspace" while the space R d is called the "latent space". Fig. 1a,c shows a schematic representation where thedimensionality of the ambient space is D = 3 and the one of the latent space is d = 2 . A typical applicationof DGMs is the generation of images (see e.g. Kingma and Welling, 2014; Goodfellow et al., 2014) for whichthe ambient space is just the pixel space. Gridded representations of subsurface models may be seen as two-or three-dimensional images of the subsurface.The underlying assumption in DGMs is that real-world data are generally structured in their high-dimensional ambient space R D and therefore have an intrinsic lower dimensionality—such assumption isknown in machine learning literature as the manifold hypothesis (Fefferman et al., 2016) because it statesthat high-dimensional data usually lie on (or lie close to) a lower-dimensional manifold M ⊂ R D . Forinstance, when studying a subsurface region it is usually assumed that geological processes gave it certaindegree of structure then, to allow for a flexible base on which to represent the distribution of the differentsubsurface materials, the region is usually divided in pixels (or cells) within each of which the material isassumed to be homogeneous. Such gridded representation "lives" in the high-dimensional pixel space (theambient space) but since it has some structure there should be a lower dimensional space (the latent space)where the same distribution of subsurface materials might be represented. Technically, while both the latentspace R d and the manifold M are usually low-dimensional, they may differ in dimensionality and/or themanifold may only occupy a certain portion of the latent space (e.g. the shaded region in Fig. 1c).5 igure 1: Sketch of low-dimensional manifold setting. (a) Real manifold M and misfit function γ ( x ) in ambient space R D .(b) Approximate manifold M (cid:48) overlaying the real manifold. (c) Region of latent space R d where the approximate manifold isimplicitly defined by the probability distribution p ( z ) . (d) Misfit function contours intersected by the real manifold. (e) Misfitfunction contours intersected by approximate manifold. (f) Misfit function contours back-mapped onto the latent space andthe related gradient ∇ z ζ ( z ) computed at one iteration. M by generating samples that closely follow such manifold, i.e. that lie onan approximate manifold M (cid:48) (Fig. 1b). Samples of this approximate manifold are generated by sampling firstfrom a simple probability distribution p ( z ) in latent space (e.g. a normal or uniform distribution) and thenpassing them through the generator g ( z ) . Since the probability distribution p ( z ) defines indirectly a region(or subset) in latent space that generally has a different curvature and topology than the real manifold, thegenerator g ( z ) must be able to approximate both curvature and topology when mapping the samples of p ( z ) to ambient space. This generally requires the generator to be a highly nonlinear function. As an instance,consider the case of certain spatial patterns whose real manifold is a highly curved surface with "holes"in ambient space and the (input) region defined by a uniform p ( z ) is a (flat) plane in a two-dimensionallatent space. Regarding their topological properties, one technically says that this plane is simply connectedwhile the real manifold is not (see e.g. Kim and Zhang, 2019). Then, the generative function has to deformthis plane in such a way as to approximate (or cover) the real manifold as close as possible. An importantproperty of DGMs is that since a probability distribution in latent space is used, the sample "density" of suchplane (and its mapping) also plays an important role. For instance, the generative function may approximatethe "holes" of the real manifold by creating regions of very low density of samples when mapping to ambientspace (to picture this one can imagine locally stretching a flexible material without changing its curvature).The combined deformation needed to curve the plane and to "make" the holes causes the generative functionto be highly nonlinear. Note that when considering a DGM that uses a DNN with ReLU activation functionsas generator g ( z ) , it is also possible for g ( z ) to change topology of the input by "folding" transformations(Naitzat et al., 2020).While one should always strive to accurately approximate the real manifold, since a finite set of trainingsamples is used a tradeoff between accuracy and diversity in the generated samples may be a better ob-jective. Indeed, the use of the prescribed probability distributions is done to continuously "fill" the spacebetween the samples and therefore generate samples of a continuous manifold. Recent success—in terms ofaccuracy and diversity of generated samples—has been achieved with two DGMs that are based on deepneural networks (DNNs): generative adversarial networks (GANs) (Goodfellow et al., 2014) and variationalautoencoders (VAEs) (Kingma and Welling, 2014). The generator g ( z ) on both strategies is a mapping fromlow-dimensional input z ∈ R d to high-dimensional output x ∈ R D . In contrast, the mappings correspondingto the discriminator and the encoder take high-dimensional inputs x and return low-dimensional outputs. Consider a survey or experiment for which a vector of noisy measurements y = ( y , . . . , y Q ) T ∈ R Q of aphysical process is available. A simplified description of the process may be expressed by a (mathematical)forward operator f : R D → R Q that takes as input a subsurface model vector x = ( x , . . . , x D ) T ∈ R D f ( x ) . Commonly, this operator is in the form of a (numerical) discretization of a set of PDEs describing theprocess under study and is an approximation of the real process. As a result of such approximation andthe use of noisy data, a noise term η is added to the simulation to represent total uncertainty. Then, therelation between the operator and the measurements may be written as (see e.g. Aster et al., 2013): y = f ( x ) + η (4)The corresponding inverse problem or inversion of Eq. (4), aims to obtain an estimation of the vector x from the (noisy) data y . Deterministic inversion does so by optimizing a misfit function γ ( x ) that is usuallygiven in the form of a distance function between simulated response f ( x ) and data y , e.g. by the l norm: γ ( x ) = (cid:107) f ( x ) − y (cid:107) (5)In this way, traditional gradient-based inversion requires the gradient ∇ x γ ( x ) whose elements are: [ ∇ x γ ( x )] i = ∂γ ( x ) ∂x i (6)and are computed by considering Eq. (4) together with the chosen misfit.DGMs may be used with inversion of subsurface data y to obtain geologically realistic spatial distributionsof physical properties x (Laloy et al., 2017). While this is also possible with traditional deterministic inversionwhere a regularization term is added directly in Eq. (5) (i.e. in ambient space) to obtain models with theimposed structures that minimize the misfit (Lange et al., 2012; Caterina et al., 2014), DGMs are moreflexible because they can enforce different structures provided they are appropriately trained. In the DGMsetting, the low-dimensional samples z that input to the generator g ( z ) may be seen as defining a low-dimensional parameterization (or encoding) of realistic patterns x and therefore exploration of the set offeasible models may be done in the latent space R d , as long as the search is done within the region wherethe approximated manifold M (cid:48) is defined (depicted by shading in Fig. 1c).Since the misfit γ ( x ) is typically defined in ambient space R D (e.g. in Fig. 1a), gradient-based inversionwith DGMs may be seen as optimizing the intersection of γ ( x ) with the approximate manifold M (cid:48) (Fig. 1e).Such intersected misfit is mapped into the latent space (Fig. 1f) and may be expressed as γ ( g ( z )) . Also notethat when probability distributions p ( z ) with infinite support are used (e.g. a normal distribution), one canguide the search in the latent space by adding controlling (regularization) terms to the mapped misfit (seee.g. Bora et al., 2017) and the resulting objective function may be written as:8 ( z ) = γ ( g ( z )) + λR ( z )= (cid:107) f ( g ( z )) − y (cid:107) + λR ( z ) (7)where R ( z ) is a regularization term defined in the latent space and λ is the corresponding regularizationfactor.In practice, no exhaustive mapping has to be done and the gradient ∇ z ζ ( z ) is only computed for thepoints in latent space where optimization lands in each iteration (in Fig. 1f the gradient is represented forone iteration). The gradient ∇ z ζ ( z ) is computed by adding a derivative layer corresponding to ∇ x γ ( x ) tothe autodifferentation that was set up for g ( z ) while training the DGM (see e.g. Laloy et al., 2019). Suchautodifferentiation setup may be seen as implicitly obtaining the jacobian J ( z ) of size D × d whose elementsare: [ J ( z )] i,j = ∂g i ( z ) ∂z j (8)Then, the gradient ∇ z ζ ( z ) is obtained from Eq. (7) by using the chain rule given by the product of Eqs. (6)and (8): ∇ z ζ ( z ) = ∇ z γ ( g ( z )) + λ ∇ z R ( z )= J ( z ) T ∇ x γ ( x ) + λ ∇ z R ( z ) (9)The latter may also be done implicitly by incorporating directly in the autodifferentiation framework, e.g.putting it on top of the so called computational graph (Richardson, 2018; Mosser et al., 2018).Assuming the considered misfit function γ ( x ) is convex in ambient space R D (as depicted by concentriccontours in Fig. 1a), difficulties to perform gradient-based deterministic inversion may arise due to thegenerator g ( z ) (Laloy et al., 2019). We propose that such difficulties arise because the generator (1) is highlynonlinear and (2) changes the topology of the input region defined by p ( z ) . Both of these properties oftencause distances (between samples) in latent space to be significantly different than distances in ambient space.Consider again the example of a real manifold that is a highly curved surface with "holes" in it and a uniformdistribution p ( z ) is used as input to the generator, then the latter might be able to approximate both thecurvature and the holes at the cost of increasing nonlinearity and/or changing topology. When consideringthis backwards—e.g. when mapping the misfit function γ ( x ) in the latent space—the approximation of bothhigh curvature and differences in topology often translate in discontinuities or high nonlinearities because acontinuous mapping onto the uniform distribution is enforced. This results in high curvature being effectively"flattened" and holes effectively "glued", both of which cause distances to be highly distorted. In this work,we will call a generator "well-behaved" when it is only mildly nonlinear and preserves topology.9oth the generator’s nonlinearity and its ability to change topology, may be controlled by two factors:(1) the generator architecture (type and size of each layer and total number of layers) and (2) the way itis trained (including training parameters). If the goal is to perform gradient-based inversion with DGMs,one should try to preserve convexity of γ ( x ) as much as possible when mapping it to the latent space as γ ( g ( z )) while not degrading the generator’s ability to reproduce the desired patterns. To aid in preservingsuch convexity, we propose to enforce the generator g ( z ) to be well-behaved. This means that the generatorwill approximate the real manifold M with a manifold M (cid:48) with a moderate curvature and whose topologyis the same as the region defined in latent space by p ( z ) . By enforcing a moderate curvature manifold, localoscillations that may give rise to local minima (as those shown in Fig. 1d) but only have minimum impactin pattern accuracy are avoided in the approximate manifold M (cid:48) (the local minima are no longer present inFig. 1e). In turn, when the generator is encouraged to preserve topology no more local minima should arisein R d than the ones resulting from intersecting γ ( x ) with the approximate manifold M (cid:48) in R D (note e.g.there is one local minima in both Fig. 1e,f). The latter is in line with the proposal of Falorsi et al. (2018),where they argue that for the purpose of representation learning (which basically means learning encodingsthat are useful for other tasks than just generative modeling) the mapping should preserve topology.GANs often produce highly nonlinear generators that do not preserve topology, which may result inchallenging inversion in the latent space. Laloy et al. (2018) provide an example of how architecture of aGAN is set to obtain a relatively well-behaved generator g ( z ) . They propose to use a model called spatialgenerative adversarial network (SGAN) (Jetchev et al., 2017) that enforces different latent variables toaffect different local regions in the ambient space. Their architecture results in a high compression (lowerdimensionality of the latent space) and controls nonlinearity which allowed them to successfully performMCMC-based inversion in the latent space. However, gradient-based deterministic inversion performed withthe same DGM was shown to be highly dependent on the initial model (Laloy et al., 2019) pointing towardsthe existence of local minima. In this work we aim for robust gradient-based inversion in latent space byconsidering a VAE, the other predominant type of DGM, and its ability to produce a well-behaved generator. A variational autoencoder (VAE) is the model resulting from using a reparameterized gradient estimatorfor the evidence lower bound while applying (amortized) variational inference to an autoencoder, i.e. anarchitecture involving an encoder and a decoder which are both (possibly deep) neural networks (Kingmaand Welling, 2014; Zhang et al., 2018). To train a VAE one uses a dataset X = { x ( i ) | ≤ i ≤ N } whereeach x ( i ) is a sample (e.g. an image) with the desired patterns and then maximizes the sum of the evidence(or marginal likelihood) lower bound of each individual sample. The evidence lower bound for each samplecan be written as (Kingma and Welling, 2014) 10 ( θ, ϑ ; x ( i ) ) = L x + L z (10)with L x = E q ϑ ( z | x ( i ) ) [log( p θ ( x ( i ) | z )] (11)and L z = − D KL ( q ϑ ( z | x ( i ) ) || p ( z )) (12)where z refers to the codes or latent vectors, p θ ( x | z ) is the (probabilistic) decoder, q ϑ ( z | x ) is the (prob-abilistic) encoder, E denotes the expectation operator, D KL denotes the Kullback-Leibler distance and, θ and ϑ are the parameters (weights and biases) of the DNNs for the decoder and encoder, respectively.In order to maximize the evidence lower bound in Eq. (10), its gradient with respect to both θ and ϑ isrequired, however, this is generally intractable and therefore an estimator is used. This estimator is basedon a so called reparameterization trick of the random variable (cid:101) z ∼ q ϑ ( z | x ) which uses an auxiliary noise (cid:15) .In the case of a VAE, the encoder is defined as a multivariate Gaussian with diagonal covariance: q ϑ ( z | x ) = N ( h ϑ ( x ) , u ϑ ( x ) · I d ) (13)where h ϑ ( x ) and log u ϑ ( x ) are modeled with DNNs and I d is a d × d diagonal matrix. Then, the encoderand the auxiliary noise (cid:15) are used in the following way during training (Kingma and Welling, 2014) (cid:101) z = h ϑ ( x ) + u ϑ ( x ) (cid:12) (cid:15) , (cid:15) ∼ p ( (cid:15) ) (14)where (cid:12) denotes an element-wise product. Often Eq. (12) has an analytical solution, then only Eq. (11) isapproximated with the estimator as (Kingma and Welling, 2014) (cid:101) L x = 1 M M (cid:88) j =1 log( p θ ( x ( i ) | (cid:101) z ( i,j ) )) (15)where (cid:101) z ( i,j ) = h ϑ ( x ( i ) ) + u ϑ ( x ( i ) ) (cid:12) (cid:15) ( j ) , (cid:15) ( j ) ∼ p ( (cid:15) ) and M is the number of samples used for the estimator.Further, if we set the decoder p θ ( x | z ) as a multivariate Gaussian with diagonal covariance structure, then p θ ( x | z ) = N ( g θ ( z ) , v θ ( z ) · I D ) (16)where g θ ( z ) and log v θ ( z ) are modeled with DNNs and I D is a D × D diagonal matrix. In this work, weconsider only the mean of the decoder p θ ( x | z ) which is just the (deterministic) generator g θ ( z ) . Then, thecorresponding (mean-square error) loss function may be written as11 L x = 1 M M (cid:88) j =1 (cid:107) g θ ( (cid:101) z ( i,j ) ) − x ( i ) (cid:107) (17)The described setting allows for the gradient to be computed with respect to both θ and ϑ and then stochasticgradient descent is used to maximize the lower bound in Eq. (10). In the rest of this work, we drop thesubindex θ in g ( z ) to simplify notation and also because once the DGM is trained, the parameters θ do notchange, i.e. they are fixed for the subsequent inversion.As previously mentioned, it is often possible to analytically integrate the Kullback-Leibler distance inEq. (12). In this work, we consider that p ( z ) and q ϑ ( z | x ) are both Gaussian therefore Eq. (12) may berewritten as (Kingma and Welling, 2014): L z = 12 d (cid:88) i =1 (1 + log(( u i ) ) − ( h i ) − ( u i ) ) (18)where the sum is done for the d output dimensions of the encoder.Note that the term in Eqs. (11), (15) and (17) may be interpreted as a reconstruction term that causesthe outputs of the encode-decode operation to look similar to the training samples, while the term in Eqs.(12) and (18) may be considered a regularization term that enforces the encoder q ϑ ( z | x ) to be close to aprescribed distribution p ( z ) . In practice, one may add a weight to the second term (Higgins et al., 2017) ofthe lower bound as: (cid:101) L ( θ, ϑ ; x ( i ) ) = (cid:101) L x + β L z (19)to prevent samples to be encoded far from each other in the latent space, which may cause overfitting ofthe reconstruction term and degrade the VAE’s generative performance. The overall process of training andgeneration for a VAE is depicted in Fig. 2.Note that in setting up the VAE one has to choose: (1) the architectures of the encoder and decoder,(2) the probability distribution p ( z ) , (3) the noise distribution p ( (cid:15) ) and (4) the regularization weight β .As mentioned in Section 2.2, these choices may impact the nonlinearity of the generator and its ability topreserve topology, which in turn affect the mapping of the data misfit function γ ( x ) in latent space andpossibly diminish the performance of inversion methods. In this work, we assume that an architecture forthe generator (decoder) g ( z ) is chosen so that it performs sufficiently good in terms of reproducing thepatterns. For instance, when gridded spatial distributions (images) are considered, a typical choice is adeep-convolutional neural network (Radford et al., 2016). While the choice of the probability distribution p ( z ) may aid in obtaining a well-behaved generator, e.g. by selecting a probability distribution with the same(or similar) topology as the real manifold (Falorsi et al., 2018), we expect such a choice to be highly problem12 igure 2: A diagram for a VAE: steps needed for training are shown in frames with continuous line and steps needed forgeneration are in frames with dashed lines. (pattern) dependent. Therefore herein we focus on the other two possible controls: the noise distribution p ( (cid:15) ) and the regularization weight β .The effect of the regularization weight β is such that when increased the encoded training samples tendto lie closer to the prescribed probability distribution p ( z ) . Then, one may picture the transformation of theencoder as taking the low-dimensional approximate manifold in the ambient space and charting it (e.g. bybending, stretching and even folding) into the region defined by p ( z ) in the latent space and the generatoras the transformation undoing such charting.While the effect of β in a VAE is relatively easy to understand, the effect of the noise distribution p ( (cid:15) ) isnot so straightforward. First, note that the typical choice of a diagonal noise as p ( (cid:15) ) = N ( , α · I d ) where α denotes a constant variance (frequently set to α = 1 . ) and I d is a d × d diagonal matrix is usually done fortractability or computational convenience (Kingma and Welling, 2014; Rolinek et al., 2019). However, it hasbeen proposed recently that the choice of a diagonal noise has an impact on a property called disentanglement(Rolinek et al., 2019). Such disentanglement basically means that different latent directions control differentindependent characteristics of the training (or generated) samples. They explain that a diagonal p ( (cid:15) ) mightinduce an encoding that preserves local orthogonality of the ambient space. In this work, we argue thatthe choice of a diagonal p ( (cid:15) ) (which is usually done only for computational convenience) might be useful inproducing a well-behaved generator.In order to visualize the joint effect of α and β , Fig. 3 shows a synthetic example where samples in atwo-dimensional ambient space lie close to a rotated "eight-shaped" manifold (Fig. 3a). In addition, to studythe impact on inversion, a convex data misfit function γ ( x ) in the same space (created synthetically with anegative isotropic Gaussian function) is shown in Fig. 3b. The latent space is also chosen two-dimensionalfor visualization purposes but recall that for a real case the dimensionality of the latent space is usually much13 igure 3: Synthetic example of two-dimensional "eight-shaped" manifold: (a) training samples lying close the manifold, and(b) synthetic misfit function γ ( x ) . lower than the one of the ambient space. Then, Fig. 4 considers nine different combinations for the valuesof α and β to show how the (nonlinear) generator g ( z ) maps a region of the latent space (denoted by thez-axes in the first three rows) into the ambient space (denoted by the x-axes in the last three rows) in orderto approximate the manifold in Fig. 3a. To visualize the deformation caused by the generator, an orthogonalgrid in the z-axes and its mapping into the x-axes (a deformed grid) are shown (both on the left of eachinset). The corresponding encoded training samples are shown in red in the z-axes (left of each inset) andtheir reconstruction (resulting from the operation of encode-decode) is shown also in red in the the x-axes(right of each inset), where also the original training samples are shown (in blue) to assess the accuracy ofreconstruction. Samples obtained from a Gaussian distribution with a unitary diagonal covariance p ( z ) areshown in the z-axes in orange (left of each inset), while their generator-mapped values are shown also inorange in the x-axes (right of each inset). Finally, the mapping of the data misfit function in Fig. 3b intothe latent space is shown in the z-axes (right of each inset).It is worth mentioning a few effects visible in the illustrative example of Figs. 3 and 4. First, note thatincreasing the variance of α seems to cause the grid to be more "rigid" locally (grid lines tend to intersectmore at right angles) while going through the generator which may in turn help in preserving topology andcontrolling non-linearity (e.g. compare the deformation of the grids for different values of α for β = 0 . ), andmore importantly, in preserving the convexity of the data misfit function in the latent space (the mappedmisfit function using α = 0 . and β = 0 . has a single global minimum, while the misfit function for α = 0 . and β = 0 . has two minima in latent space). Also note that both α and β should be set in orderto not cause a significant degradation in: (1) the reconstruction of the patterns, e.g. the cases of α = 1 . with both β = 0 . and β = 0 . show that the "eight-shape" is not completely reconstructed (seen in redsamples not fully overlaying the blue samples in x-axes), or (2) the similarity of the encoded samples to theprescribed distribution p ( z ) , e.g. the case of α = 0 . and β = 0 . shows that encoded samples (in red)14 igure 4: Mapping a region of the latent space by the generator g ( z ) and mapping of the misfit function γ ( x ) to the latentspace with different values for α and β . The first three rows (z-axes) depict the latent space where each case shows: (left frame)orthogonal grid (black), encoded training samples (red) and generated samples (orange); (right frame) misfit function mappedin latent space (blue). The last three rows (x-axes) depict the ambient space where each case shows: (left frame) the same gridbut mapped by the generator; (right frame) training (blue), reconstructed (red) and generated samples (orange). α = 0 . and β = 0 . ) seem to provide the best choice interms of reconstruction of the patterns, generative accuracy and convexity of the misfit function in latentspace.In summary, a generator g ( z ) that preserves topology and contains nonlinearity is the best choice forgradient-based inversion in the latent space because it preserves convexity of the objective function. Note,however, that if the topology of the probability distribution p ( z ) is different to the one of the real manifold M ,this strategy may result in approximate manifolds M (cid:48) that do not account for all topological differences—e.g. that partially cover holes of the real one (see e.g. Fig. 1b)—and therefore might produce models thathave non-accurate patterns when sampling from p ( z ) . We argue that the two training parameters α and β of a VAE may be chosen in order for the latter issue to not be severe, i.e. the generated patterns do notdeviate too much from the training patterns, while still approximately preserving convexity of the objectivefunction in the latent space.To test our proposed method we implement a VAE in PyTorch (Paszke et al., 2017) and use trainingsamples cropped from a "training image" which is large enough to have many repetitions of the patternsat the cropping size—a requirement similar in MPS. For our synthetic case, we use the training image of2500 × Note that even when topology is preserved and nonlinearity is contained, the data misfit function in thelatent space might still present some local minima. Using our proposed VAE approach in the synthetic casestudy, the resulting misfit function seems to have the shape of a global basin of attraction with some localminima of less amplitude. To deal with such remaining local minima we propose to use a stochastic gradient16 igure 5: (a) A 1000 × descent (SGD) method instead of regular gradient-based optimization.SGD methods are commonly used in training machine learning models to cope with large datasets (e.g.Kingma and Ba, 2017) and it has also been shown they are able to find minima that are useful in termsof generalization (Smith and Le, 2018). They essentially use an estimator for the gradient of the objectivefunction computed only with a batch of the data. Such estimator is used in each gradient descent iterationand may be written for the case of inversion in the latent space as: z k +1 = z k − (cid:96) · ∇ z ζ ( z ) k (20)where k denotes the iteration index, (cid:96) is the step size (or learning rate) and the gradient estimator ∇ z ζ ( z ) k is computed by using Eq. (9) for a data batch (i.e. a subset of y ) which is different for each k -th iterationbut of constant size b . Relying on such estimator makes SGD methods less likely to get trapped in localminima if a sufficiently large step is chosen. However, if such a step is too large the optimization will haveissues when it is close to the global minimum, usually seen in the form of high misfit and oscillations inthe value of the objective function. Note that similar to other stochastic optimization methods, SGD onlyguarantees convergence to the global minimum with a certain probability, however if modified in the rightway for the type of problems to be solved and its parameters chosen appropriately such probability couldbe very close to one.Recently, it has been proposed that using SGD may be seen as optimizing a smoothed version of theobjective function obtained by convolving it with the gradient "noise" resulting from batching (Kleinberg17t al., 2018). The degree of noise (and therefore the degree of smoothness) is controlled by the ratio ofthe learning rate to the batch size (cid:96)/b (Chaudhari and Soatto, 2018; Smith and Le, 2018). Therefore if wechoose to decrease the value of (cid:96) (while keeping b constant) as the optimization progresses we might be ableto achieve lower misfit values i.e. get sufficiently close the global minimum. This may be implemented byusing: (cid:96) k +1 = c (cid:96) · (cid:96) k (21)where a constant value of c (cid:96) < . and a starting value (cid:96) must be chosen. In practice, the method maybe further improved by also decreasing the controlling (regularization) term in Eqs. (7) and (9) in orderto prevent that large initial steps diverge from the region of the latent space where the manifold is defined(Bora et al., 2017). Then, similarly to (cid:96) this may be done as: λ k +1 = c λ · λ k (22)again a constant c λ < . and a starting value λ must be selected.The combined effect of simultaneously decreasing (cid:96) and λ is illustrated in Fig. (6) for a simple syntheticproblem in a two-dimensional ( d = 2 ) latent space R d . The misfit term (i.e. first term of Eq. (7)) of thesynthetic problem is shown in Fig. 6a. Assuming that p ( z ) is a normal distribution N ( , I d ) where I d isa d × d identity matrix, we propose a specific regularization term R ( z ) that will preferentially stay in theregions of higher mass (where most samples are located). This is done by radially constraining the searchspace by means of a χ -distribution, i.e. the regularization term is written as: R ( z ) = ( (cid:107) z (cid:107) − µ χ ) (23)where µ χ is the mean for a χ -distribution with d degrees of freedom. Dashed lines in Fig. 6a denote this meantogether with the - and -th percentiles. In general, this is especially useful for higher dimensionalitieswhere most of the mass of a normal distribution is far from its center (Domingos, 2012). Then, Eq. (7) maybe rewritten as: ζ ( z ) = (cid:107) f ( g ( z )) − y (cid:107) + λ ( (cid:107) z (cid:107) − µ χ ) (24)and correspondingly Eq. (9) may be expressed as: ∇ z ζ ( z ) = J ( z ) T ∇ x γ ( x ) + 2 λ z (cid:18) − µ χ (cid:107) z (cid:107) (cid:19) (25)As mentioned above, this gradient is often computed simply by adding a layer to the autodifferentiation ofthe generator. One optimization instance for a random initial model is shown in Fig. 6b, while the behavior18 z z a) z b) l o ss c) || z || d) Figure 6: Regularized gradient-based inversion in a synthetic two-dimensional latent space: (a) misfit (blue) and mean of χ -distribution (black dashed) together with - and -th percentiles (gray dashed), (b) the same setting of (a) with an overlayof an instance of optimization (trajectory in orange) for a random initial model (black ’ (cid:63) ’), showing also final model (red ’ × ’)and true model (black ’ + ’), (c) misfit vs. iteration number, and (d) norm of z vs. iteration number. of the misfit and (cid:107) z (cid:107) is shown in Fig. 6c,d. Notice the rather "noisy" inversion trajectory, but also its abilityto escape local minima. The effect of decreasing (cid:96) is seen in Fig. 6c by the decreasing of the oscillationsamplitude as the optimization progresses, while the effect of decreasing λ is noticeable in Fig. 6d by theprogressive shifting of (cid:107) z (cid:107) away from µ χ .The strategy described above and stated by Eq. (24) is generally applicable to DGMs that use anindependent normal distribution as its probability distribution p ( z ) and whose generator is well-behaved. Inthis work, we consider a VAE whose training parameters β and p ( (cid:15) ) are chosen so that it results in a mildlynonlinear inversion for which such SGD strategy is generally useful. To test our proposed method and compare it with a previous instance of inversion with a DGM, weconsider an identical setting to that used in Laloy et al. (2019). Such setting considers a dataset of boreholeground penetrating radar (GPR) traveltime tomography. To obtain a subsurface model x ∈ R D this methodrelies in contrasts of electromagnetic wave velocity which is related to moisture content and therefore toporosity for saturated media. The tomographic array considers a transmitter antenna in one borehole and areceiver antenna in the other, each of which is moved to different positions and a vector of measurements y ∈ R Q is obtained by taking the traveltime of the wave’s first arrival for each transmitter-receiver combination.We assume that the sensed physical domain is a 6.5 × × D = × = -1 .Measurements are taken every 0.5 m in depth (the first being at 0.5 m and the last at 12.5 m) resulting ina dataset of Q =
625 traveltimes. For one instance of our synthetic case, we add normal independent noise η ∼ N ( , σ · I Q ) where σ is the noise variance and I Q is a 625 ×
625 diagonal matrix.Similarly to Laloy et al. (2019), we first consider a fully linear forward operator f for which raypaths arealways straight, i.e. independent of the velocity spatial distribution. For this case Eq. (4) may be rewrittenas: y = Fx + η (26)where F is a matrix of dimension Q × D in which a certain row contains the length of the raypath in eachcell of the model for a certain transmitter-receiver combination. The corresponding gradient of the misfit ∇ x γ ( x ) to be used in Eq. (25) for the solution of the inversion is: ∇ x γ ( x ) = − F T ( y − Fx ) (27)We also consider the case of a more physically realistic nonlinear forward operator f (see Eq. (4)) forwhich raypaths are not straight. In particular, we consider a shortest path (graph) method which usessecondary nodes to improve the accuracy of the simulated traveltimes as proposed by Giroux and Larouche(2013) and implemented in PyGIMLi (Rücker et al., 2017). For this case, when inversion with Eq. (25) ispursued, we linearize the forward operator f in order to compute the gradient: ∇ x γ ( x ) = − S ( x ) T ( y − f ( x )) (28)where is S ( x ) is the Q × D jacobian matrix of the forward operator whose elements are: [ S ( x )] i,j = ∂f i ( x ) ∂x j (29)The elements of the jacobian S ( x ) are computed by the shortest path method and also represent lengths ofraypaths. In contrast to the linear case, these have to be recomputed in every iteration. Both the nonlinearforward operator and the need for recomputing the jacobian result in higher computational cost comparedto the linear operator.The method proposed in Sec. 2.4 to perform gradient-based inversion with a VAE should work for thelinear forward operator because the nonlinearity in the inverse problem arises only due to the generator g ( z ) which is moderate when the latter is well-behaved. However, since the considered nonlinear forwardoperator in Eq. (28) is only mildly nonlinear (when contrast in velocities is not extreme), the same methodmay also provide good inversion results for this operator.20 . Results As previously mentioned, our proposed method relies on a VAE whose training parameters are selectedin order to improve gradient-based inversion. The training samples are the croppings detailed in Sec. 2.3whose dimensionality is D = d =
20. The probabilitydistribution p ( z ) is an independent multinormal distribution N ( , I d ) with I d an identity matrix of size 20 ×
20. The architecture of the encoder and the decoder includes 4 convolutional layers, 2 fully-connected layersand instance normalization is used between each layer (details may be consulted in the associated code).The training parameters relevant to our proposed method are chosen in the following way: (1) β in Eq.(19) is given a value of 1000, chosen by visually assessing the generated samples which also coincided with amoderate visual deviation from the prescribed p ( z ) ; (2) the distribution p ( (cid:15) ) in Eq. (14) is given the typicalvalue of a diagonal unit variance ( α = 1 . ), which enforces local orthogonality while passing through thegenerator and therefore aids in preserving topology and controlling non-linearity. The value of β is ratherhigh compared to previous studies, e.g. Laloy et al. (2017) used a value of 20 for similar two-dimensionalpatterns, but seems to cause slightly higher compression ratios (in our work the compression ratio is 420compared to 200 in the mentioned study). The VAE is trained by maximizing the lower bound in Eq. (19)using 10 iterations and batches of 100 random croppings in each iteration (a GPU was used in order toreduce training time). In the following, we test the performance of this VAE when used for our proposedSGD-based inversion with a linear and a mildly nonlinear forward operator. In this section, we consider the linear operator in Eq. (26) and assess the performance of our proposedDGM inversion approach: using a VAE trained as above (to have a well-behaved generator) and SGD withboth decreasing step size and regularization to optimize Eq. (24). We aim to show that such approach isrobust regarding its convergence to the global minimum and therefore assess its performance by using 100different initial models. Compared to previous studies, our proposed approach involves changes in boththe DGM and the optimization, therefore we compare with the base cases listed in Table 1 to show theimpact of each change proposed. As denoted by the columns of this table, the different cases consider: (1)VAE and SGAN as DGMs, (2) SGD and Adam (Kingma and Ba, 2017) as stochastic optimizers, (3) databatching for computing the gradient ∇ z ζ ( z ) , which basically means using SGD when batching and using(regular) gradient-descent when not batching, (4) regularization in the latent space, with "origin" being theone proposed in Bora et al. (2017) and "ring" the one proposed herein, and (5) decreasing of the step size(or learning rate). Our proposed approach is then labeled as "VSbrd". We also show the chosen values forthe step size (cid:96) and its decreasing factor c (cid:96) when applicable—for these cases the values of λ = 10.0 and c λ (cid:96) c (cid:96) VSnnn VAE SGD no none no 1e-4 -VSbnn VAE SGD yes none no 1e-4 -VSbod VAE SGD yes origin yes 1e-2 0.95VSbrd VAE SGD yes ring yes 1e-2 0.95SAnnn SGAN Adam no none no 1e-2 -SSbnd SGAN SGD yes none yes 1e-3 0.95
Table 1: Configuration of our proposed approach (VSbrd) and the base cases for comparison. = 0.999 are used. The number of iterations for inversion is set to 3000 for all cases. When data batching isused, the batch size b is 25 (of a total of 625) and is sampled with no replacement, then the whole datasetis used every 25 iterations (i.e. the number of epochs is 120 for a total of 120 ×
25 = 3000 iterations).Note that we compare against the approach in Laloy et al. (2019), where SGAN is used as DGM and Adam(gradient-descent with adaptive moments) are used to optimize the resulting objective function—this caseis labeled "SAnnn" in Table 1. We also consider the case where we apply our proposed SGD to the sameSGAN (labeled as "SSbnd"). For both of these cases instead of regularization we use stochastic clipping inthe latent space (Laloy et al. 2018, 2019) because a uniform p ( z ) with finite support is used.We consider 6 different truth subsurface models to assess our method and compare with the base cases:(1) a set of three models cropped directly from the training image and (2) a set of three models obtainedby generating from the trained VAE. Both sets include models with three different degrees of complexity.These truth models are shown in the first row of Fig. 7 where "mc" refers to the first set, "mv" refers tothe second set and the degree of complexity is denoted by a subscript, where "1" denotes least complexand "3" most complex. For the first set (mc), to avoid "memorizing" the croppings we exclude them (andany overlapping cropping) from the samples used to trained the VAE. The second set (mv) is similar to theone used by Laloy et al. (2019) to test the performance of their setup, only in their case the models weregenerated from a SGAN instead of a VAE. For each one of these truths, we generate synthetic data applyingthe forward operator F and use these data to perform gradient-based inversion for each case in Table 1.We first consider no added noise to the synthetic dataset, hence after inversion the data misfit shouldbe close to zero for inverted models that are sufficiently close to the global minimum. To define a thresholdfor this data misfit beyond which inverted models are "accepted", we use the RMSE between these syn-thetic data and data obtained by applying the forward operator on models resulting from passing the truthmodels through a VAE’s encoding-decoding (these models are shown in the second row of Fig. 7 and thecorresponding values for the threshold are shown in Table 2). This is done because we found the encode-decode reconstructed models to be visually very similar to the truth models (compare first and second rows22 igure 7: Truth models (first row): cropped from training image (denoted by "mc") and generated from trained VAE (denotedby "mv"). Corresponding models resulting from encode-decode of truth models (second row). Subindex indicates level ofcomplexity, with "1" being the least complex. of Fig. 7) and also show a low model RMSE when compared to them (computed just as the difference ofpixel values between truth model and the encode-decode model and shown Table 2). Once such threshold isdefined for each truth model, gradient-based inversion is run for the same 100 initial models for all cases inTable 1. Note that no convergence criteria were set in order to compare to all base cases (some cases suchas "SAnnn" do not allow for easily defining such criteria) but in practice it is possible to set them for ourproposed approach (VSbrd) in terms of a minimal change in either step size and/or data misfit. This alsomeans that for some cases (including our proposed VSbrd) the 3000 iterations may not be necessary for alltruths and all initial models. Results for the number of accepted inverted models are shown in Table 3 whilethe corresponding mean of the misfit (expressed as RMSE) for the 100 inversions is shown in Table 4.As seen in Table 3, given our defined threshold: (1) the cases where VAE and SGD with decreasing stepwere used (VSbod and VSbrd) resulted in all inverted models being accepted, (2) the cases where SGAN wasused (SAnnn and SSbnd) resulted in almost all models being rejected (only two models accepted for SAnnnwith truth mv ), and (3) the cases where VAE and non-decreasing step size SGD was used (VSnnn andVSbnn) resulted in some inverted models being accepted. Note also that using SGD (data batching) withouta decreasing step size (VSbnn) results in less accepted models compared to GD (VSnnn), highlighting theimportance of our proposed decreasing step size and regularization. As shown in Table 4 a higher meanRMSE is related to a lower number of accepted models. Furthermore, two things worth highlighting in Table4 are (1) the general improvement for inversion with SGAN caused by our proposed SGD compared to Adam23ata RMSE (ns) model RMSE (-)mc Table 2: Data RMSE (ns) of encode-decode operation used to define thresholds (for the linear forward operator) and corre-sponding model RMSE.
VSnnn VSbnn VSbod VSbrd SAnnn SSbnd VSbrd (noise)mc
88 35 100 100 0 0 100mc
96 50 100 100 0 0 100mc
92 58 100 100 0 0 100mv
100 75 100 100 2 0 100mv
64 54 100 100 0 0 100mv
91 71 100 100 0 0 100
Table 3: Number of accepted inversions (using 100 different initial models) according to the defined threshold.
VSnnn VSbnn VSbod VSbrd SAnnn SSbnd threshold VSbrd (noise)mc Table 4: Mean RMSE (ns) of inversions using 100 different initial models and defined threshold for accepting models. ) are shown in Fig. 8. Here, truth models are shown in Fig. 8a while Fig. 8b showsone example of an accepted model for cases that have at least one (VSnnn, VSbnn, VSbod and VSbrd).Similarly, Fig. 8c shows one example of a rejected model for applicable cases (VSnnn, VSbnn, SAnnn andSSbnd). Finally, the corresponding data RMSE vs. iteration number plots are shown in Fig. 8d (in bluefor accepted models and red for rejected ones) and corresponding model RMSE plots are shown in Fig. 8e.Note both the higher similarity with the truth model (i.e. note the low model RMSE and compare modelsin Fig. 8b,c with those in Fig. 8a) and the lower RMSE for accepted models. Also, examples of invertedmodels for our proposed approach (VSbrd) using all the truths are shown in Fig. 9b, together with plots ofRMSE vs. iteration number (Fig. 9d) and norm of z vs. iteration number (Fig. 9e). For cropped truths (mc)it seems that visual similarity decreases and final data RMSE of inverted models increases as complexityincreases, whereas for generated truths they seem independent of complexity. Notice the overshoot in (cid:107) z (cid:107) in the initial iterations and its eventual convergence close to µ χ as defined in Eq. (23).To study the effect of noise for our proposed approach (VSbrd), we added noise with a standard deviation σ = 0.25 ns to the synthetic traveltime data. Corresponding results are shown in the rightmost column ofTables 3 and 4 and in Fig. 9c (with corresponding data RMSE and z norm plots in Fig. 9d,e). The thresholdin this case is set equal to the one for the noise-free case plus σ and when using it all inverted models withour proposed approach are accepted. It is also worth noticing the relative robustness of the method to noise,as shown by the corresponding mean misfit values in Table 4 that indicate no significant overfitting, i.e. themean misfit values are close to the noise-free threshold plus σ even if no traditional regularization was used.The latter means that optimizing in the latent space of the DGM is effectively constraining the invertedmodels to display the prescribed patterns. After showing that our proposed method works with the linear forward operator for the synthetic caseconsidered, we now test its performance with a nonlinear forward operator. For inversion, the general formof Eq. (4) is used and the gradient in the latent space given in Eq. (25) is computed using Eq. (28). Asmentioned in Sec. 2.5, we consider a shortest path method to solve for the traveltime for which we use 3secondary nodes added to the edges of the velocity grid. Note that the jacobian S ( x ) in Eq. (28) has tobe recomputed at every iteration. Given the higher computational demand for inversion with the nonlinearforward operator and since it was already shown to be the best performing approach for the linear forwardoperator, we only test our proposed approach VSbrd with all the truths and for a single initial model (Fig.25 c a) VSnnn b) VSnnn c) d a t a R M S E ( n s ) d) m o d e l R M S E ( - ) e) VSbnn VSbnn 0510 d a t a R M S E ( n s ) m o d e l R M S E ( - ) VSbod 05 d a t a R M S E ( n s ) m o d e l R M S E ( - ) VSbrd 0510 d a t a R M S E ( n s ) m o d e l R M S E ( - ) SAnnn 0510 d a t a R M S E ( n s ) m o d e l R M S E ( - ) SSbnd 0 1000 2000 3000iteration0510 d a t a R M S E ( n s ) m o d e l R M S E ( - ) Figure 8: Examples of inverted models for mc truth for all cases in Table 1: (a) accepted models according to defined threshold,(b) rejected models, (c) data RMSE vs. iterations plots (blue for accepted models and red for rejected models and dashed lineindicates defined threshold) and (d) model RMSE vs. iterations plots (dashed line indicates model RMSE for encode-decodeoperation). c a) b) c) R M S E ( n s ) d) || z || e) mc R M S E ( n s ) || z || mc R M S E ( n s ) || z || mv R M S E ( n s ) || z || mv R M S E ( n s ) || z || mv R M S E ( n s ) || z || Figure 9: Examples of gradient-based inversion using our proposed approach (VSbrd) for all truth models and the linearforward operator: (a) truth models, (b) inverted models with no added noise, (c) inverted models with added noise, (d) RMSEvs. iterations plots (no noise case in dark blue and noise case in light blue; lower dashed line indicates the defined thresholdwhile upper dashed line is threshold plus σ = 0 . ), and (c) norm of z vs. iterations plots (no noise case in dark green andnoise case in light green). σ = 0.25as in the linear operator scenario. We select the following values for the required inversion parameters: (cid:96) = c (cid:96) = λ = c λ = c (cid:96) compared to the linear case, but the decreasing in Eq. (21) is only done every 5 iterations. Thismay cause the method to converge to the global minimum with lower probability, however it seems to stillbe high enough since all of the inversions with no added noise are very similar to the truth models. Also,using the threshold obtained by encoding-decoding the truth models (now computed with the nonlinearforward operator) all inverted models are accepted (these models are shown in Fig. 10b). When consideringadded noise, results are similar but inversion seems to converge to the global minimum with slightly lowerprobability (6 out of 8 inversions are accepted) and accepted models are shown in in Fig. 10c. The behaviorof the misfit during optimization (Fig. 10d) is similar to the linear case, although oscillations of a slightlyhigher amplitude are still visible in the last iterations (mainly due to the lower number of iterations). Topartially solve the latter issue, we take as inverted model the model with lowest misfit and not the one forthe final iteration (these are the models shown in Fig. 10b,c). The plot of the norm of z vs. iterations in Fig.10e shows a similar behavior to the linear case, although there seems to be more oscillations in (cid:107) z (cid:107) duringinitial iterations.
4. Discussion
When training the VAE for our considered synthetic case, only the selection of β was done while thevariance of p ( (cid:15) ) was not changed (a unity variance α = 1 . was used). However, as noted in Sec. 2.3 and inFig. 4, the variance α also has an impact on (gradient-based) inversion as it might be used to control thegenerator’s nonlinearity and its induced changes in topology.In order to select SGD parameters in our proposed approach, we suggest looking jointly at the behaviorof the misfit and norm of z . For instance, if a certain number of iterations is desired for computationalreasons, we suggest choosing first (cid:96) and c (cid:96) that produce a behavior of the misfit similar to that in Fig. 9d,i.e. oscillations of high amplitude at the beginning and then progressive attenuation of the oscillations insuch a way that at the end they are negligible. Note, however, that inversion may have to be run a fewtimes because divergence may occur during initial iterations (this is easily seen in the value of (cid:107) z (cid:107) takingvalues far from µ χ ). Once (cid:96) and c (cid:96) are chosen, the selection of λ and c λ is done only to prevent divergence,this may be achieved by looking for a behavior similar to that in Fig. 9e. An initial overshoot in (cid:107) z (cid:107) isnormal (and even necessary) since the method is exploring more rapidly the latent space, however, it shouldeventually converge to a value close to µ χ .The results for gradient-based inversion using our proposed approach point to a (possible) conflict be-28 c a) b) c) R M S E ( n s ) d) || z || e) mc R M S E ( n s ) || z || mc R M S E ( n s ) || z || mv R M S E ( n s ) || z || mv R M S E ( n s ) || z || mv R M S E ( n s ) || z || Figure 10: Examples of gradient-based inversion using our proposed approach (VSbrd) for all truth models and the nonlinearforward operator: (a) truth models, (b) inverted models with no added noise, (c) inverted models with added noise, (d) RMSEvs. iterations plots (no noise case in dark blue and noise case in light blue; lower dashed line indicates the defined thresholdwhile upper dashed line is threshold plus σ = 0 . ), and (c) norm of z vs. iterations plots (no noise case in dark green andnoise case in light green). α and β while training a VAE in orderto improve performance of gradient-based inversion. We empirically prove the validity of this statementfor our case study with specific values α and β . In general (for inversion with DGMs), this implies thata tradeoff between generative accuracy and a well-behaved generator may be found. The latter statementalso supports our assumption regarding the "holes" of the real manifold for the case of channel patterns(as mentioned in Section 2.3): when approximating the real manifold using a VAE with a well-behavedgenerator, the approximate manifold will tend to fill the holes and therefore produce breaking channels.While the generator’s nonlinearity was already identified by Laloy et al. (2019) as a potential factor forhindering gradient-based inversion, its causes (curvature and topology of the real manifold) and the possibleinduced changes in topology have not been previously explained as factors in degrading the performance ofgradient-based inversion in the latent space (to the authors’ knowledge).In general, good performance of DNNs for some tasks is usually associated with their ability to changetopology (Naitzat et al., 2020). However, when one wants to use the latent variables or codes of DGMs forfurther tasks and not just for generation, these changes in topology might become an issue. For instance, weinterpret the misfit "jumps" seen in gradient-based inversion with SGAN (as seen in Fig. 8c for case SAnnn)as resulting from the "gluing" or "collapsing" in latent space of holes in the real manifold—either causedby an induced change in topology or a high nonlinearity in the SGAN generator. Some studies have evensuggested that if one wants to obtain useful geometric interpretations in the latent space (e.g. to performinterpolation), the activation functions should be restricted to ones that are smooth (Shao et al., 2017;Arvanitidis et al., 2018), that means e.g. not using the ReLU activation function that is generally recognizedto result in faster learning. In contrast, in this work we do consider ReLU activation functions but controlthe changes in topology by means of a combination of α and β , whether this might nullify the advantagesof ReLU is still an open question. Note however that, in general, control of induced changes in topologyand high nonlinearities (as in our proposed approach) might be useful for any inversion method that reliesin the concept of a neighborhood (e.g. MCMC and ensemble smoothers).Besides its good performance for gradient-based inversion, a further advantage of our approach whencompared to the previous approaches is that when the data used for inversion is not sufficiently informative,regularization in the latent space might be used to constrain to the most common patterns with our regu-larization term in Eq. (23). This statement provides an interesting paradigm where regularization in latentspace might be seen as a flexible way to incorporate complex regularization. In contrast, a disadvantage ofour proposed approach is that GANs in general result in higher generative accuracy (all generated patternslook more similar to those in the training image), however, as previously mentioned this may be in conflict30ith performance of certain inversion methods. Also, as may be noticed in the relation between the datamisfit and the degree of complexity for cropped truths, a limiting factor in using our VAE is its inabilityto produce new highly complex patterns. However, this lack of innovation (or sample diversity) is generallypresent in other methods and may be even more severe for regular GANs, where the phenomenon is knownas mode collapse. Recently, different ways to control such mode collapse in VAEs and GANs have beenproposed (Metz et al., 2017; Salimans et al., 2016).Finally, regarding our proposed SGD we must note that similar results might be obtained with a MCMCmethod where information about the gradient is taken into account. For example, Mosser et al. (2018) use aMetropolis-adjusted Langevin method which basically follows a gradient-descent and adds some noise to thestep. However, the noise added to the gradient step in our approach is different—SGD noise has been shownto be approximately constant but anisotropic (Chaudhari and Soatto, 2018). Another possible alternativeto our method is to use Riemannian optimization, which is possible when the DGM approximate manifold issmooth. Although it is possible to compute the direction of the gradient by using the pullback Riemannianmetric, which may be obtained as suggested by e.g. Shao et al. (2017); Chen et al. (2018); Arvanitidis et al.(2018), it is not straightforward to compute the step because it would have to be along a geodesic curveinstead of a straight path and such geodesics are computationally demanding to compute.
5. Conclusions
In this work the principal difficulties of performing inversion with deep generative models (DGMs) arereviewed and a conflict between generated pattern accuracy and feasibility of gradient-based inversion isidentified. Also, an approach based on a variational autoencoder (VAE) as DGM and a modified stochasticgradient descent method for optimization is proposed to address such conflict. We show that two trainingparameters of the VAE (the weight factor β and the variance α of the encoder’s noise distribution p ( (cid:15) ) ) maybe chosen in order to obtain a well-behaved generator g ( z ) , i.e. one that is mildly nonlinear and approxi-mately preserves topology when mapping from latent space to ambient space. This helps in maintaining theconvexity of the misfit function in the latent space and therefore improves the behavior of gradient-basedinversion. We highlight changes in topology which have not been previously identified as impacting theconvexity of the inversion objective function. In contrast to prior studies where gradient-based inversionwas used, our approach converges to the neighborhood of the global minimum with very high probability forboth a linear forward operator and a mildly nonlinear forward operator with and without noise. We arguethat when using DGMs in inversion, a tradeoff may be found where inverted models are close enough to theprescribed patterns while low cost gradient-based inversion is still applicable—our proposed approach relieson such tradeoff and produces inverted models with significant similarity to the training patterns and a lowdata misfit. 31 cknowledgements This work has received funding from the European Union’s Horizon 2020 research and innovation programunder the Marie Sklodowska-Curie grant agreement number 722028 (ENIGMA ITN).
Computer code availability
All code necessary to reproduce the test case is available at: https://github.com/jlalvis/VAE_SGD.
References
Arvanitidis, G., Hansen, L. K., Hauberg, S., Jan. 2018. Latent Space Oddity: on the Curvature of Deep Generative Models.arXiv:1710.11379 [stat]ArXiv: 1710.11379.URL http://arxiv.org/abs/1710.11379
Aster, R., Borchers, B., Thurber, C., 2013. Parameters estimation and inverse problems, 2nd Edition. Academic press.Bora, A., Jalal, A., Price, E., Dimakis, A. G., Mar. 2017. Compressed Sensing using Generative Models. arXiv:1703.03208 [cs,math, stat]ArXiv: 1703.03208.URL http://arxiv.org/abs/1703.03208
Canchumuni, S. W., Emerick, A. A., Pacheco, M. A. C., Jul. 2019. Towards a robust parameterization for conditioning faciesmodels using deep variational autoencoders and ensemble smoother. Computers & Geosciences 128, 87–102.URL https://linkinghub.elsevier.com/retrieve/pii/S0098300419300378
Caterina, D., Hermans, T., Nguyen, F., Aug. 2014. Case studies of incorporation of prior information in electrical resistivitytomography: comparison of different approaches. Near Surface Geophysics 12, 451–465.URL http://nsg.eage.org/publication/publicationdetails/?publication=76904
Chaudhari, P., Soatto, S., Jan. 2018. Stochastic gradient descent performs variational inference, converges to limit cycles fordeep networks. arXiv:1710.11029 [cond-mat, stat]ArXiv: 1710.11029.URL http://arxiv.org/abs/1710.11029
Chen, N., Klushyn, A., Kurle, R., Jiang, X., Bayer, J., van der Smagt, P., Feb. 2018. Metrics for Deep Generative Models.arXiv:1711.01204 [cs, stat]ArXiv: 1711.01204.URL http://arxiv.org/abs/1711.01204
Domingos, P., Oct. 2012. A few useful things to know about machine learning. Communications of the ACM 55 (10), 78.URL http://dl.acm.org/citation.cfm?doid=2347736.2347755
Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forré, P., Cohen, T. S., Jul. 2018. Explorations in Homeo-morphic Variational Auto-Encoding. arXiv:1807.04689 [cs, stat]ArXiv: 1807.04689.URL http://arxiv.org/abs/1807.04689
Fefferman, C., Mitter, S., Narayanan, H., Feb. 2016. Testing the manifold hypothesis. Journal of the American MathematicalSociety 29 (4), 983–1049.URL
Giroux, B., Larouche, B., Apr. 2013. Task-parallel implementation of 3D shortest path raytracing for geophysical applications.Computers & Geosciences 54, 130–141.URL https://linkinghub.elsevier.com/retrieve/pii/S0098300412004128
González, E. F., Mukerji, T., Mavko, G., Jan. 2008. Seismic inversion combining rock physics and multiple-point geostatistics.GEOPHYSICS 73 (1), R11–R21.URL http://library.seg.org/doi/10.1190/1.2803748 oodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press.Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., Jun. 2014.Generative Adversarial Networks. arXiv:1406.2661 [cs, stat]ArXiv: 1406.2661.URL http://arxiv.org/abs/1406.2661 Hand, P., Voroninski, V., Dec. 2018. Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk.arXiv:1705.07576 [cs, math]ArXiv: 1705.07576.URL http://arxiv.org/abs/1705.07576
Hansen, T. M., Cordua, K. S., Mosegaard, K., Jun. 2012. Inverse problems with non-trivial priors: efficient solution throughsequential Gibbs sampling. Computational Geosciences 16 (3), 593–611.URL http://link.springer.com/10.1007/s10596-011-9271-1
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A., 2017. beta-VAE: Learningbasic visual concepts with a constrained variational framework, 13.Jetchev, N., Bergmann, U., Vollgraf, R., Sep. 2017. Texture Synthesis with Spatial Generative Adversarial Networks.arXiv:1611.08207 [cs, stat]ArXiv: 1611.08207.URL http://arxiv.org/abs/1611.08207
Kim, J., Zhang, B.-T., Jan. 2019. Data Interpolations in Deep Generative Models under Non-Simply-Connected ManifoldTopology. arXiv:1901.08553 [cs, stat]ArXiv: 1901.08553.URL http://arxiv.org/abs/1901.08553
Kingma, D. P., Ba, J., Jan. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs]ArXiv: 1412.6980.URL http://arxiv.org/abs/1412.6980
Kingma, D. P., Welling, M., May 2014. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat]ArXiv: 1312.6114.URL http://arxiv.org/abs/1312.6114
Kleinberg, R., Li, Y., Yuan, Y., Aug. 2018. An Alternative View: When Does SGD Escape Local Minima? arXiv:1802.06175[cs]ArXiv: 1802.06175.URL http://arxiv.org/abs/1802.06175
Laloy, E., Hérault, R., Jacques, D., Linde, N., Jan. 2018. Training-Image Based Geostatistical Inversion Using a SpatialGenerative Adversarial Neural Network. Water Resources Research 54 (1), 381–406.URL http://doi.wiley.com/10.1002/2017WR022148
Laloy, E., Hérault, R., Lee, J., Jacques, D., Linde, N., Dec. 2017. Inversion using a new low-dimensional representation ofcomplex binary geological media based on a deep neural network. Advances in Water Resources 110, 387–405.URL https://linkinghub.elsevier.com/retrieve/pii/S0309170817306243
Laloy, E., Linde, N., Ruffino, C., Hérault, R., Gasso, G., Jacques, D., Dec. 2019. Gradient-based deterministic inversion ofgeophysical data with generative adversarial networks: Is it feasible? Computers & Geosciences 133, 104333.URL https://linkinghub.elsevier.com/retrieve/pii/S009830041831207X
Lange, K., Frydendall, J., Cordua, K. S., Hansen, T. M., Melnikova, Y., Mosegaard, K., Oct. 2012. A Frequency MatchingMethod: Solving Inverse Problems by Use of Geologically Realistic Prior Information. Mathematical Geosciences 44 (7),783–803.URL http://link.springer.com/10.1007/s11004-012-9417-2
Linde, N., Renard, P., Mukerji, T., Caers, J., Dec. 2015. Geological realism in hydrogeological and geophysical inverse modeling:A review. Advances in Water Resources 86, 86–101.Mariethoz, G., Renard, P., Straubhaar, J., Nov. 2010. The Direct Sampling method to perform multiple-point geostatisticalsimulations: PERFORMING MULTIPLE-POINTS SIMULATIONS. Water Resources Research 46 (11).URL http://doi.wiley.com/10.1029/2008WR007621 etz, L., Poole, B., Pfau, D., Sohl-Dickstein, J., May 2017. Unrolled Generative Adversarial Networks. arXiv:1611.02163 [cs,stat]ArXiv: 1611.02163.URL http://arxiv.org/abs/1611.02163 Mo, S., Zabaras, N., Shi, X., Wu, J., Feb. 2020. Integration of Adversarial Autoencoders With Residual Dense ConvolutionalNetworks for Estimation of Non-Gaussian Hydraulic Conductivities. Water Resources Research 56 (2).URL https://onlinelibrary.wiley.com/doi/abs/10.1029/2019WR026082
Mosser, L., Dubrule, O., Blunt, M. J., Jun. 2018. Stochastic seismic waveform inversion using generative adversarial networksas a geological prior. arXiv:1806.03720 [physics, stat]ArXiv: 1806.03720.URL http://arxiv.org/abs/1806.03720
Naitzat, G., Zhitnikov, A., Lim, L.-H., Apr. 2020. Topology of deep neural networks. arXiv:2004.06093 [cs, math, stat]ArXiv:2004.06093.URL http://arxiv.org/abs/2004.06093
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017.Automatic differentiation in PyTorch, 4.Radford, A., Metz, L., Chintala, S., Jan. 2016. Unsupervised Representation Learning with Deep Convolutional GenerativeAdversarial Networks. arXiv:1511.06434 [cs]ArXiv: 1511.06434.URL http://arxiv.org/abs/1511.06434
Richardson, A., Jun. 2018. Generative Adversarial Networks for Model Order Reduction in Seismic Full-Waveform Inversion.arXiv:1806.00828 [physics]ArXiv: 1806.00828.URL http://arxiv.org/abs/1806.00828
Rolinek, M., Zietlow, D., Martius, G., Apr. 2019. Variational Autoencoders Pursue PCA Directions (by Accident).arXiv:1812.06775 [cs, stat]ArXiv: 1812.06775.URL http://arxiv.org/abs/1812.06775
Rücker, C., Günther, T., Wagner, F. M., Dec. 2017. pyGIMLi: An open-source library for modelling and inversion in geophysics.Computers & Geosciences 109, 106–123.Salakhutdinov, R., Apr. 2015. Learning Deep Generative Models. Annual Review of Statistics and Its Application 2 (1), 361–385.URL
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., Jun. 2016. Improved Techniques for TrainingGANs. arXiv:1606.03498 [cs]ArXiv: 1606.03498.URL http://arxiv.org/abs/1606.03498
Seo, J. K., Kim, K. C., Jargal, A., Lee, K., Harrach, B., Jan. 2019. A Learning-Based Method for Solving Ill-Posed NonlinearInverse Problems: A Simulation Study of Lung EIT. SIAM Journal on Imaging Sciences 12 (3), 1275–1295.URL https://epubs.siam.org/doi/10.1137/18M1222600
Shao, H., Kumar, A., Fletcher, P. T., Nov. 2017. The Riemannian Geometry of Deep Generative Models. arXiv:1711.08014 [cs,stat]ArXiv: 1711.08014.URL http://arxiv.org/abs/1711.08014
Smith, S. L., Le, Q. V., Feb. 2018. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv:1710.06451[cs, stat]ArXiv: 1710.06451.URL http://arxiv.org/abs/1710.06451
Strebelle, S., 2002. Conditional simulation of complex geological structures using multiple-point statistics. Mathematical Geol-ogy 34 (1), 1–21.URL ikhonov, A. N., Arsenin, V. I. A., 1977. Solutions of ill-posed problems. Winston.URL https://books.google.be/books?id=ECrvAAAAMAAJ Zahner, T., Lochb?hler, T., Mariethoz, G., Linde, N., Feb. 2016. Image synthesis with graph cuts: a fast model proposalmechanism in probabilistic inversion. Geophysical Journal International 204 (2), 1179–1190.URL https://academic.oup.com/gji/article-lookup/doi/10.1093/gji/ggv517
Zhang, C., Butepage, J., Kjellstrom, H., Mandt, S., Oct. 2018. Advances in Variational Inference. arXiv:1711.05597 [cs,stat]ArXiv: 1711.05597.URL http://arxiv.org/abs/1711.05597http://arxiv.org/abs/1711.05597