[PDF] Numerical analysis of a deep learning formulation of elastic full waveform inversion with high order total variation regularization in different parameterization

Abstract

We have formulated elastic seismic full waveform inversion (FWI) within a deep learning environment. In our formulation, a recurrent neural network is set up with rules enforcing elastic wave propagation, with the wavefield projected onto a measurement surface acting as the synthetic data to be compared with observed seismic data. Gradients for iterative updating of an elastic model, with a variety of parameterizations and misfit functionals, can be efficiently constructed within the network through the automatic differential method. With this method, the inversion based on complex misfits can be calculated. We survey the impact of different complex misfits (based on the l2, l1 ) with high order total variation (TV) regulations on multiparameter elastic FWI recovery of models within velocity/density, modulus/density, and stiffness parameter/density parameterizations. We analyze parameter cross-talk. Inversion results on simple and complex models show that the RNN elastic FWI with high order TV regulation using l1 norm can help mitigate cross-talk issues with gradient-based optimization methods.

Full PDF

NNumerical analysis of a deep learning formulation of elasticfull waveform inversion with high order total variationregularization in diﬀerent parameterization

Tianze Zhang ∗ Jian Sun ∗† Kristopher A. Innanen ∗ Daniel Trad ∗∗ University of Calgary,Department of Geoscience,Calgary, Alberta, Canada. † Pennsylvania State University,College of Earth and Mineral Science,University Park, Pennsylvania, USA (January 25, 2021)Running head:

ABSTRACT

We have formulated elastic seismic full waveform inversion (FWI) within a deep learningenvironment. Training this network with a single seismic data set is equivalent to carryingout elastic FWI. There are three main motivations for this eﬀort. The ﬁrst is an interestin developing an inversion algorithm which is more or less equivalent to standard elasticFWI but which is ready-made for a range of cloud computational architectures. The secondis an interest in algorithms which can later (i.e., not in this paper) be coupled with amore involved training component, involving multiple datasets. The third is a generalinterest in developing the idea of theory-guiding within machine learning solutions for largegeophysical problems, wherein the number of degrees of freedom within a network, and the1 a r X i v : . [ phy s i c s . g e o - ph ] J a n eliance on exhaustive training data, are both reduced by constraining the network withphysical rules. In our formulation, a recurrent neural network is set up with rules enforcingelastic wave propagation, with the waveﬁeld projected onto a measurement surface actingas the synthetic data to be compared with observed seismic data. Gradients for iterativeupdating of an elastic model, with a variety of parameterizations and misﬁt functionals, canbe eﬃciently constructed within the network through the automatic diﬀerential method.With this method, the inversion based on complex misﬁts can be calculated. We surveythe impact of diﬀerent complex misﬁts (based on the l , l ) with high order total variation(TV) regulations on multiparameter elastic FWI recovery of models within velocity/density,modulus/density, and stiﬀness parameter/density parameterizations. We analyze parametercross-talk. Inversion results on simple and complex models show that the RNN elasticFWI with high order TV regulation using l norm can help mitigate cross-talk issues withgradient-based optimization methods. 2 NTRODUCTION

It has recently been shown (Sun et al., 2020) that seismic wave propagation can be simulatedwith a specialized recurrent neural network (RNN), and that the process of training sucha network with a single seismic data set is equivalent to carrying out seismic full waveforminversion (FWI). We are motivated to extend and expand on these results, because of (1)the apparent potential for wider training of such a network to combat common FWI issuessuch as modelling error; (2) the opportunities for eﬃcient computation (e.g., cloud) oﬀeredby FWI realized on platforms like TensorFlow, and (3) an interest in the behavior of morecomplex RNN-FWI formulations than previously analyzed, e.g., multi-parameter elasticFWI. In this paper we report on progress on the third of these.The application of machine learning methods to seismic problems has been underwayfor decades; for example, R¨oth and Tarantola (1994) presented a neural network-basedinversion of time-domain seismic data for acoustic velocity depth-proﬁles. However, theevolution of these network approaches into deep learning methods, and the results whichhave subsequently become possible, make many aspects of the discipline quite new, andexplain the major recent surge in development and interest. Now, novel seismic applicationsare being reported in fault detection, denoising, reservoir prediction and velocity inversion(e.g., Jin et al., 2019; Zheng et al., 2019; Peters et al., 2019; Chen et al., 2019; Li et al., 2019a;Smith et al., 2019; Shustak and Landa, 2018). Speciﬁcally in seismic velocity inversion,Sun and Demanet (2018) applied a deep learning method to the problem of bandwidthextension; Zhang et al. (2019) designed an end-to-end framework, the velocity-GAN, whichgenerates velocity images directly from the raw seismic waveform data; Wu and Lin (2018)trained a network with an encoder-decoder structure to model the correspondence between3eismic data and subsurface velocity structures; and Yang and Ma (2019) investigated asupervised, deep convolutional neural network for velocity-model building directly fromraw seismograms.The above examples are purely data-driven, in the sense that they involve no assumedtheoretical/physical relationships between the input layer (e.g., velocity model) and outputlayer (e.g., seismic data). We believe that the advantages of the purely data-driven methodsare that once the training for the network to perform inversion is ﬁnished, take the data-driven training for a network that can perform FWI as an example, the raw seismograms canbe directly mapped to the velocity models. This would become a faster inversion methodcompared with the conventional inversion method that requires iterations. However, seismicinversion is a sophisticated issue, so how we choose the suﬃcient amount of training datasets that can represent such complex wave physics features and their corresponding velocitymodels is a hard problem.Here we distinguish between such methods and those belonging to the theory-guided

AI paradigm (e.g., Wagner and Rondinelli, 2016; Wu et al., 2018; Karpatne et al., 2017).Theory-guided deep learning networks are, broadly, those which enforce physical rules onthe relationships between variables and/or layers. This may occur through the trainingof a standard network with simulated or synthetic data, which are created using physicalrules, or by holding some weights in a network, chosen to force the network to mimic aphysical process, ﬁxed and un-trainable. Theory-guiding was explicitly used in the formersense by Downton and Hampson (2018), in which a network designed to predict well logproperties from seismic amplitudes was trained not only with real data but with syntheticsderived from the Zoeppritz equations. Bar-Sinai et al. (2019) and Raissi (2018) built deepconvolutional neural networks (or CNNs) to solve partial diﬀerential equations, i.e., explic-4tly using theoretical models within the design, which is an example of theory guiding inthe latter sense. Sun et al. (2020), in the work we extend in this paper, similarly set upa deep recurrent neural network to simulate the propagation of a seismic wave through ageneral acoustic medium. The network is set up in such a way that the trainable weightscorrespond to the medium property unknowns (i.e., wave velocity model), and the non-trainable weights correspond to the mathematical rules (diﬀerencing, etc.) enforcing wavepropagation. The output layer was the wave ﬁeld projected onto a measurement surface,i.e., a simulation of measured seismic data. The training of the Sun et al. (2020) networkwith a single data set was shown to be an instance of full waveform inversion.Conventional (i.e., not network-based) seismic full waveform inversion, or FWI, is acomplex data ﬁtting procedure aimed at extracting important information from seismicrecords. Key ingredients are an eﬃcient forward modeling method to simulate syntheticdata, and a local diﬀerential approach, in which the gradient and direction are calculatedto update the model. When updating several parameters simultaneously, which is a mul-tiparameter full waveform inversion, error in one parameter tends to produce updates inothers, a phenomenon referred to as inter-parameter trade-oﬀ or cross-talk (e.g., Kameiand Pratt, 2013; Alkhalifah and Plessix, 2014; Innanen, 2014; Pan et al., 2018; Keatingand Innanen, 2019). The degree of trade-oﬀ or cross-talk between parameters can dependsensitively on the speciﬁc parameterization; even though, for instance, λ , µ and ρ have thesame information content as V P , V S , and ρ , updating in one can produce markedly diﬀerentconvergence properties and ﬁnal models that doing so in the other. The coupling betweendiﬀerent physical parameters is controlled by the physical relationships amongst these pa-rameters, and the relationships between the parameters and the wave which interacts withthem. Tarantola (1986), by using the scattering or radiation pattern, systematically ana-5yzed the eﬀect of diﬀerent model parameterizations on isotropic FWI. He suggested thatthe greater the diﬀerence in the scattering pattern between each parameter, the better theparameters can be resolved. K¨ohn et al. (2012) showed that across all geophysical param-eterizations, within isotropic-elastic FWI, density is always the most challenging to resolve(e.g., Tarantola, 1984; Plessix, 2006; Kamath et al., 2018; Lailly, 1983; Operto et al., 2013;Oh and Alkhalifah, 2016; Pan et al., 2016; Keating and Innanen, 2017).In RNN-based FWI, then, a natural step is the extension of the Sun et al. (2020) result,which involves a scalar-acoustic formulation of FWI, to a multi-parameter elastic version.Cells within a deep recurrent neural network are designed such that the propagation ofinformation through the network is equivalent to the propagation of an elastic waveﬁeldthrough a 2D heterogeneous and isotropic-elastic medium; the network is equipped to ex-plore a range of parameterizations and misﬁt functionals for training based on seismic data.As with the acoustic network, the output layer is a projection of the wave ﬁeld onto somedeﬁned measurement surface, simulating measured data.In addition to providing a framework for inversion methods which mixes the features ofFWI with the training capacity of a deep learning network, this approach also allows foreﬃcient calculation of the derivatives of the residual through automatic diﬀerential (AD)methods (e.g., Li et al., 2019b; Sambridge et al., 2007), an engine for which is providedby the open-source machine learning library Pytorch (Paszke et al., 2017). Our recurrentneural network is designed using this library. The forward simulation of wave propagation isrepresented by a Dynamic Computational Graph (DCG), which records how each internalparameter is calculated from all previous ones. In the inversion, the partial derivatives ofthe residual with respect to any parameter (within a trainable class) is computed by (1)backpropagating within the computational graph to that parameter, and (2) calculating the6artial derivative along the path using the chain rule.This paper is organized as follows. First, we introduce the basic structure of the recurrentneural network and how the gradients can be calculated using the backpropagation method.Second, we demonstrate how the elastic FWI RNN cell is constructed in this paper and howthe waveﬂieds propagate in the RNN cells. Third, we explain the l and l misﬁts with highorder TV regulation and the mathematical expression for the gradients based on thesemisﬁts. Fourth, we use simple layers models and complex over-thrust models to performinversions with various parameterizations using diﬀerent misﬁts. Finally, we discuss theconclusions of this work. RECURRENT NEURAL NETWORK

A recurrent neural network (RNN) is a machine learning network suitable for dealing withdata which have some sequential meaning. Within such a network, the information gener-ated in one cell layer can be stored and used by the next layer. This design has naturalapplicability in the processing and interpretation of the time evolution of physical processes(Sun et al., 2020), and time-series data in general; examples include language modeling(Mikolov, 2012), speech recognition (Graves et al., 2013), and machine translation (Kalch-brenner and Blunsom, 2013; Zaremba et al., 2014).[Figure 1 about here.][Figure 2 about here.]Figure 1 illustrates the forward propagation of information through an example RNN, inwhich each RNN cell represents an instant in time: the O = [ O , O , O , · · · ] are the internal7ariables in each RNN cell; S = [ S , S ] are the input at each time step; P = [ P , P ] are theoutput; L = [ L , L ] are the labeled data; and W = [ W , W ] are the trainable parametersat each time step. Mathematical operations relating the internal variables within a cell,and those relating internal variables across adjacent cells, are represented as arrows.To train this network is to select trainable weights such that labeled data L and RNNoutput P are as close as possible. Speciﬁcally, in training we determine the parameters W ,through a gradient-based optimization involving the partial derivatives of the residual withrespect to each W i . These derivatives are determined through backpropagation, which isa repeated application the chain rule, organized to resemble ﬂow in the reverse directionalong with the arrows in the network. For the example RNN in Figure 1, this takes theform illustrated in Figure 2, in which the sequence R = [ R , R ] represents the residuals ateach time step. Within this example, to calculate the partial derivative of R with respectto W , we back-propagate from node R to W : ∂R ∂W = ∂R ∂P ∂P ∂W = ∂R ∂P ∂P ∂O ∂O ∂W = − R with respect to W , we backpropagate from R to W : ∂R ∂W = ∂R ∂P ∂P ∂W = ∂R ∂P ∂P ∂O ∂O ∂O ∂O ∂O ∂O ∂W = − O (2)If the RNN was set up to propagate through two-time steps, the gradient for W is − O +2.Real RNNs are more complex and involve propagation through larger numbers of time steps,but all are optimized through a process similar to this. These derivatives are the basis forgradients in the optimization misﬁt function; using them the W are updated and iterationscontinue. 8 RECURRENT NEURAL NETWORK FORMULATION OF EFWI

Wave propagation can be simulated using suitably-designed RNNs (Sun et al., 2020; Richard-son, 2018; Hughes et al., 2019). Here we take the acoustic wave propagation approach ofSun et al. (2020) as a starting point, and formulate an RNN which simulates the propaga-tion of an elastic wave through isotropic elastic medium. The underlying equations are the2D velocity-stress form of the elastodynamic equations (Virieux, 1986; Liu and Sen, 2009): ∂v x ∂t = 1 ρ (cid:18) ∂σ xx ∂x + ∂σ xz ∂z (cid:19) ∂v z ∂t = 1 ρ (cid:18) ∂σ xz ∂x + ∂σ zz ∂z (cid:19) ∂σ xx ∂t = ( λ + 2 µ ) ∂v x ∂x + λ ∂v z ∂z∂σ zz ∂t = ( λ + 2 µ ) ∂v z ∂z + λ ∂v x ∂x∂σ xz ∂t = µ (cid:18) ∂v x ∂z + ∂v z ∂x (cid:19) ., (3)where v x and v z are the x and z components of the particle velocity, σ xx , σ zz and σ xz arethree 2D components of the stress tensor. Discretized spatial distributions of the Lam´eparameters λ and µ , and the density ρ , form the elastic model.[Figure 3 about here.]In Figure 3 the structure of an RNN cell which produces a staggered-grid ﬁnite diﬀerencesolution for the velocity and stress ﬁelds is illustrated. At each time step the discrete sources s x and s z act as inputs; the velocity and stress information, v t − x , v t − z , σ txx , σ tzz , and σ txz , is communicated between the RNN cells ; the partial derivative ﬁelds, ∂ x σ txx , ∂ z σ tzz , ∂ x σ txz , ∂ z σ txz , ∂ x v t + x , ∂ z v t + x , ∂ x v t + z , ∂ z v t + z are the internal variables in each RNN cell,which correspond to O in Figure 2; and, λ , µ and ρ are included as trainable weights,which correspond to W in Figure 2. In Algorithm 1 pseudocode detailing these calculations9ithin the RNN cell is provided. The ∗ symbol represents the machine learning imageconvolution operator. This image convolution is the process of adding each element of theimage to its local neighbors, weighted by the image convolution kernel. We ﬁnd that thisimage convolution operator is also capable of calculating space partial derivatives if theconvolution kernel is designed according to the ﬁnite diﬀerence coeﬃcients. Details aboutthe image convolution operation can be found in Podlozhnyuk (2007). dx , dz are the gridintervals, and the image convolution kernels are: k x = a /dx , k x = b /dx , k z = a T /dz , and k z = b T /dz , where a = [0 , / , − / , / , − /

24] and b = [1 / , − / , / , − / , a and b are 1 × k x and k x are kernels, for the image convolution process,responsible for calculating the staggered grid space partial derivative in x direction. k z and k z are kernels, for the image convolution process, responsible for calculating the staggeredgrid space partial derivative in z direction, and that is also why the arrays, a and b , aretransposed in k z and k z . Space partial derivative calculated in this way is, mathematically,the same with conventional staggered grid method (e.g.Virieux (1986)). In this study, weachieve this process in an image convolution way. In algorithm 1, in order to implement thePML boundary, all the stress ﬁelds and the velocity ﬁelds need to be split into their x andz components. In algorithm 1, d x and d z are the PML damping coeﬃcients in x directionand z direction. d x can be expressed as: d x ( i ) = d x ( in pmlx ) p (4),where * represents either x or z direction. i is PML layer number starting from the eﬀectivecalculation boundary. n pmlx is the PML layer number in x direction. p is an integer andthe value is from 1-4. d x can be expressed as: d x = log ( 1 R ) rV s n pmlx δ x (5)10where R is a theoretically reﬂection coeﬃcient, r is a value ranging from 3-4. δ x is the gridlength in x direction. d z can also be calculated in the same way.[Figure 4 about here.]The activity of the RNN network is illustrated in “unfolded” form in Figure 4. Abovethe unfolded network, horizontal velocity ﬁelds associated with a point source at the top leftof a model are plotted at three times during propagation of the wave information throughthe network, the third being at t max , the maximum receiving time. The wave ﬁeld values arestored at positions selected to match multicomponent receivers; these form shot records astime evolves, which becomes the output data at each time. These shot records correspondto variables P in Figures 1 and 2, and the observed data. For FWI problem, the observeddata is obtained from the true model. For real seismic data inversion, the observed data isobtained from ﬁeld survey. In this RNN based elastic FWI the observed data is consideredas the labeled data. The residuals are calculated at the last computational time as Figure4 illustrated and this along with a selected norm deﬁnes the misﬁt function used to trainthe network.Algorithm 2 describes the training of the network, i.e., the process of elastic RNNFWI. Partial derivatives of the residual with respect to the trainable parameters (in thiscase λ , µ and ρ ) are calculated through backpropagation using the automatic diﬀerentialmethod, as set out in the previous section. After we have the gradients then we can use anoptimization method and step length to update the trainable parameters and reduce themisﬁt and start another iteration. In step 4, RNN() is the network discussed above, whoseoutput is the synthetic data; costFunc(), in step 5, is the misﬁt or loss function chosento measure the diﬀerence between the synthetic data and observed data; loss.backward()11egins the backpropagation within the computational graph and produces the gradientsfor each parameter, with which the current parameter model can be updated and anotheriteration started.The gradient calculated within the training process above essentially reproduces theadjoint-state calculations within FWI, as discussed by Sun et al. (2020). Formulated as aproblem of RNN training, the gradient calculation occurs rapidly and in a manner suitablefor cloud computational architectures; also, it allows the researcher to eﬃciently alter misﬁtfunction choices and parameterization in order to do high-level optimization. However, itinvolves the storage of the whole waveﬁeld, and thus should be expected to have signiﬁcantmemory requirements. MISFITS WITH HIGH ORDER TV REGULATION

Here we ﬁst introduce the elastic RNN misﬁts based on l norm with high order TV regu-larization: Φ T Vl ( m λ , m µ , m ρ , α λ , α µ , α ρ , α λ , α µ , α ρ ) = 12 (cid:107) D syn ( m λ , m µ , m ρ ) − D obs (cid:107) + α λ Θ TV ( m λ ) + α µ Θ TV ( m µ ) + α ρ Θ TV ( m ρ )+ α λ Υ TV ( m λ ) + α µ Υ TV ( m µ ) + α ρ Υ TV ( m ρ ) (6),where α λ , α µ , α ρ , α λ , α µ , α ρ , are vector of Lagrange multipliers, and Θ TV , Υ TV repre-sents ﬁrst and second order TV regularization functions respectively. D syn ( m λ , m µ , m ρ )represents the synthetic data, which is the function of the model parameters, and in thisequation they are V P , V S and ρ . Θ TV and Υ TV represent functions for calculating the ﬁrstand second order TV regulations for the models.12he ﬁrst order TV regulation term can be expressed as: T V (( m )) = n − (cid:88) i =1 m − (cid:88) j =1 | M i +1 ,j − M i,j | + n − (cid:88) i =1 m − (cid:88) j =1 | M i,j +1 − M i,j | = (cid:18) ∇ x , ∇ z (cid:19)  m , m  = (cid:18) L x , L z , (cid:19)  m , m  = Θ TV ( m ) (7)The second order TV regulation term can be expressed as: T V (( m )) = n − (cid:88) i =1 m − (cid:88) j =1 | M i +1 ,j − M i,j + M i − ,j | + n − (cid:88) i =1 m − (cid:88) j =1 | M i,j +1 − M i,j + M i,j − | = (cid:18) ∇ xx , ∇ zz (cid:19)  m , m  = (cid:18) K xx , K zz , (cid:19)  m , m  = Υ TV ( m ) (8)The derivative of Φ T Vl for each parameter ,which is the gradient for V P , V S and ρ basedon the l T V norm can be expressed as:  ∂ Φ TVl ∂ m λ ∂ Φ TVl ∂ m µ ∂ Φ TVl ∂ m ρ  =  G l2 λ G l2 µ G l2 ρ  +  R λ R µ R ρ  (9), where G l2 λ , G l2 µ , G l2 ρ are the gradient for λ , µ , ρ . R λ , R µ , R ρ are the regulation termsand the mathematical expressions for these regulation terms are:  R λ R µ R ρ  =  α λ L Tx Q x λ α λ L Tz Q z λ , α λ K Txx Q xx λ α λ K Tzz Q zz λ α µ L Tx Q x µ α µ L Tz Q z µ α µ K Txx Q xx µ α µ K Tzz Q zz µ α ρ L Tx Q x ρ α ρ L Tz Q z ρ α ρ K Txx Q xx ρ α ρ K Tzz Q zz ρ   L x L z K xx K zz  ·  m λ m µ m ρ  (10)13  q x λ q x µ q x ρ q z λ q z µ q z ρ q xx λ q xx µ q xx ρ q zz λ q zz µ q zz ρ  =  L x L z K xx K zz  (cid:18) m λ , m µ , m ρ (cid:19) (11)  Q x λ Q x µ Q x ρ Q z λ Q z µ Q z ρ Q xx λ Q xx µ Q xx ρ Q zz λ Q zz µ Q zz ρ  =  | q x λ | | q x µ | | q x ρ | | q z λ | | q z µ | | q z ρ | | q xx λ | | q xx µ | | q xx ρ | | q zz λ | | q zz µ | | q zz ρ |  (12) T means the transpose of the matrix, · meas dot product. q x λ represent the ﬁrst orderTV regularization vector in x direction for parameter λ . q x λ represent the values in vector q x λ . Q x λ is the absolute inverse of q x λ . Q x λ are elements in vector Q x λ . q xx λ representthe second order TV regularization vector in x direction for parameter λ . q xx λ representthe values in vector q xx λ . Q xx λ is the absolute inverse of q xx λ . Q xx λ are elements in vector Q xx λ . Other values in equations (9) and (10) can be also deduced like this. L x , L z arethe ﬁrst order diﬀerential vector to give the ﬁrst order total variations in x and z directionsrespectively. K xx , K zz are the second order diﬀerential vector to give the second order totalvariations in x and z directions respectively.If we were to use l Φ T Vl ( m λ , m µ , m ρ , α λ , α µ , α ρ , α λ , α µ , α ρ ) = (cid:107) D syn ( m λ , m µ , m ρ ) − D obs (cid:107) + α λ Θ TV ( m λ ) + α µ Θ TV ( m µ ) + α ρ Θ TV ( m ρ )+ α λ Υ TV ( m λ ) + α µ Υ TV ( m µ ) + α ρ Υ TV ( m ρ ) (13)14he gradient for each parameter based on l1 norm:  ∂ Φ TVl ∂ m λ ∂ Φ TVl ∂ m µ ∂ Φ TVl ∂ m ρ  =  G l1 λ G l1 µ G l1 ρ  +  R λ R µ R ρ  (14), where G l1 λ , G l1 µ , G l1 ρ are the gradient for λ , µ , ρ using l norm as misﬁt function. Thegradient calculation in l norm is using a diﬀerenta adjoint source (Pyun et al. (2009)Brossier et al. (2010)). The adjoint source for the adjoint ﬁelds for l norm is , In the caseof real arithmetic numbers, the term ∆ d | ∆ d | corresponds to the function sign . In this study,we did not meet conditions when ∆ d = 0. The detail gradient expression using the adjointstate method for parameters λ , µ and ρ based on l and l norm can be expressed as: G l2 λ = − (cid:80) x s (cid:80) x g (cid:82) T (( ∂ x ˜ u x ( r , r s , t ) + ∂ z ˜ u z ( r , r s , t )) ( ∂ x ˜ u ∗ l x ( r , r g , T − t ) + ∂ z ˜ u ∗ l z ( r , r g , T − t ))) (15) G l2 µ = − (cid:80) x s (cid:80) x g (cid:82) T (( ∂ z ˜ u x ( r , r s , t ) + ∂ x ˜ u z ( r , r s , t )) ( ∂ z ˜ u ∗ l x ( r , r g , T − t ) + ∂ x ˜ u ∗ l z ( r , r g , T − t ))) − ∂ x ˜ u x ( r , r s , t ) ∂ x ˜ u ∗ l x ( r , r g , T − t ) + ∂ z ˜ u z ( r , r s , t ) ∂ z ˜ u ∗ l z ( r , r g , T − t ))) (16) G l2 ρ = (cid:88) x s (cid:88) x g (cid:90) T (( ∂ t ˜ u x ( r , r s , t ) ∂ t ˜ u ∗ l x ( r , r g , T − t ) + ∂ t ˜ u z ( r , r s , t ) ∂ t ˜ u ∗ l z ( r , r g , T − t )))(17) G l1 λ = − (cid:80) x s (cid:80) x g (cid:82) T (( ∂ x ˜ u x ( r , r s , t ) + ∂ z ˜ u z ( r , r s , t )) ( ∂ x ˜ u ∗ l x ( r , r g , T − t ) + ∂ z ˜ u ∗ l z ( r , r g , T − t ))) (18)15 l1 µ = − (cid:80) x s (cid:80) x g (cid:82) T (( ∂ z ˜ u x ( r , r s , t ) + ∂ x ˜ u z ( r , r s , t )) ( ∂ z ˜ u ∗ l x ( r , r g , T − t ) + ∂ x ˜ u ∗ l z ( r , r g , T − t ))) − ∂ x ˜ u x ( r , r s , t ) ∂ x ˜ u ∗ l x ( r , r g , T − t ) + ∂ z ˜ u z ( r , r s , t ) ∂ z ˜ u ∗ l z ( r , r g , T − t ))) (19) G l1 ρ = (cid:88) x s (cid:88) x g (cid:90) T (( ∂ t ˜ u x ( r , r s , t ) ∂ t ˜ u ∗ l x ( r , r g , T − t ) + ∂ t ˜ u z ( r , r s , t ) ∂ t ˜ u ∗ l z ( r , r g , T − t )))(20) G l2 λ , G l2 µ , G l2 ρ , are gradients for λ , µ and ρ using l norm as misﬁt respectively. G l1 λ , G l1 µ , G l1 ρ , are gradients for λ , µ and ρ using l norm as misﬁt respectively. ˜ u x ∗ l and˜ u z ∗ l are the adjoint waveﬁelds generated by the l1 norm adjoint source, ˜ u x ∗ l and ˜ u z ∗ l are the adjoint waveﬁelds generated by the l2 norm adjoint source. T is the total receivingtime for the shot records. r s , r g represent the source and receivers locations respectively. r represent the model perturbation locations for λ , µ and ρ model. Figure 5 shows thegradient calculated using the adjoint state method and the Automatic Diﬀerence method.Figure 5 (a), (b), (c) are the normalized λ , µ and ρ gradients calculated by using the adjontstate method. Figure 5 (d), (e), (f) are the normalized λ , µ and ρ gradients calculated byusing the Automatic Diﬀerential method. The gradients calculated by using the AutomaticDiﬀerence method, contains more information about the model, for instance the lower partof the model, indicating that they can better reconstruct the mode. Now we rewrite themisﬁt function as: Φ T V = J D + J r1 + J r2 (21),where J D represents the any kind of norm misﬁt between observed data and synthetic data. J r1 = α λ Θ TV ( m λ ) + α µ Θ TV ( m µ ) + α ρ Θ TV ( m ρ ). J r2 = α λ Υ TV ( m λ ) + α µ Υ TV ( m µ ) + α ρ Υ TV ( m ρ ). The value α λ , α µ , α ρ , α λ , α µ , α ρ . are chosen according to the following16ormula (Guitton et al. (2012)). T = J D J r1 + J r2 (22)We should control the values for α λ , α µ , α ρ , α λ , α µ , α ρ and keep value T between 1 and10. T should be relatively large when, noise occurs in the data (Xiang and Zhang (2016)).[Figure 5 about here.] Parameterization testing

By modifying the RNN cells, and changing the trainable parameters, we can examine theinﬂuence of parameterization on waveform inversion within the deep learning formulation.Three sets of parameter classes are considered: the velocity parameterization (D-V model),involving P-wave velocity, S-wave velocity, and density; the modulus parameterization (D-M model), involving the Lam´e parameters λ and µ , and density; and, the stiﬀness matrixmodel (D-S model), involving C , C and density ρ .In these tests, the size of each model is 40 ×

90. 7 source points are evenly distributedacross the surface of the model; the source is a Ricker wavelet with a dominant frequencyof 30 Hz . The grid length of the model is dx = dz = 4m. In Figure 6 (a)-(b) are true andinitial V P model, (c)-(d) are the true and initial V S model, (e)-(f) are the true and initial ρ model we use in this test.In Figures 7 are the inversion results using the D-V parameterization. Figure 7 (a)-(d)are the inversion results for V P generated by l , l T V , l and l T V norm. Figure 7 (e)-(h) arethe inversion results for V S generated by l , l T V , l and l T V norm. Figure 7 (i)-(l) are theinversion results for ρ generated by l , l T V , l and l T V norm. In Figure 7 (a), (e) and (i) wecan see the cross talk between diﬀerent parameters as the back arrows indicate, In Figure 717b), (c), (d) we can see that by changing the misﬁts and adding high order TV regulations,the cross talk between V P and ρ has been reduced. In Figure 7 (h) and (l) we can see thatby using l T V norm, the cross talk between density and V S has been mitigated, while in (j)and (k) we still see the cross talk between V S and density. From Figures 7 we can see that l norm with high order TV regulation can help to mitigate the cross talk problem.[Figure 6 about here.][Figure 7 about here.]Next the modulus parameterization is examined, in which we seek to recover λ , µ and ρ models. This occurs through a straightforward modiﬁcation of the RNN cell, and againa change in the trainable parameters from velocities to moduli. Figure 8 (a)-(d) are theinversion results for λ generated by l , l T V , l and l T V norm. Figure 8 (e)-(h) are theinversion results for µ generated by l , l T V , l and l T V norm. Figure 8 (i)-(l) are theinversion results for ρ generated by l , l T V , l and l T V norm. In Figure 8 (b) and (d) we cansee that by using the high order TV regulations in l and l norm the cross talk betweendensity and λ has been mitigated. Figure (j) shows that In this parameterization by usingthe high order TV regulation on l norm can provide better inversion results for density aswell. [Figure 8 about here.]Inversion results for models in the S-D parameterization are plotted in Figure 9, gen-erated using the change of variables C = V P ρ and C = V S ρ . Figure 9 (a)-(d) are theinversion results for λ generated by l , l T V , l and l T V norm. Figure 9 (e)-(h) are the inver-sion results for µ generated by l , l T V , l and l T V norm. Figure 9 (i)-(l) are the inversion18esults for ρ generated by l , l T V , l and l T V norm. In Figure 9 we can still see the crosstalk between c

44 and c

11 as the black arrows pointing out. However this cross talk hasbeen mitigated by in ﬁgure (d), by using the l T V norm misﬁt. The inversion results aboveshows that the RNN based high order TV regulation FWi based on the l and l norm hasthe ability to mitigate cross talk problem with only gradient based methods. The RNNbased l T V RNN FWI in D-M parameterization and l T V RNN FWI in D-S parameterizationprovide better inversion results than other inversion tests.[Figure 9 about here.][Figure 10 about here.][Figure 11 about here.][Figure 12 about here.][Figure 13 about here.]Next, we will verify the proposed methods on the over-thrust model. Figure 10 (a), (c),(e) demonstrate the true models for V P , V S , and density ρ model, and (b), (d) and (h) arethe initial models for V P , V S , and density ρ respectively. The size of the model is 121 × V P , Figure Figure 11 (b) is the inversion for V S , Figure19igure 11 (c) is the inversion for ρ . From ﬁgure 11 we can see that the overall inversionresolution by using the conventional FWI is poor.Figure 12 shows the inversion results by using D-M parameterization. Figure 12 (a)-(c)are λ l norm, l T V norm, l T V norm inversion results respectively, (d)-(f) are µ l norm, l T V norm, l T V norm, inversion results respectively, (g)-(i) are ρ l norm l T V norm, l T V norminversion results respectively. In this parameterization we get unstable inversion results forparameter λ . However, in D-M parameterization Figure 13 (f), the small half arc structureat around 1000m of the model has been recovered in , as the black arrows indicate. Figure13 shows the proﬁles through the recovered elastic modules at 1000m of the models based onD-M parameterization. In Figure 13, the black lines are the true values, the yellow lines arethe initial values, red lines are the inversion results for l norm, green lines are the inversionresults for l T V norm and blue lines are the inversion results for l T V norm. Compared withthe true lines we can also see that , Figure 13 (b) and Figure 13 (c), l T V norm inversionresults are more close to the true values.[Figure 14 about here.][Figure 15 about here.][Figure 16 about here.]Figure 14 shows the inversion results by using D-V parameterization. In this parameter-ization all the three parameters are stable. Figure 14 (a)-(c) are, V P l norm, l T V norm, l T V norm inversion results respectively, (d)-(f) are V S l norm, l T V norm, l T V norm, inversionresults respectively, (g)-(i) are ρ l norm l T V norm, l T V norm inversion results respectively.Compared with the true models ,we can see that the inversion results generated by using the20 norm are less robust compared with other misﬁts. l norm with high order TV regulationcan provide more accurate inversion results. This can also be seen from Figure 14, whichis the proﬁles through the recovered elastic modules at 1000m of the models. In Figure 14,the black lines are the true velocities and density, the yellow lines are the initial values, redlines are the inversion results for l norm, green lines are the inversion results for l T V normand blue lines are the inversion results for l T V norm. Figure 15 (a) shows the results for V P . Figure 15 (b) shows the results for V S . Figure 15 (c) shows the results for ρ . In all thethree ﬁgures we can see that blue lines are closer to the true values compared with otherlines, which means that the l T V norm can provide us with more accurate inversion results.Figure 16 shows the D-V parameterization inversion model misﬁts in each iteration. Thered lines are the inversion using l norm. The green lines are the inversion using l T V norm.The red lines are the inversion using l T V norm. Figure 16 shows that we can get higheraccuracy inversion results by using the l T V norm with fewer iterations.Figure 17 shows the inversion results by using D-S parameterization. Figure 17 (a)-(c),are c l norm, l T V norm, l T V norm inversion results respectively, (d)-(f) are c l norm, l T V norm, l T V norm, inversion results respectively, (g)-(i) are ρ l norm l T V norm, l T V norm inversion results respectively. In this parameterization the small half arc structureat around 1000m of the model is clearly resolved in this parameterization by using the l T V norm. Compared with other misﬁts , again the l T V norm provide us the best resolution forthe model. Figure 13 shows the proﬁles through the recovered elastic modules at 1000m ofthe models based on D-M parameterization.. Red lines, green lines and blue lines are l , l T V ,and l T V norm inversion results respectively. We can also see that in D-S parameteriation. l T V provide us with inversion results that is more close the to true values. Figure 18 showshow model misﬁts in D-M parameterization changes in each iteration. The blue line, the21 T V norm inversion, has the fastest model misﬁt decline rate. We can conclude in thisparameterization l T V can generate more accuracy inversion results with fewer iterations.[Figure 17 about here.][Figure 18 about here.][Figure 19 about here.] RANDOM NOISE TESTING

In this section, we will test the sensitivity of this deep learning method to data contaminatedwith random noise drawn from a Gaussian distribution. In Figure 20, we plot a trace froman example shot proﬁle with diﬀerent ratios of noise.[Figure 20 about here.]In (a), the red signal is the record without noise, and the blue line is the record withGaussian random noise. The mean value of the noise is zero, and the standard deviation is0.3 ( std = 0 . std = 0 . std = 1 are plotted.[Figure 21 about here.]In Figure 21a, e and i the inversion results from noise-free inversion based on the RNNare plotted. In Figure 21b, f and j, the inversion results with noise at std = 0 . V P , V S ,and ρ are plotted, respectively; in Figure 21c, g, and k, the inversion results with noise at std = 0 . V P , V S , and ρ are plotted; in Figure 21d, h, and l are likewise for std = 1 .

0. We22onclude that a moderate amount of random error in the data used for the RNN trainingleads to acceptable results, though some blurring is introduced and detail in the structureis lost. The V S recovery appears to be much more sensitive to noise than are those of V P or ρ . CONCLUSIONS

Elastic multi-parameter full waveform inversion can be formulated as a strongly constrained,theory-based deep-learning network. Speciﬁcally, a recurrent neural network, set up withrules enforcing elastic wave propagation, with the waveﬁeld projected onto a measurementsurface acting as the labeled data to be compared with observed seismic data, recapitulateselastic FWI but with both (1) the opportunity for data-driven learning to be incorporated,and (2) a design supported by powerful cloud computing architectures. Each cell of therecurrent neural network is designed according to the isotropic-elastic wave equation.The partial derivatives of the data residual with respect to the trainable parameterswhich act to represent the elastic media are calculated by using the intelligent automaticdiﬀerential method. With the automatic diﬀerential method, gradients can be automaticallycalculated via the chain rule, guided by backpropagation along the paths within the compu-tational graph. The automatic diﬀerential method produces a high level of computationaleﬃciency and scalability for the calculation of gradients for diﬀerent parameters in elasticmedia. The formulation is suitable for exploring numerical features of diﬀerent misﬁts anddiﬀerent parameterizations, with an aim of improving the resolution of the recovered elasticmodels, and mitigate cross-talk.We compared RNN waveform inversions based on l , l with high order total variations.23e used this RNN synthetic environment to compare density-velocity (P-wave velocity, S-wave velocity, and density), modulus-density (Lam´e parameters and density) and S-D ( C , C and density) parameterizations, and their exposure to cross-talk for the varying misﬁtfunctions. Our results suggest generally that this approach to full waveform inversion isconsistent and stable. S-D l l EFERENCES

Alkhalifah, T., and R.-´E. Plessix, 2014, A recipe for practical full-waveform inversion inanisotropic media: An analytical parameter resolution study: Geophysics, , R91–R101.Bar-Sinai, Y., S. Hoyer, J. Hickey, and M. P. Brenner, 2019, Learning data-driven discretiza-tions for partial diﬀerential equations: Proceedings of the National Academy of Sciences, , 15344–15349.Brossier, R., S. Operto, and J. Virieux, 2010, Which data residual norm for robust elasticfrequency-domain full waveform inversion?: Geophysics, , R37–R46.Chen, Y., M. Zhang, M. Bai, and W. Chen, 2019, Improving the signal-to-noise ratio ofseismological datasets by unsupervised machine learning: Seismological Research Letters.Downton, J. E., and D. P. Hampson, 2018, Deep neural networks to predict reservoir prop-erties from seismic: CSEG Geoconvention.Graves, A., A.-r. Mohamed, and G. Hinton, 2013, Speech recognition with deep recurrentneural networks: 2013 IEEE international conference on acoustics, speech and signalprocessing, IEEE, 6645–6649.Guitton, A., G. Ayeni, and E. D´ıaz, 2012, Constrained full-waveform inversion by modelreparameterizationgeologically constrained fwi: Geophysics, , R117–R127.Hughes, T. W., I. A. Williamson, M. Minkov, and S. Fan, 2019, Wave physics as an analogrecurrent neural network: Science Advances, , eaay6946.Innanen, K. A., 2014, Seismic avo and the inverse hessian in precritical reﬂection full wave-form inversion: Geophysical Journal International, , 717–734.Jin, G., K. Mendoza, B. Roy, and D. G. Buswell, 2019, Machine learning-based fracture-hitdetection algorithm using lfdas signal: The Leading Edge, , 520–524.Kalchbrenner, N., and P. Blunsom, 2013, Recurrent continuous translation models: 1700–25709.Kamath, N., R. Brossier, L. M´etivier, and P. Yang, 2018, 3d acoustic/visco-acoustic time-domain fwi of obc data from the valhall ﬁeld, in SEG Technical Program ExpandedAbstracts 2018: Society of Exploration Geophysicists, 1093–1097.Kamei, R., and R. Pratt, 2013, Inversion strategies for visco-acoustic waveform inversion:Geophysical Journal International, , 859–884.Karpatne, A., G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S.Shekhar, N. Samatova, and V. Kumar, 2017, Theory-guided data science: A newparadigm for scientiﬁc discovery from data: IEEE Transactions on Knowledge and DataEngineering, , 2318–2331.Keating, S., and K. A. Innanen, 2017, Crosstalk and frequency bands in truncated newtonan-acoustic full-waveform inversion: 1416–1421.——–, 2019, Parameter crosstalk and modeling errors in viscoacoustic seismic full-waveforminversion: Geophysics, , R641–R653.K¨ohn, D., D. De Nil, A. Kurzmann, A. Przebindowska, and T. Bohlen, 2012, On theinﬂuence of model parametrization in elastic full waveform tomography: GeophysicalJournal International, , 325–345.Lailly, P., 1983, as a sequence of before stack migrations: Conference on Inverse Scattering–Theory and Application, Siam, 206.Li, C., Y. Zhang, and C. C. Mosher, 2019a, A hybrid learning-based framework for seismicdenoising: The Leading Edge, , 542–549.Li, D., K. Xu, J. M. Harris, and E. Darve, 2019b, Time-lapse full waveform inversionfor subsurface ﬂow problems with intelligent automatic diﬀerentiation: arXiv preprintarXiv:1912.07552. 26iu, Y., and M. K. Sen, 2009, An implicit staggered-grid ﬁnite-diﬀerence method for seismicmodelling: Geophysical Journal International, , 459–474.Mikolov, T., 2012, Statistical language models based on neural networks: Master’s thesis.Oh, J.-W., and T. Alkhalifah, 2016, Elastic orthorhombic anisotropic parameter inversion:An analysis of parameterizationelastic orthorhombic anisotropic fwi: Geophysics, ,C279–C293.Operto, S., Y. Gholami, V. Prieux, A. Ribodetti, R. Brossier, L. Metivier, and J. Virieux,2013, A guided tour of multiparameter full-waveform inversion with multicomponent data:From theory to practice: The leading edge, , 1040–1054.Pan, W., K. A. Innanen, and Y. Geng, 2018, Elastic full-waveform inversion andparametrization analysis applied to walk-away vertical seismic proﬁle data for uncon-ventional (heavy oil) reservoir characterization: Geophysical Journal International, ,1934–1968.Pan, W., K. A. Innanen, G. F. Margrave, M. C. Fehler, X. Fang, and J. Li, 2016, Estimationof elastic constants for hti media using gauss-newton and full-newton multiparameter full-waveform inversion: Geophysics, , R275–R291.Paszke, A., S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer, 2017, Automatic diﬀerentiation in pytorch.Peters, B., E. Haber, and J. Granek, 2019, Neural-networks for geophysicists and theirapplication to seismic data interpretation: arXiv preprint arXiv:1903.11215.Plessix, R.-E., 2006, A review of the adjoint-state method for computing the gradient of afunctional with geophysical applications: Geophysical Journal International, , 495–503.Podlozhnyuk, V., 2007, Image convolution with cuda: NVIDIA Corporation white paper,27une, .Pyun, S., W. Son, and C. Shin, 2009, Frequency-domain waveform inversion using an l1-norm objective function: Exploration Geophysics, , 227–232.Raissi, M., 2018, Deep hidden physics models: Deep learning of nonlinear partial diﬀerentialequations: The Journal of Machine Learning Research, , 932–955.Richardson, A., 2018, Seismic full-waveform inversion using deep learning tools and tech-niques: arXiv preprint arXiv:1801.07232.R¨oth, G., and A. Tarantola, 1994, Neural networks and inversion of seismic data: Journalof Geophysical Research: Solid Earth, , 6753–6768.Sambridge, M., P. Rickwood, N. Rawlinson, and S. Sommacal, 2007, Automatic diﬀerenti-ation in geophysical inverse problems: Geophysical Journal International, , 1–8.Shustak, M., and E. Landa, 2018, Time reversal for wave refocusing and scatterer detectionusing machine learning: Geophysics, , T257–T263.Smith, R., T. Mukerji, and T. Lupo, 2019, Correlating geologic and seismic data withunconventional resource production curves using machine learning: Geophysics, , O39–O47.Sun, H., and L. Demanet, 2018, Low frequency extrapolation with deep learning, in SEGTechnical Program Expanded Abstracts 2018: Society of Exploration Geophysicists,2011–2015.Sun, J., Z. Niu, K. A. Innanen, J. Li, and D. O. Trad, 2020, A theory-guided deep-learningformulation and optimization of seismic waveform inversion: Geophysics, , R87–R99.Tarantola, A., 1984, Inversion of seismic reﬂection data in the acoustic approximation:Geophysics, , 1259–1266.——–, 1986, A strategy for nonlinear elastic inversion of seismic reﬂection data: Geophysics,28 , 1893–1903.Virieux, J., 1986, P-sv wave propagation in heterogeneous media: Velocity-stress ﬁnite-diﬀerence method: Geophysics, , 889–901.Wagner, N., and J. M. Rondinelli, 2016, Theory-guided machine learning in materials sci-ence: Frontiers in Materials, , 28.Wu, J.-L., H. Xiao, and E. Paterson, 2018, Physics-informed machine learning approach foraugmenting turbulence models: A comprehensive framework: Physical Review Fluids, ,074602.Wu, Y., and Y. Lin, 2018, Inversionnet: A real-time and accurate full waveform inversionwith cnns and continuous crfs: arXiv preprint arXiv:1811.07875.Xiang, S., and H. Zhang, 2016, Eﬃcient edge-guided full-waveform inversion by canny edgedetection and bilateral ﬁltering algorithms: Geophysical Supplements to the MonthlyNotices of the Royal Astronomical Society, , 1049–1061.Yang, F., and J. Ma, 2019, Deep-learning inversion: a next generation seismic velocity-model building method: Geophysics, , 1–133.Zaremba, W., I. Sutskever, and O. Vinyals, 2014, Recurrent neural network regularization:arXiv preprint arXiv:1409.2329.Zhang, Z., Y. Wu, Z. Zhou, and Y. Lin, 2019, Velocitygan: Subsurface velocity imageestimation using conditional adversarial networks: 2019 IEEE Winter Conference onApplications of Computer Vision (WACV), IEEE, 705–714.Zheng, Y., Q. Zhang, A. Yusifov, and Y. Shi, 2019, Applications of supervised deep learningfor seismic interpretation and inversion: The Leading Edge, , 526–533.29 lgorithm 1 Sequence of calculations in the RNN cell with PML boundary

Require:

Source: s x , s z ; velocity and stress ﬁelds at the previous time step, parameters: λ , µ , ρ ; time step: dt . PML damping coeﬃcients d x . d z . Ensure:

Update velocity ﬁeld at t + and stress ﬁelds at t + 1 σ txx ← σ txx + s x σ tzz ← σ tzz + s z ∂ x σ txx ← ( σ txx ∗ k x ) /ρ ∂ z σ txz ← ( σ txz ∗ k z ) /ρ ∂ x σ txz ← ( σ txz ∗ k x ) /ρ ∂ z σ tzz ← ( σ tzz ∗ k z ) /ρ v t + x x ← (1 − dtd x ) v t − x x + dt ( ∂ x σ txx ) v t + x z ← (1 − dtd z ) v t − x z + dt ( ∂ z σ txz ) v t + x ← v t + x x + v t + x z v t + z x ← (1 − dtd x ) v t − z x + dt ( ∂ x σ txz ) v t + z z ← (1 − dtd z ) v t − z z + dt ( ∂ z σ tzz ) v t + z ← v t + x x + v t + x z ∂ x v t + x ← v t + x ∗ k x ∂ z v t + x ← v t + x ∗ k z ∂ x v t + z ← v t + z ∗ k x ∂ z v t + z ← v t + z ∗ k z σ xxt +1 x ← (1 − dtdx ) σ xxtx + dt ( λ + 2 µ ) ∂ x v t + x σ zzt +1 x ← (1 − dtdz ) σ zztx + dt ( λ ) ∂ z v t + z σ zzt +1 ← σ xxt +1 x + σ xxt +1 z σ zzt +1 x ← (1 − dtdx ) σ xxtx + dt ( λ ) ∂ x v t + x σ zzt +1 z ← (1 − dtdz ) σ zztz + dt ( λ + 2 µ ) ∂ z v t + z σ zzt +1 ← σ zzt +1 x + σ zzt +1 z σ xzt +1 x ← (1 − dtdx ) σ xztx + dt ( µ ) ∂ z v t + x σ xzt +1 z ← (1 − dtdz ) σ xztz + dt ( µ ) ∂ x v t + z σ xzt +1 ← σ xzt +1 x + σ xzt +1 z lgorithm 2 Loop for elastic RNN FWI Set trainable parameters: λ , µ , ρ in this test. Set optimizers for parameters:

Optimizer , Optimizer and Optimizer for λ , µ and ρ respectively. for iter ∈ [1 , maxiter ] or not converge do D syn = RNN( λ , µ , ρ ): generate synthetic data loss = costFunc( D syn , D obs ): calculate misﬁts loss.backward(): Backpropagation and give gradients for the parameters optimizers.step(): update parameters end for IST OF FIGURES O = [ O , O , O , · · · ] arethe internal variables in each cell; S = [ S , S ] are the inputs; P = [ P , P ]are the outputs; L = [ L , L ] are the labeled data; W = [ W , W ] are thetrainable weights. Black dashed line indicates forward propagation. . . . . . 342 Residual backward propagation through an example RNN. R = [ R , R ] arethe residuals at each RNN cell, which are calculated using absolute error( l norm). Gray solid line shows how gradients are calculated using back prop-agation from the residual to the trainable weights in RNN cell along thecomputational graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 The structure of each RNN. In this ﬁgure, ∂ x σ txx , ∂ z σ tzz , ∂ x σ txz , ∂ z σ txz , ∂ x v t + x , ∂ z v t + x , ∂ x v t + z , ∂ z v t + z are the internal variables. v t − x , v t − z , σ txx , σ tzz , σ txz ,is communicated between the RNN cells. λ , µ , ρ are trainable parameters. . 364 Velocity ﬁeld forward and residual backward propagation under formattingof the RNN. The shot records formed at each time step correspond to theoutput of RNN cell P in Figure 2. Observed shot records correspond tolabeled data (L in Figure 2). Residual shot record correspond to residualinformation (R in Figure 2). Back propagation starts from the residual shotrecord. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 (a) λ gradient given by the adjoint state method. (b) µ gradient based bythe adjoint state method. (c) ρ gradient based by the adjoint state method.(d) λ gradient given by the AD method. (e) µ gradient based by the ADmethod. (f) ρ gradient based by the AD method. . . . . . . . . . . . . . . 386 (a) true Vp model, (b) initial Vs model, (c) true Vs model, (d) initial vsmodel, (e) true ρ mode, (f) initial ρ model. . . . . . . . . . . . . . . . . . . 397 D-V parameterization inversion results. (a)-(d), V P l norm, l T V norm, l norm, l T V norm inversion results respectively, (e)-(h), V S l norm, l T V norm, l l T V norm, inversion results respectively, (i)-(l), ρ l norm l T V norm, l norm, l T V norm inversion results respectively. . . . . . . . . . . . . . . . 408 D-M parameterization inversion results. (a)-(d), λ l , norm, l T V norm, l ,norm l T V norm inversion results respectively, (e)-(h), µ l norm, l T V norm, l l T V norm inversion results respectively, (i)-(l), ρ l norm, l T V norm, l norm, l T V norm inversion results respectively. . . . . . . . . . . . . . . . 419 D-S parameterization inversion results. (a)-(d), c l norm, l T V norm, l norm, l T V norm inversion results respectively, (e)-(h), c l norm, l T V norm, l l T V norm, inversion results respectively, (i)-(l), ρ l norm l T V norm, l norm, l T V norm inversion results respectively. . . . . . . . . . . . . . . . 4210 (a) true V P model, (b) initial V P model, (c) true V S model (d) initial Vsmodel, (e) ρ true (f) initial ρ . . . . . . . . . . . . . . . . . . . . . . . . . . 43321 (a) true V P model, (b) initial V P model, (c) true V S model (d) initial V S model, (e) ρ true (f) initial ρ . . . . . . . . . . . . . . . . . . . . . . . . . . 4412 M-D parameterization inversion results. (a)-(c), λ l norm, l T V norm, l T V norm inversion results respectively, (c)-(f), µ l norm, l T V norm, l T V norm,inversion results respectively, (g)-(i), ρ l norm l T V norm, l T V norm inversionresults respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4513 Proﬁles through the recovered elastic models. (a) Vertical λ proﬁles; (b)vertical µ proﬁles; (c) vertical ρ proﬁles . . . . . . . . . . . . . . . . . . . . 4614 D-V parameterization inversion results. (a)-(d), V P l norm, l T V norm, l T V norm inversion results respectively, (e)-(h), V S l norm, l T V norm, l T V norm,inversion results respectively, (i)-(l), ρ l norm l T V norm, l T V norm inversionresults respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4715 Proﬁles through the recovered elastic models. (a) Vertical V P proﬁles; (b)vertical V S proﬁles; (c) vertical ρ proﬁles. . . . . . . . . . . . . . . . . . . . 4816 D-V parameterization inversion results . (a) Vp model misﬁt using l , l T V and l T V norm.(b) Vs model misﬁt using l , l T V and l T V norm.(c) ρ modelmisﬁt using l , l T V and l T V norm . . . . . . . . . . . . . . . . . . . . . . . . 4917 S-D parameterization inversion results. (a)-(c), c l norm, l T V norm, l T V norm inversion results respectively, (c)-(f), c l norm, l T V norm, l T V norm,inversion results respectively, (g)-(i), ρ l norm l T V norm, l T V norm inversionresults respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5018 Proﬁles through the recovered elastic models based on . (a) Vertical c c

44 proﬁles; (c) vertical ρ proﬁles . . . . . . . . . . . . 5119 D-S parameterization inversion results . (a) c

11 model misﬁt using l , l T V and l T V norm.(b) c

44 model misﬁt using l , l T V and l T V norm.(c) ρ modelmisﬁt using l , l T V and l T V norm . . . . . . . . . . . . . . . . . . . . . . . . 5220 Noise free and noise shotrecords. (a) Noise free record and records withGaussian noise std = 0 .

3. (b) Noise free record and records with Gaussiannoise std = 0 .

5. (c) Noise free record and records with Gaussian noise std = 1. 5321 Random noise testing inversion results. (a),(e),(i), noise free inversion resultsfor V P , V S , and density (b),(f),(j) Gaussian noise std = 0 . V P , V S , and density, (c),(g),(k) Gaussian noise std = 0 . V P , V S , and density, (d),(h),(l) Gaussian noise std = 1 . V P , V S , and density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5433igure 1: Forward propagation through an example RNN. Cells are identical in form, withthe output of one being the input of the next. O = [ O , O , O , · · · ] are the internal variablesin each cell; S = [ S , S ] are the inputs; P = [ P , P ] are the outputs; L = [ L , L ] are thelabeled data; W = [ W , W ] are the trainable weights. Black dashed line indicates forwardpropagation. 34igure 2: Residual backward propagation through an example RNN. R = [ R , R ] are theresiduals at each RNN cell, which are calculated using absolute error( l norm). Gray solidline shows how gradients are calculated using back propagation from the residual to thetrainable weights in RNN cell along the computational graph.35 𝒙𝒙𝒕 𝝈 𝒙𝒛𝒕 𝝈 𝒛𝒛𝒕 𝝏 𝒙 𝝈 𝒙𝒙𝒕 𝝏 𝒙 𝝈 𝒙𝒛𝒕 𝝏 𝒛 𝝈 𝒙𝒛𝒕 𝝏 𝒛 𝝈 𝒛𝒛𝒕 𝑽 z𝒕+𝟏 𝝏 𝒙 𝑽 𝒙𝒕+𝟏𝟐 𝝏 𝒙 𝑽 𝒛𝒕+𝟏𝟐 𝝏 𝒛 𝑽 𝒙𝒕+𝟏𝟐 𝝏 𝒛 𝑽 𝒛𝒕+𝟏𝟐 𝝈 𝒙z𝒕+𝟏 𝝈 𝒙𝒙𝒕+𝟏 𝝈 𝒛𝒛𝒕+𝟏 𝑽 𝒙𝒕−𝟏𝟐 𝑽 𝒛𝒕−𝟏𝟐 𝝀 𝝁 𝑽 𝒙𝒕+𝟏𝟐 𝝁 𝝀𝝁 𝑺 𝒙𝒕 𝑺 𝒛𝒕 𝝆 𝝆 Figure 3: The structure of each RNN. In this ﬁgure, ∂ x σ txx , ∂ z σ tzz , ∂ x σ txz , ∂ z σ txz , ∂ x v t + x , ∂ z v t + x , ∂ x v t + z , ∂ z v t + z are the internal variables. v t − x , v t − z , σ txx , σ tzz , σ txz , is communi-cated between the RNN cells. λ , µ , ρ are trainable parameters.36igure 4: Velocity ﬁeld forward and residual backward propagation under formatting of theRNN. The shot records formed at each time step correspond to the output of RNN cell P inFigure 2. Observed shot records correspond to labeled data (L in Figure 2). Residual shotrecord correspond to residual information (R in Figure 2). Back propagation starts fromthe residual shot record. 37igure 5: (a) λ gradient given by the adjoint state method. (b) µ gradient based by theadjoint state method. (c) ρ gradient based by the adjoint state method. (d) λ gradientgiven by the AD method. (e) µ gradient based by the AD method. (f) ρ gradient based bythe AD method. 38igure 6: (a) true Vp model, (b) initial Vs model, (c) true Vs model, (d) initial vs model,(e) true ρ mode, (f) initial ρ model. 39igure 7: D-V parameterization inversion results. (a)-(d), V P l norm, l T V norm, l norm, l T V norm inversion results respectively, (e)-(h), V S l norm, l T V norm, l l T V norm,inversion results respectively, (i)-(l), ρ l norm l T V norm, l norm, l T V norm inversionresults respectively. 40igure 8: D-M parameterization inversion results. (a)-(d), λ l , norm, l T V norm, l , norm l T V norm inversion results respectively, (e)-(h), µ l norm, l T V norm, l l T V norminversion results respectively, (i)-(l), ρ l norm, l T V norm, l norm, l T V norm inversionresults respectively. 41igure 9: D-S parameterization inversion results. (a)-(d), c l norm, l T V norm, l norm, l T V norm inversion results respectively, (e)-(h), c l norm, l T V norm, l l T V norm,inversion results respectively, (i)-(l), ρ l norm l T V norm, l norm, l T V norm inversionresults respectively. 42igure 10: (a) true V P model, (b) initial V P model, (c) true V S model (d) initial Vs model,(e) ρ true (f) initial ρ V P model, (b) initial V P model, (c) true V S model (d) initial V S model,(e) ρ true (f) initial ρ λ l norm, l T V norm, l T V norminversion results respectively, (c)-(f), µ l norm, l T V norm, l T V norm, inversion resultsrespectively, (g)-(i), ρ l norm l T V norm, l T V norm inversion results respectively45igure 13: Proﬁles through the recovered elastic models. (a) Vertical λ proﬁles; (b) vertical µ proﬁles; (c) vertical ρ proﬁles 46igure 14: D-V parameterization inversion results. (a)-(d), V P l norm, l T V norm, l T V norm inversion results respectively, (e)-(h), V S l norm, l T V norm, l T V norm, inversionresults respectively, (i)-(l), ρ l norm l T V norm, l T V norm inversion results respectively.47igure 15: Proﬁles through the recovered elastic models. (a) Vertical V P proﬁles; (b)vertical V S proﬁles; (c) vertical ρ proﬁles. 48igure 16: D-V parameterization inversion results . (a) Vp model misﬁt using l , l T V and l T V norm.(b) Vs model misﬁt using l , l T V and l T V norm.(c) ρ model misﬁt using l , l T V and l T V norm 49igure 17: S-D parameterization inversion results. (a)-(c), c l norm, l T V norm, l T V norminversion results respectively, (c)-(f), c l norm, l T V norm, l T V norm, inversion resultsrespectively, (g)-(i), ρ l norm l T V norm, l T V norm inversion results respectively.50igure 18: Proﬁles through the recovered elastic models based on . (a) Vertical c

11 proﬁles;(b) vertical c

44 proﬁles; (c) vertical ρ proﬁles51igure 19: D-S parameterization inversion results . (a) c

11 model misﬁt using l , l T V and l T V norm.(b) c

44 model misﬁt using l , l T V and l T V norm.(c) ρ model misﬁt using l , l T V and l T V norm 52igure 20: Noise free and noise shotrecords. (a) Noise free record and records with Gaussiannoise std = 0 .

3. (b) Noise free record and records with Gaussian noise std = 0 .