[PDF] Seismic Inversion by Hybrid Machine Learning

Abstract

We present a new seismic inversion method that uses deep learning (DL) features for the subsurface velocity model estimation. The DL feature is a low-dimensional representation of the high-dimensional seismic data, which is automatically generated by a convolutional autoencoder (CAE) and preserved in the latent space. The low-dimensional DL feature contains the key information of the input seismic data. Therefore, instead of directly comparing the waveform differences between the observed and predicted data, such as full-waveform inversion (FWI). We measure their DL feature differences in the latent space of a CAE. The advantage of this low-dimensional comparison is that it is less prone to the cycle-skipping problem compared to FWI. The reason is that the DL features mainly contain the kinematic information, such as traveltime, of the input seismic data when the latent space dimension is small. However, more dynamic information, such as the waveform variations, can be preserved in the DL feature when the latent space dimension becomes larger. Therefore we propose a multiscale inversion approach that starts with inverting the low-dimensional DL features for the low-wavenumber information of the subsurface model. Then recover its high-wavenumber details through inverting the high-dimensional DL features. However, there is no governing equation that contains both the velocity and DL feature terms in the same equation. Therefore we use the automatic differentiation (AD) to numerically connect the perturbation of DL features to the velocity perturbation. In another word, we connect a deep learning network with the wave-equation inversion by using the AD. We denote this hybrid connection as hybrid machine learning (HML) inversion. Here, the AD replaces the complex math derivations of the gradient with a black box so anyone can do HML without having a deep geophysical background.

Full PDF

SSeismic Inversion by Hybrid Machine Learning

Yuqing Chen ∗ and Erdinc Saygin ∗∗ Deep Earth Imaging Future Science Platform, CSIRO, Kensington, Australia. (September 16, 2020)

Seismic Inversion by Hybrid Machine Learning

Running head:

Seismic Inversion by HML

ABSTRACT

We present a new seismic inversion method which uses the deep learning (DL) features forthe subsurface velocity model estimation. The DL feature is a low-dimensional represen-tation of the high-dimensional seismic data, which is automatically generated by a convo-lutional autoencoder (CAE) and preserved in the latent space. The low-dimensional DLfeature contains the key information of the high-dimensional input seismic data. Therefore,instead of directly comparing the waveform diﬀerences between the observed and predicteddata, such as full waveform inversion (FWI). We measure their DL feature diﬀerences in thelatent space of a CAE. The advantage of this low-dimensional comparison is that it is lessprone to the cycle-skipping problem compared to FWI. The reason is that the automaticallygenerated DL features mainly contain the kinematic information, such as traveltime, of theinput seismic data when the latent space dimension is small. However, more dynamic infor-mation, such as the waveform variations, can be preserved in the DL feature when the latentspace dimension becomes larger. Therefore we propose a multiscale inversion approach thatstarts with inverting the low-dimensional DL features for the low-wavenumber informationof the subsurface velocity model. Then recover its high-wavenumber details through in-1 a r X i v : . [ phy s i c s . g e o - ph ] S e p erting the high-dimensional DL features. However, there is no governing equation thatcontains both the velocity and DL feature terms in the same equation. Therefore we usethe automatic diﬀerentiation (AD) to numerically connect the perturbation of DL featuresto the velocity perturbation. In another word, we connect a deep learning network withthe wave-equation inversion by using the AD. We denote this hybrid connection as hybridmachine learning (HML) inversion. Here, the AD replaces the complex math derivationsof gradient with a black box so anyone can do HML without having a deep geophysicalbackground.One concern of the HML method is that it is expensive to solve the wave-equationinversion using the AD. To mitigate this problem, we propose a hybrid implementationapproach that uses the AD only through the CAE and compute the velocity gradient usingimaging condition. This approach is computational eﬃcient and beneﬁt from the quasi-linear misﬁt function at the same time. Numerical tests on both synthetic and real datashow that the multi-scale HML approach can eﬀectively recover both the low- and high-wavenumber information of the subsurface velocity model.2 NTRODUCTION

Full waveform inversion (FWI) is a powerful tool for inverting the high-resolution subsur-face model by minimizing the waveform diﬀerences between the observed and predicteddata (Lailly and Bednar, 1983; Tarantola, 1984; Virieux and Operto, 2009; Simut˙e et al.,2016; P´erez Solano and Plessix, 2019). However, the conventional FWI assumes its for-ward modeling operator L includes all the physics of wave propagation in the real Earth.Moreover, a good initial model is essential for FWI which requires the time-lag betweenthe observed and predicted data should be smaller than half of the period. Otherwise FWIwill suﬀer from the cycle-skipping problem and the failure of these assumptions could leadFWI to converge to a local minimum. To mitigate these problems, an alternative solutionis to invert the skeletonized data, such as the traveltime and peak frequency, rather thanthe whole waveform. The skeletonized data is a simpliﬁed form of the original data butstill contains its key information. Therefore inverting the skeletonized data is less proneto the local minimum and can successfully recover the low-to-intermediate wavenumber in-formation of the model of interests. Luo and Schuster (1991a,b) used the wave-equationsolution to invert the ﬁrst arrival traveltime for the subsurface background velocity model.Dutta and Schuster (2016) inverted for the Qp model by minimizing the central/peak fre-quency diﬀerences between the observed and predicted early arrivals. Similarly, Li et al.(2017) utilized the peak frequency shifts of the surface wave to invert for the Qs model. Liet al. (2016) and Liu et al. (2018) found the optimal S-velocity model by using the disper-sion curves associated with the surface waves. A more comprehensive introduction of theskeletonized inversion can be found in Lu et al. (2017).The generation of skeletonized data mentioned above are based on human knowledge3nd usually require manual picking. For a large dataset, this picking task is labor intensiveand time consuming. Here we use a convolutional neural network (CAE) to automaticallyextract the skeletal information from seismic data and no manual picking is required. Theskeletal data, also know as the deep learning (DL) feature, is a low-dimensional representa-tion of the high-dimensional input seismic trace, which contains the key information of theinput data and preserved in the latent space of the CAE. When the latent space dimensionis small, the DL feature mainly contains the kinematic information of the seismic trace,such as traveltime. However, a high dimensional latent space is capable of preserving boththe kinematic and dynamic information of seismic data. In this paper, we invert for thesubsurface velocity by measuring the diﬀerence of the DL feature between the observed andpredicted data in the latent space of a well-trained CAE. We ﬁrst invert the low-dimensionalDL features for the low-wavenumber information of subsurface velcoity model. We then re-cover its high-wavenumber details by inverting the high-dimensional DL features. However,there is no governing equation contain both DL feature and velocity terms in the sameequation. Therefore there is no way to connect the perturbation of the DL feature to thevelocity perturbation directly. In other words, one can not directly compute the velocitygradient associated with the DL feature diﬀerences misﬁt function. In this research, insteadof using the connective function assumption (Chen and Schuster, 2020), we use automaticdiﬀerentiation (AD) to automatically compute the derivative of the DL feature with respectto the velocity model. Therefore the AD technique can numerically connect the CAE net-work with wave-equation inversion where no assumptions are made. We denote this hybridconnection technique as hybrid machine learning (HML) inversion.The AD is a set of techniques to numerically evaluate the derivatives of a functionspeciﬁed by a computer program (Schuster, 2020). It uses the chain rules to break up the4erivative of a complicated composite function into a chain of simple derivatives (Schuster,2020). The AD is widely used in deep learning to compute the gradients of the modelparameters and bias terms of a DL network. Moreover, the AD has shown its potentialin solving the inverse problem. Sambridge et al. (2007) showed several examples of usingthe AD to solve the geophysical inverse problem, such as ray tracing. Hughes et al. (2019)showed that the wave-equation forward modeling is equivalent to a recurrent neural network(RNN). Sun et al. (2020) used the AD as an alternative of the imaging condition to computethe FWI gradient. Therefore we use AD as a perfect tool to connect the CAE network withwave-equation inversion, where we only need to program the forward progress (shown inFigure 1) from the velocity model v to the ﬁnal misﬁt (cid:15) . Then the AD can automaticallycompute the derivative of the misﬁt to the input velocity model. Here the AD replaces thecomplex math derivation of ∂(cid:15)∂v with a black box so anyone can do HML without having adeep geophysical background. Moreover, the CAE network can be replaced by any othernetworks and the wave-equation can be replaced by other types of Newton equations tosolve a variety of problems. No matter what changes have made, the AD can automaticallycompute the derivative of the misﬁt with respect to the model of interets. velocity model v L wave-equationmodeling Deep Learning Skeletal DL features ε mis ﬁ t Figure 1: The demonstration of the forward progress of HML.In HML, a convolutional autoencoder (CAE) is ﬁrst trained by the seismic traces to5earn its low-dimensional DL features that contain the key information of input seismictraces. We then compute the L2 misﬁt (cid:15) of the DL features between the observed andsynthetic data in the latent space of the well-trained CAE. Next, the AD computes thevelocity gradient ∂(cid:15)∂v automatically and we use the gradient descent method to update thevelocity model. However, one concern of the HML method is that it is computationallyexpensive to use the AD to solve the wave equation inversion. Because it needs to computeat least nt × N local derivatives, where nt is the simulation time in time samples and N deﬁnes the model size in grid points. For a large 3D model, this computation becomesnear impossible. As an alternative, we only use the AD only through the CAE to compute ∂(cid:15)∂d , where d represents the predicted data. We then use the imaging condition to computethe velocity gradient where ∂(cid:15)∂d is used as the virtual source for constructing the backward-propagated waveﬁeld and then zero-lag cross-correlated with the forward waveﬁeld. Thishybrid implementation approach enjoys both the computational eﬃciency and the quasi-linear property of HML misﬁt function at the same time, which bring HML the potentialfor solving the large-scale inversion problems. Numerical tests on both synthetic and realdata show that the HML approach can successfully recover the low- and high-wavenumberinformation of the subsurface velocity model in a multiscale way. THEORYConvolutional neural network

Convolutional autoencoder (CAE) is an unsupervised neural network that is trained tolearn the low-dimensional representation of the high-dimensional input data. An exampleof a typical 1D CAE architecture is shown in Figure 2, where the pink, yellow, and purple6oxes represent the encoder network, latent space and decoder network, respectively. Theencoder network includes three convolutional layers with an increasing number of channels C and decreasing of length L . Usually, the ”convolution” + activation function + poolingoperations exist between each convolutional layer and decide the channel size and lengthof the next convolutional layer. The data in the last convolutional layer with the size of C × L needs to be ﬂattened to a vector shape with the size of ( C × L ) × ... ... ... ... ✕ L C ✕ L C ✕ L C ✕ L (C ✕ L ) ✕ ✕ ✕ ✕ ✕ ✕ L C ✕ L Reshape C ✕ L ✕ L Figure 2: An example of a simple function.7 utomatic diﬀerentiation

The automatic diﬀerentiation (AD) is a technique that numerically estimates the derivativeof a function speciﬁed by a computer program (Schuster, 2020). The AD believes that anycomplicated function is composed by the elementary math operations, such as addition,multiplication, log, exp, etc. Therefore the AD uses the chain rule to break up the derivativeof a complicated composite function into a chain of simple derivatives. Figure 3 shows anexample of computing the derivatives of the function (cid:15) = ( a + b ) × c using the AD. Thisfunction is described by a computational graph in Figure 3a, where the yellow and whitenodes indicate the computational and math operations node, respectively. In the forwardoperation, an intermediate-term p is ﬁrst generated to represents the result of a + b , and thenmultiplied with c to get the output (cid:15) . In the backward operation, the AD ﬁrst computesthe derivative of (cid:15) to the intermediate variable p , then calculates the derivative of theintermediate variable p to each input variable. In general, the AD only computes the localderivative between a pair of the nearby computational node that is directly connected to amath operation node. These computations start from the very ﬁnal output and way backto the input, this procedure is also denoted as the reverse mode of the AD. Once the ADhas computed all the local derivatives, the global derivative, such as ∂(cid:15)∂a , can be acquired bymultiplying those local derivatives on a certain computational path.Similarly, a neural network (NN) shown in Figure 4a can be also depicted by a compu-tational graph shown in Figure 4b. Here, w and x represent the model parameters of theNN network and the input data, respectively. The forward operation in Figure 4c is verysimilar to the previous example except the input variables are vectors. Here, g () representa activations function, such as sigmoid function e − x . To compute ∂(cid:15)∂ w , AD computes each8 b +c p ✕ ε∂ε∂ p ∂ p ∂ a ∂ p ∂ b ∂ε∂ c ∂ε∂ p ∂ p ∂ a ∂ε∂ b = ∂ε∂ p ∂ p ∂ b ∂ε∂ b = ∂ε∂ c(a) Computational Graph (b) Math Operation Forward p=a+b ε =p ✕ c Backward ∂ε∂ a =

Figure 3: (a) The computational graph of the function (cid:15) = ( a + b ) × c and the (b) mathoperations of the computational graph. The forward indicates the feedforward operation ofthe computational graph and the backward indicates the reverse model of the AD, whereeach local derivative is computed by the AD from the very ﬁnal misﬁt (cid:15) to the inputvariables.local derivatives from the output back to inputs. And ∂(cid:15)∂ w can be acquired by multiplyingall the local derivatives together along the red path as ∂(cid:15)∂ w = ∂(cid:15)∂p ∂p∂ w . x p ε∂ε∂ p ∂ε∂ p ∂ p ∂ w (b) computational graph Forward p= wx ε =g(p) Backward ∂ε∂ w = ... x x x x n w w w w n ... ε (a) Neural Network w g() ✕ (c) Math operation ∂ p ∂ w ∂ p ∂ x Figure 4: A (a) neural network and (b) its computational graph. The (c) forward andbackward operation of AD. 9 ybrid machine learning inversion

Full waveform inversion (FWI) is a powerful tool in recovering a high-resolution subsurfacevelocity model by minimizing the waveform diﬀerences between the observed and predicteddata. However, the FWI misﬁt function is often characterized by many local minima whichcould due to many reasons, such as: (1) the wave-equation forward modeling operator L can’t take into account all the physics in the real Earth, (2) the initial model is far awayfrom the true model where the time-lag between the observed and predicted data is largerthan half of the fundamental period, where FWI suﬀers from the cycle-skipping problem.To mitigate these problems, instead of computing their waveform diﬀerences, we measuretheir low-dimensional deep learning (DL) feature diﬀerences in the latent space of CAE (cid:15) = (cid:88) s (cid:88) r (cid:88) k [ z obsk ( x r , x s ) − z predk ( x r , x s )] , (1)where z obsk and z predk represents the compressed DL features of the observed and predicteddata in the k th latent space dimension. x s and x r indicates the locations of source andreceiver, respectively. When the latent space dimension is small, the compressed DL fea-ture mainly contains the kinematic information of the seismic data, such as traveltime.Therefore, the HML misﬁt function in equation 1 is less prone to local minima comparedto the FWI misﬁt function. The low-wavenumber information of the subsurface velocitymodel can be recovered by inverting these low-dimensional DL features. However, moredynamic information such as the waveform variation can be preserved in the DL featurewhen the latent space becomes larger. As a consequence, the HML method can recover thehigh-wavenumber information of the subsurface model. Therefore, we propose a multiscaleHML inversion approach where we start from inverting the low-dimensional DL features10or the low-wavenumber information of the subsurface model. We then recover the high-wavenumber information by inverting the high-dimensional DL features. Similar to FWI,the velocity gradient γ ( x ) can be computed by taking the derivative of misﬁt (cid:15) to thevelocity v γ ( x ) = − ∂(cid:15)∂v ( x ) = − (cid:88) s (cid:88) r (cid:88) k (cid:20)(cid:18) ∂ ∆ z k ( x r , x s ) ∂v ( x ) (cid:19) T ∆ z k ( x r , x s ) (cid:21) . (2)Because there is no governing equation which contains both the velocity term v and DLfeatures z in the same equation. Therefore there is no way to compute ∂z∂v directly. Chenand Schuster (2020) proposed a Newtonian machine learning (NML) inversion which uses aconnective function to connect the perturbation of DL feature to the velocity perturbation.However, one problem of the connective function assumption is that, for a multi-dimensionallatent space, each latent space dimension is characterized by a gradient and the weightedsum of all these gradients can be used for velocity updates (Chen and Saygin, 2020). There-fore the complexity of NML in both theoretical and computational aspects will increase whenthe latent space dimension increases.Hughes et al. (2019) and Sun et al. (2020) showed that the wave-equation modeling isequivalent to the recurrent neural network (RNN) and the FWI gradient can be automati-cally calculated by the AD. Because CAE training also relies on the AD, therefore the ADis a perfect tool to numerically connect a CAE architecture to the wave-equation inversion.Figure 5a shows the architecture of HML, where we ﬁrst input a velocity model v anda source wavelet f into a wave-equation modeling module to generate the predicted data d pred . We then use the encoder network of a well-trained CAE to compress the observedand synthetic data. Once we get their DL features in the latent space, we compute their112 misﬁt using equation 1. This feedforward progress can be described by a simpliﬁedcomputational graph shown in Figure 5b, where w represents the model parameters of anencoder network from a well-trained CAE. Here, L indicates the wave-equation modelingoperation. The symbol × and − represents the CAE encoding and misﬁt calculation oper-ation, respectively. All these three operations are composed of elementary math operationssuch as addition, multiplication, log, and so on. But we do not show their detailed com-putational graph here otherwise that will be too complicated. Once you have programmedthe feedforward progress from the velocity v to the misﬁt (cid:15) , the AD can automaticallycompute each local derivatives, such as ∂(cid:15)∂z , from the very ﬁnal misﬁt (cid:15) way back to theinput velocity model v . Therefore, the global derivative ∂(cid:15)∂v , which is the velocity gradientregarding the HML misﬁt function, can be computed by multiplying all of the local deriva-tives together which located on the computational path indicated by the red line in Figure5b. In summary, the AD can automatically compute the velocity gradient once you haveprogrammed the feedforward progress, where no connective function assumption is requiredand no need to derive the complicated formula of the imaging condition. The AD replacesthese complex math derivations with a black box so anyone can do HML without havinga deep background in geophysics. Moreover, the CAE network can be replaced by anyother deep learning architecture and the wave-equation can be replaced by other Newtonequations to solve a variaty of problems. However, no matter what changes are made, theAD can still automatically compute the derivative of the misﬁt with respect to the modelof interets. 12 Wave-equation modeling f d pred d obs ... ... ε (a) Hybrid machine learning architecture(b) Computational graph of Hybrid machine learning v L f d pred d obs W z pred z obs - ε∂ε∂ z ∂ z ∂ d ∂ d ∂ v ✕✕ Figure 5: The (a) architecture of hybrid machine learning and its (b) simpliﬁed version ofcomputational graph.

Hybrid machine learning using a hybrid implementation approach

Using the AD to solve the wave-equation inversion is computationally expensive. Becauseit needs to compute at least nt × N local derivatives, where nt is the simulation time intime samples and N deﬁnes the model size in grid points. For a large 3D inversion project,this computation task becomes near impossible. To mitigate this problem, we propose ahybrid implementation approach where we only use the AD through the CAE to compute ∂(cid:15)∂d and then use the imaging condition to compute the velocity gradient ∂(cid:15)∂v . Here, the ADcomputed derivative ∂(cid:15)∂d is used as the virtual source to construct the backward propagatedwaveﬁeld, which is then zero-lag cross-correlated with the forward waveﬁeld to generate the13elocity gradient ∂(cid:15)∂v . Figure 6 shows the computational graph of HML using the hybridimplementation approach, which is very similar to Figure 5b. The only diﬀerence is thatthe calculation of ∂d∂v is replaced by the wave-equation inversion kernel L T L . Therefore thevelocity gradient ∂(cid:15)∂v can be expressed as ∂(cid:15)∂v = L T L ( ∂z∂d ∂(cid:15)∂z ) . (3)Because the computation cost of ∂z∂d and ∂(cid:15)∂z is trivial compared to L T L . Therefore thecomputational eﬃciency of HML with the hybrid implementation approach is approximatelyequal to the conventional inversion method, such as FWI. However, HML is less prone to thelocal minima by inverting the low-dimensional DL features. But also can recover the high-wavenumber details through inverting the higher-dimensional DL feature. This multiscaleinversion strategy guarantees that HML with the hybrid implementation approach caneﬃciently recover a reliable subsurface velocity for both its low- and high-wavenumberinformation. Hybrid machine learing using hybrid approach v L f d pred d obs W z pred z obs - ε∂ε∂ z ∂ z ∂ d ✕✕ L T L Figure 6: The architecture of hybrid machine learning with the hybrid implementationapproach. 14

UMERICAL TESTS

In the numerical tests, the HML with the hybrid approach is ﬁrst tested by two syntheticdatasets with the corsswell geometry. We then test this method using a ﬁeld dataset col-lected at the Gulf of Aqaba by a surface geometry. In the descriptions below, la = n represents the latent space dimension equal to n , where n is a real number. Layered model

A layered model with an linear increasing background is used as the true model which isshown in Figure 10a. Figure 10b shows the initial model where the eﬀective inversion areabetween z = 0.4 km to z = 2.2 km is set as a homogeneous model with a constant velocityequals to 3535 m/s. 119 acoustic shots are generated by a crosswell acquisition systemwhere the source and receiver well are located at x = 0.01 km and x = 1 km, respectively.These shots are evenly distributed on the source well at an interval of 20 m. Each shot has239 receivers deployed on the receiver well at an equal spacing of 10 m. A 20 Hz Rickerwavelet is used as the source wavelet. Figures 7a and 7b show one example of the observedand predicted data, where most of the traces are suﬀers from the cycle-skipping problem.Before HML inversion, an autoencoder needs to be trained to learn the low-dimensionalDL features of the input data. We use the seismic traces from the observed and predictshot gathers as the training data to train an autoencoder with the latent space dimensionequals to one. Here, each nt × × nt × Traces T i m e ( s ) (a) Observed data

25 50 75 100

Traces (b) Synthetic data

Figure 7: One example of the (a) observed and (b) predicted common shot gather.spectively. The compressed DL features are very similar to the traveltime shown in Figure8b. This similarity demonstrates that the compressed 1 × (cid:15) using equation 1 and then uses the AD to automaticallycalculate the HML ( la = 1) virtual source ∂(cid:15)∂d which is shown in Figure 9a. Comparedto the NML virtual source shown in Figure 9b which is computed by perturbing the DLfeature diﬀerences between the observed and predicted data on the predicted shot gathertrace by trace, the HML ( la = 1) virtual source is very dissimilar in waveform’s shape. Thereason is that the latent space dimension is too small to preserve the information of waveformvariations. This problem can be solved by using a larger dimensional latent space. However,both the HML ( la = 1) and NML virtual source shows an opposite waveform polarity onthe left- and right-hand side of trace la = 1) velocity gradient ∂(cid:15)∂v is estimated by combining the HML16

50 100 150 200 250

Traces E n c o d e v a l u e (a) Compressed DL feature (la=1) obssyn Traces T i m e ( s ) (b) Traveltime obssyn Figure 8: The (a) compressed one-dimensional DL features and (b) traveltime.( la = 1) virtual source with the imaging condition. Figure 10c shows the ﬁrst iterationgradient of HML ( la = 1) which is dominated by the low-wavenumber updates. The HML( la = 1) and FWI inverted model are shown in Figures 11a and 11b, respectively, wherethe FWI result suﬀers severely from the cycle-skipping problem especially at the deep partbelow z = 1.4 km. Figures 12a and 12b show the velocity proﬁle comparisons at x = 0.5km and x = 0.8 km, respectively, between the true, initial, HML ( la = 1) inverted and FWIinverted velocity model, which are represented by the black, green, red and blue line. Itclearly shows that HML ( la = 1) has successfully recovered the low wavenumber informationof the velocity model. In contrast, the FWI inverted result is far away from the true model.In the next step, we increase the latent space dimension to ten, and re-train the autoen-coder using the observed data and the predicted data that generated based on the HML( la = 1) inverted model. Figure 13a shows the computed HML ( la = 10) virtual sourcewhich is similar to the FWI virtual source shown in Figure 13b. Here, the FWI virtual17 Traces T i m e ( s ) (a) HML virtual source (la=1)

25 50 75 100

Traces (b) NML virtual source

Figure 9: The virtual source of (a) HML and (b) NML.source is computed by subtracting the predicted data from the observed data. This similar-ity is because the autoencoder can preserve both the kinematic and dynamic information,such as the traveltime and waveform variations, of the seismic traces by using a larger la-tent space. Figure 14a shows the HML ( la = 10) inverted velocity model where most of thehigh-wavenumber information has been recovered. To further recover the high-wavenumberdetails, we use this HML ( la = 10) inverted result as the initial model for FWI inversion.Figure 14b shows the FWI inverted result which has the best resolution among all theseresults. SEAM model

Data calculated from a portion of the SEAM model with a size of 157 ×

135 grid point areused to test the HML with the hybrid approach method. Figure 15a shows the true modeland a homogeneous model is used as the initial model, which is shown in Figure 15b. Asource well is located at x = 0.01 km and there are 52 shots distributed on the well at an18 .00 0.25 0.50 0.75 1.00

X (km) (c) 1st itr gradient (la=1)

X (km) D e p t h ( k m ) (a) True vel model X (km) (b) Initial vel model

Figure 10: The (a) true and (b) initial model. The (c) ﬁrst iteration gradient of HML.equal spacing of 30 m. Each shot includes 156 receivers which are evenly deployed on thereceiver well located at x = 1.35 km. The receiver interval is 10 m and a 20 Hz Rickerwavelet is used as the source.Similar to the layered model test, an autoencoder with a one-dimensional latent spaceis ﬁrst trained by the observed and predicted seismic traces to learn the one-dimensionalDL features that contain the kinematic information of the seismic traces. Once the trainingis ﬁnished, we use HML ( la = 1) with the hybrid approach to invert these DL featuresfor the low-wavenumber information of the subsurface model. Figure 15c shows the ﬁrstiteration gradient of HML ( la = 1) inversion which is dominated by the low-wavenumberupdates. Figures 16a and 16b show the inverted velocity model by HML ( la = 1) and FWI,respectively, where the FWI result suﬀers severely from the cycle-skipping problem. In com-parison, the HML ( la = 1) inverted model has successfully recovered the low-wavenumberinformation of the subsurface velocity model. This successful recovery can be further provedby the velocity proﬁle comparisons at x = 0.5 km and x = 0.8 km, which are shown in Fig-19 e p t h ( k m ) X (km) (a) HML vel model (la=1)

X (km) (b) FWI vel model

Figure 11: The (a) HML and FWI inverted velocity model.ure 17a and 17b, respectively. The black, green, red, and blue curve represents the velocityproﬁle of the true, initial, HML ( la = 1) inverted and FWI inverted velocity model, wherethe HML ( la = 1) inverted result best matches with the true model. However, the high-wavenumber information is still missing in the HML ( la = 1) inverted result because thelatent space dimension is too small to preserve the information of waveform variations ofthe seismic data.Following the multiscale strategy, we increase the latent space dimension to ten andre-train the autoencoder using the observed data and the predicted data that is generatedbased on the HML ( la = 1) inverted model. We then invert the ten-dimensional DL featuresfor the high-wavenumber information of the velocity model and the HML ( la = 10) invertedresult is shown in Figure 18a. It shows a obvious resolution increases when compared to theHML ( la = 1) inverted result. Finally, we use FWI to further recover the velocity detailsand the inverted result is shown in Figure 18b, which shows a better resolution at the depthabove z = 0.6 km. 20 .0 0.5 1.0 1.5 2.0 2.5Depth (km)3250350037504000 V e l ( m / s ) (b) Velocity pro ﬁ le comparison at 0.8 km TrueIniHMLFWI0.0 0.5 1.0 1.5 2.0 2.5300035004000 V e l ( m / s ) (a) Velocity profile comparison at 0.5 km TrueIniHMLFWI

Figure 12: The velocity proﬁle comparisons at (a) x= 0.5 km and (b) x = 0.8 km betweenthe true model, initial model, HML and FWI inverted velocity model, which are representby the black, green, red and blue line, respectively.

Gulf of Aqaba ﬁeld data

The ﬁeld dataset is collected by a surface seismic survey at an alluvial fan on the Gulf ofAqaba coast in Saudi Arabia. A total of 120 shot gathers were collected at an equal spacingof 2.5 m. Each shot contains 120 traces evenly distributed on the seismic survey with areceiver interval of 2.5 m. Data were recorded using a 1 ms sampling rate for total recordingtime of 0.5 s. A 200 lb weight drop was used as the source, with 10 to 15 stacks at eachshot location (Hanafy et al., 2014). An example of a raw shot gather is shown in Figure19a which includes very strong surface wave energy and weak refraction events. We ﬁrstremove the surface waves because we only consider inverting the P waves in this paper. Wethen bandpass the data to the peak-frequency of 40 Hz. A processed shot gather is shownin Figure 19b, where only the refractions event remains. We further apply an amplitude21

Time (s) (a) HML virtual source (la=10)

25 50 75 100

Time (s) (b) FWI virtual source

Figure 13: The computed (a) HML virtual source with the latent space dimension equalsto 10. The (b) FWI virtual source.damping on the time axis to highlight the early arrivals and attenuate the later arrivals.One example of the processed + damping shot gather is shown in Figure 19c, where theearly arrivals has been highlighted. A linear increasing model shown in Figure 21a is usedas the initial model.According to the multiscale inversion strategy of HML, we ﬁrst invert the low-dimensionalDL features for the background velocity model. A CAE with a single-dimensional latentspace is ﬁrst trained using the seismic traces from the processed + damping shot gathers.The well-trained CAE can eﬀectively compress the nt × × × e p t h ( k m ) X (km) (a) HML vel model (la=10)

X (km) (b) FWI vel model

Figure 14: The (a) HML ( la = 10) inverted velocity model using the previous HML ( la = 1)inverted result as the initial model. The (b) FWI inverted result which uses (a) as the initialmodel. X (km) (c) 1st itr gradient (la=1)

X (km) D e p t h ( k m ) (a) True vel model X (km) (b) Initial vel model

Figure 15: The (a) true and (b) initial model. The (c) ist iteration gradient of HML (la=1)inversion.Figure 20d shows the traveltime map of the observed data, which shows a similar pattern toFigure 20a. Figures 20b and 20e show the DL feature and traveltime map of the predicteddata, which also shows a similar pattern. The most obvious similarity between the DLfeatures and traveltimes can be seen in their diﬀerence map shown in Figures 21c and 21f,23 e p t h ( k m ) X (km) (a) HML vel model (la=1)

X (km) (b) FWI vel model

Figure 16: The (a) HML (la=1) and (b) FWI inverted result.

Depth (km) v e l ( m / s ) (b) Velocity pro ﬁ le comparison at 0.8 km TrueIniHMLFWI0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.630003200340036003800 v e l ( m / s ) (a) Velocity profile comparison at 0.5 km TrueIniHMLFWI

Figure 17: The velocity proﬁle comparisons at (a) x= 0.5 km and (b) x = 0.8 km betweenthe true model, initial model, HML and FWI inverted velocity model, which are representby the black, green, red and blue line, respectively.respectively. Both the DL feature and traveltime diﬀerences show that the major diﬀerencebetween the observed and predicted data is within the area between shot e p t h ( k m ) X (km) (a) HML vel model (la=10)

X (km) (b) FWI vel model

Figure 18: The (a) HML ( la = 10) inverted velocity model which uses the HML ( la = 1)inverted result as the initial model. The (b) FWI inverted velocity model which uses theHML ( la = 10) inverted result as the initial model.DL features do contain the kinematic information of the seismic traces, which is similarto the traveltime. Figure 21b shows the inverted velocity model using the wave-equationtraveltime (WT) inversion, which reveals a dipping interface between the upper low-velocitylayer and the bedrock. This dipping feature is because that the mountain and the sea arelocated on the left- and right-hand side of the seismic survey, respectively. This dippingcan be seen more clearly in the HML ( la = 1) inverted result which is shown in Figure 21c.After inversion, we generate a new set of predicted shot gathers based on the HML ( la = 1)inverted velocity model. The DL feature and traveltime maps of the new predicted dataare shown in Figures 22b and 22e, which is similar to their corresponding observed mapsthat are shown in Figure 22a and 22d. Their diﬀerences are shown in Figures 22c and 22fwhich are much smaller compared to the initial diﬀerences shown in Figures 21c and 21f.The reduced diﬀerences indicate that the HML ( la = 1) inverted velocity model is moreclose to the true model compared to the initial model.25 Traces T i m e ( s ) (a) Obs data

50 100

Traces (b) Pro data

50 100

Traces (c) Pro data w damping

Figure 19: An example of a (a) raw and (b) processed shot gather. (c) is the processed shotgather with damping along the time axis.

25 50 75 10020406080100120 S h o t i n d e x (a) DL features of obs data (b) DL features of pred data (c) DL feature differences Receiver index S h o t i n d e x (d) Traveltime of obs data Receiver index (e) Traveltime of pred data

Receiver index (f) Traveltime differences

Figure 20: The DL features of the (a) observed and (b) predicted data, where the predicteddata is generated based on the initial model. The (c) DL diﬀerences between the observedand predicted data. The traveltime of the (d) observed and (e) predicted data. (f) Theirtraveltime diﬀerences. 26 D e p t h ( m ) (a) Initial model D e p t h ( m ) (b) WT tomogram X (m) D e p t h ( m ) (c) HML model (la=1) Figure 21: The (a) initial model and (b) wave-equation traveltime inversion method invertedmodel. The (c) HML ( la = 1) inverted velocity model with latent space dimensional equalsto one.In the next step, we use the HML ( la = 1) inverted velocity model as the initial modeland start to recover the high-wavenumber information of the subsurface model. We increasethe latent space dimension to twenty and re-train the autoencoder using the seismic tracesfrom the processed shot gathers. The reason we use the processed rather than the processedplus damping shot gathers for training is that the twenty-dimensional latent space is capableof preserving the kinematic and dynamic information for both the early and later P waveevents. The HML ( la = 20) inverted velocity model is shown in Figure 23b which reveals27 S h o t i n d e x (a) DL features of obs data

25 50 75 10020406080100120 (b) DL features of pred data (c) DL feature differences

Receiver index S h o t i n d e x (d) Traveltime of obs data Receiver index (e) Traveltime of pred data

Receiver index (f) Traveltime differences

Figure 22: The DL features of the (a) observed and (b) predicted data, where the predicteddata is generated based on the HML (la=1) inverted result. The (c) DL diﬀerences betweenthe observed and predicted data. The traveltime of the (d) observed and (e) predicted data.(f) Their traveltime diﬀerences.more high-resolution details compared to the HML ( la = 1) inverted result. There are somelow- and high-velocity anomalies appear at the region between x = 80 m to x = 130 mand x = 230 m to x = 280 m, respectively. There also shows a velocity discontinuity atx = 140. This discontinuity could be caused by an active fault which has been identiﬁedby Hanafy et al. (2014). Figure 23c shows the FWI inverted model which uses the HML( la = 20) inverted result as the initial model. The FWI method slightly increased thevelocity resolution which means the HML ( la = 20) inverted result is already good enough.Figures 24a, 24b and 24c show the HML ( la = 1), HML ( la = 20) and FWI invertedvelocity model overlaped with their contour lines. The contour lines around x = 140 m inFigures 24b and 24c point downward which further highlight the velocity discontinuity in28his region. We mark the possible fault using the white line on these ﬁgures.

50 100 150 200 250 3002040 D e p t h ( m ) (a) HML tomogram (la=1) D e p t h ( m ) (b) HML tomogram (la=20) X (m) D e p t h ( m ) (c) FWI tomogram Figure 23: (a) The HML ( la = 1) inverted velocity model. (b) The HML ( la = 20) invertedvelocity model which uses (a) as the initial model. (c) The FWI inverted result which uses(b) as the initial model. CONCLUSION

We present a seismic inversion method that inverts the deep learning (DL) features for thesubsurface velocity model. The DL feature is a low-dimensional representation of the high-dimensional seismic data, which is automatically generated by a convolutional autoencoder(CAE) and preserved in the latent space. When the latent space dimension is small, the29 D e p t h ( m ) (a) HML tomogram (la=1) D e p t h ( m ) (b) HML tomogram (la=20) X (m) D e p t h ( m ) (c) FWI tomogram Figure 24: (a) The HML ( la = 1) inverted velocity model with overlaped contour lines. (b)The HML ( la = 20) inverted velocity model with overlaped contour lines which uses (a) asthe initial model. (c) The FWI inverted result with overlaped contour lines which uses (b)as the initial model. The white line indicates the fault.DL feature mainly contains the kinematic information, such as the traveltime, of the inputseismic data. However, both the kinematic and dynamic information, such as the traveltimeand waveform variations, can be preserved in the DL features by using a larger latent space.Therefore, we propose a multiscale inversion approach which starts with inverting the low-dimensional DL features for the low-wavenumber information of the subsurface model. Thenrecover its high-wavenumber details through inverting the high-dimensional DL features.30ecause there are no governing equations that contain both the velocity and DL featureterm in the same equation, therefore we use the automatic diﬀerentiation (AD) to numer-ically connect these two terms together. In another word, we use the AD to numericallyconnect the CAE network with the wave-equation inversion. One can replace the CAE net-work with any type of deep learning architecture and connected with any type of Newtonequations by using the AD to solve various problems. We denote this hybrid connectionthrough the AD as hybrid machine learning (HML). This method would be appreciated bygeophysical novices because the AD replaces the complex math derivation with a black boxso anyone can do HML without having a deep background in geophysics. However, oneconcern of the HML method is it computational costs because it is expensive to use the ADto solve the wave-equation inversion. Therefore we also propose a hybrid implementationapproach which makes HML has the same level of computational eﬃciency compared to theconventional wave-equation method, such as full waveform inversion (FWI). This hybridimplementation approach brings HML the potential of solving a very large scale inversionproblem. ACKNOWLEDGEMENT

This research was fully funded by the Deep Earth Imaging Future Science Platform, CSIRO.31

EFERENCES

Chen, Y. and E. Saygin, 2020, Seismic inversion by multi-dimensional newtonian machinelearning (under reviewing): Geophysics.Chen, Y. and G. T. Schuster, 2020, Seismic inversion by newtonian machine learning:Geophysics, , 1–59.Dutta, G. and G. T. Schuster, 2016, Wave-equation q tomography: Geophysics, , R471–R484.Hanafy, S. M., S. Jonsson, and Y. Klinger, 2014, Imaging normal faults in alluvial fansusing geophysical techniques: Field example from the coast of gulf of aqaba, saudi arabia:4670–4674.Hughes, T. W., I. A. Williamson, M. Minkov, and S. Fan, 2019, Wave physics as an analogrecurrent neural network: Science advances, , eaay6946.Lailly, P. and J. Bednar, 1983, The seismic inverse problem as a sequence of before stackmigrations: Conference on inverse scattering: theory and application, 206–220.Li, J., G. Dutta, and G. Schuster, 2017, Wave-equation qs inversion of skeletonized surfacewaves: Geophysical Journal International, , 979–991.Li, J., Z. Feng, and G. Schuster, 2016, Wave-equation dispersion inversion: GeophysicalJournal International, , 1567–1578.Liu, Z., J. Li, S. M. Hanafy, and G. Schuster, 2018, 3d wave-equation dispersion inversionof surface waves: 4733–4737.Lu, K., J. Li, B. Guo, L. Fu, and G. Schuster, 2017, Tutorial for wave-equation inversion ofskeletonized data: Interpretation, , SO1–SO10.Luo, Y. and G. T. Schuster, 1991a, Wave equation inversion of skeletalized geophysicaldata: Geophysical Journal International, , 289–294.32—–, 1991b, Wave-equation traveltime inversion: Geophysics, , 645–653.P´erez Solano, C. and R.-´E. Plessix, 2019, Velocity-model building with enhanced shallowresolution using elastic waveform inversionan example from onshore oman: Geophysics, , R977–R988.Sambridge, M., P. Rickwood, N. Rawlinson, and S. Sommacal, 2007, Automatic diﬀerenti-ation in geophysical inverse problems: Geophysical Journal International, , 1–8.Schuster, G. T., 2020, Practical machine learning methods in geosciences: Society of Ex-ploration Geophysicists.Simut˙e, S., H. Steptoe, L. Cobden, A. Gokhberg, and A. Fichtner, 2016, Full-waveforminversion of the japanese islands region: Journal of Geophysical Research: Solid Earth, , 3722–3741.Sun, J., Z. Niu, K. A. Innanen, J. Li, and D. O. Trad, 2020, A theory-guided deep-learningformulation and optimization of seismic waveform inversion: Geophysics, , R87–R99.Tarantola, A., 1984, Inversion of seismic reﬂection data in the acoustic approximation:Geophysics, , 1259–1266.Virieux, J. and S. Operto, 2009, An overview of full-waveform inversion in explorationgeophysics: Geophysics, , WCC1–WCC26. LIST OF FIGURES (cid:15) = ( a + b ) × c and the (b) math oper-ations of the computational graph. The forward indicates the feedforward operation of thecomputational graph and the backward indicates the reverse model of the AD, where each33ocal derivative is computed by the AD from the very ﬁnal misﬁt (cid:15) to the input variables.4 A (a) neural network and (b) its computational graph. The (c) forward and back-ward operation of AD.5 The (a) architecture of hybrid machine learning and its (b) simpliﬁed version ofcomputational graph.6 The architecture of hybrid machine learning with the hybrid implementation ap-proach.7 One example of the (a) observed and (b) predicted common shot gather.8 The (a) compressed one-dimensional DL features and (b) traveltime.9 The virtual source of (a) HML and (b) NML.10 The (a) true and (b) initial model. The (c) ﬁrst iteration gradient of HML.11 The (a) HML and FWI inverted velocity model.12 The velocity proﬁle comparisons at (a) x= 0.5 km and (b) x = 0.8 km between thetrue model, initial model, HML and FWI inverted velocity model, which are represent bythe black, green, red and blue line, respectively.13 The computed (a) HML virtual source with the latent space dimension equals to10. The (b) FWI virtual source.14 The (a) HML ( la = 10) inverted velocity model using the previous HML ( la = 1)inverted result as the initial model. The (b) FWI inverted result which uses (a) as theinitial model.15 The (a) true and (b) initial model. The (c) ist iteration gradient of HML (la=1)inversion.16 The (a) HML (la=1) and (b) FWI inverted result.17 The velocity proﬁle comparisons at (a) x= 0.5 km and (b) x = 0.8 km between the34rue model, initial model, HML and FWI inverted velocity model, which are represent bythe black, green, red and blue line, respectively.18 The (a) HML ( la = 10) inverted velocity model which uses the HML ( la = 1)inverted result as the initial model. The (b) FWI inverted velocity model which uses theHML ( la = 10) inverted result as the initial model.19 An example of a (a) raw and (b) processed shot gather. (c) is the processed shotgather with damping along the time axis.20 The DL features of the (a) observed and (b) predicted data, where the predicteddata is generated based on the initial model. The (c) DL diﬀerences between the observedand predicted data. The traveltime of the (d) observed and (e) predicted data. (f) Theirtraveltime diﬀerences.21 The (a) initial model and (b) wave-equation traveltime inversion method invertedmodel. The (c) HML ( la = 1) inverted velocity model with latent space dimensional equalsto one.22 The DL features of the (a) observed and (b) predicted data, where the predicteddata is generated based on the HML (la=1) inverted result. The (c) DL diﬀerences betweenthe observed and predicted data. The traveltime of the (d) observed and (e) predicted data.(f) Their traveltime diﬀerences.23 (a) The HML ( la = 1) inverted velocity model. (b) The HML ( la = 20) invertedvelocity model which uses (a) as the initial model. (c) The FWI inverted result which uses(b) as the initial model.24 (a) The HML ( la = 1) inverted velocity model with overlaped contour lines. (b)The HML ( lala