[PDF] Bayesian geoacoustic inversion using mixture density network

Abstract

Bayesian geoacoustic inversion problems are conventionally solved by Markov chain Monte Carlo methods or its variants, which are computationally expensive. This paper extends the classic Bayesian geoacoustic inversion framework by deriving important geoacoustic statistics of Bayesian geoacoustic inversion from the multidimensional posterior probability density (PPD) using the mixture density network (MDN) theory. These statistics make it convenient to train the network directly on the whole parameter space and get the multidimensional PPD of model parameters. The present approach provides a much more efficient way to solve geoacoustic inversion problems in Bayesian inference framework. The network is trained on a simulated dataset of surface-wave dispersion curves with shear-wave velocities as labels and tested on both synthetic and real data cases. The results show that the network gives reliable predictions and has good generalization performance on unseen data. Once trained, the network can rapidly (within seconds) give a fully probabilistic solution which is comparable to Monte Carlo methods. It provides an promising approach for real-time inversion.

Full PDF

ssubmitted to

Geophys. J. Int.

Bayesian geoacoustic inversion using mixture densitynetwork

Guoli Wu , , Hefeng Dong , Junqiang Song and Jingya Zhang , College of Meteorology and Oceanography, National University of Defense Technology, Changsha, China Department of Electronic Systems, Norwegian University of Science and Technology, Trondheim, Norway Department of Geoscience and Petroleum, Norwegian University of Science and Technology, Trondheim, Norway Research Institute of Petroleum Exploration & Development, PetroChina, Beijing, China

SUMMARY

Bayesian geoacoustic inversion problems are conventionally solved by Markov chainMonte Carlo methods or its variants, which are computationally expensive. This paperextends the classic Bayesian geoacoustic inversion framework using the mixture densitynetwork (MDN), which provides a much more efﬁcient way to solve geoacoustic inver-sion problems in Bayesian inference framework. Some important geoacoustic statisticsof Bayesian geoacoustic inversion are derived from the multidimensional posterior prob-ability density (PPD) using the MDN theory. These statistics make it convenient to trainthe network directly on the whole parameter space and get the multidimensional PPDof model parameters. The network is trained on a simulated dataset of surface-wave dis-persion curves with shear-wave velocities as labels. The results show that the networkgives reliable predictions and has good generalization performance on unseen data. Oncetrained, the network can rapidly (within seconds) give a fully probabilistic solution whichis comparable to Monte Carlo methods. It provides an promissing approach for real-timeinversion.

Key words:

Bayesian inversion – Mixture density network – Shear-wave velocity. a r X i v : . [ s t a t . M L ] A ug Wu et al

Geoacoustic inversion focuses on the reconstruction of unknown geoacoustic model parameters (suchas wave speed, density, attenuation and layer thickness, etc.), which can not be directly observed,from a set of observations. It is a challenging and representative inverse problem because of its highlynonlinearity and non-uniqueness and has drawn a great deal of attention in seismic and underwateracoustics communities. Some researchers use linearized inversions to approximate the nonlinear rela-tion between the model and the observations by a linearized relation (Tarantola & Valette 1982; Caitiet al. 1994; Zhdanov 2002; Bensen et al. 2009; Dong et al. 2010). They apply regularization techniquesto deal with the ill-conditioning problem and seek an approximate solution which can minimize themisﬁt using optimization algorithms. Some other researchers use metaheuristic algorithms, such asgenetic algorithms (Gerstoft 1994a; Heard et al. 1998), simulated annealing algorithms (Collins et al.1992; Dosso et al. 1993; Gerstoft & Michalopoulou 2000) and hybrid inversion algorithms (Gerstoft1995; Musil et al. 1999; Dosso et al. 2001), to search the model space and approximate the globaloptimal solutions. These nonlinear optimization methods provide better capability in searching theglobal optimization model that ﬁts the observations best than linearized inversions. However, the lackof quantitative uncertainty estimates limits the understanding of the solutions especially when thesolution of the data is non-unique.Bayesian inference provides a probabilistic framework to quantify parameter uncertainties for in-version problems. In a Bayesian formulation, the general solution to an inversion problem is expressedas the posterior probability density (PPD) of the model parameters given the observed data. The modelparameter estimates and uncertainties can be characterized by the moments of the multidimensionalPPD, such as the optimal model estimates (e.g., the posterior mean model, the maximum a posteriori (MAP) model), marginal probability density function (PDF) of the model parameter and the inter-parameter relationships (covariance matrix). Bayesian inversion has been applied to solve geoacousticinverse problem since the 1990s (Gerstoft 1994b; Mosegaard & Tarantola 1995) and widely used dur-ing the past two decades (e.g., Dosso 2002; Dosso & Wilmut 2006; Bodin & Sambridge 2009; Dong& Dosso 2011; Dosso & Dettmer 2011; Steininger et al. 2014; Galetti et al. 2017; Zhang et al. 2018).In Bayesian inversion, how to efﬁciently obtain the asymptotically unbiased samples from the PPDis a key issue. In most publications, sampling based methods, such as Markov-chain Monte Carlo al-gorithms and its variants, are normally used to sample from the PPD. However, the sampling basedmethods are computationally intensive and the computational cost is extremely high.The mixture density network (MDN) is a type of neural network which combines conventionalneural networks with a mixture density model (Bishop 1994). Different from conventional neuralnetworks, the MDN provides full statistical properties of the solution and is suitable to solve problems ayesian geoacoustic inversion using MDN

Wu et al

Let m denotes the c -dimensional vector of model parameters with entries m i and d denotes the N -dimensional vector of the observations with entries d i . According to Bayes’ theorem, the posteriorprobability can be written as P ( m | d ) = P ( d | m ) P ( m ) P ( d ) . (1) P ( m | d ) denotes the conditional PDF of parameters m given observations d and is known as thePPD. P ( m ) denotes the prior PDF which represents the estimate of the probability of m independentof observations d . P ( d ) is the PDF of observations d and can normally be treated as a constant. P ( d | m ) denotes conditional PDF of observations d given parameters m . When the observations d isﬁxed, it can be interpreted as the likelihood function and generally be written as L ( m ) = P ( d | m ) ∝ exp[ − E ( m )] , (2)where E ( m ) denotes the misﬁt function. Some researchers have derived several forms of E ( m ) undersome appropriate assumptions of the distribution of observation errors (Dosso 2002; Dosso & Dettmer2011; Dong & Dosso 2011). If Equation 2 is substituted back into Equation 1, Equation 1 can berewritten as P ( m | d ) = kP ( m )exp[ − E ( m )] , (3)where k is a normalisation constant. Taking the logarithm of both sides leads to an expression of themisﬁt function E ( m ) = − log P ( m | d ) + log[ kP ( m )] , (4)where log is the natural logarithm.The multidimensional PPD is considered as a general solution to Bayesian inversion. Based onthe PPD, some important statistics, such as the MAP model, mean model, marginal PDFs and modelcovariance matrix (Dosso & Dettmer 2011), can be deﬁned as ˆ m = argmax m P ( m | d ) , (5) m = (cid:90) m P ( m | d )d m , (6) P ( m i | d ) = (cid:90) δ ( m i − m (cid:48) i ) P ( m (cid:48) | d )d m (cid:48) , (7) P ( m i , m j | d ) = (cid:90) δ ( m i − m (cid:48) i ) δ ( m j − m (cid:48) j ) P ( m (cid:48) | d )d m (cid:48) , (8) ayesian geoacoustic inversion using MDN C m = (cid:90) ( m − m )( m − m ) T P ( m | d )d m , (9)where δ denotes the Dirac delta function. Note that (cid:82) means integrating over the entire parameterspace. Inter-parameter relationships can be computed in a normalized form as R ij = C m ij / (cid:113) C m ii C mjj . (10)Generally speaking, analytical solutions to Equations 5-9 are not available for nonlinear inverse prob-lems. Assume that we have a data set D = { ( d i , m i ) : i = 1 , ..., N D } , where N D is the number of samples, d i and m i is the observation vector and the corresponding model parameter vector in the i th sample,respectively. A neural network can be trained to learn the underlying transformation function betweeninput and output spaces. For the data set D , if we take m as the input and d as the output, the neuralnetwork learns a forward problem. Conversely, it learns an inverse problem. The central goal of thelearning is that when fed a new input, the neural network can provide an appropriate prediction as anoutput.For nonlinear inverse problems involving non-unique solutions and prediction of continuous vari-ables, conventional neural networks have limited capability to give adequate statistical informationabout the solution, while the MDN overcomes this limitation and supply us a framework which hasthe ﬂexibility to model arbitrary probability distributions (Bishop 1994; Meier et al. 2007a). The basicidea of the MDN is to model the PPD using a linear combination of Gaussian kernel functions as P ( m | d ) = M (cid:88) l =1 α l ( d ) φ l ( m | d ) , (11)where α ( d ) denotes the mixing coefﬁcients representing the relative importance of each kernel, M denotes the number of Gaussian kernels used in the mixture and φ l ( m | d ) is the Gaussian kernelfunction with the form φ l ( m | d ) = 1(2 π ) c/ σ l ( d ) c exp {− (cid:107) m − µ l ( d ) (cid:107) σ l ( d ) } . (12)Equation 12 is a multivariate normal distribution of m , in which µ l ( d ) is the mean, representing thecenter of the l th kernel and σ l ( d ) is the common variance. More generally the common variance canbe replaced by a full covariance matrix, leading to a more complex form of Equation 12. However, itis not necessary because such a model with the simpliﬁed kernels (Equation 11) has the potential toapproximate any PDF to arbitrary accuracy (McLachlan & Basford 1988). Wu et al

Additionally, according to Bishop (1994), the mixing coefﬁcients are required to satisfy the con-straint M (cid:88) l =1 α l ( d ) = 1 . (13)Note that α l ( d ) and σ l ( d ) is a function of d , respectively, and µ l (d) is a vector function of d with the same dimensionality c as m . The M sets of parameters α l ( d ) , σ l ( d ) and µ l ( d ) fully deﬁnedthe model and can be used to approximate any possible PPD with a sufﬁcient number of kernels.Therefore, a MDN takes these parameters as its output. Once fed an input d i , the trained MDN outputs M sets of parameters with a total number of M ( c + 2) .During the training of the MDN on the data set D , the parameters of the network are tuned in orderto ﬁnd the parameters which best ﬁt the training data. The maximum likelihood principle is usuallyadopted. Equation 2 indicates that maximizing the likelihood L ( m ) equals to minimizing the misﬁtfunction E ( m ) . In the expression of the misﬁt function (Equation 4), P ( m ) is prior and independentof the observations and k is a constant, making the last term log [ kP ( m )] a constant factor in thetraining process. Therefore, we can deﬁne a loss (or cost) function Loss which quantiﬁes how wellthe network ﬁts the training data as

Loss = − log P ( m | d )= − log { M (cid:88) l =1 α l ( d ) φ l ( m | d ) } . (14)Once we ﬁnd the network parameters which minimize the loss function Loss , we can compute thePPD which forms the solution of the inverse problem as a function of m using Equation 11. Suppose an appropriate designed MDN is trained now. Hence we can obtain a multidimensional PPDwith a formalism as Equation 11 as soon as new observations data d are fed to the network. Knownthe multidimensional PPD, it is possible to compute the statistics in Equation 5-9 using numericalapproaches. However, we propose to derive these statistics further using the MDN theory and makethis process more efﬁcient and straightforward. The MAP model, deﬁned in Equation 5, involves locating the global maximum in the multidimen-sional parameter space. It is a nonlinear optimization problem and could be time-consuming when the ayesian geoacoustic inversion using MDN m is large. From Equation 5, 11 and 12 we have ˆ m = argmax m P ( m | d )= 1(2 π ) c/ argmax m { M (cid:88) l =1 α l ( d ) σ l ( d ) c exp[ − (cid:107) m − µ l ( d ) (cid:107) σ l ( d ) ] } . (15)Note that the exponential term in Equation 15 is nonnegative and the maximum value of it is obtainedwhen m = µ l ( d ) . If M = 1 , the solution is straightforward as m = µ ( d ) . If M > , it becomesmore complicated. Bishop (1994) provides a fast approximation under the assumption that the M kernel functions are well separated from each other. The approximation to the MAP model is thecenter of the l th kernel µ l which satisfy max l { α l ( d ) σ l ( d ) c } . (16)It makes sense because when the kernel functions are well separated, the MAP model is dominated bythe kernel with the largest weight α l ( d ) /σ l ( d ) c , considering the maximum value of the exponentialterm in Equation 15 is for all kernels. Substituting Equation 11 and 12 into Equation 6 leads to m = (cid:90) m M (cid:88) l =1 α l ( d ) φ l ( m | d )d m = M (cid:88) l =1 α l ( d ) (cid:90) m φ l ( m | d )d m . (17)Note that φ l ( m | d ) is a multivariate normal distribution, whose mean value can be written as µ l ( d ) = (cid:90) m φ l ( m | d )d m . (18)Using the relationship in Equation 18, Equation 17 can be simpliﬁed as m = M (cid:88) l =1 α l ( d ) µ l ( d ) . (19)It indicates that the mean model derived from the multidimensional PPD is the weighted average ofthe centers of all the Gaussian kernels.

1D and 2D marginal PDFs of model parameters are widely used by most researchers. If the output ofthe MDN is a multidimensional PPD, 1D and 2D marginal PDFs can be derived from it. Substituting

Wu et al

Equation 11 and 12 into Equation 7 leads to P ( m i | d ) = (cid:90) δ ( m i − m (cid:48) i ) P ( m (cid:48) | d )d m (cid:48) = 1(2 π ) c/ (cid:90) δ ( m i − m (cid:48) i ) M (cid:88) l =1 α l σ cl exp {− (cid:107) m (cid:48) − µ l (cid:107) σ l } d m (cid:48) = 1(2 π ) c/ M (cid:88) l =1 α l σ cl (cid:90) δ ( m i − m (cid:48) i )exp {− (cid:107) m (cid:48) − µ l (cid:107) σ l } d m (cid:48) = 1(2 π ) c/ M (cid:88) l =1 α l σ cl I l ( m i ) . (20)Note that the ( d ) is dropped from α l ( d ) , σ l ( d ) and µ l (d) for simpliﬁcation. The integral term I l ( m i ) can be computed as I l ( m i ) = (cid:90) δ ( m i − m (cid:48) i )exp {− (cid:107) m (cid:48) − µ l (cid:107) σ l } d m (cid:48) = (cid:90) δ ( m i − m (cid:48) i )exp {− σ l [( m (cid:48) − µ l ) + ( m (cid:48) − µ l ) + · · · + ( m (cid:48) i − µ li ) + · · · + ( m (cid:48) c − µ lc ) ] } d m (cid:48) d m (cid:48) · · · d m (cid:48) i · · · d m (cid:48) c = (cid:90) exp {− σ l ( m (cid:48) − µ l ) } d m (cid:48) (cid:90) exp {− σ l ( m (cid:48) − µ l ) } d m (cid:48) · · · (cid:90) δ ( m i − m (cid:48) i )exp {− σ l ( m (cid:48) i − µ li ) } d m (cid:48) i · · · (cid:90) exp {− σ l ( m (cid:48) c − µ lc ) } d m (cid:48) c = √ πσ l · √ πσ l · · · exp {− σ l ( m i − µ li ) } · · · √ πσ l =(2 π ) ( c − / σ c − l exp {− σ l ( m i − µ li ) } , (21)in which the relationship (cid:82) exp {− σ l ( m (cid:48) − µ li ) } d m (cid:48) = √ πσ l is adopted. This relationship isproved in APPENDIX A. Substituting Equation 21 back into Equation 20 we have P ( m i | d ) = 1 √ π M (cid:88) l =1 α l σ l exp {− σ l ( m i − µ li ) } . (22)The derivation for 2D joint marginal PDF is similar to 1D marginal PDF and is omitted. Similarto 1D marginal PDF (Equation 22), 2D joint marginal PDF can be derived as P ( m i , m j | d ) = (cid:90) δ ( m i − m (cid:48) i ) δ ( m j − m (cid:48) j ) P ( m (cid:48) | d )d m (cid:48) = 12 π M (cid:88) l =1 α l σ l exp {− σ l [( m i − µ li ) + ( m j − µ lj ) ] } . (23)Based on Equation 22 and 23, the formulas to compute higher dimensional joint marginal PDFsare straightforward. When the dimensionality reaches c , Equation 23 converges to Equation 11. ayesian geoacoustic inversion using MDN The diagonal elements of C m (Equation 9) is computed as C m ii = (cid:90) ( m i − m i ) P ( m | d )d m = 1(2 π ) c/ M (cid:88) l =1 α l σ cl (cid:90) ( m i − m i ) exp {− (cid:107) m − µ l (cid:107) σ l } d m = 1(2 π ) c/ M (cid:88) l =1 α l σ cl (cid:90) exp {− σ l ( m − µ l ) } d m (cid:90) exp {− σ l ( m − µ l ) } d m · · · (cid:90) ( m i − m i ) exp {− σ l ( m i − µ li ) } d m i · · · (cid:90) exp {− σ l ( m c − µ lc ) } d m c = 1(2 π ) c/ M (cid:88) l =1 α l σ cl √ πσ l · √ πσ l · · · √ πσ l [( m i − µ li ) + σ l ] · · · √ πσ l = M (cid:88) l =1 α l [( m i − µ li ) + σ l ] , (24)in which the relationship (cid:82) ( m i − m i ) exp {− σ l ( m i − µ li ) } d m i = √ πσ l [( m i − µ li ) + σ l ] isadopted. This relationship is proved in APPENDIX B.The off-diagonal elements of C m (Equation 9) is computed as C m ij = (cid:90) ( m i − m i )( m j − m j ) P ( m | d )d m = 1(2 π ) c/ M (cid:88) l =1 α l σ cl (cid:90) ( m i − m i )( m j − m j )exp {− (cid:107) m − µ l (cid:107) σ l } d m = 1(2 π ) c/ M (cid:88) l =1 α l σ cl (cid:90) exp {− σ l ( m − µ l ) } d m (cid:90) exp {− σ l ( m − µ l ) } d m · · · (cid:90) ( m i − m i )exp {− σ l ( m i − µ li ) } d m i · · · (cid:90) ( m j − m j )exp {− σ l ( m j − µ lj ) } d m j · · · (cid:90) exp {− σ l ( m c − µ lc ) } d m c = 1(2 π ) c/ M (cid:88) l =1 α l σ cl √ πσ l · √ πσ l · · · √ πσ l ( µ li − m i ) · · · √ πσ l ( µ lj − m j ) · · · √ πσ l = M (cid:88) l =1 α l ( µ li − m i )( µ lj − m j ) . (25)Then the inter-parameter correlation matrix can be computed using Equation 10.0 Wu et al

Ocean D h i ρ v S0 v P0 ρ i v Si v Pi Seabed h i ρ i v Si v Pi h i ρ i v Si v Pi { { { Figure 1.

Model parametrization. The deepest bottom layer is a semi-inﬁnite layer. Note that layers with differ-ent thickness do not keep the same scale for display purposes.

In this section we perform the MDN inversion of shear-wave velocity proﬁle of seabed from surfacewave dispersion curves. Surface wave measurements are usually recorded directly in an active sourcesurvey (Dong & Dosso 2011) or reconstructed from passive collected ambient noise (Wu et al. 2019).Surface wave dispersion curves can be extracted from these measurements and used to estimate theshear-wave velocity proﬁle (Li et al. 2012). The method described here is general and can be easilyapplied to other geoacoustic inverse problems.

The training of machine learning algorithms relies on sufﬁcient training data, especially for problemswith high complexity. However, only limited observational data or even no geophysical ground-truthdata are available in many geoacoustic problems. Hence, training the network on simulations is a goodalternative. The forward problem is solved using DISPER80 subroutines developed by Saito (1980).

The shear-wave velocity proﬁle is parametrized as a layered structure, which is shown in Figure 1. Thesymbols ρ , v P , v S and h represent the density, compressional wave velocity, shear wave velocity anddenpth of a layer. The water layer is colored in blue, and the seabed layers are colored in darksalmon.The seabed is composed of 3 top layers with h i =

30 m ( i = 1 , , ), 6 middle layers with h i =

75m ( i = 4 , , · · · , ) and 9 bottom layers with h i =

250 m ( i = 10 , , · · · , ). The ﬁrst top layer is anear-surface layer which is thought to be soft and unconsolidated with v S ∼ U (0 . / s , . / s) .Galetti et al. (2017) found that DISPER80 produces unreliable result if a large velocity drop exists ayesian geoacoustic inversion using MDN D ep t h ( k m ) Figure 2.

The shear-wave velocity density distribution. between two consecutive layers with increasing depth in the model. Therefore, we deﬁne the constrainton the velocity model as v Si ∼ U ( lbound, ubound ) , i = 2 , , · · · , lbound = max(0 . v Si − , v S ) ubound = min(1 . v Si − , . / s) . (26)The constraint makes the shear-wave velocity of each layer at least 80 and at most 140 percent of theupper consecutive layer. Also a hard bound ( v s , 2.5 km/s) is implied in the constraint. The constraintexcludes lots of unusual models in the real world. ρ and v P are computed using the empirical relations(Castagna et al. 1985; Brocher 2005) v P = 1 . v S + 1 . , (27)and ρ = 1 . v . P . (28) model-samples are generated randomly and the shear-wave velocity density distribution is2 Wu et al shown in Figure 2. These samples are well distributed based on our deﬁnition and most possiblemodels in the real world are covered. Based on the deﬁnition of Equation 26, the velocity of a modelmore likely tend to increase with depth because of the setting of the bounds. The hard upper bound2.5 km/s stops the increasing, resulting in a high density zone on the right side of Figure 2. corresponding phase-velocity dispersion curves (observations) are computed using the forward modelDISPER80. This procedure is fully paralleled using parallel computing technique. Thus we obtain adata set D = { ( d i , m i ) : i = 1 , ..., N D } with N D = 10 . The realistic measurements of phase-velocity dispersion curves are subject to uncertainties. We assumethat the uncertainties of the observations are independent, Gaussian-distributed random processes andadd synthetic noise to the simulated observations according to the probability density ρ ( (cid:101) d ) = 1(2 π ) c/ | C D | / exp (cid:26) − (cid:16) d − (cid:101) d (cid:17) T C − D (cid:16) d − (cid:101) d (cid:17)(cid:27) , (29)where (cid:101) d denotes the noisy observations and C D denotes the diagonal data covariance matrix. Thediagonal elements σ i of C D are deﬁned as σ i = (cid:15)d i , (30)where (cid:15) = 0 . is used in this paper. The capacity of the MDN to approximate the nonlinear mapping of inverse problems relies on sufﬁ-cient ﬂexibility of the network. In practice, we are prone to adding plenty of freedoms to the networkto make sure that it is sufﬁciently ﬂexible to model the nonlinear mapping with little bias. However,too great ﬂexibility can result in the overﬁtting problem, which means that the network ﬁts the trainingset perfectly but show bad performance on new data. That’s why regularization is needed to constrainthe network.Adding noise to the observations works like an implicit regularizer. It prevents the network fromlearning small variations in the observations and makes the MDN less sensitive to variations withinuncertainties. Bishop (1994) and Meier et al. (2007a) show that adding noise to the observations hasthe same effect as using a regularized error function which involves a penalty term.Besides, an early stopping approach is employed in this paper to prevent overﬁtting. The model isevaluated on the validation set at regular checkpoints and the value of the loss function is comparedwith the previously lowest one. The value of the “winner” is updated after each comparison. If the ayesian geoacoustic inversion using MDN input_1: InputLayer input:output: (None, 21)(None, 21)dense: Dense input:output: (None, 21)(None, 500)dense_1: Dense input:output: (None, 500)(None, 500)dense_2: Dense input:output: (None, 500)(None, 500)dense_3: Dense input:output: (None, 500)(None, 500)dense_4: Dense input:output: (None, 500)(None, 648) dense_5: Dense input:output: (None, 500)(None, 36) dense_6: Dense input:output: (None, 500)(None, 36)concatenate: Concatenate input:output: [(None, 648), (None, 36), (None, 36)](None, 720) Figure 3.

Network structure. Dense layer is the regular deeply connected neural network layer. value stops updating for a certain time, the training of the network is then interrupted and the latest“winner” model is adopted.

A 7-layer MDN is designed for the inversion of shear-wave velocity problem, including an input layer,an output layer and 5 dense layers (the 5th dense layer is composed of 3 branches). The basic shapeof the network is shown in Figure 3. The input layer takes the discretized points of the phase-velocitydispersion curves as inputs. The dense 4, dense 5 and dense 6 layers respectively compute µ s , σs and αs of all kernels which fully deﬁne the multidimensional PPD. The last layer concatenates allthe outputs and performs as an output layer. Once trained, the statistics of shear-wave velocities forall layers of the seabed can be extracted from a single MDN. Table 1 displays the distribution oftrainable parameters in the network. The total number of trainable parameters included in the networkis 1,123,220.The ReLU activation function is used in the ﬁrst 4 dense layers. µ is a location parameter which4 Wu et al

Table 1.

Network summary.Layer (type) Param denotes the center of a kernel. The output of dense 4 layer is directly used as µ and no activationfunction is used. α is a mixing coefﬁcient which should satisfy the constraint of Equation 13. Thusthe softmax activation function is used in dense 6 layer for the calculation of α . σ is a scale parameterwhich denotes the variance of a kernel. A modiﬁed ELU activation function f ( z ) = ELU (1 , z ) + 1 (Brando Guillaumes 2017) is used in dense 5 layer. f ( z ) is smooth and positive everywhere (shownin Figure 4) and is suitable for the activation of σ . Figure 4.

The modiﬁed ELU activation function. ayesian geoacoustic inversion using MDN The network is trained on the data set D = { ( d i , m i ) : i = 1 , ..., N D } with samples. The dataset is shufﬂed and split into a training set, a validation set and a test set with 9,800,000, 100,000 and100,000 samples, respectively.The network is trained on a server with a machine conﬁguration as- Operating system: Ubuntu 18.04.4 LTS- CPU: Intel Xeon(R) E5-1620 v4 @ 3.50GHz ×

8- GPU: GeForce GTX 1080 Ti- Memory: 64 GB- GPU memory: 12 GB- Hard disk: Samsung SSD 850 PRO 512 GBThe following programming tools and software libraries were used to implement the experiment.- Python 3.7.7- Tensorﬂow-GPU 1.13.1- Keras 2.3.1In this study it takes about 8.75 hours to run 1500 epochs of the training with batch size equals to8192.

The designed MDN is trained on the prepared training set and evaluated on the validation set duringthe training process. The noise is added to the two sets based on Equation 29 to form a noisy training.

Figure 5 compares the loss (Equation 14) of noiseless and noisy training. Note that early stoppingis not applied here for comparison purpose. Figure 5a shows that the validation loss of the noiselesstraining can reach a minimum of around -75 at epoch=1067. However, the loss curve is oscillating at alarge scale during the training process, which indicates that the network is sensitive to slight changesand may be overﬁtting the training set. Figure 5b shows a better convergence of the noisy trainingcompared with Figure 5a. The validation loss reaches a minimum of around -65 at epoch=705. A smalloscillation occurs at around epoch=1000 but it comes back to convergence soon. The performance of6

Wu et al Lo ss (a) Training lossValidation loss 0 250 500 750 1000 1250 1500Epoch8070605040302010 Lo ss (b) Training lossValidation loss

Figure 5.

The loss as a function of epoch. (a) The loss of the training on noiseless data. (b) The loss of thetraining on noisy data. the network is similar on the training set and validation set, indicating that overﬁtting does not happenin the training.

We evaluate the trained networks on the test set. Random noise is added to the test set based onEquation 29 to simulate the measurement errors in the real world. Since the test set is never seen by

Figure 6.

The predictions of noiseless training versus true velocity models. ayesian geoacoustic inversion using MDN Figure 7.

The predictions of noisy training versus true velocity models. the networks during the training process, the performance of the networks on the test set is a goodindicator for generalization.Figure 6 presents the predictions of noiseless training network versus the true velocity models. Thepredictions (mean velocity model) are computed using Equation 19. Precise predictions should fall onthe diagonal lines of each subplot. Layers 2 to 10 show good generalization performance for smalltrue velocities with corresponding dots locating closely to the diagonal lines. For large true velocities,layers 4 to 8 show worse generalization performance while layer 10 performs relatively better. Onepossible reason is a lack of samples with large velocities in layers 4-8 in the training set (as is shown inFigure 2). The performance becomes worse when the depth goes deeper (layers 12 to 18), indicatinglarger uncertainties for deeper layers. It is because t he dispersion curve is less sensitive to the velocityof deeper layer, making it harder to learn for the networks. The overall average loss of the network onthe test set is 95.32, which is much larger than the losses on the training and validation set. It illustratesthat the network performs badly when the observations are corrupted by noise.Figure 7 presents the predictions of noisy training network versus the true velocity models. Itshows improved performance at every layer compared with Figure 6. The bad performance problemfor large true velocities in layers 4 to 8 (Figure 6) has been well solved. When the depth goes deeperthan 1 km (below layer 12), the uncertainty starts to increase gradually. The overall average loss of the8

Wu et al D ep t h ( k m ) (a) True modelMAP predictionMean prediction 0.00.20.40.60.81.0 0.0 0.5 1.0 1.5 2.0 2.5Shear-wave velocity (km/s)0.00.51.01.52.02.53.0 D ep t h ( k m ) (b) True modelMAP predictionMean prediction 0.00.20.40.60.81.0

Figure 8.

The predictions of noisy training versus true velocity models. network on the test set is -65.46, which is similar to the lowest losses on the training and validationset. In general, the network trained on noisy data is more robust and generalizes better.Note that the points off the diagonal line are not necessarily bad predictions because they mayrepresent other alternative solutions when non-unique solutions exist for the problem.

To illustrate how the predictions ﬁt the true model and data, we take two synthetic cases. The trueshear-wave velocity models of the synthetic cases are shown Figure 8 with blue solid lines. The cor-responding dispersion curves are computed and displayed in Figure 9 with blue solid lines. Noise isadded to the dispersion curves using Equation 29 and the noisy dispersion curves are displayed inFigure 9 with black dashed lines. The noisy dispersion curves are fed to the MDN and the predictionsare displayed in Figure 8. The mean prediction models are plotted in green dashed lines and the MAPprediction models are plotted in red chain lines. The background color behind the lines denotes the 1DPPD at each layer. The shade of the color indicates the probability (dark colors represent higher prob-ability). Note that the probability is normalized at each layer. The corresponding dispersion curves ofthe mean and MAP models are also computed (using the forward model DISPER80) and displayed inFigure 9 for comparison. ayesian geoacoustic inversion using MDN Frequency (Hz) P ha s e V e l o c i t y ( k m / s ) (a) True dataNoisy dataMAP dataMean data 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Frequency (Hz) P ha s e V e l o c i t y ( k m / s ) (b) True dataNoisy dataMAP dataMean data

Figure 9.

The predictions of dispersion curves versus true dispersion curves.

Figure 8 shows that the predictions ﬁt the true model well, especially at shallow layers. In Fig-ure 8a, the mean prediction performs slightly better than the MAP prediction. The uncertainties of thePPD encompass the true model at most layers, except for the third layer from bottom which is not verywell determined. It may be caused by the large error of the noisy dispersion data at low frequencies(0.2 to 0.4 Hz) shown in Figure 9a. In Figure 8b, the MAP prediction performs slightly better than themean prediction, which is contrary to Figure 8a. That is because the existence of side lobes of the PPDaffects the values of the mean model and results in larger misﬁt, especially for deep layers. The layerwith depth from 1.0 to 1.3 km is not very well determined. One possible reason is the large error ofthe noisy dispersion data at high frequencies (1.0 to 1.4 Hz) shown in Figure 9b.Figure 9 shows that the MAP and mean dispersion data ﬁt the true dispersion data very well, exceptfor some small anomalies caused by the large error of noisy data at some frequencies. It indicates agood generalization performance of the designed MDN.

We have presented a Bayesian geoacoustic inversion approach using the mixture density network(MDN), which can provide probabilistic solutions to nonlinear inverse problems. Some importantgeoacoustic statistics have been derived from the MDN theory, making it convenient to train the net-work directly on the whole parameter space and get the multidimensional posterior probability densityof model parameters directly. The evaluation on the test data set indicates a good generalization per-formance of the designed network trained using noisy data. The inversion results of synthetic casesshows that the predictions of the MDN ﬁt the true velocity model and true dispersion data well, al-though some local layers may be inﬂuenced by large error in the noisy data.0

Wu et al

ACKNOWLEDGMENTS

A number of colleagues have helped with suggestions for the improvement of this material and I wouldparticularly like to thank Bob Geller, University of Tokyo for his criticisms and corrections. ayesian geoacoustic inversion using MDN REFERENCES

Bensen, G. D., Ritzwoller, M. H., & Yang, Y., 2009. A 3-d shear velocity model of the crust and uppermostmantle beneath the united states from ambient seismic noise, , 1177–1196.Bishop, C. M., 1994. Mixture density networks.Bodin, T. & Sambridge, M., 2009. Seismic tomography with the reversible jump algorithm,

GeophysicalJournal International , (3), 1411–1436.Brando Guillaumes, A., 2017. Mixture density networks for distribution and uncertainty estimation , Master’sthesis, Universitat Polit`ecnica de Catalunya.Brocher, T. M., 2005. Empirical Relations between Elastic Wavespeeds and Density in the Earth’s Crust,

Bulletin of the Seismological Society of America , (6), 2081–2092.Caiti, A., Akal, T., & Stoll, R. D., 1994. Estimation of shear wave velocity in shallow marine sediments, ,58–72.Cao, R., Earp, S., de Ridder, S. A. L., Curtis, A., & Galetti, E., 2020. Near-real-time near-surface 3d seismicvelocity and uncertainty models by waveﬁeld gradiometry and neural network inversion of ambient seismicnoise, GEOPHYSICS , (1), KS13–KS27.Castagna, J. P., Batzle, M. L., & Eastwood, R. L., 1985. Relationships between compressionalwave andshearwave velocities in clastic silicate rocks, Geophysics , , 571–581.Collins, M. D., Kuperman, W. A., & Schmidt, H., 1992. Nonlinear inversion for oceanbottom properties, ,2770–2783.Dong, H. & Dosso, S. E., 2011. Bayesian inversion of interface-wave dispersion for seabed shear-wave speedproﬁles, IEEE Journal of Oceanic Engineering , (1), 1–11.Dong, H., Liu, L., Thompson, M., & Morton, A., 2010. Estimation of seismic interface wave dispersion usingambient noise recorded by ocean bottom cable, in ECUA 2010 .Dosso, S. & Wilmut, M., 2006. Estimating data uncertainty in matched-ﬁeld geoacoustic inversion,

IEEE J.Ocean. Eng , , 470–479.Dosso, S. E., 2002. Quantifying uncertainty in geoacoustic inversion. i. a fast gibbs sampler approach, ,129–142.Dosso, S. E. & Dettmer, J., 2011. Bayesian matched-ﬁeld geoacoustic inversion, Inverse Problems , (5),055009.Dosso, S. E., Yeremy, M. L., Ozard, J. M., & Chapman, N. R., 1993. Estimation of ocean-bottom propertiesby matched-ﬁeld inversion of acoustic ﬁeld data, , 232–239.Dosso, S. E., Wilmut, M. J., & Lapinski, A.-L. S., 2001. An adaptive-hybrid algorithm for geoacoustic inver-sion, , 324–336.Earp, S., Curtis, A., Zhang, X., & Hansteen, F., 2019. Probabilistic neural network tomography across graneﬁeld (north sea) from surface wave dispersion data, arXiv preprint arXiv:1908.09588 (Not peer reviewed) .Galetti, E., Curtis, A., Baptie, B., Jenkins, D., & Nicolson, H., 2017. Transdimensional love-wave tomographyof the british isles and shear-velocity structure of the east irish sea basin from ambient-noise interferometry, Wu et al

Geophysical Journal International , , 36–58.Gerstoft, P., 1994a. Inversion of seismoacoustic data using genetic algorithms and a posteriori probabilitydistributions, , 770–782.Gerstoft, P., 1994b. Inversion of seismoacoustic data using genetic algorithms and a posteriori probabilitydistributions, The Journal of the Acoustical Society of America , (2), 770–782.Gerstoft, P., 1995. Inversion of acoustic data using a combination of genetic algorithms and the gaussnewtonapproach, , 2181–2190.Gerstoft, P. & Michalopoulou, Z.-H., 2000. Global optimization in matched ﬁeld inversion, in , pp. 27–32, Citeseer.Heard, G. J., Hannay, D., & Carr, S., 1998. Genetic algorithm inversion of the 1997 geoacoustic inversionworkshop test case data, , 61–71.Li, C., Dosso, S. E., Dong, H., Yu, D., & Liu, L., 2012. Bayesian inversion of multimode interface-wavedispersion from ambient noise, IEEE Journal of Oceanic Engineering , (3), 407–416.McLachlan, G. J. & Basford, K. E., 1988. Mixture models. inference and applications to clustering, mmia .Meier, U., Curtis, A., & Trampert, J., 2007a. Global crustal thickness from neural network inversion of surfacewave data, Geophysical Journal International , (2), 706–722.Meier, U., Curtis, A., & Trampert, J., 2007b. Fully nonlinear inversion of fundamental mode surface wavesfor a global crustal model, Geophysical Research Letters .Mosegaard, K. & Tarantola, A., 1995. Monte carlo sampling of solutions to inverse problems,

Journal ofGeophysical Research: Solid Earth , (B7), 12431–12447.Musil, M., Wilmut, M. J., & Chapman, N. R., 1999. A hybrid simplex genetic algorithm for estimatinggeoacoustic parameters using matched-ﬁeld inversion, , 358–369.Saito, M., 1980. Disper80 : A subroutine package for the calculation of seismic normal mode solutions, Seismological Algorithms .Shahraeeni, M. S., Curtis, A., & Chao, G., 2012. Fast probabilistic petrophysical mapping of reservoirs from3d seismic data,

GEOPHYSICS , (3), O1–O19.Steininger, G., Dosso, S. E., Holland, C. W., & Dettmer, J., 2014. A trans-dimensional polynomial-splineparameterization for gradient-based geoacoustic inversion, The Journal of the Acoustical Society of America , (4), 1563–1573.Tarantola, A. & Valette, B., 1982. Generalized nonlinear inverse problems solved using the least squarescriterion, , 219.Wit, R. D., Kuﬂ, P., Valentine, A., & Trampert, J., 2014. Bayesian inversion of free oscillations for earth’sradial (an)elastic structure, Physics of the Earth and Planetary Interiors , , 1–17.Wit, R. W. L. D., Valentine, A. P., & Trampert, J., 2013. Bayesian inference of earth’s radial seismic structurefrom body-wave traveltimes using neural networks, , 408–422.Wu, G., Dong, H., Ke, G., & Song, J., 2019. An adapted eigenvalue-based ﬁlter for ocean ambient noiseprocessing, GEOPHYSICS , (1), KS29–KS38. ayesian geoacoustic inversion using MDN Zhang, X., Curtis, A., Galetti, E., & de Ridder, S., 2018. 3-d monte carlo surface wave tomography,

Geophys-ical Journal International , (3), 1644–1658.Zhdanov, M. S., 2002. Geophysical inverse theory and regularization problems , vol. 36, Elsevier.

APPENDIX A:

This section proves the following integral formula (cid:90) exp {− σ l ( m (cid:48) − µ li ) } d m (cid:48) = √ πσ l , (A.1)where σ l and µ l are considered to be constant. Note that (cid:82) means integrating over the entire parameterspace. For the convenience of calculation, we choose to compute the inﬁnite integral as an estimation.Since the value interval of the parameter usually covers [ µ l − σ l , µ l + 3 σ l ] in geoacoustic inversions,the inﬁnite integral would be a good estimation. First, we compute (cid:90) (cid:90) ∞−∞ exp {− (cid:0) x + y (cid:1) } d x d y = (cid:90) π (cid:90) ∞ exp {− r } r d r d θ = (cid:90) π −

12 exp {− r } (cid:12)(cid:12)(cid:12)(cid:12) ∞ d θ = π. (A.2)On the other side, we have (cid:90) (cid:90) ∞−∞ exp {− (cid:0) x + y (cid:1) } d x d y = (cid:90) ∞−∞ exp {− x } d x (cid:90) ∞−∞ exp {− y } d y = (cid:18)(cid:90) ∞−∞ exp {− x } d x (cid:19) . (A.3)It leads to the fact that (cid:90) ∞−∞ exp {− x } d x = √ π. (A.4)Taking y = m (cid:48) − µ li , the left hand side (LHS) of Equation A.1 becomes LHS = (cid:90) ∞−∞ exp {− σ l y } d y. (A.5)Taking z = y/ ( √ σ ) and using Equation A.4 we have (cid:90) ∞−∞ exp {− σ l y } d y = (cid:90) ∞−∞ exp {− z }√ σ l d z = √ πσ l . (A.6)Another simple way to prove Equation A.1 is to use the conclusion that the integral of the standardnormal distribution equals to 1 (cid:90) ∞−∞ √ πσ l exp {− σ l ( m (cid:48) − µ li ) } d m (cid:48) = 1 , (A.7)which directly leads to the result of Equation A.1.4 Wu et al

APPENDIX B:

This section proves the following integral formula (cid:90) ( m i − m i ) exp {− σ l ( m i − µ li ) } d m i = √ πσ l [( m i − µ li ) + σ l ] , (B.1)where σ l , µ li and m i are considered to be constant. The LHS of Equation B.1 can be expanded to √ πσ l LHS = (cid:90) m i √ πσ l exp {− σ l ( m i − µ li ) } d m i − (cid:90) m i m i √ πσ l exp {− σ l ( m i − µ li ) } d m i + (cid:90) m i √ πσ l exp {− σ l ( m i − µ li ) } d m i = I − I + I . (B.2)Using the integral property of standard normal distribution, we have σ l = (cid:90) ( m i − µ li ) √ πσ l exp {− σ l ( m i − µ li ) } d m i = (cid:90) m i √ πσ l exp {− σ l ( m i − µ li ) } d m i − (cid:90) µ li m i √ πσ l exp {− σ l ( m i − µ li ) } d m i + (cid:90) µ li √ πσ l exp {− σ l ( m i − µ li ) } d m i = I − µ li + µ li = I − µ li , (B.3) I = (cid:90) m i m i √ πσ l exp {− σ l ( m i − µ li ) } d m i =2 m i µ li , (B.4)and I = (cid:90) m i √ πσ l exp {− σ l ( m i − µ li ) } d m i = m i . (B.5)Substituting Equation B.3, B.4 and B.5 back into Equation B.2 leads to LHS = √ πσ l [( m i − µ li ) + σ l ] ..