Deep generative models for accelerated Bayesian posterior inference of microseismic events
Alessio Spurio Mancini, Davide Piras, Michael P. Hobson, Benjamin Joachimi
GGeophys. J. Int. (0000) , 000–000
Deep generative models for accelerated Bayesian posterior inferenceof microseismic events
Alessio Spurio Mancini , (cid:63) , Davide Piras , Michael P. Hobson , Benjamin Joachimi Department of Physics and Astronomy, University College London, Gower Street, London, WC1E 6BT, UK Astrophysics Group, Cavendish Laboratory, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK
16 September 2020
SUMMARY
We present a series of generative models for fast emulation of noiseless isotropic microseismictraces from source coordinates, given a 3D heterogeneous velocity and density model. The sur-rogate models can accelerate Bayesian source inversion by emulating the forward modellingof the seismograms, thus replacing the expensive solution of the elastic wave equation at eachlikelihood evaluation. Our generative models, based on deep learning algorithms, are trainedusing only 2000 synthetic seismic traces forward modelled with a pseudospectral method. Weevaluate the performance of our methods on a testing set and compare them against state-of-the-art algorithms applied to the same data, showing that all of our models provide more accuratepredictions and up to a O (10 ) speed-up factor during their training and evaluation. We fi-nally apply our generative models to a posterior inference analysis of a simulated microseismicevent. We show that constraints three times tighter can be obtained ∼ times faster than withexisting methods. The speed, accuracy and scalability of our proposed models pave the way forextensions of these emulators to more complicated source mechanisms and applications to realdata. Key words: seismograms – seismic modelling – statistical methods – artificial intelligence.
Microseismic events are seismic signals characterised by a verysmall amplitude. Their monitoring is of crucial importance instudying induced seismicity, to help quantify seismic hazard causedby human activity (Mukuhira et al., 2016). A Bayesian analysis ofseismic traces recorded by receiving sensors allows for the solutionof the inverse problem: given the recorded seismograms, we wish toinfer properties of the microseismic sources in the subsurface, mostimportantly their coordinates for localisation purposes (St¨ahler andSigloch, 2014; Pugh et al., 2016). In a marine environment, geo-phones are situated at the seabed, recording water pressure and thethree-component particle velocity of the propagating medium, inorder to localise and characterise microseismic sources in the sub-surface (Fertitta et al., 2010). While in this work we will considera marine environment, the methodology proposed here is not lim-ited to this context; it equally applies to more standard geophysicalscenarios, where geophones are located on the land surface.Bayesian inference represents a principled approach to geo-physical source inversion (e.g. Aster et al., 2012), as it allowsone to extract model parameters along with their associated uncer-tainty by means of techniques such as Markov Chain Monte Carlo(MCMC), which sample the parameters’ posterior distribution (seee.g. Craiu and Rosenthal, 2014, for a review of different sampling (cid:63) [email protected] algorithms). The solution of the inverse problem requires, however,numerous generations of forward simulated events from given val-ues of the source parameters. While performing MCMC sampling,the forward model needs to be simulated at each point in parameterspace where the likelihood function is evaluated. The number ofsuch evaluations scales exponentially with the number of parame-ters (an example of the curse of dimensionality , see e.g. MacKay2003). If the forward model is computationally expensive to sim-ulate, this means that, even for small parameter spaces, samplingthe posterior distribution becomes extremely challenging or evenunattainable. This is the case for microseismic events in a ma-rine (as well as terrestrial) environment: the elastic wave equationfor forward modelling microseismic traces given their source co-ordinates is extremely computationally expensive to solve, whichmakes Bayesian inference of the coordinates posterior distributionunfeasible. For example, given the geophysical model considered inDas et al. (2017), the generation of a single seismic trace given itscoordinates requires O (1) hour of Graphics Processing Unit (GPU)time.To overcome this issue, Das et al. (2018, D18 hereafter) devel-oped a machine learning framework (also referred to as metamodel,surrogate model or emulator) for fast generation of synthetic seis-mic traces, given their locations in a marine domain and a specified3D heterogeneous velocity and density model. A similar Bayesiansurrogate learning approach in an astrophysical context was firstproposed in Auld et al. (2007, 2008) and recently investigated in a r X i v : . [ phy s i c s . g e o - ph ] S e p A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi dynamic simulator-based regression problems by Chen and Hob-son (2019). More generally, replacing the usually expensive sim-ulations with emulated realisations obtained by training machinelearning algorithms has become progressively common in multiplescientific fields (see e.g. Kasim et al., 2020; Raghu and Schmidt,2020, and references therein).In D18, Gaussian processes (GPs, Rasmussen and Williams,2005) were first trained on a few thousand forward simulations, ob-tained by solving the computationally expensive elastic wave equa-tion. The trained GPs were then shown to produce seismic traces ingood agreement with those of the testing set, previously produced via full solution of the elastic wave equation, but kept separatedfrom the training set. It was also demonstrated that the trained GPswere able to produce values of the likelihood function similar tothose obtained from the true events. D18 concluded that the trainedsurrogate module could be employed for Bayesian inference, to re-place the expensive solution of the elastic wave equation at eachpoint explored in parameter space.In this paper, we improve on the method developed in D18by training multiple generative models, based on deep learning al-gorithms, to learn to predict the seismic traces corresponding togiven coordinates. Similar to D18, the goal is to use the trainedgenerative models as emulators, replacing the forward modellingof the seismograms at each likelihood evaluation in the posteriorinference analysis. Thus, the trained emulators permit the other-wise computationally prohibitive Bayesian inference of the coordi-nates posterior distribution. All of our newly proposed generativemodels achieve similar accuracy when compared with each other,and in all cases their accuracy is greater than that achieved by themethod proposed in D18. Moreover, the emulators we develop arefaster, less computationally demanding and easier to store than thesurrogate model developed in D18. We train and test our generativemodels on the same seismograms used by D18, to achieve a directcomparison between the methodologies.In addition, we demonstrate in practice how the new emulationschemes make it possible to perform Bayesian inference of a mi-croseismic source location. We carry out a full posterior inferenceanalysis of the location of a simulated microseismic event, given itsseismic trace recorded by receivers placed at the seabed. We showthat a configuration with only four receivers is sufficient to yield anaccurate characterisation of the source location.This paper is structured as follows. In Sec. 2 we describeour simulation setup, identical to that of D18, detailing the gen-eral characteristics of the simulated seismograms which constituteour training, validation and testing sets. In Sec. 3 we describe thepreprocessing operations that we operate on our seismograms to fa-cilitate the training of our new generative models, outlined in detailin Sec. 4. In Sec. 5 we show our results in terms of accuracy andspeed achieved by the trained models; moreover, we use our bestperforming model to show that we can accelerate accurate Bayesianinference of a simulated microseismic event. Finally, we concludein Sec. 6 with a discussion of our main findings and their future ap-plications, some of which will be explored in a forthcoming paper(Piras et al., in prep.).
Since one of our goals is to compare our new emulation methodswith the one previously developed in D18, we use the same dataset,to ensure a careful comparison. We briefly recap here the charac-teristics of the dataset, referring to D18 for further details. We consider a geophysical framework where we record seis-mic traces in a marine environment. Sensors are placed at theseabed to record both pressure and three-component particle veloc-ity of the propagating medium. As was the case in D18, we assumethat our recorded seismic traces are generated by isotropic sources.This means that their moment tensor has only diagonal elements,thus making the point sources ‘explosive’: their signal is charac-terised by strong peaks, representing the arrival of P-waves at thesensor. For isotropic sources, considering only the pressure waveand ignoring the particle velocity is sufficient to determine the lo-cation of the event in the studied domain, as shown in D18. We con-sider the seismic traces to be noiseless when building the emulator,while some noise is added to the simulated recorded seismogramwhen inferring the coordinates’ posterior distribution, as we willshow in Sec. 5.Forward simulations of seismic traces are obtained by solv-ing the elastic wave equation given a 3D heterogenous velocity anddensity model for the propagating medium, shown in Fig. 1. Themodel specifies values of the propagation velocities for P- and S-waves ( V p , V s ) as well as the density ρ of the propagating medium,discretised on a 3D grid of voxels. The solution of the elastic waveequation is a computationally challenging task, which can be accel-erated using Graphics Processing Units (GPUs) (Das et al., 2017).This is implemented in the software K - WAVE (Treeby et al., 2014),a pseudospectral method employed by D18 to generate their train-ing and testing samples, which we also use in our analysis. TheGPU software allows for the computation of the acoustic pressuremeasured at specified receiver locations. 4000 microseismic tracesare generated in total, given their random locations within a pre-defined domain and a specified form for their moment tensor. Thevalue of the diagonal components of the moment tensor is set equalto 1 MPa, following Collettini and Barchi (2002). The coordinates ( x, y, z ) of the simulated sources are sampled using Latin Hyper-cube Sampling on a 3D grid of 81 × ×
301 gridpoints, cor-responding to a real geological model (the same used in D18) ofdimensions 1 km × × N t = 501 time components. Note that, similarly to D18, we aug-ment each of our ( x, y, z ) coordinate set with their distance fromthe receiver d = (cid:112) x + y + z , as we noted that this helps thetraining of the generative models. Training deep learning algorithms like the ones described in Sec. 4as surrogate models is a challenging task in this context, due to thelimited number of training samples available, i.e. the 2000 train-ing seismograms generated by solution of the elastic wave equa-tion (see e.g. Bishop, 2006). We note that each of these seismictraces corresponds to one set of integer coordinates ( x, y, z ) in our3D grid of × × voxels: this means that our training setonly has ∼ . of the possible coordinates among our simulatedgeophysical framework.Such a small number of training samples means, for exam-ple, that training even just a simple neural network (which willbe described in Sec. 4.1) to emulate the seismograms given theirsource coordinates does not achieve satisfactory accuracy on the eep learning emulation of seismic events Figure 1.
Density, P-wave velocity and S-wave velocity models of the simulated domain used in this work. The models are specified as 3D grids of voxels.Notice the layered structure, with variation along the vertical dimension much more pronounced than across horizontal planes. testing set, leading only to overfitting the training set, as we veri-fied directly. This issue can be relieved by applying some form ofpreprocessing to the data, in order to reduce the number of rele-vant features that have to be learnt by the emulators. In addition,some compression method can be employed to reduce even furtherthe dimensionality of the mapping, on condition that the performedcompression is efficient in preserving the information carried bythe original signal.To preprocess our seismograms we first identify the maximumpositive amplitude A i and the corresponding time index t i in eachseismogram, labelled by index i = 1 , ..N train , in our training set of N train = 2000 samples (the same used by D18). We then isolate onerandom seismic event in our training set and store the value of itsmaximum positive amplitude A ∗ and its corresponding time index t ∗ . We normalise all of our training seismograms to the amplitudeof this reference peak, and shift them so that their peak location cor-responds to the reference peak location (see Fig. 2), replacing themissing time components with zeros. Operating this preprocessingleaves us with two additional parameters for each seismic trace: anormalising factor ¯ A i ≡ A i /A ∗ and a time shift ¯ t i ≡ t i − t ∗ . Thispreprocessing is encouraged by the structure of the signal, whichis localised in the form of spikes preceded by absence of signal,corresponding to the sudden arrival of the P-wave at the sensor lo-cation. The amplitudes A i and time indices t i depend mainly onthe distance of the seismic source from the sensor. By rescaling alltraining seismograms to the reference amplitude A ∗ and time index t ∗ we allow the deep learning algorithm to ‘concentrate’ on learn-ing the rest of the signal, which instead depends on the propertiesof the heterogeneous medium encountered by the wave while prop-agating to the sensor. We verified that all numerical conclusions ofour analysis do not depend significantly on the specific choice ofthe reference seismogram; therefore, in our analysis we choose thereference seismogram completely at random † .A consequence of this type of preprocessing is that, in orderto recover the original seismograms, one also needs to learn the † It could be argued that the reference seismogram should be chosen inorder to minimise the truncated signal, i.e. that for every receiver we shouldchoose the closest event as the reference seismogram. However, we verifiedthat, in order to constrain the source location adequately well, the main peakof a seismogram is sufficient, and therefore this choice is irrelevant in ourapproach.
Time component A m p li t u d e BEFORE preprocessing Reference seismogramSeismogram n.1
Time component A m p li t u d e AFTER preprocessing
Figure 2.
Example of preprocessing applied to the seismograms. We con-sider one random reference seismogram, shown in black in both upper andlower panels. Given another generic seismogram ( red line in the upperpanel), we rescale it to have its positive maximum peak amplitude and timelocation matching those of the reference seismogram. The result is a seis-mogram, like the one shown in red in the bottom panel, whose main differ-ence with the reference seismogram is given by the additional fluctuationssurrounding the main peak. The generative methods we develop learn topredict these fluctuations given the source coordinates as well as the mainpeak, whereas two GPs learn the amplitude and time shift coefficients torescale the predicted seismograms back to their natural amplitude and peaklocation.Throughout the paper, the seismograms’ amplitude is measured inarbitrary units of pressure. coefficients ¯ A i and ¯ t i . To this purpose we train two additional in-dependent machine learning algorithms that map the input sourcecoordinates ( x i , y i , z i , d i ) to the output amplitude and time shift ( ¯ A i , ¯ t i ) . We model each of these outputs as a Gaussian process (GP,Rasmussen and Williams, 2005). When performing GP regression A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi given a generic function f ( θ ) of parameters θ , we assume f ( θ ) ∼ N (cid:0) , K ( θ , θ (cid:48) ; ψ ) (cid:1) , (1)where the kernel K ( θ , θ (cid:48) ; ψ ) represents the covariance betweentwo points in parameter space and may depend on additional hy-perparameters, collectively denoted as ψ . In our case ¯ A and ¯ t aremodelled as functions of the coordinates ( x, y, z, d ) using a GPeach. We find that a Mat´ern kernel K in its Automatic RelevanceDiscovery (ARD) version (Neal, 1996; Rasmussen and Williams,2005), defined as K ARD − Mat´ern − / (cid:0) θ , θ (cid:48) ; ψ (cid:1) = σ f (cid:16) √ r (cid:17) exp (cid:16) −√ r (cid:17) , (2)where ˜ r = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) m =1 ( θ m − θ (cid:48) m ) σ m , (3)produces the most accurate results in training both GPs. The hyper-parameters of this kernel are a signal standard deviation σ f and acharacteristic length scale σ m for each input feature m = 1 , . . . , n (the source coordinates, in our case). We optimise these parameterswhile training our GPs, which we implement using the softwareGP Y ‡ . In this Section we describe the deep generative models that we trainas emulators of the seismic traces given their source coordinates.Each generative model can be composed of: • neural networks (NNs), which will be described in Sec. 4.1and can be used either as a compression scheme or in combinationwith other algorithms; • Gaussian processes (GPs), introduced in Sec. 3, trained on acompressed version of the seismograms; • Principal Component Analysis (PCA), which is a standard lin-ear dimensionality reduction technique and will be described in de-tail in Sec. 4.3.For all methods, we also train two GPs to learn the map-ping between source coordinates and the coefficients ( ¯ A, ¯ t ) , as de-scribed in Sec. 3. In Fig 3 we provide a schematic summarising themapping between coordinates and seismograms for each emulationframework described below. In the end, we will show that all ourmethods achieve a better performance than D18, and that we canuse a simple neural network as our proxy to constrain the posteriordistribution of a source’s coordinates. The first method we propose is a simple direct mapping betweencoordinates and preprocessed seismograms, without any interme-diate data compression. The mapping is learnt by a fully connectedneural network, which consists of a stack of layers, each made of acertain number of neurons. Each layer maps the input of the previ-ous layer θ in to an output θ out via θ out = A ( w θ in + b ) , (4) ‡ http://github.com/SheffieldML/GPy where w and b are called the network weights and bias, respec-tively, and A is the activation function, which is introduced in or-der to be able to model non-linear mappings. The output of eachlayer becomes the input to the following layer, and the number ofneurons in each layer determines the shape of w an b . Trainingthe neural network consists of optimising the weights and biasesto minimise a specific loss function which quantifies the deviationof the predicted output from the target training sample. The opti-misation is performed by back-propagating the gradient of the lossfunction with respect to the networks parameters (Rumelhart et al.,1988).After experimenting with different architectures and activationfunctions, we find our best results with a neural network made ofthree hidden layers, with 64, 128 and 256 hidden units each, anda Leaky ReLU (Maas, 2013) activation function for each hiddenlayer, except the last one where we maintain a linear activation (cf.Sec. 4.8 for details on the choice of our network hyperparameters).The Leaky ReLU activation function for an input x is defined as: f ( x ) = (cid:26) x if x > αx otherwise (5)where we set the hyperparameter α = 0 . . We choose the meansquared error (MSE) as our loss function:MSE = 1 N t N t (cid:88) m =1 (cid:16) ˜ S m − S m (cid:17) , (6)between predicted and original seismograms ˜ S and S . Image (A)in Fig. 3 summarises the emulation framework with direct NNbetween coordinates and preprocessed seismograms, plus the twoGPs to rescale the predicted seismograms to their real amplitudeand peak location. The second method proposed makes use of a signal compressionstage prior to the emulation step. We first perform Principal Com-ponent Analysis (PCA) of the preprocessed seismograms in thetraining set. PCA is a technique for dimensionality reduction per-formed by eigenvalue decomposition of the data covariance matrix.This identifies the principal vectors, maximising the variance of thedata when projected onto those vectors. The projections of eachdata point onto the principal axes are the ‘principal components’of the signal. By retaining only a limited number of these compo-nents, discarding the ones that carry less variance, one achievesdimensionality reduction. We retain only the first N PCA = 20 principal components, as we find that in this case the 2D corre-lation coefficient between original and reconstructed seismogramsis R D ∼ . . We can then model the seismograms as linear com-binations of the PCA basis functions f i , S ( x, y, z, d ) = N PCA (cid:88) i =1 c i ( x, y, z, d ) f i , (7)where the coefficients c i ( x, y, z, d ) are unknown non-linear func-tions of the source coordinates. We train a neural network as theone described in Sec. 4.1 to learn this mapping. In other words,contrary to the direct mapping between coordinates and seismo-gram components, we train a NN to learn to predict the PCA basiscoefficients c i given a set of coordinates. Image (B) in Fig. 3 sum-marises the emulation framework in this case. We find that a neural eep learning emulation of seismic events Figure 3.
Summary of generative models described in this work. The generative models emulate the forward modelling of the seismograms given theircoordinates ( x, y, z, d ) . All models share the use of two Gaussian Processes (GPs) to learn the ( ¯ A, ¯ t ) coefficients described in Sec. 3, used to undo thepreprocessing of the seismograms. The mapping between coordinates and preprocessed seismograms is learnt by a neural network (NN) in method (A),whereas in methods (B) and (C) the preprocessed seismograms are compressed in Principal Component Analysis (PCA) coefficients, which are then learntby a NN and a GP, respectively. In method (D) and (E) the seismograms are compressed in central features of an autoencoder (AE), which are then learnt bya NN and a GP, respectively. Finally, a Conditional Variational Autoencoder (CVAE) and Wasserstein Generative Adversarial Networks - Gradient Penalty(WGAN-GP) are used in method (F) and (G), respectively, to learn the mapping between coordinates and preprocessed seismograms. The performance of eachof these models is reported in Table 1. A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi network architecture similar to the one employed in the direct map-ping approach, with three layers and LeakyReLU activation func-tion, performs well also for this task. The number of nodes in eachhidden layer is reduced to 50, and we still minimise the MSE be-tween predicted and original PCA coefficients.
Once PCA has been performed on the training set, as an alternativeto a neural network one can train multiple GPs to learn the mappingbetween the source coordinates and the PCA coefficients. We trainone GP for each PCA component. Image (C) in Fig. 3 summarisesthe emulation framework for this approach.
An autoencoder (AE) is a neural network with equal number ofneurons in the input and output layers, trained to reproduce the in-put in the output (Hinton and Salakhutdinov, 2006). An autoen-coder is typically made of an encoder followed by a decoder. Theencoder network maps the input signal into a central layer (la-tent space), usually with fewer neurons with respect to the inputto achieve dimensionality reduction. The decoder network receivesas an input the output layer of the encoder and learns to map thesecompressed features back to the original input signals of the en-coder. Together, encoder and decoder form a ‘funnel-like’ structurefor the AE network, as shown in Fig. 4. In seismology, autoen-coders have been studied by Valentine and Trampert (2012), whoused them to compress seismic traces, and they are generally usedas a non-linear alternative to PCA.Once the AE has been trained, the new input signals can becompressed into the central features of the AE. Our aim is to learnthe mapping between the source coordinates and these features. Forexample, this can be achieved with an additional neural network.Once this NN is trained, it can be used to generate new encodedfeatures of the AE from new coordinates, decoding new featuresinto preprocessed seismograms. By using the GPs for the coeffi-cients ( ¯ A, ¯ t ) one can then rescale the preprocessed generated seis-mograms to their natural amplitude and peak location. At the endof this procedure, summarised in Image (D) in Fig. 3, one has ef-fectively learnt a way to generate new seismograms given certaincoordinates.In our application, we find that a fully-connected § architecturewith 501, 256, 128, 64, 5 nodes for each layer in the encoder (fromthe input to the latent space, and symmetric decoder) produces thebest results in compressing the seismograms. Hence, we encode ourseismograms in z dim = 5 central features; using a higher numberof central features does not lead to significant improvements in thereconstruction performance, as discussed in Sec. 5. We train ourAE with the same procedure described in Sec. 4.1. Similarly to what we did with the PCA+GP method described inSec. 4.3, one can train GPs to predict the encoded features given § We experimented also with a convolutional architecture, but noticed thatit did not yield better accuracy, while also slowing down the training con-siderably. source coordinates. The predicted encoded features are then de-coded by the trained decoder to generate new preprocessed seismo-grams. The amplitude and time GPs scale back the seismograms totheir original values. The scheme is summarised in Image (E) inFig. 3.
In general, the encoded features in the latent space of an autoen-coder have no specific structure, as the only requirement is for thereconstructed data points to be similar to the input points. However,it is possible to enforce a desired distribution over the latent space,which is driven by our preliminary knowledge of the problem andis therefore called a prior distribution. This is one of the advantagesof Variational Autoencoders (VAEs, Kingma and Welling, 2013). Inthis case, the model becomes fully probabilistic, and the loss func-tion to maximise consists of the ELBO (Evidence Lower BOund),which is defined as:ELBO = E z [log p φ ( x | z )] − D KL ( q θ ( z | x ) || p ( z )) , (8)where x indicates the seismograms, z the encoded features, and E z the expectation value over z ∼ q θ ( z | x ) . Additionally, p ( z ) refersto the prior distribution we wish to impose in latent space, q θ ( z | x ) to the encoder distribution, and p φ ( x | z ) to the decoder distribution; θ and φ indicate the learnable parameters of the encoder and of thedecoder, respectively. Finally, D KL refers to the Kullback-Leiblerdivergence (Kullback, 1959), which is a measure of distance be-tween distributions - see Appendix B for further details.In simple words, when maximising the objective in Eq. 8 withrespect to θ and φ , we demand the encoded distribution to matchthe prior p ( z ) as close as possible, while requiring that the decodeddata points resemble the input data. For a full derivation of theELBO and further details about VAEs see e.g. Kingma and Welling(2013); Doersch (2016); Odaibo (2019), and references therein.VAEs can be used both as a compression algorithm and a gen-erative method. Since we want to map coordinates to seismograms,we choose to employ a supervised version of VAEs, called Condi-tional Variational Autoencoders (CVAEs, Sohn et al., 2015), whichproposes to maximise this slightly altered loss function: L ( θ, φ ; x , c ) = E z [log p φ ( x | z , c )] − D KL ( q θ ( z | x , c ) || p ( z | c )) , (9)where c refers to the coordinates associated to the seismograms x ,and the expectation value is over z ∼ q θ ( z | x , c ) .In our analysis, we set a latent space size of z dim = 5 . More-over, we choose the encoding q θ ( z | x , c ) to be a multivariate nor-mal distribution with mean given by the encoder and covariancematrix Σ = 0 . I z dim . We choose a multivariate normal distribu-tion with zero mean and the same covariance matrix Σ as our prior p ( z | c ) , and we employ a deterministic p φ ( x | z , c ) as our decod-ing distribution. We estimate the expectation value in Eq. 9 using aMonte Carlo approximation, and we calculate the KL divergence inclosed form as both q θ ( z | x , c ) and p ( z | c ) are multivariate normaldistributions; see Appendix B for the full derivation. The choice of Σ is made in order to limit the spread of points in latent space, suchthat we can approximate the desired deterministic mapping withthe probabilistic model offered by the CVAE. Once trained, we canfeed a set of coordinates c and a vector z ∼ p ( z | c ) to the decoderto obtain a seismogram; with our setup, we verified that using asample z ∼ p ( z | c ) or the mean value z = has no significant eep learning emulation of seismic events ENCODED FEATURES
INPUT LAYER OUTPUT LAYERHIDDEN LAYERS
DECODERDECODERENCODERENCODER
Figure 4.
Typical architecture of an autoencoder. A bottleneck architecture allows for the compression of the input signal into a central layer through the‘encoder’ part of the network. The central layer is characterised by fewer nodes than the input one, thus leading to dimensionality reduction on condition thatthe ‘decoder’ part can efficiently reconstruct the input signal (to a good degree of accuracy) starting from the central encoded features. In this schematic wehighlight that training of the autoencoder is performed by feeding a seismogram to the encoder, and then comparing the output of the decoder with the sameinput seismogram. impact on the final performance of the model. Image (F) in Fig. 3summarises the emulation framework making use of the CVAE.
One of the main lines of research in generative models is basedon Generative Adversarial Networks (GANs, Goodfellow et al.,2014). In this framework, two neural networks, called generator(G) and discriminator (D), are trained simultaneously with two dif-ferent goals. While G maps noise to candidate fake samples whichresemble the training data to fool the discriminator, D is trained todistinguish between these fake samples and the real data points.More formally, we can define a value function as: V ( D, G ) = E x [log D ( x )] + E z [log (1 − D ( G ( z )))] , (10)where x refers to the training data sampled from the data distribu-tion p data ( x ) , and z to a noise variable sampled from some prior p ( z ) . The discriminator is thus trained to maximise V ( D ) , whilethe generator aims at minimising V ( D ) ; the two networks play a minimax game until a Nash equilibrium is (hopefully) reached(Goodfellow et al., 2014; Che et al., 2016; Oliehoek et al., 2018).In practice, despite generating incredibly sharp images, GANshave proved to be quite unstable at training time. Moreover, it hasbeen shown how vanilla GANs are prone to mode collapse , wherethe generator just focuses on a few modes of the data distributionand yields new samples with low diversity (see e.g. Metz et al.,2016; Che et al., 2016).Many alternatives to vanilla GANs have been proposed to ad-dress these issues. We focus here on Wasserstein GANs - Gradi-ent Penalty (WGAN-GP ¶ ; Arjovsky et al., 2017; Gulrajani et al.,2017). In short, we still consider two networks, a generator G anda critic C, which are trained to minimise the following objective: ¶ To avoid confusion, we stress that the acronym ’GP’ is used to indicatea Gaussian Process throughout the paper, while it refers to the ’GradientPenalty’ variant of the WGAN algorithm only when quoted as ’WGAN-GP’.
A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi min G max C E x , c [log C ( x , c )] − E z , c [log (1 − C ( G ( z , c )))] + − λ E ˆ x , c (cid:2) ( ||∇ ˆ x C ( ˆ x , c ) || − (cid:3) , (11)where c refers to the coordinates, ˆ x is a linear combination of thereal and generated data, λ ≥ is a penalty coefficient for the reg-ularisation term, and ||∇ ˆ x || refers to the L norm of the critic’sgradient with respect to ˆ x . See Appendix C for further details. Im-age (G) in Fig. 3 summarises the emulation framework making useof the WGAN-GP.In our experiments, we chose λ = 10 , and trained the critic n crit = 100 times for every generator weight update. Both our gen-erator and discriminator are made of fully-connected layers withvarious numbers of hidden neurons. We set the dimension of thelatent z dim = 64 , and p ( z ) ∼ U ( − , . Note that the choice ofhow to include the conditional information in the architecture is notunique, and we experimented with different combinations withoutsignificant differences. Once the algorithm has been trained, a newseismogram is obtained by feeding the generator with a latent vec-tor and a set of coordinates.Finally, note that, in this case only, we standardised the data x after the rescaling described in Sec. 3. We calculated the mean µ and the standard deviation σ over all seismograms x , and trainedour model on x (cid:48) = x − µσ . All our models are trained on the same 2000 simulated events usedin D18. For optimisation and testing purposes, we divide the re-maining 2000 samples (from the pool of 4000 events generated intotal by D18) into a validation set and a testing set of 1000 eventseach. Differently from D18, in this paper we use a validation set totune the hyperparameters of our deep learning models. To providean unbiased estimate of the performance of the final tuned mod-els, we quote our definitive results evaluating the accuracy of eachmodel on the testing set, which is never ’seen’ by the model at anypoint in the training or optimisation procedures.Similar to D18, our accuracy performance is quantified interms of the R D coefficient, a standard statistics commonly usedin time series analysis to quantify the correlation between two sig-nals. Given a batch of the true seismograms G and the correspond-ing emulated ones P , the R D coefficient is defined as R D = (cid:80) i (cid:80) j (cid:0) G ij − ¯ G (cid:1) (cid:0) P ij − ¯ P (cid:1)(cid:114)(cid:16)(cid:80) i (cid:80) j (cid:0) G ij − ¯ G (cid:1) (cid:17) (cid:16)(cid:80) i (cid:80) j (cid:0) P ij − ¯ P (cid:1) (cid:17) , (12) ¯ G = 1 N s N t (cid:88) i (cid:88) j G ij , ¯ P = 1 N s N t (cid:88) i (cid:88) j P ij , where ¯ G and ¯ P are the mean over all i = 1 , . . . , N s samplesand j = 1 , . . . , N t time components of the ground truth G ij and predicted seismograms P ij . Given its normalisation, the R D coefficient ranges between values of − , denoting perfect anti-correlation, to +1 , indicating perfect correlation; a vanishing cor-relation coefficient denotes absence of correlation.When training our NNs, all implemented in TENSOR - FLOW (cid:107) (Abadi et al., 2015), we monitor the value of the validationloss to choose the total number of epochs, waiting 100 epochs afterthe loss stopped decreasing and restoring the model with the low-est validation loss value. In other words, we early-stop (Yao et al.,2007) based on the validation loss with patience = 100 epochs.Moreover, we optimise our algorithms calculating the final R D coefficient, as defined in Eq. 12, over different combinations of thehyperparameters, choosing the values that yield the highest R D .The optimisation procedure is performed using Adam (Kingma andBa, 2014), with default parameters. The optimisation of the net-work hyperparameters is entirely performed on the validation set;the testing set is left unseen by the networks until the very last stageof the analysis, when it is used to calculate the results quoted inTab. 1. In this Section we summarise our main findings. We start in Sec. 5.1by describing the accuracy performance of all our new methods,and compare them with that achieved by D18. In Sec. 5.2 we thenmove to our inference results, describing how we simulated a mi-croseismic measurement and used our generative models to accel-erate Bayesian inference of the event coordinates, again comparingagainst the results obtained applying the method described in D18.
In Tab. 1 we report summary statistics for the performance of ourgenerative models. Our goal is to critically compare the differentmethods, highlighting their strengths and weaknesses, so that thereader can decide to adopt the one that fits best their primary inter-ests and available resources. To perform this comparison, similarlyto D18, we consider an experimental setup with only one centralreceiver at planar coordinates ( x = 41 , y = 41) in the detectionplane z = 244 (see Fig. 6). In the following paragraphs we willthen consider a more complicated geometrical setup for the detec-tion of microseismic events.Considerations of accuracy in terms of reconstructed seismo-grams are important for applications to posterior inference analysis,to avoid biases and/or misestimates of the uncertainty associated tothe inferred parameters. In our case achieving higher accuracy iscrucial to guarantee unbiased and accurate estimation of the micro-seismic source location. For this reason, in Tab. 1 we cite the R D statistic defined in Eq. 12 as a means to quantify the accuracy of ourmethods, similarly to what was done in D18. The R D coefficientis evaluated on the testing set, after training and validation of eachmethod, according to the procedure described in Sec. 4.8.All of our new methods provide a R D statistic higher thanthe one reported by D18 on their testing set. We note that in D18the testing set was composed of 2000 events, whereas here we splitthose 2000 events in a validation and testing set of 1000 sampleseach. However, we checked that all of our numerical conclusionsare unchanged considering a larger testing set composed of thesame 2000 events used by D18. We also checked that training theD18 emulator (augmented with the two GPs for the ( ¯ A , ¯ t ) coeffi-cients) on the seismograms preprocessed following Sec. 3 leads tovalues of R D worse than that obtained applying the D18 method (cid:107) Version 2.2.0, available from https://github.com/tensorflow/tensorflow/tree/v2.2.0 eep learning emulation of seismic events Model R D Training time (s) Evaluation time (ms) Size (MB)
NN direct (A) . ± . ±
12 9 . ± . . PCA + NN (B) . ± . ± . ± . . PCA + GP (C) . ± . ±
27 97 . ± . . AE + NN (D) . ± . ±
12 9 . ± . . AE + GP (E) . ± . ±
23 25 . ± . . CVAE (F) . ± . ± . ± . . WGAN-GP (G) . ± . ±
59 9 . ± . . D18 ∼ .
89 29232 621 . ± . . Table 1.
2D correlation coefficient R D (as defined in Eq. 12), training time, single likelihood evaluation time and total size of the model for all our modelsand the model of Das et al. (2018, D18). The capital letter in round brackets refers to the schematic in Fig. 3. Note that training time refers to the total time topreprocess, train and postprocess data. All of our experiments were run on an Intel® Core TM i7-8750H CPU @ 2.20GHz, which can be found on an average-performing laptop. Results for D18 are taken from Tab. 2 and Fig. 14 in D18, and have been run in parallel on an HPC cluster, with the only exception of thelikelihood evaluation time, which we performed on our machine. The reported values of R D and times are the mean and standard deviation of 3 runs. All ofour proposed models perform better than the one shown in D18, while taking considerably less time and requiring less disk space and hardware performance. without preprocessing. Hence, for our comparison we decided toleave the D18 method unchanged from its original version, i.e.without performing the preprocessing of Sec. 3 prior to training.The ‘NN direct’ model, described in Sec. 4.1, provides thehighest R D value among our proposed methods. This is due tothe combination of two factors: the relatively simple structure ofthe seismograms, given the isotropic nature of their moment ten-sor, and the preprocessing operated on the training seismograms.On the one hand, isotropic sources are characterised by strongand very localised P-wave peaks, which clearly dominate over therest of the signal. This simplifies the form of the signal with re-spect to e.g. pure compensated linear vector dipole (CLVD) anddouble-couple (DC) events, characterised by more complicated sig-nal structure (Vavryˇcuk, 2015; Das et al., 2017). On the other hand,even with the relatively simple structure of the isotropic seismo-grams, the training of a NN to map coordinates to seismic tracesis extremely challenging due to the reduced number of trainingsamples available. It is for this reason that we operated the pre-processing on the training seismograms described in Sec. 3. Thishas the advantage of extracting information regarding the source-sensor distance, encoded mainly in the location and amplitude ofthe P-wave peak of each seismic trace. By isolating these featuresinto the parameters ( ¯ A , ¯ t ) , we simplify the task for our NN or anyother method learning the mapping between coordinates and seis-mograms. This approach relies on being able to train methods thatlearn efficiently the mapping between coordinates and ( ¯ A , ¯ t ) coef-ficients. Fortunately, this mapping is not too complicated, depend-ing mainly on the distance of the source from the sensor, and this isquite simple to learn for the GPs described in Sec. 3 which, as weverified experimentally, show higher accuracy than NNs in learningthe ( ¯ A , ¯ t ) coefficients.Fig. 5 shows the reconstruction accuracy of three modelsamong the ones considered in Tab. 1, namely the D18 method andthe two models proposed in this paper which yield lowest and high-est R D coefficient (the ‘WGAN-GP’ and ‘NN direct’ methods, re-spectively). We evaluate the predictions of these three models forthree random coordinates among those of the testing set, and checkhow the predictions compare with the original seismograms. Wenotice how in some cases the D18 method fails to produce accu-rate predictions (as in the case of the seismogram shown in the second and third column in Fig. 5). The ‘WGAN-GP’ and ‘NN di-rect’ methods, instead, manage to yield more accurate predictionsin these cases, in particular regarding the location and amplitude ofthe P-wave peak, crucial for localisation purposes. From the Fig-ure we can appreciate how even the ‘WGAN-GP’ method, whoseaccuracy is the worst among the methods proposed in this paper(cf. Tab. 1), reconstructs the seismograms in the second and thirdcolumn better than the D18 method. Speed considerations are also important when evaluating theperformance of the models. In general, applications of deep learn-ing to Bayesian analysis may often be possible only making useof High Performance Computing (HPC) infrastuctures. If these arenot available, applications to real parameter estimation frameworksmay be fatally compromised. Therefore, it is important to noticethat all our proposed models can be efficiently run on a simplelaptop, without the need of any HPC platform. If HPC infrastuc-tures are available, they would speed up our models even further.In particular, running all generative models on GPUs would lead toa speed-up of at least an order of magnitude (Wang et al., 2019).Importantly, however, even without this HPC acceleration wefind that all our models are ∼ - orders of magnitude faster totrain and to evaluate than the method described in D18. We stresshere that the advantage of our models in terms of speed relies notonly in requiring considerably less time to train, but also, and ar-guably more importantly, in predicting a seismogram much fasterthan with the D18 method. This point is essential for applicationsto parameter inference, e.g. through MCMC techniques, where aforward model needs to be computed at each likelihood evaluation.A single-evaluation time for our models is reduced of up to 2 or-ders of magnitude with respect to that of D18, which in turns meansthat Bayesian inference of microseismic sources will be similarlyfaster (see Sec. 5.2). The training time required by each method isalso significantly lower than in D18. We note that this last propertymakes the training of our models much less demanding in terms ofcomputational resources. We also note that the creation of a train-ing dataset, with a few thousands seismograms generated by solu-tion of the elastic wave equation, is a computational overhead costthat we share with the analysis of D18, and therefore its generationtime is not reported here for any of the methods in Tab. 1, including A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi
250 300 350 400200002000 A m p li t u d e ( x , y , z ) = (15, 48, 111) OriginalD18 375 400 425 450 475 500 ( x , y , z ) = (15, 38, 51) OriginalD18 350 375 400 425 450 475 500 ( x , y , z ) = (73, 13, 58) OriginalD18
250 300 350 400200002000 A m p li t u d e ( x , y , z ) = (15, 48, 111) Original
WGAN-GP 375 400 425 450 475 500 ( x , y , z ) = (15, 38, 51) Original
WGAN-GP 350 375 400 425 450 475 500 ( x , y , z ) = (73, 13, 58) Original
WGAN-GP250 300 350 400
Time component A m p li t u d e ( x , y , z ) = (15, 48, 111) OriginalNN direct 375 400 425 450 475 500
Time component( x , y , z ) = (15, 38, 51) OriginalNNdirect 350 375 400 425 450 475 500
Time component( x , y , z ) = (73, 13, 58) OriginalNNdirect
Figure 5.
Comparison of the reconstruction accuracy of different emulation methods on three random seismograms from the testing set (in black, dashed line),whose coordinates are reported on top of each panel. The horizontal axis is zoomed around the location of the P-wave peak. In addition to the D18 method(in blue ), we show the performance of the methods achieving lowest and highest accuracy as reported in Tab. 1: the ‘WGAN-GP’ model ( pink ) and the ‘NNdirect’ model ( red ), respectively.
D18 (see Sec. 6 for a discussion on how to reduce this overheadsimulation time in future work).Among our proposed methods, the fastest to evaluate is the‘PCA+NN’ method described in Sec. 4.2. This was expected, asthis model is composed of a relatively small NN and a reconstruc-tion through the predicted PCA coefficients. Both operations es-sentially boil down to matrix multiplications, which can be exe-cuted with highly optimised software libraries. We also notice thatthe methods requiring GP predictions (‘PCA+GP’ and ‘AE+GP’ inTab. 1) are the ones that perform worse in terms of evaluation andtraining speed. Again, this was expected as it is due to the natureof GP regression itself. Contrary to NNs, GPs are non-parametricmethods that need to take into account the entire training dataseteach time they make a prediction. At inference time they need tokeep in memory the whole training set and the computational costof predictions scales (cubically) with the number of training sam- ples (Liu et al., 2018). This affects also the D18 method, in an evenmore exacerbated form since the number of GPs involved in thatmethod is higher.Related to the difference between GP and NN regression arethe storage size requirements of the different methods. Models em-ploying NNs are less demanding than GPs in terms of memory re-quirements, mainly because they do not need to keep memory ofthe training data. Within NN architectures, the simpler ones are,intuitively, the lightest to store. ‘PCA+NN’ is again, the best per-forming method in this regard, winning in particular over ‘AE+NN’since the latter requires the storage of weights and biases for twoNNs. eep learning emulation of seismic events (21,61) (41,41) (61,21) (61,61)Receiver coordinates ( x , y ) on plane z =2440100200300400500 T i m e c o m p o n e n t s Y d i r e c t i o n ( N o r t h - S o u t h ) Figure 6.
Left panel: simulated noisy seismic traces recorded by the sensors at the seabed, with configuration shown in the right panel. These recordedseismograms represent the data vector for our simulated posterior distribution inference.
Right panel: simulated receiver geometry in the detection plane z = 244 . The dots indicate the sensor locations, with colours matching those of the recorded seismic traces in the left panel. The central receiver withcoordinates (41 , , is the one that we consider (similarly to D18) when we quantify the performance of our trained generative models in Table 1. Wethen include the other receivers when we demonstrate the effectiveness of our models in carrying out Bayesian inference of the coordinates of a simulatedseismic event with coordinates (31 , , , whose projection on the detection plane is marked with an orange cross. Now that we have quantified the performance of our generativemodels we want to apply them to the Bayesian inference of a micro-seismic event location. To this purpose, we simulate the detectionof a microseismic event and wish to infer the posterior distributionof its coordinates. The posterior distribution of a set of parameters θ for a given model or hypothesis H and a data set D is given byBayes’ theorem (e.g. MacKay, 2003) Pr ( θ | D , H ) = Pr ( D | θ , H ) Pr ( θ | H )Pr ( D | H ) . (13)The posterior distribution Pr ( θ | D , H ) is the product of the likeli-hood function Pr ( D | θ , H ) and the prior distribution Pr ( θ | H ) on the parameters, normalised by the evidence Pr ( D | H ) , usu-ally ignored in parameter estimation problems since it is inde-pendent of the parameters θ . In this work we employ the algo-rithm M ULTI N EST (Feroz et al., 2009) for multi-modal nested sam-pling (Skilling, 2006), as implemented in the software P Y M ULTI -N EST ∗∗ (Buchner et al., 2014), to sample the posterior distributionof our model parameters (i.e. the source coordinates).We simulate the observation of a microseismic isotropicevent, by generating a noiseless trace given specified coordinates ( x, y, z ) = (31 , , . Our goal is to derive posterior distribu-tion contours on the coordinates x, y, z , which represent our pa-rameters. Following D18, we add random Gaussian noise to eachcomponent of the noiseless trace, with standard deviation σ = 250in the same arbitrary units as the seismograms’ amplitude. The re-sulting seismic trace, measured at multiple receivers, is shown inthe left panel of Fig. 6. Similarly to D18, we assume a Gaussianlikelihood. We stress here that the particular shape considered forthe noise modelling and the likelihood function are not restrictive:our methodologies are easily applicable to more complicated noisemodels or likelihood forms, while we chose to use the same inves-tigated by D18 for a direct and fair comparison.Instead of repeating the analysis for each proposed generative ∗∗ https://github.com/JohannesBuchner/PyMultiNest Coordinate Prior range Ground truth D18 NN direct x [0 ,
81] 31 17 . +42 . − . . +14 . − . y [0 ,
81] 25 16 +42 − . +11 . − . z [0 , +64 − +18 − Table 2.
Prior range and mean and marginalised 68 percent credibility inter-vals on the coordinates ( x, y, z ) , for both the D18 method and our proposed‘NN direct’ model, described in Sec. 4.1. model, we decide to use the one that has been shown in Tab. 1 toachieve greater accuracy, i.e. the direct neural network mapping be-tween coordinates and seismograms, described in Sec. 4.1. We sim-ulate an experimental setup with multiple receivers on the detectionplane z = 244 , shown in the right panel of Fig. 6. D18 reporteda maximum likelihood calculation for up to 23 receivers placed onthe same plane. Here, our aim is to test the performance of our mod-els at inference time, while we do not wish to carry out a detailedanalysis for optimisation of the receivers geometry. In particular,we wish to compare the D18 emulator with ours at inference time.Thus, we do not consider all 23 receivers considered by D18. Whilewe find that considering only one receiver is obviously not enoughto achieve significant constraints on the coordinates, after experi-menting with different configurations and number of receivers wefind that considering four receivers, placed in the upper diagonalpart of the detection plane as shown in Fig. 6, leads to significantconstraints on the event coordinates. This is true if we considerour ‘NN direct’ method, whose accuracy is higher than the one ofD18. Indeed, repeating the inference analysis with the D18 emula-tor, given the same receiver configuration, leads to constraints ∼ times less tight on the coordinates and at a considerable increase incomputation time: 66h of computation using 24 Central ProcessingUnits (CPUs), compared to the . h required by the ‘NN direct’ ona single CPU. The fact that such tighter constraints can be achievedwith our emulator, even if making use of the information coming A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi from only four receivers, is due to the increased accuracy of ourmethod, evident from the R D values reported in Tab. 1.Fig. 7 shows the posterior contour plots for the four receiverconfiguration described above, obtained with our ‘NN direct’ gen-erative model and the emulator of D18. The numerical results aresummarised in Tab. 2, reporting the prior ranges and mean andmarginalised 68 percent credibility interval on the coordinates. Wenotice that the x and y coordinates are less constrained than the z coordinate. This is due to the layered structure of the densityand velocity model (cf. Fig. 1), with much more variability along z than along the horizontal directions. A full comparison between theD18 and ‘NN direct’ methods would require to perform the infer-ence process using data from all 23 receivers. However, we foundthat implementing the D18 method with all 23 receivers involvessignificant computational complication, even when making use ofhighly parallelised HPC implementations. We remark that the D18method fails with few detectors and is computationally expensivewith many, while the ‘NN direct’ method proposed in this paperworks well with just 4 detectors and can be expected to work verywell, and at lower cost, with many. In this paper we developed new generative models to accelerateBayesian inference of microseismic event coordinates. Our setupwas similar to the one used in Das et al. (2018, D18) to train an em-ulator with the aim of speeding up the inference process of sourcecoordinates. This was achieved by replacing the computationallyexpensive solution of the elastic wave equation at each point in theparameter space explored by e.g. MCMC techniques for posteriordistribution sampling. In both D18 and this work, emulators weretrained to learn the mapping between source coordinates and seis-mic traces recorded by the sensor.All models developed in this paper were trained on the same2000 forward simulated seismograms used by D18 when trainingtheir emulator. However, our models are based on deep learningarchitectures and make minimal use of GP regression, which is in-stead performed multiple times in the method proposed by D18.This makes all of our models faster to train and evaluate com-pared to the previous emulator, achieving a speed-up factor of up to O (10 ) , as well as reducing the storage requirements of the mod-els. Crucially, this acceleration does not happen at the expense ofaccuracy; on the contrary, our models provide improved constraintson the source coordinates.We showed this first by calculating the 2D correlation coef-ficient for the seismograms of the test set. The values obtainedwith all our models were higher than those obtained by D18, in-dicating the higher accuracy achieved. Secondly, we repeated thesimulated experiment devised by D18, with sensors placed at theseabed of a 3D marine environment where our simulated sourceswere randomly located. We showed that using information comingfrom only four receivers situated on the detection plane we wereable to provide accurate and tight constraints on the source coordi-nates, whereas the D18 method struggled to provide any significantconstraint given the same setup and would likely need additionalinformation from more sensors to achieve comparable constraints.As a result of the speed up obtained at evaluation time, we wereable to perform the inference process on a single CPU in ∼ . h,compared to ∼ h of calculation over 24 CPUs required by theD18 method.In conclusion, we provided the community with a collec- tion of deep generative models that can accelerate very efficientlyBayesian inference of microseismic sources. The performances ofthese methods in terms of accuracy are all comparable betweenthem, and improved with respect to the D18 method. Speed con-siderations may therefore be invoked in the decision process fora particular method. However, we notice that our framework isvalid only for microseismic events characterised by isotropic mo-ment tensor. Considering more complicated forms of the momenttensor will likely require additional complications, first of all con-sidering seismic traces recorded for longer time, since the signalstructure will be in general more complicated. Extensions of thiswork to non-isotropic sources, possibly in combination with othersource inversion techniques (e.g. Weston et al., 2014; Frietsch et al.,2019), would then allow for an extension of the parameter spaceto be explored, including for example the moment tensor compo-nents for characterisation of the source mechanism. Additionally,applications to real analyses will need to implement more realisticmodels for the noise than the one we considered when performingBayesian inference.We stress, however, that since these sophistications will likelyrequire larger training sets, the development of the generative mod-els presented in this paper is a crucial step for future work. Onceagain, this is due to the fact that our models are based on deep learn-ing architectures, reducing the use of GP regression. This is a cru-cial feature for extensions of this work to more complicated sourcemechanisms and larger training sets, as it is well known that GPsscale very badly with the dimension of the training dataset (see e.g.Liu et al., 2018). The issue remains of the necessity of producingsuch large training sets, which require considerable computationalresources. To face the demands in this sense for future extensionsof this work, we advocate the use of Bayesian optimisation (see e.g.Frazier, 2018, for a review) to optimise the simulation of trainingseismograms. Some of the mentioned extensions of this work arein current development, and will be presented in an upcoming pa-per (Piras et al., in prep.); meanwhile, the software implementationof the models developed in this work is available from the authorsupon request. ACKNOWLEDGMENTS
This work has been partially enabled by funding from Royal DutchShell plc and the UCL Cosmoparticle Initiative. Some of the com-putations have been performed on the Wilkes High PerformanceGPU computer cluster at the University of Cambridge; we aregrateful to Stuart Rankin and Greg Willatt for their technical sup-port. We wish to express our deepest gratitude to Saptarshi Dasfor useful discussions and support during the initial phase of thisproject. We also thank Stephen Bourne, Xi Chen, Ana Ferreira,Detlef Hohl, Jonathan Smith and Teh-Ru Alex Song for usefuldiscussions. We acknowledge the use of the software C
HAIN -C ONSUMER (Hinton, 2016) for producing contour plots and NN - SVG (LeNail, 2019) for producing neural networks’ architectureschematics. DP is supported by the STFC UCL Centre for DoctoralTraining in Data Intensive Science.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro,C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., eep learning emulation of seismic events y
20 40 60 80 x z
20 40 60 80 y
75 100 125 150 175 z Figure 7.
Comparison of the marginalised 68 and 95 per cent credibility contours obtained with the D18 method (in blue ) and our proposed ‘NN direct’generative model (in red ) described in Sec. 4.1, considering a seismic trace measured by the four receivers shown in Fig. 6. The black dashed lines indicatethe source’s true coordinates at ( x, y, z ) = (31 , , . Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefow-icz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´e, D., Monga,R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V.,Vasudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Wattenberg,M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow:Large-scale machine learning on heterogeneous systems. Soft-ware available from tensorflow.org.Arjovsky, M., Chintala, S., and Bottou, L. (2017). WassersteinGAN. arXiv e-prints , page arXiv:1701.07875.Aster, R., Borchers, B., and Thurber, C. (2012). Parameter esti-mation and inverse problems.
Recherche , 67:02.Auld, T., Bridges, M., Hobson, M., and Gull, S. (2007). Fast cos-mological parameter estimation using neural networks.
Mon.Not. Roy. Astron. Soc. , 376:L11–L15.Auld, T., Bridges, M., and Hobson, M. P. (2008). cosmonet: fastcosmological parameter estimation in non-flat models using neu-ral networks.
Monthly Notices of the Royal Astronomical Soci-ety , 387(4):15751582.Bishop, C. M. (2006).
Pattern Recognition and Machine Learning(Information Science and Statistics) . Springer-Verlag, Berlin,Heidelberg.Buchner, J., Georgakakis, A., Nandra, K., Hsu, L., Rangel, C.,Brightman, M., Merloni, A., Salvato, M., Donley, J., and Ko-cevski, D. (2014). X-ray spectral modelling of the AGN obscur-ing region in the CDFS: Bayesian model selection and catalogue.
Astronomy and Astrophysics , 564:A125. Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. (2016).Mode regularized generative adversarial networks.
CoRR ,abs/1612.02136.Chen, X. and Hobson, M. (2019). Bayesian surrogate learning indynamic simulator-based regression problems. arXiv e-prints ,page arXiv:1901.08898.Collettini, C. and Barchi, M. R. (2002). A low-angle normal faultin the umbria region (central italy): a mechanical model for therelated microseismicity.
Tectonophysics , 359(1):97 – 115.Craiu, R. V. and Rosenthal, J. S. (2014). Bayesian computationvia markov chain monte carlo.
Annual Review of Statistics andIts Application , 1(1):179–201.Das, S., Chen, X., and Hobson, M. P. (2017). Fast gpu-based seis-mogram simulation from microseismic events in marine environ-ments using heterogeneous velocity models.
IEEE Transactionson Computational Imaging , 3(2):316–329.Das, S., Chen, X., Hobson, M. P., Phadke, S., vanBeest, B.,Goudswaard, J., and Hohl, D. (2018). Surrogate regression mod-elling for fast seismogram generation and detection of micro-seismic events in heterogeneous velocity models.
GeophysicalJournal International , 215(2):1257–1290.Devroye, L., Mehrabian, A., and Reddad, T. (2018). The totalvariation distance between high-dimensional Gaussians. arXive-prints , page arXiv:1810.08693.Doersch, C. (2016). Tutorial on Variational Autoencoders. arXive-prints , page arXiv:1606.05908.Feroz, F., Hobson, M., and Bridges, M. (2009). MultiNest: an A. Spurio Mancini, D. Piras, M. Hobson, B. Joachimi efficient and robust Bayesian inference tool for cosmology andparticle physics.
Mon. Not. Roy. Astron. Soc. , 398:1601–1614.Fertitta, G., Stefano, A. D., Fiscelli, G., and Giaconia, G. C.(2010). A low power and high resolution data logger for sub-marine seismic monitoring.
Microprocessors and Microsystems ,34(2):63 – 72.Frazier, P. I. (2018). A Tutorial on Bayesian Optimization. arXive-prints , page arXiv:1807.02811.Frietsch, M., Ferreira, A. M. G., Funning, G. J., and Weston, J.(2019). Multiple fault modelling combining seismic and geode-tic data: the importance of simultaneous subevent inversions.
Geophysical Journal International , 218(2):958–976.Gibbs, A. L. and Su, F. E. (2002). On Choosing and Bound-ing Probability Metrics.
Interdisciplinary Science Reviews ,70(3):419–435.Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.(2014). Generative Adversarial Networks. arXiv e-prints , pagearXiv:1406.2661.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. (2017). Improved training of wasserstein gans.
CoRR , abs/1704.00028.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducingthe dimensionality of data with neural networks.
Science ,313(5786):504–507.Hinton, S. R. (2016). ChainConsumer.
The Journal of OpenSource Software , 1:00045.Kasim, M. F., Watson-Parris, D., Deaconu, L., Oliver, S., Hatfield,P., Froula, D. H., Gregori, G., Jarvis, M., Khatiwala, S., Kore-naga, J., Topp-Mugglestone, J., Viezzer, E., and Vinko, S. M.(2020). Up to two billion times acceleration of scientific simula-tions with deep neural architecture search. arXiv e-prints , pagearXiv:2001.08055.Kingma, D. P. and Ba, J. (2014). Adam: A method for stochasticoptimization.Kingma, D. P. and Welling, M. (2013). Auto-Encoding VariationalBayes. arXiv e-prints , page arXiv:1312.6114.Kullback, S. (1959).
Information Theory and Statistics . Wiley,New York.LeNail, A. (2019). Nn-svg: Publication-ready nn-architectureschematics.Liu, H., Ong, Y.-S., Shen, X., and Cai, J. (2018). When GaussianProcess Meets Big Data: A Review of Scalable GPs. arXiv e-prints , page arXiv:1807.01065.Maas, A. L. (2013). Rectifier nonlinearities improve neural net-work acoustic models.MacKay, D. J. C. (2003).
Information Theory, Inference, andLearning Algorithms . Copyright Cambridge University Press.Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. (2016). Un-rolled generative adversarial networks.
CoRR , abs/1611.02163.Mukuhira, Y., Asanuma, H., Ito, T., and H¨aring, M. O. (2016).Physics-based seismic evaluation method: Evaluating possibleseismic moment based on microseismic information due to fluidstimulation.
GEOPHYSICS , 81(6):KS195–KS205.Neal, R. M. (1996).
Bayesian Learning for Neural Networks .Springer-Verlag, Berlin, Heidelberg.Odaibo, S. G. (2019). Tutorial: Deriving the standard variationalautoencoder (VAE) loss function.
CoRR , abs/1907.08956.Oliehoek, F. A., Savani, R., Gallego-Posada, J., van der Pol, E.,and Groß, R. (2018). Beyond local nash equilibria for adversarialnetworks.
CoRR , abs/1806.07268.Pugh, D. J., White, R. S., and Christie, P. A. F. (2016). A Bayesian method for microseismic source inversion.
Geophysical JournalInternational , 206(2):1009–1038.Raghu, M. and Schmidt, E. (2020). A Survey of Deep Learningfor Scientific Discovery. arXiv e-prints , page arXiv:2003.11755.Rasmussen, C. E. and Williams, C. K. I. (2005).
Gaussian Pro-cesses for Machine Learning (Adaptive Computation and Ma-chine Learning) . The MIT Press.Rubner, Y., Tomasi, C., and Guibas, L. J. (1998). A met-ric for distributions with applications to image databases. In
Sixth International Conference on Computer Vision (IEEE Cat.No.98CH36271) , pages 59–66.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Learning Representations by Back-Propagating Errors , page696699. MIT Press, Cambridge, MA, USA.Sason, I. and Verd´u, S. (2015). f -divergence Inequalities. arXive-prints , page arXiv:1508.00335.Skilling, J. (2006). Nested sampling for general Bayesian compu-tation. Bayesian Analysis , 1(4):833–859.Sohn, K., Lee, H., and Yan, X. (2015). Learning structured out-put representation using deep conditional generative models. InCortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Gar-nett, R., editors,
Advances in Neural Information Processing Sys-tems 28 , pages 3483–3491. Curran Associates, Inc.St¨ahler, S. C. and Sigloch, K. (2014). Fully probabilistic seismicsource inversion; part 1: Efficient parameterisation.
Solid Earth ,5(2):1055–1069.Treeby, B. E., Jaros, J., Rohrbach, D., and Cox, B. T. (2014). Mod-elling elastic wave propagation using the k-wave matlab toolbox.In , pages 146–149.Valentine, A. P. and Trampert, J. (2012). Data space reduction,quality assessment and searching of seismograms: autoencodernetworks for waveform data.
Geophysical Journal International ,189(2):1183–1202.Vavryˇcuk, V. (2015). Moment tensor decompositions revisited.
Journal of Seismology , 19(1):231–252.Villani, C. (2008).
Optimal transport – Old and new , volume 338,pages xxii+973.Wang, Y. E., Wei, G.-Y., and Brooks, D. (2019). BenchmarkingTPU, GPU, and CPU Platforms for Deep Learning. arXiv e-prints , page arXiv:1907.10701.Weston, J., Ferreira, A., and Funning, G. J. (2014). Joint earth-quake source inversions using seismo-geodesy and 3-D earthmodels.
Geophysical Journal International , 198(2):671–696.Yao, Y., Rosasco, L., and Caponnetto, A. (2007). On early stop-ping in gradient descent learning.
Constructive Approximation ,26:289–315.
APPENDIX A: SUMMARY OF D18 EMULATOR
Here we briefly summarise, for comparison, the surrogate modeldeveloped in D18 for fast emulation of isotropic microseismictraces, given their source locations on a 3D grid. We report herethe main steps of the procedure, referring to D18 for all details.(i) We first compress the training seismograms, isolating in eachof them the 100 dominant components in absolute values and stor-ing their amplitudes and time indices;(ii) we then train a GP for each dominant component and foreach index. Thus, in total there will be × GPs totrain: each of the 100 GPs for the signal part will learn to predict eep learning emulation of seismic events the mapping between coordinates and one sorted dominant compo-nent in the seismograms; the corresponding GP for the time indexwill learn to predict what is the temporal index associated to thatdominant component.(iii) Once the GPs are trained, for each set of coordinates the100 predictions for the dominant signal components and the 100predictions for their indices will produce a compressed version ofthe seismogram, where the (predicted) subdominant componentsare set to zero. APPENDIX B: KULLBACK-LEIBLER DIVERGENCEB1 Definition and properties
Given two probability distributions P and Q of a continuous ran-dom variable X , one possible way of measuring their distance isthe Kullback-Leibler divergence (KL divergence, Kullback, 1959),which is defined as: D KL ( P || Q ) = (cid:90) X p ( x ) log p ( x ) q ( x ) , (B.1)where p and q are the probability densities of P and Q , re-spectively. It is easy to show that D KL ( P || Q ) ≥ and that D KL ( P || Q ) = 0 ⇐⇒ P = Q almost everywhere: this is inline with the idea of D KL ( P || Q ) being a way of measuring the dis-tance between P and Q . However, we also note that that the KLdivergence is not symmetric ( D KL ( P || Q ) (cid:54) = D KL ( Q || P )) , that itdoes not satisfy the triangle inequality, and that it is part of a biggerclass of divergences called f -divergences (see e.g. Gibbs and Su,2002; Sason and Verd´u, 2015; Arjovsky et al., 2017, and referencestherein). B2 Calculation of the loss function
In Sec. 4.6, we introduced the KL divergence in the loss function ofthe Conditional Variational Autoencoder (CVAE). In that instance,we calculate D KL ( q θ ( z | x , c ) || p ( z | c )) where both q θ ( z | x , c ) and p ( z | c ) are multivariate normal distributions. In particular, wechoose q θ ( z | x , c ) = N ( z ; µ ( x , c ) , Σ) and p ( z | c ) = N ( z ; , Σ) ,where Σ is a diagonal matrix with all entries equal to σ = 0 . ,and µ ( x , c ) is the output of the encoder network of the CVAE.It is easy to show (Kullback, 1959; Rasmussen and Williams,2005; Devroye et al., 2018) that the KL divergence in the case oftwo multivariate normal distributions reduces to D KL ( N ( µ , Σ ) || N ( µ , Σ )) = 12 log | Σ Σ − | + 12 tr Σ − (cid:16) ( µ − µ )( µ − µ ) T + Σ − Σ (cid:17) . (B.2)In our case, since Σ = Σ = Σ and µ = 0 , we can write: D KL ( q θ ( z | x , c ) || p ( z | c )) = 12 tr Σ − (cid:16) µ ( x , c ) µ T ( x , c ) (cid:17) = 12 σ z dim (cid:88) i =0 µ i ( x , c ) , (B.3)where z dim = 5 is the chosen dimensionality of the latent space. APPENDIX C: DETAILS OF WGAN-GP
In Sec. 4.7 we explained how standard Generative Adversarial Net-works (GANs) are prone to training instabilities and mode collapse ;therefore, in this work we chose to employ a variant called Wasser-stein GAN - Gradient Penalty (WGAN-GP; Arjovsky et al., 2017;Gulrajani et al., 2017). In this algorithm, two networks, called gen-erator (G) and critic (C), are trained to minimise the Wasserstein-1 distance between the data distribution and the generative modeldistribution, implicitly defined by G ( z ))
In Sec. 4.7 we explained how standard Generative Adversarial Net-works (GANs) are prone to training instabilities and mode collapse ;therefore, in this work we chose to employ a variant called Wasser-stein GAN - Gradient Penalty (WGAN-GP; Arjovsky et al., 2017;Gulrajani et al., 2017). In this algorithm, two networks, called gen-erator (G) and critic (C), are trained to minimise the Wasserstein-1 distance between the data distribution and the generative modeldistribution, implicitly defined by G ( z )) , z ∼ p ( z ))