[PDF] HypoSVI: Hypocenter inversion with Stein variational inference and Physics Informed Neural Networks

Abstract

We introduce a scheme for probabilistic hypocenter inversion with Stein variational inference. Our approach uses a differentiable forward model in the form of a physics-informed neural network, which we train to solve the Eikonal equation. This allows for rapid approximation of the posterior by iteratively optimizing a collection of particles against a kernelized Stein discrepancy. We show that the method is well-equipped to handle highly non-convex posterior distributions, which are common in hypocentral inverse problems. A suite of experiments is performed to examine the influence of the various hyperparameters. Once trained, the method is valid for any network geometry within the study area without the need to build travel time tables. We show that the computational demands scale efficiently with the number of differential times, making it ideal for large-N sensing technologies like Distributed Acoustic Sensing.

Full PDF

HHypoSVI: Hypocenter inversion with Stein variational inferenceand Physics Informed Neural Networks

Jonathan D. Smith ∗ , Zachary E. Ross ∗ , Kamyar Azizzadenesheli † , Jack B. Muir ∗ Abstract

We introduce a scheme for probabilistic hypocenter inversion with Stein variational inference.Our approach uses a diﬀerentiable forward model in the form of a physics-informed neuralnetwork, which we train to solve the Eikonal equation. This allows for rapid approximation of theposterior by iteratively optimizing a collection of particles against a kernelized Stein discrepancy.We show that the method is well-equipped to handle highly non-convex posterior distributions,which are common in hypocentral inverse problems. A suite of experiments is performed toexamine the inﬂuence of the various hyperparameters. Once trained, the method is valid for anynetwork geometry within the study area without the need to build travel time tables. We showthat the computational demands scale eﬃciently with the number of diﬀerential times, making itideal for large-N sensing technologies like Distributed Acoustic Sensing.

Earthquake hypocenters represent the points in space and time at which earthquakes occur. Theyare a fundamental component of many downstream analyses in seismology, from seismic tomographyto earthquake source properties. They are also used for real-time earthquake forecasting, such asduring active sequences. Thus the ability to reliably estimate hypocenters and characterize theiruncertainty is of major importance in seismology.Determining a hypocenter from observations of seismic waves is a classic inverse problem ingeophysics (Geiger, 1912; Thurber, 1985). More recently, Bayesian inference has been used forhypocenter inversion (Tarantola, 2004; Lomax et al. , 2000), in which prior information is combinedwith some observations to infer posterior distributions over the hypocentral parameters. In theirmost general form, the travel time solutions from ray theory are nonlinear in the hypocentralcoordinates, which for this particular problem adds non-convexity to the posterior. Furthermore,and perhaps a more serious issue, is that the observations of seismic wave arrival times often containsigniﬁcant errors resulting from the widespread adoption of automated picking algorithms (Ross et al. , 2018; Mousavi et al. , 2020). One strategy for adding robustness to hypocenter inversionshas been to incorporate non-standard likelihood functions into the inverse problem, which hassigniﬁcantly improved the results but at the cost of creating highly non-convex posteriors. Thismakes many common techniques for performing Bayesian inference, such as Markov chain MonteCarlo or variational inference, ill-suited for this particular problem with the non-convex nature ofthe posterior expected to be computational slow. ∗ Seismological Laboratory, California Institute of Technology, Pasadena, CA, USA † Lawson Computer Science Building, Purdue University, West Lafayette, IN, USA a r X i v : . [ phy s i c s . g e o - ph ] J a n ecent advances in deep learning have led to the development of physics-informed neuralnetworks (PINNs), which are designed to learn solutions to partial diﬀerential equations (PDEs).Such approaches have a number of appealing properties that are not present with conventionalapproaches like ﬁnite diﬀerence methods, for example the solutions can be made diﬀerentiable,are often mesh-free, and can be rapidly calculated upon demand. These properties make PINNswell-suited as the forward model in an inverse problem, in particular since it often is desirable to takegradients of some objective function. Sampling and ensemble methods such as MCMC or variationalinference typically require many evaluations of the forward model, so solving this with a PINN makesit more computationally tractable.The rise of deep learning over the last decade has accelerated the development of novel approachesfor performing Bayesian inference. One notable example is Stein variational inference (SVI), inwhich a collection of particles is iteratively optimized to approximate a target posterior (Qiang& Dilin, 2016). It is better suited than standard variational techniques at handling multi-modaldistributions because the number of modes does not need to be known a priori; this results froma kernelized objective function that creates a natural repulsive force between the particles. SVIrequires evaluating many gradients and as such, beneﬁts from the diﬀerentiability of PINNs.Our contributions to this paper are as follows: (1) we develop a framework for earthquakehypocenter inversion using Stein variational inference; (2) we incorporate a PINN trained to solvethe Eikonal equation as a forward model; (3) we perform experiments on the hyperparameters ofthe inverse problem to characterize their eﬀect on the solution; and (4) we benchmark the methodagainst a catalog of earthquakes from Southern California. In this section, we provide background information on Stein variational inference as well as physics-informed neural networks.

For two random variables x and y , let p ( x ) denote the prior on x , p ( y | x ) the likelihood function,and p ( x | y ) the posterior over x after observing (conditioning on) y . Using celebrated Bayes rule,these quantities are related as, p ( x | y ) = ¯ p ( x ) /Z , where ¯ p ( x ) = p ( y | x ) p ( x ) , and the normalizationconstant Z = (cid:82) x ¯ p ( x ) dx .Let H denote a reproducing kernel Hilbert space on the domain x , with a positive deﬁnitereproducing kernel κ , endowed with the inter product (cid:104)· , ·(cid:105) and the norm (cid:107) · (cid:107) H . We further deﬁne H d , as a set of multivalued functions, with d values, with the corresponding norm (cid:107) · (cid:107) H d , where forany f = [ f , f , . . . , f d ] ∈ H d we have f i ∈ H ∀ i ∈ [1 , , . . . , d ] .For a function f ∈ H d , we deﬁne Stein’s operator endowed with H d and p as, ( A f )( x ) = f ( x ) ∇ x log p ( x ) (cid:62) + ∇ x f ( x ) . (1)We further deﬁne the kernelized Stein’s discrepancy between two distributions p and q using H d is as follows, D ( q, p ) := max f ∈H d s.t., (cid:107) f (cid:107) H d ≤ E x ∼ q [ trace ( A f ( x ))] . (2)2his discrepancy equates to zero when p = q . Fortunately, the maximization in Eq. 2 has aclosed-form solution D ( q, p ) = (cid:107) f ∗ q (cid:107) H d where f ∗ q := E x ∼ q [ A κ ( x, · )] is the maximizer.Now consider the Kullback–Leibler between q and p , i.e., KL ( q, p ) divergence. We aim to ﬁnd agradient direction g , a function, such that g maximally reduces the KL divergence. Using g , we canuse gradient descent with learning rate α and update q ← q + αg to reduce the KL ( q, p ) , and makethe q closer to p . It is known that for the kernelized Stein’s discrepancy, the direction g ∈ H thatprovides the direction of maximal change is g := f ∗ q (Qiang & Dilin, 2016). In the following, weprovide an update rule to update q and approximate the posterior p given observed data.We represent q with a set of particles, i.e., an average of many delta Dirac measures. { x i } ni =1 where q is the distribution of this particle. In the following, we update q , and make it closer to p bymoving the particles. Therefore, for the update direction f ∗ q = E x ∼ q [ A κ ( x, · )] , at each point x , wehave, f ∗ q ( x ) = n (cid:88) i =1 A κ ( x i , x )= n (cid:88) i =1 [ κ ( x i , x ) ∇ x (cid:48) log p ( x (cid:48) ) | x (cid:48) = x i + ∇ x (cid:48) κ ( x (cid:48) , x ) | x (cid:48) = x i , with the updating rule given by, x l +1 i ← x li + α l f ∗ q ( x li ) . (3)Here α l is the step size at the l th epoch. For the choice of kernel, we deploy the celebrated RadialBasis Function (RBF), κ ( x (cid:48) , x ) = exp( − h (cid:107) x − x (cid:48) (cid:107) ) , with h representing the width of kernel, for itsempirical and universal approximation properties. As discussed above, the update in Eq. 3, updates q (through updating the particles distribution) at each time step to make it closer to p in the Stein’sdiscrepancy sense. In solving inverse problems for earthquake hypocenters, the most common approach is to use a raytheoretical forward model to calculate the expected travel times, T , for seismic waves propagatingfrom a given source location to a receiver location. In heterogeneous 3D Earth models, the Eikonalequation is often solved to determine T (Rawlinson & Sambridge, 2005), (cid:107)∇ r T s → r (cid:107) = 1 V ( (cid:126)x r ) = S ( (cid:126)x r ) (4)where (cid:107) · (cid:107) is the Euclidean norm, T s → r is the travel-time through the medium from a sourcelocation s to a receiver location r , V r is the velocity of the medium at the receiver location, S r is theslowness of the medium at the receiver location, and ∇ r the gradient at the receiver location.Smith et al. (2020) developed a PINN approach to solving the factored Eikonal equation (EikoNet),which trains a deep neural network to calculate the travel-time between any two points in a 3Dmedium for a given velocity model, satisfying the additional boundary condition that the travel-timeat the source location equals zero, T s → s = 0 . We leverage a factored eikonal formulations to mitigatethe strong singularity aﬀects at the source location, representing the travel-time as a deviation from3 homogeneous medium with V = 1 (Treister et al. , 2016). The factored travel-time form can thenbe represented by: T s → r = T · τ s → r (5)where T = (cid:107) (cid:126)x r − (cid:126)x s (cid:107) , representing the distance function from the source location, and τ the deviationof the travel-time ﬁeld from a model travel-time with homogeneous unity velocity. Substituting theformulation of equation 5 into equation 4 and expanding using the chain rule, then the velocity canbe represented by; V ( (cid:126)x r ) = (cid:20) T (cid:107) ∇ r τ s → r (cid:107) +2 τ s → r ( (cid:126)x r − (cid:126)x s ) ·∇ r τ s → r + τ s → r (cid:21) − . (6)They leveraged the analytical diﬀerentiability of neural networks to solve the factored Eikonalequation from scratch, without the use of ﬁnite-diﬀerence solutions during training. Once trained, anetwork describing the travel-time between any source-receiver pair can be represented by: T s → r = f θ ( (cid:126)x s , (cid:126)x r ) (7)where T s → r is the travel-time between the source location (cid:126)x s and receiver location (cid:126)x r , and f is theneural network with weights and biases given by θ .EikoNet has several properties that are mathematically advantageous in solving inverse problems.First, the solutions to the Eikonal equation are mesh-independent, i.e. they are not discretized ona ﬁxed grid and can be evaluated at truly any point within the 3D medium. Second, the networkis a forward model that is analytically diﬀerentiable, which allows for gradient-based methods tobe eﬃciently employed to calculate a downstream objective function. Third, by approximating theEikonal equation with a deep neural network, the optimization part of the inverse problem is easilysolved with graphical processing units (GPUs). We now present an approach for probabilistic hypocenter inversion that uses a PINN as a forwardmodel and SVI to approximate the posterior distribution. The method consists of several primarysteps:1. An EikoNet model is trained for a given Earth velocity model to solve the Eikonal equation.This is performed for both P-waves and S-waves.2. A collection of particles is randomly initialized throughout the geographic study area. Theserepresent preliminary hypocenter locations.3. Travel times are calculated with EikoNet from each particle to every receiver with an observation.4. The synthetic travel times are used together with the data to calculate a kernelized Steindiscrepancy (loss function).5. The gradients of the loss are calculated with automatic diﬀerentiation and used to collectivelyupdate the particles’ locations. 4igure 1: Overview of the inversion procedure.

Panel 1 represents the travel-time between theobservational locations, red triangles and given by x obs , and the particle locations, black dots andgiven by x i . The particle kernel density are given by a series of gray contours with the particlewith the maximum kernel density given by the white star. Panel 2 shows the distribution of theparticle locations changing with the Stein Variational Gradient descent.

Panel 3 represents theparticle location optimisation with the step-size given by α i and optimisation direction f ∗ q ( x ) . Panel4 represents the clustering procedure to determine optimised location, represented by the kerneldensity max, and the location uncertainty given by the clustered particle kernel density.6. Steps 3-5 are repeated until convergence. The ﬁnal collection of particle positions will approxi-mate the posterior distribution of the hypocenter.7. Uncertainty estimates are extracted from the particles using kernel density methods.Next, we provide a detailed discussion of each stage of the procedure, with the outline of the inversiongiven in Figure 1.

Throughout this study we train EikoNet travel-time models using a set of constant training parametersand network architecture as described in Smith et al. (2020) and supplied in Table 1. A modelregion is deﬁned spanning our Longitude, Latitude, depth regional of interest, with xmin and xmaxlocations as [ o (cid:48) W , o (cid:48) N , − km ] and [ o (cid:48) W , o (cid:48) N , km ] respectively. The grid isprojected to a UTM coordinate system, with random source-reciever locations selected within theUTM model space. These points represent the training locations, with diﬀerent velocity modelsdiscussed below.In many earthquake location procedures the complex geometry of the subsurface is poorlyunderstood, with the assumption that lateral variations in velocity are negligible compared to5able 1: EikoNet training paradigm used to learn velocity modelsParameter ValueDataset Size × Validation Fraction . Batch Size

Optimizer ADAM (+ scheduler)Learning Rate × − Sampling Type Weighted Random DistanceSampling Type Bounds [0 . , . Domain Normalization Oﬀset Min-Max NormalizationNetwork Architecture Dense → + Dense → + × Residual Blocks → + Dense → + Dense → ELU Activation Functionvelocity variations in depth. As such one-dimensional velocity structure describing how the velocitychanges with depth are speciﬁed. These models typically have independent velocity structure deﬁnedfor both the P-wave and S-wave arrivals, or a scaling relationship of Vp/Vs. It is important tounderstand how reliabiable these methods are for location procedures such as HypoSVI, as this wouldbe a typical starting model for many use cases. In addition, understanding of the computationaldemand for training more simplistic travel-time models, informs the feasibility of the method ontypical computational systems. We investigate these problems for our region of interest by trainingEikoNet travel-time models from the Vp and Vs velocity structure shown by the blue dots in Figure2a. We interpolate the velocity at the point locations as the linear interpolation of the observedvelocity values. Two independent EikoNet neural networks are trained independently for the Vp andVs velocity structure using the network parameters speciﬁed in Table 1. The training of each modeltook 10 epochs, with roughly a minutes training time on a Nvdia V100 GPU and ∼ minuteson a free Colab GPU (either a Nvidia K80,T4 or P100). Once trained the travel-time models canbe validated by comparing the imposed observed velocity to predicted velocity, determined as theanalytical gradient of the travel-time over the neural network, for a series of × source-receiverpairs within the three-dimensional domain. Figure 2 outlines the comparison of the observed velocitystructure and the predicted velocity, with the variance of the predicted velocity within . km/s ofthe observed values. The consistent velocity structure and low computational overhead shows thatthis method is viable regardless of the available computational infrastructure.In more well studied regions prior geophysical datasets and analysis could have been leverage togain a better insight into the complex subsurface and therefore the velocity structure. The velocitystructure for our region of interest, that of Southern California, has been densely studied with acompilation of direct velocity model estimates (Shaw et al. , 2015; Lee et al. , 2014), used to constructdetailed subsurface velocity models deﬁned under the group project ’Southern California EarthquakeCentre Community Velocity Model’ (SCEC-CVM, ). Thecurrent versions of these velocity models encompass the SCEC-CVM-H (Shaw et al. , 2015, version15.1.0)) and SCEC-CVM-S (Lee et al. , 2014, version 4.26). Incremental improvements are made tothe CVM-H, but the CVM-S has seen no advances since the current version. The SCEC-CVMs arecomputed using numerous datasets (Süss & Shaw, 2003), encompassing a detailed subsurface three-6igure 2: EikoNet trained travel-time formulation for a one-dimensional velocity model only changingin depth (Z). Each curve represent a diﬀerent model computed for both the P-wave velocity structure(VP) and S-wave velocity structure (VS). Blue points represent the user deﬁned velocity values atdepths, blue lines the linear interpolation of velocity between points. Gray points represent thepredicted velocity from EikoNet for × randomly selected points for each of the velocity models.dimensional velocity structure from moho surface, basement surface and topological/bathymetricsurfaces. Using the SCEC-CVM-H model we train EikoNet models to determine the travel-timewithin the complex 3D velocity structures. The models are trained on × randomly selectedsource-receiver points within the domain, with example slice at Longitude = 115 o (cid:48) W ± . (cid:48) givenfor the P-wave and S-wave velocity structure in Figure 3a and 3b respectively. The EikoNet modelsonce trained represent the travel-time and predicted velocity between any points, as such we showthe recovered velocity model colourmap and travel-time contours (at s spacing) for a earthquakesource at [115 o (cid:48) W, o (cid:48) , km ] on a receiver grid as separation [ Latitude, Depth ] = [0 . o , . km ] with Longitude = 115 o (cid:48) W . This example shows consistent agreement between the observed andpredicted velocity models, able to reconcile the sharp velocity contrasts which create deﬂection in thetravel-time ﬁelds. This example demonstrates the viability of this method in complex 3D velocitystructures. An earthquake hypocenter, m , is composed of three spatial coordinates, [ x, y, z ] , and the origintime, t o . Most commonly, the data used to locate earthquakes are measured times of seismic P- andS-wave arrivals ("phase picks") over a network seismic instruments. These phase picks deﬁne a setof absolute arrival time observations d = T obs , where d ∈ R N . In a Bayesian framework (Tarantola,2004; Lomax et al. , 2000), inference on m is performed by combining prior knowledge together withthe observations, p ( m | d ) = Z − p ( d | m ) p ( m ) (8)where p ( m | d ) is the posterior distribution, p ( m ) is the prior distribution, p ( d | m ) is the likelihood,and Z is a constant. 7igure 3: EikoNet trained travel-time formulation for the complex three-dimensional velocity of SCEC-CVM-H. Plots represent a slice in the three-dimensional structure taken at Longitude = 115 o (cid:48) W .(a) and (b) represent the P-wave (VP) and S-wave (VS) velocity structures for the training pointswithin pm . (cid:48) of the longitude slice and within the Latitude and Depth domain of the model space.(c) and (d) represent the predicted velocity structure colourmap and predicted travel-time contours,at s intervals, for the P-wave and S-wave EikoNet models.8 simple example of p ( d | m ) for hypocenter inversion is, p ( d | m ) = exp  − (cid:88) obs i [ T obs − T pred ] σ i  (9)where σ i is an estimate of uncertainty and, T pred = t o + f θ ( (cid:126)x s , (cid:126)x r ) , (10)is a nonlinear forward model, i.e. a solution to the Eikonal equation plus the origin time. Thus, theforward model in this problem is a physics-informed neural network. Since (cid:126)x s is included as an inputto the neural network, this allows for downstream gradients to be taken with respect to it.More recently, a likelihood function based on the Equal Diﬀerential Time method (Lomax et al. ,2000, EDT) has seen increasing usage. The EDT likelihood builds diﬀerential times from all pairs ofphases, and in the process, decouples origin time, t o from the spatial coordinates of the hypocenter.The formulation is given by; p ( d | m ) = (cid:34)(cid:88) a (cid:88) b (cid:112) σ a + σ b exp ( A ) (cid:35) N , (11) A = − (cid:2)(cid:0) T obs ( a ) − T obs ( b ) (cid:1) − (cid:0) T pred ( a ) − T pred ( b ) (cid:1)(cid:3) σ a + σ b , (12)where a and b are diﬀerent phase arrival time observations, σ is a phase-dependent estimate of uncertainty,and N is the total number of diﬀerential times. In addition to reducing the number of latent variables byone, this formulation acts to minimise the eﬀects of outliers, which are particularly common with automatedpicking algorithms. This robustness results from the fact that in the EDT likelihood, the errors are combinedin an additive manner, rather than multiplicative as typical in Bayesian inference problems. Each termin Eq. 11 produces a hyperbolic error surface that decays like a Gaussian in the direction normal to eachpoint on the hyperbola. Thus, Eq. 11 can be viewed as producing a stack of hyperbolas with relativelylimited intersection, which creates robustness in the presence of strong outliers. However, the downside isthat it results in posterior distributions that are highly non-convex, making MCMC methods and standardvariational inference schemes diﬃcult to use for this problem (Lomax et al. , 2000). The origin time isreintroduced by using the optimised earthquake location to determine the predicted origin times to each ofthe observational locations, determining the origin time as the median of the predicted origin times. Theuncertainty is then deﬁned by the median absolute deviation (MAD) from the predicted origin time. Weuse a uniform prior, p ( m ) , with samples selected within the model domain speciﬁed in the Eikonal physicsinformed neural network.The uncertainty in the posterior distribution is assigned as a combination of the observational, σ obs , andforward model uncertainty, σ pred , given as σ = σ obs + σ pred . (13)The observational uncertainty represents uncertainty in each of the observational times, with an expectedstandard deviation for each observation time supplied by the user. This value is then converted to a varianceto deﬁne σ obs for each observation. The forward model uncertainty is constructed as a function of thepredicted travel-time for each of the observational locations (similar to that given in Lomax et al. σ pred =  σ min , for σ f T P < σ min σ frac T pred , for σ min ≤ σ f T P ≤ σ max σ max , for σ f T P > σ max (14) here σ f is the fraction of the travel time to use as uncertainty, bounded within the max and min uncertaintiesspeciﬁed by σ min and σ max respectively. Throughout this work we use the [ σ f , σ min , σ max ] = [0 . , . s, . s ] ,discussing the eﬀects of these parameters on synthetic testing within Section 4.5.A Stein variational gradient descent procedure is used to optimise for the Equal-Diﬀerential Time posterior.We use a Gaussian kernel for the SVI procedure, also known as radial basis function (RBF) kernel, for itspractical and universal approximation properties. First, we initialize N particles randomly using a uniformprior over the 3D study area . For each of these particle locations, we calculate corresponding travel timesusing EikoNet forward model (Section 1. of Figure 1), evaluating the posterior (to within the normalizationconstant Z ), and determine the kernelized Stein discrepancy (Section 2 of Figure 1). Then, we calculatethe gradients of this loss function particle-wise with respect to the hypocentral coordinates using automaticdiﬀerentiation, which is possible due to the diﬀerentiablity of the PINN (Section 3 of Figure 1). We use thesegradients together with the ADAM optimizer (Kingma & Ba, 2014) to update the particle locations untilconvergence, where the optimal hypocentral location is consistent across multiple epochs. SupplementaryVideo 1 demonstrates the convergence for the example outlined in Figure Figure 1.The next step is to extract summary statistics from the posterior distribution. As mentioned previously,the posterior is typically strongly non-convex due to the EDT likelihood function, although many of thelocal extrema are eﬀectively negligible in amplitude. Therefore, we aim to determine the dominant clusterof particles representing the main peak of the posterior. This is achieved by using the DBSCAN clusteringtechnique (Hahsler et al. , 2019, Section 4 of Figure 1) to identify high-density clusters of particles, with thecluster with the largest number of particles is used as the dominant cluster. Once the dominant cluster isidentiﬁed, we apply kernel density methods using Gaussian kernels to estimate the MAP and quantify thelocation uncertainty from its covariance matrix. We discuss hyperparameter tuning for DBSCAN (Hahsler etal. , 2019) in a later section. In this section we ﬁrst demonstrate the earthquake inversion scheme on a series of synthetic tests. Weconstruct a catalogue of synthetic earthquake locations across the region, determining the travel-time to agrid of observation points at ﬁxed elevation of km , before applying a . s uncertainty in the syntheticphase arrival and inverting to determine the earthquake location and uncertainty. The earthquake locationsare at a ﬁxed latitude and depth of o (cid:48) N and km respectively, with longitude varying from o (cid:48) W to o (cid:48) W at (cid:48) separations. The recovered optimal hypocentre and location uncertainty are then comparedwith the imposed earthquake locations and an expected 2-std contour from a grid-search approach. We varythe possible user deﬁned parameters with the optimised parameters given in Table 2 and earthquake locationsin Figure 4. However, we expect that these parameters will need to be varied somewhat depending on theexact application, for example if the error models or network geometry are changed signiﬁcantly. As suchwe recommend that initial synthetic testing is undertaken before real data is inverted. Outlined below arediscussions on how each hyperparameter aﬀects the recovered locations for this study, with correspondingSupplementary Figures S9-S11. The number of particles used in the Stein Variational Gradient Descent is of great importance for theresolution of the resolved earthquake location. If the number of particles is too small then the particles densityis unable to adequately represent the posterior distribution. However, a large number of particles wouldhave increasing computational demand on the inversion procedure and is intractable for large earthquakecatalogues. We specify a optimal number of samples equal to and ﬁnd that an increase in the number ofparticles does not provide additional information on the the earthquake location, but reducing the number km . Pink trianglesrepresent the synthetic station locations, at a ﬁxed depth of km . Black line represent a crosssection at a ﬁxed latitude, with the cross section given in (b). (b) represents the imposed earthquakelocations, black points, recovered optimal location with errors bars, blue dots, posterior determinedby the particle density, red contours, and grid search derived posterior at 2-std, gray line.Table 2: HypoSVI parameters used in earthquake location techniquesParameter ValueNumber of Particles Number of Epochs

Observation Weights [0 . , . s, . s ] Radial Basis Function Location Uncertainty [0 . km, f particles greatly eﬀects posterior. Additional plots for variations of number of particles, with remainingparameters set equal to Table 2, is given in Supplementary Figure S9. The number of epochs eﬀects how well the particles density represents the posterior for the earthquakelocation. Figure 1b demonstrates that the posterior drastically changes in the early epochs, but once itconverges there is little to no change in the recovered posterior. However, an increase in number of epochseﬀects the computational time, which over a large number of earthquake locations could have a large eﬀect.We optimise the number of epochs to determine when the earthquake locations and location uncertainty isconsistent between epochs.

The RBF kernel can be represented by κ ( x, x (cid:48) ) = exp( − h (cid:107) x − x (cid:48) (cid:107) ) , where h is the shape parameter and x the pairwise particle diﬀerence. As this term acts as a repulsive force in the SVI procedure increasing the h term has the eﬀect of increasing the minimum distance between particles locations. Understanding the tradeoﬀ for the shape parameter is important as larger values could eﬀect on the recovered posterior. Qiang & Dilin(2016) deﬁned a dynamic shape parameter with the value changing depending on h = med / log n , where med is the median distance between pairwise particles, with the deﬁnition (cid:80) j k ( x i , x j ) ≈ n exp( − h med ) = 1 demonstrating that for each x i the contribution from its own gradient and the inﬂuence from other pointsbalances out. We investigate the variation of the RBF shape parameters on the recovered synthetic earthquakelocations ﬁnding that parameter has little eﬀect the recovered optimal hypocentral location, with minorvariations of the recovered posterior for static values between − and that of the dynamic shape parameter(Supplementary Figure S10). We decided to use a static shape parameter of , to mitigate any diﬀerencethat could occur to the posterior from mulitple run of the same observations for a dynamic shape parameter. The total uncertainty assigned to the inverse problem is a combination of the picking uncertainty and theforward model uncertainty due to the velocity structure. As described previously, we follow Lomax et al. (2000) and characterize the uncertainty in the forward model as a fraction of the travel time. This is areasonable choice as the uncertainty in the predicted travel times is expected grow in proportion to the traveltime. In our hyperparameter investigation we found that a fraction of . should be used, as lower valueslead to signiﬁcant mis-location of the recovered events (Supplementary Figure S11). The upper and lowerbounds to the allowed error has less of an eﬀect on our synthetic testing, which we attribute to the syntheticstation locations being regularly spaced. For observational data that is clustered spatially the upper andlower bounds could be of great importance and should be investigated with synthetic examples for the speciﬁcnetwork geometry. The posterior for earthquake location is non-convex due to the EDT likelihood function and we aim todetermine the dominant cluster of particles representing the main peak of the posterior. This is achieved byusing the DBSCAN clustering technique (Hahsler et al. , 2019) to identify high-density clusters of particles.We investigate the variation of the two dominant parameters deﬁning the DSCAN clustering technique, themaximum distance between two samples for them to be considered as in the neighbourhood of the other andthe minimum number of samples per cluster (Supplementary ﬁgure S12). The deﬁnition of the minimumdistance is crucial for eﬀectively sampling the dominant peak of the distribution, with a minimum distancetoo small possibly subsampling the peak as multiple clusters and a value too large including additional localpeaks in the the posterior. We found that the minimum distance must be deﬁned large enough to remove

32 496 6128 8128 17512 130816 641024 523776 1551408 990528 2471728 1492128 3362028 2055378 439 the eﬀects of subsampling the dominant peak in the posterior, in this use case we found . km suﬃcient,although has little variation in the recovered kernal density function function until the mimum distance isexpanded to at least km . In addition, we ﬁnd little eﬀect of the kernel density function with changes inthe minimum number of particles per cluster, something that is to be expected when the dominent clustertypically comprises > of the particles for these synthetic inversions. We conclude that a minimum clusterseparation of . km should be used for these large regional scale problems and is insenstive to the low valuesin the minimum number of samples per cluster, which we use a minimum cluster comprising particles. The number of observations going into a inversion aﬀects the compute time, as each observation requirespredicted travel-time formulations from EikoNet and gradients to be computed . Here, we investigate thecomputational cost of the inversion procedure while increasing the number of observations. We replicatean increasing number of observations by copying the synthetic station deployment locations multiple times,labelling them as diﬀerent station names but comprising the same arrival times. This synthetic testing waschosen to minimising the changing eﬀect on the location estimate, which would occur if additional syntheticstation locations are provided. All other location hyperparameters are ﬁxed at values given in Table 2. Theearthquake locations are then determined for the varying number of observations and the total number ofpairwise diﬀerential times, with the average computational time for a Nvidia V100 shown in Table 3. Thecomputational time even for the observations, diﬀerential times, only takes s per event.These synthetic tests demonstrate that this approach is computationally scale-able with computational timeincreasing as a linearly in a log-log space of computational time vs number of observational diﬀerential times. To further validate the developed method, we apply it to real earthquakes occurring within the SouthernCalifornia region, with region deﬁned in Section 3.2. This study area was chosen as it encompasses alarge seismic network and complex 3D regional velocity structures (Allam & Ben-Zion, 2012). We used thedetections and phase picks from the open source Southern California Earthquake Data Centre (SCEDC)phase arrival observational catalogue, for the ﬁst k events starting 2019-01-01. The events and phase picksused have all been manually reviewed by analysts at the Southern California Seismic Network (Hutton et al. ,2010). .2 Earthquake Location comparisons with NonLinLoc We infer hypocenters for the k earthquakes using two diﬀerent velocity models (1D and 3D cases, describedin Section 3.2). The hyperparameters used for the inversions are outlined in Table 2 with detailed explanationof the reasoning behind the parameter deﬁnition outlined in Section 4. The catalogues are generated on aNvidia V100 GPU with an average of s per event, varying depending on the number of observations in theinversion procedure, with on average ∼ observations per event. Since the calculation of travel-times fromEikoNet is independent on the complexity of the velocity model (once the network has been trained), theprocessing takes equal time for both the 1D and 3D trained models. Example inversions for three events areshown in Figure 5.To understand the validity of our location technique we compare our earthquake catalogue, with acatalogue determined using the conventional earthquake location software, NonLinLoc. NonLinLoc is a nonlinear earthquake technique leveraging ﬁnite-diﬀerence travel-time solutions; Gaussian or equal-diﬀerentiallikelihood functions; and, likelihood estimations schemes using oct-tree, grid-search or Markov Chain MonteCarlo (MCMC). Travel-times are computed by solving the eikonal using a ﬁnite-diﬀerence approach outlinedin Podvin & Lecomte (1991). For a 1D velocity structure, only varying in depth, the package computes thetravel-times as an radial 2D ﬁnite-diﬀerence travel-time model that depends on the radial distance from theobservation point and the depth, saving these as independent travel-time look-up tables. In contrast, forcomplex three-dimensional velocity structures the travel-times are computed for a user deﬁned gridded seriesof receiver locations, with each observation saved as a separate travel-time look-up table. Since the storage andcomputational requirements for a NonLinLoc using the complex 3D velocity for a very high resolution locationgrid, this method was intractable as it would return large gridding artifacts to the recovered earthquakelocations and predicted location uncertainty, which are not directly comparable to the non-gridded solutions ofthe HypoSVI. Instead we compare the HypoSVI and NonLinLoc locations using the one-dimensional velocitystructure, with the NonLinLoc travel-time and initial location grids resolved to km and km receptively.The location is determined using a Equal-Diﬀerential Travel-Time (EDT) likelihood function and octreesampling technique. The location uncertainty of the recovered NonLinLoc catalogue is determined as thestandard error in X,Y,Z to 2-std using the diagonal of the covariance matrix. The remaining NonLinLocuser parameters are given in the full control ﬁle in the Supplementary Material. The HypoSVI earthquakecatalogues for the 1D and 3D velocity structures are given in Figure 6a-b and 6c-d respectively.For comparison we derive a NonLinLoc catalogue for subregion of [ o W , o N ] to [ o W , o (cid:48) N ].This region comprises a total of events in the HypoSVI 1D catalogue (Figure 7a-b), with the NonLinLoccomprising events (Figure 7c-d). Manual inspection showed that the events present in the NonLinLoccatalogue but not HypoSVI catalogue, are events that are locate external to the subregion in the HypoSVIcatalogue but are projected to the edge of NonLonLoc search grid, having large location uncertainties. For theremaining events we determine the relative location diﬀerences between the two catalogues by projecting bothcatalogues to a local universal transverse mercato (UTM) coordinate system and determining the distancebetween the events in km in a local XYZ coordinates. The relative distance of the NonLinLoc locations minusthe HypoSVI 1D locations are given in Figure 8a-c. The relative locations demonstrate no consistent spatialbias, with the mean location diﬀerence given by [ X, Y, Z ] = [+0 . km , +0 . km , − . km ] , as shown by thered dot in Figure 8a-c. In addition, we normalise the location diﬀerence by the location uncertainty from theNonLinLoc catalogue. Figure 8d-f gives the normalized location distances, with . of the events havinga relative distance less than that of the NonLinLoc location uncertainty, as shown by the points within thedashed box. In this paper, we developed a new approach to performing Bayesian inference on earthquake hypocenters thatcombines a diﬀerentiable forward model (physics-informed neural network) with Stein variational inference. istance from hypocentre X (km) D i s t an c e f r o m h y po c en t r e Y ( k m ) D ep t h ( k m ) -5.0 +5.05.010.05.0 10.0 -5.0-2.5+2.5+5.00.0+2.5-2.5 0.0 Depth (km)117 30'W o Epoch 150 - Final EpochEpoch 50Epoch 0 - Start

116 30'W o D ep t h ( k m ) o

116 30'W o

117 30'W o

116 30'W o Distance from hypocentre X (km) D i s t an c e f r o m h y po c en t r e Y ( k m ) D ep t h ( k m ) -5.0 +5.010.05.0 5.00.0 10.0-5.0-2.5+2.5+5.00.0+2.5-2.5 0.0 Depth (km)117 30'W o Epoch 150 - Final EpochEpoch 50Epoch 0 - Start

116 30'W o D ep t h ( k m ) o

116 30'W o

117 30'W o

116 30'W o Distance from hypocentre X (km) D i s t an c e f r o m h y po c en t r e Y ( k m ) D ep t h ( k m ) -5.0 +5.010.05.0 5.0 10.0 -5.0-2.5+2.5+5.00.0+2.5-2.5 0.0 Depth (km)117 30'W o Epoch 150 - Final EpochEpoch 50Epoch 0 - Start

116 30'W o D ep t h ( k m ) o

116 30'W o

117 30'W o

116 30'W o Location[Long,Lat,Depth] = [-116.78573,33.48528, 7.31153]Number of Diff Times =Number of Observations = 59535Location Uncertainty +/- 2-std = [0.19648km, 0.114436km, 0.190364km]

DateTime: 2019-01-01 01:43:28.27222EventID: 1000001

Location[Long,Lat,Depth] = [-116.79463,33.50676, 6.60195]Number of Diff Times =Number of Observations = 94644Location Uncertainty +/- 2-std = [0.18781km, 0.23348km, 0.20248km]

DateTime: 2019-01-01 01:39:57.30565EventID: 1000000 DateTime: 2019-01-01 02:27:57.22350

Earthquake Optimal hypocentral locationStein Variational Gradient Descent particlesEarthquake clustered particle locationsKernel density of clustered particle locations

EventID: 1000002

Location[Long,Lat,Depth] = [-116.80175,33.50544, 4.97147]Number of Diff Times =Number of Observations = 127551Location Uncertainty +/- 2-std = [0.183132km, 0.17986km, 0.22622km]

Key: (i) (j) (k) (l)(l)(e) (f) (g) (h)(h)(a) (b) (c) (d)(d)

Figure 5: Example earthquake locations for three earthquakes in the Catalogues using travel-timesderived from the three-dimensional regional velocity model. Left panels represent the particlelocations changing at diﬀerent epochs in the Stein Variational Gradient Descent. Right panelsrepresent a zoom in of the ﬁnal event locations, with the particle locations shown relative therecovered optimal hypocentral location. Kernel density contours are shown in red for the clusteredparticles. 15igure 6: Comparison of earthquake locations between the HypoSVI and NLLoc. Left columnrepresents the Latitude/Longitude map of the detected earthquakes given by red dots, observationalstation locations given by blue triangles and mapped faults by gray lines. Right column represents aLongitude vs Depth cross-sections of earthquakes. (a) and (b) are the locations determined fromHypoSVI with a EikoNet model trained on a regional 1D velocity. (c) and (d) are the locationsdetermined from HypoSVI with a EikoNet model trained on a regional SCEC-CVM-H 3D velocitystructure. 16igure 7: Zoom in earthquake location comparison for the region for subregion of [ o W , o N ]to [ o W , o (cid:48) N ]. (a)-(b) are the locations determined from HypoSVI with a EikoNet modeltrained on a regional 1D velocity. (c)-(d) are the locations determined from the NonLinLoc inversionprocedure. 17igure 8: Earthquake distance comparison for the NonLinLoc and HypoSVI 1D catalogue for theregion [ o W , o N ] to [ o W , o (cid:48) N ], projected to the local X , Y , Z UTM coordinate system.(a)-(c) black dots represent the relative distance between catalogue event locations in X,Y,Z; with reddot representing the mean location. (d)-(f) black points relative distance between catalogue eventlocations normalized by the NonLinLoc 2-std location uncertainty. Red-dashed region represents thecatalogue events with a relative distance less than the location uncertainty.18 nlike with MCMC sampling methods, SVI approximates a posterior with a collection of particles, with theset of particle locations jointly optimized. In this paper we use an EikoNet forward model, but this couldbe replaced with any other diﬀerentiable forward model. Thus, HypoSVI is a general variational approachto hypocenter inversion. We validated the method with synthetic tests and compared the locations for ∼ events in Southern California with those produced by the Southern California Seismic Network. Inparticular, we focused on demonstrating the reliability of the method in the presence of non-convex posteriordistributions, which SVI is well suited for handling. This is all possible because of the diﬀerentiable forwardmodel.Another advantage of our approach is that it is computationally eﬃcient and can make use of state ofthe art GPU architectures and modern deep learning APIs like PyTorch. This allows for rapid calculationof the gradients with automatic diﬀerentiation. As GPU hardware improves, such as increased memory,these performance gains will be passed on to the algorithm which will allow for even larger datasets to beworked with than currently possible. By combining SVI with EikoNet, we are able to evaluate observationsat any point within the 3D volume without retraining, i.e. the forward model is valid for any array geometry.Due to the highly-parallelized nature of calculations with neural networks, our method scales well to verylarge networks, which may be important for emerging technologies like Distributed Acoustic Sensing (DAS).This was demonstrated herein by the ability to locate an earthquake with 2048 phase picks in 439 seconds.Thus, our HypoSVI approach is ideal for handling the enormous data volumes that are starting to emerge inseismology. Acknowledgments

HypoSVI is avaliable at github https://github.com/Ulvetanna/HypoSVI, with additional runable Coalabcode supplied at the Github.This project was partly supported by a grant from the United States Geological Survey (USGS). Wewould like to thank Jack Wilding and Bing Q. Li for interesting discussions about the implementation andsoftware usage.

References

Allam A. & Ben-Zion Y., 2012, Seismic velocity structures in the Southern California plate-boundaryenvironment from double-diﬀerence tomography,

Geophysical Journal International , , 1181-1196Geiger L., 1912, Probability method for the determination of earthquake epicenters from the arrival timeonly, St. Louis Univ. Bull. , ,60-71Hahsler M., Piekenbrock M., & Doran D., 2019, dbscan: Fast Density-Based Clustering with R, Journal ofStatistical Software , , 1-30Hutton, K., Woessner, J., & Hauksson, E., 2010, Earthquake monitoring in southern California for seventy-seven years (1932–2008), Bulletin of the Seismological Society of America , , 423-446Kingma, D.P., & Ba, J., 2014, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 Lee, E., Chen, P., Jordan, T.H., Maechling, P.B., Denolle, M.A.M., & Beroza, G.C., 2014, Full-3-D tomographyfor crustal structure in Southern California based on the scattering-integral and the adjoint-waveﬁeldmethods,

Journal of Geophysical Research: Solid Earth , , 6421-6451Lomax, A., Virieux, J., Volant, P., & Catherine, B., 2000, Probabilistic earthquake location in 3D and layeredmodels, Advances in seismic event location ,101–134 ousavi, S. M., Ellsworth, W. L., Zhu, W., Chuang, L. Y., & Beroza, G. C., 2020, Earthquake trans-former—an attentive deep-learning model for simultaneous earthquake detection and phase picking, Naturecommunications , ,1-12Podvin, P., & Lecomte, I., 1991, Finite diﬀerence computation of traveltimes in very contrasted velocitymodels: a massively parallel approach and its associated tools, Geophysical Journal International , ,271-284Qiang, L. & Dilin, W., 2016, Stein variational gradient descent: A general purpose bayesian inferencealgorithm, Advances in neural information processing systems , 2378-2386Rawlinson, N., & Sambridge, M., 2005, The fast marching method: an eﬀective tool for tomographic imagingand tracking multiple phases in complex layered media,

Exploration Geophysics , , 341-350Ross, Z. E., Meier, M., Hauksson, E., & Heaton, T. H., 2004, Generalized seismic phase detection with deeplearning, Bulletin of the Seismological Society of America , , 2894-2901Shaw, J.H., Plesch, A., Tape, C., Suess, M.P., Jordan, T.H, Ely, G., Hauksson, E., Tromp, J., Tanimoto, T. &Graves, R., 2015, Uniﬁed structural representation of the southern California crust and upper mantle, Earthand Planetary Science Letters , , 1-15Smith, J. D., Azizzadenesheli, K., & Ross, Z. E., 2020, EikoNet: Solving the Eikonal Equation With DeepNeural Networks, IEEE Transactions on Geoscience and Remote Sensing ,1–12, 10.1109/TGRS.2020.3039165Süss, M.P, & Shaw, J.H., 2003, P wave seismic velocity structure derived from sonic logs and industryreﬂection data in the Los Angeles basin, California,

Journal of Geophysical Research: Solid Earth , Tarantol, A., 2004, Inverse Problem Theory and Methods for Model Parameter Estimation,

Society forIndustrial and Applied Mathematics

Thurber, C.H., 1985, Nonlinear earthquake location: theory and examples,

Bulletin of the SeismologicalSociety of America , ,779-790Treister, E. & Haber, E., 2016, A fast marching algorithm for the factored eikonal equation, Journal ofComputational physics , , 210-225 A Supplementary Figures [ σ f , σ min , σ max ]]