[PDF] 70 years of machine learning in geoscience in review

Abstract

Full PDF

770

YEARS OF MACHINE LEARNING IN GEOSCIENCE IN REVIEW

A P

REPRINT

August 27, 2020 A BSTRACT

This review gives an overview of the development of machine learning in geoscience. A thoroughanalysis of the co-developments of machine learning applications throughout the last 70 years relatesthe recent enthusiasm for machine learning to developments in geoscience. I explore the shiftof kriging towards a mainstream machine learning method and the historic application of neuralnetworks in geoscience, following the general trend of machine learning enthusiasm through thedecades. Furthermore, this chapter explores the shift from mathematical fundamentals and knowledgein software development towards skills in model validation, applied statistics, and integrated subjectmatter expertise. The review is interspersed with code examples to complement the theoreticalfoundations and illustrate model validation and machine learning explainability for science. Thescope of this review includes various shallow machine learning methods, e.g. Decision Trees,Random Forests, Support-Vector Machines, and Gaussian Processes, as well as, deep neural networks,including feed-forward neural networks, convolutional neural networks, recurrent neural networksand generative adversarial networks. Regarding geoscience, the review has a bias towards geophysicsbut aims to strike a balance with geochemistry, geostatistics, and geology, however excludes remotesensing, as this would exceed the scope. In general, I aim to provide context for the recent enthusiasmsurrounding deep learning with respect to research, hardware, and software developments that enablesuccessful application of shallow and deep machine learning in all disciplines of Earth science. K eywords Review · Machine Learning · Deep Learning · Neural Networks · Kriging · Earth Science · Geoscience · Geology · GeophysicsIn recent years machine learning has become an increasingly important interdisciplinary tool that has advanced severalﬁelds of science, such as biology [Ching et al., 2018], chemistry [Schütt et al., 2017], medicine [Shen et al., 2017] andpharmacology [Kadurin et al., 2017]. Speciﬁcally, the method of deep neural networks has found wide application.While geoscience was slower in the adoption, bibliometrics show the adoption of deep learning in all aspects ofgeoscience. Most subdisciplines of geoscience have been treated to a review of machine learning. Remote sensing hasbeen an early adopter [Lary et al., 2016], with geomorphology [Valentine and Kalnins, 2016], solid Earth geoscience[Bergen et al., 2019], hydrogeophysics [Shen, 2018], seismology [Kong et al., 2019], seismic interpretation [Wanget al., 2018] and geochemistry [Zuo et al., 2019] following suite. Climate change, in particular, has received a thoroughtreatment of the potential impact of varying machine learning methods for modelling, engineering and mitigation toaddress the problem [Rolnick et al., 2019]. This review addresses the development of applied statistics and machinelearning in the wider discipline of geoscience in the past 70 years and aims to provide context for the recent increase ininterest and successes in machine learning and its challenges .Machine learning (ML) is deeply rooted in applied statistics, building computational models that use inference andpattern recognition instead of explicit sets of rules. Machine learning is generally regarded as a sub-ﬁeld of artiﬁcial The author of this manuscript has a background in geophysics, exploration geoscience, and active source 4D seismic. While thisskews the expertise, they attempt to give a full overview over developments in all of geoscience with the minimum amount of biaspossible. a r X i v : . [ phy s i c s . g e o - ph ] A ug PREPRINT - A

UGUST

27, 2020Figure 1: Machine Learning timeline from [Dramsch, 2019]. Neural Networks: [Russell and Norvig, 2010]; Kriging:[Krige, 1951]; Decision Trees: [Belson, 1959]; Nearest Neighbours: [Cover and Hart, 1967]; Automatic Differentiation:[Linnainmaa, 1970]; Convolutional Neural Networks: [Fukushima, 1980, LeCun et al., 2015]; Recurrent NeuralNetworks: [Hopﬁeld, 1982]; Backpropagation: [Kelley, 1960, Bryson, 1961, Dreyfus, 1962, Rumelhart et al., 1988];Reinforcement Learning: [Watkins, 1989]; Support Vector Machines: [Cortes and Vapnik, 1995]; Random Forests: [Ho,1995]; LSTM: [Hochreiter and Schmidhuber, 1997]; Torch Library: [Collobert et al., 2002]; ImageNet: [Deng et al.,2009]; Scikit-Learn: [Pedregosa et al., 2011]; LibSVM: [Chang and Lin, 2011]; Generative Adversarial Networks:[Goodfellow et al., 2014]; Tensorﬂow: [Abadi et al., 2015]; XGBoost: [Chen and Guestrin, 2016]intelligence (AI), with the notion of AI ﬁrst being introduced by Turing [1950]. Samuel [1959] coined the term machinelearning itself, with Mitchell et al. [1997] providing a commonly quoted deﬁnition:A computer program is said to learn from experience E with respect to some class of tasks T andperformance measure P if its performance at tasks in T, as measured by P, improves with experienceE. Mitchell et al. [1997]This means that a machine learning model is deﬁned by a combination of requirements. A task such as, classiﬁcation,regression, or clustering is improved by conditioning of the model on a training data set. The performance of the modelis measured with regard to a loss, also called metric, which quantiﬁes the performance of a machine learning model onthe provided data. In regression, this would be measuring the misﬁt of the data from the expected values. Commonly,the model improves with exposure to additional samples of data. Eventually, a good model generalizes to unseen data,which was not part of the training set, on the same task the model was trained to perform.Accordingly, many mathematical and statistical methods and concepts, including Bayes’ rule [Bayes, 1763], least-squares [Legendre, 1805], and Markov models [Markov, 1906, 1971], are applied in machine learning. Gaussianprocesses stand out as they originate in time series applications [Kolmogorov, 1939] and geostatistics [Krige, 1951],which roots this machine learning application in geoscience [Rasmussen, 2003]. "Kriging" originally applied two-dimensional Gaussian processes to the prediction of gold mine valuation and has since found wide application ingeostatistics. Generally, Matheron [1963] is credited with formalizing the mathematics of kriging and developing itfurther in the following decades.Between 1950 and 2020 much has changed. Computational resources are now widely available both as hardware andsoftware, with high-performance compute being affordable to anyone from cloud computing vendors. High-qualitysoftware for machine learning is widely available through the free and open-source software movement, with majorcompanies (Google, Facebook, Microsoft) competing for the usage of their open-source machine learning frameworks(Tensorﬂow, Pytorch, CNTK ) and independent developments reaching wide applications such as scikit-learn [Pedregosaet al., 2011] and xgboost [Chen and Guestrin, 2016].Nevertheless, investigations of machine learning in geoscience are not a novel development. The research into machinelearning follows interest in artiﬁcial intelligence closely. Since its inception, artiﬁcial intelligence has experienced twoperiods of a decline in interest and trust, which has impacted negatively upon its funding. Developments in geosciencefollow this wide-spread cycle of enthusiasm and loss of interest with a time lag of a few years. This may be the result ofa variety of factors, including research funding availability and a change in willingness to publish results. Deprecated 2019 PREPRINT - A

UGUST

27, 2020

The 1950s and 1960s were decades of machine learning optimism, with machines learning to play simple games andperform tasks like route mapping. Intuitive methods like k-means, Markov models, and decision trees have beenused as early as the 1960s in geoscience. K-means was used to describe the cyclicity of sediment deposits [Prestonand Henderson, 1964]. Krumbein and Dacey [1969] give a thorough treatment of the mathematical foundations ofMarkov chains and embedded Markov chains in a geological context through application to sedimentological processes,which also provides a comprehensive bibliography of Markov processes in geology. Some selected examples of earlyapplications of Markov chains are found in sedimentology [Schwarzacher, 1972], well log analysis [Agterberg, 1966],hydrology [Matalas, 1967], and volcanology [Wickman, 1968]. Decision tree-based methods found early applicationsin economic geology and prospectivity mapping [Newendorp, 1976, Reddy and Bonham-Carter, 1991].The 1970s were left with few developments in both the methods of machine learning, as well as, applications andadoption in geoscience (cf. Figure 1), due to the "ﬁrst AI winter" after initial expectations were not met. Nevertheless,as kriging was not considered an AI technology, it was unaffected by this cultural shift and found applications in mining[Huijbregts and Matheron, 1970], oceanography [Chiles and Chauvet, 1975], and hydrology [Delhomme, 1978]. Thiswas in part due to superior results over other interpolation techniques, but also the provision of uncertainty measures.

The 1980s marked uptake in interest in machine learning and artiﬁcial intelligence through so-called "expert systems"and corresponding specialized hardware. While neural networks were introduced in 1950, the tools of automaticdifferentiation and backpropagation for error-correcting machine learning were necessary to spark their adoption ingeophysics in the late 1980s. Zhao and Mendel [1988] performed seismic deconvolution with a recurrent neuralnetwork (Hopﬁeld network). Dowla et al. [1990] discriminated between natural earthquakes and underground nuclearexplosions using feed-forward neural networks. An ensemble of networks was able to achieve 97 % accuracy fornuclear monitoring. Moreover, the researchers inspected the network to gain the insight that the ratio of particularinput spectra was beneﬁcial to the discrimination of seismological events to the network. However, in practice theneural networks underperformed on uncurated data, which is often the case in comparison to published results. Huanget al. [1990] presented work on self-organizing maps (also Kohonen networks), a special type of unsupervised neuralnetwork applied to pick seismic horizons. The ﬁeld of geostatistics saw a formalization of theory and an uptake ininterest with Matheron et al. [1981] formalizing the relationship of spline-interpolation and kriging and Dubrule [1984]further develop the theory and apply it to well data. At this point, kriging is well-established in the mining industry aswell as other disciplines that rely on spatial data, including the successful analysis and construction of the Channeltunnel [Chilès and Desassis, 2018]. The late 1980s then marked the second AI winter, where expensive machines tunedto run "expert systems" were outperformed by desktop hardware from non-specialist vendors, causing the collapse of ahalf-billion-dollar hardware industry. Moreover, government agencies cut funding in AI speciﬁcally.The 1990s are generally regarded as the shift from a knowledge-driven to a data-driven approach in machine learning.The term AI and especially expert systems were almost exclusively used in computer gaming and regarded withcynicism and as a failure in the scientiﬁc world. In the background, however, with research into applied statistics andmachine learning, this decade marked the inception of Support-Vector Machines (SVM) [Cortes and Vapnik, 1995],the tree-based method Random Forests (RF) [Ho, 1995], and a speciﬁc type of recurrent neural network (RNN) LongShort-Term Memories (LSTM) [Hochreiter and Schmidhuber, 1997]. SVMs were utilized for land usage classiﬁcationin remote sensing early on [Hermes et al., 1999]. Geophysics applied SVMs a few years later to approximate theZoeppritz equations for AVO inversion, outperforming linearized inversion [Kuzma, 2003]. Random Forests, however,were delayed in broader adoption, due to the term "random forests" only being coined in 2001 [Breiman, 2001] and thestatistical basis initially being less rigorous and implementation being more complicated. LSTMs necessitate largeamounts of data for training and can be expensive to train, after further development in 2011 [Ciresan et al., 2011] itgained popularity in commercial time series applications particularly speech and audio analysis.

McCormack [1991] marks the ﬁrst review of the emerging tool of neural networks in geophysics. The paper goesinto the mathematical details and explores pattern recognition. The author summarizes neural network applicationsover the 30 years prior to the review and presents worked examples in automated well-log analysis and seismic traceediting. The review comes to the conclusion that neural networks are, in fact, good function approximators, taking overtasks that were previously reserved for human work. He criticizes slow training, the cost of retraining networks uponnew knowledge, imprecision of outputs, non-optimal training results, and the black box property of neural networks.3

PREPRINT - A

UGUST

27, 2020Figure 2: Single layer neural network as described in equation 1. Two inputs x i are multiplied by the weights w ij andsummed with the biases b j . Subsequently an activation function σ is applied to obtain out outputs o j .The main conclusion sees the implementation of neural networks in conventional computation and expert systems toleverage the pattern recognition of networks with the advantages of conventional computer systems.Neural networks are the primary subject of the modern day machine learning interest, however, signiﬁcant developmentsleading up to these successes were made prior to the 1990s. The ﬁrst neural network machine was constructed byMinsky [described in Russell and Norvig [2010]] and soon followed by the "Perceptron", a binary decision boundarylearner [Rosenblatt, 1958]. This decision was calculated as follows: o j = σ (cid:16)(cid:80) j w ij x i + b (cid:17) = σ ( a j )= (cid:26) a j > otherwise (1)It describes a linear system with the output o , the linear activation a of the input data x , the index of the source i andtarget node j , the trainable weights w , the trainable bias b and a binary activation function σ . The activation function σ in particular has received ample attention since its inception. During this period, a binary σ became uncommon and wasreplaced by non-linear mathematical functions. Neural networks are commonly trained by gradient descent, therefore,differentiable functions like sigmoid or tanh, allowing for the activation o of each neuron in a neural network to becontinuous.Deep learning [Dechter, 1986] expands on this concept. It is the combination of multiple layers of neurons in a neuralnetwork. These deep networks learn representations with multiple levels of abstraction and can be expressed usingequation 1 as input neurons to the next layer o k = σ ( (cid:80) k w jk · o j + b )= σ (cid:16)(cid:80) k w jk · σ (cid:16)(cid:80) j w ij x i + b (cid:17) + b (cid:17) (2)Röth and Tarantola [1994] apply these building blocks of multi-layered neural networks with sigmoid activation to per-form seismic inversion. They successfully invert low-noise and noise-free data on small training data. The authors notethat the approach is susceptible to errors at low signal-to-noise ratios and coherent noise sources. Further applicationsinclude electromagnetic subsurface localization [Poulton et al., 1992], magnetotelluric inversion via Hopﬁeld neuralnetworks [Zhang and Paulson, 1997], and geomechanical microfractures modelling in triaxial compression tests [Fengand Seto, 1998]. 4 PREPRINT - A

UGUST

27, 2020Figure 3: Deep multi-layer neural network as described in equation 2.Figure 4: Sigmoid activation function (red) and derivative (blue) to train multi-layer Neural Network described inequation 2.

Cressie [1990] review the history of kriging, prompted by the uptake of interest in geostatistics. The author deﬁneskriging as Best Linear Unbiased Prediction and reviews the historical co-development of disciplines. Similar conceptswere developed with mining, meteorology, physics, plant and animal breeding, and geodesy that relied on optimalspatial prediction. Later, Williams [1998] provide a thorough treatment of Gaussian Processes, in the light of recentsuccesses of neural networks. 5

PREPRINT - A

UGUST

27, 2020Figure 5: Gaussian Process separating two classes with different kernels. This image presents a 2D slice out of a 3Ddecision space. The decision boundary learnt from the data is visible, as well as the prediction in every location ofthe 2D slice. The two kernels presented are a linear kernel and a radial basis function (RBF) kernel, which show asigniﬁcant discrepancy in performance. The bottom right number shows the accuracy on unseen test data. The linearkernel achieves

71 % accuracy, while the RBF kernel achieves

90 % .An alternative method of putting a prior over functions is to use a Gaussian process (GP) prior overfunctions. This idea has been used for a long time in the spatial statistics community under the nameof "kriging", although it seems to have been largely ignored as a general-purpose regression method.Williams [1998]Overall, Gaussian Processes beneﬁt from the fact that a Gaussian distribution will stay Gaussian under conditioning.That means that we can use Gaussian distributions in this machine learning process and they will produce a smoothGaussian result after conditioning on the training data. To become a universal machine learning model, GaussianProcesses have to be able to describe inﬁnitely many dimensions. Instead of storing inﬁnite values to describe thisrandom process, Gaussian Processes go the path of describing a distribution over functions that can produce each valuewhen required. p ( x ) ≈ GP ( µ ( x ) , k ( x, x (cid:48) )) , (3)The multivariate distribution over functions p ( x ) is described by the Gaussian Process depends on mean a function µ ( x ) and a covariance function k ( x, x (cid:48) ) . It follows that choosing an appropriate mean and covariance function, alsoknown as kernel, is essential. Very commonly, the mean function is chosen to be zero, as this simpliﬁes some of themath. Therefore, data with a non-zero mean is commonly centered to comply with this assumption [Görtler et al., 2019].Choosing an appropriate kernel for the machine learning task is one of the beneﬁts of the Gaussian Process. The kernelis where expert knowledge can be incorporated into data, e.g. seasonality metereological data can be described by aperiodic covariance function.Figure 5 present a 2D slice of 3D data with two classes. This binary problem can be approached by applying a GaussianProcess to it. In the second panel, a linear kernel is shown, which predicts the data relatively poorly with an accuracy of

71 % . A radial basis function (RBF) kernel, shown in the third panel generalizes to unseen test data with an accuracyof

90 % . This ﬁgure shows how a trained Gaussian Process would predict any new data point presented to the model.The linear kernel would predict any data in the top part to be blue (Class 0) and any data in the bottom part to be red(Class 1). The RBF kernel, which we explore further in the section introducing support-vector machines, separates theprediction into four uneven quadrants. The choice of kernel is very important in Gaussian Processes and research intoextracting speciﬁc kernels is ongoing [Duvenaud, 2014].In a more practical sense, Gaussian processes are computationally expensive, as an n × n matrix must be inverted, with n being the number of samples. This results in a space complexity of O ( n ) and a time complexity O ( n ) [Williams andRasmussen, 2006]. This makes Gaussian Processes most feasible for smaller data problems, which is one explanationfor their rapid uptake in geoscience. An approximate computation of the inverted matrix is possible using the ConjugateGradient (CG) optimization method, which can be stopped early with a maximum time cost of O ( n ) [Williams andRasmussen, 2006]. For problems with larger data sets, neural networks become feasible due to being computationallycheaper than Gaussian Processes, regularization on large data sets being viable, as well as, their ﬂexibility to model a6 PREPRINT - A

UGUST

27, 2020wide variety of functions and objectives. Regularization being essential as neural networks tend to not "overﬁt" andsimply memorize the training data, instead of learning a generalizable relationship of the data. Interestingly, Horniket al. [1989] showed that neural networks are a universal function approximator as the number of weights tend toinﬁnity, and Neal [1996] were able to show that the inﬁnitely wide stochastic neural network converges to a GaussianProcess. Oftentimes Gaussian Processes are trained on a subset of a large data set to avoid the computational cost.Gaussian Processes have seen successful application on a wide variety of problems and domains that beneﬁt from expertknowledge.The 2000s were opened with a review by van der Baan and Jutten [2000] recapitulating the most recent geophysicalapplications in neural networks. They went into much detail on the neural networks theory and the difﬁculties inbuilding and training these models. The authors identify the following subsurface geoscience applications throughhistory: First-break picking, electromagnetics, magnetotellurics, seismic inversion, shear-wave splitting, well loganalysis, trace editing, seismic deconvolution, and event classiﬁcation. They reveal a strong focus on explorationgeophysics. The authors evaluated the application of neural networks as subpar to physics-based approaches andconcluded that neural networks are too expensive and complex to be of real value in geoscience. This sentiment isconsistent with the broader perception of artiﬁcial intelligence during this decade. Artiﬁcial intelligence and expertsystems over-promised human-like performance, causing a shift in focus on research into specialized sub-ﬁelds, e.g.machine learning, fuzzy logic, and cognitive systems.

Mjolsness and DeCoste [2001] review machine learning in a broader context outside of exploration geoscience. Theauthors discuss recent successes in applications of remote sensing and robotic geology using machine learning models.They review graphical models, (hidden) Markov models, and SVMs and go on to disseminate the limitations ofapplications to vector data and poor performance when applied to rich data, such as graphs and text data. Moreover, theauthors from NASA JPL go into detail on pattern recognition in automated rovers to identify geological prospects onMars. They state:The scientiﬁc need for geological feature catalogs has led to multiyear human surveys of Mars orbitalimagery yielding tens of thousands of cataloged, characterized features including impact craters,faults, and ridges. Mjolsness and DeCoste [2001]The review points out the profound impact SVMs have on identifying geomorphological features without modelling theunderlying processes.

This decade of the 2000s introduces a shift in tooling, which is a direct contributor to the recent increase in adoptionand research of both shallow and deep machine learning research.Machine Learning software has been primarily comprised of proprietary software like Matlab™with the NeuralNetworks Toolbox and Wolfram Mathematica™or independent university projects like the Stuttgart Neural NetworkSimulator (SNNS). These tools were generally closed source and hard or impossible to extend and could be difﬁcultto operate due to limited accompanying documentation. Early open-source projects include WEKA [Witten et al.,2005], a graphical user interface to build machine learning and data mining projects. Shortly after that, LibSVM wasreleased as free open-source software (FOSS) [Chang and Lin, 2011], which implements support vector machinesefﬁciently. It is still used in many other libraries to this day, including WEKA [Chang and Lin, 2011]. Torch wasthen released in 2002, which is a machine learning library with a focus on neural networks. While it has beendiscontinued in its original implementation in the programming language Lua [Collobert et al., 2002], PyTorch, thereimplementation in the programming language Python, is one of the leading deep learning frameworks at the time ofwriting [Paszke et al., 2017]. In 2007, the libraries Theano and scikit-learn were released openly licensed in Python[Theano Development Team, 2016, Pedregosa et al., 2011]. Theano is a neural network library that was a tool developedat the Montreal Institute for Learning Algorithms (MILA) and ceased development in 2017 after strong industrialdevelopers had released openly licensed deep learning frameworks. Scikit-learn implements many different machinelearning algorithms, including SVMs, Random Forests and single-layer neural networks, as well as utility functionsincluding cross-validation, stratiﬁcation, metrics and train-test splitting, necessary for robust machine learning modelbuilding and evaluation. 7

PREPRINT - A

UGUST

27, 2020

The impact of scikit-learn has shaped the current machine learning software package by implementing a uniﬁedapplication programming interface (API) [Buitinck et al., 2013]. This API is explored by example in the followingcode snippets, the code can be obtained at Dramsch [2020b]. First, we generate a classiﬁcation dataset using a utilityfunction. The make_classification function takes different arguments to adjust the desired arguments, we aregenerating 5000 samples ( n_samples ) for two classes, with ﬁve features ( n_features ), of which three features areactually relevant to the classiﬁcation ( n_informative ). The data is stored in X , whereas the labels are contained in y . from sklearn.datasets import make_classificationX, y = make_classification(n_samples=5000, n_features=5,n_informative=3, n_redundant=0,random_state=0, shuffle=False) It is good practice to divide the available labeled data into a training data set and a validation or test data set. This splitensures that models can be evaluated on unseen data to test the generalization to unseen samples. The utility function train_test_split takes an arbitrary amount of input arrays and separates them according to speciﬁed arguments. Inthis case 25% of the data are kept for the hold-out validation set and not used in training. The random_state is ﬁxedto make these examples reproducible. from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.25,random_state=0)

Then we need to deﬁne a machine learning model, considering the previous discussion of high impact machine learningmodels, the ﬁrst example is an SVM classiﬁer. This example uses the default values for hyperparameters of the SVMclassiﬁer, for best results on real-world problems these have to be adjusted. The machine learning training is alwaysdone by calling classifier.fit(X, y) on the classiﬁer object, which in this case is the SVM object. In more detail,the .fit() method implements an optimization loop that will condition the model to the training data by minimizingthe deﬁned loss function. In the case of the SVM classiﬁcation the parameters are adjusted to optimize a hinge loss,outlined in equation 5. The trained model scikit-learn model contains information about all its hyperparameters inaddition to the trained model, shown below. The exact meaning of all these hyperparameters is laid out in the scikit-learndocumentation [Buitinck et al., 2013]. from sklearn.svm import SVCsvm = SVC(random_state=0)svm.fit(X_train, y_train)>>> SVC(C=1.0, break_ties=False, cache_size=200,class_weight=None, coef0=0.0, degree=3,decision_function_shape='ovr', gamma='scale',kernel='rbf', max_iter=-1, probability=False,random_state=0, shrinking=True, tol=0.001,verbose=False)

The trained SVM can the be used to predict on new data, by calling classifier.predict(data) on the trainedclassiﬁer object. The new data has to contain four features like the training data did. Generally, machine learningmodels always need to be trained on the same set of input features as the data available for prediction. The .predict() method outputs the most likely estimate on the new data to generate predictions. In the following code snippet, threepredictions on three input vectors are performed on the previously trained model.8

PREPRINT - A

UGUST

27, 2020Figure 6: Example of Support Vector Machine separating two classes, showing the decision boundary learnt from thedata. The data contains three informative features, the decision boundary is therefore three dimensional, shown is acentral slice of data points in 2D. (A video is available at [Dramsch, 2020a]) print(svm.predict([[0, 0, 0, 0, 0],[-1, -1, -1, -1, -1],[1, 1, 1, 1, 1]]))>>> [1 0 1]

The blackbox model should be evaluated with the classifier.score() function. Evaluating the performance onthe training data set gives an indication how well the model is performing, but this is generally not enough to gaugethe performance of machine learning models. In addition, the trained model has to be evaluated on the hold-out set, adataset the model has not been exposed to during training. This avoids that the model only performs well on the trainingdata by "memorization" instead of extracting meaningful generalizable relationships, an effect called overﬁtting. In thisexample the hyperparameters are left to the default values, in real-life applications hyperparameters are usually adjustedto build better models. This can lead to an addition meta-level of overﬁtting on the hold-out set, which necessitatesan additional third hold-out set to test the generalizability of the trained model with optimized hyperparameters. Thedefault score uses the class accuracy, which suggests our model is approximately 90% correct. Similar train and testscores indicate that the model learned a generalizable model, enabling prediction on unseen data without a performanceloss. Large differences between the training score and test score indicate either overﬁtting, in the case of a better trainingscore. A higher test score than training score can be an indication of a deeper problem with the data split, scoring, classimbalances, and needs to be investigated by means of external cross-validation, building standard "dummy" models,independence tests, and further manual investigations. print(svm.score(X_train, y_train))print(svm.score(X_test, y_test))>>> 0.9098666666666667>>> 0.9032

Support-vector machines can be employed for each class of machine learning problem, i.e. classiﬁcation, regression,and clustering. In a two-class problem, the algorithm considers the n -dimensional input and attempts to ﬁnd a ( n − -dimensional hyperplane that separates these input data points. The problem is trivial if the two classes are linearly9 PREPRINT - A

UGUST

27, 2020Figure 7: Samples from two classes that are not linearly separable input data (left). Applying a Gaussian Radial BasisFunction centered around (0 . , . with λ = . results in the two classes being linearly separable.separable, also called a hard margin. The plane can pass the two classes of data without ambiguity. For data with anoverlap, which is usually the case, the problem becomes an optimization problem to ﬁt the ideal hyperplane. The hingeloss provides the ideal loss function for this problem, yielding 0 if none of the data overlap, but a linear residual foroverlapping points that can be minimized: max (0 , (1 − y i ( (cid:126)w · (cid:126)x i − b ))) , (4)with y i being the current target label and (cid:126)w · (cid:126)x i − b being the hyperplane under consideration. The hyperplane consistsof w the normal vector and point x , with the offset b . This leads the algorithm to optimize (cid:34) n n (cid:88) i =1 max (0 , − y i ( w · x i − b )) (cid:35) + λ (cid:107) w (cid:107) , (5)with λ being a scaling factor. For small λ the loss becomes the hard margin classiﬁer for linearly separable problems.The nature of the algorithm dictates that only values for (cid:126)x close to the hyperplane deﬁne the hyperplane itself; thesevalues are called the support vectors.The SVM algorithm would not be as successful if it were simply a linear classiﬁer. Some data can become linearlyseparable in higher dimensions. This, however, poses the question of how many dimensions should be searched, becauseof the exponential cost in computation that follows due to the increase of dimensionality (also known as the curseof dimensionality). Instead, the "kernel trick" was proposed [Aizerman, 1964], which deﬁnes a set of values that areapplied to the input data simply via the dot product. A common kernel is the radial basis function (RBF), which is alsothe kernel we applied in the example. The kernel is deﬁned as: k ( (cid:126)x i , (cid:126)x j ) → exp (cid:0) − γ (cid:107) (cid:126)x i − (cid:126)x j (cid:107) (cid:1) (6)This speciﬁcally deﬁnes the Gaussian Radial Basis Function of every input data point with regard to a central point.This transformation can be performed with other functions (or kernels), such as, polynomials or the sigmoid function.The RBF will transform the data according to the distance between x i and X j , this can be seen in Figure 7. This resultsin the decision surface in Figure 6 consisting of various Gaussian areas. The RBF is generally regarded as a gooddefault, in part, due to being translation invariant (i.e. stationary) and smoothly varying.An important topic in machine learning is explainability, which inspects the inﬂuence of input variables on the prediction.We can employ the utility function permutation_importance to inspect any model and how they perform with10 PREPRINT - A

UGUST

27, 2020regard to their input features [Breiman, 2001]. The permutation importance evaluates how well the blackbox modelperforms, when a feature is not available. Practically, a feature is replaced with random noise. Subsequently, the scoreis calculated, which provides a representation how informative a feature is compared to noise. The data we generated inthe ﬁrst example contains three informative features and two random data columns. The mean values of the calculatedimportances show that three features are estimated to be three magnitudes more important, with the second featurecontaining the maximum amount of information to predict the labels. from sklearn.inspection import permutation_importanceimportances = permutation_importance(svm, X_train, y_train,n_repeats=10, random_state=0) print(importances.importances_mean)print(importances.importances_mean.argsort())>>> [ 2.1787e-01 2.8712e-01 1.2293e-01 -1.8667e-04 7.7333e-04]>>> [3 4 2 0 1]

Support-vector machines were applied to seismic data analysis [Li and Castagna, 2004] and the automatic seismicinterpretation [Liu et al., 2015, Di et al., 2017b, Mardan et al., 2017]. Compared to convolutional neural networks, theseapproaches usually do not perform as well, when the CNN can gain information from adjacent samples. Seismologicalvolcanic tremor classiﬁcation [Masotti et al., 2006, 2008] and analysis of ground-penetrating radar [Pasolli et al., 2009,Xie et al., 2013] were other notable applications of SVM in Geoscience. The 2016 Society of Exploration Geophysicists(SEG) machine learning challenge was held using a SVM baseline [Hall, 2016]. Several other authors investigated welllog analysis [Anifowose et al., 2017, Caté et al., 2018, Gupta et al., 2018, Saporetti et al., 2018], as well as seismologyfor event classiﬁcation [Malfante et al., 2018] and magnitude determination [Ochoa et al., 2018]. These rely on SVMsbeing capable of regression on time-series data. Generally, many applications in geoscience have been enabled by thestrong mathematical foundation of SVMs, such as microseismic event classiﬁcation [Zhao and Gross, 2017], seismicwell ties [Chaki et al., 2018], landslide susceptibility [Marjanovi´c et al., 2011, Ballabio and Sterlacchini, 2012], digitalrock models [Ma et al., 2012], and lithology mapping [Cracknell and Reading, 2013].

The following example shows the application of Random Forests, to illustrate the similarity of the API for differentmachine learning algorithms in the scikit-learn library. The Random Forest classiﬁer is instantiated with a maximumdepth of seven, and the random state is ﬁxed to zero again. Limiting the depth of the forest forces the random forestto conform to a simpler model. Random forests have the capability to become highly complex models that are verypowerful predictive models. This is not conducive to this small example dataset, but easy to modify for the inclinedreader. The classiﬁer is then trained using the same API of all classiﬁers in scikit-learn. The example shows a very highnumber of hyperparameters, however, Random Forests work well without further optimization of these. from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(max_depth=7, random_state=0)rf.fit(X_train, y_train)>>> RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,class_weight=None, criterion='gini', max_depth=7,max_features='auto', max_leaf_nodes=None,max_samples=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=100, n_jobs=None, oob_score=False,random_state=0, verbose=0, warm_start=False)

The prediction of the random forest is performed in the same API call again, also consistent with all classiﬁers available.The values are slightly different from the prediction of the SVM.11

PREPRINT - A

UGUST

27, 2020 print(rf.predict([[0, 0, 0, 0, 0],[-1, -1, -1, -1, -1],[1, 1, 1, 1, 1]]))>>> [1 0 1]

The training score of the random forest model is 2.5 % better than the SVM in this instance, this score however notinformative. Comparing the test scores shows only a 0.88 % difference, which is the relevant value to evaluate, as itshows the performance of a model on data it has not seen during the training stage. The random forest performedslightly better on the training set than the test data set. This slight discrepancy is usually not an indicator of an overﬁtmodel. Overﬁt models "memorize" the training data and do not generalize well, which results in poor performance onunseen data. Generally, overﬁtting is to be avoided in real application, but can be seen in competitions, on benchmarks,and show-cases of new algorithms and architectures to oversell the improvement over state-of-the-art methods [Rechtet al., 2019]. print(rf.score(X_train, y_train))print(rf.score(X_test, y_test))>>> 0.9306>>> 0.912

Random forests have specialized methods available for introspection, which can be used to calculate feature importance.These are based on the decision process the random forest used to build the machine learning model. The featureimportance in Random Forests uses the same method as permutation importance, which is dropping out features toestimate their importance on the model performance. Random Forests use a measure to determine the split betweenclasses at each node of the trees called Gini impurity. While the permutation importance uses the accuracy score of theprediction, in Random Forests this Gini impurity can be used to measure how informative a feature is in a model. It isimportant to note that this impurity-based process can be susceptible to noise and overestimate high number of classesin features. Using the permutation importance instead is a valid choice. In this instance as opposed to the permutationimportance, the random forest estimates the two non-informative features to be one magnitude less useful than theinformative features, instead of two magnitudes. print(rf.feature_importances_)print(rf.feature_importances_.argsort())>>> [0.2324 0.4877 0.2527 0.0141 0.0129]>>> [4 3 0 2 1]

Random forests and other tree-based methods, including gradient boosting, a specialized version of random forests, havegenerally found wider application with the implementation into scikit-learn and packages for the statistical languages Rand SPSS. Similar to neural networks, this method is applied to ASI [Guillen et al., 2015] with limited success, whichis due to the independent treatment of samples, like SVMs. Random forests have the ability to approximate regressionproblems and time series, which made them suitable for seismological applications including localization [Dodge andHarris, 2016], event classiﬁcation in volcanic tremors [Maggi et al., 2017] and slow slip analysis [Hulbert et al., 2018].They have also been applied to geomechanical applications in fracture modelling [Valera et al., 2017] and fault failureprediction [Rouet-Leduc et al., 2017, 2018], as well as, detection of reservoir property changes from 4D seismic data[Cao and Roy, 2017]. Gradient Boosted Trees were the winning models in the 2016 SEG machine learning challenge[Hall and Hall, 2017] for well-log analysis, propelling a variety of publications in facies prediction [Bestagini et al.,2017, Blouin et al., 2017, Caté et al., 2018, Saporetti et al., 2018].Furthermore, various methods that have been introduced into scikit-learn have been applied to a multitude of geoscienceproblems. Hidden Markov models were used on seismological event classiﬁcation [Ohrnberger, 2001, Beyreuther andWassermann, 2008, Bicego et al., 2013], well-log classiﬁcation [Jeong et al., 2014, Wang et al., 2017a], and landslidedetection from seismic monitoring [Dammeier et al., 2016]. These hidden Markov models are highly performant on timeseries and spatially coherent problems. The "hidden" part of Markov models enables the model to assume inﬂuenceson the predictions that are not directly represented in the input data. The K-nearest neighbours method has been used12

PREPRINT - A

UGUST

27, 2020Figure 8: Binary Decision Boundary for Random Forest in 2D. This is the same central slice of the 3D decision volumeused in Figure 6.for well-log analysis [Caté et al., 2017, Saporetti et al., 2018], seismic well ties [Wang et al., 2017b] combined withdynamic time warping and fault extraction in seismic interpretation [Hale, 2013], which is highly dependent on choosingthe right hyperparameter k. The unsupervised k-NN equivalent, k-means has been applied to seismic interpretation [Diet al., 2017a], ground motion model validation [Khoshnevis and Taborda, 2018], and seismic velocity picking [Weiet al., 2018]. These are very simple machine learning models that are useful for baseline models. Graphical modellingin the form of Bayesian networks has been applied to seismology in modelling earthquake parameters [Kuehn et al.,2011], basin modelling [Martinelli et al., 2013], seismic interpretation [Ferreira et al., 2018] and ﬂow modelling indiscrete fracture networks [Karra et al., 2018]. These graphical models are effective in causal modelling and gainedpopularity in modern applications of machine learning explainability, interpretability, and generalization in combinationwith do-calculus [Pearl, 2012].

The 2010s marked a renaissance of deep learning and particularly convolutional neural networks. The convolutionalneural network (CNN) architecture AlexNet [Krizhevsky et al., 2012] was the ﬁrst CNN to enter the ImageNet challenge[Deng et al., 2009]. The ImageNet challenge is considered a benchmark competition and database of natural imagesestablished in the ﬁeld of computer vision. This improved the classiﬁcation error rate from 25.8 % to 16.4 % (top-5accuracy). This has propelled research in CNNs, resulting in error rates on ImageNet of 2.25 % on top-5 accuracy in2017 [Russakovsky et al., 2015]. The Tensorﬂow library [Abadi et al., 2015] was introduced for open source deeplearning models, with some different software design compared to the Theano and Torch libraries.The following example shows an application of deep learning to the data presented in the previous examples. Theclassiﬁcation data set we use has independent samples, which leads to the use of simple densely connected feed-forwardnetworks. Image data or spatially correlated datasets would ideally be fed to a convolutional neural network (CNN),whereas time series are often best approached with recurrent neural networks (RNN). This example is written using theTensorﬂow library. PyTorch would be an equally good library to use.All modern deep learning libraries take a modular approach to building deep neural networks that abstract operationsinto layers. These layers can be combined into input and output conﬁgurations in highly versatile and customizableways. The simplest architecture, which is the one we implement below, is a sequential model, which consists of oneinput and one output layer, with a "stack" of layers. It is possible to deﬁne more complex models with multiple inputsand outputs, as well as the branching of layers to build very sophisticated neural network pipelines. These models arecalled functional API and subclassing API, but would not be conducive to this example.13

PREPRINT - A

UGUST

27, 2020Figure 9: ReLU activation (red) and derivative (blue) for efﬁcient gradient computation.The example model consists of Dense layers and a Dropout layer, which are arranged in sequence. Densely connectedlayers contain a speciﬁed number of neurons with an appropriate activation function, shown in the example below. Eachneuron performs the calculation outlined in equation 1, with σ deﬁning the activation. Modern neural networks rarelyimplement sigmoid and tanh activations anymore. Their activation characteristic leads them to lose information forlarge positive and negative values of the input, commonly called saturation[Hochreiter et al., 2001]. This saturation ofneurons prevented good deep neural network performance until new non-linear activation functions took their place[Xuet al., 2015]. The activation function Rectiﬁed linear unit (ReLU) is generally credited with facilitating the developmentof very deep neural networks, due to their non-saturating properties [Hahnloser et al., 2000]. It sets all negative values tozero and provides a linear response for positive values, as seen in equation 7. Since it’s inception, many more rectiﬁerswith different properties have been introduced. σ ( a ) = max (0 , a ) (7)The other activation function used in the example is the "softmax" function on the output layer. This activation iscommonly used for classiﬁcation tasks, as it normalizes all activations at all outputs to one. It achieves this by applyingthe exponential function to each of the outputs in (cid:126)a for class C and dividing that value by the sum of all exponentials: σ ( (cid:126)a ) = e a j C (cid:80) p e a p (8)The example additionally uses a Dropout layer, which is a common layer used for regularization of the network byrandomly setting a speciﬁed percentage of nodes to zero for each iteration. Neural networks are particularly prone tooverﬁtting, which is counteracted by various regularization techniques that also include input-data augmentation, noiseinjection, L and L constraints, or early-stopping of the training loop [Goodfellow et al., 2016]. Modern deep learningsystems may even leverage noisy student-teacher networks for regularization [Xie et al., 2019b]. import tensorflow as tfmodel = tf.keras.models.Sequential([ PREPRINT - A

UGUST

27, 2020Figure 10: Three layer convolutional network. The input image (yellow) is convolved with several ﬁlters or kernelmatrices (purple). Commonly, the convolution is used to downsample an image in the spatial dimension, whileexpanding the dimension of the ﬁlter response, hence expanding in "thickness" in the schematic. The ﬁlters are learnedin the machine learning optimization loop. The shared weights within a ﬁlter improve efﬁciency of the network overclassic dense networks. tf.keras.layers.Dense(32, activation='relu'),tf.keras.layers.Dropout(.3),tf.keras.layers.Dense(16, activation='relu'),tf.keras.layers.Dense(2, activation='softmax')])

These sequential models are also used for simple image classiﬁcation models using CNNs. Instead of Dense layers,these are built up with convolutional layers, which are readily available in 1D, 2D, and 3D as Conv1D, Conv2D andConv3D respectively. A two-dimensional CNN learns a so-called ﬁlter f for the n × m -dimensional image G , expressedas: G ∗ ( x, y ) = n (cid:88) i =1 m (cid:88) j =1 f ( i, j ) · G ( x − i + c, y − j + c ) , (9)resulting in the central result G ∗ around the central coordinate c . In CNNs each layer learns several of these ﬁlters f , usually following by a down-sampling operation in n and m to compress the spatial information. This serves as aforcing function to learn increasingly abstract representations in subsequent convolutional layers.This sequential example model of densely connected layers with a single input, 32, 16, and two neurons contains atotal of 754 trainable weights. Initially, each of these weights is set to a pseudo-random value, which is often drawnfrom a distribution beneﬁcial to fast training. Consequently, the data is passed through the network, and the result isnumerically compared to the expected values. This form of training is deﬁned as supervised training and error-correctinglearning, which is a form of Hebbian learning. Other forms of learning exist and are employed in machine learning, e.g.competitive learning in self-organizing maps. M AE = | y j − o j | (10) M SE = ( y j − o j ) (11)In regression problems the error is often calculated using the Mean Absolute Error (MAE) or Mean Squared Error(MSE), the L shown in equation 10 and the L norm shown in equation 11 respectively. Classiﬁcation problems forma special type of problem that can leverage a different kind of loss called cross-entropy (CE). The cross-entropy isdependent on the true label y and the prediction in the output layer.15 PREPRINT - A

UGUST

27, 2020 CE = − C (cid:88) j y j log ( o j ) (12)Many machine learning data sets have one true label y true = 1 for class C j = true , leaving all other y j = 0 . This makesthe sum over all labels obsolete. It is debatable how much binary labels reﬂects reality, but it simpliﬁes equation 12 tominimizing the (negative) logarithm of the neural network output o j , also known as negative log-likelihood: CE = − log ( o j ) (13)Technically, the data we generated is a binary classiﬁcation problem, and this means we could use the sigmoid activationfunction in the last layer and optimize a binary CE. This can speed up computation, but in this example, an approach isshown that works for many other problems and can therefore be applied to the readers data. model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy']) Large neural networks can be extremely costly to train with signiﬁcant developments in 2019/2020 reporting multi-billion parameter language models (Google, OpenAI) trained on massive hardware infrastructure for weeks with a singleepoch taking several hours. This calls for validation on unseen data after every epoch of the training run. Therefore,neural networks, like all machine learning models, are commonly trained with two hold-out sets, a validation and a ﬁnaltest set. The validation set can be provided or be deﬁned as a percentage of the training data, as shown below. In theexample, 10% of the training data are held out for validation after every epoch, reducing the training data set from 3750to 3375 individual samples. model.fit(X_train,y_train,validation_split=.1,epochs=100)>>> [...]Epoch 100/1003375/3375 [==============================] - 0s 66us/sampleloss: 0.1567 - accuracy: 0.9401 -val_loss: 0.1731 - val_accuracy: 0.9359

Neural networks are trained with variations of stochastic gradient descent (SGD), an incremental version of the classicsteepest descent algorithm. We use the Adam optimizer, a variation of SGD that converges fast, but a full explanationwould go beyond the scope of this chapter. The gist of the Adam optimizer is that it maintains a per-parameter learningrate of the ﬁrst statistical moment (mean). This is beneﬁcial for sparse problems and the second moment (uncenteredvariance), which is beneﬁcial for noisy and non-stationary problems [Kingma and Ba, 2014]. The main alternativeto Adam is SGD with Nesterov momentum [Sutskever et al., 2013], an optimization method that models conjugategradient methods (CG) without the heavy computation that comes with the search in CG. SGD anecdotally ﬁnds abetter optimal point for neural networks than Adam but converges much slower.In addition to the loss value, we display the accuracy metric. While accuracy should not be the sole arbiter of modelperformance, it gives a reasonable initial estimate, how many samples are predicted correctly with a percentage betweenzero and one. As opposed to scikit-learn, deep learning models are compiled after their deﬁnition to make them ﬁt foroptimization on the available hardware. Then the neural network can be ﬁt like the SVM and Random Forest modelsbefore, using the

X_train and y_train data. In addition, a number of epochs can be provided to run, as well as otherparameters that are left on default for the example. The amount of epochs deﬁnes how many cycles of optimization onthe full training data set are performed. Conventional wisdom for neural network training is that it should always learnfor more epochs than machine learning researchers estimate initially.16

PREPRINT - A

UGUST

27, 2020Figure 11: Loss and Accuracy of example neural network on ten random initializations. Training for 100 epochs withthe shaded area showing the 95% conﬁdence intervals of the loss and metric. Analyzing loss curves is important toevaluate overﬁtting. The trining loss decreasing, while validation loss is close to plateauing is a sign of overﬁtting.Generally, it can be seen that the model converged and is only making marginal gains with the risk of overﬁtting.It can be difﬁcult to ﬁx all sources of randomness and stochasticity in neural networks, to make both research andexamples reproducible. This example does not ﬁx these so-called random seeds as it would detract from the example.That implies that the results for loss and accuracy will differ from the printed examples. In research ﬁxing the seedis very important to ensure reproducibility of claims. Moreover, to avoid bad practices or so-called "lucky seeds", astatistical analysis of multiple ﬁxed seeds is good practice to report results in any machine learning model. model.evaluate(X_test, y_test)>>> 1250/1250 [==============================] - 0s 93us/sampleloss: 0.1998 - accuracy: 0.9360[0.19976349686831235, 0.936]

In the example before, the SVM and Random Forest classiﬁer were scored on unseen data. This is equally important forneural networks. Neural networks are prone to overﬁt, which we try to circumvent by regularizing the weights and byevaluating the ﬁnal network on an unseen test set. The prediction on the test set is very close to the last epoch in thetraining loop, which is a good indicator that this neural network generalizes to unseen data. Moreover, the loss curvesin ﬁgure 11 do not converge too fast, while converging. However, it appears that the network would overﬁt if we lettraining continue. The exemplary decision boundary in ﬁgure 12 very closely models the local distribution of the data,which is true for the entire decision volume [Dramsch, 2020a].These examples illustrate the open source revolution in machine learning software. The consolidated API and utilityfunctions make it seem trivial to apply various machine learning algorithms to scientiﬁc data. This can be seen inthe recent explosion of publications of applied machine learning in geoscience. The need to be able to implementalgorithms has been replaced by merely installing a package and calling model.fit(X, y) . These developments callfor strong validation requirements of models to ensure valid, reproducible, and scientiﬁc results. Without this carefulvalidation these modern day tools can be severely misused to oversell results and even come to incorrect conclusions.In aggregate, modern-day neural networks beneﬁt from the development of non-saturating non-linear activationfunctions. The advancements of stochastic gradient descent with Nesterov momentum and the Adam optimizer(following AdaGrad and RMSProp) was essential faster training of deep neural networks. The leverage of graphicshardware available in most high-end desktop computers that is specialized for linear algebra computation, furtherreduced training times. Finally, open-source software that is well-maintained, tested, and documented with a consistentAPI made both shallow and deep machine learning accessible to non-experts.17

PREPRINT - A

UGUST

27, 2020Figure 12: Central 2D slice of decision Boundary of deep neural network in trained on data with 3 informative features.The 3D volume is available in [Dramsch, 2020a]. conv1 conv2

256 256 256 conv3

512 512 512 conv4

512 512 512 conv5 fc6 fc7 fc8+softmax K Figure 13: Schematic of a VGG16 network for ImageNet. The input data is convolved and down-sampled repeatedly.The ﬁnal image classiﬁcation is performed by ﬂattening the image and feeding it to a classic feed-forward denselyconnected neural network. The 1000 output nodes for the 1000 ImageNet classes are normalized by a ﬁnal softmaxlayer (cf. equation 8). Visualization library [Iqbal, 2018]

In deep learning, implementation of models is commonly more complicated than understanding the underlying algorithm.Modern deep learning makes use of various recent developments that can be beneﬁcial to the data set it is applied to,without speciﬁc implementation details results are often not reproducible. However, the machine learning communityhas a ﬁrm grounding in openness and sharing, which is seen in both publications and code. New developments arecommonly published alongside their open-source code, and frequently with the trained networks on standard benchmarkdata sets. This facilitates thorough inspection and transferring the new insights to applied tasks such as geoscience. Inthe following, some relevant neural network architectures and their application are explored.18

PREPRINT - A

UGUST

27, 2020 × ReLU × ReLU × ReLU Σ ReLU

Figure 14: Schematic of a ResNet block. The block contains a × , × , and × convolution with ReLU activation.The output is concatenated with the input and passed through another ReLU activation function. The ﬁrst model to discuss is the VGG-16 model, a 16-layer deep convolutional neural network [Simonyan and Zisserman,2014] represented in ﬁgure 13. This network was an attempt at building even deeper networks and uses small × convolutional ﬁlters in the network, called f in equation 9. This small ﬁlter-size was sufﬁcient to build powerful modelsthat abstract the information from layer to deeper layer, which is easy to visualize and generalize well. The trainedmodel on natural images also transfers well to other domains like seismic interpretation [Dramsch and Lüthje, 2018].Later, the concept of Network-in-Network was introduced, which suggested deﬁned sub-networks or blocks in thelarger network structure [Lin et al., 2013]. The ResNet architecture uses this concept of blocks to deﬁne residual blocks.These use a shortcut around a convolutional block [He et al., 2016] to achieve neural networks with up to 152 layersthat still generalize well. ResNets and residual blocks, in particular, are very popular in modern architectures includingthe shortcuts or skip connections they popularized, to address the following problem:When deeper networks start converging, a degradation problem has been exposed: with the networkdepth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.Unexpectedly, such degradation is not caused by overﬁtting, and adding more layers to a suitablydeep model leads to higher training error. He et al. [2016]The developments and successes in image classiﬁcation on benchmark competitions like ImageNet and Pascal-VOCinspired applications in automatic seismic interpretation. These networks are usually single image classiﬁers usingconvolutional neural networks (CNNs). The ﬁrst application of a convolutional neural network to seismic data used arelatively small deep CNN for salt identiﬁcation [Waldeland and Solberg, 2017]. The open source software "MaLenoV"implemented a single image classiﬁcation network, which was the earliest freely available implementation of deeplearning for seismic interpretation [Ildstad and Bormann]. Dramsch and Lüthje [2018] applied pre-trained VGG-16and ResNet50 single image seismic interpretation. Recent succesful applications build upon pre-trained pre-builtarchitectures to implement into more sophisticated deep learning systems, e.g. semantic segmentation. Semanticsegmentation is important in seismic interpretation. This is already a narrow ﬁeld of application of machine learningand it can be observed that many early applications focus on sub-sections of seismic interpretation utilizing thesepre-built architectures such as salt detection [Waldeland et al., 2018, Di et al., 2018, Gramstad and Nickel, 2018], faultinterpretation [Araya-Polo et al., 2017, Guitton, 2018, Purves et al., 2018], facies classiﬁcation [Chevitarese et al., 2018,19 PREPRINT - A

UGUST

27, 2020Dramsch and Lüthje, 2018], and horizon picking [Wu and Zhang, 2018]. In comparison, this is however, already abroader application than prior machine learning approaches for seismic interpretation that utilized very speciﬁc seismicattributes as input to self-organizing maps (SOM) for e.g. sweet spot identiﬁcation [Guo et al., 2017, Zhao et al., 2017,Roden and Chen, 2017].In geoscience single image classiﬁcation, as presented in the ImageNet challenge, is less relevant than other applicationslike image segmentation and time series classiﬁcation. The developments and insights resulting from the ImageNetchallenge were, however, transferred to network architectures that have relevance in machine learning for science. Fullyconvolutional networks are a way to better achieve image segmentation. A particularly useful implementation, theU-net, was ﬁrst introduced in biomedical image segmentation, a discipline notorious for small datasets [Ronnebergeret al., 2015]. The U-net architecture shown in Figure 15 utilizes several shortcuts in an encoder-decoder architectureto achieve stable segmentation results. Shortcuts (or skip connections) are a way in neural networks to combine theoriginal information and the processed information, usually through concatenation or addition. In ResNet blocks thisconcept is extended to an extreme, where every block in the architecture contains a shortcut between the input andoutput, as seen in Figure 14. These blocks are universally used in many architectures to implement deeper networks,i.e. ResNet-152 with 60 million parameters, with fewer parameters than previous architectures like VGG-16 with 138million parameters. Essentially, enabling models that are ten times as deep with less than half the parameters, andsigniﬁcantly better accuracy on image benchmark problems.

Bottleneck

Figure 15: Schematic of Unet architecture. Convolutional layers are followed by a downsampling operation inthe encoder. The central bottleneck contains a compressed representation of the input data. The decoder containsupsampling operations followed by convolutions. The last layer is commonly a softmax layer to provide classes. Equallysized layers are connected via shortcut connections.In 2018 the seismic contractor TGS made a seismic interpretation challenge available on the data science competitionplatform Kaggle. Successful participants in the competition combined ResNet architectures with the Unet architectureas their base architecture and modiﬁed these with state-of-the-art image segmentation applications [Babakhin et al.,2019]. Moreover, Dramsch and Lüthje [2018] showed that transferring networks trained on large bodies of naturalimages to seismic data yields good results on small datasets, which was further conﬁrmed in this competition. Thelearnings from the TGS Salt Identiﬁcation challenge have been incorporated in production scale models that performhuman-like salt interpretation [Sen et al., 2020]. In broader geoscience, U-nets have been used to model global waterstorage using GRAVE satellite data [Sun et al., 2019], landslide prediction [Hajimoradlou et al., 2019], and earthquakearrival time picking [Zhu and Beroza, 2018]. A more classical approach identiﬁes subsea scale worms in hydrothermalvents [Shashidhara et al., 2020], whereas Dramsch et al. [2019] includes a U-net in a larger system for unsupervised 3Dtimeshift extraction from 4D seismic.This modularity of neural networks can be seen all throughout the research and application of deep learning. Newinsights can be incorporated into existing architectures to enhance their predictive power. This can be in the formof swapping out the activation function σ or including new layers for improvements e.g. regularization with batchnormalization [Ioffe and Szegedy, 2015]. The U-net architecture originally is relatively shallow, but was modiﬁed tocontain a modiﬁed ResNet for the Kaggle salt identiﬁcation challenge instead [Babakhin et al., 2019]. Overall, servingas examples for the ﬂexibility of neural networks. 20 PREPRINT - A

UGUST

27, 2020

Generative adversarial networks (GAN) take composition of neural network to another level, where two networks aretrained in aggregate to get a desired result. In GANs, a generator network G and a discriminator network D workagainst each other in the training loop [Goodfellow et al., 2014]. The generator G is set up to generate samples from aninput, these were often natural images in early GANs, but has now progressed to anything from time series [Engel et al.,2019] to high-energy physics simulation [Paganini et al., 2018]. The discriminator network D attempts to distinguishwhether the sample is generated from G i.e. fake or a real image from the training data. Mathematically, this deﬁnes amin max game for the value function V of G and D min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )] + E z ∼ p z ( z ) [log(1 − D ( G ( z )))] , (14)with x representing the data, z is the latent space G draws samples from, and p represents the respective probabilitydistributions. Eventually reaching a Nash equlibrium [Nash, 1951], where neither the generator network G can producebetter outputs, nor the discriminator network D can improve its capability to discern between fake and real samples.Despite how versatile U-nets are, they still need an appropriate deﬁned loss function and labels to build a discriminativemodel. GANs however, build a generative model that approximates the training sample distribution in the Generatorand a discriminative model of the Discriminator modeled dynamically through adversarial training. The Discriminatoreffectively providing an adversarial loss in a GAN. In addition to providing two models that serve different purposes,learning the training sample distribution with an adversarial loss makes GANs one of the most versatile models currentlydiscovered. Mosser et al. [2017] were applied GANs early on to geoscience, modeling 3D porous media at the pore scalewith a deep convolutional GAN. The authors extended this approach to conditional simulations of oolithic digital rock[Mosser et al., 2018a]. Early applications of GANs also included approximating the problem of velocity inversion ofseismic data [Mosser et al., 2018c], geostatistical inversion [Laloy et al., 2017], and generating seismograms [Krischerand Fichtner, 2017]. Richardson [2018] integrate the Generator of the GAN into full waveform inversion of the scalarwaveﬁeld. Alternatively, a Bayesian inversion using the Generator as prior for velocity inversion was introduced inMosser et al. [2018b]. In geomodeling, generation of geological channel models was presented [Chan and Elsheikh,2017], which was subsequently extended with the capability to be conditioned on physical measurements [Dupont et al.,2018]. Naturally, GANs were applied to the growing ﬁeld of automatic seismic interpretation [Lu et al., 2018]. The ﬁnal type of architecture applied in geoscience is recurrent neural networks (RNN). In contrast to all previousarchitectures, recurrent neural networks feed back into themselves. There are many types of RNNs, Hopﬁeld networksbeing one that were applied to seismic source wavelet prediction [Wang and Mendel, 1992] early on. However, LSTMs[Hochreiter and Schmidhuber, 1997] are the main application in geoscience and wider machine learning. This typeof network achieves state-of-the-art performance on sequential data like language tasks and time series applications.LSTMs solve some common problems of RNNs by implementing speciﬁc gates that regulate information ﬂow in anLSTM cell, namely, input gate, forget gate, and output gate, visualized in Figure 16. The input gate feeds input values tothe internal cell. The forget gate overwrites the previous state. Finally, the output gate regulates the direct contributionof the input value to the output value combined with the internal state of the cell. Additionally, a peephole functionalityhelps with the training that serves as a shortcut between inputs and gates.A classic application of LSTMs is text analysis and natural language understanding, which has been applied to geologicalrelation extraction from unstructured text documents [Luo et al., 2017, Blondelle et al., 2017]. Due to the nature ofLSTMs being suited for time series data, it is has been applied to seismological event classiﬁcation of volcanic activityTitos et al. [2018], multi-factor landslide displacement prediction [Xie et al., 2019a], and hydrological modelling[Kratzert et al., 2019]. Talarico et al. [2019] applied LSTM to model sedimentological sequences and compared themodel to baseline Hidden Markov Model (HMM), concluding that RNNs outperform HMMs based on ﬁrst-orderMarkov chains, while higher order Markov chains were too complex to calibrate satisfactorily. Gated Recurrent Unit(GRU) [Cho et al., 2014] is another RNN developed based on the insights into LSTM, which was applied to predictpetrophysical properties from seismic data [Alfarraj and AlRegib, 2019].The scope of this review only allowed for a broad overview of types of networks, that were successfully applied togeoscience. Many more speciﬁc architectures exist and are in development that provide different advantages. Siamesenetworks for one-shot image analysis [Koch et al., 2015], transformer networks that largely replaced LSTM and GRUin language modelling [Vaswani et al., 2017], or attention as a general mechanism in deep neural networks [Zhenget al., 2017].Neural network architectures have been modiﬁed and applied to diverse problems in geoscience. Every architecturetype is particularly suited to certain data types that are present in each ﬁeld of geoscience. However, ﬁelds with data21

PREPRINT - A

UGUST

27, 2020Figure 16: Schematic of LSTM architecture. The input data is processed together with the hidden state and cell state.The LSTM avoid the exploding gradient problem by implemented a input, forget, and output gate.present in machine-readable format experienced accelerated adoption of machine learning tools and applications. Forexample, Ross et al. [2018a] were able to successfully apply CNNs to seismological phase detection, relying on anextensive catalogue of hand-picked data [Ross et al., 2018b] and consequently generalize this work [Ross et al., 2018c].It has to be noted that synthetic or speciﬁcally sampled data can introduce an implicit bias into the network [Wirgin,2004, Kim et al., 2019]. Nevertheless, particularly this blackbox property of machine learning model makes themversatile and powerful tools that were leveraged in every subdiscipline of the Earth sciences.

Overall, geoscience and especially geophysics has followed developments in machine learning closely. Acrossdisciplines, machine learning methods have been applied to various problems that can generally be categorized intothree sections:1. Build a surrogate ML model of a well-understood process. This model usually provides an advantage incomputational cost. 22

PREPRINT - A

UGUST

27, 20202. Build an ML model of a task previously only possible with human interaction, interpretation, or knowledgeand experience.3. Build a novel ML model that performs a task that was previously not possible.Granulometry on SEM images is an example of an application in category I, where previously sediments were hand-measured in images [Dramsch et al., 2018]. Applying large deformation diffeomorphic mapping of seismic data wascomputationally infeasible for matching 4D seismic data, however, made feasible by applying a U-net architecture tothe problem of category II [Dramsch et al., 2019]. The problem of earthquake magnitude prediction falls into categoryIII due to the complexity of the system but was nevertheless approached with neural networks [Panakkat and Adeli,2007].The accessibility of tools, knowledge, and compute make this cycle of machine learning enthusiasm unique, with regardto previous decades. This unprecedented access to tools makes the application of machine learning algorithms to anyproblem possible, where data is available. The bibliometrics of machine learning in geoscience, shown in ﬁgure 17 serveas a proxy for increased access. These papers include varying degrees of depth in application and model validation. Oneof the primary inﬂuences for the current increase in publications are new ﬁelds such as automatic seismic interpretation,as well as, publications soliciting and encouraging machine learning publications. Computer vision models wererelatively straight forward to transfer to seismic interpretation tasks, with papers in this sub-sub-ﬁeld ranging fromsingle 2D line salt identiﬁcation models with limited validation to 3D multi-facies interpretation with validation on aseparate geographic area.Geoscientiﬁc publishing can be challenging to navigate with respect to machine learning. While papers investigatingthe theoretical fundamentals of machine learning in geoscience exist, it is clear that the overwhelming majority ofpapers present applications of ML to geoscientiﬁc problems. It is complex to evaluate whether a paper is a case study ora methodological paper with an exemplary application to a speciﬁc data set. Despite the difﬁculty of most thoroughapplications of ML, "idea papers" exist that simply present an established algorithm to a problem in geoscience withouta speciﬁc implementation or addressing the possible caveats. On the ﬂip-side, some papers apply machine learningalgorithms as pure regression models without the aim to generalize the model to other data. Unfortunately, this makesmeta-analysis articles difﬁcult to impossible. This kind of meta-analysis article, is commonly done in medicine andconsidered a gold-standard study, and would greatly beneﬁt the geoscientiﬁc community to determine the efﬁcacy ofalgorithms on sets of similar problems.Analogous to the medical ﬁeld, obtaining accurate ground truth data, is often impossible and usually expensive.Geological ground truth data for seismic data is usually obtained through expert interpreters. Quantifying the uncertaintyof these interpretations is an active ﬁeld of research, which suggest a broader set of experiences and a diverse set ofsources of information for interpretation facilitate correct geological interpretation between interpreters [Bond et al.,2007]. Radiologists tasked to interpret x-ray images showed similar decreases in both inter- and intra-interpreter errorrate with more diverse data sources Jewett et al. [1992]. These uncertainties in the training labels are commonly knownas "label noise" and can be a detriment to building accurate and generalizable machine learning models. A signiﬁcantportion of data in geoscience, however, is not machine learning ready. Actual ground truth data from drilling reports isoften locked away in running text reports, sometimes in scanned PDFs. Data is often siloed and most likely proprietary.Sometimes the amount of samples to process is so large that many insights are yet to be made from samples in corestores or the storage rooms of museums. Benchmark models are either non-existent or made by consortia that onlyprovide access to their members. Academic data is usually only available within academic groups for competitiveadvantage, respect for the amount of work, and fear of being exposed to legal and public repercussions. These problemsare currently addressed by a culture change. Nevertheless, liberating data will be a signiﬁcant investment, regardless ofwho will work on it and a slow culture change can be observed already.Generally, machine learning has seen the fastest successes in domains where decisions are cheap (e.g. click advertising),data is readily available (e.g. online shops), and the environment is simple (e.g. games) or unconstrained (e.g. imagegeneration). Geoscience generally is at the opposite of this spectrum. Decisions are expensive, be it drilling new wellsor assessing geohazards. Data is expensive, sparse, and noisy. The environment is heterogeneous and constrained byphysical limitations. Therefore, solving problems like automatic seismic interpretation see a surge of activity havingfewer constraints initially. Problems like inversion have solutions that are veriﬁably wrong due to physics. Theseconstraints do not prohibit machine learning applications in geoscience. However, most successes are seen in closecollaboration with subject matter experts. Moreover, model explainability becomes essential in the geoscience domain.While not being a strict equivalency, simpler models are usually easier to interpret, especially regarding failure modes.A prominent example of "excessive" [Mignan and Broccardo, 2019a] model complexity was presented in DeVrieset al. [2018] applying deep learning to aftershock prediction. Independent data scientists identiﬁed methodologicalerrors, including data leakage from the train set to the test set used to present results [Shah and Innig, 2019]. Moreover,Mignan and Broccardo [2019b] showed that using the central physical interpretation of the deep learning model, using23

PREPRINT - A

UGUST

27, 2020Figure 17: Bibliometry of 242 papers in Machine Learning for Geoscience per year. Search terms include variations ofmachine learning terms and geoscientiﬁc subdisciplines but exclude remote sensing and kriging.the von Mises yield criterion, could be used to build a surrogate logistic regression. The resulting surrogate or baselinemodel outperforms the deep network and overﬁts less. Moreover, replacing the ~13,000 parameter model with thetwo-parameter baseline model increases calculation speed, which is essential in aftershock forecasting and disasterresponse . More generally, this is an example where data science practices such as model validation, baseline models,and preventing data leakage and overﬁtting become increasingly important when the tools of applying machine learningbecome readily available.Despite potential setbacks and the ﬁeld of deep learning and data science being relatively young, they can rely onmathematical and statistical foundations and make signiﬁcant contributions to science and society. Machine learningsystems have contributed to modelling the protein structure of the current pandemic virus COVID-19 [Jumper et al.]. Adeep learning computer vision system was built to stabilize food safety by identifying Cassava plant disease on ofﬂinemobile devices [Ramcharan et al., 2017, 2019]. Self-driving cars have become a possibility [Bojarski et al., 2016] andnatural language understanding has progressed signiﬁcantly [Devlin et al., 2018].Geoscience is slower in the adoption of machine learning, compared to other disciplines. To be able to adapt theprogress in machine learning research, many valuable data sources have to be made machine-readable. There hasalready been a change in making computer code open source, which has lead to collaborations and accelerating scientiﬁcprogress. While speciﬁc open benchmark data sets have been tantamount to the progress in machine learning, it isquestionable whether these would be beneﬁcial to machine learning in geoscience. The problems are often very complexwith non-unique explanations and solutions, which historically has lead to disagreements over geophysical benchmarkdata sets. Open data and open-source software, however, have and will play a signiﬁcant role in advancing the ﬁeld.Examples of this include basic utility function to load geoscientiﬁc data [Kvalsvik and Contributors, 2019] or morespeciﬁcally cross-validation functions tailored to geoscience [Uieda, 2018].Moreover, machine learning is fundamentally conservative, training on available data. This bias of data collection willinﬂuence the ability to generate new insights in all areas of geoscience. Machine learning in geoscience may be able togenerate insights and establish relationships in existing data. Entirely new insights from previously unseen or analysis All authors point out the potential in deep and machine learning research in geoscience regardless and do not wish to stiﬂe suchresearch.[Shah and Innig, 2019, Mignan and Broccardo, 2019b] PREPRINT - A

UGUST

27, 2020of particularly complex models will still be a task performed by trained geoscientists. Transfer learning is an activeﬁeld of machine learning research, that geoscience can signiﬁcantly beneﬁt from. However, no signiﬁcant headwayhas been made to transfer trained machine learning models to out-of-distribution data, i.e. data that is conceptuallysimilar but explicitly different from the training data set. The ﬁelds of self-supervised learning, including reinforcementlearning that can learn by exploration, may be able to approach some of these problems. They are, however, notoriouslyhard to set up and train, necessitating signiﬁcant expertise in machine learning.Large portions of publications are concerned with weakly or unconstrained predictions such as seismic interpretationand other applications that perform image recognition on SEM or core photography. These methods will continue toimprove by implementing algorithmic improvements from machine learning research, specialized data augmentationstrategies, and more diverse training data being available. New techniques such as multi-task learning [Kendall et al.,2018] which improved computer vision and computer linguistic models, deep bayesian networks [Mosser et al., 2019]to obtain uncertainties, noisy teacher-student networks [Xie et al., 2019b] to improve training, and transformer networks[Graves, 2012] for time series processing, will signiﬁcantly improve applications in geoscience. For example, automatedseismic interpretation may advance to provide reliable outputs for relatively difﬁcult geological regimes beyond existingsolutions. Success will be reliant on interdisciplinary teams that can discern why geologically speciﬁc faults areimportant to interpret, while others would be ignored in manual interpretations, to encode geological understanding inautomatic interpretation systems.Currently, the most successful applications of machine learning and deep learning, tie into existing workﬂows toautomate sub-tasks in a grander system. These models are highly speciﬁc, and their predictive capability does notresemble an artiﬁcial intelligence or attempt to do so. Mathematical constraints and existing theory in other appliedﬁelds, especially neuroscience, were able to generate insights into deep learning and geoscience has the opportunity todevelop signiﬁcant contributions to the area of machine learning, considering their unique problem set of heterogeneity,varying scales and non-unique solutions. This has already taken place with the wider adoption of "kriging" or moregenerally Gaussian processes into machine learning. Moreover, known applications of signal theory and informationtheory employed in geophysics are equally applicable in machine learning, with examples utilizing complex-valuedneural networks [Trabelsi et al., 2017], deep Kalman ﬁlters [Krishnan et al., 2015], and Fourier analysis [Tancik et al.,2020]. Therefore, possibly enabling additional insights, particularly when integrated with deep learning, due to itsmodularity and versatility.Previous reservations about neural networks included the difﬁculty of implementation and susceptibility to noise inaddition to computational costs. Research into updating trained models and saving the optimizer state with the modelhas in part alleviated the cost of re-training existing models. Moreover, ﬁne-tuning pre-trained large complex models tospeciﬁc problems has proven successful in several domains. Regularization techniques and noise modelling, as wellas data cleaning pipelines, can be implemented to lessen the impact of noise on machine learning models. Speciﬁctypes of noise can be attenuated or even used as an additional source of information. The aforementioned concerns havemainly transitioned into a critique about overly complex models that overﬁt the training data and are not interpretable.Modern software makes very sophisticated machine learning models, and data pipelines available to researchers, whichhas, in turn, increased the importance to control for data leakage and perform thorough model validation.Currently, machine learning for science primarily relies on the emerging ﬁeld of explainability [Lundberg et al., 2018].These provide primarily post-hoc explanations for predictions from models. This ﬁeld is particularly important toevaluate which inputs from the data have the strongest inﬂuence on the prediction result. The major point of critiqueregarding post-hoc explanations is that these methods attempt to explain how the algorithm reached a wrong predictionwith equal conﬁdence. Bayesian neural networks intend to address this issue by providing conﬁdence intervals forthe prediction based on prior beliefs. These neural networks intend to incorporate prior expert knowledge into neuralnetworks, which can be beneﬁcial in geoscientiﬁc applications, where strong priors can be necessary. Machine learninginterpretability attempts to impose constraints on the machine learning models to make the model itself explainable.Closely related to these topics is the statistics ﬁeld of causal inference. Causal inference attempts to model the cause ofvariable, instead of correlative prediction. Some methods exist that can perform causal machine learning, i.e. causaltrees [Athey and Imbens, 2016]. These three ﬁelds will be necessary to glean veriﬁable scientiﬁc insights from machinelearning in geoscience. They are active ﬁelds of research and more involved to correctly apply, which often makescooperation with a statistician necessary.In conclusion, machine learning has had a long history in geoscience. Kriging has progressed into more general machinelearning methods, and geoscience has made signiﬁcant progress applying deep learning. Applying deep convolutionalnetworks to automatic seismic interpretation has progressed these methods beyond what was possible, albeit stillbeing an active ﬁeld of research. Using modern tools, composing custom neural networks, and conventional machinelearning pipelines has become increasingly trivial, enabling wide-spread applications in every sub-ﬁeld of geoscience.Nevertheless, it is important to acknowledge the limitations of machine learning in geoscience. Machine learning25

PREPRINT - A

UGUST

27, 2020methods are often cutting-edge technology, yet properly validated models take time to develop, which is often perceivedas inconvenient when working in a hot scientiﬁc ﬁeld. Despite being cutting edge, it is important to acknowledge thatnone of these applications are fully automated, as would be suggested by the lure of artiﬁcial intelligence. Nevertheless,within applied geoscience, signiﬁcant new insights have been presented. Applications in geoscience are using machinelearning as a utility for data pre-processing, implementing previous insights beyond the theory and synthetic cases, orthe model itself enabling unprecedented applications in geoscience. Overall, applied machine learning has maturedinto an established tool in computational geoscience and has the potential to provide further insights into the theory ofgeoscience itself.

ReferencesReferences

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, andX. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/ . Software available from tensorﬂow.org.F. Agterberg. Markov schemes for multivariate well data. In

Proceedings, symposium on applications of computers andoperations research in the mineral industries, Pennsylvania State University, State College, Pennsylvania , volume 2,pages X1–X18, 1966.M. A. Aizerman. Theoretical foundations of the potential function method in pattern recognition learning.

Automationand remote control , 25:821–837, 1964.M. Alfarraj and G. AlRegib. Petrophysical property estimation from seismic data using recurrent neural networks. arXiv preprint arXiv:1901.08623 , 2019.F. Anifowose, C. Ayadiuno, and F. Rashedian. Carbonate reservoir cementation factor modeling using wireline logs andartiﬁcial intelligence methodology. In , 2017.M. Araya-Polo, T. Dahlke, C. Frogner, C. Zhang, T. Poggio, and D. Hohl. Automated fault detection without seismicprocessing.

The Leading Edge , 36(3):208–214, 2017.S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects.

Proceedings of the National Academyof Sciences , 113(27):7353–7360, 2016.Y. Babakhin, A. Sanakoyeu, and H. Kitamura. Semi-supervised segmentation of salt bodies in seismic images using anensemble of convolutional neural networks.

German Conference on Pattern Recognition (GCPR) , 2019.C. Ballabio and S. Sterlacchini. Support vector machines for landslide susceptibility mapping: The staffora riverbasin case study, italy.

Math. Geosci. , 44(1):47–70, Jan. 2012. ISSN 1874-8961, 1874-8953. doi: 10.1007/s11004-011-9379-9. URL https://doi.org/10.1007/s11004-011-9379-9 .T. Bayes. Lii. an essay towards solving a problem in the doctrine of chances. by the late rev. mr. bayes, frs communicatedby mr. price, in a letter to john canton, amfr s.

Philosophical transactions of the Royal Society of London , (53):370–418, 1763.W. A. Belson. Matching and prediction on the principle of biological classiﬁcation.

Journal of the Royal StatisticalSociety: Series C (Applied Statistics) , 8(2):65–75, 1959.K. J. Bergen, P. A. Johnson, V. Maarten, and G. C. Beroza. Machine learning for data-driven discovery in solid earthgeoscience.

Science , 363(6433):eaau0323, 2019.P. Bestagini, V. Lipari, and S. Tubaro. A machine learning approach to facies classiﬁcation using well logs. In

SEG Technical Program Expanded Abstracts 2017 , SEG Technical Program Expanded Abstracts, pages 2137–2142. Society of Exploration Geophysicists, Aug. 2017. doi: 10.1190/segam2017-17729805.1. URL https://doi.org/10.1190/segam2017-17729805.1 .M. Beyreuther and J. Wassermann. Continuous earthquake detection and classiﬁcation using discrete hidden markovmodels.

Geophys. J. Int. , 175(3):1055–1066, Dec. 2008. ISSN 0956-540X. doi: 10.1111/j.1365-246X.2008.03921.x.URL https://academic.oup.com/gji/article-abstract/175/3/1055/634811 .M. Bicego, C. Acosta-Muñoz, and M. Orozco-Alzate. Classiﬁcation of seismic volcanic signals using Hidden-Markov-Model-Based generative embeddings.

IEEE Trans. Geosci. Remote Sens. , 51(6):3400–3409, June 2013. ISSN0196-2892. doi: 10.1109/TGRS.2012.2220370. URL http://dx.doi.org/10.1109/TGRS.2012.2220370 .26

PREPRINT - A

UGUST

27, 2020H. Blondelle, A. Juneja, J. Micaelli, and P. Neri. Machine learning can extract the information needed for modelling anddata analysing from unstructured documents. In . earthdoc.org,2017. URL .M. Blouin, A. Caté, L. Perozzi, and E. Gloaguen. Automated facies prediction in drillholes using machine learning. In . earthdoc.org, 2017. URL .M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016.C. E. Bond, A. D. Gibbs, Z. K. Shipton, S. Jones, et al. What do you think this is¿‘conceptual uncertainty”in geoscienceinterpretation.

GSA today , 17(11):4, 2007.L. Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. In

Proc. Harvard Univ. Symposiumon digital computers and their applications , volume 72, 1961.L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort,J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software:experiences from the scikit-learn project. In

ECML PKDD Workshop: Languages for Data Mining and MachineLearning , pages 108–122, 2013.J. Cao and B. Roy. Time-lapse reservoir property change estimation from seismic using machine learning.

Lead. Edge ,36(3):234–238, Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030234.1. URL https://doi.org/10.1190/tle36030234.1 .A. Caté, L. Perozzi, E. Gloaguen, and M. Blouin. Machine learning as a tool for geologists.

Lead. Edge , 36(3):215–219,Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030215.1. URL https://doi.org/10.1190/tle36030215.1 .A. Caté, E. Schetselaar, P. Mercier-Langevin, and P.-S. Ross. Classiﬁcation of lithostratigraphic and alteration unitsfrom drillhole lithogeochemical data using machine learning: A case study from the lalor volcanogenic massivesulphide deposit, snow lake, manitoba, canada.

J. Geochem. Explor. , 188:216–228, 2018. ISSN 0375-6742. URL .S. Chaki, A. Routray, and W. K. Mohanty. Well-Log and seismic data integration for reservoir characterization: Asignal processing and Machine-Learning perspective.

IEEE Signal Process. Mag. , 35(2):72–81, Mar. 2018. ISSN1053-5888. doi: 10.1109/MSP.2017.2776602. URL http://dx.doi.org/10.1109/MSP.2017.2776602 .S. Chan and A. H. Elsheikh. Parametrization and generation of geological models with generative adversarial networks. arXiv preprint arXiv:1708.01810 , 2017.C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines.

ACM Trans. Intell. Syst. Technol. , 2(3):27:1–27:27, May 2011. ISSN 2157-6904. doi: 10.1145/1961189.1961199. URL http://doi.acm.org/10.1145/1961189.1961199 .T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In

Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , KDD ’16, pages 785–794, New York, NY,USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785 .D. Chevitarese, D. Szwarcman, R. M. D. Silva, and E. V. Brazil. Seismic facies segmentation using deep learning. In

ACE 2018 Annual Convention & Exhibition . searchanddiscovery.com, 2018.J. Chiles and P. Chauvet. Kriging: a method for cartography of the sea ﬂoor.

The International Hydrographic Review ,1975.J.-P. Chilès and N. Desassis. Fifty years of kriging. In

Handbook of mathematical geosciences , pages 589–612. Springer,2018.T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow,M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E.Carpenter, A. Shrikumar, J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J.Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M. Boca, S. J. Swamidass,A. Huang, A. Gitter, and C. S. Greene. Opportunities and obstacles for deep learning in biology and medicine.

J. R. Soc. Interface , 15(141), Apr. 2018. ISSN 1742-5689, 1742-5662. doi: 10.1098/rsif.2017.0387. URL http://dx.doi.org/10.1098/rsif.2017.0387 .K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phraserepresentations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014.27

PREPRINT - A

UGUST

27, 2020D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutionalneural networks for image classiﬁcation. In

Twenty-Second International Joint Conference on Artiﬁcial Intelligence ,2011.R. Collobert, S. Bengio, and J. Mariéthoz. Torch: a modular machine learning software library. Technical report, Idiap,2002.C. Cortes and V. Vapnik. Support-vector networks.

Machine learning , 20(3):273–297, 1995.T. Cover and P. Hart. Nearest neighbor pattern classiﬁcation.

IEEE transactions on information theory , 13(1):21–27,1967.M. J. Cracknell and A. M. Reading. The upside of uncertainty: Identiﬁcation of lithology contact zones from airbornegeophysics and satellite data using random forests and support vector machines.

Geophysics , 78(3):WB113–WB126,2013.N. Cressie. The origins of kriging.

Mathematical geology , 22(3):239–252, 1990.F. Dammeier, J. R. Moore, C. Hammer, F. Haslinger, and S. Loew. Automatic detection of alpine rockslides incontinuous seismic data using hidden markov models.

J. Geophys. Res. Earth Surf. , 121(2):351–371, Feb. 2016.ISSN 2169-9003. doi: 10.1002/2015JF003647. URL http://doi.wiley.com/10.1002/2015JF003647 .R. Dechter.

Learning while searching in constraint-satisfaction problems . University of California, Computer ScienceDepartment, Cognitive Systems . . . , 1986.J. Delhomme. Kriging in the hydrosciences.

Advances in water resources , 1:251–266, 1978.J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805 , 2018.P. M. DeVries, F. Viégas, M. Wattenberg, and B. J. Meade. Deep learning of aftershock patterns following largeearthquakes.

Nature , 560(7720):632, 2018.H. Di, M. Shaﬁq, and G. AlRegib. Multi-attribute k-means cluster analysis for salt boundary detection. , 2017a. URL .H. Di, M. A. Shaﬁq, and G. AlRegib. Seismic-fault detection based on multiattribute support vector machine analysis.In

SEG Technical Program Expanded Abstracts 2017 , pages 2039–2044. Society of Exploration Geophysicists,2017b.H. Di, Z. Wang, and G. AlRegib. Deep convolutional neural networks for seismic salt-body delineation.

AAPG AnnualConvention and , 2018.D. A. Dodge and D. B. Harris. Large-scale test of dynamic correlation processors: Implications for correlation-basedseismic pipelines.

Bull. Seismol. Soc. Am. , 2016. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/106/2/435/332173 .F. U. Dowla, S. R. Taylor, and R. W. Anderson. Seismic discrimination with artiﬁcial neural networks: Preliminaryresults with regional spectral data.

Bull. Seismol. Soc. Am. , 80(5):1346–1373, Oct. 1990. ISSN 0037-1106. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/80/5/1346/119382 .J. Dramsch.

Machine Learning in 4D Seismic Data Analysis: Deep Neural Networks in Geophysics . PhD thesis, 2019.J. S. Dramsch. 3d decision volume of svm, random forest, and deep neural network, Jul 2020a. URL https://figshare.com/articles/media/3D_decision_volume_of_SVM_Random_Forest_and_Deep_Neural_Network/12640226/1 .J. S. Dramsch. Code for 70 years of machine learning in geoscience in review, Jul 2020b.J. S. Dramsch and M. Lüthje. Deep-learning seismic facies on state-of-the-art cnn architectures. In

SEG TechnicalProgram Expanded Abstracts 2018 , pages 2036–2040. Society of Exploration Geophysicists, 8 2018. doi: 10.1190/segam2018-2996783.1. URL https://doi.org/10.1190/segam2018-2996783.1 .J. S. Dramsch, F. Amour, and M. Lüthje. Gaussian mixture models for robust unsupervised scanning-electron microscopyimage segmentation of north sea chalk. In

First EAGE/PESGB Workshop Machine Learning . EAGE PublicationsBV, 2018. doi: 10.3997/2214-4609.201803014. URL https://doi.org/10.3997/2214-4609.201803014 .J. S. Dramsch, A. N. Christensen, C. MacBeth, and M. Lüthje. Deep unsupervised 4d seismic 3d time-shift estimationwith convolutional neural networks.

IEEE Transactions in Geoscience and Remote Sensing , 2019.28

PREPRINT - A

UGUST

27, 2020S. Dreyfus. The numerical solution of variational problems.

Journal of Mathematical Analysis and Applications , 5(1):30–45, 1962.O. Dubrule. Comparing splines and kriging.

Computers & Geosciences , 10(2-3):327–338, 1984.E. Dupont, T. Zhang, P. Tilke, L. Liang, and W. Bailey. Generating realistic geology conditioned on physicalmeasurements with generative adversarial networks. arXiv preprint arXiv:1802.03065 , 2018.D. Duvenaud.

Automatic model construction with Gaussian processes . PhD thesis, University of Cambridge, 2014.J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts. Gansynth: Adversarial neural audiosynthesis. arXiv preprint arXiv:1902.08710 , 2019.X.-T. Feng and M. Seto. Neural network dynamic modelling of rock microfracturing sequences under triaxial compres-sive stress conditions.

Tectonophysics , 292(3):293–309, July 1998. ISSN 0040-1951. doi: 10.1016/S0040-1951(98)00072-9. URL .R. Ferreira, E. V. Brazil, R. Silva, and others. Texture-Based similarity graph to aid seismic interpretation.

ACE 2018Annual , 2018.K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognitionunaffected by shift in position.

Biological cybernetics , 36(4):193–202, 1980.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generativeadversarial nets. In

Advances in neural information processing systems , pages 2672–2680, 2014.I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . MIT Press, 2016.O. Gramstad and M. Nickel. Automated top salt interpretation using a deep convolutional net. , 2018. URL .A. Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 , 2012.P. Guillen, G. Larrazabal, G. González, D. Boumber, and R. Vilalta. Supervised learning to detect salt body. In

SEG Technical Program Expanded Abstracts 2015 , SEG Technical Program Expanded Abstracts, pages 1826–1829. Society of Exploration Geophysicists, Aug. 2015. doi: 10.1190/segam2015-5931401.1. URL https://doi.org/10.1190/segam2015-5931401.1 .A. Guitton. 3D convolutional neural networks for fault interpretation. ,2018. URL .R. Guo, Y. S. Zhang, H. Lin, and W. Liu. Sweet spot interpretation from multiple attributes: Machine learningand neural networks technologies.

First EAGE/AMGP/AMGE Latin , 2017. URL .I. Gupta, C. Rai, C. Sondergeld, and D. Devegowda. Rock typing in the upper Devonian-Lower mississippianwoodford shale formation, oklahoma, USA.

Interpretation , 6(1):SC55–SC66, Feb. 2018. ISSN 2324-8858. doi:10.1190/INT-2017-0015.1. URL https://doi.org/10.1190/INT-2017-0015.1 .J. Görtler, R. Kehlbeck, and O. Deussen. A visual exploration of gaussian processes.

Distill , 2019. doi: 10.23915/distill.00017. https://distill.pub/2019/visual-exploration-gaussian-processes.R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital selection and analogueampliﬁcation coexist in a cortex-inspired silicon circuit.

Nature , 405(6789):947–951, 2000.A. Hajimoradlou, G. Roberti, and D. Poole. Predicting landslides using contour aligning convolutional neural networks. arXiv: Computer Vision and Pattern Recognition , 2019.D. Hale. Methods to compute fault images, extract fault surfaces, and estimate fault throws from 3d seismic images.

Geophysics , 78(2):O33–O43, 2013.B. Hall. Facies classiﬁcation using machine learning.

Lead. Edge , 35(10):906–909, Oct. 2016. ISSN 1070-485X. doi:10.1190/tle35100906.1. URL https://doi.org/10.1190/tle35100906.1 .M. Hall and B. Hall. Distributed collaborative prediction: Results of the machine learning contest.

Lead. Edge , 36(3):267–269, Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030267.1. URL https://doi.org/10.1190/tle36030267.1 .K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778, 2016.L. Hermes, D. Frieauff, J. Puzicha, and J. M. Buhmann. Support vector machines for land usage classiﬁcation inlandsat tm imagery. In

IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS’99 (Cat. No.99CH36293) , volume 1, pages 348–350. IEEE, 1999. 29

PREPRINT - A

UGUST

27, 2020T. K. Ho. Random decision forests. In

Proceedings of 3rd international conference on document analysis andrecognition , volume 1, pages 278–282. IEEE, 1995.S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient ﬂow in recurrent nets: the difﬁculty of learninglong-term dependencies, 2001.J. J. Hopﬁeld. Neural networks and physical systems with emergent collective computational abilities.

Proceedings ofthe national academy of sciences , 79(8):2554–2558, 1982.K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.

NeuralNetw. , 2(5):359–366, Jan. 1989. ISSN 0893-6080. doi: 10.1016/0893-6080(89)90020-8. URL .K. Y. Huang, W. R. I. Chang, and H. T. Yen. Self-organizing neural network for picking seismic horizons.

SEGTechnical Program Expanded , 1990. URL https://library.seg.org/doi/pdf/10.1190/1.1890183 .C. Huijbregts and G. Matheron. Universal kriging (an optimal method for estimating and contouring in trend surfaceanalysis): 9th intern.

Sym. on Decisionmaking in the Mineral Industries (proceedings to be published by CanadianInst. Mining), Montreal , 1970.C. Hulbert, B. Rouet-Leduc, C. X. Ren, J. Riviere, D. C. Bolton, C. Marone, and P. A. Johnson. Estimating the physicalstate of a laboratory slow slipping fault from seismic signals. Jan. 2018. URL http://arxiv.org/abs/1801.07806 .C. R. Ildstad and P. Bormann. Malenov_nd (machine learning of voxels). URL https://github.com/bolgebrygg/MalenoV .S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.H. Iqbal. Harisiqbal88/plotneuralnet v1.0.0, Dec. 2018. URL https://doi.org/10.5281/zenodo.2526396 .J. Jeong, E. Park, W. S. Han, and K.-Y. Kim. A novel data assimilation methodology for predicting lithology based onsequence labeling algorithms.

J. Geophys. Res. [Solid Earth] , 119(10):7503–7520, Oct. 2014. ISSN 2169-9313. doi:10.1002/2014JB011279. URL http://doi.wiley.com/10.1002/2014JB011279 .M. A. Jewett, C. Bombardier, D. Caron, M. R. Ryan, R. R. Gray, E. L. S. Louis, S. J. Witchell, S. Kumra, and K. E.Psihramis. Potential for inter-observer and intra-observer variability in x-ray review to establish stone-free rates afterlithotripsy.

The Journal of urology , 147(3):559–562, 1992.J. Jumper, K. Tunyasuvunakool, P. Kohli, D. Hassabis, and A. Team. Computational predictions of protein struc-tures associated with covid-19. Technical report. URL https://deepmind.com/research/open-source/computational-predictions-of-protein-structures-associated-with-COVID-19 .A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov. druGAN: An advanced generative adversarialautoencoder model for de novo generation of new molecules with desired molecular properties in silico.

Mol. Pharm. ,14(9):3098–3104, Sept. 2017. ISSN 1543-8384, 1543-8392. doi: 10.1021/acs.molpharmaceut.7b00346. URL http://dx.doi.org/10.1021/acs.molpharmaceut.7b00346 .S. Karra, D. O’Malley, J. D. Hyman, H. S. Viswanathan, and G. Srinivasan. Modeling ﬂow and transport in fracturenetworks using graphs.

Phys Rev E , 97(3-1):033304, Mar. 2018. ISSN 2470-0053, 2470-0045. doi: 10.1103/PhysRevE.97.033304. URL http://dx.doi.org/10.1103/PhysRevE.97.033304 .H. J. Kelley. Gradient theory of optimal ﬂight paths.

Ars Journal , 30(10):947–954, 1960.A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry andsemantics. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7482–7491,2018.N. Khoshnevis and R. Taborda. Prioritizing ground-motion validation metrics using semisupervised and super-vised learning.

Bull. Seismol. Soc. Am. , 2018. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/108/4/2248/536309 .B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim. Learning not to learn: Training deep neural networks with biased data.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 9012–9020, 2019.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv , 2014.G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In

ICML deeplearning workshop , volume 2. Lille, 2015. 30

PREPRINT - A

UGUST

27, 2020A. N. Kolmogorov. Sur l’interpolation et extrapolation des suites stationnaires.

CR Acad Sci , 208:2043–2045, 1939.Q. Kong, D. T. Trugman, Z. E. Ross, M. J. Bianco, B. J. Meade, and P. Gerstoft. Machine learning in seismology:Turning data into insights.

Seismological Research Letters , 90(1):3–14, 2019.F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing. Benchmarking a catchment-aware longshort-term memory network (lstm) for large-scale hydrological modeling. arXiv preprint arXiv:1907.08456 , 2019.D. G. Krige.

A statistical approach to some mine valuation and allied problems on the Witwatersrand . PhD thesis,Johannesburg, 1951.L. Krischer and A. Fichtner. Generating seismograms with deep neural networks.

AGUFM , 2017:S41D–03, 2017.R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman ﬁlters. arXiv preprint arXiv:1511.05121 , 2015.A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages 1097–1105, 2012.W. C. Krumbein and M. F. Dacey. Markov chains and embedded markov chains in geology.

Journal of the InternationalAssociation for Mathematical Geology , 1(1):79–96, 1969.N. M. Kuehn, C. Riggelsen, and others. Modeling the joint probability of earthquake, site, and ground-motionparameters using bayesian networks.

Bulletin of the , 2011. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/101/1/235/349494 .H. A. Kuzma. A support vector machine for avo interpretation. In

SEG Technical Program Expanded Abstracts 2003 ,pages 181–184. Society of Exploration Geophysicists, 2003.J. Kvalsvik and Contributors. Segyio, 2019. URL https://github.com/equinor/segyio/ .E. Laloy, R. Hérault, D. Jacques, and N. Linde. Efﬁcient training-image based geostatistical simulation and inversionusing a spatial generative adversarial neural network. Aug. 2017. URL http://arxiv.org/abs/1708.04975 .D. J. Lary, A. H. Alavi, A. H. Gandomi, and A. L. Walker. Machine learning in geosciences and remote sensing.

Geoscience Frontiers , 7(1):3–10, Jan. 2016. ISSN 1674-9871. doi: 10.1016/j.gsf.2015.07.003. URL .Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.A. M. Legendre.

Nouvelles méthodes pour la détermination des orbites des comètes . F. Didot, 1805.J. Li and J. Castagna. Support vector machine (SVM) pattern recognition to AVO classiﬁcation.

Geophys. Res. Lett. ,31(2):948, Jan. 2004. ISSN 0094-8276. doi: 10.1029/2003GL018299. URL http://doi.wiley.com/10.1029/2003GL018299 .M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400 , 2013.S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the localrounding errors.

Master’s Thesis (in Finnish), Univ. Helsinki , pages 6–7, 1970.Y. Liu, Z. Chen, L. Wang, Y. Zhang, Z. Liu, and Y. Shuai. Quantitative seismic interpretations to detect biogenicgas accumulations: a case study from qaidam basin, china.

Bull. Can. Petrol. Geol. , 63(1):108–121, Mar. 2015.ISSN 0007-4802. doi: 10.2113/gscpgbull.63.1.108. URL https://pubs.geoscienceworld.org/cspg/bcpg/article-abstract/63/1/108/455952 .P. Lu, M. Morris, S. Brazell, C. Comiskey, and Y. Xiao. Using generative adversarial networks to improve deep-learningfault interpretation networks.

The Leading Edge , 37(8):578–583, 2018.S. M. Lundberg, B. Nair, M. S. Vavilala, M. Horibe, M. J. Eisses, T. Adams, D. E. Liston, D. K.-W. Low, S.-F. Newman,J. Kim, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.

Naturebiomedical engineering , 2(10):749–760, 2018.X. Luo, W. Zhou, W. Wang, Y. Zhu, and J. Deng. Attention-based relation extraction with bidirectional gated recurrentunit and highway network in the analysis of geological data.

IEEE Access , 6:5705–5715, 2017.J. Ma, Z. Jiang, Q. Tian, and G. D. Couples. Classiﬁcation of digital rocks by machine learning.

ECMOR XIII-13th Euro-pean , 2012. URL .A. Maggi, V. Ferrazzini, C. Hibert, F. Beauducel, P. Boissier, and A. Amemoutou. Implementation of a multistationapproach for automated event classiﬁcation at piton de la fournaise volcano.

Seismol. Res. Lett. , 88(3):878–891,May 2017. ISSN 0895-0695. doi: 10.1785/0220160189. URL https://pubs.geoscienceworld.org/ssa/srl/article-abstract/88/3/878/284054 . 31

PREPRINT - A

UGUST

27, 2020M. Malfante, M. D. Mura, J. Metaxian, J. I. Mars, O. Macedo, and A. Inza. Machine learning for Volcano-Seismicsignals: Challenges and perspectives.

IEEE Signal Process. Mag. , 35(2):20–30, Mar. 2018. ISSN 1053-5888. doi:10.1109/MSP.2017.2779166. URL http://dx.doi.org/10.1109/MSP.2017.2779166 .A. Mardan, A. Javaherian, and others. Channel characterization using support vector machine. ,2017. URL .M. Marjanovi´c, M. Kovaˇcevi´c, B. Bajat, and V. Voženílek. Landslide susceptibility assessment using SVM machinelearning algorithm.

Eng. Geol. , 123(3):225–234, Nov. 2011. ISSN 0013-7952. doi: 10.1016/j.enggeo.2011.09.006.URL .A. A. Markov. Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga.

Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete , 15(135-156):18, 1906.A. A. Markov. Extension of the limit theorems of probability theory to a sum of variables connected in a chain.

Dynamicprobabilistic systems , 1:552–577, 1971. Reprint in English of [Markov, 1906].G. Martinelli, J. Eidsvik, R. Sinding-Larsen, and others. Building bayesian networks from basin-modelling scenariosfor improved geological decision making.

Petroleum , 2013. URL http://pg.lyellcollection.org/content/early/2013/06/24/petgeo2012-057.abstract .M. Masotti, S. Falsaperla, H. Langer, S. Spampinato, and R. Campanini. Application of support vector machine to theclassiﬁcation of volcanic tremor at etna, italy.

Geophys. Res. Lett. , 33(20):113, Oct. 2006. ISSN 0094-8276. doi:10.1029/2006GL027441. URL http://doi.wiley.com/10.1029/2006GL027441 .M. Masotti, R. Campanini, L. Mazzacurati, S. Falsaperla, H. Langer, and S. Spampinato. TREMOrEC: A softwareutility for automatic classiﬁcation of volcanic tremor.

Geochem. Geophys. Geosyst. , 9(4), Apr. 2008. ISSN 1525-2027.doi: 10.1029/2007GC001860. URL http://doi.wiley.com/10.1029/2007GC001860 .N. C. Matalas. Mathematical assessment of synthetic hydrology.

Water Resources Research , 3(4):937–945, 1967.G. Matheron. Principles of geostatistics.

Economic geology , 58(8):1246–1266, 1963.G. Matheron et al. Splines and kriging; their formal equivalence. 1981.M. McCormack. Neural computing in geophysics.

Lead. Edge , 10(1):11–15, Jan. 1991. ISSN 1070-485X. doi:10.1190/1.1436771. URL https://doi.org/10.1190/1.1436771 .A. Mignan and M. Broccardo. A deeper look into ‘deep learning of aftershock patterns following large earthquakes’:Illustrating ﬁrst principles in neural network physical interpretability. In

International Work-Conference on ArtiﬁcialNeural Networks , pages 3–14. Springer, 2019a.A. Mignan and M. Broccardo. One neuron versus deep learning in aftershock prediction.

Nature , 574(7776):E1–E3,2019b.T. M. Mitchell et al. Machine learning, 1997.E. Mjolsness and D. DeCoste. Machine learning for science: state of the art and future prospects.

Science , 293(5537):2051–2055, Sept. 2001. ISSN 0036-8075. doi: 10.1126/science.293.5537.2051. URL http://dx.doi.org/10.1126/science.293.5537.2051 .L. Mosser, O. Dubrule, and M. J. Blunt. Reconstruction of three-dimensional porous media using generative adversarialneural networks.

Phys Rev E , 96(4-1):043309, Oct. 2017. ISSN 2470-0053, 2470-0045. doi: 10.1103/PhysRevE.96.043309. URL http://dx.doi.org/10.1103/PhysRevE.96.043309 .L. Mosser, O. Dubrule, and M. J. Blunt. Conditioning of three-dimensional generative adversarial networks for poreand reservoir-scale models. Feb. 2018a. URL http://arxiv.org/abs/1802.05622 .L. Mosser, O. Dubrule, and M. J. Blunt. Stochastic seismic waveform inversion using generative adversarial networksas a geological prior. June 2018b. URL http://arxiv.org/abs/1806.03720 .L. Mosser, W. Kimman, J. S. Dramsch, S. Purves, A. De la Fuente Briceño, and G. Ganssle. Rapid seismic domaintransfer: Seismic velocity inversion and modeling using deep generative neural networks. In . EAGE Publications BV, 6 2018c. doi: 10.3997/2214-4609.201800734. URL https://doi.org/10.3997/2214-4609.201800734 .L. Mosser, R. Oliveira, and M. Steventon. Probabilistic seismic interpretation using bayesian neural networks. In , volume 2019, pages 1–5. European Association of Geoscientists &Engineers, 2019.J. Nash. Non-cooperative games.

Annals of mathematics , pages 286–295, 1951.R. M. Neal.

Bayesian learning for neural networks . Springer Science & Business Media, 1996.32

PREPRINT - A

UGUST

27, 2020P. D. Newendorp. Decision analysis for petroleum exploration. 1976.L. H. Ochoa, L. F. Niño, and C. A. Vargas. Fast magnitude determination using a single seismological stationrecord implementing machine learning techniques.

Geodesy and Geodynamics , 9(1):34–41, Jan. 2018. ISSN1674-9847. doi: 10.1016/j.geog.2017.03.010. URL .M. Ohrnberger. [no title], 2001. Accessed: 2018-12-17.M. Paganini, L. de Oliveira, and B. Nachman. Calogan: Simulating 3d high energy particle showers in multilayerelectromagnetic calorimeters with generative adversarial networks.

Physical Review D , 97(1):014021, 2018.A. Panakkat and H. Adeli. Neural network models for earthquake magnitude prediction using multiple seismicityindicators.

International journal of neural systems , 17(01):13–33, 2007.E. Pasolli, F. Melgani, and M. Donelli. Automatic analysis of gpr images: A pattern-recognition approach.

IEEETransactions on Geoscience and Remote Sensing , 47(7):2206–2217, 2009.A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.Automatic differentiation in PyTorch. In

NIPS Autodiff Workshop , 2017.J. Pearl. The do-calculus revisited. arXiv preprint arXiv:1210.4852 , 2012.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.M. Poulton, B. Sternberg, and C. Glass. Location of subsurface targets in geophysical data using neural networks.

Geophysics , 57(12):1534–1544, Dec. 1992. ISSN 0016-8033. doi: 10.1190/1.1443221. URL https://doi.org/10.1190/1.1443221 .F. W. Preston and J. Henderson.

Fourier series characterization of cyclic sediments for stratigraphic correlation .Kansas Geological Survey, 1964.S. Purves, B. Alaei, and E. Larsen. Bootstrapping Machine-Learning based seismic fault interpretation.

ACE 2018Annual Convention & , 2018. URL .A. Ramcharan, K. Baranowski, P. McCloskey, B. Ahmed, J. Legg, and D. P. Hughes. Deep learning for image-basedcassava disease detection.

Frontiers in plant science , 8:1852, 2017.A. Ramcharan, P. McCloskey, K. Baranowski, N. Mbilinyi, L. Mrisho, M. Ndalahwa, J. Legg, and D. P. Hughes. Amobile-based deep learning model for cassava disease diagnosis.

Frontiers in plant science , 10:272, 2019.C. E. Rasmussen. Gaussian processes in machine learning. In

Summer School on Machine Learning , pages 63–71.Springer, 2003.B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classiﬁers generalize to imagenet? arXiv preprintarXiv:1902.10811 , 2019.R. Reddy and G. Bonham-Carter. A decision-tree approach to mineral potential mapping in snow lake area, manitoba.

Canadian Journal of Remote Sensing , 17(2):191–200, 1991. doi: 10.1080/07038992.1991.10855292. URL https://doi.org/10.1080/07038992.1991.10855292 .A. Richardson. Seismic Full-Waveform inversion using deep learning tools and techniques. Jan. 2018. URL http://arxiv.org/abs/1801.07232 .R. Roden and C. W. Chen. Interpretation of DHI characteristics with machine learning.

First Break , 2017. ISSN0263-5046. URL .D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, et al. Tackling climate change with machine learning. arXiv preprintarXiv:1906.05433 , 2019.O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In

International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,2015.F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.

Psychologicalreview , 65(6):386, 1958.Z. E. Ross, M. A. Meier, and E. Hauksson. P-wave arrival picking and ﬁrst-motion polarity determination with deeplearning.

J. Geophys. Res. , 2018a. ISSN 0148-0227. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2017JB015251 . 33

PREPRINT - A

UGUST

27, 2020Z. E. Ross, M.-A. Meier, E. Hauksson, and T. H. Heaton. Generalized seismic phase detection with deep learning. May2018b. URL http://arxiv.org/abs/1805.01075 .Z. E. Ross, M.-A. Meier, E. Hauksson, and T. H. Heaton. Generalized seismic phase detection with deep learning.

Bulletin of the Seismological Society of America , 108(5A):2894–2901, 2018c.G. Röth and A. Tarantola. Neural networks and inversion of seismic data.

J. Geophys. Res. , 99(B4):6753, 1994. ISSN0148-0227. doi: 10.1029/93JB01563. URL http://doi.wiley.com/10.1029/93JB01563 .B. Rouet-Leduc, C. Hulbert, N. Lubbers, and others. Machine learning predicts laboratory earthquakes.

Geophysical ,2017. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/2017GL074677 .B. Rouet-Leduc, C. Hulbert, D. C. Bolton, and others. Estimating fault friction from seismic signals in the laboratory.

Geophysical , 2018. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/2017GL076708 .D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by back-propagating errors.

Cognitivemodeling , 5(3):1, 1988.O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C.Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.S. J. Russell and P. Norvig.

Artiﬁcial Intelligence - A Modern Approach, Third International Edition . Pearson Education,2010. ISBN 978-0-13-207148-2. URL http://vig.pearsoned.com/store/product/1,1207,store-12521_isbn-0136042597,00.html .A. L. Samuel. Some studies in machine learning using the game of checkers.

IBM Journal of research and development ,3(3):210–229, 1959.C. M. Saporetti, L. G. da Fonseca, E. Pereira, and L. C. de Oliveira. Machine learning approaches for petrographicclassiﬁcation of carbonate-siliciclastic rocks using well logs and textural information.

J. Appl. Geophys. , 155:217–225, Aug. 2018. ISSN 0926-9851. doi: 10.1016/j.jappgeo.2018.06.012. URL .K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko. Quantum-chemical insights from deeptensor neural networks.

Nat. Commun. , 8:13890, Jan. 2017. ISSN 2041-1723. doi: 10.1038/ncomms13890. URL http://dx.doi.org/10.1038/ncomms13890 .W. Schwarzacher. The semi-markov process as a general sedimentation model. In

Mathematical Models of SedimentaryProcesses , pages 247–268. Springer, 1972.S. Sen, S. Kainkaryam, C. Ong, and A. Sharma. Saltnet: A production-scale deep learning pipeline for automated saltmodel building.

The Leading Edge , 39(3):195–203, 2020.R. Shah and L. Innig. Aftershock issues, 2019. URL https://github.com/rajshah4/aftershocks_issues .B. M. Shashidhara, M. Scott, and A. Marburg. Instance segmentation of benthic scale worms at a hydrothermal site. In

The IEEE Winter Conference on Applications of Computer Vision , pages 1314–1323, 2020.C. Shen. A transdisciplinary review of deep learning research and its relevance for water resources scientists.

WaterResources Research , 54(11):8558–8593, 2018.D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis.

Annu. Rev. Biomed. Eng. , 19:221–248, June2017. ISSN 1523-9829, 1545-4274. doi: 10.1146/annurev-bioeng-071516-044442. URL http://dx.doi.org/10.1146/annurev-bioeng-071516-044442 .K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.A. Y. Sun, B. R. Scanlon, Z. Zhang, D. Walling, S. N. Bhanja, A. Mukherjee, and Z. Zhong. Combining physicallybased modeling and deep learning for fusing grace satellite data: Can we learn from mismatch?

Water ResourcesResearch , 55(2):1179–1195, 2019.I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning.In S. Dasgupta and D. McAllester, editors,

Proceedings of the 30th International Conference on Machine Learning ,volume 28 of

Proceedings of Machine Learning Research , pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun2013. PMLR. URL http://proceedings.mlr.press/v28/sutskever13.html .E. Talarico, W. Leäo, and D. Grana. Comparison of recursive neural network and markov chain models in faciesinversion. In

Petroleum Geostatistics 2019 , volume 2019, pages 1–5. European Association of Geoscientists &Engineers, 2019. 34

PREPRINT - A

UGUST

27, 2020M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron,and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprintarXiv:2006.10739 , 2020.Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXive-prints , abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688 .M. Titos, A. Bueno, L. García, M. C. Benítez, and J. Ibañez. Detection and classiﬁcation of continuous volcano-seismicsignals with recurrent neural networks.

IEEE Transactions on Geoscience and Remote Sensing , 57(4):1936–1948,2018.C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio,and C. J. Pal. Deep complex networks. arXiv preprint arXiv:1705.09792 , 2017.A. M. Turing. I.—Computing Machinery and Intelligence.

Mind , LIX(236):433–460, 10 1950. ISSN 0026-4423. doi:10.1093/mind/LIX.236.433. URL https://doi.org/10.1093/mind/LIX.236.433 .L. Uieda. Verde: Processing and gridding spatial data using Green’s functions.

Journal of Open Source Software , 3(29):957, 2018. ISSN 2475-9066. doi: 10.21105/joss.00957.A. Valentine and L. M. Kalnins. An introduction to learning algorithms and potential applications in geomorphometryand earth surface dynamics.

Earth surface dynamics. , 4:445–460, 2016.M. Valera, Z. Guo, P. Kelly, S. Matz, V. A. Cantu, A. G. Percus, J. D. Hyman, G. Srinivasan, and H. S. Viswanathan.Machine learning for graph-based representations of three-dimensional discrete fracture networks. May 2017. URL http://arxiv.org/abs/1705.09866 .M. van der Baan and C. Jutten. Neural networks in geophysical applications.

Geophysics , 65(4):1032–1047, July 2000.ISSN 0016-8033. doi: 10.1190/1.1444797. URL https://doi.org/10.1190/1.1444797 .A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In

Advances in neural information processing systems , pages 5998–6008, 2017.A. Waldeland, A. Jensen, L. Gelius, and A. Solberg. Convolutional neural networks for automated seismic interpretation.

Lead. Edge , 37(7):529–537, July 2018. ISSN 1070-485X. doi: 10.1190/tle37070529.1. URL https://doi.org/10.1190/tle37070529.1 .A. U. Waldeland and A. Solberg. Salt classiﬁcation using deep learning. , 2017.URL .H. Wang, J. F. Wellmann, Z. Li, X. Wang, and R. Y. Liang. A segmentation approach for stochastic geological modelingusing hidden markov random ﬁelds.

Math. Geosci. , 49(2):145–177, Feb. 2017a. ISSN 1874-8961, 1874-8953. doi:10.1007/s11004-016-9663-9. URL https://doi.org/10.1007/s11004-016-9663-9 .K. Wang, J. Lomask, and F. Segovia. Automatic, geologic layer-constrained well-seismic tie through blocked dynamicwarping.

Interpretation , 5(3):SJ81–SJ90, Aug. 2017b. ISSN 2324-8858. doi: 10.1190/INT-2016-0160.1. URL https://doi.org/10.1190/INT-2016-0160.1 .L. X. Wang and J. M. Mendel. Adaptive minimum prediction-error deconvolution and source wavelet estimation usinghopﬁeld neural networks.

Geophysics , 1992. ISSN 0016-8033. URL https://library.seg.org/doi/abs/10.1190/1.1443281 .Z. Wang, H. Di, M. A. Shaﬁq, Y. Alaudah, and G. AlRegib. Successful leveraging of image processing and machinelearning in seismic structural interpretation: A review.

The Leading Edge , 37(6):451–461, 2018.C. J. C. H. Watkins. Learning from delayed rewards. 1989.S. Wei, O. Yonglin, Z. Qingcai, H. Jiaqiang, and others. Unsupervised machine learning: K-means clustering velocitysemblance Auto-Picking. , 2018. URL .F. Wickman. Repose period patterns of volcanoes. v. general discussion and a tentative stochastic model.

Arkiv forMineralogi och Geologi , 4(5):351, 1968.C. K. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In

Learningin graphical models , pages 599–621. Springer, 1998.C. K. Williams and C. E. Rasmussen.

Gaussian processes for machine learning , volume 2. MIT press Cambridge, MA,2006.A. Wirgin. The inverse crime. arXiv preprint math-ph/0401050 , 2004.I. H. Witten, E. Frank, and M. A. Hall. Practical machine learning tools and techniques.

Morgan Kaufmann , page 578,2005. 35

PREPRINT - A

UGUST

27, 2020H. Wu and B. Zhang. A deep convolutional encoder-decoder neural network in assisting seismic horizon tracking. Apr.2018. URL http://arxiv.org/abs/1804.06814 .P. Xie, A. Zhou, and B. Chai. The application of long short-term memory (lstm) method on displacement prediction ofmultifactor-induced landslides.

IEEE Access , 7:54305–54311, 2019a.Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le. Self-training with noisy student improves imagenet classiﬁcation. arXivpreprint arXiv:1911.04252 , 2019b.X. Xie, H. Qin, C. Yu, and L. Liu. An automatic recognition algorithm for GPR images of RC structure voids.

J.Appl. Geophys. , 99:125–134, Dec. 2013. ISSN 0926-9851. doi: 10.1016/j.jappgeo.2013.02.016. URL .B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectiﬁed activations in convolutional network. arXivpreprint arXiv:1505.00853 , 2015.Y. Zhang and K. V. Paulson. Magnetotelluric inversion using regularized hopﬁeld neural networks.

Geophys.Prospect. , 1997. ISSN 0016-8025. URL .T. Zhao, F. Li, and K. Marfurt. Constraining self-organizing map facies analysis with stratigraphy: An approach toincrease the credibility in automatic seismic facies classiﬁcation.

Interpretation , 5(2):T163–T171, May 2017. ISSN2324-8858. doi: 10.1190/INT-2016-0132.1. URL https://doi.org/10.1190/INT-2016-0132.1 .X. Zhao and J. M. Mendel. Minimum-variance deconvolution using artiﬁcial neural networks.

SEG Technical ProgramExpanded Abstracts , 1988. URL https://library.seg.org/doi/pdf/10.1190/1.1892433 .Z. Zhao and L. Gross. Using supervised machine learning to distinguish microseismic from noise events. In

SEGTechnical Program Expanded Abstracts 2017 , SEG Technical Program Expanded Abstracts, pages 2918–2923.Society of Exploration Geophysicists, Aug. 2017. doi: 10.1190/segam2017-17727697.1. URL https://doi.org/10.1190/segam2017-17727697.1 .H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for ﬁne-grained imagerecognition. In

Proceedings of the IEEE international conference on computer vision , pages 5209–5217, 2017.W. Zhu and G. C. Beroza. PhaseNet: A Deep-Neural-Network-Based seismic arrival time picking method. Mar. 2018.URL http://arxiv.org/abs/1803.03211 .R. Zuo, Y. Xiong, J. Wang, and E. J. M. Carranza. Deep learning and its application in geochemical mapping.