770
YEARS OF MACHINE LEARNING IN GEOSCIENCE IN REVIEW
A P
REPRINT
Jesper Sören Dramsch [email protected]
August 27, 2020 A BSTRACT
This review gives an overview of the development of machine learning in geoscience. A thoroughanalysis of the co-developments of machine learning applications throughout the last 70 years relatesthe recent enthusiasm for machine learning to developments in geoscience. I explore the shiftof kriging towards a mainstream machine learning method and the historic application of neuralnetworks in geoscience, following the general trend of machine learning enthusiasm through thedecades. Furthermore, this chapter explores the shift from mathematical fundamentals and knowledgein software development towards skills in model validation, applied statistics, and integrated subjectmatter expertise. The review is interspersed with code examples to complement the theoreticalfoundations and illustrate model validation and machine learning explainability for science. Thescope of this review includes various shallow machine learning methods, e.g. Decision Trees,Random Forests, Support-Vector Machines, and Gaussian Processes, as well as, deep neural networks,including feed-forward neural networks, convolutional neural networks, recurrent neural networksand generative adversarial networks. Regarding geoscience, the review has a bias towards geophysicsbut aims to strike a balance with geochemistry, geostatistics, and geology, however excludes remotesensing, as this would exceed the scope. In general, I aim to provide context for the recent enthusiasmsurrounding deep learning with respect to research, hardware, and software developments that enablesuccessful application of shallow and deep machine learning in all disciplines of Earth science. K eywords Review · Machine Learning · Deep Learning · Neural Networks · Kriging · Earth Science · Geoscience · Geology · GeophysicsIn recent years machine learning has become an increasingly important interdisciplinary tool that has advanced severalfields of science, such as biology [Ching et al., 2018], chemistry [Schütt et al., 2017], medicine [Shen et al., 2017] andpharmacology [Kadurin et al., 2017]. Specifically, the method of deep neural networks has found wide application.While geoscience was slower in the adoption, bibliometrics show the adoption of deep learning in all aspects ofgeoscience. Most subdisciplines of geoscience have been treated to a review of machine learning. Remote sensing hasbeen an early adopter [Lary et al., 2016], with geomorphology [Valentine and Kalnins, 2016], solid Earth geoscience[Bergen et al., 2019], hydrogeophysics [Shen, 2018], seismology [Kong et al., 2019], seismic interpretation [Wanget al., 2018] and geochemistry [Zuo et al., 2019] following suite. Climate change, in particular, has received a thoroughtreatment of the potential impact of varying machine learning methods for modelling, engineering and mitigation toaddress the problem [Rolnick et al., 2019]. This review addresses the development of applied statistics and machinelearning in the wider discipline of geoscience in the past 70 years and aims to provide context for the recent increase ininterest and successes in machine learning and its challenges .Machine learning (ML) is deeply rooted in applied statistics, building computational models that use inference andpattern recognition instead of explicit sets of rules. Machine learning is generally regarded as a sub-field of artificial The author of this manuscript has a background in geophysics, exploration geoscience, and active source 4D seismic. While thisskews the expertise, they attempt to give a full overview over developments in all of geoscience with the minimum amount of biaspossible. a r X i v : . [ phy s i c s . g e o - ph ] A ug PREPRINT - A
UGUST
27, 2020Figure 1: Machine Learning timeline from [Dramsch, 2019]. Neural Networks: [Russell and Norvig, 2010]; Kriging:[Krige, 1951]; Decision Trees: [Belson, 1959]; Nearest Neighbours: [Cover and Hart, 1967]; Automatic Differentiation:[Linnainmaa, 1970]; Convolutional Neural Networks: [Fukushima, 1980, LeCun et al., 2015]; Recurrent NeuralNetworks: [Hopfield, 1982]; Backpropagation: [Kelley, 1960, Bryson, 1961, Dreyfus, 1962, Rumelhart et al., 1988];Reinforcement Learning: [Watkins, 1989]; Support Vector Machines: [Cortes and Vapnik, 1995]; Random Forests: [Ho,1995]; LSTM: [Hochreiter and Schmidhuber, 1997]; Torch Library: [Collobert et al., 2002]; ImageNet: [Deng et al.,2009]; Scikit-Learn: [Pedregosa et al., 2011]; LibSVM: [Chang and Lin, 2011]; Generative Adversarial Networks:[Goodfellow et al., 2014]; Tensorflow: [Abadi et al., 2015]; XGBoost: [Chen and Guestrin, 2016]intelligence (AI), with the notion of AI first being introduced by Turing [1950]. Samuel [1959] coined the term machinelearning itself, with Mitchell et al. [1997] providing a commonly quoted definition:A computer program is said to learn from experience E with respect to some class of tasks T andperformance measure P if its performance at tasks in T, as measured by P, improves with experienceE. Mitchell et al. [1997]This means that a machine learning model is defined by a combination of requirements. A task such as, classification,regression, or clustering is improved by conditioning of the model on a training data set. The performance of the modelis measured with regard to a loss, also called metric, which quantifies the performance of a machine learning model onthe provided data. In regression, this would be measuring the misfit of the data from the expected values. Commonly,the model improves with exposure to additional samples of data. Eventually, a good model generalizes to unseen data,which was not part of the training set, on the same task the model was trained to perform.Accordingly, many mathematical and statistical methods and concepts, including Bayes’ rule [Bayes, 1763], least-squares [Legendre, 1805], and Markov models [Markov, 1906, 1971], are applied in machine learning. Gaussianprocesses stand out as they originate in time series applications [Kolmogorov, 1939] and geostatistics [Krige, 1951],which roots this machine learning application in geoscience [Rasmussen, 2003]. "Kriging" originally applied two-dimensional Gaussian processes to the prediction of gold mine valuation and has since found wide application ingeostatistics. Generally, Matheron [1963] is credited with formalizing the mathematics of kriging and developing itfurther in the following decades.Between 1950 and 2020 much has changed. Computational resources are now widely available both as hardware andsoftware, with high-performance compute being affordable to anyone from cloud computing vendors. High-qualitysoftware for machine learning is widely available through the free and open-source software movement, with majorcompanies (Google, Facebook, Microsoft) competing for the usage of their open-source machine learning frameworks(Tensorflow, Pytorch, CNTK ) and independent developments reaching wide applications such as scikit-learn [Pedregosaet al., 2011] and xgboost [Chen and Guestrin, 2016].Nevertheless, investigations of machine learning in geoscience are not a novel development. The research into machinelearning follows interest in artificial intelligence closely. Since its inception, artificial intelligence has experienced twoperiods of a decline in interest and trust, which has impacted negatively upon its funding. Developments in geosciencefollow this wide-spread cycle of enthusiasm and loss of interest with a time lag of a few years. This may be the result ofa variety of factors, including research funding availability and a change in willingness to publish results. Deprecated 2019 PREPRINT - A
UGUST
27, 2020
The 1950s and 1960s were decades of machine learning optimism, with machines learning to play simple games andperform tasks like route mapping. Intuitive methods like k-means, Markov models, and decision trees have beenused as early as the 1960s in geoscience. K-means was used to describe the cyclicity of sediment deposits [Prestonand Henderson, 1964]. Krumbein and Dacey [1969] give a thorough treatment of the mathematical foundations ofMarkov chains and embedded Markov chains in a geological context through application to sedimentological processes,which also provides a comprehensive bibliography of Markov processes in geology. Some selected examples of earlyapplications of Markov chains are found in sedimentology [Schwarzacher, 1972], well log analysis [Agterberg, 1966],hydrology [Matalas, 1967], and volcanology [Wickman, 1968]. Decision tree-based methods found early applicationsin economic geology and prospectivity mapping [Newendorp, 1976, Reddy and Bonham-Carter, 1991].The 1970s were left with few developments in both the methods of machine learning, as well as, applications andadoption in geoscience (cf. Figure 1), due to the "first AI winter" after initial expectations were not met. Nevertheless,as kriging was not considered an AI technology, it was unaffected by this cultural shift and found applications in mining[Huijbregts and Matheron, 1970], oceanography [Chiles and Chauvet, 1975], and hydrology [Delhomme, 1978]. Thiswas in part due to superior results over other interpolation techniques, but also the provision of uncertainty measures.
The 1980s marked uptake in interest in machine learning and artificial intelligence through so-called "expert systems"and corresponding specialized hardware. While neural networks were introduced in 1950, the tools of automaticdifferentiation and backpropagation for error-correcting machine learning were necessary to spark their adoption ingeophysics in the late 1980s. Zhao and Mendel [1988] performed seismic deconvolution with a recurrent neuralnetwork (Hopfield network). Dowla et al. [1990] discriminated between natural earthquakes and underground nuclearexplosions using feed-forward neural networks. An ensemble of networks was able to achieve 97 % accuracy fornuclear monitoring. Moreover, the researchers inspected the network to gain the insight that the ratio of particularinput spectra was beneficial to the discrimination of seismological events to the network. However, in practice theneural networks underperformed on uncurated data, which is often the case in comparison to published results. Huanget al. [1990] presented work on self-organizing maps (also Kohonen networks), a special type of unsupervised neuralnetwork applied to pick seismic horizons. The field of geostatistics saw a formalization of theory and an uptake ininterest with Matheron et al. [1981] formalizing the relationship of spline-interpolation and kriging and Dubrule [1984]further develop the theory and apply it to well data. At this point, kriging is well-established in the mining industry aswell as other disciplines that rely on spatial data, including the successful analysis and construction of the Channeltunnel [Chilès and Desassis, 2018]. The late 1980s then marked the second AI winter, where expensive machines tunedto run "expert systems" were outperformed by desktop hardware from non-specialist vendors, causing the collapse of ahalf-billion-dollar hardware industry. Moreover, government agencies cut funding in AI specifically.The 1990s are generally regarded as the shift from a knowledge-driven to a data-driven approach in machine learning.The term AI and especially expert systems were almost exclusively used in computer gaming and regarded withcynicism and as a failure in the scientific world. In the background, however, with research into applied statistics andmachine learning, this decade marked the inception of Support-Vector Machines (SVM) [Cortes and Vapnik, 1995],the tree-based method Random Forests (RF) [Ho, 1995], and a specific type of recurrent neural network (RNN) LongShort-Term Memories (LSTM) [Hochreiter and Schmidhuber, 1997]. SVMs were utilized for land usage classificationin remote sensing early on [Hermes et al., 1999]. Geophysics applied SVMs a few years later to approximate theZoeppritz equations for AVO inversion, outperforming linearized inversion [Kuzma, 2003]. Random Forests, however,were delayed in broader adoption, due to the term "random forests" only being coined in 2001 [Breiman, 2001] and thestatistical basis initially being less rigorous and implementation being more complicated. LSTMs necessitate largeamounts of data for training and can be expensive to train, after further development in 2011 [Ciresan et al., 2011] itgained popularity in commercial time series applications particularly speech and audio analysis.
McCormack [1991] marks the first review of the emerging tool of neural networks in geophysics. The paper goesinto the mathematical details and explores pattern recognition. The author summarizes neural network applicationsover the 30 years prior to the review and presents worked examples in automated well-log analysis and seismic traceediting. The review comes to the conclusion that neural networks are, in fact, good function approximators, taking overtasks that were previously reserved for human work. He criticizes slow training, the cost of retraining networks uponnew knowledge, imprecision of outputs, non-optimal training results, and the black box property of neural networks.3
PREPRINT - A
UGUST
27, 2020Figure 2: Single layer neural network as described in equation 1. Two inputs x i are multiplied by the weights w ij andsummed with the biases b j . Subsequently an activation function σ is applied to obtain out outputs o j .The main conclusion sees the implementation of neural networks in conventional computation and expert systems toleverage the pattern recognition of networks with the advantages of conventional computer systems.Neural networks are the primary subject of the modern day machine learning interest, however, significant developmentsleading up to these successes were made prior to the 1990s. The first neural network machine was constructed byMinsky [described in Russell and Norvig [2010]] and soon followed by the "Perceptron", a binary decision boundarylearner [Rosenblatt, 1958]. This decision was calculated as follows: o j = σ (cid:16)(cid:80) j w ij x i + b (cid:17) = σ ( a j )= (cid:26) a j > otherwise (1)It describes a linear system with the output o , the linear activation a of the input data x , the index of the source i andtarget node j , the trainable weights w , the trainable bias b and a binary activation function σ . The activation function σ in particular has received ample attention since its inception. During this period, a binary σ became uncommon and wasreplaced by non-linear mathematical functions. Neural networks are commonly trained by gradient descent, therefore,differentiable functions like sigmoid or tanh, allowing for the activation o of each neuron in a neural network to becontinuous.Deep learning [Dechter, 1986] expands on this concept. It is the combination of multiple layers of neurons in a neuralnetwork. These deep networks learn representations with multiple levels of abstraction and can be expressed usingequation 1 as input neurons to the next layer o k = σ ( (cid:80) k w jk · o j + b )= σ (cid:16)(cid:80) k w jk · σ (cid:16)(cid:80) j w ij x i + b (cid:17) + b (cid:17) (2)Röth and Tarantola [1994] apply these building blocks of multi-layered neural networks with sigmoid activation to per-form seismic inversion. They successfully invert low-noise and noise-free data on small training data. The authors notethat the approach is susceptible to errors at low signal-to-noise ratios and coherent noise sources. Further applicationsinclude electromagnetic subsurface localization [Poulton et al., 1992], magnetotelluric inversion via Hopfield neuralnetworks [Zhang and Paulson, 1997], and geomechanical microfractures modelling in triaxial compression tests [Fengand Seto, 1998]. 4 PREPRINT - A
UGUST
27, 2020Figure 3: Deep multi-layer neural network as described in equation 2.Figure 4: Sigmoid activation function (red) and derivative (blue) to train multi-layer Neural Network described inequation 2.
Cressie [1990] review the history of kriging, prompted by the uptake of interest in geostatistics. The author defineskriging as Best Linear Unbiased Prediction and reviews the historical co-development of disciplines. Similar conceptswere developed with mining, meteorology, physics, plant and animal breeding, and geodesy that relied on optimalspatial prediction. Later, Williams [1998] provide a thorough treatment of Gaussian Processes, in the light of recentsuccesses of neural networks. 5
PREPRINT - A
UGUST
27, 2020Figure 5: Gaussian Process separating two classes with different kernels. This image presents a 2D slice out of a 3Ddecision space. The decision boundary learnt from the data is visible, as well as the prediction in every location ofthe 2D slice. The two kernels presented are a linear kernel and a radial basis function (RBF) kernel, which show asignificant discrepancy in performance. The bottom right number shows the accuracy on unseen test data. The linearkernel achieves
71 % accuracy, while the RBF kernel achieves
90 % .An alternative method of putting a prior over functions is to use a Gaussian process (GP) prior overfunctions. This idea has been used for a long time in the spatial statistics community under the nameof "kriging", although it seems to have been largely ignored as a general-purpose regression method.Williams [1998]Overall, Gaussian Processes benefit from the fact that a Gaussian distribution will stay Gaussian under conditioning.That means that we can use Gaussian distributions in this machine learning process and they will produce a smoothGaussian result after conditioning on the training data. To become a universal machine learning model, GaussianProcesses have to be able to describe infinitely many dimensions. Instead of storing infinite values to describe thisrandom process, Gaussian Processes go the path of describing a distribution over functions that can produce each valuewhen required. p ( x ) ≈ GP ( µ ( x ) , k ( x, x (cid:48) )) , (3)The multivariate distribution over functions p ( x ) is described by the Gaussian Process depends on mean a function µ ( x ) and a covariance function k ( x, x (cid:48) ) . It follows that choosing an appropriate mean and covariance function, alsoknown as kernel, is essential. Very commonly, the mean function is chosen to be zero, as this simplifies some of themath. Therefore, data with a non-zero mean is commonly centered to comply with this assumption [Görtler et al., 2019].Choosing an appropriate kernel for the machine learning task is one of the benefits of the Gaussian Process. The kernelis where expert knowledge can be incorporated into data, e.g. seasonality metereological data can be described by aperiodic covariance function.Figure 5 present a 2D slice of 3D data with two classes. This binary problem can be approached by applying a GaussianProcess to it. In the second panel, a linear kernel is shown, which predicts the data relatively poorly with an accuracy of
71 % . A radial basis function (RBF) kernel, shown in the third panel generalizes to unseen test data with an accuracyof
90 % . This figure shows how a trained Gaussian Process would predict any new data point presented to the model.The linear kernel would predict any data in the top part to be blue (Class 0) and any data in the bottom part to be red(Class 1). The RBF kernel, which we explore further in the section introducing support-vector machines, separates theprediction into four uneven quadrants. The choice of kernel is very important in Gaussian Processes and research intoextracting specific kernels is ongoing [Duvenaud, 2014].In a more practical sense, Gaussian processes are computationally expensive, as an n × n matrix must be inverted, with n being the number of samples. This results in a space complexity of O ( n ) and a time complexity O ( n ) [Williams andRasmussen, 2006]. This makes Gaussian Processes most feasible for smaller data problems, which is one explanationfor their rapid uptake in geoscience. An approximate computation of the inverted matrix is possible using the ConjugateGradient (CG) optimization method, which can be stopped early with a maximum time cost of O ( n ) [Williams andRasmussen, 2006]. For problems with larger data sets, neural networks become feasible due to being computationallycheaper than Gaussian Processes, regularization on large data sets being viable, as well as, their flexibility to model a6 PREPRINT - A
UGUST
27, 2020wide variety of functions and objectives. Regularization being essential as neural networks tend to not "overfit" andsimply memorize the training data, instead of learning a generalizable relationship of the data. Interestingly, Horniket al. [1989] showed that neural networks are a universal function approximator as the number of weights tend toinfinity, and Neal [1996] were able to show that the infinitely wide stochastic neural network converges to a GaussianProcess. Oftentimes Gaussian Processes are trained on a subset of a large data set to avoid the computational cost.Gaussian Processes have seen successful application on a wide variety of problems and domains that benefit from expertknowledge.The 2000s were opened with a review by van der Baan and Jutten [2000] recapitulating the most recent geophysicalapplications in neural networks. They went into much detail on the neural networks theory and the difficulties inbuilding and training these models. The authors identify the following subsurface geoscience applications throughhistory: First-break picking, electromagnetics, magnetotellurics, seismic inversion, shear-wave splitting, well loganalysis, trace editing, seismic deconvolution, and event classification. They reveal a strong focus on explorationgeophysics. The authors evaluated the application of neural networks as subpar to physics-based approaches andconcluded that neural networks are too expensive and complex to be of real value in geoscience. This sentiment isconsistent with the broader perception of artificial intelligence during this decade. Artificial intelligence and expertsystems over-promised human-like performance, causing a shift in focus on research into specialized sub-fields, e.g.machine learning, fuzzy logic, and cognitive systems.
Mjolsness and DeCoste [2001] review machine learning in a broader context outside of exploration geoscience. Theauthors discuss recent successes in applications of remote sensing and robotic geology using machine learning models.They review graphical models, (hidden) Markov models, and SVMs and go on to disseminate the limitations ofapplications to vector data and poor performance when applied to rich data, such as graphs and text data. Moreover, theauthors from NASA JPL go into detail on pattern recognition in automated rovers to identify geological prospects onMars. They state:The scientific need for geological feature catalogs has led to multiyear human surveys of Mars orbitalimagery yielding tens of thousands of cataloged, characterized features including impact craters,faults, and ridges. Mjolsness and DeCoste [2001]The review points out the profound impact SVMs have on identifying geomorphological features without modelling theunderlying processes.
This decade of the 2000s introduces a shift in tooling, which is a direct contributor to the recent increase in adoptionand research of both shallow and deep machine learning research.Machine Learning software has been primarily comprised of proprietary software like Matlab™with the NeuralNetworks Toolbox and Wolfram Mathematica™or independent university projects like the Stuttgart Neural NetworkSimulator (SNNS). These tools were generally closed source and hard or impossible to extend and could be difficultto operate due to limited accompanying documentation. Early open-source projects include WEKA [Witten et al.,2005], a graphical user interface to build machine learning and data mining projects. Shortly after that, LibSVM wasreleased as free open-source software (FOSS) [Chang and Lin, 2011], which implements support vector machinesefficiently. It is still used in many other libraries to this day, including WEKA [Chang and Lin, 2011]. Torch wasthen released in 2002, which is a machine learning library with a focus on neural networks. While it has beendiscontinued in its original implementation in the programming language Lua [Collobert et al., 2002], PyTorch, thereimplementation in the programming language Python, is one of the leading deep learning frameworks at the time ofwriting [Paszke et al., 2017]. In 2007, the libraries Theano and scikit-learn were released openly licensed in Python[Theano Development Team, 2016, Pedregosa et al., 2011]. Theano is a neural network library that was a tool developedat the Montreal Institute for Learning Algorithms (MILA) and ceased development in 2017 after strong industrialdevelopers had released openly licensed deep learning frameworks. Scikit-learn implements many different machinelearning algorithms, including SVMs, Random Forests and single-layer neural networks, as well as utility functionsincluding cross-validation, stratification, metrics and train-test splitting, necessary for robust machine learning modelbuilding and evaluation. 7
PREPRINT - A
UGUST
27, 2020
The impact of scikit-learn has shaped the current machine learning software package by implementing a unifiedapplication programming interface (API) [Buitinck et al., 2013]. This API is explored by example in the followingcode snippets, the code can be obtained at Dramsch [2020b]. First, we generate a classification dataset using a utilityfunction. The make_classification function takes different arguments to adjust the desired arguments, we aregenerating 5000 samples ( n_samples ) for two classes, with five features ( n_features ), of which three features areactually relevant to the classification ( n_informative ). The data is stored in X , whereas the labels are contained in y . from sklearn.datasets import make_classificationX, y = make_classification(n_samples=5000, n_features=5,n_informative=3, n_redundant=0,random_state=0, shuffle=False) It is good practice to divide the available labeled data into a training data set and a validation or test data set. This splitensures that models can be evaluated on unseen data to test the generalization to unseen samples. The utility function train_test_split takes an arbitrary amount of input arrays and separates them according to specified arguments. Inthis case 25% of the data are kept for the hold-out validation set and not used in training. The random_state is fixedto make these examples reproducible. from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.25,random_state=0)
Then we need to define a machine learning model, considering the previous discussion of high impact machine learningmodels, the first example is an SVM classifier. This example uses the default values for hyperparameters of the SVMclassifier, for best results on real-world problems these have to be adjusted. The machine learning training is alwaysdone by calling classifier.fit(X, y) on the classifier object, which in this case is the SVM object. In more detail,the .fit() method implements an optimization loop that will condition the model to the training data by minimizingthe defined loss function. In the case of the SVM classification the parameters are adjusted to optimize a hinge loss,outlined in equation 5. The trained model scikit-learn model contains information about all its hyperparameters inaddition to the trained model, shown below. The exact meaning of all these hyperparameters is laid out in the scikit-learndocumentation [Buitinck et al., 2013]. from sklearn.svm import SVCsvm = SVC(random_state=0)svm.fit(X_train, y_train)>>> SVC(C=1.0, break_ties=False, cache_size=200,class_weight=None, coef0=0.0, degree=3,decision_function_shape='ovr', gamma='scale',kernel='rbf', max_iter=-1, probability=False,random_state=0, shrinking=True, tol=0.001,verbose=False)
The trained SVM can the be used to predict on new data, by calling classifier.predict(data) on the trainedclassifier object. The new data has to contain four features like the training data did. Generally, machine learningmodels always need to be trained on the same set of input features as the data available for prediction. The .predict() method outputs the most likely estimate on the new data to generate predictions. In the following code snippet, threepredictions on three input vectors are performed on the previously trained model.8
PREPRINT - A
UGUST
27, 2020Figure 6: Example of Support Vector Machine separating two classes, showing the decision boundary learnt from thedata. The data contains three informative features, the decision boundary is therefore three dimensional, shown is acentral slice of data points in 2D. (A video is available at [Dramsch, 2020a]) print(svm.predict([[0, 0, 0, 0, 0],[-1, -1, -1, -1, -1],[1, 1, 1, 1, 1]]))>>> [1 0 1]
The blackbox model should be evaluated with the classifier.score() function. Evaluating the performance onthe training data set gives an indication how well the model is performing, but this is generally not enough to gaugethe performance of machine learning models. In addition, the trained model has to be evaluated on the hold-out set, adataset the model has not been exposed to during training. This avoids that the model only performs well on the trainingdata by "memorization" instead of extracting meaningful generalizable relationships, an effect called overfitting. In thisexample the hyperparameters are left to the default values, in real-life applications hyperparameters are usually adjustedto build better models. This can lead to an addition meta-level of overfitting on the hold-out set, which necessitatesan additional third hold-out set to test the generalizability of the trained model with optimized hyperparameters. Thedefault score uses the class accuracy, which suggests our model is approximately 90% correct. Similar train and testscores indicate that the model learned a generalizable model, enabling prediction on unseen data without a performanceloss. Large differences between the training score and test score indicate either overfitting, in the case of a better trainingscore. A higher test score than training score can be an indication of a deeper problem with the data split, scoring, classimbalances, and needs to be investigated by means of external cross-validation, building standard "dummy" models,independence tests, and further manual investigations. print(svm.score(X_train, y_train))print(svm.score(X_test, y_test))>>> 0.9098666666666667>>> 0.9032
Support-vector machines can be employed for each class of machine learning problem, i.e. classification, regression,and clustering. In a two-class problem, the algorithm considers the n -dimensional input and attempts to find a ( n − -dimensional hyperplane that separates these input data points. The problem is trivial if the two classes are linearly9 PREPRINT - A
UGUST
27, 2020Figure 7: Samples from two classes that are not linearly separable input data (left). Applying a Gaussian Radial BasisFunction centered around (0 . , . with λ = . results in the two classes being linearly separable.separable, also called a hard margin. The plane can pass the two classes of data without ambiguity. For data with anoverlap, which is usually the case, the problem becomes an optimization problem to fit the ideal hyperplane. The hingeloss provides the ideal loss function for this problem, yielding 0 if none of the data overlap, but a linear residual foroverlapping points that can be minimized: max (0 , (1 − y i ( (cid:126)w · (cid:126)x i − b ))) , (4)with y i being the current target label and (cid:126)w · (cid:126)x i − b being the hyperplane under consideration. The hyperplane consistsof w the normal vector and point x , with the offset b . This leads the algorithm to optimize (cid:34) n n (cid:88) i =1 max (0 , − y i ( w · x i − b )) (cid:35) + λ (cid:107) w (cid:107) , (5)with λ being a scaling factor. For small λ the loss becomes the hard margin classifier for linearly separable problems.The nature of the algorithm dictates that only values for (cid:126)x close to the hyperplane define the hyperplane itself; thesevalues are called the support vectors.The SVM algorithm would not be as successful if it were simply a linear classifier. Some data can become linearlyseparable in higher dimensions. This, however, poses the question of how many dimensions should be searched, becauseof the exponential cost in computation that follows due to the increase of dimensionality (also known as the curseof dimensionality). Instead, the "kernel trick" was proposed [Aizerman, 1964], which defines a set of values that areapplied to the input data simply via the dot product. A common kernel is the radial basis function (RBF), which is alsothe kernel we applied in the example. The kernel is defined as: k ( (cid:126)x i , (cid:126)x j ) → exp (cid:0) − γ (cid:107) (cid:126)x i − (cid:126)x j (cid:107) (cid:1) (6)This specifically defines the Gaussian Radial Basis Function of every input data point with regard to a central point.This transformation can be performed with other functions (or kernels), such as, polynomials or the sigmoid function.The RBF will transform the data according to the distance between x i and X j , this can be seen in Figure 7. This resultsin the decision surface in Figure 6 consisting of various Gaussian areas. The RBF is generally regarded as a gooddefault, in part, due to being translation invariant (i.e. stationary) and smoothly varying.An important topic in machine learning is explainability, which inspects the influence of input variables on the prediction.We can employ the utility function permutation_importance to inspect any model and how they perform with10 PREPRINT - A
UGUST
27, 2020regard to their input features [Breiman, 2001]. The permutation importance evaluates how well the blackbox modelperforms, when a feature is not available. Practically, a feature is replaced with random noise. Subsequently, the scoreis calculated, which provides a representation how informative a feature is compared to noise. The data we generated inthe first example contains three informative features and two random data columns. The mean values of the calculatedimportances show that three features are estimated to be three magnitudes more important, with the second featurecontaining the maximum amount of information to predict the labels. from sklearn.inspection import permutation_importanceimportances = permutation_importance(svm, X_train, y_train,n_repeats=10, random_state=0) print(importances.importances_mean)print(importances.importances_mean.argsort())>>> [ 2.1787e-01 2.8712e-01 1.2293e-01 -1.8667e-04 7.7333e-04]>>> [3 4 2 0 1]
Support-vector machines were applied to seismic data analysis [Li and Castagna, 2004] and the automatic seismicinterpretation [Liu et al., 2015, Di et al., 2017b, Mardan et al., 2017]. Compared to convolutional neural networks, theseapproaches usually do not perform as well, when the CNN can gain information from adjacent samples. Seismologicalvolcanic tremor classification [Masotti et al., 2006, 2008] and analysis of ground-penetrating radar [Pasolli et al., 2009,Xie et al., 2013] were other notable applications of SVM in Geoscience. The 2016 Society of Exploration Geophysicists(SEG) machine learning challenge was held using a SVM baseline [Hall, 2016]. Several other authors investigated welllog analysis [Anifowose et al., 2017, Caté et al., 2018, Gupta et al., 2018, Saporetti et al., 2018], as well as seismologyfor event classification [Malfante et al., 2018] and magnitude determination [Ochoa et al., 2018]. These rely on SVMsbeing capable of regression on time-series data. Generally, many applications in geoscience have been enabled by thestrong mathematical foundation of SVMs, such as microseismic event classification [Zhao and Gross, 2017], seismicwell ties [Chaki et al., 2018], landslide susceptibility [Marjanovi´c et al., 2011, Ballabio and Sterlacchini, 2012], digitalrock models [Ma et al., 2012], and lithology mapping [Cracknell and Reading, 2013].
The following example shows the application of Random Forests, to illustrate the similarity of the API for differentmachine learning algorithms in the scikit-learn library. The Random Forest classifier is instantiated with a maximumdepth of seven, and the random state is fixed to zero again. Limiting the depth of the forest forces the random forestto conform to a simpler model. Random forests have the capability to become highly complex models that are verypowerful predictive models. This is not conducive to this small example dataset, but easy to modify for the inclinedreader. The classifier is then trained using the same API of all classifiers in scikit-learn. The example shows a very highnumber of hyperparameters, however, Random Forests work well without further optimization of these. from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(max_depth=7, random_state=0)rf.fit(X_train, y_train)>>> RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,class_weight=None, criterion='gini', max_depth=7,max_features='auto', max_leaf_nodes=None,max_samples=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=100, n_jobs=None, oob_score=False,random_state=0, verbose=0, warm_start=False)
The prediction of the random forest is performed in the same API call again, also consistent with all classifiers available.The values are slightly different from the prediction of the SVM.11
PREPRINT - A
UGUST
27, 2020 print(rf.predict([[0, 0, 0, 0, 0],[-1, -1, -1, -1, -1],[1, 1, 1, 1, 1]]))>>> [1 0 1]
The training score of the random forest model is 2.5 % better than the SVM in this instance, this score however notinformative. Comparing the test scores shows only a 0.88 % difference, which is the relevant value to evaluate, as itshows the performance of a model on data it has not seen during the training stage. The random forest performedslightly better on the training set than the test data set. This slight discrepancy is usually not an indicator of an overfitmodel. Overfit models "memorize" the training data and do not generalize well, which results in poor performance onunseen data. Generally, overfitting is to be avoided in real application, but can be seen in competitions, on benchmarks,and show-cases of new algorithms and architectures to oversell the improvement over state-of-the-art methods [Rechtet al., 2019]. print(rf.score(X_train, y_train))print(rf.score(X_test, y_test))>>> 0.9306>>> 0.912
Random forests have specialized methods available for introspection, which can be used to calculate feature importance.These are based on the decision process the random forest used to build the machine learning model. The featureimportance in Random Forests uses the same method as permutation importance, which is dropping out features toestimate their importance on the model performance. Random Forests use a measure to determine the split betweenclasses at each node of the trees called Gini impurity. While the permutation importance uses the accuracy score of theprediction, in Random Forests this Gini impurity can be used to measure how informative a feature is in a model. It isimportant to note that this impurity-based process can be susceptible to noise and overestimate high number of classesin features. Using the permutation importance instead is a valid choice. In this instance as opposed to the permutationimportance, the random forest estimates the two non-informative features to be one magnitude less useful than theinformative features, instead of two magnitudes. print(rf.feature_importances_)print(rf.feature_importances_.argsort())>>> [0.2324 0.4877 0.2527 0.0141 0.0129]>>> [4 3 0 2 1]
Random forests and other tree-based methods, including gradient boosting, a specialized version of random forests, havegenerally found wider application with the implementation into scikit-learn and packages for the statistical languages Rand SPSS. Similar to neural networks, this method is applied to ASI [Guillen et al., 2015] with limited success, whichis due to the independent treatment of samples, like SVMs. Random forests have the ability to approximate regressionproblems and time series, which made them suitable for seismological applications including localization [Dodge andHarris, 2016], event classification in volcanic tremors [Maggi et al., 2017] and slow slip analysis [Hulbert et al., 2018].They have also been applied to geomechanical applications in fracture modelling [Valera et al., 2017] and fault failureprediction [Rouet-Leduc et al., 2017, 2018], as well as, detection of reservoir property changes from 4D seismic data[Cao and Roy, 2017]. Gradient Boosted Trees were the winning models in the 2016 SEG machine learning challenge[Hall and Hall, 2017] for well-log analysis, propelling a variety of publications in facies prediction [Bestagini et al.,2017, Blouin et al., 2017, Caté et al., 2018, Saporetti et al., 2018].Furthermore, various methods that have been introduced into scikit-learn have been applied to a multitude of geoscienceproblems. Hidden Markov models were used on seismological event classification [Ohrnberger, 2001, Beyreuther andWassermann, 2008, Bicego et al., 2013], well-log classification [Jeong et al., 2014, Wang et al., 2017a], and landslidedetection from seismic monitoring [Dammeier et al., 2016]. These hidden Markov models are highly performant on timeseries and spatially coherent problems. The "hidden" part of Markov models enables the model to assume influenceson the predictions that are not directly represented in the input data. The K-nearest neighbours method has been used12
PREPRINT - A
UGUST
27, 2020Figure 8: Binary Decision Boundary for Random Forest in 2D. This is the same central slice of the 3D decision volumeused in Figure 6.for well-log analysis [Caté et al., 2017, Saporetti et al., 2018], seismic well ties [Wang et al., 2017b] combined withdynamic time warping and fault extraction in seismic interpretation [Hale, 2013], which is highly dependent on choosingthe right hyperparameter k. The unsupervised k-NN equivalent, k-means has been applied to seismic interpretation [Diet al., 2017a], ground motion model validation [Khoshnevis and Taborda, 2018], and seismic velocity picking [Weiet al., 2018]. These are very simple machine learning models that are useful for baseline models. Graphical modellingin the form of Bayesian networks has been applied to seismology in modelling earthquake parameters [Kuehn et al.,2011], basin modelling [Martinelli et al., 2013], seismic interpretation [Ferreira et al., 2018] and flow modelling indiscrete fracture networks [Karra et al., 2018]. These graphical models are effective in causal modelling and gainedpopularity in modern applications of machine learning explainability, interpretability, and generalization in combinationwith do-calculus [Pearl, 2012].
The 2010s marked a renaissance of deep learning and particularly convolutional neural networks. The convolutionalneural network (CNN) architecture AlexNet [Krizhevsky et al., 2012] was the first CNN to enter the ImageNet challenge[Deng et al., 2009]. The ImageNet challenge is considered a benchmark competition and database of natural imagesestablished in the field of computer vision. This improved the classification error rate from 25.8 % to 16.4 % (top-5accuracy). This has propelled research in CNNs, resulting in error rates on ImageNet of 2.25 % on top-5 accuracy in2017 [Russakovsky et al., 2015]. The Tensorflow library [Abadi et al., 2015] was introduced for open source deeplearning models, with some different software design compared to the Theano and Torch libraries.The following example shows an application of deep learning to the data presented in the previous examples. Theclassification data set we use has independent samples, which leads to the use of simple densely connected feed-forwardnetworks. Image data or spatially correlated datasets would ideally be fed to a convolutional neural network (CNN),whereas time series are often best approached with recurrent neural networks (RNN). This example is written using theTensorflow library. PyTorch would be an equally good library to use.All modern deep learning libraries take a modular approach to building deep neural networks that abstract operationsinto layers. These layers can be combined into input and output configurations in highly versatile and customizableways. The simplest architecture, which is the one we implement below, is a sequential model, which consists of oneinput and one output layer, with a "stack" of layers. It is possible to define more complex models with multiple inputsand outputs, as well as the branching of layers to build very sophisticated neural network pipelines. These models arecalled functional API and subclassing API, but would not be conducive to this example.13
PREPRINT - A
UGUST
27, 2020Figure 9: ReLU activation (red) and derivative (blue) for efficient gradient computation.The example model consists of Dense layers and a Dropout layer, which are arranged in sequence. Densely connectedlayers contain a specified number of neurons with an appropriate activation function, shown in the example below. Eachneuron performs the calculation outlined in equation 1, with σ defining the activation. Modern neural networks rarelyimplement sigmoid and tanh activations anymore. Their activation characteristic leads them to lose information forlarge positive and negative values of the input, commonly called saturation[Hochreiter et al., 2001]. This saturation ofneurons prevented good deep neural network performance until new non-linear activation functions took their place[Xuet al., 2015]. The activation function Rectified linear unit (ReLU) is generally credited with facilitating the developmentof very deep neural networks, due to their non-saturating properties [Hahnloser et al., 2000]. It sets all negative values tozero and provides a linear response for positive values, as seen in equation 7. Since it’s inception, many more rectifierswith different properties have been introduced. σ ( a ) = max (0 , a ) (7)The other activation function used in the example is the "softmax" function on the output layer. This activation iscommonly used for classification tasks, as it normalizes all activations at all outputs to one. It achieves this by applyingthe exponential function to each of the outputs in (cid:126)a for class C and dividing that value by the sum of all exponentials: σ ( (cid:126)a ) = e a j C (cid:80) p e a p (8)The example additionally uses a Dropout layer, which is a common layer used for regularization of the network byrandomly setting a specified percentage of nodes to zero for each iteration. Neural networks are particularly prone tooverfitting, which is counteracted by various regularization techniques that also include input-data augmentation, noiseinjection, L and L constraints, or early-stopping of the training loop [Goodfellow et al., 2016]. Modern deep learningsystems may even leverage noisy student-teacher networks for regularization [Xie et al., 2019b]. import tensorflow as tfmodel = tf.keras.models.Sequential([ PREPRINT - A
UGUST
27, 2020Figure 10: Three layer convolutional network. The input image (yellow) is convolved with several filters or kernelmatrices (purple). Commonly, the convolution is used to downsample an image in the spatial dimension, whileexpanding the dimension of the filter response, hence expanding in "thickness" in the schematic. The filters are learnedin the machine learning optimization loop. The shared weights within a filter improve efficiency of the network overclassic dense networks. tf.keras.layers.Dense(32, activation='relu'),tf.keras.layers.Dropout(.3),tf.keras.layers.Dense(16, activation='relu'),tf.keras.layers.Dense(2, activation='softmax')])
These sequential models are also used for simple image classification models using CNNs. Instead of Dense layers,these are built up with convolutional layers, which are readily available in 1D, 2D, and 3D as Conv1D, Conv2D andConv3D respectively. A two-dimensional CNN learns a so-called filter f for the n × m -dimensional image G , expressedas: G ∗ ( x, y ) = n (cid:88) i =1 m (cid:88) j =1 f ( i, j ) · G ( x − i + c, y − j + c ) , (9)resulting in the central result G ∗ around the central coordinate c . In CNNs each layer learns several of these filters f , usually following by a down-sampling operation in n and m to compress the spatial information. This serves as aforcing function to learn increasingly abstract representations in subsequent convolutional layers.This sequential example model of densely connected layers with a single input, 32, 16, and two neurons contains atotal of 754 trainable weights. Initially, each of these weights is set to a pseudo-random value, which is often drawnfrom a distribution beneficial to fast training. Consequently, the data is passed through the network, and the result isnumerically compared to the expected values. This form of training is defined as supervised training and error-correctinglearning, which is a form of Hebbian learning. Other forms of learning exist and are employed in machine learning, e.g.competitive learning in self-organizing maps. M AE = | y j − o j | (10) M SE = ( y j − o j ) (11)In regression problems the error is often calculated using the Mean Absolute Error (MAE) or Mean Squared Error(MSE), the L shown in equation 10 and the L norm shown in equation 11 respectively. Classification problems forma special type of problem that can leverage a different kind of loss called cross-entropy (CE). The cross-entropy isdependent on the true label y and the prediction in the output layer.15 PREPRINT - A
UGUST
27, 2020 CE = − C (cid:88) j y j log ( o j ) (12)Many machine learning data sets have one true label y true = 1 for class C j = true , leaving all other y j = 0 . This makesthe sum over all labels obsolete. It is debatable how much binary labels reflects reality, but it simplifies equation 12 tominimizing the (negative) logarithm of the neural network output o j , also known as negative log-likelihood: CE = − log ( o j ) (13)Technically, the data we generated is a binary classification problem, and this means we could use the sigmoid activationfunction in the last layer and optimize a binary CE. This can speed up computation, but in this example, an approach isshown that works for many other problems and can therefore be applied to the readers data. model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy']) Large neural networks can be extremely costly to train with significant developments in 2019/2020 reporting multi-billion parameter language models (Google, OpenAI) trained on massive hardware infrastructure for weeks with a singleepoch taking several hours. This calls for validation on unseen data after every epoch of the training run. Therefore,neural networks, like all machine learning models, are commonly trained with two hold-out sets, a validation and a finaltest set. The validation set can be provided or be defined as a percentage of the training data, as shown below. In theexample, 10% of the training data are held out for validation after every epoch, reducing the training data set from 3750to 3375 individual samples. model.fit(X_train,y_train,validation_split=.1,epochs=100)>>> [...]Epoch 100/1003375/3375 [==============================] - 0s 66us/sampleloss: 0.1567 - accuracy: 0.9401 -val_loss: 0.1731 - val_accuracy: 0.9359
Neural networks are trained with variations of stochastic gradient descent (SGD), an incremental version of the classicsteepest descent algorithm. We use the Adam optimizer, a variation of SGD that converges fast, but a full explanationwould go beyond the scope of this chapter. The gist of the Adam optimizer is that it maintains a per-parameter learningrate of the first statistical moment (mean). This is beneficial for sparse problems and the second moment (uncenteredvariance), which is beneficial for noisy and non-stationary problems [Kingma and Ba, 2014]. The main alternativeto Adam is SGD with Nesterov momentum [Sutskever et al., 2013], an optimization method that models conjugategradient methods (CG) without the heavy computation that comes with the search in CG. SGD anecdotally finds abetter optimal point for neural networks than Adam but converges much slower.In addition to the loss value, we display the accuracy metric. While accuracy should not be the sole arbiter of modelperformance, it gives a reasonable initial estimate, how many samples are predicted correctly with a percentage betweenzero and one. As opposed to scikit-learn, deep learning models are compiled after their definition to make them fit foroptimization on the available hardware. Then the neural network can be fit like the SVM and Random Forest modelsbefore, using the
X_train and y_train data. In addition, a number of epochs can be provided to run, as well as otherparameters that are left on default for the example. The amount of epochs defines how many cycles of optimization onthe full training data set are performed. Conventional wisdom for neural network training is that it should always learnfor more epochs than machine learning researchers estimate initially.16
PREPRINT - A
UGUST
27, 2020Figure 11: Loss and Accuracy of example neural network on ten random initializations. Training for 100 epochs withthe shaded area showing the 95% confidence intervals of the loss and metric. Analyzing loss curves is important toevaluate overfitting. The trining loss decreasing, while validation loss is close to plateauing is a sign of overfitting.Generally, it can be seen that the model converged and is only making marginal gains with the risk of overfitting.It can be difficult to fix all sources of randomness and stochasticity in neural networks, to make both research andexamples reproducible. This example does not fix these so-called random seeds as it would detract from the example.That implies that the results for loss and accuracy will differ from the printed examples. In research fixing the seedis very important to ensure reproducibility of claims. Moreover, to avoid bad practices or so-called "lucky seeds", astatistical analysis of multiple fixed seeds is good practice to report results in any machine learning model. model.evaluate(X_test, y_test)>>> 1250/1250 [==============================] - 0s 93us/sampleloss: 0.1998 - accuracy: 0.9360[0.19976349686831235, 0.936]
In the example before, the SVM and Random Forest classifier were scored on unseen data. This is equally important forneural networks. Neural networks are prone to overfit, which we try to circumvent by regularizing the weights and byevaluating the final network on an unseen test set. The prediction on the test set is very close to the last epoch in thetraining loop, which is a good indicator that this neural network generalizes to unseen data. Moreover, the loss curvesin figure 11 do not converge too fast, while converging. However, it appears that the network would overfit if we lettraining continue. The exemplary decision boundary in figure 12 very closely models the local distribution of the data,which is true for the entire decision volume [Dramsch, 2020a].These examples illustrate the open source revolution in machine learning software. The consolidated API and utilityfunctions make it seem trivial to apply various machine learning algorithms to scientific data. This can be seen inthe recent explosion of publications of applied machine learning in geoscience. The need to be able to implementalgorithms has been replaced by merely installing a package and calling model.fit(X, y) . These developments callfor strong validation requirements of models to ensure valid, reproducible, and scientific results. Without this carefulvalidation these modern day tools can be severely misused to oversell results and even come to incorrect conclusions.In aggregate, modern-day neural networks benefit from the development of non-saturating non-linear activationfunctions. The advancements of stochastic gradient descent with Nesterov momentum and the Adam optimizer(following AdaGrad and RMSProp) was essential faster training of deep neural networks. The leverage of graphicshardware available in most high-end desktop computers that is specialized for linear algebra computation, furtherreduced training times. Finally, open-source software that is well-maintained, tested, and documented with a consistentAPI made both shallow and deep machine learning accessible to non-experts.17
PREPRINT - A
UGUST
27, 2020Figure 12: Central 2D slice of decision Boundary of deep neural network in trained on data with 3 informative features.The 3D volume is available in [Dramsch, 2020a]. conv1 conv2
256 256 256 conv3
512 512 512 conv4
512 512 512 conv5 fc6 fc7 fc8+softmax K Figure 13: Schematic of a VGG16 network for ImageNet. The input data is convolved and down-sampled repeatedly.The final image classification is performed by flattening the image and feeding it to a classic feed-forward denselyconnected neural network. The 1000 output nodes for the 1000 ImageNet classes are normalized by a final softmaxlayer (cf. equation 8). Visualization library [Iqbal, 2018]
In deep learning, implementation of models is commonly more complicated than understanding the underlying algorithm.Modern deep learning makes use of various recent developments that can be beneficial to the data set it is applied to,without specific implementation details results are often not reproducible. However, the machine learning communityhas a firm grounding in openness and sharing, which is seen in both publications and code. New developments arecommonly published alongside their open-source code, and frequently with the trained networks on standard benchmarkdata sets. This facilitates thorough inspection and transferring the new insights to applied tasks such as geoscience. Inthe following, some relevant neural network architectures and their application are explored.18
PREPRINT - A
UGUST
27, 2020 × ReLU × ReLU × ReLU Σ ReLU
Figure 14: Schematic of a ResNet block. The block contains a × , × , and × convolution with ReLU activation.The output is concatenated with the input and passed through another ReLU activation function. The first model to discuss is the VGG-16 model, a 16-layer deep convolutional neural network [Simonyan and Zisserman,2014] represented in figure 13. This network was an attempt at building even deeper networks and uses small × convolutional filters in the network, called f in equation 9. This small filter-size was sufficient to build powerful modelsthat abstract the information from layer to deeper layer, which is easy to visualize and generalize well. The trainedmodel on natural images also transfers well to other domains like seismic interpretation [Dramsch and Lüthje, 2018].Later, the concept of Network-in-Network was introduced, which suggested defined sub-networks or blocks in thelarger network structure [Lin et al., 2013]. The ResNet architecture uses this concept of blocks to define residual blocks.These use a shortcut around a convolutional block [He et al., 2016] to achieve neural networks with up to 152 layersthat still generalize well. ResNets and residual blocks, in particular, are very popular in modern architectures includingthe shortcuts or skip connections they popularized, to address the following problem:When deeper networks start converging, a degradation problem has been exposed: with the networkdepth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitablydeep model leads to higher training error. He et al. [2016]The developments and successes in image classification on benchmark competitions like ImageNet and Pascal-VOCinspired applications in automatic seismic interpretation. These networks are usually single image classifiers usingconvolutional neural networks (CNNs). The first application of a convolutional neural network to seismic data used arelatively small deep CNN for salt identification [Waldeland and Solberg, 2017]. The open source software "MaLenoV"implemented a single image classification network, which was the earliest freely available implementation of deeplearning for seismic interpretation [Ildstad and Bormann]. Dramsch and Lüthje [2018] applied pre-trained VGG-16and ResNet50 single image seismic interpretation. Recent succesful applications build upon pre-trained pre-builtarchitectures to implement into more sophisticated deep learning systems, e.g. semantic segmentation. Semanticsegmentation is important in seismic interpretation. This is already a narrow field of application of machine learningand it can be observed that many early applications focus on sub-sections of seismic interpretation utilizing thesepre-built architectures such as salt detection [Waldeland et al., 2018, Di et al., 2018, Gramstad and Nickel, 2018], faultinterpretation [Araya-Polo et al., 2017, Guitton, 2018, Purves et al., 2018], facies classification [Chevitarese et al., 2018,19 PREPRINT - A
UGUST
27, 2020Dramsch and Lüthje, 2018], and horizon picking [Wu and Zhang, 2018]. In comparison, this is however, already abroader application than prior machine learning approaches for seismic interpretation that utilized very specific seismicattributes as input to self-organizing maps (SOM) for e.g. sweet spot identification [Guo et al., 2017, Zhao et al., 2017,Roden and Chen, 2017].In geoscience single image classification, as presented in the ImageNet challenge, is less relevant than other applicationslike image segmentation and time series classification. The developments and insights resulting from the ImageNetchallenge were, however, transferred to network architectures that have relevance in machine learning for science. Fullyconvolutional networks are a way to better achieve image segmentation. A particularly useful implementation, theU-net, was first introduced in biomedical image segmentation, a discipline notorious for small datasets [Ronnebergeret al., 2015]. The U-net architecture shown in Figure 15 utilizes several shortcuts in an encoder-decoder architectureto achieve stable segmentation results. Shortcuts (or skip connections) are a way in neural networks to combine theoriginal information and the processed information, usually through concatenation or addition. In ResNet blocks thisconcept is extended to an extreme, where every block in the architecture contains a shortcut between the input andoutput, as seen in Figure 14. These blocks are universally used in many architectures to implement deeper networks,i.e. ResNet-152 with 60 million parameters, with fewer parameters than previous architectures like VGG-16 with 138million parameters. Essentially, enabling models that are ten times as deep with less than half the parameters, andsignificantly better accuracy on image benchmark problems.
Bottleneck
Figure 15: Schematic of Unet architecture. Convolutional layers are followed by a downsampling operation inthe encoder. The central bottleneck contains a compressed representation of the input data. The decoder containsupsampling operations followed by convolutions. The last layer is commonly a softmax layer to provide classes. Equallysized layers are connected via shortcut connections.In 2018 the seismic contractor TGS made a seismic interpretation challenge available on the data science competitionplatform Kaggle. Successful participants in the competition combined ResNet architectures with the Unet architectureas their base architecture and modified these with state-of-the-art image segmentation applications [Babakhin et al.,2019]. Moreover, Dramsch and Lüthje [2018] showed that transferring networks trained on large bodies of naturalimages to seismic data yields good results on small datasets, which was further confirmed in this competition. Thelearnings from the TGS Salt Identification challenge have been incorporated in production scale models that performhuman-like salt interpretation [Sen et al., 2020]. In broader geoscience, U-nets have been used to model global waterstorage using GRAVE satellite data [Sun et al., 2019], landslide prediction [Hajimoradlou et al., 2019], and earthquakearrival time picking [Zhu and Beroza, 2018]. A more classical approach identifies subsea scale worms in hydrothermalvents [Shashidhara et al., 2020], whereas Dramsch et al. [2019] includes a U-net in a larger system for unsupervised 3Dtimeshift extraction from 4D seismic.This modularity of neural networks can be seen all throughout the research and application of deep learning. Newinsights can be incorporated into existing architectures to enhance their predictive power. This can be in the formof swapping out the activation function σ or including new layers for improvements e.g. regularization with batchnormalization [Ioffe and Szegedy, 2015]. The U-net architecture originally is relatively shallow, but was modified tocontain a modified ResNet for the Kaggle salt identification challenge instead [Babakhin et al., 2019]. Overall, servingas examples for the flexibility of neural networks. 20 PREPRINT - A
UGUST
27, 2020
Generative adversarial networks (GAN) take composition of neural network to another level, where two networks aretrained in aggregate to get a desired result. In GANs, a generator network G and a discriminator network D workagainst each other in the training loop [Goodfellow et al., 2014]. The generator G is set up to generate samples from aninput, these were often natural images in early GANs, but has now progressed to anything from time series [Engel et al.,2019] to high-energy physics simulation [Paganini et al., 2018]. The discriminator network D attempts to distinguishwhether the sample is generated from G i.e. fake or a real image from the training data. Mathematically, this defines amin max game for the value function V of G and D min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )] + E z ∼ p z ( z ) [log(1 − D ( G ( z )))] , (14)with x representing the data, z is the latent space G draws samples from, and p represents the respective probabilitydistributions. Eventually reaching a Nash equlibrium [Nash, 1951], where neither the generator network G can producebetter outputs, nor the discriminator network D can improve its capability to discern between fake and real samples.Despite how versatile U-nets are, they still need an appropriate defined loss function and labels to build a discriminativemodel. GANs however, build a generative model that approximates the training sample distribution in the Generatorand a discriminative model of the Discriminator modeled dynamically through adversarial training. The Discriminatoreffectively providing an adversarial loss in a GAN. In addition to providing two models that serve different purposes,learning the training sample distribution with an adversarial loss makes GANs one of the most versatile models currentlydiscovered. Mosser et al. [2017] were applied GANs early on to geoscience, modeling 3D porous media at the pore scalewith a deep convolutional GAN. The authors extended this approach to conditional simulations of oolithic digital rock[Mosser et al., 2018a]. Early applications of GANs also included approximating the problem of velocity inversion ofseismic data [Mosser et al., 2018c], geostatistical inversion [Laloy et al., 2017], and generating seismograms [Krischerand Fichtner, 2017]. Richardson [2018] integrate the Generator of the GAN into full waveform inversion of the scalarwavefield. Alternatively, a Bayesian inversion using the Generator as prior for velocity inversion was introduced inMosser et al. [2018b]. In geomodeling, generation of geological channel models was presented [Chan and Elsheikh,2017], which was subsequently extended with the capability to be conditioned on physical measurements [Dupont et al.,2018]. Naturally, GANs were applied to the growing field of automatic seismic interpretation [Lu et al., 2018]. The final type of architecture applied in geoscience is recurrent neural networks (RNN). In contrast to all previousarchitectures, recurrent neural networks feed back into themselves. There are many types of RNNs, Hopfield networksbeing one that were applied to seismic source wavelet prediction [Wang and Mendel, 1992] early on. However, LSTMs[Hochreiter and Schmidhuber, 1997] are the main application in geoscience and wider machine learning. This typeof network achieves state-of-the-art performance on sequential data like language tasks and time series applications.LSTMs solve some common problems of RNNs by implementing specific gates that regulate information flow in anLSTM cell, namely, input gate, forget gate, and output gate, visualized in Figure 16. The input gate feeds input values tothe internal cell. The forget gate overwrites the previous state. Finally, the output gate regulates the direct contributionof the input value to the output value combined with the internal state of the cell. Additionally, a peephole functionalityhelps with the training that serves as a shortcut between inputs and gates.A classic application of LSTMs is text analysis and natural language understanding, which has been applied to geologicalrelation extraction from unstructured text documents [Luo et al., 2017, Blondelle et al., 2017]. Due to the nature ofLSTMs being suited for time series data, it is has been applied to seismological event classification of volcanic activityTitos et al. [2018], multi-factor landslide displacement prediction [Xie et al., 2019a], and hydrological modelling[Kratzert et al., 2019]. Talarico et al. [2019] applied LSTM to model sedimentological sequences and compared themodel to baseline Hidden Markov Model (HMM), concluding that RNNs outperform HMMs based on first-orderMarkov chains, while higher order Markov chains were too complex to calibrate satisfactorily. Gated Recurrent Unit(GRU) [Cho et al., 2014] is another RNN developed based on the insights into LSTM, which was applied to predictpetrophysical properties from seismic data [Alfarraj and AlRegib, 2019].The scope of this review only allowed for a broad overview of types of networks, that were successfully applied togeoscience. Many more specific architectures exist and are in development that provide different advantages. Siamesenetworks for one-shot image analysis [Koch et al., 2015], transformer networks that largely replaced LSTM and GRUin language modelling [Vaswani et al., 2017], or attention as a general mechanism in deep neural networks [Zhenget al., 2017].Neural network architectures have been modified and applied to diverse problems in geoscience. Every architecturetype is particularly suited to certain data types that are present in each field of geoscience. However, fields with data21
PREPRINT - A
UGUST
27, 2020Figure 16: Schematic of LSTM architecture. The input data is processed together with the hidden state and cell state.The LSTM avoid the exploding gradient problem by implemented a input, forget, and output gate.present in machine-readable format experienced accelerated adoption of machine learning tools and applications. Forexample, Ross et al. [2018a] were able to successfully apply CNNs to seismological phase detection, relying on anextensive catalogue of hand-picked data [Ross et al., 2018b] and consequently generalize this work [Ross et al., 2018c].It has to be noted that synthetic or specifically sampled data can introduce an implicit bias into the network [Wirgin,2004, Kim et al., 2019]. Nevertheless, particularly this blackbox property of machine learning model makes themversatile and powerful tools that were leveraged in every subdiscipline of the Earth sciences.
Overall, geoscience and especially geophysics has followed developments in machine learning closely. Acrossdisciplines, machine learning methods have been applied to various problems that can generally be categorized intothree sections:1. Build a surrogate ML model of a well-understood process. This model usually provides an advantage incomputational cost. 22
PREPRINT - A
UGUST
27, 20202. Build an ML model of a task previously only possible with human interaction, interpretation, or knowledgeand experience.3. Build a novel ML model that performs a task that was previously not possible.Granulometry on SEM images is an example of an application in category I, where previously sediments were hand-measured in images [Dramsch et al., 2018]. Applying large deformation diffeomorphic mapping of seismic data wascomputationally infeasible for matching 4D seismic data, however, made feasible by applying a U-net architecture tothe problem of category II [Dramsch et al., 2019]. The problem of earthquake magnitude prediction falls into categoryIII due to the complexity of the system but was nevertheless approached with neural networks [Panakkat and Adeli,2007].The accessibility of tools, knowledge, and compute make this cycle of machine learning enthusiasm unique, with regardto previous decades. This unprecedented access to tools makes the application of machine learning algorithms to anyproblem possible, where data is available. The bibliometrics of machine learning in geoscience, shown in figure 17 serveas a proxy for increased access. These papers include varying degrees of depth in application and model validation. Oneof the primary influences for the current increase in publications are new fields such as automatic seismic interpretation,as well as, publications soliciting and encouraging machine learning publications. Computer vision models wererelatively straight forward to transfer to seismic interpretation tasks, with papers in this sub-sub-field ranging fromsingle 2D line salt identification models with limited validation to 3D multi-facies interpretation with validation on aseparate geographic area.Geoscientific publishing can be challenging to navigate with respect to machine learning. While papers investigatingthe theoretical fundamentals of machine learning in geoscience exist, it is clear that the overwhelming majority ofpapers present applications of ML to geoscientific problems. It is complex to evaluate whether a paper is a case study ora methodological paper with an exemplary application to a specific data set. Despite the difficulty of most thoroughapplications of ML, "idea papers" exist that simply present an established algorithm to a problem in geoscience withouta specific implementation or addressing the possible caveats. On the flip-side, some papers apply machine learningalgorithms as pure regression models without the aim to generalize the model to other data. Unfortunately, this makesmeta-analysis articles difficult to impossible. This kind of meta-analysis article, is commonly done in medicine andconsidered a gold-standard study, and would greatly benefit the geoscientific community to determine the efficacy ofalgorithms on sets of similar problems.Analogous to the medical field, obtaining accurate ground truth data, is often impossible and usually expensive.Geological ground truth data for seismic data is usually obtained through expert interpreters. Quantifying the uncertaintyof these interpretations is an active field of research, which suggest a broader set of experiences and a diverse set ofsources of information for interpretation facilitate correct geological interpretation between interpreters [Bond et al.,2007]. Radiologists tasked to interpret x-ray images showed similar decreases in both inter- and intra-interpreter errorrate with more diverse data sources Jewett et al. [1992]. These uncertainties in the training labels are commonly knownas "label noise" and can be a detriment to building accurate and generalizable machine learning models. A significantportion of data in geoscience, however, is not machine learning ready. Actual ground truth data from drilling reports isoften locked away in running text reports, sometimes in scanned PDFs. Data is often siloed and most likely proprietary.Sometimes the amount of samples to process is so large that many insights are yet to be made from samples in corestores or the storage rooms of museums. Benchmark models are either non-existent or made by consortia that onlyprovide access to their members. Academic data is usually only available within academic groups for competitiveadvantage, respect for the amount of work, and fear of being exposed to legal and public repercussions. These problemsare currently addressed by a culture change. Nevertheless, liberating data will be a significant investment, regardless ofwho will work on it and a slow culture change can be observed already.Generally, machine learning has seen the fastest successes in domains where decisions are cheap (e.g. click advertising),data is readily available (e.g. online shops), and the environment is simple (e.g. games) or unconstrained (e.g. imagegeneration). Geoscience generally is at the opposite of this spectrum. Decisions are expensive, be it drilling new wellsor assessing geohazards. Data is expensive, sparse, and noisy. The environment is heterogeneous and constrained byphysical limitations. Therefore, solving problems like automatic seismic interpretation see a surge of activity havingfewer constraints initially. Problems like inversion have solutions that are verifiably wrong due to physics. Theseconstraints do not prohibit machine learning applications in geoscience. However, most successes are seen in closecollaboration with subject matter experts. Moreover, model explainability becomes essential in the geoscience domain.While not being a strict equivalency, simpler models are usually easier to interpret, especially regarding failure modes.A prominent example of "excessive" [Mignan and Broccardo, 2019a] model complexity was presented in DeVrieset al. [2018] applying deep learning to aftershock prediction. Independent data scientists identified methodologicalerrors, including data leakage from the train set to the test set used to present results [Shah and Innig, 2019]. Moreover,Mignan and Broccardo [2019b] showed that using the central physical interpretation of the deep learning model, using23
PREPRINT - A
UGUST
27, 2020Figure 17: Bibliometry of 242 papers in Machine Learning for Geoscience per year. Search terms include variations ofmachine learning terms and geoscientific subdisciplines but exclude remote sensing and kriging.the von Mises yield criterion, could be used to build a surrogate logistic regression. The resulting surrogate or baselinemodel outperforms the deep network and overfits less. Moreover, replacing the ~13,000 parameter model with thetwo-parameter baseline model increases calculation speed, which is essential in aftershock forecasting and disasterresponse . More generally, this is an example where data science practices such as model validation, baseline models,and preventing data leakage and overfitting become increasingly important when the tools of applying machine learningbecome readily available.Despite potential setbacks and the field of deep learning and data science being relatively young, they can rely onmathematical and statistical foundations and make significant contributions to science and society. Machine learningsystems have contributed to modelling the protein structure of the current pandemic virus COVID-19 [Jumper et al.]. Adeep learning computer vision system was built to stabilize food safety by identifying Cassava plant disease on offlinemobile devices [Ramcharan et al., 2017, 2019]. Self-driving cars have become a possibility [Bojarski et al., 2016] andnatural language understanding has progressed significantly [Devlin et al., 2018].Geoscience is slower in the adoption of machine learning, compared to other disciplines. To be able to adapt theprogress in machine learning research, many valuable data sources have to be made machine-readable. There hasalready been a change in making computer code open source, which has lead to collaborations and accelerating scientificprogress. While specific open benchmark data sets have been tantamount to the progress in machine learning, it isquestionable whether these would be beneficial to machine learning in geoscience. The problems are often very complexwith non-unique explanations and solutions, which historically has lead to disagreements over geophysical benchmarkdata sets. Open data and open-source software, however, have and will play a significant role in advancing the field.Examples of this include basic utility function to load geoscientific data [Kvalsvik and Contributors, 2019] or morespecifically cross-validation functions tailored to geoscience [Uieda, 2018].Moreover, machine learning is fundamentally conservative, training on available data. This bias of data collection willinfluence the ability to generate new insights in all areas of geoscience. Machine learning in geoscience may be able togenerate insights and establish relationships in existing data. Entirely new insights from previously unseen or analysis All authors point out the potential in deep and machine learning research in geoscience regardless and do not wish to stifle suchresearch.[Shah and Innig, 2019, Mignan and Broccardo, 2019b] PREPRINT - A
UGUST
27, 2020of particularly complex models will still be a task performed by trained geoscientists. Transfer learning is an activefield of machine learning research, that geoscience can significantly benefit from. However, no significant headwayhas been made to transfer trained machine learning models to out-of-distribution data, i.e. data that is conceptuallysimilar but explicitly different from the training data set. The fields of self-supervised learning, including reinforcementlearning that can learn by exploration, may be able to approach some of these problems. They are, however, notoriouslyhard to set up and train, necessitating significant expertise in machine learning.Large portions of publications are concerned with weakly or unconstrained predictions such as seismic interpretationand other applications that perform image recognition on SEM or core photography. These methods will continue toimprove by implementing algorithmic improvements from machine learning research, specialized data augmentationstrategies, and more diverse training data being available. New techniques such as multi-task learning [Kendall et al.,2018] which improved computer vision and computer linguistic models, deep bayesian networks [Mosser et al., 2019]to obtain uncertainties, noisy teacher-student networks [Xie et al., 2019b] to improve training, and transformer networks[Graves, 2012] for time series processing, will significantly improve applications in geoscience. For example, automatedseismic interpretation may advance to provide reliable outputs for relatively difficult geological regimes beyond existingsolutions. Success will be reliant on interdisciplinary teams that can discern why geologically specific faults areimportant to interpret, while others would be ignored in manual interpretations, to encode geological understanding inautomatic interpretation systems.Currently, the most successful applications of machine learning and deep learning, tie into existing workflows toautomate sub-tasks in a grander system. These models are highly specific, and their predictive capability does notresemble an artificial intelligence or attempt to do so. Mathematical constraints and existing theory in other appliedfields, especially neuroscience, were able to generate insights into deep learning and geoscience has the opportunity todevelop significant contributions to the area of machine learning, considering their unique problem set of heterogeneity,varying scales and non-unique solutions. This has already taken place with the wider adoption of "kriging" or moregenerally Gaussian processes into machine learning. Moreover, known applications of signal theory and informationtheory employed in geophysics are equally applicable in machine learning, with examples utilizing complex-valuedneural networks [Trabelsi et al., 2017], deep Kalman filters [Krishnan et al., 2015], and Fourier analysis [Tancik et al.,2020]. Therefore, possibly enabling additional insights, particularly when integrated with deep learning, due to itsmodularity and versatility.Previous reservations about neural networks included the difficulty of implementation and susceptibility to noise inaddition to computational costs. Research into updating trained models and saving the optimizer state with the modelhas in part alleviated the cost of re-training existing models. Moreover, fine-tuning pre-trained large complex models tospecific problems has proven successful in several domains. Regularization techniques and noise modelling, as wellas data cleaning pipelines, can be implemented to lessen the impact of noise on machine learning models. Specifictypes of noise can be attenuated or even used as an additional source of information. The aforementioned concerns havemainly transitioned into a critique about overly complex models that overfit the training data and are not interpretable.Modern software makes very sophisticated machine learning models, and data pipelines available to researchers, whichhas, in turn, increased the importance to control for data leakage and perform thorough model validation.Currently, machine learning for science primarily relies on the emerging field of explainability [Lundberg et al., 2018].These provide primarily post-hoc explanations for predictions from models. This field is particularly important toevaluate which inputs from the data have the strongest influence on the prediction result. The major point of critiqueregarding post-hoc explanations is that these methods attempt to explain how the algorithm reached a wrong predictionwith equal confidence. Bayesian neural networks intend to address this issue by providing confidence intervals forthe prediction based on prior beliefs. These neural networks intend to incorporate prior expert knowledge into neuralnetworks, which can be beneficial in geoscientific applications, where strong priors can be necessary. Machine learninginterpretability attempts to impose constraints on the machine learning models to make the model itself explainable.Closely related to these topics is the statistics field of causal inference. Causal inference attempts to model the cause ofvariable, instead of correlative prediction. Some methods exist that can perform causal machine learning, i.e. causaltrees [Athey and Imbens, 2016]. These three fields will be necessary to glean verifiable scientific insights from machinelearning in geoscience. They are active fields of research and more involved to correctly apply, which often makescooperation with a statistician necessary.In conclusion, machine learning has had a long history in geoscience. Kriging has progressed into more general machinelearning methods, and geoscience has made significant progress applying deep learning. Applying deep convolutionalnetworks to automatic seismic interpretation has progressed these methods beyond what was possible, albeit stillbeing an active field of research. Using modern tools, composing custom neural networks, and conventional machinelearning pipelines has become increasingly trivial, enabling wide-spread applications in every sub-field of geoscience.Nevertheless, it is important to acknowledge the limitations of machine learning in geoscience. Machine learning25
PREPRINT - A
UGUST
27, 2020methods are often cutting-edge technology, yet properly validated models take time to develop, which is often perceivedas inconvenient when working in a hot scientific field. Despite being cutting edge, it is important to acknowledge thatnone of these applications are fully automated, as would be suggested by the lure of artificial intelligence. Nevertheless,within applied geoscience, significant new insights have been presented. Applications in geoscience are using machinelearning as a utility for data pre-processing, implementing previous insights beyond the theory and synthetic cases, orthe model itself enabling unprecedented applications in geoscience. Overall, applied machine learning has maturedinto an established tool in computational geoscience and has the potential to provide further insights into the theory ofgeoscience itself.
ReferencesReferences
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, andX. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/ . Software available from tensorflow.org.F. Agterberg. Markov schemes for multivariate well data. In
Proceedings, symposium on applications of computers andoperations research in the mineral industries, Pennsylvania State University, State College, Pennsylvania , volume 2,pages X1–X18, 1966.M. A. Aizerman. Theoretical foundations of the potential function method in pattern recognition learning.
Automationand remote control , 25:821–837, 1964.M. Alfarraj and G. AlRegib. Petrophysical property estimation from seismic data using recurrent neural networks. arXiv preprint arXiv:1901.08623 , 2019.F. Anifowose, C. Ayadiuno, and F. Rashedian. Carbonate reservoir cementation factor modeling using wireline logs andartificial intelligence methodology. In , 2017.M. Araya-Polo, T. Dahlke, C. Frogner, C. Zhang, T. Poggio, and D. Hohl. Automated fault detection without seismicprocessing.
The Leading Edge , 36(3):208–214, 2017.S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects.
Proceedings of the National Academyof Sciences , 113(27):7353–7360, 2016.Y. Babakhin, A. Sanakoyeu, and H. Kitamura. Semi-supervised segmentation of salt bodies in seismic images using anensemble of convolutional neural networks.
German Conference on Pattern Recognition (GCPR) , 2019.C. Ballabio and S. Sterlacchini. Support vector machines for landslide susceptibility mapping: The staffora riverbasin case study, italy.
Math. Geosci. , 44(1):47–70, Jan. 2012. ISSN 1874-8961, 1874-8953. doi: 10.1007/s11004-011-9379-9. URL https://doi.org/10.1007/s11004-011-9379-9 .T. Bayes. Lii. an essay towards solving a problem in the doctrine of chances. by the late rev. mr. bayes, frs communicatedby mr. price, in a letter to john canton, amfr s.
Philosophical transactions of the Royal Society of London , (53):370–418, 1763.W. A. Belson. Matching and prediction on the principle of biological classification.
Journal of the Royal StatisticalSociety: Series C (Applied Statistics) , 8(2):65–75, 1959.K. J. Bergen, P. A. Johnson, V. Maarten, and G. C. Beroza. Machine learning for data-driven discovery in solid earthgeoscience.
Science , 363(6433):eaau0323, 2019.P. Bestagini, V. Lipari, and S. Tubaro. A machine learning approach to facies classification using well logs. In
SEG Technical Program Expanded Abstracts 2017 , SEG Technical Program Expanded Abstracts, pages 2137–2142. Society of Exploration Geophysicists, Aug. 2017. doi: 10.1190/segam2017-17729805.1. URL https://doi.org/10.1190/segam2017-17729805.1 .M. Beyreuther and J. Wassermann. Continuous earthquake detection and classification using discrete hidden markovmodels.
Geophys. J. Int. , 175(3):1055–1066, Dec. 2008. ISSN 0956-540X. doi: 10.1111/j.1365-246X.2008.03921.x.URL https://academic.oup.com/gji/article-abstract/175/3/1055/634811 .M. Bicego, C. Acosta-Muñoz, and M. Orozco-Alzate. Classification of seismic volcanic signals using Hidden-Markov-Model-Based generative embeddings.
IEEE Trans. Geosci. Remote Sens. , 51(6):3400–3409, June 2013. ISSN0196-2892. doi: 10.1109/TGRS.2012.2220370. URL http://dx.doi.org/10.1109/TGRS.2012.2220370 .26
PREPRINT - A
UGUST
27, 2020H. Blondelle, A. Juneja, J. Micaelli, and P. Neri. Machine learning can extract the information needed for modelling anddata analysing from unstructured documents. In . earthdoc.org,2017. URL .M. Blouin, A. Caté, L. Perozzi, and E. Gloaguen. Automated facies prediction in drillholes using machine learning. In . earthdoc.org, 2017. URL .M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016.C. E. Bond, A. D. Gibbs, Z. K. Shipton, S. Jones, et al. What do you think this is¿‘conceptual uncertainty”in geoscienceinterpretation.
GSA today , 17(11):4, 2007.L. Breiman. Random forests.
Machine learning , 45(1):5–32, 2001.A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. In
Proc. Harvard Univ. Symposiumon digital computers and their applications , volume 72, 1961.L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort,J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software:experiences from the scikit-learn project. In
ECML PKDD Workshop: Languages for Data Mining and MachineLearning , pages 108–122, 2013.J. Cao and B. Roy. Time-lapse reservoir property change estimation from seismic using machine learning.
Lead. Edge ,36(3):234–238, Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030234.1. URL https://doi.org/10.1190/tle36030234.1 .A. Caté, L. Perozzi, E. Gloaguen, and M. Blouin. Machine learning as a tool for geologists.
Lead. Edge , 36(3):215–219,Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030215.1. URL https://doi.org/10.1190/tle36030215.1 .A. Caté, E. Schetselaar, P. Mercier-Langevin, and P.-S. Ross. Classification of lithostratigraphic and alteration unitsfrom drillhole lithogeochemical data using machine learning: A case study from the lalor volcanogenic massivesulphide deposit, snow lake, manitoba, canada.
J. Geochem. Explor. , 188:216–228, 2018. ISSN 0375-6742. URL .S. Chaki, A. Routray, and W. K. Mohanty. Well-Log and seismic data integration for reservoir characterization: Asignal processing and Machine-Learning perspective.
IEEE Signal Process. Mag. , 35(2):72–81, Mar. 2018. ISSN1053-5888. doi: 10.1109/MSP.2017.2776602. URL http://dx.doi.org/10.1109/MSP.2017.2776602 .S. Chan and A. H. Elsheikh. Parametrization and generation of geological models with generative adversarial networks. arXiv preprint arXiv:1708.01810 , 2017.C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines.
ACM Trans. Intell. Syst. Technol. , 2(3):27:1–27:27, May 2011. ISSN 2157-6904. doi: 10.1145/1961189.1961199. URL http://doi.acm.org/10.1145/1961189.1961199 .T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In
Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , KDD ’16, pages 785–794, New York, NY,USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785 .D. Chevitarese, D. Szwarcman, R. M. D. Silva, and E. V. Brazil. Seismic facies segmentation using deep learning. In
ACE 2018 Annual Convention & Exhibition . searchanddiscovery.com, 2018.J. Chiles and P. Chauvet. Kriging: a method for cartography of the sea floor.
The International Hydrographic Review ,1975.J.-P. Chilès and N. Desassis. Fifty years of kriging. In
Handbook of mathematical geosciences , pages 589–612. Springer,2018.T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow,M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E.Carpenter, A. Shrikumar, J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J.Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M. Boca, S. J. Swamidass,A. Huang, A. Gitter, and C. S. Greene. Opportunities and obstacles for deep learning in biology and medicine.
J. R. Soc. Interface , 15(141), Apr. 2018. ISSN 1742-5689, 1742-5662. doi: 10.1098/rsif.2017.0387. URL http://dx.doi.org/10.1098/rsif.2017.0387 .K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phraserepresentations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014.27
PREPRINT - A
UGUST
27, 2020D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutionalneural networks for image classification. In
Twenty-Second International Joint Conference on Artificial Intelligence ,2011.R. Collobert, S. Bengio, and J. Mariéthoz. Torch: a modular machine learning software library. Technical report, Idiap,2002.C. Cortes and V. Vapnik. Support-vector networks.
Machine learning , 20(3):273–297, 1995.T. Cover and P. Hart. Nearest neighbor pattern classification.
IEEE transactions on information theory , 13(1):21–27,1967.M. J. Cracknell and A. M. Reading. The upside of uncertainty: Identification of lithology contact zones from airbornegeophysics and satellite data using random forests and support vector machines.
Geophysics , 78(3):WB113–WB126,2013.N. Cressie. The origins of kriging.
Mathematical geology , 22(3):239–252, 1990.F. Dammeier, J. R. Moore, C. Hammer, F. Haslinger, and S. Loew. Automatic detection of alpine rockslides incontinuous seismic data using hidden markov models.
J. Geophys. Res. Earth Surf. , 121(2):351–371, Feb. 2016.ISSN 2169-9003. doi: 10.1002/2015JF003647. URL http://doi.wiley.com/10.1002/2015JF003647 .R. Dechter.
Learning while searching in constraint-satisfaction problems . University of California, Computer ScienceDepartment, Cognitive Systems . . . , 1986.J. Delhomme. Kriging in the hydrosciences.
Advances in water resources , 1:251–266, 1978.J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805 , 2018.P. M. DeVries, F. Viégas, M. Wattenberg, and B. J. Meade. Deep learning of aftershock patterns following largeearthquakes.
Nature , 560(7720):632, 2018.H. Di, M. Shafiq, and G. AlRegib. Multi-attribute k-means cluster analysis for salt boundary detection. , 2017a. URL .H. Di, M. A. Shafiq, and G. AlRegib. Seismic-fault detection based on multiattribute support vector machine analysis.In
SEG Technical Program Expanded Abstracts 2017 , pages 2039–2044. Society of Exploration Geophysicists,2017b.H. Di, Z. Wang, and G. AlRegib. Deep convolutional neural networks for seismic salt-body delineation.
AAPG AnnualConvention and , 2018.D. A. Dodge and D. B. Harris. Large-scale test of dynamic correlation processors: Implications for correlation-basedseismic pipelines.
Bull. Seismol. Soc. Am. , 2016. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/106/2/435/332173 .F. U. Dowla, S. R. Taylor, and R. W. Anderson. Seismic discrimination with artificial neural networks: Preliminaryresults with regional spectral data.
Bull. Seismol. Soc. Am. , 80(5):1346–1373, Oct. 1990. ISSN 0037-1106. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/80/5/1346/119382 .J. Dramsch.
Machine Learning in 4D Seismic Data Analysis: Deep Neural Networks in Geophysics . PhD thesis, 2019.J. S. Dramsch. 3d decision volume of svm, random forest, and deep neural network, Jul 2020a. URL https://figshare.com/articles/media/3D_decision_volume_of_SVM_Random_Forest_and_Deep_Neural_Network/12640226/1 .J. S. Dramsch. Code for 70 years of machine learning in geoscience in review, Jul 2020b.J. S. Dramsch and M. Lüthje. Deep-learning seismic facies on state-of-the-art cnn architectures. In
SEG TechnicalProgram Expanded Abstracts 2018 , pages 2036–2040. Society of Exploration Geophysicists, 8 2018. doi: 10.1190/segam2018-2996783.1. URL https://doi.org/10.1190/segam2018-2996783.1 .J. S. Dramsch, F. Amour, and M. Lüthje. Gaussian mixture models for robust unsupervised scanning-electron microscopyimage segmentation of north sea chalk. In
First EAGE/PESGB Workshop Machine Learning . EAGE PublicationsBV, 2018. doi: 10.3997/2214-4609.201803014. URL https://doi.org/10.3997/2214-4609.201803014 .J. S. Dramsch, A. N. Christensen, C. MacBeth, and M. Lüthje. Deep unsupervised 4d seismic 3d time-shift estimationwith convolutional neural networks.
IEEE Transactions in Geoscience and Remote Sensing , 2019.28
PREPRINT - A
UGUST
27, 2020S. Dreyfus. The numerical solution of variational problems.
Journal of Mathematical Analysis and Applications , 5(1):30–45, 1962.O. Dubrule. Comparing splines and kriging.
Computers & Geosciences , 10(2-3):327–338, 1984.E. Dupont, T. Zhang, P. Tilke, L. Liang, and W. Bailey. Generating realistic geology conditioned on physicalmeasurements with generative adversarial networks. arXiv preprint arXiv:1802.03065 , 2018.D. Duvenaud.
Automatic model construction with Gaussian processes . PhD thesis, University of Cambridge, 2014.J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts. Gansynth: Adversarial neural audiosynthesis. arXiv preprint arXiv:1902.08710 , 2019.X.-T. Feng and M. Seto. Neural network dynamic modelling of rock microfracturing sequences under triaxial compres-sive stress conditions.
Tectonophysics , 292(3):293–309, July 1998. ISSN 0040-1951. doi: 10.1016/S0040-1951(98)00072-9. URL .R. Ferreira, E. V. Brazil, R. Silva, and others. Texture-Based similarity graph to aid seismic interpretation.
ACE 2018Annual , 2018.K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognitionunaffected by shift in position.
Biological cybernetics , 36(4):193–202, 1980.I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generativeadversarial nets. In
Advances in neural information processing systems , pages 2672–2680, 2014.I. Goodfellow, Y. Bengio, and A. Courville.
Deep Learning . MIT Press, 2016.O. Gramstad and M. Nickel. Automated top salt interpretation using a deep convolutional net. , 2018. URL .A. Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 , 2012.P. Guillen, G. Larrazabal, G. González, D. Boumber, and R. Vilalta. Supervised learning to detect salt body. In
SEG Technical Program Expanded Abstracts 2015 , SEG Technical Program Expanded Abstracts, pages 1826–1829. Society of Exploration Geophysicists, Aug. 2015. doi: 10.1190/segam2015-5931401.1. URL https://doi.org/10.1190/segam2015-5931401.1 .A. Guitton. 3D convolutional neural networks for fault interpretation. ,2018. URL .R. Guo, Y. S. Zhang, H. Lin, and W. Liu. Sweet spot interpretation from multiple attributes: Machine learningand neural networks technologies.
First EAGE/AMGP/AMGE Latin , 2017. URL .I. Gupta, C. Rai, C. Sondergeld, and D. Devegowda. Rock typing in the upper Devonian-Lower mississippianwoodford shale formation, oklahoma, USA.
Interpretation , 6(1):SC55–SC66, Feb. 2018. ISSN 2324-8858. doi:10.1190/INT-2017-0015.1. URL https://doi.org/10.1190/INT-2017-0015.1 .J. Görtler, R. Kehlbeck, and O. Deussen. A visual exploration of gaussian processes.
Distill , 2019. doi: 10.23915/distill.00017. https://distill.pub/2019/visual-exploration-gaussian-processes.R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital selection and analogueamplification coexist in a cortex-inspired silicon circuit.
Nature , 405(6789):947–951, 2000.A. Hajimoradlou, G. Roberti, and D. Poole. Predicting landslides using contour aligning convolutional neural networks. arXiv: Computer Vision and Pattern Recognition , 2019.D. Hale. Methods to compute fault images, extract fault surfaces, and estimate fault throws from 3d seismic images.
Geophysics , 78(2):O33–O43, 2013.B. Hall. Facies classification using machine learning.
Lead. Edge , 35(10):906–909, Oct. 2016. ISSN 1070-485X. doi:10.1190/tle35100906.1. URL https://doi.org/10.1190/tle35100906.1 .M. Hall and B. Hall. Distributed collaborative prediction: Results of the machine learning contest.
Lead. Edge , 36(3):267–269, Mar. 2017. ISSN 1070-485X. doi: 10.1190/tle36030267.1. URL https://doi.org/10.1190/tle36030267.1 .K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778, 2016.L. Hermes, D. Frieauff, J. Puzicha, and J. M. Buhmann. Support vector machines for land usage classification inlandsat tm imagery. In
IEEE 1999 International Geoscience and Remote Sensing Symposium. IGARSS’99 (Cat. No.99CH36293) , volume 1, pages 348–350. IEEE, 1999. 29
PREPRINT - A
UGUST
27, 2020T. K. Ho. Random decision forests. In
Proceedings of 3rd international conference on document analysis andrecognition , volume 1, pages 278–282. IEEE, 1995.S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learninglong-term dependencies, 2001.J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities.
Proceedings ofthe national academy of sciences , 79(8):2554–2558, 1982.K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.
NeuralNetw. , 2(5):359–366, Jan. 1989. ISSN 0893-6080. doi: 10.1016/0893-6080(89)90020-8. URL .K. Y. Huang, W. R. I. Chang, and H. T. Yen. Self-organizing neural network for picking seismic horizons.
SEGTechnical Program Expanded , 1990. URL https://library.seg.org/doi/pdf/10.1190/1.1890183 .C. Huijbregts and G. Matheron. Universal kriging (an optimal method for estimating and contouring in trend surfaceanalysis): 9th intern.
Sym. on Decisionmaking in the Mineral Industries (proceedings to be published by CanadianInst. Mining), Montreal , 1970.C. Hulbert, B. Rouet-Leduc, C. X. Ren, J. Riviere, D. C. Bolton, C. Marone, and P. A. Johnson. Estimating the physicalstate of a laboratory slow slipping fault from seismic signals. Jan. 2018. URL http://arxiv.org/abs/1801.07806 .C. R. Ildstad and P. Bormann. Malenov_nd (machine learning of voxels). URL https://github.com/bolgebrygg/MalenoV .S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.H. Iqbal. Harisiqbal88/plotneuralnet v1.0.0, Dec. 2018. URL https://doi.org/10.5281/zenodo.2526396 .J. Jeong, E. Park, W. S. Han, and K.-Y. Kim. A novel data assimilation methodology for predicting lithology based onsequence labeling algorithms.
J. Geophys. Res. [Solid Earth] , 119(10):7503–7520, Oct. 2014. ISSN 2169-9313. doi:10.1002/2014JB011279. URL http://doi.wiley.com/10.1002/2014JB011279 .M. A. Jewett, C. Bombardier, D. Caron, M. R. Ryan, R. R. Gray, E. L. S. Louis, S. J. Witchell, S. Kumra, and K. E.Psihramis. Potential for inter-observer and intra-observer variability in x-ray review to establish stone-free rates afterlithotripsy.
The Journal of urology , 147(3):559–562, 1992.J. Jumper, K. Tunyasuvunakool, P. Kohli, D. Hassabis, and A. Team. Computational predictions of protein struc-tures associated with covid-19. Technical report. URL https://deepmind.com/research/open-source/computational-predictions-of-protein-structures-associated-with-COVID-19 .A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov. druGAN: An advanced generative adversarialautoencoder model for de novo generation of new molecules with desired molecular properties in silico.
Mol. Pharm. ,14(9):3098–3104, Sept. 2017. ISSN 1543-8384, 1543-8392. doi: 10.1021/acs.molpharmaceut.7b00346. URL http://dx.doi.org/10.1021/acs.molpharmaceut.7b00346 .S. Karra, D. O’Malley, J. D. Hyman, H. S. Viswanathan, and G. Srinivasan. Modeling flow and transport in fracturenetworks using graphs.
Phys Rev E , 97(3-1):033304, Mar. 2018. ISSN 2470-0053, 2470-0045. doi: 10.1103/PhysRevE.97.033304. URL http://dx.doi.org/10.1103/PhysRevE.97.033304 .H. J. Kelley. Gradient theory of optimal flight paths.
Ars Journal , 30(10):947–954, 1960.A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry andsemantics. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7482–7491,2018.N. Khoshnevis and R. Taborda. Prioritizing ground-motion validation metrics using semisupervised and super-vised learning.
Bull. Seismol. Soc. Am. , 2018. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/108/4/2248/536309 .B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim. Learning not to learn: Training deep neural networks with biased data.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 9012–9020, 2019.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv , 2014.G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In
ICML deeplearning workshop , volume 2. Lille, 2015. 30
PREPRINT - A
UGUST
27, 2020A. N. Kolmogorov. Sur l’interpolation et extrapolation des suites stationnaires.
CR Acad Sci , 208:2043–2045, 1939.Q. Kong, D. T. Trugman, Z. E. Ross, M. J. Bianco, B. J. Meade, and P. Gerstoft. Machine learning in seismology:Turning data into insights.
Seismological Research Letters , 90(1):3–14, 2019.F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing. Benchmarking a catchment-aware longshort-term memory network (lstm) for large-scale hydrological modeling. arXiv preprint arXiv:1907.08456 , 2019.D. G. Krige.
A statistical approach to some mine valuation and allied problems on the Witwatersrand . PhD thesis,Johannesburg, 1951.L. Krischer and A. Fichtner. Generating seismograms with deep neural networks.
AGUFM , 2017:S41D–03, 2017.R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121 , 2015.A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems , pages 1097–1105, 2012.W. C. Krumbein and M. F. Dacey. Markov chains and embedded markov chains in geology.
Journal of the InternationalAssociation for Mathematical Geology , 1(1):79–96, 1969.N. M. Kuehn, C. Riggelsen, and others. Modeling the joint probability of earthquake, site, and ground-motionparameters using bayesian networks.
Bulletin of the , 2011. URL https://pubs.geoscienceworld.org/ssa/bssa/article-abstract/101/1/235/349494 .H. A. Kuzma. A support vector machine for avo interpretation. In
SEG Technical Program Expanded Abstracts 2003 ,pages 181–184. Society of Exploration Geophysicists, 2003.J. Kvalsvik and Contributors. Segyio, 2019. URL https://github.com/equinor/segyio/ .E. Laloy, R. Hérault, D. Jacques, and N. Linde. Efficient training-image based geostatistical simulation and inversionusing a spatial generative adversarial neural network. Aug. 2017. URL http://arxiv.org/abs/1708.04975 .D. J. Lary, A. H. Alavi, A. H. Gandomi, and A. L. Walker. Machine learning in geosciences and remote sensing.
Geoscience Frontiers , 7(1):3–10, Jan. 2016. ISSN 1674-9871. doi: 10.1016/j.gsf.2015.07.003. URL .Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.A. M. Legendre.
Nouvelles méthodes pour la détermination des orbites des comètes . F. Didot, 1805.J. Li and J. Castagna. Support vector machine (SVM) pattern recognition to AVO classification.
Geophys. Res. Lett. ,31(2):948, Jan. 2004. ISSN 0094-8276. doi: 10.1029/2003GL018299. URL http://doi.wiley.com/10.1029/2003GL018299 .M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400 , 2013.S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the localrounding errors.
Master’s Thesis (in Finnish), Univ. Helsinki , pages 6–7, 1970.Y. Liu, Z. Chen, L. Wang, Y. Zhang, Z. Liu, and Y. Shuai. Quantitative seismic interpretations to detect biogenicgas accumulations: a case study from qaidam basin, china.
Bull. Can. Petrol. Geol. , 63(1):108–121, Mar. 2015.ISSN 0007-4802. doi: 10.2113/gscpgbull.63.1.108. URL https://pubs.geoscienceworld.org/cspg/bcpg/article-abstract/63/1/108/455952 .P. Lu, M. Morris, S. Brazell, C. Comiskey, and Y. Xiao. Using generative adversarial networks to improve deep-learningfault interpretation networks.
The Leading Edge , 37(8):578–583, 2018.S. M. Lundberg, B. Nair, M. S. Vavilala, M. Horibe, M. J. Eisses, T. Adams, D. E. Liston, D. K.-W. Low, S.-F. Newman,J. Kim, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.
Naturebiomedical engineering , 2(10):749–760, 2018.X. Luo, W. Zhou, W. Wang, Y. Zhu, and J. Deng. Attention-based relation extraction with bidirectional gated recurrentunit and highway network in the analysis of geological data.
IEEE Access , 6:5705–5715, 2017.J. Ma, Z. Jiang, Q. Tian, and G. D. Couples. Classification of digital rocks by machine learning.
ECMOR XIII-13th Euro-pean , 2012. URL .A. Maggi, V. Ferrazzini, C. Hibert, F. Beauducel, P. Boissier, and A. Amemoutou. Implementation of a multistationapproach for automated event classification at piton de la fournaise volcano.
Seismol. Res. Lett. , 88(3):878–891,May 2017. ISSN 0895-0695. doi: 10.1785/0220160189. URL https://pubs.geoscienceworld.org/ssa/srl/article-abstract/88/3/878/284054 . 31
PREPRINT - A
UGUST
27, 2020M. Malfante, M. D. Mura, J. Metaxian, J. I. Mars, O. Macedo, and A. Inza. Machine learning for Volcano-Seismicsignals: Challenges and perspectives.
IEEE Signal Process. Mag. , 35(2):20–30, Mar. 2018. ISSN 1053-5888. doi:10.1109/MSP.2017.2779166. URL http://dx.doi.org/10.1109/MSP.2017.2779166 .A. Mardan, A. Javaherian, and others. Channel characterization using support vector machine. ,2017. URL .M. Marjanovi´c, M. Kovaˇcevi´c, B. Bajat, and V. Voženílek. Landslide susceptibility assessment using SVM machinelearning algorithm.
Eng. Geol. , 123(3):225–234, Nov. 2011. ISSN 0013-7952. doi: 10.1016/j.enggeo.2011.09.006.URL .A. A. Markov. Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga.
Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete , 15(135-156):18, 1906.A. A. Markov. Extension of the limit theorems of probability theory to a sum of variables connected in a chain.
Dynamicprobabilistic systems , 1:552–577, 1971. Reprint in English of [Markov, 1906].G. Martinelli, J. Eidsvik, R. Sinding-Larsen, and others. Building bayesian networks from basin-modelling scenariosfor improved geological decision making.
Petroleum , 2013. URL http://pg.lyellcollection.org/content/early/2013/06/24/petgeo2012-057.abstract .M. Masotti, S. Falsaperla, H. Langer, S. Spampinato, and R. Campanini. Application of support vector machine to theclassification of volcanic tremor at etna, italy.
Geophys. Res. Lett. , 33(20):113, Oct. 2006. ISSN 0094-8276. doi:10.1029/2006GL027441. URL http://doi.wiley.com/10.1029/2006GL027441 .M. Masotti, R. Campanini, L. Mazzacurati, S. Falsaperla, H. Langer, and S. Spampinato. TREMOrEC: A softwareutility for automatic classification of volcanic tremor.
Geochem. Geophys. Geosyst. , 9(4), Apr. 2008. ISSN 1525-2027.doi: 10.1029/2007GC001860. URL http://doi.wiley.com/10.1029/2007GC001860 .N. C. Matalas. Mathematical assessment of synthetic hydrology.
Water Resources Research , 3(4):937–945, 1967.G. Matheron. Principles of geostatistics.
Economic geology , 58(8):1246–1266, 1963.G. Matheron et al. Splines and kriging; their formal equivalence. 1981.M. McCormack. Neural computing in geophysics.
Lead. Edge , 10(1):11–15, Jan. 1991. ISSN 1070-485X. doi:10.1190/1.1436771. URL https://doi.org/10.1190/1.1436771 .A. Mignan and M. Broccardo. A deeper look into ‘deep learning of aftershock patterns following large earthquakes’:Illustrating first principles in neural network physical interpretability. In
International Work-Conference on ArtificialNeural Networks , pages 3–14. Springer, 2019a.A. Mignan and M. Broccardo. One neuron versus deep learning in aftershock prediction.
Nature , 574(7776):E1–E3,2019b.T. M. Mitchell et al. Machine learning, 1997.E. Mjolsness and D. DeCoste. Machine learning for science: state of the art and future prospects.
Science , 293(5537):2051–2055, Sept. 2001. ISSN 0036-8075. doi: 10.1126/science.293.5537.2051. URL http://dx.doi.org/10.1126/science.293.5537.2051 .L. Mosser, O. Dubrule, and M. J. Blunt. Reconstruction of three-dimensional porous media using generative adversarialneural networks.
Phys Rev E , 96(4-1):043309, Oct. 2017. ISSN 2470-0053, 2470-0045. doi: 10.1103/PhysRevE.96.043309. URL http://dx.doi.org/10.1103/PhysRevE.96.043309 .L. Mosser, O. Dubrule, and M. J. Blunt. Conditioning of three-dimensional generative adversarial networks for poreand reservoir-scale models. Feb. 2018a. URL http://arxiv.org/abs/1802.05622 .L. Mosser, O. Dubrule, and M. J. Blunt. Stochastic seismic waveform inversion using generative adversarial networksas a geological prior. June 2018b. URL http://arxiv.org/abs/1806.03720 .L. Mosser, W. Kimman, J. S. Dramsch, S. Purves, A. De la Fuente Briceño, and G. Ganssle. Rapid seismic domaintransfer: Seismic velocity inversion and modeling using deep generative neural networks. In . EAGE Publications BV, 6 2018c. doi: 10.3997/2214-4609.201800734. URL https://doi.org/10.3997/2214-4609.201800734 .L. Mosser, R. Oliveira, and M. Steventon. Probabilistic seismic interpretation using bayesian neural networks. In , volume 2019, pages 1–5. European Association of Geoscientists &Engineers, 2019.J. Nash. Non-cooperative games.
Annals of mathematics , pages 286–295, 1951.R. M. Neal.
Bayesian learning for neural networks . Springer Science & Business Media, 1996.32
PREPRINT - A
UGUST
27, 2020P. D. Newendorp. Decision analysis for petroleum exploration. 1976.L. H. Ochoa, L. F. Niño, and C. A. Vargas. Fast magnitude determination using a single seismological stationrecord implementing machine learning techniques.
Geodesy and Geodynamics , 9(1):34–41, Jan. 2018. ISSN1674-9847. doi: 10.1016/j.geog.2017.03.010. URL .M. Ohrnberger. [no title], 2001. Accessed: 2018-12-17.M. Paganini, L. de Oliveira, and B. Nachman. Calogan: Simulating 3d high energy particle showers in multilayerelectromagnetic calorimeters with generative adversarial networks.
Physical Review D , 97(1):014021, 2018.A. Panakkat and H. Adeli. Neural network models for earthquake magnitude prediction using multiple seismicityindicators.
International journal of neural systems , 17(01):13–33, 2007.E. Pasolli, F. Melgani, and M. Donelli. Automatic analysis of gpr images: A pattern-recognition approach.
IEEETransactions on Geoscience and Remote Sensing , 47(7):2206–2217, 2009.A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.Automatic differentiation in PyTorch. In
NIPS Autodiff Workshop , 2017.J. Pearl. The do-calculus revisited. arXiv preprint arXiv:1210.4852 , 2012.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.
Journal of Machine Learning Research , 12:2825–2830, 2011.M. Poulton, B. Sternberg, and C. Glass. Location of subsurface targets in geophysical data using neural networks.
Geophysics , 57(12):1534–1544, Dec. 1992. ISSN 0016-8033. doi: 10.1190/1.1443221. URL https://doi.org/10.1190/1.1443221 .F. W. Preston and J. Henderson.
Fourier series characterization of cyclic sediments for stratigraphic correlation .Kansas Geological Survey, 1964.S. Purves, B. Alaei, and E. Larsen. Bootstrapping Machine-Learning based seismic fault interpretation.
ACE 2018Annual Convention & , 2018. URL .A. Ramcharan, K. Baranowski, P. McCloskey, B. Ahmed, J. Legg, and D. P. Hughes. Deep learning for image-basedcassava disease detection.
Frontiers in plant science , 8:1852, 2017.A. Ramcharan, P. McCloskey, K. Baranowski, N. Mbilinyi, L. Mrisho, M. Ndalahwa, J. Legg, and D. P. Hughes. Amobile-based deep learning model for cassava disease diagnosis.
Frontiers in plant science , 10:272, 2019.C. E. Rasmussen. Gaussian processes in machine learning. In
Summer School on Machine Learning , pages 63–71.Springer, 2003.B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers generalize to imagenet? arXiv preprintarXiv:1902.10811 , 2019.R. Reddy and G. Bonham-Carter. A decision-tree approach to mineral potential mapping in snow lake area, manitoba.
Canadian Journal of Remote Sensing , 17(2):191–200, 1991. doi: 10.1080/07038992.1991.10855292. URL https://doi.org/10.1080/07038992.1991.10855292 .A. Richardson. Seismic Full-Waveform inversion using deep learning tools and techniques. Jan. 2018. URL http://arxiv.org/abs/1801.07232 .R. Roden and C. W. Chen. Interpretation of DHI characteristics with machine learning.
First Break , 2017. ISSN0263-5046. URL .D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, et al. Tackling climate change with machine learning. arXiv preprintarXiv:1906.05433 , 2019.O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In
International Conference on Medical image computing and computer-assisted intervention , pages 234–241. Springer,2015.F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.
Psychologicalreview , 65(6):386, 1958.Z. E. Ross, M. A. Meier, and E. Hauksson. P-wave arrival picking and first-motion polarity determination with deeplearning.
J. Geophys. Res. , 2018a. ISSN 0148-0227. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2017JB015251 . 33
PREPRINT - A
UGUST
27, 2020Z. E. Ross, M.-A. Meier, E. Hauksson, and T. H. Heaton. Generalized seismic phase detection with deep learning. May2018b. URL http://arxiv.org/abs/1805.01075 .Z. E. Ross, M.-A. Meier, E. Hauksson, and T. H. Heaton. Generalized seismic phase detection with deep learning.
Bulletin of the Seismological Society of America , 108(5A):2894–2901, 2018c.G. Röth and A. Tarantola. Neural networks and inversion of seismic data.
J. Geophys. Res. , 99(B4):6753, 1994. ISSN0148-0227. doi: 10.1029/93JB01563. URL http://doi.wiley.com/10.1029/93JB01563 .B. Rouet-Leduc, C. Hulbert, N. Lubbers, and others. Machine learning predicts laboratory earthquakes.
Geophysical ,2017. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/2017GL074677 .B. Rouet-Leduc, C. Hulbert, D. C. Bolton, and others. Estimating fault friction from seismic signals in the laboratory.
Geophysical , 2018. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/2017GL076708 .D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by back-propagating errors.
Cognitivemodeling , 5(3):1, 1988.O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C.Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.S. J. Russell and P. Norvig.
Artificial Intelligence - A Modern Approach, Third International Edition . Pearson Education,2010. ISBN 978-0-13-207148-2. URL http://vig.pearsoned.com/store/product/1,1207,store-12521_isbn-0136042597,00.html .A. L. Samuel. Some studies in machine learning using the game of checkers.
IBM Journal of research and development ,3(3):210–229, 1959.C. M. Saporetti, L. G. da Fonseca, E. Pereira, and L. C. de Oliveira. Machine learning approaches for petrographicclassification of carbonate-siliciclastic rocks using well logs and textural information.
J. Appl. Geophys. , 155:217–225, Aug. 2018. ISSN 0926-9851. doi: 10.1016/j.jappgeo.2018.06.012. URL .K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko. Quantum-chemical insights from deeptensor neural networks.
Nat. Commun. , 8:13890, Jan. 2017. ISSN 2041-1723. doi: 10.1038/ncomms13890. URL http://dx.doi.org/10.1038/ncomms13890 .W. Schwarzacher. The semi-markov process as a general sedimentation model. In
Mathematical Models of SedimentaryProcesses , pages 247–268. Springer, 1972.S. Sen, S. Kainkaryam, C. Ong, and A. Sharma. Saltnet: A production-scale deep learning pipeline for automated saltmodel building.
The Leading Edge , 39(3):195–203, 2020.R. Shah and L. Innig. Aftershock issues, 2019. URL https://github.com/rajshah4/aftershocks_issues .B. M. Shashidhara, M. Scott, and A. Marburg. Instance segmentation of benthic scale worms at a hydrothermal site. In
The IEEE Winter Conference on Applications of Computer Vision , pages 1314–1323, 2020.C. Shen. A transdisciplinary review of deep learning research and its relevance for water resources scientists.
WaterResources Research , 54(11):8558–8593, 2018.D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis.
Annu. Rev. Biomed. Eng. , 19:221–248, June2017. ISSN 1523-9829, 1545-4274. doi: 10.1146/annurev-bioeng-071516-044442. URL http://dx.doi.org/10.1146/annurev-bioeng-071516-044442 .K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.A. Y. Sun, B. R. Scanlon, Z. Zhang, D. Walling, S. N. Bhanja, A. Mukherjee, and Z. Zhong. Combining physicallybased modeling and deep learning for fusing grace satellite data: Can we learn from mismatch?
Water ResourcesResearch , 55(2):1179–1195, 2019.I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning.In S. Dasgupta and D. McAllester, editors,
Proceedings of the 30th International Conference on Machine Learning ,volume 28 of
Proceedings of Machine Learning Research , pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun2013. PMLR. URL http://proceedings.mlr.press/v28/sutskever13.html .E. Talarico, W. Leäo, and D. Grana. Comparison of recursive neural network and markov chain models in faciesinversion. In
Petroleum Geostatistics 2019 , volume 2019, pages 1–5. European Association of Geoscientists &Engineers, 2019. 34
PREPRINT - A
UGUST
27, 2020M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron,and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprintarXiv:2006.10739 , 2020.Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXive-prints , abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688 .M. Titos, A. Bueno, L. García, M. C. Benítez, and J. Ibañez. Detection and classification of continuous volcano-seismicsignals with recurrent neural networks.
IEEE Transactions on Geoscience and Remote Sensing , 57(4):1936–1948,2018.C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio,and C. J. Pal. Deep complex networks. arXiv preprint arXiv:1705.09792 , 2017.A. M. Turing. I.—Computing Machinery and Intelligence.
Mind , LIX(236):433–460, 10 1950. ISSN 0026-4423. doi:10.1093/mind/LIX.236.433. URL https://doi.org/10.1093/mind/LIX.236.433 .L. Uieda. Verde: Processing and gridding spatial data using Green’s functions.
Journal of Open Source Software , 3(29):957, 2018. ISSN 2475-9066. doi: 10.21105/joss.00957.A. Valentine and L. M. Kalnins. An introduction to learning algorithms and potential applications in geomorphometryand earth surface dynamics.
Earth surface dynamics. , 4:445–460, 2016.M. Valera, Z. Guo, P. Kelly, S. Matz, V. A. Cantu, A. G. Percus, J. D. Hyman, G. Srinivasan, and H. S. Viswanathan.Machine learning for graph-based representations of three-dimensional discrete fracture networks. May 2017. URL http://arxiv.org/abs/1705.09866 .M. van der Baan and C. Jutten. Neural networks in geophysical applications.
Geophysics , 65(4):1032–1047, July 2000.ISSN 0016-8033. doi: 10.1190/1.1444797. URL https://doi.org/10.1190/1.1444797 .A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In
Advances in neural information processing systems , pages 5998–6008, 2017.A. Waldeland, A. Jensen, L. Gelius, and A. Solberg. Convolutional neural networks for automated seismic interpretation.
Lead. Edge , 37(7):529–537, July 2018. ISSN 1070-485X. doi: 10.1190/tle37070529.1. URL https://doi.org/10.1190/tle37070529.1 .A. U. Waldeland and A. Solberg. Salt classification using deep learning. , 2017.URL .H. Wang, J. F. Wellmann, Z. Li, X. Wang, and R. Y. Liang. A segmentation approach for stochastic geological modelingusing hidden markov random fields.
Math. Geosci. , 49(2):145–177, Feb. 2017a. ISSN 1874-8961, 1874-8953. doi:10.1007/s11004-016-9663-9. URL https://doi.org/10.1007/s11004-016-9663-9 .K. Wang, J. Lomask, and F. Segovia. Automatic, geologic layer-constrained well-seismic tie through blocked dynamicwarping.
Interpretation , 5(3):SJ81–SJ90, Aug. 2017b. ISSN 2324-8858. doi: 10.1190/INT-2016-0160.1. URL https://doi.org/10.1190/INT-2016-0160.1 .L. X. Wang and J. M. Mendel. Adaptive minimum prediction-error deconvolution and source wavelet estimation usinghopfield neural networks.
Geophysics , 1992. ISSN 0016-8033. URL https://library.seg.org/doi/abs/10.1190/1.1443281 .Z. Wang, H. Di, M. A. Shafiq, Y. Alaudah, and G. AlRegib. Successful leveraging of image processing and machinelearning in seismic structural interpretation: A review.
The Leading Edge , 37(6):451–461, 2018.C. J. C. H. Watkins. Learning from delayed rewards. 1989.S. Wei, O. Yonglin, Z. Qingcai, H. Jiaqiang, and others. Unsupervised machine learning: K-means clustering velocitysemblance Auto-Picking. , 2018. URL .F. Wickman. Repose period patterns of volcanoes. v. general discussion and a tentative stochastic model.
Arkiv forMineralogi och Geologi , 4(5):351, 1968.C. K. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In
Learningin graphical models , pages 599–621. Springer, 1998.C. K. Williams and C. E. Rasmussen.
Gaussian processes for machine learning , volume 2. MIT press Cambridge, MA,2006.A. Wirgin. The inverse crime. arXiv preprint math-ph/0401050 , 2004.I. H. Witten, E. Frank, and M. A. Hall. Practical machine learning tools and techniques.
Morgan Kaufmann , page 578,2005. 35
PREPRINT - A
UGUST
27, 2020H. Wu and B. Zhang. A deep convolutional encoder-decoder neural network in assisting seismic horizon tracking. Apr.2018. URL http://arxiv.org/abs/1804.06814 .P. Xie, A. Zhou, and B. Chai. The application of long short-term memory (lstm) method on displacement prediction ofmultifactor-induced landslides.
IEEE Access , 7:54305–54311, 2019a.Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le. Self-training with noisy student improves imagenet classification. arXivpreprint arXiv:1911.04252 , 2019b.X. Xie, H. Qin, C. Yu, and L. Liu. An automatic recognition algorithm for GPR images of RC structure voids.
J.Appl. Geophys. , 99:125–134, Dec. 2013. ISSN 0926-9851. doi: 10.1016/j.jappgeo.2013.02.016. URL .B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXivpreprint arXiv:1505.00853 , 2015.Y. Zhang and K. V. Paulson. Magnetotelluric inversion using regularized hopfield neural networks.
Geophys.Prospect. , 1997. ISSN 0016-8025. URL .T. Zhao, F. Li, and K. Marfurt. Constraining self-organizing map facies analysis with stratigraphy: An approach toincrease the credibility in automatic seismic facies classification.
Interpretation , 5(2):T163–T171, May 2017. ISSN2324-8858. doi: 10.1190/INT-2016-0132.1. URL https://doi.org/10.1190/INT-2016-0132.1 .X. Zhao and J. M. Mendel. Minimum-variance deconvolution using artificial neural networks.
SEG Technical ProgramExpanded Abstracts , 1988. URL https://library.seg.org/doi/pdf/10.1190/1.1892433 .Z. Zhao and L. Gross. Using supervised machine learning to distinguish microseismic from noise events. In
SEGTechnical Program Expanded Abstracts 2017 , SEG Technical Program Expanded Abstracts, pages 2918–2923.Society of Exploration Geophysicists, Aug. 2017. doi: 10.1190/segam2017-17727697.1. URL https://doi.org/10.1190/segam2017-17727697.1 .H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained imagerecognition. In
Proceedings of the IEEE international conference on computer vision , pages 5209–5217, 2017.W. Zhu and G. C. Beroza. PhaseNet: A Deep-Neural-Network-Based seismic arrival time picking method. Mar. 2018.URL http://arxiv.org/abs/1803.03211 .R. Zuo, Y. Xiong, J. Wang, and E. J. M. Carranza. Deep learning and its application in geochemical mapping.