[PDF] An introduction to distributed training of deep neural networks for segmentation tasks with large seismic datasets

Abstract

Deep learning applications are drastically progressing in seismic processing and interpretation tasks. However, the majority of approaches subsample data volumes and restrict model sizes to minimise computational requirements. Subsampling the data risks losing vital spatio-temporal information which could aid training whilst restricting model sizes can impact model performance, or in some extreme cases, renders more complicated tasks such as segmentation impossible. This paper illustrates how to tackle the two main issues of training of large neural networks: memory limitations and impracticably large training times. Typically, training data is preloaded into memory prior to training, a particular challenge for seismic applications where data is typically four times larger than that used for standard image processing tasks (float32 vs. uint8). Using a microseismic use case, we illustrate how over 750GB of data can be used to train a model by using a data generator approach which only stores in memory the data required for that training batch. Furthermore, efficient training over large models is illustrated through the training of a 7-layer UNet with input data dimensions of 4096X4096. Through a batch-splitting distributed training approach, training times are reduced by a factor of four. The combination of data generators and distributed training removes any necessity of data 1 subsampling or restriction of neural network sizes, offering the opportunity of utilisation of larger networks, higher-resolution input data or moving from 2D to 3D problem spaces.

Full PDF

AA N INTRODUCTION TO DISTRIBUTED TRAINING OF DEEPNEURAL NETWORKS FOR SEGMENTATION TASKS WITH LARGESEISMIC DATASETS

A P

REPRINT

Claire Birnie

KAUST, KSAformerly Equinor ASA, Norway [email protected]

Haithem Jarraya

Jarraya Consultancy Ltd.London, United Kingdom

Fredrik Hansteen

Equinor ASABergen, NorwayFebruary 26, 2021 A BSTRACT

Deep learning applications are drastically progressing in seismic processing and interpretation tasks.However, the majority of approaches subsample data volumes and restrict model sizes to minimisecomputational requirements. Subsampling the data risks losing vital spatio-temporal informationwhich could aid training whilst restricting model sizes can impact model performance, or in someextreme cases, renders more complicated tasks such as segmentation impossible. This paper illus-trates how to tackle the two main issues of training of large neural networks: memory limitationsand impracticably large training times. Typically, training data is preloaded into memory prior totraining, a particular challenge for seismic applications where data is typically four times larger thanthat used for standard image processing tasks (ﬂoat32 vs. uint8). Using a microseismic use case, weillustrate how over 750GB of data can be used to train a model by using a data generator approachwhich only stores in memory the data required for that training batch. Furthermore, efﬁcient trainingover large models is illustrated through the training of a 7-layer UNet with input data dimensions of4096 × ∼ . M parameters). Through a batch-splitting distributed training approach, trainingtimes are reduced by a factor of four. The combination of data generators and distributed trainingremoves any necessity of data subsampling or restriction of neural network sizes, offering the op-portunity of utilisation of larger networks, higher-resolution input data or moving from 2D to 3Dproblem spaces.

The use of Deep Learning (DL) has seen a resurgence in its application to geophysical problems over the past decade.Last century’s investigations into the potential beneﬁts of DL methodologies were hampered by technological limi-tations [1]. Nowadays, access to reasonably powerful compute is freely available with certain cloud providers evenoffering free GPU provisions in their experimentation environment, for example CoLab. Alongside this, “tech gi-ants” have open-sourced deep-learning packages en masse, such as Google’s TensorFlow package [2] and Facebook’sPyTorch package [3]. These advancements have signiﬁcantly lowered the bar for incorporating DL approaches intoresearch projects and, as such, have contributed to the surge in development of deep learning applications for thegeoscience domain. Furthermore, training data and pretrained models have become increasingly more available.Whilst DL methodologies have seen a resurgence across all ﬁelds of seismology, and wider geoscience applications,the use of computer vision procedures in particular has been shown to be incredibly useful for seismic processing andinterpretation problems, where the ‘input’ data can be treated as an image. [4] illustrated the use of Cycle GenerativeAdverserial Networks for groundroll suppression in land seismic data whilst [5] illustrated the potential of Convo-lutional NNs (CNN) for seismic denoising of random and linear noise signals, as well as multiple suppression. Theuse of Neural Networks (NNs) for interpretation of seismic cubes has been extensively tested over the last ﬁve years a r X i v : . [ phy s i c s . g e o - ph ] F e b istributed NN training with seismic data A PREPRINT with promising results being offered from many different approaches varying in both preprocessing, NN architectureand postprocessing. For example, [6] investigate the use of a growing NN for an unsupervised clustering procedure toaccelerate seismic interpretation, whilst [7] illustrated the use of a CNN for identiﬁcation of faults within a 2D windowfrom a seismic volume.DL is not just making waves in the active seismic community, it has also begun making headway in passive seismicapplications through the introduction of new, more reliable procedures for event detection. From a single station view-point, i.e., where traces are handled independently of one another, Recurrent NNs have been shown to be particularlypowerful in offering an alternative to the commonly used short-time average, long-time average detection procedure,for example [8, 9]. Whilst from an array point of view, both [10] and [11] have illustrated how CNNs can be used fordetecting an events arrival within a certain time-space bounding box.Despite great advancements being made on tailoring NN architectures for geophysical applications, one large draw-back remains: training of large NNs is memory and time expensive. As such, the majority of deep learning applicationsfor seismic datasets require subsampling of the data [12]. A solution to this is to train NNs in a distributed manner.Using passive monitoring as a use case, this paper walks through the design, implementation and deployment of adeep learning problem that leverages on the ability to distribute the NN training, allowing efﬁcient training of a largeNN ( ∼ . M trainable parameters) with a large ( > GB) training dataset.

Similar to the development of many processing, imaging and inversion algorithms, in this study our approach isdeveloped on synthetic data and tested on a ﬁeld dataset. The ﬁeld dataset comes from a PRM system deployed onthe seabed at the Grane ﬁeld in the Norwegian sector of the North Sea. The PRM system consists of 3458 sensors,3-component geophones with a hydrophone, arranged in a psuedo-gridded-style with a sparser “crossline” backboneas illustrated in Figure 1. The receiver spacing is approximately 50m along the cables (inline) and 300m betweenthe cables (crossline). Continuously recording at a 500Hz sampling rate, almost 2.4TB of passive seismic data arecollected every day.The system is primarily used for reservoir and overburden monitoring with active seismic surveys. However, it hasalso been shown to provide invaluable additional information by using it for passive monitoring. For example, drill bitlocalisation during drilling campaigns [13] and interferometric velocity modelling [14].To-date no seismic events have been recorded due to subsurface movement. However, in the summer of 2015 during adrilling campaign, energy waves resulting from a liner collapse were captured in the seismic data. An in-depth analysisof this event was performed by [15] using a subset of the receivers. The z-component of this event, hereinafter referredto as the G8-event, is illustrated in Figure 2 and used in this study for the benchmarking of the developed ML detectionprocedure.

Deﬁning a clear problem statement is fundamental for the development of any new algorithm, whether ML relatedor not. For the passive monitoring scenario the problem statement we investigate in this paper is how to develop a real-time event detection procedure that utilises the full array . Two other key elements in the development of MLapproaches include: the training dataset and the model architecture. Below we discuss in detail how the problem is setup, how training data is chosen and how the model architecture is adapted for the use case.

For the microseismic scenario, events are typically below a SNR of one and therefore a lot of standard processing mea-sures leverage the additional spatio-temporal information that can be captured by using array processing procedures asopposed to trace-by-trace methods. For example, there are a number of different stacking procedures that have beenshown to improve detection procedures by increasing the SNR, such as envelope stacking [16] or semblance stacking[17].Figure 3 illustrates how microseismic event detection can be considered as a computer vision task, whether as aclassiﬁcation, object detection, or image segmentation task. Considering the full array, the identiﬁcation of the signalwithin a certain time window can be considered as an image segmentation tasks where each pixel represents a singlepoint in time, t , and space, x . Therefore the task is to determine for each pixel in the image whether it contains aseismic event or not, i.e., a binary classiﬁcation per pixel. 2istributed NN training with seismic data A PREPRINT longitude l a t i t u d e inline o f s e n s o r s inline d i s t a n c e t o n e x t r e c . Figure 1: Array information for the permanent reservoir monitoring system deployed over the Grane ﬁeld in theNorwegian North Sea. The array is separated into ”inline” sections as represented by the colourscale. (a) illustratesthe array geometry overlaid on the ﬁeld’s polygon with the black triangle indicating the location of the platform. (b)Details the number of sensors per line, whilst (c) details the distribution in distances between neighbouring sensorsper ”inline”.

Stations T i m e ( s ) (a)1600 1700 1800 1900 2000 2100 Stations T i m e ( s ) (b) 3000 3100 3200 3300 3400 Stations T i m e ( s ) (c) Figure 2: Bandpassed data of the G8 event recorder over the full PRM array (a). The blue box in the center correspondsto the zoomed in data segment shown in (b) highlighting the event arrival at the same time as an onset of platformnoise. The red box corresponds to the zoomed in data segment shown in (c) from a quieter section of the array.3istributed NN training with seismic data A PREPRINT

Figure 3: Schematic illustrating how microseismic event detection can be considered as a computer vision task, eitheras full image classiﬁcation, object detection, or image segmentation.Figure 4: Schematic illustrating possible approaches to windowing of the seismic data prior to developing DL models.Sliding window approaches have proved very popular in previous image segmentation tasks on post-stack seismicdata. This works particularly well due to the uniform sampling in processed seismic sections from active acquisitionsmeaning that all windows whether 2D or 3D maintain the same distance between samples. However, this is not thecase when working with pre-migrated data, as is often the case in passive monitoring. Figure 4 Scenario A offersan impression of how a rudimentary, spatio-temporal windowing procedure, the most commonly applied in seismicDL applications, could be implemented for raw passive data on a pseudo-gridded-geometry analogous to the Granegeometry. For this approach, one must only consider/optimise the number of stations to include in the window and thetime range on which to span. However, due to the irregular spacing between receivers, there is little consistency in therelationship between event arrivals across the different windows.Scenario B offers a number of more sophisticated alternatives to Scenario A. As illustrated in Figure 4, receiversgroups could be selected multiple ways: by inline grouping, a radius-based approach from central receivers or a4istributed NN training with seismic data A PREPRINT

Figure 5: Workﬂow of the generation and labelling of synthetic data.nearest-neighbour approach. A number of design decisions must be considered with these approaches: the numberof receivers per group; the number of models to be created (e.g., one per group); how to handle over-utilisation ofreceivers where they are grouped into multiple groups; as well as the obvious, which grouping method to use. For theinline and radius-based approaches the number of stations would change between each window therefore requiringdifferent NN models per group. Fixing the number of ‘neighbours’, as illustrated by the neighbour-based approach,would remove the complication of varying input dimensions however would still introduce inconsistencies in thespatial distribution of arrivals, particularly at the edges of the array.The alternative to splitting the data is to develop an image segmentation procedure that uses all 3458 sensors simul-taneously. This removes the complications of determining the optimal receiver groupings (and number of models),however it introduces computational complexities due to the size of each data “observation”. To provide a comparison,most imaging recognition tasks utilise input dimensions of 256 ×

256 [18]. Other DL applications on seismic data haveranged from input windows of 24 ×

24 [19] to 100 ×

100 [20] to 128 × ×

128 [21]. is substantially larger thanmost input dimensions, as such the remainder of the paper will focus on how to efﬁciently train NNs with large inputdimensions.

In the seismic space there are three main options for gathering training data: ﬁeld data collection, laboratory createddata, or synthetically generated data. A good training dataset must have a large volume of data available, be similar tothe data onto which the trained model will be applied and be simple to label. Largely to avoid the tedious annotationprocedure typically associated with supervised learning approaches, in this study synthetic datasets were generated fortraining the model. Historically synthetic datasets have been heavily utilised in the development and benchmarkingprocedures of new algorithms, and the importance of using realistic synthetics to accurately depict how an algorithmwill perform on ﬁeld data cannot be overstated [22]. Similarly, to train an ML model that is robust for application toﬁeld data, the training data must provide an efﬁcient representation of the variety of waveforms and noises that existin such recordings however at a reasonable creation speed. In this section we discuss how we have generated a diversedataset of realistic synthetic seismic recordings for training and evaluation purposes.Using travel times and the standard convolutional modelling approach, synthetic datasets are generated using theworkﬂow as illustrated in Figure 5. First, the source location is randomly selected from a cube in the subsurfacecentered around the top of the reservoir. The source parameters: wavelet type, frequency content, and SNR are alsorandomly selected. The wavelet is then generated and the waveﬁeld data is created via convolutional modelling with ascaler accounting for amplitude decay due to geometrical spreading.Noise is an ever-persistent challenge in seismic ﬁeld data handling. To make the synthetics representative of ﬁeld data,synthetic coloured noise models are generated using statistics observed from previously collected passive recordings.The frequency spectrum of the recordings are grouped into 5Hz bands representing the percent of total energy withineach band. This is used to scale the coloured noise model such that it has a similar frequency content to recorded noise,similar to the approach of [23]. The coloured noise model is then scaled spatially to represent the spatial distributionin energy as typically observed on the array, e.g., higher amplitudes around the vicinity of the production platform.As well as forming the base of the synthetic seismic dataset, the waveﬁeld data is used to generate the matching “label”dataset for training and evaluation purposes. As event detection is a binary classiﬁcation, the labels are either zeroor one where one indicates that a waveﬁeld of interest is present. An event’s arrival is classiﬁed anywhere where thewaveﬁeld energy is greater than a speciﬁed amount depending on the wavelet type and frequency content.5istributed NN training with seismic data A PREPRINT

Figure 6: Seven layer UNet architecture.For simplifying experimentation of the NN architecture, to be discussed below, the length of each synthetic dataset is4096 time samples which equates to 8.192 seconds, given a 2ms sampling frequency. Assuming the energy bands fornoise spectrum and the array geometry are preloaded, it takes 1.7 seconds from start to end of the generation procedureof a single data sample (when computed on a 2.9GHz, 6-core Intel Core i9 machine with 32GB RAM).

The U-Net architecture of [24] has become the workhorse for most image segmentation tasks on seismic data, follow-ing on from its successful application for image segmentation in medical imaging. The standard U-Net architecturefollows the form of a contracting (left) path and an expansive (right) path as illustrated in Figure 6. The contractingpath has the ability to capture context and consists of repeated blocks of: two 3 × × × × “To allow a seamless tiling of the output segmentation map ..., it isimportant to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an evenx- and y-size.” . 3458, the number of sensors in the Grane PRM system, when halfed becomes an odd number, ,therefore it is not possible to make a U-Net without altering the input dimensions. An additional 638 null traces wereadded to the array such that the input dimension became 4096 - a binary number meaning that we can divide by twoall the way down to one. These input images are now orders of magnitude larger than [24]’s study, whose experimentused images of 512 ×

512 pixels.In the original U-Net study, four layers were utilised, reducing the data dimensionality down to 32 at the base of theNN. For the Grane example, an additional three layers are required to reduce the data down to the same dimensions.For the convolution steps we begin with four ﬁlters at the top layer, multiplying by a factor of two at each reductionstep. The incorporation of the additional layers and following the ﬁlter methodology, the resulting model has ∼ . Mnumber of trainable parameters.

The large dimensions of the data are not the only “size” complexity arising in this use case due to the data types whichare involved. Typically images are stored with a data type of uint8 whilst seismic data is stored with a ﬂoat32 datatype. Therefore, a seismic section with the same dimensions as an image is four times larger, impacting memoryrequirements for NN training. This complexity presents a challenge when loading data into memory for training theNN. For the majority of image segmentation tasks the full training set is loaded into memory prior to training. In thisexperiment, each labelled seismic section is 108 MB therefore it is not feasible to load into memory.TensorFlow’s dataset functions offer a manageable solution to the memory limitation challenges encountered due tothe datasize. This allowed the storing of only the required data samples per step, therefore removing any necessity toreduce the size of the model or the input data dimensions. 6istributed NN training with seismic data A PREPRINT

Figure 7: Comparison between a single process for training vs a distributed process using a data parallelism strategywith 4 workers. The evaluator node is not illustrated.In the data creation section above we argued for the use of synthetic datasets for training purposes. However, there aretwo approaches to how this can be implemented. Firstly data can be pre-made, written to ﬁle and read in as needed.Alternatively a data generator can be implemented that creates data on-the-ﬂy. For this speciﬁc use case, we calculatedthat it would take ∼ hours and GB of storage for the ﬁrst option, additionally taking 2 seconds per ﬁle to be readin - assuming the data is stored as a TensorFlow Tensor. The second option has the advantage that no additional storageis required however the data would need to be re-generated every epoch. In this case, the generation time is similar tothe loading time and as such there is little difference in the processing time of either approach (considering only thereading time for the ﬁrst approach). Therefore, due to the lowered storage requirements, we choose to implement thesecond approach of generating the data on-the-ﬂy. The data generator was seeded with the sample number such thatthe same data was generated per epoch and could be replicated at any future point.

The model has ∼ . M trainable parameters with 6000 seismic sections per epoch with an additional 1000 samplesgenerated for validation. Using a single machine with a large GPU * , a single training sample takes ∼ s. Thereforefor one epoch, excluding validation, on a single GPU machine takes ∼ hours.Parallelisation of the training regime can drastically decrease the total training time and is a functionality available inboth the two biggest machine learning Python libraries: TensorFlow and PyTorch. In this example, we use a batch-splitting (data parallelism) approach implemented by using TensorFlow Estimators with 4 workers as illustrated inFigure 7. A separate evaluator node is also added to our resource pool such that training is not paused during thevalidation steps. We follow a synchronous updating procedure requiring each worker to complete its batch and returnweight updates to the chief before workers can begin on the next batch of training samples. Utilising 4 workers of thesame specs as the GPU machine in the serial example, with an additional evaluator node, training time for one epoch isreduced to under hours. Note, some additional compute time is introduced due to both communication and waiting(due to the synchronous training mode).The training is run on cloud resources and orchestrated using Kubernetes. The training scripts were written and testedlocally on small, dummy datasets before being incorporated into a custom Docker Image. A cluster of cloud computeresources were commissioned, in this case ﬁve GPU machines with the specs as previously described. A ﬁleshare wasmounted to the resources containing the necessary ﬁles for the synthetic data creation - geometry and noise energyfrequency bands - allowing access to the ﬁles as if they were locally stored. Distributed training is initialised viaapplying a Kubernetes .yaml ﬁle to the cluster. The .yaml contains all the necessary information regarding ﬁle paths,number of resources to use for training and validation, as well as the additional Python inputs such as number oftraining samples per epoch, snapshot frequency, range of synthetic parameters, etc. Once the Kubernetes job has beeninitiated, the required number of pods are created, in our case one chief, three additional training pods and an evaluator,and the training job begins.The model is saved at every checkpoint during the training procedure allowing analysis of the model whilst trainingis ongoing. Training ran for approximately 6 days, covering 12 epochs (i.e., 18000 training steps of 4 samples each),before the model was deemed sufﬁciently trained via a qualitative analysis of detection performed on newly created * A PREPRINT

Step A cc u r a c y Accuracy

Step L o ss Loss evaluatorchief

Figure 8: Progression of the model accuracy and loss during training.(i.e., blind) synthetic recordings. Figure 8 illustrates the progression of the model’s accuracy and loss with respect tothe evaluation dataset, as well as the chief’s loss, over the training period.

Once sufﬁciently trained, a number of new synthetic datasets were generated that the model was not exposed to duringthe training period covering a range of different event locations. Figure 9 illustrates the performance of the trainednetwork on predicting the event arrival for three events of the same magnitude (SNR=0.4), one to the NorthEast ofthe array, one below the center of the array, and one to the SouthWest. Whilst the moveout patterns are signiﬁcantlydifferent the network manages to accurately detect the arrivals. Figure 10 zooms in on the recordings from differentsections of the receiver array, illustrating how the detection procedure accurately handles the varying amplitude ofarrivals across the array as well as the varying local moveouts. In both Figures 9 and 10, there is little-to-no additionalnoise in the detection arising from the heightened noise levels around the platform site.A similar analysis is run analysing the sensitivity of the trained segmentation model to varying SNRs. Figure 11illustrates how the detection procedure can handle low SNR events. As expected, decreasing the SNR of arrivalsresults in increasing noise in the detection procedure. Down to an SNR of 0.2 the arrival shape is clearly visible withinthe prediction section however at an SNR of 0.1 the event arrivals are no longer easily identiﬁable.To ensure the trained model is applicable to ﬁeld data, it is applied to the previously described G8 event. Figure 12displays the 8 second seismic recording with the event alongside the UNet predictions. The blue box highlights thearrival on a particularly noisy receiver grouping whilst the red box indicates the arrival on a quieter group of receiversat the edge of the array. The event is clearly detected across the majority of receivers without detecting the platformnoise that begins halfway through the recording.

The aim of this study was to investigate possible methodologies for the training and application of large deep NNson seismic datasets without the requirement of subsampling or windowing. The ability to handle larger input datadimensions as well as train larger models offers the opportunity for capturing additional spatio-temporal informationfrom the seismic data - a well documented approach for enhancing SNR. The solution design section of this paper, inparticular Figure 4, highlighted the complications in developing a generic model to be applied on either receiver linesor by windowing the array - for this speciﬁc use case. As such, the simplest path was to process the full array in onego and leverage technological advances to allow the training of such a large model.There also exist a number of other use cases which naturally permit windowing however that may beneﬁt from usinglarger windows. Fault detection is one such task that is often reduced to a 2D problem despite the ‘original’ 3D sub-surface data volume. For example, [20] extract 2D slices from a 3D seismic cube, explicitly stating: “The dimensionreduction from 3D to 2D is to reduce the time to train the CNN.” Similarly, [19] provide 24 ×

24 images with an in-line, crossline and time section as input channels to a 2D NN, rendering the problem psuedo-3D. However, they stopshort of utilising a full 3D input. Distributing the training allows the possibility of using larger input data dimensions8istributed NN training with seismic data A PREPRINT

Figure 9: UNet detection’s on synthetic seismic events with source origins in different subsurface locations: NorthEastof the array, below the center of the array and to the SouthWest of the array, as illustrated by red crosses in the arraymap. The top panel shows the synthetic data, the middle panel shows the labels corresponding to the synthetic and thebottom panel shows the UNet’s detection. 9istributed NN training with seismic data A PREPRINT

Figure 10: Magniﬁed results from the event below the center of the array as illustrated in 9. The ﬁrst row comes fromreceiver lines in the West of the array, the middle row includes the two inlines closest to the platform in the center ofthe array, and the bottom row includes receivers in the furthest East lines.(increasing window sizes or adding an additional dimension) therefore, either capturing a larger spatio-temporal areaor offering the opportunity to use higher resolution data.A trade-off can occur between input data dimensions and model size where, as opposed to subsampling data, a smaller,simpler model is used. For example, in the microseismic event detection use case both [10] and [11] have trained CNNsto detect a time-space box in which an arrival is detected. The smaller computational requirement allows for a fastertraining procedure however less information can be derived from the models’ predictions. For object detection thereturned information is that of a bounding box with the same “arrival time” for all receivers as opposed to segmentationprocedures which detect arrival times per trace. [21] provide another example of where a smaller network has beenutilised. They used a simpliﬁed UNet with a reduced number of layers for a 3D fault detection procedure which allowedfor signiﬁcant “savings in GPU memory and computational time”. The procedure for efﬁcient training detailed in thispaper provides the opportunity to increase model dimensions whilst still keeping a reasonable training time.It should be noted here that not all deep learning applications on seismic data require subsampling. For example,should the same segmentation procedure developed in this paper be adapted for a different, smaller permanent array,such as the 50 receiver array at Aquistore [25], the input dimensions would be smaller than those used in the originalUNet implementation rendering any discussion on subsampling unnecessary. However, these use cases are becomingrarer, particularly with the adoption of densely sampled ﬁber optic cables for permanent monitoring.10istributed NN training with seismic data A PREPRINT

Figure 11: SNR investigation on the performance of the trained UNet with the event always originating from the samesubsurface location. 11istributed NN training with seismic data A PREPRINT

Stations T i m e ( s ) (a)1600 1700 1800 1900 2000 2100 Stations T i m e ( s ) (c) 3000 3100 3200 3300 3400 Stations T i m e ( s ) (d)0 500 1000 1500 2000 2500 3000 Stations T i m e ( s ) (b)1600 1700 1800 1900 2000 2100 Stations T i m e ( s ) (e) 3000 3100 3200 3300 3400 Stations T i m e ( s ) (f) Figure 12: UNet event detection on the Grane G8 liner collapse event. The blue box in the center corresponds to thezoomed in data segment and detections shown in the bottom left column highlighting the event arrival at the same timeas an onset of platform noise. The red box corresponds to the zoomed in data segment and detections shown in thebottom right column from a quieter section of the array. 12istributed NN training with seismic data A PREPRINT

The success of a model is highly dependent on its training data, and the use of synthetic datasets for training hasbecome common-place in seismic deep learning procedures, for example [26, 27, 21, 28]. In traditional synthetic datausage for developing and benchmarking algorithms it has been shown that the more realistic the synthetic data thebetter for understanding uncertainties and identifying pitfalls [22]. However, there is a trade-off between similarityto ﬁeld data and computational cost which is particularly applicable when developing the large volume of datasetsrequired for training deep learning models. In this use case, we found that generating the waveform data via a wavepropagation procedure would be too computationally expensive for what we classiﬁed as a reasonable generation time- sub two seconds. As such, used a simple convolutional modelling procedure incorporating geometric spreading andassuming a homogeneous velocity model for the traveltime computations. Similarly, for the incorporation of noise inthe dataset, the generation of realistic noise models (i.e., non-stationary, non-Gaussian, non-white noise), such as via acovariance-based approach [29], was deemed too timely. Therefore, an approach similar to that of [23] is used whichgenerates a stationary noise model that accurately replicates the frequency content of recorded noise. As of yet, ananalysis has not been published to show the trade-off between the complexity/reality of synthetics and the performanceof the trained model. In this use case the resulting network produces acceptable predictions on this ﬁeld dataset, butmore testing is required to fully assess the model’s performance on a wider variety of ﬁeld data with varying noise andevent properties.Despite many advancements in detection algorithms over the years, [30] highlighted how computational cost is a bigbarrier preventing the majority of these algorithms making it into a production toolbox. One of the key criteria ofsuch a detection algorithm is its real-time applicability. Whilst the training took 6 days utilising four GPU machines,detection can be performed in under 3 seconds on a 2.9GHz, 6-core Intel Core i9 machine with 32GB RAM for an8 second recording segment. Therefore, once trained the model can be used for real-time monitoring applicationswithout any requirement of large computational resources or parallelisation across multiple machines.

The majority of deep learning applications for seismic data involve the subsampling or windowing of the dataset. Inthis paper, we illustrate how through the distribution of training, larger networks can be efﬁciently trained, removingthe need for subsampling and/or windowing. Illustrated on a microseismic monitoring use case, the paper walksthrough the stages of the deep learning project, from synthetic training data creation to adapting a standard modelarchitecture to distributed model training and ﬁnally to model evaluation using both synthetic and ﬁeld datasets. Whilstillustrated on a scenario where data windowing is non-trivial, the beneﬁts of not windowing data, or using largerwindows than previously possible, has great potential for other segmentation tasks such as fault and horizon detection.

The authors would like to thank the Grane license partners Equinor Energy AS, Petoro AS, V˚ar Energi AS, andConocoPhillips Skandinavia AS for allowing to present this work. The views and opinions expressed in this abstractare those of the Operator and are not necessarily shared by the license partners. The authors would also like to thankAhmed Khamassi and Florian Schuchert for their invaluable support on the data science elements of this project, aswell as Marianne Houbiers for her insightful discussions on the application of DL for passive monitoring.13istributed NN training with seismic data A PREPRINT

References [1] Jeff Dean, David Patterson, and Cliff Young. A new golden age in computer architecture: Empowering themachine-learning revolution.

IEEE Micro , 38(2):21–29, 2018.[2] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 , 2016.[3] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deeplearning library. In

Advances in neural information processing systems , pages 8026–8037, 2019.[4] Harpreet Kaur, Sergey Fomel, and Nam Pham. Seismic ground-roll noise attenuation using deep learning.

Geo-physical Prospecting , 68(7):2064–2077, 2020.[5] Siwei Yu, Jianwei Ma, and Wenlong Wang. Deep learning for denoising.

Geophysics , 84(6):V333–V350, 2019.[6] Kamal Hami-Eddine, Bruno de Ribet, Patrick Durand, and Patxi Gascue. A growing machine learning approachto optimize use of prestack and poststack seismic data. In

SEG Technical Program Expanded Abstracts 2017 ,pages 2149–2152. Society of Exploration Geophysicists, 2017.[7] Xinming Wu, Yunzhi Shi, Sergey Fomel, and Luming Liang. Convolutional neural networks for fault interpre-tation in seismic images. In

SEG Technical Program Expanded Abstracts 2018 , pages 1946–1950. Society ofExploration Geophysicists, 2018.[8] Jing Zheng, Jiren Lu, Suping Peng, and Tianqi Jiang. An automatic microseismic or acoustic emission arrivalidentiﬁcation scheme with deep recurrent neural networks.

Geophysical Journal International , 212(2):1389–1397, 2018.[9] C. Birnie and F. Hansteen. Bidirectional recurrent neural networks for seismic event detection. Submitted toGeophysics, 2020.[10] Anna L Stork, Alan F Baird, Steve A Horne, Garth Naldrett, Sacha Lapins, J-Michael Kendall, James Wookey,James P Verdon, Andy Clarke, and Anna Williams. Application of machine learning to microseismic eventdetection in distributed acoustic sensing data.

Geophysics , 85(5):KS149–KS160, 2020.[11] Benjamin Consolvo and Michael Thornton. Microseismic event or noise: Automatic classiﬁcation with convo-lutional neural networks. In

SEG Technical Program Expanded Abstracts 2020 , pages 1616–1620. Society ofExploration Geophysicists, 2020.[12] Stephen Alwon. Generative adversarial networks in seismic data processing. In

SEG Technical Program Ex-panded Abstracts 2018 , pages 1991–1995. Society of Exploration Geophysicists, 2018.[13] M Houbiers, S Bussat, and F Hansteen. Real-time drill bit tracking with passive seismic data at grane, offshorenorway. In , volume 2020, pages 1–5. European Association ofGeoscientists & Engineers, 2020.[14] Xin Zhang, Fredrik Hansteen, Andrew Curtis, and Sjoerd de Ridder. 1d, 2d and 3d monte carlo ambient noisetomography using a dense passive seismic array installed on the north sea seabed.

Journal of GeophysicalResearch: Solid Earth , 2019.[15] S Bussat, M Houbiers, and Z Zariﬁ. Real-time microseismic overburden surveillance at the grane prm ﬁeldoffshore norway.

First Break , 36(4):63–70, 2018.[16] Hom Nath Gharti, Volker Oye, Michael Roth, and Daniela K¨uhn. Automated microearthquake location using en-velope stacking and robust global optimizationautomated microearthquake location.

Geophysics , 75(4):MA27–MA46, 2010.[17] Kit Chambers, J Kendall, Sverre Brandsberg-Dahl, Jose Rueda, et al. Testing the ability of surface arrays tomonitor microseismic activity.

Geophysical Prospecting , 58(5):821–830, 2010.[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In , pages 248–255. Ieee,2009.[19] Yue Ma, Xu Ji, Nasher M BenHassan, and Yi Luo. A deep learning method for automatic fault detection. In

SEGTechnical Program Expanded Abstracts 2018 , pages 1941–1945. Society of Exploration Geophysicists, 2018.[20] Bowen Guo, Lu Li, and Yi Luo. A new method for automatic seismic fault detection using convolutional neu-ral network. In

SEG Technical Program Expanded Abstracts 2018 , pages 1951–1955. Society of ExplorationGeophysicists, 2018. 14istributed NN training with seismic data A PREPRINT [21] Xinming Wu, Luming Liang, Yunzhi Shi, and Sergey Fomel. Faultseg3d: Using synthetic data sets to trainan end-to-end convolutional neural network for 3d seismic fault segmentation.

Geophysics , 84(3):IM35–IM45,2019.[22] Claire Birnie, Kit Chambers, Doug Angus, and Anna L Stork. On the importance of benchmarking algorithmsunder realistic noise conditions.

Geophysical Journal International , 221(1):504–520, 2020.[23] RG Pearce and BJ Barley. The effect of noise on seismograms.

Geophysical Journal International , 48(3):543–547, 1977.[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical imagesegmentation. In

International Conference on Medical image computing and computer-assisted intervention ,pages 234–241. Springer, 2015.[25] AL Stork, CG Nixon, CD Hawkes, C Birnie, DJ White, DR Schmitt, and B Roberts. Is co2 injection at aquis-tore aseismic? a combined seismological and geomechanical study of early injection operations.

InternationalJournal of Greenhouse Gas Control , 75:107–124, 2018.[26] Lei Huang, Xishuang Dong, and T Edward Clee. A scalable deep learning platform for identifying geologicfeatures from seismic attributes.

The Leading Edge , 36(3):249–256, 2017.[27] Nam Pham, Sergey Fomel, and Dallas Dunlap. Automatic channel detection using deep learning.

Interpretation ,7(3):SE43–SE50, 2019.[28] Augusto Cunha, Axelle Pochet, H´elio Lopes, and Marcelo Gattass. Seismic fault detection in real data usingtransfer learning from a convolutional neural network pre-trained with synthetic seismic data.

Computers &Geosciences , 135:104344, 2020.[29] Claire Birnie, Kit Chambers, Doug Angus, and Anna Stork. Analysis and models of pre-injection surface seismicarray noise recorded at the aquistore carbon storage site.

Geophysical Journal International , 2016.[30] Robert J Skoumal, Michael R Brudzinski, and Brian S Currie. An efﬁcient repeating signal detector to investigateearthquake swarms.