[PDF] Reconstructing missing seismic data using Deep Learning

Abstract

In current seismic acquisition practice, there is an increasing drive for sparsely (in space) acquired data, often in irregular geometry. These surveys can trade off subsurface information for efficiency/cost - creating a problem of "missing seismic data" that can greatly hinder subsequent processing and interpretation. Reconstruction of regularly sampled dense data from highly sparse, irregular data can therefore aid in processing and interpretation of these far sparser, more efficient seismic surveys. Here, two methods are compared to solve the reconstruction problem in both space-time and wavenumber-frequency domain. This requires an operator that maps sparse to dense data: the operator is generally unknown, being the inverse of a known data sampling operator. As such, here the deterministic inversion is efficiently solved by least squares optimisation using a numerically efficient Python-based linear operator representation. An alternative approach is probabilistic and uses deep learning. Here, two deep learning architectures are benchmarked against each other and the deterministic approach; a Recurrent Inference Machine (RIM), which is designed specifically to solve inverse problems given known forward operators, and the well-known U-Net. The trained deep learning networks are capable of successfully mapping sparse to dense seismic data for a range of different datasets and decimation percentages, thereby significantly reducing spatial aliasing in the wavenumber-frequency domain. The deterministic inversion on the contrary, could not reconstruct the missing data and thus did not reduce the undesired spatial aliasing. Results show that the application of Deep Learning for seismic reconstruction is promising, but the treatment of large-volume, multi-component seismic datasets will require dedicated learning architectures not yet realisable with existing tools.

Full PDF

RReconstructing missing seismic datausing Deep Learning

Dieuwertje Kuijpers , Ivan Vasconcelos and Patrick Putzky Utrecht University, Utrecht, the Netherlands AMLab University of Amsterdam, Amsterdam, the Netherlands

Abstract

In current seismic acquisition practice, there is an increasing drive for data to be ac-quired sparsely in space, and often in irregular geometry. These surveys can trade oﬀ subsur-face information for eﬃciency/cost - creating a problem of “missing seismic data” that cangreatly hinder subsequent seismic processing and interpretation. Reconstruction of regularly-sampled dense data from highly-sparse, irregular data can therefore aid in processing andinterpretation of these far sparser, more eﬃcient seismic surveys. Here, we compare twomethods to solve the reconstruction problem in both space-time and wavenumber-frequencydomain. Both of these methods require an operator that maps sparse to dense data: thisoperator is generally unknown, being the inverse of a known data sampling operator. Assuch, here our deterministic inversion is eﬃciently solved by least squares optimisation us-ing an numerically-eﬃcient Python-based linear operator representation. An alternativemethod is the probabilistic approach that uses deep learning. Here, two speciﬁc deep learn-ing architectures are benchmarked against each other and the deterministic approach; aRecurrent Inference Machine (RIM), which is designed speciﬁcally to solve inverse problemsgiven known forward operators and the U-Net, originally designed for image segmentationtasks. The trained deep learning networks are capable of successfully mapping sparse todense seismic data for a range of diﬀerent datasets and decimation percentages, thereby sig-niﬁcantly reducing spatial aliasing in the wavenumber-frequency domain. The deterministicinversion on the contrary, could not reconstruct the missing data in space-time domain andthus did not reduce the undesired spatial aliasing. Our results show that the applicationof Deep Learning for the seismic reconstruction is promising, but the treatment of large-volume, multi-component seismic datasets will require dedicated learning architectures notyet realisable with existing tools. a r X i v : . [ phy s i c s . g e o - ph ] J a n - Introduction Eﬃcient and cost-eﬀective data acquisition is, together with streamlined data processing, ofcrucial importance in seismic imaging, from exploration to the global scale. In the example ofexploration surveys, acquisition is designed to sample data at a set Nyquist rate (or higher),driving costs to be very high and the duration to often be very long. In principle, a morebeneﬁcial acquisition model would be to use fewer sources and/or receivers, yet still main-taining the same information content as a more conventional high-density, regularly-sampledsetup. However, on its own, sparse, irregular acquisition results in missing data/informationdue to sparser sampling (i.e. sub-Nyquist sampling). Missing seismic data, either due tosparser sampling or irregularities can greatly hinder accurate processing and interpretation.For example Peng and Vasconcelos (2019) ﬁnd that missing seismic data in either sourceor receiver domain or both domains can lead to diﬀerent types of artifacts and data gapsafter using the sparse datasets for Marchenko methods. The reconstruction of dense, reg-ularly sampled waveﬁelds from highly sparse, (ir)regular data can therefore play a criticalrole in achieving better processing and interpretation from far sparser, more eﬃcient seismicsurveys.Several methods exist to solve this reconstruction problem. These methods can broadlybe divided into two groups; deterministic and probabilistic. Most often the reconstructionproblem is solved using deterministic, iterative linear solvers. Ruan and Vasconcelos (2019)for example, ﬁnd that the sampling rate in seismic acquisition can be decimated furtherthan the Nyquist rate by means of preconditioning and compressive sensing techniques inthe presence of acquired data gradients. Using a multi-component reconstruction theoremthat includes the acquired data, the ﬁrst- and second-order spatial derivatives plus the cross-derivatives in shot- and receiver-domain, Ruan (2019) can succesfully reconstruct regularlydecimated 3D seismic data with one-third of the original Nyquist rate using a gradient-based, sparsity promoting solver. When using an irregular sampling scheme as proposed byHennenfent and Herrmann (2008), Ruan (2019) can decimate the sample rate even further.One major requirement for this method is the need for spatial derivatives of the data in theinversion: in practice, this would mean that data are acquired far more sparsely, but eachdata station contains many channels due to the multi-component nature of gradient data. Forexample, in oﬀshore seismic, derivatives of the waveﬁeld can be measured if particle-velocitymeasurements are available, something that is often not the case for vintage seismic dataand also presents technological challenges in practice, such as the engineering of source-sidederivatives, or higher order derivatives on either source or receiver side.The interest in machine learning solutions to inverse (seismic) problems is growing, thereconstruction problem provides an attractive application because the underlying forwardoperators are computationally inexpensive. For deterministic approaches however, achievingaccurate solutions to data reconstruction can be quite challenging. Recently, Siahkoohiet al. (2018) addressed the use of adversarial neural networks (GANNs) to learn a map fromsparsely to fully sampled seismic data. With the use of their trained GANN, Siahkoohi et al.(2018) are able to reconstruct 90 percent of the missing seismic data in frequency domainunder diﬀerent types of frequency domain decimation, as long as at least 5 percent of the1ata in that particular frequency slice was densely sampled. Seismic acquisition however, isoften done in the spatial domain and thus does the decimation also takes place in the spatialdomain.This research will focus on reconstructing dense seismic waveﬁelds from spatially dec-imated data using deep learning, by means of the so-called Recurrent Inference Machine(RIM) deep learning architecture designed by Putzky and Welling (2017). Testing the po-tential of using RIMs in seismic processing problems where determining a complex inversemap to a known forward problem is the main goal. The RIM will be benchmarked againstthe U-Net deep learning architecture (originally designed for biomedical image segmentation;Ronneberger et al. (2015)) and will be compared to deterministic linear iterative methods.Deep learning mainly consists of two stages. The ﬁrst stage is the training stage in whichthe neural networks have access to an input and expected output. Based on the input thenetwork has to make a prediction that should be as close as possible to the expected output.The misﬁt between the prediction and expected output can be backpropagated through thenetwork thereby updating its internal state in order to make a better prediction for the nextexample. After a period of training, the neural nets enter the inference stage. In this stagethe network will have access to input data, that it has never seen before, only. From thisinput the network should try to make a prediction. Here, the reconstruction problem willbe studied and the neural networks will estimate a map between the decimated and denseseismic waveﬁelds in which deep learning can be seen as an approach to solving inverseproblem.The reconstruction problem will be studied in the time-space domain mostly as mostseismic data are acquired in this domain. In the frequency-wavenumber domain the recon-struction problem becomes the dealiasing problem as sub-Nyquist spatial sampling will leadto spatial aliasing. After studying the approach the two methods take in solving inverseproblems, the reconstruction problem will ﬁrst be studied in 2D where decimation (with dif-ferent patterns and percentages) only takes place along the receiver dimension. As a ﬁnal testall diﬀerent studied methods will aim at solving a highly decimated 3D Ocean Turbulencedataset, that is not just decimated along the receiver dimension but also along the sourcedimension, resulting in over 90 % missing data to be reconstructed. The next section givesthe reader a general introduction to machine learning, a deeper description of the speciﬁcarchitectures used here will be given in coming sections.2 - A brief introduction to Machine Learning

In this section, a short introduction to machine learning is given to help the reader understandthe techniques used in this research. Because the machine learning community often usesa speciﬁc wording that will also be used in this study, a short glossary is given at the endof this section. The introduction and glossary are far from complete as they only serve todescribe the basic concepts. Two recommended references for a more detailed description ora more hands-on experience include the book on Deep Learning by Goodfellow et al. (2016)and the online course on Pytorch via Udacity (2019).A machine learning algorithm is able to learn from example data, with learning beingdescribed as an increased performance over repetitive execution of a given task. In its verymathematical basics, machine learning can be seen as a form of applied statistics sincecomputer models are used to statistically estimate a unknown, often complicated functionthat maps a given input to a given output. Deep learning is a form of machine learning inwhich a deep (multiple layers) neural network is the learning computer model. The network isa numerical representation of a series of computations that process information. With everypass through a layer mathematical computations are applied to the input data, therebymapping part of the input data to a new representation. The visible input and output to amachine learning network can have very diﬀerent forms such as images, text or classiﬁcationlabels. All layers in between hold hidden representations of the data that are invisible forthe user.The layers in a neural network consist of nodes, each diﬀerent node applies a mathematicalfunction to part of the input data. The output of each node has a diﬀerent importance in thelayer’s representation of the data and therefore all nodes have a corresponding weight. Whenbuilding a machine learning model, the weights have an initial setup that is not optimal inmapping the input to output. Thus, for a model that should generalize well to diﬀerentand highly variable data, it is important to ﬁnd the optimum set of weights (high weightscorresponding to more import features) that represent a map between the data in a so-calledtraining dataset.The network, mathematically represented by g , deﬁnes a parametric model between theoutput ˜x and input y as set by the weights such that ˜x = g ( y , w ). Training consists ofestimating the network weights w by minimization of a speciﬁc loss function suitable for theproblem. Training data consists of a large set of data for which both x and y are knownsuch that the diﬀerence (loss) between model output ˜x (generated by the network from input y ; indicated by a tilde) and, during training known, x can be minimized. Minimization ofthe loss by altering the weights during training is achieved with the help of an optimizerthat performs iterative optimisation using stochastic gradient descent. The training stage isfollowed by the inference stage during which the trained network is deployed for testing. Inthis phase never before seen data y can be used as an input and the model will map this toa new output representation ˜x .A deep learning model is build by selecting an architecture suited for the speciﬁc problem,a loss function and an optimizer. Many diﬀerent combinations of these three exist and herewe have chosen to use convolutional networks to solve a regression problem. The most3imple form of a regression problem consists of ﬁnding the parameters a and b ﬁtting alinear trend ( y = ax + b ) with (training) data in Cartesian space. In this study theproblem is more complex, the convolutional networks will take corrupted (decimated) 2Dseismic gathers as input and the network should map these to an output consisting of 2Dreconstructed (dense) gathers. Convolutional networks (CNNs) are capable of taking N-dimensional images as input without having to transform these into 1-dimensional vectors(a very common technique in machine learning), thereby more successfully capturing thespatial and temporal dependencies in the data. In CNNs, 2D convolutional kernels areapplied to the input data, therefore the weights in a CNN correspond to kernel weights thatextract higher-level features from the input.The main goal in deep learning is thus to ﬁnd a ”diﬀerent” (the meaning of diﬀerent isunique for each problem) representation of the input data after a forward pass through themodel. The mapping function that takes input to output is then represented by the networkweights. The problem of mapping corrupted to reconstructed seismic gathers can be cast asan inverse problem (forward problem: y = Ax ) where the task is to ﬁnd x (reconstructedgather) given y (corrupted gather) and the forward operator A . In this example the weightsof the neural network, representing the mapping function, should represent the inverse ofthe forward operator that maps y back to x . Therefore, deep learning will be used in thisstudy as a probabilistic approach to inverse problems. After the machine learning glossary,the next sections will describe the exact deep learning architectures used in this study andhow each of those approach inverse problems.4 achine Learning Glossary • Activation function - the function applied to the input data in a node activatingthat node or transforming input to output. Here, the Rectiﬁed Linear Unit (ReLU)activation (ReLU( x ) = max(0 , x )) is used. • Batch - the set of data(-patches) that is used for one update step during training ofthe network. • Channels / Features - features are the properties or characteristic phenomenons ofthe input data that are extracted in a layer. Channels and features refer to the samedimension in the data (e.g. a grayscale image consists of 1 channel and a color scaleimage of 3 for RGB). • Dropout - layer that randomly sets some nodes to zero during the update step intraining, could help prevent overﬁtting. • Epoch - the time the network needs to see all training data once. • Gated Recurrent Unit (GRU) - Gating mechanism in recurrent neural networksthat has feedback connections and can process entire data sequences at once. The cellregulates information ﬂow through the network with the use of a forget and memorygate. • Learning rate - parameter that controls the step size in stochastic gradient descent;how much the weights are adjusted with respect to the loss gradient. • Loss - cost function that measures the misﬁt between the networks predictions andthe expected results, loss should be minimized during the training phase. • Optimizer - the algorithm that is used to update the weights and/or learning rate inorder to reduce the loss during the training phase. • Overﬁtting - when an algorithms is overﬁtting the training data, the model remembersthe output with the input instead of learning. The model therefore generalizes poorlyto unseen datasets during the inference stage. • Training / Inference - the training phase is the phase in which a machine learningalgorithm is build, inference uses this trained model to make a prediction.5 - The Reconstruction problem

In sparse seismic waveﬁeld acquisition, the reconstruction problem can be posed as a generallinear problem (3.1); y = R x (3.1)in which y is the decimated (corrupted) waveﬁeld and x the dense waveﬁeld. R is theRestriction operator that can be assembled from the characteristics of the acquisition setup(e.g. malfunctioning receivers or missing shots). R represents a mask that extracts a subsetof data from the dense waveﬁeld into the decimated waveﬁeld. Equation (3.1) is knownas the forward problem that generates the observed data. The inverse problem consistsof reconstructing the dense waveﬁeld x from the observed decimated waveﬁeld y using aninverse of the restriction operator.From Nyquist-Shannon’s sampling theorem it is known that the restriction operator inequation (3.1) has an exact inverse as long as the sample-rate criterion is satisﬁed. A mainassumption in Nyquist-Shannon’s sampling theorem is that of uniform sampling. In realityhowever, irregularities in the acquired data could be caused by malfunctioning receiversor perturbations leading to a varying receiver spacing or sample rate during acquisition.Irregular and/or far sparser sampling both result in ill-posedness of the inverse of equation(3.1). In these cases the inverse of the restriction operator can be approximated by twotypes of approaches; iterative deterministic or probabilistic inversion. In what follows, eachdensely sampled gather is represented by x and the decimated version by y . The goal is toestimate a dense version of the data from the decimated data and the forward operator, thisestimate is represented by ˜x and should be as close to the original dense data x as possible.The seismic data could be decimated over a single source- or receiver-dimension resulting inthe reconstruction of missing traces in 2D seismic gathers, or decimated in both dimensionsresulting in a highly sparse 3D decimated dataset. Deterministic - Linear Solvers

Deterministic methods aim at inverting equation (3.1) without explicitly using any proba-bility theory on the parameters of the inversion. The most general solution to this inverseproblem is the least-squares solution to which possible regularization terms can be added.Minimizing the least squares cost function, yields the reconstructed dense waveﬁeld ˜x ofequation (3.2). The linear system in equation (3.1) can numerically be represented usingan eﬃcient linear operator representation in the Python-based Pylops framework (Ravasiand Vasconcelos, 2020). Pylops-implemented least squares optimisation can also be usedto eﬃciently solve the inversion in equation (3.2). Least squares optimisation uses the for-ward operators in the inversion and is therefore controlled by the physics of the restrictionoperator. ˜ x = min x || y − R x || = ( R T R ) − R T y (3.2)6 robabilistic - Deep Learning An alternative method to solve the inverse problem makes use of deep learning. The neuralnetwork (mathematically represented by g φ ) is trained to represent an approximate inverseof the restriction operator thereby mapping the decimated to the dense data. From now on φ will be used to represent the network’s parameters instead of the earlier introduced w .This because φ includes the weights and can also, since the used models are more complexthan simple linear regression, include other trainable parameters like a varying learning rate.The neural network is trained to minimize the mean squared cost function J (see equation(3.3)) with the use of an optimizer that performs gradient descent on this cost function andthe model parameters. The main focus of this study lies on the Recurrent Inference Machine(RIM) as designed by Putzky and Welling (2017), which will be benchmarked to a moresimplistic network architecture; the U-Net as ﬁrst designed by Ronneberger et al. (2015).The numerical code used for U-Net is based on that of Zbontar et al. (2018) for their fastMRIchallenge. Both existing code basements for the RIM and U-Net will be adjusted for thespeciﬁc goal of reconstructing missing seismic data. J = || x − ˜x || = || x − g φ ( y ) || (3.3) Probability Theorem

The parameters in a neural network should represent an unknown map between an input y and an output x , that is supposed to be an inverse to a known forward operator (linearor non-linear) mapping x to y . This means that the goal to solving inverse problems usingdeep learning comes down to creating a function estimator of the actual inverse operator.The neural network parameters are trained to represent this function estimator, the beliefthat these parameters ( θ ) can represent the inverse operator can be expressed using proba-bilities. Maximum probability corresponds to a 100 % capability of the network parametersto represent the desired inverse operators. Diﬀerent approaches can be taken to maximizethis probability (refer to Chapter 5 of Goodfellow et al. (2016)). Here, the inverse problemis approached by deﬁning a likelihood and a prior and optimizing the maximum a posteriorisolution (MAP) in the following equation,˜ x = max x log p ( y | x ; θ ) + log p θ ( x ) . (3.4)such that the iterative approach to MAP inference represents the iterative approach toinversion (an optimization problem).In equation (3.4), the ﬁrst term is a conditional probability (log-likelihood term) undernetwork parameters θ that represents the forward problem, while the latter is a parametricprior over x that reduces the ill-posedness of the inverse problem by including for examplea sparsity promoting term (Putzky and Welling, 2017). Maximizing the conditional log-likelihood term is an attempt to make the network parameters match the mapping functionbetween input and output as set by the training data. Ideally this would match all data usedduring inference, however these data are not directly available and therefore that probability7istribution remains unknown. The conditional log-likelihood term is the basis for super-vised learning in which y is predicted given x and the model parameters. The maximuma posteriori approach also includes the prior on the dense waveﬁeld thereby allowing thenetwork parameters (and therefore the estimate of the inverse function) to be aﬀected byprior beliefs. The prior distribution is also related to the training data. In the case of seismicdata, the prior space can include information on spatial and temporal signal distribution,curvature and sparsity. The next sections will describe two speciﬁc architectures used in thisstudy and how each of those approximate the inverse problem.8 - The Recurrent Inference Machine By design, a Recurrent Inference Machine (Putzky and Welling, 2017), or RIM, uses a recur-rent neural network (RNN) as a recurrent approach to MAP inference. Putzky and Welling(2017) stepped away from taking the usual deep learning approach in which the prior andlog-likelihood are learned separately and instead setup a RNN that jointly learns inferenceand a prior. The RIM uses the current reconstruction ( ˜x t ), a hidden memory state ( s ) andthe gradient of the log-likelihood term ( ∇ log p ( y | x ; θ )) to infer a better reconstruction ( ˜x t +1 )over a ﬁxed number of steps in the recurrent part of the RIM. Each consecutive estimateof the recurrent part in the RIM x can, in its most simple form, be estimated through arecursive update function ˜x t +1 = ˜x t + γ t ∇ (cid:0) log p ( y | x ) + log p θ ( x ) (cid:1) . (4.1)Using Bayes’ rule and generalization to the RIMs formulation this results in recursive updateequation (4.2). The learnable parameters φ in the RIM (represented by g φ in (3.3)) nowinclude network and prior parameters θ and the learning rate. For a more detailed descriptionon RIMs and the derivation from equation (4.1) to (4.2), the reader is referred to Putzkyand Welling (2017). For now it suﬃces to know that the inputs to a RIM consist of amemory state, the gradient of the likelihood term (as given by the forward operator R ) andthe current reconstruction. The gradient of the likelihood term for general inverse problemswhere y = Ax can be written as log p ( y | x ) = A T ( y − Ax ). Because the forward operator R is self-adjoint, the gradient can here be written as log p ( y | x ) = Rx − y . x RIMt +1 = x RIMt + g φ (cid:0) ∇ log p ( y | x )( x RIMt ) , x RIMt , s t +1 (cid:1) . (4.2) RIM architecture

The RIM can be seen as a series of repeating neural nets conﬁgured in a single cell repre-senting the iterative approach to inverse problems (indicated by subscripts t and t + 1 inﬁgure 1). The RIM cell consists of a Gated Recurrent Unit (GRU) and convolutional layers.The ﬂow through a cell is intrinsically repeated by a ﬁxed number of steps (here chosen tobe 10). Over these steps the network should improve its reconstruction for which it usesan intrinsic loss function that compares the inference prediction with the expected outcome(known for all training data). For both the intrinsic and global loss in the RIM the meansquared error is used (see equation (3.3)).In ﬁgure 1 input image y is the decimated data. The forward operator generating thisdecimated data is applied to the current estimate of the RIM ( x t ) to generate the gradientof the log-likelihood term in the green cell. The gradient (indicated by ∇ y | x t ; short for ∇ log p ( y | x )) and the current estimate ( x t ) of the dense waveﬁeld, are concatenated over thechannel dimension and form the input to the ﬁrst convolutional layer that is followed by aReLu activation layer. The next layer is a GRU (gating mechanism) that determines whatinformation in the hidden state ( s t +1 ) is important and what can be forgotten for the next9 igure 1: RIM Architecture -

An overview of the data ﬂow through the RIM used in thisproject. Bold arrows are direct connections within a single timestep, dotted lines are re-current connections passing information through to the next time step.

Conv is short forconvolution and ∇ y | x t for the gradient of the log-likelihood term. The diﬀerent representa-tions of the input data throughout the model are described in the main text. Figure adaptedfrom Lønning et al. (2018) .step. Another convolutional layer followed by ReLU activation and a GRU pass (with hiddenstate s t +1 ) follows before the ﬁnal convolutional layer. The exact RIM architecture chosenhere consists of three hidden convolutional layers, the ﬁrst with kernel size 5x5 and the lasttwo having size 3x3. Padded convolution is used to have a constant image size throughoutthe whole network. The output in the recurrent network is an update ∆ x t +1 that is added tothe current estimate ( x t ) to form the new estimate ( x t +1 ). Neural networks extract featuresfrom the input to learn about data characteristics, in the ﬁrst two hidden layers 64 featuresare extracted from the input that consists of two channels (the decimated data concatenatedwith the gradient of the log-likelihood term), the ﬁnal output consists of a single channel;the grayscale reconstructed seismic gather x t +1 , that becomes x t in the next timestep. Intotal the RIM consists of just over 90.000 trainable parameters.10 - U-Net The U-Net is a very well-known deep learning architecture for image tasks, with the beneﬁtof being relatively easy to implement and train. The U-Net consists of a contracting path,a bottleneck in the center and an expanding path. The two paths consist of a numberof blocks in which convolutional operations are applied. The contracting path maps theinput y to a hidden representation in the bottleneck layer, thereby compressing the inputto a higher-level feature representation over the blocks. The expanding path transforms thehidden representation coming from the bottleneck layer into an estimate ˜x of the dense data x , thereby decreasing the number of features over the blocks while increasing the size of thedata. Thus, the contracting path of the U-Net is trained such that maps the corrupted inputto a compact representation of the reconstructed data and the expanding path is trained tomap from this compact, hidden representation to the full reconstructed data.What is special about the U-Net is that the features from each contracting block areconcatenated to the features from the expansion block at the same level. Concatenationensures that the learned features in the contracting path are used to build up the image inthe expansion path. In contrast to the RIM, the U-Net has no knowledge of the forwardoperator that created the decimated data. This means that where the RIM is forced tofollow the physics set by the restriction operator, the U-Net does not and that is expected tosometimes lead to physically implausible results. Here, the same loss function and optimizeras for the RIM are used. * * * * * * * * * * * * * * * * * * *

1 64 64 128 64 64 256 512 512 1024 256 256 512 128 128 256 256 128 128 128 64 256 64 64 32 1 512 512 ** Figure 2:

U-Net Architecture -

An overview of the data ﬂow through the U-Net as used inthis project, the diﬀerent representations are described in the main text. The colours of thecells represent from which path the features come; blue for the contracting path, gray for theexpanding path and green for the fully connected layers.

Conv is short for convolution, thenumbers above the cells stand for the number of features as present in the representation ofthe data in that cell, width of the cell for the number of features and length for the size ofthe representation of the data. 11n the U-Net blocks, 2D max-pooling, bilinear upsampling and instance normalizationare used. Pooling is a form of non-linear downsampling, the convolutional kernels output animage of the same dimensions as the input with a diﬀerent number of features. Max poolingis used to reduce the size of the data between two blocks in the contracting path thereby aswell reducing the required number of parameters (the more parameters the more the networkis prone to overﬁtting the training data), memory load and number of computations. Theoutput from one block is reassembled into small windows from which only the maximumvalues are kept and assembled to form the input to the next block. Pooling is a validoperation in the reasoning behind U-Net because the exact location of a feature is lessimportant than its relative position in the global image. In order to undo this downsamplingprocess in the contracting path, bilinear upsampling is used in the expanding path. Inbilinear upsampling linear interpolation is used to interpolate the missing data in a 2D grid.First, one of the dimensions is kept ﬁxed and linear interpolation occurs in the other directionand the second step is vice-versa. Each step is thus linear but the total interpolation is non-linear on the sampled location. Similar to the eﬀect of data and feature normalization onnetwork performance, instance normalization improves training by normalizing the data overthe channel dimension.

U-Net architecture

The used U-Net architecture consists of four pooling blocks that perform 3x3 convolutions inboth the contracting and expanding path, no dropout is used in these blocks. In ﬁgure 2, theinput to the contracting path (indicated in blue) consists of a seismic gather that is decimatedin the spatial domain (the same y as in the RIM). In the ﬁrst block 64 features are extractedfrom the gather, this number doubles each block in the contracting path (indicated by cellwidth) and reaches its maximum at 1024 features in the bottleneck layer (the rectangulararea in ﬁgure 2). The size of the input image decreases by a factor 2 in both image dimensionsper layer (indicated by the length of the cells). Over the four expanding blocks (gray in ﬁgure2) the number of features are decreased to 64 again and in the ﬁnal two 1x1 convolutionallayers (indicated in green in ﬁgure 2) this decreases to a single feature image with the samesize as the original input. A 1x1 convolutional layer decreases the number of features inthe representations without a change in the size. In total this U-Net consist of almost 13.5million trainable parameters. Both the input ( y ; the decimated data) and the output ( x ;the reconstructed data) of the U-Net thus consist of a single feature, single channel seismicgather. The concatenation between the features from the contracting and expanding bathis indicated by the gray horizontal arrows and the combined blue/grey cells. Figure 2 alsojustiﬁes the name of the U-Net as the input data indeed follows a U-like ﬂow towards theoutput. 12 - Methods The inverse problem, that consists of retrieving the dense seismic waveﬁelds from the re-striction operator and the decimated data, will be solved by two approaches; determinis-tic inversion and deep learning. Here, the main focus lies on the RIM and the potentialof the RIM to solve the reconstruction problem, as an example of an inverse problem forwhich the forward operator is known and computationally inexpensive. The reconstruction isbenchmarked against the deterministic approach and the U-Net deep learning architecture.Eventhough the U-Net is originally designed for image segmentation (Ronneberger et al.,2015), it has lately been used for other tasks as well. For both deep learning networks manydiﬀerent architectures, choices of activation functions, loss functions and training data arepossible. The architectures used in this study have been described in previous sections, bothnetworks are numerically implemented using the Python-based deep learning package Py-Torch (Paszke et al., 2019). The most important step before deploying the neural networksin their inference stage, is training the networks on seismic data representative of the datato be inferred. The trained models can then, during the inference stage, be compared to thedeterministic inversion over several tasks. The least squares optimisation in the determin-istic approach is numerically implemented using the eﬃcient linear operator representationPython-based package Pylops (Ravasi and Vasconcelos, 2020).

Training networks using Seismic data

Four diﬀerent seismic datasets of diﬀerent formats and sizes have been used for this study.These include the Gulf Of Suez (

Gulf ) ﬁeld dataset that consists of 128 shots, 128 receiversand 512 timesamples, two more complex numerical subsalt datasets (

Pdat & Rdat ) with intotal 202 shots, 201 receivers and 2001 timesamples and a 3D numerical ocean turbulencedataset (

OTD ) consisting of 300 shots, 301 receivers and 1081 timesamples. A range ofdiﬀerent networks are trained on diﬀerent parts of these datasets. To generate syntheticsparser (decimated) training data for the neural networks, the originally densely sampled, insource, receiver and time domain, data are decimated using ﬁve diﬀerent decimation patternson the receiver domain. To limit the possible eﬀects of the selected training decimationpatterns on the networks capability to generalize to other decimation patterns, two jitteredirregular (based on ideas of Hennenfent and Herrmann (2008)) and three regular (factor 2,3 and 4) decimation patterns are applied. During training the decimation percentages varybetween 50 and 80 %.It is well known that suﬃcient data is required to accurately train a neural network (e.g.Siahkoohi et al. (2018)). For this study a single GPU (Nvidia GeForce GTX 1080 Ti; 11 GBmemory) is used. In order to both increase the amount of training data while decreasingthe computational memory load on the single GPU, non-overlapping patches consisting of 32traces of each 64 timesamples are extracted from all shot gathers. The patches are decimatedusing ﬁve diﬀerent masks resulting in 5 times as many decimated input waveﬁelds as there aredense waveﬁelds. The data-windowing eﬀect on the data is a band-limitation of the originalsignal, therefore the full frequency content of the original signal is no longer present in the13indowed signal. Next to that, edge eﬀects could include undesired peaks in the frequencyspectrum related to smaller-scale structures. To reduce this eﬀect, a 2D Tukey taper (withfraction 0.3) is applied to the windowed gathers. This space-time domain multiplication ofthe windowed data and the Tukey taper results in a smoothing convolutional operation infrequency-wavenumber domain, that attempts to diminish the undesired eﬀects introducedby space-time windowing. In the inference the seismic gathers will not be windowed andtherefore tapering is only used in the training stage. Note thus that it is not needed to traina neural network on the same size input data as used for inference.

Prior space sampling

To make the best predictions possible for unseen data during the inference stage, the traineddeep learning algorithms require the prior space inferred from the training data to be anaccurate description of the space that the networks have to infer. In the case of reconstructingseismic data, it is important for the training data to have similar slope variation, curvatureand types of reﬂections as in the to be inferred data. Next to that, the bandwidth of thereconstruction plays an important role. The ﬁner the temporal and spatial scale structures inthe to be inferred data are, the broader the bandwidth of the training data should be. Fromlater results it will become clear that having an idea on the decimation percentage in thedata to be reconstructed can improve the network’s predictions. This is related to the factthat the network’s prediction quality will start to decrease at the higher end of the range ofdecimation percentages present in the prior. Therefore it is important to generate syntheticdata with high decimation percentages for training if that is what should be reconstructedduring inference. Figure 3 illustrates this eﬀect because if the left panel (

Pdat ; single-shotsalt data) were to be the goal of inference, it is important to include similar structures andproperties in the training data.The four diﬀerent datasets used in this study have diﬀerent complexities. The Gulf OfSuez dataset (

Gulf ) has little structural variations but includes velocity variations of thesubsurface therefore having hyperbolas centered around the source location. The oceanturbulence dataset (

OTD ) is the complete opposite of this because the velocities in theocean layers have very little velocity variations but high structural variations (turbulence)therefore this dataset includes many diﬀerent diﬀractions and reﬂections that can be oﬀ-centered and interfering. The

Rdat salt dataset is a synthetic dataset that includes all ofthe previously mentioned properties. All of these structures can be found in the single-shot

Pdat salt dataset, this data is however generated from a source within the medium and istherefore diﬀerent from all other datasets that are generated by sources at the surface.

General Deep Learning parameters

Both networks make use of the Adam optimizer (Kingma and Ba, 2014) with weight decayfactor 1e-8 and gradient norm 0.1. The initial learning rate is set to 1e-4 and can be alteredby the optimizer. The networks are subject to the same mean squared loss function and usethe Rectiﬁed Linear Unit (ReLU) activation function. During training, batches of 32 images14 raining DataInference Data

Figure 3:

Illustration of the importance of a representative prior space in the trainingdata. The to be inferred data on the left is a complex dataset consisting of many slopevariations (blue), of variation in scale of structures (bandwidth; green) and a combinationof diﬀractions and reﬂections (orange). The training data should therefore consist of acombination of the properties desired to be inferred by the network. All diﬀerent datasetshave diﬀerent properties as explained in the main text.

From left to right the used shotgathers are from Pdat (shot 1), Rdat (156), Gulf (47) and OTD (145). are made over seismic shot gathers in windows of size 32x64. After the training stage, densewaveﬁelds can be predicted for single decimated seismic gathers of varying sizes (does nothave to equal the training data size). All models are trained for 40 epochs during which theloss is monitored using Tensorboard (Martinez, 2016). The same decimation percentagesused to decimate the training data for the RIM are used for the U-Net.Some machine learning architectures can be very sensitive to the scale of input data.Scaling the input data is known to have a positive eﬀect on network performance as it is ahelpful approach to the vanishing gradient problem that often occurs during back-projectionof the misﬁt (e.g. Ioﬀe and Szegedy (2015); Dai and Heckel (2019)). The variety in amplitudeand complexity of the diﬀerent seismic datasets is high, scaling is therefore applied to reducethis variance and improve training. Four diﬀerent types of scaling are compared; normalisa-tion (to range -1, +1), normalisation (using maximum absolute amplitude), standardisation(zero mean, unit standard deviation) and no scaling of original data.

Reconstruction Approach

During both the training and inference stage in the deep learning approach, a single dec-imated 2D seismic gather is used as input. During the inference stage, the 2D decimatedwaveﬁelds for unseen data map are mapped to dense reconstructions. The same syntheticallygenerated decimated gathers, are used to perform a deterministic inversion with the help of15ylops’ least squares optimisation over a 1000 iterations. The inference and inversion resultswill be compared over two tasks; 2D seismic gather and 3D highly decimated reconstruction.Unlike the deep learning networks that can only take single 2D gathers as input, thedeterministic approach can invert the problem for any N-dimensional decimated data. Nextto that, it is also known from compressive sensing techniques that far sparser data can bereconstructed by inversion with the help of derivatives of the decimated data (e.g. Ruan(2019)). To test the potential of the neural networks (speciﬁcally trained to perform 2Dreconstruction) to be used for more complex 3D highly sparse data decimated over bothsource and receiver domain, the 3D reconstruction problem is split into two 2D problems.First, all shot gathers will be reconstructed and after sorting the data to common receiverdomain, inference can again be applied to the receiver gathers to reconstruct the rest of themissing data. This two-step approach will be compared to least squares optimisation usingthe ﬁrst- and second-order derivative of the Ocean Turbulence data as well as the cross-derivatives in the source- and receiver-domain. The ocean turbulence dataset is a seismicdataset generated from a synthetic 2D model as described in more detail by Ruan (2019).All (cross-)derivatives are created synthetically with the use of Pylops’ linear operators andare decimated as well to simulate the eﬀect of measuring these derivatives in the ﬁeld withthe use of streamers.

Evaluation Metrics

The diﬀerent results will visually be compared in both the space-time (data reconstruction)and the wavenumber-frequency domain (aliasing-dealiasing problem). To quantitatively com-pare the diﬀerent reconstruction qualities, that are scaled diﬀerently and created diﬀerently,two diﬀerent metrics are used. A common evaluation metric in inversion techniques is the(normalized) root mean squared error, in image reconstruction however the structural simi-larity index is more common. Both metrics focus on diﬀerent aspects of the reconstructionand are here used jointly to compare the performance of inversion and inference.The root mean squared error (RMSE) measures the diﬀerence in per-pixel amplitudebetween the reconstructed and reference image thereby representing the Euclidean distancebetween two images. The RMSE (see equation (6.1)) is very easy to implement as the meansquared error is already used as the loss function in the RIM and U-Net. However, RMSElacks the ability to use overall image structure because the comparison is made per-pixel.The Structural Similarity Index (SSIM; Ndajah et al. (2010)) however uses the structuralproperties of an image and can be computed at diﬀerent local patches of the image datawith the use of a sliding window. SSIM is used here as deﬁned in equation (6.2). In whichthe average pixel intensities ( µ ), their variance ( σ ) and two stabilizing factors ( c ) are usedto calculate the structural similarity between two seismic gathers.RMSE(˜ x, x ) = (cid:113) || ˜ x − x || (6.1)SSIM(˜ x, x ) = (2 µ ˜ x µ x + c )(2 σ ˜ x σ x + c )( µ x + µ x + c )( σ x + σ x + c ) (6.2)16 - Results Comparison of all trained models revealed that the networks trained on normalized (bymaximum) data performed best. Scaling the data proved to be necessary to have a goodgeneralizing model. Normalization by the maximum absolute value results in scaled datawithout having altered the physics of the waveﬁeld, something that is no longer true whenstandardizing the data or normalizing to a custom range. Application of Tukey tapering tothe patched data proved to decrease the eﬀect of the undesired edge eﬀects (present in thetraining data) on the inference results. Therefore, all deep learning results that will followare based on normalized, tapered models.

Prior space sampling

As stated before, it is important for a neural network to generalize well to new data. Theability of generalization is determined by the prior space sampled from the training data.The generalization quality of the networks are also dependent on the amount of data usedduring training because an incorrect ratio between number of training data and number ofnetwork parameters could lead to under- or overﬁtting. First, the eﬀect of data complexityis studied, later the decimation patterns. Varying both of these factors results in a varyingamount of training data as well.Initially, the ﬁve diﬀerent decimation patterns consisted of two irregular and three regularpatterns, thereby decimating the data between 50 and 80 %. Four diﬀerent models arecompared for both the U-Net and RIM, based on diﬀerent training data consisting of

Gulf(of Suez) (every second shot),

Rdat (every second shot of the largest salt dataset),

GulfRdat (a combination of the former two) or Ocean Turbulence Data (

OTD ; every second shot). Thediﬀerent decimation percentages in addition to patching results in a dataset size of just over100.000 images for the last two models, just under 100.000 for only

Rdat and only around10.000 for

Gulf . 75 percent of these images went into training, the other 25 percent is usedfor testing and validation.

Data complexity

Table 1 in combination with ﬁgure 4 illustrate the eﬀect of data complexity on the potentialof the networks to generalize to unseen data. From the average SSIM in table 1 (arithmeticmean of all but training data performance), it can be deduced that all models performbest on their training data and that the RIM overall performs slightly better than the U-Net. The RIM generalizes equally well with models trained on diﬀerent higher complexitydatasets and poorer when inference is performed on data with a higher complexity than seenduring training. This result is to be expected as based on the data complexity discussiongiven before. U-Net on the other hand, has more trouble generalizing to unseen datasetsespecially if trained on only the ocean turbulence data that consists of many diﬀractions andreﬂections but very little velocity variations (and therefore very little slope variation).17igure 4 illustrates this eﬀect and now also gives an indication of the misﬁt betweenthe network’s inference results and to be inferred dense data. The displayed shot gathercomes from the single shot salt dataset (

Pdat ) that none of the models had been trained on.This dataset is diﬀerent from the rest because the data is generated from a source withinthe medium. The decimation is irregular with a percentage of 62 % (within the range ofdecimation percentages in the training data). The 8 diﬀerent reconstruction panels (B-Ein ﬁgure 4a and 4b) are all very diﬀerent. For example both reconstructions made by thenetwork trained on

Gulf -data only, show many small-scale structures on the left ﬂank thanpresent in the dense data (see panels B in ﬁgure 4). In the RIM it is clear that manysmall-scale structures, most likely related to the curvature in the training data, overprintthe desired curvature of the salt data. In the U-Net this eﬀect is less pronounced, relatedto the fact that that network also underestimates the amplitude of the reconstruction. Bothnetworks perform best when trained on a combination of complex salt dataset and the Gulfof Suez dataset that includes many velocity and slope variations.

Gulf Gulf / Rdat Rdat OTD

U-Net RIM U-Net RIM U-Net RIM U-Net RIM

Gulf

Rdat

OTD

Pdat

Average

Average SSIM for inference using the trained models (columns) on the to beinferred dense data (rows). The SSIM are computed as an arithmetic mean over the SSIMfor 10 diﬀerent decimation percentages (5 regular, 5 irregular) for 3 shot gathers in the data(if available; left quarter, center, right quarter) without taking the training data into thecalculation (indicated by gray cells). All models perform best on the data they are trainedon and the RIM outperforms the U-Net in these tasks.

Decimation patterns

The networks were initially trained on 5 diﬀerent decimation masks, ranging in decimationpercentage between 50 and 80 %. From these patterns, 2 were irregular and 3 regular. Whenperforming inference on data decimated by between 25 and 82 percent it is observed thatthe networks can generalize better to lower percentages than towards the higher end of therange present in the prior space. This means that the reconstruction quality thus decreaseswhen the data is highly decimated. There is no clear indication that the networks performbetter on irregular or regular decimated data, unlike in the deterministic inversion that tendsto be able to reconstruct irregularly sampled data better. Training the RIM on only twopatterns (50 % regular and 84 % irregular) in the same prior space range resulted in similarobservations. Using more patterns in the same range (50, 67, 75, 80 % regular and 75, 81,18

Trace number T i m e s a m p l e B - Gulf (0.61) C - Rdat (0.82) D - GulfRdat (0.84) E - OTD (0.08) (a)

U-Net A Trace number T i m e s a m p l e B - Gulf (0.83) C - Rdat (0.88) D - GulfRdat (0.88) E - OTD (0.81) (b)

RIM

Figure 4:

Reconstruction of 62 % irregularly decimated shot gather from a complex singleshot salt dataset (normalized). The top bar represents the decimation pattern in whichblack stands for missing trace. None of the models in panels B-E have been trained onthis shot gather. Panel A represents the original shot gather that has to be inferred by thenetwork, therefore the reconstructions in panel B-E should be as close as possible to thispanel if inference were perfect. The quality of generalization to unseen data is indicated bythe SSIM in brackets and the amplitude of the reconstruction.2x 84 % - random jittering- irregular) improved the reconstruction quality but increased thetraining time as more training data became available. With less decimation patterns in thelower range, the RIM could generalize well to percentages just outside this training rangeand predicted structures similar to the training data at much higher percentages. Theseobservations thus explain that it is important to train the network in the range of decima-tion percentages that have to be inferred during training to reconstruct well. The U-Net19erformance varied highly with a change in training data and that resulted in a performancedrop when less data or decimation percentages are used.Based on the previous discussion on prior-space sampling, the networks trained on half ofthe Gulf of Suez and salt data for ﬁve diﬀerent decimation percentages (previously called

GulfRdat ) are selected for further inference. This is a weigh-oﬀ between training time andinference performance at diﬀerent percentages. Training the RIM for 40 epochs using justover 100.000 training images on a single GPU took around 12 hours. The U-Net is not arecurrent neural net and requires less memory of the GPU, training this network on the samedata and number of epochs took only 1.5 hours. Performing inference on a single full-sizeshot gather is almost instantaneous, whereas deterministic inversion can take minutes pergather before convergence is reached.

2D gather reconstruction

The reconstruction results for a central shot gather from the ocean turbulence dataset areshown in ﬁgure 5. Panel A illustrates the temporal bandwidth and spatial variation present inthe ocean turbulence dataset. The ﬁrst arrivals have a strong amplitude, later arrivals are lesspronounced but because of normalization and ﬁltering still clearly visible. In this example,the shot gather is regularly decimated by factor 4, resulting in the decimated gather of panelB. Because of sub-nyquist spatial decimation, spatial aliasing in the frequency-wavenumberdomain occurs as can be seen in the corresponding Fourier spectrum.Solving the deterministic inversion without regularization results in panel C of ﬁgure5. By visual inspection and comparison of the norms in table 2, there is no diﬀerencebetween the decimated and the reconstructed gather. The misﬁt between the original Fourierdomain image and the Fourier transform of the reconstruction equals the original Fourierdomain image. This means that the inversion is not capable of reconstructing the 75 %missing seismic traces eventhough the iterative inversion has converged. Both deep learningapproaches on the other hand, panel D and E in ﬁgure 5, are capable of reconstructing themissing seismic data. In both panels there is still an imprint of the missing traces, thisis especially clear in the ﬁrst arrivals. The later reﬂections and diﬀractions seem to nothave this imprint resulting in a low misﬁt in both the spatial and Fourier domain. Similaras what has been observed before, the U-Net introduces low frequency structures into thereconstruction visible in the low frequency, low wavenumber part of the misﬁt that hasa higher amplitude than that same area for the RIM’s reconstruction. The U-Net againalso underestimates the amplitude of the data more than the RIM (see the diﬀerence innorms in table 2). The training data included higher velocity variations than present in theto be inferred data as well as structural variation. This, structure-wise, results in a highcorrespondence of the predicted waveﬁelds and the dense waveﬁeld (to be inferred). Notjust the strong ﬁrst arrivals, but also the later diﬀractions and reﬂections are reconstructedwithout loss of bandwidth.Both deep learning approaches are thus capable of reconstructing the missing data tosimilar extent, thereby decreasing spatial aliasing in Fourier domain. The higher SSIM20alues and lower misﬁt amplitudes of RIM reconstructions are not limited to this speciﬁcgather or dataset only, table 2 indicates that this is a general trend. The presented resultsare based on 75 % regularly decimated data and can be generalized to other gathers anddecimation percentages as well. Where the deterministic inversion on the decimated data andthe forward decimation operator already breaks down at very low decimation percentagesdue to the Shannon-Nyquist sampling theorem, the neural nets performance only start todecrease at decimation percentages near the edge of the sampled prior space.

Gulf - 27 Rdat - 97 Pdat - 1 OTD - 154 (4.292e3) (0.938) (29.08e3) (4.242e-3)SSIM norm SSIM norm SSIM norm SSIM norm

Decimated gather

Inversion

U-Net

RIM

Table 2:

A comparison of the reconstruction for the diﬀerent approaches to inversion.The diﬀerent gathers are regularly decimated by a factor 4 (75 % decimation), the norm ofthe dense shot gathers is given in brackets after the name of the dataset and the selectedshot. The deterministic iterative inversion cannot solve the reconstruction problem for alldatasets at this decimation percentage (no diﬀerence between input decimated gather andreconstruction), the RIM slightly outperforms the U-Net when comparing the metrics.

3D Ocean Turbulence reconstruction

Because the neural nets are trained to reconstruct 2D seismic gathers, a two-step inferenceprocedure is followed to reconstruct the 3D decimated dataset. The total 3D reconstructionis thus an inference result created by ﬁrst reconstructing all shot gathers and, after sorting tocommon receiver gathers, in a second step the receiver gathers. The deterministic inversionuses the forward operator and performed a 1000 iterations. Next to that, for the 3D inversionit is assumed that the ﬁrst-, second-order as well as the spatial cross-derivatives are availabletherefore taking more data into the inversion and solving a multichannel reconstructionproblem. The data are decimated by 94 % resulting in randomly missing about every fourthtrace in both the source and receiver dimension. In total there is 16 times less data presentin the decimated waveﬁeld than in the dense waveﬁeld. The decimation pattern is equal inthe source and receiver domain, source and receiver positions are colocated in this dataset.Therefore each position will have either both shot and receiver sampled or none of them.Table 3 compares the inference and inversion results for 5 diﬀerent methods. Because ofthe two-step procedure used in inference, the two diﬀerent networks (RIM and U-Net) canalso be used jointly such that the networks could beneﬁt from each others reconstructionmade in the ﬁrst step. The best overall reconstruction is clearly made by the deterministicinversion that used the forward operator, the decimated data and all eight (cross-)derivatives.21

B C - Det. inv. (0.61) D - U-Net (0.75) E - RIM (0.83)

Trace number T i m e s a m p l e -1 0 1 Wavenumber: k / k Ny -40.00 F r e q u e n c y ( H z ) Figure 5:

The reconstruction of a central shot gather from the ocean turbulence dataset;each panel consists (from top to bottom) of a bar representing sample distribution (black formissing, white for sampled trace), the normalized waveﬁeld and the corresponding Fourierspectrum. A) Original dense seismic gather, no missing data. B) Data regularly decimatedby factor 4 (75 %), spatial aliasing in the Fourier domain occurs. C - E) Reconstruction usingthree diﬀerent approaches, Fourier spectrum is misﬁt to A. In brackets the SSIM betweenA and the reconstruction in space-time domain is given. The RIM reconstruction has thelowest misﬁt in space-time and frequency-wavenumber domain as well as the highest SSIM.All deep learning methods however, still estimate the waveﬁeld in a decent matter consideringthe fact that these networks only know the decimated data and, in case of the RIM, a 2Dversion of the forward operator. Because two steps are taken in the inference procedure,the second inference step takes place on reconstructed data, this reconstruction is far fromperfect and therefore error propagation occurs. From table 3 it should be clear that thereconstruction is best at positions where some data was sampled. Because of the used lossfunction in training, the networks are free to alter also the traces that where sampled insteadof only the missing traces. The inversion uses the forward operator and does not allow thealteration of sampled traces, therefore the misﬁt between the inference results could alwaysbe higher than that of the inversion.Figures 6 and 7 display the dense waveﬁeld estimates from deterministic inversion for aset of shots in the center of the ocean turbulence dataset. These results are compared to thebest probabilistic estimate of the waveﬁeld made by the RIM in ﬁgures 8 and 9. Because thedata is randomly decimated by 75 % over each dimension, the maximum amount of missingtraces in a row within a gather corresponds to six. In the panels of all ﬁgures, only the ﬁrst22 ecimated data Sampled data Total 3D dataset (44.91) (26.07) (51.94)SSIM norm SSIM norm SSIM norm

Inversion

RIM

U-Net / RIM

U-Net

RIM / U-Net

Table 3:

A comparison of 3D inversion results for the 94 % decimated ocean turbulencedata. The deterministic inversion in this case performs best on all components. The two-stepRIM reconstruction again estimates the amplitudes of the reconstruction better than the U-Net. Combining the U-Net and RIM leads to a better 3D reconstruction than using theU-Net for two steps, possibly because the RIM uses the forward operator in the estimation.The norm of the original part of the data is given in brackets, all norms are scaled by factor1e3.and last shot/receiver were recorded and the traces in all other missing shots/receivers arereconstructed. Because the RIM reconstructs the decimated data in two steps over the twodimensions, the maximum decimation percentage the network takes as an input equals thatof the single dimension, this 75 % decimation falls just within the range sampled by the priorspace.Traces in the six missing shots in ﬁgure 6 are reconstructed by the deterministic inver-sion. From all approaches, the amplitude of this reconstruction best approximates the densewaveﬁeld. The misﬁt increases further away from the last sampled shot, yet all major seis-mic events are accurately recovered. In panel A-D it can be observed that the temporalbandwidth of the reconstruction also decreases with distance from the last sampled shot. Asexpected, more densely sampled areas result in a better reconstruction. The same generaltrend can be observed in ﬁgure 7 for the missing receivers because the decimation patternsover both dimensions are equal and the deterministic inversion method included the 3Dforward decimation operator.Traces in the six missing shots in ﬁgure 8 are reconstructed by the two-step RIM inferenceprocedure. Again, the misﬁt increases further away from the last sampled shot. The temporalbandwidth of the reconstruction however does not seem to decrease with distance, thisapproach does underestimate the dense waveﬁeld amplitude however. At source and receiverlocations where many datapoints are missing, the imprint of the decimation pattern is moreevident than in the deterministic inversion. The RIM reconstruction is relatively poor inpanels D and E, where the distance to the last sampled shot/receiver is largest. This ismost likely due to the fact that as an input to the model, these panels had no data. Thereconstruction is thus fully based on inference and the build up of errors over the two steps.23 - . B - . C - . D - . R e c e i v e r nu m b e r Timesample E - . F - . G - . H - . R e c e i v e r nu m b e r Timesample . . . . . . F i g u r e : D e t e r m i n i s t i c i n v e r s i o n r e s u l t s f o r % d ec i m a t e d o ce a n t u r bu l e n ce d a t a s e t , s h o t ga t h e r - . B a r o n t o p o f e a c hp a n e l r e p r e s e n t ss a m p l e d i s tr i bu t i o n ( w h i t e f o r s a m p l e d , b l a c k f o r d ec i m a t e d ) , SS I M v a l u e s a r e r e p o rt e d f o r e a c h ga t h e r a s w e ll. Q u a li t y o f r ec o n s tr u c t i o nd ec r e a s e s a ndb a nd w i d t h i s l o s t w i t hd i s t a n ce f r o m l a s t s a m p l e d ga t h e r . - . B - . C - . D - . S h o t nu m b e r Timesample E - . F - . G - . H - . S h o t nu m b e r Timesample . . . . . . F i g u r e : D e t e r m i n i s t i c i n v e r s i o n r e s u l t s f o r % d ec i m a t e d o ce a n t u r bu l e n ce d a t a s e t , r ece i v e r ga t h e r - . B a r o n t o p o f e a c hp a n e l r e p r e s e n t ss a m p l e d i s tr i bu t i o n ( w h i t e f o r s a m p l e d , b l a c k f o r d ec i m a t e d ) , SS I M v a l u e s a r e r e p o rt e d f o r e a c h ga t h e r a s w e ll. - . B - . C - . D - . R e c e i v e r nu m b e r Timesample E - . F - . G - . H - . R e c e i v e r nu m b e r Timesample . . . . . . F i g u r e : Tw o - s t ag e R I M i n f e r e n ce r e s u l t s f o r % d ec i m a t e d o ce a n t u r bu l e n ce d a t a s e t , s h o t ga t h e r - . B a r o n t o p o f e a c hp a n e l r e p r e s e n t ss a m p l e d i s tr i bu t i o n ( w h i t e f o r s a m p l e d , b l a c k f o r d ec i m a t e d ) , SS I M v a l u e s a r e r e p o rt e d f o r e a c h ga t h e r a s w e ll. R ec o n s tr u c t i o n i s p oo r e rt h a n t h e d e t e r m i n i s t i c i n v e r s i o n , r ec o n s tr u c t i o n q u a li t y d ec r e a s e s w i t h d i s t a n ce f r o m l a s t s a m p l e d s h o t y e tt h e r e i s n o c l e a r i nd i c a t i o n o f l o ss o f b a nd w i d t h . - . B - . C - . D - . S h o t nu m b e r Timesample E - . F - . G - . H - . S h o t nu m b e r Timesample . . . . . . F i g u r e : Tw o - s t ag e R I M i n f e r e n ce r e s u l t s f o r % d ec i m a t e d o ce a n t u r bu l e n ce d a t a s e t , r ece i v e r ga t h e r - . B a r o n t o p o f e a c hp a n e l r e p r e s e n t ss a m p l e d i s tr i bu t i o n ( w h i t e f o r s a m p l e d , b l a c k f o r d ec i m a t e d ) , SS I M v a l u e s a r e r e p o rt e d f o r e a c h ga t h e r a s w e ll. - Discussion In order to solve the reconstruction problem, two diﬀerent approaches have been studied. Thewaveﬁelds reconstructed with the use of deterministic inversion without regularization, verifyShannon-Nyquist sampling theorem that states that dense waveﬁelds can be reconstructedfrom the decimated (sampled) waveﬁelds only if the sampling frequency is not less thantwice the Nyquist frequency. Herrmann (2010) studied the eﬀect of diﬀerent decimationpatterns on the imprint in the Fourier spectrum. Regular sampling will lead to sparse andstrong aliased signals in the Fourier spectrum where irregular sampling tends to generateweaker decimation artifacts. The regular sampling artifacts hinder the reconstruction anddominate the misﬁt, whereas the irregular sampling artifacts are less distinct and thereforedo not hinder the reconstruction of the original main structures in the waveﬁeld. Becauseof irregularities or limitations in data acquisition, sampled data are often not fulﬁlling thesampling criterion and therefore aliasing occurs. These eﬀects are also observed in thisstudy. At lower decimation percentages the deterministic inversion can reconstruct thedata for both regular and irregular decimated data. The best reconstructions are madeon irregularly decimated data. However, for higher decimation percentages the inversionwithout regularization is not able to solve the inverse problem for both regular and irregulardecimation. Deterministic inversion is only limited to very low decimation percentages, yetit would be beneﬁcial to reconstruct data that is far sparser than reconstructable with thehelp of inversion. Here, two deep learning approaches have been introduced that have shownto be able to map decimated waveﬁelds into denser waveﬁelds for both regular and irregular,highly sparse data.

Deterministic versus Probabilistic approach

Deep learning approached the inverse problem in a probabilistic sense in which the prior hasshown to be of crucial importance. The quality of the reconstruction is mainly dependenton the information extracted from the training data. Sampling the training data resultsin a prior space distribution that is used in the neural networks inference stage. In theseismic reconstruction problem the most important elements the prior space should containinclude reﬂections and diﬀractions due to spatial variation, bandwidth, slope variations dueto velocity variations and a range of decimation percentages. Unlike the deterministic inver-sion of 2D decimated gathers, that can only reconstruct data accurately when the samplingcriterion is fulﬁlled, the neural networks have proved to be able to reconstruct 2D seismicgathers with decimation percentages up to the edge of the decimation range the networkswere trained on.When the derivatives of the data are available however, the deterministic inversion ofthe reconstruction problem turns into the multichannel reconstruction problem. In this casethe deterministic inversion improved as more sparse data could be reconstructed. In the 3Dhighly sparse reconstruction of ocean turbulence data, the deep learning methods have provedto be able to reconstruct the sparse data without the need of derivatives. The reconstructionquality is not as good as the inversion however but it is believed that the reconstruction can28e improved by more extensive training on highly sparse data or creating a neural networkcapable of taking N-dimensional data as in- and output. The two-step inference procedureis prone to error propagation, something that does not occur when having N-dimensionaldata as input. The loss of bandwidth in the inversion with distance to last sampled shot isnot observed in the inference results, indicating that the used training data was suﬃcient todescribe the bandwidth in the ocean turbulence dataset. Because the extra data taking intothe inversion (derivatives) is often not available, deep learning should be considered a viableoption in data reconstruction.Next to the fact that the deep learning methods do not require anything but the dataand possibly the forward operator, another advantage of using deep learning methods overdeterministic methods lies in the short inference times. Of course, training a neural networktakes time. In the case of the used RIM that corresponds to 12 hours where the U-Net didthis in under 2 hours. However, with a good generalizing ability, a network only has to betrained once and can be used for inference on unseen datasets afterwards. The reconstructionof a single 2D seismic gather by inference is almost instantaneous whereas the inversion cantake up to minutes per gather. When including the derivatives into the inversion this maytake even longer (the 3D inversion in a 1000 iterations took over 14 hours to converge). Thetraining time of neural networks could possible be reduced, based on the discussion of priorspace sampling required for a good generalizing model.The requirement of having a large training data to extract an accurate description of theprior space, could be seen as a diﬃculty in deep learning as well. In this case, the trainingdata are created synthetically from dense seismic waveﬁelds that include a range of diﬀerentproperties and structures. This means that in all cases it is best to either use existingdense data for training or to sample part of the acquisition densely, thereby providing apossibility of generating synthetic training data consisting of structures present in the to bereconstructed data. As noticeable in the results, without accurate prior space sampling thedeep learning networks cannot generalize well enough. Of course, the required quality of thereconstructed data also depends on what this data will be used for in post-processing steps.For example, migration is less demanding than full waveform inversion that attempts to useevery single arrival. Therefore making exact conclusions based on the presented metrics hereshould be done with care, taking the ultimate aim of the reconstruction into account. Inseismics, collecting suitable and enough training data should be a manageable task as therequired features are very common in seismic data.

Comparison of deep learning methods

The two deep learning architectures used here are the Recurrent Inference Machine (RIM)and U-Net. Both methods require training data to update their internal state to match theapproximate inverse of the forward operator that generated the decimated data. The RIMapproaches inverse problems by combining the known forward operator and the sampleddata within a main Recurrent Neural Net (RNN) cell. According to Putzky and Welling(2017), this approach that combines the inference and training stage is crucial and unique tosolving inverse problems with deep learning. That the RIM has to potential to solve inverse29roblems has been demonstrated here by solving the reconstruction problem for which theforward operator is the linear (computationally inexpensive) restriction operator. The RIMdemonstrated to generalize well to unseen data and decimation percentages also with alimited amount of training data. From the results it can be concluded that the RIMs havea low tendency to overﬁt the training data while generalizing well outside the prior range.That the RIM is not the only neural net that can represent the inverse of the restrictionoperator, has been proven with the help of the U-Net. Like the RIM, the U-Net makes useof convolutional operators to extract higher-level features from the input data. However, theU-Net does not use a RNN or the forward operator. In both the 2D seismic gather and the3D highly decimated reconstruction, the U-Net consistently underestimates the amplitude ofthe reconstruction and introduces lower frequency structures in the prediction. Most oftenhowever, it is possible to ﬁlter these lower frequency structures from the predictions andreach results that are similar to the predictions made by the RIM. Likewise, it is often notthe absolute amplitude of the reconstruction that is the main goal, the relative distribution ofamplitudes is of higher importance as this is a measure of contrast in subsurface properties.This indicates that structure-wise, the reconstruction of the U-Net after ﬁltering could begood enough for further processing as well. Training the U-Net on diﬀerent training dataresulted in highly varying inference results. It can therefore be concluded that the U-Netis much more likely to overﬁt the training data, possible because of the high number oftrainable parameters in the network, and is therefore more prone to prior space variance.During the course of this study, another study has been published by Mandelli et al.(2019) in which the U-Net is again used to solve the reconstruction problem as a pre-processing step before using the reconstructed data for migration. There however, as apost-processing step, at the sampled locations the traces are removed from the network’sprediction and replaced by the actual sampled traces. Mandelli et al. (2019) ﬁnd that theU-Net can be used to solve the reconstruction problem. However, their results are basedon decimation percentages 10, 30 and 50. Similar observations of poorer generalization tounseen data or decimation patterns are observed.When taking these considerations into account it can be stated that the reconstructedwaveﬁelds in both 2D and 3D made by the RIM are slightly better (in structural similarity,norm as well as dealiasing) than that of the U-Net while both methods perform betterthan the single channel deterministic inversion at higher decimation percentages. In thisdecision, emphasis is put on the fact that the RIM generalizes better to unseen data anddecimation percentages outside the prior range. When the deterministic inversion doesinclude the derivatives of the data (multichannel reconstruction), the reconstruction improvesand becomes better than deep learning methods. Deep learning has proven to be a promisingstrategy to the single channel reconstruction problem that does not lose bandwidth over thereconstructions and should be considered in N-dimensional problems as well when only thedecimated data is acquired.The choice of hyperparameters in the RIM architecture is based on considerations madeby Patrick Putzky and described in Lønning et al. (2018). The U-Net architecture is createdsuch that it extracts a similar number of features in the ﬁrst layer as does the RIM (here 64).30he number of pooling layers is chosen to be four such the representation of the input datahas a minimum size in the bottleneck layer. The size of the input data (32 receiver samples,64 timesamples) is based on the memory load on a single GPU. For the RIM, which has ahigher memory load than the U-Net, this input size in batches of 32 was the maximum loadthe single GPU could use. As observed in the results, with this window size the temporal andspatial structures can be captured such that generalization to full (not windowed; inferencestage) seismic gathers is possible. To benchmark the U-Net and RIM, the input size inthe U-Net is chosen to be equal to that of the RIM eventhough the computational load ismuch lower for this network and a larger window could have been chosen. The training datais windowed using non-overlapping patches, results in Mandelli et al. (2019) describe thatoverlapping patches increase the computational load while resulting in only a very limitedincrease in inference performance. Even though the neural networks have been trained toreach their, as equal as possible, minimum states, the networks should still be comparedwith care as their architectures are diﬀerent.

Eﬀect of forward operator

That the RIM takes the forward operator into account is what is believed to make the RIMsapproach to inverse problems better than the U-Net. Unfortunately, because that is not theonly diﬀerence between the two architectures (1. the RIM is a RNN, 2. the RIM is a RNNthat uses the forward operator in its update function), it can only be stated with care thatthe fact the forward operator is used to solve the inverse problem in the RIM is what makesthe RIM a better probabilistic inverse problem solver than the U-Net. To exclude the factthat the RNN is what makes the RIM perform better than the U-Net, a neural networkis trained using a unit forward operator. In that case, the prediction made by the RIMare worse than that of U-Net. This observation supports the hypothesis and indicates thatthe diﬀerences between the RIM and U-Net indeed come from the fact that the RIM canextract information from the gradient of the log-likelihood for which the forward operator isrequired.

More complex forward operator

Eventhough the U-Net performs slightly worse than the RIM, the U-Net is able to representthe inverse to the linear forward operator decimating the data. Because the RIM is mostlydesigned in an approach to inverse problems, it was expected to outperform the U-Net. TheRIM does perform better than the U-Net, but it did not excel in the reconstruction problem.It is believed that the RIM will excel for more complex (possibly even non-linear) forwardoperators. As a ﬁrst test closely related to the reconstruction problem, the reconstructionproblem was transformed to the Fourier domain. Reconstructing data in space-time domaincan be seen as dealiasing the Fourier spectrum that is aliased due to sub-Nyquist spatialsampling. Because of the current limitations by the single GPU setup it was not possibleto study this approach to more complex forward operators. This is related to the fact thattaking the Fourier transform of a patch of data results in a local Fourier representation of the31ata instead of the full global spectrum. Training the networks to dealias the local spectrumdid not correspond to dealiasing the global spectrum for all given methods and therefore thisshould be part of future studies.Lønning et al. (2019) did use the RIM as an approximate inverse of a more complexforward operator and also compared this to the U-Net. In this case, the data is sampledin image space with decimation taking place in another data space related to the imagespace by the Fourier transform. Results from Lønning et al. (2019) indicate that indeed itis the RIMs architecture that makes the network a potential inverse problem solver. TheRIM generalized better to unseen data, required less training data (less parameters to train)and did not suﬀer from structural artifacts as generated by the U-Net. Again the U-Netgeneralized poorly to unseen data or decimation ranges, linked to the number of trainableparameters.

Limitations & Future work

Unlike the deterministic inversion, the networks were free to alter the sampled traces. Thismight not have been the best approach and should be changed in the future. A weightingfactor and the forward operator could be included in the loss function that then emphasizesthat the network should reconstruct the decimated traces only. It is believed that this willpositively aﬀect the reconstruction results.From these results and those in Mandelli et al. (2019), it became clear that not just theRIM but also the U-Net has the ability to represent the inverse to the restriction operator.Despite currently being limited by the single GPU setup, it would be interesting to test theability of both networks to represent more complex (possibly non-linear) operators. Resultsfrom Lønning et al. (2019) indicate that in that case the RIM will outperform the U-Net.This statement could be studied in the Fourier domain as a follow-up to this study wherereconstruction took place in the space-time domain. With the use of multiple GPUs itwould be possible to distribute the training data over multiple GPUs without being limitedto the window size of 32x64 currently used. This would mean the networks can be trained todealias the global Fourier spectrum, thereby reducing spatial aliasing and thus reconstructingdecimated data in space-time domain. This study, as well comparisons made by e.g. Kim andNakata (2018) and Russell (2019), indicate that indeed deep learning should be consideredas a viable option to solving inverse problems and especially those for which deterministicinversion is not possible.It would be interesting to use the reconstructed data volumes in post-processing steps. Forexample, migration can be performed on the 3D reconstructed highly sparse ocean turbulencedata volume. At this point, the comparison between the deterministic and probabilisticapproach is limited to the reconstructions and after migration it would be possible to seeif the methods result in a similar image of the studied subsurface. Therefore a decisiveconclusion should not purely be based on the metrics used in this study, diﬀerent typesof eﬀects can or cannot have an eﬀect in post-processing steps and therefore it is diﬃcultto state exactly what makes a reconstructed image ’good’. Using the reconstructed datavolumes for migration is currently part of ongoing studies.32 - Conclusions

In this study two diﬀerent approaches to solving the reconstruction problem, as an exampleof an inverse problem for which the forward operator is known, have been studied. Thedeterministic inversion without regularization is not capable of reconstructing the decimatedseismic data when the acquisition did not follow the setup speciﬁed by Shannon-Nyquistsampling theorem.Two deep learning methods, that approach the inverse problem in a probabilistic sense,have been compared on diﬀerent reconstruction tasks. It can be concluded that the mostimportant element in building a well generalizing neural network is the prior space. In theseismic data reconstruction problem, this prior space should consist of similar features asthose to be inferred including bandwidth, structural and velocity variations, and a range ofdecimation percentages. The ability of the deep learning methods to represent the inverseof the restriction operator is better than that of the deterministic inversion. The predictionsmade by the network result in higher SSIM values and better estimates of the norm forall studied decimation percentages, patterns and datasets. The deep learning methods arecapable of eliminating spatial aliasing in the Fourier domain where the inversion cannotundo the aliasing caused by sub-Nyquist spatial sampling. Both deep learning methodshave proved to be able to map decimated data into dense seismic data thereby solvingthe reconstruction problem. The deterministic inversion can be improved by incorporatingspatial derivatives. The two-step multichannel reconstruction made by deep learning provedthat deep learning should be considered as a viable option for highly sparse, N-dimensionaldata reconstruction when only the decimated data are acquired.The RIM architecture is speciﬁcally designed to approximate the inverse of the forwardoperator and is compared to the U-Net (initially designed for image segmentation). Bench-marking the RIM against the U-Net leads to the conclusion that the RIM generalizes betterto unseen decimation percentages and data due to the nature of the architecture in whichthe reconstruction is regularized by the forward operator. The RIM contains less trainableparameters thereby being less prone to overﬁtting. For simple linear operators, the U-Netis also capable of inverting the system except underestimating amplitudes and introducinglow frequency artifacts thereby requiring further processing before using the data volumesin e.g. migration and full waveform inversion.Benchmarking the RIM against other deep learning architectures for more complex for-ward operators should be the subject of future studies. However, initial results as presentedhere show that RIMs have great potential in seismic processing problems where determininga complex inverse map to a known forward problem is the goal of inference by machinelearning. 33 eferences , V19–V28.Herrmann, F. J., 2010, Randomized sampling and sparsity: Getting more information fromfewer samples: Geophysics, , WB173–WB187.Ioﬀe, S., and C. Szegedy, 2015, Batch normalization: Accelerating deep network training byreducing internal covariate shift: arXiv preprint arXiv:1502.03167.Kim, Y., and N. Nakata, 2018, Geophysical inversion versus machine learning in inverseproblems: The Leading Edge, , 894–901.Kingma, D. P., and J. Ba, 2014, Adam: A method for stochastic optimization: arXiv preprintarXiv:1412.6980.Lønning, K., P. Putzky, M. W. Caan, and M. Welling, 2018, Recurrent inference machinesfor accelerated MRI reconstruction: Presented at the International Conference on MedicalImaging with Deep Learning (MIDL 2018).Lønning, K., P. Putzky, J.-J. Sonke, L. Reneman, M. W. Caan, and M. Welling, 2019,Recurrent inference machines for reconstructing heterogeneous MRI data: Medical imageanalysis, , 64–78.Mandelli, S., V. Lipari, P. Bestagini, and S. Tubaro, 2019, Interpolation and denoising ofseismic data using convolutional neural networks: arXiv preprint arXiv:1901.07927.Martinez, M. T., 2016, An overview of Google’s Machine Intelligence Software TensorFlow.:Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).Ndajah, P., H. Kikuchi, M. Yukawa, H. Watanabe, and S. Muramatsu, 2010, SSIM imagequality metric for denoised images: Proc. 3rd WSEAS Int. Conf. on Visualization, Imagingand Simulation, 53–58.Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.Gimelshein, L. Antiga, et al., 2019, PyTorch: An imperative style, high-performance deeplearning library: Advances in Neural Information Processing Systems, 8024–8035.Peng, H., and I. Vasconcelos, 2019, A study of acquisition-related sub-sampling and apertureeﬀects on Marchenko focusing and redatuming, in SEG Technical Program ExpandedAbstracts 2019: Society of Exploration Geophysicists, 248–252.Putzky, P., and M. Welling, 2017, Recurrent inference machines for solving inverse problems:arXiv preprint arXiv:1706.04008.Ravasi, M., and I. Vasconcelos, 2020, PyLops—A linear-operator Python library for scalablealgebra and optimization: SoftwareX, , 100361.Ronneberger, O., P. Fischer, and T. Brox, 2015, U-net: Convolutional networks for biomed-ical image segmentation: International Conference on Medical image computing andcomputer-assisted intervention, Springer, 234–241.34uan, J., 2019, Compressive Acquisition and Ocean Turbulence Waveﬁeld Reconstruction:Master’s thesis, Utrecht University.Ruan, J., and I. Vasconcelos, 2019, Data-and prior-driven sampling and waveﬁeld recon-struction for sparse, irregularly-sampled, higher-order gradient data, in SEG TechnicalProgram Expanded Abstracts 2019: Society of Exploration Geophysicists, 4515–4519.Russell, B., 2019, Machine learning and geophysical inversion—A numerical study: TheLeading Edge, , 512–519.Siahkoohi, A., R. Kumar, and F. Herrmann, 2018, Seismic data reconstruction with gen-erative adversarial networks: Presented at the 80th EAGE Conference and Exhibition2018.Udacity, 2019, Intro to Deep Learning with PyTorch,