[PDF] Training Image Selection using Recurrent Neural Networks: An Application in Hydrogeology

Abstract

Multiple-point geostatistics plays an important role in characterizing complex subsurface aquifer systems such as channelized structures. However, only a few studies have paid attention to how to choose an applicable training image. In this paper, a TI selection method based on Recurrent Neural Networks is proposed. A synthetic case is tested using two channelized training images given the hydraulic head time series. Three different RNNs architectures are tested for the selection performance. Various scenarios of the model input are also tested including the number of observation wells, the observation time steps, the influence of observation noise, and the training dataset size. In this TI selection task, the GRU has the best performance among all three architectures and can reach to a 97.63% accuracy.

Full PDF

mmanuscript submitted to arXiv

Training Image Selection using Recurrent NeuralNetworks: An Application in Hydrogeology

Zhendan Cao , Jiawei Shen ∗ , Mathieu Gravey Department of Land, Air and Water Resources. University of California, Davis, USA College of Mathematics and Statistics. Chongqing University, Chongqing, China Institute of Earth Surface Dynamics, University of Lausanne, Lausanne, Switzerland

Key Points: • A new training image(TI) selection model for multiple point geostatistics is pro-posed based on recurrent neural networks(RNNs) • RNNs are applied to extract features from the observed hydraulic head time se-ries as the criteria for the TI selection • The GRU model has the best performance in the TI selection task and can reachto a 97.63% accuracy ∗ This author shares the same contribution with the ﬁrst author

Corresponding author: Zhendan Cao, [email protected] –1– a r X i v : . [ phy s i c s . g e o - ph ] J a n anuscript submitted to arXiv Abstract

Multiple-point geostatistics plays an important role in characterizing complex sub-surface aquifer systems such as channelized structures. However, only a few studies havepaid attention to how to choose an applicable training image. In this paper, a TI selec-tion method based on Recurrent Neural Networks is proposed. A synthetic case is testedusing two channelized training images given the hydraulic head time series. Three dif-ferent RNNs architectures are tested for the selection performance. Various scenarios ofthe model input are also tested including the number of observation wells, the observa-tion time steps, the inﬂuence of observation noise and the training dataset size. In thisTI selection task, the GRU has the best performance among all three architectures andcan reach to a 97 .

63% accuracy.

Plain Language Summary

The physical properties of a subsurface aquifer system such as hydraulic conduc-tivity plays a signiﬁcant role in groundwater modeling. Multiple point geostatistics(MPS)is a competent approach for reproducing complex subsurface systems. A Training Im-age(TI) can be a conceptual representation of the subsurface hydraulic conductivity andit is the foremost input in a MPS algorithm. However, few studies have paid attentionto how to choose a proper TI. In this paper, we proposed a new method using observa-tions of the hydraulic head to select the most likely TI from the TI candidates with thehelp of Recurrent Neural Networks. Two diﬀerent channelized training images and threediﬀerent RNNs architectures were applied to test the performance of this new method.The result shows that our method can have a 97.63% accuracy in the TI selection task.

Geological subsurface modeling plays a signiﬁcant role in groundwater protectionand management. Reproducing the strong heterogeneity of aquifers is one of the key tasksof most studies. Geostatistical simulations can capture the details of the heterogeneityof subsurface models. However, the traditional two-point based geostatistical modelingfailed in reproducing complex subsurface structures such as channelized geometries(Caers,2001; Journel & Zhang, 2006).Multiple point geostatistics (MPS) has gained its popularity for capturing the com-plexity of the subsurface systems. The widely-used algorithms are SNESIM (Strebelle,2002), Direct Sampling(Mariethoz et al., 2010), and Graph Cuts methods(Li et al., 2016).The main idea of those MPS algorithms is generating realizations which can mimic theTraining Image(TI). The TI image is a conceptual model which can reﬂect the geomet-rical structure of the subsurface and it can be obtained from outcrop data, object-basedmethods (Deutsch & Wang, 1996; Holden et al., 1998) and process-based methods(Gross& Small, 1998; Pyrcz et al., 2009).However, as the foremost part in the MPS workﬂow, TI is usually selected by re-searchers from their empirical understanding of the study area based on the outcrops andsparse hard data ,and therefore, often with a huge uncertainty which can cause a greatimpact to the following studies(Tahmasebi, 2018). Nevertheless, only a few studies havediscussed how to choose or determine a proper TI. Brunetti et al. (2019) proposed a MCMCbased method to select the hydrogeological model and have an application on the MADEsite. However, due the computational cost, it is not feasible to apply this method to a3D model selection problem.In traditional hydrogeological conceptual model selections criteria, matching theobservation data is one of the most important criterion(Carrera et al., 1993; Rojas et al.,2008; Refsgaard et al., 2012; Pirot et al., 2015). Knowing that the hydraulic conductiv- –2–anuscript submitted to arXiv ity has a strong correlation with the hydraulic head and through observing the chang-ing of the hydraulic head in a transient state model with channelized hydraulic conduc-tivity, we ﬁnd the pattern of the changing of the hydraulic head is quite sensitive to thehydraulic conductivity. The head dropping at two adjacent time steps is faster where thereare channels with bigger hydraulic conductivity which indicates that the feature of thechanging of the hydraulic head can reﬂect the pattern of the hydraulic conductivity. There-fore, learning the features from the hydraulic head time series from groups of realizationsgenerated from their corresponding TI can help us determine which TI is more appli-cable to the study area.Recurrent neural networks(RNNs) have a strong capability for dealing with sequen-tial data and have been widely applied to various real-world tasks due to its promisingresults, e.g., speech recognition, machine translation, time series prediction, etc. (Graveset al., 2013; Sutskever et al., 2014; Fern´andez et al., 2007). Chen et al. (2019) combinedGated Recurrent Unit(GRU) and kernel principal component analysis(KPCA) to pre-dict the remaining useful life at a very complex system where conventional methods can’tperformed well. Kumar et al. (2004) found that RNNs had better results compared toArtiﬁcial Neural Network(ANN) in river ﬂow forecasting. Zhang et al. (2018) used LongShort-term Memory(LSTM) with water diversion, evaporation, precipitation, temper-ature, and time data to forecast groundwater table depth in agricultural areas. Inspiredby the property of RNNs and their applications, we applied diﬀerent RNNs, i.e., stan-dard RNN, GRU and LSTM, to extract features from observation data of the hydraulichead at diﬀerent time steps. Then a Multilayer Perceptron(MLP) was adopted to mapfrom the extracted features to a vector that can represent predicted probability of eachTI. To our best knowledge, this is the ﬁrst attempt to apply RNNs to choose a properTI by observation data of the hydraulic head.In this paper, a new TI selection method based on RNNs is proposed for MPS al-gorithms when there are various TI available. Instead of running thousands of iterationsas MCMC based methods, this new approach only need one time forward modeling toget the observation data, and the RNN training part is also computational eﬃcient. TwoTIs with channelized geometrical structures were applied in a hydrogeological model totest the performance of this new method. The result shows that the selection model canobtain a 97.63% accuracy by using GRU in the synthetic hydrogeological case.

The TI selection steps can be summarized as follows: ﬁrst, TI candidates selection,several TIs need to be selected or generated from ﬁeld observation data to be evaluated,then, generates corresponding realizations through a MPS algorithm from each TI. Af-ter that, apply the groundwater ﬂow equation through MODFLOW for each realizationto obtain the hydraulic head time series. Label the time series respect to the correspond-ing TI. Next, train RNNs to extract the features of the hydraulic head time series andmake the prediction decisions. After the RNN training done, enter the observations ofhydraulic head into the trained model and it will select a most possible TI. Figure 1 showsthe ﬂowchart.

The time series were generated using the MODFLOW-2000(Harbaugh et al., 2000)with the governing equation as follows: ∂∂x (cid:0) K x ∂h∂x (cid:1) + ∂∂y (cid:0) K y ∂h∂y (cid:1) + ∂∂z (cid:0) K z ∂h∂z (cid:1) − W ∗ = S s ∂h∂t (1)Where K is the hydraulic conductivity, h is the hydraulic head, S s is the speciﬁc stor-age and t is time. W ∗ is the source and sink. –3–anuscript submitted to arXiv Figure 1.

Flowchart of TI selection method. The ﬁrst step is to choose the TI candidatesand then apply the MPS algorithm to generate realizations. After that, feed the MODFLOWwith each of the realization as the hydraulic conductivity ﬁeld and obtain the hydraulic headtime series. Then, extract the data at the observation points and label the data based on the TI.Train the RNNs with the prepared data and the RNNs module will give a sequence of predictedprobability of each TI and select the TI with the maximum probability.–4–anuscript submitted to arXiv

Figure 2.

Structure diagram of RNNs. x t denotes the observed data of the hydraulic head attime step t , h t denotes the hidden state at time step t . Knowing that RNNs are eﬃcient architectures to deal with sequential data, we ap-plied three diﬀerent RNN models to test the TI selection performance. As Figure 2 shows,the whole neural network is divided into two parts, RNN and MLP. First, RNN extractsfeatures from the observation data of the hydraulic head. The extracted features are vec-tor representations named hidden state. Then, MLP takes the hidden state at the lasttime step as input and maps the hidden state into another vector which can be consid-ered as predicted probability of each TI.The idea of RNNs is to repeatedly perform a shared unit and to update the hid-den state along time steps. The shared unit is a chunk of neural network that usuallytakes the previous hidden state and the data at the current time step as the input, andoutputs the current hidden state. All data at diﬀerent time steps share the same repeat-ing unit and a chain allows information to be passed along time steps, as is shown in pic-ture 2. The standard RNN, GRU and LSTM diﬀer in the design of the repeating unit.

In standard RNN, the repeating unit has a simple structure such as a single tanhlayer, a linear layer with tanh activation function. Given input sequence x = ( x , x , · · · , x T ),the hidden state is updated as the following equation iteratively from t=1 to T: h t = f ( W xh x t + W hh h t − + b h ) (2)where f is a non-linear activation function, h t is the current hidden state at time t, x t is the current input at time t, h t − is the previous hidden state at time t-1, W xh and W hh are repeating weight matrices, and b h denotes bias vector.However, it has been observed by (Bengio et al., 1994) that RNN suﬀers from long-term dependency problems, i.e., when it comes to long sequence input, the gradients tendto vanish so that the information extracted from early observation data will be forgot-ten in RNN. –5–anuscript submitted to arXiv Figure 3.

Frame of LSTM.

Figure 4.

Frame of GRU.–6–anuscript submitted to arXiv

LSTM was initially proposed by Hochreiter and Schmidhuber (1997). LSTM unitadopts gating mechanism to help RNN avoid long-term dependency problem. The struc-ture of LSTM unit is shown in Figure 3.LSTM unit consists of a cell state, a hidden state and three diﬀerent gates. Eachgate is actually a sigmoid layer that takes the previous hidden state and current dataas input. A cell state is introduced in LSTM to store historical information from pre-vious sequential data. The extent to which the previous cell state needs to be forgottenis regulated by a forget gate. f t = σ ( W f h t − + U f x t + b f ) (3)The new information from current input and the previous hidden state is learnedby a tanh layer. Then an input gate decides the extent to which the new informationneeds to be added into the cell state. n t = tanh ( W n h t − + U n x t + b n ) (4) i t = σ ( W i h t − + U i x t + b i ) (5)Next, the cell state is updated by partially forgetting historical information andadding new information. c t = f t (cid:12) c t − + i t (cid:12) n t (6)Lastly, the updated cell state is used to reproduce the current hidden state by atanh layer. Then an output gate decides the extent to which the current hidden stateneeds to be retained. o t = σ ( W o h t − + U o x t + b o ) (7) h t = o t (cid:12) tanh ( c t ) (8)In the formula above, h t , c t , x t are the hidden state, cell state and input at timet, and f t , i t , o t are the forget, input and output gates, respectively. n t is the new infor-mation. W , U denote weight matrices and b denote bias vectors. σ is the sigmoid func-tion, and (cid:12) is the Hadamard product. GRU was proposed by Chung et al. (2014) on statistical machine translation. Fig-ure 4 shows the structure of GRU. Similar to the LSTM unit, GRU also adopts gatingmechanism but without having the cell state. Instead, it mixed with cell state and hid-den state. Another diﬀerence is that the forget gate and input gate are combined togetherinto a single layer and there are only two gates in GRU, reset and update gates.The reset gate decides how much the previous hidden state is retained to producenew information. r t = σ ( W r h t − + U r x t + b r ) (9)The new information is computed by n t = tanh ( W n ( r t (cid:12) h t − ) + U n x t + b n ) (10)The hidden state was updated by a weighted average between the previous hiddenstate and the new information, and the update gate decides the weight coeﬃcient. z t = σ ( W z h t − + U z x t + b z ) (11) h t = (1 − z t ) (cid:12) n t + z t (cid:12) h t − (12) –7–anuscript submitted to arXiv In the above formula, n t is the new information and r t , z t are the reset and updategates, respectively. W , U denote weight matrices and b denote bias vectors. σ is the sig-moid function, and (cid:12) is the Hadamard product. In hydrological conceptual model selections, various studies have applied Bayesianstatistics to do help make the decision(Hsu et al., 2009; Sch¨oniger et al., 2014; Brunettiet al., 2017).This new proposed neural network approach can also be interpreted in a sta-tistical perspective. In Bayesian theory, the posterior probability, p ( Y | X ), of a model Y given data X plays a crucial role in model selection. By Bayes’ theory, p ( Y | X ) = p ( X | Y ) p ( Y ) p ( X ) (13)where X denotes observations of hydraulic head, Y is TI candidates, Y ∈ { , , · · · , K } , K is the number of TI candidates. As the term p ( X ) is a normalizing constant of pos-terior distribution of Y , it could be neglected and the posterior probability is propor-tional to the likelihood function multiplied by the prior probability: p ( Y | X ) ∝ p ( X | Y ) p ( Y ) (14)The term p ( X | Y ) named Bayesian model evidence(BME) or marginal likelihood quan-tiﬁes the likelihood of observed data integrated over TI’s realizations space: p ( X | Y ) = (cid:90) p ( X | Y, µ ) p ( µ | Y )d µ (15)where µ denotes realizations of each TI, p ( µ | Y ) represents the probability that the re-alization µ is generated by a given TI Y , the term p ( X | Y, µ ) represents the probabilitythat observed data X is generated by a given realization µ . It’s generally assumed to beuncorrelated and Gaussian distributed with constant standard deviation σ for simplic-ity: p ( X | Y, µ ) = (cid:16) √ πσ (cid:17) − n exp (cid:34) − n (cid:88) i =1 (cid:18) x i − F i ( µ ) σ (cid:19) (cid:35) (16)where F( µ ) is the generated time series of hydraulic head given realization µ using MOD-FLOW with the groundwater equation, n is the total number of observation data of hy-draulic head. Since the formula 15 is hard to compute, it can be approximated by MonteCarlo method (Hammersley, 1960): p ( X | Y ) = 1 N N (cid:88) i =1 p ( X | Y, µ i ) (17)where µ denotes the realizations of Y generated through a MPS algorithm, N is the num-ber of realizations. Other modiﬁed approaches based on Monte Carlo integration suchas path sampling, power posteriors, and stepping-stone sampling were proposed to com-pute the marginal likelihood p ( X | Y ) in equation 15(Gelman & Meng, 1998; Friel & Pet-titt, 2008; Xie et al., 2011). All of them are under Bayesian frame, i.e., compute marginallikelihood ﬁrst then use formula 14 to compare posterior probability of each TI. TI withlarge posterior probability is preferred statistically.Our approach uses a diﬀerent strategy. Instead of computing marginal likelihoodﬁrst, we use neural network with parameters θ to approximate the posterior probabil-ity directly. We use a MPS algorithm to generate realizations for each TI candidates Y .Then, we apply the MODFLOW to produce the hydraulic head sequence x for each re-alization. The neural network learn from the already matched ( x , y ) pairs. It takes thehydraulic head x as an input and output a real vector p . p = ( p , p , · · · , p K ) and p i = –8–anuscript submitted to arXiv p ( Y = i | X ). p ( Y = i | X ) represents the probability of a TI i given hydraulic head se-quence X and (cid:80) Ki =1 p ( Y = i | X ) = 1. Obviously, p ( Y = i | X ) = K (cid:89) i =1 p ( Y = i | X ) I { i } ( Y ) (18)where I { i } ( Y ) is the indicator function, s.t., I { i } ( Y ) = (cid:40) , Y = i , otherwise (19)Actually, the training of the neural network is to ﬁnd a proper group of parame-ters θ , s.t.,ˆ θ = arg max θ log( p ( Y | X, θ )) = arg min θ − log( p ( Y | X, θ )) = arg min θ Loss (20)We choose cross entropy error as loss function, which is equal to the negative log-arithm of the posterior probability. Speciﬁcally,

Loss = − N (cid:88) ( x ,y ) ∈ S K (cid:88) i =1 I { i } ( y ) log p ( y = i | x ) (21)where S is the dataset, N is the dataset size, ( x , y ) are the matched data pairs of hy-draulic head sequence and TI labels in dataset S . In prediction, the trained neural net-work takes the observation data of hydraulic head as an input and output the TI pre-dicted vector p . The selected TI is: ˆ Y = arg max i ∈{ , ,...,K } p i (22) All networks were trained in Pytorch, an open source machine learning library sup-porting strong graphics processing units(GPU) acceleration (Paszke et al., 2019). Weused the Adam algorithm to search a proper group of parameters to minimize the crossentropy loss in formula 20. Adam is one of the gradient-based optimization algorithmsthat are based on calculating the gradients of loss function with respect to the weightsof network. It can be considered as the combination of AdaGrad and RMSProp and hasa faster convergence rate compared to other gradient-based algorithms, e.g., SGD, Ada-Grad and RMSProp (Kingma & Ba, 2014). Gradients involved in Adam can be computedby back-propagation through time(BPTT), an extension of back-propagation(BP) to dealwith RNNs (Rumelhart et al., 1986; Hecht-Nielsen, 1992; Werbos, 1990). Gradients inPytorch can be computed by automatic diﬀerentiation technique (Paszke et al., 2017).The dataset was divided into training set(80%) and testing set(20%). We trainedthe network on the training set and tested its performance on the testing set. The L regularization and dropout techniques were used to prevent the overﬁtting problem,i.e.,models have much better performance on the training set than the testing set (Hawkins,2004). L regularization is a popular weight decay approach that drives the weights ofthe network to be closer to 0. It has been observed that a weight decay can improve thegeneralization of neural network (Krogh & Hertz, 1992). The key point of L regular-ization is adding a penalty term to the loss function. The main idea of the dropout tech-nique is to randomly drop out nodes of neural network during train process and it couldimprove neural network performance by preventing units from co-adapting (Hinton et –9–anuscript submitted to arXiv Table 1.

Hyper-parameters setting. Lambda denotes regularization coeﬃcient, hidden sizedenotes dimension of hidden state, gamma denotes multiplicative factor of learning rate decayand step size denotes period of learning rate decay. model learning rate dropout lambda hidden size batch size gamma step sizeRNN 0.0001 6.09 × − × −

128 64 0.5 40GRU 0.0001 1.37 × − × −

128 128 0.2 40LSTM 0.0001 7.04 × − × −

128 128 0.5 40al., 2012; Srivastava et al., 2014). Batch Normalization(BN) and learning rate exponen-tial decay techniques were also adopted to accelerate the convergence. BN is a widely-used technique that makes the training of neural network faster and more stable. De-spite its promising property, the mechanism of BN remains under discuss. A widely ac-cepted explanation is that BN can reduce the internal covariate shift problem, i.e., thechange in the distributions of layers’ inputs makes upper layers hard to learn (Ioﬀe &Szegedy, 2015). By ﬁxing the distribution of layers’ inputs, BN can improve training speed.Santurkar et al. (2018) proposed that BN improves training speed by smoothing the op-timization landscape rather than reducing internal covariate shift. Cooijmans et al. (2016)found that it’s beneﬁcial to apply BN to the hidden state transition in RNNs.Since model performance is inﬂuenced by the hyper-parameters of the network, Weadopted random search strategy to search a proper group of hyper-parameters. The hyper-parameters are chosen the best one from at least 50 diﬀerent groups of hyper-parametersfor each model. The hyper-parameters cover learning rate, dropout probability, regular-ization coeﬃcient, dimension of hidden state, batch size, period and multiplicative fac-tor of learning rate decay. We used ray.tune, a uniﬁed platform for model selection thatprovides hyper-parameters searching algorithms, to help search hyper-parameters (Liawet al., 2018). Part of hyper-parameters setting are shown in Table 1.

The data set is constitute with two steps. The ﬁrst step is generating realizationsfrom each TI by MPS algorithms. The second step is generating the hydraulic head timeseries by MODFLOW for each realization.

Two TI candidates of a potential ﬂuvial originated aquifer are selected to gener-ate realizations to obtain the corresponding hydraulic head time series.TI a (Figure 5(a)) is a classic channelized TI obtained from (Strebelle, 2002), TI b (Figure 5 (b)) isobtained from the TI Library and represents the Bangladesh delta. In both TIs, the chan-nel is assumed ﬁll with sand and the background is ﬁlled with clay. Figure 6 shows thecorresponding realizations of each TI generated by the Quick Sampling(Gravey & Ma-riethoz, 2020). The input parameters of the Quick Sampling is shown in Table 2.Notice here, we chose two structurally similar TIs in order to create more challengesfor the selection task. In TI a, the channels are in same width while in TI b, the chan-nels’ width varies from place to place. However, both of them show a strong connectiv-ity and curvilinear structures. –10–anuscript submitted to arXiv

Figure 5.

TI candidates.

Figure 6.

Realizations of corresponding TIs. Row a is the corresponding realizations of TI a,and row b is the corresponding realizations of TI b.–11–anuscript submitted to arXiv

Table 2.

Input Parameters of the QS.

Training Image a size 250 × × × Figure 7.

Selected hydraulic head time series.

A transient ﬂow model was applied to run the test. A period of 2000 days were di-vided into 100 time steps. The aquifer has 100 × × × × q = − . m /day to get the time series of the hydraulichead. The hydraulic conductivity(K) for sand was set to K sand = 3 . m/day , and claywas set to K clay = 0 . m/day . Diﬀerent numbers observation wells were selected toget the observed hydraulic head data. Table 1 shows the aquifer setting parameters. Fig-ure 7 is an illustration of hydraulic head time series at diﬀerent time steps for one re-alization. As the ﬁgure shows, the pattern of the the changing of the hydraulic head canreﬂect the channel structure in a certain degree. –12–anuscript submitted to arXiv Table 3.

Setup of the groundwater ﬂow model.

Model size 100 × × We used three metrics to evaluate the performance of all models.1.

Accuracy

The percentage of correct prediction for model selection.2.

AUC

Area Under the ROC Curve(AUC) which equals to the probability that aclassiﬁer will rank a randomly chosen positive instance higher than a randomlychosen negative one (Fawcett, 2006). A higher AUC value indicates a better per-formance.3.

Cross entropy loss

Since the cross entropy loss is the objective function thatall models attempted to minimize, we used it as a straightforward metric.The accuracy and AUC value on testing set are two main metrics.

We compared the performance of three diﬀerent RNNs on the data set with 49 ob-servation wells and 100 time steps of the hydraulic head. The results are shown in Fig-ure 8 and Table 4. All of the three diﬀerent RNNs acquire an over 90% accuracy on thetesting set, which indicates that this method has a high accuracy in TI selection givenwith the observation data. In addition to that, according to Figure 8, GRU and LSTMachieve a better performance than the standard RNN. The reason is that standard RNNsuﬀers from long-term dependency problems, while GRU and LSTM can alleviate themwith the gating mechanism. The overﬁtting problem still exists although we tried L2 reg-ulation and dropout techniques with diﬀerent combinations of regularization coeﬃcientand dropout probability. That’s possibly because dataset size of 4000 is relatively smallfor the selection task that overﬁtting problem is hard to avoid. Comparing to LSTM,GRU performs slightly better on testing data with 1.19% rise on accuracy, 0.0089 riseon AUC and 0.0326 decline on loss. It seems that GRU is a more eﬃcient model to ex-tract information from the observation data of hydraulic head but the impact of diﬀer-ent RNNs architectures is small.

The number of observation wells of hydraulic head N is an important factor whichhas a directly impact to the TI selection accuracy . In this new approach, N is also theinput dimension of RNNs. We compared GRU performances on diﬀerent data sets with N =9, 49, 400, 900 and 1600 respectively. Since the appropriate hyper-parameters mayvary with the change of number of observation wells, we tried 50 diﬀerent groups of hyper- –13–anuscript submitted to arXiv Table 4.

Model comparison on data set with 49 observation wells and 100 time steps. All theresults were computed by each model with parameters that made each test accuracy highest. model train accuracy test accuracy train AUC test AUC train loss test lossRNN 0.9372 0.9023 0.9831 0.9592 0.1729 0.2440GRU 0.9559

LSTM

Figure 8.

Model comparison on data set with 49 observations and sequence length of 100.We trained each model with 50 diﬀerent groups of hyper-parameters and chose the best one topresent respectively. –14–anuscript submitted to arXiv

Table 5.

GRU performances on data sets with diﬀerent numbers of observation wells. 50 dif-ferent groups of hyper-parameters are tested for each number of observation wells and the onewith highest accuracy on the testing set are selected. Again, all the lines were presented withparameters that made each test accuracy highest. observation wells train accuracy test accuracy train AUC test AUC train loss test loss9 0.8678 0.8555 0.9446 0.9073 0.3025 0.387849 0.9559

Table 6.

GRU performances on data sets with diﬀerent sequence length of hydraulic head. sequence length train accuracy test accuracy train AUC test AUC train loss test loss20 0.9300 0.9010 0.9795 0.9579 0.1943 0.255550 parameters for each N and choose the best one, i.e., the group of hyper-parameters thathave highest accuracy on testing set.The results are shown at Table 5 and Figure 9. As N increases from 9 to 400, thetest accuracy increases from 85.55% to 92.19%, and test AUC increases from 0.9073 to0.9705. It intuitively makes sense because the more observation wells of hydraulic head,the more information is provided to select TI. However, as N increase from 400 to 1600,the tendency is adverse, i.e., the test accuracy decreases to under 89%, and test AUCdecreases to under 0.95. An possible explanation is that a large input dimension maymake the network easier to fall into local extreme points. Nevertheless, in practice ap-plications, observations of hydraulic head may mot be abundant enough to cause thisissue due to the economic cost. In order to verify the impact of sequence length (time steps) L of hydraulic headon TI selection, we tested diﬀerent sequence length varies from 20, 50 to 100 time stepsand kept other conditions unchanged. Notice here, all diﬀerent length of time series wasselected from the end of the sequence. Again, since the change of sequence length mayinﬂuence the choice of hyper-parameters, we tried 50 diﬀerent groups of hyper-parametersfor each sequence length and choose the one with highest accuracy on testing data.Table 6 and Figure 10 show the GRU performances on data set with L = 20, 50and 100. We can see that GRU performances on testing set with L = 50 and 100 arevery close, i.e., as L increases from 50 to 100, test accuracy decreases 0.39%, test AUCincreasing 0.0085 and test loss decreasing 0.0365. GRU performance on testing set with L = 20 is poorer than L = 50 and 100, but the diﬀerence is small. The experimentalresults indicate that the sequence length of hydraulic head has little impact on TI se-lection. –15–anuscript submitted to arXiv Figure 9.

GRU performances on data set with sequence length of 100 and diﬀerent observa-tion wells of hydraulic head sequence. –16–anuscript submitted to arXiv

Figure 10.

GRU performances on testing set with 49 observation wells and diﬀerent length ofhydraulic head sequence. –17–anuscript submitted to arXiv

Table 7.

GRU performances on data set with 49 observation wells, 100 time steps and diﬀerentnoise setting. Small noise denoted uncorrelated Gaussian noise with a small variance σ = 1 anda mean µ = 0, while large noise denotes uncorrelated Gaussian noise with a large variance σ = 9and a mean µ = 0 noise train accuracy test accuracy train AUC test AUC train loss test losswithout noise small noise 0.9406 Table 8.

GRU performances on data set of diﬀerent size. Other conditions such as sequencelength, observation number and hyper-parameters are kept unchanged.

Dataset Size train accuracy test accuracy train AUC test AUC train loss test loss4000 0.9559 0.9219 0.9897 0.9698 0.1458 0.212210000 0.9922 0.9565 0.9995 0.9831 0.0425 0.154720000

Considering that there are usually errors in collecting hydraulic head observationdata in real applications, we added two diﬀerent Gaussian noise to hydraulic head se-quences and compared their performances to the case without adding noise. One is anuncorrelated Gaussian noise with a small variance σ = 1 and a mean µ = 0, the otheris an uncorrelated Gaussian noise with a large variance σ = 9 and a mean µ = 0. Theresults are shown in Table 7 and Figure 11. It’s clear that adding a small Gaussian noiseto observations of hydraulic head doesn’t degrade the GRU performances on both train-ing and testing set. However, when adding a large noise to observations of hydraulic head,we can see a visible degradation of GRU performance, i.e., 7.4% decline on train accu-racy, 4.04% decline on test accuracy, 0.0385 decline on train AUC, 0.0244 decline on testAUC, 0.1552 rise on train loss and 0.1031 rise on test loss.It can also been seen from Table 7 that the GRU performances on training set andtesting set are very close when adding a large noise to observations of hydraulic head.It seems to indicate that adding a random noise can alleviate the overﬁtting problem forthe neural network to some extent and thus the degradation of GRU performance on test-ing set is signiﬁcantly lighter than on training set. Three diﬀerent sizes(4000,10000,20000) of dataset are tested to see the model per-formance. The hyper-parameters maintain unchanged for all sizes of dataset. As Table 8and Figure 12 show, GRU performances on both training set and testing set are greatlyimproved when the dataset size increases. As the dataset size increses from 4000 to 20000,The test accuracy is increasing from 92.19% to 97.63%, the test AUC is increasing from0.9698 to 0.9963, and test loss is decreasing from 0.2122 to 0.0755. In addition, as Fig-ure 12 shows, the tendencies of the curves are pretty consistent. They have the inﬂec-tion point almost at the same position. –18–anuscript submitted to arXiv

Figure 11.

GRU performances on testing set with 49 observation wells, 100 time steps anddiﬀerent noise setting. Small noise denoted uncorrelated Gaussian noise with a small variance σ = 1 and a mean µ = 0, while large noise denotes uncorrelated Gaussian noise with a smallvariance σ = 9 and a mean µ = 0 –19–anuscript submitted to arXiv Figure 12.

GRU performances on data set of diﬀerent size. Other conditions such as sequencelength, observation number and hyper-parameters are kept unchanged.–20–anuscript submitted to arXiv

In this paper, we proposed a robust and straight forward TI selection model us-ing RNNs to help researchers to determine an applicable TI in a given context for theirstudy areas. We have presented the work ﬂow of the new method as well as the detailedinformation about neural network settings.Three diﬀerent RNNs architectures are tested to select a proper TI among two chan-nelized TI candidates. Results show that GRU performs slightly better than LSTM, whileLSTM performs better than standard RNN on both prediction accuracy and AUC value.Number of observation points of hydraulic head plays a signiﬁcant role in our method.Enough observations points can improve the accuracy of TI selection while too much pointsmay cause an adverse eﬀect. 50 time steps of the hydraulic head are suﬃcient enoughfor the GRU to extract the features and do the TI selection job. Adding a small noiseto observations of hydraulic head doesn’t degrade model performance. Even in the casewith large noise the prediction accuracy is still acceptable. This is a promising propertybecause there may be errors between observed data and the true value in practice butit has little eﬀect on this proposed method. Increasing the size of dataset can greatly im-prove the accuracy of TI selection in our method and 16000 training data has a satis-factory performance when facing with the overﬁt issue. Since there is no iteration neededin the forward simulation in this new framework, increasing dataset size won’t signiﬁ-cantly increase the computational cost.The concept of this method is to capture the relationship between the observationdata and the hard data, and based on that relationship, to make selection choice. Anexample has been illustrated via a groundwater model. The result shows that the selec-tion accuracy can reach to 97.63%. TI selection plays a signiﬁcance role in the whole work-ﬂow of modeling a complex subsurface system. This method can be considered in theforemost step in a MPS based workﬂow. Besides, this selection model can also be ap-plied to other subsurface models such as oil reservoir models, CO sequestration mod-els. Acknowledgments

References

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies withgradient descent is diﬃcult.

IEEE transactions on neural networks , (2), 157–166. doi: 10.1109/72.279181Brunetti, C., Bianchi, M., Pirot, G., & Linde, N. (2019). Hydrogeological model se-lection among complex spatial priors. Water Resources Research , (8), 6729–6753. doi: 10.1029/2019WR024840Brunetti, C., Linde, N., & Vrugt, J. A. (2017). Bayesian model selection in hydro-geophysics: Application to conceptual subsurface models of the south oysterbacterial transport site, virginia, usa. Advances in Water Resources , , 127 -141. doi: https://doi.org/10.1016/j.advwatres.2017.02.006Caers, J. (2001). Geostatistical reservoir modelling using statistical pattern recog-nition. Journal of Petroleum Science and Engineering , (3-4), 177–188. doi:https://doi.org/10.1016/S0920-4105(01)00088-2Carrera, J., Mousavi, S. F., Usunoﬀ, E. J., S´anchez-Vila, X., & Galarza, G. (1993).A discussion on validation of hydrogeological models. Reliability Engineering& System Safety , (2), 201 - 216. doi: https://doi.org/10.1016/0951-8320(93) –21–anuscript submitted to arXiv Reliability Engineering & System Safety , , 372–382. doi:https://doi.org/10.1016/j.ress.2019.01.006Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluationof gated recurrent neural networks on sequence modeling. arXiv preprintarXiv:1412.3555 .Cooijmans, T., Ballas, N., Laurent, C., G¨ul¸cehre, C¸ ., & Courville, A. (2016). Recur-rent batch normalization. arXiv preprint arXiv:1603.09025 .Deutsch, C. V., & Wang, L. (1996). Hierarchical object-based stochastic modeling ofﬂuvial reservoirs. Mathematical geology , (7), 857–880. doi: https://doi.org/10.1007/BF02066005Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters , (8), 861–874. doi: https://doi.org/10.1016/j.patrec.2005.10.010Fern´andez, S., Graves, A., & Schmidhuber, J. (2007). An application of recurrentneural networks to discriminative keyword spotting. In International confer-ence on artiﬁcial neural networks (pp. 220–229).Friel, N., & Pettitt, A. N. (2008). Marginal likelihood estimation via power poste-riors.

Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) , (3), 589–607. doi: https://doi.org/10.1111/j.1467-9868.2007.00650.xGelman, A., & Meng, X.-L. (1998). Simulating normalizing constants: From impor-tance sampling to bridge sampling to path sampling. Statistical science , 163–185.Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech recognition withdeep recurrent neural networks. In (pp. 6645–6649). doi: 10.1109/ICASSP.2013.6638947Gravey, M., & Mariethoz, G. (2020). Quicksampling v1.0: a robust and simpliﬁedpixel-based multiple-point simulation approach.

Geoscientiﬁc Model Develop-ment , (6), 2611–2630. doi: 10.5194/gmd-13-2611-2020Gross, L. J., & Small, M. J. (1998). River and ﬂoodplain process simulation for sub-surface characterization. Water Resources Research , (9), 2365-2376. doi: 10.1029/98WR00777Hammersley, J. M. (1960). Monte carlo methods for solving multivariable problems. Annals of the New York Academy of Sciences , (3), 844–874. doi: https://doi.org/10.1111/j.1749-6632.1960.tb42846.xHarbaugh, A. W., Banta, E. R., Hill, M. C., & McDonald, M. G. (2000). Modﬂow-2000, the u. s. geological survey modular ground-water model-user guide tomodularization concepts and the ground-water ﬂow process. Open-ﬁle Report.U. S. Geological Survey (92), 134.Hawkins, D. M. (2004). The problem of overﬁtting.

Journal of chemical informationand computer sciences , (1), 1–12. doi: https://doi.org/10.1021/ci0342472Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neu-ral networks for perception (pp. 65–93). Elsevier. doi: https://doi.org/10.1016/B978-0-12-741252-8.50010-8Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R.(2012). Improving neural networks by preventing co-adaptation of featuredetectors. arXiv preprint arXiv:1207.0580 .Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.

Neural compu-tation , (8), 1735–1780. doi: https://doi.org/10.1162/neco.1997.9.8.1735Holden, L., Hauge, R., Skare, Ø., & Skorstad, A. (1998). Modeling of ﬂuvial reser-voirs with object models. Mathematical Geology , (5), 473–496. doi: https://doi.org/10.1023/A:1021769526425Hsu, K.-l., Moradkhani, H., & Sorooshian, S. (2009). A sequential bayesian approach –22–anuscript submitted to arXiv for hydrologic model selection and prediction. Water Resources Research , (12). doi: 10.1029/2008WR006824Ioﬀe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .Journel, A., & Zhang, T. (2006). The necessity of a multiple-point prior model. Mathematical geology , (5), 591–610. doi: https://doi.org/10.1007/s11004-006-9031-2Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 .Krogh, A., & Hertz, J. A. (1992). A simple weight decay can improve generalization.In Advances in neural information processing systems (pp. 950–957).Kumar, D. N., Raju, K. S., & Sathish, T. (2004). River ﬂow forecasting using re-current neural networks.

Water resources management , (2), 143–161. doi:https://doi.org/10.1023/B:WARM.0000024727.94701.12Li, X., Mariethoz, G., Lu, D., & Linde, N. (2016). Patch-based iterative conditionalgeostatistical simulation using graph cuts. Water Resources Research , (8),6297–6320. doi: 10.1002/2015WR018378Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., & Stoica, I. (2018).Tune: A research platform for distributed model selection and training. arXivpreprint arXiv:1807.05118 .Mariethoz, G., Renard, P., & Straubhaar, J. (2010). The direct sampling method toperform multiple-point geostatistical simulations. Water Resources Research , (11). doi: 10.1029/2008WR007621Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., . . . Lerer, A.(2017). Automatic diﬀerentiation in pytorch.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., . . . others(2019). Pytorch: An imperative style, high-performance deep learning library.In Advances in neural information processing systems (pp. 8026–8037).Pirot, G., Renard, P., Huber, E., Straubhaar, J., & Huggenberger, P. (2015). In-ﬂuence of conceptual model uncertainty on contaminant transport forecast-ing in braided river aquifers.

Journal of Hydrology , , 124 - 141. doi:https://doi.org/10.1016/j.jhydrol.2015.07.036Pyrcz, M., Boisvert, J., & Deutsch, C. (2009). Alluvsim: A program for event-basedstochastic modeling of ﬂuvial depositional systems. Computers & Geosciences , (8), 1671 - 1685. doi: https://doi.org/10.1016/j.cageo.2008.09.012Refsgaard, J. C., Christensen, S., Sonnenborg, T. O., Seifert, D., Højberg, A. L., &Troldborg, L. (2012). Review of strategies for handling geological uncertaintyin groundwater ﬂow and transport modeling. Advances in Water Resources , , 36 - 50. doi: https://doi.org/10.1016/j.advwatres.2011.04.006Rojas, R., Feyen, L., & Dassargues, A. (2008). Conceptual model uncertainty ingroundwater modeling: Combining generalized likelihood uncertainty estima-tion and bayesian model averaging. Water Resources Research , (12). doi:10.1029/2008WR006908Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representationsby back-propagating errors. nature , (6088), 533–536. doi: https://doi.org/10.1038/323533a0Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normal-ization help optimization? In Advances in neural information processing sys-tems (pp. 2483–2493).Sch¨oniger, A., W¨ohling, T., Samaniego, L., & Nowak, W. (2014). Model selec-tion on solid ground: Rigorous comparison of nine ways to evaluate bayesianmodel evidence.

Water Resources Research , (12), 9484-9513. doi:10.1002/2014WR016062Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.(2014). Dropout: a simple way to prevent neural networks from overﬁtting. –23–anuscript submitted to arXiv The journal of machine learning research , (1), 1929–1958.Strebelle, S. (2002). Conditional simulation of complex geological structures usingmultiple-point statistics. Mathematical geology , (1), 1–21. doi: https://doi.org/10.1023/A:1014009426274Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning withneural networks. In Advances in neural information processing systems (pp.3104–3112).Tahmasebi, P. (2018). Multiple point statistics: a review. In

Handbook of mathemat-ical geosciences (pp. 613–643). Springer, Cham.Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it.

Proceedings of the IEEE , (10), 1550–1560. doi: 10.1109/5.58337Xie, W., Lewis, P. O., Fan, Y., Kuo, L., & Chen, M.-H. (2011). Improving marginallikelihood estimation for bayesian phylogenetic model selection. Systematic bi-ology , (2), 150–160. doi: https://doi.org/10.1093/sysbio/syq085Zhang, J., Zhu, Y., Zhang, X., Ye, M., & Yang, J. (2018). Developing a long short-term memory (lstm) based model for predicting water table depth in agricul-tural areas. Journal of hydrology , , 918–929. doi: https://doi.org/10.1016/j.jhydrol.2018.04.065, 918–929. doi: https://doi.org/10.1016/j.jhydrol.2018.04.065