[PDF] A Machine-Learning-Based Importance Sampling Method to Compute Rare Event Probabilities

Abstract

We develop a novel computational method for evaluating the extreme excursion probabilities arising from random initialization of nonlinear dynamical systems. The method uses excursion probability theory to formulate a sequence of Bayesian inverse problems that, when solved, yields the biasing distribution. Solving multiple Bayesian inverse problems can be expensive; more so in higher dimensions. To alleviate the computational cost, we build machine-learning-based surrogates to solve the Bayesian inverse problems that give rise to the biasing distribution. This biasing distribution can then be used in an importance sampling procedure to estimate the extreme excursion probabilities.

Full PDF

AA Machine-Learning-Based ImportanceSampling Method to Compute Rare EventProbabilities (cid:63)

Vishwas Rao , Romit Maulik , Emil Constantinescu , and Mihai Anitescu , Argonne National Laboratory, Lemont, IL, USA University of Chicago, Chicago, IL, USA { vhebbur,rmaulik,emconsta,anitescu } @anl.gov Abstract.

We develop a novel computational method for evaluatingthe extreme excursion probabilities arising from random initializationof nonlinear dynamical systems. The method uses excursion probabilitytheory to formulate a sequence of Bayesian inverse problems that, whensolved, yields the biasing distribution. Solving multiple Bayesian inverseproblems can be expensive; more so in higher dimensions. To alleviate thecomputational cost, we build machine-learning-based surrogates to solvethe Bayesian inverse problems that give rise to the biasing distribution.This biasing distribution can then be used in an importance samplingprocedure to estimate the extreme excursion probabilities.

Keywords:

Machine learning · Rice’s formula · Gaussian processes.

Characterizing high-impact rare and extreme events such as hurricanes, torna-does, and cascading power failures are of great social and economic importance.Many of these natural phenomena and engineering systems can be modeled byusing dynamical systems. The models representing these complex phenomenaare approximate and have many sources of uncertainties. For example, the exactinitial and boundary conditons or the external forcings that are necessary tofully deﬁne the underlying model might be unknown. Other parameters that areset based on experimental data may also be uncertain or only partially known.A probabilistic framework is generally used to formulate the problem of quan-tifying various uncertainties in these complex systems. By deﬁnition, the out-comes of interest that correspond to high-impact rare and extreme events residein the tails of the probability distribution of the associated event space. Fullycharacterizing the tails requires resolving high-dimensional integrals over irreg-ular domains. The most commonly used method to determine the probability (cid:63)

This material was based upon work supported by the U.S. Department of Energy,Oﬃce of Science, Oﬃce of Advanced Scientiﬁc Computing Research (ASCR) underContract DE-AC02-06CH11347. a r X i v : . [ phy s i c s . c o m p - ph ] J un V. Rao et al. of rare and extreme events is Monte Carlo simulation (MCS). Computing rare-event probabilities via MCS involves generating several samples of the randomvariable and calculating the fraction of the samples that produce the outcome ofinterest. For small probabilities, however, this process is expensive. For example,consider an event whose probability is around 10 − and the underlying numer-ical model for the calculation requires ten minutes per simulation. With MCS,estimating the probability of such an event to an accuracy of 10% will requiretwo years of serial computation. Hence, alternative methods are needed that arecomputationally eﬃcient.Important examples of extreme events are rogue waves in the ocean [14],hurricanes, tornadoes [29], and power outages [3]. The motivation for this workcomes from the rising concern surrounding transient security in the presenceof uncertain initial conditions identiﬁed by North American Electric Reliabil-ity Corporation in connection with its long-term reliability assessment [32]. Theproblem can be mathematically formulated as a dynamical system with uncer-tain initial conditions. In this paper, the aim is to compute the extreme excursionprobability: the probability that the transient due to a sudden malfunction ex-ceeds preset safety limits. Typically, the target safe limit exceedance probabilitiesare in the range 10 − − − − . We note that the same formulation is applicablein other applications such as data assimilation, which is used extensively formedium- to long-term weather forecasting. For example, one can potentially usethe formulation in this paper to determine the likelihood of temperature levelsat a location exceeding certain thresholds or the likelihood of precipitation levelsexceeding safe levels in a certain area.In [25], we presented an algorithm that uses ideas from excursion probabilitytheory to evaluate the probability of extreme events [1]. In particular we usedRice’s formula [27], which was developed to estimate the average number ofupcrossings for a generic stochastic process. Rice’s formula is given by E (cid:8) N + u (0 , T ) (cid:9) = (cid:90) T (cid:90) ∞ yϕ t ( u, y ) d y d t , (1)where the left-hand side denotes the expected number of upcrossings of level u,y is the derivative of the stochastic process in a mean squared sense, and ϕ t ( u, y )represents the joint probability distribution of the process and its derivative. Inthis paper, we build on our recent algorithm [25], which we used to construct animportance biasing distribution (IBD) to accelerate the computation of extremeevent probabilities. A key step in the algorithm presented in [25] involves solvingmultiple Bayesian inverse problems, which can be expensive in high dimensions.Here, we propose to use machine-learning-based surrogates to obtain the inversemaps and hence alleviate the computational costs. The mathematical setup used in this paper consists of a nonlinear dynamicalsystem that is excited by a Gaussian initial state and that results in a non-Gaussian stochastic process. We are interested in estimating the probability itle Suppressed Due to Excessive Length 3 of the stochastic process exceeding a preset threshold. Moreover, we wish toestimate the probabilities when the underlying event of the process exceedingthe threshold is a rare event . The rare events typically lie in the tails of theunderlying event distribution. To characterize the tail of the resulting stochasticprocess, we use ideas from theory excursion probabilities [1]. Speciﬁcally, we useRice’s formula (1) to estimate the expected number of upcrossings of a stochasticprocess. For a description of the mathematically rigorous settings used for therare event problem, we refer interested readers to [25, §

2] and references therein.Evaluating ϕ t ( u, y ), the joint probability distribution of the stochastic pro-cess and its derivative, is central to evaluating the integral in Rice’s formula.However, ϕ t ( u, y ) is analytically computable only for Gaussian processes. Sinceour setup results in a non-Gaussian stochastic process, we linearize the nonlineardynamical system variation around the trajectories starting at the mean of theinitial state. We thus obtain a Gaussian approximation to the system trajectorydistribution. In [25], we solve a sequence of Bayesian inverse problems to de-termine a biasing distribution to accelerate the convergence of the probabilityestimates. Forr high-dimensional problems, however, solving multiple Bayesianinverse problems can be expensive. In this work, we propose to replace multiplesolutions to Bayesian inverse problems with machine-learning-based surrogatesto alleviate the computational burden. The rest of the paper is organized as follows. In § § § § § Most of the existing methods to compute the probabilities of rare events use MCSdirectly or indirectly. The MCS approach was developed by Metropolis and hiscollaborators to solve problems in mathematical physics [22]. Since then, it hasbeen used in a variety of applications [21, 28]. When evaluating rare event prob-abilities, the MCS method basically counts the fraction of the random samplesthat cause the rare event. For a small probabilty P of the underlying event, thenumber of samples required to obtain an accuracy of (cid:15) (cid:28) O ( (cid:15) − P − ). HenceMCS becomes impractical for estimating rare event probabilities.A popular sampling technique that is employed to compute rare event prob-abilties is importance sampling (IS). IS is a variance reduction technique de-veloped in the 1950s [17] to estimate the quantity of interest by constructing V. Rao et al. estimators that have smaller variance than MCS. In MCS, simulations frommost of the samples do not result in the rare event and hence do not play a partin probability calculations. IS, instead, uses problem-speciﬁc information to con-struct an IBD; computing the rare event probability using the IBD requires fewersamples. Based on this idea, several techniques for constructing IBDs have beendeveloped [8]. For a more detailed treatment of IS, we direct interested readersto [2, 13]. One of the major challenges involved with importance sampling is theconstruction of an IBD that results in a low-variance estimator. We note thatthe approach may sometimes be ineﬃcient for high-dimensional problems [19].A more detailed description of MCS and IS in the context of rare events can befound in [25, §

2] and references therein.

Other methods use the notion of conditional probability over a sequence of nestedsubsets of the probability space of interest. For example, one can start with theentire probability space and progressively shrink to the region that correspondsto the rare event. Furthermore, one can use the notion of conditional probabilityto factorize the event of interest as a product of conditional events. Subset sim-ulation (SS) [4] and splitting methods [16] are ideas that use this idea. Severalmodiﬁcations and improvements have been proposed to both SS [6, 9, 10, 18, 33]and splitting methods [5,7]. Evaluating the conditional probabilities forms a ma-jor portion of the computational load. Compute the conditional probabilities fordiﬀerent nested subsets concurrently is nontrivial.

Large deviation theory (LDT) is an eﬃcient approach for estimating rare eventsin cases when the event space of interest is dominated by few elements suchas rogue waves of a certain height. LDT also has been used to estimate theprobabilities of extreme events in dynamical systems with random components[11,12]. A sequential sampling strategy has been used to compute extreme eventstatistics [23, 24].

Most of the work in this section is a review of our approach described in [25, § x (cid:48) = f ( t, x ) , t = [0 , T ] (2) x (0) = x , x ∼ p , x ∈ Ω , where c is a canonical basis vector and x , the initial state of the system, isuncertain and has a probability distribution p . The problem of interest is to itle Suppressed Due to Excessive Length 5 estimate the probability that c (cid:62) x ( t ) exceeds the level u for t ∈ [0 , T ]. That is,we seek to estimate the following excursion probability , P T ( u ) := P (cid:18) sup ≤ t ≤ T c (cid:62) x ( t, x ) ≥ u , t ∈ [0 , T ] (cid:19) , (3)where x ( t, x ) represents the solution of the dynamical system (2) for a giveninitial condition x . We note that P T ( u ) = µ ( Ω ( u )) , (4)where µ is the respective measure transformation subject to (2) and Ω ( u ) ⊂ Ω represents the excursion set Ω ( u ) := (cid:26) x : sup ≤ t ≤ T c (cid:62) x ( t, x ) ≥ u (cid:27) . (5)Hence, estimating Ω ( u ) will help us in estimating the excursion probability P T ( u ). In general, however, estimating the excursion set Ω ( u ) analytically isdiﬃcult. Rice’s formula, (1) gives us insights about the excursion set and can beused to construct an approximation to the excursion set.Recall that in Rice’s formula (1), ϕ t ( u, y ) represents the joint probabilitydensity of c (cid:62) x and its derivative c (cid:62) x (cid:48) for an excursion level u . The right-handside of (1) can be interpreted as the summation of all times and slopes at whichan excursion occurs. One can sample from yϕ t ( u, y ) to obtain a slope-time pair( y i , t i ) at which the sample paths of the stochastic process cause an excursion.Now consider the map G : R d × → R that evaluates the vector (cid:20) c (cid:62) x ( t ) c (cid:62) x (cid:48) ( t ) (cid:21) basedon the dynamical system (2), given an initial state x and a time t . By deﬁnitionof the excursion set Ω ( u ), there exists an element x i ∈ Ω ( u ) that satisﬁes thefollowing relationship, G ( x i , t i ) = (cid:20) u + ε i y i (cid:21) , (6)where ε >

0. We can use this insight to construct an approximation of Ω ( u )by constructing the preimages of multiple slope-time pairs. Observe that theproblem of ﬁnding the preimage of a sample ( y i , t i ) is ill-posed since there couldbe multiple x i ’s that map to (cid:20) u + ε i y i (cid:21) at t i via operator G . We deﬁne the set X i := (cid:26) x i ∈ Ω : G ( x i , t i ) = (cid:20) u + ε i y i (cid:21)(cid:27) , (7)and an approximation (cid:98) Ω ( u ) to Ω ( u ) can be written as (cid:98) Ω ( u ) := N (cid:91) i =1 X i . (8) V. Rao et al.

Note that the approximation (8) improves as we increase N . For a discussion onthe choice of ε i , we refer interested readers to [25, § (cid:98) Ω ( u ) consists ofthe following stages: – Draw samples from unnormalized yϕ t ( u, y ) – Find the preimages of these samples to approximate Ω ( u ).We use MCMC to draw samples from unnormalized yϕ t ( u, y ) . We note thatirrespective of the size of the dynamical system, yϕ t ( u, y ) represents an unnor-malized density in two dimensions; hence, using MCMC is an eﬀective means, draw samples from it. Drawing samples from yϕ t ( u, y ) requires evaluating itrepeatedly, and in the following section we discuss the means to do so. yϕ t ( u, y ) We note that yϕ t ( u, y ) can be evaluated analytically only for special cases.Speciﬁcally, when ϕ t ( u, y ) is a Gaussian process, then the joint density function yϕ t ( u, y ) is analytically computable. Consider the dynamical system describedby (2). When p is Gaussian and f is linear, we have x (cid:48) = A x ( t ) + b , x ( t ) = x , x ∼ N ( x , Σ ) . (9)Assuming A is invertible, x ( t ) can be written as x ( t ) = exp( A ( t − t )) x − ( I − exp( A ( t − t ))) A − b , (10)where I represents an identity matrix of the appropriate size. Given that x isnormally distributed, it follows that x ( t ) is a Gaussian process: x ( t ) ∼ GP ( x , cov x ) , where (11) x = exp( A ( t − t )) x − ( I − exp( A ( t − t ))) A − b andcov x = exp( A ( t − t )) Σ (exp( A ( t − t ))) (cid:62) . The joint probability density function (PDF) of c (cid:62) x ( t ) and c (cid:62) x (cid:48) ( t ) is givenby [26, equation 9.1] (cid:20) c (cid:62) xc (cid:62) x (cid:48) (cid:21) ∼ GP (cid:18) x ϕ , (cid:20) c (cid:62) Φ c c (cid:62) ΦA (cid:62) cc (cid:62) AΦ (cid:62) c c (cid:62) AΦA (cid:62) c (cid:21)(cid:19) , (12)where x ϕ := (cid:20) c (cid:62) xc (cid:62) ( A x + b ) (cid:21) and Φ := exp( A ( t − t )) Σ (exp( A ( t − t ))) (cid:62) . itle Suppressed Due to Excessive Length 7 We now can evaluate yϕ t ( u, y ) for arbitrary values of u i , y i , and t i as y i ϕ t i ( u i , y i ) = y i π | Υ | exp (cid:32) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) u i y i (cid:21) − x ϕ (cid:13)(cid:13)(cid:13)(cid:13) Υ − (cid:33) , (13)where Υ := (cid:20) c (cid:62) Φ c c (cid:62) ΦA (cid:62) cc (cid:62) AΦ (cid:62) c c (cid:62) AΦA (cid:62) c (cid:21) and | Υ | denotes the determinant of Υ . Notethat the right-hand side in (13) is dependent on t i via Υ . f When f is nonlinear, yϕ t ( u, y ) cannot be computed analytically—a key ingre-dient for our computational procedure. We approximate the nonlinear dynamicsby linearizing f around the mean of the initial distribution. Assuming that theinitial state of the system is normally distributed as described by equation (9),linearizing around the mean of the initial state gives x (cid:48) ≈ F · ( x − x ) + f ( x , , (14)where F represents the Jacobian of f at t = 0 and x = x ; this reduces thenonlinear dynamical system to a form that is similar to equation (9). Thus, wecan now use equations (11), (12), and (13) to approximate yϕ t ( u, y ) for nonlinear f . In [25] we formulated the problem of determining preimages (7) as a Bayesianinverse problem. However, solving multiple Bayesian inverse problems can beexpensive. Hence we approximated our IBD by using the solutions of a smallnumber of Bayesian inverse problems. In this section we build a simple data-driven surrogate for approximating the preimages X i described in equation (7).Using the surrogate, we can approximate the preimages of several y i ’s obtainedby sampling from yϕ t ( u, y ). The surrogate developed here approximates theinverse of the map deﬁned in equation (6). To that end, we wish to approximatethe map G − : R → R d , (15)where the input space corresponds to ( u + ε i , y ) | t i and the output lives in thedomain of the state space ( Ω here). This is equivalent to augmenting t i as anadditional input variable and building a surrogate that maps from R → R d .We utilize a fully connected deep neural network to approximate this map. Aone-layered neural network can be expressed as ξ j = F (cid:32) L (cid:88) (cid:96) =1 c (cid:96)m x (cid:96) + (cid:15) m (cid:33) , (16) V. Rao et al. where F is a diﬀerentiable activation function that imparts nonlinearity to thistransformation; L is the input dimension of an incoming signal; M is the numberof hidden-layer neurons (in machine learning terminology); c (cid:96)m ∈ R M × L are theweights of this map; (cid:15) m ∈ R M are the biases; and ξ j ∈ R J is the nonlinearoutput of this map, which may be matched to targets available from data or“fed-forward” into future maps. Note that ξ j is the postactivation value of eachneuron in a hidden layer of J neurons. In practice, multiple compositions of thismap may be used to obtain nonlinear function approximators, called deep neuralnetworks, that are very expressive. For nonlinear activation, we utilize F ( ξ ) = max( ξ, , (17)for all its activation functions. In addition, we concatenate three such maps asshown in equation 16 to ultimately obtain an approximation for G − . Two suchsubmaps have J ﬁxed at 256, and a ﬁnal transformation utilizes J = 3. We notethat the function F for the ﬁnal transformation is the identity, as is common inmachine learning algorithms. A schematic of this network architecture is shownin Figure 1. The trainable parameters ( c (cid:96)m and (cid:15) m for each transformation) areoptimized with the use of backpropagation [30], an adjoint calculation techniquethat obtains gradients of the loss function with respect to these parameters.A stochastic gradient optimization technique, ADAM, is used to update theseparameters [20] with a learning rate of 0.001. Our loss function is given by the L -distance between the prediction of the network and the targets (i.e., themean-squared error). Our network also incorporates a regularization strategy,called dropout [31], that randomly switches oﬀ certain units ξ j (here we utilizea dropout probability of 0.1) in the forward propagation of the map (i.e., from d → The following procedure is used to construct the IBD.1. Obtain diﬀerent realizations of the initial conditions of the dynamical systemby sampling from the initial PDF p .2. Use G to obtain the forward maps of these realizations. itle Suppressed Due to Excessive Length 9 Fig. 1.

Schematic of our neural network architecture. Note that the number of hiddenlayer units are not representative since this study utilizes 256 such units.

Fig. 2.

Convergence of training for our network. Note how both training and validationlosses diminish in magnitude concurrently.0 V. Rao et al.

Fig. 3.

Scatter plots between truth and predicted quantites of the inverse map withdimension 1 (left) and dimension 2 (right). These results are from unseen data.

3. Use the forward maps and the corresponding random realizations of theinitial conditions to train the inverse map G − .4. We now apply this trained inverse map on samples generated from yϕ t ( u, y )to obtain the approximate preimages of samples y i .5. Use a Gaussian approximation of these inverse maps is used as an IBD.Assume that this Gaussian approximation has PDF p IBD .6. Sample from the IBD, and use importance sampling to estimate the proba-bilities.

We can now estimate P T ( u ) using the IBD as follows: P IS T ( u )( (cid:98) x , . . . , (cid:98) x M ) = 1 M M (cid:88) i =1 I ( (cid:98) x i ) ψ ( (cid:98) x i ) , (18)where (cid:98) x , . . . , (cid:98) x M are sampled from the biasing distribution p IBD and I ( (cid:98) x i ) rep-resents the indicator function given by I ( (cid:98) x i ) =  , sup ≤ t ≤ T c (cid:62) x ( t, (cid:98) x i ) ≥ u , t ∈ [0 , T ] , , sup ≤ t ≤ T c (cid:62) x ( t, (cid:98) x i ) < u , t ∈ [0 , T ] . (19)Also, ψ ( (cid:98) x i ) represents the importance weights. The importance weight for anarbitrary (cid:98) x i is given by ψ ( (cid:98) x i ) = p ( (cid:98) x i ) p IBD ( (cid:98) x i ) . (20) We demonstrate the application ofthe procedure described in Section 3 andSection 4 for nonlinear dynamical systems excited by a Gaussian distribution. We itle Suppressed Due to Excessive Length 11 use the Lotka-Volterra equations as a test problem. These equations, also knownas the predator-prey equations, are a pair of ﬁrst-order nonlinear diﬀerentialequations and are used to describe the dynamics of biological systems in whichtwo species interact, one as a predator and the other as a prey. The populationschange through time according to the following pair of equations, dx dt = αx − βx x , (21) dx dt = δx x − γx , where x is the number of prey, x is the number of predators, and dx dt and dx dt represent the instantaneous growth rates of the two populations. We assumethat the initial state of the system at time t = 0 is a random variable that isnormally distributed: x (0) ∼ N (cid:18)(cid:20) (cid:21) , . × I (cid:19) , and we are interested in estimating the probability of the event P ( c (cid:62) x ≥ u ),where c = (cid:20) (cid:21) , t ∈ [0 , u = 17. The ﬁrst step of our solution procedureinvolves sampling from yϕ t ( u, y ) to generate observations y i . We linearize thedynamical system about the mean of the distribution of x (equation (14)) andexpress ϕ t ( u, y ) as a function of t and y as described by equation (12). We com-pute yϕ t ( u, y ) as shown in equation (13). We use the delayed rejection adaptiveMetropolis (DRAM) Markov chain Monte Carlo (MCMC) method to generatesamples from yϕ t ( u, y ). (For more details about DRAM, see [15].) To minimizethe eﬀect of the initial guess on the posterior inference, we use a burn-in of 1,000samples. Figure 4 shows the contours of yϕ t ( u, y ) and samples drawn from it byusing DRAM MCMC. In [25] we then solved the Bayesian inverse problem by us-ing both MCMC and Laplace approximation at MAP to construct a distributionthat approximately maps to likelihood constructed around y i . Here, we replacethe solution to the Bayesian inverse problem with a machine-learned inverse mapdescribed in Section 4. Multiple samples generated from yϕ t ( u, y ) can be usedto construct the IBD, as described in § P T ( u ), as explained in § . × − . ML-based ISgives fairly good estimate even with small number of model evaluations. Whenthe training dataset size is large enough, the improvements are dramatic. Noticethat for a true probability of the order of 10 − we obtain an estimate that has arelative error of less than 1%. Notice that our method gives the same (or better)accuracy as the MCS with hundred times lesser computational cost. The con-vergence with just 5000 training samples is acceptable and these results improve Fig. 4.

Contours of yϕ t ( u, y ) and samplesdrawn from it using DRAM MCMC. Fig. 5.

Comparison between conventionalMCS and ML-based IS. We observe evenwith a small amount of training data, weobtain fairly accurate estimates; and aswe increase the training data, the accu-racy improves dramatically. dramatically for 10000 and 20000 training samples. We believe the results couldbe even better when we use a Gaussian mixture to represent the IBD instead ofa simple Gaussian approximation.

In Figure 5, we havent included the costs associated with generating trainingdata, training costs, and cost for approximating the inverse map as these costs arealmost negligible when compared to the overall costs. Note that generating 20000samples is approximately equivalent to 400 model evaluations (this is becausea single model evaluation can be used to generate the slope and state at 50diﬀerent times and each of them can be used as a training sample). The trainingof the ML framework, for this problem, required very little compute time. Eachtraining was executed on an 8th-generation Intel Core-I7 machine with Python3.6.8 and Tensorﬂow 1.14 and took less than 180 seconds for training 20000samples (this is less than 50 model evaluations). Inference (for 20000 predictionpoints) costs were less than 2 seconds, on average.

In this work we developed a ML-based IS to estimate rare event probabilitiesand we demostrated the algorithm on the prey-predator system. The methoddeveloped here builds on the approach in [25] and replaces the expensive Bayesianinference with a Machine learning based surrogate. This approach yields fairlyaccurate estimate of the probabilities and for a given accuracy requires atleastthree orders of magnitude lesser computational eﬀort than the traditional MCS. itle Suppressed Due to Excessive Length 13

In future, we aim to test this algorithm for larger problems and also use an activelearning based approach to pick the training samples. Scaling this algorithm tohigh dimensions (say O (1000)) could be challenging and to address it, we willuse state-of-the-art techniques developed by machine learning and deep learningcommunity in the future. References

1. Adler, R.J.: The geometry of random ﬁelds. SIAM (2010)2. Asmussen, S., Glynn, P.W.: Stochastic simulation: algorithms and analysis, vol. 57.Springer Science & Business Media (2007)3. Atputharajah, A., Saha, T.K.: Power system blackouts-literature review. In: In-dustrial and Information Systems (ICIIS), 2009 International Conference on. pp.460–465. IEEE (2009)4. Au, S.K., Beck, J.L.: Estimation of small failure probabilities in high dimensionsby subset simulation. Probabilistic engineering mechanics (4), 263–277 (2001)5. Beck, J.L., Zuev, K.M.: Rare-event simulation. Handbook of uncertainty quantiﬁ-cation pp. 1–26 (2016)6. Bect, J., Li, L., Vazquez, E.: Bayesian subset simulation. SIAM/ASA Journal onUncertainty Quantiﬁcation (1), 762–786 (2017)7. Botev, Z.I., Kroese, D.P.: Eﬃcient monte carlo simulation via the generalized split-ting method. Statistics and Computing (1), 1–16 (2012)8. Bucklew, J.: Introduction to rare event simulation. Springer Science & BusinessMedia (2013)9. Ching, J., Beck, J.L., Au, S.: Hybrid subset simulation method for reliability es-timation of dynamical systems subject to stochastic excitation. Probabilistic engi-neering mechanics (3), 199–214 (2005)10. Ching, J., Au, S.K., Beck, J.L.: Reliability estimation for dynamical systems sub-ject to stochastic excitation using subset simulation with splitting. Computer meth-ods in applied mechanics and engineering (12-16), 1557–1579 (2005)11. Dematteis, G., Grafke, T., Vanden-Eijnden, E.: Rogue waves and large deviationsin deep sea. Proceedings of the National Academy of Sciences (5), 855–860(2018)12. Dematteis, G., Grafke, T., Vanden-Eijnden, E.: Extreme event quantiﬁcation indynamical systems with random components. SIAM/ASA Journal on UncertaintyQuantiﬁcation (3), 1029–1059 (2019)13. Dunn, W.L., Shultis, J.K.: Exploring Monte Carlo methods. Elsevier (2011)14. Dysthe, K., Krogstad, H.E., M¨uller, P.: Oceanic rogue waves. Annu. Rev. FluidMech. , 287–310 (2008)15. Haario, H., Laine, M., Mira, A., Saksman, E.: Dram: eﬃcient adaptive mcmc.Statistics and computing (4), 339–354 (2006)16. Kahn, H., Harris, T.E.: Estimation of particle transmission by random sampling.National Bureau of Standards applied mathematics series , 27–30 (1951)17. Kahn, H., Marshall, A.W.: Methods of reducing sample size in monte carlo com-putations. Journal of the Operations Research Society of America (5), 263–278(1953)18. Katafygiotis, L., Cheung, S.H.: A two-stage subset simulation-based approach forcalculating the reliability of inelastic structural systems subjected to gaussian ran-dom excitations. Computer methods in applied mechanics and engineering (12-16), 1581–1595 (2005)4 V. Rao et al.19. Katafygiotis, L.S., Zuev, K.M.: Geometric insight into the challenges of solvinghigh-dimensional reliability problems. Probabilistic Engineering Mechanics (2-3), 208–218 (2008)20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)21. Liu, J.S.: Monte Carlo strategies in scientiﬁc computing. Springer Science & Busi-ness Media (2008)22. Metropolis, N., Ulam, S.: The monte carlo method. Journal of the American sta-tistical association (247), 335–341 (1949)23. Mohamad, M.A., Sapsis, T.P.: A sequential sampling strategy for extreme eventstatistics in nonlinear dynamical systems. arXiv preprint arXiv:1804.07240 (2018)24. Mohamad, M.A., Sapsis, T.P.: Sequential sampling strategy for extreme eventstatistics in nonlinear dynamical systems. Proceedings of the National Academy ofSciences (44), 11138–11143 (2018)25. Rao, V., Anitescu, M.: Eﬃcient computation of extreme excursion probabilities fordynamical systems (2020)26. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. AdaptiveComputation and Machine Learning, MIT Press, Cambridge, MA, USA (1 2006)27. Rice, S.O.: Mathematical analysis of random noise. Bell Labs Technical Journal (3), 282–332 (1944)28. Robert, C.P., Casella, G.: Monte carlo statistical methods (springer texts in statis-tics) (2005)29. Ross, T., Lott, N.: A climatology of 1980-2003 extreme weather and climate events.US Department of Commerece, National Ocanic and Atmospheric Administration,National Environmental Satellite Data and Information Service, National ClimaticData Center (2003)30. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. nature (6088), 533–536 (1986)31. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:Dropout: a simple way to prevent neural networks from overﬁtting. The journal ofmachine learning research92