# Model Calibration via Distributionally Robust Optimization: On the NASA Langley Uncertainty Quantification Challenge

MModel Calibration via Distributionally RobustOptimization: On the NASA Langley UncertaintyQuantiﬁcation Challenge

Yuanlu Bai

Department of Industrial Engineering & Operations Research, Columbia University,USA.

Zhiyuan Huang

Department of Management Science & Engineering, Tongji University, Shanghai, China.

Henry Lam

Department of Industrial Engineering & Operations Research, Columbia University,USA.

Abstract

We study a methodology to tackle the NASA Langley Uncertainty Quan-tiﬁcation Challenge, a model calibration problem under both aleatory andepistemic uncertainties. Our methodology is based on an integration of ro-bust optimization, more speciﬁcally a recent line of research known as dis-tributionally robust optimization, and importance sampling in Monte Carlosimulation. The main computation machinery in this integrated method-ology amounts to solving sampled linear programs. We present theoreticalstatistical guarantees of our approach via connections to nonparametric hy-pothesis testing, and numerical performances including parameter calibrationand downstream decision and risk evaluation tasks.

Keywords: uncertainty quantiﬁcation, model calibration, distributionallyrobust optimization, importance sampling, linear programming,nonparametric.

Email addresses: [email protected] (Yuanlu Bai), [email protected] (Zhiyuan Huang), [email protected] (Henry Lam)

Preprint submitted to Elsevier February 4, 2021 a r X i v : . [ s t a t . M E ] F e b e consider the NASA Langley Uncertainty Quantiﬁcation (UQ) Chal-lenge problem [16] where, given a set of “output” data and under bothaleatory and epistemic uncertainties, we aim to infer a region that containsthe true values of the associated variables. These steps allow us to investigatethe reduction of uncertainty by obtaining further information and estimatethe failure probabilities of related systems. To tackle these challenges, westudy a methodology based on an integration of robust optimization (RO),more speciﬁcally, a recent line of research known as distributionally robustoptimization (DRO) , and importance sampling in Monte Carlo simulation.We will see that the main computation machinery in this integrated method-ology boils down to solving sampled linear programs (LPs). In this paper, wewill explain our methodology, introduce theoretical statistical guarantees viaconnections to nonparametric hypothesis testing, and present the numericalresults on this UQ Challenge.We brieﬂy introduce the Challenge and notations, where details can befound in [16]. The uncertainty model in the Challenge is given by (cid:104) f a , E (cid:105) ,where a ∼ f a is an aleatory variable following a probability density f a andprobability distribution function F a , and e ∈ E is an epistemic variableinside the deterministic set E . Both the true distribution of a and the truevalue of e are unknown. Initially, we are given E ⊃ E and data D = { y ( i ) ( t ) } , i = 1 , . . . , n in the form of a discrete-time trajectory t = 0 , . . . , T .We have the computational capability to simulate y ( a, e, t ) for given valuesof a ∈ A, e ∈ E . The task is to calibrate the distribution of a and value of e with uncertainty quantiﬁcation, as well as using them to conduct downstreamdecision and risk evaluation tasks.

1. Overview of Our Methodology (Problem A)

We ﬁrst give a high-level overview of our methodology in extracting aregion E that contains the true epistemic variables. For convenience, we callthis region an “eligibility set” of e . For each value of e inside E , we alsohave a set (in the space of probability distributions) that contains “eligi-ble” distributions for the random variable a . For the sake of computationaltractability (as we will see shortly), the eligibility set of e is representedby a set of sampled points in E that approximate its shape, whereas theeligibility set of a is represented by probability weights on sampled pointson A . The eligibility set E and the corresponding eligibility set of distri-butions for a are obtained by solving an array of LPs that are constructed2rom these properly sampled points, and then deciding eligibility by check-ing the LP optimal values against a threshold that resembles the “ p -value”approach in hypothesis testing. As another key ingredient, this methodol-ogy involves a dimension-collapsing transformation S , applied on the rawdata, which ultimately allows using the Kolgomorov-Smirnov (KS) statisticto endow rigorous statistical guarantees.Algorithm 1 is a procedural description of our approach to construct theeligibility set E , which also gives as a side product an eligibility set of thedistributions of a for each e , represented by weights in the set (11). In thefollowing, we explain the elements and terminologies in this algorithm indetail. Algorithm 1

Constructing eligibility set E Input:

Data D = { ( y ( i ) ( t )) t =0 ,...,T } i =1 ,...,n . A uniformly sampled set of e ( l ) , l = 1 , . . . , n over E . A uniformly sampled set of a ( j ) , j = 1 , . . . , k over A . A summary function S ( · ) : R n t +1 → R m . A target conﬁdence level 1 − α . Procedure:1. Simulate outputs from the baseline distribution:

Evaluate( y ( a ( j ) , e ( l ) , t )) t =0 ,...,T for j = 1 , . . . , k , l = 1 , . . . , n .

2. Summarize the outputs:

Evaluate s ( i ) = S (( y ( i ) ( t )) t =0 ,...,T ) for i =1 , . . . , n , and S ( y ( a ( j ) , e ( l ) , t )) t =0 ,...,T ) for j = 1 , . . . , k , l = 1 , . . . , n .

3. Compute the degree of eligibility:

For each l = 1 , . . . , n , solveoptimization problem Eq. (10) to obtain q ∗ l .

4. Construct the eligibility set:

Output E = { e ( l ) : q ∗ l ≤ q − α/m } .Smooth the set if needed.

2. A DRO Perspective

Our starting idea is to approximate the set E = { e ∈ E : there exists P e s.t. d ( P e , ˆ P ) ≤ η } (1)where P e is the probability distribution of { y ( a, e, t ) } t =0 ,...,T , namely the out-puts of the simulation model { y ( a, e, t ) } t =0 ,...,T at a ﬁxed e but random a .ˆ P denotes the empirical distribution of D , more concretely the distributiongiven by ˆ P ( · ) = 1 n n (cid:88) i =1 δ ( y ( i ) ( t )) t =0 ,...,T ( · )3here δ ( y ( i ) ( t )) t =0 ,...,T ( · ) denotes the Dirac measure at ( y ( i ) ( t )) t =0 ,...,T . d ( · , · )denotes a discrepancy between two probability distributions, and η ∈ R + isa suitable constant. Intuitively, E in Eq. (1) is the set of e such that thereexists a distribution for the outputs that is close enough to the empiricaldistribution from the data. If for a given e there does not exist any possibleoutput distribution that is close to ˆ P , then e is likely not the truth. Thefollowing gives a theoretical justiﬁcation for using Eq. (1): Theorem 1.

Suppose that the true distribution of the output ( y ( t )) t =0 ,...,T ,called P true , satisﬁes d ( P true , ˆ P ) ≤ η with conﬁdence level − α , i.e., we have P ( d ( P true , ˆ P ) ≤ η ) ≥ − α (2) where P denotes the probability with respect to the data. Then the set E inEq. (1) satisﬁes P ( e true ∈ E ) ≥ − α , where e true denotes the true valueof e . Similar deduction holds if Eq. (2) holds asymptotically (as the datasize grows), in which case the same asymptotic modiﬁcation holds for theconclusion. The proof of Theorem 1 comes from a straightforward set inclusion.

Proof.

Note that d ( P true , ˆ P ) ≤ η implies e true ∈ E . Thus we have P ( e true ∈ E ) ≥ P ( d ( P true , ˆ P ) ≤ η ) ≥ − α . Similar derivation holds for the asymptoticversion.In Eq. (1), the set of distributions { P e : d ( P e , ˆ P ) ≤ η } is analogous to theso-called uncertainty set or ambiguity set in the RO literature (e.g., [7, 6]),which is a set postulated to contain the true values of uncertain parame-ters in a model. RO generally advocates decision-making under uncertaintythat hedges against the worst-case scenario, where the worst case is over theuncertainty set (and thus often leads to a minimax optimization problem).DRO, in particular, focuses on problems where the uncertainty is on the prob-ability distribution of an underlying random variable (e.g., [62, 18]). Thisis the perspective that we are taking here, where a has a distribution thatis unknown, in addition to the uncertainty on e . Moreover, we also take ageneralized view of RO or DRO here as attempting to construct an eligibilityset of e instead of ﬁnding a robust decision via a minimax optimization.Theorem 1 focuses on the situation where the uncertainty set is con-structed and calibrated from data, which is known as data-driven RO or4RO ([8, 33]). If such an uncertainty set has the property of being a conﬁ-dence region for the uncertain parameters or distributions, then by solvingRO or DRO, the conﬁdence guarantee can be translated to the resulting de-cision, or the eligibility set in our case. Here we have taken a nonparametric and frequentist approach, as opposed to other potential Bayesian methods.In implementation we choose α = 0 .

05, so that the eligibility set E hasthe interpretation of approximating a 95% conﬁdence set for e . In the abovedevelopments, d ( P e , ˆ P ) ≤ η can in fact be replaced with more general set P e ∈U where U is calibrated from the data. Nonetheless, the distance-based set(or “ball”) surrounding the empirical distribution is intuitive to understand,and our speciﬁc choice of the set below falls into such a representation.To use Eq. (1), there are two immediate questions:1. What d ( · , · ) should and can we use, and how do we calibrate η ?2. How do we determine whether there exists P e that satisﬁes d ( P e , ˆ P ) ≤ η for a given e ?In the following two sections, we address the above two questions respectivelywhich would then lead us to Algorithm 1.

3. Constructing Discrepancy Measures

For the ﬁrst question, we ﬁrst point out that in theory many choices of d could be used (basically, any d that satisﬁes the conﬁdence property inTheorem 1). But, a poor choice of d would lead to a more conservativeresult, i.e., larger E , than others. A natural choice of d should capturethe discrepancy of the distributions eﬃciently. Moreover, the choice of d should also account for the diﬃculty in calibrating η such that the assumptionin Theorem 1 can be satisﬁed, as well as the computational tractability insolving the eligibility determination problem in Eq. (1).Based on the above considerations, we construct d and calibrate η asfollows. First, we “summarize” the data D into a lower-dimensional repre-sentation, say { s ( i )1 , . . . , s ( i ) m } , i = 1 , . . . , n , where s ( i ) v = S v ( y ( i ) ( t ) t =0 ,...,T ) forsome function S v ( · ). For convenience, we denote S ( · ) = ( S ( · ) , . . . , S m ( · )) : R n t +1 → R m , and s ( i ) = ( s ( i )1 , . . . , s ( i ) m ). We call S ( · ) the “summary func-tion” and s ( i ) the “summaries” of the i -th output. S ( · ) attempts to captureimportant characteristics of the raw data (we will see later that we use thepositions and values of the peaks extracted from Fourier analysis). Also, thelow dimensionality of s ( i ) is important to calibrate η well.5ext, we deﬁne d ( P e , ˆ P ) = max v =1 ,...,m sup x ∈ R (cid:12)(cid:12)(cid:12) F e,v ( x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) (3)where ˆ F v ( x ) = n (cid:80) n i =1 I ( s ( i ) v ≤ x ), with I ( · ) denoting the indicator func-tion, is the empirical distribution function of s ( i ) v (i.e., the distribution func-tion of ˆ P projected onto the v -th summary). F e,v ( x ) is the probabilitydistribution function of the v -th summary of the simulation model output S v ( y ( a, e, t )) t =0 ,...,T (i.e., the distribution function of the projection of P e ontothe v -th summary). We then choose η = q − α/m / √ n as the (1 − α/m )-quantile of the Kolmogorov-Smirnov (KS) statistic, namely that q − α/m isthe (1 − α/m )-quantile of sup x ∈ [0 , | BB ( x ) | where BB ( · ) denotes a standardBrownian bridge.To understand Eq. (3), note that the set of P e that satisﬁes d ( P e , ˆ P ) ≤ η is equivalent to P e that satisﬁessup x ∈ R (cid:12)(cid:12)(cid:12) F e,v ( x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) ≤ q − α/m √ n , v = 1 , . . . , m (4)Here, sup x ∈ R (cid:12)(cid:12)(cid:12) F e,v ( x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) is the KS-statistic for a goodness-of-ﬁt testagainst the distribution F e,v ( x ), using the data on the v -th summary. Sincewe have m summaries and hence m tests, we use a Bonferroni correction anddeduce thatlim inf n →∞ P (cid:18) sup x ∈ R (cid:12)(cid:12)(cid:12) F true,v ( x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) ≤ q − α/m √ n for v = 1 , . . . , m (cid:19) ≥ − α where F true,v denotes the true distribution function of the v -th summary.Thus, the (asymptotic version of the) assumption in Theorem 1 holds.Note that here the quality of the summaries does not aﬀect the statisticalcorrectness of our method (in terms of overﬁtting), but it does aﬀect cruciallythe resulting conservativeness (in the sense of getting a larger E ). Moreover,in choosing the number of summaries m , there is a tradeoﬀ between the con-servativeness coming from representativeness and simultaneous estimation .On one end, using more summaries means more knowledge we impose on P e ,which translates into a smaller feasible set for P e and ultimately a smallereligibility set E . This relation, however, is true only if there is no statis-tical noise coming from the data. In the case of ﬁnite data size n , then6ore summaries also means that constructing the feasible set for P e requiresmore simultaneous estimations in calibrating its size, which is manifested inthe Bonferroni correction whose degree increments with each additional sum-mary. In our implementation (see Section 7), we ﬁnd that using 12 summariesseems to balance well this representativeness versus simultaneous estimationerror tradeoﬀ.

4. Determining Existence of an Aleatory Distribution

Now we address the second question on how we can decide, for a given e , whether a P e exists such that d ( P e , ˆ P ) ≤ η . We ﬁrst rephrase the rep-resentation with a change of measure. Consider a “baseline” probabilitydistribution, say P , that is chosen by us in advance. A reasonable choice,for instance, is the uniform distribution over A , the support of a . Then wecan write d ( P e , ˆ P ) ≤ η assup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) S v ( u ) ≤ x W e ( u ) dP ( u ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ q − α/m √ n (5)for v = 1 , . . . , m where W e ( · ) = dP e /dP is the Radon-Nikodym derivative of P e with respect to P , and we have used the change-of-measure representation F e,v ( x ) = (cid:82) S v ( u ) ≤ x W e ( u ) dP ( u ). Here we have assumed that P is suitablychosen such that absolute continuity of P e with respect to P holds. Eq. (5)turns the determination of the existence of eligible P e into the existence ofan eligible Radon-Nikodym derivative W e ( · ).The next step is to utilize Monte Carlo simulation to approximate P .More speciﬁcally, given e , we run k simulation runs under P to generate( y ( a ( j ) , e, t )) t =0 ,...,T for j = 1 , . . . , k . Then Eq. (5) can be approximated bysup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v (( y ( a ( j ) , e, t )) t =0 ,...,T ) ≤ x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ q − α/m √ n , r = 1 , . . . , m (6)where W j = (1 /k )( dP e /dP (( y ( a ( j ) , e, t )))) represents the (unknown) sam-pled likelihood ratio from the view of importance sampling [36, 12]. Ourtask is to ﬁnd a set of weights, W j , j = 1 , . . . , k , such that Eq. (6) holds.These weights should approximately satisfy the properties of the Radon-Nikodym derivative, namely positivity and integrating to one. Thus, we seek7or W j , j = 1 , . . . , k such thatsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v (( y ( a ( j ) , e, t )) t =0 ,...,T ) ≤ x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ q − α/m √ n , r = 1 , . . . , m (7) k (cid:88) j =1 W j = 1 , W j ≥ j = 1 , . . . , k (8)where Eq. (8) enforces the weights to lie in a probability simplex. If k ismuch larger than n , then the existence of W j , j = 1 , . . . , k satisfying Eq. (7)and Eq. (8) would determine that the considered e is in E . To summarize,we have: Theorem 2.

Suppose k = ω ( n ) , and P true is absolutely continuous withrespect to P and that (cid:107) dP true /dP (cid:107) ∞ ≤ C for some constant C > and (cid:107) · (cid:107) ∞ denotes the essential supremum. Suppose, for each e , we generate k simulation replications to get ( y ( a ( j ) , e, t )) t =0 ,...,T ) , j = 1 , . . . , k , where a ( j ) aredrawn from P in an i.i.d. fashion. Then the set E = (cid:110) e : there exists W j , j = 1 , . . . , k such that Eq. (7) and Eq. (8) hold (cid:111) will satisfy lim inf n →∞ ,k/n →∞ P ( e true ∈ E ) ≥ − α The proof is in the appendix. Note that in Theorem 2, W j ’s representthe unknown sampled likelihood ratios such that, together with the a ( j ) ’sgenerated from P , the function k (cid:88) j =1 W j I ( S v (( y ( a ( j ) , e, t )) t =0 ,...,T ) ≤ · )approximates the unknown true v -th summary distribution function F true,v .To use the above E and elicit the guarantee in Theorem 2, we still needsome steps in order to conduct feasible numerical implementation. First,we need to discretize or suﬃciently sample e ’s over E , since checking the8xistence of eligible W j ’s for all e is computationally infeasible. In our imple-mentation we draw n = 1000 e ’s uniformly over E , call them e (1) , . . . , e ( n ) ,and then put together the geometry of E from the eligible e ( l ) ’s. Second, thecurrent representation of the KS constraint Eq. (7) involves entire distribu-tion functions. We can write Eq. (7) as a ﬁnite number of linear constraints,given by ˆ F v ( s ( i ) v +) − q − α/m √ n ≤ k (cid:88) j =1 W j I ( S v (( y ( a ( j ) , e, t )) t =0 ,...,T ) ≤ s ( i ) v ) ≤ ˆ F v ( s ( i ) v − ) + q − α/m √ n (9)for i = 1 , . . . , n , r = 1 , . . . , m where s ( i ) v , i = 1 , . . . , n are the v -th summaryof the i -th data point, and s ( i ) v + and s ( i ) v − denote the right and left limits ofthe empirical distribution at s ( i ) v .Thus, putting everything together, we solve, for each e ( l ) , l = 1 , . . . , n ,the feasibility problem: Find W j , j = 1 , . . . , k such that Eq. (9) and Eq. (8) hold .If there exists feasible W j , j = 1 , . . . , k , then e ( l ) is eligible. The set { e ( l ) : e ( l ) is eligible } is an approximation of E . Note that this is a “sampled” subsetof E . In general, without running the simulation at the other points of E ,there is no guarantee whether these other points are eligible or not. However,if the distribution of { y ( a, e, t ) } t =0 ,...,T is continuous in e in some suitablesense, then it is reasonable to believe that the neighborhood of an eligiblepoint e ( l ) is also eligible (and vice versa). In this case, we can “smooth” thediscrete set of { e ( l ) : e ( l ) is eligible } if needed (e.g., by doing some clusteringand taking the convex hull of each cluster). Finally, note that the feasibilityproblem above is a linear problem in the decision variables W j ’s.

5. Towards the Main Procedure

To link to our main Algorithm 1, we oﬀer an equivalent approach tothe above feasibility-problem-based procedure that allows further ﬂexibilityin choosing the threshold q − α/m , which currently is set as the Bonferroni-adjusted KS critical value. This equivalent approach leaves this choice of9hreshold open and can determine the set of eligible e ( l ) as a function of thethreshold, thus giving some room to improve conservativeness should theformed approximate E turns out to be too loose according to other expertopinion. Here, we solve, for each e ( l ) , l = 1 , . . . , n , the optimization problem q ∗ l = min q s.t. ˆ F v ( s ( i ) v +) − q √ n ≤ (cid:80) kj =1 W j I ( S v (( y ( a ( j ) , e ( l ) , t )) t =0 ,...,T ) ≤ s ( i ) v ) ≤ ˆ F v ( s ( i ) v − ) + q √ n for i = 1 , . . . , n , v = 1 , . . . , m ; (cid:80) kj =1 W j = 1 , W j ≥ j = 1 , . . . , k (10)where the decision variables are W j , j = 1 , . . . , k and q . If the optimal value q ∗ l satisﬁes q ∗ l ≤ q − α/m , then e ( l ) is eligible (This can be seen by checkingits equivalence to the feasibility problem via the monotonicity of the feasibleregion for W j ’s in Eq. (10) as q increases). The rest then follows as abovethat { e ( l ) : e ( l ) is eligible } is an approximation of E . Like before, Eq. (10)is an LP. Moreover, here q ∗ l captures in a sense the “degree of eligibility” of e ( l ) , and allows convenient visualization by plotting q ∗ l against e ( l ) to assessthe geometry of E . For these reasons we prefer to use Eq. (10) over thefeasibility problem before. These give the full procedure in Algorithm 1.Note that Algorithm 1 has a variant where we re-generate a sample of a ( j ) ’sfor each diﬀerent e ( l ) . It is clear that the correctness guarantee (Theorem 2)still holds in this case.Moreover, we also present how to ﬁnd eligible distributions of a for aneligible e ( l ) . The set of eligible distributions of a is approximated by theweights W j ’s that satisfy Eq. (9) and Eq. (8), namely (cid:40) W j , j = 1 , . . . , k : ˆ F v ( s ( i ) v +) − q − α/m √ n ≤ k (cid:88) j =1 W j I ( S v (( y ( a ( j ) , e ( l ) , t )) t =0 ,...,T ) ≤ s ( i ) v ) ≤ ˆ F v ( s ( i ) v − ) + q − α/m √ n , for i = 1 , . . . , n , v = 1 , . . . , m ; k (cid:88) j =1 W j = 1 , W j ≥ j = 1 , . . . , k (cid:41) (11)10here W j is the probability weight on a ( j ) . From this, one could also obtainapproximate bounds for quantities related to the distribution of a . For in-stance, to get approximate bounds for the mean of a , we can maximize andminimize (cid:80) j W j a ( j ) subject to constraint (11).

6. Related Literature

Before we discuss our numerical ﬁndings, we discuss some related litera-ture on the problem setting and our proposed methodology.The model calibration problem that infers input from output data hasbeen studied across diﬀerent disciplines. In scientiﬁc areas it is viewed asan inverse problem [60], in which Bayesian methodologies are predominantlyused (e.g., [17, 15, 37, 3, 32]). Our presented approach is an alternative toBayesian methods that aim to provide frequentist guarantees in the form ofconﬁdence regions. This approach is motivated from the need of sophisticatedtechniques such as approximate Bayesian computation [48] in the Bayesianframework, and that the DRO methodology that we develop appears to bewell-suited to the UQ Challenge setup. In addition to Bayesian approaches,other alternative methods include entropy maximization [39] that use theentropy as a criterion to select the “best” distribution, but it does not havethe frequentist guarantee in recovering the true distribution that we providein this UQ Challenge.We point out that model calibration has also been investigated in thestochastic simulation community [55, 38]. In this setting, model calibrationis often viewed together with model validation. To validate a model, theconventional approach is to use statistical tests such as the two-sample mean-diﬀerence tests [2] or others like the Schruben-Turing test [57] that decideswhether the simulated output data and historical real output data are closeenough. If not, then the simulation model is re-calibrated, and this process isrepeated until the gap between simulation and real data is suﬃciently close.Though having a long history, the development of rigorous frameworks toconduct model calibration and validation has been quite open with relativelyfew elaborate discussions in the literature [49].In terms of methodology, our approach is closely related to RO, whichis an established method for optimization under uncertainty that advocatesthe representation of unknown or uncertain parameters in the model as a(deterministic) set (e.g., [7, 6]). This set is often called an uncertainty set11r an ambiguity set. In the face of decision-making, RO optimizes the de-cision over the worst-case scenario within the uncertainty set, which usuallycomes in the form of a minimax problem with the outer optimization on thedecision while the inner optimization on the worst case scenario. DRO, arecently active branch of RO, considers stochastic optimization where theunderlying probability distribution is uncertain (e.g., [30, 62, 18]). In thiscase, the uncertainty set lies in the space of probability distributions and oneattempts to make decisions under the worst-case distribution. In this paperwe take a generalized view of RO or DRO as attempting to ﬁnd a set of eli-gible “decisions”, namely the e , so it does not necessarily involve a minimaxproblem but instead a set construction.In data-driven RO or DRO, the uncertainty set is constructed or cali-brated from data. If such a set has the property of being a conﬁdence regionfor the uncertain parameters or distributions, then by solving the RO orDRO, the conﬁdence guarantee can be translated to bounds on the resultingdecision, and in our case the eligibility set. This approach of constructinguncertainty sets, by viewing them as conﬁdence regions or via hypothesis test-ing, has been the main approach in data-driven RO or DRO [9]. Recently,alternate approaches have been studied to reduce the conservativeness inset calibration, by utilizing techniques from empirical likelihood [45, 43, 19],Wasserstein proﬁle function [11], Bayesian perspectives [31] and data split-ting [33, 44].In our development, we have constructed an uncertainty set for the un-known distribution P e via a conﬁdence region associated with the KS goodness-of-ﬁt test. This uncertainty set has been proposed in [10]. Other distance-based uncertainty sets, including φ -divergence [51, 5, 27, 41, 42, 4] andWasserstein distance [20, 13, 22], have also been used, as well as sets based onmoment [18, 24, 34] or distributional shape information [52, 46, 61]. We use asimultaneous group of KS statistics with Bonferroni correction, motivated bythe tractability in the resulting integration with the importance weighting.The closest work to our framework is the stochastic simulation inverse cali-bration problem studied in [29], but they consider single-dimensional outputand parameter to calibrate the input distributions, in contrast to our “sum-mary” approach via Fourier analysis and the multi-dimensional settings weface.Another important ingredient in our approach is importance sampling.This is often used as a variance reduction tool (e.g., [58, 28]; [1] Chapter 5;[26] Chapter 4) and is shown to be particularly eﬀective in rare-event sim-12lation (e.g., [14, 54, 36, 12]). It operates by sampling a random variablefrom a diﬀerent distribution from the true underlying distribution, and ap-plies a so-called likelihood ratio to de-bias the resulting estimate. Other thanvariance reduction, importance sampling is also used in risk quantiﬁcation inoperations research and mathematical ﬁnance that uses a robust optimiza-tion perspective (e.g., [27, 25, 41]), which is more closely related to our usein this paper. Additionally, it is used in Bayesian computation [47], andmore recently in machine learning contexts such as covariate shift estimation[50, 59] and oﬀ-policy evaluation in reinforcement learning [53, 56].In the remainder of this paper, we illustrate the use of our methodologyand report our numerical results on the UQ Challenge.

7. Summarizing Discrete-Time Histories using Fourier Transform

By observing the plot of the outputs y ( i ) , i = 1 , . . . , n (see Fig. 1), wejudge that these time series are highly seasonal. Naturally, we choose to useFourier transform to summarize ( y ( t )) t =0 ,...,T , and we may write y ( t ) in theform y ( t ) = (cid:80) ∞ k = −∞ C k e − ikω t . Figure 1: The plot of y ( i ) , i = 1 , . . . , n First we apply Fourier transform to y ( i ) , i = 1 , . . . , n . For each y ( i ) , wecompute the C k ’s. Fig. 2 shows the real part and the imaginary part of C k ’sagainst the corresponding frequencies.For the real part, we see that there is a large positive peak, a large negativepeak, a small positive peak and a small negative peak. After testing, weconﬁrm that for any i , the large peaks lie in the ﬁrst 14 terms (from 0Hz to13 a) Real part (b) Imaginary partFigure 2: The real part and the imaginary part of C k ’s against the corresponding frequen-cies y (i.e.,construct the function S ( · )): ﬁrst, we apply the Fourier transform to compute C k ’s and the corresponding frequencies; second, we compute the real part andthe imaginary part of C k ’s; third, for the real part, we ﬁnd the maximumvalue and the minimum value over [0 Hz, . Hz ] and [1 . Hz, . Hz ], aswell as their corresponding frequencies; fourth, for the imaginary part, weﬁnd the minimum value over [0 Hz, . Hz ] and the maximum value over[1 . Hz, . Hz ] as well as their corresponding frequencies. Then we usethese 12 parameters as the summaries of y .To illustrate how well these summaries ﬁt y , Fig. 3(a) shows the compar-ison for y (1) . The ﬁt qualities of other time series are similar to this example.Though they are not extremely close to each other, the ﬁtted curves do re-semble the original curves. Note that it is entirely possible to improve theﬁtting if we keep more frequencies even if they are not as signiﬁcant as themain peaks. For instance, Fig. 3(b) shows the improved ﬁtting curve if forboth real part and imaginary part, we respectively keep the 20 frequencieswith the largest values. It can be seen that now the ﬁt quality is quite good.On the other hand, as discussed in Section 2, using a larger number of sum-maries both represents more knowledge of P e (better ﬁtting) but also leads14o more simultaneous estimation error when using the Bonferroni correctionneeded in calibrating the set for P e . To balance the conservativeness of ourapproach coming from representativeness versus simultaneous estimation, wechoose to use the 12-parameter summaries depicted before. (a) 12 parameters (b) 80 parametersFigure 3: Fitting y (1) with diﬀerent number of parameters

8. Uncertainty Reduction (Problem B)

Now we implement Algorithm 1 with n = k = 1000 and the summaryfunction S ( · ) deﬁned in the previous section. The dimension of the summaryfunction is m = 12. We choose α to be 0.05. Thus, following the algorithm,for each l = 1 , . . . , n , we compute q ∗ l and then compare it with q − α/m = q − . / = 1 . q ∗ l ’s against each dimension of e . The red horizontallines in the graphs correspond to q − α/m = 1 .

76. Thus the dots below thered lines constitute the eligible e ’s. We rank the epistemic parameters ac-cording to these graphs, namely we rank higher the parameter whose rangecan potentially be reduced the most. Note that this ranking scheme can besummarized using more rigorous metrics related to the expected amount ofeligible e ’s after range shrinkage, but since there are only four dimensions,using the graphs directly seem suﬃcient for our purpose here.We ﬁnd that the values of e and e of the eligible e ’s broadly rangefrom 0 to 2, which implies that reducing the ranges of these two dimensionscould hardly reduce our uncertainty. By contrast, the values of e and e ofthe eligible e ’s are both concentrated in the lower part of [0 , e > e > e > e .Chances are that the true values of e and e are relatively small. Inorder to further pinpoint the true values of e and e , we choose to make twouncertainty reductions: increase the lower limits of the bounding interval of e and e . (a) e (b) e (c) e (d) e Figure 4: q ∗ l against each epistemic variable n (A.2) To investigate the impact of the value of n , for diﬀerent values of n we randomly sample n outputs without replacement. Then we take theseoutputs as the new data set. By repeatedly implementing Alg. 1, we ﬁndthat the larger is n , the smaller is the proportion of eligible e ’s. It is intuitivethat as the data size grows, e can be better pinpointed. Moreover, exceptfor e , the range of each epistemic variable of eligible e ’s obviously shrinks as16 increases, which further conﬁrms that e is the least important epistemicvariable. After the epistemic space is reduced, we repeat the process in Section 8.1but now e ’s are generated uniformly from E . From the associated scatterplots (Fig. 5), the updated ranking of the epistemic parameters is e > e >e > e . (a) e (b) e (c) e (d) e Figure 5: q ∗ l against each epistemic variable (reﬁned)

9. Reliability of Baseline Design (Problem C)

Combining the reﬁned range of e provided by the host with our Algorithm1, we construct E ⊂ E . To estimate min e ∈ E / max e ∈ E P ( g i ( a, e, θ ) ≥ / max k (cid:88) j =1 W j I ( g i ( a ( j ) , e, θ ) ≥ e ∈ E, W ∈ U (12)where U is the set of ( W , · · · , W k ) in Eq. (11). These give the range of R i ( θ ). We use the same method to approximate R ( θ ), the failure probabilityfor any requirement. Note that in our implementation the E in the formula-tions above is represented by discrete points e ( l ) ’s. As discussed previously,under additional smoothness assumptions, we could “smooth” these pointsto obtain a continuum. Nonetheless, under suﬃcient sampling of e ( l ) , thediscretized set should be a good enough approximation in the sense that theoptimal values from the “discretized” problems are close to those using thecontinuum.Using the above method, we get that the ranges of R ( θ ), R ( θ ), R ( θ )and R ( θ ) are approximately [0 , . , . , . , . s i ( θ ), the severity of each individual requirement violation,similarly we simulatemax e ∈ E max W ∈ U k (cid:88) j =1 W j g i ( a ( j ) , e, θ ) I ( g i ( a ( j ) , e, θ ) ≥ . The results for s ( θ ), s ( θ ) and s ( θ ) are respectively 0.1464, 0.0493 and3.5989. Clearly the violation of g is the most severe one while the violationof g is the least. Our analysis on the rank for epistemic uncertainties is based on the rangeof R ( θ ) obtained above. In our computation, we obtainmin W ∈ U / max W ∈ U k (cid:88) j =1 W j I ( g i ( a ( j ) , e, θ ) ≥ i = 1 , , e ∈ E . For simplicity, we use R min and R max to denote thesetwo values for each eligible e ∈ E respectively.18ur approach is to scrutinize the plots of R min and R max against eachepistemic variable (Fig. 6 and 7). For R min , large value is notable, since itmeans that any distribution that provides similarity to the original data isgoing to fail with large probability. Therefore the most ideal reduction is toavoid the region of e such that all R min ’s are large. For R max , the largest R max for the region denotes the maximum failure probability that one canhave. So we pay attention to the epistemic variables that could potentiallyreduce the “worst-case” failure probability. Based on these considerations,we conclude that the rank for epistemic uncertainties is e > e > e > e . Figure 6: R min against each epistemic variable. Since the distribution of a in our approach is deﬁned as an ambiguity setthat depends on e , the failure domain would also be based on each eligible e .We classify an eligible e to be notable if its corresponding R min is relativelylarge (e.g., > . igure 7: R max against each epistemic variable. corresponding to R min as w min , where w min = arg min W ∈ U k (cid:88) j =1 W j I ( g i ( a ( j ) , e, θ ) ≥ . We consider the representative realizations of uncertainties as those a ’s withlarge value of w min (in our case we consider > . a and a as the coordinates (as inFig. 8). We also provide some example responses of cases in each group. Weobserve that there is a clear similarity in the responses within each group,which can be interpreted as diﬀerent failure patterns.

10. Reliability-Based Design (Problem D)

To ﬁnd a reliability-optimal design point θ new , we minimizemax e ∈ E min W ∈ U k (cid:88) j =1 W j I ( g ( a ( j ) , e, θ ) ≥ . (13)20 igure 8: The four groups of representative realizations on the scatter plot with a and a as the coordinates. The four groups are failure cases caused by g , g and g (blue), g (red), g (yellow) and g (green). e ∈ E , if min W ∈ U (cid:80) kj =1 W j I ( g ( a ( j ) , e, θ ) ≥

0) is large, then thetrue probability in which the system fails must be even larger than this“best-case” estimate, which implies that this point e has a considerablefailure likelihood. The objective above thus aims to ﬁnd a design pointto minimize this best-case estimate, but taking the worst-case among allthe eligible e ’s. Arguably, one can use other criteria such as minimizingmax e ∈ E max W ∈ U (cid:80) kj =1 W j I ( g ( a ( j ) , e, θ ) ≥ g is only observed through simulation, and the problem is easily non-convex. Our approach is to use a gradient descent to guide us towards abetter θ new , with a goal of ﬁnding a reasonably good θ new (instead of insist-ing on full optimality which could be diﬃcult to achieve in this problem).Note that we need to sample a ( j ) when we land at a new θ during our it-erations, and hence our approach takes the form of a stochastic gradientdescent or stochastic approximation [21, 35]. Moreover, the gradient can-not be estimated in an unbiased fashion as we only have black-box functionevaluation, and thus we need to resort to the use of ﬁnite-diﬀerence. Thisresults in a zeroth-order or the so-called Kiefer-Wolfowitz (KW) algorithm[40, 23]. As we have a nine-dimensional design variable, we choose to update θ via a coordinate descent, namely at each iteration we choose one of thedimensions and run a central ﬁnite-diﬀerence along that dimension, followedby a movement of θ guided by this gradient estimate with a suitable stepsize. The updates are done in a round-about fashion over the dimensions.The perturbation size in the ﬁnite-diﬀerence is chosen of order 1 /n / hereas it appears to perform well empirically (though theoretically other scalingcould be better).Algorithm 2 shows the details of our optimization procedure. Consideringthat the components of θ baseline are of very diﬀerent magnitudes, we ﬁrstperform a normalization to ease this diﬀerence. The quantity x now encodesthe position of the normalized θ now , and 1 denotes a nine-dimensional vectorof 1’s that is set as the initial normalized design point. We set c = a = 0 . N max = 8.After running the algorithm, we arrive at a new design point, θ new :( − . − . .

31, 4525 .

3, 4924 .

2, 1 . .

32, 14 . . lgorithm 2 KW algorithm to ﬁnd θ new Input:

The baseline design point θ baseline . The initial step size c . Theinitial perturbation size a . The max iteration N max . The objective function f ( θ ). Procedure:

Set x now = 1 and n = 1. while n ≤ N max do Set c n = c /n / and a n = a /n . for i from 1 to 9 do u = f ( θ baseline ◦ ( x now + c n e i )). l = f ( θ baseline ◦ ( x now − c n e i )). g = ( u − l ) / (2 c n ). x now = x now − a n g . end for n = n + 1. end while Output θ baseline ◦ x now .( ◦ denotes the Hadamard product).failure probability, among the worst possible of all eligible e ’s, is 0.2732.For θ new , the ranges of R ( θ ), R ( θ ), R ( θ ) and R ( θ ) (deﬁned in Section9.1) are approximately [0 , . , . , . , . R min and R max that e has signiﬁcant diﬀerentpatterns on high values in both plots. According to the trends shown in theplots, we rank the epistemic variables as e > e > e > e .

11. Design Tuning (Problem E)

With data sequence D = { z ( i ) ( t ) } for i = 1 , . . . , n , we may incorpo-rate the additional information to update our model as before. Similarto Section 7, we use Fourier transform to summarize the highly seasonalresponses. In particular, we represent ( z ( i )1 ( t )) t =0 ,...,T and ( z ( i )2 ( t )) t =0 ,...,T as z ( i )1 ( t ) = (cid:80) ∞ k = −∞ C k e − ikω t and z ( i )2 ( t ) = (cid:80) ∞ k = −∞ C k e − ikω t respectively. Asshown in Figures 9 and 10, the responses in frequency domain have commonpatterns in the positive and negative peaks. Again we use the values of thesepeaks and their corresponding frequencies to summarize z and z , whichleads to 20 extra parameters adding to the 12 parameters extracted from D .With the extracted parameters from both D and D , we now update23 a) Real part (b) Imaginary partFigure 9: The real part and the imaginary part pf C k ’s against the corresponding frequen-cies. (a) Real part (b) Imaginary partFigure 10: The real part and the imaginary part pf C k ’s against the corresponding fre-quencies. our eligibility set for E by computing q ∗ l ’s. We determine eligible e ’s withthe new threshold q − . / = 1 .

89. The values of q ∗ l ’s are presented inFigure 11. Compared with Figure 5, we observe that the trend in e changesslightly. The q ∗ l ’s with high value in e become higher after introducing theinformation from D , which indicates that e with higher e is less eligible.Based on the stronger trend in e and the observation in Section 10, wedetermine to reﬁne e on both ends.In Figure 12, we present q ∗ l ’s of samples of e for determining the ﬁnaleligibility set, E . With these updated information, the ﬁnal design θ final isobtained using Algo. 2, where θ final : ( − . − . .

61, 4410 . .

1, 1 . .

49, 16 . . R ( θ final ), R ( θ final ),24 a) e (b) e (c) e (d) e Figure 11: q ∗ l against each epistemic variable (after incorporating D ). R ( θ final ) and R ( θ final ) are [0,0.1676], [0,0.1620], [0,0.046] and [0,0.2551] re-spectively. Compared to θ baseline and θ new , the worst-case reliability perfor-mance is signiﬁcantly improved.

12. Risk-Based Design (Problem F)

Recall that we create an eligibility set for e in the form of { e ( l ) : q ∗ l ≤ q − α/m } , which provides us (1 − α ) conﬁdence for covering the truth asymp-totically. To reduce r % volume of the eligibility set, we remove r % numberof eligible points in the set with the larger q ∗ l ’s. Since larger q ∗ l indicates lesssimilarity with the true response, the reduced eligibility set maintains moreimportant e ( l ) ’s.In our setting for e , taking risks is equivalent to reducing the conﬁdencelevel for covering the truth. Let us assume the r % upper quantile of q ∗ l is q r % . Then the reduced eligibility set can be represented as { e ( l ) : q ∗ l ≤ q r % } .By ﬁnding the ˜ α such that q r % = q − ˜ α/m , we can ﬁnd the conﬁdence level25 a) e (b) e (c) e (d) e Figure 12: q ∗ l against each epistemic variable (ﬁnal reﬁned). − ˜ α that corresponds to each choice of r %. In later discussion, the reducedeligibility set corresponding to risk level r % is denoted as E r % .In our experiment, we use E in Section 11 as the baseline. Table 12shows the risk levels and their corresponding conﬁdence levels. The relationbetween r % and ˜ α highly depends on the value of q ∗ l ’s. In our case, weobserve that a large portion of q ∗ l ’s are close to q − α/m . As a consequence,the reduction in the volume of the set does not lead to a similar extent ofreduction in the conﬁdence level. Since the conﬁdence level is almost notchanged, we can anticipate that the design results with diﬀerent r % in therange of (0 ,

10) will perform similarly.With diﬀerent r %’s, we construct E r % ’s using the above approach andimplement Algo. 2 to obtain risk-based designs θ r % ’s. Then we evaluatethe θ r % ’s by computing the reliability and severity metrics based on theircorresponding eligibility set E r % and also E . The evaluation results using E are shown in Figure 13. We observe from Figure 13 that both the reliabilityor severity metrics are insensitive to the change of r %. In fact, the results are26 able 1: The risk levels against their corresponding conﬁdence level. r % 0 2 4 6 8 10 | E r % |

114 113 110 108 106 104 q r % − ˜ α

95% 94.9% 94.8% 94.8% 94.7% 94.6%also insensitive to whether using E or E r % to compute the metrics. Sincethe diﬀerence can be neglected (the largest diﬀerence is smaller than 0.01),we omit the results using E r % . From these results, we conﬁrm our conjecturethat taking risks would not make much diﬀerence in our design approach. (a) Reliability (b) SeverityFigure 13: The reliability and severity metrics for θ r % ’s evaluated using E .

13. Discussion

In this UQ Challenge, we propose a methodology to calibrate modelparameters and quantify calibration errors from output data under bothaleatory and epistemic uncertainties. The approach utilizes a frameworkbased on an integration of distributionally robust optimization and impor-tance sampling, and operates computationally by solving sampled linear pro-grams. It provides theoretical conﬁdence guarantees on the coverage of theground truth parameters and distributions. We apply and illustrate our ap-proach to the model calibration and downstream risk analysis tasks in the UQChallenge. Our approach is drastically diﬀerent from established Bayesianmethodologies, both in the type of guarantee (frequentist versus Bayesian)and computation method (optimization versus posterior sampling). We an-27icipate much further work in the future in expanding our methodology tomore general problems as well as comparing with the established approaches.We discuss some immediate future improvement in our implementationin this UQ Challenge. Our procedure relies on several conﬁgurations thatwarrant further explorations. First, the eligibility set geometry is dictatedby the choices of the distance metric between distributions and the summaryfunction. Our choice of KS-distance is motivated from nonparametric hy-pothesis testing that provides asymptotic guarantees. However, since onlya ﬁnite number of samples is available in practice, its performance can beproblem dependent, and other nonparametric test statistics could be con-sidered. Regarding summary functions, we have chosen them based on thevisualization of Fourier transform and justify their number via a balance ofrepresentativeness and conservativeness in simultaneous estimation. Our re-ﬁnement results indicate that our eligibility set performs well in locating e ,which validates our conﬁgurations. Nonetheless, a more rigorous approachto choose both the distance metric and the summary functions is desirable.Our approach requires sampling a number of a and e for eligibility setand aleatory distribution construction. Since a limited size of naive (uni-form) sample might miss important information in a large continuous spaceand cause high variance, we have used several variance reduction techniquesincluding stratiﬁed sampling and common random numbers. We note thatthe samples for a have a larger eﬀects on designs, since they are used toconstruct the associated best- and worst-case distributions and the qualityof samples can be crucial to correctly evaluating the design performances.Moreover, a good sampling scheme can also lead to higher stability of thestochastic gradient descent algorithm.Lastly, we note that the conservative nature of our robust approach isreﬂected in the system design. While our robust approach performs well inlocating the eligibility set and providing upper bounds on reliability, directlyusing these bounds as the objectives for optimizing designs appear over-conservative. Further work on improving the choice of eligibility sets andsampling on a and e could help improve these design performances. Appendix A. Proof of Theorem 2

This proof is adapted from [29]. We denote L = dP true /dP . Let W j = L ( a ( j ) ) (cid:80) kj =1 L ( a ( j ) ) . For simplicity, we use y ( a ) to denote ( y ( a, e true , t )) t =1 ,...,T and28se y j to denote ( y ( a ( j ) , e true , t )) t =1 ,...,T . Then we have thatsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v ( y j ) ≤ x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v ( y j ) ≤ x ) − k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) +sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) − E P ( L ( a ) I ( S v ( y ( a )) ≤ x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) +sup x ∈ R (cid:12)(cid:12)(cid:12) E P true ( I ( S v ( y ( a )) ≤ x )) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) . For the ﬁrst term, we have that k (cid:88) j =1 W j I ( S v ( y j ) ≤ x ) − k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x )= 1 k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) (cid:32) k (cid:80) kj =1 L ( a ( j ) ) − (cid:33) . Since (cid:107) dP true /dP (cid:107) ∞ ≤ C , we get that k (cid:80) kj =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) ≤ C .Moreover, we know that E P ( L ) = 1 and var P ( L ) < ∞ , and thus √ k (cid:32) k (cid:80) kj =1 L ( a ( j ) ) − (cid:33) ⇒ N (0 , var P ( L )) . Hence, we get thatsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v ( y j ) ≤ x ) − k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (1 / √ k ) . For the second term, following the proof in [29], we know that (cid:40) √ k (cid:32) k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) − E P ( L ( a ) I ( S v ( y ( a )) ≤ x )) (cid:33)(cid:41) ⇒ { G ( x ) } (cid:96) ∞ ( { a (cid:55)→ L ( a ) I ( S v ( y ( a )) ≤ x ) : x ∈ R } ) and G is a Gaussian process.Therefore, we get thatsup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k (cid:88) j =1 L ( a ( j ) ) I ( S v ( y j ) ≤ x ) − E P ( L ( a ) I ( S v ( y ( a )) ≤ x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (1 / √ k ) . Finally, it is known that √ n sup x ∈ R (cid:12)(cid:12)(cid:12) E P true ( I ( S v ( y ( a )) ≤ x )) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12) ⇒ sup x ∈ [0 , | BB ( F true,r ( x )) | . Combining the above results, we get that for each v = 1 , . . . , m ,lim sup n →∞ ,k/n →∞ P (cid:32) sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 W j I ( S v ( y j ) ≤ x ) − ˆ F v ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > q − α/m √ n (cid:33) ≤ αm and hence lim inf n →∞ ,k/n →∞ P ( e true ∈ E ) ≥ − α. Acknowledgements

We gratefully acknowledge support from the National Science Founda-tion under grants CAREER CMMI-1834710 and IIS-1849280. A preliminaryconference version of this paper has appeared in the Proceedings of the 30thEuropean Safety and Reliability Conference and the 15th Probabilistic SafetyAssessment and Management Conference (ESREL-PSAM) 2020.

References [1] Asmussen, S. and P. W. Glynn (2007).

Stochastic simulation: algorithmsand analysis , Volume 57. Springer Science & Business Media.[2] Balci, O. and R. G. Sargent (1982). Some examples of simulation modelvalidation using hypothesis testing. In H. J. Highland, Y. W. Chao, andO. S. Madrigal (Eds.),

Proceedings of the 14th Conference on Winter Sim-ulation - Volume 2 , Piscataway, New Jersey, pp. 621–629. Institute ofElectrical and Electronics Engineers, Inc.303] Bayarri, M. J., J. O. Berger, R. Paulo, J. Sacks, J. A. Cafeo, J. Cavendish,C.-H. Lin, and J. Tu (2007). A framework for validation of computermodels.

Technometrics 49 (2), 138–154.[4] Bayraksan, G. and D. K. Love (2015). Data-driven stochastic program-ming using phi-divergences. In

Tutorials in Operations Research , pp. 1–19.INFORMS.[5] Ben-Tal, A., D. Den Hertog, A. De Waegenaere, B. Melenberg, andG. Rennen (2013). Robust solutions of optimization problems aﬀectedby uncertain probabilities.

Management Science 59 (2), 341–357.[6] Ben-Tal, A. and A. Nemirovski (2002). Robust optimization–methodology and applications.

Mathematical Programming 92 (3), 453–480.[7] Bertsimas, D., D. B. Brown, and C. Caramanis (2011). Theory andapplications of robust optimization.

SIAM Review 53 (3), 464–501.[8] Bertsimas, D., V. Gupta, and N. Kallus (2018a). Data-driven robustoptimization.

Mathematical Programming 167 (2), 235–292.[9] Bertsimas, D., V. Gupta, and N. Kallus (2018b, September). Robustsample average approximation. (1–2), 217–282.[10] Bertsimas, D., V. Gupta, and N. Kallus (2018c). Robust sample averageapproximation.

Mathematical Programming 171 (1-2), 217–282.[11] Blanchet, J., Y. Kang, and K. Murthy (2019). Robust wasserstein pro-ﬁle inference and applications to machine learning.

Journal of AppliedProbability 56 (3), 830–857.[12] Blanchet, J. and H. Lam (2012). State-dependent importance samplingfor rare-event simulation: An overview and recent advances.

Surveys inOperations Research and Management Science 17 (1), 38–59.[13] Blanchet, J. and K. R. Murthy (2016). Quantifying distributional modelrisk via optimal transport. arXiv preprint arXiv:1604.01446 .[14] Bucklew, J. (2013).

Introduction to Rare Event Simulation . New York:Springer Science & Business Media.3115] Craig, P. S., M. Goldstein, J. C. Rougier, and A. H. Seheult (2001).Bayesian forecasting for complex systems using computer simulators.

Jour-nal of the American Statistical Association 96 (454), 717–729.[16] Crespo, L. and S. Kenny (2020). The NASA Langley Challenge onOptimization under Uncertainty.

ESREL .[17] Currin, C., T. Mitchell, M. Morris, and D. Ylvisaker (1991). Bayesianprediction of deterministic functions, with applications to the design andanalysis of computer experiments.

Journal of the American StatisticalAssociation 86 (416), 953–963.[18] Delage, E. and Y. Ye (2010). Distributionally robust optimization undermoment uncertainty with application to data-driven problems.

Operationsresearch 58 (3), 595–612.[19] Duchi, J. C., P. W. Glynn, and H. Namkoong (0). Statistics of robustoptimization: A generalized empirical likelihood approach.

Mathematicsof Operations Research 0 (0), null.[20] Esfahani, P. M. and D. Kuhn (2018). Data-driven distributionally robustoptimization using the wasserstein metric: Performance guarantees andtractable reformulations.

Mathematical Programming 171 (1-2), 115–166.[21] Fu, M. C. (2015).

Handbook of simulation optimization , Volume 216.Springer.[22] Gao, R. and A. J. Kleywegt (2016). Distributionally robust stochasticoptimization with Wasserstein distance. arXiv preprint arXiv:1604.02199 .[23] Ghadimi, S. and G. Lan (2013). Stochastic ﬁrst-and zeroth-order meth-ods for nonconvex stochastic programming.

SIAM Journal on Optimiza-tion 23 (4), 2341–2368.[24] Ghaoui, L. E., M. Oks, and F. Oustry (2003). Worst-case value-at-risk and robust portfolio optimization: A conic programming approach.

Operations research 51 (4), 543–556.[25] Ghosh, S. and H. Lam (2019). Robust analysis in stochastic simulation:Computation and performance guarantees.

Operations Research 67 (1),232–249. 3226] Glasserman, P. (2013).

Monte Carlo methods in ﬁnancial engineering ,Volume 53. Springer Science & Business Media.[27] Glasserman, P. and X. Xu (2014). Robust risk measurement and modelrisk.

Quantitative Finance 14 (1), 29–58.[28] Glynn, P. W. and D. L. Iglehart (1989). Importance sampling forstochastic simulations.

Management science 35 (11), 1367–1392.[29] Goeva, A., H. Lam, H. Qian, and B. Zhang (2019). Optimization-basedcalibration of simulation input models.

Operations Research 67 (5), 1362–1382.[30] Goh, J. and M. Sim (2010). Distributionally robust optimization andits tractable approximations.

Operations Research 58 (4-Part-1), 902–917.[31] Gupta, V. (2019). Near-optimal bayesian ambiguity sets for distribu-tionally robust optimization.

Management Science 65 (9), 4242–4260.[32] Higdon, D., J. Gattiker, B. Williams, and M. Rightley (2008). Computermodel calibration using high-dimensional output.

Journal of the AmericanStatistical Association 103 (482), 570–583.[33] Hong, L. J., Z. Huang, and H. Lam (2020). Learning-based robustoptimization: Procedures and statistical guarantees.

Management Science .[34] Hu, Z., J. Cao, and L. J. Hong (2012). Robust simulation of globalwarming policies using the dice model.

Management Science 58 (12), 2190–2206.[35] Jian, N. and S. G. Henderson (2015). An introduction to simulationoptimization. In , pp. 1780–1794.IEEE.[36] Juneja, S. and P. Shahabuddin (2006). Rare-event simulation tech-niques: an introduction and recent advances.

Handbooks in operationsresearch and management science 13 , 291–350.[37] Kennedy, M. C. and A. O’Hagan (2001). Bayesian calibration of com-puter models.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 63 (3), 425–464. 3338] Kleijnen, J. P. (1995). Veriﬁcation and validation of simulation models.

European Journal of Operational Research 82 (1), 145–162.[39] Kraan, B. and T. Bedford (2005). Probabilistic inversion of expertjudgments in the quantiﬁcation of model uncertainty.

Management sci-ence 51 (6), 995–1006.[40] Kushner, H. and G. G. Yin (2003).

Stochastic approximation and recur-sive algorithms and applications , Volume 35. Springer Science & BusinessMedia.[41] Lam, H. (2016). Robust sensitivity analysis for stochastic systems.

Mathematics of Operations Research 41 (4), 1248–1275.[42] Lam, H. (2018). Sensitivity to serial dependency of input processes: Arobust approach.

Management Science 64 (3), 1311–1327.[43] Lam, H. (2019). Recovering best statistical guarantees via the empiri-cal divergence-based distributionally robust optimization.

Operations Re-search 67 (4), 1090–1105.[44] Lam, H. and H. Qian (2019). Combating conservativeness in data-drivenoptimization under uncertainty: A solution path approach. arXiv preprintarXiv:1909.06477 .[45] Lam, H. and E. Zhou (2017). The empirical likelihood approach to quan-tifying uncertainty in sample average approximation.

Operations ResearchLetters 45 (4), 301 – 307.[46] Li, B., R. Jiang, and J. L. Mathieu (2017). Ambiguous risk constraintswith moment and unimodality information.

Mathematical Programming ,1–42.[47] Liu, J. S. (2008).

Monte Carlo strategies in scientiﬁc computing .Springer Science & Business Media.[48] Marjoram, P., J. Molitor, V. Plagnol, and S. Tavar´e (2003). Markovchain monte carlo without likelihoods.

Proceedings of the NationalAcademy of Sciences 100 (26), 15324–15328.[49] Nelson, B. L. (2016). ’Some Tactical Problems in Digital Simulation’for the Next 10 Years.

Journal of Simulation 10 (1), 2–11.3450] Pan, S. J. and Q. Yang (2009). A survey on transfer learning.

IEEETransactions on knowledge and data engineering 22 (10), 1345–1359.[51] Petersen, I. R., M. R. James, and P. Dupuis (2000). Minimax optimalcontrol of stochastic uncertain systems with relative entropy constraints.

IEEE Transactions on Automatic Control 45 (3), 398–412.[52] Popescu, I. (2005). A semideﬁnite programming approach to optimal-moment bounds for convex classes of distributions.

Mathematics of Oper-ations Research 30 (3), 632–657.[53] Precup, D. (2000). Eligibility traces for oﬀ-policy policy evaluation.

Computer Science Department Faculty Publication Series , 80.[54] Rubinstein, R. Y. and D. P. Kroese (2016).

Simulation and the MonteCarlo method , Volume 10. John Wiley & Sons.[55] Sargent, R. G. (2010). Veriﬁcation and validation of simulation models.In S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu (Eds.),

Proceedings of the 2010 winter simulation conference , Piscataway, NewJersey, pp. 166–183. Institute of Electrical and Electronics Engineers, Inc.[56] Schlegel, M., W. Chung, D. Graves, J. Qian, and M. White (2019).Importance resampling for oﬀ-policy prediction. In

Advances in NeuralInformation Processing Systems , pp. 1799–1809.[57] Schruben, L. W. (1980). Establishing the Credibility of Simulations.

Simulation 34 (3), 101–105.[58] Siegmund, D. (1976). Importance sampling in the monte carlo study ofsequential tests.

The Annals of Statistics , 673–684.[59] Sugiyama, M., S. Nakajima, H. Kashima, P. Von Buenau, andM. Kawanabe (2007). Direct importance estimation with model selec-tion and its application to covariate shift adaptation. In

NIPS , Volume 7,pp. 1433–1440. Citeseer.[60] Tarantola, A. (2005).

Inverse Problem Theory and Methods for ModelParameter Estimation . Philadelphia, Pennsylvania: Society for Industrialand Applied Mathematics. 3561] Van Parys, B. P., P. J. Goulart, and D. Kuhn (2016). Generalizedgauss inequalities via semideﬁnite programming.

Mathematical Program-ming 156 (1-2), 271–302.[62] Wiesemann, W., D. Kuhn, and M. Sim (2014). Distributionally robustconvex optimization.