[PDF] Optimal Nested Simulation Experiment Design via Likelihood Ratio Method

Abstract

Nested simulation arises frequently in financial or input uncertainty quantification problems, where the performance measure is defined as a function of the simulation output mean conditional on the outer scenario. The standard nested simulation samples M outer scenarios and runs N inner replications at each. We propose a new experiment design framework for a problem whose inner replication's inputs are generated from probability distribution functions parameterized by the outer scenario. This structure lets us pool replications from an outer scenario to estimate another scenario's conditional mean via the likelihood ratio method. We formulate a bi-level optimization problem to decide not only which of M outer scenarios to simulate and how many times to replicate at each, but also how to pool these replications such that the total simulation effort is minimized while achieving the same estimation error as the standard nested simulation. The resulting optimal design requires far less simulation effort than MN . We provide asymptotic analyses on the convergence rates of the performance measure estimators computed from the experiment design. Empirical results show that our experiment design significantly reduces the simulation cost compared to the standard nested simulation as well as a state-of-the-art design that pools replications via regressions.

Full PDF

UU n d e r R e v i e w Optimal Nested Simulation Experiment Designvia Likelihood Ratio Method

Mingbin Ben Feng and Eunhye Song Department of Statistics and Actuarial Science, University of Waterloo, Canada The Harold and Inge Marcus Department of Industrial and Manufacturing Engineering, ThePennsylvania State University, USA

Abstract

Nested simulation arises frequently in ﬁnancial or input uncertainty quantiﬁcation problems,where the performance measure is deﬁned as a function of the simulation output mean conditionalon the outer scenario. The standard nested simulation samples M outer scenarios and runs N inner replications at each. We propose a new experiment design framework for a problem whoseinner replication’s inputs are generated from probability distribution functions parameterized bythe outer scenario. This structure lets us pool replications from an outer scenario to estimateanother scenario’s conditional mean via the likelihood ratio method. We formulate a bi-leveloptimization problem to decide not only which of M outer scenarios to simulate and how manytimes to replicate at each, but also how to pool these replications such that the total simulationeﬀort is minimized while achieving the same estimation error as the standard nested simulation.The resulting optimal design requires far less simulation eﬀort than M N . We provide asymptoticanalyses on the convergence rates of the performance measure estimators computed from theexperiment design. Empirical results show that our experiment design signiﬁcantly reduces thesimulation cost compared to the standard nested simulation as well as a state-of-the-art designthat pools replications via regressions.

Keywords— nested simulation, importance sampling, likelihood ratio method, optimal simulationexperiment design a r X i v : . [ s t a t . M E ] A ug n d e r R e v i e w In this paper, we consider a simulation experiment whose performance measure is a function of a conditionalmean of a simulation output. This class of problems is known as the nested simulation (Hong et al., 2017),where the conditioning random vectors are referred to as outer scenarios and the simulation replications run ateach outer scenario as inner replications. In particular, we are interested in the case when outer scenarios areindependent and identically distributed (i.i.d.) and the random inputs generated within each inner replicationare drawn from a joint probability distribution that is completely speciﬁed by the outer scenario. We considertwo classes of performance measures: (i) an expectation of a real-valued function of the conditional mean; (ii)an α -quantile of the conditional mean given user-speciﬁed probability 0 < α <

1. In the rest of the paper, weadopt the term, nested simulation statistic, to refer to a generic performance measure estimator produced bya nested simulation. Two important applications of nested simulation are introduced below.

Enterprise risk management (ERM) : To monitor the risk exposures of large investment portfolios ofcomplex ﬁnancial instruments, ﬁnancial institutions such as banks and insurance companies regularly estimatesome risk measures of their portfolios’ proﬁts and losses (P&Ls) at a future time. These estimation problemshave the nested simulation structure; one ﬁrst simulates the possible evolution of underlying risk factors,which serves as the outer scenario, then inner simulations are conducted at each outer scenario to estimatethe expected portfolio P&L in that scenario. The empirical distribution of the simulated P&Ls for all outerscenarios is then used to estimate the risk measure of interest, e.g., exceedance probability, variance, quantileor Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR), etc.

Input uncertainty quantiﬁcation (IUQ) : Input uncertainty refers to simulation output variation causedby estimation error in the input probability distribution inferred from ﬁnite observations obtained from thetarget system (Song et al., 2014). In a Bayesian setting, a prior distribution is imposed on the parametervector of the input model, which is then updated to the posterior given the data. Zouaoui and Wilson (2004)quantify input uncertainty by propagating variability of the parameter vector captured by its posterior to thesimulation output via nested simulations. In this context, an outer scenario is a parameter vector sampledfrom its posterior and inner replications are averaged to estimate the conditional mean of the simulationoutput given each sampled parameter. From the conditional mean estimates, one can compute empiricalquantiles of the conditional mean and construct a credible interval that contains the conditional mean with adesired probability (Xie et al., 2014).A challenge in nested simulation is its computational cost. A classical approach, which we referred toas the standard nested simulation, samples M outer scenarios and runs N inner replications at each outerscenario to compute the Monte Carlo (MC) estimator of the conditional mean, resulting in a total simulationbudget of Γ = M N . In the nested simulation literature, it is well-known that small M and N lead to high n d e r R e v i e w variance and bias, respectively, of the nested simulation statistic. Gordy and Juneja (2010) analyze thestandard nested simulation experiment design when the performance measure of interest is an exceedanceprobability, VaR, or CVaR. They show the bias and the variance of the nested simulation statistic diminishesin O ( N − ) and O ( M − ), respectively, which leads to the asymptotically optimal allocation that minimizesthe mean squared error (MSE) given ﬁxed budget Γ: N = O (Γ / ) and M = O (Γ / ). Sun et al. (2011)consider estimating the variance of the nested simulation statistic via analysis of variance. They show thatas the number of outer scenario M → ∞ , the N that minimizes the asymptotic variance of the varianceestimator is a constant, which leads to an experiment design that chooses a ﬁnite N and spends the rest ofthe simulation budget to increase M . Focusing on estimating the exceedance probability, Broadie et al. (2011)devise a sequential budget allocation scheme. Given M , their algorithm sequentially assigns inner replicationsto the scenario whose estimated conditional mean is the closet to the threshold relative to its estimation error.Common in these nested simulation approaches is that the inner replications at each outer scenario are usedonly to estimate the conditional mean at that scenario but none others.On the contrary, there are several approaches that pool inner simulation replications from some or allouter scenarios to improve the estimation error of the nested simulation statistic; we refer to these as thepooled nested simulation designs. Liu and Staum (2010) pools inner simulation replications to calibrate astochastic kriging model, which is then used to predict the outer scenarios’ conditional means for CVaRestimation. Their numerical studies show that their CVaR estimator achieves orders of magnitudes smallerMSE than that from the standard nested simulation with the same simulation budget. Broadie et al. (2015)run a smaller nested simulation experiment and use it as the initial design to calibrate a regression model,which is then used to predict the outer scenarios’ conditional means. For a performance measure deﬁnedas the expectation of a smooth function, they suggest M = Γ and N = 1 for the nested simulation at theinitial design and show that the MSE of the nested simulation statistic converges in O (Γ − δ ) for any δ > k -nearest-neighbor estimator to pool inner replications from nearby outer scenarios to estimate a conditionalmean given each outer scenario. Considering expected-value type performance measures, their optimal kernelparameter choice leads to MSE convergence rate of O (Γ − min { , / ( d +2) } ) for the nested simulation statistics,where d is the dimension of the outer scenario vector.There are two experiment design decisions to make when one decides to pool inner simulation replicationsamong diﬀerent outer scenarios: 1) where (at which outer scenarios) to run inner simulation and how manytimes to replicate and 2) how to pool these replications when estimating the conditional mean at each outerscenario. We refer to the ﬁrst question as the sampling decision and the latter as the pooling decision . Theaforementioned pooled nested simulation designs address the latter, but not the former, by running the samenumber of replications at all outer scenarios. n d e r R e v i e w We propose a new pooled nested simulation experiment design framework that optimizes both samplingand pooling decisions prior to running any inner replications . For any pair of the M outer scenarios, theconditional mean at one scenario can be estimated from inner replications made at another scenario byconstructing a likelihood ratio (LR) estimator. Not all such pairwise LR estimators are eﬃcient, as somemay have unbounded variances. We adopt a version of LR estimator that allows us to assess the variance ofthe pairwise LR estimators and hence evaluate its eﬃciency prior to running any inner replications. To thisend, we propose a pooled LR estimator for each outer scenario that is a weighted sum of the pairwise LRestimators deﬁned with respect to all outer scenarios (including itself). Our pooling decision is to ﬁnd thevariance-minimizing weights for each pooled LR estimator of the conditional mean. For any pair that resultsin an inﬁnite-variance LR estimator, the pooling weight becomes 0. Our sampling decision minimizes thetotal simulation budget, Γ, while ensuring the variance of each outer scenario’s pooled LR estimator is belowthat of the standard nested simulation’s conditional mean estimator computed from N inner replications.This may well result in running no replication at some outer scenarios. We combine the two design decisionsinto a bi-level optimization problem whose upper-level optimizes the sampling decision and the lower-leveloptimizes the pooling decision given any sampling decision.The main contributions of our study can be summarized as follows: • We propose a framework to optimize the nested simulation experiment design that accounts for bothsampling and pooling decisions and formulate it as a bi-level optimization problem. Then, we showthe problem can be reformulated into a linear program (LP) with guaranteed feasibility, which can besolved easily. • We perform asymptotic analyses on the nested simulation statistics constructed from our pooled LRestimators under commonly adopted conditions in the literature. When the nested statistic computedfrom our LR estimators is expressed as the expectation of a function of the conditional mean, we showthat its MSE converges in O ( M − ) + O ( N − ). When the nested simulation statistic is a quantile of theconditional mean, we show it converges to the true quantile in O p ( M − / ) + O p ( N − / ). We emphasizethat N only sets the target precision in our optimization formulation not the actual number of innerreplications in each outer scenario. The resulting minimum simulation budget is guaranteed to be nolarger than N M . • Our experiment design is applied to two examples in ERM and IUQ applications. Both studies providesupporting evidence to our asymptotic analyses. Moreover, we observe that the minimized simulationbudget is orders of magnitudes smaller than

N M while achieving smaller estimation error compared tobenchmark algorithms using the same simulation budget.To put our methodological contribution in context, we provide a brief literature review on two closely n d e r R e v i e w related methods, namely importance sampling (IS) and the LR method. The two methods are mathematicallyidentical but are diﬀerent in goals and use cases. IS concerns the simulation design and aims to selectvariance-minimizing sampling distributions typically for a single estimator prior to running any simulationexperiment. We address the sampling decision in a similar manner as IS (Hesterberg, 1988; Owen, 2013), butour selection criteria are diﬀerent in that we optimize eﬃciency of the entire experiment design that estimatesmultiple conditional means simultaneously. The LR method, also known as the score function method, aimsto save computation by reusing outputs after the simulation experiments are run. The LR method has beenapplied to improve eﬃciency of metamodeling (Dong et al., 2018), gradient estimation (L’Ecuyer, 1990, 1993;Glasserman and Xu, 2014), and sensitivity analysis and optimization (Rubinstein and Shapiro, 1993; Kleijnenand Rubinstein, 1996; Fu, 2015; Maggiar et al., 2018). Feng and Staum (2015, 2017) coin the term greensimulation that aims to recycle and reuse simulation outputs to save computations and improve precision byapplying the LR method. Similar to these approaches, our pooling decision is also guided by the LR methodonce the sampling decision is made.In recent developments of IUQ literature, Zhou and Liu (2019) and Feng and Song (2019) apply the LRmethod to pool inner replications of the nested simulation and demonstrate signiﬁcant computational savings.However, neither optimizes the sampling decision.The remainder of the paper is organized as follows. Section 2 provides a mathematical framework forthe nested simulation problem and Section 3 deﬁnes our pooled LR estimator and discuss its properties.In Section 4, we propose the experiment design framework to optimize sampling and pooling decisions.Asymptotic properties of the nested simulation statistics computed from the experiment design are analyzedin Section 5 followed by their empirical evaluations in Section 6. Consider a nested simulation whose outer scenarios are M i.i.d. multidimensional random vectors, θ , θ , . . . , θ M .Let Θ ⊆ R p denote the support of θ i . The scenarios may be sampled from a known distribution or generatedfrom an outer simulation model. For each θ i ∈ Θ , the conditional mean of simulation output is deﬁnedas µ i := µ ( θ i ) ≡ E θ i [ g ( X )], where X is a vector of all inputs generated within each replication with jointdistribution function h ( x ; θ i ) and g is a real-valued simulation output function. Here, µ i is a conditional meansince the expectation is taken with respect to h parameterized by (random) θ i . Similarly, the conditionalvariance of the simulation output given θ i is denoted by V θ i [ g ( X )]. The following assumption facilitates ourexperiment design. Assumption 2.1.

For all θ ∈ Θ , h ( x ; θ ) is a well-deﬁned probability distribution function and has a commonsupport X ⊆ R d for a ﬁxed dimension d . Furthermore, the simulation output function g does not depend on θ n d e r R e v i e w and sup θ ∈ Θ V θ [ g ( X )] < ∞ . The common support X ensures that the likelihood ratio between any two outer scenarios θ i , θ j ∈ Θ iswell-deﬁned. The common function g allows us to reuse X and g ( X ) from θ j to estimate µ i via the LR methodwithout running any additional inner replications at θ j . The ﬁxed input dimension, d , is a limitation of ourmethod. For example, a queueing simulation may not satisfy Assumption 2.1, if the numbers of interarrivaltimes and service times generated within each run vary. We refer the readers to Feng and Song (2019) fora discussion on applying the LR method to improve nested simulation eﬃciency when the dimension of X varies per iteration. The condition sup θ ∈ Θ V θ [ g ( X )] < ∞ ensures that the Monte Carlo estimator for µ i forany θ i ∈ Θ has a ﬁnite variance.We consider two classes of performance measures. The ﬁrst class is written as E[ ζ ( µ i )], where ζ is areal-valued function. Speciﬁcally, three types of ζ are investigated. The ﬁrst is when ζ is indicator function ζ ( µ i ) = I ( µ i ≤ ξ ) for some ξ ∈ R . If the performance measure of interest is the probability of large portfolioloss beyond ξ , then the exceedance probability, P r ( µ i > ξ ), can be estimated from the indicator function, ζ . We also consider hockey stick function ζ ( µ i ) = max { µ i − ξ, } = ( µ i − ξ ) I ( µ i > ξ ) commonly used inERM applications to price derivatives or compound options (Glasserman, 2003). The last case is when ζ is asmooth function of µ i with a bounded second derivative. An example of such ζ is the squared loss functiongiven target ξ , ζ ( µ i ) = ( µ i − ξ ) . The second class of performance measures is the α -quantile of µ i given auser-speciﬁed 0 < α <

1, which has relevance in both ERM and IUQ applications as mentioned in Section 1.In the standard nested simulation with N inner replications for each outer scenario, the MC estimator ofconditional mean µ i , ¯ µ i ≡ (cid:80) Nk =1 g ( X k ) /N , is computed for each i , where X k i.i.d. ∼ h ( x ; θ i ) is the input vectorgenerated within the k th inner replication. Under Assumption 2.1, V θ i [¯ µ i ] = V θ i [ g ( X )] /N for any θ i ∈ Θ.From ¯ µ , . . . , ¯ µ M , E[ ζ ( µ i )] can be estimated by (cid:80) Mi =1 ζ (¯ µ i ) /M . Also, the α -quantile of µ i can be estimatedby the empirical quantile ¯ µ ( (cid:100) Mα (cid:101) ) , where ¯ µ ( i ) is the i th order statistic of ¯ µ , . . . , ¯ µ M . We apply the LR method to pool inner replications from several outer scenarios to estimate the conditionalmean at each outer scenario. Consider two scenarios θ i , θ j ∈ Θ , where θ i is the target scenario whoseconditional mean is desired and θ j is the sampling scenario where inner replication is run. Several LR-basedestimators are inspired by the following identity (Hesterberg, 1988; Owen, 2013; Dong et al., 2018): µ i = E θ i [ g ( X )] = E θ j (cid:20) g ( X ) h ( X ; θ i ) h ( X ; θ j ) (cid:21) = E θ j [ g ( X ) W ij ( X )] , (1) n d e r R e v i e w where W ij ( X ) ≡ h ( X ; θ i ) h ( X ; θ j ) is the LR of the input vector, X , between θ i and θ j . Under Assumption 2.1, W ij ( X )is well-deﬁned and E θ j [ W ij ( X )] = 1 for any θ i , θ j ∈ Θ . From (1), we can deﬁne an unbiased estimator of µ i , (cid:98) µ ij ≡ (cid:80) N j k =1 g ( X k ) W ij ( X k ) /N j , where N j is the number of replications made at θ j and X k is the inputvector of the k th replication generated from h ( x ; θ j ). We refer to (cid:98) µ ij as the nominal LR estimator.A diagnostic measure for assessing the eﬃciency of an LR estimator is eﬀective sample size (ESS), whichis deﬁned as the number of replications required for the MC estimator to achieve the same variance as the LRestimator. So, the larger the ESS, the more precise the LR estimator. Because V[ (cid:98) µ ij ] = V θ j [ g ( X ) W ij ( X )] /N j ,the ESS of the nominal LR estimator is V θ i [ g ( X )]V θ j [ g ( X ) W ij ( X )] N j , which, in general, cannot be evaluated analytically,nor there is a known approximation that does not involve a moment of g ( X ).In this study, we adopt the self-normalized LR estimator: (cid:101) µ ij = 1 N j N j (cid:88) k =1 g ( X k ) (cid:102) W ij,k , X k i.i.d. ∼ h ( x ; θ j ) , ∀ k = 1 , . . . , N j , (2)where (cid:102) W ij,k = W ij ( X k ) (cid:80) Nj(cid:96) =1 W ij ( X (cid:96) ) /N j is the self-normalized LR from the k th replication. Lemma 3.1 states asymptoticproperties of (cid:101) µ ij . Note that X is dropped from W ij ( X ) for notational convenience. Lemma 3.1.

Consider any given target and sampling scenarios θ i , θ j ∈ Θ . Under Assumption 2.1, (i) (cid:101) µ ij converges almost surely to µ i as N j increases; and (ii) with additional regularity conditions in Assumption A.1in Appendix A in the electronic companion, E θ j [ (cid:101) µ ij ] − µ i = − E θ j [ W ij ( g ( X ) − µ i )] N j + o ( N − j ); V θ j [ (cid:101) µ ij ] = E θ j [ W ij ( g ( X ) − µ i ) ] N j + o ( N − j ) . (3)Part (i) of Lemma 3.1 is proved in Theorem 9.2 of Owen (2013). Results similar to Part (ii) can be foundin Hesterberg (1988) and Owen (2013), however, neither provides the exact assumptions for them to hold. Wepresent the assumptions and the proof of Part (ii) in Appendix A.As indicated in Lemma 3.1, the self-normalized LR estimator is biased but consistent. We adopt it in ourstudy because it has a convenient approximate ESS expression. Kong (1992) shows that the variance of (cid:101) µ ij can be approximated as V θ j [ (cid:101) µ ij ] ≈ V θ i [ g ( X )]E θ j [ W ij ] N − j . (4)Then, the approximate ESS of (cid:101) µ ij is n eij ≡ N j / E θ j [ W ij ], which is free of g , thus can be computed withoutrunning any inner replications. For a large class of distribution families, E θ j [ W ij ] can be computed analytically.In Appendix B in the electronic companion, we derive a closed-form expression for E θ j [ W ij ] when h ( x ; θ ) is amember of the exponential family. n d e r R e v i e w In a more detailed derivation, Liu (1996) shows the approximation error of (4) asV θ j [ (cid:101) µ ij ] = (cid:0) V θ i [ g ( X )]E θ j [ W ij ] + E θ i [( W ij − E θ i [ W ij ])( g ( X ) − µ i ) )] (cid:1) N − j + o ( N − j ) , (5)which holds under Assumption A.1. Liu (1996) mentions that the O ( N − j ) term, E θ i [( W ij − E θ i [ W ij ])( g ( X ) − µ i ) )] N − j , omitted from (4) may be small when g ( X ) is relatively ﬂat, however, when it is large, (4) can besubstantially oﬀ. Nevertheless, Liu (1996) recommends using (4) as a rule of thumb because of its convenientapproximate ESS expression. We adopt this recommendation and evaluate eﬃciency of ˜ µ ij using its ESS, n eij ,for each ( θ i , θ j ) pair prior to running any inner simulations. In Section 6.1, we empirically demonstrate thatApproximation (4) performs well with an ERM example. Remark 1 . Some studies (Martino et al., 2017; Elvira et al., 2018) deﬁne the ESS as the number of replicationssuch that the LR estimator’s MSE matches V[¯ µ i ]. As seen in Lemma 3.1, the bias diminishes faster than thevariance, thus we ignore the bias from the ESS deﬁnition. Suppose M outer scenarios, θ , θ , . . . , θ M , are given. Our goal is to estimate the conditional means at all M scenarios with a precision guarantee while minimizing the total number of inner replications run at thescenarios. The precision guarantee we adopt is that the variance of the conditional mean estimator for each θ i is no larger than that of ¯ µ i computed from N inner replications.Suppose N j ≥ θ j for all j = 1 , . . . , M . Some of the outer scenariosmay have zero replications, i.e., N j = 0. Then, (cid:101) µ ij is well-deﬁned for any ( θ i , θ j ) , ≤ i, j ≤ M , if N j > i = 1 , . . . , M , consider the following pooled LR estimator of µ i : (cid:101) µ i ≡ M (cid:88) j =1 ,N j > γ ij (cid:101) µ ij , M (cid:88) j =1 ,N j > γ ij = 1 , (6)where γ ij , j = 1 , . . . , M , are the pooling weights. In the standard nested simulation, ¯ µ i only pools the innerreplications simulated at θ i . In contrast, the estimator (6) pools all inner replications from all samplingscenarios with appropriate weights { γ ij : j = 1 , , . . . , M } . Because all inner replications are run independently,V[ (cid:101) µ i ] = (cid:80) Mj =1 ,N j > γ ij V θ j [ (cid:101) µ ij ].Our precision guarantee for each µ i is V[ (cid:101) µ i ] ≤ V[¯ µ i ]. From Approximation (4), we have M (cid:88) j =1 ,N j > γ ij V θ j [ (cid:101) µ ij ] = V[ (cid:101) µ i ] ≤ V[¯ µ i ] = V θ i [ g ( X )] N approx . ⇒ M (cid:88) j =1 ,N j > γ ij E θ j [ W ij ] N j ≤ N . (7) n d e r R e v i e w Therefore, the total simulation budget can be minimized by solvingmin N j ≥ ,γ ij M (cid:88) j =1 N j (8)subject to M (cid:88) j =1 ,N j > γ ij E θ j [ W ij ] N j ≤ N , ∀ i = 1 , , . . . , M M (cid:88) j =1 ,N j > γ ij = 1 , ∀ i = 1 , , . . . , M For simplicity, we ignore integrality constraints for { N j } . In numerical experiments, we assign (cid:100) N j (cid:101) innerreplications to θ j . For some feasible { N j } , there may be inﬁnitely many feasible { γ ij } that satisfy theconstraints; among them, it is sensible to choose { γ ij } such that V[ (cid:101) µ i ] is minimized. From this insight, wereformulate (8) as the following bi-level optimization problem:min N j ≥ ,γ ij M (cid:88) j =1 N j (9)subject to M (cid:88) j =1 ,N j > γ ij E θ j [ W ij ] N j ≤ N , ∀ i = 1 , . . . , M { γ ij } ∈ arg min γ ij  M (cid:88) j =1 ,N j > γ ij E θ j [ W ij ] N j : M (cid:88) j =1 ,N j > γ ij = 1  , ∀ i = 1 , . . . , M (10)The upper-level problem of (9) makes the sampling decision ; it decides not only where to sample, i.e., θ j swith N j >

0, but also how many replications to run at each scenario. The lower-level problem (10) deﬁned foreach i makes the pooling decision to ﬁnd γ ij , j = 1 , , . . . , M , that minimize the (approximate) variance of (cid:101) µ i given { N j } . Notice that (10) is separable for each i given { N j } .Any optimal solution to Problem (9) is also an optimal solution to Problem (8). By means of contradiction,suppose { N (cid:63)j , γ (cid:63)ij } is an optimal solution to (9), but is suboptimal to (8). This implies that there exists { N (cid:48) j , γ (cid:48) ij } (cid:54) = { N (cid:63)j , γ (cid:63)ij } such that (cid:80) Mj =1 N (cid:48) j < (cid:80) Mj =1 N (cid:63)j . Now suppose we solve the lower-level problem (10)given { N (cid:48) j } for each i and obtain { γ (cid:48)(cid:48) ij } . Clearly, { N (cid:48) j , γ (cid:48)(cid:48) ij } is a feasible solution to (9), but the objectivefunction value of { N (cid:48) j , γ (cid:48)(cid:48) ij } in (9) is (cid:80) Mj =1 N (cid:48) j < (cid:80) Mj =1 N (cid:63)j , which contradicts the premise that { N (cid:63)j , γ (cid:63)ij } is anoptimal solution to (9).Given { N j } , (10) is a simple quadratic program that can be solved analytically via the Karush-Kuhn-Tucker(KKT) conditions. The Lagrangian function of the i th lower-level problem is L ( γ ij , λ ) = (cid:80) Mj =1 ,N j > γ ij E θ j [ W ij ] N j + λ (1 − (cid:80) Mj =1 ,N j > γ ij ). The corresponding KKT condition is γ ij E θ j [ W ij ] N j − λ = 0 for all j = 1 , . . . , M such that N j >

0, which implies γ ij = λN j θ j [ W ij ] . Therefore, the optimal γ (cid:63)ij is proportional to N j E θ j [ W ij ] . Considering theconstraint, (cid:80) Mj =1 ,N j > γ ij = 1, we have γ (cid:63)ij = N j / E θ j [ W ij ] (cid:80) Mk =1 N k / E θ k [ W ik ] , for all j = 1 , . . . , M such that N j >

0. Notice n d e r R e v i e w that the same expression produces γ (cid:63)ij = 0 when N j = 0, thus the condition, N j >

0, can be dropped. Alsonotice that γ (cid:63)ij = 0 when E θ j [ W ij ] = ∞ even if N j >

0; this is the case when θ i does not pool from the innerreplications at θ j because V[ (cid:101) µ ij ] is large. Consequently, the i th lower-level problem’s optimal objective is M (cid:88) j =1 ( γ (cid:63)ij ) E θ j [ W ij ] N j =  M (cid:88) j =1 N j E θ j [ W ij ]  − . (11)Notice that the condition, N j >

0, is dropped from the summation as γ (cid:63)ij = 0 for N j = 0. Plugging (11) intothe ﬁrst constraint of (9), we obtain the following LP:min N j ≥ M (cid:88) j =1 N j (12)subject to M (cid:88) j =1 N j E θ j [ W ij ] ≥ N, ∀ i = 1 , . . . , M Note that (12) is always feasible; if N j = N for all j , then all constraints are satisﬁed because E θ j [ W jj ] = 1.Solving (12) is easy even for large M and can be done prior to running any inner replications provided thatE θ j [ W ij ] is computable a priori (e.g. exponential family). Numerical studies in Section 6 show that the optimalobjective function value of (12) tends to be orders of magnitudes smaller than M N , which demonstrates thatour experiment design signiﬁcantly reduces the simulation budget compared to the standard nested simulation.Let { c (cid:63)j } be an optimal solution of (12) when N = 1. Proposition 4.1 shows that one can solve (12) forsome other N = N , by simply scaling { c (cid:63)j } by N . Therefore, even if the target precision, N , is changed post hoc , there is no need to resolve (12). This proportionality property is also useful for showing asymptoticproperties of the pooled estimator in Section 5. Proposition 4.1.

Suppose outer scenarios θ , θ , . . . , θ M are given and N > is a constant. Let { c j } and { c (cid:63)j } be a feasible solution and an optimal solution of (12) , respectively, when N = 1 . Then, { N j = N · c j } and { N (cid:63)j = N · c (cid:63)j } are a feasible solution and an optimal solution of (12) , respectively, when N = N .Proof. As { c j } is feasible to (12) when N = 1, multiplying both sides of the inequality constraint of (12)by N shows that { N j = N · c j } is a feasible solution of the revised problem. To see that { N (cid:63)j = N · c (cid:63)j } is an optimal solution of the revised problem, suppose to the contrary that there exists a feasible solution { N (cid:48) j } of the revised problem such that (cid:80) Mj =1 N (cid:48) j < (cid:80) Mj =1 N (cid:63)j . Dividing both sides of the constraints by N , it is clear that { c (cid:48) j = N (cid:48) j /N } is a feasible solution of (12) when N = 1. However, by construction (cid:80) Mj =1 c (cid:48) j = ( (cid:80) Mj =1 N (cid:48) j ) /N < ( (cid:80) Mj =1 N (cid:63)j ) /N = (cid:80) Mj =1 c (cid:63)j , which contradicts the premise that { c (cid:63)j } is an optimalsolution of (12) when N = 1. n d e r R e v i e w In light of Proposition 4.1, the optimal pooling weights corresponding to { N (cid:63)j } satisfy γ (cid:63)ij = c (cid:63)j / E θ j [ W ij ] (cid:80) Mk =1 c (cid:63)k / E θ k [ W ik ] , ∀ j = 1 , . . . , M, (13)which implies { γ (cid:63)ij } do not depend on N and only depend on the outer scenarios.In the context of experiment design, optimal solution (13) to the lower-level problem, its objective functionvalue (11), and the constraints of (12) all have meaningful interpretations. Recall that the ESS of (cid:101) µ i isapproximated by n eij = N j /E θ j [ W ij ]. For any outer scenario θ i , (13) can be written as γ (cid:63)ij = n eij / ( (cid:80) Mk =1 n eik ).This suggests that, given any sampling plan { N j } , the optimal way to pool the estimators (cid:101) µ ij , j = 1 , . . . , M ,is to weight them proportionally to their ESS. Moreover, plugging the optimal objective (11) into (7) we haveV[ (cid:101) µ i ] ≈ V θ i [ g ( X )] / ( (cid:80) Mj =1 n eij ). This suggests that the ESS of the optimally weighted estimator (cid:101) µ i is equalto the sum of ESS of (cid:101) µ ij , j = 1 , , . . . , M . Lastly, the constraint in (12) can be written as (cid:80) Mj =1 n eij ≥ N for i = 1 , . . . , M , i.e., the ESS of the pooled estimator, (cid:101) µ i , should be no less than the target, N , for each i .Once { N (cid:63)j } and { γ (cid:63)ij } are found, we run inner replications at { θ j } as prescribed by { N (cid:63)j } . For ( i, j )pair such that γ (cid:63)ij >

0, we compute the self-normalized LR estimator, (cid:101) µ ij , as deﬁned in (2). The optimallypooled conditional mean estimators are then computed as (cid:101) µ (cid:63)i = (cid:80) Mj =1 γ (cid:63)ij (cid:101) µ ij , for all i = 1 , . . . , M . Then, theperformance measure, E[ ζ ( µ i )], can be estimated by (cid:101) ζ = (cid:80) Mi =1 ζ ( (cid:101) µ (cid:63)i ) /M . The α -quantile of µ i is estimated bythe empirical quantile, (cid:101) µ (cid:63) ( (cid:100) Mα (cid:101) ) . In this section, we discuss asymptotic properties of { (cid:101) µ (cid:63)i } and the nested statistics computed from them as M and N increase without bounds. Since feasibility of (12) is always guaranteed, there exists optimal { N (cid:63)j } andthe corresponding { γ (cid:63)ij } for any N and M . Thus, (cid:101) µ (cid:63)i is well-deﬁned for any N and M . We begin with thefollowing strong consistency result for (cid:101) µ (cid:63)i . Theorem 5.1.

Suppose Assumption 2.1 holds. Given θ , θ , . . . , θ M , (cid:101) µ (cid:63)i a.s. −−→ µ i as N → ∞ , ∀ i = 1 , . . . , M .Proof. Recall that from Proposition 4.1, { γ (cid:63)ij } do not depend on N and are constants once θ , θ , . . . , θ M aregiven. From the deﬁnition of (cid:101) µ (cid:63)i ,lim N →∞ (cid:101) µ (cid:63)i = lim N →∞ M (cid:88) j =1 γ (cid:63)ij (cid:101) µ ij = M (cid:88) j =1 γ (cid:63)ij lim N j →∞ (cid:101) µ ij a.s. −−→ M (cid:88) j =1 γ (cid:63)ij µ i = µ i M (cid:88) j =1 γ (cid:63)ij = µ i , where the second equality holds because N (cid:63)j ∝ N from Proposition 4.1 and γ (cid:63)ij = 0 whenever N (cid:63)j = 0. Thealmost sure convergence holds from Lemma 3.1. n d e r R e v i e w Next, we examine the MSE convergence rate of (cid:101) µ (cid:63)i . Recall that Lemma 3.1 shows both bias and varianceof (cid:101) µ ij are O ( N − j ), if the moment conditions in Assumption A.1 are satisﬁed. Because N (cid:63)j ∝ N , to showMSE[ (cid:101) µ (cid:63)i ] = O ( N − ), it suﬃces to have the result of Lemma 3.1 hold for each ( θ i , θ j ) pair such that γ (cid:63)ij > Assumption 5.1.

For each θ i ∈ Θ , let Θ i ≡ { θ ∈ Θ | E θ j [ W ij ] < ∞} . Then, there exists C ∈ N suchthat for any N j > C, θ i ∈ Θ and θ j ∈ Θ i , N j | E θ j [ (cid:101) µ ij ] − µ i | < | E θ j [ W ij ( g ( X ) − µ i )] | + 1 and N j V θ j [ (cid:101) µ ij ] < E θ j [ W ij ( g ( X ) − µ i ) ]+1 . Also, sup θ i ∈ Θ sup θ j ∈ Θ i E θ j [ W ij ( g ( X ) − µ i )] < ∞ and sup θ i ∈ Θ sup θ j ∈ Θ i E θ j [ W ij ( g ( X ) − µ i ) ] < ∞ . Additionally, we make a minor modiﬁcation to the deﬁnition of { N (cid:63)j } : N (cid:63)j =  δN, if 0 < c (cid:63)j < δ,c (cid:63)j N, otherwise , (14)where δ is a small positive constant. In words, (14) guarantees that if any replications are made at θ j , then N (cid:63)j is to be at least δ fraction of N . The outer scenarios we sample at do not increase because N (cid:63)j = 0, if c (cid:63)j = 0. We emphasize that (14) has no practical impact for ﬁnite M as δ can be chosen to be arbitrarilysmall. In the remainder of this section, we assume (14) is adopted. The following theorem establishes that forany sample of outer scenarios, MSE[ (cid:101) µ (cid:63)i ] = O ( N − ). Theorem 5.2.

Suppose Assumptions 2.1 and 5.1 hold. Then, for any ﬁnite M sup { θ , θ ,..., θ M }∈ Θ | E[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] − µ i | = O ( N − ) and sup { θ , θ ,..., θ M }∈ Θ V[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] = O ( N − ) as N → ∞ . Moreover, the same statement holds when M → ∞ .Proof. By construction, (cid:101) µ (cid:63)i only pools replications at θ j ∈ Θ i . For suﬃciently large N , | E[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] − µ i | ≤ M (cid:88) j =1 γ (cid:63)ij (cid:12)(cid:12) E θ j [ (cid:101) µ ij ] − µ i (cid:12)(cid:12) < M (cid:88) j =1 ,N (cid:63)j > γ (cid:63)ij N (cid:63)j (cid:0) | E θ j [ W ij ( g ( X ) − µ i )] | + 1 (cid:1) ≤ M (cid:88) j =1 γ (cid:63)ij δN sup θ j ∈ Θ i (cid:0) | E θ j [ W ij ( g ( X ) − µ i )] | + 1 (cid:1) = sup θ j ∈ Θ i (cid:0) | E θ j [ W ij ( g ( X ) − µ i )] | + 1 (cid:1) δN , where the second and third inequalities follow from Assumption 5.1 and (14). The last holds since (cid:80) Mj =1 ,N (cid:63)j > γ (cid:63)ij = 1. Because sup θ i ∈ Θ sup θ j ∈ Θ i (cid:0) | E θ j [ W ij ( g ( X ) − µ i )] | + 1 (cid:1) is bounded from Assumption 5.1, n d e r R e v i e w we conclude sup { θ , θ ,..., θ M }∈ Θ | E[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] − µ i | = O ( N − ) for ﬁnite M as well as when M → ∞ .For the variance, as all inner replications are independently simulated,V[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] = M (cid:88) j =1 ( γ (cid:63)ij ) V θ j [ (cid:101) µ ij ] < M (cid:88) j =1 ,N (cid:63)j > ( γ (cid:63)ij ) N (cid:63)j (cid:0) E θ j [ W ij ( g ( X ) − µ i ) ] + 1 (cid:1) ≤ M (cid:88) j =1 γ (cid:63)ij δN sup θ j ∈ Θ i (cid:0) E θ j [ W ij ( g ( X ) − µ i ) ] + 1 (cid:1) ≤ sup θ j ∈ Θ i (cid:0) E θ j [ W ij ( g ( X ) − µ i ) ] + 1 (cid:1) δN for suﬃciently large N , where the second inequality follows from Assumption 5.1 and (14) and the thirdholds since 0 ≤ γ (cid:63)ij ≤

1. Because sup θ i ∈ Θ sup θ j ∈ Θ i (cid:0) E θ j [ W ij ( g ( X ) − µ i ) ] + 1 (cid:1) < ∞ from Assumption 5.1,we conclude sup { θ , θ ,..., θ M }∈ Θ V[ (cid:101) µ (cid:63)i | θ , θ , . . . , θ M ] = O ( N − ) for ﬁnite M as well as when M → ∞ .In the following subsections, we analyze asymptotic properties of the two classes of nested simulationstatistics of our interest. For (cid:101) ζ = (cid:80) Mi =1 ζ ( (cid:101) µ (cid:63)i ) /M , we show that its bias and variance converge in O ( N − )and O ( N − ) + O ( M − ), respectively, for when ζ is an indicator, hockey stick, and smooth function withbounded second derivative. Sections 5.1–5.3 present diﬀerent assumptions and proofs for each choice of ζ .To contrast with the standard nested simulation, recall that Gordy and Juneja (2010) show the bias andvariance of (cid:80) Mi =1 ζ (¯ µ i ) /M converge in O ( N − ) and O ( M − ), respectively, when ζ is an indicator or hockeystick function. The additional O ( N − ) term in our variance is the price we pay for pooling the same innerreplications for all outer scenarios, which introduces correlations among (cid:101) µ (cid:63) , (cid:101) µ (cid:63) , . . . , (cid:101) µ (cid:63)M . In Section 5.4, weshow that the empirical quantile (cid:101) µ (cid:63) ( (cid:100) Mα (cid:101) ) converges to the true α -quantile of µ ( θ ) in O p ( M − / ) + O p ( N − / ).The corresponding analysis by Gordy and Juneja (2010) implies ¯ µ ( (cid:100) Mα (cid:101) ) converges in O p ( M − / ) + O p ( N − ).Again, the diﬀerence is caused by pooling the inner replications.To summarize, a consistent ﬁnding from the analyses is that choosing N = O ( M ) for our experimentdesign gives the best convergence rate of the nested statistics. We emphasize that N in our scheme is merelythe target sample size but not the actual number of inner replications at each scenario. Suppose ζ ( µ i ) = I ( µ i ≤ ξ ) for some ξ ∈ R . Let Φ be the cdf of µ i . By deﬁnition, E[ ζ ( µ i )] = Φ( ξ ). Thus, wedenote the corresponding estimator (cid:101) ζ ≡ M − (cid:80) Mi =1 I ( (cid:101) µ (cid:63)i ≤ ξ ) by Φ M,N ( ξ ), where Φ M,N ( · ) is the empirical cdf(ecdf) constructed from (cid:101) µ (cid:63) , (cid:101) µ (cid:63) , . . . , (cid:101) µ (cid:63)M .For ease of exposition, let (cid:15) i ≡ √ N ( (cid:101) µ (cid:63)i − µ i ), which is the scaled estimation error of (cid:101) µ (cid:63)i so that its limitingdistribution is not degenerate as N → ∞ . From Theorem 5.2, E[ (cid:15) i | θ i ] = E[E[ (cid:15) i | θ , θ , . . . , θ M ] | θ i ] = O ( N − / ) n d e r R e v i e w uniformly for all (ﬁxed) θ i ∈ Θ . Similarly, V[ (cid:15) i | θ i ] = O (1) uniformly for all θ i ∈ Θ . In the following, wedenote the joint distribution of µ i and (cid:15) i by f i ( µ, (cid:15) ), where θ i is an arbitrary scenario in { θ , θ , . . . , θ M } .Similarly, f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) refers to the joint distribution of µ i , µ j , (cid:15) i , and (cid:15) j for some arbitrary θ i and θ j among { θ , θ , . . . , θ M } . We make Assumption 5.2 below, which facilitates following Theorem 5.3 on theMSE convergence rate of Φ M,N ( ξ ). Assumption 5.2.

The cdf of µ i , Φ , is absolutely continuous with continuous pdf φ . For any outer scenario θ i ∈ { θ , θ , . . . , θ M } , f i ( µ, (cid:15) ) , is diﬀerentiable with respect to µ for each M and N . Moreover, there existfunctions p s,M,N ( (cid:15) ) , s = 0 , , such that f i ( µ, (cid:15) ) ≤ p ,M,N ( (cid:15) ) and (cid:12)(cid:12)(cid:12) ∂f i ( µ,(cid:15) ) ∂µ (cid:12)(cid:12)(cid:12) ≤ p ,M,N ( (cid:15) ) for all µ and for each M and N , and sup M sup N (cid:90) ∞−∞ | (cid:15) | k p s,M,N ( (cid:15) ) d(cid:15) < ∞ for s = 0 , and ≤ k ≤ . Similarly, for any outer scenarios θ i (cid:54) = θ j , f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) , is diﬀerentiablewith respect to µ i and µ j for each M and N . There exist functions p s,M,N ( (cid:15) i , (cid:15) j ) , s = 0 , , such that f i,j ( µ i , µ j , (cid:15) i , (cid:15) j ) ≤ p ,M,N ( (cid:15) i , (cid:15) j ) and (cid:12)(cid:12)(cid:12) ∂f i,j ( µ i ,µ j ,(cid:15) i ,(cid:15) j ) ∂µ (cid:12)(cid:12)(cid:12) ≤ p ,M,N ( (cid:15) i , (cid:15) j ) for all µ i , µ j , i (cid:54) = j and for each M and N , and sup M sup N (cid:90) ∞−∞ (cid:90) ∞−∞ | (cid:15) i | k i | (cid:15) j | k j p s,M,N ( (cid:15) i , (cid:15) j ) d(cid:15) i d(cid:15) j < ∞ for s = 0 , , and ≤ k i , k j ≤ , k i + k j ≤ . Theorem 5.3.

Under Assumptions 2.1–5.2,

E[Φ

M,N ( ξ )] − Φ( ξ ) = O ( N − ) and V[Φ

M,N ( ξ )] = O ( N − ) + O ( M − ) .Proof. Let us deﬁne Φ M as the ecdf constructed from µ , µ , . . . , µ M given the same outer scenarios as Φ M,N .Because Φ M is an unbiased estimator of Φ, E[Φ M,N ( ξ )] − Φ( ξ ) = E[Φ M,N ( ξ ) − Φ M ( ξ )]. From deﬁnitions,Φ M,N ( ξ ) − Φ M ( ξ ) = M − (cid:80) Mi =1 ( I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ )) . For each i ,E[ I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ )] = (cid:90) ∞−∞ (cid:90) ξ − (cid:15) √ N −∞ f i ( µ, (cid:15) ) dµd(cid:15) − (cid:90) ∞−∞ (cid:90) ξ −∞ f i ( µ, (cid:15) ) dµd(cid:15) = − (cid:90) ∞−∞ (cid:90) ξξ − (cid:15) √ N f i ( µ, (cid:15) ) dµd(cid:15). (15)Under Assumption 5.2, the ﬁrst-order Taylor series expansion of f i ( µ, (cid:15) ) at µ ∈ [ ξ − (cid:15)/ √ N , ξ ] is f i ( µ, (cid:15) ) = f i ( ξ, (cid:15) ) + ∂f i (ˇ µ, (cid:15) ) ∂µ ( µ − ξ ) , (16)where ˇ µ ∈ ( µ, ξ ). From Assumption 5.2, (cid:15) √ N f i ( ξ, (cid:15) ) − (cid:15) N p ,M,N ( (cid:15) ) ≤ (cid:90) ξξ − (cid:15)/ √ N f i ( µ, (cid:15) ) dµ ≤ (cid:15) √ N f i ( ξ, (cid:15) ) + (cid:15) N p ,M,N ( (cid:15) ) (17) n d e r R e v i e w for all M and N . Thus, from (15), integrating all three sides of (17) with respect to (cid:15) ∈ ( −∞ , ∞ ) givesupper and lower bounds to E[ I ( (cid:101) µ i ≤ ξ ) − I ( µ i ≤ ξ )]. Note that f i ( ξ, (cid:15) ) = f i ( (cid:15) | ξ ) φ ( ξ ), where f i ( (cid:15) | ξ ) is theconditional pdf of (cid:15) i given µ i = ξ . Thus, (cid:82) ∞−∞ f i ( ξ, (cid:15) ) (cid:15) √ N d(cid:15) = φ ( ξ )E (cid:104) (cid:15) i √ N (cid:12)(cid:12)(cid:12) µ i = ξ (cid:105) = φ ( ξ )E[ (cid:101) µ (cid:63)i − µ i | µ i = ξ ],where E[ (cid:101) µ (cid:63)i − µ i | µ i ] = O ( N − ) uniformly for all µ i from Theorem 5.2. Also, Assumption 5.2 guaranteesthat (cid:82) ∞−∞ (cid:15) p ,M,N ( (cid:15) ) d(cid:15) is bounded. Therefore, E[ I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ )] = O ( N − ) for each i , which in turnimplies E[Φ M,N ( ξ ) − Φ M ( ξ )] = O ( N − ). Next, noticing Φ M,N ( ξ ) = Φ M,N ( ξ ) − Φ M ( ξ ) + Φ M ( ξ ), V[Φ M,N ( ξ )]can be written asV[Φ M,N ( ξ )] = V[Φ M,N ( ξ ) − Φ M ( ξ )] + V[Φ M ( ξ )] − M,N ( ξ ) − Φ M ( ξ ) , Φ M ( ξ )] . (18)From the deﬁnition of (cid:101) µ (cid:63)i ,lim N →∞ (cid:101) µ (cid:63)i = lim N →∞ M (cid:88) j =1 γ (cid:63)ij (cid:101) µ ij = M (cid:88) j =1 γ (cid:63)ij lim N j →∞ (cid:101) µ ij a.s. −−→ M (cid:88) j =1 γ (cid:63)ij µ i = µ i M (cid:88) j =1 γ (cid:63)ij = µ i , where the second equality holds because N (cid:63)j ∝ N from Proposition 4.1 and γ (cid:63)ij = 0 whenever N (cid:63)j = 0.The almost sure convergence holds from Lemma 3.1. Clearly, V[Φ M ( ξ )] = O ( M − ). In the following, weshow V[Φ M,N ( ξ ) − Φ M ( ξ )] = O ( M − ) + O ( N − ). Because O ( N − ) term only shows in the expression forV[Φ M,N ( ξ ) − Φ M ( ξ )], not in V[Φ M ( ξ )], subtracting the covariance term in (18) does not cancel the O ( N − )term in general. Thus, we may ignore the covariance term as long as the convergence rate of V[Φ M,N ( ξ )] isconcerned. Note that V[Φ M,N ( ξ ) − Φ M ( ξ )] can be expanded as1 M M (cid:88) i =1 V[ I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ )] + 1 M M (cid:88) i =1 M (cid:88) j =1 j (cid:54) = i Cov[ I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) , I ( (cid:101) µ (cid:63)j ≤ ξ ) − I ( µ j ≤ ξ )] , where the ﬁrst summation is O ( M − ) and the pairwise covariance term can be rewritten asE (cid:2)(cid:0) I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) (cid:1)(cid:0) I ( (cid:101) µ (cid:63)j ≤ ξ ) − I ( µ j ≤ ξ ) (cid:1)(cid:3) − E (cid:2) I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) (cid:3) E (cid:2) I ( (cid:101) µ (cid:63)j ≤ ξ ) − I ( µ j ≤ ξ ) (cid:3) . (19) n d e r R e v i e w The ﬁrst expectation of (19) is equal toPr { (cid:101) µ (cid:63)i ≤ ξ, (cid:101) µ (cid:63)j ≤ ξ } − Pr { µ i ≤ ξ, (cid:101) µ (cid:63)j ≤ ξ } + Pr { µ i ≤ ξ, µ j ≤ ξ } − Pr { (cid:101) µ (cid:63)i ≤ ξ, µ j ≤ ξ } = (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξ − (cid:15)i √ N −∞ (cid:90) ξ − (cid:15)j √ N −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i − (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξ −∞ (cid:90) ξ − (cid:15)j √ N −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i + (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξ −∞ (cid:90) ξ −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i − (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξ − (cid:15)i √ N −∞ (cid:90) ξ −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i = − (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξξ − (cid:15)i √ N (cid:90) ξ − (cid:15)j √ N −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i + (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξξ − (cid:15)i √ N (cid:90) ξ −∞ f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i = (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξξ − (cid:15)i √ N (cid:90) ξξ − (cid:15)j √ N f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i . Applying the ﬁrst-order Taylor series expansion to f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) gives f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) = f ij ( ξ, ξ, (cid:15) i , (cid:15) j ) + ∂f ij (ˇ µ i , ˇ µ j , (cid:15) i , (cid:15) j ) ∂µ i ( µ i − ξ ) + ∂f ij (ˇ µ i , ˇ µ j , (cid:15) i , (cid:15) j ) ∂µ j ( µ j − ξ ) (20)for some ˇ µ i ∈ ( µ i , ξ ) and ˇ µ j ∈ ( µ j , ξ ). Under Assumption 5.2, the integral of (20) with respect to µ i ∈ [ ξ − (cid:15) i / √ N , ξ ] and µ j ∈ [ ξ − (cid:15) j / √ N , ξ ] is lower/upper-bounded by (cid:15) i (cid:15) j N f ij ( ξ, ξ, (cid:15) i , (cid:15) j ) ∓ | (cid:15) i (cid:15) j + (cid:15) j (cid:15) i | N / p ,M,N ( (cid:15) i , (cid:15) j ) , (21)Integrating (21) once again with respect to (cid:15) i ∈ ( −∞ , ∞ ) and (cid:15) j ∈ ( −∞ , ∞ ), we haveE (cid:2)(cid:0) I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) (cid:1)(cid:0) I ( (cid:101) µ (cid:63)j ≤ ξ ) − I ( µ j ≤ ξ ) (cid:1)(cid:3) = O ( N − ) . Since it is already shown that E (cid:2) I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) (cid:3) = O ( N − ), we conclude Cov[ I ( (cid:101) µ (cid:63)i ≤ ξ ) − I ( µ i ≤ ξ ) , I ( (cid:101) µ (cid:63)j ≤ ξ ) − I ( µ j ≤ ξ )] = O ( N − ) from (19), which in turn implies V[Φ M,N ( ξ ) − Φ M ( ξ )] = O ( M − ) + O ( N − ) from (18). Next, we analyze the case when ζ is a hockey stick function, i.e., ζ ( µ i ) = max { µ − ξ, } = ( µ − ξ ) I ( µ i > ξ )for some ξ ∈ R , which requires the following additional moment conditions. Assumption 5.3.

For p ,N,M ( (cid:15) ) deﬁned in Assumption 5.2, sup M sup N (cid:82) ∞−∞ | (cid:15) | p ,M,N ( (cid:15) ) d(cid:15) < ∞ . Similarly, n d e r R e v i e w for p s,N,M ( (cid:15) i , (cid:15) j ) deﬁned in Assumption 5.2, sup M sup N (cid:90) ∞−∞ (cid:90) ∞−∞ | (cid:15) i | k i | (cid:15) j | k j p s,M,N ( (cid:15) i , (cid:15) j ) d(cid:15) i d(cid:15) j < ∞ for s = 0 , , ≤ k i , k j ≤ and k i + k j ≤ . Also, for each M and N , there exist functions q s,M,N ( µ i , (cid:15) i , (cid:15) j ) , s =0 , , such that f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) ≤ q ,M,N ( µ i , (cid:15) i , (cid:15) j ) and (cid:12)(cid:12)(cid:12) ∂f ij ( µ i ,µ j ,(cid:15) i ,(cid:15) j ) ∂µ j (cid:12)(cid:12)(cid:12) ≤ q ,M,N ( µ i , (cid:15) i , (cid:15) j ) for all µ i , µ j , (cid:15) i ,and (cid:15) j , and sup M sup N (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ∞ ξ | µ i | k m | (cid:15) i | k i | (cid:15) j | k j q s,M,N ( µ i , (cid:15) i , (cid:15) j ) dµ i d(cid:15) i d(cid:15) j < ∞ for any ξ ∈ R , s = 0 , , ≤ k m ≤ , and ≤ k i , k j ≤ , k i + k j ≤ . Theorem 5.4 shows that the hockey stick function results in the same bias and variance convergence ratesas the those of the indicator function.

Theorem 5.4.

Suppose ζ ( µ i ) = ( µ i − ξ ) I ( µ i > ξ ) for some constant ξ ∈ R and (cid:101) ζ = (cid:80) Mi =1 ζ ( (cid:101) µ (cid:63)i ) /M . UnderAssumptions 2.1–5.3, E[ (cid:101) ζ − ζ ( µ i )] = O ( N − ) and V[ (cid:101) ζ ] = O ( M − ) + O ( N − ) .Proof. From the deﬁnition of (cid:101) ζ , we haveE[ (cid:101) ζ − ζ ( µ i )] = E[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ ) − ( µ i − ξ ) I ( µ i > ξ )]= (cid:90) ∞−∞ (cid:90) ∞ ξ − (cid:15) √ N (cid:18) µ + (cid:15) √ N − ξ (cid:19) f i ( µ, (cid:15) ) dµd(cid:15) − (cid:90) ∞−∞ (cid:90) ∞ ξ ( µ − ξ ) f i ( µ, (cid:15) ) dµd(cid:15). (22)Note that (cid:90) ∞−∞ (cid:90) ∞ ξ (cid:15) √ N f i ( µ, (cid:15) ) dµd(cid:15) = (cid:90) ∞ ξ φ ( µ )E (cid:20) (cid:15) √ N (cid:12)(cid:12)(cid:12)(cid:12) µ i = µ (cid:21) dµ = O ( N − ) , (23)where the last equality holds because E (cid:104) (cid:15)/ √ N (cid:12)(cid:12)(cid:12) µ i = µ (cid:105) = O ( N − ) uniformly for all µ as shown in the proof ofTheorem 5.2. Adding and subtracting (23) from both sides of (22), (22) = (cid:82) ∞−∞ (cid:82) ξξ − (cid:15) √ N (cid:16) µ + (cid:15) √ N − ξ (cid:17) f i ( µ, (cid:15) ) dµd(cid:15) + O ( N − ). From the Taylor series expansion of f i ( µ, (cid:15) ) in (16), (cid:90) ξξ − (cid:15) √ N (cid:18) µ + (cid:15) √ N − ξ (cid:19) f i ( µ, (cid:15) ) dµ = (cid:90) ξξ − (cid:15) √ N (cid:18) µ + (cid:15) √ N − ξ (cid:19) (cid:26) f i ( ξ, (cid:15) ) + ∂f i ( ξ (cid:63) , (cid:15) ) ∂µ ( µ − ξ ) (cid:27) dµ, which is lower/upper-bounded by (cid:15) N f i ( ξ, (cid:15) ) ∓ | (cid:15) | N / p ,M,N ( (cid:15) ) under Assumption 5.3. Integrating these boundsonce again with respect to (cid:15) ∈ ( −∞ , ∞ ), we haveE[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ ) − ( µ i − ξ ) I ( µ i > ξ )] = O ( N − ) . (24) n d e r R e v i e w The variance of (cid:101) ζ can be expanded as1 M M (cid:88) i =1 V[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ )] + 1 M M (cid:88) i =1 M (cid:88) j =1 ,j (cid:54) = i Cov[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ ) , ( (cid:101) µ (cid:63)j − ξ ) I ( (cid:101) µ (cid:63)j > ξ )] , (25)where the ﬁrst term is O ( M − ). Because µ i and µ j for arbitrary i (cid:54) = j are independent, Cov[( µ i − ξ ) I ( µ i >ξ ) , ( µ j − ξ ) I ( µ j > ξ )] = 0. Therefore, the covariance term in (25) is equal toCov[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ ) , ( (cid:101) µ (cid:63)j − ξ ) I ( (cid:101) µ (cid:63)j > ξ )] − Cov[( µ i − ξ ) I ( µ i > ξ ) , ( µ j − ξ ) I ( µ j > ξ )]= E[( (cid:101) µ (cid:63)i − ξ )( (cid:101) µ (cid:63)j − ξ ) I ( (cid:101) µ (cid:63)i > ξ, (cid:101) µ (cid:63)j > ξ )] − E[( µ i − ξ )( µ j − ξ ) I ( µ i > ξ, µ j > ξ )] (26)+ E[( µ i − ξ ) I ( µ i > ξ )]E[( µ j − ξ ) I ( µ j > ξ )] − E[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ )]E[( (cid:101) µ (cid:63)j − ξ ) I ( (cid:101) µ (cid:63)j > ξ )] (27)From (24), (27)= O ( N − ). We rewrite (26) as (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ∞ ξ − (cid:15)i √ N (cid:90) ∞ ξ − (cid:15)j √ N (cid:18) µ i + (cid:15) i √ N − ξ (cid:19) (cid:18) µ j + (cid:15) j √ N − ξ (cid:19) f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i − (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ∞ ξ (cid:90) ∞ ξ ( µ i − ξ ) ( µ j − ξ ) f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i = (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ξξ − (cid:15)i √ N (cid:90) ξξ − (cid:15)j √ N ( µ i − ξ ) ( µ j − ξ ) f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i (28) − (cid:90) ∞−∞ (cid:90) ∞−∞ (cid:90) ∞ ξ − (cid:15)i √ N (cid:90) ∞ ξ − (cid:15)j √ N (cid:26) ( µ i − ξ ) (cid:15) j √ N + ( µ j − ξ ) (cid:15) i √ N + (cid:15) i (cid:15) j N (cid:27) f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) dµ j dµ i d(cid:15) j d(cid:15) i . (29)Using the Taylor expansion in (20), the two inner integrals of (28) can be bounded from above and belowby (cid:15) i (cid:15) j N f ij ( ξ, ξ, (cid:15) i , (cid:15) j ) + | (cid:15) i (cid:15) j ± (cid:15) i (cid:15) j | N / p ,M,N ( (cid:15) i , (cid:15) j ), which yields O ( N − ) when integrated with respect to (cid:15) i and (cid:15) j . Showing (29)= O ( N − ) is rather tedious; we ﬁrst partition the integration ranges for µ i and µ j as: (i) µ i ∈ [ ξ − (cid:15) i / √ N , ξ ] , µ j ∈ [ ξ − (cid:15) j / √ N , ξ ], (ii) µ i ∈ [ ξ, ∞ ) , µ j ∈ [ ξ, ∞ ), (iii) µ i ∈ [ ξ − (cid:15) i / √ N , ξ ] , µ j ∈ [ ξ, ∞ ),and (iv) µ i ∈ [ ξ, ∞ ] , µ j ∈ [ ξ − (cid:15) j / √ N , ξ ]. Part (i)

Plugging in the Taylor series expansion in (20) for f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) and computing the two innerintegrals of (29) for Part i), we have the lower & upper bounds, (cid:15) i (cid:15) j N f ij ( ξ, ξ, (cid:15) i , (cid:15) j ) ± | (cid:15) i (cid:15) j + (cid:15) i (cid:15) j | N / p ,M,N ( (cid:15) i , (cid:15) j ),which yields O ( N − ) when integrated with respect to (cid:15) i and (cid:15) j . Part (ii)

We can change orders of integrals because the ranges for µ i and µ j no longer depend on (cid:15) i and (cid:15) j .Thus, Part (ii) can be rewritten as (cid:90) ∞ ξ (cid:90) ∞ ξ (cid:26) ( µ i − ξ )E (cid:20) (cid:15) j √ N (cid:12)(cid:12)(cid:12)(cid:12) µ i , µ j (cid:21) + ( µ j − ξ )E (cid:20) (cid:15) i √ N (cid:12)(cid:12)(cid:12)(cid:12) µ i , µ j (cid:21) + E (cid:104) (cid:15) i (cid:15) j N (cid:12)(cid:12)(cid:12) µ i , µ j (cid:105)(cid:27) φ ( µ i ) φ ( µ j ) dµ i dµ j Because E[ (cid:15) j √ N | µ i , µ j ] = O ( N − ) and E[ (cid:15) i (cid:15) j N | µ i , µ j ] = O ( N − ) for all µ i and µ j , Part (ii) = O ( N − ). n d e r R e v i e w Part (iii) and (iv)

Because Part (iii) and (iv) are symmetric, it suﬃces to bound the latter. Applyingthe ﬁrst-order Taylor expansion for f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) with respect to µ j ∈ [ ξ − (cid:15) j √ N , ξ ], f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) = f ij ( µ i , xi, (cid:15) i , (cid:15) j )+ ∂f ij ( µ i , ˇ µ j ,(cid:15) i ,(cid:15) j ) ∂µ j ( µ j − ξ ) for some ˇ µ j ∈ ( µ j , ξ ). Substituting f ij ( µ i , µ j , (cid:15) i , (cid:15) j ) with the expansionabove and integrating with respect to µ j ∈ [ ξ − (cid:15) j / √ N , ξ ] yields the following upper & lower bounds (cid:40) (cid:15) j ( µ i − ξ ) N + 3 (cid:15) i (cid:15) j N / (cid:41) f ij ( µ i , ξ, (cid:15) i , (cid:15) j ) ± (cid:40) | (cid:15) j ( µ i − ξ ) | N / + | (cid:15) i (cid:15) j | N (cid:41) q ,M,N ( µ i , (cid:15) i , (cid:15) j ) . (30)Integrating (30) with respect to (cid:15) i ∈ ( −∞ , ∞ ) and (cid:15) j ∈ ( −∞ , ∞ ), we conclude Part (iv) = O ( N − ).Combining Parts (i)–(iv), (28) and (27), Cov[( (cid:101) µ (cid:63)i − ξ ) I ( (cid:101) µ (cid:63)i > ξ ) , ( (cid:101) µ (cid:63)j − ξ ) I ( (cid:101) µ (cid:63)j > ξ )] = O ( N − ). Therefore,V[ (cid:101) ζ ] = O ( M − ) + O ( N − ). Next, we analyze when ζ is a smooth function of µ i that satisﬁes the following assumption. Assumption 5.4.

The continuous function, ζ : R → R , is twice diﬀerentiable everywhere with boundedsecond derivative ζ (cid:48)(cid:48) . Also, E[( ζ ( µ i )) ] , E[( ζ (cid:48) ( µ i ) (cid:15) i ) ] and E[ (cid:15) i ] are bounded. Similar to the indicator and the hockey stick functions, the following theorem shows that the MSEconvergence rate of the estimator of E[ ζ ( µ i )] is O ( M − ) + O ( N − ) for ζ satisﬁes Assumption 5.4. Theorem 5.5.

Suppose ζ satisﬁes Assumption 5.4 and (cid:101) ζ = (cid:80) Mi =1 ζ ( (cid:101) µ (cid:63)i ) /M . Under Assumptions 2.1 and 5.1, E[ (cid:101) ζ − ζ ( µ i )] = O ( N − ) and V[ (cid:101) ζ ] = O ( M − ) + O ( N − ) .Proof. From the deﬁnition, E[ (cid:101) ζ ] = (cid:80) Mi =1 E[ ζ ( (cid:101) µ (cid:63)i )] /M . To obtain E[ ζ ( (cid:101) µ (cid:63)i )], we apply Taylor series expansion as ζ ( (cid:101) µ (cid:63)i ) = ζ (cid:16) µ i + (cid:15) i √ N (cid:17) = ζ ( µ i ) + ζ (cid:48) ( µ i ) (cid:15) i √ N + ζ (cid:48)(cid:48) (ˇ µ i )2 (cid:15) i N , where ˇ µ i is in between (cid:101) µ (cid:63)i and µ i . Therefore, E[ ζ ( (cid:101) µ (cid:63)i )] − E[ ζ ( µ i )] = E (cid:104) ζ (cid:48) ( µ i )E (cid:104) (cid:15) √ N (cid:12)(cid:12)(cid:12) µ i (cid:105)(cid:105) + E[ ζ (cid:48)(cid:48) (ˇ µ i ) (cid:15) i N ]. Recall that E (cid:104) (cid:15) i √ N (cid:12)(cid:12)(cid:12) µ i (cid:105) = O ( N − ) for all µ i . Because ζ (cid:48) ( µ i )does not depend at all on N and E[ ζ (cid:48) ( µ i ) (cid:15) i ] is bounded by Assumption 5.4, E (cid:104) ζ (cid:48) ( µ i )E (cid:104) (cid:15) i √ N (cid:12)(cid:12)(cid:12) µ i (cid:105)(cid:105) = O ( N − ).Since ζ (cid:48)(cid:48) and E[ (cid:15) i ] are bounded, E[ ζ (cid:48)(cid:48) (ˇ µ i ) (cid:15) i N ] = O ( N − ). Therefore, E[ ζ ( (cid:101) µ (cid:63)i )] − E[ ζ ( µ i )] = O ( N − ). For thevariance, V[ (cid:101) ζ ] = 1 M M (cid:88) i =1 V[ ζ ( (cid:101) µ (cid:63)i )] + 1 M (cid:88) ≤ i (cid:54) = j ≤ M Cov[ ζ ( (cid:101) µ (cid:63)i ) , ζ ( (cid:101) µ (cid:63)j )] . (31)Clearly, the ﬁrst sum of (31) is O ( M − ). The covariance term of (31) can be written as Cov[ ζ ( (cid:101) µ (cid:63)i ) , ζ ( (cid:101) µ (cid:63)j )] =E[ ζ ( (cid:101) µ (cid:63)i ) ζ ( (cid:101) µ (cid:63)j )] − E[ ζ ( (cid:101) µ (cid:63)i )]E[ ζ ( (cid:101) µ (cid:63)j )] . Because E[ ζ ( (cid:101) µ (cid:63)i )] = E[ ζ ( µ i )] + O ( N − ) as shown above, E[ ζ ( (cid:101) µ (cid:63)i )]E[ ζ ( (cid:101) µ (cid:63)j )] = n d e r R e v i e w E[ ζ ( µ i )]E[ ζ ( µ j )] + O ( N − ). Moreover, ζ ( (cid:101) µ (cid:63)i ) ζ ( (cid:101) µ (cid:63)j ) = ζ ( µ i ) ζ ( µ j ) + { ζ ( µ i ) ζ (cid:48) ( µ j ) (cid:15) i + ζ ( µ j ) ζ (cid:48) ( µ i ) (cid:15) j } N − / + (cid:40) ζ (cid:48) ( µ i ) ζ (cid:48) ( µ j ) (cid:15) i (cid:15) j + ζ ( µ i ) ζ (cid:48)(cid:48) (ˇ µ j ) (cid:15) j ζ ( µ j ) ζ (cid:48)(cid:48) (ˇ µ i ) (cid:15) i (cid:41) N − + R, where R contains O ( N − / ) terms. Note that E[ ζ ( µ i ) ζ (cid:48) ( (cid:101) µ (cid:63)j ) (cid:15) i √ N ] = E[ ζ ( µ i ) ζ (cid:48) ( (cid:101) µ (cid:63)j )E[ (cid:15) i √ N | µ i ]] = O ( N − ). UnderAssumption 5.4, one can verify that the coeﬃcients of N − term is bounded in mean and E[ R ] = O ( N − / ).Therefore, from (31), V[ (cid:101) ζ ] = O ( M − ) + O ( N − ). Let (cid:101) q α = (cid:101) µ (cid:63) ( (cid:100) Mα (cid:101) ) . In this section, we prove weak consistency of (cid:101) q α as M and N increase. In the standardnested simulation experiment, q α is estimated by the (cid:100) M α (cid:101) th order statistic of M independent conditionalmean estimators. On the other hand, (cid:101) q α , is the order statistic of correlated estimators, (cid:101) µ (cid:63) , (cid:101) µ (cid:63) , . . . , (cid:101) µ (cid:63)M .Consistency of an empirical quantile estimator constructed from dependent outputs has been studied (Sen,1972; Heidelberger and Lewis, 1984) under the assumption that the output sequence has a strong mixingproperty, which ensures that pairwise correlation between distant outputs in the sequence dies down. Ourpooled LR estimators do not have this property, however, their pairwise correlation decreases as N increases.To show weak consistency of (cid:101) q α , we need the following intermediate result, which states that the ecdf of (cid:101) µ (cid:63) , (cid:101) µ (cid:63) , . . . , (cid:101) µ (cid:63)M , i.e., Φ M,N ( · ), is uniformly weakly consistent to Φ( · ), the cdf of µ i . Lemma 5.1.

Under Assumptions 2.1–5.2, sup ξ ∈ R | Φ M,N ( ξ ) − Φ( ξ ) | = O p ( M − / ) + O p ( N − / ) . The proof of Lemma 5.1 can be found in Appendix C in the electronic companion. The following theoremis the main result of this section.

Theorem 5.6.

Suppose Assumptions 2.1–5.2 hold and φ ( q α ) > for given < α < . Then, | (cid:101) q α − q α | = O p ( M − / ) + O p ( N − / ) .Proof. For each M , | Φ M,N ( (cid:101) q α ) − α | ≤ /M . Also, Lemma 5.1 implies | Φ M,N ( (cid:101) q α ) − Φ( (cid:101) q α ) | ≤ sup ξ ∈ R | Φ M,N ( ξ ) − Φ( ξ ) | = O p ( M − / ) + O p ( N − / ). Therefore, | Φ( (cid:101) q α ) − α | = O p ( M − / ) + O p ( N − / ). Then, for suﬃcientlylarge M and N , there exists U (cid:63) ∈ (Φ( (cid:101) q α ) , α ) such that φ (Φ − ( U (cid:63) )) > − (Φ( (cid:101) q α )) = Φ − ( α ) + φ (Φ − ( U (cid:63) )) (Φ( (cid:101) q α ) − α ) with probability arbitrarily close to 1. Because φ is bounded in a neighborhood of q α , | Φ − (Φ( (cid:101) q α )) − Φ − ( α ) | = | (cid:101) q α − q α | = O p ( M − / ) + O p ( N − / ). n d e r R e v i e w We present two numerical examples to demonstrate the performance of the proposed nested simulationexperiment design. For both examples, we adopt the following common settings: • Optimal design : Our proposed design obtained by solving (12) by setting N = M , which is shown togive the best convergence rate for all performance measures we consider in Section 5. • Standard design : The standard nested simulation, where the outer and inner sample sizes are setaccording to the asymptotically optimal allocation given by Gordy and Juneja (2010). For the samemacro replication, if the optimal design ran Γ replications in total, we chose M = (cid:100) Γ / (cid:101) and N = (cid:100) Γ / (cid:101) ,respectively, for the standard design so that their total simulation budgets are approximately equal. • Standard design + : The standard nested simulation that adopts the same M outer scenarios and N used in the optimal design; the total budget is M N , which tends to be much larger than Γ. This designcoincides with the na¨ıve feasible solution to (9) that assigns N i = N for all 0 ≤ i ≤ M and γ ij = 0, for i (cid:54) = j , and γ ii = 1. • Regression : A regression approach that ﬁts a model by sampling Γ initial design points and running onereplication at each as in Broadie et al. (2015). We chose diﬀerent basis functions to suit each example.Once the model is ﬁtted, it is evaluated at the same M outer scenarios of the optimal design to computethe performance measures.The ﬁrst example is a portfolio risk management problem, where the objective is to evaluate risk measuresof the future portfolio value due to the price ﬂuctuations of the underlying asset. We compute all fourperformance measures discussed in Section 5 from each of the four experiment designs and compare theirMSEs. We found that the nested statistics computed from the optimal design have smaller MSEs for allperformance measures than those from the regression. The standard design consistently performs worse thanboth.The second example demonstrates Bayesian IUQ applied to a multi-product newsvendor problem. Theperformances of the experiment designs are evaluated by the coverage probabilities and widths of the credibleintervals (CrIs) of the expected proﬁt constructed from the designs. We show that the optimal design performssigniﬁcantly better than the standard design and the regression across all target coverage probabilities.The expression for µ ( θ ) is known in both examples, which facilitates evaluating the performances of thefour experiment designs in comparison. n d e r R e v i e w We consider a straddle option portfolio that consists of a call option and a put option with the same underlyingstock, strike, and maturity. The call option is proﬁtable when the underlying stock price increases and theput option is proﬁtable when the price drops. Combining both, the straddle portfolio is proﬁtable when thestock price at maturity is much greater or much smaller than the strike price. A straddle option is popular involatile markets during a ﬁnancial crisis.Let the underlying stock price at the current time, t = 0, be S = $100. We assume that the stock isnon-dividend-paying and follows the Black-Scholes model with η = 2% annualized expected rate of returnand σ = 30% annualized volatility. The annualized risk-free rate is r = 2%. The common maturity of bothoptions is T = 2 years and the common strike price is K = $110.We are interested in what the value of the portfolio would be in three months or τ = 1 / τ , S τ , then computing the expected payoﬀ of the stock at the maturity given S τ by running inner replications.Here, the outer scenario is one-dimensional θ = S τ , and µ ( θ ) corresponds to the conditional expected payoﬀgiven S τ . From the Black-Scholes model, the outer scenarios are simulated under the real-world measure,i.e., S τ | S = S e Z τ , where the rate of return Z τ , is distributed as N (( η − σ ) τ, σ τ ). Thus, θ = S τ has alog-normal distribution whose density function is shown in Figure 1a. This lets us choose the outer scenariosfor the experiment to be equally-spaced percentiles of the log-normal distribution instead of sampling from it.Given any θ = S τ , the input random variable for the inner replication is the stock price at maturity, X = S T . From the Black-Scholes model, the inner simulation is conducted under the risk-neutral measure,i.e., S T | S τ = S τ e Z T , where Z T ∼ N (( r − σ )( T − τ ) , σ ( T − τ )). The simulation model g computes thediscounted payoﬀ of the straddle option from X ; g ( X ) = e − r ( T − t ) [max { K − X, } + max { X − K, } ]. Thus, µ ( θ ) = E θ [ g ( X )] = E[ g ( X ) | S τ ]. The analytical expression for µ ( θ ) can be derived from the Black-Scholesmodel without simulation. Figure 1b depicts µ ( θ ), which shows that the portfolio value is high when S τ takesextreme values. We take the standpoint of a ﬁnancial institution that oﬀers the straddle strategies to investors,i.e., a short position. So the company suﬀers large losses when µ ( θ ) is large, or when S τ takes extreme values.In the following, we present the nested simulation results from the four experiment designs we compare.For the regression approach, the weighted Laguerre polynomials up to order 3 are adopted as the basisfunctions, which is a common choice in pricing American options (see Longstaﬀ and Schwartz, 2001, forexample).In the ﬁrst set of experiments, we chose M = 1 ,

000 equally-spaced quantiles of θ as the outer scenarios. Forthe optimal design, this results in the total simulation budget of Γ = 2 ,

148 replications. The optimal samplingdecision, { c (cid:63)j } , indicates allocating 29%, 21%, 21%, and 29% of the simulation budget to θ = 70 . , . , . , n d e r R e v i e w t = q O u t e r sc ena r i o den s i t y f un c t i on (a) t = q C ond i t i ona l m ean m ( q ) (b) Figure 1: (a) shows the probability density function of µ ( θ ) and (b) plots µ ( θ ) against θ . and 141 .

94, respectively. Notice that these points are near the two tail ends of distribution of θ , which canbe explained by the ESS formula for this example. In this example, the inner simulation random variable X = S T | S τ for any outer scenario θ = S τ follows a log-normal distribution with the common variance σ ( T − τ ) with mean m θ = ln θ + ( r − σ )( T − τ ). Thus, for any two scenarios θ i and θ j , the likelihoodratio is W ij ( X ) = exp (cid:18) (ln X − m θj ) − (ln X − m θi ) σ ( T − τ ) (cid:19) and its second moment is E θ j [ W ij ] = exp (cid:18) ( m θi − m θj ) σ (cid:19) =exp (cid:16) (ln θ i − ln θ j ) σ (cid:17) . Consequently, the ESS of using one sample from θ j to estimate the conditional mean forscenario θ i is inversely proportional to exp((ln θ i − ln θ j ) ), which indicates that the ESS falls oﬀ quickly when θ i and θ j are diﬀerent. Therefore, the θ s on the tails beneﬁt the most by pooling from nearby θ s, whereas the θ s in the middle can achieve the desired ESS by pooling from both tails.The estimation results from the optimal design, the standard design + and the regression are depicted inFigure 2: The black curve shows the exact µ ( θ ) based on closed-form calculation in the Black-Scholes model.For each of the three methods indicated in the legend, the conﬁdence bands are created from the 2 .

5% and97 .

5% quantiles of the estimated µ ( θ ) from the 10 ,

000 macro replications at each θ . Note that the standarddesign is omitted from Figure 2 as its conﬁdence band is too wide to be compared on the same plot. Figure 2ashows the conﬁdence band constructed from all 1,000 outer scenarios, and Figure 2b zooms in on θ ∈ [80 , θ . Note that the conﬁdenceband of the optimal design is slightly inﬂated than those of the standard design + . This is attributed tothat (4) is an approximation of V[ (cid:101) µ ij ], thus we do not match that of the standard design exactly.From Figure 2a, the conﬁdence band produced by the optimal design is indistinguishable from that of thestandard design + . Note that the former costs approximately 1 /

460 of the total replications of the latter. Thisdemonstrates that the precision requirement (7) in our optimization formulation is eﬀective despite that theESS constraint in the optimization approximates the desired precision requirement based on relative variance.The regression approach’s conﬁdence band is wider than that of both the other two approaches’ when θ = S τ takes extreme values. In Figure 2b, When zoomed in near the mode of the outer distribution, we see that the n d e r R e v i e w outer scenarios q c ond i t i ona l m ean , m ( q ) Opt. Design Std. Design + Regression (a) outer scenarios q c ond i t i ona l m ean , m ( q ) Opt. Design Std. Design + Regression (b)

Figure 2: (a) shows the 95% conﬁdence bands for the optimal design, standard design + , and theregression evaluated from 10,000 macro replications; (b) is a zoomed-in version of (a) around themedian of θ .Table 1: The MSEs of the nested simulation statistics computed from 1 ,

000 macro runs of the fourexperiment designs. All methods use the same simulation budget except for the standard design + . M Quantile Indicator Function ζ ( · )Opt. Design Std. Design Std. Design+ Regression Opt. Design Std. Design Std. Design+ Regression512 5.43 175 1.34 36.0 2.66E-05 4.39E-03 4.60E-06 1.34E-041024 2.56 130 0.46 20.0 1.11E-05 2.38E-03 1.78E-06 6.71E-052048 1.10 78.2 0.21 10.3 4.35E-06 1.29E-03 6.61E-07 3.54E-054096 0.52 58.3 0.06 5.11 2.28E-06 7.59E-04 2.52E-07 1.89E-05 M Hockey Stick Function ζ ( · ) Square Function ζ ( · )Opt. Design Std. Design Std. Design+ Regression Opt. Design Std. Design Std. Design+ Regression512 7.77E-04 2.94E-01 2.85E-04 1.53E-02 0.142 103 0.109 9.051024 3.84E-04 1.39E-01 1.07E-04 6.96E-03 0.084 41.1 0.058 4.342048 1.71E-04 5.84E-02 3.92E-05 3.02E-03 0.042 13.9 0.028 1.144096 8.36E-05 3.24E-02 1.41E-05 1.54E-03 0.021 7.13 0.012 0.52 optimal design’s conﬁdence band is wider than the other two approaches.Next, we compare the MSE of the portfolio risk measures computed from all four designs. We consider α = 0 .

99 to emulate tail risk estimation. The four risk measures we consider are; i) the α -quantile of µ ( θ ), µ α ,ii) indicator risk measure E[ I ( µ ( θ ) > µ ( θ ) − I ( µ ( θ ) > µ ( θ ) − ]. The latter three tail risk measures emulate probability of largelosses, expected excess loss, and expected squared excess loss, respectively. All are common in practical ERMproblems. These risk measures cannot be calculated analytically, thus were estimated via MC simulation bysampling 10 outer scenarios θ s and computing µ ( θ ) at each θ from its analytical expression, which are thenused to compute the risk measures; µ α ≈ .

916 based on these 10 conditional means. n d e r R e v i e w Table 1 presents the MSE of diﬀerent nested statistics computed from the four designs in comparison. TheMSEs are computed from 1 ,

000 independent macro replications. For the optimal design, as M is increasedby a factor of two, each risk measure’s MSE also shrinks by approximately a half, which is consistent withour asymptotic results in Section 5. Also observe that the optimal design’s MSEs are signiﬁcantly lowerthan those of the standard design’s and the regression’s, for all risk measures and M . Moreover, the optimaldesign’s MSEs are within the same order of magnitudes as those of the standard design+, even though thelatter requires a much larger simulation budget. Speciﬁcally, the optimal design’s simulation budget is about250 and 1 ,

760 times smaller compared to the standard design+ when M = 512 and M = 4 , In this section, we consider a single-stage newsvendor problem with 10 products. We assume that the (cid:96) thproduct’s demand, X (cid:96) , follows a Poisson distribution and is independent from all other products’ demands.Let c (cid:96) and p (cid:96) be the unit cost and sale price for the (cid:96) th product, respectively, and { k , . . . , k } be thestocking policy, where k (cid:96) is the available units of the (cid:96) th product. Speciﬁcally, we chose p (cid:96) = 10 + 0 . (cid:96), c (cid:96) = 2,and k (cid:96) = 9 + (cid:96) for all (cid:96) for the experiment. Given these inputs, the simulator computes the total proﬁt, g ( X ) = (cid:80) (cid:96) =1 { p (cid:96) min( X (cid:96) , k (cid:96) ) − c (cid:96) k (cid:96) } .The correct mean demand of the (cid:96) th product is assumed to be ϑ c(cid:96) = 5 + (cid:96) unknown to us. We have 50 + 5 (cid:96) i.i.d. realizations from Poisson( ϑ c(cid:96) ) to estimate ϑ c(cid:96) for the simulation study. Taking the Bayesian view, we modelthe unknown parameter ϑ (cid:96) as a random variable with a prior distribution and update it with the real-worldobservations generated from Poisson( ϑ c(cid:96) ). To exploit conjugacy, the Gamma prior with rate 0 .

001 and shape0 .

001 is adopted for each ϑ (cid:96) . Then, the posterior distribution of ϑ (cid:96) is still Gamma with rate 0 .

001 + 50 + (cid:96) andshape 0 .

001 plus the sum of observed demands of the (cid:96) th product. Let θ = { ϑ , ϑ , . . . , ϑ } be a parametervector sampled from the joint posterior distribution, which is simply a product of the marginals as all productdemands are mutually independent. The expected proﬁt given θ , µ ( θ ) = E θ [ g ( X )], is a random variablewhose distribution is induced by the posterior of θ . Variability of µ ( θ ) reﬂects input uncertainty caused byﬁniteness of the demand data. To quantify input uncertainty, we construct a 1 − α credible interval (CrI)for µ ( θ ) via nested simulation. The analytical expression for µ ( θ ) can be derived easily using the Poissondistribution function. Thus, a CrI can be constructed by sampling θ , θ , . . . , θ M from the posterior of θ andcomputing the empirical α/ − α/ µ ( θ ) , µ ( θ ) , . . . , µ ( θ M ); this interval is referred toas the oracle CrI in the following and used as a benchmark to compare the performances of the algorithms.Table 2 compares the CrIs constructed by the four nested simulation experiment designs as well as theoracle CrI from 1 ,

000 macro-runs. For each macro-run, a new set of real-world demands are sampled fromthe true demand distributions and the joint posterior of θ is updated conditional on the data. The oracle n d e r R e v i e w Table 2: The estimated coverage probabilities and the widths of CrIs constructed by the oracle,optimal design, standard design, stardard design + , and regression from 1 ,

000 macro-runs. Allmethods use the same simulation budget except for stardard design + . The standard errors are inparentheses. Empirical coverage WidthTarget 1 − α . .

95 0 .

99 0 . .

95 0 . + CrI is constructed from M = 1 , θ s sampled from its posterior. The optimal design and the regression usethe same 1 , θ s as outer scenarios to construct CrIs. The average of the simulation budget used by theoptimal design across 1 ,

000 macro-runs is 1 ,

471 (with standard error 1 . M N = 10 for the standard design + . For the regression, polynomial basis functions up to order 2 were used withoutcross-terms reﬂecting that all product demands are independent. The empirical coverage probabilities and thewidths of CrIs in Table 2 are averaged over 1 ,

000 macro replications. To compute the former, a million θ swere drawn from its posterior distribution independently from the experiment designs to construct a test setof µ ( θ )s using its analytical expression. For all algorithms and macro replications, the same test set was usedto compute the empirical coverage probabilities.Table 2 shows that the CrIs constructed by the oracle and the optimal design are very close in bothcoverage and width across all α s, although the latter shows a slight undercoverage compared to the former.The undercoverage is caused by that the optimal design interpolates the simulation outputs run at the samplingouter scenarios, however, it is alleviated as N grows. The standard design clearly exhibits overcoverage andwide CrIs. This is because the small inner sample size, N , makes V[¯ µ i ] large, which inﬂates the CrI. Noticethat the standard design + still shows overcoverage and slightly wider CrIs than the oracle indicating thatthe inﬂation of CrI from MC error of ¯ µ i persists even with N = 1 , + show comparable performances across all α s, however, the former costs only 1 /

670 of the simulationreplications of the latter on average. The regression method shows clear overcoverage across all α s comparedto the optimal design. In particular, the diﬀerence between the CrI widths from the two methods is wider forsmaller α , which coincides with the observations from the ERM example; the regression tends to work poorlyat predicting µ ( θ ) for extreme θ s. On the other hand, the optimal nested simulation design does not suﬀerfrom this by allocating more replications to the extreme outer scenarios so that they achieve the same targetESS as all others. n d e r R e v i e w References

Broadie, M., Y. Du, and C. C. Moallemi (2011). Eﬃcient risk estimation via nested sequential simulation.

Management Science 57 (6), 1172–1194.Broadie, M., Y. Du, and C. C. Moallemi (2015). Risk estimation via regression.

Operations Research 63 (5),1077–1097.Dong, J., M. Feng, and B. L. Nelson (2018). Unbiased metamodeling via likelihood ratios. In

Proceedings ofthe 2018 Winter Simulation Conference , pp. 1778–1789. IEEE.Elvira, V., L. Martino, and C. P. Robert (2018). Rethinking the eﬀective sample size. arXiv preprint .https://arxiv.org/abs/1809.04129, Last accessed on 6/20/2020.Feng, M. and E. Song (2019). Eﬃcient input uncertainty quantiﬁcation via green simulation using samplepath likelihood ratios. In

Proceedings of the 2019 Winter Simulation Conference , pp. 3693–3704. IEEE.Feng, M. and J. Staum (2015). Green simulation designs for repeated experiments. In

Proceedings of the 2015Winter Simulation Conference , pp. 403–413. IEEE.Feng, M. and J. Staum (2017). Green simulation: Reusing the output of repeated experiments.

ACMTransactions on Modeling and Computer Simulation (TOMACS) 27 (4), 1–28.Fu, M. C. (2015).

Handbook of Simulation Optimization . New York, New York: Springer.Glasserman, P. (2003).

Monte Carlo Methods in Financial Engineering . Springer.Glasserman, P. and X. Xu (2014). Robust risk measurement and model risk.

Quantitative Finance 14 (1),29–58.Gordy, M. B. and S. Juneja (2010). Nested simulation in portfolio risk measurement.

Management Sci-ence 56 (10), 1833–1848.Heidelberger, P. and P. A. W. Lewis (1984). Quantile estimation in dependent sequences.

OperationsResearch 32 (1), 185–209.Hesterberg, T. C. (1988).

Advances in Importance Sampling . Ph. D. thesis, Stanford University.Hong, L. J., S. Juneja, and G. Liu (2017). Kernel smoothing for nested estimation with application to portfoliorisk measurement.

Operations Research 65 (3), 657–673. n d e r R e v i e w Kleijnen, J. P. and R. Y. Rubinstein (1996). Optimization and sensitivity analysis of computer simulationmodels by the score function method.

European Journal of Operational Research 88 (3), 413–427.Kong, A. (1992). A note on importance sampling using standardized weights. university of chicago. TechnicalReport 348, Department of Statistics, University of Chicago.L’Ecuyer, P. (1990). A uniﬁed view of the ipa, sf, and lr gradient estimation techniques.

ManagementScience 36 (11), 1364–1383.L’Ecuyer, P. (1993). Two approaches for estimating the gradient in functional form. In

Proceedings of the1993 Winter Simulation Conference , pp. 338–346. IEEE.Liu, J. S. (1996, Jun). Metropolized independent sampling with comparisons to rejection sampling andimportance sampling.

Statistics and Computing 6 (2), 113–119.Liu, M. and J. Staum (2010). Stochastic kriging for eﬃcient nested simulation of expected shortfall.

Journalof Risk 12 (3), 3–27.Longstaﬀ, F. A. and E. S. Schwartz (2001). Valuing american options by simulation: a simple least-squaresapproach.

The review of ﬁnancial studies 14 (1), 113–147.Maggiar, A., A. Waechter, I. S. Dolinskaya, and J. Staum (2018). A derivative-free trust-region algorithmfor the optimization of functions smoothed via gaussian convolution using adaptive multiple importancesampling.

SIAM Journal on Optimization 28 (2), 1478–1507.Martino, L., V. Elvira, and F. Louzada (2017). Eﬀective sample size for importance sampling based ondiscrepancy measures.

Signal Processing 131 , 386–401.Owen, A. B. (2013).

Monte Carlo theory, methods and examples . https://statweb.stanford.edu/ owen/mc/,Last accessed on 6/20/2020.Rubinstein, R. Y. and A. Shapiro (1993).

Discrete event systems: Sensitivity analysis and stochasticoptimization by the score function method . John Wiley & Sons Inc.Sen, P. K. (1972). On the bahadur representation of sample quantiles for sequences of φ -mixing randomvariables. Journal of Multivariate Analysis 2 (1), 77 – 95.Song, E., B. L. Nelson, and C. D. Pegden (2014). Advanced tutorial: Input uncertainty quantiﬁcation. In

Proceedings of the 2014 Winter Simulation Conference , pp. 162–176. IEEE. n d e r R e v i e w Sun, Y., D. W. Apley, and J. Staum (2011). Eﬃcient nested simulation for estimating the variance of aconditional expectation.

Operations Research 59 (4), 998–1007.Xie, W., B. L. Nelson, and R. R. Barton (2014). A bayesian framework for quantifying uncertainty instochastic simulation.

Operations Research 62 (6), 1439–1452.Zhou, E. and T. Liu (2019). Online quantiﬁcation of input model uncertainty by two-layer importancesampling. https://arXiv:1912.11172, Last accessed on 6/20/2020.Zouaoui, F. and J. R. Wilson (2004). Accounting for input-model and input-parameter uncertainties insimulation.

IIE Transactions 36 (11), 1135–1151. n d e r R e v i e w Electronic CompanionA Assumptions for Part (ii) of Lemma 3.1 and its proof

Let W ij g ( X ) ≡ N j − (cid:80) N j k =1 g ( X k ) W ij,k and W ij ≡ N j − (cid:80) N j k =1 W ij,k . The following assumptionstates the conditions under which Lemma 3.1 holds. Assumption A.1.

Given θ i and θ j in Θ , assume that(i) E θ j [ W ij ] < ∞ and E θ j [( W ij g ( X ) − µ i ) ] < ∞ ,(ii) E (cid:104) sup N j sup B ∈ ( W ij , (cid:16) B − ( W ij − ( W ij g ( X ) − µ i ) (cid:17)(cid:105) < ∞ , and(iii) E (cid:104) sup N j sup A ∈ ( W ij g ( X ) ,µ i ) ,B ∈ ( W ij , (cid:0) A B − ( W ij − (cid:1)(cid:105) < ∞ . Part (ii) of the moment conditions in Assumption A.1 may appear strong, but in practice it islikely to hold if Part (i) holds because sup A ∈ ( W ij g ( X ) ,µ i ) A a.s. → µ i by Part (i) of Lemma 3.1 andsup B ∈ ( W ij , B a.s. → Proof of Part (ii) of Lemma 3.1.

By deﬁnition, (cid:101) µ ij is the ratio between W ij g ( X ) and W ij . Applyingtwo-dimensional Taylor series expansion to (cid:101) µ ij at (cid:0) E θ j [ W ij g ( X )] , E θ j [ W ij ] (cid:1) (cid:62) , (cid:101) µ ij = E θ j [ W ij g ( X )]E θ j [ W ij ] +  θ j [ W ij ] − E θ j [ W ij g ( X )](E θ j [ W ij ])  (cid:62)  W ij g ( X ) − E θ j [ W ij g ( X )] W ij − E θ j [ W ij ]  + 12  W ij g ( X ) − E θ j [ W ij g ( X )] W ij − E θ j [ W ij ]  (cid:62)  − /B − /B A/B   W ij g ( X ) − E θ j [ W ij g ( X )] W ij − E θ j [ W ij ]  , where A and B are in between W ij g ( X ) and E θ j [ W ij g ( X )], and W ij and E θ j [ W ij ], respectively.Because E θ j [ W ij g ( X )] = µ i and E θ j [ W ij ] = 1, the expansion can be rewritten as (cid:101) µ ij = W ij g ( X ) − µ i ( W ij − − B ( W ij − W ij g ( X ) − µ i ) + AB ( W ij − . (32)We ﬁrst show that the variance of (cid:101) µ ij has the stated expression. From Assumption A.1, the secondmoment of (32) is bounded. Then, by the dominated convergence theorem, (32) converges in mean30 n d e r R e v i e w squared to W ij g ( X ) − µ i ( W ij − − ( W ij − W ij g ( X ) − µ i ) + µ i ( W ij − . (33)After some tedious algebra, the variance of (33) can be shown to have the following form:E θ j [ W ij ( g ( X ) − µ i ) ] N j − + R N − j + R N − j , (34)where R and R are functions of moments of W ij and g ( X ) bounded under Assumption A.1. Thus,V θ j [ (cid:101) µ ij ] = E θ j [ W ij ( g ( X ) − µ i ) ] N − j + o ( N − j ) . Since variance of (cid:101) µ i is bounded, its bias is alsobounded. Taking the expectation of both sides of (33), we obtain E θ j [ (cid:101) µ ij ] − µ i = E θ j [ W ij ( g ( X ) − µ i )] N − j + o ( N − j ), which concludes the proof. B Analytical calculation of E[ W ij ] for exponential family distribu-tions The joint input distribution, h ( x ; θ ), is said to be a member of the exponential family if it can bewritten as h ( x ; θ ) = B ( x ) exp( θ (cid:62) T ( x ) − A ( θ )) , (35)where B ( x ), T ( x ), and A ( θ ) are known functions. The exponential family includes both discrete(e.g. Poisson) and continuous (e.g. normal) distributions. For convenience, we focus on thelatter in the following, but the discussion can be generalized to both. In particular, A ( θ ) =ln (cid:0)(cid:82) X B ( x ) exp( θ (cid:62) T ( x ) d x (cid:1) is called the log-partition function. The natural parameter space isdeﬁned as Then, for any θ ∈ Θ , h ( x ; θ ) in (35) is a well-deﬁned probability density function, i.e., (cid:82) X h ( x ; θ ) d x = 1. Moreover, (cid:82) X B ( x ) exp( θ (cid:62) T ( x )) d x diverges for any θ (cid:54)∈ Θ . For any θ i , θ j ∈ Θ ,31 n d e r R e v i e w we have E θ j [ W ij ] = (cid:90) X (cid:18) h ( x ; θ i ) h ( x ; θ j ) (cid:19) h ( x ; θ j ) d x = exp( A ( θ j ) − A ( θ i )) (cid:90) X B ( x ) exp((2 θ i − θ j ) (cid:62) T ( x )) d x =  exp( A ( θ j ) − A ( θ i ) + A (2 θ i − θ j )) , if 2 θ i − θ j ∈ Θ , ∞ , if 2 θ i − θ j (cid:54)∈ Θ . (36)Note that if 2 θ i − θ j (cid:54)∈ Θ then E θ j [ W ij ] = ∞ , therefore (E θ j [ W ij ]) − = 0. In this case, poolingreplications from h ( x ; θ j ) to estimate µ i has zero ESS hence there is no beneﬁt. C Proof of Lemma 5.1

From Theorem 5.3 and Chebyshev’s inquality, | Φ M,N ( ξ ) − Φ( ξ ) | = O p ( M − / ) + O p ( N − / ) forany ξ ∈ R . To show the convergence rate holds uniformly, we proceed with a Glivenko-Cantellilemma type argument. Let J be an arbitrary positive integer and −∞ = ξ < ξ < . . . < ξ J = ∞ such that Φ( ξ j ) − Φ( ξ j − ) = 1 /J for all j = 1 , , . . . , J . Then, there exists j ∈ { , . . . , J } such that ξ ∈ [ ξ j − , ξ j ]. Note that Φ M,N ( ξ ) − Φ( ξ ) ≤ Φ M,N ( ξ j ) − Φ( ξ j − ) = Φ M,N ( ξ j ) − Φ( ξ j ) + 1 /J, andΦ M,N ( ξ ) − Φ( ξ ) ≥ Φ M,N ( ξ j − ) − Φ( ξ j ) = Φ M,N ( ξ j − ) − Φ( ξ j − ) − /J. Thus, | Φ M,N ( ξ ) − Φ( ξ ) | ≤ max {| Φ M,N ( ξ j ) − Φ( ξ j ) | , | Φ M,N ( ξ j − ) − Φ( ξ j − ) |} + 1 /J andsup ξ ∈ R | Φ M,N ( ξ ) − Φ( ξ ) | ≤ max ≤ j ≤ J {| Φ M,N ( ξ j ) − Φ( ξ j ) |} + 1 /J. Choosing J ≥ max { M, N } , the right-hand-side of the inequality above is O p ( M − / ) + O ( N − /2