[PDF] Ensemble approximate control variate estimators: Applications to multi-fidelity importance sampling

Abstract

The recent growth in multi-fidelity uncertainty quantification has given rise to a large set of variance reduction techniques that leverage information from model ensembles to provide variance reduction for estimates of the statistics of a high-fidelity model. In this paper we provide two contributions: (1) we utilize an ensemble estimator to account for uncertainties in the optimal weights of approximate control variate (ACV) approaches and derive lower bounds on the number of samples required to guarantee variance reduction; and (2) we extend an existing multi-fidelity importance sampling (MFIS) scheme to leverage control variates. As such we make significant progress towards both increasing the practicality of approximate control variates-for instance, by accounting for the effect of pilot samples-and using multi-fidelity approaches more effectively for estimating low-probability events. The numerical results indicate our hybrid MFIS-ACV estimator achieves up to 50% improvement in variance reduction over the existing state-of-the-art MFIS estimator, which had already shown outstanding convergence rate compared to the Monte Carlo method, on several problems of computational mechanics.

Full PDF

EEnsemble approximate control variate estimators: Applications tomulti-ﬁdelity importance sampling

Trung Pham ∗†a and Alex A. Gorodetsky ∗† aa Department of Aerospace Engineering, University of Michigan, Ann Arbor, MI, 48109, USA

Abstract

The recent growth in multi-ﬁdelity uncertainty quantiﬁcation has given rise to a large set of variancereduction techniques that leverage information from model ensembles to provide variance reduction forestimates of the statistics of a high-ﬁdelity model. In this paper we provide two contributions: (1) weutilize an ensemble estimator to account for uncertainties in the optimal weights of approximate controlvariate (ACV) approaches and derive lower bounds on the number of samples required to guaranteevariance reduction; and (2) we extend an existing multi-ﬁdelity importance sampling (MFIS) scheme toleverage control variates. As such we make signiﬁcant progress towards both increasing the practicalityof approximate control variates—for instance, by accounting for the eﬀect of pilot samples—and usingmulti-ﬁdelity approaches more eﬀectively for estimating low-probability events. The numerical resultsindicate our hybrid MFIS-ACV estimator achieves up to 50% improvement in variance reduction overthe existing state-of-the-art MFIS estimator, which had already shown outstanding convergence ratecompared to the Monte Carlo method, on several problems of computational mechanics.

Keywords. multi-ﬁdelity, uncertainty quantiﬁcation, approximate control variates, importance sampling, rare-event simulation

This paper develops an advancement to the approximate control variate (ACV) [1] approach for variance reductionin uncertainty quantiﬁcation applications where multiple models with varying qualities and computational costs areavailable. Speciﬁcally, we analyze the aﬀect of using estimated control variate weights within the ACV on variancereduction, and we provide conditions under which variance reduction can still be guaranteed in these cases. Multi-ﬁdelity approaches for uncertainty quantiﬁcation have recently seen signiﬁcant adoption across wide varying domainswhere expensive simulations are required for accurate predictions. These domains include heat transfer problems [2],aerospace design [3], optimization under uncertainty [4, 5], and ensemble of computer simulator outputs [6]. A surveyof multi-ﬁdelity methods is presented in [7].As with single ﬁdelity UQ, multi-ﬁdelity (MF) UQ techniques can leverage both surrogate and sampling-basedalgorithmic approaches. While surrogate-based techniques are plentiful [8–10] we focus on sampling approachesthat are often both more ﬂexible to leverage and also provide the foundation for many surrogate approaches (e.g.,estimating where to obtain more data). The basis of a majority of sampling approaches is the usage of Monte Carlo(MC) simulation [11] to estimate the output statistics. The primary advantage of MC over surrogate approaches isthat it does not impose any requirement on the smoothness of the forward model, and its accuracy and convergencerate are independent of the model dimension. Nevertheless, its convergence rate is also slow, demanding a largenumber of model evaluations to reach the satisfactory accuracy. This computational cost can be prohibitive formany practical problems with expensive simulation models. A straightforward error analysis in [12] reveals that theeﬃciency of MC simulation can be greatly improved by variance reduction methods, which, as explained in [13], “canbe viewed as a means of utilizing known information about the model in order to obtain more accurate estimatorsof its performance.” Two such methods are heavily used for quantifying uncertainty: importance sampling (IS) andcontrol variates (CV). We explore the adaptation and advancement of certain aspects of these two approaches tomulti-ﬁdelity uncertainty quantiﬁcation problems in this paper.Adapting standard statistical approaches for variance reduction in the context of UQ problems must addressspecial challenges. The primary challenge is that the relationships about and between low-ﬁdelity and high-ﬁdelity ∗ Corresponding authors. † Email addresses: [email protected] (T. Pham), [email protected] (A.A. Gorodetsky) a r X i v : . [ s t a t . M E ] J a n odels are unknown or imprecisely encoded. For instance, CV techniques require the low-ﬁdelity models to haveknown means and known covariance amongst models. However, the low-ﬁdelity information sources in uncertaintyquantiﬁcation typically are in the form of simulation models—their statistics are not known but simply easier tocompute than the high-ﬁdelity model. Thus, algorithms that adapt IS and CV to these problems must determineand model the relationship between such information sources.A number of multi-ﬁdelity techniques leveraging importance sampling have also been proposed. Importancesampling approaches generate weighted samples from a biasing distributions. A prudent choice of biasing distributioncan lead to a drastic reduction of computation cost [14, 15]. MFUQ techniques that use IS are based on the ideathat the distributions of low-ﬁdelity quantity of interest (QoI) are closely related to the high-ﬁdelity QoI, even if theirpointwise evaluations have errors. In [8] the IS density is constructed from a single surrogate model built on the high-ﬁdelity model, and the multi-ﬁdelity importance sampling (MFIS) estimator is basically an IS estimator using thatsurrogate-based density. In [9] multiple low-ﬁdelity surrogate-based IS densities are aggregated to derive an estimatorfollowing the mixed IS approach [16]. In [10] multiple surrogate-based IS estimators are fused into a weighed ensembleestimator, and the weights are determined through minimizing the variance of the fused estimator. Though theseestimators have achieved impressive speedups, further improvement can still be possible. Particularly, they can beenhanced by also using the low-ﬁdelity models as control variates, not just for bias distribution construction.Control variate techniques introduce a weighted adjustment term to maintain an unbiased estimator with lowervariance. These techniques leverage correlations between information sources to achieve variance reduction. Examplesinclude CV [13], approximate CV [1], multi-level MC (MLMC) [17, 18], multi-index MC (MIMC) [19], multi-ﬁdelityMC [3], and multi-level multi-ﬁdelity (MLMF) MC [20].The classical CV technique requires low-ﬁdelity information sources with known mean and known correlationto high-ﬁdelity information sources [13], thus making it unsuitable for direct application to the MFUQ problem.Work to extend control variates to the case of unknown covariances was achieved in [21] by leveraging an ensembleestimator that returns an average over a set of CV estimators. However, this estimator still required known meansof the low-ﬁdelity information sources.On the other hand, approximate control variates [1] were recently developed to tackle the problem of unknownmeans in the case where the covariance amongst models was known (or easily estimated). This estimator wasshown to generalize existing control-variate inspired techniques such as MFMC and MLMF [3, 5, 8, 22]. Indeed theseexisting techniques were viewed as either recursive diﬀerence or recursive nested estimators that sequentially estimatevarious unknown means of the low-ﬁdelity sources. Moreover the ACV provides a method to achieve optimal variancereduction in the case of limited high-ﬁdelity resources and increasing low-ﬁdelity resources. The ACV can also beviewed as a particular generalization of MLMC and MIMC. The diﬀerence is that these approaches embed recursivediﬀerence estimators with a ﬁxed control variate weight ( − ) within a bias-reduction scheme that sequentially reﬁnesthe high-ﬁdelity model. Indeed the typical MLMC-type estimators have achieved great success for variance reductionin cases where information sources are related through discretization reﬁnement. Moreover, they do not requireestimation of covariances or control variate weights because they use a ﬁxed CV weight.Nevertheless [1] showed that the ﬁxed weight of MLMC ( − ) is only optimal in the case where the correlationbetween sequential models is one. In all other cases, there exist weights that can further improve variance reduction.However, it has not been clear that this beneﬁt can be truly realized because of the complexity of additional varianceintroduced by weight estimation from pilot samples. This paper takes the ﬁrst steps to answer this question byproviding guidelines into choosing the number of pilot samples required to achieve reduction.Our primary contribution begins to bridge the remaining gap between ACV theory and practice by consideringunknown means and unknown correlations (and therefore unknown optimal weights). It does so by adapting the ACVinto an ensemble estimator, similar to obtaining pilot samples, and analyzing its performance. These main resultsare1. Theorems 1 and 4: providing the ineﬃciency induced by unknown correlations as a scaling on the typical (1 − R ) term for ACV-based estimators (ensemble estimator)2. Corollary 4.1 providing for the number of samples required for achieving variance reduction (ensemble estimator)Our derivation of the lower bounds depends on the Gaussianity consumption of the MC estimators and the construc-tion of ensemble estimators which are based on the method of batch means [21, 23, 24]. These bounds are knowngiven prior knowledge on the correlation amongst models, which can be obtained from physics information whenpossible or lower-bounded to be conservative. Two secondary contributions include1. Theorem 7: explicitly speciﬁes the range of weights over which one still obtains variance reduction, even undererrors in approximating the optimal control variate weight2. A new multi-ﬁdelity IS estimator that combines MFIS approach of using the low ﬁdelity model for determiningthe biasing distribution with a CV estimator that also uses the low ﬁdelity model to leverage its correlations inally, empirical demonstrations are performed for rare-event estimation with an emphasis on mechanical systems.Multiple examples demonstrate the ensemble estimator theory as well as show improvement over the MFIS estimator.The rest of this paper is organized as follows. Section 2 reviews the MC, IS, and CV estimators along withour target application of rare-event (low-probability-event) simulation. Section 3 lists our main results on variancereduction via the proposed ensemble-based ACV estimator. The proofs can be found in the appendices. Section 4introduces how to simply use importance sampling as the underlying estimator within the control variate frameworkand provides a step-by-step implementation of our estimators in pseudocode. The numerical examples in Section 5demonstrate our theoretical results by providing empirical performance on synthetic and PDE-based problems. Itprovides comparison to the MFIS estimator, and comparisons between ensemble and standalone ACV estimators.The paper is concluded in Section 6. In this section, we review our notation and provide background about the main statistical estimators which aregathered in Table 1 for convenience.

Notes Notation a Equation

Monte Carlo Q n (1)Importance sampling Q IS q,n (4)Classical control variates(known mean and known weight) Q CV (10)Approximate control variates(unknown mean and known weight) Q e (23)Ensemble (A)CV-type estimator( (un)known mean, unknown weight) ¯ Q e (¯ α e ) (22) and (37) a e ∈ { ACV , ACV-X } and e ∈ { CV , ACV , ACV-X } , where ACV-X indicatesa particular sample partitioning scheme used within the ACV. Table 1: List of estimators

Let (Ω , F , P ) be a probability space. Let N denote the positive natural numbers, R the real number, and R + the positive real numbers. For d ∈ N we use Z : Ω → R d to denote a F -measurable random variable correspondingto input uncertainty. This variable is assumed to be continuous and have probability density function (PDF) p ( z ) : R d → R + , with z ∈ R d denoting a realization of Z . Let supp ( p ) be the support of the PDF p , i.e., supp ( p ) = { z ∈ R d , p ( z ) > } . Let Y i ( Z ) : R d → R i ⊂ R for i = 0 , , . . . , M, denote quantities of interest of a high-ﬁdelitymodel Y and M low-ﬁdelity models { Y i } Mi =1 , respectively. Let the expected values of those models be denoted by µ i = E p [ Y i ( Z )] where E p [ · ] is the expectation operator taken with respect to p ; for brevity, whenever the PDF p isomitted, it is implicitly assumed, i.e., the expectation E [ · ] , the variance V ar [ · ] and the covariance C ov [ · ] are computedwith respect to p . Our goal is to estimate µ . In this section we describe the rare event estimation problem for which we seek to apply variance reduction in Section 5.Let g : R d → R denote the limit state function. A failure event is deﬁned by g ( z ) < with the corresponding failuredomain G = (cid:110) z ∈ R d : g ( z ) < (cid:111) . The failure probability P f ( g ( z ) < can be cast as an expectation by considering the indicator function deﬁned onthe failure domain I G : R d → { , } as I G ( z ) = (cid:40) , z ∈ G , , else . Using this indicator function, the failure probability P f is given by P f ( g ( z ) <

0) = E [ I G ( Z )] . .2 Monte Carlo estimator In this section we review the basic Monte Carlo (MC) estimator. The MC estimator is deﬁned as a normalized sumof random variables Q n ( Y ) = 1 n n (cid:88) i =1 Y (cid:16) Z ( i ) (cid:17) , (1)where the random variables Z ( i ) are independent and identically distributed (i.i.d.) according to the input randomvariable Z , and Q n is a new random variable derived from Y and each Z ( i ) . This estimator is unbiased and has variance decaying proportionally to /n as E [ Q n ( Y )] = E [ Y ( Z )] = µ , (2) V ar [ Q n ( Y )] = V ar [ Y ( Z )] n = σ n . (3)The root mean squared error of the MC estimator is (cid:112) V ar [ Q n ( Y )] = σ / √ n , which shows that to gain one moredecimal digit of accuracy, the computational cost needs to increase 100 times [25]. In this paper, we seek an estimatorwith reduced variance by leveraging two ideas: importance sampling [13] and control variates [25]. Importance sampling (IS) seeks variance reduction by carefully choosing a sampling distribution that diﬀers fromthat of p . The IS estimator Q IS q,n ( Y ) is deﬁned by a weighted sum of random variables as Q IS q,n ( Y ) = 1 n n (cid:88) i =1 Y (cid:16) Z ( i ) (cid:17) W (cid:16) Z ( i ) (cid:17) , (4)where the input random variables Z ( i ) are i.i.d. according to PDF q ( z ) , and W ( z ) = p ( z ) q ( z ) , (5)is the density ratio. The new density q is called the proposal (or biasing ) density .According to [13, 25], the IS estimator is unbiased when Y ( z ) p ( z ) is dominated by q ( z ) , that is, q ( z ) = 0 implies Y ( z ) p ( z ) = 0 . In other words, if supp ( Y p ) ⊆ supp ( q ) , we have E q (cid:104) Q IS q,n ( Y ) (cid:105) = E q [ Y ( Z ) W ( Z )] = E p [ Y ( Z )] = µ , (6) V ar q (cid:104) Q IS q,n ( Y ) (cid:105) = V ar q [ Y ( Z ) W ( Z )] n . (7)A prudent choice of the IS proposal density q ( z ) can yield an estimator with a smaller variance than that of the MCestimator. The optimal IS density is obtained by minimizing the variance V ar q [ Y ( Z ) W ( Z )] as in [13], and is givenin closed form by q ∗ ( z ) = | Y ( z ) | p ( z ) (cid:90) Ω | Y ( z ) | p ( z ) d z . Specializing this proposal to the case of rare-event simulation, i.e., Y is an indicator function, we obtain q ∗ ( z ) = Y ( z ) p ( z ) µ . (8)Using this proposal, the variance of Q IS q,n ( Y ) becomes identically zero because each evaluation of the high-ﬁdelitymodel Y with a single sample drawn from q ∗ ( z ) is exactly equal to its expected value µ .However, since µ is unknown, it is intractable to exactly compute the optimal IS density. Instead, severalapproaches can be used for obtaining an approximation [26–28]. In this paper, the cross-entropy (CE) method withGaussian mixture model (GMM) in [29, 30] is chosen to ﬁnd an approximate IS density ˆ q ( z ) . Here, ˆ q ( z ) is deﬁnedas a weighted sum of k ∈ N multivariate normal density functions ˆ q ( z ) = k (cid:88) i =1 π i N ( z ; µ i , Σ i ) , (9) here π i ∈ R , µ i ∈ R d , and Σ i ∈ R d × d for i = 1 , , . . . , k, are the mixture weights, means, and covariance matrices,respectively. The mixture coeﬃcient π i is the probability that the i th density N ( z ; µ i , Σ i ) is selected at a given time[30], requiring ≤ π i ≤ and (cid:80) ki =1 π i = 1 .There exist several approaches to estimate the three parameters for each element of the mixture [29, 30]. Herewe use an expectation-maximization (EM) algorithm with a cross-entropy objective function. We brieﬂy explainthis approach in Appendix A by following the construction in [29] and refer to [13, 26, 29, 31] for a more detailedtreatment. In this section, we ﬁrst consider the classical control variates—a variance-reduction technique that relies on introducingadditional information sources. The classical control variates assume that both the means and the covariances of theadditional information sources are available. Next we present an extension of the classical control variates in whichthe control means are known but the covariances amongst the low-ﬁdelity information sources are unknown [21, 23].Finally, we review the approximate control variates [1], which considers the case with with unknown control variatemeans and known covariances.

A control variate (CV) estimator Q CV ( α ) utilizes a set of M additional estimators Q n ( Y ) , . . . , Q n ( Y M ) , and augmentsa baseline estimator Q n ( Y ) via a linear combination of these estimators Q CV ( α ) = Q n ( Y ) + M (cid:88) i =1 α i ( Q n ( Y i ) − µ i ) = Q n ( Y ) + α T ( Q − µ ) , (10)where µ i = E [ Q n ( Y i )] is the known mean of Q n ( Y i ) , µ = [ µ , . . . , µ M ] , α = [ α , . . . , α M ] T is the vector of controlvariate weights, and Q = [ Q n ( Y ) , . . . , Q n ( Y M )] T is the vector of additional estimators. These additional estimatorsare shown here to be Monte Carlo estimators Q n , but can actually be any random variable. In the later sections wewill use importance sampling estimators instead.This CV estimator is unbiased and has reduced variance compared with the baseline estimator Q n ( Y ) . Speciﬁ-cally, since Q n ( Y i ) is unbiased by speciﬁcation of the estimator above, we have E (cid:104) Q CV ( α ) (cid:105) = µ . (11)Furthermore, the variance of this estimator is V ar (cid:104) Q CV ( α ) (cid:105) = V ar [ Q n ( Y )] + α T C ov [ Q , Q ] α + 2 α T C ov [ Q , Q n ( Y )] . (12)The optimal control variate weight [13], which minimizes the above variance, is then given by α ∗ CV = − C − c , (13)where C = C ov [ Y , Y ] ∈ R M × M is the covariance matrix among Y i , c = C ov [ Y , Y ] ∈ R M is the vector of covariancesbetween Y and each Y i , and Y = [ Y , . . . , Y M ] T . If we further deﬁne ¯ c = c / (cid:112) V ar [ Y ] = (cid:104) ρ (cid:112) V ar [ Y ] , ρ (cid:112) V ar [ Y ] , . . . , ρ M (cid:112) V ar [ Y M ] (cid:105) T , (14)where ρ i is the Pearson correlation coeﬃcient between Y and Y i , then the variance corresponding to α ∗ CV becomes V ar (cid:104) Q CV ( α ∗ CV ) (cid:105) = (cid:0) − R (cid:1) V ar [ Q n ( Y )] , (15)where R = ¯ c T C − ¯ c . (16)Here we see that the greater the correlation amongst models, the greater the achieved variance reduction.Furthermore, since each Q n ( Y i ) shares n i.i.d. samples, we have C ov [ Q n ( Y i ) , Q n ( Y j )] = C ov (cid:34) n n (cid:88) k =1 Y i ( Z ( k ) ) , n n (cid:88) k =1 Y j ( Z ( k ) ) (cid:35) = 1 n C ov [ Y i , Y j ] , (17) .e., C = n C , c = n c , (18)where C = C ov [ Q , Q ] and c = C ov [ Q , Q n ( Y )] . Thus, we can rewrite (13) as α ∗ CV = − C − c . (19)to obtain an expression in terms of the variance between estimators. It is often the case that the covariances amongst the low-ﬁdelity information sources are not available; however,these sources can be simulated to obtain estimates of these covariances. One well-analyzed strategy for estimationin this context is to generate an ensemble of K realizations of the random variables Q n ( Y i ) to estimate the requiredcovariances and correlations [21, 23]. Because we are considering MC estimators as additional information sources,this operation requires a total of nK samples for each Y i . The estimated optimal weight is then obtained as ¯ α CV = − ˆ C − ˆ c , (20)where ˆ C and ˆ c are the sample versions of C and c , respectively. Hence, ˆ C and ˆ c can be computed as ˆ C = 1 K − K (cid:88) j =1 (cid:0) Q j − ¯ Q (cid:1) (cid:0) Q j − ¯ Q (cid:1) T , ˆ c = 1 K − K (cid:88) j =1 (cid:0) Q j − ¯ Q (cid:1) (cid:16) Q ( j ) n ( Y ) − ¯ Q n ( Y ) (cid:17) , (21)where ¯ Q = 1 K K (cid:88) j =1 Q j , ¯ Q n ( Y ) = 1 K K (cid:88) j =1 Q ( j ) n ( Y ) , and Q j = (cid:104) Q ( j ) n ( Y ) , . . . , Q ( j ) n ( Y M ) (cid:105) T is the vector of MC estimatorsusing n i.i.d. samples of the j th batch (out of a total of K batches).This estimated weight is then used in an ensemble estimator [21] given by ¯ Q CV (¯ α CV ) = 1 K K (cid:88) j =1 Q CV j (¯ α CV ) = 1 K K (cid:88) j =1 (cid:16) Q ( j ) n ( Y ) + ¯ α TCV ( Q j − µ j ) (cid:17) = T K K Q ( Y ) + ¯ α TCV (cid:0) ¯ Q − ¯ µ (cid:1) , (22)where ¯ µ = 1 K K (cid:88) j =1 µ j , Q ( Y ) = (cid:104) Q (1) n ( Y ) , . . . , Q ( K ) n ( Y ) (cid:105) T , and K is a K × vector of ones. Comparing (10)and (22), we see that this estimator is identical to the estimator Q CV (¯ α CV ) that uses the same set of nK samples.Our analytical results in the following sections build from the ensemble form. The approximate control variate (ACV) estimator is designed to leverage the control variate framework when theanalytical expectation of the additional estimators Q n ( Y i ) is not known [1]. This estimator replaces the unknown µ i with another estimator µ ACV i as Q ACV ( α ) = Q n ( Y ) + M (cid:88) i =1 α i (cid:16) Q n ( Y i ) − µ ACV i (cid:17) = Q n ( Y ) + α T (cid:16) Q − µ ACV (cid:17) , (23)where µ ACV = (cid:2) µ ACV , . . . , µ ACV M (cid:3) . If the µ ACV i are unbiased, then Q ACV ( α ) is unbiased. Furthermore, the ACVestimator variance is V ar (cid:104) Q ACV ( α ) (cid:105) = V ar [ Q n ( Y )] + α T C ov (cid:104) Q − µ ACV , Q − µ ACV (cid:105) α + 2 α T C ov (cid:104) Q − µ ACV , Q n ( Y ) (cid:105) , (24)so that the optimal weight is α ∗ ACV = − C ov (cid:104) Q − µ ACV , Q − µ ACV (cid:105) − C ov (cid:104) Q − µ ACV , Q n ( Y ) (cid:105) , (25) orresponding to the variance V ar (cid:104) Q ACV ( α ∗ ACV ) (cid:105) = (1 − R ACV ) V ar [ Q n ( Y )] , (26)where R ACV = C ov (cid:2) Q − µ ACV , Q n ( Y ) (cid:3) T C ov (cid:2) Q − µ ACV , Q − µ ACV (cid:3) − V ar [ Q n ( Y )] C ov (cid:2) Q − µ ACV , Q n ( Y ) (cid:3) .The ACV estimator is ﬂexible in that it permits a variety of estimators to be used for the unknown means. Whilethere may be many partitioning strategies [32], two speciﬁc schemes selected for further analysis in this paper arisefrom using diﬀerent partitioning of samples of Y i for the estimators Q n and µ ACV i . Let z , z i and z i denote thesample sets (realizations of Z ) used to compute Q n ( Y ) , Q n ( Y i ) and µ ACV i , respectively. The two sample partitioningstrategies [1] are deﬁned as ACV-IS  z i = z z i = z i ∪ ˜ z i ˜ z i ∩ ˜ z j = ∅ for i (cid:54) = j , (27)ACV-MF  z i = z z i = z i ∪ (cid:83) ij =1 ˜ z j ˜ z i ∩ ˜ z j = ∅ for i (cid:54) = j , (28)where ˜ z i and ˜ z j are extra sets of samples to estimate µ ACV i . In ACV-IS, the computation of Q n ( Y ) and Q n ( Y i ) employs only the sample set z while the computation of µ ACV-IS i uses these same samples plus a sample increment.In ACV-MF, besides sharing the sample set z between Q n ( Y ) and Q n ( Y i ) , the estimation of µ ACV-MF i utilizes thesample set of µ ACV-MF i − with some extra samples. We refer to [1, Fig. 2] for a visual explanation of the two strategies.No analytical results are available for the optimal sample distribution strategy in the case of ﬁnite-sample sizes, andso our aim is to simply show that our analysis applies to a variety of strategies.According to [1], the ACV-IS estimator obtains an optimal weight α ∗ ACV-IS = − (cid:104) C ◦ F ACV-IS (cid:105) − (cid:104) diag (cid:16) F ACV-IS (cid:17) ◦ c (cid:105) , (29)where F ACV-IS ∈ R M × M has elements f ACV-IS ij =  r i − r i r j − r j if i (cid:54) = jr i − r i otherwise , (30)diag ( • ) denotes a vector whose elements are the diagonal of the matrix • , and r i ∈ R + is the ratio between the totalnumber of realizations of Y i and the total number of evaluations of Y . Similarly, the ACV-MF estimator obtains anoptimal sample weight α ∗ ACV-MF = − (cid:104) C ◦ F ACV-MF (cid:105) − (cid:104) diag (cid:16) F ACV-MF (cid:17) ◦ c (cid:105) , (31)where F ACV-MF ∈ R M × M has elements f ACV-MF ij =  min( r i , r j ) − r i , r j ) if i (cid:54) = jr i − r i otherwise . (32)The corresponding correlation coeﬃcients are R ACV-IS = a T (cid:104) C ◦ F ACV-IS (cid:105) − a , (33) R ACV-MF = b T (cid:104) C ◦ F ACV-MF (cid:105) − b , (34)where a = diag (cid:16) F ACV-IS (cid:17) ◦ ¯ c , b = diag (cid:16) F ACV-MF (cid:17) ◦ ¯ c . e can rewrite the ACV optimal weight in terms of the covariances amongst MC estimators C and c (rather thanamongst models Y i ) by substituting (18) into (29) and (31) as α ∗ ACV-IS = − (cid:104) C ◦ F ACV-IS (cid:105) − (cid:104) diag (cid:16) F ACV-IS (cid:17) ◦ c (cid:105) , (35) α ∗ ACV-MF = − (cid:104) C ◦ F ACV-MF (cid:105) − (cid:104) diag (cid:16) F ACV-MF (cid:17) ◦ c (cid:105) . (36) In this section we describe an extension to the ACV estimator that considers unknown covariances amongst thelow-ﬁdelity information sources. This extension utilizes the same idea of [21] to construct a new estimator as anaverage of an ensemble of estimators.Suppose again that the optimal weight is estimated from an ensemble of K simulations, each of which employsthe same number of samples. Using the sample weight ¯ α ACV , we deﬁne the ensemble ACV estimator as ¯ Q ACV (¯ α ACV ) = 1 K K (cid:88) j =1 Q ACV j (¯ α ACV ) = 1 K K (cid:88) j =1 (cid:16) Q ( j ) n ( Y ) + ¯ α TACV (cid:16) Q j − µ ACV j (cid:17)(cid:17) = T K K Q ( Y ) + (cid:16) ¯ Q − ¯ µ ACV (cid:17) T ¯ α ACV , (37)where ¯ µ ACV = 1 K K (cid:88) j =1 µ ACV j .The next section analyzes the variance reduction of the proposed ensemble estimators ¯ Q ACV , both for MF andIS sampling strategies.

The goal of this section is to present Theorem 1, which expresses the variances of the three ensemble estimatorsderived from the CV and ACV (i.e., ¯ Q CV (¯ α CV ) , ¯ Q ACV-IS (¯ α ACV-IS ) and ¯ Q ACV-MF (¯ α ACV-MF ) ) with respect to thenumber of samples, the correlation coeﬃcient, the variance of the MC estimator Q n ( Y ) , and the expectation of afunction of estimators. We use this relationship in Theorem 4 to compute explicit expressions for these variances.In Corollary 4.1 we derive lower bounds on the number of ensembles required to guarantee smaller variances of thethree sample-weight estimators than that of the baseline estimator Q n ( Y ) .The results rely on a multivariate Gaussianity assumption, and so are asymptotically true for model settingswhere the information sources have ﬁnite mean and variance as an implication from the central limit theorem. Thatis, the results holds when the vectors {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } in the CV-based estimator and {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , Q nr ( Y ) , . . . , Q nr M ( Y M ) } in the ACV-based estimators have a multivariate normal distribution as n → ∞ . Using the Gaussianity assumption of the vectors of MC estimators, the ﬁrst theorem allows us to calculate thevariances of the ensemble estimators in terms of the expectations shown below.

Theorem 1. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then we have V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = V ar [ Q n ( Y )] (cid:0) − R (cid:1) (cid:18) K + E Q (cid:20)(cid:16) ¯ Q − ¯ µ CV (cid:17) T (cid:16) DD T (cid:17) − (cid:16) ¯ Q − ¯ µ CV (cid:17)(cid:21)(cid:19) , (38) where R is deﬁned in (16) .b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. Then, we have V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (cid:0) − R e (cid:1) × (cid:18) K + E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (39) where ˜ Q e = [ Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM ] T , F eM = diag ( F e ) ⊗ M , and R e is deﬁned in (33) and (34) . he proof of Theorem 1 in Appendix E requires the identities provided in Appendix C and the following twouseful propositions. In the ﬁrst proposition, we rewrite the sample weight in terms of the centered data matrix tofacilitate the calculation of the variances of the ensemble estimators. Proposition 2.

The estimated control variate weights can be written as ¯ α CV = − (cid:16) DD T (cid:17) − DQ ( Y ) , (40) ¯ α e = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))] , (41) where e ∈ { ACV-IS , ACV-MF } and D is the centered data matrix D =  Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) . . . . . . . . . . . . Q (1) n ( Y M ) − ¯ Q n ( Y M ) Q (2) n ( Y M ) − ¯ Q n ( Y M ) . . . Q ( K ) n ( Y M ) − ¯ Q n ( Y M )  (42) Proof.

See Appendix B.The next step is to calculate the expectations in Theorem 1 by ﬁnding the distribution of the expressions insidethese operators. Coupled with the Gaussianity assumption, the speciﬁc structure of these expressions suggests theHotelling’s T distribution [33], which is conﬁrmed by the second proposition. Proposition 3. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then, ¯ Q − ¯ µ CV ∼ N (cid:18) M , C K (cid:19) (43) b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. Then, ¯ Q − ¯ µ e ∼ N (cid:18) M , C ◦ F e K (cid:19) (44) Proof.

See Appendix D.We can obtain explicit expressions for the expectation in Theorem 1 under certain reasonable limiting conditionson the ratio of low-ﬁdelity to high-ﬁdelity samples r i . Theorem 4 summarize the variance reduction ratios of theensemble CV-type estimators with respect to the baseline estimator Q n ( Y ) . Theorem 4. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then, V ar (cid:2) Q CV (¯ α CV ) (cid:3) V ar [ Q n ( Y )] = (1 − R ) (cid:18) MK − M − (cid:19) (45) b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. If we further assume that(ACV-IS) r i (cid:29) , (46) (ACV-MF) r i = r, (47) for i = 1 , , . . . , M , then V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] = (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) , (48) where a ( ACV-IS ) = 1 and a ( ACV-MF ) = r − r .Proof. See Appendix F. o guarantee variance reduction, the number of ensembles K must be bounded from below as follows Corollary 4.1.

Suppose R (cid:54) = 0 , R e (cid:54) = 0 for e ∈ { ACV-IS , ACV-MF } . Furthermore, let e ∈ { CV , ACV-IS , ACV-MF } and K > max( M + 2 , B e ) , (49) where B CV = MR + 2 , (50) B ACV-IS = MR ACV-IS + 2 , (51) B ACV-MF = r − r MR ACV-MF + Mr + 2 . (52) Then, we have V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] < . Proof.

See Appendix G.The above corollary implies that in order for V ar [ Q e (¯ α e )] < V ar [ Q n ( Y )] , i.e., variance reduction of the CV-typeestimators with estimated weight compared to the baseline estimator, the number of realizations K of the randomvariable Q e (¯ α e ) has to be at least max( M + 2 , B e ) , where B e depends on the sample partitioning, the correlationamongst models, and the number of low-ﬁdelity models. This corollary recovers the result in [21] for the CV estimator.From the proof of the corollary in Appendix G, the results are obtained by setting the upper bound of the ratio V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] (i.e., y in (G.1)) equal to 1, which satisﬁes the constraint y + R e > for any non-zero R e , i.e., when y = 1 , we only need the models to be correlated and do not need a speciﬁc value of the correlation coeﬃcient. Incase we know the correlation amongst models, we may choose a smaller upper bound such that y > − R e , e.g., if R e = 0 . , y can take any value in the interval (0 . , . This reﬂects the principle of the CV-based methods: strongercorrelation leads to smaller variance.The corollary explicitly ties the correlation amongst models to the sampling requirements. This type of connectionhas previously been ignored in the general case of the ACV-like estimators. Furthermore, it provides an avenue throughwhich to inject problem speciﬁc information as correlations. In this respect it can be used in multi-level MonteCarlo [18] schemes for which a convergence rate for a numerical method is used to determine optimal allocations andguarantee convergence. We envision that these types of problems can also be amenable to deriving expressions forthe correlation.Finally, to choose K and the numbers of samples in practice we can solve an optimization problem in which (1) K and the numbers of samples are design variables; (2) the variance of an appropriate estimator is minimized; and(3) the total cost is bounded above and K is constrained by (49). We leave solving such a problem for future work. We now combine the CV-type estimators with importance sampling (IS) to target rare-event calculations. Theresulting estimator will closely parallel that of [8], with the primary diﬀerence being the leveraging of control variatesfor further variance reduction.Speciﬁcally, instead of the Monte Carlo estimator, we now use importance sampling as the baseline estimatorsfor both the CV and ACV estimators, yielding Q MF ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17) = Q IS ˆ q,n ( Y ) + α (cid:16) Q IS ˆ q,n ( Y ) − µ (cid:17) (53)for the CV and Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17) = Q IS ˆ q,n ( Y ) + α (cid:16) Q IS ˆ q,n ( Y ) − µ IS (cid:17) (54)for the ACV, where µ is the known mean of Y and µ IS is the IS estimator for the mean of Y , i.e., µ IS = Q IS ˆ q,m ( Y ) for m ∈ N . Here, ˆ q is a the biasing distribution that has the property supp ( Y p ) ⊆ supp (ˆ q ) . For simplicity of resentation, we only consider the case with a single additional low ﬁdelity model. Now Eqs. (12), (24), (13) and(25) become V ar ˆ q (cid:104) Q MF ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + 2 α C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:105) , (55) V ar ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α (cid:16) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + V ar ˆ q (cid:104) µ IS (cid:105) − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105) (cid:17) + 2 α (cid:16) C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:105) − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105)(cid:17) , (56) α ∗ CV = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) , (57) α ∗ ACV = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , µ IS (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + V ar ˆ q [ µ IS ] − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105) . (58) The proposed estimators are unbiased.

Theorem 5.

Suppose supp ( Y p ) ⊆ supp (ˆ q ) . Then, Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expected value µ .Proof. Let us consider the expected values of Q MF ˆ q,n and Q MF-ACV ˆ q,n with respect to ˆ q E ˆ q (cid:104) Q MF ˆ q,n ( Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α ) (cid:105) = E ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α E ˆ q (cid:104) Q IS ˆ q,n ( Y ) − µ (cid:105) , E ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = E ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α E ˆ q (cid:104) Q IS ˆ q,n ( Y ) − µ IS (cid:105) . Since supp ( Y p ) ⊆ supp (ˆ q ) , which allows us to apply (6), we have E ˆ q (cid:2) Q IS ˆ q,n ( Y i ) (cid:3) = E [ Y i ( Z )] = µ i for i = 0 , , and E ˆ q (cid:2) µ IS (cid:3) = E [ Y ( Z )] = µ . Thus, E ˆ q (cid:104) Q MF ˆ q,n ( Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α ) (cid:105) = µ , E ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = µ , which implies Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expected value µ .This result means that the Gaussian mixture can be used as a proposal. Corollary 5.1. If ˆ q is a Gaussian mixture as in (9) , Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expectedvalue µ .Proof. Because π i are probabilities and N ( z ; µ i , Σ i ) > , ∀ z ∈ R d , i = 1 , , . . . , k , the GMM in (9) has global support,that is, ˆ q ( z ) > , ∀ z ∈ R d . This leads to supp ( Y p ) ⊆ supp (ˆ q ) and, hence, Theorem 5 holds.As with any control-variate estimator, variance reduction is greater when the correlation between estimators Q IS ˆ q,n ( Y ) and Q IS ˆ q,n ( Y ) is larger. In fact, as shown in the below theorems, for any non-zero correlation the varianceof our estimator is smaller than that of the multi-ﬁdelity importance sampling (MFIS) estimator Q MFIS ˆ q,n presentedin [8]. Speciﬁcally, the MFIS estimator uses a proposal density that is derived from Y . In other words, the MFISestimator leverages low-ﬁdelity information sources to design the proposal distribution and then uses this proposaldistribution within a standard importance sampling scheme for the high-ﬁdelity model Y . Assuming that we use the same proposal, our proposed estimator is guaranteed to have lower variance than theMFIS for a certain range of the control weight. This further reduction is achieved by leveraging the low-ﬁdelitymodels again as control variates. When the control variate weight is zero and the proposal distribution matches, weobtain an equivalent estimator Q MFIS ˆ q,n ( Y ) = Q IS ˆ q,n ( Y ) . , when α = 0 . We can generally choose the weight α to achieve greater variance reduction by leveraging the correlation between Y and Y . Formally, Theorem 6 states that if the weight α belongs to a certain range and V ar [ Q ( Y )] (cid:54) = 0 , then thevariance of the CV estimator Q CV ( α ) is bounded above by that of the estimator Q ( Y ) . The equality happens whenthere is no correlation between Q ( Y ) and Q ( Y ) . heorem 6 (Range of control variate weight for CV estimator) . Let Q CV ( α ) = Q ( Y ) + α ( Q ( Y ) − µ ) with V ar [ Q ( Y )] > , where Q ( Y ) and Q ( Y ) are unbiased estimators for the means of Y and Y , respectively. Fur-thermore, let ¯ f = − C ov [ Q ( Y ) , Q ( Y )] V ar [ Q ( Y )] (59) denote a scaled ratio of the covariance to the variance of Q ( Y ) .If ¯ f ≥ (resp. ¯ f ≤ ), then for control variate weight in the range α ∈ (cid:2) , ¯ f (cid:3) (resp. α ∈ (cid:2) ¯ f, (cid:3) ), the variance ofthe CV estimator is bounded above by that of the baseline estimator, i.e., V ar (cid:2) Q CV ( α ) (cid:3) ≤ V ar [ Q ( Y )] . Equality is obtained for α = 0 . Proof.

Let V ( α ) = V ar (cid:104) Q CV ( α ) (cid:105) − V ar [ Q ( Y )]= (cid:0) V ar [ Q ( Y )] + α V ar [ Q ( Y )] + 2 α C ov [ Q ( Y ) , Q ( Y )] (cid:1) − V ar [ Q ( Y )]= α ( α V ar [ Q ( Y )] + 2 C ov [ Q ( Y ) , Q ( Y )]) , where the second equality uses (12) with one low-ﬁdelity model. Therefore, V ( α ) ≤ ⇐⇒ (cid:34) α ≥ and α ≤ ¯ f ⇒ if ¯ f ≥ , then α ∈ (cid:2) , ¯ f (cid:3) .α ≤ and α ≥ ¯ f ⇒ if ¯ f ≤ , then α ∈ (cid:2) ¯ f, (cid:3) . The below corollary conﬁrms the advantage of our estimator over the MFIS one: aside from the trivial case ofindependent models, using an appropriate value of α ensures variance reduction. Corollary 6.1 (Range of control variate weight for MF estimator) . Let ˜ f = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) , and V ar ˆ q (cid:2) Q IS ˆ q,n ( Y ) (cid:3) > . If ˜ f ≥ (resp. ˜ f ≤ ), then for control variate weight in the range α ∈ [0 , ˜ f ] (resp. α ∈ [ ˜ f, ) the variance of the MF estimator is bounded above by that of the MFIS estimator, i.e., V ar ˆ q (cid:2) Q MF ˆ q,n (cid:0) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:1)(cid:3) ≤ V ar ˆ q (cid:2) Q MFIS ˆ q,n ( Y ) (cid:3) .Equality is obtained for α = 0 . Proof.

Substitute p, Q ( Y ) and Q ( Y ) in Theorem 6 with ˆ q, Q IS ˆ q,n ( Y ) and Q IS ˆ q,n ( Y ) , respectively.Theorem 6 can be extended straightforwardly to the approximate control variates described in Section 2.4.3. Theorem 7 (Range of control variate weight for ACV estimator) . Let s = V ar (cid:104) Q ( Y ) − µ ACV (cid:105) > ,s = C ov (cid:104) Q ( Y ) , Q ( Y ) − µ ACV (cid:105) . If s ≤ (resp. s ≥ ), then for control variate weight in the range α ∈ (cid:20) , − s s (cid:21) (cid:18) resp. α ∈ (cid:20) − s s , (cid:21)(cid:19) thevariance of the ACV estimator is bounded above by that of the baseline estimator, i.e., V ar (cid:2) Q ACV ( α ) (cid:3) ≤ V ar [ Q ( Y )] .Equality is obtained for α = 0 . Proof.

This theorem is proved by following the same steps as in the proof of Theorem 6, which are ﬁrst simplify thediﬀerence V ( α ) = V ar (cid:2) Q ACV ( α ) (cid:3) − V ar [ Q ( Y )] using (24) with one low-ﬁdelity model, and then ﬁnd α such that V ( α ) ≤ . We skip the details for brevity.In words, the theorem states that if the variance of Q ( Y ) − µ ACV is non-zero, we can always choose α from aninterval depending on the sign of C ov [ Q ( Y ) , Q ( Y )] − C ov (cid:2) Q ( Y ) , µ ACV (cid:3) so that the ACV estimator has smallervariance than the MC estimator.These results and the ensemble estimator variance reduction results provide strong motivation towards usingthe control-variate weight for further variance reduction even when the weight cannot be exactly estimated. Theseresults indicate that there is a range of values of the weight that still leads to variance reduction, i.e., that someweight estimate can have an error but still be beneﬁcial. The ensemble estimator variance results provide a suﬃcientcondition for the number of samples required to guarantee a particular size of variance reduction. .2 Algorithms Algorithms 1 and 2 provide pseudocode to implement the ensemble estimators for the control variate and the approxi-mate control variate with the importance sampling approach. These algorithms requires the selection of a low-ﬁdelitymodel correlated to the high-ﬁdelity model, the input parameters to the EM algorithm, and the number of samples.Using these inputs, the algorithms produce an estimate of the expected value of the high-ﬁdelity model. We note thatAlgorithm 2 uses the ACV-IS strategy, and adapting it to the ACV-MF strategy is trivial and omitted for brevity.

Algorithm 1:

Ensemble control variate estimator using importance sampling

Require: Y , Y : high-ﬁdelity and low-ﬁdelity model; p : PDF of the input random variables; µ : expectedvalue of Y , i.e., µ = E [ Y ] ; n s , τ, k init : parameters of the EM algorithm; R : correlation coeﬃcient; C :target cost, or ζ : target accuracy; Ensure: ˆ µ : estimate of the expected value of Y , i.e., ˆ µ ≈ E [ Y ] ; v CV : estimate of the sample variance ofthe ensemble CV estimator. Compute the approximate IS density ˆ q using the EM algorithm with the parameters n s , τ, and k init Determine the number of outer loops K and the number of samples n using R and C (or ζ ) (e.g.,solving an optimization problem) for i = 1 , , . . . , K do Draw n samples z , z , . . . , z n from the density ˆ q for j = 1 , , . . . , n do Q j ( Y k ) = Y k ( z j ) (cid:99) W ( z j ) for k = 0 , end for ¯ Q i ( Y k ) = 1 n n (cid:88) j =1 Q j ( Y k ) for k = 0 , end for ˜ Q ( Y k ) = 1 K K (cid:88) i =1 ¯ Q i ( Y k ) for k = 0 , ˆ c = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) ; ˆ s k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) for k = 0 , ¯ α = − ˆ c ˆ s ˆ µ = ˜ Q ( Y ) + ¯ α (cid:16) ˜ Q ( Y ) − µ (cid:17) ; ¯ v CV = 1 K (cid:0) ˆ s + ¯ α ˆ s + 2¯ α ˆ c (cid:1) {(55)} In this section we demonstrate the performance of our ensemble importance-sampling control variate algorithms forrare-event estimation for three examples. The ﬁrst example is a simple case of estimating tail probabilities involvingnormal random variables. The second example is a cantilever beam whose material uncertainty is modeled as arandom ﬁeld discretized by the Karhunen-Loève expansion. Our third example analyzes a clamped Mindlin plate inbending under random loads and material properties. To focus on demonstrating the beneﬁt of the control variatesand exercising our theory, we focus only on problems with a single low-ﬁdelity model.The simplicity of the ﬁrst example allows cheap evaluations of its models, and so enables us to disambiguatebetween sources of errors, such as lack of optimality in the proposal distribution. The second example is morerealistic than the ﬁrst one in the sense that it solves a system of PDEs by the ﬁnite element method and computingoutput statistics on a dense mesh is often expensive. Based on the similar settings as the ﬁrst example of [8], our lastexample aims to stress the beneﬁt of our proposed estimator on a more complex problem.For each example we provide the governing equations, deﬁnitions of limit state functions, and the choice of HFand LF models. We then describe implementation details such as parameters of the EM algorithm and number ofsamples. In each example we aim to compare the variance of several estimator to compare their performance withthe MFIS estimator under equal costs. In each example we ﬁx the total cost to C MFIS , and then only use a numberof high ﬁdelity samples n HF and low-ﬁdelity samples n LF to ensure C MFIS = n HF C HF + n LF C LF , where C HF and C LF are the cost of the MFIS estimator, the cost of one evaluation of the HF and LF models, respectively. Note that we lgorithm 2: Ensemble approximate control variate estimator using importance sampling

Require: Y , Y : high-ﬁdelity and low-ﬁdelity model; p : PDF of the input random variables; n s , τ, k init :parameters of the EM algorithm; R : correlation coeﬃcient; C : target cost, or ζ : target accuracy; Ensure: ˆ µ : estimate of the expected value of Y , i.e., ˆ µ ≈ E [ Y ] ; v IS : estimate of the sample variance ofthe ensemble ACV-IS estimator. Compute the approximate IS density ˆ q using the EM algorithm with the parameters n s , τ, and k init Determine the number of outer loops K , and the number of samples n and m using R and C (or ζ )(e.g., solving an optimization problem) for i = 1 , , . . . , K do Draw n samples { z , z , . . . , z n } and m samples { z (cid:48) , z (cid:48) , . . . , z (cid:48) m } from the density ˆ q for j = 1 , , . . . , n do Q j ( Y k ) = Y k ( z j ) (cid:99) W ( z j ) for k = 0 , end for for j = 1 , , . . . , m do Q (cid:48) j ( Y ) = Y ( z (cid:48) j ) (cid:99) W ( z (cid:48) j ) end for ¯ Q i ( Y k ) = 1 n n (cid:88) j =1 Q j ( Y k ) for k = 0 , ¯ Q (cid:48) i ( Y ) = 1 n + m  n (cid:88) j =1 Q j ( Y ) + m (cid:88) j =1 Q (cid:48) j ( Y )  end for ˜ Q ( Y k ) = 1 K K (cid:88) i =1 ¯ Q i ( Y k ) for k = 0 , ; ˜ Q (cid:48) ( Y ) = 1 K K (cid:88) i =1 ¯ Q (cid:48) i ( Y ) ˆ s k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) for k = 0 , ; ˆ s (cid:48) = 1 K − K (cid:88) i =1 (cid:16) ¯ Q (cid:48) i ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) ˆ c = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) ˆ c (cid:48) k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) (cid:16) ¯ Q (cid:48) i ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) for k = 0 , ¯ α = − ˆ c − ˆ c (cid:48) ˆ s + ˆ s (cid:48) − c (cid:48) {(58)} ˆ µ = ˜ Q ( Y ) + ¯ α (cid:16) ˜ Q ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) ; ¯ v IS = 1 K (cid:0) ˆ s + ¯ α (ˆ s + ˆ s (cid:48) − c (cid:48) ) + 2¯ α (ˆ c − ˆ c (cid:48) ) (cid:1) {(56)} only compare online cost, where the low-ﬁdelity distribution was already computed, because the oﬄine cost is equalfor all algorithms (i.e., they all use the same biasing distribution). The values of n HF and n LF are determined in eachexample from either a stated assumption or from the runtime of the implementation.All examples [34] are implemented in Matlab and use the cross-entropy (CE) code from [35] with some trivialmodiﬁcations to integrate with our estimator. All quantities given below are dimensionless for simplicity. Forconvenience, the ﬁgures in this section use shortened notations for the variances of considered estimators, i.e., v CV = V ar ˆ q (cid:104) Q MF ˆ q,n (cid:105) , (60) v = V ar ˆ q (cid:104) Q MFIS ˆ q,n (cid:105) , (61) v IS = V ar ˆ q (cid:104) Q MF-1 ˆ q,n (cid:105) , (62) ¯ v CV = V ar ˆ q (cid:104) ¯ Q MF ˆ q,K (cid:105) , (63) ¯ v IS = V ar ˆ q (cid:104) ¯ Q MF-1 ˆ q,K (cid:105) , (64) here ¯ Q MF ˆ q,K and ¯ Q MF-1 ˆ q,K are the ensemble estimators deﬁned in (22) and (37) when replacing Q CV and Q ACV with Q MF ˆ q,n and Q MF-1 ˆ q,n , respectively, and using K ensembles; and MF-1 indicates the use of the ACV-IS scheme. Estimatesof the variance of each ensemble estimators are available from Algorithms 1 and 2, and for reference we recall theexpressions of v CV and v IS are Eqs. (55) and (56). However, these variances are only valid for cases with large K . Incases with small K these algorithms need to run many times to evaluate the empirical variances of the estimators.Lastly, we note that the true mean in the control variate approach is determined from either analytical expression(the ﬁrst example) or using an extra set of a very large number of samples (the second and third example). We ﬁrst consider an analytical example where we seek to evaluate tail probabilities of Gaussians. Our aim is toexplore: (1) how non-optimality in the proposal distribution aﬀects the variance reduction, and (2) how much beneﬁtwe obtain in both over the process that uses a single-ﬁdelity importance sampling estimator based on a low-ﬁdelityproposal [8].We consider a standard normal input space Z ∼ N (0 , , and two limit-state functions g and g . In this problemwe will treat g as the high-ﬁdelity model. The failure probabilities can be analytically computed using the standardnormal cumulative distribution function Ψ according to P f ( g ( z ) <

0) = 1 − Ψ( l ) , where l = 3 is our chosenthreshold. The high-ﬁdelity limit-state function becomes g ( z ) = l − z. A low-ﬁdelity model would have an error in the failure probability threshold, and we posit that such an erroroccurs from an incorrect speciﬁcation of a threshold. In this case we use l = 2 . as the threshold to make this model“lower-ﬁdelity” g ( z ) = l − z. Figure 1: Minimum variance ratio of v CV to v , and KL divergence between the approximate low-ﬁdelitydistribution and the exact low-ﬁdelity distribution vs. threshold l ( i ) The cross-entropy method does not provide a ﬁne-grained control over approximation error due to the approxima-tion quality of the Gaussian mixture model. However, we would like to investigate how such approximation qualityaﬀects algorithm performance. To this end, this subsection investigates an alternative way to generate proposaldistributions with ﬁne-grained control on error.Our alternative approach is to use a sequence of biasing distributions obtained by specifying an alternate set oflimits g ( i ) ( z ) = l ( i ) − z , l ( i ) ∈ { . , . , . . . , . } , and to use rejection sampling to sample form them exactly. When l ( i ) = 2 . , we are exactly sampling from the optimal proposal for the low-ﬁdelity model. As the threshold decreasesto l ( i ) = 1 . , our proposal distribution has increasing error (as any GMM proposal would). In other words, as theintermediate thresholds l ( i ) approach the LF threshold l , the corresponding intermediate IS densities progressivelybecome better and ultimately the low-ﬁdelity IS density. In this manner, we have disambiguated the error due tosub-optimal biasing distributions and those due to low-ﬁdelity eﬀects. Figure 2: α vs. v CV /v for l ( i ) = 1 . . Figure 1 shows two curves: the red curve is the Kullback-Leibler (KL) divergence between the approximatelow-ﬁdelity distribution and the exact low-ﬁdelity distribution, and the blue curve is the minimum variance ratio min( v CV /v ) between the control-variate and the MFIS estimators. As the KL divergence converges to zero, thevariance ratio min( v CV /v ) approaches 1. This behavior is expected because it corresponds to perfect sampling ofthe low-ﬁdelity model, which leads to a zero variance estimate for the low-ﬁdelity model. Speciﬁcally, as this variancevanishes, the contribution to the covariance vanishes, and we recover v CV = v .We also see that as the KL divergence increases, min( v CV /v ) reaches a plateau. This behavior aligns well withthe fact that the variance reduction must depend on the correlation amongst models and we cannot reduce v CV /v to 0 simply by using more crude approximations of the LF biasing distribution. Next we show that even if the estimate of the control variate weight is not extremely accurate in practice, there isstill a interval of weights in which our estimator is able to achieve a larger variance reduction.Figure 2 shows the ratio of the variance between v CV and v with respect to varying weight α for the proposalbased on l ( i ) = 1 . . Such a dependence is quadratic according to (55). The threshold of equal performance is shownin red. It is clear from Figure 2 that there is a range of α in which the CV estimator has small variance than theMFIS estimator. This fact has been established in Corollary 6.1. In other words, although the CV estimator has anadditional parameter to estimate—there is a range of weights for which it still improves upon the baseline estimator. We now switch from the accurate biasing distribution sampling to the EM algorithm approach of Section 2.3. Thisalgorithm is deployed to construct the approximate density from the low-ﬁdelity model (i.e., l = 2 . ), from whichsamples are drawn to calculate the estimated variances in Figure 3. The input parameters of the EM algorithm are n s = 3000 , τ = 0 . and the initial number of mixture components k init = 3 . The variance of the baseline estimator v is calculated using n = 5 × samples. We assume further that for this example the HF model is 30 timesmore expensive to evaluate than the LF model. As shown in Table 2, the number of samples used to calculate otherestimators is chosen such that they consume the same cost as the baseline estimator. To compute the variances ofthe ensemble estimators, i.e., ¯ v CV and ¯ v IS , we utilize the sample sets of the corresponding component estimators (i.e., Q MF ˆ q,n and Q MF-1 ˆ q,n ) but divide each of them into K = 1000 batches, and then apply Algorithms 1 and 2.Figure 3 shows that:1. The ACV estimators Q MF ˆ q,n and Q MF-1 ˆ q,n have smaller variances than that of MFIS over a certain range of weights;2. As reported in [1], v CV is the smallest among the estimators since it uses the exact mean of Y ;3. Since ¯ v CV and ¯ v IS utilize estimated optimal control weight, their variances approximate the minimum of v CV and v IS ;4. Even with small K , the ensemble estimators ˜ v CV and ˜ v IS are still smaller than v according to Corollary 4.1.As a reference, Figure 4 shows the HF, LF and approximate density—denoted by q , q and ˆ q , respectively. Here, ˆ q is the approximate optimal density q obtained via the GMM approximation. Being the cross-entropy approximation Figure 3: α vs. variance ratios using the EM algorithm with GMM; ˜ v CV = V ar ˆ q (cid:104) ¯ Q MF ˆ q,K (cid:105) and ˜ v IS = V ar ˆ q (cid:104) ¯ Q MF-1 ˆ q,K (cid:105) with small K (i.e., K = 4 ). v v CV v IS n HF n LF to the LF density, the function ˆ q ﬂuctuates around q ; in particular, the vertical line of q at . is estimated by avery steep curve. In this section, we compare estimators for a cantilever beam with uncertain material properties.Figure 5 displays the cantilever beam, a two-dimensional domain used in this example. The beam is ﬁxed on itsleft side, subject to an vertical load at its upper-right corner, and has the dimensions of L × L unit length. Let X = [0 , L ] × [0 , L ] ⊂ R denote the domain in Figure 5. Considering a rectangular Cartesian coordinate system andthe Einstein summation convention, the governing equations [36] of a two-dimensional, inﬁnitesimal strain, linearelastic problem without body force are given as σ ij,j = 0 ,σ ij = λ(cid:15) kk δ ij + 2 µ(cid:15) ij , (65) (cid:15) ij = 12 ( u i,j + u j,i ) ,u i (0 , x ) = 0 , (66)where the ( • ) ,j subscript stands for ∂ ( • ) /∂x j ; σ ij is the Cauchy stress tensor, u i the displacement, (cid:15) ij the strain, and δ ij the Kronecker delta. The boundary condition (66) reﬂects the ﬁxed left side of the beam. In (65) λ and µ are theLamé constants which can be calculated from the Young’s modulus E and Poisson’s ratio ν as follows λ = Eν (1 + ν )(1 − ν ) , µ = E ν ) . The equations are solved using the ﬁnite element method, whose mesh is assembled from square, linear, plane stresselements with unit thickness. Those elements are made of an isotropic, linear elastic material characterized by E and ν . Figure 4: High-ﬁdelity, low-ﬁdelity and approximate density.Figure 5: The cantilever beam.

We treat Young’s modulus as the uncertain quantity in this problem. Young’s modulus must be positive andﬁnitely bounded and we model it as uniform random ﬁeld transformed from a two-dimensional Gaussian random ﬁeldwith the following covariance K ( s , t ) = exp (cid:18) − ( s − t ) r (cid:19) exp (cid:18) − ( s − t ) r (cid:19) = K ( s , t ) K ( s , t ) for s , t ∈ X , (67)where r and r are the correlation lengths in the two coordinate directions. The transformation [37] is performed by E ( x , ω ) = F − ◦ [Φ ( y ( x , ω ))] for x ∈ X and ω ∈ Ω , (68)where F − is the inverse of a prescribed CDF, Φ ( y ( x , ω )) the standard normal CDF, y ( x , ω ) a stationary zero-meanGaussian random ﬁeld, and E ( x , ω ) the Young’s modulus. Choosing F − as the inverse of an uniform CDF, we have E ( x , ω ) = a + ( b − a )Φ( y ( x , ω )) , (69)where a and b are the lower and upper bound of the uniform distribution. In practice the random ﬁeld y ( x , ω ) needs to be discretized by an appropriate method such as Expansion Optimal Linear Estimator [38] and polynomialchaos expansion [39, 40]. Among such methods, the Karhunen-Loève (KL) expansion minimizes the mean squarederror [41, 42] resulting in the smallest number of terms in the expansion to obtain a required accuracy [43]. The KLexpansion of the random ﬁeld y ( x , ω ) is given as y ( x , ω ) = ∞ (cid:88) i =1 √ λ i ξ i ( ω ) ψ i ( x ) , (70) (a) High-ﬁdelity model (b) Low-ﬁdelity model Figure 6: Young’s modulus for sample { ξ , . . . , ξ n KL } = { . , . . . , . } . (a) High-ﬁdelity model (b) Low-ﬁdelity model Figure 7: Mesh deformation for sample { ξ , . . . , ξ n KL } = { . , . . . , . } . where ξ i ( ω ) are standard normal random variables. The eigenvalues λ i and the corresponding orthogonal eigenfunc-tions ψ i ( x ) are solutions of the following eigenvalue problem: (cid:90) D K ( s , t ) ψ i ( t ) d t = λ i ψ i ( s ) , (71)where K ( s , t ) is the covariance function of the random ﬁeld. Here two practical issues have to be considered. First,the inﬁnite KL expansion in (70) are truncated to be computable. Second, the integral equation (71) is not trivialto solve on high-dimensional domain. Thus, the separability of the covariance function (67) is exploited leading toseparable eigenvalues and eigenfunctions [42] as follows λ i = λ i λ i ,ψ i ( x ) = ψ i ( x ) ψ i ( x ) , (72)where λ i j and ψ i j ( x j ) , j = { , } , are the eigenvalues and eigenfunctions of the integral equation (71) using thecovariance function K j ( s j , t j ) in (67); and λ i are arranged in decreasing order.In this example the following numerical values are used: Poisson’s ratio ν = 0 . ; correlation lengths r = 60 and r = 20 ; beam dimensions L × L = 0 . × . ; bounds of the Young’s modulus a = 1 and b = 2 . The limit state functions are given as g i ( u ) = l i − u, i = { , } , where u is the vertical displacement at the loadapplication point. The thresholds l i are chosen so that the failure probabilities P f ( g i ( u ) < , which are computedusing reference samples of { ξ i ( ω ) } n KL i =1 , are small and diﬀerent, i.e., l = 118 . , l = 108 . , P f ( g ( u ) <

0) =0 . and P f ( g ( u ) <

0) = 0 . .We utilize two diﬀerent meshing schemes as multi-ﬁdelity models, i.e., the high-ﬁdelity model corresponds to themesh of × elements, and the low-ﬁdelity model the mesh of × elements. The models also have diﬀerent,but shared, sources of ucnertainty. For the high-ﬁdelity model: u depends on the Young’s modulus E ( x , ω ) which inturn is transformed according to (69) using a truncated KL expansion y ( x , ω ) from (70); and the stiﬀness matrix ofeach ﬁnite element is calculated using the values of the eigenfunctions at the element’s centroid coordinates. For thelow-ﬁdelity model, the Young’s modulus is just a uniform random variable E ( ω ) = a + ( b − a )Φ( y ( ω )) , where y ( ω ) = n KL (cid:88) i =1 √ λ i ξ i ( ω ) ψ i (¯ x ) (73)and ¯ x is the coordinates of the beam centroid, i.e., ¯ x = (0 . , . . For both models, the number of λ i j and ψ i j ( x j ) is n j = 5 , and the number of λ i and ψ i ( x ) is n KL = 10 .Figures 6 and 7 show the Young’s modulus and the mesh deformation of the beam for the high- and low-ﬁdelitymodel when all the random variables ξ i ( ω ) take the value of . ; the actual displacements are scaled down by -1.5 -1 -0.5 0 0.50.511.522.53 Figure 8: α vs. variance ratios for the cantilever beam. for visualization. It is clear from Figure 6 that the Young’s modulus is spatially variable for the high-ﬁdelity modelwhile it is only a constant for the low-ﬁdelity model.The EM algorithm uses inputs n s = 5000 , τ = 0 . , and k init = 5 . The baseline estimator Q MFIS ˆ q,n takes n = 4 × samples to compute its variance. We ﬁrst ﬁnd that C HF ≈ C LF , where C HF and C LF are the costs to produce oneevaluation of the LF and HF model, respectively. Then, we allocate the samples to each model as shown in Table 3,using the cost ratio such that the total cost of model evaluations is equal to that of the baseline estimator. v v CV v IS n HF n LF The results of this example are presented in Figure 8. Again, our estimators perform better than the MFIS oneand the CV estimator shows favorable result compared to the ACV scheme.

The last example is a modiﬁed version of that provided in [8], where authors derive the MFIS estimator. While theMFIS estimator is able to achieve impressive speedups of up to several orders of magnitude compared to the MCmethod, we demonstrate that further variance reduction is still possible by employing control variates.Let Y = [0 , denote the domain of the clamped Mindlin plate in Figure 9; E the four edges of the plate; θ and θ the rotations of the normal to the plate middle plane with respect to the axes x and x , respectively; and w thedisplacement of the middle plane in the (out-of-plane) x -direction. The governing equations of the Mindlin’s theoryof plate in static equilibrium are given as M ij,j − Q i = 0 ,Q i,i + s = 0 ,θ ( E ) = 0 , θ ( E ) = 0 , w ( E ) = 0 , (74)where M ij and Q i are the moment and shear resultants, i.e., M = D (cid:18) ∂θ ∂x + ν ∂θ ∂x (cid:19) , Q = κGh (cid:18) θ + ∂w∂x (cid:19) , etc.; D , G , κ , h and ν are the bending rigidity, shear modulus, shear correction factor, plate thickness and Poisson’s ratio,respectively; and s is a transverse load. We refer to [44, 45] for a complete treatment of the plate theory with detailedequations. The boundary conditions (74) state that for a clamped plate there are no rotations and displacementalong the edges of the plate which is made of an isotropic, linear elastic material with Young’s modulus E = 10 B. Peherstorfer et al. / Comput. Methods Appl. Mech. Engrg. 300 (2016) 490–509

Fig. 2. Plate problem: The eight inputs of the plate problem control the thickness and the load in the four subregions Ω , . . . , Ω of the spatialdomain Ω . initial seeding. The runtime measurements were performed on an Intel Xeon E5-1620 compute node with 32 GB RAMon a single core. We are interested in the maximum deﬂection of a clamped plate after a load has been applied [35]. The geometryof the plate is shown in Fig. 2. The spatial domain Ω = [ , ] ⊂ R is partitioned into four subregions Ω , . . . , Ω .Inputs deﬁne the thickness of the plate and the applied load in each subregion. The inputs are the components z = [ z , . . . , z ] T ∈ R of a realization of the eight-dimensional input random variable Z , which has a uniformdistribution in [ . , . ] × [ , ] ⊂ R . The ﬁrst four inputs z , . . . , z deﬁne the thickness of the plate and thelast four inputs z , . . . , z deﬁne the load, in the subregions Ω , . . . , Ω , respectively. The output, y , is the maximumdeﬂection of the plate in the spatial domain Ω . The limit state function is g plate ( y ) = − y +

5, with the threshold 5,which means that a failure occurs if the maximum deﬂection is higher than 5.

Models . The (steady-state) high-ﬁdelity model s : R → R is derived with the ﬁnite element method following[36, Section 12.2, p. 161]. The number of degrees of freedom of the high-ﬁdelity model is 19 039. To construct thesurrogate model, we sample 1000 inputs at which the high-ﬁdelity model is evaluated to generate the snapshots. Thereare several sampling schemes available to guide the selection of the snapshots [37–39]; however, for this examplesampling from a uniform distribution in the input domain is sufﬁcient. The proper orthogonal decomposition (POD)is applied to the snapshots to derive the POD basis. The reduced operators are computed with Galerkin projection.This leads to surrogate models s and s generated from ﬁve and ten POD basis vectors, respectively. Reference failure probability . The reference failure probability P MC f = . × − is computed with the MonteCarlo estimator with 10 samples using the high-ﬁdelity model. The RMSE of the Monte Carlo estimator is9 . × − , where the variance is estimated from the samples. Illustration of PDFs . For illustration purposes, the estimated PDF of the random variables s ( Z ), s ( Z ) , and s ( Z ) are compared in Fig. 3. Each of the PDFs is estimated with kernel density estimation (KDE) from 1000 realizationsof the corresponding random variable. The surrogate model s with ﬁve POD modes is sufﬁcient to derive a PDF thatmatches the PDF corresponding to the high-ﬁdelity model well.We generate a biasing distribution by constructing the biasing PDF with Algorithm 1 and the surrogate model s .The number of samples is set to N = , and the number of normal distributions in the mixture model is k = N has only a minorimpact on the overall costs of the MFIS method, see the discussion on runtime and speedup below. To illustrate thequality of the biasing distribution, we draw 1000 samples z ′ , . . . , z ′ from the biasing distribution, i.e., realizations Figure 9: The Mindlin plate [8]. (a) High-ﬁdelity model (b) Low-ﬁdelity model

Figure 10: Plate deformation for sample { h i , s i } i =1 = { . , } . and Poisson’s ratio ν = 0 . . Following the settings in [8], we divide the plate into four regions {Y i } i =1 , each ofwhich has a random thickness h i and is subject to a random load s i . Both h i and s i are uniformly distributed, i.e., { h i } i =1 ∼ U (0 . , . and { s i } i =1 ∼ U (1 , . According to the ﬁrst-order shear deformation theory [44] the shearcorrection factor is κ = 5 / .In this example the high-ﬁdelity model has the mesh size of × square bilinear isoparametric elements (i.e., Q elements) while the low-ﬁdelity model utilizes × elements. The Matlab code from [46, Chapter 12] is adoptedfor ﬁnite element analysis. Figure 10 shows the plate deformation for both models using { h i , s i } i =1 = { . , } . Thelimit state functions are deﬁned as g i ( z ) = l i − w c i ( z ) , i = { , } , where z = (cid:8) { h i } i =1 , { s i } i =1 (cid:9) ⊂ R is a realizationof input random variables and w c i ( z ) is the x -direction displacement of the plate centroid.The parameters used in the EM algorithm are given as n s = 5000 , τ = 0 . and k init = 5 . The baseline estimator Q MFIS ˆ q,n is evaluated using n = 4 × samples for its variance. As shown in Table 4, the cost of other estimatorsis guaranteed to be equal to the baseline estimator by appropriate sample allocation using the number of ensembles K = 1000 and the empirical formula C HF ≈ C LF .The variance ratios are shown in Figure 11. First, we stress that since the MFIS estimator does not take intoaccount the correlations among models, it is not able to exploit the multi-ﬁdelity modeling to the fullest extent,and, thus, our estimators have achieved clear advantages over it. Second, comparing v IS with v CV , the estimatorbecomes less eﬃcient with estimated weight as concluded in [1]. Third, as explained in the ﬁrst example, ¯ v CV and ¯ v IS tightly follows the minimum of v CV and v IS , respectively. Finally, it is noted that even with sub-optimal weightthe performance of these estimators is still better than the MFIS estimator over a signiﬁcant range of weights, as Figure 11: α vs. variance ratios for the Mindlin plate. v v CV v IS n HF n LF speciﬁed by Theorem 7. In this paper we have developed an ensemble estimator for approximate control variate schemes that provides amechanism to estimate unknown covariances, in addition to unknown means. This contribution has allowed us toprovide theoretical bounds on the number of samples required to guarantee certain variance reduction. Furthermore,this guarantee depends upon a correlation coeﬃcient that is problem dependent. The second contribution is applyingthe framework in the context of importance sampling. We show that the approximate control variate can furtherreduce the variance compared to the MFIS approach described in [8]. We are able to achieve considerably greatervariance reduction with this approach on several problems of computational mechanics.Future work will seek to study values of the correlation coeﬃcient that can be derived from the underlyingproblem—similar to what is done in multi-level MC for multi-ﬁdelity models arising in varying discretizations. Anotherline of work is extending the importance sampling techniques to include several low-ﬁdelity models. One challengeto overcome is eﬀectively balancing the cost of computing a biasing distribution using the low-ﬁdelity model andthe variance reduction that it provides. Indeed, the current approaches to multi-ﬁdelity importance sampling tendto ignore this computational aspect. Finally, as an eﬀective variance reduction technique, our estimators haveextensive application potential in expensive UQ problems including optimization under uncertainty, and in particular,reliability-based and robust optimization.

We thank Gianluca Geraci, John Jakeman, Mike Eldred, and Teresa Portone for helpful discussions surrounding thispaper. This project was funded by the Sandia National Laboratories LDRD program. The expectation-maximization algorithm

The Kullback-Leibler (KL) divergence between the optimal density q ∗ ( z ) and the approximate density ˆ q ( z ) is D ( q ∗ ( z ) , ˆ q ( z )) = E q ∗ (cid:20) ln (cid:18) q ∗ ( z )ˆ q ( z ) (cid:19)(cid:21) = (cid:90) R d q ∗ ( z ) ln( q ∗ ( z )) d z − (cid:90) R d q ∗ ( z ) ln(ˆ q ( z )) d z . The CE method aims to minimize the KL divergence to ﬁnd the unknowns in the GMM (9). Let us gather the unknownparameters into the vector v = { π i , µ i , Σ i ; i = 1 , , . . . , k } . Then the optimization problem can be equivalently writtenas min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d q ∗ ( z ) ln(ˆ q ( z ; v )) d z , (A.1)because the ﬁrst term of the KL divergence is independent of ˆ q. where ˆ q ( z ; v ) stresses the presence of parameters inthe GMM (9). Replacing Y ( z ) with I G ( z ) in (8), and inserting this expression into (A.1) we obtain min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d I G ( z ) p ( z ) ln(ˆ q ( z ; v )) d z . (A.2)Another sampling density ˆ q ( z ; w ) , which has the same form as ˆ q ( z ; v ) but with a diﬀerent parameter vector w , isintroduced to facilitate the optimization algorithm min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d I G ( z ) ln(ˆ q ( z ; v )) (cid:99) W ( z ; w )ˆ q ( z ; w ) d z = max v E ˆ q ( z ; w ) (cid:104) I G ( z ) ln(ˆ q ( z ; v )) (cid:99) W ( z ; w ) (cid:105) ≈ max v n s n s (cid:88) i =1 I G ( z i ) ln(ˆ q ( z i ; v )) (cid:99) W ( z i ; w ) , (A.3)where z i , i = 1 , , . . . , n s , are samples drawn from ˆ q ( z ; w ) , and (cid:99) W ( z ; w ) = p ( z )ˆ q ( z ; w ) . It is noted that by choosingan appropriate joint likelihood h (ˆ z | v ) = (cid:81) i ∈ ˆ n s ˆ q ( z i | v ) (cid:99) W ( z i ) , the optimization problem (A.3) is equivalent to themaximum log-likelihood estimation (MLE) problem ˆ v = arg max v ln( h (ˆ z | v )) = arg max v (cid:88) i ∈ ˆ n s ln(ˆ q ( z i | v )) (cid:99) W ( z i ) , (A.4)where ˆ z = { z i } i ∈ ˆ n s , ˆ n s = { i ∈ n s : I G ( z i ) (cid:54) = 0 } . The EM algorithm is an iterative method to ﬁnd ˆ v , which is alsothe solution of (A.3). Let ˆ v ( m ) denote the parameter vector at m th iteration. In [47] it is shown that ˆ v ( m +1) = arg max v E X | ˆ z , ˆ v ( m ) [ln h ( x | v )] = arg max v Q ( v | ˆ v ( m ) ) , (A.5)where X is the complete data set. Using the GMM (9) and n τ = τ ˆ n s , where τ ∈ ]0 , is a ﬁxed value to identify theintermediate failure domains, we have [29, 31] Q ( v | ˆ v ( m ) ) = n τ (cid:88) i =1 (cid:99) W ( z i ) k (cid:88) j =1 γ ( m ) ij ln( π j N ( z i ; µ j , Σ j )) (A.6) γ ( m ) ij = π ( m ) j N ( z i ; µ ( m ) j , Σ ( m ) j ) (cid:80) kr =1 π ( m ) r N ( z i ; µ ( m ) r , Σ ( m ) r ) .s (A.7)The updating scheme is then derived by solving the following optimization problem max v Q ( v | ˆ v ( m ) ) subject to k (cid:88) i =1 π i = 1 ,π i ≥ , i = 1 , , . . . , k, Σ i (cid:31) , i = 1 , , . . . , k, (A.8) µ is not needed since it is a constant. here the ﬁrst and second constraint enforce π i to be probabilities, and the last constraint is meant to render thecovariance matrices positive deﬁnite. Using (A.6), (A.7), and the method of Lagrange multipliers, the updatingequations [29, 31] are listed below. ν ( m ) j = n τ (cid:88) i =1 (cid:99) W ( z i ) γ ( m ) ij ,π ( m +1) j = ν ( m ) j (cid:80) kr =1 ν ( m ) r , µ ( m +1) j = (cid:80) n τ i =1 (cid:99) W ( z i ) γ ( m ) ij z i ν ( m ) j , Σ ( m +1) j = (cid:80) n τ i =1 (cid:99) W ( z i ) γ ( m ) ij ( z i − µ ( m +1) j )( z i − µ ( m +1) j ) T ν ( m ) j , j = 1 , , . . . , k. B Proof of Proposition 2

We rewrite ˆ C and ˆ c using (21) and (42) as ˆ C = DD T K − , (B.1) ˆ c = D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − . (B.2)Substituting (B.1) and (B.2) into (20), (35), and (36), we obtain ¯ α CV = − ˆ C − ˆ c = − (cid:18) DD T K − (cid:19) − D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − − (cid:16) DD T (cid:17) − D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) , (B.3)and ¯ α e = − (cid:104) ˆ C ◦ F e (cid:105) − [ diag ( F e ) ◦ ˆ c ] = − (cid:20) DD T K − ◦ F e (cid:21) − (cid:34) diag ( F e ) ◦ D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − (cid:35) = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:2) diag ( F e ) ◦ (cid:0) D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1)(cid:1)(cid:3) , (B.4)respectively. Because D is the centered data matrix, we obtain D K =  Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) . . . . . . . . . . . . Q (1) n ( Y M ) − ¯ Q n ( Y M ) Q (2) n ( Y M ) − ¯ Q n ( Y M ) . . . Q ( K ) n ( Y M ) − ¯ Q n ( Y M )   . . .  = M . (B.5)Since D K = M , ¯ α CV and ¯ α e become ¯ α CV = − (cid:16) DD T (cid:17) − DQ ( Y ) , ¯ α e = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))] . C Useful matrix algebra identities

The below proposition provides several identities to manipulate the expressions of the variances of the ensembleestimators. roposition 8. Let A ∈ R M × M , B ∈ R M × K , V ∈ R K × M and v ∈ R K . Then, we have the following identitiesdiag ( A ) ◦ ( Bv ) = (( diag ( A ) ⊗ K ) ◦ B ) v , (C.1) (( diag ( A ) ⊗ K ) ◦ B ) V = ( diag ( A ) ⊗ M ) ◦ ( BV ) , (C.2) V T ( B ◦ ( diag ( A ) ⊗ K )) T = (cid:16) V T B T (cid:17) ◦ ( diag ( A ) ⊗ M ) T , (C.3) (( diag ( A ) ⊗ K ) ◦ A ) (( diag ( A ) ⊗ K ) ◦ A ) T = (cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) ◦ ( diag ( A ) ⊗ M ) T , (C.4) where ◦ is the Hadamard product and ⊗ is the outer product.Proof. Let [ (cid:63) ] i ( ij ) denote an entry of the vector (matrix) (cid:63) . We prove the ﬁrst two identities by showing the entriesof both sides are equal. Thus, [ diag ( A ) ◦ ( Bv )] i = a ii K (cid:88) j =1 b ij v j = K (cid:88) j =1 ( a ii b ij ) v j = [(( diag ( A ) ⊗ K ) ◦ B ) v ] i [(( diag ( A ) ⊗ K ) ◦ B ) V ] ij = K (cid:88) k =1 a ii b ik v kj = a ii K (cid:88) k =1 b ik v kj = [( diag ( A ) ⊗ M ) ◦ ( BV )] ij The third identity is proved by transposing both sides of the second one as [(( diag ( A ) ⊗ K ) ◦ B ) V ] T = [( diag ( A ) ⊗ M ) ◦ ( BV )] T V T ( B ◦ ( diag ( A ) ⊗ K )) T = ( BV ) T ◦ ( diag ( A ) ⊗ M ) T V T ( B ◦ ( diag ( A ) ⊗ K )) T = (cid:16) V T B T (cid:17) ◦ ( diag ( A ) ⊗ M ) T We ﬁnd the ﬁnal identity by applying the second and third one consecutively, and note that the Hadamard productis commutative and associative (( diag ( A ) ⊗ K ) ◦ A ) (( diag ( A ) ⊗ K ) ◦ A ) T = ( diag ( A ) ⊗ M ) ◦ (cid:16) A (( diag ( A ) ⊗ K ) ◦ A ) T (cid:17) = ( diag ( A ) ⊗ M ) ◦ (cid:16)(cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) T (cid:17) = (cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) ◦ ( diag ( A ) ⊗ M ) T . D Proof of Proposition 3

The assumptions imply that the distributions of ¯ Q − ¯ µ CV and ¯ Q − ¯ µ e are multivariate normal. Thus, the prooffocuses on ﬁnding the means and variances of those distributions.It is trivial to show that E (cid:104) ¯ Q − ¯ µ CV (cid:105) = E (cid:2) ¯ Q − ¯ µ e (cid:3) = M . a. We compute the variance of ¯ Q − ¯ µ CV as V ar (cid:104) ¯ Q − ¯ µ CV (cid:105) = V ar (cid:2) ¯ Q (cid:3) =  C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y ) , K K (cid:88) j =1 Q ( j ) n ( Y ) (cid:35) . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y ) , K K (cid:88) j =1 Q ( j ) n ( Y M ) (cid:35) . . . . . . . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y M ) , K K (cid:88) j =1 Q ( j ) n ( Y ) (cid:35) . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y M ) , K K (cid:88) j =1 Q ( j ) n ( Y M ) (cid:35) = 1 K  C ov [ Q n ( Y ) , Q n ( Y )] C ov [ Q n ( Y ) , Q n ( Y )] . . . C ov [ Q n ( Y ) , Q n ( Y M )] . . . . . . . . . . . . C ov [ Q n ( Y M ) , Q n ( Y )] C ov [ Q n ( Y M ) , Q n ( Y )] . . . C ov [ Q n ( Y M ) , Q n ( Y M )]  = C K . b. From [1, Appendix D] and [1, Appendix E], we know that V ar (cid:2) ¯ Q − ¯ µ e (cid:3) = C ◦ F e K . Proof of Theorem 1

The goal of this proposition is to compute V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:3) and V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) for e ∈ { ACV-IS , ACV-MF } in termsof V ar [ Q n ( Y )] and some known quantities, e.g., the covariances amongst models, the number of ensembles, etc.We begin with an auxiliary result that will be used in the rest of the proof. Recall the law of total expectation E X [ X ] = E Y [ E X [ X | Y ]] , where X and Y are some random variables in the same probability space. We apply it to calculate the variances of ¯ Q CV (¯ α CV ) and ¯ Q e (¯ α e ) by setting X = (cid:0) ¯ Q CV (¯ α CV ) − E (cid:2) ¯ Q CV (¯ α CV ) (cid:3)(cid:1) , Y = Q and X = (cid:0) ¯ Q e (¯ α e ) − E (cid:2) ¯ Q e (¯ α e ) (cid:3)(cid:1) , Y = ˜ Q e to obtain V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = E Q (cid:104) V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:12)(cid:12)(cid:12) Q (cid:105)(cid:105) , V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = E ˜ Q e (cid:2) V ar (cid:2) ¯ Q e (¯ α e ) (cid:12)(cid:12) Q (cid:3)(cid:3) , where Q = [ Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M )] T , ˜ Q e = [ Q n ( Y ) − µ e , Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM ] T and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M . As we can see, the vectors Q and ˜ Q e only involve the low-ﬁdelity models and the expectationswith respect to these vectors eliminate the dependence of V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:3) and V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) on { Y i } Mi =1 . Since V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:12)(cid:12) Q (cid:3) is a special case of V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) with F e = M ⊗ M , the computation of the later plays acentral role in the proof below.We now begin the main logic of the proof by substituting (41) into (37) to obtain ¯ Q e (¯ α e ) = T K K Q ( Y ) − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))]= (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) Q ( Y )= X T Q ( Y ) , (E.1)where F eK = diag ( F e ) ⊗ K . The second line of (E.1) uses the identity (C.1).Given ˜ Q e , X in (E.1) is ﬁxed so that the conditional variance of ¯ Q e (¯ α e ) becomes V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = X T V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) X . (E.2)To compute V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) , we utilize the assumption that the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } has a multivariate normal distribution, and so the distribution of Q n ( Y ) conditional on ˜ Q e = {Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM } is also multivariate normal [48, Theorem 5.3] with variance V ar (cid:104) Q n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) =  − C ov (cid:104) ˜ Q e , Q n ( Y ) (cid:105) T C ov (cid:104) ˜ Q e , ˜ Q e (cid:105) − V ar [ Q n ( Y )] C ov (cid:104) ˜ Q e , Q n ( Y ) (cid:105) V ar [ Q n ( Y )]= (1 − R e ) V ar [ Q n ( Y )] . (E.3)Then, the conditional variance of Q ( Y ) given ˜ Q e is V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) =  V ar (cid:104) Q (1) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q (1) n ( Y ) , Q (2) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . C ov (cid:104)(cid:16) Q (1) n ( Y ) , Q ( K ) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q (2) n ( Y ) , Q (1) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) V ar (cid:104) Q (2) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . C ov (cid:104)(cid:16) Q (2) n ( Y ) , Q ( K ) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . . . . . . . . . . C ov (cid:104)(cid:16) Q ( K ) n ( Y ) , Q (1) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q ( K ) n ( Y ) , Q (2) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . V ar (cid:104) Q ( K ) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105)  =  V ar (cid:104) Q (1) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . V ar (cid:104) Q (2) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . . . . . . . . . . . . . . . . V ar (cid:104) Q ( K ) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar (cid:104) Q n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) I = (1 − R e ) V ar [ Q n ( Y )] I , (E.4) here I ∈ R K × K is the identity matrix. The second equality arises due to the i.i.d assumption of each of the K batches.Substituting (E.4) into (E.2), we obtain V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar [ Q n ( Y )] (1 − R e ) X T X (E.5)The expression X T X can be expanded as X T X = (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) T = (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) (cid:18) K K − [ F eK ◦ D ] T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:19) = 1 K − T K [ F eK ◦ D ] T K A − A T [ F eK ◦ D ] K K + A T [ F eK ◦ D ] [ F eK ◦ D ] T A , (E.6)where A = (cid:2)(cid:0) DD T (cid:1) ◦ F e (cid:3) − (cid:0) ¯ Q − ¯ µ e (cid:1) . Using the identity (C.1), we obtain [ F eK ◦ D ] K = [( diag ( F e ) ⊗ K ) ◦ D ] K = diag ( F e ) ◦ ( D K ) = M , T K [ F eK ◦ D ] T = T M , which simplify (E.6) as X T X = 1 K + A T [ F eK ◦ D ] [ F eK ◦ D ] T A = 1 K + (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] [ F eK ◦ D ] T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) = 1 K + (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) (E.7)where the last equality of (E.7) applies the identity (C.4). Thus, Theorem 1b is proved as V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = E ˜ Q e (cid:104) V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105)(cid:105) = V ar [ Q n ( Y )] (1 − R e ) × (cid:18) K + E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (E.8)To deduce the result of Theorem 1a, we replace ¯ Q e (¯ α e ) , R e , ˜ Q e and ¯ µ e with ¯ Q CV (¯ α CV ) , R , Q and ¯ µ CV , respectively,and note that F e = F eM = M ⊗ M in the CV case; hence, V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = V ar [ Q n ( Y )] (cid:0) − R (cid:1) (cid:18) K + E Q (cid:20)(cid:16) ¯ Q − ¯ µ CV (cid:17) T (cid:16) DD T (cid:17) − (cid:16) ¯ Q − ¯ µ CV (cid:17)(cid:21)(cid:19) . (E.9) F Proof of Theorem 4

Proposition 3 suggests that the expressions inside the expectation operators in Theorem 1 may follow the Hotelling’s T distributions. It is indeed the case for (38), while extra assumptions on the ACV-IS and ACV-MF schemesare needed to establish the distribution in (39). The means of the Hotelling’s T distributions are then computedexplicitly to prove Theorem 4. We note that (45) is a special case of (48), and so we prove Theorem 4b ﬁrst.The proof strategy is to simplify the expression inside the expectation operator in (39) (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) using the extra assumptions (46) and (47) on the ACV-IS and ACV-MF schemes. First, an identity is added into themiddle term (cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T = (cid:16) DD T (cid:17) ◦ F e ◦ ( F e ) ◦ ( − ◦ F eM ◦ ( F eM ) T (F.1)where ( F e ) ◦ ( − is the Hadamard inverse of F e , i.e., (cid:104) ( F e ) ◦ ( − (cid:105) ij = 1 f eij , (F.2) nd [ (cid:63) ] ij denote an entry of the matrix (cid:63) . We recall from (30) and (32) that f eij depends on the ratios r i and r j whichmust be positive for any meaningful settings. Eventually, using either ACV-IS or ACV-MF makes r i and r j greaterthan 1. Thus, in practice f eij (cid:54) = 0 and the Hadamard inverse of F e exists.For the ACV-IS scheme we then have the terms (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ij = 1 f ACV-IS ij f ACV-IS ii f ACV-IS jj = 1 f ACV-IS ii f ACV-IS jj f ACV-IS ii f ACV-IS jj = 1 (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ii = 1 f ACV-IS ii f ACV-IS ii f ACV-IS ii = f ACV-IS ii = r i − r i (F.3)Because we assume r i (cid:29) , (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ii = 1 , (cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T = M ⊗ M . (F.4)For the ACV-MF scheme, (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ij = 1 f ACV-MF ij f ACV-MF ii f ACV-MF jj = 1min( r i , r j ) − r i , r j ) r i − r i r j − r j = max( r i , r j ) − r i , r j ) (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ii = 1 f ACV-MF ii f ACV-MF ii f ACV-MF ii = f ACV-MF ii = r i − r i (F.5)Because we assume r i = r , (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ij ( ii ) = r − r , (cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T = r − r ( M ⊗ M ) . (F.6)Substitute (F.4) and (F.6) into (F.1) (cid:16) DD T (cid:17) ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T = (cid:16) DD T (cid:17) ◦ F ACV-IS (F.7) (cid:16) DD T (cid:17) ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T = r − r (cid:16) DD T (cid:17) ◦ F ACV-MF (F.8)Substitute (F.7) and (F.8) into (39) V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (F.9)Here the expectation of (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:2)(cid:0) DD T (cid:1) ◦ F e (cid:3) − (cid:0) ¯ Q − ¯ µ e (cid:1) can be computed explicitly because ¯ Q − ¯ µ e has amultivariate normal distribution from (44) and (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) = 1 K ( K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:114) K (cid:34) (cid:0) DD T (cid:1) ◦ F e K − (cid:35) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:114) K = 1 K ( K − t M,K − , (F.10)where t M,K − follows the Hotelling’s T distribution [48, Corollary 5.3].Substituting (F.10) into (F.9), the variance of ¯ Q e (¯ α e ) becomes V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) K ( K − E ˜ Q e (cid:2) t M,K − (cid:3)(cid:19) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) K ( K −

1) ( K − MK − M − (cid:19) = V ar [ Q n ( Y )] K (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) , (F.11) here the second equality uses the expectation of the Hotelling’s T distribution and the third equality simpliﬁes theresult.Thus, V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q e (¯ α e )] K = V ar [ Q n ( Y )] K (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] = (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) To prove Theorem 4a, we replace ¯ Q e (¯ α e ) , R e , ˜ Q e and ¯ µ e with ¯ Q CV (¯ α CV ) , R , Q and ¯ µ CV , respectively, and notethat F e = F eM = M ⊗ M in the CV case; hence, V ar (cid:2) Q CV (¯ α CV ) (cid:3) V ar [ Q n ( Y )] = (1 − R ) (cid:18) MK − M − (cid:19) . (F.12) G Proof of Corollary 4.1

We seek to ﬁnd K such that Theorem 4 guarantees variance reduction V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] < y for < y ≤ and e ∈ { CV , ACV-IS , ACV-MF } . (G.1)We ﬁrst solve (G.1) for K that satisﬁes this inequality for the ACV-IS and ACV-MF strategies. We then deduce thecorresponding result in the CV case. Thus, for < y ≤ and e ∈ { ACV-IS , ACV-MF } , (G.1) becomes (cid:18) a ( e ) MK − M − (cid:19) (1 − R e ) < y ⇐⇒ ( K − M −

2) + a ( e ) MK − M − − R e ) < y (G.2)Because of the assumption K > M + 2 , (G.2) becomes (( K − M −

2) + a ( e ) M )(1 − R e ) < y ( K − M − ⇐⇒ K (1 − y − R e ) < ( M + 2 − a ( e ) M )(1 − R e ) − y ( M + 2) ⇐⇒ K (1 − y − R e ) < ( M + 2)(1 − y − R e ) − a ( e ) M (1 − R e ) ⇐⇒  K > M + 2 − a ( e ) M (1 − R e )1 − y − R e if y + R e > K < M + 2 − a ( e ) M (1 − R e )1 − y − R e if y + R e < (G.3)Now we show that the case y + R e < leads to a contradiction. Speciﬁcally, (G.3) implies that M + 2 < K a ( e ) M (1 − R e )1 − y − R e ⇐⇒ a ( e )(1 − R e ) < (G.4)Since in the ACV-MF scheme the low-ﬁdelity models always use more samples than the high-ﬁdelity model [1], then a ( e ) > for e ∈ { ACV-IS , ACV-MF } . Therefore, (G.4) becomes − R e < , which is a contradiction to ≤ R e ≤ .We are left with y + R e > . So, we only have K > B e = M + 2 − a ( e ) M (1 − R e )1 − y − R e if y + R e > and K > M + 2 .In other words, if y + R e > and K > max( M + 2 , B e ) , then the target variance reduction is obtained.Substituting the values of a ( e ) into B e , we obtain B ACV-IS = M + 2 − M (1 − R ACV-IS )1 − y − R ACV-IS ,B ACV-MF = M + 2 − ( r − M (1 − R ACV-MF ) r (1 − y − R ACV-MF ) . n particular, if y = 1 , then B ACV-IS = M + 2 + M (1 − R ACV-IS ) R ACV-IS = MR ACV-IS + 2 ,B ACV-MF = M + 2 + ( r − M (1 − R ACV-MF ) rR ACV-MF = r − r MR ACV-MF + Mr + 2 . Similarly, for the CV case, if y + R > , K > max( M + 2 , B CV ) , where B CV = M + 2 − M (1 − R )1 − y − R , and if y = 1 , B CV = MR + 2 . References [1] Alex A. Gorodetsky, Gianluca Geraci, Michael S. Eldred, and John D. Jakeman. A generalized approximate con-trol variate framework for multiﬁdelity uncertainty quantiﬁcation.

Journal of Computational Physics , 408:109257,May 2020.[2] Alireza Doostan, Gianluca Geraci, and Gianluca Iaccarino. A Bi-Fidelity Approach for Uncertainty Quantiﬁca-tion of Heat Transfer in a Rectangular Ribbed Channel. In

Volume 2C: Turbomachinery . American Society ofMechanical Engineers, June 2016.[3] Gianluca Geraci, Michael S. Eldred, and Gianluca Iaccarino. A multiﬁdelity multilevel Monte Carlo method foruncertainty propagation in aerospace applications. In .American Institute of Aeronautics and Astronautics, January 2017.[4] Anirban Chaudhuri, John Jasa, Joaquim Martins, and Karen E. Willcox. Multiﬁdelity Optimization UnderUncertainty for a Tailless Aircraft. In . American Instituteof Aeronautics and Astronautics, January 2018.[5] Leo W. T. Ng and Karen E. Willcox. Multiﬁdelity approaches for optimization under uncertainty.

InternationalJournal for Numerical Methods in Engineering , 100(10):746–772, December 2014.[6] Joslin Goh, Derek Bingham, James Paul Holloway, Michael J. Grosskopf, Carolyn C. Kuranz, and Erica Rutter.Prediction and Computer Model Calibration Using Outputs From Multiﬁdelity Simulators.

Technometrics ,55(4):501–512, November 2013.[7] Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of Multiﬁdelity Methods in UncertaintyPropagation, Inference, and Optimization.

SIAM Review , 60(3):550–591, January 2018.[8] Benjamin Peherstorfer, Tiangang Cui, Youssef Marzouk, and Karen Willcox. Multiﬁdelity importance sampling.

Computer Methods in Applied Mechanics and Engineering , 300:490–509, March 2016.[9] Benjamin Peherstorfer, Boris Kramer, and Karen Willcox. Combining multiple surrogate models to acceleratefailure probability estimation with expensive high-ﬁdelity models.

Journal of Computational Physics , 341:61–75,July 2017.[10] Boris Kramer, Alexandre Noll Marques, Benjamin Peherstorfer, Umberto Villa, and Karen Willcox. Multiﬁdelityprobability estimation via fusion of estimators.

Journal of Computational Physics , 392:385–402, September 2019.[11] G. Fishman.

Monte Carlo: Concepts, Algorithms, and Applications . Springer, 2014.[12] P. Glasserman.

Monte Carlo Methods in Financial Engineering . Stochastic Modelling and Applied Probability.Springer, 2004.[13] R.Y. Rubinstein and D.P. Kroese.

Simulation and the Monte Carlo Method . Wiley Series in Probability andStatistics. Wiley, 2016.

14] R. Srinivasan.

Importance Sampling: Applications in Communications and Detection . Springer, 2013.[15] J. Bucklew.

Introduction to Rare Event Simulation . Springer Series in Statistics. Springer, 2004.[16] Art Owen and Yi Zhou Associate. Safe and eﬀective importance sampling.

Journal of the American StatisticalAssociation , 95(449):135–143, 2000.[17] Michael B. Giles. Multilevel monte carlo methods.

Acta Numerica , 24:259–328, 2015.[18] Michael B. Giles, Mateusz B. Majka, Lukasz Szpruch, Sebastian J. Vollmer, and Konstantinos C. Zygalakis.Multi-level Monte Carlo methods for the approximation of invariant measures of stochastic diﬀerential equations.

Statistics and Computing , 30(3):507–524, May 2020.[19] Abdul-Lateef Haji-Ali, Fabio Nobile, and Raúl Tempone. Multi-index Monte Carlo: when sparsity meets sam-pling.

Numerische Mathematik , 132(4):767–806, April 2016.[20] Casey M. Fleeter, Gianluca Geraci, Daniele E. Schiavazzi, Andrew M. Kahn, and Alison L. Marsden. Multileveland multiﬁdelity uncertainty quantiﬁcation for cardiovascular hemodynamics.

Computer Methods in AppliedMechanics and Engineering , 365:113030, June 2020.[21] S. S. Lavenberg and P. D. Welch. A Perspective on the Use of Control Variables to Increase the Eﬃciency ofMonte Carlo Simulations.

Management Science , 27(3):322–335, March 1981.[22] Raghu Pasupathy, Bruce W. Schmeiser, Michael R. Taaﬀe, and Jin Wang. Control-variate estimation usingestimated control means.

IIE Transactions , 44(5):381–385, May 2012.[23] Barry L. Nelson. Control Variate Remedies.

Operations Research , 38(6):974–992, December 1990.[24] Barry L. Nelson. Batch size eﬀects on the eﬃciency of control variates in simulation.

European Journal ofOperational Research , 43(2):184–196, 1989.[25] Art B. Owen. Monte carlo theory, methods and examples, 2013.[26] R.Y. Rubinstein and D.P. Kroese.

The Cross-Entropy Method: A Uniﬁed Approach to Combinatorial Optimiza-tion, Monte-Carlo Simulation and Machine Learning . Information Science and Statistics. Springer, 2004.[27] Monica F. Bugallo, Victor Elvira, Luca Martino, David Luengo, Joaquin Miguez, and Petar M. Djuric. AdaptiveImportance Sampling: The past, the present, and the future.

IEEE Signal Processing Magazine , 34(4):60–79,July 2017.[28] Iason Papaioannou, Costas Papadimitriou, and Daniel Straub. Sequential importance sampling for structuralreliability analysis.

Structural Safety , 62:66–75, September 2016.[29] Sebastian Geyer, Iason Papaioannou, and Daniel Straub. Cross entropy-based importance sampling using Gaus-sian densities revisited.

Structural Safety , 76:15–27, January 2019.[30] Nolan Kurtz and Junho Song. Cross-entropy-based adaptive importance sampling using Gaussian mixture.

Structural Safety , 42:35–44, May 2013.[31] Yihua Chen and Maya R. Gupta. EM demystiﬁed: An expectation-maximization tutorial. Department ofElectrical Engineering, University of Washington, 2010.[32] Geoﬀrey F. Bomarito, Patrick E. Leser, James E. Warner, and William P. Leser. On the Optimization ofApproximate Control Variates with Parametrically Deﬁned Estimators. arXiv:2012.02750 , December 2020.[33] NIST/SEMATECH. e-Handbook of Statistical Methods https://github.com/pbtrung/cvis . Accessed: 2020-12-25.[35] Cross entropy-based importance sampling code. . Accessed: 2020-03-31.[36] M.H. Sadd.

Elasticity: Theory, Applications, and Numerics . Elsevier Science, 2020.[37] Mircea Grigoriu. Simulation of Stationary Non-Gaussian Translation Processes.

Journal of Engineering Me-chanics , 124(2):121–126, February 1998.

38] Chun-Ching Li and A. Der Kiureghian. Optimal Discretization of Random Fields.

Journal of EngineeringMechanics , 119(6):1136–1154, June 1993.[39] R.G. Ghanem and P.D. Spanos.

Stochastic Finite Elements: A Spectral Approach . Civil, Mechanical and OtherEngineering Series. Dover Publications, 2003.[40] D. Xiu.

Numerical Methods for Stochastic Computations: A Spectral Method Approach . Princeton UniversityPress, 2010.[41] Alen Alexanderian. A brief note on the Karhunen-Loeve expansion, October 2015. arXiv: 1509.07526.[42] Limin Wang.

Karhunen − Loève Expansions and their Applications . PhD thesis, London School of Economicsand Political Science, 2008.[43] Bruno Sudret and Armen Der Kiureghian. Stochastic Finite Element Methods and Reliability: A State-of-the-Art Report (Report No. UCB/SEMM-2000/08). Department of Civil and Environmental Engineering, Universityof California, Berkeley, 2000.[44] G.T. Lim and J.N. Reddy. On canonical bending relationships for plates.

International Journal of Solids andStructures , 40(12):3039–3067, June 2003.[45] J.N. Reddy.

Theory and Analysis of Elastic Plates and Shells, Second Edition . Series in Systems and Control.Taylor & Francis, 2006.[46] A.J.M. Ferreira.

MATLAB Codes for Finite Element Analysis: Solids and Structures . Solid Mechanics and ItsApplications. Springer Netherlands, 2008.[47] Sean Borman. The Expectation Maximization Algorithm: A short tutorial. Manuscript, 2009.[48] W.K. Härdle and L. Simar.

Applied Multivariate Statistical Analysis . Springer International Publishing, 2019.. Springer International Publishing, 2019.