Ensemble approximate control variate estimators: Applications to multi-fidelity importance sampling
EEnsemble approximate control variate estimators: Applications tomulti-fidelity importance sampling
Trung Pham ∗†a and Alex A. Gorodetsky ∗† aa Department of Aerospace Engineering, University of Michigan, Ann Arbor, MI, 48109, USA
Abstract
The recent growth in multi-fidelity uncertainty quantification has given rise to a large set of variancereduction techniques that leverage information from model ensembles to provide variance reduction forestimates of the statistics of a high-fidelity model. In this paper we provide two contributions: (1) weutilize an ensemble estimator to account for uncertainties in the optimal weights of approximate controlvariate (ACV) approaches and derive lower bounds on the number of samples required to guaranteevariance reduction; and (2) we extend an existing multi-fidelity importance sampling (MFIS) scheme toleverage control variates. As such we make significant progress towards both increasing the practicalityof approximate control variates—for instance, by accounting for the effect of pilot samples—and usingmulti-fidelity approaches more effectively for estimating low-probability events. The numerical resultsindicate our hybrid MFIS-ACV estimator achieves up to 50% improvement in variance reduction overthe existing state-of-the-art MFIS estimator, which had already shown outstanding convergence ratecompared to the Monte Carlo method, on several problems of computational mechanics.
Keywords. multi-fidelity, uncertainty quantification, approximate control variates, importance sampling, rare-event simulation
This paper develops an advancement to the approximate control variate (ACV) [1] approach for variance reductionin uncertainty quantification applications where multiple models with varying qualities and computational costs areavailable. Specifically, we analyze the affect of using estimated control variate weights within the ACV on variancereduction, and we provide conditions under which variance reduction can still be guaranteed in these cases. Multi-fidelity approaches for uncertainty quantification have recently seen significant adoption across wide varying domainswhere expensive simulations are required for accurate predictions. These domains include heat transfer problems [2],aerospace design [3], optimization under uncertainty [4, 5], and ensemble of computer simulator outputs [6]. A surveyof multi-fidelity methods is presented in [7].As with single fidelity UQ, multi-fidelity (MF) UQ techniques can leverage both surrogate and sampling-basedalgorithmic approaches. While surrogate-based techniques are plentiful [8–10] we focus on sampling approachesthat are often both more flexible to leverage and also provide the foundation for many surrogate approaches (e.g.,estimating where to obtain more data). The basis of a majority of sampling approaches is the usage of Monte Carlo(MC) simulation [11] to estimate the output statistics. The primary advantage of MC over surrogate approaches isthat it does not impose any requirement on the smoothness of the forward model, and its accuracy and convergencerate are independent of the model dimension. Nevertheless, its convergence rate is also slow, demanding a largenumber of model evaluations to reach the satisfactory accuracy. This computational cost can be prohibitive formany practical problems with expensive simulation models. A straightforward error analysis in [12] reveals that theefficiency of MC simulation can be greatly improved by variance reduction methods, which, as explained in [13], “canbe viewed as a means of utilizing known information about the model in order to obtain more accurate estimatorsof its performance.” Two such methods are heavily used for quantifying uncertainty: importance sampling (IS) andcontrol variates (CV). We explore the adaptation and advancement of certain aspects of these two approaches tomulti-fidelity uncertainty quantification problems in this paper.Adapting standard statistical approaches for variance reduction in the context of UQ problems must addressspecial challenges. The primary challenge is that the relationships about and between low-fidelity and high-fidelity ∗ Corresponding authors. † Email addresses: [email protected] (T. Pham), [email protected] (A.A. Gorodetsky) a r X i v : . [ s t a t . M E ] J a n odels are unknown or imprecisely encoded. For instance, CV techniques require the low-fidelity models to haveknown means and known covariance amongst models. However, the low-fidelity information sources in uncertaintyquantification typically are in the form of simulation models—their statistics are not known but simply easier tocompute than the high-fidelity model. Thus, algorithms that adapt IS and CV to these problems must determineand model the relationship between such information sources.A number of multi-fidelity techniques leveraging importance sampling have also been proposed. Importancesampling approaches generate weighted samples from a biasing distributions. A prudent choice of biasing distributioncan lead to a drastic reduction of computation cost [14, 15]. MFUQ techniques that use IS are based on the ideathat the distributions of low-fidelity quantity of interest (QoI) are closely related to the high-fidelity QoI, even if theirpointwise evaluations have errors. In [8] the IS density is constructed from a single surrogate model built on the high-fidelity model, and the multi-fidelity importance sampling (MFIS) estimator is basically an IS estimator using thatsurrogate-based density. In [9] multiple low-fidelity surrogate-based IS densities are aggregated to derive an estimatorfollowing the mixed IS approach [16]. In [10] multiple surrogate-based IS estimators are fused into a weighed ensembleestimator, and the weights are determined through minimizing the variance of the fused estimator. Though theseestimators have achieved impressive speedups, further improvement can still be possible. Particularly, they can beenhanced by also using the low-fidelity models as control variates, not just for bias distribution construction.Control variate techniques introduce a weighted adjustment term to maintain an unbiased estimator with lowervariance. These techniques leverage correlations between information sources to achieve variance reduction. Examplesinclude CV [13], approximate CV [1], multi-level MC (MLMC) [17, 18], multi-index MC (MIMC) [19], multi-fidelityMC [3], and multi-level multi-fidelity (MLMF) MC [20].The classical CV technique requires low-fidelity information sources with known mean and known correlationto high-fidelity information sources [13], thus making it unsuitable for direct application to the MFUQ problem.Work to extend control variates to the case of unknown covariances was achieved in [21] by leveraging an ensembleestimator that returns an average over a set of CV estimators. However, this estimator still required known meansof the low-fidelity information sources.On the other hand, approximate control variates [1] were recently developed to tackle the problem of unknownmeans in the case where the covariance amongst models was known (or easily estimated). This estimator wasshown to generalize existing control-variate inspired techniques such as MFMC and MLMF [3, 5, 8, 22]. Indeed theseexisting techniques were viewed as either recursive difference or recursive nested estimators that sequentially estimatevarious unknown means of the low-fidelity sources. Moreover the ACV provides a method to achieve optimal variancereduction in the case of limited high-fidelity resources and increasing low-fidelity resources. The ACV can also beviewed as a particular generalization of MLMC and MIMC. The difference is that these approaches embed recursivedifference estimators with a fixed control variate weight ( − ) within a bias-reduction scheme that sequentially refinesthe high-fidelity model. Indeed the typical MLMC-type estimators have achieved great success for variance reductionin cases where information sources are related through discretization refinement. Moreover, they do not requireestimation of covariances or control variate weights because they use a fixed CV weight.Nevertheless [1] showed that the fixed weight of MLMC ( − ) is only optimal in the case where the correlationbetween sequential models is one. In all other cases, there exist weights that can further improve variance reduction.However, it has not been clear that this benefit can be truly realized because of the complexity of additional varianceintroduced by weight estimation from pilot samples. This paper takes the first steps to answer this question byproviding guidelines into choosing the number of pilot samples required to achieve reduction.Our primary contribution begins to bridge the remaining gap between ACV theory and practice by consideringunknown means and unknown correlations (and therefore unknown optimal weights). It does so by adapting the ACVinto an ensemble estimator, similar to obtaining pilot samples, and analyzing its performance. These main resultsare1. Theorems 1 and 4: providing the inefficiency induced by unknown correlations as a scaling on the typical (1 − R ) term for ACV-based estimators (ensemble estimator)2. Corollary 4.1 providing for the number of samples required for achieving variance reduction (ensemble estimator)Our derivation of the lower bounds depends on the Gaussianity consumption of the MC estimators and the construc-tion of ensemble estimators which are based on the method of batch means [21, 23, 24]. These bounds are knowngiven prior knowledge on the correlation amongst models, which can be obtained from physics information whenpossible or lower-bounded to be conservative. Two secondary contributions include1. Theorem 7: explicitly specifies the range of weights over which one still obtains variance reduction, even undererrors in approximating the optimal control variate weight2. A new multi-fidelity IS estimator that combines MFIS approach of using the low fidelity model for determiningthe biasing distribution with a CV estimator that also uses the low fidelity model to leverage its correlations inally, empirical demonstrations are performed for rare-event estimation with an emphasis on mechanical systems.Multiple examples demonstrate the ensemble estimator theory as well as show improvement over the MFIS estimator.The rest of this paper is organized as follows. Section 2 reviews the MC, IS, and CV estimators along withour target application of rare-event (low-probability-event) simulation. Section 3 lists our main results on variancereduction via the proposed ensemble-based ACV estimator. The proofs can be found in the appendices. Section 4introduces how to simply use importance sampling as the underlying estimator within the control variate frameworkand provides a step-by-step implementation of our estimators in pseudocode. The numerical examples in Section 5demonstrate our theoretical results by providing empirical performance on synthetic and PDE-based problems. Itprovides comparison to the MFIS estimator, and comparisons between ensemble and standalone ACV estimators.The paper is concluded in Section 6. In this section, we review our notation and provide background about the main statistical estimators which aregathered in Table 1 for convenience.
Notes Notation a Equation
Monte Carlo Q n (1)Importance sampling Q IS q,n (4)Classical control variates(known mean and known weight) Q CV (10)Approximate control variates(unknown mean and known weight) Q e (23)Ensemble (A)CV-type estimator( (un)known mean, unknown weight) ¯ Q e (¯ α e ) (22) and (37) a e ∈ { ACV , ACV-X } and e ∈ { CV , ACV , ACV-X } , where ACV-X indicatesa particular sample partitioning scheme used within the ACV. Table 1: List of estimators
Let (Ω , F , P ) be a probability space. Let N denote the positive natural numbers, R the real number, and R + the positive real numbers. For d ∈ N we use Z : Ω → R d to denote a F -measurable random variable correspondingto input uncertainty. This variable is assumed to be continuous and have probability density function (PDF) p ( z ) : R d → R + , with z ∈ R d denoting a realization of Z . Let supp ( p ) be the support of the PDF p , i.e., supp ( p ) = { z ∈ R d , p ( z ) > } . Let Y i ( Z ) : R d → R i ⊂ R for i = 0 , , . . . , M, denote quantities of interest of a high-fidelitymodel Y and M low-fidelity models { Y i } Mi =1 , respectively. Let the expected values of those models be denoted by µ i = E p [ Y i ( Z )] where E p [ · ] is the expectation operator taken with respect to p ; for brevity, whenever the PDF p isomitted, it is implicitly assumed, i.e., the expectation E [ · ] , the variance V ar [ · ] and the covariance C ov [ · ] are computedwith respect to p . Our goal is to estimate µ . In this section we describe the rare event estimation problem for which we seek to apply variance reduction in Section 5.Let g : R d → R denote the limit state function. A failure event is defined by g ( z ) < with the corresponding failuredomain G = (cid:110) z ∈ R d : g ( z ) < (cid:111) . The failure probability P f ( g ( z ) < can be cast as an expectation by considering the indicator function defined onthe failure domain I G : R d → { , } as I G ( z ) = (cid:40) , z ∈ G , , else . Using this indicator function, the failure probability P f is given by P f ( g ( z ) <
0) = E [ I G ( Z )] . .2 Monte Carlo estimator In this section we review the basic Monte Carlo (MC) estimator. The MC estimator is defined as a normalized sumof random variables Q n ( Y ) = 1 n n (cid:88) i =1 Y (cid:16) Z ( i ) (cid:17) , (1)where the random variables Z ( i ) are independent and identically distributed (i.i.d.) according to the input randomvariable Z , and Q n is a new random variable derived from Y and each Z ( i ) . This estimator is unbiased and has variance decaying proportionally to /n as E [ Q n ( Y )] = E [ Y ( Z )] = µ , (2) V ar [ Q n ( Y )] = V ar [ Y ( Z )] n = σ n . (3)The root mean squared error of the MC estimator is (cid:112) V ar [ Q n ( Y )] = σ / √ n , which shows that to gain one moredecimal digit of accuracy, the computational cost needs to increase 100 times [25]. In this paper, we seek an estimatorwith reduced variance by leveraging two ideas: importance sampling [13] and control variates [25]. Importance sampling (IS) seeks variance reduction by carefully choosing a sampling distribution that differs fromthat of p . The IS estimator Q IS q,n ( Y ) is defined by a weighted sum of random variables as Q IS q,n ( Y ) = 1 n n (cid:88) i =1 Y (cid:16) Z ( i ) (cid:17) W (cid:16) Z ( i ) (cid:17) , (4)where the input random variables Z ( i ) are i.i.d. according to PDF q ( z ) , and W ( z ) = p ( z ) q ( z ) , (5)is the density ratio. The new density q is called the proposal (or biasing ) density .According to [13, 25], the IS estimator is unbiased when Y ( z ) p ( z ) is dominated by q ( z ) , that is, q ( z ) = 0 implies Y ( z ) p ( z ) = 0 . In other words, if supp ( Y p ) ⊆ supp ( q ) , we have E q (cid:104) Q IS q,n ( Y ) (cid:105) = E q [ Y ( Z ) W ( Z )] = E p [ Y ( Z )] = µ , (6) V ar q (cid:104) Q IS q,n ( Y ) (cid:105) = V ar q [ Y ( Z ) W ( Z )] n . (7)A prudent choice of the IS proposal density q ( z ) can yield an estimator with a smaller variance than that of the MCestimator. The optimal IS density is obtained by minimizing the variance V ar q [ Y ( Z ) W ( Z )] as in [13], and is givenin closed form by q ∗ ( z ) = | Y ( z ) | p ( z ) (cid:90) Ω | Y ( z ) | p ( z ) d z . Specializing this proposal to the case of rare-event simulation, i.e., Y is an indicator function, we obtain q ∗ ( z ) = Y ( z ) p ( z ) µ . (8)Using this proposal, the variance of Q IS q,n ( Y ) becomes identically zero because each evaluation of the high-fidelitymodel Y with a single sample drawn from q ∗ ( z ) is exactly equal to its expected value µ .However, since µ is unknown, it is intractable to exactly compute the optimal IS density. Instead, severalapproaches can be used for obtaining an approximation [26–28]. In this paper, the cross-entropy (CE) method withGaussian mixture model (GMM) in [29, 30] is chosen to find an approximate IS density ˆ q ( z ) . Here, ˆ q ( z ) is definedas a weighted sum of k ∈ N multivariate normal density functions ˆ q ( z ) = k (cid:88) i =1 π i N ( z ; µ i , Σ i ) , (9) here π i ∈ R , µ i ∈ R d , and Σ i ∈ R d × d for i = 1 , , . . . , k, are the mixture weights, means, and covariance matrices,respectively. The mixture coefficient π i is the probability that the i th density N ( z ; µ i , Σ i ) is selected at a given time[30], requiring ≤ π i ≤ and (cid:80) ki =1 π i = 1 .There exist several approaches to estimate the three parameters for each element of the mixture [29, 30]. Herewe use an expectation-maximization (EM) algorithm with a cross-entropy objective function. We briefly explainthis approach in Appendix A by following the construction in [29] and refer to [13, 26, 29, 31] for a more detailedtreatment. In this section, we first consider the classical control variates—a variance-reduction technique that relies on introducingadditional information sources. The classical control variates assume that both the means and the covariances of theadditional information sources are available. Next we present an extension of the classical control variates in whichthe control means are known but the covariances amongst the low-fidelity information sources are unknown [21, 23].Finally, we review the approximate control variates [1], which considers the case with with unknown control variatemeans and known covariances.
A control variate (CV) estimator Q CV ( α ) utilizes a set of M additional estimators Q n ( Y ) , . . . , Q n ( Y M ) , and augmentsa baseline estimator Q n ( Y ) via a linear combination of these estimators Q CV ( α ) = Q n ( Y ) + M (cid:88) i =1 α i ( Q n ( Y i ) − µ i ) = Q n ( Y ) + α T ( Q − µ ) , (10)where µ i = E [ Q n ( Y i )] is the known mean of Q n ( Y i ) , µ = [ µ , . . . , µ M ] , α = [ α , . . . , α M ] T is the vector of controlvariate weights, and Q = [ Q n ( Y ) , . . . , Q n ( Y M )] T is the vector of additional estimators. These additional estimatorsare shown here to be Monte Carlo estimators Q n , but can actually be any random variable. In the later sections wewill use importance sampling estimators instead.This CV estimator is unbiased and has reduced variance compared with the baseline estimator Q n ( Y ) . Specifi-cally, since Q n ( Y i ) is unbiased by specification of the estimator above, we have E (cid:104) Q CV ( α ) (cid:105) = µ . (11)Furthermore, the variance of this estimator is V ar (cid:104) Q CV ( α ) (cid:105) = V ar [ Q n ( Y )] + α T C ov [ Q , Q ] α + 2 α T C ov [ Q , Q n ( Y )] . (12)The optimal control variate weight [13], which minimizes the above variance, is then given by α ∗ CV = − C − c , (13)where C = C ov [ Y , Y ] ∈ R M × M is the covariance matrix among Y i , c = C ov [ Y , Y ] ∈ R M is the vector of covariancesbetween Y and each Y i , and Y = [ Y , . . . , Y M ] T . If we further define ¯ c = c / (cid:112) V ar [ Y ] = (cid:104) ρ (cid:112) V ar [ Y ] , ρ (cid:112) V ar [ Y ] , . . . , ρ M (cid:112) V ar [ Y M ] (cid:105) T , (14)where ρ i is the Pearson correlation coefficient between Y and Y i , then the variance corresponding to α ∗ CV becomes V ar (cid:104) Q CV ( α ∗ CV ) (cid:105) = (cid:0) − R (cid:1) V ar [ Q n ( Y )] , (15)where R = ¯ c T C − ¯ c . (16)Here we see that the greater the correlation amongst models, the greater the achieved variance reduction.Furthermore, since each Q n ( Y i ) shares n i.i.d. samples, we have C ov [ Q n ( Y i ) , Q n ( Y j )] = C ov (cid:34) n n (cid:88) k =1 Y i ( Z ( k ) ) , n n (cid:88) k =1 Y j ( Z ( k ) ) (cid:35) = 1 n C ov [ Y i , Y j ] , (17) .e., C = n C , c = n c , (18)where C = C ov [ Q , Q ] and c = C ov [ Q , Q n ( Y )] . Thus, we can rewrite (13) as α ∗ CV = − C − c . (19)to obtain an expression in terms of the variance between estimators. It is often the case that the covariances amongst the low-fidelity information sources are not available; however,these sources can be simulated to obtain estimates of these covariances. One well-analyzed strategy for estimationin this context is to generate an ensemble of K realizations of the random variables Q n ( Y i ) to estimate the requiredcovariances and correlations [21, 23]. Because we are considering MC estimators as additional information sources,this operation requires a total of nK samples for each Y i . The estimated optimal weight is then obtained as ¯ α CV = − ˆ C − ˆ c , (20)where ˆ C and ˆ c are the sample versions of C and c , respectively. Hence, ˆ C and ˆ c can be computed as ˆ C = 1 K − K (cid:88) j =1 (cid:0) Q j − ¯ Q (cid:1) (cid:0) Q j − ¯ Q (cid:1) T , ˆ c = 1 K − K (cid:88) j =1 (cid:0) Q j − ¯ Q (cid:1) (cid:16) Q ( j ) n ( Y ) − ¯ Q n ( Y ) (cid:17) , (21)where ¯ Q = 1 K K (cid:88) j =1 Q j , ¯ Q n ( Y ) = 1 K K (cid:88) j =1 Q ( j ) n ( Y ) , and Q j = (cid:104) Q ( j ) n ( Y ) , . . . , Q ( j ) n ( Y M ) (cid:105) T is the vector of MC estimatorsusing n i.i.d. samples of the j th batch (out of a total of K batches).This estimated weight is then used in an ensemble estimator [21] given by ¯ Q CV (¯ α CV ) = 1 K K (cid:88) j =1 Q CV j (¯ α CV ) = 1 K K (cid:88) j =1 (cid:16) Q ( j ) n ( Y ) + ¯ α TCV ( Q j − µ j ) (cid:17) = T K K Q ( Y ) + ¯ α TCV (cid:0) ¯ Q − ¯ µ (cid:1) , (22)where ¯ µ = 1 K K (cid:88) j =1 µ j , Q ( Y ) = (cid:104) Q (1) n ( Y ) , . . . , Q ( K ) n ( Y ) (cid:105) T , and K is a K × vector of ones. Comparing (10)and (22), we see that this estimator is identical to the estimator Q CV (¯ α CV ) that uses the same set of nK samples.Our analytical results in the following sections build from the ensemble form. The approximate control variate (ACV) estimator is designed to leverage the control variate framework when theanalytical expectation of the additional estimators Q n ( Y i ) is not known [1]. This estimator replaces the unknown µ i with another estimator µ ACV i as Q ACV ( α ) = Q n ( Y ) + M (cid:88) i =1 α i (cid:16) Q n ( Y i ) − µ ACV i (cid:17) = Q n ( Y ) + α T (cid:16) Q − µ ACV (cid:17) , (23)where µ ACV = (cid:2) µ ACV , . . . , µ ACV M (cid:3) . If the µ ACV i are unbiased, then Q ACV ( α ) is unbiased. Furthermore, the ACVestimator variance is V ar (cid:104) Q ACV ( α ) (cid:105) = V ar [ Q n ( Y )] + α T C ov (cid:104) Q − µ ACV , Q − µ ACV (cid:105) α + 2 α T C ov (cid:104) Q − µ ACV , Q n ( Y ) (cid:105) , (24)so that the optimal weight is α ∗ ACV = − C ov (cid:104) Q − µ ACV , Q − µ ACV (cid:105) − C ov (cid:104) Q − µ ACV , Q n ( Y ) (cid:105) , (25) orresponding to the variance V ar (cid:104) Q ACV ( α ∗ ACV ) (cid:105) = (1 − R ACV ) V ar [ Q n ( Y )] , (26)where R ACV = C ov (cid:2) Q − µ ACV , Q n ( Y ) (cid:3) T C ov (cid:2) Q − µ ACV , Q − µ ACV (cid:3) − V ar [ Q n ( Y )] C ov (cid:2) Q − µ ACV , Q n ( Y ) (cid:3) .The ACV estimator is flexible in that it permits a variety of estimators to be used for the unknown means. Whilethere may be many partitioning strategies [32], two specific schemes selected for further analysis in this paper arisefrom using different partitioning of samples of Y i for the estimators Q n and µ ACV i . Let z , z i and z i denote thesample sets (realizations of Z ) used to compute Q n ( Y ) , Q n ( Y i ) and µ ACV i , respectively. The two sample partitioningstrategies [1] are defined as ACV-IS z i = z z i = z i ∪ ˜ z i ˜ z i ∩ ˜ z j = ∅ for i (cid:54) = j , (27)ACV-MF z i = z z i = z i ∪ (cid:83) ij =1 ˜ z j ˜ z i ∩ ˜ z j = ∅ for i (cid:54) = j , (28)where ˜ z i and ˜ z j are extra sets of samples to estimate µ ACV i . In ACV-IS, the computation of Q n ( Y ) and Q n ( Y i ) employs only the sample set z while the computation of µ ACV-IS i uses these same samples plus a sample increment.In ACV-MF, besides sharing the sample set z between Q n ( Y ) and Q n ( Y i ) , the estimation of µ ACV-MF i utilizes thesample set of µ ACV-MF i − with some extra samples. We refer to [1, Fig. 2] for a visual explanation of the two strategies.No analytical results are available for the optimal sample distribution strategy in the case of finite-sample sizes, andso our aim is to simply show that our analysis applies to a variety of strategies.According to [1], the ACV-IS estimator obtains an optimal weight α ∗ ACV-IS = − (cid:104) C ◦ F ACV-IS (cid:105) − (cid:104) diag (cid:16) F ACV-IS (cid:17) ◦ c (cid:105) , (29)where F ACV-IS ∈ R M × M has elements f ACV-IS ij = r i − r i r j − r j if i (cid:54) = jr i − r i otherwise , (30)diag ( • ) denotes a vector whose elements are the diagonal of the matrix • , and r i ∈ R + is the ratio between the totalnumber of realizations of Y i and the total number of evaluations of Y . Similarly, the ACV-MF estimator obtains anoptimal sample weight α ∗ ACV-MF = − (cid:104) C ◦ F ACV-MF (cid:105) − (cid:104) diag (cid:16) F ACV-MF (cid:17) ◦ c (cid:105) , (31)where F ACV-MF ∈ R M × M has elements f ACV-MF ij = min( r i , r j ) − r i , r j ) if i (cid:54) = jr i − r i otherwise . (32)The corresponding correlation coefficients are R ACV-IS = a T (cid:104) C ◦ F ACV-IS (cid:105) − a , (33) R ACV-MF = b T (cid:104) C ◦ F ACV-MF (cid:105) − b , (34)where a = diag (cid:16) F ACV-IS (cid:17) ◦ ¯ c , b = diag (cid:16) F ACV-MF (cid:17) ◦ ¯ c . e can rewrite the ACV optimal weight in terms of the covariances amongst MC estimators C and c (rather thanamongst models Y i ) by substituting (18) into (29) and (31) as α ∗ ACV-IS = − (cid:104) C ◦ F ACV-IS (cid:105) − (cid:104) diag (cid:16) F ACV-IS (cid:17) ◦ c (cid:105) , (35) α ∗ ACV-MF = − (cid:104) C ◦ F ACV-MF (cid:105) − (cid:104) diag (cid:16) F ACV-MF (cid:17) ◦ c (cid:105) . (36) In this section we describe an extension to the ACV estimator that considers unknown covariances amongst thelow-fidelity information sources. This extension utilizes the same idea of [21] to construct a new estimator as anaverage of an ensemble of estimators.Suppose again that the optimal weight is estimated from an ensemble of K simulations, each of which employsthe same number of samples. Using the sample weight ¯ α ACV , we define the ensemble ACV estimator as ¯ Q ACV (¯ α ACV ) = 1 K K (cid:88) j =1 Q ACV j (¯ α ACV ) = 1 K K (cid:88) j =1 (cid:16) Q ( j ) n ( Y ) + ¯ α TACV (cid:16) Q j − µ ACV j (cid:17)(cid:17) = T K K Q ( Y ) + (cid:16) ¯ Q − ¯ µ ACV (cid:17) T ¯ α ACV , (37)where ¯ µ ACV = 1 K K (cid:88) j =1 µ ACV j .The next section analyzes the variance reduction of the proposed ensemble estimators ¯ Q ACV , both for MF andIS sampling strategies.
The goal of this section is to present Theorem 1, which expresses the variances of the three ensemble estimatorsderived from the CV and ACV (i.e., ¯ Q CV (¯ α CV ) , ¯ Q ACV-IS (¯ α ACV-IS ) and ¯ Q ACV-MF (¯ α ACV-MF ) ) with respect to thenumber of samples, the correlation coefficient, the variance of the MC estimator Q n ( Y ) , and the expectation of afunction of estimators. We use this relationship in Theorem 4 to compute explicit expressions for these variances.In Corollary 4.1 we derive lower bounds on the number of ensembles required to guarantee smaller variances of thethree sample-weight estimators than that of the baseline estimator Q n ( Y ) .The results rely on a multivariate Gaussianity assumption, and so are asymptotically true for model settingswhere the information sources have finite mean and variance as an implication from the central limit theorem. Thatis, the results holds when the vectors {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } in the CV-based estimator and {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , Q nr ( Y ) , . . . , Q nr M ( Y M ) } in the ACV-based estimators have a multivariate normal distribution as n → ∞ . Using the Gaussianity assumption of the vectors of MC estimators, the first theorem allows us to calculate thevariances of the ensemble estimators in terms of the expectations shown below.
Theorem 1. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then we have V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = V ar [ Q n ( Y )] (cid:0) − R (cid:1) (cid:18) K + E Q (cid:20)(cid:16) ¯ Q − ¯ µ CV (cid:17) T (cid:16) DD T (cid:17) − (cid:16) ¯ Q − ¯ µ CV (cid:17)(cid:21)(cid:19) , (38) where R is defined in (16) .b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. Then, we have V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (cid:0) − R e (cid:1) × (cid:18) K + E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (39) where ˜ Q e = [ Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM ] T , F eM = diag ( F e ) ⊗ M , and R e is defined in (33) and (34) . he proof of Theorem 1 in Appendix E requires the identities provided in Appendix C and the following twouseful propositions. In the first proposition, we rewrite the sample weight in terms of the centered data matrix tofacilitate the calculation of the variances of the ensemble estimators. Proposition 2.
The estimated control variate weights can be written as ¯ α CV = − (cid:16) DD T (cid:17) − DQ ( Y ) , (40) ¯ α e = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))] , (41) where e ∈ { ACV-IS , ACV-MF } and D is the centered data matrix D = Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) . . . . . . . . . . . . Q (1) n ( Y M ) − ¯ Q n ( Y M ) Q (2) n ( Y M ) − ¯ Q n ( Y M ) . . . Q ( K ) n ( Y M ) − ¯ Q n ( Y M ) (42) Proof.
See Appendix B.The next step is to calculate the expectations in Theorem 1 by finding the distribution of the expressions insidethese operators. Coupled with the Gaussianity assumption, the specific structure of these expressions suggests theHotelling’s T distribution [33], which is confirmed by the second proposition. Proposition 3. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then, ¯ Q − ¯ µ CV ∼ N (cid:18) M , C K (cid:19) (43) b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. Then, ¯ Q − ¯ µ e ∼ N (cid:18) M , C ◦ F e K (cid:19) (44) Proof.
See Appendix D.We can obtain explicit expressions for the expectation in Theorem 1 under certain reasonable limiting conditionson the ratio of low-fidelity to high-fidelity samples r i . Theorem 4 summarize the variance reduction ratios of theensemble CV-type estimators with respect to the baseline estimator Q n ( Y ) . Theorem 4. a. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) } have a multivariate normal distribution. Then, V ar (cid:2) Q CV (¯ α CV ) (cid:3) V ar [ Q n ( Y )] = (1 − R ) (cid:18) MK − M − (cid:19) (45) b. Let the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } , where e ∈ { ACV-IS , ACV-MF } and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M , have a multivariate normal distribution. If we further assume that(ACV-IS) r i (cid:29) , (46) (ACV-MF) r i = r, (47) for i = 1 , , . . . , M , then V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] = (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) , (48) where a ( ACV-IS ) = 1 and a ( ACV-MF ) = r − r .Proof. See Appendix F. o guarantee variance reduction, the number of ensembles K must be bounded from below as follows Corollary 4.1.
Suppose R (cid:54) = 0 , R e (cid:54) = 0 for e ∈ { ACV-IS , ACV-MF } . Furthermore, let e ∈ { CV , ACV-IS , ACV-MF } and K > max( M + 2 , B e ) , (49) where B CV = MR + 2 , (50) B ACV-IS = MR ACV-IS + 2 , (51) B ACV-MF = r − r MR ACV-MF + Mr + 2 . (52) Then, we have V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] < . Proof.
See Appendix G.The above corollary implies that in order for V ar [ Q e (¯ α e )] < V ar [ Q n ( Y )] , i.e., variance reduction of the CV-typeestimators with estimated weight compared to the baseline estimator, the number of realizations K of the randomvariable Q e (¯ α e ) has to be at least max( M + 2 , B e ) , where B e depends on the sample partitioning, the correlationamongst models, and the number of low-fidelity models. This corollary recovers the result in [21] for the CV estimator.From the proof of the corollary in Appendix G, the results are obtained by setting the upper bound of the ratio V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] (i.e., y in (G.1)) equal to 1, which satisfies the constraint y + R e > for any non-zero R e , i.e., when y = 1 , we only need the models to be correlated and do not need a specific value of the correlation coefficient. Incase we know the correlation amongst models, we may choose a smaller upper bound such that y > − R e , e.g., if R e = 0 . , y can take any value in the interval (0 . , . This reflects the principle of the CV-based methods: strongercorrelation leads to smaller variance.The corollary explicitly ties the correlation amongst models to the sampling requirements. This type of connectionhas previously been ignored in the general case of the ACV-like estimators. Furthermore, it provides an avenue throughwhich to inject problem specific information as correlations. In this respect it can be used in multi-level MonteCarlo [18] schemes for which a convergence rate for a numerical method is used to determine optimal allocations andguarantee convergence. We envision that these types of problems can also be amenable to deriving expressions forthe correlation.Finally, to choose K and the numbers of samples in practice we can solve an optimization problem in which (1) K and the numbers of samples are design variables; (2) the variance of an appropriate estimator is minimized; and(3) the total cost is bounded above and K is constrained by (49). We leave solving such a problem for future work. We now combine the CV-type estimators with importance sampling (IS) to target rare-event calculations. Theresulting estimator will closely parallel that of [8], with the primary difference being the leveraging of control variatesfor further variance reduction.Specifically, instead of the Monte Carlo estimator, we now use importance sampling as the baseline estimatorsfor both the CV and ACV estimators, yielding Q MF ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17) = Q IS ˆ q,n ( Y ) + α (cid:16) Q IS ˆ q,n ( Y ) − µ (cid:17) (53)for the CV and Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17) = Q IS ˆ q,n ( Y ) + α (cid:16) Q IS ˆ q,n ( Y ) − µ IS (cid:17) (54)for the ACV, where µ is the known mean of Y and µ IS is the IS estimator for the mean of Y , i.e., µ IS = Q IS ˆ q,m ( Y ) for m ∈ N . Here, ˆ q is a the biasing distribution that has the property supp ( Y p ) ⊆ supp (ˆ q ) . For simplicity of resentation, we only consider the case with a single additional low fidelity model. Now Eqs. (12), (24), (13) and(25) become V ar ˆ q (cid:104) Q MF ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + 2 α C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:105) , (55) V ar ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α (cid:16) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + V ar ˆ q (cid:104) µ IS (cid:105) − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105) (cid:17) + 2 α (cid:16) C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:105) − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105)(cid:17) , (56) α ∗ CV = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) , (57) α ∗ ACV = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , µ IS (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + V ar ˆ q [ µ IS ] − C ov ˆ q (cid:104) Q IS ˆ q,n ( Y ) , µ IS (cid:105) . (58) The proposed estimators are unbiased.
Theorem 5.
Suppose supp ( Y p ) ⊆ supp (ˆ q ) . Then, Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expected value µ .Proof. Let us consider the expected values of Q MF ˆ q,n and Q MF-ACV ˆ q,n with respect to ˆ q E ˆ q (cid:104) Q MF ˆ q,n ( Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α ) (cid:105) = E ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α E ˆ q (cid:104) Q IS ˆ q,n ( Y ) − µ (cid:105) , E ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = E ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) + α E ˆ q (cid:104) Q IS ˆ q,n ( Y ) − µ IS (cid:105) . Since supp ( Y p ) ⊆ supp (ˆ q ) , which allows us to apply (6), we have E ˆ q (cid:2) Q IS ˆ q,n ( Y i ) (cid:3) = E [ Y i ( Z )] = µ i for i = 0 , , and E ˆ q (cid:2) µ IS (cid:3) = E [ Y ( Z )] = µ . Thus, E ˆ q (cid:104) Q MF ˆ q,n ( Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α ) (cid:105) = µ , E ˆ q (cid:104) Q MF-ACV ˆ q,n (cid:16) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:17)(cid:105) = µ , which implies Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expected value µ .This result means that the Gaussian mixture can be used as a proposal. Corollary 5.1. If ˆ q is a Gaussian mixture as in (9) , Q MF ˆ q,n and Q MF-ACV ˆ q,n are unbiased estimators of the expectedvalue µ .Proof. Because π i are probabilities and N ( z ; µ i , Σ i ) > , ∀ z ∈ R d , i = 1 , , . . . , k , the GMM in (9) has global support,that is, ˆ q ( z ) > , ∀ z ∈ R d . This leads to supp ( Y p ) ⊆ supp (ˆ q ) and, hence, Theorem 5 holds.As with any control-variate estimator, variance reduction is greater when the correlation between estimators Q IS ˆ q,n ( Y ) and Q IS ˆ q,n ( Y ) is larger. In fact, as shown in the below theorems, for any non-zero correlation the varianceof our estimator is smaller than that of the multi-fidelity importance sampling (MFIS) estimator Q MFIS ˆ q,n presentedin [8]. Specifically, the MFIS estimator uses a proposal density that is derived from Y . In other words, the MFISestimator leverages low-fidelity information sources to design the proposal distribution and then uses this proposaldistribution within a standard importance sampling scheme for the high-fidelity model Y . Assuming that we use the same proposal, our proposed estimator is guaranteed to have lower variance than theMFIS for a certain range of the control weight. This further reduction is achieved by leveraging the low-fidelitymodels again as control variates. When the control variate weight is zero and the proposal distribution matches, weobtain an equivalent estimator Q MFIS ˆ q,n ( Y ) = Q IS ˆ q,n ( Y ) . , when α = 0 . We can generally choose the weight α to achieve greater variance reduction by leveraging the correlation between Y and Y . Formally, Theorem 6 states that if the weight α belongs to a certain range and V ar [ Q ( Y )] (cid:54) = 0 , then thevariance of the CV estimator Q CV ( α ) is bounded above by that of the estimator Q ( Y ) . The equality happens whenthere is no correlation between Q ( Y ) and Q ( Y ) . heorem 6 (Range of control variate weight for CV estimator) . Let Q CV ( α ) = Q ( Y ) + α ( Q ( Y ) − µ ) with V ar [ Q ( Y )] > , where Q ( Y ) and Q ( Y ) are unbiased estimators for the means of Y and Y , respectively. Fur-thermore, let ¯ f = − C ov [ Q ( Y ) , Q ( Y )] V ar [ Q ( Y )] (59) denote a scaled ratio of the covariance to the variance of Q ( Y ) .If ¯ f ≥ (resp. ¯ f ≤ ), then for control variate weight in the range α ∈ (cid:2) , ¯ f (cid:3) (resp. α ∈ (cid:2) ¯ f, (cid:3) ), the variance ofthe CV estimator is bounded above by that of the baseline estimator, i.e., V ar (cid:2) Q CV ( α ) (cid:3) ≤ V ar [ Q ( Y )] . Equality is obtained for α = 0 . Proof.
Let V ( α ) = V ar (cid:104) Q CV ( α ) (cid:105) − V ar [ Q ( Y )]= (cid:0) V ar [ Q ( Y )] + α V ar [ Q ( Y )] + 2 α C ov [ Q ( Y ) , Q ( Y )] (cid:1) − V ar [ Q ( Y )]= α ( α V ar [ Q ( Y )] + 2 C ov [ Q ( Y ) , Q ( Y )]) , where the second equality uses (12) with one low-fidelity model. Therefore, V ( α ) ≤ ⇐⇒ (cid:34) α ≥ and α ≤ ¯ f ⇒ if ¯ f ≥ , then α ∈ (cid:2) , ¯ f (cid:3) .α ≤ and α ≥ ¯ f ⇒ if ¯ f ≤ , then α ∈ (cid:2) ¯ f, (cid:3) . The below corollary confirms the advantage of our estimator over the MFIS one: aside from the trivial case ofindependent models, using an appropriate value of α ensures variance reduction. Corollary 6.1 (Range of control variate weight for MF estimator) . Let ˜ f = − C ov ˆ q (cid:2) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) (cid:3) V ar ˆ q (cid:104) Q IS ˆ q,n ( Y ) (cid:105) , and V ar ˆ q (cid:2) Q IS ˆ q,n ( Y ) (cid:3) > . If ˜ f ≥ (resp. ˜ f ≤ ), then for control variate weight in the range α ∈ [0 , ˜ f ] (resp. α ∈ [ ˜ f, ) the variance of the MF estimator is bounded above by that of the MFIS estimator, i.e., V ar ˆ q (cid:2) Q MF ˆ q,n (cid:0) Q IS ˆ q,n ( Y ) , Q IS ˆ q,n ( Y ) , α (cid:1)(cid:3) ≤ V ar ˆ q (cid:2) Q MFIS ˆ q,n ( Y ) (cid:3) .Equality is obtained for α = 0 . Proof.
Substitute p, Q ( Y ) and Q ( Y ) in Theorem 6 with ˆ q, Q IS ˆ q,n ( Y ) and Q IS ˆ q,n ( Y ) , respectively.Theorem 6 can be extended straightforwardly to the approximate control variates described in Section 2.4.3. Theorem 7 (Range of control variate weight for ACV estimator) . Let s = V ar (cid:104) Q ( Y ) − µ ACV (cid:105) > ,s = C ov (cid:104) Q ( Y ) , Q ( Y ) − µ ACV (cid:105) . If s ≤ (resp. s ≥ ), then for control variate weight in the range α ∈ (cid:20) , − s s (cid:21) (cid:18) resp. α ∈ (cid:20) − s s , (cid:21)(cid:19) thevariance of the ACV estimator is bounded above by that of the baseline estimator, i.e., V ar (cid:2) Q ACV ( α ) (cid:3) ≤ V ar [ Q ( Y )] .Equality is obtained for α = 0 . Proof.
This theorem is proved by following the same steps as in the proof of Theorem 6, which are first simplify thedifference V ( α ) = V ar (cid:2) Q ACV ( α ) (cid:3) − V ar [ Q ( Y )] using (24) with one low-fidelity model, and then find α such that V ( α ) ≤ . We skip the details for brevity.In words, the theorem states that if the variance of Q ( Y ) − µ ACV is non-zero, we can always choose α from aninterval depending on the sign of C ov [ Q ( Y ) , Q ( Y )] − C ov (cid:2) Q ( Y ) , µ ACV (cid:3) so that the ACV estimator has smallervariance than the MC estimator.These results and the ensemble estimator variance reduction results provide strong motivation towards usingthe control-variate weight for further variance reduction even when the weight cannot be exactly estimated. Theseresults indicate that there is a range of values of the weight that still leads to variance reduction, i.e., that someweight estimate can have an error but still be beneficial. The ensemble estimator variance results provide a sufficientcondition for the number of samples required to guarantee a particular size of variance reduction. .2 Algorithms Algorithms 1 and 2 provide pseudocode to implement the ensemble estimators for the control variate and the approxi-mate control variate with the importance sampling approach. These algorithms requires the selection of a low-fidelitymodel correlated to the high-fidelity model, the input parameters to the EM algorithm, and the number of samples.Using these inputs, the algorithms produce an estimate of the expected value of the high-fidelity model. We note thatAlgorithm 2 uses the ACV-IS strategy, and adapting it to the ACV-MF strategy is trivial and omitted for brevity.
Algorithm 1:
Ensemble control variate estimator using importance sampling
Require: Y , Y : high-fidelity and low-fidelity model; p : PDF of the input random variables; µ : expectedvalue of Y , i.e., µ = E [ Y ] ; n s , τ, k init : parameters of the EM algorithm; R : correlation coefficient; C :target cost, or ζ : target accuracy; Ensure: ˆ µ : estimate of the expected value of Y , i.e., ˆ µ ≈ E [ Y ] ; v CV : estimate of the sample variance ofthe ensemble CV estimator. Compute the approximate IS density ˆ q using the EM algorithm with the parameters n s , τ, and k init Determine the number of outer loops K and the number of samples n using R and C (or ζ ) (e.g.,solving an optimization problem) for i = 1 , , . . . , K do Draw n samples z , z , . . . , z n from the density ˆ q for j = 1 , , . . . , n do Q j ( Y k ) = Y k ( z j ) (cid:99) W ( z j ) for k = 0 , end for ¯ Q i ( Y k ) = 1 n n (cid:88) j =1 Q j ( Y k ) for k = 0 , end for ˜ Q ( Y k ) = 1 K K (cid:88) i =1 ¯ Q i ( Y k ) for k = 0 , ˆ c = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) ; ˆ s k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) for k = 0 , ¯ α = − ˆ c ˆ s ˆ µ = ˜ Q ( Y ) + ¯ α (cid:16) ˜ Q ( Y ) − µ (cid:17) ; ¯ v CV = 1 K (cid:0) ˆ s + ¯ α ˆ s + 2¯ α ˆ c (cid:1) {(55)} In this section we demonstrate the performance of our ensemble importance-sampling control variate algorithms forrare-event estimation for three examples. The first example is a simple case of estimating tail probabilities involvingnormal random variables. The second example is a cantilever beam whose material uncertainty is modeled as arandom field discretized by the Karhunen-Loève expansion. Our third example analyzes a clamped Mindlin plate inbending under random loads and material properties. To focus on demonstrating the benefit of the control variatesand exercising our theory, we focus only on problems with a single low-fidelity model.The simplicity of the first example allows cheap evaluations of its models, and so enables us to disambiguatebetween sources of errors, such as lack of optimality in the proposal distribution. The second example is morerealistic than the first one in the sense that it solves a system of PDEs by the finite element method and computingoutput statistics on a dense mesh is often expensive. Based on the similar settings as the first example of [8], our lastexample aims to stress the benefit of our proposed estimator on a more complex problem.For each example we provide the governing equations, definitions of limit state functions, and the choice of HFand LF models. We then describe implementation details such as parameters of the EM algorithm and number ofsamples. In each example we aim to compare the variance of several estimator to compare their performance withthe MFIS estimator under equal costs. In each example we fix the total cost to C MFIS , and then only use a numberof high fidelity samples n HF and low-fidelity samples n LF to ensure C MFIS = n HF C HF + n LF C LF , where C HF and C LF are the cost of the MFIS estimator, the cost of one evaluation of the HF and LF models, respectively. Note that we lgorithm 2: Ensemble approximate control variate estimator using importance sampling
Require: Y , Y : high-fidelity and low-fidelity model; p : PDF of the input random variables; n s , τ, k init :parameters of the EM algorithm; R : correlation coefficient; C : target cost, or ζ : target accuracy; Ensure: ˆ µ : estimate of the expected value of Y , i.e., ˆ µ ≈ E [ Y ] ; v IS : estimate of the sample variance ofthe ensemble ACV-IS estimator. Compute the approximate IS density ˆ q using the EM algorithm with the parameters n s , τ, and k init Determine the number of outer loops K , and the number of samples n and m using R and C (or ζ )(e.g., solving an optimization problem) for i = 1 , , . . . , K do Draw n samples { z , z , . . . , z n } and m samples { z (cid:48) , z (cid:48) , . . . , z (cid:48) m } from the density ˆ q for j = 1 , , . . . , n do Q j ( Y k ) = Y k ( z j ) (cid:99) W ( z j ) for k = 0 , end for for j = 1 , , . . . , m do Q (cid:48) j ( Y ) = Y ( z (cid:48) j ) (cid:99) W ( z (cid:48) j ) end for ¯ Q i ( Y k ) = 1 n n (cid:88) j =1 Q j ( Y k ) for k = 0 , ¯ Q (cid:48) i ( Y ) = 1 n + m n (cid:88) j =1 Q j ( Y ) + m (cid:88) j =1 Q (cid:48) j ( Y ) end for ˜ Q ( Y k ) = 1 K K (cid:88) i =1 ¯ Q i ( Y k ) for k = 0 , ; ˜ Q (cid:48) ( Y ) = 1 K K (cid:88) i =1 ¯ Q (cid:48) i ( Y ) ˆ s k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) for k = 0 , ; ˆ s (cid:48) = 1 K − K (cid:88) i =1 (cid:16) ¯ Q (cid:48) i ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) ˆ c = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) (cid:16) ¯ Q i ( Y ) − ˜ Q ( Y ) (cid:17) ˆ c (cid:48) k = 1 K − K (cid:88) i =1 (cid:16) ¯ Q i ( Y k ) − ˜ Q ( Y k ) (cid:17) (cid:16) ¯ Q (cid:48) i ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) for k = 0 , ¯ α = − ˆ c − ˆ c (cid:48) ˆ s + ˆ s (cid:48) − c (cid:48) {(58)} ˆ µ = ˜ Q ( Y ) + ¯ α (cid:16) ˜ Q ( Y ) − ˜ Q (cid:48) ( Y ) (cid:17) ; ¯ v IS = 1 K (cid:0) ˆ s + ¯ α (ˆ s + ˆ s (cid:48) − c (cid:48) ) + 2¯ α (ˆ c − ˆ c (cid:48) ) (cid:1) {(56)} only compare online cost, where the low-fidelity distribution was already computed, because the offline cost is equalfor all algorithms (i.e., they all use the same biasing distribution). The values of n HF and n LF are determined in eachexample from either a stated assumption or from the runtime of the implementation.All examples [34] are implemented in Matlab and use the cross-entropy (CE) code from [35] with some trivialmodifications to integrate with our estimator. All quantities given below are dimensionless for simplicity. Forconvenience, the figures in this section use shortened notations for the variances of considered estimators, i.e., v CV = V ar ˆ q (cid:104) Q MF ˆ q,n (cid:105) , (60) v = V ar ˆ q (cid:104) Q MFIS ˆ q,n (cid:105) , (61) v IS = V ar ˆ q (cid:104) Q MF-1 ˆ q,n (cid:105) , (62) ¯ v CV = V ar ˆ q (cid:104) ¯ Q MF ˆ q,K (cid:105) , (63) ¯ v IS = V ar ˆ q (cid:104) ¯ Q MF-1 ˆ q,K (cid:105) , (64) here ¯ Q MF ˆ q,K and ¯ Q MF-1 ˆ q,K are the ensemble estimators defined in (22) and (37) when replacing Q CV and Q ACV with Q MF ˆ q,n and Q MF-1 ˆ q,n , respectively, and using K ensembles; and MF-1 indicates the use of the ACV-IS scheme. Estimatesof the variance of each ensemble estimators are available from Algorithms 1 and 2, and for reference we recall theexpressions of v CV and v IS are Eqs. (55) and (56). However, these variances are only valid for cases with large K . Incases with small K these algorithms need to run many times to evaluate the empirical variances of the estimators.Lastly, we note that the true mean in the control variate approach is determined from either analytical expression(the first example) or using an extra set of a very large number of samples (the second and third example). We first consider an analytical example where we seek to evaluate tail probabilities of Gaussians. Our aim is toexplore: (1) how non-optimality in the proposal distribution affects the variance reduction, and (2) how much benefitwe obtain in both over the process that uses a single-fidelity importance sampling estimator based on a low-fidelityproposal [8].We consider a standard normal input space Z ∼ N (0 , , and two limit-state functions g and g . In this problemwe will treat g as the high-fidelity model. The failure probabilities can be analytically computed using the standardnormal cumulative distribution function Ψ according to P f ( g ( z ) <
0) = 1 − Ψ( l ) , where l = 3 is our chosenthreshold. The high-fidelity limit-state function becomes g ( z ) = l − z. A low-fidelity model would have an error in the failure probability threshold, and we posit that such an erroroccurs from an incorrect specification of a threshold. In this case we use l = 2 . as the threshold to make this model“lower-fidelity” g ( z ) = l − z. Figure 1: Minimum variance ratio of v CV to v , and KL divergence between the approximate low-fidelitydistribution and the exact low-fidelity distribution vs. threshold l ( i ) The cross-entropy method does not provide a fine-grained control over approximation error due to the approxima-tion quality of the Gaussian mixture model. However, we would like to investigate how such approximation qualityaffects algorithm performance. To this end, this subsection investigates an alternative way to generate proposaldistributions with fine-grained control on error.Our alternative approach is to use a sequence of biasing distributions obtained by specifying an alternate set oflimits g ( i ) ( z ) = l ( i ) − z , l ( i ) ∈ { . , . , . . . , . } , and to use rejection sampling to sample form them exactly. When l ( i ) = 2 . , we are exactly sampling from the optimal proposal for the low-fidelity model. As the threshold decreasesto l ( i ) = 1 . , our proposal distribution has increasing error (as any GMM proposal would). In other words, as theintermediate thresholds l ( i ) approach the LF threshold l , the corresponding intermediate IS densities progressivelybecome better and ultimately the low-fidelity IS density. In this manner, we have disambiguated the error due tosub-optimal biasing distributions and those due to low-fidelity effects. Figure 2: α vs. v CV /v for l ( i ) = 1 . . Figure 1 shows two curves: the red curve is the Kullback-Leibler (KL) divergence between the approximatelow-fidelity distribution and the exact low-fidelity distribution, and the blue curve is the minimum variance ratio min( v CV /v ) between the control-variate and the MFIS estimators. As the KL divergence converges to zero, thevariance ratio min( v CV /v ) approaches 1. This behavior is expected because it corresponds to perfect sampling ofthe low-fidelity model, which leads to a zero variance estimate for the low-fidelity model. Specifically, as this variancevanishes, the contribution to the covariance vanishes, and we recover v CV = v .We also see that as the KL divergence increases, min( v CV /v ) reaches a plateau. This behavior aligns well withthe fact that the variance reduction must depend on the correlation amongst models and we cannot reduce v CV /v to 0 simply by using more crude approximations of the LF biasing distribution. Next we show that even if the estimate of the control variate weight is not extremely accurate in practice, there isstill a interval of weights in which our estimator is able to achieve a larger variance reduction.Figure 2 shows the ratio of the variance between v CV and v with respect to varying weight α for the proposalbased on l ( i ) = 1 . . Such a dependence is quadratic according to (55). The threshold of equal performance is shownin red. It is clear from Figure 2 that there is a range of α in which the CV estimator has small variance than theMFIS estimator. This fact has been established in Corollary 6.1. In other words, although the CV estimator has anadditional parameter to estimate—there is a range of weights for which it still improves upon the baseline estimator. We now switch from the accurate biasing distribution sampling to the EM algorithm approach of Section 2.3. Thisalgorithm is deployed to construct the approximate density from the low-fidelity model (i.e., l = 2 . ), from whichsamples are drawn to calculate the estimated variances in Figure 3. The input parameters of the EM algorithm are n s = 3000 , τ = 0 . and the initial number of mixture components k init = 3 . The variance of the baseline estimator v is calculated using n = 5 × samples. We assume further that for this example the HF model is 30 timesmore expensive to evaluate than the LF model. As shown in Table 2, the number of samples used to calculate otherestimators is chosen such that they consume the same cost as the baseline estimator. To compute the variances ofthe ensemble estimators, i.e., ¯ v CV and ¯ v IS , we utilize the sample sets of the corresponding component estimators (i.e., Q MF ˆ q,n and Q MF-1 ˆ q,n ) but divide each of them into K = 1000 batches, and then apply Algorithms 1 and 2.Figure 3 shows that:1. The ACV estimators Q MF ˆ q,n and Q MF-1 ˆ q,n have smaller variances than that of MFIS over a certain range of weights;2. As reported in [1], v CV is the smallest among the estimators since it uses the exact mean of Y ;3. Since ¯ v CV and ¯ v IS utilize estimated optimal control weight, their variances approximate the minimum of v CV and v IS ;4. Even with small K , the ensemble estimators ˜ v CV and ˜ v IS are still smaller than v according to Corollary 4.1.As a reference, Figure 4 shows the HF, LF and approximate density—denoted by q , q and ˆ q , respectively. Here, ˆ q is the approximate optimal density q obtained via the GMM approximation. Being the cross-entropy approximation Figure 3: α vs. variance ratios using the EM algorithm with GMM; ˜ v CV = V ar ˆ q (cid:104) ¯ Q MF ˆ q,K (cid:105) and ˜ v IS = V ar ˆ q (cid:104) ¯ Q MF-1 ˆ q,K (cid:105) with small K (i.e., K = 4 ). v v CV v IS n HF n LF to the LF density, the function ˆ q fluctuates around q ; in particular, the vertical line of q at . is estimated by avery steep curve. In this section, we compare estimators for a cantilever beam with uncertain material properties.Figure 5 displays the cantilever beam, a two-dimensional domain used in this example. The beam is fixed on itsleft side, subject to an vertical load at its upper-right corner, and has the dimensions of L × L unit length. Let X = [0 , L ] × [0 , L ] ⊂ R denote the domain in Figure 5. Considering a rectangular Cartesian coordinate system andthe Einstein summation convention, the governing equations [36] of a two-dimensional, infinitesimal strain, linearelastic problem without body force are given as σ ij,j = 0 ,σ ij = λ(cid:15) kk δ ij + 2 µ(cid:15) ij , (65) (cid:15) ij = 12 ( u i,j + u j,i ) ,u i (0 , x ) = 0 , (66)where the ( • ) ,j subscript stands for ∂ ( • ) /∂x j ; σ ij is the Cauchy stress tensor, u i the displacement, (cid:15) ij the strain, and δ ij the Kronecker delta. The boundary condition (66) reflects the fixed left side of the beam. In (65) λ and µ are theLamé constants which can be calculated from the Young’s modulus E and Poisson’s ratio ν as follows λ = Eν (1 + ν )(1 − ν ) , µ = E ν ) . The equations are solved using the finite element method, whose mesh is assembled from square, linear, plane stresselements with unit thickness. Those elements are made of an isotropic, linear elastic material characterized by E and ν . Figure 4: High-fidelity, low-fidelity and approximate density.Figure 5: The cantilever beam.
We treat Young’s modulus as the uncertain quantity in this problem. Young’s modulus must be positive andfinitely bounded and we model it as uniform random field transformed from a two-dimensional Gaussian random fieldwith the following covariance K ( s , t ) = exp (cid:18) − ( s − t ) r (cid:19) exp (cid:18) − ( s − t ) r (cid:19) = K ( s , t ) K ( s , t ) for s , t ∈ X , (67)where r and r are the correlation lengths in the two coordinate directions. The transformation [37] is performed by E ( x , ω ) = F − ◦ [Φ ( y ( x , ω ))] for x ∈ X and ω ∈ Ω , (68)where F − is the inverse of a prescribed CDF, Φ ( y ( x , ω )) the standard normal CDF, y ( x , ω ) a stationary zero-meanGaussian random field, and E ( x , ω ) the Young’s modulus. Choosing F − as the inverse of an uniform CDF, we have E ( x , ω ) = a + ( b − a )Φ( y ( x , ω )) , (69)where a and b are the lower and upper bound of the uniform distribution. In practice the random field y ( x , ω ) needs to be discretized by an appropriate method such as Expansion Optimal Linear Estimator [38] and polynomialchaos expansion [39, 40]. Among such methods, the Karhunen-Loève (KL) expansion minimizes the mean squarederror [41, 42] resulting in the smallest number of terms in the expansion to obtain a required accuracy [43]. The KLexpansion of the random field y ( x , ω ) is given as y ( x , ω ) = ∞ (cid:88) i =1 √ λ i ξ i ( ω ) ψ i ( x ) , (70) (a) High-fidelity model (b) Low-fidelity model Figure 6: Young’s modulus for sample { ξ , . . . , ξ n KL } = { . , . . . , . } . (a) High-fidelity model (b) Low-fidelity model Figure 7: Mesh deformation for sample { ξ , . . . , ξ n KL } = { . , . . . , . } . where ξ i ( ω ) are standard normal random variables. The eigenvalues λ i and the corresponding orthogonal eigenfunc-tions ψ i ( x ) are solutions of the following eigenvalue problem: (cid:90) D K ( s , t ) ψ i ( t ) d t = λ i ψ i ( s ) , (71)where K ( s , t ) is the covariance function of the random field. Here two practical issues have to be considered. First,the infinite KL expansion in (70) are truncated to be computable. Second, the integral equation (71) is not trivialto solve on high-dimensional domain. Thus, the separability of the covariance function (67) is exploited leading toseparable eigenvalues and eigenfunctions [42] as follows λ i = λ i λ i ,ψ i ( x ) = ψ i ( x ) ψ i ( x ) , (72)where λ i j and ψ i j ( x j ) , j = { , } , are the eigenvalues and eigenfunctions of the integral equation (71) using thecovariance function K j ( s j , t j ) in (67); and λ i are arranged in decreasing order.In this example the following numerical values are used: Poisson’s ratio ν = 0 . ; correlation lengths r = 60 and r = 20 ; beam dimensions L × L = 0 . × . ; bounds of the Young’s modulus a = 1 and b = 2 . The limit state functions are given as g i ( u ) = l i − u, i = { , } , where u is the vertical displacement at the loadapplication point. The thresholds l i are chosen so that the failure probabilities P f ( g i ( u ) < , which are computedusing reference samples of { ξ i ( ω ) } n KL i =1 , are small and different, i.e., l = 118 . , l = 108 . , P f ( g ( u ) <
0) =0 . and P f ( g ( u ) <
0) = 0 . .We utilize two different meshing schemes as multi-fidelity models, i.e., the high-fidelity model corresponds to themesh of × elements, and the low-fidelity model the mesh of × elements. The models also have different,but shared, sources of ucnertainty. For the high-fidelity model: u depends on the Young’s modulus E ( x , ω ) which inturn is transformed according to (69) using a truncated KL expansion y ( x , ω ) from (70); and the stiffness matrix ofeach finite element is calculated using the values of the eigenfunctions at the element’s centroid coordinates. For thelow-fidelity model, the Young’s modulus is just a uniform random variable E ( ω ) = a + ( b − a )Φ( y ( ω )) , where y ( ω ) = n KL (cid:88) i =1 √ λ i ξ i ( ω ) ψ i (¯ x ) (73)and ¯ x is the coordinates of the beam centroid, i.e., ¯ x = (0 . , . . For both models, the number of λ i j and ψ i j ( x j ) is n j = 5 , and the number of λ i and ψ i ( x ) is n KL = 10 .Figures 6 and 7 show the Young’s modulus and the mesh deformation of the beam for the high- and low-fidelitymodel when all the random variables ξ i ( ω ) take the value of . ; the actual displacements are scaled down by -1.5 -1 -0.5 0 0.50.511.522.53 Figure 8: α vs. variance ratios for the cantilever beam. for visualization. It is clear from Figure 6 that the Young’s modulus is spatially variable for the high-fidelity modelwhile it is only a constant for the low-fidelity model.The EM algorithm uses inputs n s = 5000 , τ = 0 . , and k init = 5 . The baseline estimator Q MFIS ˆ q,n takes n = 4 × samples to compute its variance. We first find that C HF ≈ C LF , where C HF and C LF are the costs to produce oneevaluation of the LF and HF model, respectively. Then, we allocate the samples to each model as shown in Table 3,using the cost ratio such that the total cost of model evaluations is equal to that of the baseline estimator. v v CV v IS n HF n LF The results of this example are presented in Figure 8. Again, our estimators perform better than the MFIS oneand the CV estimator shows favorable result compared to the ACV scheme.
The last example is a modified version of that provided in [8], where authors derive the MFIS estimator. While theMFIS estimator is able to achieve impressive speedups of up to several orders of magnitude compared to the MCmethod, we demonstrate that further variance reduction is still possible by employing control variates.Let Y = [0 , denote the domain of the clamped Mindlin plate in Figure 9; E the four edges of the plate; θ and θ the rotations of the normal to the plate middle plane with respect to the axes x and x , respectively; and w thedisplacement of the middle plane in the (out-of-plane) x -direction. The governing equations of the Mindlin’s theoryof plate in static equilibrium are given as M ij,j − Q i = 0 ,Q i,i + s = 0 ,θ ( E ) = 0 , θ ( E ) = 0 , w ( E ) = 0 , (74)where M ij and Q i are the moment and shear resultants, i.e., M = D (cid:18) ∂θ ∂x + ν ∂θ ∂x (cid:19) , Q = κGh (cid:18) θ + ∂w∂x (cid:19) , etc.; D , G , κ , h and ν are the bending rigidity, shear modulus, shear correction factor, plate thickness and Poisson’s ratio,respectively; and s is a transverse load. We refer to [44, 45] for a complete treatment of the plate theory with detailedequations. The boundary conditions (74) state that for a clamped plate there are no rotations and displacementalong the edges of the plate which is made of an isotropic, linear elastic material with Young’s modulus E = 10 B. Peherstorfer et al. / Comput. Methods Appl. Mech. Engrg. 300 (2016) 490–509
Fig. 2. Plate problem: The eight inputs of the plate problem control the thickness and the load in the four subregions Ω , . . . , Ω of the spatialdomain Ω . initial seeding. The runtime measurements were performed on an Intel Xeon E5-1620 compute node with 32 GB RAMon a single core. We are interested in the maximum deflection of a clamped plate after a load has been applied [35]. The geometryof the plate is shown in Fig. 2. The spatial domain Ω = [ , ] ⊂ R is partitioned into four subregions Ω , . . . , Ω .Inputs define the thickness of the plate and the applied load in each subregion. The inputs are the components z = [ z , . . . , z ] T ∈ R of a realization of the eight-dimensional input random variable Z , which has a uniformdistribution in [ . , . ] × [ , ] ⊂ R . The first four inputs z , . . . , z define the thickness of the plate and thelast four inputs z , . . . , z define the load, in the subregions Ω , . . . , Ω , respectively. The output, y , is the maximumdeflection of the plate in the spatial domain Ω . The limit state function is g plate ( y ) = − y +
5, with the threshold 5,which means that a failure occurs if the maximum deflection is higher than 5.
Models . The (steady-state) high-fidelity model s : R → R is derived with the finite element method following[36, Section 12.2, p. 161]. The number of degrees of freedom of the high-fidelity model is 19 039. To construct thesurrogate model, we sample 1000 inputs at which the high-fidelity model is evaluated to generate the snapshots. Thereare several sampling schemes available to guide the selection of the snapshots [37–39]; however, for this examplesampling from a uniform distribution in the input domain is sufficient. The proper orthogonal decomposition (POD)is applied to the snapshots to derive the POD basis. The reduced operators are computed with Galerkin projection.This leads to surrogate models s and s generated from five and ten POD basis vectors, respectively. Reference failure probability . The reference failure probability P MC f = . × − is computed with the MonteCarlo estimator with 10 samples using the high-fidelity model. The RMSE of the Monte Carlo estimator is9 . × − , where the variance is estimated from the samples. Illustration of PDFs . For illustration purposes, the estimated PDF of the random variables s ( Z ), s ( Z ) , and s ( Z ) are compared in Fig. 3. Each of the PDFs is estimated with kernel density estimation (KDE) from 1000 realizationsof the corresponding random variable. The surrogate model s with five POD modes is sufficient to derive a PDF thatmatches the PDF corresponding to the high-fidelity model well.We generate a biasing distribution by constructing the biasing PDF with Algorithm 1 and the surrogate model s .The number of samples is set to N = , and the number of normal distributions in the mixture model is k = N has only a minorimpact on the overall costs of the MFIS method, see the discussion on runtime and speedup below. To illustrate thequality of the biasing distribution, we draw 1000 samples z ′ , . . . , z ′ from the biasing distribution, i.e., realizations Figure 9: The Mindlin plate [8]. (a) High-fidelity model (b) Low-fidelity model
Figure 10: Plate deformation for sample { h i , s i } i =1 = { . , } . and Poisson’s ratio ν = 0 . . Following the settings in [8], we divide the plate into four regions {Y i } i =1 , each ofwhich has a random thickness h i and is subject to a random load s i . Both h i and s i are uniformly distributed, i.e., { h i } i =1 ∼ U (0 . , . and { s i } i =1 ∼ U (1 , . According to the first-order shear deformation theory [44] the shearcorrection factor is κ = 5 / .In this example the high-fidelity model has the mesh size of × square bilinear isoparametric elements (i.e., Q elements) while the low-fidelity model utilizes × elements. The Matlab code from [46, Chapter 12] is adoptedfor finite element analysis. Figure 10 shows the plate deformation for both models using { h i , s i } i =1 = { . , } . Thelimit state functions are defined as g i ( z ) = l i − w c i ( z ) , i = { , } , where z = (cid:8) { h i } i =1 , { s i } i =1 (cid:9) ⊂ R is a realizationof input random variables and w c i ( z ) is the x -direction displacement of the plate centroid.The parameters used in the EM algorithm are given as n s = 5000 , τ = 0 . and k init = 5 . The baseline estimator Q MFIS ˆ q,n is evaluated using n = 4 × samples for its variance. As shown in Table 4, the cost of other estimatorsis guaranteed to be equal to the baseline estimator by appropriate sample allocation using the number of ensembles K = 1000 and the empirical formula C HF ≈ C LF .The variance ratios are shown in Figure 11. First, we stress that since the MFIS estimator does not take intoaccount the correlations among models, it is not able to exploit the multi-fidelity modeling to the fullest extent,and, thus, our estimators have achieved clear advantages over it. Second, comparing v IS with v CV , the estimatorbecomes less efficient with estimated weight as concluded in [1]. Third, as explained in the first example, ¯ v CV and ¯ v IS tightly follows the minimum of v CV and v IS , respectively. Finally, it is noted that even with sub-optimal weightthe performance of these estimators is still better than the MFIS estimator over a significant range of weights, as Figure 11: α vs. variance ratios for the Mindlin plate. v v CV v IS n HF n LF specified by Theorem 7. In this paper we have developed an ensemble estimator for approximate control variate schemes that provides amechanism to estimate unknown covariances, in addition to unknown means. This contribution has allowed us toprovide theoretical bounds on the number of samples required to guarantee certain variance reduction. Furthermore,this guarantee depends upon a correlation coefficient that is problem dependent. The second contribution is applyingthe framework in the context of importance sampling. We show that the approximate control variate can furtherreduce the variance compared to the MFIS approach described in [8]. We are able to achieve considerably greatervariance reduction with this approach on several problems of computational mechanics.Future work will seek to study values of the correlation coefficient that can be derived from the underlyingproblem—similar to what is done in multi-level MC for multi-fidelity models arising in varying discretizations. Anotherline of work is extending the importance sampling techniques to include several low-fidelity models. One challengeto overcome is effectively balancing the cost of computing a biasing distribution using the low-fidelity model andthe variance reduction that it provides. Indeed, the current approaches to multi-fidelity importance sampling tendto ignore this computational aspect. Finally, as an effective variance reduction technique, our estimators haveextensive application potential in expensive UQ problems including optimization under uncertainty, and in particular,reliability-based and robust optimization.
We thank Gianluca Geraci, John Jakeman, Mike Eldred, and Teresa Portone for helpful discussions surrounding thispaper. This project was funded by the Sandia National Laboratories LDRD program. The expectation-maximization algorithm
The Kullback-Leibler (KL) divergence between the optimal density q ∗ ( z ) and the approximate density ˆ q ( z ) is D ( q ∗ ( z ) , ˆ q ( z )) = E q ∗ (cid:20) ln (cid:18) q ∗ ( z )ˆ q ( z ) (cid:19)(cid:21) = (cid:90) R d q ∗ ( z ) ln( q ∗ ( z )) d z − (cid:90) R d q ∗ ( z ) ln(ˆ q ( z )) d z . The CE method aims to minimize the KL divergence to find the unknowns in the GMM (9). Let us gather the unknownparameters into the vector v = { π i , µ i , Σ i ; i = 1 , , . . . , k } . Then the optimization problem can be equivalently writtenas min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d q ∗ ( z ) ln(ˆ q ( z ; v )) d z , (A.1)because the first term of the KL divergence is independent of ˆ q. where ˆ q ( z ; v ) stresses the presence of parameters inthe GMM (9). Replacing Y ( z ) with I G ( z ) in (8), and inserting this expression into (A.1) we obtain min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d I G ( z ) p ( z ) ln(ˆ q ( z ; v )) d z . (A.2)Another sampling density ˆ q ( z ; w ) , which has the same form as ˆ q ( z ; v ) but with a different parameter vector w , isintroduced to facilitate the optimization algorithm min v D ( q ∗ ( z ) , ˆ q ( z ; v )) = max v (cid:90) R d I G ( z ) ln(ˆ q ( z ; v )) (cid:99) W ( z ; w )ˆ q ( z ; w ) d z = max v E ˆ q ( z ; w ) (cid:104) I G ( z ) ln(ˆ q ( z ; v )) (cid:99) W ( z ; w ) (cid:105) ≈ max v n s n s (cid:88) i =1 I G ( z i ) ln(ˆ q ( z i ; v )) (cid:99) W ( z i ; w ) , (A.3)where z i , i = 1 , , . . . , n s , are samples drawn from ˆ q ( z ; w ) , and (cid:99) W ( z ; w ) = p ( z )ˆ q ( z ; w ) . It is noted that by choosingan appropriate joint likelihood h (ˆ z | v ) = (cid:81) i ∈ ˆ n s ˆ q ( z i | v ) (cid:99) W ( z i ) , the optimization problem (A.3) is equivalent to themaximum log-likelihood estimation (MLE) problem ˆ v = arg max v ln( h (ˆ z | v )) = arg max v (cid:88) i ∈ ˆ n s ln(ˆ q ( z i | v )) (cid:99) W ( z i ) , (A.4)where ˆ z = { z i } i ∈ ˆ n s , ˆ n s = { i ∈ n s : I G ( z i ) (cid:54) = 0 } . The EM algorithm is an iterative method to find ˆ v , which is alsothe solution of (A.3). Let ˆ v ( m ) denote the parameter vector at m th iteration. In [47] it is shown that ˆ v ( m +1) = arg max v E X | ˆ z , ˆ v ( m ) [ln h ( x | v )] = arg max v Q ( v | ˆ v ( m ) ) , (A.5)where X is the complete data set. Using the GMM (9) and n τ = τ ˆ n s , where τ ∈ ]0 , is a fixed value to identify theintermediate failure domains, we have [29, 31] Q ( v | ˆ v ( m ) ) = n τ (cid:88) i =1 (cid:99) W ( z i ) k (cid:88) j =1 γ ( m ) ij ln( π j N ( z i ; µ j , Σ j )) (A.6) γ ( m ) ij = π ( m ) j N ( z i ; µ ( m ) j , Σ ( m ) j ) (cid:80) kr =1 π ( m ) r N ( z i ; µ ( m ) r , Σ ( m ) r ) .s (A.7)The updating scheme is then derived by solving the following optimization problem max v Q ( v | ˆ v ( m ) ) subject to k (cid:88) i =1 π i = 1 ,π i ≥ , i = 1 , , . . . , k, Σ i (cid:31) , i = 1 , , . . . , k, (A.8) µ is not needed since it is a constant. here the first and second constraint enforce π i to be probabilities, and the last constraint is meant to render thecovariance matrices positive definite. Using (A.6), (A.7), and the method of Lagrange multipliers, the updatingequations [29, 31] are listed below. ν ( m ) j = n τ (cid:88) i =1 (cid:99) W ( z i ) γ ( m ) ij ,π ( m +1) j = ν ( m ) j (cid:80) kr =1 ν ( m ) r , µ ( m +1) j = (cid:80) n τ i =1 (cid:99) W ( z i ) γ ( m ) ij z i ν ( m ) j , Σ ( m +1) j = (cid:80) n τ i =1 (cid:99) W ( z i ) γ ( m ) ij ( z i − µ ( m +1) j )( z i − µ ( m +1) j ) T ν ( m ) j , j = 1 , , . . . , k. B Proof of Proposition 2
We rewrite ˆ C and ˆ c using (21) and (42) as ˆ C = DD T K − , (B.1) ˆ c = D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − . (B.2)Substituting (B.1) and (B.2) into (20), (35), and (36), we obtain ¯ α CV = − ˆ C − ˆ c = − (cid:18) DD T K − (cid:19) − D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − − (cid:16) DD T (cid:17) − D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) , (B.3)and ¯ α e = − (cid:104) ˆ C ◦ F e (cid:105) − [ diag ( F e ) ◦ ˆ c ] = − (cid:20) DD T K − ◦ F e (cid:21) − (cid:34) diag ( F e ) ◦ D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1) K − (cid:35) = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:2) diag ( F e ) ◦ (cid:0) D (cid:0) Q ( Y ) − ¯ Q n ( Y ) K (cid:1)(cid:1)(cid:3) , (B.4)respectively. Because D is the centered data matrix, we obtain D K = Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) Q (1) n ( Y ) − ¯ Q n ( Y ) Q (2) n ( Y ) − ¯ Q n ( Y ) . . . Q ( K ) n ( Y ) − ¯ Q n ( Y ) . . . . . . . . . . . . Q (1) n ( Y M ) − ¯ Q n ( Y M ) Q (2) n ( Y M ) − ¯ Q n ( Y M ) . . . Q ( K ) n ( Y M ) − ¯ Q n ( Y M ) . . . = M . (B.5)Since D K = M , ¯ α CV and ¯ α e become ¯ α CV = − (cid:16) DD T (cid:17) − DQ ( Y ) , ¯ α e = − (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))] . C Useful matrix algebra identities
The below proposition provides several identities to manipulate the expressions of the variances of the ensembleestimators. roposition 8. Let A ∈ R M × M , B ∈ R M × K , V ∈ R K × M and v ∈ R K . Then, we have the following identitiesdiag ( A ) ◦ ( Bv ) = (( diag ( A ) ⊗ K ) ◦ B ) v , (C.1) (( diag ( A ) ⊗ K ) ◦ B ) V = ( diag ( A ) ⊗ M ) ◦ ( BV ) , (C.2) V T ( B ◦ ( diag ( A ) ⊗ K )) T = (cid:16) V T B T (cid:17) ◦ ( diag ( A ) ⊗ M ) T , (C.3) (( diag ( A ) ⊗ K ) ◦ A ) (( diag ( A ) ⊗ K ) ◦ A ) T = (cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) ◦ ( diag ( A ) ⊗ M ) T , (C.4) where ◦ is the Hadamard product and ⊗ is the outer product.Proof. Let [ (cid:63) ] i ( ij ) denote an entry of the vector (matrix) (cid:63) . We prove the first two identities by showing the entriesof both sides are equal. Thus, [ diag ( A ) ◦ ( Bv )] i = a ii K (cid:88) j =1 b ij v j = K (cid:88) j =1 ( a ii b ij ) v j = [(( diag ( A ) ⊗ K ) ◦ B ) v ] i [(( diag ( A ) ⊗ K ) ◦ B ) V ] ij = K (cid:88) k =1 a ii b ik v kj = a ii K (cid:88) k =1 b ik v kj = [( diag ( A ) ⊗ M ) ◦ ( BV )] ij The third identity is proved by transposing both sides of the second one as [(( diag ( A ) ⊗ K ) ◦ B ) V ] T = [( diag ( A ) ⊗ M ) ◦ ( BV )] T V T ( B ◦ ( diag ( A ) ⊗ K )) T = ( BV ) T ◦ ( diag ( A ) ⊗ M ) T V T ( B ◦ ( diag ( A ) ⊗ K )) T = (cid:16) V T B T (cid:17) ◦ ( diag ( A ) ⊗ M ) T We find the final identity by applying the second and third one consecutively, and note that the Hadamard productis commutative and associative (( diag ( A ) ⊗ K ) ◦ A ) (( diag ( A ) ⊗ K ) ◦ A ) T = ( diag ( A ) ⊗ M ) ◦ (cid:16) A (( diag ( A ) ⊗ K ) ◦ A ) T (cid:17) = ( diag ( A ) ⊗ M ) ◦ (cid:16)(cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) T (cid:17) = (cid:16) AA T (cid:17) ◦ ( diag ( A ) ⊗ M ) ◦ ( diag ( A ) ⊗ M ) T . D Proof of Proposition 3
The assumptions imply that the distributions of ¯ Q − ¯ µ CV and ¯ Q − ¯ µ e are multivariate normal. Thus, the prooffocuses on finding the means and variances of those distributions.It is trivial to show that E (cid:104) ¯ Q − ¯ µ CV (cid:105) = E (cid:2) ¯ Q − ¯ µ e (cid:3) = M . a. We compute the variance of ¯ Q − ¯ µ CV as V ar (cid:104) ¯ Q − ¯ µ CV (cid:105) = V ar (cid:2) ¯ Q (cid:3) = C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y ) , K K (cid:88) j =1 Q ( j ) n ( Y ) (cid:35) . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y ) , K K (cid:88) j =1 Q ( j ) n ( Y M ) (cid:35) . . . . . . . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y M ) , K K (cid:88) j =1 Q ( j ) n ( Y ) (cid:35) . . . C ov (cid:34) K K (cid:88) j =1 Q ( j ) n ( Y M ) , K K (cid:88) j =1 Q ( j ) n ( Y M ) (cid:35) = 1 K C ov [ Q n ( Y ) , Q n ( Y )] C ov [ Q n ( Y ) , Q n ( Y )] . . . C ov [ Q n ( Y ) , Q n ( Y M )] . . . . . . . . . . . . C ov [ Q n ( Y M ) , Q n ( Y )] C ov [ Q n ( Y M ) , Q n ( Y )] . . . C ov [ Q n ( Y M ) , Q n ( Y M )] = C K . b. From [1, Appendix D] and [1, Appendix E], we know that V ar (cid:2) ¯ Q − ¯ µ e (cid:3) = C ◦ F e K . Proof of Theorem 1
The goal of this proposition is to compute V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:3) and V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) for e ∈ { ACV-IS , ACV-MF } in termsof V ar [ Q n ( Y )] and some known quantities, e.g., the covariances amongst models, the number of ensembles, etc.We begin with an auxiliary result that will be used in the rest of the proof. Recall the law of total expectation E X [ X ] = E Y [ E X [ X | Y ]] , where X and Y are some random variables in the same probability space. We apply it to calculate the variances of ¯ Q CV (¯ α CV ) and ¯ Q e (¯ α e ) by setting X = (cid:0) ¯ Q CV (¯ α CV ) − E (cid:2) ¯ Q CV (¯ α CV ) (cid:3)(cid:1) , Y = Q and X = (cid:0) ¯ Q e (¯ α e ) − E (cid:2) ¯ Q e (¯ α e ) (cid:3)(cid:1) , Y = ˜ Q e to obtain V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = E Q (cid:104) V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:12)(cid:12)(cid:12) Q (cid:105)(cid:105) , V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = E ˜ Q e (cid:2) V ar (cid:2) ¯ Q e (¯ α e ) (cid:12)(cid:12) Q (cid:3)(cid:3) , where Q = [ Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M )] T , ˜ Q e = [ Q n ( Y ) − µ e , Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM ] T and µ ei = Q nr i ( Y i ) for i = 1 , , . . . , M . As we can see, the vectors Q and ˜ Q e only involve the low-fidelity models and the expectationswith respect to these vectors eliminate the dependence of V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:3) and V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) on { Y i } Mi =1 . Since V ar (cid:2) ¯ Q CV (¯ α CV ) (cid:12)(cid:12) Q (cid:3) is a special case of V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) with F e = M ⊗ M , the computation of the later plays acentral role in the proof below.We now begin the main logic of the proof by substituting (41) into (37) to obtain ¯ Q e (¯ α e ) = T K K Q ( Y ) − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ diag ( F e ) ◦ ( DQ ( Y ))]= (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) Q ( Y )= X T Q ( Y ) , (E.1)where F eK = diag ( F e ) ⊗ K . The second line of (E.1) uses the identity (C.1).Given ˜ Q e , X in (E.1) is fixed so that the conditional variance of ¯ Q e (¯ α e ) becomes V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = X T V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) X . (E.2)To compute V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) , we utilize the assumption that the vector {Q n ( Y ) , Q n ( Y ) , . . . , Q n ( Y M ) , µ e , . . . , µ eM } has a multivariate normal distribution, and so the distribution of Q n ( Y ) conditional on ˜ Q e = {Q n ( Y ) − µ e , . . . , Q n ( Y M ) − µ eM } is also multivariate normal [48, Theorem 5.3] with variance V ar (cid:104) Q n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = − C ov (cid:104) ˜ Q e , Q n ( Y ) (cid:105) T C ov (cid:104) ˜ Q e , ˜ Q e (cid:105) − V ar [ Q n ( Y )] C ov (cid:104) ˜ Q e , Q n ( Y ) (cid:105) V ar [ Q n ( Y )]= (1 − R e ) V ar [ Q n ( Y )] . (E.3)Then, the conditional variance of Q ( Y ) given ˜ Q e is V ar (cid:104) Q ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar (cid:104) Q (1) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q (1) n ( Y ) , Q (2) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . C ov (cid:104)(cid:16) Q (1) n ( Y ) , Q ( K ) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q (2) n ( Y ) , Q (1) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) V ar (cid:104) Q (2) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . C ov (cid:104)(cid:16) Q (2) n ( Y ) , Q ( K ) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . . . . . . . . . . C ov (cid:104)(cid:16) Q ( K ) n ( Y ) , Q (1) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) C ov (cid:104)(cid:16) Q ( K ) n ( Y ) , Q (2) n ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . V ar (cid:104) Q ( K ) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar (cid:104) Q (1) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . V ar (cid:104) Q (2) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) . . . . . . . . . . . . . . . . . . V ar (cid:104) Q ( K ) n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar (cid:104) Q n ( Y ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) I = (1 − R e ) V ar [ Q n ( Y )] I , (E.4) here I ∈ R K × K is the identity matrix. The second equality arises due to the i.i.d assumption of each of the K batches.Substituting (E.4) into (E.2), we obtain V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105) = V ar [ Q n ( Y )] (1 − R e ) X T X (E.5)The expression X T X can be expanded as X T X = (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) T = (cid:18) T K K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] (cid:19) (cid:18) K K − [ F eK ◦ D ] T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:19) = 1 K − T K [ F eK ◦ D ] T K A − A T [ F eK ◦ D ] K K + A T [ F eK ◦ D ] [ F eK ◦ D ] T A , (E.6)where A = (cid:2)(cid:0) DD T (cid:1) ◦ F e (cid:3) − (cid:0) ¯ Q − ¯ µ e (cid:1) . Using the identity (C.1), we obtain [ F eK ◦ D ] K = [( diag ( F e ) ⊗ K ) ◦ D ] K = diag ( F e ) ◦ ( D K ) = M , T K [ F eK ◦ D ] T = T M , which simplify (E.6) as X T X = 1 K + A T [ F eK ◦ D ] [ F eK ◦ D ] T A = 1 K + (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − [ F eK ◦ D ] [ F eK ◦ D ] T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) = 1 K + (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) (E.7)where the last equality of (E.7) applies the identity (C.4). Thus, Theorem 1b is proved as V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = E ˜ Q e (cid:104) V ar (cid:104) ¯ Q e (¯ α e ) (cid:12)(cid:12)(cid:12) ˜ Q e (cid:105)(cid:105) = V ar [ Q n ( Y )] (1 − R e ) × (cid:18) K + E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (E.8)To deduce the result of Theorem 1a, we replace ¯ Q e (¯ α e ) , R e , ˜ Q e and ¯ µ e with ¯ Q CV (¯ α CV ) , R , Q and ¯ µ CV , respectively,and note that F e = F eM = M ⊗ M in the CV case; hence, V ar (cid:104) ¯ Q CV (¯ α CV ) (cid:105) = V ar [ Q n ( Y )] (cid:0) − R (cid:1) (cid:18) K + E Q (cid:20)(cid:16) ¯ Q − ¯ µ CV (cid:17) T (cid:16) DD T (cid:17) − (cid:16) ¯ Q − ¯ µ CV (cid:17)(cid:21)(cid:19) . (E.9) F Proof of Theorem 4
Proposition 3 suggests that the expressions inside the expectation operators in Theorem 1 may follow the Hotelling’s T distributions. It is indeed the case for (38), while extra assumptions on the ACV-IS and ACV-MF schemesare needed to establish the distribution in (39). The means of the Hotelling’s T distributions are then computedexplicitly to prove Theorem 4. We note that (45) is a special case of (48), and so we prove Theorem 4b first.The proof strategy is to simplify the expression inside the expectation operator in (39) (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:104)(cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T (cid:105) (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) using the extra assumptions (46) and (47) on the ACV-IS and ACV-MF schemes. First, an identity is added into themiddle term (cid:16) DD T (cid:17) ◦ F eM ◦ ( F eM ) T = (cid:16) DD T (cid:17) ◦ F e ◦ ( F e ) ◦ ( − ◦ F eM ◦ ( F eM ) T (F.1)where ( F e ) ◦ ( − is the Hadamard inverse of F e , i.e., (cid:104) ( F e ) ◦ ( − (cid:105) ij = 1 f eij , (F.2) nd [ (cid:63) ] ij denote an entry of the matrix (cid:63) . We recall from (30) and (32) that f eij depends on the ratios r i and r j whichmust be positive for any meaningful settings. Eventually, using either ACV-IS or ACV-MF makes r i and r j greaterthan 1. Thus, in practice f eij (cid:54) = 0 and the Hadamard inverse of F e exists.For the ACV-IS scheme we then have the terms (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ij = 1 f ACV-IS ij f ACV-IS ii f ACV-IS jj = 1 f ACV-IS ii f ACV-IS jj f ACV-IS ii f ACV-IS jj = 1 (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ii = 1 f ACV-IS ii f ACV-IS ii f ACV-IS ii = f ACV-IS ii = r i − r i (F.3)Because we assume r i (cid:29) , (cid:20)(cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T (cid:21) ii = 1 , (cid:16) F ACV-IS (cid:17) ◦ ( − ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T = M ⊗ M . (F.4)For the ACV-MF scheme, (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ij = 1 f ACV-MF ij f ACV-MF ii f ACV-MF jj = 1min( r i , r j ) − r i , r j ) r i − r i r j − r j = max( r i , r j ) − r i , r j ) (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ii = 1 f ACV-MF ii f ACV-MF ii f ACV-MF ii = f ACV-MF ii = r i − r i (F.5)Because we assume r i = r , (cid:20)(cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T (cid:21) ij ( ii ) = r − r , (cid:16) F ACV-MF (cid:17) ◦ ( − ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T = r − r ( M ⊗ M ) . (F.6)Substitute (F.4) and (F.6) into (F.1) (cid:16) DD T (cid:17) ◦ F ACV-IS M ◦ (cid:16) F ACV-IS M (cid:17) T = (cid:16) DD T (cid:17) ◦ F ACV-IS (F.7) (cid:16) DD T (cid:17) ◦ F ACV-MF M ◦ (cid:16) F ACV-MF M (cid:17) T = r − r (cid:16) DD T (cid:17) ◦ F ACV-MF (F.8)Substitute (F.7) and (F.8) into (39) V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) E ˜ Q e (cid:20)(cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:21)(cid:19) (F.9)Here the expectation of (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:2)(cid:0) DD T (cid:1) ◦ F e (cid:3) − (cid:0) ¯ Q − ¯ µ e (cid:1) can be computed explicitly because ¯ Q − ¯ µ e has amultivariate normal distribution from (44) and (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:104)(cid:16) DD T (cid:17) ◦ F e (cid:105) − (cid:0) ¯ Q − ¯ µ e (cid:1) = 1 K ( K − (cid:0) ¯ Q − ¯ µ e (cid:1) T (cid:114) K (cid:34) (cid:0) DD T (cid:1) ◦ F e K − (cid:35) − (cid:0) ¯ Q − ¯ µ e (cid:1)(cid:114) K = 1 K ( K − t M,K − , (F.10)where t M,K − follows the Hotelling’s T distribution [48, Corollary 5.3].Substituting (F.10) into (F.9), the variance of ¯ Q e (¯ α e ) becomes V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) K ( K − E ˜ Q e (cid:2) t M,K − (cid:3)(cid:19) = V ar [ Q n ( Y )] (1 − R e ) (cid:18) K + a ( e ) K ( K −
1) ( K − MK − M − (cid:19) = V ar [ Q n ( Y )] K (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) , (F.11) here the second equality uses the expectation of the Hotelling’s T distribution and the third equality simplifies theresult.Thus, V ar (cid:2) ¯ Q e (¯ α e ) (cid:3) = V ar [ Q e (¯ α e )] K = V ar [ Q n ( Y )] K (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] = (1 − R e ) (cid:18) a ( e ) MK − M − (cid:19) To prove Theorem 4a, we replace ¯ Q e (¯ α e ) , R e , ˜ Q e and ¯ µ e with ¯ Q CV (¯ α CV ) , R , Q and ¯ µ CV , respectively, and notethat F e = F eM = M ⊗ M in the CV case; hence, V ar (cid:2) Q CV (¯ α CV ) (cid:3) V ar [ Q n ( Y )] = (1 − R ) (cid:18) MK − M − (cid:19) . (F.12) G Proof of Corollary 4.1
We seek to find K such that Theorem 4 guarantees variance reduction V ar [ Q e (¯ α e )] V ar [ Q n ( Y )] < y for < y ≤ and e ∈ { CV , ACV-IS , ACV-MF } . (G.1)We first solve (G.1) for K that satisfies this inequality for the ACV-IS and ACV-MF strategies. We then deduce thecorresponding result in the CV case. Thus, for < y ≤ and e ∈ { ACV-IS , ACV-MF } , (G.1) becomes (cid:18) a ( e ) MK − M − (cid:19) (1 − R e ) < y ⇐⇒ ( K − M −
2) + a ( e ) MK − M − − R e ) < y (G.2)Because of the assumption K > M + 2 , (G.2) becomes (( K − M −
2) + a ( e ) M )(1 − R e ) < y ( K − M − ⇐⇒ K (1 − y − R e ) < ( M + 2 − a ( e ) M )(1 − R e ) − y ( M + 2) ⇐⇒ K (1 − y − R e ) < ( M + 2)(1 − y − R e ) − a ( e ) M (1 − R e ) ⇐⇒ K > M + 2 − a ( e ) M (1 − R e )1 − y − R e if y + R e > K < M + 2 − a ( e ) M (1 − R e )1 − y − R e if y + R e < (G.3)Now we show that the case y + R e < leads to a contradiction. Specifically, (G.3) implies that M + 2 < K
Journal of Computational Physics , 408:109257,May 2020.[2] Alireza Doostan, Gianluca Geraci, and Gianluca Iaccarino. A Bi-Fidelity Approach for Uncertainty Quantifica-tion of Heat Transfer in a Rectangular Ribbed Channel. In
Volume 2C: Turbomachinery . American Society ofMechanical Engineers, June 2016.[3] Gianluca Geraci, Michael S. Eldred, and Gianluca Iaccarino. A multifidelity multilevel Monte Carlo method foruncertainty propagation in aerospace applications. In .American Institute of Aeronautics and Astronautics, January 2017.[4] Anirban Chaudhuri, John Jasa, Joaquim Martins, and Karen E. Willcox. Multifidelity Optimization UnderUncertainty for a Tailless Aircraft. In . American Instituteof Aeronautics and Astronautics, January 2018.[5] Leo W. T. Ng and Karen E. Willcox. Multifidelity approaches for optimization under uncertainty.
InternationalJournal for Numerical Methods in Engineering , 100(10):746–772, December 2014.[6] Joslin Goh, Derek Bingham, James Paul Holloway, Michael J. Grosskopf, Carolyn C. Kuranz, and Erica Rutter.Prediction and Computer Model Calibration Using Outputs From Multifidelity Simulators.
Technometrics ,55(4):501–512, November 2013.[7] Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of Multifidelity Methods in UncertaintyPropagation, Inference, and Optimization.
SIAM Review , 60(3):550–591, January 2018.[8] Benjamin Peherstorfer, Tiangang Cui, Youssef Marzouk, and Karen Willcox. Multifidelity importance sampling.
Computer Methods in Applied Mechanics and Engineering , 300:490–509, March 2016.[9] Benjamin Peherstorfer, Boris Kramer, and Karen Willcox. Combining multiple surrogate models to acceleratefailure probability estimation with expensive high-fidelity models.
Journal of Computational Physics , 341:61–75,July 2017.[10] Boris Kramer, Alexandre Noll Marques, Benjamin Peherstorfer, Umberto Villa, and Karen Willcox. Multifidelityprobability estimation via fusion of estimators.
Journal of Computational Physics , 392:385–402, September 2019.[11] G. Fishman.
Monte Carlo: Concepts, Algorithms, and Applications . Springer, 2014.[12] P. Glasserman.
Monte Carlo Methods in Financial Engineering . Stochastic Modelling and Applied Probability.Springer, 2004.[13] R.Y. Rubinstein and D.P. Kroese.
Simulation and the Monte Carlo Method . Wiley Series in Probability andStatistics. Wiley, 2016.
14] R. Srinivasan.
Importance Sampling: Applications in Communications and Detection . Springer, 2013.[15] J. Bucklew.
Introduction to Rare Event Simulation . Springer Series in Statistics. Springer, 2004.[16] Art Owen and Yi Zhou Associate. Safe and effective importance sampling.
Journal of the American StatisticalAssociation , 95(449):135–143, 2000.[17] Michael B. Giles. Multilevel monte carlo methods.
Acta Numerica , 24:259–328, 2015.[18] Michael B. Giles, Mateusz B. Majka, Lukasz Szpruch, Sebastian J. Vollmer, and Konstantinos C. Zygalakis.Multi-level Monte Carlo methods for the approximation of invariant measures of stochastic differential equations.
Statistics and Computing , 30(3):507–524, May 2020.[19] Abdul-Lateef Haji-Ali, Fabio Nobile, and Raúl Tempone. Multi-index Monte Carlo: when sparsity meets sam-pling.
Numerische Mathematik , 132(4):767–806, April 2016.[20] Casey M. Fleeter, Gianluca Geraci, Daniele E. Schiavazzi, Andrew M. Kahn, and Alison L. Marsden. Multileveland multifidelity uncertainty quantification for cardiovascular hemodynamics.
Computer Methods in AppliedMechanics and Engineering , 365:113030, June 2020.[21] S. S. Lavenberg and P. D. Welch. A Perspective on the Use of Control Variables to Increase the Efficiency ofMonte Carlo Simulations.
Management Science , 27(3):322–335, March 1981.[22] Raghu Pasupathy, Bruce W. Schmeiser, Michael R. Taaffe, and Jin Wang. Control-variate estimation usingestimated control means.
IIE Transactions , 44(5):381–385, May 2012.[23] Barry L. Nelson. Control Variate Remedies.
Operations Research , 38(6):974–992, December 1990.[24] Barry L. Nelson. Batch size effects on the efficiency of control variates in simulation.
European Journal ofOperational Research , 43(2):184–196, 1989.[25] Art B. Owen. Monte carlo theory, methods and examples, 2013.[26] R.Y. Rubinstein and D.P. Kroese.
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimiza-tion, Monte-Carlo Simulation and Machine Learning . Information Science and Statistics. Springer, 2004.[27] Monica F. Bugallo, Victor Elvira, Luca Martino, David Luengo, Joaquin Miguez, and Petar M. Djuric. AdaptiveImportance Sampling: The past, the present, and the future.
IEEE Signal Processing Magazine , 34(4):60–79,July 2017.[28] Iason Papaioannou, Costas Papadimitriou, and Daniel Straub. Sequential importance sampling for structuralreliability analysis.
Structural Safety , 62:66–75, September 2016.[29] Sebastian Geyer, Iason Papaioannou, and Daniel Straub. Cross entropy-based importance sampling using Gaus-sian densities revisited.
Structural Safety , 76:15–27, January 2019.[30] Nolan Kurtz and Junho Song. Cross-entropy-based adaptive importance sampling using Gaussian mixture.
Structural Safety , 42:35–44, May 2013.[31] Yihua Chen and Maya R. Gupta. EM demystified: An expectation-maximization tutorial. Department ofElectrical Engineering, University of Washington, 2010.[32] Geoffrey F. Bomarito, Patrick E. Leser, James E. Warner, and William P. Leser. On the Optimization ofApproximate Control Variates with Parametrically Defined Estimators. arXiv:2012.02750 , December 2020.[33] NIST/SEMATECH. e-Handbook of Statistical Methods https://github.com/pbtrung/cvis . Accessed: 2020-12-25.[35] Cross entropy-based importance sampling code. . Accessed: 2020-03-31.[36] M.H. Sadd.
Elasticity: Theory, Applications, and Numerics . Elsevier Science, 2020.[37] Mircea Grigoriu. Simulation of Stationary Non-Gaussian Translation Processes.
Journal of Engineering Me-chanics , 124(2):121–126, February 1998.
38] Chun-Ching Li and A. Der Kiureghian. Optimal Discretization of Random Fields.
Journal of EngineeringMechanics , 119(6):1136–1154, June 1993.[39] R.G. Ghanem and P.D. Spanos.
Stochastic Finite Elements: A Spectral Approach . Civil, Mechanical and OtherEngineering Series. Dover Publications, 2003.[40] D. Xiu.
Numerical Methods for Stochastic Computations: A Spectral Method Approach . Princeton UniversityPress, 2010.[41] Alen Alexanderian. A brief note on the Karhunen-Loeve expansion, October 2015. arXiv: 1509.07526.[42] Limin Wang.
Karhunen − Loève Expansions and their Applications . PhD thesis, London School of Economicsand Political Science, 2008.[43] Bruno Sudret and Armen Der Kiureghian. Stochastic Finite Element Methods and Reliability: A State-of-the-Art Report (Report No. UCB/SEMM-2000/08). Department of Civil and Environmental Engineering, Universityof California, Berkeley, 2000.[44] G.T. Lim and J.N. Reddy. On canonical bending relationships for plates.
International Journal of Solids andStructures , 40(12):3039–3067, June 2003.[45] J.N. Reddy.
Theory and Analysis of Elastic Plates and Shells, Second Edition . Series in Systems and Control.Taylor & Francis, 2006.[46] A.J.M. Ferreira.
MATLAB Codes for Finite Element Analysis: Solids and Structures . Solid Mechanics and ItsApplications. Springer Netherlands, 2008.[47] Sean Borman. The Expectation Maximization Algorithm: A short tutorial. Manuscript, 2009.[48] W.K. Härdle and L. Simar.
Applied Multivariate Statistical Analysis . Springer International Publishing, 2019.. Springer International Publishing, 2019.