[PDF] Using Supervised Learning to Improve Monte Carlo Integral Estimation

Abstract

Monte Carlo (MC) techniques are often used to estimate integrals of a multivariate function using randomly generated samples of the function. In light of the increasing interest in uncertainty quantification and robust design applications in aerospace engineering, the calculation of expected values of such functions (e.g. performance measures) becomes important. However, MC techniques often suffer from high variance and slow convergence as the number of samples increases. In this paper we present Stacked Monte Carlo (StackMC), a new method for post-processing an existing set of MC samples to improve the associated integral estimate. StackMC is based on the supervised learning techniques of fitting functions and cross validation. It should reduce the variance of any type of Monte Carlo integral estimate (simple sampling, importance sampling, quasi-Monte Carlo, MCMC, etc.) without adding bias. We report on an extensive set of experiments confirming that the StackMC estimate of an integral is more accurate than both the associated unprocessed Monte Carlo estimate and an estimate based on a functional fit to the MC samples. These experiments run over a wide variety of integration spaces, numbers of sample points, dimensions, and fitting functions. In particular, we apply StackMC in estimating the expected value of the fuel burn metric of future commercial aircraft and in estimating sonic boom loudness measures. We compare the efficiency of StackMC with that of more standard methods and show that for negligible additional computational cost significant increases in accuracy are gained.

Full PDF

UUsing Supervised Learning to Improve Monte Carlo IntegralEstimation

Brendan Tracey

Stanford University, Stanford, CA 94305

David Wolpert

NASA Ames Research Center, Mo ﬀ ett Field, CA 94035 Juan J. Alonso

Stanford University, Stanford, CA 94305

October 1, 2018

Abstract

Monte Carlo (MC) techniques are often used to estimate integrals of a multivariate function using randomly gen-erated samples of the function. In light of the increasing interest in uncertainty quantiﬁcation and robust design ap-plications in aerospace engineering, the calculation of expected values of such functions (e.g. performance measures)becomes important. However, MC techniques often su ﬀ er from high variance and slow convergence as the numberof samples increases. In this paper we present Stacked Monte Carlo (StackMC), a new method for post-processing anexisting set of MC samples to improve the associated integral estimate. StackMC is based on the supervised learningtechniques of ﬁtting functions and cross validation. It should reduce the variance of any type of Monte Carlo integralestimate (simple sampling, importance sampling, quasi-Monte Carlo, MCMC, etc.) without adding bias. We reporton an extensive set of experiments conﬁrming that the StackMC estimate of an integral is more accurate than both theassociated unprocessed Monte Carlo estimate and an estimate based on a functional ﬁt to the MC samples. These ex-periments run over a wide variety of integration spaces, numbers of sample points, dimensions, and ﬁtting functions.In particular, we apply StackMC in estimating the expected value of the fuel burn metric of future commercial aircraftand in estimating sonic boom loudness measures. We compare the e ﬃ ciency of StackMC with that of more standardmethods and show that for negligible additional computational cost signiﬁcant increases in accuracy are gained. Nomenclature b Bias of M D x Set of samples of f ( x ) D x ( i ) i th Sample of f ( x ) D xi , test Subset of Samples in the i th Testing Set D xi , train Subset of Samples in the i th Training Set E [ · ] Expected Value f ( x ) Objective Functionˆ f True Expected Value of f ( x )ˆ f M Estimate of ˆ f from M f ( D x ( i )) Function Value at the i th sample g ( x ) Fitting Algorithm g i ( x ) i th Fit to f ( x ) g i (cid:0) D x ( i ) (cid:1) Prediction of Fit g i at D x ( i )ˆ g Expected Value of g ( x ) k Number of Folds 1 a r X i v : . [ s t a t . M L ] A ug i Likelihood of the Expected Value of the i th Fold M An Estimator of ˆ fm i Number of Samples in the i th Testing Set N Number of Samples N i Number of Samples in the i th Training Set p ( x ) Probability Distribution of xq ( x ) Alternate Sample Distribution r ( x ) Fitting Algorithm for p ( x ) v Variance of M x Input Parameters α Free Parameter of StackMC β Free Parameter in the Fitting Algorithm ρ Correlation σ Standard Deviation

A system optimized only for best performance at nominal operating conditions can see severe performance degradationat even slightly o ﬀ -design conditions. Aerospace systems rarely operate at precisely the design condition and a robustdesign approach dictates trading some performance at the nominal condition for improved performance over a widerange of input conditions. However, certain input conditions will occur rarely or never, and adding robustness for theseconditions will degrade average system performance. Consequently, we are often interested in optimizing the expected performance of a system rather than the performance at any speciﬁc operating point.Formally, we are interested in an integral of the form E [ f ] = ˆ f = (cid:90) f ( x ) p ( x ) dx , (1)where f ( x ) is the (potentially multivariate) objective function to be optimized, and p ( x ) is the probability densityfunction (or probability distribution, in which case (1) is actually a summation) from which x is generated.Evaluating (1) is non-trivial, especially when the objective function is high dimensional and / or expensive to eval-uate. When standard quadrature techniques (such as Simpson’s rule) become intractable, Monte Carlo (MC) methodsare powerful techniques which have become the standard for integral estimation. However, MC techniques also havetheir drawbacks; the required computational expense may be prohibitive in most situations. The question is if a tech-nique exists to signiﬁcantly lower the cost of estimating (1). This paper describes such a technique which combinesMC with machine learning and statistical techniques and can lead to signiﬁcant computational savings over traditionalmethods.We begin this paper with a brief review of integral estimation techniques including Monte Carlo, ﬁtting algorithms,and stacking. We then introduce a new technique we call Stacked Monte Carlo (StackMC), which uses stackingto reduce the estimation error of any Monte Carlo method. Finally, we apply StackMC to a number of analyticexample problems and to two problems from the aerospace literature. We show that StackMC can signiﬁcantly reduceestimation error and never has higher estimation error than the best of Monte Carlo and the chosen ﬁtting algorithm. When using any method M returning ˆ f M as an estimate of (1), there are two sources of estimation error: bias ( b ) andvariance ( v ). Bias is the expected di ﬀ erence between the method and the truth: b = E [ ˆ f M − ˆ f ], where the expectationis over all data sets. A method whose average over many runs gives the correct answer has zero bias, whereas a2ethod that estimates high or low on average has non-zero bias. As an example, the Euler equations have signiﬁcantbias in their estimate of viscous drag. Variance is a measure of how much ˆ f M varies between di ﬀ erent runs of M : v = E [( ˆ f M − E [ ˆ f M ]) ]. If M is deterministic, it has zero variance because every run returns the same answer, whereasif M is stochastic multiple runs have di ﬀ erent outputs leading to a non-zero variance. The total expected squared erroris the combination of these two factors E (cid:2) | ˆ f M − ˆ f | (cid:3) = b + v . Any method seeking to reduce estimation error must keep both of sources of error low.

In “Simple Sampling” Monte Carlo (MC), a set of N samples, D x , is generated randomly according to p ( x ). Theestimate of the expected value of the function based on D x is the average of the function values of the samplesˆ f ≈ ˆ f mc = N N (cid:88) i = f (cid:0) D x ( i ) (cid:1) ,where D x ( i ) refers to the i th generated sample. Simple Monte Carlo has two extremely useful properties. First, MC isguaranteed to be unbiased and thus, on average, will return the correct answer. Second, MC is guaranteed to convergeto the correct answer at a rate of O ( n / ) independent of the number of dimensions . That is, for any problem, increasingthe number of samples by a factor of one hundred decreases the expected error by a factor of ten. As Simple MonteCarlo has zero bias, all of the error comes from the variance of Monte Carlo runs. This variance is due a di ﬀ erent kindof variance concerning f ( x ) itself; ﬂuctuations in the Monte Carlo estimate are directly related to ﬂuctuations in f ( x ).As a result, the smaller the variance of f ( x ), the smaller the expected error of Monte Carlo.Many di ﬀ erent sample generation techniques exist to help reduce the variance of Monte Carlo. Importance sam-pling [1] generates samples from an alternate distribution q ( x ) (instead of the true distribution p ( x )) and estimatesintegral (1) as a weighted average of sample valuesˆ f = (cid:90) f ( x ) p ( x ) dx = (cid:90) f ( x ) p ( x ) q ( x ) q ( x ) dx ,ˆ f is = N N (cid:88) i = f (cid:0) D x ( i ) (cid:1) p (cid:0) D x ( i ) (cid:1) q (cid:0) D x ( i ) (cid:1) . (2)Importance sampling is often used when the tails of a distribution have a measurable e ﬀ ect on ˆ f , but occur veryinfrequently. Using Simple MC, several million samples are needed to accurately capture a once-in-a-million event,but by using importance sampling unlikely outcomes can be made to occur more frequently reducing the total numberof samples needed.Quasi-Monte Carlo (QMC) techniques reduce variance by choosing sample locations more carefully. Samplepoints generated from simple Monte Carlo will inevitably cluster in some locations and leave other locations voidof samples. QMC methods are usually deterministic and reduce variance by spreading points evenly throughout thesample space. Two such methods are the scrambled Halton sequence [2] and the Sobol sequence [3]. While oftene ﬀ ective, due to deterministic sample generation they are not guaranteed to be unbiased. It is also di ﬃ cult to generatepoints from an arbitrary p ( x ); most QMC algorithms generate sample points from a uniform distribution over the unithypercube. A second class of techniques for estimating integrals from data seeks to use the data samples more e ﬃ ciently throughthe use of a ﬁtting algorithm and supervised learning techniques. The ﬁtting algorithm uses data samples to makea “ﬁt” to the data, and the integral estimate is the integral of the ﬁt. Examples of ﬁtting algorithms include splines,high-order polynomials and Fourier expansions; even Simpson’s rule computes the quadrature by ﬁtting a piecewise3uadratic polynomial to the data samples. Fits incorporate the spatial distribution of sample points and thus often givemore accurate estimates of ˆ f than MC. However, using a ﬁtting algorithm can induce bias and may or may not exhibitconvergence to the correct answer as the number of points increases. Additionally, when the number of data points is“too small” many ﬁtting algorithms exhibit higher variance than MC, and, as a result, using a ﬁtting algorithm can beworse than not using one, especially with low numbers of sample points and in high-dimensional spaces. It is usuallyimpossible to know a priori how many points is “too small” making it di ﬃ cult to know when to use a ﬁtting algorithm.Additionally, it is di ﬃ cult to know if a ﬁt to the data samples is an accurate representation of f ( x ) at values of x notin the data set. Many ﬁtting algorithms exhibit a property known as “overﬁtting” where a ﬁt to the data is very accurateat x locations that are among the data samples, but is very inaccurate at x locations not among the data samples. Asa result, one cannot use the data samples themselves to both make a ﬁt and to evaluate its accuracy. The standardtechnique for addressing this issue is known as cross-validation, where a certain percentage of the data is used to makethe ﬁt and the rest of the data is used to evaluate its performance. Usually this process is repeated several times, andthe best performing ﬁt is used as the output of the ﬁtting algorithm (a.k.a. winner-takes-all ).Stacking is a technique ﬁrst introduced by Wolpert [4] which improves upon cross-validation by combining theresults of di ﬀ erent ﬁts. Wolpert describes stacking in his paper as “ a strategy ... which is more sophisticated thanwinner-takes-all. Loosely speaking, this strategy is to combine the [ﬁts] rather than choose one amongst them. ...[Stacking] can be viewed as a more ﬂexible version of nonparametric statistics techniques like cross-validation, ...therefore it can be argued that for almost any generalization or classiﬁcation problem ... one should use [stacking]rather than any single generalizer by itself.” Interested readers can see stacking applied to improving regression [5],probability density estimation [6], conﬁdence interval estimation [7], and optimization [8]. Stacking recently gainedfame as being a major part of the ﬁrst and second place algorithms in the Netﬂix competition [9].In addition to combining the predictions of multiple ﬁtting algorithms, stacking can also be used to improve theprediction of a single ﬁtting algorithm. In this variant of stacking, one again partitions the full set of data D x into asubset S used to make ﬁts and the remainder of the data, and a subset S = D x \ S . However now one does notuse S to determine how to combine the predictions of multiple ﬁtting algorithms that were all run on S . Instead oneuses S to correct the prediction of a single algorithm that was run on S . This variant of stacking is closest to thealgorithm we use below. For a comparison of stacking to other generalizers, see Clarke[10]. The main idea of Stacked Monte Carlo (StackMC) is to incorporate the variance reduction potential of a ﬁtting al-gorithm while avoiding introducing bias and additional variance from poor ﬁts. It makes no assumptions about thefunction f ( x ) nor the sample generation method, and therefore StackMC can be used to augment nearly any MonteCarlo application.Let us assume that we have a function g ( x ) that is a reasonable (though not necessarily perfect) ﬁt to f ( x ). Equation(1) can be re-written as ˆ f = (cid:90) α g ( x ) p ( x ) dx + (cid:90) (cid:0) f ( x ) − α g ( x ) (cid:1) p ( x ) dx = α ˆ g + (cid:90) (cid:0) f ( x ) − α g ( x ) (cid:1) p ( x ) dx , (3)where α is a constant and ˆ g = (cid:82) g ( x ) p ( x ) dx . Instead of using Monte Carlo samples to estimate (1) directly, we canuse the Monte Carlo samples to estimate the integral term in (3), i.e.ˆ f ≈ α ˆ g + N N (cid:88) i = f (cid:0) D x ( i ) (cid:1) − α g (cid:0) D x ( i ) (cid:1) . (4)Since g ( x ) is assumed to be a reasonable ﬁt, for a properly chosen α , f ( x ) − α g ( x ) has lower variance than f ( x )alone (see Fig. 1), and so a MC estimate of (3) has lower expected error than an estimate of (1). In fact, it can be4 a) Example f ( x ) and g ( x ) (b) f ( x ) and f ( x ) − α g ( x ) Figure 1: Comparison of f ( x ) and f ( x ) − α g ( x ) for a value of α = . f − α g has lower variance than f alone, andso a Monte Carlo estimate of f − α g will have less error than an estimate of f .shown [11] that the optimal value of α to minimize expected error is α = ρ σ f σ g , (5)where σ f and σ g are the standard deviations of f ( x ) and g ( x ) respectively, and ρ is the correlation between f and g .Intuitively, if g is a good ﬁt to f , ρ (and correspondingly α ) will be high and ˆ g will be trusted as a good estimate for ˆ f .If g is a poor ﬁt to f , ρ and α will both be low and ˆ g will be downplayed. Looking at it from a di ﬀ erent perspective,(3) takes the expected value estimated from g ( x ), and corrects it with a term estimating the bias of g ( x ). Either way,by using (3) the error should be lower than using either Monte Carlo or the ﬁtting function alone. Since the expectedvalue of the ﬁt is both added and subtracted, (4) remains an unbiased estimate of (1). Thus, (4) incorporates a ﬁtterwhile remaining unbiased, and α allows us to emphasize good ﬁts while de-emphasizing poor ones.The obvious question is how to obtain g ( x ) and ﬁnd α from data samples. The ﬁrst step is to pick a functionalform for a ﬁtting algorithm g ( x ), such as a polynomial with unknown coe ﬃ cients. g ( x ) can be nearly anything, theonly restrictions are that it can make predictions at new x values. For now we also require that g can be analyticallyintegrated over the volume, i.e. we can calculate ˆ g = (cid:90) g ( x ) p ( x ) dx (6)analytically (see further discussion later in the paper). By comparing the output of a ﬁt g ( x ) to the true f ( x ) values,an estimate for the “goodness” of the ﬁt (and by extention α ) could be obtained, and (4) could be applied. However,doing this directly would cause over-ﬁtting and would lead to an inaccurate estimate of α .Over-ﬁtting can be mitigated by using a technique known as k -fold cross-validation [12]. The N data samplesare randomly partitioned into k testing sets, D xi , test , i = , ..., k , which are mutually exclusive and contain the samenumber of samples (di ﬀ ering slightly if N / k is not an integer). Training sets, D xi , train , i = , ..., k , contain all of thedata samples not in D xi , test . We call N i the number of samples in D xi , test and m i the number of samples in D xi , train (suchthat N i + m i = N ). One ﬁt per fold, g i ( x ), is created using only the N i samples in D xi , train (for a total of k ﬁtters). Thesamples in a training set are called “held in” points because they are used to generate the ﬁt, whereas points in thecorresponding testing set are “held out” points because the ﬁt is generated without using these samples. Each datasample is in a held out data set once and in the held in data set k − g i ( x ) a prediction of the functionvalues for the points in D xi , test is made ( g i ( D xi , train ) ). Standard cross-validation evaluates the accuracy of each of theﬁts, and chooses the best ﬁt to use as a single g ( x ).Instead, we adopt the approach of stacking and use (3) to get an estimate of ˆ f from each of the ﬁtters, and we canuse the held out data points as an estimate of the integral term.5 f smc ( i ) = α ˆ g i + (cid:90) (cid:0) f ( x ) − α g i ( x ) (cid:1) p ( x ) dx (7)ˆ f smc ( i ) = α ˆ g i + m i m i (cid:88) j = f (cid:0) D xi , test ( j ) (cid:1) − α g i (cid:0) D xi , test ( j ) (cid:1) . (8)The mean of these estimates is taken as the ﬁnal predictionˆ f ≈ ˆ f smc = k k (cid:88) i = ˆ f smc ( i ) . (9)We can also use the predictions at the held out points to estimate α . σ f is estimated from the variance of thesample function values themselves, σ g from the variance of the predictions, and ρ from the correlation between thepredictions and the truth. µ f = N N (cid:88) j = f (cid:0) D x ( j ) (cid:1) µ g = N k (cid:88) i = N i (cid:88) j = g i (cid:0) D xi , train ( j ) (cid:1) σ f = (cid:118)(cid:117)(cid:116) N − N (cid:88) j = (cid:18) f (cid:0) D x ( j ) (cid:1) − µ f (cid:19) σ g = (cid:118)(cid:117)(cid:116) N − k (cid:88) i = N i (cid:88) j = (cid:18) g i (cid:0) D xi , train ( j ) (cid:1) − µ g (cid:19) cov ( f , g ) = N − k (cid:88) i = N i (cid:88) j = (cid:18) f (cid:0) D xi , train ( j ) (cid:1) − µ f (cid:19)(cid:18) g (cid:0) D xi , train ( j ) (cid:1) − µ g (cid:19) ρ = cov ( f , g ) σ f σ g .Finally, we plug into (9)ˆ f smc = k k (cid:88) i = ˆ f smc ( i ) = k k (cid:88) i = (cid:20) α ˆ g i + m i m i (cid:88) j = f (cid:0) D xi , test ( j ) (cid:1) − α g i (cid:0) D xi , test ( j ) (cid:1)(cid:21) . (10)One ﬁnal correction is needed to complete StackMC. It is possible that some or all of the ˆ g i di ﬀ er greatly from ˆ f but α is calculated high because of good predictions at held out data points. If left alone, this would cause StackMCto return a low-quality prediction on some data sets.However, the error in the mean (EIM) of the Monte Carlo samples can be used as a second metric to evaluate thegoodness of the ﬁt. EIM is deﬁned as ¯ σ = σ √ N , and represents the uncertainty in the mean of a set of Monte Carlo samples. Speciﬁcally, it gives us a likelihood boundon ˆ f based on ˆ f mc and σ f . We can ﬁnd a “likelihood” for each fold by calculating˜ L i = | ˆ g i − ˆ f mc | ¯ σ . (11)6he higher L i , the less likely it is that ˆ g i = ˆ f , and the more likely it is that the ﬁtter is bad and should be ignored. Aheuristic is set that if max( L i ) > C , ˆ f smc = ˆ f mc , i.e. all of the ﬁts are ignored entirely and the Monte Carlo average isused. If C is set too low, then too many reasonable ﬁts are ignored, and if C is set too high, then too many unreasonableﬁts are kept. A value of C = C as low as 3 or as highas 7 were inferior.Finally, a note about the bias of StackMC. Because the expected value of the ﬁt is being both added and subtracted,it is true that (3) (and by extension, (7) and (9)) are also unbiased for a ﬁxed α . However, using the held-out data toboth set α and estimate the integral can introduce bias. In practice, we have found that this is only a problem for verysmall numbers of sample points where the output ﬁt of the ﬁtting algorithm changes signiﬁcantly depending on theheld-in samples. The discussion above was for the case where the samples are generated according to a known p ( x ), this is only aspecial case of a class of estimation scenarios. While the overall methodology remains approximately the same forthese other scenarios, some speciﬁcs (such as the calculation of α ) change with the choice for g ( x ) and the samplegeneration method. If importance sampling methods are used, samples are generated from q ( x ) instead of p ( x ). As a result, (1) is expandedas ˆ f = (cid:90) α g ( x ) q ( x ) dx + (cid:90) (cid:18) f ( x ) p ( x ) q ( x ) − α g ( x ) (cid:19) q ( x ) dx ,and so g ( x ) should be a ﬁtting algorithm for f ( x ) p ( x ) q ( x ) instead of just f ( x ). As a result a few modiﬁcations are neededincluding the fact that, ˆ g i = (cid:90) g ( x ) q ( x ) dx ,and that in the calculation of α and (10), f pq is used instead of just f . g ( x ) over p ( x )If the samples are generated from p ( x ), but the choice for g ( x ) cannot be integrated over p ( x ), there are two options.The ﬁrst option is to use another function r ( x ) as a ﬁtting algorithm for p ( x ) over which g ( x ) is integrable. We expand(1) as ˆ f = (cid:90) α g ( x ) r ( x ) dx + (cid:90) f ( x ) p ( x ) − α g ( x ) r ( x ) dx = (cid:90) α g ( x ) r ( x ) dx + (cid:90) f ( x ) − α g ( x ) r ( x ) p ( x ) p ( x ) dx .Similarly to the above case, when calculating α and (10), g is replaced with grp .Alternately, when g ( x ) is signiﬁcantly less computationally expensive than f ( x ), ˆ g itself can be estimated by MonteCarlo techniques. When estimating ˆ g from N g additional samples, it should be the case that N g (cid:29) N so that errors inthe estimate of ˆ g do not cause signiﬁcant errors in the estimate of ˆ f . We attempt to ﬁnd the expected value of 3 x + . x − . x − . x + . x − . x + .

9, where x is generatedaccording to a uniform distribution between − x f ( x ) β β β β g i ( x ) ˆ g i f = (cid:90) − (cid:0) x + . x − . x − . x + . x − . x + . (cid:1) dx .The actual value of ˆ f can be calculated to be 0.7069. Twenty samples are generated and divided into ﬁve folds,with each fold’s testing set containing four points and each corresponding training set containing the other sixteenpoints. While the function of interest is a sixth order polynomial, g ( x ) was chosen to be a third-order polynomial: β + β x + β x + β x . g ( x ) is found by choosing the β ’s which minimize the squared error of the data in D x , train (using the standard pseudo-inverse). g ( x ) - g ( x ) are set similarly. Next, g i (cid:0) D xi , test ( j ) (cid:1) is evaluated for each test pointin D xi , test .The following parameters are calculated to ﬁnd α : µ f = . µ g = . σ f = . σ g = . cov ( f , g ) = . ρ = cov ( f , g ) σ f σ g = . α = ρ σ f σ g = . f smc .Using the Monte Carlo estimate alone gives ˆ f mc = . f f it = . f smc = . g i (cid:80) m i j = f (cid:0) D xi , test ( j ) (cid:1) − α g i (cid:0) D xi , test ( j ) (cid:1) Corrected ˆ g i f smc Results

We test the performance of StackMC on a number of di ﬀ erent problems. In the ﬁrst set of example cases the problemshave known analytic results and an exact analysis of the performance of StackMC is examined. In the second set,StackMC is applied to two aerospace problems in the literature. While an exact solution to the aerospace problemsis not available, an approximate answer is obtained from a Monte Carlo estimate using 100,000 samples. For eachexample problem, a range of number of samples was tested using two thousand simulations at each value of N . Ineach case, 10-fold cross-validation was used. Unless otherwise noted, the plots in this section have three lines whichshow the mean squared error versus the number of sample points. The lines represent the average squared error fromusing just the prediction of Monte Carlo (green), the ﬁt to all the samples (red), and StackMC (blue). p ( x ) , polynomial ﬁtter. The D-dimensional Rosenbrock function is given by f ( x ) = D − (cid:88) i = [(1 − x i ) + x i + − x i ) ] ,and x is drawn from a uniform distribution over the [-3,3] hypercube.The ﬁtting algorithm is chosen as a third-order polynomial in each dimension whose form is g ( x ) = β + D (cid:88) i = β , i x i + β , i x i + β , i x i ,where the β i are free parameters that are set from the data samples.A comparison of the error of StackMC can be seen in Fig. 3 for the ten-dimensional version of the Rosenbrockfunction. At low numbers of sample points, Monte Carlo is more accurate, on average, than the polynomial ﬁt to allthe data points, but at higher numbers of points the polynomial outperforms Monte Carlo. Throughout the entire rangeof number of samples, StackMC performs at least as well as the best of the two. Additionally, as can be seen in Fig. 4,the polynomial ﬁtter is actually a biased ﬁtter for this example problem. StackMC is able to use cross-validation toremove the bias of the ﬁtter while keeping the accuracy of its estimate. p ( x ) . This is the same set-up as above, except that x is generated according to a multivariate Gaussian distribution, eachdimension independent with µ = σ = p ( x ) , Fourier ﬁtter In the previous examples, the Rosenbrock function is a 4 th order polynomial, and the ﬁtting algorithm is a 3 rd orderpolynomial, so StackMC had reasonable ﬁts to use to improve upon Monte Carlo. In order to challenge StackMC, afunction we call the BTButterﬂy was created so that it would be very di ﬃ cult to ﬁt accurately.Like the Rosenbrock function, its general form is given by f ( x ) = D − (cid:88) i = h ( x i , x i + )with the contour plot of h shown in Fig. 6. x is generated uniformly over the [-3,3] hypercube.10 A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC EstimatePolynomialStackMC

Figure 3: Comparison of MC, ﬁtter, and StackMC for the Rosenbrock function with uniform uncertainty.A Fourier expansion for g ( x ) was chosen whose form is given by g ( x ) = β + D (cid:88) i = β , i cos( x i ) + β , i cos(2 x i ) + β , i cos(3 x i ) + β , i sin( x ) i + β , i sin(2 x i ) + β , i sin(3 x i ) .Results for this function are shown in Fig.7, and exhibit the same trends seen previously; StackMC has as low of anerror as the lowest of the two methods individually. In Ref.[13], the authors use the Program for Aircraft Synthesis Studies [14] (PASS), a conceptual aircraft design tool,to predict the fuel burn of future aircraft given certain assumptions about technology advancement in the 2020 and2030 time frames. In their predictions for single-aisle aircraft in 2020, the authors model eight probabilistic variablesrepresenting di ﬀ erent e ﬀ ects of improvements in aircraft technology (propulsion, structures, aerodynamics). Each ofthe variables is represented by a unique beta distribution. The authors generated 15,000 samples (each representingone optimized aircraft) from which they measured the expected fuel-burn metric and the standard deviation of theexpected fuel-burn metric.To apply StackMC, we choose a third-order polynomial ﬁtter. A polynomial ﬁtter is convenient for this casebecause the E [ x n ] over a beta distribution is easily found analytically as follows: E [ x n ] = (cid:81) ni = α + i − (cid:81) ni = α + β + i − E x pe c t ed V a l ue MC EstimatePolynomialStackMCExact Solution

Figure 4: Expected output of MC, ﬁtter, and StackMC with error in the mean for the 10-D Rosenbrock function. Notethat the ﬁtting function is biased, especially at lower numbers of sample points, while StackMC and MC are not.12 A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC EstimatePolynomialStackMC

Figure 5: Comparison of MC, ﬁtter, and StackMC for the 10-D Rosenbrock function with Gaussian uncertainty. x(ii) x ( ii + ) −3 −2 −1 0 1 2 3−3−2−10123 Figure 6: Contour plot of successive dimensions of the BTButterﬂy function.13 Number of Samples A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC EstimateFourierStackMC

Figure 7: Comparison of MC, ﬁtter, and StackMC for the BTButterﬂy function with uniform uncertainty.where α and β are the two parameters of the beta distribution, B ( α, β ) (not to be confused with the StackMC parameter α ). A set of 100,000 samples were generated and the mean and standard deviation of their function values were taken asthe “true” expected value and standard deviation. The standard deviation is deﬁned as (cid:112) E [ x ] − E [ x ] , and therefore,to ﬁnd the standard deviation we can run StackMC twice, once to ﬁnd E [ x ] and a second time to ﬁnd E [ x ]. The resultsof applying StackMC to ﬁnd the expected value and standard deviation are shown in Fig. 8 and Fig. 9, respectively.The exact same trend is seen here as in all of the analytic problems. At low numbers of samples, where the ﬁt toall the data points performs badly, StackMC does as well as Monte Carlo. At high numbers of samples, where theﬁt greatly outperforms Monte Carlo, StackMC does as well as the ﬁtter. In between these two extremes, StackMCoutperforms both algorithms.At 1500 samples, StackMC reduces the error by roughly an order of magnitude compared with the Monte Carloresult used by the authors. Each sample takes about 6 seconds to generate, whereas StackMC takes less than a half asecond to form the 10 ﬁts, calculate α , and calculate (10) . For a computational cost of less than one additional sample,StackMC has reduced the number of samples needed by 90% for the same expected error. The uncertainty quantiﬁcation of sonic boom noise is used as the second application case. Unlike the aircraft designtest case, the response surface for the sonic boom noise signature is not smooth; the output can vary signiﬁcantly withslight adjustments to the input parameters. Colonno and Alonso[15] recently created a new sonic boom propagationtool, SUBoom, and used it to analyze the robustness of several aircraft pressure signatures optimized for minimalboom noise. Speciﬁcally, in their high-ﬁdelity near-ﬁeld case, they have 62 uncertain variables: 4 representing aircraftparameters such as cruise Mach number and roll angle, and 58 representing uncertainties in the near-ﬁeld pressuresignal. Like in the Aircraft UQ case, 100,000 samples were generated and used as the true mean. A third-orderpolynomial was used as g ( x ). The results can be seen in Fig. 10.14 −8 −6 −4 −2 Aircraft UQ Test CaseAverage Squared Error vs Number of SamplesNumber of Samples A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC FitpolynomialStackMC

Figure 8: Comparison of MC, ﬁtter, and StackMC for the Aircraft UQ test case ﬁnding E [ x ]. −10 −8 −6 −4 Aircraft UQ E[x ]Average Squared Error vs Number of SamplesNumber of Samples A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC EstimatePolynomialStackMC

Figure 9: Comparison of MC, ﬁtter, and StackMC for the Aircraft UQ test case ﬁnding E [ x ].15 −2 −1 SUBoom Test CaseAverage Squared Error vs Number of SamplesNumber of Samples A v e r age s qua r ed e rr o r w i t h e rr o r i n t he m ean MC EstimatePolynomialStackMC

Figure 10: Comparison of MC, ﬁtter, and StackMC for the SUBoom UQ test case.Even with the discontinuities in the function space, the same performance for StackMC is seen. StackMC does aswell as the best of Monte Carlo and the ﬁtting algorithm. The reduction in error is not as great here as in the AircraftUQ case because a polynomial is not a good ﬁt to f ( x ), though using domain knowledge to improve the accuracy ofthe ﬁtting algorithm would further increase the performance of StackMC. Despite the relatively poor performance ofthe ﬁtting algorithm, it is still better to use StackMC than to not regardless of the number of sample points used. Insome senses, this test case demonstrates one of the strengths of StackMC: at virtually no additional computation costthe user is guaranteed the highest accuracy without having to worry about the appropriateness of the methodology fora given number of samples. In this paper, we have introduced a new technique, Stacked Monte Carlo, which reduces error of Monte Carlo samplingby blending several di ﬀ erent ﬁtters of f ( x ). StackMC is unbiased, (thus retaining a major advantage of Monte Carlo),and is able to use cross validation to e ﬀ ectively incorporate the use of a ﬁtting function.StackMC is an extremely generic post-processing technique. It is applied after the samples are generated and makesno assumptions about the generation method, and therefore StackMC can be used not only with simple MC, but alsowith other sample generation techniques such as Importance Sampling, Quasi-Monte Carlo, and Markov-Chain MonteCarlo. Furthermore, StackMC makes no assumptions about f ( x ); it not only applies to smooth functions, but also todiscontinuous functions, and even functions with discrete variables. In a future paper, [16] we will present resultsof the application of StackMC to di ﬀ erent input regimes and di ﬀ erent sample generation methods. We will examinehigher dimensional spaces, explore the application of StackMC to multi-ﬁdelity methods, and extended StackMC toincorporate multiple ﬁtting functions.The computation time of StackMC is dominated by forming the ﬁts g i ( x ). As a result, StackMC is only a ﬀ ectedby the curse of dimensionality insofar as the ﬁtting algorithm is. In both of the aerospace applications, the additional16ost of the entire StackMC algorithm was virtually non-existant; there are signiﬁcant improvements in accuracy forless computation time than the cost of one additional function evaluation. The only assumptions made about g ( x ) arethat it be able to predict the value at new sample locations and we are able to evaluate ˆ g analytically (and even thissecond assumption is relaxed). As shown in the BTButterﬂy and sonic boom example cases, the ﬁt does not need to beparticularly good to see improvement, and so the choice for g ( x ) can be modiﬁed as computational e ﬀ ort and domainknowledge allows. StackMC does not eliminate the need for ﬁnding better ﬁtters; a better ﬁtter will always lead toan improved result. With the exception of the error in the mean test (whose value we did not change for any of theexample cases), StackMC requires no user-set parameters that need to be heuristically tuned to see good results.The version of StackMC implemented in this paper is relatively naive. Instead of taking the mean of the resultsfrom the di ﬀ erent ﬁts, taking a weighted combination of the ﬁts could improve the estimate. StackMC currently onlypartitions the data into folds once, but we could imagine re-partitioning the same samples several times to have moreﬁtters. We use Monte Carlo to estimate the integral term in (7), but using a ﬁtter, or even using StackMC recursively,could improve the estimates of ˆ f smc . Furthermore, α was set as a constant, but in general α could vary between thefolds or could even be a function of x so the conﬁdence in the ﬁt varies spatially.Despite this simplistic implementation, StackMC performs at least as well as the better of MC and the ﬁttingfunction, and for a range of sample points outperforms both. Normally, there is a transition number of samples atwhich using a ﬁt to all the data samples has a smaller average error than MC, and it is hard to know a priori if it isbetter to use the ﬁt. StackMC eliminates the need to decide whether or not to use a ﬁt; it will never be harmful to doso. While it is not true that for any set of data samples ˆ f smc is closer to ˆ f than ˆ f mc , on average StackMC reduces theexpected error. StackMC is a very generic method for the post-processing of data samples; it can be used by anyonetrying to estimate an integral or expected value of a function where p ( x ) is known. References [1] Haugh, M., “Variance Reduction Methods II,”

Monte Carlo Simulation: IEOR E4703 , 2004.[2] Kocis, L. and Whiten, W. J., “Computational Investigations of Low-Discrepancy Sequences,”

ACM Transactionson Mathematical Software , Vol. 23, No. 2, 1997, pp. 266–294.[3] Sobol, I., “Uniformly distributed sequences with additional uniformity properties,”

USSR Comput. Math. Math.Phys , Vol. 16, No. 5, 1976, pp. 236–242.[4] Wolpert, D., “Stacked Generalizations,”

Neural Networks , Vol. 5, No. 2, 1992, pp. 241–260.[5] Breiman, L., “Stacked regressions,”

Machine Learning , Vol. 24, pp. 49–64.[6] Smyth, P. and Wolpert, D., “Stacked density estimation,”

Proceedings of the 1997 conference on Advances inneural information processing systems 10 , NIPS ’97, MIT Press, Cambridge, MA, USA, 1998, pp. 668–674.[7] Kim, K. and Bartlett, E. B., “Error Estimation by Series Association for Neural Network Systems,”

NeuralComputation , Vol. 7, No. 4, 1995, pp. 799–808.[8] Rajnarayan, D. and Wolpert, D., “Exploiting Parametric Learning to Improve Black-Box Optimization,” .[9] Sill, J., Takács, G., Mackey, L., and Lin, D., “Feature-Weighted Linear Stacking,”

CoRR , Vol. abs / Journal of Machine Learning Research , Vol. 4:683-712, 2003.[11] Haugh, M., “Variance Reduction Methods I,”

Monte Carlo Simulation: IEOR E4703 , 2004.[12] Duda, R. O., Hart, P. E., and Stork, D. G.,

Pattern Classiﬁcation , Wiley, New York, 2nd ed., 2001.1713] Economon, T., Copeland, S., Alonso, J., Zeinali, M., and Rutherford, D., “Design and Optimization of FutureAircraft for Assessing the Fuel Burn Trends of Commercial Aviation,” ,Orlando, Fl., January 2011, AIAA Paper 2011-267.[14] Kroo, I., “An Interactive System for Aircraft Design and Optimization,”