Efficient estimation of divergence-based sensitivity indices with Gaussian process surrogates
aa r X i v : . [ m a t h . S T ] S e p Efficient estimation of divergence-based sensitivity indices withGaussian process surrogates
A.W. Eggels, D.T. CrommelinSeptember 18, 2019
Abstract
We consider the estimation of sensitivity indices based on divergence measures such as Hellinger distance.For sensitivity analysis of complex models, these divergence-based indices can be estimated by Monte-Carlosampling (MCS) in combination with kernel density estimation (KDE). In a direct approach, the complexmodel must be evaluated at every input point generated by MCS, resulting in samples in the input-outputspace that can be used for density estimation. However, if the computational cost of the complex modelstrongly limits the number of model evaluations, this direct method gives large errors. We propose to useGaussian process (GP) surrogates to increase the number of samples in the combined input-output space. Byenlarging this sample set, the KDE becomes more accurate, leading to improved estimates. To compare theGP surrogates, we use a surrogate constructed by samples obtained with stochastic collocation, combinedwith Lagrange interpolation. Furthermore, we propose a new estimation method for these sensitivity indicesbased on minimum spanning trees. Finally, we also propose a new type of sensitivity indices based ondivergence measures, namely direct sensitivity indices. These are useful when the input data is dependent.
Sensitivity analysis is an essential part of uncertainty quantification and a very active research field [1, 2, 3].Several types of sensitivity indices have been formulated, such as variance-based (including Sobol’s indices[4]), density-based [5], derivative-based [6] or divergence-based. Broadly speaking, divergence-based sensitivityindices quantify the difference between the joint probability distribution (or density) of model input and outputon the one hand, and the product of their marginal distributions on the other hand. A variety of divergence-based indices can be brought in a common framework built on the notion of f -divergence [7], as was shown byDa Veiga [8]. The f -divergence is a generalization of several well-known divergences such as the Kullback-Leiblerdivergence [9] and the Hellinger distance [10].In most cases, these sensitivity indices cannot be computed analytically because the distribution of the modeloutput given the input is not known exactly. As an alternative, one can resort to Monte Carlo (MC) samplingcombined with kernel density estimation: the input distribution is sampled using MC, the model is evaluatedon all sampled input points, and from resulting input-output points the joint and marginal probability densitiesof input and output are estimated. However, when the number of available output points is low, for examplebecause of high computational cost of the model, the estimated densities will generally be inaccurate, resultingin large errors in the estimated sensitivity indices.In this study we first propose to increase the number of output samples by using a Gaussian process (GP)surrogate. The GP is constructed on the input-output points that are obtained with the (expensive) model.The main idea is that the additional output samples improve the kernel density estimates even though theyintroduce a bias due to the difference between the true model and its GP approximation. Our approach is basedon both the development of divergence-based indices and the use of Gaussian processes in sensitivity analysis.Therefore, we briefly summarize some of the advancements in these areas. Auder & Iooss [11] presented twosensitivity analysis methods based on Shannon and Kullback-Leiber entropy, respectively, building on workin [12] and [13]. Da Veiga [8] introduced sensitivity indices based on the f -divergence. Recently, KDE alsoappears in estimators of mutual information measures in [14], where f -divergences are computed between thejoint distribution of two random variables and the product of their marginal distributions. In [15], f -divergencemeasures are computed by a k -nearest neighbor graph.The use of GPs is discussed in Marrel et al. [16], together with the analytical expressions for Sobol indicesthat arise from them. To compute the indices, two approaches are considered: one in which the predictor of theGP is used and one in which the full GP is used. The latter approach is found to be superior in convergenceand robustness. Furthermore, the modeling error of the GP is integrated through confidence intervals; it isreported that the bias due to the use of the GP is negligible [16]. In a related study, Svenson et al. [17] estimateSobol indices with GPs, using specific compactly supported kernel functions. Furthermore, combining GPs with1erivative-based indices has been investigated by [6] and [18]. In [19], predictions from a GP are used to rankthe input variables based on their predictive relevance. Two methods for this are presented in [19], one basedon Kullback-Leibler divergence and one based on the variance of the posterior mean.Despite the developments sketched above, approaches that combine GP surrogate modeling and divergence-based sensitivity analysis have not been explored much yet, although [20] already applied this approach. Themethodology proposed in this paper combines these two elements.We note that for the approach proposed here it is not needed to assume that the inputs are mutuallyindependent, nor does dependency of inputs make it more complicated. We present test cases with independentinputs as well as cases with dependent inputs. For the former, we compare with results obtained with stochasticcollocation [21, 22]. In this method, an appropriate set of points, called collocation points, is obtained. Theseare usually chosen as the zeros of the orthogonal polynomials with respect to the marginal input probabilitydistributions. Then, Lagrange interpolation is used to approximate the output function. For dependent inputs,this method might not be ideal, as [23] already showed.Second, we propose a new estimation method for the divergence-based sensitivity indices as introduced before.Because the KDE method depends on the choice of both kernel and kernel bandwidth, we propose to use anestimator without parameters which is numerically fast as well. This estimator is based on the approximationof one of the integrals appearing in the sensitivity index by computing a minimum spanning tree [24].As a third contribution, we propose a new set of sensitivity indices to complement the ones introduced before.This new set computes the direct sensitivity indices, which measure the sensitivity of the output with respectto one input variable only. This is beneficial for cases when the input variables are dependent, because theseindices remove indirect effects caused by dependent input variables. To illustrate this, consider an examplewhere X = ( X , X ) follows a bivariate normal distribution with means 0, variances 1 and covariance ρ > u ( x , x ) = x . Then the direct effect of X on the output is zero, while the original sensitivity indexwould be positive due to the dependence between X and X .Section 2 describes the sensitivity indices central to this paper, their estimation method and the complicationstherein. It also contains our proposed method to enlarge the set of input and output data and the new estimationmethod. Section 3 applies these estimators to several test cases. Section 5 concludes. We start by introducing the sensitivity indices derived from the f -divergences in Section 2.1. Section 2.2discusses the complications in estimating them. Gaussian processes and the two estimators are given in Section2.3. f -divergence We consider the situation where a model takes a vector of inputs ( X , ..., X d ) and returns a (scalar) output Y .The input vector X is random, and as a result the output Y is a random variable as well. Da Veiga [8] proposedto perform global sensitivity analysis with dependence measures, especially f -divergences (see also [25]). In thisway, the impact of the k th input variable X k on the output Y is given by S X k = E (cid:2) d ( Y, Y | X k ) (cid:3) , (1)where d ( · , · ) denotes a dissimilarity measure. The unnormalized first-order Sobol indices can also be written inthis framework, namely with d ( Y, Y | X k ) = (cid:0) E ( Y ) − E ( Y | X k ) (cid:1) . We will use the Csisz´ar f -divergence [7], which is given by d f ( Y, Y | X k ) = Z R f (cid:18) p Y ( y ) p Y | X k ( y ) (cid:19) p Y | X k ( y ) dy, (2)with f ( · ) a convex function with f (1) = 0, and p · ( · ) denotes a probability distribution function. Some well-known choices for f are f ( t ) = − log( t ) (Kullback-Leibler divergence) and f ( t ) = ( √ t − (Hellinger distance).Combining (1) and (2) with basic probability theory gives us S fX k = Z Z R f (cid:18) p Y ( y ) p X k ( x ) p X k ,Y ( x, y ) (cid:19) p X k ,Y ( x, y ) dydx. (3)These sensitivity indices are equal to zero for X k and Y independent and positive otherwise. Furthermore, theyare invariant with respect to smooth and uniquely invertible transformation of X k and Y [26], in contrast toSobol indices which are only invariant with respect to linear transformations. Moreover, it is easy to generalize(3) to multidimensional X k,l . 2 .2 Difficulties for estimation The main problem for computing S fX k is that the probability densities in (3) are not known. In order to estimate S fX k it is necessary to estimate p Y ( · ) and p X k ,Y ( · , · ), and, depending on the type of input, p X k ( · ) as well. In [8]it is indicated that if samples ( X L , Y L ) are available, only the ratio r ( x, y ) = p Y ( y ) p Xk ( x ) p Xk,Y ( x,y ) needs to be estimated.The estimates of the densities can be obtained with kernel-density estimation (also in [8, 25]). To do so, onechooses a suitable kernel and a suitable value for the kernel bandwidth h . When the density of the input X isknown, this information can be used to determine h , otherwise, guidelines are available [27].Clearly, the estimate of the density p Y will not be perfect, leading to an error in the estimation of S fX k . Thisis strongly related to the number of samples ( X L , Y L ) available for density estimation. If high computationalcost of the model limits this number, the estimation of S fX k can be improved by using a surrogate of the modelto generate more samples. One possible way to do so is to use stochastic collocation (SC) [21, 22]. Herein, onechooses the samples X L as the collocation points, which are obtained as the zeros of the orthogonal polynomialswith respect to the marginal input distributions. Then, at these collocation points, the corresponding outputsamples are obtained. Finally, an emulator is constructed by Lagrange interpolation on these samples.As an alternative, we propose to use Gaussian processes [28] as a surrogate model to obtain the larger sample( X + , Y + ) = ( X L ∪ X L + , Y L ∪ Y L + ), in which Y L + indicates the surrogate model output for the extra input samples X L + . For each data point in X L + , this Y L + is a normal distribution in itself, and for each point in X L it is adegenerated normal distribution (i.e., it has zero variance). An additional advantage may be the availability ofconfidence intervals for S fX k at almost no extra computational cost. Unfortunately, these confidence intervalsdo not include the bias from approximating the output by a Gaussian process. We assume the input samples X L := { x l } Ll =1 are already available, otherwise one can use Monte Carlo sampling(or Latin hypercube sampling in the case of independent uniform data) to select samples from the data X .Although it may be tempting to use other sample selection methods, it is not guaranteed that they representthe distribution just as naive sampling would. Then, the corresponding output Y L := { y l } Ll =1 can be obtained as Y L = G ( X L ) with G the process to generate output, which is either a function or a computational model. Then,one needs to fit a Gaussian process e G { X L ,Y L } ( x ) = N ( µ ( x ) , Σ( x )) to ( X L , Y L ), thereby choosing an appropriatekernel. This Gaussian process is now used to obtain output Y L + = e G { X L ,Y L } ( X L + ) for other input samples X L + . This leads to the augmented dataset X + = X L ∪ X L + of size N = L + L + with (partial) surrogate output Y + = Y L ∪ Y L + . Note that Y L + does not consist of single values, but rather of multivariate normal distributions. We now explain how to compute the KDE on ( X + , Y + ) and how it is used to approximate (3). Because S fX k iscomputed per input variable X k , it is here enough to consider one-dimensional kernel densities.For each input variable X k and output variable Y , the estimators for the kernel density are given by [25]: d f X k ( x ) = 1 Jh X k J X j =1 K X k (cid:18) x − x j h X k (cid:19) , c f Y ( y ) = 1 Jh Y J X j =1 K Y (cid:18) y − y j h Y (cid:19) , \ f X k ,Y ( x, y ) = 1 Jh X k h Y J X j =1 K X k (cid:18) x − x j h X k (cid:19) K Y (cid:18) y − y j h Y (cid:19) , with ( x j , y j ) the j th sample of the input data ( X k , Y ) and J the size of the data. Note the input data X = ( X , . . . , X d ) has to represent the distribution of X . An extension to a higher-dimensional X k is easy toobtain. For our purpose, we either have J = L and ( X, Y ) = ( X L , Y L ), or we have J = N and ( X, Y ) = ( X + , Y + ).We choose the Gaussian kernel and h X k = h Y = h according to Scott’s rule [29, p. 152], which is optimizedwith respect to the normal distribution. Then, the estimator for S fX k as given by [25] is obtained: H ( J ) X k ,f := 1 J J X j =1 f d f X k ( x j ) c f Y ( y j ) \ f X k ,Y ( x j , y j ) ! . (4)We note this choice of h may not be optimal. We have adapted the bandwidth h previously to the ranges of X and Y , but the results of this are worse than with a single bandwidth. Also, kernel density estimation may not3e the best choice when the domain of a variable X k or Y is bounded and this variable has nonzero density atthe boundaries.Until so far, we ignored the fact Y L + is a multivariate normal random variable instead of a single value when J = N . Therefore, there are two options to obtain values for Y L + . The first option is to use the predictionmean µ ( x ) and get the resulting output samples Y L + = µ ( X L + ) , (5)to be used in (4). The other is to sample from this normal distribution n s times. In that case, one gets the n s output sets Y ( s ) L + ∼ N ( µ ( X L + ) , Σ ( X L + )) , (6)in which ∼ denotes “sampled from the distribution”, and thereby n s estimates of H ( N ) X k ,f . Note that this alsoimplies the kernel density estimates have to be computed n s times. Because the computation of the kerneldensity estimate is expensive, we choose not to include this option. We will indicate the estimator H ( J ) X k ,f by d S fX k in the results, where the value of J is clear from the context. Before we can explain how to use the minimum spanning trees, we first need to introduce the concept of R´enyientropy. This is a generalization of the continuous Shannon entropy (see e.g. [30]) and is given by H α ( X ) = 11 − α log (cid:18)Z Ω ( p ( x )) α d x (cid:19) , for α ∈ (0 , ∞ ). In the limit of α →
1, the R´enyi entropy converges to the continuous Shannon entropy. Hero& Michel [31, 24, 32] proposed a direct way to estimate the R´enyi entropy for α ∈ (0 ,
1) given a dataset X N consisting of N samples of the probability distribution X of dimension d . Their estimator isˆ H α ( X N ) = 11 − α (cid:18) log (cid:18) L γ ( X N ) N α (cid:19) − log β L,γ (cid:19) = 11 − α log (cid:18) L γ ( X N ) β L,γ N α (cid:19) , (7)in which γ can be derived from the relation α = ( d − γ ) /d . The functional L γ ( X N ) is given by L γ ( X N ) = min T ( X N ) X e ∈ T ( X N ) | e | γ , (8)in which T ( X N ) denotes the set of spanning trees on X N and e denotes an edge therein. The parameter γ canbe computed from the desired value of α and will be within the interval (0 , d ). The constant β L,γ is defined by β L,γ = lim N →∞ L γ ( X N ) /N α , (9)for X N a sample of size N of the uniform distribution in d dimensions. However, we will estimate it for N samples only by computing it for several repetitions of the sample X N .The estimator (7) is asymptotically unbiased and strongly consistent for α ∈ (0 ,
1) [33]. We focus on thecase α = 1 / d = 2 wherein | e | denotes the Euclidean distance, hence γ = 1. To see why we choose α = 1 / D α ( f, g ) = 1 α − (cid:18)Z Ω (cid:18) f ( x ) g ( x ) (cid:19) α g ( x )d x (cid:19) , for the probability distribution functions f ( · ) and g ( · ). We choose f ( · ) = p XY ( x, y ) and g ( · ) = p X ( x ) p Y ( y ),where p XY is the joint probability distribution function of X and Y and p X ( x ) and p Y ( y ) denote the marginalprobability distribution functions. Then, their R´enyi divergence is given by D / ( p XY , p X p Y ) = − Z Y Z X (cid:18) p XY ( x, y ) p X ( x ) p Y ( y ) (cid:19) / p X ( x ) p Y ( y )d x d y ! , = − (cid:18)Z Y Z X p p XY ( x, y ) p X ( x ) p Y ( y )d x d y (cid:19) . (10)4e also have D / ( p XY , p X p Y ) = − Z Y Z X (cid:18) p XY ( x, y ) p X ( x ) p Y ( y ) (cid:19) / p X ( x ) p Y ( y )d x d y ! , = − (cid:18)Z Y Z X ( h xy ( x ′ , y ′ )) / d x ′ d y ′ (cid:19) , = − H / ( h ) , (11)with h ( · ) the (well-defined) probability distribution function given by h ( x, y ) = p XY ( x, y ) p X ( x ) p Y ( y ) . We also see that S HX k , with H denoting the sensitivity index derived from the Hellinger distance, is giventhrough (3) by S HX k = Z Z R s p X k ( x ) p Y ( y ) p X k ,Y ( x, y ) − ! p X k ,Y ( x, y )d y d x, which can be simplified to S HX k = 2 − Z Z R q p X k ( x ) p Y ( y ) p X k ,Y ( x, y )d y d x. (12)We now see the agreement between (10) and (12). In case the domain of X k and Y is extended to R by zerodensity outside of the domain, it is possible to write D / ( p X k Y , p X k p Y ) = − I ) , S HX k = 2 − I, with I = Z Z R q p X k ( x ) p Y ( y ) p X k ,Y ( x, y )d y d x, hence S HX k = 2 − (cid:18) − D / ( p X k ,Y , p X k p Y )2 (cid:19) . We can compute S HX k via D / ( p X k ,Y , p X k p Y ) = − H / ( h ) (Equation 11). Therefore, we need to estimate L γ ( X k , Y ) (8) and β (9). Because I can be estimated as L γ ( X k , Y ) β √ N , we estimate the sensitivity indices by d S HX k = 2 − L γ β √ N . (13)
We test the estimators in several ways. The first test case is with regard to random input/output data andis described in Section 3.1. In this case, the estimates should be near zero. The second test case, in Section3.2, is based on comparing analytic to numerical values of the sensitivity indices. In the third test case, theIshigami function is used and tests are performed for both independent and dependent input data, of which theresults can be found in Section 3.3. The last test case is higher-dimensional and considers the Piston function(Section 3.4). In these tests, we only use the Hellinger distance. All experiments have been performed n r = 10times. The error bars in the upcoming figures indicate the minimum and maximum value found. The resultsare summarized in Section 3.5. First, we check the behavior for random output, in which case the sensitivity indices should be zero. Both theinput and output data are one-dimensional, uniformly distributed on [0 ,
1] and have size N = 10 , while L isvaried from L = 10 to L = 200 based on [34]. The results are in Figure 1. On the right, we show the sensitivityindex as computed on the complete, i.e., L = N , data by KDE and MST (blue circle and red pentagon). Herein,5 ̂ S H X sample-KDEGP-KDEGP-MST Figure 1: Sensitivity indices for random output.no Gaussian process is used. As expected, their mean is around zero. The spread for the KDE method issmaller than for the MST method. The estimates for d S HX k based on L samples (blue circles) are also aroundzero, although their spread is larger than for L = N . Note that due to the numerical implementation, thesensitivity indices can become negative.We continue with the estimates based on Gaussian processes. Herein, the situation is a little different becausethe Gaussian process fits a function through the data while there is no functional relation between input andoutput. Hence, the sensitivity indices will most likely not be equal to zero. When fitting the Gaussian process,two cases appear, which have the same effect. The length scale and the process variance are either both smallor both large. As a result, the predictions of the Gaussian process will be inaccurate. This can be seen in thefigure for the KDE results (green diamonds) by their mean being away from zero and the large spread of theirestimates (the outlier has a value of approximately 0 . d S HX k are measured because the predicted output values are the values of the prediction mean function, whichis a continuous function. Hence, these predictions are located on a curve. Therefore, the values of d S HX k for theMST-based estimator are too large to be visible in this plot for the chosen values of L , except for L = 10 ,where no emulated output is used. The reference result where we computed the FMST-based sensitivity indexon the full data without emulator gave a reasonable result (red pentagon).Similar results appear for estimates based on emulation by stochastic collocation (SC), where we usedcollocation samples based on the uniform distribution. A function is fit through the data while no functionalrelation between input and output exists. Therefore, high values of d S HX k are measured, which are not shown inthe plot.We summarize these results as follows: when an emulator (either Gaussian process or SC) is used to augmentthe data for sensitivity analysis, positive values of d S HX k are found because the emulator is designed to fita functional relation between input and output. The “sample” method does not suffer from this problem.However, this is a very specific test case in which sample-based estimators are preferred over ones which use anaugmented dataset. We consider a small test case in which we can compute the sensitivity index analytically. The idea behind thisis to compare the KDE and the MST method in case no emulator is used. We have (cid:20) XY (cid:21) ∼ N (cid:18)(cid:20) (cid:21) , (cid:20) ρρ (cid:21)(cid:19) . We took N = 10 and repeated the experiment n r = 10 times. The results are in Figure 2. Except for ρ = 0 .
98, the MST method outperforms the KDE method. Furthermore, the MST method is i) not dependenton parameter choices such as kernel and kernel bandwidth and ii) faster to compute. One also needs to takeinto account that the rule of thumb to choose the kernel bandwidth we used here is based on the assumptionthat the data comes from a normal distribution and, therefore, this kernel bandwidth is optimal in this testcase. When the underlying distribution is not normal, this heuristic may not be optimal.6 .00 0.25 0.50 0.75 1.00 ρ S H X KDEMSTanalytic
Figure 2: Computed values for the sensitivity indices, analytic test case.
We now continue to a non-trivial synthetic test case, of which the test function is from Ishigami & Homma [35].This output function is defined by G ( x, y, z | a, b ) = (1 + bz ) sin( x ) + a sin ( y )on the domain [ − π, π ] (dimension d = 3). We will use the well-known choice a = 7, b = 0 . N = 10 samples on the domain of the output function. The other is the empirical copula of a multivariatenormal distribution on the same domain, which is given by Z = N , . . . . . . , such that X = − π + 2 π · F ( Z ) , with F the cumulative distribution function of the marginal distributions (which is distributed as N (0 , N = 10 )for comparison. Scott’s rule [27] is used for the kernel bandwidth.In the numerical experiments, we first compute, depending on the dataset, a Latin hypercube sample (LHS)or Monte Carlo sample (MCS) of size L = { , , , } and combine it with KDE. For this data, wecomputed (4). Then, we fit a Gaussian process with Gaussian kernel to these samples, where the length scaleshave been estimated by maximum likelihood estimation. Now, we can proceed with KDE on ( X + , Y + ), in whichwe include the choice Y L + = µ ( X L + ) (Equation 5). We obtain one estimate for H ( L + L + ) X k ,f for each repetitionof the experiment and thereby one value of | H ( L + L + ) X k ,f − H ( N ) X k ,f | ≈ | ˆ S X k − S X k | which is used as measure ofconvergence. In a similar way, we can proceed with the MST method on ( X + , Y + ) with Y L + = µ ( X L + ). Finally,the SC method, based on the uniform distribution and combined with KDE, is used for comparison. Note weshowed earlier that KDE has a larger bias than MST, but we look mainly at the convergence here.The computed reference values of the sensitivity indices are shown in Figure 3. The values for independentand dependent data are close to each other for variables 1 and 3, while they are far apart for variable 2, whichis due to the dependency.We will first show the results for the independent data, followed by the results for the dependent data. Westart with determining the goodness-of-fit of the Gaussian process by performing k -fold cross-validation (CV)with k = 10 and compute the coefficient of determination R = 1 − SS res SS tot = 1 − L P l ( ˆ Y l − Y l ) L P l ( Y l − ¯ Y ) , where ˆ Y l are the CV predictions for Y l and ¯ Y = L P l Y l . The values for R for independent data are given7 ̂ S H X k ind-KDEdep-KDEind-FMSTdep-FMST Figure 3: Computed values for the sensitivity indices per variable, Ishigami function.
30 50 100 200L10 −2 −1 − R Figure 4: Cross-validation results showing the quality of the Gaussian process, independent input data.in Figure 4 and we see its values are near zero for higher values of L . For L = 30 and L = 50, this fractioncan become larger than 1. In this case, the fit is worse than a constant function. Note that here, the Gaussianprocess is not fit well, while this is the case for the higher values of L .Figure 5 shows the convergence of the estimates, where “sample” indicates the KDE is based on only L samples, “SC” indicates stochastic collocation is used (combined with KDE), “GP-KDE” is based on (5) and“GP-MST” is based on (13). From left to right, variables 1 to 3 are shown. This will also be the case for allsimilar figures in this section. In this figure, we see several trends. First of all, the samples perform worse than
30 50 100 200L10 −3 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −3 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST Figure 5: Convergence of the estimates for Hellinger divergence, independent input data.the methods which use augmentation of the dataset. Second, we see the results for SC are not robust and theirerrors do not decrease in general for increasing L . Third, we see that GP-MST shows in general decreasingerrors for increasing L .The results for dependent data are shown in Figures 6 and 7. Note that LHS is not an appropriate samplingmethod because the data is dependent, therefore, Monte Carlo sampling is used instead. Furthermore, SCis here also not completely suitable because the input distribution is dependent. The results are similar toprevious experiments, although the cross-validation results imply the Gaussian process for this data has beenfit better. Another observation is that GP-MST outperforms the other methods for variables 2 and 3, while itis not really worse than GP-KDE for variable 1. Overall, the Gaussian process-based methods outperform theother methods. 8 −4 −3 −2 −1 − R Figure 6: Cross-validation results showing the quality of the Gaussian process, dependent input data.
30 50 100 200L10 −3 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −3 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST Figure 7: Convergence of the estimates for Hellinger divergence, dependent input data.
We also tested a higher-dimensional test case with independent uniformly distributed input variables. In thiscase, the output function is defined by the Piston function from [37]. The output here is the cycle time of apiston, as given by C ( x ) =2 π s Mk + S P VT T a V ,V = S k r A + 4 k P VT T a − A ! A = P S + 19 . M − kVS x = ( M, S, V, k, P, T a , T ) , of which the input ranges are given in Table 1. For numerical reasons, the data of size N = 10 is generatedand processed on the unit hypercube: it is only transformed to the input ranges to obtain the output values.The sensitivity indices as computed on a larger dataset of size N = 10 are given in Figure 8. The values forKDE and MST differ, although Section 3.2 indicates the MST results are more accurate. The cross-validationresults are in Figure 9. These last results show the Gaussian process has been fit well for L ≥
50 and thereforewe can continue with the remaining results. The results for the convergence are in Figure 10. From left to right,top to bottom, variables 1 to 7 are shown.Table 1: Input variables for the Piston function.Symbol and range Explanation M ∈ [30 ,
60] piston weight (kg) S ∈ [0 . , . ) V ∈ [0 . , . ) k ∈ [1000 , P ∈ [90000 , ) T a ∈ [290 , T ∈ [340 , ̂ S H X k KDEMST
Figure 8: Computed values for the sensitivity indices per variable, Piston function.
30 50 100 200L10 −3 −2 −1 − R Figure 9: Cross-validation results showing the quality of the Gaussian process, Piston function.
30 50 100 200L10 −4 −3 −2 | ̂ S − S | sampleSCGP-KDEGP-MST30 50 100 200L10 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −3 −2 −1 | ̂ S − S | sampleSCGP-KDEGP-MST 30 50 100 200L10 −2 | ̂ S − S | sampleSCGP-KDEGP-MST30 50 100 200L10 −5 −4 −3 −2 | ̂ S − S | sampleSCGP-KDEGP-MST30 50 100 200L10 −5 −4 −3 −2 | ̂ S − S | sampleSCGP-KDEGP-MST30 50 100 200L10 −5 −4 −3 −2 | ̂ S − S | sampleSCGP-KDEGP-MST Figure 10: Convergence of the estimates for Hellinger divergence, Piston function.We see clear differences between variables 1-4 on one hand and variables 5-7 on the other hand. This isdue to variables 5-7 for which d S HX k is near zero. As indicated in Section 3.1, methods which make use of a fitperform badly in this case. For variables 1-4, GP-MST clearly outperforms the others. The SC method hasonly been performed with 2 collocation points for each dimension, which led to L = 2 = 128. Increasing to 3would give us 3 = 2187 collocation points, which is higher than the number data points in the used dataset. The Gaussian process-based methods in general outperform the sample-based method and stochastic collocation,except when the value of the sensitivity index is (near-)zero. However, usually one is interested in ordering theinput variables based on the sensitivity indices rather than obtaining their values very precisely. Input variableswith values of the sensitivity index near zero are usually considered unimportant and in that case, it is alsonot very important to estimate the value of zero very precisely. We therefore advise to use the GP-MST10ethod, wherein the available samples ( X L , Y L ) are augmented to ( X + , Y + ) by a Gaussian process, on whichthe sensitivity index (3) is computed for the Hellinger distance by the minimum spanning tree method. We note that the sensitivity indices as described by [8] are total sensitivity indices, which include both directand indirect effects. Direct effects measure the effect of one input variable only, while indirect effects containthe effect of the other variables due to possible dependencies in the input variables. The indirect effect is thedifference between the total and the direct effect. To illustrate this, consider an example where X = ( X , X )follows a bivariate normal distribution with means 0, variances 1 and covariance ρ > X and X aredependent), while u ( x , x ) = 2 x . Then the direct effect of X is zero, while its total effect is positive (because u ( X , X ) and X are dependent through X ). Hence, the indirect effect of X is positive as well. Our goal isnow to find a measure for the direct effects, i.e., without the effects of the mutual input dependencies.Although useful, total sensitivity indices do not tell the complete story. While an input variable may becompletely irrelevant for the value of the output, it may have a positive sensitivity index due to a dependencywith a relevant input. The relevant input variable would then be called a confounder. An example of this iswave height for a computational model of offshore wind energy: although the waves have nothing to do withthe power output, they are linked to each other via the wind speed with which they have a dependency. Toget rid of this effect, we need to construct indices which measure the effect of only one input variable, withouteffects due to dependencies in the input. It is in this case necessary to remove the dependencies from the input.For variance-based sensitivity indices, a distinction is made between first-order, higher-order and total sen-sitivity indices [38]. In first-order indices, one only measures the effect of varying one variable alone, wherein higher-order indices multiple variables are varied at the same time. Because the number of second-ordersensitivity indices grows as d ( d − / d the number of input variables, and the total number of sensitivityindices is 2 d −
1, usually not all of them are computed. Instead, one computes the first-order and total sensitivityindices.In a similar fashion to first-order indices, we define direct sensitivity indices, which measure the effect ofvarying one variable only. The direct indices then measure the direct effects, while the total indices measurethe combination of direct and indirect effects, which also includes effects due to dependencies in the input.
The starting point of these new indices is the same divergence-based index as before, namely (12). We repeat(12) here for convenience, S HX k = 2 − Z Z R q p X k ( x ) p Y ( y ) p X k ,Y ( x, y )d y d x. Now, note that Y = u ( X , . . . , X d ) = u ( X ) , with u ( · ) being the model used to obtain the output Y , hence, both p Y ( y ) and p X k ,Y ( x, y ) depend in theory onall input variables X k . Hence, if we remove the dependencies between the input variables, then these probabilitydistributions change as well. This removal is done by applying a permutation operator Π, which is defined ona dataset X in such a way that Π( X ) = (Π( X ) , . . . , Π( X d )) , with CDF (Π( X k )) = CDF ( X k ) , Π( X i ) ⊥ Π( X j ) for all i , j, i = j. Hence, this operator keeps the marginal distributions the same, but it removes all dependencies ( ⊥ here denotesstatistical independence). The implementation of this operator is detailed at the end of this section.Now, we create a permuted version of our dataset X , being Π( X ). For this dataset, we can define the directsensitivity index by S HD,X k = 2 − Z Z R q p Π( X k ) ( x ) p Y ( y ) p Π( X k ) ,Y ( x, y )d y d x. The output Y = u ( X ) can be replaced by e Y = ˜ u (Π( X )) , in which ˜ u ( · ) denotes the Gaussian process e G { X L ,Y L } ( · ) constructed earlier. This leads to the estimator S HD,X k = 2 − ZZ R q p Π( X k ) ( x ) p e Y ( y ) p Π( X k ) , ˜ Y ( x, y )d y d x. X ). A naive implementation could be one in which for each variable,a random permutation of the values is performed. This is fast, but does not guarantee independence of theinput variables after transformation. Also, the indexing of the permutations leads to a Latin hypercube design(LHD): each value from 1 to N (for N data points in the dataset) is used only once. However, this does notguarantee all dependencies are removed. In Latin hypercube sampling (LHS), a comparable problem exists asequally probable subspaces can end up with a different number of sampling points. This is solved by orthogonalsampling [39] or by using a maximin criterion [40].Inspired by this, we would like to generate an LHD of size N in d dimensions with the maximin criterionwhich puts the samples at the middle of each interval. This LHD is easily transformed to an indexing, whichcan be applied to the original data X to obtain Π( X ). However, obtaining such an LHD is computationallyvery expensive because it contains an optimization step and is therefore not feasible for the problem sizes weare looking at.An alternative to Latin hypercube sampling is quasi-Monte Carlo sampling, which generates data pointsfrom low-discrepancy sequences such as Halton’s [41, Chapter 3] and Sobol’s [42]. In this way, we achievethe goal that the proportion of data points in a sequence falling into a subspace is nearly proportional to theprobability measure of this subspace (the difference between them is the discrepancy). Hence, we achieve anapproximately uniform distribution of data points over the unit hypercube, which means the dimensions areindependent of each other. Furthermore, all values generated for a variable are unique, which means they caneasily be transformed to the discrete hypercube { , . . . , N } d . The transformed values can be used as an indexingfor X to obtain Π( X ). Because the data points generated by the sequence are uniform over the unit hypercube,they lead to an independent dataset when their transformed values are used as indexing. We use the Sobolsequences as described by [42]. We compute these direct sensitivity indices for the Ishigami test case of Section 3.3. We split the results outto the KDE and the MST estimates. For each of them, we show the estimates of both the independent anddependent direct sensitivity indices and the spread therein for increasing L . We also compare them to the valuesof the total indices.We start with the KDE estimates in Figure 11. On the left, we see that the estimates are relatively stablefor increasing L and the spread of the estimates decreases. On the right, we compare the estimates for N = 10 for the direct indices to the estimates with N = 10 for the total indices. We do not compute a reference valuefor the direct indices because of computational cost. For the dependent data, the indices work as expected: thetotal indices are larger than the direct sensitivity indices. For the independent data, this is not the case, as forvariable 2 and 3 the direct sensitivity index is larger than the total sensitivity index. It is not immediately clearto us why this is the case, because for independent data, the total and direct sensitivity indices should give thesame results.
30 50 100 200L0.20.40.6 S H D , X k X X X inddep (a) Convergence for increasing L . ̂ S H X k ind-totaldep-totalind-directdep-direct (b) Comparison with the total indices. Figure 11: Direct indices for the Ishigami test case when computed with KDE.Figure 12 shows on the left similar results for the stability and spread in the estimates as before with KDE. Onthe right, we see the total sensitivity indices are larger or equal than their direct sensitivity indices counterparts.For the independent data, the differences between the the direct and total sensitivity indices are small forvariable 1 and 3 and nearly invisible for variable 2. Theoretically, this difference should be (numerically) zero.For the dependent data, we see the difference between direct and total sensitivity index is largest for variable2, while variables 1 and 3 show a small difference. This is due to variable 2 being stronger dependent with theother two variables. 12 S H D , X k X X X inddep (a) Convergence for increasing L . ̂ S H X k ind-totaldep-totalind-directdep-direct (b) Comparison with the total indices. Figure 12: Direct indices for the Ishigami test case when computed with MST.
We proposed to use Gaussian processes in order to improve the estimates of divergence-based sensitivity indices.This is advantageous in cases where the number of available input-output samples is small, for example if thecomputational cost of each model evaluation needed to compute the output is high.We compared the use of Gaussian processes to the well-established method of stochastic collocation combinedwith Lagrange interpolation. This method has several disadvantages in practice and is outperformed by theGaussian process-based methods in our experiments. The use of Gaussian processes also allowed us to propose (i)a new estimation method and (ii) a new type of sensitivity indices. This new estimation method for divergence-based sensitivity indices is based on minimum spanning trees and can be used in case the divergence used isthe Hellinger distance. This estimation method has been used before to compute entropies and is numericallyfast. The new type of sensitivity index, named direct sensitivity index, is especially useful when the input datais dependent.
Acknowledgments
This research is part of the EUROS programme, which is supported by NWO domain Applied and EngineeringSciences under grant number 14185 and partly funded by the Ministry of Economic Affairs.
References [1] A. Saltelli, Global sensitivity analysis: an introduction, in: Proc. 4th International Conference on SensitivityAnalysis of Model Output (SAMO’04), 2004, pp. 27–43.[2] J. E. Oakley, A. O’Hagan, Probabilistic sensitivity analysis of complex models: a Bayesian approach,Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (3) (2004) 751–769. doi:10.1111/j.1467-9868.2004.05304.x .[3] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola, Variance based sensitivity analysisof model output. Design and estimator for the total sensitivity index, Computer Physics Communications181 (2) (2010) 259–270. doi:10.1016/j.cpc.2009.09.018 .[4] I. M. Sobol’, Sensitivity Estimates for Nonlinear Mathematical Models, Mathematical Modeling & Com-putational Experiments 1 (4) (1993) 407–414.[5] E. Borgonovo, A new uncertainty importance measure, Reliability Engineering & System Safety 92 (6)(2007) 771–784.[6] K. Blix, Sensitivity analysis of Gaussian process machine learning for chlorophyll prediction from opticalremote sensing, Master’s thesis, UiT Norges arktiske universitet (2014).[7] I. Csisz´ar, P. C. Shields, Information theory and statistics: A tutorial, Foundations and Trends R (cid:13) inCommunications and Information Theory 1 (4) (2004) 417–528.138] S. Da Veiga, Global sensitivity analysis with dependence measures, Journal of Statistical Computation andSimulation 85 (7) (2014) 1283–1305. doi:10.1080/00949655.2014.945932 .URL http://dx.doi.org/10.1080/00949655.2014.945932 [9] T. van Erven, P. Harremoes, R´enyi Divergence and Kullback-Leibler Divergence, IEEE Transactions onInformation Theory 60 (7) (2014) 3797–3820. doi:10.1109/tit.2014.2320500 .URL http://dx.doi.org/10.1109/TIT.2014.2320500 [10] A. L. Gibbs, F. E. Su, On Choosing and Bounding Probability Metrics, International Statistical Review70 (3) (2002) 419–435. doi:10.1111/j.1751-5823.2002.tb00178.x .URL http://dx.doi.org/10.1111/j.1751-5823.2002.tb00178.x [11] B. Auder, B. Iooss, Global sensitivity analysis based on entropy, in: Safety, reliability and risk analysis-Proceedings of the ESREL 2008 Conference, 2008, pp. 2107–2115. doi:10.1.1.595.3509 .[12] B. Krzykacz-Hausmann, Epistemic sensitivity analysis based on the concept of entropy, in: Internationalsymposium on sensitivity analysis of model output, 2001, pp. 53–57.[13] H. Liu, W. Chen, A. Sudjianto, Relative Entropy Based Method for Probabilistic Sensitivity Analysis inEngineering Design, Journal of Mechanical Design 128 (2) (2006) 326–336. doi:10.1115/1.2159025 .[14] K. R. Moon, K. Sricharan, A. O. Hero, Ensemble estimation of mutual information, 2017 IEEE Interna-tional Symposium on Information Theory (ISIT) doi:10.1109/isit.2017.8007086 .URL http://dx.doi.org/10.1109/isit.2017.8007086 [15] M. Noshad, K. R. Moon, S. Y. Sekeh, A. O. Hero, Direct estimation of information divergence using nearest neighbor ratios,2017 IEEE International Symposium on Information Theory (ISIT) doi:10.1109/isit.2017.8006659 .URL http://dx.doi.org/10.1109/isit.2017.8006659 [16] A. Marrel, B. Iooss, B. Laurent, O. Roustant, Calculations of Sobol indices for the Gaussian process metamodel,Reliability Engineering & System Safety 94 (3) (2009) 742–751. doi:10.1016/j.ress.2008.07.008 .URL http://dx.doi.org/10.1016/j.ress.2008.07.008 [17] J. Svenson, T. Santner, A. Dean, H. Moon, Estimating sensitivity indices based on Gaussian process metamodels with compactly supported correlation functions,Journal of Statistical Planning and Inference 144 (2014) 160–172. doi:10.1016/j.jspi.2013.04.003 .URL http://dx.doi.org/10.1016/j.jspi.2013.04.003 [18] M. De Lozzo, A. Marrel, Estimation of the Derivative-Based Global Sensitivity Measures Using a Gaussian Process Metamodel,SIAM/ASA Journal on Uncertainty Quantification 4 (1) (2016) 708–738. doi:10.1137/15m1013377 .URL http://dx.doi.org/10.1137/15M1013377 [19] T. Paananen, J. Piironen, M. R. Andersen, A. Vehtari, Variable selection for Gaussian processes viasensitivity analysis of the posterior predictive distribution, arXiv preprint arXiv:1712.08048v2.[20] E. Borgonovo, W. Castaings, S. Tarantola, Model emulation and moment-independent sensitivity analysis: An application to environmental modelling,Environmental Modelling & Software 34 (2012) 105–115. doi:10.1016/j.envsoft.2011.06.006 .URL http://dx.doi.org/10.1016/j.envsoft.2011.06.006 [21] L. Mathelin, M. Y. Hussaini, T. A. Zang, A stochastic collocation algorithm for uncertainty analysis, Tech.Rep. NASA/CR-2003-212153, NASA (2003).[22] D. Xiu, J. S. Hesthaven, High-order collocation methods for differential equations with random inputs,SIAM Journal on Scientific Computing 27 (3) (2005) 1118–1139.[23] M. Navarro, J. A. S. Witteveen, J. Blom, Stochastic collocation for correlated inputs, in: UNCECOMP2015, 2015.[24] A. O. Hero, B. Ma, O. J. Michel, J. Gorman, Applications of entropic spanning graphs, IEEE signalprocessing magazine 19 (5) (2002) 85–95.[25] S. Rahman, The f -Sensitivity Index, SIAM/ASA Journal on Uncertainty Quantification 4 (1) (2016)130–162. doi:10.1137/140997774 .URL http://dx.doi.org/10.1137/140997774 [26] A. Kraskov, H. St¨ogbauer, P. Grassberger, Estimating mutual information, Physical review E 69 (6) (2004)066138. doi:10.1103/physreve.69.066138 . 1427] J. Scott, Multivariate Density Estimation, Wiley Series in Probability andStatistics doi:10.1002/9780470316849 .URL http://dx.doi.org/10.1002/9780470316849 [28] C. E. Rasmussen, C. K. Williams, Gaussian Processes for Machine Learning, MIT press Cambridge, 2006.[29] J. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley Series in Probabilityand Statistics, John Wiley & Sons, Inc., 1992. doi:10.1002/9780470316849 .URL http://dx.doi.org/10.1002/9780470316849 [30] T. M. Cover, J. A. Thomas, Elements of information theory, Second Edition, John Wiley & Sons, 2006. doi:10.1002/047174882X .[31] A. O. Hero, O. J. J. Michel, Robust Entropy Estimation Strategies Based on Edge Weighted RandomGraphs, in: SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, Inter-national Society for Optics and Photonics, 1998, pp. 250–261. doi:10.1117/12.323804 .[32] A. O. Hero, J. Costa, B. Ma, Asymptotic Relations Between Minimal Graphs and α -entropy, Tech. Rep.334, Comm. and Sig. Proc. Lab.(CSPL), Dept. EECS, University of Michigan, Ann Arbor (2003).[33] A. Hero, O. J. J. Michel, Estimation of R´enyi Information Divergence via Pruned Minimal Spanning Trees,in: Proceedings of the IEEE Signal Processing Workshop on Higher-Order Statistics, IEEE, 1999, pp.264–268.[34] J. L. Loeppky, J. Sacks, W. J. Welch, Choosing the Sample Size of a Computer Experiment: A Practical Guide,Technometrics 51 (4) (2009) 366–376. doi:10.1198/tech.2009.08040 .URL http://dx.doi.org/10.1198/TECH.2009.08040 [35] T. Ishigami, T. Homma, An Importance Quantification Technique in Uncertainty Analysis for ComputerModels, in: Uncertainty Modeling and Analysis, 1990. Proceedings., First International Symposium on,IEEE, 1990, pp. 398–403. doi:10.1109/isuma.1990.151285 .[36] T. Crestaux, O. Le Maˆıtre, J.-M. Martinez, Polynomial chaos expansion for sensitivity analysis, ReliabilityEngineering & System Safety 94 (7) (2009) 1161–1172. doi:10.1016/j.ress.2008.10.008 .[37] R. Kenett, S. Zacks, Modern Industrial Statistics: Design and Control of Quality and Reliability, PacificGrove: Duxbury Press, 1998.[38] I. M. Sobol’, Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates,Mathematics and Computers in Simulation 55 (1) (2001) 271–280. doi:10.1016/s0378-4754(00)00270-6 .[39] A. B. Owen, Orthogonal arrays for computer experiments, integration and visualization, Statistica Sinica(1992) 439–452.[40] M. Johnson, L. Moore, D. Ylvisaker, Minimax and maximin distance designs, Journal of Statistical Plan-ning and Inference 26 (2) (1990) 131 – 148. doi:https://doi.org/10.1016/0378-3758(90)90122-B .URL [41] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, Society for Industrial andApplied Mathematics, 1992. doi:10.1137/1.9781611970081 .URL http://dx.doi.org/10.1137/1.9781611970081 [42] S. Joe, F. Y. Kuo, Constructing Sobol sequences with better two-dimensional projections, SIAM Journalon Scientific Computing 30 (5) (2008) 2635–2654. doi:10.1137/070709359 .URL http://dx.doi.org/10.1137/070709359http://dx.doi.org/10.1137/070709359