Kernel-based ANOVA decomposition and Shapley effects -- Application to global sensitivity analysis
KKernel-based ANOVA decomposition and Shapley effects –Application to global sensitivity analysis
S´ebastien Da VeigaSafran Tech, Modeling & SimulationRue des Jeunes Bois, Chˆateaufort, 78114 Magny-Les-Hameaux, France
Abstract
Global sensitivity analysis is the main quantitative technique for identifying the most in-fluential input variables in a numerical simulation model. In particular when the inputs areindependent, Sobol’ sensitivity indices attribute a portion of the output of interest variance toeach input and all possible interactions in the model, thanks to a functional ANOVA decom-position. On the other hand, moment-independent sensitivity indices focus on the impact ofinput variables on the whole output distribution instead of the variance only, thus providingcomplementary insight on the inputs / output relationship. Unfortunately they do not enjoy thenice decomposition property of Sobol’ indices and are consequently harder to analyze. In thispaper, we introduce two moment-independent indices based on kernel-embeddings of probabilitydistributions and show that the RKHS framework used for their definition makes it possible toexhibit a kernel-based ANOVA decomposition. This is the first time such a desirable propertyis proved for sensitivity indices apart from Sobol’ ones. When the inputs are dependent, wealso use these new sensitivity indices as building blocks to design kernel-embedding Shapleyeffects which generalize the traditional variance-based ones used in sensitivity analysis. Severalestimation procedures are discussed and illustrated on test cases with various output types suchas categorical variables and probability distributions. All these examples show their potentialfor enhancing traditional sensitivity analysis with a kernel point of view.
In the computer experiments community, global sensitivity analysis (GSA) has now emerged as acentral tool for exploring the inputs/outputs relationship of a numerical simulation model. Startingfrom the pioneering work of Sobol (Sobol’, 1993) and Saltelli (Saltelli et al., 1999) on the inter-pretation and estimation of Sobol’ indices, the last two decades have been a fertile ground forthe development of advanced statistical methodologies and extensions of original Sobol’ indices:new estimation procedures (Da Veiga et al. (2009), Da Veiga and Gamboa (2013), Sol´ıs (2019),Gamboa et al. (2020)), multivariate outputs with aggregation (Gamboa et al., 2013) and dimen-sionality reduction (Lamboni et al., 2011), goal-oriented sensitivity analysis (Fort et al., 2016) ormoment-independent sensitivity measures (Borgonovo (2007), Da Veiga (2015)), among others. Atthe heart of the popularity of Sobol’ indices is the fundamental functional analysis of variance(ANOVA) decomposition, which opens the path for their interpretation as parts of the output vari-ance and makes it possible to pull apart the input main effects and all their potential interactions,up to their whole influence measured by total Sobol’ indices. This decomposition however has two1 a r X i v : . [ m a t h . S T ] J a n rawbacks. First, it is only valid when the inputs are independent, although some generalizationswere investigated (Chastaing et al., 2012). Secondly, it only concerns the original Sobol’ indices,meaning that it is not possible to split the input effects with goal-oriented or moment-independentsensitivity analysis in general.When the inputs are dependent, total Sobol’ indices can still be used to discriminate themwhen the objective is to build a surrogate model of the system, and other Sobol’-related indiceshave also been proposed for interpretability (Mara et al., 2015). But the major breakthroughhappened when Shapley effects have been defined for GSA by Owen (Owen, 2014). Indeed due totheir mathematical foundations from game theory, Shapley effects do not require the independenceassumption to enjoy nice properties: each input is assigned a Shapley effect lying between 0 and1, while the sum of all effects is equal to 1. For a given input all interactions and correlationswith other ones are mixed up, but the interpretation as parts of the output variance is kept andinput rankings are still sensible. For these reasons Shapley effects are now commonly thought ascentral importance measures in GSA for dealing with dependence, and their estimation has beenthoroughly investigated recently (Song et al., 2016; Iooss and Prieur, 2019; Broto et al., 2020;Plischke et al., 2020).From an interpretability perspective, other importance measures introduced in the context ofgoal-oriented and moment-independent sensitivity analysis have proven useful to gain additionalinsights on a given model. For example quantile-oriented (Fort et al., 2016; Maume-Deschamps andNiang, 2018) or reliability-based measures (Ditlevsen and Madsen, 1996) can help understand whichinputs lead to the failure of the system, while optimization-related indices enable dimension reduc-tion for optimization problems (Spagnol et al., 2019). On the other hand, moment-independentsensitivity indices, which quantify the input impact on the whole output distribution instead ofthe variance only, are powerful complementary tools to grasp further types of input influence.Among them are the f-divergence indices (Da Veiga (2015), Rahman (2016)) with particular casescorresponding to the sensitivity index introduced by Borgonovo (Borgonovo, 2007) and the classof kernel-based sensitivity indices, which rely on the embedding of probability distributions in re-producing kernel Hilbert spaces (RKHS) (Da Veiga, 2015, 2016). Unfortunately an ANOVA-likedecomposition is not available yet for any of these indices even in the independent setting: as a con-sequence this limits the interpretation of their formulation for interactions since without ANOVA itis not possible to remove the main effects, and at the same time the natural normalization constant(equivalent to the total output variance for Sobol’ indices) is not known.In this paper we focus on a general RKHS framework for GSA and prove that an ANOVAdecomposition actually exists for two previously introduced kernel-based sensitivity indices in theindependent setting. To the best of our knowledge this is the first time such a decomposition isavailable for other sensitivity indices other than the original Sobol’ ones. Not only this makesit possible to properly define higher-order indices, but this further gives access to their naturalnormalization constant. We also demonstrate that these measures are generalizations of Sobol’indices, in the sense that they are recovered with specific kernels. But the RKHS point of viewadditionally comes with a large body of work on several kernels specifically designed for particulartarget applications, such as multivariate, functional, categorical or time-series case studies, thusdefining a unified framework for many real GSA test cases. When inputs are not independent, wefinally introduce a kernel-based version of Shapley effects similar to the ones proposed by Owen.The paper is organized as follows. Section 2 first briefly introduces the traditional functional2NOVA decomposition with Sobol’ indices and moment-independent indices. In Section 3 we thendiscuss the elementary tools from RKHS theory needed to build kernel-based sensitivity indiceswhich are at the core of this work. We further investigate these indices and prove they alsoarise from an ANOVA decomposition. In addition we define Shapley effects with kernels and thebenefits of the RKHS framework for GSA are studied through several examples. Several estimationprocedures are then discussed in Section 4, where we generalize some of the recent estimators forSobol’ indices. Finally, Section 5 illustrates the potential of these sensitivity indices with variousnumerical experiments corresponding to typical GSA applications. Let η : X × . . . × X d → Y denote the numerical simulation model, which is a function of d input variables X l ∈ X l , l = 1 , . . . , d , and Y ∈ Y the model output given by Y = η ( X , . . . , X d ).In standard GSA the inputs X l are further assumed to be independent with known probabilitydistributions P X l , meaning that the vector of inputs X = ( X , . . . , X d ) is distributed as P X =P X ⊗ . . . ⊗ P X d . For any subset A = { l , . . . , l | A | } ∈ P d of indices taken from { , . . . , d } we denote X A = (cid:16) X l , . . . , X l | A | (cid:17) ∈ X A = X l × . . . × X l | A | the vector of inputs with indices in A and X − A thecomplementary vector with indices not in A . In this setting, the main objective of global sensitivityanalysis is to quantify the impact of any group of input variables X A on the model output Y .In this section we first recall the functional ANOVA decomposition and the definition of Sobol’indices, which fall into the category of variance-based indices. Sensitivity indices that account forthe whole output distribution, referred to as moment-independent indices, are then discussed. Notethat in the following, we adopt the notation S for a properly normalized sensitivity index, while S will stand for an unnormalized index, where normalization is to be understood as an end resultfrom an ANOVA-like decomposition. Here we first assume that Y ∈ Y ⊂ R is a square integrable scalar output. If the inputs areindependent, the function η can then be decomposed according to the ANOVA decomposition: Theorem 1 (ANOVA decomposition (Hoeffding, 1948; Antoniadis, 1984)) . Assume that η : X × . . . × X d → Y is a square integrable function of d independent random variables X , . . . , X d . Then η admits a decomposition Y = η ( X , . . . , X d ) = (cid:88) A ⊆P d η A ( X A ) , with η A depending only on the variables X A and satisfying(a) η ∅ = E ( Y ) ,(b) E X l ( η A ( X A )) = 0 if l ∈ A ,(c) η A ( X A ) = (cid:80) B ⊂ A ( − | A |−| B | E ( Y | X B ) .Furthermore, ( b ) implies that all the terms η A in the decomposition are mutually orthogonal. As aconsequence, the output variance can be decomposed as Var Y = (cid:88) A ⊆P d Var η A ( X A ) = (cid:88) A ⊆P d V A (1)3 here V A = (cid:88) B ⊂ A ( − | A |−| B | Var E ( Y | X B ) . (2)When this decomposition holds, it is then straightforward to quantify the influence of any subsetof inputs X A on the output variance by normalizing each term with Var Y . Definition 1 (Sobol’ indices (Sobol’, 1993)) . Under the same assumptions of Theorem 1, the Sobol’sensitivity index associated to a subset A of input variables is defined as S A = V A Var
Y , (3) while the total Sobol’ index associated to A is S TA = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S B . (4) In particular, the first-order Sobol’ index of an input X l writes S l = Var E ( Y | X l )Var Y and its total Sobol’ index is given by S Tl = (cid:88) B ⊆P d , l ∈ B S B = 1 − Var E ( Y | X − l )Var Y .
Finally, the ANOVA decomposition (1) readily provides an interpretation of Sobol’ indices as apercentage of explained output variance, i.e. (cid:88) A ⊆P d S A = 1 . (5)With these definitions, the impact of each input variable can be quantitatively assessed: thefirst-order Sobol’ index measures the main effect of an input, while the total Sobol’ index aggregatesall its potential interactions with other inputs. As an illustration, an input variable with low totalSobol’ index is thus unimportant and one can freeze it at a default value. When for a giveninput both first-order and total Sobol’ indices are close, this means that this input does not haveinteractions, while a large gap indicates strong interactions in the model. Furthermore, due to thesummation property (5), the interpretation of Sobol’ indices as shares of the output variance is anefficient tool for practitioners who aim at understanding precisely the impact and interactions ofthe inputs of a model on the output. For example the interaction of two inputs X l and X l (cid:48) writes S ll (cid:48) = Var E ( Y | X l , X l (cid:48) ) − Var E ( Y | X l ) − Var E ( Y | X l (cid:48) )Var Y = Var E ( Y | X l , X l (cid:48) )Var Y − S l − S l (cid:48) . (6)Note that to compute this interaction one subtracts the first-order indices S l and S l (cid:48) from thesensitivity index of the subset ( X l , X l (cid:48) ) in order to remove the main effects and highlight theinteraction only. 4 .2 Moment-independent sensitivity indices Despite their appealing properties, Sobol’ indices rank the input variables according to their impacton the output variance only. In a parallel line of work, several authors proposed to investigateinstead how inputs influence the whole output distribution, thus introducing a different insight onthe inputs/outputs relationship. The starting point (Baucells and Borgonovo (2013), Da Veiga(2015)) is to consider that a given input X l is important in the model if the probability distributionP Y of the output changes when X l is fixed, i.e. if the conditional probability distribution P Y | X l is different from P Y . More precisely, if d ( · , · ) denotes a dissimilarity measure between probabilitydistributions, one can define a sensitivity index for variable X l given by S l = E X l (cid:0) d (P Y , P Y | X l ) (cid:1) . (7)Such a general formulation is flexible, in the sense that many choices for d ( · , · ) are available. As anillustration, it is straightforward to show that the unnormalized first-order Sobol’ index is retrievedwith the naive dissimilarity measure d (P , Q) = ( E ξ ∼ P ( ξ ) − E ξ ∼ Q ( ξ )) , which compares probabilitydistributions only through their means. A large class of dissimilarity measures is also given by theso-called f-divergence family: assuming that ( X l , Y ) has an absolute continuous distribution withrespect to the Lebesgue measure on R , the f-divergence between P Y and P Y | X l is d f (P Y , P Y | X l ) = (cid:90) f (cid:18) p Y ( y ) p Y | X l ( y ) (cid:19) p Y | X l ( y ) dy where f is a convex function such that f (1) = 0 and p Y and p Y | X l are the probability distributionfunctions of Y and Y | X l , respectively. The corresponding sensitivity index is then S fl = (cid:90) f (cid:18) p Y ( y ) p X l ( x ) p X l ,Y ( x, y ) (cid:19) p X l ,Y ( x, y ) dxdy with p X l and p X l ,Y the probability distribution functions of X l and ( X l , Y ), respectively. This indexhas been studied for example in Da Veiga (2015) and Rahman (2016). A notable special case isobtained with the total-variation distance corresponding to f ( t ) = | t − | , leading to the sensitivityindex proposed by Borgonovo (Borgonovo, 2007): S T Vl = (cid:90) | p Y ( y ) p X l ( x ) − p X l ,Y ( x, y ) | dxdy. Obviously, definition (7) can be easily extended to measure the influence of any subset of inputs S A = E X A (cid:0) d ( P Y , P Y | X A ) (cid:1) . But in this case, since there is no ANOVA-like decomposition, there isno longer the guarantee that an interaction index defined following (6): S T Vll (cid:48) = (cid:90) | p Y ( y ) p X l ( x ) p X l (cid:48) ( x (cid:48) ) − p X l ,X l (cid:48) ,Y ( x, x (cid:48) , y ) | dxdx (cid:48) dy − S T Vl − S
T Vl (cid:48) really measures the pure interaction between X l and X l (cid:48) . Therefore the interpretation of higher-order moment-independent sensitivity indices is cumbersome. On the other hand, even if nor-malization constants have been proposed through general inequalities on f-divergences (Borgonovo(2007), Rahman (2016)), the lack of an ANOVA decomposition once again impedes the definitionof a natural normalization constant equivalent to the output variance for Sobol’ indices.5ecently, new moment-independent indices built upon the framework of RKHS embedding ofprobability distributions have also been investigated (Da Veiga, 2015, 2016). Though originallyintroduced as an alternative to reduce the curse of dimensionality and make the most of the vastkernel literature, we will see in what follows that they actually exhibit an ANOVA-like decompo-sition and can therefore be seen as a general kernelized version of Sobol’ indices. Before introducing the kernel-based sensitivity indices, we first review some elements of the RKHSembedding of probability distributions (Smola et al., 2007), which will serve as a building block fortheir definition.
We first introduce a RKHS H of functions X → R with kernel k X and dot product (cid:104)· , ·(cid:105) H . The kernel mean embedding µ P ∈ H of a probability distribution P on X is defined as µ P = E ξ ∼ P k X ( ξ, · ) = (cid:90) X k X ( ξ, · ) d P( ξ )if E ξ ∼ P k X ( ξ, ξ ) < ∞ , see Smola et al. (2007). The representation µ P is appealing because, ifthe kernel k X is characteristic, the map P → µ P is injective (Sriperumbudur et al., 2009, 2010).Consequently, the kernel mean embedding can be used in lieu of the probability distribution forseveral comparisons and manipulations of probability measures but using only inner products ordistances in the RKHS. For example, a distance between two probability measures P and P on X can simply be obtained by computing the distance between their representations in H , i.e. MMD(P , P ) = (cid:107) µ P − µ P (cid:107) H , which is a distance if the kernel k X is characteristic (Sriperumbudur et al., 2009, 2010). Thisdistance is called the maximum mean discrepancy (MMD) and it has been recently used in manyapplications (Muandet et al., 2012; Szab´o et al., 2016). Indeed, using the reproducing property ofa RKHS one may show (Song, 2008) thatMMD (P , P ) = E ξ,ξ (cid:48) k X ( ξ, ξ (cid:48) ) − E ξ,ζ k X ( ξ, ζ ) + E ζ,ζ (cid:48) k X ( ζ, ζ (cid:48) )where ξ, ξ (cid:48) ∼ P and ζ, ζ (cid:48) ∼ P with ξ, ξ (cid:48) , ζ, ζ (cid:48) independent, this notation being used throughoutthe rest of the paper. This means that the MMD can be computed with expectations of kernelsonly, unlike other distances between probability distributions which will typically require densityestimation.Another significant application of kernel embeddings concerns the problem of measuring thedependence between random variables. Given a pair of random vectors ( U , V ) ∈ X × Y with prob-ability distribution P UV , we define the product RKHS H = F × G with kernel k H (( u , v ) , ( u (cid:48) , v (cid:48) )) = k X ( u , u (cid:48) ) k Y ( v , v (cid:48) ). A measure of the dependence between U and V can then be defined as the dis-tance between the mean embedding of P UV and P U ⊗ P V , the joint distribution with independentmarginals P U and P V : MMD (P UV , P U ⊗ P V ) = (cid:107) µ P UV − µ P U ⊗ µ P V (cid:107) H . Hilbert-Schmidt independence criterion (HSIC, see Gretton et al.(2005a,b)) and can be expanded asHSIC( U , V ) = MMD (P UV , P U ⊗ P V )= E U , U (cid:48) , V , V (cid:48) k X ( U , U (cid:48) ) k Y ( V , V (cid:48) )+ E U , U (cid:48) k X ( U , U (cid:48) ) E V , V (cid:48) k Y ( V , V (cid:48) ) − E U , V (cid:2) E U (cid:48) k X ( U , U (cid:48) ) E V (cid:48) k Y ( V , V (cid:48) ) (cid:3) (8)where ( U (cid:48) , V (cid:48) ) is an independent copy of ( U , V ). Once again, the reproducing property implies thatHSIC can be expressed as expectations of kernels, which facilitates its estimation when comparedto other dependence measures such as the mutual information. The RKHS framework introduced above can readily be used to define kernel-based sensitivityindices. The first approach relies on the MMD, while the second one builds upon HSIC. We discussthem below and show that in particular both of them admit an ANOVA-like decomposition.
The first natural idea is to come back to the general formulation for moment-independent indices (7)and use the MMD as the dissimilarity measure to compare P Y and P Y | X l as proposed in Da Veiga(2016): S MMD l = E X l MMD (P Y , P Y | X l )= E X l E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) ) − E X l E ξ ∼ P Y ,ζ ∼ P Y | Xl k Y ( ξ, ζ ) + E X l E ζ,ζ (cid:48) ∼ P Y | Xl k Y ( ζ, ζ (cid:48) )= E X l E ζ,ζ (cid:48) ∼ P Y | Xl k Y ( ζ, ζ (cid:48) ) − E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) )where we have defined a RKHS G of functions Y → R with kernel k Y . More generally, we canalso consider the unnormalized MMD-based sensitivity index for a group of variables X A given by E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ζ,ζ (cid:48) ∼ P Y | X A k Y ( ζ, ζ (cid:48) ) − E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) ), provided the followingassumption holds: Assumption 1. ∀ A ⊆ P d and ∀ x A ∈ X A , E ξ ∼ P Y | X A = x A k Y ( ξ, ξ ) < ∞ with the convention P Y | X A =P Y if A = ∅ . First note that if we focus on the scalar output case
Y ⊂ R with the linear kernel k Y ( y, y (cid:48) ) = yy (cid:48) ,we have S MMD A = E X A (cid:16) E ξ ∼ P Y ( ξ ) − E ζ ∼ P Y | X A ( ζ ) (cid:17) = E X A ( E Y − E ( Y | X A )) = Var E ( Y | X A ) , that is, we recover the unnormalized Sobol’ index for X A . S MMD A can thus be seen as a kernelizedversion of Sobol’ indices since the latter can be retrieved with a specific kernel. However it isobvious that since the linear kernel is not characteristic, the MMD in this case is not a distance,7hich means that S MMD A is no longer a moment-independent index.To make another connection with Sobol’ indices, we now recall Mercer’s theorem, a notable repre-sentation theorem for kernels. Theorem 2 (Mercer, see Aubin (2000)) . Suppose k Y is a continuous symmetric positive definitekernel on a compact set Y and consider the integral operator T k Y : L ( Y ) → L ( Y ) defined by (cid:0) T k Y f (cid:1) ( x ) = (cid:90) Y k Y ( y, u ) f ( u ) du. Then there is an orthonormal basis { e r } of L ( Y ) consisting of eigenfunctions of T k Y such that thecorresponding sequence of eigenvalues { λ r } are non-negative. The eigenfunctions corresponding tonon-zero eigenvalues are continuous on Y and k Y has the following representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 λ r e r ( y ) e r ( y (cid:48) ) where the convergence is absolute and uniform. Assume now that the output Y ∈ Y with Y a compact set, meaning that Mercer’s theoremholds. Then k Y admits a representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) )where φ r ( y ) = √ λ r e r ( y ) are orthogonal functions in L ( Y ). In this setting we can write S MMD A = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) − E ζ,ζ (cid:48) ∼ P k Y ( ζ, ζ (cid:48) )= E X A E ξ,ξ (cid:48) ∼ P Y | X A (cid:32) ∞ (cid:88) r =1 φ r ( ξ ) φ r ( ξ (cid:48) ) (cid:33) − E ζ,ζ (cid:48) ∼ P (cid:32) ∞ (cid:88) r =1 φ r ( ζ ) φ r ( ζ (cid:48) ) (cid:33) . Now, since the convergence of the series is absolute, we can interchange the expectations and thesummations to get S MMD A = ∞ (cid:88) r =1 (cid:26) E X A E ξ,ξ (cid:48) ∼ P Y | X A (cid:0) φ r ( ξ ) φ r ( ξ (cid:48) ) (cid:1) − E ζ,ζ (cid:48) ∼ P (cid:0) φ r ( ζ ) φ r ( ζ (cid:48) ) (cid:1) (cid:27) = ∞ (cid:88) r =1 (cid:26) E X A E ( φ r ( Y ) | X A ) − E ( φ r ( Y )) (cid:27) = ∞ (cid:88) r =1 Var E ( φ r ( Y ) | X A ) . (9)In other words, the MMD-based sensitivity index S MMD A generalizes the Sobol’ one in the sensethat it measures the impact of the inputs not only on the conditional expectation of the output,8ut on a possibly infinite number of transformations φ r of the output, given by the eigenfunctionsof the kernel.We can now state the main theorem of this section on the ANOVA-like decomposition for S MMD A . Recall that the variance decomposition (1) states that the variance of the output can bedecomposed as Var Y = (cid:80) A ⊆P d V A where each term is given by V A = (cid:88) B ⊂ A ( − | A |−| B | Var E ( Y | X B ) . The MMD-based equivalent is obtained with the following theorem.
Theorem 3 (ANOVA decomposition for MMD) . Under the same assumptions of Theorem 1 (inparticular, the random vector X has independent components) and with Assumption 1, denote MMD = E k Y ( Y, Y ) − E k Y ( Y, Y (cid:48) ) where Y (cid:48) is an independent copy of Y . Then the total MMDcan be decomposed as MMD = (cid:88) A ⊆P d MMD A where each term is given by MMD A = (cid:88) B ⊂ A ( − | A |−| B | E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) . The proof is given in Appendix A.1. Theorem 3 is very similar to the ANOVA one givenin (1): one can note that the total variance of the output is replaced by a generalized varianceMMD defined by the kernel, and that each subset effect is obtained by removing lesser order onesin the MMD distance of the conditional distributions (instead of the variance of the conditionalexpectations in the ANOVA). The following corollary states that these two decompositions coincidewhen the kernel is chosen as the linear one.
Corollary 1.
When Y ∈ Y ⊂ R and k Y ( y, y (cid:48) ) = yy (cid:48) in Theorem 3, the decomposition is identicalto the decomposition (1), which means that MMD = Var Y and ∀ B ∈ P d , E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) = Var E ( Y | X B ) . It further implies ∀ A ⊆ P d , MMD A = V A . Thanks to Theorem 3 we can now define properly normalized MMD-based indices.
Definition 2 (MMD-based sensitivity indices) . In the frame of Theorem 3, let A ⊆ P d . Thenormalized MMD-based sensitivity index associated to a subset A of input variables is defined as S MMD A = MMD A MMD , while the total MMD-based index associated to A is S T , MMD A = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S MMD B = 1 − E X − A (cid:0) MMD (P Y , P Y | X − A ) (cid:1) MMD . rom Theorem 3, we have the fundamental identity providing the interpretation of MMD-basedindices as percentage of the explained generalized variance MMD : (cid:88) A ⊆P d S MMD A = 1 . Finally, we exhibit a generalized law of total variance for MMD which will yield anotherformulation for the total MMD-based index.
Proposition 1 (Generalized law of total variance) . Assuming Assumption 1 holds, we have
MMD = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . The proof is to be found in Appendix A.2. This is a generalization in the sense that the totalvariance is replaced by MMD , the conditional variance by E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) )and the variance of the conditional expectation by E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . In particular, allthese terms reduce to the ones in the classical law of total variance if one uses the linear kernel k Y ( y, y (cid:48) ) = yy (cid:48) in the scalar case. This gives the following corollary. Corollary 2 (Other formulation of total MMD-based index) . In the frame of Theorem 3, we havefor all A ⊆ P d S T , MMD A = E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) MMD . Another approach for combining kernel embeddings with sensitivity analysis consists in directlyusing HSIC as a sensitivity index. For example Da Veiga (2015) considers the unnormalized index S HSA = HSIC( X A , Y )relying on a product RKHS H A = F A × G with kernel k H A (( x , y ) , ( x (cid:48) , y (cid:48) )) = k X A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) )and provided the following assumption holds: Assumption 2. ∀ A ⊆ P d , E ξ ∼ P XA k X A ( ξ, ξ ) < ∞ and E ξ ∼ P Y k Y ( ξ, ξ ) < ∞ . In Da Veiga (2015) an empirical normalization inspired by the definition of the distance corre-lation criterion (Sz´ekely et al., 2007) was also proposed. But similarly to the MMD decompositionabove, it is actually possible to exhibit an ANOVA-like decomposition for HSIC, thus providing anatural normalization constant. The main ingredient is an assumption on the kernel k X associatedto the input variables. Assumption 3.
The reproducing kernel k X of F is of the form k X ( x , x (cid:48) ) = p (cid:89) l =1 (cid:0) k l ( x l , x (cid:48) l ) (cid:1) (10)10 here for each l = 1 , . . . , d , k l ( · , · ) is the reproducing kernel of a RKHS F l of real functions depend-ing only on variable x l and such that / ∈ F l .In addition, for all l = 1 , . . . , d and ∀ x l ∈ X l , we have (cid:90) X l k l ( x l , x (cid:48) l ) d P X l ( x (cid:48) l ) = 0 . (11)The first part (10) of Assumption 3 may seem stringent, however it can be easily fulfilledby using univariate Gaussian kernels since they define a RKHS which does not contain constantfunctions (Steinwart et al., 2006).On the contrary, the second assumption (11) is more subtle. It requires using kernels defining aso-called RKHS of zero-mean functions (Wahba et al., 1995). A prominent example of such RKHSis obtained if (a) all input variables are uniformly distributed on [0 ,
1] and (b) the univariate kernelsare chosen among the Sobolev kernels with smoothness parameter r ≥ k l ( x l , x (cid:48) l ) = B r ( | x l − x (cid:48) l | )( − r +1 (2 r )! + r (cid:88) j =1 B j ( x l ) B j ( x (cid:48) l )( j !) (12)where B j is the Bernoulli polynomial of degree j . Even though applying a preliminary transfor-mation on the inputs in order to get uniform variables is conceivable (with e.g. the probabilityintegral transform), a more general and elegant procedure has been proposed by Durrande et al.(2012). Starting from an arbitrary univariate k ( · , · ), they build a zero-mean kernel k D ( · , · ) given by k D ( x, x (cid:48) ) = k ( x, x (cid:48) ) − (cid:82) k ( x, t ) d P( t ) (cid:82) k ( x (cid:48) , t ) d P( t ) (cid:82)(cid:82) k ( s, t ) d P( s ) d P( t )where k D ( · , · ) satisfies ∀ x, (cid:82) k D ( x, t ) d P( t ) = 0. Interestingly, they also show that the RKHS H associated to k ( · , · ) is orthogonal to the constant functions, thus satisfying directly the requirementsfor the product kernel (10).More recently, several works made use of the Stein operator (Stein et al., 1972) to define the Steindiscrepancy in a RKHS (Chwialkowski et al., 2016) which showed great potential for Monte-Carlointegration (Oates et al., 2017) or goodness-of-fit tests (Gorham and Mackey, 2015; Chwialkowskiet al., 2016; Jitkrittum et al., 2017) when the target distribution is either impossible to sample oris known up to a normalization constant. More precisely, given a RKHS H with kernel k ( · , · ) offunctions in R d and a (target) probability distribution with density p ( · ), they define a new RKHS H with kernel k S ( · , · ) which writes k S ( x , x (cid:48) ) = ∇ x ∇ x (cid:48) k ( x , x (cid:48) ) + ∇ x p ( x ) p ( x ) ∇ x (cid:48) k ( x , x (cid:48) ) + ∇ x (cid:48) p ( x (cid:48) ) p ( x (cid:48) ) ∇ x k ( x , x (cid:48) ) + ∇ x p ( x ) p ( x ) ∇ x (cid:48) p ( x (cid:48) ) p ( x (cid:48) ) k ( x , x (cid:48) )and it can be proved that ∀ x ∈ R d , (cid:82) k S ( x , x (cid:48) ) p ( x (cid:48) ) d x (cid:48) = 0. Unlike k D ( · , · ), kernel k S ( · , · ) can stillbe defined when p ( · ) is known up to a constant: this property may find interesting applications inGSA when the input distributions are obtained via a preliminary Bayesian data calibration, sinceit would no longer be required to perform a costly sampling step of their posterior distribution andone could easily use the unnormalized posterior distribution instead.With Assumption 3, we can now state a decomposition for HSIC-based sensitivity indices.11 heorem 4 (ANOVA decomposition for HSIC) . Under the same assumptions of Theorem 1 (inparticular, the random vector X has independent components) and with Assumptions 2 and 3, theHSIC dependence measure between X = ( X , . . . , X d ) and Y can be decomposed as HSIC ( X , Y ) = (cid:88) A ⊆P d HSIC A where each term is given by HSIC A = (cid:88) B ⊂ A ( − | A |−| B | HSIC ( X B , Y ) and HSIC ( X B , Y ) is defined with a product RKHS H B = F B ×G with kernel k B ( x B , x (cid:48) B ) k Y ( y, y (cid:48) ) = (cid:81) l ∈ B (1 + k l ( x l , x (cid:48) l )) k Y ( y, y (cid:48) ) as in (10). The proof, which mainly relies on Mercer’s theorem and on Theorem 4.1 from Kuo et al. (2010),is given in Appendix A.3. Once again, this decomposition resembles the ANOVA decomposition(1), where the conditional variances are replaced with HSIC dependence measures between subsetsof inputs and the output.Properly normalized HSIC-based indices can then be defined:
Definition 3 (HSIC-based sensitivity indices) . In the frame of Theorem 4, let A ⊆ P d . Thenormalized HSIC-based sensitivity index associated to a subset A of input variables is defined as S HSIC A = HSIC A HSIC ( X , Y ) , while the total HSIC-based index associated to A is S T , HSIC A = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S HSIC B = 1 − HSIC( X − A , Y )HSIC ( X , Y ) . From Theorem 4, we have the fundamental identity providing the interpretation of HSIC-basedindices as percentage of the explained HSIC dependence measure between X = ( X , . . . , X d ) and Y : (cid:88) A ⊆P d S HSIC A = 1 . Finally, a noteworthy asymptotic result yields a link between HSIC-based indices and MMD-based ones when the input kernel k X degenerates to a dirac kernel, as elaborated in the followingproposition. Proposition 2.
For all subset A ⊆ P d , let us define a product RKHS H A = F A × G with kernel k A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) ) . We further assume that ∀ x A ∈ X A , p X A ( x A ) > and that k A ( x A , x (cid:48) A ) = 1 (cid:112) p X A ( x A ) (cid:112) p X A ( x (cid:48) A ) (cid:89) l ∈ A h K (cid:18) x l − x (cid:48) l h (cid:19) (13) where K : R → R is a symmetric kernel function satisfying (cid:82) u K ( u ) du = 1 , and h > . Then wehave ∀ A ⊆ P d lim h → HSIC( X A , Y ) = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) where HSIC( X A , Y ) is defined with the product RKHS H A = F A × G and MMD (P Y , P Y | X A ) withthe RKHS G . k X with a standard deviation tending to 0, or the sinckernel associated to the RKHS of band-limited continuous functions with a cutoff frequency tendingto infinity. Obviously the result also holds if one uses different kernels K for each input X l ∈ X A in Eq. (13).Although they may seem trivial, Proposition 2 and Corollary 1 actually justify our claim thatboth the MMD- and the HSIC-based sensitivity indices are natural generalizations of Sobol’ indices,in the sense that a degenerate HSIC-based index with a dirac kernel for the input variables givesthe MMD-based index which, in turn, is equal to the Sobol’ index when using the dot productkernel for the output. In this section, we now discuss how the previously indices can still be valuable in the case wherethe input variables are no longer independent. In this setting, the Shapley effects introduced in thecontext of GSA by Owen (Owen, 2014) and based on Shapley values (Shapley, 1953) from gametheory have appealing properties, since they provide a proper allocation of the output variance toeach input variable, without requiring they are independent. We recall their definition below.
Definition 4 (Shapley effects (Shapley, 1953)) . For any l = 1 . . . , d , the Shapley effect of input X l is given by Sh l = 1Var Y p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) Var E (cid:0) Y | X A ∪{ l } (cid:1) − Var E ( Y | X A ) (cid:27) . (14) This definition corresponds to the Shapley value (Shapley, 1953) φ l = 1 p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) val ( A ∪ { l } ) − val ( A ) (cid:27) with value function val : P d → R + equal to val( A ) = Var E ( Y | X A ) / Var Y . Moreover, we have thefollowing decomposition p (cid:88) l =1 Sh l = 1 . The only requirement is that the value function satisfies val : P d → R + such that val( ∅ ) = 0.Combining this result with the kernel-based sensitivity indices is consequently straightforward,which leads to the definition of kernel-embedding Shapley effects : Definition 5 (Kernel-embedding Shapley effects) . For any l = 1 . . . , d , we define(a) The MMD-Shapley effect Sh MMD l = 1MMD p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) E X A ∪{ l } (cid:16) MMD (P Y , P Y | X A ∪{ l } ) (cid:17) − E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) (cid:27) (15) provided Assumption 1 holds. b) The HSIC-Shapley effect Sh HSIC l = 1HSIC ( X , Y ) 1 p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) HSIC (cid:0) X A ∪{ l } , Y (cid:1) − HSIC ( X A , Y ) (cid:27) (16) provided Assumptions 2 and 3 hold.We further have the decompositions p (cid:88) l =1 Sh MMD l = p (cid:88) l =1 Sh HSIC l = 1 . Just like in the independent setting, kernel-embedding Shapley effects (15) and (16) can be seenas general kernelized versions of Shapley effects, since Proposition 2 and Corollary 1 are still validwhen the inputs are dependent.
Remark 1.
In the machine learning community dedicated to the interpretability of black-box models,an importance measure called Kernel-Shap has been recently introduced (Lundberg and Lee, 2017).Although its naming resembles ours, they designate clearly separated approaches, since the Kernel-Shap measure is a local Shapley effect, and the ”Kernel” denomination only refers to an estimationprocedure without any links to RKHS.
Beyond their theoretical interest in themselves, the kernel-ANOVA decompositions and the asso-ciated sensitivity indices also appear powerful from a practical point of view when one carefullyexamines the potential of using kernels. We give below some insights on how they could enhancetraditional GSA studies in several settings.
Categorical model outputs and target sensitivity analysis.
In some applications, the modeloutput Y is categorical, meaning that Y = { , . . . , K } when the output can take K levels. A simplecommon instance involves two levels, corresponding to a failure/success situation. Similarly evenif Y is not categorical, the objective may be to measure the impact of each input on the fact thatthe output reaches disjoint regions of interest R , . . . , R K ⊂ Y , as for example in the case whereone focuses on events { t i +1 > Y > t i } for thresholds t i , i = 1 , . . . , K . Such an objective is called target sensitivity analysis (TSA, see Marrel and Chabridon (2020)) and can be reformulated in acategorical framework by the change of variable Z = i if Y ∈ R i .The case where Y = { , } (or equivalently Z = { Y ∈R} ) is frequent in TSA. A straightforwardapproach is to use Sobol’ indices with a 0 / S TSA l = E X l ( P ( Y = 1 | X l ) − P ( Y = 1)) P ( Y = 1)(1 − P ( Y = 1)) (17)see Li et al. (2012). But to the best of our knowledge, no systematic procedure is available whenthe number of levels is greater than two. Without resorting yet to our kernel-based indices, thereare at least two roads, which actually lead to the same indices:14a) The one-versus-all approach, where we compute several Sobol’ indices by repeatedly consid-ering Z = { Y = i } for all i = 1 . . . , K . We thus have a collection of indices S TSA , [ i ] l = E X l ( P ( Y = i | X l ) − P ( Y = i )) P ( Y = i )(1 − P ( Y = i )) , and we can aggregate them by normalizing each of them by its own variance, yielding S TSA l = (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) S TSA , [ i ] l (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) . (b) The one-hot encoding approach, which consists in encoding the categorical output into amultivariate vector of 0 / Y = { , . . . , K } , Y is encodedas a K -dimensional vector ( Z = Y =1 , . . . , Z K = Y = K ). The aggregated Sobol’ indices arethen S TSA l = (cid:80) Ki =1 Var E ( Z i | X l ) (cid:80) Ki =1 Var Z i = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) , which is exactly the index obtained with the one-versus-all approach.As for the kernel-based indices, the process is less cumbersome, since the only ingredient thatrequires attention is the choice of a kernel k Y ( · , · ) adapted to categorical outputs, which has alreadybeen investigated in the kernel literature (Song et al., 2007, 2012). We focus here on the simple dirackernel defined as k Y ( y, y (cid:48) ) = δ ( y, y (cid:48) ) for categorical values y, y (cid:48) ∈ { , . . . , K } , and the correspondingkernel-based indices are then • The first-order MMD-based index: S MMD l = E X l (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i | X l ) P ( Y = j | X l ) − (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i ) P ( Y = j ) (cid:80) Ki =1 P ( Y = i ) − (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i ) P ( Y = j )= E X l (cid:80) Ki =1 P ( Y = i | X l ) − (cid:80) Ki =1 P ( Y = i ) (cid:80) Ki =1 P ( Y = i ) − (cid:80) Ki =1 P ( Y = i ) = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) , where we retrieve again the one-versus-all Sobol’ index. • The first-order HSIC-based index: S HSIC l = (cid:90) X l ×X l K (cid:88) i =1 K (cid:88) j =1 k { l } ( x, x (cid:48) ) δ ( i, j ) (cid:2) p X l | Y = i ( x ) − p X l ( x ) (cid:3)(cid:2) p X l | Y = j ( x (cid:48) ) − p X l ( x (cid:48) ) (cid:3) P ( Y = i ) P ( Y = j ) dxdx (cid:48) = K (cid:88) i =1 P ( Y = i ) (cid:90) X l ×X l k { l } ( x, x (cid:48) ) (cid:2) p X l | Y = i ( x ) − p X l ( x ) (cid:3) (cid:2) p X l | Y = i ( x (cid:48) ) − p X l ( x (cid:48) ) (cid:3) dxdx (cid:48) = K (cid:88) i =1 P ( Y = i ) MMD (cid:0) P X l | Y = i , P X l (cid:1) , • The MMD- and HSIC- Shapley effects using one of the above indices as building block.Interestingly, it has been shown that Eq. (17) can also been written, up to a constant, as thePearson χ divergence between P X l | Y =1 and P X l (Perrin and Defaux, 2019; Spagnol, 2020). Thismeans that S TSA l = S MMD l and S HSIC l essentially have the same interpretation as weighted sumsof distances between the initial input distributions and the conditional input distributions (whenrestricted to an output level), with the Pearson χ divergence and the MMD distance, respectively.But we will see in Section 4 that the estimation of HSIC-based sensitivity indices is much lessprone to the curse of dimensionality and does not require density estimation, as opposed to S TSA l (Perrin and Defaux, 2019). Finally, note that another kernel for categorical variables has also beenproposed (Song et al., 2007, 2012), but this is actually a normalized dirac kernel which would onlymodify the weights in the indices above. Beyond scalar model outputs.
In many numerical simulation models, some of the outputs arecurves representing the temporal evolution of physical quantities of the system such as temperatures,pressures, etc. One can also encounter industrial applications which involve spatial outputs (Marrelet al., 2008),. In such cases, the two main approaches in GSA are (a) the ubiquitous point ofview, where one sensitivity index is computed for each time step or each spatial location (Terrazet al., 2017) and (b) the dimension reduction angle, in which one preliminary projects the outputinto a low-dimensional vector space and then calculates aggregated sensitivity indices for this newmultivariate vector (Lamboni et al., 2011; Gamboa et al., 2013).However, the kernel perspective for such structured outputs can bring new insights for GSA.Indeed the kernel literature has already proposed several ways to handle curves or images in re-gression or classification tasks. For instance the PCA-kernel (Ferraty and Vieu, 2006) can be usedas an equivalent of (b), such as illustrated in Da Veiga (2015). But more interestingly, kernelsdedicated to times series were designed, such as the global alignment kernel (Cuturi, 2011) inspiredby the dynamic time-warping kernel (Sakoe and Chiba, 1978). Such kernel could be employed inindustrial applications where one is interested by the impact of an input variable on the shape ofthe output curve. On the other hand, for dealing with spatial outputs similar to images such as inMarrel et al. (2008), one may consider a kernel based on image classification (Harchaoui and Bach,2007) which would be better suited to analyze the impact of inputs on the change of the shapesappearing inside the image output.Finally, numerical models involving graphs as inputs or outputs ( e.g. electricity networks ormolecules) may now be tractable with GSA by employing kernels specifically tailored for graphs(G¨artner et al., 2003; Ramon and G¨artner, 2003).
Stochastic numerical models.
On occasions one has to deal with stochastic simulators , whereinternally the numerical model relies on random draws to compute the output. Typical industrialapplications include models dedicated to the optimization of maintenance costs, where randomfailures are simulated during the system life cycle, or molecular modeling to predict macroscopicproperties based on statistical mechanics, where several microstates of the system are generated atrandom (Moutoussamy et al., 2015). For fixed values of the input variables, the output is thereforea probability distribution, meaning that
Y ⊂ M +1 the set of probability measures. In this setting16SA aims at measuring how changes in the inputs modify the output probability distribution,which is clearly out of the traditional scope of GSA.Once again the kernel point of view makes it possible to easily recycle the MMD- and the HSIC-based sensitivity indices in this context since they only require the definition of a kernel k Y ( · , · ) onprobability distributions. This can be achieved through one of the two following kernels: k Y (P , Q) = σ e − λ MMD (P , Q) (18)introduced in Song (2008) or k Y (P , Q) = σ e − λW (P , Q) discussed in Bachoc et al. (2017) where P , Q ∈ M +1 , W is the Wasserstein distance and σ , λ > The properly normalized kernel-based sensitivity indices being defined above, we now discuss theirestimation. The HSIC-based index is first examined as we only consider already proposed es-timators. On the other hand, the MMD-based index is analyzed more thoroughly since severalestimators can be envisioned given its close links with Sobol’ indices. Finally we investigate theestimation of kernel-embedding Shapley effects.
We start by observing that if Assumption 3 holds, for any subset A ⊆ P d we have E X A k A ( X A , x (cid:48) A ) =1 for all x (cid:48) A ∈ X A , which means that HSIC in Eq. (8) simplifies into:HSIC( X A , Y ) = E X A , X (cid:48) A ,Y,Y (cid:48) k A ( X A , X (cid:48) A ) k Y ( Y, Y (cid:48) ) − E Y,Y (cid:48) k Y ( Y, Y (cid:48) ) . Given a sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n and following Song et al. (2007); Gretton et al. (2008) twoestimators HSIC u ( X A , Y ) and HSIC b ( X A , Y ) based on U- and V-statistics, respectively, can beintroduced: HSIC u ( X A , Y ) = 1 n ( n − n (cid:88) i,j =1 , i (cid:54) = j (cid:16) k A ( x ( i ) A , x ( j ) A ) − (cid:17) k Y ( y ( i ) , y ( j ) )HSIC b ( X A , Y ) = 1 n n (cid:88) i,j =1 (cid:16) k A ( x ( i ) A , x ( j ) A ) − (cid:17) k Y ( y ( i ) , y ( j ) )where we assume that P X A is known and is used to compute analytically the zero-mean kernelsin Eq. (3.2.2). The study of the version of the above estimators when the sample (cid:0) x ( i ) (cid:1) i =1 ,...,n also serves to estimate k A is left as future work. Both HSIC u ( X A , Y ) and HSIC b ( X A , Y ) convergein probability to HSIC( X A , Y ) with rate 1 / √ n , and one can show (Song et al., 2007) that if weassume that k A and k Y are bounded almost everywhere by 1 and are nonnegative, with probabilityat least 1 − δ we have | HSIC u ( X A , Y ) − HSIC( X A , Y ) | ≤ (cid:112) log(2 /δ ) /n. u ( X A , Y ) and HSIC b ( X A , Y ) have also been studied in thecase where X A and Y are dependent, see Song et al. (2007) and Gretton et al. (2008).It is worth mentioning that here the number of model evaluations is n , which is independentfrom the input dimension, meaning that all HSIC-based sensitivity indices can be computed withonly a given sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n . MMD-based indices are close generalizations of Sobol’ indices, since they involve computing theexpectation of a conditional quantity (a MMD distance with a conditional probability for the formerand a conditional variance for the latter). This is the reason why estimation procedures developedfor Sobol’ indices can be adapted to the MMD ones. The first two estimators discussed belowassume that one can easily sample the computer model for any input values (to be determined bythe estimation procedure), as opposed to the next two ones which can be defined with any givensample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n . The first naive estimator consists in systematically resampling the conditional distribution P Y | X A = x A for many values of x A , as detailed in Algorithm 1 below. Algorithm 1
Double-loop estimator of E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) Sample x ( j ) from P X and compute y ( j ) = η ( x ( j ) ) for j = 1 , . . . , m . for i = 1 . . . , n do Outer-loop
Sample x ( i ) A from P X A ; for j = 1 . . . , m do Inner-loop
Sample x ( j ) − A from P X − A (if inputs are independent) or from P X − A | X A = x ( i ) A (otherwise);Compute ˜ y ( j ) = η ( x (cid:48) ) where x (cid:48) A = x ( i ) A and x (cid:48)− A = x ( j ) − A ; end for Compute M ( i ) = 1 n m (cid:88) j,j (cid:48) =1 k Y (cid:16) y ( j ) , y ( j (cid:48) ) (cid:17) + 1 n m (cid:88) j,j (cid:48) =1 k Y (cid:16) ˜ y ( j ) , ˜ y ( j (cid:48) ) (cid:17) − n m (cid:88) j,j (cid:48) =1 k Y (cid:16) y ( j ) , ˜ y ( j (cid:48) ) (cid:17) the estimator of MMD (P Y , P Y | X A = x ( i ) A ); end for E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) is finally estimated by n (cid:80) ni =1 M ( i ) .For each MMD-based index of a subset of variables X A the total number of model evaluations is( n + 1) m , which means for example that all first-order MMD-based sensitivity indices are computedat a cost of p ( n + 1) m model evaluations. It is however possible to design better sampling strategiesto compute first-order and total indices if the inputs are independent, as explained in the nextsection. 18 .2.2 Pick-freeze estimators We begin by recalling the definition of the pick-freeze estimators for Sobol’ indices.
Lemma 1 (Pick-freeze formulation of Sobol indices (Janon et al., 2014)) . Assume X and X (cid:48) aretwo independent copies of the input vector, the inputs being independent. For any subset A ⊆ P d define X ∼ A the vector assembled from X and X (cid:48) such that X ∼ AA = X A and X ∼ A − A = X (cid:48) A . Now if wedenote Y = η ( X ) and Y ∼ A = η ( X ∼ A ) , we have Var E ( Y | X A ) = Cov (cid:0) Y, Y ∼ A (cid:1) ,S TA = 1 − Cov (cid:0)
Y, Y ∼− A (cid:1) Var
Y .
In the particular case of A = { l } , the first-order and total indices S l and S Tl can be estimatedby collecting estimators ˆ V l , ˆ V − l and ˆ V of Cov (cid:0) Y, Y ∼ l (cid:1) , Cov (cid:0) Y, Y ∼− l (cid:1) and Var Y , respectively.Such estimators have been first studied in Homma and Saltelli (1996), but we focus on the onesintroduced by Saltelli et al. (2010) which writeˆ V l = 1 n n (cid:88) i =1 η ( x ( i ) ) (cid:110) η ( x ∼ l, ( i ) ) − η ( x (cid:48) ( i ) ) (cid:111) , ˆ V − l = 1 n n (cid:88) i =1 η ( x (cid:48) ( i ) ) (cid:110) η ( x ∼ l, ( i ) ) − η ( x ( i ) ) (cid:111) , ˆ V = 1 n n (cid:88) i =1 η ( x ( i ) ) − (cid:32) n n (cid:88) i =1 η ( x ( i ) ) (cid:33) where x ( i ) and x (cid:48) ( i ) denote independent samples of X and x ∼ l, ( i ) is a vector such that x ∼ l, ( i ) l = x ( i ) l and x ∼ l, ( i ) − l = x (cid:48) ( i ) − l . The total number of model evaluations to estimate both S l and S Tl is thus( p + 2) n , which is much less than the amount required by the previously introduced double-loopestimator.We now build upon these estimators to design equivalent ones for the first-order and totalMMD-based sensitivity indices. The main ingredient is to state an equivalent of Lemma 1 for theMMD. Lemma 2 (Pick-freeze formulation of MMD-based indices) . With the same notations and assump-tions as in Lemma 1, we have E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E k Y (cid:0) Y, Y ∼ A (cid:1) − E k Y (cid:0) Y, Y (cid:48) (cid:1) . Proof.
Since Y and Y ∼ A are conditionally independent on X A with the same distribution, we canwrite E k Y (cid:0) Y, Y ∼ A (cid:1) = E X A E (cid:2) k Y (cid:0) Y, Y ∼ A (cid:1) | X A (cid:3) = E X A E ζ,ζ (cid:48) ∼ P Y | X A k Y ( ζ, ζ (cid:48) ).Estimators (cid:92) MMD l for E X l (cid:0) MMD (P Y , P Y | X l ) (cid:1) and (cid:92) MMD − l for E X − l (cid:0) MMD (P Y , P Y | X − l ) (cid:1) aretherefore given by (cid:92) MMD l = 1 n n (cid:88) i =1 (cid:110) k Y (cid:16) η ( x ( i ) ) , η ( x ∼ l, ( i ) ) (cid:17) − k Y (cid:16) η ( x ( i ) ) , η ( x (cid:48) ( i ) ) (cid:17)(cid:111) (cid:92) MMD − l = 1 n n (cid:88) i =1 (cid:110) k Y (cid:16) η ( x (cid:48) ( i ) ) , η ( x ∼ l, ( i ) ) (cid:17) − k Y (cid:16) η ( x ( i ) ) , η ( x (cid:48) ( i ) ) (cid:17)(cid:111) = E k Y ( Y, Y ) − E k Y ( Y, Y (cid:48) ) is estimated by (cid:92)
MMD = 1 n n (cid:88) i =1 k Y (cid:16) η ( x ( i ) ) , η ( x ( i ) ) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) η ( x ( i ) ) , η ( x ( j ) ) (cid:17) . All these estimators can actually be recovered by using Mercer’s theorem k Y ( y, y (cid:48) ) = (cid:80) ∞ r =1 φ r ( y ) φ r ( y (cid:48) )and plugging the Sobol’ estimators of Cov (cid:0) φ r ( Y ) , φ r ( Y ∼ l ) (cid:1) , Cov (cid:0) φ r ( Y ) , φ r ( Y ∼− l ) (cid:1) and Var φ r ( Y )for all r >
1. Once again all first-order and total MMD-based sensitivity indices can be estimatedwith a total cost of ( p + 2) n model evaluations, and by the strong law of large numbers it isstraightforward to show that both (cid:92) MMD l / (cid:92) MMD and (cid:92)
MMD − l / (cid:92) MMD are consistent.
The two previous estimators, although simple, necessitate specific sampling schemes (double-loopMonte-Carlo or pick-freeze) which may not be amenable in practice. In addition first-order MMDindices estimation call for a number of model evaluations which increases with the number of inputvariables d . Recently, Gamboa et al. (2020) introduced new estimators of first-order Sobol’ indicesbased on ranking and inspired by the work of Chatterjee (2020). In particular, for any pair ofrandom variables ( V, Y ) and measurable bounded functions f and g , they propose a universalestimation procedure for expectations of the form E ( E [ f ( Y ) | V ] E [ g ( Y ) | V ])using only a given sample ( v ( i ) , y ( i ) ) i =1 ,...,n and an estimator given by1 n n (cid:88) i =1 f ( y ( i ) ) g ( y ( σ n ( i )) )where σ n is a random permutation with no fixed point and measurable with respect to the σ -algebragenerated by ( v (1) , . . . , v ( n ) ). First-order Sobol’ indices are then estimated using f ( x ) = g ( x ) = x and the permutation σ n = N defined as in Chatterjee (2020): N ( i ) = (cid:26) π − ( π ( i ) + 1) if π ( i ) + 1 ≤ nπ − (1) otherwise (19)where π ( i ) is the rank of V ( i ) in the sample ( V (1) , . . . , V ( n ) ). All first-order indices are finallyobtained with a given sample by considering one after the other the pairs ( X l , Y ) with their ownpermutation based on the sample ranks of X l .Interestingly, it is possible to generalize this result to the first-order MMD indices with thefollowing proposition. Proposition 3 (Generalization of Proposition 3.2 from Gamboa et al. (2020)) . Let k ( · , · ) be ameasurable bounded kernel and ( v ( i ) , y ( i ) ) i =1 ,...,n an iid sample from a pair of random variables ( V, Y ) . Consider a random permutation with no fixed point and measurable with respect to the σ -algebra generated by ( v (1) , . . . , v ( n ) ) such that for any i = 1 , . . . , n , v ( σ n ( i )) → v ( i ) as n → ∞ withprobability one. Then the estimator χ n ( V, Y, k ) = 1 n n (cid:88) i =1 k ( y ( i ) , y ( σ n ( i )) )20 onverges almost surely to χ ( V, Y, k ) = E V E ξ,ξ (cid:48) ∼ P Y | V k Y ( ξ, ξ (cid:48) ) as n → ∞ . The proof relies again on Mercer’s theorem and is given in Appendix A.5. The estimators of E X l (cid:0) MMD (P Y , P Y | X l ) (cid:1) and MMD are finally given by (cid:92) MMD l = 1 n n (cid:88) i =1 k Y (cid:16) y ( i ) , y ( σ ln ( i )) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) (20) (cid:92) MMD = 1 n n (cid:88) i =1 k Y (cid:16) y ( i ) , y ( i ) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) (21)for a sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n and where σ ln is the permutation defined in Eq. (19) with aranking performed on the sample (cid:16) x ( i ) l (cid:17) i =1 ,...,n . The ranking approach introduced above can actually be generalized to estimate higher-order sensi-tivity indices by replacing ranking (in dimension 1) by nearest-neighbors (in arbitrary dimension),since they define a permutation with the same properties as required in Proposition 3. This wasproposed independently by Azadkia and Chatterjee (2019) in the context of a dependence measureand by Broto et al. (2020) for Shapley effects estimation. Here we adopt the formalism of Brotoet al. (2020), where they introduce j ∗ A ( i, m ) the index such that the sample point x ( j ∗ A ( i,m )) A of thesubset A ⊆ P d of input variables is the m -th nearest neighbor of the sample point x ( i ) A in a sampleof the inputs (cid:0) x ( i ) (cid:1) i =1 ,...,n . Then their nearest-neighbor estimator ˆ V knn A of Var E ( Y | X A ) is given byˆ V knn A = 1 n A n A (cid:88) j =1 η (cid:16) x ( j ∗ A ( s ( j ) , (cid:17) η (cid:16) x ( j ∗ A ( s ( j ) , (cid:17) − (cid:32) n n (cid:88) i =1 η (cid:16) x ( i ) (cid:17)(cid:33) where s ( j ), j = 1 , . . . , n A is a sample of uniformly distributed integers in { , . . . , n } , with n A ≤ n .The choice of using a subsample s ( j ) is motivated by the authors so that their framework isgeneral enough for the different aggregation procedures they propose for Shapley effects and fortheir consistency proofs. Several numerical experimentations not reported here also show thatusing all the samples instead of subsamples yield biased estimators, so we follow the procedure ofBroto et al. (2020). Once again this estimator can be generalized to MMD-based indices, where E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) is estimated by (cid:92) MMD A = 1 n A n A (cid:88) j =1 k Y (cid:16) y ( j ∗ A ( s ( j ) , , y ( j ∗ A ( s ( j ) , (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) where we denote y ( i ) = η (cid:0) x ( i ) (cid:1) . The consistency of this estimator directly follows from the con-sistency of ˆ V knn A from Broto et al. (2020) and Mercer’s theorem. Since j ∗ A ( s ( j ) ,
1) = s ( j ), theestimator is identical to the ranking-based one in (20) where the permutation from rankings issimply replaced by the index of the nearest neightbor not including itself j ∗ A ( s ( j ) , .3 Shapley effect estimation The last estimation task concerns kernel-embedding Shapley effects set forth in Definition 5. Ofcourse a straightforward approach consists in using any of the estimators discussed before in thegeneral formulation of the MMD- or HSIC-Shapley effects. But a closer inspection actually revealsthat although this is easy for the HSIC-Shapley effects since both HSIC u ( X A , Y ) and HSIC b ( X A , Y )can be computed for all subsets A ⊆ P d with only one sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n , the MMD-Shapley effects require estimators of E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) which do not involve two many callsto the numerical model. Among the estimators introduced in Section 4.2, only the one based onnearest neighbors has a computational cost independent of the number of input variables. This isexactly the framework proposed in Broto et al. (2020) for the variance-based Shapley effects.However, as pointed out in Song et al. (2016) in the case of variance-based Shapley effects,a double-loop Monte-Carlo estimator of the value function val( A ) = Var E ( Y | X A ) / Var Y can beheavily biased. They show that another value function val (cid:48) ( A ) = E Var ( Y | X − A ) / Var Y behavesbetter and gives rise to the exact same Shapley effects (Theorem 1 in Song et al. (2016)). This iswhy Broto et al. (2020) also introduced a nearest neighbor estimator of E Var ( Y | X − A ) given byˆ E knn A = 1 n A n A (cid:88) j =1 n I − n (cid:88) i =1 (cid:34) y ( j ∗− A ( s ( j ) ,i )) − n I n (cid:88) i =1 y ( j ∗− A ( s ( j ) ,i )) (cid:35) where this time n I nearest neighbors are used. In a nutshell, the nearest neighbors are used as ifthey were independent samples from P Y | X A = x ( s ( j )) , which explains why we compute their empiricalvariance in the formula above. In order to follow the same road for the estimation of MMD-Shapleyeffects, we first need an equivalent of Theorem 1 from Song et al. (2016) for a new value functionrelated to the MMD. Lemma 3 (Other formulation of MMD-Shapley effects) . The Shapley values obtained with valuefunction val (cid:48) ( A ) = E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) / MMD are exactly equal tothe MMD-Shapley effects from Definition 5 with value function val( A ) = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) / MMD . The proof is based on the generalization of the law of total variance for the generalized varianceMMD and is given in Appendix A.6. A nearest neighbor estimator (cid:92)
EMMD A of E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) is then given by (cid:92) EMMD A = 1 n A n A (cid:88) j =1 (cid:40) n I n I (cid:88) i =1 k Y (cid:16) y ( j ∗− A ( s ( j ) ,i )) , y ( j ∗− A ( s ( j ) ,i )) (cid:17) − n I n I (cid:88) i,i (cid:48) =1 k Y (cid:16) y ( j ∗− A ( s ( j ) ,i )) , y ( j ∗− A ( s ( j ) ,i (cid:48) )) (cid:17) and the MMD-Shapley effect estimator is (cid:99) Sh MMD l = 1 (cid:92) MMD p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) (cid:92) EMMD A ∪{ l } − (cid:92) EMMD A (cid:27) . (cid:92) MMD is estimated as in Eq. (21).As a side-note, when the number of input variables is large, the number of terms involved inShapley effects severely increases and the computational cost to assemble all the terms (even if oneuses estimators relying on a given sample only) becomes prohibitive. For such cases it is possible touse a formulation of Shapley effects involving a sum on permutations of { , . . . , d } instead of a sumon subsets of P d , which makes it possible to add another level of approximation by computing thesum on a random sample of permutations instead of on all of them (Castro et al., 2009). Obviouslysince this trick does not depend on the value function used inside the Shapley values, it can alsobe used for our kernel-embedding Shapley effects. In this section we illustrate the behavior of the kernel-based sensitivity indices on several testcases representative of typical GSA industrial applications. In particular, we address the followingnumerical model categories: a standard scalar output model, a stochastic simulator, a model witha time-series output and a multi-class categorical output simulator with dependent inputs. All theresults presented here are reproducible with the R code provided in the supplementary material. To exemplify the additional insight provided by these indices we first consider a classical GSA testcase, the Ishigami function (Ishigami and Homma, 1990) where the output Y is given by Y = sin( X ) + 7 sin( X ) + X sin( X )where X l ∼ U ( − π, π ) for l = 1 , . . . ,
4, meaning that we add a dummy input variable X for analysispurposes.We start by computing the traditional Sobol’ first-order and total sensitivity indices using apick-freeze estimator as in Section 4.2.2 with a sample size n = 1000 and we repeat this calcu-lation 50 times. For each replication the total number of calls to the numerical model is thus( p + 2) n = 6000. We then use the same pick-freeze procedure to estimate the MMD-based first-order and total indices with the exact same samples. For the output we use a Gaussian kernel k Y ( y, y (cid:48) ) = exp( − σ ( y − y (cid:48) ) ) where σ is chosen as the median of the pairwise distances betweenthe output samples. Results are given in Figure 1. First note that, as is well known, the first-orderSobol’ index of X is zero, while its total index is around 0 .
25 due to its interaction with X . X isalso an important variable, which does not have any interaction since its total Sobol’ index is equalto its first-order one. As expected X is correctly detected as non-important. The MMD-based in-dices however bring a different insight: from a probability distribution perspective, one can observethat interactions are much more present since there is a large gap between total and first-orderindices for all inputs (except X of course). In addition, this time X is detected to have a maineffect: indeed even though it does not impact the output conditional mean, it influences the tails ofthe output conditional distribution when it is close to − π/ π as was already illustrated in Da Veiga(2016). This shows that MMD-based indices capture other types of input influence than Sobol’ ones.23 a) Sobol’ first-order index (b) Sobol’ total index(c) MMD-based first-order index (d) MMD-based total index Figure 1: Ishigami test case. First-order (a) and total (b) Sobol’ indices and first-order (c) andtotal (d) MMD-based indices with pick-freeze estimators, n = 1000, 50 replicates.To take a different view at the inputs/output relationship we also estimate HSIC-based first-order and total indices using the V-statistic of Section 4.1. Again for the output we use the sameGaussian kernel as above, while we use the Sobolev kernel from Eq. (12) for the inputs. Since theyare uniform it is easy to renormalize them to satisfy the zero-mean kernel condition. We use onlyone sample of size n = 1000 and estimates obtained with 50 replications are reported in Figure 2.24nterestingly, we observe first that with HSIC we no longer detect any interaction: our intuitionis that first-order HSIC indices already aggregate a very large family of potential influences andthus interactions may only appear with highly complicated inputs/output link functions. This issupported by the fact that HSIC indices rank the inputs the exact same way at total Sobol’ indices.Another appealing property is that to compute all HSIC indices we only need a given sample ofmoderate size, which is interesting from a screening perspective for GSA on very time-consumingnumerical models. (a) HSIC-based first-order index (b) HSIC-based total index Figure 2: Ishigami test case. First-order (a) and total (b) HSIC-based indices with V-statisticestimator, n = 1000, 50 replicates. Our second illustration is a more original setting for GSA which consists of a stochastic simulatorwhere the numerical model outputs a probability distribution, or rather a sample from a probabilitydistribution in practice, for a fixed value of the input variables. Here we use a test case proposedin Moutoussamy et al. (2015) which involves five input variables and writes Y = ( X + 2 X + U ) sin(3 X − X + N ) + U + 5 X B + (cid:88) i =1 iX i where X , . . . , X ∼ U (0 ,
1) are the input variables and U ∼ U (0 , U ∼ U (1 , N ∼ N (0 , B ∼ Bernoulli(1 /
2) are additional random variables which are responsible for the simulatorstochasticity. Note that we modify the constant in front of X B to lessen the effect of X as com-pared to Moutoussamy et al. (2015). An example of the output distribution for 20 random fixedvalues of the input variables obtained each time with a sample of size 100 for the stochastic ones is25iven in Figure 3.Figure 3: Stochastic simulator test case. Output probability distribution for 20 values of the inputvariables chosen at random. The distribution is estimated with a kernel-density estimator.Leaving aside for now the whole output distribution, we first place ourselves in a standard GSAdeterministic setting by first analyzing the input influence on both the output mean and standarddeviation (with respect to U , U , N and B ). We thus compute Sobol’ indices for these two outputsof interest with a pick-freeze estimator with a sample of size n = 1000 and perform 50 replications,see Figure 4. It shows that interactions are negligible, and that X is clearly the most influentialinput by far: it explains alone 65% of the output mean variability and 75% of the output standarddeviation variability. This is expected since X is coupled with B , which creates the multi-modalfeature of the output distribution. The output mean variability also depends on X and X tosome lesser extent, and the output standard deviation variability on X .We now make use of the kernel framework to compute MMD- and HSIC-based sensitivity indices26 a) Sobol’ first-order index of the output mean (b) Sobol’ total index of the output mean(c) Sobol’ first-order index of the output standarddeviation (d) Sobol’ total index of the output standard devia-tion Figure 4: Stochastic simulator test case. First-order (a) and total (b) Sobol’ indices of the outputmean and first-order (c) and total (d) Sobol’ indices of the output standard deviation with pick-freeze estimators, n = 1000, 50 replicates.which can accommodate directly the output distribution thanks to the specific kernels discussed inSection 3.4. More precisely we use the kernel of Eq. (18) with σ = 1 and λ chosen as the medianof the MMD computed on the preliminary sample used for visualization in Figure 3, with a kernel27 Y ( y, y (cid:48) ) = exp( − τ ( y − y (cid:48) ) ) and τ chosen as the median of the pairwise distances between theoutput samples. We only compute first-order indices here and use for illustration the rank estimatorof the MMD index from Section 4.2.3 while for HSIC we use again the Sobolev kernel. Resultswith 50 replications and a sample of size n = 200 are given in Figure 5. Both indices coincide andidentify X as the most important input variable, as well as a small influence of X , X and X while X is non-important: considering the whole output distribution variability via the specifickernel is comparable to an aggregation of the variability on the output mean and standard deviation(and other moments we did not compute above). (a) MMD first-order index (b) HSIC first-order index Figure 5: Stochastic simulator test case. First-order MMD (a) and HSIC (b) indices of the outputdistribution with rank and V-statistic estimators, respectively, n = 200, 50 replicates. Another commonly encountered industrial application is a physics-based numerical simulator in-volving functional outputs, such as curves representing the evolution over time of some systemcharacteristics ( e.g. pressure, temperature, ...). To illustrate how time-series kernels can easilyhandle GSA on such systems we build a simplified compartmental epidemiological model inspiredby previous works on COVID-19 (Magal and Webb, 2020; Charpentier et al., 2020; Di Domenicoet al., 2020). Our model is a straightforward Susceptible - Infected - Recovered (SIR) model (Ker-mack and McKendrick, 1927) which is slightly modified, in the sense that it accounts for twodifferent types of infectious people: the reported cases, which we assume are isolated and can nonlonger contaminate others, and the unreported cases who can infect others. A summary of thiscompartment model proposed by Magal and Webb (2020) is given in Figure 6. S consists of the susceptible individuals who are not yet infected. During the epidemic spreadthey are infected depending on the time-dependent transmission rate τ ( t ). Once infected they28igure 6: Functional simulator test case. The modified SIR model with 4 compartments followingMagal and Webb (2020) .mode to compartment I where are the asymptomatic infectious individuals. After a period of η days they become symptomatic and a fraction f of them is detected and go to compartment R ,while the rest of them are undetected and go to compartment U . After a recovering period of η days symptomatic people from R and U recover and go to the last compartment. Observe that thisis a highly simplified representation of the epidemic where we do not account for hospitalizations,testing strategies or deaths: our goal here is not to be representative of COVID-19 but ratherexemplify how GSA can be applied to such models.The dynamics of the evolution of individuals from a compartment to another is modeled withthe following system of ordinary differential equations: dSdt = − τ S ( I + U ) dIdt = τ S ( I + U ) − νIdRdt = f νI − ηRdUdt = (1 − f ) νI − ηU The transmission rate is chosen according to Magal and Webb (2020) where they propose a para-metric form given by τ ( t ) = τ exp( − µ max( t − N, τ and it then decreases with anexponential decay with rate µ once social distancing and lockdown start to have an effect after N days. They further assume that the cumulative number of reported cases CR ( t ) is approximately CR ( t ) = χ exp( χ t ) − χ and χ are to be estimated on data. From this assumption they get the value of the initialconditions I , U , R I = χ f ν , U = (1 − f ) νη + χ I , R = 1and with in our case S = 66 . × is the initial susceptible population (here in France). From aGSA perspective we then assume that we have uncertainty on the following 6 input variables: τ , µ ,29 (transmission rate), η , ν (days until symptoms and recovery) and χ (which impacts the initialconditions). f is assumed to be fixed at a fraction equal to 0 .
1. We assign uniform distributionsto the input variables with ranges consistent with the values from Magal and Webb (2020), i.e. τ ∼ U (5 . × − , . × − ), µ ∼ U (0 . , . N ∼ U (8 , /η ∼ U (5 , /ν ∼ U (5 , χ ∼ U (0 . , . I and R for 20 values of theinputs chosen at random according to these uniform distributions is given in Figure 7. (a) Infectious cases (b) Reported cases Figure 7: Functional simulator test case. Output dynamics over time for compartment I (left) and R (right) for 20 values of the input variables chosen at random. They are both normalized by thetotal population S .For GSA we rely on the global-alignment kernel of Cuturi (2011) designed for time-series, whichsearches for all their alignments and averages them, and use it inside our first-order HSIC indicesfor the output, whereas we still employ the Sobolev kernel for the inputs. The results obtained with50 repetitions with a sample of size n = 200 and the V-statistic estimator are reported in Figure 8.For both compartments the most influential input is χ , as expected since it influences theinitial conditions, and then N and µ related to the transmission rate. ν has also an impact forcompartment I but not η , which is coherent with the ordinary differential equations, and one cansee on the contrary that η influences compartment R . Finally we investigate a numerical model with both a categorical output (to make use of thediscussion from Section 3.4) and dependent inputs (to analyze kernel-embedding Shapley effectsfrom Section 3.3). We build upon the famous wine quality data set (Cortez et al., 2009) of theUCI repository (Dua and Graff, 2017). This dataset consists of 4898 observations of wine qualities30 a) First-order HSIC index for compartment I (b) First-order HSIC index for compartment R Figure 8: Functional simulator test case. First-order HSIC index for compartments I (left) and R (right) with V-statistics estimator, n = 200, 50 replicates.(categorical variable with levels 0 to 10 corresponding to a score) associated to 11 features obtainedwith physicochemical tests. In order to place ourselves in a standard computer experiments setting( i.e. a numerical simulator and uncertain inputs with given probability distribution) we use thisdataset to design a GSA scenario detailed in the following steps:1. We regroup wine quality scores into only 3 categories: low (score less than 5), medium (scoreequal to 6) and high (score higher than 7) in order to have a balanced dataset. We also usea small subsample of size 600 of white wine only from the initial 4898 observations for fasterestimation of the input dependence structure;2. We estimate a random forest model between the wine quality and the 11 inputs from thistransformed dataset and compute variable importance for each input. The variable impor-tance score is used to select only 4 important features among the initial 11 ones (volatileacidity, chlorides, density and alcohol). This is absolutely not a mandatory step, but wechoose to do so for both a faster computation of Shapley effects and estimation of the inputdependence structure. A new random forest model is finally built with these 4 input variablesonly, and the predictor serves as our numerical simulation model;3. The samples from the 4 input variables identified above are used to estimate a vine copulastructure which models their dependence (Czado, 2019). Once the vine copula is estimated,it is then easy to generate new input samples as much as required.MMD- and HSIC-Shapley effects are then computed with a sample size of n = 1000 with adirac categorical kernel for the output and a Sobolev kernel for the inputs in the HSIC case. For31MD we use the nearest-neighbor estimator of Section 4.3 and for HSIC the V-statistic estimatorand we repeat the estimation 50 times, see Figure 9. (a) MMD-Shapley effect (b) HSIC-Shapley effect Figure 9: Multi-class output test case. MMD- (a) and HSIC- (b) Shapley effects with nearest-neighbor and V-statistic estimators, respectively, n = 1000, 50 replicates.Both kernel-embedding Shapley effects identify alcohol as the most influential input, which wasexpected from the variable importance scores computed with the random forest. However, theMMD-Shapley effects do not discriminate as clearly the input variables as HSIC. We suspect thatthere may be remaining estimation bias coming from the nearest-neighbor estimators which weplan to carefully examine in future work. In this paper we discussed two moment-independent sensitivity indices which generalize Sobol’ones by relying on the RKHS embedding of probability distributions. These MMD- and HSIC-based sensitivity indices are shown to admit an ANOVA-decomposition, which makes it possibleto properly define input interactions and their natural normalization constant. To the best of ourknowledge this is the first time such a result is proved for sensitivity indices apart from Sobol’ones. We also defined kernel-embedding Shapley effects which are built upon these indices forthe case where the input variables are no longer independent. As discussed through several GSAapplications with categorical outputs or stochastic simulators, this opens the path for new powerfuland general GSA approaches by means of kernels adapted to the task at hand. Finally, severalestimators have been introduced, including new ones inspired by recent advances in Sobol’ indicesand Shapley effects estimation.However, there is still room for improvement in the theoretical understanding of theses indices.First, we extensively used Mercer’s theorem and it would be interesting to extend our results when32t no longer holds. We also assume a kernel product form for HSIC indices, whereas the theoremused in our proof allows for more general kernels. From an estimation perspective, we did not exhibithere any central limit theorem, although this would be an important step enabling to statisticallytest whether indices are zero or not. But this is not at all an easy task, which may be tackled viathe functional delta method combined with Mercer’s theorem. On the other hand, some bias canbe observed in the nearest neighbor estimators, which should be analyzed carefully in future work.Finally, further practical experimentations should be performed to better understand the behaviorof these new indices. We think in particular to the choice of the kernel hyperparameters, and theinvestigation of invariant kernels for outputs given as curves or images.
A Proofs
A.1 Proof of Theorem 3
Proof.
The theorem is proved in the case where Mercer’s theorem holds, i.e. , the output is assumedto be such that Y ∈ Y with Y a compact set and k Y has the representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) )as in Eq. (9). Consider now the random variable W = (cid:80) ∞ r =1 η [ r ] ( X ) where η [ r ] ( X ) = φ r ( Y ) = φ r ( η ( X )). To prove the theorem, two formulations of Var W are exhibited. First, since the functions φ r are orthogonal in L ( Y ) and using the absolute convergence of the series, we haveVar W = ∞ (cid:88) r =1 Var φ r ( Y )= ∞ (cid:88) r =1 E ( φ r ( Y ) φ r ( Y )) − ∞ (cid:88) r =1 E (cid:0) φ r ( Y ) φ r ( Y (cid:48) ) (cid:1) = E (cid:32) ∞ (cid:88) r =1 φ r ( Y ) φ r ( Y ) (cid:33) − E (cid:32) ∞ (cid:88) r =1 φ r ( Y ) φ r ( Y (cid:48) ) (cid:33) = E k ( Y, Y ) − E k ( Y, Y (cid:48) ) . On the other hand, using the variance decomposition (1) for each η [ r ] ( X ) = φ r ( η ( X )) we getVar W = ∞ (cid:88) r =1 Var η [ r ] ( X )= ∞ (cid:88) r =1 (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | Var E (cid:16) η [ r ] ( X ) | X B (cid:17) = (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | ∞ (cid:88) r =1 Var E ( φ r ( Y ) | X B )= (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) using again the absolute continuity and the expansion of E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) obtained inEq. (9). The theorem follows by equating both formulations of Var W .33 .2 Proof of Proposition 1 Proof.
We simply add and subtract E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ):MMD = E ζ ∼ P Y k Y ( ζ, ζ ) − E ζ,ζ (cid:48) ∼ P Y k Y ( ζ, ζ (cid:48) )= E ζ ∼ P Y k Y ( ζ, ζ ) − E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . A.3 Proof of Theorem 4
Proof.
We first rewrite HSIC between X and Y from Eq. (8) as a multivariate integral, assumingP X Y is absolutely continuous with respect to the Lebesgue measure on X × Y :HSIC( X , Y ) = (cid:90) X ×X (cid:90)
Y×Y k X ( x , x (cid:48) ) k Y ( y, y (cid:48) ) [ p X Y ( x , y ) − p X ( x ) p Y ( y )] (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x d x (cid:48) dydy (cid:48) (22)where p X Y , p X and p Y are the probability density functions of ( X , Y ), X and Y , respectively. Asin Theorem 3 we further assume Mercer’s theorem holds, which means that k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) ) . For each r , we then define the function g [ r ] ( x ) = (cid:90) X (cid:90) Y k X ( x , x (cid:48) ) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) , noting that g [ r ] ∈ F from Assumption 2. It is then straightforward to show that (cid:107) g [ r ] (cid:107) F = (cid:90) X ×X (cid:90)
Y×Y k X ( x , x (cid:48) ) φ r ( y ) φ r ( y (cid:48) ) [ p X Y ( x , y ) − p X ( x ) p Y ( y )] (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x d x (cid:48) dydy (cid:48) , which means that HSIC( X , Y ) = ∞ (cid:88) r =1 (cid:107) g [ r ] (cid:107) F (23)since in Mercer’s theorem we have the absolute convergence of the series. Now the idea is to writean orthogonal decomposition (in F ) for each function g [ r ] , which will finally provide a decomposi-tion for HSIC through Eq. (23). 34he orthogonal decomposition of g [ r ] is obtained with Theorem 4.1 from Kuo et al. (2010).First, recall that we have from the first part of Assumption 3: k X ( x , x (cid:48) ) = p (cid:89) l =1 (cid:0) k l ( x l , x (cid:48) l ) (cid:1) = (cid:88) A ⊆P d (cid:89) l ∈ A k l ( x l , x (cid:48) l ) := (cid:88) A ⊆P d k A ( x A , x (cid:48) A ) (24)which corresponds to Eq. (4.1) in Kuo et al. (2010). We then introduce a set of commutingprojections { P l } pl =1 on F given by P l ( f ) = (cid:90) X l f ( x , . . . , x l − , t, x l +1 , . . . , x d ) p X l ( t ) dt (25)for all f ∈ F . From the second part of Assumption 3, one has for all subset A ⊆ P d and x A ∈ X A P l ( k A ( · , x A )) = (cid:89) l (cid:48) ∈ A,l (cid:48) (cid:54) = l k l (cid:48) ( · , x l (cid:48) ) (cid:90) X l k l ( t, x l ) p X l ( t ) dt = 0for l ∈ A , meaning that Eq. (4.5) from Kuo et al. (2010) is satisfied. From Theorem 4.1 from Kuoet al. (2010), we can now state that g [ r ] ( x ) has an unique orthogonal decomposition given by g [ r ] = (cid:88) A ⊆P d g [ r ] A where g [ r ] A = (cid:88) B ⊆ A ( − | A |−| B | P − B ( g [ r ] )with P − B = (cid:81) l / ∈ B P l . Since the decomposition is orthogonal, we further have (cid:107) g [ r ] (cid:107) F = (cid:88) A ⊆P d (cid:107) g [ r ] A (cid:107) F and (cid:107) g [ r ] A (cid:107) F = (cid:88) B ⊆ A ( − | A |−| B | (cid:107) P − B ( g [ r ] ) (cid:107) F . (cid:107) P − B ( g [ r ] ) (cid:107) F . We first write the projection: P − B ( g [ r ] ) = (cid:90) X (cid:90) Y (cid:90) X − B k X ( x , x (cid:48) ) p X − B ( x − B ) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) d x − B = (cid:90) X (cid:90) Y (cid:32)(cid:89) l / ∈ B (cid:90) X l (cid:0) k l ( x l , x (cid:48) l ) (cid:1) p X l ( x l ) dx l (cid:33) (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) = (cid:90) X (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) = (cid:90) X B (cid:90) X − B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B d x (cid:48)− B dy (cid:48) = (cid:90) X B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:32)(cid:90) X − B (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48)− B (cid:33) d x (cid:48) B dy (cid:48) = (cid:90) X B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B dy (cid:48) = (cid:90) X B (cid:90) Y k B ( x B , x (cid:48) B ) φ r ( y (cid:48) ) (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B dy (cid:48) and its norm then equals (cid:107) P − B ( g [ r ] ) (cid:107) F = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) φ r ( y ) φ r ( y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) . Finally, we have HSIC( X , Y ) = ∞ (cid:88) r =1 (cid:107) g [ r ] (cid:107) F = (cid:88) A ⊆P d ∞ (cid:88) r =1 (cid:107) g [ r ] A (cid:107) F = (cid:88) A ⊆P d (cid:88) B ⊆ A ( − | A |−| B | ∞ (cid:88) r =1 (cid:107) P − B ( g [ r ] ) (cid:107) F ∞ (cid:88) r =1 (cid:107) P − B ( g [ r ] ) (cid:107) F = ∞ (cid:88) r =1 (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) φ r ( y ) φ r ( y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) (cid:32) ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) ) (cid:33) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) k Y ( y, y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = HSIC( X B , Y ) . A.4 Proof of Proposition 2
Proof.
We begin with the integral formulation of HSIC as in Eq. (22) and plug the kernel definedin Eq. (13):HSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y k A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) ) [ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x (cid:48) A , y (cid:48) ) − p X A ( x (cid:48) A ) p Y ( y (cid:48) ) (cid:3) d x A d x (cid:48) A dydy (cid:48) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x (cid:48) A ) (cid:89) l ∈ A h K (cid:18) x l − x (cid:48) l h (cid:19) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x (cid:48) A , y (cid:48) ) − p X A ( x (cid:48) A ) p Y ( y (cid:48) ) (cid:3) d x A d x (cid:48) A dydy (cid:48) . We then use a change of variables u l = ( x l − x (cid:48) l ) /h , which leads toHSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x A − h u A ) (cid:89) l ∈ A K ( u l ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A − h u A , y (cid:48) ) − p X A ( x A − h u A ) p Y ( y (cid:48) ) (cid:3) d x A d u A dydy (cid:48) . h → h → HSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x A ) (cid:89) l ∈ A K ( u l ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A , y (cid:48) ) − p X A ( x A ) p Y ( y (cid:48) ) (cid:3) d x A d u A dydy (cid:48) = (cid:90) X A (cid:89) l ∈ A K ( u l ) d u A (cid:90) X A (cid:90) Y×Y p X A ( x A ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A , y (cid:48) ) − p X A ( x A ) p Y ( y (cid:48) ) (cid:3) d x A dydy (cid:48) = (cid:90) X A (cid:90) Y×Y k Y ( y, y (cid:48) ) (cid:2) p Y | X A = x A ( y ) − p Y ( y ) (cid:3)(cid:2) p Y | X A = x A ( y (cid:48) ) − p Y ( y (cid:48) ) (cid:3) p X A ( x A ) d x A dydy (cid:48) where we have used (cid:82) u K ( u ) du = 1. Proposition 2 then follows by noting that the last equationequals E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) thanks to the integral formulation of the MMD. A.5 Proof of Proposition 3
Proof.
Assuming Mercer’s theorem holds, we have k ( y, y (cid:48) ) = (cid:80) ∞ r =1 φ r ( y ) φ r ( y (cid:48) ) and E ( χ n ) = ∞ (cid:88) r =1 n n (cid:88) i =1 E (cid:104) φ r (cid:16) y ( i ) (cid:17) φ r (cid:16) y ( σ n (( i )) (cid:17)(cid:105) (26)= ∞ (cid:88) r =1 E (cid:104) φ r (cid:16) y (1) (cid:17) φ r (cid:16) y ( σ n ((1)) (cid:17)(cid:105) → ∞ (cid:88) r =1 E [ E [ φ r ( Y ) | V ] E [ φ r ( Y ) | V ]] (27)= ∞ (cid:88) r =1 E (cid:104) E [ φ r ( Y ) | V ] (cid:105) = E V E ξ,ξ (cid:48) ∼ P Y | V k Y ( ξ, ξ (cid:48) ) (28)where (26) is obtained by the absolute convergence of Mercer’s series, (27) by applying Eq.(34) in the proof of Proposition 3.2 from Gamboa et al. (2020) to f = g = φ r (which is boundedsince k is bounded) and the absolute convergence of Mercer’s series and (28) with Eq. 9. TheMac Diarmid’s concentration inequality given in Theorem A.1 in the proof of Proposition 3.2 fromGamboa et al. (2020) is unchanged and concludes the proof.38 .6 Proof of Lemma 3 Proof.
We follow closely the proof of Theorem 1 from Song et al. (2016). We only need to provethat for a subset A ⊆ P d such that l / ∈ A , thenval( A ∪ { l } ) − val( A ) = val (cid:48) ( B ∪ { l } ) − val (cid:48) ( B ) (29)where B = P d \ ( A ∪ { l } ). We first need the generalized law of total variance for MMD fromProposition 1:MMD = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . Now we prove the equality (29) for val and val (cid:48) defined in Lemma 3, except that here we workwithout the denominator MMD for better readability (this does not change the proof since thissame constant appears in both value functions).val( A ∪ { l } ) − val( A ) = E X A ∪{ l } (cid:16) MMD (P Y , P Y | X A ∪{ l } ) (cid:17) − E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = (cid:110) MMD − E X A ∪{ l } (cid:104) E ξ ∼ P Y | X A ∪{ l } k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A ∪{ l } k Y ( ξ, ξ (cid:48) ) (cid:105)(cid:111) − (cid:110) MMD − E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105)(cid:111) = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) − E X A ∪{ l } (cid:104) E ξ ∼ P Y | X A ∪{ l } k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A ∪{ l } k Y ( ξ, ξ (cid:48) ) (cid:105) = E X − ( B ∪{ l } ) (cid:104) E ξ ∼ P Y | X − ( B ∪{ l } ) k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − ( B ∪{ l } ) k Y ( ξ, ξ (cid:48) ) (cid:105) − E X − B (cid:104) E ξ ∼ P Y | X − B k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − B k Y ( ξ, ξ (cid:48) ) (cid:105) = val (cid:48) ( B ∪ { l } ) − val (cid:48) ( B )where we have used the generalized law of total variance for the second equality. The rest of theproof is identical to the one of Theorem 1 from Song et al. (2016). References
Antoniadis, A. (1984), ‘Analysis of variance on function spaces’,
Math. Operationsforsch. u. Statist.,ser. statist. , 59–71.Aubin, J. P. (2000), Applied Functional Analysis , 2 nd edn, New York: Wiley-Interscience.Azadkia, M. and Chatterjee, S. (2019), ‘A simple measure of conditional dependence’, arXivpreprint arXiv:1910.12327 .Bachoc, F., Gamboa, F., Loubes, J.-M. and Venet, N. (2017), ‘A gaussian process regression modelfor distribution inputs’, IEEE Transactions on Information Theory (10), 6620–6637.Baucells, M. and Borgonovo, E. (2013), ‘Invariant probabilistic sensitivity analysis’, to appear inManagement Science . 39orgonovo, E. (2007), ‘A new uncertainty importance measure’, Reliability Engineering & SystemSafety (6), 771–784.Broto, B., Bachoc, F. and Depecker, M. (2020), ‘Variance reduction for estimation of shapley effectsand adaptation to unknown input distribution’, SIAM/ASA Journal on Uncertainty Quantifica-tion (2), 693–716.Castro, J., G´omez, D. and Tejada, J. (2009), ‘Polynomial calculation of the shapley value based onsampling’, Computers & Operations Research (5), 1726–1730.Charpentier, A., Elie, R., Lauri`ere, M. and Tran, V. C. (2020), ‘Covid-19 pandemic control:balancing detection policy and lockdown intervention under icu sustainability’, arXiv preprintarXiv:2005.06526 .Chastaing, G., Gamboa, F., Prieur, C. et al. (2012), ‘Generalized hoeffding-sobol decomposition fordependent variables-application to sensitivity analysis’, Electronic Journal of Statistics , 2420–2448.Chatterjee, S. (2020), ‘A new coefficient of correlation’, Journal of the American Statistical Asso-ciation pp. 1–21.Chwialkowski, K., Strathmann, H. and Gretton, A. (2016), A kernel test of goodness of fit, in ‘ICML’.Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009), ‘Modeling wine preferencesby data mining from physicochemical properties’, Decision Support Systems (4), 547–553.Cuturi, M. (2011), Fast global alignment kernels, in ‘Proceedings of the 28th international confer-ence on machine learning (ICML-11)’, pp. 929–936.Czado, C. (2019), ‘Analyzing dependent data with vine copulas’, Lecture Notes in Statistics,Springer .Da Veiga, S. (2015), ‘Global sensitivity analysis with dependence measures’,
Journal of StatisticalComputation and Simulation (7), 1283–1305.Da Veiga, S. (2016), New perspectives for sensitivity analysis, in ‘Proceedings of Mascot-Num 2016conference’, Toulouse, France. https://mascot2016.sciencesconf.org/resource/page/id/2 .Da Veiga, S. and Gamboa, F. (2013), ‘Efficient estimation of sensitivity indices’, Journal of Non-parametric Statistics (3), 573–595.Da Veiga, S., Wahl, F. and Gamboa, F. (2009), ‘Local polynomial estimation for sensitivity analysison models with correlated inputs’, Technometrics (4), 452–463.Di Domenico, L., Pullano, G., Sabbatini, C. E., Bo¨elle, P.-Y. and Colizza, V. (2020), ‘Expectedimpact of lockdown in ˆıle-de-france and possible exit strategies’, medRxiv .Ditlevsen, O. and Madsen, H., eds (1996), Structural reliability methods , Wiley & Sons.Dua, D. and Graff, C. (2017), ‘Uci machine learning repository’.40urrande, N., Ginsbourger, D., Roustant, O. and Carraro, L. (2012), ‘Anova kernels and rkhsof zero mean functions for model-based sensitivity analysis’,
Journal of Multivariate Analysis , 57–67.Ferraty, F. and Vieu, P. (2006),
Nonparametric functional data analysis: theory and practice ,Springer.Fort, J.-C., Klein, T. and Rachdi, N. (2016), ‘New sensitivity analysis subordinated to a contrast’,
Communications in Statistics-Theory and Methods (15), 4349–4364.Gamboa, F., Gremaud, P., Klein, T. and Lagnoux, A. (2020), ‘Global sensitivity analysis: a newgeneration of mighty estimators based on rank statistics’, hal-02474902v3 .Gamboa, F., Janon, A., Klein, T. and Lagnoux, A. (2013), ‘Sensitivity indices for multivariateoutputs’, Comptes Rendus Mathematique (7-8), 307–310.G¨artner, T., Flach, P. and Wrobel, S. (2003), On graph kernels: Hardness results and efficientalternatives, in ‘Learning theory and kernel machines’, Springer, pp. 129–143.Gorham, J. and Mackey, L. (2015), ‘Measuring sample quality with stein’s method’, Advances inNeural Information Processing Systems , 226–234.Gretton, A., Bousquet, O., Smola, A. and Sch¨olkopf, B. (2005a), Measuring statistical dependencewith hilbert-schmidt norms, in S. Jain, H. Simon and E. Tomita, eds, ‘Algorithmic LearningTheory’, Vol. 3734 of
Lecture Notes in Computer Science , Springer Berlin Heidelberg, pp. 63–77.Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Sch¨olkopf, B. and Smola, A. J. (2008), A kernelstatistical test of independence, in ‘Advances in neural information processing systems’, pp. 585–592.Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Sch¨olkopf, B. (2005b), ‘Kernel methods formeasuring independence’, The Journal of Machine Learning Research , 2075–2129.Harchaoui, Z. and Bach, F. (2007), Image classification with segmentation graph kernels, in ‘2007IEEE Conference on Computer Vision and Pattern Recognition’, IEEE, pp. 1–8.Hoeffding, W. (1948), ‘A class of statistics with asymptotically normal distributions’, Annals ofMathematical Statistics , 293–325.Homma, T. and Saltelli, A. (1996), ‘Importance measures in global sensitivity analysis of nonlinearmodels’, Reliability Engineering & System Safety (1), 1–17.Iooss, B. and Prieur, C. (2019), ‘Shapley effects for sensitivity analysis with dependent inputs:comparisons with Sobol’ indices, numerical estimation and applications’, International Journalfor Uncertainty Quantification , 493–514,.Ishigami, T. and Homma, T. (1990), An importance quantification technique in uncertainty analysisfor computer models, in ‘[1990] Proceedings. First International Symposium on UncertaintyModeling and Analysis’, IEEE, pp. 398–403. 41anon, A., Klein, T., Lagnoux, A., Nodet, M. and Prieur, C. (2014), ‘Asymptotic normality andefficiency of two sobol index estimators’, ESAIM: Probability and Statistics , 342–364.Jitkrittum, W., Xu, W., Szab´o, Z., Fukumizu, K. and Gretton, A. (2017), A linear-time kernelgoodness-of-fit test, in ‘Advances in Neural Information Processing Systems’, pp. 262–271.Kermack, W. O. and McKendrick, A. G. (1927), ‘A contribution to the mathematical theory of epi-demics’, Proceedings of the royal society of london. Series A, Containing papers of a mathematicaland physical character (772), 700–721.Kuo, F., Sloan, I., Wasilkowski, G. and Wo´zniakowski, H. (2010), ‘On decompositions of multivari-ate functions’,
Mathematics of computation (270), 953–966.Lamboni, M., Monod, H. and Makowski, D. (2011), ‘Multivariate sensitivity analysis to measureglobal contribution of input factors in dynamic models’, Reliability Engineering & System Safety , 450–459.Li, L., Lu, Z., Feng, J. and Wang, B. (2012), ‘Moment-independent importance measure of basicvariable and its state dependent parameter solution’, Structural Safety , 40–47.Lundberg, S. M. and Lee, S.-I. (2017), A unified approach to interpreting model predictions, in ‘Advances in neural information processing systems’, pp. 4765–4774.Magal, P. and Webb, G. (2020), ‘Predicting the number of reported and unreported cases for thecovid-19 epidemic in south korea, italy, france and germany’, medRxiv . URL:
Mara, T. A., Tarantola, S. and Annoni, P. (2015), ‘Non-parametric methods for global sensitivityanalysis of model output with dependent inputs’,
Environmental modelling & software , 173–183.Marrel, A. and Chabridon, V. (2020), ‘Statistical developments for target and conditional sensitivityanalysis: application on safety studies for nuclear reactor’, hal-02541142 .Marrel, A., Iooss, B., Van Dorpe, F. and Volkova, E. (2008), ‘An efficient methodology for modelingcomplex computer codes with Gaussian processes’, Computational Statistics and Data Analysis , 4731–4744.Maume-Deschamps, V. and Niang, I. (2018), ‘Estimation of quantile oriented sensitivity indices’, Statistics & Probability Letters , 122–127.Moutoussamy, V., Nanty, S. and Pauwels, B. (2015), ‘Emulators for stochastic simulation codes’,
ESAIM: Proceedings and Surveys , 116–155.Muandet, K., Fukumizu, K., Dinuzzo, F. and Sch¨olkopf, B. (2012), ‘Learning from distributionsvia support measure machines’, Advances in neural information processing systems , 10–18.Oates, C. J., Girolami, M. and Chopin, N. (2017), ‘Control functionals for monte carlo integration’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) (3), 695–718.42wen, A. (2014), ‘Sobol’ indices and Shapley value’, SIAM/ASA Journal on Uncertainty Quantifi-cation , 245–251.Perrin, G. and Defaux, G. (2019), ‘Efficient estimation of reliability-oriented sensitivity indices’, Journal of Scientific Computing (3).Plischke, E., Rabitti, G. and Borgonovo, E. (2020), ‘Computing shapley effects for sensitivityanalysis’, arXiv preprint arXiv:2002.12024 .Rahman, S. (2016), ‘The f-sensitivity index’, SIAM/ASA Journal on Uncertainty Quantification (1), 130–162.Ramon, J. and G¨artner, T. (2003), Expressivity versus efficiency of graph kernels, in ‘Proceedingsof the first international workshop on mining graphs, trees and sequences’, pp. 65–74.Sakoe, H. and Chiba, S. (1978), ‘Dynamic programming algorithm optimization for spoken wordrecognition’, IEEE transactions on acoustics, speech, and signal processing (1), 43–49.Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. and Tarantola, S. (2010), ‘Variancebased sensitivity analysis of model output. design and estimator for the total sensitivity index’, Computer physics communications (2), 259–270.Saltelli, A., Tarantola, S. and Chan, K.-S. (1999), ‘A quantitative model-independent method forglobal sensitivity analysis of model output’,
Technometrics (1), 39–56.Shapley, L. (1953), A value for n-persons game, in H. Kuhn and A. Tucker, eds, ‘Contributions tothe theory of games II, Annals of mathematic studies’, Princeton University Press, Princeton,NJ.Smola, A., Gretton, A., Song, L. and Sch¨olkopf, B. (2007), A hilbert space embedding for distri-butions, in ‘Algorithmic Learning Theory’, Vol. 4754, Springer, pp. 13–31.Sobol’, I. (1993), ‘Sensitivity estimates for non linear mathematical models’, Mathematical Mod-elling and Computational Experiments , 407–414.Sol´ıs, M. (2019), ‘Non-parametric estimation of the first-order sobol indices with bootstrap band-width’, Communications in Statistics-Simulation and Computation pp. 1–16.Song, E., Nelson, B. L. and Staum, J. (2016), ‘Shapley effects for global sensitivity analysis: Theoryand computation’,
SIAM/ASA Journal on Uncertainty Quantification (1), 1060–1083.Song, L. (2008), Learning via Hilbert Space Embedding of Distributions, PhD thesis, University ofSydney.Song, L., Smola, A., Gretton, A., Bedo, J. and Borgwardt, K. (2012), ‘Feature selection via depen-dence maximization’, The Journal of Machine Learning Research , 1393–1434.Song, L., Smola, A., Gretton, A., Borgwardt, K. M. and Bedo, J. (2007), Supervised featureselection via dependence estimation, in ‘Proceedings of the 24th international conference onMachine learning’, pp. 823–830. 43pagnol, A. (2020), Kernel-based sensitivity indices for high-dimensional optimization problems,PhD thesis, University Lyon, France.Spagnol, A., Le Riche, R. and Da Veiga, S. (2019), ‘Global sensitivity analysis for optimizationwith variable selection’, SIAM/ASA Journal on Uncertainty Quantification (2), 417–443.Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. and Sch¨olkopf, B. (2009),Kernel choice and classifiability for rkhs embeddings of probability distributions, in ‘Advancesin neural information processing systems’, pp. 1750–1758.Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch¨olkopf, B. and Lanckriet, G. R. (2010),‘Hilbert space embeddings and metrics on probability measures’, The Journal of Machine Learn-ing Research , 1517–1561.Stein, C. et al. (1972), A bound for the error in the normal approximation to the distributionof a sum of dependent random variables, in ‘Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability, Volume 2: Probability Theory’, The Regents of theUniversity of California.Steinwart, I., Hush, D. and Scovel, C. (2006), ‘An explicit description of the reproducing kernelhilbert spaces of gaussian rbf kernels’, IEEE Transactions on Information Theory (10), 4635–4643.Szab´o, Z., Sriperumbudur, B. K., P´oczos, B. and Gretton, A. (2016), ‘Learning theory for distri-bution regression’, The Journal of Machine Learning Research (1), 5272–5311.Sz´ekely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence bycorrelation of distances’, The Annals of Statistics (6), 2769–2794.Terraz, T., Ribes, A., Fournier, Y., Iooss, B. and Raffin, B. (2017), Large scale in transit globalsensitivity analysis avoiding intermediate files, in ‘Proceedings the International Conference forHigh Performance Computing, Networking, Storage and Analysis (Supercomputing)’, Denver,USA.Wahba, G., Wang, Y., Gu, C., Klein, R., Klein, B. et al. (1995), ‘Smoothing spline anova for expo-nential families, with application to the wisconsin epidemiological study of diabetic retinopathy:the 1994 neyman memorial lecture’, The Annals of Statistics23