[PDF] Kernel-based ANOVA decomposition and Shapley effects -- Application to global sensitivity analysis

Abstract

Global sensitivity analysis is the main quantitative technique for identifying the most influential input variables in a numerical simulation model. In particular when the inputs are independent, Sobol' sensitivity indices attribute a portion of the output of interest variance to each input and all possible interactions in the model, thanks to a functional ANOVA decomposition. On the other hand, moment-independent sensitivity indices focus on the impact of input variables on the whole output distribution instead of the variance only, thus providing complementary insight on the inputs / output relationship. Unfortunately they do not enjoy the nice decomposition property of Sobol' indices and are consequently harder to analyze. In this paper, we introduce two moment-independent indices based on kernel-embeddings of probability distributions and show that the RKHS framework used for their definition makes it possible to exhibit a kernel-based ANOVA decomposition. This is the first time such a desirable property is proved for sensitivity indices apart from Sobol' ones. When the inputs are dependent, we also use these new sensitivity indices as building blocks to design kernel-embedding Shapley effects which generalize the traditional variance-based ones used in sensitivity analysis. Several estimation procedures are discussed and illustrated on test cases with various output types such as categorical variables and probability distributions. All these examples show their potential for enhancing traditional sensitivity analysis with a kernel point of view.

Full PDF

KKernel-based ANOVA decomposition and Shapley eﬀects –Application to global sensitivity analysis

S´ebastien Da VeigaSafran Tech, Modeling & SimulationRue des Jeunes Bois, Chˆateaufort, 78114 Magny-Les-Hameaux, France

Abstract

Global sensitivity analysis is the main quantitative technique for identifying the most in-ﬂuential input variables in a numerical simulation model. In particular when the inputs areindependent, Sobol’ sensitivity indices attribute a portion of the output of interest variance toeach input and all possible interactions in the model, thanks to a functional ANOVA decom-position. On the other hand, moment-independent sensitivity indices focus on the impact ofinput variables on the whole output distribution instead of the variance only, thus providingcomplementary insight on the inputs / output relationship. Unfortunately they do not enjoy thenice decomposition property of Sobol’ indices and are consequently harder to analyze. In thispaper, we introduce two moment-independent indices based on kernel-embeddings of probabilitydistributions and show that the RKHS framework used for their deﬁnition makes it possible toexhibit a kernel-based ANOVA decomposition. This is the ﬁrst time such a desirable propertyis proved for sensitivity indices apart from Sobol’ ones. When the inputs are dependent, wealso use these new sensitivity indices as building blocks to design kernel-embedding Shapleyeﬀects which generalize the traditional variance-based ones used in sensitivity analysis. Severalestimation procedures are discussed and illustrated on test cases with various output types suchas categorical variables and probability distributions. All these examples show their potentialfor enhancing traditional sensitivity analysis with a kernel point of view.

In the computer experiments community, global sensitivity analysis (GSA) has now emerged as acentral tool for exploring the inputs/outputs relationship of a numerical simulation model. Startingfrom the pioneering work of Sobol (Sobol’, 1993) and Saltelli (Saltelli et al., 1999) on the inter-pretation and estimation of Sobol’ indices, the last two decades have been a fertile ground forthe development of advanced statistical methodologies and extensions of original Sobol’ indices:new estimation procedures (Da Veiga et al. (2009), Da Veiga and Gamboa (2013), Sol´ıs (2019),Gamboa et al. (2020)), multivariate outputs with aggregation (Gamboa et al., 2013) and dimen-sionality reduction (Lamboni et al., 2011), goal-oriented sensitivity analysis (Fort et al., 2016) ormoment-independent sensitivity measures (Borgonovo (2007), Da Veiga (2015)), among others. Atthe heart of the popularity of Sobol’ indices is the fundamental functional analysis of variance(ANOVA) decomposition, which opens the path for their interpretation as parts of the output vari-ance and makes it possible to pull apart the input main eﬀects and all their potential interactions,up to their whole inﬂuence measured by total Sobol’ indices. This decomposition however has two1 a r X i v : . [ m a t h . S T ] J a n rawbacks. First, it is only valid when the inputs are independent, although some generalizationswere investigated (Chastaing et al., 2012). Secondly, it only concerns the original Sobol’ indices,meaning that it is not possible to split the input eﬀects with goal-oriented or moment-independentsensitivity analysis in general.When the inputs are dependent, total Sobol’ indices can still be used to discriminate themwhen the objective is to build a surrogate model of the system, and other Sobol’-related indiceshave also been proposed for interpretability (Mara et al., 2015). But the major breakthroughhappened when Shapley eﬀects have been deﬁned for GSA by Owen (Owen, 2014). Indeed due totheir mathematical foundations from game theory, Shapley eﬀects do not require the independenceassumption to enjoy nice properties: each input is assigned a Shapley eﬀect lying between 0 and1, while the sum of all eﬀects is equal to 1. For a given input all interactions and correlationswith other ones are mixed up, but the interpretation as parts of the output variance is kept andinput rankings are still sensible. For these reasons Shapley eﬀects are now commonly thought ascentral importance measures in GSA for dealing with dependence, and their estimation has beenthoroughly investigated recently (Song et al., 2016; Iooss and Prieur, 2019; Broto et al., 2020;Plischke et al., 2020).From an interpretability perspective, other importance measures introduced in the context ofgoal-oriented and moment-independent sensitivity analysis have proven useful to gain additionalinsights on a given model. For example quantile-oriented (Fort et al., 2016; Maume-Deschamps andNiang, 2018) or reliability-based measures (Ditlevsen and Madsen, 1996) can help understand whichinputs lead to the failure of the system, while optimization-related indices enable dimension reduc-tion for optimization problems (Spagnol et al., 2019). On the other hand, moment-independentsensitivity indices, which quantify the input impact on the whole output distribution instead ofthe variance only, are powerful complementary tools to grasp further types of input inﬂuence.Among them are the f-divergence indices (Da Veiga (2015), Rahman (2016)) with particular casescorresponding to the sensitivity index introduced by Borgonovo (Borgonovo, 2007) and the classof kernel-based sensitivity indices, which rely on the embedding of probability distributions in re-producing kernel Hilbert spaces (RKHS) (Da Veiga, 2015, 2016). Unfortunately an ANOVA-likedecomposition is not available yet for any of these indices even in the independent setting: as a con-sequence this limits the interpretation of their formulation for interactions since without ANOVA itis not possible to remove the main eﬀects, and at the same time the natural normalization constant(equivalent to the total output variance for Sobol’ indices) is not known.In this paper we focus on a general RKHS framework for GSA and prove that an ANOVAdecomposition actually exists for two previously introduced kernel-based sensitivity indices in theindependent setting. To the best of our knowledge this is the ﬁrst time such a decomposition isavailable for other sensitivity indices other than the original Sobol’ ones. Not only this makesit possible to properly deﬁne higher-order indices, but this further gives access to their naturalnormalization constant. We also demonstrate that these measures are generalizations of Sobol’indices, in the sense that they are recovered with speciﬁc kernels. But the RKHS point of viewadditionally comes with a large body of work on several kernels speciﬁcally designed for particulartarget applications, such as multivariate, functional, categorical or time-series case studies, thusdeﬁning a uniﬁed framework for many real GSA test cases. When inputs are not independent, weﬁnally introduce a kernel-based version of Shapley eﬀects similar to the ones proposed by Owen.The paper is organized as follows. Section 2 ﬁrst brieﬂy introduces the traditional functional2NOVA decomposition with Sobol’ indices and moment-independent indices. In Section 3 we thendiscuss the elementary tools from RKHS theory needed to build kernel-based sensitivity indiceswhich are at the core of this work. We further investigate these indices and prove they alsoarise from an ANOVA decomposition. In addition we deﬁne Shapley eﬀects with kernels and thebeneﬁts of the RKHS framework for GSA are studied through several examples. Several estimationprocedures are then discussed in Section 4, where we generalize some of the recent estimators forSobol’ indices. Finally, Section 5 illustrates the potential of these sensitivity indices with variousnumerical experiments corresponding to typical GSA applications. Let η : X × . . . × X d → Y denote the numerical simulation model, which is a function of d input variables X l ∈ X l , l = 1 , . . . , d , and Y ∈ Y the model output given by Y = η ( X , . . . , X d ).In standard GSA the inputs X l are further assumed to be independent with known probabilitydistributions P X l , meaning that the vector of inputs X = ( X , . . . , X d ) is distributed as P X =P X ⊗ . . . ⊗ P X d . For any subset A = { l , . . . , l | A | } ∈ P d of indices taken from { , . . . , d } we denote X A = (cid:16) X l , . . . , X l | A | (cid:17) ∈ X A = X l × . . . × X l | A | the vector of inputs with indices in A and X − A thecomplementary vector with indices not in A . In this setting, the main objective of global sensitivityanalysis is to quantify the impact of any group of input variables X A on the model output Y .In this section we ﬁrst recall the functional ANOVA decomposition and the deﬁnition of Sobol’indices, which fall into the category of variance-based indices. Sensitivity indices that account forthe whole output distribution, referred to as moment-independent indices, are then discussed. Notethat in the following, we adopt the notation S for a properly normalized sensitivity index, while S will stand for an unnormalized index, where normalization is to be understood as an end resultfrom an ANOVA-like decomposition. Here we ﬁrst assume that Y ∈ Y ⊂ R is a square integrable scalar output. If the inputs areindependent, the function η can then be decomposed according to the ANOVA decomposition: Theorem 1 (ANOVA decomposition (Hoeﬀding, 1948; Antoniadis, 1984)) . Assume that η : X × . . . × X d → Y is a square integrable function of d independent random variables X , . . . , X d . Then η admits a decomposition Y = η ( X , . . . , X d ) = (cid:88) A ⊆P d η A ( X A ) , with η A depending only on the variables X A and satisfying(a) η ∅ = E ( Y ) ,(b) E X l ( η A ( X A )) = 0 if l ∈ A ,(c) η A ( X A ) = (cid:80) B ⊂ A ( − | A |−| B | E ( Y | X B ) .Furthermore, ( b ) implies that all the terms η A in the decomposition are mutually orthogonal. As aconsequence, the output variance can be decomposed as Var Y = (cid:88) A ⊆P d Var η A ( X A ) = (cid:88) A ⊆P d V A (1)3 here V A = (cid:88) B ⊂ A ( − | A |−| B | Var E ( Y | X B ) . (2)When this decomposition holds, it is then straightforward to quantify the inﬂuence of any subsetof inputs X A on the output variance by normalizing each term with Var Y . Deﬁnition 1 (Sobol’ indices (Sobol’, 1993)) . Under the same assumptions of Theorem 1, the Sobol’sensitivity index associated to a subset A of input variables is deﬁned as S A = V A Var

Y , (3) while the total Sobol’ index associated to A is S TA = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S B . (4) In particular, the ﬁrst-order Sobol’ index of an input X l writes S l = Var E ( Y | X l )Var Y and its total Sobol’ index is given by S Tl = (cid:88) B ⊆P d , l ∈ B S B = 1 − Var E ( Y | X − l )Var Y .

Finally, the ANOVA decomposition (1) readily provides an interpretation of Sobol’ indices as apercentage of explained output variance, i.e. (cid:88) A ⊆P d S A = 1 . (5)With these deﬁnitions, the impact of each input variable can be quantitatively assessed: theﬁrst-order Sobol’ index measures the main eﬀect of an input, while the total Sobol’ index aggregatesall its potential interactions with other inputs. As an illustration, an input variable with low totalSobol’ index is thus unimportant and one can freeze it at a default value. When for a giveninput both ﬁrst-order and total Sobol’ indices are close, this means that this input does not haveinteractions, while a large gap indicates strong interactions in the model. Furthermore, due to thesummation property (5), the interpretation of Sobol’ indices as shares of the output variance is aneﬃcient tool for practitioners who aim at understanding precisely the impact and interactions ofthe inputs of a model on the output. For example the interaction of two inputs X l and X l (cid:48) writes S ll (cid:48) = Var E ( Y | X l , X l (cid:48) ) − Var E ( Y | X l ) − Var E ( Y | X l (cid:48) )Var Y = Var E ( Y | X l , X l (cid:48) )Var Y − S l − S l (cid:48) . (6)Note that to compute this interaction one subtracts the ﬁrst-order indices S l and S l (cid:48) from thesensitivity index of the subset ( X l , X l (cid:48) ) in order to remove the main eﬀects and highlight theinteraction only. 4 .2 Moment-independent sensitivity indices Despite their appealing properties, Sobol’ indices rank the input variables according to their impacton the output variance only. In a parallel line of work, several authors proposed to investigateinstead how inputs inﬂuence the whole output distribution, thus introducing a diﬀerent insight onthe inputs/outputs relationship. The starting point (Baucells and Borgonovo (2013), Da Veiga(2015)) is to consider that a given input X l is important in the model if the probability distributionP Y of the output changes when X l is ﬁxed, i.e. if the conditional probability distribution P Y | X l is diﬀerent from P Y . More precisely, if d ( · , · ) denotes a dissimilarity measure between probabilitydistributions, one can deﬁne a sensitivity index for variable X l given by S l = E X l (cid:0) d (P Y , P Y | X l ) (cid:1) . (7)Such a general formulation is ﬂexible, in the sense that many choices for d ( · , · ) are available. As anillustration, it is straightforward to show that the unnormalized ﬁrst-order Sobol’ index is retrievedwith the naive dissimilarity measure d (P , Q) = ( E ξ ∼ P ( ξ ) − E ξ ∼ Q ( ξ )) , which compares probabilitydistributions only through their means. A large class of dissimilarity measures is also given by theso-called f-divergence family: assuming that ( X l , Y ) has an absolute continuous distribution withrespect to the Lebesgue measure on R , the f-divergence between P Y and P Y | X l is d f (P Y , P Y | X l ) = (cid:90) f (cid:18) p Y ( y ) p Y | X l ( y ) (cid:19) p Y | X l ( y ) dy where f is a convex function such that f (1) = 0 and p Y and p Y | X l are the probability distributionfunctions of Y and Y | X l , respectively. The corresponding sensitivity index is then S fl = (cid:90) f (cid:18) p Y ( y ) p X l ( x ) p X l ,Y ( x, y ) (cid:19) p X l ,Y ( x, y ) dxdy with p X l and p X l ,Y the probability distribution functions of X l and ( X l , Y ), respectively. This indexhas been studied for example in Da Veiga (2015) and Rahman (2016). A notable special case isobtained with the total-variation distance corresponding to f ( t ) = | t − | , leading to the sensitivityindex proposed by Borgonovo (Borgonovo, 2007): S T Vl = (cid:90) | p Y ( y ) p X l ( x ) − p X l ,Y ( x, y ) | dxdy. Obviously, deﬁnition (7) can be easily extended to measure the inﬂuence of any subset of inputs S A = E X A (cid:0) d ( P Y , P Y | X A ) (cid:1) . But in this case, since there is no ANOVA-like decomposition, there isno longer the guarantee that an interaction index deﬁned following (6): S T Vll (cid:48) = (cid:90) | p Y ( y ) p X l ( x ) p X l (cid:48) ( x (cid:48) ) − p X l ,X l (cid:48) ,Y ( x, x (cid:48) , y ) | dxdx (cid:48) dy − S T Vl − S

T Vl (cid:48) really measures the pure interaction between X l and X l (cid:48) . Therefore the interpretation of higher-order moment-independent sensitivity indices is cumbersome. On the other hand, even if nor-malization constants have been proposed through general inequalities on f-divergences (Borgonovo(2007), Rahman (2016)), the lack of an ANOVA decomposition once again impedes the deﬁnitionof a natural normalization constant equivalent to the output variance for Sobol’ indices.5ecently, new moment-independent indices built upon the framework of RKHS embedding ofprobability distributions have also been investigated (Da Veiga, 2015, 2016). Though originallyintroduced as an alternative to reduce the curse of dimensionality and make the most of the vastkernel literature, we will see in what follows that they actually exhibit an ANOVA-like decompo-sition and can therefore be seen as a general kernelized version of Sobol’ indices. Before introducing the kernel-based sensitivity indices, we ﬁrst review some elements of the RKHSembedding of probability distributions (Smola et al., 2007), which will serve as a building block fortheir deﬁnition.

We ﬁrst introduce a RKHS H of functions X → R with kernel k X and dot product (cid:104)· , ·(cid:105) H . The kernel mean embedding µ P ∈ H of a probability distribution P on X is deﬁned as µ P = E ξ ∼ P k X ( ξ, · ) = (cid:90) X k X ( ξ, · ) d P( ξ )if E ξ ∼ P k X ( ξ, ξ ) < ∞ , see Smola et al. (2007). The representation µ P is appealing because, ifthe kernel k X is characteristic, the map P → µ P is injective (Sriperumbudur et al., 2009, 2010).Consequently, the kernel mean embedding can be used in lieu of the probability distribution forseveral comparisons and manipulations of probability measures but using only inner products ordistances in the RKHS. For example, a distance between two probability measures P and P on X can simply be obtained by computing the distance between their representations in H , i.e. MMD(P , P ) = (cid:107) µ P − µ P (cid:107) H , which is a distance if the kernel k X is characteristic (Sriperumbudur et al., 2009, 2010). Thisdistance is called the maximum mean discrepancy (MMD) and it has been recently used in manyapplications (Muandet et al., 2012; Szab´o et al., 2016). Indeed, using the reproducing property ofa RKHS one may show (Song, 2008) thatMMD (P , P ) = E ξ,ξ (cid:48) k X ( ξ, ξ (cid:48) ) − E ξ,ζ k X ( ξ, ζ ) + E ζ,ζ (cid:48) k X ( ζ, ζ (cid:48) )where ξ, ξ (cid:48) ∼ P and ζ, ζ (cid:48) ∼ P with ξ, ξ (cid:48) , ζ, ζ (cid:48) independent, this notation being used throughoutthe rest of the paper. This means that the MMD can be computed with expectations of kernelsonly, unlike other distances between probability distributions which will typically require densityestimation.Another signiﬁcant application of kernel embeddings concerns the problem of measuring thedependence between random variables. Given a pair of random vectors ( U , V ) ∈ X × Y with prob-ability distribution P UV , we deﬁne the product RKHS H = F × G with kernel k H (( u , v ) , ( u (cid:48) , v (cid:48) )) = k X ( u , u (cid:48) ) k Y ( v , v (cid:48) ). A measure of the dependence between U and V can then be deﬁned as the dis-tance between the mean embedding of P UV and P U ⊗ P V , the joint distribution with independentmarginals P U and P V : MMD (P UV , P U ⊗ P V ) = (cid:107) µ P UV − µ P U ⊗ µ P V (cid:107) H . Hilbert-Schmidt independence criterion (HSIC, see Gretton et al.(2005a,b)) and can be expanded asHSIC( U , V ) = MMD (P UV , P U ⊗ P V )= E U , U (cid:48) , V , V (cid:48) k X ( U , U (cid:48) ) k Y ( V , V (cid:48) )+ E U , U (cid:48) k X ( U , U (cid:48) ) E V , V (cid:48) k Y ( V , V (cid:48) ) − E U , V (cid:2) E U (cid:48) k X ( U , U (cid:48) ) E V (cid:48) k Y ( V , V (cid:48) ) (cid:3) (8)where ( U (cid:48) , V (cid:48) ) is an independent copy of ( U , V ). Once again, the reproducing property implies thatHSIC can be expressed as expectations of kernels, which facilitates its estimation when comparedto other dependence measures such as the mutual information. The RKHS framework introduced above can readily be used to deﬁne kernel-based sensitivityindices. The ﬁrst approach relies on the MMD, while the second one builds upon HSIC. We discussthem below and show that in particular both of them admit an ANOVA-like decomposition.

The ﬁrst natural idea is to come back to the general formulation for moment-independent indices (7)and use the MMD as the dissimilarity measure to compare P Y and P Y | X l as proposed in Da Veiga(2016): S MMD l = E X l MMD (P Y , P Y | X l )= E X l E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) ) − E X l E ξ ∼ P Y ,ζ ∼ P Y | Xl k Y ( ξ, ζ ) + E X l E ζ,ζ (cid:48) ∼ P Y | Xl k Y ( ζ, ζ (cid:48) )= E X l E ζ,ζ (cid:48) ∼ P Y | Xl k Y ( ζ, ζ (cid:48) ) − E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) )where we have deﬁned a RKHS G of functions Y → R with kernel k Y . More generally, we canalso consider the unnormalized MMD-based sensitivity index for a group of variables X A given by E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ζ,ζ (cid:48) ∼ P Y | X A k Y ( ζ, ζ (cid:48) ) − E ξ,ξ (cid:48) ∼ P Y k Y ( ξ, ξ (cid:48) ), provided the followingassumption holds: Assumption 1. ∀ A ⊆ P d and ∀ x A ∈ X A , E ξ ∼ P Y | X A = x A k Y ( ξ, ξ ) < ∞ with the convention P Y | X A =P Y if A = ∅ . First note that if we focus on the scalar output case

Y ⊂ R with the linear kernel k Y ( y, y (cid:48) ) = yy (cid:48) ,we have S MMD A = E X A (cid:16) E ξ ∼ P Y ( ξ ) − E ζ ∼ P Y | X A ( ζ ) (cid:17) = E X A ( E Y − E ( Y | X A )) = Var E ( Y | X A ) , that is, we recover the unnormalized Sobol’ index for X A . S MMD A can thus be seen as a kernelizedversion of Sobol’ indices since the latter can be retrieved with a speciﬁc kernel. However it isobvious that since the linear kernel is not characteristic, the MMD in this case is not a distance,7hich means that S MMD A is no longer a moment-independent index.To make another connection with Sobol’ indices, we now recall Mercer’s theorem, a notable repre-sentation theorem for kernels. Theorem 2 (Mercer, see Aubin (2000)) . Suppose k Y is a continuous symmetric positive deﬁnitekernel on a compact set Y and consider the integral operator T k Y : L ( Y ) → L ( Y ) deﬁned by (cid:0) T k Y f (cid:1) ( x ) = (cid:90) Y k Y ( y, u ) f ( u ) du. Then there is an orthonormal basis { e r } of L ( Y ) consisting of eigenfunctions of T k Y such that thecorresponding sequence of eigenvalues { λ r } are non-negative. The eigenfunctions corresponding tonon-zero eigenvalues are continuous on Y and k Y has the following representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 λ r e r ( y ) e r ( y (cid:48) ) where the convergence is absolute and uniform. Assume now that the output Y ∈ Y with Y a compact set, meaning that Mercer’s theoremholds. Then k Y admits a representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) )where φ r ( y ) = √ λ r e r ( y ) are orthogonal functions in L ( Y ). In this setting we can write S MMD A = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) − E ζ,ζ (cid:48) ∼ P k Y ( ζ, ζ (cid:48) )= E X A E ξ,ξ (cid:48) ∼ P Y | X A (cid:32) ∞ (cid:88) r =1 φ r ( ξ ) φ r ( ξ (cid:48) ) (cid:33) − E ζ,ζ (cid:48) ∼ P (cid:32) ∞ (cid:88) r =1 φ r ( ζ ) φ r ( ζ (cid:48) ) (cid:33) . Now, since the convergence of the series is absolute, we can interchange the expectations and thesummations to get S MMD A = ∞ (cid:88) r =1 (cid:26) E X A E ξ,ξ (cid:48) ∼ P Y | X A (cid:0) φ r ( ξ ) φ r ( ξ (cid:48) ) (cid:1) − E ζ,ζ (cid:48) ∼ P (cid:0) φ r ( ζ ) φ r ( ζ (cid:48) ) (cid:1) (cid:27) = ∞ (cid:88) r =1 (cid:26) E X A E ( φ r ( Y ) | X A ) − E ( φ r ( Y )) (cid:27) = ∞ (cid:88) r =1 Var E ( φ r ( Y ) | X A ) . (9)In other words, the MMD-based sensitivity index S MMD A generalizes the Sobol’ one in the sensethat it measures the impact of the inputs not only on the conditional expectation of the output,8ut on a possibly inﬁnite number of transformations φ r of the output, given by the eigenfunctionsof the kernel.We can now state the main theorem of this section on the ANOVA-like decomposition for S MMD A . Recall that the variance decomposition (1) states that the variance of the output can bedecomposed as Var Y = (cid:80) A ⊆P d V A where each term is given by V A = (cid:88) B ⊂ A ( − | A |−| B | Var E ( Y | X B ) . The MMD-based equivalent is obtained with the following theorem.

Theorem 3 (ANOVA decomposition for MMD) . Under the same assumptions of Theorem 1 (inparticular, the random vector X has independent components) and with Assumption 1, denote MMD = E k Y ( Y, Y ) − E k Y ( Y, Y (cid:48) ) where Y (cid:48) is an independent copy of Y . Then the total MMDcan be decomposed as MMD = (cid:88) A ⊆P d MMD A where each term is given by MMD A = (cid:88) B ⊂ A ( − | A |−| B | E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) . The proof is given in Appendix A.1. Theorem 3 is very similar to the ANOVA one givenin (1): one can note that the total variance of the output is replaced by a generalized varianceMMD deﬁned by the kernel, and that each subset eﬀect is obtained by removing lesser order onesin the MMD distance of the conditional distributions (instead of the variance of the conditionalexpectations in the ANOVA). The following corollary states that these two decompositions coincidewhen the kernel is chosen as the linear one.

Corollary 1.

When Y ∈ Y ⊂ R and k Y ( y, y (cid:48) ) = yy (cid:48) in Theorem 3, the decomposition is identicalto the decomposition (1), which means that MMD = Var Y and ∀ B ∈ P d , E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) = Var E ( Y | X B ) . It further implies ∀ A ⊆ P d , MMD A = V A . Thanks to Theorem 3 we can now deﬁne properly normalized MMD-based indices.

Deﬁnition 2 (MMD-based sensitivity indices) . In the frame of Theorem 3, let A ⊆ P d . Thenormalized MMD-based sensitivity index associated to a subset A of input variables is deﬁned as S MMD A = MMD A MMD , while the total MMD-based index associated to A is S T , MMD A = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S MMD B = 1 − E X − A (cid:0) MMD (P Y , P Y | X − A ) (cid:1) MMD . rom Theorem 3, we have the fundamental identity providing the interpretation of MMD-basedindices as percentage of the explained generalized variance MMD : (cid:88) A ⊆P d S MMD A = 1 . Finally, we exhibit a generalized law of total variance for MMD which will yield anotherformulation for the total MMD-based index.

Proposition 1 (Generalized law of total variance) . Assuming Assumption 1 holds, we have

MMD = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . The proof is to be found in Appendix A.2. This is a generalization in the sense that the totalvariance is replaced by MMD , the conditional variance by E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) )and the variance of the conditional expectation by E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . In particular, allthese terms reduce to the ones in the classical law of total variance if one uses the linear kernel k Y ( y, y (cid:48) ) = yy (cid:48) in the scalar case. This gives the following corollary. Corollary 2 (Other formulation of total MMD-based index) . In the frame of Theorem 3, we havefor all A ⊆ P d S T , MMD A = E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) MMD . Another approach for combining kernel embeddings with sensitivity analysis consists in directlyusing HSIC as a sensitivity index. For example Da Veiga (2015) considers the unnormalized index S HSA = HSIC( X A , Y )relying on a product RKHS H A = F A × G with kernel k H A (( x , y ) , ( x (cid:48) , y (cid:48) )) = k X A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) )and provided the following assumption holds: Assumption 2. ∀ A ⊆ P d , E ξ ∼ P XA k X A ( ξ, ξ ) < ∞ and E ξ ∼ P Y k Y ( ξ, ξ ) < ∞ . In Da Veiga (2015) an empirical normalization inspired by the deﬁnition of the distance corre-lation criterion (Sz´ekely et al., 2007) was also proposed. But similarly to the MMD decompositionabove, it is actually possible to exhibit an ANOVA-like decomposition for HSIC, thus providing anatural normalization constant. The main ingredient is an assumption on the kernel k X associatedto the input variables. Assumption 3.

The reproducing kernel k X of F is of the form k X ( x , x (cid:48) ) = p (cid:89) l =1 (cid:0) k l ( x l , x (cid:48) l ) (cid:1) (10)10 here for each l = 1 , . . . , d , k l ( · , · ) is the reproducing kernel of a RKHS F l of real functions depend-ing only on variable x l and such that / ∈ F l .In addition, for all l = 1 , . . . , d and ∀ x l ∈ X l , we have (cid:90) X l k l ( x l , x (cid:48) l ) d P X l ( x (cid:48) l ) = 0 . (11)The ﬁrst part (10) of Assumption 3 may seem stringent, however it can be easily fulﬁlledby using univariate Gaussian kernels since they deﬁne a RKHS which does not contain constantfunctions (Steinwart et al., 2006).On the contrary, the second assumption (11) is more subtle. It requires using kernels deﬁning aso-called RKHS of zero-mean functions (Wahba et al., 1995). A prominent example of such RKHSis obtained if (a) all input variables are uniformly distributed on [0 ,

1] and (b) the univariate kernelsare chosen among the Sobolev kernels with smoothness parameter r ≥ k l ( x l , x (cid:48) l ) = B r ( | x l − x (cid:48) l | )( − r +1 (2 r )! + r (cid:88) j =1 B j ( x l ) B j ( x (cid:48) l )( j !) (12)where B j is the Bernoulli polynomial of degree j . Even though applying a preliminary transfor-mation on the inputs in order to get uniform variables is conceivable (with e.g. the probabilityintegral transform), a more general and elegant procedure has been proposed by Durrande et al.(2012). Starting from an arbitrary univariate k ( · , · ), they build a zero-mean kernel k D ( · , · ) given by k D ( x, x (cid:48) ) = k ( x, x (cid:48) ) − (cid:82) k ( x, t ) d P( t ) (cid:82) k ( x (cid:48) , t ) d P( t ) (cid:82)(cid:82) k ( s, t ) d P( s ) d P( t )where k D ( · , · ) satisﬁes ∀ x, (cid:82) k D ( x, t ) d P( t ) = 0. Interestingly, they also show that the RKHS H associated to k ( · , · ) is orthogonal to the constant functions, thus satisfying directly the requirementsfor the product kernel (10).More recently, several works made use of the Stein operator (Stein et al., 1972) to deﬁne the Steindiscrepancy in a RKHS (Chwialkowski et al., 2016) which showed great potential for Monte-Carlointegration (Oates et al., 2017) or goodness-of-ﬁt tests (Gorham and Mackey, 2015; Chwialkowskiet al., 2016; Jitkrittum et al., 2017) when the target distribution is either impossible to sample oris known up to a normalization constant. More precisely, given a RKHS H with kernel k ( · , · ) offunctions in R d and a (target) probability distribution with density p ( · ), they deﬁne a new RKHS H with kernel k S ( · , · ) which writes k S ( x , x (cid:48) ) = ∇ x ∇ x (cid:48) k ( x , x (cid:48) ) + ∇ x p ( x ) p ( x ) ∇ x (cid:48) k ( x , x (cid:48) ) + ∇ x (cid:48) p ( x (cid:48) ) p ( x (cid:48) ) ∇ x k ( x , x (cid:48) ) + ∇ x p ( x ) p ( x ) ∇ x (cid:48) p ( x (cid:48) ) p ( x (cid:48) ) k ( x , x (cid:48) )and it can be proved that ∀ x ∈ R d , (cid:82) k S ( x , x (cid:48) ) p ( x (cid:48) ) d x (cid:48) = 0. Unlike k D ( · , · ), kernel k S ( · , · ) can stillbe deﬁned when p ( · ) is known up to a constant: this property may ﬁnd interesting applications inGSA when the input distributions are obtained via a preliminary Bayesian data calibration, sinceit would no longer be required to perform a costly sampling step of their posterior distribution andone could easily use the unnormalized posterior distribution instead.With Assumption 3, we can now state a decomposition for HSIC-based sensitivity indices.11 heorem 4 (ANOVA decomposition for HSIC) . Under the same assumptions of Theorem 1 (inparticular, the random vector X has independent components) and with Assumptions 2 and 3, theHSIC dependence measure between X = ( X , . . . , X d ) and Y can be decomposed as HSIC ( X , Y ) = (cid:88) A ⊆P d HSIC A where each term is given by HSIC A = (cid:88) B ⊂ A ( − | A |−| B | HSIC ( X B , Y ) and HSIC ( X B , Y ) is deﬁned with a product RKHS H B = F B ×G with kernel k B ( x B , x (cid:48) B ) k Y ( y, y (cid:48) ) = (cid:81) l ∈ B (1 + k l ( x l , x (cid:48) l )) k Y ( y, y (cid:48) ) as in (10). The proof, which mainly relies on Mercer’s theorem and on Theorem 4.1 from Kuo et al. (2010),is given in Appendix A.3. Once again, this decomposition resembles the ANOVA decomposition(1), where the conditional variances are replaced with HSIC dependence measures between subsetsof inputs and the output.Properly normalized HSIC-based indices can then be deﬁned:

Deﬁnition 3 (HSIC-based sensitivity indices) . In the frame of Theorem 4, let A ⊆ P d . Thenormalized HSIC-based sensitivity index associated to a subset A of input variables is deﬁned as S HSIC A = HSIC A HSIC ( X , Y ) , while the total HSIC-based index associated to A is S T , HSIC A = (cid:88) B ⊆P d , B ∩ A (cid:54) = ∅ S HSIC B = 1 − HSIC( X − A , Y )HSIC ( X , Y ) . From Theorem 4, we have the fundamental identity providing the interpretation of HSIC-basedindices as percentage of the explained HSIC dependence measure between X = ( X , . . . , X d ) and Y : (cid:88) A ⊆P d S HSIC A = 1 . Finally, a noteworthy asymptotic result yields a link between HSIC-based indices and MMD-based ones when the input kernel k X degenerates to a dirac kernel, as elaborated in the followingproposition. Proposition 2.

For all subset A ⊆ P d , let us deﬁne a product RKHS H A = F A × G with kernel k A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) ) . We further assume that ∀ x A ∈ X A , p X A ( x A ) > and that k A ( x A , x (cid:48) A ) = 1 (cid:112) p X A ( x A ) (cid:112) p X A ( x (cid:48) A ) (cid:89) l ∈ A h K (cid:18) x l − x (cid:48) l h (cid:19) (13) where K : R → R is a symmetric kernel function satisfying (cid:82) u K ( u ) du = 1 , and h > . Then wehave ∀ A ⊆ P d lim h → HSIC( X A , Y ) = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) where HSIC( X A , Y ) is deﬁned with the product RKHS H A = F A × G and MMD (P Y , P Y | X A ) withthe RKHS G . k X with a standard deviation tending to 0, or the sinckernel associated to the RKHS of band-limited continuous functions with a cutoﬀ frequency tendingto inﬁnity. Obviously the result also holds if one uses diﬀerent kernels K for each input X l ∈ X A in Eq. (13).Although they may seem trivial, Proposition 2 and Corollary 1 actually justify our claim thatboth the MMD- and the HSIC-based sensitivity indices are natural generalizations of Sobol’ indices,in the sense that a degenerate HSIC-based index with a dirac kernel for the input variables givesthe MMD-based index which, in turn, is equal to the Sobol’ index when using the dot productkernel for the output. In this section, we now discuss how the previously indices can still be valuable in the case wherethe input variables are no longer independent. In this setting, the Shapley eﬀects introduced in thecontext of GSA by Owen (Owen, 2014) and based on Shapley values (Shapley, 1953) from gametheory have appealing properties, since they provide a proper allocation of the output variance toeach input variable, without requiring they are independent. We recall their deﬁnition below.

Deﬁnition 4 (Shapley eﬀects (Shapley, 1953)) . For any l = 1 . . . , d , the Shapley eﬀect of input X l is given by Sh l = 1Var Y p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) Var E (cid:0) Y | X A ∪{ l } (cid:1) − Var E ( Y | X A ) (cid:27) . (14) This deﬁnition corresponds to the Shapley value (Shapley, 1953) φ l = 1 p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) val ( A ∪ { l } ) − val ( A ) (cid:27) with value function val : P d → R + equal to val( A ) = Var E ( Y | X A ) / Var Y . Moreover, we have thefollowing decomposition p (cid:88) l =1 Sh l = 1 . The only requirement is that the value function satisﬁes val : P d → R + such that val( ∅ ) = 0.Combining this result with the kernel-based sensitivity indices is consequently straightforward,which leads to the deﬁnition of kernel-embedding Shapley eﬀects : Deﬁnition 5 (Kernel-embedding Shapley eﬀects) . For any l = 1 . . . , d , we deﬁne(a) The MMD-Shapley eﬀect Sh MMD l = 1MMD p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) E X A ∪{ l } (cid:16) MMD (P Y , P Y | X A ∪{ l } ) (cid:17) − E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) (cid:27) (15) provided Assumption 1 holds. b) The HSIC-Shapley eﬀect Sh HSIC l = 1HSIC ( X , Y ) 1 p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) HSIC (cid:0) X A ∪{ l } , Y (cid:1) − HSIC ( X A , Y ) (cid:27) (16) provided Assumptions 2 and 3 hold.We further have the decompositions p (cid:88) l =1 Sh MMD l = p (cid:88) l =1 Sh HSIC l = 1 . Just like in the independent setting, kernel-embedding Shapley eﬀects (15) and (16) can be seenas general kernelized versions of Shapley eﬀects, since Proposition 2 and Corollary 1 are still validwhen the inputs are dependent.

Remark 1.

In the machine learning community dedicated to the interpretability of black-box models,an importance measure called Kernel-Shap has been recently introduced (Lundberg and Lee, 2017).Although its naming resembles ours, they designate clearly separated approaches, since the Kernel-Shap measure is a local Shapley eﬀect, and the ”Kernel” denomination only refers to an estimationprocedure without any links to RKHS.

Beyond their theoretical interest in themselves, the kernel-ANOVA decompositions and the asso-ciated sensitivity indices also appear powerful from a practical point of view when one carefullyexamines the potential of using kernels. We give below some insights on how they could enhancetraditional GSA studies in several settings.

Categorical model outputs and target sensitivity analysis.

In some applications, the modeloutput Y is categorical, meaning that Y = { , . . . , K } when the output can take K levels. A simplecommon instance involves two levels, corresponding to a failure/success situation. Similarly evenif Y is not categorical, the objective may be to measure the impact of each input on the fact thatthe output reaches disjoint regions of interest R , . . . , R K ⊂ Y , as for example in the case whereone focuses on events { t i +1 > Y > t i } for thresholds t i , i = 1 , . . . , K . Such an objective is called target sensitivity analysis (TSA, see Marrel and Chabridon (2020)) and can be reformulated in acategorical framework by the change of variable Z = i if Y ∈ R i .The case where Y = { , } (or equivalently Z = { Y ∈R} ) is frequent in TSA. A straightforwardapproach is to use Sobol’ indices with a 0 / S TSA l = E X l ( P ( Y = 1 | X l ) − P ( Y = 1)) P ( Y = 1)(1 − P ( Y = 1)) (17)see Li et al. (2012). But to the best of our knowledge, no systematic procedure is available whenthe number of levels is greater than two. Without resorting yet to our kernel-based indices, thereare at least two roads, which actually lead to the same indices:14a) The one-versus-all approach, where we compute several Sobol’ indices by repeatedly consid-ering Z = { Y = i } for all i = 1 . . . , K . We thus have a collection of indices S TSA , [ i ] l = E X l ( P ( Y = i | X l ) − P ( Y = i )) P ( Y = i )(1 − P ( Y = i )) , and we can aggregate them by normalizing each of them by its own variance, yielding S TSA l = (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) S TSA , [ i ] l (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) . (b) The one-hot encoding approach, which consists in encoding the categorical output into amultivariate vector of 0 / Y = { , . . . , K } , Y is encodedas a K -dimensional vector ( Z = Y =1 , . . . , Z K = Y = K ). The aggregated Sobol’ indices arethen S TSA l = (cid:80) Ki =1 Var E ( Z i | X l ) (cid:80) Ki =1 Var Z i = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) , which is exactly the index obtained with the one-versus-all approach.As for the kernel-based indices, the process is less cumbersome, since the only ingredient thatrequires attention is the choice of a kernel k Y ( · , · ) adapted to categorical outputs, which has alreadybeen investigated in the kernel literature (Song et al., 2007, 2012). We focus here on the simple dirackernel deﬁned as k Y ( y, y (cid:48) ) = δ ( y, y (cid:48) ) for categorical values y, y (cid:48) ∈ { , . . . , K } , and the correspondingkernel-based indices are then • The ﬁrst-order MMD-based index: S MMD l = E X l (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i | X l ) P ( Y = j | X l ) − (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i ) P ( Y = j ) (cid:80) Ki =1 P ( Y = i ) − (cid:80) Ki =1 (cid:80) Kj =1 δ ( i, j ) P ( Y = i ) P ( Y = j )= E X l (cid:80) Ki =1 P ( Y = i | X l ) − (cid:80) Ki =1 P ( Y = i ) (cid:80) Ki =1 P ( Y = i ) − (cid:80) Ki =1 P ( Y = i ) = (cid:80) Ki =1 E X l ( P ( Y = i | X l ) − P ( Y = i )) (cid:80) Ki =1 P ( Y = i )(1 − P ( Y = i )) , where we retrieve again the one-versus-all Sobol’ index. • The ﬁrst-order HSIC-based index: S HSIC l = (cid:90) X l ×X l K (cid:88) i =1 K (cid:88) j =1 k { l } ( x, x (cid:48) ) δ ( i, j ) (cid:2) p X l | Y = i ( x ) − p X l ( x ) (cid:3)(cid:2) p X l | Y = j ( x (cid:48) ) − p X l ( x (cid:48) ) (cid:3) P ( Y = i ) P ( Y = j ) dxdx (cid:48) = K (cid:88) i =1 P ( Y = i ) (cid:90) X l ×X l k { l } ( x, x (cid:48) ) (cid:2) p X l | Y = i ( x ) − p X l ( x ) (cid:3) (cid:2) p X l | Y = i ( x (cid:48) ) − p X l ( x (cid:48) ) (cid:3) dxdx (cid:48) = K (cid:88) i =1 P ( Y = i ) MMD (cid:0) P X l | Y = i , P X l (cid:1) , • The MMD- and HSIC- Shapley eﬀects using one of the above indices as building block.Interestingly, it has been shown that Eq. (17) can also been written, up to a constant, as thePearson χ divergence between P X l | Y =1 and P X l (Perrin and Defaux, 2019; Spagnol, 2020). Thismeans that S TSA l = S MMD l and S HSIC l essentially have the same interpretation as weighted sumsof distances between the initial input distributions and the conditional input distributions (whenrestricted to an output level), with the Pearson χ divergence and the MMD distance, respectively.But we will see in Section 4 that the estimation of HSIC-based sensitivity indices is much lessprone to the curse of dimensionality and does not require density estimation, as opposed to S TSA l (Perrin and Defaux, 2019). Finally, note that another kernel for categorical variables has also beenproposed (Song et al., 2007, 2012), but this is actually a normalized dirac kernel which would onlymodify the weights in the indices above. Beyond scalar model outputs.

In many numerical simulation models, some of the outputs arecurves representing the temporal evolution of physical quantities of the system such as temperatures,pressures, etc. One can also encounter industrial applications which involve spatial outputs (Marrelet al., 2008),. In such cases, the two main approaches in GSA are (a) the ubiquitous point ofview, where one sensitivity index is computed for each time step or each spatial location (Terrazet al., 2017) and (b) the dimension reduction angle, in which one preliminary projects the outputinto a low-dimensional vector space and then calculates aggregated sensitivity indices for this newmultivariate vector (Lamboni et al., 2011; Gamboa et al., 2013).However, the kernel perspective for such structured outputs can bring new insights for GSA.Indeed the kernel literature has already proposed several ways to handle curves or images in re-gression or classiﬁcation tasks. For instance the PCA-kernel (Ferraty and Vieu, 2006) can be usedas an equivalent of (b), such as illustrated in Da Veiga (2015). But more interestingly, kernelsdedicated to times series were designed, such as the global alignment kernel (Cuturi, 2011) inspiredby the dynamic time-warping kernel (Sakoe and Chiba, 1978). Such kernel could be employed inindustrial applications where one is interested by the impact of an input variable on the shape ofthe output curve. On the other hand, for dealing with spatial outputs similar to images such as inMarrel et al. (2008), one may consider a kernel based on image classiﬁcation (Harchaoui and Bach,2007) which would be better suited to analyze the impact of inputs on the change of the shapesappearing inside the image output.Finally, numerical models involving graphs as inputs or outputs ( e.g. electricity networks ormolecules) may now be tractable with GSA by employing kernels speciﬁcally tailored for graphs(G¨artner et al., 2003; Ramon and G¨artner, 2003).

Stochastic numerical models.

On occasions one has to deal with stochastic simulators , whereinternally the numerical model relies on random draws to compute the output. Typical industrialapplications include models dedicated to the optimization of maintenance costs, where randomfailures are simulated during the system life cycle, or molecular modeling to predict macroscopicproperties based on statistical mechanics, where several microstates of the system are generated atrandom (Moutoussamy et al., 2015). For ﬁxed values of the input variables, the output is thereforea probability distribution, meaning that

Y ⊂ M +1 the set of probability measures. In this setting16SA aims at measuring how changes in the inputs modify the output probability distribution,which is clearly out of the traditional scope of GSA.Once again the kernel point of view makes it possible to easily recycle the MMD- and the HSIC-based sensitivity indices in this context since they only require the deﬁnition of a kernel k Y ( · , · ) onprobability distributions. This can be achieved through one of the two following kernels: k Y (P , Q) = σ e − λ MMD (P , Q) (18)introduced in Song (2008) or k Y (P , Q) = σ e − λW (P , Q) discussed in Bachoc et al. (2017) where P , Q ∈ M +1 , W is the Wasserstein distance and σ , λ > The properly normalized kernel-based sensitivity indices being deﬁned above, we now discuss theirestimation. The HSIC-based index is ﬁrst examined as we only consider already proposed es-timators. On the other hand, the MMD-based index is analyzed more thoroughly since severalestimators can be envisioned given its close links with Sobol’ indices. Finally we investigate theestimation of kernel-embedding Shapley eﬀects.

We start by observing that if Assumption 3 holds, for any subset A ⊆ P d we have E X A k A ( X A , x (cid:48) A ) =1 for all x (cid:48) A ∈ X A , which means that HSIC in Eq. (8) simpliﬁes into:HSIC( X A , Y ) = E X A , X (cid:48) A ,Y,Y (cid:48) k A ( X A , X (cid:48) A ) k Y ( Y, Y (cid:48) ) − E Y,Y (cid:48) k Y ( Y, Y (cid:48) ) . Given a sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n and following Song et al. (2007); Gretton et al. (2008) twoestimators HSIC u ( X A , Y ) and HSIC b ( X A , Y ) based on U- and V-statistics, respectively, can beintroduced: HSIC u ( X A , Y ) = 1 n ( n − n (cid:88) i,j =1 , i (cid:54) = j (cid:16) k A ( x ( i ) A , x ( j ) A ) − (cid:17) k Y ( y ( i ) , y ( j ) )HSIC b ( X A , Y ) = 1 n n (cid:88) i,j =1 (cid:16) k A ( x ( i ) A , x ( j ) A ) − (cid:17) k Y ( y ( i ) , y ( j ) )where we assume that P X A is known and is used to compute analytically the zero-mean kernelsin Eq. (3.2.2). The study of the version of the above estimators when the sample (cid:0) x ( i ) (cid:1) i =1 ,...,n also serves to estimate k A is left as future work. Both HSIC u ( X A , Y ) and HSIC b ( X A , Y ) convergein probability to HSIC( X A , Y ) with rate 1 / √ n , and one can show (Song et al., 2007) that if weassume that k A and k Y are bounded almost everywhere by 1 and are nonnegative, with probabilityat least 1 − δ we have | HSIC u ( X A , Y ) − HSIC( X A , Y ) | ≤ (cid:112) log(2 /δ ) /n. u ( X A , Y ) and HSIC b ( X A , Y ) have also been studied in thecase where X A and Y are dependent, see Song et al. (2007) and Gretton et al. (2008).It is worth mentioning that here the number of model evaluations is n , which is independentfrom the input dimension, meaning that all HSIC-based sensitivity indices can be computed withonly a given sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n . MMD-based indices are close generalizations of Sobol’ indices, since they involve computing theexpectation of a conditional quantity (a MMD distance with a conditional probability for the formerand a conditional variance for the latter). This is the reason why estimation procedures developedfor Sobol’ indices can be adapted to the MMD ones. The ﬁrst two estimators discussed belowassume that one can easily sample the computer model for any input values (to be determined bythe estimation procedure), as opposed to the next two ones which can be deﬁned with any givensample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n . The ﬁrst naive estimator consists in systematically resampling the conditional distribution P Y | X A = x A for many values of x A , as detailed in Algorithm 1 below. Algorithm 1

Double-loop estimator of E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) Sample x ( j ) from P X and compute y ( j ) = η ( x ( j ) ) for j = 1 , . . . , m . for i = 1 . . . , n do Outer-loop

Sample x ( i ) A from P X A ; for j = 1 . . . , m do Inner-loop

Sample x ( j ) − A from P X − A (if inputs are independent) or from P X − A | X A = x ( i ) A (otherwise);Compute ˜ y ( j ) = η ( x (cid:48) ) where x (cid:48) A = x ( i ) A and x (cid:48)− A = x ( j ) − A ; end for Compute M ( i ) = 1 n m (cid:88) j,j (cid:48) =1 k Y (cid:16) y ( j ) , y ( j (cid:48) ) (cid:17) + 1 n m (cid:88) j,j (cid:48) =1 k Y (cid:16) ˜ y ( j ) , ˜ y ( j (cid:48) ) (cid:17) − n m (cid:88) j,j (cid:48) =1 k Y (cid:16) y ( j ) , ˜ y ( j (cid:48) ) (cid:17) the estimator of MMD (P Y , P Y | X A = x ( i ) A ); end for E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) is ﬁnally estimated by n (cid:80) ni =1 M ( i ) .For each MMD-based index of a subset of variables X A the total number of model evaluations is( n + 1) m , which means for example that all ﬁrst-order MMD-based sensitivity indices are computedat a cost of p ( n + 1) m model evaluations. It is however possible to design better sampling strategiesto compute ﬁrst-order and total indices if the inputs are independent, as explained in the nextsection. 18 .2.2 Pick-freeze estimators We begin by recalling the deﬁnition of the pick-freeze estimators for Sobol’ indices.

Lemma 1 (Pick-freeze formulation of Sobol indices (Janon et al., 2014)) . Assume X and X (cid:48) aretwo independent copies of the input vector, the inputs being independent. For any subset A ⊆ P d deﬁne X ∼ A the vector assembled from X and X (cid:48) such that X ∼ AA = X A and X ∼ A − A = X (cid:48) A . Now if wedenote Y = η ( X ) and Y ∼ A = η ( X ∼ A ) , we have Var E ( Y | X A ) = Cov (cid:0) Y, Y ∼ A (cid:1) ,S TA = 1 − Cov (cid:0)

Y, Y ∼− A (cid:1) Var

Y .

In the particular case of A = { l } , the ﬁrst-order and total indices S l and S Tl can be estimatedby collecting estimators ˆ V l , ˆ V − l and ˆ V of Cov (cid:0) Y, Y ∼ l (cid:1) , Cov (cid:0) Y, Y ∼− l (cid:1) and Var Y , respectively.Such estimators have been ﬁrst studied in Homma and Saltelli (1996), but we focus on the onesintroduced by Saltelli et al. (2010) which writeˆ V l = 1 n n (cid:88) i =1 η ( x ( i ) ) (cid:110) η ( x ∼ l, ( i ) ) − η ( x (cid:48) ( i ) ) (cid:111) , ˆ V − l = 1 n n (cid:88) i =1 η ( x (cid:48) ( i ) ) (cid:110) η ( x ∼ l, ( i ) ) − η ( x ( i ) ) (cid:111) , ˆ V = 1 n n (cid:88) i =1 η ( x ( i ) ) − (cid:32) n n (cid:88) i =1 η ( x ( i ) ) (cid:33) where x ( i ) and x (cid:48) ( i ) denote independent samples of X and x ∼ l, ( i ) is a vector such that x ∼ l, ( i ) l = x ( i ) l and x ∼ l, ( i ) − l = x (cid:48) ( i ) − l . The total number of model evaluations to estimate both S l and S Tl is thus( p + 2) n , which is much less than the amount required by the previously introduced double-loopestimator.We now build upon these estimators to design equivalent ones for the ﬁrst-order and totalMMD-based sensitivity indices. The main ingredient is to state an equivalent of Lemma 1 for theMMD. Lemma 2 (Pick-freeze formulation of MMD-based indices) . With the same notations and assump-tions as in Lemma 1, we have E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E k Y (cid:0) Y, Y ∼ A (cid:1) − E k Y (cid:0) Y, Y (cid:48) (cid:1) . Proof.

Since Y and Y ∼ A are conditionally independent on X A with the same distribution, we canwrite E k Y (cid:0) Y, Y ∼ A (cid:1) = E X A E (cid:2) k Y (cid:0) Y, Y ∼ A (cid:1) | X A (cid:3) = E X A E ζ,ζ (cid:48) ∼ P Y | X A k Y ( ζ, ζ (cid:48) ).Estimators (cid:92) MMD l for E X l (cid:0) MMD (P Y , P Y | X l ) (cid:1) and (cid:92) MMD − l for E X − l (cid:0) MMD (P Y , P Y | X − l ) (cid:1) aretherefore given by (cid:92) MMD l = 1 n n (cid:88) i =1 (cid:110) k Y (cid:16) η ( x ( i ) ) , η ( x ∼ l, ( i ) ) (cid:17) − k Y (cid:16) η ( x ( i ) ) , η ( x (cid:48) ( i ) ) (cid:17)(cid:111) (cid:92) MMD − l = 1 n n (cid:88) i =1 (cid:110) k Y (cid:16) η ( x (cid:48) ( i ) ) , η ( x ∼ l, ( i ) ) (cid:17) − k Y (cid:16) η ( x ( i ) ) , η ( x (cid:48) ( i ) ) (cid:17)(cid:111) = E k Y ( Y, Y ) − E k Y ( Y, Y (cid:48) ) is estimated by (cid:92)

MMD = 1 n n (cid:88) i =1 k Y (cid:16) η ( x ( i ) ) , η ( x ( i ) ) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) η ( x ( i ) ) , η ( x ( j ) ) (cid:17) . All these estimators can actually be recovered by using Mercer’s theorem k Y ( y, y (cid:48) ) = (cid:80) ∞ r =1 φ r ( y ) φ r ( y (cid:48) )and plugging the Sobol’ estimators of Cov (cid:0) φ r ( Y ) , φ r ( Y ∼ l ) (cid:1) , Cov (cid:0) φ r ( Y ) , φ r ( Y ∼− l ) (cid:1) and Var φ r ( Y )for all r >

1. Once again all ﬁrst-order and total MMD-based sensitivity indices can be estimatedwith a total cost of ( p + 2) n model evaluations, and by the strong law of large numbers it isstraightforward to show that both (cid:92) MMD l / (cid:92) MMD and (cid:92)

MMD − l / (cid:92) MMD are consistent.

The two previous estimators, although simple, necessitate speciﬁc sampling schemes (double-loopMonte-Carlo or pick-freeze) which may not be amenable in practice. In addition ﬁrst-order MMDindices estimation call for a number of model evaluations which increases with the number of inputvariables d . Recently, Gamboa et al. (2020) introduced new estimators of ﬁrst-order Sobol’ indicesbased on ranking and inspired by the work of Chatterjee (2020). In particular, for any pair ofrandom variables ( V, Y ) and measurable bounded functions f and g , they propose a universalestimation procedure for expectations of the form E ( E [ f ( Y ) | V ] E [ g ( Y ) | V ])using only a given sample ( v ( i ) , y ( i ) ) i =1 ,...,n and an estimator given by1 n n (cid:88) i =1 f ( y ( i ) ) g ( y ( σ n ( i )) )where σ n is a random permutation with no ﬁxed point and measurable with respect to the σ -algebragenerated by ( v (1) , . . . , v ( n ) ). First-order Sobol’ indices are then estimated using f ( x ) = g ( x ) = x and the permutation σ n = N deﬁned as in Chatterjee (2020): N ( i ) = (cid:26) π − ( π ( i ) + 1) if π ( i ) + 1 ≤ nπ − (1) otherwise (19)where π ( i ) is the rank of V ( i ) in the sample ( V (1) , . . . , V ( n ) ). All ﬁrst-order indices are ﬁnallyobtained with a given sample by considering one after the other the pairs ( X l , Y ) with their ownpermutation based on the sample ranks of X l .Interestingly, it is possible to generalize this result to the ﬁrst-order MMD indices with thefollowing proposition. Proposition 3 (Generalization of Proposition 3.2 from Gamboa et al. (2020)) . Let k ( · , · ) be ameasurable bounded kernel and ( v ( i ) , y ( i ) ) i =1 ,...,n an iid sample from a pair of random variables ( V, Y ) . Consider a random permutation with no ﬁxed point and measurable with respect to the σ -algebra generated by ( v (1) , . . . , v ( n ) ) such that for any i = 1 , . . . , n , v ( σ n ( i )) → v ( i ) as n → ∞ withprobability one. Then the estimator χ n ( V, Y, k ) = 1 n n (cid:88) i =1 k ( y ( i ) , y ( σ n ( i )) )20 onverges almost surely to χ ( V, Y, k ) = E V E ξ,ξ (cid:48) ∼ P Y | V k Y ( ξ, ξ (cid:48) ) as n → ∞ . The proof relies again on Mercer’s theorem and is given in Appendix A.5. The estimators of E X l (cid:0) MMD (P Y , P Y | X l ) (cid:1) and MMD are ﬁnally given by (cid:92) MMD l = 1 n n (cid:88) i =1 k Y (cid:16) y ( i ) , y ( σ ln ( i )) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) (20) (cid:92) MMD = 1 n n (cid:88) i =1 k Y (cid:16) y ( i ) , y ( i ) (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) (21)for a sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n and where σ ln is the permutation deﬁned in Eq. (19) with aranking performed on the sample (cid:16) x ( i ) l (cid:17) i =1 ,...,n . The ranking approach introduced above can actually be generalized to estimate higher-order sensi-tivity indices by replacing ranking (in dimension 1) by nearest-neighbors (in arbitrary dimension),since they deﬁne a permutation with the same properties as required in Proposition 3. This wasproposed independently by Azadkia and Chatterjee (2019) in the context of a dependence measureand by Broto et al. (2020) for Shapley eﬀects estimation. Here we adopt the formalism of Brotoet al. (2020), where they introduce j ∗ A ( i, m ) the index such that the sample point x ( j ∗ A ( i,m )) A of thesubset A ⊆ P d of input variables is the m -th nearest neighbor of the sample point x ( i ) A in a sampleof the inputs (cid:0) x ( i ) (cid:1) i =1 ,...,n . Then their nearest-neighbor estimator ˆ V knn A of Var E ( Y | X A ) is given byˆ V knn A = 1 n A n A (cid:88) j =1 η (cid:16) x ( j ∗ A ( s ( j ) , (cid:17) η (cid:16) x ( j ∗ A ( s ( j ) , (cid:17) − (cid:32) n n (cid:88) i =1 η (cid:16) x ( i ) (cid:17)(cid:33) where s ( j ), j = 1 , . . . , n A is a sample of uniformly distributed integers in { , . . . , n } , with n A ≤ n .The choice of using a subsample s ( j ) is motivated by the authors so that their framework isgeneral enough for the diﬀerent aggregation procedures they propose for Shapley eﬀects and fortheir consistency proofs. Several numerical experimentations not reported here also show thatusing all the samples instead of subsamples yield biased estimators, so we follow the procedure ofBroto et al. (2020). Once again this estimator can be generalized to MMD-based indices, where E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) is estimated by (cid:92) MMD A = 1 n A n A (cid:88) j =1 k Y (cid:16) y ( j ∗ A ( s ( j ) , , y ( j ∗ A ( s ( j ) , (cid:17) − n n (cid:88) i,j =1 k Y (cid:16) y ( i ) , y ( j ) (cid:17) where we denote y ( i ) = η (cid:0) x ( i ) (cid:1) . The consistency of this estimator directly follows from the con-sistency of ˆ V knn A from Broto et al. (2020) and Mercer’s theorem. Since j ∗ A ( s ( j ) ,

1) = s ( j ), theestimator is identical to the ranking-based one in (20) where the permutation from rankings issimply replaced by the index of the nearest neightbor not including itself j ∗ A ( s ( j ) , .3 Shapley eﬀect estimation The last estimation task concerns kernel-embedding Shapley eﬀects set forth in Deﬁnition 5. Ofcourse a straightforward approach consists in using any of the estimators discussed before in thegeneral formulation of the MMD- or HSIC-Shapley eﬀects. But a closer inspection actually revealsthat although this is easy for the HSIC-Shapley eﬀects since both HSIC u ( X A , Y ) and HSIC b ( X A , Y )can be computed for all subsets A ⊆ P d with only one sample (cid:0) x ( i ) , y ( i ) (cid:1) , i = 1 , . . . , n , the MMD-Shapley eﬀects require estimators of E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) which do not involve two many callsto the numerical model. Among the estimators introduced in Section 4.2, only the one based onnearest neighbors has a computational cost independent of the number of input variables. This isexactly the framework proposed in Broto et al. (2020) for the variance-based Shapley eﬀects.However, as pointed out in Song et al. (2016) in the case of variance-based Shapley eﬀects,a double-loop Monte-Carlo estimator of the value function val( A ) = Var E ( Y | X A ) / Var Y can beheavily biased. They show that another value function val (cid:48) ( A ) = E Var ( Y | X − A ) / Var Y behavesbetter and gives rise to the exact same Shapley eﬀects (Theorem 1 in Song et al. (2016)). This iswhy Broto et al. (2020) also introduced a nearest neighbor estimator of E Var ( Y | X − A ) given byˆ E knn A = 1 n A n A (cid:88) j =1  n I − n (cid:88) i =1 (cid:34) y ( j ∗− A ( s ( j ) ,i )) − n I n (cid:88) i =1 y ( j ∗− A ( s ( j ) ,i )) (cid:35)  where this time n I nearest neighbors are used. In a nutshell, the nearest neighbors are used as ifthey were independent samples from P Y | X A = x ( s ( j )) , which explains why we compute their empiricalvariance in the formula above. In order to follow the same road for the estimation of MMD-Shapleyeﬀects, we ﬁrst need an equivalent of Theorem 1 from Song et al. (2016) for a new value functionrelated to the MMD. Lemma 3 (Other formulation of MMD-Shapley eﬀects) . The Shapley values obtained with valuefunction val (cid:48) ( A ) = E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) / MMD are exactly equal tothe MMD-Shapley eﬀects from Deﬁnition 5 with value function val( A ) = E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) / MMD . The proof is based on the generalization of the law of total variance for the generalized varianceMMD and is given in Appendix A.6. A nearest neighbor estimator (cid:92)

EMMD A of E X − A (cid:104) E ξ ∼ P Y | X − A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − A k Y ( ξ, ξ (cid:48) ) (cid:105) is then given by (cid:92) EMMD A = 1 n A n A (cid:88) j =1 (cid:40) n I n I (cid:88) i =1 k Y (cid:16) y ( j ∗− A ( s ( j ) ,i )) , y ( j ∗− A ( s ( j ) ,i )) (cid:17) − n I n I (cid:88) i,i (cid:48) =1 k Y (cid:16) y ( j ∗− A ( s ( j ) ,i )) , y ( j ∗− A ( s ( j ) ,i (cid:48) )) (cid:17) and the MMD-Shapley eﬀect estimator is (cid:99) Sh MMD l = 1 (cid:92) MMD p (cid:88) A ⊆P d , A (cid:54)(cid:51) l (cid:18) p − | A | (cid:19) − (cid:26) (cid:92) EMMD A ∪{ l } − (cid:92) EMMD A (cid:27) . (cid:92) MMD is estimated as in Eq. (21).As a side-note, when the number of input variables is large, the number of terms involved inShapley eﬀects severely increases and the computational cost to assemble all the terms (even if oneuses estimators relying on a given sample only) becomes prohibitive. For such cases it is possible touse a formulation of Shapley eﬀects involving a sum on permutations of { , . . . , d } instead of a sumon subsets of P d , which makes it possible to add another level of approximation by computing thesum on a random sample of permutations instead of on all of them (Castro et al., 2009). Obviouslysince this trick does not depend on the value function used inside the Shapley values, it can alsobe used for our kernel-embedding Shapley eﬀects. In this section we illustrate the behavior of the kernel-based sensitivity indices on several testcases representative of typical GSA industrial applications. In particular, we address the followingnumerical model categories: a standard scalar output model, a stochastic simulator, a model witha time-series output and a multi-class categorical output simulator with dependent inputs. All theresults presented here are reproducible with the R code provided in the supplementary material. To exemplify the additional insight provided by these indices we ﬁrst consider a classical GSA testcase, the Ishigami function (Ishigami and Homma, 1990) where the output Y is given by Y = sin( X ) + 7 sin( X ) + X sin( X )where X l ∼ U ( − π, π ) for l = 1 , . . . ,

4, meaning that we add a dummy input variable X for analysispurposes.We start by computing the traditional Sobol’ ﬁrst-order and total sensitivity indices using apick-freeze estimator as in Section 4.2.2 with a sample size n = 1000 and we repeat this calcu-lation 50 times. For each replication the total number of calls to the numerical model is thus( p + 2) n = 6000. We then use the same pick-freeze procedure to estimate the MMD-based ﬁrst-order and total indices with the exact same samples. For the output we use a Gaussian kernel k Y ( y, y (cid:48) ) = exp( − σ ( y − y (cid:48) ) ) where σ is chosen as the median of the pairwise distances betweenthe output samples. Results are given in Figure 1. First note that, as is well known, the ﬁrst-orderSobol’ index of X is zero, while its total index is around 0 .

25 due to its interaction with X . X isalso an important variable, which does not have any interaction since its total Sobol’ index is equalto its ﬁrst-order one. As expected X is correctly detected as non-important. The MMD-based in-dices however bring a diﬀerent insight: from a probability distribution perspective, one can observethat interactions are much more present since there is a large gap between total and ﬁrst-orderindices for all inputs (except X of course). In addition, this time X is detected to have a maineﬀect: indeed even though it does not impact the output conditional mean, it inﬂuences the tails ofthe output conditional distribution when it is close to − π/ π as was already illustrated in Da Veiga(2016). This shows that MMD-based indices capture other types of input inﬂuence than Sobol’ ones.23 a) Sobol’ ﬁrst-order index (b) Sobol’ total index(c) MMD-based ﬁrst-order index (d) MMD-based total index Figure 1: Ishigami test case. First-order (a) and total (b) Sobol’ indices and ﬁrst-order (c) andtotal (d) MMD-based indices with pick-freeze estimators, n = 1000, 50 replicates.To take a diﬀerent view at the inputs/output relationship we also estimate HSIC-based ﬁrst-order and total indices using the V-statistic of Section 4.1. Again for the output we use the sameGaussian kernel as above, while we use the Sobolev kernel from Eq. (12) for the inputs. Since theyare uniform it is easy to renormalize them to satisfy the zero-mean kernel condition. We use onlyone sample of size n = 1000 and estimates obtained with 50 replications are reported in Figure 2.24nterestingly, we observe ﬁrst that with HSIC we no longer detect any interaction: our intuitionis that ﬁrst-order HSIC indices already aggregate a very large family of potential inﬂuences andthus interactions may only appear with highly complicated inputs/output link functions. This issupported by the fact that HSIC indices rank the inputs the exact same way at total Sobol’ indices.Another appealing property is that to compute all HSIC indices we only need a given sample ofmoderate size, which is interesting from a screening perspective for GSA on very time-consumingnumerical models. (a) HSIC-based ﬁrst-order index (b) HSIC-based total index Figure 2: Ishigami test case. First-order (a) and total (b) HSIC-based indices with V-statisticestimator, n = 1000, 50 replicates. Our second illustration is a more original setting for GSA which consists of a stochastic simulatorwhere the numerical model outputs a probability distribution, or rather a sample from a probabilitydistribution in practice, for a ﬁxed value of the input variables. Here we use a test case proposedin Moutoussamy et al. (2015) which involves ﬁve input variables and writes Y = ( X + 2 X + U ) sin(3 X − X + N ) + U + 5 X B + (cid:88) i =1 iX i where X , . . . , X ∼ U (0 ,

1) are the input variables and U ∼ U (0 , U ∼ U (1 , N ∼ N (0 , B ∼ Bernoulli(1 /

2) are additional random variables which are responsible for the simulatorstochasticity. Note that we modify the constant in front of X B to lessen the eﬀect of X as com-pared to Moutoussamy et al. (2015). An example of the output distribution for 20 random ﬁxedvalues of the input variables obtained each time with a sample of size 100 for the stochastic ones is25iven in Figure 3.Figure 3: Stochastic simulator test case. Output probability distribution for 20 values of the inputvariables chosen at random. The distribution is estimated with a kernel-density estimator.Leaving aside for now the whole output distribution, we ﬁrst place ourselves in a standard GSAdeterministic setting by ﬁrst analyzing the input inﬂuence on both the output mean and standarddeviation (with respect to U , U , N and B ). We thus compute Sobol’ indices for these two outputsof interest with a pick-freeze estimator with a sample of size n = 1000 and perform 50 replications,see Figure 4. It shows that interactions are negligible, and that X is clearly the most inﬂuentialinput by far: it explains alone 65% of the output mean variability and 75% of the output standarddeviation variability. This is expected since X is coupled with B , which creates the multi-modalfeature of the output distribution. The output mean variability also depends on X and X tosome lesser extent, and the output standard deviation variability on X .We now make use of the kernel framework to compute MMD- and HSIC-based sensitivity indices26 a) Sobol’ ﬁrst-order index of the output mean (b) Sobol’ total index of the output mean(c) Sobol’ ﬁrst-order index of the output standarddeviation (d) Sobol’ total index of the output standard devia-tion Figure 4: Stochastic simulator test case. First-order (a) and total (b) Sobol’ indices of the outputmean and ﬁrst-order (c) and total (d) Sobol’ indices of the output standard deviation with pick-freeze estimators, n = 1000, 50 replicates.which can accommodate directly the output distribution thanks to the speciﬁc kernels discussed inSection 3.4. More precisely we use the kernel of Eq. (18) with σ = 1 and λ chosen as the medianof the MMD computed on the preliminary sample used for visualization in Figure 3, with a kernel27 Y ( y, y (cid:48) ) = exp( − τ ( y − y (cid:48) ) ) and τ chosen as the median of the pairwise distances between theoutput samples. We only compute ﬁrst-order indices here and use for illustration the rank estimatorof the MMD index from Section 4.2.3 while for HSIC we use again the Sobolev kernel. Resultswith 50 replications and a sample of size n = 200 are given in Figure 5. Both indices coincide andidentify X as the most important input variable, as well as a small inﬂuence of X , X and X while X is non-important: considering the whole output distribution variability via the speciﬁckernel is comparable to an aggregation of the variability on the output mean and standard deviation(and other moments we did not compute above). (a) MMD ﬁrst-order index (b) HSIC ﬁrst-order index Figure 5: Stochastic simulator test case. First-order MMD (a) and HSIC (b) indices of the outputdistribution with rank and V-statistic estimators, respectively, n = 200, 50 replicates. Another commonly encountered industrial application is a physics-based numerical simulator in-volving functional outputs, such as curves representing the evolution over time of some systemcharacteristics ( e.g. pressure, temperature, ...). To illustrate how time-series kernels can easilyhandle GSA on such systems we build a simpliﬁed compartmental epidemiological model inspiredby previous works on COVID-19 (Magal and Webb, 2020; Charpentier et al., 2020; Di Domenicoet al., 2020). Our model is a straightforward Susceptible - Infected - Recovered (SIR) model (Ker-mack and McKendrick, 1927) which is slightly modiﬁed, in the sense that it accounts for twodiﬀerent types of infectious people: the reported cases, which we assume are isolated and can nonlonger contaminate others, and the unreported cases who can infect others. A summary of thiscompartment model proposed by Magal and Webb (2020) is given in Figure 6. S consists of the susceptible individuals who are not yet infected. During the epidemic spreadthey are infected depending on the time-dependent transmission rate τ ( t ). Once infected they28igure 6: Functional simulator test case. The modiﬁed SIR model with 4 compartments followingMagal and Webb (2020) .mode to compartment I where are the asymptomatic infectious individuals. After a period of η days they become symptomatic and a fraction f of them is detected and go to compartment R ,while the rest of them are undetected and go to compartment U . After a recovering period of η days symptomatic people from R and U recover and go to the last compartment. Observe that thisis a highly simpliﬁed representation of the epidemic where we do not account for hospitalizations,testing strategies or deaths: our goal here is not to be representative of COVID-19 but ratherexemplify how GSA can be applied to such models.The dynamics of the evolution of individuals from a compartment to another is modeled withthe following system of ordinary diﬀerential equations: dSdt = − τ S ( I + U ) dIdt = τ S ( I + U ) − νIdRdt = f νI − ηRdUdt = (1 − f ) νI − ηU The transmission rate is chosen according to Magal and Webb (2020) where they propose a para-metric form given by τ ( t ) = τ exp( − µ max( t − N, τ and it then decreases with anexponential decay with rate µ once social distancing and lockdown start to have an eﬀect after N days. They further assume that the cumulative number of reported cases CR ( t ) is approximately CR ( t ) = χ exp( χ t ) − χ and χ are to be estimated on data. From this assumption they get the value of the initialconditions I , U , R I = χ f ν , U = (1 − f ) νη + χ I , R = 1and with in our case S = 66 . × is the initial susceptible population (here in France). From aGSA perspective we then assume that we have uncertainty on the following 6 input variables: τ , µ ,29 (transmission rate), η , ν (days until symptoms and recovery) and χ (which impacts the initialconditions). f is assumed to be ﬁxed at a fraction equal to 0 .

1. We assign uniform distributionsto the input variables with ranges consistent with the values from Magal and Webb (2020), i.e. τ ∼ U (5 . × − , . × − ), µ ∼ U (0 . , . N ∼ U (8 , /η ∼ U (5 , /ν ∼ U (5 , χ ∼ U (0 . , . I and R for 20 values of theinputs chosen at random according to these uniform distributions is given in Figure 7. (a) Infectious cases (b) Reported cases Figure 7: Functional simulator test case. Output dynamics over time for compartment I (left) and R (right) for 20 values of the input variables chosen at random. They are both normalized by thetotal population S .For GSA we rely on the global-alignment kernel of Cuturi (2011) designed for time-series, whichsearches for all their alignments and averages them, and use it inside our ﬁrst-order HSIC indicesfor the output, whereas we still employ the Sobolev kernel for the inputs. The results obtained with50 repetitions with a sample of size n = 200 and the V-statistic estimator are reported in Figure 8.For both compartments the most inﬂuential input is χ , as expected since it inﬂuences theinitial conditions, and then N and µ related to the transmission rate. ν has also an impact forcompartment I but not η , which is coherent with the ordinary diﬀerential equations, and one cansee on the contrary that η inﬂuences compartment R . Finally we investigate a numerical model with both a categorical output (to make use of thediscussion from Section 3.4) and dependent inputs (to analyze kernel-embedding Shapley eﬀectsfrom Section 3.3). We build upon the famous wine quality data set (Cortez et al., 2009) of theUCI repository (Dua and Graﬀ, 2017). This dataset consists of 4898 observations of wine qualities30 a) First-order HSIC index for compartment I (b) First-order HSIC index for compartment R Figure 8: Functional simulator test case. First-order HSIC index for compartments I (left) and R (right) with V-statistics estimator, n = 200, 50 replicates.(categorical variable with levels 0 to 10 corresponding to a score) associated to 11 features obtainedwith physicochemical tests. In order to place ourselves in a standard computer experiments setting( i.e. a numerical simulator and uncertain inputs with given probability distribution) we use thisdataset to design a GSA scenario detailed in the following steps:1. We regroup wine quality scores into only 3 categories: low (score less than 5), medium (scoreequal to 6) and high (score higher than 7) in order to have a balanced dataset. We also usea small subsample of size 600 of white wine only from the initial 4898 observations for fasterestimation of the input dependence structure;2. We estimate a random forest model between the wine quality and the 11 inputs from thistransformed dataset and compute variable importance for each input. The variable impor-tance score is used to select only 4 important features among the initial 11 ones (volatileacidity, chlorides, density and alcohol). This is absolutely not a mandatory step, but wechoose to do so for both a faster computation of Shapley eﬀects and estimation of the inputdependence structure. A new random forest model is ﬁnally built with these 4 input variablesonly, and the predictor serves as our numerical simulation model;3. The samples from the 4 input variables identiﬁed above are used to estimate a vine copulastructure which models their dependence (Czado, 2019). Once the vine copula is estimated,it is then easy to generate new input samples as much as required.MMD- and HSIC-Shapley eﬀects are then computed with a sample size of n = 1000 with adirac categorical kernel for the output and a Sobolev kernel for the inputs in the HSIC case. For31MD we use the nearest-neighbor estimator of Section 4.3 and for HSIC the V-statistic estimatorand we repeat the estimation 50 times, see Figure 9. (a) MMD-Shapley eﬀect (b) HSIC-Shapley eﬀect Figure 9: Multi-class output test case. MMD- (a) and HSIC- (b) Shapley eﬀects with nearest-neighbor and V-statistic estimators, respectively, n = 1000, 50 replicates.Both kernel-embedding Shapley eﬀects identify alcohol as the most inﬂuential input, which wasexpected from the variable importance scores computed with the random forest. However, theMMD-Shapley eﬀects do not discriminate as clearly the input variables as HSIC. We suspect thatthere may be remaining estimation bias coming from the nearest-neighbor estimators which weplan to carefully examine in future work. In this paper we discussed two moment-independent sensitivity indices which generalize Sobol’ones by relying on the RKHS embedding of probability distributions. These MMD- and HSIC-based sensitivity indices are shown to admit an ANOVA-decomposition, which makes it possibleto properly deﬁne input interactions and their natural normalization constant. To the best of ourknowledge this is the ﬁrst time such a result is proved for sensitivity indices apart from Sobol’ones. We also deﬁned kernel-embedding Shapley eﬀects which are built upon these indices forthe case where the input variables are no longer independent. As discussed through several GSAapplications with categorical outputs or stochastic simulators, this opens the path for new powerfuland general GSA approaches by means of kernels adapted to the task at hand. Finally, severalestimators have been introduced, including new ones inspired by recent advances in Sobol’ indicesand Shapley eﬀects estimation.However, there is still room for improvement in the theoretical understanding of theses indices.First, we extensively used Mercer’s theorem and it would be interesting to extend our results when32t no longer holds. We also assume a kernel product form for HSIC indices, whereas the theoremused in our proof allows for more general kernels. From an estimation perspective, we did not exhibithere any central limit theorem, although this would be an important step enabling to statisticallytest whether indices are zero or not. But this is not at all an easy task, which may be tackled viathe functional delta method combined with Mercer’s theorem. On the other hand, some bias canbe observed in the nearest neighbor estimators, which should be analyzed carefully in future work.Finally, further practical experimentations should be performed to better understand the behaviorof these new indices. We think in particular to the choice of the kernel hyperparameters, and theinvestigation of invariant kernels for outputs given as curves or images.

A Proofs

A.1 Proof of Theorem 3

Proof.

The theorem is proved in the case where Mercer’s theorem holds, i.e. , the output is assumedto be such that Y ∈ Y with Y a compact set and k Y has the representation k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) )as in Eq. (9). Consider now the random variable W = (cid:80) ∞ r =1 η [ r ] ( X ) where η [ r ] ( X ) = φ r ( Y ) = φ r ( η ( X )). To prove the theorem, two formulations of Var W are exhibited. First, since the functions φ r are orthogonal in L ( Y ) and using the absolute convergence of the series, we haveVar W = ∞ (cid:88) r =1 Var φ r ( Y )= ∞ (cid:88) r =1 E ( φ r ( Y ) φ r ( Y )) − ∞ (cid:88) r =1 E (cid:0) φ r ( Y ) φ r ( Y (cid:48) ) (cid:1) = E (cid:32) ∞ (cid:88) r =1 φ r ( Y ) φ r ( Y ) (cid:33) − E (cid:32) ∞ (cid:88) r =1 φ r ( Y ) φ r ( Y (cid:48) ) (cid:33) = E k ( Y, Y ) − E k ( Y, Y (cid:48) ) . On the other hand, using the variance decomposition (1) for each η [ r ] ( X ) = φ r ( η ( X )) we getVar W = ∞ (cid:88) r =1 Var η [ r ] ( X )= ∞ (cid:88) r =1 (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | Var E (cid:16) η [ r ] ( X ) | X B (cid:17) = (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | ∞ (cid:88) r =1 Var E ( φ r ( Y ) | X B )= (cid:88) A ⊆P d (cid:88) B ⊂ A ( − | A |−| B | E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) using again the absolute continuity and the expansion of E X B (cid:0) MMD (P Y , P Y | X B ) (cid:1) obtained inEq. (9). The theorem follows by equating both formulations of Var W .33 .2 Proof of Proposition 1 Proof.

We simply add and subtract E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ):MMD = E ζ ∼ P Y k Y ( ζ, ζ ) − E ζ,ζ (cid:48) ∼ P Y k Y ( ζ, ζ (cid:48) )= E ζ ∼ P Y k Y ( ζ, ζ ) − E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E X A E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . A.3 Proof of Theorem 4

Proof.

We ﬁrst rewrite HSIC between X and Y from Eq. (8) as a multivariate integral, assumingP X Y is absolutely continuous with respect to the Lebesgue measure on X × Y :HSIC( X , Y ) = (cid:90) X ×X (cid:90)

Y×Y k X ( x , x (cid:48) ) k Y ( y, y (cid:48) ) [ p X Y ( x , y ) − p X ( x ) p Y ( y )] (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x d x (cid:48) dydy (cid:48) (22)where p X Y , p X and p Y are the probability density functions of ( X , Y ), X and Y , respectively. Asin Theorem 3 we further assume Mercer’s theorem holds, which means that k Y ( y, y (cid:48) ) = ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) ) . For each r , we then deﬁne the function g [ r ] ( x ) = (cid:90) X (cid:90) Y k X ( x , x (cid:48) ) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) , noting that g [ r ] ∈ F from Assumption 2. It is then straightforward to show that (cid:107) g [ r ] (cid:107) F = (cid:90) X ×X (cid:90)

Y×Y k X ( x , x (cid:48) ) φ r ( y ) φ r ( y (cid:48) ) [ p X Y ( x , y ) − p X ( x ) p Y ( y )] (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x d x (cid:48) dydy (cid:48) , which means that HSIC( X , Y ) = ∞ (cid:88) r =1 (cid:107) g [ r ] (cid:107) F (23)since in Mercer’s theorem we have the absolute convergence of the series. Now the idea is to writean orthogonal decomposition (in F ) for each function g [ r ] , which will ﬁnally provide a decomposi-tion for HSIC through Eq. (23). 34he orthogonal decomposition of g [ r ] is obtained with Theorem 4.1 from Kuo et al. (2010).First, recall that we have from the ﬁrst part of Assumption 3: k X ( x , x (cid:48) ) = p (cid:89) l =1 (cid:0) k l ( x l , x (cid:48) l ) (cid:1) = (cid:88) A ⊆P d (cid:89) l ∈ A k l ( x l , x (cid:48) l ) := (cid:88) A ⊆P d k A ( x A , x (cid:48) A ) (24)which corresponds to Eq. (4.1) in Kuo et al. (2010). We then introduce a set of commutingprojections { P l } pl =1 on F given by P l ( f ) = (cid:90) X l f ( x , . . . , x l − , t, x l +1 , . . . , x d ) p X l ( t ) dt (25)for all f ∈ F . From the second part of Assumption 3, one has for all subset A ⊆ P d and x A ∈ X A P l ( k A ( · , x A )) = (cid:89) l (cid:48) ∈ A,l (cid:48) (cid:54) = l k l (cid:48) ( · , x l (cid:48) ) (cid:90) X l k l ( t, x l ) p X l ( t ) dt = 0for l ∈ A , meaning that Eq. (4.5) from Kuo et al. (2010) is satisﬁed. From Theorem 4.1 from Kuoet al. (2010), we can now state that g [ r ] ( x ) has an unique orthogonal decomposition given by g [ r ] = (cid:88) A ⊆P d g [ r ] A where g [ r ] A = (cid:88) B ⊆ A ( − | A |−| B | P − B ( g [ r ] )with P − B = (cid:81) l / ∈ B P l . Since the decomposition is orthogonal, we further have (cid:107) g [ r ] (cid:107) F = (cid:88) A ⊆P d (cid:107) g [ r ] A (cid:107) F and (cid:107) g [ r ] A (cid:107) F = (cid:88) B ⊆ A ( − | A |−| B | (cid:107) P − B ( g [ r ] ) (cid:107) F . (cid:107) P − B ( g [ r ] ) (cid:107) F . We ﬁrst write the projection: P − B ( g [ r ] ) = (cid:90) X (cid:90) Y (cid:90) X − B k X ( x , x (cid:48) ) p X − B ( x − B ) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) d x − B = (cid:90) X (cid:90) Y (cid:32)(cid:89) l / ∈ B (cid:90) X l (cid:0) k l ( x l , x (cid:48) l ) (cid:1) p X l ( x l ) dx l (cid:33) (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) = (cid:90) X (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) dy (cid:48) = (cid:90) X B (cid:90) X − B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B d x (cid:48)− B dy (cid:48) = (cid:90) X B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:32)(cid:90) X − B (cid:2) p X Y ( x (cid:48) , y (cid:48) ) − p X ( x (cid:48) ) p Y ( y (cid:48) ) (cid:3) d x (cid:48)− B (cid:33) d x (cid:48) B dy (cid:48) = (cid:90) X B (cid:90) Y (cid:89) l ∈ B (cid:0) k l ( x l , x (cid:48) l ) (cid:1) φ r ( y (cid:48) ) (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B dy (cid:48) = (cid:90) X B (cid:90) Y k B ( x B , x (cid:48) B ) φ r ( y (cid:48) ) (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x (cid:48) B dy (cid:48) and its norm then equals (cid:107) P − B ( g [ r ] ) (cid:107) F = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) φ r ( y ) φ r ( y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) . Finally, we have HSIC( X , Y ) = ∞ (cid:88) r =1 (cid:107) g [ r ] (cid:107) F = (cid:88) A ⊆P d ∞ (cid:88) r =1 (cid:107) g [ r ] A (cid:107) F = (cid:88) A ⊆P d (cid:88) B ⊆ A ( − | A |−| B | ∞ (cid:88) r =1 (cid:107) P − B ( g [ r ] ) (cid:107) F ∞ (cid:88) r =1 (cid:107) P − B ( g [ r ] ) (cid:107) F = ∞ (cid:88) r =1 (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) φ r ( y ) φ r ( y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) (cid:32) ∞ (cid:88) r =1 φ r ( y ) φ r ( y (cid:48) ) (cid:33) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = (cid:90) X B ×X B (cid:90) Y×Y k B ( x B , x (cid:48) B ) k Y ( y, y (cid:48) ) [ p X B Y ( x B , y ) − p X B ( x B ) p Y ( y )] (cid:2) p X B Y ( x (cid:48) B , y (cid:48) ) − p X B ( x (cid:48) B ) p Y ( y (cid:48) ) (cid:3) d x B d x (cid:48) B dydy (cid:48) = HSIC( X B , Y ) . A.4 Proof of Proposition 2

Proof.

We begin with the integral formulation of HSIC as in Eq. (22) and plug the kernel deﬁnedin Eq. (13):HSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y k A ( x A , x (cid:48) A ) k Y ( y, y (cid:48) ) [ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x (cid:48) A , y (cid:48) ) − p X A ( x (cid:48) A ) p Y ( y (cid:48) ) (cid:3) d x A d x (cid:48) A dydy (cid:48) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x (cid:48) A ) (cid:89) l ∈ A h K (cid:18) x l − x (cid:48) l h (cid:19) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x (cid:48) A , y (cid:48) ) − p X A ( x (cid:48) A ) p Y ( y (cid:48) ) (cid:3) d x A d x (cid:48) A dydy (cid:48) . We then use a change of variables u l = ( x l − x (cid:48) l ) /h , which leads toHSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x A − h u A ) (cid:89) l ∈ A K ( u l ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A − h u A , y (cid:48) ) − p X A ( x A − h u A ) p Y ( y (cid:48) ) (cid:3) d x A d u A dydy (cid:48) . h → h → HSIC( X A , Y ) = (cid:90) X A ×X A (cid:90) Y×Y (cid:112) p X A ( x A ) (cid:112) p X A ( x A ) (cid:89) l ∈ A K ( u l ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A , y (cid:48) ) − p X A ( x A ) p Y ( y (cid:48) ) (cid:3) d x A d u A dydy (cid:48) = (cid:90) X A (cid:89) l ∈ A K ( u l ) d u A (cid:90) X A (cid:90) Y×Y p X A ( x A ) k Y ( y, y (cid:48) )[ p X A Y ( x A , y ) − p X A ( x A ) p Y ( y )] (cid:2) p X A Y ( x A , y (cid:48) ) − p X A ( x A ) p Y ( y (cid:48) ) (cid:3) d x A dydy (cid:48) = (cid:90) X A (cid:90) Y×Y k Y ( y, y (cid:48) ) (cid:2) p Y | X A = x A ( y ) − p Y ( y ) (cid:3)(cid:2) p Y | X A = x A ( y (cid:48) ) − p Y ( y (cid:48) ) (cid:3) p X A ( x A ) d x A dydy (cid:48) where we have used (cid:82) u K ( u ) du = 1. Proposition 2 then follows by noting that the last equationequals E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) thanks to the integral formulation of the MMD. A.5 Proof of Proposition 3

Proof.

Assuming Mercer’s theorem holds, we have k ( y, y (cid:48) ) = (cid:80) ∞ r =1 φ r ( y ) φ r ( y (cid:48) ) and E ( χ n ) = ∞ (cid:88) r =1 n n (cid:88) i =1 E (cid:104) φ r (cid:16) y ( i ) (cid:17) φ r (cid:16) y ( σ n (( i )) (cid:17)(cid:105) (26)= ∞ (cid:88) r =1 E (cid:104) φ r (cid:16) y (1) (cid:17) φ r (cid:16) y ( σ n ((1)) (cid:17)(cid:105) → ∞ (cid:88) r =1 E [ E [ φ r ( Y ) | V ] E [ φ r ( Y ) | V ]] (27)= ∞ (cid:88) r =1 E (cid:104) E [ φ r ( Y ) | V ] (cid:105) = E V E ξ,ξ (cid:48) ∼ P Y | V k Y ( ξ, ξ (cid:48) ) (28)where (26) is obtained by the absolute convergence of Mercer’s series, (27) by applying Eq.(34) in the proof of Proposition 3.2 from Gamboa et al. (2020) to f = g = φ r (which is boundedsince k is bounded) and the absolute convergence of Mercer’s series and (28) with Eq. 9. TheMac Diarmid’s concentration inequality given in Theorem A.1 in the proof of Proposition 3.2 fromGamboa et al. (2020) is unchanged and concludes the proof.38 .6 Proof of Lemma 3 Proof.

We follow closely the proof of Theorem 1 from Song et al. (2016). We only need to provethat for a subset A ⊆ P d such that l / ∈ A , thenval( A ∪ { l } ) − val( A ) = val (cid:48) ( B ∪ { l } ) − val (cid:48) ( B ) (29)where B = P d \ ( A ∪ { l } ). We ﬁrst need the generalized law of total variance for MMD fromProposition 1:MMD = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) + E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) . Now we prove the equality (29) for val and val (cid:48) deﬁned in Lemma 3, except that here we workwithout the denominator MMD for better readability (this does not change the proof since thissame constant appears in both value functions).val( A ∪ { l } ) − val( A ) = E X A ∪{ l } (cid:16) MMD (P Y , P Y | X A ∪{ l } ) (cid:17) − E X A (cid:0) MMD (P Y , P Y | X A ) (cid:1) = (cid:110) MMD − E X A ∪{ l } (cid:104) E ξ ∼ P Y | X A ∪{ l } k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A ∪{ l } k Y ( ξ, ξ (cid:48) ) (cid:105)(cid:111) − (cid:110) MMD − E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105)(cid:111) = E X A (cid:104) E ξ ∼ P Y | X A k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A k Y ( ξ, ξ (cid:48) ) (cid:105) − E X A ∪{ l } (cid:104) E ξ ∼ P Y | X A ∪{ l } k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X A ∪{ l } k Y ( ξ, ξ (cid:48) ) (cid:105) = E X − ( B ∪{ l } ) (cid:104) E ξ ∼ P Y | X − ( B ∪{ l } ) k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − ( B ∪{ l } ) k Y ( ξ, ξ (cid:48) ) (cid:105) − E X − B (cid:104) E ξ ∼ P Y | X − B k Y ( ξ, ξ ) − E ξ,ξ (cid:48) ∼ P Y | X − B k Y ( ξ, ξ (cid:48) ) (cid:105) = val (cid:48) ( B ∪ { l } ) − val (cid:48) ( B )where we have used the generalized law of total variance for the second equality. The rest of theproof is identical to the one of Theorem 1 from Song et al. (2016). References

Antoniadis, A. (1984), ‘Analysis of variance on function spaces’,

Math. Operationsforsch. u. Statist.,ser. statist. , 59–71.Aubin, J. P. (2000), Applied Functional Analysis , 2 nd edn, New York: Wiley-Interscience.Azadkia, M. and Chatterjee, S. (2019), ‘A simple measure of conditional dependence’, arXivpreprint arXiv:1910.12327 .Bachoc, F., Gamboa, F., Loubes, J.-M. and Venet, N. (2017), ‘A gaussian process regression modelfor distribution inputs’, IEEE Transactions on Information Theory (10), 6620–6637.Baucells, M. and Borgonovo, E. (2013), ‘Invariant probabilistic sensitivity analysis’, to appear inManagement Science . 39orgonovo, E. (2007), ‘A new uncertainty importance measure’, Reliability Engineering & SystemSafety (6), 771–784.Broto, B., Bachoc, F. and Depecker, M. (2020), ‘Variance reduction for estimation of shapley eﬀectsand adaptation to unknown input distribution’, SIAM/ASA Journal on Uncertainty Quantiﬁca-tion (2), 693–716.Castro, J., G´omez, D. and Tejada, J. (2009), ‘Polynomial calculation of the shapley value based onsampling’, Computers & Operations Research (5), 1726–1730.Charpentier, A., Elie, R., Lauri`ere, M. and Tran, V. C. (2020), ‘Covid-19 pandemic control:balancing detection policy and lockdown intervention under icu sustainability’, arXiv preprintarXiv:2005.06526 .Chastaing, G., Gamboa, F., Prieur, C. et al. (2012), ‘Generalized hoeﬀding-sobol decomposition fordependent variables-application to sensitivity analysis’, Electronic Journal of Statistics , 2420–2448.Chatterjee, S. (2020), ‘A new coeﬃcient of correlation’, Journal of the American Statistical Asso-ciation pp. 1–21.Chwialkowski, K., Strathmann, H. and Gretton, A. (2016), A kernel test of goodness of ﬁt, in ‘ICML’.Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009), ‘Modeling wine preferencesby data mining from physicochemical properties’, Decision Support Systems (4), 547–553.Cuturi, M. (2011), Fast global alignment kernels, in ‘Proceedings of the 28th international confer-ence on machine learning (ICML-11)’, pp. 929–936.Czado, C. (2019), ‘Analyzing dependent data with vine copulas’, Lecture Notes in Statistics,Springer .Da Veiga, S. (2015), ‘Global sensitivity analysis with dependence measures’,

Journal of StatisticalComputation and Simulation (7), 1283–1305.Da Veiga, S. (2016), New perspectives for sensitivity analysis, in ‘Proceedings of Mascot-Num 2016conference’, Toulouse, France. https://mascot2016.sciencesconf.org/resource/page/id/2 .Da Veiga, S. and Gamboa, F. (2013), ‘Eﬃcient estimation of sensitivity indices’, Journal of Non-parametric Statistics (3), 573–595.Da Veiga, S., Wahl, F. and Gamboa, F. (2009), ‘Local polynomial estimation for sensitivity analysison models with correlated inputs’, Technometrics (4), 452–463.Di Domenico, L., Pullano, G., Sabbatini, C. E., Bo¨elle, P.-Y. and Colizza, V. (2020), ‘Expectedimpact of lockdown in ˆıle-de-france and possible exit strategies’, medRxiv .Ditlevsen, O. and Madsen, H., eds (1996), Structural reliability methods , Wiley & Sons.Dua, D. and Graﬀ, C. (2017), ‘Uci machine learning repository’.40urrande, N., Ginsbourger, D., Roustant, O. and Carraro, L. (2012), ‘Anova kernels and rkhsof zero mean functions for model-based sensitivity analysis’,

Journal of Multivariate Analysis , 57–67.Ferraty, F. and Vieu, P. (2006),

Nonparametric functional data analysis: theory and practice ,Springer.Fort, J.-C., Klein, T. and Rachdi, N. (2016), ‘New sensitivity analysis subordinated to a contrast’,

Communications in Statistics-Theory and Methods (15), 4349–4364.Gamboa, F., Gremaud, P., Klein, T. and Lagnoux, A. (2020), ‘Global sensitivity analysis: a newgeneration of mighty estimators based on rank statistics’, hal-02474902v3 .Gamboa, F., Janon, A., Klein, T. and Lagnoux, A. (2013), ‘Sensitivity indices for multivariateoutputs’, Comptes Rendus Mathematique (7-8), 307–310.G¨artner, T., Flach, P. and Wrobel, S. (2003), On graph kernels: Hardness results and eﬃcientalternatives, in ‘Learning theory and kernel machines’, Springer, pp. 129–143.Gorham, J. and Mackey, L. (2015), ‘Measuring sample quality with stein’s method’, Advances inNeural Information Processing Systems , 226–234.Gretton, A., Bousquet, O., Smola, A. and Sch¨olkopf, B. (2005a), Measuring statistical dependencewith hilbert-schmidt norms, in S. Jain, H. Simon and E. Tomita, eds, ‘Algorithmic LearningTheory’, Vol. 3734 of

Lecture Notes in Computer Science , Springer Berlin Heidelberg, pp. 63–77.Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Sch¨olkopf, B. and Smola, A. J. (2008), A kernelstatistical test of independence, in ‘Advances in neural information processing systems’, pp. 585–592.Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Sch¨olkopf, B. (2005b), ‘Kernel methods formeasuring independence’, The Journal of Machine Learning Research , 2075–2129.Harchaoui, Z. and Bach, F. (2007), Image classiﬁcation with segmentation graph kernels, in ‘2007IEEE Conference on Computer Vision and Pattern Recognition’, IEEE, pp. 1–8.Hoeﬀding, W. (1948), ‘A class of statistics with asymptotically normal distributions’, Annals ofMathematical Statistics , 293–325.Homma, T. and Saltelli, A. (1996), ‘Importance measures in global sensitivity analysis of nonlinearmodels’, Reliability Engineering & System Safety (1), 1–17.Iooss, B. and Prieur, C. (2019), ‘Shapley eﬀects for sensitivity analysis with dependent inputs:comparisons with Sobol’ indices, numerical estimation and applications’, International Journalfor Uncertainty Quantiﬁcation , 493–514,.Ishigami, T. and Homma, T. (1990), An importance quantiﬁcation technique in uncertainty analysisfor computer models, in ‘[1990] Proceedings. First International Symposium on UncertaintyModeling and Analysis’, IEEE, pp. 398–403. 41anon, A., Klein, T., Lagnoux, A., Nodet, M. and Prieur, C. (2014), ‘Asymptotic normality andeﬃciency of two sobol index estimators’, ESAIM: Probability and Statistics , 342–364.Jitkrittum, W., Xu, W., Szab´o, Z., Fukumizu, K. and Gretton, A. (2017), A linear-time kernelgoodness-of-ﬁt test, in ‘Advances in Neural Information Processing Systems’, pp. 262–271.Kermack, W. O. and McKendrick, A. G. (1927), ‘A contribution to the mathematical theory of epi-demics’, Proceedings of the royal society of london. Series A, Containing papers of a mathematicaland physical character (772), 700–721.Kuo, F., Sloan, I., Wasilkowski, G. and Wo´zniakowski, H. (2010), ‘On decompositions of multivari-ate functions’,

Mathematics of computation (270), 953–966.Lamboni, M., Monod, H. and Makowski, D. (2011), ‘Multivariate sensitivity analysis to measureglobal contribution of input factors in dynamic models’, Reliability Engineering & System Safety , 450–459.Li, L., Lu, Z., Feng, J. and Wang, B. (2012), ‘Moment-independent importance measure of basicvariable and its state dependent parameter solution’, Structural Safety , 40–47.Lundberg, S. M. and Lee, S.-I. (2017), A uniﬁed approach to interpreting model predictions, in ‘Advances in neural information processing systems’, pp. 4765–4774.Magal, P. and Webb, G. (2020), ‘Predicting the number of reported and unreported cases for thecovid-19 epidemic in south korea, italy, france and germany’, medRxiv . URL:

Mara, T. A., Tarantola, S. and Annoni, P. (2015), ‘Non-parametric methods for global sensitivityanalysis of model output with dependent inputs’,

Environmental modelling & software , 173–183.Marrel, A. and Chabridon, V. (2020), ‘Statistical developments for target and conditional sensitivityanalysis: application on safety studies for nuclear reactor’, hal-02541142 .Marrel, A., Iooss, B., Van Dorpe, F. and Volkova, E. (2008), ‘An eﬃcient methodology for modelingcomplex computer codes with Gaussian processes’, Computational Statistics and Data Analysis , 4731–4744.Maume-Deschamps, V. and Niang, I. (2018), ‘Estimation of quantile oriented sensitivity indices’, Statistics & Probability Letters , 122–127.Moutoussamy, V., Nanty, S. and Pauwels, B. (2015), ‘Emulators for stochastic simulation codes’,

ESAIM: Proceedings and Surveys , 116–155.Muandet, K., Fukumizu, K., Dinuzzo, F. and Sch¨olkopf, B. (2012), ‘Learning from distributionsvia support measure machines’, Advances in neural information processing systems , 10–18.Oates, C. J., Girolami, M. and Chopin, N. (2017), ‘Control functionals for monte carlo integration’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) (3), 695–718.42wen, A. (2014), ‘Sobol’ indices and Shapley value’, SIAM/ASA Journal on Uncertainty Quantiﬁ-cation , 245–251.Perrin, G. and Defaux, G. (2019), ‘Eﬃcient estimation of reliability-oriented sensitivity indices’, Journal of Scientiﬁc Computing (3).Plischke, E., Rabitti, G. and Borgonovo, E. (2020), ‘Computing shapley eﬀects for sensitivityanalysis’, arXiv preprint arXiv:2002.12024 .Rahman, S. (2016), ‘The f-sensitivity index’, SIAM/ASA Journal on Uncertainty Quantiﬁcation (1), 130–162.Ramon, J. and G¨artner, T. (2003), Expressivity versus eﬃciency of graph kernels, in ‘Proceedingsof the ﬁrst international workshop on mining graphs, trees and sequences’, pp. 65–74.Sakoe, H. and Chiba, S. (1978), ‘Dynamic programming algorithm optimization for spoken wordrecognition’, IEEE transactions on acoustics, speech, and signal processing (1), 43–49.Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. and Tarantola, S. (2010), ‘Variancebased sensitivity analysis of model output. design and estimator for the total sensitivity index’, Computer physics communications (2), 259–270.Saltelli, A., Tarantola, S. and Chan, K.-S. (1999), ‘A quantitative model-independent method forglobal sensitivity analysis of model output’,

Technometrics (1), 39–56.Shapley, L. (1953), A value for n-persons game, in H. Kuhn and A. Tucker, eds, ‘Contributions tothe theory of games II, Annals of mathematic studies’, Princeton University Press, Princeton,NJ.Smola, A., Gretton, A., Song, L. and Sch¨olkopf, B. (2007), A hilbert space embedding for distri-butions, in ‘Algorithmic Learning Theory’, Vol. 4754, Springer, pp. 13–31.Sobol’, I. (1993), ‘Sensitivity estimates for non linear mathematical models’, Mathematical Mod-elling and Computational Experiments , 407–414.Sol´ıs, M. (2019), ‘Non-parametric estimation of the ﬁrst-order sobol indices with bootstrap band-width’, Communications in Statistics-Simulation and Computation pp. 1–16.Song, E., Nelson, B. L. and Staum, J. (2016), ‘Shapley eﬀects for global sensitivity analysis: Theoryand computation’,

SIAM/ASA Journal on Uncertainty Quantiﬁcation (1), 1060–1083.Song, L. (2008), Learning via Hilbert Space Embedding of Distributions, PhD thesis, University ofSydney.Song, L., Smola, A., Gretton, A., Bedo, J. and Borgwardt, K. (2012), ‘Feature selection via depen-dence maximization’, The Journal of Machine Learning Research , 1393–1434.Song, L., Smola, A., Gretton, A., Borgwardt, K. M. and Bedo, J. (2007), Supervised featureselection via dependence estimation, in ‘Proceedings of the 24th international conference onMachine learning’, pp. 823–830. 43pagnol, A. (2020), Kernel-based sensitivity indices for high-dimensional optimization problems,PhD thesis, University Lyon, France.Spagnol, A., Le Riche, R. and Da Veiga, S. (2019), ‘Global sensitivity analysis for optimizationwith variable selection’, SIAM/ASA Journal on Uncertainty Quantiﬁcation (2), 417–443.Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. and Sch¨olkopf, B. (2009),Kernel choice and classiﬁability for rkhs embeddings of probability distributions, in ‘Advancesin neural information processing systems’, pp. 1750–1758.Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch¨olkopf, B. and Lanckriet, G. R. (2010),‘Hilbert space embeddings and metrics on probability measures’, The Journal of Machine Learn-ing Research , 1517–1561.Stein, C. et al. (1972), A bound for the error in the normal approximation to the distributionof a sum of dependent random variables, in ‘Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability, Volume 2: Probability Theory’, The Regents of theUniversity of California.Steinwart, I., Hush, D. and Scovel, C. (2006), ‘An explicit description of the reproducing kernelhilbert spaces of gaussian rbf kernels’, IEEE Transactions on Information Theory (10), 4635–4643.Szab´o, Z., Sriperumbudur, B. K., P´oczos, B. and Gretton, A. (2016), ‘Learning theory for distri-bution regression’, The Journal of Machine Learning Research (1), 5272–5311.Sz´ekely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence bycorrelation of distances’, The Annals of Statistics (6), 2769–2794.Terraz, T., Ribes, A., Fournier, Y., Iooss, B. and Raﬃn, B. (2017), Large scale in transit globalsensitivity analysis avoiding intermediate ﬁles, in ‘Proceedings the International Conference forHigh Performance Computing, Networking, Storage and Analysis (Supercomputing)’, Denver,USA.Wahba, G., Wang, Y., Gu, C., Klein, R., Klein, B. et al. (1995), ‘Smoothing spline anova for expo-nential families, with application to the wisconsin epidemiological study of diabetic retinopathy:the 1994 neyman memorial lecture’, The Annals of Statistics23