Gaussian linear approximation for the estimation of the Shapley effects
Baptiste Broto, François Bachoc, Marine Depecker, Jean-Marc Martinez
GGaussian linear approximation for theestimation of the Shapley effects
Baptiste Broto , François Bachoc , Marine Depecker , andJean-Marc Martinez CEA, LIST, Université Paris-Saclay, F-91120, Palaiseau, France Institut de Mathématiques de Toulouse, Université Paul Sabatier,F-31062 Toulouse, France CEA, DES/DM2S, Université Paris-Saclay, F-91191Gif-sur-Yvette, FranceJune 4, 2020
Abstract
In this paper, we address the estimation of the sensitivity indices called"Shapley effects". These sensitivity indices enable to handle dependentinput variables. The Shapley effects are generally difficult to estimate, butthey are easily computable in the Gaussian linear framework. The aim ofthis work is to use the values of the Shapley effects in an approximatedGaussian linear framework as estimators of the true Shapley effects corre-sponding to a non-linear model. First, we assume that the input variablesare Gaussian with small variances. We provide rates of convergence ofthe estimated Shapley effects to the true Shapley effects. Then, we fo-cus on the case where the inputs are given by an non-Gaussian empiricalmean. We prove that, under some mild assumptions, when the numberof terms in the empirical mean increases, the difference between the trueShapley effects and the estimated Shapley effects given by the Gaussianlinear approximation converges to 0. Our theoretical results are supportedby numerical studies, showing that the Gaussian linear approximation isaccurate and enables to decrease the computational time significantly.
Sensitivity analysis, and particularly sensitivity indices, have became importanttools in applied sciences. The aim of sensitivity indices is to quantify the im-pact of the input variables X , · · · , X p on the output Y = f ( X , · · · , X p ) of amodel f . This information improves the interpretability of the model. In globalsensitivity analysis, the input variables are assumed to be random variables. In1 a r X i v : . [ m a t h . S T ] J un his framework, the Sobol indices [Sob93] were the first suggested indices to beapplicable to general classes of models. Nevertheless, one of the most importantlimitations of these indices is the assumption of independence between the inputvariables. Hence, many variants of the Sobol indices have been suggested fordependent input variables [MT12, Cha13, MTA15, CGP12].Recently, Owen defined new sensitivity indices in [Owe14] called "Shapleyeffects". These sensitivity indices have many advantages over the Sobol indicesfor dependent inputs [IP19]. For general models, [SNS16] suggested an estimatorof the Shapley effects. However, this estimator requires to be able to generatesamples with the conditional distributions of the input variables. Then, a con-sistent estimator has been suggested in [BBD20], requiring only a sample of theinputs-output. However, in practice, this estimator requires a large sample andis very costly in terms of computational time.Let us now consider the framework when the distribution of X , · · · , X p isGaussian and f is linear, that we call the Gaussian linear framework. Thisframework is considered relatively commonly (see for example [KHF +
06, HT11,Ros04, Clo19]), since the unknown function f ( X , · · · , X p ) can be approximatedby its linear approximation around E ( X ) . The Gaussian linear setting is highlybeneficial, since the theoretical values of the Shapley effects can be computedexplicitly [OP17, IP19, BBDM19, BBCM20]. These values depend on the co-variance matrix of the inputs and on the coefficients of the linear model. Analgorithm enabling to compute these values is implemented as the function"ShapleyLinearGaussian" in the R package sensitivity [IAP20]. It is shownin [BBDM19] that this computation is almost instantaneous when the number p of input variables is smaller than 15, but becomes more difficult for p ≥ .However, "ShapleyLinearGaussian" uses the possible block-diagonal structureof the covariance matrix to reduce the dimension, thereby reducing the compu-tation cost [BBDM19].The aim of this paper is to use the Shapley values computed from a Gaus-sian linear model as estimates of the true Shapley values corresponding to anon-linear model f . We provide convergence guarantees, as the Gaussian lin-ear approximation is more and more accurate. We address the two followingsettings.First, we assume that X = ( X , . . . , X p ) is a Gaussian vector with variancesdecreasing to 0, and f is not linear. We give the rate of convergence of thedifference between the true Shapley effects and the ones given by the first-orderTaylor polynomial of f at the mean of X . To estimate the Shapley effects in abroader setting, we also provide the rate of convergence when the Taylor poly-nomial is unknown and the linear approximation is given by a finite differenceapproximation and a linear regression. To strengthen these theoretical results,we compare the three linear approximations on simulated data.Second, we consider the case where the input vector is non-Gaussian andgiven by an empirical mean and the model f is non-linear. We address the esti-mators of the Shapley values obtained by treating the input vector as Gaussian2nd the model as linear. We show that, as the number of summands goes toinfinity, the estimators of the Shapley values converge to the true Shapley val-ues, corresponding to the non-Gaussian input vector and the non-linear model.Then, we treat the particular case when the Shapley effects evaluate the impactof the individual estimation errors on a global estimation error. In numeri-cal experiments, we compare the estimator of the Shapley effects given by theGaussian linear framework with the estimator of the Shapley effects given bythe general procedure of [BBD20], to the advantage of the former.The rest of the article is organized as follows. In Section 2, we recall thedefinition of the Shapley effects and we detail the particular form of the Gaussianlinear framework. Section 3 provides the rates of convergence for Gaussianinputs and non-linear models. In Section 4, we address the case where theinputs are given by an empirical mean and f is non-linear. The conclusions aregiven in Section 5. All the proofs are postponed to the supplementary material. Let ( X i ) i ∈ [1: p ] be random input variables on R p and let Y = f ( X ) be thereal random output variable which is squared integrable . We assume that Var( Y ) (cid:54) = 0 . Here, f : R p → R can be a numerical simulation model [SWNW03].If u ⊂ [1 : p ] and x = ( x i ) i ∈ [1: p ] ∈ R p , we write x u := ( x i ) i ∈ u . We can definethe Shapley effects as in [Owe14], where for each input variable X i , the Shapleyeffect is: η i ( X, f ) := 1 p Var( Y ) (cid:88) u ⊂− i (cid:18) p − | u | (cid:19) − (cid:0) Var( E ( Y | X u ∪{ i } )) − Var( E ( Y | X u )) (cid:1) (1)where − i is the set [1 : p ] \ { i } . We let η ( X, f ) be the vector of dimension p composed of η ( X, f ) , ..., η p ( X, f ) . One can see in Equation (1) that adding X i to X u changes the conditional expectation of Y , and increases the variability ofthis conditional expectation. The Shapley effect η i ( X, f ) is large when, on aver-age, the variance of this conditional expectation increases significantly when X i is observed. Thus, a large Shapley effect η i ( X, f ) corresponds to an importantinput variable X i .The Shapley effects have interesting properties for global sensitivity analy-sis. Indeed, there is only one Shapley effect for each variable (contrary to theSobol indices). Moreover, the sum of all the Shapley effects is equal to (see[Owe14]) and all these values lie in [0 , even with dependent inputs. This isvery convenient for the interpretation of these sensitivity indices.An estimator of the Shapley effects has been suggested in [SNS16]. It is im-plemented in the R package sensitivity as the function "shapleyPermRand".However, it requires to be able to generate samples with the conditional distri-butions of the inputs, which limits the application framework. [BBD20] sug-gested another estimator which requires only a sample of the inputs-output.3his estimator uses nearest-neighbour methods to mimic the generation of sam-ples from these conditional distributions. It is implemented in the R package sensitivity as the function "shapleySubsetMC". However, in practice, thisestimator requires a large sample and is very costly in terms of computationaltime.Consider now the case where X ∼ N ( µ, Σ) , with Σ ∈ S ++ p ( R ) and wherethe model is linear, that is f : x (cid:55)−→ β + β T x , for a fixed β ∈ R and a fixedvector β . In this framework, the sensitivity indices can be calculated explicitly[OP17]: η i ( X, f ) := 1 p Var( Y ) (cid:88) u ⊂− i (cid:18) p − | u | (cid:19) − (cid:0) Var( Y | X u ) − Var( Y | X u ∪{ i } ) (cid:1) (2)with Var( Y | X u ) = Var( β T − u X − u | X u ) = β T − u (Σ − u, − u − Σ − u,u Σ − u,u Σ u, − u ) β − u (3)where Γ v,w := (Γ i,j ) i ∈ v,j ∈ w . Thus, in the Gaussian linear framework, the Shap-ley effects are functions of the parameters β and Σ . The Gaussian linear frame-work is thus very beneficial from an estimation point of view, because in gen-eral one needs to estimate conditional moments of the form Var( E ( Y | X v )) for v ⊂ [1 , p ] , using nearest-neighbour methods, while in the Gaussian linear frame-work, only standard matrix vector operations are required. To model uncertain physical values, it can be convenient to consider them asa Gaussian vector. For example, the international libraries [McL05, JEF13,JEN11] on real data from the field of nuclear safety provide the average andcovariance matrix of the input variables, so it is natural to model them withthe Gaussian distribution. Hence, to quantify the impact of the uncertaintiesof the physical inputs of a model on a quantity of interest, it is commonlythe case to estimate the Shapley effects of Gaussian inputs. The model f is ingeneral non-linear and the estimation procedures dedicated to non-linear models[SNS16, BBD20] are typically computationally costly, with an accuracy that canbe sensitive to the specific situation. Nevertheless, when the uncertainty on theinputs become small, the input vector converges to its mean µ , and a linearapproximation of the model at µ seems more and more appropriate.To formalize this idea, let X { n } ∼ N ( µ { n } , Σ { n } ) be the input vector, with asequence of mean vectors ( µ { n } ) and a sequence of covariance matrices (Σ { n } ) .The index n can represent for instance the number of measures of an uncertaininput, in which case the covariance matrix Σ { n } will decrease with n .4 ssumption 1. The covariance matrix Σ { n } decreases to such that the eigen-values of a { n } Σ { n } are lower-bounded and upper-bounded in R ∗ + , with a { n } −→ n → + ∞ + ∞ . Moreover, µ { n } −→ n → + ∞ µ , where µ is a fixed vector. In Assumption 1, the condition on the eigenvalues of a { n } Σ { n } means thatthe correlation matrix obtained from Σ { n } can not get close to a singular matrix.This condition is necessary in our proofs.If j ∈ N and if f is C j at µ { n } , we will write f { n } j ( x ) = j ! D j f ( µ { n } )( x − µ { n } ) (where D j ( µ { n } )( z ) is the image of ( z, z, · · · , z ) ∈ ( R p ) j through the multilinearfunction D j f ( µ { n } ) , which gathers all the partial derivatives of order j of f at µ { n } ) and R { n } j ( x ) = f ( x ) − (cid:80) jl =0 f { n } l ( x ) the remainder of the j -th order Taylorapproximation of f at µ { n } . In particular, f { n } ( x ) = Df ( µ { n } )( x − µ { n } ) , where Df = D f . We identify the linear function Df ( µ { n } ) with the correspondingrow gradient vector of size × p and the bilinear function D f ( µ { n } ) with thecorresponding Hessian matrix of size p × p . We also write f ( x ) = Df ( µ )( x − µ ) .Finally, we assume that the function f is subpolynomial, that is, there exist k ∈ N and C > such that, ∀ x ∈ R p , | f ( x ) | ≤ C (1 + (cid:107) x (cid:107) k ) . First, we study the asymptotic difference between the Shapley effects given bythe true model f and the ones given by the first-order Taylor polynomial of f at µ { n } . Remark that adding a constant to the function does not affect the valuesof the Shapley effects. Thus, the Shapley effects η ( X { n } , f ( µ { n } ) + f { n } ) givenby the first-order Taylor polynomial of f at µ { n } are equal to η ( X { n } , f { n } ) . Inthe next proposition, we show that approximating the true Shapley effects ofthe non-linear f by the Shapley effects of the linear approximation f { n } yieldsa vanishing error of order /a { n } as n → ∞ . Proposition 1.
Assume that X { n } ∼ N ( µ { n } , Σ { n } ) , Assumption 1 holds and f is subpolynomial and C on a neighbourhood of µ and Df ( µ ) (cid:54) = 0 . Then, (cid:107) η ( X { n } , f ) − η ( X { n } , f { n } ) (cid:107) = O (cid:18) a { n } (cid:19) . We remark that, when f is a computer model, it can be the case that thegradient vector is available. First, the computer model can already providethem, by means of the Adjoint Sensitivity Method [Cac03]. Second, automaticdifferentiation methods can be used on the source file of the code and yield adifferentiated code [HP04]. 5 emark 1. The rate O (1 /a { n } ) is the best rate that we can reach under the as-sumptions of Proposition 1. Indeed, letting X { n } = ( X { n } , X { n } ) ∼ N (0 , a { n } I ) and Y { n } = f ( X { n } ) = X { n } + X { n } , we have η ( X { n } , f { n } ) = 1 and η ( X { n } , f { n } ) = 0 . Moreover, η ( X { n } , f ) = a { n } a { n } +2 and η ( X { n } , f ) = a { n } +2 .Thus, the rate of the difference between η ( X { n } , f ) and η ( X { n } , f { n } ) is exactly /a { n } . In Proposition 1, we bound the difference between the Shapley effects givenby f and the ones given by the first-order Taylor polynomial of f . Moreover,when the matrix a { n } Σ { n } converges, Proposition 2 shows that the Shapleyeffects given by the Taylor polynomial converge. Proposition 2.
Assume that X { n } ∼ N ( µ { n } , Σ { n } ) , Assumption 1 holds, f is C on a neighbourhood of µ , Df ( µ ) (cid:54) = 0 and a { n } Σ { n } −→ n → + ∞ Σ ∈ S ++ p ( R ) .Then, if X ∗ ∼ N ( µ, Σ) , (cid:107) η ( X { n } , f { n } ) − η ( X ∗ , f ) (cid:107) = O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) a { n } Σ { n } − Σ (cid:107) ) . Proposition 1 shows that replacing f by its first-order Taylor polynomial f { n } does not impact significantly the Shapley effects when the input variancesare small. Thus, the knowledge of f { n } would enable us to use the explicitexpression (3) of the Gaussian linear case, and for instance the function "Shap-leyLinearGaussian" of the package sensitivity , to estimate the true Shapleyeffects η ( X { n } , f ) . However, in practice, the first-order Taylor polynomial f { n } is not always available, except for instance in situations described above. Thus,one may be interested in replacing the true first-order Taylor polynomial f { n } by an approximation. We will study two such approximations given by finitedifference and linear regression. For h = ( h , · · · , h p ) ∈ ( R ∗ + ) p and writing ( e , · · · , e p ) the canonical basis of R p ,let (cid:98) D h f ( x ) := (cid:18) f ( x + e h ) − f ( x − e h )2 h , · · · , f ( x + e p h p ) − f ( x − e p h p )2 h p (cid:19) , (4)be the approximation of the differential of f at x with the steps h , · · · , h p . If ( h { n } ) n is a sequence of ( R ∗ + ) p converging to , let ˜ f ,h { n } ( x ) := ˜ f ,h { n } ,µ { n } ( x ) := (cid:98) D h { n } f ( µ { n } )( x − µ { n } ) be the approximation of the first-order Taylor polynomial of f − f ( µ { n } ) at µ { n } with the steps h , · · · , h p . The next proposition ensures that the Shapley effectscomputed from the true Taylor polynomial and the approximated one are close,for small steps. 6 roposition 3. Under the assumptions of Proposition 1, we have (cid:107) η ( X { n } , f { n } ) − η ( X { n } , ˜ f { n } ,h { n } ) (cid:107) = O (cid:16) (cid:107) h { n } (cid:107) (cid:17) . Then, the next corollary extends Propositions 1 and 2 to the approximatedTaylor polynomial based on finite differences.
Corollary 1.
Under the assumptions of Proposition 1, and if (cid:107) h { n } (cid:107) ≤ C sup √ a { n } (for example, choosing h { n } i := (cid:113) Var( X { n } i ) , the standard deviation of X { n } i ),we have (cid:107) η ( X { n } , f ) − η ( X { n } , ˜ f { n } ,h { n } ) (cid:107) = O ( 1 a { n } ) . Moreover, if a { n } Σ { n } −→ n → + ∞ Σ , then, letting X ∗ ∼ N ( µ, Σ) , (cid:107) η ( X { n } , ˜ f { n } ,h { n } ) − η ( X ∗ , f ) (cid:107) = O ( (cid:107) µ { n } − µ (cid:107) )+ O ( (cid:107) a { n } Σ { n } − Σ (cid:107) )+ O (cid:18) a { n } (cid:19) . For n ∈ N and N ∈ N ∗ , let ( X { n } ( l ) ) l ∈ [1: N ] be an i.i.d. sample of X { n } of size N and assume that we compute the image of f at each sample point, obtainingthe vector Y { n } . Then, we can approximate f with a linear regression, by leastsquares. In this case, we estimate the coefficients of the linear regression by thevector: (cid:32) (cid:98) β { n } (cid:98) β { n } (cid:33) = (cid:16) A { n } T A { n } (cid:17) − A { n } T Y { n } , where A { n } ∈ M N,p +1 ( R ) is such that, for all j ∈ [1 : N ] , the j -th line of A { n } is (1 X { n } ( j ) T ) . The function f is then approximated by (cid:98) f { n } ( N ) lin : x (cid:55)−→ (cid:98) β { n } + (cid:98) β { n } T x. Remark that the linear function (cid:98) f { n } ( N ) lin is random and so, the deduced Shap-ley effects η ( X { n } , (cid:98) f { n } ( N ) lin ) are random variables. The next proposition andcorollary correspond to Proposition 3 and Corollary 1, for the linear regressionapproximation of f . Proposition 4.
Under Assumption 1, if f is C on a neighbourhood of µ with Df ( µ ) (cid:54) = 0 , there exist C inf > , C (1)sup < + ∞ and C (2)sup < + ∞ such that, withprobability at least − C (1)sup exp( − C inf N ) , we have (cid:107) η ( X { n } , f { n } ) − η ( X { n } , (cid:98) f { n } ( N ) lin ) (cid:107) ≤ C (2)sup √ a { n } . orollary 2. Under the assumptions of Proposition 1, there exist C inf > , C (1)sup < + ∞ and C (2)sup < + ∞ such that, with probability at least − C (1)sup exp( − C inf N ) ,we have (cid:107) η ( X { n } , f ) − η ( X { n } , (cid:98) f { n } ( N ) lin ) (cid:107) ≤ C (2)sup √ a { n } . Moreover, if a { n } Σ { n } −→ n → + ∞ Σ , then, letting X ∗ ∼ N ( µ, Σ) , there exists C (3)sup < + ∞ such that, with probability at least − C (1)sup exp( − C inf N ) , (cid:107) η ( X { n } , (cid:98) f { n } ( N ) lin ) − η ( X ∗ , f ) (cid:107) ≤ C (3)sup (cid:18) (cid:107) µ { n } − µ (cid:107) + (cid:107) a { n } Σ { n } − Σ (cid:107) + 1 √ a { n } (cid:19) . In this section, we compute the Shapley effects of the true function f and theones obtained from the three previous linear approximations to illustrate theprevious theoretical results. Let p = 4 and f ( x ) = cos( x ) x + sin( x ) + 2 cos( x ) x − sin( x ) . This function is -Lipschitz continuous and C ∞ on R . We choose Σ { n } = n Σ (that is, a { n } = n ), where Σ is defined by: Σ = A T A, A = − − − − − −
10 1 2 − . Let µ = (1 , , , and µ { n } = µ + n (1 , , , .On Figure 1, we plot, for different values of n , the vector η ( X { n } , (cid:98) f { n } ( N ) lin ) (given by the linear regression), the vector η ( X { n } , f { n } ) (given by the true Tay-lor polynomial), the vector η ( X { n } , ˜ f { n } ,h { n } ) (given by the finite difference ap-proximation of the derivatives) and the boxplots of 200 estimates of η ( X { n } , f ) computed by the R function "shapleyPermRand" from the R package sensitivity (see [SNS16, IP19]), which is adapted to non-linear functions, with parameters N V = 10 , m = 10 and N I = 3 . To compute the linear regression, we observeda sample of size N = 40 . To compute the finite difference approximation, wetook h { n } i = (cid:113) Var( X { n } i ) .The differences between the Shapley effects given by f and the ones givenby the linear approximations of f seem to converge to , as it is proved byPropositions 1, 3 and 4. Moreover, Figure 1 emphasizes that the Shapley effectsobtained from the linear regression get closer slower to the true ones than theones given by the other linear approximations.We remark that we have here Σ { n } = a { n } Σ and thus the assumptions ofProposition 2 hold. Hence, the values of the true Shapley effects η ( X { n } , f ) converge, as we can see on Figure 1. 8 f {n}1 . . . . . n . . . . . . . . . . . . . . . η η η η f f~ f {n}{n}{n}(N) ^f {n} Figure 1: Shapley effects of the linear approximations (cid:98) f { n } ( N ) lin , f { n } , ˜ f { n } ,h { n } andboxplots of estimates of the Shapley effects of the function f .The computation time for each estimate of the Shapley effects is around 5seconds using "shapleyPermRand", . × − using the linear approximation f { n } or ˜ f { n } ,h { n } and . × − using the linear approximation (cid:98) f { n } ( N ) lin . Remarkthat this time difference can become more accentuated if the function f is acostly computer code. Here, we extend the results of Section 3 to the case where the distribution ofthe input (that we now write (cid:98) X { n } ) is close to a Gaussian distribution X { n } .9e focus on the setting where the input vector is an empirical mean (cid:98) X { n } = 1 n n (cid:88) l =1 U ( l ) , where ( U ( l ) ) l ∈ [1: n ] is an i.i.d. sample of a random vector U in R p such that E ( (cid:107) U (cid:107) ) < + ∞ and Var( U ) (cid:54) = 0 . Let µ := E ( U ) and Σ be the covariancematrix of U . Remark that, as is Section 3, the input vector (cid:98) X { n } is a randomvector converging to its mean, and its covariance matrix Σ { n } is equal to n Σ .Contrary to Section 3, (cid:98) X { n } is not Gaussian, but, thanks to the centrallimit theorem, its distribution is close to N ( µ, n Σ) . Hence, we would like toestimate the Shapley effects η ( (cid:98) X { n } , f ) by η ( X ∗ , Df ( µ )) , where X ∗ ∼ N (0 , Σ) ,since η ( X ∗ , Df ( µ )) can be computed using the explicit expression (3) of theGaussian linear case, and for instance the function "ShapleyLinearGaussian" ofthe package sensitivity . Proposition 5.
Assume that f is C on a neighbourhood of µ with Df ( µ ) (cid:54) = 0 and that f is subpolynomial, that is there exist k ∈ N ∗ and C > such that forall x ∈ R p , we have | f ( x ) | ≤ C (1 + (cid:107) x (cid:107) k ) . If E ( (cid:107) U (cid:107) k ) < + ∞ and if U has abounded probability density function, then η ( (cid:98) X { n } , f ) −→ n → + ∞ η ( X ∗ , Df ( µ )) . Proposition 5 justifies that η ( X ∗ , Df ( µ )) is a good approximation of η ( (cid:98) X { n } , f ) .Furthermore, if µ , Σ and Df ( µ ) are unknown, the following corollary shows thatthey can be replaced by approximations. Let ( U { l }(cid:48) ) l ∈ [1: n (cid:48) ] and ( U { l }(cid:48)(cid:48) ) l ∈ [1: n (cid:48)(cid:48) ] beindependent of ( U { l } ) l ∈ [1: n ] , composed of i.i.d. copies of U and with n (cid:48) = n (cid:48) ( n ) and n (cid:48)(cid:48) = n (cid:48)(cid:48) ( n ) such that n (cid:48) , n (cid:48)(cid:48) → ∞ when n → ∞ . We can estimate µ (resp. Σ ) by the empirical mean (cid:98) X { n (cid:48) }(cid:48) of ( U { l }(cid:48) ) l ∈ [1: n (cid:48) ] (resp. the empirical covariancematrix (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) of ( U { l }(cid:48)(cid:48) ) l ∈ [1: n (cid:48)(cid:48) ] ), and we can estimate Df by a finite differenceapproximation. The next corollary guarantees that the error stemming fromthese additional estimations goes to as n → ∞ . Corollary 3.
Assume that the assumptions of Proposition 5 hold and that ( h { n } ) n ∈ N is a sequence of ( R ∗ + ) p converging to . Let X ∗ n be a random vectorwith distribution N ( µ, (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) ) conditionally to (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) . Then (cid:13)(cid:13)(cid:13) η ( (cid:98) X { n } , f ) − η ( X ∗ n , ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) ) (cid:13)(cid:13)(cid:13) a.s. −→ n → + ∞ , where ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) is the linear approximation of f at (cid:98) X { n (cid:48) }(cid:48) obtained fromEquation (4) by replacing µ { n } by (cid:98) X { n (cid:48) } (cid:48) . Remark 2. If µ , Σ or Df is known, the previous corollary holds replacing (cid:98) X { n (cid:48) }(cid:48) , (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) or ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) by µ, Σ or Df ( (cid:98) X { n (cid:48) }(cid:48) ) respectively. emark 3. The notation η ( X ∗ n , ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) ) is to be understood conditionallyto (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) , (cid:98) X { n (cid:48) }(cid:48) . That is, conditionally to (cid:98) Σ { n (cid:48)(cid:48) }(cid:48)(cid:48) , (cid:98) X { n (cid:48) }(cid:48) , the Shapley effects η ( X ∗ n , ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) ) are defined with the fixed linear function ˜ f { n } ,h { n } , (cid:98) X { n (cid:48)}(cid:48) andthe Gaussian distribution for X ∗ n . Let us show an example of application of the results of Section 4.1. Let U be acontinuous random vector of R p , with a bounded density and with an unknownmean µ . Assume that we observe an i.i.d. sample ( U ( l ) ) l ∈ [1: n ] of U and that wefocus on the estimation of a parameter θ = f ( µ ) , where f is C . This parameteris estimated by f ( (cid:98) X { n } ) (which is asymptotically efficient by the delta-method),where (cid:98) X { n } is the empirical mean of ( U ( l ) ) l ∈ [1: n ] . The estimation error of eachvariable (cid:98) X { n } i (for i = 1 , · · · , p ) propagates through f . To quantify the partof the estimation error of Y = f ( (cid:98) X { n } ) caused by the individual estimationerrors of each (cid:98) X { n } i (for i = 1 , · · · , p ), one can estimate the Shapley effects η ( (cid:98) X { n } , f ) = η ( (cid:98) X { n } − µ, f ( · + µ ) − f ( µ )) which assess the impact of individualerrors on the global error. To that end, Proposition 5 and Corollary 3 statethat the Shapley effects can be estimated using a Gaussian linear approxima-tion, with an error that vanishes as n increases.For example, let f = (cid:107) · (cid:107) and p = 5 . In this case, the derivative Df is known and no finite difference approximation is required. To generate U with a bounded density and with dependencies, we define A ∼ U ([5 , , A ∼N (0 , , A with a symmetric triangular distribution T ( − , , A ∼ Beta (1 , and A ∼ Exp (1) . Then, we define U = A + 2 A − . A U = A + 2 A − . A U = A + 2 A − . A U = A + 2 A − . A U = A + 2 A − . A . Since the mean µ and the covariance matrix Σ are unknown, we need to estimatethem (as in Corollary 3). Using the notation of Section 4.1, we choose n = n (cid:48) = n (cid:48)(cid:48) and ( U ( l ) (cid:48) ) l ∈ [1: n (cid:48) ] = ( U ( l ) (cid:48)(cid:48) ) l ∈ [1: n (cid:48) ] (that is, we estimate the empiricalmean and the empirical covariance matrix with the same sample). We estimatethe Shapley effects η ( (cid:98) X { n } , f ) by η ( X ∗ n , Df ( (cid:98) X { n }(cid:48) )) , where X ∗ n is a randomvector with distribution N ( µ, (cid:98) Σ { n }(cid:48)(cid:48) ) conditionally to (cid:98) Σ { n }(cid:48)(cid:48) . By Corollary 3 andRemark 2, the difference between η ( (cid:98) X { n } , f ) and η ( X ∗ n , Df ( (cid:98) X { n }(cid:48) )) convergesto 0 almost surely when n goes to + ∞ .Here, we compute 1000 estimates of µ and Σ and we compute the 1000 corre-sponding Shapley effects of the Gaussian linear approximation η ( X ∗ n , Df ( (cid:98) X { n }(cid:48) )) .11o compare with these estimates, we also compute 1000 estimates given by thefunction "shapleySubsetMC" suggested in [BBD20], with parameters N tot =1000 , N i = 3 and with an i.i.d. sample of (cid:98) X { n } with size 1000. We plot theresults on Figure 2. − . . . . n = 100 h h h h h − . . . . n = 200 h h h h h − . . . . n = 500 h h h h h − . . . . n = 1000 h h h h h Figure 2: Boxplots of the estimates of the Shapley effects given by the generalestimation function "shapleySubsetMC" (in red) and by the Gaussian linearapproximation (in black).We observe that the estimates of the Shapley effects given by "shapley-SubsetMC" and the Gaussian linear approximation are rather similar, even for n = 100 . However, the variance of the estimates given by the Gaussian lin-ear approximation is smaller than the one of the general estimates given by"shapleySubsetMC". Moreover, each Gaussian linear estimation requires onlya sample of ( U ( l ) (cid:48) ) l ∈ [1: n ] (to compute (cid:98) X { n }(cid:48) and (cid:98) Σ { n }(cid:48)(cid:48) ) and takes around 0.007second on a personal computer, whereas each general estimation with "shap-leySubsetMC" requires here 1000 samples of ( U ( l ) (cid:48) ) l ∈ [1: n ] and takes around 1112econds. Remark that this time difference can become more accentuated if thefunction f is a costly computer code. Finally, the estimator of the Shapley ef-fects given by the linear approximation converges almost surely when n goes to + ∞ , whereas the estimator of the Shapley effects given by "shapleySubsetMC"is only shown to converge in probability when the sample size and N tot go to + ∞ (see [BBD20]).To conclude, we have provided a framework where the theoretical results ofSection 4.1 can be applied. We have illustrated this framework with numericalexperiments on generated data. We have showed that, in this framework, toestimate the Shapley effects, the Gaussian linear approximation provides anestimator much faster and much more accurate than the general estimator givenby "shapleySubsetMC". In this paper, we worked on the Gaussian linear framework approximation toestimate the Shapley effects, in order to take advantage of the simplicity broughtby this framework. First, we focused on the case where the inputs are Gaussianvariables converging to their means. This setting is motivated, in particular,by the case of uncertainties on physical quantities that are reduced by takingmore and more measurements. We showed that, to estimate the Shapley effects,one can replace the true model f by three possible linear approximations: theexact Taylor polynomial approximation, a finite difference approximation and alinear regression. We gave the rate of convergence of the difference between theShapley effects of the linear approximations and the Shapley effects of the truemodel. These results are illustrated by a simulated application that highlightsthe accuracy of the approximations. Then, we focused on the case where the in-puts are given by an empirical mean. In this case, we proved that the instinctiveidea to replace the empirical mean by a Gaussian vector and the true model by alinear approximation around the mean indeed gives good approximations of theShapley effects. We highlighted the benefits of these estimators on numericalexperiments.Several questions remain open to future work. In particular, it would bevaluable to obtain more insight on the choice between the general estimator ofthe Shapley effects for non-linear models and the estimators based on Gaussianlinear approximations. Quantitative criteria for this choice, based for instanceon the magnitude of the input uncertainties or on the number of input samplesthat are available, would be beneficial. Regarding the results on the impactof individual estimation errors in Section 4.2, it would be interesting to obtainextensions to estimators of quantities of interest that are not only empiricalmeans, for instance general M-estimators.13 cknowledgements We acknowledge the financial support of the Cross- Disciplinary Program onNumerical Simulation of CEA, the French Alternative Energies and Atomic En-ergy Commission. We would like to thank BPI France for co-financing thiswork, as part of the PIA (Programme d’Investissements d’Avenir) - Grand Défidu Numérique 2, supporting the PROBANT project. We acknowledge the In-stitut de MathÃľmatiques de Toulouse.
References [BBCM20] Baptiste Broto, François Bachoc, Laura Clouvel, and Jean-MarcMartinez. Block-diagonal covariance estimation and applicationto the Shapley effects in sensitivity analysis. https://hal.archives-ouvertes.fr/hal-02196583v2, February 2020.[BBD20] Baptiste Broto, Francois Bachoc, and Marine Depecker. Vari-ance reduction for estimation of Shapley effects and adaptation tounknown input distribution.
SIAM/ASA Journal on UncertaintyQuantification , 8(2):693–716, 2020.[BBDM19] Baptiste Broto, FranÃğois Bachoc, Marine Depecker, and Jean-Marc Martinez. Sensitivity indices for independent groups ofvariables.
Mathematics and Computers in Simulation , 163:19–31,September 2019.[BR86] Rabi N Bhattacharya and R Ranga Rao.
Normal approximationand asymptotic expansions , volume 64. SIAM, 1986.[Cac03] Dan G Cacuci. Sensitivity and uncertainty analysis, volume 1: The-ory (hardcover), 2003.[CGP12] Gaëlle Chastaing, Fabrice Gamboa, and Clémentine Prieur. Gen-eralized hoeffding-sobol decomposition for dependent variables-application to sensitivity analysis.
Electronic Journal of Statistics ,6:2420–2448, 2012.[Cha13] GaÃńlle Chastaing.
Indices de Sobol gÃľnÃľralisÃľs pour vari-ables dÃľpendantes . phdthesis, UniversitÃľ de Grenoble, September2013.[Clo19] Laura Clouvel.
Quantification de l’incertitude du flux neutroniquerapide reÃğu par la cuve d’un rÃľacteur Ãă eau pressurisÃľe . PhDThesis, UniversitÃľ Paris-Saclay, November 2019.[GJK +
16] Fabrice Gamboa, Alexandre Janon, Thierry Klein, A. Lagnoux,and ClÃľmentine Prieur. Statistical inference for Sobol pick-freezeMonte Carlo method.
Statistics , 50(4):881–902, 2016.14HP04] Laurent Hascoët and Valérie Pascual. Tapenade 2.1 user’s guide.2004.[HT11] Hugo Hammer and HÃěkon Tjelmeland. Approximate forward-backward algorithm for a switching linear Gaussian model.
Com-putational Statistics & Data Analysis , 55(1):154–167, January 2011.[IAP20] Bertrand Iooss, Janon Alexandre, and Gilles Pujol. sensitivity:Global Sensitivity Analysis of Model Outputs, February 2020.[IP19] Bertrand Iooss and Clémentine Prieur. Shapley effects for sensitiv-ity analysis with correlated inputs: comparisons with sobol’indices,numerical estimation and applications.
International Journal forUncertainty Quantification , 9(5):493–514, 2019.[JEF13] JEFF-3.1. Validation of the jeff-3.1 nuclear data library: Jeff report23, 2013.[JEN11] JENDL-4.0. Jendl-4.0: A new library for nuclear science and en-gineering.
Journal of Nuclear Science and Technology , 48(1):1–30,2011.[KHF +
06] T. Kawano, K. M. Hanson, S. Frankle, P. Talou, M. B. Chadwick,and R. C. Little. Evaluation and Propagation of the $$Pu Fis-sion Cross-Section Uncertainties Using a Monte Carlo Technique.
Nuclear Science and Engineering , 153(1):1–7, May 2006.[McL05] V. McLane. ENDF-6 data formats and procedures for the evaluatednuclear data file ENDF-VII, 2005.[MT12] Thierry A. Mara and Stefano Tarantola. Variance-based sensitivityindices for models with dependent inputs.
Reliability Engineering& System Safety , 107:115–121, November 2012.[MTA15] Thierry A. Mara, Stefano Tarantola, and Paola Annoni. Non-parametric methods for global sensitivity analysis of model outputwith dependent inputs.
Environmental Modelling and Software ,72:173–183, July 2015.[OP17] Art B. Owen and ClÃľmentine Prieur. On Shapley value for mea-suring importance of dependent inputs.
SIAM/ASA Journal onUncertainty Quantification , 5(1):986–1002, 2017.[Owe14] Art B. Owen. Sobol’ Indices and Shapley Value.
SIAM/ASA Jour-nal on Uncertainty Quantification , 2(1):245–251, January 2014.[Ros70] Haskell P Rosenthal. On the subspaces ofl p (p> 2) spanned bysequences of independent random variables.
Israel Journal of Math-ematics , 8(3):273–303, 1970.15Ros04] Antti-Veikko Ilmari Rosti.
Linear Gaussian models for speech recog-nition . PhD Thesis, University of Cambridge, 2004.[She71] TL Shervashidze. On a uniform estimate of the rate of convergencein the multidimensional local limit theorem for densities.
Theory ofProbability & Its Applications , 16(4):741–743, 1971.[SNS16] Eunhye Song, Barry L. Nelson, and Jeremy Staum. Shapley Ef-fects for Global Sensitivity Analysis: Theory and Computation.
SIAM/ASA Journal on Uncertainty Quantification , 4(1):1060–1083, January 2016.[Sob93] Ilya M. Sobol. Sensitivity estimates for nonlinear mathematicalmodels.
Mathematical Modelling and Computational Experiments ,1(4):407–414, 1993.[SWNW03] Thomas J. Santner, Brian J. Williams, William Notz, and Brain J.Williams.
The design and analysis of computer experiments , vol-ume 1. Springer, 2003. 16 ppendices
We will write C sup for a generic non-negative finite constant. The actual valueof C sup is of no interest and can change in the same sequence of equations.Similarly, we will write C inf for a generic strictly positive constant. Moreover,for all u ⊂ [1 : p ] , if Z is a random vector in R p and g is a function from R p to R such that E ( g ( Z ) ) < + ∞ and Var( g ( Z )) > , let S clu ( Z, g ) be the closedSobol index (see [GJK +
16] for example) for the input vector Z and the model g , defined by: S clu ( Z, g ) = Var( E ( g ( Z ) | Z u ))Var( g ( Z )) . Proof of Proposition 1
We divide the proof into several lemmas. We assume that the assumptionsof Proposition 1 hold throughout this proof.Let ε ∈ ]0 , be such that f is C on B ( µ, ε ) and such that, for all x ∈ B ( µ, ε ) ,we have Df ( x ) (cid:54) = 0 . Since µ { n } converges to µ , there exists N ∈ N such that,for all n ≥ N , µ { n } ∈ B ( µ, ε/ . In the following, we assume that n is largerthan N . Lemma 1.
For all x ∈ B ( µ { n } , ε/ , we have | R { n } ( x ) | ≤ C (cid:107) x − µ { n } (cid:107) , | R { n } ( x ) | ≤ C (cid:48) (cid:107) x − µ { n } (cid:107) and for all x / ∈ B ( µ { n } , ε/ , | R { n } ( x ) | ≤ C (cid:107) x − µ { n } (cid:107) k , | R { n } ( x ) | ≤ C (cid:48) (cid:107) x − µ { n } (cid:107) k , where C , C (cid:48) , C and C (cid:48) are positive constants that do not depend on n .Proof. Using Taylor’s theorem, for all x ∈ B ( µ { n } , ε ) , there exist θ ( n, x ) , θ ( n, x ) ∈ ]0 , such that f ( x ) = f { n } + f { n } ( x ) + 12 D f ( µ { n } + θ ( n, x )( x − µ { n } ))( x − µ { n } )= f { n } + f { n } ( x ) + f { n } ( x )+ 16 D f ( µ { n } + θ ( n, x )( x − µ { n } ))( x − µ { n } ) . Let C = max x ∈ B ( µ,ε ) (cid:107) D f ( x ) (cid:107) and C (cid:48) = max x ∈ B ( µ,ε ) (cid:107) D f ( x ) (cid:107) , where (cid:107)·(cid:107) also means the operator norm of a multilinear form. Thus, for all x ∈ B ( µ, ε ) , | R { n } ( x ) | ≤ C (cid:107) x − µ { n } (cid:107) , | R { n } ( x ) | ≤ C (cid:48) (cid:107) x − µ { n } (cid:107) . f is subpolynomial, so ∃ k ≥ , and C < + ∞ such that, ∀ x ∈ R p , | f ( x ) | ≤ C (1 + (cid:107) x (cid:107) k ) . Hence, taking C (cid:48) = C (2 (cid:107) µ (cid:107) + 2) k , we have | f ( x ) | ≤ C (1 + 2 k (cid:107) x − µ { n } (cid:107) k + 2 k (cid:107) µ { n } (cid:107) k ) ≤ C (cid:48) (1 + (cid:107) x − µ { n } (cid:107) k ) . Hence, taking C (cid:48)(cid:48) := C (cid:48) + max y ∈ B ( µ,ε ) (cid:107) Df ( y ) (cid:107) , we have | R { n } ( x ) | ≤ | f ( x ) | + max y ∈ B ( µ,ε ) (cid:107) Df ( y ) (cid:107)(cid:107) x − µ { n } (cid:107) ≤ C (cid:48)(cid:48) (1 + (cid:107) x − µ { n } (cid:107) k ) . Now, taking C := C (cid:48)(cid:48) (cid:0) ε ) k (cid:1) , we have, for all x / ∈ B ( µ { n } , ε/ , | R { n } ( x ) | ≤ C (cid:48)(cid:48) + C (cid:48)(cid:48) (cid:107) x − µ { n } (cid:107) k ≤ C (cid:107) x − µ { n } (cid:107) k . Similarly, there exists C (cid:48) < + ∞ such that | R { n } ( x ) | ≤ C (cid:48) (cid:107) x − µ { n } (cid:107) k . Lemma 2.
We have cov( E ( f { n } ( X { n } ) | X { n } u ) , f { n } ( X { n } ) | X { n } u )) = 0 . Proof.
Let n ∈ N . To simplify notation, let A = X { n } − µ { n } , β ∈ R p be thevector of the linear application Df ( µ { n } ) and Γ ∈ M p ( R ) be symmetric thematrix of the quadratic form D f ( µ { n } ) . Then, cov( E ( f { n } ( X { n } ) | X { n } u ) , E ( f { n } ( X { n } ) | X { n } u ))= cov( E ( β T A ) | A u ) , E ( A T Γ A | A u ))= E (cid:0)(cid:2) β Tu A u + β T − u E ( A − u | A u ) (cid:3) (cid:2) A Tu Γ u,u A u + 2 A Tu Γ u, − u E ( A − u | A u ) + E ( A T − u Γ − u, − u A − u | A u ) (cid:3)(cid:1) = E (cid:0)(cid:2) β Tu A u + β T − u E ( A − u | A u ) (cid:3) E ( A T − u Γ − u, − u A − u | A u ) (cid:1) since all the other terms are linear combinations of expectations of products ofthree zero-mean Gaussian variables. Indeed, the coefficients of E ( A − u | A u ) arelinear combinations of the coefficients of A u . Now, E (cid:0) β Tu A u × E ( A T − u Γ − u, − u A − u | A u ) (cid:1) = E (cid:0) E ( β Tu A u × A T − u Γ − u, − u A − u | A u ) (cid:1) = E ( β Tu A u × A T − u Γ − u, − u A − u )= 0 . Similarly, the term E (cid:0) β − u E ( A − u | A u ) E ( A T − u Γ − u, − u A − u | A u ) (cid:1) is equal to 0.18 emma 3. There exists C sup < + ∞ such that, for all u ⊂ [1 : p ] , Var( E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u )) ≤ C sup a { n } , and (cid:12)(cid:12)(cid:12) cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u )) (cid:12)(cid:12)(cid:12) ≤ C sup a { n } . Proof.
Using Lemma 1, we have, E ( | (cid:112) a { n } R { n } ( X { n } ) | ) = E ( | (cid:112) a { n } R { n } ( X { n } ) | (cid:107) X n (cid:107) < ε ) + E ( | (cid:112) a { n } R { n } ( X { n } ) | (cid:107) X n (cid:107)≥ ε ) ≤ C a { n } E ( (cid:107) (cid:112) a { n } ( X { n } − µ { n } ) (cid:107) )+ C a { n } ( k − E ( (cid:107) (cid:112) a { n } ( X { n } − µ { n } ) (cid:107) k ) ≤ C sup a { n } , since a { n } Σ { n } is bounded. Hence, Var( (cid:112) a { n } R { n } ( X { n } )) ≤ C sup a { n } . Moreover, for all u ⊂ [1 : p ] , ≤ Var( E ( a { n } R { n } ( X { n } ) | X { n } u )) ≤ Var( a { n } R { n } ( X { n } )) ≤ C sup a { n } . For all u ⊂ [1 : p ] , cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u ))= cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ))+ cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u ))= cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u )) , using Lemma 2. Now, by Cauchy-Schwarz inequality, (cid:12)(cid:12)(cid:12) cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u )) (cid:12)(cid:12)(cid:12) ≤ (cid:113) Var( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) (cid:113) Var( (cid:112) a { n } R { n } ( X { n } ) | X { n } u ) ≤ (cid:113) Var( (cid:112) a { n } f { n } ( X { n } )) (cid:113) Var( (cid:112) a { n } R { n } ( X { n } )) . E ( | (cid:112) a { n } R { n } ( X { n } ) | )= E ( | (cid:112) a { n } R { n } ( X { n } ) | (cid:107) X n (cid:107)≤ ε ) + E ( | (cid:112) a { n } R { n } ( X { n } ) | (cid:107) X n (cid:107)≥ ε ) ≤ C a { n } E ( (cid:107) (cid:112) a { n } ( X { n } − µ { n } ) (cid:107) )+ C a { n } ( k − E ( (cid:107) (cid:112) a { n } ( X { n } − µ { n } ) (cid:107) k × ) ≤ C sup a { n } . Furthermore,
Var( (cid:112) a { n } f { n } ( X { n } )) ≤ max x ∈ B ( µ { n } ,ε/ (cid:107) Df ( x ) (cid:107) E (cid:16) (cid:107) (cid:112) a { n } ( X { n } − µ { n } ) (cid:107) (cid:17) ≤ C sup . Finally, (cid:12)(cid:12)(cid:12) cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u )) (cid:12)(cid:12)(cid:12) ≤ C sup a { n } , that concludes the proof of Lemma 3. Lemma 4.
For all u ⊂ [1 : p ] , S clu ( X { n } , f ) = S clu ( X { n } , f { n } ) + O (cid:18) a { n } (cid:19) . Proof.
We have f ( X { n } ) = f ( µ { n } ) + f { n } ( X { n } ) + R { n } ( X { n } ) . For all u ⊂ [1 : p ] , we have E ( f ( X { n } ) | X { n } u ) = f ( µ { n } ) + E ( f { n } ( X { n } ) | X { n } u ) + E ( R { n } ( X { n } ) | X { n } u ) , so, a { n } Var( E ( f ( X { n } ) | X { n } u ))= Var( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u )) + Var( E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u ))+2 cov( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) , E ( (cid:112) a { n } R { n } ( X { n } ) | X { n } u ))= Var( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u )) + O ( 1 a { n } ) , by Lemma 3. Hence, for u = [1 : p ] , we have a { n } Var( f ( X { n } )) = Var( (cid:112) a { n } f { n } ( X { n } )) + O ( 1 a { n } ) . u ⊂ [1 : p ] , S clu ( X { n } , f ) = Var( E ( f ( X { n } ) | X { n } u ))Var( f ( X { n } ))= a { n } Var( E ( f ( X { n } ) | X { n } u )) a { n } Var( f ( X { n } ))= Var( E ( √ a { n } f { n } ( X { n } ) | X { n } u )) + O ( a { n } )Var( √ a { n } f { n } ( X { n } )) + O ( a { n } )= Var( √ a { n } f { n } ( X { n } ) | X { n } u )Var( √ a { n } f { n } ( X { n } )) + O ( 1 a { n } )= S clu ( X { n } , f { n } ) + O (cid:18) a { n } (cid:19) , where we used that, Var( (cid:112) a { n } f { n } ( X { n } )) = Df ( µ { n } )( a { n } Σ { n } ) Df ( µ { n } ) T ≥ λ min ( a { n } Σ { n } ) inf x ∈ B ( µ,ε/ (cid:107) Df ( x ) (cid:107) ≥ C inf . Now we have proved the convergence of the closed Sobol indices, we canprove Proposition 1 easily.
Proof.
By Lemma 4 and applying the linearity of the Shapley effects with respectto the Sobol indices, we have η ( X { n } , f ) = η ( X { n } , f { n } ) + O ( 1 a { n } ) . Proof of Remark 1
Proof.
Let X { n } = ( X { n } , X { n } ) ∼ N (0 , a { n } I ) and Y { n } = f ( X { n } ) = X { n } + X { n } , we have f { n } ( X { n } ) = X { n } and R { n } ( X { n } ) = X { n } . Thus, η ( X { n } , f { n } ) = 1 and η ( X { n } , f { n } ) = 0 . Now, let us compute the Shapleyeffects η ( X { n } , f ) . We have Var( f ( X { n } )) = Var( X { n } ) + Var( X { n } )= Var( X { n } ) + E ( X { n } ) − E ( X { n } ) = 1 a { n } + 3 a { n } − a { n } = a { n } + 2 a { n } . Var( E ( f ( X { n } ) | X { n } )) = Var( X { n } + 1 a { n } ) = Var( X { n } ) = 1 a { n } and Var( E ( f ( X { n } ) | X { n } )) = Var( X { n } ) = E ( X { n } ) − E ( X { n } ) = 3 − a { n } = 2 a { n } . Hence, η ( X { n } , f ) = a { n } ( a { n } + 2)2 (cid:18) a { n } + a { n } + 2 a { n } − a { n } (cid:19) = a { n } a { n } + 2 , and η ( X { n } , f ) = 2 a { n } + 2 . Proof of Proposition 2
As in the proof of Proposition 1, we first prove the convergence for the closedSobol indices. To simplify notation, let Γ { n } := a { n } Σ { n } . Lemma 5.
Under the assumptions of Proposition 2, for all u ⊂ [1 : p ] , we have S clu ( f { n } ( X { n } )) = S clu ( f ( X ∗ )) + O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) ) . Proof.
We have
Var( (cid:112) a { n } f { n } ( X { n } )) − Var( f ( X ∗ ))= Df ( µ { n } )Γ { n } Df ( µ { n } ) T − Df ( µ )Σ Df ( µ ) T = Df ( µ { n } )Γ { n } (cid:104) Df ( µ { n } ) T − Df ( µ ) T (cid:105) + Df ( µ { n } ) (cid:104) Γ { n } − Σ (cid:105) Df ( µ ) T (cid:104) Df ( µ { n } ) − Df ( µ ) (cid:105) Σ Df ( µ ) T = O ( (cid:107) Df ( µ { n } ) − Df ( µ ) (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) )= O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) ) , using that Df is Lipschitz continuous on a neighbourhood of µ (thanks to thecontinuity of D f ).Moreover, for all ∅ (cid:32) u (cid:32) [1 : p ] , we have Var( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u )) − Var( E ( f ( X ∗ ) | X ∗ u ))= Var( (cid:112) a { n } f { n } ( X { n } )) − E (Var( (cid:112) a { n } f ( X { n } ) | X { n }− u )) − Var( f ( X ∗ )) + E (Var( f ( X ∗ ) | X ∗ u ))= Df ( µ { n } )Γ { n } Df ( µ { n } ) T − Df ( µ { n } ) u (Γ { n } u,u − Γ { n } u, − u Γ { n }− − u, − u Γ { n }− u,u ) Df ( µ { n } ) Tu − Df ( µ )Σ Df ( µ ) T − Df ( µ ) u (Σ u,u − Σ u, − u Σ − − u, − u Σ − u,u ) Df ( µ ) Tu = O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) ) , S clu ( X { n } , f { n } ) = S clu ( X ∗ , f ) + O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) ) . Now, we can easily prove Proposition 2.
Proof.
By Lemma 5 and applying the linearity of the Shapley effects with respectto the Sobol indices, we have η ( f { n } ( X { n } )) = η ( f ( X ∗ )) + O ( (cid:107) µ { n } − µ (cid:107) ) + O ( (cid:107) Γ { n } − Σ (cid:107) ) . Proof of Proposition 3
Under the assumption of Proposition 3, let ε > be such that f is C on B ( µ, ε ) and such that, for all x ∈ B ( µ, ε ) , we have Df ( x ) (cid:54) = 0 . Since µ { n } converges to µ , there exists N ∈ N such that, for all n ≥ N , µ { n } ∈ B ( µ, ε/ .In the following, we assume that n is larger than N . Lemma 6.
For all x ∈ B ( µ, ε ) and h ∈ ( R ∗ + ) p such that (cid:107) h (cid:107) ≤ ε , we have (cid:107) (cid:98) D h f ( x ) − Df ( x ) (cid:107) ≤
16 max i ∈ [1: p ] max y ∈ B ( µ,ε ) | ∂ i f ( y ) |(cid:107) h (cid:107) Proof.
Let x ∈ B ( µ, ε ) and h ∈ ( R ∗ + ) p such that (cid:107) h (cid:107) ≤ ε . For all i ∈ [1 : p ] ,using Taylor’s theorem, there exist θ + x,h,i , θ − x,h,i ∈ ]0 , such that f ( x + e i h i ) − f ( x − e i h i )2 h i = ∂ i f ( x )+ h i (cid:16) ∂ i f ( x + θ + x,h,i h ) + ∂ i f ( x − θ − x,h,i h ) (cid:17) . Hence, (cid:107) (cid:98) D h f ( x ) − Df ( x ) (cid:107) ≤ p (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:104) (cid:98) D h f ( x ) − Df ( x ) (cid:105) i (cid:12)(cid:12)(cid:12) ≤
16 max i ∈ [1: p ] max y ∈ B ( µ,ε ) | ∂ i f ( y ) | p (cid:88) i =1 h i = 16 max i ∈ [1: p ] max y ∈ B ( µ,ε ) | ∂ i f ( y ) |(cid:107) h (cid:107) . Lemma 7.
For all linear functions l and l from R p to R , we have (cid:12)(cid:12)(cid:12) Var( E ( l ( X { n } ) | X { n } u ) − Var( E ( l ( X { n } ) | X { n } u ) (cid:12)(cid:12)(cid:12) ≤ C sup a { n } (cid:107) l − l (cid:107) . roof. For all u ⊂ [1 : p ] , let φ { n } u : R | u | −→ R p be defined by φ { n } u ( x u ) = (cid:18) x u µ { n }− u + Γ { n }− u,u Γ { n }− u,u ( x u − µ { n } u ) (cid:19) and φ { n } [1: p ] = id R p .Let u ⊂ [1 : p ] . Then E ( X { n } | X { n } u ) = φ { n } u ( X { n } u ) . Now, for all linear function l : R p −→ R , we have E ( l ( X { n } ) | X { n } u ) = l (cid:16) E ( X { n } | X { n } u ) (cid:17) = l ( φ { n } u ( X { n } u )) , so, identifying a linear function from R p to R with its matrix of size × p , wehave Var (cid:16) E ( l ( X { n } ) | X { n } u ) (cid:17) = lφ { n } u Γ { n } u,u a { n } φ { n } Tu l T . Hence, for l = l and l = l , one can show that, (cid:12)(cid:12)(cid:12) Var( E ( l ( X { n } ) | X { n } u )) − Var( E ( l ( X { n } ) | X { n } u )) (cid:12)(cid:12)(cid:12) ≤ C sup a { n } (cid:107) l − l (cid:107) . Now, we can prove Proposition 3.
Proof.
By Lemmas 6 and 7, we have, for all u ⊂ [1 : p ] , Var( E ( (cid:112) a { n } f { n } ( X { n } ) | X { n } u ) − Var( E ( (cid:112) a { n } ˜ f { n } ,h { n } ( X { n } ) | X { n } u ) = O (cid:16) (cid:107) h { n } (cid:107) (cid:17) . Thus, S clu ( X { n } , f { n } ) − S clu ( X { n } , ˜ f { n } ,h { n } ) = O (cid:16) (cid:107) h { n } (cid:107) (cid:17) , so η ( X { n } , f { n } ) − η ( X { n } , ˜ f { n } ,h { n } ) = O (cid:16) (cid:107) h { n } (cid:107) (cid:17) . Proof of Proposition 4
Under the assumption of Proposition 3, let ε > be such that f is C on B ( µ, ε ) and such that, for all x ∈ B ( µ, ε ) , we have Df ( x ) (cid:54) = 0 . Since µ { n } converges to µ , there exists N ∈ N such that, for all n ≥ N , µ { n } ∈ B ( µ, ε/ .In the following, we assume that n is larger than N . Lemma 8.
There exists C sup such that, with probability at least − p exp( − C inf N ) − p exp( − C inf N ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) A { n } T A { n } (cid:17) − A { n } T (cid:13)(cid:13)(cid:13)(cid:13) ≤ C sup √ a { n } √ N . roof. (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) A { n } T A { n } (cid:17) − A { n } T (cid:13)(cid:13)(cid:13)(cid:13) = λ max (cid:20)(cid:16) A { n } T A { n } (cid:17) − (cid:21) = a { n } N λ max (cid:34)(cid:18) a { n } N A { n } T A { n } (cid:19) − (cid:35) . Now, by the strong law of large numbers, we have almost surely a { n } N A { n } T A { n } − ( a { n } − (cid:18) µ { n } (cid:19) (cid:18) µ { n } (cid:19) T −→ N → + ∞ M { n } := (cid:18) µ { n } T µ { n } Γ { n } + µ { n } µ { n } T (cid:19) . Let M { n } := (cid:18) µ { n } T µ { n } λ inf I p + µ { n } µ { n } T (cid:19) and M := (cid:18) µ T µ λ inf I p + µµ T (cid:19) ,where λ inf > is a lower-bound of the eigenvalues of (Γ { n } ) n . We can seethat M { n } ≥ M { n } −→ n → + ∞ M . Now, det( M ) = det(1) det (cid:0) [ λ inf I p + µµ T ] − µ − µ T (cid:1) = λ p inf > . Hence, writing λ (cid:48) inf > the smallest eigenvalue of M , we have that the eigen-values of M { n } are lower-bounded by λ (cid:48) inf / for n large enough.Similarly, let M { n } := (cid:18) µ { n } T µ { n } λ sup I p + µ { n } µ { n } T (cid:19) , and M := (cid:18) µ T µ λ sup I p + µµ T (cid:19) , where λ sup > is an upper-bound of the eigenvalues of (Γ { n } ) n . Writing λ (cid:48) sup < + ∞ the largest eigenvalue of M , we have that the eigenvalues of M { n } areupper-bounded by λ (cid:48) sup for n large enough.Now, since the eigenvalues of ( M { n } ) n are lower-bounded and upper-bounded,there exists α > such that, for all n ∈ N (large enough), ∀ M ∈ S p ( R ) , (cid:107) M − M { n } (cid:107) ≤ α = ⇒ | λ min ( M ) − λ min ( M { n } ) | ≤ λ (cid:48) inf . Now, by Bernstein inequality, P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) a { n } N A { n } T A { n } − ( a { n } − (cid:18) µ { n } (cid:19) (cid:18) µ { n } (cid:19) T − M { n } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ α (cid:33) ≥ − p exp( − C inf N ) − × p exp( − C inf N ) ≥ − C sup exp( − C inf N ) , p exp( − C inf N ) bounds the difference of the submatrices ofindex [2 : p + 1] × [2 : p + 1] and the term × p exp( − C inf N ) bounds thedifferences of the submatrices of index { } × [2 : p + 1] and [2 : p + 1] × { } .Hence, with probability at least − C sup exp( − C inf N ) , we have λ min (cid:32) a { n } N A { n } T A { n } − ( a { n } − (cid:18) µ { n } (cid:19) (cid:18) µ { n } (cid:19) T (cid:33) ≥ λ (cid:48) inf , and so λ min (cid:18) a { n } N A { n } T A { n } (cid:19) ≥ λ (cid:48) inf . Lemma 9.
With probability at least − C sup exp( − C inf N ) , we have (cid:13)(cid:13)(cid:13) (cid:98) β { n } − ∇ f ( µ { n } ) (cid:13)(cid:13)(cid:13) ≤ C sup √ a { n } . Proof.
Let Z { n } ∼ N (0 , Γ { n } ) . Then (cid:107) X { n } − µ { n } (cid:107) ≤ ε with probability P ( (cid:107) Z { n } (cid:107) ≤ a { n } ε ) −→ n → + ∞ . Let Ω { n } N := { ω ∈ Ω | ∀ j ∈ [1 : N ] , (cid:107) X { n } ( j ) ( ω ) − µ { n } (cid:107) ≤ ε } . Hence, P (Ω { n } N ) ≥ − N exp (cid:16) − C inf a { n } (cid:17) −→ n → + ∞ . On B ( µ { n } , ε ) , we have f = f ( µ { n } ) + f { n } + R { n } . Hence, on Ω { n } N , for all j ∈ [1 : N ] , f ( X { n } ( j ) ) = f ( µ { n } ) + f { n } ( X { n } ( j ) ) + R { n } ( X { n } ( j ) ) . Thus, (cid:98) β { n } = (cid:16) A { n } T A { n } (cid:17) − A { n } T (cid:18) f ( µ { n } )+ f { n } ( X { n } ( j ) )+ R { n } ( X { n } ( j ) ) (cid:19) j ∈ [1: N ] . Since f ( µ { n } ) + f { n } is a linear function with gradient vector ∇ f ( µ { n } ) andwith value at zero f ( µ { n } ) − Df ( µ { n } ) µ { n } , we have, (cid:16) A { n } T A { n } (cid:17) − A { n } T ( f ( µ { n } )+ f { n } ( X { n } ( j ) )) j ∈ [1: N ] = (cid:18) f ( µ { n } ) − Df ( µ { n } ) µ { n } ∇ f ( µ { n } ) (cid:19) . Hence, it remains to see if (cid:16) A { n } T A { n } (cid:17) − A { n } T ( R { n } ( X { n } ( j ) )) j ∈ [1: N ]
26s small enough. By Lemma 1, we have on Ω { n } N , (cid:107) ( R { n } ( X { n } ( j ) )) j ∈ [1: N ] (cid:107) = N (cid:88) j =1 R { n } ( X { n } ( j ) ) ≤ C sup N (cid:88) j =1 (cid:107) X { n } ( j ) − µ { n } (cid:107) ≤ C sup a { n } N (cid:88) j =1 (cid:107) (cid:112) a { n } ( X { n } ( j ) − µ { n } ) (cid:107) . Hence, on Ω { n } N , (cid:107) ( R { n } ( X { n } ( j ) )) j ∈ [1: N ] (cid:107) ≤ C sup √ Na { n } . Thus, (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) A { n } T A { n } (cid:17) − A { n } T ( R { n } ( X { n } ( j ) )) j ∈ [1: N ] (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) A { n } T A { n } (cid:17) − A { n } T (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( R { n } ( X { n } ( j ) )) j ∈ [1: N ] (cid:13)(cid:13)(cid:13) ≤ C sup √ a { n } , with probability at least − C sup exp( − C inf N ) .Now, it is easy to prove Proposition 4. Proof.
By Lemma 7 for l = (cid:98) β { n } T and l = Df ( µ { n } ) , and by Lemma 9 wehave, with probability at least − C sup exp( − C inf N ) , (cid:12)(cid:12)(cid:12) Var( E ( (cid:112) a { n } Df ( µ { n } ) X { n } | X { n } u )) − Var( E ( (cid:112) a { n } (cid:98) β { n } T X { n } | X { n } u )) (cid:12)(cid:12)(cid:12) ≤ C sup (cid:107) Df ( µ { n } ) − (cid:98) β { n } T (cid:107)≤ C sup √ a { n } , where the conditional expectations and the variances are conditional to ( X { n } ( j ) ) j ∈ [1: N ] .Thus, with probability at least − C sup exp( − C inf N ) , there exists C inf > suchthat, for n large enough (cid:107) (cid:98) β { n } T (cid:107) ≥ C inf , thus Var( √ a { n } (cid:98) β { n } T X { n } ) is lower-bounded. Hence, with probability at least − C sup exp( − C inf N ) , (cid:12)(cid:12)(cid:12) S clu ( X { n } , f { n } ) − S clu ( X { n } , (cid:98) β { n } T ) (cid:12)(cid:12)(cid:12) ≤ C sup √ a { n } , and so (cid:13)(cid:13)(cid:13) η ( X { n } , f { n } ) − η ( X { n } , (cid:98) β { n } T ) (cid:13)(cid:13)(cid:13) ≤ C sup √ a { n } . Proofs for Section 4
In this section, we prove Proposition 5 in Subsections B.1 to B.6 and we proveCorollary 3 in Subsection B.7.
Recall that ( U ( l ) ) l ∈ [1: n ] is an i.i.d. sample of U with E ( U ) = µ and Var( U ) = Σ and (cid:98) X { n } = 1 n n (cid:88) l =1 U ( l ) . Let X { n } ∼ N ( µ, n Σ) . By Proposition 1, we have η (cid:16) X { n } , f (cid:17) = η (cid:16) X { n } , Df ( µ ) (cid:17) + O (cid:18) a { n } (cid:19) = η ( X ∗ , Df ( µ )) + O (cid:18) a { n } (cid:19) . Hence, it remains to prove that (cid:13)(cid:13)(cid:13) η (cid:16) (cid:98) X { n } , f (cid:17) − η (cid:16) X { n } , f (cid:17)(cid:13)(cid:13)(cid:13) −→ n → + ∞ , that is, writing f n := √ n (cid:16) f (cid:16) ·√ n + µ (cid:17) − f ( µ ) (cid:17) and ˜ X { n } := √ n ( (cid:98) X { n } − µ ) ,that (cid:13)(cid:13)(cid:13) η (cid:16) ˜ X { n } , f n (cid:17) − η ( X ∗ , f n ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ . In Section 7.2, we give some lemmas of f n . Then, defining E u,n,K ( Z ) := E (cid:16) E (cid:2) f n ( Z ) (cid:107) Z (cid:107) ∞ ≤ K (cid:12)(cid:12) Z u (cid:3) (cid:17) ,E u,n ( Z ) := E (cid:16) E (cid:2) f n ( Z ) (cid:12)(cid:12) Z u (cid:3) (cid:17) , we prove in Section 7.3 that sup n | E u,n,K ( ˜ X { n } ) − E u,n ( ˜ X { n } ) | converges to when K → + ∞ . In particular, for U ∼ N ( µ, Σ) , the result holds for ˜ X { n } = X ∗ .Hence, for any ε > , choosing K such that | E u,n,K ( ˜ X { n } ) − E u,n ( ˜ X { n } ) | <ε/ and | E u,n,K ( X ∗ ) − E u,n ( X ∗ ) | < ε/ , we show in Section 7.4 that | E u,n,K ( X ∗ ) − E u,n,K ( ˜ X { n } ) | −→ n → + ∞ . In Section 7.5, we conclude the proof that (cid:12)(cid:12)(cid:12)
Var( E ( f n ( ˜ X { n } ) | ˜ X { n } u )) − Var( E ( f n ( X ∗ ) | X ∗ u )) (cid:12)(cid:12)(cid:12) −→ n → + ∞ . In Section 7.6, we conclude the proof that (cid:13)(cid:13)(cid:13) η (cid:16) ˜ X { n } , f n (cid:17) − η ( X ∗ , f n ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ . The key of the proof is that the probability density function of ˜ X { n } con-verges uniformly to the one of X ∗ by local limit theorem (see [She71] or Theorem19.1 of [BR86]). 28 .2 Part 1 Lemma 10.
There exists C sup < + ∞ such that, for all x ∈ R p , | f n ( x ) | ≤ C sup (cid:32) (cid:107) x (cid:107) (cid:107) x (cid:107)≤√ n + (cid:107) x (cid:107) k √ n k − (cid:107) x (cid:107) > √ n ) (cid:33) , where we recall that k ∈ N ∗ is such that for all x ∈ R p , we have | f ( x ) | ≤ C (1 + (cid:107) x (cid:107) k ) .Proof. For all x ∈ R p , we have (cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) x √ n + µ (cid:19) − f ( µ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) x √ n + µ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) + | f ( µ ) |≤ C sup (cid:32) (cid:13)(cid:13)(cid:13)(cid:13) x √ n + µ (cid:13)(cid:13)(cid:13)(cid:13) k (cid:33) + | f ( µ ) |≤ C sup (cid:32) (cid:13)(cid:13)(cid:13)(cid:13) x √ n (cid:13)(cid:13)(cid:13)(cid:13) k (cid:33) . Thus, for all (cid:107) x (cid:107) ≥ √ n , we have | f n ( x ) | ≤ C sup (cid:107) x (cid:107) k √ n k − . If (cid:107) x (cid:107) ≤ √ n , we have (cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) x √ n + µ (cid:19) − f ( µ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ max (cid:107) y (cid:107)≤ (cid:107) µ (cid:107) (cid:107) Df ( y ) (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) x √ n + µ − µ (cid:13)(cid:13)(cid:13)(cid:13) ≤ C sup (cid:13)(cid:13)(cid:13)(cid:13) x √ n (cid:13)(cid:13)(cid:13)(cid:13) , and thus, | f n ( x ) | ≤ C sup (cid:107) x (cid:107) . In particular, | f n ( x ) | ≤ C sup ( (cid:107) x (cid:107) + (cid:107) x (cid:107) k ) , f n ( x ) ≤ C sup ( (cid:107) x (cid:107) + (cid:107) x (cid:107) k ) Lemma 11.
For i = 1 , , we have E ( f n ( ˜ X { n } ) i ) ≤ C sup . Proof.
We have E ( f n ( ˜ X { n } ) i ) ≤ C sup (cid:16) E ( (cid:107) ˜ X { n } (cid:107) ik ) + E ( (cid:107) ˜ X { n } (cid:107) i ) (cid:17) ≤ C sup (cid:16) E ( (cid:107) ˜ X { n } (cid:107) ik ik ) + E ( (cid:107) ˜ X { n } (cid:107) i i ) (cid:17) . E ( | ˜ X j | ik ) = 1 n ik E (cid:32) n (cid:88) l =1 U ( l ) j − µ j (cid:33) ik ≤ C sup n ik max (cid:18) n E ([ U (1) j − µ j ] ik ) , (cid:104) n E ([ U (1) j − µ j ] ) (cid:105) ik (cid:19) ≤ C sup . Lemma 12.
For all v ⊂ [1 : p ] , v (cid:54) = ∅ and for i = 1 , , we have sup n E (cid:16) f n ( ˜ X { n } ) i ˜ X { n } v / ∈ [ − K,K ] | v | (cid:17) −→ K → + ∞ . Proof.
We have E (cid:16) f n ( ˜ X { n } ) i ˜ X { n } v / ∈ [ − K,K ] | v | (cid:17) ≤ (cid:114) E (cid:16) f n ( ˜ X { n } ) i (cid:17)(cid:113) P ( ˜ X { n } v / ∈ [ − K, K ] | v | ) . By Lemma 11, sup n (cid:114) E (cid:16) f n ( ˜ X { n } ) i (cid:17) is bounded.Now, since ( ˜ X { n } v ) n converges in distribution, it is a tight sequence, hence sup n P (cid:16) ˜ X { n } v / ∈ [ − K, K ] | v | (cid:17) ≤ sup n P ( (cid:107) ˜ X { n } v (cid:107) ≥ K ) −→ K → + ∞ . Lemma 13.
The sequence ( f n ) n converges pointwise to Df ( µ ) .Proof. For all x ∈ R , f (cid:18) x √ n + µ (cid:19) − f ( µ ) = Df ( µ ) x √ n + O (cid:32)(cid:13)(cid:13)(cid:13)(cid:13) x √ n (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) , so, f n ( x ) = Df ( µ ) x + O (cid:18) (cid:107) x (cid:107) √ n (cid:19) . .3 Part 2 We want to prove that, for all u ⊂ [1 : p ] , u (cid:54) = ∅ , we have sup n | E u,n,K ( ˜ X { n } ) − E u,n ( ˜ X { n } ) | −→ K → + ∞ . We will prove this result for ∅ (cid:32) u (cid:32) [1 : p ] , since it is easier for u = [1 : p ] (see Remark 4).We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) R | u | (cid:18)(cid:90) R |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) d P ˜ X { n } u ( x u ) − (cid:90) [ − K,K ] | u | (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:33) d P ˜ X { n } u ( x u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) ([ − K,K ] | u | ) c (cid:18)(cid:90) R |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) d P ˜ X { n } u ( x u )+ (cid:90) [ − K,K ] | u | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:90) R |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) − (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d P ˜ X { n } u ( x u ) . We have to bound the two summands of the previous upper-bound.The first term converges to by Lemma 12. Let us bound the second term.31y mean-value inequality with the square function, we have (cid:90) [ − K,K ] | u | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:90) R |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) − (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d P ˜ X { n } u ( x u ) ≤ (cid:90) [ − K,K ] | u | (cid:18)(cid:90) R |− u | | f n ( x ) | d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R |− u | x − u / ∈ [ − K,K ] |− u | f n ( x ) d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:12)(cid:12)(cid:12)(cid:12) d P ˜ X { n } u ( x u ) ≤ (cid:90) [ − K,K ] | u | (cid:18)(cid:90) R |− u | | f n ( x ) | d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) × (cid:18)(cid:90) R |− u | x − u / ∈ [ − K,K ] |− u | | f n ( x ) | d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) d P ˜ X { n } u ( x u ) ≤ (cid:113) E ( E ( | f n ( ˜ X { n } ) | | ˜ X { n } u ) ) × (cid:115)(cid:90) R | u | (cid:18)(cid:90) R |− u | x − u / ∈ [ − K,K ] |− u | | f n ( x ) | d P ˜ X { n }− u | ˜ X { n } u = x u ( x − u ) (cid:19) d P ˜ X { n } u ( x u ) . Now, E ( E ( | f n ( ˜ X { n } ) | | ˜ X { n } u ) ) ≤ E ( f n ( ˜ X { n } ) ) which is bounded by Lemma11 and the other term converges to uniformly on n by Lemma 12. Remark 4.
In the case where u = [1 : p ] , it is much simpler, since E ( f n ( ˜ X { n } ) ) − E ( f n ( ˜ X { n } ) ˜ X { n } ∈ [ − K,K ] p ) = E ( f n ( ˜ X { n } ) ˜ X { n } / ∈ [ − K,K ] p ) , which converges to 0 uniformly on n when K → + ∞ by Lemma 12. Let K ∈ R ∗ + and u ⊂ [1 : p ] such that u (cid:54) = ∅ . We want to prove that | E u,n,K ( X ∗ ) − E u,n,K ( ˜ X { n } ) | −→ n → + ∞ . The case u = [1 : p ] is much easier (see Remark 5), hence, assume that ∅ (cid:32) u (cid:32) [1 : p ] . Since K is fixed, the probability density function f X ∗ of X ∗ islower-bounded by a > on [ − K, K ] p . Let ε n := max ∅ (cid:32) u ⊂ [1: p ] sup x ∈ R p | f X ∗ ( x ) − f ˜ X { n } ( x ) | . Using local limit theorem (see Theorem 19.1 of [BR86] or [She71]), ε n −→ n → + ∞ . We assume that n is large enough such that ε n ≤ a . Let b < + ∞ be the maximum of f X ∗ . 32e have | E u,n,K ( X ∗ ) − E u,n,K ( ˜ X { n } ) |≤ (cid:90) [ − K,K ] | u | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f X ∗ ( x ) f X ∗ u ( x u ) dx − u (cid:33) − (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) dx − u (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ u ( x u ) dx u + (cid:90) [ − K,K ] | u | (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) dx − u (cid:33) | f X ∗ u ( x u ) − f ˜ X { n } u ( x u ) | dx u . Hence, we have to prove the convergence of the two summands in the previousupper-bound. For the second term, it suffices to remark that | f X ∗ u ( x u ) − f ˜ X { n } u ( x u ) | ≤ ε n ≤ ε n a f ˜ X { n } u ( x u ) . Hence, the second term is smaller than ε n a E ( f n ( ˜ X { n } ) ) that converges to 0. Itremains to prove that the first term converges to . By mean-value inequality,we have (cid:90) [ − K,K ] | u | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f X ∗ ( x ) f X ∗ u ( x u ) dx − u (cid:33) − (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) dx − u (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ u ( x u ) dx u ≤ (cid:90) [ − K,K ] | u | (cid:32)(cid:90) [ − K,K ] |− u | | f n ( x ) | max (cid:32) f X ∗ ( x ) f X ∗ u ( x u ) , f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) (cid:33) dx − u (cid:33) × (cid:32)(cid:90) [ − K,K ] |− u | | f n ( x ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ ( x ) f X ∗ u ( x u ) − f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx − u (cid:33) f X ∗ u ( x u ) dx u . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ ( x ) f X ∗ u ( x u ) − f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | f X ∗ ( x ) − f ˜ X { n } ( x ) | f X ∗ u ( x u ) + f ˜ X { n } ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ u ( x u ) − f ˜ X { n } u ( x u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | f X ∗ ( x ) − f ˜ X { n } ( x ) | f X ∗ u ( x u ) + f ˜ X { n } ( x ) 4 a (cid:12)(cid:12)(cid:12) f X ∗ u ( x u ) − f ˜ X { n } u ( x u ) (cid:12)(cid:12)(cid:12) ≤ ε n f X ∗ u ( x u ) + f ˜ X { n } ( x ) 4 a ε n ≤ ε n f X ∗ u ( x u ) + f X ∗ ( x ) 8 a ε n ≤ ε n a f X ∗ ( x ) f X ∗ u ( x u ) + 8 ba ε n f X ∗ ( x ) f X ∗ u ( x u ) ≤ C sup ε n f X ∗ ( x ) f X ∗ u ( x u ) . Hence, for n large enough such that C sup ε n ≤ , we have (cid:90) [ − K,K ] | u | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f X ∗ ( x ) f X ∗ u ( x u ) dx − u (cid:33) − (cid:32)(cid:90) [ − K,K ] |− u | f n ( x ) f ˜ X { n } ( x ) f ˜ X { n } u ( x u ) dx − u (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f X ∗ u ( x u ) dx u ≤ (cid:90) [ − K,K ] | u | (cid:32)(cid:90) [ − K,K ] |− u | | f n ( x ) | f X ∗ ( x ) f X ∗ u ( x u ) dx − u (cid:33) × (cid:32)(cid:90) [ − K,K ] |− u | | f n ( x ) | C sup ε n f X ∗ ( x ) f X ∗ u ( x u ) dx − u (cid:33) f X ∗ u ( x u ) dx u ≤ C sup ε n E ( f n ( X ∗ ) ) , that converges to . Remark 5. If u = [1 : p ] , it suffices to remark that | f X ∗ ( x ) − f ˜ X { n } ( x ) | ≤ ε n ≤ ε n a f X ∗ ( x ) . Thus, | E u,n,K ( X ∗ ) − E u,n,K ( ˜ X { n } ) |≤ (cid:90) [ − K,K ] p f n ( x ) | f X ∗ ( x ) − f ˜ X { n } ( x ) | dx ≤ ε n a E ( f n ( X ∗ ) ) ≤ C sup ε n . .5 Part 4 Let us prove that E ( f n ( ˜ X { n } )) − E ( f n ( X ∗ )) −→ By lemma 12, we have sup n (cid:12)(cid:12)(cid:12) E ( f n ( ˜ X { n } ) − E ( f n ( ˜ X { n } ) ˜ X { n } ∈ [ − K,K ] p ) (cid:12)(cid:12)(cid:12) −→ K →∞ Let ε > and let K such that sup n (cid:12)(cid:12)(cid:12) E ( f n ( ˜ X { n } ) − E ( f n ( ˜ X { n } ) ˜ X { n } ∈ [ − K,K ] p ) (cid:12)(cid:12)(cid:12) < ε and sup n (cid:12)(cid:12) E ( f n ( X ∗ ) − E ( f n ( X ∗ ) X ∗ ∈ [ − K,K ] p ) (cid:12)(cid:12) < ε . By local limit theorem, we have (cid:12)(cid:12)(cid:12) E ( f n ( ˜ X { n } ) ˜ X { n } ∈ [ − K,K ] p ) − E ( f n ( X ∗ ) X ∗ ∈ [ − K,K ] p ) (cid:12)(cid:12)(cid:12) −→ n → + ∞ . Thus, for all u ⊂ [1 : p ] , we have Var( E ( f n ( ˜ X { n } ) | ˜ X { n } u )) − Var( E ( f n ( X ∗ ) | X ∗ u )) −→ n → + ∞ . To prove the convergence of the Shapley effects, it suffices to prove the
Var( f n ( X ∗ )) is lower-bounded. Hence, we show that Var( f n ( X ∗ )) converges to Var( Df ( µ ) X ∗ ) .Let i = 1 , and let ε > . By Lemma 12, let K such that sup n E ( f n ( X ∗ ) i X ∗ / ∈ [ − K,K ] p ) ≤ ε , E ([ Df ( µ ) X ∗ ] i X ∗ / ∈ [ − K,K ] p ) ≤ ε . By Lemmas 10 and 13 and by dominated convergence theorem, we have : E ( f n ( X ∗ ) i X ∗ ∈ [ − K,K ] p ) −→ n → + ∞ E ([ Df ( µ ) X ∗ ] i X ∗ ∈ [ − K,K ] p ) . Hence,
Var( f n ( X ∗ )) converges to Var( Df ( µ ) X ∗ ) . Thus, for all u ⊂ [1 : p ] S clu ( ˜ X { n } , f n ) − S u ( X ∗ , f n ) −→ n → + ∞ . Hence, (cid:13)(cid:13)(cid:13) η ( ˜ X { n } , f n ) − η ( X, f n ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ . .7 Proof of Corollary 3 Since (cid:98) X { n (cid:48) }(cid:48) a.s −→ n → + ∞ µ and (cid:98) Σ { n (cid:48)(cid:48) }(cid:48) a.s −→ n → + ∞ Σ , it suffices to prove that, if ( x { n } ) n converges to µ , and (Σ { n } ) n converges to Σ , we have (cid:13)(cid:13)(cid:13) η ( (cid:98) X { n } , f ) − η ( X ∗ n , ˜ f { n } ,h { n } ,x { n } ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ , where X ∗ n is a random vector with distribution N ( µ, Σ { n } ) . Let ( x { n } ) n and (Σ { n } ) n be such sequences. Recall that (cid:13)(cid:13)(cid:13) η ( ˜ X { n } , f n ) − η ( X ∗ , f n ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ , where X ∗ ∼ N (0 , Σ) , that is (cid:13)(cid:13)(cid:13) η (cid:16) (cid:98) X { n } , f (cid:17) − η (cid:16) X { n } , f (cid:17)(cid:13)(cid:13)(cid:13) −→ n → + ∞ , where X { n } ∼ N ( µ, n Σ) . Hence, we have to prove that (cid:13)(cid:13)(cid:13) η ( X { n } , f ) − η ( X ∗ n , ˜ f { n } ,h { n } ,x { n } ) (cid:13)(cid:13)(cid:13) −→ n → + ∞ . By Propositions 1 and Proposition 2, remark that η ( X { n } , f ) converges to η ( X ∗ , f ) . Moreover, η ( X ∗ n , ˜ f { n } ,h { n } ,x { n } ) = η ( X ∗ n + x { n } − µ { n } , ˜ f { n } ,h { n } ,x { n } ) −→ n → + ∞ η ( X ∗ , f ) ,,