Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression
IImproving KernelSHAP: Practical Shapley ValueEstimation via Linear Regression
Ian Covert Su-In Lee
University of Washington University of Washington
Abstract
The Shapley value concept from cooperativegame theory has become a popular techniquefor interpreting ML models, but efficientlyestimating these values remains challenging,particularly in the model-agnostic setting.Here, we revisit the idea of estimating Shap-ley values via linear regression to understandand improve upon this approach. By ana-lyzing the original KernelSHAP alongside anewly proposed unbiased version, we developtechniques to detect its convergence and cal-culate uncertainty estimates. We also findthat the original version incurs a negligibleincrease in bias in exchange for significantlylower variance, and we propose a variance re-duction technique that further accelerates theconvergence of both estimators. Finally, wedevelop a version of KernelSHAP for stochas-tic cooperative games that yields fast new es-timators for two global explanation methods.
Shapley values are central to many machine learning(ML) model explanation methods (e.g., SHAP, IME,QII, Shapley Effects, Shapley Net Effects, SAGE)[24, 25, 38, 13, 32, 23, 12]. Though developed in the co-operative game theory context [36], recent work showsthat Shapley values provide a powerful tool for explain-ing how models work when either individual features[24], individual neurons in a neural network [18], or in-dividual samples in a dataset [17] are viewed as playersin a cooperative game. They have become a go-to solu-tion for allocating credit and quantifying contributionsdue to to their appealing theoretical properties.
Proceedings of the 24 th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s).
The main challenge when using Shapley values is cal-culating them efficiently. A naive calculation has com-putational complexity that is exponential in the num-ber of players, so numerous approaches have been pro-posed to accelerate their calculation. Besides brute-force methods [23], other techniques include sampling-based approximations [38, 37, 10, 12], model-specificapproximations (e.g., TreeSHAP) [1, 25] and a linearregression-based approximation (KernelSHAP) [24].Here, we revisit the regression-based approach to ad-dress several shortcomings in KernelSHAP. Recentwork has questioned whether KernelSHAP is an un-biased estimator [29], and, unlike sampling-based esti-mators [6, 26, 12], KernelSHAP does not provide un-certainty estimates. Furthermore, it provides no guid-ance on the number of samples required because itsconvergence properties are not well understood.We address each of these problems, in part by buildingon a newly proposed unbiased version of the regression-based approach. Our contributions include:1. Deriving an unbiased version of KernelSHAP andshowing empirically that the original version in-curs a negligible increase in bias in exchange forsignificantly lower variance2. Showing how to detect KernelSHAP’s conver-gence, automatically determine the number ofsamples required, and calculate uncertainty esti-mates for the results3. Proposing a variance reduction technique thatfurther accelerates KernelSHAP’s convergence4. Adapting the regression-based approach tostochastic cooperative games [9] to provide fastnew approximations for two global explanationmethods, SAGE [12] and Shapley Effects [32]With these new insights and tools, we offer a morepractical approach to Shapley value estimation via lin-ear regression. https://github.com/iancovert/shapley-regression a r X i v : . [ c s . L G ] F e b hapley Value Estimation via Linear Regression We now provide background information on coopera-tive game theory and the Shapley value. A cooperative game is a function v : 2 d R that re-turns a value for each coalition (subset) S ⊆ D , where D = { , . . . , d } represents a set of players . Coopera-tive game theory has become increasingly important inML because many methods frame model explanationproblems in terms of cooperative games [11]. Notably,SHAP [24], IME [38] and QII [13] define cooperativegames that represent an individual prediction’s depen-dence on different features. For a model f and an input x , SHAP (when using the marginal distribution [24])analyzes the cooperative game v x , defined as v x ( S ) = E [ f ( x S , X D \ S )] , (1)where x S ≡ { x i : i ∈ S } represents a feature subsetand X S is the corresponding random variable. Twoother methods, Shapley Effects [32] and SAGE [12],define cooperative games that represent a model’s be-havior across the entire dataset. For example, given aloss function ‘ and response variable Y , SAGE uses acooperative game w that represents the model’s pre-dictive performance given a subset of features X S : w ( S ) = − E h ‘ (cid:0) E [ f ( X ) | X S ] , Y (cid:1)i . (2)Several other techniques also frame model explanationquestions in terms of cooperative games, where a tar-get quantity (e.g., model loss) varies as groups of play-ers (e.g., features) are removed, and the Shapley valuesummarizes each player’s contribution [11]. The Shapley value [36] assumes that the grand coali-tion D is participating and seeks to provide each playerwith a fair allocation of the total profit, which is rep-resented by v ( D ). Fair allocations must be based oneach player’s contribution to the profit, but a player’scontribution is often difficult to define. Player i ’s marginal contribution to the coalition S is the differ-ence v ( S ∪ { i } ) − v ( S ), but the marginal contributiontypically depends on which players S are already par-ticipating.The Shapley value resolves this problem by derivinga unique value based on a set of fairness axioms; see[36, 30] for further detail. It can be understood as a player’s average marginal contribution across all pos-sible player orderings, and each player’s Shapley value φ ( v ) , . . . , φ d ( v ) for a game v is given by: φ i ( v ) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − (cid:16) v ( S ∪{ i } ) − v ( S ) (cid:17) . (3)Many ML model explanation methods can be under-stood in terms of ideas from cooperative game theory[11], but the Shapley value is especially popular and isalso widely used in other fields [2, 33, 39]. While we can characterize the Shapley value in manyways, the perspective most relevant to our work isviewing it as a solution to a weighted least squaresproblem. Many works have considered fitting simplemodels to cooperative games [8, 20, 19, 14, 15, 28],particularly additive models of the form u ( S ) = β + X i ∈ S β i . Such additive models are known as inessential games ,and although a game v may not be inessential, aninessential approximation can help summarize eachplayer’s average contribution. Several works [8, 20, 14]model games by solving a weighted least squares prob-lem using a weighting function µ :min β ,...,β d X S ⊆ D µ ( S ) (cid:16) u ( S ) − v ( S ) (cid:17) . Perhaps surprisingly, different weighting kernels µ lead to recognizable optimal regression coefficients( β ∗ , . . . β ∗ d ) [11]. In particular, a carefully chosenweighting kernel yields optimal regression coefficientsequal to the Shapley values [8, 24]. The Shapley kernel µ Sh is given by µ Sh ( S ) = d − (cid:0) d | S | (cid:1) | S | ( d − | S | ) , where the values µ Sh ( {} ) = µ Sh ( D ) = ∞ effectivelyenforce constraints β = v ( {} ) for the intercept and P i ∈ D β i = v ( D ) − v ( {} ) for the sum of the coefficients.Lundberg and Lee [24] used this Shapley value inter-pretation when developing an approach to approxi-mate SHAP values via linear regression. overt & Lee As noted, Shapley values are difficult to calculate be-cause they require examining each player’s marginalcontribution to every possible subset (Eq. 3). Thisleads to run-times that are exponential in the num-ber of players, so efficient approximations are of greatpractical importance [38, 37, 24, 10, 1, 25, 12]. Here,we revisit the regression-based approach presentedby Lundberg and Lee (KernelSHAP) [24] and thenpresent an unbiased version of this approach whoseproperties are simpler to analyze.
The least squares characterization of the Shapleyvalue suggests that we can calculate the values φ ( v ) , . . . , φ d ( v ) by solving the optimization problemmin β ,...,β d X < | S | We introduce new notation to make theproblem easier to solve. First, we denote the non-intercept coefficients as β = ( β , . . . , β d ) ∈ R d . Next,we denote each subset S ⊆ D using the correspondingbinary vector z ∈ { , } d , and with tolerable abuse ofnotation we write v ( z ) ≡ v ( S ) and µ Sh ( z ) ≡ µ Sh ( S )for S = { i : z i = 1 } . Lastly, we denote a distributionover Z using p ( z ), where we define p ( z ) ∝ µ Sh ( z ) when0 < T z < d and p ( z ) = 0 otherwise. With this, wecan rewrite the optimization problem asmin β ,...,β d X z p ( z ) (cid:16) v ( ) + z T β − v ( z ) (cid:17) s . t . T β = v ( ) − v ( ) . (5) Solving the problem in Eq. 5 requires evaluating thecooperative game v with all 2 d coalitions. Evaluat-ing v ( ) and v ( ) is sufficient to ensure that the con-straints are satisfied, but all values v ( z ) for z suchthat 0 < T z < d are required to fit the model exactly.KernelSHAP manages this challenge by subsampling adataset and optimizing an approximate objective. Werefer to this approach as dataset sampling . Using n independent samples z i ∼ p ( Z ) and their values v ( z i ),KernelSHAP solves the following problem: min β ,...,β d n n X i =1 (cid:16) v ( ) + z Ti β − v ( z i ) (cid:17) s . t . T β = v ( ) − v ( ) . (6)The dataset sampling approach, also applied by LIME[35], offers the flexibility to use only enough samples toaccurately approximate the objective. Given a set ofsamples ( z , . . . , z n ), solving this problem is straight-forward. The Lagrangian with multiplier ν ∈ R isgiven by:ˆ L ( β, ν ) = β T (cid:16) n n X i =1 z i z Ti (cid:17) β − β T (cid:16) n n X i =1 z i (cid:0) v ( z i ) − v ( ) (cid:1)(cid:17) + 1 n n X i =1 (cid:0) v ( z i ) − v ( ) (cid:1) + 2 ν (cid:0) T β − v ( ) + v ( ) (cid:1) . If we introduce the shorthand notationˆ A n = 1 n n X i =1 z i z Ti and ˆ b n = 1 n n X i =1 z i (cid:16) v ( z i ) − v ( ) (cid:17) , then we can use the problem’s KKT conditions [5] toderive the following solution:ˆ β n = ˆ A − n (cid:16) ˆ b n − T ˆ A − n ˆ b n − v ( ) + v ( ) T ˆ A − n (cid:17) . (7)This method is known as KernelSHAP [24], and theimplementation in the SHAP repository also allowsfor regularization terms in the approximate objective(Eq. 6), such as the ‘ penalty [40]. While this ap-proach is intuitive and simple to implement, the esti-mator ˆ β n is surprisingly difficult to characterize. Aswe show in Section 4, it is unclear whether it is unbi-ased, and understanding its variance and rate of con-vergence is not straightforward. Therefore, we derivean alternative approach that is simpler to analyze. Consider the solution to the problem that uses all 2 d player coalitions (Eq. 5). Rather than finding an ex-act solution to an approximate problem (Section 3.2), http://github.com/slundberg/shap hapley Value Estimation via Linear Regression we now derive an approximate solution to the exactproblem . The full problem’s Lagrangian is given by L ( β, ν ) = β T E [ ZZ T ] β − β T E h Z (cid:0) v ( Z ) − v ( ) (cid:1)i + E h(cid:0) v ( Z ) − v ( ) (cid:1) i + 2 ν (cid:0) T β − v ( ) + v ( ) (cid:1) , where we now consider Z to be a random variable dis-tributed according to p ( Z ). Using the shorthand no-tation A = E [ ZZ T ] and b = E h Z (cid:0) v ( Z ) − v ( ) (cid:1)i , we can write the solution to the exact problem as: β ∗ = A − (cid:16) b − T A − b − v ( ) + v ( ) T A − (cid:17) . (8)Due to our setup of the optimization problem, we havethe property that β ∗ i = φ i ( v ). Unfortunately, we can-not evaluate this expression in practice without eval-uating v for all 2 d coalitions S ⊆ D .However, knowledge of p ( Z ) means that A ∈ R d × d canbe calculated exactly and efficiently. To see this, notethat ( ZZ T ) ij = Z i Z j = ( Z i = Z j = 1). Therefore,to calculate A , we need to estimate only p ( Z i = 1) fordiagonal values A ii and p ( Z i = Z j = 1) for off-diagonalvalues A ij . See Appendix A for their derivations.Since b cannot be calculated exactly and efficiently dueto its dependence on v , this suggests that we shoulduse A ’s exact form and approximate β ∗ by estimating(only) b . We propose the following estimator for b :¯ b n = 1 n n X i =1 z i v ( z i ) − E [ Z ] v ( ) . Using this, we arrive at an alternative to the originalKernelSHAP estimator, which we refer to as unbiasedKernelSHAP :¯ β n = A − (cid:16) ¯ b n − T A − ¯ b n − v ( ) + v ( ) T A − (cid:17) . (9)In the next section, we compare these two approachesboth theoretically and empirically. We now analyze the consistency, bias and varianceproperties of the Shapley value estimators, and weconsider how to detect, forecast, and accelerate theirconvergence. A consistent estimator is one that converges to the cor-rect Shapley values β ∗ given a sufficiently large numberof samples. If the game v has bounded value, then thestrong law of large numbers implies thatlim n →∞ ˆ A n = A and lim n →∞ ˆ b n = lim n →∞ ¯ b n = b, where the convergence is almost sure. From this, wesee that both estimators are consistent:lim n →∞ ˆ β n = lim n →∞ ¯ β n = β ∗ . Next, an unbiased estimator is one whose expectationis equal to the correct Shapley values β ∗ . This is diffi-cult to verify for the KernelSHAP estimator ˆ β n due tothe interaction between ˆ A n and ˆ b n (see Eq. 7). Bothˆ A n and ˆ b n are unbiased, but terms such as E [ ˆ A − n ˆ b n ]and E [ ˆ A − n T ˆ A − n ˆ b n / ( T ˆ A − n )] are difficult to char-acterize. To make any claims about KernelSHAP’sbias, we rely instead on empirical observations.In contrast, it is easy to see that the alternative esti-mator ¯ β n is unbiased. Because of its linear dependenceon ¯ b n and the fact that E [¯ b n ] = b , we can see that E [ ¯ β n ] = β ∗ . We therefore conclude that the alternative estimator¯ β n is both consistent and unbiased, whereas the orig-inal KernelSHAP ( ˆ β n ) is only provably consistent. Itis for this reason that we refer to ¯ β n as unbiased Ker-nelSHAP .Regarding the estimators’ variance, unbiased Ker-nelSHAP is once again simpler to characterize. Thevalues ¯ β n are a function of ¯ b n , and the multivariatecentral limit theorem (CLT) [41] asserts that ¯ b n con-verges in distribution to a multivariate Gaussian, or¯ b n √ n D −→ N ( b, Σ ¯ b ) , (10)where Σ ¯ b = Cov (cid:0) Zv ( Z ) (cid:1) . This implies that for theestimator ¯ β n , we have the convergence property¯ β n √ n D −→ N ( β ∗ , Σ ¯ β ) , (11) overt & Lee where, due to its linear dependence on ¯ b n (see Eq. 9),we have the covariance Σ ¯ β given byΣ ¯ β = C Σ ¯ b C T (12) C = A − − A − T A − T A − . (13)This allows us to reason about unbiased KernelSHAP’sasymptotic distribution. In particular, we remark that¯ β n has variance that reduces at a rate of O ( n ).In comparison, the original KernelSHAP estimator ˆ β n is difficult to analyze due to the interaction betweenthe ˆ A n and ˆ b n terms. We can apply the CLT to eitherterm individually, but reasoning about ˆ β n ’s distribu-tion or variance remains challenging.To facilitate our analysis of KernelSHAP, we presenta simple experiment to compare the two estimators.We approximated the SHAP values for an individualprediction in the census income dataset [22] and em-pirically calculated the mean squared error relative tothe true SHAP values across 250 runs. We then de-composed the error into bias and variance terms asfollows: E (cid:2) || ˆ β n − β ∗ || (cid:3)| {z } Error = E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3)| {z } Variance + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ β n ] − β ∗ (cid:12)(cid:12)(cid:12)(cid:12) . | {z } Bias Figure 1 shows that the error for both estimators isdominated by variance rather than bias. It also showsthat KernelSHAP incurs virtually no bias in exchangefor significantly lower variance. In Appendix F, weprovide global measures of the bias and variance toconfirm these observations across multiple examplesand two other datasets. This suggests that althoughKernelSHAP is more difficult to analyze theoretically,it should be used in practice because its bias is negli-gible and it converges faster. Having analyzed each estimator’s properties, we nowconsider whether their convergence can be accelerated.We propose a simple variance reduction technique thatleads to significantly faster convergence in practice.When sampling n subsets according to the distribution z i ∼ p ( Z ), we suggest a paired sampling strategy where The true SHAP values use a sufficient number of sam-ples to ensure convergence (see Section 4.3). The unbiased approach appears to have higher biasdue to estimation error, but its bias is provably zero. Game Evals (×1000) B i a s / V a r i a n c e Error Decomposition (SHAP) Variance (Original)Variance (Unbiased) Bias (Original)Bias (Unbiased) Figure 1: SHAP error decomposition for the originaland unbiased KernelSHAP estimators. The bias andvariance are calculated empirically across 250 runs.each sample z i is paired with its complement − z i .To show why this approach accelerates convergence,we focus on unbiased KernelSHAP, which is easier toanalyze theoretically.When estimating b for unbiased KernelSHAP ( ¯ β n ),consider using the following modified estimator thatcombines z i with − z i :ˇ b n = 12 n n X i =1 (cid:0) z i v ( z i ) + ( − z i ) v ( − z i ) − v ( ) (cid:1) . (14)Substituting this into unbiased KernelSHAP (Eq. 9)yields a new estimator ˇ β n that preserves the propertiesof being both consistent and unbiased:ˇ β n = A − (cid:16) ˇ b n − T A − ˇ b n − v ( ) + v ( ) T A − (cid:17) . (15)For games v that satisfy a specific condition, we canguarantee that this sampling approach leads to ˇ β n hav-ing lower variance than ¯ β n , even when we account forˇ b n requiring twice as many cooperative game evalua-tions as ¯ b n (see proof in Appendix B). Theorem 1. The difference between the covariancematrices for the estimators ¯ β n and ˇ β n is given by Cov( ¯ β n ) − Cov( ˇ β n ) = 12 n CG v C T , where G v is a property of the game v , defined as G v = − Cov (cid:16) Zv ( Z ) , ( − Z ) v ( − Z ) (cid:17) . For sufficiently large n , G v (cid:23) guarantees that theGaussian confidence ellipsoid ¯ E n,α for ¯ β n containsthe corresponding confidence ellipsoid ˇ E n,α for ˇ β n , or ˇ E n,α ⊆ ¯ E n,α , at any confidence level α ∈ (0 , . We call − z the complement because it is the binaryvector for D \ S , where S corresponds to z . hapley Value Estimation via Linear Regression Capital Gain SHAP Value ( v x ) R e l a t i o n s h i p S H A P V a l u e ( v x ) SHAP Value Confidence Ellipsoid OriginalUnbiased Original (Variance Reduction)Unbiased (Variance Reduction) Figure 2: Gaussian 95% confidence ellipsoids for twoSHAP values in a census income prediction (from 250runs). The estimators use an equal number of samples.Theorem 1 shows that ˇ β n is a more precise estima-tor than ¯ β n when the condition G v (cid:23) G v is positive semi-definite). This may not holdin the general case, but in Appendix B we show thata weaker condition holds for all games : the diagonalvalues of G v satisfy ( G v ) ii ≥ v . Ge-ometrically, this weaker condition means that ¯ E n,α extends beyond ˇ E n,α in the axis-aligned directions.Figure 2 illustrates the result of Theorem 1 by show-ing empirical 95% confidence ellipsoids for two SHAPvalues. Although a comparable condition is difficultto derive for the original KernelSHAP estimator ( ˆ β n ),we find that the paired sampling approach yields asimilar reduction in variance. Our experiments pro-vide further evidence that this approach acceleratesconvergence for both estimators (Section 6). One of KernelSHAP’s practical shortcomings is its lackof guidance on the number of samples required to ob-tain accurate estimates. We address this problem bydeveloping an approach for convergence detection andforecasting.Previously, we showed that unbiased KernelSHAP( ¯ β n ) has variance that reduces at a rate O ( n ) (Eq. 10).Furthermore, its variance is simple to estimate in prac-tice: we require only an empirical estimate ˆΣ ¯ b of Σ ¯ b (defined above), which we can calculate using an onlinealgorithm, such as Welford’s [42].We also showed that the original KernelSHAP ( ˆ β n )is difficult to characterize, but its variance is empiri-cally lower than the unbiased version. Understandingits variance is useful for convergence detection, so wepropose an approach for approximating it. Based onthe results in Figure 1, we may hypothesize that Ker-nelSHAP’s variance reduces at the same rate of O ( n );in Appendix F, we examine this by plotting the prod- uct of the variance and the number of samples over thecourse of estimation. We find that the product is con-stant as the sample number increases, which suggeststhat the O ( n ) rate holds in practice. This propertyis difficult to prove formally, but it can be used forsimple variance approximation.When running KernelSHAP, we suggest estimating thevariance by selecting an intermediate value m suchthat m << n and calculating multiple independentestimates ˆ β m while accumulating samples for ˆ β n . Forany n , we can then approximate Cov( ˆ β n ) asCov( ˆ β n ) ≈ mn Cov( ˆ β m ) , where Cov( ˆ β m ) is estimated empirically using the mul-tiple independent estimates ˆ β m . This online approachhas a negligible impact on the algorithm’s run-time,and the covariance estimate can be used to provideconfidence intervals for the final results.Whether we use the original or unbiased version ofKernelSHAP, the estimator’s covariance at a givenvalue of n lets us both detect and forecast convergence.For detection, we propose stopping at the current value n when the largest standard deviation is a sufficientlysmall portion t (e.g., t = 0 . 01) of the gap betweenthe largest and smallest Shapley value estimates. Forunbiased KernelSHAP, this criterion is equivalent to:max i r n ( ˆΣ ¯ β ) ii < t (cid:16) max i ( ¯ β n ) i − min i ( ¯ β n ) i (cid:17) . To forecast the number of samples required to reachconvergence, we again invoke the property that esti-mates have variance that reduces at a rate of O ( n ).Given a value t and estimates ¯ β n and ˆΣ ¯ β , the approx-imate number of samples ˆ N required is:ˆ N = 1 t (cid:16) max i q ( ˆΣ ¯ β ) ii max i ( ¯ β n ) i − min i ( ¯ β n ) i (cid:17) . This allows us to forecast the time until convergenceat any point during the algorithm. The forecast isexpected to become more accurate as the estimatedterms become more precise.Our approach is theoretically grounded for unbiasedKernelSHAP, and the approximate approach for thestandard version of KernelSHAP relies only on the as-sumption that Cov( ˆ β n ) reduces at a rate of O ( n ). Ap-pendix G shows algorithms for both approaches, whichillustrate both variance reduction and convergence de-tection techniques. overt & Lee A g e W o r k c l a ss E d u c a t i o n M a r i t a l O cc u p a t i o n R e l a t i o n s h i p R a c e S e x C a p i t a l G a i n C a p i t a l L o ss H o u r s / W e e k C o u n t r y S H A P V a l u e Census Income Local Explanation C h e c k i n g S t a t u s D u r a t i o n C r e d i t H i s t . P u r p o s e A m o u n t S a v i n g s E m p l o y m e n t S i n c e I n s t a ll m e n t R a t e P e r s o n a l S t a t u s D e b t o r s / G u a r a n t o r s R e s i d e n c e D u r a t i o n P r o p e r t y T y p e A g e O t h e r I n s t a ll m e n t s H o m e O w n e r s h i p E x i s t i n g C r e d i t s J o b N u m b e r L i a b l e T e l e p h o n e F o r e i g n W o r k e r S A G E V a l u e German Credit Global Explanation Figure 3: Shapley value-based explanations with 95% uncertainty estimates. Left: SHAP values for a singleprediction with the census income dataset. Right: SAGE values for the German credit dataset. We have thus far focused on developing a regression-based approach to estimate Shapley values for any co-operative game. We now discuss how to adapt thisapproach to stochastic cooperative games, which leadsto fast estimators for two global explanation methods. Stochastic cooperative games return a random valuefor each coalition of participating players S ⊆ D . Suchgames are represented by a function V that maps coali-tions to a distribution of possible outcomes, so that V ( S ) is a random variable [9, 7].To aid our presentation, we assume that the uncer-tainty in the game can be represented by an exogenousrandom variable U . The game can then be denoted by V ( S, U ), where V ( · , U ) is a deterministic function of S for any fixed value of the variable U .Stochastic cooperative games provide a useful tool forunderstanding two global explanation methods, SAGE[12] and Shapley Effects [32]. To see why, assume anexogenous variable U = ( X, Y ) that represents a ran-dom input-label pair, and consider the following game: W ( S, X, Y ) = − ‘ (cid:16) E (cid:2) f ( X ) | X S (cid:3) , Y (cid:17) . (16)The game W evaluates the (negated) loss with respectto the label Y given a prediction that depends only onthe features X S . The cooperative game used by SAGEcan be understood as the expectation of this game, or w ( S ) = E XY (cid:2) W ( S, X, Y ) (cid:3) (see Eq. 2). Shapley Effectsis based on the expectation of a similar game, wherethe loss is evaluated with respect to the full model pre-diction f ( X ) (see Appendix C). As we show next, anapproximation approach tailored to this setting yields significantly faster estimators for these methods. It is natural to assign values to players in stochasticcooperative games like we do for deterministic games.We propose a simple generalization of the Shapleyvalue for games V ( S, U ) that averages a player’smarginal contributions over both (i) player orderingsand (ii) values of the exogenous variable U : φ i ( V ) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − E U h V ( S ∪{ i } , U ) − V ( S, U ) i . Due to the linearity property of Shapley values [36, 30],the following sets of values are equivalent:1. The Shapley values of the game’s expectation¯ v ( S ) = E U [ V ( S, U )], or φ i (¯ v )2. The expected Shapley values of games with fixed U , or E U [ φ i ( v U )] where v u ( S ) = V ( S, u )3. Our generalization of Shapley values to thestochastic cooperative game V ( S, U ), or φ i ( V )The first two list items suggest ways of calculatingthe values φ i ( V ) using tools designed for determin-istic cooperative games. However, the expectation E U (cid:2) V ( S, U ) (cid:3) may be slow to evaluate (e.g., if it isacross an entire dataset), and calculating Shapley val-ues separately for each value of U would be intractableif U has many possible values. We therefore introducea third approach. We now propose a fast, regression-based approach forcalculating the generalized Shapley values φ i ( V ) of hapley Value Estimation via Linear Regression Convergence Forecasting (SHAP) Mean Predicted Mean Required Convergence Forecasting (Shapley Effects) Mean Predicted Mean Required Game Evals (×1000) Convergence Forecasting (SAGE) Mean Predicted Mean Required P r e d i c t e d E v a l s ( × ) Figure 4: Convergence forecasting for SHAP, ShapleyEffects and SAGE. The required number of samples iscompared with the predicted number across 100 runs(with 90% confidence intervals displayed).stochastic cooperative games V ( S, U ). Fortunately, itrequires only a simple modification of the precedingapproaches.First, we must calculate the values E U (cid:2) V ( , U ) (cid:3) and E U (cid:2) V ( , U ) (cid:3) for the grand coalition and the emptycoalition. Next, we replace our previous b estimators(ˆ b n and ¯ b n ) with estimators that use n pairs of inde-pendent samples z i ∼ p ( Z ) and u i ∼ p ( U ). To adaptthe original KernelSHAP to this setting, we use˜ b n = 12 n X i =1 z i (cid:0) V ( z i , u i ) − E U (cid:2) V ( , U ) (cid:3)(cid:1) . We then substitute this into the KernelSHAP estima-tor, as follows:˜ β n = ˆ A − n (cid:16) ˜ b n − T ˆ A − n ˜ b n − v ( ) + v ( ) T ˆ A − n (cid:17) . (17)By the same argument used in Section 3.2, this ap-proach estimates a solution to the weighted leastsquares problem whose optimal solution is the gener-alized Shapley values φ i ( V ). This adaptation of Ker-nelSHAP is consistent, and the analogous version ofunbiased KernelSHAP is consistent and unbiased (see C o n v e r g e n c e R a t i o Convergence Acceleration (SAGE) Game Evals (×1000) C o n v e r g e n c e R a t i o Convergence Acceleration (Shapley Effects) Expectation UnbiasedStochastic UnbiasedStochastic Expectation Unbiased (Var Red)Stochastic Unbiased (Var Red)Stochastic (Var Red) Figure 5: Convergence acceleration for SAGE andShapley Effects. The ratio of the maximum standarddeviation to the gap between the largest and smallestShapley values is compared across six estimators.Appendix D). These can be run with our paired sam-pling approach, and we can also provide uncertaintyestimates and detect convergence (Section 4). We conducted experiments with four datasets todemonstrate the advantages of our Shapley value esti-mation approach. We used the census income dataset[22], the Portuguese bank marketing dataset [31],the German credit dataset [22], and a breast cancer(BRCA) subtype classification dataset [4]. To avoidoverfitting with the BRCA data, we analyzed a ran-dom subset of 100 out of 17,814 genes (Appendix E).We trained a LightGBM model [21] for the censusdata, CatBoost [34] for the credit and bank data, andlogistic regression for the BRCA data. Code for ourexperiments is available online.To demonstrate local and global explanations with un-certainty estimates, we show examples of SHAP [24]and SAGE [12] values generated using our estima-tors (Figure 3). Both explanations used a conver-gence threshold of t = 0 . 01 and display 95% confi-dence intervals, which are features not previouslyoffered by KernelSHAP . We used the dataset sam-pling approach for both explanations, and for SAGEwe used the estimator designed for stochastic cooper-ative games. These estimators are faster than theirunbiased versions, but the results are nearly identical. overt & Lee Table 1: SHAP estimator run-time comparison. Each value represents the ratio of the average number of samplesrequired relative to the fastest estimator for that dataset (lower is better). Census Income Bank Marketing German Credit BRCA Subtypes Unbiased 380.63 176.45 17437.44 90.40Unbiased + Paired Sampling 128.60 90.61 422.17 40.44Original (KernelSHAP) 12.74 7.41 13.74 2.49Original + Paired Sampling To measure run-time differences between each esti-mator when calculating SHAP values, we comparedthe number of samples required to explain 100 in-stances for each dataset (Table 1). Rather than re-porting the exact number of samples, which is depen-dent on the convergence threshold, we show the ratio between the number of samples required by each es-timator; this ratio is independent of the convergencethreshold when convergence is defined by the meansquared estimation error falling below a fixed value(Appendix E). Table 1 displays results based on 100runs for each instance. Results show that the datasetsampling approach (original) is consistently faster thanthe unbiased estimator, and that paired sampling en-ables significantly faster convergence. In particular,we find that our paired sampling approach yields a × speedup on average over the original Ker-nelSHAP. To investigate the accuracy of our convergence fore-casting method, we compared the predicted numberof samples to the true number across 250 runs. Thenumber of samples depends on the convergence thresh-old, and we used a threshold t = 0 . 005 for SHAP and t = 0 . 02 for Shapley Effects and SAGE. Figure 4 showsthe results for SHAP (using the census data), ShapleyEffects (using the bank data) and SAGE (using theBRCA data). In all three cases, the forecasts be-come more accurate with more samples, andthey vary within an increasingly narrow rangearound the true number of required samples. There is a positive bias in the forecast, but the biasdiminishes with more samples.Finally, to demonstrate the speedup from our ap-proach for stochastic cooperative games, we show thatour stochastic estimator converges faster than a naiveestimator based on the underlying game’s expectation(see Section 5.2). We plotted the ratio between themaximum standard deviation and the gap betweenthe smallest and largest values, which we used to de-tect convergence (using a threshold t = 0 . The fastest estimators forboth datasets are stochastic estimators usingthe paired sampling technique, and only thesemethods converged for both datasets in the numberof samples displayed. As is the case for SHAP, thedataset sampling approach is often faster than the un-biased approach, but the latter is slightly faster forSAGE when using paired sampling. This paper described several approaches for estimat-ing Shapley values via linear regression. We first in-troduced an unbiased version of KernelSHAP, withproperties that are simpler to analyze than the orig-inal version. We then developed techniques for de-tecting convergence, calculating uncertainty estimates,and reducing the variance of both the original and un-biased estimators. Finally, we adapted our approachto provide significantly faster estimators for two globalexplanation methods based on stochastic cooperativegames. Our work makes significant strides towardsimproving the practicality of Shapley value estimationby automatically determining the required number ofsamples, providing confidence intervals, and accelerat-ing the estimation process.More broadly, our work contributes to a mature liter-ature on Shapley value estimation [6, 26] and to thegrowing ML model explanation field [32, 38, 13, 24, 12].We focused on improving the regression-based ap-proach to Shapley value estimation, and we leave tofuture work a detailed comparison of this approachto sampling-based [38, 37, 10, 12] and model-specificapproximations [1, 25]. We also believe that certaininsights from our work may be applicable to LIME,which is based on a similar dataset sampling approach;recent work has noted LIME’s high variance whenusing an insufficient number of samples [3], and animproved understanding of its convergence properties[16, 27] may lead to approaches for automatic conver-gence detection and uncertainty estimation. hapley Value Estimation via Linear Regression A CALCULATING A EXACTLY Recall the definition of A , which is a term in the solution to the Shapley value linear regression problem: A = E [ ZZ T ] . The entries of A are straightforward to calculate because Z is a random binary vector with a known distribution.Recall that Z is distributed according to p ( Z ), which is defined as: p ( z ) = ( Q − µ Sh ( Z ) 0 < T z < d , where the normalizing constant Q is given by: Q = X < T z We present a proof for Theorem 1, and we prove that a weaker condition than G v (cid:23) G v ) ii ≥ v ). B.1 Theorem 1 Proof In Section 4.2, we proposed a variance reduction technique that pairs each sample z i ∼ p ( Z ) with its complement − z i when estimating b . We now provide a proof for the condition that must be satisfied for the estimator ˇ β n to have lower variance than ¯ β n . As mentioned in the main text, the multivariate CLT asserts that¯ b n √ n D −→ N ( b, Σ ¯ b )ˇ b n √ n D −→ N ( b, Σ ˇ b ) , where Σ ¯ b = Cov (cid:0) Zv ( Z ) (cid:1) , Σ ˇ b = Cov (cid:16) (cid:0) Zv ( Z ) + ( − Z ) v ( − Z ) (cid:1)(cid:17) . We can also apply the multivariate CLT to the Shapley value estimators ¯ β n and ˇ β n . We can see that¯ β n √ n D −→ N ( β ∗ , Σ ¯ β )ˇ β n √ n D −→ N ( β ∗ , Σ ˇ β ) , where, due to their multiplicative dependence on b estimators, the covariance matrices are defined asΣ ¯ β = C Σ ¯ b C T Σ ˇ β = C Σ ˇ b C T . hapley Value Estimation via Linear Regression Next, we examine the relationship between Σ ¯ b and Σ ˇ b because they dictate the relationship between Σ ¯ β and Σ ˇ β .To simplify our notation, we introduce three jointly distributed random variables, M , M and ¯ M , which areall functions of the random variable Z : M = Zv ( Z ) − E [ Z ] v ( ) M = ( − Z ) v ( − Z ) − E [ − Z ] v ( )¯ M = 12 (cid:0) M + M (cid:1) . To understand ¯ M ’s covariance structure, we can decompose it using standard covariance properties and the factthat p ( z ) = p ( − z ) for all z :Cov( ¯ M , ¯ M ) ij = 14 Cov( M i + M i , M j + M j )= 14 (cid:16) Cov( M i , M j ) + Cov( M i , ¯ M j ) + Cov( M i , M j ) + Cov( M i , M j ) (cid:17) = 12 (cid:16) Cov( M i , M j ) + Cov( M i , M j ) (cid:17) . We can now compare Σ ¯ b to Σ ˇ b . To account for each ¯ M sample requiring twice as many cooperative gameevaluations as M , we compare the covariance Cov(¯ b n ) to the covariance Cov(ˇ b n ): n (cid:16) Cov(¯ b n ) − Cov(ˇ b n ) (cid:17) ij = − 12 Cov( M i , M j ) . Based on this, we define G v as follows: G v = − Cov( M i , M j )= − Cov (cid:16) Zv ( Z ) − E [ Z ] v ( ) , ( − Z ) v ( − Z ) − E [ − Z ] v ( ) (cid:17) = − Cov (cid:16) Zv ( Z ) , ( − Z ) v ( − Z ) (cid:17) . This is the matrix referenced in Theorem 1. Notice that G v is the negated cross-covariance between M and M ,which is the off-diagonal block in the joint covariance matrix for the concatenated random variable ( M , M ).This matrix is symmetric, unlike general cross-covariance matrices, and its eigen-structure determines whetherour variance reduction approach is effective. In particular, if the condition G v (cid:23) b n ) (cid:23) Cov(ˇ b n ) , which implies that Cov( ¯ β n ) (cid:23) Cov( ˇ β n ) . Since the inverses of two ordered matrices are also ordered, we get the result:Cov( ¯ β n ) − (cid:22) Cov( ˇ β n ) − . overt & Lee This has implications for quadratic forms involving each matrix. For any vector a ∈ R d , we have the inequality a T Cov( ¯ β n ) − a ≤ a T Cov( ˇ β n ) − a. The last inequality has a geometric interpretation. It shows that the confidence ellipsoid (i.e., the confidenceregion, or prediction ellipsoid) for ˇ β n is contained by the corresponding confidence ellipsoid for ¯ β n since largevalues of n lead each estimator to converge to its asymptotically normal distribution. This is because theconfidence ellipsoids are defined for α ∈ (0 , 1) as¯ E n,α = n a ∈ R d : ( a − β ∗ ) T Cov( ¯ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) o ˇ E n,α = n a ∈ R d : ( a − β ∗ ) T Cov( ˇ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) o , where χ d ( α ) denotes the inverse CDF of a Chi-squared distribution with d degrees of freedom evaluated at α .More precisely, we have ˇ E n,α ⊆ ¯ E n,α because( a − β ∗ ) T Cov( ˇ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) ⇒ ( a − β ∗ ) T Cov( ¯ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) . This completes the proof. B.2 A Weaker Condition Consider the matrix G v , which for a game v is defined as G v = − Cov (cid:16) Zv ( Z ) , ( − Z ) v ( − Z ) (cid:17) . A necessary (but not sufficient) condition for G v (cid:23) v , the diagonal value ( G v ) ii is givenby: ( G v ) ii = − Cov (cid:16) Z i v ( Z ) , (1 − Z i ) v ( − Z ) (cid:17) = − E (cid:2) Z i (1 − Z i ) v ( Z ) v ( − Z ) (cid:3) + E (cid:2) Z i v ( Z ) (cid:3) E (cid:2) (1 − Z i ) v ( − Z ) (cid:3) = E (cid:2) Z i v ( Z ) (cid:3) = E (cid:2) v ( S ) | i ∈ S (cid:3) ≥ . Geometrically, this condition means that the confidence ellipsoid ¯ E n,α extends beyond the ellipsoid ˇ E n,α in theaxis-aligned directions. In a probabilistic sense, it means that the variance for each Shapley value estimate islower when using the paired sampling technique. C SHAPLEY EFFECTS Shapley Effects is a model explanation method that summarizes the model f ’s sensitivity to each feature [32].It is based on the cooperative game hapley Value Estimation via Linear Regression ˜ w ( S ) = Var (cid:0) E [ f ( X ) | X S ] (cid:1) . (18)To show that Shapley Effects can be viewed as the expectation of a stochastic cooperative game, we reformulatethis game (Covert et al. [12]) as:˜ w ( S ) = Var (cid:0) E [ f ( X ) | X S ] (cid:1) = Var (cid:0) f ( X ) (cid:1) − E X S (cid:2) Var( f ( X ) | X S ) (cid:3) = c − E X S h E X D \ S | X S (cid:2)(cid:0) E [ f ( X ) | X S ] − f ( X S , X D \ S ) (cid:1) (cid:3)i = c − E X h(cid:0) E [ f ( X ) | X S ] − f ( X ) (cid:1) i . If we generalize this cooperative game to allow arbitrary loss functions (e.g., cross entropy loss for classificationtasks) rather than MSE, then we can ignore the constant value and re-write the game as˜ w ( S ) = − E X h ‘ (cid:0) E [ f ( X ) | X S ] , f ( X ) (cid:1)i . Now, it is apparent that Shapley Effects is based on a cooperative game that is the expectation of a stochasticcooperative game, or ˜ w ( S ) = E X [ ˜ W ( S, X )], where ˜ W ( S, X ) is defined as:˜ W ( S, X ) = − ‘ (cid:0) E [ f ( X ) | X S ] , f ( X ) (cid:1) . Unlike the stochastic cooperative game implicitly used by SAGE, the exogenous random variable for this gameis U = X . D STOCHASTIC COOPERATIVE GAME PROOFS For a stochastic cooperative game V ( S, U ), the generalized Shapley values are given by the expression φ i ( V ) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − E U (cid:2) V ( S ∪ { i } , U ) − V ( S, U ) (cid:3) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − E U (cid:2) V ( S ∪ { i } , U ) (cid:3) − E U (cid:2) V ( S, U ) (cid:3) . The second line above shows that the generalized Shapley values are equivalent to the Shapley values of thegame’s expectation, or φ i ( ¯ V ), where ¯ V ( S ) = E U [ V ( S, U )]. Based on this, we can also understand the values φ ( V ) , . . . , φ d ( V ) as the optimal coefficients for the following weighted least squares problem:min β ,...,β d X z p ( z ) (cid:16) β + z T β − E U (cid:2) V ( z, U ) (cid:3)(cid:17) s . t . β = E U (cid:2) V ( , U ) (cid:3) , T β = E U (cid:2) V ( , U ) (cid:3) − E U (cid:2) V ( , U ) (cid:3) . Using our derivation from the main text (Section 3.3), we can write the solution as β ∗ = A − (cid:16) b − T A − b − E U [ V ( , U )] + E U [ V ( , U )] T A − (cid:17) , overt & Lee where A and b are given by the expressions A = E [ ZZ T ] b = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i . Now, we consider our adaptations of KernelSHAP and unbiased KernelSHAP and examine whether these esti-mators are consistent or unbiased. We begin with the stochastic version of KernelSHAP presented in the maintext (Section 5.3). Recall that this approach uses the original A estimator ˆ A n and the modified b estimator ˜ b n ,which is defined as: ˜ b n = 12 n X i =1 z i (cid:0) V ( z i , u i ) − E U (cid:2) V ( , U ) (cid:3)(cid:1) . As mentioned in the main text, the strong law of large numbers lets us conclude that lim n →∞ ˆ A n = A . Thus,we can understand the b estimator’s expectation as follows: E (cid:2) ˜ b n (cid:3) = E ZU h Z (cid:0) V ( Z, U ) − E U (cid:2) V ( , U ) (cid:3)(cid:1)i = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i = b. With this, we conclude that lim n →∞ ˜ b n = b and that ˜ β n are consistent, orlim n →∞ ˜ β n = β ∗ . To adapt unbiased KernelSHAP to the setting of stochastic cooperative games, we use the same technique ofpairing independent samples of Z and U . To estimate b , we use an estimator ˜¯ b n defined as:˜¯ b n = 1 n n X i =1 z i V ( z i , u i ) − E (cid:2) Z (cid:3) E U (cid:2) V ( , U ) (cid:3) . We then substitute this into a Shapley value estimator as follows:˜¯ β n = A − (cid:16) ˜¯ b n − T A − ˜¯ b n − v ( ) + v ( ) T A − (cid:17) . (19)This is consistent and unbiased because of the linear dependence on ˜¯ b n and the fact that ˜¯ b n is unbiased: E (cid:2) ˜¯ b n (cid:3) = E ZU h ZV ( Z, U ) − E (cid:2) Z (cid:3) E U (cid:2) V ( , U ) (cid:3)i = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i = b. With this, we conclude that E [ ˜¯ β n ] = β ∗ and lim n →∞ ˜¯ β n = β ∗ . hapley Value Estimation via Linear Regression E EXPERIMENT DETAILS Here, we provide further details about experiments described in the main body of text. E.1 Datasets and Hyperparameters For all three explanation methods considered in our experiments – SHAP [24], SAGE [12] and Shapley Effects[32] – we handled removed features by marginalizing them out according to their joint marginal distribution.This is the default behavior for SHAP, but it is an approximation of what is required by SAGE and ShapleyEffects. However, this choice should not affect the outcome of our experiments, which focus on the convergenceproperties of our Shapley value estimators (and not the underlying cooperative games).Both SAGE and Shapley Effects require a loss function (Section C). We used the cross entropy loss for SAGEand the soft cross entropy loss for Shapley Effects.For the breast cancer (BRCA) subtype classification dataset, we selected 100 out of 17,814 genes to avoidoverfitting on the relatively small dataset size (only 510 patients). These genes were selected at random: wetried ten random seeds and selected the subset that achieved the best performance to ensure that several relevantBRCA genes were included. A small portion of missing expression values were imputed with their mean. Thedata was centered and normalized prior to fitting a ‘ regularized logistic regression model; the regularizationparameter was chosen using a validation set. E.2 SHAP Run-time Comparison To compare the run-time of various SHAP value estimators, we sought to compare the ratio of the mean numberof samples required by each method. For a single example x whose SHAP values are represented by β ∗ , themean squared estimation error can be decomposed into the variance and bias as follows: E (cid:2) || ˆ β n − β ∗ || (cid:3) = E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ β n ] − β ∗ (cid:12)(cid:12)(cid:12)(cid:12) . Since we found that the error is dominated by variance rather than bias (Section 4.1), we can make the followingapproximation to relate the error to the trace of the covariance matrix: E (cid:2) || ˆ β n − β ∗ || (cid:3) = E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ β n ] − β ∗ (cid:12)(cid:12)(cid:12)(cid:12) ≈ E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) = Tr (cid:16) Cov( ˆ β n ) (cid:17) . (20)If we define convergence based on the mean estimation error falling below a threshold value t , then the convergencecondition is E (cid:2) || ˆ β n − β ∗ || (cid:3) ≤ t. Using our approximation (Eq. 20), we can see that this condition is approximately equivalent to E (cid:2) || ˆ β n − β ∗ || (cid:3) ≈ Tr (cid:16) Cov( ˆ β n ) (cid:17) ≈ Tr(Σ ˆ β ) n ≤ t. For a given threshold t , the mean number of samples required to explain individual predictions is therefore basedon the mean trace of the covariance matrix Σ ˆ β (or the analogous covariance matrix for a different estimator). Tocompare two methods, we simply calculate the ratio of the mean trace of the covariance matrices. These ratiosare reported in Table 1, where each covariance matrix is calculated empirically across 100 runs with n = 2048samples. overt & Lee Table 2: Global measures of bias and variance for each SHAP value estimator. Each entry is the mean bias andmean variance calculated empirically across 100 examples (bias/variance, lower is better). Census Income Bank Marketing German Credit Unbiased 0.0002/0.0208 0.0001/0.0125 0.0026/0.2561Unbiased + Paired Sampling 0.0000/0.0068 0.0000/0.0066 0.0000/0.0062Original (KernelSHAP) 0.0000/0.0007 0.0000/0.0006 0.0000/0.0002Original + Paired Sampling 0.0000/0.0001 0.0000/0.0001 0.0000/0.0000 F CONVERGENCE EXPERIMENTS In Section 4.1, we empirically compared the bias and variance for the original and unbiased versions of Ker-nelSHAP using a single census income prediction. The results (Figure 1) showed that both versions’ estimationerrors were dominated by variance rather than bias, and that the original version had significantly lower variance.To verify that this result is not an anomaly, we replicated it on multiple examples and across several datasets.First, we examined several individual predictions for the census income, German credit and bank marketingdatasets. To highlight the effectiveness of our paired sampling approach (Section 4.2), we added these methodsas additional comparisons. Rather than decomposing the error into bias and variance as in the main text, wesimply calculated the mean squared error across 100 runs of each estimator. Figure 6 shows the error for severalcensus income predictions, Figure 8 for several bank marketing predictions, and Figure 10 for several creditquality predictions. These results confirm that the original version of KernelSHAP converges significantly fasterthan the unbiased version, and that the paired sampling technique is effective for both estimators. The datasetsampling approach (original KernelSHAP) appears preferable in practice despite being more difficult to analyzebecause it converges to the correct result much faster.Second, we calculated a global measure of the bias and variance for each estimator using the same datasets(Table 2). Given 100 examples from each dataset, we calculated the mean bias and mean variance for eachestimator empirically across 100 runs given n = 256 samples. Results show that the bias is nearly zero for allestimators, not just the unbiased ones; they also show that the variance is often significantly larger than thebias. However, when using the dataset sampling approach (original) in combination with the paired samplingtechnique, the bias and variance are comparably low ( ≈ 0) after 256 samples. The only exception is the unbiasedestimator that does not use paired sampling, but this is likely due to estimation error because its bias is provablyequal to zero.Finally, Section 4.3 also proposed assuming that the original KernelSHAP estimator’s variance reduces at a rateof O ( n ), similar to the unbiased version (for which we proved this rate). Although this result is difficult to proveformally, it seems to hold empirically across multiple predictions and several datasets. In Figures 7, 9 and 11,we display the product of the estimator’s variance with the number of samples for the census, bank and creditdatasets. Results confirm that the product is roughly constant as the number of samples increases, indicatingthat the variance for all four estimators (not just the unbiased ones) reduces at a rate of O ( n ). G ALGORITHMS Here, we provide pseudocode for the estimation algorithms described in the main text. Algorithm 1 shows thedataset sampling approach (original KernelSHAP) with our convergence detection and paired sampling tech-niques. Algorithm 2 shows KernelSHAP’s adaptation to the setting of stochastic cooperative games (stochasticKernelSHAP). Algorithm 3 shows the unbiased KernelSHAP estimator, and Algorithm 4 shows the adaptationof unbiased KernelSHAP to stochastic cooperative games. hapley Value Estimation via Linear Regression M e a n Sq u a r e d E rr o r Estimation Error (Example 1) Estimation Error (Example 2) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 3) Game Evals (×1000) Estimation Error (Example 4) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 6: Census income SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 1) Variance Estimate (Example 2) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 3) Game Evals Variance Estimate (Example 4) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 7: Census income SHAP value variance estimation on four predictions. overt & Lee M e a n Sq u a r e d E rr o r Estimation Error (Example 0) Estimation Error (Example 1) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 2) Game Evals (×1000) Estimation Error (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 8: Bank marketing SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 0) Variance Estimate (Example 1) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 2) Game Evals Variance Estimate (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 9: Bank marketing SHAP value variance estimation on four predictions. hapley Value Estimation via Linear Regression M e a n Sq u a r e d E rr o r Estimation Error (Example 0) Estimation Error (Example 1) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 2) Game Evals (×1000) Estimation Error (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 10: German credit SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 0) Variance Estimate (Example 1) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 2) Game Evals Variance Estimate (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 11: German credit SHAP value variance estimation on four predictions. overt & Lee Algorithm 1: Shapley value estimation with dataset sampling (KernelSHAP) Input: Game v , convergence threshold t , intermediate samples m // Initialize n = 0A = 0b = 0 // For tracking intermediate samples counter = 0Atemp = 0btemp = 0estimates = list() // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z ) if variance reduction then Asample = (cid:0) zz T + ( − z )( − z ) T (cid:1) bsample = (cid:0) zv ( z ) + ( − z ) v ( − z ) − v ( ) (cid:1) else Asample = zz T bsample = z (cid:0) v ( z ) − v ( ) (cid:1) // Welford’s algorithm n = n + 1A += (Asample − A) / nb += (bsample − b) / ncounter += 1Atemp += (Asample − Atemp) / counterbtemp += (bsample − btemp) / counter if counter == m then // Get intermediate estimate β m = Atemp − (cid:16) btemp − T Atemp − btemp − v ( )+ v ( ) T Atemp − (cid:17) estimates.append( β m )counter = 0Atemp = 0btemp = 0 // Get estimates, uncertainties β n = A − (cid:16) b − T A − b − v ( )+ v ( ) T A − (cid:17) Σ β = m · Cov(estimates) // Empirical covariance σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n hapley Value Estimation via Linear Regression Algorithm 2: Shapley value estimation with dataset sampling for stochastic cooperative games Input: Game V , convergence threshold t , intermediate samples m // Initialize n = 0A = 0b = 0 // For tracking intermediate samples counter = 0Atemp = 0btemp = 0estimates = list() // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z )Sample u ∼ p ( U ) if variance reduction then bsample = (cid:0) zV ( z, u ) + ( − z ) V ( − z, u ) − E U [ V ( , U )] (cid:1) Asample = (cid:0) zz T + ( − z )( − z ) T (cid:1) else bsample = z (cid:0) V ( z, u ) − E U [ V ( , U )] (cid:1) Asample = zz T // Welford’s algorithm n = n + 1b += (bsample − b) / nA += (Asample − A) / ncounter += 1btemp += (bsample − btemp) / counterAtemp += (Asample − Atemp) / counter if counter == m then // Get intermediate estimate β m = Atemp − (cid:16) btemp − T Atemp − btemp − E U [ V ( ,U )]+ E U [ V ( ,U )] T Atemp − (cid:17) estimates.append( β m )counter = 0Atemp = 0btemp = 0 // Get estimates, uncertainties β n = A − (cid:16) b − T A − b − E U [ V ( ,U )]+ E U [ V ( ,U )] T A − (cid:17) Σ β = m · Cov(estimates) // Empirical covariance σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n overt & Lee Algorithm 3: Unbiased Shapley value estimation Input: Game v , convergence threshold t // Initialize Set A (Section 3.3)Set C (Eq. 13)n = 0b = 0bSSQ = 0 // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z ) if variance reduction then bsample = (cid:0) zv ( z ) + ( − z ) v ( − z ) − v ( ) (cid:1) else bsample = zv ( z ) − v ( ) // Welford’s algorithm n = n + 1diff = (bsample − b)b += diff / ndiff2 = (bsample − b)bSSQ += outer(diff, diff2) // Outer product// Get estimates, uncertainties β n = A − (cid:16) b − T A − b − v ( )+ v ( ) T A − (cid:17) Σ b = bSSQ / nΣ β = C Σ b C T σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n hapley Value Estimation via Linear Regression Algorithm 4: Unbiased Shapley value estimation for stochastic cooperative games Input: Game V , convergence threshold t // Initialize Set A (Section 3.3)Set C (Eq. 13)n = 0b = 0bSSQ = 0 // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z )Sample u ∼ p ( U ) if variance reduction then bsample = (cid:16) zV ( z, u ) + ( − z ) V ( − z, u ) − E U (cid:2) V ( ) , U (cid:3)(cid:17) else bsample = zV ( z, u ) − E U [ V ( , U )] // Welford’s algorithm n = n + 1diff = (bsample − b)b += diff / ndiff2 = (bsample − b)bSSQ += outer(diff, diff2) // Outer product// Get estimates, uncertainties β n = A − (cid:16) b − T A − b − E U [ V ( ,U )]+ E U [ V ( ,U )] T A − (cid:17) Σ b = bSSQ / nΣ β = C Σ b C T σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σσ Game V , convergence threshold t // Initialize Set A (Section 3.3)Set C (Eq. 13)n = 0b = 0bSSQ = 0 // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z )Sample u ∼ p ( U ) if variance reduction then bsample = (cid:16) zV ( z, u ) + ( − z ) V ( − z, u ) − E U (cid:2) V ( ) , U (cid:3)(cid:17) else bsample = zV ( z, u ) − E U [ V ( , U )] // Welford’s algorithm n = n + 1diff = (bsample − b)b += diff / ndiff2 = (bsample − b)bSSQ += outer(diff, diff2) // Outer product// Get estimates, uncertainties β n = A − (cid:16) b − T A − b − E U [ V ( ,U )]+ E U [ V ( ,U )] T A − (cid:17) Σ b = bSSQ / nΣ β = C Σ b C T σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σσ n overt & Lee Acknowledgements This work was funded by the National Science Foundation [CAREER DBI-1552309, and DBI- 1759487]; theAmerican Cancer Society [127332-RSG-15-097-01-TBG]; and the National Institutes of Health [R35 GM 128638,and R01 NIA AG 061132]. We would like to thank Hugh Chen, the Lee Lab, and our AISTATS reviewers forfeedback that greatly improved this work. References [1] Marco Ancona, Cengiz ¨Oztireli, and Markus Gross. Explaining deep neural networks with a polynomialtime algorithm for Shapley values approximation. arXiv preprint arXiv:1903.10992 , 2019.[2] Robert JJ Aumann. Economic applications of the Shapley value. In Game-Theoretic Methods in GeneralEquilibrium Analysis , pages 121–133. Springer, 1994.[3] Naman Bansal, Chirag Agarwal, and Anh Nguyen. SAM: The sensitivity of attribution methods to hyper-parameters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,pages 8673–8683, 2020.[4] Ashton C Berger, Anil Korkut, Rupa S Kanchi, Apurva M Hegde, Walter Lenoir, Wenbin Liu, Yuexin Liu,Huihui Fan, Hui Shen, Visweswaran Ravikumar, et al. A comprehensive pan-cancer molecular study ofgynecologic and breast cancers. Cancer Cell , 33(4):690–705, 2018.[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004.[6] Javier Castro, Daniel G´omez, and Juan Tejada. Polynomial calculation of the Shapley value based onsampling. Computers & Operations Research , 36(5):1726–1730, 2009.[7] A Charnes and Daniel Granot. Coalitional and chance-constrained solutions to n-person games. i: The priorsatisficing nucleolus. SIAM Journal on Applied Mathematics , 31(2):358–367, 1976.[8] A Charnes, B Golany, M Keane, and J Rousseau. Extremal principle solutions of games in characteris-tic function form: core, Chebychev and Shapley value generalizations. In Econometrics of Planning andEfficiency , pages 123–133. Springer, 1988.[9] Abraham Charnes and Daniel Granot. Prior solutions: Extensions of convex nucleus solutions to chance-constrained games. Technical report, Texas University at Austin Center for Cybernetic Studies, 1973.[10] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. L-Shapley and C-Shapley: Efficientmodel interpretation for structured data. arXiv preprint arXiv:1808.02610 , 2018.[11] Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for modelexplanation. arXiv preprint arXiv:2011.14878 , 2020.[12] Ian Covert, Scott Lundberg, and Su-In Lee. Understanding global feature contributions with additiveimportance measures. Advances in Neural Information Processing Systems , 34, 2020.[13] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence:Theory and experiments with learning systems. In ,pages 598–617. IEEE, 2016.[14] Guoli Ding, Robert F Lax, Jianhua Chen, and Peter P Chen. Formulas for approximating pseudo-booleanrandom variables. Discrete Applied Mathematics , 156(10):1581–1597, 2008.[15] Guoli Ding, Robert F Lax, Jianhua Chen, Peter P Chen, and Brian D Marx. Transforms of pseudo-booleanrandom variables. Discrete Applied Mathematics , 158(1):13–24, 2010.[16] Damien Garreau and Ulrike von Luxburg. Looking deeper into LIME. arXiv preprint arXiv:2008.11092 ,2020.[17] Amirata Ghorbani and James Zou. Data Shapley: Equitable valuation of data for machine learning. arXivpreprint arXiv:1904.02868 , 2019.[18] Amirata Ghorbani and James Zou. Neuron Shapley: Discovering the responsible neurons. arXiv preprintarXiv:2002.09815 , 2020.[19] Michel Grabisch, Jean-Luc Marichal, and Marc Roubens. Equivalent representations of set functions. Math-ematics of Operations Research , 25(2):157–178, 2000. hapley Value Estimation via Linear Regression [20] Peter L Hammer and Ron Holzman. Approximations of pseudo-boolean functions; applications to gametheory. Zeitschrift f¨ur Operations Research , 36(1):3–21, 1992.[21] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu.Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information ProcessingSystems , pages 3146–3154, 2017.[22] Moshe Lichman et al. UCI machine learning repository, 2013.[23] Stan Lipovetsky and Michael Conklin. Analysis of regression in game theory approach. Applied StochasticModels in Business and Industry , 17(4):319–330, 2001.[24] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances inNeural Information Processing Systems , pages 4765–4774, 2017.[25] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz,Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding withexplainable AI for trees. Nature Machine Intelligence , 2(1):2522–5839, 2020.[26] Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. Bounding the estimationerror of sampling-based Shapley value approximation. arXiv preprint arXiv:1306.4265 , 2013.[27] Dina Mardaoui and Damien Garreau. An analysis of LIME for text data. arXiv preprint arXiv:2010.12487 ,2020.[28] Jean-Luc Marichal and Pierre Mathonet. Weighted Banzhaf power and interaction indexes through weightedapproximations of games. European Journal of Operational Research , 211(2):352–358, 2011.[29] Luke Merrick and Ankur Taly. The explanation game: Explaining machine learning models using Shapleyvalues. arXiv preprint arXiv:1909.08128 , 2019.[30] Dov Monderer, Dov Samet, et al. Variations on the Shapley value. Handbook of Game Theory , 3:2055–2076,2002.[31] S´ergio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemar-keting. Decision Support Systems , 62:22–31, 2014.[32] Art B Owen. Sobol’ indices and Shapley value. SIAM/ASA Journal on Uncertainty Quantification , 2(1):245–251, 2014.[33] Leon Petrosjan and Georges Zaccour. Time-consistent Shapley value allocation of pollution cost reduction. Journal of Economic Dynamics and Control , 27(3):381–398, 2003.[34] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin.Catboost: unbiased boosting with categorical features. In Advances in Neural Information ProcessingSystems , pages 6638–6648, 2018.[35] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?” Explaining the predic-tions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 1135–1144, 2016.[36] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games , 2(28):307–317, 1953.[37] Eunhye Song, Barry L Nelson, and Jeremy Staum. Shapley effects for global sensitivity analysis: Theoryand computation. SIAM/ASA Journal on Uncertainty Quantification , 4(1):1060–1083, 2016.[38] Erik ˇStrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with featurecontributions. Knowledge and Information Systems , 41(3):647–665, 2014.[39] Nikola Tarashev, Kostas Tsatsaronis, and Claudio Borio. Risk attribution using the Shapley value: Method-ology and policy applications. Review of Finance , 20(3):1189–1213, 2016.[40] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:Series B (Methodological) , 58(1):267–288, 1996.[41] Aad W Van der Vaart. Asymptotic Statistics , volume 3. Cambridge University Press, 2000.[42] BP Welford. Note on a method for calculating corrected sums of squares and products. Technometrics , 4(3):419–420, 1962. upplementary Materials forImproving KernelSHAP: Practical Shapley ValueEstimation via Linear Regression A EXACTLY Recall the definition of A , which is a term in the solution to the Shapley value linear regression problem: A = E [ ZZ T ] . The entries of A are straightforward to calculate because Z is a random binary vector with a known distribution.Recall that Z is distributed according to p ( Z ), which is defined as: p ( z ) = ( Q − µ Sh ( Z ) 0 < T z < d , where the normalizing constant Q is given by: Q = X < T z 12 Cov( M i , M j ) . Based on this, we define G v as follows: G v = − Cov( M i , M j )= − Cov (cid:16) Zv ( Z ) − E [ Z ] v ( ) , ( − Z ) v ( − Z ) − E [ − Z ] v ( ) (cid:17) = − Cov (cid:16) Zv ( Z ) , ( − Z ) v ( − Z ) (cid:17) . This is the matrix referenced in Theorem ?? . Notice that G v is the negated cross-covariance between M and M ,which is the off-diagonal block in the joint covariance matrix for the concatenated random variable ( M , M ).This matrix is symmetric, unlike general cross-covariance matrices, and its eigen-structure determines whetherour variance reduction approach is effective. In particular, if the condition G v (cid:23) hapley Value Estimation via Linear Regression Cov(¯ b n ) (cid:23) Cov(ˇ b n ) , which implies that Cov( ¯ β n ) (cid:23) Cov( ˇ β n ) . Since the inverses of two ordered matrices are also ordered, we get the result:Cov( ¯ β n ) − (cid:22) Cov( ˇ β n ) − . This has implications for quadratic forms involving each matrix. For any vector a ∈ R d , we have the inequality a T Cov( ¯ β n ) − a ≤ a T Cov( ˇ β n ) − a. The last inequality has a geometric interpretation. It shows that the confidence ellipsoid (i.e., the confidenceregion, or prediction ellipsoid) for ˇ β n is contained by the corresponding confidence ellipsoid for ¯ β n since largevalues of n lead each estimator to converge to its asymptotically normal distribution. This is because theconfidence ellipsoids are defined for α ∈ (0 , 1) as¯ E n,α = n a ∈ R d : ( a − β ∗ ) T Cov( ¯ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) o ˇ E n,α = n a ∈ R d : ( a − β ∗ ) T Cov( ˇ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) o , where χ d ( α ) denotes the inverse CDF of a Chi-squared distribution with d degrees of freedom evaluated at α .More precisely, we have ˇ E n,α ⊆ ¯ E n,α because( a − β ∗ ) T Cov( ˇ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) ⇒ ( a − β ∗ ) T Cov( ¯ β n ) − ( a − β ∗ ) ≤ q χ d ( α ) . This completes the proof. Consider the matrix G v , which for a game v is defined as G v = − Cov (cid:16) Zv ( Z ) , ( − Z ) v ( − Z ) (cid:17) . A necessary (but not sufficient) condition for G v (cid:23) v , the diagonal value ( G v ) ii is givenby: G v ) ii = − Cov (cid:16) Z i v ( Z ) , (1 − Z i ) v ( − Z ) (cid:17) = − E (cid:2) Z i (1 − Z i ) v ( Z ) v ( − Z ) (cid:3) + E (cid:2) Z i v ( Z ) (cid:3) E (cid:2) (1 − Z i ) v ( − Z ) (cid:3) = E (cid:2) Z i v ( Z ) (cid:3) = E (cid:2) v ( S ) | i ∈ S (cid:3) ≥ . Geometrically, this condition means that the confidence ellipsoid ¯ E n,α extends beyond the ellipsoid ˇ E n,α in theaxis-aligned directions. In a probabilistic sense, it means that the variance for each Shapley value estimate islower when using the paired sampling technique. Shapley Effects is a model explanation method that summarizes the model f ’s sensitivity to each feature [ ? ].It is based on the cooperative game ˜ w ( S ) = Var (cid:0) E [ f ( X ) | X S ] (cid:1) . (1)To show that Shapley Effects can be viewed as the expectation of a stochastic cooperative game, we reformulatethis game (Covert et al. [ ? ]) as:˜ w ( S ) = Var (cid:0) E [ f ( X ) | X S ] (cid:1) = Var (cid:0) f ( X ) (cid:1) − E X S (cid:2) Var( f ( X ) | X S ) (cid:3) = c − E X S h E X D \ S | X S (cid:2)(cid:0) E [ f ( X ) | X S ] − f ( X S , X D \ S ) (cid:1) (cid:3)i = c − E X h(cid:0) E [ f ( X ) | X S ] − f ( X ) (cid:1) i . If we generalize this cooperative game to allow arbitrary loss functions (e.g., cross entropy loss for classificationtasks) rather than MSE, then we can ignore the constant value and re-write the game as˜ w ( S ) = − E X h ‘ (cid:0) E [ f ( X ) | X S ] , f ( X ) (cid:1)i . Now, it is apparent that Shapley Effects is based on a cooperative game that is the expectation of a stochasticcooperative game, or ˜ w ( S ) = E X [ ˜ W ( S, X )], where ˜ W ( S, X ) is defined as:˜ W ( S, X ) = − ‘ (cid:0) E [ f ( X ) | X S ] , f ( X ) (cid:1) . Unlike the stochastic cooperative game implicitly used by SAGE, the exogenous random variable for this gameis U = X . For a stochastic cooperative game V ( S, U ), the generalized Shapley values are given by the expression hapley Value Estimation via Linear Regression φ i ( V ) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − E U (cid:2) V ( S ∪ { i } , U ) − V ( S, U ) (cid:3) = 1 d X S ⊆ D \{ i } (cid:18) d − | S | (cid:19) − E U (cid:2) V ( S ∪ { i } , U ) (cid:3) − E U (cid:2) V ( S, U ) (cid:3) . The second line above shows that the generalized Shapley values are equivalent to the Shapley values of thegame’s expectation, or φ i ( ¯ V ), where ¯ V ( S ) = E U [ V ( S, U )]. Based on this, we can also understand the values φ ( V ) , . . . , φ d ( V ) as the optimal coefficients for the following weighted least squares problem:min β ,...,β d X z p ( z ) (cid:16) β + z T β − E U (cid:2) V ( z, U ) (cid:3)(cid:17) s . t . β = E U (cid:2) V ( , U ) (cid:3) , T β = E U (cid:2) V ( , U ) (cid:3) − E U (cid:2) V ( , U ) (cid:3) . Using our derivation from the main text (Section ?? ), we can write the solution as β ∗ = A − (cid:16) b − T A − b − E U [ V ( , U )] + E U [ V ( , U )] T A − (cid:17) , where A and b are given by the expressions A = E [ ZZ T ] b = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i . Now, we consider our adaptations of KernelSHAP and unbiased KernelSHAP and examine whether these esti-mators are consistent or unbiased. We begin with the stochastic version of KernelSHAP presented in the maintext (Section ?? ). Recall that this approach uses the original A estimator ˆ A n and the modified b estimator ˜ b n ,which is defined as: ˜ b n = 12 n X i =1 z i (cid:0) V ( z i , u i ) − E U (cid:2) V ( , U ) (cid:3)(cid:1) . As mentioned in the main text, the strong law of large numbers lets us conclude that lim n →∞ ˆ A n = A . Thus,we can understand the b estimator’s expectation as follows: E (cid:2) ˜ b n (cid:3) = E ZU h Z (cid:0) V ( Z, U ) − E U (cid:2) V ( , U ) (cid:3)(cid:1)i = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i = b. With this, we conclude that lim n →∞ ˜ b n = b and that ˜ β n are consistent, orlim n →∞ ˜ β n = β ∗ . To adapt unbiased KernelSHAP to the setting of stochastic cooperative games, we use the same technique ofpairing independent samples of Z and U . To estimate b , we use an estimator ˜¯ b n defined as:¯ b n = 1 n n X i =1 z i V ( z i , u i ) − E (cid:2) Z (cid:3) E U (cid:2) V ( , U ) (cid:3) . We then substitute this into a Shapley value estimator as follows:˜¯ β n = A − (cid:16) ˜¯ b n − T A − ˜¯ b n − v ( ) + v ( ) T A − (cid:17) . (2)This is consistent and unbiased because of the linear dependence on ˜¯ b n and the fact that ˜¯ b n is unbiased: E (cid:2) ˜¯ b n (cid:3) = E ZU h ZV ( Z, U ) − E (cid:2) Z (cid:3) E U (cid:2) V ( , U ) (cid:3)i = E Z h Z (cid:0) E U [ V ( Z, U )] − E U [ V ( , U )] (cid:1)i = b. With this, we conclude that E [ ˜¯ β n ] = β ∗ and lim n →∞ ˜¯ β n = β ∗ . Here, we provide further details about experiments described in the main body of text. For all three explanation methods considered in our experiments – SHAP [ ? ], SAGE [ ? ] and Shapley Effects[ ? ] – we handled removed features by marginalizing them out according to their joint marginal distribution.This is the default behavior for SHAP, but it is an approximation of what is required by SAGE and ShapleyEffects. However, this choice should not affect the outcome of our experiments, which focus on the convergenceproperties of our Shapley value estimators (and not the underlying cooperative games).Both SAGE and Shapley Effects require a loss function (Section 3). We used the cross entropy loss for SAGEand the soft cross entropy loss for Shapley Effects.For the breast cancer (BRCA) subtype classification dataset, we selected 100 out of 17,814 genes to avoidoverfitting on the relatively small dataset size (only 510 patients). These genes were selected at random: wetried ten random seeds and selected the subset that achieved the best performance to ensure that several relevantBRCA genes were included. A small portion of missing expression values were imputed with their mean. Thedata was centered and normalized prior to fitting a ‘ regularized logistic regression model; the regularizationparameter was chosen using a validation set. To compare the run-time of various SHAP value estimators, we sought to compare the ratio of the mean numberof samples required by each method. For a single example x whose SHAP values are represented by β ∗ , themean squared estimation error can be decomposed into the variance and bias as follows: E (cid:2) || ˆ β n − β ∗ || (cid:3) = E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ β n ] − β ∗ (cid:12)(cid:12)(cid:12)(cid:12) . Since we found that the error is dominated by variance rather than bias (Section ?? ), we can make the followingapproximation to relate the error to the trace of the covariance matrix: hapley Value Estimation via Linear Regression E (cid:2) || ˆ β n − β ∗ || (cid:3) = E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) E [ ˆ β n ] − β ∗ (cid:12)(cid:12)(cid:12)(cid:12) ≈ E (cid:2) || ˆ β n − E [ ˆ β n ] || (cid:3) = Tr (cid:16) Cov( ˆ β n ) (cid:17) . (3)If we define convergence based on the mean estimation error falling below a threshold value t , then the convergencecondition is E (cid:2) || ˆ β n − β ∗ || (cid:3) ≤ t. Using our approximation (Eq. 3), we can see that this condition is approximately equivalent to E (cid:2) || ˆ β n − β ∗ || (cid:3) ≈ Tr (cid:16) Cov( ˆ β n ) (cid:17) ≈ Tr(Σ ˆ β ) n ≤ t. For a given threshold t , the mean number of samples required to explain individual predictions is therefore basedon the mean trace of the covariance matrix Σ ˆ β (or the analogous covariance matrix for a different estimator). Tocompare two methods, we simply calculate the ratio of the mean trace of the covariance matrices. These ratiosare reported in Table ?? , where each covariance matrix is calculated empirically across 100 runs with n = 2048samples. In Section ?? , we empirically compared the bias and variance for the original and unbiased versions of Ker-nelSHAP using a single census income prediction. The results (Figure ?? ) showed that both versions’ estimationerrors were dominated by variance rather than bias, and that the original version had significantly lower variance.To verify that this result is not an anomaly, we replicated it on multiple examples and across several datasets.First, we examined several individual predictions for the census income, German credit and bank marketingdatasets. To highlight the effectiveness of our paired sampling approach (Section ?? ), we added these methodsas additional comparisons. Rather than decomposing the error into bias and variance as in the main text, wesimply calculated the mean squared error across 100 runs of each estimator. Figure 1 shows the error for severalcensus income predictions, Figure 3 for several bank marketing predictions, and Figure 5 for several credit qualitypredictions. These results confirm that the original version of KernelSHAP converges significantly faster than theunbiased version, and that the paired sampling technique is effective for both estimators. The dataset samplingapproach (original KernelSHAP) appears preferable in practice despite being more difficult to analyze becauseit converges to the correct result much faster.Second, we calculated a global measure of the bias and variance for each estimator using the same datasets(Table 1). Given 100 examples from each dataset, we calculated the mean bias and mean variance for eachestimator empirically across 100 runs given n = 256 samples. Results show that the bias is nearly zero for allestimators, not just the unbiased ones; they also show that the variance is often significantly larger than thebias. However, when using the dataset sampling approach (original) in combination with the paired samplingtechnique, the bias and variance are comparably low ( ≈ 0) after 256 samples. The only exception is the unbiasedestimator that does not use paired sampling, but this is likely due to estimation error because its bias is provablyequal to zero.Finally, Section ?? also proposed assuming that the original KernelSHAP estimator’s variance reduces at a rateof O ( n ), similar to the unbiased version (for which we proved this rate). Although this result is difficult to proveformally, it seems to hold empirically across multiple predictions and several datasets. In Figures 2, 4 and 6,we display the product of the estimator’s variance with the number of samples for the census, bank and creditdatasets. Results confirm that the product is roughly constant as the number of samples increases, indicatingthat the variance for all four estimators (not just the unbiased ones) reduces at a rate of O ( n ). M e a n Sq u a r e d E rr o r Estimation Error (Example 1) Estimation Error (Example 2) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 3) Game Evals (×1000) Estimation Error (Example 4) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 1: Census income SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 1) Variance Estimate (Example 2) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 3) Game Evals Variance Estimate (Example 4) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 2: Census income SHAP value variance estimation on four predictions. hapley Value Estimation via Linear Regression M e a n Sq u a r e d E rr o r Estimation Error (Example 0) Estimation Error (Example 1) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 2) Game Evals (×1000) Estimation Error (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 3: Bank marketing SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 0) Variance Estimate (Example 1) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 2) Game Evals Variance Estimate (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 4: Bank marketing SHAP value variance estimation on four predictions. M e a n Sq u a r e d E rr o r Estimation Error (Example 0) Estimation Error (Example 1) Game Evals (×1000) M e a n Sq u a r e d E rr o r Estimation Error (Example 2) Game Evals (×1000) Estimation Error (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 5: German credit SHAP value estimation error on four predictions. V a r i a n c e × S a m p l e s Variance Estimate (Example 0) Variance Estimate (Example 1) Game Evals V a r i a n c e × S a m p l e s Variance Estimate (Example 2) Game Evals Variance Estimate (Example 3) Original Unbiased Original (Variance Reduction) Unbiased (Variance Reduction) Figure 6: German credit SHAP value variance estimation on four predictions. hapley Value Estimation via Linear Regression Table 1: Global measures of bias and variance for each SHAP value estimator. Each entry is the mean bias andmean variance calculated empirically across 100 examples (bias/variance, lower is better). Census Income Bank Marketing German Credit Unbiased 0.0002/0.0208 0.0001/0.0125 0.0026/0.2561Unbiased + Paired Sampling 0.0000/0.0068 0.0000/0.0066 0.0000/0.0062Original (KernelSHAP) 0.0000/0.0007 0.0000/0.0006 0.0000/0.0002Original + Paired Sampling 0.0000/0.0001 0.0000/0.0001 0.0000/0.0000 Here, we provide pseudocode for the estimation algorithms described in the main text. Algorithm 1 shows thedataset sampling approach (original KernelSHAP) with our convergence detection and paired sampling tech-niques. Algorithm 2 shows KernelSHAP’s adaptation to the setting of stochastic cooperative games (stochasticKernelSHAP). Algorithm 3 shows the unbiased KernelSHAP estimator, and Algorithm 4 shows the adaptationof unbiased KernelSHAP to stochastic cooperative games. lgorithm 1: Shapley value estimation with dataset sampling (KernelSHAP) Input: Game v , convergence threshold t , intermediate samples m // Initialize n = 0A = 0b = 0 // For tracking intermediate samples counter = 0Atemp = 0btemp = 0estimates = list() // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z ) if variance reduction then Asample = (cid:0) zz T + ( − z )( − z ) T (cid:1) bsample = (cid:0) zv ( z ) + ( − z ) v ( − z ) − v ( ) (cid:1) else Asample = zz T bsample = z (cid:0) v ( z ) − v ( ) (cid:1) // Welford’s algorithm n = n + 1A += (Asample − A) / nb += (bsample − b) / ncounter += 1Atemp += (Asample − Atemp) / counterbtemp += (bsample − btemp) / counter if counter == m then // Get intermediate estimate β m = Atemp − (cid:16) btemp − T Atemp − btemp − v ( )+ v ( ) T Atemp − (cid:17) estimates.append( β m )counter = 0Atemp = 0btemp = 0 // Get estimates, uncertainties β n = A − (cid:16) b − T A − b − v ( )+ v ( ) T A − (cid:17) Σ β = m · Cov(estimates) // Empirical covariance σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n hapley Value Estimation via Linear Regression Algorithm 2: Shapley value estimation with dataset sampling for stochastic cooperative games Input: Game V , convergence threshold t , intermediate samples m // Initialize n = 0A = 0b = 0 // For tracking intermediate samples counter = 0Atemp = 0btemp = 0estimates = list() // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z )Sample u ∼ p ( U ) if variance reduction then bsample = (cid:0) zV ( z, u ) + ( − z ) V ( − z, u ) − E U [ V ( , U )] (cid:1) Asample = (cid:0) zz T + ( − z )( − z ) T (cid:1) else bsample = z (cid:0) V ( z, u ) − E U [ V ( , U )] (cid:1) Asample = zz T // Welford’s algorithm n = n + 1b += (bsample − b) / nA += (Asample − A) / ncounter += 1btemp += (bsample − btemp) / counterAtemp += (Asample − Atemp) / counter if counter == m then // Get intermediate estimate β m = Atemp − (cid:16) btemp − T Atemp − btemp − E U [ V ( ,U )]+ E U [ V ( ,U )] T Atemp − (cid:17) estimates.append( β m )counter = 0Atemp = 0btemp = 0 // Get estimates, uncertainties β n = A − (cid:16) b − T A − b − E U [ V ( ,U )]+ E U [ V ( ,U )] T A − (cid:17) Σ β = m · Cov(estimates) // Empirical covariance σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n lgorithm 3: Unbiased Shapley value estimation Input: Game v , convergence threshold t // Initialize Set A (Section 3.3)Set C (Eq. 13)n = 0b = 0bSSQ = 0 // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z ) if variance reduction then bsample = (cid:0) zv ( z ) + ( − z ) v ( − z ) − v ( ) (cid:1) else bsample = zv ( z ) − v ( ) // Welford’s algorithm n = n + 1diff = (bsample − b)b += diff / ndiff2 = (bsample − b)bSSQ += outer(diff, diff2) // Outer product// Get estimates, uncertainties β n = A − (cid:16) b − T A − b − v ( )+ v ( ) T A − (cid:17) Σ b = bSSQ / nΣ β = C Σ b C T σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σ n hapley Value Estimation via Linear Regression Algorithm 4: Unbiased Shapley value estimation for stochastic cooperative games Input: Game V , convergence threshold t // Initialize Set A (Section 3.3)Set C (Eq. 13)n = 0b = 0bSSQ = 0 // Sampling loop converged = False while not converged do // Draw next sample Sample z ∼ p ( Z )Sample u ∼ p ( U ) if variance reduction then bsample = (cid:16) zV ( z, u ) + ( − z ) V ( − z, u ) − E U (cid:2) V ( ) , U (cid:3)(cid:17) else bsample = zV ( z, u ) − E U [ V ( , U )] // Welford’s algorithm n = n + 1diff = (bsample − b)b += diff / ndiff2 = (bsample − b)bSSQ += outer(diff, diff2) // Outer product// Get estimates, uncertainties β n = A − (cid:16) b − T A − b − E U [ V ( ,U )]+ E U [ V ( ,U )] T A − (cid:17) Σ b = bSSQ / nΣ β = C Σ b C T σ n = p diag(Σ β ) / n // Element-wise square root// Check for convergence converged = (cid:16) max( σ n )max( β n ) − min( β n ) < t (cid:17) endreturn β n , σσ