Explaining predictive models using Shapley values and non-parametric vine copulas
EExplaining predictive models using Shapleyvalues and non-parametric vine copulas
Kjersti Aas Thomas Nagler Martin JullumAnders LølandFebruary 15, 2021
Abstract
The original development of Shapley values for prediction explanationrelied on the assumption that the features being described were indepen-dent. If the features in reality are dependent this may lead to incorrectexplanations. Hence, there have recently been attempts of appropriatelymodelling/estimating the dependence between the features. Although theproposed methods clearly outperform the traditional approach assumingindependence, they have their weaknesses. In this paper we propose twonew approaches for modelling the dependence between the features. Bothapproaches are based on vine copulas, which are flexible tools for mod-elling multivariate non-Gaussian distributions able to characterise a widerange of complex dependencies. The performance of the proposed meth-ods is evaluated on simulated data sets and a real data set. The exper-iments demonstrate that the vine copula approaches give more accurateapproximations to the true Shapley values than its competitors.
In many applications complex machine learning models are outperforming tra-ditional regression models. It is often hard to understand why the machinelearning models perform so well. Hence, during the last few years, a new lineof research has emerged focusing on interpreting the predictions from thesemodels. One such method that has become very popular is Shapley values[20, 35, 36, 2]. This method, which is based on concepts from cooperative gametheory, was originally invented for assigning payout to players depending ontheir contribution towards the total payout [31]. When interpreting machinelearning models, the model features are the players and the prediction is thetotal payout.The original development of Shapley values for prediction explanation [20,35, 36] relied on the assumption that the features being described were indepen-dent. [2] showed that if there is a high degree of dependence among some or allthe features, this may lead to severely inaccurate Shapley value estimates and1 a r X i v : . [ s t a t . M E ] F e b ncorrect explanations. In the same paper, the authors deal with this problem,proposing three different approaches appropriately modelling/estimating the de-pendence between the features; the Gaussian approach, the Gaussian copulaapproach, and the empirical approach. Although all three methods clearly out-perform the traditional approach assuming independence, they have their weak-nesses. The Gaussian approach assumes that features are multivariate Gaussiandistributed, while the Gaussian copula approach represents the marginal distri-butions of the features with their empirical margins and model the dependencestructure by a Gaussian copula [17]. Hence, they will work well if respectivelythe distribution or the dependependy structure of the features is Gaussian. Theempirical approach is inspired by the kernel estimator. Like most other non-parametric density estimation approaches, this method suffers from the curseof dimensionality. It would therefore require a large data set to be accurate inproblems with many features.In this paper, we propose two alternative approaches to estimate Shapleyvalues. In both approaches, the multivariate joint density function of the fea-tures is represented by a vine copula [17], but they differ in the way the Shapleycontribution function is evaluated. A vine copula is a multivariate copula thatis constructed from a set of bivariate ones, so-called pair-copulas . All of thesebivariate copulas may be selected completely freely, meaning that vine copu-las are able to characterise a wide range of complex dependencies. Hence, thenew approaches are expected to outperform the existing ones in cases where thefeature distribution is far from the Gaussian.The main part of the methodology proposed in this paper may be usedfor many other applications than computing Shapley values. It may e.g. beregarded as a contribution to the field of non-parametric conditional densityestimation.The rest of the paper is organized as follows. We begin by explaining the fun-damentals of the Shapley value framework in an explanation setting in Section2, while Section 3 reviews some of of the previously proposed Shapley methodsfor prediction explanation. In Section 4 we introduce the two new methodsfor computing Shapley values based on vine copulas. Section 5 presents vari-ous simulation studies that demonstrate that our method works in a variety ofsettings, while Section 6 gives a real data example. Finally, in Section 7, weconclude. Suppose we are in a cooperative game setting with M players, j = 1 , . . . , M ,trying to maximize a payoff. Let M be the set of all players and S any subsetof M . Then the Shapley value [31] for the j th player is defined as2 j = (cid:88) S⊆M\{ j } |S| !( M − |S| − M ! ( v ( S ∪ { j } ) − v ( S )) . (1)Here, v ( S ) is the contribution function which maps subsets of players to realnumbers representing the worth or contribution of the group S and |S| is thenumber of players in subset S .In the game theory sense, each player receives φ j as their payout. From theformula, we see that this payout is just a weighted sum of the player’s marginalcontributions to each group S . Lloyd Shapley proved that distributing the totalgains of the game in this way is ‘fair’ in the sense that it obeys certain importantaxioms [31]. In a machine learning setting, imagine a scenario where we M features, x =( x , . . . , x M ) and a univariate response y , and have fitted the model g ( x ). Wenow want to explain the prediction g ( x ∗ ) for a specific feature vector x ∗ . Thepapers [20, 35, 36] suggest doing this with Shapley values where the predictivemodel replaces the cooperative game and the features replace the players. Touse (1), [20] defines the contribution function v ( S ) as the following expectedprediction v ( S ) = E [ g ( x ) | x S = x ∗S ] . (2)Here x S denotes the features in subset S and x ∗S is the subset S of the featurevector x ∗ that we want to explain. Thus, v ( S ) denotes the expected predictiongiven that the features in subset S take the value x ∗S .If the features are continuous, we can write the conditional expectation (2)as E [ g ( x ) | x S = x ∗S ] = E [ g ( x ¯ S , x S ) | x S = x ∗S ] = (cid:90) g ( x ¯ S , x ∗S ) f ( x ¯ S | x S = x ∗S ) d x ¯ S , (3)where x ¯ S is the vector of features not in S and f ( x ¯ S | x S = x ∗S ) is the conditionaldensity of x ¯ S given x S = x ∗S . To compute Shapley values in practice, theconditional expecation in (3) needs to be approximated empirically. Note thatin the rest of the paper we use lower case x -s for both random variables andrealizations to keep the notation simple. Since the conditional probability density is rarely known and difficult to esti-mate, [20] replaces it with the simple (unconditional) probability density f ( x ¯ S | x S = x ∗S ) = f ( x ¯ S ) . (4)3he integral is thus approximated by E [ g ( x ) | x S = x ∗S ] ≈ (cid:90) g ( x ¯ S , x ∗S ) f ( x ¯ S ) d x ¯ S , (5)which is estimated by randomly drawing K times from the full training data setand calculating v KerSHAP ( S ) = 1 K K (cid:88) k =1 g ( x k ¯ S , x ∗S ) . (6)Here, x k ¯ S , k = 1 , . . . , K are the samples from the training set and g ( · ) is theestimated prediction model. In the Shapley literature, the approximation (5)is sometimes termed the interventional conditional expectation, while (3) isdenoted the observational conditional expectation. See e.g. [7] for more details.Unfortunately, when the features are not independent, [2] demonstrates thatnaively replacing the conditional probability function with the unconditional oneleads to very inaccurate Shapley values. In [2], one of the proposed methods for estimating f ( x ¯ S | x S = x ∗S ) withoutrelying on the naive assumption of independence is based on the Gaussian cop-ula . A copula is a function that characterizes the dependence in a randomvector. By Sklar’s theorem, any joint distribution function F with marginalcdf’s F , . . . , F M can be written as F ( x ) = C ( F ( x ) , . . . , F M ( x M )) , where C is the copula function. Copulas are distribution functions with uniformmargins. The corresponding density is denoted by c .There are several parametric families for the copula function. The Gaussiancopula is a special case. It is derived by inverting the above display i.e., C ( u ) = F ( F − ( u ) , . . . , F − M ( u M )) , and taking F as a multivariate Gaussian distribution. This gives rise to aparametric model (parametrized by a correlation matrix) that reflects Gaussiandependence, but can be combined with arbitrary marginal distributions.To compute Shapley values, we first need an estimate of the marginal dis-tributions and copula parameters. [2] proposed to approximate the marginals F , . . . , F M by the corresponding empirical cdf s and parametrize the copula bythe empirical correlation matrix of corresponding normal scores. Together thisgives us an estimated model for the joint distributionˆ F ( x ) = ˆ C ( ˆ F ( x ) , . . . , ˆ F M ( x M )) . The conditional expectation in (3) can now be approximated by simulatingconditionally from the estimated model. More precisely, let x k ¯ S , k = 1 , . . . , K be4imulated values of x ¯ S given x S = x ∗S and compute v KerSHAP ( S ) = 1 K K (cid:88) k =1 g ( x k ¯ S , x ∗S ) . (7)Conditional simulation from the Gaussian copula can be achieved in essentiallythe same way as for the multivariate Gaussian distribution, see [2] for moredetails.The Gaussian copula model is very flexible with regard to the marginaldistributions, but quite restrictive in the dependence structures it can capture.It can only represent radially symmetric dependence relationships and does notallow for tail dependence (i.e., joint occurrence of extreme events has smallprobability). We therefore wish to use more flexible copula models and we shallfocus on vine copula models specifically in what follows. A vine copula is a multivariate copula that is constructed from a set of bivariateones, so-called pair-copulas . All of these bivariate copulas may be selectedcompletely freely as the resulting structure is guaranteed to be a valid copula.Hence, vine copulas are highly flexible being able to characterise a wide rangeof complex dependencies.Vine copulas have become very popular over the last decade. The main ideawas originally proposed by [17] and further explored and discussed by [3, 4] and[19]. However, it was the paper [1], putting them in an inferential context, thatreally spurred a surge in empirical applications of these constructions. In thispaper we use vine copulas to model the multivariate distributions involved in theShapley framework. After a brief introduction to vine copulas in Section 4.1, weintroduce two new methods for approximating the Shapely value contributionsbased on these structures in Sections 4.2 and 4.3. Finally, computationallyefficient selection of the D-vine order is discussed in Section 4.4.
In a vine copula is a multivariate copula the copula density is decomposed into aproduct of pair-copula densities. This decomposition is not unique. To organizeall possible decompositions, the notion of regular vines (R-vines) was introducedby [4], and described in more detail in [19]. It involves the specification of asequence of trees, each edge of which corresponds to a pair-copula. These pair-copulas constitute the building blocks of the joint R-vine distribution.In this paper we use a special case of R-vines called D-vines [18] where eachtree is a path. The density f ( x , . . . , x M ) corresponding to a D-vine may be5
2 3 4 512 23 34 4512 23 34 4513|2 24|3 35|413|2 24|3 35|414|23 25|34 14|23 25|3415|234 T T T T Figure 1: A D-vine with 5 variables, 4 trees and 10 edges. Each edge may bemay be associated with a pair-copula.written as f ( x , . . . , x M ) = M (cid:89) j =1 f j ( x j ) × (8) M − (cid:89) i =1 M − i (cid:89) j =1 c j,j + i | j +1 ,...,j + i − ( F ( x j | x j +1 , . . . , x j + i − ) , F ( x j + i | x j +1 , . . . , x j + i − )) , where index i identifies the trees, and j runs over the edges in each tree. Theright factor of the righthand side of (8) is a product of M ( M − / D-vine copula density. Note that the argumentsof the pair-copulas are conditional distributions in all trees except the first,where they are the univariate margins. Figure 1 shows a 5-dimensional D-vinewith 4 trees and 10 edges.The density in (8) implies a specific order of conditioning. This order canbe changed by a simple relabelling of the variables. For example, we can switchthe roles of variables x and x M . Instead of pair-copulas c , and c M − ,M wewill then get pair-copulas c M, and c M − , in the first tree. Each permutationof (1 , , . . . , M ) therefore gives rise to a different model. These permutationsare called orders of the D-vine and will play an important role later on.The key to the construction in (8) is that all copulas involved in the decom-position are bivariate and can belong to different families. There are no restric-tions regarding the copula types that can be combined; the resulting structureis guaranteed to be valid. A further advantage with R-vine copulas is that theconditional distributions F ( x | v ) constituting the pair-copula arguments can be6valuated using a recursive formula derived in [17]: F ( x | v ) = ∂C xv j | v − j ( F ( x | v − j ) , F ( v j | v − j )) ∂F ( v j | v − j ) . (9)Here C xv j | v − j is a bivariate copula, v j is an arbitrary component of v and v − j denotes the vector v excluding v j . By construction, R-vines have the importantcharacteristic that the copulas in question are always present in the precedingtrees of the structure, so that they are available without extra computations.In their general form, vine copulas can represent most continuous multivari-ate distributions. However, to keep them tractable for inference, the assumptionthat the pair copulas c j,j + i | j +1 ,...,j + i − ( F ( x j | x j +1 , . . . , x j + i − ) , F ( x j + i | x j +1 , . . . , x j + i − ))are independent of the conditioning variables x j +1 , . . . , x j + i − except throughthe conditional marginal distributions, is usually made. This leads to the so-called simplified vine copulas. We will consider both parametric and nonpara-metric models for the pair-copulas. We use the R package rvinecopulib [22]for parameter estimation. More details about parametric and non-parametricestimation can be found in [1] and [21], respectively. Having determined the multivariate distribution of the explanatory variables,the next step is to compute the contribution function v ( S ). We propose twodifferent methods for estimating v ( S ). In the first, to be described in thissection, we generate samples from an estimate of the conditional distribution f ( x ¯ S | x S = x ∗S ) and use these samples to estimate v ( S ). In the other, which istreated in Section 4.3, v ( S ) is estimated using ratios of copula densities.To generate the samples from conditional distributions, we can use the Rosenblatt transform [27] and its inverse. The Rosenblatt transform u = T ( v )of a random vector v = ( v , . . . , v M ) ∼ F is defined as u = F ( v ) , v = F ( v | v ) , . . . , u M = F ( v M | v , . . . , v M − ) , where F ( v m | v , . . . , v m − ) is the conditional distribution of v m given v . . . , v m − ,m = 2 , . . . , M . The variables u , . . . u M are then independent standard uniformvariables. The inverse operation v = F − ( u ); v = F − ( u | u ); . . . ; v M = F − ( u M | u , . . . , u M − ) , can be used to simulate from a distribution. For any joint distribution F , if u is a vector of independent random variables, v = T − ( u ) has distribution F .In what follows we outline the procedure for generating the k th sample fromthe conditional distribution F ( x ¯ S | x S = x ∗S ):1. For each j ∈ S , let u ∗ j = ˆ F j ( x ∗ j ), where ˆ F j is the empirical distributionfunction of x j . 7. Let w ¯ S be a vector with | ¯ S| elements with aritrary values between 0 and 1.Set u = ( w ¯ S , u ∗S ) and let v = T ( u ), where T ( · ) is the Roseblatt transform.3. Generate the vector z ¯ S by sampling | ¯ S| independent uniform U[0,1] dis-tributed variates.4. Replace the | ¯ S| elements corresponding to the subset ¯ S in v by z ¯ S .5. Obtain u = T − ( v ) using the inverse Rosenblatt transform T − ( · ).6. Finally, for each j ∈ ¯ S , let x j = ˆ F − j ( u j ), where ˆ F − j is the empiricalquantile function of x j .Step 2 ensures that the values of the conditioning variables are the same in allsamples. Having generated K samples x S , . . . x K ¯ S from the conditional distribu-tion f ( x ¯ S | x S = x ∗S ) we use (7) to compute v KerSHAP ( S ).Whether or not we can compute T − ( · ) easily for a given D-vine depends onits implied sampling orders [11]. In particular, the conditioning variables haveto appear either first or last in the D-vine structure. For example, a D-vinewith order 1 − − − S = { , } given S = { , } and ¯ S = { , } given S = { , } , but simulating ¯ S = { , } given S = { , } isonly possible through expensive multivariate numerical integration.More formally, assume that we have a certain permutation π = ( π , . . . π M )of (1 , . . . , M ). The corresponding D-vine may then be used to generate samplesfrom conditional distributions f ( x ¯ S | x S = x ∗S ) where S either is of the form S = { π , . . . , π k } or S = { π M , . . . , π M − k +1 } for k = 1 , . . . , M . In Section 4.4,we use this fact to search for a small set of models that allows for simulationconditionally on any viable coalition S . For vine copula models, conditional simulation often involves numerical integra-tion or inversion, which significantly slows down the algorithms. [23] proposedan alternative way to approximate conditional expectations based on copulas.The idea is to weight every sample in (7) in a way that accounts for the depen-dence.It turns out that the appropriate weights are given by a ratio of copuladensities. For simplicity denote u j = F ( x j ), j = 1 , . . . , M . If we have continuous8ariables, we can compute v ( S ) = E [ g ( x ¯ S , x S ) | x S = x ∗S ] = (cid:90) g ( x ¯ S , x ∗S ) f ( x ¯ S | x S = x ∗S ) d x ¯ S = (cid:90) g ( x ¯ S , x ∗S ) f ( x ¯ S , x ∗S ) /f ( x ∗S ) d x ¯ S = (cid:90) g ( x ¯ S , x ∗S ) c ( u ¯ S , u ∗S ) (cid:81) Mj =1 f ( x j ) c ( u ∗S ) (cid:81) j ∈S f ( x j ) d x ¯ S = (cid:90) g ( x ¯ S , x ∗S ) c ( u ¯ S , u ∗S ) c ( u ∗S ) f ( x ¯ S ) c ( u ¯ S ) d x ¯ S = E x ¯ S (cid:20) g ( x ¯ S , x ∗S ) c ( u ¯ S , u ∗S ) c ( u ¯ S ) (cid:21) /c ( u ∗S )= E x ¯ S (cid:20) g ( x ¯ S , x ∗S ) c ( u ¯ S , u ∗S ) c ( u ¯ S ) (cid:21) /E u ¯ S (cid:20) c ( u ¯ S , u ∗S ) c ( u ¯ S ) (cid:21) . The expression in the third line follows from the definition of a copula given inSection 3.2, while the one in the fourth line is obtained using f ( x ¯ S ) = c ( u ¯ S ) (cid:89) j ∈S f ( x j ) . To approximate the last line, we can estimate a vine copula model ˆ c and replacethe expectations by a sample average over a (possibly random) subset of thetraining data: v KerSHAP ( S ) = (cid:80) Kk =1 g ( x ¯ S , x ∗S )ˆ c ( u k ¯ S , u ∗S ) / ˆ c ( u k ¯ S ) (cid:80) Kk =1 ˆ c ( u k ¯ S , u ∗S ) / ˆ c ( u k ¯ S ) . (10)Similar expressions for discrete or mixed data and theoretical guarantees for(10) can be found in [23]. Note that the formula in (10) is very similar to theone for the empirical method in [2]. However, while the weights in that paperwere computed using a Gaussian kernel, they are here given as ratios of copuladensities.The joint vine copula density ˆ c ( u ¯ S , u ∗S ) is easily computed from (8) irre-spective of the D-vine order. However, only some of the marginals ˆ c ( u ¯ S ) areavailable in closed form for a given D-vine. For example, a D-vine with order1 − − − c , , c , , c , , c , , and c , , , but not any other marginals.To formalize this, we again identify the D-vine structure with a permutation π = ( π , . . . π M ) of (1 , . . . , M ). From this permutation we may easily computeall marginals ˆ c ( u ¯ S ) where ¯ S = { π k , π k +1 , . . . , π (cid:96) } for 1 ≤ k ≤ (cid:96) ≤ M . We can use D-vine copula models in both the conditional simulation methodand the ratio method. Depending on our choice of method, we need to either9imulate conditionally from an estimated model or compute a ratio of copuladensities. How efficiently we can do this numerically depends on the interplayof the coalition S and the order of variables in the D-vine. Generally, there are M !/2 distinct D-vines when we have M variables. Usually, when using vinesone looks for the D-vine maximising dependence in the first trees. The natureof the problem treated in this paper is a bit different from the ones previouslydiscussed in the literature.Let Z be the set of all conditional distributions f ( x ¯ S | x S = x ∗S ) to be used inthe conditional simulation method, or all copula marginals ˆ c ( u ¯ S ) to be computedin the ratio method. In the previous two sections, we identified the conditionaldistributions or copula marginals that may be easily obtained for a given D-vine. In this section we propose a randomized search method that minimizescomputational complexity by finding a small set of D-vine models that covers Z . The procedure is as follows:1. Generate B random permutations of (1 , . . . , M ).2. For each permutation, find the number of conditional distributions orcopula marginals that may be easily obtained (see Sections 4.2 and 4.3).3. Pick the permutation that covers most of the remaining sets in Z . Removethe covered sets from Z .4. Go back to step 1 until no subsets are remaining.The result is a collection D-vine structures such that all conditional distribu-tions/copula marginals may be easily computed. We have used B = 100 per-mutations, which gave fairly stable results in our experiments with 10 features.Empirically, this approach reduces the number of D-vine models to estimatefrom 2 M to around 2 M − for conditional simulation and to 2 M − for the ratiomethod. That is, the computational time is reduced by 75% – 87.5%. In this section, we discuss a simulation study designed to compare different waysto estimate Shapley values. Specifically, we compare our suggested approacheswith [20]’s independence estimation approach (below called independence ) and[2]’s empirical, Gaussian and Gaussian copula estimation approaches. A shortdescription of each approach is given in Table 1. For the approaches presentedin this paper, we have fitted both a non-parametric and a parametric vine.The independence, empirical, Gaussian and Gaussian copula approaches are allimplemented in the R package shapr [30], and the plan is to also include theapproaches proposed in this paper.The simulation model is detailed in Section 5.1, the actual design of theexperiments is given in Section 5.2. Section 5.3 describes the evaluation measureused to quantify the accuracy of the different methods, while Section 5.4 givesthe results. 10 .1 Simulation model To evaluate the different approaches, we need cases for which we know the truefeature distribution. Moreover, we have to use multivariate distributions thathave known conditional distributions. There are not many such distributions,but one example, which allows for heavy-tailed and skewed marginals and non-linear dependence, is the multivariate Burr distribution.The M -dimensional Burr distribution has the density [34] f M ( x ) = Γ( p + M )Γ( p ) (cid:32) (cid:89) m =1 b m r m (cid:33) (cid:81) Mm =1 x b m − m (cid:16) (cid:80) Mm =1 r m x b m m (cid:17) p + M , for x m >
0. Here, p , b , . . . , b M and r , . . . r M are the parameters of the dis-tribution. The Burr distribution is a compound Weibull distribution with thegamma distribution as compounder [34]. It can be regarded as a special case ofthe Pareto IV distribution [38].Any conditional distribution of the multivariate Burr distribution is also amultivariate Burr distribution [34]. The conditional density f ( x , . . . , x S | x S +1 = x ∗ S +1 , . . . , x M = x ∗ M ) is an S -dimensional Burr density with parameters ˜ p ,˜ b , . . . , ˜ b S , ˜ r , . . . , ˜ r S , where ˜ p = p + M − S and for all j = 1 , . . . , S ,˜ b j = b j , ˜ r j = r j (cid:80) Mm = S +1 r m ( x ∗ m ) b m . According to [10], the copula corresponding to the multivariate Burr distri-bution is a Clayton survival copula. Thus, the multivariate Burr distributionmay be represented by a vine copula where the pair-copulas are bivariate Clay-ton survival copulas. For this reason we have fitted both a non-parametric and aparametric vine for the two approaches presented in this paper. For the paramet-ric vines the models are correctly specified in this simulation example. Hence, ifthe non-parametric vines have similar performance to the corresponding para-metric vines, it indicates that the non-parametric vines provide a satisfactoryfit to the multivariate Burr distribution.In our experiments, we simulate data from 3 different 10 dimensional Burrdistributions. All three distributions have b = (2 , , , , , , , , , r = (1 , , , , , , , , , , while they have p equal to 0.5, 1, and 1.5, respectively. The three values of p correspond to a pair-wise Kendall’s τ of 0.5, 0.33, and 0.25, respectively.In addition to the feature distribution, we need to specify the samplingmodel for the response y and the machine learning approach used to fit thepredictive model g ( x ). Inspired by [6] we chose the following non-linear andheteroscedastic function for y : y = u u exp(1 . u u )+ u u exp(1 . u u )+ u exp(1 . u )+0 . u + u + u ) (cid:15). (11)11ere x is multivariate Burr distributed and u m = F m ( x m ) where F m ( · ) is thetrue parametric distribution function. Finally, (cid:15) is standard normal distributedand independent of all the x m s. We perform 3 different experiments with training sample sizes N train equal to100, 1,000 and 10,000, respectively. In each experiment we repeat the followingsteps 50 times for each of the 3 different Burr distributions described in Section5.1:1. Sample N train training observations from the chosen Burr distribution2. Compute the corresponding y values using (11).3. Fit a Random forest with 500 trees using the R package ranger [37] withdefault parameter settings to the training data.4. Sample N test = 100 test observations from the chosen distribution.5. For all possible subsets S and all test observations x ∗ : • If one of the ratio methods: Compute v KerSHAP ( S ) using (10) with K = 1 , • If the empirical method: Compute v KerSHAP ( S ) using the formulagiven in Section 3.3 in [2] with η = 0 . • If one of the remaining methods in Table 1: Generate K = 1 , p ( x ¯ S | x S = x ∗S ) and com-pute v KerSHAP ( S ) using (7).6. For all test observations x ∗ , compute the Shapley value using (1) with the v KerSHAP ( S ) values for all subsets S for the current test observation.For all approaches, the multivariate model for the features is fitted using thetraining data. We measure the performance of each method based on the mean absolute error(MAE), across both the features and the sample space. MAE is defined asMAE(method q ) = 1 T T (cid:88) i =1 M M (cid:88) j =1 | φ j, true ( x i ) − φ j,q ( x i ) | , (12)where φ j,q ( x ) and φ j, true ( x ) denote, respectively, the Shapley value estimatedwith method q and the corresponding true Shapley value for the prediction g ( x ).Further, M is the number of features and T is the number of test observations.12e use Monto Carlo integration with 10,000 samples to compute the exactShapley values.As stated in Section 5.2, for each feature distribution and choice of N train we repeat the test procedure 50 times and report the average MAE over those50 repetitions. Hence, the quality of the Shapley values is evaluated based on atotal of T = 5 ,
000 test observations. Sampling new data for each batch reducesthe influence of the exact shape of the fitted predictive model.
The results of the simulation study are shown in Figure 2. The nine panelscorrespond to different combinations of sample size N train (columns) and de-pendence parameter p (rows). Each bar represents the MAE achieved by aparticular method, where smaller values indicate higher accuracy.Analogous to [2], we clearly see that the independence method is not suit-able for estimating Shapley values when covariates are dependent. The othermethods have more similar performance, but with the vine-based methods fa-vored overall. The parametric vine methods perform slightly better than thenon-parametric ones in all scenarios. This is to be expected, because, as pre-viously stated, the true simulation model can be represented as a vine copulawith survival Clayton pair-copulas. Hence, the parametric models are correctlyspecified. On real data sets, the parametric assumption will rarely hold and canbe severly violated, however.For very small sample sizes ( N train = 100), even the (correctly specified)parametric vine methods have only a small advantage and the non-parametricmethods are outperformed by some of their competitors when p = 1 . N train ≥ p = 0 . N train = 10 ,
000 and the corresponding ratio for p = 1 . = 100 n = 1000 n = 10000 p = . = = . I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o M AE Figure 2: MAE for each combination of sample size, Burr distribution param-eters and method. Each MAE-value is computed from 5,000 test observations.The values 0.5, 1, and 1.5 of p correspond to pair-wise Kendall’s τ s of 0.5, 0.33,and 0.25, respectively.This is mainly due to the fact that we have to compute a large number of dif-ferent Shapley contributions v ( S ) (2 = 1 ,
024 for 10 covariates). This can bemitigated substantially by parallelizing computations and/or using the approx-imated weighted least squares method proposed by [20]. The latter approach,which is thoroughly described in Section 2.3.1 in [2] requires only a subset ofShapley contributions to be computed.14 = 100 n = 1000 n = 10000 I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o I ndependen c e E m p i r i c a l G au ss i an G au ss i an c opu l a N on − pa r a m e t r i c M C N on − pa r a m e t r i c r a t i o P a r a m e t r i c M C P a r a m e t r i c r a t i o t i m e ( hou r s ) Figure 3: Average computation time (CPU hours) required for each sample size,method, on 100 test observations.
In this section, we apply the methods discussed in this paper on the Abalonedata set (available at http://archive.ics.uci.edu/ml/datasets/Abalone). It haspreviously been used in several machine learning studies, see e.g. [29, 32]. More-over, it has been used in the related vine copula studies [13, 6, 9]. The dataoriginate from a study by the Tasmanian Aquaculture and Fisheries Institute.An abalone is a kind of edible sea snail, the harvest of which is subject to quotas.These quotas are based partly on the age distribution of the abalones. To deter-mine an abalone ´ s age, one cuts the shell through the cone, stains it, and countsthe number of rings through a microscope. This is a highly time-consuming task.Hence, one would like to predict the age based on physical measurements thatare easier to obtain. The Abalone data set was originally used for this purpose.It consists of 4,177 samples on the following 9 variables: Sex , Length , Diameter , Height , Whole weight , Shucked weight , Viscera weight , Shell weight and
Age measured by number of rings.We do not include the variable
Sex in our study since it is a discrete variable.Note that the use of regular vines does not exclude discrete data; examples of15igure 4: Pairwise scatter plots, marginal density functions and pairwise corre-lation coefficients for the explanatory variables and the response variable.discrete and mixed discrete vines may be found for instance in [26] and [33].However, many of the methods become more complicated when discrete dataare involved.Figure 4 shows the pairwise scatter plots, marginal density functions andpairwise Pearson correlation coefficients. There is clear non-linearity and het-eroscedaticity among the pairs of variables. Moreover, it can be noted that allpairwise correlations between the explanatory variables are higher than 0.775.We treat the age prediction as a regression problem. The Abalone dataset was divided into a training set and a test set, containing 4,077 and 100observations, respectively. We fitted a Random forest model with 500 treesto the training data, using the R package ranger with default parameter set-16ings. Then, this model was used to predict the age (number of rings) for theobservations in the test data set.Since the non-parametric ratio method was the fastest, most stable, and bestperforming of the new vine copula methods in the simulation study, we comparethe performance of this method with the independence, empirical, Gaussian andGaussian copula approaches. Figure 5 shows the Shapley values for two of thetest observations. The Shapley values computed by the different methods arequite different. As expected, the independence method is the one that differsmost from the other ones. Take e.g. test observation B. Here, the Shapleyvalues for all variables computed using the non-parametric ratio method havethe same sign and they are of quite similar magnitude. For the independenceapproach, however, the Shapley values show much larger variation, both whenit comes to sign and magnitude. The Shapley values produced by the Gaussiancopula method are the ones that are most similar to those produced by thenon-parametric ratio method. This is as expected, since this method is theone that bears most resemblance with the vines methods. However, we observesome differences, e.g. for the variables
ShuckedWeight and
Shell Weight fortest observation A.A problem with evaluating Shapley values for real data is that there is noground truth. Hence, we have to justify the results in other ways. In whatfollows, we use mainly the same framework as that proposed in [2]. For allapproches treated in this paper, the Shapley value is a weighted sum of differ-ences v ( S ∪ { j } ) − v ( S ) for several subsets S . However, the approaches differ inhow v ( S ), or more specifically, the conditional distribution p ( x ¯ S | x S = x ∗S ), isestimated. Hence, if we are able to show that the samples from the conditionaldistributions generated using the non-parametric method are more representa-tive than the samples generated using the previously proposed methods, it islikely that the Shapley values obtained using the non-parametric method arethe most accurate.Since there are many conditional distributions involved in the Shapley for-mula, we will not show all here. However, we have included some examples thatillustrate that the non-parametric method gives more correct approximationsto the true conditional distributions than the other approaches. First, Figure6 shows plots of Length against
Shell weight and
Viscera weight against
Shell weight . The grey dots are the training data. The blue dots are thesamples from the conditional distribution of the variable at the x-axis giventhat
Shell weight is equal to 0.1, generated using our method. The green andred dots are the corresponding samples generated using the Gaussian copulaand independence approaches, respectively. It should be noted that the non-parametric ratio method does not involve any simulation. However, the methodhas an implicit statistical model that we can sample from for illustrative pur-poses. (10) can be seen as an expectation E P [ g ( x ¯ S , x S ∗ )] with respect to a17 est observation A Test observation B −2 0 2 −2 0 2DiameterHeightLengthShellWeightShuckedWeightVisceraWeightWholeWeight Shapley value F ea t u r e Method
Empirical Gaussian Gaussian copula Independence Non−parametric ratio
Figure 5: Shapley values for two of the test observations in the real data setcomputed using the different methods.model P that assigns probability π (cid:96) = ˆ c ( u (cid:96) ¯ S , u ∗S ) / ˆ c ( u k ¯ S ) (cid:80) Kk =1 ˆ c ( u k ¯ S , u ∗S ) / ˆ c ( u k ¯ S )18 ength Viscera weight S he ll w e i gh t method Observed Independence Gaussian copula Non−parametric ratio
Figure 6:
Length against
Shell weight (left) and
Viscera weight against
Shell weight (right). The grey dots are the training data. The blue dotsare the samples from the conditional distribution of the variable at the x-axisgiven that
Shell weight is equal to 0.1 generated using the non-parametricratio method, the green dots are the corresponding samples generated usingthe Gaussian copula method, and the red dots are the samples generated usingthe independence method. Note that the red and green dots have been slightlydisplaced vertically to improve visibility of the figure.to the (cid:96) th observation x (cid:96) . Hence, we can simulate from this implicit modelby drawing with replacement from the original observations x , . . . , x K using( π , . . . , π K ) as sampling probabilities. Both the Gaussian copula approach andthe independence approach generate samples that are unrealistic, in the sensethat they are far outside the range of what is observed in the training data. Itmight be confusing that this is the case for the Gaussian copula. This is dueto the fact that we are sampling in the lower tail, where there is very strongtail-dependence that the Gaussian copula is missing out on. It is well knownthat evaluation of predictive machine learning models far from the domain atwhich they have been trained, can lead to spurious predictions. Thus, it isimportant that the explanation methods are evaluating the predictive model atappropriate feature combinations. The samples generated by the ratio methodare inside the range of what is observed in the training data.In Figure 7 we study three different conditional distributions involved in theShapley formula: • The conditional distribution of
Shell weight given all the other variables.19
The conditional distribution of
Length and
Shucked weight given
Visceraweight and
Shell weight . • The conditional distribution of all variables except
Shucked weight given
Shucked weight .For all the three distributions, we generate 1,000 samples for three of the testobservations using the non-parametric ratio, Gaussian copula and independenceapproaches. That is, we condition on four different sets of values. For each com-bination of test observation, conditional distribution and method, we computethe mean Mahalanobis distance between each sample and its ten nearest train-ing samples, resulting in 1,000 different mean distances. Each panel of Figure7 shows the probability densities of such mean distances for a specific test ob-servation and a specific conditional distribution (test observations A and B arethe same as those in Figure 5). If the generated samples are realistic, we wouldexpect the majority of the mean distances to be small.For all conditional distributions and all test observations, the mode of thedensity corresponding to the independence approach is larger than those of thetwo other densities, indicating that the samples generated by the Gaussian cop-ula and non-parametric ratio approaches are more realistic than those generatedby the independence approach. Further, for the majority of the test observa-tions/conditional distributions, the Mahalanobis distances corresponding to thenon-parametric ratio approach are smaller than those corresponding to the in-dependence and Gaussian copula approaches.To summarize, we have illustrated that the Shapley values computed usingthe non-parametric ratio method and the previously proposed methods are dif-ferent. We have tried to justify that this is because the non-parametric ratiomethod gives more correct approximations to the true conditional distributionsfor this data set.
Shapley values are a model-agnostic method for explaining individual predic-tions with a solid theoretical foundation. The original development of Shapleyvalues for prediction explanation relied on the assumption that the features be-ing described were independent. If the features in reality are dependent thismay lead to incorrect explanations. Hence, there have recently been attemptsof appropriately modelling/estimating the dependence between the features.Although the proposed methods clearly outperform the traditional approachassuming independence, they have their weaknesses. In this paper we have pro-posed two new approaches for modelling the dependence between the features.Both approaches are based on vine copulas, which are flexible tools for multi-variate non-Gaussian distributions able to characterise a wide range of complexdependencies.We have performed a comprehensive simulation study, showing that ourapproaches outperform the previously proposed methods. We have also applied20 : All except Shucked weight S: All except Shucked weight and DiameterS: Shucked weight T e s t ob s e r v a t i on A T e s t ob s e r v a t i on C T e s t ob s e r v a t i on B T e s t ob s e r v a t i on D den s i t y method Copula Independence Non−parametric ratio
Figure 7: Probability densities of mean Mahalanobis distances for three differentconditional distributions and three different individuals. See the text for afurther description.the different approaches to a real data set, where the predictions to be explainedwere produced by a Random forest classifier designed to predict the age of anabalone (sea snail). In this case the true Shapley values are not known, butwe provide results which indicate that the vine based approaches provide moresensible approximations than the previously proposed methods.The main part of the methodology proposed in this paper may be used formany other applications than computing Shapley values. The need for express-21ng statistical inference in terms of conditional quantities is ubiquitous in mostnatural and social sciences [25]. An obvious example is the estimation of themean of some set of response variables conditioned on sets of explanatory vari-ables taking specified values [6]. Other common tasks are the forecasting ofvolatilities or quantiles of financial time series conditioned on past history [8].Problems of this kind often call for some sort of regression analysis, like the onepresented in this paper.The challenging issue in conditional density estimation is to circumvent thecurse of dimensionality. Several methods have been proposed to estimate condi-tional densities; the classical kernel estimator [28], which has been refined anddeveloped in many directions, see for example [14, 5, 16, 24]; local polynomialestimators [15, 12], and a local Gaussian correlation estimator [25]. However,most of these methods, if not all, are computationally intractable when either x S or x ¯ S are not univariate, or both have dimension above 3-4. The vine-basedapproaches proposed in this paper work well when both x S or x ¯ S are high di-mensional. Hence, the methodology proposed in this paper may be regarded asa contribution to the field of non-parametric conditional density estimation. Acknowledgements
This work is supported by the Norwegian Research Council grant 237718.