[PDF] The Shapley Value of Inconsistency Measures for Functional Dependencies

Abstract

Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Therefore, inconsistency quantification and attribution have been a subject of much research in Knowledge Representation and, more recently, in Databases. As in many other fields, a conventional responsibility sharing mechanism is the Shapley value from cooperative game theory. In this paper, we carry out a systematic investigation of the complexity of the Shapley value in common inconsistency measures for functional-dependency (FD) violations. For several measures we establish a full classification of the FD sets into tractable and intractable classes with respect to Shapley-value computation. We also study the complexity of approximation in intractable cases.

Full PDF

TThe Shapley Value of Inconsistency Measures forFunctional Dependencies

Ester Livshits

Technion, Haifa, Israel

Benny Kimelfeld

Technion, Haifa, Israel

Abstract

Quantifying the inconsistency of a database is motivated by various goals including reliabilityestimation for new datasets and progress indication in data cleaning. Another goal is to attribute toindividual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples inthe explanation or inspection of dirt. Therefore, inconsistency quantiﬁcation and attribution havebeen a subject of much research in Knowledge Representation and, more recently, in Databases. Asin many other ﬁelds, a conventional responsibility sharing mechanism is the Shapley value fromcooperative game theory. In this paper, we carry out a systematic investigation of the complexity ofthe Shapley value in common inconsistency measures for functional-dependency (FD) violations.For several measures we establish a full classiﬁcation of the FD sets into tractable and intractableclasses with respect to Shapley-value computation. We also study the complexity of approximationin intractable cases.

Theory of computation → Incomplete, inconsistent, and uncertaindatabases; Information systems → Data cleaning

Keywords and phrases

Shapley value, inconsistent databases, functional dependencies

Digital Object Identiﬁer

Inconsistency measures for knowledge bases have received considerable attention of theKnowledge Representation (KR) and Logic communities [10, 14, 16–18, 21, 22, 38]. Morerecently, inconsistency measures have also been studied from the database viewpoint [2, 26].Such measures quantify the extent to which the database violates a set of integrity constraints.There are multiple reasons why one might be using such measures. For one, the measure canbe used for estimating the usefulness or reliability of new datasets for data-centric applicationssuch as business intelligence [6]. Inconsistency measures have also been proposed as the basisof progress indicators for data-cleaning systems [26]. Finally, the measure can be used forattributing to individual tuples a level of responsibility to the overall inconsistency [31, 37],thereby prioritize tuples in the explanation/inspection/correction of errors.A conventional approach to dividing the responsibility to a quantitative property (herean inconsistency measure) among entities (here the database tuples) is the

Shapley value [36],which is a game-theoretic formula for wealth distribution in a cooperative game. TheShapley value has been applied in a plethora of domains, including economics [15], law [33],environmental science [24, 34], social network analysis [32], physical network analysis [30],and advertisement [5]. In data management, the Shapley value has been used for determiningthe relative contribution of features in machine-learning predictions [23,29], the responsibilityof tuples to database queries [4, 25, 35], and the reliability of data sources [6].The Shapley value has also been studied in a context similar to the one we adopt inthis paper—assigning a level of inconsistency to statements in an inconsistent knowledgebase [18, 31, 37, 40]. Hunter and Konieczny [16–18] use the maximal Shapley value of one © Ester Livshits and Benny Kimelfeld;licensed under Creative Commons License CC-BYLeibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . D B ] S e p X:2 The Shapley Value of Inconsistency Measures for Functional Dependencies inconsistency measure in order to deﬁne a new inconsistency measure. Grant and Hunter [13]considered information systems distributed along data sources of diﬀerent reliabilities, andapply the Shapley value to determine the expected blame of each statement to the overallinconsistency. Yet, with all the investigation that has been conducted on the Shapley valueof inconsistency, we are not aware of any results or eﬀorts regarding the computationalcomplexity of calculating this value.In this work, we embark on a systematic analysis of the complexity of the Shapley valueof database tuples relative to inconsistency measures, where the goal is to calculate thecontribution of a tuple to inconsistency. Our main results are summarized in Table 1. Weconsider inconsistent databases with respect to functional dependencies (FDs), and basicmeasures of inconsistency following Bertossi [3] and Livshits, Ilyas, Kimelfeld and Roy [26].These measures are all adopted from the studied measures in the aforementioned KR research.More formally, we investigate the following computational problem for any ﬁxed databasesignature, set of FDs and inconsistency measure: given a database and a tuple, computethe Shapley value of the tuple with respect to the inconsistency measure. As Table 1 shows,two of these measures are computable in polynomial time: I MI (number of FD violations)and I P (number of problematic facts that participate in violations). In other two of thesemeasures, we establish a full dichotomy in the complexity of the Shapley value: I d (thedrastic measure—0 for consistency and 1 for inconsistency) and I MC (number of maximalconsistent subsets, a.k.a. repairs). The dichotomy in both cases is the same: when the FD sethas, up to equivalence, an lhs chain (i.e., the left-hand sides form a chain w.r.t. inclusion [27]),the Shapley value can be computed in polynomial time; in any other case, it is FP -hard.In the case of I R (the minimal number of tuples to delete for consistency), the problem issolvable in polynomial time in the case of an lhs chain, and NP-hard whenever it is intractableto ﬁnd a cardinality repair [28]; however, the problem is open for every FD set in between,for example, the bipartite matching constraint { A → B, B → A } .We also study the complexity of approximating the Shapley value, and obtain an interestingpicture (also depicted in Table 1). First, in the case of I d , there is a (multiplicative) fullypolynomial-time approximation scheme (FPRAS) for every set of FDs. In the case of I MC ,approximating the Shapley value of any intractable (non-lhs-chain) FD set is at least ashard as approximating the number of maximal matchings of a bipartite graph—a longstanding open problem [19]. In the case of I R , we establish a full dichotomy, namely FPRASvs. hardness of approximation, that has the same separation as the problem of ﬁnding acardinality repairThe rest of this paper is organized as follows. After presenting the basic notation andterminology in Section 2, we formally deﬁne the studied problem and give initial observationsin Section 3. In Section 4, we describe polynomial-time algorithms for I MI and I P . Then, weexplore the measures I d , I R and I MC in Sections 5, 6 and 7, respectively. We conclude anddiscuss future directions in Section 8. Some proofs are given in the Appendix. Database concepts.

By a relational schema we refer to a sequence ( A , . . . , A n ) of attributes.A database D over ( A , . . . , A n ) is a ﬁnite set of tuples, or facts , of the form ( c , . . . , c n ),where each c i is a constant from a countably inﬁnite domain. For a fact f and an attribute A i , we denote by f [ A i ] the value associated by f with the attribute A i (that is, f [ A i ] = c i ).Similarly, for a sequence X = ( A j , . . . , A j m ) of attributes, we denote by f [ X ] the tuple( f [ A j ] , . . . , f [ A j m ]). Generally, we use letters from the beginning of the English alphabet . Livshits and B. Kimelfeld XX:3 Table 1

Complexity of the (exact ; approximate) Shapley value of inconsistency measures. lhs chain no lhs chain, PTime c-repair other I d PTime FP -complete ; FPRAS I MI PTime I P PTime I R PTime ? ; FPRAS NP-hard [28] ; no FPRAS I MC PTime FP -complete [27] ; ? (i.e.,

A, B, C, ... ) to denote single attributes and letters from the end of the alphabet (i.e.,

X, Y, Z, ... ) to denote sets of attributes. We may omit stating the relational schema of adatabase D when it is clear from the context or irrelevant.A Functional Dependency (FD for short) over ( A , . . . , A n ) is an expression of the form X → Y , where X, Y ⊆ { A , . . . , A m } . We may also write the attribute sets X and Y byconcatenating the attributes (e.g., AB → C instead of { A, B } → { C } ). A database D satisﬁes X → Y if every two facts f, g ∈ D that agree on the values of the attributes of X also agree on the values of the attributes of Y (that is, if f [ X ] = g [ X ] then f [ Y ] = g [ Y ]).A database D satisﬁes a set ∆ of FDs, denoted by D | = ∆, if D satisﬁes every FD of ∆.Otherwise, D violates ∆ (denoted by D = ∆). Two FD sets over the same relational schemaare equivalent if every database that satisﬁes one of them also satisﬁes the other.Let ∆ be a set of FDs and D a database (which may violate ∆). A repair ( of D w.r.t. ∆)is a maximal consistent subset of D ; that is, E ⊆ D is a repair if E | = ∆ but E = ∆ forevery E (cid:40) E . A cardinality repair (or c-repair for short) is a repair of maximum cardinality;that is, it is a repair E such that | E | ≥ | E | for every repair E . (cid:73) Example 1.

Figure 1 depicts an inconsistent database over a relational schema that storesa train schedule. For example, the fact f states that train number 16 will depart from theNew York Penn Station at time 1030 and arrive to the Boston Back Bay Station after 315minutes. The FD set ∆ consists of the two FDs: ◦ train time → departs ◦ train time duration → arrivesThe ﬁrst FD states that the departure station is determined by the train number anddeparture time, and the second FD states that the arrival station is determined by the trainnumber, the departure time, and the duration of the ride.Observe that the database of Figure 1 violates the FDs as all the facts refer to thesame train number and departure time, but there is no agreement on the departure station.Moreover, the facts f and f also agree on the duration, but disagree on the arrival station.The database has ﬁve repairs: (a) { f , f } , (b) { f , f , f } , (c) { f , f } , (d) { f , f } , and (e) { f } ; only the second one is a cardinality repair. ♦ Shapley value. A cooperative game of a set A of players is a function v : P ( A ) → R , where P ( A ) is the power set of A , such that v ( ∅ ) = 0. The value v ( B ) should be thought of asthe joint wealth obtained by the players of B when they cooperate. The Shapley value of aplayer a ∈ A measures the contribution of a to the total wealth v ( A ) of the game [36], and isformally deﬁned by Shapley( A, v, a ) def = 1 | A | ! X σ ∈ Π A ( v ( σ a ∪ { a } ) − v ( σ a )) X:4 The Shapley Value of Inconsistency Measures for Functional Dependencies fact train departs arrives time duration f

16 NYP BBY 1030 315 f

16 NYP PVD 1030 250 f

16 PHL WIL 1030 20 f

16 PHL BAL 1030 70 f

16 PHL WAS 1030 120 f

16 BBY PHL 1030 260 f

16 BBY NYP 1030 260 f

16 BBY WAS 1030 420 f

16 WAS PVD 1030 390

Figure 1

The inconsistent database of our running example. where Π A is the set of all permutations over the players of A and σ a is the set of playersthat appear before a in the permutation σ . Intuitively, the Shapley value of a player a is theexpected contribution of a in a random permutation of the players, where the contributionof a is the change to the value of v caused by the addition of a . An alternative formula forthe Shapley value, that we will use in this paper, is the following.Shapley( A, v, a ) def = X B ⊆ A \{ a } | B | ! · ( | A | − | B | − | A | ! (cid:16) v ( B ∪ { a } ) − v ( B ) (cid:17) Observe that | B | ! · ( | A | − | B | − B appear ﬁrst, then a , and then the rest of the players. Approximation schemes.

We discuss both exact and approximate algorithms for computingShapley values. Recall that a

Fully-Polynomial Randomized Approximation Scheme (FPRAS,for short) for a function f is a randomized algorithm A ( x, (cid:15), δ ) that returns an (cid:15) -approximationof f ( x ) with probability at least 1 − δ , given an input x for f and (cid:15), δ ∈ (0 , x , 1 /(cid:15) , and log(1 /δ ). Formally, an FPRAS, satisﬁes:Pr [ f ( x ) / (1 + (cid:15) ) ≤ A ( x, (cid:15), δ ) ≤ (1 + (cid:15) ) f ( x )] ≥ − δ . Note that this notion of FPRAS refers to a multiplicative approximation, and we adopt thisnotion implicitly unless stated otherwise. We may also write “multiplicative” explicitly forstress. In cases where the function f has a bounded range, it also makes sense to discuss an additive FPRAS where Pr [ f ( x ) − (cid:15) ≤ A ( x, (cid:15), δ ) ≤ f ( x ) + (cid:15) ] ≥ − δ . We refer to an additiveFPRAS, and explicitly state so, in cases where the Shapley value is in the range [0 , In this paper, we study the Shapley value of facts with respect to measures of databaseinconsistency. More precisely, the cooperative game that we consider here is determined byan inconsistency measure I , and the facts of the database take the role of the players. Inturn, an inconsistency measure I is a function that maps pairs ( D, ∆) of a database D anda set ∆ of FDs to a number I ( D, ∆) ∈ [0 , ∞ ). Intuitively, the higher the value I ( D, ∆) is,the more inconsistent (or, the less consistent) the database D is w.r.t. ∆. The Shapley valueof a fact f of a database D is then deﬁned as follows.Shapley( D, ∆ , f, I ) def = X E ⊆ ( D \{ f } ) | E | ! · ( | D | − | E | − | D | ! (cid:16) I ( E ∪ { f } , ∆) − I ( E, ∆) (cid:17) (1) . Livshits and B. Kimelfeld XX:5 We note that the deﬁnition of the Shapley value requires the cooperative game to be zero onthe empty set [36] and this is indeed the case for all of the inconsistency measures I that weconsider in this work. Next, we introduce each of these measures. I d is the drastic measure that takes the value 1 if the database is inconsistent and thevalue 0 otherwise [38]. I MI counts the minimal inconsistent subsets [17, 18]; in the case of FDs, these subsets aresimply the pairs of tuples that jointly violate an FD. I P is the number of problematic facts , where a fact is problematic if it belongs to aminimal inconsistent subset [11]; in the case of FDs, a fact is problematic if and only if itparticipates in a pair of facts that jointly violate ∆. I R is the minimal number of facts that we need to delete from the database for ∆ tobe satisﬁed (similarly to the concept of a cardinality repair and proximity in PropertyTesting) [3, 8, 12]. I MC is the number of maximal consistent subsets (i.e., repairs) [11, 14].Table 1 summarizes the complexity results for the diﬀerent measures. The ﬁrst column(lhs chain) refers to FD sets that have a left-hand-side chain—a notion that was introducedby Livshits et al. [27], and we recall in the next section. The second column (no lhs chain,PTime c-repair) refers to FD sets that do not have a left-hand-side chain, but entail apolynomial-time cardinality repair computation according to the dichotomy of Livshits etal. [28] that we discuss in more details in Section 6. (cid:73) Example 2.

Consider again the database of our running example. Since the database isinconsistent w.r.t. the FD set deﬁned in Example 1, we have that I d ( D, ∆) = 1. As for themeasure I MI , the reader can easily verify that there are twenty eight pairs of tuples thatjointly violate the FDs; hence, we have that I MI ( D, ∆) = 28. Since each tuple participates inat least one violation of the FDs, it holds that I P = 9. Finally, as we have already seen inExample 1, the database has ﬁve repairs and a single cardinality repair obtained by deletingﬁve facts. Thus, I R ( D, ∆) = 5 and I MC ( D, ∆) = 5. In the next sections, we discuss thecomputation of the Shapley value for each one of these measures. ♦ Preliminary analysis.

We study the data complexity of computing Shapley( D, ∆ , f, I ) fordiﬀerent inconsistency measures I . To this end, we give here two important observationsthat we will use throughout the paper. The ﬁrst observation is that the computationof Shapley( D, ∆ , f, I ) can be easily reduced to the computation of the expected valueof the inconsistency measure over all subsets of the database of a given size. In thefollowing proposition, we denote by E D ∼ U m ( D \{ f } ) (cid:0) I ( D ∪ { f } , ∆) (cid:1) the the expected valueof I ( D ∪ { f } , ∆) over all subsets D of D \ { f } of a given size m , assuming a uniformdistribution. Similarly, E D ∼ U m ( D \{ f } ) (cid:0) I ( D , ∆) (cid:1) is the expected value of I ( D , ∆) over allsuch subsets D . (cid:73) Proposition 3.

Let I be an inconsistency measure. The following holds. Shapley( D, ∆ , f, I ) = 1 | D | | D |− X m =0 (cid:2) E D ∼ U m ( D \{ f } ) (cid:0) I ( D ∪ { f } , ∆) (cid:1) − E D ∼ U m ( D \{ f } ) (cid:0) I ( D , ∆) (cid:1)(cid:3) We prove the proposition in the Appendix. Hence, the computation of the Shapley valueis polynomial-time reducible to the computation of these expectations, and our algorithmswill, indeed, compute them instead of the Shapley value.

X:6 The Shapley Value of Inconsistency Measures for Functional Dependencies

The second observation is the following. One of the basic properties of the Shapley value is eﬃciency —the sum of the Shapley values over all the players equals the total wealth [36]. Thisproperty implies that P f ∈ D Shapley( D, ∆ , f, I ) = I ( D, ∆). Thus, whenever the measureitself is computationally hard, so is the Shapley value of facts. (cid:73) Fact 1.

Let I be an inconsistency measure. The computation of I is polynomial-timereducible to the computation of the Shapley value for I . This observation can be used for showing lower bounds on the complexity of the Shapleyvalue, as we will see in the next sections. I MI and I P We start by discussing two tractable measures. The ﬁrst measure is I MI , that counts the minimal inconsistent subsets (i.e., pairs of facts that jointly violate at least one FD). An easyobservation is that a fact f increases the value of the measure I MI by i in a permutation σ ifand only if σ f contains exactly i facts that are in conﬂict with f . Hence, assuming that D contains N f facts that conﬂict with f , the Shapley value for this measure can be computedin the following way:Shapley( D, ∆ , f, I MI ) = 1 | D | ! N f X i =1 | D |− X m = i (cid:18) N f i (cid:19)(cid:18) | D | − N f − m − i (cid:19) · m ! · ( | D | − m − · i Therefore, we immediately obtain the following result. (cid:73)

Theorem 4.

Let ∆ be a set of FDs. Computing Shapley( D, ∆ , f, I MI ) can be done inpolynomial time, given D and f . The second measure that we consider here is I P that counts the “problematic” facts; thatis, facts that participate in a violation of ∆. Here, a fact f increases the measure by i in apermutation σ if and only if σ f contains precisely i − f , butnot in conﬂict with any other fact of σ f (hence, all these facts and f itself are added to thegroup of problematic facts). We prove the following. (cid:73) Theorem 5.

Let ∆ be a set of FDs. Computing Shapley(

D, f, ∆ , I P ) can be done inpolynomial time, given D and f . Proof.

We now show how the expected values of Proposition 3 can be computed in polynomialtime. We start with E D ∼ U m ( D \{ f } ) (cid:0) I P ( D , ∆) (cid:1) . We denote by X m a random variable holdingthe number of problematic facts in a subset of size m of D \ { f } , and by X gm a randomvariable that holds 1 if the fact g is selected and participates in a violation of the FDs insuch a subset, and 0 otherwise. Due to the linearity of expectation, we have the following. E D ∼ U m ( D \{ f } ) (cid:0) I P ( D , ∆) (cid:1) = E ( X m ) = E ( X g ∈ D \{ f } X gm ) = X g ∈ D \{ f } E ( X gm )Hence, the computation of E D ∼ U m ( D \{ f } ) (cid:0) I P ( D , ∆) (cid:1) reduces to the computation of E ( X gm ),and this value can be computed as follows. E ( X gm ) = Pr [ g is selected in a subset of size m ] × Pr [a conﬂicting fact is selected in a subset of size m | g is selected in the subset]= (cid:0) | D |− m − (cid:1)(cid:0) | D |− m (cid:1) · P N g k =1 (cid:0) N g k (cid:1) · (cid:0) | D |− − N g m − k − (cid:1)(cid:0) | D |− m − (cid:1) = P N g k =1 (cid:0) N g k (cid:1) · (cid:0) | D |− − N g m − k − (cid:1)(cid:0) | D |− m (cid:1) . Livshits and B. Kimelfeld XX:7 where N g is the number of facts in D \ { f } that are in conﬂict with g .We can similarly show that E D ∼ U m ( D \{ f } ) (cid:0) I P ( D ∪{ f } , ∆) (cid:1) = P g ∈ D \{ f } E ( X gm,f ), where X gm,f is a random variable that holds 1 if g is selected in a subset E of size m of D \ { f } andparticipates in a violation of the FDs with the rest of the facts of E ∪ { f } , and 0 otherwise.For a fact g that is not in conﬂict with f we have that: E ( X gm,f ) = E ( X gm ), while for a fact g that is in conﬂict with f it holds that: E ( X gm,f ) = Pr [ g is selected in a subset of size m ] = (cid:18) | D | − m − (cid:19) / (cid:18) | D | − m (cid:19) and that concludes our proof. (cid:74) I d In this section, we consider the drastic measure I d . While the measure itself is extremelysimple and, in particular, computable in polynomial time (testing whether ∆ is satisﬁed),it might be intractable to compute the Shapley value of a fact. In particular, we prove adichotomy for this measure, classifying FD sets into ones where the Shapley value can becomputed in polynomial time and the rest where the problem is FP -complete. Beforegiving our dichotomy, we recall the deﬁnition of a left-hand-side chain (lhs chain, for short),introduced by Livshits et al. [27]. (cid:73)

Deﬁnition 6. [27]

An FD set ∆ has a left-hand-side chain if for every two FDs X → Y and X → Y in ∆ , either X ⊆ X or X ⊆ X . (cid:73) Example 7.

The FD set of our running example (Example 1) has an lhs chain. We couldalso deﬁne ∆ with redundancy by adding the following FD: train time arrives → departs.The resulting FD set does not have an lhs chain, but it is equivalent to an FD set with an lhschain. An example of an FD set that does not have an lhs chain, not even up to equivalence,is { train time → departs , train departs → time } . ♦ We prove the following. (cid:73)

Theorem 8.

Let ∆ be a set of FDs. If ∆ is equivalent to an FD set with an lhs chain,then Shapley(

D, f, ∆ , I d ) can be computed in polynomial time, given D and f . Otherwise,the problem is FP -complete. Interestingly, this is the exact same dichotomy that we obtained in prior work [27] forthe problem of counting subset repairs. We also showed that this tractability criterion isdecidable in polynomial time by computing a minimal cover: if ∆ is equivalent to an FD setwith an lhs chain, then every minimal cover of ∆ has an lhs chain.

Proof of Theorem 8.

The proof of the hardness side, which is given in the Appendix, has twosteps. We ﬁrst show hardness for the matching constraint { A → B, B → A } over the schema( A, B ), and this proof is similar to the proof of Livshits et al. [25] for the problem of computingthe Shapley contribution of facts to the result of the query q () :- R ( x ) , S ( x, y ) , T ( y ). Then,from this case to the remaining cases we apply the fact-wise reductions that we established inour prior work [27]. So, in the remainder of this section we will focus on the tractability side. Recall that FP is the class of polynomial-time functions with an oracle to a problem in

X:8 The Shapley Value of Inconsistency Measures for Functional Dependencies v [0 , , , , , , v v [0 , , , v [0 , , , , , , , , , , , , , , , , , , v [0 , , , , , ,

0] [0 , , , , , , = BBYarrives v = PVDarrives duration= 120= 315duration duration= 250 = 20duration = WASarrives v v = NYParrives v = PHLarrives v = WASarrives v = BALarrives v = WILarrives [0 , , , , , , duration= 420 v duration= 260 v [0 , , , , , , v [0 , , , , , ,

0] [0 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , [0 , , , ][0 , , , r train = 16[0 , , , , , , ]time = 1030 v departs = PHL[0 , , , , , , , , , , , , v v departs = NYP [0 , , , v departs = BBY duration= 70 Figure 2

The data structure T of our running example. As stated in Proposition 3, the computation of Shapley(

D, f, ∆ , I d ) reduces to thecomputation of the expected value of the measure over all subsets of the database of a givensize m . In this case, E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) and E D ∼ U m ( D \{ f } ) (cid:0) I d ( D , ∆) (cid:1) arethe probabilities that a uniformly chosen D ⊆ D \ { f } of size m is such that ( D ∪ { f } ) = ∆and D = ∆, respectively. Due to the structure of FD sets with an lhs chain, we can computethese probabilities eﬃciently, as we explain next.Our main observation is that for an FD X → Y , if we group the facts of D by X (i.e.,split D into maximal susbets of facts that agree on the values of all attributes in X ), thenthis FD and the FDs that appear later in the chain may be violated only among facts fromthe same group. Moreover, when we group by XY (i.e., further split each group of X intomaximal subsets of facts that agree on the values of all attributes in Y ), facts from diﬀerentgroups always violate this FD, and hence, violate ∆. We refer to the former groups as blocks and the latter groups as subblocks . This special structure allows us to split the problem intosmaller problems, solve each one of them separately, and then combine the solutions viadynamic programming.We deﬁne a data structure T where each vertex v is associated with a subset of D that wedenote by D [ v ]. The root r is associated with D itself, that is, D [ r ] = D . At the ﬁrst level,each child c of r is associated with a block of D [ r ] w.r.t. X → Y , and each child c of c isassociated with a subblock of D [ c ] w.r.t. X → Y . At the second level, each child c of c isassociated with a block of D [ c ] w.r.t. X → Y , and each child c of c is associated with asubblock of D [ c ] w.r.t. X → Y . This continues all the way to the n th FD, where at the i thlevel, each child u of an ( i − v is associated with a block of D [ v ]w.r.t. X i → Y i and each child u of u is associated with a subblock of D [ u ] w.r.t. X i → Y i .We assume that the data structure T is constructed in a preprocessing phase. Clearly,the number of vertices in T is polynomial in | D | and n (recall that n is the number of FDsin ∆) as the height of the tree is 2 n , and each level contains at most | D | vertices; hence, thispreprocessing phase requires polynomial time (even under combined complexity). Then, we . Livshits and B. Kimelfeld XX:9 Algorithm 1

DrasticShapley ( D, ∆ , m, T ) for all vertices v of T in a bottom-up order do UpdateProb ( v, m ) return r. val [ m ] Subroutine 1

UpdateProb ( v, m ) for all children c of v in T do for j ∈ { m, . . . , } do v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] + (1 − c. val [ j ]) · v. val [ j ] (cid:1) · ( | D [ c ] | j ) · ( | D [ prev ( c )] | j )( | D [ prev ( c )] | + | D [ c ] | j ) if v is a block node then v. val [ j ] += P j + j = j

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I d ( D , ∆) (cid:1) for ∆ with an lhs chain. compute both E D ∼ U m ( D \{ f } ) (cid:0) I d ( D , ∆) (cid:1) and E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) by going overthe vertices of T from bottom to top, as we will explain later. Note that for the computationof theses values, we construct T from the database D \ { f } . Figure 2 depicts the datastructure T used for the computation of Shapley( D, f , ∆ , I d ) for the database D anf fact f of our running example. Next, we explain the meaning of the values stored in each vertex.Each vertex v in T stores an array v. val (that is initialized with zeros) such that v. val [ j ] = E D ∼ U j ( D [ v ]) (cid:0) I d ( D , ∆) (cid:1) at the end of the execution. For this measure, we have that: v. val [ j ] def = Pr [a random subset of size j of D [ v ] violates ∆]Our ﬁnal goal is to compute r. val [ m ], where r is the root of T . For that purpose, in thealgorithm DrasticShapley , depicted in Figure 3, we go over the vertices of T in a bottom-uporder and compute the values of v. val for every vertex v in the UpdateProb subroutine.Observe that we only need one execution of

DrasticShapley with m = | D | − m ∈ { , . . . , | D | − } , as we calculate all these values in ourintermediate computations.To compute v. val for a subblock vertex v , we iterate over its children in T (which are the( i + 1)th level blocks) according to an arbitrary order deﬁned in the construction of T . For achild c of v , we denote by prev ( c ) the set of children of v that occur before c in that order,and by D [ prev ( c )] the database S c ∈ prev ( c ) D [ c ]. When considering c in the for loop of line 1,we compute the expected value of the measure on a subset of D [ prev ( c )] ∪ D [ c ]. Hence, whenwe consider the last child of v in the for loop of line 1, we compute the expected value of themeasure on a subset of the entire database D [ v ].For a child c of v , there are N = (cid:0) | D [ prev ( c )] | + | D [ c ] | j (cid:1) subsets of size j of all the children of v considered so far (including c itself). Each such subset consists of j facts of the current c (there are N = (cid:0) | D [ c ] | j (cid:1) possibilities) and j facts of the previously considered children (thereare N = (cid:0) | D [ prev ( c )] | j (cid:1) possibilities), for some j , j such that j + j = j , with probability N N /N . Moreover, such a subset violates ∆ if either the facts of the current c violate ∆(with probability c. val [ j ] that was computed in a previous iteration) or these facts satisfy ∆,but the facts of the previous children violate ∆ (with probability (1 − c. val [ j ]) · v. val [ j ]).Observe that since we go over the values j in reverse order in the for loop of line 2 (i.e., from X:10 The Shapley Value of Inconsistency Measures for Functional Dependencies m to 1), at each iteration of this loop, we have that v. val [ j ] (for all considered j ≤ j ) stillholds the expected value of I d over subsets of size j of the previous children of v , which isindeed the value that we need for our computation.This computation of v. val also applies to the block vertices. However, the addition ofline 5 only applies to blocks. Since the children of a block belong to diﬀerent subblocks,and two facts from the same i th level block but diﬀerent i th level subbblocks always jointlyviolate X i → Y i , a subset of size j of a block also violates the constraints if we select anon-empty subset of the current child c and a non-empty subset of the previous children,even if each of these subsets by itself is consistent w.r.t. ∆. Hence, we add this probability inline 5. Note that all the three cases that we consider are disjoint, so we sum the probabilities.Observe also that the leaves of T have no children and we do not update their probabilities,and, indeed the probability to select a subset from a leaf v that violates the constraints iszero, as all the facts of D [ v ] agree on the values of all the attributes that occur in ∆.To compute E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪{ f } , ∆) (cid:1) , we use the algorithm DrasticShapleyF , givenin the Appendix. There, we distinguish between several types of vertices w.r.t. f , and showhow this expectation can be computed for each one of these types. (cid:73) Example 9.

We now illustrate the computation of E D ∼ U m ( D \{ f } ) (cid:0) I d ( D , ∆) (cid:1) on thedatabase D and the fact f of our running example for m = 3. Inside each node of the datastructure T of Figure 2, we show the values [ v. val [0] , v. val [1] , v. val [2] , v. val [3]] used for thiscomputation. Below them, we present the corresponding values used in the computation of E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) . For the leaves v and each vertex v ∈ { v , . . . , v , v } , wehave that v. val [ j ] = 0 for every j ∈ { , , , } , as D [ v ] has a single fact. As for v , when weconsider its ﬁrst child v in the for loop of line 1 of UpdateProb , all the values in v . val remain zero (since v . val [ j ] = v . val [ j ] = 0 for any j , j , and | D [ prev ( c )] | = 0). However,when we consider its second child v , while the computation of line 3 again has no impacton v . val , after the computation of line 5 we have that v . val [2] = 1. And, indeed, there isa single subset of size two of { f , f } and it violates the FD train time duration → arrives.This also aﬀects the values of v . val . In particular, when we consider the ﬁrst child v of v , we have that v . val [ j ] = 1 for j = 2 and v . val [ j ] = 0 for any other j . Then, whenwe consider the second child v of v , it holds that v . val [2] = (as the only subset ofsize two of D [ v ] that violates the FDs is { f , f } , and there are three subsets in total)and v . val [3] = 1 (as every subset of size three contains both f and f ). Finally, we havethat E D ∼ U ( D \{ f } ) (cid:0) I d ( D , ∆) (cid:1) = and E D ∼ U ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) = 1; hence, weconclude that Shapley( D, f , ∆ , I d ) = , according to the formula of Proposition 3. ♦ Approximation.

We now consider an approximate computation of the Shapley value. Us-ing the Chernoﬀ-Hoeﬀding bound, we can easily obtain an additive FPRAS of the valueShapley(

D, f, ∆ , I d ), by sampling O (log(1 /δ ) /(cid:15) ) permutations and computing the averagecontribution of f in a permutation. As observed by Livshits et al. [25], an additive FPRASis not necessarily a multiplicative FPRAS, unless the “gap” property holds: nonzero Shapleyvalues are guaranteed to be large enough compared to the utility value (which is at most 1in the case of the drastic measure); in that case, the two approximation guarantees coincide.This is also the case here, as we now prove the following gap property of Shapley( D, f, ∆ , I d ). (cid:73) Proposition 10.

There is a polynomial p such that for all databases D and facts f of D the value Shapley(

D, f, ∆ , I d ) is either zero or at least / ( p ( | D | )) . Proof.

If no fact of D is in conﬂict with f , then Shapley( D, f, ∆ , I d ) = 0. Otherwise, let g be a fact that violates an FD of ∆ jointly with f . Clearly, it holds that { g } | = ∆, while . Livshits and B. Kimelfeld XX:11 { g, f } 6| = ∆. The probability to choose a permutation σ , such that σ f is exactly { g } is ( | D |− | D | ! = | D |· ( | D |− (recall that σ f is the set of facts that appear before f in σ ). Therefore,we have that Shapley( D, f, ∆ , I d ) ≥ | D |· ( | D |− , and that concludes our proof. (cid:74) We conclude that the following holds. (cid:73)

Corollary 11.

Shapley(

D, f, ∆ , I d ) has both an additive and a multiplicative FPRAS. I R In this section, we study the measure I R , based on the cost of a cardinality repair of D . Here,we refer to the number of facts that are removed from D to obtain a repair E as the cost of E . This is the ﬁrst measure that we consider that is not always computable in polynomial.Livshits et al. [28] presented a dichotomy for the problem of computing a cardinality repair,classifying FD sets into those for which the problem is solvable in polynomial time, and thosefor which it is NP-hard. They presented a polynomial-time algorithm, which we refer toas Simplify() , that, given an FD set ∆, attempts to obtain an empty FD set by simplifying∆ (i.e., repeatedly removing from the FDs of ∆ attributes that satisfy certain properties).For completeness, the

Simplify() algorithm is given in the Appendix. They showed that ifit is possible to empty the FD set by repeatedly applying

Simplify() , then the problem issolvable in polynomial time, and otherwise it is NP-hard. Fact 1 implies that computingShapley(

D, f, ∆ , I R ) is hard whenever computing I R ( D, ∆) is hard. Hence, we immediatelyobtain the following. (cid:73) Theorem 12.

Let ∆ be a set of FDs. If ∆ cannot be emptied by repeatedly applying Simplify() , then computing

Shapley(

D, f, ∆ , I R ) is NP-hard. In the remainder of this section, we focus on the tractable cases of the dichotomy ofLivshits et al. [28]. In particular, we start by proving that the Shapley value can again becomputed in polynomial time for an FD set that has an lhs chain. (cid:73)

Theorem 13.

Let ∆ be a set of FDs. If ∆ is equivalent to an FD set with an lhs chain,then computing Shapley(

D, f, ∆ , I R ) can be done in polynomial time, given D and f . Our polynomial-time algorithm

RShapley , depicted in Figure 4, is very similar in structureto

DrasticShapley . However, to compute the expected value of I R , we take the reduction ofProposition 3 a step further, and show, in the Appendix, that the problem of computingthe expected value of the measure over subsets of size m can be reduced to the problem ofcomputing the number of subsets of size m of D that have a cardinality repair of cost k ,given m and k . In the subroutine UpdateCount , we compute this number.For each vertex v in T , we deﬁne: v. val [ j, t ] def = number of subsets of size j of D [ v ] with a cardinality repair of cost t For the leaves v of T , we set v. val [ j,

0] = (cid:0) | D [ v ] | j (cid:1) for 0 ≤ j ≤ | D [ v ] | , as every subset of D [ v ] isconsistent, and the cost of a cardinality repair is zero. We also set v. val [0 ,

0] = 1 for each v in T for the same reason. Since the size of the cardinality repair is bounded by the sizeof the database, in UpdateCount ( v, m ), we compute the value v. val [ j, t ] for every 1 ≤ j ≤ m and 0 ≤ t ≤ j . To compute this number, we again go over the children of v , one by one.When we consider a child c in the for loop of line 1, the value v. val [ j, t ] is the number ofsubsets of size j of D [ prev ( c )] ∪ D [ c ] that have a cardinality repair of cost t . X:12 The Shapley Value of Inconsistency Measures for Functional Dependencies

Algorithm 2

RShapley ( D, ∆ , m, T ) for all vertices v of T in a bottom-up order do UpdateCount ( v, m ) return P mk =0 k ( | D |− m ) · r. val [ m, k ] Subroutine 2

UpdateCount ( v, m ) v. val [0 ,

0] = 1 if v is a leaf then v. val [ j,

0] = (cid:0) | D [ v ] | j (cid:1) for all j ∈ { , . . . , | D [ v ] |} for all children c of v in T do for j ∈ { m, . . . , } do for t ∈ { j, . . . , } do if v is a block vertex then v. val [ j, t ] = P j + j = j ≤ j ≤| D [ c ] | t − j ≤ j ≤ min { t, | D [ prev ( c )] |} P t − j ≤ w ≤ j (cid:0) c. val [ j , t − j ] · v. val [ j , w ] (cid:1) v. val [ j, t ] += P j + j = jt − j ≤ j ≤ min { t, | D [ c ] |} ≤ j ≤| D [ prev ( c )] | P t − j

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I R ( D , ∆) (cid:1) for ∆ with an lhs chain. The children of a block v are subblocks that jointly violate an FD of ∆; hence, whenwe consider a child c of v , a cardinality repair of a subset E of D [ prev ( c )] ∪ D [ c ] is either acardinality repair of E ∩ D [ c ] (in which case we remove every fact of E ∩ D [ prev ( c )]) or acardinality repair of E ∩ D [ prev ( c )] (in which case we remove every fact of E ∩ D [ c ]). Thedecision regarding which of these cases holds is based on the following four parameters: (1) the number j of facts in E ∩ D [ c ], (2) the number j of facts in E ∩ D [ prev ( c )], (3) thecost w of a cardinality repair of E ∩ D [ c ], and (4) the cost w of a cardinality repair of E ∩ D [ prev ( c )]. In particular:If w + j ≤ w + j , then a cardinality repair of E ∩ D [ c ] is preferred over a cardinalityrepair of E ∩ D [ prev ( c )], as it requires removing less facts from the database.If w + j > w + j , then a cardinality repair of E ∩ D [ prev ( c )] is preferred over acardinality repair of E ∩ D [ c ].In fact, since we ﬁx t in the computation of v. val [ j, t ], we do not need to go over all w and w . In the ﬁrst case, we have that w = t − j (hence, the total number of removed facts is t − j + j = t ), and in the second case we have that w = t − j for the same reason. Hence,in line 7 we consider the ﬁrst case where t ≤ w + j , and in line 8 we consider the secondcase where w + j > t . To avoid negative costs, we add a lower bound of t − j on j and w in line 7, and, similarly, a lower bound of t − j on j and w in line 8.For a subblock vertex v , a cardinality repair of D [ v ] is the union of cardinality repairs ofthe children of v , as facts corresponding to diﬀerent children of v do not jointly violate anyFD. Therefore, for such vertices, in line 10, we compute v. val by going over all j , j suchthat j + j = j and all t , t such that t + t = t and multiply the number of subsets of size j of the current child for which the cost of a cardinality repair is t by the number of subsetsof size j of the previously considered children for which the cost of a cardinality repair is t . . Livshits and B. Kimelfeld XX:13 In the Appendix, we present a polynomial-time algorithm for computing the expectation E D ∼ U m ( D \{ f } ) (cid:0) I R ( D ∪ { f } , ∆) (cid:1) . Approximation.

In cases where a cardinality repair can be computed in polynomial time,we can obtain an additive FPRAS in the same way as the drastic measure. (Note that thisShapley value is also in [0 , (cid:73) Proposition 14.

There is a polynomial p such that for all databases D and facts f of D the value Shapley(

D, f, ∆ , I R ) is either zero or at least / ( p ( | D | )) . Livshits et al. [28] showed that the hard cases of their dichotomy for the problem ofcomputing a cardinality repair are, in fact, APX-complete; hence, there is a polynomial-timeconstant-ratio approximation, but for some (cid:15) > (cid:15) -approximationor else P = NP (NP ⊆ BPP). Since the Shapley value of every fact w.r.t. I R is positive,the existence of a multiplicative FPRAS for Shapley( D, f, ∆ , I R ) would imply the existenceof a multiplicative FPRAS for I R ( D, ∆) (due to Fact 1), which is a contradiction to theAPX-hardness. We conclude the following. (cid:73) Proposition 15.

Let ∆ be a set of FDs. If ∆ can be emptied by repeatedly applying Simplify() ,then

Shapley(

D, f, ∆ , I R ) has both an additive and a multiplicative FPRAS. Otherwise, ithas neither multiplicative nor additive FPRAS, unless NP ⊆ BPP.

Unsolved cases for I R . Unlike the other inconsistency measures considered in this paper,we do not have a full dichotomy for the measure I R . In particular, a basic open problemis the computation of Shapley( D, f, ∆ , I R ) for ∆ = { A → B, B → A } . On the one hand,Proposition 15 shows that this case belongs to the tractable side if an approximation isallowed. On the other hand, our algorithm for exact Shapley( D, f, ∆ , I R ) is via counting thesubsets of size m that have a cardinality repair of cost k . This approach will not work here: (cid:73) Proposition 16.

Let ∆ = { A → B, B → A } be an FD set over ( A, B ) . Counting thesubsets of size m of a given database that have a cardinality repair of cost k is NP-hard. The proof, given in the Appendix, is by a reduction from the problem of counting the perfectmatchings in a bipartite graph, which is known to be I MC The ﬁnal measure that we consider counts the repairs of the database. A dichotomy result fromour previous work [27] states that the problem of counting repairs can be solved in polynomialtime for FD sets with an lhs chain (up to equivalence), and is

D, f, ∆ , I MC ) isFP -hard whenever the FD set is not equivalent to an FD set with an lhs chain. Hence, anlhs chain is a necessary condition for tractability. We show here that it is also suﬃcient: if theFD set has an lhs chain, then the problem can be solved in polynomial time. Consequently,we obtain the following dichotomy. (cid:73) Theorem 17.

Let ∆ be a set of FDs. If ∆ is equivalent to an FD set with an lhs chain, thencomputing Shapley(

D, f, ∆ , I MC ) can be done in polynomial time, given D and f . Otherwise,the problem is FP -complete. X:14 The Shapley Value of Inconsistency Measures for Functional Dependencies

Algorithm 3

MCShapley ( D, ∆ , m, T ) for all vertices v of T in a bottom-up order do UpdateExpected ( v, m ) return r. val [ m ] Subroutine 3

UpdateExpected ( v, m ) v. val [0] = 1 if v is a leaf then v. val [ j ] = 1 for all j ∈ { , . . . , | D [ v ] |} for all children c of v in T do for j ∈ { m, . . . , } do if v is a block vertex then v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] + v. val [ j ] (cid:1) else v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] · v. val [ j ] (cid:1) Figure 5

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I MC ( D , ∆) (cid:1) for ∆ with an lhs chain. The algorithm

MCShapley , depicted in Figure 5, for computing Shapley(

D, f, ∆ , I MC ),has the same structure as DrasticShapley , with the only diﬀerence being the computations inthe subroutine

UpdateExpected (that replaces

UpdateProb ).For a vertex v in T we deﬁne: v. val [ j ] = E [number of repairs of a random subset of size j of D [ v ]]As the number of repairs of a consistent database D is one ( D itself is a repair), we set v. val [0] = 1 for every vertex v and v. val [ j ] = 1 for 0 ≤ j ≤ | D [ v ] | for every leaf v . Now,consider a block vertex v and a child c of v . Since the children of v are subblocks, each repairconsists of facts of a single child. Hence, the total number of repairs is the sum of repairs ofthe children of v . Moreover, since our choice of facts from diﬀerent subblocks is independent,we have the following (where MC ( D, ∆) is the set of repairs of D w.r.t. ∆). E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X D ⊆ D [ prev ( c )] ∪ D [ c ] | D | = j Pr [ D ] · | M C ( D , ∆) | = X j + j = j ≤ j ≤| D [ c ] | ≤ j ≤ D [ prev ( c )] X E ⊆ D [ c ] | E | = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]Pr [ E ] (cid:0) | M C ( E , ∆) | + | M C ( E , ∆) | (cid:1) Using standard mathematical manipulations (provided in the Appendix), we obtain thefollowing result: E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X j + j = j ≤ j ≤| D [ c ] | ≤ j ≤ D [ prev ( c )] (cid:2) E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) + E D ∼ U j ( D [ prev ( c )] (cid:0) I MC ( D , ∆) (cid:1)(cid:3) This calculation is reﬂected in line 6 of the

UpdateExpected subroutine. . Livshits and B. Kimelfeld XX:15

We can similarly show that for a subblock vertex v , it holds that: E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X j + j = j ≤ j ≤| D [ c ] | ≤ j ≤ D [ prev ( c )] (cid:2) E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) × E D ∼ U j ( D [ prev ( c )] (cid:0) I MC ( D , ∆) (cid:1)(cid:3) We use this result for the calculation of line 8. The diﬀerence between the calculations lies inthe observation that the children of a subblock are blocks that never jointly violate ∆; hence,the number of repairs is obtained by multiplying the number of repairs of each child of v .An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I MC ( D ∪ { f } , ∆) (cid:1) is given in the Appendix. Approximation.

Repair counting for ∆ = { A → B, B → A } is the problem of counting themaximal matchings of a bipartite graph. As the Shapley( D, f, ∆ , I MC ) are nonnegative andsum up to the number of repairs, we conclude that an FPRAS for Shapley implies an FPRASfor the number of maximal matchings. To the best of our knowledge, existence of the latteris a long-standing open problem [19]. This is also the case for any ∆ that is not equivalentto an FD set with an lhs chain, since there is a fact-wise reduction from ∆ to such ∆ [27].(More details in the Appendix.) We studied the complexity of calculating the Shapley value of database facts for basicinconsistency measures, focusing on FD constraints. We showed that two of them arecomputable in polynomial time: the number of violations ( I MI ) and the number of problematicfacts ( I P ). In contrast, each of the drastic measure ( I d ) and the number of repairs ( I MC )features a dichotomy in complexity, where the tractability condition is the possession of anlhs chain (up to equivalence). For the cost of a cardinality repair ( I R ) we showed a tractablefragment and an intractable fragment, but a gap remains on certain FD sets—the ones thatdo not have an lhs chain, and yet, a cardinality repair can be computed in polynomial time.We also studied the approximability of the Shapley value and showed, among other things,an FPRAS for I d and a dichotomy in the existence of an FPRAS for I R .Many directions are left open for future research. First, there is the challenge of completingthe picture of I R towards a full dichotomy. In particular, the problem is open for the bipartitematching constraint { A → B, B → A } that, unlike the known FD sets in the intractablefragment, has an FPRAS. Moreover, our results neither imply nor refute the existence of aconstant-ratio approximation (for some constant) for the Shapley value w.r.t. I R . Second,the problems are immediately extendible to any type of constraints other than functionaldependencies, such as denial constraints, tuple generating dependencies, and so on. Third,it would be interesting to see how the results extend to wealth distribution functions otherthan Shapley, e.g., the Banzhaﬀ Power Index [7]. The tractable cases remain tractable forthe Banzhaﬀ Power Index, but it is not clear how (and whether) our proofs for the lowerbounds generalize to this function. Finally, there is the practical question of implementation:while our algorithms terminate in polynomial time, we believe that they are hardly scalablewithout further optimization and heuristics ad-hoc to the use case; developing those is animportant challenge for future research. X:16 The Shapley Value of Inconsistency Measures for Functional Dependencies

References Roland Bacher. Determinants of matrices related to the pascal triangle.

Journal de Théoriedes Nombres de Bordeaux , 14, 01 2002. Leopoldo E. Bertossi. Measuring and computing database inconsistency via repairs. In

SUM ,volume 11142 of

Lecture Notes in Computer Science , pages 368–372. Springer, 2018. Leopoldo E. Bertossi. Repair-based degrees of database inconsistency. In

LPNMR , volume11481 of

Lecture Notes in Computer Science , pages 195–209. Springer, 2019. Leopoldo E. Bertossi and Floris Geerts. Data quality and explainable AI.

J. Data andInformation Quality , 12(2):11:1–11:9, 2020. Omar Besbes, Antoine Désir, Vineet Goyal, Garud Iyengar, and Raghav Singal. Shapley meetsuniform: An axiomatic framework for attribution in online advertising. In

WWW , pages1713–1723. ACM, 2019. Laurence Cholvy, Laurent Perrussel, William Raynaut, and Jean - Marc Th ’e venin. Towardsconsistency-based reliability assessment. In

AAMAS , pages 1643–1644. ACM, 2015. Pradeep Dubey and Lloyd S. Shapley. Mathematical properties of the banzhaf power index.

Mathematics of Operations Research , 4(2):99–131, 1979. Oded Goldreich, Shaﬁ Goldwasser, and Dana Ron. Property testing and its connection tolearning and approximation.

J. ACM , 45(4):653–750, 1998. URL: http://doi.acm.org/10.1145/285055.285060 , doi:10.1145/285055.285060 . Teoﬁlo F. Gonzalez, editor.

Handbook of Approximation Algorithms and Metaheuristics .Chapman and Hall/CRC, 2007. URL: https://doi.org/10.1201/9781420010749 , doi:10.1201/9781420010749 . John Grant and Anthony Hunter. Measuring inconsistency in knowledgebases.

J. Intell. Inf.Syst. , 27(2):159–184, 2006. John Grant and Anthony Hunter. Measuring consistency gain and information loss in stepwiseinconsistency resolution. In

ECSQARU , volume 6717, pages 362–373. Springer, 2011. John Grant and Anthony Hunter. Distance-based measures of inconsistency. In

ECSQARU ,volume 7958 of

Lecture Notes in Computer Science , pages 230–241. Springer, 2013. John Grant and Anthony Hunter. Using shapley inconsistency values for distributed informationsystems with uncertainty. In

ECSQARU , volume 9161 of

Lecture Notes in Computer Science ,pages 235–245. Springer, 2015. John Grant and Anthony Hunter. Analysing inconsistent information using distance-basedmeasures.

Int. J. Approx. Reasoning , 89:3–26, 2017. URL: https://doi.org/10.1016/j.ijar.2016.04.004 , doi:10.1016/j.ijar.2016.04.004 . Faruk Gul. Bargaining foundations of shapley value.

Econometrica: Journal of the EconometricSociety , pages 81–95, 1989. Anthony Hunter and Sébastien Konieczny. Shapley inconsistency values. In KR , pages 249–259.AAAI Press, 2006. Anthony Hunter and Sébastien Konieczny. Measuring inconsistency through minimal incon-sistent sets. In KR , pages 358–366. AAAI Press, 2008. Anthony Hunter and Sébastien Konieczny. On the measure of conﬂicts: Shapley inconsistencyvalues.

Artif. Intell. , 174(14):1007–1026, 2010. Yifan Jing and Akbar Raﬁey. Counting maximal near perfect matchings in quasirandom anddense graphs.

CoRR , abs/1807.04803, 2018. Benny Kimelfeld. A dichotomy in the complexity of deletion propagation with functionaldependencies. In

PODS , pages 191–202, 2012. Kevin M. Knight. Two information measures for inconsistent sets.

Journal of Logic, Languageand Information , 12(2):227–248, 2003. Sébastien Konieczny, Jérôme Lang, and Pierre Marquis. Quantifying information and con-tradiction in propositional logic through test actions. In

IJCAI , pages 106–111. MorganKaufmann, 2003. . Livshits and B. Kimelfeld XX:17 Christophe Labreuche and Simon Fossier. Explaining multi-criteria decision aiding modelswith an extended shapley value. In

IJCAI , pages 331–339. ijcai.org, 2018. Zhenliang Liao, Xiaolong Zhu, and Jiaorong Shi. Case study on initial allocation of shanghaicarbon emission trading based on shapley value.

Journal of Cleaner Production , 103:338–344,2015. Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. The shapley valueof tuples in query answering. In

ICDT , volume 155 of

LIPIcs , pages 20: 1–20: 19. SchlossDagstuhl - Leibniz Center for "u r Computer Science, 2020. Ester Livshits, Ihab F. Ilyas, Benny Kimelfeld, and Sudeepa Roy. Principles of progressindicators for database repairing.

CoRR , abs/1904.06492, 2019. Ester Livshits and Benny Kimelfeld. Counting and enumerating (preferred) database repairs.In

PODS , pages 289–301. ACM, 2017. Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functionaldependencies.

ACM Trans. Database Syst. , 45(1):4: 1–4: 46, 2020. Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

NIPS , pages 4765–4774, 2017. Richard TB Ma, Dah Ming Chiu, John Lui, Vishal Misra, and Dan Rubenstein. Interneteconomics: The use of shapley value for isp settlement.

IEEE/ACM Transactions on Networking(TON) , 18(3):775–787, 2010. Kedian Mu, Weiru Liu, and Zhi Jin. Measuring the blame of each formula for inconsist-ent prioritized knowledge bases.

Journal of Logic and Computation , 22(3):481–516, 022011. URL: https://doi.org/10.1093/jigpal/exr002 , arXiv:https://academic.oup.com/logcom/article-pdf/22/3/481/3177718/exr002.pdf , doi:10.1093/jigpal/exr002 . Ramasuri Narayanam and Yadati Narahari. A shapley value-based approach to discoverinﬂuential nodes in social networks.

IEEE Transactions on Automation Science and Engineering ,8(1):130–147, 2011. Tatiana Nenova. The value of corporate voting rights and control: A cross-country analysis.

Journal of ﬁnancial economics , 68(3):325–351, 2003. Leon Petrosjan and Georges Zaccour. Time-consistent shapley value allocation of pollutioncost reduction.

Journal of economic dynamics and control , 27(3):381–398, 2003. Alon Reshef, Benny Kimelfeld, and Ester Livshits. The impact of negation on the complexityof the shapley value in conjunctive queries. In

PODS , pages 285–297. ACM, 2020. Lloyd S Shapley. A value for n-person games. In Harold W. Kuhn and Albert W. Tucker,editors,

Contributions to the Theory of Games II , pages 307–317. Princeton University Press,Princeton, 1953. Matthias Thimm. Measuring inconsistency in probabilistic knowledge bases. In

UAI , pages530–537. AUAI Press, 2009. Matthias Thimm. On the compliance of rationality postulates for inconsistency measures: Amore or less complete picture. KI , 31(1):31–39, 2017. L.G. Valiant. The complexity of computing the permanent.

Theoretical Computer Science ,8(2):189 – 201, 1979. doi:https://doi.org/10.1016/0304-3975(79)90044-6 . Bruno Yun, Srdjan Vesic, Madalina Croitoru, and Pierre Bisquert. Inconsistency measures forrepair semantics in OBDA. In

IJCAI , pages 1977–1983. ijcai.org, 2018.

X:18 The Shapley Value of Inconsistency Measures for Functional Dependencies

A Details for Section 3

We now give the missing proof of Proposition 3. For convenience, we give the propositionagain here.

Proposition 3.

The reduction is as follows.Shapley( D, ∆ , f, I ) = X D ⊆ ( D \{ f } ) | D | !( | D | − | D | − | D | ! (cid:16) I ( D ∪ { f } , ∆) (cid:17) − X D ⊆ ( D \{ f } ) | D | !( | D | − | D | − | D | ! (cid:16) I ( D , ∆) (cid:17) = | D |− X m =0 X D ⊆ ( D \{ f } ) | D | = m m !( | D | − m − | D | ! (cid:16) I ( D ∪ { f } , ∆) (cid:17) − | D |− X m =0 X D ⊆ ( D \{ f } ) | D | = m m !( | D | − m − | D | ! (cid:16) I ( D , ∆) (cid:17) = | D |− X m =0 m !( | D | − m − | D | ! (cid:18) | D | − m (cid:19) X D ⊆ ( D \{ f } ) | D | = m (cid:0) | D |− m (cid:1) (cid:16) I ( D ∪ { f } , ∆) (cid:17) (2) − | D |− X m =0 m !( | D | − m − | D | ! (cid:18) | D | − m (cid:19) X D ⊆ ( D \{ f } ) | D | = m (cid:0) | D |− m (cid:1) (cid:16) I ( D , ∆) (cid:17) = | D |− X m =0 m !( | D | − m − | D | ! (cid:18) | D | − m (cid:19) E D ∼ U m ( D \{ f } ) (cid:0) I ( D ∪ { f } , ∆) (cid:1) (3) − | D |− X m =0 m !( | D | − m − | D | ! (cid:18) | D | − m (cid:19) E D ∼ U m ( D \{ f } ) (cid:0) I ( D , ∆) (cid:1) = 1 | D | | D |− X m =0 E D ∼ U m ( D \{ f } ) (cid:0) I ( D ∪ { f } , ∆) (cid:1) − | D | | D |− X m =0 E D ∼ U m ( D \{ f } ) (cid:0) I ( D , ∆) (cid:1) = 1 | D | | D |− X m =0 (cid:2) E D ∼ U m ( D \{ f } ) (cid:0) I ( D ∪ { f } , ∆) (cid:1) − E D ∼ U m ( D \{ f } ) (cid:0) I ( D , ∆) (cid:1)(cid:3) Note that in (2) we multiply and divide by the value (cid:0) | D |− m (cid:1) . Since ( | D |− m ) is the probabilityto select a random subset of size m of D \ { f } , we obtain the expected value of the measurein (3), under the uniform distribution. (cid:74) . Livshits and B. Kimelfeld XX:19 Algorithm 4

DrasticShapleyF ( D, ∆ , m, T, f ) DrasticShapley ( D, ∆ , m, T ) for all vertices v of T in a bottom-up order do UpdateProbF ( v, m, f ) return r. val [ m ] Subroutine 4

UpdateProbF ( v, m, f ) if f conﬂict with v then v. val [ j ] = 1 for all 1 ≤ j ≤ | D [ v ] | return if f does not match v then v. val [ j ] = v. val [ j ] for all 1 ≤ j ≤ m return for all children c of v in T do for j ∈ { m, . . . , } do v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] + (1 − c. val [ j ]) · v. val [ j ] (cid:1) · ( | D [ c ] | j ) · ( | D [ prev ( c )] | j )( | D [ prev ( c )] | + | D [ c ] | j ) if v is a block node then v. val [ j ] += P j + j = j

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) for ∆ with an lhs chain. B Details for Section 5B.1 The Algorithm

DrasticShapleyF

The algorithm

DrasticShapleyF , used for the computation of E D ∼ U m ( D \{ f } ) (cid:0) I d ( D ∪ { f } , ∆) (cid:1) is given in Figure 6. Before elaborating on the algorithm, we give some non-standarddeﬁnitions. Recall that all the facts in D [ v ], for an i th level block vertex v , agree on thevalues of all the attributes in X Y . . . X i . Moreover, all the facts in D [ u ], for an i th levelsubblock vertex u , agree on the values of all the attributes in X Y . . . X i Y i . We say that f conﬂicts with an i th level block vertex v if for some X j → Y j such that j ∈ { , . . . , i − } it holds that f agrees with the facts of D [ v ] on all the values of the attributes in X j butdisagrees with them on the attributes of Y j . Note that in this case, every fact of D [ v ] conﬂictswith f . Similarly, we say that f conﬂicts with an i th level subblock vertex u if it violates anFD X j → Y j for some j ∈ { , . . . , i } with the facts of D [ u ]. We also say that f matches an i th level block or subblock vertex v if it agrees with the facts of D [ v ] on the values of all theattributes in X Y . . . X i .In DrasticShapleyF , we deﬁne: v. val [ j ] def = Pr [a random subset of size j of D [ v ] violates ∆ jointly with f ]We ﬁrst compute v. val for all vertices v of T using DrasticShapley , and then we use thesevalues to compute v. val for some vertices v . First, we observe that for vertices v that conﬂictwith f we have that v. val [ j ] = 1 for every 1 ≤ j ≤ D [ v ], as every non-empty subset of D [ v ]violates the FDs with f . Note that this computation also applies to the leaves of T that are X:20 The Shapley Value of Inconsistency Measures for Functional Dependencies D m +1 . . . ... f f fg D D . . . Figure 7

The databases constructed in the reduction of the proof of Theorem 18. in conﬂict with f . For vertices v that do not conﬂict with f but also do not match with f ,we have that v. val [ j ] = v. val [ j ] for every 1 ≤ j ≤ m , as no fact of D [ v ] agrees with f on theleft-hand side of an FD in ∆ (for j = 0 we clearly have that v. val [0] = v. val [0] = 0).For the rest of the vertices, the arguments given in Section 5 for the computation of v. val still hold in this case; hence, the computation of v. val is similar. In particular, fora child c of v , a subset E of size j of D [ prev ( c )] ∪ D [ c ] is such that E ∪ { f } violates ∆ ifeither ( E ∩ D [ c ]) ∪ { f } violates ∆ or ( E ∩ D [ c ]) ∪ { f } satisﬁes ∆ but ( E ∩ D [ prev ( c )]) ∪ { f } violates ∆. If v is a block vertex, then E ∪ { f } also violates ∆ if we choose a non-emptysubset from both ( E ∩ D [ c ]) and ( E ∩ D [ prev ( c )]). Therefore, the main diﬀerence betweenthe computation of v. val in DrasticShapley and the computation of v. val in DrasticShapleyF is the use of the value c. val instead of the value c. val in line 9 and 11. B.2 Proof of Hardness

Next, we prove the hardness side of Theorem 8. We start by proving that the problem ishard for the FD set { A → B, B → A } . As aforementioned, the proof is similar to the proofof Livshits et al. [25] for the problem of computing the Shapley contribution of facts to theresult of the query q () :- R ( x ) , S ( x, y ) , T ( y ). (cid:73) Theorem 18.

Computing

Shapley(

D, f, ∆ , I d ) is FP -complete for the FD set ∆ = { A → B, B → A } over the relational schema ( A, B ) . Proof.

We construct a reduction from the problem of computing the number | M ( g ) | ofmatchings in a bipartite graph g . Note that we consider partial matchings; that is, subsetsof edges that consist of mutually-exclusive edges. Given an input graph g we construct m + 1 input instances ( D , f ) , . . . , ( D m +1 , f m +1 ) to our problem, where m is the numberof edges in g , in the following way. For every r ∈ { , . . . , m + 1 } , we add one vertex v tothe left-hand side of g and r + 1 vertices u , . . . , u r , v to the right-hand side of g . Then,we connect the vertex v to every new vertex on the right-hand side of g . We constructthe instance D r from the resulting graph by adding a fact ( u, v ) for every edge ( u, v ) in thegraph. We will compute the Shapley value of the fact f corresponding to the edge ( v , v ).The reduction is illustrated in Figure 7.In every instance D r , the fact f will increase the measure by one in a permutation σ ifand only if σ f satisﬁes two properties: (1) the facts of σ f jointly satisfy the FDs in ∆ ,and (2) σ f contains at least one fact that is in conﬂict with f . Hence, for f to aﬀect themeasure in a permutation, we have to select a set of facts corresponding to a matching fromthe original graph g , as well as exactly one of the facts corresponding to an edge ( v , u i )(since the facts ( v , u i ) and ( v , u j ) for i = j jointly violate the FD A → B ). We have thefollowing. . Livshits and B. Kimelfeld XX:21 Shapley( D r , f, I d ) = m X k =0 | M ( g, k ) | · r · ( k + 1)! · ( m − k + r − M ( g, k ) is the set of matchings of g containing precisely k edges.Hence, we obtain m + 1 equations from the m + 1 constructed instances, and get thefollowing system of equations.  · m ! 1 · m − . . . · ( m + 1)!0!2 · m + 1)! 2 · m ! . . . · ( m + 1)!1!... ... ... ...( m + 1) · m ! ( m + 1) · m − . . . ( m + 1) · ( m + 1)! m !   | M ( g, || M ( g, | ... | M ( g, m ) |  =  Shapley( D , f, I d )Shapley( D , f, I d )...Shapley( D m +1 , f, I d )  Let us divide each column in the above matrix by the constant ( j + 1)! (where j is thecolumn number) and each row by i + 1 (where i is the row number), and reverse the order ofthe columns. We then get the following matrix. A = 

0! 1! . . . m !1! 2! . . . ( m + 1)!... ... ... ... m ! ( m + 1)! . . . m !  This matrix has coeﬃcients a i,j = ( i + j )!, and the determinant of A is det ( A ) = Q mi =0 i ! i ! = 0; hence, the matrix is non-singular [1]. Since dividing a column by a constantdivides the determinant by a constant, and reversing the order of the columns can onlychange the sign of the determinant, the determinant of the original matrix is not zero aswell, and the matrix is non-singular. Therefore, we can solve the system of equations, andcompute the value P mk =0 M ( g, k ), which is precisely the number of matchings in g . (cid:74) Now, using the concept of a fact-wise reduction [20], we can prove that the problem ishard for any FD set that is not equivalent to an FD set with an lhs chain. We ﬁrst give theformal deﬁnition of a fact-wise reduction. Let ( R, ∆) and ( R , ∆ ) be two pairs of a relationalschema and an FD set. A mapping from R to R is a function µ that maps facts over R tofacts over R . (We say that f is a fact over R if f is a fact of some database D over R .) Weextend a mapping µ to map databases D over R to databases over R by deﬁning µ ( D ) tobe { µ ( f ) | f ∈ D } . A fact-wise reduction from ( R, ∆) to ( R , ∆ ) is a mapping Π from R to R with the following properties. Π is injective; that is, for all facts f and g over R , if Π( f ) = Π( g ) then f = g . Π preserves consistency and inconsistency; that is, for all facts f and g over R , { f, g } satisﬁes ∆ if and only if { Π( f ) , Π( g ) } satisﬁes ∆ . Π is computable in polynomial time.We have previously shown a fact-wise reduction from ((

A, B ) , { A → B, B → A } ) toany ( R, ∆), where ∆ is not equivalent to an FD set with an lhs chain [27]. Clearly,fact-wise reductions preserve the Shapley value of facts (that is, Shapley( D, f, I , ∆) =Shapley(Π( D ) , Π( f ) , I , ∆ )); hence, there is a reduction from the problem of computing the X:22 The Shapley Value of Inconsistency Measures for Functional Dependencies

Algorithm 5

Simplify Remove trivial FDs from ∆ if ∆ is not empty then ﬁnd a removable pair ( X, Y ) of attribute sequences: ∆ := ∆ − XY Figure 8

An algorithm for deciding whether a cardinality repair w.r.t. ∆ can be computed inpolynomial time [28].

Shapley value over { A → B, B → A } to the problem of computing the Shapley value overany ∆ that has no lhs chain (even up to equivalence), and that concludes our proof. C Details for Section 6C.1 The

Simplify()

Algorithm

First, we give the

Simplify() algorithm presented by Livshits et al. [28]. The algorithm isdepicted in Figure 8. They studied the problem of ﬁnding a cardinality repair, and establisheda dichotomy over the space of all the sets of FDs. In particular, they introduced an algorithmthat, given an FD set ∆, decides whether: A cardinality repair can be computed in polynomial time; or Finding a cardinality repair is APX-complete. These are the only two possibilities. The algorithm is a recursive procedure that attemptsto simplify ∆ at each iteration by ﬁnding a removable pair (

X, Y ) of attribute sets, andremoving every attribute of X and Y from every FD in ∆ (which we denote by ∆ − XY ).We say that a pair ( X, Y ) is removable if the following hold:Closure ∆ ( X ) = Closure ∆ ( Y ), XY is nonempty,Every FD in ∆ contains either X or Y on the left-hand side.Note that X and Y may be the same, and then the condition states that every FD contains X on the left-hand side. If we are able to transform ∆ to an empty set of FDs by repeatedlyapplying Simplify() , then the algorithm returns true and a cardinality repair can be computedin polynomial time. Otherwise, the algorithm returns false and the problem is APX-complete.

C.2 Reduction

We now show the reduction from the problem of computing the Shapley value for I R tothat of counting the subsets of size m of D that have a cardinality repair of cost k . In whatfollows, we denote by MR ( D, ∆) the cost of a cardinality repair of D w.r.t. ∆. Recall that APX is the class of NP optimization problems that admit constant-ratio approximationsin polynomial time. Hardness in APX is via the so called “PTAS” reductions (cf. textbooks onapproximation complexity, e.g., [9]). . Livshits and B. Kimelfeld XX:23

Shapley( D, ∆ , f, I R ) = 1 | D | | D |− X m =0 X D ⊆ ( D \{ f } ) | D | = m (cid:0) | D |− m (cid:1) (cid:16) I ( D ∪ { f } , ∆) (cid:17) − | D | | D |− X m =0 X D ⊆ ( D \{ f } ) | D | = m (cid:0) | D |− m (cid:1) (cid:16) I ( D , ∆) (cid:17) = 1 | D | | D |− X m =0 m X k =0 X D ⊆ ( D \{ f } ) | D | = m MR ( D ∪{ f } , ∆)= k (cid:0) | D |− m (cid:1) (cid:16) I ( D ∪ { f } , ∆) (cid:17) − | D | | D |− X m =0 m X k =0 X D ⊆ ( D \{ f } ) | D | = m MR ( D , ∆)= k (cid:0) | D |− m (cid:1) (cid:16) I ( D , ∆) (cid:17) = 1 | D | | D |− X m =0 m X k =0 X D ⊆ ( D \{ f } ) | D | = m MR ( D ∪{ f } , ∆)= k k (cid:0) | D |− m (cid:1) − | D | | D |− X m =0 m X k =0 X D ⊆ ( D \{ f } ) | D | = m MR ( D , ∆)= k k (cid:0) | D |− m (cid:1) = 1 | D | | D |− X m =0 m X k =0 k (cid:0) | D |− m (cid:1) |{ D ⊆ D \ { f } | | D | = m, MR ( D ∪ { f } , ∆) = k }|− | D | | D |− X m =0 m X k =0 k (cid:0) | D |− m (cid:1) |{ D ⊆ D \ { f } | | D | = m, MR ( D , ∆) = k }| C.3 The Algorithm

RShapleyF

Next, we give the algorithm

RShapleyF for computing E D ∼ U m ( D \{ f } ) (cid:0) I R ( D ∪ { f } , ∆) (cid:1) . Thealgorithm is depicted in Figure 9. We deﬁne: v. val [ j, t ] def =number of subsets of size j of D [ v ] that, jointly with f, have a cardinality repair of cost t As in the case of

DrasticShapleyF , we start with the execution of

RShapley , which allows us toreuse some of the values computed in this execution. For every vertex, we set v. val [0 ,

0] = 1,as the empty set has a single cardinality repair of cost zero. Then, we consider three types ofvertices. For vertices v that conﬂict with f we have that v. val [ j, t ] = v. val [ j, t −

1] for all1 ≤ j ≤ D [ v ] and 1 ≤ t ≤ j , as every non-empty subset of v conﬂicts with f ; hence, we haveto remove f in a cardinality repair, and the cost of a cardinality repair increases by one. Forvertices v that do not match f , we have that v. val [ j, t ] = v. val [ j, t ], as f is not in conﬂictwith any fact of D [ v ]; hence, it can be added to any cardinality repair without increasing itscost. The same holds for the leaves of T that do not conﬂict with f .For a block vertex v , all the arguments given in Section 6 still apply here, as for a child c of v , a cardinality repair of E ∪ { f } for a subset E of size j of D [ prev ( c )] ∪ D [ c ], is either acardinality repair of ( E ∩ D [ c ]) ∪ { f } (in which case we delete all facts of E ∩ D [ prev ( c )]) ora cardinality repair of ( E ∩ D [ prev ( c )]) ∪ { f } (in which case we delete all facts of E ∩ D [ c ]). X:24 The Shapley Value of Inconsistency Measures for Functional Dependencies

Algorithm 6

RShapleyF ( D, ∆ , m, T, f ) RShapley ( D , ∆ , m , T ) for all vertices v of T in a bottom-up order do UpdateCount ( v, m, f ) return P mk =0 k ( | D |− m ) · r. val [ m, k ] Subroutine 5

UpdateCountF ( v, m, f ) v. val [0 ,

0] = 1 if f conﬂict with v then v. val [ j, t ] = v. val [ j, t −

1] for all 1 ≤ j ≤ | D [ v ] | and 1 ≤ t ≤ j return if f does not match v or v is a leaf then v. val [ j, t ] = v. val [ j, t ] for all 1 ≤ j ≤ m and 0 ≤ t ≤ j return for all children c of v in T do for j ∈ { m, . . . , } do for t ∈ { j, . . . , } do if v is a block vertex then v. val [ j, t ] = P j + j = j ≤ j ≤| D [ c ] | t − j ≤ j ≤ min { t, | D [ prev ( c )] |} P t − j ≤ w ≤ j (cid:0) c. val [ j , t − j ] · v. val [ j , w ] (cid:1) v. val [ j, t ] += P j + j = jt − j ≤ j ≤ min { t, | D [ c ] |} ≤ j ≤| D [ prev ( c )] | P t − j

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I R ( D ∪ { f } , ∆) (cid:1) for ∆ with an lhs chain. Therefore, the only diﬀerence in the computation of v. val compared to the computation of v. val for such vertices is the use of c. val (that takes f into account) rather than c. val .For a subblock vertex v (that does not conﬂict with f , and, hence, matches f ), thecomputation of v. val is again very similar to that of v. val , with the only diﬀerence beingthe use of c. val . Observe that in this case, the children of v correspond to diﬀerent blocks.Each such block that does not match f also does not violate any FD with f ; hence, when weadd f to this block, a cardinality repair of the resulting group of facts does not require theremoval of f . The only child of v where a cardinality repair might require the removal of f is a child that matches f , and, clearly, there is at most one such child. Therefore, we do notcount the fact f twice in the computation of the value v. val . C.4 Proof of Proposition 16

Finally, we prove Proposition 16. For convenience, we give it again here.

Proposition 16.

Let ∆ = { A → B, B → A } be an FD set over ( A, B ) . Counting thesubsets of size m of a given database that have a cardinality repair of cost k is NP-hard. Proof.

The proof is by a simple reduction from the problem of computing the number of . Livshits and B. Kimelfeld XX:25 perfect matchings in a bipartite graph, known to be g = ( A ∪ B, E ) (where | A | = | B | ), we construct a database D over ( A, B ) by adding afact ( a, b ) for every edge ( a, b ) ∈ E . We then deﬁne m = k = | A | . It is rather straightforwardthat the perfect matchings of g correspond exactly to the subsets E of size | A | of D suchthat E itself is a cardinality repair. (cid:74) D Details for Section 7D.1 Proof of Correctness

We now prove the following. (cid:73)

Lemma 19.

For a block vertex v and a child c of v , we have that: E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X j + j = j ≤ j ≤| D [ c ] | ≤ j ≤ D [ prev ( c )] (cid:2) E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) + E D ∼ U j ( D [ prev ( c )] (cid:0) I MC ( D , ∆) (cid:1)(cid:3) Proof.

Since the children of v are subblocks (that jointly violate an FD of ∆), each repair ofa subset E of D [ v ] contains facts from a single child of v , and the number of repairs is thesum of repairs over the children of v . Hence, we have the following (recall that MC ( D, ∆) isthe set of repairs of D w.r.t. ∆): E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X D ⊆ D [ prev ( c )] ∪ D [ c ] | D | = j Pr [ D ] · | MC ( D , ∆) | = X j + j = j X E ⊆ D [ c ] | E | = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]Pr [ E ] (cid:0) | MC ( E , ∆) | + | MC ( E , ∆) | (cid:1) = X j + j = j X E ⊆ D [ c ] | E | = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]Pr [ E ] | MC ( E , ∆) | + X j + j = j X E ⊆ D [ c ] | E | = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]Pr [ E ] | MC ( E , ∆) | = X j + j = j X E ⊆ D [ c ] | E | = j Pr [ E ] | MC ( E , ∆) | X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]+ X j + j = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ] | MC ( E , ∆) | X E ⊆ D [ c ] | E | = j Pr [ E ] = X j + j = j E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) + X j + j = j E D ∼ U j ( D [ prev ( c )]) (cid:0) I MC ( D , ∆) (cid:1) = X j + j = j E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) + E D ∼ U j ( D [ prev ( c )]) (cid:0) I MC ( D , ∆) (cid:1) Recall that in our reduction from the problem of computing the Shapley value to that ofcomputing the expected value of the measure over subsets of a given size of the database,we considered the uniform distribution where Pr [ E ] = ( | D | m ) for a subset E of size m of D .Therefore, we have that P E ⊆ D [ prev ( c )] | E | = j Pr [ E ] = P E ⊆ D [ c ] | E | = j Pr [ E ] = 1. (cid:74) X:26 The Shapley Value of Inconsistency Measures for Functional Dependencies

We can similarly show the following. (cid:73)

Lemma 20.

For a subblock vertex v and a child c of v , we have that: E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X j + j = j ≤ j ≤| D [ c ] | ≤ j ≤ D [ prev ( c )] (cid:2) E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) · E D ∼ U j ( D [ prev ( c )] (cid:0) I MC ( D , ∆) (cid:1)(cid:3) Proof.

Since the children of v are blocks (that do not jointly violate any FD of ∆), eachrepair of a subset E of D [ v ] is a union of the repairs of the children of v , and the number ofrepairs is the product of the number of repairs over the children of v . Hence, we have thefollowing: E D ∼ U j ( D [ prev ( c )] ∪ D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) = X D ⊆ D [ prev ( c )] ∪ D [ c ] | D | = j Pr [ D ] · | MC ( D , ∆) | = X j + j = j X E ⊆ D [ c ] | E | = j X E ⊆ D [ prev ( c )] | E | = j Pr [ E ]Pr [ E ] (cid:0) | MC ( E , ∆) | · | MC ( E , ∆) | (cid:1) = X j + j = j X E ⊆ D [ c ] | E | = j Pr [ E ] | MC ( E , ∆) | X E ⊆ D [ prev ( c )] | E | = j Pr [ E ] | MC ( E , ∆) | = X j + j = j E D ∼ U j ( D [ c ]) (cid:0) I MC ( D , ∆) (cid:1) · E D ∼ U j ( D [ prev ( c )]) (cid:0) I MC ( D , ∆) (cid:1) (cid:74) D.2 The Algorithm

MCShapleyF

Next, we give the algorithm

MCShapleyF that computes E D ∼ U m ( D \{ f } ) (cid:0) I MC ( D ∪ { f } , ∆) (cid:1) .The algorithm is shown in Figure 10. We deﬁne: v. val [ j ] = E [number of repairs of E ∪ { f } for a random subset E of size j of D [ v ]]First, we set v. val [0] = 1 for every vertex v , as when f is added to the empty set weobtain a consistent database that has a single repair—the whole database. Then, we againconsider three possible types of vertices. For vertices v that conﬂict with f we have that v. val [ j ] = v. val [ j ] + 1, as f violates the FDs with every non-empty subset of D [ v ]; hence, foreach such subset, { f } is an additional repair, and the number of repairs increases by onecompared to the number of repairs without f . For a vertex v that does not match f , it holdsthat f does not violate the constraints with any subset of D [ v ]; thus, it does not aﬀect thenumber of repairs and we have that v. val [ j ] = v. val [ j ]. The same holds for the leaves of T that do not conﬂict with f .For the rest of the vertices v , the computation is similar to the one in MCShapley . Inparticular, we go over the children c of v in the for loop of line 8, and compute v. val usingdynamic programming. We observe that if a child c of a block vertex v conﬂicts with f ,then when f is added to a subset E of D [ c ], none of the repairs of E ∪ { f } contains f , but { f } is an additional repair. For such children c , we use the value c. val in the calculation ofline 14, where we ignore f . Hence, if all the children of v conﬂict with f , we compute v. val in the exact same way we compute v. val , while ignoring the fact f . Then, we increase thecomputed value by one in line 18, to reﬂect the additional repair { f } . . Livshits and B. Kimelfeld XX:27 Algorithm 7

MCShapleyF ( D, ∆ , m, T, f ) MCShapley ( D , ∆ , m , T ) for all vertices v of T in a bottom-up order do UpdateExpectedF ( v, m, f ) return r. val [ m ] Subroutine 6

UpdateExpectedF ( v, m, f ) v. val [0] = 1 if f conﬂict with v then v. val [ j ] = v. val [ j ] + 1 for all 1 ≤ j ≤ | D [ v ] | return if f does not match v or v is a leaf then v. val [ j ] = v. val [ j ] for all 0 ≤ j ≤ m return for all children c of v in T do for j ∈ { m, . . . , } do if v is a block vertex then if c does not conﬂict with f then v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] + v. val [ j ] (cid:1) else v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] + v. val [ j ] (cid:1) if all the children of v conﬂict with f then v. val [ j ] = v. val [ j ] + 1 for all 1 ≤ j ≤ m else v. val [ j ] = P j + j = j ≤ j ≤| D [ c ] | ≤ j ≤| D [ prev ( c )] | (cid:0) c. val [ j ] · v. val [ j ] (cid:1) Figure 10

An algorithm for computing E D ∼ U m ( D \{ f } ) (cid:0) I MC ( D ∪ { f } , ∆) (cid:1) for ∆ with an lhschain. If one of the children c of v does not conﬂict with f (note that there is at most one suchchild), then we take f into account in the computation of line 12, where we use the value c. val rather than c. val . In this case, there is no need to increase the computed value by one,as we have already considered the addition of f , and the fact f may appear in some of therepairs of the subset of D [ c ] (or, again, be a repair on its own).For a subblock vertex v , we compute v. val in the same way we compute v. val in MCShapley ,but we use the value c. val in the computation. In this case, each repair of a subset E of D [ c ] ∪ D [ prev ( c )] is a union of a repair of ( E ∩ D [ c ]) ∪{ f } and a repair of ( E ∩ D [ prev ( c )]) ∪{ f } .Note that in this case, for a child c of v that does not match f we have that c. val = c. val ;hence, the fact f is again only taken into account when considering a child c of v that matches f , and there is no risk in counting the repair { f }}