[PDF] Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

Abstract

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than 17/16 for most cases. We second show a (2− 0.5 σ−1 ) -approximation whenever given σ functional dependencies, and a (2− η k + η k k ) -approximation when an η k -portion of tuples have the k -quasi-Tur a ´ n property for some k>1 . We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio 2n+ϵn with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio 2+ϵ . To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the ⊆ -oracle which can answer membership-query, and return p tuples uniformly sampled whenever given a number p . Experiments are conducted on range queries as an implementation of ⊆ -oracle, and results show the efficiency of our FD-inconsistency degree estimator.

Full PDF

CComplexity and Efﬁcient Algorithms for DataInconsistency Evaluating and Repairing ∗ Dongjing Miao

Harbin Institute of TechnologyP.O. Box 32192 Xidazhi StreetHarbin, China [email protected] Zhipeng Cai

Georgia State UniversityP.O. Box 5060Atlanta, GA, USA [email protected] Jianzhong Li

Harbin Institute of TechnologyP.O. Box 32192 Xidazhi StreetHarbin, China [email protected] Gao

Harbin Institute of TechnologyP.O. Box 32192 Xidazhi StreetHarbin, China [email protected] Xianmin Liu

Harbin Institute of TechnologyP.O. Box 32192 Xidazhi StreetHarbin, China [email protected]

ABSTRACT

Data inconsistency evaluating and repairing are major con-cerns in data quality management. As the basic comput-ing task, optimal subset repair is not only applied for costestimation during the progress of database repairing, butalso directly used to derive the evaluation of database in-consistency. Computing an optimal subset repair is to ﬁnda minimum tuple set from an inconsistent database whoseremove results in a consistent subset left. Tight bound onthe complexity and eﬃcient algorithms are still unknown.In this paper, we improve the existing complexity and al-gorithmic results, together with a fast estimation on thesize of optimal subset repair. We ﬁrst strengthen the di-chotomy for optimal subset repair computation problem, weshow that it is not only

APXcomplete , but also

NPhard toapproximate an optimal subset repair with a factor betterthan 17 /

16 for most cases. We second show a (2 − . σ − )-approximation whenever given σ functional dependencies,and a (2 − η k + η k k )-approximation when an η k -portion oftuples have the k -quasi-Tur´an property for some k >

1. Weﬁnally show a sublinear estimator on the size of optimal S -repair for subset queries, it outputs an estimation of a ratio2 n + (cid:15)n with a high probability, thus deriving an estima-tion of FD-inconsistency degree of a ratio 2 + (cid:15) . To sup-port a variety of subset queries for FD-inconsistency eval-uation, we unify them as the ⊆ -oracle which can answermembership-query, and return p tuples uniformly sampledwhenever given a number p . Experiments are conducted on ∗ Supported by NSFC xxxx, NSFC xxxx range queries as an implementation of ⊆ -oracle, and resultsshow the eﬃciency of our FD-inconsistency degree estima-tor.

1. INTRODUCTION

A database instance I is said to be inconsistent if it vi-olates some given integrity constraints, that is, I containsconﬂicts or inconsistencies. Those database inconsistenciescan occur in various scenarios due to many causes. For ex-ample, a typical scenario is information integration, wheredata are integrated from diﬀerent sources, some of them maybe low-quality or imprecise, so that conﬂicts or inconsisten-cies arise.In the principled approach managing inconsistencies [5],the notion of repair was ﬁrst introduced decades ago. Arepair of an inconsistent instance I is a consistent instance I (cid:48) obtained by performing a minimal set of operations on I so as to satisfy all the given integrity constraints. Repairscould be deﬁned under diﬀerent settings of operations and integrity constraints . We follow the setting of [27], wherewe take functional dependencies, also a most typical one, asthe integrity constraints, and deletions as the operations, sothat a repair of I here is a subset of I obtained by mini-mal tuple deletions, and an optimal repair of I is a subsetof it obtained by deleting minimum tuples. Computing anoptimal subset repair with respect to functional dependen-cies is the major concern in this paper. It is a fundamentalproblem of data inconsistency management and the motiva-tion has been partially discussed in [27]. The signiﬁcance ofstudy on computing optimal repair is twofold.Computing optimal repairs would be the basic task indata cleaning and repairing. For data repairing, existingmethods could be roughly categorized into two classes, fullyautomatic and semi automatic ways [14]. In fully automaticrepairing methods, optimal subset repairs are always consid-ered as optimization objectives [15, 32, 31]. Given an incon-sistent database, one needs automated methods to make itconsistent, i.e. , ﬁnd a repair that satisﬁes the constraintsand minimally diﬀers from the originated input, optimalsubset repair is right one of the choices [27]. On the other1 a r X i v : . [ c s . D B ] J a n ide, optimal subset repairs are also preferred candidatespicked by automatic data cleaning or repairing system whendealing with inconsistency errors. Instead of the fully au-tomatic way, the human-in-loop semi-automatic repairing isanother prevailing way [6, 7, 18, 22], and the complement ofan optimal subset repair is an ideal lower bound of repairingcost which could be used as to estimate the amount of nec-essary eﬀort to eliminate all the inconsistency, sometimeseven enlighten them how to choose speciﬁc operations.Besides optimal repairs, measuring inconsistency moti-vates the computation on the size of optimal repairs. In-tuitively, for the same schema and with the same integrityconstraints, given two databases instances, it is natural toknow which one is more inconsistent than the other. Thiscomparison can be accomplished by assigning a measure ofinconsistency to a database. Hence, measuring database in-consistency has been revisited and generalized recently bydata management community. [9] argued that both the ad-missible repair actions and how close we want stay to theinstance at hand should be taken into account when deﬁningsuch measure. To achieve this, database repairs [8] could beapplied to deﬁne degrees of inconsistency. Among a seriesof numerical measurements proposed in [9], subset repairbased inconsistency degree inc - deg S is the most typical one.According to [9], subset repair based inconsistency degree isdeﬁned as the ratio of minimum number of tuple deleted inorder to remove all inconsistencies, i.e. , the size of the com-plement of an optimal subset repair. Therefore, computingoptimal subset repair is right the fundamental of inconsis-tency degree computation. Previous studies does not pro-vide ﬁne-grained complexity on this problem and eﬃcientalgorithm for large databases. Thus, we in this paper givea careful analysis on the computational complexity and fastcomputation on the size of an optimal subset repair. Con-tributions of this paper are detailed as follows.We ﬁrst study the data complexity of optimal subset re-pair problem including the lower and upper bounds in orderto understand how hard the problem is and how good wecould achieve. The most recent work [27] develops a simpli-ﬁcation algorithm OSRSucceeds and establishes a dichotomybased on it to ﬁgure the complexity of this problem. Simplyspeaking, they show that, for the space of combinations ofdatabase schemas and functional dependency sets, (i) it ispolynomial to compute an optimal subset repair, if the givenFD set can be simpliﬁed into an empty set; (ii) the problemis APX-complete, otherwise.As the computation accuracy of the size of an optimalsubset is very crucial to our motivation, we strengthen thedichotomy in this paper by improving the lower bound intoconcrete constants. Speciﬁcally, we show that it is

NPhard to obtain a (17 / − (cid:15) )-optimal subset repair for most inputcases, and (69246103 / − (cid:15) )–optimal subset repairfor all the others. We show that a simple reduction couldunify most cases and improve the low bound We then con-sider approximate repairing. For this long standing problem,it is always treated as a vertex cover problem equivalently,and admits the upper bound of ratio 2. However, we takea step further, show that (i) an (2 − . σ − )-approximationof an optimal subset repair could be obtained for given σ functional dependencies, more than that, (ii) it is also poly-nomial to ﬁnd an (2 − η k + η k k )-approximation, which ismuch better if an η k -portion of tuples have the k -quasi-Tur´an property for some k >

2. Then, we turn to the most related problem, to estimatethe subset repair based FD-inconsistency degree eﬃciently.For an integrated database instance, it is helpful to measurethe inconsistency degree of any part of it locally, in orderto let users know and understand well the quality of theirdata. Consider an inconsistent database I integrated bydata from two organizations A and B , we need to know themain cause of the conﬂicts. If we know the inconsistencydegree of some part A is very low but that of B is as highas the inconsistency degree of I , then we could concludethat the cause of inconsistencies is mainly on the conﬂicts inbetween, but not in any single source. That is when we ﬁndthe inconsistency degree of some local part is approximatelyequal to that of the global one, then it is reasonable to takethis part as a primary cause of inconsistency, so that we mayfocus on investigating what happens in B .To this motivation, in this paper, we focus on fast es-timating the subset-repair based FD-inconsistency degree.We here follow the deﬁnition of subset repair based mea-surement recently proposed by Bertossi [9], and develop aneﬃcient method estimating FD-inconsistency degree of anypart of the input database. Concretely, it seems a sameproblem as computing an optimal repair itself, so that thecomplexity result of optimal subset repair computing indi-cates it is hard to be approximated within a better ratiopolynomially, not to mention linear or even a sublinear run-ning time. However, we observe that, the value of incon-sistency degree is a ratio to the size of input data, say n ,hence, an n -fold accuracy loss of optimal subset repair sizeestimating is acceptable. Therefore, we develop a sample-based method to estimate the size of an optimal subset re-pair with a error of (2 ± (cid:15) ) n so as to break through thelimitation of linear time complexity while achieving an ap-proximation with an additive error (cid:15) . To support a varietyof subset queries, especially for whose result is very large,we model those queries as the ⊆ -oracle which can answermembership-query, and return k tuples uniformly sampledwhenever given a number k .The following parts of this paper is organized as follows.Necessary notations and deﬁnitions are formally stated inSection 2. Complexity results and LP-based approxima-tions are shown in Section 3. Sampling-based fast FD-inconsistency degree estimation is given in Section 4. Exper-iment results are discussed in Section 5. At last, we concludeour study in Section 6.

2. PROBLEM STATEMENT

The necessary deﬁnitions, notations and problem deﬁni-tion are formally given in this section.

Schemas and Tables . A k -ary relation schema is repre-sented by R ( A , ..., A k ), where R is the relation name and A , . . . , A k are distinct attributes of R . In the followingpart of this paper, we refer R ( A , ..., A k ) to R for simplic-ity. We customarily use capital letters from the beginningof the English alphabet to denote individual attribute, suchas “ A, B, C ”, and use capital letters from the end of theEnglish alphabet individual attribute to denote a set of at-tributes, such as “ X , Y , Z ”, sometimes with subscripts. A setof attributes are conventionally written without curly bracesand commas, such as X can be written as AB if X = { A, B } .We assume the domain of each attribute, dom ( A i ), is count-ably inﬁnite, then, any instance I over relation R is a collec-tion of k -ary tuples { a , a , . . . , a k } , where each value a i are2a) Example instance order (b) All conﬂicts Figure 1: dist with diﬀerent ρ , n and σ taken from the set dom ( A i ). Let t .A i refer to the value a i on attribute A i , and t . X refer to the sequence of attributevalues a , a , . . . , a i when X = A A . . . A i . We use [ t . X ] todenote the set of all tuples from I sharing the same valueof X . The size of an instance is the number of tuples in it,denoted as | I | . In this paper, any instance I of a relationschema R is a single table corresponding to R . Functional Dependencies . Let X and Y be two arbi-trary sets of attributes in a relation schema R , then Y is saidto be functionally determined by X , written as X → Y , if andonly if each X -value in R is associated with precisely one Y -value in R . Usually X is called the determinant set and Y the dependent set, but in this paper, for the sake of simple,we just call them determinant and dependent respectively.A functional dependency X → Y is called trivial if Y is asubset of X .Given a functional dependency ϕ : X → Y over R , anyinstance I corresponding to R is said to satisfy ϕ , denotedas I | = ϕ , such that for any two tuples s , t in I , s . Y = t . Y if s . X = t . X . That is, two tuples sharing the same valuesof X will necessarily have the same values of Y . Otherwise, I does not satisfy ϕ , denoted as I (cid:50) ϕ . As a special case,any two-tuple subset J of I is called a ϕ - conﬂict in I withrespect to ϕ if J (cid:50) ϕ .Let Σ be a set of functional dependencies, we usually use σ to refer to the number of functional dependencies in Σ , i.e. σ = | Σ | . Given a set of functional dependency Σ , aninstance I is said to be consistent with respect to Σ if I satisﬁes every functional dependencies in Σ . Otherwise, I is inconsistent, denoted as I (cid:50) Σ . As a special case, anytwo-tuple subset J of I is called a conﬂict in I if there issome ϕ such that J is a ϕ - conﬂict . That is, I contains oneor more conﬂicts if I is inconsistent. Example 1.

Our running example is around the schema order(id, name, AC, PR, PN, STR, CTY, CT, ST, zip) . Eachtuple contains information about an item sold (a unique item id , name and price PR ), and the phone number (area code AC , phone number PN ) and the address of the customer whopurchased the item (street STR , country

CTY , city CT , state ST ). An instance I of the schema order is shown in ﬁgure1(a). Some functional dependencies on the order databaseinclude: fd : [ AC , PN ] → [ STR , CT , ST ] fd : [ zip ] → [ CT , ST ] fd : [ id ] → [ name , PR ] fd : [ CT , STR ] → [ zip ]The database of ﬁgure 1(a) is inconsistent since there are 13 fd -conﬂicts in total as listed in ﬁgure 1(b). The meaning of the number assigned to each conﬂict will be clariﬁed later. Equivalence Class . Given an FD ϕ : X → Y , an in-stance I can be partitioned horizontally into several deter-minant equivalence classes according to the X -values, thatis, tuples in each determinant equivalence class share thesame value of X . Moreover, any determinant equivalenceclass can be further partitioned into several determinant-dependent equivalence classes according to the Y -values, de-noted as [ xy ], that is, tuples in each determinant equivalenceclass share the same value of XY . It is obviously that, foran FD ϕ : X → Y and any instance I , two tuples s and t not in any ϕ -conﬂict in I must be in diﬀerent determinantequivalence classes [ s . X ] and [ t . X ] respectively. Example 2.

With respect to fd : [ zip ] → [ CT , ST ], instance I can be partitioned into one determinant equivalence class[ ] = { t , t , t , t , t , t } and 4 determinant-dependentequivalence classes { t , t } , { t , t } , { t } and { t } . Repair . Let I be an instance over a relation schema R , asubset of I is an instance J obtained from I by eliminatingsome tuples. If J is a subset of I , then the distance from J to I , denoted dist sub ( J, I ), is the number of tuples missingfrom I , and it is for sure that J ⊆ I , thus, dist sub ( J, I ) = | I \ J | = | I | − | J | Let I be an instance over schema R , and let Σ be a set ofFDs. A consistent subset of I with respect to Σ is a subset J of I such that J | = Σ . A subset repair ( s-repair , for short) isa consistent subset that is not strictly contained in any otherconsistent subset. An optimal subset repair of I is a consis-tent subset J of I such that dist sub ( J, I ) is minimum amongall consistent subsets of I . Note that, each optimal subsetrepair is a repair, but not necessarily vice versa. Clearly,any consistent subset can be polynomially transformed intoa subset repair, with no increase of distance. Unless explic-itly stated otherwise, in this paper, we do not distinguishbetween a subset repair and a consistent subset. Example 3.

Both S = { t } and S = { t , t } are s-repairs of I . It is easy to verify that S is an optimal s-repairsuch that dist sub ( S , I ) = 4.Now, we formally deﬁne the ﬁrst problem studies in thispaper as follows, Definition 1 (OSR Computing).

Input an instance I over a relation schema R , a functional dependency set Σ ,OSR computing problem is to compute an optimal s-repair J of I with respect to Σ . nconsistency Measurement . Computing an optimals-repair helps estimating database FD-inconsistency degree.As in literature [9], given a functional dependency set Σ , oneof subset repair based measurements on the FD-inconsistencydegree of input database I is deﬁned as following, incDeg ( I, Σ ) = min J ⊆ I,J | = Σ (cid:26) dist sub ( J, I ) | I | (cid:27) = dist sub ( J opt , I ) | I | Moreover, this measurement could be also applied for anypart H of the input database I in order to evaluate its cor-responding FD-inconsistency degree as following, incDeg ( I, H, Σ ) = min J ⊆ H,J | = Σ (cid:26) dist sub ( J, H ) | H | (cid:27) = incDeg ( H, Σ )The local degree does not depends on the whole of the inputdata, thus leads to the right equation. Our FD-inconsistencydegree of any part is deﬁned locally, hence, we use notation incDeg ( H, Σ ) instead of incDeg ( I, H, Σ ) by omitting theﬁrst parameter. Then, we here formally deﬁne the secondproblem studied in this paper as follows, Definition 2 (FD-inconsistency Evaluation).

Inputa relation schema R , an FD set Σ , an instance I over R and a subset query Q on I , FD-inconsistency evaluation isto compute incDeg ( Q ( I ) , Σ ) of the query result Q ( I ) withrespect to Σ .Example 4. As mentioned in Example 3, S = { t , t } isan optimal s-repair of I , then incDeg ( I, Σ ) = dist sub ( S ,I ) | I | = . Given a range query Q = [15 ,

45] on attribute PR in order ,the result set Q ( I ) = { t , t , t , t } . S ia also an optimals-repair of Q ( I ), then incDeg ( Q ( I ) , Σ ) = dist sub ( S ,I ) | Q ( I ) | = . Approximation . We follow the convention of approx-imation deﬁnition, to deﬁne the approximation of optimalrepairs explicitly. For a constant c ≥

1, a c -optimal s-repair is an s-repair J of I such that dist sub ( J, I ) ≤ c · dist sub ( J (cid:48) , I )for all s-repairs J (cid:48) of I . In particular, an optimal s-repair isthe same as a 1-optimal s-repair .According to the deﬁnition of subset repair based FD-inconsistency degree, for an arbitrary 0 ≤ (cid:15) ≤ c ≥

1, ˜ incDeg ( I, Σ ) is a ( c, (cid:15) )-approximation of incDeg ( I, Σ ) such that incDeg ( I, Σ ) ≤ ˜ incDeg ( I, Σ ) ≤ c · incDeg ( I, Σ ) + (cid:15) Complexity . The conventional measure of data complex-ity are adopted to perform the computational complexityanalysis of optimal subset repair computing problem in thispaper. That is, the relation schema R ( A , . . . , A k ) and thefunctional dependency set Σ are ﬁxed in advance, and theinstance data I over R is the only input. Therefore, an poly-nomial running time may have an exponential dependencyon k and | Σ | . In such context of data complexity, each dis-tinct setting of R ( A , . . . , A k ) and Σ indicates a distinctproblem of ﬁnding an optimal repair, so that diﬀerent set-ting may indicate diﬀerent complexities. Recall that, in themeasurement of combined complexity, the relation schemaand the functional dependency set are considered as inputs,hence, the hardness of OSR computing problem equals to that of vertex cover problem. However, this is not the caseunder data complexity.After showing the hardness, we still adopt data complex-ity to be the measurement on running times and approxi-mation ratios, however, the diﬀerence is that we ﬁx only thesize of the functional dependency set Σ , but not itself andthe schema. Note that, this is reasonable in practical, theinput functional dependencies may vary with time, but thenumber of given functional dependencies are always muchsmaller than the size of input data I , so that we could con-sider it to be bounded within some constant.

3. COMPUTING AN OPTIMAL S-REPAIR

In this section, we show the improved lower bound andupper bound of OSR.

Livshits et al. gave a procedure

OSRSucceed(Σ) [27] tosimplify a given functional dependency set Σ . Any func-tional dependency set can either be simpliﬁed polynomiallyinto a set containing only trivial functional dependencies,or not. The procedure OSRSucceed(Σ) returns true forthe former case, otherwise false . OSR is polynomiallytractable for functional dependency sets that can be sim-pliﬁed into trivial ones. For all the other functional depen-dency sets, OSR computing problem is hard as in not only

NPhard but also

APXcomplete .Speciﬁcally, any functional dependency set that cannotbe simpliﬁed further can be classiﬁed into one of ﬁve certainclasses of functional dependency sets. And OSR is shown in

APXcomplete for any such functional dependency set by fact - wise reductions from one of the following four ﬁxed schemas. Σ A → B → C = { A → B, B → C } Σ A → B ← C = { A → B, C → B } Σ AB → C → B = { AB → C, C → B } Σ AB ↔ AC ↔ BC = { AB → C, AC → B, BC → A } By showing the inapproximability of such four schemas,the following dichotomy follows immediately.

Theorem 1 (Dichotomy for OSR computing [27]).

Let Σ be a set of FDs, then • An optimal subset repair can be computed polynomi-ally, if

OSRSucceed(Σ) returns true ; • Computing an optimal subset repair is

APXcomplete ,if

OSRSucceed(Σ) returns false . In this paper, we give a more careful analysis to show aconcrete constant for each of the four schemas, thus strength-ening this dichotomy.

Lemma For FD sets Σ A → B → C , Σ A → B ← C , and Σ AB → C → B ,there is no polynomial-time ( − (cid:15) ) -approximation algorithmfor computing an optimal subset repair for any (cid:15) > , unless NP=P . Proof.

We here show three similar gap-preserved reduc-tions ( i.e. , ≺ G ) from MAX NM-E3SAT to show the lower bound.Note that, (i) any variable x do not occur more than oncein any clause, and (ii) each clause is monotone as either( x + y + z ) or (¯ x +¯ y +¯ z ). Each of the following reductions gen-erates a corresponding table instance I with schema (cid:104) A, B, C (cid:105) .4 AX NM-E3SAT ≺ G Σ A → B → C . For each clause c i , build threetuples. If c i contains a positive literal of variable x j , thenbuild ( i, j, j ). If c i contains a negative literal of variable x j , then build ( i, j, ¯ j ). Intuitively, A → B guarantees thatexactly one of the three tuples survives once the correspond-ing clause is satisﬁed, and B → C will ensure the consistentassignment of each variable. MAX NM-E3SAT ≺ G Σ A → B ← C . By simply exchange the col-umn B and C, we get the second reduction. Concretely,for each clause c i , build three tuples. If c i contains a posi-tive literal of variable x j , then build ( i, j, j ). If c i contains anegative literal of variable x j , then build ( i, ¯ j, j ). Intuitively, A → B guarantees that exactly one of the three tuples sur-vives once the corresponding clause is satisﬁed, and C → B will ensure the consistent assignment of each variable. MAX NM-E3SAT ≺ G Σ AB → C → B . By slightly modify the way oftuple generation, we get the third reduction.Concretely, (i) for each variable x i , build two x -tuples( x i , , x i ) and ( x i , , x i ), (ii) for each clause c i , build three c -tuples, if c i contains a positive literal of variable x j , thenbuild ( c i , , x j ),if c i contains a negative literal of variable x j ,then build ( c i , , x j ).Intuitively, there are 2 n + 3 m tuples created, AB → C guarantees that exactly one of the three tuples survives oncethe corresponding clause is satisﬁed, and C → B will ensurethe consistent assignment of each variable. Lower bound . We here show the proof for Σ AB → C → B , andthe other two are similar. For any instance φ of MAX NM-E3SAT problem, we denote the corresponding table built by reduc-tion as I φ . Let τ ( φ ) be the number of clauses satisﬁed byan assignment τ ( φ ) on φ , and τ max ( φ ) be the number ofclauses satisﬁed by an optimal assignment τ max ( φ ) on φ . Claim Any tuple deletion ∆ should contain at leasttwo of the three tuples having the same value i on the at-tribute A for any ≤ i ≤ m . Claim any tuple deletion ∆ should contain either theset of tuples ( c i , , x j ) or the set of tuples ( c i , , x j ) for any ≤ i ≤ m, ≤ j ≤ n . FD AB → C guarantees the ﬁrst claim, and FD C → B ensures that there is always an assignment τ can be derivedfrom I \ ∆, s.t. τ ( x i ) = (cid:26) , if ( x i , , x i ) ∈ I \ ∆ , , otherwise . Claim Let ∆ min be an minimum tuple deletion, thenany optimal solution ∆ min does not contain ( x i , , x i ) and ( x i , , x i ) simultaneously for each variable x i . Proof by contradiction. Suppose if not, there must existanother solution ∆ (cid:48) obtained by returning tuple ( x i , , x i )or ( x i , , x i ) from ∆ min into I \ ∆ min without producing anyinconsistency, thus resulting in a solution ∆ (cid:48) smaller thanthe optimal one. Based on the three claims, we have τ max ( φ ) = | I φ | − | ∆ min | − n and for any solution ∆ of I φ , τ ( φ ) ≥ | I φ | − | ∆ | − n additionally, we have the fact that | I φ | = 2 n + 3 m Now, suppose ∆ is an r -approximation ( r >

1) of ∆ min suchthat | ∆ | ≤ r · | ∆ min | , then τ ( φ ) τ max ( φ ) ≥ | I φ | − | ∆ | − n | I φ | − | ∆ min | − n ≥ | I φ | − r · | ∆ min | − n | I φ | − | ∆ min | − n = 1 + (1 − r ) · | ∆ min || I φ | − | ∆ min | − n (1)since each clause has exactly 3 literals, we have | ∆ min | ≥ n + 2 · | I φ | − n | ∆ min || I φ | − | ∆ min | − n ≥ | I φ | − n | I φ | − n = 2 + 3 | I φ | n − | I φ | = 2 n + 3 m > n , therefore we get | ∆ min || I φ | − | ∆ min | − n > τ ( φ ) τ max ( φ ) > − r That is, if there is an r -approximation of OSR, then MAXNM-E3SAT can be approximated within 3 − r . However, if OSR can be polynomially approximated within , then thereexists a polynomial approximation better than for MAXNM-E3SAT problem, but it contraries to the hardness resultshown in [23].One can verify the lower bound of Σ A → B → C and Σ A → B ← C in the same way, then the lemma follows immediately. Onecan refer to appendix for more detail.To deal with the last case, by carefully merging the four L α,β -reductions { MAX B29-3SAT ≺ L , [24] ≺ L , MAX 3SC [24]

MAX 3SC ≺ L , Triangle [4]

Triangle ≺ L , Σ AB ↔ AC ↔ BC [27] } , we have the following lemma. Lemma For FD set Σ AB ↔ AC ↔ BC , there is no polyno-mial time ( − (cid:15) ) -approximation algorithm for com-puting an optimal subset repair for any (cid:15) > , unless NP=P . Proof.

By merging the L α,β -reduction mentioned above,if computing an OSR for FD set Σ AB ↔ AC ↔ BC can be approx-imated within , then there exists a polynomial ap-proximation better than for MAX B29-3SAT problem whichis contrary to the hardness result shown in [17].Based on Lemma 1,2 and Theorem 1, a strengthened di-chotomy for OSR computing can be stated as follows.

Theorem 2 (A strengthened dichotomy for OSR).

Let Σ be a set of FDs, then • An optimal subset repair can be computed polynomi-ally, if

OSRSucceed(Σ) returns true ; There is no poly-time ( − (cid:15) )-approximation tocompute an optimal subset repair, if OSRSucceed(Σ) returns false and Σ can be classiﬁed into the classhaving a fact-wise reduction from Σ AB ↔ AC ↔ BC to itself; • There is no poly-time ( − (cid:15) )-approximation to com-pute an optimal subset repair, otherwise. For the polynomial-intractable side, one can simply verifythat if the size of FD set is unbounded, then the OSR com-puting is as hard as classical vertex cover problem on generalinputs which is

NPhard to be approximate within 2 − (cid:15) forany (cid:15) >

0. A simple approximation algorithm can provide aratio of 2 when the input FD set is unbounded.However, in practical, the size of FD set is usually muchsmaller than the size of data, so that it can be treated asﬁxed, especially in the context of big data. Unfortunately,it is still unclear how good we could arrive when the size ofFD set is bounded. Therefore, to study the upper boundof its data complexity, we next give a carefully designedapproximation to archive a ratio of 2 − . σ − when thenumber of given FDs is σ , or even better sometimes. To investigate the upper bound of optimal s-repair com-puting problem, we start from a basic linear programmingto provide a ratio of 2 − . X ( n ) , for an input instance I overa given relation schema R and an input FD set Σ , where X ( I ) is the number of all possible determinant-dependentequivalence classes of an input instance I with respect tothe input FD set Σ . Then, an improved the approximationratio 2 − . σ − could be derived by means of triad elimina-tion. Finally, we ﬁnd another (2 − η k + η k k )-approximationwhich is sometimes, but not always, better than 2 − . σ − ,based on a k -quasi-Tur´an characterization of the input in-consistent instance with respect to the input Σ . We start from the basic linear programming model whichis equivalent to the classical one solving minimum vertexcover problem.Let x i be a 0-1 variable indicating the elimination of tuple t i such that, x i = 1 if eliminate t i ; x i = 0 otherwise. Thenwe formulate the OSR computing problem as followings, minimize (cid:80) t i ∈ I x i (2) s. t. x i + x j ≥ , ∀ { t i , t j } (cid:50) Σ , (3) x i ≥ , ∀ t i ∈ I (4)It is well-known that every extreme point of this model takesvalue of 0 or 0.5 or 1, hence, we can relax it with condition: x i ∈ { , . , } thus getting OP T relax ≤ OP T

A trivial rounding derives a ratio of 2 immediately. However,based on a partition of instance I with respect to FD set Σ ,a better ratio depending on the size of partition could beobtained.Obviously, for any FD ϕ i : X i → Y i of Σ with a sizeof σ , each tuple t belongs to one and only one distinct determinant-dependent equivalence class with respect to ϕ i ,say [ t . X i Y i ], then we have t ∈ [ t . X Y ] ∩ · · · ∩ [ t . X σ Y σ ] = [ t . Z ] , where Z = X ∪ · · · ∪ X σ ∪ Y ∪ · · · ∪ Y σ . Hence, we observethat if any two tuples s and t are in some conﬂict, then theremust be s / ∈ [ t . Z ] , t / ∈ [ s . Z ] , and vice versa, since they disagree on at least one attributein some Y i but agree on all the attributes in X .Further more, another observation is that all the tuples inconﬂict with t are included in the determinant equivalenceclasses [ t . X ] = [ t . X ] ∪ · · · ∪ [ t . X σ ]Because all tuples in each [ t . X i ] may be inconsistent witheach other at worst, hence, every tuple in [ t . X ] may be in-consistent with at most | [ t . X ] | × · · · × | [ t . X σ ] | − X ( t ) be the numbers of tuples who are in conﬂict with t , and X ( I ) be the numbers of consistent classes that I couldbe partitioned into, such that each class is consistent. Thisobservation implies the following claims immediately, Claim X ( I ) ≤ max t ∈ I {X ( t ) } ≤ max t ∈ I {| [ t . X ] | × · · · × | [ t . X σ ] |} This claim implies that all the tuples in I could be parti-tioned into at most X ( I ) classes such that tuples in eachclass are consistent with each other.Then, we improve the ratio by using the X ( I ) partitions ofthe input instance. Based on the rounding technique similarwith [29], an improved approximated algorithm could bestated as follows. Algorithm 1

Baseline LP-OSR

Input: n -tuple instance I over schema R , FD set Σ Output: optimal subset repair J of I with respect to Σ

1: Solve the linear programming (2)-(4) to obtain a solution x . . . x n such that x i ∈ { , . , } for all 1 ≤ i ≤ n .2: Let P j is the set of tuples of some consistent partitionof I with respect to Σ j ← arg max j |{ x i | t i ∈ P j ∧ x i = 0 . }| for each t i ∈ I do if x i = 1 or ( x i = 0 . t i / ∈ P j ) then

6: add t i into ˆ∆7: end if end for

9: ˆ J → I \ ˆ∆10: return ˆ J Obviously,

OP T relax can be returned in polynomial timeas shown in [33], and it is easy to see that ˆ J is a s-repair.In fact, if an x i = 0 . t i must be added intoˆ∆ because they are not in partition P j , and the sum of twovariables of tuples in any conﬂict should be no less than 1.Therefore, we claim that the approximation ratio is 2 − X ( I ) . Lemma Algorithm 1 returns a (cid:16) − X ( I ) (cid:17) -optimal sub-set repair. roof. Deﬁne notations S , S . and S P j as follows, S = { x i | t i ∈ I, x i = 1 } S . = { x i | t i ∈ I, x i = 0 . } S P j = { x i | t i ∈ P j , x i = 0 . } Then, the following holds obviously,

OP T relax := (cid:88) t i ∈ S ∪ S . x i ≤ OP T ≤ | ∆ min | = dist sub ( J opt , I )Second, We have that dist sub ( ˆ J, I ) = | ˆ∆ |≤ | S | + | S . | − | S P j | = (cid:88) i ∈ S x i + 2 (cid:88) i ∈ S . x i − (cid:88) i ∈ S Pj x i ≤ (cid:88) i ∈ S x i + 2 (cid:88) i ∈ S . x i − · X ( I ) (cid:88) i ∈ S . x i ≤ (cid:88) i ∈ S x i + (cid:18) − X ( I ) (cid:19) (cid:88) i ∈ S . x i ≤ (cid:18) − X ( I ) (cid:19) (cid:88) i ∈ S ∪ S . x i ≤ (cid:18) − X ( I ) (cid:19) · OP T ≤ (cid:18) − X ( I ) (cid:19) · dist sub ( J opt , I )Therefore, ˆ J is a (cid:16) − X ( I ) (cid:17) -optimal subset repair of I .The number X ( I ) is unbounded, in the worst case, could beas large as | I | so that it is a factor depending on the size ofinput. Reducing the number of consistent partitions will improvethe approximation. We introduce triad elimination in thissection to decrease the partition number into a factor whichis independent with the size of input but only depending onthe number σ = | Σ | of input functional dependencies. Data reduction . Let r , s , t be three tuples in I , then theyare called a triad if any two of them are in a conﬂict withrespect to Σ . An important observation is that any s-repaircontains at most one tuple of a triad in I , especially in anoptimal s-repair, hence, any triad elimination yields a 1 . . Algorithm 2

TE LP-OSR

Input: n -tuple instance I over schema R , FD set Σ Output: optimal subset repair J of I with respect to Σ

1: Find a maximal tuple set of disjoint triads ∆ from I I (cid:48) ← I \ ∆3: ˆ J ← Baseline LP-OSR ( I (cid:48) )4: return ˆ J Let σ be the number of functional dependencies in Σ , thenwe claim that TE LP-OSR will return a better approximationas the following theorem.

Theorem Algorithm

TE LP-OSR returns a (cid:0) − . σ − (cid:1) -optimal subset repair. Proof.

Algorithm 2 does ﬁnd an s-repair, because allthe triads are eliminated from I , hence conﬂicts involvingany tuples in ∆ are removed from I , and all conﬂicts in thereduced data I \ ∆ are removed by Baseline-LP-OSR .Moreover, any optimal s-repair of I contains at most onetuples of a triad which yields a 1 . I , formally, we have, dist sub ( ∅ , ∆) ≤ . · dist sub (∆ opt , ∆)Additionally, consider each determinant equivalence class with respect to any single FD, no triad in it, that is, I could be partitioned into 2 consistent classes with respectto this FD. It implies I could be partitioned into 2 σ con-sistent classes with respect to Σ . Due to lemma 3, J (cid:48) is a (cid:0) − . σ − (cid:1) -optimal s-repair of I \ ∆.Let J (cid:48) opt be the optimal s-repair of I \ ∆, then, dist sub ( ˆ J, I \ ∆) ≤ (cid:0) − . σ − (cid:1) · dist sub ( J (cid:48) opt , I \ ∆)And ˆ J ∪ ∅ is an s-repair of ( I \ ∆) ∪ ∆ = I , hence, dist sub ( ˆ J ∪ ∅ , I ) ≤ max { . , − . σ − } · dist sub ( J opt , I )Without loss of generality, we have σ ≥

2, then Algorithm 2returns a (cid:0) − . σ − (cid:1) -approximation. Remarks . Note that this ratio depends on only the sizeof functional dependency set other than the scale of inputdata. Therefore, a simple corollary implies a ratio of 1.5 for Σ A → B → C , Σ A → B ← C , and Σ AB → C → B , and 1.75 for Σ AB ↔ AC ↔ BC ,no matter how large of the input data.A naive enumeration of triad is time wasting. In our al-gorithm, as in the proof of theorem 3, it is not necessary toeliminate all disjoint triads as possible. Instead, to obtaina good ratio, it needs only eliminate all disjoint triads withrespect to each single functional dependency. Then, for eachsingle functional dependency, sorting or hashing techniquescould be utilized to speed up the triad eliminating, and skipthe ﬁnding of triads across diﬀerent functional dependencies. ´ a n property Triad elimination based

TE LP-OSR does not capture thecharacteristic of input data instance. We next give anotherapproximation algorithm

QT LP-OSR . In fact, we found thatconstraints could be derived to strengthen LP formula. In-tuitively, for each functional dependency, each determinantequivalence class contains several determinant-dependent equiv-alence classes, say k , hence, tuples in at least k − I to obtain an s-repair. There-fore, constraints could be invented to limit the lower boundof variables taking value 1 according to the k − p ] con-taining m determinant-dependent equivalence classes [ pq ],..., [ pq m ], hence, | [ p ] | = | [ pq ] | + · · · + | [ pq m ] | k -quasi-Tur ´ an . Given k >

1, a tuple t ∈ I is of k -quasi-Tur ´ an property if and only if there is some functional depen-dency ϕ and a determinant equivalence class [ p ] with respectto ϕ such that t ∈ [ p ] , m ≥ , ∀ i, ≤ i ≤ m, | [ p ] | − | [ pq i ] | ≥ k | [ pq i ] | Example 5.

As mentioned in Example 2, given the func-tional dependency fd : [ zip ] → [ CT , ST ], the determinant equiv-alence class [ ] is partitioned into m = 4 determinant-dependent equivalence classes. It is easy to verify that [ ]is a 2 -quasi-Tur ´ an Then we characterize the data with parameter η k which isthe portion of k -quasi-Tur´an tuples in I . A strengthened LPcould be formulated as follows, minimize (cid:80) t i ∈ I x i s. t. x i ≥ , ∀ t i ∈ Ix i + x j ≥ , ∀ { t i , t j } (cid:50) Σ , (cid:80) t i ∈ [ p ] x i > | [ p ] | − max j | [ pq j ] | − (cid:15), for every [ p ] . In this model, pick a small enough (cid:15) >

0, the inequal-ity guarantees that in any integral solution, at least | [ p ] | − max j | [ pq j ] | variables taking value of 1. However, in the frac-tional solution, we could not limit the number of 1-variables,for example, a slop line cannot distinguish points (0 . , . ,

1) and (1 , Claim Every extreme point of any solution to the lin-ear programming is in { , . , } . One can simply verify the correctness and prove it by con-tradiction, we omit the proof here.Every solution of this strengthened linear programmingstill admits the half-integral property, hence, we take thebasic rounding strategy such that x i =  , if x i = 0 , , if x i = 0 . , , if x i = 1 . then, for tuples in each determinant equivalence class [ p ],at most max j | [ pq j ] | variables will be rounded as 1 wrongly.Formally, for each determinant equivalence class [ p ], deﬁne S [ p ]1 and S [ p ]0 . as follows, S [ p ]1 := { x i | t i ∈ [ p ] , x i = 1 } , S [ p ]0 . := { x i | t i ∈ [ p ] , x i = 0 . } then we have the following lemma, Lemma (cid:12)(cid:12)(cid:12) S [ p ]1 (cid:12)(cid:12)(cid:12) ≥ | [ p ] |− j | [ pq j ] | , (cid:12)(cid:12)(cid:12) S [ p ]0 . (cid:12)(cid:12)(cid:12) ≤ j | [ pq j ] | Proof.

Due to the constraint (cid:88) t i ∈ [ p ] x i > | [ p ] | − max j | [ pq j ] | − (cid:15) hence, (cid:12)(cid:12)(cid:12) S [ p ]1 (cid:12)(cid:12)(cid:12) + 0 . (cid:12)(cid:12)(cid:12) S [ p ]0 . (cid:12)(cid:12)(cid:12) > | [ p ] | − max j | [ pq j ] | − (cid:15) The worst case is that | S [ p ]0 . | ≤ | [ p ] | − | S [ p ]1 | , then we have, (cid:12)(cid:12)(cid:12) S [ p ]1 (cid:12)(cid:12)(cid:12) + 0 . | [ p ] | − (cid:12)(cid:12)(cid:12) S [ p ]1 (cid:12)(cid:12)(cid:12) ) > | [ p ] | − max j | [ pq j ] | − (cid:15), that is (cid:12)(cid:12)(cid:12) S [ p ]1 (cid:12)(cid:12)(cid:12) > | [ p ] | − j | [ pq j ] | − (cid:15) Pick small (cid:15) such that (cid:15) < .

5, then this lemma follows.This lemma derives the a ratio depending on the portionof k -quasi-Tur´an tuples η k where 0 < η k ≤ k ≥ Theorem QT LP-OSR returns a (2 − η k + η k k ) -optimalsubset repair. Proof.

Let

OP T be the fractional optimal solution ofthe strengthened LP, thus

OP T = | S | + 0 . | S . | Considerthe subset H of all the k -quasi-Tur´an tuples, let the solutionintersecting with H is OP T H = (cid:12)(cid:12) S H (cid:12)(cid:12) + 0 . (cid:12)(cid:12) S H . (cid:12)(cid:12) . Let ˆ J bethe approximated s-repair, then in H , the number of tuplesrounded out of the approximated s-repair,( I \ ˆ J ) ∩ H = OP T H + 0 . (cid:12)(cid:12)(cid:12) S H . (cid:12)(cid:12)(cid:12) and | ( I \ J opt ) ∩ H | ≥ OP T H then we derive the ratio as follows | ( I \ ˆ J ) ∩ H || ( I \ J opt ) ∩ H | ≤ OP T H + 0 . (cid:12)(cid:12) S H . (cid:12)(cid:12) OP T H ≤ [ p ] (cid:26) . · pq max ] | [ p ] | − | [ pq max ] | (cid:27) ≤ k Then for the other part,( I \ ˆ J ) ∩ I \ H ( I \ J opt ) ∩ I \ H ≤ dist sub ( ˆ J,I ) dist sub ( J opt ,I ) ≤ η k (1 + k ) + 2(1 − η k ) =2 − η k + η k k Combine the approximations based on the strengthenedLP and the triad elimination, a better approximation is pro-vided. Note that, it is polynomial-time to ﬁnd a best pair( k, η k ) to capture the data characteristic as possible, so asto improve the ratio as much as possible.

4. FAST ESTIMATE FD-INCONSISTENCYDEGREE

The hardness of OSR computing implies FD-inconsistencydegree evaluation is also hard. Therefore, we take eﬀortto ﬁnd an approximation of such degree. Fortunately, an8bservation is that we aim to compute the ratio, but notany OSR itself, hence, to achieve a constant relative ratio,a relaxation of approximation ratio with an O ( n ) factor isallowed. In this section, we show a fast FD-inconsistencyevaluation of subset query result. To obtain a good approx-imation in sublinear complexity, we allow a relative ratio 2and an additional additive error (cid:15) where 0 < (cid:15) < i.e. ,given an FD set Σ and a subset query Q on an instance I , the algorithm computes an estimation ˜ incDeg ( Q ( I ) , Σ )such that with high constant probability such that incDeg ( Q ( I ) , Σ ) ≤ ˜ incDeg ( Q ( I ) , Σ ) ≤ · incDeg ( Q ( I ) , Σ )+ (cid:15). As the diversity of subset queries, we model them as a ⊆ -oracle, such that, query complexity of the algorithm canbe analyzed in terms of operations supported by the oracle.The rest work is to ﬁnd out the way of implementing the ⊆ -oracle for a speciﬁc subset query. The time complexityof FD-Inconsistency evaluation for this kind of subset querythen can be derived by combining query complexity andtime complexity of the oracle.Given an instance I of a relation schema R and a subsetquery Q , the corresponding ⊆ -oracle O ( I, Q ) is required toanswer three queries about the result Q ( I ): O ( I , Q ) .sample tuple() . Since the algorithm introduced lateris sample-based, the oracle has to provide a uniform sampleon the result set Q ( I ). But sampling after the evaluation of Q is incompetent to obtain a sublinear approximation, sincethe retrieval of Q ( I ) will take at least linear time. A novelmethod of sampling is essential to implement the oracle. O ( I , Q ) . in result ( t ). It is to check the membership of atuple t of Q ( I ), such that, it returns true if the input tuple t belongs to Q ( I ), otherwise , it returns false . As we shownin the next subsection, it is mostly used to check if Q ( I )contains the conﬂict { t , s } . O ( I , Q ) .size() . Recall the deﬁnition of incDeg ( Q ( I ) , Σ ),the result size is in the denominator. It only returns thenumber of tuples in Q ( I ). Obviously, it is intolerable tocompute the size by evaluating the query.As an example, we next show a concrete implementationof ⊆ -oracle for range queries. An implement of ⊆ -oracle. In the following, we presentan indexing-based implement of ⊆ -oracle for range queries.Without the loss of generality, let [ low, high ] be the queryrange of attribute A , hence, the corresponding query re-sult consists of all tuples t such that low ≤ t .A ≤ high .Then only B + -tree index is suﬃcient to implement the ⊆ -oracle.As mentioned before, the most challenge is to imple-ment the three operations in sublinear time.Recall the general structure of B + -tree, each node main-tains a list of key-pointer pairs. And every node, for eachkey in it, say k , records two counters: the number N k< oftuples whose keys are less than k , and the number N k = oftuples whose keys are equal to k . Given a range query Q , say[ l, h ], on an instance I with n tuples, ⊆ -oracle ﬁrst queriesthe boundary (leaf) nodes l and h to get the size of Q ( I )such that N h = + N h< − N l< . Then ⊆ -oracle could sample aninteger d uniformly from [ N l< , N h = + N h< ], then fetch the tu-ple in B + -tree by performing a binary search of d wherethe oﬀsets of d to counters are taken as the keys. As forverifying whether a tuple t is in Q ( I ), it is easy to makea comparison with the boundary of Q . At last, the size of Q ( I ) can be easily calculated, since it is equal to the lengthof the integer interval. Therefore, all the three operations istractable in a logarithmic time.In fact, another implementation is much more straightfor-ward. Note that tuples are arranged in speciﬁc order as alist, and the result of a range query is always a consecutivepart of it. Then, for each tuple, label it with a distinct id ,so that each label represents its corresponding tuple. Allthe id s are consecutive and sorted in the order induced bythe selection condition. Then it is easy to verify that all thethree kinds of queries can be answered in O (1) time.Nevertheless, based on our model, one could also be freefrom the consideration on materialization of query result,such as we shown for range queries, the materialization ofquery result could be avoided. (2 , (cid:15) ) -Estimation Recall that, a two-tuple subset J = { t , s } is a conﬂictif and only if s ∈ [ t . X ] \ [ t . XY ] with respect to some FD: X → Y in Σ . Let C be the set of all conﬂicts in an instance I , then an s-repair S of I can be derived in the followingway: ranking all the conﬂicts of C in an ascendant order Π, pick the current ﬁrst conﬂict J = { t , s } and remove tuples t and s from I , then eliminate all the conﬂicts containing t or s from C , repeat the pick-remove-eliminate procedureuntil no any conﬂict left in C , then the I left is a repair S .We claim that S is a 2-optimal s-repair of I . The proof isquite straightforward, observe that, any repair has to elimi-nate at least one tuple of the conﬂicts picked in such proce-dure, then we have12 dist sub ( S, I ) = 12 ( | I | − | S | ) ≤ dist sub ( S opt , I ) , thus achieving a 2-optimal s-repair.By applying Chernoﬀ bound, if we uniformly sample p =Θ( (cid:15) ) tuples, and count the number of tuple not in S , say q , then with high probability | S | − (cid:15) n ≤ p − qp · n ≤ | S | + (cid:15) n. Hence we can obtain a (2 , (cid:15) )-approximation of incDeg ( I, Σ )as deﬁned previously. And observe each ranking Π decidesa 2-optimal s-repair S so that we could scan the rankingonce, verify the membership of each tuple in S , and countthe number q , however, in such a trivial way, it takes alinear time complexity. In the next subsection, we will givea sublinear time implementation of the veriﬁcation step.We argue that the sampling method mentioned above stillworks for the 2-optimal s-repair of Q ( I ) with the same prob-abilistic error bound. Consider a subset query Q on I , aconﬂict J in any Q ( I ) is still a conﬂict in I , and the rankinginduced by any Q ( I ) from Π is still ascendant. Continuewith Example 4, given a range query Q = [15 ,

45] on PR ,the result set Q ( I ) is { t , t , t , t } . As shown in the ﬁgure2, all conﬂicts in Q ( I ) are yellowed faded and the rankinginduced by them from Π is still ascendant. Let S be anoptimal s-repair of Q ( I ), then ranking Π could be reused tocompute a (2 , (cid:15) )-approximation of incDeg ( Q ( I ) , Σ ). We ﬁrst settle the preprocessing method, then show aneﬃcient implementation of the sample-and-verify methodfor subset queries.9 igure 2: Ranking of conﬂicts in I and Ranking ofconﬂicts in Q ( I ) (yellow shaded) Ranking-based method mentioned above implies a pre-deﬁned rank could be reused for the FD-inconsistency de-gree evaluation of every subset query result. Therefore, wediscover all the conﬂicts in I with respect to Σ in the prepro-cessing step, and assign a uniﬁed rank to C in advance. Thatis a distinct rank r is assigned to each conﬂict J = { t , s } asshown in ﬁgure 1(b). Algorithm 3

Preprocessing Procedure

Input:

An instance I of a relation schema R , and a set offunctional dependencies Σ Output:

Set C of conﬂicts in I and a ascendant ranking Πon C for each two tuples subset J = { t , s } of I do if J is a ϕ - conﬂict for some ϕ ∈ Σ in I then

3: Generate a unique and no duplicated ranking r ;4: Append (cid:104){ t , s } , r (cid:105) to C ;5: end if end for

7: Sort C according to r ;Algorithm 3 illustrates the preprocessing procedure. Let n be the size of I . The running time of Preprocessing is atworst O ( n log n ). Because there are at most O ( n ) tuplepairs of I and we consider data complexity in this paper, ittakes O ( n ) time to ﬁnd out all conﬂicts of I . And Step 7may take O ( n log n ) time. Note that, the number of con-ﬂicts are usually not that large in practice, techniques likehash-based partition could be taken as a tool to ﬁnd all pos-sible conﬂicts, so that the time cost of preprocessing couldbe further lower, but we do not emphasis them in this paper. Recall the sampling-and-veriﬁcation procedure, for anytuple t ∈ Q ( I ) sampled uniformly, it is to check if t is inthe 2-optimal s-repair of Q ( I ) derived by given ranking Π.During this procedure, every conﬂict J in Q ( I ) eliminatedin the checking procedure has a lowest rank when we turn tocheck it. That is, any conﬂict J (cid:48) in Q ( I ) intersecting with J are either already eliminated or having a rank higher than J . It is easy for the sequential implementation if we scan theranking from its beginning. However, it is diﬃcult withoutscanning the entire ranking, since the sampled tuple maynot locate in the beginning.To enable a sublinear evaluation, we need a start-from-anywhere implement method. Fortunately, the locality of a ranking can be utilized to avoid scanning the entire ranking.We employ a recursive veriﬁcation starting from the conﬂictsinvolving current sampled tuple.Basically, we begin with the sampled tuple t and checkthe conﬂicts in Q ( I ) caused by t in turn from rank lowestto highest. For each conﬂict J we currently considering, J should be eliminated if one of the following conditions holds,- J is lowest among all the conﬂicts in Q ( I ) intersectingwith it,- Every conﬂict J (cid:48) in Q ( I ) intersecting with J and lowerthan J are already known to be eliminated.Otherwise, recursively check J (cid:48) by the same procedure. Atlast, if none of conﬂicts in Q ( I ) containing t is decided tobe eliminated, then t should stay in S , otherwise not, andcount it into q . The correctness of this method is obviously,the only concern is the running time which depends on thenumber of recursive calls. We next formally describe it andbound the number of recursive calls. Algorithm 4

Fast-IncDeg performs s sampling-and-verifyingoperations and counts the number of tuples sampled butnot in S by calling function NotInSR( t , C ). Since S isa 2-optimal s-repair of I , Fast-IncDeg outcomes an (2 , (cid:15) )-estimation of FD-inconsistency degree of Q ( I ). Algorithm 4

Fast-IncDeg

Input:

Set C of conﬂicts in I , subset query Q , error (cid:15) Output: ˜ incDeg ( Q ( I ) , Σ )1: ˜ dist sub := 0;2: for i := 1 to 8 /(cid:15) do t := O ( I, Q ) . sample tuple ();4: if NotInSR( t , C ) then

5: ˜ dist sub := ˜ dist sub + 1;6: end if end for return ˜ dist sub O ( I,Q ) . size () + (cid:15) ;Subroutine 1 implements the veriﬁcation by calling Elimi-nate recursively. Namely, as introduced in subsection- ,give a conﬂict J = { t , s } , it considers all conﬂicts whichinclude t or s with lower ranking. If there are no such con-ﬂicts, it returns true . Otherwise, it performs recursive callsto these conﬂicts in the order of their ranking. If any one-step recursion returns true , it returns false ; Otherwise, itreturns true . With the help of Eliminate , the subroutine 1checks if a tuple t belongs to S . Concretely, it performs Eliminate on all conﬂicts in Q ( I ) including t in the order oftheir rankings, and if there exists a conﬂict such that Elimi-nate returns true , it returns true ; otherwise, it returns false .Now, we bound the number of recursive calls of

Eliminate .First, we derive an important corollary from [30]. For aranking injection π : J → r of all conﬂicts of an instance I and a tuple t ∈ I , let N ( π, t ) denote the number of conﬂictsthat a call Eliminate( J ) was made on in the course of thecomputation of NotInSR( t ). Let Π denote the set of allranking injections π over the conﬂicts of I . Given a tuple t , let δ t be the number of conﬂicts containing t . Then wedeﬁne the maximum conﬂict number of I as δ I = max t ∈ I { δ t } ubroutine 1 NotInSR( t , C )1: Let (cid:104){ t , t } , r (cid:105) , · · · , (cid:104){ t , t l } , r l (cid:105) be the tuples in C in-cluding t in order of increasing r ;2: for i := 1 to l do if O ( I, Q ) . in result ( t i ) and Eliminate ( { t , t i } ) then return true ;5: end if end for return false ;8: function Eliminate ( { t , s } )9: Let (cid:104){ t , s } , r , (cid:105) , · · · , (cid:104){ t l , s l } , r l (cid:105) be the tuples in C such that t i ∈ { t , s } in order of increasing r ;10: while r i < r do if O ( I, Q ) . in result ( s i ) and Eliminate ( { t i , s i } ) then return false ;13: end if i := i + 1;15: end while return true ;17: end function The average value of N ( π, t ) taken over all ranking injections π and tuples t is O ( δ I ), i.e. ,1 m ! · n ! · (cid:88) π ∈ Π (cid:88) t ∈ I N ( π, t ) = O ( δ I ) (5) Theorem Algorithm

Fast-IncDeg returns an estimate ˜ incDeg ( Q ( I ) , Σ ) with a probability at least / such that, incDeg ( Q ( I ) , Σ ) ≤ ˜ incDeg ( Q ( I ) , Σ ) ≤ · incDeg ( Q ( I ) , Σ )+ (cid:15). The average query complexity taken over all rankings π , sub-set queries Q and tuples t of Q ( I ) is O ( δ I (cid:15) ) , where the algo-rithm uses only queries supported by the ⊆ -oracle. Proof.

By applying an additive Chernoﬀ bound, sup-pose that it is sampled uniformly and independently s =Θ( (cid:15) ) tuples t from Q ( I ), with probability more than 2 / dist sub ( S, Q ( I )) | Q ( I ) | − (cid:15) ≤ ˜ dist sub | Q ( I ) | ≤ dist sub ( S, Q ( I )) | Q ( I ) | + (cid:15) . (6)And with the fact that S is a 2-optimal s-repair , it is ob-tained that, incDeg ( Q ( I ) , Σ ) ≤ ˜ incDeg ( Q ( I ) , Σ ) ≤ · incDeg ( Q ( I ) , Σ )+ (cid:15). (7)For query complexity, we ﬁrst bound the number of callsof Eliminate (). Given the result Q ( I ) of a subset query Q , let n (cid:48) be the number of tuples in Q ( I ), and m (cid:48) be the numberof conﬂicts contained in Q ( I ), and the maximum conﬂictnumber of Q ( I ). Now, consider the ranking Π (cid:48) induced by Q ( I ) from Π, then equation 5 implies,1 m (cid:48) ! · n (cid:48) · (cid:88) π ∈ Π (cid:48) (cid:88) t ∈ Q ( I ) N ( π, t ) = O ( δ Q ( I ) )Notice that since the conﬂicts in Q ( I ) is a subset of theconﬂicts in I , for each π (cid:48) ∈ Π (cid:48) , there are m ! m (cid:48) ! number of π ∈ Π can produce the same ranking on the conﬂicts of Q ( I ). Group Π into m (cid:48) ! groups { Π , · · · , Π m (cid:48) ! } , and for each π ∈ Π i and a ﬁxed t ∈ Q ( I ), N ( π, t ) has the samevalue. So we have,1 m ! (cid:88) π ∈ Π | Q ( I ) | (cid:88) t ∈ Q ( I ) N ( π, t )= 1 m ! (cid:88) π ∈{ Π , ··· , Π m (cid:48) ! } | Q ( I ) | (cid:88) t ∈ Q ( I ) N ( π, t )= 1 m ! · m ! m (cid:48) ! (cid:88) π ∈ Π (cid:48) Q ( I ) (cid:88) t ∈ Q ( I ) N ( π, t )= O ( δ Q ( I ) )The we could derive the query complexity. Let Q be thespace of queries, then for any Q ( I ), we have δ Q ( I ) ≤ δ I , sothat the average query complexity is that1 m ! (cid:88) π ∈ Π |Q| (cid:88) Q ∈Q | Q ( I ) | (cid:88) t ∈ Q ( I ) N ( π, t )= 1 |Q| (cid:88) Q ∈Q m ! (cid:88) π ∈ Π | Q ( I ) | (cid:88) t ∈ Q ( I ) N ( π, t )= 1 |Q| (cid:88) Q ∈Q O ( δ Q ( I ) ) ≤ O ( δ I )Obviously, there are O ( (cid:15) ) calls to sample a tuple fromthe result set. For each sampled tuple t , the average numberof calls to Eliminate is O ( δ I ). So the number of calls to in result () is O ( δ I (cid:15) ).In addition, inspired by the methodology proposed in [30],a pre-deﬁned ranking in the preprocessing can be saved byranking on the ﬂy, that is, we only need to discover all con-ﬂicts in the preprocessing step and ranking whenever it isrequired. Since the basic idea is similar with our method, weomit the detail here, instead, we compare the two diﬀerentimplements in our experiments to show the eﬃciency of ourmethod.

5. EXPERIMENTS

This section experimentally evaluates the performace ofour algorithms for OSR computing and FD-inconsistencyevaluation.

All experiments are conducted on a machine with eight16-core Intel Xeon processors and 3072GB of memory.

Dataset.

We used two datasets to evaluate the performanceof algorithms for OSR computing and FD-Inconsistency eval-uaction experimentally.

Dataset 1 : order data is an instance of the schema order shown in Example 1. Our set Σ consists of 4 FDs taken fromExample 1. To populate the relation we scraped productinformations from amazon and collected real-life data: thezip and area codes for major cities and twons for all USstates and street informations for all the United States .We generated datasets of various size, ranging from 10M to100M tuples. http://results.openaddresses.io/11 ataset 2 : dblp data was extracted from dblp Bibliogra-phy . It consists of 40M tuples and the format is as follows: dblp ( title , authors , year , publication , pages , ee , url )Each dblp tuple contains the title of an article, the au-thors and the information of publication (year, publicationvenue, pages, electronic edition and, url) We designed 4 FDsfor dblp . fd : [ title ] → [ author ] fd : [ ee ] → [ title ] fd : [ year , publication , pages ] → [ title , ee , url ] fd : [ url ] → [ title ]To add noise to a dataset, we randomly selected an at-tribute of a ”correct” tuple and changed it either to a closevalue or to an existing value taken from another tuple. Weappended such ”dirty” tuples which violate at least one ormore functional dependencies to the dataset. We set a pa-rameter ρ ranging from 1% to 10% to control the noise rate. Methods.

We implemented the following algorithms: (a)the basic approximation algorithm

BL LP-OSR and the im-proved approximation algorithm by triad elimination

TE LP-OSR for OSR computing; (b) the sublinear estimation algo-rithm

Fast-IncDeg based on two implements of ⊆ -oracle forrange query with O (1) and O (log n ) time complexity respec-tively, and its variation mentioned in the subsection- Fast-IncDeg ol for FD-Inconsistency evaluation. Hence, thereare totally 4 implements of Algorithm 4 for range query de-noted by

Fast-IncDeg c , Fast-IncDeg log , Fast-IncDeg c ol and Fast-IncDeg log ol respectively. Metrics.

Since a dataset I with n tuples is polluted byappending ρn dirty tuples, where ρ is noise rate, the numberof tuples in the optimal repair S opt must be larger than n , i.e. , dist sub ( S opt , I ) ≤ ρn . Hence, we calculate dist sub ( ˆ J, I )of

BL LP-OSR and

TE LP-OSR and use 2 ρn to evaluate theapproximation ratio of them. What’s more, according tothe deﬁnition of FD-inconsistency degree, we treat 2 ρ + (cid:15) asthe upper bound of FD-inconsistency degree to ensure thecorrectness of the Algorithm 4. To evaluate the eﬃciencyof Algorithm 4, we issue 300 queries for each algorithm andeach parameter set, and record the average of the querytime. We report our ﬁndings concerning about the accuracy andeﬃciency of our algorithms.

Accuracy.

We ﬁrst show the accuracy of

BL LP-OSR , TE LP-OSR and

QT LP-OSR . In ﬁgure 3, we ran them ondatasets consisting of 10K to 40K tuples with noise rate ρ ranging from 1% to 10% and calculated UB = 2 ρn . The dist sub ( ˆ J, I ) of

BL LP-OSR , TE LP-OSR and

QT LP-OSR aremuch less than UB . Because the approximation ratio onlybounds the relation between worst case output of an algo-rithm and the optimal solution, BL LP-OSR sometimes per-forms better than the other two improved algorithm. What’smore, it is discovered that, after triad elimination, the ratioof 0.5 solution becomes much less since they only appearwhen some conﬂicts form a cycle with odd length.We also evaluate the accuracy of

Fast-IncDeg . We ran

Fast-IncDeg c with parameter (cid:15) = 0 .

01 on the same datasetsand calculated UB = (2 ρ + (cid:15) ) n . As shown in ﬁgure 3 thevalue return by Fast-IncDeg c is less than UB even less than UB mostly since the upper bound is loose. And it is greater https://dblp.org/xml/ than the values of BL LP-OSR , TE LP-OSR and

QT LP-OSR since it returns an estimate with an additive error.

Eﬃciency.

We evaluate the eﬃciency of our algorithmsfor FD-inconsistency evaluation. We ﬁrst ran

Fast-IncDeg c , Fast-IncDeg log , Fast-IncDeg-ol c and Fast-IncDeg-ol log on datasetswith various size, noise rate ρ ranging from 1% to 10% and (cid:15) = 0 .

01. 300 large queries were issued per dataset and theaverage query time was recorded.Figure 4(a), 4(b), 4(e), and 4(f) indicate that the averagequery time increase with the number of tuples and noise ratesince both of them inﬂuence the maximum conﬂicts numberin Q ( I ). Further experiments were performed to evaluatethe impact of the maximum conﬂict number δ Q ( I ) on the av-erage query time. Since queries were generated randomly, weonly bounded the maximum conﬂict number of the dataset δ D . Therefore, the average query time shown in ﬁgures 4(d)and 4(h) remain basically the same with the increasing δ D .As shown in ﬁgures 4(c) and 4(g), the average query time of Fast-IncDeg c and Fast-IncDeg-ol c change slightly due to theimpact of the number of tuples on δ Q ( I ) . And the averagequery time of Fast-IncDeg log and

Fast-IncDeg-ol log grow withthe number of tuples.Figure 4 also illustrates that no matter how the ⊆ -oracle isimplemented, Fast-IncDeg performs better then

Fast-IncDeg-ol . It is because that in

Fast-IncDeg-ol the ranking is as-signed on-the-ﬂy when a conﬂict is queried and it is expen-sive to keep the ranking consistent in every tuple which theconﬂict concerned about. In addition, an eﬃcient ⊆ -oracleindeed makes the average query time drop a lot.

6. RELATED WORK

As a principled approach managing inconsistency, Arenaset al. [5] introduced the notions of repairs to deﬁne con-sistent query answering. The deﬁnitions of repair diﬀer insettings of integrity constraints and operation gain [3]. Themost general form of integrity constraints are denial con-straints [21], they are able to express the classic functionaldependencies [1], inclusion dependencies [25], and so on.Data complexities of computing optimal repairs are widelystudied in the past. The complexity of tuple-level deletionbased subset repair [13, 27] is studied respectively in thepast. And the complexity of cell-level update based v-repairis also studied in [27, 26]. APXcompleteness of both optimalsubset repair and v-repair computation has been shown forin these works.For the upper bound, the best approximation on subsetrepair is still 2 obtained by solving the corresponding vertexcover problem [27] without the limitation on the number ofgiven FDs. For the setting of ﬁxed number of FDs, thereare still no existing algorithmic result.For the data repairing frameworks [2], there are two kindsof works which are based on FDs, they both aim to directlyresolve the inconsistency of database. One kind of methodsis to repair data based on minimizing the repair cost, e.g. , [5,11, 16, 28, 34].Given the data edit operations (including tuple-level andcell-level), minimum cost repair will output repaired datawith minimizing the diﬀerence between it and the originalone. But these work also do not provide us tight lower andupper bounds for data repairing. There are some other typeof repairs not related with this paper, such as minimumdescription length [12], relative trust [10] and so on.12 d i s t tuple size (k) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (a) order ρ = 0 .

10 20 30 4015003000450060007500 d i s t tuple size (k) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (b) order ρ = 0 . d i s t noise rate (%) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (c) order n = 10K d i s t noise rate (%) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (d) order n = 40K

10 20 30 406001200180024003000 d i s t tuple size (k) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (e) dblp ρ = 0 .

10 20 30 4015003000450060007500 d i s t tuple size (k) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (f) dblp ρ = 0 . d i s t noise rate (%) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (g) dblp n = 10K d i s t noise rate (%) BL LP-OSR TE LP-OSR QT LP-OSR Fast-IncDeg UB UB (h) dblp n = 40K Figure 3: dist with diﬀerent ρ , n and σ a v g que r y t i m e ( s ) tuple size (M) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (a) order ρ = 0 . a v g que r y t i m e ( s ) noise rate (%) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (b) order n = 10M a v g que r y t i m e ( s ) tuple size (M) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (c) order d = 50 , ρ = 0 . a v g que r y t i m e ( s ) max conflict number Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (d) order n = 10M , ρ = 0 . a v g que r y t i m e ( s ) tuple size (M) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (e) dblp ρ = 0 . a v g que r y t i m e ( s ) noise rate (%) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (f) dblp n = 4M a v g que r y t i m e ( s ) tuple size (M) Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (g) dblp d = 50 , ρ = 0 . a v g que r y t i m e ( s ) max conflict number Fast-IncDeg log

Fast-IncDeg-ol log

Fast-IncDeg c Fast-IncDeg-ol c (h) dblp n = 4M , ρ = 0 . Figure 4: average query time with diﬀerent n , ρ and d For inconsistency detection, there exists some detectiontechniques which are able to detect errors eﬃciently. SQLtechniques for detecting FD violations were given in [13],practical algorithms for detecting violations of FDs in frag-mented and distributed relations were provided in [19], anda incremental detection algorithm were proposed by [20]. Incontrast to inconsistency detection, inconsistency evaluationneed to compute the quantized dirtiness value of the data,rather than ﬁnding all violations.

7. CONCLUSIONS

We revisit computing an optimal s-repair problem and fastestimate of s-repair based FD-inconsistency degree of subset query results. For the lower bound, we improve the inap-proximability of optimal s-repair computing problem overmost cases of FDs and schemas. For the upper bound, we de-veloped two LP-based algorithms to compute a near optimals-repair based on diﬀerent characterization of input FDs andschemas respectively. Complexity results implies it is hardto obtain a good approximation polynomially, not to men-tion sublinear time for large data. For the FD-inconsistencydegree, we present a fast (2 , (cid:15) )-estimation with an averagesublinear query complexity, and achieve a sublinear timecomplexity whenever incorporating a sublinear implementa-tion of the subset query oracle. This results give a way toestimate FD-inconsistency degree eﬃciently with theoreticalguarantee.13 . REFERENCES [1] S. Abiteboul, R. Hull, and V. Vianu.

Foundations ofDatabases: The Logical Level . Addison-WesleyLongman Publishing Co., Inc., Boston, MA, USA,1995.[2] F. N. Afrati and K. P. G. Repair checking ininconsistent databases: Algorithms and complexity.2009.[3] F. N. Afrati and P. G. Kolaitis. Repair checking ininconsistent databases: Algorithms and complexity. In

Proceedings of the 12th International Conference onDatabase Theory , ICDT 09, pages 31–41, New York,NY, USA, 2009. ACM.[4] O. Amini, S. P´erennes, and I. Sau. Hardness andapproximation of traﬃc grooming.

TheoreticalComputer Science , 410(38-40):3751–3760, 2009.[5] M. Arenas, L. Bertossi, and J. Chomicki. Consistentquery answers in inconsistent databases. In

Proceedings of the 18th ACM Symposium on Principlesof Database Systems , pages 68–79. ACM, 1999.[6] A. Assadi, T. Milo, and S. Novgorodov. Dance: datacleaning with constraints and experts. In , pages 1409–1410. IEEE, 2017.[7] M. Bergman, T. Milo, S. Novgorodov, and W.-C. Tan.Qoco: A query oriented data cleaning system withoracles.

Proceedings of the VLDB Endowment ,8(12):1900–1903, 2015.[8] L. Bertossi.

Database Repairing and Consistent QueryAnswering . Morgan Claypool Publishers, 2011.[9] L. Bertossi. Repair-based degrees of databaseinconsistency. In

Logic Programming andNonmonotonic Reasoning , pages 195–209. Springer,Cham, 2019.[10] G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. Onthe relative trust between inconsistent data andinaccurate constraints. arXiv preprintarXiv:1207.5226 , 2012.[11] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. Acost-based model and eﬀective heuristic for repairingconstraints by value modiﬁcation. In

Proceedings ofthe 2005 ACM SIGMOD international conference onManagement of data , pages 143–154. ACM, 2005.[12] F. Chiang and R. J. Miller. A uniﬁed model for dataand constraint repair. In

ICDE , 2011.[13] J. Chomicki and J. Marcinkowski. Minimal-changeintegrity maintenance using tuple deletions.

Inf.Comput. , 197(1-2), 2005.[14] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Datacleaning: Overview and emerging challenges. In

Proceedings of the 2016 International Conference onManagement of Data , SIGMOD 16, page 22012206,New York, NY, USA, 2016. Association forComputing Machinery.[15] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma.Improving data quality: Consistency and accuracy. In

Proceedings of the 33rd International Conference onVery Large Data Bases , VLDB 07, page 315326.VLDB Endowment, 2007.[16] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma.Improving data quality: Consistency and accuracy. In

Proceedings of the 33rd international conference on Very large data bases , pages 315–326. VLDBEndowment, 2007.[17] P. Crescenzi. A short guide to approximationpreserving reductions. In

Proceedings ofComputational Complexity. Twelfth Annual IEEEConference , pages 262–273. IEEE, 1997.[18] M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid,I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: acommodity data cleaning system. In

Proceedings of the2013 ACM SIGMOD International Conference onManagement of Data , pages 541–552. ACM, 2013.[19] W. Fan, F. Geerts, S. Ma, and H. M¨uller. Detectinginconsistencies in distributed data. 2010.[20] W. Fan, J. Li, N. Tang, and W. Y. qa. Incrementaldetection of inconsistencies in distributed data.

IEEETrans. on Knowl. and Data Eng. , 26(6), 2014.[21] T. Gaasterland, P. Godfrey, and J. Minker. Anoverview of cooperative answering.

Journal ofIntelligent Information Systems , 1(2):123–157, 1992.[22] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Thellunatic data-cleaning framework.

Proceedings of theVLDB Endowment , 6(9):625–636, 2013.[23] V. Guruswami and S. Khot. Hardness of max 3satwith no mixed clauses. In

Proceedings of the 20thAnnual IEEE Conference on ComputationalComplexity , pages 154–162. IEEE Computer Society,2005.[24] V. Kann. Maximum bounded 3-dimensional matchingis max snp-complete.

Information Processing Letters ,37(1):27–35, 1991.[25] H. Koehler and S. Link. Inclusion dependencies andtheir interaction with functional dependencies in sql.

J. Comput. Syst. Sci. , 85(C):104–131, 2017.[26] S. Kolahi and L. V. S. Lakshmanan. Onapproximating optimum repairs for functionaldependency violations. ICDT, 2009.[27] E. Livshits, B. Kimelfeld, and S. Roy. Computingoptimal repairs for functional dependencies. In

Proceedings of the 37th ACM Symposium on Principlesof Database Systems , pages 225–237. ACM, 2018.[28] A. Lopatenko and L. Bravo. Eﬃcient approximationalgorithms for repairing inconsistent databases. In , pages 216–225. IEEE, 2007.[29] G. L. Nemhauser and L. E. Trotter. Vertex packings:Structural properties and algorithms.

MathematicalProgramming , 8(4):232–248, 1975.[30] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. Anear-optimal sublinear-time algorithm forapproximating the minimum vertex cover size. In

Proceedings of the twenty-third annual ACM-SIAMsymposium on Discrete Algorithms , pages 1123–1131.Society for Industrial and Applied Mathematics, 2012.[31] C. D. Sa, I. F. Ilyas, B. Kimelfeld, C. R´e, andT. Rekatsinas. A formal framework for probabilisticunclean databases. In , pages 6:1–6:18, 2019.[32] B. Salimi, L. Rodriguez, B. Howe, and D. Suciu.Interventional fairness: Causal database repair foralgorithmic fairness. In

Proceedings of the 2019International Conference on Management of Data ,14IGMOD 19, page 793810, New York, NY, USA,2019. Association for Computing Machinery.[33] D. P. Williamson and D. B. Shmoys.

The design ofapproximation algorithms . Cambridge University Press, Cambridge, England, 2011.[34] W. E. Winkler. Methods for evaluating and creatingdata quality.