Computing Optimal Repairs for Functional Dependencies
CComputing Optimal Repairs for Functional Dependencies
Ester Livshits
TechnionHaifa 32000, Israel
Benny Kimelfeld
TechnionHaifa 32000, Israel
Sudeepa Roy
Duke UniversityDurham, NC 27708, USA
ABSTRACT
We investigate the complexity of computing an optimal re-pair of an inconsistent database, in the case where integrityconstraints are Functional Dependencies (FDs). We focuson two types of repairs: an optimal subset repair (optimalS-repair) that is obtained by a minimum number of tupledeletions, and an optimal update repair (optimal U-repair)that is obtained by a minimum number of value (cell) up-dates. For computing an optimal S-repair, we present apolynomial-time algorithm that succeeds on certain sets ofFDs and fails on others. We prove the following about the al-gorithm. When it succeeds, it can also incorporate weightedtuples and duplicate tuples. When it fails, the problem isNP-hard, and in fact, APX-complete (hence, cannot be ap-proximated better than some constant). Thus, we establisha dichotomy in the complexity of computing an optimal S-repair. We present general analysis techniques for the com-plexity of computing an optimal U-repair, some based onthe dichotomy for S-repairs. We also draw a connection to apast dichotomy in the complexity of finding a“most probabledatabase” that satisfies a set of FDs with a single attributeon the left hand side; the case of general FDs was left open,and we show how our dichotomy provides the missing gen-eralization and thereby settles the open problem.
1. INTRODUCTION
Database inconsistency arises in a variety of scenar-ios and for different reasons. For instance, data maybe collected from imprecise sources (social encyclopedi-as/networks, sensors attached to appliances, cameras,etc.) via imprecise procedures (natural-language pro-cessing, signal processing, image analysis, etc.). Incon-sistency may arise when integrating databases of differ-ent organizations with conflicting information, or evenconsistent information in conflicting formats. Arenaset al. [5] introduced a principled approach to manag-ing inconsistency via the notions of repairs and con-sistent query answering . An inconsistent database is adatabase D that violates integrity constraints, a repair is a consistent database D ′ obtained from D by a mini-mal sequence of operations, and the consistent answers to a query are the answers given in every repair D ′ .Instantiations of the repair framework differ in theirdefinitions of integrity constraints , operations , and min-imality [1]. Common types of constraints are denialconstraints [18] that include the classic functional de- pendencies (FDs), and inclusion dependencies [11] thatinclude the referential (foreign-key) constraints. An op-eration can be a deletion of a tuple, an insertion of a tu-ple, and an update of an attribute (cell) value. Minimal-ity can be either local —no strict subset of the operationsachieves consistency, or global —no smaller (or cheaper)subset achieves consistency. For example, if only tupledeletions are allowed, then a subset repair [12] corre-sponds to a local minimum (restoring any deleted tuplecauses inconsistency) and a cardinality repair [27] cor-responds to a global minimum (consistency cannot begained by fewer tuple deletions). The cost of operationsmay differ between tuples; this can represent differentlevels of trust that we have in the tuples [24, 27].In this paper, we focus on global minima under FDsvia tuple deletions and value updates. Each tuple is as-sociated with a weight that determines the cost of itsdeletion or a change of a single value. We study thecomplexity of computing a minimum repair in two set-tings: (a) only tuple deletions are allowed, that is, weseek a (weighted) cardinality repair, and (b) only valueupdates are allowed, that is, we seek what Kolahi andLakshmanan [24] refer to as an “optimum V-repair.” Werefer to the two challenges as computing an optimal sub-set repair (optimal S-repair) and computing an optimalupdate repair (optimal U-repair).The importance of computing an optimal repair arisesin the challenge of data cleaning [17]—eliminate errorsand dirt (manifested as inconsistencies) from the data-base. Specifically, our motivation is twofold. The obvi-ous motivation is in fully automated cleaning, where anoptimal repair is the best candidate, assuming the sys-tem is aware of only the constraints and tuple weights.The second motivation comes from the more realisticpractice of iterative, human-in-the-loop cleaning [6, 9,13, 19]. There, the cost of the optimal repair can serveas an educated estimate for the extent to which thedatabase is dirty and, consequently, the amount of ef-fort needed for completion of cleaning.As our integrity constraints are FDs, it suffices to con-sider a database with a single relation, which we callhere a table . In a general database, our results can beapplied to each relation individually. A table T con-forms to a relational schema R ( A , . . . , A k ) where each A i is an attribute. Integrity is determined by a set ∆ ofFDs. Our complexity analysis focuses primarily on data a r X i v : . [ c s . D B ] D ec omplexity , where R ( A , . . . , A k ) and ∆ are consideredfixed and only T is considered input. Hence, we have in-finitely many optimization problems, one for each com-bination of R ( A , . . . , A k ) and ∆. Table records haveidentifiers, as we wish to be able to determine easilywhich cells are updated in a repair. Consequently, weallow duplicate tuples (with distinct identifiers).We begin with the problem of computing an optimalS-repair. The problem is known to be computationallyhard for denial constraints [27]. As we discuss later,complexity results can be inferred from prior work [20]for FDs with a single attribute on the left hand side(lhs for short). For general FDs, we present the algo-rithm OptSRepair (Algorithm 1). The algorithm seeksopportunities for simplifying the problem by eliminat-ing attributes and FDs, until no FDs are left (and thenthe problem is trivial). For example, if all FDs sharean attribute A on the left hand side, then we can par-tition the table according to A and solve the problemseparately on each partition; but now, we can ignore A . We refer to this simplification as “common lhs.”Two additional simplifications are the “consensus” and“lhs marriage.” Importantly, the algorithm terminatesin polynomial time, even under combined complexity.However, OptSRepair may fail by reaching a nonemptyset of FDs where no simplification can be applied. Weprove two properties of the algorithm. The first is sound-ness —if the algorithm succeeds, then it returns an op-timal S-repair. More interesting is the property of com-pleteness —if the algorithm fails, then the problem isNP-hard. In fact, in this case the problem is APX-complete, that is, for some (cid:15) > ( + (cid:15) ) times theminimum, but some ( + (cid:15) ′ ) is achievable in polynomialtime. More so, the problem remains APX-complete ifwe assume that the table does not contain duplicates,and all tuples have a unit weight (in which case we saythat T is unweighted ). Consequently, we establish thefollowing dichotomy in complexity for the space of com-binations of schemas R ( A , . . . , A k ) and FD sets ∆. ● If we can eliminate all FDs in ∆ with the threesimplifications, then an optimal S-repair can becomputed in polynomial time using
OptSRepair . ● Otherwise, the problem is APX-complete, even forunweighted tables without duplicates.We then continue to the problem of computing anoptimal U-repair. Here we do not establish a full di-chotomy, but we make a substantial progress. We havefound that proving hardness results for updates is farmore subtle than for deletions. We identify conditionswhere the complexity of computing an optimal U-repairand that of computing an optimal S-repair coincide.One such condition is the common lhs (i.e., all FDsshare a left-hand-side attribute). Hence, in this case,our dichotomy provides the precise test of tractability.We also show decomposition techniques that extend theopportunities of using the dichotomy. As an example,consider
Purchase ( product , price , buyer , email , address ) and ∆ = { product → price , buyer → email } . We can de- compose this problem into ∆ = { product → price } and∆ = { buyer → email } , and consider each ∆ i , for i = , i is the samein both variants of optimal repairs, and so, polynomialtime. Yet, these results do not cover all sets of FDs. Forexample, let ∆ = { email → buyer , buyer → address } .Kolahi and Lakshmanan [24] proved that under ∆ ,computing an optimal U-repair is NP-hard. Our di-chotomy shows that it is also NP-hard (and also APX-complete) to compute an S-repair under ∆ . Yet, thisFD set does not fall in our coincidence cases.The above defined ∆ is an example where an opti-mal U-repair can be computed in polynomial time, butcomputing an optimal S-repair is APX-complete. Wealso show an example in the reverse direction, namely∆ = { buyer → email , email → buyer , buyer → address } .This FD set falls in the positive side of our dichotomyfor optimal S-repairs, but computing an optimal U-repair is APX-complete. The proof of APX-hardnessis inspired by, but considerably more involved than, thehardness proof of Kolahi and Lakshmanan [24] for ∆ .Finally, we consider approximate repairing. For thecase of an optimal S-repair, the problem easily reducesto that of weighted vertex cover , and hence, we get apolynomial-time 2-approximation due to Bar-Yehudaand Even [7]. To approximate optimal U-repairs, weshow an efficient reduction to S-repairs, where the lossin approximation is linear in the number of attributes.Hence, we obtain a constant-ratio approximation, wherethe constant has a linear dependence on the number ofattributes. Kolahi and Lakshmanan [24] also gave anapproximation for optimal U-repairs, but their worst-case approximation can be quadratic in the number ofattributes. We show an infinite sequence of FD setswhere this gap is actually realized. On the other hand,we also show an infinite sequence where our approxi-mation is linear in the number of attributes, but theirsremains constant. Hence, in general, the two approxi-mations are incomparable, and we can combine the twoby running both approximations and taking the best.Stepping outside the framework of repairs, a differentapproach to data cleaning is probabilistic [4,20,28]. Theidea is to define a probability space over possible cleandatabases, where the probability of a database is deter-mined by the extent to which it satisfies the integrityconstraints. The goal is to find a most probable database that, in turn, serves as the clean outcome. As an in-stantiation, Gribkoff, Van den Broeck, and Suciu [20]identify probabilistic cleaning as the “Most ProbableDatabase” problem (MPD): given a tuple-independentprobabilistic database [14,30] and a set of FDs, find themost probable database among those satisfying the FDs(or, put differently, condition the probability space onthe FDs). They show a dichotomy for unary FDs (i.e.,FDs with a single attribute on the left hand side). Thecase of general (not necessarily unary) FDs has beenleft open. It turns out that there are reductions fromMPD to computing an optimal S-repair and vice versa.Consequently, we are able to generalize their dichotomyto all FDs, and hence, fully settle the open problem.2 . PRELIMINARIES We first present some basic terminology and notationthat we use throughout the paper.
An instance of our data model is a single table whereeach tuple is associated with an identifier and a weight that states how costly it is to change or delete the tu-ple. Such a table corresponds to a relation schema thatwe denote by R ( A , . . . , A k ) , where R is the relationname and A , . . . , A k are distinct attributes . We saythat R ( A , . . . , A k ) is k -ary since it has k attributes.When there is no risk of confusion, we may refer to R ( A , . . . , A k ) by simply R .We use capital letters from the beginning of the En-glish alphabet (e.g., A , B , C ), possibly with subscriptsand/or superscripts, to denote individual attributes, andcapital letters from the end of the English alphabet(e.g., X , Y , Z ), possibly with subscripts and/or su-perscripts, to denote sets of attributes. We follow theconvention of writing sets of attributes without curlybraces and without commas (e.g., ABC ).We assume a countably infinite domain
Val of at-tribute values. By a tuple we mean a sequence of valuesin
Val . A table T over R ( A , . . . , A k ) has a collection ids ( T ) of (tuple) identifiers and it maps every identifier i to a tuple in Val k and a positive weight; we denote thistuple by T [ i ] and this weight by w T ( i ) . For i ∈ ids ( T ) we refer to T [ i ] as a tuple of T . We denote by T [∗] the set of all tuples of T . We say that T is: ● duplicate free if distinct tuples disagree on at leastone attribute, that is, T [ i ] ≠ T [ j ] whenever i ≠ j ; ● unweighted if all tuple weights are equal, that is, w T ( i ) = w T ( j ) for all identifiers i and j .We use ∣ T ∣ to denote the number of tuple identifiers of T , that is, ∣ T ∣ def = ∣ ids ( T )∣ . Let t = ( a , . . . , a k ) be a tupleof T . We use t .A j to refer to the value a j . If X = A i , . . . , A i (cid:96) is a sequence of attributes in { A , . . . , A k } ,then t [ X ] denotes the tuple ( t .A i . . . . , t .A i ) . Example
Office ( facility , room , floor , city ) , describing thelocation of offices in an organization. For example, thetuple T [ ] corresponds to an office in room 322, in thethird floor of the headquarters (HQ) building, located inParis. The meaning of the yellow background color willbe clarified later. The identifier of each tuple is shownon the leftmost (gray shaded) column, and its weighton the rightmost column (also gray shaded). Note thattable S is duplicate free and unweighted, table S isduplicate free but not unweighted, and table U is nei-ther duplicate free nor unweighted. Let R ( A , . . . , A k ) be a schema. As usual, an FD(over R ) is an expression of the form X → Y where X and Y are sequences of attributes of R . We refer to id facility room floor city w HQ 322 3 Paris HQ 322 30 Madrid HQ 122 1 Madrid Lab1 B35 3 London T id facility room floor city w HQ 322 30 Madrid HQ 122 1 Madrid Lab1 B35 3 London S id facility room floor city w HQ 322 3 Paris Lab1 B35 3 London S id facility room floor city w HQ 122 1 Madrid Lab1 B35 3 London S id facility room floor city w F01 322 3 Paris HQ 322 30 Madrid HQ 122 1 Madrid Lab1 B35 3 London U id facility room floor city w HQ 322 3 Paris HQ 322 3 Paris HQ 122 1 Paris Lab1 B35 3 London U id facility room floor city w HQ 322 30 Madrid HQ 322 30 Madrid HQ 122 1 Madrid Lab1 B35 3 London U Figure 1: For
Office ( facility , room , floor , city ) and FDs facility → city and facility room → floor , a table T , con-sistent subsets S , S and S , and consistent updates U , U and U . Changed values are marked in yellow. X as the left-hand side , or lhs for short, and to Y asthe right-hand side , of rhs for short. A table T satisfies X → Y if every two tuples that agree on X also agreeon Y ; that is, for all t , s ∈ T [∗] , if t [ X ] = s [ X ] then t [ Y ] = s [ Y ] . We say that T satisfies a set ∆ of FDs if T satisfies each FD in ∆; otherwise, T violates ∆.An FD X → Y is entailed by ∆, denoted ∆ ⊧ X → Y ,if every table T that satisfies ∆ also satisfies the FD3 → Y . The closure of ∆, denoted cl ( ∆ ) , is the set ofall FDs over R that are entailed by ∆. The closure of anattribute set X ( w.r.t. ∆), denoted cl ∆ ( X ) , is the setof all attributes A such that the FD X → A is entailedby ∆. Two sets ∆ and ∆ of FDs are equivalent if theyhave the same closure (or in other words, each FD in ∆ is entailed by ∆ and vice versa, or put differently, everytable that satisfies one also satisfies the other). An FD X → Y is trivial if Y ⊆ X ; otherwise, it is nontrivial .Note that a trivial FD belongs to the closure of everyset of FDs (including the empty one). We say that ∆is trivial if ∆ does not contain any nontrivial FDs (e.g.,it is empty); otherwise, ∆ is nontrivial .Next, we give some non-standard notation that weneed for this paper. A common lhs of an FD set ∆ isan attribute A such that A ∈ X for all FDs X → Y in ∆. An FD set ∆ is a chain if for every two FDs X → Y and X → Y it is the case that X ⊆ X or X ⊆ X . Livshits and Kimelfeld [26] proved that theclass of chain FD sets consists of precisely the FD sets inwhich the subset repairs , which we define in Section 2.3,can be counted in polynomial time (assuming P ≠ Example ● facility → city : a facility belongs to a single city. ● facility room → floor : a room in a facility does notgo beyond one floor.Note that the FDs allow for the same room number tooccur in different facilities (possibly on different floors,in different cities). The attribute facility is a commonlhs. Moreover, ∆ is a chain FD set, since { facility } ⊆{ facility , room } . Table T (Figure 1(a)) violates ∆, andthe other tables (Figures 1(b)–1(g)) satisfy ∆.An FD X → Y might be such that X is empty, andthen we denote it by ∅ → Y and call it a consensus FD. Satisfying the consensus FD ∅ → Y means that alltuples agree on Y , or in other words, the column thatcorresponds to each attribute in Y consists of copies ofthe same value. For example, ∅ → city means that alltuples have the same city. A consensus attribute ( of ∆)is an attribute in cl ∆ (∅) , that is, an attribute A suchthat ∅ → A is implied by ∆. We say that ∆ is consensusfree if it has no consensus attributes. Let R ( A , . . . , A k ) be a schema, and let T be a table.A subset of T is a table S that is obtained from T byeliminating tuples. More formally, table S is a subsetof T if ids ( S ) ⊆ ids ( T ) and for all i ∈ ids ( S ) we have S [ i ] = T [ i ] and w S ( i ) = w T ( i ) . If S is a subset of T ,then the distance from S to T , denoted dist sub ( S, T ) , isthe weighted sum of the tuples missing from S ; that is, dist sub ( S, T ) def = ∑ i ∈ ids ( T )∖ ids ( S ) w T ( i ) . A value update of T (or just update of T for short)is a table U that is obtained from T by changing at-tribute values. More formally, a table U is an up-date of T if ids ( U ) = ids ( T ) and for all i ∈ ids ( U ) we have w U ( i ) = w T ( i ) . We adopt the definition of Ko-lahi and Lakshmanan [24] for the distance from U to T . Specifically, if u and t are tuples of tables over R , then the Hamming distance H ( u , t ) is the num-ber of attributes in which u and t disagree, that is, H ( u , t ) = ∣{ j ∣ u .A j ≠ t .A j }∣ . If U is an update of T then the distance from U to T , denoted dist sub ( U, T ) ,is the weighted Hamming distance between U and T (where every changed value counts as the weight of thetuple); that is, dist upd ( U, T ) def = ∑ i ∈ ids ( T ) w T ( i ) ⋅ H ( T [ i ] , U [ i ]) . Let R ( A , . . . , A k ) be a schema, let T be table, and let∆ be a set of FDs. A consistent subset ( of T w.r.t. ∆)is a subset S of T such that S ⊧ ∆, and a consistentupdate ( of T w.r.t. ∆) is an update U of T such that U ⊧ ∆. A subset repair , or just S-repair for short, isa consistent subset that is not strictly contained in anyother consistent subset. An update repair , or just
U-repair for short, is a consistent update that becomesinconsistent if any set of updated values is restored tothe original values in T . An optimal subset repair of T ,or just optimal S-repair for short, is a consistent sub-set S of T such that dist sub ( S, T ) is minimal among allconsistent subsets of T . Similarly, an optimal updaterepair of T , or just optimal U-repair for short, is a con-sistent update U of T such that dist upd ( U, T ) is minimalamong all consistent updates of T . When there is riskof ambiguity, we may stress that the optimal S-repair(or U-repair) is of T and under ∆ or under R and ∆.Every (S- or U-) optimal repair is a repair, but notnecessarily vice versa. Clearly, a consistent subset (re-spectively, update) can be transformed into a (not nec-essarily optimal) S-repair (respectively, U-repair), withno increase of distance, in polynomial time. In fact, wedo not really need the concept of a repair per se, andthe definition is given mainly for compatibility with theliterature (e.g., [1]). Therefore, unless explicitly statedotherwise, we do not distinguish between an S-repairand a consistent subset, and between a U-repair and aconsistent update.We also define approximations of optimal repairs inthe obvious ways, as follows. For a number α ≥ α -optimal S-repair is an S-repair S of T such that dist sub ( S, T ) ≤ α dist sub ( S ′ , T ) for all S-repairs S ′ of T ,and an α -optimal U-repair is a U-repair U of T suchthat dist upd ( U, T ) ≤ α dist upd ( U ′ , T ) for all U-repairs U ′ of T . In particular, an optimal S-repair (resp., optimalU-repair) is the same as a 1-optimal S-repair (resp., 1-optimal U-repair). Example S , S and S are consistent subsets, and U , U and U are consistent updates. For clarity, we markedwith yellow shading the values that were changed forconstructing each U i . We have dist sub ( S , T ) = dist sub ( S , T ) = dist sub ( S , T ) =
3. Thereader can verify that S and S are optimal S-repairs.However, S is not an optimal S-repair since its dis-tance to T is greater than the minimum. Nevertheless, S is an 1 . / = . dist upd ( U , T ) = dist upd ( U , T ) =
3, and dist upd ( U , T ) = U is obtained by changing twovalues from a tuple of weight 2).It should be noted that the values of an update U of atable T are not necessarily taken from the active domain (i.e., values that occur in T ). An example is the value F01 of table U in Figure 1(e). This has implicationson the complexity of computing optimal U-repairs. Wediscuss a restriction on the allowed update values inSection 5. We adopt the conventional measure of data complex-ity , where the schema R ( A , . . . , A k ) and dependencyset ∆ are assumed to be fixed , and only the table T isconsidered input . In particular, a “polynomial” runningtime may have an exponential dependency on k , as in O (∣ T ∣ k ) , Hence, each combination of R ( A , . . . , A k ) and∆ defines a distinct problem of finding an optimal re-pair (of the relevant type), and different combinationsmay feature different computational complexities.For the complexity of approximation, we use the fol-lowing terminology. In an optimization problem P , eachinput x has a space of solutions y , each associated witha cost cost ( x, y ) . Given x , the goal is to compute asolution y with a minimum cost. For α ≥
1, an α -approximation for P is an algorithm that, for input x ,produces an α -optimal solution y , which means thatcost ( x, y ) ≤ α ⋅ cost ( x, y ′ ) for all solutions y ′ . The com-plexity class APX consists of all optimization problemsthat have a polynomial-time constant-factor approxi-mation. A polynomial-time reduction f from an opti-mization problem Q to an optimization problem P is a strict reduction if for all α ≥
1, any α -optimal solutionfor f ( x ) can be transformed in polynomial time into an α -optimal solution for x [25]; it is a PTAS (Polynomial-Time Approximation Scheme) reduction if for all α > β α > β α -optimal solutionfor f ( x ) can be transformed in polynomial time intoan α -optimal solution for x . A strict reduction is alsoa PTAS reduction, but not necessarily vice versa. Aproblem P is APX-hard if there is a PTAS reduction to P from every problem in APX; it is APX-complete if, inaddition, it is in APX. If P is APX-hard, then there isa constant α P > P cannot be approximatedbetter than α P , or else P = NP.
3. COMPUTING AN OPTIMAL S-REPAIR
In this section, we study the problem of computingan optimal S-repair. We begin with some conventions.
Assumptions and Notation.
Throughout this section weassume that every FD has a single attribute on its right- hand side, that is, it has the form X → A . Clearly, thisis not a limiting assumption, since replacing X → Y Z with X → Y and X → Z preserves equivalence.Let ∆ be a set of FDs. If X is a set of attributes, thenwe denote by ∆ − X the set ∆ ′ of FDs that is obtainedfrom ∆ by removing each attribute of X from every lhsand rhs of every FD in ∆. Hence, no attribute in X occurs in ∆ − X . If A is an attribute, then we maywrite ∆ − A instead of ∆ − { A } .An lhs marriage of an FD set ∆ is a pair ( X , X ) ofdistinct lhs of FDs in ∆ with the following properties. ● cl ∆ ( X ) = cl ∆ ( X )● The lhs of every FD in ∆ contains either X or X (or both). Example A ↔ B → C def = { A → B , B → A , B → C } (1)As another example, consider the following FD set.∆ def = { ssn → first , ssn → last , first last → ssn , ssn → address , ssn office → phone , ssn office → fax } In ∆ the pair ({ ssn } , { first , last }) is an lhs marriage.Finally, if S is a subset of a table T , then we denoteby w T ( S ) the sum of weights of the tuples of S , that is, w T ( S ) def = ∑ i ∈ ids ( S ) w T ( i ) We now describe an algorithm for finding an optimalS-repair. The algorithm terminates in polynomial time,even under combined complexity, yet it may fail . If itsucceeds, then the result is guaranteed to be an optimalS-repair. We later discuss the situations in which thealgorithm fails. The algorithm,
OptSRepair , is shown asAlgorithm 1. The input is a set ∆ of FDs and a table T , both over the same relation schema (that we do notneed to refer to explicitly). In the remainder of thissection, we fix ∆ and T , and describe the execution of OptSRepair on ∆ and T .The algorithm handles four cases. The first is where∆ is trivial. Then, T is itself an optimal S-repair. Thesecond case is where ∆ has a common lhs A . Then,the algorithm groups the tuples by A , finds an op-timal S-repair for each group (via a recursive call to OptSRepair ), this time by ignoring A (i.e., removing A from the FDs of ∆), and returning the union of theoptimal S-repairs. The precise description is in thesubroutine CommonLHSRep (Subroutine 1). The thirdcase is where ∆ has a consensus FD ∅ → A . Simi-larly to the second case, the algorithm groups the tu-ples by A and finds an optimal S-repair for each group.This time, however, the algorithm returns the optimalS-repair with the maximal weight. The precise descrip-tion is in the subroutine ConsensusRep (Subroutine 2).The fourth (last) case is the most involved. This is thecase where ∆ has an lhs marriage ( X , X ) . In this case5 lgorithm 1 OptSRepair ( ∆ , T ) if ∆ is trivial then ▷ successful termination return T remove trivial FDs from ∆ if ∆ has a common lhs then return CommonLHSRep ( ∆ , T ) if ∆ has a consensus FD then return ConsensusRep ( ∆ , T ) if ∆ has an lhs marriage then return MarriageRep ( ∆ , T ) fail ▷ cannot find a minimum repair Subroutine 1
CommonLHSRep ( ∆ , T ) A ∶= a common lhs of ∆ return ∪ ( a )∈ π A T [∗] OptSRepair ( σ A = a T, ∆ − A ) Subroutine 2
ConsensusRep ( ∆ , T ) select a consensus FD ∅ → A in ∆ for all a ∈ π A T [∗] do S a ∶= OptSRepair ( σ A = a T, ∆ − A ) a max ∶= argmax ( a )∈ π A T [∗] w T ( S a ) return S a max Subroutine 3
MarriageRep ( ∆ , T ) select an lhs marriage ( X , X ) of ∆ for all ( a , a ) ∈ π X X T [∗] do S a , a ∶= OptSRepair ( σ X = a ,X = a T, ∆ − X X ) w ( a , a ) ∶= w T ( S a , a ) V i ∶= π X i T [∗] for i = , E ∶= {( a , a ) ∣ ( a , a ) ∈ π X X T [∗]} G ∶= weighted bipartite graph ( V , V , E, w ) E max ∶= a maximum matching of G return ∪ ( a , a )∈ E max S a , a the problem is reduced to finding a maximum weightedmatching of a bipartite graph. The graph, which wedenote by G = ( V , V , E, w ) , consists of two disjointnode sets V and V , an edge set E that connects nodesfrom V to nodes from V , and a weight function w thatassigns a weight w ( v , v ) to each edge ( v , v ) . For i = ,
2, the node set V i is the set of tuples in the pro-jection of T to X i . To determine the weight w ( v , v ) ,we select from T the subset T v ,v that consists of thetuples that agree with v and v on X and X , respec-tively. We then find an optimal S-repair for T v ,v , afterwe remove from ∆ every attribute in either X or X .Then, the weight w ( v , v ) is the weight of this optimalS-repair. Next, we find a maximum matching E max of In principle, it may be the case that the same tuple occursin both V and V , since the tuple is in both projections.Nevertheless, we still treat the two occurrences of the tupleas distinct nodes, and so effectively assume that V and V are disjoint. Algorithm 2
OSRSucceeds ( ∆ ) while ∆ is nontrivial do remove trivial FDs from ∆ if ∆ has a common lhs A then ∆ ∶= ∆ − A else if ∆ has a consensus FD ∅ → A then ∆ ∶= ∆ − X else if ∆ has an lhs marriage ( X , X ) then ∆ ∶= ∆ − X X else return false return true G . Note that E max is a subset of E such that no nodeappears more than once. The returned result is thenthe disjoint union of the optimal S-repair of T v ,v overall ( v , v ) in E max . The precise description is in thesubroutine MarriageRep (Subroutine 3).The following theorem states the correctness and ef-ficiency of
OptSRepair . Theorem
Let ∆ and T be a set of FDs and a ta-ble, respectively, over a relation schema R ( A , . . . , A k ) .If OptSRepair ( ∆ , T ) succeeds, then it returns an opti-mal S-repair. Moreover, OptSRepair ( ∆ , T ) terminatesin polynomial time in k , ∣ ∆ ∣ , and ∣ T ∣ . What about the cases where
OptSRepair ( ∆ , T ) fails?We discuss it in the next section. Approximation.
An easy observation is that the com-putation of an optimal subset is easily reducible to the weighted vertex-cover problem—given a graph G wherenodes are assigned nonnegative weights, find a vertexcover (i.e., a set C of nodes that intersects with alledges) with a minimal sum of weights. Indeed, givena table T , we construct the graph G that has ids ( T ) as the set of nodes, and an edge between every i and j such that T [ i ] and T [ j ] contradict one or more FDs in∆. Given a vertex cover C for G , we obtain a consistentsubset S by deleting from T every tuple with an identi-fier in C . Clearly, this reduction is strict. As weightedvertex cover is 2-approximable in polynomial time [7],we conclude the same for optimal subset repairing. Proposition
For all FD sets ∆ , a 2-optimal S-repair can be computed in polynomial time. While Proposition 3.3 is straightforward, it is of prac-tical importance as it limits the severity of the lowerbounds we establish in the next section. Moreover, wewill later show that the proposition has implications onthe problem of approximating an optimal U-repair.
The reader can observe that the success or failureof
OptSRepair ( ∆ , T ) depends only on ∆, and not on T . The algorithm OSRSucceeds ( ∆ ) , depicted as Algo-rithm 2, tests whether ∆ is such that OptSRepair suc-ceeds by simulating the cases and corresponding changes6o ∆. The next theorem shows that, under conventionalcomplexity assumptions,
OptSRepair covers all sets ∆such that an optimal S-repair can be found in poly-nomial time. Hence, we establish a dichotomy in thecomplexity of computing an optimal S-repair.
Theorem
Let ∆ be a set of FDs. ● If OSRSucceeds ( ∆ ) returns true, then an optimalS-repair can be computed in polynomial time byexecuting OptSRepair ( ∆ , T ) on the input T . ● If OSRSucceeds ( ∆ ) returns false, then computingan optimal S-repair is APX-complete, and remainsAPX-complete on unweighted, duplicate-free tables.Moreover, the execution of OSRSucceeds ( ∆ ) terminatesin polynomial time in ∣ ∆ ∣ . Recall that a problem in APX has a constant factorapproximation and, under the assumption that P ≠ NP,an APX-hard problem cannot be approximated betterthan some constant factor (that may depend on theproblem itself).
Example
OSRSucceeds ( ∆ ) transforms ∆ as follows. { facility → city , facility room → floor } (common lhs) ⇛{∅ → city , room → floor } (consensus) ⇛{ room → floor } (common lhs) ⇛{∅ → floor } (consensus) ⇛{} Hence,
OSRSucceeds ( ∆ ) is true, and hence, an optimalS-repair can be found in polynomial time.Next, consider the FD set ∆ A ↔ B → C from Example 3.1. OSRSucceeds ( ∆ A ↔ B → C ) executes as follows. { A → B, B → A, B → C } (lhs marriage) ⇛{∅ → C } (consensus) ⇛{} Hence, this is again an example of an FD set on thetractable side of the dichotomy.As the last positive example we consider the FD set∆ of Example 3.1. { ssn → first , ssn → last , first last → ssn , ssn → address , ssn office → phone , ssn office → fax } (lhs marriage) ⇛ {∅ → address , office → phone , office → fax } (consensus) ⇛ { office → phone , office → fax } (common lhs) ⇛ {∅ → phone , ∅ → fax } (consensus) ⇛ {} On the other hand, for ∆ = { A → B, B → C } , none ofthe conditions of OSRSucceeds ( ∆ ) is true, and there-fore, the algorithm returns false. It thus follows fromTheorem 3.4 that computing an optimal S-repair is APX-complete (even if all tuple weights are the same andthere are no duplicate tuples). The same applies to∆ = { A → B, C → D } . Table 1: FD sets over R ( A, B, C ) used in the proof ofhardness of Theorem 3.4. Name FDs ∆ A → B → C A → B , B → C ∆ A → C ← B A → C , B → C ∆ AB → C → B AB → C , C → B ∆ AB ↔ AC ↔ BC AB → C , AC → B , BC → A As another example, the following corollary of Theo-rem 3.4 generalizes the tractability of our running ex-ample to general chain FD sets.
Corollary If ∆ is a chain FD set, then an op-timal S-repair can be computed in polynomial time. Proof.
The reader can easily verify that when ∆is a chain FD set,
OSRSucceeds ( ∆ ) will reduce it toemptiness by repeatedly removing consensus attributesand common-lhs, as done in our running example. In this section we discuss the proof of Theorem 3.4.(The full proof is in the Appendix.) The positive sideis a direct consequence of Theorem 3.2. For the nega-tive side, membership in APX is due to Proposition 3.3.The proof of hardness is based on the concept of a fact-wise reduction [22], as previously done for proving di-chotomies on sets of FDs [16, 22, 23, 26]. In our setup, afact-wise reduction is defined as follows. Let R and R ′ be two relation schemas. A tuple mapping from R to R ′ is a function µ that maps tuples over R to tuples over R ′ . We extend µ to map tables T over R to tables over R ′ by defining µ ( T ) to be { µ ( t ) ∣ t ∈ T } . Let ∆ and ∆ ′ be sets of FDs over R and R ′ , respectively. A fact-wisereduction from ( R, ∆ ) to ( R ′ , ∆ ) is a tuple mappingΠ from R to R ′ with the following properties: (a) Πis injective, that is, for all tuples t and t over R , ifΠ ( t ) = Π ( t ) then t = t ; (b) Π preserves consistencyand inconsistency; that is, Π ( T ) satisfies ∆ ′ if and onlyif T satisfies ∆; and (c) Π is computable in polynomialtime. The following lemma is straightforward.
Lemma
Let R and R ′ be relation schemas and ∆ and ∆ ′ FD sets over R and R ′ , respectively. If there is afact-wise reduction from ( R, ∆ ) to ( R ′ , ∆ ′ ) , then thereis a strict reduction from the problem of computing anoptimal S-repair under R and ∆ to that of computingan optimal S-repair under R ′ and ∆ ′ . In the remainder of this section, we describe the waywe use Lemma 3.7. Our proof consists of four steps.1. We first prove APX-hardness for each of the FDsets in Table 1 over R ( A, B, C ) . For ∆ A → B → C and ∆ A → C ← B we adapt reductions by Gribkoff etal. [20] in a work that we discuss in Section 3.4. For∆ AB → C → B we show a reduction from MAX-non-mixed-SAT [21]. Most intricate is the proof for∆ AB ↔ AC ↔ BC , where we devise a nontrivial adap-tation of a reduction by Amini et al. [3] to trianglepacking in graphs of bounded degree.7. Next, we prove that whenever OSRSucceeds sim-plifies ∆ into ∆ ′ , there is a fact-wise reductionfrom ( R, ∆ ′ ) to ( R, ∆ ) , where R is the underlyingrelation schema.3. Then, we consider an FD set ∆ that cannot befurther simplified (that is, ∆ does not have a com-mon lhs, a consensus FD, or an lhs marriage). Weshow that ∆ can be classified into one of five cer-tain classes of FD sets (that we discuss next).4. Finally, we prove that for each FD set ∆ in one ofthe five classes there exists a fact-wise reductionfrom one of the four schemas of Table 1.The most challenging part of the proof is identifyingthe classes of FD sets in Step 3 in such a way that we areable to build the fact-wise reductions in Step 4. We firstidentify that if an FD set ∆ cannot be simplified, thenthere are at least two distinct local minima X → Y and X → Y in ∆. By a local minimum we mean anFD with a set-minimal lhs, that is, an FD X → Y suchthat no FD Z → W in ∆ satisfies that Z is a strictsubset of X . We pick any two local minima from ∆.Then, we divide the FD sets into five classes based onthe relationships between X , X , cl ∆ ( X ) ∖ X , whichwe denote by ̂ X , and cl ∆ ( X ) ∖ X , which we denoteby ̂ X . The classes are illustrated in Figure 2.Each line in Figure 2 represents one of the sets X , X , ̂ X or ̂ X . If two lines do not overlap, it meansthat we assume that the corresponding two sets are dis-joint. For example, the sets ̂ X and ̂ X in class ( ) havean empty intersection. Overlapping lines represent setsthat have a nonempty intersection, an example beingthe sets ̂ X and ̂ X in class ( ) . If two dashed linesoverlap, it means that we do not assume anything abouttheir intersection. As an example, the sets X and X can have an empty or a nonempty intersection in eachone of the classes. Finally, if a line covers another line,it means that the set corresponding to the first line con-tains the set corresponding to the second line. For in-stance, the set ̂ X in class ( ) contains the set X ∖ X ,while in class ( ) it holds that ( X ∖ X ) /⊆ ̂ X . Weremark that Figure 2 well covers the important casesthat we need to analyze, but it misses a few cases. (Aspreviously said, full details are in the Appendix.) Example
Class 1. ∆ = { A → B, C → D } . In this case X ={ A } , X = { C } , ̂ X = { B } and ̂ X = { D } . Thus, ̂ X ∩ X = ∅ , ̂ X ∩ X = ∅ and ̂ X ∩ ̂ X = ∅ and indeedthe only overlapping lines in ( ) are the dashed linescorresponding to X and X . Class 2. ∆ = { A → CD, B → CE } . It holds that X ={ A } , X = { B } , ̂ X = { C, D } and ̂ X = { C, E } . Hence, ̂ X ∩ X = ∅ and ̂ X ∩ X = ∅ , but ̂ X ∩ ̂ X ≠ ∅ , andthe difference from ( ) is that the lines correspondingto ̂ X and ̂ X in ( ) overlap. (1) ̂ X X X ̂ X (2) ̂ X X X ̂ X (3) ̂ X X X ̂ X (4) X ̂ X ̂ X X (5) X ̂ X X ̂ X Figure 2: Classes of FD sets that cannot be simplified.
Class 3. ∆ = { A → BC, B → D } . Here, it holds that X = { A } , X = { B } , ̂ X = { B, C, D } and ̂ X = { D } .Thus, ̂ X ∩ X ≠ ∅ , but ̂ X ∩ X = ∅ . The differencefrom ( ) is that now the lines corresponding to X and ̂ X overlap and we do not assume anything about theintersection between ̂ X and ̂ X . Class 4. ∆ = { AB → C, AC → B, BC → A } . In thiscase we have three local minima. We pick two of them: AB → C and AC → B . Now, X = { AB } , X = { AC } , ̂ X = { C } and ̂ X = { B } . Thus, ̂ X ∩ X ≠ ∅ and ̂ X ∩ X ≠ ∅ . The difference from ( ) is that now the linescorresponding to X and ̂ X overlap. Moreover, the linecorresponding to ̂ X covers the entire line correspondingto X ∖ X and the line corresponding to ̂ X covers theentire line corresponding to X ∖ X . This means thatwe assume that ( X ∖ X ) ⊆ ̂ X and ( X ∖ X ) ⊆ ̂ X . Class 5. ∆ = { AB → C, C → AD } . Here, X ={ A, B } , X = { C } , ̂ X = { C, D } and ̂ X = { A, D } , there-fore ̂ X ∩ X ≠ ∅ and ̂ X ∩ X ≠ ∅ . The difference from ( ) is that now we assume that ( X ∖ X ) /⊆ ̂ X . In this section, we draw a connection to the
MostProbable Database problem (MPD) [20]. A table in oursetting can be viewed as a relation of a tuple-independentdatabase [14] if each weight is in the interval [ , ] . Inthat case, we view the weight as the probability of thecorresponding tuple, and we call the table a probabilis-tic table . Such a table T represents a probability spaceover the subsets of T , where a subset is selected by con-sidering each tuple T [ i ] independently and selecting itwith the probability w T ( i ) , or equivalently, deleting itwith the probability 1 − w T ( i ) . Hence, the probabilityof a subset S , denoted Pr T ( S ) , is given by:Pr T ( S ) def = ⎛⎝ ∏ i ∈ ids ( S ) w T ( i )⎞⎠ × ⎛⎝ ∏ i ∈ ids ( T )∖ ids ( S ) ( − w T ( i ))⎞⎠ (2)Given a constraint ϕ over the schema of T , MPD for ϕ is the problem of computing a subset S that satisfies ϕ ,and has the maximal probability among all such sub-sets. Here, we consider the case where ϕ is a set ∆ of8Ds. Hence MPD for ∆ is the problem of computingargmax S ⊆ T , S ⊧ ∆ Pr T ( S ) . Gribkoff, Van den Broeck, and Suciu [20] proved thefollowing dichotomy for unary
FDs, which are FDs ofthe form A → X having a single attribute on their lhs. Theorem
Let ∆ be a set of unary FDs overa relational schema. MPD for ∆ is either solvable inpolynomial time or NP-hard. The question of whether such a dichotomy holds for general (not necessarily unary ) FDs has been left open.The following corollary of Theorem 3.4 fully resolvesthis question.
Theorem
Let ∆ be a set of FDs over a rela-tional schema. If OSRSucceeds ( ∆ ) is true, then MPDfor ∆ is solvable in polynomial time; otherwise, it isNP-hard. Proof.
We first show a reduction from MPD to theproblem of computing an optimal S-repair. Let T be aninput for MPD. By a certain tuple we refer to a tupleidentifier i ∈ ids ( T ) such that w T ( i ) =
1. We assumethat the set of certain tuples satisfies ∆ collectively,since otherwise the probability of any consistent sub-set is zero (and we can select, e.g., the empty subsetas a most likely solution). We can then replace eachprobability 1 with a probability that is smaller than,yet close enough to 1, so that every consistent subsetthat excludes a certain fact is less likely than any subsetthat includes all certain facts. In addition, as observedby Gribkoff et al. [20], tuples with probability at most0 . . < w T ( i ) < i ∈ ids ( T ) . From (2) we conclude the following.Pr T ( S ) = ⎛⎝ ∏ i ∈ ids ( S ) w T ( i ) − w T ( i ) ⎞⎠ × ⎛⎝ ∏ i ∈ ids ( T ) ( − w T ( i ))⎞⎠∝ ⎛⎝ ∏ i ∈ ids ( S ) w T ( i ) − w T ( i ) ⎞⎠ The reason for the proportionality ( ∝ ) is that all con-sistent subsets share the same right factor of the firstproduct. Hence, we construct a table T ′ that is thesame as T , except that w T ′ ( i ) = log ( w T ( i )/( − w T ( i ))) for all i ∈ ids ( T ′ ) , and then a most likely database of T is the same as an optimal S-repair of T ′ .For the “otherwise” part we show a reduction fromthe problem of computing an optimal S-repair of an unweighted table to MPD. The reduction is straightfor-ward: given T , we set the weight w T ( i ) of each tuple to0 . . We do not need to make an assumption of infinite preci-sion to work with logarithms, since the algorithms we usefor computing an optimal S-repair can replace addition andsubtraction with multiplication and division, respectively.
Comment A ↔ B → C defined in (1) is classifiedas polynomial time in our dichotomy while NP-hard byGribkoff et al. [20]. This is due to a gap in their proofof hardness.
4. COMPUTING AN OPTIMAL U-REPAIR
In this section, we focus on the problem of findingan optimal U-repair and an approximation thereof. Wedevise general tools for analyzing the complexity of thisproblem, compare it to the problem of finding an op-timal S-repair (discussed in the previous section), andidentify sufficient conditions for efficient reductions be-tween the two problems. Yet, unlike S-repairs, the ex-istence of a full dichotomy for computing an optimalU-repair remains an open problem.
Notation.
Let ∆ be a set of FDs. An lhs cover of ∆ is aset C of attributes that hits every lhs, that is, X ∩ C ≠∅ for every X → Y in ∆. We denote the minimumcardinality of an lhs cover of ∆ by mlc ( ∆ ) . For instance,if ∆ is nonempty and has a common lhs (e.g., Figure 1),then mlc ( ∆ ) =
1. For a set ∆ of FDs, attr ( ∆ ) denotesthe set of attributes that appear in ∆ (i.e., the unionof lhs and rhs over all the FDs in ∆). Two FD sets ∆ and ∆ (over the same schema) are attribute disjoint if attr ( ∆ ) and attr ( ∆ ) are disjoint. For example, { A → BC, C → D } and { E → F G } are attribute disjoint. In this section we show two reductions between sets ofFDs. The following theorem implies that to determinethe complexity of the union of two attribute-disjoint FDsets, it suffices to look at each set separately.
Theorem
Suppose that ∆ = ∆ ∪ ∆ where ∆ and ∆ are attribute disjoint. The following are equiv-alent for all α ≥ .1. An α -optimal U-repair can be computed in polyno-mial time under ∆ .2. An α -optimal U-repair can be computed in polyno-mial time under each of ∆ and ∆ . If we have a polynomial-time algorithm to computean α -optimal solution for ∆, it can be used to obtain an α -optimal solution for ∆ by setting all attributes notin attr ( ∆ ) to 0, and running the algorithm (similarlyfor ∆ ). In the reverse direction, it can be shown thatsimply composing α -optimal solutions of ∆ , ∆ givesan α -optimal solution for ∆. Example def = { item → cost , buyer → address } This has been established in a private communication withthe authors of [20].
9e will later show that if ∆ consists of a single FD, thenan optimal U-repair can be computed in polynomialtime. Hence, we can compute an optimal U-repair under∆ = { item → cost } and under ∆ = { buyer → phone } .Then, Theorem 4.1 implies that an optimal U-repaircan be computed in polynomial time under ∆ as well.Now consider the following set of FDs.∆ ′ def = { item → cost , buyer → address , address → state } Kolahi and Lakshmanan [24] proved that it is NP-hardto compute an optimal U-repair for { A → B, B → C } ,by reduction from the problem of finding a minimumvertex cover of a graph G . Their reduction is, in fact,a PTAS reduction if we use vertex cover in a graph ofa bounded degree [2]. Hence, computing an optimal U-repair is APX-hard for this set of FDs. Theorem 4.1then implies that it is also APX-hard for ∆ ′ .Next, we discuss the problem in the presence of con-sensus FDs. The following theorem states that suchFDs do not change the complexity of the problem. Re-call that, for a set ∆ of FDs and a set X of attributes,the set ∆ − X denotes the set of FDs that is obtainedfrom ∆ by removing each attribute of X from the lhsand rhs of every FD. Also recall that cl ∆ (∅) is the setof all consensus attributes. Theorem
Let ∆ be a set of FDs. There is astrict reduction from computing an optimal U-repair for ∆ to computing an optimal U-repair for ∆ − cl ∆ (∅) , andvice versa. The proof (given in the Appendix) uses Theorem 4.1and a special treatment of the case where all FDs areconsensus. As an example of applying Theorem 4.3,if ∆ consists of only consensus FDs, then an optimalU-repair can be computed in polynomial time, since∆ − cl ∆ (∅) is empty. As another example, if ∆ is the set {∅ → D, AD → B, B → CD } then ∆ − cl ∆ (∅) = { A → B, B → C } and, according to Theorem 4.3, computingan optimal U-repair is APX-hard, since this problem ishard for { A → B, B → C } due to Kolahi and Laksh-manan [24], as explained in Example 4.2. In this section we establish several results that en-able us to infer complexity results for the problem ofcomputing an optimal U-repair from that of comput-ing an optimal S-repair via polynomial-time reductions.These results are based on the following proposition,which shows that we can transform a consistent updateinto a consistent subset (with no extra cost) and, in theabsence of consensus FDs, a consistent subset into aconsistent update (with some extra cost). We give theproof here, as it shows the actual constructions.
Proposition
Let ∆ be a set of FDs and T atable. The following can be done in polynomial time.1. Given a consistent update U , construct a consis-tent subset S such that dist sub ( S, T ) ≤ dist upd ( U, T ) . 2. Given a consistent subset S , and assuming that ∆ is consensus free, construct a consistent update U such that dist upd ( U, T ) ≤ mlc ( ∆ ) ⋅ dist sub ( S, T ) . Proof.
We construct S from U by excluding any i ∈ ids ( T ) such that T [ i ] has at least one attributeupdated in U (i.e., H ( T [ i ] , U [ i ]) ≥ U from S as follows. Let C be an lhs cover of minimumcardinality mlc ( ∆ ) . The tuple of each i ∈ ids ( S ) is leftintact, and for i ∈ ids ( T ) ∖ ids ( S ) we update the valueof T [ i ] .A for each attribute A ∈ C to a fresh constantfrom our infinite domain Val . Since C is an lhs coverand there are no consensus FDs, for all X → Y in ∆it holds that two distinct tuples in U that agree on X must correspond to intact tuples; hence, U is consistent(as S is consistent).As we discuss later in Section 4.4, Proposition 4.4,combined with Proposition 3.3, reestablishes the resultof Kolahi and Lakshmanan [24], stating the computinga U-repair is in APX. We also establish the followingadditional consequences of Proposition 4.4. The first isan immediate corollary (that we refer to later on) aboutthe relationship between optimal repairs. Corollary
Let ∆ be a set of FDs, T a table, S ∗ an optimal S-repair of T , and U ∗ an optimal U-repairof T . Then dist sub ( S ∗ , T ) ≤ dist upd ( U ∗ , T ) . Moreover,if ∆ is consensus free, then dist upd ( U ∗ , T ) ≤ mlc ( ∆ ) ⋅ dist sub ( S ∗ , T ) . The second consequence relates to FD sets ∆ with acommon lhs, that is, mlc ( ∆ ) = Corollary
Let ∆ be an FD set with a com-mon lhs. There is a strict reduction from the problemof computing an optimal S-repair to that of computingan optimal U-repair, and vice versa. For example, if ∆ consists of a single FD, then anoptimal U-repair can be computed in polynomial time.Additional examples follow.
Example
OSRSucceeds ), we get that an optimal U-repair canalso be computed in polynomial time for ∆.As another illustration, consider the following FD set.∆ def = { id country → passport , id passport → country } Again, as ∆ has a common lhs, and ∆ passes the testof OSRSucceeds (by applying common lhs followed byan lhs marriage), from Theorem 3.4 we conclude thatan optimal U-repair can be found in polynomial time.Finally, consider the following set of FDs.∆ def = { state city → zip , state zip → country } The reader can verify that ∆ fails OSRSucceeds , andtherefore, from Theorem 3.4 we conclude that comput-ing an optimal U-repair is APX-complete.10y combining Theorem 4.3, Corollary 4.6, and Corol-lary 3.6, we conclude the following.
Corollary If ∆ is a chain FD set, then an op-timal U-repair can be computed in polynomial time. Proof.
If ∆ is a chain FD set, then so is ∆ − cl ∆ (∅) .Theorem 4.3 states that computing an optimal U-repairhas the same complexity under the two FD sets. More-over, if ∆ − cl ∆ (∅) is nonempty, then it has at leastone common lhs. From Corollary 4.6 we conclude thatthe problem then strictly reduces to computing an S-repair, which, by Corollary 3.6, can be done in polyno-mial time.Hence, for chain FD sets, an optimal repair can becomputed for both subset and update variants. Corollaries 4.6 and 4.8 state cases where computingan optimal S-repair has the same complexity as com-puting an optimal U-repair. A basic case (among oth-ers) that the corollaries do not cover is in the followingproposition, where again both variants have the same(polynomial-time) complexity.
Proposition
Under ∆ = { A → B, B → A } , anoptimal U-repair can be computed in polynomial time. For ∆ = { A → B, B → A } , although mlc ( ∆ ) =
2, weshow that dist upd ( U ∗ , T ) = dist sub ( S ∗ , T ) for an opti-mal U-repair U ∗ and an optimal S-repair S ∗ for anytable T over R . Since ∆ passes the test of OSRSucceeds (by applying lhs marriage), from Theorem 3.4 an opti-mal S-repair can be computed in polynomial time, andtherefore an optimal U-repair of { A → B, B → A } canalso be computed in polynomial time.Do the two variants of optimal repairs feature thesame complexity for every set of FDs? Next, we answerthis question in a negative way.We have already seen an example of an FD set ∆where an optimal U-repair can be computed in poly-nomial time, but finding an S-repair is APX-complete.Indeed, Example 4.2 shows that { A → B, C → D } isa tractable case for optimal U-repairs; yet, it fails thetest of OSRSucceeds , and is therefore hard for optimalS-repairs (Theorem 3.4).Showing an example of the other direction is more in-volved. The FD set in this example is ∆ A ↔ B → C fromExample 3.1. In Example 3.5 we showed that ∆ A ↔ B → C passes the test of OSRSucceeds , and therefore, an opti-mal S-repair can be computed in polynomial time. Thisis not the case for an optimal U-repair.
Theorem
For the relation schema R ( A, B, C ) and the FD set ∆ A ↔ B → C , computing an optimal U-repair is APX-complete, even on unweighted, duplicate-free tables. The proof of hardness in Theorem 4.10 is inspired by,yet different from, the reduction of Kolahi and Laksh-manan [24] for showing hardness for { A → B, B → C } . The reduction is from the problem of finding a minimumvertex cover of a graph G ( V, E ) . Every edge { u, v } ∈ E gives rise to the tuples ( u, v, ) and ( v, u, ) . In addi-tion, each vertex v ∈ V gives rise to the tuple ( v, v, ) .The proof shows that there is a consistent update ofcost at most 2 ∣ E ∣ + k (assuming a unit weight for eachtuple) if and only if G has a vertex cover of size atmost k . To establish a PTAS reduction, we use theAPX-hardness of vertex cover when G is of boundeddegree [2]. The challenging (and interesting) part ofthe proof is in showing that a consistent update of cost2 ∣ E ∣ + k can be transformed into a vertex cover of size k ; this part is considerably more involved than the cor-responding proof of Kolahi and Lakshmanan [24].We conclude with the following corollary, stating theexistence of FD sets where the two variants of the prob-lem feature different complexities. Corollary
There exist FD sets ∆ and ∆ such that:1. Under ∆ an optimal S-repair can be computedin polynomial time, but computing an optimal U-repair is APX-complete.2. Under ∆ an optimal U-repair can be computedin polynomial time, but computing an optimal S-repair is APX-complete. In this section we discuss approximations for optimalU-repairs. The combination of Propositions 3.3 and 4.4gives the following.
Theorem
Let ∆ be a set of FDs, and α = ⋅ mlc ( ∆ ) . An α -optimal U-repair can be computed inpolynomial time. Note that the approximation ratio can be further im-proved by applying Theorem 4.1: if ∆ is the unionof attribute-disjoint FD sets ∆ and ∆ , then an α -optimal U-repair can be computed (under ∆) where α = ⋅ max { mlc ( ∆ ) , mlc ( ∆ )} . Kolahi and Lakshmanan [24] have also given aconstant-factor approximation algorithm for U-repairs(assuming ∆ is fixed). To compare their ratio with ours,we first explain their ratio.Let ∆ be a set of FDs and assume (without loss ofgenerality) that the rhs of each FD consists of a sin-gle attribute. Kolahi and Lakshmanan [24] define twomeasures on ∆. By
MFS ( ∆ ) we denote the maximumnumber of attributes involved in the lhs of any FD in ∆.An implicant of an attribute A is a set X of attributessuch that X → A is in the closure of ∆. A core implicant of A is a set C of attributes that hits every implicantof A (i.e., X ∩ C ≠ ∅ whenever X → A is in the closureof ∆). A minimum core implicant of A is a core impli-cant of A with the smallest cardinality. By MCI ( ∆ ) wedenote the size of the largest minimum core implicantover all attributes A .11 heorem For any set ∆ of FDs, an α -optimal U-repair can be computed in polynomial timewhere α = ( MCI ( ∆ ) + ) ⋅ ( MFS ( ∆ ) − ) . In both Theorems 4.12 and 4.13, the approximation ra-tios are constants under data complexity, but dependon ∆. It is still unknown whether there is a constant α that applies to all FD sets ∆. Yet, it is known thata constant-ratio approximation cannot be obtained inpolynomial time under combined complexity (where R , T , and ∆ are all given as input) [24].Although the proof of Theorem 4.12 is much simplerthan the non-trivial proof of Theorem 4.13 given in [24],it can be noted that the approximation ratios in thesetwo theorems are not directly comparable. If k is thenumber of attributes, then the worst-case approxima-tion ratio in Theorem 4.13 is quadratic in k , while theworst-case approximation in Theorem 4.12 is linear in k (precisely, linear in min ( k, ∣ ∆ ∣) ). Moreover, an easy ob-servation is that the ratio between the two approxima-tion ratios can be at most linear in k . In the remainderof this section, we illustrate the difference between theapproximations with examples.First, we show an infinite sequence of FD sets wherethe approximation ratio of Theorem 4.12 is Θ ( k ) andthat of Theorem 4.13 is Θ ( k ) . For a natural number k ≥
1, we define ∆ k as follows.∆ k def = { A ⋯ A k → B , B → C, B → A , . . . , B k → A } The approximation ratio for ∆ k given by Theorem 4.12is 2 ( k + ) . For the approximation ratio of Theorem 4.13,we have MFS ( ∆ k ) = k + A ⋯ A k → B )and MCI ( ∆ k ) = k (since the core implicant of A is { B , . . . , B k } ). Hence, the approximation ratio of The-orem 4.13 grows quadratically with k (i.e., it is Θ ( k ) ).On the other hand, following is a sequence of FD setsin which the approximation ratio of Theorem 4.12 growslinearly with k , while that of Theorem 4.13 is a constant.∆ ′ k def = { A A → B , A A → B , . . . , A k A k + → B k } Here, the approximation ratio of Theorem 4.12 is Θ ( k ) (since mlc ( ∆ ′ k ) is ⌈( k + )/ ⌉ ), but that of Theorem 4.13is constant, since MFS ( ∆ ′ k ) = MCI ( ∆ ′ k ) = k and ∆ ′ k is a hard problem,thus an approximation is, indeed, needed. Theorem
Let k ≥ be fixed. Computing anoptimal U-repair is APX-complete for:1. R ( A , . . . , A k , B , . . . , B k , C ) and ∆ k ;2. R ( A , . . . , A k + , B , . . . , B k ) and ∆ ′ k . The proof for ∆ k is by a reduction from computingan optimal U-repair under { A → B, B → C } (see Ex-ample 4.2). For ∆ ′ k , we first show that the problem isAPX-hard for k =
1. In this case, the FD set containstwo FDs A A → B and A A → B , thus A is a com-mon lhs, and Corollary 4.6, combined with the fact thatcomputing an optimal S-repair under { A → B, C → D } is APX-hard (Theorem 3.4), imply that computing anoptimal U-repair is APX-hard as well. Then, we con-struct a reduction from computing an optimal U-repairunder ∆ ′ k for k = ′ k for k >
5. DISCUSSION AND FUTURE WORK
We investigated the complexity of computing an op-timal S-repair and an optimal U-repair. For the former,we established a dichotomy over all sets of FDs (andschemas). For the latter, we developed general tech-niques for complexity analysis, showed concrete com-plexity results, and explored the connection to the com-plexity of S-repairs. We presented approximation re-sults and, in the case of U-repairs, compared to theapproximation of Kolahi and Lakshmanan [24]. In thecase of S-repairs, we drew a direct connection to prob-abilistic database repairs, and completed a dichotomyby Gribkoff et al. [20] to the entire space of FDs. Quitea few directions are left for future investigation, and weconclude with a discussion of some of these.As our results are restricted to FDs, an obvious im-portant direction is to extend our study to other typesof integrity constraints, such as denial constraints [18],conditional FDs [10], referential constraints [15], andtuple-generating dependencies [8]. Moreover, the re-pair operations we considered are either exclusively tu-ple deletions or exclusively value updates. Hence, an-other clear direction is to allow mixtures of deletions,insertions and updates, where the cost depends on theoperation type, the involved tuple, and the involved at-tribute (in the case of updates).Our understanding of the complexity of computing anoptimal U-repair is considerably more restricted thanthat of an optimal S-repair. We would like to completeour complexity analysis for optimal U-repairs into a fulldichotomy. More fundamentally, we would like to incor-porate restrictions on the allowed value updates. Ourresults are heavily based on the ability to update anycell with any value from an infinite domain. A naturalrestriction on the update repairs is to allow revising onlycertain attributes, possibly using a finite (small) spaceof possible new values. It is not clear how to incorporatesuch a restriction in our results and proof techniques.In the case of S-repairs, we are interested in incorpo-rating preferences , as in the framework of prioritized re-pairing by Staworko et al. [29]. There, priorities amongtuples allow to eliminate subset repairs that are inferiorto others (where “inferior” has several possible inter-pretations). It may be the case that priorities are richenough to clean the database unambiguously [23]. Arelevant question is, then, what is the minimal numberof tuples that we need to delete in order to have an un-ambiguous repair? Alternatively, how many preferencesare needed for this cause?12 . REFERENCES [1] F. N. Afrati and P. G. Kolaitis. Repair checkingin inconsistent databases: algorithms andcomplexity. In
ICDT , pages 31–41. ACM, 2009.[2] P. Alimonti and V. Kann. Some apx-completenessresults for cubic graphs.
Theor. Comput. Sci. ,237(1-2):123–134, 2000.[3] O. Amini, S. P´erennes, and I. Sau. Hardness andapproximation of traffic grooming.
Theor.Comput. Sci. , 410(38-40):3751–3760, 2009.[4] P. Andritsos, A. Fuxman, and R. J. Miller. Cleananswers over dirty databases: A probabilisticapproach. In
Proceedings of the 22ndInternational Conference on Data Engineering,ICDE 2006, 3-8 April 2006, Atlanta, GA, USA ,page 30. IEEE Computer Society, 2006.[5] M. Arenas, L. E. Bertossi, and J. Chomicki.Consistent query answers in inconsistentdatabases. In
PODS , pages 68–79. ACM, 1999.[6] A. Assadi, T. Milo, and S. Novgorodov. DANCE:data cleaning with constraints and experts. In , pages 1409–1410. IEEEComputer Society, 2017.[7] R. Bar-Yehuda and S. Even. A linear-timeapproximation algorithm for the weighted vertexcover problem.
J. Algorithms , 2(2):198–203, 1981.[8] C. Beeri and M. Y. Vardi. Formal systems fortuple and equality generating dependencies.
SIAMJ. Comput. , 13(1):76–98, 1984.[9] M. Bergman, T. Milo, S. Novgorodov, andW. Tan. QOCO: A query oriented data cleaningsystem with oracles.
PVLDB , 8(12):1900–1903,2015.[10] P. Bohannon, W. Fan, F. Geerts, X. Jia, andA. Kementsietsidis. Conditional functionaldependencies for data cleaning. In
ICDE , pages746–755. IEEE, 2007.[11] M. A. Casanova, R. Fagin, and C. H.Papadimitriou. Inclusion dependencies and theirinteraction with functional dependencies.
J.Comput. Syst. Sci. , 28(1):29–59, 1984.[12] J. Chomicki and J. Marcinkowski.Minimal-change integrity maintenance using tupledeletions.
Inf. Comput. , 197(1-2):90–121, 2005.[13] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K.Elmagarmid, I. F. Ilyas, M. Ouzzani, andN. Tang. NADEEF: a commodity data cleaningsystem. In
SIGMOD , pages 541–552. ACM, 2013.[14] N. N. Dalvi and D. Suciu. Efficient queryevaluation on probabilistic databases. In
VLDB , pages 864–875. Morgan Kaufmann, 2004.[15] C. J. Date. Referential integrity. In
VLDB , pages2–12. VLDB Endowment, 1981.[16] R. Fagin, B. Kimelfeld, and P. G. Kolaitis.Dichotomies in the complexity of preferredrepairs. In
PODS , pages 3–15. ACM, 2015.[17] W. Fan and F. Geerts.
Foundations of DataQuality Management . Synthesis Lectures on DataManagement. Morgan & Claypool Publishers,2012.[18] T. Gaasterland, P. Godfrey, and J. Minker. Anoverview of cooperative answering.
J. Intell. Inf.Syst. , 1(2):123–157, 1992.[19] F. Geerts, G. Mecca, P. Papotti, and D. Santoro.The LLUNATIC data-cleaning framework.
PVLDB , 6(9):625–636, 2013.[20] E. Gribkoff, G. V. den Broeck, and D. Suciu. Themost probable database problem. In
BUDA , 2014.[21] J. H˚astad. Some optimal inapproximabilityresults.
J. ACM , 48(4):798–859, 2001.[22] B. Kimelfeld. A dichotomy in the complexity ofdeletion propagation with functionaldependencies. In
PODS , pages 191–202, 2012.[23] B. Kimelfeld, E. Livshits, and L. Peterfreund.Detecting ambiguity in prioritized databaserepairing. In
ICDT , volume 68 of
LIPIcs , pages17:1–17:20. Schloss Dagstuhl - Leibniz-Zentrumfuer Informatik, 2017.[24] S. Kolahi and L. V. S. Lakshmanan. Onapproximating optimum repairs for functionaldependency violations. In
ICDT , volume 361,pages 53–62. ACM, 2009.[25] M. W. Krentel. The complexity of optimizationproblems.
J. Comput. Syst. Sci. , 36(3):490–509,1988.[26] E. Livshits and B. Kimelfeld. Counting andenumerating (preferred) database repairs. In
PODS , pages 289–301, 2017.[27] A. Lopatenko and L. E. Bertossi. Complexity ofconsistent query answering in databases undercardinality-based and incremental repairsemantics. In
ICDT , pages 179–193, 2007.[28] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. R´e.Holoclean: Holistic data repairs with probabilisticinference.
PVLDB , 10(11):1190–1201, 2017.[29] S. Staworko, J. Chomicki, and J. Marcinkowski.Prioritized repairing and consistent queryanswering in relational databases.
Ann. Math.Artif. Intell. , 64(2-3):209–246, 2012.[30] D. Suciu, D. Olteanu, R. Christopher, andC. Koch.
Probabilistic Databases . Morgan &Claypool Publishers, 1st edition, 2011.13
PPENDIXA. DETAILS FROM SECTION 3
In this section, we prove Theorem 3.4.
Theorem
Let ∆ be a set of FDs. ● If OSRSucceeds ( ∆ ) returns true, then an optimal S-repair can be computed in polynomial time by executing OptSRepair ( ∆ , T ) on the input T . ● If OSRSucceeds ( ∆ ) returns false, then computing an optimal S-repair is APX-complete, and remains APX-complete on unweighted, duplicate-free tables.Moreover, the execution of OSRSucceeds ( ∆ ) terminates in polynomial time in ∣ ∆ ∣ . A.1 Tractability Side
We start by proving the positive side of Theorem 3.4. Figure 3 illustrates our proof. The idea is the following:we start with a table T and an FD set ∆ , for which we want to find an optimal S-repair. Then, we apply thesimplifications (represented by the red arrows in the figure) until it is no longer possible. We can find an optimalS-repair in polynomial time if we have a trivial set of FDs at this point (which can also be empty). In order to provethat, we show for each one of the simplifications (common lhs, consensus FD and lhs marriage), that if the problemafter the simplification can be solved in polynomial time, then we can use this solution to solve the original problem(before the simplification). The lemmas that appear next the black arrows in the figure contain our proofs. Lemma
A.1.
Let T be a table, and let ∆ be a set of FDs, such that ∆ has a common lhs A . If for each a ∈ π A T [∗] , OptSRepair ( ∆ − A, σ A = a T ) returns an optimal S-repair of σ A = a T w.r.t. ∆ − A , then OptSRepair ( ∆ , T ) returns anoptimal S-repair of T w.r.t. ∆ . Proof.
Assume that for each a ∈ π A T [∗] , it holds that OptSRepair ( ∆ − A, σ A = a T ) returns an optimal S-repair of σ A = a T w.r.t. ∆ − A . We contend that the algorithm OptSRepair ( ∆ , T ) returns an optimal S-repair of T w.r.t. ∆.Since the condition of line 4 of OptSRepair is satisfied, the subroutine
CommonLHSRep will be called. Thus, we haveto prove that
CommonLHSRep1 ( ∆ , T ) returns an optimal S-repair of T .Let J be the result of CommonLHSRep ( ∆ , T ) . We will start by proving that J is consistent. Let us assume, byway of contradiction, that J is not consistent. Thus, there are two tuples t and t in J that violate an FD Z → W in ∆. Since A ∈ Z (as A is a common lhs attribute), the tuples t and t agree on the value of attribute A . Assumethat t .A = t .A = a . By definition, there is an FD ( Z ∖ { A )} → ( W ∖ { A }) in ∆ − A . Clearly, the tuples t and t agree on all the attributes in Z ∖ { A } , and since they also agree on the attribute A , there exists an attribute B ∈ ( W ∖ { A }) such that t .B ≠ t .B . Thus, t and t violate an FD in ∆ − A , which is a contradiction to the factthat OptSRepair ( ∆ − A, σ A = a T ) returns an optimal S-repair of σ A = a T that contains both t and t .Next, we will prove that J is an optimal S-repair of T . Let us assume, by way of contradiction, that this is notthe case. That is, there is another consistent subset J ′ of T , such that the weight of the tuples in T ∖ J ′ is lowerthan the weight of the tuples in T ∖ J . In this case, there exists at least one value a ′ of attribute A , such that theweight of the tuples t ∈ ( T ∖ J ′ ) for which it holds that t .A = a ′ is lower than the weight of such tuples in T ∖ J . Let H = { h , . . . , h r } be the set of tuples from T for which it holds that h j .A = a ′ . Let { f , . . . , f n } be the set of tuplesin H ∩ J , and let { g , . . . , g m } be the set of tuples in H ∩ J ′ . It holds that w T ( H ∖ J ) > w T ( H ∖ J ′ ) . We claimthat { g , . . . , g m } is a consistent subset of σ A = a ′ T , which is a contradiction to the fact that { f , . . . , f n } is an optimalS-repair of σ A = a ′ T . Let us assume, by way of contradiction, that { g , . . . , g m } is not a consistent subset of σ A = a ′ T .Thus, there exist two tuples g j and g j in { g , . . . , g m } that violate an FD, Z → W , in ∆ − A . By definition, thereis an FD ( Z ∪ { A }) → ( W ∪ Y ) in ∆, where Y ⊆ { A } , and since g j and g j agree on the value of attribute A , theyclearly violate this FD, which is a contradiction to the fact that they both appear in J ′ (which is a consistent subsetof T ). Lemma
A.2.
Let T be a table, and let ∆ be a set of FDs, such that ∆ contains a consensus FD ∅ → X . If for each a ∈ π X T [∗] , OptSRepair ( ∆ − X, σ X = a T ) returns an optimal S-repair of σ X = a T w.r.t. ∆ − X , then OptSRepair ( ∆ , T ) returns an optimal S-repair of T w.r.t. ∆ . Proof.
Assume that for each a ∈ π X T [∗] , it holds that OptSRepair ( ∆ − X, σ X = a T ) returns an optimal S-repairof σ X = a T w.r.t. ∆ − X . We contend that the algorithm OptSRepair ( ∆ , T ) returns an optimal S-repair of T w.r.t. ∆.Note that the condition of line 4 cannot be satisfied, since there is no attribute that appears on the left-hand sideof ∅ → A . Since the condition of line 6 of OptSRepair is satisfied, the subroutine
ConsensusRep will be called. Thus,we have to prove that
ConsensusRep ( ∆ , T ) returns an optimal S-repair of T .Let J be the result of ConsensusRep ( ∆ , T ) . We will start by proving that J is consistent. Let us assume, by wayof contradiction, that J is not consistent. Thus, there are two tuples t and t in J that violate an FD Z → W in∆. Note that t and t agree on the value of the attributes in X (since ConsensusRep ( ∆ , T ) always returns a set of14 onsensus ∆ ∆ − X ∆ − A ∆ − X X ∆Trivial ∆ Lemma A.3Lemma A.1 Lemma A.2common lhs lhs marriage Figure 3: An illustration of our proof of the positive side of Theorem 3.4. We start with an FD set ∆ and applysimplifications to it, until we get a trivial set of FDs ∆. The red arrows represent simplifications and the dashed bluearrows represent renaming of a set of FDs. A black arrow from ∆ ′ to ∆ means that if we can also find an optimalS-repair for ∆ ′ in polynomial time, then we can find an optimal S-repair for ∆ in polynomial time. The proof is inthe lemma that appears next to the corresponding black arrow.tuples that agree on the value of the attributes in X ). Assume that t [ X ] = t [ X ] = a . Thus, it holds that thereexists an attribute A ′ ∈ W , such that t .A ′ ≠ t .A ′ . By definition, there is an FD ( Z ∖ X ) → ( W ∖ X ) in ∆ − X .Clearly, the tuples t and t agree on all the attributes in Z ∖ X , but do not agree on the attribute A ′ ∈ ( W ∖ X ) .Thus, t and t violate an FD in ∆ − X , which is a contradiction to the fact that OptSRepair ( ∆ − X, σ X = a T ) returnsan optimal S-repair of σ X = a T that contains both t and t .Next, we will prove that J is an optimal S-repair of T . Let us assume, by way of contradiction, that this is not thecase. That is, there is another consistent subset J ′ of T , such that the weight of the tuples in T ∖ J ′ is lower thanthe weight of the tuples in T ∖ J . Clearly, each consistent subset of T only contains tuples that agree on the value ofattribute A (that is, tuples that belong to σ X = a T for some value a . The instance J is an optimal S-repair of σ X = a ′ T for some value a ′ . If J ′ ⊆ σ X = a ′ T , then we get a contradiction to the fact that J is an optimal S-repair of σ X = a ′ T .Thus, J ′ ⊆ σ X = a ′′ T for some other value a ′′ . In this case, the optimal S-repair of σ X = a ′′ T has a higher weight thanthe optimal S-repair of σ X = a ′ T , which is a contradiction to the fact that ConsensusRep returns an optimal S-repairof σ X = a T that has the highest weight among all such optimal S-repairs. Lemma
A.3.
Let T be a table, and let ∆ be a set of FDs, such that ∆ has an lhs marriage ( X , X ) anddoes not have a common lhs. If it holds that OptSRepair ( ∆ − X X , σ X = a ,X = a T ) returns an optimal S-repairof σ X = a ,X = a T w.r.t. ∆ − X X for each pair ( a , a ) ∈ π X X T [∗] , then OptSRepair ( ∆ , T ) returns an optimalS-repair of T w.r.t. ∆ . Proof.
Assume that for each ( a , a ) ∈ π X X T [∗] , it holds that OptSRepair ( ∆ − X X , σ X = a ,X = a T ) returnsan optimal S-repair of σ X = a ,X = a T w.r.t. ∆ − X X . We contend that the algorithm OptSRepair ( ∆ , T ) returns anoptimal S-repair of T w.r.t. ∆. We assumed that ∆ does not have a common lhs, thus the condition of line 4 is notsatisfied. The condition of line 6 cannot be satisfied as well, since neither X ⊆ ∅ nor X ⊆ ∅ (as we remove trivialFDs from ∆ in line 3). The condition of line 8 on the other hand is satisfied, thus the algorithm will call subroutine MarriageRep and return the result. Thus, we have to prove that
MarriageRep ( ∆ , T ) returns an optimal S-repair of T .Let us denote by J the result of MarriageRep ( ∆ , T ) . We will start by proving that J is consistent. Let t and t be two tuples in T . Note that it cannot be the case that t [ X ] ≠ t [ X ] but t [ X ] = t [ X ] (or vice versa),since in this case the matching that we found for the graph G contains two edges ( a , a ) and ( a ′ , a ) , which isimpossible. Moreover, if it holds that t [ X ] ≠ t [ X ] and t [ X ] ≠ t [ X ] , then t and t do not agree on theleft-hand side of any FD in ∆ (since we assumed that for each FD Z → W in ∆ it either holds that X ⊆ Z or X ⊆ Z ). Thus, { t , t } satisfies all the FDs in ∆. Now, let us assume, by way of contradiction, that J is notconsistent. Thus, there are two tuples t and t in J that violate an FD Z → W in ∆. That is, t and t agree onall the attributes in Z , but do not agree on at least one attribute B ∈ W . As mentioned above, the only possiblecase is that t [ X ] = t [ X ] = a and t [ X ] = t [ X ] = a . In this case, t and t both belong to σ X = a ,X = a T ,and they do not agree on an attribute B ∈ ( W ∖ ( X ∪ X )) . The FD ( Z ∖ ( X ∪ X ) → ( W ∖ ( X ∪ X )) belongs to∆ − X X , and clearly t and t also violate this FD, which is a contradiction to the fact that J only contains anoptimal S-repair of σ X = a ,X = a T and does not contain any other tuples from σ X = a ,X = a T (recall that we assumedthat OptSRepair ( ∆ − X X , σ X = a ,X = a T ) returns an optimal S-repair for each ( a , a ) ∈ π X X T [∗] ).Next, we will prove that J is an optimal S-repair of T . Let us assume, by way of contradiction, that this is not thecase. That is, there is another consistent subset J ′ of T , such that the weight of the tuples in T ∖ J ′ is lower than the15eight of the tuples in T ∖ J , and J ′ is an optimal S-repair of T . Note that the weight of the matching correspondingto J is the total weight of the tuples in J (since the weight of each edge ( a , a ) is the weight of the optimal S-repairof σ X = a ,X = a T , and J contains the optimal S-repair of each σ X = a ,X = a T , such that the edge ( a , a ) belongsto the matching). Let t and t be two tuples in J ′ . Note that it cannot be the case that t [ X ] = t [ X ] but t [ X ] ≠ t [ X ] , since in this case, { t , t } violates the FD X → cl ∆ ( X ) which is implied from ∆ and this is acontradiction to the fact that J ′ satisfied ∆ (we recall that cl ∆ ( X ) = cl ∆ ( X ) , and since X ⊆ cl ∆ ( X ) , it holdsthat X ⊆ cl ∆ ( X ) ). Similarly, it cannot be the case that t [ X ] ≠ t [ X ] but t [ X ] = t [ X ] . Hence, it eitherholds that t [ X ] = t [ X ] and t [ X ] = t [ X ] or t [ X ] ≠ t [ X ] and t [ X ] ≠ t [ X ] . Therefore, J ′ clearlycorresponds to a matching of G as well (the matching will contain an edge ( a , a ) if there is a tuple t ∈ J ′ , suchthat t [ X ] = a and t [ X ] = a ).Next, we claim that for each edge ( a , a ) that belongs to the above matching (the matching that corresponds to J ′ ), the subinstance J ′ contains an optimal S-repair of σ X = a ,X = a T w.r.t. ∆ − X X . Clearly, J ′ cannot containtwo tuples t and t from σ X = a ,X = a T that violate an FD Z → W from ∆ − X X (otherwise, t and t will alsoviolate the FD ( Z ∪ Y ) → ( W ∪ Y ) from ∆ (where Y ⊆ ( X ∪ X ) and Y ⊆ ( X ∪ X ) ), which is a contradictionto the fact that J ′ is a consistent subset of T ). Thus, J ′ contains a consistent set of tuples from σ X = a ,X = a T . Ifthis set of tuples is not an optimal S-repair of σ X = a ,X = a T , then we can replace this set of tuples with an optimalS-repair of σ X = a ,X = a T . This will not break the consistency of J ′ since these tuples do not agree on the attributesin neither X nor X with any other tuple in J ′ , and each FD Z → W in ∆ is such that X ⊆ Z or X ⊆ Z . Theresult will be a consistent subset of T with a higher weight than J ′ (that is, the weight of the removed tuples willbe lower), which is a contradiction to the fact that J ′ is an optimal S-repair of T . Therefore, for each edge ( a , a ) that belongs to the above matching, J ′ contains tuples with a total weight of w T ( S a ,a ) , where S a ,a is an optimalS-repair of σ X = a ,X = a T , and the weight of this matching is the total weight of tuples in J ′ . In this case, we founda matching of G with a higher weight than the matching corresponding to J , which is a contradiction to the factthat J corresponds to the maximum weighted matching of G .Next, we prove Theorem 3.2. Theorem
Let ∆ and T be a set of FDs and a table, respectively, over a relation schema R ( A , . . . , A k ) . If OptSRepair ( ∆ , T ) succeeds, then it returns an optimal S-repair. Moreover, OptSRepair ( ∆ , T ) terminates in polyno-mial time in k , ∣ ∆ ∣ , and ∣ T ∣ . Proof.
We will prove the theorem by induction on n , the number of simplifications that will be applied to ∆ by OptSRepair . We start by proving the basis of the induction, that is n =
0. In this case,
OptSRepair will only succeedif ∆ = ∅ or if ∆ is trivial. Clearly, in this case, T is consistent w.r.t. ∆ and an optimal S-repair of T is T itself. Andindeed, OptSRepair ( ∆ , T ) will return T in polynomial time.For the inductive step, we need to prove that if the claim is true for all n = , . . . , k −
1, it is also true for n = k . Inthis case, OptSRepair ( ∆ , T ) will start by applying some simplification to the schema. Clearly, the result is a set ofFDs ∆ ′ , such that OptSRepair ( ∆ ′ , T ′ ) will apply n − ′ . One of the following holds: ● ∆ has a common lhs A . In this case, the condition of line 4 is satisfied and the subroutine CommonLHSRep will be called. Note that
OptSRepair ( ∆ , T ) will succeed only if OptSRepair ( ∆ − A, σ A = a T ) succeeds for each a ∈ π A T [∗] . We know from the inductive assumption that if OptSRepair ( ∆ − A, σ A = a T ) succeeds, then it returnsan optimal S-repair. Thus, Lemma A.1 implies that OptSRepair ( ∆ , T ) returns an optimal S-repair of T w.r.t.∆. ● ∆ has a consensus FD ∅ → X . In this case, the condition of line 4 is not satisfied, but the condition ofline 6 is satisfied and the subroutine ConsensusRep will be called. Again,
OptSRepair ( ∆ , T ) will succeed onlyif OptSRepair ( ∆ − X, σ X = a T ) succeeds for each a ∈ π X T [∗] . We know from the inductive assumption thatif OptSRepair ( ∆ − X, σ X = a T ) succeeds, then it returns an optimal S-repair. Thus, Lemma A.2 implies that OptSRepair ( ∆ , T ) returns an optimal S-repair of T w.r.t. ∆. ● ∆ does not have a common lhs, but has an lhs marriage. In this case, the conditions of line 4 and line 6 arenot satisfied, but the condition of line 8 is satisfied and the subroutine MarriageRep will be called. As in theprevious cases,
OptSRepair ( ∆ , T ) will succeed only if OptSRepair ( ∆ − X X , σ X = a ,X = a T ) succeeds for each ( a , a ) ∈ π X X T [∗] . We know from the inductive assumption that if OptSRepair ( ∆ − X X , σ X = a ,X = a T ) succeeds, then it returns an optimal S-repair. Thus, Lemma A.3 implies that OptSRepair ( ∆ , T ) returns anoptimal S-repair of T w.r.t.It is only left to prove that OptSRepair terminates in polynomial time in k , ∣ ∆ ∣ , and ∣ T ∣ . We will first explain howchecking each one of the conditions can be done in polynomial time. Then, we will provide the recurrence relationfor each one of the subroutines of the algorithm. The algorithm will first check if ∆ is trivial. This can be done inpolynomial time as we can go over the FDs and for each one of them check if each attribute that appears on the rhsalso appears on the lhs. Then, the algorithm checks if ∆ has a common lhs. This can also be done in polynomialtime by going over the attributes in A , . . . , A k and checking for each one of them if the lhs of each FD in ∆ contains16t. If this condition does not hold, then the algorithm will check if ∆ has a consensus FD. This can be done inpolynomial time by going over the FDs in ∆ and checking for each one of them if the lhs is empty. Finally, if thiscondition does not hold as well, the algorithm will check if ∆ has an lhs marriage. To do that in polynomial time,we can go over the FDs in ∆ and for each one of them find the closure of its lhs w.r.t. ∆ (it is known that thiscan be done in polynomial time). Then, for each pair of FDs in ∆ that agree on the closure of their lhs, we have tocheck if one of their lhs is contained in the lhs of each other FD in ∆.If the condition of line 4 holds, then the subroutine CommonLHSRep will be called. Since we have already founda common lhs A (when we were checking if the condition holds), it is only left to divide T into blocks of tuples thatagree on the value of A (which again can be done in polynomial time by going over the tuples in T and for each oneof them checking the value of attribute A ) and make a recursive call to OptSRepair for each one of the blocks. Therecurrence relation for this subroutine is: F ( k, ∣ T ∣ , ∣ ∆ ∣) = ∑ a ∈ π A T [∗] F ( k − , ∣ σ A = a T ∣ , ∣ ∆ − A ∣) + poly ( k, ∣ T ∣ , ∣ ∆ ∣) (3)If the condition of line 6 holds, then the subroutine ConsensusRep will be called. This case is very similar to theprevious one. Since we have already found a consensus FD ∅ → X , it is only left to divide T into blocks of tuplesthat agree on the value of the attributes in X and make a recursive call to OptSRepair for each one of the blocks.The recurrence relation for this subroutine is: F ( k, ∣ T ∣ , ∣ ∆ ∣) = ∑ a ∈ π X T [∗] F ( k − ∣ X ∣ , ∣ σ X = a T ∣ , ∣ ∆ − X ∣) + poly ( k, ∣ T ∣ , ∣ ∆ ∣) (4)If the condition of line 8 holds, then the subroutine MarriageRep will be called. In this case, we go over each pairof values ( a , a ) that appear in the attributes ( X , X ) in some tuple of T (note that there are at most ∣ T ∣ suchpairs). Then, we make a recursive call to OptSRepair for each one of these pairs and calculate the weight of the result S a ,a (clearly, this can be done in polynomial time in the size of S a ,a ). Finally, we build the graph G and find amaximum weighted matching for the graph (which can be done in polynomial time with the Hungarian algorithm).The recurrence relation for this subroutine is: F ( k, ∣ T ∣ , ∣ ∆ ∣) = ∑ ( a ,a )∈ π X X T [∗] F ( k − ∣ X ∪ X ∣ , ∣ σ X = a ,X = a T ∣ , ∣ ∆ − X X ∣) + poly ( k, ∣ T ∣ , ∣ ∆ ∣) (5)Since finding an optimal S-repair for a trivial FD set can be done in polynomial time in the size of T , and sincein each one of the three recurrence relations, the tables in the middle argument form a partition of T , a standardanalysis of F shows that it is bounded by a polynomial. A.2 Hardness Side
As mentioned above, our proof of hardness is based on the concept of a fact-wise reduction [22]. We first prove,for each one of the FD sets in Table 1 (over the schema R ( A, B, C ) ), that computing an optimal S-repair is APX-complete. Then, we prove the existence of fact-wise reductions from these FD sets over the schema R ( A, B, C ) toother sets of FDs over other schemas.Figure 4 illustrates our proof. The idea is the following: we start with a table T and an FD set ∆ , for which wewant to find an optimal S-repair. Then, we apply the simplifications (represented by the red arrows in the figure)until it is no longer possible. We prove, for each one of the simplifications (common lhs, consensus FD and lhsmarriage), that there is a fact-wise reduction from the problem after the simplification to the problem before thesimplification. The problem is APX-complete if at this point we have a set of FDs that is not trivial. In this case,we pick two local minima from the FD set. We recall that an FD X → Y in ∆ is a local minimum if there is no FD Z → W in ∆ such that Z ⊂ X . These two local minima belong to one of five classes, that we discuss in Section 3 andlater in this section. Finally, we prove for each one of the classes, that there is a fact-wise reduction to this class fromone of four FD sets. To prove APX-hardness for these four sets (over the relation schema R ( A, B, C ) ) we constructreductions from problems that are known to be APX-hard. A black arrow in Figure 4 represents a reduction thatwe construct in the lemma that appears next to the arrow. A.2.1 Hard Schemas
We start by proving that computing an optimal S-repair for the FD sets in Table 1 is APX-complete. Proposition 3.3implies that the problem is in APX for each one of these sets. Thus, it is only left to show that the problem is alsoAPX-hard. Gribkoff et al. [20] prove that the MPD problem is NP-hard for both ∆ A → B → C and ∆ A → C ← B . Theirhardness proof also holds for the problem of computing an optimal S-repair. More formally, the following hold. Lemma
A.4. [20]
Computing an optimal S-repair for ∆ A → C ← B is NP-hard. − X ∆ − A ∆ − X X ∆Class (2) Class (3) Class (4) Class (5)∆ AB ↔ AC ↔ BC ∆ AB → C → B Class (1) ∆ A → B → C ∆ A → C ← B consensus Max-non-mixed-SATMax-2-SAT Triangle PackingLemma A.7 Lemma A.11 Lemma A.8 Lemma A.13Lemma A.14 Lemma A.15 Lemma A.16Lemma A.15 Lemma A.17Lemma A.21Lemma A.19 ∆ Lemma A.20common lhs lhs marriage
Figure 4: An illustration of our proof of the negative side of Theorem 3.4. We start with an FD set ∆ and applysimplifications to it, until we get a non trivial set of FDs ∆, that we classify into one of five classes. The red arrowsrepresent simplifications and the dashed blue arrows represent renaming. A black arrow represents a reduction thatwe construct in the lemma that appears next to the arrow. Lemma
A.5. [20]
Computing an optimal S-repair for ∆ A → B → C is NP-hard. They prove both of these result by showing a reduction from the MAX-2-SAT problem: given a 2-CNF formula ϕ ,determine what is the maximum number of clauses in ϕ which can be simultaneously satisfied. In their reductionsit holds that the maximum number of clauses that can be simultaneously satisfied is exactly the size of an optimalS-repair of the constructed table T (which is unweighted and duplicate-free). The problem MAX-2-SAT is known tobe APX-hard. However, this is not enough to prove APX-hardness for our problem, as we would like to approximatethe minimum number of tuples to delete, rather than the size of the optimal S-repair. Thus, we will now strengthentheir results by proving that the complement problem of MAX-2-SAT is APX-hard. Clearly, their reductions arestrict reductions from the complement problem (as the number of clauses that are not satisfied is exactly the numberof tuples that are deleted from the table), thus this will complete our proof of APX-completeness for ∆ A → B → C and∆ A → C ← B . Lemma
A.6.
The complement problem of MAX- -SAT is APX-hard. Proof.
It is known that there is always an assignment that satisfies at least half of the clauses in the formula.Thus, the solution to the complement problem contains at most half of the clauses. Let ψ be a 2-CNF formula. Let m be the number of clauses in ψ , let OP T P be an optimal solution to MAX-2-SAT and let OP T CP be an optimalsolution to the complement problem. Then, the following hold: (a) ∣ OP T P ∣ ≥ ⋅ m , (b) ∣ OP T CP ∣ ≤ ⋅ m , and (c) ∣ OP T CP ∣∣ OP T P ∣ ≤
1. Now, let us assume, by way of contradiction, that the complement problem is not APX-hard. That is,for every (cid:15) >
0, there exists a ( + (cid:15) ) -optimal solution to that problem. Let S CP be such a solution. Then, ∣ S CP ∣ − ∣ OP T CP ∣ ≤ (cid:15) ⋅ ∣ OP T CP ∣ (6)We will now show that in this case, S P (which consists of the clauses that do not belong to S CP , thus ∣ S P ∣ = m − ∣ S CP ∣ ) is an (cid:15) -optimal solution for MAX-2-SAT. ∣ S P ∣ − ∣ OP T P ∣ = ( m − ∣ S CP ∣) − ( m − ∣ OP T CP ∣) = ∣ OP T CP ∣ − ∣ S CP ∣ ≥ − (cid:15) ⋅ ∣ OP T CP ∣ ≥ − (cid:15) ⋅ ∣ OP T P ∣ (7)It follows that for each (cid:15) >
0, we can get a ( + (cid:15) ) -optimal solution for MAX-2-SAT by finding a ( + (cid:15) ) -optimalsolution to the complement problem and selecting all of the clauses that do not belong to this solution, which is acontradiction to the fact that MAX-2-SAT is APX-hard.We can now conclude that the following hold. Lemma
A.7.
Computing an optimal S-repair for ∆ A → C ← B is APX-complete. z [ ] z [ ] a i [ ] a i [ ] x [ ] a i [ ] x [ ] a i [ ] a i [ ] a i [ ] y [ ] a i [ ] y [ ] a i [ ] a i [ ] Figure 5: An illustration of the reduction used by Amini et al. [3] to prove APX-hardness for the problem of findingthe maximum number of edge disjoint triangles in a tripartite graph with a bounded degree.
Proof.
This is straightforward based on Lemma A.4, Lemma A.6 and our observation that the reduction ofGribkoff et al. [20] is a strict reduction from the complement problem of MAX-2-SAT.
Lemma
A.8.
Computing an optimal S-repair for ∆ A → B → C is APX-complete. Proof.
This is straightforward based on Lemma A.5, Lemma A.6 and our observation that the reduction ofGribkoff et al. [20] is a strict reduction from the complement problem of MAX-2-SAT.Next, we prove that computing an optimal S-repair for ∆ AB ↔ AC ↔ BC is APX-complete as well. To do that, weconstruct a reduction from the problem of finding the maximum number of edge-disjoint triangles in a tripartitegraph with a bounded degree B. Amini et al. [3] (who refer to this problem as MECT-B) proved that this problemis APX-complete. Again, in our reduction, it holds that the maximum number edge-disjoint triangles is exactly thesize of an optimal S-repair of the constructed table T . Thus, our reduction is a strict reduction from the complementproblem. Hence, we first prove that the complement problem of MECT-B is APX-hard for tripartite graphs thatsatisfy a specific property. We use the reduction of Amini et al. [3] to prove that. They build a reduction from theproblem of finding a maximum bounded covering by 3-sets: given a collection of subsets of a given set that containexactly three elements each, such that each element appears in at most B subsets, find the maximum number ofdisjoint subsets. In their reduction, they construct a tripartite graph, such that for each subset S i = ( x, y, z ) , theyadd to the graph the structure from Figure 5. Note that the nodes a i [ ] . . . a i [ ] are unique for this subset, whilethe nodes x [ ] , x [ ] , y [ ] , y [ ] , z [ ] , z [ ] will appear only once in the graph, even if they appear in more than onesubset. Thus, we can build a set of edge-disjoint triangles for the constructed tripartite graph by selecting, for eachsubset, six out of the thirteen triangles (the even ones). This is true since the even triangles do not share an edgewith any other triangle. We can now conclude, that in their reduction, they construct a tripartite graph with thefollowing property: the maximum number of edge-disjoint triangles in the graph is at least of the total number oftriangles. We denote a graph that satisfies this property by -tripartite graph. Thus, we can conclude the following: Lemma
A.9.
The problem MECT-B for -tripartite graphs is APX-hard.
We will now prove that the complement problem is APX-hard as well.
Lemma
A.10.
The complement problem of MECT-B for -tripartite graphs is APX-hard.
Proof.
Let g be a -tripartite graph. Then, the solution to the complement problem contains at most of thetotal number of triangles. Let m be the number of triangles in g , let OP T P be an optimal solution to MECT for g and let OP T CP be an optimal solution to the complement problem. Thus, the following hold: (a) ∣ OP T P ∣ ≥ ⋅ m , (b) ∣ OP T CP ∣ ≤ ⋅ m , and (c) ∣ OP T CP ∣∣ OP T P ∣ ≤ . Now, let us assume, by way of contradiction, that the complementproblem is not APX-hard. That is, for every (cid:15) >
0, we can find a ( + (cid:15) ) -optimal solution to that problem. Let S CP be such a solution. Then, ∣ S CP ∣ − ∣ OP T CP ∥ le(cid:15) ⋅ ∣ OP T CP ∣ (8)We will now show that in this case, S P (which consists of the triangles that do not belong to S CP , thus ∣ S P ∣ = m − ∣ S CP ∣ ) is a (cid:15) -optimal solution for MECT-B. ∣ S P ∣ − ∣ OP T P ∣ = ( m − ∣ S CP ∣) − ( m − ∣ OP T CP ∣) = ∣ OP T CP ∣ − ∣ S CP ∣ ≥ − (cid:15) ⋅ ∣ OP T CP ∣ ≥ − (cid:15) ⋅ ∣ OP T P ∣ (9)It follows that for each (cid:15) ′ >
0, we can get an (cid:15) ′ -optimal solution for MECT by finding a ( + (cid:15) ′ ) -optimal solutionto the complement problem, which is a contradiction to the fact that MECT-B is APX-hard for g , as implied byLemma A.9.Next, we introduce our reduction from MECT-B to the problem of computing an optimal S-repair for ∆ AB ↔ AC ↔ BC . Lemma
A.11.
Computing an optimal S-repair for ∆ AB ↔ AC ↔ BC is APX-complete. roof. We construct a reduction from the problem of finding the maximum number of edge-disjoint triangles ina tripartite graph with a bounded degree. The input to this problem is a tripartite graph g with a bounded degreeB. The goal is to determine what is the maximum number of edge-disjoint triangles in g (that is, no two trianglesshare an edge). We assume that g contains three sets of nodes: { a , . . . , a n } , { b , . . . , b l } and { c , . . . , c r } . Givensuch an input, we will construct the input T for our problem as follows. For each triangle in g that consists of thenodes a i , b j , and c k , I will contain a tuple ( a i , b j , c k ) . We will now prove that there are at least m edge-disjointtriangles in g if and only if there is a consistent subset of T that contains at least m tuples. The “if ” direction. there is a consistent subset J of T that contains at least m tuples. The FD AB → C impliesthat a consistent subset cannot contain two tuples ( a i , b j , c k ) and ( a i , b j , c k ) such that c k ≠ c k . Moreover, theFD AC → B implies that it cannot contain two tuples ( a i , b j , c k ) and ( a i , b j , c k ) such that b j ≠ b j , and the FD BC → A implies that it cannot contain two tuples ( a i , b j , c k ) and ( a i , b j , c k ) such that a i ≠ a i . Thus, the twotriangles ( a i , b j , c k ) and ( a i , b j , c k ) in g that correspond to two tuples ( a i , b j , c k ) and ( a i , b j , c k ) in J , willnot share an edge (they can only share a single node). Hence, there are at least m edge-disjoint triangles in g . The “only if ” direction.
Assume that there are at least m edge-disjoint triangles { t , . . . , t m } in g . We can build aconsistent subset J of T as follows: for each triangle ( a i , b j , c k ) in { t , . . . , t m } we will add the tuple ( a i , b j , c k ) to J .Thus, J will contain m tuples. It is only left to show that J is consistent. Let us assume, by way of contradiction, that J is not consistent. That is, there are two tuples ( a , b , c ) and ( a , b , c ) in J that violate an FD in ∆ AB ↔ AC ↔ BC .If the tuples violate the FD AB → C , it holds that a = a and b = b . Thus, the corresponding two triangles from { t , . . . , t m } share the edge ( a , b ) , which is a contradiction to the fact that { t , . . . , t m } is a set of edge-disjointtriangles. Similarly, if the tuples violate the FD AC → B , then the corresponding two triangles share an edge ( a , c ) ,and if they violate the FD BC → A , the corresponding two triangles share an edge ( b , c ) . To conclude, there existsa consistent subset of T that contains at least m facts, and that concludes our proof.Clearly, our reduction is a strict reduction from the complement problem of MECT-B (as the number of trianglesthat are not part of the set of edge-disjoint triangles is exactly the number of tuples that are deleted from T ), thusLemma A.10 implies that our problem is indeed APX-complete.Finally, we construct a reduction from the problem MAX-non-mixed-SAT to the problem of computing an optimalS-repair for ∆ AB → C → B . Note that the following holds. Lemma
A.12.
The complement problem of MAX-non-mixed-SAT is APX-hard.
Proof.
The proof is identical to the proof of Lemma A.6.Thus, if we construct a reduction that is a strict reduction from the complement problem, this will conclude ourproof of APX-completeness for ∆ AB → C → B . Lemma
A.13.
Computing an optimal S-repair for ∆ AB → C → B is APX-complete. Proof.
We construct a reduction from MAX-Non-Mixed-SAT to the problem of computing an optimal S-repairfor ∆ AB → C → B . The input to the first problem is a formula ψ with the free variables x , . . . , x n , such that ψ has theform c ∧ ⋯ ∧ c m where each c j is a clause. Each clause is a conjunction of variables from one of the following sets: (a) { x i ∶ i = , . . . , n } or (b) {¬ x i ∶ i = , . . . , n } (that is, each clause either contains only positive variables or onlynegative variables). The goal is to determine what is the maximum number of clauses in the formula ψ that can besimultaneously satisfied. Given such an input, we will construct the input T for our problem as follows. For each i = , . . . , n and j = , . . . , m , T will contain the following tuples: ● ( c j , , x i ) , if c j contains only positive variables and x i appears in c j . ● ( c j , , x i ) , if c j contains only negative variables and ¬ x i appears in c j .The weight of each tuple will be 1 (that is, T is an unweighted, duplicate-free table). We will now prove that thereis an assignment that satisfies at least k clauses in ψ if and only if there is a consistent subset of the constructedtable T that contains at least k tuples. The “if ” direction.
Assume that there is a consistent subset J of T that contains at least k tuples. The FD AB → C implies that no consistent subset of T contains two tuples ( c j , b j , x i ) and ( c j , b j , x i ) such that x i ≠ x i . Thus, eachconsistent subset contains at most one tuple ( c j , b j , x i ) for each c j . We will now define an assignment τ as follows: τ ( x i ) def = b j if there exists a tuple ( c j , b j , x i ) in J for some c j . Note that the FD C → B implies that no consistentsubset contains two tuples ( c j , , x i ) and ( c j , , x i ) , thus the assignment is well defined. Finally, as mentionedabove, J contains a tuple ( c j , b j , x i ) for k clauses c j from ψ . If x i appears in c j without negation, it holds that b =
1, thus τ ( x i ) def = c j is satisfied. Similarly, if x i appears in c j with negation, it holds that b j =
0, thus τ ( x i ) def = c j is satisfied. Thus, each one of these k clauses is satisfied by τ and we conclude that there exists anassignment that satisfies at least k clauses in ψ . 20 he “only if ” direction. Assume that τ ∶ { x , . . . , x n } → { , } is an assignment that satisfies at least k clauses in ψ . We claim that there exists a consistent subset of T that contains at least k tuples. Since τ satisfies at least k clauses, for each one of these clauses c j there exists a variable x i ∈ c j , such that τ ( x i ) = x i appears in c j withoutnegation or τ ( x i ) = c j with negation. Let us build a consistent subset J as follows. For each c j that is satisfied by τ we will choose exactly one variable x i that satisfies the above and add the tuple R ( c j , b j , x i ) (where τ ( x i ) = b j ) to J . Since there are at least k satisfied clauses, J will contain at least k tuples, thus it is onlyleft to prove that J is consistent. Let us assume, by way of contradiction, that J is not consistent. Since J containsone tuple for each satisfied c j , no two tuples violate the FD AB → C . Thus, J contains two tuples ( c j , , x i ) and ( c j , , x i ) , but this is a contradiction to the fact that τ is an assignment (that is, it cannot be the case that τ ( x i ) = τ ( x i ) = T ), thus Lemma A.13 impliesthat our problem is indeed APX-complete. A.2.2 Fact-Wise Reductions
Let R be a schema and let ∆ be a set of FDs over R . Recall that an FD set ∆ is a chain if for every two FDs X → Y and X → Y it is the case that X ⊆ X or X ⊆ X . Note that as long as ∆ is a chain, it either contains acommon lhs, or a consensus FD. Thus, if we reach a point where we cannot apply any simplifications to the problem,the set of FDs is not a chain. In the rest of this section, we will use the following definition: an FD X → Y is a local minimum of ∆ if there is no other FD Z → W in ∆ such that Z ⊂ X . If ∆ is not a chain set of FDs, then itcontains at least two distinct local minima X → Y and X → Y (by distinct we mean that X ≠ X ). Thus, if nosimplification can be applied to ∆, one of the following holds: ● ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ and ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ . ● ( cl ∆ ( X ) ∖ X ) ∩ ( cl ∆ ( X ) ∖ X ) ≠ ∅ , ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ and ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ . ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ . ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and also ( X ∖ X ) ⊆ ( cl ∆ ( X ) ∖ X ) and ( X ∖ X ) ⊆( cl ∆ ( X ) ∖ X ) . In this case, ∆ contains at least one more local minimum. Otherwise, for every FD Z → W in ∆ it holds that either X ⊆ Z or X ⊆ Z . If X ∩ X ≠ ∅ , then ∆ contains a common lhs (an attribute from X ∩ X ). If X ∩ X = ∅ , then ∆ contains an lhs marriage. ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and also ( X ∖ X ) /⊆ ( cl ∆ ( X ) ∖ X ) .We will now prove that for each one of these cases there is a fact-wise reduction from one of the hard schemas wediscusses above. Lemma
A.14.
Let R be a schema and let ∆ be an FD set over R that does not contain trivial FDs. Suppose that ∆ contains two distinct local minima X → Y and X → Y , and the following hold: ● ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ , ● ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ .Then, there is a fact-wise reduction from ( R ( A, B, C ) , ∆ A → C ← B ) to ( R, ∆ ) . Proof.
We define a fact-wise reduction Π ∶ ( R ( A, B, C ) , ∆ A → C ← B ) → ( R, ∆ ) , using the FDs X → Y and X → Y and the constant ⊙ ∈ Const . Let t = ( a, b, c ) be a tuple over R ( A, B, C ) and let { A , . . . , A n } be the set of attributesin R . We define Π as follows: Π ( t ) .A k def = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩⊙ A k ∈ X ∩ X a A k ∈ X ∖ X b A k ∈ X ∖ X ⟨ a, c ⟩ A k ∈ cl ∆ ( X ) ∖ X ⟨ b, c ⟩ A k ∈ cl ∆ ( X ) ∖ X ⟨ a, b ⟩ otherwiseIt is left to show that Π is a fact-wise reduction. To do so, we prove that Π is well defined, injective and preservesconsistency and inconsistency. Π is well defined.
This is straightforward from the definition and the fact that ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ and ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ . Π is injective.
Let t , t ′ be two tuples, such that t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) . Assume that Π ( t ) = Π ( t ′ ) .Let us denote Π ( t ) = ( x , . . . , x n ) and Π ( t ′ ) = ( x ′ , . . . , x ′ n ) . Note that X ∖ X and X ∖ X are not empty since X ≠ X . Moreover, since both FDs are minimal, X /⊂ X and X /⊂ X . Therefore, there are l and p such thatΠ ( t ) .A l = a , Π ( t ) .A p = b . Furthermore, since X → Y and X → Y are not trivial, there are m and n such that21 ( t ) .A m = ⟨ a, c ⟩ and Π ( t ) .A n = ⟨ b, c ⟩ . Hence, Π ( t ) = Π ( t ′ ) implies that Π ( t ) .A l = Π ( t ′ ) .A l , Π ( t ) .A p = Π ( t ′ ) .A p ,Π ( t ) .A m = Π ( t ′ ) .A m and also Π ( t ) .A n = Π ( t ′ ) .A n . We obtain that a = a ′ , b = b ′ and c = c ′ , which implies t = t ′ . Π preserves consistency.
Let t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) be two distinct tuples. We contend that the set { t , t ′ } is consistent w.r.t. ∆ A → C ← B if and only if the set { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “if ” direction.
Assume that { t , t ′ } is consistent w.r.t ∆ A → C ← B . We prove that { Π ( t ) , Π ( t ′ )} is consistent w.r.t∆. First, note that each FD that contains an attribute A k /∈ ( cl ∆ ( X ) ∪ cl ∆ ( X )) on its left-hand side is satisfied by { Π ( t ) , Π ( t ′ )} , since t and t ′ cannot agree on both A and B (otherwise, the FD A → C implies that t = t ′ ). Thus,from now on we will only consider FDs that do not contain an attribute A k /∈ ( cl ∆ ( X ) ∪ cl ∆ ( X )) on their left-handside. The FDs in ∆ A → C ← B imply that if t and t ′ agree on one of { A, B } then they also agree on C , thus one of thefollowing holds: ● a ≠ a ′ , b = b ′ and c = c ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X or A k ∈ X ∖ X or A k ∈ cl ∆ ( X ) ∖ X . That is, they only agree on the attributes A k such that A k ∈ cl ∆ ( X ) .Thus, each FD that contains an attribute A k /∈ cl ∆ ( X ) on its left-hand side is satisfied. Moreover, any FDthat contains only attributes A k ∈ cl ∆ ( X ) on its left-hand side, also contains only attributes A k ∈ cl ∆ ( X ) onits right-hand side (by definition of a closure), thus Π ( t ) and Π ( t ′ ) agree on both the left-hand side and theright-hand side of such FDs and { Π ( t ) , Π ( t ′ )} satisfies all the FDs in ∆. ● a = a ′ , b ≠ b ′ and c = c ′ . This case is symmetric to the previous one, thus a similar proof applies for this case aswell. ● a ≠ a ′ , b ≠ b ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X . Since X → Y and X → Y are local minima, there is no FD in ∆ that contains only attributes A k such that A k ∈ X ∩ X on its left-hand side (as if there is an FD Z → W in ∆, such that Z ⊆ X ∩ X , then Z ⊂ X in contradiction tothe fact that X is a local minimum). Thus, Π ( t ) and Π ( t ′ ) do not agree on the left-hand side of any FD in ∆and { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆.This concludes our proof of the “if” direction. The “only if ” direction.
Assume { t , t ′ } is inconsistent w.r.t. ∆ A → C ← B . We prove that { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆. Since { t , t ′ } is inconsistent w.r.t. ∆ A → C ← B it either holds that a = a ′ and c ≠ c ′ or b = b ′ and c ≠ c ′ (orboth). In the first case, Π ( t ) and Π ( t ′ ) agree on the attributes on the left-hand side of the FD X → Y , but donot agree on at least one attribute on its right-hand side (since the FD is not trivial). Similarly, in the second case,Π ( t ) and Π ( t ′ ) agree on the attributes on the left-hand side of the FD X → Y , but do not agree on at least oneattribute on its right-hand side. Thus, { Π ( t ) , Π ( t ′ )} does not satisfy at least one of these FDs and { Π ( t ) , Π ( t ′ )} isinconsistent w.r.t. ∆. Lemma
A.15.
Let R be a schema and let ∆ be an FD set over R that does not contain trivial FDs. Suppose that ∆ contains two distinct local minima X → Y and X → Y , and one of the following holds: ● (( cl ∆ ( X ) ∖ X ) ∩ (( cl ∆ ( X ) ∖ X ) ≠ ∅ , (( cl ∆ ( X ) ∖ X ) ∩ X = ∅ and (( cl ∆ ( X ) ∖ X ) ∩ X = ∅ , ● (( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and (( cl ∆ ( X ) ∖ X ) ∩ X = ∅ .Then, there is a fact-wise reduction from ( R ( A, B, C ) , ∆ A → B → C ) to ( R, ∆ ) . Proof.
We define a fact-wise reduction Π ∶ ( R ( A, B, C ) , ∆ A → B → C ) → ( R, ∆ ) , using X → Y and X → Y andthe constant ⊙ ∈ Const . Let t = ( a, b, c ) be a tuple over R ( A, B, C ) and let { A , . . . , A n } be the set of attributes in R . We define Π as follows: Π ( t ) .A k def = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩⊙ A k ∈ X ∩ X a A k ∈ X ∖ X b A k ∈ X ∖ X ⟨ a, c ⟩ A k ∈ cl ∆ ( X ) ∖ X ∖ cl ∆ ( X )⟨ b, c ⟩ A k ∈ cl ∆ ( X ) ∖ X a otherwiseIt is left to show that Π is a fact-wise reduction. To do so, we prove that Π is well defined, injective and preservesconsistency and inconsistency. Π is well defined.
This is straightforward from the definition and the fact that ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ in bothcases. Π is injective.
Let t , t ′ be two tuples, such that t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) . Assume that Π ( t ) = Π ( t ′ ) . Let usdenote Π ( t ) = ( x , . . . , x n ) and Π ( t ′ ) = ( x ′ , . . . , x ′ n ) . Note that X ∖ X and X ∖ X are not empty since X ≠ X .Moreover, since both FDs are minimal, X /⊂ X and X /⊂ X . Therefore, there are l and p such that Π ( t ) .A l = a ,22 ( t ) .A p = b . Furthermore, since X → Y is not trivial, there is at least one m such that Π ( t ) .A m = ⟨ b, c ⟩ . Hence,Π ( t ) = Π ( t ′ ) implies that Π ( t ) .A l = Π ( t ′ ) .A l , Π ( t ) .A p = Π ( t ′ ) .A p and Π ( t ) .A m = Π ( t ′ ) .A m . We obtain that a = a ′ , b = b ′ and c = c ′ , which implies t = t ′ . Π preserves consistency.
Let t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) be two distinct tuples. We contend that the set { t, t ′ } is consistent w.r.t. ∆ A → B → C if and only if the set { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “if ” direction.
Assume that { t , t ′ } is consistent w.r.t ∆ A → B → C . We prove that { Π ( t ) , Π ( t ′ )} is consistent w.r.t∆. First, note that each FD that contains an attribute A k /∈ ( cl ∆ ( X ) ∪ cl ∆ ( X )) on its left-hand side is satisfiedby { Π ( t ) , Π ( t ′ )} , since t and t ′ cannot agree on A (otherwise, the FDs A → B and B → C imply that t = t ′ ). Thus,from now on we will only consider FDs that do not contain an attribute A k /∈ ( cl ∆ ( X ) ∪ cl ∆ ( X )) on their left-handside. One of the following holds: ● a ≠ a ′ , b = b ′ and c = c ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X or A k ∈ X ∖ X or A k ∈ cl ∆ ( X ) ∖ X . That is, they only agree on the attributes A k such that A k ∈ cl ∆ ( X ) .Thus, each FD that contains an attribute A k /∈ cl ∆ ( X ) on its left-hand side is satisfied. Moreover, any FDthat contains only attributes A k ∈ cl ∆ ( X ) on its left-hand side, also contains only attributes A k ∈ cl ∆ ( X ) onits right-hand side (by definition of a closure), thus Π ( t ) and Π ( t ′ ) agree on both the left-hand side and theright-hand side of such FDs and { Π ( t ) , Π ( t ′ )} satisfies all the FDs in ∆. ● a ≠ a ′ , b ≠ b ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X . Since X → Y and X → Y are minimal, there is no FD in ∆ that contains only attributes A k such that A k ∈ X ∩ X on its left-hand side. Thus, Π ( t ) and Π ( t ′ ) do not agree on the left-hand side of any FD in ∆ and { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆.This concludes our proof of the “if” direction. The “only if ” direction.
Assume { t , t ′ } is inconsistent w.r.t. ∆ A → B → C . We prove that { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆. Since { t , t ′ } is inconsistent w.r.t. ∆ A → B → C , one of the following holds: ● a = a ′ and b ≠ b ′ . For the first case of this lemma, since ( cl ∆ ( X ) ∖ X ) ∩ ( cl ∆ ( X ) ∖ X ) ≠ ∅ , at least oneattribute A k ∈ ( cl ∆ ( X ) ∖ X ) also belongs to cl ∆ ( X ) ∖ X and it holds that Π ( t ) .A k = ⟨ b, c ⟩ . For the secondcase of this lemma, since ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ , at least one attribute A k ∈ ( cl ∆ ( X ) ∖ X ) also belongs to X and it holds that Π ( t ) .A k = b . Moreover, by definition of a closure, the FD X → A k is implied by ∆. Inboth cases, the tuples Π ( t ) and Π ( t ′ ) agree on the attributes on the left-hand side of the FD X → A k , but donot agree on the right-hand side of this FD. If two tuples do not satisfy an FD that is implied by a set ∆ ofFDs, they also do not satisfy ∆, thus { Π ( t ) , Π ( t ′ )} is inconsistent w.r.t. ∆. ● a = a ′ , b = b ′ or c ≠ c ′ . For the first case of this lemma, as mentioned above, there is an attribute A k ∈( cl ∆ ( X ) ∖ X ) such that Π ( t ) .A k = ⟨ b, c ⟩ . Moreover, by definition of a closure, the FD X → A k is implied by∆. The tuples Π ( t ) and Π ( t ′ ) agree on the attributes on the left-hand side of the FD X → A k , but do not agreeon the right-hand side of this FD. For the second case of this lemma, since ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ and sincethe FD X → Y is not trivial, there is at least one attribute A k ∈ ( cl ∆ ( X ) ∖ X ) such that Π ( t ) .A k = ⟨ b, c ⟩ .Furthermore, the FDs X → A k is implied by ∆. The tuples Π ( t ) and Π ( t ′ ) again agree on the attributes onthe left-hand side of the FD X → A k , but do not agree on the right-hand side of this FD. If two tuples do notsatisfy an FD that is implied by a set ∆ of FDs, they also do not satisfy ∆, thus { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆. ● a ≠ a ′ , b = b ′ and c ≠ c ′ . In this case, Π ( t ) and Π ( t ′ ) agree on the attributes on the left-hand side of the FD X → Y , but do not agree on the right-hand side of this FD (since the FD is not trivial and contains at least oneattribute A k such that Π ( t ) .A k = ⟨ b, c ⟩ on its right-hand side). Therefore, { Π ( t ) , Π ( t ′ )} is inconsistent w.r.t.∆.This concludes our proof of the “only if” direction. Lemma
A.16.
Let R be a schema and let ∆ be an FD set over R that does not contain trivial FDs. Suppose that ∆ contains three distinct local minima X → Y , X → Y and X → Y . Then, there is a fact-wise reduction from ( R ( A, B, C ) , ∆ AB ↔ AC ↔ BC ) to ( R, ∆ ) . Proof.
We define a fact-wise reduction Π ∶ ( R ( A, B, C ) , ∆ AB ↔ AC ↔ BC ) → ( R, ∆ ) , using X → Y , X → Y and X → Y and the constant ⊙ ∈ Const . Let t = ( a, b, c ) be a tuple over R ( A, B, C ) and let { A , . . . , A n } be the set of23ttributes in R . We define Π as follows:Π ( t ) .A k def = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩⊙ A k ∈ X ∩ X ∩ X a A k ∈ ( X ∩ X ) ∖ X b A k ∈ ( X ∩ X ) ∖ X c A k ∈ ( X ∩ X ) ∖ X ⟨ a, b ⟩ A k ∈ X ∖ X ∖ X ⟨ a, c ⟩ A k ∈ X ∖ X ∖ X ⟨ b, c ⟩ A k ∈ X ∖ X ∖ X ⟨ a, b, c ⟩ otherwiseIt is left to show that Π is a fact-wise reduction. To do so, we prove that Π is well defined, injective and preservesconsistency and inconsistency. Π is well defined.
This is straightforward from the definition.
Π is injective.
Let t , t ′ be two tuples, such that t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) . Assume that Π ( t ) = Π ( t ′ ) . Letus denote Π ( t ) = ( x , . . . , x n ) and Π ( t ′ ) = ( x ′ , . . . , x ′ n ) . Note that X contains at least one attribute that does notbelong to X (otherwise, it holds that X ⊆ X , which is a contradiction to the fact that X is minimal). Thus, thereexists an attribute A l such that either Π ( t ) .A l = a or Π ( t ) .A l = ⟨ a, b ⟩ . Similarly, X contains at least one attributethat does not belong to X . Thus, there exists an attribute A p such that either Π ( t ) .A p = b or Π ( t ) .A p = ⟨ b, c ⟩ .Finally, X contains at least one attribute that does not belong to X . Thus, there exists an attribute A r such thateither Π ( t ) .A r = c or Π ( t ) .A r = ⟨ a, c ⟩ . Hence, Π ( t ) = Π ( t ′ ) implies that Π ( t ) .A l = Π ( t ′ ) .A l , Π ( t ) .A p = Π ( t ′ ) .A p andΠ ( t ) .A r = Π ( t ′ ) .A r . We obtain that a = a ′ , b = b ′ and c = c ′ , which implies t = t ′ . Π preserves consistency.
Let t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) be two distinct tuples. We contend that the set { t, t ′ } is consistent w.r.t. ∆ AB ↔ AC ↔ BC if and only if the set { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “if ” direction.
Assume that { t, t ′ } is consistent w.r.t ∆ AB ↔ AC ↔ BC . We prove that { Π ( t ) , Π ( t ′ )} is consistentw.r.t ∆. Note that t and t ′ cannot agree on more than one attribute (otherwise, they will violate at least one FDin ∆ AB ↔ AC ↔ BC ). Thus, Π ( t ) and Π ( t ′ ) may only agree on attributes that appear in X ∩ X ∩ X and in one of ( X ∩ X ) ∖ X , ( X ∩ X ) ∖ X or ( X ∩ X ) ∖ X . As mentioned above, X contains at least one attribute that doesnot belong to X , thus no FD in ∆ contains only attributes from X ∩ X ∩ X and ( X ∩ X ) ∖ X on its left-handside (otherwise, X will not be minimal). Similarly, no FD in ∆ contains only attributes from X ∩ X ∩ X and ( X ∩ X ) ∖ X on its left-hand side and no FD in ∆ contains only attributes from X ∩ X ∩ X and ( X ∩ X ) ∖ X on its left-hand side. Therefore Π ( t ) and Π ( t ′ ) do not agree on the left-hand side of any FD in ∆, and { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “only if ” direction.
Assume { t , t ′ } is inconsistent w.r.t. ∆ AB ↔ AC ↔ BC . We prove that { Π ( t ) , Π ( t ′ )} is incon-sistent w.r.t. ∆. Since { t , t ′ } is inconsistent w.r.t. ∆ AB ↔ AC ↔ BC , t and t ′ agree on two attributes, but do not agreeon the third one. Thus, one of the following holds: ● a = a ′ , b = b ′ and c ≠ c ′ . In this case, Π ( t ) and Π ( t ′ ) agree on all of the attributes that appear on the left-handside of X → Y . Since this FD is not trivial, it must contain on its right-hand side an attribute A k such that A k /∈ X . That is, there is at least one attribute A k that appears on the right-hand side of X → Y such thatone of the following holds: (a) Π ( t ) .A k = c, (b) Π ( t ) .A k = ⟨ a, c ⟩ , (c) Π ( t ) .A k = ⟨ b, c ⟩ or (d) Π ( t ) .A k = ⟨ a, b, c ⟩ .Hence, Π ( t ) and Π ( t ′ ) do not satisfy the FD X → Y and { Π ( t ) , Π ( t ′ )} is inconsistent w.r.t. ∆. ● a = a ′ , b ≠ b ′ and c = c ′ . This case is symmetric to the first one. Π ( t ) and Π ( t ′ ) agree on all of the attributesthat appear on the left-hand side of X → Y , but do not agree on at least one attribute that appears on theright-hand side of the FD. ● a ≠ a ′ , b = b ′ and c = c ′ . This case is also symmetric to the first one. Here, Π ( t ) and Π ( t ′ ) agree on the left-handside, but not on the right-hand side of the FD X → Y . Lemma
A.17.
Let R be a schema and let ∆ be an FD set over R that does not contain trivial FDs. Suppose that ∆ contains two distinct local minima X → Y and X → Y , and the following hold. ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ , ● ( X ∖ X ) /⊆ ( cl ∆ ( X ) ∖ X ) .Then, there is a fact-wise reduction from ( R ( A, B, C ) , ∆ AB → C → B ) to ( R, ∆ ) . roof. We define a fact-wise reduction Π ∶ ( R ( A, B, C ) , ∆ AB → C → B ) → ( R, ∆ ) , using X → Y , X → Y and theconstant ⊙ ∈ Const . Let t = ( a, b, c ) be a tuple over R ( A, B, C ) and let { A , . . . , A n } be the set of attributes in R .We define Π as follows: Π ( t ) .A k def = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩⊙ A k ∈ X ∩ X c A k ∈ X ∖ X b A k ∈ ( X ∖ X ) ∩ ( cl ∆ ( X ) ∖ X )⟨ a, b ⟩ A k ∈ ( X ∖ X ) ∖ ( cl ∆ ( X ) ∖ X )⟨ b, c ⟩ A k ∈ ( cl ∆ ( X ) ∖ X ) ∖ ( X ∖ X )⟨ a, b, c ⟩ otherwiseIt is left to show that Π is a fact-wise reduction. To do so, we prove that Π is well defined, injective and preservesconsistency and inconsistency. Π is well defined.
This is straightforward from the definition.
Π is injective.
Let t , t ′ be two tuples, such that t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) . Assume that Π ( t ) = Π ( t ′ ) . Letus denote Π ( t ) = ( x , . . . , x n ) and Π ( t ′ ) = ( x ′ , . . . , x ′ n ) . Since the FD X → Y is a local minimum, it holds that X /⊆ X . Thus, there is an attribute that appears in X , but does not appear in X . Moreover, it holds that ( X ∖ X ) /⊆ ( cl ∆ ( X ) ∖ X ) , thus X ∖ X contains at least one attribute that does not appear in cl ∆ ( X ) ∖ X .Therefore, there are l and p such that Π ( t ) .A l = c , Π ( t ) .A p = ⟨ a, b ⟩ . Hence, Π ( t ) = Π ( t ′ ) implies that Π ( t ) .A l = Π ( t ′ ) .A l and Π ( t ) .A p = Π ( t ′ ) .A p . We obtain that a = a ′ , b = b ′ and c = c ′ , which implies t = t ′ . Π preserves consistency.
Let t = ( a, b, c ) and t ′ = ( a ′ , b ′ , c ′ ) be two distinct tuples. We contend that the set { t, t ′ } is consistent w.r.t. ∆ AB → C → B if and only if the set { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “if ” direction.
Assume that { t, t ′ } is consistent w.r.t ∆ AB → C → B . We prove that { Π ( t ) , Π ( t ′ )} is consistentw.r.t ∆. One of the following holds: ● b ≠ b ′ and c ≠ c ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X . Since X → Y and X → Y are local minima, there is no FD in ∆ that contains on its left-hand side only attributes A k such that A k ∈ X ∩ X . Thus, Π ( t ) and Π ( t ′ ) do not agree on the left-hand side of any FD in ∆ and { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. ● a ≠ a ′ , b = b ′ and c = c ′ . Note that in this case Π ( t ) and Π ( t ′ ) agree on all of the attributes that belong to cl ∆ ( X ) , and only on these attributes. Any FD in ∆ that contains only attributes from cl ∆ ( X ) on its left-handside, also contains only attributes from cl ∆ ( X ) on its right-hand side (by definition of closure). Thus, Π ( t ) and Π ( t ′ ) satisfy all the FDs in ∆. ● a ≠ a ′ , b = b ′ and c ≠ c ′ . In this case, Π ( t ) and Π ( t ′ ) only agree on the attributes A k such that A k ∈ X ∩ X or A k ∈ ( X ∖ X ) ∩ ( cl ∆ ( X ) ∖ X ) . Since the FD X → Y is a local minimum, and since X contains anattributes that does not belong to cl ∆ ( X ) , no FD in ∆ contains on its left-hand side only attributes A k suchthat A k ∈ X ∩ X or A k ∈ ( X ∖ X ) ∩ ( cl ∆ ( X ) ∖ X ) . Thus, Π ( t ) and Π ( t ′ ) do not agree on the left-hand sideof any FD in ∆ and { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆.This concludes our proof of the “if” direction. The “only if ” direction.
Assume { t , t ′ } is inconsistent w.r.t. ∆ AB → C → B . We prove that { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆. Since { t , t ′ } is inconsistent w.r.t. ∆ AB → C → B , one of the following holds: ● a = a , b = b ′ and c ≠ c ′ . In this case, Π ( t ) and Π ( t ′ ) agree on all of the attributes that appear in X . Since X → Y is not trivial, there is an attribute A k in Y that does not belong to X . That is, one of the followingholds: (a) Π ( t ) .A k = c , (b) Π ( t ) .A k = ⟨ b, c ⟩ or (c) Π ( t ) .A k = ⟨ a, b, c ⟩ . Thus, { Π ( t ) , Π ( t ′ )} violates the FD X → Y and it is inconsistent w.r.t. ∆. ● b ≠ b ′ and c = c ′ . In this case, Π ( t ) and Π ( t ′ ) agree on all of the attributes that appear in X . Since X → Y isnot trivial, there is an attribute A k in Y that does not belong to X . That is, one of the following holds: (a) Π ( t ) .A k = b , (b) Π ( t ) .A k = ⟨ a, b ⟩ , (c) Π ( t ) .A k = ⟨ b, c ⟩ or (d) Π ( t ) .A k = ⟨ a, b, c ⟩ . Thus, { Π ( t ) , Π ( t ′ )} violates theFD X → Y and it is again inconsistent w.r.t. ∆.This concludes our proof of the “only if” direction.Next, we show that whenever OptSRepair simplifies an FD set ∆ into an FD set ∆ ′ , there is a fact-wise reductionfrom ( R, ∆ ′ ) to ( R, ∆ ) , where R is the underlying relation schema. Note that for each one of the simplifications,we just remove a set of attributes from the schema and the FDs in ∆. Thus, we will show that there is a fact-wisereduction from ( R, ∆ − X ) to ( R, ∆ ) , where X is a set of attributes. Lemma
A.18.
Let R ( A , . . . , A m ) be a relation schema and let ∆ be an FD set over R . Let X ⊆ { A , . . . , A m } bea set of attributes. Then, there is a fact-wise reduction from ( R, ∆ − X ) to ( R, ∆ ) . roof. We define a fact-wise reduction Π ∶ ( R, ∆ − X ) → ( R, ∆ ) , using the constant ⊙ ∈ Const . Let t be a tupleover R . We define Π as follows: Π ( t ) .A k def = {⊙ A k ∈ Xt.A k otherwiseIt is left to show that Π is a fact-wise reduction. To do so, we prove that Π is well defined, injective and preservesconsistency and inconsistency. Π is well defined.
This is straightforward from the definition.
Π is injective.
Let t , t ′ be two distinct tuples over R . Since t ≠ t ′ , there exists an attribute A j in { A , . . . , A k } ∖ X ,such that t .A j ≠ t ′ .A j . Thus, it also holds that Π ( t ) .A j ≠ Π ( t ′ ) .A j and Π ( t ) ≠ Π ( t ′ ) . Π preserves consistency.
Let t , t ′ be two distinct tuples over R . We contend that the set { t , t ′ } is consistentw.r.t. ∆ − X if and only if the set { Π ( t ) , Π ( t ′ )} is consistent w.r.t. ∆. The “if ” direction.
Assume that { t, t ′ } is consistent w.r.t ∆ − X . We prove that { Π ( t ) , Π ( t ′ )} is consistent w.r.t∆. Let us assume, by way of contradiction, that { Π ( t ) , Π ( t ′ )} is inconsistent w.r.t ∆. That is, there exists an FD Z → W in ∆, such that Π ( t ) and Π ( t ′ ) agree on all the attributes in Z , but do not agree on at least one attribute B in W . Clearly, it holds that B /∈ X (since Π ( t ) .A i = Π ( t ′ ) .A i = ⊙ for each attribute A i ∈ X ). Note that the FD ( Z ∖ X ) → ( W ∖ X ) belongs to ∆ − X . Since Π ( t ) .A k = t.A k and Π ( t ′ ) .A k = t ′ .A k for each attribute A k /∈ X , thetuples t and t ′ also agree on all the attributes on the left-hand side of the FD ( Z ∖ X ) → ( W ∖ X ) , but do not agreeon the attribute B that belongs to W ∖ X . Thus, t and t ′ violate an FD in ∆, which is a contradiction to the factthat the set { t , t ′ } is consistent w.r.t. ∆ − X . The “only if ” direction.
Assume { t , t ′ } is inconsistent w.r.t. ∆ − X . We prove that { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆. Since { t , t ′ } is inconsistent w.r.t. ∆ − X , there exists an FD Z → W in ∆ − X , such that t and t ′ agreeon all the attributes on the left-hand side of the FD, but do not agree on at least one attribute B on its right-handside. The FD ( Z ∪ X ) → ( W ∪ X ′ ) (for some X ′ ⊆ X ) belongs to ∆, and since Π ( t ) .A k = t.A k and Π ( t ′ ) .A k = t ′ .A k for each attribute A k /∈ X , and Π ( t ) .A k = Π ( t ′ ) .A k = ⊙ for each attribute A k ∈ X , it holds that Π ( t ) and Π ( t ′ ) agreeon all the attributes in Z ∪ X , but do not agree on the attribute B ∈ ( W ∪ X ′ ) , thus { Π ( t ) , Π ( t ′ )} is inconsistentw.r.t. ∆.The following lemmas are straightforward based on Lemma A.18. Lemma
A.19.
Let R be a relation schema and let ∆ be an FD set over R . If ∆ has a common lhs A , then thereis a fact-wise reduction from ( R, ∆ − A ) to ( R, ∆ ) . Lemma
A.20.
Let R be a relation schema and let ∆ be an FD set over R . If ∆ has a consensus FD, ∅ → X , thenthere is a fact-wise reduction from ( R, ∆ − X ) to ( R, ∆ ) . Lemma
A.21.
Let R be a relation schema and let ∆ be an FD set over R . If ∆ has an lhs marriage, ( X , X ) ,then there is a fact-wise reduction from ( R, ∆ − X X ) to ( R, ∆ ) . Lemma
A.22.
Let R be a relation schema and let ∆ be an FD set over R . If no simplification can be applied to ∆ and OSRSucceeds ( ∆ ) returns false, then computing an optimal S-repair for ∆ is APX-complete. Proof.
If no simplification can be applied to ∆ and
OSRSucceeds ( ∆ ) returns false, then ∆ is not trivial and alsonot empty. Note that in this case, ∆ cannot be a chain. Otherwise, ∆ contains a global minimum, which is eitheran FD of the form ∅ → X , in which case ∆ has a consensus FD, or an FD of the form X → Y , where X ≠ ∅ , inwhich case it holds that X ⊆ Z for each FD Z → W in ∆ and ∆ has a common lhs. Thus, ∆ contains at least twolocal minima X → Y and X → Y (that is, no FD Z → W in ∆ is such that Z ⊂ X or Z ⊂ X ). Note that wealways remove trivial FDs from ∆ before applying a simplification, thus we can assume that ∆ does not containtrivial FDs. One of the following holds:1. ( cl ∆ ( X ) ∖ X ) ∩ X = ∅ . We divide this case into three subcases: ● ( cl ∆ ( X ) ∖ X ) ∩ cl ∆ ( X ) = ∅ . In this case, Lemma A.7 and Lemma A.14 imply that computing an optimalS-repair is APX-hard. ● ( cl ∆ ( X )∖ X )∩( cl ∆ ( X )∖ X ) ≠ ∅ and ( cl ∆ ( X )∖ X )∩ X = ∅ . In this case, Lemma A.8 and Lemma A.15imply that computing an optimal S-repair is APX-hard. ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ . In this case, Lemma A.8 and Lemma A.15 imply that computing an optimalS-repair is APX-hard.2. ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ . We divide this case into three subcases:26 ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and it holds that ( X ∖ X ) ⊆ ( cl ∆ ( X ) ∖ X ) and ( X ∖ X ) ⊆ ( cl ∆ ( X ) ∖ X ) .In this case, ∆ contains at least one more local minimum. Otherwise, for every FD Z → W in ∆ it holdsthat either X ⊆ Z or X ⊆ Z . If X ∩ X ≠ ∅ , then ∆ has a common lhs, which is an attribute from X ∩ X . If X ∩ X = ∅ , then ∆ does not have a common lhs, but has an lhs marriage. In both cases,we get a contradiction to the fact that no simplifications can be applied to the ∆. Thus, Lemma A.11 andLemma A.16 imply that computing an optimal S-repair is APX-hard. ● ( cl ∆ ( X ) ∖ X ) ∩ X ≠ ∅ and it holds that ( X ∖ X ) /⊆ ( cl ∆ ( X ) ∖ X ) . In this case, Lemma A.13 andLemma A.17 imply that computing an optimal S-repair is APX-hard.Proposition 3.3 implies that computing an optimal S-repair is always in APX, thus the problem is actually APX-complete in each one of these cases. This concludes our proof of the lemma.Finally, we prove the following. Lemma
A.23.
Let R be a relation schema and let ∆ be an FD set over R . If OSRSucceeds ( ∆ ) returns false, thencomputing an optimal S-repair for ∆ is APX-complete. Proof.
We will prove the lemma by induction on n , the number of simplifications that will be applied to ∆ by OSRSucceeds . The basis of the induction is n =
0. In this case, Lemma A.22 implies that computing an optimal S-repair is indeed APX-complete. For the inductive step, we need to prove that if the claim is true for all n = , . . . , k − n = k . In this case, OSRSucceeds ( ∆ ) will start by applying a simplification to the problem. One ofthe following holds: ● ∆ has a common lhs A . Note that OSRSucceeds ( ∆ ) will return false only if OSRSucceeds ( ∆ − A ) returns false.From the inductive step we know that if OSRSucceeds ( ∆ − A ) returns false, then computing an optimal S-repairfor ∆ − A is APX-complete. Thus, Lemma A.19 implies that computing an optimal S-repair for ∆ is APX-hard. ● ∆ has a consensus FD ∅ → X . The algorithm OSRSucceeds ( ∆ ) will return false only if OSRSucceeds ( ∆ − X ) returns false. From the inductive step we know that if OSRSucceeds ( ∆ − X ) returns false, then computing anoptimal S-repair for ∆ − X is APX-complete. Thus, Lemma A.20 implies that computing an optimal S-repairfor ∆ is APX-hard. ● ∆ has an lhs marriage ( X , X ) . Again, OSRSucceeds ( ∆ ) will return false only if OSRSucceeds ( ∆ − X X ) returns false. From the inductive step we know that if OSRSucceeds ( ∆ − X X ) returns false, then computingan optimal S-repair for ∆ − X X is APX-complete. Thus, Lemma A.19 implies that computing an optimalS-repair for ∆ is APX-hard.Proposition 3.3 implies that computing an optimal S-repair is always in APX, thus the problem is APX-complete ineach one of these cases. This concludes our proof of the lemma. B. DETAILS FROM SECTION 4B.1 Proof of Theorem 4.1
In this section we prove Theorem 4.1.
Theorem
Suppose that ∆ = ∆ ∪ ∆ where ∆ and ∆ are attribute disjoint. The following are equivalentfor all α ≥ .1. An α -optimal U-repair can be computed in polynomial time under ∆ .2. An α -optimal U-repair can be computed in polynomial time under each of ∆ and ∆ . First we show the following proposition:
Proposition
B.1.
For a schema R and a table T over R , if U ∗ , U ∗ , U ∗ denote optimal U-repairs for ∆ , ∆ , ∆ respectively, where ∆ = ∆ ∪ ∆ and ∆ , ∆ are attribute disjoint, thendist upd ( U ∗ , T ) = dist upd ( U ∗ , T ) + dist upd ( U ∗ , T ) Proof.
Let us denote the Hamming distance of a tuple id i ∈ ids ( T ) in U ∗ and T with respect to a subset ofattributes P ⊆ attr ( ∆ ) as H P ( T [ i ] , U ∗ [ i ]) (i.e., the the number of attributes in P where T [ i ] and U ∗ [ i ] differ).Since U ∗ is an optimal U-repair for ∆, any attribute ∉ attr ( ∆ ) is not updated in U ∗ (otherwise we can change U ∗ in poly-time such that all attributes ∉ attr ( ∆ ) which will reduce the distance, contradicting that U ∗ is optimal).Clearly, for any i ∈ ids ( T ) , H ( T [ i ] , U ∗ [ i ]) = H attr ( ∆ ) ( T [ i ] , U ∗ [ i ]) = H attr ( ∆ ) ( T [ i ] , U ∗ [ i ]) + H attr ( ∆ ) ( T [ i ] , U ∗ [ i ]) .Hence from U ∗ we can create two repairs U ∗ , U ∗ , one for ∆ and one for ∆ , by taking the attribute values in attr ( ∆ ) (resp. attr ( ∆ ) ) from U , and keeping the remaining unchanged as in T , leading to dist upd ( U ∗ , T ) = dist upd ( U ∗ , T ) + dist upd ( U ∗ , T ) . Note that U ∗ is an optimal U-repair for ∆ , otherwise combining with U ∗ we geta U-repair U ∗ for ∆ of smaller distance than U ∗ contradicting the optimality of U ∗ . Similarly, U ∗ is an optimalU-repair of ∆ . Proof of Theorem 4.1.
We now prove the theorem.27 ) → ( ) . Suppose an α -optimal U-repair can be computed in polynomial time under ∆ for any schema R and anytable T on R by an algorithm P ∆ ( T ) . We create another table T from T by keeping the values of attributes of attr ( ∆ ) the same for all tuples, and changing the values of all other attributes of all tuples to 0. Let U ∗ , U ∗ , U ∗ denote optimal solutions for ∆ , ∆ , ∆ for T . From Proposition B.1, for T ′ , dist upd ( U ∗ , T ) = dist upd ( U ∗ , T ) + dist upd ( U ∗ , T ) (10)Since no FDs are violated in ∆ in T , T = U ∗ , and dist upd ( U ∗ , T ) =
0. Therefore, dist upd ( U ∗ , T ) = dist upd ( U ∗ , T ) (11)Then we run the algorithm P ∆ ( T ) on T to obtain an α -optimal U-repair U for T in polynomial time with respectto ∆, i.e., dist upd ( U , T ) ≤ α dist upd ( U ∗ , T ) (12)We obtain a U-repair U for ∆ and T from U by copying the attribute values in attr ( ∆ ) from U and keepingthe other attributes unchanged as in T .. Hence dist upd ( U , T ) ≤ dist upd ( U , T ) ≤ α dist upd ( U ∗ , T ) = α dist upd ( U ∗ , T ) (13)Since the attributes ∉ attr ( ∆ ) are unchanged in U and U ∗ , and the attribute values of attr ( ∆ ) are the same in T and T , U is also a U-repair of T with dist upd ( U , T ) = dist upd ( U , T ) , and U ∗ is also an optimal U-repair of T with dist upd ( U ∗ , T ) = dist upd ( U ∗ , T ) , and hence, dist upd ( U , T ) ≤ α dist upd ( U ∗ , T ) (14)This gives an α -optimal U-repair of T for ∆ in polynomial time (similar argument holds for ∆ ). ( ) → ( ) . This direction is simpler, and we argue that an α -optimal U-repair of ∆ can be obtained by composing α -optimal U-repairs U , U of ∆ , ∆ for a table T over schema R . Suppose U ∗ , U ∗ , U ∗ are optimal U-repairs of∆ , ∆ , ∆ respectively. Since U , U are α -optimal U-repairs, dist upd ( U , T ) ≤ α dist upd ( U ∗ , T ) and dist upd ( U , T ) ≤ α dist upd ( U ∗ , T ) (15)We construct a U-repair U of ∆ by choosing attribute values in attr ( ∆ ) from U values in attr ( ∆ ) from U , andchoosing all other attribute values from T . Hence dist upd ( U, T ) = dist upd ( U , T ) + dist upd ( U , T )≤ α ( dist upd ( U ∗ , T ) + dist upd ( U ∗ , T )) from (15) = α dist upd ( U ∗ , T ) from Proposition B.1)Hence U is an α -optimal U-repair for ∆. B.2 Proof of Theorem 4.3
In this section we prove Theorem 4.3.
Theorem
Let ∆ be a set of FDs. There is a strict reduction from computing an optimal U-repair for ∆ tocomputing an optimal U-repair for ∆ − cl ∆ (∅) , and vice versa. To prove this theorem, first we show the following proposition:
Proposition
B.2.
For a consensus FD ∅ → A , an optimal U-repair can be computed in polynomial time. Proof.
Consider a table T over a schema R that violates the FD ∅ → A , i.e., at least two tuples in T have twodistinct values of A . For every distinct value a of A in T , let M a denote the set of tuples t such that t .A = a , i.e., M a = σ A = a T . We obtain a repair of T as follows: for every distinct value a of A , compute the total weight of thetuples in M a , i.e., compute W a = ∑ t ∈ M a w t (slightly abusing the notation for weights). Choose the value of a , say a ,having the maximum value of W a among all such a ’s. Keep the tuples in M a unchanged, and update every othertuple t ∈ M a , a ≠ a , such that t .A = a . Clearly, this new table, say U , is a consistent update since now every tuple t in T will have t .A = a .To see that U is an optimal U-repair, first note that a repair with a better distance cannot be obtained by settingthe A values to a fresh constant from the infinite domain, since instead choosing a value from the active domain savesthe cost of the repair for at least one tuple in T . Now assume some other value a ≠ a has been chosen for all tuplesin T in a repair U such that dist upd ( U , T ) < dist upd ( U, T ) . Then dist upd ( U , T ) = ∑ i ∈ ids ( T ) w T ( i ) ⋅ H ( T [ i ] , U [ i ]) =∑ a ≠ a ∑ t ∈ M a w t (since the Hamming distance is 1 for all a ≠ a , and is 0 for a ) = ∑ a ≠ a W a = (∑ t ∈ T w t )− W a , whereas dist upd ( U, T ) = (∑ t ∈ T w t ) − W a . Since W a ≥ W a , dist upd ( U , T ) ≥ dist upd ( U, T ) contradicting the assumption that dist upd ( U , T ) < dist upd ( U, T ) . Hence U is an optimal U-repair.28rom Theorem 4.1 and B.2, the corollary below follows: Corollary
B.3.
For the consensus FD ∅ → cl ∆ (∅) , where cl ∆ (∅) is the set of all consensus attributes, anoptimal U-repair can be computed in polynomial time. In addition, the following proposition is straightforward.
Proposition
B.4.
The two sets of FDs ∆ and ∆ ′ = {∅ → cl ∆ (∅)} ∪ ( ∆ − cl ∆ (∅)) are equivalent, and for anytable T , and α -optimal solution for ∆ is also an α -optimal solution for ∆ ′ , and vice versa. Using the above results, we prove Theorem 4.3.
Proof.
To show strict reductions, we show the following two directions: α -optimal U-repair of ∆ to an α -optimal U-repair of ∆ − cl ∆ (∅) . Using (i) Proposition B.4 and the fact that (ii) {∅ → cl ∆ (∅)} and ( ∆ − cl ∆ (∅)) are attribute disjoint, from Theorem 4.1, an α -optimal U-repair of ∆ − cl ∆ (∅) canbe computed in polynomial time if an α -optimal U-repair of ∆ can be computed in polynomial time. α -optimal U-repair of ∆ − cl ∆ (∅) to an α -optimal U-repair of ∆ . Note that an optimal U-repair of ∆ − cl ∆ (∅) (whichis also an α -optimal U-repair for any α ≥
1) can be computed in polynomial time (by Corollary B.3). Therefore,using Proposition B.4 and the fact that {∅ → cl ∆ (∅)} and ( ∆ − cl ∆ (∅)) are attribute disjoint, this direction alsofollows from Theorem 4.1. B.3 Proof of Proposition 4.9
Proposition
Under ∆ = { A → B, B → A } , an optimal U-repair can be computed in polynomial time. For the FD set ∆ = { A → B, B → A } , the goal is to show that an optimal U-repair can be computed in polynomialtime. Note that the FDs { A → B, B → A } imply that in a consistent update, one value of A cannot be associatedwith multiple values of B and vice versa. Proof.
For ∆ = { A → B, B → A } , although mlc ( ∆ ) =
2, we argue below that still dist upd ( U ∗ , T ) = dist sub ( S ∗ , T ) for an optimal U-repair U ∗ and an optimal S-repair S ∗ for a table T over R . From Corollary 4.5, dist sub ( S ∗ , T ) ≤ dist upd ( U ∗ , T ) (16)Given S ∗ , we can create a consistent update U by keeping the tuples in S ∗ unchanged, and updating the A or B values of all deleted tuples in S ∗ as follows. Consider any tuple t ∈ T ∖ S ∗ . There must exist a tuple s ∈ S ∗ witheither t .A = s .A or t .B = s .B , otherwise, t could be included in S ∗ violating the optimality of S ∗ . Suppose there isa tuple s ∈ S ∗ with t .A = s .A without loss of generality. Then we change the value of t .B to s .B with the Hammingdistance = 1. Hence we get a consistent update U with dist sub ( S ∗ , T ) ≥ dist upd ( U, T ) (17)Combining (16) and (17), U is an optimal U-repair.Since ∆ passes the test of OSRSucceds (by applying lhs marriage), from Theorem 3.4 an optimal S-repair S ∗ can becomputed in polynomial time, from which an optimal U-repair of { A → B, B → A } can be computed in polynomialtime as described above. B.4 Proof of Theorem 4.10
In this section we prove Theorem 4.10.
Theorem
For the relation schema R ( A, B, C ) and the FD set ∆ A ↔ B → C , computing an optimal U-repairis APX-complete, even on unweighted, duplicate-free tables. As discussed in Section 4.3, we give a reduction from the vertex cover problem in bounded-degree graphs (knownto be APX-complete) showing that the size of the minimum vertex cover is k if and only if the distance of the optimalU-repair for the constructed instance is 2 ∣ E ∣ + k (where E is the set of edges in the original graph). This proof isinspired by the NP-hardness proof for ∆ = { A → B, B → C } in [24], but needs some additional machinery to extendthe reduction to ∆ A ↔ B → C . Construction.
The input to minimum vertex cover problem is a graph G ( V, E ) . The goal is to find a vertex cover C for G (a set of vertices that touches every edge in G ), such that no other vertex cover C ′ of G contains less verticesthan C (that is, ∣ C ′ ∣ < ∣ C ∣ ). This problem is known to be NP-hard even for bounded degree graphs. Given such aninput, we will construct the input I for our problem as follows. Let V = { v , . . . , v n } be the set of vertices in g andlet E = { e , . . . , e m } be the set of edges in g . For each edge e r = ( v i , v j ) in E , the instance I will contain the followingtuples: 29 ( v i , v j , ) , ● ( v j , v i , ) .In addition, for each vertex v i ∈ V , the instance I will contain the following tuple: ● ( v i , v i , ) ,Let C min be a minimum vertex cover of G . We will now prove the following results:1. For each vertex cover C of G of size k , there is a U-repair of I with a distance 2 ∣ E ∣ + k .2. No U-repair of I has a distance 2 ∣ E ∣ + k for some k < ∣ C min ∣ .Then, we can conclude that the size of the minimum vertex cover of G is m if and only if the distance of the optimalU-repair of I is 2 ∣ E ∣ + m . Proof of (1).
We will now prove that for each vertex cover C of g of size k , there is a U-repair of I with distance2 ∣ E ∣ + k . Assume that C is a vertex cover of g that contain k vertices. We can build a U-repair J of I as follows:for each edge e r = ( v i , v j ) , if v i ∈ C , then we will change the tuples ( v i , v j , ) and ( v j , v i , ) to ( v i , v i , ) . Otherwise,we will change these two tuples to ( v j , v j , ) (note that in this case, since v i does not belong to the vertex cover, v j does). Moreover, for each vertex v i that belongs to C , we will change the tuple ( v i , v i , ) to R ( v i , v i , ) . Note thatif a vertex v i does not belong to C , the instance J will not contain any tuple of the form ( v i , v i , ) , thus the tuple ( v i , v i , ) does not agree on the value of A or B with any other tuple in J , and is not in conflict with any other tuple.On the other hand, if a node v i does belong to C , the instance J will only contain tuples of the form ( v i , v i , ) for v i , since in this case we change the tuple ( v i , v i , ) to ( v i , v i , ) . Thus, J is indeed a U-repair. Finally, for each edge ( v i , v j ) ∈ E we update one cell in each of the tuples ( v i , v j , ) and ( v j , v i , ) (2 ∣ E ∣ cell updates), and for each vertex V i ∈ C we update one cell in the tuple ( v i , v i , ) ( k cell updates), thus the distance of J is 2 ∣ E ∣ + k . Proof of (2).
In order to prove this part of the lemma (that is, that no U-repair of I is of distance 2 ∣ E ∣ + k for some k < ∣ C min ∣ ), we first have to prove the following lemma, which is a nontrivial part of the proof. Lemma
B.5.
Let J be a U-repair of I that updates t tuples of the form ( v i , v j , ) for some t < ∣ E ∣ . Then, thereis a U-repair J ′ of I , that updates at least t + tuples of the form ( v i , v j , ) , such that the distance of J ′ is lower orequal to the distance of J . Proof.
Since it holds that t < ∣ E ∣ , there is at least one tuple ( v , v , ) , for some v , v ∈ V that is not updatedby J . We will now show that we can build another U-repair J ′ , that updates the tuple ( v , v , ) (as well as everytuple that is updated by J ), such that the distance of J ′ is lower or equal to the distance of J . That is, J ′ willupdate at least t + ( v i , v j , ) .The tuple t = ( v , v , ) is in conflict with the tuples t = ( v , v , ) and t = ( v , v , ) . Moreover, these twotuples are in conflict with the tuple t = ( v , v , ) . We will now show that since the tuple t is not updated by J and J is consistent, J must make at least four changes to the tuples in { t , t , t } . The following holds:1. To resolve the conflict between t and t (that agree only on the value of attribute A ), we have to either changethe value of attribute A in t to another value, or change the values of both attributes B and C to v and ,respectively. In the first case, since t is still in conflict with t (as they only agree on the value of attribute B ),we also have to change the value of attribute B in one of t or t . In the second case, t and t are no longerin conflict, thus we do not have to make any more changes. In both cases, the distance of the update is 2.2. To resolve the conflict between t and t (that agree only on the value of attribute B ), we have to either changethe value of attribute B in t to another value, or change the values of both attributes A and C to v and ,respectively. In the first case, since t is still in conflict with t (as they only agree on the value of attribute A ),we also have to change the value of attribute A in one of t or t . In the second case, t and t are no longerin conflict, thus we do not have to make any more changes. In both cases, the cost of the update is 2.Note that the updates presented in ( ) and ( ) above are independent (as they change different attributes in differenttuples), thus we have to combine them to resolve all of the conflicts among the tuples in { t , t , t , t } , and the totalcost of updating the tuples t , t and t in J is at least 4. Moreover, note that it is not necessary to change thevalue of attribute A in both t and t or change the value of attribute B in both t and t , since changing just oneof them resolves the conflict between the tuples. If J updates both of these values, then the cost of updating thetuples t , t and t is increased by one for each pair of tuples ( t and t , or t and t ) updated together. We willuse this observation later in the proof.Now, we will build an instance J ′ that updates each one of the tuples updated by J , as well as the tuple t . Wewill prove that the distance of J ′ is not higher than the distance of J . We will start by adding these four tuples to J ′ :1. Instead of the tuple t ( ( v , v , ) ), we will insert the tuple ( v , v , ) .30. Instead of the tuple t ( ( v , v , ) ), we will insert the tuple ( v , v , ) .3. Instead of the tuple t ( ( v , v , ) ), we will insert the tuple ( v , v , ) .4. Instead of the tuple t ( ( v , v , ) ), we will insert the tuple ( v , v , ) .So far we only updated one cell in each of the four tuples discussed above. Thus, the current cost is 4. As mentionedabove, the cost of changing these tuples in J is at least 4, thus so far we did not exceed the cost of J . From nowon, we will look at the rest of the tuples in J and insert each one of them to J ′ (in some cases, only after updatingthem to resolve some conflict).5. For each tuple t that appears in J that is not in conflict with one of ( v , v , ) or ( v , v , ) , we insert t to J ′ .Clearly, J ′ is consistent at this point, and since those tuples are updated by J ′ in exactly the same way they wereupdated by J , we did not exceed the distance of J . Each one of the remaining tuples is in conflict with one of ( v , v , ) or ( v , v , ) , thus we have to update it before inserting it to J ′ .6. Each tuple t such that t .A = v is in conflict with the tuple ( v , v , ) . Since the tuple ( v , v , ) belongs to J and J is consistent, it holds that t = ( v , v , ) . There are a few possible cases: ● If the original tuple t (the tuple in the instance I ) was of the form ( v i , v j , ) for some v i ≠ v and v j ≠ v ,then J updated two cells in this tuple. We will replace this tuple in J ′ with the tuple ( v , v , ) . Note thatthe cost is the same (since we again update two cells). ● If the original tuple t was of the form ( v , v j , ) for some v j ≠ v , then J updated one cell in this tuple. Wewill replace this tuple in J ′ with the tuple ( v , v , ) . Again, the cost is the same. ● If the original tuple was of the form ( v i , v , ) for some v i ≠ v , we will replace it with the tuple ( v , v , ) with no additional cost. ● If the original tuple was of the form ( v i , v i , ) for some v i /∈ { v , v } (we already handled the case where v i is one of v or v ), then J updated three cells in this tuple (the values in all of the attributes are differentfrom the values in the tuple ( v , v , ) ), and we will replace it with the tuple ( v , v , ) with no additionalcost.7. Each tuple t such that t .B = v is in conflict with the tuple ( v , v , ) . Again, since the tuple ( v , v , ) belongsto J and J is consistent, it holds that t = ( v , v , ) . This case is symmetric to the previous one, thus we willnot elaborate on all of the possible cases.So far, we inserted to J ′ one tuple (either ( v , v , ) or ( v , v , ) ) for each of the four tuples in { t , t , t , t } , onetuple for each tuple in J that is not in conflict with neither ( v , v , ) nor ( v , v , ) , and one tuple (again either ( v , v , ) or ( v , v , ) ) for each tuple in J that agrees with ( v , v , ) only on the value of attribute A or agrees with ( v , v , ) only on the value of attribute B . It is left to insert each tuple that agrees with ( v , v , ) on the value ofattribute B or agrees with ( v , v , ) ) on the value of attribute A .8. Each tuple t such that t .A = v is in conflict with the tuple ( v , v , ) . Note that in this case it holds that t .B ≠ v , since J is consistent and also contains the tuple ( v , v , ) . Assume that t is the tuple ( v , v i , x ) forsome v i ∈ V . Again, there are a few possible cases: ● If the original tuple t (the tuple in the instance I ) was of the form ( v j , v ′ i , y ) for some v j ≠ v and v ′ i ≠ v i ,then J updated at least two cells in this tuple in J . We will replace this tuple in J ′ with the tuple ( v , v , ) ,with no additional cost. ● If the original tuple t was of the form ( v , v ′ i , ) for some v ′ i ≠ v i , then J updated at least one cell in thistuple (the value of attribute B ). We will replace this tuple in J ′ with the tuple ( v , v , ) , with no additionalcost. ● If the original tuple t was of the form ( v j , v i , x ) for some v j ≠ v , then J updated at least one cell in thistuple (the value of attribute A ). We will replace this tuple in J ′ with the tuple ( y i , v i , ) , for some newvalue y i that depends on i , with no additional cost.9. Each tuple t such that t .B = v is in conflict with the tuple ( v , v , ) . This case is symmetric to the previousone, thus we will not elaborate on all of the possible cases.At this point, J ′ contains tuples of the form ( v , v , ) and ( v , v , ) , tuples from J that are not in conflict withthese tuples, and tuples of the form ( y i , v i , ) . Note that we insert a tuple ( y i , v i , ) to J ′ only if there is a tuple ( v , v i , x ) in J for some v i ≠ v . Since the tuple ( v , v i , x ) belongs to J , no other tuple t ′ such that t ′ .B = v i alsobelongs to J , unless the tuples are identical, and all of these identical tuples are replaced by the same tuple. Thus,there is no tuple t ′′ in J , that is not in conflict with one of ( v , v , ) or ( v , v , ) , such that t ′′ .B = v i , and the tuple ( y i , v i , ) agrees on the value of attribute B only with other tuples that look exactly the same. Moreover, the tuple ( y i , v i , ) agrees on the value of attribute A only with other tuples that look exactly the same, since we use a newvalue y i that does not appear in any other tuple in the instance. Thus, no tuple ( y i , v i , ) is in conflict with the restof the tuples in J ′ , and J ′ is consistent. Moreover, since we only replaced tuples with no additional cost, so far wedid not exceed the distance of J . 31inally, there are just two more cases that we have not covered yet. Assume that t is the tuple ( v , v i , x ) for some v i ≠ v , and assume that the original tuple t was the tuple ( v , v i , ) and v i ≠ v (if v i = v then we have alreadycovered this case). Then, it may be the case that the tuple was not updated by J at all. Since the tuple ( v , v i , x ) belongs to J and J is consistent, no other tuple t ′ ∈ J is such that t ′ .A = v (unless they are identical). Thus, J updated the value of attribute A in both t (the tuple ( v , v , ) ) and t (the tuple ( v , v , ) ). As mentioned in theobservation at the beginning of this proof, if J updated both these values, then the cost of updating the tuples t , t and t is at least 5 (and not at least 4 as we assumed so far). Thus, in this case, we can use the additional costthat we have, to change one more cell in the tuple ( v , v i , x ) .Similarly, if J contains a tuple t of the form ( v i , v , x ) for some v i ≠ v , and the original tuple t was the tuple ( v i , v , ) and v i ≠ v , then J updated the value of attribute B in in both t (the tuple ( v , v , ) ) and t (the tuple ( v , v , ) ). This again gives us an additional cost to update one more cell in the tuple ( v i , v , x ) . Thus, we will dothe following:10. If the tuple ( v , v i , x ) for some v i /∈ { v , v } belongs to J , and the original tuple was ( v , v i , ) , then we replacethis tuple in J ′ with the tuple ( v , v , ) with an additional cost of at most one.11. If the tuple ( v j , v , x ) for some v j /∈ { v , v } belongs to J , and the original tuple was ( v j , v , ) , then we replacethis tuple in J ′ with the tuple ( v , v , ) with an additional cost of at most one.At this point, we have inserted to J ′ one tuple for each tuple in J , and as explained above, J ′ is now consistent.Moreover, we never exceeded to the of J , thus we found a new U-repair J ′ that updates each tuple updated by J ,as well as the tuple ( v , v , ) and that concludes our proof of the lemma.Lemma B.5 implies that for each U-repair J of I that updates t tuples of the form ( v i , v j , ) for some t < ∣ E ∣ ,there is a U-repair J ′ of I , that updates all of the tuples of the form ( v i , v j , ) , such that the distance of J ′ is loweror equal to the distance of J . Thus, each U-repair of I is of distance at least 2 ∣ E ∣ .Finally, we can prove that no U-repair of I is of distance 2 ∣ E ∣ + k for some k < ∣ C min ∣ . The proof of this part ofthe lemma is very similar to the proof of [24]. Each U-repair of I has a distance 2 ∣ E ∣ + k for some k . Let us assume,by way of contradiction, that there is a U-repair of I that updates 2 ∣ E ∣ + k cells, for some k < ∣ C min ∣ (where C min is a minimum vertex cover of g ). Let J be the optimal U-repair of I (clearly, in this case, J updates x = ∣ E ∣ + k ′ cells, for some k ′ < ∣ C min ∣ ). Lemma B.5 implies that we can assume that J updates all of the tuples of the form ( v i , v j , ) . Let { f , . . . , f l } be the set of tuples of the form ( v i , v i , ) that are updated by J , and let C ′ = { v i , . . . , v i l } be the set of the vertices that appear in those facts. Since J updates all of the 2 ∣ E ∣ tuples of the form ( v i , v j , ) ,the distance of J is greater or equal to 2 ∣ E ∣ + ∣ C ′ ∣ , thus we get that 2 ∣ E ∣ + ∣ C min ∣ > ∣ E ∣ + k ′ >= ∣ E ∣ + ∣ C ′ ∣ , and ∣ C ′ ∣ < ∣ C min ∣ . Therefore, C ′ cannot be a cover of g (since the size of the minimum vertex cover is ∣ C min ∣ . Thus,there are at least ∣ C min ∣ − ∣ C ′ ∣ edges ( v i , v j ) in g , such that neither v i ∈ C ′ nor v j ∈ C ′ (that is, J doesn’t update thetuple ( v i , v i , ) and also doesn’t update the tuple ( v j , v j , ) ). For each one of these ∣ C min ∣ − ∣ C ′ ∣ edges, we have tochange at least 4 cells in the tuples ( v i , v j , ) and ( v j , v i , ) (the values of attributes A and B ), thus the distanceof J is greater or equal to 4 (∣ C min ∣ − ∣ C ′ ∣) + (∣ E ∣ − (∣ C min ∣ − ∣ C ′ ∣)) + ∣ C ′ ∣ = ∣ C min ∣ + ∣ E ∣ − ∣ C ′ ∣ . Finally, we get that2 ∣ E ∣ + ∣ C min ∣ >= ∣ C min ∣ + ∣ E ∣ − ∣ C ′ ∣ , that is ∣ C ′ ∣ >= ∣ C min ∣ , which is a contradiction to the tuple that ∣ C ′ ∣ < ∣ C min ∣ .The last step is to show that our reduction is a PTAS reduction. Since our reduction is from the problem of findinga minimum vertex cover for a graph of a bounded degree B , the size of the minimum vertex cover is at least ∣ E ∣ B (aseach vertex covers at most B edges). Now, let us assume, by way of contradiction, that finding an optimal U-repairunder { A → B, B → A, A → C } is not APX-hard. That is, for every (cid:15) >
0, there exists a ( + (cid:15) ) -optimal solution tothat problem. Let S U be the such a solution and let OP T U be an optimal solution. Then, dist ( S U ) − dist ( OP T U ) ≤ (cid:15) ⋅ dist ( OP T U ) (18)We will now show that in this case, S C (which is the vertex cover corresponding to the the U-repair S U ) is a ( + B ) (cid:15) -optimal solution to the minimum vertex cover problem. Let OP T C be an optimal solution to the problem.We recall that ∣ OP T C ∣ ≥ ∣ E ∣ B . Then, ∣ S C ∣ − ∣ OP T C ∣ = ( dist ( S U ) − ∣ E ∣) − ( dist ( OP T U ) − ∣ E ∣) = dist ( S U ) − dist ( OP T U ) ≤ (cid:15) ⋅ dist ( OP T U ) = (cid:15) ⋅ (∣ OP T C ∣ + ∣ E ∣) ≤ (cid:15) ⋅ (∣ OP T C ∣ + B ∣ OP T C ∣) = ( + B ) (cid:15) ⋅ ∣ OP T C ∣ (19)It follows that for each (cid:15) >
0, we can get an (cid:15) -optimal solution for the minimum vertex cover problem by finding a ( + + B ) (cid:15) -optimal solution to the problem of computing an optimal U-repair, which is a contradiction to the factthat the minimum vertex cover problem is APX-hard. B.5 Proof of Theorem 4.14
In this section we prove Theorem 4.14.
Theorem
Let k ≥ be fixed. Computing an optimal U-repair is APX-complete for: . R ( A , . . . , A k , B , . . . , B k , C ) and ∆ k ;2. R ( A , . . . , A k + , B , . . . , B k ) and ∆ ′ k . We start by proving the first part of the theorem, that is, we prove the following.
Lemma
B.6.
Let k ≥ be fixed. For the schema R ( A , . . . , A k , B , . . . , B k , C ) and FD set ∆ k = { A ⋯ A k → B , B → C, B → A , . . . , B k → A } , computing an optimal U-repair is APX-complete. Proof.
We reduce it from the problem of finding an optimal U-repair for { A → B, B → C } on a schema S ( A, B, C ) that has been shown to be NP-hard by Kolahi and Lakshmanan even if all tuples have the same weight and thereare no duplicate tuples [24] by a reduction from vertex cover. Since there exists a constant factor approximation foroptimal U-repair and degree-bounded vertex cover is known to be APX-hard, this reduction suffices to prove that∆ k is APX-complete. Construction.
Given a table T S on S ( A, B, C ) , we construct a table T R on R ( A , ⋯ , A k , B , . . . , B k , C ) , where forevery tuple s = ( a, b, c ) in T S , we create a tuple r = ( , a, , , ⋯ , b, , , ⋯ , , c ) in T R , i.e., r .A = s .A , r .B = s .B, r .C = s .C , and t has value 0 in the remaining columns A , A , ⋯ , A k and B , ⋯ , B k . We claim that T S has aconsistent update of distance ≤ M if and only if T R has a consistent update of distance ≤ M . The “if ” direction.
Suppose S has a consistent update U S of distance M . We can obtain a consistent update U R of R of the same distance by updating the A , B , C values of T R as in T S , and leave A , A , ⋯ , A k and B , ⋯ , B k unchanged in T R . If B → C is violated in U R for two tuples r , r , B → C will also be violated for the correspondingtuples s , s in U S . If A ⋯ A k → B is violated for two tuples r , r in U R , they must have different values of B butthe same value of A ⋯ A k , i.e., the same value of A = A in U S (since A , A , ⋯ , A k are all 0 in U R ), violating the FD A → B in U S and contradicting that it is a consistent update of T S . The FDs B → A . . . , B k → A are satisfiedin U R since all of A and B , ⋯ , B k have values 0. The “only if ” direction.
Suppose we have a consistent update U R of T R of distance M . We will transform U R toanother consistent update U ′ R such that only A , B , C attributes are updated, and all of B , ⋯ , B k and A , A , ⋯ , A k are unchanged from T R (all tuples have values 0 in all these attributes in U ′ R ).To achieve this, consider the subset of tuple identifiers M R ⊆ ids ( T R ) such that for any tuple t = T R [ i ] , i ∈ M R ,at least one attribute from B , ⋯ , B k or A , A , ⋯ , A k has been updated in U R (not equal to 0, and therefore H ( T R [ i ] , U R [ i ]) ≥ M R , we update U R to U ′ R by (i) keeping all attributes in B , ⋯ , B k and A , A , ⋯ , A k as 0 (unchanged from T R ), and (ii) assigning a fresh constant from the infinite domain Val to attribute A in each tuple (changed from T R ); therefore for all i ∈ M R , H ( T R [ i ] , U ′ R [ i ]) =
1. Since tuples withidentifiers ∉ M R remain the same in U R and U ′ R , and the tuples with identifiers i ∈ M R have H ( T R [ i ] , U R [ i ]) ≥ H ( T R [ i ] , U ′ R [ i ]) =
1, therefore, the dist upd ( U ′ R , T R ) ≤ dist upd ( U R , T R ) , i.e., this transformation does not increase thedistance.Next we argue that U ′ R is a consistent update for T R where all tuples have values 0 in all all attributes in B , ⋯ , B k and A , A , ⋯ , A k . (a) Since U R was a consistent update, it satisfied the FD B → C , and since neither of B and C is changed in U ′ R from U R , this FD is still satisfied in U ′ R . (b) For the FDs B → A ⋯ , B k → A , since all tuplesin U ′ R have values 0 in all these attributes, so all of these FDs are satisfied in U ′ R as well. (c) For the final FD A ⋯ A k → B , for tuples with identifiers ∉ M R , this FD is satisfied since they have the same value of the attributes A ⋯ A k , B in U R and U ′ R . For tuples with identifiers ∉ M R , they received a fresh constant in U ′ R for attribute A , sothey do not interfere with each other and with the tuples with identifiers ∉ M R , and therefore the FD A ⋯ A k → B is satisfied in U ′ R . Hence U ′ R is a consistent update for T R .Hence, we have a new consistent update U ′ R of R with distance ≤ M where only A , B , C are updated. Suppose U S = ρ ( A,B,C ) [ π A B C U R ] (project U ′ R to A B C and rename A to A and B to B ). It can be seen that U S satisfiesthe FD B → C , since B → C is maintained in U ′ R . It also satisfies the FD A → B , since otherwise there are twotuples s , s ∈ U S with the same value of A and different values of B . This implies that the corresponding tuples r , r in U R have the same value of A and different values of B . Since both r , r have the same values of allattributes in A , A , ⋯ A k (all 0), together with A , they have the same values of A ⋯ A k but different values of B ,violating the FD A ⋯ A k → B , and contradicting the tuple that U ′ R is a consistent update of R . Hence we get aconsistent update U S of S of at most distance M .Next, we prove the second part of the theorem. That is, we prove the following. Lemma
B.7.
Let k ≥ be fixed. For the schema R ( A , . . . , A k + , B , . . . , B k ) and FD set ∆ ′ k = { A A → B , A A → B , . . . , A k A k + → B k } , computing an optimal U-repair is APX-complete. Proof.
We have previously established membership in APX for the problem of computing an optimal U-repair(see Section 4), thus it is only left to show that the problem is APX-hard. We start by proving that computing33n optimal U-repair under ∆ ′ k for k = A A → B and A A → B . This FD set has a common lhs A , thus Corollary 4.6, combined with the fact that computing anoptimal S-repair for an FD set of the form { A → B, C → D } is APX-hard, imply that computing an optimal U-repairis APX-hard as well.Next, we construct a reduction from computing an optimal U-repair under ∆ ′ k for k = ′ k for k >
1. Given a table T over the schema R ( A , A , A , B , B ) , we construct a table T ′ over the schema R ( A , . . . , A k + , B , . . . , B k ) , where for every tuple t = ( a , a , a , b , b ) in T , we create a tuple t ′ = ( a , a , a , ⊙ , . . . , ⊙ , b , b , ⊙ , . . . , ⊙) in T ′ . That is, t ′ .A = t .A for every A ∈ { A , A , A , B , B } , and t ′ = ⊙ forthe rest of the attributes. We claim that T has a consistent update of distance ≤ m if and only if T ′ has a consistentupdate of distance ≤ m . The “if ” direction.
Suppose that T has a consistent update U of distance m . We can obtain a consistent update U ′ of T ′ that has the same distance by updating the values of A , A , A , B , B in T ′ in exactly the same way weupdate these values in T , and leave the values in the rest of the attributes in T ′ unchanged. Clearly, each FD thatis not one of A A → B or A A → B is satisfied by U ′ (since we did not change the values of the attributes in { B , . . . , B k } , thus all the tuples in U ′ have the same value ⊙ in these attributes, and they agree on the rhs of eachof these FDs). Moreover, if two tuples t and t ′ violate one of A A → B or A A → B , then the corresponding twotuples in U also violate these FDs, which is a contradiction to the fact that U is a consistent update of T . Clearly,the distance of both updates is the same. The “only if ” direction.
Suppose we have a consistent update U ′ of T ′ of distance m . We can obtain a consistentupdate U of T that has a lower or equal distance by updating the values of A , A , A , B , B in T in exactly thesame way we update these values in T ′ . Let us assume, by way of contradiction, that U is inconsistent. In this case,there are two tuples t and t in U that violate one of A A → B or A A → B . There is a tuple t ′ in U ′ thatagrees with t on the value of each one of the attributes in { A , A , A , B , B } . Similarly, there is a tuple t ′ in U ′ that agrees with t on the value of each one of the attributes in { A , A , A , B , B } . Clearly, these two tuples alsoviolate the FDs A A → B or A A → B , which is a contradiction to the fact that U ′ is a consistent update of T ′ .Clearly, the distance of U is at most m (it can be lower than M if U ′ also updates values in the attributes not in { A , A , A , B , B }}