Database Repairing with Soft Functional Dependencies
Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi
DDatabase Repairing with Soft FunctionalDependencies
Nofar Carmeli
Technion, Haifa, Israel
Martin Grohe
RWTH Aachen University, Germany
Benny Kimelfeld
Technion, Haifa, Israel
Ester Livshits
Technion, Haifa, Israel
Muhammad Tibi
Technion, Haifa, Israel
Abstract
A common interpretation of soft constraints penalizes the database for every violation of everyconstraint, where the penalty is the cost (weight) of the constraint. A computational challenge isthat of finding an optimal subset: a collection of database tuples that minimizes the total penaltywhen each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinitecost), this subset is a “cardinality repair” of an inconsistent database; in soft interpretations, thissubset corresponds to a “most probable world” of a probabilistic database, a “most likely intention”of a probabilistic unclean database, and so on. Within the class of functional dependencies, thecomplexity of finding a cardinality repair is thoroughly understood. Yet, very little is known aboutthe complexity of this problem in the more general soft semantics. This paper makes a significantprogress in this direction. In addition to general insights about the hardness and approximabilityof the problem, we present algorithms for two special cases: a single functional dependency, and abipartite matching. The latter is the problem of finding an optimal “almost matching” of a bipartitegraph where a penalty is paid for every lost edge and every violation of monogamy.
Information systems → Data cleaning; Theory of computation → Incomplete, inconsistent, and uncertain databases
Keywords and phrases
Soft constraints, soft repairs, functional dependencies
Digital Object Identifier
Soft variants of database constraints (also referred to as weak or approximate constraints)have been a building block of various challenges in data management. In constraint discoveryand mining, for instance, the goal is to find constraints, such as Functional Dependencies(FDs) [3,8,11] and beyond [2,12,16], that generally hold in the database but not necessarily ina perfect manner. There, the reason for the violations might be rare events (e.g., agreementon the zip code but not the state) or noise (e.g., mistyping). Soft constraints also arise whenreasoning about uncertain data [6, 9, 18, 19]—the database is viewed as a probabilistic spaceover possible worlds, and the violation of a weak constraint in a possible world is viewed asevidence that affects the world’s probability.Our investigation concerns the latter application of soft constraints. To be more precise,the semantics is that of a parametric factor graph : the probability of a possible world is theproduct of factors where every violation of the constraint contributes one factor; in turn, thisfactor is a weight that is assigned upfront to the constraint. This approach is highly inspired © Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, and Muhammad Tibi;licensed under Creative Commons License CC-BYLeibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . D B ] S e p X:2 Database Repairing with Soft Functional Dependencies by successful concepts such as the Markov Logic Network (MLN) [17]. The computationalchallenges are the typical ones of probabilistic modeling: marginal inference (compute theprobability of a query answer) and maximum likelihood (find the most probable world)—theproblem that we focus on here.More specifically, we investigate the complexity of finding a most probable world in thecase where the constraints are FDs. By taking the logarithms of the factors, this problemcan be formally defined as follows. We are given a database D and a set ∆ of FDs, whereevery tuple and every FD has a weight (a nonnegative number). We would like to obtain acleaner subset E of D by deleting tuples. The cost of E includes a penalty for every deletedtuple and a penalty for every violation of (i.e., pair of tuples that violates) an FD; thepenalties are the weights of the tuple and the FD, respectively. The goal is to find a subset E with a minimal cost. In what follows, we refer to such E as an optimal subset and to theoptimization problem of finding an optimal subset as soft repairing . The optimal subsetcorresponds to the “most likely intention” in the Probabilistic Unclean Database (PUD)framework of De Sa, Ilyas, Kimelfeld, Ré and Rekatsinas [18] in a restricted case that isstudied in their work, and to the “most probable world” in the probabilistic database modelof Sen, Deshpande and Getoor [19]. In the special case where the FDs are hard constraints(i.e., their weight is infinite or just too large to pay), an optimal subset is simply what isknown as a “cardinality repair” [15] or, equivalently [14], a “most probable database” [6].The computational challenge of soft repairing is that there are exponentially manycandidate subsets. We investigate the data complexity of the problem, where the databaseschema and the FD set are fixed, and the input consists of the database D and all involvedweights. Moreover, we assume that D consists of a single relation; this is done without loss ofgenerality, since the problem boils down to soft repairing each relation independently (sincean FD does not involve more than one relation).The complexity of the problem is very well understood in the case of hard constraints(cardinality repairs). Gribkoff, Van den Broeck and Suciu [6] established complexity results forthe case of unary FDs (having a single attribute on the left-hand side), and Livshits, Kimelfeldand Roy [14] completed the picture to a full (effective) dichotomy over all possible sets ofFDs. For example, the problem is solvable in polynomial time for the FD sets { A → B } , { A → B, B → A } and { A → B, B → A, B → C } , but is NP-hard for { A → B, B → C } .In contrast, very little is known about the more general case where the FDs are soft (andviolations are allowed), where the problem seems to be fundamentally harder, both to solveand to reason about. Clearly, for every ∆ where it is intractable to find a cardinality repair, thesoft version is also intractable. But the other direction is false (under conventional complexityassumptions). For example, soft repairing is hard for ∆ = { A → B, B → A, B → C } , for thefollowing reason. We can set the weights of A → B and B → C to be very high, making eachof them a hard constraint in effect, and the weight of B → A very low, making it ignorablein effect. Hence, an optimal subset is a cardinality repair for { A → B, B → C } that, as saidabove, is hard to compute.So, which sets of FDs have a tractable soft repairing? The only polynomial-time algorithmwe are aware of is that of De Sa et al. [18] for the special case of a single key constraint,that is, ∆ = { X → Y } where XY contain all of the schema attributes; they have left themore general case (that we study here) open. In this work, we make substantial progress inanswering this question by presenting algorithms for two types of FD sets: (a) a single FD and (b) a matching constraint.The first type generalizes the tractability of De Sa et al. [18] from a key constraint to anarbitrary FD (as long as it is the only FD in ∆). Like theirs, our algorithm employs dynamic . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:3 programming, but in a more involved fashion. This is because their algorithm is based onthe fact that in a key constraint X → Y , any two tuples that agree on X are necessarilyconflicting. We also show that our algorithm can be generalized to additional sets of FDs.For example, it turns our that the FD set { name → address , name address → email } istractable as well. (Note that the address attribute on the left-hand side of the second FD isnot redundant, as in the ordinary semantics, since the FDs are treated as soft constraints.)In Section 4 we phrase the more general condition that this FD set satisfies.The second type, matching constraints, refers to FD sets ∆ = { X → Y, X → Y } over aschema with the attributes A , . . . , A k where X ∪ Y = X ∪ Y = X ∪ X = { A , . . . , A k } .The simplest example is { A → B, B → A } over the binary schema ( A, B ) that represents abipartite graph, and the problem is that of finding the best “almost matching” of a bipartitegraph where a penalty is paid for every lost edge and every violation of monogamy. Amore involved example is { fn ln → addr , fn addr → ln } over the schema (fn , ln , addr). Ouralgorithm is based on a reduction to the Minimum Cost Maximum Flow (MCMF) problem [4].Whether our algorithms cover all of tractable cases remains an open problem for futureinvestigation. (In the Conclusions we discuss the simplest FD sets where the question is leftunsolved.) We do show, however, that there is a polynomial-time approximation algorithmwith an approximation factor 3, that is, a subset where the penalty is at most three timesthe optimum.The rest of the paper is organized as follows. We give the formal setup and the problemdefinition in Section 2. We then discuss the complexity of the general problem and itsrelationship to past results in Section 3. We describe our algorithm for soft repairing inSections 4 and 5 for a single FD and a matching constraint, respectively, and conclude inSection 6. For lack of space, some of the proofs are given in the Appendix.
We begin with preliminary definitions and terminology that we use throughout the paper. A relation schema R ( A , . . . , A k ) consists of a relation symbol R and a set { A , . . . , A k } ofattributes. A database D over R is a set of facts f of the form R ( c , . . . , c k ), where each c i isa constant . We denote by f [ A i ] the value that the fact f associates with attribute A i (i.e., f [ A i ] = c i ). Similarly, if X = B · · · B k is a sequence of attributes from { A , . . . , A k } , then f [ X ] is the tuple ( f [ B ] , . . . , f [ B k ]).A Functional Dependency (FD) over the relation schema R ( A , . . . , A k ) is an expression ϕ of the form X → Y where X, Y ⊆ { A , . . . , A k } . A violation of an FD in a database D is a pair { f, g } of tuples from D that agrees on the left-hand side (i.e., f [ X ] = g [ X ]) butdisagrees on the right-hand side (i.e., f [ Y ] = g [ Y ]). An FD X → Y is trivial if Y ⊆ X . Wedenote by vio ( D, ϕ ) the set of all the violations of the FD ϕ in D . We say that D satisfies ϕ ,denoted D | = ϕ , if it has no violations (i.e., vio ( D, ϕ ) is empty). The database D satisfiesa set ∆ of FDs, denoted by D | = ∆, if D satisfies every FD in ∆; otherwise, D violates ∆(denoted D = ∆).When there is no risk of ambiguity, we may omit the specification of the relation schema R ( A , . . . , A k ) and simply assume that the involved databases and constraints are all overthe same schema.Let D be a database and let ∆ be a set of FDs. A repair ( of D w.r.t. ∆) is a maximalconsistent subset E ; that is, E ⊆ D and E | = ∆, and moreover, E = ∆ for every E such X:4 Database Repairing with Soft Functional Dependencies that E (cid:40) E ⊆ D . Note that the number of repairs can be exponential in the number offacts of D . A cardinality repair is a repair E of a maximal cardinality (i.e., | E | ≥ | E | forevery repair E ). We define the concept of soft constraints (or weak constraints or weighted rules ) in thestandard way of “penalizing” the database for every missing fact, on the one hand, and everyviolation, on the other hand. This is the concept adopted in past work such as the parfactors of De Sa et al. [18], the soft keys of Jha et al. [9], and the PrDB model of Sen et al. [19].The concept can be viewed as a special case of the
Markov Logic Network (MLN) [17].Formally, let D be a database and ∆ a set of FDs. We assume that every fact f ∈ D andevery FD ϕ ∈ ∆ have a nonnegative weight, hereafter denoted w f and w ϕ , respectively. (Theweight of a fact is sometimes viewed as the log of a validity/existence probability [6, 19].)The cost of a subset E of a database D is then defined as follows. cost ( E | D ) def = X f ∈ ( D \ E ) w f + X ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | (1)As for the computational model, we assume that every weight is a rational number r/q that is represented using the numerator and the denominator, namely ( r, q ), where each ofthe two is an integer represented in the standard binary manner. The problem we study in this paper, referred to as soft repairing , is the optimization problemof finding a database subset with a minimal cost. Since we consider the data complexity ofthe problem, we associate with each relation schema and set of FDs a separate computationalproblem. (cid:73)
Problem 1 (Soft Repairing) . Let R ( A , . . . , A k ) be a relation schema and ∆ a set of FDs. Soft repairing ( for R ( A , . . . , A k ) and ∆ ) is the following optimization problem: Given adatabase D , find an optimal subset of D , that is, a subset E of D with a minimal cost ( E | D ) . Note that a cardinality repair is an optimal subset in the special case where the weight w ϕ of every FD ϕ is ∞ (or just higher than the cost of deleting the entire database), andthe weight w f of every fact f is 1. Livshits et al. [14] studied the complexity of finding a weighted cardinality repair , which is the same as a cardinality repair but the weight w f ofevery fact f can be arbitrary. Hence, both types of cardinality repairs are consistent (i.e.,the constraints are strictly satisfied). In contrast, an optimal subset in the general case mayviolate one or more of the FDs. In the next section we recall the known complexity resultsfor cardinality and weighted cardinality repairs. (cid:73) Example 2.
Our running example is based on the database of Figure 1 over the relationschema
Flights ( Flight , Airline , Date , Origin , Destination , Airplane ) that contains informationabout domestic flights in the United States. The weight of each tuple appears on therightmost column. The FD set ∆ consists of the following FDs:
Flight → Airline : a flight is associated with a single airline.
Flight Airline Date → Destination : a flight on a certain date has a single destination. . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:5
Flights
Flight Airline Date Origin Destination Airplane
UA123 United Airlines 01/01/2021 LA NY N652NW UA123 United Airlines 01/01/2021 NY UT N652NW UA123 Delta 01/01/2021 LA NY N652NW DL456 Southwest 02/01/2021 NC MA N713DX DL456 Southwest 03/01/2021 NJ FL N245DX DL456 Delta 03/01/2021 CA IL N819US (a) D Flights
Flight Airline Date Origin Destination Airplane
UA123 United Airlines 01/01/2021 NY UT N652NW DL456 Southwest 02/01/2021 NC MA N713DX DL456 Southwest 03/01/2021 NJ FL N245DX (b) E Flights
Flight Airline Date Origin Destination Airplane
UA123 United Airlines 01/01/2021 LA NY N652NW DL456 Delta 03/01/2021 CA IL N819US (c) E Flights
Flight Airline Date Origin Destination Airplane
UA123 United Airlines 01/01/2021 LA NY N652NW UA123 United Airlines 01/01/2021 NY UT N652NW DL456 Delta 03/01/2021 CA IL N819US (d) E Figure 1
For the relation
Flights ( Flight , Airline , Date , Origin , Destination , Airplane ) and the FDs
Flight → Airline (with w ϕ = 5) and Flight Airline Date → Destination (with w ϕ = 1), a database D , a cardinality repair E , a weighted cardinality repair E , and an optimal subset E . We assume that the weight of the first FD is 5, and the weight of the second FD is 1 (as thesame flight number can be reused for different flights).The database E of Figure 1 is a cardinality repair of D as no repair of D can be obtainedby removing less then three facts. However, E is not a weighted cardinality repair, since itscost is eight, while the cost of E is six. The reader can easily verify that E is a weightedcardinality repair of D . Finally, E is not a repair of D in the traditional sense as it containsa violation of the second FD, but it is an optimal subset of D with cost ( E | D ) = 5. (cid:74) We consider the data complexity of the problem of computing an optimal subset. We assumethat the schema and the set of FDs are fixed, and the input consists of the database. Livshitset al. [14] studied the problems of finding a cardinality repair and a weighted cardinalityrepair, and established a dichotomy over the space of all the sets of functional dependencies.In particular, they introduced an algorithm that, given a set ∆ of FDs, decides whether: A weighted cardinality repair can be computed in polynomial time; or X:6 Database Repairing with Soft Functional Dependencies
Algorithm 1
Simplify
Remove trivial FDs from ∆ if ∆ is not empty then find a removable pair ( X, Y ) of attribute sequences:Closure ∆ ( X ) = Closure ∆ ( Y ) XY is nonemptyEvery FD in ∆ contains either X or Y on the left-hand side∆ := ∆ − XY Finding a (weighted) cardinality repair is APX-complete. No other possibility exists. The algorithm, which is depicted here as Algorithm 1, is arecursive procedure that attempts to simplify ∆ at each iteration by finding a removable pair(
X, Y ) of attribute sets, and removing every attribute of X and Y from all the FDs in ∆(which we denote by ∆ − XY ). Note that X and Y may be the same, and then the conditionstates that every FD contains X on the left hand side. If we are able to transform ∆ to anempty set of FDs by repeatedly applying simplification, then the algorithm returns true andfinding an optimal consistent subset is solvable in polynomial time. Otherwise, the algorithmreturns false and the problem is APX-complete. We state their result for later reference. (cid:73) Theorem 3. [14]
Let ∆ be a set of FDs. If ∆ can be emptied via Simplify () steps, then aweighted cardinality repair can be computed in polynomial time; otherwise, finding a cardinalityrepair is APX-complete. The hardness side of Theorem 3 immediately implies the hardness of the more generalsoft-repairing problem. Yet, the other direction (tractability generalizes) is not necessarilytrue. As discussed in the Introduction, if ∆ = { A → B, B → A, B → C } , then ∆, as a setof hard constraints, is classified as tractable according to Algorithm 1; however, this is notthe case for soft constraints. We can generalize this example by stating that if ∆ containsa subset that is hard according to Theorem 3, then soft repairing is hard. (This does nothold when considering only hard constraints, as the example shows that there exists an easy∆ with a hard subset.) In the following sections, we are going to discuss tractable cases ofFD sets. Before that, we will show that the problem becomes tractable if one settles for anapproximation. The following theorem shows that soft repairing admits a constant-ratio approximation , forthe constant three, in polynomial time. This means that there is a polynomial-time algorithmfor finding a subset with a cost of at most three times the minimum. (cid:73)
Theorem 4.
For all FD sets, soft repairing admits a 3-approximation in polynomial time.
Proof.
We reduce soft repairing to the problem of finding a minimum weighted set coverwhere every element belongs to 3 sets. ‘A simple greedy algorithm finds a 3-approximationto this problem in linear time [7]. Recall that APX is the class of NP optimization problems that admit constant-ratio approximationsin polynomial time. Hardness in APX is via the so called “PTAS” reductions (cf. textbooks onapproximation complexity, e.g., [5]). . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:7
We set the elements to be { ( { f, g } , δ ) | f, g ∈ D, δ ∈ ∆ , f and g contradict δ } . Eachelement ( { f, g } , δ ) belongs to three sets: f with weight w f , g with weight w g , and ( { f, g } , δ )with weight w δ . Each minimal solution to this set cover problem can be translated to asoft repair: the selected sets that correspond to tuples are removed in the repair. Indeed, aminimal set cover of such a construction has to resolve each conflict by either paying for theremoval of at least one of the tuples or paying for the violation. (cid:74) In terms of formal complexity, Theorem 4 implies that the problem of soft repairing is inAPX (for every set of FDs). From this, from Theorem 3 and from the discussion that followsTheorem 3, we conclude the following. (cid:73)
Corollary 5.
Let ∆ be a set of FDs. Soft repairing for ∆ is in APX. Moreover, if anysubset of ∆ cannot be emptied via Simplify () steps, then soft repairing is APX-complete for ∆ . In this section, we consider the case of a single functional dependency, and present apolynomial-time algorithm for soft repairing. Hence, we establish the following result. (cid:73)
Theorem 6.
In the case of a single FD, soft repairing can be solved in polynomial time.
Next, we prove Theorem 6 by presenting an algorithm. Later, we also generalize the argumentand result beyond a single FD (Theorem 7).We assume that the single FD is ϕ def = X → Y and that our input database is D . We split D into blocks and subblocks , as we explain next. The blocks of D are the maximal subsets of D that agree on the X values. Denote these blocks by D , . . . , D m . Note that there are noconflicts across blocks; hence, we can solve the problem separately for each block and thenan optimal subset E is simply the union of optimal subsets E i of the blocks D i : E = m [ i =1 E i The subblocks of a block D i are the maximal subsets of D i that agree on the Y values (inaddition to the X values). We denote these subblocks by D i, , . . . , D i,q i . Note that twofacts from the same subblock are consistent, while two facts from different subblocks areconflicting.From here we continue with dynamic programming. For a number j ∈ { , . . . , q i } , where q i is the number of subblocks of D i , and a number k ∈ { , . . . , | D i, ∪ · · · ∪ D i,j |} of facts,we define the following values that we are going to compute: C [ i, j, k ] is the cost of an optimal subset of D i, ∪ · · · ∪ D i,j (i.e., the union of the first j subblocks) with precisely k facts. F [ i, j, k ] is a subset of D i, ∪ · · · ∪ D i,j that realizes C [ i, j, k ], that is, | F [ i, j, k ] | = k ∧ cost ( F [ i, j, k ] | D i, ∪ · · · ∪ D i,j ) = C [ i, j, k ](If multiple choices of F [ i, j, k ] exist, we select an arbitrary one.) Once we compute the F [ i, q i , k ], we are done since it then suffices to return the best subset over all k : E i = F [ i, q i , k ] for k = argmin k C [ i, q i , k ] X:8 Database Repairing with Soft Functional Dependencies
Algorithm 2
L/C-Simplify() remove trivial FDs from ∆ if ∆ is not empty then find A such that in each FD, A is either an lhs or a consensus attribute ∆ := ∆ − A It remains to compute C [ i, j, k ] and F [ i, j, k ]. We will focus on the former, as the latter isobtained by straightforward bookkeeping. The key observation is that if we decide to delete t facts from D i,j , then we always prefer to delete the t facts with the minimal weight. Weuse this observation as follows.For a subblock D i,j and t ∈ { , . . . , | D i,j |} , denote by top ( t, D i,j ) an arbitrary subset of D i,j with t facts of the highest weight. Hence, top ( t, D i,j ) is obtained by taking a prefix ofsize t when sorting the tuples of D i,j from the heaviest to the lightest. Then C [ i, j, k ] iscomputed as follows. C [ i, j, k ] = j = 0 and k = 0; ∞ j = 0 and k > t (cid:16) C [ i, j − , k − t ] + t ( k − t ) w ϕ + P f ∈ D i,j \ top ( t,D i,j ) w f (cid:17) otherwise.The correctness of the above computation is due to the definition of the cost in Equation (1).In particular, in the third case, we go over all options for the number t of facts taken fromthe subblock D i,j and choose an option with the minimum cost. This cost consists of thefollowing components: C [ i, j − , k − t ] is the cost of the best choice of k − t facts from the remaining j − t ( k − t ) w ϕ is the cost of the violations in which the j th subblock participates: anycombination of a fact from D i,j and a fact from the other subblocks is a violation of ϕ . P f ∈ D i,j \ top ( t,D i,j ) w f is the cost of removing every fact that is not in top ( t, D i,j ) fromthe j th subblock.This completes the description of the algorithm. From this description, the correctness shouldbe a straightforward conclusion. In this section, we generalize the idea from the previous section. An attribute A is an lhsattribute of an FD X → Y if A ∈ X , and it is a consensus attribute of X → Y if X = ∅ and A ∈ Y (hence, X → Y states that all tuples should have the same A value). Thesimplification step of Algorithm 2 removes an attribute A if for every FD in ∆, it is eitheran lhs or a consensus attribute. We prove the following. (cid:73) Theorem 7.
Let ∆ be a set of FDs. If ∆ can be emptied via L/C-Simplify() steps, thensoft repairing for ∆ is solvable in polynomial time. Note that whenever ∆ can be emptied via
L/C-Simplify() steps, it can also be emptiedvia
Simplify() steps. Indeed, if
L/C-Simplify() eliminates the attribute A , then we can take: (a) X = { A } and Y = ∅ in Algorithm 1 if A is a consensus attribute of some FD, or (b) . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:9 X = Y = { A } if A is an lhs attribute of every FD. This is expected due to Theorems 3and 7, and the observation of Section 3 that soft-repairing is hard whenever computing acardinality repair is hard. (cid:73) Example 8.
Consider the database and the FD set of our running example (Example 2).This FD set, which we denote here by ∆ , can be emptied via L/C-Simplify() steps, byselecting attributes in the following order: { Flight → Airline , Flight Airline Date → Destination } Flight : {∅ → Airline , Airline Date → Destination } Airline : { Date → Destination } Date : {∅ → Destination } Destination : {} Hence, Theorem 7 implies that soft repairing can be solved in polynomial time for ∆ .Next, consider the FD set ∆ consisting of the following FDs: Flight → Airline and
Flight Date → Destination . This FD set is logically equivalent to ∆ ; hence, they bothentail the exact same cardinality repairs. However, these sets are no longer equivalent whenconsidering soft repairing. In particular, two facts that agree on the values of the Flight and
Date attributes, but disagree on the values of the
Airline and
Destination attributes, violateonly one FD in ∆ but two FDs in ∆ , which affects the cost of keeping these two tuples inthe database. In fact, the FD set ∆ cannot be emptied via L/C-Simplify() steps, as afterremoving the
Flight attribute, no other attribute is either an lhs or a consensus attribute ofthe remaining FDs. The complexity of soft repairing for ∆ remains an open problem. (cid:74) Next, we prove Theorem 7 by presenting a polynomial-time algorithm for soft repairingin the case where ∆ can be emptied via
L/C-Simplify() steps. Our algorithm generalizes theidea of the algorithm for a single FD, and we again use dynamic programming.The main observation is as follows. Let A be an attribute chosen by L/C-Simplify() , andlet D , . . . , D m be the maximal subsets of D that agree on the value of A , which we referto as blocks (w.r.t. A ). Two facts from different blocks violate all of the FDs wherein A is a consensus attribute and none of the FDs wherein A is an lhs attribute. Therefore, tocompute the cost of a soft repair, each pair of facts from different blocks is charged with theviolation of all FDs wherein A is a consensus attribute. Then, we can remove A from allFDs and continue the computation separately for each block.Now, let ∆ be an FD set that can be emptied via L/C-Simplify() steps, and let A , . . . , A n be the attributes in the order of such an elimination process. For each ‘ ∈ { , . . . , n + 1 } ,we denote by ∆ ‘ the FD set in line 2 of the ‘ th iteration of this execution (after removingthe trivial FDs). Thus, ∆ contains every non-trivial FD of ∆, and ∆ n +1 is empty. We alsodenote by w ‘ the total weight of the FDs in ∆ ‘ of which A is a consensus attribute (if thereare no such FDs, then w ‘ = 0).In the algorithm for a single FD, the recursion steps were with respect to the block D i (which determines the value of X ), and so the value of i was a parameter. Here, we needto maintain the assignment τ to all previously handled attributes, and we use τ and ‘ asparameters. Given 1 ≤ ‘ ≤ n + 1, if τ is an assignment to the attributes A , . . . , A ‘ − , then D τ denotes the database σ τ D (i.e., the database that contains all the tuples that agreewith τ on the values of the attributes A , . . . , A ‘ − ). We denote by D τ , . . . , D τq τ‘ the blocksof D τ w.r.t. A ‘ . Moreover, we denote by τ ∧ ( A ‘ = j ) the assignment to the attributes X:10 Database Repairing with Soft Functional Dependencies A , . . . , A ‘ that agrees with block D τj on the value assigned to A ‘ and agrees with τ on allother values. We denote by F [ ‘, τ, j, k ] an optimal subset of D τ ∪ · · · ∪ D τj of size k w.r.t. ∆ ‘ .We also denote by C [ ‘, τ, j, k ] the cost of F [ ‘, τ, j, k ]. According to Equation (1), our goal isto compute F [1 , ∅ , q ∅ , k ] for k = argmin k C [1 , ∅ , q ∅ , k ].We again focus on the computation of C [ ‘, τ, j, k ] that can be done as follows. C [ ‘, τ, j, k ] = P f ∈ D τ \ top ( k,D τ ) w f ‘ = n + 1 , j = 0 , k = 0 , ∞ j = 0 , k > , min t (cid:16) C [ ‘, τ, j − , k − t ] + t ( k − t ) w ‘ + C [ ‘ + 1 , τ ∧ ( A ‘ = j ) , q τ ∧ ( A ‘ = j ) ‘ +1 , t ] (cid:17) otherwise . The first line (where ‘ = n + 1) refers to the case where ∆ is empty. Since there are noFDs that need to be taken into account, the optimal subset of D τ of size k consists of the k facts of the highest weight. In the fourth case, we go over all options for the number t of facts taken from the block D τj and choose an option with the minimum cost. This costconsists of the following components: C [ ‘, τ, j − , k − t ] is the cost of the best choice of k − t facts from the remaining j − t ( k − t ) w ‘ is the cost of the violations in which the j th block participates: any combinationof a fact from D τj and a fact from the other blocks D τ ∪ · · · ∪ D τj − is a violation of theFDs in which A is a consensus attribute. C [ ‘ + 1 , τ ∧ ( A ‘ = j ) , q τ ∧ ( A ‘ = j ) ‘ +1 , t ] is the cost of the further repairing needed following theelimination of A ‘ (i.e., repairing with respect to ∆ ‘ +1 ) applied to the current block (the t facts from D τj ) .The given recursion can be computed in polynomial time via dynamic programming; thus,this proves Theorem 7. Next, we consider the case of a “matching” constraint, where the FD set ∆ states two keysthat cover all of the attributes. (We give the precise definition in Section 5.1.) We presenta polynomial-time algorithm for soft repairing in this case. For presentation sake, we firstdescribe the algorithm for the special case where the schema is R ( A, B ) and ∆ def = { A → B, B → A } . Later in the section, we generalize it to the case of two keys. So, we begin byproving the following lemma. (cid:73) Lemma 9.
Soft repairing is solvable in polynomial time for R ( A, B ) and ∆ = { A → B, B → A } . In the remainder of this section, we assume the input D over R ( A, B ). We begin with anobservation. For E ⊆ D it holds that: X f ∈ ( D \ E ) w f = X f ∈ D w f − X f ∈ E w f . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:11 A Bf a b f a b f a b f a b f a b f a b (a) Database f f f f f f (b) Conflict graph
Figure 2
A database over R ( A, B ) and its conflict graph w.r.t. { A → B, B → A } . Since the value P f ∈ D w f does not depend on the choice of E , minimizing the value (cid:16)P f ∈ ( D \ E ) w f (cid:17) + (cid:16)P ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | (cid:17) is the same as minimizing the value (cid:16)P f ∈ E − w f (cid:17) + (cid:16)P ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | (cid:17) . We use the following notation: w D ( E ) = X f ∈ E − w f + X ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | To solve the problem, we construct a reduction to the
Minimum Cost Maximum Flow (MCMF) problem. The input to MCMF is a flow network N , that is, a directed graph( V, E ) with a source node s having no incoming edges and a sink node t having no outgoingedges. Each edge e ∈ E is associated with a capacity c e and a cost c ( e ). A flow f of N isa function f : E → R such that 0 ≤ f ( e ) ≤ c e for every e ∈ E , and moreover, for everynode v ∈ V \ { s, t } it holds that P e ∈ I v f ( e ) = P e ∈ O v f ( e ) where I v and O v are the sets ofincoming and outgoing edges of v , respectively. A maximum flow is a flow f that maximizesthe value P ( s,v ) ∈ E f ( s, v ), and a minimum cost maximum flow is a maximum flow f witha minimal cost, where the cost of a flow is defined by P e ∈ E f ( e ) · c ( e ). We say that f is integral if all values f ( e ) are integers. It is known that, whenever the capacities are integral(i.e., natural numbers, as will be in our case), an integral minimum cost maximum flow existsand, moreover, can be found in polynomial time [1, Chapter 9].From D we construct n instances N , . . . , N n of the MCMF problem, where n is thenumber of facts in D , in the following way.First, we denote the FD A → B by ϕ and the FD B → A by ϕ . We also denote by D.A the set of values occurring in attribute A in D (that is, D.A = { a | ∃ f ∈ D ( f [ A ] = a ) } ). Wedo the same for attribute B and denote by D.B the set of values that occur in attribute B in D . For each value a ∈ D.A we denote by
D.A ( a ) the number of appearances of the value a in attribute A (i.e., the number of facts f ∈ D such that f [ A ] = a ). Similarly, we denote by D.B ( b ) the number of appearances of the value b in attribute B in D . Observe that vio ( D, ϕ ) = 12 · X a ∈ D.A [ D.A ( a ) · ( D.A ( a ) − R ( a , b ) violates ϕ with every fact R ( a , c ) where b = c . Similarly,it holds that vio ( D, ϕ ) = 12 · X b ∈ D.B [ D.B ( b ) · ( D.B ( b ) − N k . Our construction for the databaseof Figure 2a is illustrated in Figure 3. Note that Figure 2b depicts the conflict graph of the X:12 Database Repairing with Soft Functional Dependencies u b ( i − w ϕ s k u b u b c = 0 u b s v a v a v a v a v a v a c ( v ia , v a ) =( i − w ϕ c = 0 = − w f c ( v ia , u b ) v a u b u b v a c = 0 tu b u b u b v a c ( u b , u ib ) = Figure 3
The network N k constructed from the database of Figure 2a. The capacity of all edgesis 1, except for the edge ( s, s ) that has capacity k . database of Figure 2a w.r.t. ∆ = { A → B, B → A } , which contains a vertex for each fact inthe database and an edge between two vertices if the corresponding facts jointly violate anFD of ∆. The blue edges in the conflict graph are violations of the FD A → B and the rededges are violations of the FD B → A .For each k ∈ { , . . . , n } we construct the network N k that consists of the set { s, s , t } ∪ V ∪ A ∪ B ∪ U of nodes where: A = { v a | a ∈ D.A } B = { u b | b ∈ D.B } V = { v i a | a ∈ D.A, ≤ i ≤ D.A ( a ) } U = { u i b | b ∈ D.B, ≤ i ≤ D.B ( b ) }N k contains the following edges:( s, s ), with cost c ( s, s ) = 0( s , v i a ) for every v i a ∈ V , with cost c ( s , v i a ) = 0( v i a , v a ) for every value a ∈ D , with cost c ( v i a , v a ) = ( i − · w ϕ ( v a , u b ) for every a ∈ D.A and b ∈ D.B such that f = R ( a , b ) occurs in D , with cost c ( v a , u b ) = − w f ( u b , u i b ) for every value b ∈ D , with cost c ( u b , u i b ) = ( i − · w ϕ ( u i b , t ) for every u i b ∈ U , with cost c ( u i b , t ) = 0The capacity of the edge ( s, s ) is k and the capacity of the other edges is 1. The intuitionfor the construction is as follows. A network with edges of the form ( v a , u b ) that are connectedto a source on one side and a target on the other corresponds to a matching, which in turncorresponds to a traditional repair. To allow violations of A → B , we add the vertices v i a .The cost of a violation of this FD is defined by the cost of the edges ( v i a , v a ). In particular, ifwe keep k facts of the form R ( a , · ) for some a ∈ D.A we pay P ki =1 ( k − w ϕ for violationsof ϕ . We include the vertices v i b to similarly allow violations of B → A . The discarding offacts is discouraged by offering gain for the edges ( v a , u b ). Finally, to prevent the case wherethe flow always fills the entire network (which corresponds to taking all facts and payingfor all violations), we introduce the edge ( s, s ) which limits the capacity of the network, . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:13 and enables us to find the minimum cost flow of a given size k . We will show that for every k , the cost of the solution to the MCMF problem on N k will be the cost of the “cheapest”subinstance of D of size k . Hence, the solution to our problem is the cost of the minimalsolution among all the instances N , . . . , N n .Given an integral flow f in N k , the repair D [ f ] induced by f , is the set of facts R ( a , b )corresponding to edges of the form ( v a , u b ) such that f ( v a , u b ) = 1. Moreover, given asubinstance E of D of size k , we denote by f E the integral flow in N k defined as follows. f E ( s, s ) = kf E ( s , v i a ) = 1 for 1 ≤ i ≤ E.A ( a ) and f E ( s , v i a ) = 0 for i > E.A ( a ) for every a ∈ E.Af E ( v i a , v a ) = 1 for 1 ≤ i ≤ E.A ( a ) and f E ( v i a , v a ) = 0 for i > E.A ( a ) for every a ∈ E.Af E ( v a , u b ) = 1 if R ( a , b ) ∈ E and f E ( v a , u b ) = 0 otherwise f E ( u b , u i b ) = 1 for 1 ≤ i ≤ E.B ( b ) and f E ( u b , u i b ) = 0 for i > E.B ( b ) for every b ∈ E.Bf E ( u i b , t ) = 1 for 1 ≤ i ≤ E.B ( b ) and f E ( u i b , t ) = 0 for i > E.B ( b ) for every b ∈ E.B
The reader can easily verify that f E is indeed an integral flow in N k . Clearly, the value ofthe flow is k .We have the following lemmas. The first is proved in the Appendix and the second followsstraightforwardly from the construction of N k and the definition of f E . (cid:73) Lemma 10.
Every integral solution f to MCMF on N k satisfies cost ( f ) = w D ( f [ D ]) . (cid:73) Lemma 11.
Every subinstance E of D satisfies cost ( f E ) = w D ( E ) . Now, let E be an optimal subset of D w.r.t. ∆ and assume that | E | = k . Let f ∗ be asolution with the minimum cost among all the solutions to MCMF on N . . . , N n . Lemma 11implies that there is an integral flow f E in N k such that cost ( f E ) = w D ( E ). Hence, we havethat cost ( f ∗ ) ≤ w D ( E ). By applying Lemma 10 on f ∗ , there is another subinstance E of D such that w D ( E ) = cost ( f ∗ ). Since E is an optimal subset, we have that w D ( E ) ≤ w D ( E ).Overall, we have that cost ( f ∗ ) ≤ w D ( E ) ≤ w D ( E ) = cost ( f ∗ ), and we conclude that cost ( f ∗ ) = w D ( E ). Therefore, by taking the solution with the lowest cost among all solutionsto MCMF on N , . . . , N n , we indeed find a solution to our problem, and that concludes ourproof of Lemma 9. (cid:73) Example 12.
Consider again the database of Figure 2a. Assume that: w ϕ = w ϕ = 2 w f = w f = w f = w f = w f = w f = 1Since the cost of a violation is “too high” in this case (i.e., it is always cheaper to delete afact involved in a violation than to keep the violation), an optimal subset in this case is, infact, an optimal repair in the traditional sense (that is, when the constraints are assumedto be hard constraints). One possible optimal repair in this case is { f , f , f } . The flowcorresponding to this repair in the network N is illustrated in Figure 4a.Now, assume that: w ϕ = w ϕ = 1 w f = w f = w f = w f = w f = w f = 3In this case, the cost of deleting a fact is “too high”, since each fact is involved in at mosttwo violations, and the cost of keeping the violation is lower than the cost of removingfacts involved in the violation. Therefore, the database itself is an optimal subset, and thecorresponding flow in the network N is illustrated in Figure 4b. X:14 Database Repairing with Soft Functional Dependencies s u b k = 3 u b u b u b u b u b u b ts v a v a v a v a v a v a v a u b v a v a u b (a) tv a v a v a v a v a v a v a u b u b v a v a u b s k = 6 u b u b u b u b u b u b s (b) tv a v a v a v a v a v a v a u b u b v a v a u b s k = 4 u b u b u b u b u b u b s (c) tv a v a v a v a v a v a v a u b u b v a v a u b s k = 3 u b u b u b u b u b u b s (d) Figure 4
The flow in the network N k corresponding to an optimal subset of the database ofFigure 2a for different weights. As another example, assume that: w ϕ = w ϕ = 1 w f = w f = w f = 2 , w f = w f = 1 , w f = 3Here an optimal subset consists of the facts in { f , f , f , f } , and the corresponding flow inthe network N is illustrated in Figure 4c. If we modify the weight of ϕ and define w ϕ = 4,while keeping the rest of the weight intact, it is now cheaper to delete the fact f rather thankeep the violations it is involved in with f and f ; hence, an optimal subset in this case is { f , f , f } , and the corresponding flow in the network N is illustrated in Figure 4d. (cid:74) Note that the FD set { A → B } over R ( A, B ) is in fact a special case of the result ofTheorem 14, as we can compute an optimal subset for this FD set using the algorithmdescribed above by defining w B → A = 0. However, this algorithm works only for the casewhere the single FD is a key and fails to compute the correct solution when the schemacontains attributes that do not appear in the FD. The algorithm described in the proof ofTheorem 6, on the other hand, can handle this case and does not assume anything about theunderlying schema. By a “matching constraint” we refer to the case of ˆ∆ = { X → Y, X → Y } over a schemaˆ R ( A , . . . , A k ) where X ∪ Y = X ∪ Y = X ∪ X = { A , . . . , A k } . An example follows. (cid:73) Example 13.
Consider the database of our running example (Figure 1), and the followingFDs:
Flight Airline Date → Origin Destination Airplane , Origin Destination Airplane Date → Flight Airline .The reader can easily verify that these two FDs form a matching constraint. On the otherhand, consider the set consisting of the following two FDs:
Flight Date → Airline Origin Destination Airplane , . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:15 Origin Destination Airplane Date → Flight Airline .Here, we do not have a matching constraint since while it holds that X ∪ Y = X ∪ Y = { Flight , Airline , Date , Origin , Destination , Airplane } , the set X ∪ X misses the Airline attribute. (cid:74)
The generalization of Lemma 9 from ∆ = { A → B, B → A } over R ( A, B ) to the generalcase of a matching constraint is fairly straightforward. Given an input ˆ D for soft repairingover ˆ R and ˆ∆, we construct an input D over R and ∆ by defining unique values a ( π X ( ˆ f ))and b ( π X ( ˆ f )) for the projections π X ( ˆ f ) and π X ( ˆ f ) over X and X , respectively, of everyfact ˆ f of ˆ D . Then, the database D is simply the set of all the pairs a ( π X ˆ f ) and b ( π X ˆ f ) forall facts ˆ f of D : D def = { ( a ( π X ˆ f ) , b ( π X ˆ f )) | ˆ f ∈ ˆ D } In addition, we define w f def = w ˆ f whenever f = ( a ( π X ˆ f ) , b ( π X ˆ f )) and w A → B def = w X → Y and w B → A def = w X → Y . Note that the mapping f → ˆ f is reversible since X ∪ X = { A , . . . , A k } .So, in order to solve soft repairing for ˆ D , we solve it for D and transform every fact f of D into the corresponding fact ˆ f of ˆ D . We get the following result. The proof (given in theAppendix) is by showing the correctness of the reduction. (cid:73) Theorem 14.
Soft repairing is solvable in polynomial time whenever ∆ is a pair of FDsthat constitutes a matching constraint. We studied the complexity of soft repairing for functional dependencies, where the goal is tofind an optimal subset under penalties of deletion and constraint violation. The problem isharder than that computing a cardinality repair, and we have developed two new, nontrivialalgorithms solving natural special cases. A full classification of the FD sets remains an openchallenge for future research; specifically, the question is what fragment of the positive sideof the dichotomy of Livshits et al. [14] remains positive when softness is allowed. We havealso shown that the problem becomes tractable if we settle for a 3-approximation.
Open Problems
Several directions are left open for future work. A direct open problem is to characterize theclass of tractable FDs via a full dichotomy. The simplest sets of FDs where the complexityof soft repairing is open are the following: { A → B, A → C } . Note that this problem is different from { A → BC } that consists of asingle FD. { A → B, B → A } in the case where the schema has attributes different from A and B ,starting with R ( A, B, C ). {∅ → A, B → C } .The problem is also open for classes of constraints that are more general than FDs, includ-ing equality-generating dependencies (EGDs), denial constraints, and inclusion dependencies.Yet, the problem for these types of dependencies is open already in the case of cardinalityrepairs, with the exception of some cases of EGDs [13]. Another clear direction is thatof update repairs where we are allowed to change cell values instead of (or in addition to)deleting tuples and where complexity results are known for hard constraints [10, 14]. X:16 Database Repairing with Soft Functional Dependencies
References Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin.
Network flows - theory,algorithms and applications . Prentice Hall, 1993. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Discovering denial constraints.
PVLDB , 6(13):1498–1509, 2013. URL: . Carlo Combi, Matteo Mantovani, Alberto Sabaini, Pietro Sala, Francesco Amaddeo, UgoMoretti, and Giuseppe Pozzi. Mining approximate temporal functional dependencies withpure temporal grouping in clinical databases.
Comp. in Bio. and Med. , 62:306–324, 2015. doi:10.1016/j.compbiomed.2014.08.004 . Andrew V. Goldberg and Robert E. Tarjan. Finding minimum-cost circulations by successiveapproximation.
Math. Oper. Res. , 15(3):430–466, August 1990. doi:10.1287/moor.15.3.430 . Teofilo F. Gonzalez, editor.
Handbook of Approximation Algorithms and Metaheuristics .Chapman and Hall/CRC, 2007. doi:10.1201/9781420010749 . Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. The most probable database problem. In
BUDA , 2014. URL: . Dorit S Hochbaum. Approximation algorithms for the set covering and vertex cover problems.
SIAM Journal on computing , 11(3):555–556, 1982. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: an efficientalgorithm for discovering functional and approximate dependencies.
Comput. J. , 42(2):100–111, 1999. doi:10.1093/comjnl/42.2.100 . Abhay Kumar Jha, Vibhor Rastogi, and Dan Suciu. Query evaluation with soft-key constraints.In
PODS , pages 119–128, 2008. Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functionaldependency violations. In
ICDT , volume 361 of
ACM International Conference ProceedingSeries , pages 53–62. ACM, 2009. Weibang Li, Zhanhuai Li, Qun Chen, Tao Jiang, and Zhilei Yin. Discovering approximatefunctional dependencies from distributed big data. In
APWeb , pages 289–301, 2016. doi:10.1007/978-3-319-45817-5_23 . Ester Livshits, Alireza Heidari, Ihab F. Ilyas, and Benny Kimelfeld. Approximate denialconstraints.
Proc. VLDB Endow. , 13(10):1682–1695, 2020. URL: . Ester Livshits, Ihab F. Ilyas, Benny Kimelfeld, and Sudeepa Roy. Principles of progressindicators for database repairing.
CoRR , abs/1904.06492, 2019. Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functionaldependencies.
ACM Trans. Database Syst. , 45(1):4:1–4:46, 2020. doi:10.1145/3360904 . Andrei Lopatenko and Leopoldo E. Bertossi. Complexity of consistent query answering indatabases under cardinality-based and incremental repair semantics. In
ICDT , volume 4353 of
Lecture Notes in Computer Science , pages 179–193. Springer, 2007. Eduardo H. M. Pena, Eduardo Cunha de Almeida, and Felix Naumann. Discovery of ap-proximate (and exact) denial constraints.
Proc. VLDB Endow. , 13(3):266–278, 2019. URL: , doi:10.14778/3368289.3368293 . Matthew Richardson and Pedro Domingos. Markov logic networks.
Mach. Learn. , 62(1-2):107–136, February 2006. URL: http://dx.doi.org/10.1007/s10994-006-5833-1 , doi:10.1007/s10994-006-5833-1 . Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas.A formal framework for probabilistic unclean databases. In
ICDT , volume 127 of
LIPIcs ,pages 6:1–6:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Prithviraj Sen, Amol Deshpande, and Lise Getoor. PrDB: managing and exploiting richcorrelations in probabilistic databases.
VLDB J. , 18(5):1065–1090, 2009. . Carmeli, M. Grohe, B. Kimelfeld, E. Livshits, and M. Tibi XX:17
A Details for Section 5
In this section, we provide the missing proofs of Section 5. For convenience, we give theresults again here.
Lemma
Every integral solution f to MCMF on N k satisfies cost ( f ) = w D ( f [ D ]) . Proof.
First, note that it cannot be the case that f ( s , v j a ) = 0 while f ( s , v i a ) = 1 for some j < i and i ∈ { , . . . , D.A ( a ) } . Otherwise, we can construct a different integral flow f with f ( s , v j a ) = f ( v j a , v a ) = 1, f ( s , v i a ) = f ( v i a , v a ) = 0, and f ( e ) = f ( e ) for every otheredge e . It holds that cost ( f ) = cost ( f ) − c ( v i a , v a ) + c ( v j a , v a ), and since c ( v i a , v a ) > c ( v j a , v a )we will have that cost ( f ) < cost ( f ) in contradiction to the fact that f is a solution toMCMF on N k . Therefore, for every a ∈ D.A , if the flow entering the node v a is ‘ , then f ( s , v i a ) = f ( v i a , v a ) = 1 if i ≤ ‘ and f ( s , v i a ) = f ( v i a , v a ) = 0 otherwise. Thus, thetotal cost of the edges of the form ( v i a , v a ) is P ‘i =1 [( i − w ϕ ] = ‘ ( ‘ − w ϕ . By thedefinition of f [ D ], there are f [ D ] .A ( a ) edges of the form ( v a , u b ) for which f ( v a , u b ) = 1.By the definition of a flow, this is also the flow entering the node v a , and we have that ‘ = f [ D ] .A ( a ). We conclude that the total cost of the flow on edges of the form ( v i a , v a ) is P a ∈ f [ D ] .A (cid:2) · f [ D ] .A ( a ) · ( f [ D ] .A ( a ) − · w ϕ (cid:3) = vio ( f [ D ] , ϕ ) · w ϕ . The same argumentshows that the total cost of the flow on edges of the form ( u b , u i b ) is vio ( f [ D ] , ϕ ) · w ϕ .Finally, the total cost of the edges of the form ( v a , u b ) is P g ∈ f [ D ] ( − w g ) by the definitionof f [ D ] and the construction of the network. We conclude that: cost ( f ) = X g ∈ f [ D ] ( − w g ) + vio ( f [ D ] , ϕ ) · w ϕ + vio ( f [ D ] , ϕ ) · w ϕ and cost ( f ) = w D ( f [ D ]) by definition. (cid:74) Theorem
Soft repairing is solvable in polynomial time whenever ∆ is a pair of FDsthat constitutes a matching constraint. Proof.
We prove that D has a subset E with cost ( E | D ) = k if and only if ˆ D has a subsetˆ E with cost ( ˆ E | ˆ D ) = k . Let E be a subset of D with cost k . Let ˆ E be a subset of ˆ D thatincludes the fact ˆ f for every f ∈ E . By definition, we have that P f ∈ ( D \ E ) w f = P f ∈ ( ˆ D \ ˆ E ) w ˆ f ;hence, it is left to show that P ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | = P ˆ ϕ ∈ ˆ∆ w ˆ ϕ | vio ( ˆ E, ˆ ϕ ) | . Let f, g ∈ E such that { f, g } 6| = ( A → B ). Hence, it holds that f [ A ] = g [ A ] while f [ B ] = g [ B ]. Fromthe construction of D , we have that π X ˆ f = π X ˆ g , while π X ˆ f = π X ˆ g . Thus, there is anattribute A i ∈ X such that ˆ f [ A i ] = ˆ g [ A i ] and since A i X and X ∪ Y = { A , . . . , A k } ,it holds that A i ∈ Y . We conclude that { ˆ f , ˆ g } 6| = ( X → Y ). We can similarly prove thatif { f, g } 6| = ( B → A ), then { ˆ f , ˆ g } 6| = ( X → Y ). Finally, because w A → B = w X → Y and w B → A = w X → Y it holds that P ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | = P ˆ ϕ ∈ ˆ∆ w ˆ ϕ | vio ( ˆ E, ˆ ϕ ) | .For the other direction, let ˆ E be a subset of ˆ D , and let E be the subset of D that includesthe fact f for every ˆ f ∈ ˆ E . It is again straightforward that P f ∈ ( D \ E ) w f = P f ∈ ( ˆ D \ ˆ E ) w ˆ f .Now, let ˆ f , ˆ g ∈ ˆ E such that { ˆ f , ˆ g } 6| = ( X → Y ). We have that ˆ f [ A i ] = ˆ g [ A i ] for every A i ∈ X ; thus, π X ˆ f = π X ˆ g and from the construction of D , it holds that f [ A ] = g [ A ]. On theother hand, the fact that ˆ f [ A i ] = ˆ g [ A i ] for some A i ∈ Y together with the fact that X ∪ Y = X ∪ X = { A , . . . , A k } imply that π X ˆ f = π X ˆ g and f [ B ] = g [ B ]. Hence, { f, g } 6| = ( A → B ).We can similarly prove that if { ˆ f , ˆ g } 6| = ( X → Y ), then { f, g } 6| = ( B → A ), which againimplies that P ϕ ∈ ∆ w ϕ | vio ( E, ϕ ) | = P ˆ ϕ ∈ ˆ∆ w ˆ ϕ | vio ( ˆ E, ˆ ϕ ) | , and the concludes our proof., and the concludes our proof.