Size Bounds for Conjunctive Queries with General Functional Dependencies
aa r X i v : . [ c s . D B ] D ec Size Bounds for Conjunctive Queries with GeneralFunctional Dependencies
Gregory Valiant and Paul ValiantUniversity of California, BerkeleyMay 29, 2018
Abstract
This paper extends the work of Gottlob, Lee, and Valiant (PODS 2009) [9],and considers worst-case bounds for the size of the result Q ( D ) of a conjunc-tive query Q to a database D given an arbitrary set of functional dependen-cies. The bounds in [9] are based on a “coloring” of the query variables.In order to extend the previous bounds to the setting of arbitrary functionaldependencies, we leverage tools from information theory to formalize theoriginal intuition that each color used represents some possible entropy ofthat variable, and bound the maximum possible size increase via a linear pro-gram that seeks to maximize how much more entropy is in the result of thequery than the input. This new view allows us to precisely characterize theentropy structure of worst-case instances for conjunctive queries with simplefunctional dependencies (keys), providing new insights into the results of [9].We extend these results to the case of general functional dependencies, pro-viding upper and lower bounds on the worst-case size increase. We identifythe fundamental connection between the gap in these bounds and a centralopen question in information theory.Finally, we show that, while both the upper and lower bounds are given byexponentially large linear programs, one can distinguish in polynomial timewhether the result of a query with an arbitrary set of functional dependenciescan be any larger than the input database. In this paper, we are concerned with deriving worst-case size bounds for the resultof a conjunctive query in terms of the structural properties of the query, and thoseof the input relations. This paper addresses the main open question left by Gott-lob, Lee, and Valiant (PODS 2009) [9], extending size bounds to the case where1he query is applied to a database that has an arbitrary set of general functionaldependencies (as opposed to just ‘simple’ functional dependencies—those whoseleft-hand sides consist of a single variable—as was done in [9]).Conjunctive queries are the most fundamental and most widely used databasequeries, forming the core of relational algebra [5, 15, 1]. Conjunctive queries alsocorrespond to nonrecursive datalog rules of the form R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , where R i is a relation name of the underlying database D , R is the output relation,and where each argument u i is a list of | u i | variables, where | u i | is the arity ofthe corresponding relation, and where the same variable can occur multiple timesin one or more argument lists. We allow a single relation R i to appear severaltimes in the query, thus m ≥ n . Throughout this paper we adopt this datalog rulerepresentation for conjunctive queries.In general, the result of a conjunctive query can be exponentially large in theinput size. Even in the case of bounded arities, the result can be substantially largerthan the input relations. In the worst case, the output size is r k , where r is the sizeof the largest input relation and k is the arity of the output relation. Queries withvery large outputs are sometimes unavoidable, but in most cases they are eitherill-posed or anyway undesirable, as they can be disruptive to a multi-user DBMS.It is thus useful to recognize such queries, whenever possible. Obtaining goodworst-case bounds for conjunctive queries is, moreover, relevant to view manage-ment [15] and data integration [14, 15], as well as to data exchange [8, 13], wheredata is transferred from a source database to a target database according to schemamappings that are specified via conjunctive queries. In this latter context, goodbounds on the result size of a conjunctive query may be used for estimating theamount of data that needs to be materialized at the target site.In the area of query optimization, models for predicting the size of the output ofa conjunctive query based on selectivity indices for relational operators have beendeveloped [22, 12, 6]. The selectivity indices are obtained via sampling techniques(see, e.g. [19, 11]) from existing database instances. Worst case bounds may beobtained by setting each selectivity index to 1, thus assuming the maximum selec-tivity for each operator. Unfortunately, the resulting bounds are then often trivial(akin to the above r k bound).A new and very interesting characterization of the worst-case output size of joinqueries was very recently developed by Atserias, Grohe, and Marx [3]. Their resultis based on the notion of fractional edge cover [10], and the associated concept of fractional edge-cover number ρ ∗ ( Q ) of a join query Q . In particular, in [10] it wasshown that | Q ( D ) | ≤ rmax ( Q, D ) ρ ∗ ( Q ) , (1)2here rmax ( Q, D ) represents the size of the largest relation among R , . . . , R n in D . In [3] it was shown that this bound is essentially tight.In [9], these results were extended beyond join-queries, to general conjunctivequeries (containing projections) and also to the setting in which the input rela-tions satisfy simple functional dependencies. This work introduced a new coloringscheme for query variables, and, accordingly, the association of a color number C ( Q ) with each query Q . Roughly, a valid coloring assigns a set L ( X ) of col-ors to each query variable X and requires that for each functional dependency XY → Z , the colors of Z are contained in the union of the colors of X and Y .The color number C ( Q ) of Q is the maximum over all valid colorings of Q of thequotient of the number of colors appearing in the output (i.e., head) variables of Q by the maximum number of colors appearing in the variables of any input (i.e.,body) atom of Q . It was shown that for a query Q and database D with a set ofsimple functional dependencies, | Q ( D ) | ≤ rmax ( Q, D ) C ( Q ) . In this paper, we attempt to extend these results to the case where we have ageneral set of functional dependencies (including compound functional dependen-cies of the form
X, Y, Z → W .) In this setting, while the lower bound given by thecolor number holds, we illustrate that the color number no longer provides an up-per bound on the worst-case size increase. In fact, we provide a family of instancesdemonstrating that there is a super-constant gap between the true size increase andthe bound given by the color number.In order to provide size bounds in this general setting we require machinery be-yond the color number. We use tools from information theory developed to analyzethe precise interactions of multivariate distributions. In some sense, this approachformalizes the original intuition of the coloring scheme—that each color used rep-resents some possible entropy of that variable. We construct a linear program withentropies as the variables and the exponent of the worst-case size increase as thesolution. Functional dependencies can be encoded as constraints in the linear pro-grams. The difficulty is determining which additional constraints must be added tothe linear program to ensure that the solution is realizable as a database instance.This question, as it turns out, is crucially related to an old and ongoing in-vestigation at the heart of information theory: “which entropy structures can beinstantiated in multivariate distributions?” [20, 24, 25, 18, 17, 7]. We cannot showthat our upper bound is tight in this general setting, and believe that an explicit(even exponential-sized) characterization of the worst-case size increase is unlikelywithout significant advances in information theory.Nevertheless, the formalism and tools from information theory shed signifi-cant light on the setting in which all functional dependencies are simple—the case3onsidered in [9]. We revisit the color number, and the tight bounds on the sizeincrease for queries with simple functional dependencies, providing an alternativeformulation of the color number as the solution to a linear program whose variablesare entropies. This formulation allows us to show that the settings for which wehave tight bounds on the size increase have worst-case instances with particularlysimple entropy-structures; specifically, all associated mutual information measuresare nonnegative.Finally, while both our upper and lower bounds are given by linear programsthat have exponentially many variables, we show that we can decide in polynomialtime whether a query and set of functional dependencies is sparsity-preserving. Inparticular, we can efficiently decide whether the result of a query can be any largerthan the input database.This paper is organized as follows. In Section 2 we state some useful def-initions of database terms, define the coloring scheme and the color number ofa query, and provide definitions of the basic information theory quantities and theShannon information inequalities. In Section 3 we identify the connection betweenentropy and worst-case instances, and prove our linear programming size bound.In Section 4 we provide an alternative definition of the color number in terms ofentropies, and identify the simple entropy structure of worst-case instances in thesettings in which we have tight size bounds (the setting with simple functionaldependencies). We leverage this understanding of the entropy structure of theseinstances to construct a family of instances that demonstrate a super-constant gapbetween our upper and lower bounds. Finally, in Section 5, we show that we canefficiently decide whether a query and set of functional dependencies can admitany size increase. We begin by giving basic definitions pertaining to database theory. We then de-fine the color number, and state the size bounds of [9]. Finally, we define someinformation theoretic quantities, and define the Shannon information inequalities.
As already stated in the Introduction, a conjunctive query has the form R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , where each u i is a list of (not necessarily distinct) variablesof length | u i | = arity ( R i ) . Each variable occurring in the query head R ( u ) must also occur in the body of the query. The set of all variables occurring in Q is denoted by var ( Q ) . It is important to recall that a single relation R i might4ppear several times in the query, and thus m could be larger than n . A finitestructure or database D = ( U D , R , . . . , R k ) consists of a finite universe U D andrelations R , . . . , R k over U D . The answer Q ( D ) of query Q over database D consists of the structure ( U D , R ) whose unique relation R contains precisely alltuples θ ( u ) such that θ : var ( Q ) → U D is a substitution such that for each atom R i ( u j ) appearing in the query body, θ ( u j ) ∈ R i . For ease of notation, we definermax ( Q, D ) to be the number of tuples in the largest relation among R , . . . , R n in D .A (simple) attribute of a relation R identifies a column of R . An attribute list consists of a list (without repetition) of attributes of a relation R . A compoundattribute is an attribute list with at least two attributes. A list consisting of a uniqueattribute A is identified with A . The list of all attributes of R is denoted by attr ( R ) .If V is a list of attributes of R and t ∈ R a tuple of R , then the V -value of t , denotedby t [ V ] consists of the tuple obtained as the ordered list of all values in V -positionsof t .If V and W are (possibly compound) attributes of R , then a functional depen-dency (FD) V → W on relation R expresses that for each t, t ′ ∈ R , t [ V ] = t ′ [ V ] implies that t [ W ] = t ′ [ W ] . Thus each functional dependency V → W is equiv-alent to a set containing a FD V → A for each element A of W . If A and B aresingle attributes, then the FD A → B is called a simple FD . A (possibly com-pound) attribute K of R is a key iff K → attr ( R ) holds. Such a key is called a simple key if K is a simple attribute, otherwise it is called a compound key . Anargument position in an atom that corresponds to a simple key attribute is referredto as a keyed position . Definition 2.1.
Given a conjunctive query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , we define chase ( Q ) to be the result of iteratively performing the following replace-ments: • Given two atoms R i ( u j ) and R i ( u k ) of the same relation, with the p th posi-tion a key for relation R i , if the variable at the p th position of u j is the sameas the variable at the p th position of u k , then for each h ∈ , . . . , | u j | let X be the variable that occurs at position h in u j . We replace every instance of X that occurs anywhere in the query by the variable occurring at position h of u k , and proceed with the updated u i ’s. Finally, we remove the term R i ( u j ) from the conjunctive query. Note: We do not require compound keys to be minimal.
Fact 2.2. [16, 2, 4] For any instance, the result of applying the query chase ( Q ) is identical to the output of applying Q . We restate the definitions from [9] of valid coloring and the color number C ( Q ) of a query, and state the size bounds of [9]. Definition 2.3.
Given a conjunctive query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , and the set of functional dependencies for each input relation, a valid coloring of Q with c colors is a coloring C : var ( Q ) → { ,...,c } assigning to each variable X ∈ var ( Q ) a set of colors L ( X ) ⊂ { , . . . , c } , consisting of zero or more colorssuch that the following condition is satisfied: • For each functional dependency X , . . . , X k → Y, L ( Y ) ⊆ [ i L ( X i ) . Definition 2.4.
The color number of a query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , denoted C ( Q ) , is the maximum over valid colorings of Q of the ratio ofthe total number of colors appearing in the output variables u , to the maximumnumber of colors appearing in any given u i , for i ≥ . Formally: C ( Q ) := max colorings | S X j ∈ u L ( X j ) | max i ≥ | S X j ∈ u i L ( X j ) | . The main theorem of [9] is that the color number yields a tight bound on theworst-case size increase of general conjunctive queries either without functionaldependencies, or with a set of simple functional dependencies (or simple keys).Formally, the following theorem is proven:6 heorem (Theorem 4.7 from [9]) . Given a query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) and set of simple functional dependencies, | Q ( D ) | ≤ rmax ( Q, D ) C ( chase ( Q )) . Furthermore, this bound is essentially tight: for any
N > , there exists a database D with rmax ( Q, D ) ≤ rep ( Q ) · N , and | Q ( D ) | = N C ( Q ) , where rep ( Q ) is themaximum number of times any specific relation R i appears in Q . Additionally, it was shown that, in the setting in which general functional de-pendencies are given, the color number yields a lower bound. Specifically,
Proposition (Proposition 6.3 from [9]) . Given a query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) and set of functional dependencies, there exists an instance D inwhich | Q ( D ) | ≥ (cid:18) rmax ( Q, D ) rep ( Q ) (cid:19) C ( chase ( Q )) . The proof of the above proposition is via a construction. This constructionprovides some insight into the relationship between the colorings of the variables,and conditional entropies, and we give a simplified proof in the case that m = n inAppendix A. In this section we state the basic definitions of conditional entropy and in-formation measures , and then state some facts about Shannon and non-Shannoninformation inequalities, which will prove useful in the remainder of the paper.
Definition 2.5.
For discrete random variables
X, Y with respective supports X , Y , the conditional entropy of X given Y , denoted by H ( X | Y ) is given by H ( X | Y ) := X y ∈Y p ( y ) H ( X | Y = y ) = − X x ∈X X y ∈Y p ( x, y ) log ( p ( x | y )) . The following fact follows from the above definition:
Fact 2.6.
For discrete random variables
X, Y with respective supports X , Y ,H ( X, Y ) = H ( X ) + H ( Y | X ) . efinition 2.7. For discrete random variables
X, Y , as above, the mutual infor-mation between X and Y is I ( X ; Y ) := X x ∈X ,y ∈Y p ( x, y ) log p ( x, y ) p ( x ) p ( y ) . The following fact follows from the above definition:
Fact 2.8.
For discrete random variables
X, Y as above, I ( X ; Y ) = I ( Y ; X ) = H ( X ) + H ( Y ) − H ( X, Y ) = H ( X ) − H ( X | Y ) . Definition 2.9.
For discrete random variables X , . . . , X n with respective sup-ports X , . . . , X n , and n ≥ , we recursively define their mutual information as I ( X ; . . . ; X n ) = I ( X ; . . . ; X n − ) − I ( X ; . . . ; X n − | X n ) , where the conditional mutual information is defined as I ( X ; . . . ; X n − | X n ) = X x n ∈X n p ( x n )( I ( X ; . . . ; X n − ) | X n = x n ) , and where for n = 2 , mutual information is as defined in Definition 2.7. Unsurprisingly, the above information measures have a set-theoretic structure,and can be represented in an information diagram , from which basic relations be-tween information measures can be easily read off. Figure 1 illustrates a generalinformation diagram for three variables. The following facts follow from the pre-vious definitions, and can easily be seen by considering the associated informationdiagram. (We refer the reader to Chapter 3 of [23] for proofs of these facts andrigorous definition of the set-theoretic structure of information measures.)
Fact 2.10.
For discrete random variables X , . . . , X n , and any disjoint sets K, K ′ ⊆ [ n ] , : H ( X K | X K ′ ) = X S : S ∩ K = ∅ ,S ∩ K ′ = ∅ I ( S | X [ n ] − S ) ,I ( K | X K ′ ) = X S : S ⊇ K,S ∩ K ′ = ∅ I ( S | [ n ] − S ) , where I ( S | X S ′ ) denotes I ( X ; . . . ; X j | X S ′ ) , for S = [ j ] . Note that we avoidthe notation I ( X S | X S ′ ) , which has the interpretation of I ( X , . . . , X j | X S ′ ) = H ( X S | X S ′ ) . X, Y, Z.
Note that the set-theoreticproperties of these information measures allows various information equalitiesto be read off from such a diagram; for example, I ( X ; Y ) = I ( X ; Y ; Z ) + I ( X ; Y | Z ) , and H ( Z ) = I ( X ; Y ; Z ) + I ( X ; Z | Y ) + I ( Y ; Z | X ) + H ( Z | X, Y ) . We now define the basic information inequalities.
Definition 2.11.
For discrete random variables X , . . . , X n as above, and for asubset K ⊂ [ n ] , denoting by X K the tuple of all X i for i ∈ K , the Shannoninformation inequalities consist of all inequalities of the form H ( X i | X [ n ] −{ i } ) ≥ , for all i ∈ [ n ] , and I ( X i ; X j | X K ) ≥ , for all i = j ∈ n and K ⊂ [ n ] − { i, j } . We note that, as above, the mutual information expressions can be reexpressedin terms of entropies. For example, I ( X i ; X j | X K ) = H ( X i | X K ) − H ( X i | X j , X K ) = H ( X i , X K ) + H ( X j , X K ) − H ( X K ) − H ( X i , X j , X K ) . (See [23], Chapter 14for further discussion of the Shannon inequalities.)The Shannon information inequalities are well-understood and were, initially,hypothesized to essentially capture the space of valid entropy configurations. How-ever, in a breakthrough work in 1998, Zhang and Yeung showed that there are fun-damental constraints on this space that are not captured by the Shannon inequali-ties, even for as few as four random variables [25]. This accounts for the lack oftightness in our upper bound. 9 Size Bounds
We begin by giving our linear programming upper bound for the worst-casesize increase. Throughout this section, we admit a slight abuse of notation, and re-fer to the entropy of a set of attributes of a database, interpreted in the natural way:given a database table with attribute set A = { X , . . . , X k } , some fixed proba-bility distribution D over the tuples of the table, and two subsets S, S ′ ⊆ A , werefer to the conditional entropy H D ( S | S ′ ) where S, S ′ respectively are interpretedto be the discrete random variables whose possible values consist of the | S | , re-spectively | S ′ |− tuples of values that the corresponding variables have in the tuplesof the database table, with probabilities given according to D . Theorem 3.1.
Given a query Q = chase ( Q ) = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , with var ( Q ) = { X , . . . , X k } , and a set of arbitrary functional depen-dencies, for any database D , | Q ( D ) | ≤ rmax ( Q, D ) s ( Q ) , where rmax ( Q, D ) is the size of the largest relation among R , . . . , R n in D , and s ( Q ) is the solution to the following linear program:maximize h ( u ) subject to h ( u i ) ≤ ∀ i ≥ h ( x t | x i , . . . , x i j ) = 0 for each f.d. X i , . . . , X i j → X t h ( x i | x [ k ] −{ i } ) ≥ ∀ i ∈ [ k ] I ( x i ; x j | x S ) ≥ ∀ i, j ∈ [ k ] and S ⊆ [ k ] − { i, j } , where the variables of the linear program are the (unconditional) entropies h ( x S ) for all S ⊆ [ k ] , and the expressions involving mutual information or conditionalentropies appearing in the constraints are implicitly considered to stand in for thecorresponding linear expressions of these variables (as described in Section 2.3).Proof. The first step in the proof is to establish the connection between entropyand worst-case size increases. Given our query Q and database D , let c be suchthat | Q ( D ) | = rmax ( Q, D ) c . Let Q ′ = R ′ ( var ( Q )) ← R ( u ) ∧ . . . ∧ R n ( u m ) be the query derived from Q by including all query variables in the output, anddefine the distribution D over the tuples of Q ′ ( D ) to be such that the marginaldistribution D u over the values of the | u | -tuples corresponding to variables in u is the uniform distribution. Note that such a choice for D is not necessarilyunique, unless u = var ( Q ) . Let H D ( u i ) denote the entropy of the projection of10he distribution D onto the positions labeled by the variables of u i . Observe thatfor any i ∈ [ m ] , H D ( u ) H D ( u i ) ≥ H D ( u ) H unif i ( u i ) ≥ log( | Q ( D ) | )log( | R i ( D ) | ) ≥ c, (2)where unif i is the uniform distribution over the tuples of R i ( D ) . This provides themotivation for the form of our linear program: maximizing the entropy of u whilebounding the entropies of each u i .To see that the value of the above linear program provides an upper bound on log( | Q ( D ) | )log( | R i ( D ) | ) , note that for any set S ⊂ [ k ] , the quantity H D ( S )max i ≥ H D ( u i ) satisfies all theconstraints that the corresponding variable h ( S ) is subject to in the linear program,including the last two sets of constraints that represent the Shannon informationinequalities, and thus by Equation (2) the value of the solution to the linear programmust be at least log( | Q ( D ) | )log( | R i ( D ) | ) . In order to make the size bound given by the solution to the linear program ofTheorem 3.1 tight, we would need to add additional constraints so as to enforce the non-Shannon information inequalities. Unfortunately, it was recently shown thateven for just four variables, there are infinitely many independent such inequali-ties [17].We note that the jump in difficulty of establishing tight size bounds occurswhen the left-hand sides of functional dependencies go from having single vari-ables, to having 2 variables. It is not hard to show that any size bounds for the casewhere functional dependencies have left-hand sides with at most two variables canbe extended to work for arbitrary functional dependencies, via the following propo-sition.
Proposition 3.2.
Given a query Q = chase ( Q ) and set of functional dependen-cies, there exists a query Q ′ with the following properties: • each functional dependency of Q ′ has at most two variables on its left-handside, • Q ′ = chase ( Q ′ ) , • the set of functional dependencies of Q ′ is at most polynomially larger thanthat of Q , • the description of Q ′ is at most polynomially larger than that of Q , • the worst-case size increase of Q and Q ′ are identical. C ( Q ) = C ( Q ′ ) .Proof. We shall iteratively remove functional dependencies from Q that have 3 ormore variables occurring on their left-hand sides, via the addition of a (polynomialnumber) of additional variables, relations, and functional dependencies.Given a functional dependency X . . . X k → Y, we add a relation R ( X X Z ) ,with the new variable Z , together with the functional dependencies X X → Z, Z → X , Z → X . We then add the relation R ′ ( ZX . . . X k Y ) , together withthe functional dependency ZX . . . X k → Y. Finally, we remove the functionaldependency X . . . X k → Y from the set of functional dependencies.Iteratively applying the above procedure until there are no more functional de-pendencies (other than implied ones) with more than two variables on their left-hand sides clearly results in a query Q ′ with at most a polynomially longer de-scription, and polynomially more functional dependencies. Additionally, since allnew relations are distinct, and all original functional dependencies are implied bythe new set of functional dependencies, chase ( Q ′ ) = Q ′ . To see that the size in-crease of Q ′ is the same as that of Q , note after each single iteration of the aboveprocedure, the size increase must remain unchanged, as the values taken by vari-ables X , X dictate that taken by Z , and vice versa, defining a mappingbetween tuples of Q ( D ) and tuples of the result of the query generated after onestep of the procedure. To conclude, there is a natural mapping between valid col-orings of Q , and the query obtained after one step of the above procedure, namely L ( Z ) ↔ L ( X ) ∪ L ( X ) . We now reexamine the color number in an effort to better understand the typesof entropy structures that it can capture. As the following proposition shows, thecolor number can be defined via the linear program of Theorem 3.1 with the addi-tion of some extra constraints on the entropies. In particular, we require extra con-straints that enforce that all mutual information measures be nonnegative. (Notethat the Shannon inequalities imply that all mutual information measures of twovariables be nonnegative; however, as Figure 2 depicts, the mutual information ofmore than two variables can be negative.)
Theorem 4.1.
Given a query Q = chase ( Q ) = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) , with var ( Q ) = { X , . . . , X k } , and a set of arbitrary functional dependencies, ( Q ) is equal to the solution to the following linear program:maximize h ( u ) subject to h ( u i ) ≤ ∀ i ≥ h ( x t | x i , . . . , x i j ) = 0 for each f.d. X i , . . . , X i j → X t I ( x i ; , . . . ; x i j | x [ k ] −{ i ,...,i j } ) ≥ ∀ sets { i , . . . , i j } = S ⊆ [ k ] , where the variables of the linear program are the (unconditional) entropies h ( x S ) for all S ⊆ [ k ] , and the expressions involving mutual information or conditionalentropies appearing in the constraints are implicitly considered to stand in for thecorresponding linear expressions of these variables (as described in Section 2.3).Proof. We first show that given any valid coloring achieving color number C ( Q ) ,we can find a feasible point for the linear program with value C ( Q ) . Given a validcoloring in which at most r colors occur together in the labels of any input atom,for every set S ⊆ [ k ] , we set I ( S | x [ k ] − S ) = | T i ∈ S L ( X i ) − S i S L ( X i ) | r , where I ( S | x [ k ] − S ) denotes I ( x i ; . . . ; x i j | x [ k ] − S ) , with S = { X i , . . . , X i j } . Notethat these n mutual information values are sufficient to determine the values of allvariables in the linear program. In particular, these n mutual information mea-sures are the values that would appear in an information diagram. From Fact 2.10,for any disjoint sets T, T ′ ⊆ [ k ] , we will now express I ( T | x T ′ ) in terms of thecolor labels. We note that for distinct sets S , S , the corresponding sets of labels T i ∈ S j L ( X i ) − S i S j L ( X i ) will be disjoint, because these sets consist of exactlythose colors appearing in the labels of each element of S j and not in any of thelabels of elements not in S j . Thus the sum in Fact 2.10 may be expressed in termsof the size of the union of these sets for S containing T and disjoint from T ′ . Itis straightforward to see that this union consists of exactly those colors appearingin the labels of each element of T and not in any of the labels of elements of T ′ ,yielding: I ( T | x T ′ ) = | T i ∈ T L ( X i ) − S i ∈ T ′ L ( X i ) | r . It is now easy to see that this construction yields a feasible point for the linearprogram. First observe that all the information inequalities are trivially satisfied,since for every set S ⊆ [ k ] , I ( S | x [ k ] − S ) ≥ in our construction. To see that theequality constraints given by the functional dependencies are observed, note thatthe dependency X , . . . , X j → X j +1 implies that L ( X j +1 ) − S i ∈ [ j ] L ( X i ) = ∅ , I ( x j +1 | x [ j ] ) = 0 , as desired. (Note that, by def-inition, h ( x j +1 | x [ j ] ) = I ( x j +1 | x [ j ] ) . ) Finally, to see that the first set of constraintsare observed, note that for any j ≤ k, h ( x [ j ] ) = P S s.t. S ∩ [ j ] = ∅ I ( S | x [ j ] − S ) , which,by our construction, is precisely | S i ∈ [ j ] L ( X i ) | r , which is bounded by 1 whenever S is the index set of an input atom, and which will equal C ( Q ) when S is the indexset of u by the definition of the color number.For the other direction, given a rational feasible point for the linear programwith objective function value v, where all variables have values r i /q, for integers r i , q , with q being the common denominator, we will construct a coloring withcolor number C ( Q ) . The final set of constraints of the LP implies that for any set S ⊆ [ k ] , I ( S | x [ k ] − S ) = r S q ≥ . Furthermore, since our feasible point is rational, r S ∈ N . To populate our coloring, we begin with the empty coloring, and then foreach S ⊆ [ k ] , we add q · i ( S | x [ k ] − S ) unique colors to the labels of all X i for which i ∈ S. To see that this coloring obeys the functional dependencies, note that for X , . . . , X j → X j +1 , we have that I ( x j +1 | X [ j ] ) = 0 , and thus by Fact 2.10, forany S ⊂ [ k ] − [ j ] such that j + 1 ∈ S, I ( S | X [ k ] − S ) = 0 , from which it follows thatin our construction L ( X j +1 ) ⊆ S i ∈ [ j ] L ( X i ) . Finally, to see that the color numberis at least the value v , of the linear program, note that by Fact 2.10, a total of X S ⊆ [ k ] s.t. S ∩ K = ∅ q · I ( S | X [ k ] − S ) = q · h ( X S ) unique colors are assigned to each set X S , and thus the color number is at least h ( u ) , as desired. Remark 4.2.
From the above characterization of the color number, it follows thatfor all the settings in which the color number yields a tight bound on the worst-casesize increase (i.e. when no functional dependencies are specified, or only simpledependencies), there exist worst-case instances whose corresponding informationdiagrams have only nonnegative entries.
Leveraging the understanding of the entropy structures that are compatible withthe color number given by the previous theorem, we now show that there is asuper-constant gap between the exponent of the true worst-case size increase, andthe color number (in the case of general functional dependencies). We suspect,however, that in the majority of practical applications, this gap between the upperand lower bounds will be small.
Theorem 4.3.
For any fixed constant α ∈ R , there exists a conjunctive query Q and set of functional dependencies, and database D , such that | Q ( D ) | > rmax ( Q, D ) αC ( chase ( Q )) . X , , . . . , X , in our construction for k = 4 . Note that any set of size 2 or more contains all the entropy of all four variables.The negative mutual information I ( X , ; X , ; X , ; X , ) = − suggests that novalid coloring can closely approximate the entropy structure, which is leveragedin our construction to yield a super-constant gap between the color number andworst-case size increase. Proof.
We shall construct a family of queries, and associated databases whosecolor numbers fall short of the true size increase by a superconstant factor. Fix aneven integer k , and consider the following query Q over k / variables X i,j , for i ∈ { , . . . , k } , and j ∈ { , . . . , k/ } : Q = R ( X , , . . . , X i,j , . . . , X k,k/ ) ← k/ ^ i =1 R i ( X ,i , . . . , X k,i ) ∧ k ^ i =1 T i ( X i, , . . . , X i,k/ ) . Additionally, for each j ∈ { , . . . , k/ } we impose the following functional de-pendencies: given any set S ⊂ { X ,j , . . . , X k,j } , with | S | ≥ k/ , for any i,S → X i,j . Intuitively, the above construction has k/ groups of k variables, such thatamongst any group, any set of k/ of those variables suffice to recover the re-maining k/ variables in that group. The information diagram of one group ofthe construction in the case k = 4 is depicted in Figure 2. Given any integer N , we will construct a database D such that for all i ∈ [ k/ , j ∈ [ k ] , we have Our construction is a generalization of a construction suggested to us by Daniel Marx. R i ( D ) | = N k/ = | T j ( D ) | . The values assigned to positions labeled by X i,j and X i ′ ,j ′ will be disjoint whenever j = j ′ ; i.e. the values assigned each of the k/ groups are disjoint. Each of the N k/ tuples of R i ( D ) will be constructedso as to be Shamir ( k/ , k ) secret shares [21]. That is, given the values of any k/ attributes X ,i , . . . , X k/ ,i , the values of the remaining k/ attributes can beuniquely determined, and for S ⊂ { X ,i , . . . , X k,i } , | π S ( R i ( D )) | = (cid:26) N | S | if | S | ≤ k/ ,N k/ if | S | ≥ k/ . Since Q ( D ) consists of the complete join of each R i , | Q ( D ) | = (cid:0) N k/ (cid:1) k/ = N k / , whereas the size of the largest input relation is rmax ( Q, D ) = N k/ . Wenow show that C ( chase ( Q )) = C ( Q ) ≤ , which will complete our proof of thetheorem.First observe that it suffices to consider the case that for j = j ′ , L ( X i,j ) ∩L ( X i ′ ,j ′ ) = ∅ , because, assuming otherwise, if the common color c lay in theintersection, by removing the color c from the labels L ( X i ′′ ,j ) for all i ′′ , we stillhave a valid coloring (since there are no functional dependencies between groups),and the color number could only have increased. Let r i = | S kj =1 L ( X j,i ) | , and t i = | S k/ j =1 L ( X i,j ) | = P k/ j =1 |L ( X i,j ) | denote the number of colors assigned tothe variables of each input atom. Thus in any optimal coloring, we have | [ X i,j L ( X i,j ) | = k/ X i =1 | k [ j =1 L ( X j,i ) | = k/ X i =1 r i . Next, observe that each element of L ( X i,j ) , must occur in the labels of at least k/ other variables X i ′ ,j ; if this were not the case, then there would exist a set S ⊂ { X ,j , . . . , X k,j } of size | S | ≥ k/ , such that L ( X i,j ) S X i ′ ,j ∈ S L ( X i ′ ,j ) , which violates one of the functional dependencies. Thus it follows that k X i =1 |L ( X i,j ) | ≥ k r j . To conclude, putting the above equations together, we have k X i =1 t i = X X i,j |L ( X i,j ) | ≥ k k/ X i =1 r i , and thus there must be at least one i such that t i ≥ ( k/ P k/ i =1 r i k = P k/ i =1 r i , andthus C ( Q ) ≤ . Complexity Considerations
From a complexity standpoint, the results of the previous setting are not encour-aging. Both the upper bound, and lower bound of C ( Q ) are given as the solutionsto exponential-sized linear programs. This prompts the question of whether onecan efficiently determine anything about the size of the result, in this setting withgeneral functional dependencies. (It is shown in [9] that when one only has simplefunctional dependencies, tight size bounds can be efficiently computed.) With gen-eral functional dependencies, even computing chase ( Q ) can be intractable. Nev-ertheless, we show that when chase ( Q ) is given, or can be efficiently computed(for example, when all the input relations have bounded arities), we can efficientlydecide whether the result of the query with a set of general functional dependen-cies can be any larger than the input relations. The proof relies on a propositionfrom [9], and then reduces the question at hand to the satisfiability of a sequenceof tractable SAT instances—one for each input relation. Theorem 5.1.
Given a conjunctive query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) with an arbitrary set of functional dependencies, such that Q = chase ( Q ) , it canbe efficiently decided whether the results of Q can be larger than the input rela-tions, in which case there exists an instance D with | Q ( D ) | ≥ (cid:16) rmax ( Q,D ) rep ( Q ) (cid:17) mm − . The proof of the theorem relies on the following proposition:
Proposition (Proposition 6.1 from [9]) . A query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u m ) with arbitrary functional dependencies is sparsity preserving if, and onlyif C ( chase ( Q )) = 1 . Equivalently, for any database D , | Q ( D ) | ≤ rmax ( Q, D ) if, and only if C ( chase ( Q )) = 1 . Furthermore, if C ( chase ( Q )) > , then C ( chase ( Q )) ≥ mm − . Proof of Theorem 5.1:
By the above proposition, it suffices to show that one candecide whether C ( Q ) > in polynomial time. First observe that a necessary andsufficient condition for C ( Q ) > is the existence of some coloring C such that foreach relation R i ,with i ≥ , there is a color c i such that c i ∈ S X j ∈ u L ( X j ) , but c i S X j ∈ u i L ( X j ) . We will represent this condition as a set of n tractable SATexpressions, one for each input relation, as follows. Our set of SAT variables willbe { x , . . . , x | var ( Q ) | } , in natural correspondence with the set of query variables V = { X , . . . , X | var ( Q ) | } . From Proposition 3.2 it suffices to prove our theorem in the case that all func-tional dependencies have at most two variables on their left-hand sides. Given p X j X k → X m , . . . , X j p X k p → X m p , our SAT expres-sion for relation i will have the form SAT i = ^ X j ∈ u i ¬ x j ∧ _ X j ∈ u x j ∧ ( x j ∨ x k ∨ ¬ x m ) ∧ . . . ∧ ( x j p ∨ x k p ∨ ¬ x m p ) . Any satisfying assignment of
SAT i yields a valid coloring of Q that uses exactly 1color, and has the property that no variable in u i has a color, but at least one variablein u has a color; such a coloring is given by assigning all variables that are set to f alse to not have the color, and all variables set to true to have the color. To seethis, note that the first part of SAT i ensures that no variable occurring in u i can be true in a satisfying assignment; the second part of SAT i ensures that at least onevariable in the output projection will be colored, and the third part of SAT i ensuresthat the functional dependencies are respected. Since any set of valid colorings canbe combined to yield a valid coloring (by letting L , ( X i ) = L ( X i ) ∪ L ( X i ) ), itfollows that if, for all i = 1 , . . . , n , SAT i is satisfiable, then there exists a coloringwith n colors, yielding C ( Q ) ≥ nn − > . Conversely, if, for some i , SAT i isnot satisfiable, then there is no valid coloring of the variables in which some colorappears in the output projection but not in the coloring of a variable of u i , in whichcase C ( Q ) = 1 . What remains is to verify that
SAT i can be solved efficiently. We start bydecomposing SAT i into its three basic components: SAT i = C ∧ C ∧ C , where C = V X j ∈ u i ¬ x j , C = W X j ∈ u x j , and C = V h =1 ,...,p ( x j h ∨ x k h ∨ ¬ x m h ) . We start by removing all variables x i from C that appear negated in C . Then, wesimplify SAT i via a series of at most | V | ‘passes’. In each pass, we traverse eachclause ( x j h ∨ x k h ∨ ¬ x m h ) of C ; if x m h occurs in C , then we remove the clause ( x j h ∨ x k h ∨ ¬ x m h ) from C and proceed. Otherwise, if either x j h , or x k h occurin C , we remove the occurring variable(s) from this clause in C and proceed.Finally, if a clause of C consists of a single negated literal ¬ x · , we remove thatclause from C , and add the literal to C . If no new variable is added to C duringa pass, this means that no additional passes will alter the clauses, so we halt.It is not hard to see that each pass does not alter the satisfiability of the expres-sion C ∧ C ∧ C . Furthermore, since each pass either adds at least one variableto C , or is the last pass, there will be at most | V | passes. If at any point a clausein C becomes a single literal x i that also occurs in C , or C consists of a subsetof the variables occurring in C , then SAT i is clearly not satisfiable; if this doesnot occur, then no additional passes will alter the clauses, and a satisfying assign-ment for SAT i is given by setting all the variables in C to be f alse , and all othervariables to be true . (cid:3) Conclusions
We view the main contribution of this work as establishing a firm connectionbetween worst-case size bounds and multivariate entropy structures, allowing thetools of information theory to be leveraged towards database analysis. This con-nection promotes two main lines of future work. The first direction is investigatingwhether one can explicitly characterize the worst-case size increase, even if thatcharacterization is exponentially large. It is also conceivable that, while exactlycharacterizing the size increase might not be possible, one can explicitly (and pos-sibly even efficiently) compute an approximation of the worst-case size increase.This seems like a deep and challenging question, and such a result would likelyinvolve a significant advance in the understanding of the structure of non-Shannontype information inequalities.The second direction is investigating which types of entropy structures arisefrom databases and their associated queries in practice. Such an investigationwould help determine where practical instances lie on the spectrum between thebasic color number bounds and the more intricate bounds of Theorem 3.1. Suchdatabase measures as sparsity and treewidth were introduced with correspondinggoals in mind, and have proved effective at succinctly capturing the ease with whichcertain database operations can be done. We propose the following measure of theentropy structure of a database and associated query, in the hope that it will suc-cinctly capture this new facet of database complexity, as suggested by the resultsof this paper:
Definition 6.1.
The knitted complexity of a database with respect to a query is theratio of the sum of the absolute values of the mutual informations of all subsets ofthe query variables, to the sum of the (signed) mutual informations of all subsetsof the query variables.
Acknowledgments
We are deeply grateful to Daniel Marx, who first pointed out to us that the colornumber does not provide an upper bound on the worst-case size increase in thesetting with general functional dependencies.
References [1] S. Abiteboul, R. Hull, and V. Vianu.
Foundations of Databases . Addison-Wesley, 1995. 192] A. V. Aho, Y. Sagiv, and J. D. Ullman. Equivalence of relational expressions.
SIAM J. of Computing , 8(2):218–246, May 1979.[3] A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for rela-tional joins. In
IEEE FOCS’08 , 2008.[4] C. Beeri and M. Y. Vardi. A proof procedure for data dependencies.
J. ACM ,31(4):718–741, 1984.[5] A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctivequeries in relational data bases. In
ACM STOC , 1977.[6] S. Chaudhuri. An overview of query optimization in relational systems. In
PODS 1998 .[7] R. Dougherty, C. Freiling, and K. Zeger. Networks, matroids, and non-shannon information inequalities.
IEEE Transactions on Information Theory ,53(6):1949–1969, 2007.[8] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semanticsand query answering. In
ICDT , 2003.[9] G. Gottlob, S. T. Lee, and G. J. Valiant. Size and treewidth bounds for con-junctive queries. In
PODS 2009 .[10] M. Grohe and D. Marx. Constraint solving via fractional edge covers. In
SODA 2006 .[11] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity andcost estimation for joins based on random sampling.
J. Comput. Syst. Sci. ,52(3):550–569, 1996.[12] M. Jarke and J. Koch. Query optimization in database systems.
ACM Comput.Surv. , 16(2):111–152, 1984.[13] P. Kolaitis. Schema mappings, data exchange, and metadata management. In
PODS , 2005.[14] M. Lenzerini. Data integration: a theoretical perspective. In
PODS , 2002.[15] A. Y. Levy, A. O. Mendelzon, and Y. Sagiv. Answering queries using views.In
PODS 1995 .[16] D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data de-pendencies.
ACM Trans. Database Syst. , 4(4):455–469, 1979.[17] F. Mat´uˇs. Infinitely many information inequalities. In , Nice, France, 2007.[18] F. Mat´uˇs. Two constructions on limits of entropy functions.
IEEE Transac-tions on Information Theory , 53(1):320–330, 2007.[19] F. Olken and D. Rotem. Random sampling from database files: A survey. In
Proc. of Stat. and Scientific Database Management , 1990.2020] N. Pippenger. What are the laws of information theory? In , Palo Alto, CA,1986.[21] A. Shamir. How to share a secret.
Commun. ACM , 22(11):612–613, 1979.[22] A. N. Swami and K. B. Schiefer. On the estimation of join result sizes. In
Advances in Database Technology - EDBT’94. 4th Int. Conf. on ExtendingDatabase Technology , 1994.[23] R. W. Yeung.
Information Theory and Network Coding . Springer PublishingCompany, Incorporated, 2008.[24] Z. Zhang and R. W. Yeung. A non-shannon-type conditional inequality of in-formation quantities.
IEEE Transactions on Information Theory , 43(6):1982–1986, 1997.[25] Z. Zhang and R. W. Yeung. On characterization of entropy function via infor-mation inequalities.
IEEE Transactions on Information Theory , 44(4):1440–1452, 1998.
A Simplified Proof of Proposition 6.3 from [9]
For clarity, we state and prove the proposition in the case that each input rela-tion occurs only once in the query, and thus Q = chase ( Q ) . Proposition A.1.
Given a query Q = R ( u ) ← R ( u ) ∧ . . . ∧ R n ( u n ) and setof functional dependencies, there exists an instance D in which | Q ( D ) | ≥ ( rmax ( Q, D )) C ( Q ) . Proof.
Given an integer N , and any valid coloring with d colors, with d ′ ≤ d colorsappearing in the labels of the output variables, such that the coloring achieves colornumber C ( Q ) , we shall construct an instance of D with the property that | Q ( D ) | = N d ′ , and rmax ( Q, D ) ≤ N d ′ /C ( Q ) . Consider a table of arity d , with attributes C , . . . , C d , corresponding to eachof the d colors. We construct the table T to have N d tuples, such that the projection π C i ,...,C ik ( D ) of D onto any k attributes C i , . . . , C i k has size N k . We denote the N values that a given attribute C i may take by the values i , . . . , i N . (Thus T isjust the total join of the d columns of size N .)Next, we populate a given relation R j , that has variables X , . . . , X k in the cor-responding atom u j . Assume, without loss of generality that in the given coloringof Q , S i =1 ,...,k L ( X i ) = { , . . . , q } . We populate R j with N q tuples derived fromthe N q tuples in π C ,...,C q ( T ) , where the values that attribute X i takes are given21y an ordered list of the values taken by the C ′ i s that are in L ( X i ) . To illustrate,say q = 3 , and (1 · , · , · ) is a tuple of π C ,...,C q ( T ) , if R j ( XY ) appears in Q , and L ( X ) = { , } , L ( X ) = { , } , then we add the tuple ([1 · , · ] , [2 · , · ]) to R j , withthe value [1 · , · ] appearing in the first attribute of R j . From the definition of validcoloring, it follows that the constructed database satisfies all functional dependen-cies. Additionally, by construction, if all variables appeared in the output, all N d tuples would appear in the output, and thus | Q ( D ) | = N d ′ . For each input relation R i , we have | R i ( D ) | = N k , where k = ||
Given an integer N , and any valid coloring with d colors, with d ′ ≤ d colorsappearing in the labels of the output variables, such that the coloring achieves colornumber C ( Q ) , we shall construct an instance of D with the property that | Q ( D ) | = N d ′ , and rmax ( Q, D ) ≤ N d ′ /C ( Q ) . Consider a table of arity d , with attributes C , . . . , C d , corresponding to eachof the d colors. We construct the table T to have N d tuples, such that the projection π C i ,...,C ik ( D ) of D onto any k attributes C i , . . . , C i k has size N k . We denote the N values that a given attribute C i may take by the values i , . . . , i N . (Thus T isjust the total join of the d columns of size N .)Next, we populate a given relation R j , that has variables X , . . . , X k in the cor-responding atom u j . Assume, without loss of generality that in the given coloringof Q , S i =1 ,...,k L ( X i ) = { , . . . , q } . We populate R j with N q tuples derived fromthe N q tuples in π C ,...,C q ( T ) , where the values that attribute X i takes are given21y an ordered list of the values taken by the C ′ i s that are in L ( X i ) . To illustrate,say q = 3 , and (1 · , · , · ) is a tuple of π C ,...,C q ( T ) , if R j ( XY ) appears in Q , and L ( X ) = { , } , L ( X ) = { , } , then we add the tuple ([1 · , · ] , [2 · , · ]) to R j , withthe value [1 · , · ] appearing in the first attribute of R j . From the definition of validcoloring, it follows that the constructed database satisfies all functional dependen-cies. Additionally, by construction, if all variables appeared in the output, all N d tuples would appear in the output, and thus | Q ( D ) | = N d ′ . For each input relation R i , we have | R i ( D ) | = N k , where k = || S X ∈ u i L ( X ) ||