Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
TTractable Orders for Direct Access to Ranked Answersof Conjunctive Queries
Nofar Carmeli
Technion, Israel
Nikolaos Tziavelis
Northeastern University, USA
Wolfgang Gatterbauer
Northeastern University, USA
Benny Kimelfeld
Technion, Israel
Mirek Riedewald
Northeastern University, USA
ABSTRACT
We study the question of when we can answer a Conjunctive Query(CQ) with an ordering over the answers by constructing a structurefor direct (random) access to the sorted list of answers, withoutactually materializing this list, so that the construction time islinear (or quasilinear) in the size of the database. In the absence ofanswer ordering, such a construction has been devised for the taskof enumerating query answers of free-connex acyclic CQs, so thatthe access time is logarithmic. Moreover, it follows from past resultsthat within the class of CQs without self-joins, being free-connexacyclic is necessary for the existence of such a construction (underconventional assumptions in fine-grained complexity).In this work, we embark on the challenge of identifying the an-swer orderings that allow for ranked direct access with the abovecomplexity guarantees. We begin with the class of lexicographic or-derings and give a decidable characterization of the class of feasiblesuch orderings for every CQ without self-joins. We then continue tothe more general case of orderings by the sum of attribute scores . Asit turns out, in this case ranked direct access is feasible only in triv-ial cases. Hence, to better understand the computational challengeat hand, we consider the more modest task of providing access toonly one single answer (i.e., finding the answer at a given position).We indeed achieve a quasilinear-time algorithm for a subset ofthe class of full CQs without self-joins, by adopting a solution ofFrederickson and Johnson to the classic problem of selection oversorted matrices. We further prove that none of the other queries inthis class admit such an algorithm.
When can we allow for direct access to a ranked list of answers to adatabase query without (and considerably faster than) materializingall answers?
To illustrate the concrete instantiation of this question,assume the following simple relational schema for informationabout pandemic spread and relevant activity of residents:
Visits ( person , age , city ) Cases ( city , date , ) Here,
Visits mentions, for each person, the cities that the personvisits regularly (e.g., for work and relatives) and the age of theperson (for risk assessment); the relation
Cases specifies the numberof new infection cases in specific cities at specific dates (a measurethat is commonly used for spread assessment albeit being sensitiveto the amount of testing).
Conferenceβ17, July 2017, Washington, DC, USA
Suppose that we wish to efficiently compute the natural join
Visits (cid:90)
Cases based on equality of the city attribute, so that wehave all combinations of people (with their age), the cities theyregularly visit, and the cityβs daily new cases. For example, ( Anna , , Boston , / / , ) . While the number of such answers could be quadratic in the size ofthe database, the seminal work of Bagan, Durand, and Grandjean [3]has established that the it can be evaluated using an enumerationalgorithm with a constant delay between consecutive answers, aftera linear-time preprocessing phase. This is due to the fact that thisjoin is a special case of a free-connex acyclic
Conjunctive Query (CQ).In the case of CQs without self-joins, being free-connex acyclic is asufficient and necessary condition for such efficient evaluation [3, 7].The necessity requires conventional assumptions in fine-grainedcomplexity and it holds even if we multiply the preprocessing anddelay by a logarithmic factor in the size of the database. To realize the constant (or logarithmic) delay, the preprocessingphase constructs a structure that allows for efficient iteration overthe answers in the enumeration phase. Brault-Baron [7] showed thatin the linear preprocessing phase, we can construct a structure withbetter guarantees: not only log-delay enumeration, but even log-time direct access : a structure that allows to directly retrieve the π th answer in the enumeration, given π , without needing to enumeratethe preceding π β Later, Carmeli et al. [9] showedhow such a structure can be used for enumerating answers in arandom order (random permutation) with the statistical guaranteethat the order is uniformly distributed. In particular, in the aboveexample we can enumerate the answers of Visits (cid:90)
Cases in aprovably uniform random permutation (hence, ensuring statisticalvalidity of each prefix) with logarithmic delay, after a linear-timepreprocessing phase. Their direct-access structure also allows for inverted access : given an answer, return the index π of that answer(or determine that it is not a valid answer).The direct-access structures of Brault-Baron [7] and Carmeli etal. [9] have the byproduct that they allow the answers to be sorted by some lexicographic order. For instance, in our Visits (cid:90)
Cases thestructure could be such that the tuples are in the (descending) orderof π th answer in order) and determine the position of a tuple insidethe sorted list. From this we can also conclude (fairly easily) that we For the sake of simplicity, throughout this section we make all of these complexityassumptions. In Section 2 we give their formal statements. We refer to those as quasilinear preprocessing and log delay , respectively. βDirect accessβ is also widely known as βrandom access.β Not to be confused with βrandom access.β a r X i v : . [ c s . D B ] D ec onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald can enumerate the answers ordered by age where ties are brokenrandomly, again provably uniformly. Carmeli et al. [9] have alsoshown how the order of the answers can be useful for generalizingdirect-access algorithms from CQs to UCQs. Note that direct accessto the sorted list of answers is a stronger requirement than rankedenumeration that has been studied in previous work [10, 24, 25, 27].Yet, the choice of which lexicographic order is taken is an artefactof the structure construction (e.g., the elimination order [7] or thejoin tree [9]). If the application desires any specific lexicographicorder, we can only hope to find a matching construction; which isnot necessarily the case. For example, could we construct in (quasi)linear time a direct-access structure for Visits (cid:90)
Cases ordered by
Contributions.
Our first main result is an algorithm for di-rect access for lexicographic orders, including ones that are notachievable by past structures. We further show that within the classof CQs without self-joins, our algorithm covers all the tractablecases (in the sense adopted here), and we establish a decidable(and easy to test) classification of the lexicographic orders overthe free variables into tractable and intractable ones. For in-stance, in the case of
Visits (cid:90)
Cases the lexicographic order ( , age , city , date , person ) is intractable. It is classified as suchbecause disruptivetrio . The lexicographic order ( , age ) is also intractable sincethe query Visits (cid:90)
Cases is not { , age } -connex. In contrast,the lexicographic order ( , city , age ) is tractable. We also showthat within the tractable side, the structure we construct allows forinverted access in constant time.Our classification is proved in two steps. We begin by consideringthe complete lexicographic orders (that involve all variables). Weshow that for free-connex CQs without self-joins, the absence of adisruptive trio is a sufficient and necessary condition for tractability.We then generalize to partial lexicographic orders over a subset πΏ of the variables. There, the condition is that there is no disruptivetrio and that the query is πΏ -connex. Interestingly, it turns out that apartial lexicographic order is tractable if and only if it is the prefixof a complete tractable lexicographic order.A lexicographic order is a special case of an ordering by the sum of attribute scores, where every database value is mappedto some number. Hence, a natural question now is which CQs One could argue that, in reality, this example involves functional dependencies, suchas person β age , which could invalidate the lower bounds. Indeed, our classificationdoes not account for constraints. Yet, all hardness statements mentioned about thisexample in this section can be shown to follow from the results of this paper. Wefurther discuss constraints in the Conclusions (Section 7). have a tractable direct access by the order of sum. For exam-ple, what about Visits (cid:90)
Cases with the order ( πΌ Β· + π½ Β· age ) ? It is easy to see that this order is intractable becausethe lexicographic order ( , age ) is intractable. In fact, it iseasy to show that a lexicographic order by sum is intractablewhenever any lexicographic order is intractable (e.g., there isa disruptive trio). However, the situation is worse: the onlytractable case is the one where the CQ is acyclic and there isan atom that contains all of the free variables. In particular, or-dering by sum is intractable already for the Cartesian product π ( π , π, π₯, π, π, π ) : β Visits ( π, π, π ) , Cases ( π , π, π₯ ) , even though ev-ery lexicographic order is tractable (according to our aforemen-tioned classification). This daunting hardness also emphasizes howranked direct access is fundamentally harder than ranked enumer-ation where, in the case of the sum of attributes, the answers ofevery full CQ can be enumerated with logarithmic delay after aquasilinear preprocessing time [24].To understand the root cause of the hardness of sum, we narrowour question to a considerably weaker guarantee. Our notion oftractability so far requires the construction of a structure in quasi-linear time and a direct access in logarithmic time. In particular, ifour goal is to compute just a single quantile, say the π th answer,then it takes quasilinear time. Computing a single quantile is knownas the selection problem [6]. The question we ask is to what extentis selection a weaker requirement than direct access in the caseof CQs. That is, how larger is the class of CQs with quasilinearselection than that of CQs with a quasilinear construction of alogarithmic-access structure?We answer the above question for the class of full CQs withoutself-joins by establishing the following dichotomy for the orderby sum (again assuming fine-grained hypotheses): the selectionproblem can be solved in O( π log π ) time, where π is the size of thedatabase, if and only if the hypergraph of the CQ contains at mosttwo maximal hyperedges (w.r.t. containment). The tractable sideis applicable even in the presence of self-joins, and it is achievedby adopting an algorithm by Frederickson and Johnson [14]. Forillustration, the selection problem is solvable in quasilinear timefor the query Visits (cid:90)
Cases ordered by sum.
Outline.
The remainder of the paper is organized as follows.Section 2 gives the necessary background. In Section 3 we considerdirect access by lexicographic orders that include all the free vari-ables, and Section 4 extends the results to partial ones. We moveon to the (for the most part) negative results for direct access bysum orderings in Section 5 and then study the selection problem inSection 6. Section 7 concludes and gives some directions for futurework. Due to space constraints, some proofs are in the Appendix.
Database. A schema S is a set of relational symbols { π , . . . , π π } .We use ar ( π ) for the arity of a relational symbol π . A databaseinstance πΌ contains a finite relation π πΌ β dom ar ( π ) for each π β S ,where dom is a set of constant values called the domain . We use π for the size of the database, i.e., the total number of tuples. Queries. A conjunctive query (CQ) π over schema S is an ex-pression of the form π ( (cid:174) π π ) : β π ( (cid:174) π ) , . . . , π β ( (cid:174) π β ) , where the tuples ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA (cid:174) π π , (cid:174) π , . . . , (cid:174) π β hold variables, every variable in (cid:174) π π appears in some (cid:174) π , . . . , (cid:174) π β , and π , . . . , π β β S . Each π π ( (cid:174) π π ) is called an atom ofthe query π , and atoms ( π ) denotes the set of all atoms. We use var ( π ) or var ( π ) for the set of variables that appear in an atom π or query π , respectively. The variables (cid:174) π π are called free andare denoted by free ( π ) . A CQ is full if free ( π ) = var ( π ) and Boolean if free ( π ) = β . Sometimes, we say that CQs that are notfull have projections . A repeated occurrence of a relational symbolis a self-join and if no self-joins exist, a CQ is called self-join-free .A homomorphism π from a CQ π to a database πΌ is a mapping of var ( π ) to constants from dom , such that every atom of π maps toa tuple in the database πΌ . A query answer π is such a homomor-phism followed by a projection of π on the free variables, denotedby π free ( π ) ( π ) . The answer to a Boolean CQ is whether such ahomomorphism exists. The set of query answers is π ( πΌ ) . Hypergraphs. A hypergraph H = ( π , πΈ ) is a set π of vertices and a set πΈ of subsets of π called hyperedges . Two vertices in ahypergraph are neighbors if they appear in the same edge. A path of H is a sequence of vertices such that every two succeedingvariables are neighbors. A chordless path is a path in which no twonon-succeeding vertices appear in the same atom (in particular, novertex appears twice). A join tree of a hypergraph H = ( π , πΈ ) is atree π where the nodes are the hyperedges of H and the runningintersection property holds, namely: for all π’ β π the set { π β πΈ | π’ β π } forms a (connected) subtree in π . An equivalent phrasing ofthe running intersection property is that given two vertices π , π of the tree, for any vertex π on the simple path between them, wehave that π β© π β π . A hypergraph H is acyclic if there exists ajoin tree for H . We associate a hypergraph H ( π ) = ( π , πΈ ) to a CQ π where the vertices are the variables of π , and every atom of π corresponds to a hyperedge with the same set of variables. Stateddifferently, π = var ( π ) and πΈ = { var ( π )| π β atoms ( π )} . With aslight abuse of notation, we identify atoms of π with hyperedges of H ( π ) . A CQ π is acyclic if H ( π ) is acyclic, otherwise it is cyclic . Free-connex CQs.
A hypergraph H β² is an inclusive extension of H if every edge of H appears in H β² , and every edge of H β² is asubset of some edge in H . Given a subset πΏ of the vertices of H , atree π is an ext- πΏ -connex tree (i.e., extension- πΏ -connex tree) for ahypergraph H if: (1) π is a join tree of an inclusive extension of H ,and (2) there is a subtree π β² of π that contains exactly the vertices πΏ [3]. We say that a hypergraph is πΏ -connex if it has an ext- πΏ -connextree [3]. A hypergraph is πΏ -connex iff it is acyclic and it remainsacyclic after the addition of a hyperedge containing exactly πΏ [7].Given a hypergraph H and a subset πΏ of its vertices, an πΏ -path is achordless path ( π₯, π§ , . . . , π§ π , π¦ ) in H with π β₯
1, such that π₯, π¦ β πΏ ,and π§ , . . . , π§ π β πΏ . A hypergraph is πΏ -connex iff it has no πΏ -path [3].A CQ π is free-connex if H ( π ) is free ( π ) -connex [3]. Orders of Answers.
For a CQ π and database instance πΌ , a rankingfunction rank : π ( πΌ ) Γ π ( πΌ ) β π ( πΌ ) compares two query answersand returns the smaller one according to some underlying totalorder. We consider two types of orders in this paper. Assumingthat the domain values are ordered, a lexicographic order πΏ is an WLOG, we assume that the order is ascending but all results hold if we rank returnsthe bigger ( max ) instead of the smaller ( min ). ordering of free ( π ) such that rank ( π , π ) first compares π , π onthe value of the first πΏ variable, and if they are equal on the valueof the second πΏ variable, and so on. A lexicographic order is called partial if the variables in πΏ are a subset of free ( π ) .The second type of order assumes a given weight function thatassigns a real-valued weight to the domain values of each variable.More precisely, for a variable π₯ , we define π€ π₯ : dom β R andthen the weight of a query answer is computed by aggregating theweights of the assigned values of free variables. In a sum-of-weightsorder , denoted by Ξ£ π€ , we have π€ π ( π ) = (cid:205) π₯ β free ( π ) π€ π₯ ( π ( π₯ )) , π β π ( πΌ ) and rank ( π , π ) compares π€ π ( π ) with π€ π ( π ) . To simplifynotation, we refer to all π€ π₯ and π€ π together as one weight function π€ . If two query answers have the same weight, then we break tiesarbitrarily but consistently, e.g., according to a lexicographic orderon their assigned values. Attribute Weights vs. Tuple Weights.
Notice that in the defi-nition above, we assume that the input weights are assigned to thedomain values of the attributes. Alternatively, the input weightscould be assigned to the relation tuples, a convention that has beenused in past work on ranked enumeration [24]. Since there areseveral reasonable semantics for interpreting a tuple-weight rank-ing for CQs with projections and/or self-joins, we elect to presentour results for the case of attribute weights. For self-join-free CQs,attribute weights can easily be transformed to tuple weights inlinear time such that the weights of the query answers remain thesame. This works by assigning each variable to one of the atomsthat it appears in, and computing the weight of a tuple by aggre-gating the weights of the assigned attribute values. Therefore, ourhardness results for sum-of-weights orders directly extend to thecase of tuple weights. Moreover, note that our positive results onselection (Section 6.2) rely on algorithms that innately operate ontuple weights, thus we cover that case too.
Direct access vs. Selection.
In the problem of direct access byan underlying order, we are given as an input a query π , and adatabase πΌ , and the goal is to construct a data structure which thenallows us to support accesses on the sorted array of query answers.Specifically, an access asks for the query answer at index π on the(implicit) array containing π ( πΌ ) sorted via rank comparisons, fora given integer π . This data structure is built in a preprocessingphase, after which we have to be able to support multiple suchaccesses. Our goal is to achieve efficient access (in polylogarithmictime) with a preprocessing phase that is significantly smaller than π ( πΌ ) (quasilinear in the database size).The problem of selection [6, 12, 13] is a computationally easiertask that requires only a single direct access, hence does not make adistinction between preprocessing and access phases. For example,a special case of the problem is to find the median query result. We measure asymptotic complexity in terms of the size of thedatabase π , while the size of the query is considered constant. Themodel of computation is the RAM model with uniform cost measure.In particular, it allows for linear time construction of lookup tables,which can be accessed in constant time. We would like to point outthat some past works [3, 9] have assumed that in certain variants ofthe model, sorting can be done in linear time [17]. Since we consider onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald problems related to summation and sorting [14] where a linear-timesort would improve otherwise optimal bounds, we adopt a morestandard assumption that sorting is comparison-based and possibleonly in quasilinear time. As a consequence, some upper boundsmentioned in this paper are weaker than the original sources whichassumed linear-time sorting [7, 9]. Hardness Hypotheses.
Denote by sparseBMM the hypothe-sis that two Boolean matrices π΄ and π΅ , represented as lists oftheir non-zero entries, cannot be multiplied in time π + π ( ) ,where π is the number of non-zero entries in π΄ , π΅ , and π΄π΅ . Aconsequence of this hypothesis is that we cannot answer thequery π ( π₯, π§ ) : β π ( π₯, π¦ ) , π ( π¦, π§ ) with quasilinear preprocessing andpolylogarithmic delay. In more general terms, any self-join-freeacyclic non-free-connex CQ cannot be enumerated with quasilin-ear preprocessing time and polylogarithmic delay assuming thesparseBMM hypothesis [3, 5].A ( π + , π ) -hyperclique is a set of π + π -element subset is a hyperedge. Denote by Hy-percliqe the hypothesis that for every π β₯ π ( π polylog π ) algorithm for deciding the existence of a ( π + , π ) -hyperclique in a π -uniform hypergraph with π hyperedges. When π =
2, this follows from the πΏ -Triangle hypothesis [1] for any πΏ > π β₯
3, this is a special case of the ( β, π )β Hyperclique Hy-pothesis [19]. A known consequence is that Boolean cyclic andself-join-free CQs cannot be answered in quasilinear time [7].Moreover, cyclic and self-join-free CQs do not admit enumerationwith quasilinear preprocessing time and polylogarithmic delay as-suming the Hypercliqe hypothesis [7].In its simplest form, the 3SUM problem asks for three distinct realnumbers π, π, π from a set π with π elements that satisfy π + π + π =
0. There is a simple π ( π ) algorithm for the problem, but it isconjectured that in general, no truly subquadratic solution exists[23]. The significance of this conjecture has been highlighted bymany conditional lower bounds for problems in computationalgeometry [15] and within the P class in general [26]. Note thatthe problem remains hard even for integers provided that they aresufficiently large (i.e., in the order of π ( π ) ) [23]. We denote by 3sumthe following equivalent hypothesis [4] that uses three differentsets of numbers: Deciding whether there exist π β π΄, π β π΅, π β πΆ from three sets of integers π΄, π΅, πΆ such that π + π + π = π ( π β π ) for any π >
0. This lower bound has beenconfirmed in some restricted models of computation [2, 11].
Known Results for CQs.
We now provide some backgroundthat relates to the efficient handling of CQs. For a query with projec-tions, a standard strategy is to reduce it to an equivalent one wheretechniques for acyclic full CQs can be leveraged. The followingproposition, that is widely known and used [5], shows that this ispossible for free-connex CQs.Proposition 2.1 (Folklore).
Given a CQ π over a database πΌ , ajoin tree π of an inclusive extension of π and a subtree π β² of π thatcontains all free variables, it is possible to compute in linear time a Works in the literature typically phrase this as linear, yet any logarithmic factorincrease is still covered by the hypotheses. database πΌ β² over the schema of the CQ π β² consisting of the nodes of π β² such that π ( πΌ ) = π β² ( πΌ β² ) . This reduction is done by first creating a relation for every nodein π using projections of existing relations, then performing theclassic semi-join reduction by Yannakakis [28] to filter the relationsof π β² according to the relations of π , and then we can simply ignoreall relations that do not appear in π β² and obtain the same answers.Afterwards, they can be handled efficiently, e.g. their answers canbe enumerated with constant delay [3].For direct access, past work has identified the tractable queries,yet there is no guarantee on the order of the query answers.Theorem 2.2 ([7, 9]). Let π be a CQ. If π is free-connex, thendirect access (in some order) is possible with O( π log π ) preprocessingand O( log π ) delay. Otherwise, if it is also self-join-free, then directaccess (in any order) is not possible with O( π polylog π ) preprocessingand O( polylog π ) delay, assuming sparseBMM and Hyperclique. The established direct access algorithms are allowed to internallychoose any order, while in this paper, we receive a desired orderas input. Even though these algorithms do not explicitly discussthe order of the answers, a closer look shows that they produce alexicographic order. The algorithm of Carmeli et al. [9, Algorithm3] assumes that a join tree is given with the CQ, and the variableorder is imposed by the join tree. Specifically, it is the one achievedby a preorder depth-first traversal of the tree. The algorithm ofBrault-Baron [7, Algorithm 4.3] assumes that an elimination or-der is given along with the CQ. The resulting lexicographic orderis affected by that elimination order, but is not necessarily thesame. Moreover, there exist orders (which we show in this paperto be tractable) that these algorithms cannot produce. For instance,these include lexicographic orders that interleave variables fromdifferent atoms, such as the order β¨ π, π , π, π, π , π₯ β© for the query π ( π , π, π₯, π, π, π ) : β Visits ( π, π, π ) , Cases ( π , π, π₯ ) of Section 1. In this section, we answer the following question: for which un-derlying lexicographic orders can we achieve βtractableβ directaccess to ranked CQ answers, i.e. with quasilinear preprocessingand polylogarithmic time per answer?
Example 3.1 (No direct access).
Consider the lexicographic or-der πΏ = β¨ π£ , π£ , π£ β© for the query π ( π£ , π£ , π£ ) : β π ( π£ , π£ ) , π ( π£ , π£ ) .Direct access to the query answers according to that order wouldallow us to βjump overβ the π£ values via binary search and es-sentially enumerate the answers to π β² ( π£ , π£ ) : β π ( π£ , π£ ) , π ( π£ , π£ ) .However, we know that π β² is not free-connex and that is impossibleto achieve enumeration with quasilinear preprocessing and poly-logarithmic delay (if sparseBMM holds). Therefore, the bounds weare hoping for are out of reach for the given query and order. Thecore difficulty is that the joining variable π£ appears after the othertwo in the lexicographic order.We formalize this notion of βvariable in the middleβ in order todetect similar situations in more complex queries. Definition 3.2 (Disruptive Trio).
Let π be a CQ and πΏ a lexico-graphic order of its free variables. We say that three free variables ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA π’ , π’ , π’ are a disruptive trio in π with respect to πΏ if π’ and π’ arenot neighbors (i.e. they donβt appear together in an atom), π’ is aneighbor of both π’ and π’ , and π’ appears after π’ and π’ in πΏ .As it turns out, when considering free-connex and self-join-free CQs, the tractable CQs are precisely captured by this simplecriterion. Regarding self-join-free CQs that are not free-connex,their known intractability of enumeration implies that direct accessis also intractable. This leads to the following dichotomy:Theorem 3.3. Let π be a CQ and πΏ be a lexicographic order. β’ If π is free-connex and does not have a disruptive trio withrespect to πΏ , then direct access by πΏ is possible with O( π log π ) preprocessing and O( log π ) time per access. β’ Otherwise, if π is also self-join-free, then direct access by πΏ is not possible with O( π polylog π ) preprocessing and O( polylog π ) time per access, assuming sparseBMM and Hy-perclique. Remark 1.
On the positive side of Theorem 3.3, the preprocessingtime is dominated by sorting the input relations, which we assumerequires O( π log π ) time. If we assume instead that sorting takeslinear time (as assumed in some related work [7, 9, 17]), then the timerequired for preprocessing is only O( π ) instead of O( π log π ) . In Section 3.1, we provide an algorithm for this problem for fullacyclic CQs that have a particular join tree that we call layered .Then, we show how to find such a layered join tree whenever thereis no disruptive trio in Section 3.2. In Section 3.3, we explain how toadapt our solution for CQs with projections, and in Section 3.4 weprove a lower bound which establishes that our algorithm appliesto all cases where direct access is tractable.
Before we explain the algorithm, we first define one of its maincomponents. A layered join tree is a join tree of an inclusive ex-tension of a hypergraph, where each node belongs to a layer. Thelayer number matches the position in the lexicographic order ofthe last variable that the node contains. Intuitively, βpeelingβ offthe outermost (largest) layers must result in a valid join tree (for ahypergraph with fewer variables).
Definition 3.4 (Layered Join Tree).
Let π be a full acyclic CQ, andlet πΏ = β¨ π£ , . . . , π£ π β© be a lexicographic order. A layered join tree for π with respect to πΏ is a join tree of an inclusive extension of H ( π ) where (1) every vertex π is assigned to layer max { π | π£ π β π } , (2)there is exactly one vertex for each layer, and (3) for all π β€ π theinduced subgraph with only the vertices that belong to the first π layers is a tree. Example 3.5.
Consider the CQ π ( π£ , π£ , π£ , π£ ) : β π ( π£ , π£ ) , π ( π£ , π£ ) and the lexicographic order β¨ π£ , π£ , π£ , π£ β© . To support that order,we first take an inclusive extension of its hypergraph, shown inFigure 1a. Notice that we added two hyperegdes that are strictlycontained in the existing ones. A layered join tree constructed fromthat hypergraph is depicted in Figure 1b. There are four layers, onefor each vertex of the join tree. The layer of the vertex containing π£ π£ π£ π£ ππ β² πβ² π (a) A hypergraph that is an in-clusive extension of H ( π ) . π£ π β² π£ , π£ π£ , π£ π£ πβ²ππ (b) A layered join tree for π w.r.t. the lexicographic order. Figure 1: Constructing a layered join tree for the query π ( π£ , π£ , π£ , π£ ) : β π ( π£ , π£ ) , π ( π£ , π£ ) and order β¨ π£ , π£ , π£ , π£ β© . { π£ , π£ } is 3 because π£ appears after π£ in the order and it is thethird variable. If we remove the last layer, then we obtain a join treefor the induced hypergraph where the last variable π£ is removed.We now describe an algorithm that takes as an input a CQ π , alexicographic order πΏ , and a corresponding layered join tree andprovides direct access to the query answers after a preprocessingphase. For preprocessing, we leverage a construction from Carmeliet al. [9, Algorithm 2] and apply it to our layered join tree. Forcompleteness, we briefly explain how it works below. Subsequently,we describe the access phase that takes into account the layers ofthe tree to accommodate the provided lexicogrpahic order. Thus, theway we access the structure is different than that of past work [9].This allows us to support lexicographic orders that were impossiblefor the existing algorithms (e.g. that of Example 3.5). Preprocessing.
The preprocessing phase (1) creates a relationfor every vertex of the tree, (2) removes dangling tuples, (3) sortsthe relations, (4) partitions the relations into buckets, and (5) usesdynamic programming on the tree to compute and store certaincounts. After preprocessing, we are guaranteed that for all π , thevertex of layer π has a corresponding relation where each tupleparticipates in at least one query answer; this relation is partitionedinto buckets by the assignment of the variables preceding π . Ineach bucket, we sort the tuples lexicographically by π£ π . Each tupleis given a weight that indicates the number of different answersthis tuple agrees with when only joining its subtree. The weightof each bucket is the sum of its tuple weights. We denote bothby the function weight . Moreover, for every tuple π‘ , we computethe sum of weights of the preceding tuples in the bucket, denotedby start ( π‘ ) . We use end ( π‘ ) for the sum that corresponds to thetuple following π‘ in the same bucket; if π‘ is last, we set this to bethe bucket weight. If we think of the query answers in the subtreesorted in the order of π£ π values, then start and end distribute theindices between 0 and the bucket weight to tuples. The number ofindices within the range of each tuple corresponds to its weight. Example 3.6 (Continued).
The result of the preprocessing phaseon an example database for our query π is shown in Figure 2.Notice that π has been split into two buckets according to the valuesof its parent π β² , one for value π and one for π . For tuple ( π ) β π β² ,we have weight (( π )) = onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald π β² π€ π π π π π π π π π π π π π π Figure 2: Example 3.6: The result of the preprocessing phaseon π , the layered join tree (Figure 1b) and an example data-base. The weight and start index for each tuple are abbrevi-ated in the figure as π€ and π respectively. answers which can be combined with any of the 4 possible answersof the right subtree. The start index of tuple ( π , π ) β π is thesum of the previous weights within the bucket: start (( π , π )) = weight (( π , π )) + weight (( π , π )) = + =
2. Not shown in thefigure is that every bucket stores the sum of weights it contains.
Access.
The access phase works by going through the tree layerby layer. When resolving a layer π , we select a tuple from its corre-sponding relation, which sets a value for the π th variable in πΏ , andalso determines a bucket for each child. Then, we erase the vertexof layer π and its outgoing edges.The access algorithm maintains a directed forest and an assign-ment to a prefix of the variables. Each tree in the forest representsthe answers obtained by joining its relations. Each root containsa single bucket that agrees with the already assigned values, thusevery answer agrees on the prefix. Due to the running intersectionproperty, different trees cannot share unassigned variables. As aconsequence, any combination of answers from different trees canbe added to the prefix assignment to form an answer to π . Theanswers obtained this way are exactly the answers to π that agreewith the already set assignment. Since we start with a layered jointree, we are guaranteed that at each step, the next layer (whichcorresponds to the variable following the prefix for which we havean assignment) appears as a root in the forest.Recall that from the preprocessing phase, the weight of eachroot is the number of answers in its tree. When we are at layer π , we have to take into account the weights of all the other rootsin order to compute the number of query answers for a particulartuple. More specifically, the number of answers to π containingthe already selected attributes (smaller than π ) and some π£ π valuecontained in a tuple is found by multiplying the tuple weight withthe weights of all other roots. That is because the answers from alltrees can be combined into a query answer. Let π‘ be the selectedtuple when resolving the π th layer. The number of answers to π that have a value of πΏ [ π ] smaller than that of π‘ and a value of πΏ [ π ] equal to that of π‘ for all π < π is then: βοΈ π‘ β² (cid:32) weight ( π‘ β² ) (cid:214) π β roots weight ( π ) (cid:33) where π‘ β² ranges over tuples preceding π‘ in its bucket. Denote byfactor the product of all root weights. Then we can rewrite as: (cid:32)βοΈ π‘ β² weight ( π‘ β² ) (cid:33) (cid:32) (cid:214) π β roots weight ( π ) (cid:33) = start ( π‘ ) Β· factor . Therefore, when resolving layer π we select the last tuple π‘ suchthat the index we want to access is at least start ( π‘ ) Β· factor. Algorithm 1
Lexicographic Random-Access if π β₯ weight ( root ) then return out-of-bound bucket [ ] = root factor = weight ( root ) for i=1,. . . ,f do factor = factor / weight ( bucket [ π ]) pick π‘ β bucket [ π ] s.t. start ( π‘ ) Β· factor β€ π < end ( π‘ ) Β· factor π = π β start ( π‘ ) Β· factor for child π of layer π do get the bucket π β π agreeing with the selected tuples bucket [ layer ( π )] = π factor = factor Β· weight ( π ) return the answer agreeing with the selected tuplesAlgorithm 1 summarizes the process we described where π isthe index to be accessed and π is the number of variables. Iteration π resolves layer π . Pointers to the selected buckets from the rootsare kept in a bucket array. The product of the weights of all roots iskept in a factor variable. In each iteration, the variable π is updatedto the index that should be accessed among the answers that agreewith the already selected attribute values. Note that bucket [ π ] isalways initialized when accessed since layer π is guaranteed to be achild of a smaller layer. Example 3.7 (Continued).
We demonstrate how the access algo-rithm works for index π =
12. When resolving π β² , the tuple ( π ) ischosen since 8 Β· β€ < Β·
1; then, the single bucket in π β² and thebucket containing π in π are selected. The next iteration resolves π β² . When it reaches line 7, π = β = = π ). As 0 Β· β€ < Β·
2, the tuple ( π ) is selected. Next, π is resolved, which we depict in Figure 3. Thecurrent index is π = β =
4. The weights of the other roots (only π here) gives us factor =
3. To make our choice in π , we multiplythe weights of the tuples by factor =
3. Then, we find that the
π β² π€ π π π π π π π π€ π π π π π π π π Weight of bucket = 1 + 1 + 1 = 3 answers answers π = 4
Figure 3: Example 3.7: Illustration of an iteration of the ac-cess phase where layer corresponding to π is resolved. ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA index π we are looking for falls into the range of ( π , π ) because1 Β· β€ < Β·
3. Next, π is resolved, π = β Β· =
1, and factor = Β· β€ < Β·
1, the tuple ( π , π ) is selected. Overall, answernumber 12 (the 13 th answer) is ( π , π , π , π ) .Lemma 3.8. Let π be a full acyclic CQ, and πΏ = β¨ π£ , . . . , π£ π β© be alexicographic order. If there is a layered join tree for π with respectto πΏ , then direct access is possible with O( π log π ) preprocessing and O( log π ) time per access. Proof. The correctness of Algorithm 1 follows from the discus-sion above. For the time complexity, note that it contains a constantnumber of operations (assuming the number of attributes π is fixed).Line 7 can be done in logarithmic time using binary search, whileall other operations only require constant time in the RAM model.Thus, we obtain direct access in logarithmic time per answer afterthe quasilinear preprocessing (dominated by sorting). β‘ With minor modifications, the algorithm we presented in thissection can be used for the (reverse) task of inverted access . Wedescribe this variation in Appendix B.
We now have an algorithm that can be applied whenever we havea layered join tree. We next show that the existence of such a jointree relies on the disruptive trio condition we introduced earlier.In particular, if no disruptive trio exists, we are able to construct alayered join tree for full acyclic CQs.Lemma 3.9.
Let π be a full acyclic CQ, and πΏ be a lexicographicorder. If π does not have a disruptive trio with respect to πΏ , then thereis a layered join tree for π with respect to πΏ . Proof. We show by induction on π that there exists a lay-ered join tree for the hypergraph containing the hyperedges { π β© { π£ , . . . , π£ π } | π β atoms ( π )} with respect to the prefix of πΏ containing its first π elements. The induction base is the tree thatcontains the vertex { π£ } and no edges.In the inductive step, we assume a layered join tree with π β { π β© { π£ , . . . , π£ π β } | π β atoms ( π )} , and we build alayer on top of it. Denote by V the sets of { π β© { π£ , . . . , π£ π } | π β atoms ( π )} that contain π£ π (these are the sets that to be included inthe new layer). First note that V is acyclic. Indeed, by the runningintersection property, the join tree for H ( π ) has a subtree with allthe vertices that contain π£ π . This subtree forms a join tree for V after projecting out all variables that occur after π£ π in the ordering.We next claim that some set in V contains all the others; that is,there exists π π β V such that for all π β V , we have that π β π π .Consider a join-tree for V . Every variable of V defines a subtreeinduced by the vertices that contain this variable. If two variablesare neighbors, their subtrees share a vertex. It is known that everycollection of subtrees of a tree satisfies the Helly property [16]: ifevery two subtrees share a vertex, then some vertex is shared by allsubtrees. In particular, since V is acyclic, if every two variables of V are neighbors, then some element of V contains all variables thatappear in (elements of) V . Thus, if, by way of contradiction, thereis no such π π , there exist two non-neighboring variables π£ π and π£ π that appear in (elements of) V . Since π£ π appears in all elementsof V , this means that there exist π π , π π β V with { π£ π , π£ π } β π π and { π£ π , π£ π } β π π . Since π£ π and π£ π are not neighbors, these threevariables are a disruptive trio: π£ π and π£ π are both neighbors of thelater variable π£ π . The existence of a disruptive trio contradicts theassumption of the lemma we are proving, and so we conclude thatthere is π π β V such that for all π β V , we have that π β π π .With π π at hand, we can now add the additional layer to thetree given by the inductive hypothesis. Insert π π with an edgeto a vertex containing π π \ { π£ π } (which exists by the inductivehypothesis). This results in the join tree we need: (1) the hyperedges { π β© { π£ , . . . , π£ π } | π β atoms ( π )} are all contained in vertices,since the ones that do not appear in the tree from the inductivehypothesis are contained in the new vertex; (2) it is a tree since weadd one leaf to an existing tree; and (3) the running intersectionproperty holds since the added vertex is connected to all of itsvariables that already appear in the tree. β‘ Lemmas 3.8 and 3.9 give a direct-access algorithm for full acyclicCQs and lexicographic orders without disruptive trios.
Next, we show how to support CQs that have projections. A free-connex CQ can be efficiently reduced to a full acyclic CQ usingProposition 2.1. We next show that the resulting CQ contains nodisruptive trio if the original CQ does not.Lemma 3.10.
Given a database instance and a free-connex CQ withno disruptive trio, an equivalent pair of database instance and fullacyclic CQ with no disruptive trio can be computed in linear time,and the new CQ does not depend on the database instance.
By combining Lemmas 3.8 to 3.10, we conclude an efficient algo-rithm for CQs and orders with no disruptive trios. The next lemmasummarizes our results so far.Lemma 3.11.
Let π be a CQ, and πΏ be a lexicographic order. If π does not have a disruptive trio with respect to πΏ , direct access by πΏ ispossible with O( π log π ) preprocessing and O( log π ) access time. Next, we show that our algorithm supports all feasible cases (for self-join-free CQs); we prove that all unsupported cases are intractable.Lemma 3.12.
Let π be a self-join-free CQ, and πΏ be a lexicographicorder. If π has a disruptive trio with respect to πΏ , then direct access by πΏ is not possible with O( π polylog π ) preprocessing and O( polylog π ) time per access, assuming sparseBMM. Lemma 3.12 is a special case of the more general Lemma 4.5 thatwe prove later when we discuss partial lexicographic orders. Since π has a disruptive trio, two non-neighboring variables π’ , π’ are bothneighbors of a later variable π’ in πΏ . Thus, π’ , π’ , π’ is a chordlesspath, and Lemma 4.5 implies the correctness of Lemma 3.12.By combining Lemma 3.11 and Lemma 3.12 together with theknown hardness results for non-free-connex CQs (Theorem 2.2),we prove the dichotomy given in Theorem 3.3: direct access by alexicographic order for a self-join-free CQ is possible with quasilin-ear preprocessing and polylogarithmic time per answer if and onlyif the query is free-connex and does not have a disruptive trio withrespect to the required order. onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald We now investigate the case where the desired lexicographic orderis partial , i.e., it contains only some of the free variables. Thismeans that there is no particular order requirement for the rest ofthe variables. One way to achieve direct access to a partial order isto complete it into a full lexicographic order and then leverage theresults of the previous section. If such a completion is impossible, wehave to consider cases where tie breaking between the non-orderedvariables is done in an arbitrary way. However, we will show in thissection that the tractable partial orders are precisely those that canbe completed into a full lexicographic order. In particular, we willprove the following dichotomy which also gives an easy to detectcriterion for the tractability of direct access.Theorem 4.1.
Let π be a CQ and πΏ be a partial lexicographicorder. β’ If π is free-connex and πΏ -connex and does not have a disruptivetrio with respect to πΏ , then direct access by πΏ is possible with O( π log π ) preprocessing and O( log π ) time per access. β’ Otherwise, if π is also self-join-free, then direct access by πΏ is not possible with O( π polylog π ) preprocessing time and O( polylog π ) time per access, assuming the sparseBMM andHyperclique hypotheses.Example 4.2. Consider the CQ π : β ( π₯, π¦ ) , π ( π¦, π§ ) . If the freevariables are exactly π₯ and π§ , then the query is not free-connex,and so it is intractable. Next assume that all variables are free. If πΏ = β¨ π₯, π§ β© , then the query is not πΏ -connex, and so it is intractable.If πΏ = β¨ π₯, π§, π¦ β© , than π₯, π§, π¦ is a disruptive trio, thus the query isintractable. However, if πΏ = β¨ π₯, π¦, π§ β© or πΏ = β¨ π§, π¦ β© , then the query isfree-connex, πΏ -connex and has no disruptive trio, so it is tractable. For the positive side, we can solve our problem efficiently if the CQis free-connex and there is a completion of the lexicographic orderto all free variables with no disruptive trio. Lemma 4.4 identifiesthese cases with a connexity criterion. To prove it, we first need away to combine two different connexity properties. The proof of thefollowing proposition uses ideas from a proof of the characterizationof free-connex CQs in terms of the acyclicity of the hypergraphobtained by including a hyperedge with the free variables [5].Proposition 4.3.
If a CQ π is both πΏ -connex and πΏ -connexwhere πΏ β πΏ , then there exists a join tree π of an inclusive extensionof π with a subtree π containing exactly the variables πΏ and asubtree π of π contains exactly the variables πΏ . We are now in position to show the following:Lemma 4.4.
Let π be a CQ and πΏ be a partial lexicographic order.If π is free-connex and πΏ -connex and does not have a disruptive triowith respect to πΏ , then there is an ordering πΏ + of free ( π ) that startswith πΏ such that π has no disruptive trio with respect to πΏ + . Proof. Take a tree for π given by Proposition 4.3 with a subtree π free containing exactly the free variables, and a subtree π πΏ of π free containing exactly the variables πΏ . We assume that π πΏ contains atleast one vertex; otherwise (this can only happen in case πΏ is empty),we can introduce a vertex with no variables to all of π , π free and π πΏ and connect it to any one vertex of π free . We describe a process ofextending πΏ while traversing π free . Consider the vertices of π πΏ ashandled, and initialize πΏ + = πΏ . Then, repeatedly handle a neighborof a handled vertex until all vertices are handled. When handling avertex, append to πΏ + all of its variables that are not already there.We prove by induction that π has no disruptive trio w.r.t any prefixof πΏ + . The base case is guaranteed by the premises of this lemmasince πΏ (hence all of its prefixes) have no disruptive trio.Let π£ π be a new variable added to a prefix π£ , . . . , π£ π β of πΏ + . Let π + be the subtree of π free with the handled vertices when adding π£ π to πΏ + and let π β π + be the vertex being handled. Note that,since π£ π is being added, π£ π β π but π£ π is not in any vertex of π + .We first claim that every neighbor π£ π of π£ π with π < π is in π . Our arguments are illustrated in Figure 4. Since π£ π and π£ π areneighbors, they appear together in a node π π,π outside of π + . Let π π be a node in π + containing π£ π (such a node exists since π£ π appearsbefore π£ π in πΏ + ). Consider the path from π π,π to π π . Let π π be the lastnode of this path not in π + . If π π β π , the path between π π and π goes only through vertices of π + (except for the end-points). Thus,concatenating the path from π π,π to π π with the path from π π to π results in a simple path. By the running intersection property, allvertices on this path contain π£ π . In particular, the vertex following π π contains π£ π in contradiction to the fact that π£ π does not appear in π + . Therefore, π π = π . By the running intersection property, since π is on the path between π π and π π,π , we have that π contains π£ π .We now prove the induction step. We know by the inductivehypothesis that π£ , . . . , π£ π β have no disruptive trio. Assume byway of contradiction that appending π£ π introduces a disruptive trio.Then, there are two variables π£ π , π£ π with π < π < π such that π£ π , π£ π are neighbors, π£ π , π£ π are neighbors, but π£ π , π£ π are not neighbors. Aswe proved, since π£ π and π£ π are neighbors of π£ π preceding it, wehave that all three of them appear in the handled vertex π . This isa contradiction to the fact that π£ π and π£ π are not neighbors. β‘ The positive side of Theorem 4.1 is obtained by combiningLemma 4.4 with Theorem 3.3. π π,π π£ π π£ π π£ π π π π + π π β π£ π , π£ π π£ π We get a contradiction in thecase where π β π β . π = π β π£ π , π£ π π£ π , π£ π π£ π π + π π,π π π If π£ π is a neighbor of π£ π with π < π , then π£ π β π . Figure 4: The induction step in Lemma 4.4 ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA
For the negative part, we prove a generalization of Lemma 3.12.For that, we use the hardness of Boolean matrix multiplicationwith a construction that is similar to that of Bagan et al. [3] for thehardness of enumeration on acyclic CQs that are not free-connex.Lemma 4.5.
Let π be a self-join-free CQ and πΏ be a partial lexico-graphic order. If there is a chordless path π’ , π§ , . . . , π§ π , π’ such that π’ and π’ appear in πΏ and no variable π§ π appears in πΏ before any ofthem, then direct access by πΏ is not possible with O( π polylog π ) pre-processing and O( polylog π ) time per access, assuming sparseBMM. Proof. Let π = { π§ , . . . , π§ π } . We encode Boolean matrix multi-plication with π such that, in the answers to π , the assignmentsto π’ and π’ form the answers to the given matrix multiplicationinstance, the assignments to variables of π can be skipped us-ing binary search (given direct access), and all other variables areassigned a constant value β₯ .Let π΄ and π΅ be Boolean π Γ π matrices represented as binaryrelations. That is, π΄ β { , . . . , π } , and ( π, π ) β π΄ means that theentry in the π th row and π th column is 1. We define a partition of theatoms of π where R π΄ is the set of all atoms that contain π’ , and R π΅ holds all other atoms. Note that no atom in R π΄ contains π’ (since π’ and π’ are not neighbors) and no atom in R π΅ contains π’ . Giventhree values ( π, π, π ) , we define a function π ( π,π,π ) : var ( π ) β{ π, π, π, β₯} as follows: π ( π,π,π ) ( π£ ) =  π if π£ = π’ ,π if π£ β π ,π if π£ = π’ , β₯ otherwise,For a vector (cid:174) π£ , we denote by π ( π,π,π ) ((cid:174) π£ ) the vector obtained byelement-wise application of π ( π,π,π ) . We define a database instance πΌ over π as follows: For every atom π ((cid:174) π£ ) , if π ((cid:174) π£ ) β R π΄ we set π πΌ = { π ( π,π, β₯) ((cid:174) π£ ) | ( π, π ) β π΄ } , and if π ((cid:174) π£ ) β R π΅ we set π πΌ = { π (β₯ ,π,π ) ((cid:174) π£ ) | ( π, π ) β π΅ } . Note that we do not define relations twicesince R π΄ and R π΅ are disjoint and π is self-join-free.Since π is connected, our construction guarantees that in everyanswer to π all π variables are assigned the same value. Since π’ and π§ β π are neighbors, we are guaranteed that there is anatom that contains them both in R π΄ . The same holds for π§ π β π and π’ in R π΅ . Therefore, the answers to π ( πΌ ) describe the matrixmultiplication. Consider a query answer π . We have that π ( π’ ) = π , π ( π§ π ) = π for all π§ π β π and π ( π’ ) = π for some ( π, π ) β π΄ and ( π, π ) β π΅ . All other variables are mapped to the constant β₯ . Notethat the answers projected to π’ and π’ are the answers to thematrix multiplication problem.Assume, by way of contradiction, that direct access to the an-swers of π by a lexicographic order in which no variable of π’ occurs before any of π’ and π’ is possible with O( π polylog π ) pre-processing and O( polylog π ) delay. We show how to find all theunique values of π’ and π’ in the answers efficiently. Perform thefollowing starting with π = π and print its assignment to ( π’ , π’ ) . Then,set π to be the index of the next answer which assigns ( π’ , π’ ) to dif-ferent values and repeat. Finding the next index can be done usingbinary search with a logarithmic number of direct accesses, each taking polylogarithmic time. Overall, we solve Boolean matrix mul-tiplication in O( π polylog π ) time, contradicting sparseBMM. β‘ The negative part of the dichotomy has three cases. First, if π isnot free-connex, then we know that direct access by any order isintractable according to Theorem 2.2. Next, if π has a disruptivetrio π’ , π’ , π’ with respect to πΏ , then π’ , π’ , π’ is a chordless pathsatisfying the conditions of Lemma 4.5. The last case is that π isnot πΏ -connex. In this case, there is an πΏ -path, and this path satisfiesthe conditions of Lemma 4.5. Therefore, we obtain that the last twocases are hard too, assuming the sparseBMM hypothesis. We now consider direct access for the more general orderings basedon Ξ£ π€ (the sum of attribute weights). As with lexicographic or-derings, we are able to exhaustively characterize the class of self-join-free CQs, even those with projections, in terms of tractability.We will show that direct access for Ξ£ π€ is significantly harder andtractable only for a small class of queries. The complexity of direct access depends on the ability of the queryto express certain combinations of weights. If the query contains independent free variables, then its answers may contain all pos-sible combinations of their corresponding attribute weights. Ourcharacterization is based on this independence measure.
Definition 5.1 (Independent free variables).
A set of vertices π πΌ β π of a hypergraph H (
π , πΈ ) is called independent iff no pair of thesevertices appears in the same hyperedge, i.e., | π πΌ β© π | β€ π β πΈ . For a CQ π , we denote by πΌ free ( π ) the maximum number ofvariables among free ( π ) that are independent in H ( π ) .Intuitively, we can construct a database instance where eachindependent free variable is assigned to π different domain valueswith π different weights. By appropriately choosing the assignmentof the other variables, all possible π πΌ free ( π ) combinations of theseweights will appear in the query answers. Example 5.2.
For π ( π₯, π¦, π§ ) : β π ( π₯, π§ ) , π ( π§, π¦ ) ,π ( π¦, π’ ) , we have πΌ free ( π ) =
2, namely for variables { π₯, π¦ } . If the database instanceis π = [ , π ] Γ { } , π = { } Γ [ , π ] , π = [ , π ] Γ { } , then the π query answers are [ , π ] Γ [ , π ] Γ { } .The main result of this section is a dichotomy for direct accessby Ξ£ π€ ordering:Theorem 5.3. Let π be a CQ and π€ be a weight function. β’ If π is acyclic and πΌ free ( π ) β€ , then direct access by Ξ£ π€ is possible with π ( π log π ) preprocessing and π ( ) time peranswer. β’ Otherwise, if π is also self-join-free, direct access by Ξ£ π€ is notpossible with π ( π polylog π ) preprocessing and π ( polylog π ) time per answer, assuming 3sum and Hyperclique. For the hardness results, we rely mainly on the 3sum hypothesis.To more easily relate our direct-access problem to 3sum, which asksfor the existence of a particular sum of weights, it is useful to definean auxiliary problem: onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald
Definition 5.4 (weight lookup).
Given a CQ π , weight function π€ , and π β R , weight lookup by Ξ£ π€ returns the first position of aquery answer π of weight π€ ( π ) = π in the sorted array of answers.The following lemma associates direct access with weight lookupvia binary search on the query answers:Lemma 5.5. If the π th query answer according to some rankingfunction can be directly accessed in O( π π ( π )) time for every π , thenweight lookup can be performed in O( π π ( π ) log π ) . Lemma 5.5 implies that whenever we are able to support efficientdirect access on the sorted array of query answers, weight lookupincreases time complexity only by a logarithmic factor, i.e., it is alsoefficient. The main idea behind our reductions is that via weightlookups on a CQ with an appropriately constructed database, wecan decide the existence of a zero-sum triplet over three distinct setsof numbers, thus hardness follows from 3sum. First, we considerthe case of three independent variables that are free. These threevariables are able to simulate a three-way Cartesian product in thequery answers. This allows us to directly encode the 3sum tripletsusing attribute weights, obtaining a lower bound for direct access.Lemma 5.6.
If a CQ π is self-join-free and πΌ free ( π ) β₯ , thendirect access by Ξ£ π€ is not possible with π ( π β π ) preprocessing and π ( π β π ) time per access for any π > assuming 3sum. Proof. Assume for the sake of contradiction that the lemmadoes not hold. We show that this would imply an π ( π β π ) -timealgorithm for 3sum. To this end, consider an instance of 3sum withinteger sets π΄ , π΅ , and πΆ of size π , given as arrays. We reduce 3sumto direct access over the appropriate query and input instance byusing a construction similar to Example 5.2. Let π₯ , π¦ , and π§ be freeand independent variables of π , which exist because πΌ free ( π ) β₯ π contains at least 3 atoms π π₯ , π π¦ , and π π§ ,with variable π₯ , π¦ , and π§ , respectively. Note that variables other than π₯ , π¦ , and π§ may exist in these or other atoms. We create a databaseinstance where π₯ , π¦ , and π§ take on each value in [ , π ] , while all theother attributes have value 0. This ensures that π has exactly π answersβone for each ( π₯, π¦, π§ ) combination in [ , π ] , no matterthe number of other atoms and other attributes in any of the atoms(including in π π₯ , π π¦ , and π π§ ). To see this, note that since π₯ , π¦ , and π§ are independent, they never appear together in a relation. Thuseach relation either contains 1 tuple (if neither π₯ , π¦ , nor π§ is present)or π tuples (if one of π₯ , π¦ , or π§ is present). No matter on whichattributes these relations are joined (including Cartesian products),the output result is always the βsameβ set [ , π ] Γ { } π of size π ,where π is the number of free variables other than π₯ , π¦ , and π§ . (Weuse the term βsameβ loosely for the sake of simplicity. Clearly, fordifferent values of π the query-result schema changes, e.g., considerexample 5.2 with π§ removed from the head. However, this onlyaffects the number of additional 0s in each of the π answer tuples,therefore it does not impact our construction.)For the reduction from 3sum, weights are assigned to the attributevalues as π€ π₯ ( π ) = π΄ [ π ] , π€ π¦ ( π ) = π΅ [ π ] , π€ π§ ( π ) = πΆ [ π ] , π β [ , π ] , and π€ π’ ( ) = π’ . By our weight assignment,the weights of the answers are π΄ [ π ] + π΅ [ π ] + πΆ [ π ] , π, π, π β [ , π ] ,and thus in one-to-one correspondence with the possible value combinations in the 3sum problem. We first perform the prepro-cessing for direct access in π ( π β π ) , which enables direct access toany position in the sorted array of query answers in π ( π β π ) . ByLemma 5.5, weight lookup for a query result with zero weight is pos-sible in π ( π β π log π ) . Thus, we answer the original 3sum problemin π ( π β π β² ) for any 0 < π β² < π , violating the 3sum hypothesis. β‘ For queries that do not have three independent free variables weneed a different construction. We show next that two variables aresufficient to encode partial 3sum solutions (i.e., pairs of elements),enabling a full solution of 3sum via weight lookups. This yields aweaker lower bound than Lemma 5.6, but still is sufficient to proveintractability according to our yardstick.Lemma 5.7.
If a CQ π is self-join-free and πΌ free ( π ) = , then directaccess by Ξ£ π€ is not possible with π ( π β π ) preprocessing and π ( π β π ) time per access for any π > assuming 3sum. A special case of Lemma 5.7 is closely related to the problem ofselection in X+Y [18], where we want to access the π π‘β smallest sumof pairs between two sets π and π . This is equivalent to accessingthe answers to π ππ ( π₯, π¦ ) : β π ( π₯ ) , π ( π¦ ) by Ξ£ π€ ordering. It has beenshown that if π and π are given sorted, then selection is possibleeven in linear time [14, 20]. Thus, for π ππ direct access by Ξ£ π€ ispossible with O( π log π ) preprocessing (where we simply sort theinput relations) and O( π ) per access.Next, we show that the remaining acyclic CQs (those with πΌ free ( π ) β€
1) are tractable. For these queries, a single relationcontains all the answers, and so direct access can be supported bysimply sorting that relation.Lemma 5.8.
If a CQ π is acyclic and πΌ free ( π ) β€ , then directaccess by Ξ£ π€ is possible with π ( π log π ) preprocessing and π ( ) timeper answer. Combining these lemmas with the hardness of Boolean self-join-free cyclic CQs based on Hypercliqe, gives a proof of Theorem 5.3.
Given that direct access by Ξ£ π€ order with quasilinear preprocessingand polylogarithmic delay is possible only in very few cases, wenext investigate the tractability of a simpler version of the problem:When is selection , i.e., direct access to a single query answer, pos-sible in quasilinear time? We further simplify the problem by notallowing any projections in the query, i.e., we limit our attentionto full CQs. Our main result is a dichotomy theorem that coversall full self-join-free CQs. We show that the simplifications moveonly a narrow set of queries to the tractable side. For example, the2-path query π ( π₯, π¦, π§ ) : β π ( π₯, π¦ ) , π ( π¦, π§ ) is tractable for selection(single direct access), even though is it not for direct access. We first introduce necessary terminology. For a CQ π with hyper-graph H ( π ) = ( π , πΈ ) , the maximal number of hyperedges w.r.t. con-tainment is mh ( π ) , i.e., mh ( π ) = max |{ π β πΈ | (cid:154) π β² β π β§ π β π β² }| .An atom π π is absorbed by an atom π π if var ( π π ) β var ( π π ) . Aquery π β² is a contraction of π if every atom of π β² appears in π , andall the rest of the atoms of π are absorbed by some atom of π β² . π π is a maximal contraction of π if it is a contraction and there is no ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA π β²β² that is a contraction of π π except itself. It is easy to see thatthe number of atoms of π π is mh ( π ) . Example 6.1.
Consider π ( π₯, π¦, π§ ) : β π ( π₯, π¦ ) , π ( π¦ ) ,π ( π¦, π§ ) , π ( π₯, π¦ ) .Here, π ( π¦ ) is absorbed by π ( π₯, π¦ ) and π ( π₯, π¦ ) , and the latter twoabsorb each other. There are two minimal contractions that we canobtain from π : either π π : β π ( π₯, π¦ ) ,π ( π¦, π§ ) or π π : β π ( π¦, π§ ) , π ( π₯, π¦ ) .The number of maximal hyperdges of π is mh ( π ) = π based on mh ( π ) :Theorem 6.2. Let π be a full CQ and π€ be a weight function. β’ If mh ( π ) β€ , then selection by Ξ£ π€ is possible in O( π log π ) . β’ Otherwise, if π is also self-join-free, then selection by Ξ£ π€ is notpossible in O( π polylog π ) , assuming 3sum and Hyperclique. We prove the positive part of the theorem in Section 6.2 and thenegative part in Section 6.3.
Example 6.3.
For the query π ( π₯, π¦, π§ ) : β π ( π₯, π¦ ) , π ( π¦, π§ ) we havealready shown in Section 5 that direct access by Ξ£ π€ is intractable.However, only one access is in fact possible in O( π log π ) . Absorbed atoms.
As evident from Theorem 6.2, adding to a queryatoms that are absorbed by existing ones does not affect the com-plexity of selection. We prove this claim first and use it later in ouranalysis in order to treat queries that contain absorbed atoms.Lemma 6.4.
Selection on a CQ π is possible in O( π π ( π )) if selectionon a maximal contraction π π of π is possible in O( π π ( π )) . Theconverse is also true if π is self-join-free. In this section, we provide tractability results for full CQs withmh ( π ) β€
2. First, we consider the trivial case of mh ( π ) = π has only one atom. The lemma belowis a direct consequence of the linear-time array selection algorithmof Blum et al. [6].Lemma 6.5. For a full CQ π with mh ( π ) = selection by Ξ£ π€ ispossible in O( π ) . For the mh ( π ) = π and π are given sorted, then the pairwise sumscan be represented as a sorted matrix. A sorted matrix π containsa sequence of non decreasing elements in every row and everycolumn. For the π + π problem, a cell π [ π, π ] contains the sum π [ π ] + π [ π ] . Even though the matrix π has quadratically manycells, there is no need to construct it in advance given that we cancompute each cell in constant time. Selection on a union of suchmatrices { π , . . . , π β } asks for the π th smallest cell among the cellsof all matrices.Theorem 6.6 ([14]). Selection on a union of sorted matrices { π , . . . , π β } , where π π has dimension π π Γ π π with π π β₯ π π , ispossible in time O( (cid:205) βπ = π π log ( π π / π π )) . Leveraging this algorithm, we provide our next positive result: Lemma 6.7.
If a full CQ π has mh ( π ) = , selection by Ξ£ π€ ispossible in O( π log π ) . Proof. The minimal contraction of queries with mh ( π ) = π ( (cid:174) π, (cid:174) π ) : β π ( (cid:174) π ) , π ( (cid:174) π ) , with (cid:174) π β (cid:174) π , thus by Lemma 6.4, it is enoughto prove an O( π log π ) bound for this query. As before, we turn theattribute weights into tuple weights. Since a variable may occurin both atoms, we make sure to assign each attribute weight toonly one relation to avoid double-counting. Thus, we compute π€ ( π ) = (cid:205) π₯ β (cid:174) π π€ π₯ ( π ( π₯ )) and π€ ( π ) = (cid:205) π¦ β( (cid:174) π \ (cid:174) π ) π€ π¦ ( π ( π¦ )) for all π β π and π β π , respectively. Since the query is full, the weightsof the query answers are in one-to-one correspondence with thepairwise sums of weights of tuples from π and π .Let (cid:174) π = (cid:174) π β© (cid:174) π . We next group the π and π tuples by their π values: we create β buckets of tuples where all tuples π‘ within abucket have equal π‘ ( π§ ) values, π§ β (cid:174) π . This can be done in lineartime. If (cid:174) π = β , i.e., the query is the Cartesian product, then weplace all tuples in a single bucket. For each assignment of (cid:174) π values,the query answers with those values are formed by the Cartesianproduct of π and π tuples inside that bucket. Also, if the size ofbucket π is π π , then π + . . . + π β = | π | + | π | . We sort the tuples inthe buckets according to their weight in O( π log π ) time. Assume π π and π π are the partitions of π and π in bucket π and π π [ π ] denotes the π th tuple of π π in sorted order (equivalently for π π [ π ] ).We define a union of sorted matrices { π , . . . , π β } as follows: Forbucket π , we have π π [ π, π ] = π€ ( π π [ π ]) + π€ ( π π [ π ]) . Selectionon these matrices is equivalent to selection on the query answersof π . By Theorem 6.6, if matrix π π has dimension π π Γ π π with π π β₯ π π , we can achieve selection in O( (cid:205) βπ = π π log ( π π / π π )) = O( (cid:205) βπ = π π Β· π π / π π ) = O( (cid:205) βπ = π π ) = O( (cid:205) βπ = π π ) = O( π ) .Overall, the time spent is O( π log π ) because of sorting. β‘ Though selection is a special case of direct access, we show thatfor most full CQs tractable time complexity O( π polylog π ) is stillunattainable. We start from the cases covered by Lemma 5.6. Toextend that result to the selection problem, note that a selectionalgorithm can be repeatedly applied for solving direct access. Forqueries with three free and independent variables, an O( π β π ) selec-tion algorithm would imply a direct access algorithm with O( π β π ) preprocessing and delay, which we showed to be impossible. There-fore, the following immediately follows from Lemma 5.6:Corollary 6.8. If a full CQ π is self-join-free and πΌ free ( π ) β₯ ,then selection by Ξ£ π€ is impossible in π ( π β π ) for any π > assuming3sum. This leaves only a small fraction of full CQs to be covered: querieswith two or fewer independent variables and three or more maximalhyperedges for acyclic queries. We next show that these queriesare essentially variants of the general three-path query templatewhere three atoms are organized in a chain.Lemma 6.9.
The full acyclic CQs π that satisfy πΌ free ( π ) < andmh ( π ) > are π π ( (cid:174) π, (cid:174) π, (cid:174) π, (cid:174) π ) : β π ( (cid:174) π, (cid:174) π ) , π ( (cid:174) π, (cid:174) π ) ,π ( (cid:174) π, (cid:174) π ) for non-empty (cid:174) π, (cid:174) π, (cid:174) π, (cid:174) π , up to atom absorption. onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald Now that we established the precise form of the queries wewant to characterize, we proceed to prove their intractability. Weapproach this in a different way than the other hardness proofs:instead of relying on the 3sum hypothesis, we instead show thattractable selection would lead to unattainable bounds for Booleancyclic queries.Lemma 6.10.
Selection by Ξ£ π€ is not possible in O( π polylog π ) for π π ( (cid:174) π, (cid:174) π, (cid:174) π, (cid:174) π ) : β π ( (cid:174) π, (cid:174) π ) , π ( (cid:174) π, (cid:174) π ) ,π ( (cid:174) π, (cid:174) π ) assuming Hyperclique. Proof. We will show that if selection for π π can be donein O( π polylog π ) , then the Boolean triangle query can be eval-uated in the same time bound, which contradicts the Hypercliqehypothesis. Let π β³ () : β π β² ( π₯ β² , π¦ β² ) , π β² ( π¦ β² , π§ β² ) ,π β² ( π§ β² , π₯ β² ) be a queryover a database πΌ . We will construct a database πΌ β² for π π , andvia weight lookups we will be able to answer π β³ over π· . Let π₯ β (cid:174) π, π¦ β (cid:174)
π, π§ β (cid:174)
π, π’ β (cid:174) π . For πΌ β² , we extend relation π β² to π by assigning π₯ = π₯ β² , π¦ = π¦ β² and setting the values of all theother attributes ( (cid:174) π βͺ (cid:174) π ) \ { π₯, π¦ } to a fixed domain value β₯ . Werepeat the same process for the other relations: For π we assign π¦ = π¦ β² , π§ = π§ β² , and for π we assign π§ = π§ β² , π’ = π₯ β² . Consider aquery result π β π π ( πΌ β² ) . If π π’ ( π ) = π π₯ ( π ) , then by our construc-tion π π₯π¦π§ ( π ) satisfy π , π and π and thus, π β³ over πΌ . We now assignweights as follows: If dom β R , then π€ π₯ ( π ) = π, π€ π’ ( π ) = β π , and forall other attributes π‘ , π€ π‘ ( π ) =
0. Otherwise, it is also easy to assign π€ π₯ and π€ π’ in a way s.t. π€ π₯ ( π ) = π€ π₯ ( π ) if and only if π = π and π€ π’ ( π ) = β π€ π₯ ( π ) . This is done by maintaining a lookup table forall the domain values that we map to some arbitrary real number.Then, we perform a weight lookup for π π to identify if a queryresult with zero weight exists. If it does for some result π , then π€ π₯ ( π π₯ ( π )) + . . . + π€ π’ ( π π’ ( π’ )) = π π₯ ( π ) = π π’ ( π ) and π β³ istrue, otherwise it is false. Since accessing the sorted array of π π answers takes O( π polylog π ) , by Lemma 5.5, weight lookup alsotakes O( π polylog π ) . β‘ The negative part of Theorem 6.2 for acyclic queries is proved bycombining Corollary 6.8 and Lemma 6.10 together with Lemma 6.9and Lemma 6.4 that show we cover all cases. For self-join-free cyclicCQs, we once again resort to the hardness of their Boolean versionbased on Hypercliqe.
We investigated the task of constructing a random-access datastructure to the output of a query with an ordering over the answers.We presented algorithms for fragments of the class of CQs forlexicographic orders and sum. In these algorithms, the constructiontime is quasilinear in the size of the database, and the access timeis logarithmic. We showed that within the class of CQs withoutself-joins, our algorithms cover all the cases where these complexityguarantees are feasible, assuming conventional hypotheses in thetheory of fine-grained complexity. In the case of sum, where thetractable fragment is limited, we also studied the restriction of theproblem to accessing a single answer (the selection problem) andestablished a corresponding classification.This work opens up several directions for future work, includingthe generalization to more expressive queries (CQs with self-joins,union of CQs, negation, etc.), other kinds of orders (e.g., min/maxover the tuple entries), and a continuum of complexity guarantees (beyond quasilinear/logarithmic time). It would also be importantto understand how integrity constraints, such as functional depen-dencies, change the frontier of tractability as they have in the caseof enumeration [8]. Generalizing the question posed at the begin-ning of the Introduction, we view this work as part of a the biggerchallenge that continues the line of research on factorized repre-sentations in databases [21, 22]: how can we represent the outputof a query in a way that, compared to the explicit representation,is fundamentally more compact and efficiently computable, yetequally useful to downstream operations?
REFERENCES [1] Amir Abboud and Virginia Vassilevska Williams. 2014. Popular ConjecturesImply Strong Lower Bounds for Dynamic Problems. In
FOCS . 434β443.[2] Nir Ailon and Bernard Chazelle. 2005. Lower Bounds for Linear DegeneracyTesting.
J. ACM
52, 2 (March 2005), 157β171.[3] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. 2007. On AcyclicConjunctive Queries and Constant Delay Enumeration. In
CSL . 208β222.[4] Ilya Baran, Erik D. Demaine, and Mihai PΛatraΕcu. 2005. Subquadratic Algorithmsfor 3SUM. In
Algorithms and Data Structures . 409β421.[5] Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. 2020. Constantdelay enumeration for conjunctive queries: a tutorial.
SIGLOG
7, 1 (2020), 4β33.[6] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E.Tarjan. 1973. Time bounds for selection.
JCSS
7, 4 (1973), 448 β 461.[7] Johann Brault-Baron. 2013.
De la pertinence de lβΓ©numΓ©ration: complexitΓ© enlogiques propositionnelle et du premier ordre . Ph.D. Dissertation. U. de Caen.[8] Nofar Carmeli and Markus KrΓΆll. 2020. Enumeration Complexity of ConjunctiveQueries with Functional Dependencies.
TCS
64, 5 (2020), 828β860.[9] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and NicoleSchweikardt. 2020. Answering (Unions of) Conjunctive Queries Using RandomAccess and Random-Order Enumeration. In
PODS (PODSβ20) . 393β409.[10] Shaleen Deep and Paraschos Koutris. 2019. Ranked Enumeration of ConjunctiveQuery Results.
CoRR abs/1902.02698 (2019).[11] Jeff Erickson. 1995. Lower Bounds for Linear Satisfiability Problems. In
SODA(SODA β95) . USA, 388β395.[12] Robert W. Floyd and Ronald L. Rivest. 1975. Expected Time Bounds for Selection.
Commun. ACM
18, 3 (March 1975), 165β172.[13] Greg N. Frederickson. 1993. An Optimal Algorithm for Selection in a Min-Heap.
Inf. Comput.
SIAM J. Comput.
13, 1 (feb 1984), 14β30.[15] Anka Gajentaan and Mark H Overmars. 1995. On a class of O(n2) problems incomputational geometry.
Computational Geometry
5, 3 (1995), 165 β 185.[16] Martin Charles Golumbic. 1980.
CHAPTER 4 - Triangulated Graphs . AcademicPress, 81 β 104.[17] Etienne Grandjean. 1996. Sorting, linear time and the satisfiability problem.
Annals of Mathematics and Artificial Intelligence
16, 1 (1996), 183β236.[18] Donald B Johnson and Tetsuo Mizoguchi. 1978. Selecting the Kth element in X +Y and X_1 + X_2 + ... + X_m.
SIAM J. Comput.
7, 2 (1978), 147β153.[19] Andrea Lincoln, Virginia Vassilevska Williams, and R. Ryan Williams. 2018. TightHardness for Shortest Cycles and Paths in Sparse Graphs. In
SODA . 1236β1252.[20] A. Mirzaian and E. Arjomandi. 1985. Selection in X + Y and matrices with sortedrows and columns.
Inform. Process. Lett.
20, 1 (1985), 13 β 17.[21] Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases.
SIGMOD Rec.
45, 2 (2016), 5β16.[22] Dan Olteanu and Jakub Zavodny. 2012. Factorised representations of queryresults: size bounds and readability. In
ICDT . ACM, 285β298.[23] Mihai Patrascu. 2010. Towards polynomial lower bounds for dynamic problems.In
STOC . 603.[24] Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, andXiaofeng Yang. 2020. Optimal Algorithms for Ranked Enumeration of Answersto Full Conjunctive Queries.
PVLDB
13, 9 (2020), 1582β1597.[25] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. 2020. OptimalJoin Algorithms Meet Top-k. In
SIGMOD . 2659β2665.[26] Virginia Vassilevska Williams. 2015. Hardness of Easy Problems: Basing Hardnesson Popular Conjectures such as the Strong Exponential Time Hypothesis (InvitedTalk). In
IPEC , Vol. 43. Dagstuhl, Germany, 17β29.[27] Xiaofeng Yang, Mirek Riedewald, Rundong Li, and Wolfgang Gatterbauer. 2018.Any- π Algorithms for Exploratory Analysis with Conjunctive Queries. In
Ex-ploreDB . 1β3.[28] Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In
VLDB(VLDB β81) . VLDB Endowment, 82β94. ractable Orders for Direct Access to Ranked Answersof Conjunctive Queries Conferenceβ17, July 2017, Washington, DC, USA
A ADDITIONAL PROOFSA.1 Proof of Lemma 3.10
Let π be a free-connex CQ, and let π be an ext- free ( π ) -connextree for π where π β² is the subtree of π that contains exactly thefree variables.First, we claim that two free variables are neighbors in π iff theyare neighbors in π β² . The βifβ direction is immediate since π β² iscontained in π . We show the other direction. Let π’ and π£ be freevariables of π that are neighbors in π . That is, there is a node π π in π that contains them both. Consider the unique path from π to any node in π β² such that only the last node on the path, whichwe denote π π β² , is in π β² . Since both variables appear in π β² and in π ,by the running intersection property, both variables appear in π π β² .Thus, π’ and π£ are also neighbors in π β² .Since the definition of disruptive trios depends only on neigh-boring pairs of free variables, an immediate consequence of theclaim from the previous paragraph is that there is a disruptive trioin π iff there is a disruptive trio in π β² . Next, we can simply useProposition 2.1 to reduce π to the full acyclic CQ where the atomsare exactly the nodes of π β² . A.2 Proof Sketch of Proposition 4.3
We describe a construction of the required tree. Figure 5 demon-strates our construction. We use two different characterizationsof connexity. Since π is πΏ -connex, it has an ext- πΏ -connex tree π . Since π is πΏ -connex, there is a join-tree π for the atoms of π and its head. Let π [ πΏ ] be π where the variables that are not in πΏ are deleted from all vertices. That is, for every vertex π β π , itsvariables are replaced with var ( π ) β© πΏ . Denote by V all neigh-bors of the head in π , and denote by π β the graph π after thedeletion of the head vertex. Taking both π [ πΏ ] and π β and con-necting every vertex π β V with a vertex π of π [ πΏ ] such that var ( π ) β© πΏ = var ( π ) gives us the tree we want. Such a vertexexists in π [ πΏ ] since every vertex of π β represents an atom of π ,and every atom of π is contained in some vertex of π . The subtree π [ πΏ ] contains exactly π , and since this subtree comes from anext- πΏ -connex tree, it has a subtree containing exactly πΏ . It is easyto verify that the result is a tree, and we can show that the runningintersection property holds in the united graph since it holds for π and π . A.3 Proof of Lemma 5.5
We use binary search on the sorted array of query answers. Eachdirect access returns a query answer whose weight can be computedin O( ) . Thus, in a logarithmic number of accesses we can find thefirst occurrence of the desired weight. Since the number of answersis polynomial in π , the number of accesses is O( log π ) and each onetakes O( π π ) time. A.4 Proof of Lemma 5.7
We show that the contrary contradicts the 3sum hypothesis. Let π΄ , π΅ , and πΆ be three integer arrays of a 3sum instance of size π . Weconstruct a database instance with attribute weights like in the proofof Lemma 5.6, but now with only 2 free and independent variables πΏ π : π : πΏ πΏ π¦π§π πππ¦π₯π¦π π¦π§ππ₯π¦π§π₯π¦π π¦π§π ππ π¦π₯π¦ π¦π§π¦π§π₯π¦π π¦π§ππ¦π§π π¦π§πππ Figure 5: Example for the construction from Proposition 4.3for the CQ π ( π₯, π¦, π§ ) : β π ( π₯, π¦, π ) , π ( π¦, π§, π ) , π ( π, π ) , π ( π¦, π§, π ) with πΏ = { π₯, π¦, π§ } and πΏ = { π¦ } . π₯ and π¦ . Hence the weights of the π query results are in one-to-one correspondence with the corresponding sums π΄ [ π ] + π΅ [ π ] , π, π β [ , π ] . We run the preprocessing phase for direct access in π ( π β π ) , which allows us to access the sorted array of query resultsin π ( π β π ) . For each value πΆ [ π ] in πΆ , we perform a weight lookupon π for weight β πΆ [ π ] , which takes time π ( π β π log π ) (Lemma 5.5).If that returns a valid index, then there exists a pair ( π, π ) of π΄ and π΅ with sum π΄ [ π ] + π΅ [ π ] = β πΆ [ π ] , which implies π΄ [ π ] + π΅ [ π ] + πΆ [ π ] = π values in πΆ , totaltime complexity is O( π Β· π β π log π ) = O( π β π log π ) . This proceduresolves 3sum in π ( π β π β² ) for any 0 < π β² < π , violating the 3sumhypothesis. A.5 Proof of Lemma 5.8
For πΌ free ( π ) = π , we claim that there is anatom π π ( (cid:174) π π ) which contains all the free variables. First note thatfor | free ( π )| = | free ( π )| > π ( (cid:174) π ) be an atom that contains the maximum number of free variablesand assume for the sake of contradiction that there exists a freevariable π¦ with π¦ β (cid:174) π . We use V π¦ to denote the set of nodes in thejoin tree that contain variable π¦ ; thus π β V π¦ . From π being acyclicfollows that the nodes in V π¦ form a connected graph and thereexists a node π β² that lies on every path from π to a node in V π¦ . Since πΌ free ( π ) =
1, each variable π₯ β (cid:174) π must appear together with π¦ insome query atom, implying that π₯ appears in some node π β²β² β V π¦ .From that and the running intersection property follows that π₯ must also appear in π β² since π β² lies on the path from π to any such π β²β² . Hence π β² contains π¦ and all the variables in (cid:174) π , violating themaximality assumption for π . Since all free variables appear in (cid:174) π π ,we can apply a linear-time semi-join reduction by Yannakakis [28]to remove the dangling tuples, and then compute π by projecting π on all free variables and sorting the query answers by Ξ£ π€ . This takestotal time O( π log π ) for preprocessing and enables constant-timedirect access to individual answers. onferenceβ17, July 2017, Washington, DC, USA Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald For πΌ free ( π ) = π is a Boolean query. Since π is also acyclic,Yannakakis answers it and direct access is trivial. A.6 Proof of Lemma 6.4
For the βifβ direction, we can eliminate absorbed atoms from π toobtain π π after making sure that the tuples in the database satisfythose atoms. Thus, to remove an atom π ( (cid:174) π ) which is absorbed by π ( (cid:174) π ) , we filter the relation π based on the tuples of π . Then, π π overthe filtered database has the same answers as π over the originalone. For the βonly ifβ direction, each atom π ( (cid:174) π ) that appears in π but not π π is absorbed by some π ( (cid:174) π ) . We create the relation π bycopying π (cid:174) π ( π ) into it, essentially making the atom π ( (cid:174) π ) obsolete.Note that we are allowed to create π without restrictions because π has no self-joins, hence the database doesnβt already contain therelation. Then, π over the extended database has the same answersas π π over the original one. The above reductions take linear time,which is dominated by π π ( π ) since π π ( π ) is trivially in Ξ© ( π ) for theselection problem. A.7 Proof of Lemma 6.5
By Lemma 6.4, it suffices to solve selection on the query π ( (cid:174) π ) : β π ( (cid:174) π ) , which is a minimal contraction of all queries withmh ( π ) =
1. Initially, we turn the attribute weights into tupleweights. For each tuple π β π , we compute its weight as π€ ( π ) = (cid:205) π₯ β (cid:174) π π€ π₯ ( π ( π₯ )) . Thus, the weights π€ ( π ) are the weights of thequery answers. This takes O( π ) for the O( π ) tuples of π . Then,applying linear-time selection [6] on π gives us the π th smallesttuple. A.8 Proof of Lemma 6.9
First, it is easy to see that for πΌ free ( π ) = πΌ free ( π ) = π₯ and π’ be the two independent variables. Because they do notappear together in the same atom, there exist two different atoms π π , π π such that π π contains π₯ but not π’ and π π contains π’ but not π₯ . Without loss of generality, we can further assume that the hy-peredges π π and π π are not contained in others (if they are, we canchoose those instead). We also have at least one more maximalhyperedge π π that is not absorbed by π or π because mh ( π ) > π , we claim that var ( π ) β ( var ( π ) βͺ var ( π )) .Suppose that π contains a variable π‘ s.t. π‘ β ( var ( π ) βͺ var ( π )) .Then because π‘ cannot be independent, there must exist an atom π π that contains π₯ and π‘ (or equivalently π§ and π’ ). However, in thatcase, π π , π π , π π (or equivalently π π , π π , π π ) create a cycle violatingthe acylicity of π . Let (cid:174) π be the variables in var ( π π ) β© var ( π π ) and (cid:174) π those in var ( π π ) β© var ( π π ) . We have (cid:174) π β β and (cid:174) π β β , oth-erwise π would be absorbed by π or π respectively. Conversely, var ( π π ) β var ( π π ) because π π would be absorbed by π , and thesame is true for π π . At this point, the other atoms of the query canonly be absorbed by the existing ones, otherwise we introduce anindependent variable or a cycle. B INVERTED ACCESS BY LEXICOGRAPHICORDER
A straightforward adaptation of Algorithm 1 can be used to achieve inverted access : given a query result as the input, we return its indexaccording to the lexicographic order. Algorithm 2 is almost the samealgorithm as Algorithm 1 except that the choices in each iterationare made according to the given answer and the correspondingindex is constructed (instead of the opposite). The algorithm runsin constant time per answer since every operation can be donewithin that time (unlike Algorithm 1, there is no need for binarysearch here).
Algorithm 2
Lexicographic Inverted-Access π = bucket [ ] = root factor = weight ( root ) for i=1,. . . ,f do factor = factor / weight ( bucket [ π ]) select π‘ β bucket [ π ] agreeing with the answer if no such π‘ exists then return not-an-answer π = π + start ( π‘ ) Β· factor for child π of layer π do get the bucket π β π agreeing with the answer bucket [ layer ( π )] = π factor = factor Β· weight ( π ) returnreturn