Beyond Equi-joins: Ranking, Enumeration and Factorization
BBeyond Equi-joins: Ranking, Enumeration and Factorization
Nikolaos Tziavelis [email protected] UniversityBoston, Massachusetts, USA
Wolfgang Gatterbauer [email protected] UniversityBoston, Massachusetts, USA
Mirek Riedewald [email protected] UniversityBoston, Massachusetts, USA
ABSTRACT
We study full acyclic join queries with general join predicates thatinvolve conjunctions and disjunctions of inequalities, focusing on ranked enumeration where the answers are returned incrementallyin an order dictated by a given ranking function. Our approachoffers strong time and space complexity guarantees in the standardRAM model of computation, getting surprisingly close to those ofequi-joins. With π denoting the number of tuples in the database,we guarantee that for every value of π , the π top-ranked answersare returned in O( π polylog π + π log π ) time and O( π polylog π + π ) space. This is within a polylogarithmic factor of the best-knownguarantee for equi-joins and even O( π + π ) , the time it takes tolook at the input and output π answers. The key ingredient is an O( π polylog π ) -size factorized representation of the query output,which is constructed on-the-fly for a given query and database.As a side benefit, our techniques are also applicable to unranked enumeration (where answers can be returned in any order) for joinswith inequalities, returning π answers in O( π polylog π + π ) . Thisguarantee improves over the state of the art for large values of π .In an experimental study, we show that our ranked-enumerationapproach is not only theoretically interesting, but also fast andmemory-efficient in practice. Join result enumeration.
Join processing is one of the most fun-damental topics in database research. While queries with joins havebeen optimized for decades, they have received renewed atten-tion from an algorithmic perspective [51, 61, 64, 65]. Such effortsprovide asymptotic guarantees that shield against large interme-diate results and ensure predictable performance, no matter thegiven database instance. Similarly, work on constant-delay enu-meration [7, 17, 44, 75] strives to pre-process the database for agiven query on-the-fly so that the first answer is returned in lineartime (in database size), followed by all other answers with constant(i.e., independent of database size) delay between them. (Linearpre-processing and constant delay guarantee that all answers arereturned in time linear in input and output size, which is optimal.)
Ranked enumeration.
Requiring join answers to be returnedin a specific order gives rise to ranked enumeration [29, 31, 78, 79, 85,86], where the main goal is to quickly return the most importantanswers without having to materialize and sort the entire output .Ranked enumeration generalizes the well-known top- π paradigm,removing the requirement of having to specify π in advance. Besides,non-trivial complexity guarantees for top- π join algorithms [47]had been derived only for the βmiddlewareβ cost model of thecelebrated Threshold Algorithm [36], which only accounts for thenumber of distinct data items accessed [79]. In contrast, work onenumeration, including this paper, accounts for all the steps takenby the algorithm, using the standard RAM model of computation. Beyond equi-joins.
Existing work on ranked enumeration hasfocused on equi-joins, but big-data analytics often also requiresother join conditions [30, 33, 52, 57] such as inequalities (e.g.
S.age< T.age ), non-equalities (e.g. S.id β T.id ), and band predicates(e.g. |S.time - T.time| < π ). Handling these more general pred-icates efficiently is challenging. If one batch-produces and sorts thefull join output, then for a join of β relations of size O( π ) , sortingalone takes O( βπ β log π ) time (because output size is O( π β ) ). As wediscuss later, a direct O( π ) reduction from theta-joins to equi-joinsallows us to leverage equi-join enumeration techniques [78]. Thatapproach delivers the top-ranked answer in O( π ) for acyclic joinqueries. Our goal is to reduce this to O( π polylog π ) (see Fig. 1).Example 1. Consider an ornithologist studying interactions be-tween bird species using a bird observation dataset
B(Species,Family, ObsCount, Latitude, Longitude) . For her analysis,she decides to extract pairs of observations for birds of differentspecies from the same larger family that have been spotted in thesame region. Pairs with higher
ObsCount should also appear first:
SELECT *, B1.ObsCount + B2.ObsCount as WeightFROM B B1, B B2WHERE B1.Family = B2.FamilyAND ABS(B1.Latitude - B2.Latitude) < 1AND ABS(B1.Longitude - B2.Longitude) < 1AND B1.Species <> B2.SpeciesORDER BY Weight DESC LIMIT 1000
With π denoting the number of tuples in π΅ , no existing approachcan guarantee to return the top- results in sub-quadratic timecomplexity π ( π ) . In this paper, we show how to achieve O( π log π ) even if the size of the output is O( π ) . After returning the top- answers, our approach is also capable of returning more answersin order without having to restart the query. The exponent of thelogarithm is determined by the number of join predicates that arenot equalities (3 here). Interestingly, this guarantee is not affectedby the number of relations joined, e.g., if we look for triplets ofbird observations, because the complexity is determined only by the pairwise join with the most predicates that are not equalities. This work.
We provide the first comprehensive study on enu-meration for joins with conditions that go beyond equality: ( πΆ ) general theta-join conditions , ( πΆ inequality , ( πΆ non-equality , ( πΆ band conditions , as well as ( πΆ
4) general expressions thereof as a
DNF formula. We focus on acyclic join queries, which are the most com-mon in practice. Our time and space complexity results are statedin the standard RAM model of computation and we give non-trivialguarantees for space complexity and
Time-to- π th answer (TT ( π ) ).Following common practice, we treat query sizeβintuitively, thelength of the SQL stringβas a constant. This corresponds to theclassic notion of data complexity [81], where one is interested in a r X i v : . [ c s . D B ] F e b ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald Join Condition Example Time
P ( π ) Space S( π ) ( πΆ ) Theta booleanUDF( S.A , T.C ) O( π ) O( π ) ( πΆ ) Inequality S.A < T.B ( πΆ ) Non-equality S.A β T.B O( π log π ) O( π log log π ) ( πΆ ) Band | S.A β T.B | < π ( πΆ ) DNF of ( πΆ ) , ( πΆ ) , ( πΆ ) ( S.A < T.B β§ S.A < T.C )β¨(
S.A β T.D ) O( π polylog π ) O( π polylog π ) Figure 1: Preprocessing time and space complexity of ourapproach for various join types. Ranked and unranked enu-meration return the first π results in time O(P ( π ) + π log π ) and O(P ( π ) + π ) , respectively, using O(S( π ) + π ) space. scalability in the size of the input data, not the size of the query(because users do not write arbitrarily large query expressions). Key novelty.
We propose a
Tuple-Level Factorization Graph (TLFG) to compactly represent the results of acyclic theta-joinsin general, and of joins with inequality conditions in particular.Even though a join of β relations of size π can have π β answers,TLFG size is guaranteed to be within a polylogarithmic factor ofinput size and it is amenable to (ranked) enumeration . We achievethis by reducing redundancy in the join output, which is similarin motivation to factorized databases [69]. In fact, a TLFG can bedirectly interpreted as union-product formulas, i.e., an algebraicrepresentation. Since work on factorized databases had mostly beenconcerned with equality predicates, our approach complements andextends that line of work. Contributions.
Our main contributions (see also Figure 1) are:(1) We propose TLFG, a succinct factorized representation of theoutput of any acyclic theta-join, where the join conditioncan be any Boolean function over the attributes of a pair ofrelations. For the join of β input relations with O( π ) tupleseach, it has size and construction time O( π ) , even if theoutput is of size π β .(2) We propose a specialized TLFG for join conditions that area DNF of inequalities ( πΆ O( π polylog π ) . For non-equalities ( πΆ
2) and bands ( πΆ πΆ πΆ ranked enumeration of the results ofacyclic joins. For every value of π , we return the π top-ranked results in TT ( π ) = O( π + π log π ) and TT ( π ) = O( π polylog π + π log π ) for general theta-joins and DNFof inequality conditions, respectively. The latter is withina polylogarithmic factor of the equi-join case where theoptimal TT ( π ) is O( π + π log π ) [78]. Note that TLFG canalso be used for unranked enumeration, where it achievesTT ( π ) = O( π polylog π + π ) . This yields an asymptotic im-provement for large values of π over any approach that relieson indexes for range search, such as range trees [27], whichgive TT ( π ) = O( π polylog π + π polylog π ) . (For sufficientlylarge π , e.g., π = π , the second term dominates.)(4) Our experiments demonstrate the practical feasibility of ourranked enumeration, improving over the competition byorders of magnitude on synthetic and real datasets.Due to space constraints, formal proofs and several details of im-provements to our core techniques (Section 6) are in the appendix. Let [ π ] denote the set of integers { , . . . , π } . Instead of SQL, weuse the more concise Datalog notation to express joins. A theta-join query is a formula of the type π ( Z ) = π ( X ) β§ . . . β§ π β ( X β ) β§ π ( Y ) β§ . . . β§ π π ( Y π ) where π π are relational symbols, X π are lists of variables (or at-tributes), Z , Y π are subsets of (cid:208) X π , π β [ β ] , π β [ π ] , and π π areBoolean formulas called join predicates . The terms π π ( X π ) are calledthe atoms of the query. Repeated occurrences of a variable in dif-ferent atoms encode equality predicates. If no predicates π π arepresent, then π is an equi-join . The size of the query | π | is equal tothe number of symbols in the formula. We use ar ( π π ) to denote thearity of the relational symbol π π and π π .π΄ for one of its attributes π΄ . Query Evaluation.
Join queries are evaluated over a databasethat consists of finite relations (or tables) π π . These draw valuesfrom a domain which we assume to be R for simplicity. Thus,each relation π π is a set of tuples π β R ar ( π π ) which assign valuesto the attributes of π π . The total number of tuples in the databaseis π . We write π .π΄ to reference the value of a specific attribute π΄ in tuple π . For the purposes of this paper and without loss ofgenerality, we assume that relational symbols in different atomsare distinct since self-joins can be handled with linear overheadby copying a relation to a new one. Conceptually, the query results (or answers ) are found by taking the Cartesian product betweenthe relations, then selecting based on equi-join conditions and joinpredicates π π , and finally projecting on the attributes Z . In thispaper, we consider only full queries, i.e., Z = (cid:208) X π . This means thatthe query has no projections and returns the assigned values ofall variables. Consequently, a query result can be represented as avalid combination of input tuples, one from each table π π . Whileour approach can handle queries with projections, the strongerguarantees we prove only hold for full queries. Yet note that it isstraightforward to extend the guarantees we prove to the class offree-connex queries [7, 10, 15] (which support certain projections),but the details are beyond the scope of this paper. We assume thereare no predicates on individual relations since they can be removedin linear time by filtering the corresponding input tables. Atomic Join Predicates.
We define the following types of pred-icates between attributes
π.π΄ and
π .π΅ : an inequality is π.π΄ < π .π΅ , π.π΄ > π .π΅ , π.π΄ β€ π .π΅ , or
π.π΄ β₯ π .π΅ , a non-equality is π.π΄ β π .π΅ anda band is | π.π΄ β π .π΅ | < π for some π >
0. Our approach also supportsnumerical expressions over input tuples, e.g., π ( π.π΄ , . . . , π.π΄ πΌ ) < π ( π .π΅ , . . . ,π .π΅ π½ ) , with π and π arbitrary O( ) -time computablefunctions that map to R . We say that pair ( π β π, π‘ β π ) satisfiespredicate π if π ( π , π‘ ) = true . Join Trees.
Extending the usual definition of (alpha-)acyclicity[40, 76, 88] from equi-joins to theta-joins, we say that a join queryis acyclic if we can construct a join tree with the atoms (relations) asthe nodes where (1) for every attribute π΄ , the nodes containing π΄ form a connected subtree and (2) every join predicate π π is assignedto one parent-child pair of nodes ( π,π ) such that π and π contain Our approach naturally extends to other domains such as strings or vectors, aslong as the corresponding join predicates are well-defined and computable in O( ) for a pair of input tuples.2 eyond Equi-joins: Ranking, Enumeration and Factorization S(A, D)T(B, C) R(D, E)A < B A > E (a) Join tree of query π . S(A, D)T(B, C) R(D, E) P ST (A, B, C, D) P SR (A, D, E) (b) Join tree of the equi-join π πΈ . Figure 2: Examples 2 and 4: The join tree of the query π ( π΄, π΅, πΆ, π·, πΈ ) = π ( π·, πΈ ) β§ π ( π΄, π· ) β§ π ( π΅, πΆ ) β§ ( π΄ < π΅ ) β§ ( π΄ > πΈ ) before (a) and after (b) applying QuadEqi. all the attributes referenced in π π . Notice that condition (1) alone isthe standard definition of a join tree for equi-joins [15]. For parent π and child π we write π β²β³ π π , where π is a conjunction of allpredicates π π assigned to the pair ( π,π ) and one equality predicate π.π΄ = π .π΄ for every attribute π΄ that appears in both π and π . Wecall π the join condition between π and π . Since we only consideracyclic queries, the join tree is given as the input to our algorithm.Example 2. Consider the query π ( π΄, π΅, πΆ, π·, πΈ ) = π ( π·, πΈ ) β§ π ( π΄, π· ) β§ π ( π΅, πΆ ) β§ ( π΄ < π΅ ) β§ ( π΄ > πΈ ) . a This query is acyclic sincewe can construct a join tree (Fig. 2a). Notice that the nodes contain-ing the same attribute π· are connected and each inequality has beenassigned to a parent-child pair that contains all the referenced at-tributes. For example, π΄ < π΅ has been assigned to ( π,π ) ( π contains π΄ and π contains π΅ ). We write π β²β³ π π with π = π.π΄ < π .π΅ and π β²β³ π π with π = π.π΄ > π .πΈ β§ π.π· = π .π· . a SELECT * FROM R, S, T WHERE R.D = S.D AND S.A < T.B AND S.A > R.E
Enumeration is a process that produces the solutions to a (usuallycombinatorial) problem one by one without duplicates.
Unranked Enumeration. In unranked enumeration , a prepro-cessing phase builds data structures which then allow the enumer-ation phase to return query results. Full acyclic equi-joins admitlinear preprocessing and constant delay between results [7, 10]. Ranked Enumeration. In ranked enumeration there is also agiven ranking function that imposes a total order on the queryanswers. In general, the complexity of the problem depends on theranking function. We focus on the case where tuple weights arecombined and compared with the two operators of a selective dioid [38], which is a semiring with an additional ordering property thatis known to be monotonic [59]. This includes ranking based on thesum of real-valued weights in ascending order and lexicographicordering. For these ranking functions, the results of full acyclic equi-joins can be enumerated in ranked order with linear preprocessingand logarithmic delay between results [78]. We analyze all algorithms in the Random Access Machine (RAM)model with uniform cost measure in terms of their data complex-ity , i.e., we assume the query size | π | to be constant. In line withprevious work [12, 20, 39], we assume that it is possible to createin linear time an index that supports tuple lookups in constanttime. In practice, hashing achieves those guarantees in an expected, dt A P (π) π2 +11 2 3 4 π2 1 2 3 π2 π2 +2β¦ β¦ β¦ B π2 +11 1 2 2 π4 π4 π4 +1 π4 +1 π2 π2 +2 dt
53 π2 β1π2 β¦ β¦ β¦
Figure 3: Enumeration algorithms with TT ( π ) = O(P ( π ) + π ) . amortized sense. We include all index construction times and indexsizes in our analysis.For the time complexity of enumeration algorithms, we mea-sure the time until the π th result is returned (TT ( π ) ) for all valuesof π . For unranked enumeration, our goal is to achieve TT ( π ) = O(P ( π ) + π ) with the lowest possible preprocessing time P ( π ) .The majority of papers on enumeration [7, 17, 45, 75] have tra-ditionally focused instead on constant delay after P ( π ) prepro-cessing. This is desirable because it implies the same guaranteeTT ( π ) = O(P ( π )) + π Β· O( ) = O(P ( π ) + π ) . However, settingconstant delay as the goal can lead to misjudgments about practicalperformance, as we illustrate next:Example 3. Consider an enumeration problem where the out-put consists of the integers , , . . . , π , but algorithms produceduplicates that have to be filtered out on-the-fly. Assume thattwo algorithms A and B spend preprocessing P ( π ) , then gen-erate a sequence of results with constant delay. For A , let thissequence be , , . . . , π / , , , . . . , π / , π / + , . . . and for B it is , , , , . . . , π / , π / , π / + , . . . (see Fig. 3). Even though bothachieve TT ( π ) = O(P ( π ) + π ) , due to duplicate filtering the worst-case delay of A is O( π ) (between π / and π / + ), while B has O( ) delay. However, π΅ is clearly slower than π΄ by a factor of forall π β [ π / ] . Since A outputs all these values earlier than B , wecould make A simulate the delay of B for π β [ π / ] by storing thecomputed values on even iterations and returning them later. As the example illustrates, for a preprocessing cost of
O(P ( π )) ,the ultimate goal is to guarantee TT ( π ) = O(P ( π ) + π ) . Constant-delay enumeration is a sufficient condition for achieving this goal,but not necessary. Similarly, for ranked enumeration, we aim forTT ( π ) = O(P ( π ) + π log π ) . Since we do not assume any givenindexes, a trivial lower bound on P ( π ) is O( π ) , since the input hasto be read at least once. Our algorithms achieve that lower bound upto a polylogarithmic factor. For space complexity, we use MEM ( π ) to denote the required memory until the π th result is returned. We begin by describing a direct O( π ) reduction from a theta-jointo an equi-join and then introduce a graph representation that willenable more efficient algorithms for inequality conditions. We can transform any acyclic theta-join query into an equivalentacyclic equi-join query with additional βpredicate relationsβ thatencode all general predicates besides equality. For every parent-child join π β²β³ π π in the join tree, we materialize a new relation π π,π = {( π , π‘ ) | π β π β§ π‘ β π β§ π ( π , π‘ )} that contains all pairs of π -and π -tuples that satisfy the join condition. Clearly, the size of π π,π is O( π ) and it can be computed in that time. ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald π΄ π· S π΅ πΆ T π· πΈ R β π.π΄<π.π΅ β π.π΄>π .πΈ β§ π.π·=π .π· (a) Original database and joinconditions according to thejoin tree of Fig. 2a. π΄ β¦ π· β¦
22 11 π΅ β¦ πΆ β¦
34 22 P ST π΄ β¦ π· πΈ β¦ β¦ P SR β¦ β¦ β¦ β¦ β π ππ .π΄=π ππ .π΄ β§ π ππ .π·=π ππ .π· (b) Quadratically large rela-tions that resolve all the pred-icates that are not equalities. Figure 4: Example 4: Direct transformation from a theta-jointo an equi-join by the QuadEqi approach.
Theta-join π β²β³ π π is then replaced by the equivalent equi-join π β²β³ π π,π β²β³ π . Intuitively, this introduces a new node between π and π in the join tree. ( π and π could now even be deleted fromthe join tree, but this does not affect asymptotic cost.) We proceedanalogously with the remaining join-tree edges that have theta-joinconditions. Once the entire join tree has been reduced to equi-joins,we can apply existing (ranked) enumeration algorithms [78]. Dueto its quadratic (in input size) time and space complexity for pre-processing, we refer to this reduction algorithm as QuadEqi.Example 4 (Example 2 continued). Figure 2b depicts our ex-ample query π after transforming it to an equi-join π πΈ by theQuadEqui approach. There are two new βpredicate relationsβ π ππ and π ππ , one for each child-parent pair in the original join tree. Asan optimization, we remove from the query the original relations π , π , and π and connect π ππ directly to π ππ , obtaining the same queryresults. Figure 4b shows this direct connection, for the example data-base of Fig. 4. The tuples in π ππ are the O( π ) pairs of tuples from π and π that satisfy the join condition, i.e., they have the same π· value and π.π΄ > π .πΈ . Notice that after materializing these relations,all join conditions are equalities.
To improve over QuadEqi, we need to devise an approachwhere the O( π ) joining pairs between parent-child relations inthe join tree are represented more compactly. To find an efficient representation that also admits efficient enumer-ation, we look back at the well-studied case of equi-joins. Specif-ically, we build upon our recent approach [78] that encodes thejoining pairs as paths in a directed graph.
The Equality Case.
An equi-join between two relations of size π can output π pairs, but it can still be represented in O( π ) space. Thekey insight is that tuples with the same join value can be groupedtogether, associating the input tuples with the corresponding βgrouptokenβ. This is a fundamental insight that underlies many efficientequi-join processing techniques, from a hash partitioning to theelimination of redundancy in factorized databases [69]. In termsof the graph representation, we introduce artificial βnodes in the middle,β one for each join value [78]. This creates a shared structure ,where different source-target paths use common edges.Example 5. Figure 5a depicts an example constructed from re-lations π and π of Fig. 4a where the join condition is π.π· = π .πΆ .For simplicity, node labels show only the relevant join values (ofattributes
π·, πΆ ) instead of the entire tuple. Directly connecting allthe joining pairs creates O( π ) edges. A more careful constructionis shown in Fig. 5b, where three intermediate nodes encode the threejoin values. The size of that graph is O( π ) but the pairs of π and π tuples connected by a path are the same as before. We next formalize the graph construction so that we can applyit to other join conditions:Definition 6. A Tuple-Level Factorization Graph (TLFG) of atheta-join π β²β³ π π of two relations π,π is a directed acyclic graph πΊ ( π , πΈ ) where:(1) π contains a distinct node π£ π for each tuple π β π and a distinctnode π£ π‘ for each tuple π‘ β π and(2) for each π β π, π‘ β π , there exists a path from π£ π to π£ π‘ in πΊ ifand only if π and π‘ satisfy join condition π . The size of a TLFG πΊ ( π , πΈ ) is | π | + | πΈ | and its depth is the maxi-mum length of any path in πΊ . Enumeration with TLFGs.
Enumeration algorithms run di-rectly on a TLFG representation. Unranked enumeration can beachieved with a DFS-type traversal on the graph. For example, inFig. 5a we start from the first π tuple π = { , } and in two hops(through intermediate node π£ ) reach the first π tuple π‘ = { , } .Since they are connected, these tuples are joinable according tothe equality condition. To complete a join result, we continue inthe subtree of π (according to the join tree) if more relations ex-ist. Thereafter, the enumeration proceeds with the next π tuple π‘ = { , } that is reachable from π£ .Ranked enumeration requires prioritizing the paths [78], natu-rally incurring a logarithmic factor in the traversal. Intuitively, thesize of the TLFG has an impact on preprocessing (to construct it),while depth, i.e., the length of the longest path from any π tuple toany π tuple, affects the time complexity of the enumeration phase(when traversing it). As we will see in Section 6, there is often atradeoff between the two measures. Ideally, we want to find TLFGsof constant depth such that the delay between results is independentof the database size. Duplicates.
Our TLFG definition does not require a one-to-onecorrespondence between paths and join results, i.e., there couldbe multiple paths between an π tuple and a π tuple it joins with.This leads to duplicate query answers in the enumeration phase.For certain join conditions, it might not be possible to find a repre-sentation that is both efficient in terms of size and depth, and alsofree of duplicate paths. Among the join conditions examined in thispaper, this only happens for disjunctions of predicates (discussedin Section 4.3) where each answer is duplicated only O( ) times,thus our complexity results are unaffected by this complication.We refer to the maximum number of paths from π£ π to π£ π‘ among all π β π, π‘ β π pairs as the duplication factor of the TLFG. Ideally, theduplication factor is 1, meaning the TLFG is duplicate-free . Direct TLFGs.
For any theta-join, a naive way to construct aTLFG is to directly connect each source node with all its matching eyond Equi-joins: Ranking, Enumeration and Factorization target nodes, as in Fig. 5c. This O( π ) -size TLFG is equivalent tothe QuadEqi approach in that it materializes all joining pairs.Intuitively, the edges of the TLFG are the entries of the relationsintroduced by QuadEqi for theta-join conditions (e.g., see Fig. 4b).Our goal is to reduce this size by replacing the direct edges withslightly longer paths that can be composed from fewer edges. Factorization Formulas.
Typically, factorization refers to theprocess of compacting an algebraic formula by factoring out com-mon sub-expressions using the distributivity property [24]. Underthat perspective, factorized databases [69] represent the resultsof an equi-join efficiently, treating them as a formula built withproduct and union. Besides distributivity, d-representations [72]replace shared sub-expressions with variables, further improvingsuccinctness through memoization [26]. Our TLFGs directly givea representation of that nature, complementing known results onjoin factorization. (Note that in addition to supporting joins withnon-equality conditions, in TLFG the atomic unit of the formulasis a database tuple (hence Tuple-Level), while in previous work onfactorized databases it is an attribute value.) We illustrate this withExample 7 below.Example 7.
Consider the inequality join π β²β³ π΄ < π΅ π with therelations π,π of Fig. 4a. A naive TLFG is shown in Fig. 5c. The joinresults can be expressed with the βflatβ representation: Ξ¦ = ( Γ ) βͺ ( Γ ) βͺ . . . βͺ ( Γ ) βͺ ( Γ ) βͺ . . . βͺ ( Γ ) βͺ . . . where for convenience we refer to tuples by their π΄ or π΅ value, and Γ and βͺ denote Cartesian product and union respectively. The flatrepresentation has one term for each query result, separated by theunion operator. In terms of the TLFG, Γ corresponds to path concate-nation, and βͺ to branching. To make the formula more compact, wecan factor out tuples that appear multiple times and reuse commonsubexpressions by giving them a variable name. Equivalently, thesize of the TLFG can be reduced if we introduce intermediate nodes,making the different paths share the same edges. Such a factorizedrepresentation is shown in Fig. 5d. We can write the correspond-ing algebraic formula by defining new variables π£ π , π β [ ] for theintermediate nodes: Ξ¦ = ( Γ π£ ) βͺ ( Γ π£ ) βͺ ( Γ π£ ) βͺ ( Γ π£ ) , . . . , ( Γ π£ ) π£ = ( βͺ ) , π£ = ( ) , π£ = ( βͺ βͺ ) , . . . , π£ = ( ) . We will explain how to construct this TLFG in detail in Section 4.1.Notice that the total size of these formulas is asymptotically thesame as the TLFG size.
We now show how to construct TLFGs that have O( π polylog π ) size and O( ) depth when the join condition π in a join π β²β³ π π isa DNF of inequalities (and also equalities). First, we consider thesimple case of a single inequality and present a simple partitioningapproach. Then we generalize it to conjunctions and finally, showhow to combine the conjuncts of a DNF formula. Converting an arbitrary formula to DNF may increase query size exponentially.This does not affect data complexity, because query size is still a constant.
Equality conditions naturally group the tuples into disjoint equiva-lence classes. That is the property that allowed us to derive efficientTLFGs for equi-joins. However, it is missing for inequality condi-tions, making the problem more challenging. In this section, wedevelop an approach for inequalities that achieves almost the sameasymptotic guarantees as the prior equality approach.
Baseline.
A naive approach is to materialize all paths explicitlyas edges. This is roughly equivalent to the QuadEqi approachwhich materializes all matching tuple pairs. Consider the example ofFigure 5c where we have O( π ) joining tuples: each edge betweena source node and a target node represents one joining pair oftuples. Our goal is to find a representation of paths that leveragesthe shared structure of the inequality. Binary partitioning.
We now develop a representation that isinspired by the way quicksort partitions an array based on a pivotelement [41]. We call this approach binary partitioning . Supposethat we have a less-than condition
π.π΄ < π .π΅ . We pick a pivotvalue π£ and then partition both relations π and π s.t. π .π΄ < π£ for π β π and π .π΄ β₯ π£ for π β π , and similarly π‘.π΅ < π£ for π‘ β π and π‘.π΅ β₯ π£ for π‘ β π . Thereby, we know that all values in π are strictlyless than those in π . Thus, we connect them in the graph via asingle intermediate node. Then, we continue on the two horizontalpartitions recursively . Since π tuples cannot join with π tuples byconstruction, we do not miss any joining pairs. Importantly, theintermediate node we create will never be used again in subsequentrecursive calls, therefore the depth of the TLFG will be 2.In all recursive steps we pick the median of the distinct values as our pivot. For multiset { , , , , , , } the set of distinct valuesis { , , } and hence the median is 2. This pivot is easy to find in O( π ) if the relations have been sorted on the appropriate attributesbeforehand. If all tuples contain different values, then partitioningwith the median creates two roughly even partitions of sizes β π / β and β π / β . Thus, each recursive step cuts the input by half and with O( log π ) recursive steps we reach the base case of just one inputtuple. However, if the same attribute value appears in multipleinput tuples, the two partitions we create might be uneven . Still, thenumber of distinct values π drops by half in each recursive call. Thenumber of steps needed to reach the base case of a single distinctvalue ( π =
1) is then O( log π ) = O( log π ) because π β€ π . When thathappens with a strictly less-than inequality ( < ), we stop becauseall the tuples share the same value. Overall, the time and size ofthis approach is O( π log π ) , and the depth is 2.Example 8. Figure 5d illustrates the approach for the same ex-ample as before, with dotted lines showing how the relations arepartitioned. Initially, we create partitions { , , } and { , , } . Thesource nodes of the first partition are connected to the target nodes ofthe second partition via the intermediate node π£ . The first partitionis then recursively split into { } and { , } . Even though these newpartitions are uneven with and nodes respectively, they containroughly the same number of distinct values (plus or minus one). Other inequality types.
Our approach for less-than ( < ) isstraightforward to generalize to greater-than ( > ), since it is ex-actly symmetrical: We simply connect the partitions in the oppositedirection, i.e., π connects to π (instead of π to π ). For inequality ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald
233 1 S.D = T.C
S T (a) Equality: naive construc-tion with edges between alljoining pairs.
O ( π ) size.
233 1 v v v =1=2=3 S.D = T.C
S T (b) Equality: grouping tu-ples with common join val-ues together.
O ( π ) size.
345 1 S.A < T.B
S T (c) Inequality: naive con-struction with edgesbetween all joining pairs.
O ( π ) size.
345 1 v v v v (1)(2)(2)(3)(3) S.A < T.B
S T (d) Inequality: binarypartitioning. Dotted linesindicate partitioning steps.
O ( π log π ) size. Figure 5: Factorization of Equality and Inequality conditions with our TLFGs. The S and T node labels indicate the values ofthe joining attributes. All TLFGs shown here have O( ) depth. predicates with equality ( β€ , β₯ ), only a minor change in the basecase of the algorithm is needed: Instead of simply returning fromthe recursive call when only 1 distinct value remains, we connectall the source-target nodes that contain that (equal) value. Thismodification does not affect any of our guarantees.Lemma 9. Let π be an inequality predicate for relations π,π oftotal size π . A duplicate-free TLFG of π β²β³ π π of size O( π log π ) anddepth can be constructed in O( π log π ) time. Proof. Correctness is easy to establish by induction: each re-cursive step connects precisely the joining pairs between the twopartitions and the graph within each partition is correct inductively.For the running time, we begin by sorting the relations in O( π log π ) .We analyze the recursion in terms of its recursion tree. Each re-cursive step with size | π | + | π | = π requires O( π ) to partition thesorted relations. Then, we materialize one intermediate node andfor each source and target node at most one edge. We then invoke 2recursive calls with sizes π + π = π . Therefore, in every level of therecursion tree, the sizes of all the subproblems add up to π . Since wespend linear time per recursive step, the total work per level of therecursion tree is O( π ) . We always cut the distinct values (roughly)in half, thus the height πΏ of the tree is O( log π ) = O( log π ) . Overall,the time spent on the recursion is O( ππΏ ) = O( π log π ) , which alsobounds the size of the TLFG. Across all recursive steps, edges arecreated either from source nodes to intermediate nodes or fromintermediate nodes to target nodes. Thus, the length of all pathsfrom source to target nodes is 2. The invariant property whichensures that the TLFG is duplicate-free is that whenever a recursivestep is called on a set of π β² ,π β² tuples, no path exists between π£ π β² and π£ π‘ β² for π β² β π β² , π‘ β² β π β² . β‘ For a conjunction of predicates, we have to make sure that each pathin the TLFG satisfies all predicates . If we were to construct a TLFGfor each predicate individually, then it would be hard to combinethe graphs into a single one that has that property. Instead, wepropose an approach that handles the conjunction by considering the predicates in sequence: Whenever a set of source nodes wouldbe connected to a set of target nodes according to one predicate,we demand that they additionally satisfy the remaining predicates.Thus, each predicate acts as a filter that keeps only certain pairsof source-target nodes and passes them on to the next predicate.When no predicates remain, then we simply connect the two sets.
Equalities.
First, we show that all equality predicates in theconjunction can be treated with virtually no overhead. The propertythat we rely on is that equalities (irrespective of their number) createdisjoint partitions of tuples (see Fig. 5b). Since these partitionsare independent in the graph, we simply create a TLFG for eachpartition separately for all remaining predicates.Example 10.
Suppose π β‘ ( π.π· = π .πΆ ) β§ (
π.π΄ < π .π΅ ) . Wefirst process the equality predicate and then the inequality. For theexample of Fig. 5b, the equality creates three disjoint partitions: ( π ,π ) with value , ( π ,π ) with value , and ( π ,π ) with value . Rather than connecting the source-target nodes within each par-tition via an intermediate node as in the case where we only haveequalities, we now have three inequality subproblems, one for eachpartition: For ( π π ,π π ) , we construct a TLFG with the algorithm ofSection 4.1. Source-target nodes connected by the latter will satisfyboth predicates since they belong to the same equality partition. Lemma 11.
Let π be a conjunction of predicates between relations π,π of total size π , and π β² be that conjunction with all the equalitypredicates removed. If for π β² ,π β² with | π β² |+| π β² | = π β² we can construct aTLFG of the join π β² β²β³ π β² π β² of size O( π ( π β² )) , depth π , and duplicationfactor π’ in time O( π ( π β² )) , and π , π are superadditive functions, thenwe can construct a TLFG of the join π β²β³ π π of size O( π ( π )) , depth π ,and duplication factor π’ in time O( π ( π ) + π ) . Inequalities.
We generalize the same idea to the case of multipleinequality predicates. To handle each inequality, we use the binarypartitioning approach we developed in Section 4.1. Algorithm 1shows the pseudocode of our approach. The two partitions areconnected via an intermediate node only when no other predicatesremain (Lines 15 to 17), otherwise they are passed on to the next eyond Equi-joins: Ranking, Enumeration and Factorization S S T T [S.A < T.C, S.B > T.D]
S.A < T.C
S T [S.B>T.D][S.A < T.C,S.B > T.D] (a) Binary partitioning and recur-sions. v v S.B > T.D S T (1) (2) (b) Handling the next predi-cate. Figure 6: Example 12: Steps of the conjunction algorithm fortwo inequality predicates on π ( π΄, π΅ ) ,π ( πΆ, π· ) . Node labels de-pict π΄, π΅ values (left) or
πΆ, π· values (right). predicate (Line 20). Overall, we perform two recursions simultane-ously. In one direction, we make recursive calls on smaller partitionsof the data and the same set of predicates (Lines 22 and 23). In theother direction, when the current predicate is satisfied for somesource-target nodes, nextPredicate () is called with one less pred-icate (Line 20). The recursion stops either when we are left with1 join value (base case for binary partitioning) or we exhaust thepredicate list (base case for conjunction). Finally, notice that eachtime a new predicate is encountered, the nodes have to be sortedaccording to the new attributes (Line 6).Example 12. Consider two inequalities
π.π΄ < π .πΆ β§ π.π΅ > π .π· for relations π ( π΄, π΅ ) ,π ( πΆ, π· ) as shown in Fig. 6a. The algorithminitially processes the first inequality and splits the relations into ( π ,π ) , ( π ,π ) as per the binary partitioning method (see Sec-tion 4.1). The recursive calls on these two partitions (depicted withhorizontal edges) are made with the same list of predicates. Whilefor one inequality π and π would be connected via intermediatenode, we now make a third recursive call (depicted with a diagonaledge) that will process the next inequality π.π΅ > π .π· . The resultof this recursive call is shown in Fig. 6b. Only some pairs of thesenodes satisfy this second predicate and are eventually connected.Also notice that inside this recursive call, we had to sort on attributes
π΅, π· before using binary partitioning.
Lemma 13.
Let π be a conjunction of π inequality and any numberof equality predicates for relations π,π of total size π . A duplicate-freeTLFG of π β²β³ π π of size O( π log π π ) and depth can be constructedin O( π log π π ) time. We now consider disjunctions so that we can combine the TLFGsconstructed from each of the conjunctions of a DNF formula. Aslong as we can construct a TLFG for each term of the disjunction,we can then put them together so that the final TLFG containsthe union of the TLFG paths. This approach may create duplicatepaths since the predicates in the disjunction may be satisfied by thesame pairs of tuples. However, the number of these duplicates is
Algorithm 1:
Factorizing a conjunction of inequalities Input : Relations
π,π , nodes π£ π , π£ π‘ for π β π, π‘ β π , Conjunction π as list of conditions [ π | π½ L ] Output : A TLFG of the join π β²β³ π π nextPredicate ( π,π, π ) Procedure nextPredicate(
π,π, π = [ π | π½ L ] ) π β² ,π β² = π,π sorted by the attributes of π if ( π == π.π΄ < π .π΅ ) then partIneqBinary ( π β² , π β² , [ π | π½ L ] ) Procedure partIneqBinary(
π,π, [ π = π.π΄ < π .π΅ | π½ L ] ) π = vals( S βͺ T ) //Number of distinct A, B values if d == 1 then return //Base case for binary partitioning Partition ( π βͺ π ) into ( π βͺ π ) , ( π βͺ π ) with median distinctvalue as pivot if π½ L == [] then //Base case for π to π Materialize intermediate node π₯ foreach π in π do Create edge π£ π ββ π₯ foreach π‘ in π do Create edge π₯ ββ π£ π‘ else //Check π β π against the rest of the predicates nextPredicate ( π , π , π½ L ) //Recursive calls on horizontal partitions, same predicates partIneqBinary ( π , π , [ π | π½ L ] ) partIneqBinary ( π , π , [ π | π½ L ] ) bounded by the number of different TLFGs we assemble, which inturn depends only on the size of the query.Lemma 14. Let π be a disjunction of predicates π , . . . , π π for re-lations π,π . If for each π π , π β [ π ] we can construct a duplicate-freeTLFG of π β²β³ π π π of size O(S π ) and depth π π in O(T π ) time, then wecan construct a TLFG of π β²β³ π π of size O( (cid:205) π S π ) and depth max π π π in O( (cid:205) π O(T π )) time. The duplication factor of the latter is at most π . We now apply our techniques on enumeration problems for acylicqueries that contain inequalities. A baseline approach is to applyQuadEqi to transform the query into an equi-join (see Section 2.1).This allows us to then use any enumeration algorithm designedfor equi-joins. However, QuadEqi materializes relations that arequadratically large, hence the preprocessing time (and memory) forranked or unranked enumeration will be
P ( π ) = O( π ) . Instead,we leverage our factorizations to derive enumeration algorithmswhose preprocessing time is only P ( π ) = O( π polylog π ) .The idea is to represent the matching tuples for each parent-childpair of relations π,π in the join tree with our TLFGs. Consequently,matching tuples across the entire tree are connected and each match-ing combination corresponds to one query result. As discussed inSection 3.2, enumeration algorithms operate directly on that struc-ture, e.g., with the any- π Tree-DP framework we have proposed[78]. Since the details of that approach are beyond the limits of thispaper, we instead present our results here via an alternative route.From the factorized graph representation we will create a relationalrepresentation by transforming the graph into tables. Thus, theinequality-join is transformed into an equi-join over O( π polylog π ) relations. Then, we apply equi-join enumeration to the new queryas a black-box procedure. This approach is similar to QuadEqi ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald π΄ π· S π΅ πΆ T π΄ π E v v v v v v v v π· π΅ π v v v v E πΆ (a) Relations constructedfrom the TLFG of Fig. 5d. S(A, D)T(B, C)E (V , B, C)E (A, D, V ) R(D, E)E (V , D, E)E (A, D, V ) (b) The join tree of the equi-join π πΉ that we obtain with ourmethod. Figure 7: Example 15: Using our TLFGs to transform the in-equality join of Fig. 2a into an equi-join. (or the equivalent direct TLFG) but more efficient, since the ma-terialized relations are asymptotically smaller. This highlights thegenerality of our approach for contexts besides enumeration: it canbe used to reduce an acyclic query with general join conditions to anequi-join on relations that are only O( polylog π ) larger .Example 15 (Example 4 continued). For the join of Fig. 4a,we construct a TLFG for π β π and π β π respectively. The for-mer is depicted in Fig. 5d. We transform the TLFG into a relationalrepresentation by adding new domain values for the intermedi-ate nodes. More specifically, we introduce a new attribute π withdomain values π£ β π£ and store the edges from source nodes to π nodes in a relation πΈ , shown in Fig. 7a. Similarly, a relation πΈ contains the edges from π nodes to target nodes. The rela-tions are now joined with equi-join conditions only via the newattribute π . After repeating the process for relations π, π , we geta new join tree shown in Fig. 7b that corresponds to the equi-joinquery π πΉ ( π΄, π΅, πΆ, π·, πΈ,π , π ) = π ( π·, πΈ ) β§ π ( π΄, π· ) β§ π ( π΅, πΆ ) β§ πΈ ( π΄, π·, π ) β§ πΈ ( π , π΅, πΆ ) β§ πΈ ( π΄, π·, π ) β§ πΈ ( π , π·, πΈ ) . Noticethat the size of the new relations πΈ , πΈ , πΈ , πΈ is only O( π log π ) bythe binary partitioning method. By applying this transformation and using known techniquesfor the enumeration of equi-joins, we establish the following:Theorem 16.
Let π be a full acyclic theta-join query over a data-base π· of size π where all the join conditions are DNF formulas ofequality and inequality predicates. Let π be the maximum number ofinequalities in a conjunction in every join condition π of the join tree.Ranked enumeration of the answers to π over π· can be performedwith TT ( π ) = O( π log π π + π log π ) , while unranked enumeration canbe performed with TT ( π ) = O( π log π π + π ) . The space requirementin both cases is MEM ( π ) = O( π log π π + π ) . We now work on improving the efficiency of our approach. Inparticular, we study alternative factorization methods (beyond bi-nary partitioning) and then specialize our treatment of commoninequality-type predicates, namely non-equalities and bands. Theseallow us to then strengthen the guarantees of Theorem 16. Due to
345 1 v v v v v <2 <3 <4 <5<6 S.A < T.B
S T (a) Inequality: sharedranges. Middle nodesindicate a range.
O ( π ) size, O ( π ) depth. π(1)
Size Depth
π(π)π(π ) SharedRanges
π(π loglog π)
π(π log π)
π(π)
Direct Binary
Partitioning Multiway
Partitioning Equi-join
Grouping
Goal (b) Tradeoff between size and depth ofthe TLFG for a single inequality. Ideally,we want to achieve
O ( π ) size and O ( ) depth, which is possible for equi-joins. Figure 8: Different factorization methods. space constraints, we only present the high-level ideas, but all thedetails can be found in the full version [80].
In this section, we explore different factorization methods for in-equalities to asymptotically improve the size of the TLFG.
Shared ranges.
A different idea than partitioning that is basedon shared ranges is depicted in Figure 8a. This method creates ahierarchy of intermediate nodes, each one representing a range ofvalues. Each range is entirely contained in all those that are higherin the hierarchy, thus we connect the intermediate nodes in a chain.Even though this approach yields O( π ) size for the TLFG, the depthis unfortunately O( π ) (e.g. see the path from left node 1 to rightnode 6). This is particularly undesirable for enumeration algorithmsthat traverse the TLFG since the depth affects the enumeration delay.To achieve guarantees of the type TT ( π ) = O(P ( π ) + π log π ) (seeSection 2.3), we want instead TLFGs of O( ) depth. Moreover, it isunclear how to generalize this idea to a conjunction of inequalities. Multiway partitioning.
When only one inequality predicateis present, it is possible to improve the partitioning method so thatthe size of the representation becomes O( π log log π ) instead of O( π log π ) . The idea is to split the tuples into multiple partitionsper step instead of two, hence we call this improved approach mul-tiway partitioning . Intuitively, it does more work in each recursivestep (but still linear) so that the total number of steps decreases.One complication of this approach is that we need 2 layers of inter-mediate nodes to appropriately connect the partitions, hence thedepth is 3 (instead of 2 of the binary partitioning method). Thistradeoff between size and depth of the TLFG is depicted in Fig. 8b.We also emphasize that this improvement is only possible for thespecial case of one predicate, while for conjunctions of inequalitieswe still rely on binary partitioning. We now consider the case of non-equality and band predicatesin the join condition. These predicates can be translated into aDNF of inequalities that we already handle with our techniques:A non-equality is a disjunction and a band is a conjunction of 2 eyond Equi-joins: Ranking, Enumeration and Factorization inequalities. However, a specialized construction that leveragesthe specific structure of these predicates gives a more efficientrepresentation. Specifically, we show that these two predicates canbe handled as efficiently as an inequality predicate. Combining the multiway partitioning method with the predicate-specific techniques gives us the following result:Lemma 17.
Let π be an inequality, non-equality, or band predicatefor relations π,π of total size π . A duplicate-free TLFG of π β²β³ π π ofsize O( π log log π ) and depth can be constructed in O( π log π ) time. We are now in position to state our main result. Compared toTheorem 16, the theorem below relies on multiway partitioningfor the base case of the conjunction algorithm (when one predicateremains) and the specialized TLFGs for non-equalities and bands.Theorem 18 (Main Result).
Let π be a full acyclic theta-joinquery over a database π· of size π where all the join conditions areDNF formulas of equality, inequality, non-equality, or band predicates.Let π be the maximum number of predicates excluding equalitiesin a conjunction in every join condition π of the join tree. Rankedenumeration of the answers to π over π· can be performed with TT ( π ) = O( π log π π + π log π ) , while unranked enumeration canbe performed with TT ( π ) = O( π log π π + π ) . The space requirementin both cases is MEM ( π ) = O( π log π β π Β· log log π + π ) . We now validate the effectiveness of our approach for ranked enu-meration in practice. In particular, we are interested in the timetaken by various approaches to return the π th ranked result fordifferent queries and data. Algorithms.
We compare 4 approaches for ranked enumera-tion of join queries with inequality-type predicates: 1 FactorizedAny-K is our factorization approach. 2 QuadEqi is the approachdescribed in Section 3.1 which materializes quadratically large rela-tions that resolve the inequality predicates. Given the many possibleways to handle the inequalities, we elect to compare against an idealistic implementation that cannot be surpassed by any possiblereal-world implementation. In our experiments, QuadEqi receivesits materialized tables without measuring the time required for theircomputation. We thus obtain a lower bound on the running time. 3Batch produces the entire output of the query and then sorts it. Itserves as a yardstick for any approach that materializes all results.Batch is given access to all the query results without having tocompute them; we only measure the time it takes to sort and thusagain obtain a lower bound on the running time. We note that for β = Data.
S Our synthetic data generator creates relations π π ( π΄ π , π΄ π + ,π ) , π β₯ π΄ π from a fixed domainof integers [ . . . β ] uniformly at random while discardingduplicates. The weights π are reals drawn from [ , ) .R For real data, we use a temporal graph RedditTitles [55]where the βΌ π edges are posts from a source communityto a target community identified by a hyperlink in the post ti-tle. The schema is Reddit(From, To, Timestamp, Sentiment,Readability) . B We also use OceaniaBirds [1], a dataset of birdobservations from the Oceania continent. The schema is
Birds(ID,Latitude, Longitude, IndivCount) . We keep only the βΌ π observations that have a non-empty IndivCount attribute.
Queries.
We test queries with various join conditions and sizes.For each type of query, we give below the corresponding SQL querythat would produce the entire sorted result for a binary join. Weparameterize the queries with the number of relations β . For longerqueries β >
2, we organize the relations in a chain and repeat thegiven join conditions between the π th and ( π + ) th relations.Query π π joins our synthetic tables with a single inequality. SELECT *, S1.W + S2.W as WeightFROM S1, S2WHERE S1.A2 < S2.A3ORDER BY Weight ASC π π Query π π has a more complicated join condition that is a con-junction of a band and a non-equality. SELECT *, S1.W + S2.W as WeightFROM S1, S2WHERE ABS(S1.A2 - S2.A3) < 50 AND S1.A1 <> S2.A4ORDER BY Weight ASC π π Query π π computes temporal paths [83] on RedditTitles, andranks them by a measure of sentiment such that sequences ofnegative posts are retrieved first. SELECT *, R1.Sentiment + R2.Sentiment as WeightFROM Reddit R1, Reddit R2WHERE R2.Timestamp > R1.TimestampORDER BY Weight ASC π π Query π π uses the sentiment in the join condition, keeping onlypaths along which the negative sentiment increases. For ranking,we use a measure of readability to focus on posts of higher quality. SELECT *, R1.Readability + R2.Readability as WeightFROM Reddit R1, Reddit R2WHERE R2.Timestamp > R1.Timestamp ANDR2.Sentiment < R1.SentimentORDER BY Weight DESC π π Last, π π΅ is a spatial band join on OceaniaBirds that finds pairsof populous bird sightings that are close based on proximity. Theparameter π is given as a parameter. SELECT *, B1.IndivCount + R2.IndivCount as WeightFROM Birds B1, Birds B2WHERE ABS(B2.Latitude - B1.Latitude) < π ANDABS(B2.Longitude - B1.Longitude) < π ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald ORDER BY Weight DESC π π΅ Details.
The implementation of our algorithms is in Java 8 andthe experiments are conducted on an Intel Xeon E5-2643 CPU run-ning Ubuntu Linux. The query execution is in main memory, andthe Java VM is allocated 100GB of RAM. If that is exceeded, we indi-cate it with an Out-Of-Memory (OOM) annotation. For FactorizedAny-K and QuadEqi that require an equi-join ranked enumerationalgorithm, we use any-k Lazy [21, 78] which was found to outper-form others in previous work. The version of PostgreSQL is 9.5.24.We set its parameters such that it is optimized for main-memoryexecution and system overhead related to logging or concurrency isminimized, as it is standard in the literature [9, 78]. To enable inputcaching for PSQL, each execution is performed twice and we onlytime the second one. Additionally, we create appropriate indexes onthe input relations beforehand, while our methods do not receivethese indexes. Even though the task is ranked enumeration, we stillgive PSQL a LIMIT clause whenever we measure a specific TT ( π ) ,and thus allow it to leverage the π value. All data points we showare the median of 5 measurements. Methods
First, we compare the performance of ranked enumeration usingthe three different factorization methods we proposed. Since onlyBinary Partitioning is applicable to all the types of join condi-tions considered in this paper, we test the different methods onthe queries that have only one inequality-type predicate ( π π , π π ).Figure 9 depicts TT ( π ) for the first 10 results. Shared Rangesyields an enumeration delay that is linear in the database size, hencequickly deteriorates as the number of returned results π increases.However, its TLFG is constructed in a single pass (after sorting),which makes preprocessing slightly faster than the partitioningapproaches. Specifically, for π π (Fig. 9b) it starts returning resultsafter 3 . . . π increases. This isa consequence of the size-depth tradeoff of the TLFGs (Fig. 8b).We also report the size of the constructed TLFGs (Fig. 10). In linewith our analysis, Shared Ranges is significantly more compact(by a factor of 2 . .
1) than Binary Partitioning across alltested queries. Multiway Partitioning also succeeds to a lesserextent in reducing the size of Binary Partitioning (by a factor of1 . . In the following, we set 1b Multiway Partitioning as the factor-ization method of 1 Factorized Any-K for the single-predicatecases and 1a Binary Partitioning for all others. We will showthat our approach has a significant advantage over the competitionwhen the size of the output is sufficiently large. We test three dis-tinct scenarios for which large output can occur: (1) the size of the
Binary Partitioning Multiway Partitioning Shared Ranges k (a) Query π π , β = , π = . k (b) Query π π , β = . Figure 9: First results for our different factorizationmethods on queries with one single inequality-type predi-cate. (Recall that only Binary Partitioning generalizes toarbitrary inequality conjunctions). π π , π = π π , π = π π Multiway Partitioning 35.433k 27.007M 9.455MBinary Partitioning 38.967k 47.307M 10.255MShared Ranges 14.306k 11.593M 4.868M
Figure 10: Representation size measured as the sum of nodesand edges of the TLFG. database grows, (2) the length of the query increases, and (3) theparameter of a band join increases.
Summary.
We run queries π π , π π for differentinput sizes π and two distinct query lengths. Figure 11 depicts thetime to return the top π = results. The value of π is chosento achieve a balance between returning a few top results and thespecial case of a top-1 computation. We also plot how the sizeof the output grows with increasing π . Even though QuadEqiand Batch are given precomputed results and do not even haveto resolve complicated join predicates, they still require a largeamount of memory to store those. Thus, they quickly run out ofmemory even for relatively small inputs (Figure 11a). PSQL does notface a memory problem because it can resort to secondary storage,yet becomes unacceptably slow. In contrast, our Factorized Any-Kapproach scales smoothly across all tests and requires much lessmemory. For instance, in Figure 11b QuadEqi fails after βΌ π input size, while we can easily handle βΌ π and even larger thanthat. For very small input sizes, the lower bounds of QuadEqiand Batch are sometimes lower, but their real running times canbe much higher than that. π π has more join predicates and thus, amore restricted output size (Figures 11c and 11d). Our advantage issmaller in this case, yet still significant for large values of π . Next, we test the effect of querylength on RedditTitles. We plot TT ( π ) for three distinct values We note that π in this section refers to the size of one relation, in contrast to theentire database size as in previous sections.10 eyond Equi-joins: Ranking, Enumeration and Factorization Factorized Any-k QuadEqui Lower Bound Batch Lower Bound PSQL n TT ( ) s e c OOMOOM T o t a l O u t p u t s i z e (a) Query π π , β = . n TT ( ) s e c OOMOOM T o t a l O u t p u t s i z e (b) Query π π , β = . n TT ( ) s e c OOMOOM T o t a l O u t p u t s i z e (c) Query π π , β = . n TT ( ) s e c OOMOOM T o t a l O u t p u t s i z e (d) Query π π , β = . Figure 11: Section 7.2.1: Synthetic data with a growing database size π . While all three alternative methods either run out ofmemory or exceed a reasonable running time, our method scales quasilinearly ( O( π polylog π ) ) with π . Factorized Any-k QuadEqui Lower Bound Batch Lower Bound PSQL k TT ( k ) s e c (a) Query π π , β = . k TT ( k ) s e c (b) Query π π , β = . TT ( ) s e c OOM T o t a l O u t p u t s i z e (c) Query π π , different lengths β . k TT ( k ) s e c s (d) Query π π΅ , fixed π = . . k TT ( k ) s e c (e) Query π π , β = . k TT ( k ) s e c (f) Query π π , β = . TT ( ) s e c OOM T o t a l O u t p u t s i z e (g) Query π π , different lengths β . TT ( ) s e c OOMOOM T o t a l O u t p u t s i z e (h) Query π π΅ , different bands π . Figure 12: Section 7.2.2: a,b,c,e,f,g: Section 7.2.3: d, h: Temporal paths of different lengths on RedditTitles (left), and spatialband-join on OceaniaBirds (right). Our method is robust to increasing query sizes or band-join ranges. ( π = , , ) when the length is small ( β = ,
3) and one value( π = ) for longer queries. Figure 12 depicts our results for queries π π , π π . Increasing the value of π makes the enumeration phaselonger, but the relative standing of the approaches is not affected.For a binary-join π π , our Factorized Any-K is faster than thelower bounds of the other methods (Figure 12a), and its advan-tage increases for longer queries, since the output also grows (Fig-ures 12b and 12c). Batch runs out of memory for β =
3, whilePSQL did not terminate after 3 hours. Query π π has an additionaljoin predicate, hence its output size is more restricted. Thus, thelower bound of QuadEqi is slightly better than our approach for β = β β₯ β = We now test the band-join π π΅ onthe OceaniaBirds dataset with various band parameters π . Fig-ure 12d shows that Factorized Any-K is superior for all tested π values when we fix π = .
01. Increasing the band parameter yields more joining pairs and causes the size of the output to grow (Fig-ure 12h). Hence, QuadEqi and Batch consume more and morememory and cannot handle π β₯ .
16. On the other hand, the per-formance of Factorized Any-K is mildly affected by increasing π .PSQL did not terminate after 6 hours, even for the smallest π . Wefound that band joins are poorly handled in that system, resultingin a nested loop execution even when an index is available. (Ranked) Enumeration of Query Results. If projections are in-volved, unranked enumeration for equi-joins can be performed withlinear preprocessing and constant delay for the class of free-connexacyclic queries [7]. In fact, no other equi-join query (excludingthose with self-joins) admits such an enumeration algorithm, underfine-grained complexity assumptions [7, 14]. Similar dichotomieshave been pursued by later works in more general settings, byallowing updates on the database [11], unions of queries [17], orfunctional dependencies [18]. Multiple surveys exist on this topic[10, 32, 75]. If we require a specific order on the output, then themore demanding task of ranked enumeration requires logarithmicdelay, which is unavoidable [29] assuming the π + π conjecture ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald [16]. Drawing and extending ideas from π shortest paths [34, 42, 48],Dynamic Programming [13, 26], and π -best enumeration [35, 56],we recently presented [78] two competing algorithmic approachesto this problem. Focusing on binary equi-joins, Ding et al. [31]approach ranked enumeration from a mainly practical perspective.The related problems of enumerating in provably random order[20] and directly accessing any result via its rank [19] have alsobeen considered, but are limited to equi-joins. Non-Equality Predicates ( β ). Works that target non-equality predicates mainly rely on color-coding [5], a technique that wasoriginally developed for subgraph isomorphism. Papadimitriou andYannakakis [73] apply color-coding to conjunctive queries, estab-lishing that non-equalities can be removed by paying an O( log π ) factor (in data complexity), even if they create cycles. The samecore idea is leveraged by the (unranked) enumeration algorithm ofBagan et al. [7]. and by Koutris et al. [53] who offer batch algorithmsfor non-equalities (which compute the entire set of results). Querieswith negation can be answered by rewriting them with not-all-equal-predicates [50], a generalization of non-equality. Comparedto the color-coding technique, our approach (1) offers a unified per-spective for a wide range of predicates and not just non-equalities,(2) gives a factorization of the query results, and (3) is asymptoti-cally faster by O( log π ) in the case of one non-equality predicate(and arbitrary equalities) per pairwise join. Inequality Predicates ( < ). Khayatt et al. [52] provide opti-mized and distributed batch algorithms for up to two inequalitiesper join. In general, inequality predicates can be resolved by build-ing data structures (indexes) that return the matching (right) tupleswhen probed with a (left) tuple [45]. Accessing these data structuresrequires time that is O( log π ) or higher (e.g. O( log π β π ) for rangetrees [27] with π > Unranked enumeration forconjunctions of inequalities is possible with these data structures,yet the delay between results is (poly)logarithmic. Idris et al. [45]also report an O( log π ) delay for unranked enumeration by sort-ing lexicographically on the inequality attributes . In contrast, ourapproach achieves constant delay for conjunctions of inequalities(because of the constant-depth TLFGs). For ranked enumeration ,it is not clear how any of these prior data structures can be usedother than the QuadEqi approach, which is our baseline. Factorized Databases.
Factorized representations of query re-sults [8, 69] have been proposed for equi-joins as a way to eliminateredundancies while still being useful for other tasks. These includeresult enumeration [71, 72], aggregate computation [8], and evenmachine learning [3, 54, 68, 74]. We provide factorized represen-tations for general join predicates and leverage them for (ranked)enumeration. Factorizations have also been proposed for (equi-join)provenance polynomials [70, 71] where the atomic unit of the fac-torization is a tuple, similarly to our work. Other representationschemes are also being explored [28, 49]. For probabilistic databases,factorization of non-equalities [66] and inequalities [67] is possible A non-equality is sometimes referred to as an inequality [53] or disequality [7]in the literature. According to our understanding of their method, the number of binary searchesneeded for each result is actually O( π ) , hence the delay is O( π log π ) instead of O( log π ) in the worst case. with OBDDs. Although these are designed for a different purpose,the latter exploits the transitivity of inequality, similarly to ourβshared rangesβ approach (Figure 8a). We remind the reader thatthe latter is lacking in terms of enumeration delay and it is unclearhow to generalize it to conjunctions of inequalities. Top- π Joins.
Top- π queries are a special case of ranked enumer-ation where the value of π is given explicitly and its knowledge canbe exploited by the algorithm. Fagin et al. [36] present the Thresh-old Algorithm, which has surprisingly strong optimality propertiesin terms of the number of tuples that are fetched from some externalsource, also known as the βmiddlewareβ cost model. Since this algo-rithm assumes restricted key-to-key joins, later works generalizeit to more general joins [37, 46, 58, 84], which may even involvetheta-joins [60]. However, they retain the middleware cost model,hence do not provide any non-trivial guarantees when the actualjoin cost is taken into account [79]. Ilyas et al. [47] survey some ofthese approaches, along with some related ones such as buildingtop- π indexes [22, 77] or views [25, 43]. Optimal Join Algorithms.
A significant progress has beenmade towards join algorithms that achieve asymptotic guaran-tees as close as possible to some lower bound. Acyclic equi-joinsare evaluated optimally in O( π + | out |) by the Yannakakis algo-rithm [87], where | out | is the output size. This bound is unattain-able for cyclic queries [64], thus worst-case optimal join (WCOJ)algorithms [61, 64, 65, 82] settle for the AGM bound [6], which isthe worst-case output size. Improvements over WCOJ algorithmshave been made by applying (hyper)tree decompositions to getoutput-sensitive guarantees [4, 39], while a geometric perspectivehas led to even stronger notions of optimality [51, 63]. Ngo [62] re-counts the development of these ideas. That line of work focuses on producing all the query results or Boolean queries, while our workfocuses on (ranked) enumeration. The enumeration guarantees weachieve for acyclic queries are within logarithmic factors (which areoften ignored in this area) of the lower bound TT ( π ) = O( π + π ) . We developed O( π polylog π ) -size representations for acyclic joinqueries with inequality-type predicates. We then leveraged themfor enumeration algorithms with or without a ranking. While fora theta-join with a black-box join condition it seems unavoidableto avoid the O( π ) cost, it would be interesting to identify othertypes of joins that admit efficient representations. Moreover, itremains open whether our βshared rangesβ factorization (whichis more compact but leads to higher delay) can be generalized toconjunctions of inequalities. Last, we leave as future work theapplication of our techniques to cyclic joins. Acknowledgements.
This work was supported in part by theNational Science Foundation (NSF) under award numbers CAREERIIS-1762268 and IIS-1956096.
REFERENCES
PODS . 414β431. https://doi.org/10.1145/3294052.331969412 eyond Equi-joins: Ranking, Enumeration and Factorization [3] Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, andMaximilian Schleich. 2018. In-database learning with sparse tensors. In
PODS .325β340. https://doi.org/10.1145/3196959.3196960[4] Mahmoud Abo Khamis, Hung Q Ngo, and Dan Suciu. 2017. What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do withone another?. In
PODS . 429β444. https://doi.org/10.1145/3034786.3056105[5] Noga Alon, Raphael Yuster, and Uri Zwick. 1995. Color-coding.
J. ACM
42, 4(1995), 844β856. https://doi.org/10.1145/210332.210337[6] Albert Atserias, Martin Grohe, and DΓ‘niel Marx. 2013. Size Bounds and QueryPlans for Relational Joins.
SIAM J. Comput.
42, 4 (2013), 1737β1767. https://doi.org/10.1137/110859440[7] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. 2007. On acyclicconjunctive queries and constant delay enumeration. In
International Workshopon Computer Science Logic (CSL) . 208β222. https://doi.org/10.1007/978-3-540-74915-8_18[8] Nurzhan Bakibayev, TomΓ‘Ε‘ KoΔiskΓ½, Dan Olteanu, and Jakub ZΓ‘vodnΓ½. 2013.Aggregation and Ordering in Factorised Databases.
PVLDB
6, 14 (2013), 1990β2001. https://doi.org/10.14778/2556549.2556579[9] Nurzhan Bakibayev, Dan Olteanu, and Jakub ZΓ‘vodnΓ½. 2012. FDB: A QueryEngine for Factorised Relational Databases.
PVLDB
5, 11 (2012), 1232β1243.https://doi.org/10.14778/2350229.2350242[10] Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. 2020. ConstantDelay Enumeration for Conjunctive Queries: A Tutorial.
ACM SIGLOG News
7, 1(2020), 4β33. https://doi.org/10.1145/3385634.3385636[11] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. AnsweringConjunctive Queries Under Updates. In
PODS . 303β318. https://doi.org/10.1145/3034786.3034789[12] Christoph Berkholz and Nicole Schweikardt. 2019. Constant Delay Enumerationwith FPT-Preprocessing for Conjunctive Queries of Bounded Submodular Width.In , Vol. 138. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 58:1β58:15. https://doi.org/10.4230/LIPIcs.MFCS.2019.58[13] Dimitri P. Bertsekas. 2005.
Dynamic Programming and Optimal Control
De la pertinence de lβΓ©numΓ©ration: complexitΓ© enlogiques propositionnelle et du premier ordre . Ph.D. Dissertation. UniversitΓ© deCaen. https://hal.archives-ouvertes.fr/tel-01081392[15] Johann Brault-Baron. 2016. Hypergraph Acyclicity Revisited.
ACM Comput. Surv.
49, 3, Article 54 (Dec. 2016), 26 pages. https://doi.org/10.1145/2983573[16] David Bremner, Timothy M Chan, Erik D Demaine, Jeff Erickson, Ferran Hurtado,John Iacono, Stefan Langerman, and Perouz Taslakian. 2006. Necklaces, con-volutions, and X+ Y. In
European Symposium on Algorithms . Springer, 160β171.https://doi.org/10.1007/s00453-012-9734-3[17] Nofar Carmeli and Markus KrΓΆll. 2019. On the Enumeration Complexity ofUnions of Conjunctive Queries. In
PODS . 134β148. https://doi.org/10.1145/3294052.3319700[18] Nofar Carmeli and Markus KrΓΆll. 2020. Enumeration Complexity of ConjunctiveQueries with Functional Dependencies.
Theory Comput. Syst.
64, 5 (2020), 828β860.https://doi.org/10.1007/s00224-019-09937-9[19] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, andMirek Riedewald. 2020. Tractable Orders for Direct Access to Ranked Answersof Conjunctive Queries.
CoRR abs/2012.11965 (2020). arXiv:2012.11965[20] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and NicoleSchweikardt. 2020. Answering (Unions of) Conjunctive Queries Using RandomAccess and Random-Order Enumeration. In
PODS . 393β409. https://doi.org/10.1145/3375395.3387662[21] Lijun Chang, Xuemin Lin, Wenjie Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin.2015. Optimal enumeration: Efficient top- π tree matching. PVLDB
8, 5 (2015),533β544. https://doi.org/10.14778/2735479.2735486[22] Yuan-Chi Chang, Lawrence Bergman, Vittorio Castelli, Chung-Sheng Li, Ming-Ling Lo, and John R Smith. 2000. The onion technique: indexing for linearoptimization queries. In
SIGMOD . 391β402. https://doi.org/10.1145/342009.335433[23] Bernard Chazelle. 1988. Functional approach to data structures and its use inmultidimensional searching.
SIAM J. Comput.
17, 3 (1988), 427β462. https://doi.org/10.1137/0217026[24] Yves Crama and Peter L Hammer. 2011.
Boolean functions: Theory, algo-rithms, and applications . Cambridge University Press. https://doi.org/10.1017/CBO9780511852008[25] Gautam Das, Dimitrios Gunopulos, Nick Koudas, and Dimitris Tsirogiannis. 2006.Answering top-k queries using views. In
VLDB . 451β462. https://dl.acm.org/doi/10.5555/1182635.1164167[26] Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Virkumar Vazirani. 2008.
Algorithms . McGraw-Hill Higher Education. https://dl.acm.org/doi/book/10.5555/1177299[27] Mark De Berg, Marc Van Kreveld, Mark Overmars, and Otfried Schwarzkopf.1997. Computational geometry. In
Computational geometry . Springer, 1β17.https://doi.org/10.1007/978-3-540-77974-2 [28] Shaleen Deep and Paraschos Koutris. 2018. Compressed representations ofconjunctive query results. In
PODS . 307β322. https://doi.org/10.1145/3196959.3196979[29] Shaleen Deep and Paraschos Koutris. 2021. Ranked Enumeration of ConjunctiveQuery Results. In
ICDT . http://arxiv.org/abs/1902.02698[30] David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. 1991. AnEvaluation of Non-Equijoin Algorithms. In
VLDB . 443β452. https://dl.acm.org/doi/10.5555/645917.672320[31] Mengsu Ding, Shimin Chen, Nantia Makrynioti, and Stefan Manegold. 2021.Progressive Join Algorithms Considering User Preference. In
CIDR . https://ir.cwi.nl/pub/30501/30501.pdf[32] Arnaud Durand. 2020. Fine-Grained Complexity Analysis of Queries: FromDecision to Counting and Enumeration. In
PODS . 331β346. https://doi.org/10.1145/3375395.3389130[33] Jost Enderle, Matthias Hampel, and Thomas Seidl. 2004. Joining Interval Datain Relational Databases. In
SIGMOD . 683β694. https://doi.org/10.1145/1007568.1007645[34] David Eppstein. 1998. Finding the π shortest paths. SIAM J. Comput.
28, 2 (1998),652β673. https://doi.org/10.1137/S0097539795290477[35] David Eppstein. 2016. k-Best Enumeration . Springer, Encyclopedia of Algorithms,1003β1006. https://doi.org/10.1007/978-1-4939-2864-4_733[36] Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation al-gorithms for middleware.
J. Comput. System Sci.
66, 4 (2003), 614β656. https://doi.org/10.1016/S0022-0000(03)00026-6[37] Jonathan Finger and Neoklis Polyzotis. 2009. Robust and efficient algorithmsfor rank join evaluation. In
SIGMOD . 415β428. https://doi.org/10.1145/1559845.1559890[38] Michel Gondran and Michel Minoux. 2008.
Graphs, Dioids and Semirings: NewModels and Algorithms (Operations Research/Computer Science Interfaces Series) .Springer. https://doi.org/10.1007/978-0-387-75450-5[39] Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016.Hypertree Decompositions: Questions and Answers. In
PODS . 57β74. https://doi.org/10.1145/2902251.2902309[40] M.H. Graham. 1979.
On the universal relation . Technical Report. Univ. of Toronto.[41] C. A. R. Hoare. 1962. Quicksort.
Comput. J.
5, 1 (01 1962), 10β16. https://doi.org/10.1093/comjnl/5.1.10[42] Walter Hoffman and Richard Pavley. 1959. A Method for the Solution of the π thBest Path Problem. J. ACM
6, 4 (1959), 506β514. https://doi.org/10.1145/320998.321004[43] Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. 2001. PREFER: Asystem for the efficient execution of multi-parametric ranked queries.
SIGMODRecord
30, 2 (2001), 259β270. https://doi.org/10.1145/375663.375690[44] Muhammad Idris, MartΓn Ugarte, Stijn Vansummeren, Hannes Voigt, and Wolf-gang Lehner. 2019. Efficient Query Processing for Dynamically ChangingDatasets.
SIGMOD Record
48, 1 (2019), 33β40. https://doi.org/10.1145/3371316.3371325[45] Muhammad Idris, MartΓn Ugarte, Stijn Vansummeren, Hannes Voigt, and Wolf-gang Lehner. 2020. General dynamic Yannakakis: conjunctive queries with thetajoins under updates.
VLDB J.
29 (2020), 619β653. https://doi.org/10.1007/s00778-019-00590-9[46] Ihab F Ilyas, Walid G Aref, and Ahmed K Elmagarmid. 2004. Supporting top- π join queries in relational databases. VLDB J.
13, 3 (2004), 207β221. https://doi.org/10.1007/s00778-004-0128-2[47] Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top- π query processing techniques in relational database systems. Comput. Surveys πΎ shortest paths: Anew algorithm and an experimental comparison. In International Workshop onAlgorithm Engineering (WAE) . Springer, 15β29. https://doi.org/10.1007/3-540-48318-7_4[49] Ahmet Kara and Dan Olteanu. 2018. Covers of Query Results. In
ICDT . 16:1β16:22.https://doi.org/10.4230/LIPIcs.ICDT.2018.16[50] Mahmoud Abo Khamis, Hung Q. Ngo, Dan Olteanu, and Dan Suciu. 2019. BooleanTensor Decomposition for Conjunctive Queries with Negation. In
ICDT . 21:1β21:19. https://doi.org/10.4230/LIPIcs.ICDT.2019.21[51] Mahmoud Abo Khamis, Hung Q. Ngo, Christopher RΓ©, and Atri Rudra. 2016.Joins via Geometric Resolutions: Worst Case and Beyond.
TODS
41, 4, Article 22(2016), 45 pages. https://doi.org/10.1145/2967101[52] Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti,Jorge-Arnulfo QuianΓ©-Ruiz, Nan Tang, and Panos Kalnis. 2017. Fast and scalableinequality joins.
VLDB J.
26, 1 (2017), 125β150. https://doi.org/10.1007/s00778-016-0441-6[53] Paraschos Koutris, Tova Milo, Sudeepa Roy, and Dan Suciu. 2017. AnsweringConjunctive Queries with Inequalities.
Theory of Computing Systems
61, 1 (2017),2β30. https://doi.org/10.1007/s00224-016-9684-2[54] Arun Kumar, Jeffrey Naughton, and Jignesh M Patel. 2015. Learning generalizedlinear models over normalized data. In
SIGMOD . 1969β1984. https://doi.org/10.1145/2723372.272371313 ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald [55] Srijan Kumar, William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2018. Com-munity interaction and conflict on the web. https://snap.stanford.edu/data/soc-RedditHyperlinks.html. In
WWW . 933β943.[56] Eugene L Lawler. 1972. A procedure for computing the k best solutions todiscrete optimization problems and its application to the shortest path problem.
Management science
18, 7 (1972), 401β405. https://doi.org/10.1287/mnsc.18.7.401[57] Rundong Li, Wolfgang Gatterbauer, and Mirek Riedewald. 2020. Near-OptimalDistributed Band-Joins through Recursive Partitioning. In
SIGMOD . 2375β2390.https://doi.org/10.1145/3318464.3389750[58] Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W Cheung. 2007.Efficient top- π aggregation of ranked inputs. TODS
32, 3 (2007), 19. https://doi.org/10.1145/1272743.1272749[59] Mehryar Mohri. 2002. Semiring Frameworks and Algorithms for Shortest-distance Problems.
J. Autom. Lang. Comb.
7, 3 (Jan. 2002), 321β350. http://dl.acm.org/citation.cfm?id=639508.639512[60] Apostol Natsev, Yuan-Chi Chang, John R Smith, Chung-Sheng Li, and Jeffrey ScottVitter. 2001. Supporting incremental join queries on ranked inputs. In
VLDB
ICDT , Vol. 155. 21:1β21:21. https://doi.org/10.4230/LIPIcs.ICDT.2020.21[62] Hung Q Ngo. 2018. Worst-case optimal join algorithms: Techniques, results, andopen problems. In
PODS . 111β124. https://doi.org/10.1145/3196959.3196990[63] Hung Q Ngo, Dung T Nguyen, Christopher Re, and Atri Rudra. 2014. Beyondworst-case analysis for joins with minesweeper. In
PODS . 234β245. https://doi.org/10.1145/2594538.2594547[64] Hung Q Ngo, Ely Porat, Christopher RΓ©, and Atri Rudra. 2018. Worst-case optimaljoin algorithms.
J. ACM
65, 3 (2018), 16. https://doi.org/10.1145/3180143[65] Hung Q Ngo, Christopher RΓ©, and Atri Rudra. 2014. Skew Strikes Back: NewDevelopments in the Theory of Join Algorithms.
SIGMOD Record
42, 4 (Feb. 2014),5β16. https://doi.org/10.1145/2590989.2590991[66] Dan Olteanu and Jiewen Huang. 2008. Using OBDDs for efficient query evaluationon probabilistic databases. (2008), 326β340. https://doi.org/10.1007/978-3-540-87993-0_26[67] Dan Olteanu and Jiewen Huang. 2009. Secondary-storage confidence computationfor conjunctive queries with inequalities. In
SIGMOD . 389β402. https://doi.org/10.1145/1559845.1559887[68] Dan Olteanu and Maximilian Schleich. 2016. F: Regression Models over FactorizedViews.
PVLDB
9, 13 (2016), 1573β1576. https://doi.org/10.14778/3007263.3007312[69] Dan Olteanu and Maximilian Schleich. 2016. Factorized databases.
SIGMODRecord
45, 2 (2016). https://doi.org/10.1145/3003665.3003667[70] Dan Olteanu and Jakub ZΓ‘vodn`y. 2011. On factorisation of provenance poly-nomials. In
TaPP
ICDT . 285β298. https://doi.org/10.1145/2274576.2274607[72] Dan Olteanu and Jakub ZΓ‘vodn`y. 2015. Size bounds for factorised representationsof query results.
TODS
40, 1 (2015), 2. https://doi.org/10.1145/2656335[73] Christos H. Papadimitriou and Mihalis Yannakakis. 1999. On the complexity ofdatabase queries.
J. Comput. System Sci.
58, 3 (1999), 407β427. https://doi.org/10.1006/jcss.1999.1626[74] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning linearregression models over factorized joins. In
SIGMOD . 3β18. https://doi.org/10.1145/2882903.2882939[75] Luc Segoufin. 2015. Constant Delay Enumeration for Conjunctive Queries.
SIG-MOD Record
44, 1 (2015), 10β17. https://doi.org/10.1145/2783888.2783894[76] Robert E Tarjan and Mihalis Yannakakis. 1984. Simple linear-time algorithms totest chordality of graphs, test acyclicity of hypergraphs, and selectively reduceacyclic hypergraphs.
SIAM J. Comput.
13, 3 (1984), 566β579. https://doi.org/10.1137/0213035[77] Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, andDivesh Srivastava. 2003. Ranked join indices. In
ICDE . IEEE, 277β288. https://doi.org/10.1109/ICDE.2003.1260799[78] Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, andXiaofeng Yang. 2020. Optimal Algorithms for Ranked Enumeration of Answersto Full Conjunctive Queries.
PVLDB
13, 9 (2020), 1582β1597. https://doi.org/10.14778/3397230.3397250[79] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. 2020. OptimalJoin Algorithms Meet Top-k. In
SIGMOD . 2659β2665. https://doi.org/10.1145/3318464.3383132[80] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. 2021. BeyondEqui-joins: Ranking, Enumeration and Factorization.
CoRR abs/2101.12158 (2021).arXiv:2101.12158[81] Moshe Y. Vardi. 1982. The Complexity of Relational Query Languages (ExtendedAbstract). In
STOC . 137β146. https://doi.org/10.1145/800070.802186[82] Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm.In
ICDT . 96β106. https://doi.org/10.5441/002/icdt.2014.13 [83] Huanhuan Wu, James Cheng, Silu Huang, Yiping Ke, Yi Lu, and Yanyan Xu.2014. Path Problems in Temporal Graphs.
PVLDB
7, 9 (2014), 721β732. https://doi.org/10.14778/2732939.2732945[84] Minji Wu, Laure Berti-Equille, AmΓ©lie Marian, Cecilia M Procopiuc, and DiveshSrivastava. 2010. Processing top-k join queries.
PVLDB
3, 1 (2010), 860β870.https://doi.org/10.14778/1920841.1920951[85] Xiaofeng Yang, Deepak Ajwani, Wolfgang Gatterbauer, Patrick K Nicholson,Mirek Riedewald, and Alessandra Sala. 2018. Any- π : Anytime Top- π Tree PatternRetrieval in Labeled Graphs. In
WWW . 489β498. https://doi.org/10.1145/3178876.3186115[86] Xiaofeng Yang, Mirek Riedewald, Rundong Li, and Wolfgang Gatterbauer. 2018.Any- π Algorithms for Exploratory Analysis with Conjunctive Queries. In
Inter-national Workshop on Exploratory Search in Databases and the Web (ExploreDB) .1β3. https://doi.org/10.1145/3214708.3214711[87] Mihalis Yannakakis. 1981. Algorithms for Acyclic Database Schemes. In
VLDB .82β94. https://dl.acm.org/doi/10.5555/1286831.1286840[88] Clement Tak Yu and Meral Z Ozsoyoglu. 1979. An algorithm for tree-querymembership of a distributed query. In
COMPSAC . IEEE, 306β312. https://doi.org/10.1109/CMPSAC.1979.76250914 eyond Equi-joins: Ranking, Enumeration and Factorization
A NOMENCLATURE
Symbol Definition π Join query
π , π,π
Relations
π΄, π΅,πΆ
Attributes X , Y , Z Lists of attributes π, π , π‘
Tuples π Join Predicate π β²β³ π π Join between
π,π on predicate ππ Total number of tuples π Number of distinct values β Number of relations π Number of predicates in the query πΊ ( π , πΈ ) Graph with nodes π and edges πΈπ£ π , π£ π‘ Nodes corresponding to tuples π β π, π‘ β π S Size of TLFG π Depth of TLFG π’ Duplication factor of TLFG π Number of conjuncts or disjuncts π Number of partitions in equality/inequality factorization π π Partition in inequality factorization π Number of groups in band factorization π» π Group in band factorizationTT ( π ) Time-to- π th resultMEM ( π ) Memory until the π th result T Time for constructing a TLFG
P ( π ) Time for preprocessing π , π (Computable) functions
B MULTIWAY PARTITIONING
We provide more details on the multiway partitioning methoddiscussed in Section 6.1. Recall that it constitutes an improvementover the binary partitioning method of Section 4.1 for the case of asingle inequality predicate. More specifically, it creates a TLFG ofsize O( π log log π ) instead of O( π log π ) , while only increasing thedepth to 3 from 2 (see Fig. 8b).The main idea is to create more data partitions per recursive step.In particular, we pick π β π partitions of nodeswith a roughly equal number of distinct values. Fig. 13b depictshow the partitions are connected for a less-than ( < ) predicate. Eachsource partition π π , π β [ , π β ] is connected to all target partitions π π , π β [ π + , π ] , since all values in π π are guaranteed to be smallerthan all values in π π . The ideal number of partitions is Ξ (β π ) , sothat the connections between them can be built in O(β π ) = O( π ) ,i.e., the same that binary partitioning needs per recursive step. Theadvantage of the multiple partitions is that we can reach the basecase π = Let π be an inequality predicate between relations π,π of total size π . A duplicate-free TLFG of the join π β²β³ π π of size O( π log log π ) and depth can be constructed in O( π log π ) time. Proof. The arguments for correctness and the duplicate-freeproperty are similar to the case of binary partitioning (Lemma 9).For the depth, notice that all the edges we create are either from thesource nodes to a layer of π₯ nodes (Line 13) or from π₯ nodes to a layer S S T T Recursion (a) partitions: depth with oneintermediate node. S S S Ο . . . T T T Ο ... Recursion (b) π partitions: depth with π intermediate nodes. Figure 13: Binary vs Multi-way partitioning for inequalities.
Algorithm 2:
Multiway partitioning Input : Relations
π,π , nodes π£ π , π£ π‘ for π β π, π‘ β π , predicate π β‘ π.π΄ < π .π΅ Output : A TLFG of the join π β²β³ π π Sort
π,π according to attributes
π΄, π΅ partIneqMulti( π,π, π ) Procedure partIneqMulti(
π,π, π ) π = vals( S βͺ T ) //Number of distinct A, B values if d == 1 then return //Base case π = β βοΈ ( π ) β //Number of partitions Partition ( π βͺ π ) into ( π βͺ π ) , . . . , ( π π βͺ π π ) with π -quantiles of distinct values as pivots for π β to π do Materialize intermediate nodes π₯ π , π¦ π foreach π in π π do Create edge π£ π ββ π₯ π foreach π‘ in π π do Create edge π¦ π ββ π£ π‘ for π β to π β do Create edge π₯ π ββ π¦ π partIneqMulti( π π ,π π , π ) //Recursive call of π¦ nodes (Line 15) or from π¦ nodes to target nodes(Line 14). Thus,all paths from source to target nodes have a length of 3. The runningtime is dominated by the O( π log π ) initial sorting of the relations,but the recursion (which bounds the space consumption) is nowmore efficient than the binary partitioning case. Each recursivestep with size | π | + | π | = π requires O( π ) to partition the sortedrelations. Then, we materialize O( π ) edges for source and targetnodes, O(β π ) intermediate nodes and O(β π ) edges between them.This adds up to O( π ) because π β€ π . We then invoke π = ββ π β = O(β π ) recursive calls with sizes π + π + . . . + π π = π . Therefore,in every level of the recursion tree, the sizes of all the subproblemsadd up to π . Since we spend linear time per problem, the total workper level of the recursion tree is O( π ) . The height πΏ of the tree isthe number of times we have to take the square root of π (and thenthe ceil function) in order to reach π =
1, which is O( log log π ) = O( log log π ) . To see this, observe that π ( ) πΏ = β ( ) πΏ log π = β πΏ = log log π . Overall, the time spent on the recursion andthus, the size of the TLFG is bounded by O( π log log π ) . β‘ C NON-EQUALITY PREDICATES
A non-equality condition
π.π΄ β π .π΅ is satisfied if either
π.π΄ < π .π΅ or π.π΄ > π .π΅ . Even though it can be modeled as a disjunction oftwo inequalities, we now establish that (in contrast to arbitrarydisjunctions), they do not increase the TLFG duplication factor.The main observation is that the pairs which satisfy one of the ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald inequalities cannot satisfy the other one. Therefore, if we union thetwo inequality TLFGs no path will be duplicated. The guaranteeswe obtain are the same as the inequality case by using multiwaypartitioning (once for each inequality).Lemma 20. Let π be an non-equality predicate between relations π,π of total size π . A duplicate-free TLFG of the join π β²β³ π π of size O( π log log π ) and depth can be constructed in O( π log π ) time. Proof. We sort once in O( π log π ) and then call the inequalitymultiway partitioning algorithm twice. Thus, we have to spendtwo times O( π log log π ) time and space. The depth of the finalTLFG is still 3 since the two TLFGs are constructed indepen-dently. It also remains duplicate-free since the two inequalityconditions cannot hold simultaneously. Suppose that the calls topartIneqMulti ( S , T , S . A < T . B ) and partIneqMulti ( S , T , S . A > T . B ) both create a path between π£ π and π£ π‘ for two tuples π β π, π‘ β π .Then, the two tuples would have to satisfy π .π΄ < π‘.π΅ and π .π΄ > π‘.π΅ ,which is impossible. β‘ D BAND PREDICATES
In this section, we target band predicates of the type | π.π΄ β π .π΅ | < π .We provide an algorithm that leverages the structure of the band toachieve asymptotically the same guarantees as the inequality case. Ifa band condition is handled as a generic conjunction of inequalities,then the time spent, as well as the TLFG size are higher than ourspecialized construction.Our algorithm translates the band problem into a set of inequalityproblems for smaller groups of tuples, which can then be solvedindependently. First, we describe the intuition. The band predicateconsists of two inequalities ( π.π΄ < π .π΅ + π ) and ( π.π΄ > π .π΅ β π ) thatneed to hold simultaneously. If for some source-target tuples wecan guarantee that one of the two inequalities is always satisfied,then it suffices to use the inequality algorithm we developed inSection 4.1 for the other one. Therefore, the idea is to create groupsof tuples with that property and cover all the possible joining pairswith these groups.The first step is to sort the input relations and group the tuplesof the target relation into maximal π -intervals. More specifically,we start from the first π tuple and group together all those whose π΅ values are at most π apart from it. We then repeat the sameprocess starting from the π tuple that is immediately after thegroup, creating π β€ π groups, whose range of π΅ values is at most π . A source tuple is assigned to a group if it joins with at least onetarget tuple in the group. Since the groups represent π -intervalsof target tuples, each source tuple can be assigned to at most threegroups .Example 21. Figure 14 depicts an example with π = . Noticethat as the number of tuples grows, the output is O( π ) , e.g., ifthe domain is fixed or if π grows together with the domain size.Initially, we group the target tuples by π intervals (Fig. 14a). Thus,the first group starts with the first π tuple , and ends before since β > π = . This process creates three groups of target tuples, eachone having a range of π΅ values bounded by . Then, a source tuple isassigned to a group by comparing its π΄ value with the limits of thegroup. For instance, tuple is assigned to the middle group because
11 8
14 10
15 12 β€ 4β€ 4β€ 4 |S.A - T.B| < 4
S T (a) Edges between all
O ( π ) joining pairs and groupingwith π -intervals.
10 5
711 8
S.A > T.B - 4 β€ 4
S.A < T.B + 4 (b) Edges within a group can bemodeled as two inequalities.
Figure 14: Example 21: TLFG construction for band condi-tions. β < < + , hence it joins with at least one target tuple inthat group.After the assignment of tuples to groups, we work on each groupseparately. For example, consider the middle group depicted inFig. 14b. Source tuple joins with the top π tuple , which meansthat the pair ( , ) satisfies both inequalities. From that we can inferthat , since their π΅ values are at least . Thus, we can handleit by using our inequality algorithm for the greater-than condition ( π.π΄ > π .π΅ β π ) . Conversely, tuple joins with the bottom π tuple , thus satisfies the greater-than inequality with all the target tuplesin the group. For that tuple, we only have to handle the less-thaninequality ( π.π΄ < π .π΅ + π ) . Notice that all the source tuples in thegroup are covered by at least one of the above scenarios. For each group of source-target tuples we created, there are threecases for the π tuples: (1) those who join with the top target tuplebut not the bottom, (2) those who join with the bottom target tuplebut not the top, (3) those who join with all the target tuples. Theseare the only three cases since by construction of the group, thedistance between the target tuples is at most π . Case (1) can behandled as a greater-than TLFG, case (2) as a less-than, and case (3)as either one of them. As Algorithm 3 shows, partIneqMulti () iscalled twice for each group.Lemma 22. Let π be a band predicate between relations π,π of totalsize π . A duplicate-free TLFG of the join π β²β³ π π of size O( π log log π ) and depth can be constructed in O( π log π ) time. Proof. First, we create disjoint π groups based on π -intervalsand assign each π tuple to all groups where it has joining partners(Lines 9 to 16). This can be done with binary search in O( π log π ) .Each π tuple is assigned to a single group. An π tuple cannot beassigned to more than three consecutive groups since their valuesspan a range of at least 2 π . Within each group π» π = ( π π βͺ π π ) , thecorrectness of our algorithm follows from the fact that the π π tuples eyond Equi-joins: Ranking, Enumeration and Factorization Algorithm 3:
Handling a band predicate Input : Relations
π,π , nodes π£ π , π£ π‘ for π β π, π‘ β π , predicate π β‘ | π.π΄ β π .π΅ | < π Output : A TLFG of the join π β²β³ π π Sort
π,π according to attributes
π΄, π΅ foreach ( π π , π π , π π ) in bandToIneq ( π , π , π ) do partIneqMulti( π π , π π , π π ) Function bandToIneq(
π,π, π ) ineqs = [] //Find the limits of the groups on the right π» . start = π‘ .π΅, π = for π β to | π | do if π‘ [ π ] .π΅ > π» π . start + π then π» π . end = π [ π β ] .π΅ π + + π» π . start = π [ π ] .π΅ π» π . end = π [ π ] .π΅ foreach π» π in [ π» , . . . , π» π ] do //Assign tuples to the group π π = [ π β π | π» π . start β π β€ π .π΄ β€ H j . end + π ] π π = [ π‘ β π | π» π . start β€ π‘.π΅ β€ H j . end ] //Greater-than inequality π > = [ π β π π | π .π΄ < π» π . start + π ] ineqs.add(( π < ,π π , π.π΄ > π .π΅ β π )) //Less-than inequality ineqs.add(( π π β π > ,π π , π.π΄ < π .π΅ + π )) return ineqs are at most π apart on the π΅ attribute. Since all the assigned π π tupleshave at least one joining partner in π π , they have to join either withthe first π π tuple (in sorted π΅ order) or with the last one. Recall thatthe band condition can be rewritten as ( π.π΄ < π .π΅ + π ) β§ ( π.π΄ > π .π΅ β π ) , i.e., two inequality conditions that both have to be satisfied.In case some π β π π joins with the first π π tuple, then we know thatthe less-than condition is always satisfied for π within the group π» π .Thus, we just need to connect π£ π with all π£ π‘ for π‘ β π π that satisfythe greater-than condition. We argue similarly for the case when π joins with the last tuple of π π , where we have to take care only ofthe less-than condition. Finally, there is also the possibility that π joins with all π π tuples. In that case, both inequality conditions aresatisfied β we assign those tuples to only one of the inequalitieswhich ensures the duplicate-free property. For the running time,the total size of the groups we create is π + π + . . . + π π β€ π . If fora problem of size | π | + | π | = π where the relations have been sorted, T π΅ ( π ) is the time for factorizing a band condition and T πΌ ( π ) for aninequality, we have T π΅ ( π ) = O( π )+ T πΌ ( π )+ T πΌ ( π )+ . . . + T πΌ ( π π ) ,since we call the inequality algorithm twice within each group. For T πΌ ( π ) = O( π log log π ) , we get T π΅ ( π ) = O( π log log π ) , which alsobounds the size of the TLFG. Each call to the inequality algorithminvolves different π,π pairs, giving us the duplicate-free propertyand the same depth as the inequality TLFG. β‘ E ADDITIONAL PROOFSE.1 Proof of Lemma 11
To construct the TLFG for π β²β³ π π , we gather all the equality predi-cates and use hashing to create partitions of tuples that correspondto equal joining values for the equality predicates. This takes O( π ) .We then construct the TLFG for each partition independently with the conditions π β² through some algorithm A . If A elects to con-nect two nodes, then they satisfy both π β² , and also the equalitiessince they belong to the same partition. Conversely, two nodes thatremain disconnected at the end of the process either do not belongto the same equality partition or were not connected by A , thusdo not satisfy π β² .Assume that the number of tuples in each partition is π π , π β [ π ] with π + . . . + π π = π . The total time spent on each partition is O( π ( π ) + . . . + π ( π π )) which by the superadditivity property of π is O( π ( π + . . . + π π )) = O( π ( π )) . The same argument applies to thesize, giving us O( π ( π )) . Since the partitions are disjoint, we cannotcreate additional duplicate paths apart from the ones created by A ,or increase the depth of each TLFG. E.2 Proof of Lemma 13
As a first step, all the equality predicates are handled by Lemma 11.Since the time and size guarantees we show are O( π log π π ) and π log π π is a superadditive function, they are unaffected by this step.The remaining inequality predicates are handled by Algorithm 1.We denote by T πΌ ( π, π ) the running time for π tuples and π inequalitypredicates. We proceed by induction on the number of predicates π to show that T πΌ ( π, π ) β€ π ( π ) π log π π for some function π andsufficiently large π . First, assume that all the predicates are inequal-ities. For the base case π =
1, the analysis is the same as in theproof of Lemma 9: Tthe height of the recursion tree is O( log π ) and the total time is O( π log π ) together with sorting once. In otherwords, we have T πΌ ( π, ) β€ ππ log π for sufficiently large π . For theinductive step, we assume that T πΌ ( π, π β ) β€ π ( π β ) π log π β π .The inequality at the head of the list creates a recursion tree whereevery node has a subset of the tuples π β² and calls the next in-equality, thus is computed in T πΌ ( π β² , π β ) . The problem sizes insome level of the tree add up to π + . . . + π π = π . Thus, thework per level is bounded by T πΌ ( π , π β ) + . . . + T πΌ ( π π‘ , π β ) β€ π ( π β ) π log π β π + . . . + π ( π β ) π π‘ log π β π π‘ β€ π ( π β ) π log π β π .The height of the tree is O( log π ) , thus the total work in the tree isbounded by π β² log ππ ( π β ) π log π β π = π β² π ( π β ) π log π π . We alsotake into account the time for sorting according to the attributesof the current inequality, which is bounded by π β²β² π log π . Thus, weget that T πΌ ( π, π ) β€ π β² π ( π β ) π log π π + π β²β² π log π . If we pick a func-tion π such that π ( ) β₯ π and π ( π ) β₯ π β² π ( π β ) + π β²β² πππ π β π , then T πΌ ( π, π ) β€ π ( π ) π log π π . This completes the induction, establishingthat T πΌ ( π, π ) = O( π log π π ) in data complexity.The size of the TLFG cannot exceed the running time, thus itis also O( π log π π ) . The depth is 2 because in all cases we use thebinary partitioning method and the duplication factor is 1 becausewe only connect tuples in the base case of one predicate π = E.3 Proof of Lemma 14
Correctness follows from the fact that the paths in the constructedTLFG is the union of the paths in the TLFGs for π β²β³ π π π . For thedepth, note that each π π is processed independently, thus the com-ponent TLFGs do not share any nodes or edges other than the end-points. A path from π£ π to π£ π‘ for π β π, π‘ β π may only be duplicatedby different TLFG constructions since each one is duplicate-free. ikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald Thus, the duplication factor cannot exceed the number of predicates π . E.4 Proof of Theorem 16
For each parent π and child π in the join tree, we construct a TLFG.According to Lemmas 13 and 14, the depth of each TLFG is π = π ,and create two new relations πΈ , πΈ . Let A π , A π be the attributes of π and π respectively. Relation πΈ contains attributes A π βͺ π , and πΈ contains π βͺ A π . We add to the two new relations the edgesfrom source to intermediate layer and from intermediate to targetlayer respectively. If an edge ( π£ π , π£ ) exists for π β π, π£ β π , weadd a tuple ( π , π£ ) to relation πΈ . Similarly, if an edge ( π£ , π£ π‘ ) existsfor π‘ β π, π£ β π , we add a tuple ( π£ , π‘ ) to relation πΈ .The size of the new relations and the time required for the entireconstruction follow directly from the TLFG guarantees.We run enumeration on a query π πΉ that is created by removingfrom π all predicates and adding one atom for each new relation.Each answer to π πΉ corresponds to an answer to π because ofthe correctness of the TLFGs. The duplication factor is 1, exceptif we have disjunctions (Lemma 14). Let π’ πππ₯ be the maximumduplication factor among the constructed TLFGs. The number ofβduplicateβ answers we get in π πΉ that correspond to the same answer π of π are bounded by π’ βπππ₯ , where β is the number of π atoms.That depends only on the query size which we consider as constant,thus it is O( ) .We enumerate the answers to π πΉ and project out the values ofthe new π variable as a post-processing step. First assume thatwe have no duplicates. In that case, the guarantees of this theoremfollow immediately from known results on equi-join enumeration[7, 78]. If we have duplicates, then we can filter them by maintaininga lookup table. If the time for each answer without the filtering isTT β² ( π ) , then we have that TT ( π ) = O( TT β² ( π Β· π’ βπππ₯ )) = O( TT β² ( π )) ,since the number of duplicates per answer is O( ) . E.5 Proof of Lemma 17
Lemmas 19, 20 and 22 together prove Lemma 17.
E.6 Proof of Theorem 18
The proof is the same as that of Theorem 16, but this time weuse (1) multiway partitioning for the base case of π = π log π π and π log π β π Β· log log π are superadditive functions.In the conjunction algorithm, we use multiway partitioning for π = π >
1. Therefore π πΌ ( π, ) β€ ππ log log π , resulting in π πΌ ( π, π ) = O( π log π β π Β· log log π ) over-all. Non-equalities and bands are translated into inequalities byusing the techniques we developed in Appendices C and D: a non-equality results into two inequalities on the same sets of nodes,while a band creates multiple inequality subproblems. We use thesame arguments as in the proofs of Lemmas 20 and 22. We denote by T πΌ ( π, π ) , T π ( π, π ) , T π΅ ( π, π ) the running time for π tuples and π predicates when the head of the list of predicates is an inequality,non-equality or band respectively. T π ( π, π ) = O(T πΌ ( π, π ) +T πΌ ( π, π )) and T π΅ ( π, π ) = O( π ) + T πΌ ( π , π ) + T πΌ ( π , π ) + . . . + T πΌ ( π π , π ) for π + π + . . . + π π β€ π . By these formulas, and since T πΌ ( π, π ) = O( π log π β π Β· log log π ) , it is easy to show the same bound for theother two. This proves the space consumption of the TLFGs, thusthe space bound of the theorem.We force all paths in the constructed TLFGs to have length 3.Multiway partitioning already has that property. Whenever weuse binary partitioning (that creates source-target paths of length3), we insert artificial nodes between the intermediate layer andthe target nodes. Specifically, for each edge ( π£ , π£ π‘ ) where π‘ β π ,we introduce a new node π£ and replace that edge with two edges ( π£ , π£ ) , ( π£ , π£ π‘ ) . It is easy to see that the source-target paths remainthe same by this modification and that we can do it time linear inthe TLFG size, hence it does not affect any of our guarantees.By the above, all intermediate nodes in our TLFGs can be as-signed to two layers according to their distance from a source node.Our TLFGs are then translated to a relational representation. We in-troduce variables π , π , and create new relations πΈ , πΈ , πΈ . Let A π , A π be the attributes of π and π respectively. Relation πΈ containsattributes A π βͺ π , πΈ contains π βͺ π , and πΈ contains V βͺ π΄ π .Each relation contains the edges between the corresponding lay-ers of nodes. This gives us an equi-join query π πΉ . The argumentsfor the enumeration guarantees are the same as in the proof ofTheorem 16.. The argumentsfor the enumeration guarantees are the same as in the proof ofTheorem 16.