Fast Distributed Complex Join Processing
aa r X i v : . [ c s . D B ] F e b Fast Distributed Complex Join Processing
Hao Zhang ∗ , Miao Qiao † , Jeffrey Xu Yu ∗ , Hong Cheng ∗∗ The Chinese University of Hong Kong { hzhang, yu, hcheng } @se.cuhk.edu.hk † The University of Auckland { miao.qiao } @auckland.ac.nz Abstract —Big data analytics often requires processing complexjoin queries in parallel in distributed systems such as Hadoop,Spark, Flink. The previous works consider that the main bottle-neck of processing complex join queries is the communication costincurred by shuffling of intermediate results, and propose a wayto cut down such shuffling cost to zero by a one-round multi-way join algorithm. The one-round multi-way join algorithmis built on a one-round communication optimal algorithm fordata shuffling over servers and a worst-case optimal compu-tation algorithm for sequential join evaluation on each server.The previous works focus on optimizing the communicationbottleneck, while neglecting the fact that the query could becomputationally intensive. With the communication cost beingwell optimized, the computation cost may become a bottleneck. Toreduce the computation bottleneck, a way is to trade computationwith communication via pre-computing some partial results,but it can make communication or pre-computing becomes thebottleneck. With one of the three costs being considered at atime, the combined lowest cost may not be achieved. Thus thequestion left unanswered is how much should be traded suchthat the combined cost of computation, communication, and pre-computing is minimal.In this work, we study the problem of co-optimize communi-cation, pre-computing, and computation cost in one-round multi-way join evaluation. We propose a multi-way join approach
ADJ (Adaptive Distributed Join) for complex join which findsone optimal query plan to process by exploring cost-effectivepartial results in terms of the trade-off between pre-computing,communication, and computation.We analyze the input relationsfor a given join query and find one optimal over a set of queryplans in some specific form, with high-quality cost estimationby sampling. Our extensive experiments confirm that
ADJ outperforms the existing multi-way join methods by up to ordersof magnitude.
I. I
NTRODUCTION
Join query processing is one of the important issues inquery processing, and join queries over relations based onthe equality on the common attributes are commonly usedin many real applications. Large-scale data analytics enginessuch as Spark [1], Flink [2], Hive [3], F1 [4], Myria [5],use massive parallelism in order to enable efficient queryprocessing on large data sets. Recently, data analytics enginesare used beyond traditional OLAP queries that usually consistof star-joins with aggregates. Such new kind of workloads [6]contain complex FK-FK joins, where multiple large tables arejoined, or where the query graph has cycles, and has seen manyapplications, such as querying knowledge graph [7], findingtriangle and other complex patterns in graphs [8], analyzinglocal topology around each node in graphs, which serves as powerful discriminative features for statistical relationallearning tasks for link prediction, relational classification, andrecommendation [9], [10].However, data analytics engines process complex joins bydecomposing them into smaller join queries, and combiningintermediate relations in multiple rounds, which suffers fromexpensive shuffling of intermediate results. To address suchinefficiency, one-round multi-way join
HCubeJ is proposed[11], which requires no shuffling after the initial data ex-change. The one-round multi-way join processes a join queryin two stages, namely, data shuffling and join processing. Inthe data shuffling stage,
HCubeJ shuffles the input relationsby an optimal one-round data shuffling method
HCube [12],[13]. In the join processing stage,
HCubeJ uses an in-memorysequential algorithm
Leapfrog [14] at each server to jointhe data received. It can be seen in Fig. 1(a) that the one-round multi-way join outperforms the multi-round binary joinsignificantly, regarding the number of shuffled tuples, forcomplex join queries.However, the one-round multi-way join algorithm has adeficiency, since it puts communication cost at a higherpriority to reduce than the computation cost by consideringthe communication cost as the dominating factor, which isnot always true. The main reason is that the computation ofcomplex multi-way join can be inherently difficult. We testedthe communication-first strategy of
HCubeJ in our prototypesystem using optimized
HCube for data shuffling and
Leapfrog for join processing. As shown in the first two bars for eachof the two queries ( Q and Q in Sec.VII-A) in Fig. 1(b), thecommunication cost can be small, but the computational costcan be high. Overall, the performance may not be the best asexpected.In this paper, we study how to reduce the total cost byintroducing pre-computed partial results with communication,computation, and pre-computing cost being considered atthe same time. As shown in Fig. 1(b), by our approach,we can reduce the computation cost significantly with someadditional overhead for communication and pre-computingcost. This problem is challenging since we may cause onecost larger when we reduce the other cost, and the searchspace of potential pre-computed partial results is huge. Themain contributions are given as follows. • We identify the performance issue of processing Q using HCubeJ due to the unbalance between computation andcommunication cost, and propose a simple mechanism a) One-Round Vs Multi-Round (b) Comm-First Vs Co-Opt
Fig. 1: Comparisons using two join queries, Q and Q (refer to Sec.VII-A), over the LJ dataset (refer to Table I).Here, “Comm” denotes communication cost, “Comp” denotescomputation cost, “Pre+Comm” denotes pre-computing costplus computation cost.to trade computation cost with communication and pre-computing cost such that the total cost is reduced for amulti-way join query Q . • We study how to effectively find cost-effective pre-computed partial results from overwhelmingly largesearch space, and join them and the rest of relations inan optimal order. To find such an optimal query plan, wereduce the search space of query plans to filter ineffectivequery plans early, and propose a heuristic approach toexplore cost-effective pre-computed partial results andjoin orders. • We propose a simple yet effective distributed samplingprocess with a theoretic guarantee to provide accuratecardinality estimation for query optimization. • We implement a prototype system
ADJ and proposeseveral implementation optimizations that significantlyimprove the performance of
HCube , reduce the stor-age cost, and eliminate some redundant computation of
HCubeJ . • We conducted extensive performance studies, and confirmthat our approach can be orders of magnitude faster thanthe previous approaches in terms of the total cost.The paper is organized as follows. We give the preliminaryof this work, and discuss
HCube , Leapfrog algorithms, andthe main issues we study in this work in Section II. Weoutline our approach in Section III, and discuss how to performcardinality estimation via distributed sampling in Section IV.In Sec V, we discuss the implementation optimization of ourprototype system. Section VI, we discuss the related work,and in Section VII we report our experimental studies. Weconclude our work in Section VIII.II. P
RELIMINARIES
A database D is a collection of relations. Here, a rela-tion R with schema { A , A , · · · , A n } is a set of tuples, ( a , a , · · · , a n ) , where a i is a value taken from the domainof an attribute A i , denoted as dom ( A i ) , for ≤ i ≤ n .Below,we use attrs ( R ) to denote the schema (the set of attributes) of R . A relation R with the schema of attrs ( R ) is a subset of theCartesian product of dom ( A ) × dom ( A ) × · · · × dom ( A n ) Hypergraph of Query Q a bcd eR R R R R R R R R R D (Database) cba 112222211114 da 12311114 dc 12121122 eb 34542221 ec 45341122 Fig. 2: The hypergraph of query Q (Eq (2)), and an exampleof database D
21 2cba da 12311114dc 1222 eb 345222 ec 3422 R T T T T T
1a 21 ba 21 2cba dcba 12222211edcba 34341122222222221111 (a) (b) R R R R Fig. 3: (a) The tuples shuffled to server S with hypercubeof coordinate (0 , , , , . (b) Leapfrog at the server S withhypercube of coordinate (0 , , , , for A i ∈ attrs ( R ) . We focus on natural join queries (or simplyjoin queries). A natural join query, Q , is defined over a set of m relations, R = { R , R , · · · , R m } , for m ≥ , in the formof Q ( attrs ( Q )) :- R ( attrs ( R )) ⊲⊳ · · · ⊲⊳ R m ( attrs ( R m )) . (1)Here, the schema of Q , denoted as attrs ( Q ) , is the unionof the schemas in R such as attrs ( Q ) = ∪ R i ∈R attrs ( R i ) . Forsimplicity, we assume there is an arbitrary order among theattributes of Q , denoted as ord , and A i denotes the i-th attributein ord . We also use R ( Q ) to denote the set of relations in Q . Aresulting tuple of Q is a tuple, τ , if there exists a non-emptytuple t i in R i , for every R i ∈ R , such that the projectionof τ on attrs ( R i ) is equal to t i (i.e., Π attrs ( R i ) τ = t i ). Theresult of a join Q is a relation that contains all such resultingtuples. A join query Q over m relations R can be representedas a hypergraph H = ( V, E ) , where V and E are the set ofhypernodes and the set of hyperedges, respectively, for V torepresent the attributes of attrs ( Q ) and for E to represent the m schemas. As an example, consider the following join queryQ over five relations, Q ( a, b, c, d, e ) :- R ( a, b, c ) ⊲⊳ R ( a, d ) ⊲⊳ R ( c, d ) ⊲⊳R ( b, e ) ⊲⊳ R ( c, e ) (2)Its hypergraph representation H is shown in Fig. 2 togetherwith the 5 relations. Here, V = attrs ( Q ) = { a, b, c, d, e } , and E = { e , e , e , e , e } for e = attrs ( R ) , e = attrs ( R ) , e = attrs ( R ) , e = attrs ( R ) , and e = attrs ( R ) . In thefollowing, we also use V ( H ) and E ( H ) to denote the set ofhypernodes and the set of hyperedges for a hypergraph H . A. Leapfrog and
HCube
Join Algorithms
We discuss
HCubeJ [11] to compute join queries in adistributed system over a cluster of servers, where the database D is maintained at the servers disjointly. HCubeJ is built ontwo algorithms, namely,
HCube [12], [13] and
Leapfrog [14],here
HCube is a one-round communication optimal shufflingmethod that shuffles data to every server in the cluster,and
Leapfrog is a fast in-memory sequential multi-way joinalgorithm to process the join query at each server over thedata shuffled to it. For a join query Q over m relations, R = { R , R , · · · , R m } , HCube is proven in theory to bethe optimal method in worst-case sense for transmitting thetuples to servers such that each server can evaluate the queryon its own without further data exchange.
Leapfrog [14] isproven in theory to be the optimal method in worst-case senseto evaluate a join query Q , while binary join could be sub-optimal. Also, Leapfrog is an iterator-based algorithm, whichleaves little footprint in memory when processing the query.
LeapFrog Join [14] is one of the state-of-the-art sequentialjoin algorithms for a join query Q over m relations, R = { R , R , · · · , R m } (Eq. (1)). Let attrs ( Q ) be the schema of Q for n = | attrs ( Q ) | . Leapfrog is designed to evaluate Q basedon the attribute order ord using iterators. Let t i be an i -tuplethat has i attributes of A to A i . The Leapfrog algorithm is tofind the t i +1 tuples by joining the tuple t i with an additional A i +1 value recursively until it finds all n attribute values for Q .The Leapfrog algorithm is illustrated in Algorithm 1 fora given-input i -tuple t i . The initial call of Leapfrog is withan empty input tuple t . Below, we explain the algorithmassuming that the input is a non-empty i -tuple, t i , for i > .Let R i +1 be the set of relations R in Q if R contains the ( i + -th attribute A i +1 in order such as R i +1 = { R | A i +1 ∈ R and R is a relation appearing in Q } (line 4). To find all A i +1 values that can join the input i -tuple t i , denoted as val ( t i → A i +1 ) , (line 5), it is done as follows. Here, forsimplicity and without loss of generality, we assume R i +1 = { R, R ′ } . First, for R , let As be all the attributes that appear inboth attrs ( R ) and attrs ( t i ) , it projects the A i +1 attribute valuefrom every tuple t ∈ R that can join with the i -tuple on allthe attributes As . Let T i +1 be a relation containing all A i +1 values found. Second, for R ′ , repeat the same, and let T ′ i +1 be a relation containing all A i +1 values found. The result of val ( t i → A i +1 ) is the intersection of T i +1 and T ′ i +1 . At line 6-7, for every value, v , in val ( t i → A i +1 ) , it calls Leapfrog recursively with an ( i + -tuple, t i +1 = t i k v , by concatenating t i and v . At line 1-2, If i = | attrs ( Q ) | , the tuple t i is emittedthrough the iterator. It is important to note that the main costof Leapfrog is the cost of the intersections.
Example 1:
Fig. 3(b) shows the steps of
Leapfrog on theserver S with relations as shown in Fig. 3(a). The input forthe initial Leapfrog call is with an empty tuple t . Assume theorder among attrs ( Q ) (e.g., ord is a ≺ b ≺ c ≺ d ≺ e ).First, Leapfrog will project the values for the first attribute a by attempting to join with t . At the server S , both relations, R and R , have the attribute a . Since t is empty, it projects { } from R and projects { , } from R , the result of theintersection is { } as shown in the relation T , whose schemais { a } , in Fig. 3(b).Second, for the tuple t = (1) in T , it calls Leapfrog inwhich the 2nd attribute b in order is considered. Note that both relations, R and R , have the attribute b . By joiningthe tuples in R with t = (1) , it projects the b attributevalue, { } , with R since the corresponding tuple t ∈ R canjoin with the input tuple t on the attribute a , and it projectsthe b attribute values, { } , with R , since it does not havethe attribute a to join with t . The intersection of b attributevalues from the two relations is { } , as shown in the relation T on the schema ( a, b ) in Fig. 3(b). The new t to be usedin the next Leapfrog call becomes (1 , on the schema ( a, b ) .Fig. 3(b) shows the results for T to T by Leapfrog atthe server S . Here, the join result for the hypercube assignedis in T . It is worth noting that Leapfrog is implemented asa series of iterator to avoid the recursive function call, andevery newly generated tuple t i +1 ∈ T i +1 is used immediatelyto generate tuples t i +2 without being stored in memory. HCube Shuffle [12], [13] is one of the state-of-the-art commu-nication methods to evaluate a join query Q in a distributedsystem by shuffling data in one-round. The main idea is todivide the output of a join query Q into hypercubes withcoordinates, and assign one or more hypercubes to one ofthe N ∗ servers to process by shuffling the tuples, whose hashvalues partially matches the coordinate of the given hypercube,to the server. Given a vector p = ( p , p , · · · , p n ) , where p i is the number of partitions for the attribute A i under ord , and n = | attrs ( Q ) | , hypercubes of P = p × · · · × p n dimensionare constructed. It is worth mentioning that P can be largerthan N ∗ . Here, a hypercube is identified by an coordinate of C = ( c , ..., c n ) of [ p ] × · · · × [ p n ] , where [ l ] represents therange from 0 to l − . Each machine can be assigned one ormore hypercubes. HCube distribute tuples of each relation tomachines via shuffling by hashing. For example, let’s assume p = (1 , , , , , which specifies four hypercubes with coor-dinates (0 , , , , , (0 , , , , , (0 , , , , , (0 , , , , .The first tuple, (1, 2,1), that appears at the top in the relation R ( a, b, c ) , will be shuffling to the servers that are assignedhypercube with coordinate (0 , , , ⋆, ⋆ ) , since h a (1) = 0 , h b (2) = 0 , h c (2) = 0 , where h A i means the hash function h i for attribute A i , and ⋆ means any integer. Example 2:
Consider the join query Q (Eq. (2)) and the5 relations in Fig. 2. Here, attrs ( Q ) = { a, b, c, d, e } . Let P = N ∗ = 4 , assume the order among the attributes of Q is ord = a ≺ b ≺ c ≺ d ≺ e . Suppose the vector p = ( p , p , p , p , p ) = (1 , , , , is obtained by theoptimizer, where p i denotes the number of partitions for theattribute A i . For example, p is for the attribute a because a is the first attribute in ord . The hypercubes based on p are [1] × [2] × [2] × [1] × [1] . Note that [ l ] represents a rangefrom 0 to l − . The 4 hypercubes to be assigned to the 4servers, S , S , S , and S are hypercubes with coordinate C = (0 , , , , , C = (0 , , , , , C (0 , , , , , and C = (0 , , , , , The tuples in any of the 5 relations will besent to some hypercubes. Here, suppose that a hash function, h i ( · ) , is designed for the i -th attribute A , and the hash functionis of h i ( x ) = x % p i for this example. The first tuple, (1, 2, 1),that appears at the top in the relation R ( a, b, c ) , will be sentto the servers with hypercubes with coordinate (0 , , , ⋆, ⋆ ) , lgorithm 1: Leapfrog ( t i , Q ) Input: an i -tuple t i , the query Q Output: tuples of Q emitted through iterators if i = | attrs ( Q ) | then Emit( t i ); else let R i +1 be the set of relations R in Q if R contains the ( i + -th attribute A i +1 in order; find all A i +1 values that can join the input tuple t i ,denoted as val ( t i → A i +1 ) ; for each attribute value v in val ( t i → A i +1 ) do Leapfrog ( t i k v , Q ); Q → Q ad R R R R ⋈ R R R R D ‘ (Database) cba 112222211114 da 12311114 dc 12121122 c b e R ⋈ R ecb 455344111222221221 Fig. 4: A query candidate Q i which gets the same result of Q in Fig. 2 by replacing R and R with R ⊲⊳ R since h a (1) = 0 , h b (2) = 0 , h c (2) = 0 , where h A i meansthe hash function h i for attribute A i , and ⋆ means any integer.The tuples of the 5 relations that are sent to the server S areshown in Fig. 3(a).After HCube completes its shuffling by hashing, each servercan compute the data assigned to it using an in-memory multi-way join algorithm independently, i.e.,
Leapfrog , and the unionof the results by the servers is the answer for the join query Q . Remark.
Given the two main costs, namely, communicationcost (shuffling cost) and computation cost,
HCubeJ is de-signed to puts the communication cost at a higher priorityand minimizes the communication cost first by optimizing p .There is no concern from HCube on the computation cost of
Leapfrog , which does its best to process the query Q over thedata shuffled to it.However, the query Q could be inherently computationallydifficult, and the communication cost may not be the dominat-ing factor in distributed join processing as shown in Fig. 1(b).A key question we ask is which cost it should minimize.There are several options, (1) the communication cost, (2) thecomputation cost, and (3) the both. HCubeJ takes the firstoption. However, It is highly likely that the minimization ofcommunication cost leads to high computation cost. In thispaper, we study how to optimize query Q by converting it intoan equivalent query Q i with potential higher communicationcost and lower computation cost with some additional pre-computing cost such that the total cost is minimal.III. A DAPTIVE M ULTI - WAY J OIN
In this paper, we study how to minimize the total cost ofboth communication cost and computation cost together with some additional pre-computing cost. To achieve it, we need amechanism that allows us to balance the total costs with thecondition that the mechanism is cost-effective to achieve thegoal of minimization of the total costs.We discuss our main idea using an example. Consider a joinquery as Q = R ⊲⊳ R ⊲⊳ R ⊲⊳ R ⊲⊳ R (refer to Eq. (2)for the details) over the database D shown in Fig. 2. Let itbe executed by HCubeJ , where
HCube shuffles the database D , and Leapfrog is deployed on each server to compute thedata shuffled to it. Assume, the system finds out that the timespent on
HCube for shuffling tuples is relatively small, whilea considerable amount of time is spent on
Leapfrog on eachserver. Furthermore, suppose the system finds out that thecomputation cost of
Leapfrog can be reduced for the samequery Q if R ⊲⊳ R has already been joined as one relationinstead. In other words, let Q = R ⊲⊳ R ⊲⊳ R ⊲⊳ R where R = R ⊲⊳ R , instead of executing Q directly, it isto pre-compute R first, then execute Q . Though it wouldbe more expensive to do the pre-computing and shuffle thetuples of R ( Q ) , which is shown in Fig. 4 ( integers in R , integers in R and R in total), it is still preferableto execute the new query Q instead of Q to trade thecommunication cost and pre-computing cost for computationcost, which is the bottleneck. The message by this example is:there is a way to reduce the computation cost at the expenseof increased communication cost with some pre-computingcost, and it is possible to minimize the total cost by balancingthe computation cost, communication cost, and pre-computingcost.We give our problem statement below based on the ideapresented in the example. Consider a join query Q = R ⊲⊳R ⊲⊳ · · · ⊲⊳ R m (refer to Eq. (1)). Let Q be a collectionof query candidates such as Q = { Q , Q , · · · Q |Q| } , where Q i = R ′ ⊲⊳ R ′ ⊲⊳ · · · ⊲⊳ R ′ l . Here, Q i is equivalent to Q such that Q i and Q return same results, attrs ( Q i ) = attrs ( Q ) , l ≤ m , and a relation R ′ j in R ( Q i ) is either a relation R k in R ( Q ) or a relation by joining some relations in R ( Q ) .Let a query plan be a pair ( Q i , ord ) that consists of a querycandidate Q i ∈ Q , which specifies how to pre-computerelations, and an attribute order ord for attributes of Q i ,which specifies how to join the relations of new query Q i using Leapfrog . The problem is to find a query plan suchthat the total cost for communication, pre-computing, andcomputation is minimized. This problem is challenging due tothe huge search space. For example, there exists m possiblecombinations of joins to construct a single relation R ′ j in total,where m is the number of relations in Q , and n ! possibilitiesto order the attributes of Q i .In this paper, we propose a prototype system ( ADJ ) thatexplores cost-effective query plans from a reduced searchspace. The workflow of our system (
ADJ ) is as follows. First,we shrink the search space according to an optimal hypertree T constructed for query Q such that search space of candidaterelations and attribute order ord are reduced based on T . Then,we explore cost-effective query plans derived from the T byconsidering the cost-effectiveness of trading the computation R R ac a bcR bc eR R 𝒗 b 𝒗 a 𝒗 c Candidate Relations R
45 = R ⋈ R R
23 = R ⋈ R R 𝒗 a 𝒗 b 𝒗 c T Fig. 5: Hypertree T and candidate relationswith communication and pre-computing of each pre-computedcandidate relations with the cost model. The cardinality esti-mation is done via a distributed sampler. Given an optimalquery plan ( Q i , ord ) , first, for each relation R ′ j ∈ Q i thatneeds to be joined, we pre-compute and store it. After every R ′ j is computed, we execute Q i = R ′ ⊲⊳ R ′ ⊲⊳ · · · ⊲⊳ R ′ l . Asshown in Fig. 1(b), our approach can significantly reduce thetotal cost.Next, in Sec III-A, we explain how to reduce the searchspace. Then in Sec III-B we show how to explore cost-effective query plans based on hypertree T . How to estimatethe cardinality via distributed sampling is shown in Sec IV. A. The Reduced Search Space
To reduce the search space for selecting an optimal queryplan from the collection of query candidates Q i ∈ Q andpossible attribute orders, we only consider a limited numberof joins such that a join (e.g., R ⊲⊳ R ) is as small aspossible and could lower join cost of Q . More specifically,we find query candidates that are almost acyclic queries andcan be easily transformed from Q . Our intuition is that thecomputation cost of evaluating an acyclic query is usuallysignificantly smaller than that of evaluating an equivalentcyclic query. Thus an almost acyclic query Q i could be easierto evaluate than Q .This is done as follows. First, we represent a given joinquery Q using its hypergraph representation, H = ( V, E ) .Second, for the hypergraph H , we find a hypertree represen-tation, T = ( V, E ) , where V ( T ) is a set of hypernodes and E ( T ) is a set of hyperedges. Recall that, in the hypergraph H ,a hypernode represents an attribute, and a hyperedge representsa relation schema. The corresponding hypertree T representsthe same information. (1) A hypernode in V ( T ) represents asubset of hyperedges (e.g., relation schemas) in E ( H ) , and italso corresponds to a potential pre-computed relation, whichcan be computed by joining the corresponding relations ofthe relation schemas it contains. (2) Hyperedges E ( T ) of T is constructed such that the hypernodes in T that contains acommon attribute A , must be connected in the hypertree T .There are many possible hypertrees for a given hypergraph,we use the one whose maximal size of the pre-computedrelation of each hypernode is minimal. This requirementensures that for any subset of hypernodes V ′ ( T ) ⊆ V ( T ) to be pre-computed, the resulting relations do not incur toomuch pre-computing and communication overhead in later joinquery Q i . We find such a hypertree T using GHD (GeneralizedHyperTree Decomposition) [15]. To bound the maximum sizeof the pre-computed relation of each hypernode in the worst-case sense, in theory, we can select the one with minimal fhw (fractional hypertree width) [16]. Such a hypertree T foundby GHD satisfies that max v ∈ V ( T ) | R max | fhw is the lowestamong all hypertrees, where | R max | = max R ∈R ( Q ) | R | . Inother words, the size of every pre-computed relation of eachhypernode is upper bounded by | R max | fhw for the chosen T and it is the lowest one among all possible T . Example 3:
Consider the join query Q = R ( a, b, c ) ⊲⊳R ( a, d ) ⊲⊳ R ( c, d ) ⊲⊳ R ( b, e ) ⊲⊳ R ( c, e ) (Eq. (2)). Itshypergraph is shown in Fig. 2, and its hypertree T is shownthe leftmost in Fig. 5. For the hypertree T , its hypernodesare v a , v b , v c , where v a , v b , and v c , represent R ( a, b, c ) , R ( a, d ) ⊲⊳ R ( c, d ) , and R ( b, e ) ⊲⊳ R ( c, e ) , respectively.The hyperedges { ( v a , v b ) , ( v b , v c ) } ensure 1) T is a hypertree2) For any attribute A ∈ { a, b, c, d, e } , e.g., a , the hypernodesthat contains it, e.g., v a , v b , are connected.As shown in Example 3, the hypertree T found from thehypergraph representation for a given join query, Q = R ⊲⊳R ⊲⊳ · · · ⊲⊳ R m , has two implications regarding the reducedsearch space to find the optimal Q i = R ′ ⊲⊳ R ′ ⊲⊳ · · · ⊲⊳ R ′ l ,namely, the number of joins and the attribute order. Reducing Numbers of Candidate Relations.
Instead offinding any possible joins to replace a single relation R ′ j in Q i , we only consider the joins represented as hypernodesin the hypertree T . By pre-computing such joins, query Q i is almost acyclic. Consider the hypertree, T , as shown theleftmost in Fig. 5 for Q = R ⊲⊳ R ⊲⊳ R ⊲⊳ R ⊲⊳ R . Thehypertree T has three hypernodes that represent R ( a, b, c ) , R ( a, d ) ⊲⊳ R ( c, d ) , and R ( b, e ) ⊲⊳ R ( c, e ) , respectively.Here, R ( a, b, c ) is a relation appearing in Q , and there isno need to join. For the other two hypernodes, there are only4 choices, namely, not to pre-compute joins, to pre-computethe join of R ( a, d ) ⊲⊳ R ( c, d ) , to pre-compute the joinof R ( b, e ) ⊲⊳ R ( c, e ) , to pre-compute both joins. In otherwords, by the hypertree, T , for this example, we only need toconsider 4 possible query candidates, which decides whether R and R should be pre-computed. The search space ofquery candidates is significantly reduced to | V ( T ) | . Reducing Choice of Attribute Orders.
Leapfrog needsto determine the optimal attribute order to expand from i -tuple to ( i + -tuple. For a query Q with n attributes for n = | attrs ( Q ) | , there are n ! possible attribute orders toconsider for any query Q i in Q , which incurs high selectioncost. With the hypertree T , it can reduce the search spaceto determine an attribute order following a traversal order( ≺ ) of the hypernodes of the hypertree, T . Consider anyhypernodes, u and v , in T , where u appears before v (e.g, u ≺ v ) by the traversal order. First, an attribute that appearsin u will appear before any attribute in v that does not appearin u . Second, the attributes in a hypernode v can vary ifthey do not appear in u , and can be determined via [11].Forhypertree T shown in the leftmost of Fig. 5, let’s assumethe traversal order among the hypernodes are v a ≺ v b ≺ v c .A valid attribute order is a ≺ b ≺ c ≺ d ≺ e , and aninvalid attribute order is a ≺ b ≺ e ≺ d ≺ c . The rationalebehind such reduction is that the attributes inside a hypernodere tightly constraint by each other, while attributes betweentwo hypernodes are loosely constraint, thus when following atraversal order, the attributes of A , ..., A n − are more likely tobe tightly constraint, which results in less intermediate tuples t , ..., t n − of T , ..., T n − respectively during Leapfrog . Anexperimental study in Sec. VII confirms such intuition. Byadopting such order, the search space of attribute order isreduced from O ( n !) to O ( | V ( T ) | !) , where | V ( T ) | < n . B. Finding The Plan
In this section, we discuss how to find a good plan fromthe reduced search space.
The Optimizer.
Let n ∗ = | V ( T ) | , a naive approach findsthe optimal plan by considering every combination of querycandidates that form from candidate relations and every traver-sal orders, which are O (2 n ∗ × n ∗ !) plans in total. It is worthmentioning that calculating the cost for each plan could becostly as well. Thus finding plans by such a naive approachis not feasible.We propose an approach to find good plans by exploringeffective candidate relations in terms of trading the compu-tation with communication. Recall that, pre-computing candi-date relations could reduce the computation cost but increasethe communication cost, and bring additional pre-computingcost. By finding the candidate relations that have a largepositive utility in terms of reducing computation cost, we caneffectively trade the computation cost with communicationcost.Let C be the set of candidate relations to pre-compute, O bethe traversal orders, cost M ( C ) , cost C ( C ) , and cost iE ( C, O ) be the cost of pre-computing cost, communication cost, andthe computation cost of steps that extends to attributes of i -thtraversed nodes in Leapfrog . It is worth noting that in complexjoin, the last few steps of
Leapfrog usually dominate the entirecomputation cost due to a large number of partial bindings toextend [11], and reducing such cost by pre-computing R v usu-ally has maximum benefits in terms of reducing computationcost. An example is also shown in Fig. 6. Assuming we havean empty C and empty O . For each candidate relations R v ,where v ∈ V ( T ) , we try to explore its maximum utility bysetting last traversed node of O to v . Then we compare thecost of pre-computing R v and not pre-computing R v , whichare cost M ( R v ) + cost C ( C ∪ R v ) + cost n ∗ E ( C ∪ R v , O ) and cost C ( C )+ cost n ∗ E ( C, O ) respectively, with the cost of currentoptimal candidate relation R v ∗ in terms of cost. We onlyconsider computation cost last steps of Leapfrog , as it usuallydominates the entire computation cost. After that, we canproceed to the next round of selecting R u from the remainingcandidate relations in a similar fashion and determining whichnode u the ( n − -th traversed node and whether R u shouldbe pre-computed.The detailed procedure is described in Alg. 2. Here, in lines3-14, we gradually determine all candidate relations and thetraversal order in reverse order. In lines 5-13, we find thenext candidate relations. The if condition in line 6 is usedto ensure that only O that could be extended to valid traversal Algorithm 2:
Optimizer( Q , D ) Input:
Query Q Output:
The optimal query plan ( Q i , ord ) find optimal hypertree T for Q let C = ∅ , O = ∅ , V = V ( T ) while V = ∅ do C ∗ = C , O ∗ = O , cost = inf , v ∗ = null , i = n ∗ for v ∈ V do if any two nodes in V \ v are connected then O ′ = O.add ( v ) , C ′ = C ∪ R v cost ′ = cost C ( C ) + cost iE ( C, O ′ ) cost ′′ = cost M ( R v ) + cost C ( C ′ ) + cost iE ( C ′ , O ′ ) if cost ′ < cost then C ∗ = C , O ∗ = O ′ , cost = cost ′ , v ∗ = v else if cost ′′ < cost then C ∗ = C ′ , O ∗ = O ′ , cost = cost ′′ , v ∗ = v i = i − , V.remove ( v ∗ ) , C = C ∗ , O = O ∗ convert C, O.reverse () to Q i , ord return ( Q i , ord ) ; (a) Q (b) Q Fig. 6: Percentages of intermediate tuples to extends duringtraversing n − th node, ( n − − th node, and the rest of thenode using two join queries, Q and Q (same as Fig. 1).order, which is described in the last section, is considered.In lines 7-13, we compare the cost of pre-computing R v and not pre-computing R v with the cost of current optimalcandidate relation R v ∗ . Notice that, in i-th iteration, we onlyneed to compute the cost iE ( C ′ , O ′ ) , as the computation costof cost i ′ E ( C ′ , O ′ ) is the same for all candidates relations for i ′ > i . Lemma 1:
Cost of Alg. 2 is O ( (2 n ∗ )(2 n ∗ − L ) , and L isa large constant factor that is related to the cost of estimatingthe cost M , cost C , and cost E . Computing the Cost.
Next, we discuss how to computepre-computing cost cost M , communication cost cost C , andcomputation cost of i − th step in Leapfrog cost E . We focuson computing cost C and cost E , as cost M is just a combinationof cost M and cost E . cost C ( C ) measures the communication cost of shufflingrelations of R v ∈ C and remaining relations of u ∈ V ( T ) that are not pre-computed R u in terms of seconds needed toransmit them across servers. Let us denote such collection ofrelations by R C . Recall that HCube has a parameter p , whichdetermines the numbers of partitions on attribute A ∈ attrs ( Q ) and is related to how tuples are shuffled to each servers.Given a p , for each relation R ∈ R C , each tuple t ∈ R will be sent to dup ( R, p ) = Q A ∈ attrs ( Q ) \ attrs ( R ) p A serversfollowing the rules of HCube , where p A denotes numbers ofpartitions on attribute A . And, we can represent cost C ( C ) as P R ∈R C | R |× dup ( R,p ) α , where α is the number of tuplestransmitted per seconds. Here, p is a parameter of HCube and it needs to be optimized to minimize cost C ( C ) underthe constraints 1) numbers of partition for each attributeshould ≥
1; 2) on average, the total amount of data a serverreceived should be less than memory size M of the server,which translates to M − P R ∈ Q i | R | × f rac ( R, p ) ≥ . Here, f rac ( R, p ) denotes the average percentage of R will be sentto a server, which is Q A ∈ attrs ( R ) p A . The optimization programis as follows: minimize cost C ( C )s . t . p − ≥ M − P R ∈ Q i size ( R ) × f rac ( R, p ) ≥ (3)By solving above optimization program, we can obtain cost C ( C ) = P R ∈R C | R | × dup ( R, p ) . cost iE ( C, O ) measures the computation cost of steps thatextends attributes of i -th traversed nodes in Leapfrog . Recallthat
Leapfrog gradually extends i-tuple t i ∈ T i to ( i + 1) -thtuples t i +1 , ( i + 2) -th tuples, ..., n -th tuples. As single node v ∈ V ( T ) might contains several attributes, and extending onenode v might corresponds to extending several attributes, forsimplicity, we use T v i to denote the tuples of partial bindingof attributes are from v , v , ..., v i , where v i is the i-thtraversed node. Thus, we can represent the cost of extendingattributes of i -th traversed nodes in Leapfrog , cost iE ( C, O ) ,as | T vi − | β i × N ∗ , where | T v i − | is the numbers of partial bindingswhose attributes are from v , ..., v i − , β i is numbers of partialbindings extended per seconds per server, and N ∗ is thenumber of servers. Notice that β i can be significantly higherif v i is pre-computed. cost M measures the pre-computing cost of R v . Let λ ( v ) be the relations of a node v in V ( T ) , cost M consists of thecommunication cost of shuffling λ ( v ) and computation costof ⊲⊳ λ ( v ) , which can be computed using above methods forcomputing cost C ( C ) and cost iE ( C, O ) .In the above calculation, α can be regarded as a constantthat measures the communication performance of the cluster.More specifically, we can measure it by randomly generatingtuples of size k , which is to be shuffled to random serversin the cluster, and recording the time t to shuffling k tuplesto their destination, where α = kt . β can be estimated bysampling some partial bindings, extending them, and takingthe average of their extending time. More specifically, if v i ispre-computed, the main cost of extending a partial binding isquerying the trie for candidate values, thus β i is a constant thatcan be pre-measured as kt by recording the time t to perform k query on a trie of size | R v i | . It is worth noting that we can pre-measure β i on trie of various sizes. If v i is not pre-computed,we set β i by reusing statistics gathered during sampling, whichis to be explained in the next section. More specifically, letthe total numbers of extension performed during sampling be k and aggregated extension time be t , we set β i = kt .IV. E STIMATING C ARDINALITY V IA D ISTRIBUTED S AMPLING
In this section, we discuss how we perform cardinality esti-mation via distributed sampling and why we choose sampling-based approaches to estimate cardinality.
Why Sampling.
An accurate cardinality estimation is crucialfor the optimizer to choose a good query plan [17]. Currently,there are two styles to do cardinality estimation: 1) sketch-based approaches 2) sampling-based approaches.Theoretical [18] as well as empirical [17] work has shownthat existing sketches-based approaches, which utilize fixed-size, per-attribute summary statistics (histograms) with strongassumptions (uniformity, independence, inclusion, ad hoc con-stants) to estimate cardinalities, often return estimations withlarge errors, especially on complex joins with more than 2 re-lations. Such error has been shown to lead to sub-optimal plansthat are up to slower than optimal plans in empirical studywork [17]. For sampling-based approaches, promising result isshown in [19] that sampling-based approaches could produceestimations that are orders of magnitude more accurate thansketch-based approaches in a reasonable time by performinga sequence of index join with samples. In summary, sketch-based approaches often incur less overhead than sampling-based approaches when performing estimations, but sampling-based approaches usually return estimations with much fewererrors.As our work targets complex join, which usually is long-running tasks and the additional cost brought by samplingis negligible compared to its benefits in reducing queries’running time, we choose to estimate cardinality via sampling. Estimating Cardinality Via Sampling.
Given a query Q,whose result is T , we want to estimate | T | . Let T A = a be resulttuples in T whose value on attribute A is a , we can express T as follows. | T | = X a ∈ val ( A ) | T A = a | = | val ( A ) | × | T A = a | (4)where val ( A ) is the collection of values of A in T , and | T A = a | = P a ∈ val ( A ) | T A = a || val ( A ) | . Suppose | val ( A ) | is known, thenwe need to estimate | T A = a | to obtain an estimation of | T | .To estimate | T A = a | , let a be a randomly selected value from val ( A ) . Let X be the random variable that is | T A = a | , and µ = E [ X ] = | T A = a | .Suppose we wish to estimate µ . We simply choose kindependent values a , a , ..., a k from val ( A ) with associatedrandom variables X , X , ..., X k . Define ¯ X = k P i
A naive approach parallelize the sam-pling process described above by utilizing
HCube directly.More specifically, it first shuffling the relations of Q intoservers using HCube such that each server can perform thesampling on its own based on tuples on it, then on each server,the sampling process described in the above paragraph is per-formed. However, such naive approaches would shuffle manyunnecessary tuples during
HCube , as only a small fraction of val ( A ) , and performing Leapfrog for them probably will notinvolve all tuples of every relation in Q .We can reduce such costs by reducing the database firstbefore all relations in it are shuffled by HCube . First, we findall relations R in a database whose schema contains A , andcompute a projected relation for each of them Π A R, R ∈ R .Then, for all R ∈ R , we shuffle their Π A R such that we cancompute the intersection of them and obtain val ( A ) . Then,from val ( A ) , we randomly select some samples S ′ . Next, wereduce the original database by performing semi-join between S ′ and R ∈ R to filter unpromising tuples. Finally, we shufflethe reduced database instead of the original database, andperform sampling on it.V. I MPLEMENTATION
We implemented a prototype system in Spark, which is thede-facto platform to perform large scale analytic tasks.
Optimizing HCube.
Previously,
HCube is implemented as asequence of map and reduce stage [12], where map stagemarks the destination coordinate for each tuple and reduce stage shuffles each tuple to their corresponding servers. Suchimplementation suffers from significant performance loss dueto overwhelming amount of tuples being shuffled. To reducethe cost of
HCube , the key is to reduce the cost of expensiveshuffling. A solution is to pull the tuples in blocks from remotemachines directly instead of shuffling tuples one by one, whichbypass shuffling process. The new
HCube proceed in twosteps: • Group all tuples from the same relation and with the samehash values under the
HCube ’s hash function into a blockand tagged that block with that has values. • For each server, it pulls the entire block of each relationwhose hash values “fits” its own coordinate in blocksfrom remote machines.We next use an example to better illustrate the idea.
Example 4:
Let’s take query in Fig. 2 whose share p =(1 , , , , , which result in four servers with coordinate (0 , , , , , (0 , , , , , (0 , , , , , (0 , , , , . For therelation R with schema R ( c, d ) , its tuples will be split intotwo blocks, where (1 , , (1 , will be in block B (0 , , and (2 , , , will be in block B (1 , as their hash value for c, d is , and , respectively. And, the servers with coordinate (0 , , , , , (0 , , , , will pull block B (0 , as theircoordinate on c and d is , . Similarly, servers with coordinate (0 , , , , , (0 , , , , will pull block B (1 , .A further benefit that this new HCube implementation has isthat it allows us to do some preprocessing works on a blocklevel. More specifically, we can reduce the cost of constructingthe trie of local database in each machine by pre-build the triefor each block of every relation.VI. R
ELATED WORK
Our work is related to previous works from three areas:multi-way join on a single machine, distributed multi-way join,and cardinality estimation.
Multi-Way Join on a Single Machine.
Optimizing the com-putation cost of a multi-way join has been studied for decades.Traditional multi-way join [21] is based on relational algebra(RA) — an RA expression of a multi-way join represents asequence of binary joins, i.e., sort-merge join. The recentlyemerged AGM bound [22], [23] on the worst-case outputsize of a multi-way join provides a standard to evaluate thecomputation efficiency of a join algorithm. In the worst-case,using traditional binary joins is suboptimal while worst-caseoptimal join algorithms such as NPRR [24], Generic Join [25],Leapfrog[14] are optimal. To improve the efficiency of worst-case optimal join algorithm for general case rather than worst-case, EmptyHeaded [26] combines binary join and worst-caseoptimal join via tree decomposition [15], [16], and yannakakisalgorithm [27], which improves the computation efficiency at agreat cost of memory consumption. To overcome the memoryissue of the EmtpyHeaded, CacheTrieJoin [28] is proposed,which incorporates multi-level cache into
Leapfrog . However,it is difficult to set the size of the cache for each level and thetotal amount of the cache.
Distributed Multi-Way Join.
Traditional multi-way join inthe distributed platform such as Spark [1], consists of asequence of distributed binary joins, such as distributed sort-merge join. They suffer from high communication cost forshuffling intermediate results when processing complex joinqueries. Such heavy communication cost can be reduced byone round multi-way join method
HCube [12], [13], whichavoid shuffling of intermediate results. The combination of
HCube and
Leapfrog forms the
HCubeJ [11], which processesthe complex join queries effectively. However, when commu-nication cost has been well optimized, the computation costbecomes the new bottleneck. Also, simply combining
HCube Q Q Q Q Q Q Q Q Q Q Fig. 7: Queries
Dataset
WB AS WT LJ EN OK | R | ( × ) 13.2 22.1 50.9 69.4 183.9 234.4Size (MB) 101.5 169.3 388.2 529.2 1370.0 1788.1 TABLE I: Datasets.and optimized version of
Leapfrog , such as CachedTrieJoin,helps little, as it prioritizes the memory usage for
HCube overmemory usage for CacheTrieJoin. Compared to previous work,we trying to co-optimize pre-computing, communication, andcomputation cost via introducing effective partial results.
Cardinality Estimation.
The estimation of cardinality meth-ods can be roughly classified into two classes: 1) sketchesbased, which use statistics of the database to estimate thecardinality of the query, see [17] as an entry, 2) sampling-based, which estimates the cardinality by sampling over thedatabase according to query, see [29], [19] as an entry. It hasbeen shown that the estimation of the sketch-based methodcould be orders of magnitude deviate from the ground truth[19], [17] on complex join.VII. E
XPERIMENTS
A. Setup
Queries.
We study complex join queries used in the previouswork [11], [28], [8]. The queries used are for subgraph querieswith nodes in the range of 3-5 nodes. The queries studied areshown in Fig. 7. We report the experimental studies for therepresentative queries from Q to Q , which are not easy tocompute. We omit the results for Q to Q , as they can becomputed fast, and the performance of these queries are verysimilar among the approaches being tested. Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( a, c ) Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( c, d ) ⊲⊳ R ( d, a ) ⊲⊳ R ( a, c ) ⊲⊳ R ( b, d ) Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( c, d ) ⊲⊳ R ( d, e ) ⊲⊳ R ( e, a ) ⊲⊳ R ( b, d ) ⊲⊳ R ( b, e ) ⊲⊳ R ( c, a ) ⊲⊳ R ( c, e ) ⊲⊳ R ( a, d ) Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( c, d ) ⊲⊳ R ( d, e ) ⊲⊳ R ( e, a ) ⊲⊳ R ( b, e ) Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( c, d ) ⊲⊳ R ( d, e ) ⊲⊳ R ( e, a ) ⊲⊳ R ( b, e ) ⊲⊳ R ( b, d ) Q :- R ( a, b ) ⊲⊳ R ( b, c ) ⊲⊳ R ( c, d ) ⊲⊳ R ( d, e ) ⊲⊳ R ( e, a ) ⊲⊳ R ( b, e ) ⊲⊳ R ( b, d ) ⊲⊳ R ( c, e ) Datasets.
Following [11], [28], we construct the databaseusing the real large graph, where each graph is regarded as a relation with two attributes. The statistic of the graphs is shownin Table I. For each “test-case” that consists of a database anda query, the database is constructed by allocating each relationof the query with a copy of the graph. We select commonlyused graphs from various domains. WB (web-BerkStan) is aweb graph of Berkeley and Stanford. AS (as-Skitter) is aninternet topology graph, from traceroutes run daily in 2005. WT (wiki-Talk) is a Wikipedia talk (communication) network. LJ (com-LiveJournal) is a LiveJournal online social network. EN (en-wiki2013) represents a snapshot of the English partof Wikipedia as of late February 2013. OK (com-Orkut) isan Orkut online social network. Their statistical informationis listed in Table I. EN can be downloaded from the link ,while the rest of the graphs can be downloaded from SNAP . Competing Methods.
We compare
ADJ with four state-of-the-art multi-way join methods in the distributed environment. • SparkSQL [1]: The state-of-the-art multi-round multi-wayjoin framework on Spark, which performs multi-way joinbased on decomposing the query into smaller join queries,and combining intermediate relations in a pairwise way. • HCubeJ [11]: The state-of-the-art one-round multi-wayjoin framework that utilizes a one-round shuffling method
HCube and the worst-case optimal join
Leapfrog • HCubeJ + Cache [28]: The state-of-the-art one-roundmulti-way join framework that utilizes a one-round shuf-fling method
HCube and adopt an optimized
Leapfrog with cache[28]. • BigJoin [8]: The state-of-the-art multi-round distributedmulti-round multi-way join framework, which parallelizes
Leapfrog . Evaluation Metrics.
We used wall clock time to measure thecost of an algorithm with the time of starting up the systemand loading the database into memory excluded. If an approachfailed in a test-case due to insufficient memory, the figure willshow a space instead of a bar in the corresponding locationof the figure. If an approach failed in completing the test-casewithin 12 hours, we show a bar reaching the frame-top.
Parameter Setting.
We set α of ADJ by pre-measuring thecommunication performance of the cluster based on Sec. III-B.We set the numbers of samples to be as it achieves abalance between accuracy and cost based on our experiments.We set β based on Sec. III-B for each test-case by reusingstatistics during sampling of each test-case. For competingmethods, we use their default settings. Distributed Settings.
All experiments are conducted on acluster of a master server and slave servers ( × Intel XeonE5-2680 v4, gigabytes of memory, interconnected via10 gigabytes Ethernet). All methods are deployed on Spark2.2.0. For Spark, we create workers from slave servers,where each worker is assigned cores and gigabytes ofmemory. Each core of the worker can be assigned a hypercubein HCube . http://law.di.unimi.it/webdata/enwiki-2013/ https://snap.stanford.edu/data/index.html (WB,Q )(AS,Q )(LJ,Q )(WT,Q )(EN,Q )(OK,Q )(WB,Q )(AS,Q )(LJ,Q )(WT,Q )(EN,Q )(OK,Q )(WB,Q )(AS,Q )(LJ,Q )(WT,Q )(EN,Q )(OK,Q ) Invalid-MaxValid-Max All-SelectedValid-Selected
Fig. 8: Effectiveness of attribute order pruning.
B. The Performance of ADJ
In this section, we investigate the performance of
ADJ . Effectiveness of Attribute Order Pruning.
In this test, wecompare the number of intermediate tuples generated during
Leapfrog under valid attribute order and invalid attribute orderon test-cases using Q − Q over all datasets. We omit Q − Q ,as their intermediate tuples are constant under any attributeorder. The results are shown in Fig. 8, where Invalid-Max denotes the attribute order that results in the maximum numberof intermediate tuples among all invalid orders.
Valid-Max denotes the attribute order that results in the maximum numberof intermediate tuples among any valid attribute orders.
All-Selected denotes the attribute order selected by
HCubeJ [11], which select the attribute order from all attribute order.
Valid-Selected denotes the attribute orders selected by
ADJ .It can be seen that in terms of the maximum number ofintermediate tuples produced, valid attribute orders performbetter than invalid attribute orders across all test-case. Also,we can see that selecting the attribute order from only validattribute orders can produce a better attribute order thanconsidering all attribute orders. This experiment confirms thatthe effectiveness of our heuristic in selecting good attributeorders and pruning non-effective attribute orders.
Effectiveness of Optimizations on HCube.
In this test, wecompare the effectiveness of the techniques proposed foroptimizing the performance of
HCube . We denote the orig-inal
HCube implementation by
Push , our optimized
HCube implementation by
Pull , and our optimized
HCube implemen-tation with tries pre-constructed by
Merge . We run test-casesthat consist of all datasets and query Q , and compare thecommunication cost and cost, where the results are shownin Fig. 9. In terms of communication cost, Pull and
Merge outperform
Push by up to two orders of magnitude. And,
Merge outperforms
Pull , as the block that contains one trie,which can be implemented using three arrays, are easier toserialize and deserialize than the block that contains manytuples. In terms of computation cost,
Push and
Pull are similar,and
Merge outperforms the other two methods by up to anorder of magnitude as tries has already been pre-constructedbefore
HCube . This experiment shows that our proposedtechniques for
HCube can significantly reduce communicationand some computation cost.
Cost and Accuracy of Sampling Process.
In this test, weshow that a relatively small amount of samples is enough for WB AS WT LJ EN OKtime(sec) (a) Communication WB AS WT LJ EN OKtime(sec) (b) Computation
Fig. 9: Comparison of different implementation of
HCube . . time(sec)Q Q Q (a) Time Q Q Q (b) Max D Fig. 10: Cost and accuracy of Sampling Processan accurate estimation of cardinality. For an query Q whoseresult is T , let the real cardinality of T be | T | and the estimatedone be ˜ | T | . Let D = max ( ˜ | T | , | T | ) min ( ˜ | T | , | T | ) be an indicator that measurestheir relative difference, which means the close D is to ,the better. We conduct experiments on test-cases that consistof dataset LJ and query Q , Q , Q . For each test-case, wevary the numbers of samples from ∗ to and plotthe maximum relative difference of all estimated cardinalityand the aggregated sampling time. The results are shown inFig. 10. We can see that after the sampling budget is increasedbeyond , the maximum relative difference converges to , which indicates there is almost no difference between theestimated value and real value. In terms of sampling cost,before sampling budget, the cost stays almost the same.This experiment confirms the efficiency and accuracy of oursampling-based cardinality estimation approach. The Cost and Effectiveness of Co-optimization.
In this test,we show that co-optimization can effectively trading the com-putation with communication with a low query optimizationcost, which includes the cost of sampling. We conduct ex-periment on test-cases that consist of datasets AS , LJ , OK andqueries Q , Q , Q , and measures the cost of Optimization , Pre − Computing , Communication , Computation and
Total .The results are shown in Table II-Table IV. From them, wecan see that on almost all test-cases, when Co − Optimization strategy is used, with a mildly increased
Pre − Computing and
Communication cost, the
Computation cost is drasti-cally reduced. Also, there are test-cases such as ( OK , Q ) ,whose Communication cost decreases as well. The reasonis that introducing pre-computed relation increases the sizeof the input database, but also changes the query itselfand alters share p of HCube , which could result in smaller
Communication cost. From Table II-Table IV, it also can beseen that although
Optimization cost of Co − Optimization strategy is consistently larger than
Optimization cost of
Communication − First Optimization strategy, it is still smallcompared to the total cost. This experiment confirms the (a) Q (b) Q (c) Q (d) Q (e) Q (f) Q Fig. 11: Speed-up factor of
ADJ under difference workersunder to workers .effectiveness of Co − Optimization strategy and relatively lowquery
Optimization cost of Co − Optimization strategy.
Scalability.
In Fig 11, we show the speedup of our systemwhen varying the number of workers of Spark from to on test-cases that consist of LJ , and all queries. It can beseen that our system has a near-linear speed up on query Q , Q , Q , Q . For query Q , the scalability is limited asit is a rather simple query, and the overhead of the systemsgradually becomes the dominating cost. For query Q , itslimited scalability is due to the skewness, where the “laststraggler” effect plays a bigger role in determining the elapsedtime. C. Comparison with Other Join Approaches
In this section, we compare
ADJ against state-of-the-artmethods.
Varying Dataset.
In this test, we compare each method ontest-cases where the queries are fixed to Q , Q , Q . Theresults are shown in Fig. 12 (a)-(c). It can be seen that multi-round methods SparkSQL and
BigJoin fail on many of thequeries due to overwhelming intermediate results, while one-round methods successfully tackle most of the queries as theshuffling of intermediate results are avoided. Also,
BigJoin isbetter than
SparkSQL as the worst-case optimal join
Leapfrog it parallelizes generates less intermediate tuples. Also, it canbe seen that with the increase the input database size,
HCubeJ , HCubeJ + Cache , spent more portion of time on
HCube ,and on test-case ( LJ , Q ) , ( EN , Q ) , ( OK , Q ) , they havea difficult time shuffling the tuples using original HCube implementation. In comparison,
ADJ can successfully processall test-cases and spent significantly less time when shufflingthe relations on test-cases that involve complex queries suchas Q or large dataset EN , OK . Varying Query.
In this test, we compare each method ontest-cases where the datasets are fixed to AS , LJ , OK . Theresults are shown in Fig. 12 (d)-(e). For SparkSQL , it can onlyhandle Q and failed on all other queries due to overwhelmingintermediate results. And, BigJoin can only handle Q and Q . For Q − Q , HCubeJ and
HCubeJ + Cache performssimilarly, and
ADJ has a large lead due to the optimized
HCube . For Q − Q , HCubeJ + Cache performs better than
HCubeJ , and
HCubeJ + Cache has similar performance to
ADJ on dataset AS as AS is relatively small and there isabundant remaining memory on each server to use for caching.On LJ dataset, HCubeJ + Cache is significantly outperformedby
ADJ , as
HCubeJ + Cache is a method that prioritizescommunication cost over computation cost, and uses up allmemory for shuffling and storing the tuples during
HCube ,which leaves little memory for caching. On OK dataset,both HCubeJ and
HCubeJ + Cache failed, as the original
HCube implementation shuffles too many tuples, which causesmemory-overflow. It can be seen that in almost all test-case
ADJ can effectively balance the computation cost andcommunication cost by adopting a co-optimization strategy.VIII. C
ONCLUSION
This paper studies the problem of co-optimize communica-tion and computation cost in a one-round multi-way join eval-uation and proposes a prototype system
ADJ for processingcomplex join queries. To find an effective query plan in a hugesearch space in terms of total cost, this paper study how torestrict the search space based on an optimal hypertree T andhow to explore cost-effective query plans based on hypertree T . Extensive experiments have shown the effectiveness ofvarious optimization proposed in ADJ . We shall explore co-optimize computation, pre-computing, and communication fora query that consists of selection, projection, and join.A
CKNOWLEDGEMENT
This work is supported by the Research Grants Councilof Hong Kong, China under No. 14203618, No. 14202919and No. 14205520, No. 14205617, No. 14205618, and NSFCGrant No. U1936205. R
EFERENCES[1] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, “SparkSQL: Relational Data Processing in Spark,” in
Proc. of SIGMOD’15 ,pp. 1383–1394, 2015.[2] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, andK. Tzoumas, “Apache flink: Stream and batch processing in a singleengine,”
IEEE TCDE , vol. 36, no. 4, 2015.[3] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu,P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,”
Proc. of VLDB’09 , vol. 2, no. 2, pp. 1626–1629,2009.[4] J. Shute, S. Ellner, J. Cieslewicz, I. Rae, T. Stancescu, H. Apte,R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins, M. Oancea,K. Littlefield, and D. Menestrina, “F1: a distributed SQL database thatscales,”
PVLDB , vol. 6, no. 11, pp. 1068–1079, 2013.[5] J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe,D. Hutchison, S. Jain, R. Maas, P. Mehta, et al. , “The myria big datamanagement and analytics system and cloud services.,” in
CIDR’17 ,2017.[6] Y.-M. N. Nam, D. H. Han, and M.-S. K. Kim, “Sprinter: A fast n-aryjoin query processing method for complex olap queries,” in
Proceedingsof the 2020 ACM SIGMOD International Conference on Managementof Data , pp. 2055–2070, 2020.[7] B. Elliott, E. Cheng, C. Thomas-Ogbuji, and Z. M. Ozsoyoglu, “A com-plete translation from sparql into efficient sql,” in
Proc. of IDEAS’09 ,pp. 31–42, 2009.[8] K. Ammar, F. McSherry, S. Salihoglu, and M. Joglekar, “DistributedEvaluation of Subgraph Queries Using Worst-case Optimal Low-memory Dataflows,”
PVLDB , vol. 11, no. 6, pp. 691–704, 2018. o − Optimization ( sec ) Communication − First Optimization ( sec ) Optimization Pre − Computing Communication Computation Total Optimization Communication Computation Total Q
107 12 66 1276 1461 3 21 > > Q
90 24 50 907 1071 4 36 > > Q
63 12 19 18 112 4 47 30426 30477
TABLE II: The comparison between co-optimization and communication-first optimization strategy in AS dataset Co − Optimization ( sec ) Communication − First Optimization ( sec ) Optimization Pre − Computing Communication Computation Total Optimization Communication Computation Total Q
106 22 132 1282 1542 8 62 > > Q
132 44 103 222 501 9 112 > > Q
105 22 147 350 624 12 204 > > TABLE III: The comparison between co-optimization and communication-first optimization strategy in LJ dataset Co − Optimization ( sec ) Communication − First Optimization ( sec ) Optimization Pre − Computing Communication Computation Total Optimization Communication Computation Total Q
218 71 712 13214 14215 37 1050 > > Q
265 142 422 877 1706 46 1566 > > Q
278 71 1189 516 2054 42 2067 > > TABLE IV: The comparison between co-optimization and communication-first optimization strategy in OK dataset WB AS WT LJ EN OKtime (sec) (a) Q WB AS WT LJ EN OKtime (sec) (b) Q WB AS WT LJ EN OKtime (sec) (c) Q Q Q Q Q Q Q time (sec) (d) AS Q Q Q Q Q Q time (sec) (e) LJ Q Q Q Q Q Q time (sec) (f) OK Fig. 12: Comparison of methods by varying datasets or queries [9] N. N. Liu, L. He, and M. Zhao, “Social temporal collaborative rankingfor context aware movie recommendation,”
ACM TIST , vol. 4, no. 1,pp. 1–26, 2013.[10] R. A. Rossi, L. K. McDowell, D. W. Aha, and J. Neville, “Transform-ing graph data for statistical relational learning,”
Journal of ArtificialIntelligence Research , vol. 45, pp. 363–441, 2012.[11] S. Chu, M. Balazinska, and D. Suciu, “From Theory to Practice: EfficientJoin Query Evaluation in a Parallel Database System,” in
Proc. ofSIGMOD’15 , pp. 63–78, 2015.[12] F. N. Afrati and J. D. Ullman, “Optimizing Multiway Joins in a Map-Reduce Environment,”
TKDE , vol. 23, no. 9, 2011.[13] P. Beame, P. Koutris, and D. Suciu, “Communication steps for parallelquery processing,” in
Proc. of SIGMOD’13 , pp. 273–284, 2013.[14] T. L. Veldhuizen, “Leapfrog triejoin: A simple, worst-case optimal joinalgorithm,” arXiv preprint arXiv:1210.0481 , 2012.[15] G. Gottlob, N. Leone, and F. Scarcello, “Hypertree Decompositions andTractable Queries,”
Journal of Computer and System Sciences , vol. 64,no. 3, pp. 579–627, 2002.[16] G. Gottlob, G. Greco, N. Leone, and F. Scarcello, “Hypertree Decom-positions: Questions and Answers,” in
Proc. of PODS’16 , pp. 57–74,2016.[17] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neu-mann, “How good are query optimizers, really?,”
PVLDB , vol. 9, no. 3,pp. 204–215, 2015.[18] Y. E. Ioannidis and S. Christodoulakis, “On the propagation of errors inthe size of join results,” in
Proc. of SIGMOD’1991 , pp. 268–277, 1991.[19] V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann, “Cardi- nality estimation done right: Index-based join sampling.,” in
CIDR’17 ,2017.[20] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” in
The Collected Works of Wassily Hoeffding , pp. 409–426,1994.[21] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G.Price, “Access Path Selection in a Relational Database ManagementSystem,” in
Proc. of SIGMOD’1979 , pp. 23–34, 1979.[22] A. Atserias, M. Grohe, and D. Marx, “Size Bounds and Query Plans forRelational Joins,” in
Proc. of FOCS’08 , pp. 739–748, 2008.[23] M. Grohe and D. Marx, “Constraint Solving via Fractional EdgeCovers,”
ACM Trans. Algorithms , vol. 11, no. 1, pp. 4:1–4:20, 2014.[24] H. Q. Ngo, E. Porat, C. R´e, and A. Rudra, “Worst-case Optimal JoinAlgorithms: [Extended Abstract],” in
Proc. of PODS’12 , pp. 37–48,2012.[25] H. Q. Ngo, C. R´e, and A. Rudra, “Skew strikes back: new developmentsin the theory of join algorithms,”
ACM SIGMOD Record , vol. 42, no. 4,pp. 5–16, 2014.[26] C. R. Aberger, A. Lamb, S. Tu, A. N¨otzli, K. Olukotun, and C. R´e,“EmptyHeaded: A Relational Engine for Graph Processing,”
ACM Trans.Database Syst. , vol. 42, no. 4, pp. 20:1–20:44, 2017.[27] M. Yannakakis, “Algorithms for Acyclic Database Schemes,” in
Proc.of VLDB’1981 , pp. 82–94, 1981.[28] O. Kalinsky, Y. Etsion, and B. Kimelfeld, “Flexible caching in trie joins,” arXiv preprint arXiv:1602.08721 , 2016.[29] Y. Chen and K. Yi, “Two-Level Sampling for Join Size Estimation,” in