DBL: Efficient Reachability Queries on Dynamic Graphs (Complete Version)
aa r X i v : . [ c s . D B ] J a n DBL: Efficient Reachability Queries on DynamicGraphs (Complete Version)
Qiuyi Lyu , Yuchen Li , Bingsheng He , and Bin Gong Shandong University [email protected] Singapore Management University [email protected] National University of Singapore [email protected] Shandong University [email protected]
Abstract.
Reachability query is a fundamental problem on graphs,which has been extensively studied in academia and industry. Sincegraphs are subject to frequent updates in many applications, it isessential to support efficient graph updates while offering good perfor-mance in reachability queries. Existing solutions compress the originalgraph with the Directed Acyclic Graph (
DAG ) and propose efficientquery processing and index update techniques. However, they focuson optimizing the scenarios where the Strong Connected Components(
SCC s) remain unchanged and have overlooked the prohibitively highcost of the
DAG maintenance when
SCC s are updated. In this paper,we propose
DBL , an efficient
DAG -free index to support the reachabilityquery on dynamic graphs with insertion-only updates.
DBL builds ontwo complementary indexes: Dynamic Landmark ( DL ) label and Bidirec-tional Leaf ( BL ) label. The former leverages landmark nodes to quicklydetermine reachable pairs whereas the latter prunes unreachable pairsby indexing the leaf nodes in the graph. We evaluate DBL against thestate-of-the-art approaches on dynamic reachability index with extensiveexperiments on real-world datasets. The results have demonstrated that
DBL achieves orders of magnitude speedup in terms of index update, whilestill producing competitive query efficiency.
Given a graph G and a pair of vertices u and v , reachability query (denotedas q ( u, v )) is a fundamental graph operation that answers whether there existsa path from u to v on G . This operation is a core component in supportingnumerous applications in practice, such as those in social networks, biologicalcomplexes, knowledge graphs, and transportation networks. A plethora of index-based approaches have been developed over a decade [24,5,22,17,25,18,20,26,21,9]and demonstrated great success in handling reachability query on static graphswith millions of vertices and edges. However, in many cases, graphs are highly dynamic [23]: New friendships continuously form on social networks like Face-book and Twitter; knowledge graphs are constantly updated with new entitiesand relations; and transportation networks are subject to changes when road Q. Lyu et al. constructions and temporary traffic controls occur. In those applications, it isessential to support efficient graph updates while offering good performance inreachability queries.There have been some efforts in developing reachability index to supportgraph updates [4,6,8,10,15,16,17]. However, there is a major assumption madein those works: the Strongly Connected Components (
SCC s) in the underlyinggraph remain unchanged after the graph gets updated. The Directed AcyclicGraph (
DAG ) collapses the
SCC s into vertices and the reachability query is thenprocessed on a significantly smaller graph than the original. The state-of-the-art solutions [27,24] thus rely on the
DAG to design an index for efficient queryprocessing, yet their index maintenance mechanisms only support the updatewhich does not trigger
SCC merge/split in the
DAG . However, such an assumptioncan be invalid in practice, as edge insertions could lead to updates of the
SCC s inthe
DAG . In other words, the overhead of the
DAG maintenance has been mostlyoverlooked in the previous studies.One potential solution is to adopt existing
DAG maintenance algorithms suchas [26]. Unfortunately, this
DAG maintenance is a prohibitively time-consumingprocess, as also demonstrated in the experiments. For instance, in our experi-ments, the time taken to update the
DAG on one edge insertion in the LiveJournaldataset is two-fold more than the time taken to process for thestate-of-the-art methods. Therefore, we need a new index scheme with a lowmaintenance cost while efficiently answering reachability queries.In this paper, we propose a
DAG -free dynamic reachability index frame-work(
DBL ) that enables efficient index update and supports fast query processingat the same time on large scale graphs. We focus on insert-only dynamic graphswith new edges and vertices continuously added. This is because the numberof deletions are often significantly smaller than the number of insertions, anddeletions are handled with lazy updates in many graph applications [2,3]. Insteadof maintaining the
DAG , we index the reachability information around two sets ofvertices: the “landmark” nodes with high centrality and the “leaf” nodes withlow centrality (e.g., nodes with zero in-degree or out-degree). As the reachabilityinformation of the landmark nodes and the leaf nodes remain relatively stableagainst graph updates, it enables efficient index update opportunities comparedwith approaches using the
DAG . Hence,
DBL is built on the top of two simpleand effective index components: (1) a Dynamic Landmark ( DL ) label, and (2) aBidirectional Leaf ( BL ) label. Combining DL and BL in the DBL ensures efficientindex maintenance while achieves competitive query processing performance.
Efficient query processing: DL is inspired by the landmark index approach[5]. The proposed DL label maintains a small set of the landmark nodes as thelabel for each vertex in the graph. Given a query q ( u, v ), if both the DL labels of u and v contain a common landmark node, we can immediately determine that u reaches v . Otherwise, we need to invoke Breadth-First Search( BFS ) to process q ( u, v ). We devise BL label to quickly prune vertex pairs that are not reachableto limit the number of costly BFS . BL complements DL and it focuses on buildinglabels around the leaf nodes in the graph. The leaf nodes form an exclusive set BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 3 apart from the landmark node set. BL label of a vertex u is defined to be the leafnodes which can either reach u or u can reach them. Hence, u does not reach v if there exists one leaf node in u ’s BL label which does not appear in the BL label of v . In summary, DL can quickly determine reachable pairs while BL , whichcomplements DL , prunes disconnected pairs to remedy the ones that cannot beimmediately determined by DL . Efficient index maintenance:
Both DL and BL labels are lightweight indexeswhere each vertex only stores a constant size label. When new edges are inserted,efficient pruned BFS is employed and only the vertices where their labels needupdate will be visited. In particular, once the label of a vertex is unaffected bythe edge updates, we safely prune the vertex as well as its descendants from the
BFS , which enables efficient index update.To better utilize the computation power of modern architectures, we imple-ment DL and BL with simple and compact bitwise operations. Our implementa-tions are based on OpenMP and CUDA in order to exploit parallel architecturesmulti-core CPUs and GPUs (Graphics Processing Units), respectively.Hereby, we summarize the contributions as the following: – We introduce the
DBL framework which combines two complementary DL and BL labels to enable efficient reachability query processing on large graphs. – We propose novel index update algorithms for DL and BL . To the best of ourknowledge, this is the first solution for dynamic reachability index withoutmaintaining the DAG . In addition, the algorithms can be easily implementedwith parallel interfaces. – We conduct extensive experiments to validate the performance of
DBL incomparison with the state-of-the-art dynamic methods [27,24].
DBL achievescompetitive query performance and orders of magnitude speedup for indexupdate. We also implement
DBL on multi-cores and GPU-enabled systemand demonstrate significant performance boost compared with our sequentialimplementation.The remaining part of this paper is organized as follows. Section 2 presentsthe preliminaries and background. Section 3 presents the related work. Section 4presents the index definition as well as query processing. Sections 5 demonstratethe update mechanism of DL and BL labels. Section 6 reports the experimentalresults. Finally, we conclude the paper in Section 7. A directed graph is defined as G = ( V, E ), where V is the vertex set and E is theedge set with n = | V | and m = | E | . We denote an edge from vertex u to vertex v as ( u, v ). A path from u to v in G is denoted as P ath ( u, v ) = ( u, w , w , w , . . . , v )where w i ∈ V and the adjacent vertices on the path are connected by an edgein G . We say that v is reachable by u when there exists a P ath ( u, v ) in G .In addition, we use Suc ( u ) to denote the direct successors of u and the directpredecessors of u are denoted as P re ( u ). Similarly, we denote all the ancestors of Q. Lyu et al.
Table 1: Common notations in this paper
Notation Description G ( V, E ) the vertex set V and the edge set E of a directed graph GG ′ the reverse graph of Gn the number of vertex in Gm the number of edges in GSuc ( u ) the set of u ’s out-neighbors P re ( u ) the set of u ’s in-neighbors Des ( u ) the set of u ’s descendants including uAnc ( u ) the set of u ’s ancestors including uP ath ( u, v ) A path from vertex u to vertex vq ( u, v ) the reachability query from u to vk the size of DL label for one vertex k ′ the size of BL label for one vertex DL in ( u ) the label that keeps all the landmark nodes that could reach u DL out ( u ) the label that keeps all the landmark nodes that could be reached by u BL in ( u ) the label that keeps the hash value of the leaf nodes that could reach u BL out ( u ) the label that keeps the hash value of the leaf nodes that could be reached by uh ( u ) the hash function that hash node u to a value u (including u ) as Anc ( u ) and all the descendants of u (including u ) as Des ( u ).We denote the reversed graph of G as G ′ = ( V, E ′ ) where all the edges of G are in the opposite direction of G ′ . In this paper, the forward direction refersto traversing on the edges in G . Symmetrically, the backward direction refers totraversing on the edges in G ′ . We denote q ( u, v ) as a reachability query from u to v . In this paper, we study the dynamic scenario where edges can be insertedinto the graph. Common notations are summarized in Table 1. There have been some studies on dynamic graph [4,6,8,10,15,16,17]. Yildirim etal. propose
DAGGER [26] which maintains the graph as a
DAG after insertions anddeletions. The index is constructed on the
DAG to facilitate reachability queryprocessing. The main operation for the
DAG maintenance is the merge and splitof the
Strongly Connected Component ( SCC ). Unfortunately, it has been shownthat
DAGGER exhibits unsatisfactory query processing performance on handlinglarge graphs (even with just millions of vertices [27]).The state-of-the-art approaches:
TOL [27] and IP [24] follow the maintenancemethod for the DAG from
DAGGER and propose novel dynamic index on the
DAG to improve the query processing performance. We note that
TOL and IP areonly applicable to the scenarios where the SCC /s in the
DAG remains unchangedagainst updates. In the case of
SCC merges/collapses,
DAGGER is still required torecover the
SCC /s. For instance,
TOL and IP can handle edge insertions ( v , v )in Figure 1(a), without invoking DAGGER . However, when inserting ( v , v ), twoSCC/s { v } and { v , v , v } will be merged into one larger SCC { v , v , v , v } .For such cases, TOL and IP rely on DAGGER for maintaining the
DAG first and thenperform their respective methods for index maintenance and query processing.However, the overheads of the
SCC maintenance are excluded in their experiments[27,24] and such overheads is in fact non-negligible [26,14].
BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 5 v v v v v v v v v v v (a) Graph G v DL in DL out v ∅ { v } v ∅ { v , v } v ∅ ∅ v ∅ { v } v { v } { v , v } v { v } { v , v } v ∅ ∅ v { v , v } { v } v { v } { v , v } v { v , v } ∅ v { v } ∅ (b) DL label for G v BL in h ( BL in ) BL out h ( BL out ) v { v } { } { v } { } v { v } { } { v , v } { } v { v } { } { v } { } v { v } { } { v } { } v { v } { } { v , v } { } v { v } { } { v , v } { } v { v } { } { v } { } v { v , v } { } { v } { } v { v } { } { v , v } { } v { v , v } { } { v } { } v { v , v } { } { v } { } (c) BL label for G Fig. 1: A running example of graph G In this paper, we propose the
DBL framework which only maintains the labelsfor all vertices in the graph without constructing the
DAG . That means,
DBL caneffectively avoid the costly
DAG maintenance upon graph updates.
DBL achievescompetitive query processing performance with the state-of-the-art solutions(i.e.,
TOL and IP ) while offering orders of magnitude speedup in terms of indexupdates. The
DBL framework is consist of DL and BL label which have their independentquery and update components. In this section, we introduce the DL and BL label.Then, we devise the query processing algorithm that builds upon DBL index.
We propose the
DBL framework that consists of two index components: DL and BL . Definition 1 ( DL label). Given a landmark vertex set L ⊂ V and | L | = k , wedefine two labels for each vertex v ∈ V : DL in ( v ) and DL out ( v ) . DL in ( v ) is a subsetof nodes in L that could reach v and DL out ( v ) is a subset of nodes in L that v could reach. It is noted that DL label is a subset of the 2-hop label [5]. In fact, 2-Hop labelis a special case for DL label when the landmark set L = V . Nevertheless, wefind that maintaining 2-Hop label in the dynamic graph scenario leads to indexexplosion. Thus, we propose to only choose a subset of vertices as the landmarkset L to index DL label. In this way, DL label has up to O ( n | L | ) space complexityand the index size can be easily controlled by tunning the selection of L . Thefollowing lemma shows an important property of DL label for reachability queryprocessing. Lemma 1.
Given two vertices u , v and their corresponding DL label, DL out ( u ) ∩ DL in ( v ) = ∅ deduces u reaches v but not vice versa. Q. Lyu et al.
Algorithm 1 DL label Batch Construction Input:
Graph G ( V, E ), Landmark Set D Output: DL label for G for i = 0; i < k ; i ++ do
2: //Forward
BFS S ← D [ i ]4: enqueue S to an empty queue Q while Q not empty do p ← pop Q for x ∈ Suc ( p ) do DL in ( x ) ← DL in ( x ) ∪ { S } ;9: enqueue x to Q
10: //Symmetrical Backward
BFS is performed.
Example 1.
We show an running example in Figure 1(a). Assuming the landmarkset is chosen as { v , v } , the corresponding DL label is shown in Figure 1(b). q ( v , v ) returns true since DL out ( v ) ∩ DL in ( v ) = { v } . However, the labelscannot give negative answer to q ( v , v ) despite DL out ( v ) ∩ DL in ( v ) = ∅ . Thisis because the intermediate vertex v on the path from v to v is not includedin the landmark set.To achieve good query processing performance, we need to select a set ofvertices as the landmarks such that they cover most of the reachable vertex pairsin the graph, i.e., DL out ( u ) ∩ DL in ( v ) contains at least one landmark node for anyreachable vertex pair u and v . The optimal landmark selection has been provedto be NP-hard [13]. In this paper, we adopt a heuristic method for selecting DL label nodes following existing works [1,13]. In particular, we rank verticeswith M ( u ) = | P re ( u ) | · | Suc ( u ) | to approximate their centrality and select top- k vertices. Other landmark selection methods are also discussed in Section 6.2. Definition 2 ( BL label). BL introduces two labels for each vertex v ∈ V : BL in ( v ) and BL out ( v ) . BL in ( v ) contains all the zero in-degrees vertices that can reach v ,and BL out ( v ) contains all the zero out-degrees vertices that could be reached by v . For convenience, we refer to vertices with either zero in-degree or out-degreeas the leaf nodes. Lemma 2.
Given two vertices u , v and their corresponding BL label, u does notreach v in G if BL out ( v ) BL out ( u ) or BL in ( u ) BL in ( v ) . BL label can give negative answer to q ( u, v ). This is because if u could reach v , then u could reach all the leaf nodes that v could reach, and all the leaf nodesthat reach u should also reach v . DL label is efficient for giving positive answerto a reachability query whereas BL label plays a complementary role by pruningunreachable pairs. In this paper, we take vertices with zero in-degree/out-degreeas the leaf nodes. We also discuss other leaf selection methods in Section 6.2. BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 7
Example 2.
Figure 1(c) shows BL label for the running example. BL label givesnegative answer to q ( v , v ) since BL in ( v ) is not contained by BL in ( v ). Intu-itively, vertex v reaches vertex v but cannot reach v which indicates v shouldnot reach v . BL label cannot give positive answer. Take q ( v , v ) for an example,the labels satisfy the containment condition but positive answer cannot be given.The number of BL label nodes could be huge. To develop efficient indexoperations, we build a hash set of size k ′ for BL as follows. Both BL in and BL out area subset of { , , . . . , k ′ } where k ′ is a user-defined label size, and they are storedin bit vectors. A hash function is used to map the leaf nodes to a correspondingbit. For our example, the leaves are { v , v , v , v , v } . When k ′ = 2, all leavesare hashed to two unique values. Assume h ( v ) = h ( v ) = 0, h ( v ) = h ( v ) = h ( v ) = 1. We show the hashed BL label set in Figure 1(c) which are denotedas h ( BL in ) and h ( BL out ). In the rest of the paper, we directly use BL in and BL out to denote the hash sets of the corresponding labels. It is noted that one can stilluse Lemma 2 to prune unreachable pairs with the hashed BL label.We briefly discuss the batch index construction of DBL as the focus of thiswork is on the dynamic scenario. The construction of DL label is presented inAlgorithm 1, which follows existing works on 2-hop label [5]. For each landmarknode D [ i ], we start a BFS from S (Line 4) and include S in DL in label of everyvertices that S can reach (Lines 5-9). For constructing DL out , we execute a BFS on the reversed graph G ′ symmetrically (Line 10). To construct BL label, wesimply replace the landmark set D as the leaf set D ′ and replace S with all leafnodes that are hashed to bucket i (Line 3) in Algorithm 1. The complexity ofbuilding DBL is that O (( k + k ′ )( m + n )).Note that although we use [5] for offline index construction, the contributionof our work is that we construct DL and BL as complementary indices for efficientquery processing. Furthermore, we are the first work to support efficient dynamicreachability index maintenance without assuming SCC /s remain unchanged.
Space complexity.
The space complexities of DL and BL labels are O ( kn ) and O ( k ′ n ), respectively. With the two indexes, Algorithm 2 illustrates the query processing framework of
DBL . Given a reachability query q ( u, v ), we return the answer immediately if thelabels are sufficient to determine the reachability (Lines 6-9). By the definitionsof DL and BL labels, u reaches v if their DL label overlaps (Line 6) where u does not reach v if their BL label does not overlap (Line 9). Furthermore, thereare two early termination rules implemented in Lines 10 and 12, respectively.Line 10 makes use of the properties that all vertices in a SCC contain at least onecommon landmark node. Line 12 takes advantage of the scenario when either u or v share the same SCC with a landmark node l then u reaches v if and only if l appeared in the DL label of u and v . We prove their correctness in Theorem1 and Theorem 2 respectively. Otherwise, we turn to BFS search with efficientpruning. The pruning within
BFS is performed as follows. Upon visiting a vertex
Q. Lyu et al.
Algorithm 2
Query Processing Framework for
DBL
Input:
Graph G ( V, E ), DL label, BL label, q ( u, v ) Output:
Answer of the query.1: function DL Intersec ( x , y )2: return ( DL out ( x ) ∩ DL in ( y ));3: function BL Contain ( x , y )4: return ( BL in ( x ) ⊆ BL in ( y ) and BL out ( y ) ⊆ BL out ( x ));5: procedure Query ( u , v )6: if DL Intersec ( u , v ) then
7: return true;8: if not BL Contain ( u , v ) then
9: return false;10: if DL Intersec ( v , u ) then
11: return false;12: if DL Intersec ( u , u ) or DL Intersec ( v , v ) then
13: return false;14: Enqueue u for
BFS ;15: while queue not empty do w ← pop queue;17: for vertex x ∈ Suc ( w ) do if x = v then
19: return true;20: if DL Intersec ( u , x ) then
21: continue;22: if not BL Contain ( x , v ) then
23: continue;24: Enqueue x ;25: return false; q , the procedure will determine whether the vertex q should be enqueued inLines 20 and 22. BL and DL labels will judge whether the destination vertex v will be in the Des ( w ). If not, q will be pruned from BFS to quickly answer thequery before traversing the graph with
BFS . Theorem 1.
In Algorithm 2, when DL Intersec ( x , y ) returns false and DL Intersec ( y , x )returns true, then x cannot reach y .Proof. DL Intersec ( y , x ) returns true indicates that vertex y reaches x . If vertex x reaches vertex y , then y and x must be in the same SCC (according to thedefinition of the
SCC ). As all the vertices in the
SCC are reachable to each other,the landmark nodes in DL out ( y ) ∩ DL in ( x ) should also be included in DL out and DL in label for all vertices in the same SCC . This means DL Intersec ( x , y ) shouldreturn true. Therefore x cannot reach y otherwise it contradicts with the factthat DL Intersec ( x , y ) returns false. Theorem 2.
In Algorithm 2, if DL Intersec ( x , y ) returns false and DL Intersec ( x , x )or DL Intersec ( y , y ) returns true then vertex x cannot reach y . BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 9
Algorithm 3 DL in label update for edge insertion Input:
Graph G ( V, E ), DL label, Inserted edge ( u, v ) Output:
Updated DL label1: if DL out ( u ) ∩ DL in ( v ) == ∅ then
2: Initialize an empty queue and enqueue v while queue is not empty do p ← pop queue5: for vertex x ∈ Suc ( p ) do if DL in ( u ) DL in ( x ) then DL in ( x ) ← DL in ( x ) ∪ DL in ( u )8: enqueue x Proof. If DL Intersec ( x , x ) returns true, it means that vertex x is a landmark or x is in the same SCC with a landmark. If x is in the same SCC with landmark l ,vertex x and vertex l should have the same reachability information. As landmark l will push its label element l to DL out label for all the vertices in Anc ( l ) andto DL in label for all the vertices in Des ( l ). The reachability information forlandmark l will be fully covered. It means that x ’s reachability information isalso fully covered. Thus DL label is enough to answer the query without BFS .Hence y is not reachable by x if DL Intersec ( x, y ) returns false. The provingprocess is similar for the case when DL Intersec ( y, y ) returns true. Query complexity.
Given a query q ( u, v ), the time complexity is O ( k + k ′ )when the query can be directly answered by DL and BL labels. Otherwise, weturn to the pruned BFS search, which has a worst case time complexity of O (( k + k ′ )( m + n )). Let ρ denote the ratio of vertex pairs whose reachability could bedirectly answered by the label. The amortized time complexity is O ( ρ ( k + k ′ ) +(1 − ρ )( k + k ′ )( m + n )). Empirically, ρ is over 95% according to our experiments(Table 4 in Section 6), which implies efficient query processing. When inserting a new edge ( u, v ), all vertices in
Anc ( u ) can reach all verticesin Des ( v ). On a high level, all landmark nodes that could reach u should alsoreach vertices in Des ( v ). In other words, all the landmark nodes that could bereached by v should also be reached by vertices in Anc ( u ). Thus, we updatethe label by 1) adding DL in ( u ) into DL in ( x ) for all x ∈ Des ( v ); and 2) adding DL out ( v ) into DL out ( x ) for all x ∈ Anc ( u ).Algorithm 3 depicts the edge insertion scenario for DL in . We omit the updatefor DL out , which is symmetrical to DL in . If DL label can determine that vertex v is reachable by vertex u in the original graph before the edge insertion, theinsertion will not trigger any label update (Line 1). Lines 2-8 describe a BFS process with pruning. For a visited vertex x , we prune x without traversing Des ( x ) iff DL in ( u ) ⊆ DL in ( x ), because all the vertices in Des ( x ) are deemed tobe unaffected as their DL in labels are supersets of DL in ( x ). v v v v v v v v v v v (a) Insert edge ( v , v ) v DL in v ∅ → ∅ v ∅ → { v } v ∅ → ∅ v ∅ → ∅ v { v } → { v } v { v } → { v } v ∅ → ∅ v { v , v } → { v , v } v { v } → { v } v { v , v } → { v , v } v { v } → { v } (b) DL in label update v h ( BL in ) v { } → { } v { } → { } v { } → { } v { } → { } v { } → { } v { } → { } v { } → { } v { , } → { , } v { } → { } v { , } → { , } v { } → { } (c) BL in label update Fig. 2: Label update for inserting edge ( v , v ) Example 3.
Figure 2(a) shows an example of edge insertion. Figure 2(b) showsthe corresponding DL in label update process. DL in label is presented withbrackets. Give an edge ( v , v ) inserted, DL in ( v ) is copied to DL in ( v ). Thenan inspection will be processed on DL in ( v ) and DL in ( v ). Since DL in ( v ) is asubset of DL in ( v ) and DL in ( v ), vertex v and vertex v are pruned from the BFS . The update progress is then terminated. DL label only gives positive answer to a reachability query. In poorly con-nected graphs, DL will degrade to expensive BFS search. Thus, we employ theBidirectional Leaf ( BL ) label to complement DL and quickly identify vertex pairswhich are not reachable. We omit the update algorithm of BL , as they are verysimilar to those of DL , except the updates are applied to BL in and BL out labels.Figure 2(c) shows the update of BL in label. Similar to the DL label, the updateprocess will be early terminated as the BL in ( v ) is totally unaffected after edgeinsertion. Thus, no BL in label will be updated in this case. Update complexity of
DBL . In the worst case, all the vertices that reach or arereachable to the updating edges will be visited. Thus, the time complexity of DL and BL is O (( k + k ′ )( m + n )) where ( m + n ) is the cost on the BFS . Empirically,as the
BFS procedure will prune a large number of vertices, the actual updateprocess is much more efficient than a plain
BFS .Table 2: Dataset statistics
Dataset | V | | E | d avg Diameter Connectivity
DAG - | V | DAG - | E | DAGCONSTRUCT (%) (ms)LJ 4,847,571 68,993,773 14.23 16 78.9 971,232 1,024,140 2368Web 875,713 5,105,039 5.83 21 44.0 371,764 517,805 191Email 265,214 420,045 1.58 14 13.8 231,000 223,004 17Wiki 2,394,385 5,021,410 2.09 9 26.9 2,281,879 2,311,570 360BerkStan 685,231 7,600,595 11.09 514 48.8 109,406 583,771 1134Pokec 1,632,803 30,622,564 18.75 11 80.0 325,892 379,628 86Twitter 2,881,151 6,439,178 2.23 24 1.9 2,357,437 3,472,200 481Reddit 2,628,904 57,493,332 21.86 15 69.2 800,001 857,716 1844
BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 11
In this section, we conduct experiments by comparing the proposed
DBL frame-work with the state-of-the-art approaches on reachability query for dynamicgraphs.
Our experiments are conducted on a server with an Intel XeonCPU E5-2640 v4 2.4GHz, 256GB RAM and a Tesla P100 PCIe version GPU.
Datasets:
We conduct experiments on 8 real-world datasets (see Table 2). Wehave collected the following datasets from SNAP [11]. LJ and Pokec are twosocial networks, which are power-law graphs in nature. BerkStan and Web areweb graphs in which nodes represent web pages and directed edges representhyperlinks between them. Wiki and Email are communication networks. Redditand Twitter are two social network datasets obtained from [23].Table 3: Query time (ms) for different landmark nodes selection.A=max( | P re ( · ) | , | Suc ( ·| )); B=min( | P re ( · ) | , | Suc ( ·| )); C= | P re ( · ) | + | Suc ( · ) | ; Dis the betweenness centrality; ours= | P re ( · ) | · | Suc ( · ) | Dataset A B C D oursLJ 125.10 127.84 105.88 113.34 108.51Web 202.16 144.13 142.16 140.79 139.64Email 37.02 37.01 36.14 38.53 36.38Wiki 156.21 159.74 153.66 155.45 157.12Pokec 37.69 64.57 36.96 50.66 34.78BerkStan 1890 6002 1883 1252 1590Twitter 719.31 849.78 685.31 727.59 693.71Reddit 99.21 65.06 62.68 69.62 60.48 DL select the landmark nodes by heuristically approximating the centrality ofa vertex u as M ( u ) = | P re ( u ) | · | Suc ( u ) | . Here, we evaluate different heuristicmethods for landmark nodes selection. The results are shown in Table 3. Overall,our adopted heuristic ( | P re ( u ) | · | Suc ( u ) | ) achieves the performance. For Emailand Wiki, all the methods share a similar performance. | P re ( · ) | + | Suc ( · ) | and | P re ( · ) | · | Suc ( · ) | get a better performance in other datasets. Finally, | P re ( · ) | + | Suc ( · ) | (degree centrality) and | P re ( · ) | · | Suc ( · ) | deliver similar performance formost datasets and the latter is superior in the Berkstan dataset. Thus, we adopt | P re ( · ) | · | Suc ( · ) | for approximating the centrality. It needs to mention that,although the betweenness centrality get a medium overall performance, it showsthe best performance in BerkStan dataset.In the main body of this paper, we restrict the leaf nodes to be the oneswith either zero in-degree or zero out-degree. Nevertheless, our proposed methoddoes not require such a restriction and could potentially select any vertex as a leaf node. Following the approach for which we select DL label nodes, we use M ( u ) = | P re ( u ) | · | Suc ( u ) | to approximate the centrality of vertex u and selectvertex u as a BL label node if M ( u ) ≤ r where r is a tunning parameter. Assigning r = 0 produces the special case presented in the main body of this paper. Thealgorithms for query processing as well as index update of the new BL labelremains unchanged. Figure 3 shows the query performance of DBL when we varythe threshold r . With a higher r , more vertices are selected as the leaf nodes,which should theoretically improve the query processing efficiency. However,since we employ the hash function for BL label, more leaf nodes lead to highercollision rates. This explains why we don’t observe a significant improvement inquery performance. r Q ue r y T i m e ( m s ) LJ WebEmailWikiPokecBerkStanTwitterReddit
Fig. 3: BL label node selection Table 4 shows the percentages of queries answered by DL label, BL label ( when theother label is disabled ) and DBL label. All the queries are randomly generated.The results show that DL is effective for dense and highly connected graphs(LJ, Pokec and Reddit) whereas BL is effective for sparse and poorly connectedgraphs (Email, Wiki and Twitter). However, we still incur the expensive BFS if the label is disabled. By combining the merits of both indexes, our proposalleads to a significantly better performance.
DBL could answer much more queriesthan DL and BL label. The results have validated our claim that DL and BL arecomplementary to each other. We note that the query processing for DBL is ableto handle one million queries with sub-second latency for most datasets, whichshows outstanding performance.
Impact of Label Size:
On the query processing of
DBL . There are two labels in
DBL : both DL and BL store labels in bit vectors. The size of DL label depends onthe number of selected landmark nodes whereas the size of BL label is determinedby how many hash values are chosen to index the leaf nodes. We evaluate allthe datasets to show the performance trend of varying DL and BL label sizes pervertex (by processing 1 million queries) in Table 5.When varying DL label size k , the performance of most datasets remain stablebefore a certain size (e.g., 64) and deteriorates thereafter. This means that extra BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 13
Table 4: Percentages of queries answered by DL label, BL label (when the otherlabel is disabled) and DBL label respectively. We also include the time for
DBL toprocess 1 million queries
Dataset DL Label BL Label
DBL
Label
DBL timeLJ 97.5% 20.8% 99.8% 108msWeb 79.5% 54.3% 98.3% 139msEmail 31.9% 85.4% 99.2% 36msWiki 10.6% 94.3% 99.6% 157msPokec 97.6% 19.9% 99.9% 35msBerkStan 87.5% 43.3% 95.0% 1590msTwitter 6.6% 94.8% 96.7% 709msReddit 93.7% 30.6% 99.9% 61ms landmark nodes will cover little extra reachability information. Thus, selectingmore landmark nodes does not necessarily lead to better overall performancesince the cost of processing the additional bits incur additional cache misses.BerkStan gets benefit from increasing the DL label size to 128 since 64 landmarksare not enough to cover enough reachable pairs.Compared with DL label, some of the datasets get a sweet spot when varyingthe size of BL label. This is because there are two conflicting factors which affectthe overall performance. With increasing BL label size and more hash valuesincorporated, we can quickly prune more unreachable vertex pairs by examining BL label without traversing the graph with BFS . Besides, larger BL size alsoprovides better pruning power of the BFS even if it fails to directly answer thequery (Algorithm 2). Nevertheless, the cost of label processing increases withincreased BL label size. According to our parameter study, we set Wiki’s DL and BL label size as 64 and 256, BerkStan’s DL and BL label size as 128 and 64. Forthe remaining datasets, both DL and BL label sizes are set as 64. In this section, we evaluate
DBL ’s performance on general graph update. As
DAGGER is the only method that could handle general update, we compare
DBL against
DAGGER in Figure 4. Ten thousand edge insertion and 1 million queriesare randomly generated and performed, respectively. Different from
DAGGER , DBL (a) Varying BL label sizes Dataset 16 32 64 128 256LJ 136.1 131.9 108.1 107.4 110.3Web 177.2 128.5 152.9 156.6 174.3Email 77.4 53.9 38.3 41.1 44.4Wiki 911.6 481.4 273.7 181.3 157.4Pokec 54.8 43.7 38.6 40.6 53.6BerkStan 4876.1 4958.9 4862.9 5099.1 5544.3Twitter 1085.3 845.7 708.2 652.7 673.2Reddit 117.1 80.4 67.3 63.5 67.9 (b) Varying DL label sizes Dataset 16 32 64 128 256LJ 108.2 110.3 106.9 120.2 125.5Web 154.0 152.5 151.1 158.8 167.8Email 37.9 39.5 35.8 39.8 43.7Wiki 274.5 282.6 272.4 274.8 281.1Pokec 38.1 40.6 36.3 49.7 55.6BerkStan 6369.8 5853.1 4756.3 1628.3 1735.2Twitter 716.1 724.4 695.3 707.1 716.9Reddit 64.6 65.9 62.9 75.4 81.4
Table 5: Query performance(ms) with varying DL and BL label sizes LJ Web Email Wiki Pokec BerkStan Twitter Reddit10 T i m e ( m s ) DAGGER DBL
Fig. 4: The execution time for insert 10000 edges as well as 1 million queriesdon’t need to maintain the
DAG , thus, in all the datasets,
DBL could achieve greatperformance lift compared with
DAGGER . For both edge insertion and query,
DBL is orders of magnitude faster than
DAGGER . The minimum performance gap liesin BerkStan. This is because BerkStan has a large diameter. As
DBL rely on
BFS traversal to update the index. The traversal overheads is crucial for it’sperformance. BerkStan’s diameter is large, it means, during index update,
DBL need to traversal extra hops to update the index which will greatly degrade theperformance.
In this section, we compare our method with IP and TOL . Different from
DBL ,which could handle real world update, IP and TOL could only handle syntheticedge update that will not trigger
DAG maintaining. Thus, for IP and TOL , wefollow their experimental setups depict in their paper[27,24]. Specifically, werandomly select 10,000 edges from the
DAG and delete them. Then, we will insertthe same edges back. In this way, we could get the edge insertion performancewithout trigger
DAG maintenance. For
DBL , we stick to general graph updates. Theedge insertion will be randomly generated and performed. One million querieswill be executed after that. It needs to be noted that, although both IP and TOL claim they can handle dynamic graph, due to their special pre-condition, theirmethods are in fact of limited use in real world scenario.The results are shown in Figure 5.
DBL outperforms other baselines in mostcases except on three data sets (Wiki, BerkStan and Twitter) where IP couldachieve a better performance. Nevertheless, DBL outperforms IP and TOL by 4.4xand 21.2x, respectively with respect to geometric mean performance. We analyzethe reason that
DBL can be slower than IP on Wiki, BerkStan and Twitter. Aswe aforementioned, DBL relies on the pruned
BFS to update the index, the
BFS traversal speed will determine the worst-case update performance. Berkstan hasthe largest diameter as 514 and Twitter has the second largest diameter as 24,which dramatically degrade the update procedure in
DBL . For Wiki,
DBL couldstill achieve a better update performance than IP . However, IP is much moreefficiency in query processing which lead to better overall performance.Although this experimental scenario has been used in previous studies, thecomparison is unfair for DBL . As both IP and TOL rely on the
DAG to processqueries and updates, their synthetic update exclude the
DAG maintaining
BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 15
LJ Web Email Wiki Pokec BerkStan Twitter Reddit10 T i m e ( m s ) TOL IP DBL
Fig. 5: The execution time for insert 10000 edges as well as 1 million queries, for
TOL and IP , the updates are synthetic that will not trigger SCC update procedure/overheads from the experiments.
However,
DAG maintenance isessential for their method to handle real world edge updates, as we have shownin Figure 4, the overheads is nonnegligible.
We implement
DBL with OpenMP and CUDA (
DBL-P and
DBL-G respectively)to demonstrate the deployment on multi-core CPUs and GPUs achieves en-couraging speedup for query processing. We follow existing GPU-based graphprocessing pipeline by batching the queries and updates [7,19,12]. Note thatthe transfer time can be overlapped with GPU processing to minimize datacommunication costs. Both CPU and GPU implementations are based on thevertex centric framework.To validate the scalability of the parallel approach, we vary the number ofthreads used in
DBL-P and show its performance trend in Figure 6.
DBL-P achievesalmost linear scalability against increasing number of threads (note that the y-axis is plotted in log-scale). The linear trend of scalability tends to disappearwhen the number of threads is beyond 14. We attribute this observation as thememory bandwidth bound nature of the processing tasks.
DBL invokes the
BFS traversal once the labels are unable to answer the query and the efficiency of the
BFS is largely bounded by CPU memory bandwidth. This memory bandwidthbound issue of CPUs can be resolved by using GPUs which provide memorybandwidth boost.The compared query processing performance is shown in Table 7. Bidirec-tional
BFS ( B-BFS ) query is listed as a baseline. We also compare our parallelsolutions with a home-grown OpenMP implementation of IP (denoted as IP-P ).Twenty threads are used in the OpenMp implementation. We note that IP hasto invoke a pruned DFS if its labels fail to determine the query result.
DFS isa sequential process in nature and cannot be efficiently parallelized. For ourparallel implementation
IP-P , we assign a thread to handle one query. We havethe following observations.First,
DBL is built on the pruned BFS which can be efficiently parallelizedwith the vertex-centric paradigm. We have observed significant performanceimprovement by parallelized executions.
DBL-P (CPUs) gets 4x to 10x speedupacross all datasets.
DBL-G (GPUs) shows an even better performance. In contrast, T i m e ( m s ) LJWebEmail WikiPokecBerkStan TwitterReddit
Fig. 6: Scalability of
DBL on CPU
Dataset
TOL IP IP-P DBL DBL-P DBL-G B-BFS
LJ 46.6 50.7 24.9 108.1 16.4
Fig. 7: The query performance(ms) onCPU and GPU architectures.
B-BFS means the bidirectional
BFS as DFS incurs frequent random accesses in
IP-P , the performance is bounded bymemory bandwidth. Thus, parallelization does not bring much performance gainto
IP-P compared with its sequential counterpart.Second,
DBL provides competitive efficiency against
IP-P but
DBL can beslower than
TOL and IP when comparing the single thread performance. However,this is achieved by assuming the DAG structure but the
DAG -based approachesincur prohibitively high cost of index update, as we demonstrated in the previoussubsections. In contrast,
DBL achieves sub-second query processing performancefor handling 1 million queries while still support efficient updates without usingthe
DAG .Third, there are cases where
DBL-P outperforms
DBL-G , i.e., Web, Berkstanand Twitter. This is because these datasets have a higher diameter than the restof the datasets and the pruned
BFS needs to traverse extra hops to determinethe reachability. Thus, we incur more random accesses, which do not suit theGPU architecture.
In this work, we propose
DBL , an indexing framework to support dynamicreachability query processing on incremental graphs. To our best knowledge,
DBL is the first solution which avoids maintaining
DAG structure to construct and buildreachability index.
DBL leverages two complementary index components: DL and BL labels. DL label is built on the landmark nodes to determine reachable vertexpairs that connected by the landmarks, whereas BL label prunes unreachablepairs by examining their reachability information on the leaf nodes in the graph.The experimental evaluation has demonstrated that the sequential version of DBL outperforms the state-of-the-art solutions with orders of magnitude speedups interms of index update while exhibits competitive query processing performance.The parallel implementation of
DBL on multi-cores and GPUs further boostthe performance over our sequential implementation. As future work, we areinterested in extending DBL to support deletions, which will be lazily supportedin many applications.
BL: Efficient Reachability Queries on Dynamic Graphs (Complete Version) 17
Acknowledgement.
Yuchen Li’s work was supported by the Ministry ofEducation, Singapore, under its Academic Research Fund Tier 2 (Award No.:MOE2019-T2-2-065).
References
1. Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries onlarge networks by pruned landmark labeling. In: Proceedings of the 2013 ACMSIGMOD International Conference on Management of Data. pp. 349–360. ACM(2013)2. Akiba, T., Iwata, Y., Yoshida, Y.: Dynamic and historical shortest-path distancequeries on large evolving networks by pruned landmark labeling. In: Proceedingsof the 23rd international conference on World wide web. pp. 237–248 (2014)3. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.U.: Complex net-works: Structure and dynamics. Physics reports (4-5), 175–308 (2006)4. Bramandia, R., Choi, B., Ng, W.K.: Incremental maintenance of 2-hop labeling oflarge graphs. TKDE (5), 682–698 (2010)5. Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queriesvia 2-hop labels. SICOMP (5), 1338–1355 (2003)6. Demetrescu, C., Italiano, G.F.: Fully dynamic all pairs shortest paths with realedge weights. In: FOCS. pp. 260–267. IEEE (2001)7. Guo, W., Li, Y., Sha, M., Tan, K.L.: Parallel personalized pagerank on dynamicgraphs. PVLDB (1), 93–106 (2017)8. Henzinger, M.R., King, V.: Fully dynamic biconnectivity and transitive closure.In: FOCS. pp. 664–672. IEEE (1995)9. Hotz, M., Chondrogiannis, T., W¨orteler, L., Grossniklaus, M.: Experiences withimplementing landmark embedding in neo4j. In: Proceedings of the 2nd JointInternational Workshop on Graph Data Management Experiences & Systems(GRADES) and Network Data Analytics (NDA). pp. 1–9 (2019)10. Jin, R., Ruan, N., Xiang, Y., Wang, H.: Path-tree: An efficient reachability indexingscheme for large directed graphs. TODS (1), 7 (2011)11. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (Jun 2014)12. Li, Y., Zhu, Q., Lyu, Z., Huang, Z., Sun, J.: Dycuckoo: Dynamic hash tables ongpus. In: ICDE (2021)13. Potamias, M., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distanceestimation in large networks. In: CIKM. pp. 867–876. ACM (2009)14. Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., Zhou, J.: Real-timeconstrained cycle detection in large dynamic graphs. Proceedings of the VLDBEndowment , 1876–1888 (08 2018). https://doi.org/10.14778/3229863.322987415. Roditty, L.: Decremental maintenance of strongly connected components. In:SODA. pp. 1143–1150. SIAM (2013)16. Roditty, L., Zwick, U.: A fully dynamic reachability algorithm for directed graphswith an almost linear update time. SICOMP (3), 712–733 (2016)17. Schenkel, R., Theobald, A., Weikum, G.: Efficient creation and incremental main-tenance of the hopi index for complex xml document collections. In: ICDE. pp.360–371. IEEE (2005)18. Seufert, S., Anand, A., Bedathur, S., Weikum, G.: Ferrari: Flexible and efficientreachability range assignment for graph indexing. In: ICDE. pp. 1009–1020. IEEE(2013)8 Q. Lyu et al.19. Sha, M., Li, Y., He, B., Tan, K.L.: Accelerating dynamic graph analytics on gpus.PVLDB (1) (2017)20. Su, J., Zhu, Q., Wei, H., Yu, J.X.: Reachability querying: can it be even faster?TKDE (1), 1–1 (2017)21. Valstar, L.D., Fletcher, G.H., Yoshida, Y.: Landmark indexing for evaluation oflabel-constrained reachability queries. In: Proceedings of the 2017 ACM Interna-tional Conference on Management of Data. pp. 345–358 (2017)22. Wang, H., He, H., Yang, J., Yu, P.S., Yu, J.X.: Dual labeling: Answering graphreachability queries in constant time. In: ICDE. pp. 75–75. IEEE (2006)23. Wang, Y., Fan, Q., Li, Y., Tan, K.L.: Real-time influence maximization on dynamicsocial streams. Proceedings of the VLDB Endowment (7), 805–816 (2017)24. Wei, H., Yu, J.X., Lu, C., Jin, R.: Reachability querying: an independent permu-tation labeling approach. VLDBJ (1), 1–26 (2018)25. Yildirim, H., Chaoji, V., Zaki, M.J.: Grail: Scalable reachability index for largegraphs. PVLDB3