Efficient Maintenance of Distance Labelling for Incremental Updates in Large Dynamic Graphs
EEfficient Maintenance of Distance Labelling for IncrementalUpdates in Large Dynamic Graphs
Muhammad Farhan
Australian National UniversityCanberra, [email protected]
Qing Wang
Australian National UniversityCanberra, [email protected]
ABSTRACT
Finding the shortest path distance between an arbitrary pair ofvertices is a fundamental problem in graph theory. A tremen-dous amount of research has been successfully attempted onthis problem, most of which is limited to static graphs. Due tothe dynamic nature of real-world networks, there is a pressingneed to address this problem for dynamic networks undergoingchanges. In this paper, we propose an online incremental methodto efficiently answer distance queries over very large dynamicgraphs. Our proposed method incorporates incremental updateoperations, i.e. edge and vertex additions, into a highly scalableframework of answering distance queries. We theoretically provethe correctness of our method and the preservation of labellingminimality. We have also conducted extensive experiments on12 large real-world networks to empirically verify the efficiency,scalability, and robustness of our method.
Given a very large graph with billions of vertices and edges, howefficiently can we find the shortest path distance between anytwo vertices? If such a graph is dynamically changing over time(e.g. inserting edges or vertices), how can we not only efficientlybut also accurately find the shortest path distance between anytwo vertices? These questions are intimately related to distancequeries on dynamic graphs. As one of the most fundamental oper-ations on graphs, distance queries have a wide range of real-worldapplications that operate on increasingly large dynamic graphs,such as context-aware search in web graphs [19], social networkanalysis in social networks [5, 20], management of resources incomputer networks [6], and so on. Many of these applications usedistance queries as a building block to realise more complicatedtasks, and require distance queries to be answered instantly, e.g.in the order of milliseconds.Previous studies have primarily focused on distance querieson static graphs [1–3, 10, 11, 13, 22], with little attention be-ing paid to dynamics on graphs. To speed up query responsetime, a key technique is to precompute a data structure called distance labelling that satisfies certain properties such as 2-hopcover [8], and then use this data structure to answer distancequeries efficiently. However, when a graph dynamically changes,its distance labelling needs to be changed accordingly; otherwise,distance queries may yield overestimated distances. Although itis possible to recompute a distance labelling from scratch, thisleads to inefficiency. As shown in Figure 1, the percentage ofaffected vertices by a single change often ranges from 10 − %to 10% in various real-world networks, recomputing distancelabelling from scratch for each single change not only wastes © 2021 Copyright held by the owner/author(s). Published in Proceedings of the24th International Conference on Extending Database Technology (EDBT), March23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0. P e r c e n t a g e o f a ff e c t e d v e r t i c e s IndochinaITTwitter FriendsterUKClueweb09
Figure 1: Distribution of affected vertices by a single graphchange in various networks, where the results for 1000graph changes are sorted in the descending order. computing resources, but also may generate inaccurate queryresults during recomputing process. The question arising is thushow to efficiently and accurately change distance labelling ondynamic graphs in order to support distance queries?In this paper, we aim to develop an online incremental methodthat can dynamically maintain distance labelling on graphs beingchanged by edge and vertex insertions. Typically, real-world dy-namic networks are more vulnerable to insertions than removalsand a plethora of such real-world networks are large and fre-quently updated, primarily accommodating insertions [15, 21].Thus, an online incremental method for dynamic graphs shouldpossess the following desirable characteristics: (1) time efficiency - It can answer distance queries and update distance labellingefficiently (in the order of milliseconds); 2) space efficiency - Itguarantees the minimum size of distance labelling to reduce stor-age costs; (3) scalability - It can scale to very large networks withbillions of vertices and edges.
Challenges.
Designing online incremental methods for distancequeries on dynamic graphs is known to be challenging [4]. Whenan edge or a vertex is inserted into a graph, outdated and redun-dant entries of distance labelling may occur. It was reported thatremoving such entries is a complicated task [4] because affectedvertices need to be precisely identified so as to update their labelswithout violating the original properties of a distance labellingsuch as minimality. Further, although query time and updatetime are both critical for answering distance queries on dynamicgraphs, it is not easy (if not impossible) to design a solution thatis efficient in both. This requires us to find new insights intodynamic properties of a distance labelling, as well as a goodtrade-off between query time and update time. Last but not least,scaling distance queries to dynamic graphs with billions of nodesand edges is hard. Previous work [4, 12] mostly considered 2-hoplabelling, which has very high space requirements and index con-struction time; as a result, their query and update performance a r X i v : . [ c s . D S ] F e b re dramatically degraded on large-scale dynamic graphs. Ide-ally, the labelling size of a graph should be much smaller thanits original size. However, the state-of-the-art distance labellingtechnique, i.e. pruned landmark labeling method (PLL) [4], stillyields a distance labelling whose size is 20-30 times larger thanthe original size of a dataset. Contributions.
Our contributions are summarised as follows: • Our method overcomes the challenge of eliminating out-dated and redundant distance entries. None of the previousstudies have addressed this challenge because detectingthose entries is too costly [4, 9]. When an edge or a ver-tex is inserted, previous studies only add new distanceentries or modify existing distance entries. This wouldhowever lead to an ever increasing size of labelling, partic-ularly when a graph is frequently updated by newly addededges or vertices. Accordingly, both query performanceand space efficiency would deteriorate over time. • We prove the correctness of our proposed method andshow that it preserves the desirable property of minimalityon our distance labelling. Due to a property called highwaycover [10], the minimal size of a distance labelling in thiswork is much smaller than the size of a 2-hop labelling inprevious work [4, 12]. Preserving minimality on a distancelabelling thus improves space efficiency and query perfor-mance, as well as update performance. We also provide acomplexity analysis of our proposed method. • We conducted experiments using 12 real-world large net-works across different domains to show the efficiency,scalability and robustness of our method. Particularly, ourmethod can perform updates under one second, on aver-age, even on billion-scale networks, while still answeringqueries efficiently in the order of milliseconds and guar-anteeing the labelling size of a graph to be much smaller.
Answering shortest-path distance queries in graphs has been anactive research topic for many years. Traditionally, a distancequery can be answered using Dijkstra’s algorithm [18] on posi-tively weighted graphs or Breadth-First Search (BFS) algorithmon unweighted graphs. However, these traditional algorithmsfail to achieve desired response time for distance queries on largegraphs. Later, labelling-based methods have emerged as an at-tractive way of accelerating response time to distance queries[1–3, 8, 10, 11, 13], among which Akiba et al. [3] proposed apruned landmark labeling (PLL) to precompute a 2-hop coverdistance labelling [8]. This method serves as the state-of-the-artfor labelling-based distance queries and can handle graphs withhundreds of millions of edges.So far, only a few attempts have been made to study distancequeries over dynamic graphs [4, 12], which are all based on theidea of 2-hop distance labelling or its variants. Akiba et al. [4]studied the problem of updating a pruned landmark labelling forincremental updates (i.e. vertex additions and edge additions).This work however does not remove redundant entries in dis-tance labels because the authors considered that detecting suchoutdated entries is too costly. This inevitably breaks the mini-mality of pruned landmark labelling, leading to an ever increaseof labelling size and deteriorated query performance over time.To accelerate shortest-path distance queries on large networks,another line of research is to combine a partial distance labellingwith online shortest-path searches. Hayashi et al. [12] proposed a fully dynamic approach that selects a small set of landmarks 𝑅 and precompute a shortest-path tree (SPT) rooted at each 𝑟 ∈ 𝑅 .Then, an online search is conducted on a sparsified graph underan upper distance bound being computed via the SPTs. Neverthe-less, this method still fails to construct labelling on networks withbillions of vertices. Following the same line, a recent work byFarhan et al. [10] introduced a highway-cover labelling method(HL), which can provide fast response time (milliseconds) fordistance queries even on billion-scale graphs. However, this ap-proach only works for static graphs. Let 𝐺 = ( 𝑉 , 𝐸 ) be an undirected graph where 𝑉 is a set of verticesand 𝐸 is a set of edges. We denote by 𝑁 ( 𝑣 ) the set of neighbors ofa vertex 𝑣 ∈ 𝑉 , i.e. 𝑁 ( 𝑣 ) = { 𝑢 ∈ 𝑉 |( 𝑢, 𝑣 ) ∈ 𝐸 } . Given two vertices 𝑢 and 𝑣 in 𝐺 , the distance between 𝑢 and 𝑣 , denoted as 𝑑 𝐺 ( 𝑢, 𝑣 ) , isthe length of the shortest path from 𝑢 to 𝑣 . If there does not exista path from 𝑢 to 𝑣 , then 𝑑 𝐺 ( 𝑢, 𝑣 ) = ∞ . We use 𝑃 𝐺 ( 𝑢, 𝑣 ) to denotethe set of all shortest paths between 𝑢 and 𝑣 in 𝐺 . Given a graph 𝐺 = ( 𝑉 , 𝐸 ) , an edge insertion is to add an edge ( 𝑎, 𝑏 ) into 𝐺 where { 𝑎, 𝑏 } ⊆ 𝑉 and ( 𝑎, 𝑏 ) ∉ 𝐸 . Accordingly, a node insertion is to adda new node into 𝐺 together with a set of edge insertions thatconnect 𝑣 to existing vertices in 𝐺 . The following fact is criticalfor designing algorithms for an edge insertion.Fact 3.1. Let 𝐺 ′ = ( 𝑉 , 𝐸 ∪ {( 𝑢, 𝑣 )}) be the graph after insertingan edge ( 𝑢, 𝑣 ) into 𝐺 = ( 𝑉 , 𝐸 ) . Then for any two vertices 𝑠, 𝑡 ∈ 𝑉 , 𝑑 𝐺 ( 𝑠, 𝑡 ) ≥ 𝑑 𝐺 ′ ( 𝑠, 𝑡 ) . That is, the distance between any two vertices never increasesafter inserting edges or vertices in a graph.
Highway cover labelling.
Unlike the previous work [4, 9, 12]that uses 2-hop cover labelling [8], we develop our method us-ing a highly scalable labelling approach, called highway coverlabelling [10]. Let 𝑅 ⊆ 𝑉 be a small set of landmarks in a graph 𝐺 = ( 𝑉 , 𝐸 ) . For each vertex 𝑣 ∈ 𝑉 , the label of 𝑣 is a set of distanceentries 𝐿 ( 𝑣 ) = {( 𝑟 , 𝛿 𝐿 ( 𝑟 , 𝑣 )) , . . . , ( 𝑟 𝑛 , 𝛿 𝐿 ( 𝑟 𝑛 , 𝑣 ))} , where 𝑟 𝑖 ∈ 𝑅 and 𝛿 𝐿 ( 𝑟 𝑖 , 𝑣 ) = 𝑑 𝐺 ( 𝑟 𝑖 , 𝑣 ) . We call 𝐿 = { 𝐿 ( 𝑣 )} 𝑣 ∈ 𝑉 a distance la-belling over 𝐺 whose size is defined as: 𝑠𝑖𝑧𝑒 ( 𝐿 ) = (cid:205) 𝑣 ∈ 𝑉 | 𝐿 ( 𝑣 )| .A highway 𝐻 = ( 𝑅, 𝛿 𝐻 ) consists of a set 𝑅 of landmarks and adistance decoding function 𝛿 𝐻 : 𝑅 × 𝑅 → N + such that, for anytwo landmarks 𝑟 , 𝑟 ∈ 𝑅 , 𝛿 𝐻 ( 𝑟 , 𝑟 ) = 𝑑 𝐺 ( 𝑟 , 𝑟 ) holds. Definition 3.2. A highway cover labelling is a pair Γ = ( 𝐻, 𝐿 ) where 𝐻 is a highway and 𝐿 is a distance labelling s.t. for anyvertex 𝑣 ∈ 𝑉 \ 𝑅 and 𝑟 ∈ 𝑅 , we have: 𝑑 𝐺 ( 𝑟, 𝑣 ) = min { 𝛿 𝐿 ( 𝑟 𝑖 , 𝑣 ) + 𝛿 𝐻 ( 𝑟, 𝑟 𝑖 )|( 𝑟 𝑖 , 𝛿 𝐿 ( 𝑟 𝑖 , 𝑣 )) ∈ 𝐿 ( 𝑣 )} . (1)Highway cover labelling enjoys several nice theoretical prop-erties, such as minimality and order independence. A minimalhighway cover labelling can be efficiently constructed, indepen-dently of the order of applying landmarks [10].Given a highway cover labeling Γ = ( 𝐻, 𝐿 ) , an upper bound onthe distance between any two vertices 𝑢, 𝑣 ∈ 𝑉 \ 𝑅 is computed: 𝑑 ⊤ 𝑢𝑣 = min { 𝛿 𝐿 ( 𝑟 𝑖 , 𝑢 ) + 𝛿 𝐻 ( 𝑟 𝑖 , 𝑟 𝑗 ) + 𝛿 𝐿 ( 𝑟 𝑗 , 𝑣 )|( 𝑟 𝑖 , 𝛿 𝐿 ( 𝑟 𝑖 , 𝑢 )) ∈ 𝐿 ( 𝑢 ) , ( 𝑟 𝑗 , 𝛿 𝐿 ( 𝑟 𝑗 , 𝑣 )) ∈ 𝐿 ( 𝑣 )} (2)An exact distance query 𝑄 ( 𝑢, 𝑣, Γ ) can be answered by con-ducting a distance-bounded shortest-path search over a sparsifiedgraph 𝐺 [ 𝑉 \ 𝑅 ] (i.e., removing all landmarks in 𝑅 from 𝐺 ) underthe upper bound 𝑑 ⊤ 𝑢𝑣 such that: 𝑄 ( 𝑢, 𝑣, Γ ) = (cid:40) 𝑑 𝐺 [ 𝑉 \ 𝑅 ] ( 𝑢, 𝑣 ) if 𝑑 𝐺 [ 𝑉 \ 𝑅 ] ( 𝑢, 𝑣 ) ≤ 𝑑 ⊤ 𝑢𝑣 ,𝑑 ⊤ 𝑢𝑣 otherwise . roblem definition. In this work, we study the problem ofanswering distance queries over a graph that is dynamicallychanged by edge and vertex insertions over time. Since a vertexinsertion can be treated as a set of edge insertions, without loss ofgenerality, below we define the problem based on edge insertions.
Definition 3.3.
Let 𝐺 ↩ → 𝐺 ′ denote that a graph 𝐺 is changedto a graph 𝐺 ′ by an edge insertion. The dynamic distance querying problem is, given any two vertices 𝑢 and 𝑣 in the changed graph 𝐺 ′ , to efficiently compute the distance 𝑑 𝐺 ′ ( 𝑢, 𝑣 ) . In this section, we propose an algorithm IncHL + to incrementallyupdate labelling to reflect graph changes. Algorithm 1 describesthe main steps of IncHL + . Below, we discuss them in detail. When an update operation occurs on a graph 𝐺 = ( 𝑉 , 𝐸 ) , thereexists a subset of “affected” vertices in 𝑉 whose labels need to beupdated as a consequence of this update operation on the graph. Definition 4.1.
A vertex 𝑣 ∈ 𝑉 is affected by 𝐺 ↩ → 𝐺 ′ iff 𝑃 𝐺 ( 𝑣, 𝑟 ) ≠ 𝑃 𝐺 ′ ( 𝑣, 𝑟 ) for at least one 𝑟 ∈ 𝑅 ; unaffected otherwise.We use Λ 𝑟 to denote the set of all affected vertices w.r.t. alandmark 𝑟 and Λ = (cid:208) 𝑟 ∈ 𝑅 Λ 𝑟 the set of all affected vertices. Example 4.2.
Consider Figure 2(a) in which 0 and 10 are twolandmarks. After inserting an edge ( , ) , Λ = { , , , , , } in Figure 2(b) and Λ = { , , } in Figure 2(d).The following lemma states how affected vertices relate to anedge being inserted.Lemma 4.3. When 𝐺 ↩ → 𝐺 ′ for an edge insertion ( 𝑎, 𝑏 ) , a vertex 𝑣 ∈ Λ 𝑟 iff there exists a shortest path between 𝑣 and 𝑟 in 𝐺 ′ passingthrough ( 𝑎, 𝑏 ) . Following Lemma 4.3, we can reduce the search space ofaffected vertices by eliminating landmarks 𝑟 with 𝑑 𝐺 ( 𝑟, 𝑎 ) = 𝑑 𝐺 ( 𝑟, 𝑏 ) since Λ 𝑟 = ∅ in such a case. Thus, we assume that 𝑑 𝐺 ( 𝑟, 𝑏 ) > 𝑑 𝐺 ( 𝑟, 𝑎 ) w.r.t. a landmark 𝑟 in the rest of this sectionw.l.o.g. Further, by the lemma below, we can also reduce thesearch space by “jumping” from the root of a BFS to vertex 𝑏 .Lemma 4.4. When 𝐺 ↩ → 𝐺 ′ with an inserted edge ( 𝑎, 𝑏 ) , wehave 𝑑 𝐺 ( 𝑣, 𝑟 ) ≥ 𝑑 𝐺 ( 𝑎, 𝑟 ) + for any affected vertex 𝑣 ∈ Λ 𝑟 . Proof. By Lemma 4.3, there exists a shortest path from anyaffected vertex 𝑣 to 𝑟 going through the edge ( 𝑎, 𝑏 ) and thusthrough 𝑎 . Since 𝑎 is unaffected and the distance from 𝑎 to 𝑣 isequal to or greater than 1, 𝑑 𝐺 ( 𝑣, 𝑟 ) ≥ 𝑑 𝐺 ( 𝑎, 𝑟 ) + □ Algorithm 2 describes our algorithm for finding affected ver-tices. Given a graph 𝐺 with an inserted edge ( 𝑎, 𝑏 ) and a highwaycover labelling Γ = ( 𝐻, 𝐿 ) over 𝐺 , we conduct a jumped BFSw.r.t. a landmark 𝑟 starting from the vertex 𝑏 with its new depth 𝜋 = 𝑄 ( 𝑟, 𝑎, Γ ) + ( 𝑣, 𝜋 ) ∈ Q , we enqueueall the neighbors of 𝑣 that are affected into Q with new distances 𝜋 + 𝑣 to Λ 𝑟 as affected vertex (Line 9). Thisprocess continues until Q is empty. Example 4.5.
Figure 2 illustrates how our algorithm finds af-fected vertices as a result of inserting an edge ( , ) . The BFSrooted at landmark 0 is depicted in Figure 2(b), which jumps tovertex 5 and finds six affected vertices { , , , , , } . Simi-larly, the BFS rooted at landmark 10 is depicted in Figure 2(d),which jumps to vertex 2 and finds three affected vertices { , , } . Algorithm 1:
Incremental algorithm (IncHL + ). Input: 𝐺 , 𝐺 ′ , ( 𝑎, 𝑏 ) , Γ = ( 𝐻, 𝐿 ) Output: Γ ′ = ( 𝐻 ′ , 𝐿 ′ ) foreach 𝑟 ∈ 𝑅 do Λ 𝑟 ← FindAffected ( 𝐺, ( 𝑎, 𝑏 ) , 𝑟, Γ ) RepairAffected ( 𝐺 ′ , ( 𝑎, 𝑏 ) , Λ 𝑟 , 𝑟, Γ ) Algorithm 2:
Finding affected vertices. Function
FindAffected( 𝐺 , ( 𝑎, 𝑏 ) , 𝑟 , Γ ) Q ← ∅ , Λ 𝑟 ← ∅ 𝜋 ← 𝑄 ( 𝑟, 𝑎, Γ ) + Enqueue ( 𝑏, 𝜋 ) to Q while Q is not empty do Dequeue ( 𝑣, 𝜋 ) from Q foreach 𝑤 ∈ 𝑁 ( 𝑣 ) s.t. 𝑄 ( 𝑟, 𝑤, Γ ) ≥ 𝜋 + do Enqueue ( 𝑤, 𝜋 + ) to Q Λ 𝑟 = Λ 𝑟 ∪ { 𝑣 } return Λ 𝑟 Now we propose a repair strategy to efficiently update the labelsof affected vertices in order to reflect graph changes. The keyidea is that, instead of conducting a full BFS on all vertices, weconduct a partial BFS from 𝑏 only on affected vertices. Further,to avoid unnecessary computations, we distinguish two kinds ofaffected vertices: (1) affected vertices that are covered by otherlandmarks and can thus be easily repaired by removing an entryfrom their labels; (2) affected vertices whose labels need to berepaired with accurately calculated distances on a changed graph.The following lemma characterizes the first kind according tothe definition of highway cover labelling.Lemma 4.6. An affected vertex 𝑣 ∈ Λ 𝑟 is covered by a land-mark 𝑟 ′ ∈ 𝑅 \{ 𝑟 } iff 𝑟 ′ exists in 𝑃 𝐺 ′ ( 𝑣, 𝑟 ) . If an affected vertex 𝑣 ∈ Λ 𝑟 is covered by 𝑟 ′ , then any affected vertex 𝑣 ′ ∈ Λ 𝑟 satisfying 𝑑 𝐺 ′ ( 𝑟, 𝑣 ′ ) = 𝑑 𝐺 ′ ( 𝑟, 𝑣 ) + 𝑑 𝐺 ′ ( 𝑣, 𝑣 ′ ) must also be covered by 𝑟 ′ . By Lemma 4.6, we can efficiently repair affected vertices 𝑣 ∈ Λ 𝑟 as follows. If 𝑣 is covered by a landmark 𝑟 ′ ∈ 𝑅 \{ 𝑟 } (i.e., one of theunaffected parents of 𝑣 does not contain 𝑟 in its label) and is alsoa landmark, we only update the highway; otherwise, we removethe entry of 𝑟 from 𝐿 ( 𝑣 ) . If 𝑣 is not covered by any 𝑟 ′ ∈ 𝑅 \{ 𝑟 } , weadd/modify the entry of 𝑟 in 𝐿 ( 𝑣 ) . If 𝑣 is a descendant of coveredvertices, we simply remove the entry of 𝑟 from 𝐿 ( 𝑣 ) (if exists).Algorithm 3 describes our algorithm for repairing affectedvertices. Given a graph 𝐺 with an inserted edge ( 𝑎, 𝑏 ) and a set ofaffected vertices Λ 𝑟 , we conduct a BFS w.r.t. a landmark 𝑟 startingfrom the vertex 𝑏 with its new distance 𝜋 = 𝑑 𝐺 ( 𝑟, 𝑎 ) + Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 and Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 to processuncovered and covered vertices, respectively. If 𝑏 is covered, weenqueue ( 𝑏, 𝜋 ) to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 and remove the entry of 𝑟 from thelabels of affected vertices (Line 25). Otherwise, we enqueue ( 𝑏, 𝜋 ) to Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 and start processing vertices in Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 (Line5). For each vertex 𝑣 ∈ Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 at depth 𝜋 , we examine itsaffected neighbors 𝑤 at depth 𝜋 +
1. If 𝑤 is covered, then if 𝑤 is a landmark, we update the highway (Line 10); otherwise weremove the entry of 𝑟 from 𝐿 ( 𝑤 ) (Line 12) because there mustexist another landmark in the shortest path from 𝑤 to 𝑟 and add b) (e) (a) (c) (d) Figure 2: An illustration of our online incremental algorithm IncHL + : (a) a graph with three landmarks , and (coloredin yellow); (b) and (d) the BFSs for finding affected vertices (colored in green) w.r.t. landmarks and , respectively; (c)and (e) the BFSs for repairing affected vertices w.r.t. landmarks and , respectively, where vertices with added/modifiedentries are colored in blue, and vertices with removed entries are colored in red.Algorithm 3: Repairing affected vertices. Function
RepairAffected( 𝐺 ′ , ( 𝑎, 𝑏 ) , Λ 𝑟 , 𝑟 , Γ ) Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 ← ∅ , Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 ← ∅ 𝜋 ← 𝑑 𝐺 ( 𝑟, 𝑎 ) + Enqueue ( 𝑏, 𝜋 ) to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 if covered; otherwise to Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 while Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 is not empty do while ( 𝑣, 𝜋 ) ∈ Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 at depth 𝜋 do forall 𝑤 ∈ 𝑁 ( 𝑣 ) s.t. 𝑤 ∈ Λ 𝑟 at depth 𝜋 + do if 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 ( 𝑤, 𝜋 + ) then if 𝑤 is a landmark then 𝛿 𝐻 ( 𝑟, 𝑤 ) ← 𝜋 + else Remove 𝑟 from 𝐿 ( 𝑤 ) Enqueue ( 𝑤, 𝜋 + ) to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 else Add/Modify {( 𝑟, 𝜋 + )} in 𝐿 ( 𝑤 ) Enqueue ( 𝑤, 𝜋 + ) to Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 Remove 𝑤 from Λ 𝑟 Dequeue ( 𝑣, 𝜋 ) from Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 while ( 𝑣, 𝜋 ) ∈ Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 at depth 𝜋 do forall 𝑤 ∈ 𝑁 ( 𝑣 ) s.t. 𝑤 ∈ Λ 𝑟 at depth 𝜋 + do Remove 𝑟 from 𝐿 ( 𝑤 ) Remove 𝑤 from Λ 𝑟 Enqueue ( 𝑤, 𝜋 + ) to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 Dequeue ( 𝑣, 𝜋 ) from Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 Remove entry 𝑟 from remaining vertices in Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 ( 𝑤, 𝜋 + ) to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 (Line 13). Otherwise, we add/modify theentry of 𝑟 with the new distance 𝜋 + 𝐿 ( 𝑤 ) and enqueue 𝑤 to Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 (Lines 15-16). After that, we remove 𝑤 from Λ 𝑟 (line 17). Then, for each ( 𝑣, 𝜋 ) ∈ Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 , we remove 𝑟 from thelabels of affected neighbors of 𝑣 , remove these affected verticesfrom Λ 𝑟 and enqueue them to Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 (Lines 19-24). We processthese two queues, one after the other, until Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 is empty.Finally, we remove the entry of 𝑟 from the labels of the remainingvertices in Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 (Line 25). Example 4.7.
Figure 2 illustrates how our algorithm repairslabels as a result of inserting an edge ( , ) . The BFS for landmark0 is depicted in Figure 2(c), which jumps to vertex 5 and repairsthree affected vertices { , , } . The vertices { , , } are cov-ered by landmarks 4 and 10. Similarly, the BFS for landmark 10 isdepicted in Figure 2(e), in which vertices { , } are repaired andvertex 1 is covered by landmarks 0 and 4. Proof of correctness.
For 𝐺 ↩ → 𝐺 ′ where our method IncHL + updates a highway cover labelling Γ over 𝐺 into a highway coverlabelling Γ ′ over 𝐺 ′ , we consider IncHL + to be correct iff, when-ever 𝑄 ( 𝑢, 𝑣, Γ ) = 𝑑 𝐺 ( 𝑢, 𝑣 ) holds for any two vertices 𝑢 and 𝑣 in 𝐺 , then 𝑄 ( 𝑢 ′ , 𝑣 ′ , Γ ′ ) = 𝑑 𝐺 ′ ( 𝑢 ′ , 𝑣 ′ ) also holds for any two vertices 𝑢 ′ and 𝑣 ′ in 𝐺 ′ . We prove the theorem below for IncHL + .Theorem 5.1. IncHL + is correct. Proof. First, we prove that
FindAffected returns the set ofall affected vertices Λ 𝑟 as a result of an edge insertion. IncHL + (Lines 7-8 of Algorithm 2) guarantees that any vertex being addedto Q has one shortest path to a landmark 𝑟 which goes throughthe inserted edge ( 𝑎, 𝑏 ) . By Lemma 4.3, such vertices are affectedvertices, and thus a vertex 𝑣 is added to Q in Algorithm 2 iff 𝑣 ∈ Λ 𝑟 .Then, we prove that RepairAffected repairs Γ = ( 𝐻, 𝐿 ) s.t. (1) ( 𝑟, 𝑑 𝐺 ′ ( 𝑟, 𝑣 )) ∈ 𝐿 ( 𝑣 ) for 𝑣 ∈ Λ 𝑟 , iff 𝑃 𝐺 ′ ( 𝑟, 𝑣 ) contains only onelandmark 𝑟 ; (2) 𝛿 𝐻 ( 𝑟, 𝑟 ′ ) = 𝑑 𝐺 ′ ( 𝑟, 𝑟 ′ ) for any 𝑟 ′ ∈ 𝑅 \{ 𝑟 } . Startingfrom 𝑏 with new distance 𝜋 , the distances of affected vertices in Λ 𝑟 are iteratively inferred on 𝐺 ′ and reflected into their labels via Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 if these affected vertices are not covered (Lines 15-16of Algorithm 3). If an affected vertex 𝑣 is covered, it is kept in Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 ; if 𝑣 is also a landmark, 𝛿 𝐻 ( 𝑟, 𝑣 ) in 𝐻 is updated (Lines9-10). Thus, the distance entry of 𝑟 is removed from the labelsof affected vertices appearing in Q 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 , whereas any vertex 𝑣 appearing in Q 𝑢𝑛𝑐𝑜𝑣𝑒𝑟𝑒𝑑 must have ( 𝑟, 𝑑 𝐺 ′ ( 𝑟, 𝑣 )) ∈ 𝐿 ( 𝑣 ) . □ Preservation of minimality.
It has been reported in [10] that,given a graph 𝐺 , a minimal highway cover labelling Γ = ( 𝐻, 𝐿 ) of 𝐺 can be constructed using an algorithm proposed in theirwork, i.e., 𝑠𝑖𝑧𝑒 ( 𝐿 ′ ) ≥ 𝑠𝑖𝑧𝑒 ( 𝐿 ) holds for any Γ ′ = ( 𝐻, 𝐿 ′ ) of 𝐺 . For 𝐺 ↩ → 𝐺 ′ where IncHL + updates Γ over 𝐺 into Γ ′ over 𝐺 ′ , weprove that IncHL + preserves the minimality of labelling.Theorem 5.2. If Γ is minimal on 𝐺 , then Γ ′ is minimal on 𝐺 ′ . Proof. By Lemma 4.6, ( 𝑟, 𝑑 𝐺 ′ ( 𝑟, 𝑣 )) ∈ 𝐿 ( 𝑣 ) for 𝑣 ∈ Λ 𝑟 iff 𝑃 𝐺 ′ ( 𝑟, 𝑣 ) does not contain any other landmark 𝑅 \{ 𝑟 } ; otherwisewe remove the entry of 𝑟 from the label of 𝑣 (Line 12, 21 and25 of Algorithm 3). Thus, the labels of all affected vertices mustbe minimal after applying IncHL + . For unaffected vertices, theirlabels should remain unchanged. Hence, Γ ′ must be minimal. □ Complexity analysis.
Let 𝑚 be the total number of affectedvertices, 𝑙 be the average size of labels (i.e. 𝑙 = 𝑠𝑖𝑧𝑒 ( 𝐿 )/| 𝑉 | ), and 𝑑 be the average degree of vertices. For a landmark, Algorithm2 takes 𝑂 ( 𝑚𝑑𝑙 ) time to find all affected vertices and Algorithm3 takes 𝑂 ( 𝑚𝑑 ) to repair the labels of all affected vertices. Weomit 𝑙 from 𝑂 ( 𝑚𝑑 ) for Algorithm 3 because distances for allunaffected neighbors of affected vertices are stored in Algorithm2. Therefore, IncHL + has time complexity 𝑂 (| 𝑅 | × 𝑚𝑑𝑙 ) . In our able 1: Comparing the update time, query time and labelling size of our method with the baseline methods. Dataset Update Time (ms) Query Time (ms) Labelling SizeIncHL + IncFD IncPLL IncHL + IncFD IncPLL IncHL + IncFD IncPLLSkitter 0.194 0.444 2.05 0.027 0.019 0.047 42 MB 153 MB 2.44 GBFlickr 0.006 0.074 1.73 0.007 0.012 0.064 34 MB 152 MB 3.69 GBHollywood 0.031 0.101 48 0.027 0.037 0.109 27 MB 263 MB 12.58 GBOrkut 2.026 2.049 - 0.101 0.103 - 70 MB 711 MB -Enwiki 0.134 0.163 5.91 0.054 0.035 0.071 82 MB 608 MB 12.57 GBLivejournal 0.245 0.268 - 0.044 0.046 - 122 MB 663 MB -Indochina 5.443 158 2018 0.737 0.839 0.063 81 MB 838 MB 18.64 GBIT 95.92 224 - 1.069 1.013 - 854 MB 4.74 GB -Twitter 0.027 0.134 - 0.863 0.177 - 1.14 GB 3.83 GB -Friendster 0.159 0.419 - 0.814 0.904 - 2.43 GB 9.14 GB -UK 11.49 384 - 3.443 5.858 - 1.78 GB 11.8 GB -Clueweb09 40.68 - - 16.93 - - 163 GB - -
Table 2: Summary of datasets.
Dataset Network | 𝑉 | | 𝐸 | avg. deg avg. distSkitter comp (u) 1.7M 11M 13.081 5.1Flickr social (u) 1.7M 16M 18.133 5.3Hollywood social (u) 1.1M 114M 98.913 3.9Orkut social (u) 3.1M 117M 76.281 4.2Enwiki social (d) 4.2M 101M 43.746 3.4Livejournal social (d) 4.8M 69M 17.679 5.6Indochina web (d) 7.4M 194M 40.725 7.7IT web (d) 41M 1.2B 49.768 7.0Twitter social (d) 42M 1.5B 57.741 3.6Friendster social (u) 66M 1.8B 55.056 5.0UK web (d) 106M 3.7B 62.772 6.9Clueweb09 web (d) 1.7B 7.8B 9.27 7.4experiments, we notice that 𝑚 is usually orders of magnitudessmaller than | 𝑉 | and 𝑙 is also significantly smaller than | 𝑅 | . Directed and weighted graphs.
For directed graphs, we canstore sets of forward and backward labels, namely 𝐿 𝑓 ( 𝑣 ) and 𝐿 𝑏 ( 𝑣 ) , for each vertex 𝑣 which contain pairs ( 𝑟 𝑖 , 𝛿 𝑟 𝑖 𝑣 ) from for-ward and backward BFSs w.r.t. each landmark. Accordingly, wecan store forward and backward highways 𝐻 𝑓 and 𝐻 𝑏 . Then, weconduct two BFSs to update these labels and highways: one inthe forward direction and the other in the backward direction.Our method can also be easily extended to handling weightedgraphs by using Dijkstra’s algorithm instead of BFSs. We have evaluated our method to answer the following questions:(Q1) How efficiently can our method perform against state-of-the-art methods? (Q2) How does the number of landmarks affect theperformance of our method? (Q3) How does our method scale toperform updates occurring rapidly in large dynamic networks?
Datasets.
We used 12 large real-world networks as detailed in Ta-ble 2. These networks are accessible at Stanford Network AnalysisProject [16], Laboratory for web Algorithmics [7], Koblenz Net-work Collection [14], and Network Repository [17]. We treatedthese networks as undirected and unweighted graphs.
Updates and queries.
For each network, we randomly sampled1,000 pairs of vertices as edge insertions, denoted as 𝐸 𝐼 , where 𝐸 𝐼 ∩ 𝐸 = ∅ to evaluate the average update time. Further, weevaluate the average query time with 100,000 randomly sampledpairs of vertices from each network and report the labelling sizeafter reflecting all the updates. Baseline methods.
We compared our method (IncHL + ) withthe state-of-the-art methods: (1) IncPLL: an online incrementalalgorithm proposed in [4] which is based on the 2-hop coverlabelling to answer distance queries; (2) IncFD: an online incre-mental algorithm proposed in [12] which combines a 2-hop coverlabelling with a graph traversal algorithm to answer distancequeries. The codes of these methods were provided by their au-thors and implemented in C++. We used the same parametersettings for these methods as suggested by their authors unlessotherwise stated. For a fair comparison, following [12] we set | 𝑅 | =
20 for IncFD and our methods, except for Clueweb09 whichhas | 𝑅 | =
150 due to its billion-scale vertices. Our methods wereimplemented in C++11 and compiled using gcc 5.5.0 with the -O3option. We performed all the experiments using a single threadon Linux server (Intel Xeon W-2175 with 2.50GHz and 512GB ofmain memory).
Table 1 shows that the average updatetime of our method IncHL + outperforms the state-of-the-artmethods IncFD and IncPLL on all datasets. This is due to a novelrepair strategy utilized by IncHL + . Further, only IncHL + canscale to very large networks with billions of vertices and edges.IncFD fails to scale to Clueweb09, and IncPLL fails for 7 out of12 datasets due to very high preprocessing time and memoryrequirements. From Table 1, we see that IncHL + hassignificantly smaller labelling sizes than IncFD and IncPLL. Whenupdates occur on a graph, the labelling sizes of IncFD and IncHL + remain stable because their average label sizes are bounded by thesize of landmarks set (i.e. | 𝑅 | ). Moreover, IncFD stores completeshortest path trees w.r.t. landmarks; while IncHL + stores prunedshortest-path trees which lead to labelling of much smaller sizesthan IncFD. For IncPLL, the labelling sizes may increase becauseIncPLL does not remove outdated and redundant entries. In Table 1 the query times of IncHL + arecomparable with IncFD and IncPLL. It has been shown in [9] thatquery time depends on labelling size. As discussed in Section 6.1.2,the update operations do not considerably affect the labellingsizes of IncFD and IncHL + , and thus their query times remainstable. However, the query times for IncPLL may increase overtime because of the presence of outdated and redundant entries,which result in labelling of increasing size. k i t t e r F li c k r H o ll y w o o d O r k u t E n w i k i L i v e J o u r n a l I n d o c h i n a I T T w i t t e r F r i e n d s t e r U K C l u e w e b -2 -1 U pd a t e T i m e ( m s . )
10 20 30 40 50
Figure 3: Average update time of our method IncHL + (in colored bars) and the baseline method IncFD (in colored plusgrey bars) under 10-50 landmarks. There are no results of IncFD for Clueweb09 due to the scalability issue. − U pd a t e T i m e ( s e c . ) Skitter − − Flickr − − Hollywood − Orkut − − Enwiki − Livejournal U pd a t e T i m e ( s e c . ) Indochina
ConstructionI NC HL + IT − − Twitter − − Friendster UK Clueweb09
Figure 4: Update time of IncHL + for performing up to 10,000 updates against construction time of labelling from scratch. Figure 3 shows the average update time of our method IncHL + against the baseline method IncFD under varying landmarks,i.e., | 𝑅 | ∈ [ , , , , ] . As we can see, IncHL + outperformsIncFD on all the datasets against almost every selection of land-marks. We can also see the performance gap remains stable formost of the datasets when increasing the number of landmarks.This empirically verifies the efficiency of our repair strategy. We conducted a scalability test on the update time of our methodIncHL + , by starting with 500 updates and then iteratively adding500 updates each time until 10,000 updates. Figure 4 shows theresults. We observe that the update time of IncHL + on almostall the datasets is considerably below the construction time oflabelling. On Indochina and IT, IncHL + performs relatively worsebecause these networks have large average distances as depictedin Table 2, which lead to high percentages of affected vertices asshown in Figure 1. In contrast, IncHL + performs well on graphswith small average distances such as Twitter. Overall, IncHL + can scale to perform a large number of updates efficiently. This paper has studied the problem of answering distance querieson large dynamic networks. Our proposed algorithm exploitsproperties of a recent labelling technique called highway coverlabelling [10] to efficiently process incremental graph updates,and can preserve the minimality property of labelling after eachupdate operation. We have empirically evaluated the efficiencyand scalability of the proposed algorithm. The results show thatour proposed algorithm outperforms the state-of-the-art methods.In future, we plan to further investigate the effects of decrementalupdates on graphs since they are also commonly used in practice.
REFERENCES [1] Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck.2011. A hub-based labeling algorithm for shortest paths in road networks. In
SEA . 230–241. [2] Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck.2012. Hierarchical hub labelings for shortest paths. In
ESA . 24–35.[3] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. 2013. Fast exact shortest-pathdistance queries on large networks by pruned landmark labeling. In
ACMSIGMOD . 349–360.[4] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. 2014. Dynamic and historicalshortest-path distance queries on large evolving networks by pruned landmarklabeling. In
WWW . 237–248.[5] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006.Group formation in large social networks: membership, growth, and evolution.In
ACM SIGKDD . 44–54.[6] Stefano Boccaletti, Vito Latora, Yamir Moreno, Martin Chavez, and D-UHwang. 2006. Complex networks: Structure and dynamics.
Physics reports
WWW . 595–601.[8] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. 2003. Reachabilityand distance queries via 2-hop labels.
SIAM J. Comput.
32, 5 (2003), 1338–1355.[9] Gianlorenzo D’angelo, Mattia D’emidio, and Daniele Frigioni. 2019. FullyDynamic 2-Hop Cover Labeling.
JEA
24, 1 (2019), 1–6.[10] Muhammad Farhan, Qing Wang, Yu Lin, and Brendan McKay. 2019. A HighlyScalable Labelling Approach for Exact Distance Queries in Complex Networks.In
EDBT . 13–24.[11] Ada Wai-Chee Fu, Huanhuan Wu, James Cheng, and Raymond Chi-WingWong. 2013. Is-label: an independent-set based labeling scheme for point-to-point distance querying.
VLDB
6, 6 (2013), 457–468.[12] Takanori Hayashi, Takuya Akiba, and Ken-ichi Kawarabayashi. 2016. FullyDynamic Shortest-Path Distance Query Acceleration on Massive Networks.In
CIKM . 1533–1542.[13] Ruoming Jin, Ning Ruan, Yang Xiang, and Victor Lee. 2012. A highway-centriclabeling approach for answering distance queries on large sparse graphs. In
ACM SIGMOD . 445–456.[14] Jérôme Kunegis. 2013. Konect: the koblenz network collection. In
WWW .1343–1350.[15] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution:Densification and shrinking diameters.
ACM TKDD
1, 1 (2007), 2–es.[16] Jure Leskovec and Andrej Krevl. 2015. SNAP Datasets:Stanford Large NetworkDataset Collection. (2015).[17] Ryan Rossi and Nesreen Ahmed. 2015. The network data repository withinteractive graph analytics and visualization. In
AAAI .[18] Robert Endre Tarjan. 1983.
Data structures and network algorithms . Vol. 44.Siam.[19] Antti Ukkonen, Carlos Castillo, Debora Donato, and Aristides Gionis. 2008.Searching the wikipedia with contextual information. In
CIKM . 1351–1352.[20] Monique V Vieira, Bruno M Fonseca, Rodrigo Damazio, Paulo B Golgher, Davide Castro Reis, and Berthier Ribeiro-Neto. 2007. Efficient search ranking insocial networks. In
CIKM . 563–572.[21] Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi.2009. On the evolution of user interaction in facebook. In
Proceedings of the2nd ACM workshop on Online social networks . 37–42.[22] Fang Wei. 2010. TEDI: efficient shortest path query answering on graphs. In