Evaluating Complex Queries on Streaming Graphs
EEvaluating Complex Queries on Streaming Graphs
Anil Pacaci
University of [email protected]
Angela Bonifati
Lyon 1 [email protected]
M. Tamer Özsu
University of [email protected]
ABSTRACT
In this paper, we study the problem of evaluating persistent queriesover streaming graphs in a principled fashion. These queries needto be evaluated over unbounded and very high speed graph streams.We define a streaming graph model and a streaming graph querymodel incorporating navigational queries, subgraph queries andpaths as first-class citizens. To support this full-fledged query modelwe develop a streaming graph algebra that describes the precisesemantics of persistent graph queries with their complex constructs.We present transformation rules and describe query formulationand plan generation for persistent graph queries over streaminggraphs. Our implementation of a streaming graph query processorbased on the dataflow computational model shows the feasibility ofour approach and allows us to gauge the high performance gainsobtained for query processing over streaming graphs.
Modern applications in many domains now operate on very highspeed streaming graphs. For example, Twitter’s recommendationsystem ingests 12K events/sec on average [34], Alibaba transactiongraph processes 30K edges/sec at its peak [65]. A recent survey [67]reports that these workloads are prevalent in real applications, andefficient querying of these streaming graphs is a crucial task forapplications that monitor complex patterns and relationships.Existing graph DBMSs follow the traditional database paradigmwhere data is persistent and queries are transient; consequently,they do not support persistent query semantics where queries areregistered to the system and results are generated incrementallyas the graph edges arrive. Persistent queries on streaming graphsenable users to continuously obtain new results on rapidly changingdata, supporting online analysis and real-time query processing asdemonstrated in the following two real examples.
Example 1 (Real-time Notification).
In many online social net-working applications users post original content, sometimes linkthis to other people’s contents and react each other’s posts – theseinteractions are modeled as a graph. We say that a user 𝑢 is a recent liker for another user 𝑢 if 𝑢 recently likes posts that arecreated by 𝑢 and 𝑢 and 𝑢 are connected through friends. Thegoal of the recommendation service is to notify users, in real-time,of new contents that are posted by others that are connected bya path of recent liker pattern. This real-time notification task isa persistent query over the streaming graph of user interactionsthat returns the recommended content in real-time, as depicted inFigure 1a. Example 2 (Physical Contact Tracking).
There are a number ofCovid-19 contact tracing applications that model interactions asgraphs. In these applications, people are represented as vertices P P i m P j m k P m li k e s p o s t s follows + recentLiker recentLiker + li k e s p o s t s follows + recentLiker posts (a) Real-time Notification S + S i o S j o S v i s i t v i s i t contact contact ∗ v i s i t v i s i t contact (b) Contact Tracing Figure 1: Graph patterns representing queries in (a) Example1, and (b) Example 2. and two people are said to be in contact if they visit the same spacein the last 14 days (this is a simplification). The goal is to notifypeople with a potential chain of contacts when they are linked tosomeone who tested positive. As shown in Figure 1b, the task ofcontact tracing is also a persistent graph query that returns thechain of contacts on a sliding window over this streaming graph ofpeople’s contacts.As demonstrated by these examples, real-world applications thatfeature complex graph patterns require: • R1: subgraph queries that find matches of a given graphpattern; • R2: path navigation queries that traverse paths based onuser specified constraints; and • R3: the ability to manipulate and return paths (i.e., treatpaths as first-class citizens of the data model).These requirements are only partially addressed by existing graphDBMSs in the context of one-time queries over static graphs. Querylanguages of these graph DBMSs (e.g., PGQL, SPARQL v1.1, Cypher)address the first two issues by replacing edge labels of a subgraphpattern with regular expressions – this is known as unions of con-junctive RPQs (UCRPQ) [16, 78]. However, UCRPQ lacks algebraicclosure and composability, limiting optimization opportunities. Furthermore, the output of a path navigation query is typicallya set of pairs of vertices that are connected by a path under theconstraints of a given regular expression. Consequently, UCRPQ-based query languages limit path navigation queries to boolean Relational algebra is closed over a set of relational operators and common relationaltechniques such as query rewriting and view-based query optimization utilize thealgebraic closure of operators of the relational algebra. a r X i v : . [ c s . D B ] F e b eachability, lacking the ability to return and manipulate paths.G-CORE [7] addresses these limitations at the language level andintroduces a query language proposal that heavily influences thestandardization efforts for a query language for graph DBMSs. To the best of our knowledge, there is no work that uniformly ad-dresses all three requirements in the context of persistent queryprocessing over streaming graphs. A number of specialized algo-rithms focus on evaluating subgraph pattern queries on streaminggraphs [6, 21, 40, 49, 65], and our previous work focuses on pathnavigation queries and introduces first streaming algorithms forRPQ evaluation over streaming graphs [62]. Our objective in thispaper is to introduce a general-purpose framework that addressesthe above discussed requirements of real-world applications (whichfeature complex graph patterns) in a uniform and principled man-ner.Querying streaming data in real-time imposes novel require-ments in addition to challenges of graph processing: • R4: graph streams are unbounded, making it infeasible toemploy batch algorithms on the entire stream; and • R5: graph edges arrive at a very high rate and real-timeanswers are required as the graph emerges.Most existing work on graph querying employ the snapshot model,which assumes that graphs are static and fully available, and that adhoc queries reflect the current state of the database (e.g., [23, 43, 70,72, 76, 79, 81]). Our experiments show that graph DBMSs that arebased on the snapshot model are not able to keep up with the higharrival rates [63]. The unsuitability of existing graph DBMSs forquerying streaming data have motivated the design of specializedapplications that process streaming graphs (e.g., [21, 34, 40, 49, 65]).However, due to the lack of a principled approach that has benefittedrelational DBMSs, these applications rely on ad hoc solutions thatare tailored for a particular task. The lack of systematic support forquery processing over streaming graphs hinders the developmentof a general-purpose query processor for streaming graphs.In this paper, we introduce a general-purpose query process-ing framework for streaming graphs that consists of: (i) streaminggraph model and algebra with well-founded semantics, and (ii)dataflow-based query processor with efficient physical operatorsfor persistent query processing over streaming graphs. Unlike exist-ing work on streaming graphs that relies on ad hoc algorithms, ouralgebraic approach provides the foundational framework to pre-cisely describe the semantics of complex persistent graph queriesover streaming graphs and to optimize such queries by query op-timizers. Built on the Regular Query (RQ) model , our proposed Streaming Graph Algebra (SGA) unifies path navigation and sub-graph pattern queries in a structured manner ( R1 & R2 ), i.e., itattains composability by properly closing UCRPQ under recursion.In addition, our proposed framework treats paths as first-class cit-izens of its data model, enabling the proposed algebra to expressqueries that return and manipulate paths ( R3 ). Yet, such represen-tation with well-founded semantics is only half of the answer. Wealso provide a prototype system on top of the Timely Dataflow [59]for efficient evaluation of these queries based on our proposed SGA. A computationally well-behaved subset of Datalog that has tractable query evaluationand decidable query containment, similar to relational algebra [66].
In particular, we first introduce efficient incremental algorithmsthat utilize temporal patterns of sliding window movements overthe streaming graph as physical operators for SGA primitives ( R4 & R5 ), then describe how to cast persistent graph queries as dataflowcomputations that consist of these physical operators. Ours, to thebest of our knowledge, is the first work to precisely describe se-mantics of persistent query evaluation over streaming graphs thatcombines subgraph pattern and path navigation queries in a prin-cipled manner with the notion of paths as first-class citizens, andto provide efficient algorithms to evaluate such queries consistentwith the semantics. Our contributions are: • a streaming graph model (Section 3), a query model (Section4), and a streaming graph algebra (Section 5.1) providingprecise semantics of persistent graph queries with subgraphpatterns, path navigations and windowing constructs; • incorporation of paths as first-class citizens, enabling thealgebra to express queries that return and manipulate paths; • development of algebraic transformation rules for the SGAand a query formulation and plan generation methodology; • proposing physical operators for the proposed SGA (Sec-tion 6) and a streaming graph query processor based on thedataflow computational model (Section 7.1.1); and • extensive experimental analysis on real and synthetic datasetsto demonstrate the feasibility and the performance of our ap-proach for query processing over streaming graphs (Section7). Early research on stream processing primarily adapt the relationalmodel and its query operators to the streaming setting (e.g., STREAM [10],Aurora [3], Borealis [2]). In contrast, modern Data Stream Process-ing Systems (DSPS) such as Storm [73], Heron [45], Flink [20]do not necessarily offer a full set of DBMS functionality. Exist-ing literature (as surveyed by Hirzel et al. [38]) heavily focus ongeneral-purpose systems and do not consider core graph queryingfunctionality such as subgraph pattern matching and path naviga-tion . Similarly, there has been a significant interest in RDF streams(e.g., stream reasoning [9, 42], data exchange [18, 54]). Some of thestreaming RDF systems include SPARQL extensions for persistentquery evaluation over RDF streams such as C-SPARQL [14], CQELS[48], SPARQL 𝑠𝑡𝑟𝑒𝑎𝑚 [19] and W3C proposal RSP-QL [24]. Theseare the most similar to our problem setting. However, these RDFsystems are designed for SPARQLv1.0 and cannot formulate pathexpressions such as RPQs that cover more than 99% of all recur-sive queries abundantly found in massive Wikidata query logs [17].Furthermore, query processing engines of these systems do notemploy incremental operators, except Sparkwave [42] that focuseson stream reasoning. Our proposed framework supports complexgraph patterns arising in existing graph query languages, includingSPARQL v1.1 property paths, and introduces non-blocking opera-tors that are optimized for streaming workloads.Existing work on streaming graph systems, by and large, fo-cus on either (i) maintenance of graph snapshots under a stream f updates for iterative graph analytics workloads, or (ii) special-ized systems for persistent query workloads that are tailored forthe task at hand. One of the earlier systems in the first category,STINGER [25], proposes an adjacency list-based data structure op-timized for fast ingestion of streaming graphs. GraphOne [46, 47]uses a novel versioning scheme to support concurrent reads andwrites on the most recent snapshot of the graph. Analytical enginessuch as GraphIn [69] and GraphTau [39] extend the popular vertex-centric model with incremental computation primitives to mini-mize redundant computation across consecutive snapshots. Morerecently, systems such as GraPu [71] and GraphBolt [53] introducenovel dependency tracking schemes to transparently maintain re-sults of graph analytics workloads by utilizing structural propertiessuch as monotonicity. This line of research primarily focuses onbuilding and maintaining graph snapshots from streaming graphsfor iterative graph analytics workloads with little or no focus ongraph querying functionalities. A persistent query over sliding windows can be formulated as anIncremental View Maintenance (IVM) problem, where the viewdefinition is the query itself and window movements correspond toupdates to the underlying database. In the IVM model, the goal isto incrementally maintain the view – results of a persistent query –upon changes to the underlying database – insertions (expirations)into (from) a sliding window. The classical
Counting [35] algorithmmaintains the number of alternative derivations for each derivedtuple in a Select-Project-Join view to determine when a tuple nolonger belongs to the view. DBToaster [41] introduces the conceptof higher-order views for group-by aggregates and represents eachview definition using a hierarchy of views that reduce the overallmaintenance cost. F-IVM [61] further extends higher-order viewswith a factorized representation of these views to reduce the amountof state and the computation cost. ViewDF [80] extends existingIVM techniques with windowing constructs to speed up queryprocessing over sliding windows. Although conceptually similar,these techniques are not suitable for recursive graph queries thatare addressed in this paper, primarily because of the potentiallyinfinite results for recursive graph queries.The classical
DRed algorithm [35] adapts the semi-naive strategyto support recursive views: it first deletes all derived tuples thatdepend on the deleted tuple, then re-derives the tuples that stillhave an alternative derivation after the deletion. DRed might over-estimate the set of deleted tuples and might re-derive the entire view.Storing the how- provenance – the set of all tuples that might beused to derive a tuple – might prevent over-estimation; however, itsignificantly increases the amount of state that the algorithm needsto maintain. The provenance information can be encoded in theform of boolean polynomials and the boolean absorption law canbe used to reduce the amount of additional information that needsto be maintained [52]. Thus, it is possible to adapt recursive IVMtechniques to evaluate streaming graph queries, but they ignorethe structure of graph queries and inherent temporal patterns ofstreaming graphs. Our methods, in contrast, exploit the structureof the problem to minimize the cost of persistent graph queryevaluation over streaming graphs.
Temporal Graph Algebra (TGA) [58] adapts temporal relationaloperators in PGM to provide systematic support for analytics overevolving graphs. Its implementation on Spark introduces physi-cal operators for graph analytics at different levels of resolutions[5]. Similar to ours, TGA introduces a set of time-aware algebraicprimitives, but, it focuses on exploratory graph analytics over theentire history of changes. In contrast, the primary objective of ourframework is to continuously evaluate graph queries as the under-lying (potentially unbounded) graph changes, i.e., persistent queryprocessing over streaming graphs.The closest to ours are specialized incremental algorithms ondynamic and streaming graphs [6, 21, 40, 49, 62, 65]. Some of these[21, 40] study the problem of subgraph pattern matching under astream of edge updates. Their focus is developing efficient incre-mental algorithms to maintain matches of a given subgraph patternas the underlying graph changes. Similarly, Ammar et al. [6] presentworst-case-optimal join-based algorithms for distributed evalua-tion of subgraph pattern queries. Li et al. [49] present an efficientalgorithm for subgraph isomorphism search over streaming graphswith timing-order constraints. GraphS [65] introduces efficient in-dex structures that are optimized for cycle detection queries. Theseare all specialized algorithms and systems for particular queryworkloads. In previous work [62], we study the design space ofalgorithms for path navigation over streaming graphs and providealgorithms for persistent evaluation of Regular Path Queries, a tinysubset of the class of queries that we address in this paper. To thebest of our knowledge, this is the first work to describe the pre-cise semantics of generalized streaming graph queries. We providea unified framework to represent and optimize persistent graphqueries over streaming graphs featuring both path navigation andsubgraph pattern matching as well as windowing constructs thatare commonly used in practice.
Definition 1 (Graph).
A directed labeled graph is a quintuple 𝐺 = ( 𝑉 , 𝐸, Σ ,𝜓, 𝜙 ) where 𝑉 is a set of vertices, 𝐸 is a set of edges, Σ isa set of labels, 𝜓 : 𝐸 → 𝑉 × 𝑉 is an incidence function and 𝜙 : 𝐸 → Σ is an edge labelling function. Definition 2 (Path and Path Label).
Given 𝑢, 𝑣 ∈ 𝑉 , a path 𝑝 from 𝑢 to 𝑣 in graph 𝐺 is a sequence of edges 𝑢 𝑝 → 𝑣 : ⟨ 𝑒 , · · · , 𝑒 𝑛 ⟩ .The label sequence of a path 𝑝 is defined as the concatenation of edgelabels, i.e., 𝜙 𝑝 ( 𝑝 ) = 𝜙 ( 𝑒 ) · · · 𝜙 ( 𝑒 𝑛 ) ∈ Σ ∗ . We use T = (T , ≤) to define a discrete, total ordered time domainand use timestamps 𝑡 ∈ T to denote time instants. Without lossof generality, the rest of the paper uses non-negative integers torepresent timestamps.Definition 3 (Streaming Graph Edge). A streaming graphedge (sge) is a quadruple ( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, 𝑡 ) where 𝑠𝑟𝑐, 𝑡𝑟𝑔 ∈ 𝑉 are end-points of an edge 𝑒 ∈ 𝐸 , 𝑙 ∈ Σ is the label of the edge 𝑒 , and ∈ T is the event (application) timestamp assigned by the source, i.e., 𝜓 ( 𝑒 ) = ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) and 𝜙 ( 𝑒 ) = 𝑙 . Definition 4 (Input Graph Stream).
An input graph streamis a continuously growing sequence of streaming graph edges 𝑆 𝐼 = (cid:2) 𝑠𝑔𝑒 , 𝑠𝑔𝑒 , · · · (cid:3) where each sge represents an edge in graph 𝐺 andsges are non-decreasingly ordered by their timestamps. Input graph streams represent external sources that provide thesystem with the graph-structured data. Our proposed frameworkuses a different format that generalizes Definition 4 to also representintermediate results and outputs of persistent queries (Definition7). Definition 5 (Validity Interval).
A validity interval is a half-open time interval [ 𝑡𝑠, 𝑒𝑥𝑝 ) consisting of all distinct time instants 𝑡 ∈ T for which 𝑡𝑠 ≤ 𝑡 < 𝑒𝑥𝑝 . Timestamps are commonly used to represent the time instant inwhich the interaction represented by the sge occured [49, 62, 65].Alternatively, we use intervals to represent the period of valid-ity of sges. In this paper, we argue that using validity intervals leads to a succinct representation and simplifies operator seman-tics by separating the specification of window constructs fromoperator implementation. As an example, each sge with times-tamp 𝑡 can be assigned a validity interval [ 𝑡, 𝑡 + ) that corre-sponds to a single time unit with smallest granularity that can-not be decomposed into smaller time units. Similarly, an sge 𝑒 = ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 )) with a validity interval is equivalent to a setof sges {( 𝑢, 𝑣, 𝑙, 𝑡 ) , · · · , ( 𝑢, 𝑣, 𝑙, 𝑡 𝑛 )} where 𝑡 = 𝑡𝑠 and 𝑡 𝑛 = 𝑒𝑥𝑝 − .Windowing operator (to be precisely defined momentarily in Sec-tion 5.1) are used to assign validity intervals based on the window-ing specifications of a given query. We now describe the logical representation of streaming graphsthat is used throughout the paper. First, we extend the directedlabeled graph model with materialized paths to represent paths asfirst-class citizens of the data model. As per Definition 2, a pathbetween vertices 𝑢 and 𝑣 is a sequence of edges 𝑢 𝑝 → 𝑣 : ⟨ 𝑒 , · · · , 𝑒 𝑛 ⟩ that connects vertices 𝑢 and 𝑣 , i.e., the path 𝑝 defines a higher-orderrelationship between vertices 𝑢 and 𝑣 through a sequence of edges.By treating paths as first-class citizens like vertices and edges, thematerialized path graph model enables queries to have paths asinputs and outputs. In addition, it enables edges and paths to bestitched together to form complex graph patterns as will be shownin Section 5.1.Definition 6 (Materialized Path Graph). A materialized pathgraph is a 7-tuple 𝐺 = ( 𝑉 , 𝐸, 𝑃, Σ ,𝜓, 𝜌, 𝜙 ) where 𝑉 is a set of vertices, 𝐸 is a set of edges, 𝑃 is a set of paths, Σ is a set of labels, 𝜓 : 𝐸 → 𝑉 × 𝑉 is an incidence function, 𝜌 : 𝑃 → 𝐸 × · · · × 𝐸 is a total function thatassigns each path to a finite, ordered sequence of edges in 𝐸 , and 𝜙 : ( 𝐸 ∪ 𝑃 ) → Σ is a labeling function, where images of 𝐸 and 𝑃 under 𝜙 are disjoint, i.e., 𝜙 ( 𝐸 ) ∩ 𝜙 ( 𝑃 ) = ∅ . We assume that sges are generated by a single source and arrive in order, and leaveout-of-order arrival as future work. We use (cid:2)(cid:3) to denote ordered streams throughout the paper Commonly referred as
NOW windows as described in Section 5.1. [7 , uv [10 , vb [13 , yu [17 , vc [22 , ua [28 , ya [29 , ub [30 , uc f o ll o w s p o s t s f o ll o w s p o s t s p o s t s li k e s li k e s li k e s Figure 2: An excerpt of the streaming graph of the real-timenotification query in Example 1.
The function 𝜌 assigns to each 𝑝 : 𝑢 𝑝 → 𝑣 ∈ 𝑃 an actual path ⟨ 𝑒 , · · · , 𝑒 𝑛 ⟩ in graph 𝐺 satisfying: for every 𝑖 ∈ [ , · · · , 𝑛 ) , 𝜓 ( 𝑒 𝑖 ) = ( 𝑠𝑟𝑐 𝑖 , 𝑡𝑟𝑔 𝑖 ) , 𝑡𝑟𝑔 𝑖 = 𝑠𝑟𝑐 𝑖 + , and 𝑠𝑟𝑐 = 𝑢, 𝑡𝑟𝑔 𝑛 = 𝑣 . Materializedpath graph is a strict generalization of the directed labeled graphmodel (Definition 1), i.e., each directed labeled graph 𝐺 is also amaterialized path graph where 𝑃 = ∅ . We now generalize the notionof streaming graph edges (Definition 3) as follows:Definition 7 (Streaming Graph Tuple). A streaming graphtuple (sgt) is a quintuple 𝑠𝑔𝑡 = ( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) where dis-tinguished (explicit) attributes 𝑠𝑟𝑐, 𝑡𝑟𝑔 ∈ 𝑉 are endpoints of an edge 𝑒 ∈ 𝐸 or a path 𝑝 ∈ 𝑃 in graph 𝐺 and 𝑙 ∈ Σ is its label, and non-distinguished (implicit) attributes [ 𝑡𝑠, 𝑒𝑥𝑝 ) ∈ T × T is a half-opentime-interval representing 𝑡 ’s validity and D is a payload consists ofedges in 𝐸 that participated in the generation of the tuple 𝑡 . Streaming graph tuples generalize sges (Definition 3) to repre-sent, in addition to input graph edges, derived edges (new edgesas operator and query results that are not necessarily part of theinput graph) and paths (sequence of edges as operator and queryresults). We use the notation 𝐸 𝐼 ⊂ 𝐸 to denote the set of input graphedges, and 𝜙 ( 𝐸 𝐼 ) to denote the fixed set of labels that are reservedfor input graph edges. Additionally, non-distinguished (implicit)attribute D of an sgt 𝑡 captures the path 𝑝 , i.e., sequence of edges,in case the sgt 𝑡 represents a path. Otherwise, D is the edge 𝑒 thatthe sgt 𝑡 represents.Definition 8 (Streaming Graph). A streaming graph 𝑆 is a con-tinuously growing sequence of streaming graph tuples 𝑆 = (cid:2) 𝑡 , 𝑡 , · · · (cid:3) in which each tuple 𝑡 𝑖 arrives at a particular time 𝑡 𝑖 ( 𝑡 𝑖 < 𝑡 𝑗 for 𝑖 < 𝑗 ). Figure 2 depicts an excerpt of the streaming graph of the socialnetworking application given in Example 1, where sgts representinput graph edges (validity intervals are assigned by the windowingoperator as explained in Section 5.1). The input graph edge from 𝑢 to 𝑣 with label 𝑙𝑖𝑘𝑒𝑠 is modelled as the sgt 𝑡 = ( 𝑢, 𝑎, 𝑙𝑖𝑘𝑒𝑠, [ , )) , D = {( 𝑢, 𝑙𝑖𝑘𝑒𝑠, 𝑎 )}) .Unless otherwise specified, we consider streaming graphs tobe append-only , i.e., each sgt represents an insertion, and use the direct approach to process expirations due to window movements.Explicit deletions of previously arrived sgts can be supported byexplicitly manipulating the validity interval of a previously arrivedsgt [44]. This corresponds to the negative tuple approach [29, 33].Section 6 describes the processing of insertions, deletions and expi-rations under alternative window semantics for physical operatorimplementations.Definition 9 (Logical Partitioning). A logical partitioning ofa streaming graph 𝑆 is a label-based partitioning of its tuples and it roduces a set of disjoint streaming graphs { 𝑆 𝑙 , · · · , 𝑆 𝑙 𝑛 } where each 𝑆 𝑙 𝑖 consists of sgts of 𝑆 with the label 𝑙 𝑖 , i.e., 𝑆 = (cid:208) 𝑙 ∈ Σ ( 𝑆 𝑙 ) This label-based partitioning of streaming graphs provides acoherent representation for inputs and outputs of operators inlogical algebra (Section 5.1). At the logical level, it can be performedby the filter operator of the logical algebra (precisely defined inDefinition 17), and logical operators of our algebra process logicallypartitioned streaming graphs as their inputs and outputs unlessotherwise specified.Definition 10 (Value-Eqivalence).
Sgts 𝑡 = ( 𝑢 , 𝑣 , 𝑙 , [ 𝑡𝑠 , 𝑒𝑥𝑝 ) , D ) and 𝑡 = ( 𝑢 , 𝑣 , 𝑙 , [ 𝑡𝑠 , 𝑒𝑥𝑝 ) , D ) are value-equivalent iff their distinguished attributes are equal, i.e., they bothrepresent the same edge or the same path possibly with differentvalidity intervals. Formally, 𝑡 = 𝑡 ⇔ 𝑢 = 𝑢 , 𝑣 = 𝑣 , 𝑙 = 𝑙 .Value-equivalence is used for temporal coalescing of tuples withadjacent or overlapping validity intervals [51]. We extend the coa-lesce primitive from temporal database literature [22] to sgts with anaggregation function over the non-distinguished payload attribute, D , as shown below:Definition 11 (Coalesce Primitive). The coalesce primi-tive merges a set of value-equivalent sgts { 𝑡 , · · · , 𝑡 𝑛 } , 𝑡 𝑖 = ( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, [ 𝑡𝑠 𝑖 , 𝑒𝑥𝑝 𝑖 ) , D 𝑖 ) for ≤ 𝑖 ≤ 𝑛 with overlapping or adjacentvalidity intervals using an operator-specific aggregation function 𝑓 𝑎𝑔𝑔 over the payload attribute D :coalesce 𝑓 𝑎𝑔𝑔 ({ 𝑡 · · · , 𝑡 𝑛 }) = ( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, [ min ≤ 𝑖 ≤ 𝑛 ( 𝑡𝑠 𝑖 ) , max ≤ 𝑖 ≤ 𝑛 ( 𝑒𝑥𝑝 𝑖 )) , 𝑓 𝑎𝑔𝑔 (D , · · · , D 𝑛 )) Distinguished attributes 𝑠𝑟𝑐, 𝑡𝑟𝑔 and the label 𝑙 of sgts in a stream-ing graph 𝑆 represent the topology of a materialized path graph.Hence, a finite subset of a streaming graph 𝑆 corresponds to a ma-terialized path graph over the set of edges and paths that are in thestreaming graph and the set of vertices that are adjacent to these.We now use this to define snapshot graphs and the property of snapshot reducibility .Definition 12 (Snapshot Graph). A snapshot graph 𝐺 𝑡 of astreaming graph 𝑆 is defined by a mapping from each time instantin T to a finite set of sgts in 𝑆 . At any given time 𝑡 , the content of amapping 𝜏 𝑡 ( 𝑆 ) defines a snapshot graph 𝐺 𝑡 = ( 𝑉 𝑡 , 𝐸 𝑡 , 𝑃 𝑡 , Σ 𝑡 ,𝜓, 𝜌, 𝜙 ) where 𝐸 𝑡 = { 𝑒 𝑖 | 𝑒 𝑖 .𝑡𝑠 ≤ 𝑡 < 𝑒 𝑖 .𝑒𝑥𝑝 } is the set of all edges that arevalid at time 𝑡 , 𝑃 𝑡 = { 𝑝 𝑖 | 𝑝 𝑖 .𝑡𝑠 ≤ 𝑡 < 𝑝 𝑖 .𝑒𝑥𝑝 } , and 𝑉 𝑡 is the setof all vertices that are endpoints of edges and paths in 𝐸 𝑡 and 𝑃 𝑡 ,respectively. Value-equivalence (Definition 10) and the coalesce primitive(Definition 11) ensure that snapshot graphs have the set semantics ,i.e., at any point in time 𝑡 , the snapshot graph 𝐺 𝑡 of a streaminggraph 𝑆 , a vertex, edge and path exists at most once.Next, we define the notion of snapshot reducibility that enablesus to precisely define the semantics of streaming queries and opera-tors using their non-streaming counterparts. Snapshot reducibilityis a commonly used concept in temporal databases to generalizenon-temporal queries and operators operators to temporal ones[22]. Definition 13 (Snapshot-Reducibility). Let 𝑆 be a streaminggraph, Q a streaming graph query and Q 𝑂 its non-streaming, one-time counterpart. Snapshot reducibility states that each snapshot ofthe result of applying Q to 𝑆 is equal to the result of applying itsnon-streaming version Q 𝑂 corresponding snapshots of 𝑆 , i.e., ∀ 𝑡 ∈T , 𝜏 𝑡 (cid:0) Q( 𝑆 ) (cid:1) = Q 𝑂 (cid:0) 𝜏 𝑡 ( 𝑆 ) (cid:1) . This section presents our streaming graph query (SGQ) model. Weformulate streaming graph queries based on a streaming generaliza-tion of the Regular Query (RQ) model. RQ provides a good basis forbuilding a general-purpose framework for persistent query evalua-tion over streaming graphs, because (i) unlike UCRPQ, it is closedunder transitive closure and composable, (ii) it has the expressivepower as the existing graph query languages such as SPARQL v1.1,Cypher, PGQL (RQ strictly subsumes UCRPQ on which these arebased), and (iii) its query evaluation and containment complexityis reasonable [66]. Due to its well-defined semantics and computa-tional behaviour, RQ is gaining popularity as a logical foundationfor graph queries, both in theory [15, 16] and in practice [7].Definition 14 (Regular Queries (RQ) – Following [66]).
Theclass of Regular Queries is the subset of non-recursive Datalog with afinite set of rules where each rule is of the form: ℎ𝑒𝑎𝑑 ← 𝑏𝑜𝑑𝑦 , · · · , 𝑏𝑜𝑑𝑦 𝑛 Each 𝑏𝑜𝑑𝑦 𝑖 is either (i) a binary predicate 𝑙 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) where 𝑙 ∈ Σ isa label, or (ii) ( 𝑙 ∗ ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) as 𝑑 ) , which is a transitive closure over 𝑙 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) for a labels 𝑙 ∈ Σ , 𝑑 ∈ Σ \ 𝜙 ( 𝐸 ) , and each head predicate( ℎ𝑒𝑎𝑑 ) is a binary predicate with 𝑑 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) for a label 𝑑 ∈ Σ \ 𝜙 ( 𝐸 ) except the reserved predicate 𝐴𝑛𝑠𝑤𝑒𝑟 ∉ Σ of an arbitrary arity (notnecessarily binary). In other words, an RQ is a binary, non-recursive Datalog programextended with the transitive closure of binary predicates whereinput graph edges with a label 𝑙 ∈ 𝜙 ( 𝐸 𝐼 ) corresponds to instancesof the extensional schema (EDB) and derived edges and paths with alabel 𝑙 ∈ Σ \ 𝜙 ( 𝐸 𝐼 ) correspond to instances of the intensional schema(IDB). EDBs are predicates that appear only on the right-hand-sideof the rules, which correspond to stored relations in Datalog [4].Similarly, we define IDBs as predicates that appear in the rule heads,which correspond to output relations in Datalog. Example 3 (Regular Query).
Consider the real-time notificationquery in Example 1 and its graph pattern given in Figure 1a. Theone-time query for the same graph pattern is represented as thefollowing RQ: 𝑅𝐿 ( 𝑢 , 𝑢 ) ← 𝑙 ( 𝑢 , 𝑚 ) , 𝑓 ( 𝑢 , 𝑢 ) , 𝑝 ( 𝑢 , 𝑚 ) 𝑁𝑜𝑡𝑖 𝑓 𝑦 ( 𝑢, 𝑚 ) ← 𝑅𝐿 + ( 𝑢, 𝑢 ) 𝑎𝑠𝑅𝐿𝑃, 𝑝 ( 𝑢 , 𝑚 ) 𝐴𝑛𝑠𝑤𝑒𝑟 ( 𝑢, 𝑚 ) ← 𝑁𝑜𝑡𝑖 𝑓 𝑦 ( 𝑢, 𝑚 ) where predicates 𝑙, 𝑓 , 𝑝, 𝑅𝐿, 𝑅𝐿𝑃 represent labels likes , follows , post , recent liker and recent liker path , respectively. We say a Datalog program is recursive iff its dependency graph contains cycles.The dependency graph of a Datalog program is a directed graph whose vertices arepredicates of its rules and edges represent dependencies between predicates, i.e., thereis an edge from 𝑝 to 𝑞 if 𝑞 appears in the body of rule with head predicate 𝑝 . n the following, we define the semantics of SGQ using snapshotreducibility. It is known that for many operations such as joinsand aggregation, exact results cannot be computed with a finitememory over unbounded streams [11]. In streaming systems, ageneral solution for bounding the space requirement is to evaluatequeries on a window of data from the stream. In a large number ofapplications, focusing on the most recent data is highly desirable.The windowed evaluation model not only provides a tool to processunbounded streams with bounded memory but also restricts thescope of queries on recent data, a desired feature in many streamingapplications. Additionally, as opposed to streaming approximationtechniques that trade off exact answer in favour of bounding thespace requirements, window-based query evaluation enables us toprovide exact query answers w.r.t. window specifications. Hence,we adopt the time-based sliding window for streaming graph queriesin the rest of the paper.Definition 15 (Streaming Graph Query (SGQ)). A streaminggraph query 𝑄 is a Regular Query defined over a streaming graph 𝑆 and a time-based sliding window W T . We define the semantics of anSGQ 𝑄 using the corresponding, one-time Regular query 𝑄 𝑂 and thenotion of snapshot reducibility (Definition 13): ∀ 𝑡 ∈ T , 𝜏 𝑡 (cid:0) 𝑄 ( 𝑆, W T ) (cid:1) = 𝑄 𝑂 (cid:16) 𝜏 𝑡 (cid:0) W T ( 𝑆 ) (cid:1)(cid:17) In other words, the snapshot of the resulting streaming graph ofan SGQ 𝑄 at time 𝑡 , i.e., 𝜏 𝑡 (cid:0) 𝑄 ( 𝑆, W T ) (cid:1) , is equivalent to the result ofthe corresponding one-time query 𝑄 𝑂 over the snapshot of the in-put streaming graph at time 𝑡 , i.e., 𝑄 𝑂 (cid:16) 𝜏 𝑡 (cid:0) W T ( 𝑆 ) (cid:1)(cid:17) . Consequently,the output streaming graph of an SGQ can be obtained from thesequence of snapshots that is the result of a repeated evaluationof the corresponding one-time query at every time instant: an sgt ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝𝑖𝑟𝑦 ) , 𝐷 ) is in the resulting streaming graph of SGQ 𝜏 𝑡 𝑏𝑖𝑔 ( 𝑄 ( 𝑆, W T ) (cid:1) if there is an edge 𝑒 = ( 𝑢, 𝑣, 𝑙 ) or a path 𝑝 : 𝑢 𝑝 → 𝑣 with 𝑙 = 𝜙 𝑝 ( 𝑝 ) in the resulting snapshot graph of the correspondingone-time query 𝐺 𝑡 𝑖 = 𝑄 𝑂 (cid:16) 𝜏 𝑡 (cid:0) W T ( 𝑆 ) (cid:1)(cid:17) for 𝑡𝑠 ≤ 𝑡 < 𝑒𝑥𝑝 .The specification of an SGQ is built upon that of one-time RQ,with an additional window specification that slides over the inputgraph stream. For instance, the SGQ for the real-time notificationapplication in Example 1 is defined as the RQ given in Example 3with a window specification of 24 hours over the logical partitioningof its input graph stream 𝑆 = 𝑆 𝐼𝑙𝑖𝑘𝑒𝑠 ∪ 𝑆 𝐼𝑝𝑜𝑠𝑡𝑠 ∪ 𝑆 𝐼𝑓 𝑜𝑙𝑙𝑜𝑤𝑠 . In this section, we present the logical foundation of our streaminggraph query processing framework. We first introduce streaminggraph algebra (SGA) and the semantics of its operators (Section 5.1).We subsequently describe how to transform a given SGQ (Definition15) into its canonical SGA expression and illustrate logical queryplans (Section 5.2) and argue about its closedness and composibility(Section 5.3). Finally, we present transformation rules holding inour algebra (Section 5.4).
Inputs to each SGA operator are one or more logically partitionedstreaming graphs 𝑆 𝑎 (Definition 9) where each 𝑆 𝑎 contains sgts that [28 , yu [29 , uv R L R L (a) 𝑆 𝑅𝐿 [28 , yu [29 , uv [29 , yv R L P R L P R L P (b) 𝑆 𝑅𝐿𝑃
Figure 3: (a) Resulting streaming graph of a subgraph pat-tern operator for the diamond graph pattern of Example 1on the streaming graph given in Figure 2, and (b) resultingstreaming graph of a path navigation operator for the samequery over the streaming graph given in Figure 3a. represent edges or paths with the same label 𝑎 ∈ Σ . The output ofeach operator is also a streaming graph 𝑆 𝑜 where each sgt has thelabel 𝑜 ∈ Σ \ 𝜙 ( 𝐸 𝐼 ) . In the remainder, we formally and preciselydefine the semantics of each operator. The possible physical im-plementations of these operators are discussed in Section 6. SGAcontains the following operators: windowing (Definition 16), filter(Definition 17), union (Definition 18), subgraph pattern (Definition19), and path navigation (Definition 20).Definition 16 (
WSCAN ). The windowing operator W trans-forms a given input graph stream 𝑆 𝐼 to a streaming graph 𝑆 by adjust-ing the validity interval of each sgt based on the window size T and theoptional slide interval 𝛽 , i.e., W T ,𝛽 ( 𝑆 𝐼 ) : = 𝑆 : (cid:2) ( 𝑢, 𝑣, 𝑙, [ 𝑡, 𝑒𝑥𝑝 ) , D : 𝑒 ( 𝑢, 𝑣, 𝑙 )) | ( 𝑢, 𝑣, 𝑙, 𝑡 ) ∈ 𝑆 𝐼 ∧ 𝑒𝑥𝑝 = ⌊ 𝑡 / 𝛽 ⌋ · 𝛽 + T (cid:3) . The window size T determines the length of the validity intervalof sgts and the slide interval 𝛽 controls the granularity at which thetime-based sliding window progress, as defined in [10, 62]. In case 𝛽 is not provided, it can be treated as 𝛽 = , i.e., single time instantwith the smallest granularity, and it defines a sliding window thatprogress every time instant. WSCAN operator precisely defines the semantics of time-basedsliding windows. Informally,
WSCAN act as an interface betweenthe source and the query plans and it is responsible for providingdata from input graph streams to the system, similar to the scan operator in relational systems.
WSCAN manipulates the implicittemporal attribute and associates a time interval to each sgt repre-senting its validity. Our model of representing streaming graphs(Definition 8) provides a concise representation of validity intervalsand enables operators to treat time differently than the data storedin the graph. This enables us to distinguish operator semanticsfrom window semantics and eliminates the redundancy caused byintegrating sliding window constructs into each operator of thealgebra. SGA operators access and manipulate validity intervalsimplicitly, thus they generalize their non-streaming counterpartswith implicit handling of time.
Example 4.
Consider the real-time notification query of Example1 that specifies a 24-hour window of interest.
WSCAN W adjusts 𝜙 ( 𝐸 𝐼 ) ⊂ Σ is reserved for input graph edges and cannot be used by operators aslabels for resulting sgts. In other words, 𝜙 ( 𝐸 𝐼 ) ⊂ Σ corresponds to EDBs in Datalogas described in Section 4. alidity intervals of sges of the input graph stream and produces astreaming graph where each sgt is valid for 24 hours, as shown inFigure 2.Two special cases of windows are commonly adopted in litera-ture: NOW windows that capture only the current time by assigningthe interval [ 𝑡, 𝑡 + ) to a tuple with timestamp 𝑡 , and UNBOUNDED windows that capture the entire history of the streaming graph byassigning the interval [ 𝑡, ∞) to a tuple with timestamp 𝑡 [10, 44].Applied over a streaming graph, NOW windows capture all tuplesthat emerge at the current time instant and produce the identity,whereas
UNBOUNDED windows accumulate all tuples that appearin the streaming graph so far. We use
NOW windows whenever awindow specification is omitted from a given SGQ (Definition 15).Definition 17 (
FILTER ). Filter operator 𝜎 Φ ( 𝑆 ) is defined overa streaming graph 𝑆 and a boolean predicate Φ involving the distin-guished attributes of sgts, and its output stream consists of sgts of 𝑆 on which Φ evaluates to true. Formally: 𝜎 Φ ( 𝑆 ) = (cid:2) ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) |( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) ∈ 𝑆 ∧ Φ (( 𝑠𝑟𝑐, 𝑡𝑟𝑔, 𝑙, D)) (cid:3) . Definition 18 (
UNION ). Union ∪ [ 𝑑 ] with an optional outputlabel parameter 𝑑 ∈ Σ \ 𝜙 ( 𝐸 𝐼 ) merges sgts of two streaming graphs 𝑆 and 𝑆 , and assigns the new label 𝑑 if provided. Formally: 𝑆 ∪ [ 𝑑 ] 𝑆 = (cid:2) 𝑡 | 𝑡 ∈ 𝑆 ∨ 𝑡 ∈ 𝑆 (cid:3) Definition 19 (
PATTERN ). The streaming subgraph patternoperator is defined as (cid:90) 𝑠𝑟𝑐,𝑡𝑟𝑔,𝑑 Φ ( 𝑆 𝑙 , · · · , 𝑆 𝑙 𝑛 ) where each 𝑆 𝑙 𝑖 is astreaming graph, Φ is a conjunction of a finite number of terms inthe form 𝑝𝑜𝑠 𝑖 = 𝑝𝑜𝑠 𝑗 for 𝑝𝑜𝑠 𝑖 , 𝑝𝑜𝑠 𝑗 ∈ { 𝑠𝑟𝑐 , 𝑡𝑟𝑔 , · · · , 𝑠𝑟𝑐 𝑛 , 𝑡𝑟𝑔 𝑛 } where 𝑠𝑟𝑐 𝑖 , 𝑡𝑟𝑔 𝑖 correspond to source and target of sgts in 𝑆 𝑙 𝑖 , and 𝑠𝑟𝑐, 𝑡𝑟𝑔 ∈ { 𝑠𝑟𝑐 , 𝑡𝑟𝑔 , · · · , 𝑠𝑟𝑐 𝑛 , 𝑡𝑟𝑔 𝑛 } represent the source and targetof resulting sgts, and 𝑑 ∈ Σ \ 𝜙 ( 𝐸 𝐼 ) represent the label of the resultingsgts. Formally: (cid:90) 𝑠𝑟𝑐,𝑡𝑟𝑔,𝑑 Φ ( 𝑆 𝑙 , · · · , 𝑆 𝑙 𝑛 ) = (cid:2) ( 𝑢, 𝑣, 𝑑, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D : 𝑒 ( 𝑢, 𝑣, 𝑙 )) |∃ 𝑡 𝑖 = ( 𝑠𝑟𝑐 𝑖 , 𝑡𝑟𝑔 𝑖 , 𝑙 𝑖 , [ 𝑡𝑠 𝑖 , 𝑒𝑥𝑝 𝑖 ) , D 𝑖 ) ∈ 𝑆 𝑙 𝑖 , ≤ 𝑖 ≤ 𝑛 ∧ Φ (( 𝑠𝑟𝑐 , 𝑡𝑟𝑔 , · · · , 𝑠𝑟𝑐 𝑛 , 𝑡𝑟𝑔 𝑛 ))∧ 𝑢 = 𝑠𝑟𝑐 ∧ 𝑣 = 𝑡𝑟𝑔 ∧ (cid:217) ≤ 𝑖 ≤ 𝑛 [ 𝑡𝑠 𝑖 , 𝑒𝑥𝑝 𝑖 ) ≠ ∅∧ 𝑡𝑠 = max ≤ 𝑖 ≤ 𝑛 ( 𝑡𝑠 𝑖 ) ∧ 𝑒𝑥𝑝 = min ≤ 𝑖 ≤ 𝑛 ( 𝑒𝑥𝑝 𝑖 ) (cid:3) . Given a subgraph pattern expressed as a conjunctive query,
PAT-TERN finds a mapping from vertices in the stream to free variableswhere (i) all query predicates hold over the mapping, and (ii) thereexists a time instant at which each edge in the mapping is valid.
Example 5.
Consider the real-time notification query given inExample 1; the recent liker relationship defined in the form of a trian-gle pattern can be represented with
PATTERN (cid:90) 𝑠𝑟𝑐 ,𝑠𝑟𝑐 ,𝑅𝐿𝜙 where 𝜙 = ( 𝑡𝑟𝑔 = 𝑡𝑟𝑔 ∧ 𝑠𝑟𝑐 = 𝑠𝑟𝑐 ∧ 𝑠𝑟𝑐 = 𝑡𝑟𝑔 ) . Figure 3a shows itsoutput over the streaming graph given in Figure 2. It consists ofsgts ( 𝑦, 𝑅𝐿, 𝑢, [ , ) , ( 𝑦, 𝑅𝐿, 𝑢 )) and ( 𝑢, 𝑅𝐿, 𝑣, [ , ) , ( 𝑢, 𝑅𝐿, 𝑣 )) that correspond to derived edges with label recent liker . SGA operators may produce multiple value-equivalent sgts withadjacent or overlapping validity intervals. Unless otherwise spec-ified, the coalesce primitive (Definition 11) is applied to their out-puts to maintain the set semantics of streaming graphs and theirsnapshots. To illustrate, consider PATTERN in the above exam-ple: over the streaming graph given in Figure 2, the
PATTERN operator finds two distinct subgraphs with vertices ( 𝑢, 𝑏, 𝑣 ) and ( 𝑢, 𝑐, 𝑣 ) . Consequently, it produces two value-equivalent tuples ( 𝑢, 𝑅𝐿, 𝑣, [ , ) , ( 𝑢, 𝑅𝐿, 𝑣 )) and ( 𝑢, 𝑅𝐿, 𝑣, [ , ) , ( 𝑢, 𝑅𝐿, 𝑣 )) , whichare coalesced into a single sgt by merging their validity intervalsby as shown in Figure 3a.Definition 20 ( PATH ). The streaming path navigation op-erator is defined as P 𝑑𝑅 ( 𝑆 𝑙 , · · · , 𝑆 𝑙 𝑛 ) where 𝑅 is a regular expres-sion over the alphabet { 𝑙 , · · · , 𝑙 𝑛 } ⊆ Σ , and 𝑑 ∈ Σ \ 𝜙 ( 𝐸 𝐼 ) des-ignates the label of the resulting sgts. The streaming graph tuple 𝑡 = ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D : 𝑝 ) is an answer for a path navigation query P 𝑙𝑅 if there exists a path 𝑝 between 𝑢 and 𝑣 in the snapshot of 𝑆 attime 𝑡 , i.e., 𝑝 : 𝑢 𝑝 → 𝑣 ∈ 𝜏 𝑡 ( 𝑆 ) = 𝐺 𝑡 , and the label sequence of thepath 𝑝 , 𝜙 𝑝 ( 𝑝 ) is a word in the regular language 𝐿 ( 𝑅 ) . Formally: P 𝑑𝑅 ( 𝑆 𝑙 , · · · , 𝑆 𝑙 𝑛 ) = (cid:2) ( 𝑢, 𝑣, 𝑑, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) | ∃ 𝑝 : 𝑢 𝑝 → 𝑣 ∧∀ 𝑒 𝑖 ∈ 𝑝, ∃ 𝑡 𝑖 = ( 𝑠𝑟𝑐 𝑖 , 𝑡𝑟𝑔 𝑖 , 𝑙 𝑖 , [ 𝑡𝑠 𝑖 , 𝑒𝑥𝑝 𝑖 ) , D 𝑖 ) ∈ 𝑆 𝑙 𝑖 ∧ 𝜙 𝑝 ( 𝑝 ) ∈ 𝐿 ( 𝑅 ) ∧ (cid:217) 𝑡 ∈ 𝑝 [ 𝑡.𝑡𝑠, 𝑡.𝑒𝑥𝑝 ) ≠ ∅∧ 𝑡𝑠 = max 𝑡 ∈ 𝑝 ( 𝑡.𝑡𝑠 ) ∧ 𝑒𝑥𝑝 = min 𝑡 ∈ 𝑝 ( 𝑡.𝑒𝑥𝑝 ) ∧ D = 𝑝 (cid:3) . PATH finds pairs of vertices that are connected by a path where(i) each edge in the path is valid at the same time instant, and (ii)path label is a word in the regular language defined by the query.This closely follows the Regular Path Query (RPQ) model wherepath constraints are expressed using a regular expression over theset of labels [78]. Path navigation queries in the RPQ model areevaluated under arbitrary and simple path semantics. The formerallows a path to traverse the same vertex multiple times, whereasunder the latter semantics a path cannot traverse the same vertexmore than once [8, 12, 78]. In this paper, we adopt the arbitrarypath semantics due to its widespread adoption in modern graphquery languages [7, 8, 75], and the tractability of the correspondingevaluation problem [12].
Example 6.
Consider the same running example given in Fig-ure 1; the path navigation over the derived recent liker edges isrepresented by
PATH P 𝑅𝐿𝑃𝑅𝐿 + . Figure 3b shows its resulting stream-ing graph over the output streaming graph (Figure 3a) of the PAT-TERN of Example 5. It consists of sgts ( 𝑦, 𝑅𝐿𝑃, 𝑢, [ , ) , ( 𝑦, 𝑅𝐿, 𝑢 )) , ( 𝑢, 𝑅𝐿𝑃, 𝑣, [ , ) , ( 𝑢, 𝑅𝐿, 𝑣 )) , and ( 𝑦, 𝑅𝐿𝑃, 𝑣, [ , ) , ⟨( 𝑦, 𝑅𝐿, 𝑢 ) , ( 𝑢, 𝑅𝐿, 𝑣 )⟩ that correspond to materialized paths with label recent liker path of length one and two.Most existing work on the RPQ model focuses on the prob-lem of determining reachability between pairs of vertices in agraph connected by a path conforming to given regular expres-sion [43, 50, 57, 62]. By adapting the materialized path graph model(Definition 6), we pinpoint that PATH is equipped with the abilityto return paths, i.e., each resulting sgt contains the actual sequenceof edges that form the path with a label sequence conforming togiven regular expression. GA builds on the Regular Property Graph Algebra (RPGA) [16],which is itself based on Regular Queries (RQ). Of course, bothRPGA and RQ formulate graph queries over static property graphs,while SGA allows persistent graph queries over streaming propertygraphs. SGA operators are defined over streaming graphs (Defini-tion 8), and they access and manipulate validity intervals implicitly,thus they generalize their non-streaming counterparts with implicithandling of time. This follows from the fact that the semantics ofSGA operators are defined through snapshot reducibility (Definition13), that is, the snapshot of the result of a streaming operator on astreaming graph 𝑆 at time 𝑡 is equal to the result of the correspond-ing non-streaming operator on the snapshot of the streaming graph 𝑆 at time 𝑡 . SGA can express all queries that can be specified by SGQ (Section4). In this section, we describe a set of rules that can be applied toconvert a SGQ query to a SGA expression.Given a SGQ 𝑄 ( 𝑆, W T ) and the label-based logical partitioning(Definition 9) of its input streaming graph 𝑆 = (cid:208) 𝑙 ∈ Σ ( 𝑆 𝑙 ) , Algorithm SGQParser produces the canonical SGA expression. In brief, Al-gorithm
SGQParser processes the predicates of a given SGQ andgenerates the corresponding SGA expression in a bottom-up man-ner: each EDB 𝑙 corresponds to a WSCAN over an input streaminggraph 𝑆 𝐼𝑙 , each application of transitive closure corresponds to a PATH , each IDB 𝑑 corresponds to a UNION or PATTERN based onthe body of the corresponding rule.
Algorithm SGQParser: input :
Streaming Graph Query 𝑄 ( 𝑆, W T ) output : SGA Expression 𝑒 𝐺 𝑄 ← Graph ( 𝑄 ) // dependency graph [ 𝑟 , · · · , 𝑟 𝑛 ] ← TopSort ( 𝐺 𝑄 ) // topological sort 𝑒𝑥𝑝 ← [] // empty mapping for ≤ 𝑖 ≤ 𝑛 do switch 𝑟 𝑖 do case 𝑙 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) , 𝑙 ∈ 𝜙 ( 𝐸 𝐼 ) do 𝑒𝑥𝑝 [ 𝑙 ] ← W T ( 𝑆 𝑙 ) case 𝑙 ∗ ( 𝑥, 𝑦 ) 𝑎𝑠𝑑 do 𝑒𝑥𝑝 [ 𝑑 ] ← P 𝑑𝑙 ∗ ( 𝑒𝑥𝑝 [ 𝑙 ]) otherwise do 𝑑 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) ← 𝑟 𝑖 .ℎ𝑒𝑎𝑑 , [ 𝑏 , · · · ,𝑏 𝑛 ] ← 𝑟 𝑖 .𝑏𝑜𝑑𝑦 Φ ← GenPred ( 𝑟 𝑖 .𝑏𝑜𝑑𝑦 ) 𝑒 ← (cid:90) 𝑠𝑟𝑐,𝑡𝑟𝑔,𝑑 Φ ( 𝑒𝑥𝑝 [ 𝑏 ] , · · · , 𝑒𝑥𝑝 [ 𝑏 𝑛 ]) if 𝑒𝑥𝑝 [ 𝑑 ] ≠ ∅ then 𝑒𝑥𝑝 [ 𝑑 ] ← 𝑒𝑥𝑝 [ 𝑑 ] ∪ 𝑒 else 𝑒𝑥𝑝 [ 𝑑 ] ← 𝑒 return 𝑒𝑥𝑝 [ 𝐴𝑛𝑠𝑤𝑒𝑟 ] Theorem 7.
Streaming graph queries (SGQ) can be expressed inthe proposed SGA, i.e., there exists a SGA expression 𝑒 ∈ 𝑆𝐺𝐴 for any 𝑄 ∈ 𝑆𝐺𝑄 . Proof. We prove that the Algorithm
SGQParser is guaran-teed to terminate and to create a canonical SGA expression for a given SGQ 𝑄 . The dependency graph of an RQ is acyclic as RQare non-recursive (Definition 14), hence, Line 2 is guarenteed to de-fine a partial order over the 𝑄 ’s predicates. Algorithm SGQParser generates an SGA expression for each predicate in this order andcaches it in 𝑒𝑥𝑝 array. In particular, Line 7 generates an SGA ex-pression for each EDB predicate and Line 9 generates a
PATH ex-pression for each body predicate with a Kleene star. For each rule 𝑑 ( 𝑠𝑟𝑐, 𝑡𝑟𝑔 ) : = 𝑙 ( 𝑠𝑟𝑐 , 𝑡𝑟𝑔 ) , · · · , 𝑙 𝑛 ( 𝑠𝑟𝑐 𝑛 , 𝑡𝑟𝑔 𝑛 ) , Line 13 generatesa PATTERN expression. Finally, Line 14 generates a
UNION ex-pression if there are multiple rules with the same head predicate 𝑑 .As each predicate is processed based on the partial order definedby the dependency graph 𝐺 𝑄 (Line 4), 𝑒𝑥𝑝 is guarenteed to haveSGA expressions for each predicate 𝑟 𝑗 , ≤ 𝑗 ≤ 𝑖 when processingthe predicate 𝑟 𝑖 . Once all predicates are processed, Line 18 returnsthe SGA expression of the reserved 𝐴𝑛𝑠𝑤𝑒𝑟 predicate. Hence, weconclude that Algorithm
SGQParser correctly constructs an SGAexpression for a given SGQ. □ The complexity of evaluating SGA expressions is the same as thatof RQ given their relationship as noted above – it is NP-complete incombined complexity and NLogspace-complete in data complexity[16, 66].
Example 8 (Canonical Translation).
Consider the real-time notifi-cation task in Example 1 and its corresponding RQ given in Example3. Algorithm
SGQParser generates the following canonical SGAexpression for its corresponding SGQ with a sliding window of 24hours: (cid:90) ( 𝑠𝑟𝑐 ,𝑡𝑟𝑔 ,𝑛𝑜𝑡𝑖𝑓 𝑦 ) 𝜙 (cid:18) P 𝑅𝐿𝑃𝑅𝐿 + (cid:16) (cid:90) 𝑠𝑟𝑐 ,𝑠𝑟𝑐 ,𝑅𝐿𝜙 (cid:0) W ( 𝑆 𝑙 ) , W ( 𝑆 𝑝 ) , P 𝐹𝑃𝑓 + (cid:0) W ( 𝑆 𝑓 ) (cid:1)(cid:1)(cid:17) , W ( 𝑆 𝑝 ) (cid:19) 𝜙 = ( 𝑡𝑟𝑔 = 𝑡𝑟𝑔 ∧ 𝑠𝑟𝑐 = 𝑠𝑟𝑐 ∧ 𝑠𝑟𝑐 = 𝑡𝑟𝑔 ) ,𝜙 = ( 𝑡𝑟𝑔 = 𝑠𝑟𝑐 ) Figure 4a illustrates the logical plan for the same SGQ that con-sists of logical operators of SGA.
SGA operators proposed in this paper are closed over streaminggraphs, that is, the output of an SGA operator is a valid streaminggraph if its inputs are valid streaming graphs. The fact that SGA isa closed query language over the streaming graph model (Section3) means that SGA queries (expressions) are composable, i.e., theoutput of one query (expression) can be used as input of anotherquery (expression).Given Theorem 7, SGQ language is also closed – each query takesone or more streaming graphs as input and produces a streaminggraph as output. Furthermore, it is composable as the output of aquery can be the input of the subsequent query. This is in contrastto the existing graph query languages such as SPARQL and Cypherthat are not composable and may not be closed. Cypher 9 requiresgraphs as input, but produce tables as output so the language is nei-ther closed nor composable – the results of a Cypher query cannot nswer ./ φ ( src ,trg ,notify ) W posts P RL + RLP ./ φ src ,src ,RL P FPf + W follows W posts W likes (a) Logical plan for Example 8. ./ src = src ∧ trg = trg P F Pf + Π src ,src ./ trg = trg W ( S Ip ) W ( S Il ) (b) Join tree for PATTERN . Figure 4: (a) Logical query plan for the real-time notificationtask (Example 1) based on its canonical SGA expression (Ex-ample 8), and (b) the join tree that consists of binary hashjoins for
PATTERN in (a). be used as input to a subsequent one without additional process-ing. SPARQL can generate graphs as output using the
CONSTRUCT clause, and is therefore closed; however, it requires query resultsto be made persistent and therefore not easily composable [16].Closedness should be a required property of any algebra as it en-ables query rewriting (see next section) and query optimization.Composability is a desired feature for a declarative query languageas it facilitates query decomposition, view-based query evaluation,query rewriting etc.
As noted above, closedness of an algebra is important for queryrewriting to explore the space of equivalent relational algebra ex-pressions; this is a key component of query optimization. Althoughwe do not address the optimization problems in this paper (that isfuture work), we briefly highlight some of the possible transforma-tion rules for generating equivalent query plans to demonstrate thepossibilities provided by SGA.Some of the traditional relational transformation strategies suchas join ordering, predicate push down are applicable to
UNION , FILTER and
PATTERN due to snapshot-reducibility.
UNION and
FILTER operators are streaming generalizations of correspondingrelational union and selection operators, and the
PATTERN operatorcan be represented using a series of equijoins.Below, we describe transformation rules involving the othernovel SGA operators:
Transformation Rules Involving
WSCAN : The
WSCAN op-erator W T commutes with operators that do not alter the validityintervals of sgts, i.e., UNION and
FILTER . Pushing
FILTER downthe
WSCAN operator can potentially reduce the rate of sgts andconsequently the amount of state the windowing operator needsto maintain. Formally:(1) W T ( 𝜎 𝜙 ( 𝑆 )) = 𝜎 𝜙 (W T ( 𝑆 )) (2) W T ( 𝑆 ∪ 𝑆 ) = W T ( 𝑆 ) ∪ W T ( 𝑆 ) Transformation Rules Involving
PATH : We identify two trans-formation rules of the
PATH operator for regular expressions withalternation and concatenation:(1) Alternation: P 𝑑𝑎 | 𝑏 ( 𝑆 𝑎 , 𝑆 𝑏 ) = (cid:208) 𝑑 ( 𝑆 𝑎 , 𝑆 𝑏 ) (2) Concatenation: P 𝑑𝑎 · 𝑏 ( 𝑆 𝑎 , 𝑆 𝑏 ) = (cid:90) 𝑠𝑟𝑐 ,𝑡𝑟𝑔 ,𝑑𝑡𝑟𝑔 = 𝑠𝑟𝑐 ( 𝑆 𝑎 , 𝑆 𝑏 ) These transformation rules enable the exploration of a rich planspace for SGQ that are represented in the proposed SGA. In particu-lar, the novel
PATH operator and its transformation rules enable theintegration of existing approaches for RPQ evaluation with stan-dard optimization techniques such as join ordering and pushingdown selection in a principled manner. Traditionally, path queryevaluation follows two main approaches: graph traversals guidedby finite automata, or relational algebra extended with transitiveclosure, i.e., 𝑎𝑙𝑝ℎ𝑎 -RA [12, 27, 43, 62, 68, 79]. Yakovets et al. intro-duce a hybrid approach (Waveguide) and model the cost factors thatimpact the efficiency of RPQ evaluation on static graphs [79]. SGAenables the representation of these approaches in a uniform man-ner, and the above transformation rules enable us to explore planspace that subsumes these existing plans. Section 7.4 shows howan example application of these transformation rules and presentsa micro-benchmark demonstrating the potential benefits of queryoptimizations through the exploration of the rich plan space due toSGA’s transformation rules.
This section introduces a physical operator algebra for SGA thatconsists of non-blocking algorithms as physical operator implemen-tations. The intuition behind our physical algebra is the following:in the absence of explicit deletions, expirations from windows andresult set exhibit a temporal pattern in the way that streaming graphtuples are inserted. We show that utilizing this pattern enables novelalgorithms for query processing and window maintenance. We alsodescribe how to handle explicit deletions using negative tuples forapplications that require explicit deletions of previously insertededges.As noted earlier, our focus is on incremental evaluation of persis-tent queries where the goal is to avoid re-computation of the entireresult by only computing the changes to the output as new inputarrives. It is desired that physical operators of a streaming systemhave non-blocking behaviour, i.e., they do not need the entire inputto be available before producing the first result. Operators withblocking behaviour, such as set difference and nested loop joins,can be unblocked by restricting their range to a finite subset of anunbounded stream [11, 31]. Hence, sliding windows that are used torestrict the scope of queries to recent data, a desired feature in manystreaming applications, provide a tool for incremental computationof such operators over unbounded streams. Algorithms we describehere constitute non-blocking, incremental implementations of oper-ators in our streaming graph algebra consistent with the semantics(Section 5.1). Of course, these are not the only physical operatorimplementations that are possible for SGA; other implementationscan possibly be developed, and these are exemplars to demonstratethe implementability of the SGA operators. They are also what weuse in the experiments. he standard implementations of stateless operators FILTER and
UNION can be directly used in SGA. Similarly,
WSCAN is a statelessoperator that can be implemented using the standard map operatorthat adjusts the validity interval of an sgt according to the givenwindow specification.We focus on the stateful operators
PATTERN and
PATH that needto maintain an internal operator state that is accessed during queryprocessing. This state is updated as new sgts enter the window andold sgts expire. As discussed earlier, time-based sliding windowsensure that the portion of the input that may contribute to any fu-ture result is finite, making incremental, non-blocking computationpossible.
Implementation of
PATTERN models subgraph patterns as conjunc-tive queries that can be evaluated using a series of joins. Thereis a rich literature of streaming join implementations that can beused. Symmetric hash join [77] is commonly used to implementnon-blocking joins in the streaming model: a hash table is builtfor each input stream and upon arrival (expiration) of a tuple, it isinserted into (removed from) its corresponding hash table and othertables are probed for matches [32, 74]. This produces an append-only stream of results for internal windows that do not invalidatepreviously reported results upon expiration of their participatingtuples [30]. For external windows that require eviction of old resultsas the windows slide forward, expired results can be determinedby maintaining expiration timestamps: a join result expires whenone of its participating tuples expire. Our use of validity intervalsenables the user or the application to adopt both window semanticswithout the need for explicit processing of expired input tuples.Given a subgraph pattern, we take the standard approach ofcreating a binary join tree where leafs represent streaming graphsas input streams and internal nodes represent pipelined hash joinoperators. For instance, Figure 4b shows the join tree for
PATTERN of SGQ 𝑄 whose logical plan is given in Figure 4a. We leave theproblem of finding efficient join plans (e.g. using worst-case optimaljoins [60]) for subgraph pattern queries over streaming graphs as afuture work and use the ordering of subgraph pattern predicatesgiven in the query. The semantics of
PATH follows the RPQ model, which specifiespath constraints as a regular expression over the alphabet of edgelabels and checks whether a path exists with a label that satisfies thegiven regular expression [12, 57]. We propose the
Streaming PathNavigation
S-PATH algorithm as a physical operator implementa-tion of
PATH . S-PATH follows the automata-based RPQ evaluationstrategy [57, 62] using the arbitrary path semantics as discussed inSection 5.1.Algorithm
S-PATH incrementally performs a traversal of theunderlying snapshot graph under the constraints of a given RPQas sgts arrive. It first constructs a DFA from the regular expressionof a
PATH operator, and initializes a spanning forest-based datastructure, called Δ − PATH , that is used as the internal operatorstate during query processing. Δ − PATH is used to maintain apath segment, i.e., a partial result, between each pair of vertices in
Algorithm S-PATH: input :
Input streaming graph 𝑆 , Regular expression 𝑅 , outputlabel 𝑜 output : Output streaming graph 𝑆 𝑂 𝐴 ( 𝑆, Σ , 𝛿, 𝑠 , 𝐹 ) ← ConstructDFA ( 𝑅 ) Initialize Δ − PATH 𝑆 𝑂 ← ∅ R ← ∅ foreach ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) ∈ 𝑆 do foreach 𝑠, 𝑡 ∈ 𝑆 where 𝑡 = 𝛿 ( 𝑠, 𝑙 ) do if 𝑠 = 𝑠 ∧ 𝑇 𝑢 ∉ Δ − PATH then add 𝑇 𝑢 with root node ( 𝑢, 𝑠 ) T ← ExpandableTrees ( Δ − PATH , ( 𝑢, 𝑠 ) , 𝑡𝑠 ) foreach 𝑇 𝑥 ∈ T do if ( 𝑣, 𝑡 ) ∉ 𝑇 𝑥 then R ← R + Expand ( 𝑇 𝑥 , ( 𝑢, 𝑠 ) , ( 𝑣, 𝑡 ) , 𝑒 ( 𝑢, 𝑣 ) ) else if ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 < 𝑚𝑖𝑛 ( ( 𝑢, 𝑠 ) .𝑒𝑥𝑝, 𝑒𝑥𝑝 ) then R ← R + Propagate ( 𝑇 𝑥 , ( 𝑢, 𝑠 ) , ( 𝑣, 𝑡 ) , 𝑒 = ( 𝑢, 𝑣 ) ) foreach sgt 𝑡 ∈ 𝑅 do push 𝑡 to 𝑆 𝑂 Algorithm Expand: input :
Spanning Tree 𝑇 𝑥 rooted at ( 𝑥, 𝑠 ) ,parent ( 𝑢, 𝑠 ) ,child ( 𝑣, 𝑡 ) , edge 𝑒 ( 𝑢, 𝑣 ) output : Set of results R R ← ∅ Insert ( 𝑣, 𝑡 ) as ( 𝑢, 𝑠 ) ’s child ( 𝑣, 𝑡 ) .𝑡𝑠 = 𝑚𝑎𝑥 ( 𝑒.𝑡𝑠, ( 𝑢, 𝑠 ) .𝑡𝑠 ) ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 = 𝑚𝑖𝑛 ( 𝑒.𝑒𝑥𝑝, ( 𝑢, 𝑠 ) .𝑒𝑥𝑝 ) if 𝑡 ∈ 𝐹 then 𝑝 ← PATH ( 𝑇 𝑥 , ( 𝑣, 𝑡 )) 𝑅 ← 𝑅 + ( 𝑥, 𝑣, 𝑂, [( 𝑣, 𝑡 ) .𝑡𝑠, ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 ) , 𝑝 ) foreach edge 𝑒 ( 𝑣, 𝑤 ) ∈ 𝐺 𝑡𝑠 s.t. 𝛿 ( 𝑡, 𝜙 ( 𝑒 )) = 𝑞 do if ( 𝑤, 𝑞 ) ∉ 𝑇 𝑥 then R ← R + Expand ( 𝑇 𝑥 , ( 𝑣, 𝑡 ) , ( 𝑤, 𝑞 ) , 𝑒 ( 𝑣, 𝑤 ) ) else if ( 𝑤, 𝑞 ) .𝑒𝑥𝑝 < 𝑚𝑖𝑛 ( ( 𝑣, 𝑡 ) .𝑒𝑥𝑝, 𝑒.𝑒𝑥𝑝 ) then R ← R + Propagate ( 𝑇 𝑥 , ( 𝑣, 𝑡 ) , ( 𝑤, 𝑞 ) , 𝑒 ( 𝑣, 𝑤 ) ) return R Algorithm Propagate: input :
Spanning Tree 𝑇 𝑥 rooted at ( 𝑥, 𝑠 ) , parent ( 𝑢, 𝑠 ) ,child ( 𝑣, 𝑡 ) , edge 𝑒 ( 𝑢, 𝑣 ) output : Set of results R R ← ∅ ( 𝑣, 𝑡 ) .𝑝𝑡 = ( 𝑢, 𝑠 ) ( 𝑣, 𝑡 ) .𝑡𝑠 = 𝑚𝑖𝑛 ( ( 𝑣, 𝑡 ) .𝑡𝑠,𝑚𝑎𝑥 ( 𝑒.𝑡𝑠, ( 𝑢, 𝑠 ) .𝑡𝑠 )) ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 = 𝑚𝑎𝑥 ( ( 𝑣, 𝑡 ) .𝑒𝑥𝑝,𝑚𝑖𝑛 ( 𝑒.𝑒𝑥𝑝, ( 𝑢, 𝑠 ) .𝑒𝑥𝑝 )) if 𝑡 ∈ 𝐹 then 𝑝 ← PATH ( 𝑇 𝑥 , ( 𝑣, 𝑡 )) R ← R + ( 𝑥, 𝑣, 𝑂, [( 𝑣, 𝑡 ) .𝑡𝑠, ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 ) , 𝑝 ) foreach edge 𝑒 = ( 𝑣, 𝑤 ) ∈ 𝐺 𝑡𝑠 s.t. 𝛿 ( 𝑡, 𝜙 ( 𝑒 )) = 𝑞 do if ( 𝑤, 𝑞 ) .𝑒𝑥𝑝 < 𝑚𝑖𝑛 ( ( 𝑣, 𝑡 ) .𝑒𝑥𝑝, 𝑒.𝑒𝑥𝑝 ) then R ← R + Propagate ( 𝑇 𝑥 , ( 𝑣, 𝑡 ) , ( 𝑤, 𝑞 ) , 𝑒 ( 𝑣, 𝑤 ) ) return R he form a spanning forest under the constraints of a given RPQ,consistent with Definition 20. Upon the arrival of an sgt, Algorithm S-PATH probes Δ − PATH to retrieve partial path segments thatcan be extended with the edge (or a path segment) of the incomingsgt. Each partial path segment is extended with the incoming sgt,and Algorithm
S-PATH traverses the snapshot graph 𝐺 𝑡 until nofurther expansion is possible.Definition 21 (Spanning Tree 𝑇 𝑥 ). Given an automaton 𝐴 forthe regular expression 𝑅 of a PATH operator P 𝑅𝑑 and a streaminggraph 𝑆 at time 𝑡 , a spanning tree 𝑇 𝑥 forms a compact representationof valid path segments that are reachable from the vertex 𝑥 ∈ 𝐺 𝑡 under the constraints of a given RPQ, i.e., a vertex-state pair ( 𝑢, 𝑠 ) isin 𝑇 𝑥 at time 𝑡 if there exists a path 𝑝 ∈ 𝐺 𝑡 from 𝑥 to 𝑢 with label 𝜙 𝑝 ( 𝑝 ) such that 𝑠 = 𝛿 ∗ ( 𝑠 , 𝜙 𝑝 ( 𝑝 )) . A node ( 𝑢, 𝑠 ) ∈ 𝑇 𝑥 indicates that there is a path 𝑝 in the snapshotgraph with label 𝜙 𝑝 ( 𝑝 ) such that 𝑠 = 𝛿 ∗ ( 𝑠 , 𝜙 𝑝 ( 𝑝 )) , and this pathcan simply be constructed by following parent pointers ( ( 𝑢, 𝑠 ) .𝑝𝑡 )in 𝑇 𝑥 . Under the arbitrary path semantics, there are potentiallyinfinitely many path segments between a pair of vertices that con-form to a given RPQ due to the presence of cycles in the snapshotgraph and a Kleene star in the given RPQ. Among those, S-PATHmaterializes the path segment with the largest expiry timestamp,that is, the path segment that will expire furthest in the future.Consequently, for each node ( 𝑢, 𝑠 ) ∈ 𝑇 𝑥 , the sequence of vertices inthe path from the root node to ( 𝑢, 𝑠 ) corresponds to the path from 𝑥 to 𝑢 in the snapshot graph with the largest expiry timestamp. Thisis achieved by the coalesce primitive (Definition 11) with an aggre-gation function max over the expiry timestamp of path segments. Upon expiration of a node ( 𝑢, 𝑠 ) in 𝑇 𝑥 and its corresponding pathsegment in the snapshot graph, this guarantees that there cannotbe an alternative path segment between 𝑥 and 𝑢 that have not yetexpired. Hence, we can directly find expired tuples based on theirexpiry timestamps. This is based on the observation that expira-tions have a temporal order unlike explicit deletions, and S-PATHutilizes these temporal patterns to simplify window maintenance.Definition 22 ( Δ − PATH
Index).
Given an automaton 𝐴 for theregular expression 𝑅 of a PATH operator P 𝑅𝑑 and a streaming graph 𝑆 at time 𝑡 , Δ − PATH is a collection of spanning trees (Definition 21)where each tree 𝑇 𝑥 is rooted at a vertex 𝑥 ∈ 𝐺 𝑡 for which there is ansgt 𝑡 ∈ 𝑆 ( 𝑡 ) with a label 𝑙 such that 𝛿 ( 𝑠 , 𝑙 ) ≠ ∅ and 𝑠𝑟𝑐 = 𝑥 . Δ − PATH encodes a single entry for each pair of vertices underthe constraints of a given query, consistent with the set semantics ofsnapshot graphs (Section 3). Due to spanning-tree construction (Def-inition 21), actual paths can easily be recovered by following the par-ent pointers; hence, Δ − PATH constitutes a compact representationof intermediate results for path navigation queries over materializedpath graphs. Our implementation models Δ − PATH as a hash-basedinverted index from vertex-state pairs to spanning trees, enablingquick look-up to locate all spanning trees that contain a particularvertex-state pair. Upon arrival of an sgt 𝑡 = ( 𝑢, 𝑣, 𝑙, [ 𝑡𝑠, 𝑒𝑥𝑝 ) , D) ,Algorithm S-PATH probes this inverted index of Δ − PATH to re-trieve all path segments that can be extended with the incomingsgt, that is, spanning trees that have the node ( 𝑢, 𝑠 ) with an expiry Arbitrary path semantics provides the flexibility for the aggregation function 𝑓 𝑎𝑔𝑔 of the coalesce primitive. timestamp smaller than 𝑡𝑠 for any state 𝑠 ∈ { 𝑠 ∈ 𝑆 | 𝛿 ( 𝑠, 𝑙 ) ≠ ∅} (Line 9). If the target node ( 𝑣, 𝑡 ) for 𝑡 = 𝛿 ( 𝑠, 𝑙 ) is not in the spanningtree 𝑇 𝑥 , Algorithm Expand is invoked to expand the existing pathsegment from ( 𝑥, ) to ( 𝑢, 𝑠 ) with the node ( 𝑣, 𝑡 ) and to create anew leaf node as a child of ( 𝑢, 𝑠 ) . In case there already exists apath segment between vertices ( 𝑥, ) and ( 𝑣, 𝑡 ) in Δ − PATH , i.e., thetarget node ( 𝑣, 𝑡 ) is already in 𝑇 𝑥 , Algorithm S-PATH compares itsexpiry timestamp with the new candidate (Line 13). If the extensionof the existing path segment from ( 𝑥, ) to ( 𝑢, 𝑠 ) with ( 𝑣, 𝑡 ) resultsin a larger expiry timestamp than ( 𝑣, 𝑡 ) .𝑒𝑥𝑝 , Algorithm Propagate is invoked to update the expiry timestamp of ( 𝑣, 𝑡 ) and its childrenin 𝑇 𝑥 . Algorithms Expand and
Propagate traverse the snapshotgraph until no further update is possible. The following exampleillustrates the behaviour of Algorithm
S-PATH on our runningexample.
Example 9.
Consider the real-time notification query in Example1 whose SGA expression is given in Example 8. Figure 5a shows anexcerpt of the streaming graph input to the
PATH operator P 𝑅𝐿 + 𝑅𝐿𝑃 .Figure 5b depicts a spanning tree 𝑇 𝑥 ∈ Δ − PATH at 𝑡 = . Uponarrival of the sgt ( 𝑦, 𝑢, 𝑅𝐿, [ , ) , D = {( 𝑦, 𝑅𝐿, 𝑢 )}) at 𝑡 = ,Algorithm S-PATH extends the path segment from ( 𝑥, ) to ( 𝑦, ) with ( 𝑢, ) , and compares its expiry timestamp with that of ( 𝑢, ) that is already in 𝑇 𝑥 . As the new extension has larger expiry times-tamp, the validity interval and the parent pointer of ( 𝑢, ) ∈ 𝑇 𝑥 isupdated (Line 13 in Algorithm S-PATH ). Then, incoming sgts attimes 𝑡 = and 𝑡 = are processed by Algorithm Expand ascorresponding target nodes ( 𝑣, ) and ( 𝑠, ) are not in 𝑇 𝑥 , adding ( 𝑣, ) and ( 𝑠, ) as children of ( 𝑢, ) . At 𝑡 = , the incoming sgt ( 𝑤, 𝑣, 𝑅𝐿, [ , ) , D = {( 𝑤, 𝑅𝐿, 𝑣 )}) might extend the path seg-ment from ( 𝑥, ) to ( 𝑣, ) with expiry timestamp 33. However, Al-gorithm S-PATH does not make any modification to Δ − PATH , as ( 𝑣, ) is already in 𝑇 𝑥 with a larger timestamp (Line 13). Figure 5cdepicts the resulting spanning tree 𝑇 𝑥 at 𝑡 = .As described earlier, Δ − PATH stores a parent pointer for eachnode pointing to its parent node in the corresponding spanningtree, and Algorithm
Propagate updates these pointers during pro-cessing. By traversing these parent pointers for each resulting sgt,Algorithm
Expand can construct the actual path (Line 7) and re-turn it as a part of the resulting sgt, i.e., it populates the implicitpayload attribute D of the resulting sgt with the sequence of edgesthat forms the resulting path. The cost of this operation is O( 𝑙 ) where 𝑙 is the length of the resulting path. Δ − PATH guarantees that the expiry timestamp of a node ( 𝑢, 𝑠 ) in 𝑇 𝑥 is equal to largest expiry timestamp of all paths between 𝑥 and 𝑢 in the snapshot graph with a label 𝑙 such that 𝑠 = 𝛿 ∗ ( 𝑠 , 𝑙 ) . Conse-quently, for a node ( 𝑢, 𝑠 ) ∈ 𝑇 𝑥 with expiry timestamp smaller than 𝑡 , there cannot be another path from 𝑥 to 𝑢 with an equivalent labelthat is valid at time 𝑡 . Consider the spanning tree given in Figure5c. Algorithm S-PATH can directly determine, without additionalprocessing, that nodes ( 𝑧, ) and ( 𝑡, ) are expired as their expirytimestamp is 31. Thus, at any given time 𝑡 , Algorithm S-PATH cansimply ignore a node ( 𝑢, 𝑠 ) ∈ 𝑇 𝑥 with expiry timestamps smallerthan 𝑡 (Line 13) and such nodes can be removed from Δ − PATH .To prevent Δ − PATH from growing unboundedly due to expiredtuples, a background process periodically purges expired tuplesfrom Δ − PATH . , xz [24 , zu [25 , xy [26 , yw [27 , zt [28 , yu [29 , uv [30 , us [30 , wv R L P R L P R L P R L P R L P R L P R L P R L P R L P (a) Streaming Graph 𝑆 ( x, y, [25 , ( w, [26 , ( z, [23 , ( u, [24 , ( t, [27 , (b) 𝑡 = 27 ( x, y, [25 , ( w, [26 , ( u, [28 , ( v, [29 , ( s, [30 , ( z, [23 , ( t, [27 , (c) 𝑡 = 30 ( x, y, [25 , ( w, [26 , ( z, [23 , ( u, [24 , ( v, [29 , ( s, [30 , ( t, [27 , (d) 𝑡 = 30 for [62] Figure 5: (a) A streaming graph 𝑆 𝑅𝐿𝑃 as the input for
PATH operator, (b) spanning tree 𝑇 𝑥 at 𝑡 = , (c) spanning tree 𝑇 𝑥 at 𝑡 = of the proposed algorithm following the direct approach, and (d) spanning tree 𝑇 𝑥 at 𝑡 = of [62] following the negative tuple approach. Our strategy of utilizing the validity intervals of path segmentsis in contrast to our previous work [62], which studies the designspace of streaming RPQ evaluation under arbitrary and simplepath semantics. Algorithms in [62] are based on the negative tuple approach; expirations due to window movements are processedusing the same machinery as explicit deletions. Upon expiration(deletion) of an edge, their algorithm first finds all results that areaffected by the expiration (deletion), then it traverses the snapshotgraph to ensure that there is no alternative path leading to thesame result. This corresponds to re-derivation step of
DRed [35],optimized for RPQ evaluation on streaming graphs. Although itscost can be amortized over a slide interval, using the negative tupleapproach for window management incurs a non-negligible over-head, as analytically and empirically presented in [62]. Instead,our proposed algorithm for the
PATH operator utilizes the tem-poral pattern of sliding window movements and adopt the direct approach, i.e., it can directly determine expired tuples based ontheir validity intervals. This is possible due to the separation of theimplementation of sliding windows from operator semantics viaan explicit
WSCAN operator. We argue that such an approach cangreatly simplify window management in the absence of explicitdeletions, i.e., sliding windows over append-only streaming graphs.The following example illustrates how the negative tuple and direct approaches differ, respectively for [62] and our proposed algorithm.
Example 10.
Consider the same real-time notification query asin Example 9. Both approaches behave similarly until 𝑡 = as allvertex-state pairs in 𝑇 𝑥 have a single derivation at 𝑡 = (Figure5b). Upon arrival of the sgt ( 𝑦, 𝑢, 𝐻𝐼, [ , ) , D = {( 𝑦, 𝐻𝐼, 𝑢 )}) at 𝑡 = , the negative tuple approach as in [62] does not update 𝑇 𝑥 as ( 𝑢, ) is already in 𝑇 𝑥 , whereas the direct approach as used inthis paper updates the validity interval and the parent pointer of ( 𝑢, ) ∈ 𝑇 𝑥 (Line 13 in Algorithm S-PATH ). Then, incoming sgtsat times 𝑡 = and 𝑡 = are processed similarly, adding ( 𝑣, ) and ( 𝑠, ) as children of ( 𝑢, ) . Figures 5c and 5d depict the corre-sponding spanning trees at 𝑡 = for the direct and the negativetuple approaches, respectively. Note that in Figure 5c, the validityintervals of nodes in the subtree rooted at node ( 𝑢, ) reflects the newly discovered path from 𝑥 to 𝑢 through 𝑦 in 𝐺 . The negativetuple and the direct approach differs at 𝑡 = as multiple nodesexpire. The negative tuple used as in [62] marks the entire subtree of ( 𝑧, ) as potentially expired (Figure 5d), and performs a traversal ofthe snapshot graph 𝐺 to find alternative, valid paths for expirednodes. These traversals undo the effect of expired sgts via explicitdeletions. Upon discovering alternative paths for nodes ( 𝑢, ) , ( 𝑣, ) and ( 𝑠, ) that are valid at time 𝑡 = , they are re-inserted into 𝑇 𝑥 . Instead, our proposed algorithm can directly determine the ex-pired nodes based on the validity intervals (nodes ( 𝑧, ) and ( 𝑡, ) as shown in Figure 5c) without additional processing. Physical implementations of
PATTERN and
PATH rely on the direct approach that utilizes the temporal pattern of sliding windows forstate maintenance; the expiration timestamps are used to directly locate expired tuples. On append-only streaming graphs, existingsgts only expire due to window movements. Albeit rare, certainapplications might require explicit deletions of previously insertedsgts, which necessitates the use of the negative tuple approach tohandle such explicit deletions. Deleted sgt, with an additional flagto denote deletion – i.e., a negative tuple, is used to undo the effectof the original sgt on the operator state and to invalidate previouslyreported results, if necessary. For pipelined hash join, which is usedin the implementation of
PATTERN , processing of negative tuplesis the same as original input tuples: a negative tuple is removedfrom its corresponding hash table and other tables are probed tofind corresponding deleted results.Similarly, we adopt the negative tuple approach for explicit dele-tions of sgts in
PATH operator. In brief, upon explicit deletion of ansgt, we first identify tree-edges that disconnect spanning trees in Δ − PATH . For each such tree-edge, we first mark the nodes in thesubtree that are disconnected due to the explicit deletion. Then, for The negative tuple approach can also be used to signal expirations through an explicitdeletion of the corresponding sgt to undo its effect [29, 33, 62]. Indeed, the negative-tuple approach can be used for incremental evaluation of an arbitrary computation,whereas direct approach is applicable to negation-free queries over append-onlystreaming graphs [33]. ach node in this subtree, we use a Dijkstra-based traversal over thesnapshot graph to find an alternative path with the largest expirytimestamp. Dijkstra’s algorithm guarantees that we can efficientlyfind the path with the largest expiry timestamp for each markednode, consistent with Definition 22. A marked node is removedfrom the spanning tree only if there is no alternative valid path.Deletion of a non-tree edge does not require any modification asit leaves spanning trees unchanged. The use of negative tuples forexplicit deletions in the context of RPQ evaluation is first proposedby Pacaci et al [62]. However, that technique uses the negative tuple approach for both explicit deletions and window management, i.e.,processing of expired sgts due to sliding window movements. Thisis in contrast to our strategy of utilizing the validity intervals ofpath segments in Δ − PATH index to directly locate expired tuples,which simplifies the window management by eliminating the needfor explicit processing of expired sgts.
Our objective in this section is to demonstrate the feasibility andthe performance of the algebraic framework we propose in thispaper. We first describe a prototype streaming graph query proces-sor that includes the algorithms presented in Section 6 as physicaloperators of the proposed SGA (Section 7.1.1). Using this queryprocessor, we provide and end-to-end performance analysis of ouralgebraic approach for persistent evaluation of streaming graphqueries (Section 7.2). Then, we assess the scalability by varying thewindow size T and the slide interval 𝛽 (Section 7.3). Finally, wehighlight the practical benefits of the proposed SGA by exploringthe rich plan space through SGA’s transformation rules and demon-strate the potential performance improvements of exploring thisplan space (Section 7.4).We emphasize that the design of the “best” physical operatorsfor the defined algebra or the full-fledged query optimization is notthe purpose of this paper. We provide the physical implementationsin Section 6 and their performance in this section to demonstratethat it is possible to implement SGA efficiently. Development ofalternative physical implementations and query optimization ispart of our future work.The highlights of our experimental analysis are as follows: • Algebraic framework we propose on this paper provides aprincipled approach to formulate and evaluate streaminggraph queries with complex graph patterns, and it can of-fer significant performance gains for persistent evaluationof streaming graph queries compared to a general-purposesolution. • Utilizing structural properties of streaming graph queriesand temporal patterns of sliding windows yield efficientphysical operators for the proposed SGA. In particular, thenovel
S-PATH
Algorithm for
PATH can process can improvethe performance of recursive queries by simplifying the pro-cessing of expired tuples. • Physical implementations of our stateful operators basedon the direct approach provide robust performance w.r.t.varying slide intervals, i.e., they can provide fresh results ata fine granularity regardless of 𝛽 . • The algebraic-approach for persistent evaluation of SGQ thatis advocated in this paper enables the exploration of a richplan space for SGQ, offering significant computational gainsfor persistent evaluation of SGQ.
We implement a streaming graphquery processor based on the algebraic framework we propose inthis paper as a part of the S-graffito Streaming Graph ManagementSystem . Our streaming graph query processor is implemented as alayer on top of Timely Dataflow (TD) [59]. TD is a distributed, data-parallel dataflow engine for high-throughput, low-latency computa-tions that are modeled as possibly cyclic dataflows. We implementalgorithms presented in Section 6 as TD operators, and, given anSGQ, we construct a dataflow computation that consists of theseoperators.Our prototype adopts the standard implementations of stateless UNION and
FILTER from TD, and implements
WSCAN using thestandard map operator. As described in Section 6.1,
PATTERN isimplemented as a series of symmetric hash joins based on the direct approach [33, 36, 37]. Finally,
PATH is implemented using thenovel
S-PATH algorithm that we propose in this paper. TD’s novelprogress tracking mechanism enables us to coordinate the logicaltime between the operators of the dataflow graph, eliminatinginconsistency and delayed result issue that might arise with the direct approach [29, 33].Our current implementation transforms the operator tree of agiven SGQ into a dataflow graph to be executed on TD. We firstobtain the canonical SGA expression and the corresponding logi-cal query plan for a given SGQ (Section 5.2). Physical query plansare directly derived from their logical counterparts by substitut-ing each logical operator with its physical counterpart (Section 6).Consequently, resulting dataflow graphs are tree-shaped similar tological plans that are based on the canonical SGA expressions. Eachdataflow, i.e., physical plan, consists of one or more windowingoperators that are placed at the sources and a sink operator thatpushes results back to the user incrementally, as they are generated.
Experiments are run on a Linux server with 32 physi-cal cores and 256GB memory. We use the slide interval 𝛽 to controlthe granularity at which the time-based sliding window slides. Con-sequently, 𝛽 controls the size of input batches and the temporalgranularity of the output. For each query and configuration, wereport the tail latency of each window slide, i.e., the total timeto process all arriving and expired sgts upon window movementand to produce new results, and the average throughput after tenminutes of processing on warm caches. Stackoverflow (SO) is a temporal graph of userinteractions on this website containing 63M interactions (edges)of 2.2M users (vertices), spanning 8 years [64]. Each directed edge ( 𝑢, 𝑣 ) with timestamp 𝑡 denotes an interaction: (i) user 𝑢 answereduser 𝑣 ’s questions at time 𝑡 , (ii) 𝑢 commented on 𝑣 ’s question, or (iii)comment at time 𝑡 . SO is more homogeneous and more cyclic thanother graphs we use as it contains only a single type of vertex and3 different edge labels. Its highly dense and cyclic nature causes a https://dsg-uwaterloo.github.io/s-graffito/ igh number of intermediate results and resulting paths; so it isthe most challenging one for the proposed algorithms. We set thewindow size T to 1 month and the slide interval 𝛽 to 1 day unlessspecified otherwise. LDBC SNB is a synthetic social network graph that is designedto simulate real-world interactions in an online social network[26]. We extract the update stream of the LDBC workload, whichexhibits 8 different types of interactions. We use a scale factor of 10with approximately 7.2M users and posts (vertices) and 40M userinteractions (edges). In our experiments, we use replyOf , hasCreator and likes edges between users and posts, and knows edges betweenusers. LDBC update stream spans 3.5 months of user activity andwe set | 𝑊 | = month and 𝛽 = day unless specified otherwise. To the best of our knowledge, no current bench-mark exists featuring RQ for graph databases. The existing bench-marks are limited to UCRPQ, thus not reaching the full expressiv-ity of RQ even for static graphs. Although there exist specializedstreaming RDF benchmarks such as LSBench [1] and Stream Wat-Div [28], they only focus on SPARQL v1.0 (thus not even includingsimple RPQs), and their workloads do not contain any recursivequeries. Hence, we formulate a set streaming graph queries fromexisting UCRPQ-based workloads as follows: we first collect a setof query graphs in the form of UCRPQ from previously publishedbenchmarks and studies [13, 17, 26, 62, 79], and we compose aset of complex graph patterns from this collection by applying aKleene star over each graph pattern. Table 1 lists the set of graphpatterns of increasing expressivity (from RPQ to complex RQ withcomplex graph patterns) that we use to define streaming graphqueries. 𝑄 − 𝑄 correspond to commonly used RPQ in existingstudies [17, 62, 79], and we use those to test our PATH operator. 𝑄 & 𝑄 are CRPQ-based complex graph patterns based on LDBCSNB queries IS7 and
IC7 [26]. For instance, 𝑄 – LDBC SNB query IC7 – with edge labels knows , likes and hasCreator asks for recentliker s of person’s messages that are connected by a path of friends. 𝑄 & 𝑄 – Examples 1 & 2, respectively – are the most expressiveRQ-based complex graph patterns that we construct to demonstratethe abilities of our proposed SGA to unify subgraph pattern andpath navigation queries in a structured manner and to treat pathsas first-class citizens. For instance, 𝑄 defines a path query overthe complex graph pattern of 𝑄 ; it finds arbitrary length pathswhere users are connected by the recent liker pattern. Note thatthis query cannot be expressed in existing graph query languagessuch as Cypher and SPARQL (thus it cannot be evaluated on staticgraphs by corresponding offline engines without additional process-ing). Finally, for each dataset, we instantiate the query workloadfrom these graph patterns by choosing appropriate bindings, i.e.,edge labels, for each query edge and by setting the duration oftime-based sliding windows W T based on characteristics of theparticular streaming graph used. Table 2 (SGA) shows the aggre-gated throughput and tail latency of our streaming graph queryprocessor for all queries in Table 1. We discard each streaming graphedge whose label is not in a given SGQ. Tail latencies reflect the 𝑡ℎ percentile latency of processing a window slide and produce Table 1: 𝑄 − 𝑄 correspond to common RPQ observed in real-world query logs [17], and 𝑄 − 𝑄 are Datalog encodingsof RQ-based complex graph patterns that we use to definestreaming graph queries. 𝑄 and 𝑄 correspond to complexgraph patterns of LDBC SNB queries 𝐼𝑆 and 𝐼𝐶 [26], respec-tively, 𝑄 corresponds to the complex graph pattern givenin Example 1 that is defined as a recursive path query overthe graph pattern of 𝑄 , and 𝑄 corresponds to the complexgraph pattern given in Example and 2. 𝑎, 𝑏 and 𝑐 correspondto edge predicates that are instantiated based on the datasetcharacteristics. Name Query 𝑄 ? 𝑥, ? 𝑦 ← ? 𝑥 𝑎 ∗ ? 𝑦𝑄 ? 𝑥, ? 𝑦 ← ? 𝑥 𝑎 ◦ 𝑏 ∗ ? 𝑦𝑄 ? 𝑥, ? 𝑦 ← ? 𝑥 𝑎 ◦ 𝑏 ∗ ◦ 𝑐 ∗ ? 𝑦𝑄 ? 𝑥, ? 𝑦 ← ? 𝑥 ( 𝑎 ◦ 𝑏 ◦ 𝑐 ) + ? 𝑦𝑄 𝑅𝑅 ( 𝑚 ,𝑚 ) ← 𝑎 ( 𝑥, 𝑦 ) , 𝑏 ( 𝑚 , 𝑥 ) , 𝑏 ( 𝑚 , 𝑦 ) , 𝑐 ( 𝑚 ,𝑚 ) 𝑄 𝑅𝐿 ( 𝑥, 𝑦 ) ← 𝑎 + ( 𝑥, 𝑦 ) , 𝑏 ( 𝑥,𝑚 ) , 𝑐 ( 𝑚, 𝑦 ) 𝑄 𝑅𝐿 ( 𝑥, 𝑦 ) ← 𝑎 + ( 𝑥, 𝑦 ) , 𝑏 ( 𝑥,𝑚 ) , 𝑐 ( 𝑚, 𝑦 ) 𝐴𝑛𝑠 ( 𝑥,𝑚 ) ← 𝑅𝐿 + ( 𝑥, 𝑦 ) ,𝑐 ( 𝑚, 𝑦 ) 𝑄 𝑃 + ( 𝑥, 𝑦 ) ← 𝑎 ( 𝑥, 𝑧 ) , 𝑎 ( 𝑦, 𝑧 ) 𝐴𝑛𝑠 ( 𝑥,𝑚 ) ← 𝑃 + ( 𝑥, 𝑦 ) the corresponding resulting sgts. Across queries, the performanceis lower for SO graph because it is dense & cyclic. The throughputranges from hundreds of edges-per-second for the SO to hundredsof thousands of edges-per-second for the LDBC. Existing work on query processingover streaming data such as DSMSs and stream RDF systems cannotprocess queries in Table 1 as they focus on relational queries andSPARQL v1.0, respectively (as discussed in §1). Differential Dataflow(DD) [56] is a state-of-the-art system built atop TD for incrementalmaintenance of arbitrary dataflows. In addition to standard rela-tional operators such as map , join , distinct etc., DD provides iterate operator for fixed-point computations, which enables DD to main-tain cyclic dataflows over evolving datasets under arbitrary changes.Consequently, DD can be used to evaluate SGQ on time-based slid-ing windows over a streaming graph by maintaining the windowcontent as an evolving collection where each window slide triggers:(i) insertion of new sgts, and (ii) the deletion of old sgts that expirefrom the window. In contrast to physical operators of SGA thatare based on direct approach that utilizes the temporal patterns oftime-based sliding windows, this corresponds to the negative tuple approach that relies on explicit deletions for expirations (§6). DDconstitutes a strong baseline for evaluation and enables us to assessthe potential benefits of our algebraic framework.Given a canonical SGA expression of an SGQ, we construct adataflow graph of DD operators: PATTERN is modeled as a seriesof joins,
PATH is modeled as a fixed-point computation over theunderlying path pattern. Each time-based sliding window is rep-resented as an evolving collection; incoming (expiring) sgts areinserted (deleted) into (from) this collection as the window slides.In particular, we extend the
WSCAN operator to emit a negative tu-ple when an sgt expires, similar to
SEQ-WINDOW of CQL [10]. The distinct operator is used to provide the set semantics over resultingstreaming graphs (Def. 12). Table 2 (DD) reports the throughputand tail latency of DD dataflows for all queries in Table 1. able 2: (Tput) The throughput (edges/s) and (TL) the tail latency (s) of SGA and DD for queries in Table 1 on SO and LDBC-SF10graphs with | 𝑊 | =
30 days and 𝛽 = Graph System 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 Tput TL Tput TL Tput TL Tput TL Tput TL Tput TL Tput TL Tput TLSO SGA
177 348 94.9
LDBC SGA 95903 1.4 244653 1.8 224342 1.9 278647 0.4
Overall, our SGA-based query processor outperforms the DDbaseline for the majority of the queries on SO and provides a com-petitive performance on the LDBC dataset . Due to highly cyclicstructure of SO, there are many alternative paths between eachpair of vertices, and the Algorithm S-PATH for
PATH managesto utilize the temporal patterns of sliding window movements tosimplify expirations by maintaining a compact representation ofvalid path segments (§6.2). DD-based query processor providesbetter performance on linear path queries 𝑄 – 𝑄 on LDBC, butnot others. This is due to the tree-shaped structure of replyOf edges in LDBC, where there is only one path between a pair ofvertices, so Algorithm S-PATH ’s optimizations do not apply. Per-formance variations on LDBC suggest optimization opportunitiesfor recursive graph queries when selecting physical operators im-plementations, as in the case for streaming relational joins [33]. Insummary, these results demonstrate the feasibility of our algebraicapproach for evaluating SGQ and our physical operator implemen-tations. In particular, employing the direct approach by utilizing thetemporal patterns of sliding window movements have significantperformance advantages for evaluating recursive queries on cyclicgraphs.
In this section, we analyze the impact of the window size T andthe slide interval 𝛽 on end-to-end query performance of the pro-posed streaming graph query processor. We use SO graph for thisexperiment as it is dense, cyclic structure stresses our operatorimplementations. Figure 6a reports the aggregate throughput andthe tail latency for each query across various window sizes. Asexpected, the throughput of all tested queries decreases with in-creasing T , as a larger window size increases the 𝛽 on performance.As previously mentioned, the slide interval 𝛽 controls the time-granularity at which the sliding window progresses, and our proto-type implementation uses 𝛽 to control the input batch size. Figure6b shows that the aggregate throughput and the tail latency foreach query remain stable across varying slide intervals. This is dueto tuple-oriented implementation of physical operators of SGA;SGA operators are designed to process each incoming tuple eagerlyin favour of minimizing tuple-processing latency, and they do notutilize batching to improve throughput with larger batch sizes. Con-sequently, the tail latency of window movements increases withincreasing slide interval. This is in contrast to DD whose through-put increases with increasing 𝛽 as shown in Figure 7. DD and its On LDBC, 𝑄 & 𝑄 do not have the Kleene plus over 𝑎 as it causes DD to timeout. underlying indexing mechanism, i.e., shared arrangements [55],are designed to utilize batching and improve throughput with in-creasing batching size: all sgts that arrive within one interval arebatched together with a single logical timestamp (epoch) and DDoperators can explore the latency vs throughput trade-off by chang-ing the granularity of each epoch. The investigation of batchingwithin SGA operators and the identification of other optimizationopportunities is a topic of future work. SGA proposed in this paper enables a rich foundation for logicalSGQ optimization through query rewriting as previously discussedin §5.4. Although we compile physical query plans directly fromthe canonical SGA expression of a given SGQ without addressingthe optimization issues in this paper, we design a micro-benchmarkto highlight the possibilities provided by SGA. In particular, wechoose 𝑄 (Table 1) as its linear pattern combined with a Kleeneplus demonstrates the potential benefits of SGA transformationrules involving SGA’s novel PATH operator. Fig. 8 demonstrates thethroughput and the tail latency of different plans obtained fromfollowing equivalent SGA expressions for 𝑄 : • SGA: P 𝑙𝑑 + ( (cid:90) 𝑠𝑟𝑐 ,𝑡𝑟𝑔 ,𝑑𝑡𝑟𝑔 = 𝑠𝑟𝑐 ∧ 𝑡𝑟𝑔 = 𝑠𝑟𝑐 ( 𝑆 𝑎 , 𝑆 𝑏 , 𝑆 𝑐 ))• P1: P 𝑙 ( 𝑎 · 𝑏 · 𝑐 ) + ( 𝑆 𝑎 , 𝑆 𝑏 , 𝑆 𝑐 )• P2: P 𝑙 ( 𝑎 · 𝑑 ) + ( 𝑆 𝑎 , (cid:90) 𝑠𝑟𝑐 ,𝑡𝑟𝑔 ,𝑑𝑡𝑟𝑔 = 𝑠𝑟𝑐 ( 𝑆 𝑏 , 𝑆 𝑐 ))• P3: P 𝑙 ( 𝑑 · 𝑐 ) + ( 𝑆 𝑐 , (cid:90) 𝑠𝑟𝑐 ,𝑡𝑟𝑔 ,𝑑𝑡𝑟𝑔 = 𝑠𝑟𝑐 ( 𝑆 𝑎 , 𝑆 𝑏 )) The first expression,
SGA , is the canonical SGA expression for 𝑄 and is generated by the Algorithm SGQParser . It correspondsto a fixed-point computation over the linear pattern ( 𝑎 · 𝑏 · 𝑐 ) (as employed by DD) and such plans are called loop-caching inliterature as they enable re-use of the intermediate results for thebase pattern ( 𝑎 · 𝑏 · 𝑐 ) [79]. P1 , P2 and P3 are obtained from thecanonical SGA expression using the transformation rules givenin §5.4, and represent novel plans that are possible due to novel PATH of the proposed SGA. Fig. 8 clearly illustrates the potentialbenefits of exploring the rich plan space offered by SGA: some ofthe newly computed plans provide up to 60% increase in throughputand 60% reduction in the latency. We observe a similar behaviour onother path queries 𝑄 and 𝑄 (up to 50% difference in throughput)as shown in Figure 9. These results suggest further optimizationopportunities for logical query optimization as query rewrites thatare generated by SGA transformation rules can provide significantperformance benefits for evaluating SGQ over streaming graphs.
0D 20D 30D 40D 50D
Window size T a il l a t e n c y ( m s )
10D 20D 30D 40D 50D
Window size T h r o u g h p u t ( e dg e s / s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 (a) Window Size
3H 6H 12H 1D 2D 4D
Slide size T a il l a t e n c y ( m s )
3H 6H 12H 1D 2D 4D
Slide size T h r o u g h p u t ( e dg e s / s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 (b) Slide
Figure 6: The tail latency of each window slide and the aggregate throughput of the proposed streaming graph query processorwith increasing (a) window size T and (b) slide interval 𝛽 on SO graph.
3H 6H 12H 1D 2D 4D
Slide size T a il l a t e n c y ( m s )
3H 6H 12H 1D 2D 4D
Slide size T h r o u g h p u t ( e dg e s / s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Figure 7: The tail latency of each window slide and the aggre-gate throughput of SGQ evaluation on DD with increasingslide interval 𝛽 on SO graph. DD SGA P1 P2 P30100200300400500 T h r o u g h p u t ( e dg e s / s ) DD SGA P1 P2 P3020406080 T a il L a t e n c y ( s ) (a) SO DD SGA P1 P2 P3050000100000150000200000250000300000 T h r o u g h p u t ( e dg e s / s ) DD SGA P1 P2 P30.00.20.40.60.8 T a il L a t e n c y ( s ) (b) LDBC-SF10 Figure 8: The throughput and tail latency of 𝑄 on (a) SOand (b) LDBC graphs for DD, SGA and three other equiva-lent physical plans generated via SGA transformation rules(Section 5.4). DD SGA P102000400060008000 T h r o u g h p u t ( e dg e s / s ) DD SGA P10123456 T a il L a t e n c y ( s ) (a) 𝑄 DD SGA P10100200300400500600 T h r o u g h p u t ( e dg e s / s ) DD SGA P10255075100125150175 T a il L a t e n c y ( s ) (b) 𝑄 Figure 9: The throughput and tail latency of (a) 𝑄 , and (b) 𝑄 on SO for DD, SGA for the default plan and an alternativeequivalent physical plan generated via SGA transformationrules (Section 5.4). This paper introduces a general-purpose query processing frame-work for streaming graphs that consists of: (i) streaming graphquery model and algebra with well-founded semantics, and (ii) aprototype system that consists of physical operator implementa-tions as an embodiment of the proposed framework. The proposedSGA treats paths as first-class citizens and it provides the founda-tional framework to precisely describe the semantics of complexstreaming graph queries that combine path navigation and sub-graph pattern queries in a uniform manner. Experimental analyseson real-world and synthetic streaming graphs demonstrate thefeasibility and the potential performance gains of our framework.Future research directions we consider are: (i) to design an SGA-based query optimizer that adaptively improves the query execution erformance w.r.t. changing system conditions, and (ii) to extendour framework with attribute-based predicates to fully support theproperty graph model. There is, of course, much work to be donein developing alternative physical operator implementations. REFERENCES [1] 2012. LSBench Code. https://code.google.com/archive/p/lsbench/[2] Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Çetintemel, MitchCherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin,Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. 2005. TheDesign of the Borealis Stream Processing Engine. In
Proc. 2nd Biennial Conf. onInnovative Data Systems Research . 277–289.[3] Daniel J. Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Christian Con-vey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003.Aurora: a new model and architecture for data stream management.
VLDB J.
Foundations of databases .Vol. 8. Addison Wesley.[5] Amir Aghasadeghi, Vera Zaychik Moffitt, Sebastian Schelter, and Julia Stoy-anovich. 2020. Zooming Out on an Evolving Graph.. In
Proc. 23rd Int. Conf. onExtending Database Technology . 25–36.[6] Khaled Ammar, Frank McSherry, Semih Salihoglu, and Manas Joglekar. 2018.Distributed Evaluation of Subgraph Queries Using Worst-case Optimal and Low-Memory Dataflows.
Proc. VLDB Endowment
11, 6 (2018), 691–704. https://doi.org/10.14778/3184470.3184473[7] Renzo Angles, Marcelo Arenas, Pablo Barcelo, Peter Boncz, George Fletcher,Claudio Gutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, JuanSequeda, et al. 2018. G-CORE: A core for future graph query languages. In
Proc.ACM SIGMOD Int. Conf. on Management of Data . 1421–1432.[8] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Do-magoj Vrgoč. 2017. Foundations of modern query languages for graph databases.
ACM Comput. Surv.
50, 5 (2017), 68.[9] Darko Anicic, Paul Fodor, Sebastian Rudolph, and Nenad Stojanovic. 2011. EP-SPARQL: a unified language for event processing and stream reasoning. In
Proc.20th Int. World Wide Web Conf.
VLDB J.
15, 2 (2006), 121–142.[11] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. 2002. Models and Issuesin Data Stream Systems. In
Proc. ACM SIGACT-SIGMOD Symp. on Principles ofDatabase Systems . 1–16.[12] Pablo Barceló Baeza. 2013. Querying graph databases. In
Proc. 32nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems . 175–188.[13] Guillaume Bagan, Angela Bonifati, Radu Ciucanu, George HL Fletcher, AurélienLemay, and Nicky Advokaat. 2016. gMark: schema-driven generation of graphsand queries.
IEEE Trans. Knowl. and Data Eng.
29, 4 (2016), 856–869.[14] Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle,and Michael Grossniklaus. 2009. C-SPARQL: SPARQL for continuous querying.In
Proc. 18th Int. World Wide Web Conf.
ACM SIGMOD Rec.
47, 4 (2019), 5–16.[16] Angela Bonifati, George Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018.Querying Graphs.
Synthesis Lectures on Data Management
10, 3 (2018), 1–184.[17] Angela Bonifati, Wim Martens, and Thomas Timm. 2019. Navigating the Mazeof Wikidata Query Logs. In
Proc. 28th Int. World Wide Web Conf.
WSP/WOMoCoE@ ISWC . 66–73.[19] Jean-Paul Calbimonte, Oscar Corcho, and Alasdair JG Gray. 2010. Enablingontology-based access to streaming data sources. In
Proc. 9th Int. Semantic WebConf.
Q. Bull. IEEE TC on Data Eng.
38, 4 (2015), 28–38. http://sites.computer.org/debull/A15dec/p28.pdf[21] Sutanay Choudhury, Lawrence B. Holder, George Chin Jr, Khushbu Agarwal, andJohn Feo. 2015. A Selectivity based approach to Continuous Pattern Detectionin Streaming Graphs. In
Proc. 18th Int. Conf. on Extending Database Technology .157–168. https://doi.org/10.5441/002/edbt.2015.15[22] Christian S Jensen James Clifford, Ramez Elmasri, Curtis Dyreson, Fabio GrandiWolfgang K&fer Nick Kline, Nikos Lorentzos, Yamzis Mitsopoulos, Angelo Mon-tanari, Daniel Nonen Elisa Peressi Barbara Pernici, John F Roddick Nandlal LSarda, and Maria Rita Scalas Arie Segev. 1994. A consensus glossary of temporaldatabase concepts.
ACM SIGMOD Rec.
23, 1 (1994).[23] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. 2003. Reachabilityand Distance Queries via 2-Hop Labels.
SIAM J. on Comput.
32, 5 (2003), 1338.[24] Daniele Dell’Aglio, Jean-Paul Calbimonte, Emanuele Della Valle, and Oscar Cor-cho. 2015. Towards a unified language for RDF stream query processing. In
Proc. 12th Extended Semantic Web Conf.
Proc. 2012 IEEE Conf. onHigh Performance Extreme Comp.
IEEE, 1–5.[26] Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev,Arnau Prat, Minh-Duc Pham, and Peter Boncz. 2015. The LDBC Social Net-work Benchmark: Interactive Workload. In
Proc. ACM SIGMOD Int. Conf. onManagement of Data . 619–630. https://doi.org/10.1145/2723372.2742786[27] Wenfei Fan, Chunming Hu, and Chao Tian. 2017. Incremental graph computa-tions: Doable and undoable. In
Proc. ACM SIGMOD Int. Conf. on Management ofData . 155–169.[28] Libo Gao, Lukasz Golab, M. Tamer Özsu, and Gunes Aluc. 2018. Stream WatDiv– A Streaming RDF Benchmark. In
Proc. ACM SIGMOD Workshop on SemanticBig Data . 3:1–3:6.[29] Thanaa M Ghanem, Moustafa A Hammad, Mohamed F Mokbel, Walid G Aref,and Ahmed K Elmagarmid. 2006. Incremental evaluation of sliding-windowqueries over data streams. 19, 1 (2006), 57–72.[30] L. Golab. 2006.
Sliding Window Query Processing over Data Streams . Ph.D. Disser-tation. University of Waterloo.[31] Lukasz Golab and M. Tamer Özsu. 2003. Issues in data stream management.
ACMSIGMOD Rec.
32, 2 (2003), 5–14.[32] Lukasz Golab and M. Tamer Özsu. 2003. Processing Sliding Window Multi-Joinsin Continuous Queries over Data Streams. In
Proc. 29th Int. Conf. on Very LargeData Bases . 500–511.[33] Lukasz Golab and M. Tamer Özsu. 2005. Update-Pattern-Aware Modeling and Pro-cessing of Continuous Queries. In
Proc. ACM SIGMOD Int. Conf. on Managementof Data . 658–669.[34] Ajeet Grewal, Jerry Jiang, Gary Lam, Tristan Jung, Lohith Vuddemarri, QuannanLi, Aaditya Landge, and Jimmy Lin. [n. d.]. RecService: Multi-Tenant DistributedReal-Time Graph Processing at Twitter. In
Proc. 10th USENIX Workshop on HotTopics in Cloud Computing .[35] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. 1993. Maintaining ViewsIncrementally. In
Proc. ACM SIGMOD Int. Conf. on Management of Data . 157–166.[36] M. Hammad, W. Aref, M. Franklin, M. Mokbel, and A. Elmagarmid. 2003.
EfficientExecution of Sliding Window Queries over Data Streams . Technical Report CSDTR 03-035. Purdue University.[37] M. Hammad, M. Mokbel, M. Ali, W. Aref, A. Catlin, A. Elmagarmid, M. Eltabakh,M. Elfeky, T. Ghanem, R. Gwadera, I. Ilyas, M. Marzouk, and X. Xiong. 2004.Nile: a query processing engine for data streams. In
Proc. 20th Int. Conf. on DataEngineering . 851.[38] Martin Hirzel, Guillaume Baudart, Angela Bonifati, Emanuele Della Valle, SherifSakr, and Akrivi Vlachou. 2018. Stream Processing Languages in the Big DataEra.
ACM SIGMOD Rec.
47, 2 (2018), 29–40.[39] Anand Padmanabha Iyer, Li Erran Li, Tathagata Das, and Ion Stoica. 2016. Time-evolving graph processing at scale. In
Proc. 4th Int. Workshop on Graph DataManagement Experiences and Systems . 1–6.[40] Kyoungmin Kim, In Seo, Wook-Shin Han, Jeong-Hoon Lee, Sungpack Hong,Hassan Chafi, Hyungyu Shin, and Geonhwa Jeong. 2018. TurboFlux: A FastContinuous Subgraph Matching System for Streaming Graph Data. In
Proc. ACMSIGMOD Int. Conf. on Management of Data . 411–426.[41] Christoph Koch, Yanif Ahmad, Oliver Kennedy, Milos Nikolic, Andres Nötzli,Daniel Lupei, and Amir Shaikhha. 2014. DBToaster: Higher-Order Delta Process-ing for Dynamic, Frequently Fresh Views.
VLDB J.
23, 2 (2014), 253–278.[42] Srdjan Komazec, Davide Cerri, and Dieter Fensel. 2012. Sparkwave: continuousschema-enhanced pattern matching over RDF data streams. In
Proc. 6th Int. Conf.Distributed Event-Based Systems . 58–68.[43] André Koschmieder and Ulf Leser. 2012. Regular path queries on large graphs.In
SSDBM12 . 177–194.[44] Jürgen Krämer and Bernhard Seeger. 2009. Semantics and implementation ofcontinuous sliding window queries over data streams.
ACM Trans. Database Syst.
34, 1 (2009), 1–49.[45] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, ChristopherKellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja.2015. Twitter Heron: Stream Processing at Scale. In
Proc. ACM SIGMOD Int. Conf.on Management of Data . 239–250. https://doi.org/10.1145/2723372.2742788[46] Pradeep Kumar and H Howie Huang. 2019. GraphOne: A data store for real-timeanalytics on evolving graphs. In
Proc. 17th USENIX Conf. on File and StorageTechnologies . 249–263.[47] Pradeep Kumar and H Howie Huang. 2020. GraphOne: A Data Store for Real-timeAnalytics on Evolving Graphs.
ACM Trans. Storage
15, 4 (2020), 1–40.[48] Danh Le-Phuoc, Minh Dao-Tran, Josiane Xavier Parreira, and Manfred Hauswirth.2011. A native and adaptive approach for unified processing of linked streamsand linked data. In
Proc. 10th Int. Semantic Web Conf.
Proc. 35th Int. Conf. onData Engineering . IEEE Press, 1082–1093.[50] Leonid Libkin and Domagoj Vrgoč. 2012. Regular path queries on graphs withdata. In
Proc. 15th Int. Conf. on Database Theory . 74–85.
51] Ling Liu and M. Tamer Özsu (Eds.). 2009.
Encyclopedia of Database Systems .Springer.[52] Mengmeng Liu, N.E. Taylor, Wenchao Zhou, Z.G. Ives, and Boon Thau Loo. 2009.Recursive Computation of Regions and Connectivity in Networks. In
Proc. 25thInt. Conf. on Data Engineering . 1108–1119. https://doi.org/10.1109/ICDE.2009.36[53] Mugilan Mariappan and Keval Vora. 2019. GraphBolt: Dependency-driven syn-chronous processing of streaming graphs. In
Proc. 14th ACM SIGOPS/EuroSysEuropean Conf. on Comp. Syst.
Proc. 15th Int. Semantic Web Conf.
Proc. VLDB Endowment
13, 10 (2020), 1793–1806.[56] Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013.Differential Dataflow. In
Proc. 6th Biennial Conf. on Innovative Data SystemsResearch .[57] Alberto O Mendelzon and Peter T Wood. 1995. Finding regular simple paths ingraph databases.
SIAM J. on Comput.
24, 6 (1995), 1235–1258.[58] Vera Zaychik Moffitt and Julia Stoyanovich. 2017. Temporal graph algebra. In
Proc. 16th Int. Symposium on Database Programming Languages . 1–12.[59] Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham,and Martín Abadi. 2013. Naiad: a timely dataflow system. In
Proc. 24th ACMSymp. on Operating System Principles . 439–455.[60] Hung Q Ngo, Christopher Ré, and Atri Rudra. 2014. Skew strikes back.
ACMSIGMOD Rec.
42, 4 (Feb 2014), 5–16. https://doi.org/10.1145/2590989.2590991[61] Milos Nikolic and Dan Olteanu. 2018. Incremental view maintenance with triplelock factorization benefits. In
Proc. ACM SIGMOD Int. Conf. on Management ofData . 365–380.[62] Anil Pacaci, Angela Bonifati, and M. Tamer Özsu. 2020. Regular Path QueryEvaluation on Streaming Graphs. In
Proc. ACM SIGMOD Int. Conf. on Managementof Data . ACM, 1415–1430. https://doi.org/10.1145/3318464.3389733[63] Anil Pacaci, Alice Zhou, Jimmy Lin, and M. Tamer Özsu. 2017. Do We NeedSpecialized Graph Databases?: Benchmarking Real-Time Social Networking Ap-plications. In
Proc. 5th Int. Workshop on Graph Data Management Experiences andSystems . Article 12, 7 pages. https://doi.org/10.1145/3078447.3078459[64] Ashwin Paranjape, Austin R Benson, and Jure Leskovec. 2017. Motifs in temporalnetworks. In
Proc. 10th ACM Int. Conf. Web Search and Data Mining . 601–610.[65] Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin,and Jingren Zhou. 2018. Real-time constrained cycle detection in large dynamicgraphs.
Proc. VLDB Endowment
11, 12 (2018), 1876–1888.[66] Juan L Reutter, Miguel Romero, and Moshe Y Vardi. 2017. Regular queries ongraph databases.
Theory of Comput. Syst.
61, 1 (2017), 31–83.[67] Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. TamerÖzsu. 2020. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing.
VLDB J.
29 (2020), 595—618. https://doi.org/10.1007/s00778-019-00548-x[68] Semih Salihoglu and Nikolay Yakovets. 2019. Graph query processing.
Encyclo-pedia of Big Data Technologies (2019), 890–898.[69] Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L Willke, JeffreyYoung, Matthew Wolf, and Karsten Schwan. 2016. Graphin: An online highperformance incremental graph processing framework. In
European Conferenceon Parallel Processing . Springer, 319–333.[70] S. Seufert, A. Anand, S. Bedathur, and G. Weikum. 2013. FERRARI: Flexible andefficient reachability range assignment for graph indexing. In
Proc. 29th Int. Conf.on Data Engineering . 1009–1020. https://doi.org/10.1109/ICDE.2013.6544893[71] Feng Sheng, Qiang Cao, Haoran Cai, Jie Yao, and Changsheng Xie. 2018. GraPU:Accelerate streaming graph analysis through preprocessing buffered updates. In
Proc. 9th ACM Symp. on Cloud Computing . 301–312.[72] Jiao Su, Qing Zhu, Hao Wei, and Jeffrey Xu Yu. 2016. Reachability querying: canit be even faster?
IEEE Trans. Knowl. and Data Eng.
29, 3 (2016), 683–697.[73] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh MPatel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham,et al. 2014. Storm@ twitter. In
Proc. ACM SIGMOD Int. Conf. on Management ofData . 147–156.[74] Tolga Urhan and M.J. Franklin. 2000. XJoin: A Reactively-Scheduled PipelinedJoin Operator.
Q. Bull. IEEE TC on Data Eng.
23 (2000), 27.[75] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi.2016. PGQL: a property graph query language. In
Proc. 4th Int. Workshop onGraph Data Management Experiences and Systems . 7.[76] Sarisht Wadhwa, Anagh Prasad, Sayan Ranu, Amitabha Bagchi, and SrikantaBedathur. 2019. Efficiently Answering Regular Simple Path Queries on LargeLabeled Networks. In
Proc. ACM SIGMOD Int. Conf. on Management of Data(SIGMOD ’19) . New York, NY, USA, 1463–1480. https://doi.org/10.1145/3299869.3319882[77] A.N. Wilschut and P.M.G. Apers. 1991. Dataflow query execution in a parallelmain-memory environment. In
Proc. 1st Int. Conf. on Parallel and DistributedInformation Systems . 68–77.[78] Peter T Wood. 2012. Query languages for graph databases.
ACM SIGMOD Rec.
41, 1 (2012), 50–60.[79] Nikolay Yakovets, Parke Godfrey, and Jarek Gryz. 2016. Query planning for eval-uating SPARQL property paths. In
Proc. ACM SIGMOD Int. Conf. on Managementof Data . 1875–1889.[80] Yuke Yang, Lukasz Golab, and M. Tamer Özsu. 2017. ViewDF: Declarative In-cremental View Maintenance for Streaming Data.
Inf. Syst.
71 (2017), 55–67.https://doi.org/doi.org/10.1016/j.is.2017.07.002[81] Hilmi Yildirim, Vineet Chaoji, and Mohammed J. Zaki. 2010. GRAIL: scalablereachability index for large graphs.
Proc. VLDB Endowment
3, 1 (2010), 276–284.Issue 1-2. http://dl.acm.org/citation.cfm?id=1920841.19208793, 1 (2010), 276–284.Issue 1-2. http://dl.acm.org/citation.cfm?id=1920841.1920879