[PDF] A+ Indexes: Tunable and Space-Efficient Adjacency Lists in Graph Database Management Systems

Abstract

Graph database management systems (GDBMSs) are highly optimized to perform fast traversals, i.e., joins of vertices with their neighbours, by indexing the neighbourhoods of vertices in adjacency lists. However, existing GDBMSs have system-specific and fixed adjacency list structures, which makes each system efficient on only a fixed set of workloads. We describe a new tunable indexing subsystem for GDBMSs, we call A+ indexes, with materialized view support. The subsystem consists of two types of indexes: (i) vertex-partitioned indexes that partition 1-hop materialized views into adjacency lists on either the source or destination vertex IDs; and (ii) edge-partitioned indexes that partition 2-hop views into adjacency lists on one of the edge IDs. As in existing GDBMSs, a system by default requires one forward and one backward vertex-partitioned index, which we call the primary A+ index. Users can tune the primary index or secondary indexes by adding nested partitioning and sorting criteria. Our secondary indexes are space-efficient and use a technique we call offset lists. Our indexing subsystem allows a wider range of applications to benefit from GDBMSs' fast join capabilities. We demonstrate the tunability and space efficiency of A+ indexes through extensive experiments on three workloads.

Full PDF

AA+ Indexes: Lightweight and Highly FlexibleAdjacency Lists for Graph Database Management Systems

Amine Mhedhbi, Pranjal Gupta, Shahid Khaliq, Semih Salihoglu

University of Waterloo {amine.mhedhbi, pranjal.gupta, shahid.khaliq, semih.salihoglu}@uwaterloo.ca

ABSTRACT

Graph database management systems (GDBMSs) are highly opti-mized to perform very fast joins of vertices by indexing the neigh-bourhoods of vertices in adjacency list indexes. However, existingGDBMSs have system-speciﬁc and ﬁxed adjacency list index struc-tures, which makes each system highly efﬁcient on only a ﬁxedset of workloads. We describe a highly ﬂexible and lightweightindexing sub-system for GDBMSs, that is coupled with material-ized view capability, that we call

A+ indexes . A+ indexes com-prise of three components.

Default A+ indexes provide ﬂexibilityto users to index neighbourhoods of vertices using arbitrary nestedsecondary partitioning and sorting criteria. This allows users to op-timize a system for a variety of workloads with no or minimal mem-ory overheads.

Secondary vertex- and edge-bound A+ indexes , re-spectively are views over edges and 2-paths. Edge-bound indexespartition views over 2-paths by edge IDs and store the neighbour-hoods of edges instead of vertices. Our secondary indexes are de-signed to have a very lightweight implementation based on a tech-nique we call offset lists . A+ indexes allow a wider range of appli-cations to beneﬁt from GDBMSs’ fast join capabilities. We demon-strate the ﬂexibility, efﬁciency, and low memory overheads of A+indexes through extensive experiments on a variety of applications.

1. INTRODUCTION

The term graph database management system (GDBMS) in its con-temporary usage refers to data management software such as Neo4j[33], JanusGraph [24], TigerGraph [46], and GraphﬂowDB [25,32] that adopt the property graph data model [34]. In this model,application data is represented as a set of vertices, which repre-sent the entities in the application, directed edges, which representthe connections between entities, and arbitrary key value proper-ties on the vertices and edges. GDBMSs have lately gained pop-ularity among a wide range of applications from fraud detectionand risk assessment in ﬁnancial services to recommendations in e-commerce and social networks [41].One reason GDBMSs appeal to users is that they are highly op-timized to perform very fast joins of vertices. While systems usetraditional B+ trees to access vertices with certain properties in scan operations, join operators access neighbourhoods of vertices throu-gh adjacency list indexes [11]. Adjacency list indexes are constant-depth data structures that partition the edge records into lists bysource or destination vertex IDs, and sometimes also by an addi-tional criteria, e.g., the labels of the edges and provide very fastaccess to neighbourhoods of vertices. Some systems further sortthese lists according to some properties. This contrasts with tree-based indexes, such as B+ trees, which have logarithmic depth inthe size of the data they index. Although GDBMSs give ﬂexibilityto their users to index vertices in different B+ trees, they do notprovide any ﬂexibility in indexing the neighbourhoods of vertices.Speciﬁcally, GDBMSs make different but ﬁxed choices about thepartitioning and sorting criteria of their adjacency list indexes, mak-ing each system highly efﬁcient on only a ﬁxed set of workloads.This creates physical data dependence, as users have to model theirdata, e.g., pick their edge labels, according to the ﬁxed partitioningand sorting criteria of their systems.We address the following question:

How can the fast join ca-pabilities of GDBMSs be expanded to a much wider set of work-loads?

We describe a highly ﬂexible and lightweight adjacencylist indexing sub-system for GDBMSs, that is coupled with mate-rialized view capability, that we call

A+ indexes . We observe thatlists in existing adjacency list indexes are effectively local views over the edges that have fast access paths and are used by sys-tems to evaluate queries. We ﬁrst give users ﬂexibility in select-ing the partitioning and sorting criteria of the system’s default A+indexes , which provides access to a wider set of local views overthe edges. Then, we support deﬁning two types of global views :(i) views over edges that satisfy arbitrary predicates that are storedin secondary vertex-bound A+ indexes ; (ii) views over 2-paths thatare stored in secondary edge-bound A+ indexes , which extends thenotion of neighbourhood from vertices to edges. We describe avery lightweight implementation of our secondary indexes througha technique we call offset lists , which take often one or two bytesper indexed edge.We next review the adjacency list indexes of existing systems andthen give an overview of A+ indexes. Figure 1 shows an example ﬁ-nancial graph that we use as a running example. The graph containsvertices with

Customer and

Account labels.

Customer ver-tices have name properties and

Account vertices have city and accountType(acc) properties. From customers to accountsare edges with

Owns(O) labels and between accounts are trans-fer edges with

Dir-Deposit(DD) and

Wire(W) labels with amount(amt) , currency , and date properties. We omit datesin the ﬁgure and give each transfer edge an ID such that t i .date

XPAND operator in Neo4j or E

XTEND /I NTER - SECT in GraphﬂowDB. GDBMSs employ two broad techniques toprovide fast access to adjacency lists while performing these joins: (1) Partitioning:

Every GDBMS partitions its edges ﬁrst by theirsource or destination vertex IDs, respectively in forward and back-ward indexes. We call this the primary partitioning criteria.E

XAMPLE Consider the following query, written in open-Cypher [38], that ﬁnds 2-paths starting from a vertex with name"Alice". Below, a i and r j are variables for, respectively, the queryvertices and query edges. MATCH a − [ r ] − >a − [ r ] − >a WHERE a .name = ‘ Alice ’ In every GDBMS we know of, this query is evaluated in three steps:(1) scan the vertices and ﬁnd a vertex with name "Alice" and matcha . In our example graph, v7 would match a ; (2) access v7 ’sforward adjacency list, often with one lookup, to match a → a edges; and (3) access the forward lists of matched a ’s to matcha → a → a paths. Some GDBMSs employ secondary partitioning on these lists,e.g., Neo4j [33] further partitions each vertex’s list by edge labels.This allows accessing more granular lists in constant or close toconstant time without running any predicates.E

XAMPLE Consider the following query that returns all W iretransfers made from the accounts Alice O wns: MATCH a − [ r : O ] − >a − [ r : W ] − >a WHERE a .name = ‘ Alice ’ The “r : O ” is syntactic sugar in Cypher for the r .label = Owns predicate. A system with lists partitioned by vertex IDs and edgelabels can evaluate this query as follows. First, ﬁnd v7 , with name"Alice", and then access v7 ’s Owns edges, often with a constantnumber of lookups and without running any predicates, and matcha ’s. Finally access the Wire edges of each a to match the a ’s. (2) Final List Sorting: Some systems further sort their most gran-ular lists according to an edge property [24] or the IDs of the neigh-bours in the lists [4, 32]. Sorting enables systems to access parts oflists in time logarithmic in the size of lists.Similar to major and minor sorts in traditional indexes, partition-ing and sorting keeps the edges in a sorted order, allowing systemsto use fast intersection-based join algorithms, such as worst-caseoptimal (WCO) joins [36, 37] or sort-merge joins. A+ Index Global View Pr. Part. Stored ListsDefault

Edge vertex ID ID ListsSecondary VB σ Edge vertex ID Offset ListsSecondary EB σ Edges (cid:46)(cid:47)

Edges edge ID Offset ListsTable 1: Three types of A+ indexes. VB, EB, and Pr. Part. standfor vertex-bound, edge-bound, and primary partitioning, respec-tively. All indexes allow nested secondary partitioning on cate-gorical properties of the indexed adjacent edges and neighbours.The partitioning criteria determines the ﬁnal local views that eachlist stored in an index corresponds to. In addition all indexes allowsorting on the indexed adjacent edges and neighbours properties.E

XAMPLE Consider the following query that ﬁnds all cycli-cal wire transfers with 3 edges involving Alice’s account v . MATCH a − [ r : W ] − >a − [ r : W ] − >a , a − [ r : W ] − >a WHERE a . ID = v In systems that implement WCO joins, such as EmptyHeaded [4]or GraphﬂowDB [32], this query is evaluated by scanning each v1 → a Wire edge and intersecting the pre-sorted

Wire lists of v1 and a to match the a vertices. To provide very fast access to each list, lists are accessed throughdata structures that have constant depth, instead of logarithmic dep-ths of traditional tree-based indexes. This is achieved by havingone level in the index for each partitioning criteria, so levels inthe index are not constrained to have a ﬁxed order, e.g., k as in ak-ary tree. This makes GDBMSs very fast when accessing the ap-propriate neighbourhoods of vertices as they perform certain joins.However, existing GDBMSs adopt ﬁxed system-speciﬁc partition-ing and possibly sorting criteria, which has two main shortcomings:(1) users need to model their data, e.g. pick vertex and edge labels,considering the system’s physical design decisions, creating physi-cal data dependence; and (2) systems can provide fast joins for onlythe workloads that have equality predicates on the system-speciﬁcproperties that are used as partitioning and sorting criteria.

Table 1 summarizes the high-level properties of the three com-ponents of A+ indexes, which we next review.

Default A+ Indexes:

Instead of a system-speciﬁc ﬁxed criteria,users can provide arbitrary nested partitioning and sorting on thesystem’s default indexes. The edges are then indexed in a nestedcompressed sparse-row-like data structure that has as many levelsas there are partitioning criteria. The most granular lists are thensorted according to the given sorting criteria. Figure 2a shows anexample default A+ index on our running example that has twonested partitioning levels on top of the primary vertex ID parti-tioning: (i) by edge labels; and (ii) by currency property. Thisallows the system to provide fast joins for workloads that accessedges that satisfy other equality predicates without data remod-elling and with no or negligible memory costs, leading to signif-icant performance gains in some settings.

Secondary A+ Indexes:

We observe that each level of existingindexes identiﬁes a sub-list, which is effectively a view over edgesthat is limited to satisfying a different set of equality predicates (de-termined by the partitioning criteria). Figure 2a shows the nestedsub-lists with different types of boxes. For example, the red dashedbox corresponds to view σ srcID=v1 E , while the blue dash-dotted boxcorresponds to σ srcID=v1 & e.label=Owns E , where E is the set of all edges.Therefore query processors of existing systems effectively evaluatequeries using views when they probe adjacency lists to access theselists. We refer to views that sub-lists correspond to as local views . ff s e t L i s t s t1 ... t16 ...W DDW DD Level on Edge IDLevel on Edge labels P a r t i t i on i ng Le v e l s Global View: σ ( E ⨝ E)Sort on v nbr .city v W DD

Global View: ESort on v nbr .ID v3 v2 e5 e4 P a r t i t i on i ng Le v e l s Level on Currency v2 t15 v1 ... v6 ... Level on vertex ID $ € £ $ € £ Global View: ESort on v nbr .city σ vID=1 Eσ vID=1 & e.label=W E O ff s e t L i s t s I D L i s t s Level on Edge labels v3 v2 t4 t17 v5 t18 ONA v4 t20 Default A+ IndexesSecondary VB A+ IndexesSecondary EB A+ Indexese b .date < e adj .date, e b .amt > e adj .amt (b) Example secondary edge-bound A+ Index. Figure 2: Example A+ indexes on our running example.One can also think of the entire adjacency list index as one globalview , which in Figure 2a is simply the set of edges, shown in asolid grey box. To support access to a larger set of views, we ﬁrstallow users to deﬁne two other types of global views that are in-dexed in our secondary A+ indexes . Our choice of the views wesupport allows us to provide a lightweight implementation with anappropriate partitioning, which we discuss momentarily.(i) Global views on the edges that satisfy arbitrary predicates, suchas transfers with amount > balance > secondary vertex-bound A+ indexes and partitioned primarily by vertex IDs.(ii) Global views over 2-paths, which are stored in secondary edge-bound A+ indexes and partitioned primarily by edge IDs.Edge-bound indexes extend the notion of adjacency from verticesto edges, which is highly beneﬁcial for some applications.E XAMPLE Consider the following query, which is the coreof an important class of queries in ﬁnancial fraud detection . Communication with developers of GaussDB [18], a DBMS sup-porting a graph data model and used by multiple ﬁnancial institu-tions in production. MATCH a − [ r :] − >a − [ r :] − >a − [ r :] − >a WHERE r . eID = t ,r .amt>r .amt, r .amtr .amt, r .amt

Wire or Dir-Deposit ) happens at a later date and for a smaller amountof at most α , simulating some money ﬂowing through the networkwith intermediate hops taking cuts. The predicates of this query compare properties of an edge on apath with the previous edge on the same path. A system matches r to t13 , which is from vertex v2 to v5 . Existing systems have toread transfer edges from v5 and ﬁlter those that have a later date value than t13 and also have the appropriate amount value. In-stead, when the next query edge to match r has predicates depend-ing on the query edge r , these queries can be evaluated much fasterif adjacency lists are partitioned by edge IDs: a system can directlyaccess the forward adjacency list of t13 , i.e., edges whose srcID are v5 , that satisfy the predicate on the amount and date prop-erties that depend on t13 , and perform the extension. Our edge-bound indexes allow the system to generate plans that perform thismuch faster processing. Lightweight Offset Lists:

Storing secondary A+ indexes requiresdata duplication and consumes extra memory. Our approach to ad-dressing the memory footprint of secondary indexes is based ontwo important observations: (1) each list in every secondary in-dex is a subset of some list in the default index, which is a result ofour design of secondary indexes; and (2) each list in default indexescontains a very small number of edges, which is a result of the spar-sity, i.e., small average degrees, of real-world graphs. Therefore,instead of duplicating the globally identiﬁable edge and neighbourIDs that need to be stored in the

ID lists of default indexes, sec-ondary indexes can be stored with much smaller, often one or twobyte, list-level identiﬁable pointers into systems’ default lists. Werefer to these lists as offset lists . Figures 2a and 2b show exampleoffset lists, respectively, for a secondary vertex-bound and edge-bound A+ indexes. When join operators read actual edges throughoffset lists, access is made to non-consecutive but very close mem-ory locations, achieving high CPU cache locality. We demonstratethat secondary vertex-bound and edge-bound indexes can have verysmall memory overheads, as low as a few percentage points, mak-ing them highly practical.

2. A+ INDEXES

This section describes our indexing sub-system. There are threetypes of indexes in our indexing sub-system: (i) default A+ indexes;(ii) secondary vertex-bound A+ indexes; and (iii) secondary edge-bound A+ indexes. Each index, both in our solution and existingsystems, stores a set of adjacency lists, each of which stores a set ofedges. We refer to the edges that are stored in the lists as adjacent edges, and the vertices that adjacent edges point to as neighbour vertices. So in vertex ID partitioned lists, neighbours refer to desti-nation vertices in forward indexes and source vertices in backwardindexes. We next give an overview of each index. We cover thelightweight implementation of secondary A+ indexes in Section 3.

Default A+ indexes are the primary and by default the only in-dexes in our indexing sub-system. These indexes are required tocontain each edge in the graph, otherwise the system will not beable to answer some queries. Similar to the adjacency lists of ex-isting systems, there are two default indexes, one forward and oneackward, partitioned primarily by the source and destination ver-tex IDs of the edges, respectively. In our implementation in Graph-ﬂowDB, by default we adopt secondary partitioning by edge labelsand sorting according to the IDs of the neighbours, which opti-mizes the system for queries with edge labels and matching cyclicsubgraphs using intersection-based join plans. However, unlike ex-isting systems, users can reconﬁgure the secondary partitioning andsorting criteria of the systems’ default. This reconﬁguration has noor very minor memory costs and can make the system signiﬁcantlyfast on a variety of workloads. As we explain in Sections 2.2.1and 2.2.2 momentarily, the default indexes are also the referenceindexes to which secondary indexes point.

Default A+ indexes can contain nested secondary partitioningcriteria on any categorical property of adjacent edges as well asneighbour vertices, such as edge or neighbour vertex labels, orthe currency property on the edges in our running example. Inour implementation we allow integers or enums that are mapped tosmall number of integers as categorical values. Because graph datais not structured, not all edges need to contain the properties onwhich the secondary partitionings happen. Edges with null prop-erty values form a special partition.Each provided secondary partitioning adds one new layer to theindex, storing offsets to a particular slice of the next layer, wherethe last level contains the ﬁnal list containing the neighbourhood ofone vertex (because the primary partitioning is by vertex ID). Werefer to the ﬁnal lists in the default indexes as

ID lists , as they storethe IDs of the edges and neighbour vertices. This is effectively anested compressed sparse row-like compact physical storage thatensures that the full neighbourhood of each vertex is stored consec-utively in memory.E

XAMPLE Consider querying all wire transfers made in USDcurrency from Alice’s account and the destination accounts of thesetransfers:

MATCH a − [ r : O ] − >a − [ r : W ] − >a WHERE a .name = ‘ Alice ’ , r .currency = USD

Here the query plans of existing systems that partition by edge la-bels will read all

Wire edges from Alice’s account and, for eachedge, read its currency property and run a predicate to verifywhether or not it is in USD.

Instead, if queries with equality predicates on the currency property are important and frequent for an application, users canreconﬁgure their default A+ indexes to provide a secondary parti-tioning based on currency .RECONFIGURE DEFAULT INDEXPARTITON BY e adj .label, e adj .currency SORT BY v nbr .city In index creation and modiﬁcation commands, we use reservedkeywords e adj and v nbr to refer to adjacent edges and neighbours,respectively. The above command will reconﬁgure the adjacencylists to have two levels of secondary partitioning, ﬁrst by the edgelabels and then by the currency property of these edges, which willpoint to sub-lists that are sorted by the city property of the neigh-bour vertices (discussed momentarily). Figure 2a shows the ﬁnalphysical design this generates as an example on our running exam-ple. For the query in Example 5, the system’s join operator cannow directly access these more granularly partitioned lists, ﬁrst bywire and the by USD, as it accesses Alice’s neighbourhood, withoutrunning any predicates. Note that each level of the index can be used to access a differentsub-list in the ﬁnal lists to access a set of edges that satisfy differentproperties. For example the more granularly partitioned lists inFigure 2a still allow access to only the Wire edges of a vertex,because the part of the ID list that contains wire edges are stillcontiguous and their offsets can be found by inspecting the offsetsin the ﬁrst and second levels of the index.

The most granular sub-lists can be sorted according to one ormore arbitrary properties of the adjacent edges or neighbour ver-tices, e.g., the date property of

Transfer edges and the city property of the

Account vertices of our running example. Similarto partitioning, edges with null property values on which the sort-ing is are ordered last. Secondary partitioning and sorting criteriatogether store the neighbourhoods of vertices in a particular sort or-der, allowing a system to generate intersection-based join plans fora wider set of queries.E

XAMPLE Consider the following query that searches for athree-branched money transfer tree, consisting of wire and directdeposit transfers, emanating from an account with vID v5 andending in three sink accounts in the same city.

MATCH a − [: W ] − >a − [: W ] − >a ,a − [: W ] − >a a − [: DD ] − >a − [: DD ] − >a WHERE a . ID = v , a .city = a .city = a .city If Wire and

Dir-Deposit lists are partitioned or sorted by city , as in in Figure 2a, after matching a → a and a → a , aplan can directly intersect two Wire lists of a and a and one Dir-Deposit list of a in a single operation to ﬁnd the ﬂowsthat end up in accounts in the same city. Such plans are not possiblewith the adjacency list indexes of existing systems. Observe that the ability to reconﬁgure the system’s default A+indexes provides more physical data independence. Users do nothave to model their datasets according to the system’s default phys-ical design and changes in the workloads can be addressed simplywith index reconﬁgurations. We will demonstrate the beneﬁts ofthis ﬂexibility and the minor memory overheads of default indexreconﬁgurations in our evaluations.

Many indexes in DBMSs can be thought as data structures thatgive fast access to views . In our context, each sub-list in the defaultindexes is effectively a local view over edges. For example, thedashed red in Figure 2a is the σ srcID=v1 &e.label=Wire Edge local viewwhile the dotted green box encloses a more granular local view cor-responding to σ srcID=1 & e.label=wire & curr=USD Edge . One can also thinkof the entire index as indexing a global view , which for defaultindexes is simply the

Edge table. Therefore the local views thatcan be obtained through the system’s default A+ indexes are con-strained to views over the edges that contain an equality predicateon the source or destination ID (due to vertex ID partitioning) andone equality predicate for each secondary partitioning criteria. Toprovide access to even wider set of local views, a system shouldsupport more general materialized views and index these in adja-cency list indexes. We next describe two types of secondary A+indexes, which are coupled with two types of global views and twodifferent ways of partitioning these views. These speciﬁc globalviews and their partitioning allows us to provide a lightweight im-plementation, which we describe in Section 3.i)

Secondary vertex-bound A+ indexes index global views overedges with arbitrary predicates. These views are primarily par-titioned by vertex IDs.(ii)

Secondary edge-bound A+ indexes index global views over 2-paths. These views are partitioned by edge IDs and effectivelystore neighbourhoods of edges.In the rest, when a particular adjacency list is bound to a vertex,say v , or an edge, we refer to that vertex or edge as the boundvertex and bound edge , respectively. Secondary vertex-bound indexes index global views over edgesthat contain arbitrary selection predicates. These views cannot con-tain other operators, such as group by’s, aggregations, or projec-tions, so their outputs are a subset of the original edges. The pred-icates used in these global views can depend on the bound vertex,adjacent edge, or neighbour vertex. Secondary vertex-bound A+indexes store these global views with a primary partitioning on ver-tex IDs and the same partitioning and sorting ﬂexibility provided indefault A+ indexes. In order to use secondary vertex-bound A+ in-dexes, users need to ﬁrst deﬁne the global view over the edges andthen deﬁne the structure of the secondary vertex-bound A+ index.E

XAMPLE Consider a fraud detection application that searc-hes money ﬂow patterns with high amount of transfers, say over10000 USDs. We can create a secondary vertex-bound index toindex those edges in lists, partitioned ﬁrst by vertices and then pos-sibly by other properties and in a sorted manner as before.

CREATE EDGE VIEW LargeUSDTrnxMATCH v s − [ e adj ] − >v d WHERE e adj .currency = USD, e adj .amt>

INDEX AS FW − BWPARTITION BY e adj .label SORT BY v nbr .ID Above, v s and v d are keywords to refer to the source and desti-nation vertices, whose properties can be accessed in the WHEREclause. FW and BW are keywords to build the index in the for-ward or backward direction, a partitioning option given to users.FW-BW indicates double indexing the edges both in the forwardand backward directions. The most granular sub-lists of the re-sulting secondary vertex-bound A+index effectively materializes alocal view of the form σ srcID=* & elabel=* & curr=USD & amount > 10000 Edge .If such views or views that correspond to other levels of the indexappear as part of the subgraph patterns, systems can directly accessthese views and avoid evaluating the predicates in these views.

Vertex-bound adjacency lists store the edges that are immedi-ately in the neighbourhood of each vertex. Our edge-bound indexesextend the notion to deﬁne neighbourhoods of edges. This can ben-eﬁt applications in which the searched patterns concern relationsbetween two adjacent, i.e., consecutive, edges, as in the money ﬂowpatterns from Example 4. Speciﬁcally, secondary edge-bound in-dexes index global views over 2-path. As before, these views can-not contain other operators, such as group by’s, aggregations, orprojections, so their outputs are a subset of 2-paths. The view hasto specify a predicate and that predicate has to access properties ofboth edges in 2-paths. We explain this requirement momentarily.Secondary edge-bound indexes store these global views with a pri-mary partitioning on edge IDs and, as before, the same partitioningand sorting ﬂexibility provided in default A+ indexes. There are three possible 2-paths, →→ , →← , and ←← , when partitioned bydifferent edges gives four unique possible ways in which an edge’sneighbourhood can be deﬁned.(i) Destination-Forward: v s − [ e b ] → v d − [ e adj ] → v nbr (ii) Destination-Backward: v s − [ e b ] → v d ← [ e adj ] − v nbr (iii) Source-Forward: v nbr − [ e adj ] → v s − [ e b ] → v d (iv) Source-Backward: v nbr ← [ e adj ] − v s − [ e b ] → v d e b is the edge that the adjacency lists will be bound to, and v s and v d refer to the source and destination vertices of e b , respec-tively. For example, we refer to the ﬁrst 2-path view as destination-forward because partitioning those paths by e b stores the forwardedges of the destination vertex of the bound edge.E XAMPLE Consider creating an index for the sequence ofadjacent edges in the money ﬂow queries from Example 4.

CREATE 2PATH VIEW MoneyFlowMATCH v s − [ e b ] → v d − [ e adj ] → v nbr WHERE e b .date e adj .amt ( ρ e b ( E ) (cid:46)(cid:47)ρ e adj ( E )) . E abbreviates Edge and the omitted join predicate ise b .dstID = e adj .srcID. Readers can verify that, in presence of thisindex, a GDBMS can evaluate the money ﬂow query from Exam-ple 4, ignoring the predicate with α , by scanning only one edge.It ﬁrst scans t13 ’s and lists which contains a single edge t19 . Incontrast, even if all Transfer edges are directly accessible us-ing a vertex-bound A+ index, a system would access 9 edges afterscanning t13 .Observe that unlike vertex-bound A+ indexes, an edge e in thegraph can appear in multiple adjacency lists in an edge-bound in-dex. For example, in Figure 2b, edge t17 (having offset ) appearboth in the adjacency list for t1 as well as t16 . As a consequence,we restrict that users have to specify a predicate in the WHEREclause that access properties of both edges in the 2-paths. We im-pose this restriction because if all the predicates are localized toa single query edge, say v s − [ e b ] → v d , then we would redundantlygenerate duplicate adjacency lists, and deﬁning instead a secondaryvertex-bound A+ index would give the same access path and avoidthis redundancy. Consider as an example the following edge-boundA+ index:CREATE 2PATH VIEW RedundantMATCH v s − [ e b ] → v d − [ e adj ] → v nbr WHERE e adj .amt< In absence of an INDEX AS command, global views are only parti-tioned by the primary partitioning. Consider the account v2 in ourrunning example graph in Figure 1. For each of the four incomingedges of v2 , namely t5 , t6 , and t17 , this index would containthe same adjacency list that consists of all outgoing edges of v2 : { t7 , t8 , t13 , t15 } , because the predicate is localized only to aingle edge. Instead, a user can deﬁne a vertex-bound A+ indexwith the same predicate and bound it to v2 and achieve the sameaccess path to the edges { t7 , t8 , t13 , t15 } .Views over edges and 2-paths that we support, with the speciﬁcedge ID partitioning of 2-paths have an important property that isconducive to a lightweight implementation, which we discuss next.

3. LIGHTWEIGHT OFFSET LISTS

The predominant memory cost of default indexes is the storageof the IDs of the adjacent edges and neighbour vertices in the

IDlists . Because the IDs in these lists globally identify vertices andedges, their sizes need to be logarithmic in the number of edgesand vertices in the graph, and often stored as 4 to 8 byte integers insystems. For example, in our implementation, edge IDs take 8 andneighbour IDs take 4 bytes.In our indexing sub-system, default A+ index reconﬁgurationdoes not result in data duplication and its only memory overheads(or beneﬁts) are from changes in the group values, which are mini-mal. In contrast, secondary indexes require extra memory and mayhave signiﬁcant memory overheads. However, the lists in both sec-ondary vertex-bound and edge-bound indexes have the importantproperty that they are subsets of some list in the default adjacencylists: (i) a secondary vertex-bound list for v i is a subset of the listof v i ’s default ID list; (ii) an edge-bound list for e j = ( v s , v d ) is asubset of either v s ’s or v d ’s default ID list, depending on the direc-tion of the index, e.g., v d ’s list for a DST-FW list. Recall that in ourcompressed-sparse-row-like implementation of the default indexes,the ﬁnal lists of each vertex, i.e., the sub-lists after secondary par-titionings, are contiguous. Therefore, instead of storing (edge ID,neighbour ID) pairs, we can store offsets to an appropriate ID list.We call these lists offset lists .The average size of the lists is proportional to the average degreein the graph, which is often very small, in the order of tens or hun-dreds, in many real world graph data sets. This important propertyof real world graphs has two advantages:1. Offsets need to be list-level identiﬁable and take a small num-ber bytes. Speciﬁcally, in our implementation, offsets take lessthan two bytes on average. Naturally, the ﬁnal memory con-sumptions of ID and offset lists depend on other optimizationssystem designers make, such as ID compression schemes. Forexample, on many of our graphs, the edge IDs can be com-pressed to 4 or 5 bytes instead of 8, and the offsets in offset listscan be compressed to a few bits instead of a few bytes. Impor-tantly, irrespective of these optimizations, globally identiﬁableIDs require sizes logarithmic in the number of edges and ver-tices in the graph (tens or hundreds of millions), while list-levelidentiﬁable offsets require sizes logarithmic in the average listsizes (tens or hundreds in many real-world graphs).2. Reading the original edge and neighbour IDs through offsetlists require an indirection and lead to reading not-necessarilyconsecutive locations in memory. However, because the listssizes are small, we still get a very good CPU cache locality. Wewill demonstrate this beneﬁt momentarily.We implement each secondary index in one of two possible ways,depending on whether the index contains any predicates and whetherits secondary partitioning structure matches the secondary parti-tioning structure of the default A+ indexes. • With no predicates and same secondary partitioning : The listsof the secondary index store the same lowest-level local viewsas the lists of the default index but in a different sort order.Therefore, we only store the offset lists of the index, which con-tain the same number of elements as the ID lists, and share the secondary partitioning layers with the default index. This effec-tively shares physical data structures across indexes and savesspace. Figure 2a gives an example. The bottom offset lists arefor a secondary vertex-bound index, which only consists of off-set lists and no partitioning layers. Recall that since edge-boundindexes need to contain predicates between adjacent edges, thisstorage can only be used for vertex-bound indexes. • With predicates or different secondary partitioning : In this case,the local views of the secondary index are different from thelocal views of the default index and we need to store new par-titioning layers and an offset list layer, as shown in Figure 2b.This storage layout is used for edge-bound indexes as well asvertex-bound indexes that contain predicates.We give the details of the memory page structures that store ID andoffset lists in Section 4. In our evaluations, we demonstrate that formany applications, the memory footprint of secondary indexes canbe very low, sometimes as low as a few percentage points.We next address this question: how much slower is reading IDlists through offset list indirections compared to sequential reads ifthe IDs were copied over (so requiring larger memory)?

We ad-dress this question in the context of an in-memory setting becauseour implementation is in an in-memory system. However, even indisk-based systems, lists are often brought to memory inside a sin-gle page and then read from memory during operations. So ouroffset list choice would primarily affect the speed of reading oncean appropriate page is in memory. We performed the followingdemonstrative experiment. We took the popular and relatively largeLiveJournal dataset, which contains 68M edges, and performed 5-hop enumeration queries from a random set of 100 source vertices.These queries form a stress test for our question because the mainoperation they perform is reading the IDs in adjacency lists andcopying them over to tuples that are passed between operators. Wekept the graph unlabelled, so added a single label to edges, and didnot add any properties to the graph. We then evaluated the queriesin three different ways:(i)

Sequential:

Sequentially reading the system’s default ID lists.This forms a baseline for the best cache locality we can obtainin our storage.(ii)

List-level indirection:

Reading the system default ID lists thro-ugh a vertex-bound index that sorts the edges in each list ran-domly. This achieves an indirection that is limited to within alist but we expect good CPU cache locality.(iii)

Graph-level indirection:

To form a baseline for a very poorcache locality, we separately implement a new index that shuf-ﬂes the adjacency lists into a single list and provides an indi-rection to each edge and neighbour ID pair. This effectivelysimulates an indirection where the random reads are not con-strained to a list but can span 68M edges.Sequential reads took 6.7s/query, reads through list-level indi-rections took 12.4s/query, and graph-level indirection took 63.3s/-query. Therefore, even in this stress test, the list-level indirectionswere only 1.85x slower than reading directly form ID lists (and 5.1xfaster than graph-level indirections). So despite reading through anindirection, we obtain a very good cache locality. As a referenceto compare the memory consumption, implementing a secondaryindex that copies over IDs would double the storage in this ex-periment. Instead, the overhead of our vertex-bound index in thisexperiment is 1.13x. This is a very reasonable memory vs perfor-mance tradeoff, especially given that queries often perform otheroperations, e.g., read of edge and vertex properties, aggregations,or predicate evaluations, for which the performance slow down willbe smaller, as those operations are not affected by this indirection. . IMPLEMENTATION DETAILS4.1 Query Optimizer and Processor

A+ indexes are used in evaluating subgraph pattern componentof queries, which is where the queries’ joins are described. We givean overview of the relevant join operators and the optimizer of thesystem. Reference [32] describes the details of the E

XTEND /I N - TERSECT operator and the dynamic programming join optimizerof the system in absence of the A+ indexes sub-system. J OIN O PERATORS : E XTEND /I NTERSECT (E/I) is the primaryjoin operator of the system. Given a query Q ( V Q , E Q ) and an in-put graph G ( V, E ) , let a partial k-match of Q be a set of verticesof V assigned to the projection of Q onto a set of k query vertices.We denote a sub-query with k query vertices as Q k . E/I is conﬁg-ured to intersect z ≥ adjacency lists that are sorted on neighbourIDs. The operator takes as input (k-1)-matches of Q , performs a z -way intersection, and extends them by a single query vertex tok-matches. For each (k-1)-match t , the operator intersect z adja-cency lists that are bound to the vertices of t and extends t witheach vertex in the result of this intersection to produce k -matches.If z is one, no intersection is performed, and the operator simplyextends t to each vertex in the adjacency list. The system uses E/Ito generate plans that contain worst-case optimal join-style multi-way intersections. To generate plans that use A+ indexes, we extended E/I to takeadjacency lists that can be bound to edges as well as vertices. Wethen added a variant of E/I that we call the M ULTI -E XTEND op-erator, that performs intersections of adjacency lists that are sortedby properties other than neighbour IDs and extends partial matchesto more than one query vertex. Speciﬁcally the operator is conﬁg-ured with z ≥ adjacency lists to intersect and takes partial (k-z)-matches as input. For each (k-z)-match t , the operator intersect z adjacency lists that are bound to either the edges or vertices of t and produces k -matches. This allows us to have intersection-basedquery plans also for structurally acyclic queries. Dynamic Programming Optimizer:

GraphﬂowDB has a dynamicprogramming-based join optimizer. For each k =1 , ..., m = | V Q | , inorder, the optimizer ﬁnds the lowest-cost plan for each sub-query Q k in two ways: (i) by considering extending every possible sub-query Q k − ’s (lowest-cost) plan by an E/I operator; and (ii) if Q has an equality predicate involving z ≥ query edges, by consid-ering extending smaller sub-queries Q k − z by a M ULTI -E XTEND operator. At each step, the optimizer considers the edge and vertexlabels and other predicates together, since secondary A+ indexesmay be storing local views that contain predicates other than edgelabel equality. When considering possible Q k − z to Q k extensions,the optimizer queries the I NDEX S TORE to ﬁnd both vertex- andedge-bound indexes that can be used that satisﬁes part or all of thepredicates that would be involved in the extension. Then for eachpossible index combination retrieved, the optimizer enumerates aplan. After adding an E/I or M

ULTI -E XTEND operator, if there areany predicates that can be evaluated on Q k and not satisﬁed duringthe extension to Q k by the local views used, the optimizer adds aF ILTER operator (effectively pushing down ﬁlters).The systems’ cost metric is intersection cost (i-cost), which is thetotal sizes of the adjacency lists that the system estimates will beaccessed by the E/I and M

ULTI -E XTEND operators in a plan. Thesystem uses a subgraph catalogue [32] that estimates the averagelengths of different lists, e.g., forward list of each vertex or forward There is a second join operator H

ASH J OIN that takes sets of twopartial matches Q k and Q k and hash-joins them on their com-mon query vertices. H ASH J OIN does not use A+ indexes, so is notrelevant to our work in this paper.

Wire list of each vertex. When estimating i-cost, if the adjacencylists that is used in an extension contains a predicate p other thanedge or vertex labels, we multiply the average length returned bythe subgraph catalogue with the estimated selectivity of p . We implemented an I

NDEX S TORE component that stores boththe predicate and sorting criterion of each A+ index in the system.Every A+ index in the system, their type, secondary partitioning,sorting criteria, as well as additional predicates for secondary in-dexes are maintained in the I

NDEX S TORE . The I

NDEX S TORE is queried by the system’s optimizer to ﬁnd possible extensions of Q k − z sub-queries to Q k . Speciﬁcally, the optimizer asks for theexistence of possible vertex or edge-bound indexes, e.g., a vertex-bound index that satisﬁes edge label and currency equality pred-icates. The I NDEX S TORE inspects the predicates that are satisﬁedby the local views that correspond to each secondary partitioninglevel of each index and returns all indexes that can be used.

Default and secondary vertex-bound A+ indexes are implementedusing the same data structure that groups vertices into groups of 64allocates one data page for each group. Vertex IDs are assignedconsecutively starting from 0, so given an ID, with a division andmod operation we can access the ﬁrst secondary partitioning levelof a vertex. The nested secondary partitioning is implemented us-ing a CSR-like format which points to either ID lists in the caseof the default A+ indexes or offset lists in the case of secondaryA+ indexes. The neighbour vertex and edge ID lists are stored as4 byte integer and 8 byte long arrays. In contrast, the offset lists inboth cases are stored as byte arrays by default. Offsets are variable-length, and we encode all offsets in an offset list with the maximumnumber of bytes needed for each offset. This encoding size is storedas a single byte header in the beginning of each offset list.Edge-bound indexes are partitioned by edge IDs but access tothe list of an edge requires not only the edge ID but also eitherthe source or destination vertex ID of the edge that the offset listswill point to. For example, if e ’s offset list points to v ’s list, weneed both e and v to access e ’s list. This vertex ID is alwayspart of the intermediate tuple t that will be extended and contains e . Speciﬁcally, we store all the edges that point to the ID list of avertex v i in a single page, which is accessed by v i ’s ID. All edgeIDs in this page form the ﬁrst partitioning layer on the page. Thereason for this design is that, when updates arrive, and say v ’s IDlist gets updated, we need to ﬁnd all the possible edge ID lists thatneed to be updated, so we can directly use v ’s ID to access allthese edge-bound lists. Each vertex-bound data page, storing ID lists or offset lists, isaccompanied with an update buffer. Each edge addition e =( u, v ) is ﬁrst applied to the update buffers for u ’s and v ’s pages. Then wego over each vertex-bound A+ index I in the I NDEX S TORE . If I ’sglobal view contains a predicate p , we ﬁrst apply p to see if e passesthe predicate. If so, or if I does not contain a predicate, we updatethe necessary update buffers for the offset list pages of u and/or v .The update buffers’ sizes are by default 20% of the sizes of theirdata page buffers and are merged into the actual data pages whenthe buffer is full. Edge deletions are handled by adding a “tomb-stone” for the location of the deletion until a merge is triggered.Maintenance of an edge-bound A+ index EB is more involved.For an edge insertion e =( u, v ) , we perform two separate opera-tions. First, we check to see if e should be inserted into the adja-ame e b by running the predicate p of EB on e and e b . For example, if EB is deﬁned as Destination-Forward,we loop through all the backward adjacent edges of u using thesystem’s default vertex-bound index. This is equivalent to runningtwo delta-queries as described in references [6, 25] for a contin-uous 2-path query. Second, we create a new list for e and loopthrough another set of adjacency lists (in our example v ’s forwardadjacency list in D ) and insert edges into e ’s list.

5. EVALUATION

The goal of our experiments is two-fold. First, we demonstratethe ﬂexibility and efﬁciency of our indexing sub-system on threevery different popular applications that GDBMSs support: (i) la-belled subgraph queries; (ii) recommendations; (iii) ﬁnancial frauddetection. Existing adjacency list indexes of systems are not op-timized to perform very fast joins on any of these applications.By either reconﬁguring the system’s default indexes or using newones, we improve the performance of the system signiﬁcantly, withlow memory overheads. Second, we evaluate the performance andmemory overhead tradeoffs offered by different A+ indexes on theseworkloads. We also present experiments benchmarking our in-dex maintenance performance and baseline comparisons againstNeo4j [33] and TigerGraph [46], which are two popular commer-cial GDBMSs.

We use a single machine that has two Intel E5-2670 @2.6GHzCPUs and 512 GB of RAM. The machine has 16 physical coresand 32 logical cores. For all experiments, we use a single physicalcore. We set the maximum JVM heap size to 500GB. Table 2 showsthe datasets used. Our datasets include social, web, and Wikipediaknowledge graphs, which have a variety of graph topologies andsizes ranging from several million edges to over a hundred-millionedges. A dataset G, denoted as G i,j , has i and j randomly gener-ated vertex and edge labels, respectively. We omit i and j whenthey are set to 1. We use query workloads drawn from real-worldapplications: (i) edge- and vertex-labelled subgraph queries; (ii)the MagicRecs recommendation engine from Twitter [19]; and (iii)fraud detection in ﬁnancial transaction networks. The details ofthese applications and queries are explained in subsequent sections. We ﬁrst demonstrate the beneﬁt and overhead tradeoff of re-conﬁguring default A+ indexes in two different ways: (i) by onlychanging the sorting criteria; and (ii) by adding a new secondarypartitioning. We used a popular subgraph query workload in graphprocessing that consists of labelled subgraph queries. In all sys-tems work we are aware of, this workload consists of one of twovariants: (a) queries with only edge labels [26, 32]; or (b) querieswith only vertex labels [3, 9, 17, 22, 30, 45]. We take a natural andcommon third variant where both edges and vertices have labels.We followed the data and subgraph query generation methodol-ogy from several prior work [10, 22, 32]. We took the 14 queriesfrom reference [32] (omitted due to space reasons), which containacyclic and cyclic queries with dense and sparse connectivity with up to 7 vertices and 21 edges. For each query we ﬁxed the vertexand edge labels. We picked the number of labels for each datasetto ensure that queries would take time in the order of seconds toseveral minutes. Then we ran GraphﬂowDB on our workload oneach of our datasets under three index conﬁgurations:(i) D : system’s default conﬁguration, where edges are partitionedby edge labels and sorted by neighbour IDs.(ii) D s : keeps D ’s secondary partitioning but sorts edges ﬁrst byneighbour vertex labels and then on neighbour IDs.(iii) D p : keeps D ’s sorting criteria and edge label partitioning butadds a new secondary partitioning on neighbour vertex labels.Table 3 shows our results. First observe that D s outperforms D onall of the 52 settings and by up to 10.38x and without any memoryoverheads as D s simply changes the sorting criteria of the indexes.Next observe that by adding an additional partitioning level on D ,the joins get even faster consistently across all queries, e.g., SQ improves from 2.36x to 3.84x on Ork , , as the system can directlyaccess edges with a particular edge label and neighbour label us-ing D p . In contrast, under D s , the system performs binary searchesinside lists to access the same set of edges. Even though D p isa reconﬁguration, so does not index new edges, it still has minormemory overheads ranging from 1.05x to 1.15x because of the costof storing the new partitioning layer. This demonstrates the ﬂexi-bility A+ indexes gives to users to optimize the system to performmuch faster on a different workload without any data remodelling,and with no or little memory overheads. Note that the consistentperformance improvements we gain through reconﬁguration alsodemonstrates that index reconﬁguration does not hinder the qualityof the plans our optimizer generates. We next study the tradeoffs offered by secondary vertex-boundindexes. We use two sets of workloads drawn from real-worldapplications that beneﬁt from using both the system’s default A+indexes as well as a secondary vertex-bound A+ index. Our twoapplications highlight two separate beneﬁts users get from vertex-bound A+ indexes: (i) decreasing the amount of predicate evalua-tion; and (ii) allowing the system to generate new WCO-style joinplans that are not possible with the default indexes only.

In this experiment, we take a set of the queries drawn from theMagicRecs workload described in reference [19]. MagicRecs is arecommendation engine that was developed at Twitter that looksfor the following patterns: for a given user a , it searches for users a ... a k that a has started following recently, and ﬁnds their com-mon followers. These common followers are then recommendedto a . We set k = MR ... MR , are shown in Figure 3. These queries have a timepredicate on the edges starting from a which can beneﬁt from in-dexes that sort on time. The second and third queries are also struc-turally cyclic, so can beneﬁt from sorting on neighbour IDs, whichis the default sorting order of our default A+ indexes. We evaluateour queries on all of our data sets on two index conﬁgurations. Firstis the system’s default A+ indexes D as before. Second is:(i) D+VB t : adds a new secondary vertex-bound index VB t in theforward direction that: (i) has the same secondary partitioningas default forward A+ indexes, so shares the same partitioninglayers as the default index; and (ii) sorts the most granular sub-lists on the time property of edges. Q14 is omitted as the query contained very few output tuples.Q SQ SQ SQ SQ SQ SQ SQ SQ SQ SQ SQ SQ MmOrk , DD s D p (1.85x) (2.48x) (1.75x) (2.10x) (1.79x) (2.71x) (1.09x) (1.34x) (1.50x) (2.05x) (1.19x) (2.44x) (1.30x) (1.33x) (1.65x) (1.71x) (1.23x) (1.25x) (3.63x) (4.56x) (10.38x) (10.69x) (1.31x) (1.94x) (2.36x) (3.84x) (1.0x) (1.12x) LJ , DD s D p (1.01x) (1.41x) (1.27x) (1.52x) (1.19x) (1.39x) (1.13x) (1.55x) (1.05x) (1.62x) (1.05x) (1.48x) (1.36x) (1.79x) (1.40x) (1.81x) (1.44x) (1.80x) (1.48x) (1.61x) (3.35x) (3.43x) (1.48x) (1.68x) (1.81x) (2.90x) (1.0x) (1.15x) WT , DD s D p (1.65x) (1.91x) (1.89x) (2.20x) (1.56x) (1.80x) (1.22x) (1.53x) (1.65x) (1.99x) (1.38x) (1.66x) (1.20x) (1.21x) (2.87x) (3.94x) (2.09x) (2.62x) (1.60x) (1.74x) (4.41x) (4.45x) (1.53x) (1.88x) (1.98x) (3.26x) (1.0x) (1.12x) Brk , DD s D p (1.00x) (2.42x) (1.65x) (1.72x) (1.51x) (1.80x) (1.38x) (1.69x) (1.40x) (1.39x) (1.39x) (1.58x) (1.62x) (1.66x) (2.01x) (2.19x) (2.02x) (2.08x) (2.02x) (2.09x) (2.26x) (2.29x) (2.30x) (2.33x) (3.04x) (3.49x) (1.0x) (1.05x) Table 3: Runtime (in seconds) and memory usage in MBs (Mm) evaluating subgraph queries usingthree different index conﬁgurations: D , D s , and D p introduced in Section 5.2. a a a e e e .time<α,e .time<α (a) MR . a a a a e e e .time<α,e .time<α (b) MR . a a a a a e e e e .time<α,e .time<α,e .time<α, (c) MR . Figure 3: MagicRec (MR) workload queries.In our queries we set the value of α in the time predicate to havea 5% selectivity. For MR , on datasets LJ and Ork, we considerthe query vertex a ﬁxed to 10000 and 7000 vertices, respectivelyfor the queries to run within a reasonable time. Table 4 shows ourresults. First observe that despite indexing all of the edges again,our secondary index has only 1.08x memory overhead because ofstoring lightweight offset lists and sharing the partitioning layers.Creating a separate index would have given overheads closer to50% (recall VB t is only a forward index). In return, we see up to11.3x performance beneﬁts. We note that our system uses exactlythe same plans under both index conﬁgurations that start reading a , extends to its neighbours and ﬁnally performs a multiway inter-section (except for MR , which is followed by a simple extension).The only difference is that under D+VB t the ﬁrst set of extensionsrequire fewer predicate evaluation because of accessing a ’s adja-cency list in VB t , which is sorted on time. Overall this memoryperformance tradeoff demonstrates that with minimal overheads ofan additional index, users obtain signiﬁcant performance beneﬁtson applications like MagicRecs that require fast response time. We next evaluate the beneﬁt and overhead tradeoff of secondaryvertex-bound indexes on an application where a secondary vertex-bound index can allow the system to generate new WCO join-style MR MR MR MmOrk

DD+VB t (2.05x) (1.54x) (6.80x) (1.08x) LJ DD+VB t (2.06x) (1.40x) (10.6x) (1.08x) WT DD+VB t (2.55x) (1.80x) (6.00x) (1.08x) Brk

DD+VB t (3.97x) (2.37x) (2.13x) (1.07x) Table 4: Runtime (in seconds) and memory usage in MBs (Mm)evaluating the MagicRec queries using index conﬁgurations: D and D+VB t introduced in Section 5.3.1.query plans that are not in the plan space of the system with de-fault indexes. We take a set of queries drawn from cyclic fraudu-lent money ﬂows that have been reported in prior literature [40], aswell as acyclic patterns that contain the money ﬂow paths from ourrunning examples. Figure 4 shows our queries MF ,...,MF . Asan example, MF searches for a cyclical ﬂow that start and end inthe same chequing accounts where two of the accounts in the pathare in the same city. We will focus on MF to MF for now anduse MF in the next section. These four queries have equality con-ditions on the city property of the vertices, so can beneﬁt frommultiway intersections on city . We evaluate these queries on twoindex conﬁgurations. First is the system’s default A+ indexes D asbefore. Second is:(i) D+VB c : which adds a new secondary vertex-bound index VB c in both forward and backward directions that: (i) has the samesecondary partitioning as default A+ indexes; and (ii) sorts themost granular lists on neighbour’s city property. a a a e e e e a i .acc = CQ,a .city = a .city (a) MF . a a a a e e e a .city = a .city,a .city = a .city,a .city = a .city (b) MF . a a a a a e e e e a .city = a .city, a .city = a .city,a .ID< , a i .acc = CQ, a .acc = SV ,e .datee .amt, e .amte .amt, e .amte .amt, e .amte .amt, e .amte .amt, e .amte .amt, e .amt

Figure 5: Plan for MF from Figure 4c with WCO join-styleintersection using two VB c indexes and one EB c indexintroduced in Sections 5.3.2 and 5.4.(i) D+VB c +EB c : adds the edge-bound index from Example 8 inSection 2.2.2. We change to grouping to be on v. adj .acc insteadof edge labels and add the predicate e. b .amt < e. nbr .amt + α We set the α “intermediate cut” value in our examples to have 5%selectivity. Table 5 shows our results. First we observe that theaddition of EB c only allows new plans to be generated for MF ,MF and MF , so we report numbers only for these queries. Wesee improvements ranging from 8.99x to up to 72.2x improvementsfor a 2.22x memory overheads. The performance improvementsare primarily due to producing signiﬁcantly more efﬁcient plansthat use the 2-path views in EB c index. For example, the systemcan now generate a new query plan for MF . The plan is shownin Figure 5. The plan evaluates the query as follows: (1) scan a nodes; (2) backward extend to match a ’s; (3) use M ULTI -E XTEND to perform an intersection, using a ’s list in VB c twice and e ’s listin EB c . This is a highly complex plan that uses a mix of vertex-and edge-bound indexes and performs a 3-way intersections on acustom vertex property, which is automatically generated by oursystem. Such plans are not in the plan spaces of any system we areaware of.The improvement and tradeoffs we report are examples and dem-onstrative in nature. The actual beneﬁts and overheads of our in-dexes will naturally vary across workloads. For example, we haveset the selectivity of the α parameter in the money ﬂow patterns to5% in our evaluations. This selectivity can have a signiﬁcant im-pact on the actual beneﬁts and overhead of edge-bound indexes. Todemonstrate this we took the simplest money ﬂow query that justsearches for a single step ﬂow from a vertex and varied α to bebetween a very small selectivity of 0.05% to a very high selectiv-ity of 25% and built an accompanying edge-bound index. For eachselectivity, we benchmarked using the edge bound to evaluate thequery the edge bound index satisﬁes and compare with the perfor-mance of plans that used the system’s default index. Table 6 showsour results. As expected, the lower the selectivity: (1) the morethe performance beneﬁts; and (2) the lower the memory footprint.For example, when selectivity is 0.05%, the beneﬁts are as high as35.6x while the memory overheads are only 1.28x. At a very high25% selectivity, the beneﬁts decrease to 16.7x while the overheadsincrease to 3.45x (albeit indexing 12.5x more edges than stored inthe default A+ indexes).1 Q2 Q3 Q4 Q5 Mem (MB) |E indexed |Ork DD+VB c D+VB c +EB c (8.16x) — 5.532.75 (2.01x) — 32.851.33 (24.7x) (58.7x) (3.76x) (72.2x) (14.7x) (1.17x) (2.20x) DD+VB c D+VB c +EB c (4.11x) — 4.242.86 (1.48x) — 84.785.12 (16.6x) (39.3x) (2.08x) (19.5x) (8.99x) (1.16x) (2.17x) DD+VB c D+VB c +EB c (8.85x) — 1.471.12 (1.31x) — 9.021.55 (5.82x) (18.0x) (1.62x) (6.14x) (11.4x) (1.16x) (2.22x) D , D+VB c , and D+VB c +EB c introduced in Section 5.3.2.The run time speedups and memory size shown in parenthesis are in comparison to D .selectivity 25% 5% 0.05% DD+EB c (16.7x) (21.9x) (35.6x) Mm(

D+EB c )|E indexed | 5558 (3.45x) (1.93x) (1.21x) D and D+EB c introduced inSection 5.4. We report the number of indexed edges and memory(Mm) of D+EB c and compare it to D of size 1612 MB. We next benchmark the maintenance speed of each type of A+index on a micro-benchmark. We use two datasets LJ , and Brk , .We load 50% of the dataset from the MagicRec application and in-sert the remaining 50% of the edges one at a time and evaluate thespeed of 5 index conﬁgurations, each requiring progressively moremaintenance work: (i) D s has no partitioning and a sort by the adja-cent vertices IDs; (ii) D p partitions each adjacency list on adjacentedges label ; (iii) D ps sorts each partition in D p by the adjacentvertices IDs; (iv) D ps +VB t creates a secondary adjacency list indexon time for D ps ; and ﬁnally (v) D ps +EB t : an edge bound adjacencylist index with the same grouping and sorting as VB t for the query v s − [ e b ] ← v d − [ e adj ] → v adj with predicate e b .time < e adj .time + α that has a 1% selectivity.We ran our benchmark on LJ , and Brk , . Using a singlethread, we were able to maintain the following update rates persecond (reported respectively for LJ , and Brk , ): 1.203M and2.108M for D s , 1.024M and 1.892M for D p , 1.081M and 1.832Mfor D ps , 706K and 1.691M for D ps +VB t , and 41K and 110K for D ps +EB t . Our update rate gets slower with additional complexitybut we are able to maintain insert rates of between 50-100k edges/sfor our edge-bound index and between 706K-2.1M for our vertex-bound indexes, using a single thread. We ﬁnish our evaluation by presenting GraphﬂowDB’s perfor-mance against two popular GDBMSs. These experiments do not For interested readers, we note that our guiding design princi-ple in GraphﬂowDB is to implement it as a read-optimized systemthat takes bulk data ingests, instead of fast streams of transactionalwrites. This is instructed by a recent user survey of GDBMSs [41],where we observed that GDBMSs are rarely the transactional storesin enterprises. Instead, they are systems that often ingest data fromrelational systems to develop read-heavy applications that requirecomplex and fast join processing. SQ SQ SQ SQ LJ , GFTGN4 (6.3x) (73.3x) (16.9x) (50.4x) (25.3x) (61.3x) (5.1x)

T L WT , GFTGN4 (3.2x) (3300x) (3.4x) (417.1x) (3.3x) (26.7x) (0.1x)

T L

Table 7: Graphﬂow (GF), TigerGraph (TG), and Neo4j (N4) runtimes in secs. Graphﬂow adjacency lists are grouped and sorted onedge and vertex labels, respectively.

T L indicates >

30 mins.demonstrate the beneﬁts and overheads of A+ indexes and are pre-sented for completeness of our work and interested readers. We ranfour of our labelled subgraph queries SQ , SQ , SQ , and SQ on LJ , and WT , on Neo4j and TigerGraph, using their defaultconﬁgurations and using the D p conﬁguration from Section 5.2 forGraphﬂowDB. Table 7 shows our results. GraphﬂowDB was fasteron all queries except Q13 on WT , , where TigerGraph was faster.SQ is a long 5-edge path query. We are unaware of any publica-tion that describes TigerGraph but our email communication withTigerGraph developers indicates the system is highly optimized forlong path queries.

6. RELATED WORK

We reviewed adjacency lists in existing GDBMSs in our intro-ductory section. We ﬁrst review the Kaskade [16] query optimiza-tion framework that also uses materialized graph views. Then, wereview related work in three areas: (i) indexes in RDF systems,another important class of DBMSs that support a graph-structureddata model; (ii) adjacency lists in graph analytics systems; and (iii)indexes for queries other than the subgraph queries with predicatesthat we consider in this work.As we observed earlier, our use of A+ indexes during queryprocessing can be thought of as a speciﬁc query processing usingviews. There is a rich literature on using views to answer queries.We do not review this literature here and refer the reader to refer-ence [20] for a survey.

Kaskade [16] (KSK) is a graph query optimization framework thatuses materialized graph views to speed up query evaluation. Specif-ically, KSK takes as input a query workload Q and an input graph G . Then, KSK enumerates possible views for Q , which are othergraphs G (cid:48) that contain a subset of the vertices in G and other edgesthat can represent multi-hop connections in G . For example, if G is a data provenance graph with job and ﬁle vertices, and con-umes and produces relationships between jobs and ﬁles, a graphview G (cid:48) could store only the job vertices and their dependenciesthrough ﬁles if some queries only need these 2-hop relationships.KSK is a framework that selects a set of views for a workload,materializes them in Neo4j, and then translates queries over G toappropriate graphs (views) that are stored in Neo4j, which is the ﬁ-nal system that answers queries. Therefore, the overall frameworkis limited by Neo4j’s adjacency lists. There are signiﬁcant differ-ences between the views A+ indexes effectively provide access toand KSK’s views. First, KSK’s views are based on “constraints”that are mined from G ’s schema based only on vertex/edge labelsand not properties. For example, KSK can mine “job vertices con-nect to jobs in 2-hops but not to ﬁle vertices” constraints but not“accounts connect to accounts in 2-hops with later dates and loweramounts”, which is the predicate in our edge bound index (let alonepartitioning these 2-hops by edge IDs). Therefore, KSK cannotenumerate a useful view for our money ﬂow queries. Second, KSKviews do not support ﬂexible groupings, predicates, or sorting, andare only vertex ID partitioned (because graphs in Neo4j are onlyvertex ID partitioned). Finally, the overall framework is limited byNeo4j’s query processor, which does not support WCO-style plans. Indexes in RDF Systems:

RDF systems support the RDF datamodel, in which data is represented as a set of (subject, predicate,object) triples. Because each triple can be seen as a labeled edgebetween a subject and an object, RDF is a graph-structured model.Prior work has introduced numerous architectural approaches todevelop RDF systems such as: (1) using an underlying existingRDBMS [1, 2, 8, 12]; (2) storing and then indexing one large tripletable [35, 48]; and (3) developing a system based on a native-graphstorage, such as an adjacency list [51]. A comprehensive reviewof the designs of these systems is beyond the scope of our paperand we refer the reader to reference [39] for a survey of these ap-proaches. These systems have different designs to further indexthese tables or their adjacency lists. For example, RDF-3X [35]indexes an RDF dataset in multiple B+ tree indexes for the six pos-sible sort orders. As another example, the gStore system encodesseveral vertices in ﬁxed length bit strings that captures informationabout the neighborhoods of vertices. Then these new encodingsare stored in an index called VS ∗ -tree to prune certain parts of thegraph during query processing. Similar to the GDBMSs we re-viewed, these work also deﬁne ﬁxed indexes for RDF triples. A+indexes instead gives users ﬂexibility by providing a mechanismfor deciding which edges to index in adjacency lists so that theycan tailor a GDBMS to the requirements of their workloads. Indexes in Graph Analytics Systems:

There are numerous graphanalytics systems [7, 13, 23, 31, 42] that are designed to do batchanalytics, such as decomposing a graph into connected compo-nents. These systems use native graph storage formats, such asadjacency lists or sparse matrices. Work in this space generallyfocuses on optimizing the physical layout of the edges in mem-ory. For systems storing the edges in adjacency list structures, acommon technique is to store them in a compressed sparse row(CSR) format [11], which we used in implementing our secondary partitioning in A+ indexes. References [15, 42] study CSR-likepartitioning techniques for large lists and reference [50], proposessegmenting a graph stored in a CSR-like format to achieve bettercache locality. This line of work is complementary to our work.Within the scope of this paper, we did not study how to optimizethe adjacency lists with which we implemented A+ indexes and al-ternative physical storage structures can be used to store the edgesin our A+ indexes. Finally, for analytics in the distributed setting,there is numerous work on different ways of partitioning the adja-cency lists to reduce communication between workers performing a graph analytics. We do not review this work here and refer thereader to reference [27] for an introductory overview.

Indexes For Advanced Queries:

Prior work have introduced ad-vanced indexes for several classes of queries that we do not con-sider in our work:

Complex subgraph queries:

Many prior algorithmic work on evalu-ating subgraph queries [14, 28, 29] have proposed auxiliary indexesthat index subgraphs more complex than edges, such as paths, stars,or cliques. This line of work effectively demonstrates that indexingsuch subgraphs can speed subgraph query evaluation. It is worthmentioning that A+ indexes can be generalized to index more com-plex subgraphs that are partitioned by vertices or edges. Within thescope of our work, we designed A+ indexes to bring enough ﬂexi-bility to the applications we are aware of that use GDBMSs withoutthis additional complexity.Some other algorithmic work focus on evaluating highly com-plex subgraph queries, e.g., those that contain up to hundreds ofvertices and edges. Some of these work, such as CFL [10], DP

ISO [21], and Turbo

ISO [22] develop query-speciﬁc auxiliary indexes.These indexes are often more complex than A+ indexes, and aretightly designed for a speciﬁc subgraph matching algorithm. Ex-isting GDBMSs do not process queries with those algorithms andrely on traditional relational operators, such as joins, ﬁlters, andscans. Therefore it seems difﬁcult to integrate these indexes intoGDBMSs, without changing their query processors. Instead aswe demonstrated, our A+ indexes are easy to integrate into exist-ing GDBMSs. Whether or not one can decompose these complexqueries into traditional DBMS operators to integrate these complexalgorithms into GDBMSs is an interesting research direction.

Indexes for Recursive Queries:

Several work also develop special-ized indexes for recursive queries, such as shortest paths, reacha-bility, and regular path queries (RPQ). We designed A+ indexes for(ﬁxed) subgraph queries with arbitrary predicates, so the indexesproposed in this line of work is less related to our work, so we omita detailed review of these indexes. As an example, using land-mark vertices is a popular technique that has been used for shortestpaths [5, 43], reachability [49], as well as RPQs [47]. Intuitively,these indexes store paths to a set of central vertices in the graph,and use these indexed paths during query processing.

7. CONCLUSION

Ted Codd, the inventor of the relational model, criticized theGDBMSs of the time as being restrictive because they only per-formed a set of “predeﬁned joins” [44], which causes physical datadependence and contrasts with relational systems that can join ar-bitrary tables on arbitrary columns with the same data type. This isindeed still true to a good extent for contemporary GDBMSs, whichare designed to join vertices with only their neighbourhoods, whichare predeﬁned to the system as edges. However, this is speciﬁcallythe major appeal of GDBMSs, as GDBMSs are highly optimized toperform these joins in a very fast manner, primarily by using adja-cency list indexes to store input edges. Our work was motivated bythe shortcoming that existing GDBMSs do not provide any ﬂexibil-ity in their adjacency list structures so a wider range of applicationscan beneﬁt from their fast join capabilities. As a solution, we de-scribed a new indexing sub-system A+ indexes, that are coupledwith a limited set of materialized views, which are conducive toa very lightweight implementation. We described our design andimplementation of A+ indexes, demonstrated their ﬂexibility, andevaluated the performance and memory tradeoffs they offer on avariety of applications drawn from popular real-world applicationsthat use GDBMSs. . REFERENCES [1] D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach.SW-Store: A Vertically Partitioned DBMS for Semantic WebData Management.

The VLDB Journal , 18(2), 2009.[2] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach.Scalable Semantic Web Data Management Using VerticalPartitioning. In

VLDB , 2007.[3] E. Abdelhamid, I. Abdelaziz, P. Kalnis, Z. Khayyat, andF. Jamour. Scalemine: Scalable Parallel Frequent SubgraphMining in a Single Large Graph. In SC , 2016.[4] C. R. Aberger, A. Lamb, S. Tu, A. Nötzli, K. Olukotun, andC. Ré. EmptyHeaded: A Relational Engine for GraphProcessing. TODS , 42(4), 2017.[5] T. Akiba, Y. Iwata, and Y. Yoshida. Fast Exact Shortest-PathDistance Queries on Large Networks by Pruned LandmarkLabeling. In

ACM SIGMOD , 2013.[6] K. Ammar, F. McSherry, S. Salihoglu, and M. Joglekar.Distributed Evaluation of Subgraph Queries UsingWorst-case Optimal and Low-Memory Dataﬂows.

PVLDB ,11(6), 2018.[7] Apache Giraph. https://giraph.apache.org .[8] Apache Jena. https://jena.apache.org .[9] B. Bhattarai, H. Liu, and H. H. Huang. CECI: CompactEmbedding Cluster Index for Scalable Subgraph Matching.In

ACM SIGMOD , 2019.[10] F. Bi, L. Chang, X. Lin, L. Qin, and W. Zhang. EfﬁcientSubgraph Matching by Postponing Cartesian Products. In

ACM SIGMOD , 2016.[11] A. Bonifati, G. H. L. Fletcher, H. Voigt, and N. Yakovets.

Querying Graphs . Morgan & Claypool Publishers, 2018.[12] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas,P. Dantressangle, O. Udrea, and B. Bhattacharjee. Buildingan Efﬁcient RDF Store Over a Relational Database. In

ACMSIGMOD , 2013.[13] A. Buluç and J. R. Gilbert. The Combinatorial BLAS:Design, Implementation, and Applications.

InternationalJournal of High Performance Computing Applications ,25(4), 2011.[14] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. FastGraph Pattern Matching. In

ICDE , 2008.[15] F. Claude and G. Navarro. Extended Compact Web GraphRepresentations. In

Algorithms and Applications: EssaysDedicated to Esko Ukkonen on the Occasion of His 60thBirthday . Springer-Verlag, 2010.[16] J. M. F. da Trindade, K. Karanasos, C. Curino, S. Madden,and J. Shun. Kaskade: Graph views for efﬁcient graphanalytics.

CoRR , abs/1906.05162, 2019.[17] M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis.GRAMI: Frequent Subgraph and Pattern Mining in a SingleLarge Graph.

PVLDB , 7(7), 2014.[18] Gaussdb. https://e.huawei.com/en/solutions/cloud-computing/big-data/gaussdb-distributed-database .[19] P. Gupta, V. Satuluri, A. Grewal, S. Gurumurthy, V. Zhabiuk,Q. Li, and J. Lin. Real-time Twitter Recommendation:Online Motif Detection in Large Dynamic Graphs.

PVLDB ,7(13), 2014.[20] A. Y. Halevy. Answering queries using views: A survey.

TheVLDB Journal , 2001.[21] M. Han, H. Kim, G. Gu, K. Park, and W.-S. Han. EfﬁcientSubgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together. In

ACMSIGMOD , 2019.[22] W.-S. Han, J. Lee, and J.-H. Lee. Turboiso: TowardsUltrafast and Robust Subgraph Isomorphism Search in LargeGraph Databases. In

ACM SIGMOD , 2013.[23] S. Hong, H. Chaﬁ, E. Sedlar, and K. Olukotun. Green-Marl:A DSL for Easy and Efﬁcient Graph Analysis. In

ASPLOS ,2012.[24] Janus Graph. https://janusgraph.org .[25] C. Kankanamge, S. Sahu, A. Mhedbhi, J. Chen, andS. Salihoglu. Graphﬂow: An Active Graph Database. In

ACM SIGMOD , 2017.[26] K. Kim, I. Seo, W.-S. Han, J.-H. Lee, S. Hong, H. Chaﬁ,H. Shin, and G. Jeong. TurboFlux: A Fast ContinuousSubgraph Matching System for Streaming Graph Data. In

ACM SIGMOD , 2018.[27] V. Kumar, A. Grama, A. Gupta, and G. Karypis.

Introductionto Parallel Computing: Design and Analysis of Algorithms .Addison-Wesley, 1994.[28] L. Lai, L. Qin, X. Lin, and L. Chang. Scalable SubgraphEnumeration in MapReduce.

PVLDB , 8(10), 2015.[29] L. Lai, L. Qin, X. Lin, Y. Zhang, L. Chang, and S. Yang.Scalable Distributed Subgraph Enumeration.

PVLDB , 10(3),2016.[30] L. Lai, Z. Qing, Z. Yang, X. Jin, Z. Lai, R. Wang, K. Hao,X. Lin, L. Qin, W. Zhang, Y. Zhang, Z. Qian, and J. Zhou.Distributed Subgraph Matching on Timely Dataﬂow.

PVLDB , 12(10), 2019.[31] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System forLarge-Scale Graph Processing. In

ACM SIGMOD , 2010.[32] A. Mhedhbi and S. Salihoglu. Optimizing Subgraph Queriesby Combining Binary and Worst-Case Optimal Joins.

PVLDB , 12(11), 2019.[33] Neo4j. https://neo4j.com .[34] Neo4j Property Graph Model. https://neo4j.com/developer/graph-database , 2019.[35] T. Neumann and G. Weikum. RDF-3X: A RISC-style Enginefor RDF.

PVLDB , 1(1), 2008.[36] H. Ngo, C. Ré, and A. Rudra. Skew Strikes Back: NewDevelopments in the Theory of Join Algorithms.

SIGMODRecord , 42(4), 2014.[37] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-caseOptimal Join Algorithms. In

PODS , 2012.[38] openCypher. .[39] M. T. Özsu. A Survey of RDF Data Management Systems.

Frontiers of Computer Science , 10(3), 2016.[40] X. Qiu, W. Cen, Z. Qian, Y. Peng, Y. Zhang, X. Lin, andJ. Zhou. Real-Time Constrained Cycle Detection in LargeDynamic Graphs.

PVLDB , 11(12), 2018.[41] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu.The Ubiquity of Large Graphs and Surprising Challenges ofGraph Processing: Extended Survey.

The VLDB Journal ,2019.[42] J. Shun, G. E. Blelloch, J. Shun, and G. E. Blelloch. Ligra: ALightweight Graph Processing Framework for SharedMemory.

ACM SIGPLAN Notices , 48(8), 2013.[43] C. Sommer. Shortest-Path Queries in Static Networks.

ACMComputing Surveys , 46(4), 2014.[44] Edgar f. ("ted") codd turing award lecture. https://amturing.acm.org/award_winners/odd_1000892.cfm .[45] C. H. C. Teixeira, A. J. Fonseca, M. Seraﬁni, G. Siganos,M. J. Zaki, and A. Aboulnaga. Arabesque: A System forDistributed Graph Mining. In

SOSP , 2015.[46] TigerGraph. .[47] L. D. J. Valstar, G. H. L. Fletcher, and Y. Yoshida. LandmarkIndexing for Evaluation of Label-Constrained ReachabilityQueries. In

ACM SIGMOD , 2017.[48] C. Weiss, P. Karras, and A. Bernstein. Hexastore: SextupleIndexing for Semantic Web Data Management.

PVLDB , 1(1), 2008.[49] Y. Yano, T. Akiba, Y. Iwata, and Y. Yoshida. Fast andScalable Reachability Queries on Graphs by PrunedLabeling with Landmarks and Paths. In

CIKM , 2013.[50] Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, andM. Zaharia. Making Caches Work for Graph Analytics. In

IEEE Big Data , 2017.[51] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, andD. Zhao. gStore: A Graph-Based SPARQL Query Engine.