Adaptive Low-level Storage of Very Large Knowledge Graphs
AAdaptive Low-level Storage of Very Large Knowledge Graphs
Jacopo Urbani [email protected] Universiteit AmsterdamAmsterdam, The Netherlands
Ceriel Jacobs [email protected] Universiteit AmsterdamAmsterdam, The Netherlands
ABSTRACT
The increasing availability and usage of Knowledge Graphs (KGs)on the Web calls for scalable and general-purpose solutions to storethis type of data structures. We propose Trident, a novel storagearchitecture for very large KGs on centralized systems. Trident usesseveral interlinked data structures to provide fast access to nodesand edges, with the physical storage changing depending on thetopology of the graph to reduce the memory footprint. In contrastto single architectures designed for single tasks, our approach offersan interface with few low-level and general-purpose primitives thatcan be used to implement tasks like SPARQL query answering,reasoning, or graph analytics. Our experiments show that Tridentcan handle graphs with 10 edges using inexpensive hardware,delivering competitive performance on multiple workloads. ACM Reference Format:
Jacopo Urbani and Ceriel Jacobs. 2020. Adaptive Low-level Storage of VeryLarge Knowledge Graphs. In
Proceedings of The Web Conference 2020 (WWW’20), April 20–24, 2020, Taipei, Taiwan.
ACM, New York, NY, USA, 15 pages.https://doi.org/10.1145/3366423.3380246
Motivation.
Currently, a large wealth of knowledge is published onthe Web in the form of interlinked Knowledge Graphs (KGs). TheseKGs cover different fields (e.g., biomedicine [19, 65], encyclopedicor commonsense knowledge [86, 92], etc.), and are actively used toenhance tasks such as entity recognition [80], query answering [96],or more in general Web search [15, 30, 31].As the size of KGs keeps growing and their usefulness expandsto new scenarios, applications increasingly need to access largeKGs for different purposes. For instance, a search engine mightneed to query a KG using SPARQL [39], enrich the results usingembeddings of the graph [64], and then compute some centralitymetrics for ranking the answers [46, 87]. In such cases, the storageengine must not only efficiently handle large KGs, but also allowthe execution of multiple types of computation so that the sameKG does not have to be loaded in multiple systems.
Problem.
In this paper, we focus on providing an efficient, scalable,and general-purpose storage solution for large KGs on centralizedarchitectures. A large amount of recent research has focused ondistributed architectures [20, 27, 28, 33, 34, 52, 74, 79], because theyoffer many cores and a large storage space. However, these benefits
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW ’20, April 20–24, 2020, Taipei, Taiwan © 2020 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-7023-3/20/04.https://doi.org/10.1145/3366423.3380246 come at the price of higher communication cost and increasedsystem complexity [71]. Moreover, sometimes distributed solutionscannot be used either due to financial or privacy-related constraints.Centralized architectures, in contrast, do not have network costs,are commonly affordable, and provide enough resources to loadall-but-the-largest graphs. Some centralized storage engines havedemonstrated that they can handle large graphs, but they focusprimarily on supporting one particular type of workload (e.g.,Ringo [71] supports graph analytics, RDF engines like Virtuoso [67]or RDFox [60] focus on SPARQL [39]). To the best of our knowledge,we still lack a single storage solution that can handle very largeKGs as well as support multiple workloads.
Our approach.
In this paper, we fill this gap presenting Trident, anovel storage architecture that can store very large KGs on central-ized architectures, support multiple workloads, such as SPARQLquerying, reasoning, or graph analytics, and is resource-savvy.Therefore, it meets our goal of combining scalability and general-purpose computation.We started the development of Trident by studying which arethe most frequent access types performed during the executionof tasks like SPARQL answering, reasoning, etc. Some of theseaccess types are node-centric (i.e., access subsets of the nodes), whileothers are edge-centric (i.e., access subsets of the edges). From thisstudy, we distilled a small set of low-level primitives that can beused to implement more complex tasks. Then, the research focusedon designing an architecture that supports the execution of theseprimitives as efficiently as possible, resulting in Trident.At its core, Trident uses a dedicated data structure (a B+Treeor an in-memory array) to support fast access to the nodes, and aseries of binary tables to store subsets of the edges. Since there canbe many binary tables – possibly billions with the largest KGs –handling them with a relational DBMS can be problematic. To avoidthis problem, we introduce a light-weight storage scheme where thetables are serialized on byte streams with only a little overhead pertable. In this way, tables can be quickly loaded from the secondarystorage without expensive pre-processing and offloaded in case thesize of the database exceeds the amount of available RAM.Another important benefit of our approach is that it allows usto exploit the topology of the graph to reduce its physical storage.To this end, we introduce a novel procedure that analyses eachbinary table and decides, at loading time, whether the table shouldbe stored either in a row-by-row, column-by-column, or in a cluster-based fashion. In this way, the storage engines effectively adapts to the input. Finally, we introduce other dynamic procedures thatdecide, at loading time, whether some tables can be ignored dueto their small sizes or whether the content of some tables can beaggregated to further reduce the space.Since Trident offers low-level primitives, we built interfaces toseveral engines (RDF3X [63], VLog [89], SNAP [50]) to evaluate a r X i v : . [ c s . D B ] J a n WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs the performance of SPARQL query answering, datalog reasoningand graph analytics on various types of graphs. Our comparisonagainst the state-of-the-art shows that our approach can be highlycompetitive in multiple scenarios.
Contribution.
We identified the followings as the main contribu-tions of this paper. • We propose a new architecture to store very large KGs on acentralized system. In contrast to other engines that store the KGin few data structures (e.g., relational tables), our architectureexhaustively decomposes the storage in many binary tables suchthat it supports both node- and edge-centric via a small numberof primitives; • Our storage solution adapts to the KG as it uses different layoutsto store the binary tables depending on its topology. Moreover,some binary tables are either skipped or aggregated to savefurther space. The adaptation of the physical storage is, as far aswe know, a unique feature which is particularly useful for highlyheterogeneous graphs, such as the KGs on the Web; • We present an evaluation with multiple workloads and the resultsindicate highly competitive performance while maintaining goodscalability. In some of our largest experiments, Trident was able toload and process KGs with up to 10 (100B) edges with hardwarethat costs less than $5K.The source code of Trident is freely available with an open sourcelicense at https://github.com/karmaresearch/trident, along withlinks to the datasets and instructions to replicate our experiments. A graph G = ( V , E , L , ϕ V , ϕ E ) is a tuple where V , E , L represent thesets of nodes, edges and labels respectively, ϕ V is a bijection thatmaps each node to a label in L , while ϕ E is a function that mapseach edge to a label in L . We assume that there is at most one edgewith the same label between any pair of nodes. Throughout, weuse the notation r ( s , d ) to indicate the edge with label r from nodewith label s (source) to the node with label d (destination).We say that the graph is undirected if r ( s , d ) ∈ E implies that also r ( d , s ) ∈ E . Otherwise, the graph is directed . A graph is unlabeled ifall edges map to the same label. In this paper, we will mostly focuson labeled directed graphs since undirected or unlabeled graphsare special cases of labeled directed graphs.In practice, it is inefficient to store the graph using the raw labelsas identifiers. The most common strategy, which is the one we alsofollow, consists of assigning a numerical ID to each label in L , andstores each edge r ( s , d ) with the tuple ⟨ ι s , ι r , ι d ⟩ where ι s , ι r , and ι d are the IDs associated to s , r , and d respectively.The numerical IDs allow us to sort the edges and by permuting ι s , ι r , ι d we can define six possible ordering criteria. We use stringsof three characters over the alphabet { s , r , d } to identify theseorderings, e.g., srd specifies that the edges are ordered by source,relation, and destination. We denote with R = { srd , sdr , . . . } thecollection of six orderings while R ′ = { s , r , d , sr , rs , sd , ds , dr , rd } specifies all partial orderings. We use the function isprefix to checkwhether string a is a prefix of b , i.e., isprefix ( a , b ) = [ true | f alse ] and the operator − to remove all characters of one string fromanother one (e.g., if a = srd and b = sd , then a − b = r ). Let V be a set of variables. A simple graph pattern (or triplepattern ) is an instance of L ∪ V × L ∪ V × L ∪ V and we denoteit as ( X , Y , Z ) where X , Y , Z ∈ L ∪ V . A graph pattern is a finiteset of simple graph patterns. Let σ : V → L be a partial functionfrom variables to labels. With a slight abuse of notation, we alsouse σ as a postfix operator that replaces each occurrence of thevariables in σ with the corresponding node. Given the graph G and a simple graph pattern q , the answers for q on G correspond tothe set ans ( G , q ) = { r ( s , d ) | r ( s , d ) ∈ E ∧ qσ = ( s , r , d )} . Function bound ( p ) returns the positions of the labels in the simple graphpattern p left-to-right, i.e., if p = ( X , a , b ) where X ∈ V and a , b ∈ L ,then bound ( p ) = rd .A Knowledge Graph (KG) is a directed labeled graph where nodesare entities and edges establish semantic relations between them,e.g., ⟨ Sadiq _ Khan , majorO f , London ⟩ . Usually, KGs are publishedon the Web using the RDF data model [45]. In this model, data isrepresented as a set of triples of the form ⟨ subject , predicate , object ⟩ drawn from (I ∪ B) × I × (I ∪ L) where I , B , L denote sets ofIRIs, blank nodes and literals respectively. Let T = I ∪ B ∪ L bethe set of all RDF terms. RDF triples can be trivially seen as a graphwhere the subjects and objects are the nodes, triples map to edgeslabeled with their predicate name, and L = T .SPARQL [39] is a language for querying knowledge graphs whichhas been standardized by W3C. It offers many SQL-like operatorslike UNION , FILTER , DISTINCT to specify complex queries and tofurther process the answers. Every query contains at its core a graphpattern, which is called
Basic Graph Pattern (BGP) in the SPARQLterminology. SPARQL graph patterns are defined over
T ∪ V andtheir answers are mappings σ from V to T . Therefore, answeringa SPARQL graph pattern P over a KG G corresponds to computing ans ( G , p ) ∩ . . . ∩ ans ( G , p | P | ) and retrieving the correspondinglabels.Example 1. An example of a SPARQL query is:
SELECT ?s ?o { ?s isA ?o . ?s livesIn Rome . }
If the KG contains the RDF triples ⟨ Eli , isA , Professor ⟩ and ⟨ Eli , livesIn , Rome ⟩ then one answer to the query is {( ? s → Eli , ? o → Professor )} . We start our discussion with a description of the low-level prim-itives that we wish to support. We distilled these primitives con-sidering four types of workloads:
SPARQL [39] query answering ,which is the most popular language for querying KGs;
Rule-basedreasoning [7], which is an important task in the Semantic Web toinfer new knowledge from KGs; Algorithms for graph analytics ,or network analysis, since these are widely applied on KGs eitherto study characteristics like the graph’s topology or degree dis-tribution, or within more complex pipelines;
Statistical relationalmodels [64], which are effective techniques to make predictionsusing the KG as prior evidence.If we take a closer look at the computation performed in thesetasks, we can make a first broad distinction between edge-centric and node-centric operations. The first ones can be defined as opera-tions that retrieve subsets of edges that satisfy some constraints. Incontrast, operations of the second type retrieve various data about daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan
Name Output f lbl n ( G , n ) Label of node n (equals to ϕ V ( v ) ). f lbl e ( G , e ) Label of edge e (equals to ϕ E ( e ) ). f nodid ( G , l ) ι l , i.e., the ID of node with label l . f edgid ( G , l ) ι l , i.e., the ID of edge label l . f edg srd ( G , p ) ans ( G , p ) sorted by srd . f edg sdr ( G , p ) ans ( G , p ) sorted by sdr . f edg drs ( G , p ) ans ( G , p ) sorted by drs . f edg dsr ( G , p ) ans ( G , p ) sorted by dsr . f edg rsd ( G , p ) ans ( G , p ) sorted by rsd . f edg rds ( G , p ) ans ( G , p ) sorted by rds . f grp s ( G , p ) All s of ans ( G , p ) . f grp r ( G , p ) All r of ans ( G , p ) . f grp d ( G , p ) All d of ans ( G , p ) . f grp { sr , sd } ( G , p ) Aggr. ( s , r ) / ( s , d ) of ans ( G , p ) . f grp { rs , rd } ( G , p ) Aggr. ( r , s ) / ( r , d ) of ans ( G , p ) . f grp { ds , dr } ( G , p ) Aggr. ( d , s ) / ( d , r ) of ans ( G , p ) . f count ( f | . . . | f ) Cardinality of f , . . ., f . f pos srd ( G , p , i ) i th edge returned by edg srd ( G , p ) . f pos sdr ( G , p , i ) i th edge returned by edg sdr ( G , p ) . f pos drs ( G , p , i ) i th edge returned by edg drs ( G , p ) . f pos dsr ( G , p , i ) i th edge returned by edg dsr ( G , p ) . f pos rsd ( G , p , i ) i th edge returned by edg rsd ( G , p ) . f pos rds ( G , p , i ) i th edge returned by edg rds ( G , p ) . Table 1: Graph primitives the nodes, like their degree. Some tasks, like SPARQL query answer-ing, depend more heavily on edge-centric operations while othersdepend more on node-centric operations (e.g., random walks).
Graph Primitives.
Following a RISC-like approach, we identifieda small number of low-level primitives that can act as basic buildingblocks for implementing both node- and edge-centric operations.These primitives are reported in Table 1 and are described below. f − f . These primitives retrieve the numerical IDs associated withlabels and vice-versa. The primitives f and f retrieve the labelsassociated with nodes and edges respectively. The primitives f and f retrieve the labels associated with numerical IDs. f − f . Function edg ω ( G , p ) retrieves the subset of the edges in G that matches the simple graph pattern p and returns it sortedaccording to ω . Primitives in this group are particularly importantfor the execution of SPARQL queries since they encode the coreoperation of retrieving the answers of a SPARQL triple pattern. f − f . This group of primitives returns an aggregated versionof the output of f − f . For instance, grp s ( G , p ) returns the list ⟨( x , c ) , . . . , ( x n , c n )⟩ of all distinct sources in the edges ans ( G , p ) with the respective counts of the edges that share them. Let D be aset of edges, A ( x , D ) = { r ( x , d ) ∈ D } and B ( D ) = {( s , c ) | r ( s , d ) ∈ A ( s , D )∧ c = | A ( s , D )|} . Then, grp s ( G , p ) returns the list of all tuplesin B ( ans ( G , p )) sorted by the numerical ID of the first field. Theother primitives are defined analogously. f . This primitive returns the cardinality of the output of f , . . . , f .This computation is useful in a number of cases: For instance, itcan be used to optimize the computation of SPARQL queries byrearranging the join ordering depending on the cardinalities of thetriple patterns or to compute the degree of nodes in the graph. f − f . These primitives return the i th edge that would be re-turned by the corresponding primitives edg ∗ . In practice, this oper-ation is needed in several graph analytics algorithms or for mini-batching during the training of statistical relational models. ……. …….…….……. Dictionary ……. …….…….…….
Labels
TS’TS ......
B+TreeSorted List OR TD’TD s rs r s rr sr s r s
TR’TR s ds d s dsd s d sd nod(.),lbl(.),nodid(.),edgid(.)edg(.)grp(.)count(.)pos(.) (Opt.) Relation dict. dictionary management ……. ……. d Rd r d rr dr d r d node-centric storage edge-centric storage
Figure 1: Architectural overview of Trident
Example 2.
We show how we can use the primitives in Table 1to answer the SPARQL query of Example 1, assuming that the KG iscalled I . • First, we retrieve the IDs of the labels isA , livesIn , and Rome . Tothis end, we can use the primitives f and f . • Then, we create two single graph patterns p and p which map tothe first and second triple patterns respectively. Then, we execute edg rsd ( I , p ) and edg drs ( I , p ) so that the edges are returned in aorder suitable for a merge join. • We invoke the primitive f to retrieve all the labels of the nodeswhich are returned by the join algorithm. These labels are thenused to construct the answers of the query. One straightforward way to implement the primitives in Table 1is to store the KG in many independent data structures that pro-vide optimal access for each function. However, such solution willrequire a large amount of space and updates will be slow. It is chal-lenging to design a storage engine that uses fewer data structureswithout excessively compromising the performance.Moreover, KGs are highly heterogeneous objects where somesubgraphs have a completely different topology than others. Thestorage engine should take advantage of this diversity and poten-tially store different parts of the KGs in different ways, effectivelyadapting to its structure. This adaptation lacks in current engines,which treat the KG as a single object to store.Our architecture addresses these two problems with a compactstorage layer that supports the execution of primitives f , . . . , f with a minimal compromise in terms of performance, and in such away that the engine can adapt to the KG in input selecting the beststrategy to store its parts.Figure 1 gives a graphical view of our approach. It uses a series ofinterlinked data structures that can be grouped in three components .The first one contains data structures for the mappings ID ⇔ label .The second component is called edge-centric storage and containsdata structures for providing fast access to the edges. The thirdone is called node-centric storage and offers fast access to the nodes.Section 4.1 describes these components in more detail. Section 4.2discusses how they allow an efficient execution of the primitives,while Section 4.3 focuses on loading and updating the database. WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs
Dictionary.
We store the labels on a block-based byte stream ondisk. We use one B+Tree called
DICT ι to index the mappings ID ⇒ label and another one called DICT l for label ⇒ ID . Using B+Treeshere is usual, so we will not discuss it further. It is important to notethat assigning a certain ID to a term rather than another one mighthave a significant impact on the performance. For instance, Urbaniet al. [88] have shown that a careful choice of the IDs can introduceimportance speedups due to the improved data locality. Typically,current graph engines assign unique IDs to all labels, irrespectivelywhether a label is used as an entity or as a relation. This is desirablefor SPARQL query answering because all data joins can operateon the IDs directly. There are cases, however, where unique IDassignments are not optimal. For instance, most implementationsof techniques for creating KG embeddings (e.g., TranSE [16]) storethe embeddings for the entities and the ones for relations in twocontiguous vectors, and use offsets in the vectors as IDs. If the labelsfor the relations share the IDs with the entities, then the two vectorsmust have the same number of elements. This is highly inefficientbecause KGs have many fewer relations than entities which meansthat much space in the second vector will be unused. To avoidthis problem, we can assign IDs to entities and relationships in anindependent manner. In this way, no space is wasted in storing theembeddings. Note that Trident supports both global ID assignmentsand independent entity/relationship assignments with an additionalindex specifically for the relation labels. The first type of assignmentis needed for tasks like SPARQL query answering while the secondis useful for operations like learning graph embeddings [64]. Edge-centric storage.
In order to adapt to the complex and non-uniform topology of current KGs, we do not store all edges in asingle data structure, but store subsets of the edges independently.These subsets correspond the edges which share a specific enti-ty/relation. More specifically, let us assume that we must store thegraph G = ( V , E , L , ϕ V , ϕ E ) . For each l ∈ L , we consider threetypes of subsets: E s ( l ) = { r ( l , d ) ∈ E } , E r ( l ) = { l ( s , d ) ∈ E } , E d ( l ) = { r ( s , l ) ∈ E } , i.e., the subsets of edges that have l as source,edge, or destination respectively.The choice of separating the storage of various subsets allows usto choose the best data structure for a specific subset, but it hindersthe execution of inter-table scans , i.e., scans where the content ofmultiple tables must be taken into account. To alleviate this problem,we organize the physical storage in such a way that all edges canstill be retrieved by scanning a contiguous memory location.We proceed as follows: First, we compute E s ( l ) , E r ( l ) , E d ( l ) forevery l ∈ L . Let Ω be the collection of all these sets. For each E x ( l ) ∈ Ω , we construct two sets of tuples, F x ( l ) and G x ( l ) , by extracting thefree fields left-to-right and right-to-left respectively. For instance,the set E s ( l ) results into the sets F s ( l ) = {⟨ r , d ⟩ | r ( l , d ) ∈ E } and G s ( l ) = {⟨ d , r ⟩ | r ( l , d ) ∈ E } . Since these sets contains pairs ofelements, we view them as binary tables. These are grouped intothe following six sets: • T s = { F s ( l ) | l ∈ L } and T ′ s = { G s ( l ) | l ∈ L }• T r = { F r ( l ) | l ∈ L } and T ′ r = { G r ( l ) | l ∈ L }• T d = { F d ( l ) | l ∈ L } and T ′ d = { G d ( l ) | l ∈ L } The content of these six sets is serialized on disk in correspondingbyte streams called TS , TS ′ , TR , TR ′ , TD , and TD ′ respectively (seemiddle section of Figure 1). The serialization is done by first sortingthe binary tables by their defining label IDs, and then serializingeach table one-by-one. For instance, if F s ( l ) , F s ( l ) ∈ T s , then F s ( l ) is serialized before F s ( l ) iff ι l < ι l . At the beginning of the bytestream, we store the list of all IDs associated to the tables, pointersto the tables’ physical location and instructions to parse them.Since the binary tables and tuples are serialized on the bytestream with a specific order, we can retrieve all edges sorted withany ordering in R with a single scan of the corresponding bytestream, using the content stored at the beginning of the streamto decode the binary tables in it. For instance, we can scan TS to retrieve all edges sorted according to srd . The IDs stored atthe beginning of the stream specify the sources of the edges ( s )while the content of the tables specify the remaining relations anddestinations ( r and d ). Node-centric storage.
In order to provide fast access to the nodes,we map each ID ι l (i.e., the ID assigned to label l ) to a tuple M l thatcontains 15 fields: • the cardinalities | E s ( l )| , | E r ( l )| , and | E d ( l )| ; • Six pointers p , . . . , p to the physical storage of F s ( l ) , G s ( l ) , F r ( l ) , G r ( l ) , F d ( l ) , and G d ( l ) ; • Six bytes m , . . . , m that contain instructions to read the datastructures pointed by p , . . . , p . These instructions are necessarybecause the tables are stored in different ways (see Section 5).We index all M ∗ tuples by the numerical IDs using one globaldata structure called NM (Node Manager), shown on the left side ofFigure 1. This data structure is implemented either with an on-diskB+Tree or with a in-memory sorted vector (the choice is done atloading time). The B+Tree is preferable if the engine is used foredge-based computation because the B+Tree does not need to loadall nodes in main memory and the nodes are accessed infrequentlyanyway. In contrast, the sorted vector provides much faster access( O ( ) vs. O ( loд | L |) ) but it requires that the entire vector is stored inmain memory. Thus, it is suitable only if the application accesses thenodes very frequently and there are enough hardware resources.Note that the coordinates to the binary tables are stored bothin NM and in the meta-data in front of the byte streams. Thismeans that the table can be accessed either by accessing NM , or byscanning the beginning of the byte stream. In our implementation,we consult NM when we need to answers graph patterns withat least one constant element (e.g., for answering the query inExample 1). In contrast, the meta-content at the beginning of thestream is used when must perform a full scan.The way we store the binary tables in six byte streams resemblessix-permutation indexing schemes such as proposed in engines likeRDF3X [63] or Hexastore [93]. There are, however, two importantdifferences: First, in our approach the edges are stored in multipleindependent binary tables rather than a single series of ternarytuples (as, for instance, in RDF3X [63]). This division is importantbecause it allows us to choose different serialization strategies forsubgraphs or to avoid storing some tables (Section 5.3). The seconddifference is that in our case most access patterns go through a single B+Tree instead of six different data structures. This allows us to daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan save space and to store additional information about the nodes, e.g.,their degree, which is useful, for instance, for traversal algorithmslike PageRank, or random walks.
We now discuss how we can implement the primitives in Table 1with our architecture.
Primitives f , . . . , f ( lbl ∗ nodid , edgid ) . These are executed con-sulting either
DICT l or DICT ι . Thus, the time complexity followsin a straightforward manner.Proposition 1. Let G = ( V , E , L , ϕ V , ϕ E ) . The time complexityof computing f , . . . , f is O ( loд (| L |)) . Primitives f , . . . , f ( edg ∗ ) . Let edg ω ( G , p ) be a generic invoca-tion of one of f , . . . , f . First, we need to retrieve the numericalIDs associated to the labels in p (if any). Then, we select an orderingthat allows us to 1) retrieve answers for p with a range scan, and 2)the ordering complies with ω . The orderings that satisfy 1) are Ω = { ω ′ | ω ′ ∈ R ∧ isprefix ( bound ( p ) , ω ′ ) = true } (1)An ordering ω ′ ∈ Ω which also satisfies 2) is one for which ω ′ − bound ( p ) = ω − bound ( p ) .Example 3. Consider the execution of edg srd ( G , p ) where p = ( X , Y , a ) . In this case, bound ( p ) = d , Ω = { drs , dsr } and ω ′ = dsr . The selected ω ′ is associated to one byte stream. If p containsone or more constants, then we can query NM to retrieve theappropriate binary table from that binary stream and (range-)scan itto retrieve the answers of p . In contrast, if p only contains variables,the results can be obtained by scanning all tables in the byte stream.Note that the cost of retrieving the IDs for the labels in p is O ( loд | L |) since we use B+Trees for the dictionary. This is an operation thatis applied any time the input contains a graph pattern. If we ignorethis cost and look at the remaining computation, then we can makethe following observation.Proposition 2. Let G = ( V , E , L , ϕ V , ϕ E ) . The time complexity of edg ω ( G , p ) is O (| E |)) if p only contains variables, O ( loд (| L |) + | E |) otherwise. Primitives f , . . . , f ( grp ∗ ) . Let grp ω ( G , p ) be a general call toone of these primitives. Note that in this case ω ∈ R ′ , i.e., is apartial ordering. These functions can be implemented by invoking f , . . . , f and then return an aggregated version. Thus, they havethe same cost as the previous ones.However, there are special cases where the computation is quicker,as shown in the next example.Example 4. Consider a call to grp s ( G , p ) where p = ⟨ a , X , Y ⟩ . Inthis case, we can query NM with a and return at most one tuple withthe cardinality stored in M a , which has a cost of O(log(|L|)). If ω has length two or p contains a repeated variable, then wealso need to access one or more binary tables, similarly as before.Proposition 3. Let G = ( V , E , L , ϕ V , ϕ E ) . The time complexity of grp ω ( G , p ) ranges between O ( loд (| L |)) and O ( loд (| L |) + | E |) depend-ing on p and ω . Primitive f ( count ) . This primitive returns the cardinality of theoutput of f , . . . , f . Therefore, it can be simply implemented byiterating over the results returned by these functions. However,there there are cases when we can avoid this iteration. Some ofsuch cases are the ones below: • If the input is edg ω ( G , p ) and p contains no constant nor repeatedvariables. In this case the output is | E | . • If the input is edg ω ( G , p ) and p contains only one constant c andno repeated variables. In this case the cardinality is stored in M c . • If the input is grp ω ( G , p ) , isprefix ( ω , ω ′ ) = true , and p containsat most one constant and no repeated variables, then the outputcan be obtained either by consulting NM or the metadata of oneof the byte streams.Otherwise, we also need to access one binary table to compute theresults, which, in the worst case, takes O (| E |) .Proposition 4. Let G = ( V , E , L , ϕ V , ϕ E ) . The time complexity ofexecuting count (·) ranges between O ( loд (| L |)) and O ( loд (| L |) + | E |) . Primitives f , . . . , f ( pos ∗ ) . In order to efficiently support theseprimitives, we need to provide a fast random access to the edges.Given a generic pos ω ( G , p , i ) , we distinguish four cases: • C1 If p contains repeated variables, then we iterate over theresults and return the i th edge; • C2 If p contains only one constant, then the search space isrestricted to a single binary table. In this case, the computationdepends on how the content of the table is serialized on the bytestream. If it allows random access to the rows, then the costreduces to O ( loд (| L |)) , i.e., query NM . Otherwise we also needto iterate through the table and count until the i th row; • C3 If p contains more than one constant, then we need to searchthrough the table for the right interval, and then scan until weretrieve the i th row; • C4 Finally, if p does not contain any constants or repeated vari-ables, we must consider all edges stored in one byte stream. Inthis case, we first search for the binary table that contains the i th edge. This operation requires a scan of the metadata associ-ated to the byte stream, which can take up to O (| L |) . Then, thecomplexity depends on whether the physical storage of the tableallows a random access, as in C2 and C3 . Since a scan over themetadata takes O (| L |) , this last case represents the worst-case interms of complexity as it sums to O (| L | + | E |) . Note that in thiscase, simply going through all edges is faster as it takes O (| E |) .However, in practice tables have more than one row so we canadvance more quickly despite the higher worst-case complexity.Proposition 5. Let G = ( V , E , L , ϕ V , ϕ E ) . The time complexity ofexecuting pos ω ( G , p , i ) ranges between O ( loд (| L |)) and O (| L | + | E |) . Bulk Loading.
Loading a large KG can be a lengthy process, es-pecially if the resources are constrained. In Trident, we developeda loading routine which exploits the multi-core architecture andmaximizes the (limited) I/O bandwidth.
WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs
KG Input FormatsRDF(N-triples)SnapEncoded(dict/Edges)
Dictionary
1 0 52 0 9
Edges
1) Dictionary Encoding2) Creation Binary Tables 3) Creation Node Manager 3) Creation DictionariesSorting threads
CreationTables
Select layout (Alg. 1) TS Sorting threads
CreationTables
Select layout (Alg. 1)
TS’ … Sorting threads
CreationTables
Select layout (Alg. 1)
TR’ threads or Dictionary ……. …….…….…….
Labels
B+TreeSorted list a) Deconstruct triplesb) Sorting & ID assignc) Reconstruct triples
1:
Figure 2: Bulk loading in Trident
The main operations are shown in Figure 2. Our implementationcan receive the input KG in multiple formats. Currently, we consid-ered the N-Triples format (popular in the Semantic Web) and theSNAP format [49] (used for generic graphs). The first operation isencoding the graph, i.e., assigning unique IDs to the entities andrelation labels. For this task, we adapted the MapReduce techniquepresented at [90] to work in a multi-core environment. This tech-nique first deconstructs the triples, then assigns unique IDs to allthe terms, and finally reconstruct the triples. If the graph is alreadyencoded, then our procedure skips the encoding and proceeds tothe second operation of the loading, the creation of the database.The creation of the binary tables requires that the triples are pre-sorted according to a given ordering. We use a disk-based parallelmerge sort algorithm for this purpose. The tables are serializedone-by-one selecting the most efficient layout for each of them.After all the tables are created, the loading procedure will createthe NM and the B+Trees with the dictionaries. The encoding andsorting procedures are parallelized using threads, which might needto communicate with the secondary storage. Modern architecturescan have >64 cores, but such a number of threads can easily saturatethe disk bandwidth and cause serious slowdowns. To avoid thisproblem, we have two types of threads: Processing threads, whichperform computation like sorting, and I/O threads, which only readand write from disk. In this way, we can control the maximumnumber of concurrent accesses to the disks. Updates.
To avoid a complete re-loading of the entire KG after eachchange, our implementation supports incremental updates. Ourprocedure is built following the well-known advice by Jim Gray [29]that discourages in-place updates, and it is inspired by the idea of differential indexing [63], which proposes to create additionalindices and perform a lazy merging with the main database whenthe number of indices becomes too high.Our procedure first encodes the update, which can be either anaddition or removal, and then stores it in a smaller “delta” databasewith its own NM and byte streams. Multiple updates will be storedin multiple databases, which are timestamped to remember theorder of updates. Also, updates create an extra dictionary if theyintroduce new terms. Whenever the primitives are executed, thecontent of the updates is combined with the main KG so that theexecution returns an updated view of the graph.In contrast to differential indexing, our merging does not copythe updates in the main database, but only groups them in twoupdates, one for the additions and one for the removals. This is toavoid the process of rebuilding binary tables with possibly differentlayouts. If the size of the merged updates becomes too large, thenwe proceed with a full reload of the entire database. The binary tables can be serialized in different ways. For instance,we can store them row-by-row or column-by-column. Using a sin-gle serialization strategy for the entire KG is inefficient becausethe tables can be very different from each other, so one strategymay be efficient with one table but inefficient with another. Our ap-proach addresses this inefficiency by choosing the best serializationstrategy for each table depending on its size and content.For example, consider two tables T and T . Table T contains allthe edges with label “isA”, while T contains all the edges with label“isbnValue”. These two tables are not only different in terms of sizes,but also in the number of duplicated values. In fact, the secondcolumn of T is likely to contain many more duplicate values thanthe second column of T because there are (typically) many moreinstances than classes while “isbnValue” is a functional property,which means that every entity in the first column is associatedwith a unique ISBN code. In this case, it makes sense to serialize T in a column-by-column fashion so that we can apply run-length-encoding (RLE) [1], a well-known compression scheme of repeatedvalues, to save space when storing the second column. This type ofcompression would be ineffective with T since there each valueappears only once. Therefore, T can be stored row-by-row.In our approach, we consider three different serialization strate-gies, which we call serialization layouts (or simply layouts) andemploy an ad-hoc procedure to select, for each binary table, thebest layout among these three. We refer to the three layouts that we consider as row , column , and cluster layouts respectively. The first layout stores the content row-by-row, the second column-by-column, while the third uses anintermediate representation. Row layout.
Let T = ⟨⟨ t ′ , t ′′ ⟩ , . . . , ⟨ t ′ n , t ′′ n ⟩⟩ be a binary table thatcontains n sorted pairs of elements. With this layout, the pairsare stored one after the other. In terms of space consumption, thislayout is optimal if the two columns do not contain any duplicatedvalue. Moreover, if each row takes a fixed number of bytes, then itis possible to perform binary search or perform a random access daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan Algorithm 1: selectlayout ( T ) U (cid:66) { u | ⟨ u , v ⟩ ∈ T } if n ≤ τ and | U | ≤ υ then m (cid:66) , m (cid:66) , m (cid:66) foreach u ∈ U do Z (cid:66) { v | ⟨ u , v ⟩ ∈ T } if u > m then m (cid:66) u if | Z | > m then m (cid:66) | Z | foreach z ∈ Z do if z > m then m (cid:66) z t c (cid:66) | U | ∗ ( sizeof ( m ) + sizeof ( m )) + | T | ∗ sizeof ( m ) t r (cid:66) | T | ∗ ( sizeof ( m ) + sizeof ( m )) if t r ≤ t c then return ⟨ ROW , sizeof ( m ) , sizeof ( m ) , ⟩ else return ⟨ CLUSTER , sizeof ( m ) , sizeof ( m ) , sizeof ( m )⟩ else return ⟨ COLUMN , , , ⟩ to a subset of rows. The disadvantage is that with this layout allvalues are explicitly written on the stream while the other layoutsallow us to compress duplicate values. Column layout.
With this layout, the elements in T are serializedas ⟨ t ′ , . . . , t ′ n ⟩ , ⟨ t ′′ , . . . , t ′′ n ⟩ . The space consumption required bythis layout is equal to the previous one but with the difference thathere we can use RLE to reduce the space of ⟨ t ′ , . . . , t ′ n ⟩ . In fact, if t ′ = t ′ = . . . = t ′ n , then we can simply write t ′ × n . Also this layoutallows binary search and a random access to the table. However, itis slightly less efficient than the row layout for full scans becausehere one row is not stored at contiguous locations, and the systemneeds to “jump” between columns in order to return the entirepair. On the other hand, this layout is more suitable than the rowlayout for aggregate reads (required, for instance, for executing дrp primitives) because in this case we only need to read the contentof one column which is stored at contiguous locations. Cluster layout.
Let д t = ⟨⟨ t , t ′′ k ⟩ , . . . , ⟨ t , t ′′ l ⟩⟩ be the longest sub-sequence of pairs in T which share the first term t . With this layout,all groups are first ordered in the sequence ⟨ д t , . . . , д t i , д t i + , . . . , д t m ⟩ such that t i ≤ t i + for all 1 ≤ i < m . Then, they are serialized one-by-one. Each group д t is serialized by first writing t , then | д t | , andfinally the list t ′′ k , . . . , t ′′ l . This layout needs less space than the rowlayout if the groups contain multiple elements. Otherwise, it usesmore space because it also stores the size of the groups, and thistakes an extra ⌈ loд n ⌉ bits. Another disadvantage is that with thislayout binary search is only possible within one group. The procedure for selecting the best layout for each table is reportedin Algorithm 1. Its goal is to select the layout which leads to the bestcompression without excessively compromising the performance.In our implementation, Algorithm 1 is applied by default, but theuser can disable it and use one layout for all tables.The procedure receives as input a binary table T with n rowsand returns a tuple that specifies the layout that should be chosen.It proceeds as follows. First, it makes a distinction between tablesthat have less than τ rows (default value of τ is 1M) and contain lessthan υ unique elements in the first column from tables that do not (line 2). We make this distinction because 1) if the number of rowsis too high then searching for the most optimal layout becomesexpensive and 2) if the number of unique pairs is too high, thenthe cluster layout should not be used due to the lack of supportof binary search. With small tables, this is not a problem becauseit is well known that in these cases linear search is faster thanbinary search due to a better usage of the CPU’s cache memory.The value for υ is automatically determined with a small routinethat performs some micro-benchmarks to identify the thresholdafter which binary search becomes faster. In our experiments, thisvalue ranged between 16 and 64 elements.If the table satisfies the condition of line 2, then the algorithmselects either the ROW or the
CLUSTER layout. The
COLUMN layout isnot considered because its main benefit against the other two isa better compression (e.g., RLE) but this is anyway limited if thetable is small. The procedure scans the table and keeps track of thelargest numbers and groups used in the table ( m , m , m ). Then,the function invokes the subroutine sizeof (·) to retrieve the numberof bytes needed to store these numbers. It uses this information tocompute the total number of bytes that would be needed to store thetable with the ROW and
CLUSTER layout respectively (variables t r and t c ). Then, it selects the layout that leads to maximum compression.If the condition in line 2 fails, then either the ROW or the
COLUMN layout can be selected. An exact computation would be too expen-sive given the size of the table. Therefore, we always choose
COLUMN since the other one cannot be compressed with RLE.Next to choosing the best layout, Algorithm 1 also returns themaximum number of bytes needed to store the values in the twofields in the table ( m and m ) and (optionally) also for storingthe cluster size ( m , this last value is only needed for CLUSTER ).The reason for doing so is that it would be wasteful to use four-or eight-byte integers to store small IDs. In the worst case, weassume that all IDs in both fields can be stored with five bytes,which means it can store up to 2 − selectlayout contains the informationnecessary to properly read the content of the table from the bytestream. The first field is the chosen layout while the other fieldsare the number of bytes that should be used to store the entries ofthe table. We store this tuple both in NM (in one of the m ∗ fields)and at the beginning of the byte stream. With Algorithm 1, the system adapts to the KG while storing a singletable. We discuss two other forms of compression that considermultiple tables and decide whether some tables should be skippedor stored in aggregated form.
On-the-fly reconstruction (OFR).
Every table in one stream T x maps to another table in T ′ x where the first column is swapped withthe second column. If the tables are sufficiently small, one of themcan be re-constructed on-the-fly from the other whenever needed.While this operation introduces some computational overhead, thesaving in terms of space may justify it. Furthermore, the overhead WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs
Type
LUBM
KG Var. Var.
YAGO2S
KG 76M 37M
DBPedia
KG 1B 233M
Dir. 5.1M 875k
Wikidata
KG 1.1B 299M
Dir. 1.7M 81k
Uniprot
KG 168M 177M
Astro
Undi. 198k 18k
BTC2012
KG 1B 367M
Table 2: Details about the used datasets can be limited to the first access by serializing the table on diskafter the first re-construction.We refer to this strategy as on-the-fly reconstruction (OFR) . If theuser selects it at loading time, the system will not store any binarytable in T ′ x which has less than η rows, η being a value passed bythe user (default value is 20, determined after microbenchmarking). Aggregate Indexing.
Finally, we can construct aggregate indicesto further reduce the storage space. The usage of aggregate indicesis not novel for KG storage [93]. Here, we limit their usage to thetables in T ′ r if they lead to a storage space reduction.To illustrate the main idea, consider a generic table t that containsthe set of tuples F ′ r ( isA ) . This table stores all the ⟨ object , subject ⟩ pairs of the triples with the predicate isA . Since there are typicallymany more instances than classes, the first column of t (the classes)will contain many duplicate values. If we range-partition t with thefirst field, then we can identify a copy of the values in the secondfield of t in the partitions of tables in T ′ d where the first term is isA .With this technique, we avoid storing the same sequence of valuestwice but instead store a pointer to the partition in the other table. Trident is developed in C++, is freely available, and works underWindows, Linux, MacOS. Trident is also released in the form ofa Docker image. The user can interact via command line, webinterface, or HTTP requests according to the SPARQL standard.
Integration with other systems.
Since our system offers low-level primitives, we integrated it with the following other engineswith simple wrappers to evaluate our engine in multiple scenarios: • RDF3X [63].
RDF3X is one of the fastest and most well-knownSPARQL engines. We replaced its storage layer with ours so thatwe can reuse its SPARQL operators and query optimizations. • SNAP [50].
Stanford Network Analysis Platform (SNAP) is ahigh-performance open-source library to execute over 100 dif-ferent graph algorithms. As with RDF3X, we removed the SNAPstorage layer and added an interface to our own engine. • VLog [89].
VLog is one of most scalable datalog reasoners. Weimplemented an interface allowing VLog to reason using oursystem as underlying database.We also implemented a native procedure to answer basic graphpatterns (BGPs) that applies greedy query optimization based oncardinalities, and uses either merge joins or index loop joins if thefirst cannot be used.
Testbed.
We used a Linux machine (kernel 3.10, GCC 6.4, page size4k) with dual Intel E5-2630v3 eight-core CPUs of 2.4 GHz, 64 GBof memory and two 4TB SATA hard disks in RAID-0 mode. Thecommercial value is well below $5K. We compared against RDF3Xand SNAP with their native storages, TripleBit [97], a in-memory
Type Patt/ Example N. Avg. all X Y Z 1 75,999,2461/ srd - sdr X ∗ ∗ drs - dsr ∗ ∗ X 1 29,835,4791/ rds - rsd ∗ X ∗ srd - sdr a X Y 8,617,963 82/ drs - dsr X Y a 29,835,479 22/ rds - rsd X a Y 99 767,6693/ srd - sdr a X ∗ / a ∗ X 8,617,963 4/83/ drs - dsr ∗ X a / X ∗ a 29,835,479 1/23/ rds - rsd ∗ a X / X a ∗
99 369,011/423,3354/ srd - rsd a b X 41,910,232 14/ drs - rds a X b 36,532,121 24/ sdr - dsr X b a 69,564,969 1
Table 3: Type of patterns (1,2,3,4) on several orderings onYAGO2S. X , Y , Z are variables, a , b are constants, ∗ means thecolumn is ignored. state-of-the-art RDF database (in contrast to RDF3X which usesdisks), and SYSTEM_A , a widely used commercial SPARQL engine .As inputs, we considered a selection of real-world and artifical KGs,and other non-KG graphs from SNAP [49] (see Table 2 for statistics). • KGs.
LUBM [32], a well-known artificial benchmark that createsKGs of arbitrary sizes. The KG is in the domain of universitiesand each university contributes ca. 100k new triples. Hence-forth, we write
LU BMX to indicate a KG with X universities, e.g.LUBM10 contains 1M triples; DBPedia [14], YAGO2S [84] and
Wikidata [92], three widely used KGs with encyclopedic knowl-edge;
Uniprot [75], a KG that contains biomedical knowledge,and
BTC2012 [41], a collection of crawled interlinked KGs. • Other graphs.
We considered the graphs:
Google , a Web graphfrom Google,
Twitter , which contains a social circle, and
Astro ,a collaboration network in Physics.Trident was configured to use the B+Tree for NM and table pruningwas disabled, unless otherwise stated. During the loading procedure, Trident applies Algorithm 1 to deter-mine the best layout for each table. Figure 3a shows the number oftables of each type for the KGs. The vast majority of tables is storedeither with the
ROW or CLUSTER layout. Only a few tables are storedwith the
COLUMN layout. These are mostly the ones in the TR and TR ′ byte streams. It is interesting to note that the number of tablesvaries differently among different KGs. For instance, the numberof row tables is twice the number of cluster tables with LUBM. Incontrast, with Wikidata there are more cluster tables than row ones.These differences show to what extent Trident adapted its physicalstorage to the structure of the KG.One key operation of our system is to retrieve answers of simpletriple patterns. First, we generated all possible triple patterns thatreturn non-empty answers from YAGO2S . We considered five typesof patterns. Patterns of type 0 are full scans; of type 2 containone constants and two variables (e.g., X , type , Y ), while of type 4contain two constants and one variable (e.g., X , type , person ). Thesepatterns are answered with edg ∗ . Patterns of types 1 request anaggregated version of a full scan (e.g., retrieve all subjects) whilepatterns of type 3 request an aggregation where the pattern contains We hide the real name as it is a commercial product, as usual in database research. daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan row layout cluster layout column layout (a) Number of tables (in 1k) of each type with various KGs
Graph triple patterns µ s 0.14 µ s µ sWith OFR µ s 0.38 µ s 0.30 µ sWith AGGR µ s µ s µ sOnly ROW µ s 0.16 µ s µ s Only
COLUMN sµ s 0.18 µ s 0.27 µ sRDF3X µ s 26.53 µ s 18.55 µ s (b) Median runtimes (best ones are in bold) Default With
OFR
With
AGGR
With
OFR
SYSTEM_A : 6.3GB (c) Size of the database with Trident with/without optimizations
Figure 3: Statistics using various layouts/configurations and runtimes of triple pattern lookups one constant (e.g., return all objects of the predicate type ). Thesetwo patterns are answered with grp ∗ .The number, types of patterns, and average number of answersper type is reported in Table 3. The first column reports the typeof pattern and the orderings we can apply when we retrieve it.The second column reports an example pattern of this type. Thethird column contains the number of different queries that wecan construct of this type. The fourth column reports the averagenumber of answers that we get if we execute queries of this type.For example, the first row describes the pattern of type 0, whichis a full scan. For this type of pattern, we can retrieve the answerswith all the orderings in R . There is only one possible query ofthis type (column 3) and if we execute it then we obtain about 76Manswers (column 4). Patterns of type 1 correspond to full aggregatedscans. An example pattern of this type is shown in the second row. Ifthis query is executed, the system will return the list of all subjectswith the count of triples that share each subject. With this input,this query will return about 8M results (i.e., the number of subjects).We can construct a similar query if we consider the variables in thesecond or third position. Details for these two cases are reported inthe third and fourth rows.Patterns of type 2 have one constant and two variables. Likebefore the constant can appear in three positions. Note that inthis case we can construct many more queries by using differentconstants. For instance, we can construct 8.6M queries if the con-stant appears as subject, and 99 if it appears as predicate. Similarly,Table 3 reports such details also for queries of type 3 and 4. Bytesting our system on all these types of patterns, we are effectivelyevaluating the performance over all possible queries of these typeswhich would return non-empty answers.We used the primitives to retrieve the answers for these patternswith various configurations of our system, and compared against RDF3X , which was the system with fastest runtimes. The medianwarm runtimes of all executions are reported in Figure 3b.The row “Default” reports the results with the adaptive storageselected by Algorithm 1 but without table pruning. The rows “With
OFR ” and “With
AGGR ” use Algorithm 1 and the two techniques fortable pruning discussed in Section 5.3 respectively. The rows “Only
ROW ( COLUMN )” use only the
ROW and
COLUMN layouts (the
CLUSTER is not competitive alone due to the lack of binary search). Fromthe table, we see that if the two pruning strategies are enabled,then the runtimes increase, especially with
OFR . This was expectedsince these two techniques trade speed for space. Their benefit is that they reduce the size of the database, as shown in Figure 3c.In particular,
OFR is very effective, and they can reduce the sizeby 35%. Therefore, they should only be used if space is critical.The
ROW layout returns competitive performance if used alone butthen the database size is about 9% larger due to the suboptimalcompression. Figure 3c also reports the size of the databases withthe other systems as reference. Note that the reported numbersfor Trident do not include the size of the dictionary (764MB). Thissize should be added to the reported numbers for a fair comparisonwith the other systems’ databases.A comparison against
RDF3X shows that the latter is faster withfull scans (patterns of type 0) because our approach has to visit moretables stored with different configurations. However, our approachhas comparable performance with the second pattern type andperforms significantly better when the scan is limited to a singletable, with, in the best case, improvements of more than two ordersof magnitude (pattern 3). Note that in contexts like SPARQL queryanswering, patterns that contain at least one constant are muchmore frequent than full scans (e.g., see Table 2 of [76]).
Table 4 reports the average of five cold and warm runtimes with oursystem and with other state-of-the-art engines. For
LUBM , DBPedia , Uniprot , and
BTC2012 , we considered queries used to evaluateprevious systems [97]. For Wikidata, we designed five examplequeries of various complexity looking at published examples. Thequeries are reported in Appendix A. Unfortunately, we could notload
Wikidata and
BTC2012 with
SYSTEM_A due to raised excep-tions during the loading phase.We can make a few observations from the obtained results. First,a direct comparison against
TripleBit is problematic becausesometimes
TripleBit crashed or returned wrong results (checkedafter manual inspection). Looking at the other systems, we observethat our approach returned the best cold runtimes for 20 out of 25queries, counting in both the executions with our native SPARQLengine and the integration with RDF3X. If we compare the warmruntimes, our system is faster 20 out of 25 times. Furthermore, weobserve that Trident/N is faster than Trident/R mostly with selec-tive queries that require only a few joins. Otherwise the second isfaster. The reason is that
RDF3X uses a sophisticated query optimizerthat builds multiple plans in a bottom-up fashion. This procedure iscostly if applied to simple queries, but it pays off for more complexones because it can detect a better execution plan.
WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs
Q. N. Query Cold runtime (ms.) Warm runtime (ms.)Answers TN TR R3X TripleBit
SYSTEM_A
TN TR R3X TripleBit
SYSTEM_A L U B M B x 10,620 U n i p r o t D B P E D I A B T C N.A.
N.A. W I K I D A T A N.A.
N.A. 0.066
N.A. 24.110 27.627 179.36
N.A.5 1,975,090 2,013
Table 4: Average runtimes of SPARQL queries. Column “TN” reports the runtime of our approach with the native SPARQLimplementation while “TR” is the runtime with the RDF3X SPARQL engine. LUBM8k is a generated database with about1B RDF triples. Red background means that TripleBit returned wrong answers (’x’ means it crashed); “N.A” means that theexperiment was not possible due to failure at loading time.
ASTRO GOOGLE TWITTERTask Snap Ours Snap Ours Snap Ours
HITS 431 BFS 81993 Random Walks MaxSCC 47 Diameter 11767
PageRank 515
ClustCoef
417 8519 mod 7 Table 5: Runtime of various graph analytics algorithms
Datalog reasoning using LUBM1k (130M triples)Ruleset from [60] VLog+Ours VLog
LUBM-L
Runtime training 10 epochs with TransE and YAGO
Params: bathsize=100,learningrate=0.001,dims=50,adagrad,margin=1Ours:
OpenKE [35]: 18.72s
Table 6: Runtime of reasoning and learning
Graph analytics.
Algorithms for graph analytics are used for pathanalysis (e.g., find the shortest paths), community analysis (e.g.,triangle counting), or to compute centrality metrics (e.g., PageRank).They use frequently the primitives pos ∗ and count to traverse thegraph or to obtain the nodes’ degree. For these experiments, we usedthe sorted list as NODEMGR since these algorithms are node-centric.We selected ten well-known algorithms:
HITS and
PageRank compute centrality metrics;
Breadth First Search (BFS) performs asearch;
MOD computes the modularity of the network, which isused for community detection;
Triangle Counting counts all trian-gles;
Random Walks extracts random paths;
MaxWCC and
MaxSCC compute the largest weak and strong connected components re-spectively;
Diameter computes the diameter of the graph while
ClustCoeff computes the clustering coefficient.We executed these algorithms using the original SNAP libraryand in combination with our engine. Note that the implementationof the algorithms is the same; only the storage changes. Table 5reports the runtimes. From it, we see that our engine is faster inmost cases. It is only with random walks that our approach isslower. From these results, we conclude that also with this type ofcomputation our approach leads to competitive runtimes.
Reasoning and Learning.
We also tested the performance of oursystem for rule-based reasoning. In this task, rules are used to ma-terialize all possible derivations from the KG. First, we computedreasoning considering Trident and VLog, using LUBM and two pop-ular rulesets (LUBM-L and LUBM-LE) [60, 89]. Then, we repeatedthe process with the native storage of VLog. The runtime, reportedin Table 6, shows that our engine leads to an improvement of theperformance (48% faster in the best case).Finally, we considered statistical relational learning as anotherclass of problems that could benefit from our engine. These tech-niques associate each entity and relation label in the KG to a nu-merical vector (called embedding) and then learn optimal valuesfor the embeddings so that truth values of some unseen triples canbe computed via algebraic operations on the vectors.We implemented TransE [16], one of the most popular techniquesof this kind, on top of Trident and compared the runtime of trainingvs. the one produced by OpenKE [35], a state-of-the-art library.Table 6 reports the runtime to train a model using as input a subsetof YAGO which was used in other works [69]. The results indicatecompetitive performance also in this case. daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan
Universities Q1) Q2 Q3 Q4 Q5(
10k (1.3B) 0.05 0.09 25m 11m 6s20k (2.6B) 0.05 0.09 52m 41m 12s40k (5B) 0.05 0.09 1h50m 1h42s 25s80k (10B) 0.05 0.09 3h52m 3h1m 56s160k (21B) 0.05 0.09 >8h 6h49m 1m51s800k (100B) 0.05 0.09 >8h >8h 12m
Table 7: Runtime LUBM Q1-Q5 and KGs which size rangesbetween 1B and 100B triples
Figure 4: Warm runtimes on Wikidata and LUBM8k afteradding/removing updates with 1M triples each
Op Wikidata LUBM8K
ADD 308s 175sADD 386s 230sADD 404s 242sADD 477s 261sADD 431s 260sMerge 200s 114sDEL 399s 222sDEL 465s 278sDEL 501s 319sDEL 531s 342sDEL 566s 369sMerge 291s 181s (a) Runtime of updates (b) CPU/RAM usage loading LUBM80k
System Runtime System Runtime System Runtime
Ours (seq) 20min RDF3X (seq) 24min TripleBit (par) 9minOurs (par) 6min
SYSTEM_A (par) 1h9min (c) Loading runtime of LUBM1k (130M triples)
Figure 5: Loading and update runtimes
We executed the five LUBM queries using our native SPARQL pro-cedure on KGs of different sizes (between 1B-100B triples). We usedanother machine with 256GB of RAM for these experiments (whichalso costs < $5K) due to lack of disk space. The warm runtimes areshown in Table 7. The runtime of the first two queries remains con-stant. This was expected since their selectivity does not decrease asthe size of the KG increases. In contrast, the runtime of the otherqueries increases as the KG becomes larger.Figure 4 shows the runtime of four SPARQL queries after weadded five new sets of triples to the KG, merged them, removed fiveother sets of triples, and merged again. Each set of added triples doesnot contain triples contained in previous updates. Similarly, eachset of removed triples contains only triples in the original KG andnot in previous updates. We selected the queries so that the content of the updates is considered. We observe that the runtime increases(because more deltas are considered) and that it drops after they aremerged in a single update. Figure 5a reports the runtime to processfive additions of ca. 1M novel triples, one merge, five removals of ca.1M existing triples, and another merge. As we can see, with bothdatasets the runtime is much smaller than re-creating the databasefrom scratch (>1h). The runtime with LUBM8k is faster than withWikidata because the updates with the latter KG contained 4X morenew entities.In Figure 5b, we show the trace of the resource consumptionduring the loading of LUBM80k (10B triples). We plot the CPU(100% means all physical cores are used) and RAM usage. From it,we see that most of the runtime is taken to dictionary encoding,sorting the edges, and to create the binary tables.In general, Trident has competitive loading times. Figure 5cshows the loading time of ours and other systems on LUBM1k. Withlarger KGs, RDF3X becomes significantly slower than ours (e.g., ittakes ca. 7 hours to load LUBM8k on our smaller machine whileTrident needs 1 hour and 18 minutes) due to lack of parallelism.TripleBit is an in-memory database and thus it cannot scale to someof our largest inputs. In some of our largest experiments, Tridentcould load LUBM400k (50B triples) in about 48 hours which is a sizethat other systems cannot handle. If the graph is already encoded,then loading is faster. We loaded the Hyperlink Graph [58], a graphwith 128B edges, in about 13 hours (with the larger machine) andthe database required 1.4TB of space. In this section, we describe the most relevant works to our problem.For a broader introduction to graph and RDF processing, we redirectto existing surveys [3, 24, 57, 59, 68, 77, 95]. Current approachescan be classified either as native (i.e., designed for this task) or non-native (adapt pre-existing technology). Native engines have betterperformance [17], but less functionalities [17, 23]. Our approachbelongs to the first category.Research on native systems has focused on advanced indexingstructures. The most popular approach is to extensively materializea dedicated index for each permutation. This was initially proposedby YARS [43], and further explored in RDF3X [10, 12, 13, 26, 63].Also Hexastore [93] proposes a six-way permutation-based index-ing, but implemented it using hierarchical in-memory Java hashmaps. Instead, we use on-disk data structures and therefore canscale to larger inputs. Recently, other types of indices, based on2D or 3D bit matrices [8, 97], hash-maps [60], or data structuresused for graph matching approaches [47, 99] have been proposed.If compared with these works, our approach uses a novel layout ofdata structures and uses multiple layouts to store the subgraphs.Non-native approaches offload the indexing to external engines(mostly DBMS). Here, the challenge is to find efficient partition-ing/replication criteria to exploit the multi-table nature of relationalengines. Existing partitioning criteria group the triples either bypredicates [2, 38, 51, 56, 62, 72, 73], clusters of predicates [21], orby using other entity-based splitting criteria [17]. The various par-titioning schemes are designed to create few tables to meet theconstraints of relational engines [81]. Our approach differs because
WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs we group the edges at a much higher granularity generating anumber of binary tables that is too large for such engines.Some popular commercial systems for graph processing are Vir-tuoso [67], BlazeGraph [85], Titan [22], Neo4J [61] Sparksee [83],and InfiniteGraph [66]. We compared Trident against such a lead-ing commercial system and observed that ours has very compet-itive performance; other comparisons are presented in [5, 81]. Ingeneral, a direct comparison is challenging because these systemsprovide end-to-end solutions tailored for specific tasks while weoffer general-purpose low-level APIs.Finally, many works have focused on distributed graph process-ing [4, 6, 9, 33, 36, 38, 48, 70, 78, 98]. We do not view these ap-proaches as competitors since they operate on different hardwarearchitectures. Instead, we view ours as a potential complement thatcan be employed by them to speed up distributed processing.In our approach, we use numerical IDs to store the terms. Thisform of compression has been the subject of some studies. First,some systems use the Hash-code of the strings as IDs [37, 38, 40].Most systems, however, use counters to assign new IDs [18, 42, 43,53, 63]. It has been shown in [88] that assigning some IDs ratherthan others can improve the query answering due to data locality. Itis straightforward to include these procedures in our system. Finally,some approaches focused on compressing RDF collections [54] andon the management of the strings [11, 55, 82]. We adopted a con-ventional approach to store such strings. Replacing our dictionarywith these proposals is an interesting direction for future work.
We proposed a novel centralized architecture for the low-level stor-age of very large KGs which provides both node- and edge-centricaccess to the KG. One of the main novelties of our approach is thatit exhaustively decomposes the storage of the KGs in many binarytables, serializing them in multiple byte streams to facilitate inter-table scanning, akin to permutation-based approaches. Anothermain novelty is that the storage effectively adapts to the KG bychoosing a different layout for each table depending on the graphtopology. Our empirical evaluation in multiple scenarios showsthat our approach offers competitive performance and that it canload very large graphs without expensive hardware.Future work is necessary to apply or adapt our architecturefor additional scenarios. In particular, we believe that our systemcan be used to support Triple Pattern Fragments [91], an emergingparadigm to query RDF datasets, and GraphQL [44], a more complexgraph query language. Finally, it is also interesting to study whetherthe integration of additional compression techniques, like locality-based dictionary encoding [88] or HDT [25], can further improvethe runtime and/or reduce the storage space.
Acknowledgments.
We would like to thank (in alphabetical or-der) Peter Boncz, Martin Kersten, Stefan Manegold, and GerhardWeikum for discussing and providing comments to improve thiswork. This project was partly funded by the NWO research pro-gramme 400.17.605 (VWData) and NWO VENI project 639.021.335.
A SPARQL QUERIESA.1 LUBM
A.2 DBPedia
A.3 BTC2012 daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan
A.4 Uniprot
A.5 Wikidata
REFERENCES [1] Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compres-sion and Execution in Column-Oriented Database Systems. In
Proceedings of the2006 ACM SIGMOD International Conference on Management of Data (Chicago,IL, USA) (SIGMOD âĂŹ06) . Association for Computing Machinery, New York,NY, USA, 671âĂŞ682. https://doi.org/10.1145/1142473.1142548[2] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollenbach. 2009.SW-Store: a vertically partitioned DBMS for Semantic Web data management.
The VLDB Journal
18, 2 (2009), 385–406.[3] Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, and Panos Kalnis. 2017. A Sur-vey and Experimental Comparison of Distributed SPARQL Engines for Very LargeRDF Data.
Proceedings of the VLDB Endowment
10, 13 (Sept. 2017), 2049âĂŞ2060.https://doi.org/10.14778/3151106.3151109[4] I. Abdelaziz, R. Harbi, S. Salihoglu, and P. Kalnis. 2017. Combining vertex-centric graph processing with SPARQL for large-scale RDF data analytics.
IEEETransactions on Parallel and Distributed Systems
28, 12 (Dec. 2017), 3374–3388.[5] Günes Aluç, M. Tamer Özsu, Khuzaima Daudjee, and Olaf Hartig. 2015. Executingqueries over schemaless RDF databases. In . IEEE ComputerSociety, Seoul, South Korea, 807–818. https://doi.org/10.1109/ICDE.2015.7113335[6] Bernd Amann, Olivier CurÃľ, and Hubert Naacke. 2018.
Distributed SPARQLQuery Processing: a Case Study with Apache Spark . John Wiley & Sons, Ltd,Chapter 2, 21–55. https://doi.org/10.1002/9781119528227.ch2[7] Grigoris Antoniou, Sotiris Batsakis, Raghava Mutharaju, Jeff Z. Pan, Guilin Qi,Ilias Tachmazidis, Jacopo Urbani, and Zhangquan Zhou. 2018. A survey of large-scale reasoning on the Web of data.
The Knowledge Engineering Review
33 (2018),1–43.[8] Medha Atre, Vineet Chaoji, Mohammed J. Zaki, and James A. Hendler. 2010.Matrix "Bit" Loaded: A Scalable Lightweight Join Query Processor for RDF Data.In
Proceedings of the 19th International Conference on World Wide Web (Raleigh,North Carolina, USA) (WWW âĂŹ10) . Association for Computing Machinery,New York, NY, USA, 41âĂŞ50. https://doi.org/10.1145/1772690.1772696[9] A. Azzam, S. Kirrane, and A. Polleres. 2018. Towards Making Distributed RDFProcessing FLINKer. In . IEEE Computer Society, Los Alamitos, CA, USA,9–16. https://doi.org/10.1109/Innovate-Data.2018.00009[10] Liu Baolin and Hu Bo. 2007. HPRD: A High Performance RDF Database. In
IFIPInternational Conference on Network and Parallel Computing (Lecture Notes inComputer Science) , Keqiu Li, Chris Jesshope, Hai Jin, and Jean-Luc Gaudiot (Eds.).Springer Berlin Heidelberg, Dalian, China, 364–374.[11] Hamid R. Bazoobandi, Steven de Rooij, Jacopo Urbani, Annette ten Teije, Frankvan Harmelen, and Henri Bal. 2015. A Compact In-memory Dictionary for RDFData. In
The Semantic Web. Latest Advances and New Domains . Springer-VerlagNew York, Inc., Portoroz, Slovenia, 205–220.[12] David Beckett. 2001. The Design and Implementation of the Redland RDF Appli-cation Framework. In
Proceedings of the 10th International Conference on WorldWide Web (Hong Kong) (WWW âĂŹ01) . Association for Computing Machinery,New York, NY, USA, 449âĂŞ456. https://doi.org/10.1145/371920.372099[13] Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev,and Ruslan Velkov. 2011. OWLIM: A family of scalable semantic repositories.
Semantic Web
2, 1 (2011), 33–42.[14] Christian Bizer, Jens Lehmann, Georgi Kobilarov, SÃűren Auer, Christian Becker,Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia-A crystallizationpoint for the Web of Data.
Web Semantics: science, services and agents on the worldwide web
7, 3 (2009), 154–165.[15] Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, and Nicolas Torzec. 2013.Entity Recommendations in Web Search. In
The Semantic Web – ISWC 2013 .Springer Berlin Heidelberg, Berlin, Heidelberg, 33–48.[16] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational
WW ’20, April 20–24, 2020, Taipei, Taiwan Jacopo Urbani and Ceriel Jacobs
Data. In
Advances in Neural Information Processing Systems 26 , C. J. C. Burges,L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). CurranAssociates, Inc., Lake Tahoe, Nevada, USA, 2787–2795.[17] Mihaela A. Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas,Patrick Dantressangle, Octavian Udrea, and Bishwaranjan Bhattacharjee. 2013.Building an Efficient RDF Store over a Relational Database. In
SIGMOD ’13:Proceedings of the 2013 ACM SIGMOD International Conference on Management ofData . ACM, New York, NY, USA, 121–132.[18] Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen. 2002. Sesame: AGeneric Architecture for Storing and Querying RDF and RDF Schema. In . Springer, Sardinia, Italia, 54–68.[19] Alison Callahan, José Cruz-Toledo, Peter Ansell, and Michel Dumontier. 2013.Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance ofLife Science Linked Data. In . Springer,Montperlier, France, 200–212.[20] Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and SambaviMuthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale.
Proceedings of the VLDB Endowment
8, 12 (Aug. 2015), 1804âĂŞ1815. https://doi.org/10.14778/2824032.2824077[21] Eugene Inseok Chong, Souripriya Das, George Eadon, and Jagannathan Srini-vasan. 2005. An Efficient SQL-Based RDF Querying Scheme. In
Proceedings ofthe 31st International Conference on Very Large Data Bases (VLDB âĂŹ05) . VLDBEndowment, Trondheim, Norway, 1216âĂŞ1227.[22] DATASTAX, Inc. 2019. Titan: Distributed Graph Database. http://titan.thinkaurelius.com/[23] Jing Fan, Adalbert Gerald Soosai Raj, and Jignesh M. Patel. 2015. The Case AgainstSpecialized Graph Analytics Engines.. In
The 7th Biennial Conference on InnovativeData Systems Research CIDR 2015
Revue Africaine de la Recherche en Informatique et MathématiquesAppliquées
15 (2011), 11–35.[25] Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres,and Mario Arias. 2013. Binary RDF representation for publication and exchange(HDT).
Journal of Web Semantics
19 (March 2013), 22–41.[26] George H.L. Fletcher and Peter W. Beck. 2009. Scalable Indexing of RDF Graphsfor Efficient Join Processing. In
Proceedings of the 18th ACM Conference on In-formation and Knowledge Management (Hong Kong) (CIKM âĂŹ09) . Associ-ation for Computing Machinery, New York, NY, USA, 1513âĂŞ1516. https://doi.org/10.1145/1645953.1646159[27] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.In . USENIX, Hollywood, CA, 17–30.[28] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J.Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a DistributedDataflow Framework. In . USENIX Association, Broomfield, CO, 599–613.[29] Jim Gray. 1981. The Transaction Concept: Virtues and Limitations (InvitedPaper). In
Very Large Data Bases, 7th International Conference, September 9-11,1981, Cannes, France, Proceedings . VLDB Endowment, Cannes, France, 144–154.[30] Mark Greaves and Peter Mika. 2008. Semantic Web and Web 2.0.
Web Semantics:Science, Services and Agents on the World Wide Web
6, 1 (2008), 1–3.[31] R. Guha, Rob McCool, and Eric Miller. 2003. Semantic Search. In
Proceedings of the12th International Conference on World Wide Web (Budapest, Hungary) (WWWâĂŹ03) . Association for Computing Machinery, New York, NY, USA, 700âĂŞ709.https://doi.org/10.1145/775152.775250[32] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark forOWL knowledge base systems.
Web Semantics: Science, Services and Agents onthe World Wide Web
3, 2 (2005), 158–182.[33] Sairam Gurajada, Stephan Seufert, Iris Miliaraki, and Martin Theobald. 2014.TriAD: A Distributed Shared-Nothing RDF Engine Based on AsynchronousMessage Passing. In
Proceedings of the 2014 ACM SIGMOD International Con-ference on Management of Data (Snowbird, Utah, USA) (SIGMOD âĂŹ14) . As-sociation for Computing Machinery, New York, NY, USA, 289âĂŞ300. https://doi.org/10.1145/2588555.2610511[34] Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, XingfangWang, and Tianqi Jin. 2014. An Experimental Comparison of Pregel-likeGraph Processing Systems.
Proceedings of the VLDB Endowment
7, 12 (2014),1047âĂŞ1058. https://doi.org/10.14778/2732977.2732980[35] Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and JuanziLi. 2018. OpenKE: An Open Toolkit for Knowledge Embedding. In
Proceedings ofthe 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018 .Association for Computational Linguistics, Brussels, Belgium, 139–144.[36] Razen Harbi, Ibrahim Abdelaziz, Panos Kalnis, Nikos Mamoulis, Yasser Ebrahim,and Majed Sahli. 2016. Accelerating SPARQL queries by exploiting hash-basedlocality and adaptive partitioning.
The VLDB Journal
25, 3 (2016), 355–380. [37] Stephen Harris and Nicholas Gibbins. 2003. 3store: Efficient Bulk RDF Storage. In .ePrints Soton, Sanibel Island, FL, USA, 1–15.[38] Steve Harris, Nick Lamb, and Nigel Shadbolt. 2009. 4store: The Design andImplementation of a Clustered RDF Store. In
International Conference on Web InformationSystems Engineering . Springer, New York, NY, USA, 235–244.[41] Andreas Harth. 2012. Billion Triples Challenge data set. http://km.aifb.kit.edu/projects/btc-2012/[42] Andreas Harth and Stefan Decker. 2005. Optimized Index Structures for QueryingRDF from the Web. In
Proceedings of the Third Latin American Web Congress (LA-WEB ’05) . IEEE Computer Society, Washington, DC, USA, 71–80.[43] Andreas Harth, Jürgen Umbrich, Aidan Hogan, and Stefan Decker. 2007. YARS2:A Federated Repository for Querying Graph Structured Data from the Web. In
The 6th International Semantic Web Conference (Lecture Notes in Computer Science) .Springer Berlin Heidelberg, Busan, South Korea, 211–224.[44] Olaf Hartig and Jorge Pérez. 2018. Semantics and Complexity of GraphQL. In
Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW âĂŹ18) . IEEE, Cancun, Mexico, 953–962.[47] Jinha Kim, Hyungyu Shin, Wook-Shin Han, Sungpack Hong, and Hassan Chafi.2015. Taming Subgraph Isomorphism for RDF Query Processing.
Proceedings ofthe VLDB Endowment
8, 11 (2015), 1238–1249.[48] Kisung Lee and Ling Liu. 2013. Scaling queries over big RDF graphs with semantichash partitioning.
Proceedings of the VLDB Endowment
6, 14 (2013), 1894–1905.[49] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large NetworkDataset Collection. http://snap.stanford.edu/data.[50] Jure Leskovec and Rok Sosič. 2016. Snap: A general-purpose network analysis andgraph-mining library.
ACM Transactions on Intelligent Systems and Technology(TIST)
8, 1 (2016), 1.[51] Li Ma, Zhong Su, Yue Pan, Li Zhang, and Tao Liu. 2004. RStar: An RDF Storage andQuery System for Enterprise Resource Management. In
Proceedings of the Thir-teenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM âĂŹ04) . Association for Computing Machinery,New York, NY, USA, 484âĂŞ491. https://doi.org/10.1145/1031171.1031264[52] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, IlanHorn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. In
Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD âĂŹ10) .Association for Computing Machinery, New York, NY, USA, 135âĂŞ146. https://doi.org/10.1145/1807167.1807184[53] Miguel A. Martínez-Prieto, Javier D. Fernández, and Rodrigo Cánovas. 2012. Com-pression of RDF Dictionaries. In
Proceedings of the 27th Annual ACM Symposiumon Applied Computing . ACM, Trento, Italy, 340–347.[54] Miguel A. Martínez-Prieto, Javier D. Fernández, and Rodrigo Cánovas. 2012.Querying RDF dictionaries in compressed space.
SIGAPP Appl. Comput. Rev.
The SemanticWeb. Latest Advances and New Domains . Springer-Verlag New York, Inc., Portoroz,Slovenia, 137–151.[56] Brian McBride. 2001. Jena: Implementing the RDF Model and Syntax Specification.In
SemWeb’01 Proceedings of the Second International Conference on Semantic Web-Volume 40 . CEUR-WS. org, Hong Kong, 23–28.[57] Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like avertex: a survey of vertex-centric frameworks for large-scale distributed graphprocessing.
ACM Computing Surveys (CSUR)
48, 2 (2015), 25.[58] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2015.The Graph Structure in the Web – Analyzed on Different Aggregation Levels.
The Journal of Web Science
1, 1 (2015), 33–47.[59] G. E. Modoni, M. Sacco, and W. Terkaj. 2014. A Survey of RDF Store Solutions.In .IEEE, Bergamo, Italy, 1–7.[60] Boris Motik, Yavor Nenov, Robert Piro, Ian Horrocks, and Dan Olteanu. 2014.Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF daptive Low-level Storage of Very Large Knowledge Graphs WWW ’20, April 20–24, 2020, Taipei, Taiwan
Systems. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial In-telligence, July 27 -31, 2014, Québec City, Québec, Canada . AAAI Press, Québec,Canada, 129–137.[61] Neo4j, Inc. 2019. Neo4j Graph Platform. https://neo4j.com/[62] Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: AccurateCardinality Estimation for RDF Queries with Multiple Joins. In . IEEE, Hannover, Germany, 984–994.[63] Thomas Neumann and Gerhard Weikum. 2010. The RDF-3X engine for scalablemanagement of RDF data.
The VLDB Journal
19, 1 (2010), 91–113.[64] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2015.A review of relational machine learning for knowledge graphs.
Proc. IEEE
Nucleic acids research
Frontiers ofComputer Science
10, 3 (2016), 418–432.[69] Soumajit Pal and Jacopo Urbani. 2017. Enhancing Knowledge Graph CompletionBy Embedding Correlations. In
Proceedings of the 2017 ACM on Conference onInformation and Knowledge Management, CIKM 2017, Singapore, November 06 -10, 2017 . ACM, New York, NY, USA, 2247–2250.[70] Peng Peng, Lei Zou, M. Tamer Özsu, Lei Chen, and Dongyan Zhao. 2016. Pro-cessing SPARQL queries over distributed RDF graphs.
The VLDB Journal
25, 2(2016), 243–268.[71] Yonathan Perez, Rok Sosič, Arijit Banerjee, Rohan Puttagunta, Martin Raison,Pararth Shah, and Jure Leskovec. 2015. Ringo: Interactive Graph Analytics onBig-Memory Machines. In
Proceedings of the 2015 ACM SIGMOD InternationalConference on Management of Data (Melbourne, Victoria, Australia) (SIGMODâĂŹ15) . Association for Computing Machinery, New York, NY, USA, 1105âĂŞ1110.https://doi.org/10.1145/2723372.2735369[72] Minh-Duc Pham and Peter Boncz. 2016. Exploiting Emergent Schemas to MakeRDF Systems More Efficient. In
The 15th International Semantic Web Confer-ence âĂŞ ISWC 2016 (Lecture Notes in Computer Science) . Springer InternationalPublishing, Kobe, Japan, 463–479.[73] Minh-Duc Pham, Linnea Passing, Orri Erling, and Peter Boncz. 2015. Deriving anEmergent Relational Schema from RDF Data. In
Proceedings of the 24th Interna-tional Conference on World Wide Web (WWW ’15) . International World Wide WebConferences Steering Committee, Republic and Canton of Geneva, Switzerland,864–874.[74] Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, and XueminLin. 2014. Scalable Big Graph Processing in MapReduce. In
Proceedings of the2014 ACM SIGMOD International Conference on Management of Data (Snowbird,Utah, USA) (SIGMOD âĂŹ14) . Association for Computing Machinery, New York,NY, USA, 827âĂŞ838. https://doi.org/10.1145/2588555.2593661[75] Nicole Redaschi and UniProt Consortium. 2009. UniProt in RDF: Tackling DataIntegration and Distributed Annotation with the Semantic Web.
Nature Precedings (2009).[76] Laurens Rietveld and Rinke Hoekstra. 2014. YASGUI: Feeling the Pulse of LinkedData. In
Knowledge Engineering and Knowledge Management - 19th InternationalConference, EKAW 2014, Linköping, Sweden, November 24-28, 2014. Proceedings(Lecture Notes in Computer Science) . Springer International Publishing, Linköping,Sweden, 441–452.[77] Sherif Sakr and Ghazi Al-Naymat. 2010. Relational Processing of RDF Queries: ASurvey.
SIGMOD Record
38, 4 (2010), 23âĂŞ28. https://doi.org/10.1145/1815948.1815953[78] Alexander Schätzle, Martin Przyjaciel-Zablocki, Simon Skilevic, and GeorgLausen. 2016. S2RDF: RDF Querying with SPARQL on Spark.
Proceedings of theVLDB Endowment
9, 10 (2016), 804–815.[79] Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A Distributed Graph Engineon a Memory Cloud. In
Proceedings of the 2013 ACM SIGMOD InternationalConference on Management of Data (New York, New York, USA) (SIGMOD âĂŹ13) .Association for Computing Machinery, New York, NY, USA, 505âĂŞ516. https://doi.org/10.1145/2463676.2467799[80] W. Shen, J. Wang, and J. Han. 2015. Entity linking with a knowledge base: issues,techniques, and solutions.
IEEE Transactions on Knowledge and Data Engineering
27, 2 (2015), 443–460.[81] Lefteris Sidirourgos, Romulo Goncalves, Martin Kersten, Niels Nes, and StefanManegold. 2008. Column-Store Support for RDF Data Management: not all swansare white.
Proceedings of the VLDB Endowment
1, 2 (2008), 1553–1563.[82] Gurkirat Singh, Dhawal Upadhyay, and Medha Atre. 2018. Efficient RDF Dic-tionaries with B+ Trees. In
Proceedings of the ACM India Joint InternationalConference on Data Science and Management of Data (CoDS-COMAD ’18) . ACM, New York, NY, USA, 128–136.[83] Sparsity Technologies. 2019. Sparksee. http://sparsity-technologies.com/[84] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2008. YAGO: A LargeOntology from Wikipedia and WordNet.
Web Semantics: Science, Services andAgents on the World Wide Web
6, 3 (2008), 203–217.[85] Systap. 2019. BlazeGraph. https://blazegraph.com/[86] Niket Tandon, Gerard de Melo, Fabian M. Suchanek, and Gerhard Weikum. 2014.WebChild: Harvesting and Organizing Commonsense Knowledge from the Web.In
Seventh ACM International Conference on Web Search and Data Mining, WSDM2014 . ACM, New York, NY, USA, 523–532.[87] Alberto Tonon, Michele Catasta, Roman Prokofyev, Gianluca Demartini, KarlAberer, and Philippe Cudre-Mauroux. 2016. Contextualized ranking of entitytypes based on knowledge graphs.
Web Semantics: Science, Services and Agentson the World Wide Web
37 (2016), 170–183.[88] Jacopo Urbani, Sourav Dutta, Sairam Gurajada, and Gerhard Weikum. 2016.KOGNAC: Efficient Encoding of Large Knowledge Graphs. In
Proceedings of theTwenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016,New York, NY, USA, 9-15 July 2016 . AAAI Press, New York, NY, USA, 3896–3902.[89] Jacopo Urbani, Ceriel Jacobs, and Markus Krötzsch. 2016. Column-Oriented Dat-alog Materialization for Large Knowledge Graphs. In
Proceedings of the ThirtiethAAAI Conference on Artificial Intelligence . AAAI Press, Phoenix, Arizona, USA,258–264.[90] Jacopo Urbani, Jason Maassen, Niels Drost, Frank Seinstra, and Henri Bal. 2013.Scalable RDF data compression with MapReduce.
Concurrency and Computation:Practice and Experience
25, 1 (2013), 24–39.[91] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Lau-rens De Vocht, Ben De Meester, Gerald Haesendonck, and Pieter Colpaert. 2016.Triple Pattern Fragments: a low-cost knowledge graph interface for the Web.
Web Semantics: Science, Services and Agents on the World Wide Web
37 (2016),184–206.[92] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborativeknowledge base.
Commun. ACM
57, 10 (2014), 78–85.[93] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 2008. Hexastore:sextuple indexing for semantic web data management.
Proceedings of the VLDBEndowment
1, 1 (2008), 1008–1019.[94] Hugh E. Williams and Justin Zobel. 1999. Compressing Integers for Fast FileAccess.
Comput. J.
42, 3 (1999), 193–201. https://doi.org/10.1093/comjnl/42.3.193[95] Marcin Wylot, Manfred Hauswirth, Philippe Cudré-Mauroux, and Sherif Sakr.2018. RDF Data Storage and Query Processing Schemes: A Survey.
ACM Com-puting Surveys (CSUR)
51, 4 (2018), 84:1–84:36.[96] Mohamed Yahya, Denilson Barbosa, Klaus Berberich, Qiuyue Wang, and Ger-hard Weikum. 2016. Relationship Queries on Extended Knowledge Graphs. In
Proceedings of the Ninth ACM International Conference on Web Search and DataMining (WSDM ’16) . ACM, New York, NY, USA, 605–614.[97] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013.TripleBit: a fast and compact system for large scale RDF data.
Proceedings of theVLDB Endowment
6, 7 (2013), 517–528.[98] Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang. 2013.A distributed graph engine for web scale RDF data.
Proceedings of the VLDBEndowment
6, 4 (2013), 265–276.[99] Lei Zou, M. Tamer Özsu, Lei Chen, Xuchuan Shen, Ruizhe Huang, and DongyanZhao. 2014. gStore: a graph-based SPARQL query engine.