[PDF] PHD-Store: An Adaptive SPARQL Engine with Dynamic Partitioning for Distributed RDF Repositories

Abstract

Many repositories utilize the versatile RDF model to publish data. Repositories are typically distributed and geographically remote, but data are interconnected (e.g., the Semantic Web) and queried globally by a language such as SPARQL. Due to the network cost and the nature of the queries, the execution time can be prohibitively high. Current solutions attempt to minimize the network cost by redistributing all data in a preprocessing phase, but here are two drawbacks: (i) redistribution is based on heuristics that may not benefit many of the future queries; and (ii) the preprocessing phase is very expensive even for moderate size datasets. In this paper we propose PHD-Store, a SPARQL engine for distributed RDF repositories. Our system does not assume any particular initial data placement and does not require prepartitioning; hence, it minimizes the startup cost. Initially, PHD-Store answers queries using a potentially slow distributed semi-join algorithm, but adapts dynamically to the query load by incrementally redistributing frequently accessed data. Redistribution is done in a way that future queries can benefit from fast hash-based parallel execution. Our experiments with synthetic and real data verify that PHD-Store scales to very large datasets; many repositories; converges to comparable or better quality of partitioning than existing methods; and executes large query loads 1 to 2 orders of magnitude faster than our competitors.

Full PDF

PPHD-Store: An Adaptive SPARQL Engine with DynamicPartitioning for Distributed RDF Repositories

Razen Al-Harbi

KAUST, Saudi Arabia [email protected] Yasser Ebrahim

EPFL, Switzerland yasser.ibrahim@epﬂ.ch Panos Kalnis

KAUST, Saudi Arabia [email protected]

ABSTRACT

Many repositories utilize the versatile RDF model to publishdata. Repositories are typically distributed and geographi-cally remote, but data are interconnected (e.g., the SemanticWeb) and queried globally by a language such as SPARQL.Due to the network cost and the nature of the queries, theexecution time can be prohibitively high. Current solutionsattempt to minimize the network cost by redistributing alldata in a preprocessing phase, but there are two drawbacks:( i ) redistribution is based on heuristics that may not beneﬁtmany of the future queries; and ( ii ) the preprocessing phaseis very expensive even for moderate size datasets.In this paper we propose PHD-Store, a SPARQL enginefor distributed RDF repositories. Our system does not as-sume any particular initial data placement and does not re-quire prepartitioning; hence, it minimizes the startup cost.Initially, PHD-Store answers queries using a potentially slowdistributed semi-join algorithm, but adapts dynamically tothe query load by incrementally redistributing frequentlyaccessed data. Redistribution is done in a way that futurequeries can beneﬁt from fast hash-based parallel execution.Our experiments with synthetic and real data verify thatPHD-Store scales to very large datasets; many repositories;converges to comparable or better quality of partitioningthan existing methods; and executes large query loads 1 to2 orders of magnitude faster than our competitors.

1. INTRODUCTION

RDF [3] datasets consist of � subject , predicate, object � triples, where the predicate represents a relationship be-tween two entities: the subject and the object. They canbe viewed as directed labeled graphs, where vertices andedge labels correspond to entities and predicates, respec-tively. Figure 1 shows an example RDF graph of studentsand professors in an academic network. The RDF modeldoes not require a predeﬁned schema and is a versatile wayto represent information from diverse sources. It is used in Figure 1: Example RDF graph. An edge with itsassociated vertices corresponds to an RDF triple;e.g., � Prof.Williams , worksF or,

Stanford-CS � . Dottedline depicts a MinCut partitioning. the Semantic Web and in a variety of applications includingsocial networks, online shopping, scientiﬁc databases, etc.SPARQL is the standard query language for RDF. Queriesconsist of a set of RDF triple patterns, where some of thecolumns are variables. For example, let Q prof be: SELECT ?x WHERE {?x worksFor Stanford-CSLisa advisor ?x } Q prof returns Lisa’s advisors who work for Stanford CS.The query corresponds to the graph of Figure 2(a). Theanswer is the set of bindings of ? x that render the querygraph isomorphic to subgraphs in the data. In our example,? x ∈ { Prof.Williams, Prof.James } (see Figure 1).Let the data be stored in a table D ( s, p, o ), where rowsare RDF triples � s , p, o � . To answer Q prof , ﬁrst decom-pose it into two subqueries and answer them independentlyby scanning table D : q ≡ σ p = worksFor ∧ o = Stanford-CS ( D ) and q ≡ σ s = Lisa ∧ p = advisor ( D ). Then, join the intermediate re-sults on the subject and object attribute: q �� q .s = q .o q .If all data are on the same server, the plan can be executedeﬃciently by a system like RDF-3X [25], which indexes allcombinations of the three attributes of D .In many practical applications D is distributed amongmany geographically remote repositories. For instance, ourexample involves two universities, Stanford and MIT. It isnatural to allow each university to handle its own data.1a) Q prof (b) Q stud Figure 2: SPARQL queries: (a) Find Lisa’s advisorswho work for Stanford CS. (b) Find the advisees ofProf. Williams who graduated from MIT.

Therefore, triple � Lisa , advisor,

Prof.Williams � is expectedto be in Stanford’s server, whereas MIT will store � Lisa ,gradF rom,

MIT � . Consider query Q stud = “Find the stu-dents who graduated from MIT and are advised by Prof.Williams”, depicted in Figure 2(b). None of the servers hasall necessary triples to execute the join. Therefore, inter-mediate results must be transferred from Stanford to MIT,or vice-versa. In practice, the intermediate results can bevery large and the bandwidth between remote servers mayﬂuctuate considerably. Due to the communication cost, theresulting response time can be unacceptably long.A possible solution could be based on each server cachingits partial results. If the same query is asked again, eachserver returns its cached results to the master, which con-solidates them into the ﬁnal answer. However, this approachsuﬀers from the following drawbacks: ( i ) It only works wellif each query is asked many times. ( ii ) A cached query can-not answer another query that has the same pattern butdiﬀerent variables.Another approach to minimize the communication costis to prepartition the data. Previous work [15, 18, 27] in-troduced a preprocessing step where the entire dataset ispartitioned among the servers by hashing on the subject orobject column. In our example, let subject be the hash key,forcing triples � Lisa , advisor,

Prof.Williams � and � Lisa ,gradF rom,

MIT � to move to the same server. Then, Q stud can be answered without communication. This is also truefor any star-shaped query where subject is the variable.Unfortunately, because of the diﬀerent hash keys, � Lisa ,advisor,

Prof.Williams � and � Prof.Williams , worksF or,

Stanford-CS � are likely to be in diﬀerent servers; therefore,query Q prof still requires considerable communication.A recent work [20] employs a MinCut [22] algorithm topartition the graph during preprocessing. This step gener-ates partitions with roughly balanced number of vertices andminimal number of edges between diﬀerent partitions. Thedotted line in Figure 1 depicts such a partitioning. Observethat all necessary triples for Q prof are now in the lower par-tition; therefore, the query can be executed without com-munication. This is true for Q stud as well, assuming thattriples follow the placement of their subject vertex (e.g., � Lisa , gradF rom,

MIT � will be placed at the lower parti-tion, because of Lisa). Nevertheless, there still exist nu-merous queries that cross partition boundaries and requiresigniﬁcant communication, such as: � ?x , subOrgOf, ?y � AND � ?y , type, University � .Current partitioning approaches have three drawbacks:( i ) Static partitioning (hash-based or MinCut) is not nec-essarily a good ﬁt for many queries. As explained above,there exist workloads that access many servers no matterhow partitioning is done. Hash partitioning, in particular,is useful only for star-shaped queries. ( ii ) The preprocess-ing step is expensive. MinCut partitioning, in particular, requires to transfer the entire dataset to a central location;run MinCut, which needs several hours on many CPUs andhundreds of GB of RAM even for moderate size graphs; andtransfer the resulting partitions back to the servers. ( iii )The partitioning cost is paid for the entire dataset, even iffuture queries will access only a small subset of the graph.We propose PHD-Store, a SPARQL engine for distributed,geographically remote RDF repositories. PHD-Store opti-mizes the execution of distributed joins without relying onstatic partitioning. Instead, it adapts by dynamically re-distributing portions of the graph that are accessed by thequery load. Consider again Q prof in Figure 2(a). Our sys-tem selects a vertex in the query graph as a core vertex,say ? x . PHD-Store places the bindings of � ?x , worksF or, Stanford-CS � using the values of the core (? x ) as hash keys.In this case, � Prof.Williams , worksF or,

Stanford-CS � and � Prof.James , worksF or,

Stanford-CS � are placed in twodiﬀerent servers (since the keys are diﬀerent). � Lisa , advisor,

Prof.Williams � and � Lisa , advisor,

Prof.James � follow theplacement of Prof. Williams and Prof. James, respectively.No further triples are moved. In general, vertices that bindto the core will have their neighbors copied to the sameserver, in a recursive fashion that results in a tree-shapeddistribution. We call this Propagating Hash Distribution( PHD ). Our method achieves two goals: ( i ) Minimal com-munication : Q prof can now be executed without exchang-ing intermediate results between the servers; and ( ii ) Par-allel mode : the required data are redistributed in multipleservers that can work in parallel to minimize response time.PHD-Store maintains an index of patterns that have beenredistributed, and a query optimizer that combines parts ofmultiple previously redistributed queries to answer a newquery in parallel mode. It also accepts any initial placementof data (including random) and starts processing queriesimmediately. By avoiding the upfront cost and adoptinga pay-as-you-go approach, our system can execute tens ofthousands of queries within the time it takes our competi-tors to partition even a small graph. More importantly, thequality of the PHD partitioning is, in general, better, sinceit is guided by the actual query load. Therefore, PHD-Storescales to very large graphs and many RDF repositories. Insummary, our contributions are: • We introduce PHD-Store, a SPARQL engine for dis-tributed RDF repositories, that does not require ex-pensive preprocessing. • We propose PHD, an adaptive technique that redis-tributes data dynamically, in a way that future queriescan be executed in parallel mode. • We evaluate our system using synthetic and real dataon a cluster of 21 machines. PHD-Store is initializedin around one minute, whereas our competitors needup to 22 hours. Consequently, PHD-Store can executea large workload 1 to 2 orders of magnitude faster thanexisting approaches.The rest of this paper is organized as follows: Section 2discusses the related work. Section 3 presents the architec-ture of PHD-Store. Section 4 discusses the adaptive index-ing mechanism, whereas Section 5 explains the query index-ing and distributed data management. Section 6 discusseshow updates are managed in PHD-Store. Section 7 containsthe experimental results and Section 8 concludes the paper.2 . RELATED WORK

Centralized RDF stores.

Early approaches such as RDF-Suite [6] and Sesame [10], store RDF triples � s , p, o � as largetables in a relational database, usually with indices on allthree columns. Jena [30] adds support for rich features,such as inference. Abadi et al. [4] use a collection of smallertables, one for each distinct predicate value and employ acolumn-based DBMS. More recent systems, such as YARS2[18] and HPRD [23], implement native RDF stores with spe-cialized indexing. Both, however, lack an eﬃcient query op-timizer. RDF-3X [25] is one of the most promising nativeRDF stores. It maintains indices that cover all permutationsof the three columns of RDF triples, uses rigorous byte-levelcompression and its query optimizer favors fast sort-mergejoins. Similar ideas are implemented in Hexastore [29]. Materialized views.

Recent works attempt to speed upthe execution of SPARQL queries by selecting a set of viewsto materialize based on a given workload [11, 17]; or bymaterializing a set of path expressions based again on theworkload [14]; and introducing query rewriting techniquesthat use the materialized views [12, 17]. In our approach,we do not generate materialized views or perform any queryrewriting. Instead, we redistribute and possibly replicatethe data accessed by queries in a way that these queries canbe executed in parallel mode. We also introduce a mech-anism for indexing queries and managing the redistributeddata. Nevertheless, because of replication, we share withmaterialized views the consistency maintenance problem.

Hash-based distributed RDF stores.

All existing dis-tributed solutions need to prepartition the data in orderto minimize communication during query execution. Sev-eral systems [15, 18, 27] use hashing to distribute all dataduring a preprocessing step. As explained previously, hashpartitioning works well only for 1-hop star-shaped queries.Servers work in parallel on their local data without exchang-ing intermediate results (i.e., no communication and no de-lay due to synchronization barriers); and on average loadis balanced among servers. Unfortunately, hash partition-ing is ineﬃcient for queries that are not star-shaped or spanmore than one hop; such queries are common in SPARQL.Our work extends the idea of hash partitioning to supportparallel execution of complex queries.

Optimized partitioning and replication.

Recent sys-tems partition the graph by applying a minimum cut al-gorithm, such as METIS [22]. Intuitively, if fewer edgescross partitions, the probability of answering a query with-out communication among servers is increased. There aretwo drawbacks: ( i ) min-cut is extremely expensive; and ( ii )as discussed in Section 1, there are still a lot of queries thatrequire communication. Huang et al. [20] remove the high-degree vertices prior to partitioning to reduce the complexityof min-cut. They also enforce the so-called k -hop guaran-tee: vertices are replicated among partitions, such that anyquery with radius k or less (recall that queries are repre-sented as graphs) can be executed without communication.Unfortunately, partitioning still takes several hours even formoderate size graphs. Moreover, replication increases expo-nentially with k ; therefore k must be kept small (e.g., k ≤ k , or the query splits around a high-degree vertex (both cases are common in SPARQL), then the query is an-swered by a series of Map-Reduce jobs.There also exist relevant systems that focus on data mod-els other than RDF. Schism [13] deals with the problemof data placement for distributed OLTP RDBMS. Using asample of the workload, Schism minimizes the number of dis-tributed transactions by populating a graph of co-accessedtuples. The graph is partitioned by METIS and data areredistributed to servers accordingly. Tuples accessed in thesame transaction are put in the same server. This is notappropriate for SPARQL queries because some queries ac-cess large parts of the data that would overwhelm a singlemachine. Instead, PHD-Store exploits parallelism by exe-cuting such a query across all machines in parallel withoutcommunication. H-Store [28] is an in-memory distributedOLTP RDBMS that uses a data partitioning technique sim-ilar to ours and has been extended [26] to handle skewnessin the data and workloads. Nevertheless, H-Store assumesthat the schema and complete workload is speciﬁed in ad-vance and assumes no ad-hoc queries. Although, these arevalid assumptions for OLTP databases, they are not for RDFdata stores. NuoDB [2] is a commercial, ACID compliantdatabase that supports SQL. NuoDB does not employ shard-ing for partitioning the database; rather, it smartly cachesatoms in multiple servers based on the workload.Another recent work by Yang et al. [31] focuses on generalgraphs but can be applied to RDF data. The entire graphis replicated several times and each replica is partitioned ina diﬀerent way. Each replica runs an instance of Pregel [24].The query optimizer directs each query to the most suitablereplica that minimizes communication. The method has twodrawbacks: ( i ) there is excessive replication; and ( ii ) datamust be localized in order to build the adjacency list of eachvertex, required by Pregel; which is a very ineﬃcient process. Eventual indexing.

Idreos et al. [21] introduce the con-cept of reducing the data-to-query time for relational data.They avoid building indices during data loading; instead,they reorder tuples incrementally based on the query ranges.A recent work [5] focuses on executing queries on raw ﬁles,and similarly builds incrementally an index for future queriesby amortizing the cost among past ﬁle accesses. In PHD-Store, we extend the concept of eventual indexing to dy-namic and adaptive graph partitioning. In our problem,graph prepartitioning is very expensive, hence, the poten-tial beneﬁts of eliminating the data-to-query time are large.

3. SYSTEM ARCHITECTURE

PHD-Store organizes a large number of independent andgeographically remote RDF repositories into a federation,allowing users to pose queries over the union of the entirecollection. The system architecture is depicted in Figure 3.

Master.

The master node receives queries from the users,generates an execution plan, coordinates the workers, col-lects the ﬁnal result and returns it to the user. The mastercontains a query index that stores information about thequery patterns that have been redistributed. The query in-dex is used by the query planner to decide whether a new To minimize replication, high-degree vertices utilize 1-hopinstead of k -hop guarantee.3 igure 3: System architecture: Workers correspondto independent RDF repositories. query can be executed in parallel using the current distribu-tion. The query index will be explained in Section 5. Themaster also contains a statistics manager that maintains use-ful statistics about the RDF graph. Worker.

Each RDF repository is called a worker. Let therebe N workers in the system: w , w , . . . , w N . Each worker w i stores locally a set D i of triples. The entire dataset D isdeﬁned as: D = � N D i .Each worker stores its local set of triples in an in-memorystructure, called main index . It contains a single storagemodule that consists of the following indices: ( i ) Predicate:given a predicate p , return a list of all � subject , object � pairs. ( ii ) Predicate-Subject: given a predicate p and a sub-ject s , return a list of all � s, object � pairs. ( iii ) Predicate-Object: given a predicate p and an object o , return a list ofall � subject , o � pairs. The predicate index is implemented ashash map and the other two as nested hash maps. Typically,the number of predicates in RDF datasets is small comparedto the number of triples. To eliminate redundant repetitionsof the predicate, the indices store � subject , object � pairsinstead of RDF triples. Each pair is stored only once, andonly pointers to these pairs are stored in the indices. Fig-ure 4 shows an example. The main index is used to answerqueries that cannot be executed in parallel mode.Each worker also contains an in-memory replica index thatstores copies of subsets of the entire dataset D . The replicaindex is used when answering queries in parallel mode. Ini-tially, this index contains no data, but it is maintained andupdated dynamically by the PHD redistribution process,which may replicate in worker w i triples from several work-ers. The process will be explained in Section 4. PHD-Store uses the degrees of vertices in the data to planfor query execution and redistribution, as we explain in Sec-tion 4.1. Keeping this information for each vertex in theentire data, is prohibitively expensive. PHD-Store solvesthe problem by focusing on predicates rather than vertices.For each unique predicate p , we calculate the correspondingsubject and object scores, deﬁned as follows: Definition 1 (Predicate scores).

Let p be a predi-cate. ( i ) The subject score of p , denoted p S is the averagedegree of all vertices s , such that � s , p, ?x � ∈ D . ( ii ) The object score of p , denoted p O is the average degree of allvertices o , such that � ?x , p, o � ∈ D . Figure 4: Structure of the main index of a worker:(a) Raw local RDF data at the worker, (b) predicate,predicate-subject and predicate-object indices.Figure 5: Statistics calculation for p = worksFor ,based on the graph of Figure 1. p S = (3 + 4 + 3) / . ; p O = (4 + 3) / . p O =(4 + 3) / . D is split in many workers, vertices are typicallyreplicated. Identifying unique vertices, in order to calculatepredicate scores, is very expensive due to the communica-tion cost. Therefore, PHD-Store maintains only approxi-mate statistics. Each worker calculates independently itspredicate scores and sends them to the master. The mastercalculates the global scores as the average of the local ones.The process is fast and provides an adequate approximationfor query optimization and adaptivity purposes. PHD-Store does not require global indexing and can startfrom any initial data partitioning, including random. Tostart a federation of RDF repositories, each worker buildsindependently its main index and collects statistics. Theprocess is very eﬃcient since it involves a single scan ofthe local data. Afterwards, PHD-Store can start answer-ing queries immediately.A user submits a SPARQL query Q to the master. Thequery planner at the master consults the query index todecide whether Q is processable in parallel mode, or if dis-tributed semi-joins must be used: Distributed mode (semi-joins). If Q cannot be an-swered in parallel mode, PHD-Store executes the query bya series of distributed semi-joins; the process is described inAlgorithm 1. Any worker may contain relevant data. Eachworker w sends to all other workers a projection on the join4 lgorithm 1 : Distributed semi-join on N workers.Each worker executes this algorithm Input : Query Q consisting of subqueries { q , q } Result : Answer of query Q Let q and q be joined on subject s and object o , respectively RS ← answerSubquery( q ) ; RS ← answerSubquery( q ) ; RS [ s ] ← π s ( RS ); // projection on s Send RS [ s ] to all workers; worker w , w : 1 → N do6 Let RS w [ s ] denote the RS [ s ] received from w CRS w be the candidate triples of RS that join with RS w [ s ] CRS w ← RS �� RS .o = RS w [ s ] .s RS w [ s ]; Send

CRS w to worker w ;

10 Let RS w be the CRS w received from worker w

11 Let

RES w be the result after joining with worker w RES w ← RS �� RS .s = RS w.o RS w ; q �� q ← RES ∪ RES ∪ .... ∪ RES N ; send the partial result q �� q to master; column of the relevant triples (line 5). All workers performthe semi-join on the received data (line 9) and send the re-sults back to w (line 10). w ﬁnalizes the join (line 13) andreturns the partial answer to the master, which forwards itto the user without further processing. Lines 9 and 13 areimplemented as local hash-joins, using the main index ineach worker. Since Q may consist of multiple subqueries,say { q , q , q } , the query is evaluated by joining q and q ,then joining the result with q ; each join uses Algorithm 1.Note that the master is not involved in the distributed join,and all communication is done among workers.Since our data are memory resident, each machine useshash joins as they prove to be competitive to more sophis-ticated methods [8]. The hash join consists of two phases:build and probe. Our data are already hash-indexed so wedo not need the build phase; therefore, the optimizer triesto minimize the number of probes. Currently, the optimizergenerates a right-deep join tree, starting with the subquerywith the least cardinality. More sophisticated methods likethe one discussed in [9] are orthogonal to our work. Parallel mode.

In this case, Q is executed in parallel with-out communication. The master broadcasts Q to all work-ers. Each worker uses its replica index to construct a partialresult, which is sent to the master; no communication amongworkers is needed. Finally, the master forwards the partialresults to the user; no processing is required at the master.Locally, each worker uses hash joins, as discussed above. Dynamic redistribution.

PHD-Store monitors the fre-quency of each query. When the frequency of a query in-creases above a system-wide threshold, the redistributionprocess for that query is triggered. After redistribution, thequery can be executed in parallel mode. Note that redis-tribution does not beneﬁt only the query that triggered it;future queries can utilize a subset of the past redistributeddata, and can run in parallel mode.

4. PHD-STORE ADAPTIVITY

The dynamic redistribution model of PHD-Store is a com-bination of hash partitioning and k -hop replication; how-ever, it is guided by the query load rather than the dataitself. Speciﬁcally, given a frequent query Q , our systemselects a special vertex in the query graph called the core (a) Core is ? uni (b) Core is ? stu Figure 6: Eﬀect of choice of core on replication. In(a) there is no replication. In (b) MIT is replicatedin workers w and w . vertex. The system groups the data accessed by the queryaround the bindings of this core vertex. To do so, the systemdecomposes the query into a redistribution tree rooted at thecore. Then, starting from the core vertex, ﬁrst hop triplesare hash distributed based on the core bindings. Next,triples that bind to the second level subqueries are collo-cated and so on. A redistributed query can be executedin parallel without communication. Moreover, queries thathave not been redistributed can combine data from alreadyredistributed queries and then run in parallel mode. The choice of the core has a signiﬁcant impact on theamount of replicated data as well as on the query execu-tion performance. Consider query Q = � ?stu , gradF rom, ?uni � . Assume there are two workers, w and w , and referto the graph of Figure 1; MIT is the only university satisfy-ing the query. If ? uni is the core, then MIT is hashed to w .Lisa and John are also propagated to w (see Figure 6(a));therefore there is no replication. On the other hand, if ? stu is the core, then Lisa and John are hashed in w and w ,respectively. Then MIT is propagated to the core, thereforethere are replicas in both workers (see Figure 6(b)). Theproblem becomes more pronounced when a query has morehops. Consider Q = Q AND � ?dept , subOrgOf, ?uni � and choose ? stu as core. Because MIT is replicated, all thesub-organizations of MIT will also be replicated. This is asigniﬁcant cost because replication cost grows exponentiallywith the number of hops [20].Intuitively, if random walks start from two random ver-tices (e.g., students), the probability to reach the same well-connected vertex (e.g., university) within a few hops is higherthan reaching the same student from two universities. In or-der to minimize replication, we must avoid reaching the samevertex when starting from the core. Therefore, it is reason-able to select a well-connected vertex as the core. In theliterature there are many deﬁnitions of what constitutes awell-connected vertex, many of which are based on complexdata mining algorithms. In contrast, we employ a deﬁni-tion that poses minimal computational overhead: we assumethat well-connectivity is proportional to the degree (i.e., in-degree plus out-degree) of the vertex.Nonetheless, many RDF datasets follow the power-lawdistribution in which few vertices are of extremely high de-grees. Treating such vertices as cores is problematic becausethey cause many vertices to be placed in the same worker.Vertices that appear as objects in triples with rdf:type pred-icate are also problematic. Selecting these vertices to becores will cause all vertices of the same type to be placed inone worker. This would overwhelm the worker and wouldnot take advantage of parallelism [20].5 igure 7: Example of vertex score: numbers corre-spond to p S and p O values. Assigned vertex scores v are shown in bold. Recall from Section 3.2 that we maintain statistics p S and p O for each predicate p ∈ P , where P is the set of all pred-icates in the data. Let P s and P o be the set of all p S and p O , respectively. The following three conditions are checkedfor each predicate p : ( i ) p is the reserved rdf : type predi-cate; ( ii ) p S is three standard deviations away from thearithmetic mean of all p S ∈ P s ; ( iii ) p O is three standarddeviations away from the arithmetic mean of all p O ∈ P o .The ﬁrst condition prevents the type vertices from being se-lected as core vertices. Similarly, the other two conditionstreat extremely high degree vertices as outliers and preventthem from being cores. If any of the previous conditionsis satisﬁed, then we set: p S = p O = −∞ ; else use the p S and p O as computed in Section 3.2. Now, we can computea score for each vertex as follows: Definition 2 (Vertex score).

For a query vertex v :Let E out ( v ) be the set of outgoing edges (i.e., predicateswhere v appears as subject), and E in ( v ) the set of incom-ing edges (i.e., predicates where v appears as object). Also,let S be the set of all p S for the E out ( v ) edges and all p O for E in ( v ) edges. The vertex score v is deﬁned as: v = max( S ) . Figure 7 shows an example. For vertex ? d , E out (? d ) = { subOrgOf, type } and E in (? d ) = { memberOf } . The corre-sponding predicate scores are 3, 1 and 4. Therefore, ? d = 4. Definition 3 (Core vertex).

Given a query Q , thevertex v � with the highest score is called the core vertex. Let Q be a frequent query that PHD-Store decided to re-distribute. Our goal is to generate a redistribution tree thatminimizes the expected amount of replication. In Section 4.1we explained why starting from the vertex with the highestscore has the potential to minimize replication. Intuitively,the same idea applies recursively to each level of the redis-tribution. Therefore, our query redistribution tree spans allthe edges of the query graph, such that every child nodein the tree potentially has a lower (or equal) score than itsparent. Each of the edges in the query graph should appearexactly once in the tree; vertices may be repeated.Using the scoring function discussed in the previous sec-tion, we transform Q into a vertex weighted, undirectedgraph G . The vertex with the highest score is selected asthe core vertex. Then, G is decomposed into the redistribu-tion tree using Algorithm 2. The algorithm keeps exploringedges starting from high score vertices towards lower scoreones. All edges incident to the core vertex v � are inserted ina pending edges set. Then, the algorithm gradually keeps Huang et al. [20] also use the same cut-oﬀ threshold.

Algorithm 2 : Generate Redistribution Tree

Input : G = { V, E } ; a vertex-weighted, undirected query graph, v � ; the core vertex Result : A tree T Let core edges be all incident edges to v � ; edge e in core edges do2 e.parent ← φ ; add e to pendingList ; pendingList notEmpty do5 // e is the edge connected to the highest score vertex e ← getHighestScoreEdge( pendingList ) ; remove e from pendingList ; // extend the path of e.parent with e ; appendToPath( e , e.parent ) ; adj ← getAdjacentEdgesTo( e ) ;

10 foreach α in adj do11 if α NOT explored then12 α.parent ← e ; add α to pendingList ; T.root = v � exploring new edges by removing the edge with the highestvertex score ﬁrst from the set, and inserting all its adja-cent edges to the pending edges. Note that the direction oftraversal of the graph is independent from the actual edgedirections of the query. The result is a tree with the core ver-tex v � as root. As an example, consider the query in Figure7. Having the highest score, ?u is chosen as core, and thequery is decomposed into the tree shown in Figure 8. Notethat the directions of the edges in Figure 8 are only used todeﬁne the join columns i.e., subject-subject, object-objector subject-object; they do not inﬂuence the tree traversal. Our redistribution algorithm is a hybrid of hash parti-tioning and k -hop replication. Given a redistribution tree,PHD-Store distributes the data along paths from the root toleaves using breadth ﬁrst traversal. The algorithm proceedsin two phases: First, it distributes triples that contain thecore vertex to workers using a hash function H ( · ). Let t besuch a triple and denote t.v � as its core vertex (the core canbe either the subject or the object of t ). Let w , w , . . . , w N be the workers. t will be replicated in worker w j , where: j = H ( t.v � ) mod N .In Figure 8, consider the ﬁrst hop � ?d , subOrgOf, ?u � ofthe highlighted path. Using the previous formula, the core? u determines the placement of t , t and t (see Table 1).Assuming two workers, t and t are replicated in w (be-cause of Stanford), whereas t is replicated in w (becauseof MIT). ? u is called the source column of these triples. Definition 4 (Source column).

The source columnof a triple � s , p, o � is the column (subject or object) that isused to determine its placement. The second phase of PHD places triples of the remaininglevels of the tree in the workers that contain their parent,through a series of distributed semi-joins. The column atthe opposite end of the source column of the previous stepbecomes the propagating column; in our example, the prop-agating column is ? d . Definition 5 (Propagating column).

The propagat-ing column of a triple � s , p, o � is the column (subject or ob-ject) that is at the opposite end of the corresponding source. igure 8: Redistribution tree for the query in Figure7. The selected (bold) part of the tree is a path fromcore vertex ? u to leaf ? s . RDF triple Worker t � Stanford-CS , subOrgOf,

Stanford � w t � Stanford-ENG , subOrgOf,

Stanford � w t � MIT-CS , subOrgOf,

MIT � w t � Ben , memberOf,

Stanford-CS � w t � Prof.James , memberOf,

Stanford-ENG � w t � John , memberOf,

Stanford-ENG � w t � Peter , memberOf,

MIT-CS � w Table 1: Triples from Figure 1 matching the high-lighted path of Figure 8.

It is the join attribute that determines the placement of thenext level of triples.

In Figure 8 the second subquery of the highlighted pathis � ?s , memberOf, ?d � . ? d from the previous level becomesnow the source column. Triples t ... (see Table 1) match thesub-query and are joined with triples t ... . Therefore, t , t and t are propagated to worker w , whereas t is propa-gated to w . The process is formally described in Algorithm3. The algorithm runs in parallel on all workers. Lines (5-8)perform hash distribution of all sub-queries incident to thecore; we call this propagation level 0. Propagation to nextlevels is done through a series of semi-joins between eachlevel in the path and the level directly before. Only triplesthat satisfy the join condition with the previous level arekept. This procedure causes triples on level i to follow theplacement of the triples in the parent level i − So far PHD-Store has been presented as a reactive system,where only frequent queries are redistributed and hence op-timized for. Non-frequent queries, like the ones in Figure9(a), that share the same pattern but have diﬀerent cores(i.e Stanford, MIT and ?u) would be executed using theexpensive distributed semi-join, even if they share a com-mon frequent pattern. PHD-Store solves this problem byassigning such queries to the same query template and thenredistributing the template.

Definition 6 (Query Template). A query template Q is the query pattern that results from replacing all theconstants in a query Q with variables. Template vertices store the values of all matching queryvertices and their counts. If a query matches an existing

Algorithm 3 : Performing PHD on a given tree T thatresulted from the decomposition of query Q Input : Query graph decomposed as a labeled tree T . L isnumber of levels in T Result : Data propagated from the root toward leaves

Let source nodes be the set of source vertices ; propagation nodes be the set of propagation vertices ; addToSourceNodes( root ) ; addToPrpagationNodes( all children of root ) ; // hash-distributing (core-adjacent) edges foreach node v in propagation nodes do5 // subquery defined by root , v and the label of the edgebetween them sub v ← constructSubQuery( root , v , label ) ; // use the replica index to check if sub v is not previously distributed then7 hash-distribute all bindings of sub v on root ; level l : 2 → L do9 source nodes ← propagation nodes ; clear propagation nodes ; addToPrpagationNodes( all children of source nodes ) ;

12 foreach node v in propagation nodes do13 sub v ← constructSubQuery( v.parent , v , label ) ; // use the replica index to check if sub v is not previously distributed then15 sub parent ← constructSubQuery( v.parent , v.grandP arent , label ) ; // Join sub v and sub parent using distributedsemi-join qualified ← sub v �� sub parent ; addDataToIndex( qualified ) ; template, the template vertices are updated to count theirmatched values, otherwise, a new template is created andinitialized. For example, assume queries Q , Q , and Q in Figure 9(a) were executed in this order. After executing Q , the template in Figure 9(b) will be created with vertexvalues shown in T . Template vertex values shown in T reﬂect the template state after executing all three queries.The frequency of a query template is the number of queriesmapped to it. Once the template’s frequency increases abovea system-wide threshold, the redistribution process is trig-gered. Redistributing a template results in a replicatedindex that solves future matching queries eﬃciently. Nev-ertheless, redistributing a template involves shuﬄing largeamounts of data and results in signiﬁcant replication sinceall template vertices are variables. PHD-Store introduces aproactivity threshold that is used to balance the expectedperformance gain and the excessive cost of redistributing atemplate. If the number of unique values in a template ver-tex is greater than the proactivity threshold, the vertex iskept as a variable. Otherwise, only the most frequent valueis assigned to the template vertex. In our example, assum-ing proactivity threshold of 2, vertices V and V in Figure9(b) are replaced by ? d and dept , respectively.

5. REPLICA AND QUERY INDEX

Replica index.

Each worker has a replica index consistingof a set of trees that have been redistributed, together withthe replicated data. Consider the following queries:

Q1:SELECT ?u WHERE { Q2:SELECT ?s WHERE {?d subOrgOf ?u ?s undergradFrom ?u?d type department ?u type university?s memberOf ?d }} igure 9: Similar queries in (a) are assigned to thequery template in (b). Assume Q is the ﬁrst query to trigger redistribution, sothe replica index is empty, and let ? u be the core. Thecorresponding redistribution plan is labeled Q in Figure 8.Triples that bind to � ?d , subOrgOf, ?u � will be hash dis-tributed on ? u . A new tree is created in the replica in-dex of every worker with ? u as root. Then, a vertex ? d is created with ? u as its parent and edge (? d, ? u ) is la-beled with subOrgOf. Moreover, a storage module (see Sec-tion 3.1) is created and associated with this edge; it storesthe data that are hashed to the respective worker. Usingthe binding values of ? d , propagation continues by fetchingthe triples that bind to � ?d , type, department � . A new ver-tex department is created as a child of ? d in all workers;and edge (? d, department ) is labled with type. The storagemodule associated with that edge stores the data that arehashed to the corresponding worker. The process continuesrecursively for � ?s , memberOf, ?d � . Q is redistributed inthe same manner by extending the previous tree. Figure 10shows the state of the replica index on two workers afterredistributing Q and Q for the RDF graph in Figure 1.Recall that each worker stores the original data in its mainindex. There are three reasons for storing the redistributeddata in the separate replica index: ( i ) as more queries areredistributed, updating a single index becomes a bottleneck;( ii ) because of replication, using one index mandates ﬁlter-ing duplicate results like in [20]; and ( iii ) if data is coupledin a single index, intermediate join results will be larger,which will aﬀect performance. Query index.

The query index is only created and main-tained by the master. It has exactly the same structure asthe replica indices of the workers, but the query index doesnot include storage modules and does not store any data.Instead, it is used by the query planner to check if a querycan be executed in parallel mode. When a new query Q isposed, the planner decomposes Q into its redistribution tree τ . If τ shares the same root with a tree in the the queryindex and all of τ ’s edges exist in the query index, then Q can be answered in parallel mode; otherwise, Q is answeredusing distributed semi-joins. For example, as a result of re-distributing Q and Q , the entire query in Figure 8 can beexecuted in parallel mode since its redistribution tree existsin the query index. Conﬂicting redistributions.

Conﬂicts arise when a sub-query appears at two diﬀerent levels in the query index.For example, suppose that after redistributing Q and Q ,we want to redistribute Q = � ?d , subOrgOf, ?u � AND � ?p ,worksF or, ?d � , and let ? p be the core of Q . An edge asso-ciated with � ?d , subOrgOf, ?u � will be created in the secondlevel of the query and replica indices. Recall that a similar Figure 10: The state of the replica index on twoworkers after redistributing Q and Q . edge was created in the ﬁrst level because of Q . This maycause some triples to be replicated in two levels. In termsof correctness, this is not a problem for PHD-Store, becauseconﬂicting triples (if any) are stored separately using twodiﬀerent storage modules. To answer Q in parallel mode,the subquery � ?d , subOrgOf, ?u � is answered using the datastored in the second level. This approach avoids the bur-den of any house keeping and duplicates management. Thetrade-oﬀ, however, is more memory consumption. The sys-tem currently has a parameter � max that the administratorcan set to enforce a maximum replication ratio . PHD-Storewill redistribute queries as long as the replication ratio islower than � max . Unbounded predicates.

Queries with unbounded pred-icates, such as ? p in � ?x , ? p, Prof.David � , pose a chal-lenge when it comes to data locality. For instance, longpath queries with unbounded predicates require replicationof multiple hops to ensure parallel execution. In this case,replication grows exponentially [20] and would consume thememory. PHD-Store can still answer such queries using dis-tributed semi-joins. However, we redistribute only querieswith bounded predicates.

6. UPDATES

PHD-Store supports eﬃcient batch updates in the formof insertion and deletion of RDF triples. The process isexplained below. Note that the structure of the query andreplica indices never changes because of update operations.Only the main index and the data in the storage modules ofthe replica index are aﬀected.

Deletion.

Delete operations can run in parallel mode with-out communication among workers. The triples to be re-moved are sent to each worker, which checks if any of thetriples exists in its main index and deletes it accordingly.Note that a triple t may exist in the main index of one workeronly, but it may appear in many replica indices. There-fore, each worker traverses its replica index using depth-ﬁrstsearch and deletes any instances of t . It also removes fromthe replica index all triples that are associated with t . Sup-pose we want to delete t = � MIT-CS , subOrgOf,

MIT � and t = � MIT , type, university � from the replica indices shownin Figure 10. Each worker traverses its replica index to re-move these triples. t only exists in worker w , stored inthe subOrgOf edge. After deleting t , there will be no othertriples that share value MIT-CS for variable ? d . Therefore,the triples that have MIT-CS as a binding of ? d are removedas well; these are � Peter , memberOf,

MIT-CS � at edge mem-berOf and � MIT-CS , type, department � at edge type. The8 igure 11: On worker 1: (a) The replica index afterdeletions. (b) The replica index after insertions. process continues recursively to the next levels. The sameprocess applies to t , but it will not cause any additionalcleanup because university is a leaf node. The replica indexof worker w is not aﬀected. Figure 11(a) shows the stateof the replica index of w after deleting t and t . Insertion.

To insert a new triple t , the master ﬁrst assignsit randomly to a worker, which inserts t in its main index.Next, each worker traverses its replica index to ensure con-sistency. Continuing our example, assume that after theprevious deletions have been executed, we want to insertthree triples: t = � MIT-CS , subOrgOf,

EECS � , t = � MIT ,type, university � and t = � MIT-CS , subOrgOf,

MIT � . As-sume that value EECS hashes to worker w . Then, t isinserted at the subOrgOf edge in the replica index of w .At that edge, no other triples share with t the same valueMIT-CS for the ? d vertex. Therefore, the subtree rooted at? d has to be validated and made consistent. This is done bysending requests to all other workers. As a result, � Peter ,memberOf,

MIT-CS � and � MIT-CS , type, department � arepropagated to w and stored at edges memberOf and type,respectively. Similarly, assuming that MIT hashes to w , t is inserted in the ?u–type–university edge of the replicaindex. No validation is required because university is a leaf.Triple t is inserted in w in the same way as t . Note that,because t shares the same value MIT-CS with t , no valida-tion is required since it was already done after inserting t .Figure 11(b) shows the state of the replica index of worker w after executing all insertions.

7. EXPERIMENTAL EVALUATION

We implemented: ( i ) PHD-Store in C++ using MPI forsynchronization and communication; and ( ii ) SemiJoin , abaseline approach that uses distributed semi-joins to executequeries, without redistribution or replication. Both systemswork on randomly partitioned data. Our closest competitoris the k -hop system by Huang et al. [20]. For k -hop, we par-titioned the graph using the parallel version of METIS [22],and used Hadoop and the code provided by the authors fortriple placement. For fairness, instead of a slow disk-basedstore, we implemented an in-memory query engine for k -hopusing the same data structures as PHD-Store. We used the2-hop conﬁguration from Huang et al. as it performs betterthan hash partitioning and 1-hop guarantee [20].We used two popular datasets: the synthetic LUBM [1]benchmark and the real YAGO2 dataset [19]. For LUBM,we generated a dataset of 2,000 universities, resulting in al-most 50GB of raw data or 267M triples. We used 12 queries Only patterns are used; inference is outside our scope. LUBM YAGO2PHD 2-hop PHD 2-hopPreprocessing 0 47,700 0 78,372Loading & indexing 65 176 73 249Statistics collection 6 0 7 0TOTAL 71 47,876 80 78,621

Table 2: Startup time (sec); LUBM and YAGO2 from the benchmark that contain at least one join. We alsoadded one complex query ( QS ) that is more than 2 hopslong. From these 13 queries, we generated 12K similar onesthat have the same patterns but diﬀerent constants. Weconstructed 5 workloads. Each workload consists of 20Kqueries which are randomly selected from the 12K queries.YAGO2 is a real dataset derived from Wikipedia, WordNetand GeoNames. From the native YAGO2 format we ex-tracted around 30GB of raw data, or around 300M triples.We generated a sequence of 1,000 queries randomly fromqueries A1, A2, B1, B2 and B3 deﬁned in Binna et al. [7].All queries are available online .We used a cluster with 21 machines; one is the master andthe remaining 20 are workers. Each machine has an Intel i5-660 3.3GHz CPU (dual core), 16GB RAM and 2x2TB (7200RPM, 6.0Gb/s) hard disks in RAID-0 conﬁguration. Themachines run 64-bit 2.6.38-8 Linux Kernel and are connectedvia a 1GBps Juniper switch. We restricted the bandwidthof the switch to 10MBps to simulate typical WAN speeds. This experiment measures the time it takes PHD-Storeand k -hop to prepare the data prior to answering queries.The results are shown in Tables 2 for LUBM and YAGO2.2-hop spends a lot of time preprocessing the data, the mostexpensive step being graph partitioning. Data loading andindexing also take more time in 2-hop because of the exis-tence of replicas. PHD-Store, on the other hand, incurs somecost for statistics collection. In total, PHD-Store needs al-most 3 orders of magnitude less time than its competitor.PHD-Store can start answering queries in around 1 minute,whereas 2-hop needs up to 22 hours.For fair comparison, we reconﬁgured Hadoop so inter-mediate HDFS ﬁles are written into a memory mountedpartition. Then, we re-evaluated the preprocessing phaseof 2-hop. As expected, the preprocessing time dropped to25,508 and 28,115 seconds for LUBM and YAGO2, respec-tively. Nonetheless, the startup cost for PHD-Store is still500 times less than 2-hop. The frequency and proactivity thresholds control the trig-gering of redistributions. They are highly correlated andboth inﬂuence the execution time and the amount of repli-cation. In this experiment, we select the thresholds valuesbased on one of the random LUBM workloads. We ﬁrst setthe proactivity threshold to inﬁnity and execute the work-load by varying the frequency threshold values. The exe-cution time and the resulting replication ratio are shownin Figures 12(a) and 12(b), respectively. As the frequencythreshold increases, the execution time increases as most ofthe queries use expensive semi-joins. At the same time, the http://cloud.kaust.edu.sa/Pages/queries.aspx9 Load E x e c u t i on T i m e ( s e c ) Frequency ThresholdTrend (a) Execution time vs. fre-quency threshold R ep li c a t i on R a t i o Frequency ThresholdTrend (b) Replication ratio vs. fre-quency thresholds W o r k l oad e x e c u t i on t i m e ( s e c ) Proactivity ThresholdTrend (c) Execution time vs.proactivity threshold R ep li c a t i on R a t i o Proactivity ThresholdTrend (d) Replication ratio vs.proactivity threshold

Figure 12: Sensitivity analysis for frequency and proactivity thresholds. higher the frequency threshold, the lower the replication ra-tio because fewer queries are redistributed. For this reason,we select a frequency threshold of 3 as the default value inall experiments. Next, we vary the proactivity threshold.Based on the results shown in Figures 12(c) and 12(d), weset the proactivity threshold to 10 for all the experiments.

Entire workload.

Next we measure the cumulative ex-ecution time during the execution of an entire workload.The cumulative time includes the preprocessing cost for k -hop and the cost of dynamic redistribution for PHD-Store.Figure 13(a) shows the average cumulative time for the 5random query loads generated from the 12K queries. k -hoppays most of the cost at the preprocessing phase. PHD-Store, on the other hand, amortizes the redistribution costduring the actual query processing; the spikes in the graphare due to redistribution. Note that the proactive version ofPHD-Store is at least 2 times faster than the reactive ver-sion. After roughly 10,000 queries the system converges. Inall cases PHD-Store is 1 to 2 orders of magnitude faster than k -hop. Figure 14(a) shows the results for YAGO2. Becauseof the limited number of queries, both reactive and proactivePHD-Store have the same performance. Again, PHD-Storeis 2 orders of magnitude faster than k -hop. Single queries.

The next experiment measures the eﬀec-tiveness of the partitioning. We allow all systems to converge(i.e., preprocessing and dynamic redistribution costs are ex-cluded); then we measure the execution time for each of the13 LUBM queries. The results are shown in Figure 13(b).PHD-Store achieves signiﬁcantly better (see queries 2 andQS), or comparable performance to 2-hop. The only excep-tion is query 8, where 2-hop is 10msec faster. The radius ofquery 8 is at most 2, therefore 2-hop executes it in paral-lel mode using all workers. PHD-Store, on the other hand,selects a constant (i.e., university0 ) as core vertex. There-fore the redistribution process sends all relevant triples tothe same worker. The query is still answered without com-munication overhead, but there is no parallelism, resultingin slower execution. SemiJoin, the baseline method thatdoes not redistribute any data, is 1 to 3 orders of magnitudeslower than PHD-Store. This demonstrates the signiﬁcanceof good data placement.Figure 14(b) shows the results for the YAGO2 dataset.Most of the vertices in the query graphs are not connectedto an rdf : type predicate. Without this information, 2-hopcannot distinguish whether it can process a certain queryin parallel mode, even if the radius of the query is indeedat most 2. Therefore, most queries are executed by expen- sive distributed semi-joins. PHD-Store is 1 to 2 orders ofmagnitude faster for all queries. Figure 13(c) shows the average net replication ratio as apercentage of the original data size, after completing the5 LUBM workloads. The LUBM workloads contain querieswith long chains (more than 2 hops in radius), which are verycostly when executed by 2-hop; and require the extension to3-hop to be answered eﬃciently. In contrast, by only redis-tributing what is needed PHD-Store performs better whileincurring less replication. Proactive PHD-Store results inmore replication than the reactive version as non-frequentqueries may be redistributed. For YAGO2, the net replica-tion of PHD-Store is lower than that of 2-hop. Figure 14(c),shows that 2-hop results in replication that is almost 2 timesthe original data.Recall from Section 5 that � max limits the maximum repli-cation ratio. Figure 15(a) shows the eﬀect of varying � max on the execution time, for an LUBM workload. Constrain-ing the replication to low values results in blocking queryredistributions. Accordingly, many frequent queries run us-ing expensive distributed semi-joins. On the other hand,when � max increases, many of the frequent queries are re-distributed and hence run eﬃciently in parallel mode. PHD-Store does not redistribute all parts of the datagraph, therefore and for fair comparison, we examine theinﬂuence of the workload coverage on the performance ofthe k -hop system. We extracted only triples that are rel-evant to one of the 5 LUBM workloads. Then, we par-titioned this data using 2-hop guarantee. Afterwards, weexecuted the same workload on the partitioned data. Thequery load we selected covers around 61% of the originaldata. As shown in Table 7.5, it took k -hop 11.7 hours topartition the covered data; roughly 1.5 hours less than par-titioning the whole dataset. The wall time for executing theworkload (excluding preprocessing) dropped by 21%. Thegain in query performance is masked by the expensive pre-processing overhead. Moreover, such a partitioning is onlyuseful for answering this speciﬁc workload. Repartitioningthe entire dataset is needed for diﬀerent query loads. In the following experiment, we measure the scalabilityof PHD-Store, using the LUBM workloads. We used theLUBM generator to generate datasets of diﬀerent sizes. Wevary the number of universities from 500 to 2,000; the result-ing datasets contain from 67M to 267M triples. Figure 15(b)10 C u m u l a t i v e T i m e i n s e c ( l og sc a l e ) QueryPHD-ReactivePHD-Proactive2-hop (a) Cumulative execution time T i m e i n s e c ( l og sc a l e ) QuerySemi-join2-hopPHD (b) Execution time for each query R ep li c a t i on R a t i o PHD-ReactivePHD-Proactive2-hop (c) Net replication ratio

Figure 13: Execution time and replication cost for LUBM. C u m u l a t i v e T i m e i n s e c ( l og sc a l e ) QueryPHD-Proactive2-hop (a) Cumulative execution time T i m e i n s e c ( l og sc a l e ) QuerySemi-join2-hopPHD (b) Execution time for each query R ep li c a t i on R a t i o PHD-ReactivePHD-Proactive2-hop (c) Net replication ratio; YAGO2

Figure 14: Execution time and replication cost for YAGO2.

Partial FullPreprocessing (sec) 42,156 47,700Load execution (sec) 3,816 4,831TOTAL 45,972 52,531

Table 3: Workload coverage eﬀect shows that scalability is linear, demonstrating that PHD-Store can scale to very large datasets.In the ﬁnal experiment, we measure how balanced thedata placement is after redistribution. We use the Gini coef-ﬁcient [16], whose values range from 0 (i.e., perfect balance)to 1. Large values of the coeﬃcient imply that some workerswill be overwhelmed in terms of storage requirements andcomputational time during query processing. Figure 15(c)shows the average gini coeﬃcient after executing the LUBMand YAGO2 workloads. PHD-Store achieves balanced dis-tribution. For LUBM, the Gini coeﬃcient of PHD-Store isalmost an order of magnitude lower than 2-hop. Similarly,for YAGO2, PHD-Store’s Gini coeﬃcient is at least 2 ordersof magnitude better than 2-hop.

In this experiment, we show the performance of PHD-Store when executing update operations. As discussed inSection 5, updating the main index poses no overhead onthe system, however, maintaining replica consistency can beoverwhelming. Therefore, we should execute update queriesafter executing a whole query workload. To evaluate thedelete operation, we allow the system to execute one of theLUBM workloads and redistributes queries from that work-load. Then, we delete 20% of the data ( ≈ Table 4: PHD-Store update operation throughput tem. Deletions and insertions were carried out in batches.Each batch consists of ≈

8. CONCLUSION

This paper presented PHD-Store, a SPARQL engine forfederations of many independent RDF repositories. As moreorganizations publish their data in the versatile RDF for-mat, rendering the deep web accessible to search enginesand end users, the demand for federating RDF repositoriesis expected to increase. Without proper data placement,executing queries against such federations is very costly.PHD-Store follows an adaptive approach that allows it tostart processing queries immediately, thus minimizing thedata-to-query-time, while it dynamically distributes and in-dexes only those parts of the graph that beneﬁt the most fre-quent query patterns. The experimental results verify thatPHD-Store achieves better partitioning and replicates lessdata than its competitors. More importantly, PHD-Storescales to very large RDF graphs, whereas existing meth-ods are limited to much smaller datasets by prohibitivelyexpensive preprocessing. Currently we are working on adisk-based version of our system in order to support evenlarger datasets. We are also investigating the possibility ofutilizing PHD-Store for general (i.e., non-RDF) graphs, andoperators such as graph traversals, or reachability queries.11 C u m u l a t i v e E x e c u t i on T i m e ( s e c ) QueryMax ≈ ≈ > (a) Cumulative execution time usingdiﬀerent replication upper-bounds T i m e i n s e c Number of Triples (in million)PHD-Proactive (b) Scalability versus data size A v e r age G i n i V a l ue ( l og sc a l e ) DatasetPHD-Proactive2-hop (c) Gini coeﬃcient (load balancing)

Figure 15: Cumulative execution time, scalability and load balancing; LUBM workload.

9. REFERENCES [1] LUBM SPARQL Benchmark. http://swat.cse.lehigh.edu/projects/lubm/ .[2] NuoDB. .[3] RDF Primer. .[4] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach.Scalable semantic web data management usingvertical partitioning. In

VLDB , 2007.[5] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, andA. Ailamaki. NoDB in action: adaptive queryprocessing on raw data.

PVLDB , 5(12), 2012.[6] S. Alexaki, V. Christophides, G. Karvounarakis,D. Plexousakis, and K. Tolle. The ICS-FORTHRDFSuite: Managing Voluminous RDF DescriptionBases. In

SemWeb , 2001.[7] R. Binna, W. Gassler, E. Zangerle, D. Pacher, andG. Specht. SpiderStore: A Native Main MemoryApproach for Graph Storage. In

Grundlagen vonDatenbanken , volume 733, 2011.[8] S. Blanas, Y. Li, and J. M. Patel. Design andevaluation of main memory hash join algorithms formulti-core CPUs. SIGMOD, 2011.[9] M. A. Bornea, J. Dolby, A. Kementsietsidis,K. Srinivas, P. Dantressangle, O. Udrea, andB. Bhattacharjee. Building an eﬃcient RDF store overa relational database. SIGMOD, 2013.[10] J. Broekstra, A. Kampman, and F. V. Harmelen.Sesame: A Generic Architecture for Storing andQuerying RDF and RDF Schema. In

ISWC , 2002.[11] R. Castillo and U. Leser. Selecting materialized viewsfor RDF data. ICWE, 2010.[12] Z. Chong, H. Chen, Z. Zhang, H. Shu, G. Qi, andA. Zhou. RDF pattern matching using sortable views.CIKM, 2012.[13] C. Curino, E. Jones, Y. Zhang, and S. Madden.Schism: a workload-driven approach to databasereplication and partitioning.

PVLDB , 3(1-2), 2010.[14] V. Dritsou, P. Constantopoulos, A. Deligiannakis, andY. Kotidis. Optimizing query shortcuts in RDFdatabases. ESWC, 2011.[15] O. Erling. Towards Web Scale RDF. In

SSWS , 2008.[16] C. Gini. Concentration and Dependency Ratios (1909,in Italian). English translation in Rivista di PoliticaEconomica, 87, 1997.[17] F. Goasdou´e, K. Karanasos, J. Leblay, andI. Manolescu. View selection in Semantic Web databases.

PVLDB , 5(2), 2011.[18] A. Harth, J. Umbrich, A. Hogan, and S. Decker.YARS2: A Federated Repository for Querying GraphStructured Data from the Web. In

ISWC/ASWC ,volume 4825, 2007.[19] J. Hoﬀart, F. Suchanek, K. Berberich,E. Lewis-Kelham, G. de Melo, and G. Weikum.YAGO2: exploring and querying world knowledge intime, space, context, and many languages. In

Proc.WWW , 2011.[20] J. Huang, D. Abadi, and K. Ren. Scalable SPARQLQuerying of Large RDF Graphs.

PVLDB , 4(11), 2011.[21] S. Idreos, M. L. Kersten, and S. Manegold. DatabaseCracking. In

CIDR , 2007.[22] G. Karypis and V. Kumar. MeTis: UnstructuredGraph Partitioning and Sparse Matrix OrderingSystem. , 2009.[23] B. Liu and B. Hu. HPRD: a high performance RDFdatabase.

Int. J. Parallel Emerg. Distrib. Syst. , 2010.[24] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: a System forLarge-scale Graph Processing. In

SIGMOD , 2010.[25] T. Neumann and G. Weikum. RDF-3X: a RISC-styleengine for RDF.

PVLDB , 1(1), 2008.[26] A. Pavlo, C. Curino, and S. Zdonik. Skew-AwareAutomatic Database Partitioning in Shared-Nothing,Parallel OLTP Systems. In

SIGMOD , 2012.[27] K. Rohloﬀ and R. E. Schantz. High-performance,massively scalable distributed systems using theMapReduce software framework: the SHARDtriple-store. In

Programming Support Innovations forEmerging Distributed Applications , 2010.[28] M. Stonebraker, S. Madden, D. Abadi,S. Harizopoulos, N. Hachem, and P. Helland. The endof an Architectural Era: (It’s Time for a CompleteRewrite). In

VLDB , 2007.[29] C. Weiss, P. Karras, and A. Bernstein. Hexastore:sextuple indexing for semantic web data management.

PVLDB , 1(1), 2008.[30] K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds.Eﬃcient RDF Storage and Retrieval in Jena2. In

SWDB , pages 131–150, 2003.[31] S. Yang, X. Yan, B. Zong, and A. Khan. Towardseﬀective partition management for large graphs. In