A Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments
aa r X i v : . [ c s . D B ] F e b A Framework for Federated SPARQL Query Processing overHeterogeneous Linked Data Fragments
Lars Heling [email protected] Institute of TechnologyKarlsruhe, Germany
Maribel Acosta [email protected] BochumBochum, Germany
ABSTRACT
Linked Data Fragments (LDFs) refer to Web interfaces that allowfor accessing and querying Knowledge Graphs on the Web. Theseinterfaces, such as SPARQL endpoints or Triple Pattern Fragmentservers, differ in the SPARQL expressions they can evaluate and themetadata they provide. Client-side query processing approacheshave been proposed and are tailored to evaluate queries over in-dividual interfaces. Moreover, federated query processing has fo-cused on federations with a single type of LDF interface, typicallySPARQL endpoints. In this work, we address the challenges of SPARQLquery processing over federations with heterogeneous LDF inter-faces. To this end, we formalize the concept of federations of LinkedData Fragment and propose a framework for federated queryingover heterogeneous federations with different LDF interfaces. Theframework comprises query decomposition, query planning, andphysical operators adapted to the particularities of different LDF in-terfaces. Further, we propose an approach for each component ofour framework and evaluate them in an experimental study on thewell-known FedBench benchmark. The results show a substantialimprovement in performance achieved by devising these interface-aware approaches exploiting the capabilities of heterogeneous in-terfaces in federations.
The increasing number and size of Knowledge Graphs publishedas Linked Data led to the development of different interfaces tosupport querying Knowledge Graphs on the Web [7, 12, 15, 27].These interfaces mainly differ in their expressivity, server availabil-ity, and client cost as shown in Figure 1. The Linked Data Fragment(LDF) framework provides a uniform way to describe these inter-faces regarding their querying expressivity and the metadata theyprovide [13, 27]. The query expressivity of these interfaces rangesfrom triple patterns in Triple Pattern Fragment (TPF) servers tothe full fragment of SPARQL in SPARQL endpoints. To supportefficient querying, these developments also drove the research inthe area of client-side SPARQL query processing tailored to theindividual interfaces [3, 16, 26]. Therefore, most of the existing ap-proaches focus on querying data from a single dataset or through afederation of sources with the same interfaces. However, the prob-lem of evaluating SPARQL queries over heterogeneous federationsof such LDF interfaces has not gained much attention thus far. Todevise efficient querying plans in this scenario, it is not sufficientto combine and fuse existing solutions because the capabilities andlimitations of the interfaces have to be taken into account alto-gether. An effective solution, thus, requires to re-define the notionsof the main tasks of federated engines – i.e query decomposition,
Linked Data Fragments
High Client CostHigh AvailabilityLow Expressivity Low Client CostLow AvailabilityHigh Expressivity
Data Dump Triple PatternFragment (T PF ) Bindings-restricted Triple Pattern Fragment ( BR T PF ) SPARQL Endpoint (E P ) Figure 1: Linked Data Fragment Spectrum (based on [27]). planning, and execution – to integrate the different capabilities ofthe interfaces in a single query plan.In this work, we formalize federations of LDF interfaces and pro-pose a framework that serves as a foundation for devising efficientsolutions for querying heterogeneous federations. At the core ofthe framework, we present novel concepts for query decomposi-tion, planning, and execution in heterogeneous federations. Thesecomponents include the definitions of (i) interface-compliant sub-expressions and the semantics of their evaluation, (ii) interface-aware query planning, and (iii) polymorphic physical operatorsthat implement different execution strategies according to the con-tacted interfaces. To accompany our theoretical contributions, wepropose simple yet novel approaches to query heterogeneous LDFfederations. Each approach addresses the particularities of the in-terfaces and is designed to reduce the query execution times andthe load on the members of the federations. Our results show theeffectiveness of our framework and illustrate how leveraging theinterfaces’ capabilities in a single plan can substantially improvequery execution. In summary, the contributions of this work are • a general definition of Linked Data Fragment (LDF) federations, • a framework for querying heterogeneous LDF federations ad-dressing query decomposition, planning, and physical operators, • a practical solution for each component of the framework, and • an experimental evaluation of a prototypical implementation ofthe solutions on heterogeneous federations.The remainder of this work is structured as follows. Section 2 presentsa motivating example, and in Section 3, we present our definitionof federations of LDFs. Our framework is presented in Section 4and evaluated in Section 5. We discuss related work in Section 6and conclude our work in Section 7. As a motivating example, consider the query to get American Pres-idents, the political party they are a member of as well as their pre-decessors and successors shown in Listing 1. rXiv Preprint, 2021 Heling and Acosta
Listing 1: Example query. Prefixes as in http://prefix.cc
SELECT ∗ WHERE {?x wdt:P39 wd:Q11696 . 𝑡𝑝 ?x wdt:P102 ?party . 𝑡𝑝 ?y owl:sameAs ?x . 𝑡𝑝 ?y dbo:predecessor ?predecessor . 𝑡𝑝 ?y dbo:successor ?successor . } 𝑡𝑝 Let us assume, we want to evaluate the query over a federationthat consists of the SPARQL endpoint of Wikidata and the TriplePattern Fragment (TPF) server [27] of DBpedia . The Wikidata end-point provides solutions to triple patterns 𝑡𝑝 , 𝑡𝑝 , 𝑡𝑝 and the DB-pedia TPF server to 𝑡𝑝 , 𝑡𝑝 , and 𝑡𝑝 . As the members of the feder-ation implement different Linked Data Fragment (LDF) interfaces,we denote such a federation heterogeneous . In this example, we arenot able to apply an existing query decomposition approach fromquery processing over SPARQL endpoint federations, as these ap-proaches do not consider the capabilities of the LDF interfaces. Forinstance, FedX [24] would group triple patterns 𝑡𝑝 and 𝑡𝑝 into asubquery which is not compliant with the DBpedia TPF interface.On the contrary, a naive decomposition that evaluates the queryon the triple pattern level at the relevant sources would be compli-ant with the interfaces in the federation, since they are all able toevaluate triple patterns. However, such a decomposition leads toinefficient query plans on SPARQL endpoints as they produce anexcessive number of requests on the server and thus, lead to longexecution times. In our example, this approach would fail to eval-uate the subexpression ( 𝑡𝑝 And 𝑡𝑝 ) at the Wikidata endpoint.Their individual evaluation at the endpoint leads to an overhead inrequests and intermediate results transferred that could be avoided.The example illustrates the challenges that arise in heterogeneousfederations and motivates our research to address those challenges.In this work, we propose a framework that is tailored to leveragethe capabilities of the different interfaces in heterogeneous feder-ations. Based on this framework, our implementation reduces thenumber of requests by almost 25% leading to a tenfold decreasein query execution time over the naive approach for the examplequery. Following existing works [13, 27], we introduce a formalizationof Linked Data Fragment (LDF) interfaces based on the SPARQLexpressions they are able to evaluate and the metadata they pro-vide. Verborgh et al. [27] define a Linked Data Fragment (LDF) foran RDF graph 𝐺 as a tuple consisting of a URI, a selector func-tion, a set of RDF triples that are the result of applying the selec-tor function over 𝐺 , metadata in the form of a set of RDF triples,and a set of hypermedia controls. Based on this work, Hartig et al.[13] propose a formal framework for comparing LDF interfaces interms of expressiveness, complexity, and performance when eval-uating SPARQL queries over different interfaces. The concept ofLDF interfaces by Hartig et al. comprises the following: i) a no-tion of a server language to differentiate between different capabil-ities of LDF interfaces, and ii) an evaluation function in which anLDF interface provides a set of SPARQL solution mappings upon https://query .wikidata.org/sparql http://fragments.dbpedia.org/2016-04/en a request. We adapt the server language definition from [13] to bebased on the SPARQL expressions an LDF interface can evaluate.Therefore, we first revise the SPARQL expressions considered inthe literature.Let the sets of RDF terms 𝑈 , 𝐵 , and 𝐿 be pairwise disjoint sets ofURIs, blank nodes, and Literals, and 𝑉 be a set of variables disjointfrom 𝑈 , 𝐵 , and 𝐿 . A triple ( 𝑠, 𝑝, 𝑜 ) ∈ ( 𝑈 ∪ 𝐵 ) × 𝑈 × ( 𝑈 ∪ 𝐵 ∪ 𝐿 ) iscalled an RDF triple. A set of RDF triples is an RDF graph 𝐺 and theuniverse of RDF graphs is denoted as G . Following the notation byPeréz et al. [18] and Schmidt et al. [23], SPARQL expressions areconstructed using the operators And , Union , Optional , Filter ,and
Values and can be defined recursively as follows.
Definition 3.1 (SPARQL Expression).
A SPARQL expression is anexpression that is recursively defined as follows.(1) A triple pattern 𝑡𝑝 ∈ ( 𝑈 ∪ 𝑉 ) × ( 𝑈 ∪ 𝑉 ) × ( 𝑈 ∪ 𝐿 ∪ 𝑉 ) is aSPARQL expression [18],(2) if 𝑃 and 𝑃 are SPARQL expressions, then the expressions ( 𝑃 And 𝑃 ) , ( 𝑃 Union 𝑃 ) and ( 𝑃 Optional 𝑃 ) areSPARQL expressions ( conjunctive expression , union expres-sion , optional expression ) [18],(3) if 𝑃 is a SPARQL expression and 𝑅 is a SPARQL filter condi-tion, then the expression 𝑃 Filter 𝑅 is a SPARQL expression( filter expression ) [18],(4) if 𝑃 is a SPARQL expression and 𝐷 is a SPARQL values dat-ablock, the expression 𝑃 Values 𝐷 is a SPARQL expression( values expression ), and(5) if 𝑃 is SPARQL expression and 𝑆 ⊂ 𝑉 is a set of variables, theexpression Select 𝑆 ( 𝑃 ) is an expression ( select query ) [23].Furthermore, we denote the universe of SPARQL expressions as P and 𝑣𝑎𝑟𝑠 ( 𝑃 ) ⊂ 𝑉 as the set of variables in the expression 𝑃 . Wecan define the interface languages of different LDF interfaces bymeans of the fragment of SPARQL expressions they can evaluate. Definition 3.2 (Interface Language).
Let L be the universe of in-terface languages, an interface language 𝐿 ∈ L is the fragment ofSPARQL expressions that an interface can evaluate.Moreover, we denote 𝑃 ∈ 𝐿 if a SPARQL expression 𝑃 is a SPARQLexpression that is part of an interface language 𝐿 . Common inter-face languages can, thus, be defined in the following way. • 𝐿 CoreSparql : Any SPARQL expression defined in Def. 3.1. • 𝐿 Bgp : Conjunctive expressions: ( 𝑃 And 𝑃 ) where 𝑃 and 𝑃 are either conjunctive expressions or triple patterns, • 𝐿 Tp : Triple patterns. • 𝐿 Tp+Values : Triple patterns and values expressions of theform 𝑃 Values 𝐷 , where 𝑃 is a triple pattern.The definition of interface languages based on SPARQL expressionalso allows for defining containment relations between the differ-ent languages according to their expressiveness. Definition 3.3 (Interface Language Containment).
Let 𝐿 and 𝐿 be two interface languages, we say that 𝐿 is contained in 𝐿 , if allSPARQL expressions in 𝐿 are also in 𝐿 : 𝐿 ⊆ 𝐿 , if ∀ 𝑃 ∈ 𝐿 ⇒ 𝑃 ∈ 𝐿 . For example, we can state the following containment relationsfor the previously introduced languages: 𝐿 Tp ⊆ 𝐿 BGP ⊆ 𝐿 CoreSparql , Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021 or 𝐿 Tp ⊆ 𝐿 Tp+Values . With this formalism, we can define the lan-guages of common LDF interfaces and compare them according totheir expressiveness. Triple Pattern Fragment (TPF) servers sup-port querying triple patterns ( 𝐿 Tp ) [27], bindings-restricted TPFserver support triple patterns and Values expressions ( 𝐿 Tp+Values )[12], and SPARQL endpoint support any expression ( 𝐿 CoreSparql ).To complement the definition of LDF interfaces, we introducethe concept of interface metadata . For a SPARQL expression 𝑃 , aninterface may provide interface-specific metadata 𝑀 ( 𝑃 ) describingthe data obtained from the RDF graph. The interface metadata mayrange from simple statistics such as the number of expected re-sults to more elaborate metadata describing statistics, provenance,and licensing information. Similar to [27], we assume the meta-data provided for a given expression 𝑃 to be an RDF graph, that is 𝑀 : P → G . Examples for common interface metadata are: • SPARQL endpoints 𝑀 Ep : 𝑀 ( 𝑃 ) = ∅ , ∀ 𝑃 ∈ 𝐿 CoreSparql , • Triple Pattern Fragments 𝑀 Tpf : 𝑀 ( 𝑃 ) is an RDF graph thatcontains an estimate of the number of triples that match theexpression 𝑃 , ∀ 𝑃 ∈ 𝐿 Tp , • Bindings-restricted Triple Pattern Fragments 𝑀 brTpf : 𝑀 ( 𝑃 ) is an RDF graph that contains an estimate of the number oftriples that match the expression 𝑃 , ∀ 𝑃 ∈ 𝐿 Tp+Values .Besides enabling a more fine-grained distinction between LDF in-terfaces, the metadata may impact the potential querying strate-gies employed by a client. Finally, combining interface languageand metadata, we define a Linked Data Fragment interface as fol-lows.
Definition 3.4 (Linked Data Fragment Interface).
A Linked DataFragment interface is a 2-tuple 𝑓 = ( 𝐿 𝑓 , 𝑀 𝑓 ) , where • 𝐿 𝑓 ∈ L , the interface language, • 𝑀 𝑓 : P → G , the interface metadata for an expression 𝑃 .Conceptually, we distinguish LDF interfaces which define theinterface language and metadata, and LDF services , which are Webservers that implement a specific interface. Definition 3.5 (Linked Data Fragment Service).
A Linked DataFragment service 𝑐 ∈ 𝑈 is a Web service that supports the eval-uation of SPARQL expressions and provides metadata accordingto the LDF interface 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 𝑐 , 𝑀 𝑐 ) that it implements.We reuse the function 𝑒𝑝 : 𝑈 → G [6] that maps an LDF serviceto the RDF graph 𝑒𝑝 ( 𝑐 ) available at the service. The evaluation ofa SPARQL expression 𝑃 over an LDF service 𝑐 is then given as J 𝑃 K 𝑐 : = ( J 𝑃 K 𝑒𝑝 ( 𝑐 ) , if 𝑃 ∈ 𝐿 𝑐 . ∅ , otherwise. (1)Note the difference in the subscript 𝑐 and 𝑒𝑝 ( 𝑐 ) to distinguishbetween solution mappings produced by an LDF service J · K 𝑐 re-garding its interface language and the solution mappings for eval-uating any expression over the graph available at the LDF service J · K 𝑒𝑝 ( 𝑐 ) . Combining these previous definitions, we define the con-cept of federations of LDF services as follows. Definition 3.6 (Federation of Linked Data Fragment Services).
AFederation of Linked Data Fragment services is a 3-tuple 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) ,where • 𝐶 = { 𝑐 , . . . , 𝑐 𝑛 } ⊂ 𝑈 , a set of URIs for LDF services, • 𝑖𝑛𝑡 , a function that maps an LDF service to its interface, • 𝑒𝑝 , a function that maps each LDF service to the graph avail-able at that service.Federations in which all LDF services implement the same LDFinterfaces are called homogeneous , and heterogeneous otherwise. Forpractical reasons, in the remainder of this work, we just considergraphs in the federation without blank nodes and focus on fed-erations in which all members are at least able to evaluate triplepatterns of any form: 𝐿 TP ⊆ 𝐿 𝑐 , ∀ 𝑐 ∈ 𝐶 . Example 3.7.
We can define the federation from our motivatingexample as 𝐹 𝑒𝑥 = ({ 𝑐 , 𝑐 } , 𝑖𝑛𝑡, 𝑒𝑝 ) with 𝑐 = wikidata:sparql , 𝑐 = dbpedia:tpf , 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 CoreSparql , 𝑀 Ep ) , 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 Tp ,𝑀 Tpf ) , 𝑒𝑝 ( 𝑐 ) = 𝐺 𝑊 𝑖𝑘𝑖𝑑𝑎𝑡𝑎 and 𝑒𝑝 ( 𝑐 ) = 𝐺 𝐷𝐵𝑝𝑒𝑑𝑖𝑎 .Following the notation by Acosta et al. [2], we denote the evalu-ation of a SPARQL expression over a federation of LDF interfaces 𝐹 as J · K 𝐹 and define the semantics in the following way. Definition 3.8 (Set Semantics of SPARQL Query Processing overLDF Service Federations).
Given a SPARQL expression 𝑃 and a fed-eration 𝐹 = ( 𝐶,𝑖𝑛𝑡, 𝑒𝑝 ) , the result set of evaluating 𝑃 over 𝐹 isgiven as J 𝑃 K 𝐹 : = J 𝑃 K 𝐺 , with 𝐺 = Ø ∀ 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) In the presence of heterogeneous LDF service federations, novelchallenges arise that cannot be addressed by existing approaches.Therefore, we propose a framework for heterogeneous federationsto address the central components of federated query processing:(§4.1) query decomposition, (§4.2) query planning, and (§4.3) phys-ical operators. Furthermore, for each component of the framework,we propose an approach aiming to obtain efficient query plans.
The goal of query decomposition is grouping the query into subex-pressions such that the evaluation of the subexpressions over themembers of the federation minimizes execution time while ensur-ing that all expected answers are produced. Existing decomposi-tion approaches assume that all federation members are able toevaluate any SPARQL expression. Since this assumption is not validin heterogeneous federations, we propose interface-compliant querydecompositions and their evaluation over such federations.Given a given SPARQL query 𝑃 and a federation 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) ,query decomposition aims to group a query into subexpressionssuch that they can be answered by the relevant sources in the fed-eration. The first step to achieve this goal is source selection. Thatis, select the relevant sources 𝑟 ( 𝑡𝑝 𝑖 ) for all triple patterns 𝑡𝑝 𝑖 in 𝑃 , with 𝑟 ( 𝑡𝑝 𝑖 ) = { 𝑐 ∈ 𝐶 | J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 ) ≠ ∅} . Because we requireall LDF services to at least evaluate triple patterns of any form( 𝐿 TP ⊆ 𝐿 𝑐 , ∀ 𝑐 ∈ 𝐶 ), in principle, the relevant sources can be se-lected by evaluating each triple pattern at each service. Typically,the capabilities of the services allow more efficient source selection This means that we do not include data dumps, even though they are also consideredLinked Data Fragments in other works [13, 27]. rXiv Preprint, 2021 Heling and Acosta strategy implementations, such as
Ask queries for SPARQL end-points, leveraging the metadata ( u , void:triples , 𝑐𝑛𝑡 ) ∈ 𝑀 ( 𝑡𝑝 𝑖 ) for TPF and brTPF servers, or using pre-computed data catalogues.Once the relevant sources are identified, the query engine de-composes the query into subexpressions to be evaluated at the ser-vices in the federations. If there exists a triple pattern 𝑡𝑝 in a basicgraph pattern (BGP) 𝑃 with no relevant source 𝑟 ( 𝑡𝑝 ) = ∅ , the eval-uation of 𝑃 over the federation is the empty set. In the following,we focus on query decompositions for BGPs where all triple pat-terns have at least one relevant source. For simplicity, we extendnotation and consider a BGP 𝑃 = ( 𝑡𝑝 And . . .
And 𝑡𝑝 𝑛 ) also as aset of 𝑛 triple patterns: 𝑃 = { 𝑡𝑝 , . . . , 𝑡𝑝 𝑛 } . Definition 4.1 (Query Decomposition).
Given a BGP 𝑃 and anLDF service federation 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) , a query decomposition 𝐷 ( 𝑃, 𝐹 ) = { 𝑑 , . . . , 𝑑 𝑚 } is a set of tuples 𝑑 𝑖 = ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) where • 𝑆𝐸 𝑖 is a subexpression of 𝑃 , and • 𝑆 𝑖 ⊆ 𝐶 a non-empty the subset of services over which 𝑆𝐸 𝑖 is evaluated, such that Ð 𝑖 = ,...,𝑛 𝑆𝐸 𝑖 = 𝑃 .Because the query decomposition as such does not consider theinterface language of the services, it is possible that for a validquery decomposition 𝐷 : ∃ 𝑑 𝑖 ∈ 𝐷 : ∃ 𝑐 ∈ 𝑆 𝑖 : 𝑆𝐸 𝑖 ∉ 𝐿 𝑐 . Consider-ing the query from our motivating example, a valid decompositionwould be 𝐷 ( 𝑃, 𝐹 ) = {(( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) , (( 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 })} ,even though 𝑐 is a TPF server that can only evaluate triple pat-terns. Therefore, we introduce an evaluation function 𝜃 for theinterface-compliant evaluation of SPARQL expressions. Definition 4.2 (Interface-compliant Evaluation of an Expression).
Given a BGP 𝑃 and an LDF service 𝑐 ∈ 𝑈 , the interface-compliantevaluation of 𝑃 over 𝑐 is given as follows. 𝜃 𝑐 ( 𝑃 ) : = (cid:26) J 𝑃 K 𝑐 if 𝑃 ∈ 𝐿 𝑐 . (2) J 𝑃 K 𝑐 ⊲⊳ . . . ⊲⊳ J 𝑃 𝑙 K 𝑐 otherwise. (3)For some 𝑃 , . . . , 𝑃 𝑙 with 𝑃 = ( 𝑃 And . . .
And 𝑃 𝑙 ) in Equation(3), such that the following conditions hold: • Equivalence: 𝜃 𝑐 ( 𝑃 ) ≡ J 𝑃 K 𝑒𝑝 ( 𝑐 ) • Compliance: 𝑃 𝑖 ∈ 𝐿 𝑐 , ∀ 𝑃 𝑖 in { 𝑃 . . . 𝑃 𝑙 } The intuition of the interface-compliant evaluation is as follows.If the BGP 𝑃 is in the language of the service, then 𝑃 can be eval-uated directly at the service. Otherwise, the original expression 𝑃 is split into subexpressions, such that each subexpression is in thelanguage of the service. As a result, joining the solutions of evalu-ating the individual subexpressions at the service yields the samesolutions as evaluating 𝑃 over the graph of the service 𝑒𝑝 ( 𝑐 ) . We de-note the number of subexpressions in the interface-compliant eval-uation as | 𝜃 𝑐 ( 𝑃 )| . A compliant evaluation of 𝑆𝐸 = ( 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 ) from our previous example at the DBpedia TPF server 𝑐 would be 𝜃 𝑐 ( 𝑆𝐸 ) = J 𝑡𝑝 K 𝑐 ⊲⊳ J 𝑡𝑝 K 𝑐 ⊲⊳ J 𝑡𝑝 K 𝑐 . With this notion, we definethe interface-compliant evaluation of a query decomposition. Definition 4.3 (Interface-compliant Evaluation of a Query Decom-position).
Given a query decomposition 𝐷 ( 𝑃, 𝐹 ) for the BGP 𝑃 andfederation 𝐹 , the evaluation of 𝑃 following the query decomposi-tion 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) is given as the conjunction ( ⋈ ) of the subexpres-sions 𝑆𝐸 𝑖 evaluated at all ( ∪ ) services in 𝑆 𝑖 : 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) : = ⋈ ( 𝑆𝐸 𝑖 ,𝑆 𝑖 ) ∈ 𝐷 ( 𝑃,𝐹 ) (∪ 𝑐 𝑗 ∈ 𝑆 𝑖 𝜃 𝑐 𝑗 ( 𝑆𝐸 𝑖 )) After defining query decompositions and the evaluation of suchdecompositions that is compliant with the interfaces of the LDFservices in the federations, the problem of finding a suitable querydecomposition for a given query arises. The common goal of querydecomposition approaches is finding a decomposition that yieldscomplete answers according to the assumed semantics, while thecost of executing the decomposition by the query engine is mini-mized [10, 28]. However, these approaches do not explicitly mea-sure the expected answer completeness of query decompositions.For instance, in [28] the answer completeness is encoded implic-itly in the query decomposition cost by considering the number ofnon-selected endpoints: if fewer relevant endpoints are contactedaccording to a decomposition, its cost is higher and vice versa. Ex-tending existing approaches, we propose the concept of query de-composition density as a measure to estimate and compare the ex-pected answer completeness of different decompositions. In con-trast to [28], our density measure not only considers the non-selectedendpoints but also how triple patterns are grouped into subexpres-sions that are evaluated jointly at the services.
Query Decomposition Density.
The query decomposition densityis a proxy for the expected answer completeness of a decomposi-tion. We define density as a relative measure with respect to a de-composition that guarantees answer completeness, i.e., the atomic decomposition. The atomic decomposition evaluates every singletriple pattern in a subexpression at all relevant sources and thus,guarantees answer completeness.
Definition 4.4 (Atomic Decomposition).
Given a federation 𝐹 = ({ 𝑐 , . . . , 𝑐 𝑘 } , 𝑖𝑛𝑡, 𝑒𝑝 ) and a BGP 𝑃 = ( 𝑡𝑝 And . . .
And 𝑡𝑝 𝑛 ) , theatomic decomposition is given as 𝐷 ∗ ( 𝑃, 𝐹 ) = {( 𝑡𝑝 , 𝑟 ( 𝑡𝑝 )) , . . . , ( 𝑡𝑝 𝑛 , 𝑟 ( 𝑡𝑝 𝑛 )} . Lemma 4.5.
The evaluation of 𝑃 following 𝐷 ∗ ( 𝑃, 𝐹 ) yields com-plete answers, that is: 𝜃 𝐷 ∗ ( 𝑃,𝐹 ) ( 𝑃 ) = J 𝑃 K 𝐹 (4) Proof.
We provide a direct proof by assuming the left-handside in Eq. (4). Since we require all services to be able to evalu-ate triple patterns and 𝐷 ∗ is composed of triple patterns only, theevaluation of 𝐷 ∗ ( 𝑃, 𝐹 ) is given by Def. 4.2 and Def. 4.3 as 𝜃 𝐷 ∗ ( 𝑃,𝐹 ) ( 𝑃 ) : = ⋈ ( 𝑆𝐸 𝑖 ,𝑆 𝑖 ) ∈ 𝐷 ∗ ( 𝑃,𝐹 ) (∪ 𝑐 𝑗 ∈ 𝑆 𝑖 J 𝑆𝐸 𝑖 K 𝑐 𝑗 ) (5)with 𝑆𝐸 𝑖 = 𝑡𝑝 𝑖 . By Def. 4.4, 𝑆 𝑖 corresponds to the relevant sourcesof 𝑡𝑝 𝑖 , which is given by 𝑟 ( 𝑡𝑝 𝑖 ) = { 𝑟 𝑖 , . . . , 𝑟 𝑖𝑚 } . Next, we expandthe Eq. (5) with 𝑆 𝑖 in the following way. ( J 𝑡𝑝 K 𝑟 ∪ · · · ∪ J 𝑡𝑝 K 𝑟 𝑙 ) ⊲⊳ . . . ⊲⊳ ( J 𝑡𝑝 𝑛 K 𝑟 𝑛 ∪ · · · ∪ J 𝑡𝑝 𝑛 K 𝑟 𝑛𝑜 ) (6)Next, we show that we can evaluate all triples patterns at allsources (relevant and non-relevant). By definition, we have thatthe evaluation of a triple pattern over a non-relevant source is theempty set: J 𝑡𝑝 𝑖 K 𝑐 = ∅ , ∀ 𝑐 ∉ 𝑟 ( 𝑡𝑝 𝑖 ) . Further, since ( J 𝑡𝑝 𝑖 K 𝑟 𝑖𝑗 ∪ ∅) = J 𝑡𝑝 𝑖 K 𝑟 𝑖𝑗 , we can expand Eq. (6) to ( J 𝑡𝑝 K 𝑐 ∪ · · · ∪ J 𝑡𝑝 K 𝑐 𝑘 ) ⊲⊳ . . . ⊲⊳ ( J 𝑡𝑝 𝑛 K 𝑐 ∪ · · · ∪ J 𝑡𝑝 𝑛 K 𝑐 𝑘 ) (7) Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021
According to Eq. (1) and the fact that triple patterns are in theinterface language of all services, we have J 𝑡𝑝 𝑖 K 𝑐 𝑗 = J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 𝑗 ) and can rewrite Eq. (7) as ( J 𝑡𝑝 K 𝑒𝑝 ( 𝑐 ) ∪ · · ·∪ J 𝑡𝑝 K 𝑒𝑝 ( 𝑐𝑘 ) ) ⊲⊳ . . . ⊲⊳ ( J 𝑡𝑝 𝑛 K 𝑒𝑝 ( 𝑐 ) ∪ · · ·∪ J 𝑡𝑝 𝑛 K 𝑒𝑝 ( 𝑐𝑘 ) ) (8) Because we assume set semantics, the following equality holds ( J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 ) ∪ · · · ∪ J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 𝑘 ) ) = J 𝑡𝑝 𝑖 K Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) (9)With Eq. (9) we can reformulate Eq. (8) as J 𝑡𝑝 K Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) ⊲⊳ . . . ⊲⊳ J 𝑡𝑝 𝑛 K Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) (10)and according to Def. 3.8 and Definition 4 in [23], we have thefollowing equality: J 𝑡𝑝 K Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) ⊲⊳ . . . ⊲⊳ J 𝑡𝑝 𝑛 K Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) = J 𝑡𝑝 K 𝐹 ⊲⊳ . . . ⊲⊳ J 𝑡𝑝 𝑛 K 𝐹 = J 𝑡𝑝 And . . .
And 𝑡𝑝 𝑛 K 𝐹 = J 𝑃 K 𝐹 (cid:3) Another type of structure that preserves completeness are ex-clusive groups [24], which are subexpressions of a query that canonly be answered by a single source. They are defined as follows.
Definition 4.6 (Exclusive Group).
Given a federation 𝐹 = ( 𝐶,𝑖𝑛𝑡, 𝑒𝑝 ) ,a BGP 𝑋 is called an exclusive group, if for all triple patterns 𝑡𝑝 𝑖 ∈ 𝑋 there exists only one relevant source 𝑐 𝑋 ∈ 𝐶 : 𝑋 = { 𝑡𝑝 𝑖 | 𝑟 ( 𝑡𝑝 𝑖 ) = { 𝑐 𝑋 }} We represent query decompositions by decomposition graphs and compute the relative density with respect to the decomposi-tion graph of the atomic decomposition 𝐷 ∗ ( 𝑃, 𝐹 ) as a measure ofcompleteness. More edges in the graph of a given decompositionyield a higher density and, thus, expected answer completeness. Definition 4.7 (Query Decomposition Graph).
Let 𝐷 ( 𝑃, 𝐹 ) be aquery decomposition for the BGP 𝑃 and federation 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) .The decomposition graph 𝐺 𝐷 ( 𝑃,𝐹 ) = ( 𝑉 , 𝐸 ) of 𝐷 ( 𝑃, 𝐹 ) is:The set of vertices 𝑉 = { 𝑡𝑝 𝑖 ∈ 𝑃 } ∪ { 𝑟 ( 𝑡𝑝 𝑖 ) | ∀ 𝑡𝑝 𝑖 ∈ 𝑃 } ,The set of edges 𝐸 ⊆ 𝑉 × 𝑉 are given by the following rules.Rule I Add an edge between a triple pattern 𝑡𝑝 𝑖 ∈ 𝑃 and arelevant source 𝑟 𝑖 𝑗 ∈ 𝑟 ( 𝑡𝑝 𝑖 ), if 𝑡𝑝 𝑖 is part of a subexpres-sion 𝑆𝐸 that is evaluated at 𝑟 𝑖 𝑗 : ∃( 𝑆𝐸, 𝑆 ) ∈ 𝐷 ( 𝑃, 𝐹 ) with 𝑡𝑝 𝑖 ∈ 𝑆𝐸 ∧ 𝑟 𝑖 𝑗 ∈ 𝑆 .Rule II Add an edge between two triple patterns 𝑡𝑝 𝑖 and 𝑡𝑝 𝑗 , ifthey do not co-occur in a subexpression 𝑆𝐸 in 𝐷 : ( 𝑡𝑝 𝑖 , 𝑡𝑝 𝑗 ) ∈ 𝐸 , if š 𝑆𝐸 ∈ 𝐷 ( 𝑃, 𝐹 ) : 𝑡𝑝 𝑖 ∈ 𝑆𝐸 ∧ 𝑡𝑝 𝑗 ∈ 𝑆𝐸 . We prove this equality by contradiction. Consider 𝐺 = Ð 𝑐 ∈ 𝐶 𝑒𝑝 ( 𝑐 ) . Assume thatthere exists a solution mapping 𝜇 s.t. 𝜇 ∈ ( J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 ) ∪ · · · ∪ J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐𝑘 ) ) and 𝜇 ∉ J 𝑡𝑝 𝑖 K 𝐺 . This means that the evaluation of a subexpression over some source,e.g. J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐𝑗 ) , is producing additional answers w.r.t. the evaluation over the unionof all RDF graphs. This could only happen if 𝑒𝑝 ( 𝑐 𝑗 ) * 𝐺 , however, this contradictsthe definition of 𝐺 . Now assume that 𝜇 ∈ J 𝑡𝑝 𝑖 K 𝐺 but 𝜇 ∉ ( J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐 ) ∪ · · · ∪ J 𝑡𝑝 𝑖 K 𝑒𝑝 ( 𝑐𝑘 ) ) . Without loss of generality, assume that 𝜇 was produced from matchingan RDF triple 𝑡 ∈ 𝐺 s.t. 𝑡 ∉ 𝑒𝑝 ( 𝑐 ) for all services 𝑐 ∈ 𝐶 in the federation. This isagain a contradiction with the definition of 𝐺 . Rule III Add an edge between two triple patterns 𝑡𝑝 𝑖 and 𝑡𝑝 𝑗 , ifthey are part of the same exclusive group 𝑋 : ( 𝑡𝑝 𝑖 , 𝑡𝑝 𝑗 ) ∈ 𝐸 , if 𝑡𝑝 𝑖 ∈ 𝑋 ∧ 𝑡𝑝 𝑗 ∈ 𝑋 .Rule IV Add an edge between all triple patterns, if the decompo-sition is composed of just a single subexpression to beevaluated at one source : 𝐷 ( 𝑃, 𝐹 ) = {( 𝑆𝐸, 𝑆 )} ∧ | 𝑆 | = .The rules for adding edges to the graph are designed in such away that the maximum number of edges is present for the decom-position graph of the atomic decomposition 𝐺 𝐷 ∗ ( 𝑃,𝐹 ) = ( 𝑉 ∗ , 𝐸 ∗ ) .This is because each triple pattern is connected to each relevantsource (Rule I) and there is an edge between each pair of triplepatterns (Rule II). If a decomposition contacts fewer sources, thedecomposition graph will have fewer edges according to Rule I.Further, if more triple patterns are grouped together into subex-pressions in a query decomposition, its graph will also have feweredges according to Rule II. The rationale of this rule is that group-ing triple patterns could potentially miss solution mappings thatare only produced by joining data from two different sources. Theremaining rules are introduced to handle the following exceptions.Rule III handles exclusive groups: triple patterns of exclusive groupscan be grouped into a single subexpression without negatively im-pacting the answers completeness. Finally, Rule IV handles the fol-lowing cases: if the decomposition only has a single subexpressionthat is evaluated at a single source, it does not have an impact onthe completeness how these triple patterns are grouped into subex-pressions of the decomposition. In contrast to Rule III, in Rule IVeven though 𝑆𝐸 is just evaluated at a single source, 𝑆𝐸 does notneed to be an exclusive group and can have other relevant sourcesthat are not in 𝑆 . By these rules, we can measure the density of adecomposition graph relative to the maximum number of possibleedges as given by the atomic decomposition graph. Definition 4.8 (Density of a Query Decomposition).
Given a querydecomposition 𝐷 ( 𝑃, 𝐹 ) and the corresponding graph 𝐺 𝐷 ( 𝑃,𝐹 ) = ( 𝑉 , 𝐸 ) , the density 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) is computed as: 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = | 𝐸 || 𝐸 ∗ | ∈ [ , ] . Theorem 4.9.
The evaluation of a query decomposition 𝐷 ( 𝑃, 𝐹 ) over a federation 𝐹 yields complete answers, if 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = : 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = = ⇒ 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) = J 𝑃 K 𝐹 (11) Proof.
We prove the implication in Eq. (11) by contradiction.We assume 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = and 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) ≠ J 𝑃 K 𝐹 . Accord-ing to Def. 4.8, 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = holds only if the decomposi-tion graph of 𝐷 ( 𝑃, 𝐹 ) has the same number of edges as the decom-position graph of the atomic decomposition: | 𝐸 | = | 𝐸 ∗ | . FollowingDef. 4.4 and Def. 4.7, in 𝐸 ∗ there is an edge between each triple pat-tern and its relevant sources (Rule I) and an edge between everypair of triple patterns (Rule II). The maximum number of edges is | 𝐸 ∗ | = Õ 𝑡𝑝 𝑖 ∈ 𝑃 𝑟 ( 𝑡𝑝 𝑖 ) | {z } Rule I + . · | 𝑃 | · (| 𝑃 | − ) | {z } Rule II . Since we prove completeness, we focus on the case when a de-composition yields fewer answers: 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) ⊂ J 𝑃 K 𝐹 . This canoccur in two cases: rXiv Preprint, 2021 Heling and Acosta Case
1: A part of the query is not evaluated at a relevant source.Without loss of generality, consider that a triple pattern 𝑡𝑝 𝑖 ∈ 𝑃 is not evaluated at a relevant source 𝑐 𝑗 and J 𝑡𝑝 𝑖 K 𝑐 𝑗 contributes to the answers of 𝑃 . In this case, the decompo-sition graph 𝐺 𝐷 ( 𝑃,𝐹 ) does not have an edge ( 𝑡𝑝 𝑖 , 𝑐 𝑗 ) ac-cording to Rule I and, therefore, | 𝐸 | < | 𝐸 ∗ | . This contra-dicts the assumption that 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = . Case
2: Triple patterns with several relevant sources are groupedinto subexpressions. Consider the solution mapping 𝜇 ∈ J 𝑃 K 𝐹 , with 𝜇 = { 𝜇 ∪ 𝜇 | 𝜇 ∈ J 𝑡𝑝 K 𝑐 ∧ 𝜇 ∈ J 𝑡𝑝 K 𝑐 , 𝜇 ∼ 𝜇 } , and without loss of generality assume that 𝑟 ( 𝑡𝑝 ) = 𝑟 ( 𝑡𝑝 ) = { 𝑐 , 𝑐 } . Such a solution mapping 𝜇 does not existin 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) in the case that the two triple patterns areevaluated jointly at the source 𝑐 and 𝑐 , that is (( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 , 𝑐 }) ∈ 𝐷 ( 𝑃, 𝐹 ) In this case, the edge ( 𝑡𝑝 , 𝑡𝑝 ) does not exist in 𝐸 accord-ing to Rule II but the edge exists in 𝐸 ∗ because ( 𝑡𝑝 , { 𝑐 , 𝑐 }) , ( 𝑡𝑝 , { 𝑐 , 𝑐 }) ∈ 𝐷 ∗ ( 𝑃, 𝐹 ) Therefore, we have | 𝐸 | < | 𝐸 ∗ | which contradicts the as-sumption that 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = . (cid:3) We can prove that a decomposition density of implies answercompleteness, however, the inverse (i.e., 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = ⇐ = 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) = J 𝑃 K 𝐹 ) cannot be guaranteed. For example, theremight be a triple pattern 𝑡𝑝 with two relevant sources 𝑐 and 𝑐 with just source 𝑐 contributing to the final answers. A decompo-sition 𝐷 ( 𝑃, 𝐹 ) where 𝑡𝑝 is not evaluated at 𝑐 might still yieldcomplete answers but 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) < according to Rule I.Therefore, the decomposition density is a measure for the expected completeness based on the assumptions that answer completenessis negatively affected while i) contacting fewer relevant sources,and ii) grouping triple patterns that can be evaluated at severalsources into subexpressions. Estimating the true completeness moreaccurately would require additional information on the data pro-vided by the LDF services than just the relevant sources. Such ad-ditional information could be used to improve the effectiveness ofour measure, for example by weighting the edges in the decom-position graph according to their importance. However, such anextension is out of the scope of this work. Example 4.10.
Let us consider the BGP 𝑃 = ( 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 ) from the SPARQL query of the motivating examplein Section 2 and the federation 𝐹 𝑒𝑥 = ({ 𝑐 , 𝑐 } , 𝑖𝑛𝑡, 𝑒𝑝 ) . The rele-vant sources are 𝑟 ( 𝑡𝑝 ) = { 𝑐 } , 𝑟 ( 𝑡𝑝 ) = { 𝑐 } , 𝑟 ( 𝑡𝑝 ) = { 𝑐 , 𝑐 } , 𝑟 ( 𝑡𝑝 ) = { 𝑐 } . The atomic query decomposition is 𝐷 ∗ ( 𝑃, 𝐹 ) = {( 𝑡𝑝 , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 , 𝑐 }) , ( 𝑡𝑝 , { 𝑐 })} and the corre-sponding graph is shown in Fig. 2a. In 𝑃 , the triple patterns 𝑡𝑝 and 𝑡𝑝 form an exclusive group, as they are both only answerable byservice 𝑐 . Therefore, we can combine them in a single subexpres-sion without reducing the expected completeness in 𝐷 ( 𝑃, 𝐹 ) = {(( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 , 𝑐 }) , ( 𝑡𝑝 , { 𝑐 })} . The correspond-ing graph shown in Fig. 2b is identical to 𝐺 𝐷 ∗ ( 𝑃,𝐺 ) and thus its ex-pected completeness is 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) = = . Alternatively,we can choose to evaluate 𝑡𝑝 only at service 𝑐 with 𝐷 ( 𝑃, 𝐹 ) = {(( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) (( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 })} (Fig. 2c) or evalu-ate 𝑡𝑝 at service 𝑐 with 𝐷 ( 𝑃, 𝐹 ) = {(( 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 })} (Fig. 2d). Since both corresponding decomposi-tion graphs have fewer edges than the graph of 𝐺 𝐷 ∗ ( 𝑃,𝐹 ) , we expectfewer answers because: 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ∗ ( 𝑃, 𝐹 )) > 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) > 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) . Query Decomposition Cost.
The example illustrates how query de-compositions can have different levels of expected completeness.Ideally, one would always choose the atomic query decompositionto guarantee complete answers. However, there are also costs as-sociated with the evaluation of a decomposition that are inducedby the amount of transferred data for intermediate results duringquery execution as well as the number of services that need tobe contacted. In federations of SPARQL endpoints, both goals areachieved by i) decomposing the query into as few subexpressionsas possible, and ii) reducing the number of endpoints contacted byselecting just those sources which are likely to contribute to thefinal answer of the query. In contrast, when facing heterogeneousfederations of LDF services, the languages of the LDF services needto be considered as well. The reason is that the interface-compliantevaluation might yield additional costs in cases when subexpres-sions cannot be evaluated by a service as a whole. There mightbe several interface-compliant evaluations for an expression be-cause the original expression could be split in different ways intosubexpressions that can be evaluated by the service. We denote aninterface-compliant evaluation of an expression 𝑃 with the mini-mal number of subexpressions as 𝜃 ∗ 𝑐 ( 𝑃 ) , which is the evaluationof 𝑃 that requires separating the expression into the fewest subex-pression to be interface-compliant. Note that | 𝜃 ∗ 𝑐 ( 𝑃 )| = , if 𝑃 ∈ 𝐿 𝑐 .We propose a lower bound for query decomposition cost thatconsiders the number of services contacted and the number ofsubexpressions in an interface-compliant evaluation of the decom-position. In particular, this lower bound combines: (1) The numberof sources | 𝑆 | to be contacted per subexpression. (2) The number ofadditional subexpressions ( | 𝜃 ∗ 𝑐 ( 𝑆𝐸 )| − ) required for an interface–compliant evaluation for each subexpression and all correspondingsources. Definition 4.11 (Cost of a Query Decomposition).
The cost of eval-uating a query decomposition 𝐷 ( 𝑃, 𝐹 ) is given by 𝑐𝑜𝑠𝑡 ( 𝐷 ( 𝑃, 𝐹 )) = Õ ( 𝑆𝐸,𝑆 ) ∈ 𝐷 ( 𝑃,𝐹 ) | 𝑆 |+ Õ ( 𝑆𝐸,𝑆 ) ∈ 𝐷 ( 𝑃,𝐹 )∧∀ 𝑐 ∈ 𝑆 (| 𝜃 ∗ 𝑐 ( 𝑆𝐸 )|− ) . Note that the proposed query decomposition cost provides alower bound for evaluating a decomposition while computing theexact cost requires knowledge about the technical configurationsof the services in the federation. For instance, obtaining solutionsfrom TPF servers might require several requests for paginating theresults, while a single request might suffice on a SPARQL endpoint.
Example 4.12.
Let us consider the decomposition 𝐷 ( 𝑃, 𝐹 ) fromExample 4.10 and the subexpression 𝑆𝐸 = ( 𝑡𝑝 And 𝑡𝑝 And 𝑡𝑝 ) to be evaluated at source 𝑆 = { 𝑐 } . In contrast to its density, thecost of evaluating 𝐷 ( 𝑃, 𝐹 ) depends on the LDF interface 𝑐 imple-ments. If 𝑐 is a SPARQL endpoint, i.e. 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 CoreSparql , 𝑀 Ep ) ,the evaluation J 𝑆𝐸 K 𝑐 is interface-compliant and thus | 𝜃 ∗ 𝑐 ( 𝑆𝐸 )| = . However, if 𝑐 is a TPF server, i.e. 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 Tp , 𝑀 Tpf ) , theinterface-compliant evaluation of 𝑆𝐸 would require evaluating Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021 (a) Graph 𝐺 𝐷 ∗ ( 𝑃,𝐹 ) (b) Graph 𝐺 𝐷 ( 𝑃,𝐹 ) (c) Graph 𝐺 𝐷 ( 𝑃,𝐹 ) (d) Graph 𝐺 𝐷 ( 𝑃,𝐹 ) Figure 2: Query decomposition graphs for decompositions from Example 4.10. The rules for adding edges are indicated ingreen. the triple patterns individually with 𝜃 ∗ 𝑐 ( 𝑆𝐸 ) = J 𝑡𝑝 K 𝑐 ⊲⊳ J 𝑡𝑝 K 𝑐 ⊲⊳ J 𝑡𝑝 K 𝑐 and thus | 𝜃 ∗ 𝑐 ( 𝑆𝐸 )| = . Hence, the evaluation at the TPFserver requires two additional subexpressions to be evaluated. Thismay lead to higher execution costs as there are potentially moreintermediate results to be transferred and the service needs to becontacted at least two additional times.Finally, we can combine both the density and cost of a query de-composition into the query decomposition problem which aims toobtain a query decomposition that maximizes the expected answercompleteness while minimizing the execution cost. Definition 4.13 (Query Decomposition Problem).
Given a BGP 𝑃 and a federation 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) , the query decomposition prob-lem is finding a query decomposition 𝐷 ( 𝑃, 𝐹 ) that minimizes theexecution cost while maximizing its density: 𝐷 ( 𝑃, 𝐹 ) = arg max 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 ( 𝐷 ( 𝑃, 𝐹 )) ∧ arg min 𝑐𝑜𝑠𝑡 ( 𝐷 ( 𝑃, 𝐹 )) Note that this problem is a multi-objective optimization prob-lem, where there might not be a single best solution but rather aset of optimal trade-off solutions, i.e. Pareto-optimal solutions.
Example 4.14.
Consider two alternative example federations whichdiffer in the LDF interfaces of their services 𝑐 and 𝑐 : 𝐹 = ({ 𝑐 , 𝑐 } , 𝑖𝑛𝑡, 𝑒𝑝 ) with 𝑖𝑛𝑡 ( 𝑐 ) = 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 CoreSparql , 𝑀 Ep ) . 𝐹 = ({ 𝑐 , 𝑐 } , 𝑖𝑛𝑡, 𝑒𝑝 ) with 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 Tp , 𝑀 Tpf ) and 𝑖𝑛𝑡 ( 𝑐 ) = ( 𝐿 CoreSparql , 𝑀 Ep ) .The density and cost for the query decompositions are given in thefollowing table, where the best values are indicated in bold. 𝐹 𝐹 𝐷 ∗ 𝐷 𝐷 𝐷 𝐷 ∗ 𝐷 𝐷 𝐷 Í | 𝑆 | Í ( | 𝜃 ∗ 𝑐 ( 𝑆𝑄 ) | − ) 𝑐𝑜𝑠𝑡 𝑐𝑜𝑚𝑝
911 811
911 811
The decomposition cost in the example shows how both thenumber of subexpressions and the number of sources they are eval-uated at ( Í | 𝑆 | ) as well as the capabilities of the interface Í (| 𝜃 ∗ 𝑐 ( 𝑆𝐸 )|− ) have an impact on the overall cost. Further, it shows the trade-off between the two conflicting objectives density and cost. In bothfederations, the decompositions that yield the highest density alsohave the highest cost and vice versa. Approaches to solving thequery decomposition problem need to determine solutions thatyield a suitable (depending on the use case) trade-off between thenumber of answers and execution cost. According to Def. 4.8, twomain factors impact on the density. First, the triple patterns shouldbe evaluated at as many relevant sources as possible (Rule I). Sec-ond, the more fine-grained the subexpressions for triple patternsthat have several relevant sources in common, the higher the den-sity (Rule II). Similarly, the costs of decompositions originate fromtwo main aspects. First, contacting fewer sources with larger subex-pressions will reduce costs and, second, decomposing the queryinto subexpressions that are interface-compliant will reduce thecost. One way of pruning sources without affecting answer com-pleteness is to determine the relevant sources that do not contributeto the final answers of the query [20]. However, this can be verychallenging for queries with triple patterns that contain terms fromcommon ontologies (e.g., RDF/S, OWL), as they can be answeredby many of the sources in the federation. For this purpose, someapproaches rely on pre-computed statistics/catalogues [17, 20, 21]and/or the query capabilities of SPARQL endpoints, such as Ask queries [28]. We propose a query decomposition approach that canbe combined with a heuristic-based source pruning method andcan be applied for any heterogeneous federation.
Query Decomposition Approach.
We propose an approach thatdoes not rely on specific statistics about the members and has twocentral goals: (1) maximize the density by evaluating all triple pat-terns at the relevant sources that contribute to the final answers,and (2) reduce the execution cost by obtaining subexpressions thatleverage the capabilities of the services as much as possible. Fur-thermore, we add an optional source pruning step to further de-crease cost by reducing the number of sources contacted. The de-composer is outlined in Algorithm 1. Its inputs are a BGP 𝑃 and afederation 𝐹 = ( 𝐶, 𝑖𝑛𝑡, 𝑒𝑝 ) . First, the algorithm creates the atomicdecomposition by iterating over each triple patterns 𝑡𝑝 in 𝑃 , de-termines the set of relevant sources as 𝑆 , and adds ( 𝑡𝑝, 𝑆 ) to thedecomposition 𝐷 (Line 2 - Line 5). Next, the relevant sources pertriple pattern can be pruned in Line 6. This pruning step is not rXiv Preprint, 2021 Heling and Acosta (a) Join Ordering and Union Expres-sions (b) Interface-compliant subquery plans (c) Place Physical Operators Figure 3: Query planning steps for the query decomposition 𝐷 ( 𝑃, 𝐹 𝑒𝑥 ) = {(( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 , 𝑐 }) , (( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) .Algorithm 1: Interface-aware Query Decomposer
Input:
BGP P = { 𝑡𝑝 , . . ., 𝑡𝑝 𝑛 } , Federation 𝐹 = ( 𝐶, 𝑖𝑛𝑡,𝑒𝑝 ) 𝐷 = ∅ for 𝑡𝑝 ∈ 𝑃 do 𝑆 = relevantSources ( 𝑡𝑝 ) 𝐷 = 𝐷 ∪ {( 𝑡𝑝, 𝑆 ) } end 𝐷 = pruneSources ( 𝐷 ) do 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 = 𝐹𝑎𝑙𝑠𝑒 for ∀( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) , ( 𝑆𝐸 𝑗 , 𝑆 𝑗 ) ∈ 𝐷 ∧ 𝑆𝐸 𝑖 ≠ 𝑆𝐸 𝑗 do if | 𝑣𝑎𝑟𝑠 ( 𝑆𝐸 𝑖 ) ∩ 𝑣𝑎𝑟𝑠 ( 𝑆𝐸 𝑗 ) | > ∧| 𝑆 𝑖 ∪ 𝑆 𝑗 | = ∧( 𝑆𝐸 𝑖 And 𝑆𝐸 𝑗 ) ∈ 𝐿 𝑐 , ∀ 𝑐 ∈ 𝑆 𝑖 then 𝐷 = 𝐷 \ {( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) , ( 𝑆𝐸 𝑗 , 𝑆 𝑗 ) } 𝐷 = 𝐷 ∪ {(( 𝑆𝐸 𝑖 And 𝑆𝐸 𝑗 ) , 𝑆 𝑖 ) } 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 = 𝑇𝑟𝑢𝑒 break end while 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 return 𝐷 required, however, it allows for reducing the decomposition costby i) reducing the number of sources to be contacted, and ii) al-lowing to group more triple patterns into subexpressions in thefollowing steps. The source pruning approach is interchangeableand we detail our source pruning heuristic in the next paragraph.After pruning the sources, the algorithm tries to merge as manysubexpressions in the decomposition 𝐷 as possible. All possiblecombinations of subexpressions ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) and ( 𝑆𝐸 𝑗 , 𝑆 𝑗 ) are consid-ered and merged if they fulfill the following three conditions:Condition I Both subexpressions have variables in common: | 𝑣𝑎𝑟𝑠 ( 𝑆𝐸 𝑖 ) ∩ 𝑣𝑎𝑟𝑠 ( 𝑆𝐸 𝑗 )| >
0. (Line 10)Condition II Both subexpressions have exactly one source in com-mon: | 𝑆 𝑖 ∪ 𝑆 𝑗 | =
1. (Line 11)Condition III The common source 𝑐 can evaluate the conjunctionof both expressions: ( 𝑆𝐸 𝑖 And 𝑆𝐸 𝑗 ) ∈ 𝐿 𝑐 . (Line 12)If two subexpressions fulfill all conditions, the individual subex-pressions are removed from the decomposition 𝐷 and their con-junction is added to 𝐷 . This process is repeated until no moresubexpressions can be merged ( 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 = 𝐹𝑎𝑙𝑠𝑒 ). A central prop-erty of the query decomposition generated by the algorithm is thefact that the evaluation of all subexpressions is compliant with allcorresponding sources. That is, ∀( 𝑆𝐸, 𝑆 ) ∈ 𝐷 ( 𝑃, 𝐹 ) : ∀ 𝑐 ∈ 𝑆 : 𝜃 𝑐 ( 𝑆𝐸 ) = J 𝑆𝐸 K 𝑐 . As a result, the interface-compliant evaluation(Def. 4.3) of all decomposition generate by Algorithm 1 is given as 𝜃 𝐷 ( 𝑃,𝐹 ) ( 𝑃 ) : = ⋈ ( 𝑆𝐸 𝑖 ,𝑆 𝑖 ) ∈ 𝐷 ( 𝑃,𝐹 ) (∪ 𝑐 𝑗 ∈ 𝑆 𝑖 JJJ 𝑆𝐸 𝑖 KKK 𝑐 𝑗 ) . Note that this property does not require the query planner to findthe subexpression minimizing evaluation 𝜃 ∗ 𝑐 ( 𝑆𝐸 𝑖 ) . Source Pruning Approach.
We propose a heuristic that leveragesthe atomic decomposition graph 𝐺 𝐷 ∗ ( 𝑃,𝐹 ) = ( 𝑉 ∗ , 𝐸 ∗ ) and does notrely on data statistics. Our approach iterates over the source ver-tices 𝑐 𝑖 ∈ 𝑉 ∗ by non-increasing out-degree (i.e. starting with themost popular source). For each triple pattern 𝑡𝑝 𝑗 connected to 𝑐 𝑖 ( ( 𝑐 𝑖 , 𝑡𝑝 𝑗 ) ∈ 𝐸 ∗ ), the edges to all other sources are removed for 𝑡𝑝 𝑗 : 𝐸 ∗ = 𝐸 ∗ \ {( 𝑐 𝑘 , 𝑡𝑝 𝑗 ) ∈ 𝐸 ∗ | ∀ 𝑐 𝑘 ≠ 𝑐 𝑖 } . In addition, the rel-evant sources for triple patterns with the same common subjectare not pruned to maximize completeness. The rationale for this isthe observation that RDF datasets typically follow entity-centricdescriptions, where the URI of an entity appears in the subject oftriples in the authoritative dataset. For example, triples with sub-ject dbr:Berlin are all part of the DBpedia dataset. The main tasks of the query planner are finding an efficient logicalplan and placing physical operators such that the execution timeof the query plan is minimized. For both tasks, common cost-basedquery planners leverage statistics on the data of the members in thefederation. In heterogeneous federations, however, the query plan-ning approaches cannot always rely on the same level of statisticsfrom all sources and need to be adjusted to the statistics availableat the individual sources. For instance, obtaining fine-grained sta-tistics might require access to the entire dataset of a source for effi-cient computation [14] or require the services to be able to executecomplex SPARQL expressions, such as aggregate queries. Further-more, in the case that the interface language of an LDF service doesnot support the evaluation of a subexpression from the decomposi-tion, the planner needs to obtain an efficient subplan for evaluatingthe subexpression over that service. In this section, we first discussthe steps necessary to obtain efficient query plans, and thereafter,we propose a query planner for query decompositions that respectsthe interface restrictions in heterogeneous federations.
Join Ordering with Union Expressions.
The query planner deter-mines a join ordering for the subexpressions in a decompositionthat minimizes the number of intermediate results. The challengelies in estimating the size of intermediate results from subexpres-sions and joins. This is particularly difficult in heterogeneous inter-faces due to two factors. First, the methods to estimate cardinali-ties depend on the interface languages and the metadata supportedby the interfaces. For example, determining the cardinality of asubexpression comprised of two triple patterns could be achieved
Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021 by a
Count query if the interface, e.g. a SPARQL endpoint, sup-ports the evaluation of such expressions. However, this could bean expensive operation on the server and thus time-consuming forthe client. Moreover for other interfaces, such as TPF servers, thiswould not be possible and the cardinality would need to be esti-mated according to the metadata of the triple patterns. If available,statistical data on the data distribution could be used alternativelyto estimate the number of intermediate results [8, 21]. Second, fed-erated plans comprise union operators to combine data from al-ternative relevant sources. In this case, estimating the number ofintermediate results from a union operator that will contribute toa join is more difficult due to the different data distributions ineach source. Therefore, the planner must devise appropriate joinorderings in the presence of unions from different sources. Fig. 3ashows a join ordering with unions for a query decomposition fromthe query and federation of our motivating example: 𝐷 ( 𝑃, 𝐹 𝑒𝑥 ) = {(( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) , ( 𝑡𝑝 , { 𝑐 , 𝑐 }) , (( 𝑡𝑝 And 𝑡𝑝 ) , { 𝑐 }) . Interface-compliant Subexpression Plans.
If a decomposer does notprovide decompositions in which the subexpressions 𝑆𝐸 are interface-compliant, the query planner additionally needs to find subplansthat evaluate the subexpression in an interface-compliant manner.In those cases, the query planner needs to break down 𝑆𝐸 intosubexpressions that minimize the cost of the interface-compliantevaluation 𝜃 ∗ 𝑐 ( 𝑆𝐸 ) . Since the resulting interface-compliant evalua-tion consists of several joins, the query planner also needs to de-termine the join ordering for 𝜃 ∗ 𝑐 ( 𝑆𝐸 ) . For example, if the service isa TPF server, this would require first splitting the subexpressioninto its individual triple patterns and thereafter, finding an appro-priate join ordering. The latter could rely on existing query plan-ning approaches for TPF servers [3, 27]. Fig. 3b shows the interface-compliant evaluation for J ( 𝑡𝑝 And 𝑡𝑝 ) K 𝑐 over the DBpedia TPFserver ( 𝑐 ) for decomposition 𝐷 ( 𝑃, 𝐹 𝑒𝑥 ) . The evaluation is givenby 𝜃 ∗ 𝑐 ( 𝑡𝑝 And 𝑡𝑝 ) = J 𝑡𝑝 K 𝑐 ⊲⊳ J 𝑡𝑝 K 𝑐 and it introduces an addi-tional join operation in the query plan. Placing Physical Operators.
Finally, the query planner selects phys-ical operators to obtain an executable physical query plan. This in-cludes placing access operators that retrieve the solution mappingsfrom the services as well as physical join and union operators toprocess the intermediate results. The access operators transformthe subexpressions into requests that can be processed by the cor-responding LDF services. Ideally, the access operators leverage thequerying capabilities of the interfaces such that the results are ob-tained efficiently. For example, traditional federated query enginesfor SPARQL endpoints require only access operators that adhereto the SPARQL protocol to get solution mappings from the end-points. In heterogeneous federations, however, appropriate accessoperators for each LDF interface in the federation need to be im-plemented and placed accordingly by the planner. Moreover, phys-ical join operators that implement different join strategies, suchas symmetric hash join or bind join, need to be placed effectivelyas they incur different costs. Finally, the planner needs to placethe appropriate physical union operators in the plan that respectsthe semantics of the query language. Fig. 3c shows an example ofa physical plan for decomposition 𝐷 ( 𝑃, 𝐹 𝑒𝑥 ) , where service 𝑐 isa Wikidata SPARQL endpoint and service 𝑐 a the DBpedia TPFserver. Algorithm 2:
Query Planning Algorithm
Input:
Decomposition 𝐷 ( 𝑃, 𝐹 ) = {( 𝑆𝐸 , 𝑆 ) , . . . , ( 𝑆𝐸 𝑛 , 𝑆 𝑛 ) } List 𝐿 for ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) ∈ 𝐷 ( 𝑃, 𝐹 ) do 𝑐𝑎𝑟𝑑 𝑖 = estimateCardinality ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) 𝐿.𝑎𝑝𝑝𝑒𝑛𝑑 (( 𝑆𝐸 𝑖 , 𝑆 𝑖 ,𝑐𝑎𝑟𝑑 𝑖 )) end 𝐿 = sort ( 𝐿,𝑐𝑎𝑟𝑑 𝑖 ) // Sort 𝐿 by non-decreasing 𝑐𝑎𝑟𝑑 𝑖 𝑑 = 𝐿.𝑔𝑒𝑡 ( ) 𝐿.𝑟𝑒𝑚𝑜𝑣𝑒 ( ) 𝑇 = AccessPlan ( 𝑑 ) while | 𝐿 | > do 𝑑 = 𝐿.𝑔𝑒𝑡 ( ) for 𝑖 = 𝑖 < | 𝐿 | ; 𝑖 + + do ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ,𝑐𝑎𝑟𝑑 𝑖 ) = 𝐿.𝑔𝑒𝑡 ( 𝑖 ) if | 𝑣𝑎𝑟𝑠 ( 𝑇 ) ∩ 𝑣𝑎𝑟𝑠 ( 𝑆𝐸 𝑖 ) | > then 𝑑 = ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ,𝑐𝑎𝑟𝑑 𝑖 ) 𝐿.𝑟𝑒𝑚𝑜𝑣𝑒 ( 𝑖 ) break end 𝑇 = AccessPlan ( 𝑑 ) 𝑂 = getPhysicalOperator ( 𝑇 ,𝑇 ) 𝑇 = JoinPlan ( 𝑇 ,𝑇 ,𝑂 ) end return 𝑇 Query Planning Approach.
We now present a heuristic-basedquery planner for heterogeneous federations. In particular, it relieson decomposition obtained by Algorithm 1. First, we present theoverall planning approach and, thereafter, we present details of ourprototypical implementation. The query planner is outlined in Al-gorithm 2. It starts by estimating the cardinality of each subexpres-sion in the decomposition (Line 3) and creates a list 𝐿 in which thesubexpressions are sorted by non-decreasing cardinality (Line 6).The query planner starts building the query plan with the subex-pression 𝑑 with the lowest cardinality and creates the correspond-ing access plan 𝑇 (Line 9) . It iterates over the remaining subex-pressions in 𝐿 and determines the next subexpression to join 𝑇 with. This is either a remaining subexpression with the lowest car-dinality and a common variable (Line 15) or if there is no join re-maining in the BGP, it is the subexpression with the lowest car-dinality (Line 11). Once the subexpression 𝑑 is selected, the ac-cess plan 𝑇 for 𝑑 is created (Line 19) and the appropriate phys-ical join operator 𝑂 is determined (Line 20). Finally, 𝑇 becomesthe JoinPlan of 𝑇 and 𝑇 (Line 21). When 𝐿 is empty, the finalplan 𝑇 is returned (Line 23). After presenting the generic plan-ning approach, we now provide details on the specific steps in ourprototypical implementation. The current implementation focuseson the three well-known LDF interfaces: Triple Pattern Fragments(TPF), Bindings-Restricted Triple Pattern Fragments (brTPF), andSPARQL endpoints. Further, it relies on the properties of decom-positions generated by our interface-aware query decomposer pre-sented in Algorithm 1. That is, each subexpression 𝑆𝐸 𝑖 is interface-compliant for all sources in 𝑆 𝑖 . For each service 𝑐 ∈ 𝑆 𝑖 , estimateCardinality (Line 3) obtains the estimated cardinality 𝑐𝑎𝑟𝑑 𝑐𝑖 for the subexpres-sion 𝑆𝐸 𝑖 at the service 𝑐 in line with the interface language and themetadata of 𝑐 . As evaluating 𝑆𝐸 𝑖 at several sources reflects a unionoperation, it then sums up those individual cardinalities to obtain The access plan for 𝑑 𝑖 = ( 𝑆𝐸 𝑖 , 𝑆 𝑖 ) refers to the union of evaluating subexpression 𝑆𝐸 𝑖 at each source in 𝑆 𝑖 . rXiv Preprint, 2021 Heling and Acosta the total cardinality of 𝑆𝐸 𝑖 at all sources: 𝑐𝑎𝑟𝑑 𝑖 = Í 𝑐 ∈ 𝑆 𝑖 𝑐𝑎𝑟𝑑 𝑐𝑖 . If 𝑆𝐸 𝑖 is a triple pattern and the source is a brTPF or a TPF server,we request the triple pattern and use the void:count in the meta-data as the cardinality estimation. If 𝑆𝐸 𝑖 is a BGP or a triple pat-tern and the source is a SPARQL endpoint, we use a Count queryto estimate the cardinality. Further, we estimate the join cardinal-ity of two subexpressions 𝑆𝐸 𝑖 and 𝑆𝐸 𝑗 as the minimum of theircardinalities. Next, we implement appropriate access operators forall three interfaces. Since all subexpressions are compliant withthe interface, we do not need to first obtain an interface-compliantevaluation in the AccessPlan s. Finally, we determine the physicaljoin operator according to the estimated number of requests to ex-ecute the join. We distinguish between two different common joinstrategies: symmetric hash join and bind join. The reason to usethe number of requests to determine the join strategy is two-fold:i) the number of requests directly have an effect on the executiontime, and ii) fewer requests lead to a reduced load on the servicesin the federation. Thus, we compare the number of requests neces-sary when placing a bind join or a symmetric hash join and choosethe operator that yields fewer requests. The request estimations de-pend on the implementation of the physical join operator, whichwe detail in the following section.
The heterogeneity of LDF interfaces in a federation introduces chal-lenges but also opens opportunities for implementing novel phys-ical operators. Access operators to retrieve answers from LDF ser-vices need to be implemented in efficient ways reducing the loadon the LDF services and the time for obtaining results to improvequery execution time. For instance, TPF servers have a page size configuration that limits the number of answers that are returnedupon a requested triple pattern. Additionally, many public SPARQLendpoints are configured with fair use policies that can lead to zeroor incomplete query results [25]. Consequently, implementationsof access operators for SPARQL endpoints should not overload theSPARQL endpoints and adhere to the usage policies. Yet, physicaljoin operators can be designed to simultaneously handle differentLDF interfaces and follow different join strategies depending onthe capabilities of the underlying services. We call these kinds ofoperators polymorphic and present a novel Polymorphic Bind Jointailored to TPF, brTPF, and SPARQL interfaces.
Polymorphic Bind Join.
The Polymorphic Bind Join (PBJ) imple-ments a Nested Loop Join algorithm that is able to adjust its joinstrategy according to the LDF interface. It simultaneously executesa tuple- and block-based nested loop join according to the sup-ported interface language. Our current implementation supportsthe languages 𝐿 Tp , 𝐿 Tp+Values and 𝐿 CoreSparql . By leveraging thecapabilities of each service, PBJ reduces the number of requestswhen accessing more capable sources using the block-based ap-proach. In particular, PBJ is designed for cases where the inner re-lation is either an access operator or the union of access operators.For each LDF interface 𝑓 , a block size 𝐵 𝑓 is defined. During the exe-cution, the operator keeps a reservoir per service that is filled by tu-ples from the outer relation. When the reservoir reaches the blocksize 𝐵 𝑓 of the corresponding LDF interface, the bindings from the reservoir are requested at the services. For example, when query-ing a TPF server in a nested loop join, each solution mapping of theouter relation is used to instantiate and resolve the triple pattern ofthe inner relation, hence, 𝐵 Tpf =
1. However, as the interface lan-guages of brTPF servers and SPARQL endpoints support SPARQLvalues expressions, the PBJ changes its operation accordingly byrequesting a triple pattern or a subexpression with several bind-ings. The number of bindings that can be sent to a brTPF server 𝐵 brTpf depends on the server configuration [12]. For SPARQL end-points, 𝐵 Ep is not limited, yet too many values may lead to longruntimes at the endpoint and potentially incomplete results. The proposed query planner selects a Symmetric Hash Join (SHJ)or Polymorphic Bind Join (PBJ) in getPhysicalOperator (Line 20)depending on the estimated number of requests. The number of re-quests to execute the SHJ or PBJ depends on the sub-plans 𝑇 and 𝑇 . If 𝑇 is an AccessPlan , the number of requests to obtain thetuples of 𝑇 are determined by its cardinality 𝑐𝑎𝑟𝑑 𝑇 and the inter-faces over which 𝑇 is evaluated. Otherwise, if 𝑇 is a JoinPlan , noadditional requests are necessary to obtain the tuples for 𝑇 . For thefirst case, the requests 𝑅 𝑎𝑐𝑐 ( 𝑇 ) depend on the maximum numberof tuples that can be obtained per requests from the correspondingLDF service, which we denote as 𝑀𝑎𝑥 Ep , 𝑀𝑎𝑥 brTpf , and
𝑀𝑎𝑥
Tpf . 𝑅 𝑎𝑐𝑐 ( 𝑇 ) = Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = Ep (cid:24) 𝑐𝑎𝑟𝑑𝑐𝑇𝑀𝑎𝑥 Ep (cid:25) + Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = brTpf (cid:24) 𝑐𝑎𝑟𝑑𝑐𝑇𝑀𝑎𝑥 brTpf (cid:25) + Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = Tpf (cid:24) 𝑐𝑎𝑟𝑑𝑐𝑇𝑀𝑎𝑥
Tpf (cid:25)
As a result, we can compute the number of request for the SHJ asthe sum of the requests for the two sub-plans: 𝑅 𝑆𝐻 𝐽 ( 𝑇 ,𝑇 ) = 𝑅 𝑎𝑐𝑐 ( 𝑇 ) + 𝑅 𝑎𝑐𝑐 ( 𝑇 ) For the PBJ, we need to determine the number of requests thatneed to be performed in the inner relation 𝑅 𝑏𝑖𝑛𝑑 ( 𝑇 ,𝑇 ) , whichdepends on the cardinality 𝑐𝑎𝑟𝑑 𝑇 of the outer relation 𝑇 and theblock sizes for the services in the inner relation: 𝑅 𝑏𝑖𝑛𝑑 ( 𝑇 ,𝑇 ) = Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = Ep (cid:24) 𝑐𝑎𝑟𝑑𝑇 𝐵 Ep (cid:25) + Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = brTpf (cid:24) 𝑐𝑎𝑟𝑑𝑇 𝐵 brTpf (cid:25) + Í 𝑐 ∈ 𝑆 ∧ 𝑖𝑛𝑡 ( 𝑐 ) = Tpf (cid:24) 𝑐𝑎𝑟𝑑𝑇 𝐵 Tpf (cid:25)
The overall number of requests for the PBJ is 𝑅 𝑃𝐵𝐽 ( 𝑇 ,𝑇 ) = 𝑅 𝑎𝑐𝑐 ( 𝑇 ) + 𝑅 𝑏𝑖𝑛𝑑 ( 𝑇 ,𝑇 ) . We evaluate a prototypical implementation of the interface-compliantquery decomposer, query planner, and polymorphic bind join. Thegoal is to investigate the impact of the components on the per-formance when querying heterogeneous federations of LDF inter-faces.
Datasets and Queries.
We use the well-known FedBench bench-mark [22] which is comprised of 9 datasets and tailored to assessthe performance of federated SPARQL querying strategies. We usea total of 25 queries from Cross Domain (
CD1-7 ), Life Science (
LS1-7 ) and Linked Data (
LD1-11 ) in our evaluation. In our implementation, we set 𝐵 brTpf = and 𝐵 Ep = , to reduce the requestswhile not overloading the endpoint. In our implementation, we set
𝑀𝑎𝑥 brTpf = [12] and 𝑀𝑎𝑥
Tpf = [27], and 𝑀𝑎𝑥 Ep = (most common value reported at https://sparqles.ai.wu.ac.at/). Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021 (a)
Fed-I (b)
Fed-II
Figure 4: Average runtimes [s] (log-scale) and total number of requests (log-scale) for each query and both federations.Table 1: Heterogeneous Federations:
Fed-I , Fed-II . SPARQLendpoints indicated in bold and brTPF servers in italic.
DBpedia NYTimes LinkedMDB Jamendo GeoNames SWDF KEGG Drugbank ChEBI
Fed-I
Sparql brTpf brTpf
Tpf
Sparql
Tpf brTpf
Tpf
Sparql
Fed-II Tpf brTpf brTpf
Sparql
Tpf
Sparql brTpf
Sparql
Tpf
Federations.
We evaluate our approach on two heterogeneousfederations
Fed-I and
Fed-II shown in Table 1 to study the perfor-mance in different scenarios. The central difference between thefederations is that in
Fed-I the three largest datasets are accessiblevia SPARQL endpoints while in
Fed-II they are accessible via TPFservers. The other datasets are accessible via TPF or brTPF servers.
Implementation.
We implemented a prototypical federated queryengine for heterogeneous federations that implements the proposedquery planner, decomposer, source pruning ( PS ), and polymorphicbind join ( PBJ ) operator. Our implementation is based on CROP [ ? ] and implemented in Python 2.7.13. As Baseline , we use executethe query plans from our query planner for the atomic decomposi-tions. The decomposer, source pruning, and PBJ are disabled in the
Baseline . Note that, while Comunica [26] can query heterogeneousinterfaces, its performance is currently not competitive as it doesnot implement query decomposition, source pruning, or polymor-phic join operators. Therefore, we do not consider Comunica in ourevaluation. We use the
Server.js v2.2.3 and original Java brTPFserver implementation [12] to deploy the TPF and brTPF serverswith HDT [ ? ] backends. We used Virtuoso v07.20.3229 with the de-fault virtuoso.ini (cf. supplemental material). All LDF servicesand the client were executed on a single Debian Jessie server (2x16core Intel(R) Xeon(R) E5-2670 2.60GHz CPU; 256GB RAM) to avoidnetwork latency. The timeout was set to 900 seconds. After a warm-up run, the queries were executed five times. The source code, ex-perimental results, and additional material are provided in the sup-plemental material of this submission. https://github.com/LinkedDataFragments/Server.js Table 2: Average total runtime Í 𝑟 , number of requests Í 𝑟𝑒𝑞. , and answers Í 𝑎𝑛𝑠. per run as well as the mean decom-position completeness 𝑐𝑜𝑚𝑝 and decomposition cost 𝑐𝑜𝑠𝑡 . Í 𝑟 Í 𝑟𝑒𝑞. Í 𝑎𝑛𝑠. 𝑐𝑜𝑚𝑝 𝑐𝑜𝑠𝑡 F e d - I Baseline
Decomposer
Decomposer+PS
Decomposer+PS+PBJ F e d - II Baseline
Decomposer
Decomposer+PS
Decomposer+PS+PBJ
Metrics.
We evaluated the performance by the following metrics:(i)
Runtime : Elapsed time spent by the engine evaluating a query. (ii)
Num-ber of Requests : Total number of requests submitted to the LDFservices during the query execution. (iii)
Number of Answers : To-tal number of answers produced. (iv)
Diefficiency : Continuous effi-ciency as the answers are produced over time [5].
We start providing an overview of the performance of the differ-ent components. In Fig. 4a and Fig. 4b the mean runtimes andnumber of requests are shown per query for
Fed-I and
Fed-II . Thevalues are also summarized in Table 2. Considering the impact ofthe individual components, the results show that enabling the de-composer without pruning the sources and no PBJ (
Decomposer ),only provides a slight improvement in the runtime over the
Base-line , even though all queries yield the same number of requestsor less. This is because, without source pruning, only exclusivegroups can be merged by the decomposer. The results when addingthe source pruning approach (
Decomposer+PS ) show that pruningsources considerably reduces both the runtime and the number ofrequests for the majority of queries. The reasons for the improve-ment are two-fold: i) the decomposer can create more and larger rXiv Preprint, 2021 Heling and Acosta subexpressions, and ii) fewer services are contacted during the exe-cution of the query plan. Finally, with the polymorphic join opera-tor (
Decomposer+PS+PBJ ), we observe the lowest overall runtimesand number of requests in both federations. In
Fed-I , executingall queries with
Decomposer+PS+PBJ is more than 34 times fasterthan the
Baseline and 6 times faster in
Fed-II . The results showthat our interface-aware federated query approaches, that adjust tothe specifics of heterogeneous interfaces, can greatly improve theperformance in terms of runtime. Simultaneously, it reduces theload on the servers by requiring fewer requests. The results showthat the interfaces present in the federation (
Fed-I vs.
Fed-II ) sub-stantially impact the querying performance when not consideringthe interfaces’ capabilities (
Baseline ). Yet, our interface-aware so-lution (
Decomposer+PS+PBJ ) enables similar performance resultsregardless of the interfaces.
Query Decomposition.
The results show the effectiveness of theproposed 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 measure as a proxy for completeness and 𝑐𝑜𝑠𝑡 measures as means to assess the expected execution cost of querydecompositions. In Table 2, we can observe that, in both federa-tions, the decomposer without source pruning yields complete an-swers with 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 = .
0, since only exclusive groups are merged(Rule III). The cost can only be slightly reduced (
Fed-I : 𝑐𝑜𝑠𝑡 = . Fed-II : 𝑐𝑜𝑠𝑡 = . ) . However, adding the source pruning ( Decomposer+PS )enables decompositions with about half the cost. Contacting fewerservices reduces the cost but also leads to a reduction in the ex-pected completeness ( 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 = .
77) and to fewer answers ( Í 𝑎𝑛𝑠 )that are being produced. 97% of all answers are still produced whensources are pruned. These results show that the improvementachieved by the decomposer in its ability to leverage the interfaces’capabilities depends on the source pruning.
Polymorphic Bind Join.
The results in both Fig. 4 and Table 2 re-veal that, in the two federations, adding the Polymorphic Bind Join(
Decomposer+PS+PBJ ) reduces the number of requests by morethan 25% and, as a consequence, reduces the overall runtimes. Weinvestigate the diefficiency to better understand the impact of thePBJ. In Fig. 5, we show the diefficiency plots for four examplequeries. The plot for
LS3 in Fig. 5a shows that the performanceof the PBJ is similar to a regular NLJ in case it cannot leverage thecapabilities of the interfaces, e.g.,
LS3 where only TPFs are con-tacted. However, if the capabilities of the services can be leveraged,the PBJ allows for producing the results at a higher rate as shownfor queries
LD3 (Fig. 5b) and
LS6 (Fig. 5c). In the latter, all answersare produced at once. Nonetheless, the semi-blocking nature of thePBJ can also have a detrimental effect on diefficiency and runtimeas observed for query
LS8 in Fig. 5d. Here, the production of the an-swers is delayed because the block size of the inner relation (whichconsumes data from a brTPF server) is not reached until all resultsof the outer relation in the PBJ are produced. Future work couldstudy an adaptive PBJ with variable block sizes, determined accord-ing to the expected number of tuples of the outer relation.Summarizing our experimental evaluation, the results show theeffectiveness of our interface-aware techniques for query decom-position, planning, and physical operators. Furthermore, the re-sults illustrate that our techniques can cope with heterogeneous Cost values are normalized: 𝑐𝑜𝑠𝑡 ( 𝐷 ( 𝑃, 𝐹 ))/ 𝑐𝑜𝑠𝑡 ( 𝐷 ∗ ( 𝑃, 𝐹 )) In Fed-I , the
Baseline does not yield all answers, due to a timeout in
LD7 . federations that are composed of different combinations of inter-faces. Limitations.
The central assumption of our framework is the ac-cess to high-level information about the federation (e.g., interfacesand relevant sources) while fine-grained statistics (e.g., data dis-tributions) are not available. Therefore, the proposed frameworkcomponents are limited to devise approximate solutions to the prob-lems of query decomposition, planning, and execution. While ourexperimental results show substantial improvements in the Fed-Bench benchmark, these improvements might not hold in otherfederations. Yet, our framework is a foundation for federated queryprocessing in heterogeneous federations and the components canbe refined in the case that additional statistics are available. For in-stance, by weighting edges in the decomposition graph accordingto probability of sources contributing to the answers of a query.
Query processing over homogeneous federations of SPARQL end-points has been broadly studied and existing approaches addressdifferent challenges. For instance, [1, 24] leverage requests duringruntime to obtain efficient query plans, while [8, 11, 17, 19, 20]implement cost models that rely on pre-computed statistics, and[4] focuses on runtime adaptivity. Furthermore, approaches thatspecifically study query decomposition have been proposed. Vi-dal et al. [28] propose a formalization of the query decompositionproblem in a way such that it can be mapped to the vertex color-ing problem. Vidal et al. [28] propose the heuristic
Fed-DSATUR tosolve the problem. Similar, Endris et al. [10] formalize the query de-composition problem for federated SPARQL querying and presenta decomposition approach that relies on RDF Molecule Templates,which represent metadata obtained by executing SPARQL queriesover endpoints. Different from our work, these approaches assumeall members in the federation to be SPARQL endpoints, and thus,the proposed solutions rely on their querying capabilities.Additional Linked Data Fragment (LDF) interfaces and corre-sponding SPARQL clients have been proposed. They range fromless expressive interfaces, such as (Bindings-Restricted) Triple Pat-tern Fragments ([12]) [27], to more expressive interfaces such asSaGe [15] or smart-KG [7]. To study the expressiveness of LDF in-terfaces, Hartig et al. [13] propose the Linked Data Fragment Ma-chines as a formal framework that includes client demand, serverdemand, and communication cost when executing queries overthese interfaces. Similar to their work, we also formalize the con-cept of a server language to distinguish the capabilities of differentinterfaces in the federation. Yet, our work goes beyond individualinterfaces and studies the problem of heterogeneous LDF federa-tions.Lastly, a few approaches have addressed the problem of hetero-geneous interfaces. Comunica [26] is a client able to query hetero-geneous LDF federations. But, in contrast to our work, Comunicadoes not support interface-aware query decomposition and han-dles the query execution on a triple pattern level, even if differentinterfaces are present. Moreover, the physical join operators, suchas the nested loop join, do not adapt to the different interfaces. In a
Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments arXiv Preprint, 2021 (a)
Fed-I : LS3 (b)
Fed-I : LD3 (c)
Fed-II : LS6 (d)
Fed-II : LS8
Figure 5: Example diefficiency plots for the approach with the PBJ (green) and without the PBJ (dotted). recent paper, Cheng and Hartig [9] study query plans in heteroge-neous federations. Similar to our work, they conceptualize differ-ent interfaces, and federation members implementing those inter-faces. They focus on a formal language for logical query plans oversuch federations but, in contrast to our work, they do not proposespecific solutions to devise such plans and derive physical plansto be evaluated by an engine. Montoya et al. [16] propose a clientto query replicas of datasets via heterogeneous interfaces (brTPFserver and SPARQL endpoints) to exploit their characteristics. Dif-ferent from our work, they focus on different interfaces for singledatasets but do no investigate federated querying.
We formalize the concept of federations of Linked Data Fragmentservices and present the challenges that querying approaches overheterogeneous federations face. In particular, we present a theo-retical framework and practical solutions for query decomposition,query planning, and physical operators tailored to heterogeneousLDF federations. In our experimental study, we evaluated a pro-totypical implementation of our proposed solutions. The resultsshow a substantial improvement in performance achieved by de-vising interface-aware strategies to exploit the capabilities of TPF,brTPF, and SPARQL endpoints during federated query processing.Future work may focus on extending the proposed framework toother LDF interfaces and studying how state-of-the-art query de-composition, planning, and source pruning approaches from fed-erated SPARQL engines can be applied to heterogeneous federa-tions.
REFERENCES [1] Ibrahim Abdelaziz, Essam Mansour, Mourad Ouzzani, Ashraf Aboulnaga, andPanos Kalnis. 2017. Lusail: A System for Querying Linked Data at Scale.
Proc.VLDB Endow.
11, 4 (2017), 485–498. https://doi.org/10.1145/3186728.3164144[2] Maribel Acosta, Olaf Hartig, and Juan F. Sequeda. 2019. FederatedRDF Query Processing. In
Encyclopedia of Big Data Technologies .https://doi.org/10.1007/978-3-319-63962-8_228-1[3] Maribel Acosta and Maria-Esther Vidal. 2015. Networks of Linked DataEddies: An Adaptive Web Query Processing Engine for RDF Data. In
The Semantic Web - ISWC 2015 - 14th International Semantic Web Confer-ence, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I . 111–127.https://doi.org/10.1007/978-3-319-25007-6_7[4] Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio Castillo, and EdnaRuckhaus. 2011. ANAPSID: An Adaptive Query Processing Engine for SPARQLEndpoints. In
The Semantic Web - ISWC 2011 - 10th International SemanticWeb Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I . 18–34.https://doi.org/10.1007/978-3-642-25073-6_2[5] Maribel Acosta, Maria-Esther Vidal, and York Sure-Vetter. 2017. Dieffi-ciency Metrics: Measuring the Continuous Efficiency of Query Processing Ap-proaches. In
The Semantic Web - ISWC 2017 - 16th International SemanticWeb Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part II . 3–19.https://doi.org/10.1007/978-3-319-68204-4_1 [6] Carlos Buil Aranda, Marcelo Arenas, and Óscar Corcho. 2011. Semanticsand Optimization of the SPARQL 1.1 Federation Extension. In
The SemanicWeb: Research and Applications - 8th Extended Semantic Web Conference, ESWC2011, Heraklion, Crete, Greece, May 29 - June 2, 2011, Proceedings, Part II . 1–15.https://doi.org/10.1007/978-3-642-21064-8_1[7] Amr Azzam, Javier D. Fernández, Maribel Acosta, Martin Beno, and AxelPolleres. 2020. SMART-KG: Hybrid Shipping for SPARQL Querying on the Web.In
WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020 . 984–994.https://doi.org/10.1145/3366423.3380177[8] Angelos Charalambidis, Antonis Troumpoukis, and Stasinos Konstan-topoulos. 2015. SemaGrow: optimizing federated SPARQL queries.In
Proceedings of the 11th International Conference on Semantic Sys-tems, SEMANTICS 2015, Vienna, Austria, September 15-17, 2015 . 121–128.https://doi.org/10.1145/2814864.2814886[9] Sijin Cheng and Olaf Hartig. 2020. FedQPL: A Language for Logical QueryPlans over Heterogeneous Federations of RDF Data Sources (Extended Version).arXiv:2010.01190 [cs.DB][10] Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, and Sören Auer. 2017. MULDER: Querying the Linked Data Webby Bridging RDF Molecule Templates. In
Database and Expert Systems Applica-tions - 28th International Conference, DEXA 2017, Lyon, France, August 28-31, 2017,Proceedings, Part I . 3–18. https://doi.org/10.1007/978-3-319-64468-4_1[11] Olaf Görlitz and Steffen Staab. 2011. SPLENDID: SPARQL Endpoint FederationExploiting VOID Descriptions. In
Proceedings of the Second International Work-shop on Consuming Linked Data (COLD2011), Bonn, Germany, October 23, 2011 .http://ceur-ws.org/Vol-782/GoerlitzAndStaab_COLD2011.pdf[12] Olaf Hartig and Carlos Buil Aranda. 2016. Bindings-Restricted TriplePattern Fragments. In
On the Move to Meaningful Internet Systems: OTM2016 Conferences - Confederated International Conferences: CoopIS, C&TC,and ODBASE 2016, Rhodes, Greece, October 24-28, 2016, Proceedings . 762–779.https://doi.org/10.1007/978-3-319-48472-3_48[13] Olaf Hartig, Ian Letter, and Jorge Pérez. 2017. A Formal Framework for Compar-ing Linked Data Fragments. In
The Semantic Web - ISWC 2017 - 16th InternationalSemantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, PartI . 364–382. https://doi.org/10.1007/978-3-319-68288-4_22[14] Lars Heling and Maribel Acosta. 2020. Estimating Characteristic Sets for RDFDataset Profiles Based on Sampling. In
The Semantic Web - 17th InternationalConference, ESWC 2020, Heraklion, Crete, Greece, May 31-June 4, 2020, Proceedings .157–175. https://doi.org/10.1007/978-3-030-49461-2_10[15] Thomas Minier, Hala Skaf-Molli, and Pascal Molli. 2019. SaGe: Web Pre-emption for Public SPARQL Query Services. In
The World Wide Web Con-ference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019 . 1268–1278.https://doi.org/10.1145/3308558.3313652[16] Gabriela Montoya, Christian Aebeloe, and Katja Hose. 2018. Towards Ef-ficient Query Processing over Heterogeneous RDF Interfaces. In
EmergingTopics in Semantic Technologies - ISWC 2018 Satellite Events [best papersfrom 13 of the workshops co-located with the ISWC 2018 conference] . 39–53.https://doi.org/10.3233/978-1-61499-894-5-39[17] Gabriela Montoya, Hala Skaf-Molli, and Katja Hose. 2017. The Odyssey Ap-proach for Optimizing Federated SPARQL Queries. In
The Semantic Web - ISWC2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25,2017, Proceedings, Part I . 471–489. https://doi.org/10.1007/978-3-319-68288-4_28[18] Jorge Pérez, Marcelo Arenas, and Claudio Gutiérrez. 2009. Semantics andcomplexity of SPARQL.
ACM Trans. Database Syst.
34, 3 (2009), 16:1–16:45.https://doi.org/10.1145/1567274.1567278[19] Bastian Quilitz and Ulf Leser. 2008. Querying Distributed RDF Data Sources withSPARQL. In
The Semantic Web: Research and Applications, 5th European Seman-tic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1-5, 2008,Proceedings . 524–538. https://doi.org/10.1007/978-3-540-68234-9_39[20] Muhammad Saleem and Axel-Cyrille Ngonga Ngomo. 2014. HiBISCuS:Hypergraph-Based Source Selection for SPARQL Endpoint Federation. In
The Semantic Web: Trends and Challenges - 11th International Conference, rXiv Preprint, 2021 Heling and Acosta
ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings . 176–191.https://doi.org/10.1007/978-3-319-07443-6_13[21] Muhammad Saleem, Alexander Potocki, Tommaso Soru, Olaf Hartig, and Axel-Cyrille Ngonga Ngomo. 2018. CostFed: Cost-Based Query Optimization forSPARQL Endpoint Federation. In
Proceedings of the 14th International Confer-ence on Semantic Systems, SEMANTICS 2018, Vienna, Austria, September 10-13,2018 . 163–174. https://doi.org/10.1016/j.procs.2018.09.016[22] Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte,and Thanh Tran. 2011. FedBench: A Benchmark Suite for Federated SemanticData Query Processing. In
The Semantic Web - ISWC 2011 - 10th InternationalSemantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, PartI . 585–600. https://doi.org/10.1007/978-3-642-25073-6_37[23] Michael Schmidt, Michael Meier, and Georg Lausen. 2010. Foundations ofSPARQL query optimization. In
Database Theory - ICDT 2010, 13th Interna-tional Conference, Lausanne, Switzerland, March 23-25, 2010, Proceedings . 4–33.https://doi.org/10.1145/1804669.1804675[24] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and Michael Schmidt.2011. FedX: Optimization Techniques for Federated Query Processing on LinkedData. In
The Semantic Web - ISWC 2011 - 10th International Semantic WebConference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I . 601–616. https://doi.org/10.1007/978-3-642-25073-6_38[25] Arnaud Soulet and Fabian M. Suchanek. 2019. Anytime Large-Scale Analytics ofLinked Open Data. In
The Semantic Web - ISWC 2019 - 18th International SemanticWeb Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part I .576–592. https://doi.org/10.1007/978-3-030-30793-6_33[26] Ruben Taelman, Joachim Van Herwegen, Miel Vander Sande, and Ruben Ver-borgh. 2018. Comunica: A Modular SPARQL Query Engine for the Web.In
The Semantic Web - ISWC 2018 - 17th International Semantic Web Con-ference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part II . 239–255.https://doi.org/10.1007/978-3-030-00668-6_15[27] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Her-wegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, andPieter Colpaert. 2016. Triple Pattern Fragments: A low-cost knowledgegraph interface for the Web.
J. Web Semant.