[PDF] Trav-SHACL: Efficiently Validating Networks of SHACL Constraints

Abstract

Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge graphs, SHACL is rapidly gaining attention in real-world scenarios. SHACL models integrity constraints as a network of shapes, where a shape contains the constraints to be fullfiled by the same entities. The validation of a SHACL shape schema can face the issue of tractability during validation. To facilitate full adoption, efficient computational methods are required. We present Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shape schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shape schema for efficient validation and rewrites target and constraint queries for the fast detection of invalid entities. Trav-SHACL is empirically evaluated on 27 testbeds executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 28.93 compared to the state of the art.

Full PDF

TTrav-SHACL: Eﬃciently ValidatingNetworks of SHACL Constraints

M´onica Figuera , Philipp D. Rohde , and Maria-Esther Vidal

L3S Research Center, Leibniz University of Hannover, Germany University of Bonn, Germany TIB Leibniz Information Centre for Science and Technology,Hannover, Germany [email protected], { philipp.rohde,maria.vidal } @tib.eu Abstract

Knowledge graphs have emerged as expressive data structures forWeb data. Knowledge graph potential and the demand for ecosystems to facili-tate their creation, curation, and understanding, is testiﬁed in diverse domains,e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C rec-ommendation language for integrity constraints over RDF knowledge graphs.Enabling quality assements of knowledge graphs, SHACL is rapidly gainingattention in real-world scenarios. SHACL models integrity constraints as a net-work of shapes, where a shape contains the constraints to be fullﬁled by thesame entities. The validation of a SHACL shape schema can face the issue oftractability during validation. To facilitate full adoption, eﬃcient computationalmethods are required. We present Trav-SHACL, a SHACL engine capable ofplanning the traversal and execution of a shape schema in a way that invalidentities are detected early and needless validations are minimized. Trav-SHACLreorders the shapes in a shape schema for eﬃcient validation and rewrites targetand constraint queries for the fast detection of invalid entities. Trav-SHACL isempirically evaluated on 27 testbeds executed against knowledge graphs of upto 34M triples. Our experimental results suggest that Trav-SHACL exhibitshigh performance gradually and reduces validation time by a factor of up to28.93 compared to the state of the art.

Keywords

SHACL Validation, Quality Assessment, Knowledge Graph Con-straints

Web data keeps enduring an exponential growth rate, and knowledge graphs [12]have emerged as expressive data structures that provide a uniﬁed view of myr-1 a r X i v : . [ c s . D B ] J a n a) SHACL Shape Network (b) Random Traversal(c) Following Links (d) Exploit Knowledge Figure 1:

University System Example. (a) A simple SHACL shape schemafor professors, departments, courses, and universities. (b) A random traversalof the network leading to a high validation time. (c) A traversal strategy fol-lowing links to connected shapes; slight improvement in validation time. (d) Asophisticated traversal strategy that exploits knowledge gained from previousshapes; leading to a reduction in validation time by a factor of 15.96 comparedto a random traversal.iad data sources. Knowledge graphs make possible a holistic description of real-world entities as structured and factual statements. Scientiﬁc and industrialcommunities [3, 18] are considering them as fundamental building blocks in anew knowledge-driven era of science and technology. The adoption of knowledgegraphs in large IT companies [18], industrial data spaces , and domain-speciﬁcapplications like in biomedicine [10] testify not only their potential but also thedemand of ecosystems of tools to facilitate their creation, curation, and under-standing. In this direction, W3C has actively contributed with standards for The International Data Space (IDS) eclaratively representing knowledge graphs and the whole process of creationand curation. Nevertheless, the full adoption of W3C standards demands theWeb community to develop eﬃcient tools to scale up and resist the forecastavalanche of Web data.The Shapes Constraint Language (SHACL) is the W3C recommendationlanguage for the declarative speciﬁcation of data quality assessment over RDFknowledge graphs. SHACL is rapidly gaining attention in real-world scenar-ios, and has been adopted in industrial consortiums (e.g., the InternationalData Space (IDS)) to represent integrity constraints in reference architectures.SHACL models integrity constraints as a network of shapes (a.k.a. shapeschema). A shape speciﬁes constraints against the attributes of a speciﬁc RDFclass, while integrity requirements on properties associating two classes are ex-pressed with links between shapes. Albeit exhibiting clarity and readability,SHACL shape schemas can face tractability issues during validation. The prob-lem is in general intractable [7], and the current algorithms do not scale wellwhen the size of shape schema or knowledge graph grows. Thus, despite the en-couraging evidence about the acceptance of SHACL and other W3C standards,eﬃcient computational tools are required to facilitate full adoption.

Problem Statement and Objectives.

We address the problem of scaling upthe validation of SHACL shape schemas against large knowledge graphs. Al-though there are signiﬁcant contributions (e.g., identiﬁcation of SHACL tractablefragments [7] and algorithms to eﬀectively validate these fragments [9]), knowl-edge graph size imposes strict data management requirements to launch theseSHACL engines into real-world scenarios. We formalize the SHACL validationprocess as an optimization problem. Given a shape schema, an equivalent onethat minimizes the traversal and validation time corresponds to a solution tothe problem.

Our Proposed Solution . We present Trav-SHACL, a SHACL engine capableof planning a shape schema traversal and executing the shapes in a way thatinvalid entities are detected early and needless validations are minimized. Trav-SHACL implements a two-fold approach. First, it resorts to graph-based metrics– describing a shape network’s connectivity – to identify a seed shape and thetraversal validation strategy. Building on related work, Trav-SHACL imple-ments an algorithm that interleaves data collection from the knowledge graphwith constraint validation. These two steps are named inter- and intra-shape validation, respectively. Trav-SHACL also performs query rewriting techniquesthat exploit knowledge about the entities (in)validated so far and early identiﬁesnew invalidated entities on these entities’ neighborhoods. We have empiricallyevaluated Trav-SHACL in 27 testbeds built from knowledge graphs generatedusing the Lehigh University Benchmark (LUBM) [11]. The study comprisesshape schemas and knowledge graphs of various sizes and percentages of invalidentities to ensure the results’ reproducibility. Furthermore, we measure theperformance of Trav-SHACL in terms of execution time and during an elapsedtime period – or dieﬃciency. The results indicate savings by a factor of upto 28.93 with respect to SHACL2SPARQL [9, 8], the state-of-the-art SHACLengine. Moreover, the query rewriting techniques and interleaved execution ofrav-SHACL allow for an eﬀective forecast of invalid entities and produce theﬁrst result ahead of other engines. More importantly, Trav-SHACL exhibitshigh-performance continuous behavior and keeps generating results incremen-tally.

Contributions i)

Trav-SHACL, a SHACL engine that resorts to query rewrit-ing and interleaved execution strategies to early identify invalid entities andavoid unnecessary constraint validations. ii)

A set of testbeds that includevarious parameters that impact the SHACL shape schema validation eﬃciency. iii)

Empirical evaluation of the performance of nine conﬁgurations of Trav-SHACL and two diﬀerent implementations of SHACL2SPARQL. We run 11 var-ious SHACL engines over 164 constraints validated over nine knowledge graphswhose sizes range from 1M to 34M. Results indicate reductions of executiontime by a factor of up to 28.93, while the continuous behavior is considerablyimproved. Trav-SHACL is available as open-source, and it will be publicly pub-lished together with the experimental conﬁguration to ensure reproducibility.The remainder of this paper is organized as follows. Section 2 motivatesour work with an example. Then Section 3 discusses the planning and execu-tion techniques implemented in Trav-SHACL. Section 4 empirically evaluatesthe approach. Afterwards, Section 5 places our work within the related work.Finally, we close the paper in Section 6 with conclusions and an overview offuture work.

Consider a set of SHACL shapes representing constraints over a knowledge graphof a university system, as presented in Fig. 1a. In SHACL terms, a shape is aset of integrity constraints that apply to the same entities, e.g., the shape called

Professor in the example.

Professor s have precisely one name, at least onee-mail address, at least one doctoral degree from a

University , i.e., an instanceof that shape that meets all constraints, and they work for at least one

De-partment . We call constraints that refer to other shapes inter-shape constraints .Analogous we call constraints that do not refer to other shapes intra-shape con-straints . The constraints from the example can be represented with SHACL’s min and max constraints to restrict the minimal and maximal occurrence of aparticular pattern, respectively. The constraint on the names of professors isrepresented as a min and a max constraint, both with the value 1. The exis-tential constraints can be transformed into min constraints with a minimum of1. This example comprises nine min constraints and ﬁve max constraints. Thenine min constraints are distributed as follows, one for the shape

University ,ﬁve for the

Professors , two for

Departments , and one for the shape representing

Courses , respectively.

Universities are restricted by one max constraint,

Pro-fessors by one,

Departments by two, and

Courses by one, respectively. Shapescan have a target deﬁnition, i.e., the shape applies to all entities of a particularclass in the RDF data set that is to be validated. A shape is valid over the dataset if and only if all entities in the data set that satisfy the target deﬁnitionf the shape also fulﬁll all constraints of the shape. The strategy followed totraverse the SHACL shapes impacts on the validation time. To illustrate thisissue, consider the strategy presented in Fig. 1b. In a random traversal order,all the data needs to be loaded. In the presence of inter-shape constraints, aparticular shape, e.g.,

Professor , can only be validated after visiting the shapethat is referred to. Based on the traversal order, this might be a previous shape,the next shape, or even a shape that is scheduled after several others. To re-duce the time necessary to wait until all the information needed is available,the strategy depicted in Fig. 1c follows the links to connected shapes. This as-sessment strategy allows for a minor improvement in validation time since someneeded information might already be available at a later stage. In this example,the validation of the shape

Professor can make use of the validation of the uni-versities and departments. Following the idea of reusing existing knowledge, ifthe algorithm keeps track of the valid entities and the violations, this knowledgecan be used in the next steps to speed up the validation by identifying invalidentities fast. The traversal strategy in Fig. 1d follows this approach by startingto validate shapes that are not dependent on other shapes, but other shapesdepend on, i.e., the universities in the example. Nevertheless, the number ofconstraints to be checked can be dramatically decreased whenever knowledgeconcerning the validated shapes is exploited to invalidate entities in cascade.Hence, the quality assessment time of the set of SHACL shapes can be reducedsigniﬁcantly. As shown in the example, the validation time is sped up by afactor of 15.9 compared to the random traversal.

Table 1:

Summary of Trav-SHACL Notation

Notation Explanation S = (cid:104) S, targ , def (cid:105) Shape schema S is a set of shape names, targ assigns a SPARQL query to a shape s in S , and def maps s to a constraint φ . G = (cid:104) V G , E G (cid:105) RDF graph modeling subject, predicate, and object statements. V G isa set of subjects and objects, and labelled edges in E G represent theRDF triples of G . γ ( def ( s )) SPARQL query representing the evaluation of def ( s ).[[ Q ]] G Set of mappings from variables in the SPARQL query Q to entities in G representing the evaluation of Q over G .Φ S Dependency graph of the shapes in S . σ ( v, s ) Boolean assignment from V G and S . σ ( v, s ) = T represents that anentity v belongs to the target of s , i.e., there is a variable mappingthat includes v in [[ targ ( s )]] G ; F, otherwise.[ φ ] G ,v,σ Boolean function denoting if a constraint φ is satisﬁed by an entity v from G according to an assignment σ .[ S ] G Entities in V G such that exists an assignment σ , and for each v in[ S ] G , there is a shape s in S and [ def ( s )] G ,v,σ is true.Time( S , G ) Cost function representing the time required to evaluate S over G . σ S , G minFix ( v, s ) Boolean assignment from V G and S , s.t., σ S , G minFix ( v, s ) ≡ [ def ( s )] G ,v,σ S , G minFix .Γ S , G Union of the entities from G in the result of the evaluation of theSPARQL queries of targ ( . ) and γ ( def ( . )) in S . raph-metric computationShape Traversal Planning Query RewritingInterleaved Execution Inter-shape planner Intra-shape plannerand executionCatalog of RDF Graph metadata

Data Collection Rule Grounding SaturationSeed Shape Selection Shape Traversal Strategy Shape Ordering GenerationIn- and out-degree distribution Node connectivity Edge Connectivity Pushing FILTERs Partition non-selective queries Query Reordering

Shape schema graph measures Serial Order of Shapes SPARQL queriesDependency Graph

INPUT OUTPUT [[ 𝛄 ’(DEF(s))]] 𝒢 [[TARG’(s)]] 𝒢 𝛄 (DEF(s))TARG(s) 𝛟 𝙎 Endpoints

Figure 2:

The Trav-SHACL Architecture.

Trav-SHACL receives a shapeschema S = (cid:104) S, targ , def (cid:105) and an RDF graph G = (cid:104) V G , E G (cid:105) , and outputs [ S ] G ,the entities in V G that satisfy shapes in S . The inter-shape planner resorts tograph metrics computed on the dependency graph Φ S ; it orders the S shapesin a way that invalid entities are identiﬁed sooner. The intra-shape plannerand execution optimizes the queries targ and γ ( def ( s )) at the time S is tra-versed. So-far (in)validated entities are considered to ﬁlter out entities linkedto these entities; query rewriting decisions (e.g., pushing ﬁlters, partitioningof non-selective queries, and query reordering) are made based on invalid enti-ties’ cardinalities and query selectivity. Rewritten queries are executed againstSPARQL endpoints. Query answers [[ targ ( s )]] G and [[ γ ( def ( s ))]] G , and truthvalue assignments σ S , G minFix , are exchanged during query rewriting and interleavedexecution. They are utilized – in a bottom-up fashion – for constraint rulegrounding and saturation. The intra-shape planner and execution componentruns until a ﬁxed-point in σ S , G minFix is reached.Trav-SHACL is a data quality assessment engine that resorts to constraintsexpressed as a shape schema S = (cid:104) S, targ , def (cid:105) to validate the quality of anRDF graph G = (cid:104) V G , E G (cid:105) . The evaluation of S results into a set of entities in V G (a.k.a. [ S ] G ) that satisfy the constraints in S . Trav-SHACL converts a shapeschema S into an equivalent S (cid:48) = (cid:104) S, targ’ , def (cid:105) whose validation identiﬁesthe same entities but in less time (a.k.a. Time( S , G )). We formally deﬁne thisoptimization problem and present a heuristic-based approach that identiﬁeslow-cost strategies for solving this validation problem. Table 1 summarizes thenotation utilized in this section. The Resource Description Framework (RDF) is the W3C standard for publish-ing and exchanging data over the web. It is commonly utilized to representknowledge graphs as a set of triples that consist of three parts: (i) subject - anentity or resource, (ii) predicate - a relation between subject and object, and (iii) object - an entity or resource, just like a sentence. Subjects and predicatesre always

Universal Resource Identiﬁers (URI). In addition, objects can alsobe represented as a literal instead of an URI to use data formats like strings,integers, or dates. In an RDF graph G = (cid:104) V G , E G (cid:105) , nodes correspond to subjectsand objects, while predicates are the label of directed edges from subjects to ob-jects. The Shapes Constraint Language (SHACL) is the W3C recommendationlanguage for representing integrity constraints over an RDF graph; we followthe abstract syntax and semantics deﬁned by Corman et al. [7, 9]. A shapeschema deﬁned as S = (cid:104) S, targ , def (cid:105) , represents the set S of shape names ,and two functions targ and def that map a shape with a target query andwith a constraint, respectively. A target query states the RDF class – in anRDF graph G – of the entities for which the corresponding shape will be vali-dated. We assume target queries are expressed in SPARQL. The result of theevaluation of a target query Q over G (a.k.a. [[ Q ]] G ) corresponds to a set ofmappings from the variables in Q to the entities in G that satisfy the graphpatterns in Q [20]. In case Q only includes the variable ? x in the select clause,[[ Q ]] G = {{ (? x, v ) } , . . . , (? x, v m ) } , where v , . . . , v m are entities in V G . A depen-dency graph Φ S of a shape schema S is a directed graph where shapes in S are represented as nodes, and an edge ( s i , s j ) indicates that s j appears in theconstraint of s i . An assignment σ is a function that assigns a Boolean value tothe entities v in V G and the shapes in S . The interpretation of a constraint φ ofa shape s , in an entity v in V G according to an assignment σ (a.k.a. [ φ ] G ,v,σ ), isa Boolean function that indicates if v satisﬁes φ given σ ; [ φ ] G ,v,σ is inductivelydeﬁned on the structure of φ , where the base case corresponds to the value of σ ( v, s ). The entities in V G satisfy a shape schema S (a.k.a. [ S ] G ) iﬀ there isan assignment σ and a shape s in S , and [ def ( s )] G ,v,σ is true. In general theproblem determines if the entities of an RDF graph G satisfy a shape schema S in a given assignment σ in NP-complete [7]. Nevertheless, Corman et al.[9] have identiﬁed three fragments of SHACL that are tractable; L non-rec onlyenables non-recursive shapes, L s does not allow negations through recursiveshapes, and L + ∨ does not allow negations but disjunction. More importantly,Corman et al. [9] propose a computational method that performs the inferenceprocess required to construct the set of entities in an RDF graph G that sat-isﬁes a shape schema S . This computational method is ground on the resultsof deductive databases [5] to compute the minimal model of the constraints ofthe shapes in S for the entities in G that correspond to the instantiations of thetarget queries of these shapes. This minimal model is deﬁned in terms of theﬁxed-point assignment σ S , G minFix , that assigns the same truth value to an entity v in a shape s than the value of satisfaction of v in the constraint of s accordingto σ S , G minFix , i.e., σ S , G minFix ( v, s ) ≡ [ def ( s )] G ,v,σ S , G minFix . The minimal model for thefragments L non-rec , L s , and L + ∨ , σ S , G minFix can be computed in polynomial timein the size of the result of all the queries mapped by targ ( . ) and that deﬁnedthe constraints assigned by def ( . ); the set with the union of all these entities isnamed Γ S , G . We propose query optimization techniques that exploit knowledgeabout the invalid entities identiﬁed during the execution of the shapes evaluatedso far, as well as the semantics encoded in the RDF graph to rewrite the queriesn targ ( . ) and def ( . ). Thus, the set of the entities retrieved from the RDFgraph that will invalidate the shape schema is minimized. Problem Statement:

Given an RDF graph G = (cid:104) V G , E G (cid:105) and a shape schema S = (cid:104) S, targ , def (cid:105) , the problem of SHACL executing over G is to ﬁnd a shapeschema S (cid:48) = (cid:104) S, targ’ , def (cid:105) that meets the following conditions: • S and S (cid:48) are equivalent when evaluated over G . The set of entities in G that validate S and S (cid:48) is the same, i.e., [ S ] G =[ S (cid:48) ] G . • The time required to evaluate S (cid:48) is minimal, i.e., if Z G , S is the set of allthe shape schema equivalent to S , then S (cid:48) is the schema in Z G , S thatminimizes the evaluation time. S (cid:48) = arg min S (cid:48) ∈Z G , S Time( S (cid:48) , G ) (1) Solution.

Trav-SHACL implements a heuristic-based approach. It relies on theassumption that the minimal retrieval of the entities required to validate a shape s leads to collecting only the entities needed to assess the shapes mentionedin the constraint of s , i.e., def ( s ). Trav-SHACL follows a two-fold strategythat is guided by heuristics to perform inter- and intra-shape optimizations.The shapes of S are reordered; the inter-shape optimizations aim at deciding atraversal strategy where the shapes that invalidate the highest number of entitiesare validated ﬁrst. Then, S is executed following the selected order, and intra-shape optimizations are performed on the ﬂy. According to the number of invalidentities identiﬁed during the execution of the previous executed shapes, Trav-SHACL rewrites the target and constraint queries, i.e., targ ( s ) and γ ( def ( s )),to ﬁlter out entities linked to the entities invalidated so far. As soon as a queryanswer is retrieved, the collected entities are used to ground the rules thatrepresent def ( s ). Grounded rules ﬁre a bottom-up evaluation process, named saturation , to generate new truth values of entities in σ S , G minFix . Saturation isinterleaved with data collection and grounding; it ﬁnalizes when a ﬁxed-pointon σ S , G minFix is reached. Shape reordering together with the execution of therewritten queries conduce to grounding a small number of constraint rules. Moreimportantly, the interleaved evaluation of the saturation process enables theinference of invalid entities (i.e., false assignments in σ S , G minFix ) as soon as theseentities are collected. As a result, the execution time of the new shape schema S (cid:48) , Time( S (cid:48) , G ), is minimized, and [ S ] G is created incrementally. Fig. 2 depicts the Trav-SHACL architecture. Given a shape schema S = (cid:104) S, targ , def (cid:105) and an RDF graph G = (cid:104) V G , E G (cid:105) , Trav-SHACL outputs theset of entities in V G that satisfy S . Trav-SHACL follows a two-fold approachigure 3: Running example.

Numbers retrieved for the shape schema pre-sented in Fig. 1a, following Trav-SHACL’s traversal order which allows forknowledge exploitation, as depicted in Fig. 1d. Assume SHACL2SPARQL fol-lows the same traversal order. Both engines (in)validate 10,542 entities in total.Nevertheless, Trav-SHACL computes a factor of 22.25 less grounded rules inmemory by making use of its

Inter-shape planner and

Intra-shape planner andexecution components .composed of two main components: i) Inter-shape planner and ii)

Intra-shapeplanner and execution . In a ﬁrst stage, measures – computed from the depen-dency graph Φ S – together with statistics about G , feed the heuristic-basedapproach implemented by the Inter-shape planner . In the second stage, the

Intra-shape planner and execution component rewrites and executes SPARQLqueries; it also infers from the collected answers, the entities that validate S .Both components are detailed next. Inter-shape planner . Trav-SHACL exploits graph-based measures (i.e., in-and out-degree distributions) to determine the connectivity of a shape in the de-pendency graph Φ S of S . As a result, Trav-SHACL decides the traversal’s seedshape and the best search strategy, e.g., depth-ﬁrst search (DFS) or breadth-ﬁrstsearch (BFS). The natural intuition that shapes s with high in-degree valuesshould be evaluated ﬁrst, enables to considerably reduce the number of retrievedentities during the evaluation of the neighbors of s , whenever a large number ofentities invalidate s . The seed shape selection is guided by heuristics. First, allshapes s of S that have an empty targ ( s ), i.e., no class in the RDF graph isassigned to the shape, are discarded from the possible seed shapes. Followingthe above mentioned intuition, Trav-SHACL selects – from this list of possibleshapes – the one with the highest in-degree as the seed shape of the traver-sal. Assuming that there are still at least two shapes that qualify for the seedshape, the one with the most constraints is chosen based on the intuition thatmore constraints to be met by a single entity increase the chances of a higher All min constraints are evaluated in one query, whereas only one max constraint per queryis allowed, as formalized by Corman et al. [9]. umber of invalidated entities. In the example in Fig. 1, all shapes have a tar-get deﬁnition, so no shape is discarded. The

University shape has the highestin-degree, i.e., the reference from

Professor and

Department to it. Therefore,the other shapes are omitted. Since only one possible seed shape is left, theheuristic considering the number of constraints is not taken into account in thisexample. Given a seed shape s , the dependency graph Φ S , and the traversalstrategy (e.g., DFS or BFS), Trav-SHACL’s shape ordering generation startsapplying the traversal strategy on Φ S at s ignoring edge directions in order tocreate an enumeration of the shapes in S . This enumeration is a preordering ofthe vertices in Φ S . The approach keeps track of all the nodes in Φ S that arenot yet visited and the ones that are visited already. In the case of recursions,when reaching an already visited node n , the search continues with the ﬁrstunvisited neighbor of n in Φ S . If there is no such neighbor, the ﬁrst node inthe list of not yet visited nodes is used for continuation. Since in this approachshapes are not visited more than once, the complexity is equal to the worst-casecomplexity of depth-ﬁrst search for explicit graphs traversed without repetition,i.e., O ( | V | + | E | ) where V are the vertices in Φ S and E the edges in Φ S . Trav-SHACL starts to explore the dependency graph Φ S from University . The shapeis connected to

Professor and

Department . Due to the internal representation ofthe graph, the second node visited is the

Department node. From

Department there is an edge to the unvisited node

Professor , hence, the third shape to beevaluated are the professors. In the neighborhood of

Professor there is only oneunvisited node left, i.e.,

Course . Once

Course is scheduled to be evaluated atfourth position, all nodes in the dependency graph Φ S have been visited andthe enumeration of all nodes n in Φ S is complete. The ﬁnal traversal order is University , Department , Professor , Course and, therefore, the one that exploitsknowledge in Fig. 1d.

Intra-shape planner and execution.

Once a traversal is decided, Trav-SHACL starts the shape schema’s execution and performs query rewriting and interleaved executions while the shape schema is traversed. The query rewrit-ing component performs intra-shape optimizations to increase the selectivity ofboth target and constraint queries. During

Pushing FILTERs , the list of enti-ties (in)validated so far by the neighbor shapes are used as ﬁlters in the queries targ ( s ) and γ ( def ( s )). Trav-SHACL pushes ﬁlters down by making use of theSPARQL VALUES and FILTER NOT IN clauses. Thus, for every shape s in S ,and given a list of entities validated per neighboring shape in the dependencygraph Φ S , Trav-SHACL prioritizes the smallest list – valid or invalid entities –to be included in a query ﬁlter. The Partition of non-selective queries is appliedwhenever the cardinality of a query overpasses the SPARQL endpoint limit ofmaximal answers. Query oﬀsets are evaluated to continuously retrieve slices ofthe answers; the SPARQL modiﬁers LIMIT and OFFSET are deﬁned accord-ing to thresholds speciﬁed in the conﬁguration of Trav-SHACL. Furthermore,if the list to be included as a query ﬁlter is very large, and the rewritten queryexceeds the maximum number of characters allowed by a SPARQL endpoint,the query is rewritten into several queries, and the union of the answers of thesequeries corresponds to the answer of the original one. The maximum number ofewritten queries is given by a threshold set up according to the conﬁgurationof Trav-SHACL. If the number of generated queries is longer than the thresh-old, the query rewriting is not applied to avoid overhead, and only one queryis generated. Lastly, the

Query reordering aims to execute the most selectivequeries to increase the validation process’ continuous behavior.Concurrently, the

Interleaved execution component interleaves the veriﬁca-tion of the constraints with the execution of the queries. Thus, entities canbe validated as soon as they are retrieved, allowing Trav-SHACL to produceresults incrementally. First, each entity v in [[ targ ( s )]] G is used to ground allthe constraints of def ( s ). Trav-SHACL follows the approach proposed by [7,9], and represents the constraints as a theory T of safe stratiﬁed rules. Theserules are of the form l ∧ . . . l n = ⇒ s ( v ), where each l i corresponds to s i ( v i )or ¬ s i ( v i ), s i is a shape in S , and v i belongs to V G . T is built in the way thatevery model of T in the entities of V G corresponds to the entities in [ S ] G , i.e.,the entities that satisfy S . During saturation , grounded rules are utilized toinfer which entities validate T . Rules representing the constraints in def ( s )are validated, and once an entity v in [[ targ ( s )]] G invalidates a constraint in γ ( def ( s )), Trav-SHACL skips the evaluation of the remaining rules where s ( v )appears in the body. Trav-SHACL also adds a false value to s ( v ) in σ S , G minFix . Thecomponent Intra-shape planner and execution keeps the execution of the

QueryRewriting and

Interleaved execution components until a ﬁxed-point in σ S , G minFix .Fig. 1d and its corresponding running example in Fig. 3 illustrate the resultof executing the Inter-shape planner and

Intra-shape planner and execution .Further, Fig. 3 reports on the results of the state-of-the-art SHACL2SPARQLempowered with the traversal identiﬁed by Trav-SHACL. Trav-SHACL not onlytraverses the shapes in the selected order, but also pushes into the SPARQLqueries the corresponding ﬁlters to avoid interleaving entities linked to alreadyinvalidated ones. This is depicted in Fig. 3, where using the knowledge of theeight valid universities allows to rewrite targ (Professor) to ﬁlter out 1,260 in-valid entities from grounding. This inter- and intra-shape strategy collects 9,282entities that result in grounding 16,604 rules versus 369,472 rules grounded bySHACL2SPARQL. Consequently, Trav-SHACL execution time is 525ms, withsavings of a factor of 22.25 in the number of grounded rules with respect toSHACL2SPARQL. Going back to the motivating example, in the shape schema S (cid:48) and S (cid:48)(cid:48) in Fig. 1b and Fig. 1c only inter-shape traversal decisions have beentaken into account. The number of retrieved entities of S (i.e., Γ S , G ) is 10,542and remains the same in Γ S (cid:48) , G and Γ S (cid:48)(cid:48) , G . Nevertheless, because of the orderingin which the shapes are evaluated, entities are invalidated as soon as they areretrieved, and the constraints of S (cid:48) and S (cid:48)(cid:48) are grounded and validated in lessnumber of entities. Thus, Γ S (cid:48) , G and Γ S (cid:48)(cid:48) , G ground all the constraints in theretrieved entities and end up validating 369,472 and 337,430 rules, respectively.As a result, diﬀerences in Time( S (cid:48) , G ) and Time( S (cid:48)(cid:48) , G ) are observed.able 2: Data Statistics.

Shape Schema 1: | S | = 3 , C=16 Small Knowledge Graphs (SKGs) Medium Knowledge Graphs (MKGs) Large Knowledge Graphs (LKGs)

D1 D2 D3 D4 D5 D6 D7 D8 D9 inter inter inter intra intra intra %inv

MKGs Invalid Entities %inv

LKGs Invalid Entities %invD1 D2 D3 D4 D5 D6 D7 D8 D9

Shape Schema 2: | S | = 7 , C=36 Small Knowledge Graphs (SKGs) Medium Knowledge Graphs (MKGs) Large Knowledge Graphs (LKGs)

D10 D11 D12 D13 D14 D15 D16 D17 D18 inter inter inter intra intra intra %inv

MKGs Invalid Entities %inv

LKGs Invalid Entities %invD10 D11 D12 D13 D14 D15 D16 D17 D18

Shape Schema 3: | S | = 14 , C=112 Small Knowledge Graphs (SKGs) Medium Knowledge Graphs (MKGs) Large Knowledge Graphs (LKGs)

D19 D20 D21 D22 D23 D24 D25 D26 D27 inter inter inter intra intra intra %inv

MKGs Invalid Entities %inv

LKGs Invalid Entities %invD19 D20 D21 D22 D23 D24 D25 D26 D27

We empirically study the behavior of Trav-SHACL; it is compared to the state-of-the-art SHACL shapes validator SHACL2SPARQL [8, 9]. We aim to answerthe following research questions:

RQ1)

What is the eﬀect of validating theshapes following diﬀerent traversal strategies?

RQ2)

Using a shape network,can the knowledge gained from previously validated shapes be exploited to im-prove the performance?

RQ3)

What is the impact of the size of the datasources, i.e., do the approaches scale up?

RQ4)

What is the impact of thetopology of the constraint network and the selectivity of the shapes? Sourcecode and experiment scripts are available at GitHub . In the following, theexperimental conﬁguration is described. Finally, the results are analyzed. Data Sets and Shapes.

To the best of our knowledge, there are no bench-marks to evaluate the performance of SHACL validators. Therefore, we buildour test beds based on the accepted and commonly used Lehigh UniversityBenchmark (LUBM) [11]. The LUBM Data Generator is used to create dataof diﬀerent sizes; the small, medium, and large knowledge graphs in Table 2.Based on the classes and properties available in the data, we create one shape https://github.com/SDM-TIB/Trav-SHACL LUBM Data Generator available at http://swat.cse.lehigh.edu/projects/lubm/ er class (shape schema 3 in Table 2). From this full shape schema we also eval-uate subsets, referred to as shape schema 1 and shape schema 2. The originallygenerated data is modiﬁed in such a way that for each of the shape schemas eachknowledge graph size has three diﬀerent percentages of invalid entities. For theease of the discussion of the results, we call each combination of shape schema,knowledge graph size, and percentage of invalid entities a data set . This leadsto a total of 27 data sets to be evaluated. For more detailed information aboutthe sizes of the shape schemas and other statistics on the data sets we refer thereader to Table 2.

SHACL Engines.

The baseline of our comparison is the state-of-the-artSHACL shape validator SHACL2SPARQL [8, 9]. Trav-SHACL is implementedin

Python 3.6 and SHACL2SPARQL in

Java 8 . Due to the performance diﬀer-ences of the two programming languages, we add a Python implementation ofthe SHACL2SPARQL approach to our study, named SHACL2SPARQL-py. Wecompare nine diﬀerent conﬁgurations of Trav-SHACL. As depicted in Table 3,they diﬀer in the heuristics used for the seed shape selection and the traversalstrategy used to generate the order in which the shapes of the shape schemaare evaluated. In total, we compare eleven diﬀerent traversal and validationstrategies. To ensure determinism during the experimental study, the enginesadd an ORDER BY clause to all queries. This is necessary for the case that anengine does not receive all results of an unselective query due to reaching themaximal answer limit of the SPARQL endpoint.Table 3:

Trav-SHACL Conﬁgurations.

The traversal strategy is used toﬁnd the order in which the shapes are validated. The seed shape is selectedfrom the shapes with the highest in- or outdegree. In the case that two or moreshapes belong to this group, the shape with the most or fewest constraints ischosen. Trav-SHACL 9 works completely random.Seed Shape SelectionName TraversalStrategy Connectivity ConstraintsTrav-SHACL 1 BFS high indegree manyTrav-SHACL 2 BFS high indegree fewTrav-SHACL 3 BFS high outdegree manyTrav-SHACL 4 BFS high outdegree fewTrav-SHACL 5 DFS high indegree manyTrav-SHACL 6 DFS high indegree fewTrav-SHACL 7 DFS high outdegree manyTrav-SHACL 8 DFS high outdegree fewTrav-SHACL 9 random

Metrics.

We report the following metrics: a) Average Validation Time : Av-erage time elapsed between starting the validation of a data set and the engineﬁnishing the validation. It corresponds to absolute wall-clock system time inseconds. b) Standard Deviation : The standard deviation of the validation time. c) dief @ t : A measurement for the continuous eﬃciency of an engine in the ﬁrstigure 4: Overview of Result Plots.

On the left, validation time in secondsfor each data set. Comparison of continuous behavior of best approach withthe baseline. In the middle, answer traces showing the incremental generationof validation results. On the right, dieﬃciency at time t . A higher value meansa steadier answer production. t time units of the validation [1]. The dieﬃciency is computed as the area underthe curve (AUC) of the answer distribution function. Hence, approaches thatproduce answers faster in a certain time period show higher dieﬃciency values. Experimental setup.

Each of the 297 experiments is run ﬁve times. Anexperiment is the validation of a data set D i , i ∈ [1 ,

27] with a particular engine.All caches are ﬂushed between two consecutive experiments. All components ofthe experiment are run in dedicated Docker containers to ensure reproducibility.We use

Virtuoso 7.20.3229 as SPARQL endpoint for querying the data sets. Allcontainers are run at the same server. Hence, network cost can be neglected.The experiments are executed on an Ubuntu 18.04.4 LTS 64 bit machine withan Intel ® Xeon ® E5-1630v4 CPU (four physical cores, eight threads), and 64GiB DDR4 RAM. Virtuoso endpoints are conﬁgured to use up to 32 GiB whilethe containers for the SHACL validators are limited to 24 GiB.

Fig. 4 gives an overview of the plots presented for the result analysis. First, wereport on the average validation time including a visualization of the standarddeviation. Each of these plots shows the validation time of all eleven engines,validating three data sets that only diﬀer in the percentage of invalid entitiesas depicted in Table 2, i.e., the data sets are of the same size. During theexperiments, we collect the timestamp for each validation decision made. Fromthose timestamps, we generate the answer trace. The traces show the continuousgeneration of answers for the engines. For the sake of readability, we only includethe overall best conﬁguration of Trav-SHACL, namely Trav-SHACL 5. Finally,we present the dieﬃciency at time t ( dief @ t ). All measurements presented inthe radar plot are ’higher the better’. On top, the inverse time for the ﬁrstproduced answer is noted. Hence, a higher values means the ﬁrst answer wasproduced faster. To the left, the inverse validation time is reported. Again, ahigher value implies earlier termination of the task. In the left bottom corner, a) Shape Schema 1 SKGs (b) Shape Schema 1 MKGs (c) Shape Schema 1 LKGs(d) Shape Schema 2 SKGs (e) Shape Schema 2 MKGs (f) Shape Schema 2 LKGs(g) Shape Schema 3 SKGs (h) Shape Schema 3 MKGs (i) Shape Schema 3 LKGs Figure 5:

Validation Time of the Experiments in Seconds.

The validationtime increases with the size of the data and shape schema. Apart from case (d),Trav-SHACL outperforms SHACL2SPARQL by a factor of up to 28.93.Comp refers to the number of validations performed, i.e., the sum of all validatedand invalidated entities. Next, the throughput T indicates the speed with whichthe answers are produced. A higher value implies shorter intervals between thereport of two validation results. Finally, the dieﬃciency at time t , if engine e has a higher value for dief @ t than e , e exhibits a better performance until t in terms of dieﬃciency. Validation Time.

The validation time of the eleven engines for each of the27 data sets is depicted in Fig. 5. Fig. 5a reports on the validation time forthe small knowledge graphs against shape schema 1. As can be seen, thePython version of SHACL2SPARQL performs worst. Trav-SHACL outperformsSHACL2SPARQL-py by a factor of 31.58 and SHACL2SPARQL by a factor of14.47. There is no considerable diﬀerence in the conﬁgurations of Trav-SHACL.The time needed to validate shape schema 1 against the medium-sized knowledgegraphs is presented in Fig. 5b. We observe a very similar behavior comparedto the smaller knowledge graphs. For the large knowledge graphs, shown in a) Shape Schema 1 SKGs (b) Shape Schema 1 MKGs (c) Shape Schema 1 LKGs(d) Shape Schema 2 SKGs (e) Shape Schema 2 MKGs (f) Shape Schema 2 LKGs(g) Shape Schema 3 SKGs (h) Shape Schema 3 MKGs (i) Shape Schema 3 LKGs

Figure 6:

Continuous Behavior for the Experiments.

In all cases, Trav-SHACL produces the ﬁrst result ahead of the other engines. In most cases,Trav-SHACL ﬁnishes the validation the fastest. With one exception in (d),Trav-SHACL outperforms the other engines in terms of continuously deliveringresults. The diﬀerence increases with larger knowledge graphs. (

T F F F ) − -inverse time for the ﬁrst answer produced, ( ET ) − - inverse validation time(a.k.a. execution time), Comp - sum of validated and invalidated entities, T -throughput, and dief @ t - continuous eﬃciency at time t Fig. 5c, we start to see a small diﬀerence in the times reported for the conﬁg-urations of Trav-SHACL. Also the factors by which we outperform the otherapproaches change. Trav-SHACL outperforms SHACL2SPARQL by a factorof 28.93 and SHACL2SPARQL-py by 40.06. The validation of shape schema2 against small knowledge graphs shows a diﬀerent behavior as can be seen inFig. 5d. This is because of the small amount of invalid entities in the data sets(see Table 2). This leads to the fact that SHACL2SPARQL outperforms Trav-HACL by a factor of 1.18 because Trav-SHACL cannot take advantage by itscapability of discovering invalid entities fast. For the medium-sized knowledgegraphs, Trav-SHACL is able to exploit its capabilities and perform similarly.Trav-SHACL 5 is the best approach in that case. However, it outperformsSHACL2SPARQL only by a factor of 1.04. As in the previous shape schema,the diﬀerence in validation time becomes more clear in the presence of largeknowledge graphs. Trav-SHACL outperforms SHACL2SPARQL by a factor ofup to 2.33. The high variance in validation for Trav-SHACL in Fig. 5f is causedby one single outlier in the ﬁve runs for each of the conﬁgurations. It is alwaysthe same query for the big shape

UndergraduateStudent that takes about twiceas long as in the other runs. We observe this behavior at the SPARQL end-point even though all caches are ﬂushed between experiments. Moving to thevalidation of shape schema 3, Trav-SHACL outperforms the other engines. Inthe case of small knowledge graphs, Trav-SHACL is up to 1.57 times faster thanSHACL2SPARQL. When validating the shape schema against a medium-sizedknowledge graph, we perform up to 2.20 times faster. Considering large knowl-edge graphs, Trav-SHACL outperforms SHACL2SPARQL by a factor of up to2.46. To sum up, Trav-SHACL is outperforming SHACL2SPARQL. It can beobserved that the factor by which we perform better is increasing with largerknowledge graphs. However, the size of the knowledge graph is not the onlyimpacting factor. The properties of the shape network, e.g., the cardinality ofthe shapes and percentage of invalid entities play an important role. The bestengine overall is Trav-SHACL 5.

Continuous Behavior.

We report the continuous behavior of the eleven en-gines for each group of data sets instead of the single data sets since the be-havior within a group does not change. As discussed above, the results arevisualized using radar plots in Fig. 6. Fig. 6a to Fig. 6c present the results forshape schema 1. Trav-SHACL outperforms the other engines in all measure-ments; all the engines identify the same number of (in)valid entities (Comp)except in the LKGs. Even though, the shape schema is not very complex,SHACL2SPARQL and SHACL2SPARQL-py perform similar but in the dieﬃ-ciency at time t . This observation leads to the conclusion that the Java imple-mentation is able to produce the validated answers at a faster rate. Moving toshape schema 2, Fig. 6d shows that Trav-SHACL is able to produce the ﬁrstresult faster than SHACL2SPARQL. However, due to the low amount of in-valid entities, Trav-SHACL is not able to exploit its advantages, leading to aworse validation time and dieﬃciency at t . When validating the shape schemaagainst medium-sized knowledge graphs, the execution times are competitive.But Trav-SHACL produces the answers steadier than SHACL2SPARQL. In thecase of large knowledge graphs, Trav-SHACL outperforms SHACL2SPARQL inall reported measurements. Considering shape schema 3, Trav-SHACL outper-forms all engines in all measurements for all knowledge graph sizes (see Fig. 6a -Fig. 6c). The diﬀerence between Trav-SHACL and SHACL2SPARQL increaseswith larger knowledge graphs. The results show that the Java implementationof SHACL2SPARQL outperforms the Python implementation of the same ap-proach. To sum up, Trav-SHACL exhibits a better performance in terms of theieﬃciency at time t than SHACL2SPARQL. Correctness of Validation.

All engines produce the same results for all shapeschemas validated against SKGs and MKGs. However, SHACL2SPARQL failsto correctly validate the shape schemas over LKGs. In order to enable thecomparison of the (in)validated entities between the three prototypical imple-mentations studied in this paper, we ordered constraint queries by the commonsubject of all query triples, with the usage of the

ORDER BY clause. This isnecessary since the number of answers retrieved by the SPARQL endpoint for aquery is bounded. Then, when evaluating large knowledge graphs like D25 (seeTable 2), SHACL2SPARQL is not able to identify 41.10% of the valid entitieswith respect to Trav-SHACL, but classiﬁes them as invalid instead. This is dueto the lack of selectivity in the deﬁnition of the constraint queries. Contrary,Trav-SHACL faces the same external limitations imposed by the SPARQL end-point, but it leverages the query rewriting strategy to retrieve more relevantdata for the validation process.

Answer to RQ1.

From the analysis of the results, it is clear that the traver-sal strategy impacts on the validation time. Naturally, the diﬀerence is moreprominent in shape schemas with more shapes. But also the size of the datamatters as the diﬀerence increases also with larger knowledge graphs. The anal-ysis shows that overall Trav-SHACL 5 performs best. The seed shape is selectedas the one with the most constraints amongst the shapes with the highest inde-gree. That allows for a reuse of the validation results of the ﬁrst shape for manyof the following shapes. The results of our study prove the intuition mentionedwhen describing the

Inter-shape planner . Answer to RQ2.

The knowledge gained from previously validated shapes canbe exploited to improve the performance of the validation engine. This canbe seen in the great savings of Trav-SHACL compared to SHACL2SPARQL-pyand in the decreasing validation time of Trav-SHACL with increasing numberof invalid entities.

Answer to RQ3.

Trav-SHACL scales up, and its performance diﬀerence toSHACL2SPARQL increases with larger knowledge graphs. Trav-SHACL is ableto produce correct results but SHACL2SPARQL failed to correctly classifythe entities in the large knowledge graphs. This is caused by a non-selectiveconstraint query; therefore, SHACL2SPARQL does not receive all the answersneeded for the validation as discussed in

Correctness of Validation . Answer to RQ4.

The validation time increases with an increasing number ofshapes in a shape schema. However, more important than the number of shapesis the cardinality of the shapes, i.e., the number of entities in [[ targ ( s )]] G . Forexample, the diﬀerence in validation time of shape schema 2 and shape schema3 against the same knowledge graph is relatively small taking into considerationthat shape schema 3 consists of twice as many shapes. This is due to the factthat the two biggest shapes in terms of entities assigned to them, are alreadyincluded in shape schema 2. Those two shapes comprise almost 70% of allentities in the data. Therefore, validating many selective shapes is faster thanthe validation of a single shape with many entities assigned to it. Trav-SHACLalso beneﬁts in the cases where many shapes refer to the same shape. Toonclude, the ideal case is many selective shapes that all depend on one othershape or, alternatively, a chain of shapes. Constraint Languages.

Several approaches have been developed towards thedeﬁnition and evaluation of expressive constraints on semantic web models. Ini-tial works correspond to the deﬁnition of integrity constraint semantics in OWLby using the closed-world assumption [16, 17, 23]. However, OWL was originallydesigned to model incomplete data with the open-world assumption, hence, notwell-suited for expressing integrity constraints. A next step was the SPARQLInferencing Notation (SPIN) [13], a W3C member submission that suggestedthe use of SPARQL queries as constraints on top of RDF graphs. The nextgeneration of SPIN is the W3C recommendation language SHACL [14], a lan-guage that allows to represent integrity constraints over RDF graphs in RDF.ShEx [24] is a constraint language for RDF and inspired by schema languagesfor XML. ShEx is similar to SHACL but the semantics of ShEx builds on regularbag expressions. While SHACL semantics allows multiple possible assignments,a single assignment is created when validating an RDF graph with ShEx. Itis worth mentioning that any graph that is valid with respect to ShEx shapeswill be valid to an equivalent constraint deﬁnition in SHACL while this doesnot hold in the other direction. Validation algorithms with semantics for re-cursion with stratiﬁed negation have been proposed by Boneva et al. [4] forShEx. Since SHACL is the W3C recommendation, we focus on the validationof integrity constraints expressed in SHACL.

Shape Modeling.

The above-mentioned languages for representing integrityconstraints over RDF graphs are a foundation for quality assessment. As a ﬁrststep in a quality assessment pipeline, a shape schema for the data to validateneeds to be found. Many works have been done in automatic shape generationfrom the data or systems that guide the data administrator in the process of gen-erating the constraints in a semi-automatic manner. ABSTAT [22] is an onlinesemantic proﬁling tool for data-driven extraction of ontology patterns and datastatistics. Data consumers can beneﬁt from ABSTAT in better understandingthe data. Spahiu et al. [21] propose a methodology to transform the ABSTATproﬁles into SHACL for quality assessment. The SHACL constraints are basedon the patterns discovered by ABSTAT. Therefore, most of the constraints arecardinality constraints or domain/range constraints. In contrast to ABSTAT,Astrea [6] generates SHACL shapes from ontologies only. Astrea uses mappingsbetween ontology patterns and SHACL patterns for automatic shape genera-tion. The ontology patterns are extracted from OWL 2, RDFS, and XSD. Themodeling of a shape schema for a given RDF graph is beyond the scope of thispaper.

SHACL Validation.

Another important step in a quality assessment pipelineis the actual execution of the validation of the shape schema against the knowl-edge graph. The validation of recursive shape schemas is left undeﬁned in thepeciﬁcation of SHACL [14]. Corman et al. [7] introduce a semantics for thevalidation of resursive SHACL. They also show that the validation of the fullSHACL features is NP-hard. Based on those ﬁndings, they propose fragmentsof SHACL that are tractable together with a basic algorithm for validating ashape schema using SPARQL [9]. Andre¸sel et al. [2] introduce an either strictersemantics for recursive SHACL based on stable models known from AnswerSet Programming (ASP). This approach allows to represent SHACL constraintsas logic programs and use existing ASP solvers for the validation of the shapeschema. Another advantage is that negations in recursions are possible followingthe proposed approach. In contrast to these logic approaches to the problem,we use query optimization techniques to improve the incremental behavior andscalability.

Satisﬁability and Containment.

Recent work focuses on the satisﬁabilityand containment of SHACL. Leinberger et al. [15] propose to use descriptionlogics for the containment of a shape in another. They study standard entail-ment for diﬀerent fragments of SHACL. Pareti et al. [19] propose a new fragmentof ﬁrst order logic (FOL) extended with counting quantiﬁers and transitive clo-sure operator called

SCL to decide the satisﬁability of a shape schema and thecontainment of a shape schema in another. This fragment only covers the coreconstraint components from the SHACL speciﬁcation and no recursion. Whichmakes both problems decidable which they are not for full SHACL. While theproblem of shape schema containment is important for data integration, thesatisﬁability problem is of more interest for quality assessment.

We addressed the problem of minimizing the execution time of data quality as-sessment constraints expressed in a SHACL shape schema. Given the increasingacceptance of knowledge graphs at industrial and scientiﬁc organizations and theW3C recommendation of SHACL as the language to deﬁne integrity constraintsagainst RDF graphs, scalable SHACL engines are necessary for global adoption.We presented Trav-SHACL as an eﬀective data management tool to fulﬁll thescalability requirements of novel knowledge-driven developments. Trav-SHACLselects the traversal shape plans and rewrites the target and constraint queriesto the fast detection of invalid entities. In doing so, Trav-SHACL is able toreduce the number of constraints that need to be checked during the validationand produce results incrementally. As a result, total execution time is reducedby a factor of up to 28.93, and results are delivered continuously. Thus, Trav-SHACL broadens the repertoire of tools for declaratively creating and curatingknowledge graphs. We hope that our reported results encourage the diversecommunities to develop applications where these results can be reproduced andgeneralized in real-world scenarios. Since Trav-SHACL aims at invalidating en-tities fast, it does not perform as well in data sets with a small percentage ofinvalidated entities as in low-quality data sets. In the future, we plan to investi-gate adaptive planning techniques able to adjust shape validation schedules tohe characteristics of both the shape schema and the knowledge graph. Lastly,the incorporation of Trav-SHACL in real-world pipelines of knowledge graphmanagement is part of our future agenda.

Acknowledgements

This work has been partially supported by the EU H2020 projects iASiS (No727658) and QualiChain (No 822404), and the ERAMed project P4-LUCAT(No 53000015).

References [1] Maribel Acosta, Maria-Esther Vidal, and York Sure-Vetter. “Dieﬃciency Metrics: Mea-suring the Continuous Eﬃciency of Query Processing Approaches”. In:

The Seman-tic Web – ISWC 2017. ISWC 2017. Lecture Notes in Computer Science . Vol. 10588.Cham: Springer, Oct. 2017, pp. 3–19. doi :

10 . 1007 / 978 - 3 - 319 - 68204 - 4 _ 1 . url : https://doi.org/10.1007/978-3-319-68204-4_1 .[2] Medina Andre¸sel et al. “Stable Model Semantics for Recursive SHACL”. In: Proceedingsof The Web Conference 2020 (WWW ’20), April 20-24, 2020, Taipei, Taiwan . NewYork, NY, USA: ACM, Apr. 2020, pp. 1570–1580. doi : . url : https://doi.org/10.1145/3366423.3380229 .[3] S¨oren Auer et al. “Towards a Knowledge Graph for Science”. In: Proceedings of theInternational Conference on Web Intelligence, Mining and Semantics, WIMS 2018 .2018.[4] Iovka Boneva, Jose E. Labra Gayo, and Eric G. Prud’hommeaux. “Semantics and Val-idation of Shapes Schemas for RDF”. In:

The Semantic Web – ISWC 2017 . Ed. byClaudia d’Amato et al. Cham: Springer International Publishing, 2017, pp. 104–120. doi : .[5] Stefano Ceri, Georg Gottlob, and Letizia Tanca. “What you Always Wanted to KnowAbout Datalog (And Never Dared to Ask)”. In: IEEE Trans. Knowl. Data Eng.

The Semantic Web . Ed. byAndreas Harth et al. Cham: Springer International Publishing, 2020, pp. 497–513. isbn :978-3-030-49461-2. doi : https://doi.org/10.1007/978-3-030-49461-2\_29 .[7] Julien Corman, Juan L. Reutter, and Ognjen Savkovi´c. “Semantics and Validation ofRecursive SHACL”. In: The Semantic Web – ISWC 2018 . Ed. by Denny Vrandeˇci´cet al. Cham: Springer International Publishing, 2018, pp. 318–336. doi : .[8] Julien Corman et al. “SHACL2SPARQL: Validating a SPARQL Endpoint against Re-cursive SHACL Constraints”. In: ISWC 2019 Satellites . Ed. by Mari Carmen Su´arez-Figueroa et al. Vol. 2456. Aachen: CEUR Workshop Proceedings (CEUR-WS.org), 2019,pp. 165–168.[9] Julien Corman et al. “Validating SHACL Constraints over a SPARQL Endpoint”. In:

The Semantic Web – ISWC 2019 . Ed. by Chiara Ghidini et al. Cham: Springer Inter-national Publishing, 2019, pp. 145–163. doi : .[10] Nicholson DN and Greene CS. “Constructing knowledge graphs and their biomedicalapplications.” In: Comput Struct Biotechnol J. (2020). doi : .11] Yuanbo Guo, Zhengxiang Pan, and Jeﬀ Heﬂin. “LUBM: A Benchmark for OWL Knowl-edge Base Systems”. In: Web Semantics doi : .[12] Aidan Hogan et al. “Knowledge Graphs”. In: CoRR abs/2003.02320 (2020).[13] Holger Knublauch, James A. Hendler, and Kingsley Idehen.

SPIN - Overview and Moti-vation . W3C Submission. Feb. 2011. url : .[14] Holger Knublauch and Dimitris Kontokostas. Shapes Constraint Language (SHACL) .W3C Recommendation. July 2017. url : .[15] Martin Leinberger et al. “Deciding SHACL Shape Containment through DescriptionLogics Reasoning”. In: The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notesin Computer Science . Cham: Springer, 2020.[16] Boris Motik, Ian Horrocks, and Ulrike Sattler. “Adding Integrity Constraints to OWL”.In:

OWLED 2007 - OWL: Experiences and Directions 2007 . Ed. by Christine Golbreich,Aditya Kalyanpur, and Bijan Parsia. Vol. 258. Aachen: CEUR Workshop Proceedings(CEUR-WS.org), 2007.[17] Boris Motik, Ian Horrocks, and Ulrike Sattler. “Bridging the Gap between OWL andRelational Databases”. In:

Web Semantics: Science, Services and Agents on the WorldWide Web doi : . url : https://dx.doi.org/10.2139/ssrn.3199440 .[18] Natalya Fridman Noy et al. “Industry-scale knowledge graphs: lessons and challenges”.In: Commun. ACM

The Semantic Web –ISWC 2020. ISWC 2020. Lecture Notes in Computer Science . Cham: Springer, 2020.[20] Jorge P´erez, Marcelo Arenas, and Claudio Guti´errez. “Semantics and complexity ofSPARQL”. In:

ACM Trans. Database Syst.

Emerg-ing Topics in Semantic Technologies - ISWC 2018 Satellite Events [best papers from 13of the workshops co-located with the ISWC 2018 conference] . Ed. by Elena Demidova,Amrapali J. Zaveri, and Elena Simperl. 2018, pp. 103–117. doi : .[22] Blerina Spahiu et al. “ABSTAT: Ontology-Driven Linked Data Summaries with PatternMinimalization”. In: The Semantic Web – ESWC 2016 Satellite Events, Heraklion,Crete, Greece, May 29 – June 2, 2016, Revised Selected Papers . Ed. by Harald Sacket al. Cham: Springer, 2016. doi : .[23] Jiao Tao et al. “Integrity Constraints in OWL”. In: Twenty-Fourth AAAI Conferenceon Artiﬁcial Intelligence (AAAI ’10) . 2010. url : .[24] Katherine Thornton et al. “Using Shape Expressions (ShEx) to Share RDF Data Modelsand to Guide Curation with Rigorous Validation”. In: The Semantic Web . Ed. by PascalHitzler et al. Cham: Springer International Publishing, 2019, pp. 606–620. doi : https://doi.org/10.1007/978-3-030-21348-0\_39https://doi.org/10.1007/978-3-030-21348-0\_39