Killing Two Birds with One Stone -- Querying Property Graphs using SPARQL via GREMLINATOR
KKilling Two Birds with One Stone – Querying Property Graphsusing SPARQL via Gremlinator
Harsh Thakkar
University of BonnBonn, Germany [email protected]
Dharmen Punjani
National and Kapodistrian University of AthensAthens, Greece [email protected]
Jens Lehmann
University of Bonn & Fraunhofer IAISBonn, Germany [email protected]
Sören Auer
TIB & Leibniz University of HannoverHannover, Germany [email protected]
ABSTRACT
Knowledge graphs have become popular over the past decade andfrequently rely on the Resource Description Framework (RDF) orProperty Graph (PG) databases as data models. However, the querylanguages for these two data models – SPARQL for RDF and thePG traversal language Gremlin – are lacking interoperability. Wepresent
Gremlinator , the first translator from SPARQL – the W3Cstandardised language for RDF – and Gremlin – a popular propertygraph traversal language. Gremlinator translates SPARQL queriesto Gremlin path traversals for executing graph pattern matchingqueries over graph databases. This allows a user, who is well versedin SPARQL, to access and query a wide variety of Graph DataManagement Systems (DMSs) avoiding the steep learning curvefor adapting to a new Graph Query Language (GQL). Gremlin isa graph computing system-agnostic traversal language (coveringboth OLTP graph database or OLAP graph processors), making it adesirable choice for supporting interoperability for querying GraphDMSs. Gremlinator currently supports the translation of a subsetof SPARQL 1.0, specifically the SPARQL
SELECT queries.
KEYWORDS
Property Graph, SPARQL, Gremlin, Graph Traversal, Gremlinator
ACM Reference format:
Harsh Thakkar, Dharmen Punjani, Jens Lehmann, and Sören Auer. 2016.Killing Two Birds with One Stone – Querying Property Graphs usingSPARQL via Gremlinator. In
Proceedings of ACM Conference, Washington,DC, USA, July 2017 (Conference’17),
Knowledge graphs model the real world in terms of entities andrelations between them. They became popular as they are an intu-itive and simple data model, which allows to execute many types
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, Washington, DC, USA © 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM...$15.00DOI: 10.1145/nnnnnnn.nnnnnnn of queries efficiently and can serve as a foundation for a rangeof Artificial Intelligence applications. The Resource DescriptionFramework (RDF) and Property Graphs (PGs) are popular languagesfor knowledge graphs. For RDF, the SPARQL query language wasstandardized by W3C, whereas for PGs several languages are fre-quently used, including Gremlin [11].PGs and RDF have evolved from different origins and still havelargely disjoint user communities. RDF is part of the SemanticWeb initiative with a focus on expressive data modelling as wellas data publication and linking. PGs originate from the databasecommunity with a focus on efficient execution of graph traversals.With
Gremlinator , we build a bridge between both communi-ties and research the interoperability of RDF and PG query lan-guages [16]. Moreover, we allow combining the best of both worlds:Powerful modelling capabilities as well as data publication andinterlinking methods combined with efficient graph traversal ex-ecution. In particular, Gremlinator has the following advantages:(1) Existing SPARQL-based applications can switch to propertygraphs in a non-intrusive way. (2) It provides the foundation for ahybrid use of RDF triple stores and property graph DMS – a systemcould detect which DMS is more efficient for answering a particularquery [5] and redirect the query accordingly. In particular, propertygraph databases have been shown to work very well for a widerange of queries which benefit from locality in a graph. Ratherthan performing expensive joins, property graph databases usemicro indices to perform traversals. (3) Users familiar with theW3C standardized SPARQL query language do not need to learnanother query language.Overall, we make the following contributions: • A novel approach for mapping SPARQL queries to Grem-lin pattern matching traversals, Gremlinator, which is thefirst work converting an RDF to a property graph querylanguage to the best of our knowledge. • An openly available implementation for executing SPARQLqueries over a plethora of third party graph DMS such as
Neo4J , Sparksee , OrientDB , etc. using the
Apache TinkerPop framework.The remainder of the article is organized as follows: Section 2summarizes the related work. Section 3 sheds light on the impor-tance of Gremlin, briefly discusses the Gremlinator approach andits limitations. Section 4 presents the demonstration details and a r X i v : . [ c s . D B ] J a n onference’17, July 2017, Washington, DC, USA H. Thakkar et al. the value Gremlinator will cater to its users. Finally, Section 5concludes the article and describes the future work. We present a brief summary of related work with regard to tech-niques and tools that support the translation and execution offormal query languages, addressing the interoperability issue.
SPARQL → SQL:
There is a substantial amount of work beendone for conversion of SPARQL queries to SQL queries, such as –
Ontop [3],
R2RML [12], Elliot et al. [6], Chebotko et al. [4], Zemkeet al. [18], Priyanka et al. [9].
Ontop [3], one of the most popularsystem, exposes relational databases as virtual RDF graphs by link-ing the terms (classes and properties) in the ontology to the datasources through mappings. This virtual RDF graph can then bequeried using SPARQL.
SQL → SPARQL:
RETRO [10] presents a formal semantics pre-serving the translation from SQL to SPARQL. It follows a schemaand query mapping approach rather than to transform the dataphysically. The schema mapping derives a domain-specific rela-tional schema from RDF data. Query mapping transforms an SQLquery over the schema into an equivalent SPARQL query, which inturn is executed against the RDF store.
SQL → CYPHER:
CYPHER is the graph query language usedto query the Neo4j graph database. There has been no work yet touse SQL on top of CYPHER. However, there are some examples that show the equivalent CYPHER queries for certain SQL queries. In this section we discuss the why we choose Gremlin as a prop-erty graph query language and briefly describe the Gremlinatorapproach.
Gremlin is a system-agnostic query language developed by ApacheTinkerPop . It supports both – pattern matching (declarative) andgraph traversal (imperative) style of querying over property graphs. Figure 1: The Gremlin Traversal Language and Machine. CYPHER Query Language (https://neo4j.com/developer/cypher-query-language/) Neo4j (https://neo4j.com/) SQL to CYPHER (https://neo4j.com/developer/guide-sql-to-cypher/) Gremlin: Apache TinkerPop’s graph traversal language and machine (https://tinkerpop.apache.org/)
Gremlin is more general than, e.g. ,CYPHER, as it provides inaddition to a query language a common execution platform forsupporting any graph computing system (including both OLTP andOLAP graph processors), for addressing the querying interoperabil-ity issue (see Figure 1 (a)). Together with Apache TinkerPop frame-work, Gremlin is a language and a virtual machine, it is possibleto design another traversal language that compiles to the Gremlintraversal machine (analogous to how Scala compiles to the JVM),ref. Figure 1 (b). Gremlin provides the declarative (SPARQL style)pattern matching querying construct using the .match() -step.For brevity, we abstain from dwelling into the definitions andformal semantics of Gremlin, rather point the interested readerto the literature [11, 17]. Furthermore, one may also refer to [16],where the complete SPARQL to Gremlin translation approach isdiscussed in detail.
We now present the architectural overview of Gremlinator in Fig-ure 2 and discuss each of the four steps of its execution pipeline.
Step 1.
The input SPARQL query is first parsed using the JenaARQ module, thereby: (i) validating the query and (ii) generatingits abstract syntax tree (AST) representation.
Step 2.
From the obtained AST of the parsed SPARQL query,Gremlinator then visits each basic graph pattern (BGP), mappingthem to the corresponding Gremlin single step traversals (SSTs). ASST in Gremlin is an atomic traversal step ( ψ s ) we describe in [16]in detail. Step 3.
Thereafter, depending on the operator precedence ob-tained from the AST of the parsed SPARQL query, each of the cor-responding SPARQL keywords are mapped to their correspondinginstruction steps from the Gremlin instruction library. Thereafter afinal conjunctive traversal ( Ψ ) is generated appending the SSTs andinstruction steps. This can be perceived analogous to the SPARQLquery language, wherein a set of BGPs form a single complex graphpattern (CGP). Step 4.
This final conjunctive traversal ( Ψ ) is used to generatebytecode which can be used on multiple language and platformvariants of the Apache TinkerPop Gremlin family. Figure 2: The architectural overview of Gremlinator.
Considerations.
We encode the prefixes of SPARQL queries withinthe Gremlinator implementation. In order to aid the SPARQL to Bytecode is simply serialized representation of a traversal, i.e. a list of ordered instruc-tions where an instruction is a string operator and a (flattened) array of arguments. illing Two Birds with One Stone via Gremlinator Conference’17, July 2017, Washington, DC, USA
Table 1: Query feature and description
Query Id. Feature Description
C1-C3 CGPs Queries with mixed number of BGPsF1-F3 FILTER CGPs with a combination of ≥ ≥
10 BGPs)
Gremlin translation process, we define custom prefixes preserv-ing the categories of Gremlin instruction steps. For instance, thestandard rdfs:label prefix (which is generally a predicate) is rep-resented as e:label or v:label (where e = edge and v = vertex).For the demonstration of Gremlinator, we provide a set of 30 pre-defined SPARQL queries for reference, for each dataset, covering 10different SPARQL query features (i.e. three queries per feature witha combination of various modifiers) as shown in Table 1. Thesefeatures were selected after a systematic study of SPARQL querysemantics [1, 8, 13] and from BSBM [2] explore use cases and Watdiv Query templates . Furthermore, we encourage the end userto write and execute custom SPARQL queries for both the datasets,for further exploration. Gremlinator is an on-going effort for achieving seamless translationof SPARQL queries to Gremlin traversals. The current version ofGremlinator supports the SPARQL 1.0
SELECT queries with thefollowing excceptions: 1.)
REGEX (regular expressions) in
FILTER (restrictions) of a graph pattern are currently not supported. 2.)Gremlinator does not support variables for the property predicate,i.e. the predicate {p} in a graph pattern {s p o .} has to bedefined or known for the traversal to be generated. This is becausetraversing a graph is not possible without knowing the precisetraversal operation to the destination (vertex or edge) from thesource (vertex or edge).
As a part of the demonstration of our system Gremlinator, weprovide– (i) an online screencast (ii) a web application, see Fig-ure 3) (iii) a desktop application of Gremlinator (standalone .jarbundle) which requires Java 1.8 JRE installed on the correspondinghost machine, downloadable from the web demo website.The demonstration work-flow for all the above mentioned Grem-linator versions is identical, wherein – (i) the user selects a dataset(Northwind or BSBM) from the corresponding drop-down menu; (ii) the user selects a query (one of the ten SPARQL query features)from the corresponding drop-down menu; (iii) the user executesthe query; (iv) Gremlinator returns the selected SPARQL query, thetranslated Gremlin traversal and the result of the traversal execu-tion; (v) the user can also edit or write custom SPARQL queries and BSBM Explore Use Cases (https://goo.gl/y1ObNN) Watdiv Query Features (http://dsg.uwaterloo.ca/watdiv/basic-testing.shtml) Gremlinator Demo Screencast – https://youtu.be/Z0ETx2IBamw Gremlinator Web Demo – http://gremlinator.iai.uni-bonn.de:8080/Demo
Figure 3: Gremlinator Web application demonstrationscreenshot. execute them at selected dataset using the integrated query editorat will.We will present the live demonstration of Gremlinator using apre-configured laptop with all the resources including the SPARQLqueries and datasets. In order to demonstrate the correctness of ourapproach we will provide a custom docker-based Openlink VirtuosoSPARQL endpoint, pre-loaded with the datasets, for a one-to-onequery result comparison (for interested visitors).
Value.
Gremlinator will serve as a user friendly medium to – (i)execute SPARQL queries over property graphs bridging the queryinteroperability gap; (ii) conduct performance analysis of queryresults, comparisons of SPARQL vs. Gremlin traversal operationsusing frameworks such as
LITMUS [14, 15]; and (iii) enable queryinga spectrum of graph databases via SPARQL 1.0 query fragment(ref. Figure 4), leveraging the advantages of the Apache TinkerPopframework.
We presented a demonstration of Gremlinator, a novel approach forsupporting the execution of SPARQL queries on property graphsusing Gremlin traversals. Gremlinator has obtained clearance bythe Apache Tinkerpop development team and is currently in pro-duction phase to be released as a plugin during TinkerPop’s nextframework cycle. Gremlinator has also been integrated into the onference’17, July 2017, Washington, DC, USA H. Thakkar et al.
Figure 4: Gremlinator powered by Apache TinkerPop willenable querying a variety of Graph databases.
SANSA Stack [7] (v0.3) framework as an experimental plugin. Fur-thermore, Gremlinator is freely available under the Apache 2.0license for public use from the Maven Central repository.As future work, we are working on – (i) adding support for
REGEX in restriction (FILTERs), variables for property predicates,and (ii) supporting translation of SPARQL 1.1 query features suchas property paths, in the upcoming releases.
Harsh Thakkar - is a Marie Skłodowska-CuriePh.D. student at the University of Bonn, Germany.He earned his M.Tech. in Computer Science fromNIT Surat, India. His research interests includeGraph and RDF Data Management, Benchmarking,Graph Query Languages and Question Answering.
Dharmen Punjani - is a Marie Skłodowska-CuriePh.D. student at the National and Kapodistrian Uni-versity of Athens, Greece. He earned his M.Tech.in Computer Science from NIT Surat, India. His re-search interests include Geo-Spatial and RDF DataManagement, N.L.P., and Question Answering.
Jens Lehmann is professor for Software and DataEngineering, leads the Smart Data Analytics (SDA)research group at the University of Bonn and isa lead scientist at the Enterprise Information Sys-tems (EIS) department at Fraunhofer IAIS. His mainresearch interests are semantic technologies andmachine learning.
Sören Auer is professor for Data Science and Digi-tal Libraries at University of Hannover and directorof TIB Leibniz Information Center for Science andTechnology. His research interests revolve aroundsemantic technologies, scholarly communicationand digital libraries.
ACKNOWLEDGEMENTS
This work is supported by the funding received from EU-H2020WDAqua ITN (GA. 642795). We would like to thank Dr. MarkoRodriguez and Mr. Daniel Kuppitz, of the Apache TinkerPop project,for their support and quality insights in developing Gremlinator.
REFERENCES [1] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, andDomagoj Vrgoc. 2016. Foundations of Modern Graph Query Languages.
CoRR abs/1610.06264 (2016).[2] Christian Bizer and Andreas Schultz. 2009. The berlin sparql benchmark. (2009).[3] Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Da-vide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. 2017. Ontop:Answering SPARQL queries over relational databases.
Semantic Web
8, 3 (2017),471–487.[4] Artem Chebotko, Shiyong Lu, and Farshad Fotouhi. 2009. Semantics preservingSPARQL-to-SQL translation.
Data & Knowledge Engineering
68, 10 (2009), 973–1000.[5] Souripriya Das, Jagannathan Srinivasan, Matthew Perry, Eugene Inseok Chong,and Jayanta Banerjee. 2014. A Tale of Two Graphs: Property Graphs as RDF inOracle.. In
EDBT . 762–773.[6] Brendan Elliott, En Cheng, Chimezie Thomas-Ogbuji, and Z Meral Ozsoyoglu.2009. A complete translation from SPARQL into efficient SQL. In
Proceedingsof the 2009 International Database Engineering & Applications Symposium . ACM,31–42.[7] Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler,Ivan Ermilov, Simon Bin, Nilesh Chakraborty, Muhammad Saleem, and Axel-Cyrille Ngonga Ngomo. 2017. Distributed Semantic Analytics using the SANSAStack. In
Proceedings of the 16th International Semantic Web Conference (ISWC) .Springer, 147–155.[8] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. 2006. Semantics and Com-plexity of SPARQL. In
International semantic web conference . Springer, 30–43.[9] Freddy Priyatna, Oscar Corcho, and Juan Sequeda. 2014. Formalisation andexperiences of R2RML-based SPARQL to SQL query translation using Morph.In
Proceedings of the 23rd international conference on World wide web . ACM,479–490.[10] Jyothsna Rachapalli, Vaibhav Khadilkar, Murat Kantarcioglu, and Bhavani Thu-raisingham. 2011. RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation.
The University of Texas at Dallas
800 (2011), 75080–3021.[11] Marko A. Rodriguez. 2015. The Gremlin graph traversal machine and language(invited talk). In
Proceedings of the 15th Symposium on Database ProgrammingLanguages, Pittsburgh, PA, USA, October 25-30, 2015 . 1–10.[12] Mariano Rodriguez-Muro and Martin Rezk. 2015. Efficient SPARQL-to-SQL withR2RML mappings.
Web Semantics: Science, Services and Agents on the World WideWeb
33 (2015), 141–169.[13] Michael Schmidt, Michael Meier, and Georg Lausen. 2010. Foundations ofSPARQL query optimization. In
Proceedings of the 13th International Confer-ence on Database Theory . ACM, 4–33.[14] Harsh Thakkar. 2017. Towards an Open Extensible Framework for EmpiricalBenchmarking of Data Management Solutions: LITMUS. In
The Semantic Web- 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28 - June 1,2017, Proceedings, Part II . 256–266.[15] Harsh Thakkar, Yashwant Keswani, Mohnish Dubey, Jens Lehmann, and SörenAuer. 2017. Trying Not to Die Benchmarking: Orchestrating RDF and GraphData Management Solution Benchmarks Using LITMUS. In
Proceedings of the13th International Conference on Semantic Systems, SEMANTICS 2017, Amsterdam,The Netherlands, September 11-14, 2017 . 120–127.[16] Harsh Thakkar, Dharmen Punjani, Yashwant Keswani, Jens Lehmann, and SörenAuer. 2018. A Stitch in Time Saves Nine – SPARQL querying of Property Graphsusing Gremlin Traversals.
CoRR abs/1801.02911 (2018).[17] Harsh Thakkar, Dharmen Punjani, Maria-Esther Vidal, and Sören Auer. 2017.Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin.In
Proceedings of the 28th International Conference, DEXA 2017, Lyon, France,August 28-31, 2017, Proceedings, Part I . Springer, 81–91.[18] F Zemke. 2006.