Enhancing Virtual Ontology Based Access over Tabular Data with Morph-CSV
David Chaves-Fraga, Edna Ruckhaus, Freddy Priyatna, Maria-Esther Vidal, Oscar Corcho
EEnhancing OBDA Query Translation overTabular Data with Morph-CSV
David Chaves-Fraga , Edna Ruckhaus , Freddy Priyatna ,Maria-Esther Vidal , and Oscar Corcho Ontology Engineering Group, Universidad Polit´ecnica de Madrid, Spain { dchaves,fpriyatna,eruckhaus,ocorcho } @fi.upm.es TIB Leibniz Information Centre for Science and Technology, Germany [email protected]
Abstract.
Ontology-Based Data Access (OBDA) has traditionally fo-cused on providing a unified view of heterogeneous datasets (e.g., rela-tional database, CSV, JSON), either by materializing integrated datainto RDF or by performing on-the-fly integration via SPARQL-to-SQLquery translation. In the specific case of tabular datasets comprised ofseveral CSV or Excel files, query translation approaches have been ap-plied taking as input a lightweight schema with table and column names,and considering each source as a single table that can be loaded into arelational database system (RDB). This na¨ıve approach does not con-sider implicit constraints in this type of data, e.g., referential integrityamong data sources, datatypes, or data integrity; thus, completeness andperformance of query processing can be affected. Our work is focused onexplicitly enforcing implicit constraints during OBDA query translationover tabular data. We propose Morph-CSV, a framework that enforcesconstraints and can be used together with any SPARQL-to-SQL OBDAengine. Morph-CSV resorts to both a Constraints component and a setof operators that apply each type of constraint to the input with the aimof enhancing query completeness and performance. We evaluate Morph-CSV against a set of real-world open tabular datasets in the domain ofthe public transport; Morph-CSV is compared with existing approachesin terms of query result completeness and performance.
Keywords:
OBDA · Tabular Data · Mapping Languages
Following Open Data principles, governments and private organizations are regu-larly publishing wide amounts of public data in open data portals. For example,almost a million of datasets are available in the European Open Data Portal(EODP) . Most of these datasets are available in tabular formats (e.g., CSV,Excel). This is common in most open data portals, see Table 1. The main rea-son why this format is so popular is its simplicity: many common office tools a r X i v : . [ c s . D B ] J a n Chaves-Fraga et al.
Table 1.
Most commonly used formats and percentage over the total number ofdatasets to expose data in mature EU open data portals in October 2019. Each datasetmay be shared in different formats.
Data Portal 1st Format 2nd Format 3rd Format
Spain
CSV (50%) XLS (35%)
JSON (33%)Norway
CSV (77%)
GEOJSON (17%) JSON (14%)Italy
CSV (76%)
JSON (35%) XML (25%)Croatia
XLS (63%) CSV (40%)
HTML (33%) (e.g., Excel, Calc) are available to facilitate their generation and consumption.However, advanced users (e.g., developers, data scientists) face significant chal-lenges when consuming tabular data. The lack of a unified way to query tabulardata presented in other formats (e.g., RDB, JSON, XML) hinders the integra-tion of sources that have datatype inconsistencies. Moreover, data may not benormalized, and metadata about relationships or column names are not alwaysdescriptive or homogeneous. Hence, data consumers are usually forced to applyad-hoc or manual data wrangling processes to make use of the data accessiblevia open data portals.Following Linked Data [2] and FAIR initiatives [30] , data providers are en-couraged to make data available using an RDF-based representation followingthe 5-star linked data principles . The Ontology-Based Data Access (OBDA) [23]paradigm facilitates the transformation of tabular data into RDF. OBDA usesontologies as a unified view over a set of data sources, and mappings that de-scribe the relationships between the ontology model and the data sources. Thetwo main OBDA techniques are: materialization, where data are transformed toRDF [13], and virtualization, where SPARQL queries are translated into the un-derlying data source query language [3,24]. Materialization approaches have beenenhanced with features such as functions in mappings [9,18] and source meta-data, i.e. annotations [28], so as to deal with the aforementioned challenges oftabular data. In the virtualization approach, the main problem in the context oftabular data is that traditional OBDA query translation engines usually load thetabular data directly into an SQL-based system , (e.g. MySQL, Apache Drill,Spark SQL, Presto) in order to reuse the proposed query translation techniquesbut do not consider semantics implicit in this type of data such as relation-ships among sources, datatypes or valid data constraints, thus completeness andperformance of query execution is affected. Completeness is affected because ofheterogeneity issues of data sources (e.g., date columns are treated as simplestrings), and performance is affected because the usual relational constraints arenot defined over the data sources (i.e., primary and foreign key constraints arenot defined in the lightweight schema), so the optimization techniques proposedin the query translation process [3] do not take effect. https://5stardata.info/en/ https://github.com/oeg-upm/morph-rdb/wiki/Usage https://github.com/ontop/ontop/wiki/MappingDesignTips nhancing OBDA over Tabular Data with Morph-CSV 3 Implicit constraints over data sources are usually explicitly defined in map-pings and tabular metadata, e.g., the W3C recommendation to annotate tabulardata, CSVW [28]. This information has not been included in the majority ofOBDA query translation engines [24,15], and those engines that have includedit [22], are not fully documented for their use. Examples of these constraints arethe standardization of the column format (e.g., dates), integrity constraints ordatatypes. Exploiting these constraints on the fly improves the query translationprocess in terms of query execution and number of results.
Problem and Proposed Solution.
We address the problems of current OBDAquery translation techniques over tabular data. Our main goal is to extend thedefinition of OBDA with the inclusion of Constraints, and define a set of oper-ators that apply each type of constraint to the input in order to improve querycompleteness and performance. Additionally, we describe a set of new steps overthe workflow of SPARQL-to-SQL query translation, implement them, and com-pare the obtained results with previous proposals.
Contributions:
The main contributions of the paper are:1. Extension of the current OBDA definition [23] with a Constraints compo-nent that is comprised of all of the data restrictions that are represented inmappings and tabular metadata.2. Workflow for the application of constraints. The proposed framework, Morph-CSV, takes as input a dataset, a set of mappings, tabular metadata, and aSPARQL query. With this input, it automatically: (i) selects the sourcesand attributes needed to answer the query, (ii) transforms and normalizesthe selected sources, and (iii) defines an SQL schema to ensure that theoptimizations included in the OBDA techniques will be effective.3. Development as a proof of concept of Morph-CSV, an engine that implementssource selection and the application of constraints through the exploitationof well-known declarative proposals (RML+FnO [13,9]) and CSVW [28])annotations in combination with mapping translation [7]. The engine can beused by any SPARQL-to-SQL OBDA tool.4. Evaluation of the impact of the extended OBDA with Constraints approachover two well-known open source engines: Morph-RDB [24] and Ontop [3] inthe domain of public transport.The rest of the paper is structured as follows: Section 2 gives a motivating ex-ample in the transport domain, on the problem of OBDA query translation overtabular data. Section 3 describes the identified challenges for querying and inte-grating tabular data. Section 4 presents Morph-CSV, an approach for enhancingOBDA query translation over tabular data by applying a set of constraints onthe fly. Section 5 reports on the results of our empirical study. We present therelated work in Section 6, and our conclusions and future work in Section 7.
Chaves-Fraga et al.
SELECT ?stop_name ?date1 ?date2
WHERE { ?stop1 gtfs:sameStop ?stop2 ?stop1 gtfs:name ?stop_name ?stop1 gtfs:close_date ?date1 ?stop2 gtfs:close_date ?date2
FILTER (?date1 != ?date2)} bus_stop.csv ?stop_name ?date1 ?date2
Noviciado 20191225 20191231-20200101Colonia_Jardin 2019-12-25 2019-12-31Plaza_de_españa 2020-01-01 2020-01-06Noviciado 2020-12-25 2019-12-31Noviciado 2019-12-25 2020-01-01 metro_stop.csv id,name,date,wheelchair1,Colonia_jardin,20191225,02,Plaza_de_españa,20200101,13,Noviciado,20191225,0 id,name,date10,Colonia Jardin,20191225-2019123140,Plaza De España,20200101-20200106500,Noviciado,20191231-20200101 described inObtained Expected
BusStop(w(id)) ← bus_stop(id,name,date)MetroStop(w(id)) ← metro_stop(id,name,date,wheelchair)name(w(id),name) ← metro_stop(id,name,date,wheelchair)close_date(w(id),close_date) ← metro_stop(id,name,date,wheelchair)name(w(id),name) ← bus_stop(id,name,date)close_date(w(id),close_date) ← bus_stop(id,name,date)wheelchair(w(id),wheelchair) ← metro_stop(id,name,date,wheelchair)sameStop(w(id),u(id)) ← metro_stop(name), bus_stop(name)
Fig. 1. Motivating Example.
SPARQL query evaluation over two tabular data filesin the transport domain through a common OBDA approach. It loads the files as singletables in an SQL-based system and uses the mapping rules for query translation. Thenumber of results differs with respect to the expected results due the heterogeneity ofthe raw data. Additionally, query performance may be affected by the join conditionbetween the two tables, the absence of indexes and the loading of columns that are notneeded to answer the input query (wheelchair).
Consider the de-facto standard for publishing open data in the transport domain,GTFS . This model provides information such as schedules , stops and routes us-ing 15 different inter-related CSV files called a GTFS feed. Each feed usuallyspecifies the information of one type of transportation mode (e.g., metro, trainor tram). Linking these feeds based on their stops enables route planners to offermulti-modal routes, a route that can be travelled using various types of trans-portation. The GTFS feeds from the metro and the buses of the city of Madridhave several stop and stations in common; they are created by different trans-port authorities, and the names of their stops are defined in different manners.Figure 1 depicts a SPARQL query asking for bus and metro stops with the samename, and information related to their closing dates during holidays. Since GTFSuses temporal identifiers for its resources, links have to be established joiningstop names. However, as it is usual in open datasets, stop names do not follow a https://developers.google.com/transit/gtfs/reference/ nhancing OBDA over Tabular Data with Morph-CSV 5 standard structure (e.g., “Colonia Jardin” in bus stops.csv and “Colonia jardin”in metro stops.csv ). A similar issue is presented in closing dates, where there aremulti-valued cells and their format is not the standard one (e.g., yyyy-MM-dd).Following the approach commonly employed by typical OBDA engines, the twofiles would be loaded into an SQL-system and treated as separate tables. Theobtained result set only contains one answer where the stop names in the twodata sources are identical (“Noviciado”). However, the expected result set shouldinclude more answers by performing an improved join among the names stopsof the bus and metro, through the normalization of multi-valued date columns. In this section, we describe challenges that are relevant for querying tabular datausing an OBDA approach, and that have not yet been addressed by currentproposals. We motivate the reader on the need to deal with these challengesand explain how they can have a negative effect in terms of completeness andperformance of query evaluation: • Selection:
Existing approaches load all of the files specified as sources in themapping rules into an SQL database before executing the query-translationprocess. This step has to be repeated every time a SPARQL query is eval-uated to ensure up-to-date results, resulting in unnecessary longer loadingtime and thus affecting the performance. • Normalization:
Tabular data formats do not provide restrictions on howto structure data. As a result, cells may contain multiple values, and filesmay represent multiple entities. Having non-normalized tables may affectthe completeness of the query. When a CSV file with multiple-valued cells isloaded to an RDB table, the cell’s value is interpreted by the RDBMS as anatomic value, thus reducing completeness for queries that filter or “join” onthe corresponding column. Representing several entities in a single file maylead to duplicate answers, and in turn decrease query performance. • Heterogeneity:
Tabular data normally contains values that need to betransformed before query evaluation (e.g., insertion of column default valuesor normalization of date formats). This issue directly impacts query com-pleteness: there may be different formats for the same datatype or defaultvalues may have not been included in the data. • Lightweight Schema:
Most of the tabular data only provide minimal in-formation about their underlying schema in the form of column names inthe header, if at all present. Also, although there is implicit information onkeys and relationships among sources, there is no way to specify primary keyor foreign key constraints. The same can be said on indexes and datatypes.This type of information is used by OBDA engines for performing seman-tic optimization in their query evaluation techniques, thus, the lack of thisinformation affects the performance of OBDA engines.In the next section, we explain how our approach extracts a set of functionsfrom the input (query, mappings and annotations), and applies them to improvequery completeness and performance in an OBDA query translation approach.
Chaves-Fraga et al.
The formal framework presented in [31] defines an OBDA specification as a tuple P = (cid:104)O , S , M(cid:105) where O is an ontology, S is a relational database schema, and M a set of mappings. Following, we define our extended specification: Definition 1. An OBDAT specification is a tuple PT = (cid:104)O , S lw , M tabular , C(cid:105) where O is an ontology, S lw is a lightweight tabular schema described in termsof metadata, e.g., file and column names, M tabular is a set of mapping rules fortabular data, and C is a set of constraints. Constraints are conjunctive rules specified for tabular data that restrict the validdata in one or more tables. C is a set of constraints, where each constraint c is alogical statement that expresses the condition that needs to be satisfied by thedata in order to be valid.Each constraint is applied through a function. For example, tabular metadataallow expressing a primary key constraint for a table. primaryKey ( t, a ) is a func-tion that applies this constraint to a source t and a set of columns a in the meta-data, and generates a primary key constraint in the output schema. An exampleof the function with bounded variables is primaryKey ( metro stops, { stop id } ).According to [31], an OBDA instance is defined as I = (cid:104)P , D(cid:105) where P is anOBDA specification and D are the instances of the RDB. Similarly, an OBDAT instance is defined as follows:
Definition 2. An OBDAT instance is a tuple IT = (cid:104)PT , D tabular (cid:105) where PT is an OBDAT specification and D tabular is a tabular dataset that is composedof a set of data sources, defined as D tabular = { S , . . . , S n } where each S i has aset of columns { A i , . . . , A im } . Given an OBDA instance I = (cid:104)P , D(cid:105) , the function eval ( Q, I ) retrieves a SPARQLcertain answer set [23] that is the result of the translation of Q from SPARQLto SQL using the mapping rules M defined in P , and then evaluating the querydirectly over D . The time required to evaluate this function is the sum of theexecution times of the different phases of a typical OBDA system, as definedin [19], i.e. starting, query rewriting, query translation, and execution phases.In previous proposals [22,24], when D = D tabular , the starting phase time isnegatively impacted since the system has to load the data sources to a RDBMSbefore executing the rest of the phases. Additionally, the performance is alsoaffected during the execution phase due the absence of integrity constraintsover the schema. In the case of an OBDAT instance, the function evalt ( Q, IT )extends the evaluation function of OBDA, through the application of constraintsto improve query execution time and answer completeness.
Problem statement:
Given a function evalt ( Q, IT ), the problem of OBDAquery translation over tabular data is defined as the problem of exploiting theknowledge encoded in the constraints defined in IT such that: • The number of results obtained in the evaluation of Q over IT is equal orgreater than the number of results in the evaluation of the same query Q over I , i.e., answ ( evalt ( Q, IT )) ≥ answ ( eval ( Q, I )) where I is an OBDAinstance with D = D tabular . nhancing OBDA over Tabular Data with Morph-CSV 7 • The total execution time of evaluating a SPARQL query Q over IT isminimized compared to the evaluation of the same query Q over I , i.e., time ( evat ( Q, IT )) ≤ time ( eval ( Q, I )) where I is an OBDA instance with D = D tabular . Source Selection Normalization DataPreparation SchemaCreationand LoadMapping Rules(M tabular )OntologyQuery Query translationand execution ResultsMapping Rules(M)MappingTranslationFunctionsConstraints
StartingPhase
TabularSourcesTabularMetadata
Fig. 2. The Morph-CSV Framework.
Morph-CSV extends the starting phase of atypical OBDA system including a set of steps for dealing with the identified tabulardata querying challenges. The framework first, extracts the constraints from the query,mappings and tabular metadata and then, implements them in a set of operators thatare ran before executing the query translation and query execution phases. The map-ping rules for tabular data are translated accordingly to mapping rules for relationaldatabases to allow access to the transformed data sources.
Proposed solution:
We propose Morph-CSV, an alternative to the traditionalOBDA workflow for query translation when the input is a tabular dataset.Morph-CSV interprets the function evalt ( Q, IT ) as eval ( t ( Q, IT )) where t isa function that pushes down the application of the constraints directly over IT . More in detail, t ( Q, IT ) = I (cid:48) where I (cid:48) is an enriched OBDA instance ob-tained after the application of the constraints, hence, this proposal can also reuseoptimizations proposed on query translation over the common OBDA specifica-tion [3,24].We show the workflow of the framework in Figure 2, Morph-CSV receivesan OBDAT instance IT = (cid:104)PT , D tabular (cid:105) and a query Q and extends the start-ing phase of a typical OBDA workflow by including the following steps: sourceselection, normalization, data preparation and schema generation. Morph-CSVextracts the constraints used during these steps from the mapping rules M tabular represented in a mapping language for tabular data that includes declarativetransformation functions, e.g., RML+FnO [10], and metadata represented asannotations for tabular data, e.g., CSVW [28]. Additionally, Morph-CSV per-forms a mapping translation step [7] to transform the mapping rules M tabular to Chaves-Fraga et al. standard OBDA mappings M (e.g., RML+FnO to R2RML) to allow any OBDAsystem to incorporate this workflow. We describe the steps performed in Morph-CSV. The input to all of these steps isan OBDAT instance IT = (cid:104)PT , D tabular (cid:105) and a query SPARQL Q . We describeeach step more in detail together with their corresponding constraints. Theseconstraints, implemented as a set of functions over IT , address the challengesdefined in Section 3 and are independent from the input sources. Step 1: Source Selection.
Before starting the application of constraints,Morph-CSV takes as input the SPARQL query, the set of mapping rules and thetabular dataset, and uses the information encoded in the query and the rulesof the mapping to only select the files and columns from S lw that are requiredto answer the query. After that, Morph-CSV also removes the irrelevant rulesfrom the mapping document. The output of this step is a new OBDAT instance IT (cid:48) where the number of results of evalt ( Q, IT ) is the same as evalt ( Q, IT (cid:48) ).This function tackles the
Selection challenge identified in Section 3. Note thatthe following steps are executed over the selected data sources and columns.
Step 2: Data Normalization
There are two functions for performing datanormalization. The first one is the treatment of multi-values in a column. In thiscase, Morph-CSV performs the function split ( A i , S j , sep ) where A i is the multi-valued column of source S j and sep is the character defined in the metadatathat divides the values (e.g., split (date,stops,-). The application of this functionis known as the normalization step for 2NF [6].The second function is the treatment of multiple entities in the same source.Morph-CSV takes the mapping rules and the lightweight schema and executesthe function cut ( M tabular , S lw ). This function analyzes the mapping rules M tabular and performs a 3NF [6] normalization step over S lw , when there are two sets ofmapping rules that have the same source and the intersection of their referencesis empty. Additionally, both functions rewrite the mapping rules according to thechanges performed over the lightweight tabular schema. These functions tacklethe Normalization challenge and their output is a new OBDAT instance IT (cid:48) . Step 3: Data preparation
In this step, Morph-CSV addresses the challengeof
Heterogeneity and performs two different functions: sub and create . The firstone is defined as sub ( exp ( A i ) , S j , val ) where exp ( A i ) is a boolean function overcolumn A i of source S j that when true, the value of A i is substituted by val . Forexample, the GTFS spec in the exception type column in the calendar dates.csvfile, the value 2 means false. To prepare the data to be queried, Morph-CSVextracts that information from the tabular metadata and performs the function sub ( exception type = 2 , calendar dates.csv, f alse ). There are multiple substi-tution functions that Morph-CSV executes such as default values, null valuesand date formats. The second function creates a new column in a specific source S j . It is defined as create ( c ( A n , . . . , A m ) , S j ), where c ( A n , . . . , A m ) means theapplication of a set of transformation functions over a set of selected columns A n , . . . , A m from source S j . This function is used when ad-hoc transformations nhancing OBDA over Tabular Data with Morph-CSV 9 Table 2.
Summary of constraints and corresponding functions applied by Morph-CSV.The proposed steps and their relation with the input, the output performed by eachfunction, and the addressed challenge.
Step Constraint Function Input Output Challenge
DataNormalization 2NF split Column, Source,Separator Schema, Data,Mappings Normalization3NF cut Mappings, Schema SchemaDataPreparation Standardization sub Expression, Column,Source, Value Data Heterogeneitycreate Expression, Columns,Source Schema, DataQuerySchemaCreation andLoad Primary Key primaryKey Column, Source,Query Schema LightweightSchemaForeign Key foreignKey Column, Source,Query SchemaDataType datatype Column, Source,Query SchemaIndex index Column, Source,Query Schema functions are needed and it is usually defined inside the mapping rules [18,10].For example, to give the same structure to stop names in GTFS, the function isdefined as create ( lower ( replace ( name, \ s + , )) , stops.csv ). The output of bothfunctions is a new OBDAT instance IT (cid:48) . Step 4: Schema Creation and Load.
The final step before translating andexecuting the query is to create an SQL schema applying a set of constraintsand a function to load the tabular data sources. The constraints that Morph-CSV incorporates in this step are primaryKey ( A i , S j ), f oreignKey ( A i , S j ), index ( A i , S j ) and dataT ype ( A i , S j , type ), that are typical constraints applied inSQL-based systems. However, Morph-CSV exploits the information encoded inthe query to apply a set of heuristics in this process with the aim of improvingquery execution. First, the integrity constraints ( primaryKey , f oreignKey ) areonly applied in the case that the query performs a join among the references ofthese constraints. Additionally, the application of the index constraints is basedon the selectivity of a column A and the query filters, i.e., the engine decideson-the-fly to create an index based on these inputs. Finally, the data is loadedin a RDB following the newly created schema. This step tackles the problemof Lightweight Schema and its output is an OBDA instance I (cid:48) that can be theinput to any SPARQL-to-SQL OBDA engine.The summary of the steps and the constrains applied are shown in Table 2. Interms of implementation, Morph-CSV accepts, as inputs, the mapping rules spec-ified in RML [13] together with the extension of the Function Ontology [9] andthe tabular annotations follow the W3C recommendation CSV on the Web [28].The output of the engine is a RDB instance together with R2RML [8] mappingrules, the typical inputs of SPARQL-to-SQL OBDA engines. Table 3.
Query evaluation performance (time in seconds) over multiple sizes of aGTFS dataset (the number indicates the scale factor: 1, 10, 100 and 100), The absenceof a value means that the OBDA engine does not support the features of the SPARQLquery. Execution time is a lower-is-better metric; best results are highlighted in bold . Engines/Queries Q1 Q2 Q4 Q6 Q7 Q9 Q12 Q13 Q17 GeometricMeanGTFS-1Morph-RDB 6,94 3,04 2,78 2,78 timeOut timeOut
Morph-CSV &Morph-RDB
Morph-CSV &Ontop
GTFS-10Morph-RDB timeOut timeOut timeOut
Morph-CSV &Morph-RDB 23,99 5,01 4,20 3,84 4,87 93,72 9,58 4,92 5,50 8,49Ontop
Morph-CSV &Ontop
GTFS-100Morph-RDB timeOut timeOut timeOut timeOut timeOut timeOut
Morph-CSV &Morph-RDB timeOut
Morph-CSV &Ontop 127,06
GTFS-1000Morph-RDB timeOut timeOut timeOut timeOut timeOut timeOut timeOut timeOut timeOut timeOut
Morph-CSV &Morph-RDB timeOut timeOut timeOut timeOut - - - - - timeOut timeOut timeOut
Morph-CSV &Ontop timeOut timeOut - - - - - 274,93 1252,40 2055,46
In this section, we show the empirical evaluation conducted to test the effect ofrespecting constraints, on the fly, during OBDA query translation over tabulardata. Our aim is to answer the following research questions:
RQ1)
What is theeffect of combining different types of constraints over a tabular dataset?
RQ2)
What is the impact of the constraints when the tabular dataset size increases?To answer these questions, we setup the following experimental studies:
Datasets and queries.
The GTFS-Madrid Benchmark consists of an ontology,an initial dataset of the metro system of Madrid following the GTFS model, a setof mappings in several specifications, a set of queries according to the ontologythat cover relevant features of the SPARQL query language, and a data generatorbased on a state of the art proposal [20]. We select the tabular sources of thisbenchmark (i.e., the CSV files) and we scale up the original data in severalinstances, we use the scale factors 10, 100 and 1000. The resources provided bythe benchmark address all the issues in the challenges we identified for queryingtabular data in OBDA. We select the queries from the benchmark that includeSPARQL features supported by at least, one the OBDA engines used during Paper under review. Resources available at: https://github.com/oeg-upm/gtfs-bench nhancing OBDA over Tabular Data with Morph-CSV 11
Table 4.
Query completeness over multiple sizes of a GTFS dataset (the numberindicates the scale factor: 1, 10, 100 and 100). The absence of a value means that theOBDA engine does not support the features of the SPARQL query.
Engines/Queries Q1 Q2 Q4 Q6 Q7 Q9 Q12 Q13 Q17GTFS-1Virtuoso
Morph-RDB timeOut timeOut
Morph-CSV &Morph-RDB
Ontop
Morph-CSV &Ontop
GTFS-10Virtuoso
Morph-RDB timeOut timeOut timeOut
Morph-CSV &Morph-RDB
Ontop
Morph-CSV &Ontop
GTFS-100Virtuoso
Morph-RDB timeOut timeOut timeOut timeOut timeOut timeOut
Morph-CSV &Morph-RDB
Ontop
Morph-CSV &Ontop
GTFS-1000Virtuoso
Morph-RDB timeOut timeOut timeOut timeOut timeOut timeOut timeOut timeOut timeOut
Morph-CSV &Morph-RDB timeOut
Ontop timeOut timeOut - - - - - timeOut timeOut
Morph-CSV &Ontop timeOut timeOut - - - - - 420666 855000 the evaluation (Q1, Q2, Q4, Q6, Q7, Q9, Q12, Q13, Q17). The description andfeatures of each query is also available online . Engines.
We use as our baselines two open source OBDA engines: Ontop v3.0.0 and Morph-RDB v3.9.15 . To evaluate the na¨ıve approach we manuallygenerate a relational database schema without constraints and measure the loadand query execution times. In order to measure the impact of the additionalsteps proposed by Morph-CSV , we integrate our solution on top of the twoOBDA engines. To ensure the reproducibility of the experiments, we also provideall of the resources in a docker image. In order to test the number of answers, https://github.com/oeg-upm/gtfs-bench/tree/master/queries https://github.com/ontop/ontop https://github.com/oeg-upm/morph-rdb https://zenodo.org/badge/latestdoi/219717229 the benchmark also provides a gold standard dataset in RDF that we loaded ina Virtuoso triple store and where we run the selected queries . Metrics.
We measure the total query execution time including all the steps pro-posed by Morph-CSV in the starting phase, and the number of answers obtained.Each query was executed 5 times with a timeout of 2 hours in cold mode. Theexperiments were run in an Intel(R) Xeon(R) equipped with a CPU E5-2603 v3@ 1.60GHz 20 cores, 100G memory with Ubuntu 16.04LTS.The experimental evaluation of the query execution time is shown in Ta-ble 3. When analyzing the results, we generally observe that the incorporationof Morph-CSV in the workflow of the OBDA engines enhances the performanceof the queries. With respect to the results over each query, we can observe onone hand that the behaviour of the engines over simple queries (Q1, Q2, Q3,Q4, Q6) is similar. This is understandable as the selected data sources needed toanswer the query do not include the application of several constraints (e.g., thereare no joins in the query); on the other hand, complex queries such as Q7, Q9,Q12, Q13 and Q17, where the needed tabular sources to answer the query haveseveral implicit constraints, have a negative impact over the na¨ıve approach. Forexample, in the case of query Q7, the na¨ıve approach is not able to answer thequestion in time (less than 2 hours), while in the case of Morph-CSV, applyinga large number of constraints over the tabular data sources and then runningthe query with the same engine, the query is answered in reasonable time for allof the datasets.If we analyze the results obtained in each dataset, we can observe that forsmall datasets (GTFS-1) the cost of applying the proposed steps of Morph-CSVimpacts the total execution time. However, when the size of the dataset increases,the na¨ıve approach is impacted due to the fact that it has to load all of the inputdata sources in the RDB before executing the query, low performance is reportedfor GTFS-100 and timeout is reported for all of the queries against GTFS-1000.Thanks to the application of the constraints and to the source selection step, forMorph-CSV together with Morph-RDB or Ontop, the return of the results of thequeries has a high performance most of the time. In the cases where Morph-CSVreports a timeout (e.g., Q1 in GTFS-1000) it is because the number of resultsthat should be obtained is so high that the OBDA engine used cannot processthem.In terms of query completeness, the results are shown in Table 4. We observethat the number of answers are similar. However, one of the main benefits ofincluding Morph-CSV in the OBDA query translation process is that the usercan decide which engine will be used based on the support of features in theinput SPARQL queries. For example, in this case, queries Q4, Q5, Q7, Q9 andQ12, contain features that are only supported by Morph-RDB, hence, the userwill obtain a larger number of results for the benchmark if this engine is used.Another important characteristic of Morph-CSV that is shown in these results isthat, although it performs several steps selecting, modifying and normalizing the http://gtfs-bench.linkeddata.es/sparql nhancing OBDA over Tabular Data with Morph-CSV 13 data sources, there is not a negative impact on the number of answers, obtainingthe same results as the gold standard. In this section we refer to previous works in integration systems that precede theOBDA approach, then we refer to the general techniques used in systems thathandle raw data, following we describe current Ontology Based Data Integration(OBDI) systems that handle tabular data, and finally we describe existing tab-ular annotation languages and the use of transformarion functions in mappings.The most relevant concept that predates the OBDA data integration ap-proach is that of mediator [29], defined in the early 90’s by Wiederhold. In theproposed architecture for information systems, mediators form a middle layerthat makes user applications independent of the data resources. The idea is totransform heterogeneous data sources into a common data model, which can thenbe processed and integrated. Classical examples of systems that implemented theoriginal mediator architecture were TSIMMIS [5], Information Manifold [25], andGARLIC [26]. Through the years these ideas were formalized evolving from theuse of description logics [4], to the use of ontologies as a common model for dataaccess [3]. Morph-CSV follows a data integration approach that not only uses on-tologies but exploits additional information from mappings, tabular annotationsand the query.Related to our work are those approaches that allow querying directly infor-mation stored in flat files [16], Drill , NoDB [1]. These systems provide a layerwhere “raw” data is queried, adaptively load and store the data, and then ex-ecute the query using an assortment of strategies. [16] extends a column-store,whereas [1] extends a relational DBMS and Drill loads the data into NoSQLdatabases. Current Ontology Based Data Integration (OBDI) open source sys-tems that take tabular data as input are Ontario [15] and Squerall [22]. Ontariois a federated query processing approach for heterogeneous data sources. Forsource selection, Ontario uses source descriptions named RDF Molecule Tem-plates [14] which keep information on the sources. The system handles tabulardata among other formats, and implements a virtualization approach to queryanswering focusing on techniques for efficient execution. Similarly, Squerall isa system that implements OBDI for heterogeneous data sources. It takes in-put data and mappings and offers a middleware that is able to aggregate theintermediate results in a distributed manner. Although the aforementioned sys-tems evaluate queries against raw tabular data and exploit some informationencoded in the query, they do not exploit the constraints declare in annotationsor mapping rules to enhance this process.CSV on the Web (CSVW) is a W3C proposal for the definition of meta-data on CSV files such as datatypes, valid values, data transformations, and https://drill.apache.org/ primary and foreign key constraints. A related W3C proposal defines proce-dures and rules for the generation of RDF from tabular data and there area few implementations that refer to this proposal. The CSV2RDF tool is pre-sented in [21], authors define algorithms to transform CSV data into RDF usingCSVW metadata annotations, and their experimental study uses datasets fromthe CSVW Implementation Report . Another tool, COW: Converter for CSVon the Web allows the conversion of datasets expressed in CSV and uses aJSON schema expressed in an extended version of the CSVW standard. Bothtools focus on RDF materialization. To the best of our knowledge, no existingsystem exploit information in CSVW annotations for querying tabular data inan OBDA approach.Another area related to our work is the definition and application of datatransformation functions. An approach independent of a specific implementa-tion context is described in [11]. It enables the description, publication andexploration of functions and instantiation of associated implementations. Theproposed model is the Function Ontology and the publication method followsthe Linked Data principles Previous works related to this topic are focused ondeveloping ad-hoc and programmed functions. For example, R2RML-F [12] al-lows using functions in the value of the rr:objectMap property, so as to modifythe value of the table columns from a relational database. KR2RML [27], usedin Karma, also extends R2RML by adding transformation functions in order todeal with nested values. OpenRefine allows such transformations with the usageof GREL functions, which can be also used in its RDF extension. Morph-CSVuses the extension of RML together with the Function Ontology [10] that al-lows to incorporate ad-hoc transformation functions over the data sources in adeclaratively. In this paper, we have presented an extension of the common OBDA specificationto address the problem of query translation over tabular data, OBDAT. Wedescribe and evaluate Morph-CSV, a framework that exploits the information ofconstraints defined in the OBDAT specification. It pushes down the applicationof these elements in order to improve query evaluation and query completeness.One of the main contributions of this proposal is that it can be used togetherwith any OBDA framework.The definition, application and optimization of new functions and constraintsto address other challenges for querying tabular data is one of the main lines forfuture work [17]. We also want to study the performance of the proposed work-flow over OBDA distributed query systems such as the ones proposed in [15,22].In addition, we will study the challenges for querying other data formats (e.g.,XML, JSON) in an OBDA context and extend our approach to incorporate them. https://w3c.github.io/csvw/tests/reports/index.html https://csvw-converter.readthedocs.io/en/latest/ nhancing OBDA over Tabular Data with Morph-CSV 15 Finally, we will adapt this proposal for a materialization process, and study itseffects comparing it with previous proposals.
References
1. I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: efficientquery execution on raw data files. In
Proc. ACM SIGMOD , pages 241–252. ACM,2012.2. C. Bizer, T. Heath, and T. Berners-Lee. Linked data: The story so far. In
Semanticservices, interoperability and web applications: emerging concepts , pages 205–227.IGI Global, 2011.3. D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk,M. Rodriguez-Muro, and G. Xiao. Ontop: Answering SPARQL queries over re-lational databases.
Semantic Web , 8(3):471–487, 2017.4. D. Calvanese, G. De Giacomo, and M. Lenzerini.
Description Logics for Informa-tion Integration , pages 41–60. Springer Berlin Heidelberg, 2002.5. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou,J. Ullman, and J. Widom. The tsimmis project: Integration of heterogenous infor-mation sources. In
Information Processing Society of Japan (IPSJ 1994) , 1994.6. E. F. Codd. Extending the database relational model to capture more meaning.
ACM Transactions on Database Systems (TODS) , 4(4):397–434, 1979.7. O. Corcho, F. Priyatna, and D. Chaves-Fraga. Towards a New Generation ofOntology Based Data Access.
Semantic Web Journal , 2019.8. S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB to RDF Mapping Language,W3C Recommendation 27 September 2012. , 2012.9. B. De Meester, A. Dimou, R. Verborgh, and E. Mannens. An ontology to seman-tically declare and describe functions. In
ISWC , pages 46–49. Springer, 2016.10. B. De Meester, W. Maroy, A. Dimou, R. Verborgh, and E. Mannens. Declarativedata transformations for Linked Data generation: the case of DBpedia. In
ESWC ,pages 33–48. Springer, 2017.11. B. De Meester, T. Seymoens, A. Dimou, and R. Verborgh. Implementation-independent function reuse.
Future Generation Computer Systems , 2019.12. C. Debruyne and D. O’Sullivan. R2RML-F: Towards Sharing and Executing Do-main Logic in R2RML Mappings. In
LDOW@ WWW , 2016.13. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van deWalle. RML: A Generic Language for Integrated RDF Mappings of HeterogeneousData. In
LDOW , 2014.14. K. M. Endris, M. Galkin, I. Lytra, M. N. Mami, M.-E. Vidal, and S. Auer. Queryinginterlinked data by bridging rdf molecule templates. In
Transactions on Large-ScaleData-and Knowledge-Centered Systems XXXIX , pages 1–42. Springer, 2018.15. K. M. Endris, P. D. Rohde, M.-E. Vidal, and S. Auer. Ontario: FederatedQuery Processing Against a Semantic Data Lake. In
International Conferenceon Database and Expert Systems Applications , pages 379–395. Springer, 2019.16. S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. hereare my queries. where are my results? In
Proceedings of 5th Biennial Conferenceon Innovative Data Systems Research , 2011.17. A. Iglesias-Molina, D. Chaves-Fraga, F. Priyatna, and O. Corcho. Enhancing theMaintainability of the Bio2RDF Project Using Declarative Mappings. In
Proceed-ings of the 12th International Conference on Semantic Web Applications and Toolsfor Healthcare and Life Sciences , 2019.6 Chaves-Fraga et al.18. A. C. Junior, C. Debruyne, R. Brennan, and D. O’Sullivan. FunUL: a method toincorporate functions into uplift mapping languages. In
Proceedings of the 18thInternational Conference on Information Integration and Web-based Applicationsand Services , pages 267–275. ACM, 2016.19. D. Lanti, M. Rezk, M. Slusnys, G. Xiao, and D. Calvanese. The npd benchmarkfor obda systems. CEUR Electronic Workshop Proceedings, 2014.20. D. Lanti, G. Xiao, and D. Calvanese. VIG: Data scaling for OBDA benchmarks.
Semantic Web , (Preprint):1–21.21. S. M. H. Mahmud, M. Hossin, H. Jahan, S. Noori, and M. Hossain. Csv2rdf:Generating rdf data from csv file using semantic web technologies.
Journal ofTheoretical and Applied Information Technology , 96, 2018.22. M. N. Mami, D. Graux, S. Scerri, H. Jabeen, S. Auer, and J. Lehmann. Squerall:virtual ontology-based access to heterogeneous and large data sources. In
Interna-tional Semantic Web Conference , pages 229–245. Springer, 2019.23. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati.Linking data to ontologies. In
Journal on data semantics X , pages 133–173.Springer, 2008.24. F. Priyatna, O. Corcho, and J. Sequeda. Formalisation and experiences of R2RML-based SPARQL to SQL query translation using morph. In
Proc. of WWW , pages479–490. ACM, 2014.25. A. L. A. Rajaraman, J. Ordille, et al. Querying heterogeneous information sourcesusing source descriptions. In
Proc. of VLDB , 1996.26. M. T. Roth and P. M. Schwarz. Don’t scrap it, wrap it! a wrapper architecture forlegacy data sources. In
VLDB , volume 97, pages 25–29, 1997.27. J. Slepicka, C. Yin, P. A. Szekely, and C. A. Knoblock. KR2RML: An AlternativeInterpretation of R2RML for Heterogenous Sources. In
COLD , 2015.28. J. Tennison, G. Kellogg, and I. Herman. Model for tabular data and metadata onthe web. W3C recommendation.
World Wide Web Consortium (W3C) , 2015.29. G. Wiederhold. Mediators in the architecture of future information systems.
Com-puter , 25(3):38–49, 1992.30. M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton,A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al. Thefair guiding principles for scientific data management and stewardship.
Scientificdata , 3, 2016.31. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, andM. Zakharyaschev. Ontology-based data access: A survey. In