[PDF] The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

Abstract

In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.

Full PDF

TThe Forgotten Document-Oriented Database Management Systems:An Overview and Benchmark of Native XML DODBMSes inComparison with JSON DODBMSes

Ciprian-Octavian Truică a,1,2, ∗ , Elena-Simona Apostol a,1, ∗ , Jérôme Darmont c ,Torben Bach Pedersen d a Computer Science and Engineering Department, Faculty of Automatic Control and Computers, UniversityPolitehnica of Bucharest, Romania b Department of Computer Science, Aarhus University, Aarhus, Denmark c Université de Lyon, Lyon 2, ERIC UR 3083, France d Center for Data Intensive Systems, Aalborg University, Aalborg, Denmark

Abstract

In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing,and extracting information and patterns from semi-structured data have been proposed andimplemented. These solutions were developed to relieve the issue of rigid data structures presentin relational databases, by introducing semi-structured and ﬂexible schema design. As currentdata generated by diﬀerent sources and devices, especially from IoT sensors and actuators, useeither XML or JSON format, depending on the application, database technologies that store andquery semi-structured data in XML format are needed. Thus, Native XML Databases, which wereinitially designed to manipulate XML data using standardized querying languages, i.e., XQueryand XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, themajority of these solutions have been replaced with the more modern JSON based DatabaseManagement Systems. However, we believe that XML-based solutions can still deliver performancein executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacksa clear comparison of the scalability and performance for database technologies that store andquery documents in XML versus the more modern JSON format. Moreover, to the best of ourknowledge, there are no Big Data-compliant benchmarks for such database technologies. In thispaper, we present a comparison for selected Document-Oriented Database Systems that either usethe XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e.,MongoDB, CouchDB, and Couchbase. To underline the performance diﬀerences we also proposea benchmark that uses a heterogeneous complex schema on a large DBLP corpus.

Keywords:

XML Database Management Systems; JSON Database Management Systems;Document-Oriented Database Management Systems; Benchmark ∗ Corresponding author.

Email addresses: [email protected] (Ciprian-Octavian Truică), [email protected] (Elena-Simona Apostol), [email protected] (Jérôme Darmont), [email protected] (Torben Bach Pedersen) These authors contributed equally to this article. Part of this work was done at Aarhus University.

Preprint submitted to Elsevier February 5, 2021 a r X i v : . [ c s . D B ] F e b . Introduction With the emergence of Big Data and the Internet of Things (IoT) and the increasing amount ofsemi-structured information generated daily, new technologies have arisen for storing, managing,and extracting information and patterns from such data. The new technologies for storing data havebeen labeled with the name NoSQL and were initially developed to solve very speciﬁc problems.Currently, they provide diﬀerent trade-oﬀs and functionality (e.g., choosing high-availability overconsistency) to be as generic as their counterparts Relational Database Management Systems(RDBMSes). Due to the semi-structured nature of data, NoSQL Database Management Systems(DBMSes) have been classiﬁed based on the data model used for storing information [1], i.e.,key-value, document-oriented, wide column, and graph databases.In this paper, we particularly study NoSQL Document-Oriented Databases Systems(DODBMSes) that encode data using the XML or JSON formats. We further focus on twosubcategories of DODBMSes with respect to the data model used to encode documents: i)DODBMSes that encode data using the XML format are Native XML Database ManagementSystems (XDBMSes), and ii) DODBMSes that encode data using the JSON format are JSONDatabase Management Systems (JDBMSes).The NoSQL DBMSes became very popular with the increasing need for data storage,management, and analysis systems that scale with the volume. To address these needs, manyNoSQL DBMSes compromise consistency to oﬀer high-availability, partition tolerance, improvedanalytics, and high-throughput. These features are also a requirement for real-time webapplications and Big Data processing and analysis and are available in JDBMSes as well.XDBMSes have started to emerge after the eXtensible Markup Language (XML) has beenstandardized and became the common format for exchanging data between diﬀerent applicationsrunning on the Web. Their primary use was to facilitate secure storage and fast querying ofXML documents. Besides their primary use, XDBMSes prove useful for OLAP (Online AnalyticalProcessing) style analysis and decision support systems that incorporate a time dimension andencode data in the XML format [2], and thus removing the need of using ETL (Extract TransformLoad) processes to transform XML documents into a relational model. XML query languagesand technologies, including XDBMSes, had been around before the NoSQL trend, and have beenforgotten during the Big Data hype. In the ﬁeld of relational databases, XML format is used as aData Type, e.g., Oracle, DB2, PostgreSQL, etc. Currently, with the rise of the NoSQL movement,XDBMSes have become a subcategory of DODBMSes. But, with the emergence of processingplatforms that uses Big Data or IoT technologies, where the data are transferred over computernetworks into formats such as XML and JSON, the XDBMSes can be seen as a viable solution forstoring and manipulating computer-generated semi-structured data.We hypothesize that the more classical XDBMSes may still be useful in the Big Data era. Thus,in this study we want to address and use as guidelines the following research questions:

Q1:

Are XDBMSes absolute and should be replaced by JDBMSes?2 Are XDBMSes a viable candidate for Big Date Management?

Q3:

Do JDBMSes outperform XDBMSes when using complex ﬁltering and aggregation querieswith diﬀerent scale factors, on large and heterogeneous datasets?To test our hypothesis and answer our research questions, we consider the following researchobjectives: i) discuss XDBMSes and compare their capabilities and features with several popularJDBMSes solutions; ii) propose a benchmark that evaluates the current needs and workloadsavailable in Big Data and compare performance between the selected DODBMSes; iii) evaluatethe performance of the selected DODBMSes using complex ﬁltering and aggregation queries withdiﬀerent scale factors, on large and heterogeneous datasets.For testing and analyzing with our proposed benchmark, we utilize several XDBMSes andJDBMSes solutions, that are free to use, and their license does not forbid benchmarking. Thus, wechose BaseX, eXist-db, and Sedna as representatives XDBMSes systems and MongoDB, CouchDB,and Couchbase as JDBMSes solutions.As a result of our research and as a response to Q1 , we claim that the more classical XMLbased DODBMSes may still be useful in the Big Data era. To demonstrate this and answer Q2 ,we propose a new benchmark for comprehensive DODBMSes analysis using a large dataset . Andthereby we present a qualitative and quantitative performance comparison between XDBMSes andthe more modern JDBMSes to answer Q3 .This paper is structured as follows. Section 2 presents an overview of diﬀerent NoSQL DBMSesmodels, surveys, and benchmarks. Section 3 oﬀers an in-depth overview and comparison ofDODBMSes, focusing on the XDBMSes and JDBMSes subcategories. Section 4 introduces theproposed benchmark speciﬁcation and discusses the data and workload models, while Section 5discusses the database physical implementation and presents the description of the benchmark’squeries. Section 6 thoroughly details the experiments performed on the selected DODBMSesusing our benchmark and discusses the results in detail. Finally, Section 7 concludes the paper,summarizes the results, and provides future research perspectives.

2. Related Works

The NoSQL Database Management Systems (DBMSes) emerged as an alternative to RelationalDatabase Management Systems (RDBMSes) in order to store and process huge amounts ofheterogeneous data. However, NoSQL DBMSes did not appear as a replacement for RDBMSes, butas a solution to speciﬁc problems that require additional features (e.g., replication, high-availability,etc.) that are not handled well by traditional means [3]. The reasons commonly given to developand use NoSQL DBMSes are summarized as follows [4]: avoidance of unneeded complexity, highthroughput, horizontal scalability, running on commodity hardware, avoidance of expensive object-relational mapping, lowering the complexity and the cost of setting up a cluster, compromisingreliability for better performance, and adapting to the requirements of cloud computing.3he classiﬁcations used for NoSQL DBMSes usually are done by either taking into account thepersistence model or the data and query model. Using the persistence model, NoSQL DBMSes areclassiﬁed as follows [4]:i) In-Memory Databases [5] are very fast because the most current used data are stored inmemory, with optional subsequent disk ﬂushes triggered at given periods or when the in-memory data are not used. Evidently, the size of the currently in-use data that can be storedis limited to the amount of memory. This problem can be resolved using vertical scaling tosome degree as there is a limit to the amount of memory a system can hold. Moreover, thedurability may become a problem if data are lost between subsequent disk ﬂushes or if datapersistence is disabled. A solution to this problem is data replication.ii) Memtables and SSTables Databases [6] buﬀer operations in memory using a Memtable afterthey have been written to an append-only commit log to ensure durability. After a certainamount of writes the Memtable gets ﬂushed to disk as a whole into a SSTable. These DBMSeshave performance characteristics comparable to those of In-Memory Database but solve thedurability problem.iii) B-trees Databases [7] use the B-tree self-balancing tree data structure that keeps data sortedand allows searches, sequential access, insertions, and deletions in logarithmic time [8].NoSQL DBMSes are also classiﬁed by using the data and query model as follows [1, 9]:i) Wide Column Databases are used to store, retrieve, and manage data using column families.Each record can have diﬀerent numbers of cells and columns, making a row sparse withoutstoring NULLs.ii) Graph Databases are used to store, retrieve, and manage information using a graph.Therefore, an object is modeled as a node and the edges between nodes become therelationships between the objects.iii) Key-Value Databases (KVDBMSes) are data storage systems designed for storing, retrieving,and managing associative arrays, i.e., dictionaries or hash tables.iv) Document-Oriented Databases (DODBMSes) have evolved form KVDBMSes and are usedto store, retrieve, and manage semi-structured data, i.e., documents, encoded using JSON,BSON, XML, or YAML formats.There are multiple surveys on NoSQL DBMSes, in the following phrases we present the mostrelevant ones for our analysis. Article [10] provides a comparison regarding the performance andﬂexibility of KVDBMSes and DODBMSes over RDBMSes. The NoSQL DBMSes prove to be abetter choice for high throughput applications that require data modeling ﬂexibility and horizontalscaling. The authors of [1] oﬀer a classiﬁcation by data models of NoSQL DBMSes, and also theypresent the current and most popular solutions. In [11], the authors make a comparison and4verview of NoSQL data models, query types, concurrency controls, partitioning, and replication.Article [12] presents a top-down overview of the NoSQL database ﬁeld and propose a comparativeclassiﬁcation model that relates functional and non-functional requirements to techniques andalgorithms employed in these systems. The authors of [13] provide an overview of XML datamanipulation techniques employed in conventional and temporal XDBMSes and study the supportof such functionality in mainstream commercial DBMSes. Unfortunately, the paper presents onlya general discussion about XDBMSes and other DBMSes with XML manipulation capabilities,and also no evaluation is provided. Thus, we can conclude that none of these surveys present anin-depth discussion and comparison of diﬀerent subcategories of DODBMSes.In the literature there are many data-centric benchmarks for the Big Data distributed systemsand NoSQL DBMSes that focus either on structured data or on speciﬁc applications, such asMapReduce-based applications, rather than on unstructured or variety. In [14], the authorspresent a comprehensive survey and analysis of benchmarks for diﬀerent types of Big Data systemsincluding NoSQL systems. The authors of [15] present a new benchmark for textual data fordistributed systems including MongoDB. None of the current literature presents benchmarks formodern native XDBMSes.XDBMSes benchmarks are application-oriented and domain-speciﬁc, e.g., OpenEHR XMLmedical records [16], XMark which contains documents extracted from electronic commerce sitesand content providers [17] or Transaction Processing over XML (TPoX) [18] which simulatesa ﬁnancial multi-user workload with XML data conforming to the FIXML standard. Thesebenchmarks are used for testing the performance of DBMSes that are capable of storing, searching,modifying and retrieving XML data. Unfortunately, the majority of these benchmarks use rathersmall collections. And even for the benchmarks where the XML or JSON document size isup to the order of Gigabytes (GBs), the contained information is mostly homogeneous. Ourproposed benchmark solution uses large heterogeneous collections with over 6 million records totest the scalability, ﬁltering, and aggregation performance of complex queries for the current nativeXDBMSes.Based on the lack of current literature regarding XDBMS, in this paper, we analyze theperformance and functionality of DODBMSes solutions, while focusing on two distinct subclassesthat use JSON or XML formats to encode data.

3. Document-Oriented Databases

Document-Oriented Databases Management Systems (DODBMSes) have evolved from Key-Value Databases [1]. DODBMSes are used for storing, retrieving, and managing semi-structureddata. They have a schema-less ﬂexible data representation, thus providing more ﬂexibility fordata modeling [19]. DODBMSes use documents for storing data such as XML or JSON. Theﬂexibility provided by XML and JSON makes it easier to manipulate the information than it is fortables in Relational Database Management Systems (RDBMSes). Usually, documents are storedin collections. A Native XML Database Management System (XDBMS) uses the XML (eXtensible5arkup Language) data structure to encode documents and deﬁnes a hierarchical logical modelbased on the elements of this markup language [20, 21]. A JSON Database Management System(JDBMS) uses the JSON structure for modeling documents and storing them in collections.In DODBMSes, labels are used in storing the information. These labels describe the data andvalues in a record. New information can be added directly to a record without the need to modifythe entire schema, as is the case for RDBMSes.One of the beneﬁts of using a DODBMS solution is the ﬂexibility of modeling the data [22].Data from the web, mobile, social, and IoT devices change the nature of the application’s datamodel. In an RDBMS, these changes impose the modiﬁcation of the schema by altering tablesand adding or removing columns. Whereas, the ﬂexibility of DODBMSes eliminates the need toforce-ﬁt the data into predeﬁned attributes and tables.Another beneﬁt of a DODBMS is the fast write performance. Some DODBMSes prioritize highavailability over strict data consistency. This ensures that both read and write operations willalways be executed even if there is a hardware or network failure. In case of failure, the replicationand eventual consistency mechanisms ensure that the environment will function.Fast query performance is another beneﬁt of a DODBMS. Most DODBMSes provide powerfulquery engines for CRUD (Create, Read, Update and Delete) operations and use indices andsecondary indices to improve data retrieval. Additionally, the majority of DODBMS solutionssupport aggregation frameworks, either native or using MapReduce, for Data Analysis and BusinessIntelligence.

In this subsection, we present several examples of XDBMSes that use standardized XPath andXQuery. Although there are multiple solutions of DBMSes that incorporate XML as data type(e.g., Oracle, PostgreSQL, DB2, MS SQL, etc. just to name a few), the majority of them fall outof the NoSQL movement. Furthermore, some have licenses that explicitly forbids benchmarking,e.g., commercial XDBMSes such as MarkLogic Server and Oracle Berkeley DB XML. Thus, for ourcomparison and benchmark, we chose the following three XDBMSes: BaseX, eXist-db, and Sedna.

BaseX

BaseX is an XDBMS written in Java that stores the data using a schema-free hierarchicalmodel. Transactions in BaseX respect the ACID (Atomicity, Consistency, Isolation, and Durability)properties, enabling the concurrent access of multiple readers and writers [23]. Documents arestored either persistently on disk or in the main memory. BaseX uses a single instance environment,replication and data partitioning are not available.BaseX provides CRUD operations and ad-hoc queries, including aggregation using XQuery 3.1and XPath 3.1 [24]. Although, it works with various APIs such as XML DB or JAX-RX, it wasnot designed to work with a MapReduce framework.BaseX supports multiple structural and value indices [23]. Structural indices are automaticallycreated and include: i) name indices to reference the names of all elements and attributes, ii)6ath indices to store distinct paths of the documents in the database, and iii) document indicesto reference all document nodes. Value indices are user-deﬁned. They include: i) text indicesfor documents’ text nodes to improve the performance of exact and range queries, ii) attributeindices to speed up comparisons on attribute values, iii) token indices to improve the multi-tokenattribute values, and iv) full-text indices to normalized tokens of text nodes and speed up querieswhich contain text expressions. eXist-db eXist-db [25] is a XDBMS implemented in Java that stores documents in the XML format. Itstores data in-memory using Document Object Model (DOM) trees.Although eXist-db does not have support for database-level transaction control, it hastransactions internally, transparent to the user, and also has a persistent journal that is usedto ensures the durability and consistency of the stored data. The database consistency is doneautomatically or using a sanity checker to detect the inconsistencies or damages in the core databaseﬁles [26].eXist-db supports data primary-secondary replication, thus allowing applications to bedistributed over multiple servers through the use of Java Message Service (JMS) API. Althoughreplication is available, data partitioning or sharding and distributing queries across multipleservers are not.eXist-db provides CRUD operations and ad-hoc queries for ﬁltering and aggregation usingXQuery 3.1 and XPath 3.1 [24]. Unfortunately, it does not have the MapReduce functionality,which would oﬀer more ﬂexibility to the aggregation queries.eXist-db supports four types of indices [27]: i) range indices that provide range and ﬁeld-basedsearches, ii) text indices for full-text search, iii) n-gram indices for improving the performance ofn-gram search, and iv) spatial indices for querying data using geometric characteristics, althoughthis feature is currently experimental.

Sedna

Sedna is an XDBMS written in C that stores documents in the XML format [28]. Sednaprovides ACID transactions, indexing, and persistent storage [29]. In uses the main memory toimprove query performance [30]. Replication and partitioning are not implemented in Sedna.Like the other XDBMSes, Sedna provides CRUD operations and ad-hoc queries for ﬁlteringand aggregation using XQuery 1.1 and XPath 2.0. However, it does not provide MapReducefunctionality in working with these queries.Value indices are used to index elements’ content and attributes. Full-text indices can becreated in Sedna to facilitate full-text search using XQuery.

DODBMSes are designed for storing, retrieving, managing, and processing semi-structured datain the form of document. With the rise of the NoSQL movement, multiple DODBMS solutions,7oth proprietary and open-source, have been implemented. An important subcategory of thesesystems is JDBMS, which consists of systems that use the JSON format for document encoding.For our comparison, we choose three of the more popular and open source JDBMSes : MongoDB,CouchDB, and Couchbase. MongoDB

MongoDB is a DODBMS developed in C++ that focuses on combining the critical capabilitiesof RDBMSes with the innovations of NoSQL DBMSes. MongoDB uses a ﬂexible, dynamic schemato store data. A record is stored in a document and multiple documents are stored in a collection.Documents in a collection do not necessarily have the same structure and so the number ofattributes and their data type can diﬀer from one record to another. In practice documentsusually model objects from a high-level programming language. Although the database allowsdocuments with a diﬀerent number of attributes and diﬀerent data types for the same attributes,records have almost the same structure in a collection [31].MongoDB stores the data in BSON documents. A BSON is a binary-encoded serialization ofJSON-like documents. This format is easily parsed and lightweight with respect to the overheadneeded to store data.Transactions in MongoDB respect the BASE (Basically Available, Soft state, Eventualconsistency) transaction model which ensures that all the modiﬁcation operations will propagate onall the nodes in an asynchronous way. MongoDB uses Causal Consistency that enables operationsto logically depend on preceding operations [32] and in-memory functionalities to improve the queryexecution time. Furthermore, this JDBMS supports multi-document transactions with ACID dataintegrity guarantees.To achieve redundancy and data availability, MongoDB uses Replica Sets for primary-secondaryreplication. A replica set is a group of MongoDB instances that store the same dataset. Topartition the data and distribute it across multiple machines, MongoDB uses Sharding. Shardingis a horizontal scaling mechanism that partitions and balances the data on multiple nodes or replicasets.MongoDB supports CRUD operations and ad-hoc querying through the use of a JavaScript APIavailable in the MongoDB client. The Aggregation Pipeline framework is a multi-stage pipelinethat transforms documents into aggregated results using the concepts of data processing pipelines.Aggregation can also be achieved using the MapReduce framework.MongoDB supports primary and secondary indexing. These indices can be a single ﬁeld,compound (multikey), geospatial, hashed, and text. Text indices enable full-text search.

CouchDB

CouchDB is an open-source DODBMS developed in Erlang that provides a schema-free modelfor storing self-contained data using the JSON format [33]. DB-Engines ranking https://db-engines.com/en/ranking/document+store

Couchbase

Couchbase is a highly-scalable DODBMS that stores documents using the JSON encoding. Itoﬀers high availability, horizontal scaling, and high transaction throughput [37].Transactions in Couchbase respect the ACID properties and rely on Eventual Consistency andImmediate Consistency. Couchbase has in-memory capabilities and keeps records into buckets.The buckets are of the following type i) Couchbase buckets used to store data persistently and in-memory, ii) Ephemeral buckets used when persistence is not required, and iii) Memcached bucketsused to cache frequently-used data and minimize the number of queries a database-server mustperform.Couchbase uses a shared-nothing architecture and provides primary-primary and primary-secondary as well as partitioning through the use of sharding. Couchbase scales horizontally in acluster.Ad-hoc data querying is achieved using a JavaScript API or a SQL-like language, i.e., N1QL(Non-1NF Query Language) [38]. These languages enable Couchbase to have OLTP (OnlineTransaction Processing) CRUD operations and ETL (Extract Transform Load) capabilities [39].JavaScript MapReduce Views can be developed and stored on the server-side to specify complexindexing and aggregation queries [40].Couchbase provides multiple types of indices: [40] i) composite indices to index multipleattributes, ii) covering indices to index the information needed for querying without accessingthe data, iii) ﬁltered (partial) indices to index a subset of the data used by the WHERE clause,iv) function-based indices that compute the value of an expression over a range of documents, v)sub-document indices to index embedded structures, vi) incremental MapReduce views to indexthe results of complex queries that perform sorting and aggregation to support real-time analyticsover very large datasets, vii) spatial views using Spatial MapReduce to index multi-dimensionalnumeric data, and viii) full-text indices used for full-text search capabilities.

Table 1 summarizes the main features of the presented databases. BaseX, Sedna and Couchbaseoﬀer ACID compliant transactions in comparison with MongoDB that oﬀers BASE compliant9ulti-document isolation transactions and CouchDB that oﬀers document-level ACID with MVCCtransactions. XDBMSes support transaction consistency while MongoDB and CouchDB supportcasual consistency and eventual consistency, respectively. Couchbase supports both eventual andimmediate consistency. A disadvantage of XDBMSes is that they do not have replication orpartitioning mechanisms, except for eXist-db which oﬀers primary-secondary replication. Anadvantage of XDBMSes is the use of XQuery and XPath for querying the data which makes ad-hoc querying an easy task. Although XDBMSes support aggregation queries, they do not provideMapReduce frameworks as a result of the lack of distribution capabilities. Another advantage ofXDBMSes is that they oﬀer diﬀerent types of indices, including text indices for full-text search. Ascan be seen from Table 1, the chosen JDBMS solutions also oﬀer diﬀerent types of indices, but inaddition to JDBMS, the one used in XDBMS systems can also be added on properties and paths,not only on keys and values.

Table 1: DODBMS comparison

BaseX eXist-db Sedna MongoDB CouchDB CouchbaseDBMS type

XDBMS XDBMS XDBMS JDBMS JDBMS JDBMS

Data format

XML XML XML BSON (Binary JSON) JSON JSON

Implementation

Java Java C C++ Erlang C/C++, Go, Erlang

Transaction

ACID Isolation safe ACID BASEMulti-document isolation Document-level ACIDwith MVCC ACID

Consistency

Transaction Consistency Automatic consistencySanity checker Transaction Consistency Causal Consistency Eventual Consistency Eventual ConsistencyImmediate Consistency

In-memory

Yes Yes Yes Yes No Yes

Replication

No Primary-Secondary No Primary-Secondary Primary-PrimaryPrimary-Secondary Primary-PrimaryPrimary-Secondary

Partitioning

No No No Sharding Sharding Sharding

Ad-hoc queries

XQuery 3.1XPath 3.1 XQuery 3.1XPath 3.1 XQuery 1.0XPath 2.0 JavaScript Mango N1QLJavaScript

MapReduce

No No No Yes Yes Yes

Secondary indices

Yes Yes Yes Yes Yes Yes

Geospatial indices

No No Yes Yes Yes Yes

Text indices

Yes Yes Yes Yes Yes Yes

4. Benchmark speciﬁcations

For our benchmark, we proposed a heterogeneous entity-relationship schema that can be easilyexpanded with more complex relationships and new entities. Figure 1 presents the proposedschema. The model’s entities are described below.•

Authors is the entity that stores information about authors. Besides the unique identiﬁer foreach author

AuthorID , the attribute

Name is used for storing the name of each author.•

Records contains information about the published work of one or more authors. It storesthe

Title , the

URL for quick access on the web, and the publishing

Year . The many-to-may relationship

WrittenBy correlates each record with the authors. A record can beeither published as a book (or book chapter) or as an article (conference or journal). Therelationship

IsA is used for denoting the sub-type of a record.10

Books is the ﬁrst sub-type of a record. This entity stores the following information: i) theunique book identiﬁer

ISBN , ii) the pages of a record using the attribute

Pages , iii) thebook editors using the multi-variate attribute

Editors , and iv) the type of a record of thissub-type, i.e., book or book chapter, using the attribute

Type . The one-to-many relationship

PublishedBy is used to correlate each record of sub-type

Book to a

Publisher .• Articles is the second sub-type of a record. Besides the unique identiﬁer of a record inthis sub-type, the entity

Articles stored information about i) the pages of a record using theattribute

Pages , and ii) the type of a record of this sub-type, i.e., conference or journal article,using the attribute

Type . The one-to-many relationship

PublishedIn is used to correlate eacharticle to a journal.•

Journals entity stores information about an article publication venue. The attributes are:i)

ISSN used as the unique identiﬁer, ii)

Type used to determine if the publication is ajournal, proceedings, or special issue, iii)

Title used for keeping the title of the journal or theconference name, iv)

Volume used to store the number of years since the ﬁrst publication,and v)

Issue used to store how many times the journal has been published during a year. Theone-to-many relationship

PublishedBy is used to correlate each record of sub-type

Journal toa

Publisher .• Publishers is the entity that stores a unique identiﬁer and the

Name of a publishing house.

Figure 1: Database entity-relational diagram .2. Workload Model The workload model follows two analysis directions: i) selection queries for ﬁltering the corpusand extract subsamples, and ii) aggregation queries for creating reports.For the selection queries, a constraint c i = contains ( Records.T itle, t i ) is used to extract themost relevant records that are contained in the title of a given set of terms. The constraint c i utilizes the contains ( · , · ) function, which veriﬁes if a substring t i ∈ { t | t ∈ vocabulary } belongs to astring. In this case, the vocabulary is the set of terms extracted from each title using Tokenization.Aggregation queries are used to create reports about the publishing activity of each author.These reports are created by counting the number of published records using attributes forgrouping. To achieve this, we apply the aggregation operator γ L with L = ( F, G ) , where F isthe list of aggregation functions, and G is the list of attributes in the GROUP BY clause. Weuse the Authors.Name attribute in the GROUP BY clause to create an overview report of thepublication activity for each author over his/her entire academic life. To determine the publishingpatterns by year of each author, we use the

Records.Year attribute that adds a time dimension tothe previous report. For a more in-depth analysis of each published topic by author, we also usethe c i constraint to ﬁlter the dataset by keywords before counting the number of articles.

5. Benchmark Implementation

The conceptual entity-relational diagram described in Section 4 must be translated into theXML and JSON formats (Figure 2). For the XML representation (Figure 2a), the attributes ofentities are directly encoded in the elements’ names, e.g., the

Article.Type is directly encodedinto the journal label. In the case of the

Authors entity, the records associated with the articleare presented as multiple tags with the same name, i.e., author . For the JSON representation,the

Authors entity becomes a list of values, i.e., the label authors . The information regarding anarticle is stored directly in the document using labels, e.g., type, publication year, etc. Using thisrepresentation, both schemes are greatly simpliﬁed and the need of relationships between entitiesdisappears. (a) XML Document (b) JSON Document

Figure 2: Document representation in XML and JSON .2. Query Description The proposed benchmark features nine queries with diﬀerent complexity and selectivity, i.e., Q to Q . The ﬁrst ﬁve queries are used to ﬁlter the dataset based on diﬀerent constraints. Whereas,the last four queries are used to ﬁlter and group the data in order to obtain aggregated results. The ﬁrst set of queries selects the records that respect a given constraint.The ﬁrst query ( Q i ) uses the constraint c i to extract the documents which contain in theirtitle a certain given term t i (Equation (1)). The projection for the query, which speciﬁes the setof selected attributes following the query execution, is Π = { Records.T itle } . Q i = π Π ( σ c i ( Records )) (1)The second query ( Q ij ) extracts the records that contain in their title two terms (Equation (2)).It uses the constraint c s , s ∈ { i, j } with i (cid:54) = j . The query is written using the INTERSECTIONoperator between the results returned by Q i for term t i and Q j for term t j . Due to the nature ofthe ﬁltering condition, we can concatenate the separate conditions to create a single conditionalexpression using the and logical operator ( ∧ ), i.e., c i ∧ c j . As in the case of the ﬁrst query, theprojection remains Π . Q ij = Q i ∩ Q j = π Π ( σ c i ( Records )) ∩ π Π ( σ c j ( Records ))= π Π ( σ c i ∧ c j ( Records )) (2) Q ij extracts the records that contain in their title at least one of the terms given through the c i or c j constraints, with i (cid:54) = j (Equation (3)). The query is written using the UNION operatorbetween the results returned by Q i for term t i and Q j for term t j . The projection remains Π . Asfor query Q ij , the conditions can be concatenated to create a single conditional expression usingthe or logical operator ( ∨ ), i.e., c i ∨ c j . Q ij = Q i ∪ Q j = π Π ( σ c i ( Records )) ∪ π Π ( σ c j ( Records ))= π Π ( σ c i ∨ c j ( Records )) (3)The fourth query ( Q ) ﬁlters the Records entity and extracts the documents that contain in theirtitle the terms t i , t j , and t k (Equation 4). As for the previous queries, the projection attributesare given using Π . The query is written using the INTERSECTION operator between the resultsobtained by Q i , Q j , and Q k for terms t i , t j , and t k respectively. Due to the nature of the ﬁlteringconditions, they can be concatenated into one constraint c i ∧ c j ∧ c k .13 ijk = Q i ∩ Q j ∩ Q k = π Π ( σ c i ( Records )) ∩ π Π ( σ c j ( Records )) ∩ π Π ( σ c k ( Records ))= π Π ( σ c i ( Records ) ∩ σ c j ( Records ) ∩ σ c k ( Records ))= π Π ( σ c i ∧ c j ∧ c k ( Records )) (4)The last selection query ( Q ) extracts the documents that contain in their title one or moreof the searched terms t s , s ∈ { i, j, k } with i (cid:54) = j ∧ i (cid:54) = k ∧ j (cid:54) = k . The query is written using theUNION operator between the results obtained by each Q s for t s terms. The nature of the ﬁlteringconstraints permit the query to be written using one constraint c i ∨ c j ∨ c k and the projection Π (Equation 5). Q ijk = Q i ∪ Q j ∪ Q k = π Π ( σ c i ( Records )) ∪ π Π ( σ c j ( Records )) ∪ π Π ( σ c k ( Records ))= π Π ( σ c i ( Records ) ∪ σ c j ( Records ) ∪ σ c k ( Records ))= π Π ( σ c i ∨ c j ∨ c k ( Records )) (5) The last four queries use aggregation to count the number of articles using diﬀerent ﬁlteringconstraints and attributes in the GROUP BY clause.The sixth query ( Q ) uses aggregation to determine the number of articles written by eachauthor (Equation (6)). It uses a JOIN operation between the Records and

Authors entities. Becausethere is a many-to-many relationship between the two entities, the JOIN also traverses

WrittenBy .The projection attributes are Π = { Author.N ame, count } . To determine the number of articlesfor each author, we use the aggregation operator γ L , where L = ( F , G ) . The list of aggregationfunctions is given by F , while the set of attributes in the GROUP BY clause is given by G . Thelist of aggregation functions is F = { count ( Records.RecordID ) } , where the count is the countingaggregation function. The set of attributes in the GROUP BY clause is G = { Authors.N ame } . Q = π Π ( γ L ( Authors (cid:46)(cid:47) Records )) (6)The seventh query ( Q ) counts the number of articles published by an author for eachyear (Equation (7)). The query makes use of a JOIN operation between the Records and

Authors entities, as in the case of query Q . The projection uses the following attributes Π = { Author.N ame, Record.Y ear, count } . To determine the number of articles written in ayear by each author, we use the aggregation operator γ L , where L = ( F , G ) . For query Q , thelist of aggregation functions is given by F , while the set of attributes in the GROUP BY clauseis given by G . The list of aggregation functions is F = { count ( Records.RecordID ) } , where the count is the counting function used for determining the number of articles written in a year by eachauthor. The set of attributes in the GROUP BY clause is G = { Authors.N ame, Records.Y ear } .14 = π Π ( γ L ( Authors (cid:46)(cid:47) Records )) (7)The eighth query ( Q ) extracts the documents that contain in their title all of the searchedterms, and then it counts the number of articles grouped by author and year. As in the case of Q , the JOIN operation is between the Records and

Authors entities. The query is written usingthe INTERSECTION operator. The ﬁltering is done using the constraints c i , c j , c k which ensuresthat the title contains all terms t i , t j , and t k with i (cid:54) = j ∧ i (cid:54) = k ∧ j (cid:54) = k . The projection attributesand the aggregation operator remains the same as in the case of Q , i.e., Π and γ L . Due to thenature of the ﬁltering conditions, the query can be rewritten using only one constraint c i ∧ c j ∧ c k . Q = π Π ( γ L ( σ c i ( Records (cid:46)(cid:47) Authors ) ∩ σ c j ( Records (cid:46)(cid:47) Authors ) ∩ σ c k ( Records (cid:46)(cid:47) Authors )))= π Π ( γ L ( σ c i ∧ c j ∧ c k ( Records (cid:46)(cid:47) Authors ))) (8)The last query ( Q ) extracts the documents that contain in their title one or more of thesearched terms t s , s ∈ { i, j, k } and i (cid:54) = j ∧ i (cid:54) = k ∧ j (cid:54) = k , by ﬁltering through the use of constraint c s . The JOIN operator is used once again between the Records and

Authors entities, as in thecase of Q . The projection attributes and the aggregation operator remain the same as in thecase of Q , i.e., Π and γ L . The ﬁltering constraints c i , c j , c k are applied on the Records entity.The query uses the UNION operator between the relationship obtained after ﬁltering. Due to thenature of the ﬁltering, the query can be rewritten using one constraint c i ∨ c j ∨ c k . Q = π Π ( γ L ( σ c i ( Records (cid:46)(cid:47) Authors ) ∪ σ c j ( Records (cid:46)(cid:47) Authors ) ∪ σ c k ( Records (cid:46)(cid:47) Authors )))= π Π ( γ L ( σ c i ∨ c j ∨ c k ( Records (cid:46)(cid:47) Authors ))) (9)

6. Experiments

All tests were run on an IBM System x3550 M4 with 64GB of RAM, and an Intel(R) Xeon(R)CPU E5-2670 v2 @ 2.50GHz. The XDBMSes used for benchmarking are BaseX, eXist-db, andSedna. For comparison reasons we also use three JDBMSes: MongoDB, CouchDB, and Couchbase.We chose these DODBMSes because they are free to uses and because their licenses do not forbidbenchmarking.The versions of the deployed DODBMSes are listed in Table 2. The proposed benchmark, theresults, and the used dataset are publicly available on-line .As the chosen XDBMS solutions do not have partitioning, we could not distribute them.Therefore, we deployed and tested them on a single instance environment. Moreover, for GitHub Sources https://github.com/cipriantruica/The-Forgotten-DODBMSes able 2: Benchmarked DODBMSes DODBMS Version

BaseX 9.3.3eXist-db 5.2.0Sedna 3.5MongoDB 4.2.7CouchDB 3.1.0Couchbase 6.5.1comparison reasons, we also used a single instance environment for MongoDB, CouchDB, andCouchbase.The query parameterization is presented in Table 3. Each term t i ( i = 1 , ) is used for ﬁlteringthe records through the constraint c ( i )1 . Thus for the ﬁrst set of queries, i.e., Q i , Q ij , and Q ijk , the i , j , and k indices ( i (cid:54) = j ∧ i (cid:54) = k ∧ j (cid:54) = k ) represent the i (cid:48) ∈ , index of the t i (cid:48) used for ﬁltering. Table 3: Query parameter values

Parameter Value t database t text t mining The experiments are performed on 6 150 738 records extracted from DBLP . The initial datasetis split into diﬀerent subsets to test the scalability of each DODBMS w.r.t. the number ofrecords. These subsets contain 768 842, 1 537 685, 3 075 369, and 6 150 738 records, respectively.Each subset allows scaling experiments and are associated with a scale factor SF parameter,where SF = { . , . , . , } . Table 4 presents the size of the subsets, both as raw data andthe resulting DODBMS collection dimension. Table 4: Dataset

SF No.Records RawXML RawJSON BaseXDB size eXist-dbDB size SednaDB size MongoDBDB size CouchDBDB size CouchbaseDB size DBLP http://dblp.org/

Data are stored within each DODBMS using a denormalized schema; thus, one-to-manyand many-to-many relationships are encapsulated inside the same document. To achievedenormalization, JDBMSes employ nested documents, lists, and lists of nested documents, whileXDBMSes use the hierarchical structure of the XML format. To normalize the information andapply ﬁltering and aggregation operations and functions, we use the native syntax, operators, querylanguage clauses, and frameworks provided by each DODBMS. Table 5 presents the implementationlanguage and operators.

Table 5: Filtering and aggregation queries

Database Filtering Query Aggregation Queries

BaseX XQuery 3.1 XQuery 3.1 syntax for sorting and groupingeXist-db XQuery 3.1 XQuery 3.1 syntax for sorting and groupingsedna XQuery 1.0 XQuery 1.1 syntax for sorting and groupingMongoDB JavaScript JavaScript Aggregation Pipeline with unwind operatorCouchDB JavaScript/Mango JavaScript/Mango Materialized ViewsCouchbase N1QL N1QL with

UNNEST clauseFor the XDBMSes, we implemented the queries using XQuery. The aggregation queries forBaseX and eXist-db use the XQuery 3.1 syntax for sorting and grouping, i.e.,

FOR ... WHERE... GROUP BY ... ORDER BY ... . For Sedna, we use the XQuery 1.1 syntax for sorting andgrouping, i.e.,

FOR ... WHERE ... LET ... ORDER BY ... . We used the native CommandLine Interfaces to run these queries.The aggregation queries in MongoDB are implemented using its Aggregation Pipelineframework. To deal with nested documents, the unwind operator is used to ﬂatten an arrayﬁeld of nested documents. This operator is useful when trying normalize the one-to-many andmany-to-many which trough denormalization are stored in the JSON format as nested documentsor lists of nested documents. We used the native Command Line Interfaces to run these queries.CouchDB uses Materialized Views for aggregation and to deal with nested and list of nesteddocuments. These views are implemented using CouchDB’s MapReduce framework. The mapperfunction is used to ﬂatten nested documents and ﬁlter the ﬁeld. The reducer function is usedfor applying an aggregation function and returning the ﬁnal result. We used cURL to run thesequeries.To manipulate nested array in Couchbase, N1QL oﬀers developers the UNNEST clause. Thisclause is used to ﬂatten the arrays in the parent document. Thus, the UNNEST clause conceptually17erforms a JOIN operation between nested arrays and the parent document. As data are storedusing the JSON format, the JOIN operation increases the runtime and decreases the overall retrievalperformance. For Couchbase, we used the native Command Line Interfaces to run these queries.

Selectivity, i.e., the amount of retrieved data ( n ( Q ) ) w.r.t. the total amount of available data( N ), depends on the number of attributes in the WHERE and GROUP BY clauses. The selectivityformula used for a query Q is S ( Q ) = 1 − n ( Q ) N . For the selection queries, we set N equal to thecardinality of the Records entity, i.e., N = || Records || . Table 6 presents the ﬁltering queries’selectivity w.r.t. the SF . The queries with more restrictive conditions return a smaller number ofrecords and the selectivity is higher, e.g., Q ij . The queries with more inclusive restrictions returna higher number of records and the selectivity is lower, e.g., Q ij . Table 6: Filter queries selectivity

SF Q Q Q Q Q Q Q Q Q Q Q N equal to the number of queries returned by joining theentities Records with

Authors , i.e., N = || Authors (cid:46)(cid:47) Records || . Table 7 shows the aggregationqueries’ selectivity w.r.t. the SF factor. Query Q is the most restrictive query. Because of theﬁltering and grouping conditions, Q returns a small number of records, and its selectivity is almostequal to . The most inclusive query is Q , and it has a low selectivity w.r.t. SF . Because of theless restrictive ﬁltering and grouping conditions, the selectivity of this query is less than . . Theselectivity of Q increases w.r.t. SF , meaning that the number of records returned by the queryincreases more gradually than the size of the corpus. Table 7: Aggregation queries selectivity

SF Q Q Q Q We use the query response time as the only metric for the benchmark. It is symbolized for eachquery by t ( Q ∗ i ) ∀ i ∈ [1 , . All queries are executed times, which is suﬃcient according to the18entral limit theorem. Additionally, all executions are warm runs, i.e., either caching mechanismsmust be deactivated, or a cold run where each query must be executed once (but not takeninto account in the benchmark’s results) to ﬁll in the cache. Queries must be written in thenative scripting language of the target DODBMS and executed directly inside the speciﬁed systemusing the command line interpreter. Lastly, the average response time and standard deviation arecomputed for each t ( Q ∗ i ) . Figure 3 presents the results of Q i where i = 1 , is used to denote the keyword t i . MongoDBand BaseX oﬀer the fastest time performance among the DODBMSes that encode documentsusing JSON and XML, respectively, regardless of the keyword w.r.t. SF . For Q query whichhas the lowest selectivity of the three Q i queries, the time performance of CouchDB is with afactor of ∼ x faster than eXist-db w.r.t. SF . The time performance of CouchDB and eXist-db for Q and Q tend to become the same w.r.t. SF , i.e., the performance diﬀerence factorbetween CouchDB and eXist-db at SF = 0 . is ∼ . x which increases to ∼ . x for SF = 1 .CouchDB time performance is with a factor of ∼ . x faster than Couchbase for all the Q i queriesregardless of SF . Couchbase and eXist-db have similar performance for query Q and SF = 1 .Sedna performance is almost constant regardless of query selectivity w.r.t. SF . The overall bestperformance is achieved by MongoDB.Figure 4 presents the results of Q ij and Q ij queries where i and j indicate the t i and t j keywordsused for ﬁltering (Table 3) with i = 1 , , j = 1 , , and i (cid:54) = j . For this set of queries, MongoDB hasthe best overall time performance regardless of the SF factor. BaseX achieves the second overallbest performance and the best performance among the tested XDBMSes, regardless of the SF .For the Q ij set of queries, the time performance of MongoDB has a factor between ∼ . x and ∼ . x faster then BaseX w.r.t. SF . For the Q ij set of queries, the time performance of MongoDBhas a factor between ∼ . x and ∼ . x faster then BaseX w.r.t. SF .Couchbase presents the highest execution time for the Q ij queries regardless of SF , followedby the execution time of CouchDB. CouchDB time performance is with a factor of ∼ . x and ∼ . x faster than Couchbase for the Q ij , respectively Q ij queries regardless of SF . The eXist-dbXDBMS has the worst performance for the Q ij set of queries regardless of the SF . For the Q ij set of queries, Sedna time performance has a factor of ∼ x better than CouchDB and a factor of x worse than eXist-db. For the Q ij set of queries, Sedna’s query execution time is with a factorof ∼ . x better than CouchDB and with a factor of ∼ x worst than BaseX.Figure 5 presents the time performance of Q and Q queries for each DODBMS w.r.t. SF . Thetime performance trend for Q and Q remains similar to the ones for Q ij and Q ij , respectively.CouchDB time performance is with a factor of ∼ . x and ∼ . x faster than Couchbase for the Q ij ,respectively Q ij queries regardless of SF . MongoDB achieves the overall best time performancefor both queries. BaseX has the second-best time performance among the tested DODBMSes andthe best performance among the XDBMSes. 19 SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (a) Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (b) Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (c) Q Figure 3: Response time for Q i Figure 6 shows the results for the aggregation queries, i.e., Q to Q . For the queries Q , Q , and Q , BaseX has the best time performance and signiﬁcantly outperforms MongoDB andCouchDB with a factor of ∼ x, regardless of the SF . For the Q query, CouchDB achievedthe best query execution time, while Couchbase the worst. MongoDB has the second best queryresponse time among the studied DODBMSes for Q , Q , and Q . MongoDB’s response timefor these queries is almost on parity with the response time of CouchDB w.r.t. SF , althoughMongoDB executes the aggregation functions at runtime.For Q , Couchbase has a large standard deviation. During testing, this query ﬁnished with theerror "Index scan timed out". The tests that ﬁnished with the status "success" returned ﬂuctuatingtime performance for each run. This abnormal behavior of the Couchbase system can be sometimesobserved for complex queries on large collections.For Q which has the highest selectivity, CouchDB holds the best time performance.We attribute this result to the mechanism used by CouchDB to store aggregation functions.Aggregation functions are stored in materialized views also named indices in CouchDB. Using thistechnique, CouchDB manages to outperform BaseX and MongoDB, which execute aggregationfunctions at runtime, for queries with high selectivity. With Couchbase, the complexity andselectivity together with the UNNEST clause required to extract the nested documents in order20 SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (a) Q = Q ∩ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (b) Q = Q ∪ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (c) Q = Q ∩ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (d) Q = Q ∪ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (e) Q = Q ∩ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (f) Q = Q ∪ Q Figure 4: Response time for Q ij and Q ij to ﬁlter and group the information, increases the runtime signiﬁcantly while decreasing the overallquery performance.The aggregation queries did not work on Sedna. When executing these queries, the XDBMSremained unresponsive for days, and we had to manually stop the system, the related services, andthe background processes. We note that Sedna also executes aggregation functions at runtime. Wesuspect that one reason for Sedna’s failure to execute the aggregation queries isalso the outdated XQuery 1.0 query language. SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (a) Q = Q ∩ Q ∩ Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db SednaMongoDB CouchDB Couchbase (b) Q = Q ∪ Q ) ∪ Q Figure 5: Response time for Q and Q The eXist-db XDBMS has the highest query time for Q , Q , and Q queries. The executionis done at runtime. For this XDBMS, query Q worked only for SF = 0 . . For other SF values, the query returned memory errors, although we have tuned this XDBMS to work with thesame parameters as the other DODBMSes. Thus, eXist-db is highly dependent on the JVM (JavaVirtual Machine) memory allocation mechanism. , , , SF R e s p o n s e t i m e ( s ) BaseX eXist-db MongoDBChouchDB Couchbase (a) Q , , , , SF R e s p o n s e t i m e ( s ) BaseX eXist-db MongoDBCouchDB Couchbase (b) Q SF R e s p o n s e t i m e ( s ) BaseX eXist-db MongoDBCouchDB Couchbase (c) Q , SF R e s p o n s e t i m e ( s ) BaseX eXist-db MongoDBCouchDB Couchbase (d) Q Figure 6: Response time for aggregation queries .7. Discussions on the Experimental Design Choices In this study, we present our ﬁndings regarding the performance of ﬁltering and aggregationqueries on a large dataset for XDBMSes and JDBMSes w.r.t. diﬀerent scale factors. Weobserve that the XDBMSes perform as well as JDBMSes for speciﬁc use cases, with BaseX evenoutperforming the more popular JDBMSes on three out of the four aggregation queries. Amongthe JDBMSes, MongoDB has the overall best performance.For our comparison, we do not take into account horizontal scalability through sharding andreplication, as not all of the analyzed DBMSes have such a functionality. Furthermore, it is essentialﬁrst to understand single-node performance before considering horizontal scaling. Thus, the aimof the paper is to examine single instance deployments.There are many real-world scenarios where such single-instance deployment is preferred. Asa ﬁrst example, XDBMSes can be used for fast application development, analyzing and queryinglog data, or storing and retrieving IoT sensor data. XDBMSes are good candidates for storinglarge documents, managing long-running transactions, and querying hierarchical data structuresin environments that require rapidly evolving schemes. Furthermore, these DBMSes are lightweightand do not require dedicated hardware, software, or a lot of resources. Thus, managing to lowerresource costs at the data center site and enabling on-site data analysis and decision making.Therefore, they can be utilized in Edge and Fog Computing with ease.The creation of network islands due to faulty nodes is very common in the Edge/Fogenvironment. Even in the presence of well-deﬁned recovery mechanisms, the formation of temporalnetwork islands is unfavorable for sharding, as the overall latency increases if nodes go down andthen up again. Hence, single-instance deployments are favored in these environments.Another real-world scenario where such single-instance deployment can be used is for small tomedium scale document management systems. These management systems are useful to smallerenterprises, where data is kept in the company due to GDPR (European Union Legislation onGeneral Data Protection Regulation). Moreover, as in many cases most of the data is in semi-structured formats, such as XML and JSON, single instance DODBMSes are a good candidate forstoring and managing such documents. Thus, removing from the company’s costs the maintenanceof a data center.It is also important to mention that the focus of our benchmark is on data retrieval and noton write operations because, in real-world applications, multiple techniques can be put in checkto balance the write operations and minimize the workload. Moreover, data persistence can beachieved much later within a DBMS, depending on the workload and the systems write and loggingmechanisms.Furthermore, we loaded the data into the database using diﬀerent methods. Because not allof the tested DODBMSes have their own data load tools, we developed our own data loadingprograms. By utilizing our data load programs and not native load DBMS functionalities, weadded a new layer of complexity which decreases write performance. This makes the loadingprocess to be dependent on external DBC (database connectors) implementations, and not on the23ODBMS internal functionalities.

7. Conclusion

In this paper, we present an overview and comparison of DODBMSes that encode informationusing XML and JSON formats and propose a benchmark using ﬁltering and aggregation querieson a heterogeneous dataset. For our experiments we chose three XDBMSes, i.e., BaseX, eXist-db,Sedna, and three JDBMSes, i.e., MongoDB, CouchDB, and Couchbase. These DODBMSes areopen-source and free to use systems, whose license does not forbid benchmarking.Our comparison focuses on key functionalities required by Big Data and IoT systems for storingand extracting information from large volumes of data. For this comparison, we also consider thetransactions’ properties of each DODBMSes, their in-memory capabilities, and how these systemsdeal with atomicity, consistency, isolation, durability with regards to operations such as accessing,modifying, and saving documents. We also present for each DODBMS its support for replicationand partitioning of data and how it manages these Big Data requirements. Furthermore, wepresent the querying languages used for extracting information as well as the diﬀerent types ofindices provided by each DODBMS to improve retrieval response time.The proposed benchmark uses diﬀerent queries to emphasize the time performance ofDODBMSes and highlights the capabilities of XDBMSes and JDBMSes. Furthermore, our solutionproves its portability, scalability, and relevance by its design. The benchmark is portable, as itworks on multiple systems. For this purpose, we compare the performance of several DODBMSes,i.e., BaseX, eXist-db, Sedna, MongoDB, CouchDB, and Couchbase. To demonstrate the scalabilityof our solution, we introduced SF , the scaling factor that generates an incremental growth in thedata volume for our experiments. By increasing the queries’ complexity together with the SF factor, we analyze the behavior of the systems from the scaling perspective. We observe that allthe DODBMSes have a linear increase at runtime. Furthermore, BaseX proves to be a good choicewhen dealing with aggregations. Finally, our experimental results show that our benchmark isindeed relevant in comparing the runtime performance of diﬀerent DODBMSes.The performance tests provide some interesting and unexpected results. Among the XDBMSes,BaseX has the best overall performance. BaseX even outperforms the JDBMSes selected for thisbenchmark, i.e., MongoDB, CouchDB, and Couchbase, for three out of the four aggregation queriesproposed. We observe that Couchbase has the overall worst performance among the JDBMSes.Sedna outperforms CouchDB and Couchbase when dealing with ﬁltering queries, but does notwork for the aggregation queries. MongoDB has the overall best time performance for the ﬁlteringqueries and it outperforms BaseX only for the aggregation query Q . eXist-db has some strangebehavior when dealing with both ﬁltering and aggregation queries. Also, it is highly dependenton the JVM, which needs to be tuned for each query, making this XDBMS hard to work with.However, we can assume that eXist-db works well on a query to query basis.Following the results obtained by the benchmark, we can answer the three research questionsand conclude that XDBMSes are still useful: their performance is as good as JDBMSes and24hey are good candidates for Big Data Management. Furthermore, XDBMSes are well-suitedfor several current real-world scenarios. Firstly, XDBMSes are reliable systems for storing largedocuments, managing long-running transactions, and querying hierarchical data structures inEdge/Fog environments (e.g., smart agriculture, healthcare wearables, etc.), as these types ofDODBMSes are lightweight and do not require dedicated hardware, software, or a lot of resources.Secondly, XDBMSes can be used as small to medium scale document management systems insmaller enterprises, where data are kept in the company due to GDPR. Thirdly, in the case of BigData analysis, they prove to be well-suited when the documents are in XML format, by removingthe ETL (Extract, Transform, Load) processes from the storing, managing, and analysis pipeline.As future work, we plan to improve the support for OLAP queries [41] on XML data andXML data in combination with other data [42, 43] both in terms of performance and functionality.This includes designing new sampling strategies and supporting more aggregation queries [42]. Thesampling methods will include constraints on other labels and values contained in the records. Also,we aim to add more dimension for grouping [42], to boost the performance by lowering the queryselectivity and performing query rewriting [43], and to add further grouping functionality [42]. Acknowledgement

The research presented in this paper was supported in part by the Danish Independent ResearchCouncil , through the

SEMIOTIC project, and the

Robots and Society: Cognitive Systems forPersonal Robots and Autonomous Vehicles (ROBIN) project

CCCDI-UEFISCDI grant No. PN-III-P1-1.2-PCCDI-2017-0734 . References [1] J. Han, H. E, G. Le, J. Du, Survey on NoSQL database, in: International Conference onPervasive Computing and Applications, IEEE, 2011, pp. 363–366. doi:10.1109/icpca.2011.6106531 .[2] B.-K. Park, H. Han, I.-Y. Song, XML-OLAP: A multidimensional analysis framework forXML warehouses, in: Data Warehousing and Knowledge Discovery, Springer, 2005, pp. 32–42. doi:10.1007/11546849_4 .[3] M. Stonebraker, U. Çetintemel, "one size ﬁts all": An idea whose time has come and gone, in:International Conference on Data Engineering, IEEE, 2005, pp. 1–10. doi:10.1109/icde.2005.1 .[4] C. Strauch, Nosql databases, Tech. rep., Stuttgart Media University (2011).[5] T. Zhu, D. Wang, H. Hu, W. Qian, X. Wang, A. Zhou, Interactive transaction processingfor in-memory database system, in: Database Systems for Advanced Applications, Springer,Cham, 2018, pp. 228–246. doi:10.1007/978-3-319-91458-9_14 .256] M. A. Qader, S. Cheng, V. Hristidis, A comparative study of secondary indexing techniquesin lsm-based nosql databases, in: International Conference on Management of Data,SIGMOD2018, ACM, 2018, pp. 551–566. doi:10.1145/3183713.3196900 .[7] A. Petrov, Algorithms behind modern storage systems, Queue 16 (2) (2018) 30:31–30:51. doi:10.1145/3212477.3220266 .[8] D. Comer, Ubiquitous b-tree, ACM Computing Surveys 11 (2) (1979) 121–137. doi:10.1145/356770.356776 .[9] R. Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD Record 39 (4) (2011) 12. doi:10.1145/1978915.1978919 .[10] M. Stonebraker, SQL databases v. NoSQL databases, Communications of the ACM 53 (4)(2010) 10. doi:10.1145/1721654.1721659 .[11] R. Hecht, S. Jablonski, NoSQL evaluation: A use case oriented survey, in: InternationalConference on Cloud and Service Computing, IEEE, 2011, pp. 336–341. doi:10.1109/csc.2011.6138544 .[12] F. Gessert, W. Wingerath, S. Friedrich, N. Ritter, NoSQL database systems: a survey anddecision guidance, Computer Science - Research and Development 32 (3-4) (2016) 353–365. doi:10.1007/s00450-016-0334-3 .[13] Z. Brahmia, H. Hamrouni, R. Bouaziz, XML data manipulation in conventional and temporalXML databases: A survey, Computer Science Review 36 (2020) 100231. doi:10.1016/j.cosrev.2020.100231 .[14] F. Bajaber, S. Sakr, O. Batarﬁ, A. Altalhi, A. Barnawi, Benchmarking big data systems: Asurvey, Computer Communications 149 (2020) 241–251. doi:10.1016/j.comcom.2019.10.002 .[15] C.-O. Truică, E.-S. Apostol, J. Darmont, I. Assent, TextBenDS: a generic textualdata benchmark for distributed systems, Information Systems Frontiers doi:10.1007/s10796-020-09999-y .[16] S. M. Freire, E. Sundvall, D. Karlsson, P. Lambrix, Performance of xml databases forepidemiological queries in archetype-based ehrs, in: Scandinavian Conference on HealthInformatics, Linköping University Electronic Press, 2012, pp. 51–57.[17] A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, R. Busse, XMark: A benchmarkfor xml data management, in: International Conference on Very Large Databases VLDB,Elsevier, 2002, pp. 974–985. doi:10.1016/b978-155860869-6/50096-2 .[18] M. Nicola, I. Kogan, B. Schiefer, An XML transaction processing benchmark, in: ACMSIGMOD International Conference on Management of data, ACM Press, 2007, pp. 937–948. doi:10.1145/1247480.1247590 . 2619] P. Atzeni, F. Bugiotti, L. Cabibbo, R. Torlone, Data modeling in the NoSQL world, ComputerStandards & Interfaces 67 (2020) 103149. doi:10.1016/j.csi.2016.10.003 .[20] T. Fiebig, S. Helmer, C.-C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, T. Westmann,Anatomy of a native XML base management system, The VLDB Journal The InternationalJournal on Very Large Data Bases 11 (4) (2002) 292–314. doi:10.1007/s00778-002-0080-y .[21] G. Pavlović-Lažetić, Native xml databases vs. relational databases in dealing with xmldocuments, Kragujevac Journal of Mathematics 30 (2007) 181–199.[22] E. Gallinucci, M. Golfarelli, S. Rizzi, Schema proﬁling of document-oriented databases,Information Systems 75 (2018) 13 – 25. doi:10.1016/j.is.2018.02.007 .[23] BaseX, Basex documentation (2020).URL http://docs.basex.org/wiki/Main_Page [24] C. Grün, S. Gath, A. Holupirek, M. H. Scholl, XQuery full text implementation in BaseX,Database and XML Technologies (2009) 114–128 doi:10.1007/978-3-642-03555-5_10 .[25] W. Meier, exist: An open source native xml database, in: Web, Web-Services, and DatabaseSystems, Springer, 2003, pp. 169–183. doi:10.1007/3-540-36560-5_13 .[26] E. Siegel, A. Retter, eXist: A NoSQL Document Database and Application Platform, O’ReillyMedia, Inc., 2014.[27] eXistdb, exist-db documentation (2020).URL https://exist-db.org/exist/apps/doc/documentation [28] A. Fomichev, M. Grinev, S. Kuznetsov, Sedna: A native xml dbms, in: SOFSEM 2006: Theoryand Practice of Computer Science, Springer, 2006, pp. 272–281. doi:10.1007/11611257_25 .[29] Sedna, Sedna documentation (2020).URL [30] I. Taranov, I. Shcheklein, A. Kalinin, L. Novak, S. Kuznetsov, R. Pastukhov, A. Boldakov,D. Turdakov, K. Antipin, A. Fomichev, P. Pleshachkov, P. Velikhov, N. Zavaritski, M. Grinev,M. Grineva, D. Lizorkin, Sedna: Native xml database management system (internalsoverview), in: ACM SIGMOD International Conference on Management of Data, SIGMOD’10, ACM, 2010, pp. 1037–1046. doi:10.1145/1807167.1807282 .[31] K. Banker, P. Bakkum, S. Verch, D. Garrett, T. Hawkins, MongoDB in Action, 2nd Edition,Manning Publications Co., 2011.[32] MongoDB, Inc., Mongodb documentation (2020).URL https://docs.mongodb.com/ [33] Apache CouchDB, Couchdb documentation (2020).URL https://docs.couchdb.org/en/stable/ doi:10.1016/j.ygeno.2012.05.006 .[37] M. Brown, Getting started with Couchbase server, Oreilly, 2012.[38] D. Vohra, Pro Couchbase Development, Apress, 2015. doi:10.1007/978-1-4842-1434-3 .[39] M. A. Hubail, A. Alsuliman, M. Blow, M. Carey, D. Lychagin, I. Maxon, T. Westmann,Couchbase analytics, VLDB Endowment 12 (12) (2019) 2275–2286. doi:10.14778/3352063.3352143 .[40] Apache Couchbase, Couchbase documentation (2020).URL https://docs.couchbase.com/home/index.htmlhttps://docs.couchbase.com/home/index.html