Featured Researches

Databases

An Algorithm for Context-Free Path Queries over Graph Databases

RDF (Resource Description Framework) is a standard language to represent graph databases. Query languages for RDF databases usually include primitives to support path queries, linking pairs of vertices of the graph that are connected by a path of labels belonging to a given language. Languages such as SPARQL include support for paths defined by regular languages (by means of Regular Expressions). A context-free path query is a path query whose language can be defined by a context-free grammar. Context-free path queries can be used to implement queries such as the "same generation queries", that are not expressible by Regular Expressions. In this paper, we present a novel algorithm for context-free path query processing. We prove the correctness of our approach and show its run-time and memory complexity. We show the viability of our approach by means of a prototype implemented in Go. We run our prototype using the same cases of study as proposed in recent works, comparing our results with another, recently published algorithm. The experiments include both synthetic and real RDF databases. Our algorithm can be seen as a step forward, towards the implementation of more expressive query languages.

Read more
Databases

An Algorithm for the Discovery of Independence from Data

For years, independence has been considered as an important concept in many disciplines. Nevertheless, we present the first research that investigates the discovery problem of independence in data. In its arguably simplest form, independence is a statement between two sets of columns expressing that for every two rows in a table there is also a row in the table that coincides with the first row on the first set of columns and with the second row on the second set of columns. We show that the problem of deciding whether there is an independence statement that holds on a given table is not only NP-complete but W[3] -complete in its arguably most natural parameter, namely its arity. We establish the first algorithm to discover all independence statement that hold on a given table. We illustrate in experiments with benchmark data that our algorithm performs well within the limits established by our hardness results. In practice, it is often useful to determine the ratio with which independence statements hold on a given table. For that purpose, we show that our treatment of independence and the design of our algorithm enables us to extend our findings to approximate independence. In our final experiments, we provide some insight into the trade-off between run time and the approximation ratio. Naturally, the smaller the ratio, the more approximate independence statements hold, and the more time it takes to discover all of them. While this research establishes first insight into the computational properties of discovering independence from data, we hope to initiate research into more sophisticated notions of independence, including embedded multivalued dependencies, as well as their context-specific and probabilistic variants.

Read more
Databases

An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench (Extended Version)

This paper presents yet another concurrency control analysis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the configuration of seven workload parameters. We analyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous studies focused on thread scalability and did not explore the space analyzed here. We classified the optimization methods on the basis of three performance factors: CPU cache, delay on conflict, and version lifetime. Analyses using CCBench and 224 threads, produced six insights. (I1) The performance of optimistic concurrency control protocol for a read only workload rapidly degrades as cardinality increases even without L3 cache misses. (I2) Silo can outperform TicToc for some write-intensive workloads by using invisible reads optimization. (I3) The effectiveness of two approaches to coping with conflict (wait and no-wait) depends on the situation. (I4) OCC reads the same record two or more times if a concurrent transaction interruption occurs, which can improve performance. (I5) Mixing different implementations is inappropriate for deep analysis. (I6) Even a state-of-the-art garbage collection method cannot improve the performance of multi-version protocols if there is a single long transaction mixed into the workload. On the basis of I4, we defined the read phase extension optimization in which an artificial delay is added to the read phase. On the basis of I6, we defined the aggressive garbage collection optimization in which even visible versions are collected. The code for CCBench and all the data in this paper are available online at GitHub.

Read more
Databases

An Efficient Index Method for the Optimal Route Query over Multi-Cost Networks

Smart city has been consider the wave of the future and the route recommendation in networks is a fundamental problem in it. Most existing approaches for the shortest route problem consider that there is only one kind of cost in networks. However, there always are several kinds of cost in networks and users prefer to select an optimal route under the global consideration of these kinds of cost. In this paper, we study the problem of finding the optimal route in the multi-cost networks. We prove this problem is NP-hard and the existing index techniques cannot be used to this problem. We propose a novel partition-based index with contour skyline techniques to find the optimal route. We propose a vertex-filtering algorithm to facilitate the query processing. We conduct extensive experiments on six real-life networks and the experimental results show that our method has an improvement in efficiency by an order of magnitude compared to the previous heuristic algorithms.

Read more
Databases

An Efficient Index for Contact Tracing Query in a Large Spatio-Temporal Database

In this paper, we study a novel contact tracing query (CTQ) that finds users who have been in direct contact with the query user or in contact with the already contacted users in subsequent timestamps from a large spatio-temporal database. The CTQ is of paramount importance in the era of new COVID-19 pandemic world for finding possible list of potential COVID-19 exposed patients. A straightforward way to answer the CTQ is using traditional spatio-temporal indexes. However, these indexes cannot serve the purpose as each user covers a large area within the time-span of potential disease spreading and thus they can hardly use efficient pruning techniques. We propose a multi-level index, namely QR-tree, that consider both space coverage and the co-visiting patterns to group users so that users who are likely to meet the query user are grouped together. More specifically, we use a quadtree to partition user movement traces w.r.t. space and time, and then exploit these space-time mapping of user traces to group users using an R-tree. The QR-tree facilitates efficient pruning and enables accessing only potential sets of user who can be the candidate answers for the CTQ. Experiments with real datasets show the effectiveness of our approach.

Read more
Databases

An Efficient Secure Dynamic Skyline Query Model

It is now cost-effective to outsource large dataset and perform query over the cloud. However, in this scenario, there exist serious security and privacy issues that sensitive information contained in the dataset can be leaked. The most effective way to address that is to encrypt the data before outsourcing. Nevertheless, it remains a grand challenge to process queries in ciphertext efficiently. In this work, we shall focus on solving one representative query task, namely dynamic skyline query, in a secure manner over the cloud. However, it is difficult to be performed on encrypted data as its dynamic domination criteria require both subtraction and comparison, which cannot be directly supported by a single encryption scheme efficiently. To this end, we present a novel framework called SCALE. It works by transforming traditional dynamic skyline domination into pure comparisons. The whole process can be completed in single-round interaction between user and the cloud. We theoretically prove that the outsourced database, query requests, and returned results are all kept secret under our model. Moreover, we also present an efficient strategy for dynamic insertion and deletion of stored records. Empirical study over a series of datasets demonstrates that our framework improves the efficiency of query processing by nearly three orders of magnitude compared to the state-of-the-art.

Read more
Databases

An Efficient and Wear-Leveling-Aware Frequent-Pattern Mining on Non-Volatile Memory

Frequent-pattern mining is a common approach to reveal the valuable hidden trends behind data. However, existing frequent-pattern mining algorithms are designed for DRAM, instead of persistent memories (PMs), which can lead to severe performance and energy overhead due to the utterly different characteristics between DRAM and PMs when they are running on PMs. In this paper, we propose an efficient and Wear-leveling-aware Frequent-Pattern Mining scheme, WFPM, to solve this problem. The proposed WFPM is evaluated by a series of experiments based on realistic datasets from diversified application scenarios, where WFPM achieves 32.0% performance improvement and prolongs the NVM lifetime of header table by 7.4x over the EvFP-Tree.

Read more
Databases

An Empirical Study on the Design and Evolution of NoSQL Database Schemas

We study how software engineers design and evolve their domain model when building applications against NoSQL data stores. Specifically, we target Java projects that use object-NoSQL mappers to interface with schema-free NoSQL data stores. Given the source code of ten real-world database applications, we extract the implicit NoSQL database schema. We capture the sizes of the schemas, and investigate whether the schema is denormalized, as is recommended practice in data modeling for NoSQL data stores. Further, we analyze the entire project history, and with it, the evolution history of the NoSQL database schema. In doing so, we conduct the so far largest empirical study on NoSQL schema design and evolution.

Read more
Databases

An Intelligent Edge-Centric Queries Allocation Scheme based on Ensemble Models

The combination of Internet of Things (IoT) and Edge Computing (EC) can assist in the delivery of novel applications that will facilitate end users activities. Data collected by numerous devices present in the IoT infrastructure can be hosted into a set of EC nodes becoming the subject of processing tasks for the provision of analytics. Analytics are derived as the result of various queries defined by end users or applications. Such queries can be executed in the available EC nodes to limit the latency in the provision of responses. In this paper, we propose a meta-ensemble learning scheme that supports the decision making for the allocation of queries to the appropriate EC nodes. Our learning model decides over queries' and nodes' characteristics. We provide the description of a matching process between queries and nodes after concluding the contextual information for each envisioned characteristic adopted in our meta-ensemble scheme. We rely on widely known ensemble models, combine them and offer an additional processing layer to increase the performance. The aim is to result a subset of EC nodes that will host each incoming query. Apart from the description of the proposed model, we report on its evaluation and the corresponding results. Through a large set of experiments and a numerical analysis, we aim at revealing the pros and cons of the proposed scheme.

Read more
Databases

An Ontology for the Materials Design Domain

In the materials design domain, much of the data from materials calculations are stored in different heterogeneous databases. Materials databases usually have different data models. Therefore, the users have to face the challenges to find the data from adequate sources and integrate data from multiple sources. Ontologies and ontology-based techniques can address such problems as the formal representation of domain knowledge can make data more available and interoperable among different systems. In this paper, we introduce the Materials Design Ontology (MDO), which defines concepts and relations to cover knowledge in the field of materials design. MDO is designed using domain knowledge in materials science (especially in solid-state physics), and is guided by the data from several databases in the materials design field. We show the application of the MDO to materials data retrieved from well-known materials databases.

Read more

Ready to get started?

Join us today