Featured Researches

Databases

A GPU-friendly Geometric Data Model and Algebra for Spatial Queries: Extended Version

The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. However, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. Graphics Processing Units~(GPUs) are increasingly being used to speedup spatial queries. However, existing GPU-based solutions have two important drawbacks: they are often tightly coupled to the specific query types they target, making it hard to adapt them for other queries; and since their design is based on CPU-based approaches, it can be difficult to effectively utilize all the benefits provided by the GPU. As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects. We demonstrate the expressiveness of the proposed algebra by formulating standard spatial queries as algebraic expressions. We also present a proof-of-concept prototype that supports a subset of the operators and show that it is at least two orders of magnitude faster than a CPU-based implementation. This performance gain is obtained both using a discrete Nvidia mobile GPU and the less powerful integrated GPUs common in commodity laptops.

Read more
Databases

A GeoSPARQL Compliance Benchmark

We propose a series of tests that check for the compliance of RDF triplestores with the GeoSPARQL standard. The purpose of the benchmark is to test how many of the requirements outlined in the standard a tested system supports and to push triplestores forward in achieving a full GeoSPARQL compliance. This topic is of concern because the support of GeoSPARQL varies greatly between different triplestore implementations, and such support is of great importance for the domain of geospatial RDF data. Additionally, we present a comprehensive comparison of triplestores, providing an insight into their current GeoSPARQL support.

Read more
Databases

A Graph-Based Platform for Customer Behavior Analysis using Applications' Clickstream Data

Clickstream analysis is getting more attention since the increase of usage in e-commerce and applications. Beside customers' purchase behavior analysis, there is also attempt to analyze the customer behavior in relation to the quality of web or application design. In general, clickstream data can be considered as a sequence of log events collected at different levels of web/app usage. The analysis of clickstream data can be performed directly as sequence analysis or by extracting features from sequences. In this work, we show how representing and saving the sequences with their underlying graph structures can induce a platform for customer behavior analysis. Our main idea is that clickstream data containing sequences of actions of an application, are walks of the corresponding finite state automaton (FSA) of that application. Our hypothesis is that the customers of an application normally do not use all possible walks through that FSA and the number of actual walks is much smaller than total number of possible walks through the FSA. Sequences of such a walk normally consist of a finite number of cycles on FSA graphs. Identifying and matching these cycles in the classical sequence analysis is not straight forward. We show that representing the sequences through their underlying graph structures not only groups the sequences automatically but also provides a compressed data representation of the original sequences.

Read more
Databases

A Lazy Approach for Efficient Index Learning

Learned indices using neural networks have been shown to outperform traditional indices such as B-trees in both query time and memory. However, learning the distribution of a large dataset can be expensive, and updating learned indices is difficult, thus hindering their usage in practical applications. In this paper, we address the efficiency and update issues of learned indices through agile model reuse. We pre-train learned indices over a set of synthetic (rather than real) datasets and propose a novel approach to reuse these pre-trained models for a new (real) dataset. The synthetic datasets are created to cover a large range of different distributions. Given a new dataset DT, we select the learned index of a synthetic dataset similar to DT, to index DT. We show a bound over the indexing error when a pre-trained index is selected. We further show how our techniques can handle data updates and bound the resultant indexing errors. Experimental results on synthetic and real datasets confirm the effectiveness and efficiency of our proposed lazy (model reuse) approach.

Read more
Databases

A Migratory Near Memory Processing Architecture Applied to Big Data Problems

Servers produced by mainstream vendors are inefficient in processing Big Data queries due to bottlenecks inherent in the fundamental architecture of these systems. Current server blades contain multicore processors connected to DRAM memory and disks by an interconnection chipset. The multicore processor chips perform all the computations while the DRAM and disks store the data but have no processing capability. To perform a database query, data must be moved back and forth between DRAM and a small cache as well as between DRAM and disks. For Big Data applications this data movement in onerous. Migratory Near Memory Servers address this bottleneck by placing large numbers of lightweight processors directly into the memory system. These processors operate directly on the relations, vertices and edges of Big Data applications in place without having to shuttle large quantities of data back and forth between DRAM, cache and heavyweight multicore processors. This paper addresses the application of such an architecture to relational database SELECT and JOIN queries. Preliminary results indicate end-to-end orders of magnitude speedup.

Read more
Databases

A Multi-Dimensional Big Data Storing System for Generated COVID-19 Large-Scale Data using Apache Spark

The ongoing outbreak of coronavirus disease (COVID-19) had burst out in Wuhan China, specifically in December 2019. COVID-19 has caused by a new virus that had not been identified in human previously. This was followed by a widespread and rapid spread of this epidemic throughout the world. Daily, the number of the confirmed cases are increasing rapidly, number of the suspect increases, based on the symptoms that accompany this disease, and unfortunately number of the deaths also increase. Therefore, with these increases in number of cases around the world, it becomes hard to manage all these cases information with different situations; if the patient either injured or suspect with which symptoms that appeared on the patient. Therefore, there is a critical need to construct a multi-dimensional system to store and analyze the generated large-scale data. In this paper, a Comprehensive Storing System for COVID-19 data using Apache Spark (CSS-COVID) is proposed, to handle and manage the problem caused by increasing the number of COVID-19 daily. CSS-COVID helps in decreasing the processing time for querying and storing COVID-19 daily data. CSS-COVID consists of three stages, namely, inserting and indexing, storing, and querying stage. In the inserting stage, data is divided into subsets and then index each subset separately. The storing stage uses set of storing-nodes to store data, while querying stage is responsible for handling the querying processes. Using Apache Spark in CSS-COVID leverages the performance of dealing with large-scale data of the coronavirus disease injured whom increase daily. A set of experiments are applied, using real COVID-19 Datasets, to prove the efficiency of CSS-COVID in indexing large-scale data.

Read more
Databases

A Note On Operator-Level Query Execution Cost Modeling

External query execution cost modeling using query execution feedback has found its way in various database applications such as admission control and query scheduling. Existing techniques in general fall into two categories, plan-level cost modeling and operator-level cost modeling. It has been shown in the literature that operator-level cost modeling can often significantly outperform plan-level cost modeling. In this paper, we study operator-level cost modeling from a robustness perspective. We address two main challenges in practice regarding limited execution feedback (for certain operators) and mixed cost estimates due to the use of multiple cost modeling techniques. We propose a framework that deals with these issues and present a comprehensive analysis of this framework. We further provide a case study to demonstrate the efficacy of our framework in the context of index tuning, which is itself a new application of external cost modeling techniques.

Read more
Databases

A Novel Approach for Generating SPARQL Queries from RDF Graphs

This work is done as part of a research master's thesis project. The goal is to generate SPARQL queries based on user-supplied keywords to query RDF graphs. To do this, we first transformed the input ontology into an RDF graph that reflects the semantics represented in the ontology. Subsequently, we stored this RDF graph in the Neo4j graphical database to ensure efficient and persistent management of RDF data. At the time of the interrogation, we studied the different possible and desired interpretations of the request originally made by the user. We have also proposed to carry out a sort of transformation between the two query languages SPARQL and Cypher, which is specific to Neo4j. This allows us to implement the architecture of our system over a wide variety of BD-RDFs providing their query languages, without changing any of the other components of the system. Finally, we tested and evaluated our tool using different test bases, and it turned out that our tool is comprehensive, effective, and powerful enough.

Read more
Databases

A Philosophy of Data

We argue that while this discourse on data ethics is of critical importance, it is missing one fundamental point: If more and more efforts in business, government, science, and our daily lives are data-driven, we should pay more attention to what exactly we are driven by. Therefore, we need more debate on what fundamental properties constitute data. In the first section of the paper, we work from the fundamental properties necessary for statistical computation to a definition of statistical data. We define a statistical datum as the coming together of substantive and numerical properties and differentiate between qualitative and quantitative data. Subsequently, we qualify our definition by arguing that for data to be practically useful, it needs to be commensurable in a manner that reveals meaningful differences that allow for the generation of relevant insights through statistical methodologies. In the second section, we focus on what our conception of data can contribute to the discourse on data ethics and beyond. First, we hold that the need for useful data to be commensurable rules out an understanding of properties as fundamentally unique or equal. Second, we argue that practical concerns lead us to increasingly standardize how we operationalize a substantive property; in other words, how we formalize the relationship between the substantive and numerical properties of data. Thereby, we also standardize the interpretation of a property. With our increasing reliance on data and data technologies, these two characteristics of data affect our collective conception of reality. Statistical data's exclusion of the fundamentally unique and equal influences our perspective on the world, and the standardization of substantive properties can be viewed as profound ontological practice, entrenching ever more pervasive interpretations of phenomena in our everyday lives.

Read more
Databases

A Pluggable Learned Index Method via Sampling and Gap Insertion

Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution and incorporate such information into the learning of indexes, which leads to promising performance improvements. However, the "learning" process of learned indexes is still under-explored. In this paper, we propose a formal machine learning based framework to quantify the index learning objective, and study two general and pluggable techniques to enhance the learning efficiency and learning effectiveness for learned indexes. With the guidance of the formal learning objective, we can efficiently learn index by incorporating the proposed sampling technique, and learn precise index with enhanced generalization ability brought by the proposed result-driven gap insertion technique. We conduct extensive experiments on real-world datasets and compare several indexing methods from the perspective of the index learning objective. The results show the ability of the proposed framework to help to design suitable indexes for different scenarios. Further, we demonstrate the effectiveness of the proposed sampling technique, which achieves up to 78x construction speedup while maintaining non-degraded indexing performance. Finally, we show the gap insertion technique can enhance both the static and dynamic indexing performances of existing learned index methods with up to 1.59x query speedup. We will release our codes and processed data for further study, which can enable more exploration of learned indexes from both the perspectives of machine learning and database.

Read more

Ready to get started?

Join us today