Featured Researches

Databases

A Big Data Based Framework for Executing Complex Query Over COVID-19 Datasets (COVID-QF)

COVID-19's rapid global spread has driven innovative tools for Big Data Analytics. These have guided organizations in all fields of the health industry to track and minimized the effects of virus. Researchers are required to detect coronaviruses through artificial intelligence, machine learning, and natural language processing, and to gain a complete understanding of the disease. COVID-19 takes place in different countries in the world, with which only big data application and the work of NOSQL databases are suitable. There is a great number of platforms used for processing NOSQL Databases model like: Spark, H2O and Hadoop HDFS/MapReduce, which are proper to control and manage the enormous amount of data. Many challenges faced by large applications programmers, especially those that work on the COVID-19 databases through hybrid data models through different APIs and query. In this context, this paper proposes a storage framework to handle both SQL and NOSQL databases named (COVID-QF) for COVID-19 datasets in order to treat and handle the problems caused by virus spreading worldwide by reducing treatment times. In case of NoSQL database, COVID-QF uses Hadoop HDFS/Map Reduce and Apache Spark. The COVID-QF consists of three Layers: data collection layer, storage layer, and query Processing layer. The data is collected in the data collection layer. The storage layer divides data into collection of data-saving and processing blocks, and it connects the Connector of the spark with different databases engine to reduce time of saving and retrieving. While the Processing layer executes the request query and sends results. The proposed framework used three datasets increased for time for COVID-19 data (COVID-19-Merging, COVID-19-inside-Hubei and COVID-19-ex-Hubei) to test experiments of this study. The results obtained insure the superiority of the COVID-QF framework.

Read more
Databases

A Case Study on Visualizing Large Spatial Datasets in a Web-based Map Viewer

Lately, many companies are using Mobile Workforce Management technologies combined with information collected by sensors from mobile devices in order to improve their business processes. Even for small companies, the information that needs to be handled grows at a high rate, and most of the data collected have a geographic dimension. Being able to visualize this data in real-time within a map viewer is a very important deal for these companies. In this paper we focus on this topic, presenting a case study on visualizing large spatial datasets. Particularly, since most of the Mobile Workforce Management software is web-based, we propose a solution suitable for this environment.

Read more
Databases

A Comparative Analysis of Knowledge Graph Query Performance

As Knowledge Graphs (KGs) continue to gain widespread momentum for use in different domains, storing the relevant KG content and efficiently executing queries over them are becoming increasingly important. A range of Data Management Systems (DMSs) have been employed to process KGs. This paper aims to provide an in-depth analysis of query performance across diverse DMSs and KG query types. Our aim is to provide a fine-grained, comparative analysis of four major DMS types, namely, row-, column-, graph-, and document-stores, against major query types, namely, subject-subject, subject-object, tree-like, and optional joins. In particular, we analyzed the performance of row-store Virtuoso, column-store Virtuoso, Blazegraph (i.e., graph-store), and MongoDB (i.e., document-store) using five well-known benchmarks, namely, BSBM, WatDiv, FishMark, BowlognaBench, and BioBench-Allie. Our results show that no single DMS displays superior query performance across the four query types. In particular, row- and column-store Virtuoso are a factor of 3-8 faster for tree-like joins, Blazegraph performs around one order of magnitude faster for subject-object joins, and MongoDB performs over one order of magnitude faster for high-selective queries.

Read more
Databases

A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism

There is a large body of recent work applying machine learning (ML) techniques to query optimization and query performance prediction in relational database management systems (RDBMSs). However, these works typically ignore the effect of \textit{intra-parallelism} -- a key component used to boost the performance of OLAP queries in practice -- on query performance prediction. In this paper, we take a first step towards filling this gap by studying the problem of \textit{tuning the degree of parallelism (DOP) via ML techniques} in Microsoft SQL Server, a popular commercial RDBMS that allows an individual query to execute using multiple cores. In our study, we cast the problem of DOP tuning as a {\em regression} task, and examine how several popular ML models can help with query performance prediction in a multi-core setting. We explore the design space and perform an extensive experimental study comparing different models against a list of performance metrics, testing how well they generalize in different settings: (i) to queries from the same template, (ii) to queries from a new template, (iii) to instances of different scale, and (iv) to different instances and queries. Our experimental results show that a simple featurization of the input query plan that ignores cost model estimations can accurately predict query performance, capture the speedup trend with respect to the available parallelism, as well as help with automatically choosing an optimal per-query DOP.

Read more
Databases

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10x without affecting the quality of the model.

Read more
Databases

A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries

We study the generalized model counting problem , defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in {0, 1 2 ,1} . Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called safe in the first case, and unsafe in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to {0, 1 2 ,1} . This requires a complete redesign of the hardness proof, using new techniques. A related problem is the model counting problem , which asks for the probability of the query when the input probabilities are restricted to {0, 1 2 } . While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries.

Read more
Databases

A Foundation for Spatio-Textual-Temporal Cube Analytics (Extended Version)

Large amounts of spatial, textual, and temporal data are being produced daily. This is data containing an unstructured component (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a powerful and general way of analyzing spatial, textual, and temporal data together. In this paper, we define and formalize the Spatio-Textual-Temporal Cube structure to enable combined effective and efficient analytical queries over spatial, textual, and temporal data. Our novel data model over spatio-textual-temporal objects enables novel joint and integrated spatial, textual, and temporal insights that are hard to obtain using existing methods. Moreover, we introduce the new concept of spatio-textual-temporal measures with associated novel spatio-textual-temporal-OLAP operators. To allow for efficient large-scale analytics, we present a pre-aggregation framework for the exact and approximate computation of spatio-textual-temporal measures. Our comprehensive experimental evaluation on a real-world Twitter dataset confirms that our proposed methods reduce query response time by 1-5 orders of magnitude compared to the No Materialization baseline and decrease storage cost between 97% and 99.9% compared to the Full Materialization baseline while adding only a negligible overhead in the Spatio-Textual-Temporal Cube construction time. Moreover, approximate computation achieves an accuracy between 90% and 100% while reducing query response time by 3-5 orders of magnitude compared to No Materialization.

Read more
Databases

A Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments

Linked Data Fragments (LDFs) refer to Web interfaces that allow for accessing and querying Knowledge Graphs on the Web. These interfaces, such as SPARQL endpoints or Triple Pattern Fragment servers, differ in the SPARQL expressions they can evaluate and the metadata they provide. Client-side query processing approaches have been proposed and are tailored to evaluate queries over individual interfaces. Moreover, federated query processing has focused on federations with a single type of LDF interface, typically SPARQL endpoints. In this work, we address the challenges of SPARQL query processing over federations with heterogeneous LDF interfaces. To this end, we formalize the concept of federations of Linked Data Fragment and propose a framework for federated querying over heterogeneous federations with different LDF interfaces. The framework comprises query decomposition, query planning, and physical operators adapted to the particularities of different LDF interfaces. Further, we propose an approach for each component of our framework and evaluate them in an experimental study on the well-known FedBench benchmark. The results show a substantial improvement in performance achieved by devising these interface-aware approaches exploiting the capabilities of heterogeneous interfaces in federations.

Read more
Databases

A Framework for Plant Topology Extraction Using Process Mining and Alarm Data

Industrial plants are prone to faults. To notify the operator of a fault occurrence, alarms are utilized as a basic part of modern computer-controlled plants. However, due to the interconnections of different parts of a plant, a single fault often propagates through the plant and triggers a (sometimes large) number of alarms. A graphical plant topology can help operators, process engineers and maintenance experts find the root cause of a plant upset or discover the propagation path of a fault. In this paper, a method is developed to extract plant topology form alarm data. The method is based on process mining, a collection of concepts and algorithms that model a process (not necessarily an engineering one) based on recorded events. The event based nature of alarm data as well as the chronological order of recorded alarms make them suitable for process mining. The methodology developed in this paper is based on preparing alarm data for process mining and then using the suitable process mining algorithms to extract plant topology. The extracted topology is represented by the familiar Petri net which can be used for root cause analysis and discovering fault propagation paths. The methods to evaluate the extracted topology are also discussed. A case study on the well-known Tennessee Eastman process demonstrates the utility of the proposed method.

Read more
Databases

A Fully Dynamic Algorithm for k-Regret Minimizing Sets

Selecting a small set of representatives from a large database is important in many applications such as multi-criteria decision making, web search, and recommendation. The k -regret minimizing set ( k -RMS) problem was recently proposed for representative tuple discovery. Specifically, for a large database P of tuples with multiple numerical attributes, the k -RMS problem returns a size- r subset Q of P such that, for any possible ranking function, the score of the top-ranked tuple in Q is not much worse than the score of the k \textsuperscript{th}-ranked tuple in P . Although the k -RMS problem has been extensively studied in the literature, existing methods are designed for the static setting and cannot maintain the result efficiently when the database is updated. To address this issue, we propose the first fully-dynamic algorithm for the k -RMS problem that can efficiently provide the up-to-date result w.r.t.~any insertion and deletion in the database with a provable guarantee. Experimental results on several real-world and synthetic datasets demonstrate that our algorithm runs up to four orders of magnitude faster than existing k -RMS algorithms while returning results of nearly equal quality.

Read more

Ready to get started?

Join us today