Featured Researches

Databases

Complete and Sufficient Spatial Domination of Multidimensional Rectangles

Rectangles are used to approximate objects, or sets of objects, in a plethora of applications, systems and index structures. Many tasks, such as nearest neighbor search and similarity ranking, require to decide if objects in one rectangle A may, must, or must not be closer to objects in a second rectangle B, than objects in a third rectangle R. To decide this relation of "Spatial Domination" it can be shown that using minimum and maximum distances it is often impossible to detect spatial domination. This spatial gem provides a necessary and sufficient decision criterion for spatial domination that can be computed efficiently even in higher dimensional space. In addition, this spatial gem provides an example, pseudocode and an implementation in Python.

Read more
Databases

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than 17/16 for most cases. We second show a (2− 0.5 σ−1 ) -approximation whenever given σ functional dependencies, and a (2− η k + η k k ) -approximation when an η k -portion of tuples have the k -quasi-Tur a ´ n property for some k>1 . We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio 2n+ϵn with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio 2+ϵ . To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the ⊆ -oracle which can answer membership-query, and return p tuples uniformly sampled whenever given a number p . Experiments are conducted on range queries as an implementation of ⊆ -oracle, and results show the efficiency of our FD-inconsistency degree estimator.

Read more
Databases

Compressed Key Sort and Fast Index Reconstruction

In this paper we propose an index key compression scheme based on the notion of distinction bits by proving that the distinction bits of index keys are sufficient information to determine the sorted order of the index keys correctly. While the actual compression ratio may vary depending on the characteristics of datasets (an average of 2.76 to one compression ratio was observed in our experiments), the index key compression scheme leads to significant performance improvements during the reconstruction of large-scale indexes. Our index key compression can be effectively used in database replication and index recovery of modern main-memory database systems.

Read more
Databases

Computing Local Sensitivities of Counting Queries with Joins

Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and in differential privacy. However, it is NP-hard to find local sensitivity of a conjunctive query in terms of the size of the query, even for the class of acyclic queries. Although the complexity is polynomial when the query size is fixed, the naive algorithms are not efficient for large databases and queries involving multiple joins. In this paper, we present a novel approach to compute local sensitivity of counting queries involving join operations by tracking and summarizing tuple sensitivities -- the maximum change a tuple can cause in the query result when it is added or removed. We give algorithms for the sensitivity problem for full acyclic join queries using join trees, that run in polynomial time in both the size of the database and query for an interesting sub-class of queries, which we call 'doubly acyclic queries' that include path queries, and in polynomial time in combined complexity when the maximum degree in the join tree is bounded. Our algorithms can be extended to certain non-acyclic queries using generalized hypertree decompositions. We evaluate our approach experimentally, and show applications of our algorithms to obtain better results for differential privacy by orders of magnitude.

Read more
Databases

Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems

The reliability and proper function of data-driven applications hinge on the data's continued conformance to the applications' initial design. When data deviates from this initial profile, system behavior becomes unpredictable. Data profiling techniques such as functional dependencies and denial constraints encode patterns in the data that can be used to detect deviations. But traditional methods typically focus on exact constraints and categorical attributes, and are ill-suited for tasks such as determining whether the prediction of a machine learning system can be trusted or for quantifying data drift. In this paper, we introduce data invariants, a new data-profiling primitive that models arithmetic relationships involving multiple numerical attributes within a (noisy) dataset and which complements the existing data-profiling techniques. We propose a quantitative semantics to measure the degree of violation of a data invariant, and establish that strong data invariants can be constructed from observations with low variance on the given dataset. A concrete instance of this principle gives the surprising result that low-variance components of a principal component analysis (PCA), which are usually discarded, generate better invariants than the high-variance components. We demonstrate the value of data invariants on two applications: trusted machine learning and data drift. We empirically show that data invariants can (1) reliably detect tuples on which the prediction of a machine-learned model should not be trusted, and (2) quantify data drift more accurately than the state-of-the-art methods. Additionally, we show four case studies where an intervention-centric explanation tool uses data invariants to explain causes for tuple non-conformance.

Read more
Databases

Conquery: an open source application to analyze high content healthcare data

Background: Big data in healthcare must be exploited to achieve a substantial increase in efficiency and competitiveness. Especially the analysis of patient-related data possesses huge potential to considerably improve decision-making processes in the healthcare sector. Most analytical approaches used today are highly time- and resource-consuming. The presented software solution Conquery is an open source software tool providing advanced, but intuitive data analysis without the need for specialized statistical training. Results: We developed a highly scalable column-oriented distributed timeseries database and analysis platform. Its main application is the analysis of per-person medical records by non-technical medical professionals. Complex analyses can be realized in a web-frontend without necessitating deep knowledge of the underlying data and structure. Queries are evaluated by a bespoke distributed query-engine for medical records. We present a custom compression scheme to facilitate low response times that uses online calculated as well as precomputed metadata and statistics. Conclusions: Conquery enables users fast queries and analysis of large datasets without requiring technical expertise. This reduces the technical burden in the decision-making process, to facilitate better data utilization in the healthcare sector. As the only open-source software in Germany that explicitly addresses the stringent use and analysis of health data, Conquery is of great value to the healthcare community.

Read more
Databases

Consistency and Certain Answers in Relational to RDF Data Exchange with Shape Constraints

We investigate the data exchange from relational databases to RDF graphs inspired by R2RML with the addition of target shape schemas. We study the problems of consistency i.e., checking that every source instance admits a solution, and certain query answering i.e., finding answers present in every solution. We identify the class of constructive relational to RDF data exchange that uses IRI constructors and full tgds (with no existential variables) in its source to target dependencies. We show that the consistency problem is coNP-complete. We introduce the notion of universal simulation solution that allows to compute certain query answers to any class of queries that is robust under simulation. One such class are nested regular expressions (NREs) that are forward i.e., do not use the inverse operation. Using universal simulation solution renders tractable the computation of certain answers to forward NREs (data-complexity). Finally, we present a number of results that show that relaxing the restrictions of the proposed framework leads to an increase in complexity.

Read more
Databases

Consistency, Acyclicity, and Positive Semirings

In several different settings, one comes across situations in which the objects of study are locally consistent but globally inconsistent. Earlier work about probability distributions by Vorob'ev (1962) and about database relations by Beeri, Fagin, Maier, Yannakakis (1983) produced characterizations of when local consistency always implies global consistency. Towards a common generalization of these results, we consider K-relations, that is, relations over a set of attributes such that each tuple in the relation is associated with an element from an arbitrary, but fixed, positive semiring K. We introduce the notions of projection of a K-relation, consistency of two K-relations, and global consistency of a collection of K-relations; these notions are natural extensions of the corresponding notions about probability distributions and database relations. We then show that a collection of sets of attributes has the property that every pairwise consistent collection of K-relations over those attributes is globally consistent if and only if the sets of attributes form an acyclic hypergraph. This generalizes the aforementioned results by Vorob'ev and by Beeri et al., and demonstrates that K-relations over positive semirings constitute a natural framework for the study of the interplay between local and global consistency. In the course of the proof, we introduce a notion of join of two K-relations and argue that it is the "right" generalization of the join of two database relations. Furthermore, to show that non-acyclic hypergraphs yield pairwise consistent K-relations that are globally inconsistent, we generalize a construction by Tseitin (1968) in his study of hard-to-prove tautologies in propositional logic.

Read more
Databases

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-dimensional data due to the curse of dimensionality, the large variance of selectivity across different queries, and the need to make the estimator consistent (i.e., the selectivity is non-decreasing in the threshold). We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator, which is flexible to fit the selectivity curve of any distance function and query object, while guaranteeing that the output is non-decreasing in the threshold. To improve the accuracy for large datasets, we propose to partition the dataset into multiple disjoint subsets and build a local model on each of them. We perform experiments on real datasets and show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way and is useful for real applications.

Read more
Databases

Constant delay enumeration with FPT-preprocessing for conjunctive queries of bounded submodular width

Marx (STOC~2010, J.~ACM 2013) introduced the notion of submodular width of a conjunctive query (CQ) and showed that for any class Φ of Boolean CQs of bounded submodular width, the model-checking problem for Φ on the class of all finite structures is fixed-parameter tractable (FPT). Note that for non-Boolean queries, the size of the query result may be far too large to be computed entirely within FPT time. We investigate the free-connex variant of submodular width and generalise Marx's result to non-Boolean queries as follows: For every class Φ of CQs of bounded free-connex submodular width, within FPT-preprocessing time we can build a data structure that allows to enumerate, without repetition and with constant delay, all tuples of the query result. Our proof builds upon Marx's splitting routine to decompose the query result into a union of results; but we have to tackle the additional technical difficulty to ensure that these can be enumerated efficiently.

Read more

Ready to get started?

Join us today