Featured Researches

Databases

Evolution of the ROOT Tree I/O

The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts ("branches") that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event classes without explicitly defining data schemas. In this contribution, we present the status and plans of the future ROOT 7 event I/O. Along with the ROOT 7 interface modernization, we aim for robust, where possible compile-time safe C++ interfaces to read and write event data. On the performance side, we show first benchmarks using ROOT's new experimental I/O subsystem that combines the best of TTrees with recent advances in columnar data formats. A core ingredient is a strong separation of the high-level logical data layout (C++ classes) from the low-level physical data layout (storage backed nested vectors of simple types). We show how the new, optimized physical data layout speeds up serialization and deserialization and facilitates parallel, vectorized and bulk operations. This lets ROOT I/O run optimally on the upcoming ultra-fast NVRAM storage devices, as well as file-less storage systems such as object stores.

Read more
Databases

ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

Capturing and processing video is increasingly common as cameras become cheaper to deploy. At the same time, rich video understanding methods have progressed greatly in the last decade. As a result, many organizations now have massive repositories of video data, with applications in mapping, navigation, autonomous driving, and other areas. Because state-of-the-art object detection methods are slow and expensive, our ability to process even simple ad-hoc object search queries ('find 100 traffic lights in dashcam video') over this accumulated data lags far behind our ability to collect it. Processing video at reduced sampling rates is a reasonable default strategy for these types of queries, however, the ideal sampling rate is both data and query dependent. We introduce ExSample, a low cost framework for object search over unindexed video that quickly processes search queries by adapting the amount and location of sampled frames to the particular data and query being processed. ExSample prioritizes the processing of frames in a video repository so that processing is focused in portions of video that most likely contain objects of interest. It continually re-prioritizes processing based on feedback from previously processed frames. On large, real-world datasets, ExSample reduces processing time by up to 6x over an efficient random sampling baseline and by several orders of magnitude over state-of-the-art methods that train specialized per-query surrogate models. ExSample is thus a key component in building cost-efficient video data management systems.

Read more
Databases

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is a very popular technique for finding approximate nearest neighbors in high-dimensional spaces. Apart from providing theoretical guarantees on the query results, one of the main benefits of LSH techniques is their good scalability to large datasets because they are external memory based. The most dominant costs for existing LSH techniques are the algorithm time and the index I/Os required to find candidate points. Existing works do not compare both of these dominant costs in their evaluation. In this experimental survey paper, we show the impact of both these costs on the overall performance of the LSH technique. We compare three state-of-the-art techniques on four real-world datasets, and show that, in contrast to recent works, C2LSH is still the state-of-the-art algorithm in terms of performance while achieving similar accuracy as its recent competitors.

Read more
Databases

Explainable Queries over Event Logs

Added value can be extracted from event logs generated by business processes in various ways. However, although complex computations can be performed over event logs, the result of such computations is often difficult to explain; in particular, it is hard to determine what parts of an input log actually matters in the production of that result. This paper describes how an existing log processing library, called BeepBeep, can be extended in order to provide a form of provenance: individual output events produced by a query can be precisely traced back to the data elements of the log that contribute to (i.e. "explain") the result.

Read more
Databases

Explaining Inference Queries with Bayesian Optimization

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at this https URL.

Read more
Databases

Explaining Natural Language Query Results

Multiple lines of research have developed Natural Language (NL) interfaces for formulating database queries. We build upon this work, but focus on presenting a highly detailed form of the answers in NL. The answers that we present are importantly based on the provenance of tuples in the query result, detailing not only the results but also their explanations. We develop a novel method for transforming provenance information to NL, by leveraging the original NL query structure. Furthermore, since provenance information is typically large and complex, we present two solutions for its effective presentation as NL text: one that is based on provenance factorization, with novel desiderata relevant to the NL case, and one that is based on summarization. We have implemented our solution in an end-to-end system supporting questions, answers and provenance, all expressed in NL. Our experiments, including a user study, indicate the quality of our solution and its scalability.

Read more
Databases

Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data

The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation problem for multivariate IoT data and propose a 3-step self-contained method in this paper. We formalize and utilize the domain knowledge in our method, and identify the anomalies by the violation of constraints. We propose set-cover-based anomaly explanation algorithms to discover the anomaly events reflected by violation features, and further develop knowledge update algorithms to improve the original knowledge set. Experimental results on real datasets from large-scale IoT systems verify that our method computes high-quality explanation solutions of anomalies. Our work provides a guide to navigate the explicable anomaly detection in both IoT fault diagnosis and temporal data cleaning.

Read more
Databases

Extending Databases to Support Data Manipulation with Functional Dependencies: a Vision Paper

In the current paper, we propose to fuse together stored data (tables) and their functional dependencies (FDs) inside a DBMS. We aim to make FDs first-class citizens: objects which can be queried and used to query data. Our idea is to allow analysts to explore both data and functional dependencies using the database interface. For example, an analyst may be interested in such tasks as: "find all rows which prevent a given functional dependency from holding", "for a given table, find all functional dependencies that involve a given attribute", "project all attributes that functionally determine a specified attribute". For this purpose, we propose: (1) an SQL-based query language for querying a collection of functional dependencies (2) an extension of the SQL SELECT clause for supporting FD-based predicates, including approximate ones (3) a special data structure intended for containing mined FDs and acting as a mediator between user queries and underlying data. We describe the proposed extensions, demonstrate their use-cases, and finally, discuss implementation details and their impact on query processing.

Read more
Databases

Extensible Data Skipping

Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

Read more
Databases

Extracting Multiple Viewpoint Models from Relational Databases

Much time in process mining projects is spent on finding and understanding data sources and extracting the event data needed. As a result, only a fraction of time is spent actually applying techniques to discover, control and predict the business process. Moreover, current process mining techniques assume a single case notion. However, in reallife processes often different case notions are intertwined. For example, events of the same order handling process may refer to customers, orders, order lines, deliveries, and payments. Therefore, we propose to use Multiple Viewpoint (MVP) models that relate events through objects and that relate activities through classes. The required event data are much closer to existing relational databases. MVP models provide a holistic view on the process, but also allow for the extraction of classical event logs using different viewpoints. This way existing process mining techniques can be used for each viewpoint without the need for new data extractions and transformations. We provide a toolchain allowing for the discovery of MVP models (annotated with performance and frequency information) from relational databases. Moreover, we demonstrate that classical process mining techniques can be applied to any selected viewpoint.

Read more

Ready to get started?

Join us today