Flow-Loss: Learning Cardinality Estimates That Matter

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.

ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications

Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data. However, the original Git design handles data at the file granule, which is considered too coarse-grained for many database applications. We argue that Git for data should be co-designed with database systems. To this end, we developed ForkBase to make Git for data practical. ForkBase is a distributed, immutable storage system designed for data version management and data collaborative operation. In this demonstration, we show how ForkBase can greatly facilitate collaborative data management and how its novel data deduplication technique can improve storage efficiency for archiving massive data versions.

From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees

We introduce BOURBON, a log-structured merge (LSM) tree that utilizes machine learning to provide fast lookups. We base the design and implementation of BOURBON on empirically-grounded principles that we derive through careful analysis of LSM design. BOURBON employs greedy piecewise linear regression to learn key distributions, enabling fast lookup with minimal computation, and applies a cost-benefit strategy to decide when learning will be worthwhile. Through a series of experiments on both synthetic and real-world datasets, we show that BOURBON improves lookup performance by 1.23x-1.78x as compared to state-of-the-art production LSMs.

FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation

Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions -- specified with function-based mapping languages like FunUL and RML+FnO -- can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these mapping languages hinders their global adoption. We propose FunMap, an interpreter of function-based mapping languages; it relies on a set of lossless rewriting rules to push down and materialize the execution of functions in initial steps of knowledge graph creation. Although applicable to any function-based mapping language that supports joins between mapping rules, FunMap feasibility is shown on RML+FnO. FunMap reduces data redundancy, e.g., duplicates and unused attributes, and converts RML+FnO mappings into a set of equivalent rules executable on RML-compliant engines. We evaluate FunMap performance over real-world testbeds from the biomedical domain. The results indicate that FunMap reduces the execution time of RML-compliant engines by up to a factor of 18, furnishing, thus, a scalable solution for knowledge graph creation.

GGDs: Graph Generating Dependencies

We propose Graph Generating Dependencies (GGDs), a new class of dependencies for property graphs. Extending the expressivity of state of the art constraint languages, GGDs can express both tuple- and equality-generating dependencies on property graphs, both of which find broad application in graph data management. We provide the formal definition of GGDs, analyze the validation problem for GGDs, and demonstrate the practical utility of GGDs.

Generative Datalog with Continuous Distributions

Arguing for the need to combine declarative and probabilistic programming, Bárány et al. (TODS 2017) recently introduced a probabilistic extension of Datalog as a "purely declarative probabilistic programming language." We revisit this language and propose a more principled approach towards defining its semantics based on stochastic kernels and Markov processes - standard notions from probability theory. This allows us to extend the semantics to continuous probability distributions, thereby settling an open problem posed by Bárány et al. We show that our semantics is fairly robust, allowing both parallel execution and arbitrary chase orders when evaluating a program. We cast our semantics in the framework of infinite probabilistic databases (Grohe and Lindner, ICDT 2020), and show that the semantics remains meaningful even when the input of a probabilistic Datalog program is an arbitrary probabilistic database.

GeoCMS : Towards a Geo-Tagged Media Management System

In this paper, we propose the design and implementation of the new geotagged media management system. A large amount of daily geo-tagged media data generated by user's smart phone, mobile device, dash cam and camera. Geotagged media, such as geovideos and geophotos, can be captured with spatial temporal information such as time, location, visible area, camera direction, moving direction and visible distance information. Due to the increase in geo-tagged multimedia data, the researches for efficient managing and mining geo-tagged multimedia are newly expected to be a new area in database and data mining. This paper proposes a geo-tagged media management system, so called Open GeoCMS(Geotagged media Contents Management System). Open GeoCMS is a new framework to manage geotagged media data on the web. Our framework supports various types which are for moving point, moving photo - a sequence of photos by a drone, moving double and moving video. Also, GeoCMS has the label viewer and editor system for photos and videos. The Open GeoCMS have been developed as an open source system.

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data. Besides Flink, other scalable spatial data processing platforms including GeoSpark, Spatial Hadoop, etc. do not support streaming workloads and can only handle static/batch workloads. To fill this gap, we present GeoFlink, which extends Apache Flink to support spatial data types, indexes and continuous queries over spatial data streams. To enable the efficient processing of spatial continuous queries and for the effective data distribution across Flink cluster nodes, a gird-based index is introduced. GeoFlink currently supports spatial range, spatial k NN and spatial join queries on point data type. An extensive experimental study on real spatial data streams shows that GeoFlink achieves significantly higher query throughput than ordinary Flink processing.

GeoSPARQL+: Syntax, Semantics and System for Integrated Querying of Graph, Raster and Vector Data -- Technical Report

We introduce an approach to semantically represent and query raster data in a Semantic Web graph. We extend the GeoSPARQL vocabulary and query language to support raster data as a new type of geospatial data. We define new filter functions and illustrate our approach using several use cases on real-world data sets. Finally, we describe a prototypical implementation and validate the feasibility of our approach.

Grammars for Document Spanners

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.

