Featured Researches

Databases

Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

Read more
Databases

Continuous Prefetch for Interactive Data Applications

Interactive data visualization and exploration (DVE) applications are often network-bottlenecked due to bursty request patterns, large response sizes, and heterogeneous deployments over a range of networks and devices. This makes it difficult to ensure consistently low response times (< 100ms). Khameleon is a framework for DVE applications that uses a novel combination of prefetching and response tuning to dynamically trade-off response quality for low latency. Khameleon exploits DVE's approximation tolerance: immediate lower-quality responses are preferable to waiting for complete results. To this end, Khameleon progressively encodes responses, and runs a server-side scheduler that proactively streams portions of responses using available bandwidth to maximize user's perceived interactivity. The scheduler involves a complex optimization based on available resources, predicted user interactions, and response quality levels; yet, decisions must also be real-time. To overcome this, Khameleon uses a fast greedy approximation which closely mimics the optimal approach. Using image exploration and visualization applications with real user interaction traces, we show that across a wide range of network and client resource conditions, Khameleon outperforms classic prefetching approaches that benefit from perfect prediction models: response latencies with Khameleon are never higher, and typically between 2 to 3 orders of magnitude lower while response quality remains within 50%-80%.

Read more
Databases

Controlling Entity Integrity with Key Sets

Codd's rule of entity integrity stipulates that every table has a primary key. Hence, the attributes of the primary key carry unique and complete value combinations. In practice, data cannot always meet such requirements. Previous work proposed the superior notion of key sets for controlling entity integrity. We establish a linear-time algorithm for validating whether a given key set holds on a given data set, and demonstrate its efficiency on real-world data. We establish a binary axiomatization for the associated implication problem, and prove its coNP-completeness. However, the implication of unary by arbitrary key sets has better properties. The fragment enjoys a unary axiomatization and is decidable in quadratic time. Hence, we can minimize overheads before validating key sets. While perfect models do not always exist in general, we show how to compute them for any instance of our fragment. This provides computational support towards the acquisition of key sets.

Read more
Databases

Convolutional Embedding for Edit Distance

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.

Read more
Databases

CorDEL: A Contrastive Deep Learning Approach for Entity Linkage

Entity linkage (EL) is a critical problem in data cleaning and integration. In the past several decades, EL has typically been done by rule-based systems or traditional machine learning models with hand-curated features, both of which heavily depend on manual human inputs. With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost of EL associated with the traditional models. Existing exploration of DL models for EL strictly follows the well-known twin-network architecture. However, we argue that the twin-network architecture is sub-optimal to EL, leading to inherent drawbacks of existing models. In order to address the drawbacks, we propose a novel and generic contrastive DL framework for EL. The proposed framework is able to capture both syntactic and semantic matching signals and pays attention to subtle but critical differences. Based on the framework, we develop a contrastive DL approach for EL, called CorDEL, with three powerful variants. We evaluate CorDEL with extensive experiments conducted on both public benchmark datasets and a real-world dataset. CorDEL outperforms previous state-of-the-art models by 5.2% on public benchmark datasets. Moreover, CorDEL yields a 2.4% improvement over the current best DL model on the real-world dataset, while reducing the number of training parameters by 97.6%.

Read more
Databases

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

Two-phase commit (2PC) has been widely used in distributed databases to ensure atomicity for distributed transactions. However, 2PC suffers from two limitations. First, 2PC incurs long latency as it requires two logging operations on the critical path. Second, when a coordinator fails, a participant may be blocked waiting for the coordinator's decision, leading to indefinitely long latency and low throughput. We make a key observation that modern cloud databases feature a storage disaggregation architecture, which allows a transaction's final decision to not rely on the central coordinator. We propose Cornus, a one-phase commit (1PC) protocol specifically designed for this architecture. Cornus can solve the two problems mentioned above by leveraging the fact that all compute nodes are able to access and modify the log data on any storage node. We present Cornus in detail, formally prove its correctness, develop certain optimization techniques, and evaluate against 2PC on YCSB and TPC-C workloads. The results show that Cornus can achieve 1.5x speedup in latency.

Read more
Databases

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

Read more
Databases

Cost-based Query Rewriting Techniques for Optimizing Aggregates Over Correlated Windows

Window aggregates are ubiquitous in stream processing. In Azure Stream Analytics (ASA), a stream processing service hosted by Microsoft's Azure cloud, we see many customer queries that contain aggregate functions (such as MIN and MAX) over multiple correlated windows (e.g., tumbling windows of length five minutes and ten minutes) defined on the same event stream. In this paper, we present a cost-based optimization framework for optimizing such queries by sharing computation among multiple windows. Since our optimization techniques are at the level of query rewriting, they can be implemented on any stream processing system that supports a declarative, SQL-like query language without changing the underlying query execution engine. We formalize the shared computation problem, present the optimization techniques in detail, and report evaluation results over synthetic workloads.

Read more
Databases

Counting Query Answers over a DL-Lite Knowledge Base (extended version)

Counting answers to a query is an operation supported by virtually all database management systems. In this paper we focus on counting answers over a Knowledge Base (KB), which may be viewed as a database enriched with background knowledge about the domain under consideration. In particular, we place our work in the context of Ontology-Mediated Query Answering/Ontology-based Data Access (OMQA/OBDA), where the language used for the ontology is a member of the DL-Lite family and the data is a (usually virtual) set of assertions. We study the data complexity of query answering, for different members of the DL-Lite family that include number restrictions, and for variants of conjunctive queries with counting that differ with respect to their shape (connected, branching, rooted). We improve upon existing results by providing a PTIME and coNP lower bounds, and upper bounds in PTIME and LOGSPACE. For the latter case, we define a novel query rewriting technique into first-order logic with counting.

Read more
Databases

Covering the Relational Join

In this paper, we initiate a theoretical study of what we call the join covering problem. We are given a natural join query instance Q on n attributes and m relations ( R i ) i∈[m] . Let J Q = ⋈ m i=1 R i denote the join output of Q . In addition to Q , we are given a parameter Δ:1≤Δ≤n and our goal is to compute the smallest subset T Q,Δ ⊆ J Q such that every tuple in J Q is within Hamming distance Δ−1 from some tuple in T Q,Δ . The join covering problem captures both computing the natural join from database theory and constructing a covering code with covering radius Δ−1 from coding theory, as special cases. We consider the combinatorial version of the join covering problem, where our goal is to determine the worst-case | T Q,Δ | in terms of the structure of Q and value of Δ . One obvious approach to upper bound | T Q,Δ | is to exploit a distance property (of Hamming distance) from coding theory and combine it with the worst-case bounds on output size of natural joins (AGM bound hereon) due to Atserias, Grohe and Marx [SIAM J. of Computing'13]. Somewhat surprisingly, this approach is not tight even for the case when the input relations have arity at most two. Instead, we show that using the polymatroid degree-based bound of Abo Khamis, Ngo and Suciu [PODS'17] in place of the AGM bound gives us a tight bound (up to constant factors) on the | T Q,Δ | for the arity two case. We prove lower bounds for | T Q,Δ | using well-known classes of error-correcting codes e.g, Reed-Solomon codes. We can extend our results for the arity two case to general arity with a polynomial gap between our upper and lower bounds.

Read more

Ready to get started?

Join us today