Featured Researches

Databases

Distributed Nonblocking Commit Protocols for Many-Party Cross-Blockchain Transactions

The interoperability across multiple blockchains would play a critical role in future blockchain-based data management paradigm. Existing techniques either work only for two blockchains or requires a centralized component to govern the cross-blockchain transaction execution, neither of which would meet the scalability requirement. This paper proposes a new distributed commit protocol, namely \textit{cross-blockchain transaction} (CBT), for conducting transactions across an arbitrary number of blockchains without any centralized component. The key idea of CBT is to extend the two-phase commit protocol with a heartbeat mechanism to ensure the liveness of CBT without introducing additional nodes or blockchains. We have implemented CBT and compared it to the state-of-the-art protocols, demonstrating CBT's low overhead (3.6\% between two blockchains, less than 1% among 32 or more blockchains) and high scalability (linear scalability on up to 64-blockchain transactions). In addition, we developed a graphic user interface for users to virtually monitor the status of the cross-blockchain transactions.

Read more
Databases

Distributed Processing of k Shortest Path Queries over Dynamic Road Networks

The problem of identifying the k-shortest paths (KSPs for short) in a dynamic road network is essential to many location-based services. Road networks are dynamic in the sense that the weights of the edges in the corresponding graph constantly change over time, representing evolving traffic conditions. Very often such services have to process numerous KSP queries over large road networks at the same time, thus there is a pressing need to identify distributed solutions for this problem. However, most existing approaches are designed to identify KSPs on a static graph in a sequential manner (i.e., the (i+1)-th shortest path is generated based on the i-th shortest path), restricting their scalability and applicability in a distributed setting. We therefore propose KSP-DG, a distributed algorithm for identifying k-shortest paths in a dynamic graph. It is based on partitioning the entire graph into smaller subgraphs, and reduces the problem of determining KSPs into the computation of partial KSPs in relevant subgraphs, which can execute in parallel on a cluster of servers. A distributed two-level index called DTLP is developed to facilitate the efficient identification of relevant subgraphs. A salient feature of DTLP is that it indexes a set of virtual paths that are insensitive to varying traffic conditions, leading to very low maintenance cost in dynamic road networks. This is the first treatment of the problem of processing KSP queries over dynamic road networks. Extensive experiments conducted on real road networks confirm the superiority of our proposal over baseline methods.

Read more
Databases

Distributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub

Recent applications employ publish/subscribe (Pub/Sub) systems so that publishers can easily receive attentions of customers and subscribers can monitor useful information generated by publishers. Due to the prevalence of smart devices and social networking services, a large number of objects that contain both spatial and keyword information have been generated continuously, and the number of subscribers also continues to increase. This poses a challenge to Pub/Sub systems: they need to continuously extract useful information from massive objects for each subscriber in real time. In this paper, we address the problem of k nearest neighbor monitoring on a spatial-keyword data stream for a large number of subscriptions. To scale well to massive objects and subscriptions, we propose a distributed solution, namely DkM-SKS. Given m workers, DkM-SKS divides a set of subscriptions into m disjoint subsets based on a cost model so that each worker has almost the same kNN-update cost, to maintain load balancing. DkM-SKS allows an arbitrary approach to updating kNN of each subscription, so with a suitable in-memory index, DkM-SKS can accelerate update efficiency by pruning irrelevant subscriptions for a given new object. We conduct experiments on real datasets, and the results demonstrate the efficiency and scalability of DkM-SKS.

Read more
Databases

Distributed Tera-Scale Similarity Search with MPI: Provably Efficient Similarity Search over billions without a Single Distance Computation

We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query q in sub-linear ( ≪O(n) ) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.

Read more
Databases

Distribution Constraints: The Chase for Distributed Data

This paper introduces a declarative framework to specify and reason about distributions of data over computing nodes in a distributed setting. More specifically, it proposes distribution constraints which are tuple and equality generating dependencies (tgds and egds) extended with node variables ranging over computing nodes. In particular, they can express co-partitioning constraints and constraints about range-based data distributions by using comparison atoms. The main technical contribution is the study of the implication problem of distribution constraints. While implication is undecidable in general, relevant fragments of so-called data-full constraints are exhibited for which the corresponding implication problems are complete for EXPTIME, PSPACE and NP. These results yield bounds on deciding parallel-correctness for conjunctive queries in the presence of distribution constraints.

Read more
Databases

Diversifying Anonymized Data with Diversity Constraints

Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development. Unfortunately, data anonymization techniques have largely ignored diversity in its published result. This inadvertently propagates underlying bias in subsequent data analysis. We study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints. We formalize diversity constraints and study their foundations such as implication and satisfiability. We show that determining the existence of a diverse, anonymized instance can be done in PTIME, and we present a clustering-based algorithm. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. Our work aligns with recent trends towards responsible data science by coupling diversity with privacy-preserving data publishing.

Read more
Databases

DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings

Traditional relational databases contain a lot of latent semantic information that have largely remained untapped due to the difficulty involved in automatically extracting such information. Recent works have proposed unsupervised machine learning approaches to extract such hidden information by textifying the database columns and then projecting the text tokens onto a fixed dimensional semantic vector space. However, in certain databases, task-specific class labels may be available, which unsupervised approaches are unable to lever in a principled manner. Also, when embeddings are generated at individual token level, then column encoding of multi-token text column has to be computed by taking the average of the vectors of the tokens present in that column for any given row. Such averaging approach may not produce the best semantic vector representation of the multi-token text column, as observed while encoding paragraphs or documents in natural language processing domain. With these shortcomings in mind, we propose a supervised machine learning approach using a Bi-LSTM based sequence encoder to directly generate column encodings for multi-token text columns of the DrugBank database, which contains gold standard drug-drug interaction (DDI) labels. Our text data driven encoding approach achieves very high Accuracy on the supervised DDI prediction task for some columns and we use those supervised column encodings to simulate and evaluate the Analogy SQL queries on relational data to demonstrate the efficacy of our technique.

Read more
Databases

Duoquest: A Dual-Specification System for Expressive SQL Queries

Querying a relational database is difficult because it requires users to know both the SQL language and be familiar with the schema. On the other hand, many users possess enough domain familiarity or expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of these alternatives face certain pitfalls: natural language queries (NLQs) are often ambiguous, even for human interpreters, while current PBE approaches require either low-complexity queries, user schema knowledge, exact example tuples from the user, or a closed-world assumption to be tractable. Consequently, we propose dual-specification query synthesis, which consumes both a NLQ and an optional PBE-like table sketch query that enables users to express varied levels of domain-specific knowledge. We introduce the novel dual-specification Duoquest system, which leverages guided partial query enumeration to efficiently explore the space of possible queries. We present results from user studies in which Duoquest demonstrates a 62.5% absolute increase in query construction accuracy over a state-of-the-art NLI and comparable accuracy to a PBE system on a more limited workload supported by the PBE system. In a simulation study on the prominent Spider benchmark, Duoquest demonstrates a >2x increase in top-1 accuracy over both NLI and PBE.

Read more
Databases

Duplication Detection in Knowledge Graphs: Literature and Tools

In recent years, an increasing amount of knowledge graphs (KGs) have been created as a means to store cross-domain knowledge and billion of facts, which are the basis of costumers' applications like search engines. However, KGs inevitably have inconsistencies such as duplicates that might generate conflicting property values. Duplication detection (DD) aims to identify duplicated entities and resolve their conflicting property values effectively and efficiently. In this paper, we perform a literature review on DD methods and tools, and an evaluation of them. Our main contributions are a performance evaluation of DD tools in KGs, improvement suggestions, and a DD workflow to support future development of DD tools, which are based on desirable features detected through this study.

Read more
Databases

Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions

A way of finding interesting or exceptional records from instant-stamped temporal data is to consider their "durability," or, intuitively speaking, how well they compare with other records that arrived earlier or later, and how long they retain their supremacy. For example, people are naturally fascinated by claims with long durability, such as: "On January 22, 2006, Kobe Bryant dropped 81 points against Toronto Raptors. Since then, this scoring record has yet to be broken." In general, given a sequence of instant-stamped records, suppose that we can rank them by a user-specified scoring function f , which may consider multiple attributes of a record to compute a single score for ranking. This paper studies "durable top- k queries", which find records whose scores were within top- k among those records within a "durability window" of given length, e.g., a 10-year window starting/ending at the timestamp of the record. The parameter k , the length of the durability window, and parameters of the scoring function (which capture user preference) can all be given at the query time. We illustrate why this problem formulation yields more meaningful answers in some practical situations than other similar types of queries considered previously. We propose new algorithms for solving this problem, and provide a comprehensive theoretical analysis on the complexities of the problem itself and of our algorithms. Our algorithms vastly outperform various baselines (by up to two orders of magnitude on real and synthetic datasets).

Read more

Ready to get started?

Join us today