Featured Researches

Databases

Dealer: End-to-End Data Marketplace with Model-based Pricing

Data-driven machine learning (ML) has witnessed great successes across a variety of application domains. Since ML model training are crucially relied on a large amount of data, there is a growing demand for high quality data to be collected for ML model training. However, from data owners' perspective, it is risky for them to contribute their data. To incentivize data contribution, it would be ideal that their data would be used under their preset restrictions and they get paid for their data contribution. In this paper, we take a formal data market perspective and propose the first en\textbf{\underline{D}}-to-\textbf{\underline{e}}nd d\textbf{\underline{a}}ta marketp\textbf{\underline{l}}ace with mod\textbf{\underline{e}}l-based p\textbf{\underline{r}}icing (\emph{Dealer}) towards answering the question: \emph{How can the broker assign value to data owners based on their contribution to the models to incentivize more data contribution, and determine pricing for a series of models for various model buyers to maximize the revenue with arbitrage-free guarantee}. For the former, we introduce a Shapley value-based mechanism to quantify each data owner's value towards all the models trained out of the contributed data. For the latter, we design a pricing mechanism based on models' privacy parameters to maximize the revenue. More importantly, we study how the data owners' data usage restrictions affect market design, which is a striking difference of our approach with the existing methods. Furthermore, we show a concrete realization DP-\emph{Dealer} which provably satisfies the desired formal properties. Extensive experiments show that DP-\emph{Dealer} is efficient and effective.

Read more
Databases

Deep Entity Matching with Pre-Trained Language Models

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Read more
Databases

DeepSampling: Selectivity Estimation with Predicted Error and Response Time

The rapid growth of spatial data urges the research community to find efficient processing techniques for interactive queries on large volumes of data. Approximate Query Processing (AQP) is the most prominent technique that can provide real-time answer for ad-hoc queries based on a random sample. Unfortunately, existing AQP methods provide an answer without providing any accuracy metrics due to the complex relationship between the sample size, the query parameters, the data distribution, and the result accuracy. This paper proposes DeepSampling, a deep-learning-based model that predicts the accuracy of a sample-based AQP algorithm, specially selectivity estimation, given the sample size, the input distribution, and query parameters. The model can also be reversed to measure the sample size that would produce a desired accuracy. DeepSampling is the first system that provides a reliable tool for existing spatial databases to control the accuracy of AQP.

Read more
Databases

Designing a Bit-Based Model to Accelerate Query Processing Over Encrypted Databases in Cloud

Database users have started moving toward the use of cloud computing as a service because it provides computation and storage needs at affordable prices. However, for most of the users, the concern of privacy plays a major role as they cannot control data access once their data are outsourced, especially if the cloud provider is curious about their data. Data encryption is an effective way to solve privacy concerns, but executing queries over encrypted data is a problem that needs attention. In this research, we introduce a bit-based model to execute different relational algebra operators over encrypted databases at the cloud without decrypting the data. To encrypt data, we use the randomized encryption algorithm (Advanced Encryption Standard-CBC) to provide the maximum-security level. The idea is based on classifying attributes as sensitive and non-sensitive, where only sensitive attributes are encrypted. For each sensitive attribute, the table owner predefined the possible partition domains on which the tuples will be encoded into bit vectors before the encryption. We store the bit vectors in an additional column(s) in the encrypted table in the cloud. We use those bits to retrieve only part of encrypted records that are candidates for a specific query. We implemented and evaluated our model and found that the proposed model is practical and success to minimize the range of the retrieved encrypted records to less than 30 percent of the whole set of encrypted records in a table.

Read more
Databases

Detecting Opportunities for Differential Maintenance of Extracted Views

Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from engineering ad hoc, application-specific extraction rules towards using expressive languages such as CPSL and AQL creates opportunities to propose solutions that can be applied to a wide range of extraction programs. In this work, we focus on extracted view maintenance, a problem that is well-motivated and thoroughly addressed in the relational setting. In particular, we formalize and address the problem of keeping extracted relations consistent with source documents that can be arbitrarily updated. We formally characterize three classes of document updates, namely those that are irrelevant, autonomously computable, and pseudo-irrelevant with respect to a given extractor. Finally, we propose algorithms to detect pseudo-irrelevant document updates with respect to extractors that are expressed as document spanners, a model of information extraction inspired by SystemT.

Read more
Databases

Differential Privacy of Hierarchical Census Data: An Optimization Approach

This paper is motivated by applications of a Census Bureau interested in releasing aggregate socio-economic data about a large population without revealing sensitive information about any individual. The released information can be the number of individuals living alone, the number of cars they own, or their salary brackets. Recent events have identified some of the privacy challenges faced by these organizations. To address them, this paper presents a novel differential-privacy mechanism for releasing hierarchical counts of individuals. The counts are reported at multiple granularities (e.g., the national, state, and county levels) and must be consistent across all levels. The core of the mechanism is an optimization model that redistributes the noise introduced to achieve differential privacy in order to meet the consistency constraints between the hierarchical levels. The key technical contribution of the paper shows that this optimization problem can be solved in polynomial time by exploiting the structure of its cost functions. Experimental results on very large, real datasets show that the proposed mechanism provides improvements of up to two orders of magnitude in terms of computational efficiency and accuracy with respect to other state-of-the-art techniques.

Read more
Databases

Discovering Business Area Effects to Process Mining Analysis Using Clustering and Influence Analysis

A common challenge for improving business processes in large organizations is that business people in charge of the operations are lacking a fact-based understanding of the execution details, process variants, and exceptions taking place in business operations. While existing process mining methodologies can discover these details based on event logs, it is challenging to communicate the process mining findings to business people. In this paper, we present a novel methodology for discovering business areas that have a significant effect on the process execution details. Our method uses clustering to group similar cases based on process flow characteristics and then influence analysis for detecting those business areas that correlate most with the discovered clusters. Our analysis serves as a bridge between BPM people and business, people facilitating the knowledge sharing between these groups. We also present an example analysis based on publicly available real-life purchase order process data.

Read more
Databases

Discovering Domain Orders through Order Dependencies

Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit domain orders in data through order dependencies. We enumerate tractable special cases and proceed towards the most general case, which we prove is NP-complete. We show that the general case nevertheless can be effectively handled by a SAT solver. We also devise an interestingness measure to rank the discovered implicit domain orders, which we validate with a user study. Based on an extensive suite of experiments with real-world data, we establish the efficacy of our algorithms, and the utility of the domain orders discovered by demonstrating significant added value in three applications (data profiling, query optimization, and data mining).

Read more
Databases

Discovering High Utility-Occupancy Patterns from Uncertain Data

It is widely known that there is a lot of useful information hidden in big data, leading to a new saying that "data is money." Thus, it is prevalent for individuals to mine crucial information for utilization in many real-world applications. In the past, studies have considered frequency. Unfortunately, doing so neglects other aspects, such as utility, interest, or risk. Thus, it is sensible to discover high-utility itemsets (HUIs) in transaction databases while utilizing not only the quantity but also the predefined utility. To find patterns that can represent the supporting transaction, a recent study was conducted to mine high utility-occupancy patterns whose contribution to the utility of the entire transaction is greater than a certain value. Moreover, in realistic applications, patterns may not exist in transactions but be connected to an existence probability. In this paper, a novel algorithm, called High-Utility-Occupancy Pattern Mining in Uncertain databases (UHUOPM), is proposed. The patterns found by the algorithm are called Potential High Utility Occupancy Patterns (PHUOPs). This algorithm divides user preferences into three factors, including support, probability, and utility occupancy. To reduce memory cost and time consumption and to prune the search space in the algorithm as mentioned above, probability-utility-occupancy list (PUO-list) and probability-frequency-utility table (PFU-table) are used, which assist in providing the downward closure property. Furthermore, an original tree structure, called support count tree (SC-tree), is constructed as the search space of the algorithm. Finally, substantial experiments were conducted to evaluate the performance of proposed UHUOPM algorithm on both real-life and synthetic datasets, particularly in terms of effectiveness and efficiency.

Read more
Databases

Distributed Cross-Blockchain Transactions

The interoperability across multiple or many blockchains would play a critical role in the forthcoming blockchain-based data management paradigm. In particular, how to ensure the ACID properties of those transactions across an arbitrary number of blockchains remains an open problem in both academic and industry: Existing solutions either work for only two blockchains or requires a centralized component, neither of which would meet the scalability requirement in practice. This short paper shares our vision and some early results toward scalable cross-blockchain transactions. Specifically, we design two distributed commit protocols and, both analytically and experimentally, demonstrate their effectiveness.

Read more

Ready to get started?

Join us today