Fei Chiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fei Chiang is active.

Explore More

Publication

Featured researches published by Fei Chiang.

very large data bases | 2008

Discovering data quality rules

Fei Chiang; Renée J. Miller

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we propose a new data-driven tool that can be used within an organizations data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records). We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

very large data bases | 2009

Framework for evaluating clustering algorithms in duplicate detection

Oktie Hassanzadeh; Fei Chiang; Hyun Chul Lee; Renée J. Miller

The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

international conference on data engineering | 2011

A unified model for data and constraint repair

Fei Chiang; Renée J. Miller

Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.

international conference on data engineering | 2014

Continuous data cleaning

Maksims Volkovs; Fei Chiang; Jaroslaw Szlichta; Renée J. Miller

In declarative data cleaning, data semantics are encoded as constraints and errors arise when the data violates the constraints. Various forms of statistical and logical inference can be used to reason about and repair inconsistencies (errors) in data. Recently, unified approaches that repair both errors in data and errors in semantics (the constraints) have been proposed. However, both data-only approaches and unified approaches are by and large static in that they apply cleaning to a single snapshot of the data and constraints. We introduce a continuous data cleaning framework that can be applied to dynamic data and constraint environments. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence to date. Importantly, our approach uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences). We introduce a repair classifier that predicts the type of repair needed to resolve an inconsistency, and that learns from past user repair preferences to recommend more accurate repairs in the future. Our evaluation shows that our techniques achieve high prediction accuracy and generate high quality repairs. Of independent interest, our work makes use of a set of data statistics that are shown to be sensitive to predicting particular repair types.

very large data bases | 2015

Combining quantitative and logical data cleaning

Nataliya Prokoshyna; Jaroslaw Szlichta; Fei Chiang; Renée J. Miller; Divesh Srivastava

Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths. We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Movers Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency.

international conference on management of data | 2008

An xml index advisor for DB2

Iman Elghandour; Ashraf Aboulnaga; Daniel C. Zilio; Fei Chiang; Andrey Balmin; Kevin S. Beyer; Calisto Zuzarte

international conference on data engineering | 2012

AutoDict: Automated Dictionary Discovery

Fei Chiang; Periklis Andritsos; Erkang Zhu; Renée J. Miller

An attribute dictionary is a set of attributes together with a set of common values of each attribute. Such dictionaries are valuable in understanding unstructured or loosely structured textual descriptions of entity collections, such as product catalogs. Dictionaries provide the supervised data for learning product or entity descriptions. In this demonstration, we will present AutoDict, a system that analyzes input data records, and discovers high quality dictionaries using information theoretic techniques. To the best of our knowledge, AutoDict is the first end-to-end system for building attribute dictionaries. Our demonstration will showcase the different information analysis and extraction features within AutoDict, and highlight the process of generating high quality attribute dictionaries.

international conference on data engineering | 2008

XML Index Recommendation with Tight Optimizer Coupling

Iman Elghandour; Ashraf Aboulnaga; Daniel C. Zilio; Fei Chiang; Andrey Balmin; Kevin S. Beyer; Calisto Zuzarte

XML database systems are expected to handle increasingly complex queries over increasingly large and highly structured XML databases. An important problem that needs to be solved for these systems is how to choose the best set of indexes for a given workload. In this paper, we present an XML Index Advisor that solves this XML index recommendation problem and has the key characteristic of being tightly coupled with the query optimizer. We rely on the optimizer to enumerate index candidates and to estimate the benefit gained from potential index configurations. We expand the set of candidate indexes obtained from the query optimizer to include more general indexes that can be useful for queries other than those in the training workload. To recommend an index configuration, we introduce two new search algorithms. The first algorithm finds the best set of indexes for the specific training workload, and the second algorithm finds a general set of indexes that can benefit the training workload as well as other similar workloads. We have implemented our XML Index Advisor in a prototype version of IBMreg DB2reg 9, which supports both relational and XML data, and we experimentally demonstrate the effectiveness of our advisor using this implementation.

International Journal of Information Quality | 2014

Repairing integrity rules for improved data quality

Fei Chiang; Yu Wang

Integrity constraints are the primary tool used to capture business rules and domain constraints in data management systems. When these constraints are not strictly enforced, poor data quality often arises, as inconsistencies occur between the data and the set of constraints. To resolve these inconsistencies, organisations often implement specific, sometimes manual, cleansing routines to fix the errors. As modern systems are expected to handle increasing amounts of highly heterogeneous data, often in dynamic data environments where the data and the constraints may change, manual cleansing routines are insufficient to handle this increased scale and heterogeneity. In this work, we present a set of new constraint repair operations that can be incorporated into a data quality tool that provides automated support for both data and constraint repair and management. Our holistic approach is designed to facilitate the curation and maintenance of both the data and the constraints. We focus on discovering trends, contextual information, and data patterns to understand how a business rule (constraint) has evolved. We also investigate how to find a minimal set of constraints that contain non-redundant information since enforcing extraneous constraints is costly and can negatively affect system performance. We conduct two case studies using real business datasets that demonstrate the quality and usefulness of our techniques.

Procedia Computer Science | 2013

An Algebraic Approach Towards Data Cleaning

Ridha Khedri; Fei Chiang; Khair Eddin Sabri

Abstract There has been a proliferation in the amount of data being generated and collected in the past several years. One of the leading factors contributing to this increased data scale is cheaper commodity storage, making it easier for organisations to house large data stores containing massive amounts of historical data. To effectively analyse these data sets, a preprocessing step is often required as most real data sets are inherently dirty and inconsistent. Existing data cleaning tools have focused on cleaning the errors at hand. In this paper, we take a more formal approach and propose the use of information algebra as a general theory to describe structured data sets and data cleaning. We formally define the notion of association rule, association function, and we present results relating these concepts. We also propose an algorithm for generating association rules from a given structured data set.

Explore More