Xibei Jia | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xibei Jia is active.

Explore More

Publication

Featured researches published by Xibei Jia.

ACM Transactions on Database Systems | 2008

Conditional functional dependencies for capturing data inconsistencies

Wenfei Fan; Floris Geerts; Xibei Jia; Anastasios Kementsietsidis

We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrongs axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

international conference on data engineering | 2007

Conditional Functional Dependencies for Data Cleaning

Philip Bohannon; Wenfei Fan; Floris Geerts; Xibei Jia; Anastasios Kementsietsidis

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrongs axioms for FDs, as well as consistency analysis. Since CFDs allow data bindings, a large number of individual constraints may hold on a table, complicating detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints in a single query. We experimentally evaluate the performance of our CFD-based methods for inconsistency detection. This not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

very large data bases | 2009

Reasoning about record matching rules

Wenfei Fan; Xibei Jia; Shuai Ma

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

international conference on data engineering | 2007

Rewriting Regular XPath Queries on XML Views

Wenfei Fan; Floris Geerts; Xibei Jia; Anastasios Kementsietsidis

We study the problem of answering queries posed on virtual views of XML documents, a problem commonly encountered when enforcing XML access control and integrating data. We approach the problem by rewriting queries on views into equivalent queries on the underlying document, and thus avoid the overhead of view materialization and maintenance. We consider possibly recursively defined XML views and study the rewriting of both XPath and regular XPath queries. We show that while rewriting is not always possible for XPath over recursive views, it is for regular XPath; however, the rewritten query may be of exponential size. To avoid this prohibitive cost we propose a rewriting algorithm that characterizes rewritten queries as a new form of automata, and an efficient algorithm to evaluate the automaton-represented queries. These allow us to answer queries on views in linear time. We have fully implemented a prototype system, SMOQE, which yields the first regular XPath engine and a practical solution for answering queries over possibly recursively defined XML views.

very large data bases | 2011

Dynamic constraints for record matching

Wenfei Fan; Hong Gao; Xibei Jia; Shuai Ma

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

very large data bases | 2008

Semandaq: a data quality system based on conditional functional dependencies

Wenfei Fan; Floris Geerts; Xibei Jia

We present Semandaq, a prototype system for improving the quality of relational data. Based on the recently proposed conditional functional dependencies (CFDs), it detects and repairs errors and inconsistencies that emerge as violations of these constraints. We demonstrate the following functionalities supported by Semandaq: (a) an interface for specifying CFDs; (b) a visual tool for automated detection of CFD violations in relational data, leveraging efficient SQL-based techniques; (c) extensive visual data exploration capabilities that provide the user with various measures of the quality of the data; (d) repair (cleaning) functionality without excess human interaction, built upon CFD-based cleaning algorithms; we show how Semandaq allows for a natural exploration of the quality of the obtained repairs. Semandaq is a promising tool that provides easy access and user-friendly data quality facilities for any relational database system.

very large data bases | 2008

A revival of integrity constraints for data cleaning

Wenfei Fan; Floris Geerts; Xibei Jia

Integrity constraints, a.k.a. data dependencies, are being widely used for improving the quality of schema. Recently constraints have enjoyed a revival for improving the quality of data. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.

british national conference on databases | 2009

Conditional Dependencies: A Principled Approach to Improving Data Quality

Wenfei Fan; Floris Geerts; Xibei Jia

Real-life data is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising approach to improving data quality. It effectively detects and fixes inconsistencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values. It accurately identifies records from unreliable data sources by leveraging relative candidate keys, an extension of keys for relations by supporting similarity and matching operators across relations. In contrast to traditional dependencies that were developed for improving the quality of schema, the revised constraints are proposed to improve the quality of data. These constraints yield practical techniques for data repairing and record matching in a uniform framework.

conference on information and knowledge management | 2004

Composable XML integration grammars

Wenfei Fan; Minos N. Garofalakis; Ming Xiong; Xibei Jia

The proliferation of XML as a standard for data representation and exchange in diverse, next-generation Web applications has created an emphatic need for effective XML data-integration tools. For several real-life scenarios, such XML data integration needs to be <i>DTD-directed</i> -- in other words, the target, integrated XML database must conform to a prespecified, user- or application-defined DTD. In this paper, we propose a novel formalism, <i>XML Integration Grammars (XIGs)</i>, for specifying DTD-directed integration of XML data. Abstractly, an XIG maps data from multiple XML sources to a target XML document that conforms to a predefined DTD. An XIG extracts source XML data via queries expressed in a fragment of XQuery, and controls target document generation with tree-valued attributes and the target DTD. The novelty of XIGs consists in not only their automatic support for DTD-conformance but also in their: an XIG may embed local and remote XIGs in its definition, and invoke these XIGs during its evaluation. This yields an important modularity property for our XIGs that allows one to divide a complex integration task into manageable sub-tasks and conquer each of them separately. To efficiently evaluate XIGs we provide algorithms for merging XML queries in an XIG and for scheduling queries and embedded XIGs. These lead to an effective framework, as well as a design tool for XQuery, for effectively specifying and computing complex, DTD-directed XML integration.

very large data bases | 2004

A uniform system for publishing and maintaining XML data

Byron Choi; Wenfei Fan; Xibei Jia; Arek Kasprzyk

XML has become the prime standard for data exchange on the Web. To exchange data residing in Databases, one needs to publish it in XML. Data publishing is often done with a predefined “schema.” The chapter proposes a new approach for schema-directed publishing of relational data in XML, based on the novel notion of attribute transformation grammars (ATGs). ATGs provide guidance on how to define views conforming to target schemas (DTDs), and automatically ensure schema conformance. The chapter also develops an incremental algorithm for maintaining XML views produced by ATGs, based on new incremental computation techniques that capitalize on the hierarchical structure of XML data and unique features of the ATGs. Taking real-life data from Gene Ontology (GO), this chapter also demonstrates how this system can efficiently publish the GO data in XML with respect to a predefined recursive DTD, and how it incrementally updates the target XML data in response to changes to the underlying GO database.

Explore More