Wenyuan Yu
University of Edinburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wenyuan Yu.
very large data bases | 2010
Wenfei Fan; Shuai Ma; Nan Tang; Wenyuan Yu
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
international conference on management of data | 2013
Yang Cao; Wenfei Fan; Wenyuan Yu
The relative accuracy problem is to determine, given tuples <i>t</i><sub>1</sub> and <i>t</i><sub>2</sub> that refer to the same entity <i>e</i>, whether <i>t</i><sub>1</sub>[<i>A</i>] is more accurate than <i>t</i><sub>2</sub><i>A</i>, i.e., <i>t</i><sub>1</sub><i>A</i> is closer to the true value of the <i>A</i> attribute of <i>e</i> than <i>t</i><sub>2</sub><i>A</i>. This has been a longstanding issue for data quality, and is challenging when the true values of <i>e</i> are unknown. This paper proposes a model for determining relative accuracy. (1) We introduce a class of accuracy rules and an inference system with a chase procedure, to deduce relative accuracy. (2) We identify and study several fundamental problems for relative accuracy. Given a set <i>I</i><sub>e</sub> of tuples pertaining to the same entity <i>e</i> and a set of accuracy rules, these problems are to decide whether the chase process terminates, is Church-Rosser, and leads to a unique target tuple <i>t</i><sub>e</sub> composed of the most accurate values from <i>I</i><sub>e</sub> for all the attributes of <i>e</i>. (3) We propose a framework for inferring accurate values with user interaction. (4) We provide algorithms underlying the framework, to find the unique target tuple <i>t</i><sub>e</sub> whenever possible; when there is no enough information to decide a complete <i>t</i><sub>e</sub>, we compute top-<i>k</i> candidate targets based on a preference model. (5) Using real-life and synthetic data, we experimentally verify the effectiveness and efficiency of our method.
Journal of Data and Information Quality | 2014
Wenfei Fan; Shuai Ma; Nan Tang; Wenyuan Yu
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.
international conference on data engineering | 2013
Wenfei Fan; Floris Geerts; Nan Tang; Wenyuan Yu
This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data.
international conference on data engineering | 2012
Wenfei Fan; Nan Tang; Wenyuan Yu
This paper investigates incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for database D that is partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.
In Search of Elegance in the Theory and Practice of Computation | 2013
Wenfei Fan; Floris Geerts; Shuai Ma; Nan Tang; Wenyuan Yu
Recent work on data quality has primarily focused on data repairing algorithms for improving data consistency and record matching methods for data deduplication. This paper accentuates several other challenging issues that are essential to developing data cleaning systems, namely, error correction with performance guarantees, unification of data repairing and record matching, relative information completeness, and data currency. We provide an overview of recent advances in the study of these issues, and advocate the need for developing a logical framework for a uniform treatment of these issues.
international conference on the computer processing of oriental languages | 2009
Wenyuan Yu; Cheng Wang; Wenxin Li; Zhuoqun Xu
We present the idea of Named Entity Semantic Identification that is identifying the named entity in a knowledge base and give a definition of this idea. Then we introduced PKUNEI - an approach for Chinese product named entity semantic identification. This approach divided the whole process into 2 separate phases: a role-model based NER phase and a query-driven semantic identification phase. We describe the model of NER phase, the automatically building of knowledge base and the implementation of semantic identification phase. The experimental results demonstrate that our approach is effective for the semantic identification task.
international conference on management of data | 2011
Wenfei Fan; Shuai Ma; Nan Tang; Wenyuan Yu
very large data bases | 2011
Wenfei Fan; Shuai Ma; Nan Tang; Wenyuan Yu
IEEE Data(base) Engineering Bulletin | 2017
Wenfei Fan; Jingbo Xu; Xiaojian Luo; Yinghui Wu; Wenyuan Yu; Ruiqi Xu