2021 IEEE 37th International Conference on Data Engineering (ICDE) | 2021

From Minimum Change to Maximum Density: On S-Repair under Integrity Constraints

 
 

Abstract


To clean dirty data, integrity constraints are often employed. A typical S-repair model removes a minimal set of tuples (to avoid excessive removal and information loss) such that the integrity constraints are no longer violated in the remaining tuples. However, multiple candidates of minimal removal sets exist and are difficult to determine. We intuitively notice that a clean tuple often has more close neighbors (i.e., higher density) than dirty tuples. In this sense, our study proposes to return the S-repair under integrity constraints with the highest density, among various minimal removal sets. We explicitly analyze the hardness of maximizing S-repair density under integrity constraints, together with efficient approximation. Extensive experiments over real datasets collected from industry with real-world errors show that our proposal can achieve higher accuracy in cleaning dirty tuples, compared to the state-of-the-art methods.

Volume None
Pages 1943-1948
DOI 10.1109/ICDE51399.2021.00181
Language English
Journal 2021 IEEE 37th International Conference on Data Engineering (ICDE)

Full Text