8th ACM IKDD CODS and 26th COMAD | 2021

Detecting Data Accuracy Issues in Textual Geographical Data by a Clustering-based Approach

 
 
 

Abstract


Data are published to encourage data exploitation. However, data quality issues threaten data consumption and require data consumers investing time and effort in data cleansing. By focusing on textual geographical data, we aim to detect inaccurate values, such as typos, truncated values, and propose corrections by a clustering-based approach. Our method is mainly based on a dictionary of correct values, the Agglomerative clustering to group data in clusters, and Levenshtein and Fuzzy string searching for computing word similarity. We test our approach on real open datasets published by the Campania region, heterogeneous in the topic, size, and type of errors by showing the positive results of using Levenshtein and Fuzzy Matching and exploiting clustering methods in detecting and correcting quality issues in textual geographical data. The achieved results are useful for data producers and consumers, both for the academy and the industry, in any application domain.

Volume None
Pages None
DOI 10.1145/3430984.3431031
Language English
Journal 8th ACM IKDD CODS and 26th COMAD

Full Text