Journal of Statistical Software | 2019

dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R

 
 

Abstract


Data cleaning and validation are important steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. Ideally, a human investigator should go through each variable in the dataset and look for potential errors - both in input values and codings - but that process can be very time-consuming, expensive and error-prone in itself. We describe an R package, dataMaid, which implements an extensive and customizable suite of quality assessment aids that can be applied to a dataset in order to identify potential problems in its variables. The results are presented in an auto-generated, nontechnical, stand-alone overview document intended to be perused by an investigator with an understanding of the variables in the data, but not necessarily knowledge of R. Thereby, dataMaid aids the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data quality screening. Moreover, the dataMaid solution changes the data screening process from the usual ad hoc approach to a systematic, well-documented endeavor. dataMaid also provides a suite of more typical R tools for interactive data quality assessment and screening, where the data inspections are executed directly in the R console.

Volume 90
Pages 1-38
DOI 10.18637/JSS.V090.I06
Language English
Journal Journal of Statistical Software

Full Text