DData Validation
Mark P.J. van der Loo ∗ and Edwin de JongeStatistics NetherlandsThis is the submitted version of the following book chapter: “Van derLoo, M.P.J. and De Jonge, E. (2020) Data Validation . In WileyStatsRef: Statistics Reference Online” which has been published infinal form in https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat08255 . Abstract
Data validation is the activity where one decides whether or not aparticular data set is fit for a given purpose. Formalizing the require-ments that drive this decision process allows for unambiguous commu-nication of the requirements, automation of the decision process, andopens up ways to maintain and investigate the decision process itself.The purpose of this article is to formalize the definition of data vali-dation and to demonstrate some of the properties that can be derivedfrom this definition. In particular, it is shown how a formal view of theconcept permits a classification of data quality requirements, allowingthem to be ordered in increasing levels of complexity. Some subtletiesarising from combining possibly many such requirements are pointedout as well.
Informally, data validation is the activity where one decides whether ornot a particular data set is fit for a given purpose. The decision is basedon testing observed data against prior expectations that a plausible datasetis assumed to satisfy. Examples of prior expectations range widely. Theyinclude natural limits on variables (weight cannot be negative), restrictionson combinations of multiple variables (a man cannot be pregnant), combi-nations of multiple entities (a mother cannot be younger than her child) andcombinations of multiple data sources (import value of country A from coun-try B must equal the export value of country B to country A). Besides thestrict logical constraints mentioned in the examples, there are often ‘softer’ ∗ Corresponding author: [email protected] a r X i v : . [ c s . D B ] D ec onstraints based on human experience. For example, one may not expecta certain economic sector to grow more than 5% in a quarter. Here, the5% limit does not represent a physical impossibility but rather a limit basedon past experience. Since one must decide in the end whether a data set isusable for its intended purpose, we treat such assessments on equal footing.The purpose of this paper is to formalize the definition of data validationand to demonstrate some of the properties that can be derived from thisdefinition. In particular, it is shown how a formal view of the conceptpermits a classification of data validation rules (assertions), allowing themto be ordered in increasing levels of ‘complexity’. Here, the term ‘complexity’refers to the amount of different types of information necessary to evaluatea validation rule. A formal definition also permits development of tools forautomated validation and automated reasoning about data validation (Zioet al., 2015; Van der Loo and De Jonge, 2018, 2019). Finally, some subtletiesarising from combining validation rules are pointed out. Intuitively, a validation activity classifies a dataset as acceptable or notacceptable. A straightforward formalization is to define it as a function fromthe collection of data sets that could have been observed, to { True , False } .One only needs to be careful in defining the ‘collection of data sets’, to avoida ‘set of all sets’ which recursively holds itself. To avoid such paradoxes, adata set is defined as a set of key-value pairs, where the keys come from afinite set, and the values from some domain. Definition 1. A data point is a pair ( k, x ) ∈ K × D , where k is a key ,selected from a finite set K and x is a value from a domain D . In applications the identifier k makes the value interpretable as the prop-erty of a real-world entity or event. The domain D is the set of all possiblevalues that x can take, and it therefore depends on the circumstances inwhich the data is obtained.As an example, consider an administrative system holding age and jobstatus of persons. It is assumed that ‘job’ takes values in { "employed" , "unemployed" } and that ‘age’ is an integer. However, if the data entry system performs novalidation on entered data, numbers may end up in the job field, and job val-ues may end up in the age field. In an example where the database contains2ata on two persons identified as 1 and 2, this gives K = { , } × { "age" , "job" } D = N ∪ { "employed" , "unemployed" } . (1)This definition allows for the occurrence of missing values by defining aspecial value for them, say ‘ NA ’ (not available) and adding it to the domain.Once K and D are fixed it is possible to define the set of all observabledatasets. Definition 2. A dataset is a subset of K × D where every key in K occursexactly once. Another way to interpret this is to say that a data set is a total function K → D . The set of all observable data sets is denoted D K . In the example,one possible dataset is { ((1 , "age" ) , , ((1 , "job" ) , "unemployed" ) , ((2 , "age" ) , "employed" ) , ((2 , "job" ) , } Observe that the key consists of a person identifier and a variable identifier.Since type checking is a common part of data validation these definitionsdeliberately leave open the possibility that variables assume a value of thewrong type.Data validation can now be formally defined as follows.
Definition 3. A data validation function is a surjective function v : D K (cid:16) { False , True } . A data set d ∈ D K for which v ( d ) = True is said to satisfy v . A data setfor which v ( d ) = False is said to fail v .Recall that surjective means that there is at least one d ∈ D K for which v ( d ) = False and at least one d ∈ D K for which v ( d ) = True . A validationfunction has to be surjective to have any meaning as a data validation ac-tivity. Suppose v (cid:48) : D K → { False , True } non-surjective function. If there isno data set d for which v (cid:48) ( d ) = False , then v (cid:48) is always true and it does notvalidate anything. If, on the other hand, there is no d for which v (cid:48) ( d ) = True then no data set can satisfy v (cid:48) . In short, a function that is not surjective on { False , True } does not separate valid from invalid data.A data validation function is reminiscent of a predicate as defined in first-order logic. Informally, a predicate is a statement about variables that can3e True or False . The variables can take values in a predefined set (referredto as the domain of the predicate). Since validation functions map elementsof D K to { False , True } , it is tempting to equate a validation function as apredicate over D K . However, the elements of D K are instances of possibledata sets, and validation is based on statements involving variables of a singleobserved data set. It is therefore more convenient to adopt the followingdefinition. Definition 4. A validation rule is a predicate over an instance d ∈ D K thatis neither a tautology nor a contradiction. The elements of D K are sufficiently similar so that any validation ruleover a particular data set d ∈ D K is also a validation rule over another dataset in d (cid:48) ∈ D K , where the truth value of a validation rule depends on thechosen data set. This also allows us to interpret a tautology as a predicatethat is True for every element of D K and a contradiction as a predicate thatis False for every element of D K .The equivalence between an assertion about a data set and a functionclassifying possible data sets as valid or invalid instances is a rather obviousconclusion. The actual value of the above exercise is the strict definition ofa data point as a key-value pair. As will be shown below, inclusion of thekey permits a useful classification of data validation rules.To demonstrate that many types of validation rules can be expressed inthis framework a few examples based on the example of Equation (1) willbe considered. The following rule states that all ages must be integer. ∀ (( u, "age" ) , x ) ∈ d : x ∈ N . Here, (( u, "age" ) , x ) runs over the person-value pairs where the value issupposed to represent an age. Similarly, it is possible express a nonnegativitycheck on the age variable. ∀ (( u, "age" ) , x ) ∈ d : x ≥ . In these examples a data set fails a validation rule when not all elementssatisfy a predicate. Such rules are not very informative when it comes topinpointing what elements of the data set cause the violation. It is cus-tomary to perform data validation on a finer level, for example by checkingnonnegativity element by element. Based on the definitions introduced herethis is done by fixing the key completely. ∀ ((1 , "age" ) , x ) ∈ d : x ≥ ∀ ((2 , "age" ) , x ) ∈ d : x ≥ . ∀ (( u, "age" ) , x ) , ( u, "job" ) , y ) ∈ d : y = "employed" ⇒ x ≥ , where ⇒ denotes logical implication (if y = "employed" then x ≥ (cid:80) ( u, "age" ,x ) ∈ d x (cid:80) ( u,X,x ) ∈ d δ ( X, "age" ) ≥ , where δ ( X, Y ) = 1 when X = Y and otherwise 0.In practical applications validation rules are often expressed more di-rectly in terms of variable names, such as age ≥
0. Such expressions arespecializations where one silently assumes extra properties such as that therule will be evaluated for the entry for age in every record.
Remark 1.
In Definition 3, (and also 4) a validation function is defined asa surjection D K (cid:16) { False , True } . In practical applications it is often usefulto also allow the value NA (not available) for cases where one or more ofthe data points necessary for evaluating the validation rule are missing. Inthat case the domain D in the Equation (1) must be extended to D ∪ { NA } .See Van der Loo and De Jonge (2019) for an implementation. Remark 2.
The current formalization of data validation excludes checkingfor completeness of the key set: it is assumed by definition that each dataset in D K is a complete set of key-value pairs where the keys cover all of K . The above definitions therefore create a clean distinction between whatis metadata (the keys and their interpretation) and data (key-value pairs).It is possible to check for uniqueness of variables. Remark 3.
The formal definition of data validation rules also allows aformalization of data validation software tools or domain specific languages,such as in Van der Loo and De Jonge (2019).
Remark 4.
In official (government) statistics, validation rules are referredto as edit rules rather than data validation rules. See De Waal et al. (2011)and references therein. 5
Classification of validation rules
Validation rules defined by domain experts may be easy for humans to in-terpret, but can be complex to implement. Take as an example the assertion‘the average price of a certain type of goods this year, should not differ morethen ten percent from last year’s average’. To evaluate this assertion oneneeds to combine prices of multiple goods collected over two time periods.A practical question is therefore if the ‘complexity of evaluation’, in termsof the amount of information necessary, can somehow be measured. In thissection a classification of data validation rules is derived that naturally leadsto an ordering of validation rules in terms of the variety of information thatis necessary for their evaluation.One way to measure the amount of information is to look at the pred-icate defining a data validation rule and to determine how many different( k, x ) pairs are needed to evaluate it. This is not very useful for comparingcomplexity across different data sets that are not from the same D K sincethe numbers will depend on the size of the key set K . One measure thatdoes generalize to different D K is the measure ‘does a rule need one, orseveral ( k, x ) to be evaluated?’.This measure is not very precise, but it can be refined when a key consistsof multiple meaningful parts such as in the running example where the keyconsists of the id of a person and the name of a variable. One can thenclassify a rule according to two questions: ‘are multiple person id’s necessaryfor evaluation?’, and: ‘are multiple variables necessary for evaluation?’. Thisgives a four-way classification: one where both questions are answered with‘no’, two where one of the questions is answered with ‘yes’ and one whereboth questions are answered with ‘yes’. Although this refinement improvesthe accuracy of the classification, it only allows for comparing validationrules over data sets with the exact same key structure. It would therefore beuseful to have a generic key structure that can be reused in many situations.One such structure can be found by considering in great generality how adata point is obtained in practice.A data point usually represents the value of an attribute of an object orevent in the real world: a person, a web site, an e-mail, a radio broadcast, acountry, a sensor, a company, or anything else that has observable attributes.In what follows, an object or event is referred to as a ‘unit’ for brevity. Adata point is created by observing an attribute X from a unit u of a certaintype U at time τ , as in Figure 1. Using the same reasoning as above, thisyields a 2 = 16-way classification of validation rules: for each element U , τ , u , X a validation rule requires a single or multiple instances to be evaluated.6 x t u t u τ Figure 1: A unit u of type U exists from t u to t (cid:48) u . From t u onward it hasan attribute X with its value x possibly changing over time. At some time τ this value is observed. The observed value is thus fully characterized bythe quartet ( U , τ , u , X ).However, there are some restrictions. Any unit u can only be of one type.So evaluating a validation rule will never require multiple types and a singleunit. Second, the type of a unit fixes its properties. So a validation rulewill never need a single variable for multiple types. With these restrictionsconsidered, the number of possible classes of validation rules reduces fromsixteen to ten.To distinguish the classes the following notation is introduced. For eachelement U , τ , u , X assign an s when a validation rule pertains to a singlevalue of that element and assign an m when a validation rule pertains tomultiple values of that element. For example, a validation rule of class sssm needs a single type, a single measurement, a single object and multiplevariables to be evaluated.The ten possible classes can themselves be grouped into validation levels ,according to whether a class is labeled with no, one, two, three, or four m ’s.A higher validation level indicates that we need a larger variety of informa-tion in order to evaluate the rule. The classification, and their grouping intovalidation levels is summarized in Table 1.Going from level 0 to 4 corresponds to a workflow that is common inpractice. One starts with simple tests such as range checks. These are oflevel zero since a range check can be performed on a single data point. Thatis, one only needs a value that corresponds to a single type, measurement,unit, and variable. Next, more involved checks are performed, for instance,involving multiple variables ( sssm , e.g. the ratio between two propertiesof the same unit must be in a given range), multiple units ( ssms , e.g. themean of a variable over multiple units must be within a specified range),or multiple measurements ( smss , e.g. the current value of the property of7able 1: The ten possible classes of validation rules, grouped into ‘valida-tion levels’. A higher level indicates that a wider variety of information isnecessary to evaluate a validation rule.Validation level0 1 2 3 4ssss sssm ssmm smmm mmmmssms smsm msmmsmss smmsa unit cannot differ too much from a past value of the same property ofthe same unit). Going up in level, even more complex rules are found untilrule evaluation involves multiple variables of multiple units of multiple typesmeasured at multiple instances ( mmmm ).This classification also has an immediate interpretation for data storedin a data base that is normalized in the sense of Codd (1970). In such adata base, records represent units, columns represent variables, and tablesrepresent types. The ‘time of measurement’ is represented as a variable aswell. The classification indicates whether a rule implementation needs toaccess multiple records, columns or tables. Definition 4 implies that a validation rule is a predicate over a data setthat is not a tautology nor a contradiction. This means that combining twovalidation rules with ∧ or ∨ does not automatically yield a new validationrule. Consider for example the rules x ≥ x ≤ x ≥ ∨ x ≤ x ≥ ∧ x ≤ − ∧ ) ordisjunction ( ∨ ) has practical consequences. After all, defining a set of vali-dation rules amounts to conjugating them together into a single rule sincea data set is valid only when all validation rules are satisfied. A set of rulesmay be such that their conjugation is a contradiction. Such a rule set iscalled infeasible . More subtle problems involve unintended consequences,including partial infeasibility (Bruni and Bianchi, 2012), and introductionon fixed values or range restrictions. Other problems involve the intro-duction of several types of redundancies (Dillig et al., 2010; Paulraj and8umathi, 2010), which make rule sets both harder to maintain and ham-per solving problems such as error localization (Bruni, 2005; De Jonge andVan der Loo, 2019a). In the following, some examples of unintended effectsand redundancies are discussed. The examples shown here are selected froma more extensive discussion in Van der Loo and De Jonge (2018, Chapter8). For simplicity of presentation the rules are expressed as simple clauses,neglecting the key-value pair representation.Partial inconsistency is (often) an unintended consequence implied by apair of rules. For example the rule set gender = "male" ⇒ income > gender = "male" ⇒ income < , is feasible, but it can only be satisfied when gender (cid:54) = "male" . Thus, thecombination of rules (unintentionally) excludes an otherwise valid gendercategory.A simple redundancy is introduced when one rule defines a subset ofvalid values with respect to another rule. For example if x ≥ x ≥ x ≥ x ≥
1. More complex cases arisein sets with multiple rules. A more subtle redundancy, called ‘nonrelaxingclause’ occurs in the following situation. x ≥ ⇒ y ≥ x ≥ . Here, the second rule implies that the condition in the first rule must alwaysbe true. Hence the rule set can be simplified to y ≥ x ≥ . Another subtle redundancy, called a ‘nonconstraining clause’ occurs in thefollowing situation. x > ⇒ y > x < ⇒ y > x vary from −∞ to ∞ , the rule set first implies that y >
1. As x becomes positive, the rule set implies that y > x reaches ∞ . Inother words, the rule set implies that y must be positive regardless of x and9an therefore be replaced with y > x < ⇒ y > Data validation can be formalized equivalently in terms of certain predicatesover a data set or as a surjective Boolean function over a well-defined setof observable data sets. It is possible to define a very general classifica-tion of validation rules, based on a generic decomposition of the metadata.Combining validation rules into a set can lead to subtle and unintendedconsequences that can be solved in some cases with algorithmic methods.
Related articles • stat04233 • stat06088 • stat06776 • stat04018 References
Bruni, R. (2005). Error correction for massive datasets.
Optimization Meth-ods and Software 20 (2-3), 297–316.Bruni, R. and G. Bianchi (2012). A formal procedure for finding contradic-tions into a set of rules.
Applied Mathematical Sciences 6 (126), 6253–6271.10handru, V. and J. Hooker (1999).
Optimization methods for logical infer-ence , Volume 34 of
Wiley Series in Discrete Mathematics and Optimiza-tion) . John Wiley & Sons.Codd, E. F. (1970). A relational model of data for large shared data banks.
Communications of the ACM 13 (6), 377–387.Daalmans, J. (2018). Constraint simplification for data editing of numericalvariables.
Journal of Official Statistics 34 (1), 27–39.De Jonge, E. and M. Van der Loo (2019a).
Error Localization . John Wiley& Sons, Inc. to be published.De Jonge, E. and M. Van der Loo (2019b). validatetools: Checking andSimplifying Validation Rule Sets . R package version 0.4.6.De Waal, T., J. Pannekoek, and S. Scholtus (2011).
Handbook of statisticaldata editing and imputation . Wiley handbooks in survey methodology.John Wiley & Sons.Dillig, I., T. Dillig, and A. Aiken (2010). Small formulas for large programs:On-line constraint simplification in scalable static analysis. In
Interna-tional Static Analysis Symposium , pp. 236–252. Springer.Hooker, J. (2000).
Logic-based methods for optimization: combining opti-mization and constraint satisfaction , Volume 2 of
Wiley Series in DiscreteMathematics and Optimization . John Wiley & Sons.Paulraj, S. and P. Sumathi (2010). A comparative study of redundant con-straints identification methods in linear programming problems.
Mathe-matical Problems in Engineering 2010 .Van der Loo, M. and E. De Jonge (2018).
Statistical data cleaning withapplications in R . New York: John Wiley & Sons, Inc.Van der Loo, M. and E. De Jonge (2019). Data validation infrastructure forR.