Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems
Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, Alexandra Meliou
CConformance Constraint Discovery:Measuring Trust in Data-Driven Systems ∗ Technical Report
Anna Fariha †‡ University of MassachusettsAmherst, MA, [email protected]
Ashish Tiwari ‡ Arjun RadhakrishnaSumit Gulwani
Microsoft{astiwar,arradha,sumitg}@microsoft.com
Alexandra Meliou
University of MassachusettsAmherst, MA, [email protected]
ABSTRACT
The reliability of inferences made by data-driven systems hingeson the data’s continued conformance to the systems’ initial set-tings and assumptions. When serving data (on which we want toapply inference) deviates from the profile of the initial trainingdata, the outcome of inference becomes unreliable. We introduce conformance constraints , a new data profiling primitive tailoredtowards quantifying the degree of non-conformance , which can ef-fectively characterize if inference over that tuple is untrustworthy .Conformance constraints are constraints over certain arithmetic ex-pressions (called projections ) involving the numerical attributes of adataset, which existing data profiling primitives such as functionaldependencies and denial constraints cannot model.The key finding we present is that projections that incur low vari-ance on a dataset construct effective conformance constraints. Thisprinciple yields the surprising result that low-variance componentsof a principal component analysis, which are usually discardedfor dimensionality reduction, generate stronger conformance con-straints than the high-variance components. Based on this result,we provide a highly scalable and efficient technique—linear in datasize and cubic in the number of attributes—for discovering con-formance constraints for a dataset. To measure the degree of atuple’s non-conformance with respect to a dataset, we propose a quantitative semantics that captures how much a tuple violates theconformance constraints of that dataset. We demonstrate the valueof conformance constraints on two applications: trusted machinelearning and data drift . We empirically show that conformanceconstraints offer mechanisms to (1) reliably detect tuples on whichthe inference of a machine-learned model should not be trusted,and (2) quantify data drift more accurately than the state of the art.
Data is central to modern systems in a wide range of domains, in-cluding healthcare, transportation, and finance. The core of moderndata-driven systems typically comprises of models learned fromlarge datasets, and they are usually optimized to target particulardata and workloads. While these data-driven systems have seenwide adoption and success, their reliability and proper functionhinge on the data’s continued conformance to the systems’ initial ∗ An earlier version of this paper had a different title: “Data Invariants: On Trust inData-Driven Systems”. † Work done while the author was an intern at Microsoft. ‡ Both authors contributed equally to this research. settings and assumptions. If the serving data (on which the systemoperates) deviates from the profile of the initial data (on whichthe system was trained), then system performance degrades andsystem behavior becomes unreliable. A mechanism to assess thetrustworthiness of a system’s inferences is paramount, especiallyfor systems that perform safety-critical or high-impact operations.A machine-learned (ML) model typically works best if the servingdataset follows the profile of the dataset the model was trainedon; when it doesn’t, the model’s inference can be unreliable. Onecan profile a dataset in many ways, such as by modeling the datadistribution of the dataset, or by finding the (implicit) constraints that the dataset satisfies. Distribution-oriented approaches learndata likelihood (e.g., joint or conditional distribution) from thetraining data, and can be used to check if the serving data is unlikely.An unlikely tuple does not necessarily imply that the model wouldfail for it. The problem with the distribution-oriented approaches isthat they tend to overfit, and thus, are overly conservative towardsunseen tuples, leading them to report many such false positives.We argue that certain constraints offer a more effective androbust mechanism to quantify trust of a model’s inference on aserving tuple. The reason is that learning systems implicitly exploitsuch constraints during model training, and build models that as-sume that the constraints will continue to hold for serving data.For example, when there exist high correlations among attributesin the training data, learning systems will likely reduce the weightsassigned to redundant attributes that can be deduced from others,or eliminate them altogether through dimensionality reduction. Ifthe serving data preserves the same correlations, such operationsare inconsequential; otherwise, we may observe model failure.In this paper, we characterize datasets with a new data-profilingprimitive, conformance constraints , and we present a mechanism toidentify strong conformance constraints, whose violation indicatesunreliable inference. Conformance constraints specify constraintsover arithmetic relationships involving multiple numerical attributesof a dataset. We argue that a tuple’s conformance to the confor-mance constraints is more critical for accurate inference than itsconformance to the training data distribution. This is because anyviolation of conformance constraints is likely to result in a cata-strophic failure of a learned model that is built upon the assumptionthat the conformance constraints will always hold. Thus, we can usea tuple’s deviation from the conformance constraints as a proxy forthe trust on a learned model’s inference for that tuple. We proceed a r X i v : . [ c s . D B ] J a n echnical Report, January, 2021 Fariha and Tiwari, et al. Departure Departure Time Arrival Time Duration (min)Date [DT] [AT] [DUR] 𝑡 May 2 𝑡 July 22 𝑡 June 6 𝑡 May 19 𝑡 April 7
Figure 1: Sample of the airlines dataset (details are in Section 6.1),showing departure, arrival, and duration only. The dataset does notreport arrival date, but an arrival time earlier than departure time(e.g., last row), indicates an overnight flight. All times are in 24 hourformat and in the same time zone. There is some noise in the values. to describe a real-world example of conformance constraints, drawnfrom our case-study evaluation on trusted machine learning (TML).Example 1.
We used a dataset with flight information that in-cludes data on departure and arrival times, flight duration, etc. (Fig. 1)to train a linear regression model to predict flight delays. The modelwas trained on a subset of the data that happened to include only day-time flights (such as the first four tuples). In an empirical evaluation ofthe regression accuracy, we found that the mean absolute error of theregression output more than quadruples for overnight flights (such asthe last tuple 𝑡 ), compared to daytime flights. The reason is that tuplesrepresenting overnight flights deviate from the profile of the trainingdata that only contained daytime flights. Specifically, daytime flightssatisfy the conformance constraint that “arrival time is later thandeparture time and their difference is very close to the flight duration”,which does not hold for overnight flights. Note that this constraint isjust based on the covariates (predictors) and does not involve the targetattribute 𝑑𝑒𝑙𝑎𝑦 . Critically, although this conformance constraint isunaware of the regression task, it was still a good proxy of the regres-sor’s performance. In contrast, approaches that model data likelihoodmay report long daytime flights as unlikely, since all flights in thetraining data ( 𝑡 – 𝑡 ) were also short flights, resulting in false alarms,as the model works very well for most daytime flights, regardless ofthe duration (i.e., for both short and long daytime flights). Example 1 demonstrates that when training data has coincidental relationships (e.g., the one between 𝐴𝑇 , 𝐷𝑇 , and 𝐷𝑈 𝑅 for daytimeflights), then ML models may implicitly assume them as invariants .Conformance constraints can capture such data invariants and flagnon-conforming tuples (overnight flights) during serving.
Conformance constraints.
Conformance constraints complementthe existing data profiling literature, as the existing constraint mod-els, such as functional dependencies and denial constraints, cannotmodel arithmetic relationships. For example, the conformance con-straint of Example 1 is: − 𝜖 ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ 𝜖 , where 𝜖 and 𝜖 are small values. Conformance constraints can capture complexlinear dependencies across attributes within a noisy dataset. Forexample, if the flight departure and arrival data reported the hoursand the minutes across separate attributes, the constraint wouldbe on a different arithmetic expression: ( · 𝑎𝑟𝑟𝐻𝑜𝑢𝑟 + 𝑎𝑟𝑟𝑀𝑖𝑛 ) −( · 𝑑𝑒𝑝𝐻𝑜𝑢𝑟 + 𝑑𝑒𝑝𝑀𝑖𝑛 ) − 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 .The core component of a conformance constraint is the arith-metic expression, called projection , which is obtained by a linearcombination of the numerical attributes. There is an unboundednumber of projections that we can use to form arbitrary confor-mance constraints. For example, for the projection 𝐴𝑇 , we can find a broad range [ 𝜖 , 𝜖 ] , such that all training tuples in Example 1satisfy the conformance constraint 𝜖 ≤ 𝐴𝑇 ≤ 𝜖 . However, thisconstraint is too inclusive and a learned model is unlikely to exploitsuch a weak constraint. In contrast, the projection 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 leads to a stronger conformance constraint with a narrow range asits bounds, which is selectively permissible, and thus, more effective.
Challenges and solution sketch.
The principal challenge is todiscover an effective set of conformance constraints that are likelyto affect a model’s inference implicitly. We first characterize “good”projections (that construct effective constraints) and then propose amethod to discover them. We establish through theoretical analysistwo important results: (1) A projection is good over a dataset if it isalmost constant (i.e., has low variance) for all tuples in that dataset.(2) A set of projections, collectively, is good if the projections havesmall pair-wise correlations. We show that low variance compo-nents of a principal component analysis (PCA) on a dataset yieldsuch a set of projections. Note that this is different from—and infact completely opposite to—the traditional approaches (e.g., [63])that perform multidimensional analysis based on the high-varianceprincipal components, after reducing dimensionality using PCA.
Scope.
Fig. 2 summarizes prior work on related problems, but thescope of our setting differs significantly. Specifically, we can detectif a serving tuple is non-conforming with respect to the trainingdataset only based on its predictor attributes , and require no knowl-edge of the ground truth. This setting is essential in many practicalapplications when we observe extreme verification latency [74],where ground truths for serving tuples are not immediately avail-able. For example, consider a self-driving car that is using a trainedcontroller to generate actions based on readings of velocity, relativepositions of obstacles, and their velocities. In this case, we need todetermine, only based on the sensor readings (predictors), when thedriver should be alerted to take over vehicle control, as we cannotuse ground-truths to generate an alert.Furthermore, we do not assume access to the model , i.e., model’spredictions on a given tuple. This setting is necessary for (1) safety-critical applications, where the goal is to quickly alert the user,without waiting for the availability of the prediction, (2) auditingand privacy-preserving applications where the prediction cannot beshared, and (3) when we are unaware of the detailed functionalityof the system due to privacy concerns or lack of jurisdiction, butonly have some meta-information such as the system trains somelinear model over the training data.We focus on identifying tuple-level non-conformance as opposedto dataset-level non-conformance that usually requires observingentire data’s distribution. However, our tuple-level approach triv-ially extends (by aggregation) to the entire dataset.
Contrast with prior art.
We now discuss where conformanceconstraints fit with respect to the existing literature (Fig. 2) on dataprofiling and literature on modeling trust in data-driven inferences
Data profiling techniques.
Conformance constraints fall under theumbrella of data profiling, which refers to the task of extractingtechnical metadata about a given dataset [5]. A key task in dataprofiling is to learn relationships among attributes. Functional de-pendencies (FD) [59] and their variants only capture if a relation-ship exists between two sets of attributes, but do not provide a onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 Legend constraints violation setting technique TML
HP: Hyper ParameterFD: Functional DependencyDC: Denial Constraint (cid:19) : Does not require ⊥ : Not applicable ★ : Supports via extension!: Partially p a r a m e t r i c a r i t h m e t i c a pp r o x i m a t e c o n d i t i o n a l n o t i o n o f w e i g h t i n t e r p r e t a b l e c o n t i n u o u s t u p l e - w i s e n o i s y d a t a n u m e r i c a l a tt r . c a t e g o r i a l a tt r . (cid:19) t h r e s h o l d s (cid:19) d i s t a n c e m e t r i c (cid:19) H P t u n i n g s c a l a b l e t a s k a g n o s t i c (cid:19) a cc e ss t o m o d e l D a t a P r o fi l i n g Conformance Constraints ✓ ✓ ✓ ✓ ✓ ★ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ FD [59] ✓ ✓ ✓ ✓ ✓ n o t a dd r e ss e d i n p r i o r w o r k Approximate FD [50] ✓ ✓ ✓ ✓ ✓ ✓
Metric FD [48] ✓ ✓ ✓ ✓ ✓ ⊥ ⊥
Conditional FD [23] ! ✓ ✓ ✓ ✓ ✓ ✓ ✓
Pattern FD [62] ! ✓ ✓ ✓ ✓ ✓ ✓
Soft FD [38] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Relaxed FD [16] ✓ ✓ ✓ ✓ ✓ ✓
FDX [93] ✓ ✓ ✓ ✓
Differential Dependency [72] ✓ ✓ ✓ ✓
DC [13, 17] ! ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Approximate DC [53, 61] ! ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Statistical Constraint [91] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ L e a r n i n g Ordinary Least Square ✓ ✓ ✓ ★ ✓ ✓ ✓ ✓ ★ ✓ ✓ ✓ ✓ Total Least Square ✓ ✓ ✓ ★ ✓ ✓ ✓ ✓ ★ ✓ ✓ ✓ ✓ ✓ ✓ Auto-encoder [20] ⊥ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Schelter et al. [68] + ⊥ ✓ ✓ ✓ ✓ ✓ ✓ Jiang et al. [41] ⊥ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Hendrycks et al. [31] ⊥ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Model’s Prediction Probability ⊥ ✓ ✓ varies + Requires additional information
Figure 2: Conformance constraints complement existing data profil-ing primitives and provide an efficient mechanism to quantify trustin prediction, with minimal assumption on the setting. closed-form (parametric) expression of the relationship. Using theFD { 𝐴𝑇, 𝐷𝑇 } → {
𝐷𝑈 𝑅 } to model the constraint of Example 1 suf-fers from several limitations. First, since the data is noisy, no exactFD can be learned. Metric FDs [48] allow small variations in thedata (similar attribute values are considered identical), but hingeon appropriate distance metrics and thresholds. For example, if 𝑡𝑖𝑚𝑒 is split across two attributes ( ℎ𝑜𝑢𝑟 and 𝑚𝑖𝑛𝑢𝑡𝑒 ), the distancemetric is non-trivial: it needs to encode that ⟨ ℎ𝑜𝑢𝑟 = , 𝑚𝑖𝑛 = ⟩ and ⟨ ℎ𝑜𝑢𝑟 = , 𝑚𝑖𝑛 = ⟩ are similar, while ⟨ ℎ𝑜𝑢𝑟 = , 𝑚𝑖𝑛 = ⟩ and ⟨ ℎ𝑜𝑢𝑟 = , 𝑚𝑖𝑛 = ⟩ are not. In contrast, conformance constraintscan model the composite attribute (60 · ℎ𝑜𝑢𝑟 + 𝑚𝑖𝑛𝑢𝑡𝑒 ) by automat-ically discovering the coefficients 60 and 1 for such a compositeattribute.Denial constraints (DC) [13, 17, 53, 61] encapsulate a number ofdifferent data-profiling primitives such as FDs and their variants(e.g, [23]). Exact DCs can adjust to noisy data by adding predicatesuntil the constraint becomes exact over the entire dataset, but thiscan make the constraint extremely large and complex, which mighteven fail to provide the desired generalization. For example, a finiteDC—whose language is limited to universally-quantified first-orderlogic—cannot model the constraint of Example 1, which involves anarithmetic expression (addition and multiplication with a constant).Expressing conformance constraints requires a richer language thatincludes linear arithmetic expressions. Pattern functional depen-dencies (PFD) [62] move towards addressing this limitation of DCs,but they focus on text attributes: they are regex-based and treatdigits as characters. However, modeling arithmetic relationshipsof numerical attributes requires interpreting digits as numbers.To adjust for noise, FDs and DCs either relax the notion of con-straint violation or allow a user-defined fraction of tuples to violatethe (strict) constraint [16, 36, 38, 48, 50, 53, 61]. Some approaches [38,91, 93] use statistical techniques to model other types of data profilessuch as correlations and conditional dependencies. However, they require additional parameters such as noise and violation thresholdsand distance metrics. In contrast, conformance constraints do notrequire any parameter from the user and work on noisy datasets.Existing data profiling techniques are not designed to learn whatML models exploit and they are sensitive to noise in the numericalattributes. Moreover, data constraint discovery algorithms typicallysearch over an exponential set of candidates, and hence, are notscalable: their complexity grows exponentially with the number ofattributes or quadratically with data size. In contrast, our techniquefor deriving conformance constraints is highly scalable (linear indata size) and efficient (cubic in the number of attributes). It doesnot explicitly explore the candidate space, as PCA—which lies at thecore of our technique—performs the search implicitly by iterativelyrefining weaker constraints to stronger ones. Learning techniques.
While ordinary least square finds the lowest-variance projection, it minimizes observational error on only thetarget attribute, and thus, does not apply to our setting.
Total leastsquare offers a partial solution to our problem as it takes observa-tional errors on all predictor attributes into account. However, itfinds only one projection—the lowest variance one—that fits thedata tuples best. But there may exist other projections with slightlyhigher variances and we consider them all. As we show empiri-cally in Section 6.2, constraints derived from multiple projections,collectively, capture various aspects of the data, and result in an ef-fective data profile targeted towards certain tasks such as data-driftquantification. (More discussion is in the Appendix.)
Contributions.
We make the following contributions: • We ground the motivation of our work with two case studies ontrusted machine learning (TML) and data drift. (Section 2) • We introduce and formalize conformance constraints, a new dataprofiling primitive that specify constraints over arithmetic rela-tionships among numerical attributes of a dataset. We describe a conformance language to express conformance constraints, anda quantitative semantics to quantify how much a tuple violatesthe conformance constraints. In applications of constraint viola-tions, some violations may be more or less critical than others.To capture that, we consider a notion of constraint importance,and weigh violations against constraints accordingly. (Section 3) • We formally establish that strong conformance constraints areconstructed from projections with small variance and small mu-tual correlation on the given dataset. Beyond simple linear con-straints (e.g., the one in Example 1), we derive disjunctive con-straints, which are disjunctions of linear constraints. We achievethis by dividing the dataset into disjoint partitions, and learninglinear constraints for each partition. We provide an efficient, scal-able, and highly parallelizable algorithm for computing a set oflinear conformance constraints and disjunctions over them. Wealso analyze its runtime and memory complexity. (Section 4) • We formalize the notion of unsafe tuples in the context of trustedmachine learning and provide a mechanism to detect unsafetuples using conformance constraints. (Section 5) • We empirically analyze the effectiveness of conformance con-straints in our two case-study applications—TML and data-driftquantification. We show that conformance constraints can reli-ably predict the trustworthiness of linear models and quantifydata drift precisely, outperforming the state of the art. (Section 6) echnical Report, January, 2021 Fariha and Tiwari, et al. Like other data-profiling primitives, conformance constraints havegeneral applicability in understanding and describing datasets. Buttheir true power lies in quantifying the degree of a tuple’s non-conformance with respect to a reference dataset. Within the scopeof this paper, we focus on two case studies in particular to motivateour work: trusted machine learning and data drift. We provide anextensive evaluation over these applications in Section 6.
Trusted machine learning (TML) refers to the problem of quan-tifying trust in the inference made by a machine-learned model ona new serving tuple [41, 64, 67, 80, 86]. This is particularly usefulin case of extreme verification latency [74], where ground-truthoutputs for new serving tuples are not immediately available to eval-uate the performance of a learned model, when auditing models fortrustworthiness, and in privacy-preserving applications where eventhe model’s predictions cannot be shared. When a model is trainedusing a dataset, the conformance constraints for that dataset specifya safety envelope [80] that characterizes the tuples for which themodel is expected to make trustworthy predictions. If a servingtuple falls outside the safety envelope (violates the conformanceconstraints), then the model is likely to produce an untrustworthyinference. Intuitively, the higher the violation, the lower the trust.Some classifiers produce a confidence measure along with the classprediction, typically by applying a softmax function to the rawnumeric prediction values. However, such a confidence measure isnot well-calibrated [28, 41], and therefore, cannot be reliably usedas a measure of trust in the prediction. Additionally, a similar mech-anism is not available for inferences made by regression models.In the context of TML, we formalize the notion of unsafe tuples ,on which the prediction may be untrustworthy. We establish thatconformance constraints provide a sound and complete procedurefor detecting unsafe tuples, which indicates that the search forconformance constraints should be guided by the class of modelsconsidered by the corresponding learning system (Section 5).
Data drift [10, 27, 51, 63] specifies a significant change in a datasetwith respect to a reference dataset, which typically requires that sys-tems be updated and models retrained. Aggregating tuple-level non-conformances over a dataset gives us a dataset-level non-conformance,which is an effective measurement of data drift. To quantify howmuch a dataset 𝐷 ′ drifted from a reference dataset 𝐷 , our three-stepapproach is: (1) compute conformance constraints for 𝐷 , (2) evalu-ate the constraints on all tuples in 𝐷 ′ and compute their violations(degrees of non-conformance), and (3) finally, aggregate the tuple-level violations to get a dataset-level violation. If all tuples in 𝐷 ′ satisfy the constraints, then we have no evidence of drift. Otherwise,the aggregated violation serves as the drift quantity.While we focus on these two applications here, we mention otherapplications of conformance constraints in the Appendix. In this section, we define conformance constraints that allow usto capture complex arithmetic dependencies involving numericalattributes of a dataset. Then we propose a language for representingthem. Finally, we define quantitative semantics over conformanceconstraints, which allows us to quantify their violation.
Basic notations.
We use R( 𝐴 , 𝐴 , . . . , 𝐴 𝑚 ) to denote a relationschema where 𝐴 𝑖 denotes the 𝑖 𝑡ℎ attribute of R . We use Dom 𝑖 to de-note the domain of attribute 𝐴 𝑖 . Then the set Dom 𝑚 = Dom × · · · × Dom 𝑚 specifies the domain of all possible tuples. We use 𝑡 ∈ Dom 𝑚 to denote a tuple in the schema R . A dataset 𝐷 ⊆ Dom 𝑚 is a spe-cific instance of the schema R . For ease of notation, we assumesome order of tuples in 𝐷 and we use 𝑡 𝑖 ∈ 𝐷 to refer to the 𝑖 𝑡ℎ tupleand 𝑡 𝑖 .𝐴 𝑗 ∈ Dom 𝑗 to denote the value of the 𝑗 𝑡ℎ attribute of 𝑡 𝑖 . Conformance constraint.
A conformance constraint Φ charac-terizes a set of allowable or conforming tuples and is expressedthrough a conformance language (Section 3.1). We write Φ ( 𝑡 ) and ¬ Φ ( 𝑡 ) to denote that 𝑡 satisfies and violates Φ , respectively.Definition 2 (Conformance constraint). A conformance con-straint for a dataset 𝐷 ⊆ Dom 𝑚 is a formula Φ : Dom 𝑚 ↦→ { True , False } such that |{ 𝑡 ∈ 𝐷 | ¬ Φ ( 𝑡 )}| ≪ | 𝐷 | . The set { 𝑡 ∈ 𝐷 | ¬ Φ ( 𝑡 )} denotes atypical tuples in 𝐷 that do notsatisfy the conformance constraint Φ . In our work, we do not needto know the set of atypical tuples, nor do we need to purge theatypical tuples from the dataset. Our techniques derive constraintsin ways that ensure there are very few atypical tuples (Section 4). Projection.
A central concept in our conformance language is projection . Intuitively, a projection is a derived attribute that spec-ifies a “lens” through which we look at the tuples. More formally,a projection is a function 𝐹 : Dom 𝑚 ↦→ R that maps a tuple 𝑡 ∈ Dom 𝑚 to a real number 𝐹 ( 𝑡 ) ∈ R . In our language for conformanceconstraints, we only consider projections that correspond to linearcombinations of the numerical attributes of a dataset. Specifically, todefine a projection, we need a set of numerical coefficients for all at-tributes of the dataset and the projection is defined as a sum over theattributes, weighted by their corresponding coefficients. We extenda projection 𝐹 to a dataset 𝐷 by defining 𝐹 ( 𝐷 ) to be the sequenceof reals obtained by applying 𝐹 on each tuple in 𝐷 individually. Grammar.
Our language for conformance constraints consists offormulas Φ generated by the following grammar: 𝜙 : = lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub | ∧( 𝜙, . . . , 𝜙 ) 𝜓 𝐴 : = ∨( ( 𝐴 = 𝑐 ) ▷ 𝜙, ( 𝐴 = 𝑐 ) ▷ 𝜙, . . . ) Ψ : = 𝜓 𝐴 | ∧( 𝜓 𝐴 ,𝜓 𝐴 , . . . ) Φ : = 𝜙 | Ψ The language consists of (1) bounded constraints lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub where 𝐹 is a projection on Dom 𝑚 , (cid:174) 𝐴 is the tuple of formalparameters ( 𝐴 , 𝐴 , . . . , 𝐴 𝑚 ) , and lb , ub ∈ R are reals; (2) equalityconstraints 𝐴 = 𝑐 where 𝐴 is an attribute and 𝑐 is a constant in 𝐴 ’sdomain; and (3) operators ( ▷ , ∧ , and ∨ ,) that connect the constraints.Intuitively, ▷ is a switch operator that specifies which constraint 𝜙 applies based on the value of the attribute 𝐴 , ∧ denotes conjunction,and ∨ denotes disjunction. Formulas generated by 𝜙 and Ψ are called simple constraints and compound constraints , respectively. Note thata formula generated by 𝜓 𝐴 only allows equality constraints on asingle attribute, namely 𝐴 , among all the disjuncts.Example 3. Consider the dataset 𝐷 consisting of the first fourtuples { 𝑡 , 𝑡 , 𝑡 , 𝑡 } of Fig. 1. A simple constraint for 𝐷 is: 𝜙 : − ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ . onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 Here, the projection 𝐹 ( (cid:174) 𝐴 ) = 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 , with attribute coefficients ⟨ , − , − ⟩ , lb = − , and ub = . A compound constraint is: 𝜓 : 𝑀 = “May” ▷ − ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ ∨ 𝑀 = “June” ▷ ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ ∨ 𝑀 = “July” ▷ − ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ For ease of exposition, we assume that all times are converted tominutes (e.g., = × + = ) and 𝑀 denotes the departuremonth, extracted from 𝐷𝑒𝑝𝑎𝑟𝑡𝑢𝑟𝑒 𝐷𝑎𝑡𝑒 . Note that arithmetic expressions that specify linear combinationof numerical attributes (highlighted above) are disallowed in denialconstraints [17] which only allow raw attributes and constants(more details are in the Appendix).
Conformance constraints have a natural Boolean semantics: a tu-ple either satisfies a constraint or it does not. However, Booleansemantics is of limited use in practice, because it does not quantifythe degree of constraint violation. We interpret conformance con-straints using a quantitative semantics, which quantifies violations,and reacts to noise more gracefully than Boolean semantics.The quantitative semantics [[ Φ ]]( 𝑡 ) is a measure of the violationof Φ on a tuple 𝑡 —with a value of 0 indicating no violation and avalue greater than 0 indicating some violation. In Boolean semantics,if Φ ( 𝑡 ) is True , then [[ Φ ]]( 𝑡 ) will be 0; and if Φ ( 𝑡 ) is False , then [[ Φ ]]( 𝑡 ) will be 1. Formally, [[ Φ ]] is a mapping from Dom 𝑚 to [ , ] . Quantitative semantics of simple constraints.
We build upon 𝜖 -insen-sitive loss [85] to define the quantitative semantics of simple con-straints, where the bounds lb and ub define the 𝜖 -insensitive zone: [[ lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub ]]( 𝑡 ) : = 𝜂 ( 𝛼 · max ( , 𝐹 ( 𝑡 ) − ub , lb − 𝐹 ( 𝑡 )))[[∧( 𝜙 , . . . , 𝜙 𝐾 )]]( 𝑡 ) : = (cid:205) 𝐾𝑘 𝛾 𝑘 · [[ 𝜙 𝑘 ]]( 𝑡 ) Below, we describe the parameters of the quantitative semantics,and provide further details on them in the Appendix.
Scaling factor 𝛼 ∈ R + . Projections are unconstrained functions and different projectionscan map the same tuple to vastly different values. We use a scalingfactor 𝛼 to standardize the values computed by a projection 𝐹 , andto bring the values of different projections to the same comparablescale. The scaling factor is automatically computed as the inverseof the standard deviation: 𝜎 ( 𝐹 ( 𝐷 )) . We set 𝛼 to a large positivenumber when 𝜎 ( 𝐹 ( 𝐷 )) = 0. Normalization function 𝜂 ( . ) : R ↦→ [ , ] . The normalization function maps values in the range [ , ∞) to therange [ , ) . While any monotone mapping from R ≥ to [ , ) canbe used, we pick 𝜂 ( 𝑧 ) = − 𝑒 − 𝑧 . Importance factors 𝛾 𝑘 ∈ R + , (cid:205) 𝐾𝑘 𝛾 𝑘 = . The weights 𝛾 𝑘 control the contribution of each bounded-projectionconstraint in a conjunctive formula. This allows for prioritizing con-straints that are more significant than others within the context of aparticular application. In our work, we derive the importance factorof a constraint automatically, based on its projection’s standarddeviation over 𝐷 . For a target value 𝑦 , predicted value ˆ 𝑦 , and a parameter 𝜖 , the 𝜖 -insensitive loss is if | 𝑦 − ˆ 𝑦 | < 𝜖 and | 𝑦 − ˆ 𝑦 | − 𝜖 otherwise. Quantitative semantics of compound constraints.
Compound con-straints are first simplified into simple constraints, and they get theirmeaning from the simplified form. We define a function simp ( 𝜓, 𝑡 ) that takes a compound constraint 𝜓 and a tuple 𝑡 and returns asimple constraint. It is defined recursively as follows: simp (∨( ( 𝐴 = 𝑐 ) ▷ 𝜙 , ( 𝐴 = 𝑐 ) ▷ 𝜙 , . . . ) , 𝑡 ) : = 𝜙 𝑘 if 𝑡.𝐴 = 𝑐 𝑘 simp (∧( 𝜓 𝐴 ,𝜓 𝐴 , . . . ) , 𝑡 ) : = ∧( simp ( 𝜓 𝐴 , 𝑡 ) , simp ( 𝜓 𝐴 , 𝑡 ) , . . . ) If the condition in the definition above does not hold for any 𝑐 𝑘 , then simp ( 𝜓, 𝑡 ) is undefined and simp (∧( . . . ,𝜓, . . . ) , 𝑡 ) is alsoundefined. If simp ( 𝜓, 𝑡 ) is undefined, then [[ 𝜓 ]]( 𝑡 ) : =
1. When simp ( 𝜓, 𝑡 ) is defined, the quantitative semantics of 𝜓 is given by: [[ 𝜓 ]]( 𝑡 ) : = [[ simp ( 𝜓, 𝑡 )]]( 𝑡 ) Since compound constraints simplify to simple constraints, wemostly focus on simple constraints. Even there, we pay special atten-tion to bounded-projection constraints ( 𝜙 ) of the form lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub , which lie at the core of simple constraints.Example 4. Consider the constraint 𝜙 from Example 3. For 𝑡 ∈ 𝐷 , [[ 𝜙 ]]( 𝑡 ) = since 𝜙 is satisfied by all tuples in 𝐷 . The standarddeviation of the projection 𝐹 over 𝐷 , 𝜎 ( 𝐹 ( 𝐷 )) = 𝜎 ({ , − , , − }) = . .Now consider the last tuple 𝑡 ∉ 𝐷 . 𝐹 ( 𝑡 ) = ( − ) − = − , which is way below the lower bound − of 𝜙 . Now we computehow much 𝑡 violates 𝜙 : [[ 𝜙 ]]( 𝑡 ) = [[− ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ]]( 𝑡 ) = 𝜂 ( 𝛼 · max ( , − − , − + )) = − 𝑒 − . ≈ . Intuitively, thisimplies that 𝑡 strongly violates 𝜙 . In this section, we describe our techniques for deriving conformanceconstraints. We start with the synthesis of simple constraints (the 𝜙 constraints in our language specification), followed by compoundconstraints (the Ψ constraints in our language specification). Finally,we analyze the time and memory complexity of our algorithm. Synthesizing simple conformance constraints involves (a) discover-ing the projections, and (b) discovering the lower and upper boundsfor each projection. We start by discussing (b), followed by the prin-ciple to identify effective projections, based on which we solve (a).
Fix a projection 𝐹 andconsider the bounded-projection constraint 𝜙 : lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub .Given a dataset 𝐷 , a trivial choice for the bounds that are valid onall tuples in 𝐷 is: lb = min ( 𝐹 ( 𝐷 )) and ub = max ( 𝐹 ( 𝐷 )) . However,this choice is very sensitive to noise: adding a single atypical tupleto 𝐷 can produce very different constraints. Instead, we use a morerobust choice as follows: lb = 𝜇 ( 𝐹 ( 𝐷 )) − 𝐶 · 𝜎 ( 𝐹 ( 𝐷 )) , ub = 𝜇 ( 𝐹 ( 𝐷 )) + 𝐶 · 𝜎 ( 𝐹 ( 𝐷 )) Here, 𝜇 ( 𝐹 ( 𝐷 )) and 𝜎 ( 𝐹 ( 𝐷 )) denote the mean and standard devi-ation of the values in 𝐹 ( 𝐷 ) , respectively, and 𝐶 is some positiveconstant. With these bounds, [[ 𝜙 ]]( 𝑡 ) = 𝐹 ( 𝑡 ) is within 𝐶 × 𝜎 ( 𝐹 ( 𝐷 )) from the mean 𝜇 ( 𝐹 ( 𝐷 )) . In our experiments, we set 𝐶 =
4, which ensures that in expectation, very few tuples in 𝐷 willviolate the constraint for many distributions of the values in 𝐹 ( 𝐷 ) .Specifically, if 𝐹 ( 𝐷 ) follows a normal distribution, then 99 .
99% of echnical Report, January, 2021 Fariha and Tiwari, et al. the population is expected to lie within 4 standard deviations frommean. Note that we make no assumption on the original data dis-tribution of each attribute.Setting the bounds lb and ub as 𝐶 · 𝜎 ( 𝐹 ( 𝐷 )) -away from themean, and the scaling factor 𝛼 as 𝜎 ( 𝐹 ( 𝐷 )) , guarantees the followingproperty for our quantitative semantics:Lemma 5. Let 𝐷 be a dataset and let 𝜙 𝑘 be lb 𝑘 ≤ 𝐹 𝑘 ( (cid:174) 𝐴 ) ≤ ub 𝑘 for 𝑘 = , . Then, for any tuple 𝑡 , if | 𝐹 ( 𝑡 )− 𝜇 ( 𝐹 ( 𝐷 )) | 𝜎 ( 𝐹 ( 𝐷 )) ≥ | 𝐹 ( 𝑡 )− 𝜇 ( 𝐹 ( 𝐷 )) | 𝜎 ( 𝐹 ( 𝐷 )) ,then [[ 𝜙 ]]( 𝑡 ) ≥ [[ 𝜙 ]]( 𝑡 ) . This means that larger deviation from the mean (proportionallyto the standard deviation) results in higher degree of violation underour semantics. The proof follows from the fact that the normaliza-tion function 𝜂 ( . ) is monotonically increasing, and hence, [[ 𝜙 𝑘 ]]( 𝑡 ) is a monotonically non-decreasing function of | 𝐹 𝑘 ( 𝑡 )− 𝜇 ( 𝐹 𝑘 ( 𝐷 )) | 𝜎 ( 𝐹 𝑘 ( 𝐷 )) . We start by investigat-ing what makes a constraint more effective than others. An effectiveconstraint (1) should not overfit the data, but rather generalize bycapturing the properties of the data, and (2) should not underfit thedata, because it would be too permissive and fail to identify devi-ations effectively. Our flexible bounds (Section 4.1.1) serve to avoidoverfitting. In this section, we focus on identifying the principlesthat help us avoid underfitting. We first describe the key technicalideas for characterizing effective projections through example andthen proceed to formalization.Example 6.
Let 𝐷 be a dataset of three tuples {(1,1.1),(2,1.7),(3,3.2)}with two attributes 𝑋 and 𝑌 . Consider two arbitrary projections: 𝑋 and 𝑌 . For 𝑋 : 𝜇 ( 𝑋 ( 𝐷 )) = and 𝜎 ( 𝑋 ( 𝐷 )) = . . So, bounds for its confor-mance constraint are: lb = − × . = − . and ub = + × . = . .This gives us the conformance constraint: − . ≤ 𝑋 ≤ . . Similarly,for 𝑌 , we get the conformance constraint: − . ≤ 𝑌 ≤ . . Fig. 3(a)shows the conformance zone (clear region) defined by these two con-formance constraints. The shaded region depicts non-conformancezone. The conformance zone is large and too permissive: it allowsmany atypical tuples with respect to 𝐷 , such as ( , ) and ( , ) . A natural question arises: are there other projections that canbetter characterize conformance with respect to the tuples in 𝐷 ?The answer is yes and next we show another pair of projectionsthat shrink the conformance zone significantly.Example 7. In Fig. 3(b), the clear region is defined by the confor-mance constraints − . ≤ 𝑋 − 𝑌 ≤ . and − . ≤ 𝑋 + 𝑌 ≤ . , overprojections 𝑋 − 𝑌 and 𝑋 + 𝑌 , respectively. The region is indeed muchsmaller than the one in Fig. 3(a) and allows fewer atypical tuples. How can we derive projection 𝑋 − 𝑌 from the projections 𝑋 and 𝑌 , given 𝐷 ? Note that 𝑋 and 𝑌 are highly correlated in 𝐷 . InLemma 11, we show that two highly-correlated projections canbe linearly combined to construct another projection with lowerstandard deviation that generates a stronger constraint. We proceedto formalize stronger constraint—which defines whether a con-straint is more effective than another in quantifying violation—and incongruous tuples—which help us estimate the subset of the datadomain for which a constraint is stronger than the others. (a) (b) Figure 3: Clear and shaded regions depict conformance and non-conformance zones, respectively. (a) Correlated projections 𝑋 and 𝑌 yield conformance constraints forming a large conformance zone,(b) Uncorrelated (orthogonal) projections 𝑋 − 𝑌 and 𝑋 + 𝑌 yieldconformance constraints forming a smaller conformance zone. Definition 8 (Stronger constraint).
A conformance con-straint 𝜙 is stronger than another conformance constraint 𝜙 ona subset 𝐻 ⊆ Dom 𝑚 if ∀ 𝑡 ∈ 𝐻, [[ 𝜙 ]]( 𝑡 ) ≥ [[ 𝜙 ]]( 𝑡 ) . Given a dataset 𝐷 ⊆ Dom 𝑚 and a projection 𝐹 , for any tuple 𝑡 , let Δ 𝐹 ( 𝑡 ) = 𝐹 ( 𝑡 ) − 𝜇 ( 𝐹 ( 𝐷 )) . For projections 𝐹 and 𝐹 , the correlationcoefficient 𝜌 𝐹 ,𝐹 (over 𝐷 ) is defined as | 𝐷 | (cid:205) 𝑡 ∈ 𝐷 Δ 𝐹 ( 𝑡 ) Δ 𝐹 ( 𝑡 ) 𝜎 ( 𝐹 ( 𝐷 )) 𝜎 ( 𝐹 ( 𝐷 )) .Definition 9 (Incongruous tuple). A tuple 𝑡 is incongruous w.r.t. a projection pair ⟨ 𝐹 , 𝐹 ⟩ on 𝐷 if: Δ 𝐹 ( 𝑡 ) · Δ 𝐹 ( 𝑡 ) · 𝜌 𝐹 ,𝐹 < . Informally, an incongruous tuple for a pair of projections doesnot follow the general trend of correlation between the projectionpair. For example, if 𝐹 and 𝐹 are positively correlated ( 𝜌 𝐹 ,𝐹 > 𝑡 deviates in opposite ways from the mean ofeach projection ( Δ 𝐹 ( 𝑡 ) · Δ 𝐹 ( 𝑡 ) < In Example 6, 𝑋 and 𝑌 are positively correlated with 𝜌 𝑋,𝑌 ≈ . The tuple 𝑡 = ( , ) is incongruous w.r.t. ⟨ 𝑋, 𝑌 ⟩ , because 𝑋 ( 𝑡 ) = < 𝜇 ( 𝑋 ( 𝐷 )) = , whereas 𝑌 ( 𝑡 ) = > 𝜇 ( 𝑌 ( 𝐷 )) = .Intuitively, the incongruous tuples do not behave like the tuples in 𝐷 when viewed through the projections 𝑋 and 𝑌 . Note that the narrowconformance zone of Fig. 3(b) no longer contains the incongruoustuple ( , ) . In fact, the conformance zone defined by the conformanceconstraints derived from projections 𝑋 − 𝑌 and 𝑋 + 𝑌 are free from avast majority of the incongruous tuples. We proceed to state Lemma 11, which informally says that: anytwo highly-correlated projections can be linearly combined to con-struct a new projection to obtain a stronger constraint. We write 𝜙 𝐹 to denote the conformance constraint lb ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ ub , synthesizedfrom 𝐹 . (All proofs are in the Appendix.)Lemma 11. Let 𝐷 be a dataset and 𝐹 , 𝐹 be two projections on 𝐷 s.t. | 𝜌 𝐹 ,𝐹 | ≥ . Then, ∃ 𝛽 , 𝛽 ∈ R s.t. 𝛽 + 𝛽 = and for the newprojection 𝐹 = 𝛽 𝐹 + 𝛽 𝐹 :(1) 𝜎 ( 𝐹 ( 𝐷 )) < 𝜎 ( 𝐹 ( 𝐷 )) and 𝜎 ( 𝐹 ( 𝐷 )) < 𝜎 ( 𝐹 ( 𝐷 )) , and(2) 𝜙 𝐹 is stronger than both 𝜙 𝐹 and 𝜙 𝐹 on the set of tuples that areincongruous w.r.t. ⟨ 𝐹 , 𝐹 ⟩ . We now extend the result to multiple projections in Theorem 12.Theorem 12 (Low Standard Deviation Constraints).
Givena dataset 𝐷 , let F = { 𝐹 , . . . , 𝐹 𝐾 } denote a set of projections on 𝐷 s.t. ∃ 𝐹 𝑖 , 𝐹 𝑗 ∈F with | 𝜌 𝐹 𝑖 ,𝐹 𝑗 |≥ . Then, there exist a nonempty subset 𝐼 ⊆{ , . . . , 𝐾 } and a projection 𝐹 = (cid:205) 𝑘 ∈ 𝐼 𝛽 𝑘 𝐹 𝑘 , where 𝛽 𝑘 ∈ R s.t. onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 Algorithm 1:
Procedure to generate linear projections.
Inputs :
A dataset 𝐷 ⊂ Dom 𝑚 Output :
A set {( 𝐹 ,𝛾 ) , . . . , ( 𝐹 𝐾 ,𝛾 𝐾 ) } of projections andimportance factors 𝐷 𝑁 ← 𝐷 after dropping non-numerical attributes 𝐷 ′ 𝑁 ← [(cid:174) 𝐷 𝑁 ] { (cid:174) 𝑤 , . . . , (cid:174) 𝑤 𝐾 } ← eigenvectors of 𝐷 ′ 𝑁 𝑇 𝐷 ′ 𝑁 foreach ≤ 𝑘 ≤ 𝐾 do (cid:174) 𝑤 ′ 𝑘 ← (cid:174) 𝑤 𝑘 with first element removed 𝐹 𝑘 ← 𝜆 (cid:174) 𝐴 : (cid:174) 𝐴 𝑇 (cid:174) 𝑤 ′ 𝑘 || (cid:174) 𝑤 ′ 𝑘 || 𝛾 𝑘 ← ( + 𝜎 ( 𝐹 𝑘 ( 𝐷 𝑁 ))) return {( 𝐹 , 𝛾 𝑍 ) , . . . , ( 𝐹 𝐾 , 𝛾 𝐾 𝑍 ) } , where 𝑍 = (cid:205) 𝑘 𝛾 𝑘 (1) ∀ 𝑘 ∈ 𝐼 , 𝜎 ( 𝐹 ( 𝐷 )) < 𝜎 ( 𝐹 𝑘 ( 𝐷 )) ,(2) ∀ 𝑘 ∈ 𝐼 , 𝜙 𝐹 is stronger than 𝜙 𝐹 𝑘 on the subset 𝐻 , where 𝐻 = { 𝑡 | ∀ 𝑘 ∈ 𝐼 ( 𝛽 𝑘 Δ 𝐹 𝑘 ( 𝑡 )≥ ) ∨ ∀ 𝑘 ∈ 𝐼 ( 𝛽 𝑘 Δ 𝐹 𝑘 ( 𝑡 )≤ )} , and(3) ∀ 𝑘 ∉ 𝐼 , | 𝜌 𝐹,𝐹 𝑘 | < . The theorem establishes that to detect violations for tuples in 𝐻 :(1) projections with low standard deviations define stronger con-straints (and are thus preferable), and (2) a set of constraints withhighly-correlated projections is suboptimal (as they can be linearlycombined to generate stronger constraints). Note that 𝐻 is a con-servative estimate for the set of tuples where 𝜙 𝐹 is stronger thaneach 𝜙 𝐹 𝑘 ; there exist tuples outside 𝐻 for which 𝜙 𝐹 is stronger. Bounded projections vs. convex polytope.
Bounded projections(Example 7) relate to the computation of convex polytopes [83].A convex polytope is the smallest possible convex set of tuplesthat includes all the training tuples; then any tuple falling outsidethe polytope would be considered non-conforming. The problemwith using convex polytopes is that they overfit to the trainingtuples and is extremely sensitive to outliers. For example, considera training dataset over attributes 𝑋 and 𝑌 : {( , ) , ( , ) , ( , )} .A convex polytope in this case would be a line segment—starting at ( , ) and ending at ( , ) —and the tuple ( , ) will fall outsideit. Unlike convex polytope—whose goal is to find the smallest possi-ble “inclusion zone” that includes all training tuples—our goal is tofind a “conformance zone” that reflects the trend of the training tu-ples. This is inspired from the fact that ML models aim to generalizeto tuples outside training set; thus, conformance constraints alsoneed to capture trends and avoid overfitting. Our definition of goodconformance constraints (low variance and low mutual correlation)balances overfitting and overgeneralization. Therefore, beyond theminimal bounding hyper-box (i.e., a convex polytope) over the train-ing tuples, we take into consideration the distribution (variance andconcentration) of the interaction among attributes (trends). For theabove example, conformance constraints will model the interactiontrend: 𝑌 = 𝑋 allowing the tuple ( , ) , which follows the sametrend as the training tuples. Theorem 12 sets the re-quirements for good projections (see also [51, 56, 84] that makesimilar observations in different ways). It indicates that we canstart with any arbitrary projections and then iteratively improvethem. However, we can get the desired set of best projections in oneshot using an algorithm inspired by principal component analysis (PCA). PCA relies on computing eigenvectors. There exist differentalgorithms for computing eigenvectors (from the infinite space ofpossible vectors). The general mechanism involves applying nu-merical approaches to iteratively converge to the eigenvectors (upto a desired precision) as no analytical solution exists in general.Our algorithm returns projections that correspond to the principalcomponents of a slightly modified version of the given dataset.Algorithm 1 details our approach for discovering projections forconstructing conformance constraints:
Line 1
Drop all non-numerical attributes from 𝐷 to get the numericdataset 𝐷 𝑁 . This is necessary because PCA only applies to numer-ical values. Instead of dropping, one can also consider embeddingtechniques to convert non-numerical attributes to numerical ones. Line 2
Add a new column to 𝐷 𝑁 that consists of the constant 1,to obtain the modified dataset 𝐷 ′ 𝑁 : = [(cid:174) 𝐷 𝑁 ] , where (cid:174) Line 3
Compute 𝐾 eigenvectors of the square matrix 𝐷 ′ 𝑁 𝑇 𝐷 ′ 𝑁 ,where 𝐾 denotes the number of columns in 𝐷 ′ 𝑁 . These eigenvec-tors provide coefficients to construct projections. Lines 5–6
Remove the first element (coefficient for the newlyadded constant column) of all eigenvectors and normalize themto generate projections. Note that we no longer need the constantelement of the eigenvectors since we can appropriately adjust thebounds, lb and ub , for each projection by evaluating it on 𝐷 𝑁 . Line 7
Compute importance factor for each projection. Since pro-jections with smaller standard deviations are more discerning (stronger),as discussed in Section 3.2, we assign each projection an importancefactor ( 𝛾 ) that is inversely proportional to its standard deviationover 𝐷 𝑁 . Line 8
Return the linear projections with corresponding normal-ized importance factors.We now claim that the projections returned by Algorithm 1include the projection with minimum standard deviation and thecorrelation between any two projections is 0. This indicates that wecannot further improve the projections, and thus they are optimal.Theorem 13 (Correctness of Algorithm 1).
Given a numericaldataset 𝐷 over the schema R , let F = { 𝐹 , 𝐹 , . . . , 𝐹 𝐾 } be the set oflinear projections returned by Algorithm 1. Let 𝜎 ∗ = min 𝐾𝑘 𝜎 ( 𝐹 𝑘 ( 𝐷 )) .If 𝜇 ( 𝐴 𝑘 ( 𝐷 )) = for all attribute 𝐴 𝑘 in R , then, (1) 𝜎 ∗ ≤ 𝜎 ( 𝐹 ( 𝐷 )) ∀ 𝐹 = (cid:174) 𝐴 𝑇 (cid:174) 𝑤 where || (cid:174) 𝑤 || ≥ , and(2) ∀ 𝐹 𝑗 , 𝐹 𝑘 ∈ F s.t. 𝐹 𝑗 ≠ 𝐹 𝑘 , 𝜌 𝐹 𝑗 ,𝐹 𝑘 = . Using projections 𝐹 , . . . , 𝐹 𝐾 , and importance factors 𝛾 , . . . ,𝛾 𝐾 ,returned by Algorithm 1, we generate the simple (conjunctive) con-straint with 𝐾 conjuncts: (cid:211) 𝑘 lb 𝑘 ≤ 𝐹 𝑘 ( (cid:174) 𝐴 ) ≤ ub 𝑘 . We compute thebounds lb 𝑘 and ub 𝑘 following Section 4.1.1 and use the importancefactor 𝛾 𝑘 for the 𝑘 𝑡ℎ conjunct in the quantitative semantics.Example 14. Algorithm 1 finds the projection of the conformanceconstraint of Example 1, but in a different form. The actual airlinesdataset has an attribute 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ( 𝐷𝐼𝑆 ) that represents miles travelled When the condition ∀ 𝐴 𝑘 𝜇 ( 𝐴 𝑘 ( 𝐷 )) = does not hold, slightly modified variantsof the claim hold. However, by normalizing 𝐷 (i.e., by subtracting attribute mean 𝜇 ( 𝐴 𝑘 ( 𝐷 )) from each 𝐴 𝑘 ( 𝐷 ) ), it is always possible to satisfy the condition.7 echnical Report, January, 2021 Fariha and Tiwari, et al. by a flight. In our experiments, we found the following conformanceconstraint over the dataset of daytime flights: . × 𝐴𝑇 − . × 𝐷𝑇 − . × 𝐷𝑈 𝑅 − . × 𝐷𝐼𝑆 ≈ This constraint is not quite interpretable by itself, but it is in fact alinear combination of two expected and interpretable constraints: 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≈ 𝐷𝑈 𝑅 − . × 𝐷𝐼𝑆 ≈ Here, (2) is the one mentioned in Example 1 and (3) follows from thefact that average aircraft speed is about mph implying that itrequires . minutes per mile. . × (2) + . × (3) yields: . × ( 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ) + . × 𝐷𝑈 𝑅 − . × . × 𝐷𝐼𝑆 ≈ = ⇒ . × 𝐴𝑇 − . × 𝐷𝑇 − . × 𝐷𝑈 𝑅 − . × 𝐷𝐼𝑆 ≈ Which is exactly the conformance constraint (1). Algorithm 1 foundthe optimal projection of (1), which is a linear combination of theprojections of (2) and (3). The reason is: there is a correlation betweenthe projections of (2) and (3) over the dataset (Theorem 12). Onepossible explanation of this correlation is: whenever there is an errorin the reported duration of a tuple, it violates both (2) and (3). Due tothis natural correlation, Algorithm 1 returned the optimal projectionof (1), that “covers” both projections of (2) or (3).
The quality of our PCA-based simple linear constraints relies onhow many low variance linear projections we are able to find onthe given dataset. For many datasets, it is possible we find very few,or even none, such linear projections. In these cases, it is fruitfulto search for compound constraints; we first focus on disjunctiveconstraints (defined by 𝜓 𝐴 in our language grammar).PCA-based approach fails in cases where there exist differentpiecewise linear trends within the data, as it will result into low-quality constraints, with very high variances. In such cases, par-titioning the dataset and then learning constraints separately oneach partition will result in significant improvement of the learnedconstraints. A disjunctive constraint is a compound constraint ofthe form (cid:212) 𝑘 (( 𝐴 = 𝑐 𝑘 ) ▷ 𝜙 𝑘 ) , where each 𝜙 𝑘 is a constraint fora specific partition of 𝐷 . Finding disjunctive constraints involveshorizontally partitioning the dataset 𝐷 into smaller disjoint datasets 𝐷 , 𝐷 , . . . , 𝐷 𝐿 . Our strategy for partitioning 𝐷 is to use categoricalattributes with a small domain in 𝐷 ; in our implementation, weuse those attributes 𝐴 𝑗 for which |{ 𝑡.𝐴 𝑗 | 𝑡 ∈ 𝐷 }| ≤
50. If 𝐴 𝑗 issuch an attribute with values 𝑣 , 𝑣 , . . . , 𝑣 𝐿 , we partition 𝐷 into 𝐿 disjoint datasets 𝐷 , 𝐷 , . . . , 𝐷 𝐿 , where 𝐷 𝑙 = { 𝑡 ∈ 𝐷 | 𝑡.𝐴 𝑗 = 𝑣 𝑙 } . Let 𝜙 , 𝜙 , . . . , 𝜙 𝐿 be the 𝐿 simple conformance constraints we learnfor 𝐷 , 𝐷 , . . . , 𝐷 𝐿 using Algorithm 1, respectively. We compute thefollowing disjunctive conformance constraint for 𝐷 : (( 𝐴 𝑗 = 𝑣 ) ▷ 𝜙 ) ∨ (( 𝐴 𝑗 = 𝑣 ) ▷ 𝜙 ) ∨ · · · ∨ (( 𝐴 𝑗 = 𝑣 𝐿 ) ▷ 𝜙 𝐿 ) We repeat this process and partition 𝐷 across multiple attributesand generate a compound disjunctive constraint for each attribute.Then we generate the final compound conjunctive conformance For ease of exposition, we use 𝐹 ( (cid:174) 𝐴 ) ≈ to denote 𝜖 ≤ 𝐹 ( (cid:174) 𝐴 ) ≤ 𝜖 , where 𝜖 𝑖 ≈ . Interpretability is not our explicit goal, but we developed a tool [24] to explain causesof non-conformance. More discussion and case studies are in the Appendix. constraint ( Ψ ) for 𝐷 , which is the conjunction of all these disjunc-tive constraints. Intuitively, this final conformance constraint formsa set of overlapping hyper-boxes around the data tuples. Computing simple constraints involvestwo computational steps: (1) computing 𝑋 𝑇 𝑋 , where 𝑋 is an 𝑛 × 𝑚 matrix with 𝑛 tuples (rows) and 𝑚 attributes (columns), whichtakes O( 𝑛𝑚 ) time, and (2) computing the eigenvalues and eigen-vectors of an 𝑚 × 𝑚 positive definite matrix, which has complexity O( 𝑚 ) [58]. Once we obtain the linear projections using the abovetwo steps, we need to compute the mean and variance of theseprojections on the original dataset, which takes O( 𝑛𝑚 ) time. Insummary, the overall procedure is cubic in the number of attributesand linear in the number of tuples. For computing disjunctive con-straints, we greedily pick attributes that take at most 𝐿 (typicallysmall) distinct values, and then run the above procedure for sim-ple constraints at most 𝐿 times. This adds just a constant factoroverhead per attribute. The procedure can be implemented in O( 𝑚 ) space. The key observation is that 𝑋 𝑇 𝑋 can be computed as (cid:205) 𝑛𝑖 = 𝑡 𝑖 𝑡 𝑇𝑖 , where 𝑡 𝑖 is the 𝑖 𝑡ℎ tuple in the dataset. Thus, 𝑋 𝑇 𝑋 canbe computed incrementally by loading only one tuple at a time intomemory, computing 𝑡 𝑖 𝑡 𝑇𝑖 , and then adding that to a running sum,which can be stored in O( 𝑚 ) space. Note that instead of such an in-cremental computation, this can also be done in an embarrassinglyparallel way where we horizontally partition the data (row-wise)and each partition is computed in parallel. Definition 8 givesus the notion of implication on conformance constraints: for adataset 𝐷 , satisfying 𝜙 that is stronger than 𝜙 implies that 𝐷 would satisfy 𝜙 as well. Lemma 11 and Theorem 12 associate re-dundancy with correlation: correlated projections can be combinedto construct a new projection that makes the correlated projectionsredundant. Theorem 13 shows that our PCA-based procedure findsa non-redundant (orthogonal and uncorrelated) set of projections.For disjunctive constraints, it is possible to observe redundancyacross partitions. However, our quantitative semantics ensures thatredundancy does not affect the violation score. Another notionrelevant to data profiles (e.g., FDs) is minimality . In this work, wedo not focus on finding the minimal set of conformance constraints.Towards achieving minimality for conformance constraints, a fu-ture direction is to explore techniques for optimal data partitioning.However, our approach computes only 𝑚 conformance constraintsfor each partition. Further, for a single tuple, only 𝑚 𝑁 · 𝑚 𝐶 confor-mance constraints are applicable, where 𝑚 𝑁 and 𝑚 𝐶 are the numberof numerical and categorical attributes in 𝐷 (i.e., 𝑚 = 𝑚 𝑁 + 𝑚 𝐶 ).The quantity 𝑚 𝑁 · 𝑚 𝐶 is upper-bounded by 𝑚 . In this section, we provide a theoretical justification of why con-formance constraints are effective in identifying tuples for whichlearned models are likely to make incorrect predictions. To thatend, we define unsafe tuples and show that an “ideal” conformance onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 constraint provides a sound and complete mechanism to detect un-safe tuples. In Section 4, we showed that low-variance projectionsconstruct strong conformance constraints, which yield a small con-formance zone. We now make a similar argument, but in a slightlydifferent way: we show that projections with zero variance give usequality constraints that are useful for trusted machine learning.We start with an example to provide the intuition.Example 15. Consider the airlines dataset 𝐷 and assume that alltuples in 𝐷 satisfy the equality constraint 𝜙 : = 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 = (i.e., lb = ub = ). Note that for equality constraint, the correspondingprojection has zero variance—the lowest possible variance. Now, sup-pose that the task is to learn some function 𝑓 ( 𝐴𝑇, 𝐷𝑇, 𝐷𝑈 𝑅 ) . If theabove constraint holds for 𝐷 , then the ML model can instead learn thefunction 𝑔 ( 𝐴𝑇, 𝐷𝑇, 𝐷𝑈 𝑅 ) = 𝑓 ( 𝐷𝑇 + 𝐷𝑈 𝑅, 𝐷𝑇, 𝐷𝑈 𝑅 ) . 𝑔 will performjust as well as 𝑓 on 𝐷 : in fact, it will produce the same output as 𝑓 on 𝐷 . If a new serving tuple 𝑡 satisfies 𝜙 , then 𝑔 ( 𝑡 ) = 𝑓 ( 𝑡 ) , andthe prediction will be correct. However, if 𝑡 does not satisfy 𝜙 , then 𝑔 ( 𝑡 ) will likely be significantly different from 𝑓 ( 𝑡 ) . Hence, violationof the conformance constraint is a strong indicator of performancedegradation of the learned prediction model. Note that 𝑓 need not bea linear function: as long as 𝑔 is also in the class of models that thelearning procedure is searching over, the above argument holds. Based on the intuition of Example 15, we proceed to formallydefine unsafe tuples. We use [ 𝐷 ; 𝑌 ] to denote the annotated dataset obtained by appending the target attribute 𝑌 to a dataset 𝐷 , and coDom to denote 𝑌 ’s domain.Definition 16 (Unsafe tuple). Given a class C of functionswith signature Dom 𝑚 ↦→ coDom , and an annotated dataset [ 𝐷 ; 𝑌 ] ⊂( Dom 𝑚 × coDom ) , a tuple 𝑡 ∈ Dom 𝑚 is unsafe w.r.t. C and [ 𝐷 ; 𝑌 ] ,if ∃ 𝑓 , 𝑔 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) = 𝑌 but 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Intuitively, 𝑡 is unsafe if there exist two different predictor func-tions 𝑓 and 𝑔 that agree on all tuples in 𝐷 , but disagree on 𝑡 . Since,we can never be sure whether the model learned 𝑓 or 𝑔 , we shouldbe cautious about the prediction on 𝑡 . Example 15 suggests that 𝑡 can be unsafe when all tuples in 𝐷 satisfy the equality conformanceconstraint 𝑓 ( (cid:174) 𝐴 ) − 𝑔 ( (cid:174) 𝐴 ) = 𝑡 does not. Hence, we can use thefollowing approach for trusted machine learning:(1) Learn conformance constraints Φ for the dataset 𝐷 .(2) Declare 𝑡 as unsafe if 𝑡 does not satisfy Φ .The above approach is sound and complete for characterizingunsafe tuples, thanks to the following proposition.Proposition 17. There exists a conformance constraint Φ for 𝐷 s.t. the following statement is true: “ ¬ Φ ( 𝑡 ) iff 𝑡 is unsafe w.r.t. C and [ 𝐷 ; 𝑌 ] for all 𝑡 ∈ Dom 𝑚 ”. The required conformance constraint Φ is: ∀ 𝑓 , 𝑔 ∈ C : 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) = 𝑌 ⇒ 𝑓 ( (cid:174) 𝐴 ) − 𝑔 ( (cid:174) 𝐴 ) =
0. Intuitively, when all possible pairs offunctions that agree on 𝐷 also agree on 𝑡 , only then the predictionon 𝑡 can be trusted. (More discussion is in the Appendix.) Generalization to noisy setting.
While our analysis and formal-ization for using conformance constraints for TML focused on thenoise-free setting, the intuition generalizes to noisy data. Specifi-cally, suppose that 𝑓 and 𝑔 are two possible functions a model may learn over 𝐷 ; then, we expect that the difference 𝑓 − 𝑔 will havesmall variance over 𝐷 , and thus would be a good conformance con-straint. In turn, the violation of this constraint would mean that 𝑓 and 𝑔 diverge on a tuple 𝑡 (making 𝑡 unsafe); since we are obliviousof the function the model learned, prediction on 𝑡 is untrustworthy. False positives.
Conformance constraints are designed to work ina model-agnostic setting. Although this setting is of great practicalimportance, designing a perfect mechanism for quantifying trust inML model predictions, while remaining completely model-agnostic,is challenging. It raises the concern of false positives : conformanceconstraints may incorrectly flag tuples for which the model’s predic-tion is in fact correct. This may happen when the model ignores thetrend that conformance constraints learn. Since we are oblivious ofthe prediction task and the model, it is preferable that conformanceconstraints behave rather conservatively so that the users can becautious about potentially unsafe tuples. Moreover, if a model ig-nores some attributes (or their interactions) during training, it is stillnecessary to learn conformance constraints over them. Particularly,in case of concept drift [81], the ground truth may start dependingon those attributes, and by learning conformance constraints overall attributes, we can better detect potential model failures.
False negatives.
Another concern involving conformance con-straints is of false negatives : linear conformance constraints maymiss nonlinear constraints, and thus fail to identify some unsafetuples. However, the linear dependencies modeled in conformanceconstraints persist even after sophisticated (nonlinear) attributetransformations. Therefore, violation of conformance constraints isa strong indicator of potential failure of a possibly nonlinear model.
Modeling nonlinear constraints.
While linear conformance con-straints are the most common ones, we note that our framework canbe easily extended to support nonlinear conformance constraintsusing kernel functions [69]—which offer an efficient, scalable, andpowerful mechanism to learn nonlinear decision boundaries for sup-port vector machines (also known as “kernel trick”). Briefly, insteadof explicitly augmenting the dataset with transformed nonlinearattributes—which grows exponentially with the desired degree ofpolynomials—kernel functions enable implicit search for nonlinearmodels. The same idea also applies for PCA called kernel-PCA [9,41]. While we limit our evaluation to only linear kernel, polynomialkernels—e.g., radial basis function (RBF) [45]—can be plugged intoour framework to model nonlinear conformance constraints.In general, our conformance language is not guaranteed to modelall possible functions that an ML model can potentially learn, andthus is not guaranteed to find the best conformance constraint.However, our empirical evaluation on real-world datasets showsthat our language models conformance constraints effectively.
We now present experimental evaluation to demonstrate the ef-fectiveness of conformance constraints over our two case-studyapplications (Section 2): trusted machine learning and data drift.Our experiments target the following research questions: • How effective are conformance constraints for trusted machinelearning? Is there a relationship between constraint violationscore and the ML model’s prediction accuracy? (Section 6.1) echnical Report, January, 2021 Fariha and Tiwari, et al. • Can conformance constraints be used to quantify data drift?How do they compare to other state-of-the-art drift-detectiontechniques? (Section 6.2)
Efficiency.
In all our experiments, our algorithms for deriving con-formance constraints were extremely fast, and took only a fewseconds even for datasets with 6 million rows. The number ofattributes were reasonably small ( ∼ Implementation: CCSynth.
We created an open-source imple-mentation of conformance constraints and our method for synthe-sizing them, CCSynth, in Python 3. Experiments were run on aWindows 10 machine (3.60 GHz processor and 16GB RAM).
Datasets
Airlines [8] contains data about flights and has 14 attributes —year,month, day, day of week, departure time, arrival time, carrier, flightnumber, elapsed time, origin, destination, distance, diverted, andarrival delay. We used a subset of the data containing all flightinformation for year 2008. In this dataset, most of the attributesfollow uniform distribution (e.g., month, day, arrival and departuretime, etc.); elapsed time and distance follow skewed distributionwith higher concentration towards small values (implying thatshorter flights are more common); arrival delay follows a slightlyskewed gaussian distribution implying most flights are on-time, fewarrive late and even fewer arrive early. The training and servingdatasets contain 5.4M and 0.4M rows, respectively.
Human Activity Recognition (HAR) [78] is a real-world datasetabout activities for 15 individuals, 8 males and 7 females, withvarying fitness levels and BMIs. We use data from two sensors—accelerometer and gyroscope—attached to 6 body locations—head,shin, thigh, upper arm, waist, and chest. We consider 5 activities—lying, running, sitting, standing, and walking. The dataset con-tains 36 numerical attributes (2 sensors × × Extreme Verification Latency (EVL) [74] is a widely used bench-mark to evaluate drift-detection algorithms in non-stationary en-vironments under extreme verification latency. It contains 16 syn-thetic datasets with incremental and gradual concept drifts overtime. The number of attributes of these datasets vary from 2 to 6and each of them has one categorical attribute.
We now demonstrate the applicability of conformance constraintsin the TML problem. We show that, serving tuples that violate thetraining data’s conformance constraints are unsafe, and therefore,an ML model is more likely to perform poorly on those tuples.
Airlines.
We design a regression task of predicting the arrivaldelay and train a linear regression model for the task. Our goal
Train ServingDaytime Overnight Mixed
Average violation
MAE
Figure 4: Average constraint violation (in percentage) and MAE (forlinear regression) of four data splits on the airlines dataset. The con-straints were learned on
Train , excluding the target attribute, delay . A b s o l u t e e rr o r V i o l a t i o n Violation
Figure 5: Constraint violation strongly correlates with the absoluteerror of delay prediction of a linear regression model. is to observe whether the mean absolute error of the predictions(positively) correlates to the constraint violation for the servingtuples. In a process analogous to the one described in Example 1, ourtraining dataset (
Train ) comprises of a subset of daytime flights—flights that have arrival time later than the departure time (in 24hour format). We design three serving sets: (1)
Daytime : similar to
Train , but another subset, (2)
Overnight : flights that have arrivaltime earlier than the departure time (the dataset does not explicitlyreport the date of arrival), and (3)
Mixed : a mixture of
Daytime and
Overnight . A few sample tuples of this dataset are in Fig. 1.Our experiment involves the following steps: (1) CCSynth com-putes conformance constraints Φ over Train , while ignoring the tar-get attribute delay . (2) We compute average constraint violation forall four datasets—
Train , Daytime , Overnight , and
Mixed —against Φ (first row of Fig. 4). (3) We train a linear regression model over Train —including delay —that learns to predict arrival delay. (4) Wecompute mean absolute error (MAE) of the prediction accuracyof the regressor over the four datasets (second row of Fig. 4). Wefind that constraint violation is a very good proxy for predictionerror, as they vary in a similar manner across the four datasets. Thereason is that the model implicitly assumes that the constraints(e.g., 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≈
0) derived by CCSynth will always hold,and, thus, deteriorates when the assumption no longer holds.To observe the rate of false positives and false negatives, we inves-tigate the relationship between constraint violation and predictionerror at tuple-level granularity. We sample 1000 tuples from
Mixed and organize them by decreasing order of violations (Fig. 5). For allthe tuples (on the left) that incur high constraint violations, the re-gression model incurs high error for them as well. This implies thatCCSynth reports no false positives. There are some false negatives(right part of the graph), where violation is low, but the predictionerror is high. Nevertheless, such false negatives are very few.
HAR.
On the HAR dataset, we design a supervised classificationtask to identify persons from their activity data that contains 36numerical attributes. We construct train_x with data for seden-tary activities (lying, standing, and sitting), and train_y with thecorresponding person-IDs. We learn conformance constraints on train_x , and train a Logistic Regression (LR) classifier using the onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 annotated dataset [ train_x ; train_y ] . During serving, we mix mo-bile activities (walking and running) with held-out data for seden-tary activities and observe how the classification’s mean accuracy-drop (i.e., how much the mean prediction accuracy decreases com-pared to the mean prediction accuracy over the training data) relatesto average constraint violation. To avoid any artifact due to sam-pling bias, we repeat this experiment 10 times for different subsetsof the data by randomly sampling 5000 data points for each oftraining and serving. Fig. 6(a) depicts our findings: classificationdegradation has a clear positive correlation with violation (pcc =0.99 with p-value = 0). Noise sensitivity.
Intuitively, noise weakens conformance constraintsby increasing variance in the training data, which results in reducedviolations of the serving data. However, this is desirable: as morenoise makes machine-learned models less likely to overfit, and,thus, more robust. In our experiment for observing noise sensitivityof conformance constraints, we use only mobile activity data as theserving set and start with sedentary data as the training set. Thenwe gradually introduce noise in the training set by mixing mobileactivity data. As Fig. 6(b) shows, when more noise is added to thetraining data, conformance constraints start getting weaker; thisleads to reduction in violations. However, the classifier also becomesrobust with more noise, which is evident from gradual decrease inaccuracy-drop (i.e., increase in accuracy). Therefore, even under thepresence of noise, the positive correlation between classificationdegradation and violation persists (pcc = 0.82 with p-value = 0.002).
Key takeaway:
CCSynth derives conformance constraints whoseviolation is a strong proxy of model prediction accuracy. Theircorrelation persists even in the presence of noise.
We now present results of using conformance constraints for drift-detection; specifically, for quantifying drift in data. Given a baselinedataset 𝐷 , and a new dataset 𝐷 ′ , the drift is measured as averageviolation of tuples in 𝐷 ′ on conformance constraints learned for 𝐷 . HAR.
We perform two drift-quantification experiments on HAR:
Gradual drift.
For observing how CCSynth detects gradual drift, weintroduce drift in an organic way. The initial training dataset con-tains data of exactly one activity for each person. This is a realisticscenario as one can think of it as taking a snapshot of what a groupof people are doing during a reasonably small time window. Weintroduce gradual drift to the initial dataset by altering the activityof one person at a time. To control the amount of drift, we use aparameter 𝐾 . When 𝐾 =
1, the first person switches their activity,i.e., we replace the tuples corresponding to the first person perform-ing activity A with new tuples that correspond to the same personperforming another activity B. When 𝐾 =
2, the second personswitches their activity in a similar fashion, and so on. As we increase 𝐾 from 1 to 15, we expect a gradual increase in the drift magnitudecompared to the initial training data. When 𝐾 =
15, all persons haveswitched their activities from the initial setting, and we expect toobserve maximum drift. We repeat this experiment 10 times, and dis-play the average constraint violation in Fig. 6(c): the drift magnitude(violation) indeed increases as more people alter their activities.
10 30 50 70 90Fraction of mobile data (%)0 . . . . . . CC v i o l a t i o n /a cc - d r o p CCSynthClassifier (LR) 5 15 25 35 45 55Noise (%) during trainingCCSynthClassifier (LR) (a) (b) . . . A v e r ag e v i o l a t i o n DIW-PCA
Synth CC (c)Figure 6: (a) As a higher fraction of mobile activity data is mixedwith sedentary activity data, conformance constraints are violatedmore, and the classifier’s mean accuracy-drop increases. (b) As morenoise is added during training, conformance constraints get weaker,leading to less violation and decreased accuracy-drop. (c) CCSynthdetects the gradual local drift on the HAR dataset as more peoplestart changing their activities. In contrast, weighted-PCA (W-PCA)fails to detect drift in absence of a strong global drift.Figure 7: Inter-person constraint violation heat map. Each personhas a very low self-violation. The baseline weighted-PCA approach (W-PCA) fails to modellocal constraints (who is doing what), and learns some weakerglobal constraints indicating that “a group of people are perform-ing some activities”. Thus, it fails to detect the gradual local drift.CCSynth can detect drift when individuals switch activities, as itlearns disjunctive constraints that encode who is doing what.
Inter-person drift.
The goal of this experiment is to observe howeffectively conformance constraints can model the representationof an entity and whether such learned representations can be usedto accurately quantify drift between two entities. We use half ofeach person’s data to learn the constraints, and compute violationon the held-out data. CCSynth learns disjunctive constraints foreach person over all activities, and then we use the violation w.r.t.the learned constraints to measure how much the other personsdrift. While computing drift between two persons, we computeactivity-wise constraint violation scores and then average them out.In Fig. 7, the violation score at row p1 and column p2 denotes howmuch p2 drifts from p1 . As one would expect, we observe a very lowself-drift across the diagonal. Interestingly, our result also showsthat some people are more different from others, which appearsto have some correlation with (the hidden ground truth) fitnessand BMI values. This asserts that the constraints we learn for each echnical Report, January, 2021 Fariha and Tiwari, et al. . . . CD-MKL CD-Area PCA-SPLL (25%) DISynth . . . C h a n g e ( n o r m a li z e d ) UG-2C-2D
MG-2C-2D
FG-2C-2D
UG-2C-3D
UG-2C-5D
GEARS-2C-2D CC Figure 8: In the EVL benchmark, CCSynth quantifies drift correctly for all cases, outperforming other approaches. PCA-SPLL fails to detectdrift in a few cases by discarding all principal components; CD-MKL and CD-Area are too sensitive to small drift and detect spurious drifts. person are an accurate abstraction of that person’s activities, aspeople do not deviate too much from their usual activity patterns.
EVL.
We now compare CCSynth against other state-of-the-artdrift detection approaches on the EVL benchmark.
Baselines.
We use two drift-detection baselines as described below:(1) PCA-SPLL [51] , similar to us, also argues that principalcomponents with lower variance are more sensitive to a generaldrift, and uses those for dimensionality reduction. It then modelsmultivariate distribution over the reduced dimensions and appliessemi-parametric log-likelihood (SPLL) to detect drift between twomultivariate distributions. However, PCA-SPLL discards all high-variance principal components and does not model disjunctiveconstraints.(2) CD (Change Detection) [63] is another PCA-based approachfor drift detection in data streams. But unlike PCA-SPLL, it ignoreslow-variance principal components. CD projects the data onto top 𝑘 high-variance principal components, which results into multipleunivariate distributions. We compare against two variants of CD:CD-Area, which uses the intersection area under the curves oftwo density functions as a divergence metric, and CD-MKL, whichuses Maximum KL-divergence as a symmetric divergence metric,to compute divergence between the univariate distributions.Fig. 8 depicts how CCSynth compares against CD-MKL, CD-Area, and PCA-SPLL, on 16 datasets in the EVL benchmark. ForPCA-SPLL, we retain principal components that contribute to acumulative explained variance below 25%. Beyond drift detection,which just detects if drift is above some threshold, we focus on driftquantification. A tuple ( 𝑥, 𝑦 ) in the plots denotes that drift magni-tude for dataset at 𝑥 𝑡ℎ time window, w.r.t. the dataset at the firsttime window, is 𝑦 . Since different approaches report drift magni-tudes in different scales, we normalize the drift values within [ , ] .Additionally, since different datasets have different number of timewindows, for the ease of exposition, we normalize the time windowindices. Below we state our key findings from this experiment: CCSynth’s drift quantification matches the ground truth.
In all ofthe datasets in the EVL benchmark, CCSynth is able to correctlyquantify the drift, which matches the ground truth [2] exceptionallywell. In contrast, as CD focuses on detecting the drift point, it isill-equipped to precisely quantify the drift, which is demonstrated SPLL source code: github.com/LucyKuncheva/Change-detection CD source code: mine.kaust.edu.sa/Pages/Software.aspx in several cases (e.g., 2CHT), where CD fails to distinguish the devia-tion in drift magnitudes. In contrast, both PCA-SPLL and CCSynthcorrectly quantify the drift. Since CD only retains high-varianceprincipal components, it is more susceptible to noise and considersnoise in the dataset as significant drift, which leads to incorrectdrift quantification. In contrast, PCA-SPLL and CCSynth ignorethe noise and only capture the general notion of drift. In all of theEVL datasets, we found CD-Area to work better than CD-MKL,which also agrees with the authors’ experiments.
CCSynth models local drift.
When the dataset contains instancesfrom multiple classes, the drift may be just local, and not global (e.g.,4CR dataset as shown in the Appendix). In such cases, PCA-SPLLfails to detect drift (4CR, 4CRE-V2, and FG-2C-2D). In contrast,CCSynth learns disjunctive constraints and quantifies local driftsaccurately.
Key takeaways:
CCSynth can effectively detect data drift, bothglobal and local, is robust across drift patterns, and significantlyoutperforms the state-of-the-art methods.
There is extensive literature on data-profiling [5] primitives thatmodel relationships among data attributes, such as functional de-pendencies (FD) [59, 93] and their variants (metric [48], condi-tional [23], soft [38], approximate [36, 50], relaxed [16], pattern [62],etc.), differential dependencies [72], denial constraints [13, 17, 53,61], statistical constraints [91], etc. However, none of them focus onlearning approximate arithmetic relationships that involve multiplenumerical attributes in a noisy setting, which is the focus of ourwork. Some variants of FDs [16, 36, 38, 48, 50] consider noisy set-ting, but they require noise parameters to be explicitly specified bythe user. In contrast, we do not require any explicit noise parameter.The issue of trust, resilience, and interpretability of artificialintelligence (AI) systems has been a theme of increasing interestrecently [39, 42, 67, 86], particularly for safety-critical data-drivenAI systems [80, 87]. A standard way to decide whether to trust aclassifier or not, is to use the classifier-produced confidence score.However, as a prior work [41] argues, this is not always effec-tive since the classifier’s confidence scores are not well-calibrated.While some recent techniques [20, 31, 41, 68] aim at validatingthe inferences made by machine-learned models on unseen tuples,they usually require knowledge of the inference task, access tothe model, and/or expected cases of data shift, which we do not.Furthermore, they usually require costly hyper-parameter tuning onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 and do not generate closed-form data profiles like conformanceconstraints (Fig. 2). Prior work on data drift, change detection, andcovariate shift [7, 11, 14, 18, 19, 21, 22, 25, 26, 33, 34, 37, 44, 46, 66,70, 71, 73, 82, 88] relies on modeling data distribution. However,data distribution does not capture constraints, which is the primaryfocus of our work.Few works [20, 31, 54] use autoencoder’s [32, 65] input recon-struction error to determine if a new data point is out-of-distribution.Our approach is similar to outlier-detection [49] and one-class-classification [79]. However, conformance constraints differ fromthese approaches as they perform under the additional requirementto generalize the data in a way that is exploited by a given classof ML models. In general, there is a clear gap between represen-tation learning (that models data likelihood) [6, 32, 43, 65] andthe (constraint-oriented) data-profiling techniques to address theproblem of trusted AI and our aim is to bridge this gap. We introduced conformance constraints, and the notion of unsafetuples for trusted machine learning. We presented an efficient andhighly-scalable approach for synthesizing conformance constraints;and demonstrated their effectiveness to tag unsafe tuples and quan-tify data drift. The experiments validate our theory and our principleof using low-variance projections to generate effective conformanceconstraints. We have studied only two use-cases from a large poolof potential applications using linear conformance constraints. Infuture, we want to explore more powerful nonlinear conformanceconstraints using autoencoders. Moreover, we plan to explore ap-proaches to learn conformance constraints in a decision-tree-likestructure where categorical attributes will guide the splitting con-ditions and leaves will contain simple conformance constraints.Further, we envision a mechanism—built on top of conformanceconstraints—to explore differences between datasets.
REFERENCES
VLDBJ. , 24(4):557–581, 2015.[6] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representationsand generative models for 3d point clouds. arXiv preprint arXiv:1707.02392 , 2017.[7] C. C. Aggarwal. A framework for diagnosing changes in evolving data streams.In
SIGMOD , pages 575–586, 2003.[8] Airlines dataset., 2019. http://kt.ijs.si/elena_ikonomovska/data.html.[9] C. Alzate and J. A. Suykens. Kernel component analysis using an epsilon-insensitive robust loss function.
IEEE Transactions on Neural Networks , 19(9):1583–1598, 2008.[10] J. P. Barddal, H. M. Gomes, F. Enembreck, and B. Pfahringer. A survey on featuredrift adaptation: Definition, benchmark, challenges and future directions.
Journalof Systems and Software , 127:278–294, 2017.[11] A. Bifet and R. Gavaldà. Learning from time-changing data with adaptive win-dowing. In
SDM , pages 443–448, 2007.[12] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: massive online analysis.
J. Mach. Learn. Res. , 11:1601–1604, 2010.[13] T. Bleifuß, S. Kruse, and F. Naumann. Efficient denial constraint discovery withhydra.
PVLDB , 11(3):311–323, 2017. [14] L. Bu, C. Alippi, and D. Zhao. A pdf-free change detection test based on densitydifference estimation.
IEEE Trans. Neural Netw. Learning Syst. , 29(2):324–334,2018.[15] G. Buehrer, B. W. Weide, and P. A. G. Sivilotti. Using parse tree validation toprevent SQL injection attacks. In
International Workshop on Software Engineeringand Middleware, SEM , pages 106–113, 2005.[16] L. Caruccio, V. Deufemia, and G. Polese. On the discovery of relaxed functionaldependencies. In
Proceedings of the 20th International Database Engineering &Applications Symposium , pages 53–61, 2016.[17] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints.
PVLDB ,6(13):1498–1509, 2013.[18] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. An information-theoreticapproach to detecting changes in multi-dimensional data streams. In
Symp. onthe Interface of Statistics, Computing Science, and Applications , 2006.[19] R. F. de Mello, Y. Vaz, C. H. G. Ferreira, and A. Bifet. On learning guaranteesto unsupervised concept drift detection on data streams.
Expert Syst. Appl. ,117:90–102, 2019.[20] T. Denouden, R. Salay, K. Czarnecki, V. Abdelzad, B. Phan, and S. Vernekar.Improving reconstruction autoencoder out-of-distribution detection with maha-lanobis distance.
CoRR , abs/1812.02765, 2018.[21] D. M. dos Reis, P. A. Flach, S. Matwin, and G. E. A. P. A. Batista. Fast unsupervisedonline drift detection using incremental kolmogorov-smirnov test. In
SIGKDD ,pages 1545–1554, 2016.[22] W. J. Faithfull, J. J. R. Diez, and L. I. Kuncheva. Combining univariate approachesfor ensemble change detection in multivariate data.
Information Fusion , 45:202–214, 2019.[23] W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong. Discovering conditionalfunctional dependencies. In
ICDE , pages 1231–1234, 2009.[24] A. Fariha, A. Tiwari, A. Radhakrishna, and S. Gulwani. ExTuNe: Explaining tuplenon-conformance. In
SIGMOD , pages 2741–2744, 2020.[25] M. M. Gaber and P. S. Yu. Classification of changes in evolving data streamsusing online clustering result deviation. In
Proc. Of International Workshop onKnowledge Discovery in Data Streams , 2006.[26] J. Gama, P. Medas, G. Castillo, and P. P. Rodrigues. Learning with drift detection.In
Advances in Artificial Intelligence - SBIA , pages 286–295, 2004.[27] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey onconcept drift adaptation.
ACM Comput. Surv. , 46(4):44:1–44:37, 2014.[28] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modernneural networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1321–1330, 2017.[29] J. H. Hayes and A. J. Offutt. Increased software reliability through input vali-dation analysis and testing. In
International Symposium on Software ReliabilityEngineering, ISSRE , pages 199–209, 1999.[30] A. Heise, J. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalablediscovery of unique column combinations.
PVLDB , 7(4):301–312, 2013.[31] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In
ICLR , 2017.[32] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data withneural networks. science , 313(5786):504–507, 2006.[33] S. Ho. A martingale framework for concept change detection in time-varyingdata streams. In
ICML , pages 321–327, 2005.[34] B. Hooi and C. Faloutsos. Branch and border: Partition-based change detectionin multivariate time series. In
SDM , pages 504–512, 2019.[35] R. A. Horn and C. R. Johnson.
Matrix Analysis . New York, NY, USA, 2nd edition,2012.[36] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algo-rithm for discovering functional and approximate dependencies.
The computerjournal , 42(2):100–111, 1999.[37] D. Ienco, A. Bifet, B. Pfahringer, and P. Poncelet. Change detection in categoricalevolving data streams. In
SAC , pages 792–797, 2014.[38] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: automaticdiscovery of correlations and soft functional dependencies. In
SIGMOD , pages647–658, 2004.[39] S. Jha. Trust, resilience and interpretability of AI models. In
Numerical SoftwareVerification - 12th International Workshop, NSV@CAV , pages 3–25, 2019.[40] H. Jiang, S. G. Elbaum, and C. Detweiler. Inferring and monitoring invariants inrobotic systems.
Auton. Robots , 41(4):1027–1046, 2017.[41] H. Jiang, B. Kim, M. Y. Guan, and M. R. Gupta. To trust or not to trust A classifier.In
NeurIPS , pages 5546–5557, 2018.[42] D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. Model assertions for monitoringand improving ML models. In
MLSys , 2020.[43] T. Karaletsos, S. Belongie, and G. Rätsch. Bayesian representation learning withoracle constraints. arXiv preprint arXiv:1506.05011 , 2015.[44] Y. Kawahara and M. Sugiyama. Change-point detection in time-series data bydirect density-ratio estimation. In
SDM , pages 389–400, 2009.[45] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machineswith gaussian kernel.
Neural computation , 15(7):1667–1689, 2003.13 echnical Report, January, 2021 Fariha and Tiwari, et al. [46] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In
PVLDB , pages 180–191, 2004.[47] R. Koch. Sql database performance tuning for developers. https://moa.cms.waikato.ac.nz/datasets/, Sep 2013.[48] N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functionaldependencies. In
ICDE , pages 1275–1278, 2009.[49] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in arbitrarilyoriented subspaces. In ,pages 379–388, 2012.[50] S. Kruse and F. Naumann. Efficient discovery of approximate dependencies.
PVLDB , 11(7):759–772, 2018.[51] L. I. Kuncheva and W. J. Faithfull. PCA feature extraction for change detectionin multidimensional unlabeled data.
IEEE Trans. Neural Netw. Learning Syst. ,25(1):69–80, 2014.[52] P. Langer and F. Naumann. Efficient order dependency detection.
VLDB J. ,25(2):223–241, 2016.[53] E. Livshits, A. Heidari, I. F. Ilyas, and B. Kimelfeld. Approximate denial constraints.
CoRR , abs/2005.08540, 2020.[54] H. Lu, H. Xu, N. Liu, Y. Zhou, and X. Wang. Data sanity check for deep learningsystems via learnt assertions.
CoRR , abs/1909.03835, 2019.[55] F. D. Marchi, S. Lopes, and J. Petit. Unary and n-ary inclusion dependencydiscovery in relational databases.
J. Intell. Inf. Syst. , 32(1):53–73, 2009.[56] T. Martin and I. K. Glad. Online detection of sparse changes in high-dimensionaldata streams using tailored projections. arXiv preprint arXiv:1908.02029 , 2019.[57] Z. Ouyang, Y. Gao, Z. Zhao, and T. Wang. Study on the classification of datastreams with concept drift. In
International Conference on Fuzzy Systems andKnowledge Discovery (FSKD) , volume 3, pages 1673–1677, 2011.[58] V. Y. Pan and Z. Q. Chen. The complexity of the matrix eigenproblem. In
ACMSymposium on Theory of Computing , pages 507–516, 1999.[59] T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg,J. Zwiener, and F. Naumann. Functional dependency discovery: An experimentalevaluation of seven algorithms.
PVLDB , 8(10):1082–1093, 2015.[60] T. Papenbrock, S. Kruse, J.-A. Quiané-Ruiz, and F. Naumann. Divide & conquer-based inclusion dependency discovery.
PVLDB , 8(7):774–785, 2015.[61] E. H. Pena, E. C. de Almeida, and F. Naumann. Discovery of approximate (andexact) denial constraints.
PVLDB , 13(3):266–278, 2019.[62] A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. Pattern functionaldependencies for data cleaning.
PVLDB , 13(5):684–697, 2020.[63] A. A. Qahtan, B. Alharbi, S. Wang, and X. Zhang. A pca-based change detectionframework for multidimensional data streams: Change detection in multidimen-sional data streams. In
SIGKDD , pages 935–944, 2015.[64] M. T. Ribeiro, S. Singh, and C. Guestrin. “why should I trust you?”: Explainingthe predictions of any classifier. In
SIGKDD , pages 1135–1144, 2016.[65] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representa-tions by error propagation. Technical report, California Univ San Diego La JollaInst for Cognitive Science, 1985.[66] L. Rutkowski, M. Jaworski, L. Pietruczuk, and P. Duda. A new method for datastream mining based on the misclassification error.
IEEE Trans. Neural Netw.Learning Syst. , 26(5):1048–1059, 2015.[67] S. Saria and A. Subbaswamy. Tutorial: Safe and reliable machine learning.
CoRR ,abs/1904.07204, 2019.[68] S. Schelter, T. Rukat, and F. Bießmann. Learning to validate the predictions ofblack box classifiers on unseen data. In
SIGMOD , pages 1289–1299, 2020.[69] B. Schölkopf, A. J. Smola, F. Bach, et al.
Learning with kernels: support vectormachines, regularization, optimization, and beyond . 2002.[70] T. S. Sethi and M. M. Kantardzic. On the reliable detection of concept drift fromstreaming unlabeled data.
Expert Syst. Appl. , 82:77–99, 2017.[71] T. S. Sethi, M. M. Kantardzic, and E. Arabmakki. Monitoring classificationblindspots to detect drifts from unlabeled data. In
IEEE International Conferenceon Information Reuse and Integration, IRI , pages 142–151, 2016.[72] S. Song and L. Chen. Differential dependencies: Reasoning and discovery.
ACMTransactions on Database Systems (TODS) , 36(3):1–41, 2011.[73] X. Song, M. Wu, C. M. Jermaine, and S. Ranka. Statistical change detection formulti-dimensional data. In
SIGKDD , pages 667–676, 2007.[74] V. M. A. Souza, D. F. Silva, J. Gama, and G. E. A. P. A. Batista. Data streamclassification guided by clustering on nonstationary environments and extremeverification latency. In
SDM , pages 873–881, 2015.[75] A. Subbaswamy, P. Schulam, and S. Saria. Preventing failures due to datasetshift: Learning predictive models that transport. In
International Conference onArtificial Intelligence and Statistics, AISTATS , pages 3118–3127, 2019.[76] C. A. Sutton, T. Hobson, J. Geddes, and R. Caruana. Data diff: Interpretable,executable summaries of changes in distributions for data wrangling. In
SIGKDD ,pages 2279–2288, 2018.[77] J. Szlichta, P. Godfrey, L. Golab, M. Kargar, and D. Srivastava. Effective andcomplete discovery of order dependencies via set-based axiomatization.
PVLDB ,10(7):721–732, 2017.[78] T. Sztyler and H. Stuckenschmidt. On-body localization of wearable devices: Aninvestigation of position-aware activity recognition. In
PerCom , pages 1–9, 2016. [79] D. M. J. Tax and K. Müller. Feature extraction for one-class classification. In
ICANN/ICONIP , pages 342–349, 2003.[80] A. Tiwari, B. Dutertre, D. Jovanovic, T. de Candia, P. Lincoln, J. M. Rushby,D. Sadigh, and S. A. Seshia. Safety envelope for security. In
HiCoNS , pages 85–94,2014.[81] A. Tsymbal. The problem of concept drift: definitions and related work.
ComputerScience Department, Trinity College Dublin , 106(2):58, 2004.[82] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Handling local con-cept drift with dynamic integration of classifiers: Domain of antibiotic resistancein nosocomial infections. In
IEEE International Symposium on Computer-BasedMedical Systems (CBMS) , pages 679–684, 2006.[83] F. Turchini, L. Seidenari, and A. Del Bimbo. Convex polytope ensembles for spatio-temporal anomaly detection. In
International Conference on Image Analysis andProcessing , pages 174–184, 2017.[84] M. Tveten. Which principal components are most sensitive to distributionalchanges? arXiv preprint arXiv:1905.06318 , 2019.[85] V. Vapnik, S. E. Golowich, and A. J. Smola. Support vector method for functionapproximation, regression estimation and signal processing. In
NeurIPS , pages281–287, 1997.[86] K. R. Varshney. Trustworthy machine learning and artificial intelligence.
ACMCrossroads , 25(3):26–29, 2019.[87] K. R. Varshney and H. Alemzadeh. On the safety of machine learning: Cyber-physical systems, decision sciences, and data products.
Big Data , 5(3):246–255,2017.[88] H. Wang and Z. Abraham. Concept drift detection for imbalanced stream data.
CoRR , abs/1504.01044, 2015.[89] G. Wassermann and Z. Su. Sound and precise analysis of web applications forinjection vulnerabilities. In
PLDI , pages 32–41, 2007.[90] L. White and E. I. Cohen. A domain strategy for computer program testing.
IEEETrans. Software Engineering , SE-6(3):247–257, 1980.[91] J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. SCODED: statisticalconstraint oriented data error detection. In
SIGMOD , pages 845–860, 2020.[92] S. Yu, Z. Abraham, H. Wang, M. Shah, Y. Wei, and J. C. Príncipe. Concept driftdetection and adaptation with hierarchical hypothesis testing.
J. Franklin Institute ,356(5):3187–3215, 2019.[93] Y. Zhang, Z. Guo, and T. Rekatsinas. A statistical perspective on discoveringfunctional dependencies in noisy data. In
SIGMOD , pages 861–876, 2020.
A SYSTEM PARAMETERS
Our technique for deriving (unnormalized) importance factor 𝛾 𝑘 , forbounded-projection constraint on projection 𝐹 𝑘 , uses the mapping ( + 𝜎 ( 𝐹 𝑘 ( 𝐷 ))) . This mapping correctly translates our principlesfor quantifying violation by putting high weight on conformanceconstraints constructed from low-variance projections, and lowweight on conformance constraints constructed from high-varianceprojections. While this mapping works extremely well across a largeset of applications (including the ones shown in our experimentalresults), our quantitative semantics are not limited to any specificmapping. In fact, the function to compute importance factors forbounded-projections can be user-defined (but we do not require itfrom the user). Specifically, a user can plug in any custom functionto derive the (unnormalized) importance factors. Furthermore, ourtechnique to compute the bounds lb and ub can also be customized(but we do not require it from the user either). Depending on theapplication requirements, one can apply techniques used in ma-chine learning literature (e.g., cross-validation) to tighten or loosenthe conformance constraints by tuning these parameters. However,we found our technique—even without any cross-validation—forderiving these parameters to be very effective in most practicalapplications. B PROOF OF LEMMA 11
Proof. Pick 𝛽 , 𝛽 s.t. 𝛽 + 𝛽 = sign ( 𝜌 𝐹 ,𝐹 ) 𝛽 𝜎 ( 𝐹 ( 𝐷 )) + 𝛽 𝜎 ( 𝐹 ( 𝐷 )) = onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 Let 𝑡 be any tuple that is incongruous w.r.t. ⟨ 𝐹 , 𝐹 ⟩ . Now, we com-pute how far 𝑡 is from the mean of the projection 𝐹 on 𝐷 : | Δ 𝐹 ( 𝑡 )| = | 𝐹 ( 𝑡 ) − 𝜇 ( 𝐹 ( 𝐷 ))| = | 𝛽 𝐹 ( 𝑡 ) + 𝛽 𝐹 ( 𝑡 ) − 𝜇 ( 𝛽 𝐹 ( 𝐷 ) + 𝛽 𝐹 ( 𝐷 ))| = | 𝛽 Δ 𝐹 ( 𝑡 ) + 𝛽 Δ 𝐹 ( 𝑡 )| = | 𝛽 Δ 𝐹 ( 𝑡 )| + | 𝛽 Δ 𝐹 ( 𝑡 )| The last step is correct only when 𝛽 Δ 𝐹 ( 𝑡 ) and 𝛽 Δ 𝐹 ( 𝑡 ) are ofsame sign. We prove this by cases:(Case 1). 𝜌 𝐹 ,𝐹 ≥ . In this case, 𝛽 and 𝛽 are of different signs dueto Equation 4. Moreover, since 𝑡 is incongruous w.r.t. ⟨ 𝐹 , 𝐹 ⟩ , Δ 𝐹 ( 𝑡 ) and Δ 𝐹 ( 𝑡 ) are of different signs. Hence, 𝛽 Δ 𝐹 ( 𝑡 ) and 𝛽 Δ 𝐹 ( 𝑡 ) are of same sign.(Case 2). 𝜌 𝐹 ,𝐹 ≤ − . In this case, 𝛽 and 𝛽 have the same signdue to Equation 4. Moreover, since 𝑡 is incongruous w.r.t. ⟨ 𝐹 , 𝐹 ⟩ , Δ 𝐹 ( 𝑡 ) and Δ 𝐹 ( 𝑡 ) are of same sign. Hence, 𝛽 Δ 𝐹 ( 𝑡 ) and 𝛽 Δ 𝐹 ( 𝑡 ) are of same sign.Next, we compute the variance of 𝐹 on 𝐷 : 𝜎 ( 𝐹 ( 𝐷 )) = | 𝐷 | ∑︁ 𝑡 ∈ 𝐷 ( 𝛽 Δ 𝐹 ( 𝑡 )+ 𝛽 Δ 𝐹 ( 𝑡 )) = 𝛽 𝜎 ( 𝐹 ( 𝐷 )) + 𝛽 𝜎 ( 𝐹 ( 𝐷 )) + 𝛽 𝛽 𝜌 𝐹 ,𝐹 𝜎 ( 𝐹 ( 𝐷 )) 𝜎 ( 𝐹 ( 𝐷 )) = 𝛽 𝜎 ( 𝐹 ( 𝐷 )) + 𝛽 𝜎 ( 𝐹 ( 𝐷 )) − 𝛽 | 𝜌 𝐹 ,𝐹 | 𝜎 ( 𝐹 ( 𝐷 )) = 𝛽 𝜎 ( 𝐹 ( 𝐷 )) ( − | 𝜌 𝐹 ,𝐹 |) Hence, 𝜎 ( 𝐹 ( 𝐷 )) = √︁ ( − | 𝜌 𝐹 ,𝐹 |)| 𝛽 | 𝜎 ( 𝐹 ( 𝐷 )) , which is alsoequal to √︁ ( − | 𝜌 𝐹 ,𝐹 |)| 𝛽 | 𝜎 ( 𝐹 ( 𝐷 )) . Since √︁ ( − | 𝜌 𝐹 ,𝐹 |)| ≤ | 𝛽 𝑘 | <
1, we conclude that 𝜎 ( 𝐹 ( 𝐷 )) < 𝜎 ( 𝐹 𝑘 ( 𝐷 )) .Now, we compute | Δ 𝐹 ( 𝑡 ) | 𝜎 ( 𝐹 ( 𝐷 )) next using the above derived factsabout | Δ 𝐹 ( 𝑡 )| and 𝜎 ( 𝐹 ( 𝐷 )) . | Δ 𝐹 ( 𝑡 )| 𝜎 ( 𝐹 ( 𝐷 )) > | 𝛽 Δ 𝐹 ( 𝑡 )| √︁ ( − | 𝜌 𝐹 ,𝐹 |)| 𝛽 | 𝜎 ( 𝐹 ( 𝐷 )) = | Δ 𝐹 ( 𝑡 )| √︁ ( − | 𝜌 𝐹 ,𝐹 |) 𝜎 ( 𝐹 ( 𝐷 )) ≥ | Δ 𝐹 ( 𝑡 )| 𝜎 ( 𝐹 ( 𝐷 )) The last step uses the fact that | 𝜌 𝐹 ,𝐹 | ≥ . Similarly, we also get | Δ 𝐹 ( 𝑡 ) | 𝜎 ( 𝐹 ( 𝐷 )) > | Δ 𝐹 ( 𝑡 ) | 𝜎 ( 𝐹 ( 𝐷 )) . Hence, 𝜙 𝐹 is stronger than both 𝜙 𝐹 and 𝜙 𝐹 on 𝑑 , using Lemma 5. This completes the proof. □ C PROOF OF THEOREM 12
Proof. First, we use Lemma 11 on 𝐹 𝑖 , 𝐹 𝑗 to construct 𝐹 . Weinitialize 𝐼 : = { 𝑖, 𝑗 } . Next, we repeatedly do the following: Weiterate over all 𝐹 𝑘 , where 𝑘 ∉ 𝐼 , and check if | 𝜌 𝐹,𝐹 𝑘 | ≥ for some 𝑘 . If yes, we use Lemma 11 (on 𝐹 and 𝐹 𝑘 ) to update 𝐹 to be the newprojection returned by the lemma. We update 𝐼 : = 𝐼 ∪ { 𝑘 } , andcontinue the iterations. If | 𝜌 𝐹,𝐹 𝑘 | < for all 𝑘 ∉ 𝐼 , then we stop.The final 𝐹 and index set 𝐼 can easily be seen to satisfy the claimsof the theorem. □ D PROOF OF THEOREM 13
We first provide some additional details regarding the statementof the theorem. Since standard deviation is not scale invariant, ifthere is no constraint on the norm of the linear projections, thenit is possible to scale down the linear projections to make theirstandard deviations arbitrarily small. Therefore, claim (1) can notbe proved for any linear projection, but only linear projectionswhose 2-norm is not too “small”. Hence, we restate the theoremwith some additional technical conditions.Given a numerical dataset 𝐷 , let F = { 𝐹 , 𝐹 , . . . , 𝐹 𝐾 } bethe set of linear projections returned by Algorithm 1. Let 𝜎 ∗ = min 𝐾𝑘 𝜎 ( 𝐹 𝑘 ( 𝐷 )) . WLOG, assume 𝜎 ∗ = 𝜎 ( 𝐹 ( 𝐷 )) where 𝐹 = (cid:174) 𝐴 𝑇 (cid:174) 𝑤 ∗ .Assume that the attribute mean is zero for all attributes in 𝐷 (callthis Condition 1). Then,(1) 𝜎 ∗ ≤ 𝜎 ( 𝐹 ( 𝐷 )) for every possible linear projection 𝐹 whose 2-norm is sufficiently large, i.e., we require || (cid:174) 𝑤 || ≥ 𝐹 = (cid:174) 𝐴 𝑇 (cid:174) 𝑤 .If we do not assume Condition 1, then the requirement changesto || (cid:174) 𝑤 || ≥ || (cid:174) 𝑤 ∗ 𝑒 ||− 𝜇 ( 𝐷 𝑇 (cid:174) 𝑤 )|| . Here (cid:174) 𝑤 ∗ 𝑒 is the vector constructedby augmenting a dimension to (cid:174) 𝑤 ∗ to turn it to an eigenvectorof 𝐷 𝑒𝑇 𝐷 𝑒 where 𝐷 𝑒 = [(cid:174) 𝐷 ] .(2) ∀ 𝐹 𝑗 , 𝐹 𝑘 ∈ F s.t. 𝐹 𝑗 ≠ 𝐹 𝑘 , 𝜌 𝐹 𝑗 ,𝐹 𝑘 =
0. If we do not assumeCondition 1, then the correlation coefficient is close to 0 forthose 𝐹 𝑗 , 𝐹 𝑘 whose corresponding eigenvalues are much smallerthan | 𝐷 | .Proof. The proof uses the following facts: (Fact 1) If we add a constant 𝑐 to each element of a set 𝑆 of realvalues to get a new set 𝑆 ′ , then 𝜎 ( 𝑆 ) = 𝜎 ( 𝑆 ′ ) . (Fact 2) The Courant-Fischer min-max theorem [35] states that thevector (cid:174) 𝑤 that minimizes || 𝑀 (cid:174) 𝑤 ||/|| (cid:174) 𝑤 || is the eigenvector of 𝑀 𝑇 𝑀 corresponding to the lowest eigenvalue (for any matrix 𝑀 ). (Fact 3) Since 𝐷 ′ 𝑁 : = [(cid:174) 𝐷 𝑁 ] , by definition: 𝜎 ( 𝐷 𝑁 (cid:174) 𝑤 ) = | | 𝐷 ′ 𝑁 (cid:174) 𝑤 ′ | | √ | 𝐷 | ,where (cid:174) 𝑤 ′ = (cid:20) − 𝜇 ( 𝐷 𝑁 (cid:174) 𝑤 )(cid:174) 𝑤 (cid:21) (Fact 4) By the definition of variance, 𝜎 ( 𝑆 ) = || 𝑆 || − 𝜇 ( 𝑆 ) .Let 𝐹 = (cid:174) 𝐴 𝑇 (cid:174) 𝑤 be an arbitrary linear projection. Since 𝐷 is nu-merical, 𝐷 = 𝐷 𝑁 . Let 𝐷 𝑒 denote 𝐷 ′ 𝑁 . (We use the superscript 𝑒 todenote the augmented vector/matrix). 𝜎 ( 𝐷 𝑇 (cid:174) 𝑤 ) = 𝜎 ( 𝐷 𝑇 (cid:174) 𝑤 − (cid:174) 𝜇 ) (Fact 1), 𝜇 = 𝜇 ( 𝐷 𝑇 (cid:174) 𝑤 ) = 𝜎 ( 𝐷 𝑒𝑇 (cid:174) 𝑤 𝑒 ) where (cid:174) 𝑤 𝑒 = (cid:34) − 𝜇 (cid:174) 𝑤 (cid:35) = | | 𝐷 𝑒𝑇 (cid:174) 𝑤 𝑒 | | | 𝐷 | (Fact 3) ≥ | | 𝐷 𝑒𝑇 (cid:174) 𝑤 ∗ 𝑒 | | ·| | (cid:174) 𝑤 𝑒 | | | 𝐷 |·| | (cid:174) 𝑤 ∗ 𝑒 | | (Fact 2) = ( 𝜎 ( 𝐷 𝑒𝑇 (cid:174) 𝑤 ∗ 𝑒 ) + 𝑏 ) · | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | (Fact 4), 𝑏 = 𝜇 ( 𝐷 𝑒𝑇 (cid:174) 𝑤 ∗ 𝑒 ) = ( 𝜎 ( 𝐷 𝑇 (cid:174) 𝑤 ∗ + 𝑐 ) + 𝑏 ) · | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | Expand 𝐷 𝑒𝑇 (cid:174) 𝑤 ∗ 𝑒 = ( 𝜎 ( 𝐷 𝑇 (cid:174) 𝑤 ∗ ) + 𝑏 ) · | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | (Fact 1) = ( 𝜎 ∗ + 𝑏 ) · | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | definition of 𝜎 ∗ ≥ 𝜎 ∗ by assumption | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | ≥ echnical Report, January, 2021 Fariha and Tiwari, et al. For the last step, we use the technical condition that the norm ofthe extension of (cid:174) 𝑤 (extended by augmenting the mean over 𝐷 (cid:174) 𝑤 ) isat least as large as the norm of extension of (cid:174) 𝑤 ∗ (extended to makeit an eigenvector of 𝐷 𝑒𝑇 𝐷 𝑒 ). When Condition 1 holds, || (cid:174) 𝑤 𝑒 || = || (cid:174) 𝑤 || (because 𝜇 ( 𝐹 ( 𝐷 )) will be 0 and therefore, (cid:174) 𝑤 𝑒 = (cid:20) (cid:174) 𝑤 (cid:21) ), and || (cid:174) 𝑤 ∗ 𝑒 || = | | (cid:174) 𝑤 𝑒 | | | | (cid:174) 𝑤 ∗ 𝑒 | | ≥ 𝐹 𝑖 = (cid:174) 𝐴 𝑇 (cid:174) 𝑤 𝑖 for all 𝑖 , where (cid:174) 𝑤 𝑖 are thecoefficients of the linear projection 𝐹 𝑖 . Let 𝑐 𝑖 = 𝜇 ( 𝐹 𝑖 ( 𝐷 )) . (Fact 5) If Condition 1 holds, ∀ 𝑖 𝑐 𝑖 = 𝐹 𝑖 ’s, we know that 𝑤 𝑖 can be extended to bean eigenvector (cid:20) 𝑑 𝑖 (cid:174) 𝑤 𝑖 (cid:21) of 𝐷 𝑒𝑇 𝐷 𝑒 (with corresponding eigenvalue 𝜆 𝑖 ). In general, (Fact 6) It is easy to work out that 𝑑 𝑖 = − 𝑐 𝑖 − 𝜆𝑖 | 𝐷 | .Thus, we have: 𝜌 𝐹 𝑗 ,𝐹 𝑘 = (cid:205) 𝑡 ∈ 𝐷 Δ 𝐹 𝑗 ( 𝑡 ) Δ 𝐹 𝑘 ( 𝑡 )| 𝐷 | 𝜎 ( 𝐹 𝑗 ( 𝐷 )) 𝜎 ( 𝐹 𝑘 ( 𝐷 )) (definition of 𝜌 ) = ( 𝐷 (cid:174) 𝑤 𝑗 − 𝑐 𝑗 (cid:174) ) 𝑇 ( 𝐷 (cid:174) 𝑤 𝑘 − 𝑐 𝑘 (cid:174) )| 𝐷 | 𝜎 ( 𝐹 𝑗 ( 𝐷 )) 𝜎 ( 𝐹 𝑘 ( 𝐷 )) = ( 𝐷 𝑒 (cid:174) 𝑤 𝑒𝑗 ) 𝑇 𝐷 𝑒 (cid:174) 𝑤 𝑒𝑘 | 𝐷 | 𝜎 ( 𝐹 𝑗 ( 𝐷 )) 𝜎 ( 𝐹 𝑘 ( 𝐷 )) 𝑤 𝑒𝑖 = (cid:34) − 𝑐 𝑖 (cid:174) 𝑤 𝑖 (cid:35) = (cid:174) 𝑤 𝑒𝑇𝑗 𝐷 𝑒𝑇 𝐷 𝑒 (cid:174) 𝑤 𝑒𝑘 | 𝐷 | 𝜎 ( 𝐹 𝑗 ( 𝐷 )) 𝜎 ( 𝐹 𝑘 ( 𝐷 )) = (cid:174) 𝑤 𝑒𝑇𝑗 𝜆 𝑘 (cid:174) 𝑤 𝑒𝑘 | 𝐷 | 𝜎 ( 𝐹 𝑗 ( 𝐷 )) 𝜎 ( 𝐹 𝑘 ( 𝐷 )) (Fact 5,6), 𝐷 𝑒𝑇 𝐷 𝑒 (cid:174) 𝑤 𝑒𝑘 = 𝜆 𝑘 (cid:174) 𝑤 𝑒𝑘 = | 𝜆 𝑖 | 𝐷 | | is close to 0,then 𝑑 𝑖 will be close to − 𝑐 𝑖 , and 𝜌 𝐹 𝑗 ,𝐹 𝑘 will be close to 0. □ E PROOF OF PROPOSITION 17
Proof. We show that the conformance constraint Φ : = ∀ 𝑓 , 𝑔 ∈ C : 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) ⇒ 𝑓 ( 𝑡 ) − 𝑔 ( 𝑡 ) =
0, is the requiredconformance constraint over 𝐷 to detect tuples that are unsafe withrespect to C and [ 𝐷 ; 𝑌 ] .First, we claim that Φ is a conformance constraint for 𝐷 . Forthis, we need to prove that every tuple in 𝐷 satisfies Φ . Considerany 𝑡 ′ ∈ 𝐷 . We need to prove that 𝑓 ( 𝑡 ′ ) = 𝑔 ( 𝑡 ′ ) for all 𝑓 , 𝑔 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) = 𝑌 . Since 𝑡 ′ ∈ 𝐷 , and since 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) , it followsthat 𝑓 ( 𝑡 ′ ) = 𝑔 ( 𝑡 ′ ) . This shows that Φ is a conformance constraintfor every tuple in 𝐷 .Next, we claim that Φ is not satisfied by exactly tuples that areunsafe w.r.t. C and [ 𝐷 ; 𝑌 ] . Consider any 𝑡 ′ such that ¬ Φ ( 𝑡 ′ ) . Bydefinition of Φ , it follows that there exist 𝑓 , 𝑔 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) = 𝑌 , but 𝑓 ( 𝑡 ′ ) ≠ 𝑔 ( 𝑡 ′ ) . This is equivalent to saying that 𝑡 ′ isunsafe, by definition. □ F MOTIVATION FOR DISJUNCTIVECONFORMANCE CONSTRAINTS
We now provide an example motivating the need for disjunctiveconformance constraints. Example 18.
PCA-based approach fails in cases where there existdifferent piecewise linear trends within the data. If we apply PCAto learn conformance constraints on the entire dataset of Fig. 9(a), itwill learn two low-quality constraints, with very high variance. Incontrast, partitioning the dataset into three partitions (Fig. 9(b)), andthen learning constraints separately on each partition, will result insignificant improvement of the learned constraints.
G IMPLICATION FOR UNSAFE TUPLES
Here, we provide justification for our definition of unsafe tuples.Proposition 19. If 𝑡 ∈ Dom 𝑚 is a unsafe tuple w.r.t. C and [ 𝐷 ; 𝑌 ] , then for any 𝑓 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑌 , there exists a 𝑔 ∈ C s.t. 𝑔 ( 𝐷 ) = 𝑌 but 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Proof. By the definition of unsafe tuple, there exist 𝑔, 𝑔 ′ ∈ C s.t. 𝑔 ( 𝐷 ) = 𝑔 ′ ( 𝐷 ) = 𝑌 , but 𝑔 ( 𝑡 ) ≠ 𝑔 ′ ( 𝑡 ) . Now, given a function 𝑓 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑌 , the value 𝑓 ( 𝑡 ) can be either equal to 𝑔 ( 𝑡 ) or 𝑔 ′ ( 𝑡 ) , butnot both. WLOG, say 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Then, we have found a function 𝑔 s.t. 𝑔 ( 𝐷 ) = 𝑌 but 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) , which completes the proof. □ Note that even when we mark 𝑡 as unsafe, it is possible that thelearned model makes the correct prediction on that tuple. However,there is a good chance that it makes a different prediction. Hence,it is useful to be aware of unsafe tuples. Conformance Constraints as Preconditions forTrusted Machine Learning
Let C denote a class of functions. Given a dataset 𝐷 , suppose that atuple 𝑡 is unsafe. This means that there exist 𝑓 , 𝑔 ∈ C s.t. 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) ,but 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) . Now, consider the logical claim that 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) .Clearly, 𝑓 is not identical to 𝑔 since 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Therefore, there isa nontrivial “proof” (in some logic) of the fact that “for all tuples 𝑡 ∈ 𝐷 : 𝑓 ( 𝑡 ) = 𝑔 ( 𝑡 ) ”. This “proof” will use some properties of 𝐷 , andlet 𝜙 be the formula denoting these facts. If 𝜙 𝐷 is the characteristicfunction for 𝐷 , then the above argument can be written as, 𝜙 𝐷 ( (cid:174) 𝐴 ) ⇒ 𝜙 ( (cid:174) 𝐴 ) , and 𝜙 ( (cid:174) 𝐴 ) ⇒ 𝑓 ( (cid:174) 𝐴 ) = 𝑔 ( (cid:174) 𝐴 ) where ⇒ denotes logical implication.In words, 𝜙 is a conformance constraint for 𝐷 and it serves asa precondition in the “correctness proof” that shows (a potentiallymachine-learned model) 𝑓 is equal to (potentially a ground truth) 𝑔 . If a tuple 𝑡 fails to satisfy the precondition 𝜙 , then it is possiblethat the prediction of 𝑓 on 𝑡 will not match the ground truth 𝑔 ( 𝑡 ) .Example 20. Let 𝐷 = {( , ) , ( , ) , ( , )} be a dataset with twoattributes 𝐴 , 𝐴 , and let the output 𝑌 be , , and , respectively. Let C ⊆ (( R × R ) ↦→ R ) be the class of linear functions over two variables 𝐴 and 𝐴 . Consider a new tuple ( , ) . It is unsafe since there exist twodifferent functions, namely 𝑓 ( 𝐴 , 𝐴 ) = 𝐴 and 𝑔 ( 𝐴 , 𝐴 ) = 𝐴 + 𝐴 ,that agree with each other on 𝐷 , but disagree on ( , ) . In contrast, ( , ) is not unsafe because there is no function in C that maps 𝐷 to 𝑌 , but produces an output different from . We apply Proposi-tion 17 on the sets 𝐷 , 𝑌 , and C . Here, C is the set of all linear func-tions given by { 𝛼𝐴 + 𝐴 | 𝛼 ∈ R } . The conformance constraint Φ ,whose negation characterizes the unsafe tuples w.r.t. C and [ 𝐷 ; 𝑌 ] ,is ∀ 𝛼 , 𝛼 : 𝛼 𝐴 + 𝐴 = 𝛼 𝐴 + 𝐴 , which is equivalent to 𝐴 = . onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 X Y (a) PCA X Y (b) Disjoint PCA Figure 9: Learning PCA-based constraints globally results in lowquality constraints when data satisfies strong local constraints.
Sufficient Check for Unsafe Tuples
In practice, finding conformance constraints that are necessary andsufficient for detecting if a tuple is unsafe is difficult. Hence, wefocus on weaker constraints whose violation is sufficient, but notnecessary, to classify a tuple as unsafe. We can use such constraintsto get a procedure that has false negatives (fails to detect someunsafe tuples), but no false positives (never identifies a tuple asunsafe when it is not).
Model Transformation using Equality Constraints.
For certain con-formance constraints, we can prove that a constraint violation by 𝑡 implies that 𝑡 is unsafe by showing that those constraints cantransform a model 𝑓 that works on 𝐷 to a different model 𝑔 that alsoworks on 𝐷 , but 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . We claim that equality constraints(of the form 𝐹 ( (cid:174) 𝐴 ) =
0) are useful in this regard. First, we make thepoint using the scenario from Example 20.Example 21.
Consider the set C of functions, and the annotateddataset [ 𝐷 ; 𝑌 ] from Example 15. The two functions 𝑓 and 𝑔 , where 𝑓 ( 𝐴 , 𝐴 ) = 𝐴 and 𝑔 ( 𝐴 , 𝐴 ) = 𝐴 + 𝐴 , are equal when restrictedto 𝐷 ; that is, 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) . What property of 𝐷 suffices to prove 𝑓 ( 𝐴 , 𝐴 ) = 𝑔 ( 𝐴 , 𝐴 ) , i.e., 𝐴 = 𝐴 + 𝐴 ? It is 𝐴 = . Going the otherway, if we have 𝐴 = , then 𝑓 ( 𝐴 , 𝐴 ) = 𝐴 = 𝐴 + 𝐴 = 𝑔 ( 𝐴 , 𝐴 ) .Therefore, we can use the equality constraint 𝐴 = to transform themodel 𝑓 into the model 𝑔 in such a way that the 𝑔 continues to matchthe behavior of 𝑓 on 𝐷 . Thus, an equality constraint can be exploitedto produce multiple different models starting from one given model.Moreover, if 𝑡 violates the equality constraint, then it means that themodels, 𝑓 and 𝑔 , would not agree on their prediction on 𝑡 ; for example,this happens for 𝑡 = ( , ) . Let 𝐹 ( (cid:174) 𝐴 ) = 𝐷 . If alearned model 𝑓 returns a real number, then it can be transformedinto another model 𝑓 + 𝐹 , which will agree with 𝑓 only on tuples 𝑡 where 𝐹 ( 𝑡 ) =
0. Thus, in the presence of equality constraints, alearner can return 𝑓 or its transformed version 𝑓 + 𝐹 (if both modelsare in the class C ). This condition is a “relevancy” condition thatsays that 𝐹 is “relevant” for class C . If the model does not return areal, then we can still use equality constraints to modify the modelunder some assumptions that include “relevancy” of the constraint. A Theorem for Sufficient Check for Unsafe Tuples.
We first formalizethe notions of nontrivial datasets—which are annotated datasetssuch that at least two output labels differ—and relevant constraints—which are constraints that can be used to transform models in aclass to other models in the same class.
Nontrivial.
An annotated dataset [ 𝐷 ; 𝑌 ] is nontrivial if there exist 𝑖, 𝑗 s.t. 𝑦 𝑖 ≠ 𝑦 𝑗 . Relevant.
A constraint 𝐹 ( (cid:174) 𝐴 ) = relevant to a class C of modelsif whenever 𝑓 ∈ C , then 𝜆𝑡 : 𝑓 ( ite ( 𝛼𝐹 ( 𝑡 ) , 𝑡 𝑐 , 𝑡 )) ∈ C for a constanttuple 𝑡 𝑐 and real number 𝛼 . The if-then-else function ite ( 𝑟, 𝑡 𝑐 , 𝑡 ) returns 𝑡 𝑐 when 𝑟 =
1, returns 𝑡 when 𝑟 =
0, and is free to re-turn anything otherwise. If tuples admit addition, subtraction, andscaling, then one such if-then-else function is 𝑡 + 𝑟 ∗ ( 𝑡 𝑐 − 𝑡 ) .We now state a sufficient condition for identifying unsafe tuples.Theorem 22 (Sufficient Check for Unsafe Tuples). Let [ 𝐷 ; 𝑌 ] ⊂ Dom 𝑚 × coDom be an annotated dataset, C be a class of functions,and 𝐹 be a projection on Dom 𝑚 s.t.A1. 𝐹 ( (cid:174) 𝐴 ) = is a strict constraint for 𝐷 ,A2. 𝐹 ( (cid:174) 𝐴 ) = is relevant to C ,A3. [ 𝐷 ; 𝑌 ] is nontrivial, andA4. there exists 𝑓 ∈ C s.t. 𝑓 ( 𝐷 ) = 𝑌 .For 𝑡 ∈ Dom 𝑚 , if 𝐹 ( 𝑡 ) ≠ , then 𝑡 is unsafe. Proof. WLOG, let 𝑡 , 𝑡 be the two tuples in 𝐷 s.t. 𝑦 ≠ 𝑦 (A3).Since 𝑓 ( 𝐷 ) = 𝑌 (A4), it follows that 𝑓 ( 𝑡 ) = 𝑦 ≠ 𝑦 = 𝑓 ( 𝑡 ) .Let 𝑡 be a new tuple s.t. 𝐹 ( 𝑡 ) ≠
0. Clearly, 𝑓 ( 𝑡 ) can not be equalto both 𝑦 and 𝑦 . WLOG, suppose 𝑓 ( 𝑡 ) ≠ 𝑦 . Now, consider thefunction 𝑔 defined by 𝜆𝜏 : 𝑓 ( ite ( 𝐹 ( 𝜏 ) , 𝑡 , 𝜏 )) . By (A2), we knowthat 𝑔 ∈ C . Note that 𝑔 ( 𝐷 ) = 𝑌 since for any tuple 𝑡 𝑖 ∈ 𝐷 , 𝐹 ( 𝑡 𝑖 ) = 𝑔 ( 𝑡 𝑖 ) = 𝑓 ( ite ( , 𝑡 , 𝑡 𝑖 )) = 𝑓 ( 𝑡 𝑖 ) = 𝑦 𝑖 . Thus, wehave two models, 𝑓 and 𝑔 , s.t. 𝑓 ( 𝐷 ) = 𝑔 ( 𝐷 ) = 𝑌 . To prove that 𝑡 is a unsafe tuple, we have to show that 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Note that 𝑔 ( 𝑡 ) = 𝑓 ( ite ( 𝐹 ( 𝑡 ) , 𝑡 , 𝑡 )) = 𝑓 ( 𝑡 ) = 𝑦 (by definition of 𝑔 ). Sincewe already had 𝑓 ( 𝑡 ) ≠ 𝑦 , it follows that we have 𝑓 ( 𝑡 ) ≠ 𝑔 ( 𝑡 ) . Thiscompletes the proof. □ We caution that our definition of unsafe is liberal: existence ofeven one pair of functions 𝑓 , 𝑔 —that differ on 𝑡 , but agree on thetraining set 𝐷 —is sufficient to classify 𝑡 as unsafe. It ignores issuesrelated to the probabilities of finding these models by a learningprocedure. Our intended use of Theorem 22 is to guide the choicefor the class of constraints, given the class C of models, so that wecan use violation of a constraint in that class as an indication forcaution. For most classes of models, linear arithmetic constraintsare relevant. Our formal development has completely ignored thatdata (in machine learning applications) is noisy, and exact equalityconstraints are unlikely to exist. However, the development abovecan be extended to the noisy case by replacing exact equality isreplaced by approximate equality. For example, when learning fromdataset 𝐷 and ground-truth 𝑓 ′ , we may not always learn a 𝑓 thatexactly matches 𝑓 ′ on 𝐷 , but is only close (in some metric) to 𝑓 ′ . Similarly, equality constraints need not require 𝐹 ( 𝑡 ) = 𝑡 ∈ 𝐷 , but only 𝐹 ( 𝑡 ) ≈ Consider the annotated dataset [ 𝐷 ; 𝑌 ] and the class C , from Example 21. Consider the equality constraint 𝐹 ( 𝐴 , 𝐴 ) = ,where the projection 𝐹 is defined as 𝐹 ( 𝐴 , 𝐴 ) = 𝐴 . Clearly, 𝐹 ( 𝐷 ) = { } , and hence, 𝐹 ( 𝐴 , 𝐴 ) = is a constraint for 𝐷 . The constraintis also relevant to the class of linear models C . Clearly, [ 𝐷 ; 𝑌 ] isnontrivial, since 𝑦 = ≠ = 𝑦 . Also, there exists 𝑓 ∈ C (e.g., 𝑓 ( 𝐴 , 𝐴 ) = 𝐴 ) s.t. 𝑓 ( 𝐷 ) = 𝑌 . Now, consider the tuple 𝑡 = ( , ) .Since 𝐹 ( 𝑡 ) = ≠ , Theorem 22 implies that 𝑡 is unsafe. echnical Report, January, 2021 Fariha and Tiwari, et al. SQL Check Constraints
Due to the simplicity of the conformance language to express con-formance constraints, they can be easily enforced as SQL checkconstraints to prevent insertion of unsafe tuples to a database.
H APPLICATIONS OF CONFORMANCECONSTRAINTS
In database systems, conformance constraints can be used to detectchange in data and query workloads, which can help in databasetuning [47]. They have application in data cleaning (error detec-tion and missing value imputation): the violation score serves asa measure of error, and missing values can be imputed by exploit-ing relationships among attributes that conformance constraintscapture. Conformance constraints can detect outliers by expos-ing tuples that significantly violate them. Another interesting datamanagement application is data-diff [76] for exploring differencesbetween two datasets: our disjunctive constraints can explain howdifferent partitions of two datasets vary.In machine learning, conformance constraints can be used tosuggest when to retrain a machine-learned model. Further, givena pool of machine-learned models and the corresponding trainingdatasets, we can use conformance constraints to synthesize a newmodel for a new dataset. A simple way to achieve this is to pickthe model such that constraints learned from its training data areminimally violated by the new dataset. Finally, identifying non-conforming tuples is analogous to input validation that performssanity checks on an input before it is processed by an application.
I VISUALIZATION OF LOCAL DRIFT
When the dataset contains instances from multiple classes, the driftmay be just local, and not global. Fig. 10 demonstrates a scenariofor the 4CR dataset over the EVL benchmark. If we ignore thecolor/shape of the tuples, we will not observe any significant driftacross different time steps.
J MORE DATA-DRIFT EXPERIMENTS
Inter-activity drift.
Similar to inter-person constraint violation, wealso compute inter-activity constraint violation over the HAR dataset(Fig. 11). Note the asymmetry of violation scores between activi-ties, e.g., running is violating the constraints of standing muchmore than the other way around. A close observation reveals that,all mobile activities violate all sedentary activities more than theother way around. This is because, the mobile activities behave as a“safety envelope” for the sedentary activities. For example, while aperson walks, they also stand (for a brief moment); but the oppositedoes not happen.
K EXPLAINING NON-CONFORMANCE
When a serving dataset is determined to be sufficiently deviatedor drifted from the training set, the next step often is to charac-terize the difference. A common way of characterizing these dif-ferences is to perform a causality or responsibility analysis to de-termine which attributes are most responsible for the observeddrift (non-conformance). We use the violation values produced by x y Time step 1 x Time step 2 x Time step 3 x Time step 4 x Time step 5
Figure 10: Snapshots over time for 4CR dataset with local drift. Itreaches maximum drift from the initial distribution at time step 3and goes back to the initial distribution at time step 5.Figure 11: Inter-activity constraint violation heat map. Mobile activ-ities violate the constraints of the sedentary activities more. conformance constraints, along with well-established principles ofcausality, to quantify responsibility for non-conformance.
ExTuNe.
We built a tool ExTuNe [24], on top of CCSynth, to com-pute the responsibility values as described next. Given a trainingdataset 𝐷 and a non-conforming tuple 𝑡 ∈ Dom 𝑚 , we measure the responsibility of the 𝑖 𝑡ℎ attribute 𝐴 𝑖 towards the non-conformanceas follows: (1) We intervene on 𝑡.𝐴 𝑖 by altering its value to the meanof 𝐴 𝑖 over 𝐷 to obtain the tuple 𝑡 ( 𝑖 ) . (2) In 𝑡 ( 𝑖 ) , we compute howmany additional attributes need to be altered to obtain a tuple withno violation. If 𝐾 additional attributes need to be altered, 𝐴 𝑖 hasresponsibility 𝐾 + . (3) This responsibility value for each tuple 𝑡 canbe averaged over the entire serving dataset to obtain an aggregateresponsibility value for 𝐴 𝑖 . Intuitively, for each tuple, we are “fixing”the value of 𝐴 𝑖 with a “more typical” value, and checking how close(in terms of additional fixes required) this takes us to a conformingtuple. The larger the number of additional fixes required, the lowerthe responsibility of 𝐴 𝑖 . Datasets.
We use four datasets for this evaluation: (1)
Cardiovas-cular Disease [1] is a real-world dataset that contains informationabout cardiovascular patients with attributes such as height, weight,cholesterol level, glucose level, systolic and diastolic blood pres-sures, etc. (2)
Mobile Prices [4] is a real-world dataset that containsinformation about mobile phones with attributes such as ram, bat-tery power, talk time, etc. (3)
House Prices [3] is a real-world datasetthat contains information about houses for sale with attributes suchas basement area, number of bathrooms, year built, etc. (4)
LED (Light Emitting Diode) [12] is a synthetic benchmark. The datasethas a digit attribute, ranging from 0 to 9, 7 binary attributes—eachrepresenting one of the 7 LEDs relevant to the digit attribute—and 17irrelevant binary attributes. This dataset includes gradual conceptdrift every 25,000 rows. onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 (a) (b) (c) V i o l a t i o n R e s p . R e s p . R e s p . R e s p . R e s p . R e s p . R e s p . L E D L E D L E D L E D L E D L E D L E D (d) Figure 12: Responsibility assignment on attributes for drift on (a) Cardiovascular disease: trained on patients with no disease and served onpatients with disease, (b) Mobile Prices: trained on cheap mobiles and served on expensive mobiles and (c) House Prices: trained on housewith price < = > = Case studies.
ExTuNe produces bar-charts of responsibility valuesas depicted in Fig. 12. Figures 12(a), 12(b), and 12(c) show the expla-nation results for Cardiovascular Disease, Mobile Price, and HousePrice datasets, respectively. For the cardiovascular disease dataset,the training and serving sets consist of data for patients withoutand with cardiovascular disease, respectively. For the House Priceand Mobile Price datasets, the training and serving sets consist ofhouses and mobiles with prices below and above a certain threshold,respectively. As one can guess, we get many useful insights fromthe non-conformance responsibility bar-charts such as: “abnormal(high or low) blood pressure is a key cause for non-conformance ofpatients with cardiovascular disease w.r.t. normal people”, “RAM isa distinguishing factor between expensive and cheap mobiles”, “thereason for houses being expensive depends holistically on severalattributes”.Fig. 12(d) shows a similar result on the LED dataset. Instead ofone serving set, we had 20 serving sets (the first set is also usedas a training set to learn conformance constraints). We call eachserving set a window where each window contains 5,000 tuples.This dataset introduces gradual concept drift every 25,000 rows (5windows) by making a subset of LEDs malfunctioning. As one canclearly see, during the initial 5 windows, no drift is observed. Inthe next 5 windows, LED 4 and LED 5 starts malfunctioning; in thenext 5 windows, LED 1 and LED 3 starts malfunctioning, and so on.
L CONTRAST WITH PRIOR ART
Simple conformance constraints vs. least square techniques.
Note that the lowest variance principal component of [(cid:174) 𝐷 𝑁 ] is re-lated to the ordinary least square (OLS)—commonly known as linearregression—estimate for predicting (cid:174) 𝐷 𝑁 ; but OLS minimizes error for the target attribute only. Our PCA-inspired approach ismore similar to total least squares (TLS)—also known as orthogonalregression—that minimizes observational errors on all predictorattributes. However, TLS returns only the lowest-variance projec-tion (Fig. 13(d)). In contrast, PCA offers multiple projections at once(Figs. 13(b), 13(c), and 13(d)) for a set of tuples (Fig. 13(a)), whichrange from low to high variance and have low mutual correlation(since they are orthogonal to each other). Intuitively, conformanceconstraints constructed from all projections returned by PCA cap-ture various aspects of the data, as it forms a bounding hyper-boxaround the data tuples. However, to capture the relative importanceof conformance constraints, we inversely weigh them according tothe variances of their projections in the quantitative semantics. Compound constraints vs. denial constraints.
If we try to ex-press the compound constraint 𝜓 of Example 3 using the notationfrom traditional denial constraints [17] (under closed-world seman-tics), where 𝑀 always takes values from { “May”, “June”, “July” } , weget the following: Δ : ¬ (( 𝑀 = “May” ) ∧ ¬ (− ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ ))∧ ¬ (( 𝑀 = “June” ) ∧ ¬ ( ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ ))∧ ¬ (( 𝑀 = “July” ) ∧ ¬ (− ≤ 𝐴𝑇 − 𝐷𝑇 − 𝐷𝑈 𝑅 ≤ )) Note however that arithmetic expressions that specify linear com-bination of numerical attributes (highlighted fragment signifyinga projection) are disallowed in denial constraints, which only allowraw attributes and constants within the constraints. Furthermore,existing techniques that compute denial constraints offer no mech-anism to discover constraints involving such a composite attribute echnical Report, January, 2021 Fariha and Tiwari, et al. F i r s t P C S e c o n d P C T h i r d P C (a) n d r d (b) r d s t (c) s t n d (d) Figure 13: (a) 3D view of a set of tuples projected onto the space of principal components (PC). (b) The first PC gives the projection withhighest standard deviation and thus constructs the weakest conformance constraint with a very broad range for its bounds. (c) The second PCgives a projection with moderate standard deviation and constructs a relatively stronger conformance constraint. (d) The third PC gives theprojection with lowest standard deviation and constructs the strongest conformance constraint. (projection). Under an open-world assumption, conformance con-straints are more conservative—and therefore, more suitable forcertain tasks such as TML—than denial constraints. For example,a new tuple with 𝑀 = “August” will satisfy the above constraint Δ but not the compound conformance constraint 𝜓 of Example 3. Data profiling.
Conformance constraints, just like other constraintmodels, fall under the umbrella of data profiling using metadata [5].There is extensive literature on data-profiling primitives that modelrelationships among data attributes, such as unique column combi-nations [30], functional dependencies (FD) [59, 93] and their vari-ants (metric [48], conditional [23], soft [38], approximate [36, 50],relaxed [16], etc.), differential dependencies [72], order dependen-cies [52, 77], inclusion dependencies [55, 60], denial constraints [13,17, 53, 61], and statistical constraints [91]. However, none of themfocus on learning approximate arithmetic relationships that involvemultiple numerical attributes in a noisy setting, which is the focusof our work.Soft FDs [38] model correlation and generalize traditional FDsby allowing uncertainty, but are limited in modeling relationshipsbetween only a pair of attributes. Metric FDs [48] allow small varia-tions in the data, but the existing work focuses on verification onlyand not discovery of metric FDs. Some variants of FDs [16, 36, 48, 50]consider noisy setting, but they require the allowable noise param-eters to be explicitly specified by the user. However, determiningthe right settings for these parameters is non-trivial. Most exist-ing approaches treat constraint violation as Boolean, and do notmeasure the degree of violation. In contrast, we do not require anyexplicit noise parameter and provide a way to quantify the degreeof violation of conformance constraints.Conditional FDs [23] require the FDs to be satisfied condition-ally (e.g., a FD may hold for US residents and a different FD forEuropeans). Denial constraints (DC) are a universally-quantifiedfirst-order-logic formalism [17] and can adjust to noisy data, byadding predicates until the constraint becomes exact over the entiredataset. However, this can make DCs large, complex, and uninter-pretable. While approximate denial constraints [61] exist, similar toapproximate FD techniques, they also rely on the users to providethe error threshold.
Input validation.
Our work here contributes to, while also build-ing upon, work from machine learning, programming languages,and software engineering. In software engineering, input validation has been used to improve reliability [90]. For example, it is espe-cially used in web applications where static and dynamic analysisof the code, that processes the input, is used to detect vulnerabili-ties [89]. For monitoring deployed systems, few prior works exploitconstraints [40, 80]. To prevent unwanted outcomes, input valida-tion techniques [15, 29] are used in software systems. However,such mechanisms are usually implemented by deterministic rulesor constraints, which domain experts provide. In contrast, we learnconformance constraints in an unsupervised manner.
Trusted AI.
The issue of trust, resilience, and interpretability ofartificial intelligence (AI) systems has been a theme of increasinginterest recently [39, 67, 86], particularly for high-stake and safety-critical data-driven AI systems [80, 87]. A standard way to decidewhether to trust a classifier or not, is to use the classifier-producedconfidence score. However, unlike classifiers, regressors lack a nat-ural way to produce such confidence scores. To evaluate modelperformance, regression diagnostics check if the assumptions madeby the model during training are still valid for the serving data.However, they require knowledge of the ground-truths for theserving data, which is often unavailable.
Data drift.
Prior work on data drift, change detection, and covari-ate shift [7, 14, 18, 19, 21, 22, 33, 34, 37, 44, 46, 70, 73] relies onmodeling data distribution, where change is detected when thedata distribution changes. However, data distribution does not cap-ture constraints, which is the primary focus of our work. Insteadof detecting drift globally, only a handful of works model localconcept-drift [82] or drift for imbalanced data [88]. Few data-driftdetection mechanisms rely on availability of classification accu-racy [11, 25, 26, 66] or classification “blindspots” [71]. Some ofthese works focus on adapting change in data, i.e., learning in anenvironment where change in data is expected [11, 27, 57, 75, 92].Such adaptive techniques are useful to obtain better performancefor specific tasks; however, their goal is orthogonal to ours.
Representation learning, outlier detection, and one-classclassification.
Few works [20, 31] , related to our conformanceconstraint-based approach, use autoencoder’s [32, 65] input recon-struction error to determine if a new data point is out of distribution.Another mechanism [54] learns data assertions via autoencoderstowards effective detection of invalid serving inputs. However, suchan approach is task-specific and needs a specific system (e.g., a deep onformance Constraint Discovery: Measuring Trust in Data-Driven Systems Technical Report, January, 2021 neural network) to begin with. Our approach is similar to outlier-detection approaches [49] that define outliers as the ones thatdeviate from a generating mechanism such as local correlations. Wealso share similarity with one-class-classification [79], where thetraining data contains tuples from only one class. In general, thereis a clear gap between representation learning approaches (that models data likelihood) [6, 32, 43, 65] and the (constraint-oriented)data-profiling techniques to address the problem of trusted AI. Ouraim is to bridge this gap by introducing conformance constraintsthat are more abstract, yet informative, descriptions of data,tailored towards characterizing trust in ML predictions.neural network) to begin with. Our approach is similar to outlier-detection approaches [49] that define outliers as the ones thatdeviate from a generating mechanism such as local correlations. Wealso share similarity with one-class-classification [79], where thetraining data contains tuples from only one class. In general, thereis a clear gap between representation learning approaches (that models data likelihood) [6, 32, 43, 65] and the (constraint-oriented)data-profiling techniques to address the problem of trusted AI. Ouraim is to bridge this gap by introducing conformance constraintsthat are more abstract, yet informative, descriptions of data,tailored towards characterizing trust in ML predictions.