An Epistemic Approach to the Formal Specification of Statistical Machine Learning
NNoname manuscript No. (will be inserted by the editor)
An Epistemic Approach to the Formal Specification ofStatistical Machine Learning
Yusuke Kawamoto
Received: date / Accepted: date
Abstract
We propose an epistemic approach to for-malizing statistical properties of machine learning. Specif-ically, we introduce a formal model for supervised learn-ing based on a Kripke model where each possible worldcorresponds to a possible dataset and modal opera-tors are interpreted as transformation and testing ondatasets. Then we formalize various notions of the clas-sification performance, robustness, and fairness of sta-tistical classifiers by using our extension of statisticalepistemic logic (StatEL). In this formalization, we showrelationships among properties of classifiers, and rele-vance between classification performance and robust-ness. As far as we know, this is the first work that usesepistemic models and logical formulas to express sta-tistical properties of machine learning, and would be astarting point to develop theories of formal specificationof machine learning.
Keywords
Modal logic · Possible world semantics · Machine learning · Classification performance · Robustness · Fairness
With the increasing use of machine learning in real-life applications, the safety and security of learning-based systems have been of great interest. In particular,many recent studies [40],[12] have found vulnerabilities
This work was supported by the New Energy and IndustrialTechnology Development Organization (NEDO), by ERATOHASUO Metamathematics for Systems Design Project (No.JPMJER1603), JST, and by Inria under the project LOGIS.Yusuke KawamotoAIST, Tsukuba, JAPANORCID: 0000-0002-2151-9560 on the robustness of deep neural networks (DNNs) tomalicious inputs, which can lead to disasters in secu-rity critical systems, such as self-driving cars. To findout these vulnerabilities in advance, there have been re-searches on the formal verification and testing methodsfor the robustness of DNNs in recent years [23,26,35,41]. However, relatively little attention has been paidto the formal specification of machine learning [38].In the research filed of formal specification and ver-ification, logical approaches have been shown useful tocharacterize desired properties and to develop theo-ries to discuss those properties. For example, tempo-ral logic [36] is a branch of modal logic for express-ing time-dependent propositions, and has been widelyused to describe requirements of hardware and softwaresystems. For another example, epistemic logic [44] is amodal logic for knowledge and belief that has been em-ployed as formal policy languages for distributed sys-tems (e.g., for the authentication [8] and the anonymity[39] of security protocols). As far as we know, however,no prior work has employed logical formulas to rigor-ously describe various statistical properties of machinelearning, although there are some papers that (ofteninformally) list various desirable properties of machinelearning [38].In this paper, we present a first logical formaliza-tion of statistical properties of machine learning. To de-scribe the statistical properties in a simple and abstractway, we extend statistical epistemic logic (StatEL) [27],which has recently been proposed to describe statis-tical knowledge and is applied to formalize statisticalhypothesis testing and statistical privacy of databases.A key idea in our modeling of statistical machinelearning is that we formalize logical aspects in the syn-tax level, and statistical distances and dataset opera-tions in the semantics level by using accessibility rela- a r X i v : . [ c s . L O ] J un Yusuke Kawamoto tions of a Kripke model [30]. In this model, we formalizesupervised learning and some of its desirable properties,including performance, robustness, and fairness. Morespecifically, classification performance and robustnessare described as the differences between the correctclass label and the classifier’s prediction, whereas fair-ness is expressed as a conditional indistinguishabilitybetween different groups.
Our contributions.
The main contributions of this workare as follows: – We propose a logical approach to formalizing statis-tical properties of machine learning in a simple andabstract way. Specifically, we introduce a principlethat logical aspects of statistical properties are de-scribed in the syntax level, and statistical distancesand datasets are formalized in the semantics level. – We formalize supervised learning models and testdatasets (used to check whether the learning modelssatisfy specification) by employing a distributionalKripke model [27] where each possible world corre-sponds to a possible test dataset, and modal oper-ators are interpreted as transformation and testingon datasets. Then we show how the sampling froma dataset and non-deterministic adversarial inputsare formalized in the distributional Kripke model. – We propose an extension of statistical epistemic logic(StatEL) as a formal language to describe variousproperties of machine learning models, including theperformance, robustness, and fairness of statisticalclassifiers. Then the satisfaction of logical formu-las representing those properties is associated withtheir testing using a test dataset. As far as we know,this is the first work that uses logical formulas toformalize various statistical properties of machinelearning, and that provides an epistemic view onthose properties. – We show some relationships among properties ofclassifiers, such as different levels of robustness. Wealso present certain relationships between classifi-cation performance and robustness, which suggestrobustness-related properties that have not been for-malized in the literature as far as we know.
Cautions and limitations.
In this paper, we focus onformalizing properties of supervised learning models thatmay be tested by using a dataset; i.e., we do not dealwith unsupervised learning, reinforcement learning, theproperties of learning algorithms, quality of trainingdata (e.g., sample bias), quality of testing (e.g., cov-erage criteria), explainability, temporal properties, orsystem-level specification. It should be noted that mostof the properties formalized in this paper have been known in machine learning literatures, and the noveltyof this work lies in the logical formulation of those sta-tistical properties.We also highlight that this work aims to provide alogical approach to the modeling of statistical proper-ties tested with a dataset, and does not present meth-ods for checking, guaranteeing, or improving the perfor-mance/robustness/fairness of machine learning models.As for the satisfiability of logical formulas, we leave thedevelopment of testing and (statistical) model check-ing algorithms as future work, since the research areaon the testing and verification of machine learning isrelatively new and needs further techniques to improvethe scalability. Moreover, in some applications such asimage recognition, some atomic formulas (e.g., repre-senting whether an input image is a panda) cannotbe defined mathematically, and require additional tech-niques based on experiments. Nevertheless, we demon-strate that describing various properties using logicalformulas is useful to explore desirable properties andto discuss their relationships in a framework.Finally, we emphasize that our work is the first at-tempt to use epistemic models and logical formulas toexpress statistical properties of machine learning mod-els, and would be a starting point to develop theoriesof formal specification of machine learning in future re-search.
Relationship with the preliminary version.
The mainnovelties of this paper with respect to the preliminaryversion [28] are as follows: – We add how the satisfaction of a formula at a pos-sible world can be regarded as the testing of a spec-ification using a test dataset (Sect. 3.1). – We show how modal operators are used to modelthe transformation and testing on datasets. For ex-ample, data preparation T (e.g., data cleaning, dataaugmentation) can also be formalized as a modaloperator ∆ T (Sect. 3.2). – We re-interpret the non-classical implication ⊃ forconditional probabilities in StatEL as a modal oper-ator associated with a conditioning relation (Sect. 3.3). – We introduce a modal operator ∼ ε,Dx for conditionalindistinguishability (Sect. 3.4). Then we provide amore comprehensible formalization of the fairness ofsupervised learning (Sect. 7) without using counter-factual epistemic operators [28], because the formal-ization using these operators requires an additionalformula and makes the presentation more compli-cated and unintuitive. – We add a formalization of generalization error tocapture how accurately a classifier is able to classifypreviously unseen input data (Sect. 5.3). n Epistemic Approach to the Formal Specification of Statistical Machine Learning 3 – We add a formalization of other fairness notionscalled separation (Sect. 7.3) and sufficiency (Sect. 7.4)so that this paper covers all three categories of fair-ness notions [5]. – We show a running example of a pedestrian detec-tion to illustrate the formalization of various notionsof performance, robustness, and fairness.
Paper organization.
The rest of this paper is organizedas follows. Sect. 2 presents notations used in this paperand provides background on statistical distances andstatistical epistemic logic (StatEL). Sect. 3 introducesa different view on the modal operators in StatEL andextends the logic with additional operators. Sect. 4 in-troduces a formal model for describing the behaviors ofstatistical classifiers and non-deterministic adversarialinputs. Sects. 5, 6, and 7 respectively formalize vari-ous notions of the performance, robustness, and fairnessof classifiers by using our extension of StatEL. Sect. 8presents related work and Sect. 9 concludes.
In this section we introduce some notations, and re-view background on statistical distance notions andthe syntax and semantics of statistical epistemic logic (StatEL), introduced in [27].2.1 NotationsLet R ≥ be the set of non-negative real numbers, and[0 ,
1] be the set of non-negative real numbers not greaterthan 1. We denote by D O the set of all probabilitydistributions over a finite set O . Given a finite set O and a probability distribution µ ∈ D O , the probabilityof sampling a value v from µ is denoted by µ [ v ]. For asubset R ⊆ O , let µ [ R ] = (cid:80) v ∈ R µ [ v ]. For a distribution µ over a finite set O , its support is defined by supp ( µ ) = { v ∈ O : µ [ v ] > } .2.2 Statistical DistanceWe recall popular notions of distance between proba-bility distributions: total variation and ∞ -Wassersteindistance .Informally, total variation between two distributions µ and µ over a set O represents the largest differencebetween the probabilities that µ and µ assign to anidentical subset R of O . Definition 1 (Total variation)
For a finite set O ,the total variation D tv of two distributions µ , µ ∈ D O is defined by: D tv ( µ (cid:107) µ ) def = sup R ⊆O | µ [ R ] − µ [ R ] | . We then recall the ∞ -Wasserstein metric [43]. Intu-itively, the ∞ -Wasserstein metric W d ( µ , µ ) betweentwo distributions µ , µ represents the minimum largestmove between points in a transportation from µ to µ . Definition 2 ( ∞ -Wasserstein metric) Let O be afinite set and d : O × O → R ≥ be a metric over O .The ∞ -Wasserstein metric W d w.r.t. d between twodistributions µ , µ ∈ D O is defined by: W d ( µ , µ ) = min µ ∈ cp ( µ ,µ ) max ( v ,v ) ∈ supp ( µ ) d ( v , v )where cp ( µ , µ ) is the set of all couplings of µ and µ .2.3 Syntax of StatELWe next recall the syntax of statistical epistemic logic(StatEL) [27], which has two levels of formulas: static and epistemic formulas . Intuitively, a static formula de-scribes a proposition satisfied at a (deterministic) state,while an epistemic formula describes a proposition sat-isfied at a probability distribution of states. In this pa-per, the former is used only to define the latter.Formally, let Mes be a set of symbols called mea-surement variables , and Γ be a set of atomic formulasof the form γ ( x , x , . . . , x n ) for a predicate symbol γ , n ≥
0, and x , x , . . . , x n ∈ Mes . Let I ⊆ [0 ,
1] be afinite union of disjoint intervals, and A be a finite setof indices (e.g., associated with statistical divergences).Then the formulas are defined by:Static formulas: ψ ::= γ ( x , x , . . . , x n ) | ¬ ψ | ψ ∧ ψ Epistemic formulas: ϕ ::= P I ψ | ¬ ϕ | ϕ ∧ ϕ | ψ ⊃ ϕ | K a ϕ where a ∈ A . We denote by F the set of all epistemicformulas. Note that we have no quantifiers over mea-surement variables. (See Sect. 2.5 for more details.)The probability quantification P I ψ represents thata static formula ψ is satisfied with a probability be-longing to a set I . For instance, P (0 . , ψ representsthat ψ holds with a probability greater than 0 .
95. By A coupling of two distributions µ , µ ∈ D O is a joint dis-tribution µ ∈ D ( O × O ) such that µ and µ are µ ’s marginaldistributions, i.e., for each v ∈ O , µ [ v ] = (cid:80) v (cid:48) ∈O µ [ v , v (cid:48) ]and for each v ∈ O , µ [ v ] = (cid:80) v (cid:48) ∈O µ [ v (cid:48) , v ]. For a cou-pling µ , the support supp ( µ ) is the maximum subset of O ×O whose elements are assigned non-zero probabilities in µ . Yusuke Kawamoto ψ ⊃ P I ψ (cid:48) we represent that the conditional probabil-ity of ψ (cid:48) given ψ is included in a set I . The epistemicknowledge K a ϕ expresses that we know ϕ when ourcapability of observation is denoted by a ∈ A .As syntax sugar, we use disjunction ∨ , classical im-plication → , and epistemic possibility P a , defined asusual by: ϕ ∨ ϕ ::= ¬ ( ¬ ϕ ∧¬ ϕ ), ϕ → ϕ ::= ¬ ϕ ∨ ϕ ,and P a ϕ ::= ¬ K a ¬ ϕ . When I is a singleton { i } , we ab-breviate P I as P i .2.4 Distributional Kripke ModelNext we recall the notion of a distributional Kripkemodel [27], where each possible world is associated witha probability distribution over a set of states, and witha stochastic assignment of data to measurement vari-ables. Definition 3 (Distributional Kripke model)
Let A be a finite set of indices (typically associated withoperations and tests on datasets), S be a finite set ofstates, and O be a finite set of data, called a data do-main . A distributional Kripke model is a tuple M =( W , ( R a ) a ∈A , ( V s ) s ∈S ) consisting of: – a non-empty set W of multisets of states belongingto S ; – for each a ∈ A , an accessibility relation R a ⊆ W×W ; – for each s ∈ S , a valuation V s : Γ → P ( O k ) thatmaps each k -ary predicate γ to a set V s ( γ ) of k -tuples of data.The set W is called a universe , and its elements arecalled possible worlds . A world is said to be finite if it isa finite multiset, i.e., it has a finite number of (possiblyduplicated) elements. A world is said to be infinite if itis an infinite multiset.The relation R a determines an accessibility betweentwo worlds. For example, ( w, w (cid:48) ) ∈ R a means that aworld w (cid:48) is accessible from a world w when our capa-bility of distinguishing possible worlds is denoted by a ∈ A . The valuation V s may give a possibly differentinterpretation of a predicate γ at a different state s . Weassume that all measurement variables range over thesame data domain O in every world. The interpreta-tion of measurement variables at a state s is given by adeterministic assignment σ s defined below. Definition 4 (Deterministic assignment)
For anydistributional Kripke model M =( W , ( R a ) a ∈A , ( V s ) s ∈S ),we assume that each world w ∈ W is associated witha function ρ w : Mes × S → O that maps each mea-surement variable x to its value ρ w ( x, s ) that is ob-served at a state s belonging to the world w . We also assume that each state s in a world w is associated withthe deterministic assignment σ s : Mes → O defined by σ s ( x ) = ρ w ( x, s ).Since each world w is a multiset of states, we abusethe notation and denote by w [ s ] the probability thata state s is randomly chosen from w (i.e., the numberof occurrences of s in the multiset w , divided by thetotal number of elements in w ). Here we regard eachworld w as a probability distribution over the statesthat corresponds to the multiset.The probability that a measurement variable x ∈ Mes has a value v ∈ O is: σ w ( x )[ v ] = (cid:80) s ∈ w,σ s ( x )= v w [ s ].Note that σ w : Mes → D O maps each measurementvariable x to a probability distribution σ w ( x ) over thedata domain O . Hence σ w represents the joint proba-bility distribution of all variables in Mes , and is calledthe stochastic assignment at w . When a state s is uni-formly drawn from a multiset w of states, a datum σ s ( x )is sampled from the distribution σ w ( x ).In later sections, a possible world corresponds toa dataset (i.e., a multiset of data tuples) from whichdata are sampled. For example, suppose that we haveonly three measurement variables Mes = { x, y, z } . Thenfor each state s in a world w , the deterministic as-signment σ s : Mes → O represents the tuple of data( σ s ( x ) , σ s ( y ) , σ s ( z )). Hence each state s corresponds toa tuple of data, and the world w corresponds to thedataset { ( σ s ( x ) , σ s ( y ) , σ s ( z )) | s ∈ w } .2.5 Stochastic Semantics of StatELNow we recall the stochastic semantics [27] for the StatELformulas over a distributional Kripke model M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ) with W = D S .The interpretation of a static formulas ψ at a state s is given by: s | = γ ( x , . . . , x k ) iff ( σ s ( x ) , . . . , σ s ( x k )) ∈ V s ( γ ) s | = ¬ ψ iff s (cid:54)| = ψs | = ψ ∧ ψ (cid:48) iff s | = ψ and s | = ψ (cid:48) . The restriction w | ψ of a world w to a static formula ψ is defined by w | ψ [ s ] = w [ s ] (cid:80) s (cid:48) : s (cid:48)| = ψ w [ s (cid:48) ] if s | = ψ , and w | ψ [ s ] = 0 otherwise. Note that w | ψ is undefined ifthere is no state s that satisfies ψ and has a non-zeroprobability in w . n Epistemic Approach to the Formal Specification of Statistical Machine Learning 5 Then the interpretation of epistemic formulas in aworld w is defined by: M , w | = P I ψ iff Pr (cid:104) s $ ← w : s | = ψ (cid:105) ∈ I M , w | = ¬ ϕ iff M , w (cid:54)| = ϕ M , w | = ϕ ∧ ϕ (cid:48) iff M , w | = ϕ and M , w | = ϕ (cid:48) M , w | = ψ ⊃ ϕ iff w | ψ is defined and M , w | ψ | = ϕ M , w | = K a ϕ iff for every w (cid:48) s.t. ( w, w (cid:48) ) ∈ R a , M , w (cid:48) | = ϕ, where s $ ← w represents that a state s is sampled fromthe distribution w .Then M , w | = ψ ⊃ P I ψ represents that the condi-tional probability of satisfying a static formula ψ givenanother ψ is included in a set I at a world w .In each world w , measurement variables can be in-terpreted using σ w . This allows us to assign differentvalues to different occurrences of a variable in a for-mula; E.g., in ϕ ( x ) → K a ϕ (cid:48) ( x ), x occurring in ϕ ( x )is interpreted by σ w in a world w , while x in ϕ (cid:48) ( x ) isinterpreted by σ w (cid:48) in another w (cid:48) s.t. ( w, w (cid:48) ) ∈ R a .Finally, the interpretation of an epistemic formula ϕ in M is given by: M | = ϕ iff for every world w in M , M , w | = ϕ. Hereafter we mainly focus on the satisfaction localto a possible world, and M may be omitted when it isclear from the context. In this section we introduce a different view on themodal operators in statistical epistemic logic (StatEL),and define additional modal operators that are usedto formalize various properties of machine learning inSects. 5 to 7.3.1 Checking Satisfaction at a World as Testing with aDatasetWe first show how we regard the satisfaction of a for-mula ϕ as testing a system’s specification expressed by ϕ as follows.As explained in Sect. 2.4, a possible world corre-sponds to a possible dataset. Thus, given a model M ,a world w , and a formula ϕ , checking the satisfaction M , w | = ϕ can be regarded as testing whether the spec-ification ϕ of a system (e.g., a machine learning modelwe formalize in Sect. 4) is satisfied when the dataset w provides inputs to the system. For example, let ϕ be aformula representing that a machine learning task (e.g.,classification) C fails with probability at most 5%. Then M , w | = ϕ represents that when the learning task C isperformed using a test dataset w , then it fails for atmost 5% of the test data in w .For simplicity, we discuss the satisfaction of the for-mulas ϕ in which neither K a nor P a occurs as follows.For each state (namely, data tuple) s ∈ w and foreach static sub-formula ψ of ϕ , we can efficiently checkwhether s | = ψ .When the dataset w is finite (i.e., it is a finite multi-set of data tuples), we can check the satisfaction w | = ϕ in finite time, more precisely, in linear time in the num-ber of elements in w .When the dataset w is infinite, however, we cannotcheck whether w | = ϕ in general. For example, supposethat w is the infinite dataset representing a true dis-tribution from which data are sampled and observed.When we cannot learn w itself, we usually obtain a fi-nite dataset w fin by sampling data from w repeatedlyand independently and check a specification ϕ only withthis test dataset w fin .Hereafter, we mainly deal with distributional Kripkemodels M that have infinite numbers of finite worlds. Inthe following sections except Sect. 6, we deal only withformulas without K a nor P a , hence can check theirsatisfaction at a finite world in finite time.3.2 Modal Operators for Dataset TransformationIn the rest of Sect. 3, we show that modal operatorscan be used to model the transformation and testingon datasets.First, we introduce modal operators for dataset trans-formation . The modal operator ∆ T defined below isunary (i.e., taking a single formula as argument), and isparameterized with a transformation T between datasets.Intuitively, w | = ∆ T ϕ represents that a formula ϕ is sat-isfied for the dataset w (cid:48) that is obtained by transform-ing the current dataset w by T . Formally, the modaloperator ∆ T is interpreted as follows. Definition 5 (Modality ∆ T for a dataset trans-formation T ) Given a function T : W → W , we de- The testing of a formula ϕ is not feasible when an epis-temic operator K a or P a occurs in ϕ and the model M hasa large number of possible worlds. Detailed analysis of timecomplexity of StatEL is out of the scope of this paper, andshould be included in the journal version of our paper [27] thatproposed StatEL. As we will discuss in Sect. 6, the robust-ness of machine learning is formalized using these epistemicoperators, hence cannot be tested in practical time unless M is comprised of a small number of worlds. Yusuke Kawamoto fine an accessibility relation as R T def = { ( w, w (cid:48) ) | w (cid:48) = T ( w ) } . Then we define the interpretation of ∆ T by: M , w | = ∆ T ϕ iff there is a w (cid:48) s.t. ( w, w (cid:48) ) ∈ R T and M , w (cid:48) | = ϕ. For example, machine learning often require datapreparation to manipulate a given raw dataset into aform that makes a machine learning task feasible andmore effective (e.g., data cleaning , data augmentation ).For a dataset w and two ways of data preparation T and T , w | = ∆ T ϕ ∧ ∆ T ϕ represents that a property ϕ holds for the two prepared datasets T ( w ) and T ( w ).For another example, the security of machine learn-ing often assumes a certain malicious adversary thatcan manipulate a given dataset to make a machinelearning task fail. Such adversarial operations T ondatasets can also be formalized using a different modaloperator corresponding to T as we will explain in Sect. 6.In the next section, we show that the logical con-nective ⊃ can be re-interpreted as the modality ∆ T forsome dataset transformation T .3.3 Modality for ConditioningWe then present another interpretation of the logicalconnective ⊃ (defined in Sect. 2.5) used to express con-ditional probabilities in Sects. 5 and 6. Roughly speak-ing, we regard the restriction w | ψ of a world w to astatic formula ψ as a transformation R ψ of w . Thenwe redefine ⊃ as a modal operator associated with R ψ ,and call it the conditioning operator . Formally, the in-terpretation of ⊃ is defined as follows. Definition 6 (Conditioning operator ⊃ ) Assumethat the universe W includes all sub-multisets of each w ∈ W . Given a static formula ψ , we define an ac-cessibility relation as the conditioning relation R ψ def = { ( w, w | ψ ) | w ∈ W} . Then the interpretation of theconditioning operator ⊃ is given by: M , w | = ψ ⊃ ϕ iff there is a w (cid:48) s.t. ( w, w (cid:48) ) ∈ R ψ and M , w (cid:48) | = ϕ. Intuitively, w | = ψ ⊃ ϕ corresponds to the two op-erations: (i) transforming the given dataset w to thesub-dataset w | ψ and (ii) testing whether a property ϕ holds for the sub-dataset w | ψ . When no data in thedataset w satisfies the property ψ , we can describe thisas M , w | = ψ ⊃ ⊥ by using the propositional constantfalsum ⊥ .Note that the conditioning ψ ⊃ ϕ can be regardedas the modal formula ∆ T ϕ with the dataset transfor-mation T where T ( w ) = w | ψ for all w ∈ W . In Sects. 5 and 6, we show concrete examples us-ing the conditioning operator ⊃ , i.e., the classificationperformance and robustness of statistical classifiers.3.4 Modality for Conditional IndistinguishabilityNext, we introduce a modal operator that is used toformalize the fairness of machine learning in Sect. 7.Given two static formulas ψ , ψ (e.g., representingmale and female), w | ψ ( x ) (resp. w | ψ ( x )) representsthe probability distribution of values of a measurementvariable x generated from the sub-dataset w | ψ , e.g.,the sub-dataset about male (resp. w | ψ , e.g., about fe-male). To formalize a certain similarity between x ’s val-ues generated from the two sub-datasets (e.g., betweenthe benefits for male and for female), we introduce amodal operator ∼ ε,Dx for conditional indistinguishabil-ity as follows. We write ψ ∼ ε,Dx ψ to represent thatthe two distributions w | ψ ( x ) and w | ψ ( x ) are indistin-guishable up to a threshold ε in terms of a divergenceor distance D . Formally, this modality is defined as fol-lows. Definition 7 (Conditional indistinguishability op-erator ∼ ε,Dx ) Assume that the universe W includes allsub-multisets of each w ∈ W . Given an x ∈ Mes , an ε ∈ R ≥ , and a divergence or distance D : D O × D O → R ≥ , we define an accessibility relation by: R ε,Dx def = { ( w , w ) ∈ W × W | D ( σ w ( x ) (cid:107) σ w ( x )) ≤ ε } . Then for static formulas ψ and ψ , we define the in-terpretation of ψ ∼ ε,Dx ψ by: M , w | = ψ ∼ ε,Dx ψ iff there exist w , w s.t. ( w, w ) ∈ R ψ ,( w, w ) ∈ R ψ , and ( w , w ) ∈ R ε,Dx , where R ψ and R ψ are two conditioning relations inDefinition 6.Note that two worlds are related by R ε,Dx if theyhave close probability distributions of the values of x .Intuitively, w | = ψ ∼ ε,Dx ψ corresponds to the two op-erations: (i) transforming the given dataset w to the twosub-datasets w | ψ and w | ψ , and (ii) testing whether theprobability distribution of x generated by the dataset w | ψ is indistinguishable from the distribution gener-ated by the dataset w | ψ .When ε = 0, the operator ∼ ε,Dx represents the iden-tity of two distributions. The semantics for the (binary) composite operator in thearrow logic [7] resembles that for ∼ ε,Dx in Definition 7, al-though it has a totally different meaning and motivation.n Epistemic Approach to the Formal Specification of Statistical Machine Learning 7 Proposition 1
For a world w , static formulas ψ , ψ ,and a measurement variable x , w | = ψ ∼ ,Dx ψ iff thedistribution w | ψ ( x ) is identical to w | ψ ( x ) . This proposition is immediate from the followinglemma.
Lemma 1
For a world w , static formulas ψ , ψ , anda measurement variable x , w | = ψ ∼ ε,Dx ψ iff D ( σ w | ψ ( x ) (cid:107) σ w | ψ ( x )) ≤ ε. Proof
Let w = w | ψ and w = w | ψ . Then by Defi-nition 6, we have ( w, w ) ∈ R ψ and ( w, w ) ∈ R ψ .Hence this lemma follows from Definition 7. (cid:117)(cid:116) In Sect. 7, we present examples using the conditionalindistinguishability operator, i.e., we formalize variousnotions of fairness in machine learning by using thisoperator and the above proposition and lemma.3.5 Summary on the Modal LanguageIn summary, modal operators are used to representtransformation and testing on datasets. The unary modaloperator ∆ T is regarded as a transformation T on datasets,while the binary modal operators ⊃ and ∼ ε,Dx are re-garded as transforming-then-testing on datasets.Now the syntax of the formulas is given by:Static formulas: ψ ::= γ ( x , x , . . . , x n ) | ¬ ψ | ψ ∧ ψ Dataset formulas: ϕ ::= P I ψ | ¬ ϕ | ϕ ∧ ϕ | ∆ T ϕ | ψ ⊃ ϕ | ψ ∼ ε,Dx ψ | K a ϕ, where the epistemic formulas with the additional modal-ity are called dataset formulas , since they are inter-preted in a world that corresponds to a dataset.When multiple transformations/testing are sequen-tially applied to datasets, we can use dataset formulasin which different modal operators are nested. For ex-ample, w | = ∆ T ( ψ ⊃ ϕ ) represents that after applyinga data preparation T to a dataset w , a property ϕ holdsfor the sub-dataset T ( w ) | ψ that satisfies ψ . In this section we introduce a formal model for super-vised learning. Specifically, we employ a distributionalKripke model (Definition 3), and formalize a behaviorof a classifier C and a non-deterministic input x froman adversary in the model. In this formalization, we fo-cus only on the testing of supervised learning models,and do not formalize the training of supervised learningmodels or learning algorithms themselves. 4.1 Classification Problems Multiclass classification is the problem of classifying agiven input into one of multiple classes. Let L be a finiteset of class labels , and D be a finite set of input data (called feature vectors ) that we want to classify. Then a classifier is a function C : D → L that receives an inputdatum v and predicts which class (among L ) the input v belongs to. In this work, we deal with a situationwhere some classifier C has already been obtained andits properties should be evaluated, and do not model orreason about how classifiers are trained from a trainingdataset.We assume a scoring function f : D × L → R thatgives a score f ( v, (cid:96) ) of predicting the class of an inputdatum (feature vector) v as a label (cid:96) . Then for eachinput v ∈ D , we denote by H ( v ) = (cid:96) to represent thata label (cid:96) maximizes f ( v, (cid:96) ). For example, when the in-put v is an image of an animal and (cid:96) is the animal’sname, then H ( v ) = (cid:96) may represent that an oracle (ora “human”) classifies the image v as (cid:96) .4.2 Modeling the Behaviors of ClassifiersA classifier is formalized on a distributional Kripkemodel M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ) with W = D S . Then W is an infinite set of possible worlds that correspondsto all possible datasets from which the classifier can re-ceive input data. We denote by w test ∈ W a real worldthat corresponds to a test dataset. Recall that eachworld w ∈ W is a multiset of states over S and is as-sociated with a stochastic assignment σ w : Mes → D O that is consistent with the deterministic assignments σ s for all s ∈ w , as explained in Sect. 2.4.We present an overview of our formalization in Fig. 1.We denote by x ∈ Mes an input datum given to theclassifier C (and to the oracle H ), by y ∈ Mes a correctlabel given by the oracle H , and by ˆ y ∈ Mes a labelpredicted by C . We assume that the input variable x (resp. the output variables y, ˆ y ) ranges over the set D ofinput data (resp. the set L of labels); i.e., the determin-istic assignment σ s at each state s ∈ S has the range O = D ∪ L and satisfies σ s ( x ) ∈ D and σ s ( y ) , σ s (ˆ y ) ∈ L .A key idea in our modeling is that we describe logi-cal aspects of statistical properties in the syntax level byusing logical formulas, and model statistical distancesand dataset operations in the semantics level by us-ing accessibility relations in the distributional Kripke The regression can be regarded as the classification prob-lem when the label ranges over the real numbers, hence itcan be formalized using a distributional Kripke model analo-gously. For simplicity, however, we deal only with the classi-fication problems in this paper. Yusuke Kawamoto
State s (cid:31)(cid:30) (cid:28)(cid:29) input σ s ( x ) (cid:45) output σ s (ˆ y ) (cid:45) Classifier C State s (cid:31)(cid:30) (cid:28)(cid:29) input σ s ( x ) (cid:45) output σ s (ˆ y ) (cid:45) Classifier C ··· ··· Possible world w (test dataset) Fig. 1: A world w is chosen non-deterministically and corresponds to a test dataset. With probability w [ s i ], theworld w is in a deterministic state s i where the classifier C receives the input value σ s i ( x ) and returns the outputvalue σ s i (ˆ y ). Each state s i can be regarded as a tuple ( σ s i ( x ) , σ s i ( y ) , σ s i (ˆ y )) ∈ D × L × L consisting of an inputdatum, an actual label, and a predicted label.model. In this way, we can formalize various statisticalproperties of classifiers in a simple and abstract way.To formalize the classifier C , we introduce a staticformula ψ ( x, ˆ y ) to represent that C classifies a giveninput x as a class ˆ y . We also introduce a static formula h ( x, y ) to represent that y is the actual class of an in-put x . As an abbreviation, we write ψ (cid:96) ( x ) (resp. h (cid:96) ( x ))to denote ψ ( x, (cid:96) ) (resp. h ( x, (cid:96) )). Formally, these staticformulas are interpreted at each state s ∈ S as follows: s | = ψ ( x, ˆ y ) iff C ( σ s ( x )) = σ s (ˆ y ) .s | = h ( x, y ) iff H ( σ s ( x )) = σ s ( y ) . M can formalize an input x that is probabilistically cho-sen from a given dataset. As explained in Sect. 2.4, eachworld w corresponds to a test dataset . When a state s isdrawn from a multiset w of states, an input value σ s ( x )is sampled from the distribution σ w ( x ), and assignedto the measurement variable x . The set of all possi-ble probability distributions of inputs is represented by Λ def = { σ w ( x ) | w ∈ W} , which is possibly an infinite set.For example, let us consider testing the classifier C with the actual test dataset σ w test ( x ). When C classifiesan input x as a label (cid:96) with probability 0 .
2, i.e.,Pr (cid:104) v $ ← σ w test ( x ) : C ( v ) = (cid:96) (cid:105) = 0 . , then this can be expressed by: M , w test | = P . ψ (cid:96) ( x ) . Next we observe that our model can formalize a non-deterministic input x from an adversary as follows. Al-though each state s in a possible world w is assigned the probability w [ s ], each world w itself is not assigned aprobability. Thus, each input distribution σ w ( x ) ∈ Λ it-self is also not assigned a probability, hence our modelassumes no probability distribution over Λ . In otherwords, we assume that a world w and thus an input dis-tribution σ w ( x ) are non-deterministically chosen. Thisis useful to model an adversary that provides maliciousinputs to the classifier C to make its prediction fail,because we usually do not have a prior knowledge ofthe probability distribution of malicious inputs fromadversaries, and need to reason about the worst casescaused by the attack. In Sect. 6, this formalization ofnon-deterministic inputs is used to express the robust-ness of classifiers.Finally, it should be noted that we cannot enumer-ate all possible adversarial inputs, hence cannot enu-merate all possible datasets to construct the universe W . Since W can be an infinite set and is unspecified,we cannot check whether a formula expressing a se-curity property against an adversary is satisfied in allpossible worlds of W . Nevertheless, as shown in latersections, describing various properties using our exten-sion of StatEL is useful to explore desirable propertiesand to discuss relationships among them. In this section we show a formalization of classificationperformance using our extension of StatEL. We formal-ize popular measures of classification performance, in-cluding precision, recall, and accuracy, and measures forevaluating overfitting, such as the generalization error.See Fig. 2 for basic ideas on these formalizations. n Epistemic Approach to the Formal Specification of Statistical Machine Learning 9
Table 1: Logical description of the table of confusion
Actual classpositive negative
Prevalence (cid:96),I ( x ) Accuracy (cid:96),I ( x ) h (cid:96) ( x ) ¬ h (cid:96) ( x ) def = P I ( h (cid:96) ( x )) def = P I ( ψ (cid:96) ( x ) ↔ h (cid:96) ( x ))Positiveprediction tp ( x ) def = fp ( x ) def = Precision (cid:96),I ( x ) def = FDR (cid:96),I ( x ) def = ψ (cid:96) ( x ) ψ (cid:96) ( x ) ∧ h (cid:96) ( x ) ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ) ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) ψ (cid:96) ( x ) ⊃ P I ¬ h (cid:96) ( x )Negativeprediction fn ( x ) def = tn ( x ) def = FOR (cid:96),I ( x ) def = NPV (cid:96),I ( x ) def = ¬ ψ (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ∧ h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ⊃ P I ¬ h (cid:96) ( x ) Recall (cid:96),I ( x ) def = FallOut (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) ¬ h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) MissRate (cid:96),I ( x ) def = Specificity (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ¬ ψ (cid:96) ( x ) ¬ h (cid:96) ( x ) ⊃ P I ¬ ψ (cid:96) ( x ) positive / negative represent the result of the classifier’s prediction, and theterms true / false represent whether the classifier pre-dicts correctly or not. Then the following terminologiesare commonly used: – true positive ( tp ): both the prediction and actualclass are positive; – true negative ( tn ): both the prediction and actualclass are negative; – false positive ( fp ): the prediction is positive but theactual class is negative; – false negative ( fn ): the prediction is negative butthe actual class is positive.These terminologies can be formalized using static for-mulas as shown in Table 1. For example, when an in-put x shows true positive at a state s , this can be ex-pressed as s | = ψ (cid:96) ( x ) ∧ h (cid:96) ( x ). Note that the value ofthe measurement variable x is uniquely determined bythe assignment σ s at the state s . True negative, falsepositive (type I error), and false negative (type II er-ror) are respectively expressed as s | = ¬ ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ), s | = ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ), and s | = ¬ ψ (cid:96) ( x ) ∧ h (cid:96) ( x ).5.2 Precision, Recall, Accuracy, and OtherPerformance MeasuresNext we formalize three popular measures for binaryclassification performance: precision , recall , and accu-racy . In Table 1 we summarize the formalization ofvarious notions of classification performance using ourdataset formulas.In theory, these notions should be formalized withthe infinite dataset w true representing the true distribu-tion. However, we usually cannot obtain w true or testthe performance measures using w true . Hence, we often sample a finite test dataset w test from the true distribu-tion and regard it as an approximation of w true . Given a test dataset w test , precision ( positive pre-dictive value ) is defined as the conditional probabilitythat the prediction is correct given that the predictionis positive; i.e., precision = tptp + fp . Since the probabilitydistribution of the input x in the world w test is expressedby σ w test ( x ) as explained in Sect. 4.3, the precision beingwithin an interval I is given by:Pr (cid:104) v $ ← σ w test ( x ) : H ( v ) = (cid:96) (cid:12)(cid:12)(cid:12) C ( v ) = (cid:96) (cid:105) ∈ I, which can be written as:Pr (cid:104) s $ ← w test : s | = h (cid:96) ( x ) (cid:12)(cid:12)(cid:12) s | = ψ (cid:96) ( x ) (cid:105) ∈ I. By using StatEL, this can be formalized as: M , w test | = Precision (cid:96),I ( x )where Precision (cid:96),I ( x ) def = ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) . Here ⊃ is the conditioning operator defined in Sect. 3.3.The value of precision depends on the test dataset w test ,and can be computed in finite time since w test is finite.Symmetrically, recall ( true positive rate ) is definedas the conditional probability that the prediction is cor-rect given that the actual class is positive; i.e., recall = tptp + fn . Then the recall being within I is formalized as: Recall (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) . Finally, accuracy is the probability that the classifierpredicts correctly; i.e., accuracy = tp + tntp + tn + fp + fn . Thenthe accuracy being within I is formalized as: Accuracy (cid:96),I ( x ) def = P I (cid:0) ψ (cid:96) ( x ) ↔ h (cid:96) ( x ) (cid:1) , Since the test dataset w test is finite, there can be missingdata that are not included in w test but are sampled from thetrue distribution w true with a very small probability.0 Yusuke Kawamoto Real world w test with a test datasetPossible world w train with a training dataset Distributionof test data σ w test ( x ) Distribution oftraining data σ w train ( x ) Oracle (human)
H σ s ( y ) input outputsampling σ s ( x ) Classifier
C σ s (ˆ y ) sampling σ s (cid:48) ( x ) Classifier
C σ s (cid:48) (ˆ y ) Oracle (human)
H σ s (cid:48) ( y ) input output PerformanceGeneralization errorOverfittingTraining error
Fig. 2: The classification performance compares the oracle H ’s output with that of the classifier C ’s, while theevaluation of overfitting compares the expected loss by the test dataset with that by the training dataset.which can also be defined as P I (cid:0) tp ( x ) ∨ tn ( x ) (cid:1) . Whenwe measure the accuracy after a data preparation op-eration T (e.g., data cleaning) to the test dataset w test ,this can be represented by w test | = ∆ T Accuracy (cid:96),I ( x ). Example 1 (Performance of pedestrian detection)
Letus consider an autonomous car that uses a machinelearning classifier to detect a person crossing the road.For the sake of simplicity, we formalize an example of abinary classifier C that detects whether or not a pedes-trian is crossing the road in a photo image in a testdataset w test . We write sunny ( x ) (resp. snowy ( x )) torepresent that a photo x was taken on a sunny (resp.snowy) day. Let ψ (cid:96) ( x ) (resp. h (cid:96) ( x )) represent that theclassifier C (resp. the human) detects a pedestrian cross-ing the road in an image x .We empirically measure recall (i.e., the conditionalprobability that C detects a pedestrian crossing theroad when the input image x actually includes it) by us-ing the data collected on sunny days. When C achievesa recall of 0 .
95 on sunny days, this is represented by w test | = sunny ( x ) ⊃ Recall (cid:96), . ( x ).Since C should detect a pedestrian also on a snow-covered road, it should be tested with the data collectedon snowy days. If we have a recall of 0 . w test | = snowy ( x ) ⊃ Recall (cid:96), . ( x ).More generally, if the classifier C achieves a recall ofmore than 0 . γ , γ , . . . , γ m , this can berepresented by w test | = (cid:86) mi =1 (cid:0) γ i ( x ) ⊃ Recall (cid:96), (0 . , ( x ) (cid:1) .5.3 Generalization ErrorWe next formalize the generalization error of a classi-fier, i.e., a measure of how accurately a classifier is able to predict the class of previously unseen input data.Since a classifier has been trained on a finite sampletraining dataset w train , it may be overfitted to w train and have worse classification performance on new in-put data that have not been included in w train .To formalize the generalization error, we introduce aformula λ L ( y, ˆ y ) to represent that given a correct label y and a predicted label ˆ y , the expected value of losses(i.e., real numbers representing the penalty for incorrectclassification) is at most a non-negative real number L .Formally, the semantics of λ L ( y, ˆ y ) is given by: w | = λ L ( y, ˆ y ) iff E ( v, ˆ v ) ∼ σ w ( y, ˆ y ) loss ( v, ˆ v ) ≤ L, where loss is a loss function selected according to thedata domain O , and a pair ( v, v (cid:48) ) of a correct label anda predicted label follows the joint distribution σ w ( y, ˆ y ).Now the generalization error being L or smaller at atrue distribution w true is written as w true | = GE L ( x, y, ˆ y )where: GE L ( x, y, ˆ y ) def = (cid:0) h ( x, y ) ∧ ψ ( x, ˆ y ) (cid:1) ⊃ λ L ( y, ˆ y ) . Since we usually cannot obtain the true distribu-tion w true and cannot check the satisfaction w true | = GE L ( x, y, ˆ y ), we often compute an empirical error (asan approximation of the generalization error) by usinga finite test dataset w test that is believed to be an ap-proximation of w true . This testing can be expressed as w test | = GE L ( x, y, ˆ y ).On the other hand, given a training dataset w train ,the training error being at most L train is represented by w train | = GE L train ( x, y, ˆ y ). Then the overfitting of the clas-sifier can be evaluated by comparing the empirical error L with the training error L train . When the empirical er-ror is smaller than L train + ε for some error bound ε > w test | = GE L train + ε ( x, y, ˆ y ). n Epistemic Approach to the Formal Specification of Statistical Machine Learning 11 Real world w test Possible world w (cid:48) Distributionof test data σ w test ( x ) Distribution ofperturbed data σ w (cid:48) ( x ) sampling σ s ( x ) Classifier
C σ s (ˆ y ) input outputsampling σ s (cid:48) ( x ) Classifier
C σ s (cid:48) (ˆ y ) R ε, W d x Robustness
Fig. 3: The robustness compares the conditional probability in the test dataset w test with that in another possibleworld w (cid:48) that is close to w test in terms of R ε, W d x . Note that an adversary’s choice of the input distribution σ w (cid:48) ( x )is formalized as a non-deterministic choice of the possible world w (cid:48) . Many recent studies have found attacks on machinelearning where a malicious adversary manipulates theinput to cause a malfunction in a machine learningtask [12]. Such input data, called adversarial exam-ples [40], are designed to make a classifier fail to pre-dict the actual class (cid:96) of the input, but are recognizedto belong to (cid:96) from human eyes. In computer vision,for example, Goodfellow et al. [20] create an adversar-ial example by adding undetectable noise to a panda’sphoto so that humans can still recognize the perturbedimage as a panda, but a classifier misclassifies it as agibbon. To prevent or mitigate such attacks, the clas-sifier should be robust against perturbed input, i.e., itshould return similar predicted labels given similar in-put data.In this section we formalize robustness notions forclassifiers by using epistemic operators in StatEL (SeeFig. 3 for an overview of the formalization). Further-more, we show certain relationships between classifica-tion performance and robustness, and suggest a class ofrobustness properties that have not been formalized inthe literature as far as we know. We present an overviewof these formalizations and relationships in Fig. 4.6.1 Total Correctness of ClassifiersWe first note that the total correctness of classifierscould be formalized as a classification performance (e.g.,precision, recall, or accuracy) in the presence of all pos-sible inputs from adversaries. For example, the totalcorrectness could be formalized as M | = Recall (cid:96),I ( x ),which represents that Recall (cid:96),I ( x ) is satisfied in all pos-sible worlds of M .In practice, however, it is not possible or tractable totest whether the classification performance is achieved for all possible test datasets (corresponding to an infi-nite number of possible worlds in M ). Hence we need aweaker form of a correctness notion, which may be ver-ified or tested in some way. In the following sections,we deal with robustness notions that are weaker thantotal correctness.6.2 Accessibility Relation for RobustnessTo formalize robustness notions, we introduce an ac-cessibility relation R ε, W d x that relates two worlds havingcloser inputs as follows. Definition 8 (Accessibility relation for robust-ness)
We define an accessibility relation R ε, W d x ⊆ W ×W by: R ε, W d x def = { ( w, w (cid:48) ) ∈ W × W | W d ( σ w ( x ) , σ w (cid:48) ( x )) ≤ ε } , where W d is ∞ -Wasserstein distance w.r.t. a metric d in Definition 2.Then ( w, w (cid:48) ) ∈ R ε, W d x represents that the two distri-butions σ w ( x ) and σ w (cid:48) ( x ) of inputs to the classifier C are close in terms of the distance W d . Intuitively, forexample, W d means the distance between two imagedatasets σ w ( x ) and σ w (cid:48) ( x ) when the distance betweenindividual images are measured by a metric d .Then an epistemic formula K ε, W d ϕ represents thatwe are confidence that ϕ is true even when the inputdata are perturbed by noise of the level ε or smaller. W d ( σ w ( x ) , σ w (cid:48) ( x )) ≤ ε expresses that each value of theinput x from the dataset w is close to the corresponding valueof x from w (cid:48) in terms of the metric d between individual data.For example, each input image x in the dataset w looks similarto the corresponding image in w (cid:48) from the human’ eyes.2 Yusuke Kawamoto (cid:96) tar , then it is called a tar-geted attack . For instance, in the above-mentioned at-tack by [20], a gibbon is the target into which a panda’sphoto is misclassified.In this section, we discuss how we formalize robust-ness using the epistemic operator K ε, W d . We denote by v ∈ D an original input image in the test dataset w test ,and by (cid:101) v ∈ D an image obtained by perturbing theoriginal image v by noise.A first definition of robustness against targeted at-tacks might be:For any v, (cid:101) v ∈ D , if H ( v ) = panda and d ( v, (cid:101) v ) ≤ ε ,then C ( v (cid:48) ) (cid:54) = gibbon ,which represents that when an image (cid:101) v is obtained byperturbing a panda’s photo v by noise, then it will notbe classified as the target label gibbon at all. This canbe formalized using StatEL by: M , w test | = h panda ( x ) ⊃ K ε, W d P ψ gibbon ( x ) . However, this notion does not accept a negligible prob-ability of misclassification, and does not cover the casewhere the human cannot recognize the perturbed image (cid:101) v as panda (e.g., when the perturbed image (cid:101) v is obtainedby linear displacement, rescaling, and rotation [2], then H ( (cid:101) v ) (cid:54) = panda may hold).To overcome these issues, we introduce the follow-ing definition with some conditional probability δ ofmisclassification as follows. Definition 9 (Targeted robustness)
Let δ ∈ [0 , w test , a classifier C satisfies probabilistictargeted robustness w.r.t. an actual label (cid:96) and a targetlabel ˆ (cid:96) tar if for any input v ∈ supp ( σ w test ( x )) from thedataset w test , and for any perturbed input (cid:101) v ∈ D s.t. d ( v, v (cid:48) ) ≤ ε , we have:Pr[ C ( (cid:101) v ) = ˆ (cid:96) tar | H ( (cid:101) v ) = (cid:96) ] ≤ δ. (1)For instance, when the actual class (cid:96) is panda andthe target label ˆ (cid:96) tar is gibbon , then the classifier C mis-classifies a panda’s photo as gibbon with only a smallprobability δ .Now we express this robustness notion with I =[1 − δ,
1] by using StatEL.
Proposition 2 (Targeted robustness)
Let I ⊆ [0 , .The probabilistic targeted robustness w.r.t. an actual la-bel (cid:96) and a target label ˆ (cid:96) tar under a given test datasetw test is expressed by w test | = TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) where: TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ) (cid:1) . Proof
Let w (cid:48) be a possible world such that ( w test , w (cid:48) ) ∈R ε, W d x . Then w (cid:48) corresponds to the dataset obtainedby perturbing each data in w . Let (cid:101) v ∈ supp ( σ w (cid:48) ( x )).Then (cid:101) v represents a perturbed input. Let w (cid:48)(cid:48) = w (cid:48) | h (cid:96) ( x ) .Then (1) is logically equivalent to w (cid:48)(cid:48) | = P [0 ,δ ] ψ ˆ (cid:96) tar ( x ).By Definition 6, w (cid:48) | = h (cid:96) ( x ) ⊃ P [0 ,δ ] ψ ˆ (cid:96) tar ( x ). By I =[1 − δ, w (cid:48) | = h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ). Therefore thisproposition follows from the semantics for K ε, W d . (cid:117)(cid:116) Since the L p -distances are often regarded as reason-able approximations of human perceptual distances [10],they are used as distance constraints on the perturba-tion in many researches on targeted attacks (e.g. [40,20,10]). Our model can represent the robustness againstthese attacks by using the L p -distance as a metric d for R ε, W d x .6.4 Probabilistic Robustness against Non-TargetedAttacksIn this section we formalize non-targeted attacks [33,32]in which adversaries try to misclassify inputs as somearbitrary incorrect labels (i.e., not as a specific labellike a gibbon). Compared to targeted attacks, this kindof attacks are easier to mount, but harder to defend.We first define the notion of robustness against non-targeted attacks as follows. Definition 10 (Non-targeted robustness)
Let δ ∈ [0 , w test , a classifier C satisfies proba-bilistic non-targeted robustness w.r.t. an actual label (cid:96) iffor any input v ∈ supp ( σ w test ( x )) from the dataset w test ,and for any perturbed input (cid:101) v ∈ D s.t. d ( v, v (cid:48) ) ≤ ε , wehave:Pr[ C ( (cid:101) v ) = (cid:96) | H ( (cid:101) v ) = (cid:96) ] > − δ. Now we express this robustness notion with I =[1 − δ,
1] by using StatEL.
Proposition 3 (Non-targeted robustness)
Let I ⊆ [0 , . The probabilistic non-targeted robustness under atest dataset w test is expressed by w test | = Robust (cid:96),I ( x ) where: Robust (cid:96),I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) (cid:1) = K ε, W d Recall (cid:96),I ( x ) . Proof
The proof is analogous to that for Proposition 2. (cid:117)(cid:116) The L p -distance between n -dimensional real vectors x and x (cid:48) is written (cid:107) x − x (cid:48) (cid:107) p where the p -norm is defined by (cid:107) v (cid:107) p = ( (cid:80) ni =1 | v i | p ) /p .n Epistemic Approach to the Formal Specification of Statistical Machine Learning 13 Prob. non-targeted robustness (Sect. 6.4)
TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ) (cid:1) Prob. targeted robustness (Sect. 6.3)
Robust (cid:96),I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) (cid:1) Recall (Sect. 5.2)
Recall (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) Proposition 4 (1)Proposition 4 (2)
Fig. 4: Robustness notions and their relationships.6.5 Relationships among Robustness NotionsIn this section we present relationships among notionsof robustness and performance, and discuss propertiesrelated to robustness.We first present the following proposition immediatefrom the definitions.
Proposition 4 (Relationships among notions)
Let I ⊆ [0 , and (cid:96), ˆ (cid:96) tar ∈ L . Then we have:1. w test | = Robust (cid:96),I ( x ) implies w test | = TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) .2. w test | = Robust (cid:96),I ( x ) implies M , w test | = Recall (cid:96),I ( x ) . The first claim means that probabilistic non-targetedrobustness is not weaker than probabilistic targetedrobustness for the same I . The second claim meansthat probabilistic non-targeted robustness implies recallwithout perturbation noise. Note that this is immediatefrom the reflexivity of R ε, W d x .Next we remark that our extension of StatEL can beused to describe a certain situation where adversarialattacks are mitigated. When we apply some mechanism T that preprocesses a given input to mitigate attacks onrobustness, then the probabilistic targeted robustness isexpressed as w test | = ∆ T Robust (cid:96),I ( x ) where ∆ T is themodality for the dataset transformation T .Finally, we recall that by Proposition 3, robustnesscan be regarded as recall in the presence of perturbednoise. This implies that for each property ϕ in the tableof confusion (Table 1), we could consider K ε, W d ϕ as aproperty to evaluate the classification performance inthe presence of adversarial inputs although this has notbeen formalized in the literature of robustness of ma-chine learning as far as we recognize. For example, pre-cision robustness K ε, W d Precision (cid:96),i ( x ) represents that inthe presence of perturbed noise, the prediction is cor-rect with a probability i given that it is positive. For an-other example, accuracy robustness K ε, W d Accuracy (cid:96),i ( x )represents that in the presence of perturbed noise, theprediction is correct (whether it is positive or negative)with a probability i . Example 2 (Robustness of pedestrian detection)
We il-lustrate robustness notions using the pedestrian detec-tion in Example 1 in Section 5.2. We deal with a binaryclassifier C that detects whether a pedestrian is crossingthe road in a photo image x .The non-targeted robustness K ε, W d Recall (cid:96), . ( x ) rep-resents that in the presence of perturbed noise to theinput image x , with probability 0 . C candetect a person crossing the road when the human canactually recognize. This robustness is crucial for an au-tonomous car not to hit a pedestrian.The precision robustness K ε, W d Precision (cid:96), . ( x ) rep-resents that in the presence of perturbed noise to x ,with probability 0 . C de-tects it. This type of robustness is important for anautonomous car to avoid stopping suddenly due to afalse alarm (not take the crash from the car behind). Many studies have proposed and investigated variousnotions of fairness in machine learning [5]. Informally,these fairness notions mean that the results of machinelearning tasks are irrelevant of some sensitive attributes,e.g., gender, age, race, disease, political/religious view.In a recently few years, there have been studies on thetesting methods for fairness of machine learning [18,1,42].In this section, we formalize popular notions of fair-ness of supervised learning by using our extension ofStatEL. Here we focus on the fairness that should bemaintained in the impact (i.e., the results of machinelearning tasks) rather than the treatment (i.e., the pro-cess of machine learning tasks). This is because previ-ous research show that many seemingly neutral featureshave statistical relationships with sensitive attributes,and hence just ignoring or removing sensitive attributes in the process of data preparation and training is of-ten ineffective or harmful to achieve the fairness andperformance of learning tasks.7.1 Basic Ideas and NotationsVarious notions of fairness in supervised learning areclassified into three categories: independence , separa-tion , and sufficiency [5]. All of these have the form of(conditional) independence or its relaxation, and thuscan be formalized using the modal operator ∼ ε,Dx forconditional indistinguishability (defined in Sect. 3.4) inour extension of StatEL. In the formalization of fairness notions, we use a dis-tributional Kripke model M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ).Recall that x , y , and ˆ y are measurement variables re-spectively denoting the input datum, the actual classlabel (given by the oracle H ), and the predicted la-bel (output by the classifier C ). Given a real world w test (corresponding to a given test dataset), σ w test ( x )is the probability distribution of C ’s test input over D , σ w test ( y ) is the distribution of the actual label over L ,and σ w test (ˆ y ) is the distribution of C ’s output over L .Fairness notions are usually defined in terms of some sensitive attribute (e.g., gender, age, race, disease, po-litical/religious view), which is defined as a tuple ofsubsets of the input data domain D . For example, asensitive attribute based on ages can be defined as apair of groups G (input data with ages 21 to 60) and G (ages 61 to 100). For each group G ⊆ D of inputs, weintroduce a static formula η G ( x ) representing that aninput x belongs to G . Formally, this is interpreted by:For each state s ∈ S , s | = η G ( x ) iff σ s ( x ) ∈ G. Roughly speaking, a machine learning task is saidto be fair if the performance of the task for a group G ’s input is similar to that for another group G ’sinput. In the following sections, we formalize the threecategories of fairness of classifiers and their relaxation.A summary of this formalization is presented in Table 2. Such unawareness requires that sensitive attributes arenot explicitly used in the learning process. However, StatELmay not be suited to formalizing this requirement. Compared to the preliminary version [28] of this paper,we corrected errors and changed the formalization into amore comprehensible form by introducing the operator ∼ ε,Dx and by removing the counter factual epistemic operators anda formula ξ d representing that the input is drawn from adataset d . Some fairness notions (e.g., equal opportunity) assume G = D \ G . independence [9], which is also known as group fair-ness [15] , and its relaxed notion. Intuitively, indepen-dence means that the predicted label ˆ y does not havestatistical relationships with the membership in a sen-sitive group. For example, independence does not allowa bank’s lending rate to be correlated with a sensitiveattribute such as gender.We first present the definition of a relaxed notion ofindependence, called group fairness up to bias ε [15] asfollows. Intuitively, this is the property that the outputdistributions of the classifier are roughly identical wheninput data belong to different groups.Formally, this fairness notion is defined as follows. Definition 11 (Independence, group fairness)
Let G , G ⊆ D be sets of input data constituting a sensi-tive attribute. For each b = 0 ,
1, let µ G b ∈ D L be theprobability distribution of the predicted label ˆ (cid:96) outputby a classifier C when an input v is sampled from a testdataset w test and belongs to G b ; i.e., for each ˆ (cid:96) ∈ L , µ G b [ˆ (cid:96) ] def = Pr[ C ( v ) = ˆ (cid:96) | v $ ← σ w test ( x ) and v ∈ G b ] . (2)Then a classifier C satisfies the group fairness betweengroups G and G up to bias ε if D tv ( µ G (cid:107) µ G ) ≤ ε ,where D tv is the total variation between distributions(defined in Sect. 2.2). A classifier C satisfies indepen-dence w.r.t. groups G and G if it satisfies the groupfairness between G and G up to bias 0.Now we express this fairness notion using our ex-tension of StatEL as follows. Proposition 5 (Independence, group fairness)
Thegroup fairness between groups G and G up to bias ε under a given test dataset w test is expressed as w test | = GrpFair ε ( x, ˆ y ) where: GrpFair ε ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) . Independence (without bias ε ) is expressed by w test | = GrpFair ( x, ˆ y ) .Proof Let w b = w test | η Gb ( x ) ∧ ψ ( x, ˆ y ) . It follows from (2)that for each ˆ (cid:96) ∈ L , µ G b [ˆ (cid:96) ] = Pr[ σ s (ˆ y ) = ˆ (cid:96) | s $ ← w b ] , hence µ G b = σ w b (ˆ y ). Thus, by Definition 11, the groupfairness between groups G and G up to bias ε is givenby D tv ( σ w (ˆ y ) (cid:107) σ w (ˆ y )) ≤ ε . Therefore, this propositionfollows from Lemma 1. (cid:117)(cid:116) In previous literature, independence has been referredto also as different terminologies, such as statistical parity , demographic parity , and disparate impact .n Epistemic Approach to the Formal Specification of Statistical Machine Learning 15 Table 2: Popular notions of fairness of machine learning
Sect. Formalization of fairness notions7.2 Independence (a.k.a. group fairness)
GrpFair ε ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) EqOdds ε ( x, ˆ y ) def = (cid:94) (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1)(cid:17) EqOpp ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) ∼ , D tv ˆ y (cid:0) ¬ η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) Sufficency ε ( x, y ) def = (cid:94) ˆ (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) (cid:1) ∼ ε, D tv y (cid:0) η G ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) (cid:1)(cid:17) Example 3 (Independence in pedestrian detection)
Weillustrate independence using the pedestrian detectionin Example 1 in Section 5.2. We deal with a binaryclassifier C that detects whether or not a pedestrian iscrossing the road in an image x . We write η m ( x ) (resp. η w ( x )) to represent that an image x includes a man(resp. woman) that may or not be crossing the road.Let ψ ( x, ˆ y ) represent that given an input image x , theclassifier C returns ˆ y (that is either the detection of aperson crossing the road or not).Then the independence between men and women GrpFair ( x, ˆ y ) def = (cid:0) η m ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ , D tv ˆ y (cid:0) η w ( x ) ∧ ψ ( x, ˆ y ) (cid:1) means that the probability of detecting a pedestriancrossing the road is the same between men and women.This fairness guarantees that men and women are equallydetectable as pedestrians, hence equally safe against anautonomous car. Here independence does not rely onthe actual label y , i.e., on whether there is a pedestriancrossing the road that can be detected by human eyes.7.3 Separation (a.k.a. Equalized Odds) and itsRelaxation (Equal Opportunity)In this section we explain and formalize the notionof separation [5] , which is well-known as equalizedodds [22], and its relaxed notion called equal opportu-nity [22]. The motivation behind these notions is to cap-ture typical scenarios in which sensitive characteristicsmay have statistical relationships with the actual classlabel. For instance, even when some sensitive attributeis correlated with an actual default rate on loans, banksmight want to have a different lending rate for peoplewho have a higher default rate. However, independence In previous literature, separation has been referred to alsoas disparate mistreatment [46] and conditional procedure ac-curacy equality [6]. (group fairness) does not allow this, since it requiresthat the lending rate should be statistically indepen-dent of the sensitive attribute.To overcome this problem, the notion of separationallows statistical relationships between a sensitive at-tribute and the predicted label ˆ y output by the classi-fier C to the extent that this is justified by the actualclass label y . More precisely, separation means that thepredicted label ˆ y is conditionally independent of themembership in a sensitive group, given an actual classlabel y .Formally, separation is defined as a property thatrecall (true positive rate) and specificity (true nega-tive rate, explained in Table 1) are the same for all thegroups, and equal opportunity is defined as a specialcase of separation only for an advantageous class label. Definition 12 (Separation & equal opportunity)
Given a group G b ⊆ D and an actual class label (cid:96) , let µ G b ,(cid:96) ∈ D L be the probability distribution of the pre-dicted label ˆ (cid:96) output by a classifier C when an input v ∈ G b is sampled from a test dataset w test and is asso-ciated with an actual label (cid:96) ; i.e., for each ˆ (cid:96) ∈ L , µ G b ,(cid:96) [ˆ (cid:96) ] def = Pr[ C ( v ) = ˆ (cid:96) | v $ ← σ w test ( x ) , v ∈ G b , H ( v ) = (cid:96) ] . (3)A classifier C satisfies separation between two groups G and G if µ G ,(cid:96) = µ G ,(cid:96) holds for all (cid:96) ∈ L . Aclassifier C satisfies equal opportunity of an advanta-geous label (cid:96) w.r.t. a group G if µ G ,(cid:96) = µ G ,(cid:96) where G = D \ G .Now we express these two notions using our exten-sion of StatEL as follows. Proposition 6 (Separation)
Let γ ( x, (cid:96), ˆ y ) def = ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . The separation between two groups G and G under a given test dataset w test is expressed as w test | = EqOdds ( x, ˆ y ) where: EqOdds ε ( x, ˆ y ) def = (cid:94) (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1)(cid:17) . Proof
Let (cid:96) ∈ L and w b,(cid:96) = w test | η Gb ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . Itfollows from (3) that: µ G b ,(cid:96) [ˆ (cid:96) ] = Pr[ σ s (ˆ y ) = ˆ (cid:96) | s $ ← w b,(cid:96) ] , hence µ G b ,(cid:96) = σ w b,(cid:96) (ˆ y ). Thus, by Definition 12, theseparation between G and G is given by σ w ,(cid:96) (ˆ y ) = σ w ,(cid:96) (ˆ y ) for all (cid:96) ∈ L . Therefore, this proposition followsfrom Proposition 1. (cid:117)(cid:116) It should be noted that for ε > EqOdds ε ( x, ˆ y )represents a relaxation of separation up to bias ε interms of total variation D tv . Proposition 7 (Equal opportunity)
Let γ ( x, (cid:96), ˆ y ) def = ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . The equal opportunity of a label (cid:96) w.r.t.a group G under a given test dataset w test is expressedas w test | = EqOpp ( x, ˆ y ) where: EqOpp ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) ∼ , D tv ˆ y (cid:0) ¬ η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) . Proof
The proof of this proposition is similar to that ofProposition 6. Let G = D\ G . By µ G b ,(cid:96) = σ w b,(cid:96) (ˆ y ), theequal opportunity of (cid:96) w.r.t. G is given by σ w ,(cid:96) (ˆ y ) = σ w ,(cid:96) (ˆ y ). Therefore, this proposition follows from Propo-sition 1. (cid:117)(cid:116) Example 4 (Separation in pedestrian detection)
We il-lustrate separation using the pedestrian detection inExample 3 where a binary classifier C detects whethera pedestrian is crossing the road in an image x . Let ψ ( x, ˆ y ) (resp. h ( x, y )) represent that given an image x ,the classifier C (resp. human) returns ˆ y (resp. y ) rep-resenting either detection or not.The level of the inherent technical difficulty of de-tecting a female pedestrian may be different from thatof a male pedestrian, because, for example, the physi-cal appearance may tend to be different between womenand men. If we take this possible difference into account,separation can be suited instead of independence.The separation EqOdds ( x, ˆ y ) between men and womenguarantees that the conditional probability of detect-ing a pedestrian crossing the road when the humancan actually recognize it, is the same between men andwomen. This fairness implies that (from the viewpointof a pedestrian crossing the road) male and female pedes-trians may be hit by an autonomous car as fairly as bythe human-driven car. 7.4 Sufficiency (a.k.a. Conditional Use AccuracyEquality)In this section we explain and formalize the notion of sufficiency [5], which is also known as conditional useaccuracy equality [6].While separation guarantees the equality of recallamong different groups, sufficiency requires the equalityof precision. More precisely, sufficiency is defined as theproperty that precision (positive predictive value) andnegative predictive value (presented as NPV in Table 1)are the same for all the groups as follows. Definition 13 (Sufficiency)
Given a group G b ⊆ D and a predicted label ˆ (cid:96) , let µ G b , ˆ (cid:96) ∈ D L be the proba-bility distribution of the actual class label (cid:96) when aninput v ∈ G b is sampled from a test dataset w test andthe classifier C outputs the predicted label ˆ (cid:96) ; i.e., foreach (cid:96) ∈ L , µ G b , ˆ (cid:96) [ (cid:96) ] def = Pr[ H ( v ) = (cid:96) | v $ ← σ w test ( x ) , v ∈ G b , C ( v ) = ˆ (cid:96) ] . (4)A classifier C satisfies sufficiency between two groups G and G if µ G , ˆ (cid:96) = µ G , ˆ (cid:96) holds for all ˆ (cid:96) ∈ L .Then this notion can be expressed using our exten-sion of StatEL as follows. Proposition 8 (Sufficiency)
Let γ (cid:48) ( x, y, ˆ (cid:96) ) def = ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) . The sufficiency between two groups G and G under a given test dataset w test is expressed as w test | = Sufficency ( x, y ) where: Sufficency ε ( x, y ) def = (cid:94) ˆ (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ γ (cid:48) ( x, y, ˆ (cid:96) ) (cid:1) ∼ ε, D tv y (cid:0) η G ( x ) ∧ γ (cid:48) ( x, y, ˆ (cid:96) ) (cid:1)(cid:17) . Proof
Let ˆ (cid:96) ∈ L and w b, ˆ (cid:96) = w test | η Gb ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x,y ) . Itfollows from (4) that: µ G b , ˆ (cid:96) [ (cid:96) ] = Pr[ σ s ( y ) = (cid:96) | s $ ← w b, ˆ (cid:96) ] , hence µ G b , ˆ (cid:96) = σ w b, ˆ (cid:96) ( y ). Thus, by Definition 13, thesufficiency between G and G is given by σ w , ˆ (cid:96) ( y ) = σ w , ˆ (cid:96) ( y ) for all ˆ (cid:96) ∈ L . Therefore, this proposition followsfrom Proposition 1. (cid:117)(cid:116) It should be noted that for ε > Sufficency ε ( x, y )represents a relaxation of sufficiency up to bias ε interms of total variation D tv . Example 5 (Sufficiency in pedestrian detection)
We il-lustrate sufficiency using the pedestrian detection inExample 3 where a classifier C detects whether a pedes-trian is crossing the road in an image x . As mentioned in n Epistemic Approach to the Formal Specification of Statistical Machine Learning 17 Example 4, the level of the inherent technical difficultyof detecting a male pedestrian may be different fromthat of a female pedestrian. Whereas separation guar-antees the equality of recall between men and women,sufficiency guarantees that of precision.The sufficiency
Sufficency ( x, y ) between men andwomen implies that the conditional probability thatthere is no pedestrian crossing the road when C detectsit, is the same between men and women. From the view-point of the car driver, when C raises a false alarm andstops the car suddenly, we have no bias about which ofmen and women are more likely to trigger false alarmsand to be blamed for that. In this section, we provide a brief overview of relatedwork on the specification of statistical machine learningand on epistemic logic for describing specification.
Desirable properties of statistical machine learning.
Therehave been a large number of papers on attacks and de-fences for deep neural networks [40,12]. Compared tothem, however, not much work has been done to explorethe formal specification of various properties of machinelearning. Seshia et al. [38] present a list of desirableproperties of DNNs (deep neural networks) althoughmost of the properties are presented informally withoutmathematical formulas. As for robustness, Dreossi etal. [13] propose a unifying formalization of adversarialinput generation in a rigorous and organized manner,although they formalize and classify attacks (as opti-mization problems) rather than define the robustnessnotions themselves.Concerning the fairness notions, Barocas et al. [5]survey various fairness notions and classify them intothe three categories: independence, separation, and suf-ficiency. Gajane [17] surveys the formalization of fair-ness notions for machine learning and present some jus-tification based on social science literature.
Epistemic logic for describing specification.
Epistemiclogic [44] has been studied to represent and reason aboutknowledge and belief [16,21], and has been applied todescribe various properties of distributed systems.The
BAN logic [8], proposed by Burrows, Abadi andNeedham, is a notable example of epistemic logic usedto model and verify the authentication in cryptographicprotocols. To improve the formalization of protocols’behaviors, some epistemic approaches integrate processcalculi [24,11].Epistemic logic has also been used to formalize andreason about privacy properties, including anonymity [39, 19,29], receipt-freeness of electronic voting protocols [25],and privacy policy for social network services [34]. Tem-poral epistemic logic is used to express information flowsecurity policies [3].Concerning the formalization of fairness notions, pre-vious work in formal methods has modeled differentkinds of fairness involving timing by using temporallogic rather than epistemic logic. As far as we know, noprevious work has formalized fairness notions of ma-chine learning by using modal logic.
Formalization of statistical properties.
In studies of philo-sophical logic, Lewis [31] shows the idea that when arandom value has various possible probability distribu-tions, then those distributions should be represented ondistinct possible worlds. Bana [4] puts Lewis’s idea ina mathematically rigorous setting. Recently, a modallogic called statistical epistemic logic (StatEL) [27] hasbeen proposed and used to formalize statistical hypoth-esis testing and the notion of differential privacy [14].To describe statistical properties of machine learn-ing models, this work uses StatEL to formalize theprobabilistically chosen input to a learning model andthe non-deterministically chosen dataset. However, wecould possibly employ other logics (e.g., fuzzy logic [45]or Markov logic network [37]) by extending them to dealwith statistical sampling and non-deterministic inputs.Exploring the possibility of different formalization us-ing other logics is left for future work.
In this paper we proposed an epistemic approach tothe modeling of supervised learning and its desirableproperties. Specifically, we employed a distributionalKripke model in which each possible world correspondsto a possible dataset and modal operators are inter-preted as transformation and testing on datasets. Thenwe formalized various notions of the classification per-formance, robustness, and fairness of statistical clas-sifiers by using our extension of statistical epistemiclogic (StatEL). In this formalization, we clarified rela-tionships among properties of classifiers, and relevancebetween classification performance and robustness.We emphasize that this is the first attempt to useepistemic models and logical formulas to describe sta-tistical properties of machine learning, and would be astarting point to develop theories of formal specificationof machine learning.In future work, we are planning to extend our frame-work to formally reason about system-level propertiesof learning-based systems. We are also interested in developing a more general framework for the formalspecification of machine learning associated with test-ing methods, as well as in implementing a prototypetool. Our future work will also include an extensionof StatEL to formalize unsupervised learning and rein-forcement learning.
Acknowledgements
I would like to thank the reviewers fortheir helpful and insightful comments. I am also grateful toGergei Bana for his useful comments on part of a preliminarymanuscript.
References
1. Angell, R., Johnson, B., Brun, Y., Meliou, A.: Themis:automatically testing software for discrimination. In:Proc. ESEC/SIGSOFT FSE, pp. 871–875. ACM (2018).DOI 10.1145/3236024.32645902. Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthe-sizing robust adversarial examples. In: Proc. ICML, pp.284–293 (2018)3. Balliu, M., Dam, M., Guernic, G.L.: Epistemic temporallogic for information flow security. In: Proc. of PLAS,p. 6 (2011). DOI 10.1145/2166956.21669624. Bana, G.: Models of objective chance: An analysisthrough examples. In: Making it Formally Explicit, pp.43–60. Springer International Publishing (2017). DOI10.1007/978-3-319-55486-0 \
35. Barocas, S., Hardt, M., Narayanan, A.: Fairness and Ma-chine Learning. fairmlbook.org (2019).
6. Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.:Fairness in criminal justice risk assessments: The state ofthe art. Sociological Methods & Research (2018). DOI10.1177/00491241187825337. Blackburn, P., de Rijke, M., Venema, Y.: Modal Logic.Cambridge Tracts in Theoretical Computer Science.Cambridge University Press (2001). DOI 10.1017/CBO97811070508848. Burrows, M., Abadi, M., Needham, R.M.: A logic of au-thentication. ACM Trans. Comput. Syst. (1), 18–36(1990). DOI 10.1145/77648.776499. Calders, T., Verwer, S.: Three naive bayes ap-proaches for discrimination-free classification. Data Min.Knowl. Discov. (2), 277–292 (2010). DOI 10.1007/s10618-010-0190-x10. Carlini, N., Wagner, D.A.: Towards evaluating the ro-bustness of neural networks. In: Prc. S&P, pp. 39–57(2017). DOI 10.1109/SP.2017.4911. Chadha, R., Delaune, S., Kremer, S.: Epistemic logic forthe applied pi calculus. In: Proc. of FMOODS/FORTE,pp. 182–197 (2009). DOI 10.1007/978-3-642-02138-1 \ abs/1810.00069 (2018). URL http://arxiv.org/abs/1810.00069
13. Dreossi, T., Ghosh, S., Sangiovanni-Vincentelli, A.L., Se-shia, S.A.: A formalization of robustness for deep neuralnetworks. In: Proc. VNN (2019)14. Dwork, C.: Differential privacy. In: Proc. of ICALP, pp.1–12 (2006)15. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel,R.S.: Fairness through awareness. In: Proc. of ITCS, pp.214–226. ACM (2012) 16. Fagin, R., Halpern, J., Moses, Y., Vardi, M.: Reasoningabout Knowledge. The MIT Press (1995)17. Gajane, P.: On formalizing fairness in prediction withmachine learning. CoRR abs/1710.03184 (2017). URL http://arxiv.org/abs/1710.03184
18. Galhotra, S., Brun, Y., Meliou, A.: Fairness testing: test-ing software for discrimination. In: Proc. ESEC/FSE, pp.498–510. ACM (2017). DOI 10.1145/3106237.310627719. Garcia, F.D., Hasuo, I., Pieters, W., van Rossum, P.:Provable anonymity. In: Proc. of FMSE, pp. 63–72(2005). DOI 10.1145/1103576.110358520. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining andharnessing adversarial examples. In: Proc. of ICLR(2015)21. Halpern, J.Y.: Reasoning about uncertainty. The MITpress (2003)22. Hardt, M., Price, E., Srebro, N.: Equality of opportunityin supervised learning. In: proc. NIPS, pp. 3315–3323(2016)23. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safetyverification of deep neural networks. In: Proc. CAV, pp.3–29 (2017). DOI 10.1007/978-3-319-63387-9 \ (1), 3–36 (2004)25. Jonker, H.L., Pieters, W.: Receipt-freeness as a specialcase of anonymity in epistemic logic. In: Proc. WorkshopOn Trustworthy Elections (WOTE’06) (2006)26. Katz, G., Barrett, C.W., Dill, D.L., Julian, K., Kochen-derfer, M.J.: Reluplex: An efficient SMT solver for veri-fying deep neural networks. In: Proc. CAV, pp. 97–117(2017). DOI 10.1007/978-3-319-63387-9 \ LNCS , vol. 11760, pp. 344–362.Springer (2019). DOI 10.1007/978-3-030-31175-9 \ LNCS , vol.11724, pp. 293–311. Springer (2019). DOI 10.1007/978-3-030-30446-1 \ (4), 559–576 (2007). DOI10.11540/jsiamt.17.4 \ (5-6), 67–96 (1963)31. Lewis, D.: A subjectivist’s guide to objective chance. In:Studies in Inductive Logic and Probability, Volume II, pp.263–293. Berkeley: University of California Press (1980)32. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu,A.: Towards deep learning models resistant to adversarialattacks. In: Proc. ICLR (2018)33. Moosavi-Dezfooli, S., Fawzi, A., Frossard, P.: Deepfool: Asimple and accurate method to fool deep neural networks.In: Proc. CVPR, pp. 2574–2582 (2016). DOI 10.1109/CVPR.2016.28234. Pardo, R., Schneider, G.: A formal privacy policy frame-work for social networks. In: Proc. SEFM, pp. 378–392(2014). DOI 10.1007/978-3-319-10431-7 \ (1-2), 107–136 (2006). DOI 10.1007/s10994-006-5833-138. Seshia, S.A., Desai, A., Dreossi, T., Fremont, D.J.,Ghosh, S., Kim, E., Shivakumar, S., Vazquez-Chanlatte,M., Yue, X.: Formal specification for deep neural net-works. In: Proc. ATVA, pp. 20–34 (2018). DOI10.1007/978-3-030-01090-4 \ \ (3), 64–72 (1969)44. von Wright, G.H.: An Essay in Modal Logic. Amsterdam:North-Holland Pub. Co. (1951)45. Zadeh, L.: Fuzzy sets. Information and Control8