[PDF] An Epistemic Approach to the Formal Specification of Statistical Machine Learning

Abstract

We propose an epistemic approach to formalizing statistical properties of machine learning. Specifically, we introduce a formal model for supervised learning based on a Kripke model where each possible world corresponds to a possible dataset and modal operators are interpreted as transformation and testing on datasets. Then we formalize various notions of the classification performance, robustness, and fairness of statistical classifiers by using our extension of statistical epistemic logic (StatEL). In this formalization, we show relationships among properties of classifiers, and relevance between classification performance and robustness. As far as we know, this is the first work that uses epistemic models and logical formulas to express statistical properties of machine learning, and would be a starting point to develop theories of formal specification of machine learning.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

An Epistemic Approach to the Formal Speciﬁcation ofStatistical Machine Learning

Yusuke Kawamoto

Received: date / Accepted: date

Abstract

We propose an epistemic approach to for-malizing statistical properties of machine learning. Specif-ically, we introduce a formal model for supervised learn-ing based on a Kripke model where each possible worldcorresponds to a possible dataset and modal opera-tors are interpreted as transformation and testing ondatasets. Then we formalize various notions of the clas-siﬁcation performance, robustness, and fairness of sta-tistical classiﬁers by using our extension of statisticalepistemic logic (StatEL). In this formalization, we showrelationships among properties of classiﬁers, and rele-vance between classiﬁcation performance and robust-ness. As far as we know, this is the ﬁrst work that usesepistemic models and logical formulas to express sta-tistical properties of machine learning, and would be astarting point to develop theories of formal speciﬁcationof machine learning.

Keywords

Modal logic · Possible world semantics · Machine learning · Classiﬁcation performance · Robustness · Fairness

With the increasing use of machine learning in real-life applications, the safety and security of learning-based systems have been of great interest. In particular,many recent studies [40],[12] have found vulnerabilities

This work was supported by the New Energy and IndustrialTechnology Development Organization (NEDO), by ERATOHASUO Metamathematics for Systems Design Project (No.JPMJER1603), JST, and by Inria under the project LOGIS.Yusuke KawamotoAIST, Tsukuba, JAPANORCID: 0000-0002-2151-9560 on the robustness of deep neural networks (DNNs) tomalicious inputs, which can lead to disasters in secu-rity critical systems, such as self-driving cars. To ﬁndout these vulnerabilities in advance, there have been re-searches on the formal veriﬁcation and testing methodsfor the robustness of DNNs in recent years [23,26,35,41]. However, relatively little attention has been paidto the formal speciﬁcation of machine learning [38].In the research ﬁled of formal speciﬁcation and ver-iﬁcation, logical approaches have been shown useful tocharacterize desired properties and to develop theo-ries to discuss those properties. For example, tempo-ral logic [36] is a branch of modal logic for express-ing time-dependent propositions, and has been widelyused to describe requirements of hardware and softwaresystems. For another example, epistemic logic [44] is amodal logic for knowledge and belief that has been em-ployed as formal policy languages for distributed sys-tems (e.g., for the authentication [8] and the anonymity[39] of security protocols). As far as we know, however,no prior work has employed logical formulas to rigor-ously describe various statistical properties of machinelearning, although there are some papers that (ofteninformally) list various desirable properties of machinelearning [38].In this paper, we present a ﬁrst logical formaliza-tion of statistical properties of machine learning. To de-scribe the statistical properties in a simple and abstractway, we extend statistical epistemic logic (StatEL) [27],which has recently been proposed to describe statis-tical knowledge and is applied to formalize statisticalhypothesis testing and statistical privacy of databases.A key idea in our modeling of statistical machinelearning is that we formalize logical aspects in the syn-tax level, and statistical distances and dataset opera-tions in the semantics level by using accessibility rela- a r X i v : . [ c s . L O ] J un Yusuke Kawamoto tions of a Kripke model [30]. In this model, we formalizesupervised learning and some of its desirable properties,including performance, robustness, and fairness. Morespeciﬁcally, classiﬁcation performance and robustnessare described as the diﬀerences between the correctclass label and the classiﬁer’s prediction, whereas fair-ness is expressed as a conditional indistinguishabilitybetween diﬀerent groups.

Our contributions.

The main contributions of this workare as follows: – We propose a logical approach to formalizing statis-tical properties of machine learning in a simple andabstract way. Speciﬁcally, we introduce a principlethat logical aspects of statistical properties are de-scribed in the syntax level, and statistical distancesand datasets are formalized in the semantics level. – We formalize supervised learning models and testdatasets (used to check whether the learning modelssatisfy speciﬁcation) by employing a distributionalKripke model [27] where each possible world corre-sponds to a possible test dataset, and modal oper-ators are interpreted as transformation and testingon datasets. Then we show how the sampling froma dataset and non-deterministic adversarial inputsare formalized in the distributional Kripke model. – We propose an extension of statistical epistemic logic(StatEL) as a formal language to describe variousproperties of machine learning models, including theperformance, robustness, and fairness of statisticalclassiﬁers. Then the satisfaction of logical formu-las representing those properties is associated withtheir testing using a test dataset. As far as we know,this is the ﬁrst work that uses logical formulas toformalize various statistical properties of machinelearning, and that provides an epistemic view onthose properties. – We show some relationships among properties ofclassiﬁers, such as diﬀerent levels of robustness. Wealso present certain relationships between classiﬁ-cation performance and robustness, which suggestrobustness-related properties that have not been for-malized in the literature as far as we know.

Cautions and limitations.

In this paper, we focus onformalizing properties of supervised learning models thatmay be tested by using a dataset; i.e., we do not dealwith unsupervised learning, reinforcement learning, theproperties of learning algorithms, quality of trainingdata (e.g., sample bias), quality of testing (e.g., cov-erage criteria), explainability, temporal properties, orsystem-level speciﬁcation. It should be noted that mostof the properties formalized in this paper have been known in machine learning literatures, and the noveltyof this work lies in the logical formulation of those sta-tistical properties.We also highlight that this work aims to provide alogical approach to the modeling of statistical proper-ties tested with a dataset, and does not present meth-ods for checking, guaranteeing, or improving the perfor-mance/robustness/fairness of machine learning models.As for the satisﬁability of logical formulas, we leave thedevelopment of testing and (statistical) model check-ing algorithms as future work, since the research areaon the testing and veriﬁcation of machine learning isrelatively new and needs further techniques to improvethe scalability. Moreover, in some applications such asimage recognition, some atomic formulas (e.g., repre-senting whether an input image is a panda) cannotbe deﬁned mathematically, and require additional tech-niques based on experiments. Nevertheless, we demon-strate that describing various properties using logicalformulas is useful to explore desirable properties andto discuss their relationships in a framework.Finally, we emphasize that our work is the ﬁrst at-tempt to use epistemic models and logical formulas toexpress statistical properties of machine learning mod-els, and would be a starting point to develop theoriesof formal speciﬁcation of machine learning in future re-search.

Relationship with the preliminary version.

The mainnovelties of this paper with respect to the preliminaryversion [28] are as follows: – We add how the satisfaction of a formula at a pos-sible world can be regarded as the testing of a spec-iﬁcation using a test dataset (Sect. 3.1). – We show how modal operators are used to modelthe transformation and testing on datasets. For ex-ample, data preparation T (e.g., data cleaning, dataaugmentation) can also be formalized as a modaloperator ∆ T (Sect. 3.2). – We re-interpret the non-classical implication ⊃ forconditional probabilities in StatEL as a modal oper-ator associated with a conditioning relation (Sect. 3.3). – We introduce a modal operator ∼ ε,Dx for conditionalindistinguishability (Sect. 3.4). Then we provide amore comprehensible formalization of the fairness ofsupervised learning (Sect. 7) without using counter-factual epistemic operators [28], because the formal-ization using these operators requires an additionalformula and makes the presentation more compli-cated and unintuitive. – We add a formalization of generalization error tocapture how accurately a classiﬁer is able to classifypreviously unseen input data (Sect. 5.3). n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 3 – We add a formalization of other fairness notionscalled separation (Sect. 7.3) and suﬃciency (Sect. 7.4)so that this paper covers all three categories of fair-ness notions [5]. – We show a running example of a pedestrian detec-tion to illustrate the formalization of various notionsof performance, robustness, and fairness.

Paper organization.

The rest of this paper is organizedas follows. Sect. 2 presents notations used in this paperand provides background on statistical distances andstatistical epistemic logic (StatEL). Sect. 3 introducesa diﬀerent view on the modal operators in StatEL andextends the logic with additional operators. Sect. 4 in-troduces a formal model for describing the behaviors ofstatistical classiﬁers and non-deterministic adversarialinputs. Sects. 5, 6, and 7 respectively formalize vari-ous notions of the performance, robustness, and fairnessof classiﬁers by using our extension of StatEL. Sect. 8presents related work and Sect. 9 concludes.

In this section we introduce some notations, and re-view background on statistical distance notions andthe syntax and semantics of statistical epistemic logic (StatEL), introduced in [27].2.1 NotationsLet R ≥ be the set of non-negative real numbers, and[0 ,

1] be the set of non-negative real numbers not greaterthan 1. We denote by D O the set of all probabilitydistributions over a ﬁnite set O . Given a ﬁnite set O and a probability distribution µ ∈ D O , the probabilityof sampling a value v from µ is denoted by µ [ v ]. For asubset R ⊆ O , let µ [ R ] = (cid:80) v ∈ R µ [ v ]. For a distribution µ over a ﬁnite set O , its support is deﬁned by supp ( µ ) = { v ∈ O : µ [ v ] > } .2.2 Statistical DistanceWe recall popular notions of distance between proba-bility distributions: total variation and ∞ -Wassersteindistance .Informally, total variation between two distributions µ and µ over a set O represents the largest diﬀerencebetween the probabilities that µ and µ assign to anidentical subset R of O . Deﬁnition 1 (Total variation)

For a ﬁnite set O ,the total variation D tv of two distributions µ , µ ∈ D O is deﬁned by: D tv ( µ (cid:107) µ ) def = sup R ⊆O | µ [ R ] − µ [ R ] | . We then recall the ∞ -Wasserstein metric [43]. Intu-itively, the ∞ -Wasserstein metric W d ( µ , µ ) betweentwo distributions µ , µ represents the minimum largestmove between points in a transportation from µ to µ . Deﬁnition 2 ( ∞ -Wasserstein metric) Let O be aﬁnite set and d : O × O → R ≥ be a metric over O .The ∞ -Wasserstein metric W d w.r.t. d between twodistributions µ , µ ∈ D O is deﬁned by: W d ( µ , µ ) = min µ ∈ cp ( µ ,µ ) max ( v ,v ) ∈ supp ( µ ) d ( v , v )where cp ( µ , µ ) is the set of all couplings of µ and µ .2.3 Syntax of StatELWe next recall the syntax of statistical epistemic logic(StatEL) [27], which has two levels of formulas: static and epistemic formulas . Intuitively, a static formula de-scribes a proposition satisﬁed at a (deterministic) state,while an epistemic formula describes a proposition sat-isﬁed at a probability distribution of states. In this pa-per, the former is used only to deﬁne the latter.Formally, let Mes be a set of symbols called mea-surement variables , and Γ be a set of atomic formulasof the form γ ( x , x , . . . , x n ) for a predicate symbol γ , n ≥

0, and x , x , . . . , x n ∈ Mes . Let I ⊆ [0 ,

1] be aﬁnite union of disjoint intervals, and A be a ﬁnite setof indices (e.g., associated with statistical divergences).Then the formulas are deﬁned by:Static formulas: ψ ::= γ ( x , x , . . . , x n ) | ¬ ψ | ψ ∧ ψ Epistemic formulas: ϕ ::= P I ψ | ¬ ϕ | ϕ ∧ ϕ | ψ ⊃ ϕ | K a ϕ where a ∈ A . We denote by F the set of all epistemicformulas. Note that we have no quantiﬁers over mea-surement variables. (See Sect. 2.5 for more details.)The probability quantiﬁcation P I ψ represents thata static formula ψ is satisﬁed with a probability be-longing to a set I . For instance, P (0 . , ψ representsthat ψ holds with a probability greater than 0 .

95. By A coupling of two distributions µ , µ ∈ D O is a joint dis-tribution µ ∈ D ( O × O ) such that µ and µ are µ ’s marginaldistributions, i.e., for each v ∈ O , µ [ v ] = (cid:80) v (cid:48) ∈O µ [ v , v (cid:48) ]and for each v ∈ O , µ [ v ] = (cid:80) v (cid:48) ∈O µ [ v (cid:48) , v ]. For a cou-pling µ , the support supp ( µ ) is the maximum subset of O ×O whose elements are assigned non-zero probabilities in µ . Yusuke Kawamoto ψ ⊃ P I ψ (cid:48) we represent that the conditional probabil-ity of ψ (cid:48) given ψ is included in a set I . The epistemicknowledge K a ϕ expresses that we know ϕ when ourcapability of observation is denoted by a ∈ A .As syntax sugar, we use disjunction ∨ , classical im-plication → , and epistemic possibility P a , deﬁned asusual by: ϕ ∨ ϕ ::= ¬ ( ¬ ϕ ∧¬ ϕ ), ϕ → ϕ ::= ¬ ϕ ∨ ϕ ,and P a ϕ ::= ¬ K a ¬ ϕ . When I is a singleton { i } , we ab-breviate P I as P i .2.4 Distributional Kripke ModelNext we recall the notion of a distributional Kripkemodel [27], where each possible world is associated witha probability distribution over a set of states, and witha stochastic assignment of data to measurement vari-ables. Deﬁnition 3 (Distributional Kripke model)

Let A be a ﬁnite set of indices (typically associated withoperations and tests on datasets), S be a ﬁnite set ofstates, and O be a ﬁnite set of data, called a data do-main . A distributional Kripke model is a tuple M =( W , ( R a ) a ∈A , ( V s ) s ∈S ) consisting of: – a non-empty set W of multisets of states belongingto S ; – for each a ∈ A , an accessibility relation R a ⊆ W×W ; – for each s ∈ S , a valuation V s : Γ → P ( O k ) thatmaps each k -ary predicate γ to a set V s ( γ ) of k -tuples of data.The set W is called a universe , and its elements arecalled possible worlds . A world is said to be ﬁnite if it isa ﬁnite multiset, i.e., it has a ﬁnite number of (possiblyduplicated) elements. A world is said to be inﬁnite if itis an inﬁnite multiset.The relation R a determines an accessibility betweentwo worlds. For example, ( w, w (cid:48) ) ∈ R a means that aworld w (cid:48) is accessible from a world w when our capa-bility of distinguishing possible worlds is denoted by a ∈ A . The valuation V s may give a possibly diﬀerentinterpretation of a predicate γ at a diﬀerent state s . Weassume that all measurement variables range over thesame data domain O in every world. The interpreta-tion of measurement variables at a state s is given by adeterministic assignment σ s deﬁned below. Deﬁnition 4 (Deterministic assignment)

For anydistributional Kripke model M =( W , ( R a ) a ∈A , ( V s ) s ∈S ),we assume that each world w ∈ W is associated witha function ρ w : Mes × S → O that maps each mea-surement variable x to its value ρ w ( x, s ) that is ob-served at a state s belonging to the world w . We also assume that each state s in a world w is associated withthe deterministic assignment σ s : Mes → O deﬁned by σ s ( x ) = ρ w ( x, s ).Since each world w is a multiset of states, we abusethe notation and denote by w [ s ] the probability thata state s is randomly chosen from w (i.e., the numberof occurrences of s in the multiset w , divided by thetotal number of elements in w ). Here we regard eachworld w as a probability distribution over the statesthat corresponds to the multiset.The probability that a measurement variable x ∈ Mes has a value v ∈ O is: σ w ( x )[ v ] = (cid:80) s ∈ w,σ s ( x )= v w [ s ].Note that σ w : Mes → D O maps each measurementvariable x to a probability distribution σ w ( x ) over thedata domain O . Hence σ w represents the joint proba-bility distribution of all variables in Mes , and is calledthe stochastic assignment at w . When a state s is uni-formly drawn from a multiset w of states, a datum σ s ( x )is sampled from the distribution σ w ( x ).In later sections, a possible world corresponds toa dataset (i.e., a multiset of data tuples) from whichdata are sampled. For example, suppose that we haveonly three measurement variables Mes = { x, y, z } . Thenfor each state s in a world w , the deterministic as-signment σ s : Mes → O represents the tuple of data( σ s ( x ) , σ s ( y ) , σ s ( z )). Hence each state s corresponds toa tuple of data, and the world w corresponds to thedataset { ( σ s ( x ) , σ s ( y ) , σ s ( z )) | s ∈ w } .2.5 Stochastic Semantics of StatELNow we recall the stochastic semantics [27] for the StatELformulas over a distributional Kripke model M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ) with W = D S .The interpretation of a static formulas ψ at a state s is given by: s | = γ ( x , . . . , x k ) iﬀ ( σ s ( x ) , . . . , σ s ( x k )) ∈ V s ( γ ) s | = ¬ ψ iﬀ s (cid:54)| = ψs | = ψ ∧ ψ (cid:48) iﬀ s | = ψ and s | = ψ (cid:48) . The restriction w | ψ of a world w to a static formula ψ is deﬁned by w | ψ [ s ] = w [ s ] (cid:80) s (cid:48) : s (cid:48)| = ψ w [ s (cid:48) ] if s | = ψ , and w | ψ [ s ] = 0 otherwise. Note that w | ψ is undeﬁned ifthere is no state s that satisﬁes ψ and has a non-zeroprobability in w . n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 5 Then the interpretation of epistemic formulas in aworld w is deﬁned by: M , w | = P I ψ iﬀ Pr (cid:104) s $ ← w : s | = ψ (cid:105) ∈ I M , w | = ¬ ϕ iﬀ M , w (cid:54)| = ϕ M , w | = ϕ ∧ ϕ (cid:48) iﬀ M , w | = ϕ and M , w | = ϕ (cid:48) M , w | = ψ ⊃ ϕ iﬀ w | ψ is deﬁned and M , w | ψ | = ϕ M , w | = K a ϕ iﬀ for every w (cid:48) s.t. ( w, w (cid:48) ) ∈ R a , M , w (cid:48) | = ϕ, where s $ ← w represents that a state s is sampled fromthe distribution w .Then M , w | = ψ ⊃ P I ψ represents that the condi-tional probability of satisfying a static formula ψ givenanother ψ is included in a set I at a world w .In each world w , measurement variables can be in-terpreted using σ w . This allows us to assign diﬀerentvalues to diﬀerent occurrences of a variable in a for-mula; E.g., in ϕ ( x ) → K a ϕ (cid:48) ( x ), x occurring in ϕ ( x )is interpreted by σ w in a world w , while x in ϕ (cid:48) ( x ) isinterpreted by σ w (cid:48) in another w (cid:48) s.t. ( w, w (cid:48) ) ∈ R a .Finally, the interpretation of an epistemic formula ϕ in M is given by: M | = ϕ iﬀ for every world w in M , M , w | = ϕ. Hereafter we mainly focus on the satisfaction localto a possible world, and M may be omitted when it isclear from the context. In this section we introduce a diﬀerent view on themodal operators in statistical epistemic logic (StatEL),and deﬁne additional modal operators that are usedto formalize various properties of machine learning inSects. 5 to 7.3.1 Checking Satisfaction at a World as Testing with aDatasetWe ﬁrst show how we regard the satisfaction of a for-mula ϕ as testing a system’s speciﬁcation expressed by ϕ as follows.As explained in Sect. 2.4, a possible world corre-sponds to a possible dataset. Thus, given a model M ,a world w , and a formula ϕ , checking the satisfaction M , w | = ϕ can be regarded as testing whether the spec-iﬁcation ϕ of a system (e.g., a machine learning modelwe formalize in Sect. 4) is satisﬁed when the dataset w provides inputs to the system. For example, let ϕ be aformula representing that a machine learning task (e.g.,classiﬁcation) C fails with probability at most 5%. Then M , w | = ϕ represents that when the learning task C isperformed using a test dataset w , then it fails for atmost 5% of the test data in w .For simplicity, we discuss the satisfaction of the for-mulas ϕ in which neither K a nor P a occurs as follows.For each state (namely, data tuple) s ∈ w and foreach static sub-formula ψ of ϕ , we can eﬃciently checkwhether s | = ψ .When the dataset w is ﬁnite (i.e., it is a ﬁnite multi-set of data tuples), we can check the satisfaction w | = ϕ in ﬁnite time, more precisely, in linear time in the num-ber of elements in w .When the dataset w is inﬁnite, however, we cannotcheck whether w | = ϕ in general. For example, supposethat w is the inﬁnite dataset representing a true dis-tribution from which data are sampled and observed.When we cannot learn w itself, we usually obtain a ﬁ-nite dataset w ﬁn by sampling data from w repeatedlyand independently and check a speciﬁcation ϕ only withthis test dataset w ﬁn .Hereafter, we mainly deal with distributional Kripkemodels M that have inﬁnite numbers of ﬁnite worlds. Inthe following sections except Sect. 6, we deal only withformulas without K a nor P a , hence can check theirsatisfaction at a ﬁnite world in ﬁnite time.3.2 Modal Operators for Dataset TransformationIn the rest of Sect. 3, we show that modal operatorscan be used to model the transformation and testingon datasets.First, we introduce modal operators for dataset trans-formation . The modal operator ∆ T deﬁned below isunary (i.e., taking a single formula as argument), and isparameterized with a transformation T between datasets.Intuitively, w | = ∆ T ϕ represents that a formula ϕ is sat-isﬁed for the dataset w (cid:48) that is obtained by transform-ing the current dataset w by T . Formally, the modaloperator ∆ T is interpreted as follows. Deﬁnition 5 (Modality ∆ T for a dataset trans-formation T ) Given a function T : W → W , we de- The testing of a formula ϕ is not feasible when an epis-temic operator K a or P a occurs in ϕ and the model M hasa large number of possible worlds. Detailed analysis of timecomplexity of StatEL is out of the scope of this paper, andshould be included in the journal version of our paper [27] thatproposed StatEL. As we will discuss in Sect. 6, the robust-ness of machine learning is formalized using these epistemicoperators, hence cannot be tested in practical time unless M is comprised of a small number of worlds. Yusuke Kawamoto ﬁne an accessibility relation as R T def = { ( w, w (cid:48) ) | w (cid:48) = T ( w ) } . Then we deﬁne the interpretation of ∆ T by: M , w | = ∆ T ϕ iﬀ there is a w (cid:48) s.t. ( w, w (cid:48) ) ∈ R T and M , w (cid:48) | = ϕ. For example, machine learning often require datapreparation to manipulate a given raw dataset into aform that makes a machine learning task feasible andmore eﬀective (e.g., data cleaning , data augmentation ).For a dataset w and two ways of data preparation T and T , w | = ∆ T ϕ ∧ ∆ T ϕ represents that a property ϕ holds for the two prepared datasets T ( w ) and T ( w ).For another example, the security of machine learn-ing often assumes a certain malicious adversary thatcan manipulate a given dataset to make a machinelearning task fail. Such adversarial operations T ondatasets can also be formalized using a diﬀerent modaloperator corresponding to T as we will explain in Sect. 6.In the next section, we show that the logical con-nective ⊃ can be re-interpreted as the modality ∆ T forsome dataset transformation T .3.3 Modality for ConditioningWe then present another interpretation of the logicalconnective ⊃ (deﬁned in Sect. 2.5) used to express con-ditional probabilities in Sects. 5 and 6. Roughly speak-ing, we regard the restriction w | ψ of a world w to astatic formula ψ as a transformation R ψ of w . Thenwe redeﬁne ⊃ as a modal operator associated with R ψ ,and call it the conditioning operator . Formally, the in-terpretation of ⊃ is deﬁned as follows. Deﬁnition 6 (Conditioning operator ⊃ ) Assumethat the universe W includes all sub-multisets of each w ∈ W . Given a static formula ψ , we deﬁne an ac-cessibility relation as the conditioning relation R ψ def = { ( w, w | ψ ) | w ∈ W} . Then the interpretation of theconditioning operator ⊃ is given by: M , w | = ψ ⊃ ϕ iﬀ there is a w (cid:48) s.t. ( w, w (cid:48) ) ∈ R ψ and M , w (cid:48) | = ϕ. Intuitively, w | = ψ ⊃ ϕ corresponds to the two op-erations: (i) transforming the given dataset w to thesub-dataset w | ψ and (ii) testing whether a property ϕ holds for the sub-dataset w | ψ . When no data in thedataset w satisﬁes the property ψ , we can describe thisas M , w | = ψ ⊃ ⊥ by using the propositional constantfalsum ⊥ .Note that the conditioning ψ ⊃ ϕ can be regardedas the modal formula ∆ T ϕ with the dataset transfor-mation T where T ( w ) = w | ψ for all w ∈ W . In Sects. 5 and 6, we show concrete examples us-ing the conditioning operator ⊃ , i.e., the classiﬁcationperformance and robustness of statistical classiﬁers.3.4 Modality for Conditional IndistinguishabilityNext, we introduce a modal operator that is used toformalize the fairness of machine learning in Sect. 7.Given two static formulas ψ , ψ (e.g., representingmale and female), w | ψ ( x ) (resp. w | ψ ( x )) representsthe probability distribution of values of a measurementvariable x generated from the sub-dataset w | ψ , e.g.,the sub-dataset about male (resp. w | ψ , e.g., about fe-male). To formalize a certain similarity between x ’s val-ues generated from the two sub-datasets (e.g., betweenthe beneﬁts for male and for female), we introduce amodal operator ∼ ε,Dx for conditional indistinguishabil-ity as follows. We write ψ ∼ ε,Dx ψ to represent thatthe two distributions w | ψ ( x ) and w | ψ ( x ) are indistin-guishable up to a threshold ε in terms of a divergenceor distance D . Formally, this modality is deﬁned as fol-lows. Deﬁnition 7 (Conditional indistinguishability op-erator ∼ ε,Dx ) Assume that the universe W includes allsub-multisets of each w ∈ W . Given an x ∈ Mes , an ε ∈ R ≥ , and a divergence or distance D : D O × D O → R ≥ , we deﬁne an accessibility relation by: R ε,Dx def = { ( w , w ) ∈ W × W | D ( σ w ( x ) (cid:107) σ w ( x )) ≤ ε } . Then for static formulas ψ and ψ , we deﬁne the in-terpretation of ψ ∼ ε,Dx ψ by: M , w | = ψ ∼ ε,Dx ψ iﬀ there exist w , w s.t. ( w, w ) ∈ R ψ ,( w, w ) ∈ R ψ , and ( w , w ) ∈ R ε,Dx , where R ψ and R ψ are two conditioning relations inDeﬁnition 6.Note that two worlds are related by R ε,Dx if theyhave close probability distributions of the values of x .Intuitively, w | = ψ ∼ ε,Dx ψ corresponds to the two op-erations: (i) transforming the given dataset w to the twosub-datasets w | ψ and w | ψ , and (ii) testing whether theprobability distribution of x generated by the dataset w | ψ is indistinguishable from the distribution gener-ated by the dataset w | ψ .When ε = 0, the operator ∼ ε,Dx represents the iden-tity of two distributions. The semantics for the (binary) composite operator in thearrow logic [7] resembles that for ∼ ε,Dx in Deﬁnition 7, al-though it has a totally diﬀerent meaning and motivation.n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 7 Proposition 1

For a world w , static formulas ψ , ψ ,and a measurement variable x , w | = ψ ∼ ,Dx ψ iﬀ thedistribution w | ψ ( x ) is identical to w | ψ ( x ) . This proposition is immediate from the followinglemma.

Lemma 1

For a world w , static formulas ψ , ψ , anda measurement variable x , w | = ψ ∼ ε,Dx ψ iﬀ D ( σ w | ψ ( x ) (cid:107) σ w | ψ ( x )) ≤ ε. Proof

Let w = w | ψ and w = w | ψ . Then by Deﬁ-nition 6, we have ( w, w ) ∈ R ψ and ( w, w ) ∈ R ψ .Hence this lemma follows from Deﬁnition 7. (cid:117)(cid:116) In Sect. 7, we present examples using the conditionalindistinguishability operator, i.e., we formalize variousnotions of fairness in machine learning by using thisoperator and the above proposition and lemma.3.5 Summary on the Modal LanguageIn summary, modal operators are used to representtransformation and testing on datasets. The unary modaloperator ∆ T is regarded as a transformation T on datasets,while the binary modal operators ⊃ and ∼ ε,Dx are re-garded as transforming-then-testing on datasets.Now the syntax of the formulas is given by:Static formulas: ψ ::= γ ( x , x , . . . , x n ) | ¬ ψ | ψ ∧ ψ Dataset formulas: ϕ ::= P I ψ | ¬ ϕ | ϕ ∧ ϕ | ∆ T ϕ | ψ ⊃ ϕ | ψ ∼ ε,Dx ψ | K a ϕ, where the epistemic formulas with the additional modal-ity are called dataset formulas , since they are inter-preted in a world that corresponds to a dataset.When multiple transformations/testing are sequen-tially applied to datasets, we can use dataset formulasin which diﬀerent modal operators are nested. For ex-ample, w | = ∆ T ( ψ ⊃ ϕ ) represents that after applyinga data preparation T to a dataset w , a property ϕ holdsfor the sub-dataset T ( w ) | ψ that satisﬁes ψ . In this section we introduce a formal model for super-vised learning. Speciﬁcally, we employ a distributionalKripke model (Deﬁnition 3), and formalize a behaviorof a classiﬁer C and a non-deterministic input x froman adversary in the model. In this formalization, we fo-cus only on the testing of supervised learning models,and do not formalize the training of supervised learningmodels or learning algorithms themselves. 4.1 Classiﬁcation Problems Multiclass classiﬁcation is the problem of classifying agiven input into one of multiple classes. Let L be a ﬁniteset of class labels , and D be a ﬁnite set of input data (called feature vectors ) that we want to classify. Then a classiﬁer is a function C : D → L that receives an inputdatum v and predicts which class (among L ) the input v belongs to. In this work, we deal with a situationwhere some classiﬁer C has already been obtained andits properties should be evaluated, and do not model orreason about how classiﬁers are trained from a trainingdataset.We assume a scoring function f : D × L → R thatgives a score f ( v, (cid:96) ) of predicting the class of an inputdatum (feature vector) v as a label (cid:96) . Then for eachinput v ∈ D , we denote by H ( v ) = (cid:96) to represent thata label (cid:96) maximizes f ( v, (cid:96) ). For example, when the in-put v is an image of an animal and (cid:96) is the animal’sname, then H ( v ) = (cid:96) may represent that an oracle (ora “human”) classiﬁes the image v as (cid:96) .4.2 Modeling the Behaviors of ClassiﬁersA classiﬁer is formalized on a distributional Kripkemodel M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ) with W = D S . Then W is an inﬁnite set of possible worlds that correspondsto all possible datasets from which the classiﬁer can re-ceive input data. We denote by w test ∈ W a real worldthat corresponds to a test dataset. Recall that eachworld w ∈ W is a multiset of states over S and is as-sociated with a stochastic assignment σ w : Mes → D O that is consistent with the deterministic assignments σ s for all s ∈ w , as explained in Sect. 2.4.We present an overview of our formalization in Fig. 1.We denote by x ∈ Mes an input datum given to theclassiﬁer C (and to the oracle H ), by y ∈ Mes a correctlabel given by the oracle H , and by ˆ y ∈ Mes a labelpredicted by C . We assume that the input variable x (resp. the output variables y, ˆ y ) ranges over the set D ofinput data (resp. the set L of labels); i.e., the determin-istic assignment σ s at each state s ∈ S has the range O = D ∪ L and satisﬁes σ s ( x ) ∈ D and σ s ( y ) , σ s (ˆ y ) ∈ L .A key idea in our modeling is that we describe logi-cal aspects of statistical properties in the syntax level byusing logical formulas, and model statistical distancesand dataset operations in the semantics level by us-ing accessibility relations in the distributional Kripke The regression can be regarded as the classiﬁcation prob-lem when the label ranges over the real numbers, hence itcan be formalized using a distributional Kripke model analo-gously. For simplicity, however, we deal only with the classi-ﬁcation problems in this paper. Yusuke Kawamoto

State s (cid:31)(cid:30) (cid:28)(cid:29) input σ s ( x ) (cid:45) output σ s (ˆ y ) (cid:45) Classifier C State s (cid:31)(cid:30) (cid:28)(cid:29) input σ s ( x ) (cid:45) output σ s (ˆ y ) (cid:45) Classifier C ··· ··· Possible world w (test dataset) Fig. 1: A world w is chosen non-deterministically and corresponds to a test dataset. With probability w [ s i ], theworld w is in a deterministic state s i where the classiﬁer C receives the input value σ s i ( x ) and returns the outputvalue σ s i (ˆ y ). Each state s i can be regarded as a tuple ( σ s i ( x ) , σ s i ( y ) , σ s i (ˆ y )) ∈ D × L × L consisting of an inputdatum, an actual label, and a predicted label.model. In this way, we can formalize various statisticalproperties of classiﬁers in a simple and abstract way.To formalize the classiﬁer C , we introduce a staticformula ψ ( x, ˆ y ) to represent that C classiﬁes a giveninput x as a class ˆ y . We also introduce a static formula h ( x, y ) to represent that y is the actual class of an in-put x . As an abbreviation, we write ψ (cid:96) ( x ) (resp. h (cid:96) ( x ))to denote ψ ( x, (cid:96) ) (resp. h ( x, (cid:96) )). Formally, these staticformulas are interpreted at each state s ∈ S as follows: s | = ψ ( x, ˆ y ) iﬀ C ( σ s ( x )) = σ s (ˆ y ) .s | = h ( x, y ) iﬀ H ( σ s ( x )) = σ s ( y ) . M can formalize an input x that is probabilistically cho-sen from a given dataset. As explained in Sect. 2.4, eachworld w corresponds to a test dataset . When a state s isdrawn from a multiset w of states, an input value σ s ( x )is sampled from the distribution σ w ( x ), and assignedto the measurement variable x . The set of all possi-ble probability distributions of inputs is represented by Λ def = { σ w ( x ) | w ∈ W} , which is possibly an inﬁnite set.For example, let us consider testing the classiﬁer C with the actual test dataset σ w test ( x ). When C classiﬁesan input x as a label (cid:96) with probability 0 .

2, i.e.,Pr (cid:104) v $ ← σ w test ( x ) : C ( v ) = (cid:96) (cid:105) = 0 . , then this can be expressed by: M , w test | = P . ψ (cid:96) ( x ) . Next we observe that our model can formalize a non-deterministic input x from an adversary as follows. Al-though each state s in a possible world w is assigned the probability w [ s ], each world w itself is not assigned aprobability. Thus, each input distribution σ w ( x ) ∈ Λ it-self is also not assigned a probability, hence our modelassumes no probability distribution over Λ . In otherwords, we assume that a world w and thus an input dis-tribution σ w ( x ) are non-deterministically chosen. Thisis useful to model an adversary that provides maliciousinputs to the classiﬁer C to make its prediction fail,because we usually do not have a prior knowledge ofthe probability distribution of malicious inputs fromadversaries, and need to reason about the worst casescaused by the attack. In Sect. 6, this formalization ofnon-deterministic inputs is used to express the robust-ness of classiﬁers.Finally, it should be noted that we cannot enumer-ate all possible adversarial inputs, hence cannot enu-merate all possible datasets to construct the universe W . Since W can be an inﬁnite set and is unspeciﬁed,we cannot check whether a formula expressing a se-curity property against an adversary is satisﬁed in allpossible worlds of W . Nevertheless, as shown in latersections, describing various properties using our exten-sion of StatEL is useful to explore desirable propertiesand to discuss relationships among them. In this section we show a formalization of classiﬁcationperformance using our extension of StatEL. We formal-ize popular measures of classiﬁcation performance, in-cluding precision, recall, and accuracy, and measures forevaluating overﬁtting, such as the generalization error.See Fig. 2 for basic ideas on these formalizations. n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 9

Table 1: Logical description of the table of confusion

Actual classpositive negative

Prevalence (cid:96),I ( x ) Accuracy (cid:96),I ( x ) h (cid:96) ( x ) ¬ h (cid:96) ( x ) def = P I ( h (cid:96) ( x )) def = P I ( ψ (cid:96) ( x ) ↔ h (cid:96) ( x ))Positiveprediction tp ( x ) def = fp ( x ) def = Precision (cid:96),I ( x ) def = FDR (cid:96),I ( x ) def = ψ (cid:96) ( x ) ψ (cid:96) ( x ) ∧ h (cid:96) ( x ) ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ) ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) ψ (cid:96) ( x ) ⊃ P I ¬ h (cid:96) ( x )Negativeprediction fn ( x ) def = tn ( x ) def = FOR (cid:96),I ( x ) def = NPV (cid:96),I ( x ) def = ¬ ψ (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ∧ h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) ¬ ψ (cid:96) ( x ) ⊃ P I ¬ h (cid:96) ( x ) Recall (cid:96),I ( x ) def = FallOut (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) ¬ h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) MissRate (cid:96),I ( x ) def = Speciﬁcity (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ¬ ψ (cid:96) ( x ) ¬ h (cid:96) ( x ) ⊃ P I ¬ ψ (cid:96) ( x ) positive / negative represent the result of the classiﬁer’s prediction, and theterms true / false represent whether the classiﬁer pre-dicts correctly or not. Then the following terminologiesare commonly used: – true positive ( tp ): both the prediction and actualclass are positive; – true negative ( tn ): both the prediction and actualclass are negative; – false positive ( fp ): the prediction is positive but theactual class is negative; – false negative ( fn ): the prediction is negative butthe actual class is positive.These terminologies can be formalized using static for-mulas as shown in Table 1. For example, when an in-put x shows true positive at a state s , this can be ex-pressed as s | = ψ (cid:96) ( x ) ∧ h (cid:96) ( x ). Note that the value ofthe measurement variable x is uniquely determined bythe assignment σ s at the state s . True negative, falsepositive (type I error), and false negative (type II er-ror) are respectively expressed as s | = ¬ ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ), s | = ψ (cid:96) ( x ) ∧ ¬ h (cid:96) ( x ), and s | = ¬ ψ (cid:96) ( x ) ∧ h (cid:96) ( x ).5.2 Precision, Recall, Accuracy, and OtherPerformance MeasuresNext we formalize three popular measures for binaryclassiﬁcation performance: precision , recall , and accu-racy . In Table 1 we summarize the formalization ofvarious notions of classiﬁcation performance using ourdataset formulas.In theory, these notions should be formalized withthe inﬁnite dataset w true representing the true distribu-tion. However, we usually cannot obtain w true or testthe performance measures using w true . Hence, we often sample a ﬁnite test dataset w test from the true distribu-tion and regard it as an approximation of w true . Given a test dataset w test , precision ( positive pre-dictive value ) is deﬁned as the conditional probabilitythat the prediction is correct given that the predictionis positive; i.e., precision = tptp + fp . Since the probabilitydistribution of the input x in the world w test is expressedby σ w test ( x ) as explained in Sect. 4.3, the precision beingwithin an interval I is given by:Pr (cid:104) v $ ← σ w test ( x ) : H ( v ) = (cid:96) (cid:12)(cid:12)(cid:12) C ( v ) = (cid:96) (cid:105) ∈ I, which can be written as:Pr (cid:104) s $ ← w test : s | = h (cid:96) ( x ) (cid:12)(cid:12)(cid:12) s | = ψ (cid:96) ( x ) (cid:105) ∈ I. By using StatEL, this can be formalized as: M , w test | = Precision (cid:96),I ( x )where Precision (cid:96),I ( x ) def = ψ (cid:96) ( x ) ⊃ P I h (cid:96) ( x ) . Here ⊃ is the conditioning operator deﬁned in Sect. 3.3.The value of precision depends on the test dataset w test ,and can be computed in ﬁnite time since w test is ﬁnite.Symmetrically, recall ( true positive rate ) is deﬁnedas the conditional probability that the prediction is cor-rect given that the actual class is positive; i.e., recall = tptp + fn . Then the recall being within I is formalized as: Recall (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) . Finally, accuracy is the probability that the classiﬁerpredicts correctly; i.e., accuracy = tp + tntp + tn + fp + fn . Thenthe accuracy being within I is formalized as: Accuracy (cid:96),I ( x ) def = P I (cid:0) ψ (cid:96) ( x ) ↔ h (cid:96) ( x ) (cid:1) , Since the test dataset w test is ﬁnite, there can be missingdata that are not included in w test but are sampled from thetrue distribution w true with a very small probability.0 Yusuke Kawamoto Real world w test with a test datasetPossible world w train with a training dataset Distributionof test data σ w test ( x ) Distribution oftraining data σ w train ( x ) Oracle (human)

H σ s ( y ) input outputsampling σ s ( x ) Classiﬁer

C σ s (ˆ y ) sampling σ s (cid:48) ( x ) Classiﬁer

C σ s (cid:48) (ˆ y ) Oracle (human)

H σ s (cid:48) ( y ) input output PerformanceGeneralization errorOverﬁttingTraining error

Fig. 2: The classiﬁcation performance compares the oracle H ’s output with that of the classiﬁer C ’s, while theevaluation of overﬁtting compares the expected loss by the test dataset with that by the training dataset.which can also be deﬁned as P I (cid:0) tp ( x ) ∨ tn ( x ) (cid:1) . Whenwe measure the accuracy after a data preparation op-eration T (e.g., data cleaning) to the test dataset w test ,this can be represented by w test | = ∆ T Accuracy (cid:96),I ( x ). Example 1 (Performance of pedestrian detection)

Letus consider an autonomous car that uses a machinelearning classiﬁer to detect a person crossing the road.For the sake of simplicity, we formalize an example of abinary classiﬁer C that detects whether or not a pedes-trian is crossing the road in a photo image in a testdataset w test . We write sunny ( x ) (resp. snowy ( x )) torepresent that a photo x was taken on a sunny (resp.snowy) day. Let ψ (cid:96) ( x ) (resp. h (cid:96) ( x )) represent that theclassiﬁer C (resp. the human) detects a pedestrian cross-ing the road in an image x .We empirically measure recall (i.e., the conditionalprobability that C detects a pedestrian crossing theroad when the input image x actually includes it) by us-ing the data collected on sunny days. When C achievesa recall of 0 .

95 on sunny days, this is represented by w test | = sunny ( x ) ⊃ Recall (cid:96), . ( x ).Since C should detect a pedestrian also on a snow-covered road, it should be tested with the data collectedon snowy days. If we have a recall of 0 . w test | = snowy ( x ) ⊃ Recall (cid:96), . ( x ).More generally, if the classiﬁer C achieves a recall ofmore than 0 . γ , γ , . . . , γ m , this can berepresented by w test | = (cid:86) mi =1 (cid:0) γ i ( x ) ⊃ Recall (cid:96), (0 . , ( x ) (cid:1) .5.3 Generalization ErrorWe next formalize the generalization error of a classi-ﬁer, i.e., a measure of how accurately a classiﬁer is able to predict the class of previously unseen input data.Since a classiﬁer has been trained on a ﬁnite sampletraining dataset w train , it may be overﬁtted to w train and have worse classiﬁcation performance on new in-put data that have not been included in w train .To formalize the generalization error, we introduce aformula λ L ( y, ˆ y ) to represent that given a correct label y and a predicted label ˆ y , the expected value of losses(i.e., real numbers representing the penalty for incorrectclassiﬁcation) is at most a non-negative real number L .Formally, the semantics of λ L ( y, ˆ y ) is given by: w | = λ L ( y, ˆ y ) iﬀ E ( v, ˆ v ) ∼ σ w ( y, ˆ y ) loss ( v, ˆ v ) ≤ L, where loss is a loss function selected according to thedata domain O , and a pair ( v, v (cid:48) ) of a correct label anda predicted label follows the joint distribution σ w ( y, ˆ y ).Now the generalization error being L or smaller at atrue distribution w true is written as w true | = GE L ( x, y, ˆ y )where: GE L ( x, y, ˆ y ) def = (cid:0) h ( x, y ) ∧ ψ ( x, ˆ y ) (cid:1) ⊃ λ L ( y, ˆ y ) . Since we usually cannot obtain the true distribu-tion w true and cannot check the satisfaction w true | = GE L ( x, y, ˆ y ), we often compute an empirical error (asan approximation of the generalization error) by usinga ﬁnite test dataset w test that is believed to be an ap-proximation of w true . This testing can be expressed as w test | = GE L ( x, y, ˆ y ).On the other hand, given a training dataset w train ,the training error being at most L train is represented by w train | = GE L train ( x, y, ˆ y ). Then the overﬁtting of the clas-siﬁer can be evaluated by comparing the empirical error L with the training error L train . When the empirical er-ror is smaller than L train + ε for some error bound ε > w test | = GE L train + ε ( x, y, ˆ y ). n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 11 Real world w test Possible world w (cid:48) Distributionof test data σ w test ( x ) Distribution ofperturbed data σ w (cid:48) ( x ) sampling σ s ( x ) Classiﬁer

C σ s (ˆ y ) input outputsampling σ s (cid:48) ( x ) Classiﬁer

C σ s (cid:48) (ˆ y ) R ε, W d x Robustness

Fig. 3: The robustness compares the conditional probability in the test dataset w test with that in another possibleworld w (cid:48) that is close to w test in terms of R ε, W d x . Note that an adversary’s choice of the input distribution σ w (cid:48) ( x )is formalized as a non-deterministic choice of the possible world w (cid:48) . Many recent studies have found attacks on machinelearning where a malicious adversary manipulates theinput to cause a malfunction in a machine learningtask [12]. Such input data, called adversarial exam-ples [40], are designed to make a classiﬁer fail to pre-dict the actual class (cid:96) of the input, but are recognizedto belong to (cid:96) from human eyes. In computer vision,for example, Goodfellow et al. [20] create an adversar-ial example by adding undetectable noise to a panda’sphoto so that humans can still recognize the perturbedimage as a panda, but a classiﬁer misclassiﬁes it as agibbon. To prevent or mitigate such attacks, the clas-siﬁer should be robust against perturbed input, i.e., itshould return similar predicted labels given similar in-put data.In this section we formalize robustness notions forclassiﬁers by using epistemic operators in StatEL (SeeFig. 3 for an overview of the formalization). Further-more, we show certain relationships between classiﬁca-tion performance and robustness, and suggest a class ofrobustness properties that have not been formalized inthe literature as far as we know. We present an overviewof these formalizations and relationships in Fig. 4.6.1 Total Correctness of ClassiﬁersWe ﬁrst note that the total correctness of classiﬁerscould be formalized as a classiﬁcation performance (e.g.,precision, recall, or accuracy) in the presence of all pos-sible inputs from adversaries. For example, the totalcorrectness could be formalized as M | = Recall (cid:96),I ( x ),which represents that Recall (cid:96),I ( x ) is satisﬁed in all pos-sible worlds of M .In practice, however, it is not possible or tractable totest whether the classiﬁcation performance is achieved for all possible test datasets (corresponding to an inﬁ-nite number of possible worlds in M ). Hence we need aweaker form of a correctness notion, which may be ver-iﬁed or tested in some way. In the following sections,we deal with robustness notions that are weaker thantotal correctness.6.2 Accessibility Relation for RobustnessTo formalize robustness notions, we introduce an ac-cessibility relation R ε, W d x that relates two worlds havingcloser inputs as follows. Deﬁnition 8 (Accessibility relation for robust-ness)

We deﬁne an accessibility relation R ε, W d x ⊆ W ×W by: R ε, W d x def = { ( w, w (cid:48) ) ∈ W × W | W d ( σ w ( x ) , σ w (cid:48) ( x )) ≤ ε } , where W d is ∞ -Wasserstein distance w.r.t. a metric d in Deﬁnition 2.Then ( w, w (cid:48) ) ∈ R ε, W d x represents that the two distri-butions σ w ( x ) and σ w (cid:48) ( x ) of inputs to the classiﬁer C are close in terms of the distance W d . Intuitively, forexample, W d means the distance between two imagedatasets σ w ( x ) and σ w (cid:48) ( x ) when the distance betweenindividual images are measured by a metric d .Then an epistemic formula K ε, W d ϕ represents thatwe are conﬁdence that ϕ is true even when the inputdata are perturbed by noise of the level ε or smaller. W d ( σ w ( x ) , σ w (cid:48) ( x )) ≤ ε expresses that each value of theinput x from the dataset w is close to the corresponding valueof x from w (cid:48) in terms of the metric d between individual data.For example, each input image x in the dataset w looks similarto the corresponding image in w (cid:48) from the human’ eyes.2 Yusuke Kawamoto (cid:96) tar , then it is called a tar-geted attack . For instance, in the above-mentioned at-tack by [20], a gibbon is the target into which a panda’sphoto is misclassiﬁed.In this section, we discuss how we formalize robust-ness using the epistemic operator K ε, W d . We denote by v ∈ D an original input image in the test dataset w test ,and by (cid:101) v ∈ D an image obtained by perturbing theoriginal image v by noise.A ﬁrst deﬁnition of robustness against targeted at-tacks might be:For any v, (cid:101) v ∈ D , if H ( v ) = panda and d ( v, (cid:101) v ) ≤ ε ,then C ( v (cid:48) ) (cid:54) = gibbon ,which represents that when an image (cid:101) v is obtained byperturbing a panda’s photo v by noise, then it will notbe classiﬁed as the target label gibbon at all. This canbe formalized using StatEL by: M , w test | = h panda ( x ) ⊃ K ε, W d P ψ gibbon ( x ) . However, this notion does not accept a negligible prob-ability of misclassiﬁcation, and does not cover the casewhere the human cannot recognize the perturbed image (cid:101) v as panda (e.g., when the perturbed image (cid:101) v is obtainedby linear displacement, rescaling, and rotation [2], then H ( (cid:101) v ) (cid:54) = panda may hold).To overcome these issues, we introduce the follow-ing deﬁnition with some conditional probability δ ofmisclassiﬁcation as follows. Deﬁnition 9 (Targeted robustness)

Let δ ∈ [0 , w test , a classiﬁer C satisﬁes probabilistictargeted robustness w.r.t. an actual label (cid:96) and a targetlabel ˆ (cid:96) tar if for any input v ∈ supp ( σ w test ( x )) from thedataset w test , and for any perturbed input (cid:101) v ∈ D s.t. d ( v, v (cid:48) ) ≤ ε , we have:Pr[ C ( (cid:101) v ) = ˆ (cid:96) tar | H ( (cid:101) v ) = (cid:96) ] ≤ δ. (1)For instance, when the actual class (cid:96) is panda andthe target label ˆ (cid:96) tar is gibbon , then the classiﬁer C mis-classiﬁes a panda’s photo as gibbon with only a smallprobability δ .Now we express this robustness notion with I =[1 − δ,

1] by using StatEL.

Proposition 2 (Targeted robustness)

Let I ⊆ [0 , .The probabilistic targeted robustness w.r.t. an actual la-bel (cid:96) and a target label ˆ (cid:96) tar under a given test datasetw test is expressed by w test | = TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) where: TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ) (cid:1) . Proof

Let w (cid:48) be a possible world such that ( w test , w (cid:48) ) ∈R ε, W d x . Then w (cid:48) corresponds to the dataset obtainedby perturbing each data in w . Let (cid:101) v ∈ supp ( σ w (cid:48) ( x )).Then (cid:101) v represents a perturbed input. Let w (cid:48)(cid:48) = w (cid:48) | h (cid:96) ( x ) .Then (1) is logically equivalent to w (cid:48)(cid:48) | = P [0 ,δ ] ψ ˆ (cid:96) tar ( x ).By Deﬁnition 6, w (cid:48) | = h (cid:96) ( x ) ⊃ P [0 ,δ ] ψ ˆ (cid:96) tar ( x ). By I =[1 − δ, w (cid:48) | = h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ). Therefore thisproposition follows from the semantics for K ε, W d . (cid:117)(cid:116) Since the L p -distances are often regarded as reason-able approximations of human perceptual distances [10],they are used as distance constraints on the perturba-tion in many researches on targeted attacks (e.g. [40,20,10]). Our model can represent the robustness againstthese attacks by using the L p -distance as a metric d for R ε, W d x .6.4 Probabilistic Robustness against Non-TargetedAttacksIn this section we formalize non-targeted attacks [33,32]in which adversaries try to misclassify inputs as somearbitrary incorrect labels (i.e., not as a speciﬁc labellike a gibbon). Compared to targeted attacks, this kindof attacks are easier to mount, but harder to defend.We ﬁrst deﬁne the notion of robustness against non-targeted attacks as follows. Deﬁnition 10 (Non-targeted robustness)

Let δ ∈ [0 , w test , a classiﬁer C satisﬁes proba-bilistic non-targeted robustness w.r.t. an actual label (cid:96) iffor any input v ∈ supp ( σ w test ( x )) from the dataset w test ,and for any perturbed input (cid:101) v ∈ D s.t. d ( v, v (cid:48) ) ≤ ε , wehave:Pr[ C ( (cid:101) v ) = (cid:96) | H ( (cid:101) v ) = (cid:96) ] > − δ. Now we express this robustness notion with I =[1 − δ,

1] by using StatEL.

Proposition 3 (Non-targeted robustness)

Let I ⊆ [0 , . The probabilistic non-targeted robustness under atest dataset w test is expressed by w test | = Robust (cid:96),I ( x ) where: Robust (cid:96),I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) (cid:1) = K ε, W d Recall (cid:96),I ( x ) . Proof

The proof is analogous to that for Proposition 2. (cid:117)(cid:116) The L p -distance between n -dimensional real vectors x and x (cid:48) is written (cid:107) x − x (cid:48) (cid:107) p where the p -norm is deﬁned by (cid:107) v (cid:107) p = ( (cid:80) ni =1 | v i | p ) /p .n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 13 Prob. non-targeted robustness (Sect. 6.4)

TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ¬ ψ ˆ (cid:96) tar ( x ) (cid:1) Prob. targeted robustness (Sect. 6.3)

Robust (cid:96),I ( x ) def = K ε, W d (cid:0) h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) (cid:1) Recall (Sect. 5.2)

Recall (cid:96),I ( x ) def = h (cid:96) ( x ) ⊃ P I ψ (cid:96) ( x ) Proposition 4 (1)Proposition 4 (2)

Fig. 4: Robustness notions and their relationships.6.5 Relationships among Robustness NotionsIn this section we present relationships among notionsof robustness and performance, and discuss propertiesrelated to robustness.We ﬁrst present the following proposition immediatefrom the deﬁnitions.

Proposition 4 (Relationships among notions)

Let I ⊆ [0 , and (cid:96), ˆ (cid:96) tar ∈ L . Then we have:1. w test | = Robust (cid:96),I ( x ) implies w test | = TRobust (cid:96), ˆ (cid:96) tar ,I ( x ) .2. w test | = Robust (cid:96),I ( x ) implies M , w test | = Recall (cid:96),I ( x ) . The ﬁrst claim means that probabilistic non-targetedrobustness is not weaker than probabilistic targetedrobustness for the same I . The second claim meansthat probabilistic non-targeted robustness implies recallwithout perturbation noise. Note that this is immediatefrom the reﬂexivity of R ε, W d x .Next we remark that our extension of StatEL can beused to describe a certain situation where adversarialattacks are mitigated. When we apply some mechanism T that preprocesses a given input to mitigate attacks onrobustness, then the probabilistic targeted robustness isexpressed as w test | = ∆ T Robust (cid:96),I ( x ) where ∆ T is themodality for the dataset transformation T .Finally, we recall that by Proposition 3, robustnesscan be regarded as recall in the presence of perturbednoise. This implies that for each property ϕ in the tableof confusion (Table 1), we could consider K ε, W d ϕ as aproperty to evaluate the classiﬁcation performance inthe presence of adversarial inputs although this has notbeen formalized in the literature of robustness of ma-chine learning as far as we recognize. For example, pre-cision robustness K ε, W d Precision (cid:96),i ( x ) represents that inthe presence of perturbed noise, the prediction is cor-rect with a probability i given that it is positive. For an-other example, accuracy robustness K ε, W d Accuracy (cid:96),i ( x )represents that in the presence of perturbed noise, theprediction is correct (whether it is positive or negative)with a probability i . Example 2 (Robustness of pedestrian detection)

We il-lustrate robustness notions using the pedestrian detec-tion in Example 1 in Section 5.2. We deal with a binaryclassiﬁer C that detects whether a pedestrian is crossingthe road in a photo image x .The non-targeted robustness K ε, W d Recall (cid:96), . ( x ) rep-resents that in the presence of perturbed noise to theinput image x , with probability 0 . C candetect a person crossing the road when the human canactually recognize. This robustness is crucial for an au-tonomous car not to hit a pedestrian.The precision robustness K ε, W d Precision (cid:96), . ( x ) rep-resents that in the presence of perturbed noise to x ,with probability 0 . C de-tects it. This type of robustness is important for anautonomous car to avoid stopping suddenly due to afalse alarm (not take the crash from the car behind). Many studies have proposed and investigated variousnotions of fairness in machine learning [5]. Informally,these fairness notions mean that the results of machinelearning tasks are irrelevant of some sensitive attributes,e.g., gender, age, race, disease, political/religious view.In a recently few years, there have been studies on thetesting methods for fairness of machine learning [18,1,42].In this section, we formalize popular notions of fair-ness of supervised learning by using our extension ofStatEL. Here we focus on the fairness that should bemaintained in the impact (i.e., the results of machinelearning tasks) rather than the treatment (i.e., the pro-cess of machine learning tasks). This is because previ-ous research show that many seemingly neutral featureshave statistical relationships with sensitive attributes,and hence just ignoring or removing sensitive attributes in the process of data preparation and training is of-ten ineﬀective or harmful to achieve the fairness andperformance of learning tasks.7.1 Basic Ideas and NotationsVarious notions of fairness in supervised learning areclassiﬁed into three categories: independence , separa-tion , and suﬃciency [5]. All of these have the form of(conditional) independence or its relaxation, and thuscan be formalized using the modal operator ∼ ε,Dx forconditional indistinguishability (deﬁned in Sect. 3.4) inour extension of StatEL. In the formalization of fairness notions, we use a dis-tributional Kripke model M = ( W , ( R a ) a ∈A , ( V s ) s ∈S ).Recall that x , y , and ˆ y are measurement variables re-spectively denoting the input datum, the actual classlabel (given by the oracle H ), and the predicted la-bel (output by the classiﬁer C ). Given a real world w test (corresponding to a given test dataset), σ w test ( x )is the probability distribution of C ’s test input over D , σ w test ( y ) is the distribution of the actual label over L ,and σ w test (ˆ y ) is the distribution of C ’s output over L .Fairness notions are usually deﬁned in terms of some sensitive attribute (e.g., gender, age, race, disease, po-litical/religious view), which is deﬁned as a tuple ofsubsets of the input data domain D . For example, asensitive attribute based on ages can be deﬁned as apair of groups G (input data with ages 21 to 60) and G (ages 61 to 100). For each group G ⊆ D of inputs, weintroduce a static formula η G ( x ) representing that aninput x belongs to G . Formally, this is interpreted by:For each state s ∈ S , s | = η G ( x ) iﬀ σ s ( x ) ∈ G. Roughly speaking, a machine learning task is saidto be fair if the performance of the task for a group G ’s input is similar to that for another group G ’sinput. In the following sections, we formalize the threecategories of fairness of classiﬁers and their relaxation.A summary of this formalization is presented in Table 2. Such unawareness requires that sensitive attributes arenot explicitly used in the learning process. However, StatELmay not be suited to formalizing this requirement. Compared to the preliminary version [28] of this paper,we corrected errors and changed the formalization into amore comprehensible form by introducing the operator ∼ ε,Dx and by removing the counter factual epistemic operators anda formula ξ d representing that the input is drawn from adataset d . Some fairness notions (e.g., equal opportunity) assume G = D \ G . independence [9], which is also known as group fair-ness [15] , and its relaxed notion. Intuitively, indepen-dence means that the predicted label ˆ y does not havestatistical relationships with the membership in a sen-sitive group. For example, independence does not allowa bank’s lending rate to be correlated with a sensitiveattribute such as gender.We ﬁrst present the deﬁnition of a relaxed notion ofindependence, called group fairness up to bias ε [15] asfollows. Intuitively, this is the property that the outputdistributions of the classiﬁer are roughly identical wheninput data belong to diﬀerent groups.Formally, this fairness notion is deﬁned as follows. Deﬁnition 11 (Independence, group fairness)

Let G , G ⊆ D be sets of input data constituting a sensi-tive attribute. For each b = 0 ,

1, let µ G b ∈ D L be theprobability distribution of the predicted label ˆ (cid:96) outputby a classiﬁer C when an input v is sampled from a testdataset w test and belongs to G b ; i.e., for each ˆ (cid:96) ∈ L , µ G b [ˆ (cid:96) ] def = Pr[ C ( v ) = ˆ (cid:96) | v $ ← σ w test ( x ) and v ∈ G b ] . (2)Then a classiﬁer C satisﬁes the group fairness betweengroups G and G up to bias ε if D tv ( µ G (cid:107) µ G ) ≤ ε ,where D tv is the total variation between distributions(deﬁned in Sect. 2.2). A classiﬁer C satisﬁes indepen-dence w.r.t. groups G and G if it satisﬁes the groupfairness between G and G up to bias 0.Now we express this fairness notion using our ex-tension of StatEL as follows. Proposition 5 (Independence, group fairness)

Thegroup fairness between groups G and G up to bias ε under a given test dataset w test is expressed as w test | = GrpFair ε ( x, ˆ y ) where: GrpFair ε ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) . Independence (without bias ε ) is expressed by w test | = GrpFair ( x, ˆ y ) .Proof Let w b = w test | η Gb ( x ) ∧ ψ ( x, ˆ y ) . It follows from (2)that for each ˆ (cid:96) ∈ L , µ G b [ˆ (cid:96) ] = Pr[ σ s (ˆ y ) = ˆ (cid:96) | s $ ← w b ] , hence µ G b = σ w b (ˆ y ). Thus, by Deﬁnition 11, the groupfairness between groups G and G up to bias ε is givenby D tv ( σ w (ˆ y ) (cid:107) σ w (ˆ y )) ≤ ε . Therefore, this propositionfollows from Lemma 1. (cid:117)(cid:116) In previous literature, independence has been referredto also as diﬀerent terminologies, such as statistical parity , demographic parity , and disparate impact .n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 15 Table 2: Popular notions of fairness of machine learning

Sect. Formalization of fairness notions7.2 Independence (a.k.a. group fairness)

GrpFair ε ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) (cid:1) EqOdds ε ( x, ˆ y ) def = (cid:94) (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1)(cid:17) EqOpp ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) ∼ , D tv ˆ y (cid:0) ¬ η G ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) (cid:1) Suﬃcency ε ( x, y ) def = (cid:94) ˆ (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) (cid:1) ∼ ε, D tv y (cid:0) η G ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) (cid:1)(cid:17) Example 3 (Independence in pedestrian detection)

Weillustrate independence using the pedestrian detectionin Example 1 in Section 5.2. We deal with a binaryclassiﬁer C that detects whether or not a pedestrian iscrossing the road in an image x . We write η m ( x ) (resp. η w ( x )) to represent that an image x includes a man(resp. woman) that may or not be crossing the road.Let ψ ( x, ˆ y ) represent that given an input image x , theclassiﬁer C returns ˆ y (that is either the detection of aperson crossing the road or not).Then the independence between men and women GrpFair ( x, ˆ y ) def = (cid:0) η m ( x ) ∧ ψ ( x, ˆ y ) (cid:1) ∼ , D tv ˆ y (cid:0) η w ( x ) ∧ ψ ( x, ˆ y ) (cid:1) means that the probability of detecting a pedestriancrossing the road is the same between men and women.This fairness guarantees that men and women are equallydetectable as pedestrians, hence equally safe against anautonomous car. Here independence does not rely onthe actual label y , i.e., on whether there is a pedestriancrossing the road that can be detected by human eyes.7.3 Separation (a.k.a. Equalized Odds) and itsRelaxation (Equal Opportunity)In this section we explain and formalize the notionof separation [5] , which is well-known as equalizedodds [22], and its relaxed notion called equal opportu-nity [22]. The motivation behind these notions is to cap-ture typical scenarios in which sensitive characteristicsmay have statistical relationships with the actual classlabel. For instance, even when some sensitive attributeis correlated with an actual default rate on loans, banksmight want to have a diﬀerent lending rate for peoplewho have a higher default rate. However, independence In previous literature, separation has been referred to alsoas disparate mistreatment [46] and conditional procedure ac-curacy equality [6]. (group fairness) does not allow this, since it requiresthat the lending rate should be statistically indepen-dent of the sensitive attribute.To overcome this problem, the notion of separationallows statistical relationships between a sensitive at-tribute and the predicted label ˆ y output by the classi-ﬁer C to the extent that this is justiﬁed by the actualclass label y . More precisely, separation means that thepredicted label ˆ y is conditionally independent of themembership in a sensitive group, given an actual classlabel y .Formally, separation is deﬁned as a property thatrecall (true positive rate) and speciﬁcity (true nega-tive rate, explained in Table 1) are the same for all thegroups, and equal opportunity is deﬁned as a specialcase of separation only for an advantageous class label. Deﬁnition 12 (Separation & equal opportunity)

Given a group G b ⊆ D and an actual class label (cid:96) , let µ G b ,(cid:96) ∈ D L be the probability distribution of the pre-dicted label ˆ (cid:96) output by a classiﬁer C when an input v ∈ G b is sampled from a test dataset w test and is asso-ciated with an actual label (cid:96) ; i.e., for each ˆ (cid:96) ∈ L , µ G b ,(cid:96) [ˆ (cid:96) ] def = Pr[ C ( v ) = ˆ (cid:96) | v $ ← σ w test ( x ) , v ∈ G b , H ( v ) = (cid:96) ] . (3)A classiﬁer C satisﬁes separation between two groups G and G if µ G ,(cid:96) = µ G ,(cid:96) holds for all (cid:96) ∈ L . Aclassiﬁer C satisﬁes equal opportunity of an advanta-geous label (cid:96) w.r.t. a group G if µ G ,(cid:96) = µ G ,(cid:96) where G = D \ G .Now we express these two notions using our exten-sion of StatEL as follows. Proposition 6 (Separation)

Let γ ( x, (cid:96), ˆ y ) def = ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . The separation between two groups G and G under a given test dataset w test is expressed as w test | = EqOdds ( x, ˆ y ) where: EqOdds ε ( x, ˆ y ) def = (cid:94) (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) ∼ ε, D tv ˆ y (cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1)(cid:17) . Proof

Let (cid:96) ∈ L and w b,(cid:96) = w test | η Gb ( x ) ∧ ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . Itfollows from (3) that: µ G b ,(cid:96) [ˆ (cid:96) ] = Pr[ σ s (ˆ y ) = ˆ (cid:96) | s $ ← w b,(cid:96) ] , hence µ G b ,(cid:96) = σ w b,(cid:96) (ˆ y ). Thus, by Deﬁnition 12, theseparation between G and G is given by σ w ,(cid:96) (ˆ y ) = σ w ,(cid:96) (ˆ y ) for all (cid:96) ∈ L . Therefore, this proposition followsfrom Proposition 1. (cid:117)(cid:116) It should be noted that for ε > EqOdds ε ( x, ˆ y )represents a relaxation of separation up to bias ε interms of total variation D tv . Proposition 7 (Equal opportunity)

Let γ ( x, (cid:96), ˆ y ) def = ψ ( x, ˆ y ) ∧ h (cid:96) ( x ) . The equal opportunity of a label (cid:96) w.r.t.a group G under a given test dataset w test is expressedas w test | = EqOpp ( x, ˆ y ) where: EqOpp ( x, ˆ y ) def = (cid:0) η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) ∼ , D tv ˆ y (cid:0) ¬ η G ( x ) ∧ γ ( x, (cid:96), ˆ y ) (cid:1) . Proof

The proof of this proposition is similar to that ofProposition 6. Let G = D\ G . By µ G b ,(cid:96) = σ w b,(cid:96) (ˆ y ), theequal opportunity of (cid:96) w.r.t. G is given by σ w ,(cid:96) (ˆ y ) = σ w ,(cid:96) (ˆ y ). Therefore, this proposition follows from Propo-sition 1. (cid:117)(cid:116) Example 4 (Separation in pedestrian detection)

We il-lustrate separation using the pedestrian detection inExample 3 where a binary classiﬁer C detects whethera pedestrian is crossing the road in an image x . Let ψ ( x, ˆ y ) (resp. h ( x, y )) represent that given an image x ,the classiﬁer C (resp. human) returns ˆ y (resp. y ) rep-resenting either detection or not.The level of the inherent technical diﬃculty of de-tecting a female pedestrian may be diﬀerent from thatof a male pedestrian, because, for example, the physi-cal appearance may tend to be diﬀerent between womenand men. If we take this possible diﬀerence into account,separation can be suited instead of independence.The separation EqOdds ( x, ˆ y ) between men and womenguarantees that the conditional probability of detect-ing a pedestrian crossing the road when the humancan actually recognize it, is the same between men andwomen. This fairness implies that (from the viewpointof a pedestrian crossing the road) male and female pedes-trians may be hit by an autonomous car as fairly as bythe human-driven car. 7.4 Suﬃciency (a.k.a. Conditional Use AccuracyEquality)In this section we explain and formalize the notion of suﬃciency [5], which is also known as conditional useaccuracy equality [6].While separation guarantees the equality of recallamong diﬀerent groups, suﬃciency requires the equalityof precision. More precisely, suﬃciency is deﬁned as theproperty that precision (positive predictive value) andnegative predictive value (presented as NPV in Table 1)are the same for all the groups as follows. Deﬁnition 13 (Suﬃciency)

Given a group G b ⊆ D and a predicted label ˆ (cid:96) , let µ G b , ˆ (cid:96) ∈ D L be the proba-bility distribution of the actual class label (cid:96) when aninput v ∈ G b is sampled from a test dataset w test andthe classiﬁer C outputs the predicted label ˆ (cid:96) ; i.e., foreach (cid:96) ∈ L , µ G b , ˆ (cid:96) [ (cid:96) ] def = Pr[ H ( v ) = (cid:96) | v $ ← σ w test ( x ) , v ∈ G b , C ( v ) = ˆ (cid:96) ] . (4)A classiﬁer C satisﬁes suﬃciency between two groups G and G if µ G , ˆ (cid:96) = µ G , ˆ (cid:96) holds for all ˆ (cid:96) ∈ L .Then this notion can be expressed using our exten-sion of StatEL as follows. Proposition 8 (Suﬃciency)

Let γ (cid:48) ( x, y, ˆ (cid:96) ) def = ψ ˆ (cid:96) ( x ) ∧ h ( x, y ) . The suﬃciency between two groups G and G under a given test dataset w test is expressed as w test | = Suﬃcency ( x, y ) where: Suﬃcency ε ( x, y ) def = (cid:94) ˆ (cid:96) ∈ L (cid:16)(cid:0) η G ( x ) ∧ γ (cid:48) ( x, y, ˆ (cid:96) ) (cid:1) ∼ ε, D tv y (cid:0) η G ( x ) ∧ γ (cid:48) ( x, y, ˆ (cid:96) ) (cid:1)(cid:17) . Proof

Let ˆ (cid:96) ∈ L and w b, ˆ (cid:96) = w test | η Gb ( x ) ∧ ψ ˆ (cid:96) ( x ) ∧ h ( x,y ) . Itfollows from (4) that: µ G b , ˆ (cid:96) [ (cid:96) ] = Pr[ σ s ( y ) = (cid:96) | s $ ← w b, ˆ (cid:96) ] , hence µ G b , ˆ (cid:96) = σ w b, ˆ (cid:96) ( y ). Thus, by Deﬁnition 13, thesuﬃciency between G and G is given by σ w , ˆ (cid:96) ( y ) = σ w , ˆ (cid:96) ( y ) for all ˆ (cid:96) ∈ L . Therefore, this proposition followsfrom Proposition 1. (cid:117)(cid:116) It should be noted that for ε > Suﬃcency ε ( x, y )represents a relaxation of suﬃciency up to bias ε interms of total variation D tv . Example 5 (Suﬃciency in pedestrian detection)

We il-lustrate suﬃciency using the pedestrian detection inExample 3 where a classiﬁer C detects whether a pedes-trian is crossing the road in an image x . As mentioned in n Epistemic Approach to the Formal Speciﬁcation of Statistical Machine Learning 17 Example 4, the level of the inherent technical diﬃcultyof detecting a male pedestrian may be diﬀerent fromthat of a female pedestrian. Whereas separation guar-antees the equality of recall between men and women,suﬃciency guarantees that of precision.The suﬃciency

Suﬃcency ( x, y ) between men andwomen implies that the conditional probability thatthere is no pedestrian crossing the road when C detectsit, is the same between men and women. From the view-point of the car driver, when C raises a false alarm andstops the car suddenly, we have no bias about which ofmen and women are more likely to trigger false alarmsand to be blamed for that. In this section, we provide a brief overview of relatedwork on the speciﬁcation of statistical machine learningand on epistemic logic for describing speciﬁcation.

Desirable properties of statistical machine learning.

Therehave been a large number of papers on attacks and de-fences for deep neural networks [40,12]. Compared tothem, however, not much work has been done to explorethe formal speciﬁcation of various properties of machinelearning. Seshia et al. [38] present a list of desirableproperties of DNNs (deep neural networks) althoughmost of the properties are presented informally withoutmathematical formulas. As for robustness, Dreossi etal. [13] propose a unifying formalization of adversarialinput generation in a rigorous and organized manner,although they formalize and classify attacks (as opti-mization problems) rather than deﬁne the robustnessnotions themselves.Concerning the fairness notions, Barocas et al. [5]survey various fairness notions and classify them intothe three categories: independence, separation, and suf-ﬁciency. Gajane [17] surveys the formalization of fair-ness notions for machine learning and present some jus-tiﬁcation based on social science literature.

Epistemic logic for describing speciﬁcation.

Epistemiclogic [44] has been studied to represent and reason aboutknowledge and belief [16,21], and has been applied todescribe various properties of distributed systems.The

BAN logic [8], proposed by Burrows, Abadi andNeedham, is a notable example of epistemic logic usedto model and verify the authentication in cryptographicprotocols. To improve the formalization of protocols’behaviors, some epistemic approaches integrate processcalculi [24,11].Epistemic logic has also been used to formalize andreason about privacy properties, including anonymity [39, 19,29], receipt-freeness of electronic voting protocols [25],and privacy policy for social network services [34]. Tem-poral epistemic logic is used to express information ﬂowsecurity policies [3].Concerning the formalization of fairness notions, pre-vious work in formal methods has modeled diﬀerentkinds of fairness involving timing by using temporallogic rather than epistemic logic. As far as we know, noprevious work has formalized fairness notions of ma-chine learning by using modal logic.

Formalization of statistical properties.

In studies of philo-sophical logic, Lewis [31] shows the idea that when arandom value has various possible probability distribu-tions, then those distributions should be represented ondistinct possible worlds. Bana [4] puts Lewis’s idea ina mathematically rigorous setting. Recently, a modallogic called statistical epistemic logic (StatEL) [27] hasbeen proposed and used to formalize statistical hypoth-esis testing and the notion of diﬀerential privacy [14].To describe statistical properties of machine learn-ing models, this work uses StatEL to formalize theprobabilistically chosen input to a learning model andthe non-deterministically chosen dataset. However, wecould possibly employ other logics (e.g., fuzzy logic [45]or Markov logic network [37]) by extending them to dealwith statistical sampling and non-deterministic inputs.Exploring the possibility of diﬀerent formalization us-ing other logics is left for future work.

In this paper we proposed an epistemic approach tothe modeling of supervised learning and its desirableproperties. Speciﬁcally, we employed a distributionalKripke model in which each possible world correspondsto a possible dataset and modal operators are inter-preted as transformation and testing on datasets. Thenwe formalized various notions of the classiﬁcation per-formance, robustness, and fairness of statistical clas-siﬁers by using our extension of statistical epistemiclogic (StatEL). In this formalization, we clariﬁed rela-tionships among properties of classiﬁers, and relevancebetween classiﬁcation performance and robustness.We emphasize that this is the ﬁrst attempt to useepistemic models and logical formulas to describe sta-tistical properties of machine learning, and would be astarting point to develop theories of formal speciﬁcationof machine learning.In future work, we are planning to extend our frame-work to formally reason about system-level propertiesof learning-based systems. We are also interested in developing a more general framework for the formalspeciﬁcation of machine learning associated with test-ing methods, as well as in implementing a prototypetool. Our future work will also include an extensionof StatEL to formalize unsupervised learning and rein-forcement learning.

Acknowledgements

I would like to thank the reviewers fortheir helpful and insightful comments. I am also grateful toGergei Bana for his useful comments on part of a preliminarymanuscript.

References

1. Angell, R., Johnson, B., Brun, Y., Meliou, A.: Themis:automatically testing software for discrimination. In:Proc. ESEC/SIGSOFT FSE, pp. 871–875. ACM (2018).DOI 10.1145/3236024.32645902. Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthe-sizing robust adversarial examples. In: Proc. ICML, pp.284–293 (2018)3. Balliu, M., Dam, M., Guernic, G.L.: Epistemic temporallogic for information ﬂow security. In: Proc. of PLAS,p. 6 (2011). DOI 10.1145/2166956.21669624. Bana, G.: Models of objective chance: An analysisthrough examples. In: Making it Formally Explicit, pp.43–60. Springer International Publishing (2017). DOI10.1007/978-3-319-55486-0 \

35. Barocas, S., Hardt, M., Narayanan, A.: Fairness and Ma-chine Learning. fairmlbook.org (2019).

6. Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.:Fairness in criminal justice risk assessments: The state ofthe art. Sociological Methods & Research (2018). DOI10.1177/00491241187825337. Blackburn, P., de Rijke, M., Venema, Y.: Modal Logic.Cambridge Tracts in Theoretical Computer Science.Cambridge University Press (2001). DOI 10.1017/CBO97811070508848. Burrows, M., Abadi, M., Needham, R.M.: A logic of au-thentication. ACM Trans. Comput. Syst. (1), 18–36(1990). DOI 10.1145/77648.776499. Calders, T., Verwer, S.: Three naive bayes ap-proaches for discrimination-free classiﬁcation. Data Min.Knowl. Discov. (2), 277–292 (2010). DOI 10.1007/s10618-010-0190-x10. Carlini, N., Wagner, D.A.: Towards evaluating the ro-bustness of neural networks. In: Prc. S&P, pp. 39–57(2017). DOI 10.1109/SP.2017.4911. Chadha, R., Delaune, S., Kremer, S.: Epistemic logic forthe applied pi calculus. In: Proc. of FMOODS/FORTE,pp. 182–197 (2009). DOI 10.1007/978-3-642-02138-1 \ abs/1810.00069 (2018). URL http://arxiv.org/abs/1810.00069

13. Dreossi, T., Ghosh, S., Sangiovanni-Vincentelli, A.L., Se-shia, S.A.: A formalization of robustness for deep neuralnetworks. In: Proc. VNN (2019)14. Dwork, C.: Diﬀerential privacy. In: Proc. of ICALP, pp.1–12 (2006)15. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel,R.S.: Fairness through awareness. In: Proc. of ITCS, pp.214–226. ACM (2012) 16. Fagin, R., Halpern, J., Moses, Y., Vardi, M.: Reasoningabout Knowledge. The MIT Press (1995)17. Gajane, P.: On formalizing fairness in prediction withmachine learning. CoRR abs/1710.03184 (2017). URL http://arxiv.org/abs/1710.03184

18. Galhotra, S., Brun, Y., Meliou, A.: Fairness testing: test-ing software for discrimination. In: Proc. ESEC/FSE, pp.498–510. ACM (2017). DOI 10.1145/3106237.310627719. Garcia, F.D., Hasuo, I., Pieters, W., van Rossum, P.:Provable anonymity. In: Proc. of FMSE, pp. 63–72(2005). DOI 10.1145/1103576.110358520. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining andharnessing adversarial examples. In: Proc. of ICLR(2015)21. Halpern, J.Y.: Reasoning about uncertainty. The MITpress (2003)22. Hardt, M., Price, E., Srebro, N.: Equality of opportunityin supervised learning. In: proc. NIPS, pp. 3315–3323(2016)23. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safetyveriﬁcation of deep neural networks. In: Proc. CAV, pp.3–29 (2017). DOI 10.1007/978-3-319-63387-9 \ (1), 3–36 (2004)25. Jonker, H.L., Pieters, W.: Receipt-freeness as a specialcase of anonymity in epistemic logic. In: Proc. WorkshopOn Trustworthy Elections (WOTE’06) (2006)26. Katz, G., Barrett, C.W., Dill, D.L., Julian, K., Kochen-derfer, M.J.: Reluplex: An eﬃcient SMT solver for veri-fying deep neural networks. In: Proc. CAV, pp. 97–117(2017). DOI 10.1007/978-3-319-63387-9 \ LNCS , vol. 11760, pp. 344–362.Springer (2019). DOI 10.1007/978-3-030-31175-9 \ LNCS , vol.11724, pp. 293–311. Springer (2019). DOI 10.1007/978-3-030-30446-1 \ (4), 559–576 (2007). DOI10.11540/jsiamt.17.4 \ (5-6), 67–96 (1963)31. Lewis, D.: A subjectivist’s guide to objective chance. In:Studies in Inductive Logic and Probability, Volume II, pp.263–293. Berkeley: University of California Press (1980)32. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu,A.: Towards deep learning models resistant to adversarialattacks. In: Proc. ICLR (2018)33. Moosavi-Dezfooli, S., Fawzi, A., Frossard, P.: Deepfool: Asimple and accurate method to fool deep neural networks.In: Proc. CVPR, pp. 2574–2582 (2016). DOI 10.1109/CVPR.2016.28234. Pardo, R., Schneider, G.: A formal privacy policy frame-work for social networks. In: Proc. SEFM, pp. 378–392(2014). DOI 10.1007/978-3-319-10431-7 \ (1-2), 107–136 (2006). DOI 10.1007/s10994-006-5833-138. Seshia, S.A., Desai, A., Dreossi, T., Fremont, D.J.,Ghosh, S., Kim, E., Shivakumar, S., Vazquez-Chanlatte,M., Yue, X.: Formal speciﬁcation for deep neural net-works. In: Proc. ATVA, pp. 20–34 (2018). DOI10.1007/978-3-030-01090-4 \ \ (3), 64–72 (1969)44. von Wright, G.H.: An Essay in Modal Logic. Amsterdam:North-Holland Pub. Co. (1951)45. Zadeh, L.: Fuzzy sets. Information and Control8