[PDF] Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

Abstract

Prediction credibility measures, in the form of confidence intervals or probability distributions, are fundamental in statistics and machine learning to characterize model robustness, detect out-of-distribution samples (outliers), and protect against adversarial attacks. To be effective, these measures should (i) account for the wide variety of models used in practice, (ii) be computable for trained models or at least avoid modifying established training procedures, (iii) forgo the use of data, which can expose them to the same robustness issues and attacks as the underlying model, and (iv) be followed by theoretical guarantees. These principles underly the framework developed in this work, which expresses the credibility as a risk-fit trade-off, i.e., a compromise between how much can fit be improved by perturbing the model input and the magnitude of this perturbation (risk). Using a constrained optimization formulation and duality theory, we analyze this compromise and show that this balance can be determined counterfactually, without having to test multiple perturbations. This results in an unsupervised, a posteriori method of assigning prediction credibility for any (possibly non-convex) differentiable model, from RKHS-based solutions to any architecture of (feedforward, convolutional, graph) neural network. Its use is illustrated in data filtering and defense against adversarial attacks.

Full PDF

TTrust but Verify:Assigning Prediction Credibility by Counterfactual Constrained Optimization

Luiz F. O. Chamon Santiago Paternain Alejandro Ribeiro Abstract

Prediction credibility measures, in the form ofconﬁdence intervals or probability distributions,are fundamental in statistics and machine learn-ing to characterize model robustness, detect out-of-distribution samples (outliers), and protectagainst adversarial attacks. To be effective, thesemeasures should (i) account for the wide varietyof models used in practice, (ii) be computable fortrained models or at least avoid modifying estab-lished training procedures, (iii) forgo the use ofdata, which can expose them to the same robust-ness issues and attacks as the underlying model,and (iv) be followed by theoretical guarantees.These principles underly the framework devel-oped in this work, which expresses the credi-bility as a risk-ﬁt trade-off, i.e., a compromisebetween how much can ﬁt be improved by per-turbing the model input and the magnitude ofthis perturbation (risk). Using a constrained opti-mization formulation and duality theory, we ana-lyze this compromise and show that this balancecan be determined counterfactually, without hav-ing to test multiple perturbations. This resultsin an unsupervised, a posteriori method of as-signing prediction credibility for any (possiblynon-convex) differentiable model, from RKHS-based solutions to any architecture of (feedfor-ward, convolutional, graph) neural network. Itsuse is illustrated in data ﬁltering and defenseagainst adversarial attacks.

1. Introduction

Assessing credibility to predictions is a fundamental prob-lem in statistics and machine learning (ML) with both prac-tical and societal impact, ﬁnding applications in modelrobustness, outlier detection, defense against adversarial Department of Electrical and Systems Engineering, Univer-sity of Pennsylvania, Philadelphia, PA, USA. Correspondence to:Luiz F. O. Chamon . noise, and ML explanation, to name a few. For statis-tical models, these credibilities often comes in the formof uncertainty measures, such as class probability or con-ﬁdence intervals, or sensitivity measures, often embod-ied by the regression coefﬁcients themselves (Smithson,2002; Hawkins, 1980; Hastie et al., 2001). As models be-come more involved and opaque, however, their complexinput–coefﬁcients–output relation, together with miscali-bration and robustness issues, have made obtaining reliablecredibility measures increasingly challenging. This factis illustrated by the abundance of examples of neural net-works (NNs) and non-parametric models that are suscepti-ble to adversarial noise or provide overly conﬁdent predic-tions (Guo et al., 2017; Platt, 1999; Szegedy et al., 2014;Goodfellow et al., 2014). It has become clear that modeloutputs are not reliable assessments of prediction credibil-ity.A notable exception is the case of statistical models thatembedded credibility measures directly in their structure,such as Bayesian approaches (Rasmussen & Williams,2005; Neal, 1996; MacKay, 1992; Shridhar et al., 2019).Obtaining these measures, however, requires that suchmodels be fully trained using modiﬁed, often computation-ally intensive procedures. In the context of NNs, for in-stance, Bayesian methods remain feasible only for mod-erately sized architectures (Blundell et al., 2015; Shridharet al., 2019; Gal & Ghahramani, 2015; Heek & Kalchbren-ner, 2019). Even if unlimited compute power were avail-able, access to training data may not be, due to privacy orsecurity concerns. For such settings, speciﬁc variationalmodel approximations that preclude retraining have beenproposed, such as Monte Carlo dropout (Gal & Ghahra-mani, 2016). Still, despite their empirical success in someapplications (Dhillon et al., 2018; Sheikholeslami et al.,2019), they do not approximate the posterior distributionneeded to assess uncertainty in the Bayesian framework,i.e., the conditional distribution of the model given thetraining data.This work assigns credibility to the model outputs by usingtheir robustness to input perturbations rather then modify-ing the model structure and/or training procedure. How-ever, instead of assuming a predetermined source of per- a r X i v : . [ c s . L G ] N ov rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning turbation, e.g., by limiting their magnitude, they are deter-mined based on the underlying as a trade-off between ﬁtand risk. In other words, the input is perturbed to strike abalance between how much its ﬁt to any class can be im-proved and how much this perturbation modiﬁes it ( risk ).Credible predictions are obtained at the point where therisk of improving ﬁt by further perturbing the original inputoutweighs the actual improvements. This yields a practicalmethod to assign credibilities that abides by the desiderata postulated in abstract, i.e., it is (i) general , as it can be com-puted for any (possibly non-convex) differentiable model;(ii) a posteriori , as it can be evaluated for already trainedmodels; (iii) unsupervised , as it does not rely on any formof training data; and (iv) theoretically sound , as it (locally)solves both a constrained optimization and a Bayesian es-timation problem.More to the point, the main contributions of this work are:1. formalize credibility as a ﬁt-risk trade-off (Sections 3and 4) and show that it is, in fact, the maximum aposteriori (MAP) estimator of a probabilistic failuremodel (Section 6);2. develop a local theory of compromises that holds forgeneric, non-convex models and enables ﬁt-risk trade-offs to be determined counterfactually, i.e., without test-ing multiple perturbations (Section 5);3. put forward an algorithm to compute this credibil-ity measure for virtually any pretrained (possibly non-convex) differentiable model, from RKHS-based meth-ods to (convolutional) NNs (CNNs) (Section 7).Numerical experiments are used to illustrate applications ofthese credibility proﬁles for a CNN in the contexts of ﬁlter-ing and adversarial noise (Section 8). Before proceeding,we contextualize this problem in a short literature review.

2. Related work

Credibility is a central concept in statistics, arising in theform of conﬁdence intervals, probability distributions, orsensitivity analysis. Typically, these metrics are obtainedfor speciﬁc models, such as linear or generalized linearmodels, and become more intricate to compute as themodel complexity increases. Conﬁdence intervals, for in-stance, can be obtained in closed-form only for simplemodels for which the asymptotic distribution of error statis-tics are available (Smithson, 2002). Otherwise, bootstrap-ping techniques must be used. Though more widely appli-cable, they require several models to be trained over dif-ferent data sets, which can be prohibitive in certain large-scale applications (Efron & Tibshirani, 1994; Dietterich,2000; Li et al., 2018). Sensitivity analysis is an alterna-tive that does away with randomness by focusing instead in how much sample point or input values inﬂuence ﬁtand/or model coefﬁcients. The leverage score or the lin-ear regression coefﬁcients are a good examples of this ap-proach (Hawkins, 1980; Hastie et al., 2001). Bayesianmodels directly embed uncertainty measures by learningprobability distributions instead of point estimates. ThoughBayesian inference has been deployed in a wide varietyof model classes, learning these models remains challeng-ing, especially in complex, large-scale settings common indeep learning and CNNs (Friedman et al., 2001; Blundellet al., 2015; Gal & Ghahramani, 2015; Shridhar et al., 2019;Heek & Kalchbrenner, 2019). Approximate measuresbased on speciﬁc probabilistic models, e.g., Monte Carlodropout (Gal & Ghahramani, 2016), have been proposedto address these issues. However, they do not approximatethe posterior distribution of the model given the trainingdata needed to assess uncertainty in the Bayesian frame-work. Other empirically motivated credibility scores, e.g.,based on k -nearest neighbors embeddings in the learnedrepresentation space, require retraining with modiﬁed costfunctions to be effective (Mandelbaum & Weinshall, 2017).Credibility measures can be used to solve a variety of statis-tical problems. For instance, they are effective in assessingthe robustness and performance of models or detecting out-of-distribution samples or outliers, both during training ordeployment (Chandola et al., 2009; Hodge & Austin, 2004;Hawkins, 1980; Aggarwal, 2015). Selective classiﬁers areoften used to this end (Flores, 1958; Cortes et al., 2016; El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017). Real-izing that in critical applications it is often better to admitone’s limitations than to provide uncertain answers, ﬁlteredclassiﬁers are given the additional option of not classify-ing an input. These can be ﬂagged for human inspection.Their performance is quantiﬁed in terms of coverage (howmany samples the model chooses to classify) and ﬁlteredaccuracy (accuracy over the classiﬁed samples), two quan-tities typically at odds with each other (El-Yaniv & Wiener,2010; Geifman & El-Yaniv, 2017). Measures of credibil-ity are also often used in the related context of adversarialnoise (Dhillon et al., 2018; Wong & Kolter, 2018; Sheik-holeslami et al., 2019).Finally, this work leverages duality to derive both theoret-ical results and algorithms. It is worth noting that the dualvariables of convex optimization programs have a well-known sensitivity interpretation. Indeed, the dual variablecorresponding to a constraint determines the variation ofthe objective if that constraint were to be tightened or re-laxed (Boyd & Vandenberghe, 2004; Bonnans & Shapiro,2000). Since this work also considers non-convex modelsfor which these results do not hold, it develops of a novellocal sensitivity theory of compromise (Section 5). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

3. Problem formulation

This work considers the problem of assigning credibil-ity to the predictions of a classiﬁer. Formally, given amodel φ o : R p → R m , a sample x o ∈ R p , and a set K of possible classes, we wish to assign a number c k ∈ R to each class k ∈ K representing our credence that x o be-longs to k . By credence, we mean that if c k (cid:48) ≤ c k , then wewould rather bet on the proposition “ x belongs to k ” than“ x belongs to k (cid:48) ”. For convenience, we ﬁx an order of K and often refer to the vector c ∈ R |K| collecting the c k as the credibility proﬁle. We use |A| to denote the car-dinality of a set A . This problem is therefore equivalentto that of ranking from most to least credible the classesthat the input x o could belong to. Though certain credibil-ity measures impose additional structure, as is the case ofprobabilities, they are minimally required to induce a weakordering of the set K (Roberts & Tesman, 2009).The output of any classiﬁer induces a measure of credibilitythrough the training loss function. Let (cid:96) k : R m → R + denote a loss function with respect to class k ∈ K , such asthe logistic negative log-likelihood, the cross-entropy loss,or the hinge loss (Hastie et al., 2001). Then, the loss ofthe model output with respect to the class k ∈ K inducesa measure of in credibility. Conversely, we can obtain thecredibility measure c k = − (cid:96) k (cid:0) φ o ( x o ) (cid:1) , k ∈ K , (1)In this sense, classifying a sample as argmax k ∈K c k (withties broken arbitrarily if the set is not a singleton) is thesame as assigning it to the most credible class. Note thatsince losses are non-negative, credibility is a non-positivenumber that increases as the loss decreases, i.e., as thegoodness-of-ﬁt improves.Nevertheless, (1)—and the model output φ o ( x o ) for thatmatter—are generally unreliable credibility metrics. In-deed, since models are optimized by focusing only on themost credible class, they do not incorporate ranking infor-mation on the other ones. In fact, classiﬁer outputs oftenresemble singletons (see, e.g., Figs. 2a and 2c). The result-ing miscalibration can skew the credences of all but the topclass (Guo et al., 2017; Platt, 1999). Though there are ex-ceptions, most notably Bayesian models, their use is notas widespread in many application domains due to theirtraining complexity (Blundell et al., 2015; Shridhar et al.,2019; Gal & Ghahramani, 2015; Heek & Kalchbrenner,2019). Fit and model outputs are also susceptible to ro-bustness issues due to overﬁtting, poor or lacking trainingdata, and/or the use of inadequate parametrizations (e.g.,choice of NN architecture or kernel bandwidth). What ismore, the input x o can be an outlier or have been contam-inated by noise, adversarial or not. Complex models suchas (C)NNs have proven to be particularly vulnerable to sev- eral of these issues (Szegedy et al., 2014; Goodfellow et al.,2014; Dhillon et al., 2018; Wong & Kolter, 2018).Thus, the model output can only provide a sound credibil-ity measure if both the model and the input are reliable.In fact, the issues associated with (1) all but vanish if theinput x o is trustworthy and the model output φ o ( x o ) is ro-bust to perturbations. Based on this observation, the nextsection puts forward a ﬁt-based credibility measure that in-corporates robustness to perturbations. Remark 1.

It is worth contrasting the problem of assign-ing credibility to model predictions described in this sec-tion to its classical learning version in which credibility isassigned based on a data set D of labeled pairs ( x n , y n ) ∈ R p ×K . The current problem is in fact at least as hard as thelearning one both computationally and statistically, since itis reducible to the learning case. Additionally, the learningsetting is considerably more restrictive in practice. Indeed,it may not be possible, due to privacy or security concerns,to access the raw training data from which a model wasobtained. Even if it were, any solution that requires thecomplete retraining of every deployed model is infeasiblein many applications, especially if complex, modiﬁed train-ing procedures, such as those used to train Bayesian NNs,must be deployed. This is particularly critical for large-scale, non-convex models.

4. Credibility as a ﬁt-risk trade-off

In Section 3, we argued that a loss function induces a cred-ibility measure through (1), but that it is reliable only if theclassiﬁer output φ o ( x o ) is insensitive to perturbations. Inwhat follows, we show how we can ﬁlter the input x o inorder to produce such reliable credibility proﬁles.We do not say ﬁlter here to mean removing noise or arti-facts (though this may be a byproduct of the procedure),but obtaining a modiﬁed input x (cid:63) such that φ o ( x (cid:63) ) is ro-bust to perturbations. However, since our goal is to ulti-mately use φ o ( x (cid:63) ) to produce credibility proﬁles, the di-rection and magnitude of these perturbations is not dictatedby a ﬁxed disturbance model as in robust statistics and ad-versarial noise applications. Instead, we obtain task- andsample-speciﬁc settings using the underlying data proper-ties captured by the model φ o during training.Indeed, we consider perturbations that improve ﬁt (increasecredibility), not with respect to a speciﬁc class, but with re-spect to all classes simultaneously. We do so because wedo not seek a single most credible class for the input, butthe best credibility proﬁle with respect to all classes. Anyperturbation of x o that worsens the overall ﬁt leads to analtogether less conﬁdent classiﬁcation and thus, to a poormeasure of credibility. At the same time, x (cid:63) must be an-chored at x o : the further we are from the original input, the rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning more we have modiﬁed the original classiﬁcation problem.By allowing arbitrarily large perturbations, the original im-age of a dog could be entirely replaced by that of a cat, atwhich point we are no longer assigning credibility to pre-dictions of dog pictures. The magnitude of the perturbationthus relates to the risk of not solving the desired problem .In summary, the “ﬁltered” x (cid:63) must balance two conﬂict-ing objectives: improving ﬁt while remaining close to x o .Indeed, while smaller perturbations of x o are preferable toreduce risk, larger perturbations are allowed if the ﬁt pay-off is of comparable magnitude. We formalize this conceptin the following deﬁnition, where W is a diagonal, positivedeﬁnite matrix and (cid:107) z (cid:107) W = z T W z is used to denote the W -weighted Euclidian norm. Deﬁnition 1 (Credibility proﬁle) . Given a model φ o and aninput x o , the credibility proﬁle of the predictions φ o ( x o ) isa vector c (cid:63) that satisﬁes (cid:107) x (cid:63) − x o (cid:107) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − , (2) where the credibility proﬁles are obtained from (1) as c (cid:63) = (cid:2) − (cid:96) k (cid:0) φ o ( x (cid:63) ) (cid:1)(cid:3) k ∈K and c (cid:48) = (cid:2) − (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1)(cid:3) k ∈K . In other words, the credibility proﬁle c (cid:63) for the predictionsof the model φ o with respect to the input x o is obtainedusing the modiﬁed input x (cid:63) that has better ﬁt than any otherinput closer to x o and is closer to x o that any input withbetter overall ﬁt. In particular, note that for x (cid:48) = x o , (2)becomes (cid:107) x (cid:63) − x o (cid:107) ≤ (cid:107) c o (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − , (3)where c o = (cid:2) − (cid:96) k (cid:0) φ o ( x o ) (cid:1)(cid:3) k ∈K are the credences inducedby the original input x o . Hence, the perturbation thatgenerates x (cid:63) is at most as large as the overall credencegains [recall from (1) that the c k ≤ ]. These gains areweighted by the matrix W that can be used either to con-trol risk aversion by, e.g., by taking W = γ I and using γ to control the magnitude of the right-hand side of (3), orembed a priori knowledge on the value of the credibili-ties (see Section 6). While it may be easier to interpret, notethat (3) is not well-posed since it always holds for x (cid:63) = x o .Def. (1) is therefore strictly stronger as (2) must hold forany reference input x (cid:48) , not only the original x o .In the next section, we analyze the properties of the com-promise (2). In particular, we are interested in those prop-erties that do not depend on the convexity of φ o , so thatthey are applicable to a wider class of models including,e.g., (C)NNs. We ﬁrst show that the perturbation neededto achieve a credence proﬁle can be determined by solving While this work focuses on perturbations of the input x o ,similar deﬁnitions and results also apply to perturbations of themodel φ o with minor modiﬁcations. We leave the details of thisalternative formulation for future work. a constrained mathematical program (Section 5.1). Lever-aging this formulation and results from duality theory, wethen obtain equivalent formulations of Def. 1 (Section 5.2)that are used to show that c (cid:63) is in fact a MAP estimate (Sec-tion 6) and to provide an algorithm that simultaneously de-termines x (cid:63) and c (cid:63) counterfactually, i.e., without testingmultiple inputs or credences (Section 7).

5. A counterfactual theory of compromise

Def. 1 established credibility in terms of a compromise be-tween ﬁt and risk. It is not immediate, however, whethersuch a compromise exists or if it can be determined efﬁ-ciently. The goal of this section is to address these ques-tions by developing a (local) counterfactual theory of com-promise. By counterfactual, we mean that the main resultof this section (Theorem 2) characterizes properties of (PI)that would turn arbitrary credences into credibilities thatsatisfy (2). In Section 7, we leverage this result to put for-ward an algorithm to directly ﬁnd local solutions of (PI)that satisfy (2) without repeatedly testing proﬁles c . The ﬁrst step in our derivations is to express the trade-off (2) solely in terms of credences. To do so, we formalizethe relation between a credence proﬁle c and the magni-tude r (cid:63) ( c ) of the smallest perturbation of x o required toﬁnd an input that achieves it. We describe this relation bymeans of the constrained optimization problem r (cid:63) ( c ) = min x ∈ R p (cid:107) x − x o (cid:107) subject to (cid:96) k (cid:0) φ o ( x ) (cid:1) ≤ − c k , k ∈ K . (PI)Deﬁne r (cid:63) ( c ) = + ∞ if the program is infeasible, i.e., ifthere exist no x such that (cid:96) k (cid:0) φ o ( x ) (cid:1) ≤ − c k for all k ∈ K .Problem (PI) seeks the input x closest to x o whose ﬁtmatches the credences c . Its optimal value therefore de-scribes the risk (perturbation magnitude) of any given cre-dence proﬁle. Note that due to the conﬂicting natures ofﬁt and risk, the inequality constraints in (PI) typically holdwith equality at the compromise c (cid:63) . Immediately, we canwrite (2) as r (cid:63) ( c (cid:63) ) − r (cid:63) ( c (cid:48) ) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − . (4)The compromise in Def. 1 is explicit in (4): the predictioncredibilities of a model/input pair is given, not by the cre-dences induced by the model output in (1), but by those thatincrease conﬁdence in each class at least as much as theyincrease risk, as measured here by the perturbation magni-tude.The main issue with (4) is that evaluating r (cid:63) involves solv-ing the optimization problem (PI), which may not be com-putationally tractable. This is not an issue when (PI) is rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning a convex problem (e.g., if φ o is convex and (cid:96) k is convexand non-decreasing). Yet, typical ML models are non-convex functions of their input, e.g., (C)NNs or RKHS-based methods. To account for cases in which ﬁnding theminimum of (PI) is hard, we consider a local version ofDef. 1 induced by (4): Deﬁnition 2 ( Local credibility proﬁle) . Let x † ( c ) be a lo-cal minimizer of (PI) with credences c and consider itsvalue r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) . A local credibility proﬁleof the predictions φ o ( x o ) is a vector c † that satisﬁes r † ( c † ) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:13)(cid:13) c † (cid:13)(cid:13) W − , (5) for all x (cid:48) in a neighborhood of x † ( c † ) that are feasiblefor (PI) with credences c (cid:48) . While the following results are derived for the local com-promise in (5), they also hold for Def. 1 by replacing † with (cid:63) . The following assumptions are used in the sequel:

Assumption 1.

The loss function (cid:96) k , k ∈ K , and themodel φ o are differentiable. Assumption 2.

Let x † be a local minimizer of (PI). Thereexists a neighborhood N of x † such that (cid:96) k (cid:0) φ o ( x ) (cid:1) ≥ (cid:96) k (cid:0) φ o ( x † ) (cid:1) + ( x − x † ) T ∇ (cid:96) k (cid:0) φ o ( x † ) (cid:1) − (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) − (cid:96) k (cid:0) φ o ( x † ) (cid:1)(cid:3) (cid:96) k (cid:0) φ o ( x † ) (cid:1) , (6)for all x ∈ N and k ∈ K .Assumption 2 restricts the composite function (cid:96) k (cid:0) φ o ( · ) (cid:1) tobe non-convex only up to a quadratic in a neighborhoodof a local minimum. Additionally, we will only considercredibility proﬁles for which there exist strictly feasible so-lutions of (PI), i.e., c ∈ C for C = (cid:8) c ∈ R |K| | ∃ x ∈ R p such that (cid:96) k (cid:0) φ o ( x ) (cid:1) < − c k for all k ∈ K (cid:9) . (7)Note that C is arbitrarily close to the set of all achievablecredences.To proceed, let us recall the following properties of mini-mizers: Theorem 1 (KKT conditions, (Boyd & Vandenberghe,2004, Section 5.5.3)) . Let x † be a local minimizer of (PI) for the credences c ∈ C . Under Assumption 1, there ex-ist λ † ∈ R |K| + known as dual variables such that x † − x o ) + (cid:88) k ∈K λ k ∇ x (cid:96) k (cid:0) φ o ( x † ) (cid:1) = 0 , (8a) λ † k (cid:2) (cid:96) k (cid:0) φ o ( x † ) (cid:1) + c k (cid:3) = 0 , k ∈ K . (8b) If (cid:96) k (cid:0) φ o ( · ) (cid:1) is convex, 8 are both necessary and sufﬁcientfor global optimality. Additionally, the function r † (or inthis case, r (cid:63) ) is differentiable and its derivative with re-spect to c is given by the dual variables. In fact, λ † ( c ) thenquantiﬁes the change in the risk r † ( c ) if the credences in-crease or decrease by a small amount. This sensitivity inter-pretation is a classical result from the convex optimizationliterature (Boyd & Vandenberghe, 2004, Section 5.6.2).In general, however, (cid:96) k (cid:0) φ o ( · ) (cid:1) is non-convex. Still, Theo-rem 1 allows us to obtain a ﬁxed-point condition that turnscredences into credibilities. Theorem 2.

Take c ∈ C and let x † ( c ) be a local min-imizer of (PI) with value r † ( c ) associated with the dualvariables λ † ( c ) . Under Assumptions 1 and 2, the localcredibility proﬁle c † from Def. 2 exists and satisﬁes c = − W λ † ( c ) ⇒ c satisﬁes (5) , i.e., c = c † . (9) Proof.

See appendix 11.Theorem 2 shows that Theorem 1 can be used to cer-tify a local credibility measure c † without repeatedly solv-ing (PI). Explicitly, if (8) hold with λ † = − W − c ,then (cid:0) x † ( c ) , c (cid:1) satisfy (5). In other words, the dual vari-ables associated to solutions of (PI) provide counterfactu-als of the form “if the credence c k assigned to class k hadbeen λ † k for all k ∈ K , then c would have been a local cred-ibility proﬁle.” Considering the sensitivity interpretationof λ † in the convex case, the compromise (5) also causesthe credence c k for classes in which x o is harder to ﬁt todecrease (become more negative) in order to manage risk.In the sequel, we leverage the results from Theorems 1and 2 to ﬁrst show that credibility as deﬁned in Def. 2 hasa Bayesian interpretation as the MAP of a speciﬁc failuremodel (Section 6) and then put forward an algorithm to ef-ﬁciently compute the credibility proﬁle c † without repeat-edly solving (PI) for different credence proﬁles.

6. A Bayesian formulation of credibility

An interesting consequence of Theorem 2, is that the ﬁt-risk trade-off credibility deﬁnitions in Def. 1 and 2 can alsobe treated as a Bayesian inference problem. This prob-lem formalizes the intuition that formulating (2) [and moregenerally (5)] in terms of Euclidian norms is equivalent tomodel the uncertainties related to the input and credibilityproﬁles as Gaussian random variables (RVs).Consider the likelihood

Pr ( x | c ) = N (cid:0) x | x o , (2 t ) − I (cid:1) × (cid:89) c k (cid:54) =0 SE (cid:20) (cid:96) k (cid:0) φ o ( x ) (cid:1) | c k , c k w k t (cid:21) , (10a) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning deﬁned for a parameter t > , where N ( z | ¯ z , Σ ) repre-sents the density of a normal random vector z with mean ¯ z and covariance Σ and SE ( z | ¯ z, η ) , the density of an ex-ponentially distributed RV z with rate η shifted to ¯ z . Thislikelihood represents our belief that, though x o is represen-tative of the model input, it may be corrupted. This uncer-tainty is described by the ﬁrst term of (10a).The other terms account for the failure of x to meet thecredence c in the constraints of (PI). If an input x violatesany of the credences, i.e., (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k , then its like-lihood is penalized independently of how close it is to themean x o . This penalty follows a constant failure rate thatdepends on the credence c k itself. Hence, the probability ofthe violation increasing by (cid:15) does not depend on how muchthe constraint has already been violated. Explicitly, Pr (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + z + (cid:15) | (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + z (cid:3) = Pr (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + (cid:15) (cid:3) .To obtain the joint distribution of ( x , c ) , we use a normalprior on the credence values, namely Pr ( c ) = N (cid:0) c | , (2 t ) − W (cid:1) . (10b)Observe from (10b) that W in (2) [and (5)] can be inter-preted both as a weighting matrix that modiﬁes the geom-etry of the credence space or as a measure of uncertaintyover the prediction credibilities. It can therefore be usedto incorporate prior information, e.g., on the relative fre-quency of classes. The following proposition characterizesa local maximum of the joint distribution Pr( x , c ) : Proposition 1.

Let the pair ( x † , c † ) satisfy the local cred-ibility compromise (5) . Then, there exists T > such thatit is a local maximum of the joint distribution Pr ( x , c ) =Pr ( x | c ) Pr ( c ) deﬁned by (10) for t ≥ T .Proof. See appendix 12.Proposition 1 shows that the local credibility proﬁlesin (5) are asymptotic local maxima of the probabilisticmodel (10) as the variance of its components decreases,i.e., as t increases and the joint probability distributionbecomes more concentrated. In fact, every critical pointof Pr ( x , c ) satisﬁes the KKT conditions of Theorem 1and the counterfactual condition of Theorem 2. Recall thatif φ o is convex, then the unique mode of this joint distri-bution ( x (cid:63) , c (cid:63) ) provides a credibility proﬁle according toDef. (1). Hence, though motivated as deterministic ﬁt-risktrade-offs, the credibility metrics proposed in this work canalso be viewed as MAP estimates.In the next section, we conclude our analysis of the credi-bility proﬁle in Def. 2 by showing how it can be computedefﬁciently from (PI) at no additional complexity cost. We Algorithm 1

Counterfactual optimization algorithm

Let x (0) = x and λ (0) = . for t = 1 , , . . . do g ( t ) x = 2 (cid:0) x ( t ) − x o (cid:1) + (cid:88) k ∈K λ ( t ) k ∇ x (cid:96) k (cid:16) φ o (cid:0) x ( t ) (cid:1)(cid:17) x ( t +1) = x ( t ) − η x g ( t ) x λ ( t +1) k = (cid:34) λ ( t ) k + η λ (cid:104) (cid:96) k (cid:16) φ o (cid:0) x ( t ) (cid:1)(cid:17) − λ ( t ) k (cid:105) (cid:35) + end for do so by modifying the Arrow-Hurwicz algorithm (Arrowet al., 1958) using Theorem 2.

7. A modiﬁed Arrow-Hurwicz algorithms

Theorems 1 and 2 suggest a way to exploit the informa-tion in the dual variables to solve (PI) directly for the (lo-cal) credibility c † without testing multiple credence pro-ﬁles. Indeed, start by considering that the credences c areﬁxed and deﬁne the Lagrangian associated with (PI) as L ( x , λ , c ) = (cid:107) x − x o (cid:107) + (cid:88) k ∈K λ k (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k (cid:3) .(11)Observe that the KKT necessary condition (8) for x † tobe a local minimizer can be written in terms of (11)as ∇ x L ( x † , λ † , c ) = 0 and λ † k (cid:2) ∇ λ L ( x † , λ † , c ) (cid:3) k = 0 for k ∈ K , where [ z ] k indicates the k -th element of thevector z . The classical Arrow-Hurwicz algorithm (Arrowet al., 1958) is a procedure inspired by these relations thatseeks a KKT point by alternating between the updates x + = x − η x ∇ x L ( x , λ , c ) (12a) = x − η x (cid:34) x − x o ) + (cid:88) k ∈K λ k ∇ x (cid:96) k (cid:0) φ o ( x ) (cid:1)(cid:35) , λ + k = (cid:34) λ k + η λ (cid:2) ∇ λ L ( x , λ , c ) (cid:3) k (cid:35) + (12b) = (cid:34) λ k + η λ (cid:18) (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k (cid:19)(cid:35) + ,where η x , η λ > are step sizes and [ z ] + = max( z , ) de-notes the projection onto the non-negative orthant of R |K| .To understand the intuition behind this algorithm, notethat (12a) updates x by descending along a weighted com-bination of gradients of the objective and the constraints soas to reduce the value of all functions. The weight of eachconstraint is given by its respective dual variable λ k . If the k -th constraint is satisﬁed, then (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k ≤ andits inﬂuence on the update of x is decreased by (12b) untilit vanishes. On the other hand, if the constraint is violated,then (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k > and the value of λ k increases. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

100 200 400 γ k x † − x o k (a)

100 200 400 γ k c † k (b) Figure 1. (a) Perturbation magnitude and (b) Overall credibility

The relative strength of each gradient in the update (12a)is therefore related to the history of violation of each con-straint.The main drawback of (12) is that it seeks a KKT pointof (PI) for a given, ﬁxed credence proﬁle c while the cred-ibility proﬁle c † from (5) is not known a priori . To over-come this issue, we can use the counterfactual result (9)in (12b) to obtain λ + k = (cid:34) λ k + η λ (cid:18) (cid:96) k (cid:0) φ o ( x ) (cid:1) − w k λ k (cid:19)(cid:35) + , (13)The complete counterfactual optimization procedure is col-lected in Algorithm 1. If the dynamics of Algorithm 1 con-verge to ( x ∞ , λ ∞ ) , then x ∞ is a local minimizer of (PI)for credences c = − W λ ∞ , which is a local credibilityproﬁle c † according to Theorem 2 .It is worth noting that in the general, non-convex case, Al-gorithm 1 need not converge. This will happen if the gradi-ent descent procedure in (12a) is unable to ﬁnd a x that ﬁtsthe credences imposed by λ . For rich model classes, thisis less likely to happen and there is considerable empiricalevidence from the adversarial examples literature that gra-dient methods such as (12a) do converge (Szegedy et al.,2014; Goodfellow et al., 2014; Madry et al., 2018). This isalso what we observed in our numerical experiments, dur-ing which we found no instance in which Algorithm 1 di-verged (Section 8). When (cid:96) k is convex and non-decreasingand φ o is convex, then Algorithm 1 can be shown to con-verge to the global optima ( x (cid:63) , λ (cid:63) , c (cid:63) ) of (PI) through clas-sical arguments, as in (Cherukuri et al., 2016; Nagurney &Zhang, 2012). Details of this result are beyond the scope ofthis paper.

8. Numerical experiments

To showcase the properties and uses of this trade-off ap-proach to credibility, we use two CNN architectures, aResNet18 (He et al., 2016) and a DenseNet169 (Huanget al., 2017), trained to classify images from the CIFAR-10 dataset. The models were trained over epochs inmini-batches of samples using Adam with the default

Class rank . . . N o r m a li z e d c o (a) ResNet18 Class rank . . . N o r m a li z e d c † (b) ResNet18 Class rank . . . N o r m a li z e d c o (c) DenseNet169 Class rank . . . N o r m a li z e d c † (d) DenseNet169 Figure 2.

Credibility proﬁles for ResNet18 and DenseNet169 parameters from (Kingma & Ba, 2017) with a weight de-cay parameter of − . Random cropping and ﬂippingwas used to augment data during training. The ﬁnal classi-ﬁers achieve validation accuracies of for the ResNet18and for the DenseNet169. Throughout this section,the loss function (cid:96) k is the cross-entropy loss and all experi-ments were performed over a images random samplefrom the CIFAR-10 test set.We start by illustrating the effect of the trade-off betweenﬁt and risk (perturbation magnitude). To do so, we computethe credibility c † for W = γ I with γ ∈ { , , } using the ResNet18 CNN as φ o (Figure 1). Observe thatas γ increases, the compromise (2) becomes more riskaverse. Hence, the magnitude of the perturbations de-crease (Figure 1a). As this happens, the norm of the cred-ibility proﬁles (cid:13)(cid:13) c † (cid:13)(cid:13) increases (Figure 1b). Yet, note thatthe trade-off in (3) continues to hold for γ − (cid:13)(cid:13) c † (cid:13)(cid:13) . In thesequel, we proceed using γ = 200 .Given the perturbation stability interpretation of c † , it canbe used to analyze the robustness of different classiﬁers.By studying the relation between the ordering induced onthe classes K and the values of c o and c † (Fig. 2), we canevaluate how input perturbations alter the model outputs.Recall that the perturbation magnitude is adapted to boththe model and the input. Hence, while perturbation levelsmight be different, they are the limit of the compromisein (5), so that any larger perturbation would not incur in asigniﬁcantly larger change in the model output. Rather thanassessing robustness to a set of perturbations, this analysisevaluates the failure modes of the model, i.e., when andhow models fail.In Figure 2, we display the values of c o and c † sorted rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 3.

Performance of softmax-based ﬁltered classiﬁers.Shades display best and worst case results over restarts. in decreasing order for the pretrained ResNet18 andDenseNet169 CNNs. To help visualization, the credibil-ity proﬁles were normalized so as to lie in the unit interval.Notice that, in contrast to the DenseNet, the perturbed pro-ﬁles of the Resnet (Figure 2b) are considerably differentthan the original ones (Figure 2a). This implies that theResNet output can be considerably modiﬁed by perturba-tions that are small compared to their effect (as in Def. 2).On the other hand, note that the DenseNet retains a goodamount of certainty on some samples even after perturba-tion (Figure 2d). While in c o this certainty can be arti-ﬁcial (e.g., due to miscalibration), the fact that c † ≈ c o means that modifying the classiﬁer decision (whether cor-rect or not) would require too large perturbations to be war-ranted. In this case, the DenseNet model displays signs ofstability that the ResNet does not. It is worth pointing outthat both architecture have a similar numbers of parame-ters. We proceed with the experiments using the ResNet.Another use of credibility is in the context of ﬁltering. Inthis case, credibility is used to assess whether the classiﬁeris conﬁdent enough to make a prediction. If not, then itshould abstain from doing so. The performance of such ﬁl-tered classiﬁers is evaluated in terms of its coverage (howmany samples the model chooses to classify) and ﬁlteredaccuracy (accuracy over the classiﬁed samples). Ideally,we would like a model that never abstains ( coverage)and is always accurate. Yet, these two quantities are typi-cally conﬂicting: ﬁltered accuracy can often be improvedby reducing coverage. The issue is then whether this canbe done smoothly. Suppose that the model decides to clas-sify a sample if its second largest credibility is less than thelargest one by a factor of at least − α , where α is chosento achieve speciﬁc coverage levels. Figures 3 and 4 com-pare the results of directly using the softmax output of the . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 4.

Performance of credibility-based ﬁltered classiﬁers.Shades display best and worst case results over restarts. Softmax Credibility

Classiﬁer . . . . . A cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 5.

Accuracy of softmax- and credibility-based classiﬁers model and the credibility proﬁle c † .Notice that when classifying every sample, the classiﬁerbased on c † has lower accuracy (see also Figure 5). How-ever, it is able to discriminate between certain and uncer-tain predictions in a way that the output of the model can-not. This is straightforward given that the model outputtypically has a single strong component (see Figure 2a).This shows that the pretrained model is, in many cases,overconﬁdent about its predictions, a phenomenon relatedto the issue of miscalibration in NNs (Guo et al., 2017).This overconﬁdence becomes clear when the data is at-tacked using adversarial noise. To illustrate this effect,we perturb the test images using the projected gradientdescent (PGD) attack for maximum perturbations (cid:15) ∈{ . , . , . } (Madry et al., 2018). It is worth not-ing that for the largest (cid:15) , the perturbations are noticeablein the image. PGD was run for iterations, with step rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning size . ( . for (cid:15) = 0 . ), and the ﬁgures show therange of results obtained over restarts.The accuracy of the classiﬁers decreases considerably af-ter the PGD attack (Figure 5). However, while the perfor-mance of the softmax classiﬁer degrades by almost ,the credibility-based ones drops by . When the mag-nitude of the perturbation reaches (cid:15) = 0 . , the output ofthe model was only able to correctly classify a single imagein half of the restarts. Using credibility, however, accuracyremained above . It is worth noting that this result isachieved without any modiﬁcation to the model, includingretraining or preprocessing. What is more, ﬁltering doesnot improve upon these results due to the miscalibrationissue. Indeed, even for (cid:15) = 0 . , the softmax-based ﬁl-tered classiﬁer must give up on over of its coverage torecover the performance that the credibility-based ﬁlteredclassiﬁer has classifying every sample. These experimentsillustrate the trade-off between robustness and nominal ac-curacy noted in (Tsipras et al., 2019).

9. Conclusion

This work introduced a credibility measure based on acompromise between how much the model ﬁt can be im-proved by perturbing its input and how much this pertur-bation modiﬁes the original input (risk). By formulatingthis problem in the language of constrained optimization,it showed that this trade-off can be determined counterfac-tually, without testing multiple perturbations. Leveragingthese results, it put forward a practical method to assigncredibilities that (i) can be computed for any (possibly non-convex) differentiable model, from RKHS-based solutionsto any (C)NN architecture, (ii) can be obtained for mod-els that have already been trained, (iii) does not rely onany form of training data, and (iv) has formal guarantees.Future works include devising local sensitivity results fornon-convex optimization programs, analyze the model per-turbation problem, and apply the counterfactual compro-mise result to different learning problems, such as outlierdetection. The Bayesian formulation from Section 6 alsosuggests an independent line of work linking constrainedoptimization and MAP estimates.

References

Aggarwal, C. C. Outlier analysis. In

Data mining , pp. 237–263. Springer, 2015.Arrow, K., Hurwicz, L., and Uzawa, H.

Studies in linearand non-linear programming . Stanford University Press,1958.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprintarXiv:1505.05424 , 2015.Bonnans, J. and Shapiro, A.

Perturbation Analysis of Opti-mization Problems . Springer, 2000.Boyd, S. and Vandenberghe, L.

Convex optimization . Cam-bridge University Press, 2004.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec-tion: A survey.

ACM computing surveys (CSUR) , 41(3):15, 2009.Cherukuri, A., Mallada, E., and Cortés, J. Asymptotic con-vergence of constrained primal–dual dynamics.

Systems& Control Letters , 87:10–15, 2016.Cortes, C., DeSalvo, G., and Mohri, M. Boosting with ab-stention. In Lee, D. D., Sugiyama, M., Luxburg, U. V.,Guyon, I., and Garnett, R. (eds.),

Advances in NeuralInformation Processing Systems , pp. 1660–1668. 2016.Dhillon, G. S., Azizzadenesheli, K., Bernstein, J. D., Kos-saiﬁ, J., Khanna, A., Lipton, Z. C., and Anandkumar,A. Stochastic activation pruning for robust adversarialdefense. In

International Conference on Learning Rep-resentations , 2018.Dietterich, T. G. Ensemble methods in machine learning.In

MULTIPLE CLASSIFIER SYSTEMS, LBCS-1857 , pp.1–15. Springer, 2000.Efron, B. and Tibshirani, R. J.

An introduction to the boot-strap . CRC press, 1994.El-Yaniv, R. and Wiener, Y. On the foundations of noise-free selective classiﬁcation.

Journal of Machine Learn-ing Research , 11:1605–1641, 2010.Flores, I. An optimum character recognition system us-ing decision functions.

IRE Transactions on ElectronicComputers , EC-7(2):180–180, 1958.Friedman, J., Hastie, T., and Tibshirani, R.

The elements ofstatistical learning , volume 1. Springer series in statis-tics New York, 2001.Gal, Y. and Ghahramani, Z. Bayesian convolutional neuralnetworks with bernoulli approximate variational infer-ence. arXiv preprint arXiv:1506.02158 , 2015.Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx-imation: Representing model uncertainty in deep learn-ing. In

International Conference on Machine Learning ,pp. 1050–1059, 2016.Geifman, Y. and El-Yaniv, R. Selective classiﬁcation fordeep neural networks. In

Advances in Neural Informa-tion Processing Systems , pp. 4878–4887. 2017. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples.

CoRR , 2014.Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In

InternationalConference on Machine Learning , pp. 1321–1330, 2017.Hastie, T., Tibshirani, R., and Friedman, J.

The Elementsof Statistical Learning . Springer, 2001.Hawkins, D. M.

Identiﬁcation of outliers , volume 11.Springer, 1980.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In ,2016.Heek, J. and Kalchbrenner, N. Bayesian inferencefor large scale image classiﬁcation. arXiv preprintarXiv:1908.03491 , 2019.Hodge, V. and Austin, J. A survey of outlier detectionmethodologies.

Artiﬁcial intelligence review , 22(2):85–126, 2004.Huang, G., Liu, Z., v. d. Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. In , 2017.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980v9 , 2017.Li, H., Wang, X., and Ding, S. Research and developmentof neural network ensembles: a survey.

Artiﬁcial Intelli-gence Review , 49(4):455–479, 2018.MacKay, D. J. C. A practical bayesian framework for back-propagation networks.

Neural Comput. , 4(3):448–472,1992.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant to ad-versarial attacks. In

International Conference on Learn-ing Representations , 2018.Mandelbaum, A. and Weinshall, D. Distance-based conﬁ-dence score for neural network classiﬁers. arXiv preprintarXiv:1709.09844 , 2017.Nagurney, A. and Zhang, D.

Projected Dynamical Systemsand Variational Inequalities with Applications . Springer,2012.Neal, R. M.

Bayesian Learning for Neural Networks .Springer, 1996. Platt, J. C. Probabilistic outputs for support vector ma-chines and comparisons to regularized likelihood meth-ods. In

ADVANCES IN LARGE MARGIN CLASSI-FIERS , pp. 61–74, 1999.Rasmussen, C. and Williams, C.

Gaussian Processes forMachine Learning . MIT Press, 2005.Roberts, F. and Tesman, B.

Applied Combinatorics . Chap-man & Hall/CRC, 2nd edition, 2009.Sheikholeslami, F., Jain, S., and Giannakis, G. B. Efﬁcientrandomized defense against adversarial attacks in deepconvolutional neural networks. In

IEEE InternationalConference on Acoustics, Speech and Signal Processing ,pp. 3277–3281, 2019.Shridhar, K., Laumann, F., and Liwicki, M. A comprehen-sive guide to bayesian convolutional neural network withvariational inference. arXiv preprint arXiv:1901.02731 ,2019.Smithson, M.

Conﬁdence intervals , volume 140. Sage Pub-lications, 2002.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing proper-ties of neural networks. In

International Conference onLearning Representations , 2014.Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy. In

International Conference on Learning Representations ,2019.Wong, E. and Kolter, Z. Provable defenses against adver-sarial examples via the convex outer adversarial poly-tope. In

International Conference on Machine Learning ,pp. 5286–5295, 2018.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: Anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 , 2017. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning

Class rank . . . . . . N o r m a li z e d c o Class rank . . . . . . . N o r m a li z e d c † Figure 6.

Credibility proﬁles for ResNet18 on Fashion MNIST

10. Additional numerical experiments

In these additional results, we repeat the experiments from the main paper on a different dataset (namely, the fashionMNIST (Xiao et al., 2017)) to show how they carry over to another application. Here, we perform the experiments usingthe ResNet18 (He et al., 2016) architecture trained over epochs in mini-batches of samples using Adam with thedefault parameters from (Kingma & Ba, 2017) and weight decay of − . Without data augmentation, the ﬁnal classiﬁerachieves an accuracy of over the test set. Once again, the loss function (cid:96) used is the cross-entropy loss and allexperiments were performed over a images random sample from the Fashion MNIST test set. In the sequel, wetake W = γ I with γ = 200 .We once again begin leveraging the perturbation stability interpretation of c † to analyze the robustness of the ResNet onthis new dataset by looking at the normalized values of c o and c † sorted in decreasing order (Figure 6). Notice that,in contrast to the ResNet trained on CIFAR10, the perturbed proﬁles here much noisier. This classiﬁer is therefore lessrobust to perturbations of the input: its output can be considerably modiﬁed by comparably small perturbations. Noticethat this analysis does not apply to the ResNet architecture in general, but to the speciﬁc instance used to classify theseimages. While these differences can be due to a larger sensitivity of the data (Fashion MNIST pictures are black-and-whitewhereas CIFAR10 has colored images), it can also be due to the fact that we trained for fewer epochs and did not use dataaugmentation. The power of using credibility proﬁles to analyze robustness is exactly due to the fact that it holds for thespeciﬁc instance and application, in contrast to an average analysis.Results for the ﬁltering classiﬁers are shown in Figures 7 and 8. Once again, the model classiﬁes samples only if the secondlargest credibility (or softmax output entry) is less than the largest one by a factor of at least − α , where α is chosen toachieve speciﬁc coverage levels. We show results both using the softmax output of the model and the credibility proﬁle c † .Once again, we notice a trade-off between robustness and performance, as in (Tsipras et al., 2019). When classifyingevery sample ( coverage), the classiﬁer based on c † has lower accuracy (Figure 9). However, it can discriminatebetween certain and uncertain predictions in a way that the softmax output of the model cannot (as seen for c o in Figure 6).In order to achieve the same accuracy as the unmodiﬁed model, the credibility-based ﬁltered classiﬁer must reduce itscoverage to approximately . However, the overconﬁdence of the model output in its predictions makes it susceptibleto perturbations. We illustrate this point by applying the different classiﬁers to test images corrupted by adversarial noise.The attacks shown here use the same parameters as described in the paper, except for the perturbation magnitudes that aretwice as large. Still, we observe the same pattern: even for the weakest attack, the accuracy of the softmax output dropsby almost (see Figure 9), to the point that it would need to reduce its coverage by approximately to recover theaccuracy of the credibility-based classiﬁer with full coverage (Figure 7). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 7.

Performance of softmax-based ﬁltered classiﬁer . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 8.

Performance of credibility-based ﬁltered classiﬁer

Softmax Credibility

Classiﬁer . . . . . A cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 9.

Accuracy of softmax- and credibility-based classiﬁers

11. Proof of counterfactual results

In this section, we prove different versions of the counterfactual result in Section 5.2. We begin with a simple lemmashowing the existence of a local credibility proﬁle satisfying (5).

Lemma 1.

There exists a local credibility proﬁle c † that satisﬁes (5) .Proof. Let ( ¯ x , ¯ c ) be a local minimizer pair of the optimization problem minimize x ∈ R p c ∈ R |K| (cid:107) x − x o (cid:107) + (cid:107) c (cid:107) W − subject to (cid:96) ( φ o ( x ) , k ) ≤ − c k , k ∈ K . (P-A)If no credibility proﬁle c satisﬁes (5), then arbitrarily close to ¯ c there exists ( x (cid:48) , c (cid:48) ) feasible for (PI) such that (cid:107) ¯ x − x o (cid:107) + (cid:107) ¯ c (cid:107) W − = r † ( ¯ c ) + (cid:107) ¯ c (cid:107) W − > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − . (14)However, observe that the feasibility set of (PI) is contained in the feasibility set of (P-A), so that ( x (cid:48) , c (cid:48) ) is also (P-A)-feasible. This leads to a contradiction since (14) violates the fact that ( ¯ x , ¯ c ) are local minimizers of (P-A). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Thus, the existence of a local credibility proﬁle is not an issue. Let us then complete the proof of Theorem 2.

Proof of Theorem 2.

Under Assumption 1 and for c ∈ C satisfying the hypothesis in (9), the x † ( c ) that achieves the localminimum r † ( c ) satisﬁes the KKT conditions (Theorem 1) with λ k = − w − k c k . Explicitly, (cid:0) x † ( c ) − x o (cid:1) − |K| (cid:88) k =1 c k w k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) = 0 and c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:3) = 0 , for all k ∈ K . (15)Taking ( x (cid:48) , c (cid:48) ) to be any feasible pair for (PI), i.e., (cid:96) k ( φ o ( x (cid:48) )) ≤ − c (cid:48) k for all k ∈ K , for x (cid:48) ∈ N as in Assumption 2, wecan combine (15) to get r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) − (cid:88) k ∈K c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:3) + 2 (cid:0) x (cid:48) − x † ( c ) (cid:1) T (cid:34)(cid:0) x † ( c ) − x o (cid:1) − (cid:88) k ∈K c k w k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1)(cid:35) ,which can then be rearranged to read r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) + 2 (cid:0) x (cid:48) − x † ( c ) (cid:1) T (cid:0) x † ( c ) − x o (cid:1) − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + (cid:0) x (cid:48) − x † ( c ) (cid:1) T ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (16)To proceed, use the convexity of (cid:107)·(cid:107) to bound the ﬁrst two terms in (16) and obtain r † ( c ) ≤ (cid:107) x (cid:48) − x o (cid:107) − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + (cid:0) x (cid:48) − x † ( c ) (cid:1) T ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (17)Then, since x (cid:48) ∈ N from Assumption 2, we can use (6) on the bracketed quantity in (17) to write the inequality r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) + (cid:2) (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) − (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1)(cid:3) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (18)Expanding (18) yields r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ − (cid:88) k ∈K c k (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) w k (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) − (cid:88) k ∈K c k (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) w k − (cid:107) c (cid:107) W − ,where we used the fact that (cid:107) c (cid:107) W − = (cid:80) k w − k c k . Since (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) ≤ − c k and (cid:96) k (cid:0) φ o (cid:0) x † ( c (cid:48) ) (cid:1)(cid:1) ≤ − c (cid:48) k , we get r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:107) W − .Hence, if W − c is a dual variable of (PI), then c satisﬁes (5), i.e., c = c † .In the particular case in which (PI) is a convex program, we can show that (9) is both sufﬁcient and necessary: Proposition 2.

Assume that (cid:96) k is convex and non-decreasing for all k ∈ K and φ o is a convex function of x (e.g., linear).Let r † ( c ) be a local minimum of (PI) for c ∈ C achieved by x † ( c ) with value associated with the dual variables λ † ( c ) .Then, under Assumption 1, the local credibility proﬁle c † from (5) is a global proﬁle satisfying (2) and c = − W λ (cid:63) ( c ) ⇔ c satisﬁes (2) , i.e., c = c (cid:63) . (19) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Proof.

The ﬁrst part of the theorem is immediate: since φ is convex and (cid:96) is a non-decreasing convex function, (PI) is astrongly convex problem. Hence, any local minimizer x † is a global minimizer x (cid:63) (Boyd & Vandenberghe, 2004).We proceed by proving necessity ( ⇒ ). Since (PI) is a strongly convex, strictly feasible problem [ c ∈ C in (7)], it is stronglydual and its dual variables λ (cid:63) ( c ) is a subgradient of its perturbation function r (cid:63) (Boyd & Vandenberghe, 2004, Section5.6.2). We therefore obtain r (cid:63) ( c ) ≥ r (cid:63) ( c ) + λ (cid:63) ( c ) T ( c − c ) (cid:107) c (cid:107) W − ≥ (cid:107) c (cid:107) W − + 2 c T W − ( c − c ) , (20)which when summed read r (cid:63) ( c ) − r (cid:63) ( c ) ≤ (cid:107) c (cid:107) W − − (cid:107) c (cid:107) W − + (cid:2) λ (cid:63) ( c ) + 2 W − c (cid:3) T ( c − c ) . (21)Using the hypothesis in (19) yields λ (cid:63) ( c ) + 2 W − c = and (21) reduces to (2), which implies c = c (cid:63) .To prove the sufﬁciency part ( ⇐ ) in (19), notice from (P-A) that any c (cid:63) that satisﬁes (2) is the minimizer of the stronglyconvex function q ( c ) = r (cid:63) ( c ) + (cid:107) c (cid:107) W − . As such, it must be that ∇ q ( c (cid:63) ) = ⇔ ∇ r (cid:63) ( c (cid:63) ) + 2 W − c (cid:63) = .Once again leveraging the fact that (PI) is strongly dual, it holds that λ (cid:63) ( c ) = ∇ p (cid:63) ( c ) for all c (Boyd & Vandenberghe,2004, Section 5.6.3), thus concluding the proof.

12. Proof of Proposition 1

Recall the Bayesian formulation from Section 6:

Pr ( x | c ) = N (cid:0) x | x o , (2 t ) − I (cid:1) × (cid:89) c k (cid:54) =0 SE (cid:20) (cid:96) ( φ o ( x ) , k ) | c k , c k w k t (cid:21) Pr ( c ) = N (cid:0) c | , (2 t ) − W (cid:1) We then obtain the joint distribution

Pr( x , c ) as Pr( x , c ) ∝ exp (cid:16) − t (cid:107) x − x o (cid:107) (cid:17) × (cid:89) c k (cid:54) =0 c k w k t exp (cid:18) − c k w k t [ (cid:96) k ( φ o ( x ) − c k ] (cid:19) exp (cid:16) − t (cid:107) c (cid:107) W − (cid:17) = exp  − t (cid:107) x − x o (cid:107) − t (cid:88) c k (cid:54) =0 c k w k [ (cid:96) k ( φ o ( x ) − c k ] − t (cid:107) c (cid:107) W − + (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) . (22)Notice that since x † ( c † ) achieves the local minimum r † ( c † ) that satisﬁes Def. 2, the pair (cid:0) x † ( c † ) , c † (cid:1) is a local minimizerof (P-A). Indeed, (5) can be rearranged as in (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − ≤ (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − ,for x (cid:48) in a neighborhood of x † ( c † ) and ( x (cid:48) , c (cid:48) ) (PI)-feasible. Hence, they satisfy the KKT conditions (Theorem 1)for (P-A). Indeed, note that (P-A) always has a strictly feasible pair ( ˆ x , ˆ c ) obtained by taking [ ˆ c ] k = (cid:96) k (cid:0) φ o ( x ) (cid:1) . Ex-plicitly, there exists λ such that (cid:0) x † − x o (cid:1) + |K| (cid:88) k =1 λ k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) = 0 , (23) W − c † + λ = 0 , (24) λ k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) + c k (cid:3) = 0 , for all k ∈ K . (25) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Observe that (23) and (24) arise by applying (8a) taking derivatives with respect to x and c respectively and (25) is thecomplementary slackness condition (8b). Using (24) and (25), we additionally conclude that (cid:88) k ∈K c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) + c k (cid:3) = 0 . (26)Using (26), the joint probability distribution (22) evaluated at ( x † ( c † ) , c † ) reduces to Pr (cid:0) x † ( c † ) , c † (cid:1) ∝ exp  − t (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) − t (cid:13)(cid:13) c † (cid:13)(cid:13) W − + (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) . (27)Suppose now that there exists another point ( x (cid:48) , c (cid:48) ) arbitrarily close to (cid:0) x † ( c † ) , c † (cid:1) such that Pr (cid:0) x † ( c † ) , c † (cid:1) < Pr( x (cid:48) , c (cid:48) ) for all t . This would imply that (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − − t (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − − t (cid:88) c (cid:48) k (cid:54) =0 log (cid:18) c (cid:48) k w k t (cid:19) for all t and since the last term is o ( t ) , we eventually get (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − which violates (5). Hence, (cid:0) x † ( c † ) , c † (cid:1) becomes a local maximum of the joint distribution (22) as t → ∞ ..