Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning
TTrust but Verify:Assigning Prediction Credibility by Counterfactual Constrained Optimization
Luiz F. O. Chamon Santiago Paternain Alejandro Ribeiro Abstract
Prediction credibility measures, in the form ofconfidence intervals or probability distributions,are fundamental in statistics and machine learn-ing to characterize model robustness, detect out-of-distribution samples (outliers), and protectagainst adversarial attacks. To be effective, thesemeasures should (i) account for the wide varietyof models used in practice, (ii) be computable fortrained models or at least avoid modifying estab-lished training procedures, (iii) forgo the use ofdata, which can expose them to the same robust-ness issues and attacks as the underlying model,and (iv) be followed by theoretical guarantees.These principles underly the framework devel-oped in this work, which expresses the credi-bility as a risk-fit trade-off, i.e., a compromisebetween how much can fit be improved by per-turbing the model input and the magnitude ofthis perturbation (risk). Using a constrained opti-mization formulation and duality theory, we ana-lyze this compromise and show that this balancecan be determined counterfactually, without hav-ing to test multiple perturbations. This resultsin an unsupervised, a posteriori method of as-signing prediction credibility for any (possiblynon-convex) differentiable model, from RKHS-based solutions to any architecture of (feedfor-ward, convolutional, graph) neural network. Itsuse is illustrated in data filtering and defenseagainst adversarial attacks.
1. Introduction
Assessing credibility to predictions is a fundamental prob-lem in statistics and machine learning (ML) with both prac-tical and societal impact, finding applications in modelrobustness, outlier detection, defense against adversarial Department of Electrical and Systems Engineering, Univer-sity of Pennsylvania, Philadelphia, PA, USA. Correspondence to:Luiz F. O. Chamon
2. Related work
Credibility is a central concept in statistics, arising in theform of confidence intervals, probability distributions, orsensitivity analysis. Typically, these metrics are obtainedfor specific models, such as linear or generalized linearmodels, and become more intricate to compute as themodel complexity increases. Confidence intervals, for in-stance, can be obtained in closed-form only for simplemodels for which the asymptotic distribution of error statis-tics are available (Smithson, 2002). Otherwise, bootstrap-ping techniques must be used. Though more widely appli-cable, they require several models to be trained over dif-ferent data sets, which can be prohibitive in certain large-scale applications (Efron & Tibshirani, 1994; Dietterich,2000; Li et al., 2018). Sensitivity analysis is an alterna-tive that does away with randomness by focusing instead in how much sample point or input values influence fitand/or model coefficients. The leverage score or the lin-ear regression coefficients are a good examples of this ap-proach (Hawkins, 1980; Hastie et al., 2001). Bayesianmodels directly embed uncertainty measures by learningprobability distributions instead of point estimates. ThoughBayesian inference has been deployed in a wide varietyof model classes, learning these models remains challeng-ing, especially in complex, large-scale settings common indeep learning and CNNs (Friedman et al., 2001; Blundellet al., 2015; Gal & Ghahramani, 2015; Shridhar et al., 2019;Heek & Kalchbrenner, 2019). Approximate measuresbased on specific probabilistic models, e.g., Monte Carlodropout (Gal & Ghahramani, 2016), have been proposedto address these issues. However, they do not approximatethe posterior distribution of the model given the trainingdata needed to assess uncertainty in the Bayesian frame-work. Other empirically motivated credibility scores, e.g.,based on k -nearest neighbors embeddings in the learnedrepresentation space, require retraining with modified costfunctions to be effective (Mandelbaum & Weinshall, 2017).Credibility measures can be used to solve a variety of statis-tical problems. For instance, they are effective in assessingthe robustness and performance of models or detecting out-of-distribution samples or outliers, both during training ordeployment (Chandola et al., 2009; Hodge & Austin, 2004;Hawkins, 1980; Aggarwal, 2015). Selective classifiers areoften used to this end (Flores, 1958; Cortes et al., 2016; El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017). Real-izing that in critical applications it is often better to admitone’s limitations than to provide uncertain answers, filteredclassifiers are given the additional option of not classify-ing an input. These can be flagged for human inspection.Their performance is quantified in terms of coverage (howmany samples the model chooses to classify) and filteredaccuracy (accuracy over the classified samples), two quan-tities typically at odds with each other (El-Yaniv & Wiener,2010; Geifman & El-Yaniv, 2017). Measures of credibil-ity are also often used in the related context of adversarialnoise (Dhillon et al., 2018; Wong & Kolter, 2018; Sheik-holeslami et al., 2019).Finally, this work leverages duality to derive both theoret-ical results and algorithms. It is worth noting that the dualvariables of convex optimization programs have a well-known sensitivity interpretation. Indeed, the dual variablecorresponding to a constraint determines the variation ofthe objective if that constraint were to be tightened or re-laxed (Boyd & Vandenberghe, 2004; Bonnans & Shapiro,2000). Since this work also considers non-convex modelsfor which these results do not hold, it develops of a novellocal sensitivity theory of compromise (Section 5). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning
3. Problem formulation
This work considers the problem of assigning credibil-ity to the predictions of a classifier. Formally, given amodel φ o : R p → R m , a sample x o ∈ R p , and a set K of possible classes, we wish to assign a number c k ∈ R to each class k ∈ K representing our credence that x o be-longs to k . By credence, we mean that if c k (cid:48) ≤ c k , then wewould rather bet on the proposition “ x belongs to k ” than“ x belongs to k (cid:48) ”. For convenience, we fix an order of K and often refer to the vector c ∈ R |K| collecting the c k as the credibility profile. We use |A| to denote the car-dinality of a set A . This problem is therefore equivalentto that of ranking from most to least credible the classesthat the input x o could belong to. Though certain credibil-ity measures impose additional structure, as is the case ofprobabilities, they are minimally required to induce a weakordering of the set K (Roberts & Tesman, 2009).The output of any classifier induces a measure of credibilitythrough the training loss function. Let (cid:96) k : R m → R + denote a loss function with respect to class k ∈ K , such asthe logistic negative log-likelihood, the cross-entropy loss,or the hinge loss (Hastie et al., 2001). Then, the loss ofthe model output with respect to the class k ∈ K inducesa measure of in credibility. Conversely, we can obtain thecredibility measure c k = − (cid:96) k (cid:0) φ o ( x o ) (cid:1) , k ∈ K , (1)In this sense, classifying a sample as argmax k ∈K c k (withties broken arbitrarily if the set is not a singleton) is thesame as assigning it to the most credible class. Note thatsince losses are non-negative, credibility is a non-positivenumber that increases as the loss decreases, i.e., as thegoodness-of-fit improves.Nevertheless, (1)—and the model output φ o ( x o ) for thatmatter—are generally unreliable credibility metrics. In-deed, since models are optimized by focusing only on themost credible class, they do not incorporate ranking infor-mation on the other ones. In fact, classifier outputs oftenresemble singletons (see, e.g., Figs. 2a and 2c). The result-ing miscalibration can skew the credences of all but the topclass (Guo et al., 2017; Platt, 1999). Though there are ex-ceptions, most notably Bayesian models, their use is notas widespread in many application domains due to theirtraining complexity (Blundell et al., 2015; Shridhar et al.,2019; Gal & Ghahramani, 2015; Heek & Kalchbrenner,2019). Fit and model outputs are also susceptible to ro-bustness issues due to overfitting, poor or lacking trainingdata, and/or the use of inadequate parametrizations (e.g.,choice of NN architecture or kernel bandwidth). What ismore, the input x o can be an outlier or have been contam-inated by noise, adversarial or not. Complex models suchas (C)NNs have proven to be particularly vulnerable to sev- eral of these issues (Szegedy et al., 2014; Goodfellow et al.,2014; Dhillon et al., 2018; Wong & Kolter, 2018).Thus, the model output can only provide a sound credibil-ity measure if both the model and the input are reliable.In fact, the issues associated with (1) all but vanish if theinput x o is trustworthy and the model output φ o ( x o ) is ro-bust to perturbations. Based on this observation, the nextsection puts forward a fit-based credibility measure that in-corporates robustness to perturbations. Remark 1.
It is worth contrasting the problem of assign-ing credibility to model predictions described in this sec-tion to its classical learning version in which credibility isassigned based on a data set D of labeled pairs ( x n , y n ) ∈ R p ×K . The current problem is in fact at least as hard as thelearning one both computationally and statistically, since itis reducible to the learning case. Additionally, the learningsetting is considerably more restrictive in practice. Indeed,it may not be possible, due to privacy or security concerns,to access the raw training data from which a model wasobtained. Even if it were, any solution that requires thecomplete retraining of every deployed model is infeasiblein many applications, especially if complex, modified train-ing procedures, such as those used to train Bayesian NNs,must be deployed. This is particularly critical for large-scale, non-convex models.
4. Credibility as a fit-risk trade-off
In Section 3, we argued that a loss function induces a cred-ibility measure through (1), but that it is reliable only if theclassifier output φ o ( x o ) is insensitive to perturbations. Inwhat follows, we show how we can filter the input x o inorder to produce such reliable credibility profiles.We do not say filter here to mean removing noise or arti-facts (though this may be a byproduct of the procedure),but obtaining a modified input x (cid:63) such that φ o ( x (cid:63) ) is ro-bust to perturbations. However, since our goal is to ulti-mately use φ o ( x (cid:63) ) to produce credibility profiles, the di-rection and magnitude of these perturbations is not dictatedby a fixed disturbance model as in robust statistics and ad-versarial noise applications. Instead, we obtain task- andsample-specific settings using the underlying data proper-ties captured by the model φ o during training.Indeed, we consider perturbations that improve fit (increasecredibility), not with respect to a specific class, but with re-spect to all classes simultaneously. We do so because wedo not seek a single most credible class for the input, butthe best credibility profile with respect to all classes. Anyperturbation of x o that worsens the overall fit leads to analtogether less confident classification and thus, to a poormeasure of credibility. At the same time, x (cid:63) must be an-chored at x o : the further we are from the original input, the rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning more we have modified the original classification problem.By allowing arbitrarily large perturbations, the original im-age of a dog could be entirely replaced by that of a cat, atwhich point we are no longer assigning credibility to pre-dictions of dog pictures. The magnitude of the perturbationthus relates to the risk of not solving the desired problem .In summary, the “filtered” x (cid:63) must balance two conflict-ing objectives: improving fit while remaining close to x o .Indeed, while smaller perturbations of x o are preferable toreduce risk, larger perturbations are allowed if the fit pay-off is of comparable magnitude. We formalize this conceptin the following definition, where W is a diagonal, positivedefinite matrix and (cid:107) z (cid:107) W = z T W z is used to denote the W -weighted Euclidian norm. Definition 1 (Credibility profile) . Given a model φ o and aninput x o , the credibility profile of the predictions φ o ( x o ) isa vector c (cid:63) that satisfies (cid:107) x (cid:63) − x o (cid:107) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − , (2) where the credibility profiles are obtained from (1) as c (cid:63) = (cid:2) − (cid:96) k (cid:0) φ o ( x (cid:63) ) (cid:1)(cid:3) k ∈K and c (cid:48) = (cid:2) − (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1)(cid:3) k ∈K . In other words, the credibility profile c (cid:63) for the predictionsof the model φ o with respect to the input x o is obtainedusing the modified input x (cid:63) that has better fit than any otherinput closer to x o and is closer to x o that any input withbetter overall fit. In particular, note that for x (cid:48) = x o , (2)becomes (cid:107) x (cid:63) − x o (cid:107) ≤ (cid:107) c o (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − , (3)where c o = (cid:2) − (cid:96) k (cid:0) φ o ( x o ) (cid:1)(cid:3) k ∈K are the credences inducedby the original input x o . Hence, the perturbation thatgenerates x (cid:63) is at most as large as the overall credencegains [recall from (1) that the c k ≤ ]. These gains areweighted by the matrix W that can be used either to con-trol risk aversion by, e.g., by taking W = γ I and using γ to control the magnitude of the right-hand side of (3), orembed a priori knowledge on the value of the credibili-ties (see Section 6). While it may be easier to interpret, notethat (3) is not well-posed since it always holds for x (cid:63) = x o .Def. (1) is therefore strictly stronger as (2) must hold forany reference input x (cid:48) , not only the original x o .In the next section, we analyze the properties of the com-promise (2). In particular, we are interested in those prop-erties that do not depend on the convexity of φ o , so thatthey are applicable to a wider class of models including,e.g., (C)NNs. We first show that the perturbation neededto achieve a credence profile can be determined by solving While this work focuses on perturbations of the input x o ,similar definitions and results also apply to perturbations of themodel φ o with minor modifications. We leave the details of thisalternative formulation for future work. a constrained mathematical program (Section 5.1). Lever-aging this formulation and results from duality theory, wethen obtain equivalent formulations of Def. 1 (Section 5.2)that are used to show that c (cid:63) is in fact a MAP estimate (Sec-tion 6) and to provide an algorithm that simultaneously de-termines x (cid:63) and c (cid:63) counterfactually, i.e., without testingmultiple inputs or credences (Section 7).
5. A counterfactual theory of compromise
Def. 1 established credibility in terms of a compromise be-tween fit and risk. It is not immediate, however, whethersuch a compromise exists or if it can be determined effi-ciently. The goal of this section is to address these ques-tions by developing a (local) counterfactual theory of com-promise. By counterfactual, we mean that the main resultof this section (Theorem 2) characterizes properties of (PI)that would turn arbitrary credences into credibilities thatsatisfy (2). In Section 7, we leverage this result to put for-ward an algorithm to directly find local solutions of (PI)that satisfy (2) without repeatedly testing profiles c . The first step in our derivations is to express the trade-off (2) solely in terms of credences. To do so, we formalizethe relation between a credence profile c and the magni-tude r (cid:63) ( c ) of the smallest perturbation of x o required tofind an input that achieves it. We describe this relation bymeans of the constrained optimization problem r (cid:63) ( c ) = min x ∈ R p (cid:107) x − x o (cid:107) subject to (cid:96) k (cid:0) φ o ( x ) (cid:1) ≤ − c k , k ∈ K . (PI)Define r (cid:63) ( c ) = + ∞ if the program is infeasible, i.e., ifthere exist no x such that (cid:96) k (cid:0) φ o ( x ) (cid:1) ≤ − c k for all k ∈ K .Problem (PI) seeks the input x closest to x o whose fitmatches the credences c . Its optimal value therefore de-scribes the risk (perturbation magnitude) of any given cre-dence profile. Note that due to the conflicting natures offit and risk, the inequality constraints in (PI) typically holdwith equality at the compromise c (cid:63) . Immediately, we canwrite (2) as r (cid:63) ( c (cid:63) ) − r (cid:63) ( c (cid:48) ) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:63) (cid:107) W − . (4)The compromise in Def. 1 is explicit in (4): the predictioncredibilities of a model/input pair is given, not by the cre-dences induced by the model output in (1), but by those thatincrease confidence in each class at least as much as theyincrease risk, as measured here by the perturbation magni-tude.The main issue with (4) is that evaluating r (cid:63) involves solv-ing the optimization problem (PI), which may not be com-putationally tractable. This is not an issue when (PI) is rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning a convex problem (e.g., if φ o is convex and (cid:96) k is convexand non-decreasing). Yet, typical ML models are non-convex functions of their input, e.g., (C)NNs or RKHS-based methods. To account for cases in which finding theminimum of (PI) is hard, we consider a local version ofDef. 1 induced by (4): Definition 2 ( Local credibility profile) . Let x † ( c ) be a lo-cal minimizer of (PI) with credences c and consider itsvalue r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) . A local credibility profileof the predictions φ o ( x o ) is a vector c † that satisfies r † ( c † ) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:13)(cid:13) c † (cid:13)(cid:13) W − , (5) for all x (cid:48) in a neighborhood of x † ( c † ) that are feasiblefor (PI) with credences c (cid:48) . While the following results are derived for the local com-promise in (5), they also hold for Def. 1 by replacing † with (cid:63) . The following assumptions are used in the sequel:
Assumption 1.
The loss function (cid:96) k , k ∈ K , and themodel φ o are differentiable. Assumption 2.
Let x † be a local minimizer of (PI). Thereexists a neighborhood N of x † such that (cid:96) k (cid:0) φ o ( x ) (cid:1) ≥ (cid:96) k (cid:0) φ o ( x † ) (cid:1) + ( x − x † ) T ∇ (cid:96) k (cid:0) φ o ( x † ) (cid:1) − (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) − (cid:96) k (cid:0) φ o ( x † ) (cid:1)(cid:3) (cid:96) k (cid:0) φ o ( x † ) (cid:1) , (6)for all x ∈ N and k ∈ K .Assumption 2 restricts the composite function (cid:96) k (cid:0) φ o ( · ) (cid:1) tobe non-convex only up to a quadratic in a neighborhoodof a local minimum. Additionally, we will only considercredibility profiles for which there exist strictly feasible so-lutions of (PI), i.e., c ∈ C for C = (cid:8) c ∈ R |K| | ∃ x ∈ R p such that (cid:96) k (cid:0) φ o ( x ) (cid:1) < − c k for all k ∈ K (cid:9) . (7)Note that C is arbitrarily close to the set of all achievablecredences.To proceed, let us recall the following properties of mini-mizers: Theorem 1 (KKT conditions, (Boyd & Vandenberghe,2004, Section 5.5.3)) . Let x † be a local minimizer of (PI) for the credences c ∈ C . Under Assumption 1, there ex-ist λ † ∈ R |K| + known as dual variables such that x † − x o ) + (cid:88) k ∈K λ k ∇ x (cid:96) k (cid:0) φ o ( x † ) (cid:1) = 0 , (8a) λ † k (cid:2) (cid:96) k (cid:0) φ o ( x † ) (cid:1) + c k (cid:3) = 0 , k ∈ K . (8b) If (cid:96) k (cid:0) φ o ( · ) (cid:1) is convex, 8 are both necessary and sufficientfor global optimality. Additionally, the function r † (or inthis case, r (cid:63) ) is differentiable and its derivative with re-spect to c is given by the dual variables. In fact, λ † ( c ) thenquantifies the change in the risk r † ( c ) if the credences in-crease or decrease by a small amount. This sensitivity inter-pretation is a classical result from the convex optimizationliterature (Boyd & Vandenberghe, 2004, Section 5.6.2).In general, however, (cid:96) k (cid:0) φ o ( · ) (cid:1) is non-convex. Still, Theo-rem 1 allows us to obtain a fixed-point condition that turnscredences into credibilities. Theorem 2.
Take c ∈ C and let x † ( c ) be a local min-imizer of (PI) with value r † ( c ) associated with the dualvariables λ † ( c ) . Under Assumptions 1 and 2, the localcredibility profile c † from Def. 2 exists and satisfies c = − W λ † ( c ) ⇒ c satisfies (5) , i.e., c = c † . (9) Proof.
See appendix 11.Theorem 2 shows that Theorem 1 can be used to cer-tify a local credibility measure c † without repeatedly solv-ing (PI). Explicitly, if (8) hold with λ † = − W − c ,then (cid:0) x † ( c ) , c (cid:1) satisfy (5). In other words, the dual vari-ables associated to solutions of (PI) provide counterfactu-als of the form “if the credence c k assigned to class k hadbeen λ † k for all k ∈ K , then c would have been a local cred-ibility profile.” Considering the sensitivity interpretationof λ † in the convex case, the compromise (5) also causesthe credence c k for classes in which x o is harder to fit todecrease (become more negative) in order to manage risk.In the sequel, we leverage the results from Theorems 1and 2 to first show that credibility as defined in Def. 2 hasa Bayesian interpretation as the MAP of a specific failuremodel (Section 6) and then put forward an algorithm to ef-ficiently compute the credibility profile c † without repeat-edly solving (PI) for different credence profiles.
6. A Bayesian formulation of credibility
An interesting consequence of Theorem 2, is that the fit-risk trade-off credibility definitions in Def. 1 and 2 can alsobe treated as a Bayesian inference problem. This prob-lem formalizes the intuition that formulating (2) [and moregenerally (5)] in terms of Euclidian norms is equivalent tomodel the uncertainties related to the input and credibilityprofiles as Gaussian random variables (RVs).Consider the likelihood
Pr ( x | c ) = N (cid:0) x | x o , (2 t ) − I (cid:1) × (cid:89) c k (cid:54) =0 SE (cid:20) (cid:96) k (cid:0) φ o ( x ) (cid:1) | c k , c k w k t (cid:21) , (10a) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning defined for a parameter t > , where N ( z | ¯ z , Σ ) repre-sents the density of a normal random vector z with mean ¯ z and covariance Σ and SE ( z | ¯ z, η ) , the density of an ex-ponentially distributed RV z with rate η shifted to ¯ z . Thislikelihood represents our belief that, though x o is represen-tative of the model input, it may be corrupted. This uncer-tainty is described by the first term of (10a).The other terms account for the failure of x to meet thecredence c in the constraints of (PI). If an input x violatesany of the credences, i.e., (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k , then its like-lihood is penalized independently of how close it is to themean x o . This penalty follows a constant failure rate thatdepends on the credence c k itself. Hence, the probability ofthe violation increasing by (cid:15) does not depend on how muchthe constraint has already been violated. Explicitly, Pr (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + z + (cid:15) | (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + z (cid:3) = Pr (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) > − c k + (cid:15) (cid:3) .To obtain the joint distribution of ( x , c ) , we use a normalprior on the credence values, namely Pr ( c ) = N (cid:0) c | , (2 t ) − W (cid:1) . (10b)Observe from (10b) that W in (2) [and (5)] can be inter-preted both as a weighting matrix that modifies the geom-etry of the credence space or as a measure of uncertaintyover the prediction credibilities. It can therefore be usedto incorporate prior information, e.g., on the relative fre-quency of classes. The following proposition characterizesa local maximum of the joint distribution Pr( x , c ) : Proposition 1.
Let the pair ( x † , c † ) satisfy the local cred-ibility compromise (5) . Then, there exists T > such thatit is a local maximum of the joint distribution Pr ( x , c ) =Pr ( x | c ) Pr ( c ) defined by (10) for t ≥ T .Proof. See appendix 12.Proposition 1 shows that the local credibility profilesin (5) are asymptotic local maxima of the probabilisticmodel (10) as the variance of its components decreases,i.e., as t increases and the joint probability distributionbecomes more concentrated. In fact, every critical pointof Pr ( x , c ) satisfies the KKT conditions of Theorem 1and the counterfactual condition of Theorem 2. Recall thatif φ o is convex, then the unique mode of this joint distri-bution ( x (cid:63) , c (cid:63) ) provides a credibility profile according toDef. (1). Hence, though motivated as deterministic fit-risktrade-offs, the credibility metrics proposed in this work canalso be viewed as MAP estimates.In the next section, we conclude our analysis of the credi-bility profile in Def. 2 by showing how it can be computedefficiently from (PI) at no additional complexity cost. We Algorithm 1
Counterfactual optimization algorithm
Let x (0) = x and λ (0) = . for t = 1 , , . . . do g ( t ) x = 2 (cid:0) x ( t ) − x o (cid:1) + (cid:88) k ∈K λ ( t ) k ∇ x (cid:96) k (cid:16) φ o (cid:0) x ( t ) (cid:1)(cid:17) x ( t +1) = x ( t ) − η x g ( t ) x λ ( t +1) k = (cid:34) λ ( t ) k + η λ (cid:104) (cid:96) k (cid:16) φ o (cid:0) x ( t ) (cid:1)(cid:17) − λ ( t ) k (cid:105) (cid:35) + end for do so by modifying the Arrow-Hurwicz algorithm (Arrowet al., 1958) using Theorem 2.
7. A modified Arrow-Hurwicz algorithms
Theorems 1 and 2 suggest a way to exploit the informa-tion in the dual variables to solve (PI) directly for the (lo-cal) credibility c † without testing multiple credence pro-files. Indeed, start by considering that the credences c arefixed and define the Lagrangian associated with (PI) as L ( x , λ , c ) = (cid:107) x − x o (cid:107) + (cid:88) k ∈K λ k (cid:2) (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k (cid:3) .(11)Observe that the KKT necessary condition (8) for x † tobe a local minimizer can be written in terms of (11)as ∇ x L ( x † , λ † , c ) = 0 and λ † k (cid:2) ∇ λ L ( x † , λ † , c ) (cid:3) k = 0 for k ∈ K , where [ z ] k indicates the k -th element of thevector z . The classical Arrow-Hurwicz algorithm (Arrowet al., 1958) is a procedure inspired by these relations thatseeks a KKT point by alternating between the updates x + = x − η x ∇ x L ( x , λ , c ) (12a) = x − η x (cid:34) x − x o ) + (cid:88) k ∈K λ k ∇ x (cid:96) k (cid:0) φ o ( x ) (cid:1)(cid:35) , λ + k = (cid:34) λ k + η λ (cid:2) ∇ λ L ( x , λ , c ) (cid:3) k (cid:35) + (12b) = (cid:34) λ k + η λ (cid:18) (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k (cid:19)(cid:35) + ,where η x , η λ > are step sizes and [ z ] + = max( z , ) de-notes the projection onto the non-negative orthant of R |K| .To understand the intuition behind this algorithm, notethat (12a) updates x by descending along a weighted com-bination of gradients of the objective and the constraints soas to reduce the value of all functions. The weight of eachconstraint is given by its respective dual variable λ k . If the k -th constraint is satisfied, then (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k ≤ andits influence on the update of x is decreased by (12b) untilit vanishes. On the other hand, if the constraint is violated,then (cid:96) k (cid:0) φ o ( x ) (cid:1) + c k > and the value of λ k increases. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning
100 200 400 γ k x † − x o k (a)
100 200 400 γ k c † k (b) Figure 1. (a) Perturbation magnitude and (b) Overall credibility
The relative strength of each gradient in the update (12a)is therefore related to the history of violation of each con-straint.The main drawback of (12) is that it seeks a KKT pointof (PI) for a given, fixed credence profile c while the cred-ibility profile c † from (5) is not known a priori . To over-come this issue, we can use the counterfactual result (9)in (12b) to obtain λ + k = (cid:34) λ k + η λ (cid:18) (cid:96) k (cid:0) φ o ( x ) (cid:1) − w k λ k (cid:19)(cid:35) + , (13)The complete counterfactual optimization procedure is col-lected in Algorithm 1. If the dynamics of Algorithm 1 con-verge to ( x ∞ , λ ∞ ) , then x ∞ is a local minimizer of (PI)for credences c = − W λ ∞ , which is a local credibilityprofile c † according to Theorem 2 .It is worth noting that in the general, non-convex case, Al-gorithm 1 need not converge. This will happen if the gradi-ent descent procedure in (12a) is unable to find a x that fitsthe credences imposed by λ . For rich model classes, thisis less likely to happen and there is considerable empiricalevidence from the adversarial examples literature that gra-dient methods such as (12a) do converge (Szegedy et al.,2014; Goodfellow et al., 2014; Madry et al., 2018). This isalso what we observed in our numerical experiments, dur-ing which we found no instance in which Algorithm 1 di-verged (Section 8). When (cid:96) k is convex and non-decreasingand φ o is convex, then Algorithm 1 can be shown to con-verge to the global optima ( x (cid:63) , λ (cid:63) , c (cid:63) ) of (PI) through clas-sical arguments, as in (Cherukuri et al., 2016; Nagurney &Zhang, 2012). Details of this result are beyond the scope ofthis paper.
8. Numerical experiments
To showcase the properties and uses of this trade-off ap-proach to credibility, we use two CNN architectures, aResNet18 (He et al., 2016) and a DenseNet169 (Huanget al., 2017), trained to classify images from the CIFAR-10 dataset. The models were trained over epochs inmini-batches of samples using Adam with the default
Class rank . . . N o r m a li z e d c o (a) ResNet18 Class rank . . . N o r m a li z e d c † (b) ResNet18 Class rank . . . N o r m a li z e d c o (c) DenseNet169 Class rank . . . N o r m a li z e d c † (d) DenseNet169 Figure 2.
Credibility profiles for ResNet18 and DenseNet169 parameters from (Kingma & Ba, 2017) with a weight de-cay parameter of − . Random cropping and flippingwas used to augment data during training. The final classi-fiers achieve validation accuracies of for the ResNet18and for the DenseNet169. Throughout this section,the loss function (cid:96) k is the cross-entropy loss and all experi-ments were performed over a images random samplefrom the CIFAR-10 test set.We start by illustrating the effect of the trade-off betweenfit and risk (perturbation magnitude). To do so, we computethe credibility c † for W = γ I with γ ∈ { , , } using the ResNet18 CNN as φ o (Figure 1). Observe thatas γ increases, the compromise (2) becomes more riskaverse. Hence, the magnitude of the perturbations de-crease (Figure 1a). As this happens, the norm of the cred-ibility profiles (cid:13)(cid:13) c † (cid:13)(cid:13) increases (Figure 1b). Yet, note thatthe trade-off in (3) continues to hold for γ − (cid:13)(cid:13) c † (cid:13)(cid:13) . In thesequel, we proceed using γ = 200 .Given the perturbation stability interpretation of c † , it canbe used to analyze the robustness of different classifiers.By studying the relation between the ordering induced onthe classes K and the values of c o and c † (Fig. 2), we canevaluate how input perturbations alter the model outputs.Recall that the perturbation magnitude is adapted to boththe model and the input. Hence, while perturbation levelsmight be different, they are the limit of the compromisein (5), so that any larger perturbation would not incur in asignificantly larger change in the model output. Rather thanassessing robustness to a set of perturbations, this analysisevaluates the failure modes of the model, i.e., when andhow models fail.In Figure 2, we display the values of c o and c † sorted rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 3.
Performance of softmax-based filtered classifiers.Shades display best and worst case results over restarts. in decreasing order for the pretrained ResNet18 andDenseNet169 CNNs. To help visualization, the credibil-ity profiles were normalized so as to lie in the unit interval.Notice that, in contrast to the DenseNet, the perturbed pro-files of the Resnet (Figure 2b) are considerably differentthan the original ones (Figure 2a). This implies that theResNet output can be considerably modified by perturba-tions that are small compared to their effect (as in Def. 2).On the other hand, note that the DenseNet retains a goodamount of certainty on some samples even after perturba-tion (Figure 2d). While in c o this certainty can be arti-ficial (e.g., due to miscalibration), the fact that c † ≈ c o means that modifying the classifier decision (whether cor-rect or not) would require too large perturbations to be war-ranted. In this case, the DenseNet model displays signs ofstability that the ResNet does not. It is worth pointing outthat both architecture have a similar numbers of parame-ters. We proceed with the experiments using the ResNet.Another use of credibility is in the context of filtering. Inthis case, credibility is used to assess whether the classifieris confident enough to make a prediction. If not, then itshould abstain from doing so. The performance of such fil-tered classifiers is evaluated in terms of its coverage (howmany samples the model chooses to classify) and filteredaccuracy (accuracy over the classified samples). Ideally,we would like a model that never abstains ( coverage)and is always accurate. Yet, these two quantities are typi-cally conflicting: filtered accuracy can often be improvedby reducing coverage. The issue is then whether this canbe done smoothly. Suppose that the model decides to clas-sify a sample if its second largest credibility is less than thelargest one by a factor of at least − α , where α is chosento achieve specific coverage levels. Figures 3 and 4 com-pare the results of directly using the softmax output of the . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 4.
Performance of credibility-based filtered classifiers.Shades display best and worst case results over restarts. Softmax Credibility
Classifier . . . . . A cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 5.
Accuracy of softmax- and credibility-based classifiers model and the credibility profile c † .Notice that when classifying every sample, the classifierbased on c † has lower accuracy (see also Figure 5). How-ever, it is able to discriminate between certain and uncer-tain predictions in a way that the output of the model can-not. This is straightforward given that the model outputtypically has a single strong component (see Figure 2a).This shows that the pretrained model is, in many cases,overconfident about its predictions, a phenomenon relatedto the issue of miscalibration in NNs (Guo et al., 2017).This overconfidence becomes clear when the data is at-tacked using adversarial noise. To illustrate this effect,we perturb the test images using the projected gradientdescent (PGD) attack for maximum perturbations (cid:15) ∈{ . , . , . } (Madry et al., 2018). It is worth not-ing that for the largest (cid:15) , the perturbations are noticeablein the image. PGD was run for iterations, with step rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning size . ( . for (cid:15) = 0 . ), and the figures show therange of results obtained over restarts.The accuracy of the classifiers decreases considerably af-ter the PGD attack (Figure 5). However, while the perfor-mance of the softmax classifier degrades by almost ,the credibility-based ones drops by . When the mag-nitude of the perturbation reaches (cid:15) = 0 . , the output ofthe model was only able to correctly classify a single imagein half of the restarts. Using credibility, however, accuracyremained above . It is worth noting that this result isachieved without any modification to the model, includingretraining or preprocessing. What is more, filtering doesnot improve upon these results due to the miscalibrationissue. Indeed, even for (cid:15) = 0 . , the softmax-based fil-tered classifier must give up on over of its coverage torecover the performance that the credibility-based filteredclassifier has classifying every sample. These experimentsillustrate the trade-off between robustness and nominal ac-curacy noted in (Tsipras et al., 2019).
9. Conclusion
This work introduced a credibility measure based on acompromise between how much the model fit can be im-proved by perturbing its input and how much this pertur-bation modifies the original input (risk). By formulatingthis problem in the language of constrained optimization,it showed that this trade-off can be determined counterfac-tually, without testing multiple perturbations. Leveragingthese results, it put forward a practical method to assigncredibilities that (i) can be computed for any (possibly non-convex) differentiable model, from RKHS-based solutionsto any (C)NN architecture, (ii) can be obtained for mod-els that have already been trained, (iii) does not rely onany form of training data, and (iv) has formal guarantees.Future works include devising local sensitivity results fornon-convex optimization programs, analyze the model per-turbation problem, and apply the counterfactual compro-mise result to different learning problems, such as outlierdetection. The Bayesian formulation from Section 6 alsosuggests an independent line of work linking constrainedoptimization and MAP estimates.
References
Aggarwal, C. C. Outlier analysis. In
Data mining , pp. 237–263. Springer, 2015.Arrow, K., Hurwicz, L., and Uzawa, H.
Studies in linearand non-linear programming . Stanford University Press,1958.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprintarXiv:1505.05424 , 2015.Bonnans, J. and Shapiro, A.
Perturbation Analysis of Opti-mization Problems . Springer, 2000.Boyd, S. and Vandenberghe, L.
Convex optimization . Cam-bridge University Press, 2004.Chandola, V., Banerjee, A., and Kumar, V. Anomaly detec-tion: A survey.
ACM computing surveys (CSUR) , 41(3):15, 2009.Cherukuri, A., Mallada, E., and Cortés, J. Asymptotic con-vergence of constrained primal–dual dynamics.
Systems& Control Letters , 87:10–15, 2016.Cortes, C., DeSalvo, G., and Mohri, M. Boosting with ab-stention. In Lee, D. D., Sugiyama, M., Luxburg, U. V.,Guyon, I., and Garnett, R. (eds.),
Advances in NeuralInformation Processing Systems , pp. 1660–1668. 2016.Dhillon, G. S., Azizzadenesheli, K., Bernstein, J. D., Kos-saifi, J., Khanna, A., Lipton, Z. C., and Anandkumar,A. Stochastic activation pruning for robust adversarialdefense. In
International Conference on Learning Rep-resentations , 2018.Dietterich, T. G. Ensemble methods in machine learning.In
MULTIPLE CLASSIFIER SYSTEMS, LBCS-1857 , pp.1–15. Springer, 2000.Efron, B. and Tibshirani, R. J.
An introduction to the boot-strap . CRC press, 1994.El-Yaniv, R. and Wiener, Y. On the foundations of noise-free selective classification.
Journal of Machine Learn-ing Research , 11:1605–1641, 2010.Flores, I. An optimum character recognition system us-ing decision functions.
IRE Transactions on ElectronicComputers , EC-7(2):180–180, 1958.Friedman, J., Hastie, T., and Tibshirani, R.
The elements ofstatistical learning , volume 1. Springer series in statis-tics New York, 2001.Gal, Y. and Ghahramani, Z. Bayesian convolutional neuralnetworks with bernoulli approximate variational infer-ence. arXiv preprint arXiv:1506.02158 , 2015.Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx-imation: Representing model uncertainty in deep learn-ing. In
International Conference on Machine Learning ,pp. 1050–1059, 2016.Geifman, Y. and El-Yaniv, R. Selective classification fordeep neural networks. In
Advances in Neural Informa-tion Processing Systems , pp. 4878–4887. 2017. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning
Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples.
CoRR , 2014.Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. Oncalibration of modern neural networks. In
InternationalConference on Machine Learning , pp. 1321–1330, 2017.Hastie, T., Tibshirani, R., and Friedman, J.
The Elementsof Statistical Learning . Springer, 2001.Hawkins, D. M.
Identification of outliers , volume 11.Springer, 1980.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In ,2016.Heek, J. and Kalchbrenner, N. Bayesian inferencefor large scale image classification. arXiv preprintarXiv:1908.03491 , 2019.Hodge, V. and Austin, J. A survey of outlier detectionmethodologies.
Artificial intelligence review , 22(2):85–126, 2004.Huang, G., Liu, Z., v. d. Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. In , 2017.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980v9 , 2017.Li, H., Wang, X., and Ding, S. Research and developmentof neural network ensembles: a survey.
Artificial Intelli-gence Review , 49(4):455–479, 2018.MacKay, D. J. C. A practical bayesian framework for back-propagation networks.
Neural Comput. , 4(3):448–472,1992.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant to ad-versarial attacks. In
International Conference on Learn-ing Representations , 2018.Mandelbaum, A. and Weinshall, D. Distance-based confi-dence score for neural network classifiers. arXiv preprintarXiv:1709.09844 , 2017.Nagurney, A. and Zhang, D.
Projected Dynamical Systemsand Variational Inequalities with Applications . Springer,2012.Neal, R. M.
Bayesian Learning for Neural Networks .Springer, 1996. Platt, J. C. Probabilistic outputs for support vector ma-chines and comparisons to regularized likelihood meth-ods. In
ADVANCES IN LARGE MARGIN CLASSI-FIERS , pp. 61–74, 1999.Rasmussen, C. and Williams, C.
Gaussian Processes forMachine Learning . MIT Press, 2005.Roberts, F. and Tesman, B.
Applied Combinatorics . Chap-man & Hall/CRC, 2nd edition, 2009.Sheikholeslami, F., Jain, S., and Giannakis, G. B. Efficientrandomized defense against adversarial attacks in deepconvolutional neural networks. In
IEEE InternationalConference on Acoustics, Speech and Signal Processing ,pp. 3277–3281, 2019.Shridhar, K., Laumann, F., and Liwicki, M. A comprehen-sive guide to bayesian convolutional neural network withvariational inference. arXiv preprint arXiv:1901.02731 ,2019.Smithson, M.
Confidence intervals , volume 140. Sage Pub-lications, 2002.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing proper-ties of neural networks. In
International Conference onLearning Representations , 2014.Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy. In
International Conference on Learning Representations ,2019.Wong, E. and Kolter, Z. Provable defenses against adver-sarial examples via the convex outer adversarial poly-tope. In
International Conference on Machine Learning ,pp. 5286–5295, 2018.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: Anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747 , 2017. rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning
Class rank . . . . . . N o r m a li z e d c o Class rank . . . . . . . N o r m a li z e d c † Figure 6.
Credibility profiles for ResNet18 on Fashion MNIST
10. Additional numerical experiments
In these additional results, we repeat the experiments from the main paper on a different dataset (namely, the fashionMNIST (Xiao et al., 2017)) to show how they carry over to another application. Here, we perform the experiments usingthe ResNet18 (He et al., 2016) architecture trained over epochs in mini-batches of samples using Adam with thedefault parameters from (Kingma & Ba, 2017) and weight decay of − . Without data augmentation, the final classifierachieves an accuracy of over the test set. Once again, the loss function (cid:96) used is the cross-entropy loss and allexperiments were performed over a images random sample from the Fashion MNIST test set. In the sequel, wetake W = γ I with γ = 200 .We once again begin leveraging the perturbation stability interpretation of c † to analyze the robustness of the ResNet onthis new dataset by looking at the normalized values of c o and c † sorted in decreasing order (Figure 6). Notice that,in contrast to the ResNet trained on CIFAR10, the perturbed profiles here much noisier. This classifier is therefore lessrobust to perturbations of the input: its output can be considerably modified by comparably small perturbations. Noticethat this analysis does not apply to the ResNet architecture in general, but to the specific instance used to classify theseimages. While these differences can be due to a larger sensitivity of the data (Fashion MNIST pictures are black-and-whitewhereas CIFAR10 has colored images), it can also be due to the fact that we trained for fewer epochs and did not use dataaugmentation. The power of using credibility profiles to analyze robustness is exactly due to the fact that it holds for thespecific instance and application, in contrast to an average analysis.Results for the filtering classifiers are shown in Figures 7 and 8. Once again, the model classifies samples only if the secondlargest credibility (or softmax output entry) is less than the largest one by a factor of at least − α , where α is chosen toachieve specific coverage levels. We show results both using the softmax output of the model and the credibility profile c † .Once again, we notice a trade-off between robustness and performance, as in (Tsipras et al., 2019). When classifyingevery sample ( coverage), the classifier based on c † has lower accuracy (Figure 9). However, it can discriminatebetween certain and uncertain predictions in a way that the softmax output of the model cannot (as seen for c o in Figure 6).In order to achieve the same accuracy as the unmodified model, the credibility-based filtered classifier must reduce itscoverage to approximately . However, the overconfidence of the model output in its predictions makes it susceptibleto perturbations. We illustrate this point by applying the different classifiers to test images corrupted by adversarial noise.The attacks shown here use the same parameters as described in the paper, except for the perturbation magnitudes that aretwice as large. Still, we observe the same pattern: even for the weakest attack, the accuracy of the softmax output dropsby almost (see Figure 9), to the point that it would need to reduce its coverage by approximately to recover theaccuracy of the credibility-based classifier with full coverage (Figure 7). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 7.
Performance of softmax-based filtered classifier . . . . . . Coverage . . . . . . F il t e r e d a cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 8.
Performance of credibility-based filtered classifier
Softmax Credibility
Classifier . . . . . A cc u r a c y OriginalPGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . )PGD ( (cid:15) = 0 . ) Figure 9.
Accuracy of softmax- and credibility-based classifiers
11. Proof of counterfactual results
In this section, we prove different versions of the counterfactual result in Section 5.2. We begin with a simple lemmashowing the existence of a local credibility profile satisfying (5).
Lemma 1.
There exists a local credibility profile c † that satisfies (5) .Proof. Let ( ¯ x , ¯ c ) be a local minimizer pair of the optimization problem minimize x ∈ R p c ∈ R |K| (cid:107) x − x o (cid:107) + (cid:107) c (cid:107) W − subject to (cid:96) ( φ o ( x ) , k ) ≤ − c k , k ∈ K . (P-A)If no credibility profile c satisfies (5), then arbitrarily close to ¯ c there exists ( x (cid:48) , c (cid:48) ) feasible for (PI) such that (cid:107) ¯ x − x o (cid:107) + (cid:107) ¯ c (cid:107) W − = r † ( ¯ c ) + (cid:107) ¯ c (cid:107) W − > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − . (14)However, observe that the feasibility set of (PI) is contained in the feasibility set of (P-A), so that ( x (cid:48) , c (cid:48) ) is also (P-A)-feasible. This leads to a contradiction since (14) violates the fact that ( ¯ x , ¯ c ) are local minimizers of (P-A). rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Thus, the existence of a local credibility profile is not an issue. Let us then complete the proof of Theorem 2.
Proof of Theorem 2.
Under Assumption 1 and for c ∈ C satisfying the hypothesis in (9), the x † ( c ) that achieves the localminimum r † ( c ) satisfies the KKT conditions (Theorem 1) with λ k = − w − k c k . Explicitly, (cid:0) x † ( c ) − x o (cid:1) − |K| (cid:88) k =1 c k w k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) = 0 and c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:3) = 0 , for all k ∈ K . (15)Taking ( x (cid:48) , c (cid:48) ) to be any feasible pair for (PI), i.e., (cid:96) k ( φ o ( x (cid:48) )) ≤ − c (cid:48) k for all k ∈ K , for x (cid:48) ∈ N as in Assumption 2, wecan combine (15) to get r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) − (cid:88) k ∈K c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:3) + 2 (cid:0) x (cid:48) − x † ( c ) (cid:1) T (cid:34)(cid:0) x † ( c ) − x o (cid:1) − (cid:88) k ∈K c k w k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1)(cid:35) ,which can then be rearranged to read r † ( c ) = (cid:13)(cid:13) x † ( c ) − x o (cid:13)(cid:13) + 2 (cid:0) x (cid:48) − x † ( c ) (cid:1) T (cid:0) x † ( c ) − x o (cid:1) − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + (cid:0) x (cid:48) − x † ( c ) (cid:1) T ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (16)To proceed, use the convexity of (cid:107)·(cid:107) to bound the first two terms in (16) and obtain r † ( c ) ≤ (cid:107) x (cid:48) − x o (cid:107) − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + (cid:0) x (cid:48) − x † ( c ) (cid:1) T ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (17)Then, since x (cid:48) ∈ N from Assumption 2, we can use (6) on the bracketed quantity in (17) to write the inequality r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ − (cid:88) k ∈K c k w k (cid:20) (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) + (cid:2) (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) − (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1)(cid:3) (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) + c k (cid:21) . (18)Expanding (18) yields r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ − (cid:88) k ∈K c k (cid:96) k (cid:0) φ o ( x (cid:48) ) (cid:1) w k (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) − (cid:88) k ∈K c k (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) w k − (cid:107) c (cid:107) W − ,where we used the fact that (cid:107) c (cid:107) W − = (cid:80) k w − k c k . Since (cid:96) k (cid:0) φ o (cid:0) x † ( c ) (cid:1)(cid:1) ≤ − c k and (cid:96) k (cid:0) φ o (cid:0) x † ( c (cid:48) ) (cid:1)(cid:1) ≤ − c (cid:48) k , we get r † ( c ) − (cid:107) x (cid:48) − x o (cid:107) ≤ (cid:107) c (cid:48) (cid:107) W − − (cid:107) c (cid:107) W − .Hence, if W − c is a dual variable of (PI), then c satisfies (5), i.e., c = c † .In the particular case in which (PI) is a convex program, we can show that (9) is both sufficient and necessary: Proposition 2.
Assume that (cid:96) k is convex and non-decreasing for all k ∈ K and φ o is a convex function of x (e.g., linear).Let r † ( c ) be a local minimum of (PI) for c ∈ C achieved by x † ( c ) with value associated with the dual variables λ † ( c ) .Then, under Assumption 1, the local credibility profile c † from (5) is a global profile satisfying (2) and c = − W λ (cid:63) ( c ) ⇔ c satisfies (2) , i.e., c = c (cid:63) . (19) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Proof.
The first part of the theorem is immediate: since φ is convex and (cid:96) is a non-decreasing convex function, (PI) is astrongly convex problem. Hence, any local minimizer x † is a global minimizer x (cid:63) (Boyd & Vandenberghe, 2004).We proceed by proving necessity ( ⇒ ). Since (PI) is a strongly convex, strictly feasible problem [ c ∈ C in (7)], it is stronglydual and its dual variables λ (cid:63) ( c ) is a subgradient of its perturbation function r (cid:63) (Boyd & Vandenberghe, 2004, Section5.6.2). We therefore obtain r (cid:63) ( c ) ≥ r (cid:63) ( c ) + λ (cid:63) ( c ) T ( c − c ) (cid:107) c (cid:107) W − ≥ (cid:107) c (cid:107) W − + 2 c T W − ( c − c ) , (20)which when summed read r (cid:63) ( c ) − r (cid:63) ( c ) ≤ (cid:107) c (cid:107) W − − (cid:107) c (cid:107) W − + (cid:2) λ (cid:63) ( c ) + 2 W − c (cid:3) T ( c − c ) . (21)Using the hypothesis in (19) yields λ (cid:63) ( c ) + 2 W − c = and (21) reduces to (2), which implies c = c (cid:63) .To prove the sufficiency part ( ⇐ ) in (19), notice from (P-A) that any c (cid:63) that satisfies (2) is the minimizer of the stronglyconvex function q ( c ) = r (cid:63) ( c ) + (cid:107) c (cid:107) W − . As such, it must be that ∇ q ( c (cid:63) ) = ⇔ ∇ r (cid:63) ( c (cid:63) ) + 2 W − c (cid:63) = .Once again leveraging the fact that (PI) is strongly dual, it holds that λ (cid:63) ( c ) = ∇ p (cid:63) ( c ) for all c (Boyd & Vandenberghe,2004, Section 5.6.3), thus concluding the proof.
12. Proof of Proposition 1
Recall the Bayesian formulation from Section 6:
Pr ( x | c ) = N (cid:0) x | x o , (2 t ) − I (cid:1) × (cid:89) c k (cid:54) =0 SE (cid:20) (cid:96) ( φ o ( x ) , k ) | c k , c k w k t (cid:21) Pr ( c ) = N (cid:0) c | , (2 t ) − W (cid:1) We then obtain the joint distribution
Pr( x , c ) as Pr( x , c ) ∝ exp (cid:16) − t (cid:107) x − x o (cid:107) (cid:17) × (cid:89) c k (cid:54) =0 c k w k t exp (cid:18) − c k w k t [ (cid:96) k ( φ o ( x ) − c k ] (cid:19) exp (cid:16) − t (cid:107) c (cid:107) W − (cid:17) = exp − t (cid:107) x − x o (cid:107) − t (cid:88) c k (cid:54) =0 c k w k [ (cid:96) k ( φ o ( x ) − c k ] − t (cid:107) c (cid:107) W − + (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) . (22)Notice that since x † ( c † ) achieves the local minimum r † ( c † ) that satisfies Def. 2, the pair (cid:0) x † ( c † ) , c † (cid:1) is a local minimizerof (P-A). Indeed, (5) can be rearranged as in (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − ≤ (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − ,for x (cid:48) in a neighborhood of x † ( c † ) and ( x (cid:48) , c (cid:48) ) (PI)-feasible. Hence, they satisfy the KKT conditions (Theorem 1)for (P-A). Indeed, note that (P-A) always has a strictly feasible pair ( ˆ x , ˆ c ) obtained by taking [ ˆ c ] k = (cid:96) k (cid:0) φ o ( x ) (cid:1) . Ex-plicitly, there exists λ such that (cid:0) x † − x o (cid:1) + |K| (cid:88) k =1 λ k ∇ x (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) = 0 , (23) W − c † + λ = 0 , (24) λ k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) + c k (cid:3) = 0 , for all k ∈ K . (25) rust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning Observe that (23) and (24) arise by applying (8a) taking derivatives with respect to x and c respectively and (25) is thecomplementary slackness condition (8b). Using (24) and (25), we additionally conclude that (cid:88) k ∈K c k w k (cid:2) (cid:96) k (cid:0) φ o (cid:0) x † ( c † ) (cid:1)(cid:1) + c k (cid:3) = 0 . (26)Using (26), the joint probability distribution (22) evaluated at ( x † ( c † ) , c † ) reduces to Pr (cid:0) x † ( c † ) , c † (cid:1) ∝ exp − t (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) − t (cid:13)(cid:13) c † (cid:13)(cid:13) W − + (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) . (27)Suppose now that there exists another point ( x (cid:48) , c (cid:48) ) arbitrarily close to (cid:0) x † ( c † ) , c † (cid:1) such that Pr (cid:0) x † ( c † ) , c † (cid:1) < Pr( x (cid:48) , c (cid:48) ) for all t . This would imply that (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − − t (cid:88) c k (cid:54) =0 log (cid:18) c k w k t (cid:19) > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − − t (cid:88) c (cid:48) k (cid:54) =0 log (cid:18) c (cid:48) k w k t (cid:19) for all t and since the last term is o ( t ) , we eventually get (cid:13)(cid:13) x † ( c † ) − x o (cid:13)(cid:13) + (cid:13)(cid:13) c † (cid:13)(cid:13) W − > (cid:107) x (cid:48) − x o (cid:107) + (cid:107) c (cid:48) (cid:107) W − which violates (5). Hence, (cid:0) x † ( c † ) , c † (cid:1) becomes a local maximum of the joint distribution (22) as t → ∞ ..