Counterfactual Explanations & Adversarial Examples -- Common Grounds, Essential Differences, and Potential Transfers
CCounterfactual Explanations & Adversarial Examples
Common Grounds, Essential Di ff erences, and Potential Transfers Timo Freiesleben a,b a Ludwigstrasse 31, Munich Center for Mathematical Philosophy, LMU, Munich, Germany b Graduate School of Systemic Neurosciences, LMU, Munich, Germany
Abstract
The same optimization problem underlies counterfactual explanations (CEs) and adversarial examples (AEs).While this is well known, the relationship between the two at the conceptual level remains unclear. Thepresent paper provides exactly the missing conceptual link. We compare CEs and AEs with respect to theirphilosophical basis, aims, and modeling techniques. We argue that CEs are a more general object-class thanAEs. In particular, we introduce the conceptual distinction between feasible and contesting CEs and showthat AEs correspond to the latter.
Keywords:
Counterfactual Explanation, Adversarial Example, XAI, AI-Safety, Causality ∗ Corresponding author
Email address:
[email protected] (Timo Freiesleben) a r X i v : . [ c s . A I] D ec . Introduction With the emergence of more and more flexible models in machine learning, such as deep neural networksor random forests, some new problems arose. One problem is the lack of interpretability (Doshi-Velez andKim, 2017; Rudin, 2019), which has evolved into an area called eXplainable Artificial Intelligence (XAI)or Interpretable Machine Learning (IML). A variety of model-agnostic interpretation techniques have beenproposed e.g. ICE curves (Goldstein et al., 2015), LIME (Ribeiro et al., 2016), Shapley values ( ˇStrumbelj andKononenko, 2014). These techniques have the advantage of not posing any assumptions on the employedmodel (Molnar, 2019). Counterfactual Explanation (CE) (Wachter et al., 2017) is one of these model-agnostic methods and aims to explain decisions of machine learning classifiers to end-users.Another problem with highly flexible algorithms is their vulnerability to attacks and their lack of robustness.Such an attack is called an adversarial example (AE) (Szegedy et al., 2014). Researchers constructed suc-cessful attacks (largely) for computer-vision (Goodfellow et al., 2015) but also for other tasks (Yuan et al.,2019). AEs are specific inputs that machine learning algorithms misclassify. Thereby AEs aim to deceivethese algorithms and exploit their weaknesses.Given these entirely di ff erent purposes, it is surprising that CEs and AEs share the same mathematical frame-work. This similarity on the model-level has been frequently noted throughout the literature. Wachter et al.(2017) describe AEs as CEs by a di ff erent name. They mention that methods are transferable but neither dis-cuss the relationship in detail nor specify the transferable techniques. Molnar (2019) describes AEs as CEswith the aim of deception and points out the similarity as a single-objective optimization problem. Sharmaet al. (2020) use counterfactuals in their robustness measure against adversarial attacks called CERScoreand use the terms counterfactual / adversarial interchangeably. Tomsett et al. (2018) and Ignatiev et al. (2019)both discuss the relationship between AEs and interpretability, however, without referring to CEs. Sokol andFlach (2019) discuss CEs in the context of AI safety and note that there is “a fine line between counterfactualexplanations and adversarial examples” that needs further analysis.This paper aims to study and explicate the “fine line” between counterfactual explanations and adversarialexamples. Besides a detailed mathematical analysis of the relationship between CEs and AEs, we will alsoconceptually compare the two and examine their common use contexts. In order to compare CEs and AEs,we need to analyze each of the two fields. For this analysis, it is important to focus on aspects that aresu ffi cient to describe both fields (Beaney, 2018). Moreover, the aspects should allow leading an informeddiscussion about their relationship. Hence, we have selected the following aspects:i) conceptual basis,ii) aim, role, and use casesiii) models and implementationsThe conceptual basis concerns the theoretical and philosophical ideas behind a concept. It describes afoundation on other, more basic ideas within the conceptual realm. The aim, role, and use cases define themotivation with a focus on the use contexts. Aspect iii) is crucial as it describes the very definition of aconcept in precise mathematical terms. Moreover, it represents the state of current AI research on the topic. one might say old Note that our analysis is not a standard conceptual analysis as discussed by Carnap (1998) or Russell (1905). Here concepts aredefined logically by more basic concepts. Instead, we will concentrate on a holistic picture, which also includes aspects such as therespective roles of the concepts, their use cases, state of the art, etc.
2n Section 2, we point out three misconceptions of CEs and AEs present in current discussions. In Section 3,we analyze and compare CEs and AEs with respect to the aspects introduced above. In the course of this,we also discuss possible transfers between the two fields. Section 4 introduces a conceptual division of twotypes of CEs, namely feasible and contesting CEs. We argue that contesting CEs are similar to AEs. InSection 5 we reconsider the misconceptions discussed in Section 2 in the light of our analysis.
2. Three Misconceptions
An analysis of CEs, AEs, and their relationship on the conceptual level is urgently needed. We see severalmisconceptions that have already led and possibly will lead to severe confusion. Thus, our analysis’s primarygoal is to resolve these confusions and lay the foundation for well-guided future research on both CEs andAEs.
CE is equal to AE.
The first conceptual misunderstanding is to assume that CEs and AEs are just twoterms for the same objects. This misconception leads authors like Sharma et al. (2020) to use the termscounterfactual and adversarial interchangeably. The misunderstanding concerns the very basis of CEs andAEs. Not only do CEs and AEs have some non-overlapping functions, but also require AEs an additionaldefinitional constraint that CEs do not - misclassification.This first misconception leads to a false interpretation of robustness against attacks. Therefore, it worsensperformance and societal acceptance of machine learning applications. We examine the misclassificationrequirement thoroughly in Section 3.1 and Section 3.3.
Feasible CEs are all relevant CEs.
The second misconception appears in the context of CEs. It is theassumption that only actionable / feasible CEs are relevant to end-users. Researchers focusing on feasibil-ity / actionability of CEs (Poyiadzi et al., 2020; Mahajan et al., 2019; Karimi et al., 2020) do not explicitlyclaim that. However, they neither discuss other types of CEs nor the limitations of feasible CEs. Thus, onetype that is overseen are contesting CEs, which provide ground for end-users to contest a decision. Contest-ing CEs show a remarkable resemblance to AEs.Focusing just on feasible CEs leads to hiding biased algorithmic decisions behind the facade of an explana-tion. Feasibility and contestability are discussed in Section 3.2. Section 4 introduces the distinction betweenfeasible and contesting CEs. Transfers, yes or no?.
The third misconception is twofold. It is either over- or underestimating the transferopportunities between CEs and AEs. In the case of overestimation, methods can be misused, or hiddenassumptions can be adopted. In the underestimation case, already known techniques are potentially redis-covered. Neither the adaptation of non-suited techniques nor the reinvention of successful approaches are desirable inresearch. To avoid these transfer problems, we will discuss conceptually permissible transfers between CEsand AEs in Section 3.3 extensively.We will come back to these three misconceptions in Section 5. E.g Generating counterfactuals based on AE surrogate techniques (Guidotti et al., 2018) uses local approximations. Therefore, itfaces the same critiques as LIME (Molnar, 2019). E.g. evolutionary algorithms for mixed data CEs (Sharma et al., 2020). . CEs & AEs: A Comparison To give the reader an intuition on CEs and AEs, we start with two standard examples. The first describes aloan application scenario and a potential CE in that situation. The second example illustrates AEs in imagerecognition tasks.
CE Example.
Assume person P wants to obtain a loan and applies for it through the bank’s online portal.The portal uses an automated, algorithmic decision system, which decides that Person P will not receive theloan. P wants an explanation for that decision. An example of a CE would be:If P had a higher salary and an outstanding loan less, her loan application would have been accepted.
AE example.
Look at Figure 1 from Papernot et al. (2017). The images of row two are a slight modificationof the images from row one. However, row two shows AEs since these subtle changes have changed theclassification to the wrong classes. The image recognition algorithm was successfully tricked.
Figure 1: In the first row, we can see five images (The first two are from the MNIST dataset, the other three are from the GTSRDdataset.) that are classified correctly. We see the same five pictures in the row below but slightly modified by some noise added to thepictures. Here, the algorithm misclassifies them.
The first aspect we investigate is the conceptual basis of CEs and AEs. We start by considering each of themin separation, and then we conduct our comparison.
Counterfactual Explanations
CEs have a strong philosophical basis and tradition. Here, we confine ourselves to counterfactuals in the formof subjunctive conditionals. Let S and Q be propositions. Then, counterfactual sentences are conditionalsof the form: If S was true Q would have been true. (1) The di ff erence between indicative and subjunctive conditionals is that the antecedent must be false for the latter (Starr, 2019). Fromnow on, whenever we talk about counterfactual statements / sentences / explanations / conditionals we mean subjunctive ones. S is false. A counterfactual explanation is a counter-factual sentence that is true. What makes counterfactual statements true is hotly debated in philosophy, andno solution has been agreed upon generally (Starr, 2019). The solution taken up in the computer scienceapproach builds on the work of Lewis (1973). In Lewis’s framework, Equation (1) holds if and only if theclosest possible world ω (cid:48) ∈ Ω to the actual world ω ∈ Ω in which S is true also Q is true. Due to theunder-specified notion of similarity between possible worlds, Lewis’s proposal is highly controversial (Starr,2019).A good counterfactual explanation in a specific situation is a CE relevant to the explainee. That means thatthe CE is easy to comprehend and has an interesting antecedent / consequent for the explainee (Miller, 2019).In XAI applications, the antecedent describes a change in features from a given input, and the consequentdescribes a change in the outcome of the classification.Note that Lewis aimed to describe causality via counterfactuals (Menzies and Beebee, 2019), which is not di-rectly the goal of CEs in XAI. The CE approach allows us to make causal claims about the machine learningmodel only (e.g., which features does the algorithm take to be causally relevant) but not the correspondingreal-world objects (Molnar et al., 2020).
Adversarial ExamplesAdversarial examples are inputs that an algorithm assigns the wrong class / value to. A wrong class isdefined by the fact that the class deviates from a ground truth given by humans. Not for all inputs, there aresuch ground truths. Especially for large feature spaces, there are many entirely meaningless inputs, whichare not considered AEs. What defines AEs is that the given inputs appear similar (or identical) to real-worlddata or algorithm training data. Generally, this is achieved by modifying a real-world input.
Good AEs arethose that can potentially be exploited by an attacker.AEs base on a classical scenario from game theory, in which a deceiver-agent tries to trick a discriminator-agent (Dalvi et al., 2004). An AE denotes a successful deception in this game. Adversarial attacks arenot specific to neural networks but apply to any complex system (Papernot et al., 2016a); not even humansare immune against deception (Kahneman et al., 1982; Chabris and Simons, 2010; Ioannou et al., 2015).Remarkable about modern AEs is that they often transfer from one model to another and are hard to get ridof (Yuan et al., 2019). The origin of this e ff ect is still open to debate (Goodfellow et al., 2015; Ilyas et al.,2019). Comparison
Both fields rely on counterfactual reasoning. CEs describe a variation of the actual situation / world; AEsare a real-world input variation. Also, in both cases, the variation changes the result. Furthermore, both ap-proaches search for relevant variations in regions close to the real world (input), fulfilling certain constraints.The crucial di ff erence lies in these constraints. For CEs as given in Equation (1), the alteration to theactual world described by a predicate S has to satisfy that predicate Q applies. For AEs, an alternative to As mentioned, above S is false in ω . Note that Ω denotes the set of possible worlds. Section 3.2 specifies this intuitive notion of interestingness. Contrary to Pearl (2009) who introduces CEs with causal meaning. An XAI version of this type of CEs presents Karimi et al.(2020) in the form of algorithmic recourse. Moreover, counterfactuals can also account for non-causal explanations, as discussed inReutlinger (2018). From now on, we will mainly talk about misclassification and classifying. However, this is only to simplify our language usage.AEs are not restricted to classification tasks but also work on regression problems. This picture of the two opponents is also the basis of generative adversarial networks (Radford et al., 2016). Q as the predicate’Misclassified’, we see that the latter is a specific case of the former. Misclassification is not demanded ofCEs. Conversely, CEs often demand that Q describes a specific outcome, such as ’loan acceptance’. Now, we investigate the aim, role, and use cases of CEs and AEs.
Counterfactual Explanations
The CE approach in XAI aims to generate local explanations. Here, local means that the explanations aregenerated for individual “decisions” of the algorithm. According to Wachter et al. (2017) and Miller(2019), these explanations have three intuitive aims, which make the di ff erence between (only) a CE and agood CE. The aims are toi) raise understanding ,ii) give guidance for future actions, andiii) allow to contest decisions.A good CE does not have to meet all three of these goals. In many contexts, explanations focus on only oneor two of them. Aim i).
The target audience of CEs are laypersons who are neither experts in machine learning nor haveunlimited time resources (Wheeler, 2020). If we want to improve a person’s understanding, we must respectthese resource limitations and focus on the few main reasons for a decision. To achieve this degree ofsimplicity, we must be economical concerning the presented number of reasons. Therefore, sparsity is oneaim discussed in the literature.
Aim ii).
Explanations should serve as a guideline for future actions. Hence, the alternative that achieves thedesired output (e.g., obtaining the loan) should be reachable for the explainee. For example, a loan applicantcannot become younger to obtain a loan, even if age is one of the bank’s criteria for justified reasons. Thus, itis not reasonable to propose a reduction in age for obtaining a loan. Researchers summarize such limitationsunder the term feasibility . Aim iii).
Explanations provide grounds for the appealability of decisions. End-users can contest a decisionif the presented reasons are insu ffi cient or discriminatory. That may be because the decision is based onfeatures that should not play a role (e.g., skin color, gender, etc.) or on features that should play a role butare expected to have a di ff erent e ff ect (e.g., if a high salary correlates negatively with obtaining a loan). Allin all, humans want to be treated fairly and demand explanations to uncover unfair judgments (Kusner et al.,2017; Asher et al., 2020). If we feel unfairly judged, this is because we would have expected a di ff erentdecision. Thus, the explanation we generate should focus on features that the explainee expected to have adi ff erent e ff ect on the decision. In other words, explanations should be informative. By “decisions”, we usually mean classification or regression tasks the algorithm performs. P´aez (2019) argues that counterfactuals as introduced by Wachter et al. (2017) can even in principle not meet this requirement. The condition of informativeness also aids aim number one, which is to raise understanding. In the context of psychology, infor-mativeness is discussed under the name abnormality (Miller, 2019). ole. Among XAI researchers, CEs became very popular. One reason is that they are model-agnosticand, therefore, applicable to any kind of algorithm. Secondly, there is a one-to-one correspondence tocontrastive explanations, which are the type of explanations that people use most in everyday life (Miller,2019). Thirdly, CEs are compatible with the right to explanation in the European General Data ProtectionRegulation (GDPR) (Wachter et al., 2017). All these advantages allow the CE approach to playing anessential role in XAI.
Use Cases.
The use cases of CEs are generally unlimited as the CE framework applies to any kind of algo-rithm. The only requirements are interpretable input- and output-spaces and reasonable distance measureson these spaces. In the literature, however, CEs are considered exclusively in connection with classificationtasks on tabular data. Unsupervised / Reinforcement learning, image / audio classification, or regression prob-lems are still under-explored in this respect. Whether CEs should be applied on a broad scale is at leastquestioned (Laugel et al., 2019; Barocas et al., 2020).
Adversarial Examples
Even though AEs represent only single instances in which the algorithm fails, they also point to the algo-rithm’s global problems. If the algorithm classifies a stop sign as a right-of-way sign, one becomes extracautious about everything the algorithm does. The aims behind AEs depend strongly on the specific usecases and employers. Nevertheless, we find three principal aims that all perspectives share:i) to fool the system,ii) to do this imperceptibly. andiii) e ff ectively.What imply these aims for generating AEs? Aim i).
Fooling the system is the main aim of AEs. Thus, we look for missclassifications . The systemusually performs reasonably well on training data and similar inputs. In unseen regions, on the other hand,it performs poorly. If we randomly pick an example from unseen regions of our input space, we most likelychoose a meaningless data point. So we need to find an input that is close enough to a meaningful inputand yet in a region where the algorithm performs poorly.
Aim ii).
One condition under which we must search for adversarials is imperceptibility . AEs should notbe easy to detect for a human, i.e. input changes should be below the human perception threshold. Thisproperty guarantees the highest chance of deceiving successfully. Since human perception directs attentionto certain features and expectations, imperceptibility can be achieved in two ways. Either by changes inunattended features or by distributed low-intensity changes.
Aim iii).
The e ff ectiveness of an AE depends strongly on the context. Attackers want to exploit mistakes inthe most profitable way possible (e.g., monetary gain or system damage). Engineers want to defend them-selves against such attacks and make their system more stable against them (e.g., fixing bugs or detectingattacks). Researchers working on AEs strive for a deeper understanding of learning algorithms, depictingreal-world dangers in employing algorithms, and high research impact. One counterexample to this is the work on the MNIST dataset by Van Looveren and Klaise (2019). They can also be aiming at a global level and a variety of algorithms such as shown by Moosavi-Dezfooli et al. (2017). Especially considering images ole. AEs are both a blessing and a curse. They can indeed cause significant harm to individuals, com-panies, and society as a whole. The more social or ethical consequences the task we assign to a machinelearning algorithm has, the worse the e ff ect of misclassification. A stop sign classified as a right of way signcan cause accidents, and a rifle misclassified as a turtle can facilitate terrorist attacks at airports. The trust wehave in AI systems is and will be closely linked to the extent to which adversarial attacks are possible. Onthe positive side, AEs can help us understand how the algorithm works (Ignatiev et al., 2019; Tomsett et al.,2018). Knowing where the algorithm has problems helps us understand what the algorithm is really learning(Lu et al., 2017). Moreover, by adversarial training, AEs can even concretely improve models (Bekouliset al., 2018; Stutz et al., 2019). Use Cases.
AEs are mostly built for image and sometimes audio recognition tasks. Reasons for thatare uncontroversial ground-truths, the boom in computer vision, and the resemblance with optical illusions(Elsayed et al., 2018).
Comparison
Both fields help to understand what algorithms have learned. Moreover, both contribute to identifying biasesand even o ff er methods to eliminate these biases through adversarial- or counterfactual-training (Bekouliset al., 2018; Sharma et al., 2020). However, while improving understanding and highlighting algorithmicproblems is usually only a byproduct of AEs, it is the focus of CEs. The deception of a system, on theother hand, is essential for AEs, but a potential byproduct of CEs in cases where they disclose too muchinformation about the algorithm (Sokol and Flach, 2019). Making modifications imperceptible is crucial for AEs. In the case of CEs, however, the modifications formthe core of the given explanation. This is more a di ff erence in presentation and less one in the type ofmodifications. Modifications to achieve the imperceptibility of AEs show a great similarity with those inCEs that aim at informativeness. Imperceptible AEs result from modifications in unnoticed / unanticipatedbut e ff ective features. These surprisingly e ff ective changes are precisely those, which are most informativeto humans, as Section 4 discusses.For the feature permutations in CEs to make sense, the input space must provide a certain degree of inter-pretability. This demand is irrelevant for AEs since the changes are hidden and not highlighted. The two approaches play a similar role within the machine learning landscape, as both will strongly a ff ectpeople’s trust in machine learning systems in the future. Also, both fields have gained increasing legalrelevance. To be legally applicable (e.g., in autonomous driving, airport security (Athalye et al., 2018),etc.), machine learning algorithms must be both robust against AEs and provide explanations to end-usersas specified in the GDPR. A significant di ff erence is that AEs, by definition, can only point to the mistakesof the algorithm. Hence, emerging AEs mainly have a negative role, while CEs can also raise trust in thesystem.Concerning the use cases, we see that AEs mainly apply to computer vision tasks, whereas CEs are consid-ered only on tabular data. Ballet et al. (2019) give an example of AEs for tabular data classification. Meaning that giving guidance for future actions and deceiving are only compatible for immoral agents. One di ff erence is that CEs only tell us how an algorithm works in a very local region, while AEs can a ff ect humans’ confidencein the system as a whole. However, in cases where the CEs reveal racist, sexist, or causally unjustified reasons, this will also reducehumans’ confidence in the system as a whole, not just locally. Even though there are the above-mentioned counterexamples to that division (Van Looveren and Klaise, 2019; Ballet et al., 2019). .3. Models and Implementations The last conceptual aspect we investigate concerns the models and implementations of CEs and AEs.
Counterfactual Explanations
There are a variety of formulations of the CE framework and the present version orients at Wachter et al.(2017). Assume there is a learning algorithm , which we represent by a function f : I → O mapping avector x from an interpretable (potentially high dimensional) input space I to a vector y in an output space O . Assume the desired classification for x would be y (cid:48) (cid:44) y . Then, a counterfactual vector x c , y (cid:48) to x is avector that minimizes the term (cid:107) x c , y (cid:48) − x (cid:107) for which f ( x c , y (cid:48) ) = y (cid:48) . Often it is su ffi cient that x c , y (cid:48) is close to x . Also, having f ( x c , y (cid:48) ) close to y (cid:48) can be su ffi cient as it might be in principle impossible or very di ffi cult toreach. Thus, the standard formulation as a single-objective optimization problem isargmin x (cid:48) ∈ I (cid:107) x − x (cid:48) (cid:107) + λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x (cid:48) ) − y (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (2)where (cid:107) · (cid:107) and ||| · ||| are induced by some measures of distance on I and O respectively. The scalar λ tradeso ff between a more similar counterfactual and a vector closer to the desired output. The counterfactual ex-planation is derived by the di ff erence between the original input and the counterfactual vector we generatedput into words.Consider, for instance, the loan application scenario 3 and assume that x c , y (cid:48) − x has a value of + e at thefeature salary and a value of − • If P earned 2000 e more per year and had one outstanding loan less, her loan application would havebeen accepted. Distance Measures.
As in Lewis’s framework from Section 3.1, the main di ffi culty is to define a reasonabledistance measure on the input space . As discussed in Section 3.2, good CEs are sparse, feasible, andinformative.Sparsity is a hot topic in the general machine learning literature (Bach, 2010). In the field of CE, Wachteret al. (2017) gain sparsity by using the normalized Manhattan metric. Other ways to attain sparsity includesetting features as not permutable (Moore et al., 2019), using the L metric to directly penalize high numbersof changed features, using multi-objective optimization with the number of changed features as one objective(Dandl et al., 2020), or taking into account the causal structure of the real world where changing a fewfeatures via an action has consequences for several others (Karimi et al., 2020). Feasibility can be achieved in many ways and depends on the problem under consideration. One possibilityis to declare some immutable features, making them irrelevant for the explanation (Moore et al., 2019;Sokol and Flach, 2019). The second way focuses on the problem that some possible input vectors representhighly improbable or unreachable combinations of features in the real world. They should therefore, notbe suggested as good CEs. The literature suggests many ways to deal with this problem. Examples are Usually this algorithm is already trained. They don not necessarily have to be norms. Not only on the input space, it might be challenging to find a suitable measure of distance but also on the output space. Consider aclassification problem, where the output space is not just a set of options but a set of probability density functions. If the desired outputis “loan application accepted” it is unclear whether it has the highest value among the categories, more than fifty percent, or even thevalue 100%. Moreover, some outcomes might be more similar to the desired outcome than others, e.g., obtaining a smaller loan isbetter than obtaining no loan. Standard measures like KL-divergence or cross-entropy are ignorant to such similarity di ff erences. Moore et al. (2019) also introduce the idea to show a range of explanations with a diverse number of changed features. / expectations, which isusually not available. However, some solved this problem by asking the user questions about her preferences(Sokol and Flach, 2019). Another option is to focus on the features that the average human usually over- orunderestimates. Best would be a combination of the two, i.e., to set the average human’s prior characteristicsand then update this prior via feedback from the human agent. Solution Methods.
The solution strategy for the optimization problem depends on the model-knowledge. Asthe employers of interpretation techniques are usually the designers of the inspected algorithm, full modelaccess is common. Given such a white box, the problem can be solved by gradient-based methods (Wachteret al., 2017; Mothilal et al., 2020; Mahajan et al., 2019). An alternative for mixed numeric / categorical dataare mixed-integer linear program solvers (Ustun et al., 2019; Russell, 2019; Kanamori et al., 2020). Geneticalgorithms are a solution method that does not require model knowledge (Sharma et al., 2020; Dandl et al.,2020). A more controversial technique for black-box scenarios is to train a surrogate model on the originalmodel and then transfer the CEs from the surrogate to the original model Guidotti et al. (2018). Selection Problem.
The solution to the optimization problem will generally not be unique. There can bea high number of equally close CEs for the same input vector. Worse, these di ff erent CEs may provideexplanations that are pairwise incompatible as the following two: • If P earned 2000 e more per year, her loan application would have been accepted. • If P earned 2000 e less per year, her loan application would have been accepted.Such cases arise since the decision boundaries do not follow classical monotonicity constraints. For example,the state may subsidize loan applications from people below a certain salary level. Some propose thereforeto present several di ff erent CEs like Mothilal et al. (2020); Moore et al. (2019); Wachter et al. (2017); Dandlet al. (2020). However, then the question arises, how many and which ones? Others propose to select acertain CE according to relevance (Fern´andez-Lor´ıa et al., 2020) or a quality standard set by the user, suchas complexity or particularly interesting features (Sokol and Flach, 2019). The question remains open as tohow this so-called Rashomon e ff ect can be solved. Adversarial Examples
There are a variety of formulations of the AE framework. The version presented here orients at Yuan et al.(2019). Since the framework is basically the same as for CEs, we will mainly focus on its deviations. Again,the learning algorithm is represented by a function f : I → O mapping a vector x from an input space I toa vector y in an output space O . AEs do not require an interpretable input space. We distinguish betweena targeted and a non-targeted attack. For a targeted attack , a particular alternative output y (cid:48) (cid:44) y is desired,as in the case of CEs. For a non-targeted attack , the alternative output just has to di ff er from y . For a non-targeted attack, an AE x a to x is generated by searching for an x a that minimizes (cid:107) x a − x (cid:107) and for which f ( x a ) (cid:44) f ( x ). In case we do have a particular alternative output in mind, the adversarial x a , y (cid:48) to x is a vector Problems occur if the surrogate model is not faithful to the original model. In such cases, the CEs generated are simply false andpotentially misleading. (cid:107) x a , y (cid:48) − x (cid:107) for which f ( x a , y (cid:48) ) = y (cid:48) . Equation (2) presents one formulation as a singleobjective optimization problem. As in the case of CEs, the minimality might not be as important, and it isenough to find inputs close enough to x that change the classification in the desired way. Considering moredistal inputs might be computationally easier and more interesting in some cases (Elsayed et al., 2018).Of major importance is that the input is misclassified, which is guaranteed by none of the above-mentionedoptimization problems. To achieve misclassification, we must add the condition that the alternative inputis incorrectly classified. In other words, for the adversarial x a , respectively x a , y (cid:48) has to hold f ( x a ) (cid:44) y true respectively f ( x a , y (cid:48) ) (cid:44) y true . Here, y true denotes the actually correct label for the adversarial example. Thistrue label is usually the same as for the original input x , namely y .The optimization problem presented here is only one among many (Yuan et al., 2019). Also, formulating anoptimization problem is not the only way of finding AEs. The fast gradient sign method of Goodfellow et al.(2015) is an example of how to generate AEs directly. Distance Measures.
One of the most critical problems in creating an AE is the reasonable definition of adistance measure on the input space - both computational and qualitative aspects have to be considered. Thisleads us to the aims of misclassification, imperceptibility, and e ff ectiveness.Again, minimizing the di ff erence between x and its adversarial x a and flipping the algorithms assignmentdoes not guarantee to attain an AE. A switch in classification due to a small variation may be justified. However, since we are often dealing with image data, a tiny variation rarely justifies a switch in classifica-tion. It, therefore, makes it an AE. If we look at image data, there are usually infinitely many meaninglessdata points between two proper classes. Hence, following the gradient (Goodfellow et al., 2015), the Jaco-bian (Papernot et al., 2016b) or any other reasonable procedure (Yuan et al., 2019) may easily lead to an AE.Section 4 gives further insights into the problem of misclassification.Imperceptibility is realized in various ways in the literature. Some change very few or even one featurestrongly by optimizing for the L norm (Su et al., 2019). Others alter more features to a smaller amountwith the L norm (Carlini et al., 2018), or in a variety of real-world contexts and scenarios, like Brownet al. (2017); Athalye et al. (2018). The standard way to gain imperceptibility is to alter all features slightlyvia the L ∞ norm on the input space (Goodfellow et al., 2015; Szegedy et al., 2014). Basically, any p-normcan be reasonably applied (Yuan et al., 2019). More interesting are measures that take into account whathumans consider as “close” inputs (Rozsa et al., 2016; Athalye et al., 2018). This approach leads to anoverlap between human and machine deceivability (Elsayed et al., 2018). For tabular data classification, it ismuch harder to define imperceptibility. Ballet et al. (2019) solved this via defining critical and non-criticalfeatures . Since the algorithm uses both types of features in the classification, they modify only non-criticalfeatures to attain a change in assignment. This example shows how imperceptibility and misclassificationgo hand in hand.E ff ectiveness is not so much a question of defining the distance measure but rather a question of whichexample we use to build our AE. Solution Methods.
The community’s main focus is on the algorithmic generation of AEs, which again di ff ersdepending on model knowledge. There are gradient-based methods for white boxes, either for solving In the case of a regression problem, this could correspond to being far outside the range of reasonable output values, see Baldaet al. (2019). Consider a case, where a loan application of a 17-year-old is rejected while an 18-year-old with the same characteristics wouldreceive the loan. Often this variation is not only small in the sense of a p-norm, it is moreover structureless noise. Based on expert evaluation ff erences (Chenet al., 2017) or evolutionary algorithms (Guo et al., 2019; Alzantot et al., 2019; Su et al., 2019). Due to thetransferability of AEs, it is often also possible to build an AE for a surrogate model and then apply the AEto the original model (Papernot et al., 2017). Generally, non-targeted attacks are computationally less costlyand do more easily transfer to other systems than targeted attacks (Yuan et al., 2019). Selection Problems.
If we want to generate an AE, we face two selection problems. Probably due to theirirrelevance in practice, neither of them has been discussed in the literature. The first selection problem isabout selecting the initial input vector the AE is based on. The second selection problem is about selectingthe final AE among the solutions to the optimization problem. Both selection problems relate to e ff ectivenessand depend on the application, the goal, and the system’s weaknesses. The AEs that best suit the employer’sneeds should be picked. Comparison
There is no need for researchers to reinvent the wheel. For this reason, this section discusses, in addition tothe conceptual comparison, the potential transfers between the fields Figure 2.The common ground with regard to the mathematical model is evident. In Appendix A we show that AEsare special solutions to a (non-targeted) CE optimization problem.
Theorem ( Every (targeted (cid:15) ) AE is a (targeted (cid:15) ) CE ) . For all x ∈ I, f ( x ) (cid:44) y (cid:48) ∈ O, (cid:15) > and distancemeasures (cid:107) · (cid:107) holds:i) AE ( x , (cid:15), (cid:107) · (cid:107) ) ⊆ CE ( x , (cid:15), (cid:107) · (cid:107) ) ii) T AE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) ⊆ TCE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) iii) AE ( x , (cid:107) · (cid:107) ) ⊆ CE ( x , (cid:107) · (cid:107) ) iv) T AE ( x , y (cid:48) , (cid:107) · (cid:107) ) ⊆ TCE ( x , y (cid:48) , (cid:107) · (cid:107) )This means that (targeted) AEs are (targeted) CEs that are misclassified. As we show, this holds also forAEs and CEs in a given (cid:15) -environment. However, it is important to point out that non-targeted attacks arecommon while non-targeted counterfactuals are rather rare. Moreover, some formulations as optimizationproblems already encode the respective aims as in the case of e.g. Dandl et al. (2020); Van Looveren andKlaise (2019) for CEs and e.g. Carlini and Wagner (2017) for AEs. For formulations that are targeted andwhere the respective aims are not encoded in the optimization problem, transfers between the fields arepermissible. Interestingly, we do not necessarily need to formulate an optimization problem to generate AEs(Goodfellow et al., 2015). Direct generation methods are theoretically also possible for counterfactuals, eventhough the generated CEs will be much harder to justify conceptually.The di ff erent aims are mostly not encoded in the optimization problem but the distance measure. For thatreason, we find the most significant di ff erences between the fields if we consider the distance measures.However, there are also similarities to be found. Notions of distance that realize sparsity show commonali-ties with those that realize imperceptibility. A change in few among lots of features is often di ffi cult to spot(Su et al., 2019), especially when the change is not highlighted. Distance measures that favor sparsity can,therefore, be desirable to transfer between the fields. Moreover, distributed changes to achieve the imper-ceptibility of AEs in e.g., images are not per se irrelevant for CEs. The sparsity of CEs is only relevant for12 igure 2: On the left-hand side, there is the counterfactual realm and on the right-hand side the corresponding adversarial concepts.Solid arrows between two items mean that a transfer is allowed in that direction. Dashed arrows mean that a transfer is possible underadditional conditions, specified below the arrows. hanges in interpretable features. For non-interpretable features, distributed changes can often be describedas sparse changes in more abstract interpretable features.The CE aim of informativeness and the AE aim of imperceptibility also align and Section 4 discusses thisin further detail. Changing unexpected but e ff ective features will often lead to imperceptible changes. Aspeople assume these features to be non-e ff ective, they pay little attention to them. The same holds vice versa.Hence, we can expect fruitful transfers. Feasibility counteracts the goal of misclassification Feasibility inCEs requires that the explainee can realistically reach the alternative data point generated. Realistic data-points are those that are generally well represented in the training data. Hence the algorithm usually performswell in such cases. However, it may be possible to reverse distance measures that encode feasibility to aid inmisclassification. Furthermore, there are cases where feasibility can be relevant for AEs, such as anomalydetection or the generation of realistic AEs.Due to their similarity in the optimization problem, the two approaches also use similar solution methods.We can observe this parallelism in the development of the fields. Both started with gradient-based methods,proceeded with evolutionary algorithms, and then considered surrogate models. If applicable, solutionmethods developed for CEs are also suited to generate AEs. The opposite direction might be more problem-atic. For CEs, approximately good solutions are often not good enough because they lead to bad / misleadingexplanations. This problem becomes particularly apparent when we look at the, among AE researchers,highly popular surrogate model approaches. If the surrogate model is not faithful enough to the originalmodel, the generated CEs will end up being wrong and, in the worst-case, misleading.It is already noteworthy that both fields face selection problems. Moreover, in both approaches, CEs / AEsamong all the solutions to the optimization problem must be selected. Nevertheless, the di ff erences prevail.First, for AEs, an initial input has to be selected, whereas, for CEs, this input is given by the end-user.Second, First, the solution space to non-targeted AEs contains vectors from di ff erent classes, not one as forCEs. Third, the number of presentable CEs is limited by humans’ capacity to process information, while thenumber of AEs to try out is unlimited.
4. The Two Types of CEs
Until now, we have discussed the similarities, di ff erences, and possible transfers between CEs and AEs.However, we left out the relationship at the level of individual instances. Does a good CE make a goodAE or vice versa? In this section, we present a conceptual division into two types of CEs. We call them feasible CEs and contesting CEs . Feasible CEs are reasonable explanations that allow for deriving future ac-tions. Contesting CEs are explanations that provide a basis for challenging an automated decision. AEs linkclosely to the latter. We believe that defining and analyzing these two types clarifies existing misunderstand-ings regarding the relationship between CEs and AEs. In our analysis, we will look at several real-worldscenarios.For all the presented scenarios, we presuppose the following: • There is a trained supervised-learning algorithm represented by a function f : I → O mapping avector x from a (potentially high dimensional) input space I to a vector y in an output space O . Interestingly, the areas concentrate on di ff erent solution methods. While the literature on AEs mainly discusses white box solvers,the literature on CEs focuses on black-box solvers. This observation is surprising since AEs are usually considered from an attacker’sperspective without access to the model, while CEs are generated by the model engineers. A look at the use cases explains thisparadox. If we consider low-dimensional tabular data and standard algorithms, simple black box attacks are perfectly feasible. For highdimensional image data and deep convolutional neural networks, on the other hand, black-box attacks explode computationally. There is a true causal graph C describing the causal relationship between the set of input features I and the set of output features O . Each of the input and output features relates to real-world items,and C describes these items’ true causal structure. Often, relevant features to complete the full causalpicture from reality are missing, which can be either compensated for via latent variables Pearl (2009)or is ignored . Since ignoring them is common practice in supervised-learning contexts, we proceedwith the latter. We will call F ∈ I a causally relevant feature for T ∈ O if either – F is an ancestor node of T in the causal graph C or – there is a common cause L (cid:60) I ∪ O of both F and T that is not part of the causal graph. We say F ∈ I is a causally irrelevant feature for T ∈ O if neither of the two conditions is met. • In decision making, humans pay most attention to variables they consider relevant for the task (Jeheeet al., 2011; Ballet et al., 2019). Well-trained decision-makers, therefore, focus on causally relevantvariables (Navalpakkam and Itti, 2005). Hence, they often oversee changes in causally irrelevantfeatures, which makes these changes imperceptible. • For our example scenarios, again consider loan applications. For simplicity, we make the unrealisticassumption that the input space I only contains information about the features salary and the number ofdogs. The output space O is a binary feature that either takes the value 1 for loan acceptance and 0 forloan denial. We assume that Figure 3 expresses the real causal relationship of the involved variables.That means that the number of dogs should be irrelevant for loan approval given we know the salary.A high salary is a good reason for loan acceptance and a necessary condition for having many dogs(generally expensive). Thereby, the features number of dogs and loan approval are correlated. Thiscausal graph will help us in depicting the relation between feasible and contesting CEs in di ff erentscenarios. The setting is inspired by Ballet et al. (2019) who built AEs for tabular data.salarydogs loan Figure 3: The causal graph contains three variables, salary, the number of dogs, and a binary variable for the loan application status.
Both types of CEs we introduce here indicate the features the algorithm finds relevant for the decisionprocess. However, they di ff er in the kind of features they change. Feasible CEs permute causally relevantfeatures. Contesting CEs, on the other side, point out which causally irrelevant features played a role inthe decision process. We also discuss mixed type CEs with both causally relevant and irrelevant featurespermuted. Some features might even be inaccessible. Causal relevance in the first condition is clear. In the second condition, F is only indirectly causally relevant for T . F gives usinformation about the unknown variable L which is a cause of T . .1. Feasible CEs To make the division maximally clear, we consider a scenario where only feasible CEs exist. These casesoccur in the presence of perfect algorithms. A perfect algorithm describes a case where the algorithm’sdecisions match the ground truth in all scenarios where such a ground truth exists. That is, if we consider aninput x ∈ I then f ( x ) is correctly classified. In this case, no AE exists since misclassification is a necessarycondition for an AE. CEs, on the other side, do exist. Scenario.
Assume the algorithm f is perfect. Thus, for any combination of the number of dogs and thesalary for which a ground truth exists, the algorithm maps exactly to that ground truth. Hence, the algorithmlearned that all that is relevant for loan acceptance is the salary. Given the salary surpasses a certain threshold t , the algorithm grants the loan. If the applicant has a salary s below t and a given number of dogs d , thecounterfactual vector would be ( t , d ). The corresponding CE would be: • If P’s salary was ( t − s ) e higher, her loan application would have been accepted.This explanation would indeed be good because it aids understanding and guides future actions. These areprecisely the functions of feasible CEs - they permute causally relevant properties to the right amount. AEs do not exist for perfect algorithms. As we will see later, contesting CEs are just like AEs. Thus, we needto take the step from the exceptional case of perfect algorithms to imperfect algorithms.
Imperfect algorithms make some assignments that do not match the ground truth. There are two kinds of reasons for this. First,classical reasons as over- / under-fitting, biased / lacking training data, missing features, etc. (Bishop, 2006).The second kind of reason is more principled than the first. Supervised learning algorithms lack the abilityto distinguish between causes and correlations (Pearl and Mackenzie, 2018). Therefore, variables that onlycorrelate but have no causal relationship with the target variable do, in fact, impact the classification. Whilethere are various ways to get rid of the first kind of problems (Claeskens et al., 2008; Good and Hardin,2012; Jabbar and Khan, 2015), getting causality into supervised learning is much harder (Sch¨olkopf, 2019).Both kinds of reasons lead to classifications that mismatch the ground truth. Thus, in both cases, somefeatures have an undesired impact on the target variable. This mismatch can be used in creating good AEsbut also good CEs. The following example will show a scenario where only contesting CEs exist but nofeasible CEs. Scenario.
For the sake of argument, assume that a bank collected data from the members of two clubs. Thefirst club is a dog-club in Zurich (Switzerland) and the second is an animal protection club in Ukraine. It isclear that this data collection is biased. Let us also assume that the model trained by the bank is a single-layerdecision tree. The algorithm has learned that the number of dogs is the only important feature for decidingon a loan application. If a person has more or equal to two dogs, the algorithm o ff ers the loan.Assume the loan applicant has a low salary s and one dog. In this case, the loan application would berejected. This decision would be correct according to the ground truth since the salary was too low. However,the algorithm is right for the wrong reasons. It rejects the application because the applicant does not have atleast two dogs. A CE, in this case, would be: • If P’s had one more dog, her loan application would have been accepted. This again points out that the class of CEs is broader than of AEs (See Appendix A + B). s ,
2) and can even have the same function - deceiving the system.
Usually, we neither deal with the perfect algorithms from Section 4.1 nor the terrible algorithms of Sec-tion 4.2. Instead, we often deal with algorithms that mostly focus on relevant features to the right amountbut sometimes make misclassifications. In such scenarios, we can have feasible, contesting, and also mixedCEs.
Scenario.
Consider again an imperfect algorithm, as discussed in Section 4.2. However, this time, the modelselection and the data collection were carried out more carefully, and all potential kind one fallacies havebeen avoided. As a result, the algorithm has learned that the salary is relevant for loan acceptance. Also,dogs are expensive, and therefore only people with a comparatively high salary can a ff ord dogs. Since thealgorithm only matches patterns and cannot tell non-causal dependencies from actual causes apart, it willlearn that the number of dogs is (slightly) relevant for loan acceptance.Now, consider an applicant with a salary s very close to reaching the decision boundary t of loan acceptanceand zero dogs. In accordance with the ground truth, the bank rejects the loan application since s is below t . However, the algorithm is not perfect. It learned that additionally to the salary, the number of dogsis marginally relevant for the applicant obtaining the loan. There is potentially a variety of CEs. Threepossible CEs could bei) If P’s salary was ( t − s ) e higher, her loan application would have been accepted.ii) If P’s had two more dogs, her loan application would have been accepted.iii) If P’s salary was ( t − s )2 e higher and she had one more dog, her loan application would have been accepted.All of them would be good CEs. All of them would provide information for a better understanding of thealgorithm. i) would be a feasible CE and point to the most relevant feature that is also causally relevant.ii) is a contesting CE. It points to a causally irrelevant feature that is important according to the algorithm.Moreover, it is the same vector as one possible good AE. iii) is a mixed type CE. It gives information aboutthe most important feature but at the same time also about a secondary feature that should not matter butdoes. Similar to contesting CEs, it allows contestability but potentially also feasibility. It would also be anAE. However, not a good one because even though it is misclassified, it shows perceptible changes in thesalary feature. These examples raise the question of whether every good AE makes a good contesting CE and vice versa.Indeed, the two classes have a significant overlap. First, they both share the potential function of deception.Second, both provide grounds to contest the judgment of a machine learning algorithm. One di ff erence isthat as all CEs, contesting CEs highlight the changes. AEs, on the other hand, try to hide the changes as wellas possible.Every contesting CE is an AE as it must be misclassified to contest the decision for justified reasons. If thereis a misclassification, there will potentially be contexts in which an attacker can exploit this bug. Hence, themost interesting contesting CEs will also be good AEs.17hat about the opposite direction? There might be cases where a vector is a good AE but a not so goodcontesting CE. This case occurs when many causally irrelevant features are changed to achieve an alternativeclassification. However, it is unclear whether sparsity is a mandatory prerequisite for a good CE. In partic-ular, if the agent aims to deceive a system via a contesting CE, she might not care too much about sparsity.Also, distributed changes in AEs are mainly relevant in the context of computer vision. However, as wehave mentioned, for CEs, the interpretability of features is essential. In the case of images, one could arguethat a minor change in all features means a change in only one interpretable feature, namely the image’scoloration. The change of coloration as the only feature is, in fact, sparse and it can rightly be argued that itis not causally relevant for classification. Thus, contesting CEs and AEs have at least a considerable overlap,and it is di ffi cult to find convincing cases that fit in one class but not in the other. Many recently proposed papers on CEs have focused on feasibility and actionability for generating counter-factuals. They mostly achieved this aim by incorporating causal domain knowledge (Poyiadzi et al., 2020;Mahajan et al., 2019; Karimi et al., 2020). Since explanations should guide our future actions, this indeedmakes sense. However, if an algorithm uses questionable features in its decisions or misuses features, wemay want explanations that faithfully reflect such flaws. In such cases, we are interested in contesting CEs(that reassemble AEs), which point to causally irrelevant features that have influenced the learning algo-rithm’s decision.Summarized, we can say that distance measures to build good contesting CEs (AEs) assign small valuesto changes in impactful but causally irrelevant features. This setting implies that only causally irrelevantfeatures are changed since we minimize the distance to the alternative input. Adequate distance measuresfor feasible CEs, on the other hand, assign low values to causally relevant and actionable features, wherebythese features are altered strongest. Hence, we can say that there are at least two classes of interesting CEsthat can, in some cases, be mixed. Class one are feasible CEs. They guide the explainee’s future actionsby the given recommendations and stand in accordance with the real-world causal structure. Class two arecontesting CEs. They allow us to contest the decisions of algorithms or deceive them, just like AEs.Can this idea of the two complementary approaches be transferred from tabular data to images? We believethat this is possible. The only di ff erence is that the features we find causally relevant for the classification arecomposed of the input features the algorithm receives, namely pixels. Changing all pixels a little bit or a fewpixels strongly is potentially correlated to a di ff erent classification; however, it is causally irrelevant. Thisidea relates to the paper of Ilyas et al. (2019), where they discuss predictive but non-robust features. Theyargue that these features are the reason for the occurrence of AEs. We think that one important subclass ofsuch features are correlated, non-causal features.
5. Discussion
CEs and AEs are strongly related approaches. Our conceptual comparison has shown that their commonali-ties go deeper than the mathematical similarity alone. Did our analysis shed light on the three misconceptionsdiscussed in Section 2?The first misconception was to consider CE and AE as synonyms. Our analysis has shown that every (tar-geted) AE is a (targeted) CE, but not vice versa. . The essential di ff erence is that AEs, by definition, must For an example of a CE that is not an AE, see Appendix B
18e misclassified, whereas CEs are in this respect agnostic. The second misconception was to consider fea-sible CEs as the only relevant type of CEs. We showed that contesting CEs are another important type ofCEs di ff erent from feasible CEs. The di ff erence in function between the two types appears in a di ff erence intheir notion of similarity between inputs. Contesting CEs target misclassified inputs and show therefore re-markable similarity with AEs. The third misconception concerned unjustified or missing transfers betweenthe fields. We discussed under which conditions fruitful interactions are possible. While transfers of theoptimization problem or the solution methods are mostly permissible, transfers on the respective distancenotions are more demanding. Specifically, we argued that feasibility and misclassification are contrary aims,whereas informativeness and imperceptibility go well together.
6. Outlook
In addition to clarifying some misconceptions, this paper opens various directions for future research. Firstand foremost, it directs to a preference-based selection of feasible and contesting CEs, including degreesof feasibility / contestability. Second, as suggested, many concepts from AEs can be transferred and usedin generating CEs. Especially if the domains show greater overlap, e.g., AEs for tabular data and CEsfor image / audio data, such transfers will be beneficial. Conceptually the relation between the paradigm ofsupervised learning and the transferability of AEs needs further research. Acknowledgements
Funding: This work was supported by the Graduate School of Systemic Neuroscience (GSN) of the LMUMunich. Big thanks to Stephan Hartmann, Christoph Molnar, Gunnar K¨onig, and the GSN Neurophilosophy-group for their helpful comments to the manuscript, the fruitful discussions about the concepts, and theirhints to related literature.
References
Alzantot, M., Sharma, Y., Chakraborty, S., Zhang, H., Hsieh, C.J., Srivastava, M.B., 2019. Genattack: Prac-tical black-box attacks with gradient-free optimization, in: Proceedings of the Genetic and EvolutionaryComputation Conference, pp. 1111–1119.Asher, N., Paul, S., Russell, C., 2020. Adequate and fair explanations. arXiv preprint arXiv:2001.07578 .Athalye, A., Engstrom, L., Ilyas, A., Kwok, K., 2018. Synthesizing robust adversarial examples, in: Inter-national conference on machine learning, PMLR. pp. 284–293.Bach, F., 2010. Sparse methods for machine learning, in: Tutorial of IEEE-CS Conference on ComputerVision and Pattern Recognition (CVPR).Balda, E.R., Behboodi, A., Mathar, R., 2019. Perturbation analysis of learning algorithms: generationof adversarial examples from classification to regression. IEEE Transactions on Signal Processing 67,6078–6091.Ballet, V., Renard, X., Aigrain, J., Laugel, T., Frossard, P., Detyniecki, M., 2019. Imperceptible adversarialattacks on tabular data. arXiv preprint arXiv:1911.03274 .19arocas, S., Selbst, A.D., Raghavan, M., 2020. The hidden assumptions behind counterfactual expla-nations and principal reasons, in: Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency, Association for Computing Machinery, New York, NY, USA. p. 80–89. URL: https://doi.org/10.1145/3351095.3372830 , doi: .Beaney, M., 2018. Analysis, in: Zalta, E.N. (Ed.), The Stanford Encyclopedia of Philosophy. summer 2018ed.. Metaphysics Research Lab, Stanford University.Bekoulis, G., Deleu, J., Demeester, T., Develder, C., 2018. Adversarial training for multi-context joint entityand relation extraction. arXiv preprint arXiv:1808.06876 .Bishop, C.M., 2006. Pattern recognition and machine learning. Springer.Brown, T.B., Man´e, D., Roy, A., Abadi, M., Gilmer, J., 2017. Adversarial patch. arXiv preprintarXiv:1712.09665 .Carlini, N., Katz, G., Barrett, C., Dill, D.L., 2018. Ground-truth adversarial examples. URL: https://openreview.net/forum?id=Hki-ZlbA- .Carlini, N., Wagner, D., 2017. Towards evaluating the robustness of neural networks, in: 2017 IEEE Sym-posium on Security and Privacy, IEEE. pp. 39–57.Carnap, R., 1998. Der logische Aufbau der Welt. volume 514. Felix Meiner Verlag.Chabris, C.F., Simons, D.J., 2010. The invisible gorilla: And other ways our intuitions deceive us. Harmony.Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J., 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, in: Proceedings of the 10th ACMWorkshop on Artificial Intelligence and Security, pp. 15–26.Claeskens, G., Hjort, N.L., et al., 2008. Model selection and model averaging. Cambridge Books doi: .Dalvi, N., Domingos, P., Sanghai, S., Verma, D., 2004. Adversarial classification, in: Proceedings of thetenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99–108.Dandl, S., Molnar, C., Binder, M., Bischl, B., 2020. Multi-objective counterfactual explanations. arXivpreprint arXiv:2004.11165 .Doshi-Velez, F., Kim, B., 2017. Towards a rigorous science of interpretable machine learning. arXiv preprintarXiv:1702.08608 .Elsayed, G., Shankar, S., Cheung, B., Papernot, N., Kurakin, A., Goodfellow, I., Sohl-Dickstein, J., 2018.Adversarial examples that fool both computer vision and time-limited humans, in: Advances in NeuralInformation Processing Systems, pp. 3910–3920.Fern´andez-Lor´ıa, C., Provost, F., Han, X., 2020. Explaining data-driven decisions made by ai systems: Thecounterfactual approach. arXiv:2001.07417 .Goldstein, A., Kapelner, A., Bleich, J., Pitkin, E., 2015. Peeking inside the black box: Visualizing statis-tical learning with plots of individual conditional expectation. Journal of Computational and GraphicalStatistics 24, 44–65. doi: .20ood, P.I., Hardin, J.W., 2012. Common errors in statistics (and how to avoid them). John Wiley & Sons.doi: .Goodfellow, I., Shlens, J., Szegedy, C., 2015. Explaining and harnessing adversarial examples, in: Interna-tional Conference on Learning Representations. URL: http://arxiv.org/abs/1412.6572 .Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., Giannotti, F., 2018. Local rule-basedexplanations of black box decision systems. arXiv preprint arXiv:1805.10820 .Guo, C., Gardner, J.R., You, Y., Wilson, A.G., Weinberger, K.Q., 2019. Simple black-box adversarialattacks. arXiv preprint arXiv:1905.07121 .Ignatiev, A., Narodytska, N., Marques-Silva, J., 2019. On relating explanations and adversarial examples,in: Advances in Neural Information Processing Systems, pp. 15883–15893.Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A., 2019. Adversarial examples are notbugs, they are features, in: Advances in Neural Information Processing Systems, pp. 125–136.Ioannou, C.I., Pereda, E., Lindsen, J.P., Bhattacharya, J., 2015. Electrical brain responses to an auditoryillusion and the impact of musical expertise. PLoS One 10.Jabbar, H., Khan, R.Z., 2015. Methods to avoid over-fitting and under-fitting in supervised machine learning(comparative study). Computer Science, Communication and Instrumentation Devices doi: .Jehee, J.F., Brady, D.K., Tong, F., 2011. Attention improves encoding of task-relevant features in the humanvisual cortex. Journal of Neuroscience 31, 8210–8219.Kahneman, D., Slovic, S.P., Slovic, P., Tversky, A., 1982. Judgment under uncertainty: Heuristics andbiases. Cambridge University Press.Kanamori, K., Takagi, T., Kobayashi, K., Arimura, H., 2020. Dace: Distribution-aware counterfactualexplanation by mixed-integer linear optimization, in: Proceedings of the Twenty-Ninth International JointConference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). International Joint Conferenceson Artificial Intelligence Organization, pp. 2855–2862.Karimi, A.H., Sch¨olkopf, B., Valera, I., 2020. Algorithmic recourse: from counterfactual explanations tointerventions, in: 37th International Conference on Machine Learning (ICML).Kusner, M.J., Loftus, J., Russell, C., Silva, R., 2017. Counterfactual fairness, in: Advances in NeuralInformation Processing Systems, pp. 4066–4076.Laugel, T., Lesot, M.J., Marsala, C., Renard, X., Detyniecki, M., 2019. The dangers of post-hoc in-terpretability: Unjustified counterfactual explanations, in: Proceedings of the Twenty-Eighth Interna-tional Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artifi-cial Intelligence Organization. pp. 2801–2807. URL: https://doi.org/10.24963/ijcai.2019/388 ,doi: .Lewis, D.K., 1973. Counterfactuals. Blackwell.Lu, J., Issaranon, T., Forsyth, D., 2017. Safetynet: Detecting and rejecting adversarial examples robustly,in: Proceedings of the IEEE International Conference on Computer Vision, pp. 446–454.21ahajan, D., Tan, C., Sharma, A., 2019. Preserving causal constraints in counterfactual explanations formachine learning classifiers. arXiv preprint arXiv:1912.03277 .Menzies, P., Beebee, H., 2019. Counterfactual theories of causation, in: Zalta, E.N. (Ed.), The StanfordEncyclopedia of Philosophy. winter 2019 ed.. Metaphysics Research Lab, Stanford University.Miller, T., 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelli-gence 267, 1–38.Molnar, C., 2019. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/ .Molnar, C., K¨onig, G., Herbinger, J., Freiesleben, T., Dandl, S., Scholbeck, C.A., Casalicchio, G.,Grosse-Wentrup, M., Bischl, B., 2020. Pitfalls to avoid when interpreting machine learning models. arXiv:2007.04131 .Moore, J., Hammerla, N., Watkins, C., 2019. Explaining deep learning models with constrained adversarialexamples, in: Pacific Rim International Conference on Artificial Intelligence, Springer. pp. 43–56.Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P., 2017. Universal adversarial perturbations, in:Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773.Mothilal, R.K., Sharma, A., Tan, C., 2020. Explaining machine learning classifiers through diverse coun-terfactual explanations, in: Proceedings of the ACM Conference on Fairness, Accountability, and Trans-parency.Navalpakkam, V., Itti, L., 2005. Modeling the influence of task on attention. Vision research 45, 205–231.P´aez, A., 2019. The pragmatic turn in explainable artificial intelligence (xai). Minds and Machines 29,441–459.Papernot, N., McDaniel, P., Goodfellow, I., 2016a. Transferability in machine learning: from phenomena toblack-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 .Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A., 2017. Practical black-boxattacks against machine learning, in: Proceedings of the 2017 ACM on Asia conference on computer andcommunications security, pp. 506–519.Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A., 2016b. The limitations of deeplearning in adversarial settings, in: 2016 IEEE European symposium on security and privacy (EuroS&P),IEEE. pp. 372–387.Pearl, J., 2009. Causality. Cambridge University Press.Pearl, J., Mackenzie, D., 2018. The book of why: the new science of cause and e ff ect. Basic Books.Poyiadzi, R., Sokol, K., Santos-Rodriguez, R., De Bie, T., Flach, P., 2020. Face: Feasible and actionablecounterfactual explanations, in: Proceedings of the AAAI / ACM Conference on AI, Ethics, and Society,pp. 344–350. 22adford, A., Metz, L., Chintala, S., 2016. Unsupervised representation learning with deep convolutionalgenerative adversarial networks, in: Bengio, Y., LeCun, Y. (Eds.), 4th International Conference on Learn-ing Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.URL: http://arxiv.org/abs/1511.06434 .Reutlinger, A., 2018. Extending the counterfactual theory of explanation. pp. 74–95. doi: .Ribeiro, M.T., Singh, S., Guestrin, C., 2016. Why should i trust you?: Explaining the predictions of anyclassifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, ACM. pp. 1135–1144. doi: .Rozsa, A., Rudd, E.M., Boult, T.E., 2016. Adversarial diversity and hard positive generation, in: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32.Rudin, C., 2019. Stop explaining black box machine learning models for high stakes decisions and useinterpretable models instead. Nature Machine Intelligence 1, 206–215.Russell, B., 1905. On denoting. Mind 14, 479–493.Russell, C., 2019. E ffi cient search for diverse coherent explanations, in: Proceedings of the Conference onFairness, Accountability, and Transparency, Association for Computing Machinery, New York, NY, USA.p. 20–28. URL: https://doi.org/10.1145/3287560.3287569 , doi: .Sabour, S., Cao, Y., Faghri, F., Fleet, D.J., 2016. Adversarial manipulation of deep representations, in:Bengio, Y., LeCun, Y. (Eds.), 4th International Conference on Learning Representations, ICLR 2016,San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. URL: http://arxiv.org/abs/1511.05122 .Sch¨olkopf, B., 2019. Causality for machine learning. arXiv preprint arXiv:1911.10500 .Sharma, S., Henderson, J., Ghosh, J., 2020. Certifai:: Counterfactual explanations for robustness, trans-parency, interpretability, and fairness of artificial intelligence models. Proceedings of the AAAI / ACMConference on AI, Ethics, and Society URL: http://dx.doi.org/10.1145/3375627.3375812 ,doi: .Sokol, K., Flach, P.A., 2019. Counterfactual explanations of machine learning predictions: Opportunitiesand challenges for ai safety, in: Proceedings of the AAAI Workshop on Artificial Intelligence Safety.Starr, W., 2019. Counterfactuals, in: Zalta, E.N. (Ed.), The Stanford Encyclopedia of Philosophy. fall 2019ed.. Metaphysics Research Lab, Stanford University.ˇStrumbelj, E., Kononenko, I., 2014. Explaining prediction models and individual predictions with featurecontributions. Knowledge and information systems 41, 647–665. doi: .Stutz, D., Hein, M., Schiele, B., 2019. Confidence-calibrated adversarial training: Generalizing to unseenattacks. arXiv preprint arXiv:1910.06259 .Su, J., Vargas, D.V., Sakurai, K., 2019. One pixel attack for fooling deep neural networks. IEEE Transactionson Evolutionary Computation 23, 828–841. 23zegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R., 2014. Intriguingproperties of neural networks, in: International Conference on Learning Representations. URL: http://arxiv.org/abs/1312.6199 .Tomsett, R., Widdicombe, A., Xing, T., Chakraborty, S., Julier, S., Gurram, P., Rao, R., Srivastava, M., 2018.Why the failure? how adversarial examples can provide insights for interpretable machine learning, in:21st International Conference on Information Fusion (FUSION), IEEE. pp. 838–845.Ustun, B., Spangher, A., Liu, Y., 2019. Actionable recourse in linear classification, in: Proceedings of theConference on Fairness, Accountability, and Transparency, pp. 10–19.Van Looveren, A., Klaise, J., 2019. Interpretable counterfactual explanations guided by prototypes. arXivpreprint arXiv:1907.02584 .Wachter, S., Mittelstadt, B., Russell, C., 2017. Counterfactual explanations without opening the black box:Automated decisions and the gdpr. Harv. JL & Tech. 31, 841.Wheeler, G., 2020. Bounded rationality, in: Zalta, E.N. (Ed.), The Stanford Encyclopedia of Philosophy.spring 2020 ed.. Metaphysics Research Lab, Stanford University.Yuan, X., He, P., Zhu, Q., Li, X., 2019. Adversarial examples: Attacks and defenses for deep learning. IEEEtransactions on neural networks and learning systems 30, 2805–2824.24 ppendix A. Formal Proof for
AEs ⊆ CEs
In this section, we consider the relation between CEs and AEs in purely mathematical terms. For all thefollowing, assume there is a function f : I → O mapping a vector x from a (potentially high dimensional)input space I to a vector y in an output space O . Definition.
Let x ∈ I , f ( x ) = y ∈ O and y (cid:48) ∈ O . • We call x (cid:48) x ∈ I an alternative to x if y (cid:44) f ( x (cid:48) ). • We call x (cid:48) x , y (cid:48) ∈ I a targeted alternative to x if f ( x (cid:48) x , y (cid:48) ) = y (cid:48) (cid:44) y . Definition.
A distance measure (cid:107) · (cid:107) on a space I is defined as a function (cid:107) · (cid:107) : I → R + ∪ {∞} . Definition.
Let x ∈ I be a vector, x (cid:48) x ∈ I be an alternative vector, (cid:15) >
0, and (cid:107) · (cid:107) be a distance measure on I . • We call x (cid:48) x ,(cid:15) an (cid:15) -alternative to x with respect to (cid:107) · (cid:107) if (cid:107) x − x (cid:48) x (cid:107) < (cid:15) . • We call x (cid:48) x , y (cid:48) ,(cid:15) a targeted- (cid:15) -alternative to x with respect to (cid:107) · (cid:107) and target class y (cid:48) if x (cid:48) x , y (cid:48) ,(cid:15) is a targeted-alternative and an (cid:15) -alternative. Definition.
Let x ∈ I , f ( x ) (cid:44) y (cid:48) ∈ O , (cid:15) >
0, and (cid:107) · (cid:107) be a distance measure on I . • Let A ( x ) be the set of all alternatives to x. • Let
T A ( x , y (cid:48) ) be the set of all targeted-alternatives to x with target class y (cid:48) . • Let A ( x , (cid:15), (cid:107) · (cid:107) ) be the set of all (cid:15) -alternatives to x with respect to (cid:107) · (cid:107) . • Let
T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) be the set of all targeted (cid:15) -alternatives to x with respect to (cid:107) · (cid:107) and the target class y (cid:48) . Theorem.
For all x ∈ I, (cid:15) > , f ( x ) (cid:44) y (cid:48) ∈ O and distance measures (cid:107) · (cid:107) on I holds:i) A ( x , (cid:15), (cid:107) · (cid:107) ) ⊆ A ( x ) ii) T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) ⊆ T A ( x , y (cid:48) ) iii) T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) ⊆ A ( x , (cid:15), (cid:107) · (cid:107) ) iv) T A ( x , y (cid:48) ) ⊆ A ( x ) v) ∀ δ > (cid:15) : A ( x , (cid:15), (cid:107) · (cid:107) ) ⊆ A ( x , δ, (cid:107) · (cid:107) ) vi) ∀ δ > (cid:15) : T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) ⊆ T A ( x , y (cid:48) , δ, (cid:107) · (cid:107) ) Proof. i) Let x ∈ I , (cid:15) > (cid:107) · (cid:107) and z ∈ A ( x , (cid:15), (cid:107) · (cid:107) ) be arbitrary. Then, f ( z ) (cid:44) f ( x ). Thus, z ∈ A ( x ).ii) Let x ∈ I , (cid:15) > f ( x ) (cid:44) y (cid:48) ∈ O , (cid:107) · (cid:107) and z ∈ T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) be arbitrary. Then, f ( z ) = y (cid:48) (cid:44) f ( x ). Thus, z ∈ T A ( x , y (cid:48) ).iii) Let x ∈ I , (cid:15) > f ( x ) (cid:44) y (cid:48) ∈ O , (cid:107) · (cid:107) and z ∈ T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) be arbitrary. Then, f ( z ) = y (cid:48) (cid:44) f ( x ) and (cid:107) x − z (cid:107) < (cid:15) . Thus, z ∈ A ( x , (cid:15), (cid:107) · (cid:107) ). 25v) Let x ∈ I , f ( x ) (cid:44) y (cid:48) ∈ O , and z ∈ T A ( x , y (cid:48) ) be arbitrary. Then, f ( z ) = y (cid:48) (cid:44) f ( x ). Thus, z ∈ A ( x ).v) Let x ∈ I , (cid:15) < δ , (cid:107) · (cid:107) be arbitrary and z ∈ A ( x , (cid:15), (cid:107) · (cid:107) ). Then, by definition (cid:107) x − z (cid:107) < (cid:15) and f ( z ) (cid:44) f ( x ).Together with transitivity in the real numbers follows (cid:107) x − z (cid:107) < δ . Thus, z ∈ A ( x , δ, (cid:107) · (cid:107) ).vi) Let x ∈ I , f ( x ) (cid:44) y (cid:48) ∈ O , (cid:15) < δ , (cid:107) · (cid:107) be arbitrary and z ∈ T A ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ). Then, by definition (cid:107) x − z (cid:107) < (cid:15) and f ( z ) = y (cid:48) (cid:44) f ( x ). Together with transitivity in the real numbers follows (cid:107) x − z (cid:107) < δ .Thus, z ∈ T A ( x , y (cid:48) , δ, (cid:107) · (cid:107) ). (cid:3) Definition.
Let (cid:15) > • We call x c a non-targeted (cid:15) counterfactual to x with respect to (cid:107)·(cid:107) if x c ∈ A ( x , (cid:15), (cid:107)·(cid:107) ). We call x c a non-targeted counterfactual if it is a non-targeted (cid:15) counterfactual and for all δ < (cid:15) holds A ( x , δ, (cid:107) · (cid:107) ) = ∅ . • We call x c , y (cid:48) a targeted (cid:15) counterfactual to x with respect to (cid:107)·(cid:107) and targetclass y (cid:48) if x c , y (cid:48) ∈ T A ( x , y (cid:48) , (cid:15), (cid:107)·(cid:107) ). We call x c , y (cid:48) a targeted counterfactual if it is a targeted (cid:15) counterfactual and and for all δ < (cid:15) holds T A ( x , y (cid:48) , δ, (cid:107) · (cid:107) ) = ∅ . Definition.
Let x ∈ I , f ( x ) (cid:44) y (cid:48) ∈ O and (cid:107) · (cid:107) be a distance measure: • We call x a a non-targeted ( (cid:15) ) adversarial example to x with respect to (cid:107) · (cid:107) if x a is a non-targeted ( (cid:15) )counterfactual and there exists a ground truth y GT ∈ O for x a such that f ( x a ) (cid:44) y GT . • We call x a , y (cid:48) a targeted ( (cid:15) ) adversarial example to x with respect to (cid:107) · (cid:107) and target class y (cid:48) if x a , y (cid:48) is atargeted ( (cid:15) ) counterfactual and there exists a ground truth y GT ∈ O for x a , y (cid:48) such that y (cid:48) = f ( x a ) (cid:44) y GT . Definition.
Let x ∈ I , f ( x ) (cid:44) y (cid:48) ∈ O , (cid:15) >
0, and (cid:107) · (cid:107) be a distance measure on I . • Let CE ( x , (cid:15), (cid:107)·(cid:107) ) be the set of all non-targeted (cid:15) counterfactuals to x with respect to (cid:107)·(cid:107) . Let CE ( x , (cid:107)·(cid:107) )be the set of all non-targeted counterfactuals to x with respect to (cid:107) · (cid:107) . • Let
TCE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) be the set of all targeted (cid:15) counterfactuals to x with respect to (cid:107) · (cid:107) and targetclass y (cid:48) . Let TCE ( x , y (cid:48) , (cid:107) · (cid:107) ) be the set of all targeted counterfactuals to x with respect to (cid:107) · (cid:107) and targetclass y (cid:48) . • Let AE ( x , (cid:15), (cid:107) · (cid:107) ) be the set of all non-targeted (cid:15) adversarial examples to x with respect to (cid:107) · (cid:107) . Let AE ( x , (cid:107) · (cid:107) ) be the set of all non-targeted adversarial examples to x with respect to (cid:107) · (cid:107) . • Let
T AE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) be the set of all targeted (cid:15) adversarial examples to x with respect to (cid:107) · (cid:107) andtarget class y (cid:48) . Let T AE ( x , y (cid:48) , (cid:107) · (cid:107) ) be the set of all targeted adversarial examples to x with respect to (cid:107) · (cid:107) and target class y (cid:48) . Theorem ( Every (targeted (cid:15) ) AE is a (targeted (cid:15) ) CE ) . For all x ∈ I, f ( x ) (cid:44) y (cid:48) ∈ O, (cid:15) > and distancemeasures (cid:107) · (cid:107) holds:i) AE ( x , (cid:15), (cid:107) · (cid:107) ) ⊆ CE ( x , (cid:15), (cid:107) · (cid:107) ) ii) T AE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) ⊆ TCE ( x , y (cid:48) , (cid:15), (cid:107) · (cid:107) ) iii) AE ( x , (cid:107) · (cid:107) ) ⊆ CE ( x , (cid:107) · (cid:107) ) 26 v) T AE ( x , y (cid:48) , (cid:107) · (cid:107) ) ⊆ TCE ( x , y (cid:48) , (cid:107) · (cid:107) ) Proof.
All statements follow directly by the definition of adversarials given above. (cid:3)
This Theorem is not true if we consider an (cid:15) -environment for (non-)targeted CEs and an δ -environment for(non-)targeted AEs where (cid:15) (cid:44) δ . Appendix B. Example CE but not AE