[PDF] Counterfactual State Explanations for Reinforcement Learning Agents via Generative Deep Learning

Abstract

Counterfactual explanations, which deal with "why not?" scenarios, can provide insightful explanations to an AI agent's behavior. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanations based on generative deep learning. Specifically, a counterfactual state illustrates what minimal change is needed to an Atari game image such that the agent chooses a different action. We also evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our first user study investigates if humans can discern if the counterfactual state explanations are produced by the actual game or produced by a generative deep learning approach. Our second user study investigates if counterfactual state explanations can help non-expert participants identify a flawed agent; we compare against a baseline approach based on a nearest neighbor explanation which uses images from the actual game. Our results indicate that counterfactual state explanations have sufficient fidelity to the actual game images to enable non-experts to more effectively identify a flawed RL agent compared to the nearest neighbor baseline and to having no explanation at all.

Full PDF

CCounterfactual State Explanations for ReinforcementLearning Agents via Generative Deep Learning

Matthew L. Olson a, ∗ , Roli Khanna a , Lawrence Neal a , Fuxin Li a , Weng-Keen Wong a a Oregon State University, OR, USA

Abstract

Counterfactual explanations, which deal with “why not?” scenarios, can provide in-sightful explanations to an AI agent’s behavior [Miller, 2019]. In this work, we focuson generating counterfactual explanations for deep reinforcement learning (RL) agentswhich operate in visual input environments like Atari. We introduce counterfactual stateexplanations , a novel example-based approach to counterfactual explanations based ongenerative deep learning. Speciﬁcally, a counterfactual state illustrates what minimalchange is needed to an Atari game image such that the agent chooses a diﬀerent action.We also evaluate the eﬀectiveness of counterfactual states on human participants whoare not machine learning experts. Our ﬁrst user study investigates if humans can discernif the counterfactual state explanations are produced by the actual game or produced by agenerative deep learning approach. Our second user study investigates if counterfactualstate explanations can help non-expert participants identify a ﬂawed agent; we compareagainst a baseline approach based on a nearest neighbor explanation which uses imagesfrom the actual game. Our results indicate that counterfactual state explanations havesuﬃcient ﬁdelity to the actual game images to enable non-experts to more eﬀectivelyidentify a ﬂawed RL agent compared to the nearest neighbor baseline and to having noexplanation at all.

Keywords:

Deep Learning, Reinforcement Learning, Explainable AI, Interpretable AI ∗ Corresponding author at: 1148 Kelley Engineering Center, Corvallis, OR 97331-5501, USA. Tel.: +1541 737 3617.

Email address: [email protected] (Matthew L. Olson)

Preprint submitted to Journal of Artiﬁcial Intelligence February 1, 2021 a r X i v : . [ c s . A I] J a n igure 1: A counterfactual example in the game of Space Invaders that demonstrates an agent’saction changing by the removal of an enemy. Left : The game state in which an agent takes action“move left and shoot”.

Right : The counterfactual state where the agent will take the action“move right”.

1. Introduction

Despite the impressive advances made by deep reinforcement learning (RL) agents,their decision-making process is challenging for humans to understand. This limitationis a serious concern for settings in which trust and reliability are critical, and deployingRL agents in these settings requires ensuring that they are making decisions for theright reasons. To solve this problem, researchers are developing techniques to providehuman-understandable answers to explanatory questions of the agent’s decision-making.Explanatory questions can be classiﬁed into three types [Miller, 2019, Pearl andMackenzie, 2018]: “What?” (Associative reasoning), “How?” (Interventionist reason-ing) and “Why?” (Counterfactual reasoning). Of the three types, "Why?" questionsare the most challenging as it requires counterfactual reasoning [Lewis, 1973, Wachteret al., 2017], which involves reasoning about alternate outcomes that have not happened;counterfactual reasoning in turn requires both associative and interventionist reasoning[Miller, 2019]. In our work, we present a counterfactual explanation method to tacklethe "Why?" question in Miller’s classiﬁcation. More speciﬁcally, we answer the "Whynot?" question by using a deep generative model that can visually change the currentstate to produce alternate outcomes. 2nderlying a RL agent is the mathematical framework of a Markov Decision Process(MDP) [Puterman, 1994], which models an agent making a sequence of decisions as itinteracts with a stochastic environment. In the notation to follow in this section and inthe rest of the manuscript, vectors, matrices and sets are in boldface while scalars arenot. Formally, a MDP is a tuple ( 𝑺 , A , 𝑇, 𝑅, 𝛾 ) , where 𝑺 is a set of states, A is a setof actions, 𝑇 ( 𝒔 (cid:48) , 𝒔 , 𝑎 ) is a transition function capturing the probability of moving fromstate 𝒔 to 𝒔 (cid:48) when action 𝑎 is performed in state 𝒔 , 𝑅 ( 𝒔 , 𝑎 ) is a reward function returninga reward for being in state 𝒔 and performing action 𝑎 and 𝛾 is called a discount factor(where 0 ≤ 𝛾 ≤ ) which weights the importance of future rewards.Using the MDP framework, we introduce the concept of a counterfactual state as acounterfactual explanation . More precisely, for an agent in state 𝒔 performing action 𝑎 according to its learned policy, a counterfactual state 𝒔 (cid:48) is a state that involves aminimal change to 𝒔 such that the agent’s policy chooses action 𝑎 (cid:48) instead of 𝑎 . Forexample, a counterfactual state can be seen in Figure 1 for the video game Space Invaders[Brockman et al., 2016]. In this game, an agent exchanges ﬁre with approaching enemieswhile taking cover underneath three barriers.Our approach is intended for deep RL agents that operate in visual input environ-ments such as Atari. The main role of deep learning in these environments is to learna lower dimensional representation of the state that captures the salient aspects neededto learn a successful policy. Our approach investigates how changes to the state causethe agent to choose a diﬀerent action. As such, we do not focus on explaining the longterm, sequential decision-making eﬀects of following a learned policy, though this is adirection of interest for future work.Our end goal is a tool for acceptance testing for end users of a deep RL agent. Weenvision counterfactual states being used in a replay environment in which a humanuser observes the agent as it executes its learned policy. At key frames in the replay, theuser can ask the agent to generate counterfactual states which help the user determineif the agent has captured relevant aspects of the visual input for its decision making.Our approach relies on a novel deep generative architecture to create counterfactual An early version of this work appeared in Olson et al. [2019].

RQ1 : Can deep generative models produce high-ﬁdelity counterfactual statesthat appear as if they are generated by the Atari game?2.

RQ2 : Can counterfactual states help human users, who are non-experts in ma-chine learning, understand enough of an agent’s decision making to identify aﬂawed agent?3.

RQ3 : Can counterfactual states be more eﬀective for helping users understandan agent’s decision-making process than a nearest neighbor baseline technique?Our contributions are thus twofold. First, we introduce a new deep generativeapproach to generate counterfactual states to provide insight as to a RL agent’s decisionmaking. Second, we present results of user studies that investigate these researchquestions. Our results indicate that counterfactual states explanations are indeed useful.In our studies, they have suﬃcient ﬁdelity to aid non-experts in identifying ﬂawed RLagents.

2. Related Work

The literature on explainable AI is vast and we brieﬂy summarize only the mostdirectly related work. Much of the past work on explaining machine learning has focusedon explaining what features or regions of visual input were important for a prediction/ action. A large class of approaches of this type fall under saliency map techniques,which use properties of the gradient to estimate the eﬀect of pixels on the output (e.g.[Simonyan et al., 2013, Springenberg et al., 2014, Zeiler and Fergus, 2014, Selvarajuet al., 2017, Fong and Vedaldi, 2017, Shrikumar et al., 2017, Dabkowski and Gal, 2017,4undararajan et al., 2017, Zhang et al., 2018, Greydanus et al., 2018, Qi et al., 2019]).Recent work, however, has found some saliency map techniques to be problematic. Forinstance, Adebayo et al. [2018] found that some saliency map techniques still producedthe same results even if the model parameters or the data labels were randomized. Inaddition, Atrey et al. [2020] used counterfactual reasoning to evaluate if saliency mapswere true explanations of an RL agent’s behavior. Their ﬁndings indicated a negativeresult – namely that saliency maps, by themselves, could lead to incorrect inferencesby humans and should not be used as an explanation of an agent’s behavior. Otherexplanation techniques include extracting a simpler interpretable model from a morecomplex model [Craven and Shavlik, 1995], using locally interpretable models (e.g.[Marco Tulio Ribeiro and Guestrin, 2018, Ribeiro et al., 2016]), generating plots fromGeneralized Additive Models with pairwise interaction terms [Caruana et al., 2015]and using inﬂuence functions to determine which training data instances most aﬀect theprediction [Koh and Liang, 2017].These methods, however, do not speciﬁcally identify changes in the current datainstance that would result in a diﬀerent outcome (or classiﬁcation). These changesare a key part of the counterfactual reasoning needed to answer a "Why?" or "WhyNot?" question. One of the ﬁrst methods to do so was the Contrastive ExplanationsMethod (CEM) [Dhurandhar et al., 2018], which identiﬁed critical features or diﬀer-ences that would cause a data instance to be classiﬁed as another class. We found thehyperparameters for CEM to be diﬃcult to tune to create high-ﬁdelity counterfactualsfor high-dimensional data like Atari images. As we will show in Section 5.2.1, CEMproduced counterfactuals for Atari games that were ﬁlled with "snow" artifacts. CEMhas also been extended to explain diﬀerences between policies in reinforcement learning[van der Waa et al., 2018]. This approach focused on diﬀerences between trajectoriesin the environments rather than on the visual elements of a state, which is the focus ofour work.Two other recent approaches focused on producing counterfactuals for images.Chang et al. [2019] introduced the FIDO algorithm which generates counterfactualsfor images by determining which regions, when ﬁlled in with values produced by agenerative model, would most change the predicted class of the image. The focus of5he FIDO algorithm was on producing saliency maps and they used existing generativemodels for the inﬁlling. In contrast, we develop a novel generative model to producecounterfactual state explanations; the goal of our method is to generate a realistic versionof the entire counterfactual state (e.g. the whole Atari game frame image) in additionto producing diﬀerence highlights which are similar to saliency maps. Furthermore,Chang et al. [2019] did not evaluate their counterfactual explanations on human userswhile our user study results are one of our key contributions.Goyal et al. [2019] generated counterfactual visual explanations for images byﬁnding the minimal number of region swaps between the original image 𝑰 with class 𝑐 and a distractor image 𝑰 (cid:48) with class 𝑐 (cid:48) such that the class of 𝑰 would change to 𝑐 (cid:48) . Thismethod suﬀered from the problem that their counterfactual explanations could generateimages with swapped regions that looked odd, e.g. due to pose misalignment betweenthe two images. Their user study also focused on machine teaching, which is diﬀerentfrom our focus of assessing agents for acceptance testing. Past work on explaining RL has focused on explaining diﬀerent aspects of theRL formulation. Techniques for explaining policies include explaining policies fromMarkov Decision Processes with logic-based templates [Khan et al., 2009], state ab-stractions created through t-SNE embeddings [Mnih et al., 2015, Zahavy et al., 2016],human-interpretable predicates [Hayes and Shah, 2017], high-level, domain-speciﬁcprogramming languages [Verma et al., 2018] and ﬁnite state machines for RNN policies[Koul et al., 2019]. Juozapaitis et al. [2019] explained decisions made by RL agents bydecomposing reward functions into simpler but semantically meaningful components.Finally, Mott et al. [2019] used an attention mechanism to identify relevant parts of thegame environment for decision making.Another category of techniques for explaining RL used machine teaching to helpend-users understand an agent’s goals. Huang et al. [2019] taught end-users about anagent’s reward function using example trajectories chosen by an approximate-inferenceinverse RL algorithm. Lage et al. [2019] investigated using both inverse RL andimitation learning to produce summaries of an agent’s policy; their work highlighted the6eed for personalized summarization techniques as end-users varied in their preferenceof one technique over the other.Other methods looked at summarizing an agent’s behavior by presenting key mo-ments of trajectories executed by a trained agent [Amir and Amir, 2018, Huang et al.,2018, Sequeira and Gervasio, 2020]. These key moments were intended to demonstratean agent’s capabilities, which could improve end-user trust. Key moments could bechosen by importance [Amir and Amir, 2018], i.e. the largest diﬀerence in q-value fora given state [Torrey and Taylor, 2013] or by critical states in which the q-value for oneaction was clearly superior to others [Huang et al., 2018]. Sequeira and Gervasio [2020]explored interestingness based on the four dimensions of frequency, uncertainty, pre-dictability and contradiction. For a summary, rather than presenting a single moment,they presented a sequence of states that varied according to a particular dimension.These methods are all fundamentally diﬀerent, yet complementary to our counter-factual approach of generating explanations. More speciﬁcally, our work can be used asan explanation technique to demonstrate an agent’s proﬁciency once a key interactionmoment has been chosen, such as by one of the aforementioned approaches.

As our counterfactuals are produced by a deep generative model, we brieﬂy discussrelated work on generative deep learning. Generative deep learning methods model theprocess that generates the data, thereby allowing never-before-seen data instances to beproduced. Generative methods include auto-encoders [Ballard, 1987], which encode aninput feature vector into a lower-dimensional latent representation, and then decode thatlatent representation back to the original input space. Once the auto-encoder is trained,a common method to generate novel instances is to move about in the latent space andthen decode the resulting latent space representation. However, these modiﬁcations inthe latent space often result in unrealistic outputs [Bengio et al., 2013] due to "holes"in the learned latent space. This issue can be addressed by incorporating an additionalloss function term that makes the latent representation match a pre-deﬁned distribution[Kingma and Welling, 2013, Makhzani et al., 2015, Tolstikhin et al., 2018].Another class of generative deep models are adversarial networks, which have7ained increased attention due to their novel applications in modeling high-resolutiondata, especially generating faces that do not exist [Goodfellow et al., 2014]. Adversarialnetworks have been used to remove information predictive of the class label from a latentspace. For example, Fader Networks [Lample et al., 2017] encoded an image of a ﬂowerto a lower dimensional latent representation that retained its shape and background, butdid not contain information regarding its color (where color is the class label). Theclass label could then be combined with the latent representation to fully reconstructthe original data image, but crucially, the class label did not need to be the originalone. This method could recreate many diﬀerent versions of the same input that retainedsome properties, but had the characteristics relevant to the label changed. Thus, in thisexample, we can use Fader Networks to create a ﬂower image with a speciﬁc shape andbackground, but with a diﬀerent color from the original label.

3. Methodology: A Generative Deep Learning Model for Counterfactual States

The goal of this work is to shed some light into the decision making of a traineddeep RL agent through counterfactual explanations. We are speciﬁcally interested ingaining some insight into what aspects of the visual input state 𝒔 inform the choice ofaction 𝑎 . Given a query state 𝒔 , we generate a counterfactual state 𝒔 (cid:48) that minimallydiﬀers in some sense from 𝒔 , but results in the agent performing action 𝑎 (cid:48) rather thanaction 𝑎 . We refer to 𝑎 (cid:48) as the counterfactual action . Figure 2: The components of a pre-trained agent. 𝐴 ,takes a state 𝒔 and maps it to a latent representation 𝒛 = 𝐴 ( 𝒔 ) . The vector 𝒛 correspondsto the latent representation of 𝒔 in the second to last fully connected layer in the network.The second partition of network layers, which we denote as 𝝅 , takes 𝒛 and converts itto an action distribution 𝝅 ( 𝒛 ) i.e. a vector of probabilities for each action. Typically, 𝝅 consists of a fully connected linear layer followed by a softmax. We use 𝝅 ( 𝒛 , 𝑎 ) torefer to the probability of action 𝑎 in the action distribution 𝝅 ( 𝒛 ) . We highlight thedistinction in our Atari setting between a state 𝒔 , which is a raw Atari game image(also called a game frame), and the latent state 𝒛 which is obtained from the second tolast fully connected layer of the policy network. This latent layer, which we call 𝒁 isimportant in our diagnosis because it is used by the agent to inform its choice of actions.Our generative model is trained using a training dataset X = {( 𝒔 , 𝒂 ) , . . . , ( 𝒔 𝑁 , 𝒂 𝑁 )} of 𝑁 state-action pairs, where the action vectors 𝒂 𝑖 are action distributions obtainedfrom the trained agent as it executes its learned policy. In summary, the agent can beviewed as the mapping 𝝅 ( 𝐴 ( 𝒔 )) .Our approach to counterfactual explanations is to create counterfactual states usinga deep generative model, which have been shown to produce realistic images [Radfordet al., 2015]. Our strategy is to encode the query state 𝒔 to a latent representation.Then, from this latent representation, we move in the latent space 𝒁 in a directionthat increases the probability of performing the counterfactual action 𝑎 (cid:48) . However, aspreviously noted by prior work, the latent space of a standard auto-encoder is ﬁlled with“holes” and counterfactual states generated from these holes would look unrealistic[Bengio et al., 2013]. To produce a latent space that is more amenable to creatingrepresentative outputs, we create a novel architecture that involves an adversarial auto-encoder [Makhzani et al., 2015] and a Wasserstein auto-encoder [Tolstikhin et al., The agent may have other components such as the value function network. Our current work only usesthe policy network, but we would like to apply similar ideas to the value function network. igure 3: An overview of our architecture, which consists of the encoder 𝐸 , generator 𝐺 ,discriminator 𝐷 and pre-trained agent (grey). 𝒔 to a latent representation. Figure 3 depicts the architecture that we use during training. The RL agent is shadedgray to indicate that it has already been trained. First, we describe the Encoder ( 𝐸 ), theDiscriminator ( 𝐷 ) and the Generator ( 𝐺 ), which act together to produce counterfactualstate images that vary depending on an input action distribution. Second, we describethe Wasserstein auto-encoder ( 𝐸 𝑤 , 𝐷 𝑤 ), which produces a new latent space based onthe agent’s latent space 𝒁 ; this new latent space enables perturbations within this spaceto produce meaningful counterfactual states. Each of these components contributes aloss term to the overall loss function used to train the network. Auto-encoder Loss.

The encoder 𝐸 and generator 𝐺 act as an encoder-decoder pair. 𝐸 is a deep convolutional neural network that maps an input state 𝒔 to a lower dimensionallatent representation 𝐸 ( 𝒔 ) . We note that the Encoder 𝐸 is diﬀerent from the encoderused by the agent’s policy network and thus has a diﬀerent latent space. 𝐺 is a deepconvolutional generative neural network that creates an Atari image given its latent10epresentation 𝐸 ( 𝒔 ) and a policy vector 𝝅 ( 𝒛 ) (where 𝒛 = 𝐴 ( 𝑠 ) ). The auto-encodingloss function of E and G is the mean squared error (MSE) function: 𝐿 𝐴𝐸 = |X| ∑︁ ( 𝒔 , 𝒂 ) ∈X || 𝐺 ( 𝐸 ( 𝒔 ) , 𝝅 ( 𝐴 ( 𝒔 ))) − 𝒔 || (1)To generate counterfactual states, we want to create a new image by changing theaction distribution 𝝅 ( 𝐴 ( 𝒔 )) to reﬂect the desired counterfactual action 𝑎 (cid:48) . However, inour experiments, we found that having only the loss function 𝐿 𝐴𝐸 by itself will cause 𝐺 to ignore 𝝅 ( 𝐴 ( 𝒔 )) and use only 𝐸 ( 𝒔 ) ; this behavior occurs because the loss functionencourages reconstruction of 𝒔 which can be achieved with only the encoding 𝐸 ( 𝒔 ) andwithout 𝝅 ( 𝐴 ( 𝒔 )) . In order to make the Generator conditioned on the action distribution,we add an adversarial loss term using a discriminator 𝐷 . Discriminator Loss.

In order to ensure that 𝝅 ( 𝒛 ) is not ignored, we cause the encoderto create an action-invariant representation 𝐸 ( 𝒔 ) . By action-invariant, we mean that therepresentation 𝐸 ( 𝒔 ) no longer captures aspects of the state 𝒔 that inform the choice ofaction. By doing so, adding 𝝅 ( 𝒛 ) as an input to 𝐺 , along with 𝐸 ( 𝒔 ) , will provide thenecessary information that will allow 𝐺 to recreate the eﬀects of 𝝅 . In order to createan action-invariant representation, we perform adversarial training on the latent space,similar to the approach taken by Lample et al. [2017].We thus add a discriminator 𝐷 that is trained to predict the full action distribution 𝝅 ( 𝒛 ) given 𝐸 ( 𝒔 ) . The action-invariant latent representation is learned by 𝐸 such that 𝐷 is unable to predict the true 𝝅 ( 𝒛 ) from our agent. As in Generative Adversarial Networks(GANs) [Goodfellow et al., 2014], this setting corresponds to a two-player game where 𝐷 aims at maximizing its ability to identify the action distribution, and 𝐸 aims atpreventing 𝐷 from being a good discriminator. The discriminator 𝐷 approximates 𝝅 ( 𝒛 ) given the encoded state 𝐸 ( 𝒔 ) , and is trained with MSE loss as shown below: 𝐿 𝐷 = |X| ∑︁ ( 𝒔 , 𝒂 ) ∈X || 𝐷 ( 𝐸 ( 𝒔 )) − 𝝅 ( 𝐴 ( 𝒔 ))|| (2) Adversarial Loss.

The objective of the encoder 𝐸 is now to learn a latent representationthat optimizes two objectives. The ﬁrst objective causes the generator to reconstruct11he state 𝒔 given 𝐸 ( 𝒔 ) and 𝝅 ( 𝐴 ( 𝒔 )) , but the second objective causes the discriminatorto be unable to predict 𝝅 ( 𝐴 ( 𝒔 )) given 𝐸 ( 𝒔 ) . To accomplish this behavior in 𝐷 , we wantto maximize the entropy 𝐻 ( 𝐷 ( 𝐸 ( 𝒔 ))) , where 𝐻 ( 𝒑 ) = − (cid:205) 𝑖 𝑝 𝑖 𝑙𝑜𝑔 ( 𝑝 𝑖 ) . Therefore, theadversarial loss can be written as: 𝐿 𝐴𝑑𝑣 = 𝜆 |X| ∑︁ ( 𝒔 , 𝒂 ) ∈X − 𝐻 ( 𝐷 ( 𝐸 ( 𝒔 ))) (3)The hyper-parameter 𝜆 > 𝜆 ampliﬁes the importance of a high entropy 𝝅 ( 𝒛 ) , whichin turn reduces the amount of action-related information in 𝐸 ( 𝒔 ) and if pushed to theextreme, results in the generator 𝐺 producing unrealistic game frames. On the otherhand, small values of 𝜆 lower 𝐺 ’s reliance on the input 𝝅 ( 𝒛 ) , resulting in small changesto the game state when 𝝅 ( 𝒛 ) is modiﬁed. For analysis of the eﬀects of varying 𝜆 , seeAppendix A. The counterfactual states require a notion of closeness between the query state 𝒔 and the counterfactual state 𝒔 (cid:48) . This notion of closeness can be measured in termsof distance in the agent’s latent space 𝒁 . We want to create a counterfactual state inthe latent space 𝒁 because it directly inﬂuences the action distribution 𝝅 . We performgradient descent in this feature space with respect to our target action 𝑎 (cid:48) to produce anew 𝝅 that has an increased probability of the counterfactual action 𝑎 (cid:48) . However, aspreviously mentioned, moving about in the standard autoencoder’s latent representationcan result in unrealistic counterfactuals [Bengio et al., 2013]. To avoid this problem,we re-represent 𝒁 to a lower-dimensional manifold 𝒁 𝑾 that is more compact andbetter-behaved for producing representative counterfactuals.We use a Wasserstein auto-encoder (WAE) to learn a mapping function from theagent’s original latent space to a well-behaved manifold [Tolstikhin et al., 2018]. Byusing the concept of optimal transport, WAEs have shown that they can learn not justa low dimensional embedding, but also one where data points retain their concept ofcloseness in their original feature space where likely data points are close together.12 igure 4: The Wasserstein auto-encoder (shown as the pair 𝐸 𝑊 and 𝐷 𝑊 ) approximates thedistribution of internal agent states 𝒛 . The closeness-preserving nature of the WAE plays an important role when creatingan action distribution vector 𝝅 ( 𝒛 ) . In our counterfactual setting, we want to investigatethe eﬀect of performing action 𝑎 (cid:48) . However, we cannot simply convert 𝑎 (cid:48) to an actiondistribution vector and assign a probability of 1 to the corresponding component in thisvector as this approach could result in unrepresentative and low ﬁdelity images. Instead,we follow a gradient in the 𝒁 𝑾 space, which produces action distribution vectors thatare more representative of those produced by the RL agent. This process, in turn,enables the Generator 𝐺 to produce more realistic images.We train a WAE, with encoder 𝐸 𝑊 and decoder 𝐷 𝑊 , on data instances representedin the agent’s latent space 𝒁 (see Figure 4). We use MSE loss regularized by MaximumMean Discrepancy (MMD): 𝐿 𝑊 𝐴𝐸 = | 𝑆 | ∑︁ 𝒔 (cid:107) 𝐷 𝑊 ( 𝐸 𝑊 ( 𝐴 ( 𝒔 ))) − 𝐴 ( 𝒔 )(cid:107) + 𝑀 𝑀 𝐷 𝑘 ( 𝐷 𝑊 , 𝐸 𝑊 ) (4)where 𝑀 𝑀 𝐷 𝑘 ( 𝐷 𝑍 , 𝑄 𝑍 ) = (cid:13)(cid:13)(cid:13)(cid:13)∫ 𝑍 𝑘 ( 𝑧, ·) 𝑑𝐷 𝑍 ( 𝑧 ) − ∫ 𝑍 𝑘 ( 𝑧, ·) 𝑑𝐸 𝑍 ( 𝑧 ) (cid:13)(cid:13)(cid:13)(cid:13) H 𝑘 (5)Here H 𝑘 is a reproducing kernel Hilbert space, and in our work, an inverse multi-quadratic kernel is used [Tolstikhin et al., 2018] .13 .2.3. Training We let a pre-trained agent play the game with 𝜖 -greedy exploration and train with theresulting dataset X = {( 𝒔 , 𝒂 ) , . . . , ( 𝒔 𝑁 , 𝒂 𝑁 )) . We train with the overall loss functionequal to 𝐿 = 𝐿 𝐴𝐸 + 𝐿 𝐷 + 𝐿 𝐴𝑑𝑣 + 𝐿 𝑊 𝐴𝐸 . The loss function is minimized at each gametime step with stochastic gradient descent using an ADAM optimizer [Kingma and Ba,2014].

Generative models have been shown to have great diﬃculty in retaining smallobjects [Alvernaz and Togelius, 2017]. We follow [Kaiser et al., 2020] by using lossclipping, which is deﬁned as max(

𝐿𝑜𝑠𝑠, 𝐶 ) for a constant 𝐶 . This clipping is onlyapplied to our auto-encoder and it is critical as many small gradients for each easy-to-predict background pixel outweigh the cost for mispredicting the hard-to-encode smallobjects. In our setting, we ﬁnd that this loss clipping ensures the retention of smallbut key objects during auto-encoding and the creation of these objects when generatingcounterfactual states, such as the bullets in the Atari game Space Invaders. Our goal is to generate counterfactual images that closely resemble real states of thegame environment, but result in the agent taking action 𝑎 (cid:48) instead of action 𝑎 . In orderto identify the necessary elements of the state that would need to be changed, we requirethat the generated counterfactual state 𝒔 (cid:48) is minimally changed from the original querystate 𝒔 . Similar to Neal et al. [2018], we formulate this process as an optimization:minimize || 𝐸 𝑤 ( 𝐴 ( 𝒔 )) − 𝒛 ∗ 𝒘 || subject to arg max 𝑎 ∈A 𝝅 ( 𝐷 𝑊 ( 𝒛 ∗ 𝒘 ) , 𝑎 ) = 𝑎 (cid:48) where 𝒔 is the given query state, A is the set of actions, and 𝒛 ∗ 𝒘 is a latent pointrepresenting a possible internal state of the agent. This optimization can be relaxed as14ollows: 𝒛 ∗ 𝒘 = arg min 𝒛 𝒘 (cid:40) || 𝒛 𝒘 − 𝐸 𝑤 ( 𝐴 ( 𝒔 ))|| + log ( − 𝝅 ( 𝐷 𝑊 ( 𝒛 𝒘 ) , 𝑎 (cid:48) )) (cid:41) (6)where 𝝅 ( 𝒛 , 𝑎 ) is the probability of the agent taking a discrete action 𝑎 on thecounterfactual state representation 𝒛 . By minimizing the second term, we aim toincrease the probability of taking action 𝑎 (cid:48) and reduce the probability of taking all otheractions.To generate a counterfactual state, we select a state from the training set, then encodethe state to a Wasserstein latent point 𝒛 𝒘 = 𝐸 𝑊 ( 𝐴 ( 𝒔 )) . We then minimize Equation6 through gradient descent with respect to 𝒛 𝒘 to ﬁnd 𝒛 ∗ 𝒘 , then decode the latent pointto create a new 𝝅 ( 𝒛 ) which is passed to the generator, along with 𝐸 ( 𝒔 ) to create thecounterfactual state 𝒔 (cid:48) . The pre-trained agent is a deep convolutional feed-forward network trained withAsynchronous Advantage Actor-Critic (A3C) [Mnih et al., 2015] to maximize score inan Atari game. Games are played with a ﬁxed frame-skip of 8 (7 for Space Invaders).The network that takes a set of 4 concatenated monochrome frames as input and istrained to maximize game score using the A3C algorithm. We decompose the agentinto two functions: 𝐴 ( 𝒔 ) which takes as input 4 concatenated video frames and producesa 256-dimensional vector 𝒛 , and 𝝅 ( 𝒛 ) which outputs a distribution among actions. Theframes are down sampled and cropped to 80x80, with normalized values [0,1]. Thisinput is processed by 4 convolutional layers (each with 32 ﬁlters, kernel sizes of 3,strides of 2, and paddings of 1), followed by a fully connected layer, sized 256, and alast fully connected layer size |A| +

1, where |A| is the action space size. We apply asoftmax activation to the ﬁrst |A| neurons to obtain 𝜋 ( 𝒔 ) = 𝑎 and use the last neuron topredict the value, 𝑉 ( 𝒔 ) .The A3C RL algorithm was trained with a learning rate of 𝛼 = − , a discountfactor of 𝛾 = .

99, and computed loss on the policy using Generalized AdvantageEstimation with 𝜆 = .

0. We ﬁnd that convergence is more diﬃcult with such a largeframe skip, so each policy was trained asynchronously for a total of 50 million frames.15uring training, we do not downscale or greyscale the game state. We pass in thecurrent game time step as a 3 channels, RGB image. To generate the dataset X , we set 𝜖 exploration value to 0 . The encoder 𝐸 consists of 6 convolutional layers followed by 2 fully-connectedlayers with LeakyReLU activations and batch normalization. The output 𝐸 ( 𝒔 ) is a16-dimensional vector. For most of our agents, we ﬁnd a value of 𝜆 =

50 enforces agood trade-oﬀ between state reconstruction and reliance on 𝝅 ( 𝒛 ) . The output of thenetwork is referred to in the text as 𝐸 ( 𝑠 ) .The generator 𝐺 consists of one fully-connected layer followed by 6 transposedconvolutional layers, all with LeakyReLU activations and batch normalization. Theencoded state 𝐸 ( 𝒔 ) and the action distribution 𝝅 ( 𝒛 ) are fed to the ﬁrst layer of thegenerator. Additionally, following the recommendation of Lample et al. [2017], 𝝅 ( 𝒛 ) is appended as an additional input channel to each subsequent layer, which ensures 𝐺 learns to depend on the values of 𝝅 ( 𝒛 ) for image creation when 𝝅 ( 𝒛 ) is modiﬁed duringcounterfactual generation.The discriminator 𝐷 consists of two fully-connected layers followed by a softmaxfunction, and outputs a distribution among actions with the same dimensionality as 𝝅 ( 𝒛 ) .The Wasserstein encoder 𝐸 𝑤 consists of 3 fully-connected layers mapping 𝒛 to a128-dimensional vector 𝒛 𝑤 , normalized such that (cid:107) 𝒛 𝒘 (cid:107) =

1. Each layer has the samedimensionality of 256, except the output of the 3rd layer which is 128. Additionally,the ﬁrst two layers are followed by batch normalization and leaky ReLU with a leakof 0 .

2. The corresponding Wasserstein decoder 𝐷 𝑤 is symmetric to 𝐸 𝑤 , with batchnormalization and leaky ReLU after the ﬁrst two layers and maps 𝒛 𝒘 back to 𝒛 . The encoder, generator, and discriminator are all trained through stochastic gradientdescent using an Adam optimizer, with parameters 𝛼 = − , 𝛽 = , 𝛽 = .

9. Thesenetworks were typically trained for 25 million game states to achieve high ﬁdelity16econstructions, but we found even a tenth of the game states to be enough to producemeaningful counterfactual states. We set the max loss clipping constant 𝐶 = . 𝛼 = − and with the default 𝛽 parameters. Training was performed for15 million frames, upon which we found selecting actions from 𝝅 ( 𝐷 𝑤 ( 𝐸 𝑤 ( 𝐴 ( 𝒔 )))) consistently achieved the same average game score as the original agent.All models are constructed and trained using PyTorch [Paszke et al., 2019]. Formore information about our architecture and training parameters, our code can beaccessed at: https://github.com/mattolson93/counterfactual-state-explanations/3.4.3. Creating counterfactual state highlights A counterfactual state often contains small changes that are diﬃcult to notice withoutcareful inspection, so we mimic the saliency map generation process in Greydanus et al.[2018] to highlight the diﬀerence between the original and counterfactual state. Wetake the absolute diﬀerence between the original state 𝒔 and counterfactual state 𝒔 (cid:48) tocreate a counterfactual mask 𝒎 𝒄 = || 𝒔 − 𝒔 (cid:48) || . For further clarity of the changes, weapply a Gaussian blur over the mask. Lastly, we set the blurred mask to a single colorchannel and combine this color mask with the original state to get the highlights. Inour experiments, the highlights are in diﬀerent colors for diﬀerent games (e.g. blue forSpace Invaders and red for Qbert) as we want colors that are a stark contrast from thecolor scheme of the game.

4. Methodology: User Studies

In general, evaluating explanations is a challenging problem, and counterfactualexplanations are particularly diﬃcult. A good counterfactual explanation helps humansunderstand why an agent performed a particular action. This human-based criterionis infeasible to capture with quantitative metrics. For instance, using the probability17 ( 𝒔 (cid:48) , 𝑎 (cid:48) ) as a quantitative metric for a counterfactual state 𝒔 (cid:48) is misleading becausethis probability can be high for some Atari images that humans can immediately rec-ognize as not generated by the game itself and also high for adversarial examples withimperceptible changes to the original state 𝒔 .Since evaluating counterfactuals requires human inspection, we designed two userstudies. In the ﬁrst user study, we evaluated the ﬁdelity of our counterfactual states to thegame. By ﬁdelity, we refer to how well the counterfactual images appear to be generatedby the game itself rather than by a generative deep learning model. In the seconduser study, we investigated if our counterfactual states could help humans understandenough of an agent’s decision making so that they could perform a downstream task ofidentifying a ﬂawed RL agent. In order to evaluate the ﬁdelity of our counterfactual states, we needed to createbaseline methods for comparison. First, we experimented with using pertinent neg-atives from the Contrastive Explanation Method (CEM) [Dhurandhar et al., 2018] ascounterfactuals. These pertinent negatives highlight absent features that would causethe agent to select an alternate action. We generated pertinent negatives from Ataristates with pixels as features, and interpreted them as counterfactual states. We per-formed an extensive search over hyper-parameters to generate high-ﬁdelity states, butfound CEM very diﬃcult to tune due to the high-dimensional nature of Atari images.The generated counterfactual states were either identical to the original query state orthey had obvious “snow” artifacts as shown in Figure 5, making them too low qualityto serve as a reasonable baseline for our user study.We then created a baseline method consisting of counterfactual images from anablated version of our generative model. In the ablated version of the network, theencoder, discriminator, and Wasserstein autoencoder were removed, and the generatorwas trained with MSE loss to reconstruct 𝒔 given 𝒛 as input. Counterfactual imageswere generated by performing gradient descent with respect to 𝒛 to maximize 𝝅 ( 𝒛 , 𝑎 (cid:48) ) for a counterfactual action 𝑎 (cid:48) . We found that counterfactual states generated in this waydid not always construct a perfectly convincing game state as shown in Figure 6, but18 igure 5: Counterfactual states generated using the Contrastive Explanation Method with threechoices of parameters on diﬀerent states. Images are in black and white because the originalCEM source code operates on direct input to the agent– which are down-scaled, grey images.Figure 6: Three examples of counterfactual states generated using the ablated model. were of suﬃcient quality to use as a baseline in our user study. Appendix B detailsother ablation experiments, which reveal the negative eﬀects of removing any speciﬁccomponent from our architecture.Finally, we also included images from the game itself. In summary, the images inour ﬁrst user study were generated by three diﬀerent sources: 10 from the actual game,10 from our counterfactual state explanation method, and 10 from our ablated network.These images were randomly sorted for each user.We evaluated our counterfactual state explanations through a user study in our labwith 30 participants (20 male, 10 female) who were not experts in machine learning;participants included undergraduates and members of the local community. Approxi-mately half were undergraduates and the others were from the community. 80% werebetween the ages of 18-30, 10% were between 30-50, and the other 10% were between190-60. We chose to focus our study on Space Invaders because it is straightforward tolearn for a participant unfamiliar with video games. To familiarize participants withSpace Invaders, we started the study by having participants play the game for 5 minutes.Participants then rated the ﬁdelity of 30 randomly ordered game images on a Likertscale from 1 to 6: (1) Completely Fake, (2) Most parts fake, (3) More than half fake,(4) More than half real, (5) Most parts real and (6) Completely real. Our second user study was intended to evaluate the eﬀectiveness of our counter-factual state explanations. Our focus was on a real world setting in which a user, whowas not a machine learning expert, needed to assess a RL agent that was about to bedeployed. We designed an objective task that relied on the user’s understanding ofthe agent’s decision making process from the counterfactual explanations. The taskrequired participants to identify which of two RL agents was ﬂawed based on the coun-terfactual explanations provided. As in the ﬁrst user study, we chose Space Invaderssince it was quick to learn and the optimal strategy was not immediately obvious. Sincewe recruited non-experts in AI or machine learning, we henceforth referred to the RLagent as an AI agent in our user study for simplicity.The counterfactual explanation’s eﬀectiveness was measured by a (2x2)x2 mixedfactorial subjects design as we have both a within-subjects comparison and a between-subjects comparison. The within-subjects comparison involved the two independentvariables of RL agent type (ﬂawed versus normal) and explanation presence (withand without explanation). Thus, all participants were shown the behavior of theﬂawed and the normal agents both with and without counterfactual explanations. Thebetween-subjects comparison involved comparing counterfactual explanation methods;one group of participants was shown a baseline counterfactual explanation methodbased on nearest neighbors and the other group was shown our counterfactual stateexplanation method. 20 .2.1. Experimental Design

The participants were presented with the task of identifying which of the two agentswas ﬂawed. We designed our two agents such that their average score on the game wasalmost equal, and the score could not be used to determine which agent was ﬂawed.In addition, humans could not identify the ﬂawed agent by simply watching the agentsplay the game. Consequently, the counterfactual explanations were the main source ofinsight into the agent’s decision-making for the participants.An alternative approach for evaluating the eﬀectiveness of counterfactual explana-tions was to have participants predict an agent’s action in a new state. While actionprediction may be feasible in some environments (e.g. Madumal et al. [2020]), it canalso be challenging in other environments such as Atari games and real-time strategygames. Anderson et al. [2020] showed that using explanations to predict future actionswas diﬃcult, sometimes even worse than random guessing, because AI agents could besuccessful in these games in ways that were unintuitive to humans.Our "normal" agent was the agent described in section 3.4. For the ﬂawed agent, wetried designing ﬂawed agents that were blind to diﬀerent parts of the game, but many ofthese possibilities were easy to detect by humans. Blocking half of the screen resultedin the agent only playing in the visible half. Removing the barriers had no eﬀect as theagent eventually learned their locations during training. Removing the bullets causeda noticeable behavior change as the agent hid under the barriers for the majority of thegame. Finally, we were unable to train an agent that performed well by removing theenemies from the observations.We ultimately settled on a ﬂawed Space Invaders agent by masking the region of thescreen containing the green ship, eﬀectively making the agent unaware of its own ship’sposition. This ﬂaw was subtle and diﬃcult to detect without the aid of counterfactualexplanations.This ﬂawed agent was harder to train than a normal agent playing Space Invadersand thus required 160 million game steps to achieve suﬃciently good performance. Inaddition, for our ﬂawed agent, we set the adversarial loss hyperparameter 𝜆 =

100 tomake the generated counterfactual states have visually evident changes from the original21uery state.

This study involved two conditions corresponding to diﬀerent counterfactual expla-nation methods. The ﬁrst condition used a naive baseline based on a simple nearestneighbor approach. The second condition involved our counterfactual state explana-tions.

Nearest Neighbor Counterfactual Explanations (NNCE) . For this approach, theagent played the game for 𝑁 =

25 million time steps with 𝜖 -greedy exploration toproduce a game trace dataset D , which we used for nearest neighbor selection. Foreach step we stored in D , the state 𝒔 , the representation 𝒛 = 𝐴 ( 𝒔 ) , and the actiontaken 𝑎 , resulting in a dataset D = {( 𝒔 , 𝒛 , 𝑎 ) , . . . , ( 𝒔 𝑁 , 𝒛 𝑁 , 𝑎 𝑁 )) . To generate acounterfactual from this dataset, the agent played a new game and on the desired querystate 𝒔 we found the nearest latent point 𝒛 ∗ ∈ D to the current point 𝒛 = 𝐴 ( 𝒔 ) where theagent took the desired action of 𝑎 (cid:48) ; we used 𝐿 distance to determine closeness. We thendisplayed the associated state 𝒔 ∗ from the triplet ( 𝒔 ∗ , 𝒛 ∗ , 𝑎 (cid:48) ) as the closest counterfactualstate where the agent took a diﬀerent action 𝑎 (cid:48) . Note that the images from the nearestneighbor approach were always faithful to the game as they were actual game framesfrom the Atari game. However, even with a very large game trace dataset of size 25million, the nearest neighbor approach did not always retrieve a game state that was"close" to the query state. In contrast, our counterfactual state explanations were alwaysclose to the query state by design, but they may not always have complete ﬁdelity to thegame. Choosing counterfactual query states and counterfactual actions.

The speciﬁcimages, serving as query states to present to participants for our counterfactual stateexplanations, were objectively chosen using a heuristic based on the entropy of thepolicy vector 𝝅 ( 𝐴 ( 𝑠 )) of state 𝑠 ; this entropy score has been used in the past forchoosing key frames for establishing trust [Huang et al., 2018]. For diversity, if animage at time 𝑡 was selected, we do not allow images to be selected until after time 𝑡 +

10. This restriction was especially important for diversity in the counterfactual states22hosen for the ﬂawed agent as it had very low entropy in its policy vector from the initialstates, but higher entropy later on. As Space Invaders is a relatively simple game inwhich the aliens move faster as time progresses, we only considered diversity in termsof the progression of time within a round and we selected query states at diﬀerent pointsin time. All query states used in the study can be seen in Appendix ﬁgures C.22 - C.25.We thus emphasize the fact that the counterfactual states and corresponding actions wepresented to participants were not hand-picked; rather, they were selected objectivelyby our heuristic.For our counterfactual state explanations, once a query state was selected, we chosethe counterfactual action 𝑎 (cid:48) as the one that involved the largest 𝐿 change between theoriginal Wasserstein latent state 𝒛 𝒘 and the counterfactual Wasserstein latent state 𝒛 𝒘 (cid:48) (ignoring the no-operation action).For NNCE in our user study, we use the same entropy-based state selection heuristicto determine which query states to show to the participant, thereby ensuring that querystates are identical between the two conditions. What varies between the two conditionsis the explanation process, which selects the counterfactual action 𝑎 (cid:48) and the resultingcounterfactual state 𝒔 (cid:48) . The method we used for selecting the counterfactual action 𝑎 (cid:48) for NNCE diﬀers from the heuristic used by our counterfactual state explanations. Toselect the counterfactual action in NNCE, we ﬁnd the closest nearest neighbor in latentspace 𝒛 (via 𝐿 distance) where the agent performs a diﬀerent action 𝑎 (cid:48) ≠ 𝑎 .The action selection heuristics were slightly diﬀerent between the two conditions inorder to maximize the quality of selected counterfactual states by the diﬀerent methods.The two methods diﬀered in how far the counterfactual images were from the querystate due to diﬀerent latent spaces being used by the two methods and also due to thegranularity of their movement in their respective latent spaces. Our counterfactualstate explanations operated in a Wasserstein latent space. Due to the fact that theywere created by a generative process and not retrieved from a dataset, using the closestWasserstein point with a diﬀerent action often caused very little change or no change atall. In contrast, the NNCE method operated in the latent space of the pre-trained agent,which does not have a Wasserstein latent space. The NNCEs used pre-existing imagesfrom the dataset D , which were usually further away from the query state (visually) than23ost counterfactual states created by our method. Had we chosen the counterfactualaction that involved the largest 𝐿 change in latent space, NNCEs would have producedimages that were often dramatically diﬀerent from the query state, which would likelyhave produced worse results in our user study. Instead, to give the NNCE conditionthe best possible counterfactual images (based on visual inspection), we ultimatelyselected the counterfactual action to be the action (diﬀerent from the original action 𝑎 )associated with the nearest neighbor with the closest 𝐿 distance in latent space. We recruited 60 participants at Oregon State University, with 30 participants in eachcondition. The target audience for our user study was people who were not experts inmachine learning. Approximately half were undergraduates and the others were fromthe community. All participants were between the ages of 18-40, 40% of whom werewomen and 60% were men. This study consisted of 6 sections:1. Gameplay2. Agents Analysis (pre-evaluation)3. Tutorial4. Evaluation (main task)5. Agents Analysis (post-evaluation)6. Reﬂection

1. Gameplay.

A facilitator started the study with a guided tutorial about game rulesand described the task to be performed, after which the participants were allowed to usethe system. To be able to understand the game better, all the participants ﬁrst playedthe Atari 2600 video game Space Invaders for 5 minutes.

2. Agents Analysis (pre-evaluation).

After having enough hands-on experience withthe game, each participant watched a video of the normal agent and a video of the ﬂawedagent playing one complete episode of the game from start to ﬁnish. The identities ofeach agent were hidden from participants. The videos were selected such that theagent cleared all enemies before they reached the bottom while avoiding all incoming24 igure 7: The explanation tool used for displaying counterfactual states to participants in ouruser study. bullets. We randomized the order of presentation of the normal and ﬂawed agent.For concreteness, we described the ﬂawed agent to the participants as an agent with amalfunction in its sensors. After viewing the videos, we asked the participants, “Whichof the two AI do you believe has a malfunction?", with their choices being "AI one","AI two", or "CAN’T TELL". We then asked the participants if they could identifywhich part of the game was the ﬂawed AI blind to: the yellow aliens, the white bullets,the green ship, or the orange barriers. After answering both questions, participantswere placed on a waiting screen to ensure the next section occurred simultaneously foreveryone. At this point, participants were unable to change their answers to the previousquestions and were unable to view the videos for the rest of the study.The answers from this section formed the data corpus of the participants’ descriptiveanalysis of the AI agents before they saw the explanation.25 . Tutorial.

The facilitator then gave a detailed guided tutorial to describe the counter-factual explanation display tool, particularly since counterfactuals are an esoteric topicfor most non-experts. The display tool consisted of a tuple of 3 images: the originalstate where an agent took its preferred action 𝑎 , the changed state where the agenttook the action 𝑎 (cid:48) , and the image highlights. The original state was an interactive UIelement, where if clicked it would change into a GIF that sequentially displayed thecurrent and three previous game steps to give context for the query state. We omittedthe term "counterfactual" as we found the additional vocabulary to be confusing in ourpreliminary studies.Our approach of presenting the original image, highlights and the counterfactualalongside each other ties in with the causal connection of abductive reasoning andinference [Miller, 2019], where the highlights served the purpose of “ﬁxing the eyeballs”of the participants, and the counterfactuals oﬀered reasoning for the AI’s actions. Theexact script for the tutorial is provided in the appendix section Appendix C.1.

4. Evaluation (main task).

Following the tutorial, the participants evaluated 20 screen-shots of the game in the display tool: 10 states for each agent selected via the heuristicdescribed above. All 10 states were selected from the single game episode shown inthe videos from earlier. An example of a screen in the explanation tool can be seen inﬁgure 7.For each screenshot, we asked the participant two questions. The ﬁrst questionbeing "What objects do you think this AI pays attention to?" with 4 check boxes tobe potentially selected (aliens, bullets, ship, barriers) and the second question being"Which explanation did you use for making your decision?" with 6 options (OnlyHighlights, Mostly Highlights, Highlights and Changed State Equally, Mostly ChangedState, Only Changed State, or Neither). We presented the 20 explanations (of normaland ﬂawed agents) in randomized order to avoid biasing the participants.The answers from this section formed the data corpus of the participants’ descriptiveanalysis of the AI agents after they saw the explanation.

5. Agents Analysis (post-evaluation).

Once the participants ﬁnished evaluating the 20explanations, we summarized the results of their own responses to the question "What26 igure 8: An example of the results screen a user would see after completing the evaluation forthe Counterfactual States condition. objects do you think this AI pays attention to?" in a table and a chart, separating thetwo diﬀerent agents and tallying the number of times the participant selected eachobject. We then re-asked the same questions from study section two: which AI ismalfunctioning and in what way is it malfunctioning. An example of the ﬁnal resultsscreen can be seen in Figure 8, where each vertical element of the UI was hidden untilthe user clicks a “continue” button to guide participants through the summary dataone step at a time. We found that only showing the tallied results before re-askingthe questions to be the best way to get participants to focus on the explanations. Inour preliminary experiments leading to the design of the ﬁnal study, we found thatparticipants were overwhelmed with data if they could go back to either look at theindividual examples or re-watch the videos of the agents playing the game.27 . Reﬂection.

We ended the study by asking the participants to perform a short writtenreﬂection after they submitted their answer to gauge their understanding of the expla-nations, and to elicit their opinion on the explanation. The questions included, “Whichparts of the explanation tool inﬂuenced your decision in determining the malfunction-ing AI?” to understand what participants found helpful in the explanation, and whatcontributed to successfully ﬁnding the ﬂawed agent. We also asked the participantsto describe components of the explanation, “In your own words, can you brieﬂy ex-plain what the 3 𝑟𝑑 image from the explanation tool is (the images titled: "AI ResponseChanged State")?” to gauge if participants even understood the concept of a counterfac-tual reasonably well, and how they had done in the main task if they did not. AppendixD describes the content analysis applied to these two questions.

5. Results

We now show examples of counterfactual states for pre-trained agents in variousAtari games; these examples include both high and low quality counterfactuals. InFigures 9 to 12, we show sets of images in which the left image is the original querystate where the agent would take action 𝑎 according to its policy, the right image is thecounterfactual state where the agent would take the selected action 𝑎 (cid:48) , and the centerimage is the highlighted diﬀerence between the two. ∗ bert In this game, the agent controls the orange character Q*bert, who starts each gamewith 3 lives at the top of a pyramid and has 5 actions to hopping diagonally from cubeto cube (or stay still). Landing on a cube causes it to change color, and changing everycube to the target color allows the agent to progress to the next stage. The agent mustavoid purple enemies or lose a life upon contact. Green enemies that revert cube colorchanges can be stopped via contact.In the top row of Figure 9, the counterfactual shows that if the up-right square wereyellow (already visited), Qbert would move up-left. In the bottom row of Figure 9, if28 = MoveUpRight, 𝑎 (cid:48) = MoveUpLeft 𝑎 = MoveUpRight, 𝑎 (cid:48) = MoveDownLeftFigure 9: Each row shows an example of a counterfactual state explanation for Q ∗ bert : Querystate with action 𝑎 ( left ), counterfactual state with action 𝑎 (cid:48) ( right ), and red highlights ( center ). Qbert had been higher up on the structure, the agent will jump down and left; in thisexample, the Qbert image is not perfectly realistic but enough to give a sense of theagent’s decision making.

In this game, an agent must shoot torpedoes at oncoming enemies while rescuingfriendly divers. In Figure 10 (top row), a new enemy must appear to the left in orderfor the agent to take an action that turns the submarine around while ﬁring. Thus, theagent has an understanding about enemy spawns and submarine direction. The middlerow of Figure 10 shows a scenario (best viewed on a computer) where the agent wouldmove up and left but not shoot because the agent would not be fully aligned with theenemy ﬁsh on the left to hit it; in addition, the submarine has already shot its torpedo inanticipation of an enemy ﬁsh appearing on the bottom right and there can only be onetorpedo on the screen at a time. Note that the torpedo is actually highlighted in red butdue to the size of the image in Figure 10, these highlights are imperceptible.29 = MoveUpRightAndShoot, 𝑎 (cid:48) = MoveUpLeftAndShoot 𝑎 = MoveUpLeftAndShoot, 𝑎 (cid:48) = MoveUpLeft 𝑎 = MoveLeftAndShoot, 𝑎 (cid:48) = MoveDownFigure 10: Each row shows an example counterfactual state explanation for

Seaquest : Querystate with action 𝑎 ( left ), counterfactual state with action 𝑎 (cid:48) ( right ), and highlights ( center ). Figure 10 (bottom row) shows an unrealistic counterfactual, where despite neverseeing two submarines in the training data, the best prediction of the Generator (giventhe counterfactual inputs), is to place a submarine at both locations.

In this game, an agent must climb up a building while avoiding various obstacles.Figure 11 (top row) shows the original state in which the agent is in a position tomove horizontally, whereas the counterfactual state shows the climber in a ready stateto move vertically as indicated by the position of its legs. Figure 11 (bottom row)demonstrates how the agent will climb up as the enemy is no longer above it. For bothexamples, because the climber stays in a ﬁxed vertical position with the entire tower30 = MoveRight, 𝑎 (cid:48) = MoveBodyUp 𝑎 = MoveLeft, 𝑎 (cid:48) = MoveArmsUpFigure 11: Each row shows an example counterfactual state explanation for

Crazy Climber : Querystate with action 𝑎 ( left ), counterfactual state with action 𝑎 (cid:48) ( right ), and highlights ( center ). itself moving down, the highlights are diﬃcult to interpret. These examples show theimportance of using both the highlights and the counterfactual states as in some cases,the counterfactual states are much easier to understand than the highlights. In this game, an agent exchanges ﬁre with approaching enemies while taking coverunderneath three barriers. Figure 12 depicts the example, which was also used in ouruser study. This example reveals that the agent has learned to prefer speciﬁc locationsfor safely lining up shots, selectively choosing enemies to shoot.We also include an example of a counterfactual state explanation with the ﬂawedagent in our second user study. Figure 13 shows that in the generated counterfactualstate explanation, the ﬂawed agent does not move the ship as it is blind to its own ship’slocation; in fact, the ﬂawed agent never moves the ship in all of our counterfactual stateexplanations. 31 igure 12: An example of a counterfactual state explanation for Space Invaders with the "normal"agent. Here, action 𝑎 = MoveRightAndShoot ( left ), counterfactual state where action 𝑎 (cid:48) = MoveRight ( right ), and the highlighted diﬀerence ( center ).Figure 13: An example of a counterfactual state explanation for Space Invaders with the ﬂawedagent from our second user study. Here, action 𝑎 = MoveLeftAndShoot ( left ), counterfactualstate where action 𝑎 (cid:48) = MoveRight ( right ), and the highlighted diﬀerence ( center ). In terms of ﬁdelity, the average ratings on the 6 point Likert scale are shownin Table 1. The diﬀerences between the ﬁdelity ratings for the counterfactual statesand real states were not statistically signiﬁcant ( 𝛼 = .

05, p-value=0.458, one-sidedWilcoxon signed-rank test). These results show that our counterfactual states were onaverage close to appearing faithful to the game states but they were not perfect. In thenext section, we will show that despite these imperfections, the counterfactuals werestill useful to participants. 32 blated Counterfactual State Actual GameVersion ExplanationsScore

Table 1: Average results on a 6 point Likert scale from the ﬁdelity user study.

Participants were signiﬁcantly more successful at identifying the ﬂawed agent whenprovided with counterfactual explanations for both the counterfactual state explanations( 𝛼 = 0.05, p-value = 0.0011, Pearson’s Chi-square test) and the NNCEs ( 𝛼 = 0.05, p-value=0.0009, Pearson’s Chi-square test).This hypothesis was further reinforced when all participants in both conditionsself-reported to have found the explanation useful in the evaluation section. Only 1participant out of 60 stated that the video in the Agents Analysis section was useful.Instead, participants found the highlights and the counterfactuals to be more useful thanthe video. Figure 14: The total selection count over all participants and all explanations regarding theself-reported usefulness of each explanation component.

33n the evaluation section, for each explanation from a given counterfactual method,we asked participants to rate the usefulness of each component of the explanation ona 5 point Likert scale (1: Highlights only, 2: Mostly Highlights, 3: Both Equally,4: Mostly Counterfactuals, 5: Counterfactuals only). For counterfactual state expla-nations, “Mostly Highlights” was the most common response (204/600 times; 34%)for helping participants identify the ﬂaw in the AI. For NNCE, “Both Equally” wasthe most common response (236/600 times; 39%). The full response distribution foreach condition is shown in Figure 14. These results indicate that neither componentin isolation was ideal. Most of the time, participants preferred having both, but withvarying degrees of usefulness. We also found this result in the qualitative data fromthe post task questionnaire, wherein participants in both conditions overwhelminglyself-reported to have used the highlights as a supporting artefact for the counterfactualexplanations, and vice-versa:

Participant 43 in Counterfactual State condition: “I used the highlightstool primarily because it was the easiest way to see what was changing fromthe original state. Then, I would reference the changed state tool to seehow the original changed.”

Participant 14 in Nearest Neighbor condition: “Mostly highlights, Iused changed state sparingly to cement assertions from the highlights.”Participants in both conditions found the summary chart to be helpful in consoli-dating their ideas and facilitating recall. For instance, two participants commented intheir responses:

Participant 35 in Counterfactual State condition: “The bar graph atthe end of my responses for both AIs inﬂuenced it the most.”

Participant 16 in Nearest Neighbor condition: “The charts at the endheavily inﬂuenced my decision, because I thought the malfunctioning AIcouldn’t see the barriers because they had more damage on the ship sideof the barriers than the alien side, but the charts showed that that was34 poor assumption because almost every time I evaluated the barriers assomething they could see.”

Incorrect Correct Can’t tellIdentiﬁcation IdentiﬁcationWithout explanation

10 (33%) 17 (57%) 3 (10%)

With explanation

Table 2: The number of participants, with and without counterfactual state explanations, whoincorrectly identiﬁed the normal AI, correctly identiﬁed the ﬂawed AI and who were unable totell the diﬀerence.

Participants that were provided with counterfactual state explanations identiﬁed theﬂawed AI with a far higher success rate than the NNCEs. Without any explanations, 57%of the participants correctly identiﬁed the ﬂawed agent (Table 2). With counterfactualstate explanations, this percentage improved to 90%, which is a signiﬁcant improvementat the ( 𝛼 = .

05, p-value = 10 − , Pearson’s Chi-square test). In addition, none of theseparticipants were able to correctly determine the speciﬁc ﬂaw in the agent in the ﬁrstAgent Analysis section. However, after using our counterfactual state explanations, 60%of participants accurately diagnosed the speciﬁc ﬂaw, which is statistically signiﬁcant( 𝛼 = .

05, p-value =

0, Pearson’s Chi-square test).

Incorrect Correct Can’t tellIdentiﬁcation IdentiﬁcationWithout explanation

With explanation

Table 3: The number of participants, with and without NNCEs, who incorrectly identiﬁed thenormal AI, correctly identiﬁed the ﬂawed AI and who were unable to tell the diﬀerence.

In contrast, NNCEs often confused participants. 63% of the participants identiﬁedthe ﬂawed agent correctly with just the video (Table 3), but after viewing the explana-35 igure 15: The total diﬀerence in object counts over all participants, where the object refers tothe Space Invaders element that a participant determines the agent pays attention to. Here, they-axis measures the diﬀerence between the total object counts for the ﬂawed agent minus thetotal object counts for the normal agent. Positive numbers indicate that participants considerthe ﬂawed agent to pay more attention to that object than the normal agent, whereas negativenumbers indicate that participants consider the ﬂawed agent to pay less attention to that object. tions, this percentage dropped to 47% ( 𝛼 = .

05, p-value = 0.1432, Pearson’s Chi squaretest). Figure 15 contains an aggregate comparison of how well participants that wereshown NNCE versus counterfactual state explanations could identify the speciﬁc ﬂaw.The histogram in Figure 15 depicts the diﬀerence in object counts over all participants,where the object refers to the Space Invaders element that a participant determines theagent pays attention to. The diﬀerence is computed as the total object counts for theﬂawed agent minus the total object counts for the normal agent. Participants for bothcounterfactual approaches were able to pick up on the correct ﬂaw, but participants thatwere shown counterfactual state explanations did so with much higher numbers thanparticipants that were shown NNCEs.One of the main reasons for this decrease is that the NNCEs are inconsistent inquality as the quality depended on the existence of an instance in the game trace dataset D that was reasonably close (in latent space) to the query state. Despite a very large36 igure 16: An example of the nearest neighbor counterfactual explanation method for the normallytrained agent with query state where action 𝑎 = MoveRightAndShoot ( left ), counterfactual statewhere action 𝑎 (cid:48) = MoveLeftAndShoot ( right ), and the highlighted diﬀerence ( center ).Figure 17: An example of the nearest neighbor counterfactual explanation method for the ﬂawedagent with query state where action 𝑎 = MoveLeftAndShoot ( left ), counterfactual state whereaction 𝑎 (cid:48) = MoveLeft ( right ), and the highlighted diﬀerence ( center ). game trace dataset (25 million game frames) as a pool for the NNCEs, a suitableinstance that can serve as a counterfactual may not exist, resulting in odd changes tothe query state (e.g. an extra alien appearing on the opposite side from the agent) orcounterfactuals that were extremely diﬀerent from the current state (e.g. a reset ofthe game has occurred or many enemies have been added/removed). Examples of lowquality nearest neighbor counterfactuals can be seen in Figures 16 and 17. Note thatboth examples have a large number of highlights. In addition, the NNCE in Figure17 actually moves the ship, which obscures the true ﬂaw in the agent. This relianceon ﬁnding a suitable counterfactual in the game trace dataset is a major disadvantageof nearest neighbor counterfactuals. It is likely infeasible to generate a large enoughdataset to facilitate retrieval of a reasonable counterfactual for any arbitrary game state in37 suﬃciently complex game. In contrast, our counterfactual state explanation generatesthe game frame on the ﬂy, and even though it is not perfectly faithful to the game, it hassuﬃcient ﬁdelity to give meaningful insight to participants.The inconsistency in quality of the NNCEs likely contributed to the confusion ofparticipants. In the retrospective review from participants who were given the NearestNeighbor counterfactual explanations, they frequently self-reported to being confusedby the highlights or the counterfactual itself. For instance, when asked if the highlightshelped make their decision: Participant 17 in Nearest Neighbor condition: “Sometimes, but it gotconfusing sometimes as I couldn’t tell what highlight belonged to whatimage, so I couldn’t get the AI’s thoughts from it”Similarly, when asked if the counterfactual state image helped make their decision:

Participant 26 in Nearest Neighbor condition: “No, I was not sure howit related to the decision to move or shoot.”The NNCEs seemed to be useful to only a small group of participants, as 17% ofthe participants were able to accurately diagnose the speciﬁc ﬂaw in the ﬂawed agent,up from none ( 𝛼 = .

05, p-value = 0.014, Pearson’s Chi Square test). This percentageis slightly higher than 12.5%, which is the probability of correctly guessing the ﬂawedagent and correctly guessing the correct ﬂaw purely by chance.

6. Discussion

In summary, participants in our ﬁrst study found our counterfactual state explana-tions to generate game frames that were close in terms of ﬁdelity though not perfectlyso. In our second user study, these counterfactual state explanations were of suﬃcientﬁdelity that 90% of our participants could use them to identify which agent was ﬂawed.60% of the participants were able to use our counterfactual state explanations to performthe more challenging task of diagnosing the speciﬁc ﬂaw. During this study, participantsspeciﬁcally mentioned the highlights and summary chart as being particularly useful38n their decision-making, thereby suggesting that these visual elements greatly enhancecounterfactual explanations.Our counterfactual state explanations were also much more eﬀective in our userstudy than the nearest neighbor baseline. Participants using the counterfactual stateexplanations were much more successful at identifying the ﬂawed agent as well asthe speciﬁc ﬂaw than participants using the NNCEs. In addition, even though theNNCEs were 100% faithful to the game, they were not always close to the query state.Participants found that our counterfactual state explanations, which produced imagesthat were "close" to the original query state, were more insightful despite not having100% ﬁdelity. Our study also indicated that “No explanation is better than a badexplanation” as participants using the NNCEs were often confused and the number ofparticipants correctly identifying the ﬂawed agent actually decreased after seeing theNNCE.There are a few issues with our approach that remain an open area of investigation.First, our deep generative approach adds some artefacts when creating counterfactualstates, which impacts the faithfulness of our explanation. Empirically, we found mostartefacts were minor, such as blurry images, and did not seem to be a major roadblock forour participants. One of the more noticeable artefacts is how small objects, such as theshot in space invaders, occasionally disappear. While this issue is somewhat alleviatedwith max loss clipping, small objects are diﬃcult to preserve in the counterfactualgeneration processes. These small objects, however, could be important for otherdomains (e.g. Pong). It is likely that some of these artefacts could be ﬁxed by traininglonger, with more data, and with better architectures. This problem also raises an openquestion in representation learning about preserving small, but important, objects inimages.A second issue is how to select query states from a replay such that the counter-factual states, and actions, provide the most insight to a human. Our criterion wasbased on heuristics and a deeper investigation is needed as other criteria could be used,such as those used by other methods for selecting key moments [Amir and Amir, 2018,Sequeira and Gervasio, 2020]. Moreover, we chose the counterfactuals to present tothe participants using a heuristic rather than allowing the participants to interactively39xplore the space of counterfactuals. We made this choice because many of the coun-terfactual actions result in no change to the image and users need more guidance asto which counterfactual actions and states are useful. We recognize that this choicedirectly aﬀects the diversity among the counterfactuals that the participants see, andmay hinder in the building of a suﬃciently well-rounded mental model [Mothilal et al.,2020].Another area for future work is in choosing the way a counterfactual state is gen-erated. Our counterfactual state generation was based on ﬁnding a state 𝒔 (cid:48) that wasminimally changed (in the latent space) from the query state 𝒔 that would result in adiﬀerent action 𝑎 (cid:48) from 𝑎 . The reason for the minimal change was to identify the nec-essary aspects of a state that would produce the action 𝑎 (cid:48) , without distracting the userwith other irrelevant elements in the image. This minimal change criterion is similarto approaches used by other recent methods for generating counterfactuals, such as theminimal-edit approach for replacing regions in an image [Goyal et al., 2019] and thesearch for smallest deletion / supporting regions [Chang et al., 2019]. However, wecould use other criteria besides minimal change to deﬁne the space of modiﬁcations tothe query state. For instance, we could permit changes that lead to speciﬁc propertieson future time steps or allow the user to help deﬁne the space of changes allowed.Finally, we recognize that our ﬁndings are speciﬁc to the visual input environmentof Atari, and the success of a generative deep learning method for producing counter-factuals in other visual environments is an open question. In particular, the ﬁdelity ofthe counterfactual states depends on the amount of training data available and the abilityof the deep neural network to capture salient aspects of the images from that domain.While the primary application of our work is for Atari-like domains, more sophisticatedauto-encoding training methods have been shown to produce high quality images inmore visually rich environments [Nie et al., 2020]. Therefore, we believe that there aresome ﬁndings in our study that could be more broadly applicable. The general frame-work of our explanation, namely presenting original–highlights–counterfactual couldbe eﬀective in many domains. Moreover, our results also indicate that perfect ﬁdelitymay not be necessary. Counterfactual images with suﬃcient ﬁdelity could give enoughinsight in other domains and even the highlights by themselves might be suﬃciently40nsightful for other visual environments.

7. Conclusion

We introduced a deep generative model to produce counterfactual state explanationsas a way to provide insight into a deep RL agent’s decision making. The counterfactualstates showed what minimal changes needed to occur to a state to produce a diﬀerentaction by the trained RL agent. Results from our ﬁrst user study showed that thesecounterfactual state explanations had suﬃcient ﬁdelity to the actual game. Resultsfrom our second user study demonstrated that despite having some artefacts, thesecounterfactual state explanations were indeed useful for identifying the ﬂawed agent inour study as well as the speciﬁc ﬂaw in the agent. In comparison, the nearest neighborcounterfactual explanations confused participants and resulted in fewer participantsidentifying the correct agent after they were shown the explanation. Furthermore, onlya small proportion of participants were able to identify the speciﬁc ﬂaw. Our studyalso demonstrated that the highlights and summary table were important elements toaccompany counterfactual explanations.Our results suggest that perfect ﬁdelity may not be necessary for counterfactualstate explanations to give non-machine learning experts suﬃcient understanding of anagent’s decision making in order to use this knowledge for a downstream task. Whileour study focused on Atari agents, we believe this approach is promising and couldapply more broadly to domains beyond Atari with more complex visual input thoughmore investigation is needed. Moreover, using counterfactual state explanations inconjunction with other established and complementary explanation techniques couldform a formidable toolset to help non-experts understand decisions made by deep RLagents.

8. Acknowledgements

This work was supported by DARPA under the grant N66001-17-2-4030. We wouldlike to thank Andrew Anderson, Margaret Burnett, Jonathan Dodge, Alan Fern, StefanLee, Neale Ratzlaﬀ, and Janet Schmidt for their expertise and helpful comments.41 eferences

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and BeenKim. Sanity checks for saliency maps. In

Proceedings of the 32nd InternationalConference on Neural Information Processing Systems , page 9525–9536, Red Hook,NY, USA, 2018. Curran Associates Inc.Samuel Alvernaz and Julian Togelius. Autoencoder-augmented neuroevolution forvisual doom playing. In . IEEE, 2017.Dan Amir and Ofra Amir. Highlights: Summarizing agent behavior to people. In

Proceedings of the 17th International Conference on Autonomous Agents and Mul-tiAgent Systems , page 1168–1176, Richland, SC, 2018. International Foundation forAutonomous Agents and Multiagent Systems.Andrew Anderson, Jonathan Dodge, Amrita Sadarangani, Zoe Juozapaitis, Evan New-man, Jed Irvine, Souti Chattopadhyay, Matthew Olson, Alan Fern, and MargaretBurnett. Mental models of mere mortals with explanations of reinforcement learn-ing.

ACM Transactions on Interactive Intelligent Systems (TiiS) , 10(2):1–37, 2020.Akanksha Atrey, Kaleigh Clary, and David Jensen. Exploratory not explanatory: Coun-terfactual analysis of saliency maps for deep reinforcement learning. In

InternationalConference on Learning Representations , 2020. URL https://openreview.net/forum?id=rkl3m1BFDB .Dana H Ballard. Modular learning in neural networks. In

AAAI , 1987.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: Areview and new perspectives.

IEEE transactions on pattern analysis and machineintelligence , 2013.Michel Besserve, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. Counterfactualsuncover the modular structure of deep generative models. In

International Conferenceon Learning Representations , 2020. URL https://openreview.net/forum?id=SJxDDpEKvH . 42reg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman,Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540 ,2016.Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie El-hadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital30-day readmission. In

Proceedings of the 21th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining , page 1721–1730, New York, NY,USA, 2015. Association for Computing Machinery. ISBN 9781450336642.Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explain-ing image classiﬁers by counterfactual generation. In

International Conference onLearning Representations , 2019.Mark W. Craven and Jude W. Shavlik. Extracting tree-structured representations oftrained networks. In

Proceedings of the 8th International Conference on NeuralInformation Processing Systems , page 24–30, Cambridge, MA, USA, 1995. MITPress.Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classiﬁers. In

Advances in Neural Information Processing Systems , 2017.Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, KarthikeyanShanmugam, and Payel Das. Explanations based on the missing: Towards contrastiveexplanations with pertinent negatives. In

Advances in Neural Information ProcessingSystems , 2018.Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaning-ful perturbation. In

Proceedings of the IEEE International Conference on ComputerVision , 2017.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neural information processing systems , 2014.43ash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Coun-terfactual visual explanations. In

International Conference on Machine Learning(ICML) , 2019.Samuel Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing andunderstanding Atari agents. In

Proceedings of the 35th International Conference onMachine Learning , 2018.Bradley Hayes and Julie A Shah. Improving robot controller transparency throughautonomous policy explanation. In

Proceedings of the 2017 ACM/IEEE internationalconference on human-robot interaction , 2017.Hsiu-Fang Hsieh and Sarah E. Shannon. Three approaches to qualitative contentanalysis.

Qualitative Health Research , 2005.Sandy H. Huang, Kush Bhatia, Pieter Abbeel, and Anca D. Dragan. Establishingappropriate trust via critical states. In

In 2018 IEEE/RSJ International Conferenceon INtelligent Robots and Systems (IROS) , pages 3929–3936, 2018. doi: 10.1109/IROS.2018.8593649.Sandy H. Huang, David Held, Pieter Abbeel, and Anca D. Dragan. Enabling robots tocommunicate their objectives.

Auton. Robots , 43(2):309–326, feb 2019.Paul Jaccard. Nouvelles recherches sur la distribution ﬂorale.

Bull. Soc. Vaud. Sci. Nat. ,44, 1908.Ali Jahanian, Lucy Chai, and Phillip Isola. On the "steerability" of generative adversarialnetworks. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=HylsTT4FvB .Zoe Juozapaitis, Anurag Koul, Alan Fern, Martin Erwig, and Finale Doshi-Velez.Explainable reinforcement learning via reward decomposition. In

Proceedings of theĲCAI 2019 Workshop on Explainable Artiﬁcial Intelligence , 2019.Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Camp-bell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey44evine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski.Model based reinforcement learning for atari. In

International Conference onLearning Representations , 2020. URL https://openreview.net/forum?id=S1xCPJHtDB .Omar Zia Khan, Pascal Poupart, and James P. Black. Minimal suﬃcient explanations forfactored markov decision processes. In

Proceedings of the Nineteenth InternationalConference on Automated Planning and Scheduling , 2009.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR , abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980 .Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.

CoRR ,abs/1312.6114, 2013.Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuencefunctions. In

Proceedings of the 34th International Conference on Machine Learning- Volume 70 , ICML’17, page 1885–1894. JMLR.org, 2017.Anurag Koul, Alan Fern, and Sam Greydanus. Learning ﬁnite state representations ofrecurrent policy networks. In

International Conference on Learning Representations ,2019. URL https://openreview.net/forum?id=S1gOpsCctm .Isaac Lage, Daphna Lifschitz, Finale Doshi-Velez, and Ofra Amir. Exploring computa-tional user models for agent policy summarization. In Sarit Kraus, editor,

Proceedingsof the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence , pages1401–1407. ĳcai.org, 2019.Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic De-noyer, et al. Fader networks: Manipulating images by sliding attributes. In

Advancesin Neural Information Processing Systems , 2017.David Lewis.

Counterfactuals . John Wiley & Sons, 1973.Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere. Explainable rein-forcement learning through a causal lens. In

In Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , 2020. 45lireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversar-ial autoencoders.

CoRR , abs/1511.05644, 2015. URL http://arxiv.org/abs/1511.05644 .Sameer Singh Marco Tulio Ribeiro and Carlos Guestrin. Anchors: High-precisionmodel-agnostic explanations. In

In Proceedings of the 32nd AAAI Conference onArtiﬁcial Intelligence , 2018.Tim Miller. Explanation in artiﬁcial intelligence: Insights from the social sciences.

Artiﬁcial Intelligence , 2019.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.

Nature , 2015.Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learningclassiﬁers through diverse counterfactual explanations. In

Proceedings of the 2020Conference on Fairness, Accountability, and Transparency , 2020.Alex Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, and Danilo JimenezRezende. Towards interpretable reinforcement learning using attention augmentedagents. In

NeurIPS , 2019.Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, and Fuxin Li. Openset learning with counterfactual images. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , 2018.Weili Nie, Tero Karras, Animesh Garg, Shoubhik Debhath, Anjul Patney, Ankit B Patel,and Anima Anandkumar. Semi-supervised stylegan for disentanglement learning. arXiv , pages arXiv–2003, 2020.Matthew Olson, Lawrence Neal, Fuxin Li, and Weng-Keen Wong. Counterfactualstates for atari agents via generative deep learning. In

Proceedings of the ĲCAI 2019Workshop on Explainable Artiﬁcial Intelligence , 2019.46dam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, AlykhanTejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and SoumithChintala. Pytorch: An imperative style, high-performance deep learning library. In

Advances in Neural Information Processing Systems , 2019.Judea Pearl and Dana Mackenzie.

The Book of Why: The New Science of Cause andEﬀect . Basic Books, 1 edition, 2018.Martin L. Puterman.

Markov Decision Processes: Discrete Stochastic Dynamic Pro-gramming . John Wiley & Sons, Inc., 1994.Zhongang Qi, Saeed Khorram, and Fuxin Li. Visualizing deep networks by optimizingwith integrated gradients.

CoRR , abs/1905.00954, 2019. URL http://arxiv.org/abs/1905.00954 .Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Ex-plaining the predictions of any classiﬁer. In

Proceedings of the 22Nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , 2016.Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam,Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networksvia gradient-based localization. In

Proceedings of the IEEE international conferenceon computer vision , 2017.Pedro Sequeira and Melinda Gervasio. Interestingness elements for explainable re-inforcement learning: Understanding agents’ capabilities and limitations.

Artiﬁ-cial Intelligence , 288:103367, Nov 2020. doi: 10.1016/j.artint.2020.103367. URL http://dx.doi.org/10.1016/j.artint.2020.103367 .47vanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important featuresthrough propagating activation diﬀerences. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , 2017.Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutionalnetworks: Visualising image classiﬁcation models and saliency maps. arXiv preprintarXiv:1312.6034 , 2013.Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller.Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 ,2014.Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deepnetworks. In

Proceedings of the 34th International Conference on Machine Learning- Volume 70 , ICML’17, page 3319–3328. JMLR.org, 2017.Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wassersteinauto-encoders. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=HkL7n1-0b .Lisa Torrey and Matthew Taylor. Teaching on a budget: Agents advising agents inreinforcement learning. In

Proceedings of the 2013 International Conference onAutonomous Agents and Multi-Agent Systems , page 1053–1060, Richland, SC, 2013.International Foundation for Autonomous Agents and Multiagent Systems.Jasper van der Waa, Jurriaan van Diggelen, Karel van den Bosch, and Mark Neerincx.Contrastive explanations for reinforcement learning in terms of expected conse-quences. In

Proceedings of the ĲCAI/ECAI 2018 Workshop on Explainable AI ,2018.Abhinav Verma, Vĳayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, andSwarat Chaudhuri. Programmatically interpretable reinforcement learning.

CoRR ,abs/1804.02477, 2018. URL http://arxiv.org/abs/1804.02477 .48andra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanationswithout opening the black box: Automated decisions and the gdpr.

Harv. JL & Tech. ,2017.Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. Graying the black box: Understandingdqns. In

International Conference on Machine Learning , 2016.Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional net-works. In

European conference on computer vision , 2014.Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and StanSclaroﬀ. Top-down neural attention by excitation backprop.

International Journalof Computer Vision , 2018. 49 ppendix A. Tuning the 𝝀 Parameter

Figure A.18: An example of diﬀerent models trained with varying 𝜆 parameters for the normallytrained agent. The original state with action 𝑎 = MoveLeftAndShoot for which to generate acounterfactual is shown in the top left, with the rest of the images being counterfactual stateswhere the agent would take counterfactual action 𝑎 (cid:48) = Fire . Figure A.18 shows the eﬀects of varying the 𝜆 parameter. As 𝜆 increases, so doesthe amount of change in the counterfactual state, with low 𝜆 values causing nearlyimperceptible changes and high 𝜆 values producing distorted, low quality states. Fromour ﬁrst user-study, we found that non-experts were able to clearly identify poor ﬁdelityimages, caused by 𝜆 parameters that were too high. Given a set of images producedby diﬀerent 𝜆 values, we feel that ﬁnding a "sweet spot" between too high and toolow should be manageable for a non-expert viewer as there is a fairly wide range of 𝜆 values that produce reasonably high quality counterfactuals. Automating the processof selecting a 𝜆 for high ﬁdelity counterfactual production is beyond the scope of thiswork, but is an area of future interest. Appendix B. Ablation Experiments for the Counterfactual State Neural NetworkArchitecture

In this section, we describe many diﬀerent ablation experiments using the neuralnetwork architecture described in Section 3.2 for our counterfactual state explanations.50blation experiment 1 2 3 4 5 6 7 8 9 10 𝐸 ( 𝑠 ) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) 𝑧 (cid:88) (cid:88) (cid:88) (cid:88) 𝑧 𝑤 (cid:88) (cid:88) (cid:88) (cid:88) 𝝅 ( 𝐴 ( 𝒔 )) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Table B.4: An overview of what elements are given as input to the generator for each ablationexperiment.

These ablations illustrate how each component in our architecture is necessary to achievehigh quality counterfactual images. A generator is always used with MSE reconstructionloss, but what we pass into the generator changes for each ablation experiment. Weprovide an overview of the diﬀerent ablations in Table B.4 and Figures B.19 and B.20contain images which are representative of the issues for each ablation experiment.Next, we discuss each ablation experiment in detail:

Figure B.19: From left to right: ablations 1 - 5. In every column and each row, the top image isthe original state where 𝑎 = MoveRight , the center image is the auto-encoded reconstruction ofthe original state, the bottom image is a counterfactual state where 𝑎 (cid:48) = MoveRightAndFire . igure B.20: From left to right: ablations 6 - 10. In each column, the top image is an originalstate where 𝑎 = MoveRight , the middle image is an auto-encoded reconstruction, and the bottomis a counterfactual state where 𝑎 (cid:48) = MoveRightAndFire .

1. We investigate the eﬀect of only using the agent’s policy to generate reconstructedstates and hand-modifying it to create counterfactual states: removing all partsfrom our model except the agent and the generator, passing solely the 𝝅 ( 𝒛 ) , where 𝒛 = 𝐴 ( 𝒔 ) , into the generator. This gives us reconstructed states in the form of 𝐺 ( 𝝅 ( 𝒛 )) . We modify the policy vector 𝝅 ( 𝒛 ) by selecting a counterfactual action 𝑎 (cid:48) , setting 𝝅 ( 𝒛 , 𝑎 (cid:48) ) = 𝝅 ( 𝒛 , 𝑎 ) ∗ .

01, and normalizing the probabilities back to 1.This hand modiﬁcation is clearly not representative of the agent. As shown inﬁgure B.19, the reconstructed and counterfactual states are extremely low quality.2. We investigate the eﬀect of only using the agent’s learned representation togenerate both reconstructed states and counterfactual states. We removed all partsfrom our model except the agent and the generator, passing 𝒛 into the generator.This gives us reconstructed states in the form of 𝐺 ( 𝒛 ) and counterfactual statesby modifying 𝒛 with gradient descent as described in Section 3.3 to get a 𝒛 ∗ .As shown in ﬁgure B.19, the counterfactual states are quite unrealistic, butsurprisingly the reconstructed states are accurate.52. We removed all parts from our model except the agent and the generator, thistime passing both 𝒛 and 𝝅 ( 𝒛 ) into the generator. This gives us reconstructedstates in the form of 𝐺 ( 𝒛 , 𝝅 ( 𝒛 )) and counterfactual states by modifying 𝒛 withgradient descent as described in Section 3.3 to get a 𝒛 ∗ . As shown in ﬁgure B.19,the counterfactual states are quite unrealistic, but surprisingly the reconstructedstates are accurate.4. We investigate using only the Wasserstein auto-encoder. Here we pass only 𝒛 𝒘 into the generator, where 𝒛 𝒘 is the latent representation of the state in Wassersteinspace 𝒛 𝒘 = 𝐸 𝑤 ( 𝐴 ( 𝒔 )) . This gives us reconstructed states in the form of 𝐺 ( 𝒛 𝒘 ) and counterfactual states by modifying 𝒛 𝒘 with gradient descent as described inSection 3.3 to get a 𝒛 ∗ 𝒘 . As shown in ﬁgure B.19, both the reconstructed andcounterfactual states are quite unrealistic.5. We removed all parts from our model except the agent, the Wasserstein auto-encoder, and the generator. Here we pass both 𝒛 𝒘 and 𝝅 ( 𝒛 ) into the generator.This gives us reconstructed states in the form of 𝐺 ( 𝒛 , 𝝅 ( 𝒛 ) ) and counterfactualstates by modifying 𝒛 𝒘 with gradient descent as described in Section 3.3 to geta 𝒛 ∗ 𝒘 . As shown in Figure B.20, both the reconstructed and counterfactual statesimprove relative to the previous ablation, but are still quite unrealistic.6. Here we investigate the eﬀect of keeping the encoder and discriminator, buthand-modify the policy input to the generator instead of using the Wassersteinauto-encoder or gradient descent. The input to the generator is equivalent to ourwork described in section 3. We hand-modify the policy vector 𝝅 ( 𝒛 ) , by selectinga counterfactual action 𝑎 (cid:48) , setting 𝝅 ( 𝒛 , 𝑎 (cid:48) ) = 𝝅 ( 𝒛 , 𝑎 ) ∗ .

01, and normalizing theprobabilities back to 1. These hand modiﬁcation may, or may not, be represen-tative of what the agent does. As shown in ﬁgure B.19, the states have the samegenerated quality as our method and the counterfactual state has a small, butmeaningful change.7. This ablation is similar to the previous ablation, but instead of passing the policyvector 𝝅 ( 𝒛 ) to the generator, we input the agent’s latent space 𝒛 . As with previousablations, we generate counterfactual states by modifying 𝒛 with gradient descentas described in Section 3.3 to get a 𝒛 ∗ . As shown in ﬁgure B.20, the states have53 decent quality, but the counterfactual states have relatively large changes and acouple of artifacts.8. Similar to the previous ablation, but instead of passing just 𝒛 , we pass in boththe policy vector 𝝅 ( 𝒛 ) and 𝒛 to the generator. As with previous ablations, wegenerate counterfactual states by modifying 𝒛 with gradient descent as describedin Section 3.3 to get a 𝒛 ∗ . As shown in ﬁgure B.20, the states are better qualitythan just passing in 𝒛 ∗ , but the counterfactual are lower quality than our method.9. We add back in the Wasserstein auto-encoder to the previous ablation. Instead ofpassing in the agent’s latent space 𝒛 to the generator, we pass in the Wassersteinrepresentation 𝒛 𝒘 = 𝐸 𝑤 ( 𝐴 ( 𝒔 )) . As described in 3.3, we generate counterfactualstates by modifying 𝒛 𝒘 to get a 𝒛 ∗ 𝒘 . As shown in ﬁgure B.20, the states are highquality, but the counterfactual states typically have no changes.10. This experiment is an ablation in the sense that we remove the disconnectionbetween the generation and 𝒛 𝒘 . In other words, we take our original methodand add 𝒛 𝒘 as input to the generator. When counterfactual states are generated, 𝒛 ∗ 𝒘 is passed into the generator along with 𝐸 ( 𝒔 ) and 𝝅 ( 𝒛 ∗ 𝒘 ) . As shown in ﬁgureB.20, the states are high quality and the counterfactual states are interesting. Wewere not able to ﬁnd a diﬀerence in quality for generated states between thisablation and our method. Since this ablation is more complex, and requires moreparameters, we decided not to use it for our purposes. Appendix C. Details for User Study 2

In this section, we provide further details on the second user study. Speciﬁcally, weinclude the tutorial script and the images used in the second user study.

Appendix C.1. User Study Tutorial Script

In this tutorial we will introduce you to the tool for ﬁnding the malfunctioning AI.This tool shows the AIs response to speciﬁc “What if” questions. Both the functioningand malfunctioning AI provide answers to the “What if” questions.For this study, we have selected 20 diﬀerent screen shots from the videos. Afterlearning how to use the tool, you will examine the selected screen shots to collect data54 igure C.21: The user study tutorial examples used to describe counterfactual states, where thetop row of images is one potential counterfactual explanation and the bottom is another. Querystate with action 𝑎 = TurnRight where a self-driving car is taking you home ( left ), counterfactualstate where action 𝑎 (cid:48) = GoStraight ( right ), and the highlighted diﬀerence ( center ). on the two AIs. The identity of the AIs will remain anonymous until the ﬁnal evaluation.At this time, please click the checkbox, then the continue button.For each selected screen shot, you will see three images arranged in a table. Wewill now go over how the table is arranged. Please click Next.The ﬁrst image is a screen shot from the original videos. Please click Next.In this column, you will also see context for the original screen shot with a shortgif. Please click Next.Click on the image to change it into a gif. The gif shows the three previous gamestates. Then click again to return image. In the column, you will also see the originalaction the AI decided it will take at that moment in the video. Please click Next.In this example, the AI originally decided it would take the "shoot" action. We thenasked the AI, "What would the current screen need to look like for you to perform the"move right" action?” To answer this question, the AI will only evaluate the currentmoment in the game, not the past or the future. Please click Next.For a more concrete example, consider the following. Imagine there is a red self-driving car that is taking you home. It approaches an intersection and it wants to turnright to take you to your destination. (Reveal Figure C.21 top-left)55ow imagine a situation where the red car would choose to go straight instead ofturning right. There are various reasons why this could happen. One example is if thebrown tree fell over and blocked the road. (Reveal Figure C.21 top-right)In this example, an answer to a question of “what would need to change” right nowfor the car to choose go straight at this intersection (point to left image), would be “thebrown tree fell over which blocks the right turn” (point to right image), Is that clear?Excellent. Now in the examples you will look at, the AI will answer the question of“what needs to change” by responding with 2 images. Please click Next.The ﬁrst response is the changed state. This response shows the smallest amount ofchange in the game to take the diﬀerent action of “move right”. Back to the car example,if the original image was the intersection (point to left image), the following responseimage would be the intersection with the fallen brown tree (point to right image), Pleaseclick Next.In the third column, note how the game has subtly changed in two ways: the ship isunder the barrier and the barrier is fully armored. Please click Next.The second AI response is image highlights, which takes the original screenshotand adds blue highlights to the changes. This response shows where the AI is lookingfor change to occur. Using the car example, this response would look like the originalintersection with a blue highlight where the brown tree has moved. (Reveal Figure C.21top-center) Is that clear?Excellent. It is also possible for multiple objects to inﬂuence an AIs decision.(Reveal Figure C.21 bottom-left). In this example, two things inﬂuence the red self-driving cars decision to take the move straight action. The ﬁrst is: if brown treehas fallen over, but also if the red car’s position changed such that it is passed theintersection. (Reveal entirety of Figure C.21). The highlights for this example showboth the red car and the brown tree highlighted in blue. Is this second example clear?Excellent. Let us continue with the table and please click Next.Note how the changed objects are highlighted in blue: the repaired barrier, and thenew ship location. As you are viewing the table for each selected screen shot, you willbe asked two questions. The ﬁrst question is: “what objects in the game do you thinkthe AI pays attention to?” Please click Next to view this question. You do not need to56elect an answer for this tutorial. Do note that you can select more than one checkbox,or no checkboxes at all.The second question you will be asked is: which AI response or responses did youuse for making your decision? Please click Next. Again, you do not need to answer thisfor the tutorial. You will be asked these same questions for every selected screen shot.This is the full tool you will be using to analyze each screen shot presented inrandom order. This section will take about 10 to 15 minutes. For each set of imagesyou will be asked to spend at least 30 seconds. There will be a timer on the screen.After you have ﬁnished examining the 20 randomized screen shots, you will use thedata to complete the second evaluation. Your results from the tool will be displayed inboth a table and a chart. Additionally, we will reveal to you which examples were fromAI one and which were from AI two.With this information, you will re-answer the question: “which AI is malfunctioningand what objects in the game can it not see?” And ﬁnally, after you have submitted the2nd evaluation, we ask you to perform a short written reﬂection. When you are ready,click “Finish tutorial” to begin viewing the 20 selected screen shots. I will leave thetutorial example on the projector and the car example on the whiteboard. You maybegin. Appendix C.2. Images

In this section we show a further selection of explanations from our user study.Figures C.22 and C.23 shows explanations for the normally trained agent for both thecounterfactual state explanations and the nearest neighbor counterfactual explanations,sorted by game time step. Figures C.24 and C.25 similarly shows explanations for theﬂawed agent. These ﬁgures show how the nearest neighbor counterfactual explana-tions often show the green ship’s position changing for the ﬂawed agent, whereas ourcounterfactual state explanations never change the ship’s position.

Appendix D. User study data analysis

For answering research questions 2 and 3, two researchers collectively appliedcontent analysis [Hsieh and Shannon, 2005] to the Post task questionnaire data corpus.57hey developed the codes shown in Table D.5. These codes were deﬁned by having tworesearchers coded 20% of the data corpus individually, achieving inter-rater reliability(IRR) of at least 90% (calculated using Jaccard Index [Jaccard, 1908]) with all the datasets.

Code Description Example

Helpful The participant found theartifact to be helpful to themain task, and it helpedthem better understand andevaluate the agent. “Yes the third image played a role inhelping me make my decision.”Problematic The participant found theartifact hinder-some and problematic in the maintask. “The changed state portion confusedme because I wasn’t sure if that wasthe next action the AI took or the ac-tion it thought about taking given thehighlighted circumstances.”

Table D.5: The qualitative codes used in our analysis (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRight 𝑎 (cid:48) = MoveRight 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = Shoot 𝑎 (cid:48) = MoveRight 𝑎 = Shoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRightAndShoot 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeftAndShoot 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeftAndShootFigure C.22: The ﬁrst ﬁve explanations for the normally trained agent used in the user study.(

Center ) The original state 𝑠 where the agent took action 𝑎 . ( Left ) The counterfactual stateexplanations where the agent takes action 𝑎 (cid:48) . ( Right ) The nearest neighbor counterfactual statewhere the agent takes action 𝑎 (cid:48) 𝑛𝑛 . ( Center Left/Right ) The highlighted diﬀerence between thecounterfactual state and the original state. (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = Shoot 𝑎 (cid:48) = Shoot 𝑎 = MoveRight, 𝑎 (cid:48) 𝑛𝑛 = MoveRightAndShoot 𝑎 (cid:48) = Shoot 𝑎 = MoveRight, 𝑎 (cid:48) 𝑛𝑛 = MoveRightAndShoot 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRight 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRightFigure C.23: Explanations 6 through 10 for the normally trained agent used in the user study.(

Center ) The original state 𝑠 where the agent took action 𝑎 . ( Left ) The counterfactual stateexplanations where the agent takes action 𝑎 (cid:48) . ( Right ) The nearest neighbor counterfactual statewhere the agent takes action 𝑎 (cid:48) 𝑛𝑛 . ( Center Left/Right ) The highlighted diﬀerence between thecounterfactual state and the original state. (cid:48) = MoveRight 𝑎 = MoveLeft, 𝑎 (cid:48) 𝑛𝑛 = MoveLeftAndShoot 𝑎 (cid:48) = MoveRight 𝑎 = Shoot, 𝑎 (cid:48) 𝑛𝑛 = StayStill 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRight 𝑎 (cid:48) = MoveLeft 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveRight 𝑎 (cid:48) = MoveRight 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = ShootFigure C.24: The ﬁrst ﬁve explanations for the ﬂawed agent used in the user study. (

Center ) Theoriginal state 𝑠 where the agent took action 𝑎 . ( Left ) The counterfactual state explanations wherethe agent takes action 𝑎 (cid:48) . ( Right ) The Nearest Neighbor counterfactual state where the agenttakes action 𝑎 (cid:48) 𝑛𝑛 . ( Center Left/Right ) The highlighted diﬀerence between the counterfactualstate and the original state. (cid:48) = MoveRight 𝑎 = MoveRightAndShoot, 𝑎 (cid:48) 𝑛𝑛 = Shoot 𝑎 (cid:48) = MoveRight 𝑎 = MoveLeftAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeft 𝑎 (cid:48) = MoveRight 𝑎 = MoveLeftAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeft 𝑎 (cid:48) = MoveRight 𝑎 = MoveLeftAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeft 𝑎 (cid:48) = MoveRight 𝑎 = MoveLeftAndShoot, 𝑎 (cid:48) 𝑛𝑛 = MoveLeftFigure C.25: Explanations 6 through 10 for the ﬂawed agent used in the user study. (