"I Don't Think So": Disagreement-Based Policy Summaries for Comparing Agents
““I Don’t Think So”: Disagreement-Based Policy Summaries for ComparingAgents
Yotam Amitai Ofra Amir Abstract
With Artificial Intelligence on the rise, human in-teraction with autonomous agents becomes morefrequent. Effective human-agent collaborationrequires that the human understands the agent’sbehavior, as failing to do so may lead to reducedproductiveness, misuse, frustration and even dan-ger. Agent strategy summarization methods areused to describe the strategy of an agent to itsdestined user through demonstration. The sum-mary’s purpose is to maximize the user’s under-standing of the agent’s aptitude by showcasing itsbehaviour in a set of world states, chosen by someimportance criteria. While shown to be useful,we show that these methods are limited in sup-porting the task of comparing agent behavior, asthey independently generate a summary for eachagent. In this paper, we propose a novel methodfor generating contrastive summaries that high-light the differences between agent’s policies byidentifying and ranking states in which the agentsdisagree on the best course of action. We conducta user study in which participants face an agentselection task. Our results show that the noveldisagreement-based summaries lead to improveduser performance compared to summaries gen-erated using HIGHLIGHTS, a previous strategysummarization algorithm.
1. Introduction
With the maturing of reinforcement learning (RL) meth-ods, RL-based agents are being trained to perform complextasks in various domains, including robotics, healthcare andtransportation. Importantly, these agents do not operate ina vaccum – people interact with agents in a wide range ofsettings. Effective interaction between an agent and a userrequires from the latter the ability to anticipate and under-stand the former’s behavior. E.g., a clinician will need tounderstand the treatment regime recommended by an agentto determine whether it aligns with the patient’s preferences.To facilitate improved user understanding of agents’ behav- ior and reasoning, a range of explainable RL methods havebeen developed (Puiutta & Veith, 2020). These include twotypes of explanations: (1) “local” explanation approachesexplaining why an agent chose a particular action in a givenstate, e.g., saliency maps highlighting the information anagent attends to (Greydanus et al., 2017), and (2) “global”explanation methods that describe the policy of an agentmore generally, such as strategy summaries that demonstratethe agents’ behavior in a selected set of states (Amir et al.,2019). While these approaches have been shown to improvepeople’s understanding of agent behavior, they are typicallynot optimized for a particular user task.In this work, we aim to address the problem of supportingusers’ ability to distinguish between the behavior of agents.This problem arises when people need to select an agentfrom a set of alternatives. E.g., a user might need to choosefrom a variety of smart home assistants or self-driving carsavailable on the market. Importantly, there is often not aclear “ground-truth” for a superior agent, as different agentsmay prioritize different outcomes and different users mayhave different preferences. For example, some people mayprefer self-driving cars that value safety very highly, whileothers might be willing to relax safety considerations a littleto enable faster driving. The ability to distinguish policiesis also important for model developers, as different config-urations of reward functions and algorithm parameters canlead to different behaviors in unexpected ways, especiallyin domains where the reward function is not obvious, as inhealthcare (Gottesman et al., 2019).One possible approach to helping users distinguish thepolicies of agents is to use strategy summarization meth-ods (Amir et al., 2018; 2019). Using these methods, asummary can be generated for each agent, and the usercould compare the agents’ behavior. However, these ap-proaches are not optimized for the task of agent comparison,as each summary is generated independently . For instance,the HIGHLIGHTS algorithm (Amir & Amir, 2018) selectsstates based on their importance as determined by the differ-ences in Q-values for alternative actions in a state. If twohigh-quality agents are compared, it is possible that theywill consider the same states as most important, and willchoose the same (optimal) action in those states, resulting a r X i v : . [ c s . A I] F e b gree to Disagree in similar strategy summaries. In such a case, even if theagents’ policies differ in many other parts of the state-space,it will not be apparent from the summaries.In this work, we present the DISAGREEMENTS algorithmwhich is optimized for comparing agent policies. Our algo-rithm compares agents by simulating them in parallel andnoting disagreements between them, i.e. situations wherethe agents’ policies differ. These disagreement states consti-tute the behavioral differences between agents and are usedto generate a visual summary which is optimized towardsshowcasing the most prominent strategy conflicts, thus pro-viding contrastive information. Our approach assumes ac-cess to the agent’s strategy, described using a Markov Deci-sion Process (MDP) policy, and quantifies the importanceof disagreements by making use of agents’ Q-values.To test DISAGREEMENTS, we generated summaries com-paring agents playing the game of Frogger (Sequeira &Gervasio, 2020) and evaluated them in a human-subject ex-periment. We compare these summaries to ones producedby HIGHLIGHTS, which we use as a baseline. In the ex-periment, participants were shown summaries of differentFrogger agents which varied in their performance. Theywere asked to select the better performing agent and an-swered explanations satisfaction questions. Results showthat DISAGREEMENTS led to improved user performanceon the agent selection task, compared to HIGHLIGHTS.The contributions of this work are threefold: i) We intro-duce and formalize the problem of comparing agent poli-cies; ii) we develop DISAGREEMENTS, an algorithm forgenerating contrastive visual summaries of agents’ behav-ioral conflicts, and iii) we conduct a human-subject experi-ments, showing that summaries generated by DISAGREE-MENTS led to improved user performance compared toHIGHLIGHTS summaries.
2. Related Work
In recent years, explainable AI has regained interest, initiallyfocusing mainly on explaining supervised models. Morerecently, a line of research has begun exploring explanationsof reinforcement learning agents (Puiutta & Veith, 2020).Some explanation methods attempt to convey the agents’ rea-soning when selecting actions, e.g. using decision-trees (Liuet al., 2018) or causal models (Madumal et al., 2020). Inthis work, we focus on global explanations that aim to de-scribe the policy of the agent rather than explain a particularaction. Specifically, we develop a new method for strategysummaries, therefore, in this section we describe in moredepth strategy summary methods (Amir et al., 2019).Strategy summaries convey agent behavior by demonstrat-ing the actions taken by the agent in a selected set of world-states. The key question in this approach is then How to recognize meaningful agent situations. One such approachcalled HIGHLIGHTS (Amir & Amir, 2018), extracts impor-tant states from execution traces of the agent. Intuitively,a state is considered important if the decision made in thatstate has a substantial impact on the agent’s utility. To illus-trate, a car reaching an intended highway exit would be animportant state because taking a different action (continuingon the highway) will cause a significant delay. The algo-rithm also considered the diversity of states in the summaryto avoid showing redundant information. HIGHLIGHTShas been shown to support people’s ability to understand thecapabilities of agents and develop mental models of theirbehavior (Amir & Amir, 2018; Huber et al., 2020).Sequeira & Gervasio (2020) extended the HIGHLIGHTSapproach with additional criteria for selecting states, whichthey refer to as
Interestingness Elements — domain inde-pendent criteria obtained through statistical analysis of anagent’s interaction history with the environment over thetraining sessions. In a user study they did not reach clearconclusions as to which criteria is best, but found that acombination of elements enables the most accurate under-standing of an agent’s aptitude in a task.Other works have proposed a different approach for gen-erating summaries, which is based on machine teachingapproaches (Huang et al., 2017; Lage et al., 2019). The ideaunderlying these methods is to select a set of states that isoptimized to allow the reconstruction of the agents’ policyusing imitation learning or inverse reinforcement learningmethods. While these approaches are computationally ap-pealing as they rely on computational principles rather thanheuristics, in practice they were shown to have limitations.Specifically, people’s performance in predicting agent be-havior was lower than that of computational approaches,because they often used different state representations andcomputational models from those assumed by summarygeneration algorithms (Lage et al., 2019).Common to all previous policy summarization approachesis that each summary is generated specifically for a singleagent policy, and is thus independent of other agents’ be-havior. This can hinder users’ ability to compare agents, asthe summaries might show regions of the state-space wherethe agents act similarly, failing to capture useful informa-tion with respect to where the agent policies diverge. Forexample, HIGHLIGHTS focuses on “important” states andit is likely that the states found to be of most importance toone agent may be considered important by another agentas well. These could be inherently important stages of thedomain such as reaching the goal or bypassing a dangerousstate. If the agents act similarly in these important states,the HIGHLIGHTS summaries of the agents could portraysimilar behavior, even for agents that have different policiesin many other parts of the state-space. Additionally, when gree to Disagree the summaries of different agents cover different regionsof the state-space, they do not convey to the user what anagent would have done had it encountered situations shownin the other agent’s summary. To address these limitations,we propose a new approach that is specifically optimizedfor supporting users’ ability to distinguish between policies.
3. Background
For the purpose of this work, we assume a Markov De-cision Process (MDP) setting. Formally, an MDP is atuple (cid:104)
S, A, R, T r (cid:105) , where S is the set of world states;A is the set of possible actions available to the agent; R : S × A → R is a reward function mapping each state,and T r ( s, a, s (cid:48) ) → [0 , s.t. s, s (cid:48) ∈ S, a ∈ A is the transi-tion function. A solution to an MDP is a policy denoted π .An agent’s policy is a mapping from states to actions suchthat for each world state, a designated action is assigned. Summaries
A summary, denoted S , consists of a set of tra-jectories T , as opposed to singular states, in order to supplythe user with richer context for the agent’s behaviour in thatstate. Each trajectory is a sequence of l consecutive state-action pairs t = (cid:104) ( s i , a i ) , ... ( s i + l , a i + l ) (cid:105) extracted from theagent’s execution traces. For the purpose of this researchwe intend to discuss various methods for summary extrac-tion. To this end, we formally define a summary extractionprocess of an agent’s policy given an arbitrary importancefunction Im , mapping state-action pairs to numerical scores. Definition 1 (Summary Extraction) . Given an agent’s ex-ecution traces, a summary trajectory budget k , and an im-portance function Im .The agent’s summary S is then the set of trajectories T = { t , ..., t k } that maximizes the importance function. S = max T Im ( T ) (1)In this paper, our baseline will be the HIGHLIGHTS al-gorithm, which computes importance as a function of the Q -values in a given state. Specifically, we implement theimportance function from Huber et al. (2020), an extensionto HIGHLIGHTS, which suggests determining the impor-tance of a state based on the difference between the maximaland second Q -values. Formally: Im ( s ) = max a Q π ( a, s ) − secondh a ighest Q π ( a, s ) (2)The trajectory is then the sequence of states preceding andsucceeding the important state.
4. Disagreement-Based Summaries
We propose a new summary method which aims to supportthe comparison of alternative agents by explicitly highlight-ing the disagreements between the agents. These can be thought of as contrastive summaries, as they show where theagents’ policies diverge. This approach is in line with theliterature on explanations from the social sciences, whichshows that people prefer contrastive explanations (Miller,2018). We note that while typically contrastive explanationsrefer to “why not?” questions and consider counterfactuals,in our case the contrast is between the decisions made bytwo different policies.We next describe our proposed disagreement-based sum-mary method. Specifically, we formalize the notion of agentdisagreement, describe how to measure the importance of adisagreement state and a disagreement trajectory, and finallydescribe the “DISAGREEMENTS” algorithm for generat-ing a joint summary of the behavior of two agents.
There are two main dimensions on which agents can differ:their valuations of states, and their policies, i.e., their choiceof action. These two dimensions are of course related, asdifferent state valuations will naturally lead to differentactions. Our definition of disagreements focuses on thepolicy, and later we also consider the value function forassessing the extent of a disagreement.In other words, any state s for which different agents choosedifferent actions will be considered a disagreement state.We will use these states to analyze and describe how agentsdiffer from one another in their behaviors. Formally: Definition 2 (Disagreement State) . Given two agents Ag and Ag with policies π , π , respectively. Define a state s as a disagreement state iff: π ( s ) (cid:54) = π ( s ) (3) The set of all disagreement states D would then be: ∀ s ∈ S | π ( s ) (cid:54) = π ( s ) : s ∈ D (4)For a compact MDP where every state may be computed,this definition would suffice. Alas, for more complex set-tings containing a continuous or vast state space, it is notfeasible to compare all states. The proposed method mustbe able to overcome this difficulty. Identifying Agent Disagreements Through Parallel On-line Execution
Given two alternative agents to compare,we initiate an online execution of both agents simultane-ously such that we follow the first (denoted as the
Leader or L in short, with policy π L ), while querying the secondagent (denoted as the Disagree-er or D short, with policy π D ) for the action it would have chosen in each state. Bothagents are constrained to deterministically choose the actionfor which they have the highest estimated Q-value as theirnext action. gree to Disagree Upon reaching such a disagreement state, we allow the
Disagree-er to “split-off” from following the
Leader andcontinue independently for a limited number of steps whilerecording the states it reaches for later analysis. Once thestep limit is reached, we store the disagreement trajectory ,and bring the
Disagree-er back to the disagreement state,from which it continues to follow the
Leader until the nextdisagreement state is reached and the process is repeated.Upon execution termination, we are left with the task of con-veying the information obtained. A naive approach would beto simply display all disagreements found. However, such asolution would be infeasible as there may be numerous suchstates and we do not wish to burden our user with the needto examine all of them. Therefore, we require a method forranking these states. Thus, we next describe approaches forquantifying the importance of a disagreement.
Various methods can be used for determining disagreementimportance, however, all require that we first define thenotion of value for a given state based on multiple agents.
State Value
Each agent possesses an internal evaluationfunction, assigning values to states or state-action pairs.We assume the use of Q -based agents with a Q functionquantifying their evaluation of state-action pairs, denoted Q ( s, a ) → R . These values are calculated and adjustedduring the training phase of the agent and depend on thealgorithm used as well as on the specification of the rewardfunction. Therefore, the values assigned to state-actionpairs by different agents may vary greatly. This renderseach individual agent’s assessment of a state-action pair asits own estimate rather than representing a ground truth.The values themselves may not even be on the same scale.To allow for comparison between values, we normalizeeach agent’s state-action values by dividing them by themaximum Q -value appearing in each, thus rendering thevalue itself a function of how good a state-action value iscompared to the best one, as viewed by the agent. Formally: Q (cid:48) = Q max s,a | Q ( s, a ) | ; ∀ s, a : Q (cid:48) ( s, a ) ∈ [ − , (5)Since our agents are constrained to always choose the actionyielding the highest value in a state, we can denote the valueof state s as the highest state-action value associated with it. Definition 3 (Agent State Value) . Given an agent Ag , its Q function Q Ag and a state s , we define the value of s as: V Ag ( s ) = max a Q (cid:48) Ag ( s, a ) (6) Disagreement Extent
Any state where the next action ofthe compared agents differs is a disagreement state, but not all disagreements are of the same magnitude. For instance,consider a scenario where both agents wish to arrive at alocation that is diagonally above them and to the right. Oneagent chooses to first go up while the other chooses to go right . It is reasonable then to assume that each agent’sassessment for the other’s action was not far from that of theaction it chose to take. We would like to define disagreementextent as a metric for predicting the impact of a specificdisagreement. We define the notion of an agent’s confidencein an action c Ag ( s, a ) as the probability that this action isthe optimal one to take in the current state, proportional toall other action possibilities. Formally: C Ag ( s, a ) = Q (cid:48) Ag ( s, a ) (cid:80) | A | i =1 Q (cid:48) Ag ( s, a i ) (7)Using this notion, we propose a measure for disagreementextent based on the confidence of each agent in their choiceof action compared to the action selected by the other agent. BetTY - Better Than You : Given the Leader and Disagree-er agents L , D and a disagreement state s ∈ D , and theactions chosen by each of the agents denoted a L = π L ( s ) and a D = π D ( s ) . We propose measuring the significanceof the disagreement by querying each agent not solely aboutits confidence in its choice but, in addition, how confident itis that the rival agent’s choice of action is poor.The BetTy importance is then Formally: Im ( s ) BetT Y = | ∆ (cid:0) C L ( s, a L ) , C L ( s, a D ) (cid:1) | + | ∆ (cid:0) C D ( s, a D ) , C D ( s, a L ) (cid:1) | (8)Now we are able to quantify the extent of the disagreementas dependent on both agents’ evaluations.Alas, measuring only the importance of the disagreementstate has its limitations. Without ground truth regardingthe quality of the disagreeing actions we are left only withthe agents’ estimations which may be flawed. Suppose ouragents reach a disagreement state where the disagreementextent is high, i.e. in the case of BetTY , where both agentsthink the other is making a lousy choice. They each gotheir separate ways but reunite at a shared state after takingseveral actions with minimal to no impact on the succeed-ing execution. This realization led us to formulate anotherapproach for determining the importance of a disagreementstate.
An alternate approach to determining the importance of adisagreement state is to compare the trajectories that emergefrom the conflict in actions. In other words, we can for-mulate metrics for evaluating trajectories stemming fromthe disagreement state. To do so, we constrain ourselves to gree to Disagree observing and evaluating trajectories of similar length of thetwo agents.A trajectory
T r πh ( s ) = (cid:104) s +1 , ..., s + h (cid:105) denotes the sequenceof states encountered when following state s for h stepsaccording to a policy π . There are many ways in whichthe importance of a trajectory can be quantified, e.g. byaveraging the values of the encountered states, summingthem, etc. Since we are interested in quantifying the extentof a disagreement, we focus on the differences in the valuesof the last state encountered in the trajectory by each ofthe agents. This reflects how “far off” of each other thedisagreement has led the two agents, L and D . Last State Importance:
We define the importance of adisagreement trajectory as the difference between the valuesof the last states reached by each of the agents. Since eachagent evaluates states differently, we consider the sum ofthe two valuations, i.e. V ( s ) = V L ( s ) + V D ( s ) . Then: Im ( T r π L h ( s ) , T r π D h ( s )) = | ∆( V ( s π L + h ) , V ( s π D + h )) | (9) Using these methods we are able to acquire the set of dis-agreement states D ordered by importance, and for each,their corresponding trajectories. These shall be stitched to-gether to create a summary for displaying to the user. Toincrease the coverage of the summary and avoid showingredundant trajectories, we restrict the summary generated tonot contain i) multiple trajectories that end or begin at thesame state, ii) trajectories where the Leader and Disagree-ershare the same state before-last and ii) overlapping trajec-tories which share more than a predefined number of states(denoted as overlapLim ). The DISAGREEMENTS algorithm pseudo-code is suppliedin Algorithm 1. The parameters used by the algorithm aresummarized in Table 1. DISAGREEMENTS takes as inputtwo agents, one which will act as the
Leader L , and an-other as the Disagree-er D . These are used to determine theagents’ actions, state evaluations and to progress the simula-tion. k is a budget that determines the number of trajectoriesdesired in the output summary, and l is the length of eachtrajectory. Each summary trajectory includes states preced-ing and succeeding the disagreement state. The number ofsubsequent states to include in a trajectory is determined by h . In addition, numSim represents the number of simula-tions to collect disagreement states from, and overlapLim is the limit on the number of states that can be overlappingbetween trajectories. The algorithm outputs a summary S of the agents’ most important disagreements, which is a setof trajectories. The importance method used for ranking thedisagreements, impM eth , is supplied as input as well. Parameter Description (value used in experiments) k Summary budget, i.e., number of trajectories (5) l Length of each trajectory (10) h Number of states following s to include in thetrajectory (5) numSim The number of simulations (episodes) run by theDISAGREEMENTS algorithm (10) overlapLim
Maximal number of shared states allowed be-tween two trajectories in the summary (3) impMeth
Importance method used for evaluating disagree-ments (
Last State ) Table 1.
The DISAGREEMENTS algorithm parameters. Values inparentheses were used in the experiments.
The Algorithm.
First, three lists are initialized for theLeader traces, disagreement states nda Disagree-er trajec-tories (lines 4–6). Then, we run simulations of the agents(lines 7–27) where in each simulation we collect all statesseen by the Leader during the execution (line 24), disagree-ment states (line 13), and the Disagree-er trajectories (lines14–19). Each step of the simulation, both agents are queriedfor their choice of next action (lines 10–11), if they do notagree on the action — a disagreement state is added (line13) and a disagreement trajectory is created (lines 14–19),after which the simulation is restored to the last disagree-ment state (line 21). After all simulations are completed,all disagreement trajectories (coupled pairs of Leader andDisagree-er trajectories) are obtained (line 28) and the mostimportant ones are passed as output (line 29).
5. Empirical Methodology
Empirical Domain.
To evaluate our algorithm we generatedsummaries of agents playing the game of Frogger (Sequeira& Gubert, 2020). The objective of the game is to guide afrog from the bottom of the screen to an empty lily-pad atthe top of the screen. The agent controls the frog and caninitiate the following four movement actions: up, down, leftor right, causing the frog to hop in that direction. To reachthe goal the agent must lead the frog across a road with mov-ing cars while avoiding being run over, then, the agent mustpass the river by jumping on passing logs. For an in-depthdescription of the game environment, state representationsand configurations we refer the readers to (Sequeira & Ger-vasio, 2020). A screenshot from the user study, displayed inFigure 1, includes a snapshot of the game.
Agents.
We made use of the framework developed by Se-queira & Gervasio (2020) to test the DISAGREEMENTSalgorithm on multiple configurable agents of varying capa-bilities. Three different agents were trained based on theconfiguration provided by the framework.•
Expert (E):
Mid-range (M): gree to Disagree
Algorithm 1
The DISAGREEMENTS algorithm. Input: π L , π D , k, l, h, overlapLim, numSim, impM eth Output: S L T r ← empty list D ← empty list D T ← empty list for i = 1 to numSim do sim, s = InitializeSimulation () while (! sim π L .ended ()) do a π L ← sim.getAction ( π L ( s )) a π D ← sim.getAction ( π D ( s )) if a π L ! = a π D then D .add ( s ) d t ← empty list for i = 1 to h do s π D ← sim.advanceState ( π D ) a π D ← sim.getAction ( π D ( s )) d t .add ( s ) end for D T .add ( d t ) sim, s = reloadSimulation ( D s [ − end if s ← sim.advanceState ( π L ) L traces .add ( s ) end while runs = runs + 1 end for DA T ← disagreementT rajP airs ( D , L T r , D T , l, h ) S ← topImpT raj ( DA T , k, overlapLim, impM eth ) • LimitedVision (LV):
E > LV > M . All HIGHLIGHTS andDISAGREEMENTS summaries were generated for fullytrained agents, thus reflecting their final policies.
Experimental Conditions and Hypotheses.
We used abetween-subject experimental design with two experimentalconditions that varied in the method used to generate sum-maries. Participants were randomly assigned to a condition.In one condition summaries were generated using the DIS-AGREEMENTS algorithm, while in the baseline conditionsummaries were generated using HIGHLIGHTS (Amir &Amir, 2018). We chose HIGHLIGHTS as a baseline asit was previously shown to support people’s ability to as-
Figure 1.
A screenshot of the agent selection task. sess the skills of different agents. We further verified in anexperiment that it works well for supporting users’ compari-son of agents that differ substantially in their skills prior torunning this study (see supplementary materials for moreinformation).The parameter values for the algorithms used to generatethe Frogger summaries are listed in Table 1. All summarieswere composed of five trajectories ( k = 5 ) made up of tensequential states ( l = 10 ) containing the important state,four preceding states and five succeeding states ( h = 5 ).Video-clips of the summaries were generated to present tothe users and a fade-in and fade-out effect was added tofurther call attention to the transition between trajectories.Summaries generated by the DISAGREEMENTS portrayboth agents simultaneously on the screen. To help reducethe cognitive load of following two agents at once, the DIS-AGREEMENTS summary videos have been slowed down.Summary videos are provided in supplementary material.The objectives of the experiment conducted were twofold.Firstly, to support our claims regarding the limitations ofthe HIGHLIGHTS algorithm for comparing between agents,and secondly, to compare the DISAGREEMENTS algo- A summary of length 5 was found useful in previous experi-ments with HIGHLIGHTS (Amir & Amir, 2018). An examinationof this parameter’s sensitivity appears in the Appendix. gree to Disagree rithm to HIGHLIGHTS and show its added value.
Hypotheses.
In general, we hypothesized that the summariesgenerated by the HIGHLIGHTS algorithm are limited intheir ability to help users distinguish between agents and thatthe DISAGREEMENTS algorithm is more suited for thistask. More specifically, we state the following hypotheses:•
H1:
Participants shown summaries generated byHIGHLIGHTS for agents of similar skill will struggleto identify the better performing agent.•
H2:
Participants shown summaries generated by theDISAGREEMENTS algorithm will exhibit a highersuccess rate for identifying the better performing agent,compared to ones shown HIGHLIGHTS summaries.•
H3:
Explanation satisfaction of participants shownDISAGREEMENTS summaries will not be lower thanthat of participants shown HIGHLIGHTS summaries(i.e. does not harm explanation satisfaction).
Participants.
74 participants were recruited through Ama-zon Mechanical Turk (27 female, mean age = 36 . , STD = 9 . ), each receiving $3 for their completion of the Hu-man Intelligence Task (HIT). To incentivize participants tomake an effort, they were provided a bonus of 10 cents foreach correct answer in the agent selection task. Procedure.
Participants were first introduced to the gameof Frogger and the concept of AI agents. Each explanationwas followed by a short quiz to ensure understanding beforeadvancing to the task. Next, participants were randomlysplit into one of two conditions and were shown summaryvideos of pairs of different agents generated using eitherDISAGREEMENTS or HIGHLIGHTS.Participants in both groups were first introduced to the thesummary method they will be shown and were required topass a quiz to ensure their understanding. Participants werethen asked to choose the better performing agent based onthe summary videos. They were able to pause, play andrepeat the summary videos without restrictions, allowingparticipants the freedom to fully inspect the summary beforedeciding which agent they believe is more skillful. Partici-pants were also asked to provide a textual explanation fortheir selection and to rate their confidence in the agent selec-tion decision on a 7-point Likert scale (1 - not at all confidentto 7 - very confident). Overall, there were 3 pairs of agentcomparisons (cid:104)
E, M (cid:105)(cid:104)
E, LV (cid:105)(cid:104)
LV, M (cid:105) . The ordering of theagent pairs was randomized to avoid learning effects, andparticipants were also not told if the same agent appearedin multiple comparisons, that is, they made each decisionindependently of other decisions.Participants in the HIGHLIGHTS condition were showna HIGHLIGHTS summary of each agent (i.e. two sepa-rate videos, one for each agent.), while participants in theDISAGREEMENTS group were supplied two configura- tions of the DISAGREEMENTS summaries. One summarywhere the first agent is the Leader and the second is theDisagree-er, and the opposite summary, where the first agentis the Disagree-er and the second agent is the Leader. Thatis, in both conditions participants were provided with twovideos, each of the same number of trajectories: in theHIGHLIGHTS condition each video showed only one ofthe agents, while in the DISAGREEMENTS condition eachvideo showed both agents (each with different Leader andDisagree-er roles). Upon conclusion, all participants wererequired to answer a series of explanation satisfaction ques-tions adapted from (Hoffman et al., 2018). The completesurveys are provided in the supplementary materials.
Evaluation Metrics and Analyses.
The main evaluation met-ric of interest was the success rate when selecting Froggeragents with each summary method. We compare this metricacross the different agent selection tasks given to partici-pants. We also compare participants’ confidence in theiragent selections. To compare the explanation satisfactionratings given to the summaries, we averaged the values ofthe different items with items normalized such that highervalues always mean that the summary is more helpful. Inall analyses we used the non-parametric Mann-Whitney U test and computed effect sizes using rank-biserial correla-tion. In all plots the error bars depict the bootstrapped confidence intervals (Efron & Tibshirani, 1994).
6. Results
We report the main experimental results with respect to ourhypotheses regarding participants’ success rate in identi-fying the better performing agents and their explanationsatisfaction ratings. We also present results regarding par-ticipants’ confidence in their choices, for which we did nothave an a priori hypothesis. For all following graphs, thenotation DA stands for the DISAGREEMENTS algorithm,while HL represents the HIGHLIGHTS method.Figure 2 shows the percentage of participants who were ableto correctly identify the better performing agent in each ofthe experiment conditions, for each agent pair combination. (H1) Participants in the HIGHLIGHTS condition struggledto successfully identify the better performing agent compari-son task. When all agents are of decent performance, we seethe difficulty of distinguishing between them manifest itselfin a poor success rate. Based on participants’ textual expla-nations of the choice of agent, it seems they were concernedthat agent E was indecisive, e.g., “[Agent E ] seems veryindecisive while ... [Agent M ] seems to have a plan and isgoing with it.”, “... [Agent E ] behaved more like a human.”We hypothesize that these responses are a consequence ofa single trajectory in agent E ’s summary where the frog isseen leaping between logs in a seemingly indecisive manner. gree to Disagree These results emphasise the limitations of independent. (H2) Participants in the DISAGREEMENTS condition weremore successful in the agent comparison task . Partici-pant in the DISAGREEMENTS group showed vast im-provement in the ability to identify the better perform-ing Frogger agent (see Figure 2). The differences in suc-cess rate between conditions were statistically significantand substantial for all agent comparisons ( E vs. LV : p = 1 . − , r = 0 . ; E vs. M : p = 1 . − , r = 0 . ; LV vs. M : p = 0 . , r = 0 . ). Textual explanationsprovide insights regarding how the contrastive nature ofthe DISAGREEMENTS summaries helped participants de-cide which agent to choose, e.g. “ I preferred the path that ...[Agent E ] was taking”; “I felt that ... [Agent E ] was makingslightly stronger moves, and pushing ahead further”. Figure 2.
Agent selection success percentage (H3) Explanation satisfaction of participants shown DIS-AGREEMENTS summaries was similar to that of partici-pants shown HIGHLIGHTS summaries.
Participants’ dis-tributions of scores for the explanation satisfaction task areshown in Figure 3. Participants exposed to the DISAGREE-MENTS summaries reported satisfaction measures slightlyhigher than those who were shown the HIGHLIGHTS sum-maries for the same agents. The differences in satisfactionwere not statistically significant ( p = 0 . , supporting H3. Figure 3.
User satisfaction of summary methods
Confidence of correct participants.
Figure 4 shows the confi- dence ratings of participants who made the correct choice. Ratings were slightly higher in two of the comparisons forthe DISAGREEMENTS condition, however, non of thedifferences were statistically significant.
Figure 4.
Confidence of correct participants
7. Discussion and Future Work
With the maturing of AI, people are likely to encounter situ-ations in which they will need to choose between alternativesolutions available in the market. Thus, the necessity ofdistinguishing between agent behaviors becomes ever moreclear. Moreover, distinguishing between policies is alsoimportant for agent developers in their analysis of differentalgorithms and environment configurations.In this paper, we proposed a new approach for comparingRL agents by generating summaries of the disagreementsbetween their policies. Our experimental results show thatsuch summaries improved people’s ability to identify thesuperior agent compared to summaries that were generatedseparately for each agent.As for future work, we note the following possible direc-tions: i) expanding DISAGREEMENTS to enable compari-son of more that two agents; ii) testing additional state andtrajectory importance methods; iii) further enhancing thediversity between trajectories in the summary, and iv) for-mulating and defining disagreement “types” for generatingfurther user-specific summaries. References
Amir, D. and Amir, O. Highlights: Summarizing agentbehavior to people. In
Proc. of the 17th International con-ference on Autonomous Agents and Multi-Agent Systems(AAMAS) , 2018.Amir, O., Doshi-Velez, F., and Sarne, D. Agent strategysummarization. In
Proceedings of the 17th InternationalConference on Autonomous Agents and MultiAgent Sys- We compare the confidence ratings of correct participants, asbeing confident in a wrong choice is not desirable. gree to Disagree tems , pp. 1203–1207. International Foundation for Au-tonomous Agents and Multiagent Systems, 2018.Amir, O., Doshi-Velez, F., and Sarne, D. Summarizingagent strategies.
Autonomous Agents and Multi-AgentSystems , 33(5):628–644, 2019.Efron, B. and Tibshirani, R. J.
An introduction to the boot-strap . CRC press, 1994.Gottesman, O., Johansson, F., Komorowski, M., Faisal, A.,Sontag, D., Doshi-Velez, F., and Celi, L. A. Guidelinesfor reinforcement learning in healthcare.
Nature medicine ,25(1):16–18, 2019.Greydanus, S., Koul, A., Dodge, J., and Fern, A. Visu-alizing and understanding atari agents. arXiv preprintarXiv:1711.00138 , 2017.Hoffman, R. R., Mueller, S. T., Klein, G., and Litman, J.Metrics for explainable ai: Challenges and prospects. arXiv preprint arXiv:1812.04608 , 2018.Huang, S. H., Held, D., Abbeel, P., and Dragan, A. D.Enabling robots to communicate their objectives.
Au-tonomous Robots , pp. 1–18, 2017.Huber, T., Weitz, K., Andr´e, E., and Amir, O. Localand global explanations of agent behavior: Integrat-ing strategy summaries with saliency maps.
CoRR ,abs/2005.08874, 2020.Lage, I., Lifschitz, D., Doshi-Velez, F., and Amir, O. Ex-ploring computational user models for agent policy sum-marization. In
Proceedings of the Twenty-Eighth Interna-tional Joint Conference on Artificial Intelligence, IJCAI-19 , pp. 1401–1407. International Joint Conferences onArtificial Intelligence Organization, 2019.Liu, G., Schulte, O., Zhu, W., and Li, Q. Toward inter-pretable deep reinforcement learning with linear modelu-trees. In
Joint European Conference on Machine Learn-ing and Knowledge Discovery in Databases , pp. 414–429.Springer, 2018.Madumal, P., Miller, T., Sonenberg, L., and Vetere, F. Ex-plainable reinforcement learning through a causal lens.In
Proceedings of the AAAI Conference on Artificial In-telligence , volume 34, pp. 2493–2500, 2020.Miller, T. Explanation in artificial intelligence: Insightsfrom the social sciences.
Artificial Intelligence , 2018.Puiutta, E. and Veith, E. M. S. P. Explainable reinforcementlearning: A survey.
CoRR , abs/2005.06247, 2020.Sequeira, P. and Gervasio, M. Interestingness elementsfor explainable reinforcement learning: Understandingagents’ capabilities and limitations.