[PDF] Generation of Policy-Level Explanations for Reinforcement Learning

Abstract

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F | 2 |tr_samples|) . By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.

Full PDF

GGeneration of Policy-Level Explanations for Reinforcement Learning

Nicholay Topin and Manuela Veloso

Machine Learning DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213 { ntopin, veloso } @cs.cmu.edu Abstract

Though reinforcement learning has greatly beneﬁted from theincorporation of neural networks, the inability to verify thecorrectness of such systems limits their use. Current work inexplainable deep learning focuses on explaining only a sin-gle decision in terms of input features, making it unsuitablefor explaining a sequence of decisions. To address this need,we introduce Abstracted Policy Graphs, which are Markovchains of abstract states. This representation concisely sum-marizes a policy so that individual decisions can be explainedin the context of expected future transitions. Additionally, wepropose a method to generate these Abstracted Policy Graphsfor deterministic policies given a learned value function anda set of observed transitions, potentially off-policy transi-tions used during training. Since no restrictions are placedon how the value function is generated, our method is com-patible with many existing reinforcement learning methods.We prove that the worst-case time complexity of our methodis quadratic in the number of features and linear in the num-ber of provided transitions, O ( | F | | tr samples | ) . By apply-ing our method to a family of domains, we show that ourmethod scales well in practice and produces Abstracted Pol-icy Graphs which reliably capture relationships within thesedomains. Recent advances in neural networks have led to powerfulfunction approximators, which have been successfully usedto support Reinforcement Learning (RL) techniques to solvedifﬁcult problems. However, the deployment of RL systemsin real-world use cases is hampered by the difﬁculty to ver-ify and predict the behavior of RL agents. In the context ofRL, autonomous agents learn to operate in an environmentthrough repeated interaction. After training, the agent is ableto make decisions in any given state, but is unable to providea plan nor rule-based system for determining which action totake. Generally, a policy which selects actions ( π ( s ) = a ) isavailable along with its value function ( V π ( s ) ∈ R ), whichpredicts future reward from a state. However, neither the out-come of the actions nor the sequence of future actions takenis available. Without these, a human operator must blindlytrust an RL agent’s evaluation. Copyright c (cid:13)

Existing techniques for explaining Deep ReinforcementLearning agents borrow techniques used for explaining neu-ral network predictions, so they focus on explaining onestate at a time. These techniques pinpoint the features of thestate that inﬂuence the agent’s decision, but do not providean explanation incorporating expected future actions. There-fore, the explanation is insufﬁcient for a human supervisorto decide whether to trust the system. Likewise, no whole-policy view is available, so evaluating the agent’s overallcompetency (as opposed to single-state evaluation) is im-possible. For these reasons, we are interested in explainingpolicies as a whole: giving the context for action explana-tions and providing an abstraction of an entire policy.To address the aforementioned issues, we propose the cre-ation of a full-policy abstraction, which is then used as thebasis for generating local explanations. We introduce Ab-stract Policy Graphs (APGs) as such a full-policy abstrac-tion. Each APG is effectively a graph where each node is anabstract state and each edge is an action with associated tran-sition probability between two abstract states. Using a map-ping from states to abstract states, one can identify whichgroups of states the agent treats similarly, as well as predictthe sequence of actions the agent will take. This explanationprovides local explanations along with a global context.Additionally, we propose an algorithm, APG Gen, forcreating an APG given a policy, a learned value function,and a set of transitions. Starting with a single abstract statewhich encompasses the full state-space, APG Gen uses afeature importance measure to repeatedly divide abstractstates along important features. These abstract states are thenused to create an APG. The splitting procedure additionallyidentiﬁes which features are important within each abstractstate. Notably, this general procedure is compatible with ex-isting methods for learning a policy and value function.The main contributions of this work are as follows: (1)we introduce a novel representation, Abstract Policy Graphs,for summarizing policies to enable explanations of individ-ual decisions in the context of future transitions, (2) we pro-pose a process, APG Gen, for creating an APG from a pol-icy and learned value function, (3) we prove that APG Gen’sruntime is favorable ( O ( | F | | tr samples | ) , where F is theset of features and tr samples is the set of provided transi-tions), and (4) we empirically evaluate APG Gen’s capabilityto create the desired explanations. a r X i v : . [ c s . L G ] M a y Related Work

Prior work in explaining Deep Learning systems focuseson explaining individual predictions. Explaining individualpredictions in terms of input pixels has been done usingsaliency maps based on model gradients, as in (Simonyan,Vedaldi, and Zisserman 2013) and (Samek, Wiegand, andM¨uller 2017). Alternatively, local explanations are learnedfor regions around an input point to identify relevant pixels(Ribeiro, Singh, and Guestrin 2016).(Iyer et al. 2018) leverage these pixel-level explanationsand use an object detector to produce object saliency maps,which are explanations for the behavior of deep RL agents interms of objects. Unlike our method, their method explainsa single decision without the context of potential future de-cisions. (Ehsan et al. 2018) create natural language explana-tions for each action the agent performs. These explanationsare learned using a human-provided corpus of explanations.These explanations are only for individual ( s, a, s (cid:48) ) tuples,so their method does not produce policy-wide explanations.Existing RL-speciﬁc methods with policy-level explana-tions impose additional constraints. Some works explainagent behavior, but require the agent to use a speciﬁc, in-terpretable model. For example, Genetic Programming forReinforcement Learning (Hein, Udluft, and Runkler 2017)use a genetic algorithm to learn a policy which is inherentlyexplainable. Unlike our method, this method is incompati-ble with arbitrary RL systems due to its reliance on learninginherently small policies using a genetic algorithm.Other methods require a ground-truth or learned modelof the environment, which may be more complicated thanthe learned policy. (Khan, Poupart, and Black 2009) pro-duce contrastive explanations which compare the agent’s ac-tion to a proposed alternative, but require a known factoredMDP. (Hayes and Shah 2017) explain robot behavior us-ing natural language. These explanations are formed using amodel learned from demonstrations and based on operator-speciﬁed “important program state variables” and “impor-tant functions.” Other work avoids automatically identifyingpatterns in agent behavior and relies on a human to manu-ally identify similar sets of states. (Zahavy, Ben-Zrihem, andMannor 2016) embed states into a space where states withincertain regions of this space behave similarly. They groupthese states based on region to produce explanations. How-ever, a human operator must form these groups and identifywithin-group similarities.An overview of previous methods for creating abstrac-tions for Markov Decision Processes can be found in (Li,Walsh, and Littman 2006). These methods focus on creatingan abstraction for use with an RL agent. To that end, theycreate an abstract Markov Decision Process, which is usu-ally done before or during learning. This differs from ouruse case where we seek only to explain the transitions thatoccur under a speciﬁc policy, so our abstraction is instead aMarkov chain created after a policy has been learned. In the context of RL, an agent acts in an environment deﬁnedby a Markov decision process (MDP). We use a six-tupleMDP formulation: (cid:104)

S, A, P, R, γ, T (cid:105) , where S is the set ofstates, A is the set of actions, P is the transition function, R is the reward function, γ is the discount factor, and T is theset of terminal states which may be the empty set (Suttonand Barto 1998). In this work, we assume all states consistof an assignment to features. Speciﬁcally, each state consistsof a value assignment to each feature f ∈ F . An agent ul-timately seeks to learn a policy, the function π ( s t ) = a t ,which maximizes total discounted reward. Note that the pol-icy need not be deterministic, but we only consider the de-terministic case in this paper. In the process of learning apolicy, an RL agent generally approximates the state-valuefunction or the action-value function. The state-value func-tion is the expected future discounted reward from the state s when policy π is followed: V π ( s ) = E (cid:32) ∞ (cid:88) t =0 γ t R ( s t , π ( s t ) , s t +1 ) (cid:33) . (1)The action-value function is the expected future discountedreward from the state s given the agent takes action a and then follows policy π . Note that the state-valuefunction can be obtained from the action-value function: V π ( s ) = Q π ( s , π ( s )) . These are used in methods basedon Q-Learning, Sarsa( λ ), and actor-critic methods (Suttonand Barto 1998). Therefore, the value-function is generallyavailable alongside the policy of a trained agent. We use an importance measure for grouping states from anoriginal MDP into abstract states. An importance measure isa function I f ( c ) which represents the importance of feature f in determining how a system treats a set of inputs (e.g.,states), c . If f takes on the same value for all s ∈ c or itsvalue does not inﬂuence the system’s output, then f is notimportant. We use the Feature Importance Ranking Measure(FIRM) (Zien et al. 2009) since it is fast to compute exactlyfor binary features and can be meaningfully interpreted.To calculate importance, FIRM uses q f ( v ) , the condi-tional expected score of s for a feature f with respect toan arbitrary function g ( s ) . This score is the average value of g ( s ) for all s within the set c where feature f takes value v : q f ( v ) = E ( g ( s ) | s [ f ] = v ) . (2)Intuitively, if q f ( v ) is a ﬂat function, then v , the value of f , has no impact on the average value of g ( s ) over s ∈ c , soprovides little information. However, if v signiﬁcantly im-pacts g ( s ) , then the value of q f ( v ) will vary. This motivatesFIRM’s importance measure I f ( c ) , the variance of the con-ditional expected score: I f ( c ) = (cid:113) V ( q f ( s [ f ])) . (3)In speciﬁc cases, the exact value of I f ( c ) can be computedquickly. One such case is when f is a binary feature. For ainary feature, the importance measure is given by I f ( c ) = ( q f ( c ) − q f ( c )) (cid:113) p f ( c ) p f ( c ) ,p fv ( c ) = P ( s [ f ] = v ) ,q fv ( c ) = E ( g ( s ) | s [ f ] = v ) . (4)In the case of binary features, FIRM corresponds to the ex-pected change if the feature switches from 0 to 1. Conve-niently, sign is preserved in the binary case, showing magni-tude of importance as well as direction of effect. In Section 4.1, we describe Abstract Policy Graphs, our rep-resentation for explaining a policy. In Section 4.2, we pro-pose APG Gen, a method for constructing such explana-tions. In Section 4.3, we describe local explanations we pro-duce from our policy explanations. Finally, in Section 4.4,we show that our method has favorable asymptotic runtime:quadratic in the number of features and linear in the numberof transition tuples considered, where there are usually fewfeatures and runtime sub-linear in the number of transitionsis unattainable.

To create a policy-level explanation, we express the policyas a Markov chain over abstract states where edges are tran-sitions induced by a single action from the original MDP,which we term an Abstract Policy Graph (APG). We presentan example in Figure 1. Consider a mapping function, l ( s ) ,which maps states in the original MDP ( grounded states) toabstract states. In effect, each abstract state represents a setof grounded states from the original MDP. We use the phrase“an agent is in abstract state b ” to mean that the current stateof the domain, s , maps to b (i.e., l ( s ) = b ). Let each setcontain all states interchangeable under the agent’s policysuch that the agent behaves similarly when starting in a statefrom the set in a similar fashion. As a result, states in whichthe agent behaves similarly lead the agent to states in whichthe agent also behaves similarly. If the agent’s transitionsbetween grounded states are approximated using a Markovchain between these abstract states, then states which aretreated similarly are readily identiﬁed and the agent’s transi-tions between abstract states can be predicted.For example, let the agent’s distribution of actions be ap-proximately equal for all future time-steps for all groundedstates in the set: E s ,t P ( π ( s ,t ) = a ) ≈ E s ,t P ( π ( s ,t ) = a ) ∀ a, t (5)for all s and s within a set, where s i,t is the state reachedfrom s i in t further time-steps using policy π .If the transition function is deterministic, then the agenttakes the same sequence of actions from each grounded statein the set because there is only a single s i,t for each i and t . In addition, since no two sets could combine to form aninterchangeable set, then the abstract states for all s i,t areidentical, too. Since the probability of transitioning from oneabstract state to another after one action is then either zeroor one (regardless of grounded state), the agent is effectively Figure 1: An example Abstract Policy Graph with edge la-bels indicating transition probabilities. The abstract stateidentiﬁer is shown within each node, and the action takenis written adjacent to the node.traversing a Markov chain of abstract states induced by itspolicy.However, most interesting domains have stochastic tran-sition functions. Under a stochastic transition function, twogrounded states can satisfy Equation 5 while having dif-ferent transition probabilities to future abstract states. Thetransition probability from one abstract state to another canbe approximated as the average transition probability forgrounded states in the source abstract state. For stochas-tic policies, the probability of taking any given action canbe similarly approximated as the average over all groundedstates in the source abstract state. The transition probabilityis no longer exact for any given grounded state in the set butis the transition probability for a randomly chosen state inthe set. To make predictions for a series of transitions, wemake a simplifying Markov assumption: the abstract statereached, b t +1 , when performing an action depends only onthe current abstract state, b t . This assumption leads to ap-proximation error but works well in practice, as shown inSection 6.2.The abstract states now form a Markov chain, as desired.This ﬁnal product allows human examination of higher-levelbehavior (e.g., looking at often-used trajectories and check-ing for loops), prediction of future trajectory (along with ac-companying probability), and veriﬁcation of agent abstrac-tion (e.g., ensuring agent’s behavior is invariant to certainfeatures being changed). We propose an algorithm for creating APGs, APG Gen. Itﬁrst divides states into sets to form abstract states, then com-putes transition probabilities between them.

Importance Measure

APG Gen is compatible with arbi-trary interchangeability measures. We choose V π ( s ) as itis readily available, but the method does not rely on thischoice. Using an interchangeability measure based on Equa-tion 5 is difﬁcult since it would depend on an expectationover all future states, which is often computationally expen-sive and requires knowing transitions for all states. We notethat the deﬁnition of V π ( s ) in Equation 1 also includes anexpectation over all future future states, as well as a de-pendency on the policy. Since the full state-value functions generally available and does not require computing addi-tional expectations, we use it as our measure of interchange-ability. The intuition is that two states with similar state-values lead to similar future outcomes in terms of reward,so are likely treated similarly by the agent. A different mea-sure could be used insteadWith an importance measure I f ( c ) for V π ( s ) , a set ofstates which is interchangeable under the agent’s policyshould have low I f ( c ) for all f . Consider the case of c (cid:83) c ,a set containing the original MDP states which should becontained in two abstract states. At least one f should havehigh I f ( c (cid:83) c ) because the grounded states from the twoabstract states are treated differently. If there is no such f ,then the two sets are treated the same and therefore belongto the same abstract state. Splitting Binary Features

In the case where all featuresare binary, if the set is split based on the value of that f (intoone subset if f = 0 and the other if f = 1 ), then both sub-sets will have I f be 0, since f has a constant value withinthe subset. This holds for any set of grounded states whichwill ultimately form several abstract states. Therefore, thissplitting procedure can be repeatedly performed to create ab-stract states from initially larger sets until all features havelow importance.Since each binary feature can only be important once andit is straightforward to split along a binary feature, the useof binary features allows quick computation. Therefore, incases where the original MDP does not have solely binaryfeatures, pre-processing can be done to create features forAPG Gen. Note that these features are not used when evalu-ating V π ( s ) (i.e., an unmodiﬁed, arbitrary model can be usedfor approximating V π ( s ) ). The binary features are insteadused when deciding to which set a speciﬁc tuple belongswhile performing APG Gen. Abstract State Division

Since binary features allow efﬁ-cient splitting, our approach is to initially form sets basedon action taken under the current policy and then repeatedlysplit the set which has the greatest I f value, as computedwithin that set. When the importances of all features for allabstract states are sufﬁciently low, then the abstract statesconsist of sets of states which are interchangeable under theagent’s policy.The pseudocode for our method is given in Algorithm 1.To perform the procedure, we require a set of sample tran-sitions. As mentioned in Section 3.1, RL agents generallylearn through interacting with a domain, meaning a set of ( s, a, s (cid:48) ) transitions is generally available. The notation weuse for this set is a vector tr samples consisting of entries t where action t a is taken in state t s , leading to a transition tostate t s (cid:48) , an observed reward t r , and a termination ﬂag t t (0or 1). The policy is used in line 5 to discard transition tupleswhere the provided policy would perform a different actionfrom the action in the stored tuple. This is done so that thegenerated explanation reﬂects only the current policy andnot transition tuples observed under past policies.Lines 2-6 separate the tuples based on the action taken.We pre-compute the feature importance for each set andsave it in lines 7-8. Line 9 forms the core procedure, where Algorithm 1

Compute abstract states based on transitionsamples and learned policy. procedure D IV A BS S TATES ( tr samples, policy ) for i in { , . . . , | A |} do c [ i ] ← ∅ (cid:46) initially, all sets empty for t in tr samples do (cid:46) separate by action if policy ( t s ) = t a then c [ t a ] ← c [ t a ] ∪ t for i in { , . . . , c } do (cid:46) pre-compute feat. imp. m [ i ] ← [ | I f ( c [ i ]) | for f ∈ { , . . . , | F |} ] while max i max j ( m [ i ][ j ]) > (cid:15) do i max ← argmax i max j ( m [ i ][ j ]) j max ← argmax j ( m [ i max ][ j ]) c n , c n ← ∅ for t in c [ i max ] do (cid:46) split on most imp. feat. if t s [ j max ] = 0 then c n ← c n ∪ t else c n ← c n ∪ t m [ i max ] ← [ | I f ( c n ) | for f ∈ { , . . . , | F |} ] c [ i max ] ← c n m [ | c | ] ← [ | I f ( c n ) | for f ∈ { , . . . , | F |} ] c [ | c | + 1] ← c n return c abstract states are divided until no feature has importancegreater than (cid:15) . The abstract state with the most importantfeature is found, then divided based on the most importantfeature. The importance of each feature is then re-computedin lines 18 and 20. APG Edge Creation

Once the abstract state sets havebeen created, we create the mapping function l and Markovchain transition matrix using Algorithm 2 (a sparse matrixcan be created in an almost identical fashion). In lines 2-4, the contents of each set are used to create the necessaryentries in a lookup table for the mapping function. Simulta-neously, the transition matrix is initialized to be zero-valuedby lines 5-6. Then, in lines 7-13, the mapping function isused in conjunction with the transition tuples within eachabstract state set to compute transition probabilities. Thatis, if a transition tuple t is in set c [ i ] , then the origin state, t s , is in the abstract state represented by c [ i ] . The destina-tion state, t s (cid:48) , then indicates a connection between c [ i ] and l ( t s (cid:48) ) . The transition probability from c [ i ] to c [ n ] is the por-tion of tuples in c [ i ] which lead to a state in c [ n ] , so each tu-ple in c [ i ] should increment transition ( i, l ( t s (cid:48) )) by / | c [ i ] | ,as done in line 13. Terminal transitions are identiﬁed inline 9 and instead lead to the special b T abstract state. Thisabstract state represents termination and is represented bythe highest-numbered row and column in transition . Therewill be no incoming edges to b T if the set of terminal states, T , is empty. Line 14 sets b T to have an edge to itself to createa valid Markov chain. lgorithm 2 Create mapping function and transition matrixbased on policy graph. procedure C OMPUTE G RAPH I NFO ( c ) for i in { , . . . , | c | + 1 } do for t in c [ i ] do lookup [ t s ] ← i (cid:46) create lookup table for n in { , . . . , | c | + 1 } do (cid:46) zero matrix transition ( i, n ) ← for i in { , . . . , | c |} do for t in c [ i ] do if t t = 1 then (cid:46) terminal t t go to dummy b T n ← | c | + 1 else (cid:46) others go to abstract state of next state n ← lookup [ t s (cid:48) ] transition ( i, n ) += 1 / | c [ i ] | transition ( | c | + 1 , | c | + 1) ← (cid:46) add b T self-loop return lookup, transition The policy graph algorithm presented creates a summary ofthe overall policy out of abstract states, which are each de-ﬁned by a set of states from the original MDP. Due to theprocess which we use to create the abstract states, we canalso create a characterization of the states which are in theirset. Note that Algorithm 1 splits an abstract state into twobased on a feature f because f is “important” based on cho-sen function g . These f s can be trivially recorded and storedfor each abstract state. Once the ﬁnal abstract states havebeen created, these f s indicate which features were previ-ously important. From this, the important features of an ab-stract state can be determined.For any state in the transition set, s n , and a speciﬁc ab-stract state, b , if π ( s n ) = π ( s ) and s n [ f ] = s [ f ] for any s in b ’s set and for all f which were used to create b , then s n will also be in b ’s set. Similarly, if s n [ f ] (cid:54) = s [ f ] (for sim-ilarly deﬁned s and f ), then s n cannot be in b ’s set. Thesefeature-value assignments are necessary and sufﬁcient to bepart of b , so this creates an “if and only if” relationship. Asa result, for any chosen state s , based on the features used tocreate its abstract state, the “relevant” features can be deter-mined. If the value for any of these features changes, then s would be in a different abstract state and treated differently.Similarly, the agent is oblivious to changes in the other fea-tures given the values assigned to the relevant features. Thisrelationship allows a human supervisor to determine whichfeatures affect how an agent treats a speciﬁc state. In ad-dition, a summary of an abstract state can be formed usingthese same feature-value assignments. Computing FIRM

Since p f ( c ) = 1 − p f ( c ) and q f ( c ) = ( E ( g ( s )) − p f ( c ) q f ( c )) /p f ( c ) , computing p f ( c ) and q f ( c ) is enough to calculate the importance.These can be computed for all f with a single pass throughthe set of states, as shown in Algorithm 3.The bulk of the computation is performed in lines 6 to Algorithm 3

Compute feature importance for all features forgiven set of transitions. procedure FIRM( tuples ) q tot ← (cid:46) expected value over full set for f in { , . . . , | F |} do p [ f ] ← (cid:46) ratio of set with s [ f ] = 0 q [ f ] ← (cid:46) E ( s | s [ f ] = 0) for s in set for t in tuples do g val ← g ( t s ) q tot += g val (cid:46) store sum for q tot for f in { , . . . , | F |} do if s [ f ] = 0 then (cid:46) p is tally, q is sum p [ f ] += 1 q [ f ] += g val q tot = q tot / | tuples | (cid:46) convert sum to average for f in { , . . . , | F |} do (cid:46) intermediate terms q [ f ] ← q [ f ] /p [ f ] p [ f ] ← p [ f ] / | tuples | p [ f ] ← − p q [ f ] ← ( q tot − p [ f ] q [ f ]) /p [ f ] q diff [ f ] ← q [ f ] − q [ f ] return [ q diff [ f ] (cid:112) p [ f ] p [ f ] for f in { , . . . , | F |} ]

12. Here, every transition in the set is separately consid-ered. Only a single evaluation of g is required regardless ofthe number of features. This evaluation is used to calculate E ( g ( s )) and q f for each feature where s [ f ] = 0 . The over-all complexity of Algorithm 3 is therefore O ( | F || tuples | ) ,where | F | is the number of features and | tuples | is the num-ber of transitions over which FIRM is computed. APG Gen Runtime

The runtime of Algorithm 1 isquadratic in the number of features and linear in the num-ber of provided transitions, O ( | F | | tr samples | ) .Creating the initial abstract states (i.e., those based onlyon action taken) takes time O ( | A | + | tr samples | ) , wherewe assume | A | ≤ | tr samples | . Computing FIRM for allof these abstract states takes O ( | F || tr samples | ) time. Thewhile loop in lines 9 to 21 forms the bulk of the algorithm,which we will analyze last. Creating the lookup and transi-tion tables takes O ( | tr samples | ) , assuming a zero matrixcan be created in constant time for line 6.For lines 9 to 21, during each iteration of the while loop,the runtime is O (log ( | c | )) to insert the new i max s if a max-heap is used to store the j max s, O ( | c [ i max ] | ) to partition theset c [ i max ] , and O ( | F || c [ i max ] | ) to compute FIRM for bothnew sets. Note that each time a set is divided, the number offeatures within that set with non-ﬁxed values (and thereforepositive importance) is reduced by one. Therefore, any giventuple may only be part of an evaluated set | F | times. As a re-sult, over all iterations of the loop, the set division and FIRMcomputation takes at most O ( | F | | tr samples | ) time. Thiscan happen over the course of up to | F | divisions, so themax-heap insertion takes time at most O ( | F | ) .The overall worst-case runtime for APG Gen is then onthe order of O ( | F | | tr samples | ) . This is favorable sinceruntime must be at least linear in | tr samples | and F is gen-rally small compared to the number of tuples. We evaluate APG Gen on a novel domain with scalable statespace and controllable stochasticity. We describe this do-main, PrereqWorld, in Section 5.1. Experimental settings aredescribed in Section 5.2.

We introduce the PrereqWorld domain for evaluating our ap-proach. This domain is an abstraction of a production taskwhere the agent is to create a speciﬁc, multi-component itemusing a number of manufacturing steps. The size of the state-space for an instance of this domain is controlled by thenumber of unique items, m . The agent may only have oneof each item at a time. Production of each item may requiresome prerequisites, a subset of the other items, but no cy-cle of dependencies is permitted. In producing an item, theprerequisite items are usually lost. A domain parameter, ρ ,controls the probability that an item is lost.For ease of notation, we assume that the items are num-bered according to their place in a topological sort (i.e., anitem’s prerequisites must be higher-numbered). Let i d referto the desired ﬁnal item. For each item i j , let C j be the setof prerequisite items which i j requires. A sample MDP isshown in Figure 2. Note how the goal is to make i and itrequires having i and i . In turn, i also requires i .A state consists of m binary features, where the binaryfeature f j corresponds to whether the agent has an item i j .Any state where s [ f d ] = 1 is a terminal state. The distribu-tion of initial states is uniform over all possible non-terminalstates. The reward is − for transitioning to a non-terminalstate and for transitioning to a terminal state. For simplic-ity, we take γ to be , but the optimal policies for any domaininstance remain optimal for any γ in the interval (0 , .There are m actions where the action a j corresponds to at-tempting to produce item i j . Actions for currently possesseditems or for items with unmet prerequisites have no effect.That is, P ( s | s, a j ) = 1 when feature s [ f j ] = 1 or there is an i k ∈ C j such that s [ f k ] = 0 . When an action is successful, f j is set to and each of item i j ’s prerequisites is used withprobability − ρ . That is, for all i k ∈ C j , f k is indepen-dently set to with probability (1 − ρ ) and left as withprobability ρ .For the MDP in Figure 2, note that transitions are deter-ministic ( ρ = 0 ) for simplicity and we do not show the tran-sition function for states where s [ f ] = 1 ( i is present)since all such states are terminal. Notice how the domaincan be solved optimally from the starting position (no items present) using the action sequence [ a , a , a , a ] .This ensures that an i is present before i is made, and an-other i is created as a prerequisite to creating i . This do-main is suitable for explanation as it has inherent dependen-cies and sets of states which are treated identically.An example APG made by APG Gen for an instance ofPrereqWorld is given in Figure 3. APG Gen additionally de-scribes each abstract state. For example, b corresponds toall states where features 2 and 3 are 1. This corresponds to always taking action a when an i and i are present, whichcorresponds to C = { i , i } in this domain instance. Thiscorrespondence between the domain constraints and the ex-planation would allow a human operator to verify that anagent is behaving as expected. m = 4 , ρ = 0 , i d = i C = { i , i } C = { i , i } C = { i } C = {} S = { , , . . . , } A = { a , . . . , a } T = { , , . . . , } R ( s, a, s (cid:48) ) = 0 for s (cid:48) ∈ TR ( s, a, s (cid:48) ) = − for s (cid:48) (cid:54)∈ Tγ = 1 P (0001 | , a ) = 1 P (0010 | , a ) = 1 P (0011 | , a ) = 1 P (1000 | , a ) = 1 P (0100 | , a ) = 1 P (0101 | , a ) = 1 P (0110 | , a ) = 1 P (0111 | , a ) = 1 P (1100 | , a ) = 1 for other s and a , P ( s (cid:48) | s, a ) = 0 when s (cid:54) = s (cid:48) and P ( s (cid:48) | s, a ) = 1 when s = s (cid:48) Figure 2: MDP for an example PrereqWorld instance.

APG Inputs

For consistency, we use value iteration (Sut-ton and Barto 1998) to create the policies and value func-tions used for experiments, but other methods could be usedinstead. We iterate until the state-value function no longerchanges. To generate the transitions, we generated trajecto-ries from a random starting state until the maximum numberwas reached.

APG Gen Stopping Criterion ( (cid:15) ) In the case of binaryfeatures, FIRM corresponds to the expected change shouldthe feature be changed from 0 to 1. Conveniently, sign is alsopreserved in the binary case, showing magnitude of impor-tance as well as direction of effect. As a result, if no featurefor any abstract state has FIRM magnitude greater than (cid:15) ,then changing any given feature is not expected to changethe value of g ( s ) by more than (cid:15) (e.g., E s ∈ c ( | g ( s, s f =0) − g ( s, s f = 1) | ) < (cid:15) ∀ c ). We use this as a guideline forsetting (cid:15) : we set (cid:15) to be the minimum difference in action-value between the best action and second-best action. Forthe PrereqWorld domain, this is (cid:15) = 1 . Trials

For each plotted data-point, we generate 100 differ-ent PrereqWorld instances. We evaluate each instance 1,000times (i.e., we compute the feature importance for 1,000 dif-ferent states or predict the n th action for 1,000 different tra-jectories), except for the points in Figure 6, since the expla-nation size is ﬁxed per APG. Domain Generation

Each domain instance is parameter-ized by ρ and m as speciﬁed in Section 6. For simplicity, d is always i . For each instance, we randomly add pre-requisite relationships by selecting an item i j uniformly atrandom and then an item i k uniformly at random such that k > j . When adding prerequisite relationships, we constrainthe expected number of actions to reach a terminal state toigure 3: An example APG made by APG Gen for a smallPrereqWorld domain instance with m = 8 and ρ = 0 . Alledges have transition probability 1. The abstract state identi-ﬁer is shown within each node, and the action taken is writ-ten adjacent to the node.be within 10% of m . This ensures that the domain can besolved in a reasonable amount of time using value iteration. Based on the way we construct our abstract states, we cancreate “if and only if” conditions for a state in the transitionsample set to be part of an abstract state’s set, as describedin Section 4.3. From this, we can create a local explana-tion consisting of the set of features which are important inthat state. To evaluate how well APG Gen can generalizewhen predicting important features, we generate APGs us-ing a set of transitions less than the full set of non-terminalstates (i.e., we provide a set of transitions where no ( s, a, s (cid:48) ) tuple shares an s such that only a portion of non-terminalstates appear as s ). We then evaluate the local explanationsby comparing to a ground truth computed for individual Pre-reqWorld instances with a domain parameter of m = 15 ( | S | = 2 ).The portions of correct feature classiﬁcations (importantvs. not important) are shown in Figure 4. APG Gen almostalways correctly identiﬁes the important features. Even withonly 10% of the states, the prediction is correct over 93% ofthe time. When given 80% of the states, predictions are cor-rect 98.7% of the time for both the stochastic and determin-istic environments, which suggests that the model is able toidentify genuine patterns in the policy. We believe the errorsthe system makes are caused by the splitting order inducedby APG Gen’s greedy splitting strategy. n -hop Prediction Evaluation An APG is able to predict the actions an agent will take, butthis ability comes from an assumption made in Section 4.1.For each pair of abstract states, we produce a transition prob-ability: the probability that the agent will be in the second Figure 4: Comparison of feature importance prediction ac-curacy for increasing portion of non-terminal states.Figure 5: Action prediction for increasing time horizon.abstract state, assuming the agent is following a transitiontuple chosen at random from that ﬁrst abstract state. Thisholds for a single action for states in the provided transitionsample set, but not for arbitrary states and not when per-forming several of these predictions in sequence.To evaluate the error caused by making this assumption,we have APG Gen predict the distribution of actions theagent will take n time-steps in the future. We compare itto the true computed distribution and report the portion ofactions for which the true and predicted distributions agree.This is for a domain parameter of m = 15 ( | S | = 2 | ). Thesize of the transition sample set is half the size of the set ofnon-terminal states.The action prediction is consistently correct when the do-main is deterministic, so we report results for two stochasticdomains in Figure 5. Even with a small ρ , the predictionis less accurate as the number of steps increases, as is ex-pected. However, there is no dramatic decrease, suggestingigure 6: Comparison of explanation versus state-space size.that the Markovian assumption made in Section 4.2 is rea-sonable. The steady decline is likely due to computing tran-sition probabilities as an average of the transition sample set. The purpose of APGs is to be more human-interpretable thana Markov chain made from the base MDP. Therefore, thenumber of nodes in an APG should be much lower than thenumber of grounded states in the base MDP. To test this, weconstruct domains with a number of states ranging from 32to 1,073,741,824 and count the number of abstract states inthe corresponding APG. As in Section 6.2, for each gener-ated APG, the size of the transition sample set is half the sizeof the set of non-terminal states. The results are presented inFigure 6. Note that the x-axis is in log-scale.The explanation size grows sub-linearly in m while thestate-space size grows exponentially in m . This suggests thatthe explanation size is based more on the number of actionsrequired to reach a terminal state than the number of states,which indicates that compact policy representations are be-ing automatically extracted. We introduced Abstract Policy Graphs, a whole-policy ex-planation from which state-speciﬁc explanations can be ex-tracted. In addition, we presented APG Gen, an algorithmfor creating an APG given a policy, learned value function,and set of transitions, without constraints on how these arecreated. We showed that APG Gen runs in time quadraticin the number of features and linear in the number of tran-sitions provided, O ( | F | | tr samples | ) . Additionally, wedemonstrated empirical results showing the small size of theAPGs relative to the original MDPs, as well as the types andquality of explanations which can be extracted. Together,these show that APG Gen can produce concise policy-levelexplanations in a tractable amount of time. Future work in-cludes restructuring the explanations extracted from an APGto be better understood by a non-expert. To address this, we are in the process of conducting a user study to evaluate theusefulness of APG explanations in different presentation for-mats. This material is based upon work supported by DARPAgrants FA87501720152 and FA87501620042. Any opinions,ﬁndings and conclusions, or recommendations expressed inthis material are those of the authors and do not necessarilyreﬂect the views of DARPA.

References [Ehsan et al. 2018] Ehsan, U.; Harrison, B.; Chan, L.; andRiedl, M. 2018. Rationalization: A neural machine trans-lation approach to generating natural language explanations. .[Hayes and Shah 2017] Hayes, B., and Shah, J. A. 2017. Im-proving robot controller transparency through autonomouspolicy explanation. In

Proceedings of the 2017 ACM/IEEEinternational conference on human-robot interaction .[Hein, Udluft, and Runkler 2017] Hein, D.; Udluft, S.; andRunkler, T. A. 2017. Interpretable policies for reinforcementlearning by genetic programming.

CoRR abs/1712.04170.[Iyer et al. 2018] Iyer, R.; Li, Y.; Li, H.; Lewis, M.; Sun-dar, R.; and Sycara, K. 2018. Transparency and ex-planation in deep reinforcement learning neural networks. .[Khan, Poupart, and Black 2009] Khan, O. Z.; Poupart, P.;and Black, J. P. 2009. Minimal sufﬁcient explanations forfactored markov decision processes. In

Proceedings of theNineteenth International Conference on Automated Plan-ning and Scheduling .[Li, Walsh, and Littman 2006] Li, L.; Walsh, T. J.; andLittman, M. L. 2006. Towards a uniﬁed theory of state ab-straction for mdps. In

Proceedings of the Ninth InternationalSymposium on Artiﬁcial Intelligence and Mathematics .[Ribeiro, Singh, and Guestrin 2016] Ribeiro, M. T.; Singh,S.; and Guestrin, C. 2016. Why should i trust you?: Ex-plaining the predictions of any classiﬁer. In

Proceedings ofthe 22nd ACM SIGKDD international conference on knowl-edge discovery and data mining .[Samek, Wiegand, and M¨uller 2017] Samek, W.; Wiegand,T.; and M¨uller, K.-R. 2017. Explainable artiﬁcial intel-ligence: Understanding, visualizing and interpreting deeplearning models. arXiv preprint arXiv:1708.08296 .[Simonyan, Vedaldi, and Zisserman 2013] Simonyan, K.;Vedaldi, A.; and Zisserman, A. 2013. Deep inside convo-lutional networks: Visualising image classiﬁcation modelsand saliency maps. arXiv preprint arXiv:1312.6034 .[Sutton and Barto 1998] Sutton, R. S., and Barto, A. G.1998.

Reinforcement Learning: An Introduction . The MITPress.[Zahavy, Ben-Zrihem, and Mannor 2016] Zahavy, T.; Ben-Zrihem, N.; and Mannor, S. 2016. Graying the black box:nderstanding dqns. In

Proceedings of the 33rd Interna-tional Conference on Machine Learning .[Zien et al. 2009] Zien, A.; Kr¨amer, N.; Sonnenburg, S.; andR¨atsch, G. 2009. The feature importance ranking mea-sure. In