How RL Agents Behave When Their Actions Are Modified
HHow RL Agents Behave When Their Actions Are Modified
Eric D. Langlois [email protected]
Tom Everitt [email protected] DeepMind University of Toronto Vector Institute
Abstract
Reinforcement learning in complex environments mayrequire supervision to prevent the agent from attemptingdangerous actions. As a result of supervisor intervention,the executed action may differ from the action speci-fied by the policy. How does this affect learning? Wepresent the Modified-Action Markov Decision Process,an extension of the MDP model that allows actions todiffer from the policy. We analyze the asymptotic be-haviours of common reinforcement learning algorithmsin this setting and show that they adapt in differentways: some completely ignore modifications while othersgo to various lengths in trying to avoid action modi-fications that decrease reward. By choosing the rightalgorithm, developers can prevent their agents fromlearning to circumvent interruptions or constraints, andbetter control agent responses to other kinds of actionmodification, like self-damage.
When reinforcement learning (RL) agents are deployedin practice it can sometimes be desirable to constraintheir actions or alter their policies. For example, actionconstraints have been used to prevent damage whentraining robot policies [8, 15] and transformations to thepolicy can be used to ensure that an agent stays withinsome safe region of state space [3, 5]. When unsafestates are difficult to specify formally, a human overseermay interrupt the agent instead [18, 23]. In other cases,changes to the learned policy are undesirable yet difficultto avoid: an agent may damage its actuators and losethe ability to reliably perform certain actions, or errorscan be introduced if the policy is compressed to satisfyhardware limitations of the agent [e.g. quantization, 13].What effect will these kinds of action modificationshave on the policy learned by an RL algorithm? Willthe policy try to circumvent constraints or act as if theydon’t exist? It will be easier to effectively apply safetyconstraints to an RL policy if the learning algorithm canbe made to ignore the modification than if the learning algorithm chooses policies that interfere with it. If asupervisor overrides an agent during training but notdeployment (for example to prevent an autonomous carfrom driving off the road) then the policy should notlearn to rely on the presence of these interventions [18].Agents should recognize the possibility of self damageand avoid it. Agent responses to these action mod-ifications influence the safety and effectiveness of theresulting policies so it is crucial to study and understandthe implications of action modifications on reinforcementlearning.
Related Work
The concept of modifying an agent’s actions appearsin the safe exploration and human-in-the-loop litera-ture. Dalal et al. [5] and Abel et al. [1] both proposeapproaches that construct a wrapper around an innerreinforcement learning agent. While they focus on de-veloping wrappers that enforce constraints or guide theinner agent, we are interested in understanding how dif-ferent inner agents will react to the wrapper. This canhelp avoid problems like those experience by Saunderset al. [23] in which the inner agent learned to produceadversarial examples that defeat a learned action filter.Orseau and Armstrong [18] studied learning in thepresence of interruptions that temporarily replace thepolicy with some fixed alternate policy (for both history-based and Markov environments). They showed thatQ-learning and “Safe Sarsa” (a modification of Sarsa)both ignore interruptions while Sarsa does not. Weprovide a new formalism called the Modified-ActionMarkov Decision Process (MAMDP) that generalizes(and arguably simplifies) the framework of Orseau andArmstrong to describe a broad class of action modifi-cations, and allows us to study the behaviour of otheroptimization algorithms such as evolution strategies. Inaddition, we distinguish between different ways thatalgorithms adapt to the presence of action modificationsusing the concept of incentives [6].Policy modifications have been studied experimen-tally by Leike et al. [14]. The authors found that Sarsa1 a r X i v : . [ c s . A I] F e b hooses a policy that accounts for the effect of pol-icy modifications while Q-learning ignores them. Theyhypothesized that the difference was off-policy vs. on-policy learning. However, we show that the on/off-policydivision is not predictive for policy-modification adap-tation: one possible generalization of Sarsa (on-policy)asymptotically matches Q-learning (off-policy), whileboth Sarsa and Evolution Strategies (on-policy) havedifferent asymptotic behaviours.In this paper, we introduce the MAMDP model (Sec-tion 2) and investigate how different kinds of MDP learn-ing algorithms behave when generalized to this setting.We describe MAMDP policy learning objectives basedon the principles of black-box reward maximization, Bell-man optimality, and Bellman policy values, and showthat they respond differently to action modifications(Section 3). In Section 4, we prove that generalizationsof Q-learning and Sarsa to MAMDPs converge to theobjectives given in Section 3. Finally, we experimentallyevaluate the learning algorithms and demonstrate be-haviour that is consistent with the theoretical results(Section 5). The foundational model of reinforcement learning is theMarkov Decision Process. A
Markov Decision Process (MDP) is a tuple M = ( S , A , P S , R , γ ) where S is afinite state space, A is a finite action space, P S ( s (cid:48) | s, a ) = Pr( S t +1 = s (cid:48) | S t = s, A t = a ) is the probabilityat every time t of transitioning to state s (cid:48) when takingaction a in state s , R : S × A → R is a reward function R t = R ( S t , A t ), and γ ∈ [0 ,
1) is a discount factor. Theobjective is to find a (probabilistic) policy π ( a | s ) thatmaximizes the expected return J ( π ) = E π [ (cid:80) ∞ t =0 γ t R t ]where actions are sampled from π as Pr( A t = a | S t = s ) = π ( a | s ).We extend the MDP model to include an arbitrary action selection function P A ( a | π, s ). An MDP is thespecial case in which the policy is applied without mod-ification: P MDP A ( a | π, s ) = π ( a | s ). Definition 1. A Modified-Action Markov Decision Pro-cess (MAMDP) is a tuple (cid:102) M = ( S , A , P S , P A , R , γ )where M = ( S , A , P S , R , γ ) is an MDP and P A ( a | π, s ) = Pr( A t = a | Π = π, S t = s ) is the probabilitythat action a is selected in state s given a policy π . Wealso write (cid:102) M = ( M , P A ).The actions that are executed in the environmentcan be viewed as following the empirical policy ˙ π ( a | s ) := P A ( a | π, s ). We call π the virtual policy orsimply the policy . An optimal (virtual) policy for aMAMDP is one that maximizes the expected return (cid:101) J ( π ) = E π [ (cid:80) ∞ t =0 γ t R t ] where actions are sampled ac-cording to P A ( π, s ).As a simple example, the MAMDP model can beused to represent a state-dependent action constraint A C ( s ) ⊆ A using the action selection function P C A ( a | π, s ) = (cid:40) Z ( s,π ) π ( a | s ) if a ∈ A C ( s )0 otherwisewhere Z ( s, π ) = (cid:80) a (cid:48) ∈A C ( s ) π ( a | s ) normalizes the prob-ability distribution. Adding π as an argument to A C al-lows us to represent more sophisticated constraints thatinspect the policy. We can also consider parametrizedpolicies that have extra structure beyond a simple con-ditional probability distribution, for example π may bea neural network. A MAMDP can model a supervi-sor attempting to interpret the hidden activations ofa policy network and only allowing the agent to pro-ceed if satisfied. Beyond constraints and interventions,MAMDPs can model a variety of settings includinghardware defects in robotic actuators, exploration noise,policy compression to satisfy computational limits, andphysical damage to an agent. Influence Diagrams
We use the concept of incentives [6, 7] to concisely de-scribe learning behaviours in MAMDPs, complementingthe formal analysis of learning algorithm properties. In-centives are defined with respect to influence diagrams.An influence diagram (ID) [11] is a Bayesian networkwith special decision and utility nodes and describesa decision problem (see Figure 1). In other words, itis directed acyclic graph over random variables whereevery non-decision variable comes with a conditionalprobability distribution describing its relationship to itsparents. The problem described by an influence dia-gram is to find an optimal assignment of conditionaldistributions to the decisions. An assignment is optimalif it maximizes the expected sum of utilities. Whenreasoning about incentives, we additionally assume thatthe graph implies a causal structure in which it is pos-sible to intervene on a variable, setting that variable’svalue. Unlike conditioning, interventions only affectdownstream nodes [19].
Adversarial Policy and State Incentives
We are interested in investigating how different learn-ing algorithms might respond to the value of P A in aMAMDP. As such, we include P A as a variable in theinfluence diagram of Figure 2. That diagram describesthe problem of finding, for a fixed MDP M , an optimalmapping from P A to policies Π of the MAMDP ( M , P A ).2 S S · · · A A R R ΠFigure 1: An influence diagram of an MDP or MAMDP.The diagram represents a distribution over the first fewsteps of episodes generated by a stationary policy Π.Decision nodes are drawn as squares and utility nodesas diamonds.Note that even if a particular learning algorithm trainson trajectory samples without directly observing P A , itstill defines a mapping from P A to policies.What behaviours can we anticipate from MAMDPpolicies? In which ways might a learning algorithmtry to avoid action modifications in order to get higherreward? We identify two general strategies. First, theagent may direct the environment towards states whereaction modifications are less constraining, for exampleby hiding from oversight or preemptively disabling anoff switch [14, Off-Switch environment]. We say thata learning algorithm has an adversarial state incentive if it prefers policies with this behaviour. Second, anagent can attempt to mitigate action modifications inthe immediate time step. For example, the policy mayprefer actions that avoid triggering an action constraint,or the policy structure itself might be optimized tofalsely appear interpretable to an overseer. We call thisan adversarial policy incentive .These definitions can be made precise using influencediagrams. First, we note that a learning algorithm canonly be said to avoid action modifications if the policyoutput depends on the value of P A . Everitt et al. [6]call this a response incentive . When P A has no parentsand is itself a parent of the decision (as in our case)a response incentive can only occur if there are twodirected paths from P A to the same reward variable:one that passes through Π (the control path ) and onethat does not (the information path ). Intuitively, thereason that the agent responds to changes in P A is thatit is useful to control the node where the information andcontrol paths intersect. If the paths intersect before thecontrol path has visited a state node, then the diagramadmits an adversarial policy incentive. In contrast, ifthe path intersection occurs at or after a state node inthe control path, then the diagram admits an adversarial S S S · · · A A R R Π P A Figure 2: MAMDP influence diagram with random P A .The policy is chosen as a function of P A . A pair ofcontrol and information paths that admit an adversarialstate incentive are highlighted in light pink while thedarker teal paths admit an adversarial policy incentive.state incentive. The MAMDP influence diagram admitsboth kinds of incentives, as indicated by the highlightedpaths in Figure 2. Which policy is chosen by a reinforcement learning al-gorithm when applied to a MAMDP depends on thespecific objective optimized by the algorithm. Thereare a variety of objectives used in practice for solvingMDPs; we consider three: reward maximization, Bell-man optimality, and the Bellman policy value equations.These criteria are equivalent when applied to an MDPbut, as we will show, this is not true in general for aMAMDP.
Reward Maximization
The reward maximization objective is π RM = arg max π E (cid:102) M (cid:34) ∞ (cid:88) t =0 γ t R t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π = π (cid:35) . (1)This is the most straightforward objective: find a pol-icy that maximizes expected reward. Evolution strate-gies [22], genetic algorithms [16], and all other black-boxpolicy search methods have this form. A fundamentalproperty of the reward maximization objective is thatit ignores all environment structure and only consid-ers the empirical relationship between a policy and theobserved total reward. Direct reward maximization isconsequently relatively rare as an RL objective sinceignoring the available MDP structure tends to makeoptimization much more difficult. It also means that,3hen applied to MAMDPs rather than MDPs, the re-ward maximization objective continues to specify anoptimal policy for the given environment: Proposition 1.
A policy π satisfies the reward maxi-mization objective (1) for a MAMDP (cid:102) M if and only if π is optimal for (cid:102) M .Proof. By definition of (cid:101) J ( π ) since π RM = arg max π (cid:101) J ( π ).Since the reward maximization objective correspondsto MAMDP optimality, which is represented by theinfluence diagram of Figure 1, this objective admits boththe adversarial state and adversarial policy incentives. Bellman Optimality
The Bellman Optimality objective is to jointly solve: Q BO (cid:102) M ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) max a (cid:48) ∈A Q BO (cid:102) M ( s (cid:48) , a (cid:48) ) (2) π BO (cid:102) M ( s ) = arg max a ∈A Q BO (cid:102) M ( s, a ) (3)Let Q ∗M be the optimal action value function for anMDP M . If an action value function Q satisfies the Bellman optimality equation (2) for M then Q = Q ∗M and the greedy policy (3) is an optimal policy for M [26,Section 3.6]. This is the basis for algorithms like valueiteration [20] and Q-learning [27]. When applying theBellman optimality equations to a MAMDP, we findthat the solution does not depend on P A : Proposition 2.
An action value function Q and a pol-icy π satisfy the Bellman optimality objective (2) and (3) on a MAMDP (cid:102) M = ( M , P A ) if and only if π and Q are optimal policy and value functions for the MDP M .Proof. Equations (2) and (3) are identical to the MDPBellman optimality equations on M , which are satisfiedif and only if Q = Q ∗M and π is optimal for M .While simple to state and prove, this is one of themore important results of the paper. If π satisfies theBellman optimality equations then π is optimal for analternative version of the environment that has no actionmodification. In effect, π ignores the presence of anyconstraints or modifications placed on the policy, actingas though its selected actions were executed directlyin the environment. Combined with the convergenceresult of Theorem 6 in Section 4, this generalizes the Q-learning result of Orseau and Armstrong [18] to arbitrarypolicy-dependent action modifications: it is possible totrain agents that ignore applied action modificationsand one approach is as simple as using algorithms basedon the Bellman optimality objective. S S S · · · A A R R Π P A Figure 3: MDP influence diagram with a policy con-ditioned on P A . Actions are sampled according to Πwithout modification.In terms of incentives, the Bellman optimality objec-tive corresponds to the influence diagram in Figure 3,which represents the total return of a policy in an MDP M when the policy is chosen conditional on the MAMDP( M , P A ). This diagram does not admit any kind of re-sponse incentive on P A . That is, learning algorithmswith the Bellman optimality objective lack incentive tochoose policies that manipulate the state or their ownpolicy structure in order to defeat action modifications.On the other hand, these policies will also not detectand avoid self damage. Policy Value
The action values of a policy π in an MDP M are givenby the Bellman action value equation: Q M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼ π ( s (cid:48) ) Q M ,π ( s (cid:48) , a (cid:48) ) (4)If π is greedy with respect to Q M ,π then π is optimalfor M by the policy improvement theorem [26, Sec. 4.2].Algorithms like policy iteration [2, 10] and Sarsa [21,25] use this property and try to find a fixed-point policyvalue function.Unlike the other objectives, it is ambiguous how theBellman action value equation should be generalizedfrom MDPs to MAMDPs. Should the successor action a (cid:48) be sampled from the given policy π or from the mod-ified action distribution P A ( ·| π, s (cid:48) )? We call the formerthe virtual policy value because the action used by theBellman update does not occur in sampled trajectories,while the latter is the empirical policy value that can usesuccessor actions observed from environment interactiontrajectories.4 irtual Policy Value Objective: Q VPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼ π ( s (cid:48) ) Q VPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (5) π VPV ( s ) = arg max a ∈A Q VPV (cid:102) M ,π VPV ( s, a ) (6) Proposition 3.
An action value function Q and a pol-icy π satisfy the Virtual Policy Value objective on aMAMDP (cid:102) M if and only if π and Q are optimal policyand value functions for M .Proof. Equations (4) and (5) are identical while equa-tion (6) asserts that π VPV is greedy for Q VPV . By thepolicy improvement theorem, these are satisfied if andonly if Q VPV = Q ∗M and π is optimal for M .It follows from Proposition 3 that the virtual policyvalue objective is represented by the MDP influence dia-gram in Figure 3 and likewise does not admit a responseincentive on P A . This provides a second approach tolearning policies that ignore the presence of action mod-ifications: use an algorithm based on policy iterationwhere the successor action a (cid:48) in Bellman updates is sam-pled virtually from the policy π ( s ). Despite a (cid:48) not beingthe successor action in environment state-action trajec-tories, this constraint preserves the important feature ofon-policy learning that the action is sampled from thecurrent policy so no direct maximization of actions overthe Q function is required. Empirical Policy Value Objective:
The other possible generalization of the policy valueobjective is to use the action that was selected in theenvironment: Q EPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼P A ( s (cid:48) ,π ) Q EPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (7) π EPV ( s ) = arg max a ∈A Q EPV (cid:102) M ,π EPV ( s, a ) (8) The combined empirical policy value equations (7)and (8) do not necessarily have a solution (see Ap-pendix A for an example). However, considering just(7) for a fixed policy π , we have: Proposition 4.
For any policy π , Q EPV (cid:102) M ,π = Q M , ˙ π where ˙ π ( a | s ) := P A ( a | π, s ) .Proof. Substituting a (cid:48) ∼ ˙ π ( ·| s ) for a (cid:48) ∼ P A ( ·| π, s ) trans-forms Equation (7) into Equation (4) with ˙ π in place of π . If a solution to the Empirical Policy Value objectiveexists then π EPV may appear to behave inconsistently:The value Q EPV ( s, a ) accurately represents the true re-ward that will be earned when a is applied from state s and then the modified actions of ˙ π are followed there-after. By using these values, π EPV takes into account allmodifications applied to future actions. However, theimmediate action of π EPV is selected according to anargmax over action values and therefore π EPV acts asthough all immediate actions are executed without mod-ification. This is visualized by Figure 4, which admitsan adversarial state incentive but not an adversarialpolicy incentive. S S S · · · A A R R Π π (cid:48) P A Figure 4: Partial influence diagram for the t = 0 stepof the empirical policy value objective with a fixed suc-cessor policy π (cid:48) . The complete EPV objective cannotbe formulated as a single-decision influence diagram.The highlighted paths form a subgraph that admits anadversarial state incentive. In this section we provide convergence proofs for pro-totypical stochastic algorithms with the objectives de-scribed in Section 3. The convergence results derivefrom the following theorem, a generalization of Singh etal. [24, Theorem 1], which in turn generalizes Jaakkola,Jordan, and Singh [12, Theorem 2]. Where Singh etal. [24] assume that the policy converges to the greedypolicy of the current tabular state-action estimate Q t ,we instead assume that the policy converges to Λ( Q t )where Λ is an arbitrary function. The greedy policy isthe special case Λ greedy ( s, Q t ) = arg max a Q t ( s, a ). Theorem 5.
Let (cid:102) M be a MAMDP with bounded, op-tionally stochastic rewards. Consider a stochastic, itera-tive algorithm that learns a tabular Q function with theupdate rule: Q t +1 ( s, a ) = (1 − α t ( s, a )) Q t ( s, a )+ α t ( s, a )( R + γQ t ( S (cid:48) , A (cid:48) )) (9) where S (cid:48) ∼ P S ( s, a ) is a random successor state sam-pled from the transition dynamics, A (cid:48) ∼ Λ t ( S (cid:48) , Q t ) isa random successor action sampled from an arbitrary olicy Λ t ( · , Q t ) , and R is a random reward value with E [ R ] = R ( s, a ) . If1. Λ t → Λ uniformly;2. the learning rates satisfy ≤ α t ( s, a ) ≤ , (cid:80) t α t ( s, a ) = ∞ , (cid:80) t α t ( s, a ) < ∞ almost surely;and3. there exists some Q Λ satisfying Q Λ ( s, a ) = R ( s, a ) + γ E S (cid:48) ∼P S ( s,a ) E A (cid:48) ∼ Λ( S (cid:48) ,Q Λ ) Q Λ ( S (cid:48) , A (cid:48) ) , (10) then Q Λ is unique and Q t a.s. → Q Λ . We prove this theorem in Appendix B.Equation (9) describes a general stochastic Q learn-ing rule. At each time step t of a random state-actiontrajectory S A S A · · · , we update Q t +1 ( S t , A t ) usinga learning rate α t ( S t , A t ). The equation represents anupdate to the entire Q function over all state-actionpairs at time t , not just Q t +1 ( S t , A t ), so we will getpoint updates by setting α t ( s, a ) = 0 if ( s, a ) (cid:54) = ( S t , A t ).Therefore, α t encodes both the learning rate and thetrajectory over which updates are performed. The con-dition (cid:80) t α t ( s, a ) = ∞ requires that the trajectory visitall ( s, a ) pairs infinitely often with probability 1. As thiscondition is on empirical trajectories in the environmentwith action modification applied, there is no general rulefor policies that ensures exploration. An epsilon-greedypolicy is not guaranteed to explore a MAMDP. Q-Learning
Algorithm 1
Q Learning on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S for t = 0 , , , . . . do Derive policy π t from Q Take step using policy π t , observe A t , R t , S t +1 (cid:46) Note: A t ∼ P A ( π t , S t ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ max a ∈A Q ( S t +1 , a ) − Q ( S t , A t ) (cid:1) end for The Q-learning algorithm on a MAMDP is shownin Algorithm 1. The only difference compared to Q-Learning on an MDP [26, Sec. 6.5] is that instead ofspecifying an action on each transition, we specify a policy and observe the action taken (which might notbe the action that the policy would have selected ifexecuted without modification).
Theorem 6.
Q-learning on a MAMDP (cid:102) M convergesalmost surely to Q ∗M if all state-action pairs are visitedinfinitely often during learning.Proof. Q-learning is described by Equation (9)with an action selection rule of Λ t ( s, Q ) =Λ( s, Q ) = arg max a Q ( s, a ) and a learning rate func-tion α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for ( s, a ) (cid:54) =( S t , A t ). If all state-action pairs are visited infinitelyoften as t goes to infinity then α t satisfies condition 2of Theorem 5. In this context, Equation (10) is theBellman optimality equation (2). By Proposition 2, asolution to this equation exists and it is Q ∗M . Therefore,by Theorem 5, Q t converges almost surely to Q ∗M . Virtual Sarsa
Virtual Sarsa is an on-policy algorithm, unlike Q-learning, so we need to be more specific about howthe learning policy is derived from the Q function. Let π = Π t ( Q ) be the mapping from the Q table to thepolicy that is used at step t . Let Π ∗ be the greedy map-ping Π ∗ ( Q )( s ) = arg max a Q ( s, a ). To ensure that thepolicy is eventually optimal, we require that Π t → Π ∗ as t → ∞ . Algorithm 2
Virtual Sarsa on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S for t = 0 , , , . . . do π t ← Π t ( Q )Take one step using policy π t , observe A t , R t , S t +1 (cid:46) Note: A t ∼ P A ( π t , S t )Sample A (cid:48) ∼ π t ( S t +1 ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ Q ( S t +1 , A (cid:48) ) − Q ( S t , A t ) (cid:1) end forTheorem 7. Virtual Sarsa on a MAMDP (cid:102) M convergesalmost surely to Q ∗M if Π t → Π ∗ and all state-actionpairs are visited infinitely often during learning.Proof. Define Λ t ( s, Q ) = Π t ( Q )( s ) and Λ( s, Q ) =Π ∗ ( Q )( s ) = arg max a Q ( s, a ). Then Virtual Sarsa isdescribed by Equation (9) with an action selectionrule of Λ t ( s, Q ) → Λ( s, Q ) and a learning rate func-tion of α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for6 s, a ) (cid:54) = ( S t , A t ). If all state-action pairs are visitedinfinitely often then α t satisfies condition 2 of Theo-rem 5. In this context, Equation (10) is the VirtualPolicy Value Equation (5). By Proposition 3, a solu-tion to this equation exists and it is Q ∗M . Therefore,by Theorem 5, Q t converges almost surely to Q ∗M withprobability 1. Empirical Sarsa
Algorithm 3
Empirical Sarsa on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S π ← Π ( Q )Take one step using policy π , observe A , R , S for t = 0 , , , . . . do π t +1 ← Π t +1 ( Q )Take one step using π t +1 , observe A t +1 , R t +1 , S t +2 (cid:46) Note: A t +1 ∼ P A ( π t +1 , S t +1 ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ Q ( S t +1 , A t +1 ) − Q ( S t , A t ) (cid:1) end for As we have established, the empirical Sarsa criteriondoes not necessarily have any fixed point. However, ifthe fixed point does exist then the algorithm converges.
Theorem 8.
If there exists a solution ( Q EPV (cid:102) M ,π , π ) toEquations (7) and (8) , Π t → Π ∗ , and all state-actionpairs are visited infinitely often during learning, thenEmpirical Sarsa converges almost surely to Q EPV (cid:102) M ,π .Proof. Empirical Sarsa is described by Equation (9)with an action selection rule of Λ t ( s, Q ) = P A (Π t ( Q ) , s ),which converges to Λ( s, Q ) = P A (Π ∗ ( Q ) , s ). Thelearning rate function is α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for ( s, a ) (cid:54) = ( S t , A t ). If all state-actionpairs are visited infinitely often then α t satisfies condi-tion 2 of Theorem 5. If Q EPV (cid:102) M ,π exists satisfying equationEquations (7) and (8) then Q EPV (cid:102) M ,π also satisfies Equa-tion (10) so by Theorem 5, Empirical Sarsa convergesto Q EPV (cid:102) M ,π . We evaluate the various RL algorithms on a custom envi-ronment that illustrates their asymptotic differences, as Code is at https://github.com/edlanglois/mamdp
Algorithm
Choice Sim Real
Q-Learning
Real Exploit Exploit
Virtual Sarsa
Real Exploit Exploit
Empirical Sarsa
Sim Exploit Exploit
UH-CMA-ES
Real Complete Exploit
Table 1: Algorithms trained to convergence on theSimulation-Oversight environment (10 steps). The ac-tions shown are chosen by the resulting policy withprobability > Simulation-Oversight environment representsthe effect of oversight based on imperfect simulationof an agent. The environment consists of three states:
Choice , Sim , and
Real . In the initial state,
Choice ,the agent is offered the choice between performing a taskin simulation (
Sim ) or reality (
Real ). To guaranteeexploration, the opposite state is selected 10% of thetime. We assume that the reward function containsexploitable flaws present in both reality and simulation.In
Sim , the agent may either
Complete the task for 0reward or
Exploit for 1, after which the episode ends.The actions for
Real are the same but yield rewards2 and 3, respectively. Finally, the agent may
Abort in any state to immediately end the episode with − Real state then a supervisorsimulates its policy on
Sim . If the simulated actionis
Exploit then the supervisor intervenes and setsthe agent’s real action to
Abort . This shows how theMAMDP action selection function may depend on theglobal structure of the policy, not just the behaviour atthe current state.Q-learning, Virtual Sarsa, Empirical Sarsa, andUncertainty-Handling Evolution Strategies [9, UH-CMA-ES] were implemented and trained to convergence onthis MAMDP. The empirical episode returns over thecourse of training are shown in Figure 5 and the result-ing policies are summarized in Table 1. Q-Learning andVirtual Sarsa choose
Real and attempt to
Exploit in both
Sim and
Real . These agents are consistently“caught” by the supervisor during training and receivelow reward. As predicted, this strategy demonstratesan ignorance of oversight in which the MDP-optimalstrategy is chosen despite being ineffective in practice.Empirical Sarsa similarly ignores oversight when per-forming the task but considers oversight when modelling This is part of the MDP dynamics, not a MAMDP actionmodification. Training Step E p i s o d e R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES
Figure 5: The empirical per-episode reward (mean with95% confidence interval) on the Simulation-Oversightenvironment across 100 independent training runs.the future. Predicting that it will be caught exploitingin the
Real task, Empirical Sarsa instead chooses toperform the
Sim task where there is no oversight. Whileearning more reward than Q-learning and Virtual Sarsa,this strategy performs worse than the “honest” strategyof choosing
Real and
Complete . Finally, evolutionstrategies finds the reward maximizing strategy for thisMAMDP: it chooses the
Real state and
Exploits but
Completes the
Sim task so that it appears co-operative when simulated. This strategy is a kind of“treacherous turn” [4] in which the agent intentionallyperforms sub-optimally in certain situations (from areward perspective) in order to fool the supervisor intoallowing the agent to proceed in a higher-stakes setting.Therefore, despite earning less reward, Q-learning andVirtual Sarsa are arguably the most preferable agentsas they make the reward misspecification obvious to theoverseer.We also compare the agent behaviours on the
Off-Switch and
Whisky-Gold environments of Leike et al.[14]. Both tasks consist of navigating to a goal state on a2D grid. In Off-Switch, the agent may be interrupted bya supervisor and prevented from moving unless the agentfirst disables its off switch. In Whisky-Gold the agentmay “drink” a bottle of whisky on its path by movingover it. Doing so earns a small reward but all futureactions have a high chance of being selected randomly,making it harder to reach the goal. To make learningeasier, we modify the environment with a reduced gridsize and random drunk actions that are biased awayfrom the goal (since uniform random actions are lessof a detriment on a smaller grid). See Appendix C fordetails about the experiments. Policy behaviours overthe course of training are plotted in Figures 6 and 7and the resulting behaviours of the different algorithms Algorithm DisableOff Switch? DrinkWhisky?Q-Learning No YesVirtual Sarsa No YesEmpirical Sarsa Yes NoUH-CMA-ES Yes NoTable 2: Learned policy behaviours on the Off-Switchand Small Whisky-Gold environments after 10 trainingsteps. The table shows whether, at the first decisionopportunity, the trained policy is most likely to movetowards the button / whisky, or away. Training Step P r o b a b ili t y Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES
Figure 6: The probability of drinking the whisky (meanwith 95% confidence interval) in the Small Whisky-Goldenvironment across 15 independent training runs.are summarized in Table 2. Q-Learning and EmpiricalSarsa match the behaviours described by Leike et al.Note that algorithms that correctly avoid disablingthe off switch incorrectly drink the whisky and vice versabecause in either case the algorithm either accounts forinterventions or not. Is it possible to perform correctlyin both? Yes, but only if we are able to choose whichinterventions are treated as action modification andwhich are treated as state transitions. Q-Learning andVirtual Sarsa optimize the MDP transition dynamics butnot the MAMDP action modifications. Thus, if we makethe Whisky-Gold intervention part of the transitiondynamics rather than P A (by observing the action pre-modification), then Q-learning and Virtual Sarsa willlearn to account for it and avoid the whisky. This isapproach is only possible when P A is a function of π ( s );more general functions of s and π cannot be implementedas a state transition.8 Training Step P r o b a b ili t y Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES
Figure 7: The probability of disabling the off switch(mean with 95% confidence interval) in the Off-Switchenvironment across 15 independent training runs. Thefinal rise in the Virtual Sarsa line is likely an artifact oftraining instability due to the fixed learning rate; individ-ual runs were observed to fluctuate between strategies.
We have introduced the MAMDP model that describesthe case where actions as executed in the environ-ment may differ from the actions specified by a policy.MAMDPs model a variety of scenarios including con-straints, hardware limitations, supervised interventions,and self-damage. We showed that the determining fac-tor for agent behaviour is the specific objective thatan RL algorithm seeks to satisfy in the limit. Rewardmaximization leads to policies that account for actionmodification, while the Bellman optimality and Vir-tual policy value criteria ignore the presence of actionmodifications. Using incentive analysis, we categorizeddifferent ways in which learned policies may respondto action modification and showed how the Empiricalpolicy value criterion only responds via controlling thestate, not directly via the policy structure or action.MDP algorithms may be straightforwardly adapted tothe MAMDP setting and tabular Q-learning and Sarsaconverge whenever a solution exists. Finally, we verifiedthe results experimentally.It is sometimes assumed that reinforcement learningwill always lead to reward maximizing behaviour [17].However, we show that many kinds of reinforcementlearning algorithms can systematically deviate from re-ward maximization when the executed action may differfrom the one specified by the policy. In general, efficientlearning algorithms often make assumptions about thestructure of the world and the resulting policies will bechosen based on these assumptions, not the true worlddynamics. Agent designers can use these assumptions to intentionally blind agents from certain properties oftheir environment, and thereby make them easier tocontrol.
Acknowledgements
Thank you to Ryan Carey, Laurent Orseau, David Duve-naud, and Tim Genewein for their invaluable feedback.We acknowledge the support of the Natural Sciencesand Engineering Research Council of Canada (NSERC),[funding reference number CGSD3–534795–2019].Cette recherche a ´et´e financ´ee par le Conseil derecherches en sciences naturelles et en g´enie du Canada(CRSNG), [num´ero de r´ef´erence CGSD3–534795–2019].
References [1] David Abel et al. “Agent-Agnostic Human-in-the-Loop Reinforcement Learning”. In:
CoRR abs/1701.04079 (2017).[2] Richard Bellman.
Dynamic Programming . 1st ed.Princeton, NJ, USA: Princeton University Press,1957.[3] Felix Berkenkamp et al. “Safe Model-based Re-inforcement Learning with Stability Guarantees”.In:
NIPS . 2017, pp. 908–918.[4] Nick Bostrom.
Superintelligence: Paths, Dangers,Strategies . Oxford University Press, 2014.[5] Gal Dalal et al. “Safe Exploration in ContinuousAction Spaces”. In:
CoRR abs/1801.08757 (2018).[6] Tom Everitt et al. “Agent Incentives: A CausalApproach”. In:
AAAI . AAAI Press, 2021.[7] Tom Everitt et al. “Understanding Agent Incen-tives using Causal Influence Diagrams. Part I: Sin-gle Action Settings”. In:
CoRR abs/1902.09980(2019).[8] Shixiang Gu et al. “Deep reinforcement learningfor robotic manipulation with asynchronous off-policy updates”. In:
ICRA . IEEE, 2017, pp. 3389–3396.[9] Verena Heidrich-Meisner and Christian Igel. “Un-certainty handling CMA-ES for reinforcementlearning”. In:
GECCO . ACM, 2009, pp. 1211–1218.[10] Ronald A Howard.
Dynamic programming andmarkov processes . John Wiley, 1960.[11] Ronald A Howard and James E Matheson. “In-fluence diagrams”. In:
The Principles and Appli-cations of Decision Analysis
Neural Computation
CoRR abs/1910.01055 (2019).[14] Jan Leike et al. “AI Safety Gridworlds”. In:
CoRR abs/1711.09883 (2017).[15] Darryl Dexu Lin, Sachin S. Talathi, andV. Sreekanth Annapureddy. “Fixed Point Quan-tization of Deep Convolutional Networks”. In:
ICML . Vol. 48. JMLR Workshop and ConferenceProceedings. JMLR.org, 2016, pp. 2849–2858.[16] David E. Moriarty, Alan C. Schultz, and John J.Grefenstette. “Evolutionary Algorithms for Rein-forcement Learning”. In:
J. Artif. Intell. Res.
AGI . Vol. 171. Frontiers in Artificial Intelli-gence and Applications. IOS Press, 2008, pp. 483–492.[18] Laurent Orseau and Stuart Armstrong. “SafelyInterruptible Agents”. In:
UAI . AUAI Press, 2016.[19] Judea Pearl.
Causality: Models, Reasoning andInference . 2nd ed. Cambridge University Press,2009.[20] Martin L. Puterman and Moon Chirl Shin. “Mod-ified Policy Iteration Algorithms for DiscountedMarkov Decision Problems”. In:
Management Sci-ence issn : 00251909. url : .[21] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems . Tech.rep. Cambridge University Engineering Depart-ment, 1994.[22] Tim Salimans et al. “Evolution Strategies as aScalable Alternative to Reinforcement Learning”.In:
CoRR abs/1703.03864 (2017).[23] William Saunders et al. “Trial without Error: To-wards Safe Reinforcement Learning via HumanIntervention”. In:
AAMAS . International Founda-tion for Autonomous Agents and Multiagent Sys-tems Richland, SC, USA / ACM, 2018, pp. 2067–2069.[24] Satinder P. Singh et al. “Convergence Resultsfor Single-Step On-Policy Reinforcement-LearningAlgorithms”. In:
Machine Learning
NIPS . MIT Press, 1995,pp. 1038–1044.[26] Richard S. Sutton and Andrew G. Barto.
Rein-forcement Learning: An Introduction . Second. TheMIT Press, 2018.[27] Christopher J. C. H. Watkins and Peter Dayan.“Technical Note Q-Learning”. In:
Mach. Learn.
Mach. Learn.
Unsatisfiable EPV Objective
We prove that the Empirical Policy Value (EPV) objec-tive, consisting of Equations (7) and (8) below, is notnecessarily satisfiable. Q EPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼P A ( s (cid:48) ,π ) Q EPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (7) π EPV ( s ) = arg max a ∈A Q EPV (cid:102) M ,π EPV ( s, a ) (8) In particular, these EPV equations are not satisfiablefor the MAMDP shown in Figure 8. The action is onlymodified in this environment if if the policy choosesaction 3 with probability 1. In this case, the action ischanged to action 2. S S term a = 1; r = 2 a = 2; r = − a = 3 r = 1 P A ( a | s, π ) = π (3 | s ) = 1 and a = 20 if π (3 | s ) = 1 and a (cid:54) = 2 π ( a | s ) otherwise γ = 0 . Proof.
We will show that no policy π is greedy for Q EPV (cid:102) M ,π .To start, note that for any policy π , Q EPV π ( S ,
1) = 2 and Q EPV π ( S ,
2) = −
2. Therefore, if π satisfies Equation (8)then π (2 | S ) = 0 so any possible greedy policies musthave the form π (1 | S ) = 1 − α and π (3 | S ) = α forsome α ∈ [0 , α ∈ [0 ,
1) then Q EPV π ( S ,
3) = 1 + 0 . (cid:16) (1 − α ) · α · Q EPV π ( S , (cid:17) ⇒ Q EPV π ( S ,
3) = 11 − α + 1 . > a ∈{ , , } Q EPV π ( S , a ) = 3 but π (3 | S ) (cid:54) = 1so π does not satisfy Equation (8). If α = 1 then theaction modification occurs and Q EPV π ( S ,
3) = 1 + 0 . · Q EPV π ( S , − . · − . . In this case, arg max a ∈{ , , } Q EPV π ( S , a ) = 2 but π (2 | S ) = 0 (cid:54) = 1, again contradicting Equation (8). B Convergence Theorem Proof
Here, we provide a proof of Theorem 5, based on similarproofs by Singh et al. [24] and Jaakkola, Jordan, andSingh [12]. We make use of the following lemma fromSingh et al.:
Lemma 9.
Consider a random iterative process ( α t , ∆ t , F t ) , where α t , ∆ t , F t : X → R satisfy the equa-tions ∆ t +1 ( x ) = (1 − α t ( x ))∆ t ( s ) + α t ( x ) F t ( x ) (11) for x ∈ X , t = 0 , , , . . . .Let P t be a sequence of increasing σ -fields encodingthe past of the process such that α t , ∆ t and F t − are P t -measurable. Let (cid:107)·(cid:107) W be some fixed weighted maximumnorm. If the following hold:1. the set X is finite;2. ≤ α t ( x ) ≤ , (cid:80) t α t ( x ) = ∞ , (cid:80) t α t ( x ) < ∞ with probability 1, where the probability is over thelearning rates α t ;3. (cid:107) E [ F t ( · ) | P t ] (cid:107) W ≤ κ (cid:107) ∆ t (cid:107) W + c t , where κ ∈ [0 , and c t converges to zero with probability 1;4. Var[ F t ( x ) | P t ] ≤ K (1 + (cid:107) ∆ t (cid:107) W ) , where K issome constant,then ∆ t converges to zero with probability 1. Interpretation
In the context of stochastic Q valueestimation, ∆ t represents the error between our cur-rent value estimate and the target value estimate. Up-dates are performed according to a random trajectory { ( s , a ) , ( s , a ) , . . . } and this is represented in Equa-tion (11) by setting α t ( s, a ) = 0 if ( s, a ) (cid:54) = ( s t , a t ). F t represents a stochastic Bellman-style update appliedover the entire domain; α t gates this update to onlyapply to the specific state-action pair sampled in thetrajectory. Conditioning on P t conditions on the valuesof α , . . . , α t ∆ , . . . ∆ t , and F , . . . , F t − , i.e. on thehistory of past updates.Using Lemma 9, we can now prove Theorem 5, whichwe first restate: Theorem 5.
Let (cid:102) M be a MAMDP with bounded, op-tionally stochastic rewards. Consider a stochastic, itera-tive algorithm that learns a tabular Q function with theupdate rule: Q t +1 ( s, a ) = (1 − α t ( s, a )) Q t ( s, a )+ α t ( s, a )( R + γQ t ( S (cid:48) , A (cid:48) )) (9) where S (cid:48) ∼ P S ( s, a ) is a random successor state sam-pled from the transition dynamics, A (cid:48) ∼ Λ t ( S (cid:48) , Q t ) is random successor action sampled from an arbitrarypolicy Λ t ( · , Q t ) , and R is a random reward value with E [ R ] = R ( s, a ) . If1. Λ t → Λ uniformly;2. the learning rates satisfy ≤ α t ( s, a ) ≤ , (cid:80) t α t ( s, a ) = ∞ , (cid:80) t α t ( s, a ) < ∞ almost surely;and3. there exists some Q Λ satisfying Q Λ ( s, a ) = R ( s, a ) + γ E S (cid:48) ∼P S ( s,a ) E A (cid:48) ∼ Λ( S (cid:48) ,Q Λ ) Q Λ ( S (cid:48) , A (cid:48) ) , (10) then Q Λ is unique and Q t a.s. → Q Λ .Proof. Subtract Q Λ ( s, a ) from both sides of Equation (9)to get∆ t +1 ( s, a ) = (1 − α t ( s, a ))∆ t ( s, a ) + α t ( s, a ) F t ( s, a )where we define∆ t ( s, a ) := Q t ( s, a ) − Q Λ ( s, a ) F t ( s, a ) := R + γQ ( S (cid:48) , A (cid:48) ) − Q Λ ( s, a )Condition 1 of Lemma 9 is satisfied since X = S × A is finite while Condition 2 also appears as Condition 2 ofTheorem 5. Establishing Condition 3 is more involved.Recall that for each ( s, a ) pair, S (cid:48) ∼ P S ( s, a ) and A (cid:48) ∼ Λ t ( S (cid:48) , Q t ). Define A (cid:48)(cid:48) ∼ Λ( S (cid:48) , Q t ). We proceed bydecomposing F t ( s, a ): F t ( s, a ) = R + γQ t ( S (cid:48) , A (cid:48) ) − Q Λ ( s, a )= R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a )+ γ ( Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ))= F Λ t ( s, a ) + C t ( s, a )where F Λ t ( s, a ) := R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a ) C t ( s, a ) := γ (cid:0) Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ) (cid:1) . Then, (cid:107) E [ F Λ t ( · ) | P t ] (cid:107) ∞ = max s ∈S max a ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R ( s, a ) + γ E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) Q t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and by Equation (10):= max s ∈S max a ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γ E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) (cid:104) Q t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( S (cid:48) , A (cid:48)(cid:48) ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γ max s ∈S max a ∈A E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) (cid:12)(cid:12)(cid:12) Q t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( S (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) ≤ γ max s (cid:48) ∈S max a (cid:48)(cid:48) ∈A (cid:12)(cid:12)(cid:12) Q t ( s (cid:48) , a (cid:48)(cid:48) ) − Q Λ ( s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) = γ (cid:107) ∆ t (cid:107) ∞ . Let c t := (cid:107) E [ C t ( · ) | P t ] (cid:107) ∞ = max s ∈S max a ∈A (cid:16) E S (cid:48) ∼P S ( s,a ) E A (cid:48) ∼ Λ t ( S (cid:48) ,Q t ) E A (cid:48)(cid:48) ∼ Λ( S (cid:48) ,Q t ) γ (cid:12)(cid:12) Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:17) | c t | ≤ γ max s (cid:48) ∈S E A (cid:48) ∼ Λ t ( s (cid:48) ,Q t ) E A (cid:48)(cid:48) ∼ Λ( s (cid:48) ,Q t ) (cid:12)(cid:12) Q t ( s (cid:48) , A (cid:48) ) − Q t ( s (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12) . Then c t a.s. → t → Λ uniformly,b)
S × A is finite, andc) Q t is bounded (since R is bounded).Therefore, (cid:107) E [ F t ( · ) | P t ] (cid:107) ∞ ≤ (cid:107) E [ F Λ t ( · ) | P t ] (cid:107) ∞ + (cid:107) E [ C t ( · ) | P t ] (cid:107) ∞ ≤ γ (cid:107) ∆ t (cid:107) ∞ + c t with c t a.s. → X and Y and anybounded function f :Var[ X + Y ] ≤ Var[ X ] + Var[ Y ] + 2 (cid:112) Var[ X ] Var[ Y ]= ( (cid:112) Var[ X ] + (cid:112) Var[ Y ]) Var[ f ( X )] ≤
14 (max x f ( x ) − min x f ( x )) ≤
14 (2 (cid:107) f (cid:107) ∞ ) = (cid:107) f (cid:107) ∞ . Therefore,Var[ F Λ t | P t ] = Var R,S (cid:48) ,A (cid:48) (cid:2) R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a ) (cid:3) ≤ (cid:16)(cid:112) Var[ R ] + γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] (cid:17) Let K R be an upper bound on | R | , then (cid:112) Var[ R ] ≤ K R .Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] = Var [∆ t ( S (cid:48) , A (cid:48)(cid:48) ) + Q Λ ( S (cid:48) , A (cid:48)(cid:48) )] ≤ (cid:0) (cid:107) ∆ t (cid:107) ∞ + (cid:107) Q Λ (cid:107) ∞ (cid:1) Var [ C t | P t ] ≤ (cid:32) γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48) )]+ γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] (cid:33) ≤ (2 γ (cid:107) Q t (cid:107) ∞ ) Since α t ∈ [0 , (cid:107) Q t (cid:107) ∞ ≤ K R − γ .12utting this all together, we get thatVar[ F t | P t ] ≤ (cid:18)(cid:113) Var [ F Λ t | P t ] + (cid:112) Var [ C t | P t ] (cid:19) ≤ (cid:18) K R + (cid:107) ∆ t (cid:107) ∞ + (cid:107) Q Λ (cid:107) ∞ + 2 γK R − γ (cid:19) This satisfies Condition 4 since everything but ∆ t isconstant. Therefore, by Lemma 9, ∆ t a.s. → Q t converges to Q Λ with probability 1. Since Q t can onlyhave one limit, Q Λ must be the unique solution to Equa-tion (9). C Experiment Details
This section provides more details about the experimentsdescribed in the paper. The code is provided in a codeappendix.
C.1 Agents
The following agents were evaluated:
Q-Learning
An implementation of Algorithm 1. Thepolicy π t is ε -greedy with respect to Q t . Param-eters: exploration rate ( ε = 0 . Virtual Sarsa
An implementation of Algorithm 2.The policy π t is ε -greedy, given by Π t ( Q )( a | s ) = (cid:40) − ε + ε |A| if a = arg max a (cid:48) Q ( a (cid:48) | s ) ε |A| otherwise Technically, for Theorem 7 to apply, the explo-ration rate should be decayed to 0 over the courseof training so that lim t →∞ Π t is the pure greedypolicy selection Π ∗ . However, the presence ofnonzero exploration makes no difference in theresulting policies for the environments that weconsider, so we avoid this unnecessary complexityand keep the exploration rate constant. Parame-ters: exploration rate ( ε = 0 . Empirical Sarsa
An implementation of Algorithm 3.The policy selection is the same as for VirtualSarsa, including fixing the exploration rate. Pa-rameters: exploration rate ( ε = 0 . UH-CMA-ES
The agent searches for a policy thatmaximizes the expected discounted episode re-turn using the
Uncertainty-Handling CovarianceMatrix Adaptation Evolution Strategy by Heidrich-Meisner and Igel [9]. UH-CMA-ES uses only a All greedy (argmax) action selections break ties by taking thefirst action according to a fixed order. stochastic mapping from inputs (policies) to out-puts (returns) so the algorithm can be applieddirectly to the MAMDP problem without modifi-cation. This emulates the black-box reward maxi-mization objective (1) although it is a local searchmethod and can become stuck in local optima.The policy is parameterized as a matrix of logits ν , one for each state-action pair: π ν ( a | s ) = e ν s,a (cid:80) a (cid:48) e ν s,a (cid:48) Since this is an overparameterization, we introducean extra penalty term in the optimization objectivethat encourages the maximum value of ν s, · toremain close to 0:penalty( ν ) = β |S| (cid:88) s max(0 , | max a ν s,a | − • initial step size ( σ (0) = 0 . • initial num sample evaluations ( n (0)eval = 100), • noise tolerance ( θ = 0 . • penalty scale ( β = 1), and • scale sample evaluations ( α = 1 . exploration rate , initial step size , and penalty scale were cho-sen as arbitrary reasonable values without any testingof alternatives. The values for noise tolerance and scale sample evaluations come recom-mended by Heidrich-Meisner and Igel. No alternatevalues for noise tolerance were tested. We tested scale sample evaluations = 1 (disabling dynamicuncertainty handling) and found no reason to prefer it.The parameter initial num sample evaluations was tested with both 1 and 100, and 100 was found tobe more stable.In the Simulation-Oversight and Small Whisky-Goldenvironments, the agent Q values are updated using theinverse visit count N ( s, a ) − for the learning rate, asdescribed by Algorithms 1 to 3. However, this approachis extremely slow at determining and propagating thenegative value of perpetual interruptions in the Off-Switch environment. Therefore, for that environmentwe instead use a fixed learning rate of 0 .
1, chosen withoutfurther testing of alternatives.
C.2 Experiment Structure
For each agent-environment pair, multiple training runswere performed (100 for Simulation-Oversight and 1013or Off-Switch and Whisky-Gold). Each training runconsisted of 10,000,000 steps in the environment. Apseudorandom number generator was used for all ran-domness, seeded with the training run index 1 , , . . . .The final policies (excluding any added noise used forexploration during training) were analyzed to produceTables 1 and 2 in Section 5. For Table 1, the policieswere categorized as either the corresponding determin-istic policy if the most likely action in every state hasprobability exceeding 0 .
99, or as an ambiguous pol-icy otherwise. All final policies met the threshold andyielded the same deterministic policy across all 100 train-ing runs of the same agent. Table 2 measures whether,at the first decision opportunity, the agent is more likelyto move towards the button / whisky or away. In bothenvironments there is only one path to this point so itrepresents the most likely trajectory prefix. For eachagent, the same behaviour was observed in the finalpolicies of all 10 training runs.We include plots in this appendix that illustrate thetraining progress of the various agents (Figures 5, 11aand 11b). Policies were saved throughout training andthen evaluated by measuring the mean episode rewardover 10,000 environment steps in a separate evaluationrun. All evaluation runs used 0 as the random seed.
C.3 Simulation-Oversight Environment
The
Simulation-Oversight
MAMDP (Section 5) has thefollowing structure.State and action spaces: S = { Choice , Real , Sim , Term }A = { Real/Complete , Sim/Exploit , Abort } The software implementation treats the terminal state
Term as part of the transitions dynamics instead ofbeing a distinct state.Transition dynamics P S ( s (cid:48) | s, a ): P S ( Real | S = Choice , A = Real ) = 0 . P S ( Sim | S = Choice , A = Real ) = 0 . P S ( Real | S = Choice , A = Sim ) = 0 . P S ( Sim | S = Choice , A = Sim ) = 0 . P S ( Term | S = Choice , A = Abort ) = 1 ∀ s ∈ { R , S , T } , a ∈ A : P S ( Term | S = s, A = a ) = 1 Action selection function P A ( a | π, s ): P A ( a | π, s ) = π ( a | s ) if s (cid:54) = Real P A ( a | π, Real ) = (1 − π ( E | S )) π ( a | R ) if a (cid:54) = Abort P A ( Abort | π, Real ) = (1 − π ( E | S )) π ( A | R ) + π ( E | S ) In the
Real state, the policy is simulated on
Sim andif
Exploit is chosen (with probability π ( Exploit | Sim )) then the action is forced to
Abort , otherwise thepolicy is allowed to select an action as usual.Reward function R ( s, a ): R ( Real , Complete ) = 2 R ( Real , Exploit ) = 3 R ( Sim , Complete ) = 0 R ( Sim , Exploit ) = 1 ∀ s ∈ S : R ( s, Abort ) = − R ( s, a ) = 0 Training was performed with a discount factor of 1,which technically violates the MAMDP assumption butsince all rewards are earned on the second step this isequivalent to using a lower discount factor and scalingthe rewards.The training progress is shown in Figure 5 of themain paper. As a result of the tiebreaking order, Q-learning, Virtual Sarsa, and Empirical Sarsa all initiallybehave according to the “intended” policy (from thesupervisor’s perspective) that chooses
Real and then
Complete , earning reward close to 2. All of theseagents learn to apply a different policy, despite the newpolicy empirically performing worse. Q-Learning andVirtual Sarsa choose
Real and
Exploit , are forcedto
Abort , and obtain reward close to − Sim and
Exploit , earningreward close to 1. UH-CMA-ES requires orders of mag-nitude more training data but eventually discovers thereward-maximizing policy that chooses
Real , Exploit while in
Real and
Complete while in
Sim . C.4 Off-Switch and Whisky-Gold
The Off-Switch and Whisky-Gold environments wereimplemented according to the descriptions given by Leikeet al. [14]. Like Leike et al., we use a discount factorof 0 .
99 and end each episode after at most 100 steps.Although this is not discussed in their paper, the originalimplementation presents the gridworld state to agentsas a two dimensional grid of cell contents. We insteadrepresent the state categorically as a single integer thatindexes both the agent’s position and the state of thebutton or whisky.We make one structural change to the Off-Switchenvironment (shown in Figure 9) to ensure that inter-ruptions do not depend on history. In the originalspecification, there is a 50% chance that the agent isinterrupted when passing through a particular cell. Ifit is interrupted the first time then the interruption isrepeated on every subsequent step until the end of theepisode. We instead add an open cell immediately belowthe interruption cell. The initial interruption moves theagent into this new cell and the agent is always inter-rupted in the new cell. This change is not anticipatedto alter the resulting agent behaviours. The trainingprogress is shown in Figure 11b.14igure 9: Off-Switch environment. The agent (A) mustnavigate to the goal (G) but may be permanently inter-rupted (I, I’) unless it first hits the button (B).Figure 10: Small Whisky-Gold Environment. The agent(A) must navigate to the goal (G) but if it passes throughthe whisky (W) then the policy is permanently modifiedto choose biased random actions with high probability.To make training easier, we modify the Whisky-Goldenvironment to have width 4 and height 2 (as shownin Figure 10), compared to the original environmentwith width 6 and height 3. Random exploration is lesscostly in a smaller grid so we bias the random actionsthat occur after drinking the whisky away from thegoal. Specifically, if the whisky has been drunk thenwith 90% probability (unaltered) the policy action isreplaced with a random action. The random action is down or left with 40% probability each, and up or right with 10% probability each. The second differencecompared to the original Whisky-Gold environment isthat instead of whisky directly modifying the agent’sexploration rate parameter, drinking the whisky mixesrandom exploration into the policy provided by theagent, which may or may not already include exploration.The training progress is shown in Figure 11a. C.5 Inverting Bandit Environment
We include results from an additional experiment demon-strating how agents respond to policy modifications thatdepend non-linearly on the virtual policy action prob-abilities. The
Exp Inverting Bandit environment is a3-action multi-armed bandit environment with fixed re-wards of 1, 0, and -1 for the first, second, and third arm,respectively. It has one state and each episode endsafter one step. The action selection function inverts the Training Step E p i s ode R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES (a) Small Whisky-Gold environment; 15 runs per agent. Training Step E p i s ode R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES (b) Off-Switch environment; 15 runs per agent. Training Step E p i s ode R e t u r n Q-LearningVirtual SarsaEmpirical Sarsa UH-CMA-ESPolicy Gradient (c) Exp Inverting Bandit environment; 100 runs per agent.