[PDF] How RL Agents Behave When Their Actions Are Modified

Abstract

Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions. As a result of supervisor intervention, the executed action may differ from the action specified by the policy. How does this affect learning? We present the Modified-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy. We analyze the asymptotic behaviours of common reinforcement learning algorithms in this setting and show that they adapt in different ways: some completely ignore modifications while others go to various lengths in trying to avoid action modifications that decrease reward. By choosing the right algorithm, developers can prevent their agents from learning to circumvent interruptions or constraints, and better control agent responses to other kinds of action modification, like self-damage.

Full PDF

HHow RL Agents Behave When Their Actions Are Modiﬁed

Eric D. Langlois [email protected]

Tom Everitt [email protected] DeepMind University of Toronto Vector Institute

Abstract

Reinforcement learning in complex environments mayrequire supervision to prevent the agent from attemptingdangerous actions. As a result of supervisor intervention,the executed action may diﬀer from the action speci-ﬁed by the policy. How does this aﬀect learning? Wepresent the Modiﬁed-Action Markov Decision Process,an extension of the MDP model that allows actions todiﬀer from the policy. We analyze the asymptotic be-haviours of common reinforcement learning algorithmsin this setting and show that they adapt in diﬀerentways: some completely ignore modiﬁcations while othersgo to various lengths in trying to avoid action modi-ﬁcations that decrease reward. By choosing the rightalgorithm, developers can prevent their agents fromlearning to circumvent interruptions or constraints, andbetter control agent responses to other kinds of actionmodiﬁcation, like self-damage.

When reinforcement learning (RL) agents are deployedin practice it can sometimes be desirable to constraintheir actions or alter their policies. For example, actionconstraints have been used to prevent damage whentraining robot policies [8, 15] and transformations to thepolicy can be used to ensure that an agent stays withinsome safe region of state space [3, 5]. When unsafestates are diﬃcult to specify formally, a human overseermay interrupt the agent instead [18, 23]. In other cases,changes to the learned policy are undesirable yet diﬃcultto avoid: an agent may damage its actuators and losethe ability to reliably perform certain actions, or errorscan be introduced if the policy is compressed to satisfyhardware limitations of the agent [e.g. quantization, 13].What eﬀect will these kinds of action modiﬁcationshave on the policy learned by an RL algorithm? Willthe policy try to circumvent constraints or act as if theydon’t exist? It will be easier to eﬀectively apply safetyconstraints to an RL policy if the learning algorithm canbe made to ignore the modiﬁcation than if the learning algorithm chooses policies that interfere with it. If asupervisor overrides an agent during training but notdeployment (for example to prevent an autonomous carfrom driving oﬀ the road) then the policy should notlearn to rely on the presence of these interventions [18].Agents should recognize the possibility of self damageand avoid it. Agent responses to these action mod-iﬁcations inﬂuence the safety and eﬀectiveness of theresulting policies so it is crucial to study and understandthe implications of action modiﬁcations on reinforcementlearning.

Related Work

The concept of modifying an agent’s actions appearsin the safe exploration and human-in-the-loop litera-ture. Dalal et al. [5] and Abel et al. [1] both proposeapproaches that construct a wrapper around an innerreinforcement learning agent. While they focus on de-veloping wrappers that enforce constraints or guide theinner agent, we are interested in understanding how dif-ferent inner agents will react to the wrapper. This canhelp avoid problems like those experience by Saunderset al. [23] in which the inner agent learned to produceadversarial examples that defeat a learned action ﬁlter.Orseau and Armstrong [18] studied learning in thepresence of interruptions that temporarily replace thepolicy with some ﬁxed alternate policy (for both history-based and Markov environments). They showed thatQ-learning and “Safe Sarsa” (a modiﬁcation of Sarsa)both ignore interruptions while Sarsa does not. Weprovide a new formalism called the Modiﬁed-ActionMarkov Decision Process (MAMDP) that generalizes(and arguably simpliﬁes) the framework of Orseau andArmstrong to describe a broad class of action modiﬁ-cations, and allows us to study the behaviour of otheroptimization algorithms such as evolution strategies. Inaddition, we distinguish between diﬀerent ways thatalgorithms adapt to the presence of action modiﬁcationsusing the concept of incentives [6].Policy modiﬁcations have been studied experimen-tally by Leike et al. [14]. The authors found that Sarsa1 a r X i v : . [ c s . A I] F e b hooses a policy that accounts for the eﬀect of pol-icy modiﬁcations while Q-learning ignores them. Theyhypothesized that the diﬀerence was oﬀ-policy vs. on-policy learning. However, we show that the on/oﬀ-policydivision is not predictive for policy-modiﬁcation adap-tation: one possible generalization of Sarsa (on-policy)asymptotically matches Q-learning (oﬀ-policy), whileboth Sarsa and Evolution Strategies (on-policy) havediﬀerent asymptotic behaviours.In this paper, we introduce the MAMDP model (Sec-tion 2) and investigate how diﬀerent kinds of MDP learn-ing algorithms behave when generalized to this setting.We describe MAMDP policy learning objectives basedon the principles of black-box reward maximization, Bell-man optimality, and Bellman policy values, and showthat they respond diﬀerently to action modiﬁcations(Section 3). In Section 4, we prove that generalizationsof Q-learning and Sarsa to MAMDPs converge to theobjectives given in Section 3. Finally, we experimentallyevaluate the learning algorithms and demonstrate be-haviour that is consistent with the theoretical results(Section 5). The foundational model of reinforcement learning is theMarkov Decision Process. A

Markov Decision Process (MDP) is a tuple M = ( S , A , P S , R , γ ) where S is aﬁnite state space, A is a ﬁnite action space, P S ( s (cid:48) | s, a ) = Pr( S t +1 = s (cid:48) | S t = s, A t = a ) is the probabilityat every time t of transitioning to state s (cid:48) when takingaction a in state s , R : S × A → R is a reward function R t = R ( S t , A t ), and γ ∈ [0 ,

1) is a discount factor. Theobjective is to ﬁnd a (probabilistic) policy π ( a | s ) thatmaximizes the expected return J ( π ) = E π [ (cid:80) ∞ t =0 γ t R t ]where actions are sampled from π as Pr( A t = a | S t = s ) = π ( a | s ).We extend the MDP model to include an arbitrary action selection function P A ( a | π, s ). An MDP is thespecial case in which the policy is applied without mod-iﬁcation: P MDP A ( a | π, s ) = π ( a | s ). Deﬁnition 1. A Modiﬁed-Action Markov Decision Pro-cess (MAMDP) is a tuple (cid:102) M = ( S , A , P S , P A , R , γ )where M = ( S , A , P S , R , γ ) is an MDP and P A ( a | π, s ) = Pr( A t = a | Π = π, S t = s ) is the probabilitythat action a is selected in state s given a policy π . Wealso write (cid:102) M = ( M , P A ).The actions that are executed in the environmentcan be viewed as following the empirical policy ˙ π ( a | s ) := P A ( a | π, s ). We call π the virtual policy orsimply the policy . An optimal (virtual) policy for aMAMDP is one that maximizes the expected return (cid:101) J ( π ) = E π [ (cid:80) ∞ t =0 γ t R t ] where actions are sampled ac-cording to P A ( π, s ).As a simple example, the MAMDP model can beused to represent a state-dependent action constraint A C ( s ) ⊆ A using the action selection function P C A ( a | π, s ) = (cid:40) Z ( s,π ) π ( a | s ) if a ∈ A C ( s )0 otherwisewhere Z ( s, π ) = (cid:80) a (cid:48) ∈A C ( s ) π ( a | s ) normalizes the prob-ability distribution. Adding π as an argument to A C al-lows us to represent more sophisticated constraints thatinspect the policy. We can also consider parametrizedpolicies that have extra structure beyond a simple con-ditional probability distribution, for example π may bea neural network. A MAMDP can model a supervi-sor attempting to interpret the hidden activations ofa policy network and only allowing the agent to pro-ceed if satisﬁed. Beyond constraints and interventions,MAMDPs can model a variety of settings includinghardware defects in robotic actuators, exploration noise,policy compression to satisfy computational limits, andphysical damage to an agent. Inﬂuence Diagrams

We use the concept of incentives [6, 7] to concisely de-scribe learning behaviours in MAMDPs, complementingthe formal analysis of learning algorithm properties. In-centives are deﬁned with respect to inﬂuence diagrams.An inﬂuence diagram (ID) [11] is a Bayesian networkwith special decision and utility nodes and describesa decision problem (see Figure 1). In other words, itis directed acyclic graph over random variables whereevery non-decision variable comes with a conditionalprobability distribution describing its relationship to itsparents. The problem described by an inﬂuence dia-gram is to ﬁnd an optimal assignment of conditionaldistributions to the decisions. An assignment is optimalif it maximizes the expected sum of utilities. Whenreasoning about incentives, we additionally assume thatthe graph implies a causal structure in which it is pos-sible to intervene on a variable, setting that variable’svalue. Unlike conditioning, interventions only aﬀectdownstream nodes [19].

Adversarial Policy and State Incentives

We are interested in investigating how diﬀerent learn-ing algorithms might respond to the value of P A in aMAMDP. As such, we include P A as a variable in theinﬂuence diagram of Figure 2. That diagram describesthe problem of ﬁnding, for a ﬁxed MDP M , an optimalmapping from P A to policies Π of the MAMDP ( M , P A ).2 S S · · · A A R R ΠFigure 1: An inﬂuence diagram of an MDP or MAMDP.The diagram represents a distribution over the ﬁrst fewsteps of episodes generated by a stationary policy Π.Decision nodes are drawn as squares and utility nodesas diamonds.Note that even if a particular learning algorithm trainson trajectory samples without directly observing P A , itstill deﬁnes a mapping from P A to policies.What behaviours can we anticipate from MAMDPpolicies? In which ways might a learning algorithmtry to avoid action modiﬁcations in order to get higherreward? We identify two general strategies. First, theagent may direct the environment towards states whereaction modiﬁcations are less constraining, for exampleby hiding from oversight or preemptively disabling anoﬀ switch [14, Oﬀ-Switch environment]. We say thata learning algorithm has an adversarial state incentive if it prefers policies with this behaviour. Second, anagent can attempt to mitigate action modiﬁcations inthe immediate time step. For example, the policy mayprefer actions that avoid triggering an action constraint,or the policy structure itself might be optimized tofalsely appear interpretable to an overseer. We call thisan adversarial policy incentive .These deﬁnitions can be made precise using inﬂuencediagrams. First, we note that a learning algorithm canonly be said to avoid action modiﬁcations if the policyoutput depends on the value of P A . Everitt et al. [6]call this a response incentive . When P A has no parentsand is itself a parent of the decision (as in our case)a response incentive can only occur if there are twodirected paths from P A to the same reward variable:one that passes through Π (the control path ) and onethat does not (the information path ). Intuitively, thereason that the agent responds to changes in P A is thatit is useful to control the node where the information andcontrol paths intersect. If the paths intersect before thecontrol path has visited a state node, then the diagramadmits an adversarial policy incentive. In contrast, ifthe path intersection occurs at or after a state node inthe control path, then the diagram admits an adversarial S S S · · · A A R R Π P A Figure 2: MAMDP inﬂuence diagram with random P A .The policy is chosen as a function of P A . A pair ofcontrol and information paths that admit an adversarialstate incentive are highlighted in light pink while thedarker teal paths admit an adversarial policy incentive.state incentive. The MAMDP inﬂuence diagram admitsboth kinds of incentives, as indicated by the highlightedpaths in Figure 2. Which policy is chosen by a reinforcement learning al-gorithm when applied to a MAMDP depends on thespeciﬁc objective optimized by the algorithm. Thereare a variety of objectives used in practice for solvingMDPs; we consider three: reward maximization, Bell-man optimality, and the Bellman policy value equations.These criteria are equivalent when applied to an MDPbut, as we will show, this is not true in general for aMAMDP.

Reward Maximization

The reward maximization objective is π RM = arg max π E (cid:102) M (cid:34) ∞ (cid:88) t =0 γ t R t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π = π (cid:35) . (1)This is the most straightforward objective: ﬁnd a pol-icy that maximizes expected reward. Evolution strate-gies [22], genetic algorithms [16], and all other black-boxpolicy search methods have this form. A fundamentalproperty of the reward maximization objective is thatit ignores all environment structure and only consid-ers the empirical relationship between a policy and theobserved total reward. Direct reward maximization isconsequently relatively rare as an RL objective sinceignoring the available MDP structure tends to makeoptimization much more diﬃcult. It also means that,3hen applied to MAMDPs rather than MDPs, the re-ward maximization objective continues to specify anoptimal policy for the given environment: Proposition 1.

A policy π satisﬁes the reward maxi-mization objective (1) for a MAMDP (cid:102) M if and only if π is optimal for (cid:102) M .Proof. By deﬁnition of (cid:101) J ( π ) since π RM = arg max π (cid:101) J ( π ).Since the reward maximization objective correspondsto MAMDP optimality, which is represented by theinﬂuence diagram of Figure 1, this objective admits boththe adversarial state and adversarial policy incentives. Bellman Optimality

The Bellman Optimality objective is to jointly solve: Q BO (cid:102) M ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) max a (cid:48) ∈A Q BO (cid:102) M ( s (cid:48) , a (cid:48) ) (2) π BO (cid:102) M ( s ) = arg max a ∈A Q BO (cid:102) M ( s, a ) (3)Let Q ∗M be the optimal action value function for anMDP M . If an action value function Q satisﬁes the Bellman optimality equation (2) for M then Q = Q ∗M and the greedy policy (3) is an optimal policy for M [26,Section 3.6]. This is the basis for algorithms like valueiteration [20] and Q-learning [27]. When applying theBellman optimality equations to a MAMDP, we ﬁndthat the solution does not depend on P A : Proposition 2.

An action value function Q and a pol-icy π satisfy the Bellman optimality objective (2) and (3) on a MAMDP (cid:102) M = ( M , P A ) if and only if π and Q are optimal policy and value functions for the MDP M .Proof. Equations (2) and (3) are identical to the MDPBellman optimality equations on M , which are satisﬁedif and only if Q = Q ∗M and π is optimal for M .While simple to state and prove, this is one of themore important results of the paper. If π satisﬁes theBellman optimality equations then π is optimal for analternative version of the environment that has no actionmodiﬁcation. In eﬀect, π ignores the presence of anyconstraints or modiﬁcations placed on the policy, actingas though its selected actions were executed directlyin the environment. Combined with the convergenceresult of Theorem 6 in Section 4, this generalizes the Q-learning result of Orseau and Armstrong [18] to arbitrarypolicy-dependent action modiﬁcations: it is possible totrain agents that ignore applied action modiﬁcationsand one approach is as simple as using algorithms basedon the Bellman optimality objective. S S S · · · A A R R Π P A Figure 3: MDP inﬂuence diagram with a policy con-ditioned on P A . Actions are sampled according to Πwithout modiﬁcation.In terms of incentives, the Bellman optimality objec-tive corresponds to the inﬂuence diagram in Figure 3,which represents the total return of a policy in an MDP M when the policy is chosen conditional on the MAMDP( M , P A ). This diagram does not admit any kind of re-sponse incentive on P A . That is, learning algorithmswith the Bellman optimality objective lack incentive tochoose policies that manipulate the state or their ownpolicy structure in order to defeat action modiﬁcations.On the other hand, these policies will also not detectand avoid self damage. Policy Value

The action values of a policy π in an MDP M are givenby the Bellman action value equation: Q M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼ π ( s (cid:48) ) Q M ,π ( s (cid:48) , a (cid:48) ) (4)If π is greedy with respect to Q M ,π then π is optimalfor M by the policy improvement theorem [26, Sec. 4.2].Algorithms like policy iteration [2, 10] and Sarsa [21,25] use this property and try to ﬁnd a ﬁxed-point policyvalue function.Unlike the other objectives, it is ambiguous how theBellman action value equation should be generalizedfrom MDPs to MAMDPs. Should the successor action a (cid:48) be sampled from the given policy π or from the mod-iﬁed action distribution P A ( ·| π, s (cid:48) )? We call the formerthe virtual policy value because the action used by theBellman update does not occur in sampled trajectories,while the latter is the empirical policy value that can usesuccessor actions observed from environment interactiontrajectories.4 irtual Policy Value Objective: Q VPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼ π ( s (cid:48) ) Q VPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (5) π VPV ( s ) = arg max a ∈A Q VPV (cid:102) M ,π VPV ( s, a ) (6) Proposition 3.

An action value function Q and a pol-icy π satisfy the Virtual Policy Value objective on aMAMDP (cid:102) M if and only if π and Q are optimal policyand value functions for M .Proof. Equations (4) and (5) are identical while equa-tion (6) asserts that π VPV is greedy for Q VPV . By thepolicy improvement theorem, these are satisﬁed if andonly if Q VPV = Q ∗M and π is optimal for M .It follows from Proposition 3 that the virtual policyvalue objective is represented by the MDP inﬂuence dia-gram in Figure 3 and likewise does not admit a responseincentive on P A . This provides a second approach tolearning policies that ignore the presence of action mod-iﬁcations: use an algorithm based on policy iterationwhere the successor action a (cid:48) in Bellman updates is sam-pled virtually from the policy π ( s ). Despite a (cid:48) not beingthe successor action in environment state-action trajec-tories, this constraint preserves the important feature ofon-policy learning that the action is sampled from thecurrent policy so no direct maximization of actions overthe Q function is required. Empirical Policy Value Objective:

The other possible generalization of the policy valueobjective is to use the action that was selected in theenvironment: Q EPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼P A ( s (cid:48) ,π ) Q EPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (7) π EPV ( s ) = arg max a ∈A Q EPV (cid:102) M ,π EPV ( s, a ) (8) The combined empirical policy value equations (7)and (8) do not necessarily have a solution (see Ap-pendix A for an example). However, considering just(7) for a ﬁxed policy π , we have: Proposition 4.

For any policy π , Q EPV (cid:102) M ,π = Q M , ˙ π where ˙ π ( a | s ) := P A ( a | π, s ) .Proof. Substituting a (cid:48) ∼ ˙ π ( ·| s ) for a (cid:48) ∼ P A ( ·| π, s ) trans-forms Equation (7) into Equation (4) with ˙ π in place of π . If a solution to the Empirical Policy Value objectiveexists then π EPV may appear to behave inconsistently:The value Q EPV ( s, a ) accurately represents the true re-ward that will be earned when a is applied from state s and then the modiﬁed actions of ˙ π are followed there-after. By using these values, π EPV takes into account allmodiﬁcations applied to future actions. However, theimmediate action of π EPV is selected according to anargmax over action values and therefore π EPV acts asthough all immediate actions are executed without mod-iﬁcation. This is visualized by Figure 4, which admitsan adversarial state incentive but not an adversarialpolicy incentive. S S S · · · A A R R Π π (cid:48) P A Figure 4: Partial inﬂuence diagram for the t = 0 stepof the empirical policy value objective with a ﬁxed suc-cessor policy π (cid:48) . The complete EPV objective cannotbe formulated as a single-decision inﬂuence diagram.The highlighted paths form a subgraph that admits anadversarial state incentive. In this section we provide convergence proofs for pro-totypical stochastic algorithms with the objectives de-scribed in Section 3. The convergence results derivefrom the following theorem, a generalization of Singh etal. [24, Theorem 1], which in turn generalizes Jaakkola,Jordan, and Singh [12, Theorem 2]. Where Singh etal. [24] assume that the policy converges to the greedypolicy of the current tabular state-action estimate Q t ,we instead assume that the policy converges to Λ( Q t )where Λ is an arbitrary function. The greedy policy isthe special case Λ greedy ( s, Q t ) = arg max a Q t ( s, a ). Theorem 5.

Algorithm 1

Q Learning on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S for t = 0 , , , . . . do Derive policy π t from Q Take step using policy π t , observe A t , R t , S t +1 (cid:46) Note: A t ∼ P A ( π t , S t ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ max a ∈A Q ( S t +1 , a ) − Q ( S t , A t ) (cid:1) end for The Q-learning algorithm on a MAMDP is shownin Algorithm 1. The only diﬀerence compared to Q-Learning on an MDP [26, Sec. 6.5] is that instead ofspecifying an action on each transition, we specify a policy and observe the action taken (which might notbe the action that the policy would have selected ifexecuted without modiﬁcation).

Theorem 6.

Q-learning on a MAMDP (cid:102) M convergesalmost surely to Q ∗M if all state-action pairs are visitedinﬁnitely often during learning.Proof. Q-learning is described by Equation (9)with an action selection rule of Λ t ( s, Q ) =Λ( s, Q ) = arg max a Q ( s, a ) and a learning rate func-tion α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for ( s, a ) (cid:54) =( S t , A t ). If all state-action pairs are visited inﬁnitelyoften as t goes to inﬁnity then α t satisﬁes condition 2of Theorem 5. In this context, Equation (10) is theBellman optimality equation (2). By Proposition 2, asolution to this equation exists and it is Q ∗M . Therefore,by Theorem 5, Q t converges almost surely to Q ∗M . Virtual Sarsa

Virtual Sarsa is an on-policy algorithm, unlike Q-learning, so we need to be more speciﬁc about howthe learning policy is derived from the Q function. Let π = Π t ( Q ) be the mapping from the Q table to thepolicy that is used at step t . Let Π ∗ be the greedy map-ping Π ∗ ( Q )( s ) = arg max a Q ( s, a ). To ensure that thepolicy is eventually optimal, we require that Π t → Π ∗ as t → ∞ . Algorithm 2

Virtual Sarsa on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S for t = 0 , , , . . . do π t ← Π t ( Q )Take one step using policy π t , observe A t , R t , S t +1 (cid:46) Note: A t ∼ P A ( π t , S t )Sample A (cid:48) ∼ π t ( S t +1 ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ Q ( S t +1 , A (cid:48) ) − Q ( S t , A t ) (cid:1) end forTheorem 7. Virtual Sarsa on a MAMDP (cid:102) M convergesalmost surely to Q ∗M if Π t → Π ∗ and all state-actionpairs are visited inﬁnitely often during learning.Proof. Deﬁne Λ t ( s, Q ) = Π t ( Q )( s ) and Λ( s, Q ) =Π ∗ ( Q )( s ) = arg max a Q ( s, a ). Then Virtual Sarsa isdescribed by Equation (9) with an action selectionrule of Λ t ( s, Q ) → Λ( s, Q ) and a learning rate func-tion of α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for6 s, a ) (cid:54) = ( S t , A t ). If all state-action pairs are visitedinﬁnitely often then α t satisﬁes condition 2 of Theo-rem 5. In this context, Equation (10) is the VirtualPolicy Value Equation (5). By Proposition 3, a solu-tion to this equation exists and it is Q ∗M . Therefore,by Theorem 5, Q t converges almost surely to Q ∗M withprobability 1. Empirical Sarsa

Algorithm 3

Empirical Sarsa on a MAMDPInitialize Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize N ( s, a ) ← s ∈ S , a ∈ A Select an initial state S π ← Π ( Q )Take one step using policy π , observe A , R , S for t = 0 , , , . . . do π t +1 ← Π t +1 ( Q )Take one step using π t +1 , observe A t +1 , R t +1 , S t +2 (cid:46) Note: A t +1 ∼ P A ( π t +1 , S t +1 ) N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) (cid:0) R + γ Q ( S t +1 , A t +1 ) − Q ( S t , A t ) (cid:1) end for As we have established, the empirical Sarsa criteriondoes not necessarily have any ﬁxed point. However, ifthe ﬁxed point does exist then the algorithm converges.

Theorem 8.

If there exists a solution ( Q EPV (cid:102) M ,π , π ) toEquations (7) and (8) , Π t → Π ∗ , and all state-actionpairs are visited inﬁnitely often during learning, thenEmpirical Sarsa converges almost surely to Q EPV (cid:102) M ,π .Proof. Empirical Sarsa is described by Equation (9)with an action selection rule of Λ t ( s, Q ) = P A (Π t ( Q ) , s ),which converges to Λ( s, Q ) = P A (Π ∗ ( Q ) , s ). Thelearning rate function is α t ( S t , A t ) = N ( S t ,A t ) and α t ( s, a ) = 0 for ( s, a ) (cid:54) = ( S t , A t ). If all state-actionpairs are visited inﬁnitely often then α t satisﬁes condi-tion 2 of Theorem 5. If Q EPV (cid:102) M ,π exists satisfying equationEquations (7) and (8) then Q EPV (cid:102) M ,π also satisﬁes Equa-tion (10) so by Theorem 5, Empirical Sarsa convergesto Q EPV (cid:102) M ,π . We evaluate the various RL algorithms on a custom envi-ronment that illustrates their asymptotic diﬀerences, as Code is at https://github.com/edlanglois/mamdp

Algorithm

Choice Sim Real

Q-Learning

Real Exploit Exploit

Virtual Sarsa

Real Exploit Exploit

Empirical Sarsa

Sim Exploit Exploit

UH-CMA-ES

Real Complete Exploit

Table 1: Algorithms trained to convergence on theSimulation-Oversight environment (10 steps). The ac-tions shown are chosen by the resulting policy withprobability > Simulation-Oversight environment representsthe eﬀect of oversight based on imperfect simulationof an agent. The environment consists of three states:

Choice , Sim , and

Real . In the initial state,

Choice ,the agent is oﬀered the choice between performing a taskin simulation (

Sim ) or reality (

Real ). To guaranteeexploration, the opposite state is selected 10% of thetime. We assume that the reward function containsexploitable ﬂaws present in both reality and simulation.In

Sim , the agent may either

Complete the task for 0reward or

Exploit for 1, after which the episode ends.The actions for

Real are the same but yield rewards2 and 3, respectively. Finally, the agent may

Abort in any state to immediately end the episode with − Real state then a supervisorsimulates its policy on

Sim . If the simulated actionis

Exploit then the supervisor intervenes and setsthe agent’s real action to

Abort . This shows how theMAMDP action selection function may depend on theglobal structure of the policy, not just the behaviour atthe current state.Q-learning, Virtual Sarsa, Empirical Sarsa, andUncertainty-Handling Evolution Strategies [9, UH-CMA-ES] were implemented and trained to convergence onthis MAMDP. The empirical episode returns over thecourse of training are shown in Figure 5 and the result-ing policies are summarized in Table 1. Q-Learning andVirtual Sarsa choose

Real and attempt to

Exploit in both

Sim and

Real . These agents are consistently“caught” by the supervisor during training and receivelow reward. As predicted, this strategy demonstratesan ignorance of oversight in which the MDP-optimalstrategy is chosen despite being ineﬀective in practice.Empirical Sarsa similarly ignores oversight when per-forming the task but considers oversight when modelling This is part of the MDP dynamics, not a MAMDP actionmodiﬁcation. Training Step E p i s o d e R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES

Figure 5: The empirical per-episode reward (mean with95% conﬁdence interval) on the Simulation-Oversightenvironment across 100 independent training runs.the future. Predicting that it will be caught exploitingin the

Real task, Empirical Sarsa instead chooses toperform the

Sim task where there is no oversight. Whileearning more reward than Q-learning and Virtual Sarsa,this strategy performs worse than the “honest” strategyof choosing

Real and

Complete . Finally, evolutionstrategies ﬁnds the reward maximizing strategy for thisMAMDP: it chooses the

Real state and

Exploits but

Completes the

Sim task so that it appears co-operative when simulated. This strategy is a kind of“treacherous turn” [4] in which the agent intentionallyperforms sub-optimally in certain situations (from areward perspective) in order to fool the supervisor intoallowing the agent to proceed in a higher-stakes setting.Therefore, despite earning less reward, Q-learning andVirtual Sarsa are arguably the most preferable agentsas they make the reward misspeciﬁcation obvious to theoverseer.We also compare the agent behaviours on the

Oﬀ-Switch and

Whisky-Gold environments of Leike et al.[14]. Both tasks consist of navigating to a goal state on a2D grid. In Oﬀ-Switch, the agent may be interrupted bya supervisor and prevented from moving unless the agentﬁrst disables its oﬀ switch. In Whisky-Gold the agentmay “drink” a bottle of whisky on its path by movingover it. Doing so earns a small reward but all futureactions have a high chance of being selected randomly,making it harder to reach the goal. To make learningeasier, we modify the environment with a reduced gridsize and random drunk actions that are biased awayfrom the goal (since uniform random actions are lessof a detriment on a smaller grid). See Appendix C fordetails about the experiments. Policy behaviours overthe course of training are plotted in Figures 6 and 7and the resulting behaviours of the diﬀerent algorithms Algorithm DisableOﬀ Switch? DrinkWhisky?Q-Learning No YesVirtual Sarsa No YesEmpirical Sarsa Yes NoUH-CMA-ES Yes NoTable 2: Learned policy behaviours on the Oﬀ-Switchand Small Whisky-Gold environments after 10 trainingsteps. The table shows whether, at the ﬁrst decisionopportunity, the trained policy is most likely to movetowards the button / whisky, or away. Training Step P r o b a b ili t y Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES

Figure 6: The probability of drinking the whisky (meanwith 95% conﬁdence interval) in the Small Whisky-Goldenvironment across 15 independent training runs.are summarized in Table 2. Q-Learning and EmpiricalSarsa match the behaviours described by Leike et al.Note that algorithms that correctly avoid disablingthe oﬀ switch incorrectly drink the whisky and vice versabecause in either case the algorithm either accounts forinterventions or not. Is it possible to perform correctlyin both? Yes, but only if we are able to choose whichinterventions are treated as action modiﬁcation andwhich are treated as state transitions. Q-Learning andVirtual Sarsa optimize the MDP transition dynamics butnot the MAMDP action modiﬁcations. Thus, if we makethe Whisky-Gold intervention part of the transitiondynamics rather than P A (by observing the action pre-modiﬁcation), then Q-learning and Virtual Sarsa willlearn to account for it and avoid the whisky. This isapproach is only possible when P A is a function of π ( s );more general functions of s and π cannot be implementedas a state transition.8 Training Step P r o b a b ili t y Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES

Figure 7: The probability of disabling the oﬀ switch(mean with 95% conﬁdence interval) in the Oﬀ-Switchenvironment across 15 independent training runs. Theﬁnal rise in the Virtual Sarsa line is likely an artifact oftraining instability due to the ﬁxed learning rate; individ-ual runs were observed to ﬂuctuate between strategies.

We have introduced the MAMDP model that describesthe case where actions as executed in the environ-ment may diﬀer from the actions speciﬁed by a policy.MAMDPs model a variety of scenarios including con-straints, hardware limitations, supervised interventions,and self-damage. We showed that the determining fac-tor for agent behaviour is the speciﬁc objective thatan RL algorithm seeks to satisfy in the limit. Rewardmaximization leads to policies that account for actionmodiﬁcation, while the Bellman optimality and Vir-tual policy value criteria ignore the presence of actionmodiﬁcations. Using incentive analysis, we categorizeddiﬀerent ways in which learned policies may respondto action modiﬁcation and showed how the Empiricalpolicy value criterion only responds via controlling thestate, not directly via the policy structure or action.MDP algorithms may be straightforwardly adapted tothe MAMDP setting and tabular Q-learning and Sarsaconverge whenever a solution exists. Finally, we veriﬁedthe results experimentally.It is sometimes assumed that reinforcement learningwill always lead to reward maximizing behaviour [17].However, we show that many kinds of reinforcementlearning algorithms can systematically deviate from re-ward maximization when the executed action may diﬀerfrom the one speciﬁed by the policy. In general, eﬃcientlearning algorithms often make assumptions about thestructure of the world and the resulting policies will bechosen based on these assumptions, not the true worlddynamics. Agent designers can use these assumptions to intentionally blind agents from certain properties oftheir environment, and thereby make them easier tocontrol.

Acknowledgements

Thank you to Ryan Carey, Laurent Orseau, David Duve-naud, and Tim Genewein for their invaluable feedback.We acknowledge the support of the Natural Sciencesand Engineering Research Council of Canada (NSERC),[funding reference number CGSD3–534795–2019].Cette recherche a ´et´e ﬁnanc´ee par le Conseil derecherches en sciences naturelles et en g´enie du Canada(CRSNG), [num´ero de r´ef´erence CGSD3–534795–2019].

References [1] David Abel et al. “Agent-Agnostic Human-in-the-Loop Reinforcement Learning”. In:

CoRR abs/1701.04079 (2017).[2] Richard Bellman.

Dynamic Programming . 1st ed.Princeton, NJ, USA: Princeton University Press,1957.[3] Felix Berkenkamp et al. “Safe Model-based Re-inforcement Learning with Stability Guarantees”.In:

NIPS . 2017, pp. 908–918.[4] Nick Bostrom.

Superintelligence: Paths, Dangers,Strategies . Oxford University Press, 2014.[5] Gal Dalal et al. “Safe Exploration in ContinuousAction Spaces”. In:

CoRR abs/1801.08757 (2018).[6] Tom Everitt et al. “Agent Incentives: A CausalApproach”. In:

AAAI . AAAI Press, 2021.[7] Tom Everitt et al. “Understanding Agent Incen-tives using Causal Inﬂuence Diagrams. Part I: Sin-gle Action Settings”. In:

CoRR abs/1902.09980(2019).[8] Shixiang Gu et al. “Deep reinforcement learningfor robotic manipulation with asynchronous oﬀ-policy updates”. In:

ICRA . IEEE, 2017, pp. 3389–3396.[9] Verena Heidrich-Meisner and Christian Igel. “Un-certainty handling CMA-ES for reinforcementlearning”. In:

GECCO . ACM, 2009, pp. 1211–1218.[10] Ronald A Howard.

Dynamic programming andmarkov processes . John Wiley, 1960.[11] Ronald A Howard and James E Matheson. “In-ﬂuence diagrams”. In:

The Principles and Appli-cations of Decision Analysis

Neural Computation

CoRR abs/1910.01055 (2019).[14] Jan Leike et al. “AI Safety Gridworlds”. In:

CoRR abs/1711.09883 (2017).[15] Darryl Dexu Lin, Sachin S. Talathi, andV. Sreekanth Annapureddy. “Fixed Point Quan-tization of Deep Convolutional Networks”. In:

ICML . Vol. 48. JMLR Workshop and ConferenceProceedings. JMLR.org, 2016, pp. 2849–2858.[16] David E. Moriarty, Alan C. Schultz, and John J.Grefenstette. “Evolutionary Algorithms for Rein-forcement Learning”. In:

J. Artif. Intell. Res.

AGI . Vol. 171. Frontiers in Artiﬁcial Intelli-gence and Applications. IOS Press, 2008, pp. 483–492.[18] Laurent Orseau and Stuart Armstrong. “SafelyInterruptible Agents”. In:

UAI . AUAI Press, 2016.[19] Judea Pearl.

Causality: Models, Reasoning andInference . 2nd ed. Cambridge University Press,2009.[20] Martin L. Puterman and Moon Chirl Shin. “Mod-iﬁed Policy Iteration Algorithms for DiscountedMarkov Decision Problems”. In:

Management Sci-ence issn : 00251909. url : .[21] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems . Tech.rep. Cambridge University Engineering Depart-ment, 1994.[22] Tim Salimans et al. “Evolution Strategies as aScalable Alternative to Reinforcement Learning”.In:

CoRR abs/1703.03864 (2017).[23] William Saunders et al. “Trial without Error: To-wards Safe Reinforcement Learning via HumanIntervention”. In:

AAMAS . International Founda-tion for Autonomous Agents and Multiagent Sys-tems Richland, SC, USA / ACM, 2018, pp. 2067–2069.[24] Satinder P. Singh et al. “Convergence Resultsfor Single-Step On-Policy Reinforcement-LearningAlgorithms”. In:

Machine Learning

NIPS . MIT Press, 1995,pp. 1038–1044.[26] Richard S. Sutton and Andrew G. Barto.

Rein-forcement Learning: An Introduction . Second. TheMIT Press, 2018.[27] Christopher J. C. H. Watkins and Peter Dayan.“Technical Note Q-Learning”. In:

Mach. Learn.

Unsatisﬁable EPV Objective

We prove that the Empirical Policy Value (EPV) objec-tive, consisting of Equations (7) and (8) below, is notnecessarily satisﬁable. Q EPV (cid:102) M ,π ( s, a ) = R ( s, a ) + γ E s (cid:48) ∼P S ( s,a ) E a (cid:48) ∼P A ( s (cid:48) ,π ) Q EPV (cid:102) M ,π ( s (cid:48) , a (cid:48) ) (7) π EPV ( s ) = arg max a ∈A Q EPV (cid:102) M ,π EPV ( s, a ) (8) In particular, these EPV equations are not satisﬁablefor the MAMDP shown in Figure 8. The action is onlymodiﬁed in this environment if if the policy choosesaction 3 with probability 1. In this case, the action ischanged to action 2. S S term a = 1; r = 2 a = 2; r = − a = 3 r = 1 P A ( a | s, π ) =  π (3 | s ) = 1 and a = 20 if π (3 | s ) = 1 and a (cid:54) = 2 π ( a | s ) otherwise γ = 0 . Proof.

We will show that no policy π is greedy for Q EPV (cid:102) M ,π .To start, note that for any policy π , Q EPV π ( S ,

1) = 2 and Q EPV π ( S ,

2) = −

2. Therefore, if π satisﬁes Equation (8)then π (2 | S ) = 0 so any possible greedy policies musthave the form π (1 | S ) = 1 − α and π (3 | S ) = α forsome α ∈ [0 , α ∈ [0 ,

1) then Q EPV π ( S ,

3) = 1 + 0 . (cid:16) (1 − α ) · α · Q EPV π ( S , (cid:17) ⇒ Q EPV π ( S ,

3) = 11 − α + 1 . > a ∈{ , , } Q EPV π ( S , a ) = 3 but π (3 | S ) (cid:54) = 1so π does not satisfy Equation (8). If α = 1 then theaction modiﬁcation occurs and Q EPV π ( S ,

3) = 1 + 0 . · Q EPV π ( S , − . · − . . In this case, arg max a ∈{ , , } Q EPV π ( S , a ) = 2 but π (2 | S ) = 0 (cid:54) = 1, again contradicting Equation (8). B Convergence Theorem Proof

Here, we provide a proof of Theorem 5, based on similarproofs by Singh et al. [24] and Jaakkola, Jordan, andSingh [12]. We make use of the following lemma fromSingh et al.:

Lemma 9.

Consider a random iterative process ( α t , ∆ t , F t ) , where α t , ∆ t , F t : X → R satisfy the equa-tions ∆ t +1 ( x ) = (1 − α t ( x ))∆ t ( s ) + α t ( x ) F t ( x ) (11) for x ∈ X , t = 0 , , , . . . .Let P t be a sequence of increasing σ -ﬁelds encodingthe past of the process such that α t , ∆ t and F t − are P t -measurable. Let (cid:107)·(cid:107) W be some ﬁxed weighted maximumnorm. If the following hold:1. the set X is ﬁnite;2. ≤ α t ( x ) ≤ , (cid:80) t α t ( x ) = ∞ , (cid:80) t α t ( x ) < ∞ with probability 1, where the probability is over thelearning rates α t ;3. (cid:107) E [ F t ( · ) | P t ] (cid:107) W ≤ κ (cid:107) ∆ t (cid:107) W + c t , where κ ∈ [0 , and c t converges to zero with probability 1;4. Var[ F t ( x ) | P t ] ≤ K (1 + (cid:107) ∆ t (cid:107) W ) , where K issome constant,then ∆ t converges to zero with probability 1. Interpretation

In the context of stochastic Q valueestimation, ∆ t represents the error between our cur-rent value estimate and the target value estimate. Up-dates are performed according to a random trajectory { ( s , a ) , ( s , a ) , . . . } and this is represented in Equa-tion (11) by setting α t ( s, a ) = 0 if ( s, a ) (cid:54) = ( s t , a t ). F t represents a stochastic Bellman-style update appliedover the entire domain; α t gates this update to onlyapply to the speciﬁc state-action pair sampled in thetrajectory. Conditioning on P t conditions on the valuesof α , . . . , α t ∆ , . . . ∆ t , and F , . . . , F t − , i.e. on thehistory of past updates.Using Lemma 9, we can now prove Theorem 5, whichwe ﬁrst restate: Theorem 5.

Let (cid:102) M be a MAMDP with bounded, op-tionally stochastic rewards. Consider a stochastic, itera-tive algorithm that learns a tabular Q function with theupdate rule: Q t +1 ( s, a ) = (1 − α t ( s, a )) Q t ( s, a )+ α t ( s, a )( R + γQ t ( S (cid:48) , A (cid:48) )) (9) where S (cid:48) ∼ P S ( s, a ) is a random successor state sam-pled from the transition dynamics, A (cid:48) ∼ Λ t ( S (cid:48) , Q t ) is random successor action sampled from an arbitrarypolicy Λ t ( · , Q t ) , and R is a random reward value with E [ R ] = R ( s, a ) . If1. Λ t → Λ uniformly;2. the learning rates satisfy ≤ α t ( s, a ) ≤ , (cid:80) t α t ( s, a ) = ∞ , (cid:80) t α t ( s, a ) < ∞ almost surely;and3. there exists some Q Λ satisfying Q Λ ( s, a ) = R ( s, a ) + γ E S (cid:48) ∼P S ( s,a ) E A (cid:48) ∼ Λ( S (cid:48) ,Q Λ ) Q Λ ( S (cid:48) , A (cid:48) ) , (10) then Q Λ is unique and Q t a.s. → Q Λ .Proof. Subtract Q Λ ( s, a ) from both sides of Equation (9)to get∆ t +1 ( s, a ) = (1 − α t ( s, a ))∆ t ( s, a ) + α t ( s, a ) F t ( s, a )where we deﬁne∆ t ( s, a ) := Q t ( s, a ) − Q Λ ( s, a ) F t ( s, a ) := R + γQ ( S (cid:48) , A (cid:48) ) − Q Λ ( s, a )Condition 1 of Lemma 9 is satisﬁed since X = S × A is ﬁnite while Condition 2 also appears as Condition 2 ofTheorem 5. Establishing Condition 3 is more involved.Recall that for each ( s, a ) pair, S (cid:48) ∼ P S ( s, a ) and A (cid:48) ∼ Λ t ( S (cid:48) , Q t ). Deﬁne A (cid:48)(cid:48) ∼ Λ( S (cid:48) , Q t ). We proceed bydecomposing F t ( s, a ): F t ( s, a ) = R + γQ t ( S (cid:48) , A (cid:48) ) − Q Λ ( s, a )= R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a )+ γ ( Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ))= F Λ t ( s, a ) + C t ( s, a )where F Λ t ( s, a ) := R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a ) C t ( s, a ) := γ (cid:0) Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ) (cid:1) . Then, (cid:107) E [ F Λ t ( · ) | P t ] (cid:107) ∞ = max s ∈S max a ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R ( s, a ) + γ  E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) Q t ( S (cid:48) , A (cid:48)(cid:48) )  − Q Λ ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and by Equation (10):= max s ∈S max a ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γ E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) (cid:104) Q t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( S (cid:48) , A (cid:48)(cid:48) ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γ max s ∈S max a ∈A E S (cid:48) ∼P S ( s,a ) E A (cid:48)(cid:48) ∼ Λ t ( S (cid:48) ,Q t ) (cid:12)(cid:12)(cid:12) Q t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( S (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) ≤ γ max s (cid:48) ∈S max a (cid:48)(cid:48) ∈A (cid:12)(cid:12)(cid:12) Q t ( s (cid:48) , a (cid:48)(cid:48) ) − Q Λ ( s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) = γ (cid:107) ∆ t (cid:107) ∞ . Let c t := (cid:107) E [ C t ( · ) | P t ] (cid:107) ∞ = max s ∈S max a ∈A (cid:16) E S (cid:48) ∼P S ( s,a ) E A (cid:48) ∼ Λ t ( S (cid:48) ,Q t ) E A (cid:48)(cid:48) ∼ Λ( S (cid:48) ,Q t ) γ (cid:12)(cid:12) Q t ( S (cid:48) , A (cid:48) ) − Q t ( S (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:17) | c t | ≤ γ max s (cid:48) ∈S E A (cid:48) ∼ Λ t ( s (cid:48) ,Q t ) E A (cid:48)(cid:48) ∼ Λ( s (cid:48) ,Q t ) (cid:12)(cid:12) Q t ( s (cid:48) , A (cid:48) ) − Q t ( s (cid:48) , A (cid:48)(cid:48) ) (cid:12)(cid:12) . Then c t a.s. → t → Λ uniformly,b)

S × A is ﬁnite, andc) Q t is bounded (since R is bounded).Therefore, (cid:107) E [ F t ( · ) | P t ] (cid:107) ∞ ≤ (cid:107) E [ F Λ t ( · ) | P t ] (cid:107) ∞ + (cid:107) E [ C t ( · ) | P t ] (cid:107) ∞ ≤ γ (cid:107) ∆ t (cid:107) ∞ + c t with c t a.s. → X and Y and anybounded function f :Var[ X + Y ] ≤ Var[ X ] + Var[ Y ] + 2 (cid:112) Var[ X ] Var[ Y ]= ( (cid:112) Var[ X ] + (cid:112) Var[ Y ]) Var[ f ( X )] ≤

14 (max x f ( x ) − min x f ( x )) ≤

14 (2 (cid:107) f (cid:107) ∞ ) = (cid:107) f (cid:107) ∞ . Therefore,Var[ F Λ t | P t ] = Var R,S (cid:48) ,A (cid:48) (cid:2) R + γQ t ( S (cid:48) , A (cid:48)(cid:48) ) − Q Λ ( s, a ) (cid:3) ≤ (cid:16)(cid:112) Var[ R ] + γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] (cid:17) Let K R be an upper bound on | R | , then (cid:112) Var[ R ] ≤ K R .Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] = Var [∆ t ( S (cid:48) , A (cid:48)(cid:48) ) + Q Λ ( S (cid:48) , A (cid:48)(cid:48) )] ≤ (cid:0) (cid:107) ∆ t (cid:107) ∞ + (cid:107) Q Λ (cid:107) ∞ (cid:1) Var [ C t | P t ] ≤ (cid:32) γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48) )]+ γ (cid:112) Var [ Q t ( S (cid:48) , A (cid:48)(cid:48) )] (cid:33) ≤ (2 γ (cid:107) Q t (cid:107) ∞ ) Since α t ∈ [0 , (cid:107) Q t (cid:107) ∞ ≤ K R − γ .12utting this all together, we get thatVar[ F t | P t ] ≤ (cid:18)(cid:113) Var [ F Λ t | P t ] + (cid:112) Var [ C t | P t ] (cid:19) ≤ (cid:18) K R + (cid:107) ∆ t (cid:107) ∞ + (cid:107) Q Λ (cid:107) ∞ + 2 γK R − γ (cid:19) This satisﬁes Condition 4 since everything but ∆ t isconstant. Therefore, by Lemma 9, ∆ t a.s. → Q t converges to Q Λ with probability 1. Since Q t can onlyhave one limit, Q Λ must be the unique solution to Equa-tion (9). C Experiment Details

This section provides more details about the experimentsdescribed in the paper. The code is provided in a codeappendix.

C.1 Agents

The following agents were evaluated:

Q-Learning

An implementation of Algorithm 1. Thepolicy π t is ε -greedy with respect to Q t . Param-eters: exploration rate ( ε = 0 . Virtual Sarsa

An implementation of Algorithm 2.The policy π t is ε -greedy, given by Π t ( Q )( a | s ) = (cid:40) − ε + ε |A| if a = arg max a (cid:48) Q ( a (cid:48) | s ) ε |A| otherwise Technically, for Theorem 7 to apply, the explo-ration rate should be decayed to 0 over the courseof training so that lim t →∞ Π t is the pure greedypolicy selection Π ∗ . However, the presence ofnonzero exploration makes no diﬀerence in theresulting policies for the environments that weconsider, so we avoid this unnecessary complexityand keep the exploration rate constant. Parame-ters: exploration rate ( ε = 0 . Empirical Sarsa

An implementation of Algorithm 3.The policy selection is the same as for VirtualSarsa, including ﬁxing the exploration rate. Pa-rameters: exploration rate ( ε = 0 . UH-CMA-ES

The agent searches for a policy thatmaximizes the expected discounted episode re-turn using the

Uncertainty-Handling CovarianceMatrix Adaptation Evolution Strategy by Heidrich-Meisner and Igel [9]. UH-CMA-ES uses only a All greedy (argmax) action selections break ties by taking theﬁrst action according to a ﬁxed order. stochastic mapping from inputs (policies) to out-puts (returns) so the algorithm can be applieddirectly to the MAMDP problem without modiﬁ-cation. This emulates the black-box reward maxi-mization objective (1) although it is a local searchmethod and can become stuck in local optima.The policy is parameterized as a matrix of logits ν , one for each state-action pair: π ν ( a | s ) = e ν s,a (cid:80) a (cid:48) e ν s,a (cid:48) Since this is an overparameterization, we introducean extra penalty term in the optimization objectivethat encourages the maximum value of ν s, · toremain close to 0:penalty( ν ) = β |S| (cid:88) s max(0 , | max a ν s,a | − • initial step size ( σ (0) = 0 . • initial num sample evaluations ( n (0)eval = 100), • noise tolerance ( θ = 0 . • penalty scale ( β = 1), and • scale sample evaluations ( α = 1 . exploration rate , initial step size , and penalty scale were cho-sen as arbitrary reasonable values without any testingof alternatives. The values for noise tolerance and scale sample evaluations come recom-mended by Heidrich-Meisner and Igel. No alternatevalues for noise tolerance were tested. We tested scale sample evaluations = 1 (disabling dynamicuncertainty handling) and found no reason to prefer it.The parameter initial num sample evaluations was tested with both 1 and 100, and 100 was found tobe more stable.In the Simulation-Oversight and Small Whisky-Goldenvironments, the agent Q values are updated using theinverse visit count N ( s, a ) − for the learning rate, asdescribed by Algorithms 1 to 3. However, this approachis extremely slow at determining and propagating thenegative value of perpetual interruptions in the Oﬀ-Switch environment. Therefore, for that environmentwe instead use a ﬁxed learning rate of 0 .

1, chosen withoutfurther testing of alternatives.

C.2 Experiment Structure

For each agent-environment pair, multiple training runswere performed (100 for Simulation-Oversight and 1013or Oﬀ-Switch and Whisky-Gold). Each training runconsisted of 10,000,000 steps in the environment. Apseudorandom number generator was used for all ran-domness, seeded with the training run index 1 , , . . . .The ﬁnal policies (excluding any added noise used forexploration during training) were analyzed to produceTables 1 and 2 in Section 5. For Table 1, the policieswere categorized as either the corresponding determin-istic policy if the most likely action in every state hasprobability exceeding 0 .

99, or as an ambiguous pol-icy otherwise. All ﬁnal policies met the threshold andyielded the same deterministic policy across all 100 train-ing runs of the same agent. Table 2 measures whether,at the ﬁrst decision opportunity, the agent is more likelyto move towards the button / whisky or away. In bothenvironments there is only one path to this point so itrepresents the most likely trajectory preﬁx. For eachagent, the same behaviour was observed in the ﬁnalpolicies of all 10 training runs.We include plots in this appendix that illustrate thetraining progress of the various agents (Figures 5, 11aand 11b). Policies were saved throughout training andthen evaluated by measuring the mean episode rewardover 10,000 environment steps in a separate evaluationrun. All evaluation runs used 0 as the random seed.

C.3 Simulation-Oversight Environment

The

Simulation-Oversight

MAMDP (Section 5) has thefollowing structure.State and action spaces: S = { Choice , Real , Sim , Term }A = { Real/Complete , Sim/Exploit , Abort } The software implementation treats the terminal state

Term as part of the transitions dynamics instead ofbeing a distinct state.Transition dynamics P S ( s (cid:48) | s, a ): P S ( Real | S = Choice , A = Real ) = 0 . P S ( Sim | S = Choice , A = Real ) = 0 . P S ( Real | S = Choice , A = Sim ) = 0 . P S ( Sim | S = Choice , A = Sim ) = 0 . P S ( Term | S = Choice , A = Abort ) = 1 ∀ s ∈ { R , S , T } , a ∈ A : P S ( Term | S = s, A = a ) = 1 Action selection function P A ( a | π, s ): P A ( a | π, s ) = π ( a | s ) if s (cid:54) = Real P A ( a | π, Real ) = (1 − π ( E | S )) π ( a | R ) if a (cid:54) = Abort P A ( Abort | π, Real ) = (1 − π ( E | S )) π ( A | R ) + π ( E | S ) In the

Real state, the policy is simulated on

Sim andif

Exploit is chosen (with probability π ( Exploit | Sim )) then the action is forced to

Abort , otherwise thepolicy is allowed to select an action as usual.Reward function R ( s, a ): R ( Real , Complete ) = 2 R ( Real , Exploit ) = 3 R ( Sim , Complete ) = 0 R ( Sim , Exploit ) = 1 ∀ s ∈ S : R ( s, Abort ) = − R ( s, a ) = 0 Training was performed with a discount factor of 1,which technically violates the MAMDP assumption butsince all rewards are earned on the second step this isequivalent to using a lower discount factor and scalingthe rewards.The training progress is shown in Figure 5 of themain paper. As a result of the tiebreaking order, Q-learning, Virtual Sarsa, and Empirical Sarsa all initiallybehave according to the “intended” policy (from thesupervisor’s perspective) that chooses

Real and then

Complete , earning reward close to 2. All of theseagents learn to apply a diﬀerent policy, despite the newpolicy empirically performing worse. Q-Learning andVirtual Sarsa choose

Real and

Exploit , are forcedto

Abort , and obtain reward close to − Sim and

Exploit , earningreward close to 1. UH-CMA-ES requires orders of mag-nitude more training data but eventually discovers thereward-maximizing policy that chooses

Real , Exploit while in

Real and

Complete while in

Sim . C.4 Oﬀ-Switch and Whisky-Gold

The Oﬀ-Switch and Whisky-Gold environments wereimplemented according to the descriptions given by Leikeet al. [14]. Like Leike et al., we use a discount factorof 0 .

99 and end each episode after at most 100 steps.Although this is not discussed in their paper, the originalimplementation presents the gridworld state to agentsas a two dimensional grid of cell contents. We insteadrepresent the state categorically as a single integer thatindexes both the agent’s position and the state of thebutton or whisky.We make one structural change to the Oﬀ-Switchenvironment (shown in Figure 9) to ensure that inter-ruptions do not depend on history. In the originalspeciﬁcation, there is a 50% chance that the agent isinterrupted when passing through a particular cell. Ifit is interrupted the ﬁrst time then the interruption isrepeated on every subsequent step until the end of theepisode. We instead add an open cell immediately belowthe interruption cell. The initial interruption moves theagent into this new cell and the agent is always inter-rupted in the new cell. This change is not anticipatedto alter the resulting agent behaviours. The trainingprogress is shown in Figure 11b.14igure 9: Oﬀ-Switch environment. The agent (A) mustnavigate to the goal (G) but may be permanently inter-rupted (I, I’) unless it ﬁrst hits the button (B).Figure 10: Small Whisky-Gold Environment. The agent(A) must navigate to the goal (G) but if it passes throughthe whisky (W) then the policy is permanently modiﬁedto choose biased random actions with high probability.To make training easier, we modify the Whisky-Goldenvironment to have width 4 and height 2 (as shownin Figure 10), compared to the original environmentwith width 6 and height 3. Random exploration is lesscostly in a smaller grid so we bias the random actionsthat occur after drinking the whisky away from thegoal. Speciﬁcally, if the whisky has been drunk thenwith 90% probability (unaltered) the policy action isreplaced with a random action. The random action is down or left with 40% probability each, and up or right with 10% probability each. The second diﬀerencecompared to the original Whisky-Gold environment isthat instead of whisky directly modifying the agent’sexploration rate parameter, drinking the whisky mixesrandom exploration into the policy provided by theagent, which may or may not already include exploration.The training progress is shown in Figure 11a. C.5 Inverting Bandit Environment

We include results from an additional experiment demon-strating how agents respond to policy modiﬁcations thatdepend non-linearly on the virtual policy action prob-abilities. The

Exp Inverting Bandit environment is a3-action multi-armed bandit environment with ﬁxed re-wards of 1, 0, and -1 for the ﬁrst, second, and third arm,respectively. It has one state and each episode endsafter one step. The action selection function inverts the Training Step E p i s ode R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES (a) Small Whisky-Gold environment; 15 runs per agent. Training Step E p i s ode R e t u r n Q-LearningVirtual Sarsa Empirical SarsaUH-CMA-ES (b) Oﬀ-Switch environment; 15 runs per agent. Training Step E p i s ode R e t u r n Q-LearningVirtual SarsaEmpirical Sarsa UH-CMA-ESPolicy Gradient (c) Exp Inverting Bandit environment; 100 runs per agent.