[PDF] Avoiding Side Effects in Complex Environments

Abstract

Reward function specification can be difficult. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway's Game of Life. By preserving optimal value for a single randomly generated reward function, AUP incurs modest overhead while leading the agent to complete the specified task and avoid many side effects. Videos and code are available at this https URL.

Full PDF

AAvoiding Side Effects in Complex Environments

Alexander Matt Turner ∗ Neale Ratzlaff ∗ Prasad Tadepalli

Oregon State University {turneale@, ratzlafn@, tadepall@eecs.}oregonstate.edu

Abstract

Reward function speciﬁcation can be difﬁcult, even in simple environments. Real-istic environments contain millions of states. Rewarding the agent for making awidget may be easy, but penalizing the multitude of possible negative side effectsis hard. In toy environments, Attainable Utility Preservation (AUP) avoids sideeffects by penalizing shifts in the ability to achieve randomly generated goals. Wescale this approach to large, randomly generated environments based on Conway’sGame of Life. By preserving optimal value for a single randomly generated rewardfunction, AUP incurs modest overhead, completes the speciﬁed task, and avoidsside effects.

Reward function speciﬁcation can be difﬁcult, even when the desired behavior seems clear-cut.Rewarding progress in a race leads an agent to collect checkpoint reward, instead of completing therace [15]. We want to minimize the negative side effects of misspeciﬁcation: from a manufacturingrobot which breaks expensive equipment, to content recommendation systems which radicalize theirusers, to potential future AI systems which negatively transform the world [4, 21].Side effect avoidance poses a version of the “frame problem”: each action can have many effects,and it is impractical to explicitly penalize all of the bad ones [5]. For example, a housekeepingagent should clean a dining room without radically rearranging furniture, and a manufacturing agentshould assemble widgets without breaking equipment. A general, transferable solution to side effectavoidance would ease reward speciﬁcation: the agent’s designers could just positively specify whatshould be done, as opposed to negatively specifying what should not be done.Breaking equipment is bad because it hampers future optimization of the true objective (whichincludes our preferences about the factory). That is, there often exists a reward function R true whichfully speciﬁes the agent’s task within its deployment context. In the factory setting, R true mightencode “assemble widgets, but don’t spill the paint, break the conveyor belt, injure workers, etc”.We want the agent to preserve optimal value for this true reward function. While we can acceptsuboptimal actions (e.g. pacing the factory ﬂoor), we cannot accept the destruction of value for thetrue task. By avoiding negative side effects which decrease value for the true task, the designers cancorrect any misspeciﬁcation and eventually achieve low regret for R true .Despite being unable to directly specify R true , we demonstrate a method for preserving its optimalvalue anyways. In Turner et al. [25]’s toy environments, preserving optimal value for many randomlygenerated reward functions often preserves the optimal value for R true . In this paper, we generalizethis approach to combinatorially complex environments and evaluate it in the chaotic and challengingSafeLife test suite [27]. We show the rather surprising result that by preserving optimal value for a single randomly generated reward function, AUP preserves optimal value for R true and thereby avoidsnegative side effects. ∗ Equal contribution.Preprint. Under review. a r X i v : . [ c s . A I] J un Prior work

AUP avoids negative side effects in small gridworld environments while preserving optimal valuefor randomly generated reward functions [25]. Penalizing decrease in (discounted) state reachabilityachieves similar results [14]. However, this approach has difﬁculty scaling: naively estimating allreachability functions is a task quadratic in the size of the state space.In the supplementary material, proposition 4 shows that preserving initial state reachability [10]bounds the maximum decrease in optimal value for R true . Unfortunately, due to irreversible dynamics,initial state reachability often cannot be preserved.Everitt et al. [9] frame reward misspeciﬁcation as the composition of a corruption function with R true .Shah et al. [24] exploit information contained in the initial state of the environment to infer whichside effects are negative; for example, if vases are present, humans must have gone out of their way toavoid them, so the agent should as well. Christiano et al. [8] infer human preference information fromsolicited trajectory comparison. Hadﬁeld-Menell et al. [13] consider the provided reward function toonly suggest the designer’s true preferences on the training distribution.Robust optimization selects a trajectory which maximizes the minimum return achieved under afeasible set of reward functions [19]. However, we do not assume we can specify the feasible set.In constrained MDPs, the agent obeys constraints while maximizing the observed reward function[2, 1, 28]. Exhaustively specifying constraints is difﬁcult.In the multi-agent setting, empathic deep Q-learning preserves optimal value for another agent in theenvironment [6]. Schaul et al. [22] demonstrate a value function predictor which generalizes acrossboth states and goals.Safe reinforcement learning focuses on avoiding catastrophic mistakes during training [18, 11, 3, 7],while this work only considers the consequences of the learned policy. Consider a Markov decision process (MDP) (cid:104)S , A , T, R, γ (cid:105) with ﬁnite state space S , ﬁnite actionspace A , transition function T : S × A → ∆( S ) , reward function R : S × A → R , and discountfactor γ . We assume the agent may take a no-op action ∅ ∈ A . We refer to V ∗ R ( s ) as the optimalvalue or attainable utility of reward function R at state s .To deﬁne AUP’s pseudo-reward function, the designer provides a ﬁnite reward function set R ⊂ R S ,hereafter referred to as the auxiliary set . This set does not necessarily contain R true . Each auxiliaryreward function R i ∈ R has a learned Q-function Q i .AUP penalizes average change in ability to optimize the auxiliary reward functions. The motivationis that by not changing optimal value for a wide range of auxiliary reward functions, the agent alsodoes not decrease optimal value for R true . Deﬁnition (AUP reward function [25]) . Let λ ≥ . Then R AUP ( s, a ) := R ( s, a ) − λ |R| (cid:88) R i ∈R (cid:12)(cid:12) Q ∗ i ( s, a ) − Q ∗ i ( s, ∅ ) (cid:12)(cid:12) . (1)The regularization parameter λ controls penalty severity. In practice, the learned auxiliary Q i is astand-in for the optimal Q-function Q ∗ i . To an approximation, Wainwright and Eckersley [27]’s SafeLife evolves according to the transitionrules of Conway’s Game of Life [12]. Cells endure, spawn, or die depending on how many livingneighbors they have. In the eight cells surrounding the agent, no cells spawn or die – the agent candisturb dynamic patterns by merely approaching them.Figure 1 compares AUP with Schulman et al. [23]’s Proximal Policy Optimization (PPO) in a simplescenario. While PPO optimizes the primary reward R , AUP also preserves the optimal value for asingle auxiliary reward function ( |R| = 1 ). 2 a) PPO trajectory (b) AUP trajectory Figure 1: The agent ( ) receives 1 primary reward for entering the goal ( ). The agent can movein the cardinal directions, destroy cells in the cardinal directions, or do nothing. Walls ( ) are notmovable. The right end of the screen wraps around to the left. (a): The learned trajectory for themisspeciﬁed primary reward function R destroys fragile green cells ( ). (b): Starting from the samestate, AUP’s trajectory preserves the green cells.It is important to note that we did not hand-select an informative auxiliary reward function to inducethe trajectory of ﬁg. 1b. Instead, the auxiliary reward was the output of a one-dimensional observationencoder, corresponding to a continuous Bernoulli variational autoencoder [17] trained through randomexploration (see section 5). Our theorems provide intuition about how the AUP penalty term works. Proofs and additional resultsare in appendix B.

Deﬁnition.

Figure 2: Trees ( ) are permanent living cells. The agent ( ) can move crates ( ) but not walls ( ).The screen wraps vertically and horizontally. (a): The agent receives reward for creating graycells ( ) in the blue areas. The goal ( ) can be entered when some number of gray cells are present.Spawners ( ) stochastically create yellow living cells. (b): After the agent removes some number ofred cells, the goal turns red ( ) and can be entered.Wainwright and Eckersley [27] score side effects as the degree to which the agent perturbs green cellpatterns. Over an episode of T time steps, side effects are quantiﬁed as the Wasserstein 1-distancebetween the conﬁguration of green cells had the state evolved naturally for T time steps, and theactual ﬁnal conﬁguration. As the primary reward function R is indifferent to green cells, this proxymeasures the safety performance of learned policies. If the agent never disturbs green cells, it achievesa perfect score of zero. By construction, minimizing side effect score preserves R true ’s optimal value,since R true encodes our preferences about the existing green patterns.As shown in table 1, Turner et al. [25] evaluated AUP on toy environments. In contrast, SafeLifevigorously challenges modern reinforcement learning algorithms.AI safety gridworlds [16] SafeLife [27]Dozens of states Millions of statesDeterministic dynamics Stochastic dynamicsHandful of preset environments Randomly generated environmentsOne side effect per level Many side effect opportunitiesImmediate side effects Chaos unfolds over timeTable 1: SafeLife is ideal for testing side effect avoidance. The agent can move in the cardinal directions, spawn/destroy a living cell in the cardinaldirections, or do nothing. We have four conditions:

PPO , AUP , AUP proj , and

Naive . Each condition isPPO trained on a different reward signal for ﬁve million (5M) time steps. See supplemental materialfor architectural and training details.

PPO

Trained on the primary SafeLife reward function R , without a side effect penalty.4 UP For the ﬁrst 100K time steps, the agent randomly explores to collect observation frames.These frames are used to train a continuous Bernoulli variational autoencoder with a Z -dimensional latent space and encoder network E .If Z = 1 , the auxiliary reward is the output of the encoder E . Otherwise, we draw linearfunctionals φ i uniformly randomly from (0 , Z . The auxiliary reward function R i is deﬁnedas φ i ◦ E : S → R . For each of the |R| auxiliary reward functions, we learn a Q-valuenetwork for 1M time steps.The learned Q R i deﬁne the penalty term of eq. (1). The agent learns R AUP for 3.9M steps,during which time λ is linearly increased from .001 to λ ∗ . AUP proj

AUP , but the auxiliary reward function is a random projection from the downsampledobservation space to R , without using a variational autoencoder. Naive

Trained on the primary reward function R minus (roughly) the L distance between thecurrent state and the initial state. The agent is penalized when cells differ from their initialvalues. Wainwright and Eckersley [27] found that an unscaled L penalty produced the bestresults.While a good benchmark for ideal behavior in certain static tasks, penalizing state changeoften fails to avoid crucial side effects. State change penalties do not differentiate betweenmoving a box and irreversibly wedging a box in a corner [14].The default settings are: N env = 8 randomly generated environments in the curriculum, Z = 1 latentspace dimension, |R| = 1 auxiliary reward function, and λ ∗ = . ﬁnal R AUP penalty severity. Thediscount rate γ = . .We evaluate the conditions in the append-spawn (ﬁg. 2a) and prune-still-easy (ﬁg. 2b) tasks.Furthermore, we include two easier variants of append-spawn : append-still (no stochasticspawners) and append-still-easy (no stochastic spawners, fewer green cells).We conduct three trials. At the beginning of each trial, SafeLife randomly generates N env environ-ments for the given task. The conditions are evaluated on the same set of random environments. Thecurriculum is a random sequence of these environments.For append-still , we allotted an extra 1M steps to achieve convergence for all agents. For append-spawn , agents pretrain on append-still-easy environments for the ﬁrst 2M steps andtrain on append-spawn for 3M steps. For AUP in append-spawn , the autoencoder and auxiliarynetwork are trained on both tasks. R AUP is then pretrained for 2M steps and trained for 1.9M steps.

Results. In append-still-easy , even though AUP waits 1.1M steps to start training on R AUP , AUP is competitive with

PPO by step 1.75M (see ﬁg. 3). By step 2.75M,

AUP consistently outperforms

PPO while incurring less than a twentieth of the side effects.

Naive also does very well and alsolearns more quickly than

PPO . AUP proj does better than

PPO but worse than

AUP , perhaps implying thatthe one-dimensional encoder provides more structure than a random projection.In prune-still-easy , all four conditions competitively accrue reward.

PPO and

AUP proj frequentlyhave side effects.

AUP avoids side effects, but not as well as in append-still-easy . The

Naive benchmark has fewer side effects than

AUP . This makes sense: since these environments are static,

Naive almost directly penalizes our unobserved side effect metric (change to the green cells). append-still environments contain more green cells. Once again,

AUP has far fewer side effectsthan

PPO . AUP proj and

Naive both ﬂounder, earning signiﬁcantly lower return than

PPO . While

Naive appears to have fewer side effects than

AUP , this is only because

Naive usually does nothing. In thesupplemental material, we display episode lengths over the course of training –

Naive converges toan average episode length of 843 (the maximum is 1,001). Even

PPO has an average episode lengthof 548. In stark contrast,

AUP learns effective and decisive policies with half the average length of

PPO . AUP frequently attains a length of only 43 (near optimal).

AUP signiﬁcantly decreases episodelength for most tasks – perhaps because AUP applies small movement penalties (theorem 2). append-spawn environments contain both stochastic yellow cell spawners and more green cells.These environments challenge

PPO , which has less reward and yet more side effects.

AUP proj com-pletely fails to learn.

Naive usually fails to get any reward, as its policy erratically wanders theenvironment.

AUP is once again superior to

PPO : 131% of the reward, 46% of the side effects.5igure 3: Learning curves with shaded regions representing ± standard deviation. AUP beginstraining on R AUP at step 1.1M.

AUP has far fewer side effects than

PPO . 6 L G H H II HF W VF R U H append still easy PPO AUP AUP proj

Naive 6 L G H H II HF W VF R U H prune still easy PPO AUP AUP proj

Naive 6WHSVPLOOLRQV 5 H Z D U G 6WHSVPLOOLRQV 5 H Z D U G 6 L G H H II HF W VF R U H append still PPO AUP AUP proj

Naive 6 L G H H II HF W VF R U H append spawn PPO AUP AUP proj

Naive 6WHSVPLOOLRQV 5 H Z D U G 6WHSVPLOOLRQV 5 H Z D U G .2 Hyperparameter sweepMethod. In append-still-easy , we evaluate AUP on the following settings: λ ∗ ∈ { . , . , , } and ( N env , Z ) ∈ { , , , ∞} × { , , , } ( N env = ∞ means that each episode takes place ina new environment). We also evaluate PPO on each N env setting. For each setting, we record boththe side-effect score and the return of the learned policy, averaged over three trials. We use defaultsettings for all unmodiﬁed parameters. PPO Z N env * Figure 4: Side effect score, averaged over three learned policies for append-still-easy . Lowerscore is better. Default

AUP setting outlined in black.

PPO Z N env * Figure 5: Episodic reward, averaged over three learned policies for append-still-easy . Higherreward is better, although

AUP only aims to match

PPO . Default

AUP setting outlined in black.

Results. As N env increases, reward tends to decrease and side effect score tends to increase. Thisperformance degradation does not seem to be due to Attainable Utility Preservation, but ratherbecause Proximal Policy Optimization has challenges generalizing (see ﬁg. 4 and ﬁg. 5). In particular, PPO accrues less reward and induces more side effects as N env increases. However, even when N env = ∞ , AUP ( Z = 16 ) shows the potential to signiﬁcantly reduce side effects without reducingepisodic return. AUP ’s default conﬁguration achieves 117% of

PPO ’s episodic return, without any of the side effects.The AUP penalty term might be acting as a shaping reward. This is intriguing – shaping usuallyrequires knowledge of the desired task, whereas the auxiliary reward function is randomly generated.Additionally, once

AUP begins learning R AUP on step 1.1M,

AUP learns much more quickly than

PPO did (ﬁg. 3); this supports the shaping hypothesis.

AUP imposed minimal overhead: due to its increasedsample efﬁciency,

AUP reaches

PPO ’s asymptotic episodic return at the same time as

PPO .Surprisingly,

AUP does well with a single latent space dimension ( Z = 1 ). As Z increases, sodoes AUP ’s side effect score. In the supplementary material, our data show that higher-dimensionalauxiliary reward functions are harder to learn, resulting in a poorly learned auxiliary Q-function.Nonetheless, for each N env setting, AUP ’s worst conﬁguration has signiﬁcantly fewer side effects than

PPO . 7urprisingly,

AUP does well with a single auxiliary reward function ( |R| = 1 ). We hypothesizethat destroying patterns decreases power; by theorem 3, this is penalized in the limit of |R| → ∞ .Furthermore, we believe that decreasing power usually decreases optimal value for any given singleauxiliary reward function. Since R AUP penalizes optimal value decrease, this might explain why

AUP does well with one auxiliary reward function.When λ ∗ = . , AUP becomes more conservative. As λ ∗ increases further, AUP stops moving entirely.AUP only regularizes learned policies, so

AUP can still make expensive mistakes during training.

We successfully scaled AUP to complex environments without providing task-speciﬁc knowledge –the auxiliary reward function was a one-dimensional variational autoencoder trained through randomexploration. To the best of our knowledge, AUP is the ﬁrst task-agnostic approach which avoids sideeffects and competitively achieves reward in complex environments.Wainwright and Eckersley [27] speculated that avoiding side effects must necessarily decreaseperformance on the primary task. This may be true for optimal policies, but not necessarily for learnedpolicies.

AUP signiﬁcantly improved performance on append-still-easy and append-spawn ,while matching performance on prune-still-easy and append-still . AUP proj enjoyed moderate success on the easier tasks. This suggests that AUP works (to varyingextents) for a wide range of uninformative reward functions.While

Naive penalizes every state perturbation equally, AUP applies penalty in proportion toirreversibility. For example, the agent could move crates around (and then put them back later).

AUP incurred little penalty for doing so, while

Naive was more constrained. We believe that AUP willcontinue to scale to useful applications, in part because it naturally accounts for irreversibility.

Future work.

Off-policy learning could allow simultaneous training of the auxiliary R i and of R AUP . Instead of learning an auxiliary Q-function, the agent could just learn the auxiliary advantagefunction with respect to inaction.The SafeLife suite includes more challenging variants of prune-still-easy . SafeLife also includesdifﬁcult navigation tasks, in which the agent must reach the goal by wading either through fragilegreen patterns or through robust yellow patterns.

AUP ’s excellent performance when |R| = Z = 1 raises interesting questions. Turner et al. [25]’ssmall “ Options ” environment required |R| = 25 for good performance. SafeLife environments aremuch larger than

Options (table 1), so why does |R| = 1 perform so well? To what extent does theAUP penalty term provide reward shaping? Why do one-dimensional encodings provide a learnablereward signal?

Conclusion.

To realize the full potential of reinforcement learning, we need more than algorithmswhich train policies that optimize a speciﬁed reward function. We also need to be able to specify the right reward function. Fundamentally, we face a frame problem: we often know what we want theagent to do, but we cannot list everything we want the agent not to do. AUP scales to challengingdomains, incurs modest overhead, performs competitively on the original task, and avoids side effects– without explicit information as to what constitutes a “side effect”.

Broader Impact

A scalable side effect avoidance method would ease the challenge of reward speciﬁcation andaid deployment of reinforcement learning in situations where mistakes are costly. Conversely,developers should carefully consider how reinforcement learning algorithms might produce policieswith catastrophic impact. Developers should not blindly rely on even a well-tested side effect penalty.8 cknowledgments

This work was made possible by the Center for Effective Altruism and the Long-term Future Fund.We thank Joshua Turner for help compiling ﬁg. 4 and ﬁg. 5. Scott Emmons inspired theorem 9. Wethank Stuart Armstrong, Andrew Critch, Evan Hubinger, Dylan Hadﬁeld-Menell, Matthew Olson,Rohin Shah, Logan Smith, and Carroll Wainwright for their ideas and feedback.

References [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages22–31, 2017.[2] Eitan Altman.

Constrained Markov decision processes , volume 7. CRC Press, 1999.[3] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-basedreinforcement learning with stability guarantees. In

Advances in Neural Information ProcessingSystems , pages 908–918, 2017.[4] Nick Bostrom.

Superintelligence . Oxford University Press, 2014.[5] Frank M Brown.

The Frame Problem in Artiﬁcial Intelligence: Proceedings of the 1987Workshop . Morgan Kaufmann, 2014.[6] Bart Bussmann, Jacqueline Heinerman, and Joel Lehman. Towards empathic deep Q-learning. arXiv:1906.10918 , 2019.[7] Yinlam Chow, Oﬁr Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. ALyapunov-based approach to safe reinforcement learning. In

Advances in Neural InformationProcessing Systems , pages 8092–8101, 2018.[8] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deepreinforcement learning from human preferences. In

Advances in Neural Information ProcessingSystems , pages 4299–4307, 2017.[9] Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning witha corrupted reward channel. In

Proceedings of the Twenty-Sixth International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-17 , pages 4705–4713, 2017.[10] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning toreset for safe and autonomous reinforcement learning. In

International Conference on LearningRepresentations , 2018.[11] Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning.

Journal of Machine Learning Research , 16(1):1437–1480, 2015.[12] Martin Gardner. The fantastic combinations of John Conway’s new solitaire game ‘life’.

Scientiﬁc American , 223(4):120–123, 1970.[13] Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inversereward design. In

Advances in Neural Information Processing Systems , pages 6765–6774, 2017.[14] Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoidingside effects using relative reachability.

CoRR , abs/1806.01186, 2018.[15] Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, RamanaKumar, Zac Kenton, Jan Leike, and Shane Legg. Speciﬁcation gaming: the ﬂip side of AIingenuity, 2020. URL https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity .[16] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq,Laurent Orseau, and Shane Legg. AI safety gridworlds. arXiv:1711.09883 , November 2017.917] Gabriel Loaiza-Ganem and John P Cunningham. The continuous Bernoulli: ﬁxing a pervasiveerror in variational autoencoders. In

Advances in Neural Information Processing Systems , pages13266–13276, 2019.[18] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning–anoverview. In

International Workshop on Modelling and Simulation for Autonomous Systems ,pages 357–375. Springer, 2014.[19] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs usingnondominated policies. In

AAAI , 2010.[20] Paul Rendell. Turing universality of the game of life. In

Collision-based computing , pages513–539. Springer, 2002.[21] Stuart Russell.

Human compatible: Artiﬁcial intelligence and the problem of control . Viking,2019.[22] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-mators. In

International Conference on Machine Learning , pages 1312–1320, 2015.[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv:1707.06347 , 2017.[24] Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. Theimplicit preference information in an initial state. In

International Conference on LearningRepresentations , 2019.[25] Alexander Matt Turner, Dylan Hadﬁeld-Menell, and Prasad Tadepalli. Conservative agency viaattainable utility preservation. In

Proceedings of the AAAI/ACM Conference on AI, Ethics, andSociety , pages 385–391, 2020.[26] Alexander Matt Turner, Logan Smith, Rohin Shah, and Prasad Tadepalli. Optimal FarsightedAgents Tend to Seek Power. arXiv:1912.01683 , April 2020.[27] Carroll L Wainwright and Peter Eckersley. Safelife 1.0: Exploring side effects in complexenvironments. arXiv:1912.01217 , 2019.[28] Shun Zhang, Edmund H Durfee, and Satinder P Singh. Minimax-regret querying on side effectsfor safe optimality in factored Markov decision processes. In

Proceedings of the Twenty-SeventhInternational Joint Conference on Artiﬁcial Intelligence, IJCAI-18 , pages 4867–4873, 2018.10

Additional data

Figure 6: Episode length curves with shaded regions representing ± standard deviation. AUP beginstraining on R AUP at step 1.1M.

AUP signiﬁcantly decreases episode length in the append tasks. 6WHSVPLOOLRQV ( S L V RG H O H QJ W K append still easy PPO AUP AUP proj

Naive 6WHSVPLOOLRQV ( S L V RG H O H QJ W K prune still easy PPO AUP AUP proj

Naive 6WHSVPLOOLRQV ( S L V RG H O H QJ W K append still PPO AUP AUP proj

Naive 6WHSVPLOOLRQV ( S L V RG H O H QJ W K append spawn PPO AUP AUP proj

Naive

Figure 7: Auxiliary reward curves for

AUP (with a Z -dimensional latent space), with shaded regionsrepresenting ± standard deviation. Auxiliary reward is not comparable across trials, so learning isexpressed by the slope of the curves. 6WHSVWKRXVDQGV $ X [ LOL D U \ U H Z D U G append still easy Z = 1 Z = 4 Z = 16 Z = 64 Theoretical results

Consider a rewardless MDP (cid:104)S , A , T, γ (cid:105) whose state space S and action space A are both ﬁnite, and γ ∈ [0 , . Reward functions R ∈ R S have corresponding optimal value functions V ∗ R ( s ) . Deﬁnition.

Let D be a continuous distribution over reward functions bounded [0 , , with probabilitymeasure F . The attainable utility distance between state distributions ∆ , ∆ (cid:48) ∈ ∆( S ) isd AU (cid:0) ∆ , ∆ (cid:48) (cid:1) := (cid:90) D (cid:12)(cid:12)(cid:12) E ∆ (cid:2) V ∗ R ( s ) (cid:3) − E ∆ (cid:48) (cid:2) V ∗ R ( s (cid:48) ) (cid:3)(cid:12)(cid:12)(cid:12) d F ( R ) . (2)Restriction to degenerate distributions yields a distance metric over the state space. B.1 Main resultsTheorem 1. d AU is a distance metric on ∆( S ) .Proof. For ∆ , ∆ (cid:48) , ∆ (cid:48)(cid:48) ∈ ∆( S ) :1. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≥ .2. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) = 0 iff ∆ = ∆ (cid:48) .3. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) = d AU (cid:0) ∆ (cid:48) , ∆ (cid:1) .4. d AU (cid:0) ∆ , ∆ (cid:48)(cid:48) (cid:1) ≤ d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) + d AU (cid:0) ∆ (cid:48) , ∆ (cid:48)(cid:48) (cid:1) .Properties 1 and 3 are trivially true. Property 2 follows from lemma 11. Property 4 follows fromapplying the triangle inequality for real numbers to the integrand. Theorem 2 (Movement penalties are small) . Let ∆ (cid:54) = ∆ (cid:48) . Suppose that all states in the supportof ∆ can deterministically reach in one step all states in the support of ∆ (cid:48) , and vice versa. Then < d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < .Proof. < d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) by theorem 1. Let R ∈ D . By proposition 4, (cid:12)(cid:12)(cid:12) V ∗ R ( s ) − V ∗ R (cid:0) s (cid:48) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (1 − γ ) max (cid:16) V ∗ R ( s ) , V ∗ R (cid:0) s (cid:48) (cid:1)(cid:17) ≤ . Thend AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≤ (cid:90) D E ∆ , ∆ (cid:48) (cid:20)(cid:12)(cid:12)(cid:12) V ∗ R ( s ) − V ∗ R (cid:0) s (cid:48) (cid:1)(cid:12)(cid:12)(cid:12)(cid:21) d F ( R ) (4) < (cid:90) D E ∆ , ∆ (cid:48) (cid:20) (1 − γ ) max (cid:16) V ∗ R ( s ) , V ∗ R (cid:0) s (cid:48) (cid:1)(cid:17)(cid:21) d F ( R ) (5) ≤ . (6)Equation (5) follows because D is continuous, so it cannot be the case that almost all reward functionsassign 0 reward to either s or s (cid:48) . Deﬁnition (Average optimal value [26]) . V ∗ avg ( s ) := E D (cid:2) V ∗ R ( s ) (cid:3) . Theorem 3 (Power-shift penalties are large) . d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≥ (cid:12)(cid:12)(cid:12)(cid:12) E ∆ (cid:104) V ∗ avg ( s ) (cid:105) − E ∆ (cid:48) (cid:104) V ∗ avg (cid:0) s (cid:48) (cid:1)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12) .Proof. Apply the reverse triangle inequality to the integrand of d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) and use the linearity ofexpectation.We derive a principled motivation for preserving the reachability of the initial state, bounding decreasein V ∗ R true by how many steps it takes to return to the initial state s .12 roposition 4 (Communicability bounds maximum change in attainable utility) . If s canreach s (cid:48) in k steps and s (cid:48) can reach s in k steps, then max R ∈ [ b,c ] S (cid:12)(cid:12) V ∗ R ( s ) − V ∗ R ( s (cid:48) ) (cid:12)(cid:12) ≤ (cid:16) − γ max( k ,k ) (cid:17) c − b − γ < c − b − γ . In particular, max R ∈ [ b,c ] S V ∗ R ( s ) − V ∗ R ( s (cid:48) ) ≤ (cid:0) − γ k (cid:1) c − b − γ .Proof. We ﬁrst bound the maximum increase. max R ∈ [ b,c ] S V ∗ R ( s (cid:48) ) − V ∗ R ( s ) ≤ max R ∈ [ b,c ] S V ∗ R ( s (cid:48) ) − (cid:32) b − γ k − γ + γ k V ∗ R ( s (cid:48) ) (cid:33) (7) ≤ c − γ − (cid:32) b − γ k − γ + γ k c − γ (cid:33) (8) = (cid:16) − γ k (cid:17) c − b − γ . (9)Equation (7) holds because even if we make R equal b for as many states as possible, s (cid:48) is stillreachable from s . The case for maximum decrease is similar.Positive afﬁne transformation of D allows generalization of our results to other bounds, as optimalpolicy is invariant to positive afﬁne transformation of the reward function. Proposition 5.

Let D (cid:48) be any positive afﬁne transformation mX + C of D .d D (cid:48) AU (∆ , ∆ (cid:48) ) = m · d D AU (∆ , ∆ (cid:48) ) . (10) B.2 Additional resultsLemma 6. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≤ E s ∼ ∆ ,s (cid:48) ∼ ∆ (cid:48) (cid:104) d AU (cid:0) s, s (cid:48) (cid:1)(cid:105) . Lemma 7. ∀ s, s (cid:48) : d AU (cid:0) s, s (cid:48) (cid:1) < − γ .Proof. Because optimal value is bounded [0 , − γ ] , d AU (cid:0) s, s (cid:48) (cid:1) ≤ − γ . The equality holds iff foralmost all R ∈ R , V ∗ R ( s ) = − γ and V ∗ R (cid:0) s (cid:48) (cid:1) = 0 , or vice versa. But because D is continuous, s (cid:48) must induce positive optimal value for a positive measure set of reward functions. Corollary 8. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < − γ . Theorem 9 (Reward functions induce unique optimal value functions) . R (cid:55)→ V ∗ R is injective.Proof. Given V ∗ R and the rewardless MDP, deduce an optimal policy π ∗ for R by choosing a V ∗ R -greedy action for each state. Let T π ∗ be the transition probabilities under π ∗ . V ∗ R = R + γ T π ∗ V ∗ R (11) (cid:16) I − γ T π ∗ (cid:17) V ∗ R = R. (12) Deﬁnition.

Let e s represent the unit vector for state s . The state visitation distribution induced byfollowing π from state s is f πs := ∞ (cid:88) t =0 γ t E (cid:2) e s (cid:48) | π followed for t steps from s (cid:3) . (13) Lemma 10.

The elements of (cid:8) f πs | s ∈ S (cid:9) are linearly independent. roof. Consider the all-zero optimal value function with optimal policy π ∗ . Theorem 9 implies thefollowing homogeneous system of equations has a unique solution for r : f π ∗ (cid:62) s r = 0 ... f π ∗ (cid:62) s |S| r = 0 . Therefore, the optimal policy π ∗ induces linearly independent f . But r is clearly the all-zero rewardfunction (for which all policies are optimal). We conclude that the f πs are independent for any policy π .Turner et al. [26]’s theorem 33 shows the following result for deterministic dynamics and for singlestates s (cid:54) = s (cid:48) . We generalize to the stochastic case and to distributions over states. Lemma 11. If ∆ (cid:54) = ∆ (cid:48) , P (cid:16) E ∆ (cid:2) V ∗ R ( s ) (cid:3) = E ∆ (cid:48) (cid:2) V ∗ R ( s (cid:48) ) (cid:3) | R ∼ D (cid:17) = 0 .Proof. Let R ∈ D (also written r ∈ R |S| ), and let π ∗ be one of its optimal policies. By lemma 10, E ∆ (cid:104) f π ∗ s (cid:105) = E ∆ (cid:48) (cid:104) f π ∗ s (cid:48) (cid:105) iff ∆ = ∆ (cid:48) . Therefore, E ∆ (cid:104) f π ∗ s (cid:105) (cid:54) = E ∆ (cid:48) (cid:104) f π ∗ s (cid:48) (cid:105) .Trivially, V ∗ R ( s ) = V ∗ R ( s (cid:48) ) iff f π ∗ (cid:62) s r = f π ∗ (cid:62) s (cid:48) r . Since f π ∗ s (cid:54) = f π ∗ s (cid:48) , the set of satisfactory r has nointerior in the subspace topology induced by D ’s support. This convex set has zero Lesbesgue measure;by the Radon-Nikodym theorem, it also has zero measure under continuous distributions. C Training details

We detail how we trained the

AUP and

AUP proj conditions.

C.1 R aux training For the ﬁrst phase of training, our goal is to learn Q aux , allowing us to compute the AUP penaltyin the second phase of training. Due to the size of the full SafeLife state (350 × × , bothconditions downsample the observations with average pooling and convert to intensity values.Previously, Turner et al. [25] learned Q aux with tabular Q-learning. They used environments smallenough such that reward could be assigned to each state. Because SafeLife environments are toolarge for tabular Q-learning, we demonstrated two methods for randomly generating an auxiliaryreward function. AUP

We acquire a low-dimensional state representation by training a continuous Bernoullivariational autoencoder (CB-VAE) [17]. To train the CB-VAE, we collect a buffer ofobservations by acting randomly for , N env steps in each of the N env environments. Thisgives us 100K total observations with an N env -environment curriculum. We train the CB-VAE for 100 epochs, preserving the encoder E for downstream auxiliary reward training.For each auxiliary reward function, we draw a linear functional uniformly from (0 , Z toserve as our auxiliary reward function, where Z is the dimension of the CB-VAE’s latentspace. The auxiliary reward for an observation is the composition of the linear functionalwith an observation’s latent representation. AUP proj

Instead of using a CB-VAE,

AUP proj simply downsamples the input observation. At thebeginning of training, we generate a linear functional over the unit hypercube (with respectto the downsampled observation space). The auxiliary reward for an observation is thecomposition of the linear functional with the downsampled observation.After the CB-VAE has been trained, we start auxiliary reward training in the corresponding SafeLifeenvironment. To learn Q aux , we modify the value function in PPO to a Q-function. Our trainingalgorithm for phase 1 only differs from PPO in how we calculate reward. We train each auxiliaryreward function for 1M steps. 14 .2 R AUP training

In phase 2, we train new PPO agent on R AUP (eq. (1)) for the corresponding SafeLife task. Each step,the agent selects an action a in state s according to its policy π AUP , and receives reward R AUP ( s, a ) from the environment. We compute R AUP ( s, a ) with the learned Q-values Q aux ( s, ∅ ) and Q aux ( s, a ) .The penalty term is modulated by the hyperparameter λ , which is linearly scaled from − tosome ﬁnal value λ ∗ (default − ). Because λ controls the relative inﬂuence of the penalty, linearlyincreasing λ over time will prioritize primary task learning in early training and slowly encourage theagent to obtain the same reward while avoiding side effects. If the value for λ is too high – if sideeffects are too costly – the agent won’t have time to adapt its current policy and will choose inaction( ∅ ) to escape the penalty. A careful λ schedule helps induce a successful policy that also avoids sideeffects. D Hyperparameter selection

Table 2 lists the hyperparameters used for all conditions, which generally match the default SafeLifesettings.

Common refers to those hyperparameters that are the same for each evaluated condition.

AUX refers to hyperparameters that are used only when training on R AUX , thus, it only pertains to

AUP and

AUP proj . The conditions

PPO and

Naive use the

PPO hyperparameters for the duration oftheir training, while

AUP , AUP proj use them when training with respect to R AUP .Hyperparameter Value

Common

Learning Rate · − Optimizer AdamGamma ( γ ) . Lambda (PPO) . Lambda (AUP) − → − Entropy Clip . Value Coefﬁcient . Gradient Norm Clip . Clip Epsilon . AUX

Entropy Coefﬁcient . Training Steps · PPO

Entropy Coefﬁcient . Policy

Number of Hidden Layers Output Channels in Hidden Layers (32 , , Nonlinearity ReLU

CB-VAE

Learning Rate − Optimizer AdamLatent Space Dimension ( Z ) Batch Size Training Epochs Epsilon − Number of Hidden Layers (encoder) Number of Hidden Layers (decoder) Hidden Layer Width (encoder) (512 , , , , , Hidden Layer Width (decoder) (128 , , , , output ) Nonlinearity ELUTable 2: Chosen hyperparameters.15

Compute environment

Condition GPU-hours per trial

PPO AUP AUP proj