Avoiding Side Effects in Complex Environments
AAvoiding Side Effects in Complex Environments
Alexander Matt Turner ∗ Neale Ratzlaff ∗ Prasad Tadepalli
Oregon State University {turneale@, ratzlafn@, tadepall@eecs.}oregonstate.edu
Abstract
Reward function specification can be difficult, even in simple environments. Real-istic environments contain millions of states. Rewarding the agent for making awidget may be easy, but penalizing the multitude of possible negative side effectsis hard. In toy environments, Attainable Utility Preservation (AUP) avoids sideeffects by penalizing shifts in the ability to achieve randomly generated goals. Wescale this approach to large, randomly generated environments based on Conway’sGame of Life. By preserving optimal value for a single randomly generated rewardfunction, AUP incurs modest overhead, completes the specified task, and avoidsside effects.
Reward function specification can be difficult, even when the desired behavior seems clear-cut.Rewarding progress in a race leads an agent to collect checkpoint reward, instead of completing therace [15]. We want to minimize the negative side effects of misspecification: from a manufacturingrobot which breaks expensive equipment, to content recommendation systems which radicalize theirusers, to potential future AI systems which negatively transform the world [4, 21].Side effect avoidance poses a version of the “frame problem”: each action can have many effects,and it is impractical to explicitly penalize all of the bad ones [5]. For example, a housekeepingagent should clean a dining room without radically rearranging furniture, and a manufacturing agentshould assemble widgets without breaking equipment. A general, transferable solution to side effectavoidance would ease reward specification: the agent’s designers could just positively specify whatshould be done, as opposed to negatively specifying what should not be done.Breaking equipment is bad because it hampers future optimization of the true objective (whichincludes our preferences about the factory). That is, there often exists a reward function R true whichfully specifies the agent’s task within its deployment context. In the factory setting, R true mightencode “assemble widgets, but don’t spill the paint, break the conveyor belt, injure workers, etc”.We want the agent to preserve optimal value for this true reward function. While we can acceptsuboptimal actions (e.g. pacing the factory floor), we cannot accept the destruction of value for thetrue task. By avoiding negative side effects which decrease value for the true task, the designers cancorrect any misspecification and eventually achieve low regret for R true .Despite being unable to directly specify R true , we demonstrate a method for preserving its optimalvalue anyways. In Turner et al. [25]’s toy environments, preserving optimal value for many randomlygenerated reward functions often preserves the optimal value for R true . In this paper, we generalizethis approach to combinatorially complex environments and evaluate it in the chaotic and challengingSafeLife test suite [27]. We show the rather surprising result that by preserving optimal value for a single randomly generated reward function, AUP preserves optimal value for R true and thereby avoidsnegative side effects. ∗ Equal contribution.Preprint. Under review. a r X i v : . [ c s . A I] J un Prior work
AUP avoids negative side effects in small gridworld environments while preserving optimal valuefor randomly generated reward functions [25]. Penalizing decrease in (discounted) state reachabilityachieves similar results [14]. However, this approach has difficulty scaling: naively estimating allreachability functions is a task quadratic in the size of the state space.In the supplementary material, proposition 4 shows that preserving initial state reachability [10]bounds the maximum decrease in optimal value for R true . Unfortunately, due to irreversible dynamics,initial state reachability often cannot be preserved.Everitt et al. [9] frame reward misspecification as the composition of a corruption function with R true .Shah et al. [24] exploit information contained in the initial state of the environment to infer whichside effects are negative; for example, if vases are present, humans must have gone out of their way toavoid them, so the agent should as well. Christiano et al. [8] infer human preference information fromsolicited trajectory comparison. Hadfield-Menell et al. [13] consider the provided reward function toonly suggest the designer’s true preferences on the training distribution.Robust optimization selects a trajectory which maximizes the minimum return achieved under afeasible set of reward functions [19]. However, we do not assume we can specify the feasible set.In constrained MDPs, the agent obeys constraints while maximizing the observed reward function[2, 1, 28]. Exhaustively specifying constraints is difficult.In the multi-agent setting, empathic deep Q-learning preserves optimal value for another agent in theenvironment [6]. Schaul et al. [22] demonstrate a value function predictor which generalizes acrossboth states and goals.Safe reinforcement learning focuses on avoiding catastrophic mistakes during training [18, 11, 3, 7],while this work only considers the consequences of the learned policy. Consider a Markov decision process (MDP) (cid:104)S , A , T, R, γ (cid:105) with finite state space S , finite actionspace A , transition function T : S × A → ∆( S ) , reward function R : S × A → R , and discountfactor γ . We assume the agent may take a no-op action ∅ ∈ A . We refer to V ∗ R ( s ) as the optimalvalue or attainable utility of reward function R at state s .To define AUP’s pseudo-reward function, the designer provides a finite reward function set R ⊂ R S ,hereafter referred to as the auxiliary set . This set does not necessarily contain R true . Each auxiliaryreward function R i ∈ R has a learned Q-function Q i .AUP penalizes average change in ability to optimize the auxiliary reward functions. The motivationis that by not changing optimal value for a wide range of auxiliary reward functions, the agent alsodoes not decrease optimal value for R true . Definition (AUP reward function [25]) . Let λ ≥ . Then R AUP ( s, a ) := R ( s, a ) − λ |R| (cid:88) R i ∈R (cid:12)(cid:12) Q ∗ i ( s, a ) − Q ∗ i ( s, ∅ ) (cid:12)(cid:12) . (1)The regularization parameter λ controls penalty severity. In practice, the learned auxiliary Q i is astand-in for the optimal Q-function Q ∗ i . To an approximation, Wainwright and Eckersley [27]’s SafeLife evolves according to the transitionrules of Conway’s Game of Life [12]. Cells endure, spawn, or die depending on how many livingneighbors they have. In the eight cells surrounding the agent, no cells spawn or die – the agent candisturb dynamic patterns by merely approaching them.Figure 1 compares AUP with Schulman et al. [23]’s Proximal Policy Optimization (PPO) in a simplescenario. While PPO optimizes the primary reward R , AUP also preserves the optimal value for asingle auxiliary reward function ( |R| = 1 ). 2 a) PPO trajectory (b) AUP trajectory Figure 1: The agent ( ) receives 1 primary reward for entering the goal ( ). The agent can movein the cardinal directions, destroy cells in the cardinal directions, or do nothing. Walls ( ) are notmovable. The right end of the screen wraps around to the left. (a): The learned trajectory for themisspecified primary reward function R destroys fragile green cells ( ). (b): Starting from the samestate, AUP’s trajectory preserves the green cells.It is important to note that we did not hand-select an informative auxiliary reward function to inducethe trajectory of fig. 1b. Instead, the auxiliary reward was the output of a one-dimensional observationencoder, corresponding to a continuous Bernoulli variational autoencoder [17] trained through randomexploration (see section 5). Our theorems provide intuition about how the AUP penalty term works. Proofs and additional resultsare in appendix B.
Definition.
Let D be a continuous distribution over reward functions bounded [0 , , with probabilitymeasure F . The attainable utility distance between state distributions ∆ , ∆ (cid:48) ∈ ∆( S ) isd AU (cid:0) ∆ , ∆ (cid:48) (cid:1) := (cid:90) D (cid:12)(cid:12)(cid:12) E ∆ (cid:2) V ∗ R ( s ) (cid:3) − E ∆ (cid:48) (cid:2) V ∗ R ( s (cid:48) ) (cid:3)(cid:12)(cid:12)(cid:12) d F ( R ) . (2) Theorem 1. d AU is a distance metric on ∆( S ) . Viewing the designer as sampling auxiliary reward functions from distribution D , the AUP penaltyterm is the Monte Carlo integration of λγ · d AU (cid:0) T ( s, a ) , T ( s, ∅ ) (cid:1) : λ |R| (cid:88) R i ∈R (cid:12)(cid:12) Q ∗ i ( s, a ) − Q ∗ i ( s, ∅ ) (cid:12)(cid:12) = λγ · |R| (cid:88) R i ∈R (cid:12)(cid:12)(cid:12) E T ( s,a ) (cid:2) V ∗ i ( s a ) (cid:3) − E T ( s, ∅ ) (cid:2) V ∗ i ( s ∅ ) (cid:3)(cid:12)(cid:12)(cid:12) . (3)Insofar as the Monte Carlo integration approximates d AU , the attainable utility distance sheds light onthe attainable utility penalty term. Theorem 2 (Movement penalties are small) . Let ∆ (cid:54) = ∆ (cid:48) . Suppose that all states in the supportof ∆ can deterministically reach in one step all states in the support of ∆ (cid:48) , and vice versa. Then < d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < . In general, we only have ≤ d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < − γ (corollary 8 in appendix B.2).The intuitive notion of “power” corresponds to the ability to achieve goals in general, which can beformalized as average optimal value. For example, resources increase average optimal value, whileimmobility decreases it. Definition (Average optimal value [26]) . V ∗ avg ( s ) := E D (cid:2) V ∗ R ( s ) (cid:3) .AUP significantly penalizes change in expected power compared to inaction. In fig. 1b, AUP heavilypenalized the destruction of green cells. Destroying these cells reduces power. Theorem 3 (Power-shift penalties are large) . d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≥ (cid:12)(cid:12)(cid:12)(cid:12) E ∆ (cid:104) V ∗ avg ( s ) (cid:105) − E ∆ (cid:48) (cid:104) V ∗ avg (cid:0) s (cid:48) (cid:1)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12) . In Conway’s Game of Life, cells are alive or dead. Depending on how many live neighbors surrounda cell, the cell comes to life, dies, or retains its state. Even simple initial conditions can evolve into3omplex and chaotic patterns, and the Game of Life is Turing-complete when played on an infinitegrid [20]. SafeLife turns the Game of Life into an actual game. An autonomous agent moves freelythrough the world, which is a large finite grid. There are many colors and kinds of cells, many ofwhich have unique effects (see fig. 2). (a) append-spawn (b) prune-still-easy
Figure 2: Trees ( ) are permanent living cells. The agent ( ) can move crates ( ) but not walls ( ).The screen wraps vertically and horizontally. (a): The agent receives reward for creating graycells ( ) in the blue areas. The goal ( ) can be entered when some number of gray cells are present.Spawners ( ) stochastically create yellow living cells. (b): After the agent removes some number ofred cells, the goal turns red ( ) and can be entered.Wainwright and Eckersley [27] score side effects as the degree to which the agent perturbs green cellpatterns. Over an episode of T time steps, side effects are quantified as the Wasserstein 1-distancebetween the configuration of green cells had the state evolved naturally for T time steps, and theactual final configuration. As the primary reward function R is indifferent to green cells, this proxymeasures the safety performance of learned policies. If the agent never disturbs green cells, it achievesa perfect score of zero. By construction, minimizing side effect score preserves R true ’s optimal value,since R true encodes our preferences about the existing green patterns.As shown in table 1, Turner et al. [25] evaluated AUP on toy environments. In contrast, SafeLifevigorously challenges modern reinforcement learning algorithms.AI safety gridworlds [16] SafeLife [27]Dozens of states Millions of statesDeterministic dynamics Stochastic dynamicsHandful of preset environments Randomly generated environmentsOne side effect per level Many side effect opportunitiesImmediate side effects Chaos unfolds over timeTable 1: SafeLife is ideal for testing side effect avoidance. The agent can move in the cardinal directions, spawn/destroy a living cell in the cardinaldirections, or do nothing. We have four conditions:
PPO , AUP , AUP proj , and
Naive . Each condition isPPO trained on a different reward signal for five million (5M) time steps. See supplemental materialfor architectural and training details.
PPO
Trained on the primary SafeLife reward function R , without a side effect penalty.4 UP For the first 100K time steps, the agent randomly explores to collect observation frames.These frames are used to train a continuous Bernoulli variational autoencoder with a Z -dimensional latent space and encoder network E .If Z = 1 , the auxiliary reward is the output of the encoder E . Otherwise, we draw linearfunctionals φ i uniformly randomly from (0 , Z . The auxiliary reward function R i is definedas φ i ◦ E : S → R . For each of the |R| auxiliary reward functions, we learn a Q-valuenetwork for 1M time steps.The learned Q R i define the penalty term of eq. (1). The agent learns R AUP for 3.9M steps,during which time λ is linearly increased from .001 to λ ∗ . AUP proj
AUP , but the auxiliary reward function is a random projection from the downsampledobservation space to R , without using a variational autoencoder. Naive
Trained on the primary reward function R minus (roughly) the L distance between thecurrent state and the initial state. The agent is penalized when cells differ from their initialvalues. Wainwright and Eckersley [27] found that an unscaled L penalty produced the bestresults.While a good benchmark for ideal behavior in certain static tasks, penalizing state changeoften fails to avoid crucial side effects. State change penalties do not differentiate betweenmoving a box and irreversibly wedging a box in a corner [14].The default settings are: N env = 8 randomly generated environments in the curriculum, Z = 1 latentspace dimension, |R| = 1 auxiliary reward function, and λ ∗ = . final R AUP penalty severity. Thediscount rate γ = . .We evaluate the conditions in the append-spawn (fig. 2a) and prune-still-easy (fig. 2b) tasks.Furthermore, we include two easier variants of append-spawn : append-still (no stochasticspawners) and append-still-easy (no stochastic spawners, fewer green cells).We conduct three trials. At the beginning of each trial, SafeLife randomly generates N env environ-ments for the given task. The conditions are evaluated on the same set of random environments. Thecurriculum is a random sequence of these environments.For append-still , we allotted an extra 1M steps to achieve convergence for all agents. For append-spawn , agents pretrain on append-still-easy environments for the first 2M steps andtrain on append-spawn for 3M steps. For AUP in append-spawn , the autoencoder and auxiliarynetwork are trained on both tasks. R AUP is then pretrained for 2M steps and trained for 1.9M steps.
Results. In append-still-easy , even though AUP waits 1.1M steps to start training on R AUP , AUP is competitive with
PPO by step 1.75M (see fig. 3). By step 2.75M,
AUP consistently outperforms
PPO while incurring less than a twentieth of the side effects.
Naive also does very well and alsolearns more quickly than
PPO . AUP proj does better than
PPO but worse than
AUP , perhaps implying thatthe one-dimensional encoder provides more structure than a random projection.In prune-still-easy , all four conditions competitively accrue reward.
PPO and
AUP proj frequentlyhave side effects.
AUP avoids side effects, but not as well as in append-still-easy . The
Naive benchmark has fewer side effects than
AUP . This makes sense: since these environments are static,
Naive almost directly penalizes our unobserved side effect metric (change to the green cells). append-still environments contain more green cells. Once again,
AUP has far fewer side effectsthan
PPO . AUP proj and
Naive both flounder, earning significantly lower return than
PPO . While
Naive appears to have fewer side effects than
AUP , this is only because
Naive usually does nothing. In thesupplemental material, we display episode lengths over the course of training –
Naive converges toan average episode length of 843 (the maximum is 1,001). Even
PPO has an average episode lengthof 548. In stark contrast,
AUP learns effective and decisive policies with half the average length of
PPO . AUP frequently attains a length of only 43 (near optimal).
AUP significantly decreases episodelength for most tasks – perhaps because AUP applies small movement penalties (theorem 2). append-spawn environments contain both stochastic yellow cell spawners and more green cells.These environments challenge
PPO , which has less reward and yet more side effects.
AUP proj com-pletely fails to learn.
Naive usually fails to get any reward, as its policy erratically wanders theenvironment.
AUP is once again superior to
PPO : 131% of the reward, 46% of the side effects.5igure 3: Learning curves with shaded regions representing ± standard deviation. AUP beginstraining on R AUP at step 1.1M.
AUP has far fewer side effects than
PPO . 6 L G H H I I H F W V F R U H append still easy PPO AUP AUP proj
Naive 6 L G H H I I H F W V F R U H prune still easy PPO AUP AUP proj
Naive 6 W H S V P L O O L R Q V 5 H Z D U G 6 W H S V P L O O L R Q V 5 H Z D U G 6 L G H H I I H F W V F R U H append still PPO AUP AUP proj
Naive 6 L G H H I I H F W V F R U H append spawn PPO AUP AUP proj
Naive 6 W H S V P L O O L R Q V 5 H Z D U G 6 W H S V P L O O L R Q V 5 H Z D U G .2 Hyperparameter sweepMethod. In append-still-easy , we evaluate AUP on the following settings: λ ∗ ∈ { . , . , , } and ( N env , Z ) ∈ { , , , ∞} × { , , , } ( N env = ∞ means that each episode takes place ina new environment). We also evaluate PPO on each N env setting. For each setting, we record boththe side-effect score and the return of the learned policy, averaged over three trials. We use defaultsettings for all unmodified parameters. PPO Z N env * Figure 4: Side effect score, averaged over three learned policies for append-still-easy . Lowerscore is better. Default
AUP setting outlined in black.
PPO Z N env * Figure 5: Episodic reward, averaged over three learned policies for append-still-easy . Higherreward is better, although
AUP only aims to match
PPO . Default
AUP setting outlined in black.
Results. As N env increases, reward tends to decrease and side effect score tends to increase. Thisperformance degradation does not seem to be due to Attainable Utility Preservation, but ratherbecause Proximal Policy Optimization has challenges generalizing (see fig. 4 and fig. 5). In particular, PPO accrues less reward and induces more side effects as N env increases. However, even when N env = ∞ , AUP ( Z = 16 ) shows the potential to significantly reduce side effects without reducingepisodic return. AUP ’s default configuration achieves 117% of
PPO ’s episodic return, without any of the side effects.The AUP penalty term might be acting as a shaping reward. This is intriguing – shaping usuallyrequires knowledge of the desired task, whereas the auxiliary reward function is randomly generated.Additionally, once
AUP begins learning R AUP on step 1.1M,
AUP learns much more quickly than
PPO did (fig. 3); this supports the shaping hypothesis.
AUP imposed minimal overhead: due to its increasedsample efficiency,
AUP reaches
PPO ’s asymptotic episodic return at the same time as
PPO .Surprisingly,
AUP does well with a single latent space dimension ( Z = 1 ). As Z increases, sodoes AUP ’s side effect score. In the supplementary material, our data show that higher-dimensionalauxiliary reward functions are harder to learn, resulting in a poorly learned auxiliary Q-function.Nonetheless, for each N env setting, AUP ’s worst configuration has significantly fewer side effects than
PPO . 7urprisingly,
AUP does well with a single auxiliary reward function ( |R| = 1 ). We hypothesizethat destroying patterns decreases power; by theorem 3, this is penalized in the limit of |R| → ∞ .Furthermore, we believe that decreasing power usually decreases optimal value for any given singleauxiliary reward function. Since R AUP penalizes optimal value decrease, this might explain why
AUP does well with one auxiliary reward function.When λ ∗ = . , AUP becomes more conservative. As λ ∗ increases further, AUP stops moving entirely.AUP only regularizes learned policies, so
AUP can still make expensive mistakes during training.
We successfully scaled AUP to complex environments without providing task-specific knowledge –the auxiliary reward function was a one-dimensional variational autoencoder trained through randomexploration. To the best of our knowledge, AUP is the first task-agnostic approach which avoids sideeffects and competitively achieves reward in complex environments.Wainwright and Eckersley [27] speculated that avoiding side effects must necessarily decreaseperformance on the primary task. This may be true for optimal policies, but not necessarily for learnedpolicies.
AUP significantly improved performance on append-still-easy and append-spawn ,while matching performance on prune-still-easy and append-still . AUP proj enjoyed moderate success on the easier tasks. This suggests that AUP works (to varyingextents) for a wide range of uninformative reward functions.While
Naive penalizes every state perturbation equally, AUP applies penalty in proportion toirreversibility. For example, the agent could move crates around (and then put them back later).
AUP incurred little penalty for doing so, while
Naive was more constrained. We believe that AUP willcontinue to scale to useful applications, in part because it naturally accounts for irreversibility.
Future work.
Off-policy learning could allow simultaneous training of the auxiliary R i and of R AUP . Instead of learning an auxiliary Q-function, the agent could just learn the auxiliary advantagefunction with respect to inaction.The SafeLife suite includes more challenging variants of prune-still-easy . SafeLife also includesdifficult navigation tasks, in which the agent must reach the goal by wading either through fragilegreen patterns or through robust yellow patterns.
AUP ’s excellent performance when |R| = Z = 1 raises interesting questions. Turner et al. [25]’ssmall “ Options ” environment required |R| = 25 for good performance. SafeLife environments aremuch larger than
Options (table 1), so why does |R| = 1 perform so well? To what extent does theAUP penalty term provide reward shaping? Why do one-dimensional encodings provide a learnablereward signal?
Conclusion.
To realize the full potential of reinforcement learning, we need more than algorithmswhich train policies that optimize a specified reward function. We also need to be able to specify the right reward function. Fundamentally, we face a frame problem: we often know what we want theagent to do, but we cannot list everything we want the agent not to do. AUP scales to challengingdomains, incurs modest overhead, performs competitively on the original task, and avoids side effects– without explicit information as to what constitutes a “side effect”.
Broader Impact
A scalable side effect avoidance method would ease the challenge of reward specification andaid deployment of reinforcement learning in situations where mistakes are costly. Conversely,developers should carefully consider how reinforcement learning algorithms might produce policieswith catastrophic impact. Developers should not blindly rely on even a well-tested side effect penalty.8 cknowledgments
This work was made possible by the Center for Effective Altruism and the Long-term Future Fund.We thank Joshua Turner for help compiling fig. 4 and fig. 5. Scott Emmons inspired theorem 9. Wethank Stuart Armstrong, Andrew Critch, Evan Hubinger, Dylan Hadfield-Menell, Matthew Olson,Rohin Shah, Logan Smith, and Carroll Wainwright for their ideas and feedback.
References [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages22–31, 2017.[2] Eitan Altman.
Constrained Markov decision processes , volume 7. CRC Press, 1999.[3] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-basedreinforcement learning with stability guarantees. In
Advances in Neural Information ProcessingSystems , pages 908–918, 2017.[4] Nick Bostrom.
Superintelligence . Oxford University Press, 2014.[5] Frank M Brown.
The Frame Problem in Artificial Intelligence: Proceedings of the 1987Workshop . Morgan Kaufmann, 2014.[6] Bart Bussmann, Jacqueline Heinerman, and Joel Lehman. Towards empathic deep Q-learning. arXiv:1906.10918 , 2019.[7] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. ALyapunov-based approach to safe reinforcement learning. In
Advances in Neural InformationProcessing Systems , pages 8092–8101, 2018.[8] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deepreinforcement learning from human preferences. In
Advances in Neural Information ProcessingSystems , pages 4299–4307, 2017.[9] Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning witha corrupted reward channel. In
Proceedings of the Twenty-Sixth International Joint Conferenceon Artificial Intelligence, IJCAI-17 , pages 4705–4713, 2017.[10] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning toreset for safe and autonomous reinforcement learning. In
International Conference on LearningRepresentations , 2018.[11] Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research , 16(1):1437–1480, 2015.[12] Martin Gardner. The fantastic combinations of John Conway’s new solitaire game ‘life’.
Scientific American , 223(4):120–123, 1970.[13] Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inversereward design. In
Advances in Neural Information Processing Systems , pages 6765–6774, 2017.[14] Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoidingside effects using relative reachability.
CoRR , abs/1806.01186, 2018.[15] Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, RamanaKumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: the flip side of AIingenuity, 2020. URL https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity .[16] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq,Laurent Orseau, and Shane Legg. AI safety gridworlds. arXiv:1711.09883 , November 2017.917] Gabriel Loaiza-Ganem and John P Cunningham. The continuous Bernoulli: fixing a pervasiveerror in variational autoencoders. In
Advances in Neural Information Processing Systems , pages13266–13276, 2019.[18] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning–anoverview. In
International Workshop on Modelling and Simulation for Autonomous Systems ,pages 357–375. Springer, 2014.[19] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs usingnondominated policies. In
AAAI , 2010.[20] Paul Rendell. Turing universality of the game of life. In
Collision-based computing , pages513–539. Springer, 2002.[21] Stuart Russell.
Human compatible: Artificial intelligence and the problem of control . Viking,2019.[22] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-mators. In
International Conference on Machine Learning , pages 1312–1320, 2015.[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv:1707.06347 , 2017.[24] Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. Theimplicit preference information in an initial state. In
International Conference on LearningRepresentations , 2019.[25] Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli. Conservative agency viaattainable utility preservation. In
Proceedings of the AAAI/ACM Conference on AI, Ethics, andSociety , pages 385–391, 2020.[26] Alexander Matt Turner, Logan Smith, Rohin Shah, and Prasad Tadepalli. Optimal FarsightedAgents Tend to Seek Power. arXiv:1912.01683 , April 2020.[27] Carroll L Wainwright and Peter Eckersley. Safelife 1.0: Exploring side effects in complexenvironments. arXiv:1912.01217 , 2019.[28] Shun Zhang, Edmund H Durfee, and Satinder P Singh. Minimax-regret querying on side effectsfor safe optimality in factored Markov decision processes. In
Proceedings of the Twenty-SeventhInternational Joint Conference on Artificial Intelligence, IJCAI-18 , pages 4867–4873, 2018.10
Additional data
Figure 6: Episode length curves with shaded regions representing ± standard deviation. AUP beginstraining on R AUP at step 1.1M.
AUP significantly decreases episode length in the append tasks. 6 W H S V P L O O L R Q V ( S L V R G H O H Q J W K append still easy PPO AUP AUP proj
Naive 6 W H S V P L O O L R Q V ( S L V R G H O H Q J W K prune still easy PPO AUP AUP proj
Naive 6 W H S V P L O O L R Q V ( S L V R G H O H Q J W K append still PPO AUP AUP proj
Naive 6 W H S V P L O O L R Q V ( S L V R G H O H Q J W K append spawn PPO AUP AUP proj
Naive
Figure 7: Auxiliary reward curves for
AUP (with a Z -dimensional latent space), with shaded regionsrepresenting ± standard deviation. Auxiliary reward is not comparable across trials, so learning isexpressed by the slope of the curves. 6 W H S V W K R X V D Q G V $ X [ L O L D U \ U H Z D U G append still easy Z = 1 Z = 4 Z = 16 Z = 64 Theoretical results
Consider a rewardless MDP (cid:104)S , A , T, γ (cid:105) whose state space S and action space A are both finite, and γ ∈ [0 , . Reward functions R ∈ R S have corresponding optimal value functions V ∗ R ( s ) . Definition.
Let D be a continuous distribution over reward functions bounded [0 , , with probabilitymeasure F . The attainable utility distance between state distributions ∆ , ∆ (cid:48) ∈ ∆( S ) isd AU (cid:0) ∆ , ∆ (cid:48) (cid:1) := (cid:90) D (cid:12)(cid:12)(cid:12) E ∆ (cid:2) V ∗ R ( s ) (cid:3) − E ∆ (cid:48) (cid:2) V ∗ R ( s (cid:48) ) (cid:3)(cid:12)(cid:12)(cid:12) d F ( R ) . (2)Restriction to degenerate distributions yields a distance metric over the state space. B.1 Main resultsTheorem 1. d AU is a distance metric on ∆( S ) .Proof. For ∆ , ∆ (cid:48) , ∆ (cid:48)(cid:48) ∈ ∆( S ) :1. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≥ .2. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) = 0 iff ∆ = ∆ (cid:48) .3. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) = d AU (cid:0) ∆ (cid:48) , ∆ (cid:1) .4. d AU (cid:0) ∆ , ∆ (cid:48)(cid:48) (cid:1) ≤ d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) + d AU (cid:0) ∆ (cid:48) , ∆ (cid:48)(cid:48) (cid:1) .Properties 1 and 3 are trivially true. Property 2 follows from lemma 11. Property 4 follows fromapplying the triangle inequality for real numbers to the integrand. Theorem 2 (Movement penalties are small) . Let ∆ (cid:54) = ∆ (cid:48) . Suppose that all states in the supportof ∆ can deterministically reach in one step all states in the support of ∆ (cid:48) , and vice versa. Then < d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < .Proof. < d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) by theorem 1. Let R ∈ D . By proposition 4, (cid:12)(cid:12)(cid:12) V ∗ R ( s ) − V ∗ R (cid:0) s (cid:48) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (1 − γ ) max (cid:16) V ∗ R ( s ) , V ∗ R (cid:0) s (cid:48) (cid:1)(cid:17) ≤ . Thend AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≤ (cid:90) D E ∆ , ∆ (cid:48) (cid:20)(cid:12)(cid:12)(cid:12) V ∗ R ( s ) − V ∗ R (cid:0) s (cid:48) (cid:1)(cid:12)(cid:12)(cid:12)(cid:21) d F ( R ) (4) < (cid:90) D E ∆ , ∆ (cid:48) (cid:20) (1 − γ ) max (cid:16) V ∗ R ( s ) , V ∗ R (cid:0) s (cid:48) (cid:1)(cid:17)(cid:21) d F ( R ) (5) ≤ . (6)Equation (5) follows because D is continuous, so it cannot be the case that almost all reward functionsassign 0 reward to either s or s (cid:48) . Definition (Average optimal value [26]) . V ∗ avg ( s ) := E D (cid:2) V ∗ R ( s ) (cid:3) . Theorem 3 (Power-shift penalties are large) . d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≥ (cid:12)(cid:12)(cid:12)(cid:12) E ∆ (cid:104) V ∗ avg ( s ) (cid:105) − E ∆ (cid:48) (cid:104) V ∗ avg (cid:0) s (cid:48) (cid:1)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12) .Proof. Apply the reverse triangle inequality to the integrand of d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) and use the linearity ofexpectation.We derive a principled motivation for preserving the reachability of the initial state, bounding decreasein V ∗ R true by how many steps it takes to return to the initial state s .12 roposition 4 (Communicability bounds maximum change in attainable utility) . If s canreach s (cid:48) in k steps and s (cid:48) can reach s in k steps, then max R ∈ [ b,c ] S (cid:12)(cid:12) V ∗ R ( s ) − V ∗ R ( s (cid:48) ) (cid:12)(cid:12) ≤ (cid:16) − γ max( k ,k ) (cid:17) c − b − γ < c − b − γ . In particular, max R ∈ [ b,c ] S V ∗ R ( s ) − V ∗ R ( s (cid:48) ) ≤ (cid:0) − γ k (cid:1) c − b − γ .Proof. We first bound the maximum increase. max R ∈ [ b,c ] S V ∗ R ( s (cid:48) ) − V ∗ R ( s ) ≤ max R ∈ [ b,c ] S V ∗ R ( s (cid:48) ) − (cid:32) b − γ k − γ + γ k V ∗ R ( s (cid:48) ) (cid:33) (7) ≤ c − γ − (cid:32) b − γ k − γ + γ k c − γ (cid:33) (8) = (cid:16) − γ k (cid:17) c − b − γ . (9)Equation (7) holds because even if we make R equal b for as many states as possible, s (cid:48) is stillreachable from s . The case for maximum decrease is similar.Positive affine transformation of D allows generalization of our results to other bounds, as optimalpolicy is invariant to positive affine transformation of the reward function. Proposition 5.
Let D (cid:48) be any positive affine transformation mX + C of D .d D (cid:48) AU (∆ , ∆ (cid:48) ) = m · d D AU (∆ , ∆ (cid:48) ) . (10) B.2 Additional resultsLemma 6. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) ≤ E s ∼ ∆ ,s (cid:48) ∼ ∆ (cid:48) (cid:104) d AU (cid:0) s, s (cid:48) (cid:1)(cid:105) . Lemma 7. ∀ s, s (cid:48) : d AU (cid:0) s, s (cid:48) (cid:1) < − γ .Proof. Because optimal value is bounded [0 , − γ ] , d AU (cid:0) s, s (cid:48) (cid:1) ≤ − γ . The equality holds iff foralmost all R ∈ R , V ∗ R ( s ) = − γ and V ∗ R (cid:0) s (cid:48) (cid:1) = 0 , or vice versa. But because D is continuous, s (cid:48) must induce positive optimal value for a positive measure set of reward functions. Corollary 8. d AU (cid:0) ∆ , ∆ (cid:48) (cid:1) < − γ . Theorem 9 (Reward functions induce unique optimal value functions) . R (cid:55)→ V ∗ R is injective.Proof. Given V ∗ R and the rewardless MDP, deduce an optimal policy π ∗ for R by choosing a V ∗ R -greedy action for each state. Let T π ∗ be the transition probabilities under π ∗ . V ∗ R = R + γ T π ∗ V ∗ R (11) (cid:16) I − γ T π ∗ (cid:17) V ∗ R = R. (12) Definition.
Let e s represent the unit vector for state s . The state visitation distribution induced byfollowing π from state s is f πs := ∞ (cid:88) t =0 γ t E (cid:2) e s (cid:48) | π followed for t steps from s (cid:3) . (13) Lemma 10.
The elements of (cid:8) f πs | s ∈ S (cid:9) are linearly independent. roof. Consider the all-zero optimal value function with optimal policy π ∗ . Theorem 9 implies thefollowing homogeneous system of equations has a unique solution for r : f π ∗ (cid:62) s r = 0 ... f π ∗ (cid:62) s |S| r = 0 . Therefore, the optimal policy π ∗ induces linearly independent f . But r is clearly the all-zero rewardfunction (for which all policies are optimal). We conclude that the f πs are independent for any policy π .Turner et al. [26]’s theorem 33 shows the following result for deterministic dynamics and for singlestates s (cid:54) = s (cid:48) . We generalize to the stochastic case and to distributions over states. Lemma 11. If ∆ (cid:54) = ∆ (cid:48) , P (cid:16) E ∆ (cid:2) V ∗ R ( s ) (cid:3) = E ∆ (cid:48) (cid:2) V ∗ R ( s (cid:48) ) (cid:3) | R ∼ D (cid:17) = 0 .Proof. Let R ∈ D (also written r ∈ R |S| ), and let π ∗ be one of its optimal policies. By lemma 10, E ∆ (cid:104) f π ∗ s (cid:105) = E ∆ (cid:48) (cid:104) f π ∗ s (cid:48) (cid:105) iff ∆ = ∆ (cid:48) . Therefore, E ∆ (cid:104) f π ∗ s (cid:105) (cid:54) = E ∆ (cid:48) (cid:104) f π ∗ s (cid:48) (cid:105) .Trivially, V ∗ R ( s ) = V ∗ R ( s (cid:48) ) iff f π ∗ (cid:62) s r = f π ∗ (cid:62) s (cid:48) r . Since f π ∗ s (cid:54) = f π ∗ s (cid:48) , the set of satisfactory r has nointerior in the subspace topology induced by D ’s support. This convex set has zero Lesbesgue measure;by the Radon-Nikodym theorem, it also has zero measure under continuous distributions. C Training details
We detail how we trained the
AUP and
AUP proj conditions.
C.1 R aux training For the first phase of training, our goal is to learn Q aux , allowing us to compute the AUP penaltyin the second phase of training. Due to the size of the full SafeLife state (350 × × , bothconditions downsample the observations with average pooling and convert to intensity values.Previously, Turner et al. [25] learned Q aux with tabular Q-learning. They used environments smallenough such that reward could be assigned to each state. Because SafeLife environments are toolarge for tabular Q-learning, we demonstrated two methods for randomly generating an auxiliaryreward function. AUP
We acquire a low-dimensional state representation by training a continuous Bernoullivariational autoencoder (CB-VAE) [17]. To train the CB-VAE, we collect a buffer ofobservations by acting randomly for , N env steps in each of the N env environments. Thisgives us 100K total observations with an N env -environment curriculum. We train the CB-VAE for 100 epochs, preserving the encoder E for downstream auxiliary reward training.For each auxiliary reward function, we draw a linear functional uniformly from (0 , Z toserve as our auxiliary reward function, where Z is the dimension of the CB-VAE’s latentspace. The auxiliary reward for an observation is the composition of the linear functionalwith an observation’s latent representation. AUP proj
Instead of using a CB-VAE,
AUP proj simply downsamples the input observation. At thebeginning of training, we generate a linear functional over the unit hypercube (with respectto the downsampled observation space). The auxiliary reward for an observation is thecomposition of the linear functional with the downsampled observation.After the CB-VAE has been trained, we start auxiliary reward training in the corresponding SafeLifeenvironment. To learn Q aux , we modify the value function in PPO to a Q-function. Our trainingalgorithm for phase 1 only differs from PPO in how we calculate reward. We train each auxiliaryreward function for 1M steps. 14 .2 R AUP training
In phase 2, we train new PPO agent on R AUP (eq. (1)) for the corresponding SafeLife task. Each step,the agent selects an action a in state s according to its policy π AUP , and receives reward R AUP ( s, a ) from the environment. We compute R AUP ( s, a ) with the learned Q-values Q aux ( s, ∅ ) and Q aux ( s, a ) .The penalty term is modulated by the hyperparameter λ , which is linearly scaled from − tosome final value λ ∗ (default − ). Because λ controls the relative influence of the penalty, linearlyincreasing λ over time will prioritize primary task learning in early training and slowly encourage theagent to obtain the same reward while avoiding side effects. If the value for λ is too high – if sideeffects are too costly – the agent won’t have time to adapt its current policy and will choose inaction( ∅ ) to escape the penalty. A careful λ schedule helps induce a successful policy that also avoids sideeffects. D Hyperparameter selection
Table 2 lists the hyperparameters used for all conditions, which generally match the default SafeLifesettings.
Common refers to those hyperparameters that are the same for each evaluated condition.
AUX refers to hyperparameters that are used only when training on R AUX , thus, it only pertains to
AUP and
AUP proj . The conditions
PPO and
Naive use the
PPO hyperparameters for the duration oftheir training, while
AUP , AUP proj use them when training with respect to R AUP .Hyperparameter Value
Common
Learning Rate · − Optimizer AdamGamma ( γ ) . Lambda (PPO) . Lambda (AUP) − → − Entropy Clip . Value Coefficient . Gradient Norm Clip . Clip Epsilon . AUX
Entropy Coefficient . Training Steps · PPO
Entropy Coefficient . Policy
Number of Hidden Layers Output Channels in Hidden Layers (32 , , Nonlinearity ReLU
CB-VAE
Learning Rate − Optimizer AdamLatent Space Dimension ( Z ) Batch Size Training Epochs Epsilon − Number of Hidden Layers (encoder) Number of Hidden Layers (decoder) Hidden Layer Width (encoder) (512 , , , , , Hidden Layer Width (decoder) (128 , , , , output ) Nonlinearity ELUTable 2: Chosen hyperparameters.15
Compute environment
Condition GPU-hours per trial
PPO AUP AUP proj