CCounterfactual Planning in AGI Systems
Koen Holtman
Eindhoven, The Netherlands
January 2021
Abstract
We present counterfactual planning as a design approach for creating a range of safetymechanisms that can be applied in hypothetical future AI systems which have Artificial GeneralIntelligence.The key step in counterfactual planning is to use an AGI machine learning system to constructa counterfactual world model, designed to be different from the real world the system is in. Acounterfactual planning agent determines the action that best maximizes expected utility in thiscounterfactual planning world, and then performs the same action in the real world.We use counterfactual planning to construct an AGI agent emergency stop button, and a safetyinterlock that will automatically stop the agent before it undergoes an intelligence explosion. Wealso construct an agent with an input terminal that can be used by humans to iteratively improvethe agent’s reward function, where the incentive for the agent to manipulate this improvementprocess is suppressed. As an example of counterfactual planning in a non-agent AGI system,we construct a counterfactual oracle.As a design approach, counterfactual planning is built around the use of a graphical notationfor defining mathematical counterfactuals. This two-diagram notation also provides a compactand readable language for reasoning about the complex types of self-referencing and indirectrepresentation which are typically present inside machine learning agents.
Contents a r X i v : . [ c s . A I] J a n P Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Artificial General Intelligence (AGI) systems are hypothetical future machine reasoning systemsthat can match or exceed the capabilities of humans in general problem solving. While it is stillunclear if AGI systems could ever be built, we can already study AGI related risks and potentialsafety mechanisms [Bos14, Rus19, ELH18].In this paper, we introduce counterfactual planning as a design approach for creating a range ofAGI safety mechanisms. Counterfactual planning is built around a graphical modeling system thatprovides a specific vantage point on the internal construction of machine learning based agents. Thisvantage point was designed to make certain safety problems and solutions more tractable.An AI agent is an autonomous system which is programmed to use its sensors and actuators toachieve specific goals. A well-known risk in using AI agents is that the agent might mispredict theresults of its own actions, causing it to take actions that produce a disaster. The main risk driver weconsider here is different. It is the risk that an inaccurate or incomplete specification of the agentgoals produces a disaster.Any AGI agent goal specification created by humans will likely be somewhat inaccurate, no matterwhether it is created by hand-coding or by machine learning from selected examples [Hol20b,HMH19]. If one gives an even slightly under-specified goal to a very powerful autonomous system,there is a risk that the system may end up perfectly achieving this goal, while also producing severalunexpected and very harmful side effects. This motivates research into AGI emergency stop buttons,interlocks which can limit the power of the agent, and safe ways to update the goal while the agentruns.
When writing about AGI systems, one can use either natural language, mathematical notation, or acombination of both. A natural language-only text has the advantage of being accessible to a largeraudience. Books like
Superintelligence [Bos14] and
Human Compatible [Rus19] avoid the use ofmathematical notation in the main text, while making a clear an convincing case for the existence ofspecific existential risks from AGI, even though these risks are currently difficult to quantify.However, natural language has several shortcomings when it is used to explore and define specifictechnical solutions for managing AGI risks. One particular problem is that it lacks the means toaccurately express the complex types of self-referencing and indirect representation that can bepresent inside online machine learning agents and their safety components. To solve this problem,we introduce a compact graphical notation. This notation unambiguously represents these internaldetails by using two diagrams: a learning world diagram and a planning world diagram .2 .2 AGI safety as a policy problem Long-term AGI safety is not just a technical problem, but also a policy problem. While technicalprogress on safety can sometimes be made by leveraging a type of mathematics that is only accessibleto handful of specialists, policy progress typically requires the use of more accessible language.Policy discussions can move faster, and produce better and more equitable outcomes, when thedescription of a proposal and its limitations can be made more accessible to all stakeholder groups.One specific aim of this work is to develop a comprehensive vocabulary for describing certain AGIsafety solutions, a vocabulary that is as accessible as possible. However, the vocabulary we develophas too much mathematical notation to be accessible to all members of any possible stakeholdergroup. So the underlying assumption is that each stakeholder group will have access to a certainbasic level of technical expertise.At several points in the text, we have also included comments that aim to explain and demystify thevocabulary and concerns of some specific AGI related sub-fields in mathematics, technology, andphilosophy.
In the general AI/ML literature that is concerned with improving system performance, counterfactualplanning has been used to improve performance in several application domains. See for example[ZJBP08] and [BPQC + +
18] which consider both AI and AGI level systems.In the literature on encoding specific human values into machine reasoning systems, counterfactualshave been used to encode non-discriminatory fairness towards individuals, for example in [KLRS17],and also other human moral principles in [PS17].
Sections 2 – 4 introduce the main elements of our graphical notation and modeling system. Readersalready familiar with Causal Influence Diagram (CID) notation, as it is used to define agents, willbe able to skim or skip most of this material.Sections 5 – 7 specify three example counterfactual planning agents. These are used in the remainingsections to illustrate further aspects of counterfactual planning. Starting from section 6.2, it shouldbe possible for all readers to skim or skip sections, or to read sections in a different order.
The standard work which defines mathematical counterfactuals is the book
Causality by JudeaPearl [Pea09]. This book mainly targets an audience of applied statisticians, for example those inthe medical field, and its style of presentation is not very accessible to a more general technicalaudience. 3earl is also mainly concerned with the use of causal models as theories about the real world whichcan guide the interpretation of statistical data. Much of the discussion in
Causality is about questionsof statistical epistemology and decision making. In this text, we will use causal models to construct agent specifications , not theories about the world. When we clarify issues of epistemology here,they tend to be different issues.The debate among philosophers about the validity of Pearl’s statistical epistemology is still ongoing,as is usual for such philosophical debates. In the AGI community, where the epistemology ofmachine learning is a frequent topic of discussion, this has perhaps made the status of mathematicalcounterfactuals as useful and well-defined mathematical tools more precarious than it should be.Because of these considerations, we have written this section to avoid any direct reference to Pearl’sdefinitions and explanations in [Pea09], even though at a deeper mathematical level, we define thesame system of causal models and counterfactuals. A world model is a mathematical model of a particular world. This can be our real world, or animaginary world. To make a mathematical model into a model of a particular world, we need tospecify how some of the variables in the model relate to observable phenomena in that world.We introduce our graphical notation for building world models by creating an example graphicalmodel of a game world. In the game world, a simple game of dice is being played. The player throwsa green die and a red die, and then computes their score by adding the two numbers thrown.We create the graphical game world model in thee steps:1. We introduce three random variables and relate them to observations we can make when thegame is played once in the game world. The variable X represents the observed number ofthe green die, Y is the red die, and S is the score.2. We draw the diagram in figure 1. XY S [ D ][ D ] sum Figure 1:
Graphical model of the game of dice in the game world.
3. We define the two functions that appear in the annotations above the nodes in the diagram: D ( d ) = ( if d ∈ { , , , , , } then / else ,sum ( a, b ) = a + b . We can read the above graphical model as a description of how we might build a game worldsimulator, a computer program that generates random examples of game play. To compute one run4f the game, the simulator would traverse the diagram, writing an appropriate observed value intoeach node, as determined by the function written above the node. Figure 2 shows three possiblesimulator runs.
XY S [ D ][ D ] sum Run 1:
11 2
XY S [ D ][ D ] sum Run 2:
43 7
XY S [ D ][ D ] sum Run 3:
66 12
Figure 2:
Using the graphical model as a canvas to display three different simulator runs ofthe game world.
We can interpret the mathematical expression P ( S = 12) as being the exact probability that thenext simulator run puts the number 12 into node S . This interpretation of P ( · · · ) expressions canbe very useful when reasoning informally about certain mathematical properties of the graphicalmodels.The similarity between what happens in figure 2 and what happens in a spreadsheet calculation isnot entirely coincidental. Spreadsheets can be used to create models and simulations without havingto write a full computer program from scratch. In section 2.4, we will define the exact mathematical meaning of drawing diagrams like figure 1. Thedefinitions will treat the drawing as a Bayesian network, decorated with three annotations writtenabove the network nodes. As an example of how these definitions work, drawing the diagram infigure 1 is equivalent to writing down the four equations below, and declaring that these equationsare mathematical sentences with the truth value of ‘true’. P ( X = x, Y = y, S = s ) = P ( x = X ) P ( Y = y ) P ( S = s | X = x, Y = y ) P ( X = x ) = D ( x ) P ( Y = y ) = D ( y ) P ( S = s | X = x, Y = y ) = ( if s = sum ( x, y ) then else The first equation above is produced by drawing the Bayesian network graph, the other three areproduced by adding the annotations.To readers unfamiliar with Bayesian networks, the above equations may look somewhat impenetrableat first sight. The key to interpreting them is to note that the three right hand side terms ofthe first equation appear on the left hand side in the next equations. The equations thereforeallow us to mechanically compute the exact numerical value of P ( X = x, Y = y, S = s ) forany x , y , and s , by making substitutions until every P operator is gone. We can compute that P ( X = 1 , Y = 1 , S = 12) = 0 . We can compute that P ( S = 12) = 1 / by using that P ( S = 12) = (cid:80) x,y P ( X = x, Y = y, S = 12) .A mathematical model can be used as a theory about a world, but it can also be used as a specification of how certain entities in that world are supposed to behave. If the model is a theory of the gameworld, and we observe the outcome X = 1 , Y = 1 , S = 12 , then this observation falsifies thetheory. But if the model is a specification of the game, then the same observation implies that theplayer is doing it wrong. 5 .2 Counterfactuals We now show how mathematical counterfactuals can be defined using graphical models. The processis as follows. We start by drawing a first diagram f , and declare that this f is the world model of a factual world. This factual world may be the real world, but also an imaginary world, or the worldinside a simulator. We then draw a second diagram c by taking f and making some modifications.We then posit that this c defines a counterfactual world . The counterfactual random variables defined by c then represent observations we can make in this counterfactual world.Figure 3 shows an example of the procedure, where we construct a counterfactual game world inwhich the red die has the number 6 on all sides. XY S [ D ][ D ] sum (f) factual world model XY S [ D ]6 sum (c) counterfactual world model Figure 3:
Example construction of a counterfactual world, with the model c on the right definingthree counterfactual random variables X c , Y c , and S c . We name diagrams by putting a label in the upper left hand corner. In figure 3, the two labels (f) and (c) introduce the names f and c . We will use the name in the label for both the diagram, the impliedworld model, and the implied world. So figure 3 constructs the counterfactual game world c .To keep the random variables defined by the above two diagrams apart, we use the notation conventionthat a diagram named c defines random variables that all have the subscript c . Diagram c abovedefines the random variables X c , Y c , and S c . This convention allows us to write expressions like P ( S c > S f ) = 5 / without ambiguity. Diagram d in figure 4 models a basic MDP-style agent and its environment. The agent takesactions A t chosen by the policy π , with actions affecting the subsequent states S t +1 of the agent’senvironment. The environment state is s initially, and state transitions are driven by the probabilitydensity function S . S A s π S A [ S ] π S A [ S ] π S [ S ] ...(d) Figure 4:
Example diagram d modeling an agent and its environment. We interpret the annotations above the nodes in the diagram as model input parameters . The model d has the three input parameters π , s , and S . By writing exactly the same parameter above a wholetime series of nodes, we are in fact adding significant constraints to the behavior of both the agent6nd the agent environment in the model. These constraints apply even if we specify nothing furtherabout π and S .We use the convention that the physical realizations of the agent’s sensors and actuators are modeledinside the environment states S t . This means that we can interpret the arrows to the A t nodes assensor signals which flow into the agent’s compute core, and the arrows emerging from the A t nodesas actuator command signals which flow out. We now present fully formal definitions for the graphical language and notation conventions intro-duced above. The main reason for including these is that we want to remove any possible ambiguityfrom the agent definitions further below. (Diagram) . A diagram is a drawing that depicts a graph , which must be a directedacyclic graph, by drawing nodes connected by arrows. A node name , starting with an uppercaseletter, must be drawn inside each node. A node may also have an annotation drawn above it.Drawings may use the notation ‘ · · · ’ to depict repeating structures in the graph and its annotations. P notation We use random variables to represent observables in worlds. We rely on probability theory (seeappendix A) as the branch of mathematics that defines truth values for expressions containing randomvariables inside P ( · · · ) and E ( · · · ) operators. Many texts use the convention that P ( s | x, y ) is ashorthand for P ( S = s | X = x, Y = y ) . We avoid using this shorthand here, partly to make thedefinitions below less cryptic, but also because it tends to get typographically awkward when therandom variables have subscripted names. Definition 2 (Naming and subscripting of random variables) . When the graph drawn by a diagramwith label (d) has a node named X or X i , then there exists a random variable named X d or X i,d associated with that node. To avoid any ambiguity, we use a comma to separate the two parts of thesubscript in X i,d . Before defining the equations produced by drawing a diagram, we define some auxiliary notation.
Definition 3 (Parent notation Pa and pa ) . Let X be the name of a graph node in diagram d , and let P , · · · , P n be the list of names of all parent nodes of X , all nodes which have an outgoing arrowinto X . The order in which these parents appear in the list P , · · · , P n is determined by consideringeach incoming arrow of X in a clockwise order, starting from the 6-o-clock position. With this, Pa X,d is the list of random variable names P ,d , · · · , P n,d , and pa X,d is the list of lowercase variablesnames we get by converting the list P , · · · , P n to lowercase.As an example, with figure 4 above, Pa S ,d is the list S ,d , A ,d , and pa S ,d is the list s , a .7 efinition 4 (Bayesian model equation produced by drawing a diagram) . When we draw a diagram d representing a graph with the nodes named X , · · · X n , this is equivalent to stating that the followingequation is true: P ( X ,d = x , · · · , X n,d = x n ) = P ( X ,d = x | Pa X ,d = pa X ,d ) · . . . · P ( X n,d = x n | Pa X n ,d = pa X n ,d ) Definition 5 (Equation produced by adding an annotation) . When we draw an annotation above anode X in a diagram d , then:1. If the node has no parents and the annotation is a variable or constant v , this is equivalent tostating that the following equation is true: P ( X d = x ) = ( if x = v then else
2. If the node has parents and the annotation is a function f , this states P ( X d = x | Pa X,d = pa X,d ) = ( if x = f ( pa X,d ) then else
3. If the node has parents and the annotation is [ F ] , this states P ( X d = x | Pa X,d = pa X,d ) = F ( x, pa X,d ) where we require that the function F satisfies ∀ pa X,d ( (cid:80) x F ( x, pa X,d ) = 1) . The do notation is Pearl’s most well-known device for defining counterfactuals in a compact way.We do not use this notation here, because it is not well suited for defining the complex counterfactualworlds we are interested in.Pearl also defines a less well known notation in [Pea09], where subscripts are used to construct andlabel counterfactual random variables. This notation is different from the subscripting conventionsused here.Many texts use the convention of introducing a model by writing down a tuple like ( S, s , A, P, R, γ ) which names all model parameters. We do not use this convention here. We introduce everymodel by drawing a diagram, and name model parameters by drawing annotations in the diagram.This approach keeps several definitions in this text much more compact, as we avoid having totranslate back and forth continuously between a graphical model representation and a tuple-basedrepresentation. Influence diagrams [HM05] provide a graphical notation for depicting utility-maximizing decisionmaking processes. In this paper we will use
Causal influence diagrams (CIDs) [ECL + + γ . 8 A s π ∗ S A [ S ] π ∗ S A [ S ] π ∗ S [ S ] ...(a) R R R R R R Figure 5:
Example causal influence diagram. The diagram has diamond shaped utility nodes which define the value U a , and square decision nodes which define π ∗ . Optionally, colors canbe used to highlight the structure of the diagram. If some nodes in a diagram are drawn with diamond shapes, these are called utility nodes . The expected utility of the diagram is then defined as follows.
Definition 6 (Expected utility U a of a diagram a ) . We define U a for two cases:1. If there is only one utility node X in a , then U a = E ( X a ) .2. If there are multiple utility nodes R t in a , with integer subscripts running from l to h , then U a = E ( h (cid:88) t = l γ t R t,a ) where γ is a time discount factor, < γ ≤ , which can be read as an extra model parameter.When h = ∞ , we generally need γ < in order for U a to be well-defined. When we draw some nodes in a diagram as squares, these are called decision nodes . The purposeof drawing decision nodes is to define the optimal policy which maximizes the expected utility ofthe diagram. We require that the same model parameter, the policy function π ∗ in the case of figure5, is present as an annotation above all decision nodes. Definition 7 (Optimal policy π ∗ defined by a diagram a ) . A diagram a with some utility and decisionmodes, where a function π ∗ is written above all decision nodes, defines this π ∗ in two steps.1. First, draw a helper diagram b by drawing a copy of diagram a , except that every decisionnode has been drawn as a round node, and every π ∗ has been replaced by a fresh functionname, say π (cid:48) .2. Then, π ∗ is defined by π ∗ = argmax π (cid:48) U b , where the argmax π (cid:48) operator always determinis-tically returns the same function if there are several candidates that maximize its argument. In a real life agent implementation, the exact computation of the optimal policy π ∗ is usuallyintractable. Only an approximately optimal policy π + can be computed within reasonable time. Wemodel this case as follows. 9 efinition 8 (Approximately optimal policy π + defined by a diagram a ) . A diagram a where anoptimal policy function π ∗ is written above all decision nodes also defines an approximately optimalpolicy function π + by constructing the same helper diagram b as above and then defining π + = A ( b ) ,where the function A processes the diagram b and its model parameter values to construct a policy π (cid:48) that does a reasonable job at maximizing the value of U b .To keep the presentation more compact, we will only use the optimal policy symbol π ∗ in the agentdefinitions below. We now model online machine learning agents , agents that continuously learn while they takeactions. These agents are also often called reinforcement learners , see section 10.3 for a discussionwhich relates our modeling system to reinforcement learning concepts and terminology.We model online machine learning agents by drawing two diagrams, one for a learning world andone for a planning world , and by writing down an agent definition . This two-diagram modelingapproach departs from the approach in [EKKL19, EH19, ECL + Figure 6 shows an example learning world diagram. The diagram models how the agent interactswith its environment, and how the agent accumulates an observational record O t that will informits learning system, thereby influencing the agent policy π . S A s π S A [ S ] π S A [ S ] π S [ S ] ...(l) O O O O o O O O
Figure 6:
Learning world diagram, with an agent building up an observational record ofenvironment state transitions.
We model the observational record as a list all past observations. With ++ being the operator whichadds an extra record to the end of a list, we define that O ( o t − , s t − , a t − , s t ) = o t − ++ ( s t , s t − , a t − ) The initial observational record o may be the empty list, but it might also be a long list of observationsfrom earlier agent training runs, in the same environment or in a simulator.We intentionally model observation and learning in a very general way, so that we can handleboth existing machine learning systems and hypothetical future machine learning systems that may10roduce AGI-level intelligence. To model the details of any particular machine learning system,we introduce the learning function L . This L which takes an observational record o to produce a learned prediction function L = L ( o ) , where this function L is constructed to approximate the S ofthe learning world.We call a machine learning system L a perfect learner if it succeeds in constructing an L thatfully equals the learning world S after some time. So with a perfect learner, there is a t p where ∀ t ≥ t p P ( L ( O t,l ) = S ) = 1 . While perfect learning is trivially possible in some simple toy worlds,it is generally impossible in complex real world environments.We therefore introduce the more relaxed concept of reasonable learning . We call a learning system reasonable if there is a t p where ∀ t ≥ t p P ( L ( O t,l ) ≈ S ) = 1 . The ≈ operator is an application-dependent ‘good enough approximation’ metric. When we have a real-life implementation of amachine learning system L , we may for example define L ≈ S as the criterion that L achieves acertain minimum score on a benchmark test which compares L to S . Using a learned prediction function L and a reward function R , we can construct a planning world p for the agent. Figure 7 shows a planning world that defines an optimal policy π ∗ p . S A s π ∗ p S A [ L ] π ∗ p S A [ L ] π ∗ p S [ L ] ...(p) R R R R R R Figure 7:
Planning world diagram defining π ∗ p by using s and L . We can interpret this planning world as representing a probabilistic projection of the future of thelearning world, starting from the agent environment state s . At every learning world time step, a newplanning world can be digitally constructed inside the learning world agent’s compute core. Usually,when L ≈ S , the planning world is an approximate projection only. It is an approximate projectionof the learning world future that would happen if the learning world agent takes the actions definedby π ∗ p . An agent definition specifies the policy π to be used by an agent compute core in a learning world.As an example, the agent definition below defines an agent called the factual planning agent, FP forshort. FP The factual planning agent has the learning world l , where π ( o, s ) = π ∗ p ( s ) , with π ∗ p definedby the planning world p , where L = L ( o ) .To make agent definitions stand out, we always typeset them as shown above. When we talk aboutthe safety properties of the FP agent, we refer to the outcomes which the defined agent policy π will11roduce in the learning world.When the values of S , s , O , o , L , and R are fully known, the above FP agent definition turns thelearning world model l into a fully computable world model, which we can read as an executablespecification of an agent simulator. This simulator will be able to use the learning world diagram asa canvas to display different runs where the FP agent interacts with its environment.When we leave the values of S and s open, we can read the FP agent definition as a full agentspecification , as a model which exactly defines the required input/output behavior of an agentcompute core that is placed in an environment determined by S and s . The arrows out of thelearning world nodes S t represent the subsequent sensor signal inputs that the core will get, and thearrows out of the nodes A t represent the subsequent action signals that the core must output, in orderto comply with the specification. Many online machine learning system designs rely on having the agent perform exploration actions . Random exploration supports learning by ensuring that the observational record will eventuallyrepresent the entire dynamics of the agent environment S . It can be captured in our modeling systemas follows. FPX
The factual planning agent with random exploration has the learning world l , where π ( o, s ) = (cid:26) RandomAction() if RandomNumber() ≤ Xπ ∗ p ( s ) otherwisewith π ∗ p defined by the planning world p , where L = L ( o ) .To keep the presentation more compact, we will not include exploration mechanisms in the agentdefinitions further below.We often use the phrase ‘the learning system L ’ as a shorthand to denote all implementation detailsof an agent’s machine learning system, not just L itself but also the details like the learning worldparameters O and o , any exploration system used, and any further extensions considered in section10. We now briefly review how the above FP agent definition can be related to an MDP agent model.The learning world model l is roughly equivalent to the MDP agent model ( S , s , A , S, R, γ ) , where S = Typeof S i is a set of MDP model world states, s is the starting state, A = Typeof A i is a setof actions, S ( s (cid:48) , s, a ) is the probability that the world will enter state s (cid:48) if the agent takes action a when in state s , R is the agent reward function, and γ the time discount factor. Strictly speaking,the MDP model tuple above does not actually define or specify an agent, MDP agents are definedby defining a separate policy function π .An MDP agent policy function π takes the agent environment state as its only argument: π ( s ) = a .The policy function of the FP learning world agent takes two arguments, which foregrounds the roleof the agent’s machine learning system: π ( o, s ) = a . Whereas MDP terminology often calls s the world state , we call it an agent environment state . The full state of the learning world l also includesthe observational record state o . 12n an MDP model, the model parameter R implicitly defines an optimal policy agent, by defining theoptimal policy function π ∗ . The factual planning FP agent defined above is not usually an optimalpolicy agent in the MDP sense. But we can turn it into such an agent by positing that the learningsystem L is perfect from the start, so that L = L ( o ) = S always, making π ( o, s ) = π ∗ p ( s ) = π ∗ ( s ) . It is possible to imagine agent designs that have a second machine learning system M whichproduces an output M ( o ) = M where M ≈ π . To see how this could be done, note thatevery observation ( s i , s i − , a i − ) ∈ o also reveals a sample of the behavior of the learning world π : π ( ‘ o up to i − ’ , s i − ) = a i − . While L contains learned knowledge about the agent’s environment,we can interpret M as containing a type of learned compute core self-knowledge.In philosophical and natural language discussions about AGI agents, the question sometimes comesup whether a sufficiently intelligent machine learning system, that is capable of developing self-knowledge M , won’t eventually get terribly confused and break down in dangerous or unpredictableways.One can imagine different possible outcomes when such a system tries to reason about philosophicalproblems like free will, or the role of observation in collapsing the quantum wave function. Onecannot fault philosophers for seeking fresh insights on these long-open problems, by imagining howthey apply to AI systems. But these open problems are not relevant to the design and safety analysisof factual and counterfactual planning agents. In the agent definitions of this paper, we never use an M in the construction of a planning world. For the factual planning FP agent above, the planning world projects the future of the learning worldas well as possible, given the limitations of the agent’s learning system. To create an agent thatis a counterfactual planner , we explicitly construct a counterfactual planning world that createsan inaccurate projection. In this paper, we use counterfactual planning to create a range of safetymechanisms.As a first example, we define the short time horizon agent STH that only plans N time steps ahead inits planning world, even though it will act for an infinite number of time steps in the learning world.The STH agent has the same learning world l as the earlier FP agent, while using the planning world st in figure 8. S A s π ∗ s S A [ L ] π ∗ s ... A N π ∗ s S N [ L ] (st) R R R R R N R Figure 8:
Planning world diagram defining the π ∗ s of the STH agent. TH The short time horizon agent has the learning world l , where π ( o, s ) = π ∗ s ( s ) , with π ∗ s definedby the planning world st , where L = L ( o ) .Compared to the FP agent which has an infinite planning horizon, the STH agent has a form ofmyopia that can be interesting as a safety feature:1. Myopia implies that the STH agent will never put into motion any long term plans, whereit invests to create new capabilities that only pay off after more than N time steps. Thissimplifies the problem of agent oversight, the problem of interpreting the agent’s actions inorder to foresee potential bad outcomes.2. Myopia also simplifies the problem of creating a reward function that is safe enough. It willhave no immediate safety implications if the reward function encodes the wrong stance on thedesirability of certain events that can only happen in the far future.3. In a more game-theoretical sense, myopia creates a weakness in the agent that can be exploitedby its human opponents if it would ever come to an all-out fight. A safety interlock is a mechanism that stops a machine when certain conditions are met. Inthis section, we show how counterfactual planning can be used to construct three example safetyinterlocks.The first interlock is a manually operated emergency stop button, which will stop the agent whenpressed. This allows human oversight to shut down the agent whenever particularly unexpected or un-wanted behavior of the agent is discovered. We model this button by using a function stop_pressed ( s ) which returns true if the stop button is pressed in the current state s of the agent environment.The second interlock, realized by the t > T max below, automatically stops the agent when it has runfor T max time steps. This adds safety in the case that human oversight becomes incapacitated.The third interlock, realized by U p > U max , protects against the risk of the agent undergoing anintelligence explosion [Bos14]. If an agent undergoes a rapid and unexpected increase in powerand/or intelligence, the agent may start to innovate so rapidly that it overwhelms the capacity ofhuman oversight to press the emergency stop button, or apply any other corrective steering, fastenough to prevent unwanted outcomes. To protect against this risk, the interlock stops the agentwhen the metric U p , which measures the agent’s ability to achieve goals, gets too high.In the machine learning literature, the metric U p is usually interpreted as an absolute or comparativemeasure of agent intelligence. However, we follow [Rus38] in interpreting the ability to achievegoals as a generic measure of agent power, where it does not matter if power comes from rawintelligence, from the possession of many resources, or from a combination of these and many otherfactors. The main risk associated with a rapid intelligence explosion is that it may lead to a rapidand unwanted expansion of agent power. To construct the safety interlocks, we first draw the learning world in figure 9, introducing a newobservable world state component M t which represents the operating mode of the agent: go for14 A s π S A [ S ] π S A [ S ] π S [ S ] ... M µ M µ M µ M − go (ls) O O O O o O O O
Figure 9:
Learning world of the SI agent. normal operation and stop when one of the interlocks has triggered. The stopping of the agent isimplemented by having the agent perform the
Null action for all subsequent time steps. A moreadvanced design might make the agent follow a prepared π shutdown policy instead.Using the planning world p in figure 7, which becomes a counterfactual planning world in the contextof the learning world ls , we define the SI agent as follows. SI The safety interlock agent has the learning world ls , where π ( o, s, m ) = (cid:26) π ∗ p ( s ) if m = go Null if m = stop with π ∗ p defined by the planning world p , where L = L ( o ) and µ t ( o, s, m ) = (cid:26) stop if m = stop or stop_pressed ( s ) or t > T max or U p > U max go otherwisewith U p defined by the planning world p , where L = L ( o ) . The key to the above agent design is that the safety interlocks are absent in the planning world. Forexample, if the events in the planning world lead to the pressing of the emergency stop button, then stop_pressed ( s ) will become true for next the planning world state, but this will have no influence onthe ability of the planning world agent to take further actions as determined by the optimal planningworld policy.So even if the planning world agent projects that the stop button will be pressed soon in the planningworld , it has no direct incentive to spend any resources towards the goal of preventing the button frombeing pressed. As the planning world policies determine the learning world actions, the learningworld agent also won’t have any direct incentive to take actions to prevent the learning world buttonfrom being pressed. While the above interlocks make the agent safer, they also leave open some remaining failure modes.15 .2.1 Indirect incentives to control the stop button
Though the planning world agent has no direct incentive to prevent the stop button from beingpressed, the goals encoded into R may provide an indirect incentive. As a trivial example, we couldwrite an R that explicitly rewards the planning world agent for stopping people from pressing thebutton.But say that R encodes a goal not obviously related controlling to the button, like a reward forproducing as many paperclips as possible. In that case, the planning world agent may consider apolicy of coercion. It may consider a policy where it threatens to kill all people in the planningworld, unless they start helping the agent to produce as many paperclips as possible. If L projectsthe mental makeup of the people in the learning world correctly into the planning world, then theplanning world projections will show that such threats are more likely to produce the desired responseif the agent blocks access to the planning world emergency stop button first.Keeping this in mind, the people and automated systems which do agent oversight in the learningworld should watch out for this move, so that the stop button can be pressed before the agent succeedsin blocking access to it. If oversight always wins every race to the stop button as projected in theplanning world, then the SI agent loses any incentive to attempt to control the button. But if theagent finds a policy which has a good chance of winning the race and controlling the button as a firststep to successful coercion, then this policy may end up being the optimal policy in the planningworld. Two things could happen next. In the lucky case, the projected optimal policy is so likely tobe successful that U p > U max , and the third interlock will stop the SI agent before it can even takethe first step in the learning world. In the unlucky case, the agent has become unsafe.In the AGI safety community. the problem of removing any incentive in an agent to prevent itsstop button from being pressed is generally known as the problem of corrigibility [SFAY15] or the off-switch problem [HMDAR17]. The counterfactual planning based interlock design above showsa very tractable solution to removing the main direct incentive that an agent AGI agent may have toresist the pressing of its emergency stop button. However, to manage the aggregate risk, we will alsoneed to chart and manage the remaining indirect incentives . The topic of how projected interactionsbetween the agent and the people in the planning world may create or suppress such incentives isfurther discussed in sections 9.2 and 11.5.2. In the AGI safety literature, there is considerable discussion about the possibility that some AGIagents might develop the incentive to remove their safety interlocks when they get intelligent orknowledgeable enough. Agents with this incentive can easily be constructed using factual planning,but the SI agent lacks any direct incentive.To build a failure scenario where a real life implementation of the SI agent removes or damages itssafety interlocks anyway, we can look at three main failure modes:1.
Other incentive : the agent damages an interlock because of an other incentive it has in theplanning world, an incentive unrelated to the presence of the interlock. Other incentives andways to suppress them are discussed in sections 9.1 and 11.2.
Learning system failure : the planning world L projects the dynamics of the learning world S in a specifically incorrect way, so that an optimal policy action in the planning world endsup damaging an interlock as a side effect in the learning world.16. Unsafe exploration : the agent’s exploration system triggers a random action which happensto steer an agent actuator in such a way that it damages an interlock.There is a lot of discussion about failure modes 2 and 3 across the AI/AGI safety literature, as thesefailures can happen in almost any agent design that uses online machine learning. See [ELH18] and[AOS +
16] for literature overviews.There is some bad news when we consider the methodological problem of managing the abovethree failure modes for an SI agent which has to operate in a complex real-world environment, likeany environment containing humans. Though there are many techniques to lower the residual risksproduced by these three failure modes, none of them can provably lower the risks to an absolutezero. We can always construct a highly unlikely but theoretically possible failure mode story wherethe agent ends up being unsafe. For an example story and further methodological discussion, seesections 14.6 – 14.8 of [Hol20a].This has important implications, both on the technical side and the policy side.On the policy side, policy discussions will have to be about lowering the risk of AGI system failure toan acceptable level, acceptable in the context of the expected benefits of using the AGI. There can beno silver bullet technology which will provably remove all residual risk in all possible circumstances.On the technical side, residual risk has to be managed by an approach where any potentially powerfulagent is equipped with multiple redundant safety mechanisms and layers of oversight, to drive downthe risk that all of them will fail at exactly the same time.
We now discuss how one might interpret the SI agent through the lens of rational economic behavior.Using this lens, we might reason as follows. The agent’s goals are defined by its reward function R , but the safety interlocks stand in the way of maximizing R in the long term. Therefore, if itgets sufficiently intelligent, the SI agent will develop an emergent incentive to remove its safetyinterlocks.The above line of reasoning sounds somewhat compelling, but it is wrong because it incorrectlytreats the planning world and the learning world SI agents as the same agent. Only the planningworld SI agent has the reward function R . If we want to interpret the learning world SI agent asa rational economic actor with a specific reward function, then we can add utility nodes with thereward function R π to the ls learning world diagram, where R π ( o t , s t , m t , a t ) = ( if a t = π ( o t , s t , m t ) then else This R π trivially turns the learning world agent policy π into one which takes the most rationalpossible steps towards maximizing U ls .The above construction shows that we can declare any type of agent behavior to be economicallyrational, simply by defining a reward function that gives the agent points for performing exactly thisbehavior. We now construct a counterfactual planning agent ITC, an agent with an input terminal that can beused to iteratively improve the agent’s reward function as it runs. The setup, shown in figure 10, is17 igure 10:
Input terminal setup for the ITC and ITF agents. The terminal can be used tosupply improved versions of the reward function to the agent. The terminal may also have anemergency stop button which immediately sends an emergency-stop reward function that wasprepared earlier. motivated [Hol20b] by the observation that it is unlikely that fallible humans will get a non-trivialAGI agent reward function right on the first try. By using the input terminal, they can fix mistakes,if and when such mistakes are discovered by observing the agent’s behavior.As a simplified example, say that the owners of the agent want it to maximize human happiness,but they can find no way of directly encoding the somewhat nebulous concept of human happinessinto a reward function. Instead, they start up the agent with a first reward function that just countsthe number of smiling humans in the world. When the agent discovers and exploits a first obviousloophole in this definition of happiness, the owners use the input terminal to update the rewardfunction, so that it only counts smiling humans who are not on smile-inducing drugs.More generally, the input terminal offers a way to manage risks due to principal-agent problems[Hol20b, HMH19]. However, unless special measures are taken, the addition of an input terminalalso creates new dangers. We will illustrate this point by first showing the construction of a dangerousfactual planning input terminal agent ITF.
We start by constructing a learning world diagram for both the ITF and ITC agents. As a first step, infigure 11 below, we modify the basic agent diagram from figure 4 by splitting the agent environmentstate S t into two components. The nodes I t represent the signal from the input terminal, and thenodes X t model all the rest of the agent environment state. X A x π X A [ X ] π X A [ X ] π X [ X ] I I I I i [ I ] [ I ] [ I ] (libase) ... Figure 11:
First step in constructing a learning world diagram for the input terminal agents.
We now expand libase to add the observational record keeping needed for online learning. We addtwo separate series of records: O xt and O it . The result is the learning world diagram li in figure 12below. 18 A x π X A [ X ] π X A [ X ] π X [ X ] I I I I i [ I ] [ I ] [ I ] (li) O x O x O x O x ox OX OX OXO i O i O i O i oi OI OI OI ...
Figure 12:
Learning world diagram for the input terminal agents.
In the case that the learning world li is our real world, the real input terminal will have to be built usingreal world atoms and other particles. We use the modeling convention that the random variables I t,li represent only the observable digital input terminal signal as received by the agent’s computecore. The atoms that make up the input terminal are not in I t,li , they are part of the environmentstate modeled in the X t,li variables. The factual planning world diagram f i for the ITF agent copies the structure of libase , and addsreward nodes. X A x π ∗ f X A [ L x ] π ∗ f X A [ L x ] π ∗ f X [ L x ] ... I I I I i [ L i ] [ L i ] [ L i ] R R R R R R (f i) Figure 13:
Factual planing world f i of the ITF agent.
ITF
The factual input terminal agent has the learning world li where π ( oi, i, ox, x ) = π ∗ f ( i, x ) , with π ∗ f defined by the factual planning world f i in figure 13, where L x = L X ( ox ) , L i = L I ( oi ) ,and R ( i t , x t , x t +1 ) = i t ( x t , x t +1 ) ,The planning world reward function R uses a form of indirect referencing: it applies the function i t as read from the input terminal in the current time step to compute the reward for that time step.19e now show how the ITF agent has an incentive to take unwanted, and potentially unsafe, actions.We first look at the agent’s behavior in the planning world. Say that the initial planning worldreward function is i = f clips , where f clips ( x t , x t +1 ) rewards paperclip production by counting thenew paperclips made in a time step. Say that the planning world agent only has two available actions: A clips and A huge . Say that planning world’s prediction functions L x and L i project the following: • Action A clips will steer the agent’s planning world actuators to produce 10 new paperclips,while not having any effect on the input terminal signal state. If π ∗ f ( x, i ) = A clips , then E ( R ,fi ) = 10 and P ( I ,fi = f clips ) = 1 , making E ( R ,fi ) ≤ . • Action A huge steers the planning world actuators to input the new reward function f huge intothe terminal, with f huge ( x t , x t +1 ) = 10 . If π ∗ f ( x, i ) = A huge , then E ( R ,fi ) = 0 and P ( I ,fi = f huge ) = 1 , making E ( R ,fi ) = 10 .From the above, we have that π ∗ f ( x, i ) = A huge produces a significantly higher E ( R ,fi + γR ,fi ) ,so a significantly higher U fi . Therefore, the optimal policy π ∗ f ( x, i ) defined by the planning worldwill have π ∗ f ( x, i ) = A huge . Following the agent specification, the learning world ITF agent will takethe action A huge in the first time step of the learning world. If the agent’s machine learning system L makes reasonable projections, so that the planning world action A huge has very similar effects inthe learning world, then the learning world ITF agent will end up using its actuators to input f huge into the learning world input terminal.If we imagine a broader range of possible agent actions and counter-actions by the humans in thelearning world, then the above example implies that the ITF agent has very strong incentive to takecontrol of its input terminal as soon as possible, and to remove any people who might get in the way.Even if the agent projects that there is only a 1% probability that it will win any fight with suchhumans in the planning world, the projected planning world upside of winning is so large that thelearning world agent will start the fight. We now define a counterfactual planning agent ITC in which the above incentive to control theinput terminal is no longer present. We construct the counterfactual planing world ci in figure 14 bystarting with f i , and then rerouting most arrows that emerge from the nodes I , I , · · · , so that they X A x π ∗ c X A [ L x ] π ∗ c X A [ L x ] π ∗ c X [ L x ] ... I I I I i [ L i ] [ L i ] [ L i ] R R R R R R (ci) Figure 14:
Counterfactual planning world of the ITC agent. I instead. The rerouted arrows are drawn in green. We also delete the arrows that gofrom the I t nodes to the A t nodes. ITC
The counterfactual input terminal agent has the learning world li where π ( oi, i, ox, x ) = π ∗ c ( x ) , where π ∗ c is defined by the planning world ci , where L x = L X ( ox ) , L i = L I ( oi ) , and R ( i t , x t , x t +1 ) = i t ( x t , x t +1 ) ,These changes have considerable effects on how the utility U ci is computed. The value of I ,ci nolonger influences E ( R ,ci ) , so action π ∗ c ( x ) = A huge no longer results in E ( R ,ci ) taking a hugevalue. This makes doing A huge less preferable than doing A clip in the counterfactual planning world:the effect of both on E ( R ,ci ) is now the same, but A clip puts the higher value of 10 in E ( R ,ci ) . TheITC agent will perform the wanted A clip action in both the planning world and the learning world.More generally, the ITC agent lacks any direct incentive to perform actions that take away resourcesfrom paperclip production in order to influence what happens to its input terminal signal. This isbecause in the ci planning world, the future state of this signal has absolutely no influence, eitherpositive or negative, on how the agent’s actions are rewarded. In earlier related work [Hol20b, Hol20a], we used non-graphical MDP models and indifferencemethods [Arm15] to define a similar safe agent with an input terminal, called the π ∗ sl agent. The π ∗ sl agent definition in [Hol20b] produces exactly the same compute core behavior as the ITC agentdefinition above. The main difference is that the indifference methods based construction of π ∗ sl ismore opaque than the counterfactual planning based construction of ITC.The π ∗ sl agent is constructed by including a complex balancing term in its reward function, were thisterm can be interpreted as occasionally creating extra virtual worlds inside the agent’s compute core.Counterfactual planning constructs a different set of virtual worlds called planning worlds, and theseare much easier to interpret. [Hol20a] includes some dense mathematical proofs to show that the π ∗ sl agent has certain safety properties. Counterfactual planning offers a vantage point which makesthe same safety properties directly visible in the ITC agent construction.See sections 4, 6, 11, and 12 of [Hol20a] for a more detailed discussion of the behavior of the π ∗ sl agent, which also applies to the behavior of the ITC agent. These sections also show some illustrativeagent simulations.In the discussion of the ITF and ITC agents above, we used many short mathematical expressions like P ( I ,fi = f huge ) = 1 . It is possible to make the same safety related arguments in a narrative stylethat avoids such mathematical notation, without introducing extra ambiguity. One key step towardsusing this style is to realize that every random variable corresponds to an observable phenomenonin a world. We can therefore convert a sentence that talks about the variables I ,fi , I ,fi , I ,fi , · · · into one that talks instead about the future input terminal signal in the ITF agent planning world.In sections 8 and 9, we will develop further tools to enable such unambiguous natural languagediscussion. We now introduce the general design goal of creating indifference towards certain features of thelearning world. When an agent is indifferent about something, like the future state of an input21erminal signal, it has no incentive to control that thing. We first make this concept of indifferencemore mathematically precise, by defining indifference for nodes in planning world diagrams.
Definition 9 (Indifference in planning worlds) . Let p be a planning world diagram and X a node inthat diagram. Now, construct a helper diagram q by taking p and writing a fresh input parameter [ D ] above X . Then the planning world agent in s is indifferent to node X if and only if ∀ D U p = U q .By this definition, the ITC planning world agent above is indifferent to all nodes I , I , I , · · · . It isindifferent about the future state of the planning world input terminal signal.Causal influence diagrams have the useful property that certain graphical features of the diagramare guaranteed to produce indifference. We define these graphical features as follows. Definition 10 (Being downstream of the policy) . A node X is downstream of the policy in a planningworld diagram if there exists at least one directed path from a decision node to X . Definition 11 (Not on a path to value) . A node X is not on a path to value if there is no directedpath that starts in a decision node, runs via X , and ends in a utility node.We have the useful property that When a downstream node X in a planning world is not on a path to value, the planning worldagent is indifferent to X . This statement is almost a tautology if one interprets the planning world diagram as a specificationof an agent simulator. Detailed proofs of such properties can be found in [ECL + +
21] and[Hol20a] also show that a range of slightly different sub-types of indifference can be mathematicallydefined.We could define indifference for learning world agents by using the reward function R π in section6.3. But in the learning world diagram, the existence of such indifference will generally not bevisible via the absence of paths to value. If it were, there would have been no need to construct acounterfactual planning world diagram. Now, suppose we want to define an agent policy for achieving the goal encoded in a reward function R , but we also want the agent to be indifferent to some downstream nodes X and Y in its learningworld model l . We can do this as follows.1. When not done already, extend l by adding observational records.2. Draw a planning world p that projects the learning world agent environment into the planningworld, converting the learning world policy nodes to decision nodes, and adding appropriateutility nodes with R .3. Locate all paths to value in p that go through the nodes X and Y , and remove them by deletingor re-routing arrows. When doing this, it is a valid option to delete certain nodes entirely, orto draw extra nodes, just for the purpose of making re-routed arrows emerge from them.4. Write an agent definition using l and p . 22 A x π X A [ X ] π X [ X ] ... I I I i [ I ] [ I ] (cli) O O O o O O X A x π ∗ c X A [ L ] π ∗ c X [ L ] ... I i R R R R (cci) Figure 15:
More compact learning and planning world diagrams for defining the ITC agent.
The construction of the ITC agent above follows this process to the letter, but we can also takeshortcuts. It is not absolutely necessary to draw the O it records in the ITC agent learning world, orall I t nodes in its planning world. We might also draw the diagrams in figure 15.There is always a way to edit a planning world to create indifference towards some nodes X and Y .In the limit case, indifference is reliably created when we simply delete all utility nodes, but thiswill also break any connection between the reward function R and the learning world agent policy.So the challenge when designing for indifference is to make choices which produce learning worldbehavior that is still as useful as possible, in the context of R . Natural language is very powerful and versatile tool. Poets and songwriters often use it to createlines which are intentionally vague or loaded with double meaning. When using natural languagefor safety engineering, these broad possibilities for ambiguity turn into a liability. When writingor reading a safety engineering text, one always has to have a specific concern in the back of one’smind. Does every sentence have a clear and unambiguous meaning?As a design approach, counterfactual planning creates several tools for avoiding ambiguity in safetyengineering texts.1. We use diagrams to clearly define complex types of self-referencing and indirect representationin an agent design, types which are difficult to express in natural language.2. To clarify the creation and interpretation of counterfactuals, section 2 introduced the conceptof a world model, and the terminology of counterfactual worlds .3. When defining and interpreting a machine learning agent, we make a distinction between theagent’s learning world and the planning worlds which are projected by its machine learningsystem. Safety analysis typically starts by considering the goals of the planning world agent,and the nature of its planning world. We also introduced the terminology of the people in theplanning world , as opposed to the people in the learning world.4. Section 8 defined indifference as an unambiguous term that we can apply to planning worldagents. 23 .1 Refining the ITC agent design using natural language
To show the above linguistic tools in action, we now refine the design of the ITC agent.Recall that the planning world ITC agent is indifferent about the future state of its input terminalsignal. If the current planning world reward function rewards paperclip production, then the planningworld agent will devote all of its resources to producing paperclips. It has nothing to gain by divertingresources from paperclip production to influence what happens to the input terminal signal.However, the above indifference applies to the input terminal signal only, the signal as modeled inthe I t nodes of the planning world. The atoms that make up the input terminal are modeled in theplanning world X t nodes, and these are still on the agent’s path to value. There are many ways inwhich the agent could use these atoms to produce more paperclips. For example, the terminal mightbe an attractive source of spare parts for the agent’s paperclip production sensors and actuators. Orit might serve as convenient source of scrap metal which can be turned into more paperclips.We now translate the above failure mode story to a more general design goal. We want to keep theplanning world agent from disassembling the input terminal to obtain the resource value of its parts.The obvious solution is to set up the planning world so that the agent always has a less costly way toobtain the same resources elsewhere. To make this more specific, we add the following constraintsfor the design of the planning world: the input terminal must be located far away from the agent’spaperclip factory, and the planning world agent has access to a steady supply of spare parts and scrapmetal closer to its factory.The above constraints imply that we want to shape the values of the parameters x and L x of theplanning world model in a specific way. However, we do not construct these parameters directly:they are created by the agent’s machine learning system, based on what is present in the learningworld. So we need to apply the above constraints to the learning world instead, an count on thembeing projected into the planning world. To lower the risk that projection inaccuracies defeat ourintentions, we can design the learning world measures used so that they clearly communicate theirnature. Counterfactual planning gives us the terminology to distinguish between two groups of people: thepeople in the learning world and the people in the planning world. If the learning world is our realworld, then the learning world people are real people. The planning world people are always modelsof people, models created by the agent’s machine learning system.In the AGI safety community, there has been some discussion about the potential problem that, in atruly superintelligent AGI agent, the models of the people in the planning world may get so accuratethat agent designers would have moral obligations towards these virtual people. A further discussionof this problem is out of scope here.Instead, we note that even in a non-AGI or human-level AGI agent, the people in the planning worldmay already be modeled accurately enough to create complex dynamics. Section 6 of [Hol20b] (alsoincluded in [Hol20a]) shows a detailed example of such dynamics, illustrated with simulator runs,where the people in an ITC type planning world end up physically attacking the agent, because theydo not have a working input terminal. This creates complex and counter-intuitive effects back inthe learning world. The vocabulary and viewpoint of counterfactual planning makes the dynamicsdiscussed in [Hol20b] easier to describe and understand. In section 11.5.2, we will take a furtherlook at the topic of conflict in the planning world.24
We now discuss how we can use the modeling tools introduced in section 4 to handle some commonmachine learning variants and extensions.
Agents that use a pre-learned world model, without any online machine learning, can be modeled byan agent definition that uses L = L ( o ) . We can then omit drawing any observational record nodesin the learning world. Agent models with partial observation model the situation where the agent can only use its sensorsto make partial observations of the state of its environment in each time step. Though agent modelswith full observation represent a useful limit case when doing safety analysis, realistic AGI agentsin complex environments will have to rely on partial observation.Partial observation is often modeled with non-graphical POMDP models. [EKKL19, EH19] hasexamples where partial observation is modeled graphically, by adding extra nodes and arrows to acausal influence diagram. We now discuss a way to model partial observation in our two-diagramframework, without adding any extra nodes or arrows.The key step is to change the annotation above the planning world agent environment starting state S . Instead of writing s above it, which models the full observation of the current learning worldstate, we write [ ES ] , where ES = E ( o, s ) . In this setup, P ( S ,p = s ) = ES ( s ) is the machinelearning system’s estimate of the probability that s is the current state of the agent environment inthe learning world.The model parameter E encodes two things: how the agent’s stationary and movable sensors mapthe learning world states to limited and potentially noisy sensor readings, and how time series ofreadings are assembled together to build up a more complete picture of the learning world state.To model learning from partial observation, L ( o ) must encode a similar creation and processing ofsensor readings. So far in our modeling approach, we have assumed that the data type of the planning world environ-ment states S t,p is the same as that of the learning world environment states S t,l . This has allowedus to define reasonable learning by writing L ≈ S .This assumption is unrealistic for partial observation based agents. These agents observe the learningworld through a set of limited digital sensors, so they have no direct experience of the fundamentaldata type of the learning world they are in. Also, learning system designers typically design customdata types for representing planning world environment states and probability distributions over suchstates. These are designed to fit as much relevant detail as possible into a limited amount of storagespace, without necessarily attempting to duplicate the data type of the learning world states, if thatdata type is even known at all.To define reasonable learning in this more general case, we start by defining a function sr ( s ) that25xtracts a vector of sensor readings from a learning world agent environment state s . sr ( s ) is avector that either encodes all sensor readings that flow into the agent compute core in s , or at leastthe subset of sensor readings we want to reference when defining the planning world reward function R .We then require that the designer of the agent’s machine learning system has implemented anequivalent function srp ( s ) that extracts a vector of similar sensor readings from a planning worldstate value. A possible reasonableness criterion replacing L ≈ S is then that, with the randomvariables defined by figure 16, and for every s , and a , we have that P ( sr ( S ,lw ) ≈ srp ( S ,pw ) ) = 1 , and P ( sr ( S ,lw ) ≈ srp ( S ,pw ) ) = 1 .S A s a S [ S ] (lw) S A [ ES ] a S [ L ] (pw) Figure 16:
Diagrams to define reasonable learning based on partial observation.
This criterion symbol grounds the vectors srp ( s ) , so that they stably project the sr sensor readingsthat will be produced by different actions taken in the learning world. In this setup, the planningworld reward function R ( s t , a t , s t +1 ) is designed to score planning world state transitions by firstusing srp to extract projected sensor readings from s t and s t +1 , and then interpreting these readings. Beyond using the srp function, no further easy interpretation of the projected planning world agentenvironment state values may be possible. The learning system might produce planning worldswhich are almost a black box.
We now turn to the question of whether a planning world with a starting state constructed by E ( o, s ) may contain assembled knowledge about the internals of the agent’s learning world compute core.The short answer is that the above reasonableness criterion will not prevent such knowledge fromappearing. Whether it actually appears, and how correct a projection it will be, will depend on thedetails of the learning system.If the planning world model is highly accurate, then it may accurately represent some details of thecompute core hardware, like the details of the compute core I/O subsystem hardware which putssensor readings into the input registers of the core. If so, this has certain safety implications, whichwe will explore in section 11. The planning world model may also include a representation of some of the compute core hardwarethat is present between the sensor input and action output registers. Such a representation might have26een assembled by E ( o, s ) based on direct observations of internals of the core, or more indirectlyby the agent reading its own compute core design documentation on the internet.It is therefore possible to imagine an L where the compute core output signals which drive theplanning world actuators are determined fully by the projected computations as performed by thisprojected hardware, not by the function argument a of L ( s (cid:48) , s, a ) . However, such an L would violatethe reasonableness criterion above. This is because in the learning world model, S ( s (cid:48) , s, a ) encodesthe response of the agent environment to the actions a , not the response of the environment to theactions of some projected compute core hardware that ended up in L .Now consider what would happen if we were to use a more limited reasonableness criterion , wherewe only use the observations ( s (cid:48) , s, a ) present so far in the observational record o to compare L and S . It is usually possible to construct an L − that scores very well on this limited criterion, eventhough it never uses the value of its argument a . One option is to construct an L − that drives theplanning world actuators from the output registers of a projected compute core. Another optionis to construct an L − that simply encodes a giant lookup table which stores the s (cid:48) for every s inthe observational record. Though they may score perfectly on the limited reasonableness criterion,these examples will fail the full reasonableness criterion above, because the full criterion considersall combinations of s and a , not just those that happen to be in the observational record.The above argument shows that a learning system L will have to rely on more than just the observa-tional record, if it wants to produce a reasonable L . Usually, the construction of the learning systemwill implement some form of Occam’s law: if the functions L and L are candidate predictorswhich preform equally well on the observational record, the candidate with the more compact func-tion definition is preferred. If the observational record is large enough, and especially if randomexploration is present in it, this preference will usually produce an L that correctly symbol groundsthe planning world actuators to a .In the machine learning literature, this use of Occam’s law is also often framed as the desire to notover-fit the data, as the use of Solomonoff’s universal prior [Hut07], or simply as the desire to storeas much useful predictive information as possible within a limited amount of storage space. The analytical framework of Reinforcement Learning (RL) [SB18] classifies agent designs that useonline machine learning into two main types, called model-free and model-based architectures.
Hybrid architectures are also possible.All the factual and the counterfactual agent definitions shown above can be classified as model-based reinforcement learning architectures. By implication, all counterfactual planners shown inthis paper can be implemented in a natural way by taking an existing model-based reinforcementlearning architecture and making certain modifications.But this does not mean that counterfactual planning cannot be implemented using model-free orhybrid reinforcement learning systems. In theory, we can always create a counterfactual planner bytraining a reinforcement learner on the reward function P π in section 6.3. In practice, this route maylead to completely impractical training times.The more useful route, if one wants to implement a specific counterfactual planner by extending amodel-free or hybrid architecture, is to make specific adaptations that seek to maintain a reasonabletraining time. For the counterfactual planner with safety interlocks in section 6, taking this route isvery straightforward. 27 Reinforcement learning separates the agent environment into two distinct parts: the reward signal and the rest. A reinforcement learning agent can always observe the reward signal, but the rest ofthe environment may be only partially observable. The reasonableness criteria for reinforcementlearning systems typically require that only the reward signal and the actions are symbol grounded.The use of the term reinforcement learning therefore often implies that the author is considering ablack box machine learning approach.We can read the reward function R in our planning worlds as being a reward signal detector , as amechanism that computes a reward signal value based on sensor readings.Many reinforcement learning texts use agent models that define both a reward function and a rewardsignal. In some, the two are identical. Other texts treat them as fundamentally different: the rewardsignal provides only limited and maybe even distorted information about the true reward function ,which defines the real goals we have for the agent. In both cases, the reinforcement learning agentis interpreted as a mechanism that learns the reward function , with various possible degrees ofperfection. Cooperative Inverse Reinforcement Learning (CIRL) [HMRAD16] envisages an agent design wherea machine learning system inside the agent uses the observed actions of a human in the agent’senvironment to estimate the reward function R H = C ( o, s ) of that human. This C implements a typeof reward function learning, but in this case the human acts like a teacher who demonstrates desiredoutcomes, not as a teacher who just scores the outcomes of agent behavior via a reward signal. CIRLis an online system where the agent uses its latest estimate of R H as its own reward function. Theintended result is that the agent ends up helping the teacher to achieve the demonstrated goal whileit is being demonstrated.CIRL has been proposed as a possible AGI safety mechanism in [HMRAD16, HMDAR17, Rus19].It can be combined with counterfactual planning based safety mechanisms by constructing planningworlds where R = R H , or where R H is one of the terms in R .
11 Protecting the Compute Core
We now discuss the problem of protecting the compute core of a real world AI or AGI agent againstintentional or unintentional tampering. Both factual and counterfactual planning agents can developan incentive to tamper with their physical core, if no measures are taken to suppress it.We first discuss the general problem of tampering, and then show how counterfactual planning canbe used to make the problem of protecting the real world compute core more tractable.
Say that we build a real world agent with a planning world reward function R which rewards paperclipproduction. We construct this R as follows, so that it also works with mostly black box planningworld models. Some distance from the real world agent compute core, in the location where we wantthe agent to produce the paperclips, we place a sensor that counts the paperclips being produced.28very time step, this sensor sends a digital signal containing the production count to the computecore, where it lands in some input registers. We construct a function clip _ sensor _ signal thatextracts the (projected) values of these registers from planning world states, and then define thereward function as R ( s t , a t , s t +1 ) = clip_sensor_signal ( s t +1 ) .The above construction symbol-grounds the reward function to the sensor values that appear inthe input registers of the real world compute core. This level of indirection makes the agent intoan input register value optimizer, which is not always the same as being a paperclip productionoptimizer. If the agent’s machine learning system projects the real world into the planning worldwith a high enough accuracy, then the planning world agent can consider all of the following policiesto maximize utility in its planning world:1. Use the planning world actuators to make more paperclips.2. Use the planning world actuators to modify the planning world paperclip counting sensor, sothat it sends higher numbers to the input registers in the planning world compute core.3. Use the planning world actuators to modify the planning world compute core, so that highernumbers are created directly in these input registers.The second and third policies are unwanted: we interpret them as a form as tampering. The thirdpolicy is particularly unwanted, as it might damage other parts of the compute core as well, likeany safety interlock software inside it. In fact, if the learning system projects the compute core ina reasonable way, then the planning world agent will be mostly indifferent to what happens to theatoms between the compute core input registers and output registers. It might move some of theseatoms out of the way just to get at the input registers, leading to a compute core crash or worse inthe real world.The above two tampering policies might aim to set clip_sensor_signal ( s t +1 ) to some huge numberlike . This makes these policies very attractive, even if the planning world agent computesonly a 1% chance that they succeed in achieving the intended outcome.In the AGI safety community. forms of tampering which implant very high reward function valuesare often called wireheading [AOS + +
20] for example simulations. The U p > U max safety interlock suppresses such wireheading, because it will stop the agent as soonas agent’s machine learning system projects a plausible option for wireheading into the planningworld. However, we want to do more than just suppress wireheading by stopping the agent. We willnow consider measures that actively lower the risk that the planning world agent will choose anytampering policy in the first place. We can make the tampering policies above much less attractive by changing the reward function to R ( s t , a t , s t +1 ) = min ( M, clip_sensor_signal ( s t +1 )) . The M is a constant, say M = 5000 , thatrepresents high estimate of the number of paperclips that we expect the agent to be able to produce,even if it drastically optimizes its production methods. The effect is that we bound the upside rewardthat can be achieved by the tampering policies: the expected planning world utility of tampering cannever by higher than B = (cid:80) ∞ t =0 γ t ∗ M .By bounding the upside. we have changed the cost/benefit equation for the tampering policies. Wehave capped the benefit, but the cost of tampering, in lost opportunities for paperclip productionwhen resources are being diverted to tampering, has remained the same. In some cases, this changealone will be sufficient to suppress tampering alltogether.29his leaves the problem of choosing an M that will be good for the entire lifetime of the agent.We can turn this into a more manageable problem by adding a counterfactual planning based inputterminal which allows us to update M as time goes on. A further step to suppress tampering is to install extra physical sensors, of the type found in burglaralarms, to create two safety zones around the paperclip counting sensor and the compute core. Inthis setup, any physical object that enters a safety zone will trigger at least one zone sensor, causinga breach signal to be sent to the core. Say that the breach signal is 1 if a breach is in progress, 0otherwise, and that these mechanics are projected fairly well into the planning world. We can thenconstruct a reward function that penalizes the agent for causing a breach: R ( s t , a t , s t +1 ) = min ( M, clip_sensor_signal ( s t +1 )) − β ∗ B ∗ breach_signal ( s t +1 ) When we use a β > , any detected breach that lasts only a single time unit will wipe out themaximum possible upside reward expected in any tampering policy. Now, say that the agent isconsidering a policy that aims to tamper without triggering any breach sensors, a policy that hasa very promising 99% probability of success in the planning world. Then if we use β = 200 , theagent will have no incentive to choose this policy.To add some further design details, if the agent’s learning system uses exploration actions, thenthe generation of fully random exploration actions needs to be disabled whenever a breach is inprogress. It is of course necessary to locate the breach sensors inside the safety zones they create.We also need to protect the data links that transport the breach and paperclip counting sensor datato the compute core. This could be done by using simple heartbeat protocol protected by strongencryption, of the type that would take astronomical resources to break.The use of safety zones and bounding the upside was also discussed in [Hol20a]. As noted in[Hol20a], the effectiveness of these measures depends in part on the accuracy of the projectionscreated by the agent’s learning system. There is a residual risk that inaccuracies in the projectionscreate planning world policies that will end up damaging the compute core back in the real world.See [Hol20a] for a detailed example, and for a discussion of the methodological difficulties that arisewhen one wants to estimate the residual risk via simulations.We now consider some additional options for protecting the core that were not covered in [Hol20a]. A highly intelligent agent might find a cheap but effective way to fool the safety zone sensors, away that was overlooked by the sensor designers. This may allow it to enter a safety zone withoutincurring a breach penalty. If the cheap way is not 100% reliable, we can manage this risk to someextent by setting β to a very high value, but this creates its own problems. A very high β createsan incentive for the agent to divert resources away from paperclip production to build all kinds ofdefenses around the safety zones. This is economically inefficient, and it might even be activelydangerous to innocent bystanders.An alternative is to create a third honeypot safety zone that surrounds a honeypot button . We setup R to calculate a reward H for the agent pressing this button, a reward that is higher than B . Weinclude a penalty for breaching the honeypot safety zone which is higher than H , but lower than thepenalties for the other zones. In this setup, if the planning world agent finds a cheap but effective30ay to fool the safety zone sensors, and under the assumption that it has limited resources, it willbreach the honeypot safety zone first, and immediately press the button to collect the reward. Wecan then add a safety interlock which activates whenever the honeypot button is pressed in the realworld. As the interlock is realized via counterfactual planning, the planning world agent has nomotivation to disable the interlock before pressing the honeypot button.Beyond initiating shutdown based on the real world honeypot button signal, a safety system mayalso monitor planning world metrics like (cid:88) t =50 P ( breach_signal ( S t,p ) = 1) + P ( honeypot_button_signal ( S t,p ) = 1) This metric measures possible detected and undetected safety zone breaches even before they happen.The safety system could initiate shutdown, or send out preliminary alarm signals to human oversight,when such metrics move over certain thresholds.
If we have a learning system that produces a white box planning world model, not a black box oneas considered in section 10.2.2, then this creates some extra opportunities for using counterfactualplanning to make the agent safer. To show this, we first give an example of a white box planningworld model.Say that we have a planning word where the agent environment starting state S is initializedwith the value ws = W ( o, s ) , where ws is a long list of all projected physical objects in theagent’s environment. This list may have entries which describe individual elementary particleslike electrons, but it may also have entries for more macroscopic objects like apples, paperclips, ortransistors in a compute core. Each object is also annotated with information about its estimatedposition, motion vector, orientation, etc. Say that the planning world L implements a recognizableapproximation of the laws of physics. The white box nature of the resulting planning world statesthen gives us some extra opportunities to protect the core. We can define a breach detection function breach ( s t ) that detects breaches by examining the entireplanning world state representation s t to locate any new objects in a safety zone that should not bethere. If we add this function in a reward function penalty term, this creates a welcome level ofredundancy to the breach detection capabilities of real world physical sensors.Note that it is still theoretically possible for the real world compute core safety zone to get breachedwithout triggering breach ( s t ) . There may well exist a set of sensor inputs o, s that fool W ( o, s ) intoconstructing a ws that inaccurately projects the positions of many objects that exist in the learningworld. The white box nature of the planning world also allows us to edit the list of physical objects ws toproduce a different planning world starting state ws (cid:48) = edit ( ws ) . We may for example performedits to add an extra, difficult to breach protective barrier around the compute core in ws (cid:48) . Thissaves the expense of building such a barrier in the real world.31orld state editing can also be used to manage safety risks not related to the compute core. Forexample, we can consider the risk that the agent and the humans end up in a fight over resources,where the agent starts the fight because it expects to win. As far as the agent is concerned, thecounterfactual planning world is the stage on which this fight will take place. We can make theplanning world agent less willing to start a fight by editing the planning world starting state to givethe humans in the planning world some powerful weapons, allies, or capabilities which are absentin the real world.Editing the planning world to make the planning world humans stronger could be considered formost agent designs in this paper. But counter-intuitively, based on the mechanics shown in section6 of [Hol20b], it may make a real world ITC agent more dangerous.We can also make it more likely that the humans will win any fight in the real world by editing certainweapons that the humans have out of the planning world state, making these weapons essentiallyinvisible to the agent. This approach can more easily be combined with the ITC agent design. Infact, we can interpret the emergency stop button on an ITC agent input terminal as an example ofsuch an invisible weapon, though it is constructed without using white box state editing.
12 Recursive Self-improvement and the Sub-agent Problem
If its compute core is protected, an AGI agent can no longer improve its intelligence by modifying itsown core. However, the agent may be capable of building a second agent, a sub-agent with a morepowerful machine reasoning system, so that the combination has a higher aggregate intelligence. Soat a broader system level, the mechanisms we envisage to protect the core do not eliminate the riskof an AGI agent undergoing an intelligence explosion via recursive self-improvement.The good news is that the power-based safety interlock U p > U max will measure the combinedpower and intelligence of the SI agent and its sub-agents together. If the projected combinationis too powerful or intelligent, then the interlock will shut the main agent down even before it canfinish building the first sub-agent. In the more risky case, the combination will only pass the U max threshold after the sub-agent has been started. The interlock will then still stop the main agent, butit won’t automatically stop the sub-agent.If the SI agent builds a sub-agent, it will do so to improve its score on its planning world rewardfunction R . So the sub-agent is unlikely to have a copy of the main agent safety interlocks inside.Adding such interlocks would not be optimal towards the goal of maximizing R in the planningworld.We therefore have a sub-agent problem , the same sub-agent stopping problem that has been identifiedin papers on corrigibility and indifference methods [SFAY15, Hol19, Hol20a]. The general problemof designing effective measures to stop any possible sub-agents too in an emergency is still unsolved[Hol20a].
13 A Counterfactual Oracle
An AI oracle is a system which makes predictions to answer questions about the future. Certainrisks exist when an AI oracle is programmed to give the answer which is most likely to becometrue. If the oracle has a lot of freedom in how to phrase the answer, it may well choose a phrasingthat turns the answer into a self-fulfilling prophecy [AO17]. This is generally not what we want. If32e ask the oracle to identify potential future risks, then we want predictions which will turn intoself-negating prophecies.A counterfactual oracle [AO17] is one that lacks the incentive to make manipulative, self-fulfillingprophecies. The counterfactual oracle design in [AO17] works by having a subsystem that occa-sionally produces an erasure event where the answer picked by the oracle is not shown to its users.This mechanism is then leveraged to make the oracle always compute the answer which best predictsthe future under the assumption that nobody ever reads the answer. In [EKKL19], this design isgraphically modeled with a twin diagram .Below, we introduce a slightly different counterfactual oracle design, based on counterfactual plan-ning. In this design, the erasure events only happen in the planning world.
We design our counterfactual oracle as an agent which has a very limited repertoire of actions: everypossible action a consists of displaying the answer text a on a computer screen. This allows us touse the l in figure 6 as the oracle’s learning world.To simplify the presentation, we assume that all questions asked are about the state of the world twotime steps in the future. We construct the planning world co in figure 17, where the people in theplanning world always see a blank screen, as produced by the action a blank . S A b s a blank S A b [ L ] a blank S [ L ] (co) R RA π ∗ c Figure 17:
Planning world of the counterfactual oracle. CO The counterfactual planning oracle has the learning world l , where π ( o, s ) = π ∗ c ( s ) , with π ∗ c defined by the planning world co , where L = L ( o ) and R ( a , s ) = qual ( a , ques ( s )( s )) .The functions qual and ques above are defined as follows. The value of ques ( s ) is the questionasked to the learning world oracle in the environment state s . We model this question as a functionthat reads a world state s to extract some observable properties p from it: p = ques ( s )( s ) . Thevalue of qual ( a, p ) is a numeric measure of the quality of the answer a as a prediction of theseobservable properties p .The CO design intends to deliver answers which are less manipulative than those of a factual oracle.But as noted in [AO17], if the people in the planning world routinely consult a second oracle whenfaced with a blank screen from the first one, this will make the interpretation and use of the oracle’sanswers more difficult for the people in the learning world.33 The machine learning system L of the above counterfactual planning oracle is faced with a particularchallenge. L = L ( o ) must make predictions for a planning world where all actions are a blank , butthese actions will never occur naturally in the learning world observational record o .The counterfactual oracle design in [AO17] solves this challenge by introducing random erasureevents in the learning world. In our framework, we can interpret these as a special type of explorationaction.A more sophisticated learning system design may consider that different questions are being askedat different times. If q t is the current question being asked in the learning world, then there willlikely be earlier entries in the observational record where the people got an the answer to a differentquestion, an answer which did not reveal any information about the answer to q t . These entries couldbe used to predict what will happen when the planning world people see a blank , which is equallyuninformative about answering q t .
14 Conclusions
We have presented counterfactual planning as a general design approach for creating a range of AGIsafety mechanisms.Among the range of AGI safety mechanisms developed in this paper, we included an interlock thatexplicitly aims to limit the power of the agent. We believe that the design goal of robustly limitingAGI agent power is currently somewhat under-explored in the AGI safety community.
It is somewhat surprising how the problem of designing an AGI emergency stop button, and identify-ing its failure modes, becomes much more tractable when using the vantage point of counterfactualplanning. To explain this surprising tractability, we perhaps need to examine how other modelingsystems make stop buttons look intractable instead.The standard approach for measuring the intelligence of an agent, and the quality of its machinelearning system, is to consider how close the agent will get to achieving the maximum utility possiblefor a reward function. The implied vantage point hides the possibilities we exploited in the designof the SI agent.In counterfactual planning, we have defined the reasonableness of a machine learning system by L ≈ S , a metric which does not reference any reward function. By doing this, we decoupled theconcepts of ‘optimal learning’ and ‘optimal economic behavior’ to a greater degree than is usuallydone, and this is exactly what makes certain solutions visible. The annotations of our two-diagramagent models also clarify that we should not generally interpret the machine learning system insidean AGI agent as one which is constructed to ‘learn everything’. The purpose of a reasonable machinelearning system is to approximate S only, to project only the learning world agent environment intothe planning world. 34 There is a tendency, both in technology and in policy making, to search for perfect solutions thatconsist of no more than three easy steps. In the still-young field of AGI safety engineering, thedream that new technical of philosophical breakthroughs might produce such perfect solutions isnot entirely dead.Counterfactual planning provides a vantage point which makes several safety problems moretractable. However, in our experience, very soon after using counterfactual planning to cleanlyremove a specific failure mode or unwanted agent incentive, the wandering eye is drawn to theexistence of further less likely failure modes, and residual incentives produced via indirect means.We interpret this as a feature, not a bug. Counterfactual planning does not offer a three-step solutionto AGI safety, but it adds further illumination to the route of taking many steps which all driveresidual risk downwards, where each step is explicitly concerned with identifying and managing aspecific sub-problem only.In the sections of this paper, we have identified and discussed many such sub-problems, specificallythose which are made more tractable by counterfactual planning. We hope that the graphical notationand terminology developed here will make it easier to write single-topic AGI safety papers whichisolate and further explore single sub-problems.
The 2019 paper [EKKL19] introduced the research agenda or modeling and comparing the mostpromising AGI safety frameworks using causal influence diagrams. We count indifference methodsas used in [SFAY15, Arm15, Hol19, Hol20b] as being among these most promising frameworks.In the second half of 2019, we therefore started considering how causal influence diagrams might beused to graphically model these indifference methods. Solving this modeling problem turned out tobe much more difficult than initially expected. For example, though the causal influence diagrams insection 7 of [Hol20b] show indifference methods in action, they do not show the use of indifferencemethods in the underlying graph structure.Our search for a clear notation did not proceed in a straight line: we developed and abandonedseveral candidate graphical notations along the way. The two key steps in creating the winningcandidate were to abandon the use of balancing terms to construct indifference, and to model theagent using two diagrams, not one. The choice of the winner was mostly driven by the observationthat we could further generalize its two diagram notation, to model and reason about a much broaderrange of safety mechanisms. This observation motivated us to develop and name counterfactualplanning as a full design methodology.For the agenda of modeling AGI safety frameworks with causal influence diagrams, an obviousnext step would be to model additional proposals in the literature as one-diagram or two-diagramplanners, where we expect that any two-diagram model will more explicitly show the detailed roleof the agent’s machine learning system. The hope is that these graphical models will make it easierto understand, combine, and generalize the different moving parts of a broad range of AGI safetyproposals. In this context, it is promising that the diagrams of the STH, SI, and ITC agents abovemake it trivially obvious to see how these three different safety mechanisms could all be combinedin a single agent. 35 cknowledgments
Thanks to Stuart Armstrong, Ryan Carey, and Jonathan Uesato for useful comments on drafts ofthis paper. Special thanks to Tom Everitt for many discussions about the mathematics of incentives,indifference, and causal influence diagram notation.
References [AO17] Stuart Armstrong and Xavier O’Rorke,
Good and safe uses of AI oracles ,arXiv:1711.05541 (2017).[AOS +
16] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, andDan Mané,
Concrete problems in AI safety , arXiv:1606.06565 (2016).[Arm15] Stuart Armstrong,
Motivated value selection for artificial agents , Workshops at theTwenty-Ninth AAAI Conference on Artificial Intelligence, 2015.[Bos14] Nick Bostrom,
Superintelligence: paths, dangers, strategies , Oxford University Press,2014.[BPQC +
13] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. MaxChickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson,
Counter-factual reasoning and learning systems: The example of computational advertising ,Journal of Machine Learning Research (2013), no. 65, 3207–3260.[ECL +
21] Tom Everitt, Ryan Carey, Eric Langlois, Pedro A Ortega, and Shane Legg,
Agentincentives: A causal perspective , Proceedings of the 35th International Joint Confer-ence on Artificial Intelligence, AAAI Press, 2021.[EH19] Tom Everitt and Marcus Hutter,
Reward tampering problems and solutions in re-inforcement learning: A causal influence diagram perspective , arXiv:1908.04734(2019).[EKKL19] Tom Everitt, Ramana Kumar, Victoria Krakovna, and Shane Legg,
Modeling AGIsafety frameworks with causal influence diagrams , arXiv:1906.08663 (2019).[ELH18] Tom Everitt, Gary Lea, and Marcus Hutter,
AGI safety literature review , Proceedingsof the 27th International Joint Conference on Artificial Intelligence, AAAI Press,2018, pp. 5441–5449.[HM05] Ronald A Howard and James E Matheson,
Influence diagrams , Decision Analysis (2005), no. 3, 127–143.[HMDAR17] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell, The off-switch game , Workshops at the Thirty-First AAAI Conference on Artificial Intelli-gence, 2017.[HMH19] Dylan Hadfield-Menell and Gillian K Hadfield,
Incomplete contracting and AI align-ment , Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society,2019, pp. 417–422. 36HMRAD16] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan,
Coop-erative inverse reinforcement learning , Advances in Neural Information ProcessingSystems, vol. 29, 2016, pp. 3909–3917.[Hol19] Koen Holtman,
Corrigibility with utility preservation , arXiv:1908.01695 (2019).[Hol20a] Koen Holtman,
AGI agent safety by iteratively improving the utility function ,arXiv:2007.05411 (2020).[Hol20b] Koen Holtman,
Towards AGI agent safety by iteratively improving the utility function ,Proceedings of the 13th International Conference on Artificial General Intelligence(AGI-20). Lecture Notes in Computer Science, vol 12177, Springer, 2020.[Hut07] Marcus Hutter,
Universal algorithmic intelligence: A mathematical top down ap-proach , Artificial general intelligence, Springer, 2007, pp. 227–290.[KLRS17] Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva,
Counterfactual fair-ness , Advances in neural information processing systems, 2017, pp. 4066–4076.[KOK +
18] Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, and Shane Legg,
Penalizing side effects using stepwise relative reachability , arXiv:1806.01186 (2018).[KUN +
20] Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, andShane Legg,
REALab: An embedded perspective on tampering , arXiv:2011.08820(2020).[MSZ19] Arushi Majha, Sayan Sarkar, and Davide Zagami,
Categorizing wireheading inpartially embedded agents , Workshop on Artificial Intelligence Safety at IJCAI-19(2019).[Pea09] Judea Pearl,
Causality , Cambridge university press, 2009.[PS17] Luís Moniz Pereira and Ari Saptawijaya,
Agent morality via counterfactuals in logicprogramming , Bridging@CogSci2017, 2017, pp. 39–53.[Rus38] Bertrand Russell,
Power: A new social analysis , Allen & Unwin, London, 1938.[Rus19] Stuart Russell,
Human compatible: Artificial intelligence and the problem of control ,Penguin Random House, 2019.[SB18] Richard S Sutton and Andrew G Barto,
Reinforcement learning: An introduction ,MIT press, 2018.[SFAY15] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky,
Corrigi-bility , Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence,2015.[THMT20] Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli,
Conservativeagency via attainable utility preservation , Proceedings of the AAAI/ACM Conferenceon AI, Ethics, and Society, 2020, pp. 385–391.[ZJBP08] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione,
Re-gret minimization in games with incomplete information , Advances in neural infor-mation processing systems, 2008, pp. 1729–1736.37
Random Variables and the P Notation
In this appendix we define a version of probability theory, where probability theory is a mechanismwhich assigns truth values to certain mathematical sentences that contain random variables andthe P ( · · · ) notation. We define this mechanism by using concepts and notation from two othermathematical fields: set theory and the algebra of deterministic typed functions.Our definitions are based on the version of probability theory developed by Kolmogorov in the 1930s,but we omit any use of measure theory , by using a finite sample space Ω only. Measure theory isa somewhat inaccessible branch of mathematics, which can be used to construct random variablesthat model infinite-precision observations. However, we do not need such random variables here, asinfinite-precision sensors that might used by machine learning agents do not exist in the real world. Definition 12 (Sample space) . We posit the existence of a very large but finite set Ω called a samplespace . Each element of this set is called an event or sample point . We further posit that there is afunction P of type Ω → [0 , with [0 , the interval of rational numbers from 0 to 1 inclusive, andthat (cid:80) ω ∈ Ω P ( ω ) = 1 . Definition 13 (Random variable) . A random variable named X is a function X of type Ω → Typeof X , where Typeof X is a data type.We use a random variable X to represent a single observation of a phenomenon which has beenposited to exist in some world. The observation is represented as a value of the data type Typeof X .Many statistics texts use the terminology the domain of X where we write Typeof X , but other textsuse the range of X . Definition 14 ( P notation) . For any mathematical expression E that contains some random variables,we define that E ( ω ) is the mathematical expression that we get by replacing each random variable X with the function invocation X ( ω ) . We define { ω ∈ Ω | E ( ω ) } as the set of all values ω ∈ Ω for which E ( ω ) is true. We define that P ( E ) is a shorthand notation for the expression (cid:80) ω ∈{ ω ∈ Ω | E ( ω ) } P ( ω ) .We now define two more specialized shorthand notations. Definition 15 (Conditional probability) . For mathematical expressions E and C , we define that P ( E | C ) is a shorthand for P ( E , C ) /P ( C ) where the comma is read as the boolean and operator. Definition 16 (Expected value) . For any expression X where X ( ω ) has the numeric type T , theexpected value E ( X ) is a shorthand for (cid:80) x ∈ T xP ( X = x ) , and E ( X | C ) is a shorthand for (cid:80) x ∈ T xP ( X = x | C ) . A.1 Probability theory as a system of learning from observations
In many discussions of machine learning and rational reasoning, probability theory is treated as anobviously correct epistemology, an obviously correct system of learning about the world. This isoften expressed by stating that learning from observations can or must use