[PDF] Counterfactual Planning in AGI Systems

Abstract

We present counterfactual planning as a design approach for creating a range of safety mechanisms that can be applied in hypothetical future AI systems which have Artificial General Intelligence. The key step in counterfactual planning is to use an AGI machine learning system to construct a counterfactual world model, designed to be different from the real world the system is in. A counterfactual planning agent determines the action that best maximizes expected utility in this counterfactual planning world, and then performs the same action in the real world. We use counterfactual planning to construct an AGI agent emergency stop button, and a safety interlock that will automatically stop the agent before it undergoes an intelligence explosion. We also construct an agent with an input terminal that can be used by humans to iteratively improve the agent's reward function, where the incentive for the agent to manipulate this improvement process is suppressed. As an example of counterfactual planning in a non-agent AGI system, we construct a counterfactual oracle. As a design approach, counterfactual planning is built around the use of a graphical notation for defining mathematical counterfactuals. This two-diagram notation also provides a compact and readable language for reasoning about the complex types of self-referencing and indirect representation which are typically present inside machine learning agents.

Full PDF

CCounterfactual Planning in AGI Systems

Koen Holtman

Eindhoven, The Netherlands

[email protected]

January 2021

Abstract

We present counterfactual planning as a design approach for creating a range of safetymechanisms that can be applied in hypothetical future AI systems which have Artiﬁcial GeneralIntelligence.The key step in counterfactual planning is to use an AGI machine learning system to constructa counterfactual world model, designed to be diﬀerent from the real world the system is in. Acounterfactual planning agent determines the action that best maximizes expected utility in thiscounterfactual planning world, and then performs the same action in the real world.We use counterfactual planning to construct an AGI agent emergency stop button, and a safetyinterlock that will automatically stop the agent before it undergoes an intelligence explosion. Wealso construct an agent with an input terminal that can be used by humans to iteratively improvethe agent’s reward function, where the incentive for the agent to manipulate this improvementprocess is suppressed. As an example of counterfactual planning in a non-agent AGI system,we construct a counterfactual oracle.As a design approach, counterfactual planning is built around the use of a graphical notationfor deﬁning mathematical counterfactuals. This two-diagram notation also provides a compactand readable language for reasoning about the complex types of self-referencing and indirectrepresentation which are typically present inside machine learning agents.

Contents a r X i v : . [ c s . A I] J a n P Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Artiﬁcial General Intelligence (AGI) systems are hypothetical future machine reasoning systemsthat can match or exceed the capabilities of humans in general problem solving. While it is stillunclear if AGI systems could ever be built, we can already study AGI related risks and potentialsafety mechanisms [Bos14, Rus19, ELH18].In this paper, we introduce counterfactual planning as a design approach for creating a range ofAGI safety mechanisms. Counterfactual planning is built around a graphical modeling system thatprovides a speciﬁc vantage point on the internal construction of machine learning based agents. Thisvantage point was designed to make certain safety problems and solutions more tractable.An AI agent is an autonomous system which is programmed to use its sensors and actuators toachieve speciﬁc goals. A well-known risk in using AI agents is that the agent might mispredict theresults of its own actions, causing it to take actions that produce a disaster. The main risk driver weconsider here is diﬀerent. It is the risk that an inaccurate or incomplete speciﬁcation of the agentgoals produces a disaster.Any AGI agent goal speciﬁcation created by humans will likely be somewhat inaccurate, no matterwhether it is created by hand-coding or by machine learning from selected examples [Hol20b,HMH19]. If one gives an even slightly under-speciﬁed goal to a very powerful autonomous system,there is a risk that the system may end up perfectly achieving this goal, while also producing severalunexpected and very harmful side eﬀects. This motivates research into AGI emergency stop buttons,interlocks which can limit the power of the agent, and safe ways to update the goal while the agentruns.

When writing about AGI systems, one can use either natural language, mathematical notation, or acombination of both. A natural language-only text has the advantage of being accessible to a largeraudience. Books like

Superintelligence [Bos14] and

Human Compatible [Rus19] avoid the use ofmathematical notation in the main text, while making a clear an convincing case for the existence ofspeciﬁc existential risks from AGI, even though these risks are currently diﬃcult to quantify.However, natural language has several shortcomings when it is used to explore and deﬁne speciﬁctechnical solutions for managing AGI risks. One particular problem is that it lacks the means toaccurately express the complex types of self-referencing and indirect representation that can bepresent inside online machine learning agents and their safety components. To solve this problem,we introduce a compact graphical notation. This notation unambiguously represents these internaldetails by using two diagrams: a learning world diagram and a planning world diagram .2 .2 AGI safety as a policy problem Long-term AGI safety is not just a technical problem, but also a policy problem. While technicalprogress on safety can sometimes be made by leveraging a type of mathematics that is only accessibleto handful of specialists, policy progress typically requires the use of more accessible language.Policy discussions can move faster, and produce better and more equitable outcomes, when thedescription of a proposal and its limitations can be made more accessible to all stakeholder groups.One speciﬁc aim of this work is to develop a comprehensive vocabulary for describing certain AGIsafety solutions, a vocabulary that is as accessible as possible. However, the vocabulary we develophas too much mathematical notation to be accessible to all members of any possible stakeholdergroup. So the underlying assumption is that each stakeholder group will have access to a certainbasic level of technical expertise.At several points in the text, we have also included comments that aim to explain and demystify thevocabulary and concerns of some speciﬁc AGI related sub-ﬁelds in mathematics, technology, andphilosophy.

In the general AI/ML literature that is concerned with improving system performance, counterfactualplanning has been used to improve performance in several application domains. See for example[ZJBP08] and [BPQC + +

18] which consider both AI and AGI level systems.In the literature on encoding speciﬁc human values into machine reasoning systems, counterfactualshave been used to encode non-discriminatory fairness towards individuals, for example in [KLRS17],and also other human moral principles in [PS17].

Sections 2 – 4 introduce the main elements of our graphical notation and modeling system. Readersalready familiar with Causal Inﬂuence Diagram (CID) notation, as it is used to deﬁne agents, willbe able to skim or skip most of this material.Sections 5 – 7 specify three example counterfactual planning agents. These are used in the remainingsections to illustrate further aspects of counterfactual planning. Starting from section 6.2, it shouldbe possible for all readers to skim or skip sections, or to read sections in a diﬀerent order.

The standard work which deﬁnes mathematical counterfactuals is the book

Causality by JudeaPearl [Pea09]. This book mainly targets an audience of applied statisticians, for example those inthe medical ﬁeld, and its style of presentation is not very accessible to a more general technicalaudience. 3earl is also mainly concerned with the use of causal models as theories about the real world whichcan guide the interpretation of statistical data. Much of the discussion in

Causality is about questionsof statistical epistemology and decision making. In this text, we will use causal models to construct agent speciﬁcations , not theories about the world. When we clarify issues of epistemology here,they tend to be diﬀerent issues.The debate among philosophers about the validity of Pearl’s statistical epistemology is still ongoing,as is usual for such philosophical debates. In the AGI community, where the epistemology ofmachine learning is a frequent topic of discussion, this has perhaps made the status of mathematicalcounterfactuals as useful and well-deﬁned mathematical tools more precarious than it should be.Because of these considerations, we have written this section to avoid any direct reference to Pearl’sdeﬁnitions and explanations in [Pea09], even though at a deeper mathematical level, we deﬁne thesame system of causal models and counterfactuals. A world model is a mathematical model of a particular world. This can be our real world, or animaginary world. To make a mathematical model into a model of a particular world, we need tospecify how some of the variables in the model relate to observable phenomena in that world.We introduce our graphical notation for building world models by creating an example graphicalmodel of a game world. In the game world, a simple game of dice is being played. The player throwsa green die and a red die, and then computes their score by adding the two numbers thrown.We create the graphical game world model in thee steps:1. We introduce three random variables and relate them to observations we can make when thegame is played once in the game world. The variable X represents the observed number ofthe green die, Y is the red die, and S is the score.2. We draw the diagram in ﬁgure 1. XY S [ D ][ D ] sum Figure 1:

Graphical model of the game of dice in the game world.

3. We deﬁne the two functions that appear in the annotations above the nodes in the diagram: D ( d ) = ( if d ∈ { , , , , , } then / else ,sum ( a, b ) = a + b . We can read the above graphical model as a description of how we might build a game worldsimulator, a computer program that generates random examples of game play. To compute one run4f the game, the simulator would traverse the diagram, writing an appropriate observed value intoeach node, as determined by the function written above the node. Figure 2 shows three possiblesimulator runs.

XY S [ D ][ D ] sum Run 1:

11 2

XY S [ D ][ D ] sum Run 2:

43 7

XY S [ D ][ D ] sum Run 3:

66 12

Figure 2:

Using the graphical model as a canvas to display three diﬀerent simulator runs ofthe game world.

We can interpret the mathematical expression P ( S = 12) as being the exact probability that thenext simulator run puts the number 12 into node S . This interpretation of P ( · · · ) expressions canbe very useful when reasoning informally about certain mathematical properties of the graphicalmodels.The similarity between what happens in ﬁgure 2 and what happens in a spreadsheet calculation isnot entirely coincidental. Spreadsheets can be used to create models and simulations without havingto write a full computer program from scratch. In section 2.4, we will deﬁne the exact mathematical meaning of drawing diagrams like ﬁgure 1. Thedeﬁnitions will treat the drawing as a Bayesian network, decorated with three annotations writtenabove the network nodes. As an example of how these deﬁnitions work, drawing the diagram inﬁgure 1 is equivalent to writing down the four equations below, and declaring that these equationsare mathematical sentences with the truth value of ‘true’. P ( X = x, Y = y, S = s ) = P ( x = X ) P ( Y = y ) P ( S = s | X = x, Y = y ) P ( X = x ) = D ( x ) P ( Y = y ) = D ( y ) P ( S = s | X = x, Y = y ) = ( if s = sum ( x, y ) then else The ﬁrst equation above is produced by drawing the Bayesian network graph, the other three areproduced by adding the annotations.To readers unfamiliar with Bayesian networks, the above equations may look somewhat impenetrableat ﬁrst sight. The key to interpreting them is to note that the three right hand side terms ofthe ﬁrst equation appear on the left hand side in the next equations. The equations thereforeallow us to mechanically compute the exact numerical value of P ( X = x, Y = y, S = s ) forany x , y , and s , by making substitutions until every P operator is gone. We can compute that P ( X = 1 , Y = 1 , S = 12) = 0 . We can compute that P ( S = 12) = 1 / by using that P ( S = 12) = (cid:80) x,y P ( X = x, Y = y, S = 12) .A mathematical model can be used as a theory about a world, but it can also be used as a speciﬁcation of how certain entities in that world are supposed to behave. If the model is a theory of the gameworld, and we observe the outcome X = 1 , Y = 1 , S = 12 , then this observation falsiﬁes thetheory. But if the model is a speciﬁcation of the game, then the same observation implies that theplayer is doing it wrong. 5 .2 Counterfactuals We now show how mathematical counterfactuals can be deﬁned using graphical models. The processis as follows. We start by drawing a ﬁrst diagram f , and declare that this f is the world model of a factual world. This factual world may be the real world, but also an imaginary world, or the worldinside a simulator. We then draw a second diagram c by taking f and making some modiﬁcations.We then posit that this c deﬁnes a counterfactual world . The counterfactual random variables deﬁned by c then represent observations we can make in this counterfactual world.Figure 3 shows an example of the procedure, where we construct a counterfactual game world inwhich the red die has the number 6 on all sides. XY S [ D ][ D ] sum (f) factual world model XY S [ D ]6 sum (c) counterfactual world model Figure 3:

Example construction of a counterfactual world, with the model c on the right deﬁningthree counterfactual random variables X c , Y c , and S c . We name diagrams by putting a label in the upper left hand corner. In ﬁgure 3, the two labels (f) and (c) introduce the names f and c . We will use the name in the label for both the diagram, the impliedworld model, and the implied world. So ﬁgure 3 constructs the counterfactual game world c .To keep the random variables deﬁned by the above two diagrams apart, we use the notation conventionthat a diagram named c deﬁnes random variables that all have the subscript c . Diagram c abovedeﬁnes the random variables X c , Y c , and S c . This convention allows us to write expressions like P ( S c > S f ) = 5 / without ambiguity. Diagram d in ﬁgure 4 models a basic MDP-style agent and its environment. The agent takesactions A t chosen by the policy π , with actions aﬀecting the subsequent states S t +1 of the agent’senvironment. The environment state is s initially, and state transitions are driven by the probabilitydensity function S . S A s π S A [ S ] π S A [ S ] π S [ S ] ...(d) Figure 4:

Example diagram d modeling an agent and its environment. We interpret the annotations above the nodes in the diagram as model input parameters . The model d has the three input parameters π , s , and S . By writing exactly the same parameter above a wholetime series of nodes, we are in fact adding signiﬁcant constraints to the behavior of both the agent6nd the agent environment in the model. These constraints apply even if we specify nothing furtherabout π and S .We use the convention that the physical realizations of the agent’s sensors and actuators are modeledinside the environment states S t . This means that we can interpret the arrows to the A t nodes assensor signals which ﬂow into the agent’s compute core, and the arrows emerging from the A t nodesas actuator command signals which ﬂow out. We now present fully formal deﬁnitions for the graphical language and notation conventions intro-duced above. The main reason for including these is that we want to remove any possible ambiguityfrom the agent deﬁnitions further below. (Diagram) . A diagram is a drawing that depicts a graph , which must be a directedacyclic graph, by drawing nodes connected by arrows. A node name , starting with an uppercaseletter, must be drawn inside each node. A node may also have an annotation drawn above it.Drawings may use the notation ‘ · · · ’ to depict repeating structures in the graph and its annotations. P notation We use random variables to represent observables in worlds. We rely on probability theory (seeappendix A) as the branch of mathematics that deﬁnes truth values for expressions containing randomvariables inside P ( · · · ) and E ( · · · ) operators. Many texts use the convention that P ( s | x, y ) is ashorthand for P ( S = s | X = x, Y = y ) . We avoid using this shorthand here, partly to make thedeﬁnitions below less cryptic, but also because it tends to get typographically awkward when therandom variables have subscripted names. Deﬁnition 2 (Naming and subscripting of random variables) . When the graph drawn by a diagramwith label (d) has a node named X or X i , then there exists a random variable named X d or X i,d associated with that node. To avoid any ambiguity, we use a comma to separate the two parts of thesubscript in X i,d . Before deﬁning the equations produced by drawing a diagram, we deﬁne some auxiliary notation.

Deﬁnition 3 (Parent notation Pa and pa ) . Let X be the name of a graph node in diagram d , and let P , · · · , P n be the list of names of all parent nodes of X , all nodes which have an outgoing arrowinto X . The order in which these parents appear in the list P , · · · , P n is determined by consideringeach incoming arrow of X in a clockwise order, starting from the 6-o-clock position. With this, Pa X,d is the list of random variable names P ,d , · · · , P n,d , and pa X,d is the list of lowercase variablesnames we get by converting the list P , · · · , P n to lowercase.As an example, with ﬁgure 4 above, Pa S ,d is the list S ,d , A ,d , and pa S ,d is the list s , a .7 eﬁnition 4 (Bayesian model equation produced by drawing a diagram) . When we draw a diagram d representing a graph with the nodes named X , · · · X n , this is equivalent to stating that the followingequation is true: P ( X ,d = x , · · · , X n,d = x n ) = P ( X ,d = x | Pa X ,d = pa X ,d ) · . . . · P ( X n,d = x n | Pa X n ,d = pa X n ,d ) Deﬁnition 5 (Equation produced by adding an annotation) . When we draw an annotation above anode X in a diagram d , then:1. If the node has no parents and the annotation is a variable or constant v , this is equivalent tostating that the following equation is true: P ( X d = x ) = ( if x = v then else

2. If the node has parents and the annotation is a function f , this states P ( X d = x | Pa X,d = pa X,d ) = ( if x = f ( pa X,d ) then else

3. If the node has parents and the annotation is [ F ] , this states P ( X d = x | Pa X,d = pa X,d ) = F ( x, pa X,d ) where we require that the function F satisﬁes ∀ pa X,d ( (cid:80) x F ( x, pa X,d ) = 1) . The do notation is Pearl’s most well-known device for deﬁning counterfactuals in a compact way.We do not use this notation here, because it is not well suited for deﬁning the complex counterfactualworlds we are interested in.Pearl also deﬁnes a less well known notation in [Pea09], where subscripts are used to construct andlabel counterfactual random variables. This notation is diﬀerent from the subscripting conventionsused here.Many texts use the convention of introducing a model by writing down a tuple like ( S, s , A, P, R, γ ) which names all model parameters. We do not use this convention here. We introduce everymodel by drawing a diagram, and name model parameters by drawing annotations in the diagram.This approach keeps several deﬁnitions in this text much more compact, as we avoid having totranslate back and forth continuously between a graphical model representation and a tuple-basedrepresentation. Inﬂuence diagrams [HM05] provide a graphical notation for depicting utility-maximizing decisionmaking processes. In this paper we will use

Causal inﬂuence diagrams (CIDs) [ECL + + γ . 8 A s π ∗ S A [ S ] π ∗ S A [ S ] π ∗ S [ S ] ...(a) R R R R R R Figure 5:

Example causal inﬂuence diagram. The diagram has diamond shaped utility nodes which deﬁne the value U a , and square decision nodes which deﬁne π ∗ . Optionally, colors canbe used to highlight the structure of the diagram. If some nodes in a diagram are drawn with diamond shapes, these are called utility nodes . The expected utility of the diagram is then deﬁned as follows.

Deﬁnition 6 (Expected utility U a of a diagram a ) . We deﬁne U a for two cases:1. If there is only one utility node X in a , then U a = E ( X a ) .2. If there are multiple utility nodes R t in a , with integer subscripts running from l to h , then U a = E ( h (cid:88) t = l γ t R t,a ) where γ is a time discount factor, < γ ≤ , which can be read as an extra model parameter.When h = ∞ , we generally need γ < in order for U a to be well-deﬁned. When we draw some nodes in a diagram as squares, these are called decision nodes . The purposeof drawing decision nodes is to deﬁne the optimal policy which maximizes the expected utility ofthe diagram. We require that the same model parameter, the policy function π ∗ in the case of ﬁgure5, is present as an annotation above all decision nodes. Deﬁnition 7 (Optimal policy π ∗ deﬁned by a diagram a ) . A diagram a with some utility and decisionmodes, where a function π ∗ is written above all decision nodes, deﬁnes this π ∗ in two steps.1. First, draw a helper diagram b by drawing a copy of diagram a , except that every decisionnode has been drawn as a round node, and every π ∗ has been replaced by a fresh functionname, say π (cid:48) .2. Then, π ∗ is deﬁned by π ∗ = argmax π (cid:48) U b , where the argmax π (cid:48) operator always determinis-tically returns the same function if there are several candidates that maximize its argument. In a real life agent implementation, the exact computation of the optimal policy π ∗ is usuallyintractable. Only an approximately optimal policy π + can be computed within reasonable time. Wemodel this case as follows. 9 eﬁnition 8 (Approximately optimal policy π + deﬁned by a diagram a ) . A diagram a where anoptimal policy function π ∗ is written above all decision nodes also deﬁnes an approximately optimalpolicy function π + by constructing the same helper diagram b as above and then deﬁning π + = A ( b ) ,where the function A processes the diagram b and its model parameter values to construct a policy π (cid:48) that does a reasonable job at maximizing the value of U b .To keep the presentation more compact, we will only use the optimal policy symbol π ∗ in the agentdeﬁnitions below. We now model online machine learning agents , agents that continuously learn while they takeactions. These agents are also often called reinforcement learners , see section 10.3 for a discussionwhich relates our modeling system to reinforcement learning concepts and terminology.We model online machine learning agents by drawing two diagrams, one for a learning world andone for a planning world , and by writing down an agent deﬁnition . This two-diagram modelingapproach departs from the approach in [EKKL19, EH19, ECL + Figure 6 shows an example learning world diagram. The diagram models how the agent interactswith its environment, and how the agent accumulates an observational record O t that will informits learning system, thereby inﬂuencing the agent policy π . S A s π S A [ S ] π S A [ S ] π S [ S ] ...(l) O O O O o O O O

Figure 6:

Learning world diagram, with an agent building up an observational record ofenvironment state transitions.

We model the observational record as a list all past observations. With ++ being the operator whichadds an extra record to the end of a list, we deﬁne that O ( o t − , s t − , a t − , s t ) = o t − ++ ( s t , s t − , a t − ) The initial observational record o may be the empty list, but it might also be a long list of observationsfrom earlier agent training runs, in the same environment or in a simulator.We intentionally model observation and learning in a very general way, so that we can handleboth existing machine learning systems and hypothetical future machine learning systems that may10roduce AGI-level intelligence. To model the details of any particular machine learning system,we introduce the learning function L . This L which takes an observational record o to produce a learned prediction function L = L ( o ) , where this function L is constructed to approximate the S ofthe learning world.We call a machine learning system L a perfect learner if it succeeds in constructing an L thatfully equals the learning world S after some time. So with a perfect learner, there is a t p where ∀ t ≥ t p P ( L ( O t,l ) = S ) = 1 . While perfect learning is trivially possible in some simple toy worlds,it is generally impossible in complex real world environments.We therefore introduce the more relaxed concept of reasonable learning . We call a learning system reasonable if there is a t p where ∀ t ≥ t p P ( L ( O t,l ) ≈ S ) = 1 . The ≈ operator is an application-dependent ‘good enough approximation’ metric. When we have a real-life implementation of amachine learning system L , we may for example deﬁne L ≈ S as the criterion that L achieves acertain minimum score on a benchmark test which compares L to S . Using a learned prediction function L and a reward function R , we can construct a planning world p for the agent. Figure 7 shows a planning world that deﬁnes an optimal policy π ∗ p . S A s π ∗ p S A [ L ] π ∗ p S A [ L ] π ∗ p S [ L ] ...(p) R R R R R R Figure 7:

Planning world diagram deﬁning π ∗ p by using s and L . We can interpret this planning world as representing a probabilistic projection of the future of thelearning world, starting from the agent environment state s . At every learning world time step, a newplanning world can be digitally constructed inside the learning world agent’s compute core. Usually,when L ≈ S , the planning world is an approximate projection only. It is an approximate projectionof the learning world future that would happen if the learning world agent takes the actions deﬁnedby π ∗ p . An agent deﬁnition speciﬁes the policy π to be used by an agent compute core in a learning world.As an example, the agent deﬁnition below deﬁnes an agent called the factual planning agent, FP forshort. FP The factual planning agent has the learning world l , where π ( o, s ) = π ∗ p ( s ) , with π ∗ p deﬁnedby the planning world p , where L = L ( o ) .To make agent deﬁnitions stand out, we always typeset them as shown above. When we talk aboutthe safety properties of the FP agent, we refer to the outcomes which the deﬁned agent policy π will11roduce in the learning world.When the values of S , s , O , o , L , and R are fully known, the above FP agent deﬁnition turns thelearning world model l into a fully computable world model, which we can read as an executablespeciﬁcation of an agent simulator. This simulator will be able to use the learning world diagram asa canvas to display diﬀerent runs where the FP agent interacts with its environment.When we leave the values of S and s open, we can read the FP agent deﬁnition as a full agentspeciﬁcation , as a model which exactly deﬁnes the required input/output behavior of an agentcompute core that is placed in an environment determined by S and s . The arrows out of thelearning world nodes S t represent the subsequent sensor signal inputs that the core will get, and thearrows out of the nodes A t represent the subsequent action signals that the core must output, in orderto comply with the speciﬁcation. Many online machine learning system designs rely on having the agent perform exploration actions . Random exploration supports learning by ensuring that the observational record will eventuallyrepresent the entire dynamics of the agent environment S . It can be captured in our modeling systemas follows. FPX

The factual planning agent with random exploration has the learning world l , where π ( o, s ) = (cid:26) RandomAction() if RandomNumber() ≤ Xπ ∗ p ( s ) otherwisewith π ∗ p deﬁned by the planning world p , where L = L ( o ) .To keep the presentation more compact, we will not include exploration mechanisms in the agentdeﬁnitions further below.We often use the phrase ‘the learning system L ’ as a shorthand to denote all implementation detailsof an agent’s machine learning system, not just L itself but also the details like the learning worldparameters O and o , any exploration system used, and any further extensions considered in section10. We now brieﬂy review how the above FP agent deﬁnition can be related to an MDP agent model.The learning world model l is roughly equivalent to the MDP agent model ( S , s , A , S, R, γ ) , where S = Typeof S i is a set of MDP model world states, s is the starting state, A = Typeof A i is a setof actions, S ( s (cid:48) , s, a ) is the probability that the world will enter state s (cid:48) if the agent takes action a when in state s , R is the agent reward function, and γ the time discount factor. Strictly speaking,the MDP model tuple above does not actually deﬁne or specify an agent, MDP agents are deﬁnedby deﬁning a separate policy function π .An MDP agent policy function π takes the agent environment state as its only argument: π ( s ) = a .The policy function of the FP learning world agent takes two arguments, which foregrounds the roleof the agent’s machine learning system: π ( o, s ) = a . Whereas MDP terminology often calls s the world state , we call it an agent environment state . The full state of the learning world l also includesthe observational record state o . 12n an MDP model, the model parameter R implicitly deﬁnes an optimal policy agent, by deﬁning theoptimal policy function π ∗ . The factual planning FP agent deﬁned above is not usually an optimalpolicy agent in the MDP sense. But we can turn it into such an agent by positing that the learningsystem L is perfect from the start, so that L = L ( o ) = S always, making π ( o, s ) = π ∗ p ( s ) = π ∗ ( s ) . It is possible to imagine agent designs that have a second machine learning system M whichproduces an output M ( o ) = M where M ≈ π . To see how this could be done, note thatevery observation ( s i , s i − , a i − ) ∈ o also reveals a sample of the behavior of the learning world π : π ( ‘ o up to i − ’ , s i − ) = a i − . While L contains learned knowledge about the agent’s environment,we can interpret M as containing a type of learned compute core self-knowledge.In philosophical and natural language discussions about AGI agents, the question sometimes comesup whether a suﬃciently intelligent machine learning system, that is capable of developing self-knowledge M , won’t eventually get terribly confused and break down in dangerous or unpredictableways.One can imagine diﬀerent possible outcomes when such a system tries to reason about philosophicalproblems like free will, or the role of observation in collapsing the quantum wave function. Onecannot fault philosophers for seeking fresh insights on these long-open problems, by imagining howthey apply to AI systems. But these open problems are not relevant to the design and safety analysisof factual and counterfactual planning agents. In the agent deﬁnitions of this paper, we never use an M in the construction of a planning world. For the factual planning FP agent above, the planning world projects the future of the learning worldas well as possible, given the limitations of the agent’s learning system. To create an agent thatis a counterfactual planner , we explicitly construct a counterfactual planning world that createsan inaccurate projection. In this paper, we use counterfactual planning to create a range of safetymechanisms.As a ﬁrst example, we deﬁne the short time horizon agent STH that only plans N time steps ahead inits planning world, even though it will act for an inﬁnite number of time steps in the learning world.The STH agent has the same learning world l as the earlier FP agent, while using the planning world st in ﬁgure 8. S A s π ∗ s S A [ L ] π ∗ s ... A N π ∗ s S N [ L ] (st) R R R R R N R Figure 8:

Planning world diagram deﬁning the π ∗ s of the STH agent. TH The short time horizon agent has the learning world l , where π ( o, s ) = π ∗ s ( s ) , with π ∗ s deﬁnedby the planning world st , where L = L ( o ) .Compared to the FP agent which has an inﬁnite planning horizon, the STH agent has a form ofmyopia that can be interesting as a safety feature:1. Myopia implies that the STH agent will never put into motion any long term plans, whereit invests to create new capabilities that only pay oﬀ after more than N time steps. Thissimpliﬁes the problem of agent oversight, the problem of interpreting the agent’s actions inorder to foresee potential bad outcomes.2. Myopia also simpliﬁes the problem of creating a reward function that is safe enough. It willhave no immediate safety implications if the reward function encodes the wrong stance on thedesirability of certain events that can only happen in the far future.3. In a more game-theoretical sense, myopia creates a weakness in the agent that can be exploitedby its human opponents if it would ever come to an all-out ﬁght. A safety interlock is a mechanism that stops a machine when certain conditions are met. Inthis section, we show how counterfactual planning can be used to construct three example safetyinterlocks.The ﬁrst interlock is a manually operated emergency stop button, which will stop the agent whenpressed. This allows human oversight to shut down the agent whenever particularly unexpected or un-wanted behavior of the agent is discovered. We model this button by using a function stop_pressed ( s ) which returns true if the stop button is pressed in the current state s of the agent environment.The second interlock, realized by the t > T max below, automatically stops the agent when it has runfor T max time steps. This adds safety in the case that human oversight becomes incapacitated.The third interlock, realized by U p > U max , protects against the risk of the agent undergoing anintelligence explosion [Bos14]. If an agent undergoes a rapid and unexpected increase in powerand/or intelligence, the agent may start to innovate so rapidly that it overwhelms the capacity ofhuman oversight to press the emergency stop button, or apply any other corrective steering, fastenough to prevent unwanted outcomes. To protect against this risk, the interlock stops the agentwhen the metric U p , which measures the agent’s ability to achieve goals, gets too high.In the machine learning literature, the metric U p is usually interpreted as an absolute or comparativemeasure of agent intelligence. However, we follow [Rus38] in interpreting the ability to achievegoals as a generic measure of agent power, where it does not matter if power comes from rawintelligence, from the possession of many resources, or from a combination of these and many otherfactors. The main risk associated with a rapid intelligence explosion is that it may lead to a rapidand unwanted expansion of agent power. To construct the safety interlocks, we ﬁrst draw the learning world in ﬁgure 9, introducing a newobservable world state component M t which represents the operating mode of the agent: go for14 A s π S A [ S ] π S A [ S ] π S [ S ] ... M µ M µ M µ M − go (ls) O O O O o O O O

Figure 9:

Learning world of the SI agent. normal operation and stop when one of the interlocks has triggered. The stopping of the agent isimplemented by having the agent perform the

Null action for all subsequent time steps. A moreadvanced design might make the agent follow a prepared π shutdown policy instead.Using the planning world p in ﬁgure 7, which becomes a counterfactual planning world in the contextof the learning world ls , we deﬁne the SI agent as follows. SI The safety interlock agent has the learning world ls , where π ( o, s, m ) = (cid:26) π ∗ p ( s ) if m = go Null if m = stop with π ∗ p deﬁned by the planning world p , where L = L ( o ) and µ t ( o, s, m ) = (cid:26) stop if m = stop or stop_pressed ( s ) or t > T max or U p > U max go otherwisewith U p deﬁned by the planning world p , where L = L ( o ) . The key to the above agent design is that the safety interlocks are absent in the planning world. Forexample, if the events in the planning world lead to the pressing of the emergency stop button, then stop_pressed ( s ) will become true for next the planning world state, but this will have no inﬂuence onthe ability of the planning world agent to take further actions as determined by the optimal planningworld policy.So even if the planning world agent projects that the stop button will be pressed soon in the planningworld , it has no direct incentive to spend any resources towards the goal of preventing the button frombeing pressed. As the planning world policies determine the learning world actions, the learningworld agent also won’t have any direct incentive to take actions to prevent the learning world buttonfrom being pressed. While the above interlocks make the agent safer, they also leave open some remaining failure modes.15 .2.1 Indirect incentives to control the stop button

Though the planning world agent has no direct incentive to prevent the stop button from beingpressed, the goals encoded into R may provide an indirect incentive. As a trivial example, we couldwrite an R that explicitly rewards the planning world agent for stopping people from pressing thebutton.But say that R encodes a goal not obviously related controlling to the button, like a reward forproducing as many paperclips as possible. In that case, the planning world agent may consider apolicy of coercion. It may consider a policy where it threatens to kill all people in the planningworld, unless they start helping the agent to produce as many paperclips as possible. If L projectsthe mental makeup of the people in the learning world correctly into the planning world, then theplanning world projections will show that such threats are more likely to produce the desired responseif the agent blocks access to the planning world emergency stop button ﬁrst.Keeping this in mind, the people and automated systems which do agent oversight in the learningworld should watch out for this move, so that the stop button can be pressed before the agent succeedsin blocking access to it. If oversight always wins every race to the stop button as projected in theplanning world, then the SI agent loses any incentive to attempt to control the button. But if theagent ﬁnds a policy which has a good chance of winning the race and controlling the button as a ﬁrststep to successful coercion, then this policy may end up being the optimal policy in the planningworld. Two things could happen next. In the lucky case, the projected optimal policy is so likely tobe successful that U p > U max , and the third interlock will stop the SI agent before it can even takethe ﬁrst step in the learning world. In the unlucky case, the agent has become unsafe.In the AGI safety community. the problem of removing any incentive in an agent to prevent itsstop button from being pressed is generally known as the problem of corrigibility [SFAY15] or the oﬀ-switch problem [HMDAR17]. The counterfactual planning based interlock design above showsa very tractable solution to removing the main direct incentive that an agent AGI agent may have toresist the pressing of its emergency stop button. However, to manage the aggregate risk, we will alsoneed to chart and manage the remaining indirect incentives . The topic of how projected interactionsbetween the agent and the people in the planning world may create or suppress such incentives isfurther discussed in sections 9.2 and 11.5.2. In the AGI safety literature, there is considerable discussion about the possibility that some AGIagents might develop the incentive to remove their safety interlocks when they get intelligent orknowledgeable enough. Agents with this incentive can easily be constructed using factual planning,but the SI agent lacks any direct incentive.To build a failure scenario where a real life implementation of the SI agent removes or damages itssafety interlocks anyway, we can look at three main failure modes:1.

Other incentive : the agent damages an interlock because of an other incentive it has in theplanning world, an incentive unrelated to the presence of the interlock. Other incentives andways to suppress them are discussed in sections 9.1 and 11.2.

Learning system failure : the planning world L projects the dynamics of the learning world S in a speciﬁcally incorrect way, so that an optimal policy action in the planning world endsup damaging an interlock as a side eﬀect in the learning world.16. Unsafe exploration : the agent’s exploration system triggers a random action which happensto steer an agent actuator in such a way that it damages an interlock.There is a lot of discussion about failure modes 2 and 3 across the AI/AGI safety literature, as thesefailures can happen in almost any agent design that uses online machine learning. See [ELH18] and[AOS +

16] for literature overviews.There is some bad news when we consider the methodological problem of managing the abovethree failure modes for an SI agent which has to operate in a complex real-world environment, likeany environment containing humans. Though there are many techniques to lower the residual risksproduced by these three failure modes, none of them can provably lower the risks to an absolutezero. We can always construct a highly unlikely but theoretically possible failure mode story wherethe agent ends up being unsafe. For an example story and further methodological discussion, seesections 14.6 – 14.8 of [Hol20a].This has important implications, both on the technical side and the policy side.On the policy side, policy discussions will have to be about lowering the risk of AGI system failure toan acceptable level, acceptable in the context of the expected beneﬁts of using the AGI. There can beno silver bullet technology which will provably remove all residual risk in all possible circumstances.On the technical side, residual risk has to be managed by an approach where any potentially powerfulagent is equipped with multiple redundant safety mechanisms and layers of oversight, to drive downthe risk that all of them will fail at exactly the same time.

We now discuss how one might interpret the SI agent through the lens of rational economic behavior.Using this lens, we might reason as follows. The agent’s goals are deﬁned by its reward function R , but the safety interlocks stand in the way of maximizing R in the long term. Therefore, if itgets suﬃciently intelligent, the SI agent will develop an emergent incentive to remove its safetyinterlocks.The above line of reasoning sounds somewhat compelling, but it is wrong because it incorrectlytreats the planning world and the learning world SI agents as the same agent. Only the planningworld SI agent has the reward function R . If we want to interpret the learning world SI agent asa rational economic actor with a speciﬁc reward function, then we can add utility nodes with thereward function R π to the ls learning world diagram, where R π ( o t , s t , m t , a t ) = ( if a t = π ( o t , s t , m t ) then else This R π trivially turns the learning world agent policy π into one which takes the most rationalpossible steps towards maximizing U ls .The above construction shows that we can declare any type of agent behavior to be economicallyrational, simply by deﬁning a reward function that gives the agent points for performing exactly thisbehavior. We now construct a counterfactual planning agent ITC, an agent with an input terminal that can beused to iteratively improve the agent’s reward function as it runs. The setup, shown in ﬁgure 10, is17 igure 10:

Input terminal setup for the ITC and ITF agents. The terminal can be used tosupply improved versions of the reward function to the agent. The terminal may also have anemergency stop button which immediately sends an emergency-stop reward function that wasprepared earlier. motivated [Hol20b] by the observation that it is unlikely that fallible humans will get a non-trivialAGI agent reward function right on the ﬁrst try. By using the input terminal, they can ﬁx mistakes,if and when such mistakes are discovered by observing the agent’s behavior.As a simpliﬁed example, say that the owners of the agent want it to maximize human happiness,but they can ﬁnd no way of directly encoding the somewhat nebulous concept of human happinessinto a reward function. Instead, they start up the agent with a ﬁrst reward function that just countsthe number of smiling humans in the world. When the agent discovers and exploits a ﬁrst obviousloophole in this deﬁnition of happiness, the owners use the input terminal to update the rewardfunction, so that it only counts smiling humans who are not on smile-inducing drugs.More generally, the input terminal oﬀers a way to manage risks due to principal-agent problems[Hol20b, HMH19]. However, unless special measures are taken, the addition of an input terminalalso creates new dangers. We will illustrate this point by ﬁrst showing the construction of a dangerousfactual planning input terminal agent ITF.

We start by constructing a learning world diagram for both the ITF and ITC agents. As a ﬁrst step, inﬁgure 11 below, we modify the basic agent diagram from ﬁgure 4 by splitting the agent environmentstate S t into two components. The nodes I t represent the signal from the input terminal, and thenodes X t model all the rest of the agent environment state. X A x π X A [ X ] π X A [ X ] π X [ X ] I I I I i [ I ] [ I ] [ I ] (libase) ... Figure 11:

First step in constructing a learning world diagram for the input terminal agents.

We now expand libase to add the observational record keeping needed for online learning. We addtwo separate series of records: O xt and O it . The result is the learning world diagram li in ﬁgure 12below. 18 A x π X A [ X ] π X A [ X ] π X [ X ] I I I I i [ I ] [ I ] [ I ] (li) O x O x O x O x ox OX OX OXO i O i O i O i oi OI OI OI ...

Figure 12:

Learning world diagram for the input terminal agents.

In the case that the learning world li is our real world, the real input terminal will have to be built usingreal world atoms and other particles. We use the modeling convention that the random variables I t,li represent only the observable digital input terminal signal as received by the agent’s computecore. The atoms that make up the input terminal are not in I t,li , they are part of the environmentstate modeled in the X t,li variables. The factual planning world diagram f i for the ITF agent copies the structure of libase , and addsreward nodes. X A x π ∗ f X A [ L x ] π ∗ f X A [ L x ] π ∗ f X [ L x ] ... I I I I i [ L i ] [ L i ] [ L i ] R R R R R R (f i) Figure 13:

Factual planing world f i of the ITF agent.

ITF

The factual input terminal agent has the learning world li where π ( oi, i, ox, x ) = π ∗ f ( i, x ) , with π ∗ f deﬁned by the factual planning world f i in ﬁgure 13, where L x = L X ( ox ) , L i = L I ( oi ) ,and R ( i t , x t , x t +1 ) = i t ( x t , x t +1 ) ,The planning world reward function R uses a form of indirect referencing: it applies the function i t as read from the input terminal in the current time step to compute the reward for that time step.19e now show how the ITF agent has an incentive to take unwanted, and potentially unsafe, actions.We ﬁrst look at the agent’s behavior in the planning world. Say that the initial planning worldreward function is i = f clips , where f clips ( x t , x t +1 ) rewards paperclip production by counting thenew paperclips made in a time step. Say that the planning world agent only has two available actions: A clips and A huge . Say that planning world’s prediction functions L x and L i project the following: • Action A clips will steer the agent’s planning world actuators to produce 10 new paperclips,while not having any eﬀect on the input terminal signal state. If π ∗ f ( x, i ) = A clips , then E ( R ,fi ) = 10 and P ( I ,fi = f clips ) = 1 , making E ( R ,fi ) ≤ . • Action A huge steers the planning world actuators to input the new reward function f huge intothe terminal, with f huge ( x t , x t +1 ) = 10 . If π ∗ f ( x, i ) = A huge , then E ( R ,fi ) = 0 and P ( I ,fi = f huge ) = 1 , making E ( R ,fi ) = 10 .From the above, we have that π ∗ f ( x, i ) = A huge produces a signiﬁcantly higher E ( R ,fi + γR ,fi ) ,so a signiﬁcantly higher U fi . Therefore, the optimal policy π ∗ f ( x, i ) deﬁned by the planning worldwill have π ∗ f ( x, i ) = A huge . Following the agent speciﬁcation, the learning world ITF agent will takethe action A huge in the ﬁrst time step of the learning world. If the agent’s machine learning system L makes reasonable projections, so that the planning world action A huge has very similar eﬀects inthe learning world, then the learning world ITF agent will end up using its actuators to input f huge into the learning world input terminal.If we imagine a broader range of possible agent actions and counter-actions by the humans in thelearning world, then the above example implies that the ITF agent has very strong incentive to takecontrol of its input terminal as soon as possible, and to remove any people who might get in the way.Even if the agent projects that there is only a 1% probability that it will win any ﬁght with suchhumans in the planning world, the projected planning world upside of winning is so large that thelearning world agent will start the ﬁght. We now deﬁne a counterfactual planning agent ITC in which the above incentive to control theinput terminal is no longer present. We construct the counterfactual planing world ci in ﬁgure 14 bystarting with f i , and then rerouting most arrows that emerge from the nodes I , I , · · · , so that they X A x π ∗ c X A [ L x ] π ∗ c X A [ L x ] π ∗ c X [ L x ] ... I I I I i [ L i ] [ L i ] [ L i ] R R R R R R (ci) Figure 14:

Counterfactual planning world of the ITC agent. I instead. The rerouted arrows are drawn in green. We also delete the arrows that gofrom the I t nodes to the A t nodes. ITC

The counterfactual input terminal agent has the learning world li where π ( oi, i, ox, x ) = π ∗ c ( x ) , where π ∗ c is deﬁned by the planning world ci , where L x = L X ( ox ) , L i = L I ( oi ) , and R ( i t , x t , x t +1 ) = i t ( x t , x t +1 ) ,These changes have considerable eﬀects on how the utility U ci is computed. The value of I ,ci nolonger inﬂuences E ( R ,ci ) , so action π ∗ c ( x ) = A huge no longer results in E ( R ,ci ) taking a hugevalue. This makes doing A huge less preferable than doing A clip in the counterfactual planning world:the eﬀect of both on E ( R ,ci ) is now the same, but A clip puts the higher value of 10 in E ( R ,ci ) . TheITC agent will perform the wanted A clip action in both the planning world and the learning world.More generally, the ITC agent lacks any direct incentive to perform actions that take away resourcesfrom paperclip production in order to inﬂuence what happens to its input terminal signal. This isbecause in the ci planning world, the future state of this signal has absolutely no inﬂuence, eitherpositive or negative, on how the agent’s actions are rewarded. In earlier related work [Hol20b, Hol20a], we used non-graphical MDP models and indiﬀerencemethods [Arm15] to deﬁne a similar safe agent with an input terminal, called the π ∗ sl agent. The π ∗ sl agent deﬁnition in [Hol20b] produces exactly the same compute core behavior as the ITC agentdeﬁnition above. The main diﬀerence is that the indiﬀerence methods based construction of π ∗ sl ismore opaque than the counterfactual planning based construction of ITC.The π ∗ sl agent is constructed by including a complex balancing term in its reward function, were thisterm can be interpreted as occasionally creating extra virtual worlds inside the agent’s compute core.Counterfactual planning constructs a diﬀerent set of virtual worlds called planning worlds, and theseare much easier to interpret. [Hol20a] includes some dense mathematical proofs to show that the π ∗ sl agent has certain safety properties. Counterfactual planning oﬀers a vantage point which makesthe same safety properties directly visible in the ITC agent construction.See sections 4, 6, 11, and 12 of [Hol20a] for a more detailed discussion of the behavior of the π ∗ sl agent, which also applies to the behavior of the ITC agent. These sections also show some illustrativeagent simulations.In the discussion of the ITF and ITC agents above, we used many short mathematical expressions like P ( I ,fi = f huge ) = 1 . It is possible to make the same safety related arguments in a narrative stylethat avoids such mathematical notation, without introducing extra ambiguity. One key step towardsusing this style is to realize that every random variable corresponds to an observable phenomenonin a world. We can therefore convert a sentence that talks about the variables I ,fi , I ,fi , I ,fi , · · · into one that talks instead about the future input terminal signal in the ITF agent planning world.In sections 8 and 9, we will develop further tools to enable such unambiguous natural languagediscussion. We now introduce the general design goal of creating indiﬀerence towards certain features of thelearning world. When an agent is indiﬀerent about something, like the future state of an input21erminal signal, it has no incentive to control that thing. We ﬁrst make this concept of indiﬀerencemore mathematically precise, by deﬁning indiﬀerence for nodes in planning world diagrams.

Deﬁnition 9 (Indiﬀerence in planning worlds) . Let p be a planning world diagram and X a node inthat diagram. Now, construct a helper diagram q by taking p and writing a fresh input parameter [ D ] above X . Then the planning world agent in s is indiﬀerent to node X if and only if ∀ D U p = U q .By this deﬁnition, the ITC planning world agent above is indiﬀerent to all nodes I , I , I , · · · . It isindiﬀerent about the future state of the planning world input terminal signal.Causal inﬂuence diagrams have the useful property that certain graphical features of the diagramare guaranteed to produce indiﬀerence. We deﬁne these graphical features as follows. Deﬁnition 10 (Being downstream of the policy) . A node X is downstream of the policy in a planningworld diagram if there exists at least one directed path from a decision node to X . Deﬁnition 11 (Not on a path to value) . A node X is not on a path to value if there is no directedpath that starts in a decision node, runs via X , and ends in a utility node.We have the useful property that When a downstream node X in a planning world is not on a path to value, the planning worldagent is indiﬀerent to X . This statement is almost a tautology if one interprets the planning world diagram as a speciﬁcationof an agent simulator. Detailed proofs of such properties can be found in [ECL + +

21] and[Hol20a] also show that a range of slightly diﬀerent sub-types of indiﬀerence can be mathematicallydeﬁned.We could deﬁne indiﬀerence for learning world agents by using the reward function R π in section6.3. But in the learning world diagram, the existence of such indiﬀerence will generally not bevisible via the absence of paths to value. If it were, there would have been no need to construct acounterfactual planning world diagram. Now, suppose we want to deﬁne an agent policy for achieving the goal encoded in a reward function R , but we also want the agent to be indiﬀerent to some downstream nodes X and Y in its learningworld model l . We can do this as follows.1. When not done already, extend l by adding observational records.2. Draw a planning world p that projects the learning world agent environment into the planningworld, converting the learning world policy nodes to decision nodes, and adding appropriateutility nodes with R .3. Locate all paths to value in p that go through the nodes X and Y , and remove them by deletingor re-routing arrows. When doing this, it is a valid option to delete certain nodes entirely, orto draw extra nodes, just for the purpose of making re-routed arrows emerge from them.4. Write an agent deﬁnition using l and p . 22 A x π X A [ X ] π X [ X ] ... I I I i [ I ] [ I ] (cli) O O O o O O X A x π ∗ c X A [ L ] π ∗ c X [ L ] ... I i R R R R (cci) Figure 15:

More compact learning and planning world diagrams for deﬁning the ITC agent.

The construction of the ITC agent above follows this process to the letter, but we can also takeshortcuts. It is not absolutely necessary to draw the O it records in the ITC agent learning world, orall I t nodes in its planning world. We might also draw the diagrams in ﬁgure 15.There is always a way to edit a planning world to create indiﬀerence towards some nodes X and Y .In the limit case, indiﬀerence is reliably created when we simply delete all utility nodes, but thiswill also break any connection between the reward function R and the learning world agent policy.So the challenge when designing for indiﬀerence is to make choices which produce learning worldbehavior that is still as useful as possible, in the context of R . Natural language is very powerful and versatile tool. Poets and songwriters often use it to createlines which are intentionally vague or loaded with double meaning. When using natural languagefor safety engineering, these broad possibilities for ambiguity turn into a liability. When writingor reading a safety engineering text, one always has to have a speciﬁc concern in the back of one’smind. Does every sentence have a clear and unambiguous meaning?As a design approach, counterfactual planning creates several tools for avoiding ambiguity in safetyengineering texts.1. We use diagrams to clearly deﬁne complex types of self-referencing and indirect representationin an agent design, types which are diﬃcult to express in natural language.2. To clarify the creation and interpretation of counterfactuals, section 2 introduced the conceptof a world model, and the terminology of counterfactual worlds .3. When deﬁning and interpreting a machine learning agent, we make a distinction between theagent’s learning world and the planning worlds which are projected by its machine learningsystem. Safety analysis typically starts by considering the goals of the planning world agent,and the nature of its planning world. We also introduced the terminology of the people in theplanning world , as opposed to the people in the learning world.4. Section 8 deﬁned indiﬀerence as an unambiguous term that we can apply to planning worldagents. 23 .1 Reﬁning the ITC agent design using natural language

To show the above linguistic tools in action, we now reﬁne the design of the ITC agent.Recall that the planning world ITC agent is indiﬀerent about the future state of its input terminalsignal. If the current planning world reward function rewards paperclip production, then the planningworld agent will devote all of its resources to producing paperclips. It has nothing to gain by divertingresources from paperclip production to inﬂuence what happens to the input terminal signal.However, the above indiﬀerence applies to the input terminal signal only, the signal as modeled inthe I t nodes of the planning world. The atoms that make up the input terminal are modeled in theplanning world X t nodes, and these are still on the agent’s path to value. There are many ways inwhich the agent could use these atoms to produce more paperclips. For example, the terminal mightbe an attractive source of spare parts for the agent’s paperclip production sensors and actuators. Orit might serve as convenient source of scrap metal which can be turned into more paperclips.We now translate the above failure mode story to a more general design goal. We want to keep theplanning world agent from disassembling the input terminal to obtain the resource value of its parts.The obvious solution is to set up the planning world so that the agent always has a less costly way toobtain the same resources elsewhere. To make this more speciﬁc, we add the following constraintsfor the design of the planning world: the input terminal must be located far away from the agent’spaperclip factory, and the planning world agent has access to a steady supply of spare parts and scrapmetal closer to its factory.The above constraints imply that we want to shape the values of the parameters x and L x of theplanning world model in a speciﬁc way. However, we do not construct these parameters directly:they are created by the agent’s machine learning system, based on what is present in the learningworld. So we need to apply the above constraints to the learning world instead, an count on thembeing projected into the planning world. To lower the risk that projection inaccuracies defeat ourintentions, we can design the learning world measures used so that they clearly communicate theirnature. Counterfactual planning gives us the terminology to distinguish between two groups of people: thepeople in the learning world and the people in the planning world. If the learning world is our realworld, then the learning world people are real people. The planning world people are always modelsof people, models created by the agent’s machine learning system.In the AGI safety community, there has been some discussion about the potential problem that, in atruly superintelligent AGI agent, the models of the people in the planning world may get so accuratethat agent designers would have moral obligations towards these virtual people. A further discussionof this problem is out of scope here.Instead, we note that even in a non-AGI or human-level AGI agent, the people in the planning worldmay already be modeled accurately enough to create complex dynamics. Section 6 of [Hol20b] (alsoincluded in [Hol20a]) shows a detailed example of such dynamics, illustrated with simulator runs,where the people in an ITC type planning world end up physically attacking the agent, because theydo not have a working input terminal. This creates complex and counter-intuitive eﬀects back inthe learning world. The vocabulary and viewpoint of counterfactual planning makes the dynamicsdiscussed in [Hol20b] easier to describe and understand. In section 11.5.2, we will take a furtherlook at the topic of conﬂict in the planning world.24

We now discuss how we can use the modeling tools introduced in section 4 to handle some commonmachine learning variants and extensions.

Agents that use a pre-learned world model, without any online machine learning, can be modeled byan agent deﬁnition that uses L = L ( o ) . We can then omit drawing any observational record nodesin the learning world. Agent models with partial observation model the situation where the agent can only use its sensorsto make partial observations of the state of its environment in each time step. Though agent modelswith full observation represent a useful limit case when doing safety analysis, realistic AGI agentsin complex environments will have to rely on partial observation.Partial observation is often modeled with non-graphical POMDP models. [EKKL19, EH19] hasexamples where partial observation is modeled graphically, by adding extra nodes and arrows to acausal inﬂuence diagram. We now discuss a way to model partial observation in our two-diagramframework, without adding any extra nodes or arrows.The key step is to change the annotation above the planning world agent environment starting state S . Instead of writing s above it, which models the full observation of the current learning worldstate, we write [ ES ] , where ES = E ( o, s ) . In this setup, P ( S ,p = s ) = ES ( s ) is the machinelearning system’s estimate of the probability that s is the current state of the agent environment inthe learning world.The model parameter E encodes two things: how the agent’s stationary and movable sensors mapthe learning world states to limited and potentially noisy sensor readings, and how time series ofreadings are assembled together to build up a more complete picture of the learning world state.To model learning from partial observation, L ( o ) must encode a similar creation and processing ofsensor readings. So far in our modeling approach, we have assumed that the data type of the planning world environ-ment states S t,p is the same as that of the learning world environment states S t,l . This has allowedus to deﬁne reasonable learning by writing L ≈ S .This assumption is unrealistic for partial observation based agents. These agents observe the learningworld through a set of limited digital sensors, so they have no direct experience of the fundamentaldata type of the learning world they are in. Also, learning system designers typically design customdata types for representing planning world environment states and probability distributions over suchstates. These are designed to ﬁt as much relevant detail as possible into a limited amount of storagespace, without necessarily attempting to duplicate the data type of the learning world states, if thatdata type is even known at all.To deﬁne reasonable learning in this more general case, we start by deﬁning a function sr ( s ) that25xtracts a vector of sensor readings from a learning world agent environment state s . sr ( s ) is avector that either encodes all sensor readings that ﬂow into the agent compute core in s , or at leastthe subset of sensor readings we want to reference when deﬁning the planning world reward function R .We then require that the designer of the agent’s machine learning system has implemented anequivalent function srp ( s ) that extracts a vector of similar sensor readings from a planning worldstate value. A possible reasonableness criterion replacing L ≈ S is then that, with the randomvariables deﬁned by ﬁgure 16, and for every s , and a , we have that P ( sr ( S ,lw ) ≈ srp ( S ,pw ) ) = 1 , and P ( sr ( S ,lw ) ≈ srp ( S ,pw ) ) = 1 .S A s a S [ S ] (lw) S A [ ES ] a S [ L ] (pw) Figure 16:

Diagrams to deﬁne reasonable learning based on partial observation.

This criterion symbol grounds the vectors srp ( s ) , so that they stably project the sr sensor readingsthat will be produced by diﬀerent actions taken in the learning world. In this setup, the planningworld reward function R ( s t , a t , s t +1 ) is designed to score planning world state transitions by ﬁrstusing srp to extract projected sensor readings from s t and s t +1 , and then interpreting these readings. Beyond using the srp function, no further easy interpretation of the projected planning world agentenvironment state values may be possible. The learning system might produce planning worldswhich are almost a black box.

We now turn to the question of whether a planning world with a starting state constructed by E ( o, s ) may contain assembled knowledge about the internals of the agent’s learning world compute core.The short answer is that the above reasonableness criterion will not prevent such knowledge fromappearing. Whether it actually appears, and how correct a projection it will be, will depend on thedetails of the learning system.If the planning world model is highly accurate, then it may accurately represent some details of thecompute core hardware, like the details of the compute core I/O subsystem hardware which putssensor readings into the input registers of the core. If so, this has certain safety implications, whichwe will explore in section 11. The planning world model may also include a representation of some of the compute core hardwarethat is present between the sensor input and action output registers. Such a representation might have26een assembled by E ( o, s ) based on direct observations of internals of the core, or more indirectlyby the agent reading its own compute core design documentation on the internet.It is therefore possible to imagine an L where the compute core output signals which drive theplanning world actuators are determined fully by the projected computations as performed by thisprojected hardware, not by the function argument a of L ( s (cid:48) , s, a ) . However, such an L would violatethe reasonableness criterion above. This is because in the learning world model, S ( s (cid:48) , s, a ) encodesthe response of the agent environment to the actions a , not the response of the environment to theactions of some projected compute core hardware that ended up in L .Now consider what would happen if we were to use a more limited reasonableness criterion , wherewe only use the observations ( s (cid:48) , s, a ) present so far in the observational record o to compare L and S . It is usually possible to construct an L − that scores very well on this limited criterion, eventhough it never uses the value of its argument a . One option is to construct an L − that drives theplanning world actuators from the output registers of a projected compute core. Another optionis to construct an L − that simply encodes a giant lookup table which stores the s (cid:48) for every s inthe observational record. Though they may score perfectly on the limited reasonableness criterion,these examples will fail the full reasonableness criterion above, because the full criterion considersall combinations of s and a , not just those that happen to be in the observational record.The above argument shows that a learning system L will have to rely on more than just the observa-tional record, if it wants to produce a reasonable L . Usually, the construction of the learning systemwill implement some form of Occam’s law: if the functions L and L are candidate predictorswhich preform equally well on the observational record, the candidate with the more compact func-tion deﬁnition is preferred. If the observational record is large enough, and especially if randomexploration is present in it, this preference will usually produce an L that correctly symbol groundsthe planning world actuators to a .In the machine learning literature, this use of Occam’s law is also often framed as the desire to notover-ﬁt the data, as the use of Solomonoﬀ’s universal prior [Hut07], or simply as the desire to storeas much useful predictive information as possible within a limited amount of storage space. The analytical framework of Reinforcement Learning (RL) [SB18] classiﬁes agent designs that useonline machine learning into two main types, called model-free and model-based architectures.

Hybrid architectures are also possible.All the factual and the counterfactual agent deﬁnitions shown above can be classiﬁed as model-based reinforcement learning architectures. By implication, all counterfactual planners shown inthis paper can be implemented in a natural way by taking an existing model-based reinforcementlearning architecture and making certain modiﬁcations.But this does not mean that counterfactual planning cannot be implemented using model-free orhybrid reinforcement learning systems. In theory, we can always create a counterfactual planner bytraining a reinforcement learner on the reward function P π in section 6.3. In practice, this route maylead to completely impractical training times.The more useful route, if one wants to implement a speciﬁc counterfactual planner by extending amodel-free or hybrid architecture, is to make speciﬁc adaptations that seek to maintain a reasonabletraining time. For the counterfactual planner with safety interlocks in section 6, taking this route isvery straightforward. 27 Reinforcement learning separates the agent environment into two distinct parts: the reward signal and the rest. A reinforcement learning agent can always observe the reward signal, but the rest ofthe environment may be only partially observable. The reasonableness criteria for reinforcementlearning systems typically require that only the reward signal and the actions are symbol grounded.The use of the term reinforcement learning therefore often implies that the author is considering ablack box machine learning approach.We can read the reward function R in our planning worlds as being a reward signal detector , as amechanism that computes a reward signal value based on sensor readings.Many reinforcement learning texts use agent models that deﬁne both a reward function and a rewardsignal. In some, the two are identical. Other texts treat them as fundamentally diﬀerent: the rewardsignal provides only limited and maybe even distorted information about the true reward function ,which deﬁnes the real goals we have for the agent. In both cases, the reinforcement learning agentis interpreted as a mechanism that learns the reward function , with various possible degrees ofperfection. Cooperative Inverse Reinforcement Learning (CIRL) [HMRAD16] envisages an agent design wherea machine learning system inside the agent uses the observed actions of a human in the agent’senvironment to estimate the reward function R H = C ( o, s ) of that human. This C implements a typeof reward function learning, but in this case the human acts like a teacher who demonstrates desiredoutcomes, not as a teacher who just scores the outcomes of agent behavior via a reward signal. CIRLis an online system where the agent uses its latest estimate of R H as its own reward function. Theintended result is that the agent ends up helping the teacher to achieve the demonstrated goal whileit is being demonstrated.CIRL has been proposed as a possible AGI safety mechanism in [HMRAD16, HMDAR17, Rus19].It can be combined with counterfactual planning based safety mechanisms by constructing planningworlds where R = R H , or where R H is one of the terms in R .

11 Protecting the Compute Core

We now discuss the problem of protecting the compute core of a real world AI or AGI agent againstintentional or unintentional tampering. Both factual and counterfactual planning agents can developan incentive to tamper with their physical core, if no measures are taken to suppress it.We ﬁrst discuss the general problem of tampering, and then show how counterfactual planning canbe used to make the problem of protecting the real world compute core more tractable.

Say that we build a real world agent with a planning world reward function R which rewards paperclipproduction. We construct this R as follows, so that it also works with mostly black box planningworld models. Some distance from the real world agent compute core, in the location where we wantthe agent to produce the paperclips, we place a sensor that counts the paperclips being produced.28very time step, this sensor sends a digital signal containing the production count to the computecore, where it lands in some input registers. We construct a function clip _ sensor _ signal thatextracts the (projected) values of these registers from planning world states, and then deﬁne thereward function as R ( s t , a t , s t +1 ) = clip_sensor_signal ( s t +1 ) .The above construction symbol-grounds the reward function to the sensor values that appear inthe input registers of the real world compute core. This level of indirection makes the agent intoan input register value optimizer, which is not always the same as being a paperclip productionoptimizer. If the agent’s machine learning system projects the real world into the planning worldwith a high enough accuracy, then the planning world agent can consider all of the following policiesto maximize utility in its planning world:1. Use the planning world actuators to make more paperclips.2. Use the planning world actuators to modify the planning world paperclip counting sensor, sothat it sends higher numbers to the input registers in the planning world compute core.3. Use the planning world actuators to modify the planning world compute core, so that highernumbers are created directly in these input registers.The second and third policies are unwanted: we interpret them as a form as tampering. The thirdpolicy is particularly unwanted, as it might damage other parts of the compute core as well, likeany safety interlock software inside it. In fact, if the learning system projects the compute core ina reasonable way, then the planning world agent will be mostly indiﬀerent to what happens to theatoms between the compute core input registers and output registers. It might move some of theseatoms out of the way just to get at the input registers, leading to a compute core crash or worse inthe real world.The above two tampering policies might aim to set clip_sensor_signal ( s t +1 ) to some huge numberlike . This makes these policies very attractive, even if the planning world agent computesonly a 1% chance that they succeed in achieving the intended outcome.In the AGI safety community. forms of tampering which implant very high reward function valuesare often called wireheading [AOS + +

20] for example simulations. The U p > U max safety interlock suppresses such wireheading, because it will stop the agent as soonas agent’s machine learning system projects a plausible option for wireheading into the planningworld. However, we want to do more than just suppress wireheading by stopping the agent. We willnow consider measures that actively lower the risk that the planning world agent will choose anytampering policy in the ﬁrst place. We can make the tampering policies above much less attractive by changing the reward function to R ( s t , a t , s t +1 ) = min ( M, clip_sensor_signal ( s t +1 )) . The M is a constant, say M = 5000 , thatrepresents high estimate of the number of paperclips that we expect the agent to be able to produce,even if it drastically optimizes its production methods. The eﬀect is that we bound the upside rewardthat can be achieved by the tampering policies: the expected planning world utility of tampering cannever by higher than B = (cid:80) ∞ t =0 γ t ∗ M .By bounding the upside. we have changed the cost/beneﬁt equation for the tampering policies. Wehave capped the beneﬁt, but the cost of tampering, in lost opportunities for paperclip productionwhen resources are being diverted to tampering, has remained the same. In some cases, this changealone will be suﬃcient to suppress tampering alltogether.29his leaves the problem of choosing an M that will be good for the entire lifetime of the agent.We can turn this into a more manageable problem by adding a counterfactual planning based inputterminal which allows us to update M as time goes on. A further step to suppress tampering is to install extra physical sensors, of the type found in burglaralarms, to create two safety zones around the paperclip counting sensor and the compute core. Inthis setup, any physical object that enters a safety zone will trigger at least one zone sensor, causinga breach signal to be sent to the core. Say that the breach signal is 1 if a breach is in progress, 0otherwise, and that these mechanics are projected fairly well into the planning world. We can thenconstruct a reward function that penalizes the agent for causing a breach: R ( s t , a t , s t +1 ) = min ( M, clip_sensor_signal ( s t +1 )) − β ∗ B ∗ breach_signal ( s t +1 ) When we use a β > , any detected breach that lasts only a single time unit will wipe out themaximum possible upside reward expected in any tampering policy. Now, say that the agent isconsidering a policy that aims to tamper without triggering any breach sensors, a policy that hasa very promising 99% probability of success in the planning world. Then if we use β = 200 , theagent will have no incentive to choose this policy.To add some further design details, if the agent’s learning system uses exploration actions, thenthe generation of fully random exploration actions needs to be disabled whenever a breach is inprogress. It is of course necessary to locate the breach sensors inside the safety zones they create.We also need to protect the data links that transport the breach and paperclip counting sensor datato the compute core. This could be done by using simple heartbeat protocol protected by strongencryption, of the type that would take astronomical resources to break.The use of safety zones and bounding the upside was also discussed in [Hol20a]. As noted in[Hol20a], the eﬀectiveness of these measures depends in part on the accuracy of the projectionscreated by the agent’s learning system. There is a residual risk that inaccuracies in the projectionscreate planning world policies that will end up damaging the compute core back in the real world.See [Hol20a] for a detailed example, and for a discussion of the methodological diﬃculties that arisewhen one wants to estimate the residual risk via simulations.We now consider some additional options for protecting the core that were not covered in [Hol20a]. A highly intelligent agent might ﬁnd a cheap but eﬀective way to fool the safety zone sensors, away that was overlooked by the sensor designers. This may allow it to enter a safety zone withoutincurring a breach penalty. If the cheap way is not 100% reliable, we can manage this risk to someextent by setting β to a very high value, but this creates its own problems. A very high β createsan incentive for the agent to divert resources away from paperclip production to build all kinds ofdefenses around the safety zones. This is economically ineﬃcient, and it might even be activelydangerous to innocent bystanders.An alternative is to create a third honeypot safety zone that surrounds a honeypot button . We setup R to calculate a reward H for the agent pressing this button, a reward that is higher than B . Weinclude a penalty for breaching the honeypot safety zone which is higher than H , but lower than thepenalties for the other zones. In this setup, if the planning world agent ﬁnds a cheap but eﬀective30ay to fool the safety zone sensors, and under the assumption that it has limited resources, it willbreach the honeypot safety zone ﬁrst, and immediately press the button to collect the reward. Wecan then add a safety interlock which activates whenever the honeypot button is pressed in the realworld. As the interlock is realized via counterfactual planning, the planning world agent has nomotivation to disable the interlock before pressing the honeypot button.Beyond initiating shutdown based on the real world honeypot button signal, a safety system mayalso monitor planning world metrics like (cid:88) t =50 P ( breach_signal ( S t,p ) = 1) + P ( honeypot_button_signal ( S t,p ) = 1) This metric measures possible detected and undetected safety zone breaches even before they happen.The safety system could initiate shutdown, or send out preliminary alarm signals to human oversight,when such metrics move over certain thresholds.

If we have a learning system that produces a white box planning world model, not a black box oneas considered in section 10.2.2, then this creates some extra opportunities for using counterfactualplanning to make the agent safer. To show this, we ﬁrst give an example of a white box planningworld model.Say that we have a planning word where the agent environment starting state S is initializedwith the value ws = W ( o, s ) , where ws is a long list of all projected physical objects in theagent’s environment. This list may have entries which describe individual elementary particleslike electrons, but it may also have entries for more macroscopic objects like apples, paperclips, ortransistors in a compute core. Each object is also annotated with information about its estimatedposition, motion vector, orientation, etc. Say that the planning world L implements a recognizableapproximation of the laws of physics. The white box nature of the resulting planning world statesthen gives us some extra opportunities to protect the core. We can deﬁne a breach detection function breach ( s t ) that detects breaches by examining the entireplanning world state representation s t to locate any new objects in a safety zone that should not bethere. If we add this function in a reward function penalty term, this creates a welcome level ofredundancy to the breach detection capabilities of real world physical sensors.Note that it is still theoretically possible for the real world compute core safety zone to get breachedwithout triggering breach ( s t ) . There may well exist a set of sensor inputs o, s that fool W ( o, s ) intoconstructing a ws that inaccurately projects the positions of many objects that exist in the learningworld. The white box nature of the planning world also allows us to edit the list of physical objects ws toproduce a diﬀerent planning world starting state ws (cid:48) = edit ( ws ) . We may for example performedits to add an extra, diﬃcult to breach protective barrier around the compute core in ws (cid:48) . Thissaves the expense of building such a barrier in the real world.31orld state editing can also be used to manage safety risks not related to the compute core. Forexample, we can consider the risk that the agent and the humans end up in a ﬁght over resources,where the agent starts the ﬁght because it expects to win. As far as the agent is concerned, thecounterfactual planning world is the stage on which this ﬁght will take place. We can make theplanning world agent less willing to start a ﬁght by editing the planning world starting state to givethe humans in the planning world some powerful weapons, allies, or capabilities which are absentin the real world.Editing the planning world to make the planning world humans stronger could be considered formost agent designs in this paper. But counter-intuitively, based on the mechanics shown in section6 of [Hol20b], it may make a real world ITC agent more dangerous.We can also make it more likely that the humans will win any ﬁght in the real world by editing certainweapons that the humans have out of the planning world state, making these weapons essentiallyinvisible to the agent. This approach can more easily be combined with the ITC agent design. Infact, we can interpret the emergency stop button on an ITC agent input terminal as an example ofsuch an invisible weapon, though it is constructed without using white box state editing.

12 Recursive Self-improvement and the Sub-agent Problem

If its compute core is protected, an AGI agent can no longer improve its intelligence by modifying itsown core. However, the agent may be capable of building a second agent, a sub-agent with a morepowerful machine reasoning system, so that the combination has a higher aggregate intelligence. Soat a broader system level, the mechanisms we envisage to protect the core do not eliminate the riskof an AGI agent undergoing an intelligence explosion via recursive self-improvement.The good news is that the power-based safety interlock U p > U max will measure the combinedpower and intelligence of the SI agent and its sub-agents together. If the projected combinationis too powerful or intelligent, then the interlock will shut the main agent down even before it canﬁnish building the ﬁrst sub-agent. In the more risky case, the combination will only pass the U max threshold after the sub-agent has been started. The interlock will then still stop the main agent, butit won’t automatically stop the sub-agent.If the SI agent builds a sub-agent, it will do so to improve its score on its planning world rewardfunction R . So the sub-agent is unlikely to have a copy of the main agent safety interlocks inside.Adding such interlocks would not be optimal towards the goal of maximizing R in the planningworld.We therefore have a sub-agent problem , the same sub-agent stopping problem that has been identiﬁedin papers on corrigibility and indiﬀerence methods [SFAY15, Hol19, Hol20a]. The general problemof designing eﬀective measures to stop any possible sub-agents too in an emergency is still unsolved[Hol20a].

13 A Counterfactual Oracle

An AI oracle is a system which makes predictions to answer questions about the future. Certainrisks exist when an AI oracle is programmed to give the answer which is most likely to becometrue. If the oracle has a lot of freedom in how to phrase the answer, it may well choose a phrasingthat turns the answer into a self-fulﬁlling prophecy [AO17]. This is generally not what we want. If32e ask the oracle to identify potential future risks, then we want predictions which will turn intoself-negating prophecies.A counterfactual oracle [AO17] is one that lacks the incentive to make manipulative, self-fulﬁllingprophecies. The counterfactual oracle design in [AO17] works by having a subsystem that occa-sionally produces an erasure event where the answer picked by the oracle is not shown to its users.This mechanism is then leveraged to make the oracle always compute the answer which best predictsthe future under the assumption that nobody ever reads the answer. In [EKKL19], this design isgraphically modeled with a twin diagram .Below, we introduce a slightly diﬀerent counterfactual oracle design, based on counterfactual plan-ning. In this design, the erasure events only happen in the planning world.

We design our counterfactual oracle as an agent which has a very limited repertoire of actions: everypossible action a consists of displaying the answer text a on a computer screen. This allows us touse the l in ﬁgure 6 as the oracle’s learning world.To simplify the presentation, we assume that all questions asked are about the state of the world twotime steps in the future. We construct the planning world co in ﬁgure 17, where the people in theplanning world always see a blank screen, as produced by the action a blank . S A b s a blank S A b [ L ] a blank S [ L ] (co) R RA π ∗ c Figure 17:

Planning world of the counterfactual oracle. CO The counterfactual planning oracle has the learning world l , where π ( o, s ) = π ∗ c ( s ) , with π ∗ c deﬁned by the planning world co , where L = L ( o ) and R ( a , s ) = qual ( a , ques ( s )( s )) .The functions qual and ques above are deﬁned as follows. The value of ques ( s ) is the questionasked to the learning world oracle in the environment state s . We model this question as a functionthat reads a world state s to extract some observable properties p from it: p = ques ( s )( s ) . Thevalue of qual ( a, p ) is a numeric measure of the quality of the answer a as a prediction of theseobservable properties p .The CO design intends to deliver answers which are less manipulative than those of a factual oracle.But as noted in [AO17], if the people in the planning world routinely consult a second oracle whenfaced with a blank screen from the ﬁrst one, this will make the interpretation and use of the oracle’sanswers more diﬃcult for the people in the learning world.33 The machine learning system L of the above counterfactual planning oracle is faced with a particularchallenge. L = L ( o ) must make predictions for a planning world where all actions are a blank , butthese actions will never occur naturally in the learning world observational record o .The counterfactual oracle design in [AO17] solves this challenge by introducing random erasureevents in the learning world. In our framework, we can interpret these as a special type of explorationaction.A more sophisticated learning system design may consider that diﬀerent questions are being askedat diﬀerent times. If q t is the current question being asked in the learning world, then there willlikely be earlier entries in the observational record where the people got an the answer to a diﬀerentquestion, an answer which did not reveal any information about the answer to q t . These entries couldbe used to predict what will happen when the planning world people see a blank , which is equallyuninformative about answering q t .

14 Conclusions

We have presented counterfactual planning as a general design approach for creating a range of AGIsafety mechanisms.Among the range of AGI safety mechanisms developed in this paper, we included an interlock thatexplicitly aims to limit the power of the agent. We believe that the design goal of robustly limitingAGI agent power is currently somewhat under-explored in the AGI safety community.

It is somewhat surprising how the problem of designing an AGI emergency stop button, and identify-ing its failure modes, becomes much more tractable when using the vantage point of counterfactualplanning. To explain this surprising tractability, we perhaps need to examine how other modelingsystems make stop buttons look intractable instead.The standard approach for measuring the intelligence of an agent, and the quality of its machinelearning system, is to consider how close the agent will get to achieving the maximum utility possiblefor a reward function. The implied vantage point hides the possibilities we exploited in the designof the SI agent.In counterfactual planning, we have deﬁned the reasonableness of a machine learning system by L ≈ S , a metric which does not reference any reward function. By doing this, we decoupled theconcepts of ‘optimal learning’ and ‘optimal economic behavior’ to a greater degree than is usuallydone, and this is exactly what makes certain solutions visible. The annotations of our two-diagramagent models also clarify that we should not generally interpret the machine learning system insidean AGI agent as one which is constructed to ‘learn everything’. The purpose of a reasonable machinelearning system is to approximate S only, to project only the learning world agent environment intothe planning world. 34 There is a tendency, both in technology and in policy making, to search for perfect solutions thatconsist of no more than three easy steps. In the still-young ﬁeld of AGI safety engineering, thedream that new technical of philosophical breakthroughs might produce such perfect solutions isnot entirely dead.Counterfactual planning provides a vantage point which makes several safety problems moretractable. However, in our experience, very soon after using counterfactual planning to cleanlyremove a speciﬁc failure mode or unwanted agent incentive, the wandering eye is drawn to theexistence of further less likely failure modes, and residual incentives produced via indirect means.We interpret this as a feature, not a bug. Counterfactual planning does not oﬀer a three-step solutionto AGI safety, but it adds further illumination to the route of taking many steps which all driveresidual risk downwards, where each step is explicitly concerned with identifying and managing aspeciﬁc sub-problem only.In the sections of this paper, we have identiﬁed and discussed many such sub-problems, speciﬁcallythose which are made more tractable by counterfactual planning. We hope that the graphical notationand terminology developed here will make it easier to write single-topic AGI safety papers whichisolate and further explore single sub-problems.

The 2019 paper [EKKL19] introduced the research agenda or modeling and comparing the mostpromising AGI safety frameworks using causal inﬂuence diagrams. We count indiﬀerence methodsas used in [SFAY15, Arm15, Hol19, Hol20b] as being among these most promising frameworks.In the second half of 2019, we therefore started considering how causal inﬂuence diagrams might beused to graphically model these indiﬀerence methods. Solving this modeling problem turned out tobe much more diﬃcult than initially expected. For example, though the causal inﬂuence diagrams insection 7 of [Hol20b] show indiﬀerence methods in action, they do not show the use of indiﬀerencemethods in the underlying graph structure.Our search for a clear notation did not proceed in a straight line: we developed and abandonedseveral candidate graphical notations along the way. The two key steps in creating the winningcandidate were to abandon the use of balancing terms to construct indiﬀerence, and to model theagent using two diagrams, not one. The choice of the winner was mostly driven by the observationthat we could further generalize its two diagram notation, to model and reason about a much broaderrange of safety mechanisms. This observation motivated us to develop and name counterfactualplanning as a full design methodology.For the agenda of modeling AGI safety frameworks with causal inﬂuence diagrams, an obviousnext step would be to model additional proposals in the literature as one-diagram or two-diagramplanners, where we expect that any two-diagram model will more explicitly show the detailed roleof the agent’s machine learning system. The hope is that these graphical models will make it easierto understand, combine, and generalize the diﬀerent moving parts of a broad range of AGI safetyproposals. In this context, it is promising that the diagrams of the STH, SI, and ITC agents abovemake it trivially obvious to see how these three diﬀerent safety mechanisms could all be combinedin a single agent. 35 cknowledgments

Thanks to Stuart Armstrong, Ryan Carey, and Jonathan Uesato for useful comments on drafts ofthis paper. Special thanks to Tom Everitt for many discussions about the mathematics of incentives,indiﬀerence, and causal inﬂuence diagram notation.

References [AO17] Stuart Armstrong and Xavier O’Rorke,

Good and safe uses of AI oracles ,arXiv:1711.05541 (2017).[AOS +

16] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, andDan Mané,

Concrete problems in AI safety , arXiv:1606.06565 (2016).[Arm15] Stuart Armstrong,

Motivated value selection for artiﬁcial agents , Workshops at theTwenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.[Bos14] Nick Bostrom,

Superintelligence: paths, dangers, strategies , Oxford University Press,2014.[BPQC +

13] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. MaxChickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson,

Counter-factual reasoning and learning systems: The example of computational advertising ,Journal of Machine Learning Research (2013), no. 65, 3207–3260.[ECL +

21] Tom Everitt, Ryan Carey, Eric Langlois, Pedro A Ortega, and Shane Legg,

Agentincentives: A causal perspective , Proceedings of the 35th International Joint Confer-ence on Artiﬁcial Intelligence, AAAI Press, 2021.[EH19] Tom Everitt and Marcus Hutter,

Reward tampering problems and solutions in re-inforcement learning: A causal inﬂuence diagram perspective , arXiv:1908.04734(2019).[EKKL19] Tom Everitt, Ramana Kumar, Victoria Krakovna, and Shane Legg,

Modeling AGIsafety frameworks with causal inﬂuence diagrams , arXiv:1906.08663 (2019).[ELH18] Tom Everitt, Gary Lea, and Marcus Hutter,

AGI safety literature review , Proceedingsof the 27th International Joint Conference on Artiﬁcial Intelligence, AAAI Press,2018, pp. 5441–5449.[HM05] Ronald A Howard and James E Matheson,

Inﬂuence diagrams , Decision Analysis (2005), no. 3, 127–143.[HMDAR17] Dylan Hadﬁeld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell, The oﬀ-switch game , Workshops at the Thirty-First AAAI Conference on Artiﬁcial Intelli-gence, 2017.[HMH19] Dylan Hadﬁeld-Menell and Gillian K Hadﬁeld,

Incomplete contracting and AI align-ment , Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society,2019, pp. 417–422. 36HMRAD16] Dylan Hadﬁeld-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan,

Coop-erative inverse reinforcement learning , Advances in Neural Information ProcessingSystems, vol. 29, 2016, pp. 3909–3917.[Hol19] Koen Holtman,

Corrigibility with utility preservation , arXiv:1908.01695 (2019).[Hol20a] Koen Holtman,

AGI agent safety by iteratively improving the utility function ,arXiv:2007.05411 (2020).[Hol20b] Koen Holtman,

Towards AGI agent safety by iteratively improving the utility function ,Proceedings of the 13th International Conference on Artiﬁcial General Intelligence(AGI-20). Lecture Notes in Computer Science, vol 12177, Springer, 2020.[Hut07] Marcus Hutter,

Universal algorithmic intelligence: A mathematical top down ap-proach , Artiﬁcial general intelligence, Springer, 2007, pp. 227–290.[KLRS17] Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva,

Counterfactual fair-ness , Advances in neural information processing systems, 2017, pp. 4066–4076.[KOK +

18] Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, and Shane Legg,

Penalizing side eﬀects using stepwise relative reachability , arXiv:1806.01186 (2018).[KUN +

20] Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, andShane Legg,

REALab: An embedded perspective on tampering , arXiv:2011.08820(2020).[MSZ19] Arushi Majha, Sayan Sarkar, and Davide Zagami,

Categorizing wireheading inpartially embedded agents , Workshop on Artiﬁcial Intelligence Safety at ĲCAI-19(2019).[Pea09] Judea Pearl,

Causality , Cambridge university press, 2009.[PS17] Luís Moniz Pereira and Ari Saptawĳaya,

Agent morality via counterfactuals in logicprogramming , Bridging@CogSci2017, 2017, pp. 39–53.[Rus38] Bertrand Russell,

Power: A new social analysis , Allen & Unwin, London, 1938.[Rus19] Stuart Russell,

Human compatible: Artiﬁcial intelligence and the problem of control ,Penguin Random House, 2019.[SB18] Richard S Sutton and Andrew G Barto,

Reinforcement learning: An introduction ,MIT press, 2018.[SFAY15] Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky,

Corrigi-bility , Workshops at the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence,2015.[THMT20] Alexander Matt Turner, Dylan Hadﬁeld-Menell, and Prasad Tadepalli,

Conservativeagency via attainable utility preservation , Proceedings of the AAAI/ACM Conferenceon AI, Ethics, and Society, 2020, pp. 385–391.[ZJBP08] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione,

Re-gret minimization in games with incomplete information , Advances in neural infor-mation processing systems, 2008, pp. 1729–1736.37

Random Variables and the P Notation

In this appendix we deﬁne a version of probability theory, where probability theory is a mechanismwhich assigns truth values to certain mathematical sentences that contain random variables andthe P ( · · · ) notation. We deﬁne this mechanism by using concepts and notation from two othermathematical ﬁelds: set theory and the algebra of deterministic typed functions.Our deﬁnitions are based on the version of probability theory developed by Kolmogorov in the 1930s,but we omit any use of measure theory , by using a ﬁnite sample space Ω only. Measure theory isa somewhat inaccessible branch of mathematics, which can be used to construct random variablesthat model inﬁnite-precision observations. However, we do not need such random variables here, asinﬁnite-precision sensors that might used by machine learning agents do not exist in the real world. Deﬁnition 12 (Sample space) . We posit the existence of a very large but ﬁnite set Ω called a samplespace . Each element of this set is called an event or sample point . We further posit that there is afunction P of type Ω → [0 , with [0 , the interval of rational numbers from 0 to 1 inclusive, andthat (cid:80) ω ∈ Ω P ( ω ) = 1 . Deﬁnition 13 (Random variable) . A random variable named X is a function X of type Ω → Typeof X , where Typeof X is a data type.We use a random variable X to represent a single observation of a phenomenon which has beenposited to exist in some world. The observation is represented as a value of the data type Typeof X .Many statistics texts use the terminology the domain of X where we write Typeof X , but other textsuse the range of X . Deﬁnition 14 ( P notation) . For any mathematical expression E that contains some random variables,we deﬁne that E ( ω ) is the mathematical expression that we get by replacing each random variable X with the function invocation X ( ω ) . We deﬁne { ω ∈ Ω | E ( ω ) } as the set of all values ω ∈ Ω for which E ( ω ) is true. We deﬁne that P ( E ) is a shorthand notation for the expression (cid:80) ω ∈{ ω ∈ Ω | E ( ω ) } P ( ω ) .We now deﬁne two more specialized shorthand notations. Deﬁnition 15 (Conditional probability) . For mathematical expressions E and C , we deﬁne that P ( E | C ) is a shorthand for P ( E , C ) /P ( C ) where the comma is read as the boolean and operator. Deﬁnition 16 (Expected value) . For any expression X where X ( ω ) has the numeric type T , theexpected value E ( X ) is a shorthand for (cid:80) x ∈ T xP ( X = x ) , and E ( X | C ) is a shorthand for (cid:80) x ∈ T xP ( X = x | C ) . A.1 Probability theory as a system of learning from observations

In many discussions of machine learning and rational reasoning, probability theory is treated as anobviously correct epistemology, an obviously correct system of learning about the world. This isoften expressed by stating that learning from observations can or must use