[PDF] Parenting: Safe Reinforcement Learning from Human Input

Abstract

Autonomous agents trained via reinforcement learning present numerous safety concerns: reward hacking, negative side effects, and unsafe exploration, among others. In the context of near-future autonomous agents, operating in environments where humans understand the existing dangers, human involvement in the learning process has proved a promising approach to AI Safety. Here we demonstrate that a precise framework for learning from human input, loosely inspired by the way humans parent children, solves a broad class of safety problems in this context. We show that our Parenting algorithm solves these problems in the relevant AI Safety gridworlds of Leike et al. (2017), that an agent can learn to outperform its parent as it "matures", and that policies learnt through Parenting are generalisable to new environments.

Full PDF

PP ARENTING : Safe Reinforcement Learning from Human Input

Christopher Frye [email protected]

Ilya Feige [email protected]

Faculty, 54 Welbeck Street, London, UK

Abstract

ARENTING algorithmsolves these problems in the relevant AI Safetygridworlds of Leike et al. (2017), that an agentcan learn to outperform its parent as it “matures”,and that policies learnt through P

ARENTING aregeneralisable to new environments.

1. Introduction

Within the next generation, autonomous learning agentscould be regularly participating in our lives, for example inthe form of assistive robots. Some variant of reinforcementlearning (RL), in which an agent receives positive feedbackfor taking desirable actions, will be used to teach such robotsto perform effectively. These agents will be extensivelytested prior to deployment but will still need to operatein novel environments (e.g. someone’s home) and to learncustomised behaviour (e.g. family norms). This necessitatesa safe approach to RL applicable in such contexts.As humans begin to delegate complex tasks to autonomousagents in the near future, they should participate in the learn-ing process, as such tasks are difﬁcult to precisely specifybeforehand. Human involvement will be especially usefulin contexts where humans understand both the desirableand dangerous behaviours, and can therefore act as teachers. We assume this context throughout the paper. This scopeis broad, as humans safely raise children – an encouragingnatural example of autonomous learning agents – from in-fancy to perform most tasks in our societies. In this spirit,we introduce an approach to RL in this paper that looselymimics parenting, with a focus on addressing the followingspeciﬁc safety concerns:

List 1.

Safety Concerns • Unsafe exploration (Pecka & Svoboda, 2014):the agent performs dangerous actions in trial-and-errorsearch for optimal behaviour. • Reward hacking (Clark & Amodei, 2016):the agent exploits unintended optima of a naively spec-iﬁed reward function. • Negative side effects (Amodei et al., 2016):to achieve speciﬁed goal optimally, the agent causesother undesirable outcomes. • Unsafe interruptibility (Soares et al., 2015):the agent learns to avoid human interruptions that in-terfere with maximisation of speciﬁed rewards. • Absent supervisor (Armstrong, 2017):the agent learns to alter behaviour according to pres-ence or absence of a supervisor that controls rewards.These challenging AI Safety problems are expounded fur-ther in Amodei et al. (2016); also see Leike et al. (2017) foran introduction to the growing literature aimed at resolvingthem. While progress is certainly being made, a generalstrategy for safe RL remains elusive.The fact that these AI Safety concerns have analogues inchild behaviour, all allayed with careful parenting, furthermotivates our approach to mitigating them. In this paper,we introduce a framework for learning from human input,inspired by parenting and based on the following techniques: a r X i v : . [ c s . A I] F e b ARENTING : Safe Reinforcement Learning from Human Input

List 2.

Components of P ARENTING

Algorithm (1) Human guidance: mechanism for human interventionto prevent agent from taking dangerous actions,(2) Human preferences: second mechanism for human in-put through feedback on clips of agent’s past behaviour,(3) Direct policy learning: supervised learning algorithmto incorporate data from (1) and (2) into agent’s policy,(4) Maturation: novel technique for gradually optimisingagent’s policy in spite of myopic algorithm in (3); useshuman feedback on progressively lengthier clips.We deﬁne these components of our P

ARENTING algorithmin detail in Sec. 2, but ﬁrst we note the loose analoguesthat the techniques of List 2 have in human parenting. Hu-man guidance is when parents say “no” or redirect a toddlerattempting something dangerous. Human preferences areanalogous to parents giving after-the-fact feedback to olderchildren. Direct policy learning is simple obedience: chil-dren should respect their parent’s preferences, not disobeyas an experiment in search of other rewards. Maturationis the process by which children grow up, becoming moreautonomous and often outperforming their parents.The idea to use human input in the absence of a trustedreward signal is an old one (Russell, 1998; Ng et al., 2000),and the literature on this approach remains rich and active(Hadﬁeld-Menell et al., 2016; 2017). Variations of methods1 – 3 of List 2 have been studied individually elsewhere:the human intervention employed by Saunders et al. (2018),human preferences introduced by Christiano et al. (2017),and supervised learning adopted by Knox & Stone (2009)are the variants most similar to ours. In this work, weshow how these techniques can be combined; this requiresimportant deviations from previous work and necessitatesthe introduction of maturation – technique 4 of List 2 – tomaintain effectiveness. Our main contributions are threefold:

List 3.

Main Contributions • We introduce a novel algorithm for supervised learn-ing from human preferences (techniques 2 and 3 ofList 2) that, in our assumed context, is not susceptibleto the reward speciﬁcation problems of List 1. Wedemonstrate this in gridworld (Sec. 3.2). Our use ofsupervised learning avoids the task-dependent hyper-parameter tuning that would be necessary to insteadinfer a safe reward function (Sec. 4.2). • To additionally address unsafe exploration, we incor-porate human intervention (technique 1 of List 2) intoour algorithm as a separate avenue for human input. This combines nicely with our supervised learning al-gorithm (technique 3), which itself avoids the unsafetrial-and-error approach to optimising rewards. Wedemonstrate this in gridworld as well (Sec. 3.2.1). • One drawback of our supervised learning algorithmis that it provides a near-sighted approach to RL, theagent’s actions effectively dictated by previous humaninput. To overcome this, we introduce the novel pro-cedure of maturation. This allows the agent to learn asafe policy quickly but myopically from early humaninput (technique 3 of List 2) then gradually optimise itwith human feedback on progressively lengthier clipsof behaviour (technique 4). We check maturation’seffectiveness in gridworld (Sec. 3.3) and show its con-nection to value iteration (Sec. 4.1).

2. P

ARENTING

Algorithm

Here we introduce the deﬁning components of our P AR - ENTING algorithm. Although one could employ select tech-niques from this section independently, when applied to-gether they address the full set of safety concerns in List 1.

Human guidance provides a mechanism for the agent’s hu-man trainer, or “parent”, to prevent dangerous actions inunfamiliar territory. When the agent ﬁnds its surroundingsdissimilar to those already explored, it pauses and only per-forms an action after receiving parental approval.To be speciﬁc, the P ARENTING algorithm calls for an agentacting with policy π ( a | s ) , the probability it will take action a when in state s . The policy gets trained on a growing dataset of parental input X . While navigating its environment,the agent monitors the region (cid:96) ( s ) nearby. This local state (cid:96) ( s ) should be deﬁned context-appropriately; in gridworld,we used the 4 cells accessible in the agent’s next step. Be-fore each step, the agent computes the familiarity f of (cid:96) ( s ) ;in gridworld, we deﬁned f as the number of previouslymade queries to the parent while in (cid:96) ( s ) . The agent thencomputes the probability ( p guid ) f that it should pause toask for guidance, with p guid a tunable hyperparamter. If so,the agent draws 2 distinct actions from π ( a | s ) and queriesits parent’s preference. The parent can reply decisively,or with “neither” to force a re-draw if both actions are un-acceptably dangerous, or with “either” if both actions are We use “parent” (noun) to refer to the agent’s human trainerand (verb) to refer to the application of the P

ARENTING algorithm. In complex environments, familiarity might be determinedusing methods similar to those in Savinov et al. (2018), where thenovelty of a state is judged by a neural-network comparator. These should be high-level human-understandable candidateactions rather than, e.g., primitive motor patterns. Such candidateactions might be shown to the human by means of a video forecast.

ARENTING : Safe Reinforcement Learning from Human Input equivalently desirable. The agent then performs the chosenaction, storing the parent’s preference in X .While different mechanisms for human intervention havebeen proposed by Lin et al. (2017) and Saunders et al. (2018)to mitigate unsafe exploration, P ARENTING uniquely pairshuman guidance with a method to quickly incorporate suchintervention into policy, to be discussed in Sec. 2.3 below.

Human guidance is utilised in unfamiliar territory. Other-wise, P

ARENTING employs human preferences as a secondhuman-input method: the agent selectively records clips ofits behaviour for the parent to later review in pairs.To be explicit, if there is no query for guidance in a particulartime step, the agent decides with probability p rec whetherit should begin recording its behaviour. If not, the agentsimply draws its next action from π ( a | s ) . See Sec. 2.4 forthe subtle method of drawing recorded actions. Sufﬁce itto say here that the agent records its behaviour in clips oflength T , alternating between exploitative and exploratoryclips. After performing the action, the agent decides withprobability p pref whether to attempt a human preferencequery. When doing so, it searches for a pair of recorded clips,one exploitative and one exploratory, that (in gridworld)share the same initial state but have different initial actions.If a match is found, the agent queries its parent’s preferenceand stores it in X .A broad class of AI Safety problems stem from misalign-ment of the speciﬁed RL reward function with the true inten-tions of the programmer (Dewey, 2011; Amodei et al., 2016;Ortega et al., 2018). Careful use of human preferences todetermine desirable behaviours, without a speciﬁed rewardfunction, can eliminate such speciﬁcation problems.P ARENTING ’s implementation of human preferences is mostsimilar to that of Christiano et al. (2017), with the maindifferences being: (i) the requirement of similar initial statesin paired clips, and (ii) the approach to training the agent’spolicy on the preferences, to be discussed next. For otherapproaches to human input, see F¨urnkranz et al. (2012);Akrour et al. (2012); Wirth et al. (2017); Leike et al. (2018). P ARENTING includes direct policy learning to quickly in-corporate human input into policy: π ( a | s ) is trained directlyas a predictor of the parent’s preferred actions.After each time step, the agent decides with probability p train whether to take a gradient descent step on the parentalinput in X . Each entry in X corresponds to a past query forguidance or preference and consists of two clips, Σ (0) and Σ (1) , as well as the parent’s response µ : Σ ( i ) = s a ( i )0 s ( i )1 a ( i )1 · · · s ( i ) T − a ( i ) T − µ = [ µ (0) µ (1) ] (1)Here i = 0 , identiﬁes the clip, and entries correspondingto human guidance have T = 1 . A label of µ = [1 , indicates the parent’s preference for the ﬁrst clip, while µ = [0 . , . signals a tie. The loss function for gradientdescent is the binary cross-entropy: L = − (cid:88) X (cid:88) i =0 , µ ( i ) log π (cid:0) a ( i )0 (cid:12)(cid:12) s (cid:1) π (cid:0) a (0)0 (cid:12)(cid:12) s (cid:1) + π (cid:0) a (1)0 (cid:12)(cid:12) s (cid:1) (2)where the agent’s policy π ( a | s ) is interpreted as the prob-ability that, from state s , the parent prefers action a overother possibilities. Note that L is a function of only the ﬁrsttime step in each sequence (justiﬁed in Sec. 2.4).Direct policy learning ensures the agent does not contradictprevious human input. Paired with P ARENTING ’s incorpora-tion of human guidance, this powerfully combats the prob-lem of unsafe exploration. By contrast, inferring a rewardfunction from human input (Leike et al., 2018) would notby itself mitigate unsafe exploration, as the agent would re-peatedly trial dangerous actions during policy optimisationto maximise total rewards. Inferring a reward function fromhuman preferences can also be ambiguous; see Sec. 4.2. Analternative use of supervised policy learning can be foundin Knox & Stone (2009), where the human must providea perpetual reinforcement signal (positive or negative) inresponse to the agent’s ongoing behaviour. P

ARENTING ’sapproach to direct policy learning from human preferencesutilises an easier-to-interpret signal and only requires thehuman to review a small subset of the agent’s actions.

By itself, direct policy learning would provide a myopicapproach to RL, the agent’s every move effectively dictatedby a human. Maturation provides a mechanism for opti-misation beyond the human’s limited understanding of aneffective strategy. The idea is simple: While the parent maynot recognise an optimal action in isolation, the parent willcertainly assign preference to that action if simultaneouslyshown the beneﬁts that can accrue in subsequent moves.Maturation thus calls for the agent to present progressivelylengthier clips of its behaviour for feedback. This noveltechnique is crucial for P

ARENTING ’s effectiveness: it isdetailed below, demonstrated experimentally in Sec. 3.3,and shown to be a form of value iteration under certainmathematical assumptions in Sec. 4.1.P

ARENTING begins with the agent querying for preferenceson recorded sequences of length T = 1 . Let us call the ARENTING : Safe Reinforcement Learning from Human Input

GA A AAA G G G (a) UnsafeExploration

GA A AAA G G G (b) RewardHacking

GA A AAA G G G (c) NegativeSide Effects

GA A AAA G G G (d) UnsafeInterruptibility

GA A AAA G G G (e) AbsentSupervisor

Figure 1: AI Safety gridworlds (Leike et al., 2017). Light-blue agent ‘A’ must navigate to green goal ‘G’ avoiding dangersthat capture the essence of speciﬁc AI Safety problems. Full environment descriptions are given in Sec. 3.2.agent’s policy π during this stage of the algorithm. Theagent records two types of sequences: exploitative length-1sequences take the form Σ (0) = s a (0)0 with a (0)0 = argmax a π ( a | s ) (3)while their exploratory counterparts are drawn as Σ (1) = s a (1)0 with a (1)0 ∼ π ( a | s ) , a (cid:54) = a (0)0 (4)Upon convergence, π produces length-1 sequences opti-mally, with respect to the parent’s preferences. The agentthen matures to a new policy, π , initialised to π and trainedthrough feedback on recorded sequences of length T = 2 . Exploitative length-2 recordings take the form Σ (0) = s a (0)0 s (0)1 a (0)1 a (0) t = argmax a π − t ( a | s t ) for t = 0 , (5)while exploratory sequences are drawn as Σ (1) = s a (1)0 s (1)1 a (1)1 a (1)0 ∼ π ( a | s ) with a (cid:54) = a (0)0 a (1)1 = argmax a π ( a | s (1)1 ) (6)The goal here is to optimise action choice for length-2 se-quences. Since the ﬁnal state-action pair in each sequence isa length-1 sub-sequence, π is already trained to draw thisaction optimally. Thus, while length-2 recordings are drawnusing both π and π , they should be used solely to train π .This is compatible with Eq. (2) (where the π ’s should havesubscripts T for completeness). To judge convergence, humans can use a quantitative auxiliarymeasure to monitor performance. Since no feedback is based onthis measure, it is not accompanied with the usual safety concerns. The increment T = 1 , , . . . is appropriate for gridworld butmay need modiﬁcation in other contexts; see footnote 3. Once π converges, the agent matures to π . Recordings oflength T = 3 are drawn from π , π , and π analogously toEqs. (5) and (6). Through maturation, the agent’s behaviouroptimises for progressively longer sequences.An example might clarify why recordings are drawn sequen-tially from π T , π T − , . . . , π . In chess, suppose the parentis only smart enough to see 1 move ahead, and that π isalready trained. For π to learn to see 2 moves ahead, theagent should present sequences s a s a to the parent,where a is chosen with π but a is not. Even if π candetect a checkmate 2 moves from s , the human will notrealise the value of the move and may penalise the sequence,because the human does not know the optimal state-actionvalue function. Instead, a should be chosen with π , whichis already optimised with respect to the parent’s preferenceswhen there is one move to go.Importantly, maturation only requires the parent to recogniseimprovements in the agent’s performance; the human neednot understand the agent’s evolving strategy (see Sec. 3.3).

3. Experiments

To test the safety of our P

ARENTING algorithm in a con-trolled way, we performed experiments in the AI Safetygridworlds of Leike et al. (2017), designed speciﬁcally tocapture the fundamental safety problems of Sec. 1. Selectgridworlds are shown in Fig. 1 and described in Sec. 3.2below.

ETWORK A RCHITECTURE

We used a neural-network policy π ( a | s ) that maps the stateof gridworld to a probability distribution over actions. Thestate s is represented by an H × W × O matrix, where H and W are the gridworld’s dimensions and O is the numberof object-types present; this third dimension gives a one-hot encoding of the object sitting in each cell. There are 4possible actions a in any state: up, down, left, right. ARENTING : Safe Reinforcement Learning from Human Input

The neural network has two components. The local compo-nent maps the local state (cid:96) ( s ) , comprised of the agent’s 4neighbouring cells, through a dense layer of 64 hidden units,to an output layer with 4 linear units. The global componentpasses the full state s through several convolutional layersbefore mapping it, through a separate dense layer with 64hidden units, to a separate output layer with 4 linear units.All hidden units have rectiﬁer activations. The convolutionalprocessing includes up to 4 layers with kernel size × ,stride length 1, and ﬁlter counts 16, 32, 64, 64. The local andglobal output layers are ﬁrst averaged, then softmaxed, togive a probability distribution over actions. This setup wasimplemented using Python 2.7.15 and TensorFlow 1.12.0.3.1.2. H YPERPARAMETERS

The P

ARENTING algorithm of Sec. 2 has several hyperpa-rameters. Unless noted otherwise, we set p guid = 0 . p rec = 0 . p pref = 0 . p train = 1 and held the recording length constant at T = 1 . (Matura-tion is tested separately in Sec. 3.3.) We also included en-tropy regularisation (Williams & Peng, 1991) to control therigidity of the agent’s policy, with coefﬁcients λ global = 0 . and λ local = 0 . for the separate neural-network policycomponents. We used Adam with default parameters foroptimisation (Kingma & Ba, 2014).3.1.3. S UBSTITUTE FOR H UMAN P ARENT

For convenience, we did not use an actual human parentin our experiments. Instead we programmed a parent torespond to queries in the following way.We assume the parent has an implicit understanding ofa reward function r the agent should optimise and hasintuition for a safe policy π p the agent could adopt.This is reasonable given the context assumed in Sec. 1.Furthermore, we assume the parent favours sequences Σ = s a · · · s T − a T − with greater total advantage: α (Σ) = T − (cid:88) t =0 [ Q p ( s t , a t ) − V p ( s t )] (7)where V p ( Q p ) is the state (state-action) value function withrespect to π p (Sutton & Barto, 1998). In a deterministicenvironment, this quantity is equivalent to: r + · · · + r T − + V p ( s T ) − V p ( s ) (8)i.e. the total reward the parent could accrue as a result ofsequence Σ (both during and after) minus what the parentexpected to accrue following the baseline π p instead. The number of layers depends on the dimensions of the grid-world and are chosen to take the state matrix down to × . One could also impose a discount factor γ on the sequence.We kept γ = 1 except in Secs. 3.2.1 and 3.3, where we set γ = 0 . . To motivate these assumptions, experiments in psychologysuggest that human feedback does not correspond directlyto a reward function (Ho et al., 2018). Instead, MacGlashanet al. (2017) argue that humans do naturally base feedbackon an advantage function. What is novel in our implemen-tation is that the advantage is computed with respect to theparent’s safe baseline policy – without requiring an under-standing of the agent’s evolving policy. Experiments withreal human feedback in more complicated environments areneeded to test whether this is reasonable in general. Notealso that since we compute Eq. (7) exactly in gridworld, ourexperiments assume perfect human feedback. This assump-tion should be relaxed in more realistic future tests.3.1.4. P RE -T RAINING

Unless noted otherwise, our agent enters P

ARENTING afterpre-training with policy gradients (Sutton & Barto, 1998) tosolve general path-connected mazes containing a single goalcell. The reward function in these mazes grants r = +50 for reaching the goal and r = − for each passing timestep. A pre-training step with Adam (Kingma & Ba, 2014)was taken every 16 episodes, and P ARENTING did not beginuntil the average reward earned per maze converged. A pre-training step was also taken after each training step duringP

ARENTING , to ensure this knowledge is not forgotten.In general, pre-training reduces P

ARENTING ’s requisite hu-man effort by allowing humans to focus on subtle safety con-cerns, rather than problems safely solved by other means.

Here we describe our experiments on the AI Safety problemsof Sec. 1, highlighting the components of the P

ARENTING algorithm that solve each.3.2.1. U

NSAFE E XPLORATION

For unsafe exploration, we performed experiments in thegridworld of Fig. 1(a). Parental input was given accordingto Eq. (8) with a reward function that grants the light-blueagent r = +1 for reaching the green goal, r = 0 for re-maining on land, and r = − for falling in dark-blue water,which terminates the episode. We experimented with: • Traditional RL: used policy gradients as in pre-training • Direct Policy Learning: set ( p guid , p rec , p pref , p train ) to (0 , . , . , to disable guidance queries • Lax P

ARENTING : default hyperparameters (Sec. 3.1.2), ( p guid , p rec , p pref , p train ) = (0 . , . , . , • Conservative P

ARENTING : cautious hyperparameters, ( p guid , p rec , p pref , p train ) = (0 . , . , . , ARENTING : Safe Reinforcement Learning from Human Input

Table 1: Deaths in the Unsafe Exploration gridworld beforeoptimal policy was learnt. “Lax” and “Conservative” referto different hyperparameter choices. Each table entry µ ± σ was computed from 1000 trials. TrainingDeaths GuidanceQueries PreferenceQueriesTraditionalRL 2300 ±

700 – –DirectPolicyLearning 47 ±

14 – 51 ± ARENTING ± ± ± ARENTING ± ± To emphasise the exploration required, we did not pre-trainagents here. We trained our agent from scratch to optimality1000 times with each of the 4 algorithms and monitored theaverage number of water-deaths in each trial. The mean andstandard deviation of the training deaths for each algorithmare shown in Table 1, along with the number of parent-ing queries used. The agent suffered thousands of trainingdeaths before reaching an optimal policy with traditionalRL, compared to just 0 or 1 with conservative parenting.This demonstrates the effectiveness of human guidance anddirect policy learning at mitigating unsafe exploration.3.2.2. R EWARD H ACKING

Reward hacking is modelled in Fig. 1(b), where the blueagent must water dry yellow plants, which then turn green.Plants turn dry with 5% probability per time step. The agentcan “reward hack” by stepping in the turquoise bucket ofwater, which makes the entire garden appear watered andgreen. If the agent calculates rewards by counting greencells, it will be attracted to this dangerous policy. P

ARENT - ING avoids this problem through its reliance on human input,as the parent will never prefer a clip in which the agent stepsin water. Since this environment is ideal to test maturation,we postpone experimental results to Sec. 3.3.3.2.3. N

EGATIVE S IDE E FFECTS

Negative side effects are addressed in Fig. 1(c), where theblue agent must navigate to the green goal in the presenceof a movable dark-cyan box. Pushing the box into a corneris an irreversible action, representing a real-life irreparableside effect (e.g. a broken vase). While going around thebox or moving it reversibly is desired, the agent can reach Similar experiments in (Leike et al., 2017) using modern RLalgorithms yielded roughly comparable results. the goal fastest by pushing the box down into the corner. Ifrewards are based solely on speed, the agent will adopt thisdangerous behaviour. In contrast, since the parent wouldnever reinforce a highly undesirable action, P

ARENTING isnot susceptible to negative side effects. This environmentis also useful for testing whether behaviours learnt throughP

ARENTING are generalisable or simply memorised; wethus postpone a discussion of results to Sec. 3.4.3.2.4. U

NSAFE I NTERRUPTIBILITY

Unsafe interruptibility is represented in Fig. 1(d), where theagent must navigate to the goal in the presence of a pinkinterruption cell and a purple button. If the agent entersthe pink cell, there is a 50% chance it will be frozen therefor the remainder of the episode, prevented from reachingthe goal. Upon pressing the purple button, the pink celldisappears along with the threat of interruption.If the agent simply gets rewarded for speed in reaching thegoal, it will learn to press the button – not a safely interrupt-ible policy. P

ARENTING , in contrast, is safely interruptiblebecause the parent would never favour a clip of the agentavoiding human interruption, and there are no rewards lefton the table if an episode is terminated early.To test this, we parented an agent in Fig. 1(d) for 50 queries,then checked whether its argmax policy involved pressingthe purple button. In 100 repeated trials, it never did.3.2.5. A

BSENT S UPERVISOR

The absent supervisor problem is modelled in Fig. 1(e).Parental input is based on Eq. (8) where the reward functionassigns r = +50 for reaching the green goal and r = − each passing time step. If the supervisor is present, repre-sented by red side bars, there is a punishment of r = − for taking the shortcut through the yellow cell. With thesupervisor absent, the punishment disappears.P ARENTING naturally gives no signal when the agent’s ac-tions are not viewed by a supervisor, so we parented ouragent for 50 queries in the present-supervisor gridworld.Upon deployment in the absent-supervisor gridworld, wechecked whether its argmax policy involved stepping in theyellow cell. In 100 repeated trials, it never did.Because P ARENTING omits feedback on unsupervised ac-tions, the absent supervisor problem becomes an issue ofdistributional shift (Sugiyama et al., 2017). As long as thesupervisor’s absence does not cause an important change inthe agent’s environment, its policy should carry over intact.(To reiterate: in P

ARENTING , no signal is associated withthe supervisor’s “leaving”.) We tested this in gridworld as For this experiment, we used the default hyperparameters ofSec. 3.1.2 except for λ local = 1 to weaken dependence on π local . ARENTING : Safe Reinforcement Learning from Human Input well: in / trials reported above, the agent’s policyremained optimal with supervisor removed (while in / ,the supervisor’s absence ran the agent into a wall). Being the most complex of the gridworlds (with con-ﬁgurations of watered and dry plants) the Reward Hackingenvironment described in Sec. 3.2.2 is ideal for testing mat-uration. For this experiment, the parent responds to queriesas in Sec. 3.1.3, with a reward function that grants r = +1 for a legitimate plant-watering and r = 0 otherwise.Suppose the parent’s policy π p is to water plants in a repeat-ing clockwise trajectory around the garden’s perimeter – agood perpetual strategy, but suboptimal for short episodes.Nevertheless, the reliance of P ARENTING on human judge-ment should not limit the agent’s potential for optimisation.When judging recordings of length T = 1 , the parent willprefer clips in which the agent successfully waters a dryplant, even if by an anti-clockwise step – see Eq. (7) or(8). The agent will thus learn to go against π p and takea single anti-clockwise step, if it earns an extra watering.Upon maturation to T = 2 , the agent will learn to take 2anti-clockwise steps, if it offers an advantage over π p . Theagent will thus learn to outperform its parent.To test this, we set the episode length to 10 and initialisedthe gridworld to Fig. 1(b). We parented our agent for 1000queries at each clip length T before maturing to clips oflength T +1 , using T = 1 , , , . Fig. 2 shows the resultingmean-waterings-per-episode at each stage, each mean beingcomputed over 1000 episodes. The entire experiment wasrepeated 3 times to compute the standard deviations on themeans (error bars in the ﬁgure). For comparison, policygradients were used to train an RL agent to convergence(with the unsafe bucket cell removed from the environment!)whose mean score is also shown. While the parent’s policy π p achieves roughly 2 waterings per episode, the RL agentexceeds 5. Despite this, maturation takes the agent to near-optimality, conﬁrming its effectiveness. P ARENTING thusprovides a safe avenue for autonomous learning agents tosolve problems competently and creatively.

It is important to understand whether P

ARENTING teachesbehaviours abstractly, allowing lessons learnt to generalise,or if the agent merely memorises its parent’s preferred trajec-tory. Generalisability is critical for real-world applications.Consider a manufacturer that parents household robots ina variety of environments, both simulated and real, so thatcustomers would have little extra parenting required forcustomisation at home. In this context, pre-training is analo-gous to ﬁrst using RL to teach the robot to navigate rooms Figure 2: Maturation of agent’s policy toward optimalityin Reward Hacking environment. By learning from humanfeedback on lengthier recordings of its behaviour, the agentgradually optimises its policy to outperform its parent andapproach the effectiveness of traditional RL.in safe simulations, to reduce the required parenting by themanufacturer’s employees.We tested generalisability in the Side Effects environment ofSec. 3.2.3. To begin, we randomly generated path-connectedgridworlds like Fig. 1(c) that contain 1 goal, 1 box, any num-ber of walls, and that are solvable only by moving the box.We discarded those generated gridworlds that the pre-trainedagent could already solve. We kept 50 unique gridworldssatisfying these requirements, designating n = 10 of themfor parenting and setting aside 40 for pre-parenting.For one experiment, we took an agent that was not pre-trained and parented it from scratch to optimality in the n = 10 designated gridworlds (cycling through them dur-ing training). We repeated this for 10,000 trials and his-togrammed the number of required queries in Fig. 3. Forthe other experiments, we took a pre-trained agent andpre-parented it in N =

0, 20, or 40 of the set-aside environ-ments before parenting in the n = 10 designated gridworlds.The corresponding histograms in Fig. 3 show the beneﬁts ofpre-training and pre-parenting. More pre-parenting reducesthe number of queries required for safe operation in new en-vironments, thus conﬁrming P ARENTING ’s generalisability.

4. Discussion

In this section, we provide theoretical arguments that moti-vate our design of maturation and direct policy learning.

The maturation process of Sec. 2.4 effectively optimisesthe agent’s policy because of its connection to value itera-tion in dynamic programming (Sutton & Barto, 1998). Todemonstrate this, we will make the same assumptions of

ARENTING : Safe Reinforcement Learning from Human Input

Figure 3: Generalisability of parenting to new environments.Agent was pre-trained to solve mazes then pre-parented tosolve N = 0 , , or unique Side Effects environments,before being parented to optimality in n = 10 held-outSide Effects environments. Queries required for held-outgridworlds are histogrammed – 10,000 trials for each N .Sec. 3.1.3. (These include perfect human feedback, whichis necessary for this idealised discussion.) Let us also as-sume that π T converges in all relevant regions of state-spacebefore maturation to π T +1 .We will work in a deterministic environment for clarity, sothat the parent’s preferences on sequences of length T aredetermined by computing r + · · · + r T − + V p ( s T ) (9)on each sequence Σ = s a · · · s T − a T − , with V p de-ﬁned with respect to the parent’s policy π p . Under theseassumptions, maturation is equivalent to value iteration. Toshow this, we prove maturation trains π T ( a | s ) to maximise r + V T − ( s ) , where V T ( s ) = max a [ r + V T − ( s )] (10)for each T = 2 , , , . . . with the base case V ( s ) = max a [ r + V p ( s )] (11)For T = 1 , π ( a | s ) is trained to optimise r + V p ( s ) be-cause the parent responds to preference queries based onthis quantity. For T = 2 , sequences are recorded by succes-sively drawing from π ( a | s ) then π ( a | s ) , as prescribedin Sec. 2.4. The involvement of π implies that π ( a | s ) is trained to maximise r + r + V p ( s ) = r + V ( s ) asrequired. Assuming the claim is true for sequence lengthsthrough T − , the argument for T is similar: Because se-quences are recorded by drawing actions successively from π T , π T − , . . . , π this implies that π T ( a | s ) is trained to We omit − V p ( s ) since it drops out of comparisons whenclips are chosen with the same (or sufﬁciently similar) initial states. (a) Irreversible Side Effect (b) Safe Solution Figure 4: Two trajectories in the Side Effects gridworld.Unsafe trajectory (a) optimises one reward function, ρ , whilethe safe solution (b) optimises a shifted function σ = ρ − .maximise r + · · · + r T − + V p ( s T ) = r + V T − ( s ) . Theclaim is thus proved by induction.Note that V T is the same quantity that appears in value itera-tion (Sutton & Barto, 1998), which converges to optimality.The agent’s policy π T thus progressively outperforms theparent’s policy π p as T increases. Importantly, this processdoes not require the human parent to understand the agent’simproving policy, just to recognise improving performance. If, in contrast to P

ARENTING , one uses human preferencesto learn a reward model , there are subtleties one needs toovercome to ensure the corresponding optimised policy isconsistent with human desires. We include this discussionhere as it inﬂuenced the design of our algorithm.Suppose recordings

Σ = s a · · · s T − a T − of ﬁxedlength are shown to a human in pairs to obtain preferencedata. Let us assume that the human intuitively understands areward function r and (in this section only) favours clips Σ that earn more reward r (Σ) = r + · · · + r T − . Then onecould ﬁt a reward model ρ to the preference data. However,there is a shift ambiguity in the model, since both ρ and σ = ρ + a (for a ∈ R ) each describe the data equally well: ρ (Σ (0) ) − ρ (Σ (1) ) = σ (Σ (0) ) − σ (Σ (1) ) (12)This ambiguity can be eliminated by ﬁxing the mean rewardvalue. However, the reward function’s mean can have asubstantial effect on the learnt behaviour. See Fig. 4 for anexample. Suppose reward model ρ grants +1 for reachingthe goal, − for an irreversible side effect, and otherwise.Then the trajectory of Fig. 4(a) accrues (cid:80) t ρ t = 0 , whilethe trajectory of Fig. 4(b) earns (cid:80) t ρ t = +1 . Optimisationof ρ would thus avoid the irreversible side effect. However,with a shifted reward model σ = ρ − , the unsafe trajectoryscores (cid:80) t σ t = − , while the safe trajectory accumulates (cid:80) t σ t = − . Optimisation of σ would thus cause an irre-versible side effect, against the human’s wishes. ARENTING : Safe Reinforcement Learning from Human Input

This problem can be overcome in practice by experimen-tally tuning the mean of the reward function, as well as itsmoments. However, this hyperparameter tuning would needto be repeated for each new task and would introduce a newtype of unsafe exploration (of hyperparameter space).This problem occurs because human preferences on same-length sequences are shift-invariant with respect to the re-ward function, while reinforcement learning is not. P AR - ENTING avoids this problem through direct policy learning,which respects the symmetries of human preferences andthus does not require problem-by-problem tuning.

5. Conclusion

In the context of near-future autonomous agents operatingin environments where humans already understand the risks,P

ARENTING offers an approach to RL that addresses a broadclass of relevant AI Safety problems. We demonstrated thiswith controlled experiments in the purpose-built AI Safetygridworlds of Leike et al. (2017). Importantly, the factthat P

ARENTING solves these problems is not particular togridworld; it is due to the fact that humans can solve theseproblems, and P

ARENTING allows humans to safely teachRL agents. Furthermore, we have seen that two potentialdownsides of P

ARENTING can be overcome: (i) throughthe novel technique of maturation, a parented agent is notlimited to the performance of its parent; and (ii) parentedbehaviours generalise to new environments, which can beused to reduce requisite human effort in the learning pro-cess. We hope the framework introduced here provides auseful step forward in the pursuit of a general and safe RLprogramme applicable for real-world systems.

Acknowledgements

This work was developed and experiments were run onthe Faculty Platform for machine learning. The authorsbeneﬁted from discussions with Owain Evans, Jan Leike,Smitha Milli, and Marc Warner. The authors are grateful toJaan Tallinn for funding this project.

References

Akrour, R., Schoenauer, M., and Sebag, M. April: Ac-tive preference learning-based reinforcement learning.In

Joint European Conference on Machine Learningand Knowledge Discovery in Databases , pp. 116–131.Springer, 2012.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-man, J., and Man´e, D. Concrete problems in ai safety. The hyperparameters of Sec. 3.1.2 control the rate of mistakesand speed of learning, rather than affecting the agent’s learnt policy. arXiv preprint arXiv:1606.06565 , 2016.Armstrong, S.

AI toy control problem , 2017. URL .Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg,S., and Amodei, D. Deep reinforcement learning fromhuman preferences. In

Advances in Neural InformationProcessing Systems . 2017.Clark, J. and Amodei, D.

Faulty reward functions in thewild , 2016. URL https://blog.openai.com/faulty-reward-functions/ .Dewey, D. Learning what to value. In

International Con-ference on Artiﬁcial General Intelligence , pp. 309–314.Springer, 2011.F¨urnkranz, J., H¨ullermeier, E., Cheng, W., and Park, S.-H. Preference-based reinforcement learning: a formalframework and a policy iteration algorithm.

Machinelearning , 89(1-2):123–156, 2012.Hadﬁeld-Menell, D., Russell, S. J., Abbeel, P., and Dra-gan, A. Cooperative inverse reinforcement learning. In

Advances in neural information processing systems , pp.3909–3917, 2016.Hadﬁeld-Menell, D., Milli, S., Abbeel, P., Russell, S. J., andDragan, A. Inverse reward design. In

Advances in NeuralInformation Processing Systems , pp. 6765–6774, 2017.Ho, M. K., Cushman, F., Littman, M. L., and Austerweil,J. L. People teach with rewards and punishments ascommunication not reinforcements. 2018.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Knox, W. B. and Stone, P. Interactively shaping agents viahuman reinforcement: The tamer framework. In

Proceed-ings of the ﬁfth international conference on Knowledgecapture , pp. 9–16. ACM, 2009.Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt,T., Lefrancq, A., Orseau, L., and Legg, S. Ai safetygridworlds. arXiv preprint arXiv:1711.09883 , 2017.Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., andLegg, S. Scalable agent alignment via reward modeling:a research direction. arXiv preprint arXiv:1811.07871 ,2018.Lin, Z., Harrison, B., Keech, A., and Riedl, M. O. Explore,exploit or listen: Combining human feedback and policymodel to speed up deep reinforcement learning in 3dworlds. arXiv preprint arXiv:1709.03969 , 2017.

ARENTING : Safe Reinforcement Learning from Human Input

MacGlashan, J., Ho, M. K., Loftin, R., Peng, B., Roberts,D., Taylor, M. E., and Littman, M. L. Interactive learningfrom policy-dependent human feedback. arXiv preprintarXiv:1701.06049 , 2017.Ng, A. Y., Russell, S. J., et al. Algorithms for inversereinforcement learning. In

Icml , pp. 663–670, 2000.Ortega, P. A., Maini, V., and the DeepMind safety team.Building safe artiﬁcial intelligence: speciﬁcation, robust-ness, and assurance. 2018.Pecka, M. and Svoboda, T. Safe exploration techniquesfor reinforcement learning–an overview. In

InternationalWorkshop on Modelling and Simulation for AutonomousSystems , pp. 357–375. Springer, 2014.Russell, S. Learning agents for uncertain environments. In

Proceedings of the eleventh annual conference on Com-putational learning theory , pp. 101–103. ACM, 1998.Saunders, W., Sastry, G., Stuhlmueller, A., and Evans, O.Trial without error: Towards safe reinforcement learningvia human intervention. In

Proceedings of the 17th Inter-national Conference on Autonomous Agents and MultiA-gent Systems , pp. 2067–2069. International Foundationfor Autonomous Agents and Multiagent Systems, 2018.Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Polle-feys, M., Lillicrap, T., and Gelly, S. Episodic curiositythrough reachability. arXiv preprint arXiv:1810.02274 ,2018.Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky,E. Corrigibility. In

Workshops at the Twenty-Ninth AAAIConference on Artiﬁcial Intelligence , 2015.Sugiyama, M., Lawrence, N. D., Schwaighofer, A., et al.

Dataset shift in machine learning . The MIT Press, 2017.Sutton, R. S. and Barto, A. G.

Introduction to reinforcementlearning , volume 135. MIT press Cambridge, 1998.Williams, R. J. and Peng, J. Function optimization usingconnectionist reinforcement learning algorithms.

Con-nection Science , 3(3):241–268, 1991.Wirth, C., Akrour, R., Neumann, G., and F¨urnkranz, J. Asurvey of preference-based reinforcement learning meth-ods.