[PDF] Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Abstract

This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.

Full PDF

22019-9-5

Making Eﬃcient Use of Demonstrationsto Solve Hard Exploration Problems

Caglar Gulcehre *,1 , Tom Le Paine *,1 ,Bobak Shahriari , Misha Denil , Matt Hoﬀman , Hubert Soyer , Richard Tanburn , Steven Kapturowski ,Neil Rabinowitz , Duncan Williams , Gabriel Barth-Maron , Ziyu Wang , Nando de Freitas and Worlds Team * Equal contributions, DeepMind, London

This paper introduces R2D3, an agent that makes eﬃcient use of demonstrations to solve hard explorationproblems in partially observable environments with highly variable initial conditions. We also introducea suite of eight tasks that combine these three properties, and show that R2D3 can solve several of thetasks where other state of the art methods (both with and without demonstrations) fail to see even a singlesuccessful trajectory after tens of billions of steps of exploration.

1. Introduction

Reinforcement learning from demonstrations has proven to be an eﬀective strategy for attacking problemsthat require sample eﬃciency and involve hard exploration. For example, Aytar et al. (2018), Pohlen et al.(2018) and Salimans and Chen (2018b) have shown that RL with demonstrations can address the hardexploration problem in Montezuma’s Revenge. Večerík et al. (2017), Merel et al. (2017) and Paine et al.(2018) have demonstrated similar results in robotics. Many other works have shown that demonstrationscan accelerate learning and address hard-exploration tasks (e.g. see Hester et al., 2018; Kim et al., 2013;Nair et al., 2018).In this paper, we attack the problem of learning from demonstrations in hard exploration tasks inpartially observable environments with highly variable initial conditions. These three aspects togetherconspire to make learning challenging:1.

Sparse rewards induce a diﬃcult exploration problem, which is a challenge for many state of theart RL methods. An environment has sparse reward when a non-zero reward is only seen after takinga long sequence of correct actions. Our approach is able to solve tasks where standard methods runfor billions of steps without seeing a single non-zero reward.2.

Partial observability forces the use of memory, and also reduces the generality of informationprovided by a single demonstration, since trajectories cannot be broken into isolated transitionsusing the Markov property. An environment has partial observability if the agent can only observea part of the environment at each timestep.3.

Highly variable initial conditions (i.e. changes in the starting conﬁguration of the environment ineach episode) are a big challenge for learning from demonstrations, because the demonstrations cannot account for all possible conﬁgurations. When the initial conditions are ﬁxed it is possible to be ex-tremely eﬃcient through tracking (Aytar et al., 2018; Peng et al., 2018); however, with a large varietyof initial conditions the agent is forced to generalize over environment conﬁgurations. Generalizingbetween diﬀerent initial conditions is known to be diﬃcult (Ghosh et al., 2017; Langlois et al., 2019).Our approach to these problems combines demonstrations with oﬀ-policy, recurrent Q-learning ina way that allows us to make very eﬃcient use of the available data. In particular, we vastly outperformbehavioral cloning using the same set of demonstrations in all of our experiments. © 2019 DeepMind. All rights reserved a r X i v : . [ c s . L G ] S e p aking Eﬃcient Use of Demonstrations to Solve Hard Exploration Problems Another desirable property of our approach is that our agents are able to learn to outperform thedemonstrators, and in some cases even to discover strategies that the demonstrators were not aware of.In one of our tasks the agent is able to discover and exploit a bug in the environment in spite of all thedemonstrators completing the task in the intended way.Learning from a small number of demonstrations under highly variable initial conditions is not straight-forward. We identify a key parameter of our algorithm, the demo-ratio , which controls the proportionof expert demonstrations vs agent experience in each training batch. This hyper-parameter has a dramaticeﬀect on the performance of the algorithm. Surprisingly, we ﬁnd that the optimal demo ratio is very small(but non-zero) across a wide variety of tasks.The mechanism our agents use to eﬃciently extract information from expert demonstrations is touse them in a way that guides (or biases) the agent’s own autonomous exploration of the environment.Although this mechanism is not obvious from the algorithm construction, our behavioral analysis conﬁrmsthe presence of this guided exploration eﬀect.To demonstrate the eﬀectiveness of our approach we introduce a suite of tasks (which we call the

Hard-Eight suite) that exhibit our three targeted properties. The tasks are set in a procedurally-generated3D world, and require complex behavior (e.g. tool use, long-horizon memory) from the agent to succeed.The tasks are designed to be diﬃcult challenges in our targeted setting, and several state of the art methods(themselves ablations of our approach) fail to solve them.The main contributions of this paper are,1. We design a new agent that makes eﬃcient use of demonstrations to solve sparse reward tasks inpartially observed environments with highly variable initial conditions.2. We provide an analysis of the mechanism our agents use to exploit information from the demon-strations.3. We introduce a suite of eight tasks that support this line of research.

2. Recurrent Replay Distributed DQN from Demonstrations network weights prioritized samplingprioritized sampling agent trajectories& initial priorities demo replay agent replaylearneractor envrnn agenttrainingbatchρ (1 - ρ) double Q-learning+ n-step return targetupdatedpriorities updatedpriorities

Figure 1 | The R2D3 distributed system diagram. The learner samplesbatches that are a mixture of demonstrations and the experiences the agentgenerates by interacting with the environment over the course of training.The ratio between demos and agent experiences is a key hyper-parameterwhich must be carefully tuned to achieve good performance.

We propose a new agent, whichwe refer to as Recurrent ReplayDistributed DQN from Demonstra-tions (R2D3). R2D3 is designedto make eﬃcient use of demon-strations to solve sparse rewardtasks in partially observed envi-ronments with highly variable ini-tial conditions. This section givesan overview of the agent, and de-tailed pseudocode can be found inAppendix A.The architecture of the R2D3agent is shown in Figure 1. Thereare several actor processes, eachrunning independent copies of thebehavior against an instance of theenvironment. Each actor streamsits experience to a shared agent replay buﬀer, where experience from all actors is aggregated and globally prioritized (Horgan et al., 2018;Schaul et al., 2016) using a mixture of max and mean of the TD-errors with priority exponent η = . demo replay buﬀer, which is populated withexpert demonstrations of the task to be solved. Expert trajectories are also prioritized using the schemeof Kapturowski et al. (2018). Maintaining separate replay buﬀers for agent experience and expertdemonstrations allows us to prioritize the sampling of agent and expert data separately.The learner process samples batches of data from both the agent and demo replay buﬀers simultaneously.A hyperparameter ρ , the demo ratio , controls the proportion of data coming from expert demonstrationsversus from the agent’s own experience. The demo ratio is implemented at a batch level by randomlychoosing whether to sample from the expert replay buﬀer independently for each element with probability ρ . Using a stochastic demo ratio in this way allows us to target demo ratios that are smaller than the batchsize, which we found to be very important for good performance. The objective optimized by the learneruses of n -step, double Q-learning (with n =

5) and a dueling architecture (Hessel et al., 2018; Wang et al.,2016). In addition to performing network updates, the learner is also responsible for pushing updatedpriorities back to the replay buﬀers.In each replay buﬀer, we store ﬁxed-length ( m =

80) sequences of ( s , a , r ) tuples where adjacentsequences overlap by 40 time-steps. The sequences never cross episode boundaries. Given a single batchof trajectories we unroll both online and target networks (Mnih et al., 2015) on the same sequence ofstates to generate value estimates with the recurrent state initialized to zero. Proper initialization of therecurrent state would require always replaying episodes from the beginning, which would add signiﬁcantcomplexity to our implementation. As an approximation of this we treat the ﬁrst 40 steps of each sequenceas a burn-in phase, and apply the training objective to the ﬁnal 40 steps only. An alternative approximationwould be to store stale recurrent states in replay, but we did not ﬁnd this to improve performance over zeroinitialization with burn-in.

3. Background

Exploration remains one of the most fundamental challenges for reinforcement learning. So-called “hard-exploration” domains are those in which rewards are sparse, and optimal solutions typically have long andsparsely-rewarded trajectories. Hard-exploration domains may also have many distracting dead ends thatthe agent may not be able to recover from once it gets into a certain state. In recent years, the most notablesuch domains are Atari environments, including

Montezuma’s Revenge and

Pitfall (Bellemare et al., 2013).These domains are particularly tricky for classical RL algorithms because even ﬁnding a single non-zeroreward to bootstrap from is incredibly challenging.A common technique used to address the diﬃculty of exploration is to encourage the agent to visitunder-explored areas of the state-space (Schmidhuber, 1991). Such techniques are commonly known asintrinsic motivation (Chentanez et al., 2005) or count-based exploration (Bellemare et al., 2016). However,these approaches do not scale well as the state space grows, as they still require exhaustive search insparse reward environments. Additionally, recent empirical results suggest that these methods do notconsistently outperform ϵ -greedy exploration (Taïga et al., 2019). The diﬃculty of exploration is also aconsequence of the current inability of our agents to abstract the world and learn scalable, causal modelswith explanatory power. Instead they often use low-level features or handcrafted heuristics and lack thegeneralization power necessary to work in a more abstract space. Hints can be provided to the agentwhich bias it towards promising regions of the state space either via reward-shaping (Ng et al., 1999) orby introducing a sequence of curriculum tasks (Bengio et al., 2009; Graves et al., 2017). However, these Figure 2 | Hard-Eight task suite. In each task an agent ( (cid:72) ) must interact with objects in its environment in order togain access to a large apple ( (cid:72) ) that provides reward. The 3D environment is also procedurally generated so thatevery episode the state of the world including object shapes, colors, and positions is diﬀerent. From the point of viewof the agent the environment is partially observed. Because it may take hundreds of low-level actions to collect anapple the reward is sparse which makes exploration diﬃcult. approaches can be diﬃcult to specify and, in the case of reward shaping, often lead to unexpected behaviorwhere the agent learns to exploit the modiﬁed rewards.Another hallmark of hard-exploration benchmarks is that they tend to be fully-observable and exhibitlittle variation between episodes. Nevertheless, techniques like random no-ops and “sticky actions” havebeen proposed to artiﬁcially increase episode variance in Atari (Machado et al., 2018), an alternative is toinstead consider domains with inherent variability. Other recent work on the

Obstacle Tower challengedomain (Juliani et al., 2019) is similar to our task suite in this regard. Reliance on determinism of theenvironment is one of the chief criticisms of imitation leveled by Juliani (2018), who oﬀers a valuablecritique on Aytar et al. (2018), Ecoﬀet et al. (2019) and Salimans and Chen (2018a). In contrast, ourapproach is able to solve tasks with substantial per-episode variability.GAIL (Ho and Ermon, 2016) is another imitation learning method, however GAIL has never beensuccessfully applied to complex partially observable environments that require memory. Even the mazetask in Żołna et al. (2019) has distinguishable rooms, uses a single layout across all episodes, and as aresult does not require a recurrent policy or discriminator.

4. Hard-Eight Task Suite

To address the diﬃculty of hard exploration in partially observable problems with highly variable inititalconditions we introduce a collection of eight tasks, which exhibit these properties. Due to the generatednature of these tasks and the rich form of interaction between the agent and environment, we see greatlyincreased levels of variability between episodes. From the perspective of the learning process, these tasksare particularly interesting because just memorizing an open loop sequence of actions is unlikely to achieveeven partial success on a new episode. The nature of interaction with the environment combined with alimited ﬁeld of view also necessitates the use of memory in the agent.All of the tasks in the Hard-Eight task suite share important common properties that make themhard exploration problems. First, each task emits sparse rewards —in all but one task the only positiveinstantaneous reward obtained also ends the episode. The visual observations in each task are also ﬁrst-person and thus the state of the world is only ever partially observed . Several of the tasks are constructedto ensure that that it is not possible to observe all task relevant information simultaneously.

Figure 3 | High-level steps necessary to solve the Baseball task. Each step in this sequence must be completed inorder, and must be implemented by the agent as a sequence of low level actions (no option structure is availableto the agent). The necessity of completing such a long sequence of high level steps makes it unlikely that the taskwill ever be solved by random exploration. Note that each step involves interaction with physical objects in theenvironment, shown in bold.

Finally, each task is subject to a highly variable initial conditions . This is accomplished by includingseveral procedural elements, including colors, shapes and conﬁgurations of task relevant objects. Theprocedural generation ensures that simply copying the actions from a demonstration is not suﬃcient forsuccessful execution, which is a sharp contrast to the the case of Atari (Pohlen et al., 2018). A more detaileddiscussion of these aspects can be found in Appendix B and videos of agents and humans performing thesetasks can be found at https://deepmind.com/research/publications/r2d3 .Each task makes use of a standardized avatar with a ﬁrst-person view of the environment, controlledby the same discretized action space consisting of 46 discrete actions. In all tasks the agent is rewarded forcollecting apples and often this is the only reward obtained before the episode ends. A depiction of eachtask is shown in Figure 2. A description of the procedural elements and ﬁlmstrip of a successful episode foreach task is provided in Appendix B.Each of these tasks requires the agent to complete a sequence of high-level steps to complete the task.An example from the task suite is shown in Figure 3. The agent must: ﬁnd the bat, pick up the bat, knockthe ball oﬀ the plinth, pick up the ball, activate the sensor with the ball (opening the door), walk throughthe door, and collect the large apple.The Hard-Eight task suite contains the following tasks:

Baseball

The agent spawns in a small room with a sensor and a key object resting high atop a plinth.The agent must ﬁnd a stick and use it to knock the key object of the plinth in order to activate the sensor.Activating the sensor opens a door to an adjoining room with a large apple which ends the episode.

Drawbridge

The agent spawns at one end of a network of branching platforms separated by draw-bridges, which can be activated by touching a key object to a sensor. Activating a drawbridge with a keyobject destroys the key. Each platform is connected to several drawbridges, but has only one key objectavailable. Some paths through the level have small apples which give reward. The agent must choose themost rewarding path through the level to obtain a large apple at the end which ends the episode.

Navigate Cubes

The agent spawns on one side of a large room. On the other side of the room on araised platform there is a large apple which ends the episode. Across the center of the room there is a wallof movable blocks. The agent must dig through the wall of blocks and ﬁnd a ramp onto the goal platform inorder to collect the large apple.

Push Blocks

The agent spawns in a medium sized room with a recessed sensor in the ﬂoor. Thereare several objects in the room that can be pushed but not lifted. The agent must push a block whosecolor matches the sensor into the recess in order to open a door to an adjoining room which contains alarge apple which ends the episode. Pushing a wrong object into the recess makes the level impossible tocomplete.

Remember Sensor

The agent spawns near a sensor of a random color. The agent must travel downa long hallway to a room full of blocks and select one that matches the color of the sensor. Bringing thecorrect block back to the sensor allows access to a large apple which ends the episode. In addition to beingfar away, traveling between the hallway and the block room requires the agent to cross penalty sensorswhich incurs a small negative reward.

Throw Across

The agent spawns in a U shaped room with empty space between the legs of the U.There are two key objects near the agent spawn point. The agent must throw one of the key objects acrossthe void, and carry the other around the bottom of the U. Both key objects are needed to open two lockeddoors which then give access to a large apple which ends the episode.

Wall Sensor

The agent spawns in a small room with a wall mounted sensor and a key object. Theagent must pick up the key and touch it to the sensor which opens a door. In the adjoining room there is alarge apple which ends the episode.

Wall Sensor Stack

The agent spawns in a small room with a wall mounted sensor and two keyobjects. This time one of key objects must be in constant contact with the sensor in in order for the door toremain open. The agent must stack the two objects so one can rest against the sensor, allowing the agentto pass through to an adjoining room with a large apple which ends the episode.

5. Baselines

In this section we discuss the baselines and ablations we use to compare against our R2D3 agent in theexperiments. We compare to Behavior Cloning (a common baseline for learning from demonstrations) aswell as two ablations of our method which individually remove either recurrence or demonstrations fromR2D3. The two ablations correspond to two diﬀerent state of the art methods from the literature.

Behavior Cloning

BC is a simple and common baseline method for learning policies from demon-strations (Pomerleau, 1989; Rahmatizadeh et al., 2018). This algorithm corresponds to a supervisedlearning approach to imitation learning which uses only expert trajectories as its training dataset to ﬁt aparameterized policy mapping states to actions. For discrete actions this corresponds to a classiﬁcationtask, which we ﬁt using the cross-entropy loss. If the rewards of trajectories in the training dataset areconsistently high, BC is known to outperform recent batch-RL methods (Fujimoto et al., 2018). To enablefair comparison we trained our BC agent using the same recurrent neural network architecture that weused for our R2D3 algorithm (see Figure 4).

No Demonstrations

The ﬁrst ablation we consider is to remove demonstrations from R2D3. Thiscorresponds to setting the demo ratio (see Figure 1) to ρ =

0. This special case of R2D3 corresponds exactlyto the R2D2 agent of Kapturowski et al. (2018), which itself extends DQN (Mnih et al., 2015) to partiallyobserved environments by combining it with recurrence and the distributed training architecture of Ape-XDQN (Horgan et al., 2018). This ablation is itself state of the art on Atari-57 and DMLab-30, making it anextremely strong baseline.

No Recurrence

The second ablation we consider is to replace the recurrent value function of R2D3with a feed-forward reactive network. We do this separately from the no demonstrations ablation, leavingthe full system in Figure 1 in tact, with only the structure of the network changed. If we further ﬁx the

Figure 4 | (a) Recurrent head used by R2D3 agents. (b)

Feedforward head used by the DQf D agent. Heads in botha) and b) are used to compute the Q values. (c)

Architecture used to compute the input feature representations.Frames of size 96x72 are fed into a ResNet, the output is then augmented by concatenating the previous action a t − ,previous reward r t − , and other proprioceptive features f t , such as accelerations, whether the avatar hand is holdingan object, and the hand’s relative distance to the avatar. demo ratio to ρ = .

25 then this ablation corresponds to the DQf D agent of Hester et al. (2018), which iscompetitive on hard-exploration Atari environments such as Montezuma’s Revenge. However, we do notrestrict ourselves to ρ = .

25, and instead optimize over the demo ratio for the ablation as well as for ourmain agent.

6. Experiments

We evaluate the performance of our R2D3 agent alongside state-of-the-art deep RL baselines. As discussedin Section 5, we compare our R2D3 agent to BC (standard Lf D baseline) R2D2 (oﬀ-policy SOTA), DQf D(Lf D SOTA). We use our own implementations for all agents, and we plan to release code for all agentsincluding R2D3.For each task in the Hard-Eight suite, we trained R2D3, R2D2, and DQf D using 256 ϵ -greedy CPU-basedactors and a single GPU-based learner process. Following Horgan et al. (2018), the i -th actor was assigneda distinct noise parameter ϵ i ∈ [ . , . ] where each ϵ i is regularly spaced in log . space. For each of thealgorithms their common hyperparameters were held ﬁxed. Additionally, for R2D3 and DQf D the demoratio was varied to study its eﬀect. For BC we also varied the learning rate independently in a vain attemptto ﬁnd a successful agent.All agents act in the environment with an action-repeat factor of 2, i.e. the actions received by theenvironment are repeated twice before passing the observation to the agent. Using an action repeat of 4 iscommon in other domains like Atari (Bellemare et al., 2012; Mnih et al., 2015); however, we found thatusing an action repeat of 4 made the Hard-Eight tasks too diﬃcult for our demonstrators. Using an actionrepeat of 2 allowed us to strike a compromise between ease of demonstration (which is made harder byhigh action repeats prohibiting smooth and intuitive motion) and ease of learning for the agents (which ismade harder by low action repeats increasing the number of steps required to complete the task).Figure 4 illustrates the neural network architecture of the diﬀerent agents. As much as possible we use Figure 5 | Reward vs actor steps curves for R2D3 and baselines on the Hard-Eight task suite. The curves are computedas the mean performance for the same agent across 5 diﬀerent seeds per task. Error regions show the 95% conﬁdenceinterval for the mean reward across seeds. Several curves overlap exactly at zero reward for the full range of theplots. R2D3 can perform human-level or better on Baseball, Drawbridge, Navigate Cubes and Wall Sensor. R2D2could not get any positive rewards on any of the tasks. DQf D and BC agents occasionally see rewards on Drawbridgeand Navigate Cubes tasks, but this happens rarely enough that the eﬀect is not visible in the plots. Indicators ( (cid:72) )mark analysis points in Section 6.3. the same network architecture across all agents, deviating only for DQf D, where the recurrent head isreplaced with an equally sized feed-forward layer. We brieﬂy outline the training setup below, and give anexplicit enumeration of the hyperparameters in Appendix C.For R2D3, R2D2 and DQf D we use the Adam optimizer (Kingma and Ba, 2014) with a ﬁxed learningrate of 2 × − . We use hyperparameters that are shown to work well for similar environments. We usedistributed training with 256 parallel actors, trained for at least 10 billion actor steps for all tasks.For the BC agent the training regime is slightly diﬀerent, since this agent does not interact with theenvironment during training. For BC we also use the Adam optimizer but we additionally perform ahyperparameter sweep over learning rates { − , − , − } . Since there is no notion of actor steps in BCwe trained for 500k learner steps instead.During the course of training, an evaluator process periodically queries the learner process for thelatest network weights and runs the resulting policy on an episode, logging both the ﬁnal return and thetotal number of steps (actor or learner steps, as appropriate) performed at the time the of evaluation.We collected a total of 100 demonstrations for each task spread across three diﬀerent experts (eachexpert contributed roughly one third of the demonstrations for each task). Demonstrations for the taskswere collected using keyboard and mouse controls mapped to the agent’s exact action space, which wasnecessary to enable both behaviour cloning and learning from demonstrations. We show statistics relatedto the human demonstration data which we collected from three experts in Table 1. In Figure 5, we report the return against the number of actor steps, averaged over ﬁve random initializations.We ﬁnd that none of the baselines succeed in any of the eight environments. Meanwhile, R2D3 learns sixout of the eight tasks, and reaches or exceeds human performance in four of them. The fact that R2D3learns at all in this setting with only 100 demonstrations per task demonstrates the ability of the agentto make very eﬃcient use of the demonstrations. This is in contrast to BC and DQf D which use the samedemonstrations, and both fail to learn a single task from the suite. demo ratio s u cc e ss r a t e Figure 6 | Success rate (see main text) for R2D3across all tasks with at least one successful seed, asa function of demo ratio. The square markers foreach demo ratio denote the mean success rate, andthe error bars show a bootstrapped estimate of the [ , ] percentile interval for the mean estimate.The lower demo ratios consistently outperform thehigher demo ratios across the suite of tasks. Task Name Reward Episode Len.

Baseball 7.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1 | Human demonstration statistics. We col-lected 100 demos for each tasks from three humandemonstrators. We report mean lengths (in numberof frames) and rewards of the episodes along withthe standard deviations for each task.

All methods, including R2D3, fail to solve two of the tasks: Remember Sensor and Throw Across. Theseare the two tasks in the suite that are most demanding in terms of memory requirements for the agent, andit is possible that our zero-initialization with burn-in strategy for handling LSTM states in replay does notgive R2D3 suﬃcient context to complete these tasks successfully. Future work should explore the betterhandling of recurrent states as a possible avenue towards success on these tasks. R2D3, BC, and DQf Dreceive some negative returns on Remember Sensor, which indicates that the agents navigate down thehallway and walks over penalty sensors.R2D3 performed better than our average human demonstrator on Baseball, Drawbridge, NavigateCubes and the Wall Sensor tasks. The behavior on Wall Sensor Stack in particular is quite interesting. Onthis task R2D3 found a completely diﬀerent strategy than the human demonstrators by exploiting a bug inthe implementation of the environment. The intended strategy for this task is to stack two blocks on top ofeach other so that one of them can remain in contact with a wall mounted sensor, and this is the strategyemployed by the demonstrators. However, due to a bug in the environment the strategy learned by R2D3was to trick the sensor into remaining active even when it is not in contact with the key by pressing the keyagainst it in a precise way.In light of the uniform failure of our baselines to learn on the Hard-Eight suite we made several attemptsat training other models on the task suite; however, these attempts were all unsuccessful. For example, wetried adding randomized prior functions (Osband et al., 2018) to R2D2, but this approach was still unableto obtain reward on any of the Hard-Eight tasks. We also trained an IMPALA agent with pixel control(Jaderberg et al., 2016) as auxiliary reward to help with exploration, but this approach also failed to learnon any of the tasks we attempted. We omit these results from Figure 5, only keeping the most relevantbaselines.

In our experiments on Hard-Eight tasks (see Figure 5), we did a hyperparameter search and chose the besthyperparameters for each method independently. In this section, we look more closely at how the demoratio ( ρ ) aﬀects learning in R2D3. To do this we look at how the success rate of R2D3 across the entireHard-Eight task suite varies as a function of the demo ratio.The goal of each task in the Hard-Eight suite is to collect a large apple, which ends the episode and gives R2D2 @ 5B R2D3 @ 5B actor steps (B) d i s t a n c e c r a t e s p u s h e d R2D3R2D2

R2D3 @ 40B(a) (b) (c)

Figure 7 | Guided exploration behavior in the Push Blocks task. (a)

Spatial pattern of exploration behavior at ∼ ∼

20B steps). Overlay of agent’s trajectories over 200episodes. Blocks and sensors are not shown for clarity. R2D2 appears to follow a random walk. R2D3 concentrates ona particular spatial region. (b)

Interactions between the agent and blocks during the ﬁrst 12B steps. Each line showsa diﬀerent random seed. R2D2 rarely pushes the blocks. (c)

Example trajectory of R2D3 after training, showing theagent pushing the blue block onto the blue sensor, then going to collect the apple reward (green star). a large reward. We consider an episode successful if the large apple is collected. An agent that executesmany episodes in the environment will either succeed or fail at each one. We consider an agent successfulif, after training, at least 75% of its ﬁnal 25 episodes are successful. Finally, an individual agent with a ﬁxedset of hyperparameters may still succeed or fail depending on the randomness in the environment and theinitialization of the agent. We call the proportion of agents that succeed for a given set of hyperparametersthe success rate of the algorithm.We train several R2D3 agents on each tractable task in the Hard-Eight suite, varying only the demoratio while keeping the rest of the hyperparameters ﬁxed at the values used for the learning experiment.We consider four diﬀerent demo ratios across six tasks, with ﬁve seeds for each task, for a total of 120agents trained. Figure 6 shows estimates of the success rate for the R2D3 algorithm for each diﬀerent demoratio, aggregated across all tasks. We observe that tuning the demo ratio has a strong eﬀect on the successrate across the task suite, and that the best demo ratio is quite small. See Appendix D.3 for further results. The typical strategy for exploration in RL is to either use a stochastic policy and sample actions, or to use adeterministic policy and take random actions some small ϵ fraction of the time. Given suﬃcient time bothof these approaches will in theory cover the space of possible behaviors, but in practice the amount of timerequired to achieve this coverage can be prohibitively long. In this experiment, we compare the behaviorof R2D3 to the behavior of R2D2 (which is equivalent to R2D3 without demonstrations) on two of thetasks from the Hard-Eight suite. Even very early in training (well before R2D3 is able to reliably completethe tasks) we see many more task-relevant actions from R2D3 than from R2D2, suggesting that the eﬀectof demonstrations is to bias R2D3 towards exploring relevant parts of the environment.In Figure 7 we begin by examining the Push Blocks tasks. The task here is to push a particular blockonto a sensor to give access to a large apple, and we examine the behavior of both R2D3 and R2D2 after 5Bsteps, which is long before R2D3 begins to solve the task with any regularity (see Figure 5). Looking at thedistribution of spatial locations for the agents it is clear that R2D2 essentially diﬀuses randomly around theroom, while R2D3 spends much more time in task-relevant parts of the environment (e.g. away from thewalls). We also record the total distance traveled by the moveable blocks in the room, and ﬁnd that R2D3tends to move the blocks signiﬁcantly more often than R2D2, even before it has learned to solve the task. We exclude Remember Sensor and Throw Across from this analysis, since we saw no successful seeds for either of these tasks. holdsbat raisesbat hitsball activatessensor collectsapple01 p r o p o r t i o n o f e p i s o d e s R2D3R2D2 holdsbat raisesbat hitsball activatessensor collectsapple

4B steps actor steps (B) e p i s o d e l e n g t h human meanhuman min (a) (b) Figure 8 | Guided exploration behavior in the Baseball task. (a)

Sub-behaviors expressed by ﬁve R2D2 and ﬁveR2D3 agents after 0.5B steps of training (left) and 4B steps of training (right). Each point is estimated from 200episodes. At 0.5B steps, none of the agents received any reward over the 200 evaluation episodes, while at 4B steps,three of the R2D3 agents received reward on almost every episode. Even when the R2D3 agents are not receivingreward, they are expressing some of the necessary behaviors provided through human demonstrations. (b)

R2D3agents eventually surpass human performance. The 3 of 5 R2D3 agents shown in (a) which start obtaining rewardscontinue to bootstrap towards more eﬃcient policies than humans.

In Figure 8 we show a diﬀerent analysis of the Baseball task (see Figure 3 for a detailed walkthrough ofthis task). Here we manually identify a sequence of milestones that a trajectory must reach in order to besuccessful, and record how often diﬀerent agents achieve each of these subgoals. This subgoal structureis implicit in the task, but is not made available explicitly to any of the agents during training; they areidentiﬁed here purely as a post-hoc analysis tool. In this task we see that the R2D3 agents learn very quicklyto pick up and raise the bat, while the R2D2 agents rarely interact with the bat at all, and actually do soless as training proceeds. We also see that hitting the ball oﬀ the plinth is most diﬃcult step to learn in thistask, bottlenecking two of the R2D3 agents.

7. Conclusion

In this paper, we introduced the R2D3 agent, which is designed to make eﬃcient use of demonstrations tolearn in partially observable environments with sparse rewards and highly variable initial conditions. Weshowed through several experiments on eight very diﬃcult tasks that our approach is able to outperformmultiple state of the art baselines, two of which are themselves ablations of R2D3.We also identiﬁed a key parameter of our algorithm, the demo ratio , and showed that careful tuningof this parameter is critical to good performance. Interestingly we found that the optimal demo ratio issurprisingly small but non-zero, which suggests that there may be a risk of overﬁtting to the demonstrationsat the cost of generalization. For future work, we could investigate how this optimal demo ratio changeswith the total number of demonstrations and, more generally, the distribution of expert trajectories relativeto the task variability.We introduced the Hard-Eight suite of tasks and used them in all of our experiments. These tasksare speciﬁcally designed to be partially observable tasks with sparse rewards and highly variable initialconditions, making them an ideal testbed for showcasing the strengths of R2D3 in contrast to existingmethods in the literature.Our behavioral analysis showed that the mechanism R2D3 uses to eﬃciently extract informationfrom expert demonstrations is to use them in a way that guides (or biases) the agent’s own autonomousexploration of the environment. An in-depth analysis of agent behavior on the Hard-Eight task suite is apromising direction for understanding how diﬀerent RL algorithms make selective use of information.

Acknowledgements

We would like to thank the following members of the DeepMind Worlds Team for developing the tasks inthis paper: Charlie Beattie, Gavin Buttimore, Adrian Collister, Alex Cullum, Charlie Deck, Simon Green,Tom Handley, Cédric Hauteville, Drew Purves, Richie Steigerwald and Marcus Wainwright.We would also like to acknowledge the scientiﬁc python community for developing the the core setof tools that enabled this work, including Tensorﬂow (Abadi et al., 2016), Numpy (Oliphant, 2006),Pandas (McKinney et al., 2010), Matplotlib (Hunter, 2007) and Seaborn (Waskom et al., 2017).

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, SanjayGhemawat, Geoﬀrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and XiaoqiangZheng. Tensorﬂow: A system for large-scale machine learning. In , pages 265–283, 2016.Yusuf Aytar, Tobias Pfaﬀ, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard explorationgames by watching YouTube. In

Advances in Neural Information Processing Systems , pages 2930–2941, 2018.Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using Atari 2600 games.In

AAAI Conference on Artiﬁcial Intelligence , pages 864–871, 2012.Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluationplatform for general agents.

Journal of Artiﬁcial Intelligence Research , 47:253–279, 2013.Marc G Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifyingcount-based exploration and intrinsic motivation. In

Advances in Neural Information Processing Systems , pages1471–1479, 2016.Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

InternationalConference on Machine Learning , pages 41–48, 2009.Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In

Advances in neural information processing systems , pages 1281–1288, 2005.Adrien Ecoﬀet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeﬀ Clune. Go-explore: a new approach forhard-exploration problems. arXiv preprint arXiv:1901.10995 , 2019.Scott Fujimoto, David Meger, and Doina Precup. Oﬀ-policy deep reinforcement learning without exploration. arXivpreprint arXiv:1812.02900 , 2018.Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcementlearning. arXiv preprint arXiv:1711.09874 , 2017.Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learningfor neural networks. In

International Conference on Machine Learning , pages 1311–1320, 2017.Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, BilalPiot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In

AAAI Conference on Artiﬁcial Intelligence , pages 3215–3222, 2018.Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, AndrewSendonaris, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep Q-learning from demonstrations.In

AAAI Conference on Artiﬁcial Intelligence , pages 3223–3230, 2018.Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In

Advances in Neural InformationProcessing Systems , pages 4565–4573, 2016.

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver.Distributed prioritized experience replay. In

International Conference on Learning Representations , 2018.John D Hunter. Matplotlib: A 2D graphics environment.

Computing in science & engineering , 9(3):90–95, 2007.Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and KorayKavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397 , 2016.Arthur Juliani. On “solving” Montezuma’s revenge. https://medium.com/@awjuliani/on-solving-montezumas-revenge-2146d83f0bc3 , 2018. Accessed: 2019-19-21.Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Hunter Henry, Adam Crespi, Julian Togelius,and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning. In

AAAI-19Workshop on Games and Simulations for Artiﬁcial Intelligence , 2019.Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay indistributed reinforcement learning. In

International Conference on Learning Representations , 2018.Beomjoon Kim, Amir-massoud Farahmand, Joelle Pineau, and Doina Precup. Learning from limited demonstrations.In

Advances in Neural Information Processing Systems , pages 2859–2867, 2013.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforce-ment learning. arXiv preprint arXiv:1907.02057 , 2019.Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling.Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents.

Journalof Artiﬁcial Intelligence Research , 61:523–562, 2018.Wes McKinney et al. Data structures for statistical computing in python. In

Proceedings of the 9th Python in ScienceConference , pages 51–56, 2010.Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learninghuman behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201 , 2017.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves,Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcementlearning.

Nature , 518(7540):529–533, 2015.Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration inreinforcement learning with demonstrations. In

IEEE International Conference on Robotics and Automation , pages6292–6299, 2018.Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory andapplication to reward shaping. In

International Conference on Machine Learning , pages 278–287, 1999.Travis Oliphant.

Guide to NumPy . USA: Trelgol Publishing, 2006.Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In

Advances in Neural Information Processing Systems , pages 8617–8629, 2018.Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaﬀ, Matt W Hoﬀman,Gabriel Barth-Maron, Serkan Cabi, David Budden, et al. One-shot high-ﬁdelity imitation: Training large-scaledeep nets with RL. arXiv preprint arXiv:1810.05017 , 2018.Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforce-ment learning of physics-based character skills.

ACM Transactions on Graphics , 37(4):1:14, 2018.

Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, et al. Observe and look further: Achieving consistentperformance on atari. arXiv preprint arXiv:1805.11593 , 2018.Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In

Advances in neural informationprocessing systems , pages 305–313, 1989.Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-taskmanipulation for inexpensive robots using end-to-end learning from demonstration. In

IEEE InternationalConference on Robotics and Automation , pages 3758–3765, 2018.Tim Salimans and Richard Chen. Learning Montezuma’s revenge from a single demonstration. https://openai.com/blog/learning-montezumas-revenge-from-a-single-demonstration , 2018a. Accessed: 2019-19-22.Tim Salimans and Richard Chen. Learning Montezuma’s revenge from a single demonstration. arXiv preprintarXiv:1812.03381 , 2018b.Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In

InternationalConference on Learning Representations , 2016.Jürgen Schmidhuber. Curious model-building control systems. In

IEEE International Joint Conference on NeuralNetworks , pages 1458–1463, 1991.Adrien Ali Taïga, William Fedus, Marlos C Machado, Aaron Courville, and Marc G Bellemare. Benchmarkingbonus-based exploration methods on the arcade learning environment. arXiv preprint arXiv:1908.02388 , 2019.Matej Večerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, ThomasRothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning onrobotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 , 2017.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network archi-tectures for deep reinforcement learning. In

International Conference on Machine Learning , pages 1995–2003,2016.Michael Waskom, Olga Botvinnik, Drew O’Kane, Paul Hobson, Saulius Lukauskas, David C Gemperline, TomAugspurger, Yaroslav Halchenko, John B. Cole, Jordi Warmenhoven, Julian de Ruiter, Cameron Pye, StephanHoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric Quintero, Pete Bachant, Marcel Martin, Kyle Meyer,Alistair Miles, Yoav Ram, Tal Yarkoni, Mike Lee Williams, Constantine Evans, Clark Fitzgerald, Brian, ChrisFonnesbeck, Antony Lee, and Adel Qalieh. mwaskom/seaborn: v0.8.1 (september 2017), September 2017. URL https://doi.org/10.5281/zenodo.883859 .Konrad Żołna, Negar Rostamzadeh, Yoshua Bengio, Sungjin Ahn, and Pedro O Pinheiro. Reinforced imitation inheterogeneous action space. arXiv preprint arXiv:1904.03438 , 2019.

A. R2D3

Below we include pseudocode for the full R2D3 agent. The agent consists ﬁrst of a single learner processwhich samples from both demonstration and agent buﬀers in order to update its policy parameters.

Algorithm 1

Learner

Inputs: replay of expert demonstrations D , replay of agent experiences R , batch size B , sequence length m , andnumber of actors A .Initialize policy weights θ .Initialize target policy weights θ (cid:48) ← θ .Launch A actors and replicate policy weights θ to each actor. for n steps do Sample transition sequences ( s t : t + m , a t : t + m , r t : t + m ) from replay D with probability ρ or from replay R withprobability ( − ρ ) , to construct a mini-batch of size B .Calculate loss using target network.Perform a gradient descent step to update θ .If t mod t tarдet =

0, update the target policy weights θ (cid:48) ← θ .If t mod t actor =

0, replicate policy weights to the actors. end for

The agent also consists of A parallel actor processes which interact with a copy of the environmentin order to obtain data which is then inserted into the agent buﬀer. The agents periodically update theirparameters to match those being updated on the learner. Algorithm 2

Actor repeat

Sample action from behavior policy a ← π ( s ) Execute a and observe s (cid:48) and r Store ( s , a , s (cid:48) , r ) in R until learner ﬁnishes. B. Hard-Eight task suite details

Sparse rewards

All of the tasks emit sparse rewards, indeed in all but one task the only positiveinstantaneous reward obtained also ends the episode successfully. In other words, for standard RLalgorithms to learn by bootstrapping, the actors must ﬁrst solve the task inadvertently, and must do sowith no intermediate signal to guide them.

Partial observability

Visual observations are all ﬁrst-person, which means that some relevant featuresof the state of the world may be invisible to the agent simply because they are behind it or around a corner.Some tasks (e.g. Remember Sensor, are explicitly designed so that this is the case).

Highly Variable Initial Conditions

Many of the elements of the tasks are procedurally generated, whichleads to signiﬁcant variability between episodes of the same task. In particular, the starting position andorientation of the agent are randomized and similarly, where they are present, the shapes, colors, andtextures of various objects are randomly sampled from a set of available such features. Therefore a single(or small number of) demonstration(s) is not suﬃcient to guide an agent to solve the task as it is in thecase of DQf D on Atari (Pohlen et al., 2018).

Observation speciﬁcation

All of the tasks provide the same observation space. In particular, a visualchannel consisting of 96 by 72 RGB pixels, as well as accelerations of the avatar, force applied by the avatarhand on the object, whether if the avatar is holding anything or not, and the distance of a held object fromthe face of the avatar (zero when there is no held object).

Action speciﬁcation

The action space consists of four displacement and four rotation actions (8),duplicated for coarse and ﬁne-grained movement (16) as well as for movement with and without grasping(32). The avatar also has an invisible “hand” which can be used to manipulate objects in the environment.The location of the hand is controlled by the avatar gaze direction, plus an additional two actions thatcontrol the distance of the hand from the body (34). A grasped object can be manipulated by six rotationactions (two for each rotational degree of freedom; 40) as well as four additional actions controlling thedistance of the hand from the body at coarse and ﬁne speed (44). Finally there is an independent graspaction (to hold an object without moving), and a no-op action (total 46). Compared to course actions,ﬁne-grained actions result in slower movements, allowing the agent to perform careful manipulations.

B.1. Individual task details

This section gives addition details on each task in our suite including a sequence frames from a successful taskexecution (performed by a human) and a list of the procedural elements randomized per episode. Videosof agents and humans performing these tasks can be found at https://deepmind.com/research/publications/r2d3 . Baseball

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor and object materials and colors• Initial position of the stick• Position of plinth

Drawbridge

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor, ceiling and object materials and colors• Positions of the small apples throughout the network of ledges

Navigate Cubes

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor and object materials and colors

Push Blocks

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor, object materials and colors• Positions of the objects• Sensor required color

Remember Sensor

Procedural elements• Initial position and orientation of the agent• Sensor required color• Number of objects in the block room• Position of objects in the block room• Shape and material of the objects in the block room

Throw Across

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor and object materials and colors• Color and material of the sensors• Initial positions of the two key objects

Wall Sensor

Procedural elements• Initial position and orientation of the agent• Position of the sensor• Position of the key object

Wall Sensor Stack

Procedural elements• Initial position and orientation of the agent• Wall, ﬂoor and object materials and colors• Initial positions of both key objects• Position of the sensor

C. Hyper-parameters

In Table 2, we report the shared set of hyper-parameters across diﬀerent models and tasks.

Hyperparameters ValuesNetwork

See Figure 4

Environment

Image height 72Image width 96Color RGBAction repeats 2Observation spec See section BAction spec See section B

Learner

Learning rate 2e-4Optimizer Adam (Kingma and Ba, 2014)Global norm gradient clipping TrueDiscount factor ( γ ) 0.997Batch size ( B ) 32Target update period ( t tarдet ) 400Actor update period ( t actor ) 200Prioritized sampling TrueSequence length ( m ) 80Burn in length 40Asymmetric reward clipping TrueNumber of actors ( A ) 256Max replay capacity 500000Min replay capacity 25000 Table 2 | Hyper-parameters used for all experiments.

D. Experiments

D.1. Surpassing the experts

An important property of R2D3 is that although the agents are trained from demonstrations, the behaviorsthey achieve are able to surpass the skill of the demonstrations they were trained from. This can be seenquantitatively from reward curves in Figure 5, where the R2D3 agent surpasses the human baselineperformance on four of the eight tasks (e.g. Baseball, Navigate Cubes, Wall Sensor and Wall Sensor Stack).In some of these cases the improved score is simply a matter of executing the optimal strategy moreﬂuently than the demonstrators. For example, this is the case in the Baseball task, where the humandemonstrators are handicapped by the fact that the human interface to the agent action space makes itawkward to rotate a held object. This makes picking up the stick and orienting it properly to knock theball oﬀ the plinth into a tricky task for humans, but the agents are able to reﬁne their behavior to be muchmore eﬃcient (see Figure 8c).The behavior on Wall Sensor is especially interesting, however in this case the agents ﬁnd a completely R e w a r d Baseball seed01234Drawbridge Navigate Cubes Push Blocks0 1e10 2e10 3e100510 Remember Sensor 0 1e10 2e10 3e10Throw Across 0 1e10 2e10 3e10Wall Sensor 0 1e10 2e10 3e10Actor steps Wall Sensor Stack

Figure 9 | We show the rewards of the R2D3 agent on diﬀerent tasks for each seed separately. R e w a r d Baseball Demo ratio1/2561/1281/641/32Drawbridge Navigate Cubes Push Blocks0 1e10 2e10 3e100510 Remember Sensor 0 1e10 2e10 3e10Throw Across 0 1e10 2e10 3e10Wall Sensor 0 1e10 2e10 3e10Actor steps Wall Sensor Stack

Figure 10 | R2D3 learning curves with varying demo ratios for all tasks. diﬀerent strategy than the human demonstrators by exploiting a bug in the implementation of theenvironment. The intended strategy for this task is to stack two blocks on top of each other so thatone of them can remain in contact with a wall mounted sensor, and this is the strategy employed bythe demonstrators. However, due to a bug in the environment it is also possible to trick the sensor intoremaining active even when it is not in contact with the key by pressing the key against it in a precise way.The R2D3 agents are able to discover this bug and exploit it, resulting in superhuman scores on this taskeven though this strategy is not present in the demonstrations.

D.2. Additional experiments

We also ran a few additional experiments to get more information about the tasks we did not solve, orsolved incorrectly. Videos for these experiments are available at https://deepmind.com/research/publications/r2d3 . Remember Sensor

This task requires a long memory, and also has the longest episodes length ofany task in the Hard Eight suite. In an attempt to mitigate these issues, we trained the agent using a higheraction repeat which reduces the episode length, and used stale lstm states instead of zero lstm stateswhich provides information from earlier in the episode. This allows R2D3 to learn policies that displayreasonable behavior, retrieving a random block and bringing it back to the hallway. Using this method itcan occasionally solve the task.

Throw Across

The demonstrations collected for this task had a very low success rate of 54%. We attempted to compensate for this by collecting an additional 30 demos. When we trained R2D3 with all130 demos all seeds solved the task.

Wall Sensor Stack

The original Wall Sensor Stack environment had a bug that the R2D3 agent wasable to exploit. We ﬁxed the bug and veriﬁed the agent can learn the proper stacking behavior.

D.3. Addition details for main experiments

In Figure 9, we show the performance of the R2D3 agents for each seed separately. On task such asDrawbridge, Navigate Cubes and Wall Sensor, all seeds take oﬀ quite rapidly and they have very lowvariance for the rewards between diﬀerent seeds. However, on Wall Sensor Stack task while one seed takesoﬀ quite rapidly, and the rest of them are just ﬂat. In Figure 10, we elaborate on Figure 6. For Baseball,Navigate Cubes, Push Blocks, and Wall Sensor Stack, a demo ratio of 1/256 works best. On Drawbridgeand Wall Sensor all demo ratios are similarly eﬀective. actor steps (B) p r o p o r t i o n o f e p i s o d e s crate pushed into recess R2D2R2D3 0 2 4 6 8 10 12 actor steps (B)pushed crate matches sensor(a) (b)

Figure 11 | Further detail of guided exploration behavior in the Push Blocks task (as in Figure 7). (a)

Proportion ofepisodes in which the agent pushes a crate into the recess during the initial 12B steps of training. (b)

Proportion ofepisodes in which the crate pushed into the recess actually matches the sensor color. Data are only shown whencrates are pushed into the recess on at least 5 out of 200 episodes. Dashed line shows the probability expectedif a random crate was pushed into the recess. Thus, while (c) shows that by 12B steps the R2D3 agent may havereasonable success in pushing crates into the recess, it has not yet mastered the logic that the crate color must muchthe sensor color.

Figure 12 | Further detail of guided exploration behavior in the Push Blocks task (as in Figure 7). (a)

Spatial patternof exploration behavior for the R2D2 agent over the course of ∼

12B steps of training. Each row shows a diﬀerentrandom seed; the number of training steps increases from the leftmost column to the rightmost column. There islittle variation in how the policy manifests as explorative behavior across seeds and training time. (b)