[PDF] Sample-efficient Deep Reinforcement Learning with Imaginary Rollouts for Human-Robot Interaction

Abstract

Deep reinforcement learning has proven to be a great success in allowing agents to learn complex tasks. However, its application to actual robots can be prohibitively expensive. Furthermore, the unpredictability of human behavior in human-robot interaction tasks can hinder convergence to a good policy. In this paper, we present an architecture that allows agents to learn models of stochastic environments and use them to accelerate learning. We descirbe how an environment model can be learned online and used to generate synthetic transitions, as well as how an agent can leverage these synthetic data to accelerate learning. We validate our approach using an experiment in which a robotic arm has to complete a task composed of a series of actions based on human gestures. Results show that our approach leads to significantly faster learning, requiring much less interaction with the environment. Furthermore, we demonstrate how learned models can be used by a robot to produce optimal plans in real world applications.

Full PDF

SSample-efﬁcient Deep Reinforcement Learningwith Imaginary Rollouts for Human-Robot Interaction

Mohammad Thabet , Massimiliano Patacchiola , and Angelo Cangelosi , Abstract — Deep reinforcement learning has proven to bea great success in allowing agents to learn complex tasks.However, its application to actual robots can be prohibitivelyexpensive. Furthermore, the unpredictability of human behav-ior in human-robot interaction tasks can hinder convergenceto a good policy. In this paper, we present an architecture thatallows agents to learn models of stochastic environments and usethem to accelerate learning. We descirbe how an environmentmodel can be learned online and used to generate synthetictransitions, as well as how an agent can leverage these syntheticdata to accelerate learning. We validate our approach using anexperiment in which a robotic arm has to complete a taskcomposed of a series of actions based on human gestures.Results show that our approach leads to signiﬁcantly fasterlearning, requiring much less interaction with the environment.Furthermore, we demonstrate how learned models can be usedby a robot to produce optimal plans in real world applications.

I. I

NTRODUCTION

Deep reinforcement learning (RL) has been successfullyapplied to a variety of problems recently such as playingAtari games with super-human proﬁciency [1], and forrobot control [2]. However, Applying RL methods to realrobots can be extremely costly, since acquiring thousands ofepisodes of interactions with the environment often requiresa lot of time, and can lead to physical damage. Furthermore,in human-robot interaction (HRI) scenarios, human actionscannot be predicted with certainty, which can signiﬁcantlyimpede convergence to a good policy.One way of alleviating these problems is to have the agentlearn a model of the environment, and use this model togenerate synthetic data that can be used in conjunction withreal data to train the agent. This assumes that the environmentdynamics are easier to learn than an optimal policy, whichis a generally valid assumption at least for some classes oftasks. Furthermore, if such a model is stochastic in nature,then the uncertainty in state changes can be taken intoaccount, thus allowing more natural interaction with humans.Much like how people learn, an agent with a model ofits environment can generate imaginary scenarios that canbe used to help optimize its performance. This approachhas garnered much attention in the ﬁeld recently, and issometimes refered to as endowing agents with imagination [3], [4], [5]. Mohammad Thabet and Angelo Cangelosi are with the School ofComputer Science, the University of Manchester, United Kingdom, [email protected],[email protected] Massimiliano Patacchiola is with the School of Informatics, the Univer-sity of Edinbrugh, United Kingdom, [email protected] AIST Artiﬁcial Intelligence Research Centre, Tokyo, Japan. Fig. 1. Experiments with the Sawyer robotic arm. The robot has to solvea puzzle by rotating the cubes to reach a goal state based on the pointinggesture by the human.

In this paper we describe an architecture that allows anagent to learn a stochastic model of the environment anduse it to accelerate learning in RL problems. In our approach,an agent encodes sensory information into low-dimensionalrepresentations, and learns a model of its environment onlinein latent space, while simultaneously learning an optimalpolicy. The model can be learned much faster than the policy,and therefore can be used to augment transitions collectedfrom the real environment with synthetic transitions, im-proving the sample-efﬁciency of RL. Our approach requiresno prior knowledge of the task; only the encoder needs tobe pretrained on task-relevant images, and can generally bereused for multiple tasks. We test our architecture on a high-level robotic task in which a robot has to interpret a gestureand solve a puzzle based on it (Fig 1). Results show thatincorporating synthetic data leads to a signiﬁcant speedupin learning, especially when only a small amount of realinteraction data are made available to the agent.II. R

ELATED W ORK

There has been much recent interest in the literatureabout combining model-free and model-based approaches toreinforcement learning. Ha and Schmidhuber [5] built modelsfor various video game environments using a combinationof a mixture density network (MDN) and a long short-term memory (LSTM), which they call MDN-RNN. Intheir approach, they ﬁrst compress visual data into a low-dimensional representations via a variational autoencoder(VAE), and then train the MDN-RNN to predict future statevectors, which are used by the controller as additional infor-mation to select optimal actions. However, they pretrainedthe environment models on data collected by a random agent a r X i v : . [ c s . R O ] A ug laying video games, whereas in our work a model for anHRI task is learned online.The use of learned models to create synthetic training datahas also been explored. Kalweit et al. [4] used learned modelsto create imaginary rollouts to be used in conjunction withreal rollouts. In their approach, they limit model usage basedon an uncertainty estimate in the Q-function approximator,which they obtain with bootstrapping. They were able toachieve signiﬁcantly faster learning on simulated continuousrobot control tasks. However, they relied on well-deﬁned,low-dimensional state representations such as joint states andvelocities, as opposed to raw visual data as in our approach.Racaniere et al. [3] used a learned model to generate mul-tiple imaginary rollouts, which they compress and aggregateto provide context for a controller that they train on classicvideo games. The advantage of this approach is that thecontroller can leverage important information contained insubsequences of imagined rollouts, and is more robust toerroneous model predictions.Model rollouts can be used to improve targets for temporaldifferencing (TD) algortihms as well. Feinburg et al. [6] useda model rollout to compute improved targets over many steps,in what they termed model-based value expansion (MVE).More recently, Buckman et al. [7] propsed an extension toMVE, in which they use an ensemble of models to generatemultiple rollouts of various lengths, interpolating betweenthem and favoring those with lower uncertainty.Deep reinforcement learning is increasingly being em-ployed successfully for robots in continuous control tasksand manipulation [2], [8], [9], [10]. However, its applicationto high-level tasks and in HRI has been very limited. Qureshiet al. [11] used a multimodal DQN to teach a humanoidrobot basic social skills such as successfully shaking handswith strangers. Input to their system consisted of depth andgreyscale images. Interaction data were collected using therobot over a period of 14 days, where they have separated thedata collection and training phases and alternated betweenthem for practical reasons. In our work, we are primarilyinterested in increasing the sample efﬁciency so that trainingrequires less resources, allowing RL to be more practical forrobots. III. B ACKGROUND

A. Reinforcement Learning

In reinfrocement learning [12], a task is modelled as aMarkov decision process (MDP) where an agent inﬂuencesits environment state s t with action a t = π ( s t ) chosenaccording to some policy π . The environment then transitionsinto a new state s t +1 and provides the agent with a rewardsignal r t . This process repeats until the environment reachesa terminal state, concluding an episode of interaction. Thegoal of the agent is to learn an optimal policy π ∗ and use it tomaximize the expected return, which is the discounted sumof rewards, R t = (cid:80) Tt =0 γ t r t where γ ∈ [0 , is the discountfactor and T is the timestep a terminal state is reached.There are model-based and model-free algorithms to ﬁndan optimal policy. One such model-free method is to learn the action-value function Q π ( s, a ) , which is the expectedreturn from taking action a in state s and following policy π thereafter: Q π ( s t , a t ) = E π [ R t | s t , a t ] . The agent’s goalthus becomes to learn an optimal Q-function Q ∗ ( s, a ) =max π Q π ( s, a ) . This can be achieved using a recursive rela-tionship known as the Bellman equation : Q ∗ ( s t , a t ) = E [ r t + γ max a t +1 Q ∗ ( s t +1 , a t +1 ) | s t , a t ] (1)Since most non-trivial tasks have very large state or actionspaces, usually the Q-function cannot be calculated analyt-ically, and is estimated instead by a function approximator Q θ ( s, a ) with parameters θ . One common approach is deepQ-networks (DQN) [1], in which transitions are stored astuples of the form ( s t , a t , r t , s t +1 ) , and used to train a neuralnetwork so that Q θ ( s, a ) ≈ Q ∗ ( s, a ) . A DQN is trained tominimize the loss function: L ( θ ) = ( y t − Q θ ( s t , a t )) (2)where the target y t is obtained from Equation 1 using theestimate Q θ ( s t +1 , a t +1 ) . Given the gradients of Equation 2with respect to θ , the network can be trained using somevariation of stochastic gradient descent. Actions are chosenbased on the (cid:15) -greedy policy where the optimal action ischosen with probability − (cid:15) and a random action withprobability (cid:15) . B. Variational Autoencoders

Variational autoencoders (VAE) [13] are generative modelsthat can be used to both generate synthetic data, and toencode existing data into representations in low-dimensionallatent space. Like traditional autoencoders, they consist of anencoding network and a decoding one. The key difference isthat they encode a data point x into a probability distributionover latent variable vector z . The goal is then to learnan approximate multivariate Gaussian posterior distribution q ( z | x ) = N ( µ ( x ) , Σ( x ) I ) which is assumed to have aunit Gaussian prior p ( z ) = N ( , I ) . This can be done byminimizing the Kullback-Liebler (KL) divergence between q ( z | x ) and the true posterior p ( z | x ) : KL( q ( z | x ) || p ( z | x ))= E q [log q ( z | x ) − log p ( z | x )]= E q [log q ( z | x ) − log p ( x | z ) − log p ( z ) − log p ( x )]= − E q [log p ( x | z )] + KL( q ( z | x ) || p ( z )) − log p ( x ) (3)Here, the ﬁrst term is the reconstruction loss, while thesecond term penalizes divergence of the learned posteriorfrom the assumed proir. The expectation E q [ log p ( x | z )] canbe approximated as log p ( x | z ) by sampling a vector z = µ ( x ) + Σ / ( x ) ∗ (cid:15) with (cid:15) ∼ N ( , I ) and decoding itwith the decoder network. Since maximizing the marginallikelihood p ( x ) is also maximizing the expected likelihood E q [ log p ( x | z )] , and since p ( z ) = N ( , I ) , minimizing Equa-tion 3 is equivalent to minimzing: L ( θ, φ ) = − log p φ ( x | z )+ 12 J (cid:88) j =1 (1+ log σ j − µ j − σ j ) , (4)here we have parametrized the encoder and decoder net-works with θ and φ respectively, J is the dimensionality ofthe latent space, and σ j are the diagonal elements of Σ( x ; θ ) .The encoder and decoder networks are trained back to backto minimize the loss given by Equation 4. Note that if p ( x | z ) is Bernoulli, the reconstruction loss is equivalent to the cross-entropy between the actual x and the predicted ˆ x . C. Mixture Density Networks

Mixture density networks (MDN) [14] are neural networksthat model data as a mixture of Gaussian distributions. Thisis useful for modeling multi-valued mappings, such as manyinverse functions or stochastic processes. MDNs model thedistribution of target data y conditioned on input data x as p ( y | x ) = (cid:80) mi =1 α i ( x ) φ ( y ; µ i ( x ) , σ i ( x ) ) where m isthe number of components, α i are the mixture coefﬁcientssubject to (cid:80) mi =1 α i = 1 , and φ ( · ; µ, σ ) is a Gaussian kernelwith mean µ and variance σ . MDNs have a similar structureto feedforward networks, except that they have three paralleloutput layers for three vectors: one for the means, one thevariances, and one for the mixture coefﬁcients. The networkparameters θ are optimized by minimizing the negative log-likelihood of the data: L ( θ ) = − log m (cid:88) i =1 α i ( x ; θ ) φ ( y ; µ i ( x ; θ ) , σ i ( x ; θ ) ) (5)To predict an ouput for a given input, we sample fromthe resulting mixture distribution by ﬁrst sampling fromcategorical distribution deﬁned by α i to select a componentGaussian, and then sampling from the latter.IV. A RCHITECTURE

The proposed architecture consists of three components:the vision module (V) that produces abstract representationsof input images, the environment model (M) which generatesimaginary rollouts, and the controller (C) that learns tomap states into actions. We assume that the environment isMarkovian and is fully represented at any given time by theinput image. Figure 2 shows an overview of the architecture.V comprises the encoder part of a variational auto-encoder (VAE) [13], and is responsible for mapping thehigh-dimensional input images into low-dimensional staterepresentations. The controller and the environment modelare trained in this low-dimensional latent space, which isgenerally computationally less expensive. The main advan-tage of using a VAE instead of a vanilla auto-encoder is thatthe VAE maps every input image into a continuous region inthe latent space, deﬁned by the parameters of a multivariateGaussian distribution. This makes the environment modelmore robust and ensures that its output is meaningful andcan be mapped back into realistic images.M is responsible for generating synthetic transitions, andpredicts future states z t +1 and the reward r t based on currentstates z t and input actions a t . It is composed of threemodels: a mixture density network (MDN) [14] that learnsthe transition dynamics, a reward predictor called the r-network, and a terminal state predictor called the d-network. Fig. 2. Overview of the proposed architecture. The controller C inﬂuencesthe environment with an action, which produces state s and reward r . Theencoder V encodes s into latent state vector z . The environment modelM can be trained on real transitions and then used to generate imaginarytransitions. C can then be trained on both real and imaginary transitions. The MDN learns the conditional probability distribution ofthe next state p ( z t +1 | z t , a t ) . The r-network learns to predictthe reward for each state, while the d-network learns topredict whether a state is terminal or not. Both the r- and d-networks are implemented as feed-froward neural networks.To generate imaginary rollouts, M can be seeded with aninitial state from V, and then run in closed loop where itsoutput is fed back into its input along with the selectedaction. The advantage of using an MDN is that it is possibleto learn a model of stochastic environments, in which anaction taken in a given state can lead to multiple nextstates. This is especially useful for use in HRI tasks, inwhich the human response to actions taken by the robotcannot be expected with certainty. Furthermore, modellingthe next state probabilistically is much more robust to errorsin prediction, allowing the environment model to run inclosed loop.Lastly, C is responsible for selecting the appropriate actionin a given state. It is implemented as a Q-network, and learnsto estimate the action values for states. C is trained on bothreal and imaginary transitions to maximize the cumulativereward. V. E XPERIMENTS

The experiments detailed in this section are designed toevaluate our approach on a real world robotic application. Weare interested primarily in the performance increase gainedby utilizing the learned model, compared to the baselineDQN method [1].

A. Experiment Setup

To test our architecture, we desinged a task in which arobot has to solve a puzzle based on pointing gestures madeby a human. The robot sees three cubes with arrows paintedon them, with each arrow pointing either up, down, left, orright. The human can point to any of the three cubes, butmay not point to a different cube during the same episode.To successfully complete the task, the robot has to rotatethe cubes so that only the arrow on the cube being pointedto is in the direction of the human, with the constraint thatat no point should two arrows point to the same direction.The task is similar to puzzle games typically used in studiesabout robot learning, such as the Towers of Hanoi puzzle[15].e formulate the task as an RL problem in which the agentcan choose from 6 discrete actions at any given time: rotatingany of the three cubes 90 ◦ clockwise or counterclockwise.The robot gets a reward of +50 for reaching the goalstate, -5 for reaching an illegal state, and -1 otherwise toincentivize solving the task as efﬁciently as possible. Anepisode terminates if the robot reaches either a goal state oran illegal state, or after it performs 10 actions. See Fig 3 forexamples of goal and illegal terminal states. (a) (b)Fig. 3. Examples of terminal states of the task. (a) is a goal state, while(b) is an illegal state. To train the robot, we created a simulated environmentthat receives the selected action from the agent (the robot)as input, and outputs an image representing its state, alongwith a reward signal and a ﬂag to mark terminal states. Theenvironment is implemented as a ﬁnite state machine with192 macrostates, where each macrostate is the combinationof 3 microstates describing the directions of each of the threearrows, plus another microstate describing which box thehand is pointing to. Whenever the environment transitionsto a certain state, it outputs one of a multitude of imagesassociated with that state at random, thus producing theobservable state that the agent perceives.To produce the images used for the environment, we ﬁrstcollected multiple image fragments for each of the possiblemicrostates of the environment. Each of these fragmentsdepicts a slight variation for the same microstate, for exampleslightly different box positions or hand positions. We thuscreate a pool of multiple image fragments for each possiblemicrostate. To synthesize a complete image for a givenmacrostate, we choose a random fragment for each of itsconstituent microstates, and then patch them together. Forthe experiments, we collected 50 fragments for each possiblehand microstate, and 16 fragments for each possible arrowmicrostate, resulting in about × possible unique synthe-sized images. The images were taken with the Sawyer roboticarm camera (Fig. 1). For the experiments, we synthesized100,000 training images, and 10,000 test images. B. Procedure

The training procedure for our experiments can be sum-marized as follows:1) Train the VAE on all training images.2) Start collecting real rollouts and training the controller.3) After some amount of episodes, start training environ-ment model M on collected rollouts.4) Use M to generate synthetic rollouts simultaneouslywith real rollouts.

Algorithm 1:

Training procedure for agents.

Require:

Pretrained encoder V Initialize controller C and environment model E θ Initialize real memory M R and imaginary memory M I for e = 0 to num episodes do Observe initial state s s t = s while s t is not terminal do Use V to encode s t into µ t and σ t Apply action a t = C ( µ t ) Observe s t +1 , reward r t , terminal signal d t +1 Encode s t +1 into µ t +1 and σ t +1 Save transition ( µ t , σ t , a t , µ t +1 , σ t +1 , r t , d t +1 ) in M R for i = 0 to N E do Train E θ on minibatch from M R for i = 0 to N R do Train C on minibatch from M R if e ≥ I start then Use E θ to generate I B imaginary rollouts ofdepth I D Save imaginary transitions in M I for i = 0 to N I do Train C on minibatch from M I s t = s t +1

5) Continue training the controller using both real andsynthetic rollouts.The exact training procedure is given in Algorithm 1. Inthe following, we will detail the training procedure for eachcomponent of the system and justify our choice of differentparameters.

1) Variational Autoencoder:

To train the VAE, we splitthe grayscale training images along the horizontal axis into 3strips, where each strip contains a single box. We then fed thestrips into the VAE as 3 channels to help the VAE learn moretask-relevant features. The architecture used for the VAE isthat used by Ha and Schmidhuber in [5], except that weencode the images into 8-dimentional latent space. The VAEwas trained on the 100,000 synthesized training images afterscaling them down to a manageable × resolution for1000 epochs. The VAE is trained to minimize the combinedreconstruction and KL losses given by Equation 4. Here, thereconstruction loss is given by the pixel-wise binary cross-entropy. The KL loss was multiplied by a weighting factor β that controls the capacity of the latent information chanel.In general, increasing β yields more efﬁcient compression ofthe inputs and leads to learning independent and disentagledfeatures, at the cost of worse reconstruction [16]. We found β = 4 to produce best results. The Adam optimizer was usedwith a learning rate of 0.0005 and a batch size of 2000.

2) Environment Model:

The MDN used to model thedynamics in the environment model learns the posteriorrobability of the next latent state vector as a Gaussianmixture model with 5 components. The MDN has 3 hiddenlayers of 256 ReLU units with 50% dropout and 3 paralleloutput layers for the distibution parameters: one for themixture coefﬁcients with softmax activation, one for meansand one for variances both with linear activation. Whencollecting transitions, we stored the parameters µ and σ produced by V for each frame, and we sampled from N ( µ, σ ) to obtain latent space vectors when constructinga training batch. This form of data augmentation was foundto greatly improve the generalization and performance of themodel. The accompanying r-network has 3 hidden layers of512 ReLU units each with 50% dropout, and was trainedto minimize the logcosh loss. The d-network has 2 hiddenlayers of 256 ReLU units each with 50% dropout and wastrained to minimize the binary crossentropy. Both networkswere trained to predict the corresponding value based on thestate alone. During training, the MDN, the r-network andthe d-network were all updated 16 times on batches of 512transitions each timestep using the Adam optimizier with alearning rate of 0.001.

3) Controller:

The controller is a DQN consisting of 3hidden layers (512 ReLU, 256 ReLU, 128 ReLU) and alinear output layer. It was updated once on a batch of 64 realtransitions and once on a batch of 64 imaginary transitionseach timestep. We found that for such a relatively simpletask, updating the controller more often led to worse perfor-mance. We also found that using popular DQN extensionslike a separate target network or prioritized experience re-play did not signiﬁcantly affect performance. The controllerused an (cid:15) -greedy strategy with an exponentially annealingexploration rate given by (cid:15) = (cid:15) min + ( (cid:15) max − (cid:15) min ) e − λt with (cid:15) min = 0 . , (cid:15) max = 0 . , λ = 0 . , and t is the time step.The controller was trained to minimize the MSE loss usingan Adam optimizer with a learning rate of 0.001.

4) Parameters:

When training the agents, we set the depthof imaginary rollouts I D

10, and the breadth I B to 3.The size of the real memory was 50,000 transitions, andthat of the imaginary memory was 3,000. We found thattraining the controller only on recently generated transitionsleads to better performance, since more recent copies of M produce better predictions. We achieve this by bothlimiting the imaginary memory size, and generating multiplerollouts simultaneously. Furthermore, we found that settingthe update rate of the controller on both real and imaginarytranssitions ( N R and N I in Algorithm 1) to more than 1can lead to stability issues. Another parameter we had totune was the number of episodes to wait before staring togenerate imaginary rollouts ( I start in Algorithm 1), since Mwill produce erroneous predictions early on in the training.We found that waiting for about 1000 episodes provides bestresults. C. Results

We compare the performance of agents augmented withimaginary transitions using our approach with a baselineDQN trained only on real transitions. To aid comparison, all hyperparameters and architectural choices were the samefor agents augmented with our approach and the baselineDQN. For a given number of training episodes, we trained5 agents from scratch and then tested them on the simulatedenvironment for 1000 episodes. We then averaged the per-centage of successfully completed episodes of all 5 agentsin all test runs.The agents trained using our approach performed sig-niﬁcantly better than baseline DQN when trained for asmall amount of episodes, with an increase of 35.9% inperformance at 2000 episodes (Fig 4(a)). The advantage thenstarts to decline the more episodes the agent is trained, asthe baseline DQN catches up quickly. This is to be expectedsince at higher episodes, the agent has collected enough realtranisitons and no longer needs the extra data generated bythe environment model. Table I shows the exact results forthis experiment.We then increased the difﬁculty of the task while keepingthe dynamics the same by additionally requiring the goalstate not to have any arrows pointing towards the agent, andran all the tests again. Results can be seen in Fig 4(b) and Ta-ble II. Augmented agents showed even greater performanceincrease compared to the baseline DQN, with up to 78.5%increase in performance at 2000 training episodes. Thisshows that the performance increase due to using synthetictransitions is proportional to the difference in complexitybetween the task itself and the environment dynamics. (a) (b)(c)Fig. 4. Test results for various numbers of training episodes, (a) for theoriginal task, and (b) for the more difﬁcult variation of the task. (c) shows theperformance increase in both tasks. Error bars represent standard deviations.

Generating Plans:

One of the advantages of learningan environment model is that it allows a trained agent to pisodes Base DQN Augmented % increase2000 42.18 (6.01) 57.34 (6.37) 35.943000 62.88 (4.91) 81.4 (2.95) 29.454000 81.44 (3.13) 91.88 (2.38) 12.815000 88.22 (3.07) 95.1 (3.76) 7.796000 92.16 (2.57) 96.96 (2.1) 5.2TABLE IM

EAN PERCENTAGE OF SUCCESSFUL TEST EPISODES FOR VARIOUSNUMBERS OF TRAINING EPISODES FOR THE ORIGINAL TASK . S TD . DEVIATIONS ARE GIVEN IN PARENTHESIS . F

OR REFERENCE , A RANDOMAGENT SCORED

EAN PERCENTAGE OF SUCCESSFUL TEST EPISODES FOR VARIOUSNUMBERS OF TRAINING EPISODES FOR THE DIFFICULT VARIATION OFTHE TASK . S TD . DEVIATIONS ARE GIVEN IN PARENTHESIS . F

ORREFERENCE , A RANDOM AGENT SCORED produce entire plans given only the initial state . This canbe achieved by initializing the environment model with theinitial state, and then generating an imaginary rollout inwhich the controller always chooses the optimal action foreach state. To demonstrate this, we deployed a controller andan environment model on the Sawyer robotic arm (Fig 1),where both networks had been previously trained for 6000episodes using the training method descirbed previously.Afterwards, we ran experiments to evaluate the planningcapabilities of the system. Each experiment began by settingthe cubes to a random state, with the experimenter pointingto a random cube. Then, we let the robot observe theconﬁguration with the camera, and asked it to produce aplan consisting of a trajectory of actions to solve the task inits original form. The robot can execute the plan by selectingsuccessively from a set of pre-programmed point-to-pointmovements to rotate the boxes. Out of 20 test runs, the robotsuccessfully solved the task 17 times. The correct generatedplans varied in length from 1 to 5, depending on the initialstate. Moreover, the generated plans for all successful runswere optimal, containing only the fewest possible actionsrequired to solve the task. Fig 5 shows an example of animaginary rollout according to an optimal plan of length 5as generated by the agent. Model Generalization:

One of the interesting results wenoticed is that the model showed some generalization capa-bilities to transitions it had not experienced before. Sinceepisodes always terminated after encountering a terminalstate, the model never experienced any transitions from thiskind of state. To test model generalization, we deliberately This is only possible for environments with deterministic underlyingdynamics (a) (b) (c)(d) (e) (f)Fig. 5. An example of an imaginary rollout of length 5. (a) is the initialstate as observed by the robot. (b) through (f) are imagined next states aftersuccessively applying actions in the optimal plan. The visualizations of themodel predictions were obtained by mapping the latent space vectors toimages via the decoder part of the VAE. set the model state to a random termminal state 20 times, andthen asked it to predict the next state for a random actioneach time. A model trained for 5000 episodes was able tocorrectly predict the next state 75 % of the time. Fig 6 showsan example of model prediction for unseen transitions. (a) (b)Fig. 6. An example of model prediction for unseen transitions. The actionselected here is to rotate the rightmost cube clockwise. (a) is the state beforethe action, and (b) is the state after.

D. Discussion

One of the main challenges in learning a model online isavoiding overﬁtting on the small subset of data that are madeavailable early in the training. A model can easily get stuckin a local minimum if it gets trained execcsively on initialdata, and fail to converge later to an acceptable loss value in areasonable amount of time as more data are made available .We achieve this through three things. First, we limit themodel capacity by deliberately choosing smaller model sizes.Second, we adopt a probabilistic approach to encoding latentspace representations and modeling environment dynamics.Third, we employ high dropout rates in the models. Wealso found that selecting an unnecessarily large latent spacedimensionality leads to worse models.Probabilistic models are also much more robust, whichis essential when using the dynamics model in closed loop When trained online, high-capacity models often exhibited a behaviourreminiscent of the Dunning-Kruger effect. They would achieve a very lowloss value early in the training, which would quickly rise as more data areacquired, before eventually settling at a value in between. o generate rollouts. Traditional models based on pointestimates will produce some error in prediction, which willquickly compound resulting in completely erroneous predic-tions sometimes as early as the second pass. This of coursemakes using imaginary rollouts detrimental to learning.The ability to learn stochastic models can be useful evenfor environments whose underlying dynamics are determinis-tic. An environment with deterministic underlying dynamicscan have stochastic observable dynamics, since each latentstate of the environment can produce multiple observablestates. For example, the task we used for the experiments hasdeterministic underlying dynamics, since the conﬁgurationof the arrows will alwyas change in the same way inresponse to a certain action. However, the observable statewill change stochastically. The positions of the boxes or thehand may differ for the same conﬁguration. The agent hasno knowledge of the underlying dynamics since it only hasaccess to observable states. Therefore, it needs to be ableto model the observable dynamics stochastically in order toproduces realistic imaginary rollouts.The generalization capabilities of the dynamics model canin principle be used to facilitate learning other similar tasks.The two variations of the task we used for the experimentsshare the exact same dynamics; they are only different inthe deﬁnition of the reward functions. Indeed, for any givendynamics, an arbitrarily large family of tasks can be deﬁnedby specifying different reward functions. If learning thereward function can be separated from learning the dynamics,and assuming that the former is easier to learn than the latter,then learning new tasks in the same family will become muchfaster once the agent learns a dynamics model. However, thisis left for future work.VI. C

ONCLUSION

In this paper we presented an architecture that allowsan agent to learn a model of stochastic environments ina reinforcement learning setting. This allows the agent tosigniﬁcantly reduce the amount of interactions it needs tomake with the actual environment. This is especially usefulfor tasks involving real robots in which collecting real datacan be prohibitively expensive. Furthermore, the ability tomodel stochastic environments makes this approach well-suited for tasks involivng interaction with humans as theiractions usually cannot be predicted with certainty. We pro-vided a detailed algorithm describing how to train both theagent and the environment model simultaneously, and howto use both synthetic data in conjunction with real data. Wevalidated our approach on a high-level robotic task in whichan agent has to simultaneously interpret a human gestureand solve a puzzle based on it. Results show that agentsaugmented with synthetic data outperform baseline methodsespecially in situations where only limited interaction datawith the environment are available.In future work, we will include recurrent models (suchas LSTMs) in our architecture to handle environments withnon-Markovian state representations. Furthermore, we will experiment with building environment models that can cap-ture multi-modal dynamics, allowing agents to make use ofacoustic information for instance. Another important exten-sion is to include a measure of uncertainty to limit modelusage if its output is erroneous. The simplest way to achievethis is by using model ensembles. We will also incorporatedifferent ways of leveraging synthetic data to improve dataefﬁciency even further, such as using imaginary rollouts tocompute improved targets and for predicting future outcomesdirectly. Finally, we will investigate using programming bydemonstration techniques [17] to bootstrap agents, furtherdecreasing the amount of interactions the robot has to makewith the environment.A

CKNOWLEDGMENT

This project has received funding from the EuropeanUnion’s Horizon 2020 framework programme for researchand innovation under the Marie Sklodowska-Curie GrantAgreement No.642667 (SECURE)R

EFERENCES[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602 , 2013.[2] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,”

The Journal of Machine LearningResearch , vol. 17, no. 1, pp. 1334–1373, 2016.[3] S. Racani`ere, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J.Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. , “Imagination-augmented agents for deep reinforcement learning,” in

Advances inneural information processing systems , 2017, pp. 5690–5701.[4] G. Kalweit and J. Boedecker, “Uncertainty-driven imagination forcontinuous deep reinforcement learning,” in

Conference on RobotLearning , 2017, pp. 195–206.[5] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policyevolution,” in

Advances in Neural Information Processing Systems ,2018, pp. 2455–2467.[6] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, andS. Levine, “Model-based value estimation for efﬁcient model-freereinforcement learning,” arXiv preprint arXiv:1803.00101 , 2018.[7] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efﬁcient reinforcement learning with stochastic ensemble value expan-sion,” in

Advances in Neural Information Processing Systems , 2018,pp. 8234–8244.[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971 , 2015.[9] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learn-ing for robotic manipulation with asynchronous off-policy updates,”in . IEEE, 2017, pp. 3389–3396.[10] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Ve-cerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Data-efﬁcient deep reinforcement learning for dexterous manipulation,” arXiv preprint arXiv:1704.03073 , 2017.[11] A. H. Qureshi, Y. Nakamura, Y. Yoshikawa, and H. Ishiguro, “Robotgains social intelligence through multimodal deep reinforcement learn-ing,” in . IEEE, 2016, pp. 745–751.[12] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[13] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in

International Conference on Learning Representations , 2014.[14] C. M. Bishop, “Mixture density networks,” Citeseer, Tech. Rep., 1994.[15] K. Lee, Y. Su, T.-K. Kim, and Y. Demiris, “A syntactic approachto robot imitation learning using probabilistic activity grammars,”

Robotics and Autonomous Systems , vol. 61, no. 12, pp. 1323–1334,2013.16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visualconcepts with a constrained variational framework,” in

InternationalConference on Learning Representations , 2017.[17] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Robot program-ming by demonstration,”