[PDF] Flatland: a Lightweight First-Person 2-D Environment for Reinforcement Learning

Abstract

Flatland is a simple, lightweight environment for fast prototyping and testing of reinforcement learning agents. It is of lower complexity compared to similar 3D platforms (e.g. DeepMind Lab or VizDoom), but emulates physical properties of the real world, such as continuity, multi-modal partially-observable states with first-person view and coherent physics. We propose to use it as an intermediary benchmark for problems related to Lifelong Learning. Flatland is highly customizable and offers a wide range of task difficulty to extensively evaluate the properties of artificial agents. We experiment with three reinforcement learning baseline agents and show that they can rapidly solve a navigation task in Flatland. A video of an agent acting in Flatland is available here: this https URL.

Full PDF

FFlatland: a Lightweight First-Person 2-DEnvironment for Reinforcement Learning

Hugo Caselles-Dupr´e ∗† , Louis Annabi ∗ , Oksana Hagen ∗‡ , Michael Garcia-Ortiz ∗ , David Filliat †∗ AI Lab (Softbank Robotics Europe) † Flowers Laboratory (ENSTA ParisTech & INRIA) ‡ Centre for Robotics and Neural Systems (Plymouth University)Emails: [email protected], [email protected], [email protected],[email protected], david.ﬁ[email protected]

Abstract — Flatland is a simple, lightweight environment for fastprototyping and testing of reinforcement learning agents. It is oflower complexity compared to similar 3D platforms (e.g. Deep-Mind Lab or VizDoom), but emulates physical properties of thereal world, such as continuity, multi-modal partially-observablestates with ﬁrst-person view and coherent physics. We proposeto use it as an intermediary benchmark for problems related toLifelong Learning.

Flatland is highly customizable and offers awide range of task difﬁculty to extensively evaluate the propertiesof artiﬁcial agents. We experiment with three reinforcementlearning baseline agents and show that they can rapidly solvea navigation task in

Flatland . A video of an agent acting in

Flatland is available here: https://youtu.be/I5y6Y2ZypdA.

I. I

NTRODUCTION

A key goal of artiﬁcial intelligence research is to designagents capable of Lifelong Learning, which is the continuedlearning of tasks, from one or more domains, over thecourse of a lifetime of an artiﬁcial agent. Emulation of areal-life sensorimotor experience is an important aspect ofLifelong Learning. Such emulation enables building agents,that are capable of acquiring common sense knowledge inform of learning naive physics [12] or by building models ofsensorimotor interactions with the environment [13], [14].With this goal in mind, several complex 3-D partially-observable environments have been developed, such as Deep-Mind Lab [7], VizDoom [8] or Malmo [9]. They are usedfor testing Reinforcement Learning (RL) agents on tasksand scenarios that require advanced capabilities in terms ofperception, planning, representation of space for navigation-related tasks, which paves the way for building agents capableof behaving in a real-life scenario. These environments aresuitable for Lifelong Learning research since they emulatekey features of real-life tasks such as ﬁrst-person, partial-observability and coherent physics. However we argue thatthese benchmarks have limitations (see Sec.II). The highrichness of sensors and environment in current 3-D immersivesimulations does not allow for fast experiments and prototyping.The environments used for prototyping, e.g grid-worlds, donot have the features required for building agents capable ofbehaving in real-life scenarios, such as coherent physics orcontinuous state and action space. There is a need for a middleground benchmark for testing artiﬁcial agents, which preserves the complexity of the tasks, while reducing the complexity ofthe sensory space.We propose

Flatland , a 2-D ﬁrst-person view environmentwhere an agent can move and interact with different elementsof the environment, constrained by simple physical laws (seeSec.III). All the obstacles, objects and agents may have differentphysical properties and visual appearances. With a two dimen-sional world, we reduce the complexity of the environmentwhile preserving key features of the physical world, such asﬁrst-person, partial-observability and coherent physics. Ourenvironment allows to conduct fast RL experiments, as wedemonstrate on a navigation task (see Sec.IV) . We show thatthree baseline RL algorithms solve this task two orders ofmagnitude faster than a similar experiment [4] in VizDoom.Faster experiments allow researchers to perform extensivetesting which, as pointed out in [1], is crucial for reproducibilityand establishing the signiﬁcance of the results.We plan on releasing

Flatland as open-source software. Thesoftware is easily expandable and is compatible with OpenAIGym RL API [2]. Future versions of the simulator will beguided by feedback and requests from the community.II. R

ELATED WORK

To experiment and develop new artiﬁcial agents, it iscommon to use grid-world simulations [17], [18], [23] as theyare easy to manipulate, fast to train, and easily comprehensiblefor an external observer. However, grid-world environmentsoften assume a perfect knowledge of the agent’s state. Thissimpliﬁed situation does not tie in with real-life scenarios andmakes it impossible to tackle issues related to the groundingof perception or the emergence of common sense knowledge[19].The Atari suite [6] provides environments where the agentperceives its state through 2D images, and has a small setof discrete actions that it can perform to play a game. Theseenvironments are more complex than grid-worlds and allow todevelop new algorithms for control and planning. Still, theyare fully observable and thus do not correspond to real-lifescenarios.Recent simulators allow to generate environments closerto real-life experience. It allows to test approach withouthaving to deal with the problems of robots in the real world:Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018) a r X i v : . [ c s . L G ] S e p ardware failure, cost of experiments, time of experiments,scalability. For example, RL benchmarks such as MuJoCo [10]focus on the building of control policies for robots. In thesesimulators, the agent performs complex continuous actions ina realistic physical world in order to attain a certain position ora certain speed. These benchmarks helped develop new controlalgorithms for, e.g., bipedal walking.To simulate real-life scenarios, the appropriate simulators arepartially-observable and ﬁrst-person 3-D environments (suchas DeepMind Lab or VizDoom). They allow to tackle thechallenges related to navigation and planning with a ﬁrst-person sensory input. In these environment, richness of sensorsand actions (2D images, continuous actions), as well as taskcomplexity (need for memory and long-term prediction), makethe computation required to train algorithms prohibitive formost research teams. The computational cost of the experimentson realistic environments does not facilitate principled statisticalanalysis of the proposed approaches. The same argument couldbe made for task complexity, as it becomes impossible to testa large variety of tasks using these environments.We believe that a ﬁrst-person partially observable environ-ment with an intermediary level of sensory richness and tasksof increasing complexity would help the community to testand evaluate new approaches that are ultimately designed tobe applied for acting in real-life scenarios.III. Flatland : A SIMULATORFig. 1.

Flatland ’s python API. It is identical to the OpenAI Gym [2] APIsuch that researchers used to the latter can use the former easily.

Flatland is a ﬁrst-person 2-D game platform designed forresearch and development of autonomous artiﬁcial agents.

Flatland can be used to study how artiﬁcial agents may learntasks in partially or fully observable diverse worlds. The gameengine is based on Pymunk and Pygame Python libraries andinspired from a blog post from Matt Harvey [3]. The tasks areinspired from the DeepMind Lab environment [7].We design

Flatland for fast prototyping and testing of ideasrelated to Lifelong Learning such as acquiring common senseknowledge, which is built upon learning about regularitiesin the sensorimotor information available to the agent [15],and constructing sensory representations as well as a modelof the physics of the environment. RL provides a convenientframework for evaluation of autonomous open-ended learning,and can be used to quantify and compare different artiﬁcialagents. Therefore,

Flatland is designed to be used in the contextof RL algorithms.The simulator creates a rectangular environment with severalphysical objects with different physical properties and textures.Environments are composed of rooms and corridors which aredelimited by walls or doors and may contain obstacles and itemsthat can be picked up to obtain (positive or negative) rewards. All these variables can be speciﬁed by the user so that tasksof a wide range of difﬁculty can be considered. Environmentsare generated based on a human-readable conﬁguration ﬁle.Regarding state spaces, the agent perceives through oneor several ﬁrst-person sensors which provide, for example,information about the color or distance of elements of itsenvironment (e.g. depth and RGB sensors), or top-down viewof an area around the agent. In the experiments presentedhereafter, the agent has one sensory input corresponding tocolors with an intensity depending on distance (see Fig. 2).Concerning action spaces, the agents can act discretely (goforward, turn left, turn right for navigation for instance) orcontinuously (one continuous dimension for each movement).Additionally, the simulator allows to easily attach body parts(arms and hooks) to the agent if we want to implement morecomplex tasks. Games can be composed of more than oneagent, which can be interesting to study the emergence ofcollaborative or social behaviours in multi-agent systems.We ensure compatibility with OpenAI Gym RL API bydeﬁning the same API, as illustrated in a minimal example inFig. 1. IV. E

XPERIMENTS

We conduct preliminary experiments to evaluate the potentialof

Flatland as a fast prototyping platform for learning agents.

A. A test navigation task

Fig. 2. Experimental setup. The agent’s task is to collect orange “fruits” ( +10 reward) while avoiding purple “poisons” ( − reward). The agent’s inputs are1-D images, which are extended to 2-D images here for visualization purposes. For this task we have deﬁned an environment that consistsof a room with 4 obstacles of different shapes and colors, andtwo types of objects that can be picked up. The agent cancollect orange “fruits” ( +10 reward) or purple “poisons” ( − reward) by touching them. No other reward signal is provided.Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018) ig. 3. Evidence of non-trivial behavior of DQN. The agent carefully navigates in the environment in order to collect “fruits” while avoiding “poisons”. Bestviewed in color. The goal of the agent is to ﬁnd a strategy to get a maximumreward in a ﬁxed amount of timesteps ( ). It receives asinput a 1-D image corresponding to what the agent sees infront of it, which is processed using 1-D convolutional layers.This task is similar to the one proposed for the evaluationin [4] and [11]. The agent’s action space is discrete andcomposed of 3 actions: move forward , rotate left and rotateright . The task, inputs and actions are illustrated in Figure 2. B. Baseline models

We propose to test our simulator by solving the taskdescribed in Sec.IV-A using 3 baselines that were previouslyevaluated on a ﬁrst-person view navigation tasks:

Deep Q-Network (DQN) [5],

Asynchronous Advantage Actor-Critic (A3C) [11], two model-free RL baselines, and

Direct FuturePrediction (DFP) [4], an algorithm that learns to act bypredicting features of the environment. The goal of thisexperiment is not to compare merits and weaknesses of the threeevaluated methods. We rather show that three RL baselinesperform well on a navigation task in

Flatland , and do convergefaster comparing to complex 3D environments.DQN [5] combines Q-learning [20] with experience replay[21] and a deep neural network. For a given state s , DQNoutputs a vector of action values Q ( s, · , θ ) , where θ are theparameters of the network. Policies are derived using an (cid:15) -greedy approach w.r.t to Q . DQN was tested on a set of Atari2600 games [6], reaching human-level performance on manygames. It has thus become a commonly used baseline for RLproblems.In A3C [11] many instances of the agent interact in parallelwith many instances of the environment, which both acceleratesand stabilizes learning. The A3C algorithm constructs an ap-proximation to both the policy π ( a | s, θ ) and the value function V ( s, θ v ) using parameters of the actor θ and parameters of thecritic θ v . Both policy and value are adjusted towards an n-steplookahead value using an entropy regularization penalty.In DFP [4], the reinforcement learning problem is refor-mulated as a supervised learning problem. The agent predictshigh-level features at multiple timescales (here: score, numberof “fruits” picked up and number of “poisons” picked up),which are combined linearly to form an objective function.Then, policies are derived using actions that maximize thisobjective. This algorithm won, by a large margin, the VizDoom2016 competition. C. Experimental setup

We use available architectures for each method, and replace2-D convolutions by 1-D convolutions. In Appendix A, weprovide all hyperparameters and references to used imple-mentations. Each method is trained for episodes ( ktimesteps) using the Adam optimizer [16]. Reported resultsand ﬁgures show results averaged over seeds, followingrecommendations in [1] of evaluating on more than a largeenough sample of random seeds to ensure signiﬁcant results.V. R ESULTS

Training results are reported in Figure 4. We report theaverage reward as well as conﬁdence interval computedwith a standard t-test for the mean with a sample size of ,as recommended in [22].All three methods rapidly converge to a policy that success-fully navigates in the environment to pick up “fruits” whileavoiding “poisons”. Agents learn to navigate efﬁciently. This isillustrated in Fig. 3 and in the video included in Appendix C.The policies are completely reactive since none of the methodsincorporates memory. The best scoring agents achieved scoresAccepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018)

100 200 300 400 500Number of episodes50050100150200250300 R e w a r d DQN (mean)DFP (mean)A3C (mean)

Fig. 4. Training curves for DQN, A3C and DFP. Means, and 95% conﬁdenceintervals are computed using independent runs. close to that obtained by humans after playing several games( ≈ reward).In a similar navigation task in the experiments of [4],convergence is obtained after roughly M timesteps for DQN,DFP and A3C. Here, convergence is reached approximatelytwo order of magnitudes faster with roughly k timesteps.Since the task complexity is similar, these results conﬁrm thatfaster convergence can be obtained using lower-dimensionalsensory stream. This faster convergence is two fold. First the1-D images sensors contain less rich information than 2-Dimages so the training is faster in terms of training steps.Second it allows for a reduced number of parameters for theneural network (1-D instead of 2-D convolutional layers), hencea faster optimization of the neural network in terms of wall-clock time. Overall this allows for experiments at a reducedcomputational cost. For instance, training DQN took less than1 hour on a Intel i7-7700K CPU.VI. C

ONCLUSION

We propose

Flatland , a lightweight, 2-D, partially-observable, highly ﬂexible environment for testing and eval-uation of reinforcement learning agents, in the context ofproblems related to Lifelong Learning. This simulation pre-serves main characteristics of the tasks that agents face inreal life environments, such as partial observability, coherentphysics and a ﬁrst-person view. However, contrary to commonlyused complex 3-D benchmarks with the same properties, ourenvironment has a much lower dimensional sensory stream.This enables fast prototyping and extensive experiments at lowcost, as we show by experimenting with three RL baselines ina navigation task. On this task, convergence is achieved twoorder of magnitude faster, in terms of number of training steps,compared to a similar experiment on VizDoom.

Flatland isdesigned as a framework for the evaluation of the performanceof RL agents, and features OpenAI Gym API.The version of

Flatland presented in this paper is preliminary.Future work on

Flatland includes adding head and arms toagents, so that they are able to manipulate the environment by moving objects. This would allow us to deﬁne more complextasks, which could potentially require reasoning and memoryto be solved. Another direction of future work includes addingmore diverse sensors (e.g touch, smell), more actions (e.g. grabobjects, open doors). This would allow us to implement multi-modal learning, which is a promising direction in artiﬁcialintelligence research. The space of the possible tasks couldbe expanded by adding more features, such as deﬁning taskswith delayed rewards. Finally, we also intend to enable theability to add multiple agents to the simulation, so that thisenvironment could be used to study multi-agent interactions.Most importantly, this benchmark should evolve accordingto the community’s needs in terms of environment’s natureand characteristics. R

EFERENCES[1] Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau, Joelle, Precup,Doina & Meger, David (2017). Deep reinforcement learning that matters.arXiv preprint arXiv:1709.06560.[2] Greg Brockman and Vicki Cheung and Ludwig Pettersson and JonasSchneider and John Schulman and Jie Tang and Wojciech Zaremba(2016). OpenAI Gym. arXiv:1606.01540.[3] Matt Harvey, Mat Harvey’s self-driving car project, https://github.com/harvitronix/reinforcement-learning-car, Online; accessed 28-05-2018.[4] Dosovitskiy, Alexey & Koltun, Vladlen (2016). Learning to act bypredicting the future. arXiv preprint arXiv:1611.01779.[5] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, AndreiA, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin,Fidjeland, Andreas K, Ostrovski, Georg & others (2015). Human-levelcontrol through deep reinforcement learning. Nature, 518, 529.[6] Bellemare, Marc G, Naddaf, Yavar, Veness, Joel & Bowling, Michael(2013). The arcade learning environment: An evaluation platform forgeneral agents. Journal of Artiﬁcial Intelligence Research, 47, 253-279.[7] Beattie, Charles, Leibo, Joel Z, Teplyashin, Denis, Ward, Tom, Wain-wright, Marcus, Kuttler, Heinrich, Lefrancq, Andrew, Green, Simon,Valdes, Victor, Sadik, Amir & others (2016). Deepmind lab. arXiv preprintarXiv:1612.03801.[8] Kempka, Michal, Wydmuch, Marek, Runc, Grzegorz, Toczek, Jakub& Jaskowski, Wojciech (2016). Vizdoom: A doom-based ai researchplatform for visual reinforcement learning. Computational Intelligenceand Games (CIG), 2016 IEEE Conference, 1-8.[9] Johnson, Matthew, Hofmann, Katja, Hutton, Tim & Bignell, David (2016).The Malmo Platform for Artiﬁcial Intelligence Experimentation. ICJAI.[10] Todorov, Emanuel, Erez, Tom & Tassa, Yuval (2012). Mujoco: A physicsengine for model-based control. 2012 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), 5026-5033.[11] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves,Alex, Lillicrap, Timothy, Harley, Tim, Silver, David & Kavukcuoglu,Koray (2016). Asynchronous methods for deep reinforcement learning.International Conference on Machine Learning, 1928-1937.[12] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik & SergeyLevine (2016). Learning to Poke by Poking: Experiential Learning ofIntuitive Physics. CoRR, abs/1606.07419.[13] Ha, D. & Schmidhuber, J. (2018). World Models, arXiv:1803.10122.[14] Greg Wayne & al. (2018). Unsupervised Predictive Memory in a Goal-Directed Agent. CoRR, abs/1803.10760.[15] J. Friston (2009). The free-energy principle: a rough guide to the brain?.Trends in cognitive sciences, 13 7, 293-301.[16] Kingma, Diederik P & Ba, Jimmy (2014). Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980.[17] Barreto, Andre, Dabney, Will, Munos, R´emi, Hunt, Jonathan J, Schaul,Tom, van Hasselt, Hado P & Silver, David (2017). Successor featuresfor transfer in reinforcement learning. Advances in neural informationprocessing systems, 4055-4065.[18] Sutton, Richard S & Barto, Andrew G (1998). Reinforcement learning:An introduction. MIT press Cambridge.[19] Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B & Gershman,Samuel J (2017). Building machines that learn and think like people.Behavioral and Brain Sciences, 40.[20] Watkins, Christopher John Cornish Hellaby (1989). Learning fromDelayed Rewards. PhD Thesis.

Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018)

21] Lin, Long-Ji (1992). Self-improving reactive agents based on reinforce-ment learning, planning and teaching. Machine learning, 8, 293-321.[22] Hogg, Robert Vincent & Tanis, Elliot A (2009). Probability and statisticalinference. Pearson Educational International.[23] Teh, Yee, Bapst, Victor, Czarnecki, Wojciech M, Quan, John, Kirkpatrick,James, Hadsell, Raia, Heess, Nicolas & Pascanu, Razvan (2017). Distral:Robust multitask reinforcement learning. Advances in Neural InformationProcessing Systems, 4499-4509. A PPENDIX

We present implementation details for each of the three RLbaselines that we experiment with (see Sec. IV of main paper).

A. Deep Q-Network • Implementation: Keras-rl (https://github.com/keras-rl/keras-rl) • Normalization of inputs • Adam: learning rate = 0 . , β = 0 . , β = 0 . • Policy: Boltzmann policy (softmax) with temperature 1. • • Soft updates for Q-learning parameter: 0.01 • Replay buffer size: 500000 • Architecture: Input (shape = (64,3)) - Convolution 1-D(ﬁlters: 32, kernel size: 8, 1) - Convolution 1-D (48,4,1)- Convolution 1-D (64,3,1) - Max Pooling 1-D - Denselayer (3) - Output (shape = 3) • Discount factor of γ = 0 . B. Asynchronous Advantage Actor-Critic • Implementation: https://github.com/openai/universe-starter-agent • Adam: learning rate = 0 . , β = 0 . , β = 0 . • t max = 20 and I update = 20 ). • Entropy regularization with a weight β = 0 . • Discount of γ = 0 . • No action repeat: execute action on every frame (actionrepeat = 1) • Architecture: Convolution 1-D (ﬁlters: 32, kernel size: 8,1) - Convolution 1-D (48,4,1) - Convolution 1-D (64,3,1),we have substituted the LSTM with a fully connectedlayer of dimension 128 to make the policy reactive.

C. Direct Future Prediction • Implementation: https://github.com/ﬂyyufelix/Direct-Future-Prediction-Keras • Adam: learning rate = 0 . , β = 0 . , β = 0 . • Measurements used: score, number of fruits picked up,number of poisons picked up • Goal: [1,1,-1] • Normalization of inputs and measurements • • Training interval: 3 timesteps • Policy: (cid:15) -greedy with respect to the deﬁned goal • Linear annealing of epsilon from 1 to 0.0001: (cid:15) =( (cid:15) final − (cid:15) initial ) / • Replay buffer size: 20000 • Architecture: We only modify the convolutional partwith: Convolution 1-D (ﬁlters: 32, kernel size: 8, 1) -Convolution 1-D (48,4,1) - Convolution 1-D (64,3,1) -Max Pooling 1-D, the rest is unchanged. • Discount factor of γ = 0 .99