Flatland: a Lightweight First-Person 2-D Environment for Reinforcement Learning
Hugo Caselles-Dupré, Louis Annabi, Oksana Hagen, Michael Garcia-Ortiz, David Filliat
FFlatland: a Lightweight First-Person 2-DEnvironment for Reinforcement Learning
Hugo Caselles-Dupr´e ∗† , Louis Annabi ∗ , Oksana Hagen ∗‡ , Michael Garcia-Ortiz ∗ , David Filliat †∗ AI Lab (Softbank Robotics Europe) † Flowers Laboratory (ENSTA ParisTech & INRIA) ‡ Centre for Robotics and Neural Systems (Plymouth University)Emails: [email protected], [email protected], [email protected],[email protected], david.fi[email protected]
Abstract — Flatland is a simple, lightweight environment for fastprototyping and testing of reinforcement learning agents. It is oflower complexity compared to similar 3D platforms (e.g. Deep-Mind Lab or VizDoom), but emulates physical properties of thereal world, such as continuity, multi-modal partially-observablestates with first-person view and coherent physics. We proposeto use it as an intermediary benchmark for problems related toLifelong Learning.
Flatland is highly customizable and offers awide range of task difficulty to extensively evaluate the propertiesof artificial agents. We experiment with three reinforcementlearning baseline agents and show that they can rapidly solvea navigation task in
Flatland . A video of an agent acting in
Flatland is available here: https://youtu.be/I5y6Y2ZypdA.
I. I
NTRODUCTION
A key goal of artificial intelligence research is to designagents capable of Lifelong Learning, which is the continuedlearning of tasks, from one or more domains, over thecourse of a lifetime of an artificial agent. Emulation of areal-life sensorimotor experience is an important aspect ofLifelong Learning. Such emulation enables building agents,that are capable of acquiring common sense knowledge inform of learning naive physics [12] or by building models ofsensorimotor interactions with the environment [13], [14].With this goal in mind, several complex 3-D partially-observable environments have been developed, such as Deep-Mind Lab [7], VizDoom [8] or Malmo [9]. They are usedfor testing Reinforcement Learning (RL) agents on tasksand scenarios that require advanced capabilities in terms ofperception, planning, representation of space for navigation-related tasks, which paves the way for building agents capableof behaving in a real-life scenario. These environments aresuitable for Lifelong Learning research since they emulatekey features of real-life tasks such as first-person, partial-observability and coherent physics. However we argue thatthese benchmarks have limitations (see Sec.II). The highrichness of sensors and environment in current 3-D immersivesimulations does not allow for fast experiments and prototyping.The environments used for prototyping, e.g grid-worlds, donot have the features required for building agents capable ofbehaving in real-life scenarios, such as coherent physics orcontinuous state and action space. There is a need for a middleground benchmark for testing artificial agents, which preserves the complexity of the tasks, while reducing the complexity ofthe sensory space.We propose
Flatland , a 2-D first-person view environmentwhere an agent can move and interact with different elementsof the environment, constrained by simple physical laws (seeSec.III). All the obstacles, objects and agents may have differentphysical properties and visual appearances. With a two dimen-sional world, we reduce the complexity of the environmentwhile preserving key features of the physical world, such asfirst-person, partial-observability and coherent physics. Ourenvironment allows to conduct fast RL experiments, as wedemonstrate on a navigation task (see Sec.IV) . We show thatthree baseline RL algorithms solve this task two orders ofmagnitude faster than a similar experiment [4] in VizDoom.Faster experiments allow researchers to perform extensivetesting which, as pointed out in [1], is crucial for reproducibilityand establishing the significance of the results.We plan on releasing
Flatland as open-source software. Thesoftware is easily expandable and is compatible with OpenAIGym RL API [2]. Future versions of the simulator will beguided by feedback and requests from the community.II. R
ELATED WORK
To experiment and develop new artificial agents, it iscommon to use grid-world simulations [17], [18], [23] as theyare easy to manipulate, fast to train, and easily comprehensiblefor an external observer. However, grid-world environmentsoften assume a perfect knowledge of the agent’s state. Thissimplified situation does not tie in with real-life scenarios andmakes it impossible to tackle issues related to the groundingof perception or the emergence of common sense knowledge[19].The Atari suite [6] provides environments where the agentperceives its state through 2D images, and has a small setof discrete actions that it can perform to play a game. Theseenvironments are more complex than grid-worlds and allow todevelop new algorithms for control and planning. Still, theyare fully observable and thus do not correspond to real-lifescenarios.Recent simulators allow to generate environments closerto real-life experience. It allows to test approach withouthaving to deal with the problems of robots in the real world:Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018) a r X i v : . [ c s . L G ] S e p ardware failure, cost of experiments, time of experiments,scalability. For example, RL benchmarks such as MuJoCo [10]focus on the building of control policies for robots. In thesesimulators, the agent performs complex continuous actions ina realistic physical world in order to attain a certain position ora certain speed. These benchmarks helped develop new controlalgorithms for, e.g., bipedal walking.To simulate real-life scenarios, the appropriate simulators arepartially-observable and first-person 3-D environments (suchas DeepMind Lab or VizDoom). They allow to tackle thechallenges related to navigation and planning with a first-person sensory input. In these environment, richness of sensorsand actions (2D images, continuous actions), as well as taskcomplexity (need for memory and long-term prediction), makethe computation required to train algorithms prohibitive formost research teams. The computational cost of the experimentson realistic environments does not facilitate principled statisticalanalysis of the proposed approaches. The same argument couldbe made for task complexity, as it becomes impossible to testa large variety of tasks using these environments.We believe that a first-person partially observable environ-ment with an intermediary level of sensory richness and tasksof increasing complexity would help the community to testand evaluate new approaches that are ultimately designed tobe applied for acting in real-life scenarios.III. Flatland : A SIMULATORFig. 1.
Flatland ’s python API. It is identical to the OpenAI Gym [2] APIsuch that researchers used to the latter can use the former easily.
Flatland is a first-person 2-D game platform designed forresearch and development of autonomous artificial agents.
Flatland can be used to study how artificial agents may learntasks in partially or fully observable diverse worlds. The gameengine is based on Pymunk and Pygame Python libraries andinspired from a blog post from Matt Harvey [3]. The tasks areinspired from the DeepMind Lab environment [7].We design
Flatland for fast prototyping and testing of ideasrelated to Lifelong Learning such as acquiring common senseknowledge, which is built upon learning about regularitiesin the sensorimotor information available to the agent [15],and constructing sensory representations as well as a modelof the physics of the environment. RL provides a convenientframework for evaluation of autonomous open-ended learning,and can be used to quantify and compare different artificialagents. Therefore,
Flatland is designed to be used in the contextof RL algorithms.The simulator creates a rectangular environment with severalphysical objects with different physical properties and textures.Environments are composed of rooms and corridors which aredelimited by walls or doors and may contain obstacles and itemsthat can be picked up to obtain (positive or negative) rewards. All these variables can be specified by the user so that tasksof a wide range of difficulty can be considered. Environmentsare generated based on a human-readable configuration file.Regarding state spaces, the agent perceives through oneor several first-person sensors which provide, for example,information about the color or distance of elements of itsenvironment (e.g. depth and RGB sensors), or top-down viewof an area around the agent. In the experiments presentedhereafter, the agent has one sensory input corresponding tocolors with an intensity depending on distance (see Fig. 2).Concerning action spaces, the agents can act discretely (goforward, turn left, turn right for navigation for instance) orcontinuously (one continuous dimension for each movement).Additionally, the simulator allows to easily attach body parts(arms and hooks) to the agent if we want to implement morecomplex tasks. Games can be composed of more than oneagent, which can be interesting to study the emergence ofcollaborative or social behaviours in multi-agent systems.We ensure compatibility with OpenAI Gym RL API bydefining the same API, as illustrated in a minimal example inFig. 1. IV. E
XPERIMENTS
We conduct preliminary experiments to evaluate the potentialof
Flatland as a fast prototyping platform for learning agents.
A. A test navigation task
Fig. 2. Experimental setup. The agent’s task is to collect orange “fruits” ( +10 reward) while avoiding purple “poisons” ( − reward). The agent’s inputs are1-D images, which are extended to 2-D images here for visualization purposes. For this task we have defined an environment that consistsof a room with 4 obstacles of different shapes and colors, andtwo types of objects that can be picked up. The agent cancollect orange “fruits” ( +10 reward) or purple “poisons” ( − reward) by touching them. No other reward signal is provided.Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018) ig. 3. Evidence of non-trivial behavior of DQN. The agent carefully navigates in the environment in order to collect “fruits” while avoiding “poisons”. Bestviewed in color. The goal of the agent is to find a strategy to get a maximumreward in a fixed amount of timesteps ( ). It receives asinput a 1-D image corresponding to what the agent sees infront of it, which is processed using 1-D convolutional layers.This task is similar to the one proposed for the evaluationin [4] and [11]. The agent’s action space is discrete andcomposed of 3 actions: move forward , rotate left and rotateright . The task, inputs and actions are illustrated in Figure 2. B. Baseline models
We propose to test our simulator by solving the taskdescribed in Sec.IV-A using 3 baselines that were previouslyevaluated on a first-person view navigation tasks:
Deep Q-Network (DQN) [5],
Asynchronous Advantage Actor-Critic (A3C) [11], two model-free RL baselines, and
Direct FuturePrediction (DFP) [4], an algorithm that learns to act bypredicting features of the environment. The goal of thisexperiment is not to compare merits and weaknesses of the threeevaluated methods. We rather show that three RL baselinesperform well on a navigation task in
Flatland , and do convergefaster comparing to complex 3D environments.DQN [5] combines Q-learning [20] with experience replay[21] and a deep neural network. For a given state s , DQNoutputs a vector of action values Q ( s, · , θ ) , where θ are theparameters of the network. Policies are derived using an (cid:15) -greedy approach w.r.t to Q . DQN was tested on a set of Atari2600 games [6], reaching human-level performance on manygames. It has thus become a commonly used baseline for RLproblems.In A3C [11] many instances of the agent interact in parallelwith many instances of the environment, which both acceleratesand stabilizes learning. The A3C algorithm constructs an ap-proximation to both the policy π ( a | s, θ ) and the value function V ( s, θ v ) using parameters of the actor θ and parameters of thecritic θ v . Both policy and value are adjusted towards an n-steplookahead value using an entropy regularization penalty.In DFP [4], the reinforcement learning problem is refor-mulated as a supervised learning problem. The agent predictshigh-level features at multiple timescales (here: score, numberof “fruits” picked up and number of “poisons” picked up),which are combined linearly to form an objective function.Then, policies are derived using actions that maximize thisobjective. This algorithm won, by a large margin, the VizDoom2016 competition. C. Experimental setup
We use available architectures for each method, and replace2-D convolutions by 1-D convolutions. In Appendix A, weprovide all hyperparameters and references to used imple-mentations. Each method is trained for episodes ( ktimesteps) using the Adam optimizer [16]. Reported resultsand figures show results averaged over seeds, followingrecommendations in [1] of evaluating on more than a largeenough sample of random seeds to ensure significant results.V. R ESULTS
Training results are reported in Figure 4. We report theaverage reward as well as confidence interval computedwith a standard t-test for the mean with a sample size of ,as recommended in [22].All three methods rapidly converge to a policy that success-fully navigates in the environment to pick up “fruits” whileavoiding “poisons”. Agents learn to navigate efficiently. This isillustrated in Fig. 3 and in the video included in Appendix C.The policies are completely reactive since none of the methodsincorporates memory. The best scoring agents achieved scoresAccepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018)
100 200 300 400 500Number of episodes50050100150200250300 R e w a r d DQN (mean)DFP (mean)A3C (mean)
Fig. 4. Training curves for DQN, A3C and DFP. Means, and 95% confidenceintervals are computed using independent runs. close to that obtained by humans after playing several games( ≈ reward).In a similar navigation task in the experiments of [4],convergence is obtained after roughly M timesteps for DQN,DFP and A3C. Here, convergence is reached approximatelytwo order of magnitudes faster with roughly k timesteps.Since the task complexity is similar, these results confirm thatfaster convergence can be obtained using lower-dimensionalsensory stream. This faster convergence is two fold. First the1-D images sensors contain less rich information than 2-Dimages so the training is faster in terms of training steps.Second it allows for a reduced number of parameters for theneural network (1-D instead of 2-D convolutional layers), hencea faster optimization of the neural network in terms of wall-clock time. Overall this allows for experiments at a reducedcomputational cost. For instance, training DQN took less than1 hour on a Intel i7-7700K CPU.VI. C
ONCLUSION
We propose
Flatland , a lightweight, 2-D, partially-observable, highly flexible environment for testing and eval-uation of reinforcement learning agents, in the context ofproblems related to Lifelong Learning. This simulation pre-serves main characteristics of the tasks that agents face inreal life environments, such as partial observability, coherentphysics and a first-person view. However, contrary to commonlyused complex 3-D benchmarks with the same properties, ourenvironment has a much lower dimensional sensory stream.This enables fast prototyping and extensive experiments at lowcost, as we show by experimenting with three RL baselines ina navigation task. On this task, convergence is achieved twoorder of magnitude faster, in terms of number of training steps,compared to a similar experiment on VizDoom.
Flatland isdesigned as a framework for the evaluation of the performanceof RL agents, and features OpenAI Gym API.The version of
Flatland presented in this paper is preliminary.Future work on
Flatland includes adding head and arms toagents, so that they are able to manipulate the environment by moving objects. This would allow us to define more complextasks, which could potentially require reasoning and memoryto be solved. Another direction of future work includes addingmore diverse sensors (e.g touch, smell), more actions (e.g. grabobjects, open doors). This would allow us to implement multi-modal learning, which is a promising direction in artificialintelligence research. The space of the possible tasks couldbe expanded by adding more features, such as defining taskswith delayed rewards. Finally, we also intend to enable theability to add multiple agents to the simulation, so that thisenvironment could be used to study multi-agent interactions.Most importantly, this benchmark should evolve accordingto the community’s needs in terms of environment’s natureand characteristics. R
EFERENCES[1] Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau, Joelle, Precup,Doina & Meger, David (2017). Deep reinforcement learning that matters.arXiv preprint arXiv:1709.06560.[2] Greg Brockman and Vicki Cheung and Ludwig Pettersson and JonasSchneider and John Schulman and Jie Tang and Wojciech Zaremba(2016). OpenAI Gym. arXiv:1606.01540.[3] Matt Harvey, Mat Harvey’s self-driving car project, https://github.com/harvitronix/reinforcement-learning-car, Online; accessed 28-05-2018.[4] Dosovitskiy, Alexey & Koltun, Vladlen (2016). Learning to act bypredicting the future. arXiv preprint arXiv:1611.01779.[5] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, AndreiA, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin,Fidjeland, Andreas K, Ostrovski, Georg & others (2015). Human-levelcontrol through deep reinforcement learning. Nature, 518, 529.[6] Bellemare, Marc G, Naddaf, Yavar, Veness, Joel & Bowling, Michael(2013). The arcade learning environment: An evaluation platform forgeneral agents. Journal of Artificial Intelligence Research, 47, 253-279.[7] Beattie, Charles, Leibo, Joel Z, Teplyashin, Denis, Ward, Tom, Wain-wright, Marcus, Kuttler, Heinrich, Lefrancq, Andrew, Green, Simon,Valdes, Victor, Sadik, Amir & others (2016). Deepmind lab. arXiv preprintarXiv:1612.03801.[8] Kempka, Michal, Wydmuch, Marek, Runc, Grzegorz, Toczek, Jakub& Jaskowski, Wojciech (2016). Vizdoom: A doom-based ai researchplatform for visual reinforcement learning. Computational Intelligenceand Games (CIG), 2016 IEEE Conference, 1-8.[9] Johnson, Matthew, Hofmann, Katja, Hutton, Tim & Bignell, David (2016).The Malmo Platform for Artificial Intelligence Experimentation. ICJAI.[10] Todorov, Emanuel, Erez, Tom & Tassa, Yuval (2012). Mujoco: A physicsengine for model-based control. 2012 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), 5026-5033.[11] Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves,Alex, Lillicrap, Timothy, Harley, Tim, Silver, David & Kavukcuoglu,Koray (2016). Asynchronous methods for deep reinforcement learning.International Conference on Machine Learning, 1928-1937.[12] Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik & SergeyLevine (2016). Learning to Poke by Poking: Experiential Learning ofIntuitive Physics. CoRR, abs/1606.07419.[13] Ha, D. & Schmidhuber, J. (2018). World Models, arXiv:1803.10122.[14] Greg Wayne & al. (2018). Unsupervised Predictive Memory in a Goal-Directed Agent. CoRR, abs/1803.10760.[15] J. Friston (2009). The free-energy principle: a rough guide to the brain?.Trends in cognitive sciences, 13 7, 293-301.[16] Kingma, Diederik P & Ba, Jimmy (2014). Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980.[17] Barreto, Andre, Dabney, Will, Munos, R´emi, Hunt, Jonathan J, Schaul,Tom, van Hasselt, Hado P & Silver, David (2017). Successor featuresfor transfer in reinforcement learning. Advances in neural informationprocessing systems, 4055-4065.[18] Sutton, Richard S & Barto, Andrew G (1998). Reinforcement learning:An introduction. MIT press Cambridge.[19] Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B & Gershman,Samuel J (2017). Building machines that learn and think like people.Behavioral and Brain Sciences, 40.[20] Watkins, Christopher John Cornish Hellaby (1989). Learning fromDelayed Rewards. PhD Thesis.
Accepted to the Workshop on Continual Unsupervised Sensorimotor Learning (ICDL-EpiRob 2018)
21] Lin, Long-Ji (1992). Self-improving reactive agents based on reinforce-ment learning, planning and teaching. Machine learning, 8, 293-321.[22] Hogg, Robert Vincent & Tanis, Elliot A (2009). Probability and statisticalinference. Pearson Educational International.[23] Teh, Yee, Bapst, Victor, Czarnecki, Wojciech M, Quan, John, Kirkpatrick,James, Hadsell, Raia, Heess, Nicolas & Pascanu, Razvan (2017). Distral:Robust multitask reinforcement learning. Advances in Neural InformationProcessing Systems, 4499-4509. A PPENDIX
We present implementation details for each of the three RLbaselines that we experiment with (see Sec. IV of main paper).
A. Deep Q-Network • Implementation: Keras-rl (https://github.com/keras-rl/keras-rl) • Normalization of inputs • Adam: learning rate = 0 . , β = 0 . , β = 0 . • Policy: Boltzmann policy (softmax) with temperature 1. • • Soft updates for Q-learning parameter: 0.01 • Replay buffer size: 500000 • Architecture: Input (shape = (64,3)) - Convolution 1-D(filters: 32, kernel size: 8, 1) - Convolution 1-D (48,4,1)- Convolution 1-D (64,3,1) - Max Pooling 1-D - Denselayer (3) - Output (shape = 3) • Discount factor of γ = 0 . B. Asynchronous Advantage Actor-Critic • Implementation: https://github.com/openai/universe-starter-agent • Adam: learning rate = 0 . , β = 0 . , β = 0 . • t max = 20 and I update = 20 ). • Entropy regularization with a weight β = 0 . • Discount of γ = 0 . • No action repeat: execute action on every frame (actionrepeat = 1) • Architecture: Convolution 1-D (filters: 32, kernel size: 8,1) - Convolution 1-D (48,4,1) - Convolution 1-D (64,3,1),we have substituted the LSTM with a fully connectedlayer of dimension 128 to make the policy reactive.
C. Direct Future Prediction • Implementation: https://github.com/flyyufelix/Direct-Future-Prediction-Keras • Adam: learning rate = 0 . , β = 0 . , β = 0 . • Measurements used: score, number of fruits picked up,number of poisons picked up • Goal: [1,1,-1] • Normalization of inputs and measurements • • Training interval: 3 timesteps • Policy: (cid:15) -greedy with respect to the defined goal • Linear annealing of epsilon from 1 to 0.0001: (cid:15) =( (cid:15) final − (cid:15) initial ) / • Replay buffer size: 20000 • Architecture: We only modify the convolutional partwith: Convolution 1-D (filters: 32, kernel size: 8, 1) -Convolution 1-D (48,4,1) - Convolution 1-D (64,3,1) -Max Pooling 1-D, the rest is unchanged. • Discount factor of γ = 0 .99