Adaptive Reinforcement Learning through Evolving Self-Modifying Neural Networks
AAdaptive Reinforcement Learning through EvolvingSelf-Modifying Neural Networks
Samuel Schmidgall
George Mason [email protected]
ABSTRACT
The adaptive learning capabilities seen in biological neural net-works are largely a product of the self-modifying behavior emerg-ing from online plastic changes in synaptic connectivity. Currentmethods in Reinforcement Learning (RL) only adjust to new inter-actions after reflection over a specified time interval, preventingthe emergence of online adaptivity. Recent work addressing this byendowing artificial neural networks with neuromodulated plastic-ity have been shown to improve performance on simple RL taskstrained using backpropagation, but have yet to scale up to largerproblems. Here we study the problem of meta-learning in a chal-lenging quadruped domain, where each leg of the quadruped has achance of becoming unusable, requiring the agent to adapt by con-tinuing locomotion with the remaining limbs. Results demonstratethat agents evolved using self-modifying plastic networks are morecapable of adapting to complex meta-learning learning tasks, evenoutperforming the same network updated using gradient-basedalgorithms while taking less time to train.
CCS CONCEPTS • Computing methodologies → Bio-inspired approaches; Neu-ral networks; Reinforcement learning ; KEYWORDS
Reinforcement Learning, Meta-Learning, Self-Modifying, Adaptive
ACM Reference Format:
Samuel Schmidgall. 2020. Adaptive Reinforcement Learning through Evolv-ing Self-Modifying Neural Networks. In
Genetic and Evolutionary Com-putation Conference Companion (GECCO ’20 Companion), July 8–12, 2020,CancÞn, Mexico.
ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/3377929.3389901
The brain’s active self-modifying behavior plays an important rolein its effectiveness for continual adaptation and learning in dy-namic environments. Furthermore, evolution has led to the designof both the underlying neural connectivity as well as the frameworkfor directing neuromodulated plasticity, the structure from whichshort-term synaptic self-modification occurs. However, the mostcommon methods from which current AI are trained contradicts
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
GECCO ’20 Companion, July 8–12, 2020, CancÞn, Mexico © 2020 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-7127-8/20/07...$15.00https://doi.org/10.1145/3377929.3389901 this way of learning. Consequently, modern training methods ren-der AI incapable of online adaptation, only performing well onthe tasks that they were trained on. Even slight deviations fromthe original simulated environment might be catastrophic for theagent’s performance.To address this problem, recent literature in meta-learning aimto optimize toward an initial set of parameters that enable rapidlearning over a specified set of tasks, such as Model-Agnostic Meta-Learning (MAML) [2]. Another set of methods utilize fast andslow-weights in neural networks through a non-trainable Hebbianlearning-based associative memory [4]. Building off of this, differ-ential neuromodulation [3] proposes a way to augment traditionalartificial neural networks with fast- and slow-weights, where thefast-weights are modified through the addition of neuromodulatedplasticity that is trainable using backpropagated gradients.The work presented in this paper both demonstrates that self-modifying neural networks are capable of solving complex learningtasks in dynamic environments and poses Evolutionary Strategiesas the natural choice for developing such networks. Previous workusing neurmodulated plasticity [3][5] only experimented on simpleproblems, and only considered optimization through backpropagat-ing gradients. Here we show evidence toward the applicability ofevolved neuromodulated plasticity in the high-dimensional contin-uous control problem, Crippled-Ant, requiring both precise motorskills and adaptivity.
The approach presented in this work compares a traditional neuralnetwork architecture against one with self-modifying synaptic con-nectivity, where the changes in connectivity are modulated by alearned set of parameters. Performance comparisons are made be-tween policy gradient algorithm Proximal Policy Optimization [6]and a simplified version of Natural Evolutionary Strategies [5],which, for simplicity, will be referred to as OpenAI-ES for the dura-tion of this paper.
Within the differential neuromodulation framework, the weightsalong with the plasticity of each connection are optimized: x t = ϕ (( w + αH t ) x t − ) (1) H t + = H t + M ( x t ) x t − x t (2)where x t is the output of a layer of neurons at time t , ϕ is a nonlinearactivation function, w is the set of traditional non-plastic weights,and α is the plasticity coefficient that scales the magnitude of theplastic component of each connection. The plastic component attimestep t is represented by H t , which accumulates the modulatedproduct of pre- and post-synaptic activity between the respective a r X i v : . [ c s . N E ] M a y ECCO ’20 Companion, July 8–12, 2020, CancÞn, Mexico Samuel Schmidgall
Figure 1: Adaptive locomotion.
In the Crippled-Ant Environ-ment, a limb is chosen at random to be disabled (in red) requiringthe agent to adapt its gait using the remaining limbs.layers. Here, plasticity is modulated through a network learnedneuromodulatory signal M ( x t ) , which be represented by a varietyof functions, but in this work is represented by a single-layer feed-forward neural network. H t is generally clipped between - ω and ω ,with ω = 1 in this experiment. Starting with an initial zero-vector θ t , the OpenAI-ES algorithmgenerates N population samples of random noise vectors v t , i ∼N( , σ ) and uses them to create population individuals θ t + v t , i .The fitness of each individual is evaluated over the course of alifetime through an environment defined reward, r t , i . Such rewardis often center-ranked to prevent early local optima [5]. Using thecorresponding rewards, parameters are updated with StochasticGradient Descent (SGD) as follows: θ t + = θ t + α N σ N (cid:213) n = v t , i r t , i (3)OpenAI-ES was chosen because it has been shown to be compet-itive with and exhibit better exploration behavior than both DQNand A3C on difficult RL benchmarks [5]. While OpenAI-ES is lesssample-efficient than these other methods, it is better structured fordistributed computing and allows a shorter wall-clock training time.Additionally, due to not requiring back-propagation of error gradi-ents, the required wall-clock training time is further significantlyreduced for optimization over networks involving recurrence, suchas the neuromodulated plasticity used in our experiments. The meta-learning capabilities of the neural network in this paperare evaluated on a high-dimensional continuous control environ-ment, Crippled-Ant [1]. The environment begins with a 12-jointedquadruped aiming to attain the highest possible velocity in a lim-ited amount of time (Figure 1). The environment takes direct jointtorque for each of the 12 joints as input. The state is represented asa 111 dimensional vector containing relative angles and velocitiesfor each joint, as well as information about external forces acting onthe quadruped. At the beginning of each session, a leg is randomlyselected to be crippled on the quadrupedal robot, rendering it fullyunusable. This environment was chosen because this modificationcauses significant change in the action dynamics, requiring gaitadaptation throughout the course of each run.
Figure 2: Performance Comparison on Crippled-Ant En-vironment
Performance of each policy is measured for self-modifying (SM-) and traditional neural networks trained usingProximal Policy Optimization and OpenAI-ES.
Evaluation of performance is averaged over 100 episodes from 5fully trained models for each algorithm during the testing phase toensure accurate measurement. Each algorithm is trained using thedefault hyper-parameters from their respective papers. OpenAI-ESwas compared against a policy gradient algorithm often used in con-tinuous control problems, Proximal Policy Optimization (PPO). Bothof these algorithms were also compared using fixed weights anddifferential self-modifying ones. The experimental results demon-strate that self-modifying networks trained through EvolutionaryStrategies consistently outperform networks without such augmen-tation trained using OpenAI-ES and PPO, as well as self-modifyingnetworks using PPO. Total training time for the self-modifyingOpenAI-ES averaged around minutes, and minutes forthe self-modifying PPO running on a standard 6-core CPU. Futurework involves experimenting with new types of neuromodulation,as well as understanding the full capabilities of such networks.
REFERENCES [1] Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine,and Chelsea Finn. 2018. Learning to Adapt: Meta-Learning for Model-BasedControl.
CoRR abs/1803.11347 (2018). arXiv:1803.11347 http://arxiv.org/abs/1803.11347[2] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
CoRR abs/1703.03400 (2017).arXiv:1703.03400 http://arxiv.org/abs/1703.03400[3] Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O. Stanley. 2019. Back-propamine: training self-modifying neural networks with differentiable neuro-modulated plasticity. In
ICLR .[4] Jack W. Rae, Chris Dyer, Peter Dayan, and Timothy P. Lillicrap. 2018. Fast Para-metric Learning with Activation Memorization.
CoRR abs/1803.10049 (2018).arXiv:1803.10049 http://arxiv.org/abs/1803.10049[5] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017.Evolution Strategies as a Scalable Alternative to Reinforcement Learning. (2017).arXiv:stat.ML/1703.03864[6] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal Policy Optimization Algorithms.