Meta-Learning through Hebbian Plasticity in Random Networks
MMeta-Learning through Hebbian Plasticity inRandom Networks
Elias Najarro and Sebastian Risi
IT University of Copenhagen2300 Copenhagen, Denmark [email protected], [email protected]
Abstract
Lifelong learning and adaptability are two defining aspects of biological agents.Modern reinforcement learning (RL) approaches have shown significant progressin solving complex tasks, however once training is concluded, the found solutionsare typically static and incapable of adapting to new information or perturbations.While it is still not completely understood how biological brains learn and adapt soefficiently from experience, it is believed that synaptic plasticity plays a prominentrole in this process. Inspired by this biological mechanism, we propose a searchmethod that, instead of optimizing the weight parameters of neural networksdirectly, only searches for synapse-specific Hebbian learning rules that allowthe network to continuously self-organize its weights during the lifetime of theagent. We demonstrate our approach on several reinforcement learning tasks withdifferent sensory modalities and more than 450K trainable plasticity parameters.We find that starting from completely random weights, the discovered Hebbianrules enable an agent to navigate a dynamical 2D-pixel environment; likewisethey allow a simulated 3D quadrupedal robot to learn how to walk while adaptingto morphological damage not seen during training and in the absence of anyexplicit reward or error signal in less than 100 timesteps. Code is available at https://github.com/enajx/HebbianMetaLearning . Agents controlled by neural networks and trained through reinforcement learning (RL) have proven tobe capable of solving complex tasks [1–3]. However once trained, the neural network weights of theseagents are typically static, thus their behaviour remains mostly inflexible, showing limited adaptabilityto unseen conditions or information. These solutions, whether found by gradient-based methods orblack-box optimization algorithms, are often immutable and overly specific for the problem they havebeen trained to solve [4, 5]. When applied to a different tasks, these networks need to be retrained,requiring many extra iterations.Unlike artificial neural networks, biological agents display remarkable levels of adaptive behaviorand can learn rapidly [6, 7]. Although the underlying mechanisms are not fully understood, it is wellestablished that synaptic plasticity plays a fundamental role [8, 9]. For example, many animals canquickly walk after being born without any explicit supervision or reward signals, seamlessly adaptingto their bodies of origin. Different plasticity-regulating mechanisms have been suggested which canbe encompassed in two main ideal-type families: end-to-end mechanisms which involve top-downfeedback propagating errors [10] and local mechanisms, which solely rely on local activity in orderto regulate the dynamics of the synaptic connections. The earliest proposed version of a purely localmechanism is known as
Hebbian plasticity , which in its simplest form states that the synaptic strengthbetween neurons changes proportionally to the correlation of activity between them [11]. a r X i v : . [ c s . N E ] S e p imesteps W e i g h t s t = 0 t = 100 t = 200 R e w a r d A B CA B C Undamaged morphology
A B C
Front-left leg damage (not seen during training)
Front-right leg damage
Figure 1:
Hebbian Learning in Random Networks.
Starting from random weights, the discoveredlearning rules allow fast adaptation to different morphological damage without an explicit rewardsignal. The figure shows the weights of the network at three different timesteps (A, B, C) duringthe lifetime of the robot with standard morphology (top-left). Each column represents the weightsof each of the network layers at the different timesteps. At t=0 (A) the weights of the network areinitialised randomly by sampling from an uniform distribution w ∈ U[-0.1, 0.1], thereafter theirdynamics are determined by the evolved Hebbian rules and the sensory input from the environment.After a few timesteps a diagonal-like pattern appears and the quadruped starts to move which reflectsin an increase in the episodic reward (bottom row). The network with the same Hebbian rules is ableto adapt to robots with varying morphological damage, even ones not seen during training (top right).The rigidity of non-plastic networks and their inability to keep learning once trained can partiallybe attributed to them traditionally having both a fixed neural architecture and a static set of synapticweights. In this work we are therefore interested in algorithms that search for plasticity mechanismsthat allow agents to adapt during their lifetime [12–15]. While recent work in this area has focusedon determining both the weights of the network and the plasticity parameters, we are particularlyintrigued by the interesting properties of randomly-initialised networks in both machine learning[16–18] and neuroscience [19]. Therefore, we propose to search for plasticity rules that work withrandomly initialised networks purely based on a process of self-organisation.To accomplish this, we optimize for connection-specific Hebbian learning rules that allow an agent tofind high-performing weights for non-trivial reinforcement learning tasks without any explicit rewardduring its lifetime. We demonstrate our approach on two continuous control tasks and show that sucha network reaches a higher performance than a fixed-weight network in a vision-based RL task. In a3-D locomotion task, the Hebbian network is able to adapt to damages in the morphology of asimulated quadrupedal robot which has not been seen during training, while a fixed-weight networkfails to do so. In contrast to fixed-weight networks, the weights of the Hebbian networks continuouslyvary during the lifetime of the agent; the evolved plasticity rules give rise to the emergence of2n attractor in the weight phase-space, which results in the network quickly converging to high-performing dynamical weights.We hope that our demonstration of random Hebbian networks will inspire more work in neuralplasticity that challenges current assumptions in reinforcement learning; instead of agents startingdeployment with finely-tuned and frozen weights, we advocate for the use of more dynamical neuralnetworks, which might display dynamics closer to their biological counterparts. Interestingly, wefind that the discovered Hebbian networks are remarkably robust and can even recover from having alarge part of their weights zeroed out.In this paper we focus on exploring the potential of Hebbian plasticity to master reinforcement learningproblems. Meanwhile, artificial neural networks (ANNs) have been the object of great interest byneuroscientists for being capable of explaining some neurobiological data [20], while at the sametime being able to perform certain visual cognitive tasks at a human-level. Likewise, demonstratinghow random networks – solely optimised through local rules – are capable of reaching competitiveperformance in complex tasks may contribute to the pool of plausible models for understandinghow learning occurs in the brain. Finally, we hope this line of research will further help promotingANN-based RL frameworks to study how biological agents learn [21].
Meta-learning . The aim in meta-learning or learning-to-learn [22, 23] is to create agents that canlearn quickly from ongoing experience. A variety of different methods for meta-learning alreadyexist [24–29]. For example, Wang et al. [27] showed that a recurrent LSTM network [30] can learnto reinforcement learn. In their work, the policy network connections stay fixed during the agent’slifetime and learning is achieved through changes in the hidden state of the LSTM. While mostapproaches, such as the work by Wang et al. [27], take the environment’s reward as input in the innerloop of the meta-learning algorithms (either as input to the neural network or to adjust the network’sweights), we do not give explicit rewards during the agent’s lifetime in the work presented here.Typically, during meta-training, networks are trained on a number of different tasks and then testedon their ability to learn new tasks. A recent trend in meta-learning is to find good initial weights(e.g. through gradient descent [28] or evolution [29]), from which adaptation can be performed in afew iterations. One such approach is Model-Agnostic Meta-Learning (MAML) [28], which allowssimulated robots to quickly adapt to different goal directions. Hybrid approaches bringing togethergradient-based learning with an unsupervised Hebbian rules have also proven to improve performanceon supervised-learning tasks [31].A less explored meta-learning approach is the evolution of plastic networks that undergo changes atvarious timescales, such as in their neural connectivity while experiencing sensory feedback. These evolving plastic networks are motivated by the promise of discovering principles of neural adaptation,learning, and memory [13]. They enable agents to perform a type of meta-learning by adaptingduring their lifetime through evolving recurrent networks that can store activation patterns [32] orby evolving forms of local Hebbian learning rules that change the network’s weights based on thecorrelated activation of neurons (“what fires together wires together”). Instead of relying on Hebbianlearning rules, early work [14] tried to explore the optimization of the parameters of a parameterisedlearning rule that is applied to all connections in the network. Most related to our approach is earlywork by Floreano and Urzelai [33], who explored the idea of starting networks with random weightsand then applying Hebbian learning. This approach demonstrated the promise of evolving Hebbianrules but was restricted to only four different types of Hebbian rules and small networks (12 neurons,144 connections) applied to a simple robot navigation task.Instead of training local learning rules through evolutionary optimization, recent work showedit is also possible to optimize the plasticity of individual synaptic connections through gradientdescent [15]. However, while the trainable parameters in their work only determine how plastic eachconnection is, the black-box optimization approach employed in this paper allows each connection toimplement its own Hebbian learning rule.
Self-Organization . Self-organization plays a critical role in many natural systems [34] and is anactive area of research in complex systems. It also recently gaining more prominence in machinelearning, with graph neural networks being a noteworthy example [35]. The recent work by Mord-3intsev et al. [36] on growing cellular automata through local rules encoded by a neural networkhas interesting parallels to the work we present here; in their work the growth of 2D images relieson self-organization while in our work it is the network’s weights themselves that self-organize. Abenefit of self-organizing systems is that they are very robust and adaptive. The goal in our proposedapproach is to take a step towards similar levels of robustness for neural network-based RL agents.
Neuroscience . In biological nervous systems, the weakening and strengthening of synapses throughsynaptic plasticity is assumed to be one of the key mechanisms for long-term learning [8, 9]. Evolutionshaped these learning mechanisms over long timescales, allowing efficient learning during our lives.What is clear is that the brain can rewire itself based on experiences we undergo during our lifetime[37]. Additionally, animals are born with a highly structured brain connectivity that allows them tolearn quickly form birth [38]. However, the importance of random connectivity in biological brains isless well understood. For example, random connectivity seems to play a critical role in the prefrontalcortex [39], allowing an increase in the dimensionality of neural representations. Interestingly, it wasonly recently shown that these theoretical models matched experimental data better when randomnetworks were combined with simple Hebbian learning rules [19].The most well-known form of synaptic plasticity occurring in biological spiking networks is spike-timing-dependent plasticity (STDP). On the other hand, artificial neural networks have continuousoutputs which are usually interpreted as an abstraction of spiking networks in which the continuousoutput of each neuron represents a spike-rate coding average –instead of spike-timing coding – ofa neuron over a long time window or, equivalent, of a subset of spiking neurons over a short timewindow; in this scenario, the relative timing of the pre and post-synaptic activity does not playa central role anymore [40, 41]. Spike-rate-dependent plasticity (SRDP) is a well documentedphenomena in biological brains [42, 43]. We take inspiration from this work, showing that randomnetworks combined with Hebbian learning can also enable more robust meta-learning approaches.
The main steps of our approach can be summarized as follows: ( ) An initial population of neuralnetworks with random synapse-specific learning rules is created, ( ) each network is initialised withrandom weights and evaluated on a task based on its accumulated episodic reward, with the networkweights changing at each timestep following the discovered learning rules, and ( ) a new populationis created through an evolution strategy [44], moving the learning-rule parameters towards rules withhigher cumulative rewards. The algorithm then starts again at ( ), with the goal to progressivelydiscover more and more efficient learning rules that can work with arbitrary initialised networks.In more detail, the synapse-specific learning rules in this paper are inspired by biological Hebbianmechanisms. We use a generalized Hebbian ABCD model [45, 46] to control the synaptic strengthbetween the artificial neurons of relatively simple feedforward networks. Specifically, the weights ofthe agent are randomly initialized and updated during its lifetime at each timestep following: ∆ w ij = η w · ( A w o i o j + B w o i + C w o j + D w ) , (1)where w ij is the weight between neuron i and j , η w is the evolved learning rates, evolved correlationterms A w , evolved presynaptic terms B w , evolved postsynaptic terms C w , with o i and o j beingthe presynaptic and postsynaptic activations respectively. While the coefficients A, B, C explicitlydetermine the local dynamics of the network weights, the evolved coefficient D can be interpreted asan individual inhibitory/excitatory bias of each connection in the network. In contrast to previouswork, our approach is not limited to uniform plasticity [47, 48] (i.e. each connection has the sameamount of plasticity) or being restricted to only optimizing a connection-specific plasticity value [15].Instead, building on the ability of recent evolution strategy implementations to scale to a large numberof parameters [44], our approach allows each connection in the network to have both a differentlearning rule and learning rate.We hypothesize that this Hebbian plasticity mechanism should give rise to the emergence of anattractor in weight phase-space, which leads the randomly-initialised weights of the policy network toquickly converge towards high-performing values, guided by sensory feedback from the environment.4 .1 Optimization details The particular population-based optimization algorithm that we are employing is an evolutionstrategy (ES) [49, 50]. ES have recently shown to reach competitive performance compared to otherdeep reinforcement learning approaches across a variety of different tasks [44]. These black-boxoptimization methods have the benefit of not requiring the backpropagation of gradients and can dealwith both sparse and dense rewards. Here, we adapt the ES algorithm by Salimans et al. [44] to notoptimize the weights directly but instead finding the set of Hebbian coefficients that will dynamicallycontrol the weights of the network during its lifetime based on the input from the environment.In order to evolve the optimal local learning rules, we randomly initialise both the policy network’sweights w and the Hebbian coefficients h by sampling from an uniform distribution w ∈ U[-0.1,0.1] and h ∈ U[0, 1] respectively. Subsequently we let the ES algorithm evolve h , which in turndetermines the updates to the policy network’s weights at each timestep through Equation 1.At each evolutionary step t we compute the task-dependent fitness of the agent F ( h t ) , we populatea new set of n candidate solutions by sampling normal noise (cid:15) i = N (0 , and adding it to thecurrent best solution h t , subsequently we update the parameters of the solution based on the fitnessevaluation of each of the i ∈ n candidate solutions: h t + = h t + αnσ n (cid:88) i =1 F i · ( h t + σ(cid:15) i ) , where α modulates how much the parameters are updated at each generation and σ modulates theamount of noise introduced in the candidate solutions. It is important to note that during its lifetimethe agent does not have access to this reward.We compare our Hebbian approach to a standard fixed-weight approach, using the same ES algorithmto optimise either directly the weights or learning rule parameters respectively. All the code necessaryto evolve both the Hebbian networks as well as the static networks with the ES algorithm is availableat https://github.com/enajx/HebbianMetaLearning .Figure 2: Test domains.
The random Hebbian network approach introduced in this paper is tested onthe
CarRacing-v0 environment [51] and a quadruped locomotion task. In the robot tasks, the samenetwork has to adapt to three morphologies while only seeing two of them during the training phase(standard Ant-v0 morphology, morphology with damaged right front leg and unseen morphologywith damaged left front leg) without any explicit reward feedback.
We demonstrate our approach on two continuous control environments with different sensory modali-ties (Fig. 2). The first is a challenging vision-based RL task, in which the goal is to drive a racingcar through procedurally generated tracks as fast possible. While not appearing too complicated,the tasks was only recently solved (achieving a score of more than 900 averaged over 100 randomrollouts) [52–54]. The second domain is a complex 3-D locomotion task that controls a four-leggedrobot [55]. Here the information of the environment is represented as a one-dimensional state vector.
Vision-based environment
As a vision-based environment, we use the
CarRacing-v0 domain [51],build with the Box2D physics engine. The output state of the environment is resized and normalised,resulting in a observational space of 3 channels (RGB) of 84 ×
84 pixels each. The policy networkconsists of two convolutional layers, activated by hyperbolic tangent and interposed by poolinglayers which feed a 3-layers feedforward network with [128 , , nodes per layer with no bias.This network has 92,690 weight parameters, 1,362 corresponding to the convolutional layers and51,328 to the fully connected ones. The three network outputs control three continuous actions(left/right steering, acceleration, break). Under the ABCD mechanism this results in 456,640 Hebbiancoefficients including the lifetime learning rates η .In this environment, only the weights of the fully connected layers are controlled by the Hebbianplasticity mechanism, while the 1,362 parameters of the convolutional layers remain static duringthe lifetime of the agent. The reason being that there is no natural definition of what the presynapticand postsynaptic activity of a convolution filter may be, hence making the interpretation of Hebbianplasticity for convolutional layers challenging. Furthermore, previous research on the human visualcortex indicates that the representation of visual stimuli in the early regions of the ventral streamare compatible with the representations of convolutional layers trained for image recognition [56],therefore suggesting that the variability of the parameters of convolutional layers should be limited.The evolutionary fitness is calculated as -0.1 every frame and +1000/ N for every track tile visited,where N is the total number of tiles in the generated track. For the quadruped, we use a 3-layer feedforward network with [128 , , nodes per layer, no bias and hyperbolic tangent as activation function. This architectural choice leadsto a network with 12,288 synapses. Under the ABCD plastic mechanism, which has 5 coefficients persynapse, this translates to a set of 61,440 Hebbian coefficients including the lifetime learning rates η . For the state-vector environment we use the open-source Bullet physics engine and its pyBulletpython wrapper [57] that includes the “Ant” robot, a quadruped with 13 rigid links, including fourlegs and a torso, along with 8 actuated joints [58]. It is modeled after the ant robot in the MuJoCosimulator [59] and constitutes a common benchmark in RL [28]. The robot has an input size of 28,comprising the positional and velocity information of the agent and an action space of 8 dimensions,controlling the motion of each of the 8 joints. The fitness function of the quadruped agent selects fordistance travelled during a period of 1,000 timesteps.The parameters used for the ES algorithm to optimize both the Hebbian and static networks are thefollowing: a population size 200 for the CarRacing-v0 domain and size 500 for the quadruped,reflecting the higher complexity of this domain. Other parameters were the same for both domains andreflect typical ES settings (ES algorithms are typically more robust to different hyperparameters thanother RL approaches [44]), with a learning rate α =0.2, α decay=0.995, σ =0.1, and σ decay=0.999.These hyperparameters were found by trial-and-error and worked best in prior experiments. For each of the two domains, we performed three independent evolutionary runs (with differentrandom seeds) for both the static and Hebbian approach. We performed additional ablation studies onrestricted forms of the generalised Hebbian rule, which can be found in the Appendix.
Vision-based Environment
To test how well the evolved solutions generalize, we compare thecumulative rewards averaged over 100 rollouts for the highest-performing Hebbian-based approachand traditional fixed-weight approach. The set of local learning rules found by the ES algorithmyield a reward of 870 ±
13, while the static-weights solution only reached a performance of 711 ± ±
15 [54]),but on par with deep RL approaches such as PPO (865 ±
159 [54]). The competitive performanceof the Hebbian learning agent is rather surprising, since it starts every one of the 100 rollouts withcompletely different random weights but through the tuned learning rules it is able to adapt quickly.While the Hebbian network takes slightly longer to reach a high training performance, likely becauseof the increased parameter space (see Appendix), the benefits are a higher generality when tested onprocedurally generated tracks not seen during training.
For the locomotion task, we created three variations of a 4-legged robot suchas to mimic the effect of partial damage to one of its legs (Fig. 2). The choice of these morphologies isintended to create a task that would be difficult to master for a neural network that is not able to adapt.During training, both the static-weights and the Hebbian plastic networks follow the same set-up: ateach training step the policy is optimised following the ES algorithm described in Section 3.1 wherethe fitness function consists of the average distance walked of two morphologies, the standard oneand the one with damage on the right front leg. The third morphology (damaged on left front leg) isleft out of training loop in order to subsequently evaluate the generalisation of the networks.6 uadruped Damage Seen / Unseen during training Learning Rule Distance travelled SolvedNo Damage Seen Hebbian 1205 ±
100 TrueNo Damage Seen static weights 1604 ±
171 TrueRight front leg Seen Hebbian 1132 ±
60 TrueRight front leg Seen static weights 1431 ±
54 TrueLeft front leg Unseen Hebbian 471 ±
87 TrueLeft front leg Unseen static weights 68 ±
56 False
Table 1: Average distance travelled by the highest-performing quadrupeds evolved with both localrules (Hebbian) and static weights, across 100 rollouts. While the Hebbian learning approach findsa solution for the seen and unseen morphologies (defined as moving away from the initial startposition at least 100 units of length), the static-weights agent can only develop locomotion for thetwo morphologies that were present during training.For the quadruped, we define solving the task as monotonically moving away from its initial positionat least 100 units of length. Out of the five evolutionary runs, both the Hebbian network and the static-network found solutions for the seen morphologies in all runs. On the other hand, the static-weightsnetwork was incapable of finding a single solution that would solve the unseen damaged morphologywhile the Hebbian network did manage to find solutions for the damaged unseen morphology.However, the performances of the Hebbian networks evaluated on the unseen morphology have ahigh variance. Understanding why some Hebbian solution generalise and other do not paves the wayfor further research; we hypothesize that in order to obtain a solution capable of generalizing robustlythe agent would need to be trained on a diverse set of morphologies with randomized damages. Totest how well the evolved solutions generalize, we compare the distance walked averaged over 100rollouts for the Hebbian and the static-weights networks. We report the highest-performing solutionson each of the morphologies from a single evolutionary run (Table 1).Since the static-weights network can not adapt to the environment, it solves efficiently the morpholo-gies that has seen during training but fails at the unseen one. On the other hand, the Hebbian networkis capable of adapting to the new morphologies leading to an efficient self-organization of network’ssynaptic weights (Fig. 1). Furthermore, we found that the initial random weights of the networkcan even be sampled from other distributions than the one used during the discovery of the Hebbiancoefficients, such as N (0 , . , and the agent still reaches a comparable performance.Interestingly, even without the presence of any reward feedback during its lifetime, the Hebbian-basednetwork is able to find well-performing weights for each of the three morphologies. The incomingactivation patterns alone are enough for the network to adapt without explicitly knowing which is themorphology currently being simulated. However, for the morphologies that the static-weight networkdid solve, it reached a higher reward than the Hebbian-based approach. Several reasons may explainthis, including the need of extra time to learn or the lager size of the parameters space, which couldrequire longer training times to find even more efficient plasticity rules.In order to determine the minimum number of timesteps the weights need to converge from randomto optimal during an agent’s lifetime, we investigated freezing the Hebbian update mechanism ofthe weights after a different number of timesteps and examining the resulting episode’s cumulativereward. We observe that the weights only need between 30 and 80 timesteps (i.e. Hebbian updates), toconverge to a set of optimal values (Fig. 3, left). Furthermore, we tested the resilience of the networkto external perturbations by saturating all its outputs to 1.0 for 100 timesteps, effectively freezing theagent in place. Fig. 3, right shows that the evolved Hebbian rules allow the network to recover tooptimal weights within a few timesteps. Furthermore, the Hebbian network is able to recover froma partial loss of its connections, which we simulate by zeroing out a subset of the synaptic weightsduring one timestep (Fig. 4, left). We observe a brief disruption in the behavior of the agent, however,the network is able to reconverge towards an optimal solution in a few timesteps (Fig. 4, upper-right).In order to get a better insight into the effect of the discovered plasticity rules and the development ofthe weight patterns during the Hebbian learning, we performed a dimensionality reduction throughprincipal component analysis (PCA) which projects the high-dimensional space where the networkweights live to a 3-dimensional representation at each timestep such that most of the variance is bestexplained by this lower dimensional representation (Fig. 5). For the car environment we observe7
200 400 600 800 1000
Freezing timestep E p i s o d e R e w a r d Episode timestep T i m e s t e p R e w a r d Figure 3:
Learning efficiency and robustness to actuator perturbations.
Left : The cumulative rewardfor the quadruped whose weights are frozen at different timesteps. The Hebbian network only needsin the order of 30–80 timesteps to converge to high-performing weights.
Right : The performance ofa quadruped whose actuators are frozen during 100 timesteps (from t=300 to t=400). The robot isable to quickly recover from this perturbation in around 50 timesteps.
10 timesteps later
A B AB Figure 4:
Resilience to weights perturbations. A : Visualisation of the network’s weights at thetimestep when a third of its weights are zeroed out, shown as a black band. B : Visualisation ofthe network’s weights 10 timesteps after the zeroing; the network’s weights recovered from theperturbation. Right : Performance of the quadruped when we zero out a subset of the synaptic weightsquickly recovers after an initial drop. The purple line indicates the timestep of the weight zeroing.the presence of a U -shaped 2-dimensional manifold where most of the weights live, this contrastswith the dynamics of a network in which we set the Hebbian coefficient (Eq.1) to random values;here the weight trajectory lacks any structure and oscillates around zero. In the case of the threequadruped morphologies, the trajectories of the Hebbian network follow a 3-dimensional curve, withan oscillatory signature; with random Hebbian coefficients the network does not give rise to anyapparent structure in its weights trajectory. In this work we introduced a novel approach that allows agents with random weights to adapt quicklyto a task. It is interesting to note that lifetime adaptation happens without any explicitly providedreward signal, and is only based on the evolved Hebbian local learning rules. In contrast to typicalstatic network approaches, in which the weights of the network do not change during the lifetime ofthe agent, the weights in the Hebbian-based networks self-organize and converge to an attractor inweight space during their lifetime.The ability to adapt weights quickly is shown to be important for tasks such as adapting to damagedrobot morphologies, which could be useful for tasks such as continual learning [60]. The ability toconverge to high-performing weights from initially random weights is surprisingly robust and thebest networks manage to do this for each of the 100 rollouts in the CarRacing domain. That theHebbian networks are more general but performance for a particular task/robot morphology can beless is maybe not surprising: learning generally takes time but can result in greater generalisation [61].8
C 1 2000200400600 PC 2 2001000100200 PC 315010050050100150200
Car: Hebbian coefficientsCar: Random coefficients
PC 1 20015010050050100150200PC 2402002040PC 3 151050510152025
Random coefficientsStandard morphologyDamaged morphology v1Damaged morphology v2
Figure 5:
Discovered Weight Attractors . Low dimensional representations of the weights dynamics(each dot represents a timestep, first timestep indicated with a star marker). The plotted trajectoryrepresents the evolution of the first 3 principal components (PCA) of the synaptic weights controlledby the Hebbian plasticity mechanism with the evolved coefficients over 1,000 timesteps. Left: Pixel-based CarRacing-v0 agent. Right: The three quadruped agent morphologies: Bullet’s AntBulletEnv-v0, the two damaged morphologies [2].Interestingly, randomly initialised networks have recently shown particularly interesting properties indifferent domains [16–18]. We add to this recent trend by demonstrating that random weights are allyou need to adapt quickly to some complex RL domains, given that they are paired with expressiveneural plasticity mechanisms.An interesting future work direction is to extend the approach with neuromodulated plasticity, whichhas shown to improve the performance of evolving plastic neural networks [62] and plastic networktrained through backpropagation [63]. Among other properties, neuromodulation allows certainneurons to modulate the level of plasticity of the connections in the neural network. Additionally, acomplex system of neuromodulation seems critical in animal brains for more elaborated forms oflearning [64]. Such an ability could be particularly important when giving the network an additionalreward signal as input for goal-based adaptation. The approach presented here opens up otherinteresting research areas such as also evolving the agents neural architecture [65] or encoding thelearning rules through a more indirect genotype-to-phenotype mapping [66, 38].In the neuroscience community, the question of which parts of animal behaviors are already innateand which parts are acquired through learning is hotly debated [38]. Interestingly, randomness inthe connectivity of these biological networks potentially plays a more important part than previouslyrecognized. For example, random feedback connections could allow biological brains to performa type of backpropagation [67], and there is recent evidence suggesting that the prefrontal cortexmight in effect employ a combination of random connectivity and Hebbian learning [19]. To the bestof our knowledge, this is the first time the combination of random networks and Hebbian learninghas been applied to a complex reinforcement learning problem, which we hope could inspire furthercross-pollination of ideas between neuroscience and machine learning in the future [20].In contrast to current reinforcement learning algorithms that try to be as general as possible, evolutionbiased animal nervous system to be able to quickly learn by restricting their learning to what isimportant for their survival [38]. The results presented in this paper, in which the innate agent’sknowledge is the evolved learning rules, take a step in this direction. The presented approach opensup interesting future research direction that suggest to demphasize the role played by the network’sweights, and focus more on the learning rules themselves. The results on two complex and differentreinforcement learning tasks suggest that such an approach is worth exploring further.
Acknowledgements
This work was supported by a DFF-Danish ERC-programme grant and an Amazon Research Award.9 roader Impact
The ethical and future societal consequences of this work are hard to predict but likely similar to otherwork dealing with more adaptive agents and robots. In particular, by giving robots the ability to stillfunction when injured could make it easier for them being deployed in areas that have both a positiveand negative impact on society. In the very long term, robots that can adapt could help in industrialautomation or help to care for the elderly. On the other hand, more adaptive robots could also bemore easily used for military applications. The approach presented in this paper is far from beingdeployed in these areas, but it its important to discuss its potential long-term consequences early on.
References [1] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmasterlevel in StarCraft II using multi-agent reinforcement learning.
Nature , 575(7782):350–354,2019.[2] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, ChristyDennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with largescale deep reinforcement learning. arXiv preprint arXiv:1912.06680 , 2019.[3] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A briefsurvey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 , 2017.[4] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deepreinforcement learning. arXiv preprint arXiv:1804.06893 , 2018.[5] Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, andSebastian Risi. Illuminating generalization in deep reinforcement learning through procedurallevel generation.
NeurIPS 2018 Workshop on Deep Reinforcement Learning , 2018.[6] C. Lloyd Morgan. Animal Intelligence 1.
Nature , 26(674):523–524, Sep 1882. ISSN 1476-4687.doi: 10.1038/026523b0.[7] Euan M Macphail.
Brain and intelligence in vertebrates . Oxford University Press, USA, 1982.[8] Xu Liu, Steve Ramirez, Petti T Pang, Corey B Puryear, Arvind Govindarajan, Karl Deisseroth,and Susumu Tonegawa. Optogenetic stimulation of a hippocampal engram activates fearmemory recall.
Nature , 484(7394):381–385, 2012.[9] Stephen J Martin, Paul D Grimwood, and Richard GM Morris. Synaptic plasticity and memory:an evaluation of the hypothesis.
Annual review of neuroscience , 23(1):649–711, 2000.[10] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic corticalmicrocircuits approximate the backpropagation algorithm.
ArXiv e-prints , Oct 2018. URL https://arxiv.org/abs/1810.11393 .[11] D. O. Hebb.
The organization of behavior; a neuropsychological theory.
Wiley, 1949.[12] Eseoghene Ben-Iwhiwhu, Pawel Ladosz, Jeffery Dick, Wen-Hua Chen, Praveen Pilly, andAndrea Soltoggio. Evolving inborn knowledge for fast adaptation in dynamic pomdp problems. arXiv preprint arXiv:2004.12846 , 2020.[13] Andrea Soltoggio, Kenneth O Stanley, and Sebastian Risi. Born to learn: the inspiration,progress, and future of evolved plastic artificial neural networks.
Neural Networks , 108:48–67,2018.[14] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule.
NeuralNetworks , 1991.[15] Thomas Miconi, Jeff Clune, and Kenneth O. Stanley. Differentiable plasticity: training plasticneural networks with backpropagation.
ArXiv e-prints , Apr 2018. URL https://arxiv.org/abs/1804.02464 . 1016] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems andsaving energy in wireless communication. science , 304(5667):78–80, 2004.[17] Jürgen Schmidhuber, Daan Wierstra, Matteo Gagliolo, and Faustino Gomez. Training recurrentnetworks by evolino.
Neural computation , 19(3):757–779, 2007.[18] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 9446–9454, 2018.[19] Grace W Lindsay, Mattia Rigotti, Melissa R Warden, Earl K Miller, and Stefano Fusi. Hebbianlearning in a random network captures selectivity properties of the prefrontal cortex.
Journal ofNeuroscience , 37(45):11021–11036, 2017.[20] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz,Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, et al.A deep learning framework for neuroscience.
Nature neuroscience , 22(11):1761–1770, 2019.[21] Isabella Pozzi, Sander Bohté, and Pieter Roelfsema. A biologically plausible learning rule fordeep learning in the brain. 2018.[22] Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In
Learning tolearn , pages 3–17. Springer, 1998.[23] Jürgen Schmidhuber. Learning to control fast-weight memories: an alternative to dynamicrecurrent networks.
Neural Computation , 4(1):131–139, 1992.[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In
Interna-tional Conference on Learning Representations (ICLR 2018) , 2017.[25] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXivpreprint arXiv:1611.01578 , 2016.[26] Jürgen Schmidhuber. A ‘self-referential’weight matrix. In
International Conference on ArtificialNeural Networks , pages 446–450. Springer, 1993.[27] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763 , 2016.[28] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1126–1135. JMLR. org, 2017.[29] Chrisantha Thomas Fernando, Jakub Sygnowski, Simon Osindero, Jane Wang, Tom Schaul,Denis Teplyashin, Pablo Sprechmann, Alexander Pritzel, and Andrei A Rusu. Meta learning bythe baldwin effect. arXiv preprint arXiv:1806.07917 , 2018.[30] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.[31] Jeffrey Cheng, Ari Benjamin, Benjamin Lansdell, and Konrad Paul Kording. AugmentingSupervised Learning by Meta-learning Unsupervised Local Rules.
NeurIPS 2019 WorkshopNeuro AI , Sep 2019. URL https://openreview.net/pdf?id=HJlKNmFIUB .[32] Randall D Beer and John C Gallagher. Evolving dynamical neural networks for adaptivebehavior.
Adaptive behavior , 1(1):91–122, 1992.[33] Dario Floreano and Joseba Urzelai. Evolutionary robots with on-line self-organization andbehavioral fitness.
Neural Networks , 13(4-5):431–443, 2000.[34] Scott Camazine, Jean-Louis Deneubourg, Nigel R Franks, James Sneyd, Eric Bonabeau, andGuy Theraula.
Self-organization in biological systems , volume 7. Princeton university press,2003. 1135] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. Acomprehensive survey on graph neural networks.
IEEE Transactions on Neural Networks andLearning Systems , 2020.[36] Alexander Mordvintsev, Ettore Randazzo, Eyvind Niklasson, and Michael Levin. Growing neu-ral cellular automata.
Distill , 2020. doi: 10.23915/distill.00023. https://distill.pub/2020/growing-ca.[37] Stephen T Grossberg.
Studies of mind and brain: Neural principles of learning, perception,development, cognition, and motor control , volume 70. Springer Science & Business Media,2012.[38] Anthony M Zador. A critique of pure learning and what artificial neural networks can learnfrom animal brains.
Nature communications , 10(1):1–7, 2019.[39] Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing withoutstable states: A new framework for neural computation based on perturbations.
Neural compu-tation , 14(11):2531–2560, 2002.[40] Steven A. Prescott and Terrence J. Sejnowski. Spike-Rate Coding and Spike-Time Coding AreAffected Oppositely by Different Adaptation Mechanisms.
Journal of Neuroscience , 28(50):13649, Dec 2008. doi: 10.1523/JNEUROSCI.1792-08.2008.[41] Romain Brette. Philosophy of the Spike: Rate-Based vs. Spike-Based Theories of the Brain.
Frontiers in Systems Neuroscience , 9, Nov 2015. ISSN 1662-5137. doi: 10.3389/fnsys.2015.00151.[42] P. J. Sjöström, G. G. Turrigiano, and S. B. Nelson. Rate, timing, and cooperativity jointlydetermine cortical synaptic plasticity.
Neuron , 32(6):1149–1164, Dec 2001. ISSN 0896-6273.doi: 10.1016/s0896-6273(01)00542-6.[43] Luca A. Finelli, Seth Haney, Maxim Bazhenov, Mark Stopfer, and Terrence J. Sejnowski.Synaptic Learning Rules and Sparse Coding in a Model Sensory System.
PLOS ComputationalBiology , 4(4):e1000062, Apr 2008. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1000062.[44] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution Strategiesas a Scalable Alternative to Reinforcement Learning.
ArXiv e-prints , Mar 2017. URL https://arxiv.org/abs/1703.03864 .[45] Andrea Soltoggio, Peter Durr, Claudio Mattiussi, and Dario Floreano. Evolving neuromod-ulatory topologies for reinforcement learning-like problems. In , pages 2471–2478. IEEE, 2007.[46] Yael Niv, Daphna Joel, Isaac Meilijson, and Eytan Ruppin. Evolution of reinforcement learningin uncertain environments: Emergence of risk-aversion and matching. In
European Conferenceon Artificial Life , pages 252–261. Springer, 2001.[47] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using fastweights to attend to the recent past. In
Advances in Neural Information Processing Systems ,pages 4331–4339, 2016.[48] Jürgen Schmidhuber. Reducing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In
International Conference on Artificial NeuralNetworks , pages 460–463. Springer, 1993.[49] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies–a comprehensive introduction.
Natural computing , 1(1):3–52, 2002.[50] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies.In , pages 3381–3387. IEEE, 2008.[51] Oleg Klimov. Carracing-v0. 2016. URL https://gym.openai.com/envs/CarRacing-v0/ .1252] Sebastian Risi and Kenneth O Stanley. Deep neuroevolution of recurrent and discrete worldmodels. In
Proceedings of the Genetic and Evolutionary Computation Conference , pages456–462, 2019.[53] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In
Advances in Neural Information Processing Systems , pages 2450–2462, 2018.[54] Yujin Tang, Duong Nguyen, and David Ha. Neuroevolution of self-interpretable agents. arXivpreprint arXiv:2003.08165 , 2020.[55] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation.
ArXiv e-prints , Jun2015. URL https://arxiv.org/abs/1506.02438 .[56] Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to under-stand sensory cortex.
Nature Neuroscience , 19(3):356–365, Feb 2016. ISSN 1546-1726. doi:10.1038/nn.4244.[57] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games,robotics and machine learning. http://pybullet.org , 2016–2019.[58] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. BenchmarkingDeep Reinforcement Learning for Continuous Control.
ArXiv e-prints , Apr 2016. URL https://arxiv.org/abs/1604.06778v3 .[59] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-basedcontrol. In , pages5026–5033. IEEE, 2012.[60] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continuallifelong learning with neural networks: A review.
Neural Networks , 2019.[61] George Gaylord Simpson. The baldwin effect.
Evolution , 7(2):110–117, 1953.[62] Andrea Soltoggio, John A Bullinaria, Claudio Mattiussi, Peter Dürr, and Dario Floreano.Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In
Proceedings of the 11th international conference on artificial life (Alife XI) , number CONF,pages 569–576. MIT Press, 2008.[63] Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O Stanley. Backpropamine: trainingself-modifying neural networks with differentiable neuromodulated plasticity. arXiv preprintarXiv:2002.10585 , 2020.[64] Michael J Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick: cognitivereinforcement learning in parkinsonism.
Science , 306(5703):1940–1943, 2004.[65] Adam Gaier and David Ha. Weight Agnostic Neural Networks.
ArXiv e-prints , Jun 2019. URL https://arxiv.org/abs/1906.04358 .[66] Sebastian Risi and Kenneth O. Stanley. Indirectly Encoding Neural Plasticity as a Pattern ofLocal Rules.
SpringerLink , pages 533–543, Aug 2010. doi: 10.1007/978-3-642-15193-4_50.[67] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Randomfeedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247 ,2014.[68] Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Train-able Neural Networks.
ArXiv e-prints , Mar 2018. URL https://arxiv.org/abs/1803.03635v5 . 13
Appendix
Fig. 6 shows an example of how we visualize the weights of the network for a particular timestep.Each pixel represents the weight value w ij of each synaptic connection. We represent the weights ofeach of the three fully connected layers FC layer 1, FC layer 2, FC layer 3 separately: the quadruped’snetwork has an input space of dimension 28 and three fully connected layers with [128 , , neuronsrespectively, hence the rectangle above FC layer 1 has an horizontal dimension of 28 and a verticalone of 128, the 2nd layer
FC layer 2 has an horizontal dimension of 64 and a vertical one of 128while the last layer’s
FC layer 3 dimension is 64 vertical and 8 horizontally, which corresponds to thedimension of the action space. Darker pixels indicate negative values while white pixels are positivevalues. In the case of the CarRacing environment the weights are normalised to the interval [-1,+1],while the quadruped agents have unbounded weights.
FC layer 1 FC layer 2 FC layer 3
Lorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elitLorem ipsum dolor sit amet, consectetur adipiscing elit
Figure 6:
Network Weights Visualizations.
Visualisation of a random initial state of the network’sweights. Each column represents the weights of each of the three layers while each pixel representsthe value of a weight between two neurons.
We show the training over generations for both approaches and both domains in Fig. 7. Even thoughthe Hebbian method has to optimize a significant larger number of parameters, training performanceincreases similarly fast for both approaches.
Generation C u m u l a t i v e R e w a r d StaticHebbian (a) Training curves for the Car environment
Generation D i s t a n c e W a l k e d HebbianStatic (b) Training curves for the quadrupeds
Figure 7:
Left:
Training curve for the car environment.
Right:
Training curve of the quadrupeds forboth the static network and the Hebbian one. Curves are averaged over three evolutionary runs.14 .3 Hebbian rules
We analyze the different flavours of Hebbian rules derived from Eq. 2 in the car racing environment.For this experiment, we do not evolve the parameters of the convolutional layers and instead theyare randomly fixed at initialisation; we solely evolve the Hebbian coefficients controlling the feedforward layers. From the simplest one where all but the A coefficients are zero, to its most generalform where all the four A, B, C, D coefficient and the intra-life learning rate η are present (Fig. 8): ∆ w ij = η w · ( A w o i o j + B w o i + C w o j + D w ) , (2)The static, and all the generalised Hebbian models can solve pixel-based task, only the Hebbianversion with a single unique coefficient per synapse A is incapable of solving the task. The slowerconvergence of the Hebbian models with more coefficients can be explained by the fact that largerparameter spaces need more generations to be explored by the ES algorithm. Generation C u m u l a t i v e R e w a r d StaticAAD+lrABCABC+lrABCD+lr
Figure 8:
Hebbian rules ablations.
Training curves for the car racing agent with five different Hebbianrule variations. Curves are averaged over three evolutionary runs.We also show the distribution of coefficients of the most general ABCD+ η version (Fig. 10), whichshows a normal distribution. We hypothesise that this distribution is potentially necessary to allowthe self-organization of weights to not grow to extreme values. Analysing the resulting weightdistributions and evolved rules opens up many interesting future research directions. We experimented with evolving –alongside the Hebbian coefficients– the initial weights of thenetwork rather than randomly initializing them at each episode. We do this by sampling normalnoise twice (Algorithm 2, Step 5 from [44]) and computing the fitness of the resulting solution pairs(Hebbian coefficients, initial weights). Surprisingly, this does not increase the training efficiency ofthe agents (Fig. 9). Furthermore, we find that runs for the CarRacing environment where we co-evolvethe initial conditions are more likely to stall on local optima: 2 out of 3 runs found a network withgood performance (at least 800 reward), while the third run stalled on low performance (a reward ofless than 100). This finding may be explained by the extra difficulty that co-evolution introduces inthe ES algorithm as well as the extra lottery-ticket initialisation of both the initial weights and theHebbian coefficients [68]. However, other possible implementations of this system may yield betterresults and evolving both the connections’ Hebbian coefficients and learning rules has shown promisein smaller networks [45, 13, 66]. 15
100 200 300 400 500 600
Generation D i s t a n c e w a l k e d Random initial weightsCoevolved initial weights
Figure 9: Training curves of the Hebbian networks for the quadruped environments. We show thatinitializing the network with random weights at each episode and co-evolving the initial weights leadto similar results. Curves are averaged over three evolutionary runs.
Coefficient value C o un t w A w B w C w D w Coefficient value C o un t ww
Coefficient value C o un t w A w B w C w D w Coefficient value C o un t ww A ww
Coefficient value C o un t w A w B w C w D w Coefficient value C o un t ww A ww B ww
Coefficient value C o un t w A w B w C w D w Coefficient value C o un t ww A ww B ww C ww
Coefficient value C o un t w A w B w C w D w Coefficient value C o un t ww A ww B ww C ww D ww