[PDF] An approach to synaptic learning for autonomous motor control

Abstract

In the realm of motor control, artificial agents cannot match the performance of their biological counterparts. We thus explore a neural control architecture that is both biologically plausible, and capable of fully autonomous learning. The architecture consists of feedback controllers that learn to achieve a desired state by selecting the errors that should drive them. This selection happens through a family of differential Hebbian learning rules that, through interaction with the environment, can learn to control systems where the error responds monotonically to the control signal. We next show that in a more general case, neural reinforcement learning can be coupled with a feedback controller to reduce errors that arise non-monotonically from the control signal. The use of feedback control reduces the complexity of the reinforcement learning problem, because only a desired value must be learned, with the controller handling the details of how it is reached. This makes the function to be learned simpler, potentially allowing to learn more complex actions. We discuss how this approach could be extended to hierarchical architectures.

Full PDF

aa r X i v : . [ q - b i o . N C ] J un An approach to synaptic learning for autonomousmotor control

Sergio Verduzco-Flores ∗ Computational Neuroscience UnitOkinawa Institute of Science and TechnologyOkinawa, Japan [email protected]

William Dorrell

CNU, OIST [email protected]

Erik DeSchutter

CNU, OIST [email protected]

Abstract

In the realm of motor control, artiﬁcial agents cannot match the performance oftheir biological counterparts. We thus explore a neural control architecture thatis both biologically plausible, and capable of fully autonomous learning. Thearchitecture consists of feedback controllers that learn to achieve a desired stateby selecting the errors that should drive them. This selection happens through afamily of differential Hebbian learning rules that, through interaction with the en-vironment, can learn to control systems where the error responds monotonically tothe control signal. We next show that in a more general case, neural reinforcementlearning can be coupled with a feedback controller to reduce errors that arise non-monotonically from the control signal. The use of feedback control reduces thecomplexity of the reinforcement learning problem, because only a desired valuemust be learned, with the controller handling the details of how it is reached. Thismakes the function to be learned simpler, potentially allowing to learn more com-plex actions. We discuss how this approach could be extended to hierarchicalarchitectures.

Animals are masters at motor control, but it remains unclear how they achieve this. To advance thestudy of animal motor control the best strategy may be to mimic the way they learn. In particular: • The agent learns as its body interacts in real time with the environment. Preferably aphysical body, transmission delays, and response latencies should be considered. • The controller uses nothing more than neurons. • Learning rules use only information locally available at the postsynaptic neuron. Ratherthan relying on labelled data, learning takes advantage of correlation between signals, andreinforcement learning mechanisms. • No element of the model goes against current consensus in neuroscience.Models with these characteristics are rare. One reason is the signiﬁcant challenge of performingmotor control exclusively using neural elements. More than 70 years ago Cybernetics recognized ∗ Corresponding Author.Preprint. Under review. he challenge of motor control in biological organisms [20], and emphasized the role of feedbackcontrol in an abstract way, without addressing the problem of neural implementations.A more practical viewpoint was put forward by Perceptual Control Theory (PCT) [23]. The tenetsof PCT can be stated in these (oversimpliﬁed) terms: through evolution the organism has an in-nate knowledge of the the perceptions which are conducive to homeostatic states. The organism isconstantly attempting to to create those perceptions, and this is what it learns through interactionwith the environment. Matching desired and actual perceptions is a control problem, which PCTaddresses using a hierarchy of feedback control systems. Although provocative, PCT research is yetto produce autonomous agents approaching state of the art. Despite steady progress in control theoryfor more than one century, it is still not feasible to create an autonomous agent by using feedbackcontrol to create desired values of homeostatic variables. We will outline why this is the case.A feedback control system receives a vector of errors and outputs a vector of control signals. The control law maps errors to control signals so as to reduce the errors. This map is conventionallyobtained by a designer using a model of the physical system to be controlled, called the plant . Forexample, the linear-quadratic regulator (LQR) is a feedback controller whose control law can befound by minimizing a quadratic cost function using either the Hamilton-Jacobi-Bellman equation,or Pontryagin’s maximum principle [14].Solutions such as the LQR are highly effective in particular situations, but they seem unsuited forspecifying more general behaviours. Consider the case of an agent that must maintain regular levelsof certain homeostatic variables, such as blood sugar, or nociceptors’ activity. These variables arenot directly controllable, as would be the case of limb position. Instead, the variables to be controlledmust be set through interaction with the environment. Designing a controller for this case requires amodel, not only of the agents body, but also of the environment, and how they interact. This quicklybecomes untractable for conventional control theory approaches.Using a control law that is not designed a priori , but instead is learned through interaction withthe environment is the main concern of Reinforcement Learning (RL) [31]. This discipline has hadnotable success in controlling agents in limited environments, but it can quickly run into the curseof dimensionality as the space of possible policies grows. The use of deep networks in order tolearn value functions and policies (called deep RL) can improve this, leading to human or beyondhuman performance in some game environments [18, 29, e.g.] Using more realistic bodies andenvironments remains a challenge.In part inspired by the anatomical structure of animal brains, hierarchical architectures [16] hold thepromise of taming the curse of dimensionality. Both hierarchical RL [2] and hierarchical deep RL[11] are areas of ongoing research, but it remains unclear how these approaches can best harnessthe learning and representational power of neural networks. Moreover, it is not known whethertheir computations parallel those in animal brains. This motivates the development of biologically-plausible models of hierarichal RL [24, 6, 19, 7].We present a family of synaptic learning mechanisms allowing a feedback control system to adjustso as to reduce an arbitrary error, as long as the error and the motor commands have a monotonicrelation. In other words, the motor command should not cause the error to increase in one context,and to decrease in a different one. We consider that for low levels of a control hierarchy, where theresponse properties of actuators must be learned and continuously adjusted, learning using correla-tions between signals is a better approach than using global rewards. Our learning mechanisms offera novel approach to do this.We illustrate how our learning rules can be used to control two simple systems. Next we showhow the restriction of a monotonic relation between errors and motor commands in a feedbackcontroller can be overcome using reinforcement learning to adjust the controller according to thecurrent context.In the Discussion we describe new results and future directions, including realistic simulations of2D arm reaching, and a hierarchical version of an architecture presented in this paper.2

Preliminaries

The feedback control problem we consider is depicted in ﬁgure 1A. The P block represents a plantthat encapsulates both the agent’s body and the environment, which send afferent signals to a sensorypopulation S P , producing the perceived value of the state. A separate population S D represents adesired value for the perceived state S P . A controller C receives the activity of both S D and S P ,mapping them into a motor command to the agent’s actuators.A common conﬁguration for feedback control is shown in panel B of ﬁgure 1. In here the input tothe controller is an error consisting of the difference between the desired and the actual observation.In the case when S D and S P are scalar values a simple strategy such as Proportional-Derivative (PD)control [30] can be autonomously conﬁgured, and achieve acceptable performance. When S D , S P ,and the output of the controller are high-dimensional we say the system is Multi-Input Multi-Output(MIMO).MIMO systems present a particular set of challenges, and control theory has a well-developed arrayof techniques to overcome them (e.g. [13]). Most of those techinques assume a mathematical modelof the plant, which, as we have discussed, is not feasible in our case. Among those techinques thatdo not assume a model of the plant (e.g. data-driven control systems), very few use biologicallyrealistic neural networks in their solution.Four control architectures using biologically-plausible neural networks have been identiﬁed [26].From these, direct inverse learning [12] is an ofﬂine method, whereas distal supervised learning [9],and feedback error learning [17] rely on backpropagation in order to train forward and/or inversemodels of the plant.The fourth architecture is Reinforcement Learning, which avoids the limitations of the other archi-tectures, but is generally slower to ﬁnd a solution. Given the close ties between RL and differentialHebbian learning [10], it is interesting to ask whether the correlations between inputs and outputs tothe controller can be used to obtain a control law that is adaptive and biologically plausible as RL,but can achieve faster learning. Looking at the population C in panel B of ﬁgure 1, we can see that if we had a module that cancelsits own inputs this could provide a solution to the neural control problem. The actions of thismodule would also promote stability of this recurrent network. To create such a device we followthis intuition: if one of C ’s outputs controls one of its inputs, then that input will resemble a slightlydelayed version of that output (with a possible sign reversal, as the action can increase or decreasethe error). If we can measure this resemblance in a way that is somewhat invariant to scaling, meanvalues, and small difference in the phases, then we can conﬁgure the weights of a negative feedbackcontroller capable of reducing that input.Assume that the error is an M -dimensional vector e = [ e , . . . , e M ] , and that C contains N units,whose activity is in the vector c = [ c , . . . , c N ] . If there are lateral connections among the C units,then they can have the c k values locally available. The synaptic weight ω ij for the connection from e j to c i will then have a time derivative: ˙ ω ij ( t ) = Γ ij ( c ( t − ∆ t ) , e ( t )) , where Γ ij is a differential operator. We consider a family of Hebbian-like learning rules where: Γ ij ( t ) = − αG j ( e ( t )) H i ( c ( t − ∆ t )) (1)where α is a learning rate, and G j , H i are differential operators. In the rest of this subsection wewill derive three equations of this type, whose performance will be tested later in the paper.In the simplest interpretation G j measures the activity of e j , whereas H i measures the activity of c i ( t − ∆ t ) . G j = e j ( t ) , H i = c i ( t − ∆ t ) produces Hebbian learning with a time delay. This rule,however, is not invariant to scaling or mean values, and the weights will tend to saturate. A moreattractive choice is to use the correlation of the ﬁrst derivatives. This provides a measure of whether3igure 1: Basic feedback control architectures. All the modules depicted by a circle represent pop-ulations of neural units whose output is a scalar value between 0 and 1 (e.g. ﬁring rate neurons). A)A general setup; the goal is to produce the same value in S P and S D . B) Negative feedback control.C) Negative feedback control with dual populations, as described in section 3.2. Connections insidethe gray dashed oval are adjusted using the rules of section 3.1. c i and e j change together, in a way that is invariant to their mean values. The resulting learning ruleis: ˙ ω ij ( t ) = − α ˙ e j ( t ) ˙ c i ( t − ∆ t ) .The differential rule above can be improved by noticing that changes in e j ( t ) could coincide tempo-rally with those of c i ( t − ∆ t ) , but the main cause of those changes may in fact be a different output c k . It may thus be beneﬁcial to introduce some competition among the synapses, as in: ˙ ω ij ( t ) = − α (cid:16) ˙ e j ( t ) − h ˙ e ( t ) i (cid:17)(cid:16) ˙ c i ( t − ∆ t ) − h ˙ c ( t − ∆ t ) i (cid:17) (2)where h ˙ e i ≡ M P k ˙ e k , and h ˙ c i ≡ N P k ˙ c k . This rule bears some resemblance to the relative gainarray criterion [3], which measures interactions in decentralized control systems. This is explainedin the Appendix.The dependence of e on c may take the form D e = F ( c ) , where D is a differential operator, and F a vector function. This implies that the correlations may appear in mixed order derivatives, as wouldbe the case between force and displacement in a system following Newton’s laws. We illustrate thiswith: ˙ ω ij ( t ) = − α (cid:16) ¨ e j ( t ) − h ¨ e ( t ) i (cid:17)(cid:16) ˙ c i ( t − ∆ t ) − h ˙ c ( t − ∆ t ) i (cid:17) (3)We may also introduce dependency on the errors in the H i term of equation 1. One case comes fromintroducing a “global” error || e ( t ) || = P k e k ( t ) . In the architecture of section 3.2 (Fig.1C), the e k values are non-negative, so the sum of elements is the L norm. We now interpret G j as a measureof how active e j ( t ) was compared with all the other e k values, whereas H i measures how much c i ( t − ∆ t ) contributes to || e || using correlations among derivatives. An example of such a rule is: ˙ ω ij ( t ) = − α (cid:16)(cid:2) ˙ e j ( t ) − h ˙ e ( t ) i (cid:3)(cid:2) || ˙ e ( t ) || ˙ c i ( t − ∆ t ) + || ¨ e ( t ) || ¨ c i ( t − ∆ t ) (cid:3) − ˙Λ( t ) ˙ c i ( t ) (cid:17) (4)where Λ is the scaled sum of inputs from C , namely Λ = P k w ik c k . The ˙Λ( t ) ˙ c i ( t ) term is useful todesynchronize C units that inhibit each other, so they don’t potentiate the same inputs. A motivationfor using higher order derivatives in this rule is that similarity between e ( t ) and c i ( t − ∆ t ) willreﬂect in similar Taylor expansions.A different approach is to let H j be an integral operator that measures the similarity between thepower spectra of the error and command signals, which is invariant to phase differences. Due to itsnon trivial neural implementation this is not pursued in this paper.As can be surmised, other rules can be devised. We limit ourselves to exploring equations 2, 3, and4. In order to test learning rules like equations 2, 3, and 4 we created a more biological version of thearchitecture in panel B of ﬁgure 1. This is depicted in panel C.4ince we interpret our units as ﬁring rate neurons, we don’t consider any negative activities otherthan possibly the plant. To handle positive and negative errors we use two populations, S DP , and S P D . S DP receives excitation from S D , and inhibition from S P (which would come from feed-forward inhibition, not explicitly modelled). Inhibition and excitation are swapped in S P D , so thattogether with S DP these populations create a dual representation where the magnitude of the errorincreases the activity, regardless of its sign.The C population maintains this duality, as it is divided into CE and CI components. Each unit in CE has a counterpart in CI , and they mutually inhibit one another, reﬂecting the organization ofmotor units in agonist-antagonist pairs, as commonly found in vertebrates.It can be shown that using linear units and a learning rule as in equation 2 in the Fig.1B networkallows convergence to ﬁxed points with non-zero error (see Appendix). To avoid this we use aspecialized type of unit inspired by intrinsic oscillators in the spinal cord [15]. The response ofeach unit in CE and CI consists of two parts. One is the integral of its inputs, and the other isa sinusoidal (or a rectiﬁed sinusoidal) whose amplitude is modulated by the input. The speciﬁcequations can be found in the appendix. The rest of the units are sigmoidals, except for those in S D , whose activity is a ﬁxed function of time. Projections from P to S P were one-to-one with unitweights. The weights of connections from S P D /S DP undergo divisive normalization [4] so thattheir sum remains constant. We simulated each of the learning rules in equations 2, 3, and 4 for two simple models of the plant P .The learning rules depend on the relative timing of the error and command signals. To properly testthem we need to consider transmission delays and latencies in the response of the units, which maybe signiﬁcant in biological organisms. For this purpose we use the Draculab simulation software[32]. Each unit and synapse is modeled with an ordinary differential equation, and transmissiondelays are considered.In the ﬁrst model each unit c ej in CE was associated with a vector v j , whereas the correspondingunit c ij in CI was associated with − v j . The output of the plant was a vector p deﬁned as p = P j ( c ej − c ij ) v j , where c ej , c ij are also used to denote the activity of those units. The task of thenetwork in this case is akin to solving a linear system. For a given desired vector s D in S D , there isa vector p D that makes the activity in S P match s D . The network must ﬁnd a weight matrix W CP from C to P , and a vector c of activities in C such that p D = W CP c .This problem can only be solved when W CP has full rank. Moreover, as would be expected froma decentralized control system, as the interaction between the controlled variables increases theproblem becomes more difﬁcult, and performance begins to decrease [13, Ch.74]. This means thatincreasing the number of columns with similar non-zero elements will cause interactions among thecontrollers, which may be reﬂected as a larger error.To explore this we selected four types of v j vectors. For the ﬁrst type, the matrix with the v j vectorsfor columns was the identity matrix. In the second type the v j vectors constituted a Haar basis,which is an orthogonal basis with and − entries; in this case the activity of any C unit will affectall errors. In the third type the set of v j vectors was the union of the sets from the two previouscases, creating an overcomplete basis in the W CP matrix. For the ﬁnal type all v j vectors wererandom, and there were 3 for each unit in S P . Simulations were run for 1, 2, 4, and 8 units in S P . Results are summarized in ﬁgure 2. The third and fourth types of connectivity are respectivelylabeled overcomplete , and overcomplete2 in this ﬁgure.It can be seen that tracking one or a few values is easy for any of the rules, but higher dimensionalityof S P , and a high degree of interaction between its components makes the task more difﬁcult. Theeffect of redundancy in the actuators (overcomplete W CP ) is less marked.The second plant model is a pendulum that stops when trying to rotate across a certain angle, somonotonic control is maintained. C uses two units, one providing clockwise, and another counter-clockwise torque. The value in S D represents a given angle, and the task is to move the pendulumto that angle so activity in S P and S D can be equal.5igure 2: Simulation results for 4 types of connectivity matrices in a linear plant model. The numberof values in S P is labelled N in the x-axis. The y-axis indicates the time average of the norm || s P − s D || for the second half of the simulation, where s P is the vector of activities in S P , normalized soit has a unit norm for N > , and likewise for s D . Gray markers indicate the same mean error whena simulation with the same charactersitics was run with static synapses. In the case N = 1 only theidentity matrix is tested.Figure 3: Pendulum tracking a desired angle. Left: Controller architecture. Each circle represents asingle neuron, whereas the square represents the plant P . Blue connections are excitatory, red onesare inhibitory. θ represents the current angle in radians, whereas ˙ θ is the angular velocity. These statevariables are transformed into activation in the (0 , range by sigmodial units in the A population.Both units in the M population receiva all A signals. The connections from A to M (green dottedovals) evolve following the input correlation rule, and the connections from M to C units (graydotted ovals) evolve using the rule from equation 2. The output of the C units is mapped into eithera positive or a negative torque ( τ ). Right: Activity of S D (yellow), and S P throughout a simulation.The effect of intrinsic 4 Hz oscillations in the C units can be observed in the S P activity.For this particular task the architecture of Fig.1C was extended with a population M receiving the afferent activity A , which consisted of the two pendulum state variables (the angle θ , the angularvelocity ˙ θ ) in their non-negative (dual) representation, resulting in four inputs (ﬁgure 3). The gainof each M unit was modulated by one of the two errors, either s DP , or s P D . The M units used the input correlation rule [22] to potentiate afferent inputs that correlate with the error that modulatesthem. This allows M to send C an error composed of afferent signals, resulting in a self-conﬁguringproportional-derivative controller. This scheme is inspired by the long reﬂex-loop, consisting sig-nals that go from afferents to motor cortex, and from motor cortex to motoneurons.Figure 3 shows a representative result, where the system learns to perceive a desired S D angle in S P . Full equations, and simulation details can be found in the Appendix. The C units in this version of the model output the integral of their input, so this is in fact a rather non-standard PI controller. For the model mentioned in section 4.1 these integrators are replaced by a pair ofsigmoidal units, similar to the classic Wilson-Cowan model [5]. Because the G j , H i terms in the learning rules of equations 2, 3, 4 are monotonic functions, c i causing either positive positive or negative e j errors depending on the context will create inconsistentcorrelations, making the approach unlikely to succeed.A further complication is that the representation of sensory signals may not always be germane fornegative feedback control. Muscle afferents use a ﬁring rate code that provides information aboutthe muscle’s length, speed, and tension, but other afferents may provide a distributed representation,using a population of neurons where each one is tuned to a particular range of values (e.g. directiontuning in somatosensory cortex [21], or retinotopic location tuning in posterior parietal cortex [1].If a system such as ours is to be found at the bottom of a hierarchical system to control homeostaticvariables, it should deal with the issues above. We have identiﬁed an actor-critic architecture thatoffers a plausible solution (ﬁgure 4, right). The actor C consists of one or more feedback controllers,each of which learns a to associate each state with a target value, and a conﬁguration. The state network S does an expansive recoding of the activities in the S D , and S P populations (as in [8])that permits the V and C networks to learn functions of the state using a single layer. V consistsof a single unit that learns a value associated with each activity vector in S , using a neural versionof the temporal differences algorithm. The value produced by V is in turn used so the actor C canlearn a good controller conﬁguration associated with each state. Notice that inclusion of S D in thestate representation allows for a natural way to produce a value function that is also dependent onthe goal [27].To make this concrete we implemented this architecture to solve the pendulum control problem as inthe previous section, but in this case the pendulum is allowed to rotate freely. Given a desired angle θ D , and a current angle θ , this problem cannot be solved efﬁciently by a controller that respondsproportionally to θ D − θ , because θ is periodic (359 degrees and 1 degree are very close). Theproportional controller does not cross the angle discontinuity, even if this is the shortest path toreach the desired angle.In order to solve this problem we can have two angle representations for the controller, each possess-ing a discontinuity in a different region of the input space, in this case 0 and 180 degrees. For eachstate the network learns which of these afferent inputs to use, and also what θ D angle it should aimtowards. The θ D angle is learned (e.g. it is not directly set from S D into C ) so distributed afferentrepresentations can be handled; this amounts to performing a coordinate transformation. This willbecome signiﬁcant in section 4.2.Learning the angle representation and the desired angle greatly beneﬁts from having a value asso-ciated with each state, providing a measure of distance between S D and S P . This can be obtainedusing standard RL methods, although modiﬁed to work with this neural network in real time. Let θ ∗ D be the angle where s P = s D , whereas θ D denotes the target value of the controller. When θ ≈ θ ∗ D the system will experience a reward, used in a modiﬁed version of the neural TD learning rule [28]for the connections from S to V . The value that V outputs approximate conﬂates two factors: thetime that the controller requires to bring θ into θ D , and the time it takes to go from θ D into θ ∗ D .The value from V can be used to train the connections from S to C so they choose the appropriatetarget value θ D given the current context. To this end we use a form of reward-modulated Hebbianlearning, where the value from V is used for modulation. Full equations are in the Appendix.7igure 4 shows representative results from a simulation of our actor-critic architecture. The systemeffectively learns to track a desired angle that changes through time in a noisy fashion. It should beobserved that: 1) the ﬁnal value function resembles the identity matrix, reﬂecting the fact that rewardhappens when s P = s D ; 2) the system learns to choose the afferent representation (“ControllerChoice” in the ﬁgure) whose discontinuity is not in the shortest path between θ and θ D ; 3) thesystem learns to set θ D ≈ θ ∗ D , regardless of θ , which is optimal in the case of our value function.Removing dependency on θ in the policy is akin to reducing the dimensionality of the problem,which is possible thanks to the feedback controller in C .An important challenge in the implementation of our architecture is that the three types of learn-ing (learning the value, desired angle, and angle representation) must be decoupled. Our informalobservations are that concurrent learning of angle representation and desired angle may interferewith each other. In order to decouple learning we thus had a training period where random targetangles were given to the controller, allowing the system to learn both the value function and the bestangle representation for each state. After this the policy (e.g. target angle for each state) could beeasily learned. Although it required two phases, the difﬁculty of training and ﬁnal performance ofthis model was similar to training an actor-critic architecture where the policy simply consisted ofchoosing negative or positive torques (see Appendix). Using the ideas presented here we extended the architecture of ﬁgure 3 for the task of planar armreaching. The plant includes full double pendulum dynamics, with six Hill-type muscles providingredundant actuation. Each muscle provides 3 signals arising from models of the Ia,Ib, and II muscleafferents. These respond non-linearly, and in the case of the Ia afferent the response can brieﬂy benonmonotonic. All connections include transmission delays; unit and afferent activities come fromdifferential equations presenting response latencies. The model receives desired values for the Iaand II afferent signals, and can autonomously learn how to produce them. To our knowledge nobiologically-plausible model has managed to do this before, using only neurons.Due to the model’s complexity, and to its potential biological insights, the full description of thiswork is presented in a separate paper (in preparation).

The architecture illustrated in ﬁgure 4 has one important characteristic: C is a feedback controllercontrolling θ , embedded within a feedback controller controlling s P . Nothing stops us from re-placing the negative feedback controller in C with a more general feedback controller of the typedepicted in ﬁgure 4, so that the action of the high level controller is to set the target value of thelower level controller. Notice that the coordinate transformation done in the model of section 3.4 toﬁnd a target angle is, in this general setting, a process of subgoal selection. By generating rewardsfor a lower level when a “ s P = s D ” event occurs at the next level we can construct a hierarchicalneural reinforcement learning system, potentially capable of handling more complex tasks. In fact,the architecture of ﬁgure 4 is an adapted version of a biologically-inspired hierarchical architecturewhose publication is under preparation (in a paper distinct from that mentioned in section 4.1). The models in sections 3.3, and 3.4 present different solutions to the exploration vs. exploitationdilemma. The model of section 3.4 has two types of exploration. First is the noisy training signal,constantly forcing change in the target angles. Secondly, the network is initially trained with arandom policy, while the controller conﬁguration (e.g. the angle representation) assigned to eachstate is being learned. In this case the network goes through a development period with poor initialperformance because this is useful to decouple learning of policies and controller conﬁguration.In contrast, in section 3.3 exploration happens because pulses or oscillations are produced by thecontroller in response to the error signals. These are reminiscent jerks and twitches happening at theearly stages of motor learning in animals, which for some mammals may happen before birth, and8ead to the foundations of functional proprioception and motor coordination [25]. For the learningrules in section 3.3, using a constantly shifting desired state to encourage exploration would becounterproductive, because sudden shifts in the desired state would reﬂect as shifts in the errorsignal, interfering with learning.Generalizing, training a hierarchical system with the architecture we have suggested would involvestarting with the lowest hierarchical levels, and proceeding to the next level once a reasonambleperformance is achieved. At each level, sensory representation training happens ﬁrst through ex-ploratory policies, followed by more motivated behaviour.Overall, our results show how to produce autonomous motor control through a hierarchy of feedbackcontrollers. The learning rules of section 3.1 can readily learn the correlations between sensory fea-tures and actuator outputs (which tend to remain stable), whereas more complex, context-dependentcontrol rules can be learned using a second hierarchical level such as that in section 3.4. We expectthat through further hierarchical extensions, such as described in 4.2 we will approach truly ﬂexible,animal-like control.

Supplementary Material

The Appendix and source code for this paper can be obtained from: https://gitlab.com/sergio.verduzco/public_materials in the neurips_2020 folder.

References [1] R. A. Andersen, G. K. Essick, and R. M. Siegel. Encoding of spatial location by posterior parietal neurons.

Science , 230(4724):456–458, October 1985. Publisher: American Association for the Advancement ofScience Section: Reports.[2] Matthew Michael Botvinick. Hierarchical reinforcement learning and decision making.

Current Opinionin Neurobiology , 22(6):956–962, December 2012.[3] E. Bristol. On a new measure of interaction for multivariable process control.

IEEE Transactions onAutomatic Control , 11(1):133–134, January 1966.[4] Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation.

Nature ReviewsNeuroscience , 13(1):51–62, January 2012.[5] G. Bard Ermentrout and David H. Terman.

Mathematical Foundations of Neuroscience . Springer Science& Business Media, New York, NY, July 2010. Google-Books-ID: 0fLdzaFgtjcC.[6] Michael J. Frank and David Badre. Mechanisms of Hierarchical Reinforcement Learning in Corticostri-atal Circuits 1: Computational Analysis.

Cerebral Cortex , 22(3):509–526, March 2012.[7] Dongqi Han, Kenji Doya, and Jun Tani. Self-organization of action hierarchy and compositionality byreinforcement learning with recurrent neural networks. arXiv:1901.10113 [cs, stat] , November 2019.arXiv: 1901.10113.[8] Bernd Illing, Wulfram Gerstner, and Johanni Brea. Biologically plausible deep learning — But how farcan we go with shallow networks?

Neural Networks , 118:90–101, October 2019.[9] Michael I. Jordan and David E. Rumelhart. Forward Models: Supervised Learn-ing with a Distal Teacher.

Cognitive Science , 16(3):307–354, 1992. _eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog1603_1.[10] Christoph Kolodziejski, Bernd Porr, and Florentin Wörgötter. On the Asymptotic Equivalence BetweenDifferential Hebbian and Temporal Difference Learning.

Neural Computation , 21(4):1173–1202, Novem-ber 2008. Publisher: MIT Press.[11] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical DeepReinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,

Advances in Neural Information Pro-cessing Systems 29 , pages 3675–3683. Curran Associates, Inc., 2016.

12] M Kuperstein. Neural model of adaptive hand-eye coordination for single postures.

Science ,239(4845):1308–1311, March 1988.[13] William Levine, editor.

The Control Systems Handbook: Control System Advanced Methods, SecondEdition . CRC Press, 2011.[14] Daniel Liberzon.

Calculus of Variations and Optimal Control Theory: A Concise Introduction . PrincetonUniversity Press, 2012. Google-Books-ID: mhklgjmACRUC.[15] Eve Marder and Dirk Bucher. Central pattern generators and the control of rhythmic movements.

CurrentBiology , 11(23):R986–R996, November 2001.[16] Josh Merel, Matthew Botvinick, and Greg Wayne. Hierarchical motor control in mammals and machines.

Nature Communications , 10(1):1–12, December 2019.[17] Hiroyuki Miyamoto, Mitsuo Kawato, Tohru Setoyama, and Ryoji Suzuki. Feedback-error-learning neuralnetwork for trajectory control of a robotic manipulator.

Neural Networks , 1(3):251–265, January 1988.[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, andDemis Hassabis. Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533,February 2015.[19] Kenji Morita, Jenia Jitsev, and Abigail Morrison. Corticostriatal circuit mechanisms of value-based actionselection: Implementation of reinforcement learning algorithms and beyond.

Behavioural Brain Research ,311:110–121, September 2016.[20] D. A. Novikov.

Cybernetics: From Past to Future . Springer, December 2015. Google-Books-ID:LbQvCwAAQBAJ.[21] Yu-Cheng Pei, Steven S. Hsiao, James C. Craig, and Sliman J. Bensmaia. Shape Invariant Coding ofMotion Direction in Somatosensory Cortex.

PLoS Biology , 8(2), February 2010.[22] Bernd Porr and Florentin Wörgötter. Strongly Improved Stability and Faster Convergence of TemporalSequence Learning by Using Input Correlations Only.

Neural Computation , 18(6):1380–1412, April2006.[23] William T. Powers.

Behavior: The Control of Perception (2nd ed. rev. & exp.) , volume xiv. BenchmarkPress, New Canaan, CT, US, 2005.[24] Daniel Rasmussen, Aaron Voelker, and Chris Eliasmith. A neural model of hierarchical reinforcementlearning.

PLOS ONE , 12(7):e0180234, July 2017.[25] Scott R. Robinson, Gale A. Kleven, and Michele R. Brumley. Prenatal Development of Interlimb MotorLearning in the Rat Fetus.

Infancy : the ofﬁcial journal of the International Society on Infant Studies ,13(3):204–228, May 2008.[26] Uri Rokni. Neural Networks for Control.

Encyclopedia of Neuroscience , pages 2592–2596, 2009. Pub-lisher: Springer, Berlin, Heidelberg.[27] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators.pages 1312–1320, June 2015. ISSN: 1938-7228 Section: Machine Learning.[28] Wolfram Schultz, Peter Dayan, and P. Read Montague. A Neural Substrate of Prediction and Reward.

Science , 275(5306):1593–1599, March 1997.[29] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Gowithout human knowledge.

Nature , 550(7676):354–359, October 2017.[30] Eduardo D. Sontag.

Mathematical Control Theory: Deterministic Finite Dimensional Systems . SpringerScience & Business Media, November 2013. Google-Books-ID: f9XiBwAAQBAJ.[31] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . MIT Press, November2018. Google-Books-ID: sWV0DwAAQBAJ.[32] Sergio Verduzco-Flores and Erik De Schutter. Draculab: A Python Simulator for Firing Rate NeuralNetworks With Delayed Adaptive Connections.

Frontiers in Neuroinformatics , 13, 2019., 13, 2019.