[PDF] CPG-ACTOR: Reinforcement Learning for Central Pattern Generators

Abstract

Central Pattern Generators (CPGs) have several properties desirable for locomotion: they generate smooth trajectories, are robust to perturbations and are simple to implement. Although conceptually promising, we argue that the full potential of CPGs has so far been limited by insufficient sensory-feedback information. This paper proposes a new methodology that allows tuning CPG controllers through gradient-based optimization in a Reinforcement Learning (RL) setting. To the best of our knowledge, this is the first time CPGs have been trained in conjunction with a MultilayerPerceptron (MLP) network in a Deep-RL context. In particular, we show how CPGs can directly be integrated as the Actor in an Actor-Critic formulation. Additionally, we demonstrate how this change permits us to integrate highly non-linear feedback directly from sensory perception to reshape the oscillators' dynamics. Our results on a locomotion task using a single-leg hopper demonstrate that explicitly using the CPG as the Actor rather than as part of the environment results in a significant increase in the reward gained over time (6x more) compared with previous approaches. Furthermore, we show that our method without feedback reproduces results similar to prior work with feedback. Finally, we demonstrate how our closed-loop CPG progressively improves the hopping behaviour for longer training epochs relying only on basic reward functions.

Full PDF

CCPG-A

CTOR : Reinforcement Learning for Central Pattern Generators

Luigi Campanaro, Siddhant Gangapurwala, Daniele De Martini, Wolfgang Merkt and Ioannis Havoutis

Abstract — Central Pattern Generators (CPGs) have severalproperties desirable for locomotion: they generate smoothtrajectories, are robust to perturbations and are simple toimplement. Although conceptually promising, we argue thatthe full potential of CPGs has so far been limited by insuf-ﬁcient sensory-feedback information. This paper proposes anew methodology that allows tuning CPG controllers throughgradient-based optimisation in a Reinforcement Learning (RL)setting. To the best of our knowledge, this is the ﬁrst timeCPGs have been trained in conjunction with a MultilayerPerceptron (MLP) network in a Deep-RL context. In particular,we show how CPGs can directly be integrated as the Actor inan Actor-Critic formulation. Additionally, we demonstrate howthis change permits us to integrate highly non-linear feedbackdirectly from sensory perception to reshape the oscillators’dynamics. Our results on a locomotion task using a single-leghopper demonstrate that explicitly using the CPG as the Actorrather than as part of the environment results in a signiﬁcantincrease in the reward gained over time ( x more) comparedwith previous approaches. Furthermore, we show that ourmethod without feedback reproduces results similar to priorwork with feedback. Finally, we demonstrate how our closed-loop CPG progressively improves the hopping behaviour forlonger training epochs relying only on basic reward functions. I. I

NTRODUCTION

The increased manoeuvrability associated with leggedrobots in comparison to wheeled or crawling robots neces-sitates complex planning and control solutions. Particularly,the requirement to maintain balance while interacting with anuncertain environment under noisy sensing severely restrictsthe time algorithms can spend on computing new solutionsin response to perturbation or changes in the environment.This greater complexity is further increased due to thehigh dimensionality of the problem, uncertainty about theenvironment, robot models and physical constraints. Thecurrent state-of-the-art for high-performance locomotion aremodular, model-based controllers which break down the con-trol problem in different sub-modules [1], [2]: ﬁrst, trajectoryoptimisation deﬁnes a motion plan over a longer time horizonusing approximated models for computational efﬁciency; thisplan is then tracked using advanced whole-body controllerswhich operate using the full dynamics model and providerobustness to external disturbances. This rigorous approachis rooted in the knowledge of every portion of the motion,but it is also limited by heuristics handcrafted by engineersat each of the stages. In fact, many systems need to estimatethe ground contact or the slippage to trigger the transitionbetween states or reﬂexes [3], [4]. Such estimation is often

All authors are with the Oxford Robotics Institute, University ofOxford, UK. Emails: {luigi, siddhant, daniele, wolfgang,ioannis}@robots.ox.ac.uk . Fig. 1: The experiments are carried out on a classic Rein-forcement Learning (RL) benchmark – the single-leg hopper– in a custom environment based on the ANYmal quadrupedrobot [9]. It can hop along the vertical axis and is controlledby Central Pattern Generators (CPGs). Closed-loop feedbackis incorporated using a jointly trained Multilayer Perceptron(MLP) network which processes joint sensing observationsto reshape the oscillator dynamics of the CPGs.based on heuristically-set thresholds, yet it is sensitive tounmodelled aspects of the environment.Often the computations behind these controllers are soexpensive that dealing with sudden disturbances is beyondtheir abilities and simpliﬁcations of the dynamic models areneeded to meet the re-planning time requirements, resultingin a loss of dynamism and performances [5].While the ﬁeld of legged robot control has been dominatedover the last decades by conventional control approaches,recently, data-driven methods demonstrated unprecedentedresults that outpaced most of the classical approaches interms of robustness and dynamic behaviours [6]–[8]. Thesecontrollers often employ a parametrised policy to map sens-ory information to low-level actuation commands, and aretuned to optimise a given reward function on data acquired byrunning the controller itself, which improves with the exper-ience. In particular, controllers trained using deep-RL utilisea Neural Network (NN) policy to perform this mapping.As a result, controllers trained with RL exhibit behavioursthat cannot be hand-crafted by engineers and are furtherrobust to events encountered during the interaction with theenvironment. However, widely-used NN architectures, suchas MLP, do not naturally produce the oscillatory behaviourexhibited in natural locomotion gaits and as such require longtraining procedures to learn to perform smooth oscillations.A third family of controllers have been used with prom-ising results for robot locomotion: CPGs, a biologically-inspired neural network able to produce rhythmic patterns.Indeed, the locomotor system of vertebrates is organised a r X i v : . [ c s . R O ] F e b uch that the CPGs – located in the spine – are responsiblefor producing the basic rhythmic patterns, while higher-levelcentres (the motor cortex, cerebellum, and basal ganglia) areresponsible for modulating the resulting patterns accordingto environmental conditions [10].Besides the intrinsic oscillatory behaviour, several otherproperties make the usage of CPGs desirable for the lo-comotion task; these include (1) the generation of smoothand rhythmic patterns which are resilient against state per-turbations (due to their limit cycle), (2) minimal controldimensionality, i.e. few high-level signals are needed tocontrol a robot, (3) implementation simplicity (eq. (1) fullydescribe the model) and (4) they are model-free, hence welladapted to the locomotion in unknown environments [11].However, very few design principles are available, especiallyfor the integration of sensor feedback in such systems [11]and, although conceptually promising, we argue that the fullpotential of CPGs has so far been limited by insufﬁcientsensory-feedback integration.The ability of Deep-NNs to discover and model highlynon-linear relationships among the observation – the inputs– and control signals – the outputs – makes such approachesappealing for control. In particular, based on Deep-NNs,Deep-RL demonstrated very convincing results in solvingcomplex locomotion tasks [6], [7] and it does not require dir-ect supervision (but rather learns through interaction with thetask). Hence, we argue that combining Deep-RL with CPGscould improve the latter’s comprehension of the surroundingenvironment. However, optimising Deep-NN architectures inconjunction with CPGs requires adequate methods capable ofpropagating the gradient from the loss to the parameters, alsoknown as backpropagation. In contrast, methodologies thatare more commonly applied in tuning CPGs, such as GeneticAlgorithms (GA), Particle Swarm Optimisation (PSO) andhand-tuning, are rarely used for NN applications due to thevery high dimensionality of the latter’s search space.Concisely, model-based control requires expert tuningand is computationally demanding during runtime; deep-RL controllers are computationally-cheap during runtime,but require ofﬂine exploration and “discovery” of conceptsalready known for locomotion (limit cycles, oscillatory be-haviour etc.) from scratch, which leads to long training timeand careful tuning of reward functions. CPGs, instead, useconcepts developed from bio-inspired sensorimotor control,are computationally cheap during runtime, but are challen-ging to tune and incorporate feedback within. To addressthis, this paper introduces a novel way of using Deep-NNs to incorporate feedback into a fully differentiable CPGformulation, and apply Deep-RL to jointly learn the CPGparameters and MLP feedback. A. Related Work

Our work is related to both the ﬁelds of CPG designand RL, in particular to the application of the latter for theoptimisation of the former’s parameters.CPGs are very versatile and have been used for differentapplications including non-contact tasks such as swimmers [10], [12], modular robots [13], [14] and locomotion on smallquadrupeds [11], [15]–[17].The CPGs adopted in our research are modelled as Hopfnon-linear oscillators (cf. eq. (1)) which have been suc-cessfully transferred to small quadrupedal systems and haveexhibited dynamic locomotion behaviours [15]–[17].The trajectories CPGs generate are used as references foreach of the actuators during locomotion and a tuning pro-cedure is required to reach coordination. The optimisation ofCPG-based controllers usually occurs in simulation throughGA [10], PSO [14], [18] or expert hand-tuning [11], [15]–[17].Prior work has evaluated the performance of CPGs forblind locomotion over ﬂat ground [18]. However, to navigateon rough terrain sensory feedback is crucial (e.g. in orderto handle early or late contact), as shown in [15]: here,a hierarchical controller has been designed, where CPGsrelied on a state machine which controlled the activation ofthe feedback. In particular, the stumbling correction and legextension reﬂexes are constant impulses triggered by the statemachine. While the attitude control relies on informationsuch as the contact status of each leg, the joint angles readby encoders and the rotation matrix indicating the orientationof the robot’s trunk; all these data are processed in a virtualmodel control fashion and then linearly combined with theCPG equations, eq. (1). Finally, the angle of attack betweenleg and terrain is useful to accelerate/decelerate the bodyor locomote on slopes: it is controlled by the sagittal hipjoints and it is linearly combined with the equations eq. (1)to provide feedback.Similarly to [15], [17] also uses feedback, this timebased on gyroscope velocities and optical ﬂow from camerato modify the CPGs output in order to maintain balance.However, in [17] the authors ﬁrst tune CPGs in an open-loopsetting and then train a NN with PSO to provide feedback(at this stage the parameters of the CPGs are kept ﬁxed).Their method relies on a simple NN with 7 inputs – 4 fromthe camera/optical ﬂow and 3 from the gyroscope – and asingle hidden layer. We follow the same design philosophyin the sense that we preprocess the sensory feedback througha NN; yet, we propose to tune its parameters in conjunctionwith the CPG. We argue that in this way the full potential ofthe interplay of the two can be exploited. In particular, thiseffectively allows the feature processing of raw signals to belearnt from experience.RL promises to overcome the limitations of model-basedapproaches by learning effective controllers directly fromexperience. Robotics tasks in RL – such as the hopperconsidered in this work (Fig. 1) – are challenging as theiraction space is continuous and the set of possible actionsis inﬁnite. Hence, any method based on learning the actionvalues (which are the expected discounted reward receivedby following a policy) must search through this inﬁnite set inorder to select an action. Differently, actor-critic methods relyon an explicit representation of the policy independent fromthe value function. The policy is known as the actor, becauseit is used to select actions, while the estimated value function a) (b) (c)

Fig. 2: (a) represents the basic actor-critic Deep-RL method adopted for continuous action space control. (b) illustratesthe approach proposed in [19]–[22], which consists in a classic actor-critic with CPGs embedded in the environment. (c),instead, is the approach proposed in the present work, which includes the CPGs alongside the MLP network in the actorcritic architecture.is known as the critic, because it criticises the actions takenby the actor [23], as shown in Fig. 2a. The critic uses anapproximation architecture and simulation to learn a valuefunction, which is then used to update the actor’s policyparameters in a direction of performance improvement. Bothof them in Deep-RL are classically approximated by NNs.Researchers applied RL to optimise CPGs in differentscenarios [19]–[22]. The common factor among them is theformulation of the actor-critic method; yet, they include theCPG controller in the environment – as depicted in Fig. 2b. Inother words, the CPG is part of the (black-box) environmentdynamics. According to the authors [22], the motivationsfor including CPGs in the environment are their intrinsicrecurrent nature and the amount of time necessary to trainthem, since CPGs have been considered Recurrent NeuralNetworks (RNNs) (which are computationally expensive andslow to train). In [19], [20] during training and inference,the policy outputs a new set of parameters for the CPGsin response to observations from the environment at everytime-step. In this case, the observations processed by theactor network – which usually represent the feedback –are responsible for producing a meaningful set of CPG-parameters for the current state. Conversely, in [21], [22]the parameters are ﬁxed and, similarly to [17], CPGs receiveinputs from the policy.However, whether the CPGs parameters were new or ﬁxedevery time-step, they all considered CPGs as part of theenvironment rather than making use of their recurrent natureas stateful networks. We exploit this observation in this paper.

B. Contributions

In this work, we combine the beneﬁts of CPGs and RL andpresent a new methodology for designing CPG-based con-trollers. In particular, and in contrast to prior work, we embedthe CPG directly as the actor of an Actor-Critic frameworkinstead of it being part of the environment. The advantageof directly embedding a dynamical system is to directlyencode knowledge about the characteristics of the task (e.g.,periodicity) without resorting to recurrent approaches. Theoutcome is CPG-A

CTOR , a new architecture that allows end- to-end training of CPGs and a MLP by means of Deep-RL.In particular, our contributions are:1) For the ﬁrst time – to the best of our knowledge – theparameters of the CPGs can be directly trained throughstate-of-the-art gradient-based optimisation techniquessuch as Proximal Policy Optimisation (PPO) [24], apowerful RL algorithm. To make this possible, wepropose a fully differentiable CPG formulation (Sec. II-A) along with a novel way for capturing the state of theCPG without unrolling its recurrent state (Sec. II-B).2) Exploiting the fully differentiable approach further en-ables us to incorporate and jointly tune a MLP networkin charge of processing feedback in the same pipeline.3) We demonstrate a roughly six times better trainingperformance compared with previous state-of-the-artapproaches (Sec. IV).II. M

ETHODOLOGY

Differently to previous approaches presented in Sec. I-A,we embed CPGs directly as part of the actor in an actor-critic framework as shown in Fig. 2c. Indeed, the policy NNhas been replaced by a combination of an MLP network forsensory pre-processing and CPGs for action computation,while the value function is still approximated by an MLPnetwork.These measures ensure that the parameters of the CPGsare ﬁxed while interacting with the environment and duringinference, presenting an alternative (and more direct) way oftuning classical CPG-based controllers.However, a naïve integration of CPGs into the Actor-Criticformulation is error-prone and special care needs to be taken: • to attain differentiability through the CPG actor in orderto exploit gradient-based optimisation techniques; • not to neglect the hidden state as CPGs are statefulnetworks.We are going to analyse these aspects separately in thefollowing sections. A. Differentiable Central Pattern Generators

Parallel implementations of RL algorithms spawn the samepolicy π θ on parallel instances of the same robot to quickly a) (b) Fig. 3: The images above show the difference between back-propagation for classic RNNs (3a) and CPGs (3b). In particularto train RNNs, the matrices W xh , W hy , W hh have to be tuned, where W hh regulates the evolution between two hiddenstates . Instead, for CPGs only the parameters in ˙ θ i and ¨ r i (eq. (1)) need tuning, while the evolution of the hidden state isdetermined by eq. (2).gather more experiences. Once the interactions with thesimulation environment ends, the observations are fetchedin batches and used to update the actor and the critic.Instead of selecting the best-ﬁtted controller, as GA does,the update is based on gradient descent algorithms, such asAdam [25]. Consequently, the implementation of CPGs mustbe differentiable.

1) Hopf Oscillators:

As underlying, oscillatory equationfor our CPG network, we choose to utilise the Hopf oscil-lator, as in [12]. However, since equations in [12] describea system in continuous time, we need to discretise them foruse as a discrete-time robot controller, as in eq. (1): ˙ θ ti = 2 πν i ( d ti ) + ζ ti + ξ ti ζ ti = (cid:80) j r t − j w ij sin( θ t − j − θ t − i − φ ij )¨ r ti = a i ( a i ( ρ i ( d ti ) − r t − i ) − ˙ r t − i ) + κ ti x ti = r ti cos( θ ti ) (1)where · t describes the value at the t -th time-step, θ i and r i are the scalar state variables representing the phase and theamplitude of oscillator i respectively, ν i and ρ i determineits intrinsic frequency and amplitude as function of the inputcommand signals d i , and a i is a positive constant govern-ing the amplitude dynamics. The effects of the couplingsbetween oscillators are accounted in ζ i and the speciﬁccoupling between i and j are deﬁned by the weights w ij andphase φ ij . The signal x i represents the burst produced by theoscillatory centre used as position reference by the motors.Finally, ξ i and κ i are the feedback components provided bythe MLP network.To calculate the variables r and θ from their derivativevalues, we applied a trapezoidal approach, as in eq. (2): θ t = θ t − + ( ˙ θ t − + ˙ θ t ) dt ˙ r t = ˙ r t − + (¨ r t − + ¨ r t ) dt r t = r t − + ( ˙ r t − + ˙ r t ) dt (2)where dt is the timestep duration.

2) Tensorial implementation:

The tensorial operationshave to be carefully implemented to allow a correct ﬂowingof the gradient and batch computations, both crucial forupdating the actor-critic framework. Let N be the number of CPGs in the network, then: ˙Θ t = 2 πC ν ( V, D t ) + Z t + Ξ t Z t = ( W V ) ∗ (Λ R t − ) ∗ sin(ΛΘ t − − Λ (cid:124) Θ t − − Φ V )¨ R t = ( AV ) ∗ ( AV ( P ( V, D t ) − R t − ) − ˙ R t − ) + K t X t = R t cos(Θ t ) (3)Here, Θ ∈ R N and R ∈ R N are the vectors containing θ i and r i , while Ξ ∈ R N and K ∈ R N contain ξ i and κ i respectively. V ∈ R M contains the M , constant parametersto be optimised of the network composed by the N CPGs.This said, C ν : R M , R d → R N , P : R M , R d → R N and A ∈ R N × M are mappings from the set V and thecommand D t ∈ R d to the parameters that lead ν i , ρ i and a i respectively. Z ∈ R N × N instead takes into consideration the effects ofthe couplings of each CPG to each CPG; all the effect to i -thCPG will be then the sum of the i -th row of Z as in Z ,where is a vector of N elements with value . Within Z , W ∈ R N × N × M and Φ ∈ R N × N × M extrapolate the couplingweights and phases from V , while Λ ∈ R N × N × N encodesthe connections among the nodes of the CPG network. B. Recurrent state in CPGs

In order to efﬁciently train CPGs in a RL setting, we needto overcome the limitations highlighted in [22]: particularlythat CPGs are recurrent networks and that RNNs take asigniﬁcant time to train. In this section, we show how wecan reframe CPGs as stateless networks and fully determinethe state from our observation without the requirement tounroll the RNN.Stateless networks, such as MLPs, do not need any inform-ation from the previous state to compute the next step andthe backpropagation procedure is faster and straightforward.RNNs, on the other hand, are stateful networks, i.e. thestate of the previous time-step is needed to compute thefollowing step output. As a consequence, they are compu-tationally more expensive and require a speciﬁc procedureto be trained. RNNs rely on Backpropagation Through Time(BPTT), Fig. 3a, which is a gradient-based technique spe-ciﬁcally designed to train stateful networks. BPTT unfoldsthe RNN in time: the unfolded network contains t inputs andutputs, one for each time-step. As shown in Fig. 3a, themapping from an input x t to an output y t depends on threedifferent matrices: W xh determines the transition between the x t and the hidden state h , W hy regulates the transformationfrom h t to y t and lastly W hh governs the evolution betweentwo hidden states. All the matrices W xh , W hy , W hh areinitially unknown and tuned during the optimisation. Un-deniably, CPGs have a recurrent nature and as such requirestoring the previous hidden state. However, differently fromRNNs, the transition between consecutive hidden states inCPGs is determined a priori using eq. (2) without the needof tuning W hh . This observation enables two signiﬁcantconsequences: ﬁrstly, CPGs do not have to be unrolled to betrained, since, given the previous state and the new input, theoutput is fully determined. Secondly, eliminating W hh hasthe additional beneﬁt of entirely excluding gradient explosionor vanishing during training; both points are illustrated inFig. 3b. As a result, CPGs can be framed as a statelessnetwork on condition that the previous state is passed asan input of the system.III. E VALUATION

The two main components of our approach (Fig. 2c) arethe environment (Fig. 1) and the agent, part of which isCPG-A

CTOR . We evaluate our method on a classic RLbenchmark: the hopping leg [26], [27], which suits perfectlyfor CPGs as well. In fact, a single leg needs only two joints tohop and this is the minimal conﬁguration required by coupledHopf-oscillators to express the complete form; less than twowould cancel out the coupling terms, eq. (1).In order to exclude additional sources of complexity,we tested the efﬁcacy of the new method in the minimalconﬁguration ﬁrst, however we also plan to address the fulllocomotion task in the future and developing an environmentwith realistic masses, forces, inertia and robot’s dimensionsbuilt a solid base for further development.Hence, we based the environment on a single leg of theANYmal quadruped robot, which was ﬁxed to a verticalslider. Its mass is .

42 kg and it is actuated by two series-elastic actuators capable of

40 N m torque and a maximumjoint velocity of

15 rad s − . We adopted PyBullet [28] tosimulate the dynamics of the assembly and to extract therelevant information.At every time-step the following observations are captured:the joints’ measured positions p mj and velocities v mj , desiredpositions p dj , the position p h and the velocity v h of the hipattached to the rail. While the torques t dj and the planarvelocity of the foot v x,yf are instead used in computing therewards, as described in the following lines. To train CPG-A CTOR , we formulate a reward function as the sum of ﬁvedistinct terms, each of which focusing on different aspects of the desired system: r = ( c · max( v h , r = (cid:80) joint c · ( p dj − p mj ) r = (cid:80) joint c · ( v mj ) r = (cid:80) joint c · ( t dj ) r = c · (cid:13)(cid:13)(cid:13) v x,yf (cid:13)(cid:13)(cid:13) (4)where c ≥ and c , c , c , c ≤ are the weights associatedwith each reward.In particular, r promotes vertical jumping, r encouragethe reduction of the error between the desired position andthe measured position , r and r reduce respectively the measured velocity and the desired torque of the motors andﬁnally, r discourage the foot from slipping.Although the CPG-A CTOR has been extensively treatedin Sec. II, it is important to strengthen that it has beenintegrated in an existing RL framework based on

OpenAIBaselines [29]. This allows to exploit standard, well-testedRL implementations, parallel environments optimisation,GPU-computations and allows to extend the approach toother algorithms easily as they share the same infrastructure.

A. Experimental setup

CPG-A

CTOR is compared against [19] using the sameenvironment. Both the approaches resort to an actor-criticformulation, precisely running the same critic network withtwo hidden layers of 64 units each. Indeed, the main differ-ence is the actor, which is described in detail in Sec. II forthe CPG-A

CTOR case, while [19] relies on a network withtwo hidden layers of 64 units each.As Sec. IV illustrates, an appropriate comparison betweenCPG-A

CTOR and [19] required the latter to be warm-startedto generate desired positions resulting in visible motionsof the leg. Differently from the salamander [12], alreadytuned parameters are not available for the hopping task,hence a meaningful set from [15] was used as reference.The warm-starting consisted in training the actor networkfor 100 epochs in a supervised fashion using as target theaforementioned parameters.IV. R

ESULTS

A. Validation of end-to-end training

We ﬁrst demonstrate the effectiveness of CPG-A

CTOR for end-to-end training. Figure 4 shows how the parametersbelonging to both the CPG controller (Fig. 4a) and the net-work that processes the feedback (Fig.s 4b and 4c) evolve inconjunction. This is signiﬁed by their distributions changingover the course of the learning process, from darker to lightershades as the training process proceeds.

B. Comparison between MLP and

CPG-A

CTOR

In Fig. 6, the desired positions and the desired velocities of a classic actor-critic (as in Fig. 2a) and of the CPG-A

CTOR are compared after training for the same amount oftime. What emerges is that the desired positions of the CPG-A

CTOR is smooth (Fig. 6a), while the MLP-actor shows a) CPG-parameters distribution overtime (b) MLP-feedback weights distributionover time (c) MLP-feedback biases distributionover time

Fig. 4: The set of images above show the evolution – from darker to lighter colours – of the distributions of CPGsparameters (Fig. 4a), weights (Fig. 4b) and biases (Fig. 4c) of the output layer of MLP-feedback network. This demonstratesthe simultaneous gradient propagation through the CPG and MLP parameter as described in Sec. II-A. .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . × . . . T o t a l r e w a r d CPG-Actor (without feedback)CPG-Actor (with feedback)CPG-Actor-Critic (baseline)CPG-Actor-Critic (warm-started) (a) Episode reward over 20M time steps horizon. . . . . . . . . . − . . . p ∗ H FE [r a d ] . . . . . . . . . − . . . p ∗ K FE [r a d ] (b) Desired positions generated by CPG-Actor-Critic [19] andCPG-A CTOR . . . . . . . . . . ˙ θ H FE [r a d / s ] . . . . . . . . . ˙ θ K FE [r a d / s ] (c) Comparison between ˙ θ , eq. (1), generated by CPG-Actor-Critic [19] and CPG-A CTOR . . . . . . . . . . − ¨ r H FE . . . . . . . . . − ¨ r K FE (d) Comparison between ¨ r , eq. (1), generated by CPG-Actor-Critic [19] and CPG-A CTOR . Fig. 5: (5a) represents how the reward evolves during training, each of the approaches has been run for ﬁve times and therewards averaged. (5b) illustrates the trajectories generated by the different approaches: [19] with warm-start produces anoutput similar to CPG-A

CTOR without feedback. While CPG-A

CTOR with feedback presents a heavily reshaped signal.The different contribution of the feedback in the two aforementioned approaches is explained by (5c) and (5d), which arethe phase and amplitude equations in eq. (1). Here the feedback – in CPG-A

CTOR case – is actively interacting with thecontroller according to the state observed, resulting into visibly reshaped ˙ θ and ¨ r (green lines).a bang-bang behaviour. Moreover, the desired velocities (Fig. 6b) of the CPG-A CTOR are almost respecting themotor’s operative range – red horizontal lines – withoutexplicitly constraining the optimisation around these values.The desired positions and desired velocities generated byCPG-A

CTOR – under the same setup used for the MLP –appear to be closer to a safe deployment on a real robotcompared to a classic actor-critic. Despite a more careful tuning of the rewards could produce desirable trajectoriesfor the MLP as well, CPGs require less efforts to achievea smooth oscillatory motion and this is the reason behindinvestigating their potential. C. CPG-A

CTOR and previous baselines, comparison

Since the integration of CPGs in RL frameworks hasalready been proposed in other works, we validated our . . . . . . . . . − p ∗ H FE [r a d ] Classic-ACCPG-Actor0 . . . . . . . . . − p ∗ K FE [r a d ] (a) . . . . . . . . . − v ∗ H FE [r a d / s ] . . . . . . − v ∗ K FE [r a d / s ] (b) Fig. 6: The images above compare the desired position (6a) and the desired velocity (6b) generated by CPGs and MLP. Theplot relative to the knee joint (KFE) (6b) is magniﬁed to better show the sharp output of the MLP and how CPG’s desiredvelocities are very close to the motors’ limits (horizontal red lines), even if the latter were not explicit constraints of theoptimisation.approach against [19] to prove its novelty and the resultingadvantages. The approach proposed in [19] is applied toa salamander robot, to better adapt the original version,presented in [12], to more challenging environments. Hence,the integration of exteroceptive information to improve thecapabilities of the controller is pivotal.We reproduced the aforementioned method and appliedit to our test-bed, in Fig. 1, to compare it with CPG-A

CTOR . Warm-starting the policy network referring to theparameters in [12] is one of the steps proposed in [19] andthe result on our hopping leg is represented by the red line(Fig. 5a). The warm-starting is a crucial operation, because,without it, the outcome (blue line, Fig. 5a) would have notbeen adequate for a fair comparison with CPG-Actor, dueto its poor performances. Conversely, Cpg-Actor (green line,Fig. 5a) functions in average better along training than theother approaches, reaching roughly six time more rewardafter 20 million time-steps.We investigated the reason of such different performancesand we argue it lies in the way the feedback affects the CPGcontroller. Figures 5c and 5d represent the evolution overtime of the CPGs (eq. (1)). Observing ˙ θ and ¨ r in experimentswith [19] it is evident they do not show responsiveness tothe environment, since the blue and the red lines remainalmost ﬂat during the whole episode. On the other hand, ˙ θ and ¨ r in CPG-A CTOR experiments (green line) demonstratesubstantial and roughly periodic modiﬁcations over time.This is also suggested by the desired positions in Fig. 5b: inthe case of CPG-A

CTOR the original CPG’s cosine output isheavily reshaped by the feedback, while [19] presents almosta sinusoidal behaviour.Besides, we compared our approach without feedback(orange line) with [19] and it surprisingly performs betterthan the latter. This is quite impressive since [19] updatesits output based on the observations received, while CPG-A

CTOR was tested in open-loop.

D. Evaluation of progressive task achievement

The last set of experiments presented assess how CPGs’outputs and the overall behaviour evolve over the course ofthe learning. The plots in Fig. 7 present the system at , and million time-steps of training. In particular, Fig.s 7band 7c are very similar, since they represent how the positionoutput of reciprocally hip (HFE) and knee (KFE) jointsdevelop over time. Figure 7a, instead, shows the progress ofthe hopper in learning to jump; indeed, the continuous anddotted lines – respectively indicating the hip and the footposition – start quite low at the beginning of the training, toalmost double the height after millions time-steps.V. D ISCUSSION AND F UTURE WORK

We propose CPG-A

CTOR , an effective and novel methodto tune CPG controllers through gradient-based optimisationin a RL setting.In this context, we showed how CPGs can directly beintegrated as the Actor in an Actor-Critic formulation andadditionally, we demonstrated how this method permits us toinclude highly non-linear feedback to reshape the oscillators’dynamics.Our results on a locomotion task using a single-leg hopperdemonstrated that explicitly using the CPG as an Actorrather than as part of the environment results in a signiﬁcantincrease in the reward gained over time compared withprevious approaches.Finally, we demonstrated how our closed-loop CPG pro-gressively improves the hopping behaviour relying only onbasic reward functions.In the future, we will extend the present approach tothe full locomotion task and deploy it on real hardware.In fact, we believe this novel approach gives CPGs all thetools to rival state-of-the-art techniques in the ﬁeld and givesresearchers a less reward-sensitive training method.A

CKNOWLEDGEMENTS

The authors would like to thank Prof. Auke Ijspeertand his students, Jonathan Arreguit and Shravan Tata, for i m e [ s ] . . . . . . . . . T r a i n i ng s t e p s

1M 20M 60M H e i gh t [ m ] . . . . . . . . (a) The hopper progressively learns to jump. T i m e [ s ] . . . . . . . . . T r a i n i ng s t e p s

1M 20M 60M D e s i r e dpo s iti on [r a d ] − . − . − . . . . . (b) Desired hip positions across epochs T i m e [ s ] . . . . . . . . . T r a i n i ng s t e p s

1M 20M 60M D e s i r e dpo s iti on [r a d ] − . − . − . . . . . (c) Desired knee positions across epochs Fig. 7: The above ﬁgures demonstrate that the CPG-Actor progressively learns to jump indicated by higher peaks of boththe hip (solid line) and foot (dotted line) heights (Fig. 7a). We further show the evolution of the output of the oscillatorsacross epochs (Fig.s 7b and 7c).providing insights and feedback. We further would like tothank Alexander Mitchell for his feedback in reviewing themanuscript. R

EFERENCES[1] M. Kalakrishnan, J. Buchli, P. Pastor, M. Mistry, and S. Schaal,“Learning, planning, and control for quadruped locomotion overchallenging terrain,”

The Int. J. of Rob. Res. (IJRR) , vol. 30, no. 2,pp. 236–258, 2011.[2] C. D. Bellicoso, F. Jenelten, C. Gehring, and M. Hutter, “Dynamiclocomotion through online nonlinear motion optimization for quadru-pedal robots,”

IEEE Robot. Autom. Lett. , vol. 3, no. 3, pp. 2261–2268,July 2018.[3] M. Camurri, M. Fallon, S. Bazeille, A. Radulescu, V. Barasuol et al. , “Probabilistic contact estimation and impact detection for stateestimation of quadruped robots,”

IEEE Robot. Autom. Lett. , vol. 2,no. 2, pp. 1023–1030, 2017.[4] M. Focchi, V. Barasuol, M. Frigerio, D. G. Caldwell, and C. Semini,

Slip Detection and Recovery for Quadruped Robots . SpringerInternational Publishing, 2018.[5] B. Ponton, A. Herzog, A. Del Prete, S. Schaal, and L. Righetti, “Ontime optimization of centroidal momentum dynamics,” in

Proc. IEEEInt. Conf. Rob. Autom. (ICRA) , 2018, pp. 5776–5782.[6] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis et al. ,“Learning agile and dynamic motor skills for legged robots,”

ScienceRobotics , vol. 4, no. 26, 2019.[7] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter,“Learning quadrupedal locomotion over challenging terrain,”

ScienceRobotics , vol. 5, no. 47, 2020.[8] S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis,“Rloc: Terrain-aware legged locomotion using reinforcement learningand optimal control,” 2020.[9] M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso et al. ,“Anymal - a highly mobile and dynamic quadrupedal robot,” in

Proc.IEEE/RSJ Int. Conf. Intell. Rob. Sys. (IROS) , 2016, pp. 38–44.[10] A. J. Ijspeert, “Central pattern generators for locomotion control inanimals and robots: A review,”

Neural Networks , vol. 21, no. 4, pp.642 – 653, 2008.[11] L. Righetti and A. J. Ijspeert, “Pattern generators with sensory feed-back for the control of quadruped locomotion,” in

Proc. IEEE Int.Conf. Rob. Autom. (ICRA) , May 2008, pp. 819–824.[12] A. J. Ijspeert, A. Crespi, D. Ryczko, and J.-M. Cabelguen, “Fromswimming to walking with a salamander robot driven by a spinal cordmodel,”

Science , vol. 315, no. 5817, pp. 1416–1420, 2007.[13] A. Kamimura, H. Kurokawa, E. Toshida, K. Tomita, S. Murata et al. ,“Automatic locomotion pattern generation for modular robots,” in

Proc. IEEE Int. Conf. Rob. Autom. (ICRA) , vol. 1, Sep. 2003, pp.714–720 vol.1. [14] S. Bonardi, M. Vespignani, R. Möckel, J. Van den Kieboom, S. Pouya et al. , “Automatic generation of reduced cpg control networks forlocomotion of arbitrary modular robot structures,”

Proc. Robotics:Science and Systems (RSS) , 2014.[15] M. Ajallooeian, S. Gay, A. Tuleu, A. Spröwitz, and A. J. Ijspeert,“Modular control of limit cycle locomotion over unperceived roughterrain,” in

Proc. IEEE/RSJ Int. Conf. Intell. Rob. Sys. (IROS) , Tokyo,2013, pp. 3390–3397.[16] M. Ajallooeian, S. Pouya, A. Sproewitz, and A. J. Ijspeert, “Centralpattern generators augmented with virtual model control for quadrupedrough terrain locomotion,” in

Proc. IEEE Int. Conf. Rob. Autom.(ICRA) , May 2013, pp. 3321–3328.[17] S. Gay, J. Santos-Victor, and A. Ijspeert, “Learning robot gait stabilityusing neural networks as sensory feedback function for central patterngenerators,” in

Proc. IEEE/RSJ Int. Conf. Intell. Rob. Sys. (IROS) ,2013, pp. 194–201.[18] A. Spröwitz, A. Tuleu, M. Vespignani, M. Ajallooeian, E. Badri et al. , “Towards dynamic trot gait locomotion: Design, control, andexperiments with cheetah-cub, a compliant quadruped robot,”

The Int.J. of Rob. Res. (IJRR) , vol. 32, no. 8, pp. 932–950, 2013.[19] Y. Cho, S. Manzoor, and Y. Choi, “Adaptation to environmental changeusing reinforcement learning for robotic salamander,”

Intell. Serv.Robot. , vol. 12, no. 3, p. 209–218, Jul. 2019.[20] A. L. Ciancio, L. Zollo, E. Guglielmelli, D. Caligiore, and G. Bal-dassarre, “Hierarchical reinforcement learning and central patterngenerators for modeling the development of rhythmic manipulationskills,” in , vol. 2, 2011, pp. 1–8.[21] Y. Nakamura, T. Mori, M. aki Sato, and S. Ishii, “Reinforcementlearning for a biped robot based on a cpg-actor-critic method,”

NeuralNetworks , vol. 20, no. 6, pp. 723 – 735, 2007.[22] S. Fukunaga, Y. Nakamura, K. Aso, and S. Ishii, “Reinforcement learn-ing for a snake-like robot controlled by a central pattern generator,” in

Proc. IEEE Conf. Rob., Aut. Mech. (ICMA) , vol. 2, 2004, pp. 909–914vol.2.[23] R. S. Sutton and A. G. Barto,

Reinforcement Learning: An Introduc-tion , 2nd ed. The MIT Press, 2018.[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,”

CoRR ,vol. abs/1707.06347, 2017.[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in

Int. Conf. Learn. Repres. (ICLR) , 2015.[26] P. Fankhauser, M. Hutter, C. Gehring, M. Bloesch, M. A. Hoepﬂinger et al. , “Reinforcement learning of single legged locomotion,” in

Proc.IEEE/RSJ Int. Conf. Intell. Rob. Sys. (IROS) , 2013, pp. 188–193.[27] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman et al. , “OpenAI Gym,” 2016.[28] E. Coumans and Y. Bai, “Pybullet, a python module for physics sim-ulation for games, robotics and machine learning,” http://pybullet.org,2016–2020.[29] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert et al.et al.