[PDF] Autonomous learning and chaining of motor primitives using the Free Energy Principle

Abstract

In this article, we apply the Free-Energy Principle to the question of motor primitives learning. An echo-state network is used to generate motor trajectories. We combine this network with a perception module and a controller that can influence its dynamics. This new compound network permits the autonomous learning of a repertoire of motor trajectories. To evaluate the repertoires built with our method, we exploit them in a handwriting task where primitives are chained to produce long-range sequences.

Full PDF

AAutonomous learning and chaining of motorprimitives using the Free Energy Principle

Louis Annabi

ETIS UMR 8051CY University, ENSEA, CNRS

F-95000, Cergy-Pontoise, [email protected]

Alexandre Pitti

ETIS UMR 8051CY University, ENSEA, CNRS

F-95000, Cergy-Pontoise, [email protected]

Mathias Quoy

ETIS UMR 8051CY University, ENSEA, CNRS

F-95000, Cergy-Pontoise, [email protected]

Abstract —In this article, we apply the Free-Energy Principle tothe question of motor primitives learning. An echo-state networkis used to generate motor trajectories. We combine this networkwith a perception module and a controller that can inﬂuence itsdynamics.This new compound network permits the autonomous learningof a repertoire of motor trajectories. To evaluate the repertoiresbuilt with our method, we exploit them in a handwriting taskwhere primitives are chained to produce long-range sequences.

Index Terms —Unsupervised learning, self-organizing featuremaps, inference algorithms, predictive coding, intelligent control,nonlinear dynamical systems

I. I

NTRODUCTION

We consider the problem of building a repertoire of motorprimitives from an open-ended, task agnostic interaction withthe environment. We suggest that a suitable repertoire of motorprimitives should enable the agent to reach a set of states thatcovers best its state space. Based on this hypothesis, we trainan agent to learn a discrete representation of its state space,as well as motor primitives driving the agent in the learneddiscrete states. In a fully observable environment, a clusteringalgorithm such as Kohonen self-organising maps [1] appliedto the agent’s sensory observations make it possible to learn aset of discrete states that covers well the agent’s state space.Using this set of discrete states as goals, an agent can learnpolicies that drive it towards those goals, thus building foritself a repertoire of motor primitives. Our main contributionis to address this twofold learning problem in terms of freeenergy minimisation.The Free Energy Principle [2] (FEP) suggests that thecomputing mechanisms in the brain accounting for percep-tion, action and learning can be explained as a process ofminimisation of an upper bound on surprise called free energy.On the one hand, FEP applied to perception [3] translates theinference on the causes of sensory observations into a gradientdescent on free energy, and aligns nicely with the predictivecoding [4] and Bayesian brain hypotheses [5]. On the otherhand, FEP applied to action, or active inference [6], [7], canexplain motor control and decision making as an optimisationof free energy constrained by prior beliefs. In this work, we

This work was funded by the Cergy-Paris University Foundation (Facebookgrant) and partially by Labex MME-DII, France (ANR11-LBX-0023-01). present a variational formulation of our problem that allowsus to translate motor primitives learning into a free energyminimisation problem.In previous works, we applied the principle of free energyminimisation in a spiking recurrent neural network for thegeneration of long range sequences [8] but not associated tosensorimotor control. The presented model was able to gen-erate long range sequences minimising free energy functionscorresponding to several random goals. We used a randomlyconnected recurrent neural network in order to generate trajec-tories, and combined it with a second population of neuronsin charge of driving its activation into directions minimisingthe distance towards a randomly sampled goal.Using randomly connected recurrent neural networks togenerate sequences is at the core of reservoir computing (RC)techniques [9], [10]. In particular, there is work using RC forthe generation of motor trajectories, see for instance [11], [12].In the RC framework, inputs are mapped to a high-dimensionalspace by a recurrent neural network called reservoir, anddecoded by an output layer. The reservoir weights are ﬁxedand the readout weights are regressed, usually with gradientdescent. In [8], we proposed to alter the learning problem byﬁxing the readout weights to random values as well, and byoptimising the input of the reservoir network instead.In this article, we propose to combine the ideas developed inour previous work with a perception module in order to learna repertoire of motor primitives. We train and evaluate ourmodel in an environment designed for handwriting, where theagent controls a 2 degrees of freedom arm. The agent randomlyexplores its environment by drawing random trajectories on thecanvas. By clustering its visual observations of the resultingtrajectories, the agent learns a set of prototype observationswhich it will sample as goals to achieve during its futuredrawing episodes. This learning method is interesting froma developmental point of view since it implements goal-babbling. Developmental psychology tells us that learningsensorimotor contingencies [13] plays a key role in the de-velopment of young infants. In [14], the authors present areview about sensorimotor contingencies in the ﬁelds of de-velopmental psychology and developmental robotics, in whichthey propose a very general model on how a learning agentshould organise its exploration of the environment to develop a r X i v : . [ c s . N E ] M a y ts sensorimotor skills. They suggest that the agent shouldcontinuously sample goals from its state space and practiceachieving these goals. The work we propose in this articlealigns nicely with their suggestion, as our agent randomlysamples goals from a discrete state space, and optimises themotor sequences leading to these discrete states.In the following we will ﬁrst present the model, then theresults on motor primitives learning. In a third part, we willevaluate the learned repertoires on a task of motor primitivechaining to draw complex trajectories.II. M ETHODS

A. Model for motor primitives learning

Our neural network architecture, represented in ﬁgure 1, canbe segmented into three substructures. On the sensory pathway(top), a Kohonen map is used to cluster the observations.On the motor pathway (bottom), a reservoir network and acontroller are used to model the generation and optimisationof motor sequences.Kohonen q ( s ) p ( o | s ) s o Controller k x Reservoir [ r ][ a ] p ( s ) env F Fig. 1: Model for motor primitives learning. x denotes theactivation signal. [ r ] denotes the sequence of activations { r t } t =1 ..T of the reservoir network. [ a ] denotes the sequenceof atomic motor commands { a t } t =1 ..T decoded from thereservoir dynamics. o denotes visual observation. s denotesthe hidden state. k denotes the primitive index on whichdepend the activation signal x and the prior probability overhidden state p ( s ) . F denotes the free energy, it is used as anoptimisation signal for the controller.The activation signal x stimulates the random recurrentneural network (RNN), that exhibits a self-sustained activity r during T time steps. T atomic motor commands a arereadout from the T activations of the recurrent network. Theenvironment provides an observation o after being impacted bythe sequence of actions. This observation is then categorisedin s ∈ S = { s i } i

1) Reservoir network:

Motor primitives are sequences ofatomic motor commands read out from the dynamics of thereservoir network. To optimise these sequences according tosome criteria, one can either optimise the readout weights, thedynamics of the reservoir or control its activity (our case).Updating the recurrent weights of the RNN can have theundesirable effect of limiting its dynamics. Thus in RC, thefocus is usually put on the learning of appropriate readout weights. However we showed in a previous work [8] that it ispossible to control the RNN dynamics by optimising its initialstate. In this article this is the strategy we use in order to learnmotor primitives.Our implementation of the reservoir network is based onan existing model for handwriting trajectory generation usingRC [15], using the following equations : u ( t ) = (1 − τ ) · u ( t −

1) + 1 τ ( W r · r ( t − (1) r ( t ) = tanh( u ( t )) (2) a ( t ) = tanh( W o · r ( t )) (3)where u and r denote the network activation respectivelybefore and after application of the non-linearity. We denoteby τ the time constant of the network.For the recurrent network to have a self-sustained activity,we referred to the weights initialisation used in [15]. The recur-rent weights matrix W r is made sparse with each coefﬁcienthaving a probability p r of being non-null. When non-null, thecoefﬁcients are sampled from a normal distribution N (0 , σ r n r ) with a variance scaled according to the network size n r . Thereadout weights W o are sampled from a normal distribution N (0 , σ o ) .

2) Kohonen map:

The Kohonen map [1] takes as input a64x64 gray scale image. Each ﬁlter learned by the Kohonenmap has to correspond to a distinct motor primitive. Sincewe expected to learn motor primitives corresponding mainlyto movement orientations, we used a Kohonen map topologywith only one cyclic dimension. We also ran experiments with2-d and 3-d topologies with and without cyclic dimensions.We chose to stick with the 1-d cyclic topology because itpresented a fast learning and a balanced use of all the learnedﬁlters. Here are the equations of the Kohonen network: i w ( t ) = argmin i ( (cid:107) W k [ i, :]( t ) − o ( t ) (cid:107) ) (4) W k ( t + 1) = (1 − λ k ) · W k ( t ) + λ k · N ( i w ( t )) (cid:12) o ( t ) (5)where W k denotes the Kohonen weights, each row W k [ i, :] corresponding to the ﬁlter associated with neuron of index i . i w denotes the winner neuron index, i.e. the index ofthe Kohonen neuron whose associated ﬁlter is closest tothe input stimulus o . The neighbourhood function N ( i w ( t )) depends on the chosen topology. It is maximum in i w ( t ) anddecreases exponentially according to the distance with regardto the winner neuron index i w ( t ) . This exponential decay isparameterised by a neighbourhood width σ k . The operator (cid:12) denotes the element-wise product.

3) Free energy derivations:

Our goal is to learn the rightactivation signals x that stimulate the random recurrent net-work in a way that leads to desired categorisations of theobservation o by the Kohonen network. Let n be the number ofmotor primitives that we want to learn. We set the size of theKohonen network as well as the number of activation signalsto learn to n . For the trajectory of index k , we want to learn thestimulus x ∗ k that better activates the corresponding category inthe Kohonen network (i.e. such that p ( o | s = s k ) ≈ ). We canbserve that the optimal activation signal x ∗ k depends on theweights of the Kohonen map. Because of this dependence, itwould be easier to learn and ﬁx the Kohonen map before opti-misation of the controller. However, it is more realistic from adevelopmental point of view to train these two structures at thesame time. For this reason, we learn both networks in parallelbut control their learning parameters over time, to be able tofavor the learning of one network compared to the other.We use free-energy minimisation as the strategy to train thecontroller. What follows is a formalisation of our model usinga variational approach: • p ( s ) is the prior probability over states. Here we proposeusing a softmax, parameterised by β > , around theindex k of the current primitive. p ( s = s i ) = exp( − β · | k − i | ) (cid:80) j exp( − β · | k − j | ) (6) • p ( o | s ) is the state observation mapping. The observationsare images of size d . For simplicity, we make the approx-imation of considering all pixel values as independent.We choose to use Bernoulli distributions for all pixelvalues o l

4) Optimisation method:

Our optimisation problem is thefollowing. For each primitive k , we want to ﬁnd an activationsignal x k that generates an observation o resulting in a lowfree energy F ( o ) . To use gradient based methods, we wouldneed to have a diiferentiable model of how the activationsignals x k impacts the resulting free energy F ( o ) . Sincewe do not have a model of how the environment producesobservations, the whole x → r → a → o → F ( o ) chainis not differentiable. To solve our problem, we instead use arandom search optimisation method, detailed by the followingalgorithm. { Random initialisation of the controller } for k < n do x k ← N (0 , end for { Training } for e < E do k ← U ( n ) δx ∼ N (0 , σ ( e )) u + ← x k + δx u − ← x k − δx [ a , . . . , a T ] + ← simulate action( u + ) [ a , . . . , a T ] − ← simulate action( u − ) o + ← env( [ a , . . . , a T ] + ) o − ← env( [ a , . . . , a T ] − ) i w, + ← simulate kohonen( o + ) i w, − ← simulate kohonen( o − ) f + ← free energy( i w, + , o + ) f − ← free energy( i w, − , o − ) x k ← x k − λ · ( f + − f − ) · δx end for The parameter e in the search standard deviation σ ( e ) indicates that this coefﬁcient can depend on the trainingepisode e . The ”simulate action” function used in the codeabove corresponds to the iterative application of equations(1), (2), (3) for the duration T of the motor primitives. The”env” function corresponds to the generation of an observationby the environment after being impacted by the sequence ofactions. This computation is performed by the environmentand unknown to the agent. The ”simulate kohonen” functioncorresponds to the application of equations (4) and (5). The”free energy” function corresponds to the application of equa-tion (10). B. Experimental setup1) Environment:

The environment is an initially blankcanvas on which the agent can draw. The initial position ofthe pen is at the center of the canvas. The agent can act onhe environment via 2D actions. The actions are the anglevelocities of a 2 degrees of freedom arm as represented inﬁgure 2.Fig. 2: Initial (left) and ﬁnal (right) arm position for atrajectory taken from a data set of handwriting trajectories.

2) Training:

We start from a random initialisation of theKohonen map. Over time, the Kohonen map self-organiseswhen being presented with the trajectories generated by therandom search algorithm. Simultaneously, the random searchalgorithms learns to generate motor trajectories that lead tothe different Kohonen prototype observations.Training was performed using the following set of param-eters for the different components described in the previoussection: • RNN: n r = 100 , τ = 10 , p r = 0 . , σ r = 1 , . • Readout layer: n o = 2 , σ o = 1 . • Kohonen map: n = 50 , λ k = 0 . , Kohonen width σ k ( e ) varies over time, see ﬁgure 3. • Free energy: β ∈ { − , − , − , − , − , , , , , } . • Random optimisation: λ = 0 . , σ ( e ) varies over time,see ﬁgure 3. K o h o n e n w i d t h Search varianceKohonen width 0.00.20.40.60.81.0 S e a r c h v a r i a n c e Fig. 3: Search variance σ ( e ) and Kohonen width σ k ( e ) according to training iteration e . C. Results

We trained our model for E = 20000 iterations on n = 50 primitives. At each iteration, we uniformly sample k from [1 , n ] . We train on the k th primitive by adjusting the priorprobability as in (6) and optimising x k . On average, eachactivation signal x k is trained on iterations. Figure 3displays the evolution of the random optimisation searchvariance and of the Kohonen width over the 20000 iterations.We discuss this choice in the light of the following results. I n a cc u r a c y / C o m p l e x i t y ( n a t s ) InaccuracyComplexity

Fig. 4: Inaccuracy and complexity averaged on the numberof primitives n = 50 , with β = 8 . Each primitive has beensampled on at least 360 iterations, and on average on 400iterations.Figure 4 displays the evolution of inaccuracy and complex-ity during training.During the ﬁrst phase, when e < , the randomsearch has a very high variance. Consequently, the trajectoriesgenerated and fed to the Kohonen map are very diverse andthis allows the Kohonen map to self-organise. Inaccuracy doesnot seem to decrease in this early phase. This is becausethe Kohonen ﬁlters, initially very broad, are becoming moreprecise. The high variance in the random search allows fora diminution of complexity but still generates trajectories thatare too noisy to accurately ﬁt the more precise Kohonen ﬁlters.During the second phase, we decrease the variance of therandom search. The system can now converge more preciselyand this causes a faster decrease of both inaccuracy andcomplexity.We can question whether it is necessary for the randomsearch variance to remain high for such a long time, sinceit slows down learning. We observed that if we reduce theduration of the ﬁrst phase, the Kohonen does not have thetime to self organise and this results in an entangled topology.Because of the complexity term in the free energy computa-tions, the topology of the Kohonen has an inﬂuence over thelearning. For instance, with an entangled topology, a searchdirection for x k that activates a neuron closer to k might notmake the actual trajectory closer to the ones recognised by the k th Kohonen neuron. In other words, having a proper topologysmooths the loss function.We also notice that the inaccuracy cannot decrease below acertain value. At ﬁrst, we could think that this is because theptimisation strategy is stuck in a local optimum. However, weobtained the same lower bound on inaccuracy over differenttraining sessions. Since the optimisation strategy relies onrandom sampling, there is no evident reason to encounter thesame local minimum. Our explanation is that this lower boundis imposed by the Kohonen network neighbourhood function.Because the Kohonen width does not reach 0, the Kohonencentroids are still attracting each other and this preventsthem from completely ﬁtting the presented observations. Inconsequence, the ﬁlters are always partly mixed with theirneighbours and this causes the inaccuracy to plateau at a valuethat depends on σ k . Beta0.00.10.20.30.40.50.60.70.8 A v e r a g e d i s t a n c e a t e n d o f t r a i n i n g Fig. 5: Average distance | i w − k | between the activated neuronin the Kohonen i w and the primitive index k , according to β .Figure 5 shows the impact of the parameter β over theconvergence. Looking at the equations (6) and (10) we cansee that this parameter directly scales the overall complexity.For low values of β , the random search is more likely tobe stuck in local minima of free-energy, when activatinga Kohonen neuron closer to k corresponds to an increasein inaccuracy that exceeds the decrease in complexity. Wemeasured the average distance between the activated neuron inthe Kohonen and the primitive index k at the end of trainingfor β ∈ { − , − , − , − , − , } . The results, presentedin ﬁgure 5, conﬁrm that the ﬁnal states obtained with highervalues of β correspond to a more precise mapping between i w (winner index of the Kohonen map) and k (state indexenforced by the prior probability).Figure 6 displays some of the learned motor primitives. Theblue component of the image corresponds to the trajectory thatis actually being generated by the reservoir network for theactivation signal x k . The red component of the image corre-sponds to the Kohonen ﬁlter of index k . This ﬁgures allowsvisual conﬁrmation of several points. First, the inaccuracy atthe end of training seems to indeed come from the blurriness ofthe Kohonen ﬁlters. Second, the ﬁlters and motor primitive tra-jectories seem to follow a topology: the index of the primitiveseems highly correlated with the orientation of the route takenby the arm end effector. Finally, every trajectory seems to be inthe center of the corresponding Kohonen ﬁlter, which suggests Fig. 6: 9 of the 50 learned motor primitives and correspondingKohonen ﬁlters: k ∈ { , , , , , , , , } . The bluecomponent of the image corresponds to the trajectory that isactually being generated by the network for the activationsignal x k . The red component of the image corresponds tothe Kohonen ﬁlter of index k .that the minimisation of complexity successfully enforced themapping between i w and k .III. C HAINING OF MOTOR PRIMITIVES

To validate our approach, we still need to show that thisrepertoire of motor primitives is efﬁcient at constructing morecomplex movements. To perform this evaluation, we proposeto extend the model presented in section II.First, we deﬁne a new perception module meant to classifysensory observations into states corresponding to more com-plex trajectories. To avoid confusion, we will denote this newhidden state σ .Second, we enable in this revisited model the chainingof motor primitives. In the previous model, each drawingepisode corresponds to one motor trajectory of length T beinggenerated. Here in each drawing episode the agent will drawa trajectory corresponding to M motor primitives of length T chained together.Since we are simply trying to evaluate the primitives learnedin the ﬁrst model, we won’t address the training of the networkused to classify sensory observations. Instead, our focus willbe on the decision making process, i.e. the selection of theprimitives { k , . . . , k M } to chain in order to reach a certaindesired state σ ∗ . A. Model

The model for motor primitives chaining is presented inﬁgure 7.earned classiﬁer q ( σ ) p ( o | σ ) π o Controller [ k ][ x ] Reservoir [ r ][ a ] p ( σ ) env σ F Fig. 7: Model for motor primitives chaining. π = p ( σ ) denotes the prior probability over states σ . [ k ] denotesthe sequence of motor primitives indices { k m } m =1 ..M . [ x ] denotes the sequence of corresponding activation signals { x m } m =1 ..M . [ r ] denotes the resulting sequence of activations { r m,t } m =1 ..M,t =1 ..T of the reservoir network. [ a ] denotes thesequence of atomic motor commands { a m,t } m =1 ..M,t =1 ..T decoded from the reservoir dynamics. o denotes the visualobservation provided by the environment after being modiﬁedby the sequence of atomic motor commands. σ denotes thehidden state, it is different from the hidden state of the previoussection. F denotes the expected free energy, it is used asan optimisation signal for the choice of sequence of motorprimitives indices { k m } m =1 ..M .

1) New hidden state:

The hidden state σ corresponds tonew categories representing complex motor trajectories. Weused the Character Trajectories Data Set from [16] composedof approximately 70 trajectories for each letter of the alphabetthat can be drawn without lifting the pen. The trajectories areprovided as sequences of pen positions. We drew these trajec-tories using our drawing environment and used the resultingobservations to build the new state observation mapping.

2) Expected free energy derivations:

The model uses activeinference for decision making. It selects actions that minimisea free energy function constrained by prior beliefs over hiddenstates. According to active inference, constraining the priorprobability over states to infer a state distribution that favorsa target state σ ∗ will force the agent to perform actionsthat fulﬁll this prediction. In this sense, the prior probabilityover states can be compared to the deﬁnition of a reward inreinforcement learning. For instance, a rewarding state wouldbe a state that is more likely under the prior probability overstates.Here is the formalisation of this model in the variationalframework: • Prior probability over states p ( σ ) acts similarly to areward function in reinforcement learning. The priorbeliefs (or prior preferences) over σ will be set manuallyto different values during testing to guide the agent intoa desired state. • The other probability distribution over states is one thatthe agent has control on. By choosing one primitive inthe learned repertoire, the agent selects one resultingdistribution over states q k ( σ ) modeling how the choice of motor primitive k will inﬂuence the state. This probabilitydistribution over hidden states corresponds to the outputof the learned classiﬁer when fed with the observationresulting from the application of the k th motor primitive. • As in the primitive learning model presented in section II,the state observation mapping p ( o | σ ) is built using equa-tion (7). The ﬁlters used for each category correspond tothe average of the observations belonging to this classobtained from the data set.We can now derive the expected free-energy using thismodel: E [ F ( k )] = KL ( q k ( σ ) || p ( σ )) − (cid:88) i

3) Action selection:

The following algorithm details theaction selection process on a trajectory composed of M motorprimitives using free energy minimisation : p ( σ ) ← init() for m < M dofor k < n do u ← x k [ a , . . . , a T ] ← simulate action( u ) o ← env model( [ a , . . . , a T ] ) q k ( σ ) ← classiﬁer( o ) f k ← expected free energy( q k ( σ ) , p ( σ ) , o ) end for k ∗ m ← arg min k ( f k ) u ∗ ← x k ∗ m [ a , . . . , a T ] ← simulate action( u )env( [ a , . . . , a T ] ) end for The function ”env model” corresponds to a learned forwardmodel that allows estimating the future observations forany sequence of actions. In our experiments, we simplysimulated the actions and rewound the environment, which(in a deterministic environment) corresponds to having aperfect forward model (env = env model). The function”classiﬁer” corresponds to the classiﬁcation of the ob-servation by a learned classiﬁer. It outputs a probabilitydistribution over hidden states σ . Finally, the functionexpected free energy” corresponds to the application ofeq. (11). B. Results (a) (b) (c)

Fig. 8: Example of ﬁlter and produced trajectories. Left: Filterfor the category ’s’. Middle and right: trajectories producedby chaining ﬁve motor primitives belonging to two differentlearned repertoires.In our tests, the hidden state σ was build using ﬁve classescorresponding to the letters ’c’, ’h’, ’i’, ’s’, ’r’. We chosethese letters because the trajectories inside each category wererelatively close and this allowed for ﬁlters of low entropy.Figure 8 displays one ﬁlter of the learned classiﬁer and twotrajectories generated by chaining ﬁve motor primitives fromtwo different primitive repertoires learned with our model. Thetrajectories were obtained by setting the prior preferences to0.96 for the category ’s’ and 0.01 for the four other categories.To verify that our model learns a valuable repertoire ofmotor primitives, we compare the quality of the constructedcomplex trajectories (as in 8b and 8c) with our model andwith random repertoires.Random repertoires are built using the same RNN andreadout layer initialisations. They differ from the learnedrepertoires in the fact that we do not optimise the activationsignals x k of the reservoir. The initial states of the reservoirused to generate the primitives are taken as x k ∼ N (0 , . Inother words, they are equivalent to the repertoires our modelwould provide before learning.

50 25 10Repertoire size0.00.10.20.30.40.50.60.70.8 A v e r a g e c o m p l e x i t y ( n a t s ) Learned repertoireRandom repertoire

Fig. 9: Average complexity for learned and random repertoiresof different sizes. Figure 9 displays the average complexities measured atthe end of the episode. For each episode, we set the priorpreferences of one of the letter category to 0.96 and the othersto 0.01. The values are averaged over the different letters andfor 5 different repertoires, learned or random.The complexity scores how close the recognition probabil-ity q ( σ ) (provided by the learned classiﬁer) is to the priorpreferences p ( σ ) and thus constitutes a suitable indicator forcomparison. For low complexities, the constructed images areclose to the ﬁlter, as in 8.We observe that the average complexity tends to be lowerfor repertoires of larger sizes, independently of the type ofrepertoire. Having a larger repertoire of primitives indeedshould be an asset in order to reconstruct more complextrajectories. For every repertoire size, we measure a lowercomplexity with repertoires learned using the model describedin section II. IV. C ONCLUSION

The results displayed in section III show that our model isable to learn repertoires of motor primitives that are efﬁcientat building more complex trajectories when combined.To further validate our approach, it would be interesting tocompare our results with other strategies for motor primitivelearning. On the one hand, there is existing work in devel-opmental robotics prescribing guidelines to build repertoiresof motor primitives [14], [17], but they don’t provide neuralnetwork implementation to be used for comparison.On the other hand, the option discovery literature in hier-archical reinforcement learning provides practical methods tobuild repertoires of relevant options. Options were introducedin [18] as a candidate solution to address the issue of temporalscaling in reinforcement learning. An option is deﬁned as atemporally extended action, and thus is conceptually similarto a motor primitive. It would be interesting to measure howour approach compares with current state of the art techniquesfor unsupervised option learning such as [19].A

CKNOWLEDGMENT

This work was funded by the Cergy-Paris UniversityFoundation (Facebook grant) and Labex MME-DII, France(ANR11-LBX-0023-01).R

EFERENCES[1] T. Kohonen, “Self-organized formation of topologically correct featuremaps,”

Biological Cybernetics , vol. 43, pp. 59–69, 1982.[2] K. Friston and J. Kilner, “A free energy principle for the brain,”

J.Physiol. Paris , vol. 100, pp. 70–87, 2006.[3] K. Friston, “Hierarchical models in the brain,”

PLOS ComputationalBiology , vol. 4, no. 11, pp. 1–24, 11 2008.[4] R. Rao and D. Ballard, “Predictive coding in the visual cortex afunctional interpretation of some extra-classical receptive-ﬁeld effects,”

Nat Neurosci , vol. 2, pp. 79–87, 1999.[5] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtzmachine,”

Neural Computation , vol. 7, pp. 889–904, 1995.[6] K. Friston, J. Daunizeau, and S. Kiebel, “Reinforcement learning oractive inference?”

PLoS ONE , vol. 4, no. 7, p. e6421, 2009.[7] K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, J. ODoherty, andG. Pezzulo, “Active inference and learning,”

Neuroscience & Biobehav-ioral Reviews , vol. 68, pp. 862–879, 2016.8] A. Pitti, P. Gaussier, and M. Quoy, “Iterative free-energy optimizationfor recurrent neural networks (inferno),”

PLoS ONE , vol. 12, no. 3, p.e0173684, 2017.[9] D. Verstraeten, B. Schrauwen, M. DHaene, and D. Stroobandt, “Anexperimental uniﬁcation of reservoir computing methods,”

Neural Net-work , vol. 20, pp. 391–403, 2007.[10] M. Lukoeviius and H. Jaeger, “Reservoir computing approaches torecurrent neural network training,”

Computer Science Review , vol. 3,no. 3, pp. 127 – 149, 2009.[11] J. Namikawa and J. Tani, “Learning to imitate stochastic time series ina compositional way by chaos,”

Neural Networks , vol. 23, no. 5, pp.625 – 638, 2010.[12] F. Mannella and G. Baldassare, “Selection of cortical dynamics for motorbehaviour by the basal ganglia,”

Biological Cybernetics , vol. 109, pp.575–595, 2015.[13] J. K. O’Regan and A. No, “A sensorimotor account of vision andvisual consciousness,”

Behavioral and Brain Sciences , vol. 24, no. 5,p. 939973, 2001.[14] L. Jacquey, G. Baldassare, V. Santucci, and O. J.K., “Sensorimotorcontingencies as a key drive of development: From babies to robots,”

Frontiers in NeuroRobotics , vol. 13, no. 98, 2019.[15] R. Laje and D. Buonomano, “Robust timing and motor patterns by tam-ing chaos in recurrent neural networks,”

Nature Neuroscience , vol. 16,no. 7, pp. 925–935, 2013.[16] D. Dua and C. Graff, “Uci machine learning repository,” [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,School of Information and Computer Science , 2019.[17] E. Ugur, E. Sahin, and E. ¨Oztop, “Self-discovery of motor primitives andlearning grasp affordances,” , pp. 3260–3267, 2012.[18] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: Aframework for temporal abstraction in reinforcement learning,”

ArtiﬁcialIntelligence , vol. 112, no. 1, pp. 181 – 211, 1999.[19] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is allyou need: Learning skills without a reward function,”