[PDF] Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to Discover Task Hierarchy

Abstract

In open-ended continuous environments, robots need to learn multiple parameterised control tasks in hierarchical reinforcement learning. We hypothesise that the most complex tasks can be learned more easily by transferring knowledge from simpler tasks, and faster by adapting the complexity of the actions to the task. We propose a task-oriented representation of complex actions, called procedures, to learn online task relationships and unbounded sequences of action primitives to control the different observables of the environment. Combining both goal-babbling with imitation learning, and active learning with transfer of knowledge based on intrinsic motivation, our algorithm self-organises its learning process. It chooses at any given time a task to focus on; and what, how, when and from whom to transfer knowledge. We show with a simulation and a real industrial robot arm, in cross-task and cross-learner transfer settings, that task composition is key to tackle highly complex tasks. Task decomposition is also efficiently transferred across different embodied learners and by active imitation, where the robot requests just a small amount of demonstrations and the adequate type of information. The robot learns and exploits task dependencies so as to learn tasks of every complexity.

Full PDF

AArticle

Intrinsically Motivated Open-Ended Multi-Task Learning UsingTransfer Learning to Discover Task Hierarchy

Nicolas Duminy , Sao Mai Nguyen * , Junshuai Zhu , Dominique Duhaut and Jerome Kerdreux Citation:

Duminy, N.; Nguyen, S.M.;Zhu, J.; Duhaut, D.; Kerdreux, J.Intrinsically Motivated Open-EndedMulti-Task Learning Using TransferLearning to Discover Task Hierarchy.

Appl. Sci. , , 975.https://doi.org/10.3390/app11030975 Département Mathématiques Informatique Statistique, Université Bretagne Sud,Lab-STICC, UBL, 56321 Lorient,France; [email protected] (N.D.); [email protected] (D.D.) IMT Atlantique, Lab-STICC, UBL, 29238 Brest, France; [email protected] (J.Z.);[email protected] (J.K.) Flowers Team, U2IS, ENSTA Paris, Institut Polytechnique de Paris & Inria, 91120 Palaiseau, France * Correspondence: [email protected]

Abstract:

In open-ended continuous environments, robots need to learn multiple parameterised controltasks in hierarchical reinforcement learning. We hypothesise that the most complex tasks can be learnedmore easily by transferring knowledge from simpler tasks, and faster by adapting the complexity of theactions to the task. We propose a task-oriented representation of complex actions, called procedures , tolearn online task relationships and unbounded sequences of action primitives to control the differentobservables of the environment. Combining both goal-babbling with imitation learning, and activelearning with transfer of knowledge based on intrinsic motivation, our algorithm self-organises itslearning process. It chooses at any given time a task to focus on; and what, how, when and fromwhom to transfer knowledge. We show with a simulation and a real industrial robot arm, in cross-taskand cross-learner transfer settings, that task composition is key to tackle highly complex tasks. Taskdecomposition is also efﬁciently transferred across different embodied learners and by active imitation,where the robot requests just a small amount of demonstrations and the adequate type of information.The robot learns and exploits task dependencies so as to learn tasks of every complexity.

Keywords: curriculum learning; continual learning; hierarchical reinforcement learning; interactivereinforcement learning; imitation learning; multi-task learning; active imitation learning; hierarchicallearning; intrinsic motivation

1. Introduction yx (f, l, b) (x ,y )φ (x ,y )(x ,y ) rd min Ω Ω Ω Ω Ω ( x ,y ,x ,y ) (f, l, b, t) Ω Figure 1.

Real Yumi setup: the 7-DOF industrial robot arm can produce sounds by moving the blue andgreen objects and touching the table. See https://youtu.be/6gQn3JGhLvs for an example task.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 a r X i v : . [ c s . A I] F e b uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 2 of 34 Let us consider a reinforcement learning (RL) [1] robot placed in an environment sur-rounded by objects, without external rewards, but with human experts’ help. How can therobot learn multiple tasks such as manipulating an object, at the same time as combiningobjects together or other complex tasks that requires multiple steps?In the case of tasks with various complexities and dimensionalities, without a prioridomain knowledge, the complexities of actions considered should be unbounded. If werelate the complexity of actions to their dimensionality, actions of unbounded complexityshould belong to spaces of unbounded dimensionality. For instance, if an action primitive ofdimension n is sufﬁcient for placing an object to a position, a sequence of 2 primitives, i.e., anaction of dimension 2 n is sufﬁcient to place the stick on 2 xylophone keys. Nevertheless, tuneshave variable lengths and durations. Likewise, as a conceptual illustration, an unboundedsequence of actions is needed to control the interactive table and play tunes of any length in asetup like in Figure 1. In this work, we consider that actions of unbounded complexity can beexpressed as action primitives and unbounded sequences of action primitives , also named in [2]respectively micro and compound actions . The agent thus needs to estimate the complexity ofthe task and deploy actions of the corresponding complexity.To solve this unbounded problem, the learning agent should be starting small before tryingto learn more complex tasks as theorised in [3]. Indeed, in multi-task learning problems, sometasks can be compositions of simpler tasks (which we call ‘hierarchically organised tasks’ ). Thisapproach has been coined ’curriculum learning’ in [4]. The idea of this approach is to use theknowledge acquired from simple tasks to solve more complex tasks or high level of hierarchytasks, or in other words, to leverage transfer learning (TL) [5–7]. Uncovering the relationshipbetween tasks is useful for transferring knowledge from a task to another. The insight behindTL is that generalization may occur not only within tasks, but also across tasks. This is relevantfor compositional tasks. But how can the learning agent discover the decomposition of tasksand the relationship between tasks? Moreover, transfer of knowledge between tasks can alsobe completed by transfer of knowledge from teachers. Indeed, humans and many animals donot just learn a task by trial and error. Rather, they extract knowledge about how to approacha problem from watching other people performing a similar task. Behavioural psychologystudies [8,9] highlight the importance of social and instructed learning, “ including learningabout the consequences, sequential structure and hierarchical organisation of actions ” [10]. Imitationis a mechanism for emerging representational capabilities [11]. How can imitation enable thetask decomposition into subtasks, and which kind of information should be transferred fromthe teacher to the learning agent to enable effective hierarchical reinforcement learning? Howrobust is this transfer to correspondence problems? How can teachers avoid demonstrationsthat could correspond to behaviours the agent already masters or require pre-requisites therobot has not learned yet?This work addresses multi-task learning in open-ended environments by studying therole of transfer of knowledge across tasks with the hypothesis that some tasks are interre-lated, and the role of transfer of knowledge from other learners or experts to determinehow information is best transferred for hierarchical reinforcement learning: when, what andwhom to imitate?

2. State of the Art

To learn unbounded sequences of motor actions for multiple tasks, we examine recentmethods for curriculum learning based on intrinsic motivation. We also review the methodsfor hierarchical reinforcement learning and imitation learning, which can be described as twotypes of transfer of knowledge. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 3 of 34 In continual learning in an open-ended world without external rewards, to discoverrepertoires of skills, agents must be endowed with intrinsic motivation (IM), which is describedin psychology as triggering curiosity in human beings [12], to explore the diversity of outcomesit can cause and to control its environment [13,14]. These methods use a reward functionthat is not shaped to ﬁt a speciﬁc task but is general to all tasks the robot will face. Tendingtowards life-long learning, this approach, also called artiﬁcial curiosity, may be seen as aparticular case of reinforcement learning using a reward function parametrised by internalfeatures to the learning agent. One important form of IM system is the ability to autonomouslyset one’s own goals among the multiple tasks to learn. Approaches such as [15,16] haveextended the heuristics of IM with goal-oriented exploration, and proven to be able to learnﬁelds of tasks in continuous task and action spaces of high but bounded dimensionality.More recently, IMGEP [17] and CURIOUS [18] have combined intrinsic motivation and goalbabbling with deep neural networks and replay mechanisms. They could select goals in adevelopmental manner from easy to more difﬁcult tasks. Nevertheless, these works did notleverage cross-goal learning but only used interpolation between parametrised goals over acommon memory dataset.

Nevertheless, in the case of tasks with various complexities and dimensionalities, espe-cially with action spaces of unbounded dimensionality, those methods become intractableand the volume of the task and action spaces to explore grows exponentially. In that case,exploiting the relationships between the tasks can enable a learning agent to tackle increas-ingly complex tasks more easily, and heuristics such as social guidance can highlight theserelationships. The idea would be to treat complex skills as assembly tasks, i.e., sequences ofsimpler tasks. This approach is in line with descriptions of motor behaviour of humans andprimates as composing their early motions and being recombined after a maturation phaseinto sequences of action primitives [19]. In artiﬁcial systems, this idea has been implementedas a neuro-dynamic model by composing action primitives in [20] and has been proposedin [21] to learn subtask policies and a scheduler to switch between subtasks, with ofﬂineoff-policy learning, to derive a solution that is time-dependent on a scheduler. On the otherhand, options were proposed as a temporally abstract representation of complex actionsmade of lower-level actions and revealed faster to reach interesting subspaces as reviewedin [22]. Learning simple skills then combining them by skill chaining is shown in [23] moreeffective than learning the sequences directly. Other approaches using temporal abstractionand hierarchical organization have been proposed [24].More recently, Intrinsic Motivation has also tacked hierarchical RL to build increasinglycomplex skills by discovering and exploiting the task hierarchy using planning methods [25].However it does not model explicitly a representation of the task hierarchy, letting planningcompose the sequences in the exploitation phase. IM has also been used with temporalabstraction and deep leaning in h-DQN [26] with meta-level learning subgoals and controllerlevel policies over atomic actions. However h-DQN was only applied to discrete state andaction spaces. A similar idea has been proposed for continuous action and state spacesin [27], where the algorithm IM-PB relies on a representation, called procedure , of the taskdecomposition. A fully autonomous intrinsically motivated learner successfully discoversand exploits the task hierarchy of its complex environment, while still building sequences ofaction primitives of adapted sizes. These approaches could generalise across tasks by re-usingthe policies of subgoals for more complex tasks, once the task decomposition is learned. Wewould like to investigate here for continuous action and state spaces, the role transfer ofknowledge on task decomposition in hierarchical reinforcement learning, when there are uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 4 of 34 several levels of hierarchy; and how transfer learning operates when tasks are hierarchicallyrelated compared to when tasks are similar. Imitation learning techniques or Learning from Demonstrations (LfD) [28,29] providehuman knowledge for complex control tasks. However, imitation learning is often limitedby the set of demonstrations. To overcome sub-optimality and noisiness of demonstrations,imitation learning has recently been combined with RL exploration, for instance when initialhuman demonstrations have successfully initiated RL in [30,31]. In [32] combination of transferlearning, learning from demonstration and reinforcement learning signiﬁcantly improve bothlearning time and policy performance for a single task. However, in the previously citedworks, two hypotheses are frequently made:• the transfer of knowledge needs only one type of information. However, for multi-tasklearning, the demonstrations set should provide different types of information dependingon the task and the knowledge of the learning. Indeed, in imitation learning works,different kinds of information for transfer of knowledge have been examined separately,depending on the setup at hand: external reinforcement signals [33], demonstrations ofactions [34], demonstrations of procedures [35], advice operators [36] or disambiguationamong actions [37]. The combination of different types of demonstrations has beenstudied in [38] for multi-task learning, where the proposed algorithm Socially GuidedIntrinsic Motivation with Active Choice of Teacher and Strategy (SGIM-ACTS) showedthat imitating a demonstrated action and outcome has different effects depending on thetask, and that the combination of different types of demonstrations with autonomousexploration bootstraps the learning of multiple tasks. For hierarchical RL, algorithmSocially Guided Intrinsic Motivation with Procedure Babbling (SGIM-PB) in [35] couldalso take advantage of demonstrations of actions and task decomposition. We propose inthis study to examine the role of each kind of demonstrations with respect to the controltasks in order to learn task hierarchy.• the timing of these demonstrations has no inﬂuence. However, in curriculum learningthe timing of knowledge transfer should be essential. Furthermore, the agent best knowswhen and what information it needs from the teachers, and active requests for knowledgetransfer should be more efﬁcient. For instance, a reinforcement learner choosing whento request social guidance was shown in [39] making more progress. Such techniquesare called active imitation learning or interactive learning, and echo the psychologicaldescriptions of infants’ selectivity in social partners and its link to their motivation tolearn [40,41]. Active imitation learning has been implemented [42] where the agent learnswhen to imitate using intrinsic motivation for a hierarchical RL problem in a discretesetting. For continuous action, state and goal spaces, the SGIM-ACTS algorithm [38]uses intrinsic motivation to choose not only the kind of demonstrations, but also when torequest for demonstrations and who to ask among several teachers. SGIM-ACTS wasextended for hierarchical reinforcement learning with the algorithm Socially GuidedIntrinsic Motivation with Procedure Babbling (SGIM-PB) in [35]. In this article, we willstudy whether a transfer of a batch of data or an active learner is more efﬁcient to learntask hierarchy.

We combine both types of transfer of knowledge, across tasks and from experts, toaddress multi-task learning in a hierarchical RL problem in a non-rewarding, continuousand unbounded environment, where experts with their own ﬁeld of expertise, unknownto the robot, are available at the learner’s request. We propose to continue the approach uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 5 of 34 initiated in [35,43] with the SGIM-PB algorithm that combines intrinsic motivation, imitationlearning and transfer learning to enable a robot to learn its curriculum by:• Discovering and exploiting the task hierarchy using a dual representation of complexactions in action and outcome spaces;• Combining autonomous exploration of the task decompositions with imitation of theavailable teachers, using demonstrations as task dependencies;• Using intrinsic motivation, and more precisely its empirical measures of progress, asits guidance mechanism to decide which information to transfer across tasks; and forimitation, when how and from which source of information to transfer.In this article, we examine how task decomposition can be learned and transferred froma teacher or another learner using the mechanisms of intrinsic motivation in autonomousexploration and active imitation learning for discovering task hierarchy for cross-task andcross-learner transfer learning. More precisely, while in [35] we showed on a toy simulationfaster learning and better precision in the control, in this article, we show on an industrial robotthat task decomposition is pivotal to completing tasks of higher complexity (by addingmore levels of hierarchy in the experimental setup), and we test the properties of our activeimitation of task decomposition : it is valid for cross-learner transfer even in the caseof different embodiments, and active imitation proves more efﬁcient than imitation of abatch dataset given from initialisation . This use-case study enables deeper analysis into themechanisms of the transfer of knowledge.The article is organized as follows: we describe our approach in Section 3; and presentour setups on the physical simulator and the real robot of an industrial robot arm in Section 4.The results are analysed in Section 5 and discussed in Section 6; ﬁnally we conclude this articlein Section 7.

3. Our Approach

Grounding our work in cognitive developmental robotics [44,45], we propose an intrin-sically motivated learner able to self-organize its learning process for multi-task learning ofhierarchically organized tasks by exploring action, task and task decomposition spaces. Ourproposed algorithm combines autonomous exploration with active imitation learning into alearner discovering the task hierarchy to reuse its previously gained skills for tackling morecomplex ones, while adapting the complexity of its actions to the complexity of the task athand.In this section, we ﬁrst formalize the learning problem we are facing. Then we describethe algorithm Socially Guided Intrinsic Motivation with Procedure Babbling (SGIM-PB). Thisalgorithm uses a task-oriented representation of task decomposition called procedures tobuild more and more complex actions, adapting to the difﬁculty of the task.

Let us consider a robot, able to perform motions through the use of action primitives π θ ∈ Π . We suppose that the action primitives are parametrised functions with parameters ofdimension n . We note the parameters θ ∈ R n . The action primitives represent the smallestunit of motions available to the robot. The robot can also chain multiple action primitivestogether to form sequences of action primitives of any size k ∈ N . We consider that the robot canexecute actions in the form of a sequence of one or several action primitives. We note π anaction and will precise the parameter θ in the case of an action primitive π θ . We note Π N thecomplete space of actions of any size available to the learner.The environment can change as a consequence of the motions of the robot. We calloutcomes ω ∈ Ω these consequences. They can be of various types and dimensionalities, andare therefore split in outcome subspaces Ω i ⊂ Ω . Those outcomes can also be of different uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 6 of 34 complexities, meaning that the actions generating these outcomes may require differentnumbers of action primitives to chain. The robot aims for learning generalisation (how toreach a range of outcomes as broad as possible), and learning speed. It learns which action toperform depending on the outcomes to generate, known as the inverse model M : ω (cid:55)→ π . A task is thus a desired outcome, and the inverse model indicates which action can reach it. Asmore than one action can lead to the same outcome, M is not a function.We take the trial and error approach, and we suppose that the error can be evaluated and Ω is a metric space, which means the learner can evaluate a distance between two outcomes d ( ω , ω ) . Let us note H the hierarchy of the tasks used by our robot. H is formally deﬁned as adirected graph where each node is a task T . Figure 2 shows a representation of task hierarchy.As our algorithm is tackling the learning of complex hierarchically organized tasks, exploringand exploiting this hierarchy could ease the learning of the most complex tasks. Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Ω Move blue object to (x ,y )ω i = Touch Table at (x ,y ) ω j = Touch Table at (x ,y )π i π j Move green object to (x ,y )Touch Table at (x’ ,y’ )π k Touch Table at (x ,y )π l Move both objects to (x ,y ,x ,y ) Emit burst sound (f,l,b) Touch Table at (x” ,y” )π m Emit maintained sound (f,l,b,t) Ω Ω Ω Ω Ω Ω Figure 2.

Task hierarchy of the Yumi experimental setup : blue lines represent task decomposition intoprocedures for the simulation setup, dashed lines for the physical setup, red lines for the direct inversemodel. Eg. to move a blue object placed initially at ( x , y ) to a desired position ( x , y ) , the robot cancarry out task ( ω i ) that consists in moving its end-effector to ( x , y ) to pick it, then task ( ω j ) that consistsin moving the end-effector to ( x , y ) . These subtasks are executed with respectively action primitives π i and π j . Therefore to move the object, the learning agent can use the sequence of action primitives ( π i , π j ) . To represent and learn this task hierarchy, we are using the procedure framework. Thisrepresentation has been created as a way to push the learner to combine previously learnedtasks to form more complex ones. A procedure is a combination of outcomes ( ω , ω ) ∈ Ω .Carrying out a procedure ( ω , ω ) means executing sequentially each component π of theaction sequence, where π i is an action that reaches ω i ∀ i ∈ (cid:74)

1, 2 (cid:75) .We can formalise task decomposition as a mapping from a desired outcome ω g to aprocedure P : ω g (cid:55)→ ( ω , ω ) . ω and ω are then called subtasks . In the task hierarchy H ,an outcome represents a task node in the graph, while the task decomposition representsthe directed edges and the procedure is the list of its successors. H is initialised as a denselyconnected graph, and the exploration of SGIM-PB prunes the connections by testing which uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 7 of 34 procedures or task decompositions respect the ground truth. The procedure space or taskdecomposition space Ω is a new space to be explored by the learner to discover and exploittask decompositions. Our proposition is to derive the inverse model M by using recursivelythe inverse model M and the task decomposition P until we derive a sequence of actionprimitives following this recursion: M ( ω g ) (cid:55)→ (cid:26) ( π θ , ... π θ k ) with k ∈ N ( M ( ω ) , M ( ω )) if P ( ω g ) (cid:55)→ ( ω , ω ) We here present the algorithm SGIM-PB (Socially Guided Intrinsic Motivation withProcedure Babbling). In Section 5.5, we will also look at the variant named SGIM-TL (SociallyGuided Intrinsic Motivation with Transferred Lump), which is provided at initialisation witha dataset of transferred procedures and their corresponding reached outcomes: { ( ω i , ω j ) , ω r } .The main differences between SGIM-PB and SGIM-TL are outlined and they are contrastedwith former versions IM-PB and SGIM-ACTS in Table 1. SGIM-PB and SGIM-TL propose tolearn both M and P simultaneously. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 8 of 34 Table 1.

Differences between SGIM-PB, SGIM-TL, SGIM-ACTS and IM-PB.

Algorithm Action Representation Strategies σ Transferred Dataset Timing of the Transfer Transfer of Knowledge

IM-PB [27] parametrised actions,procedures auton. action spaceexplo., auton. proceduralspace explo. None NA cross-task transferSGIM-ACTS [38] parametrised actions auton. action spaceexplo., mimicry of anaction teacher Teacher demo. of actions Active request by thelearner to the teacher imitationSGIM-TL parametrised actions,procedures auton. action spaceexplo., auton. proceduralspace explo., mimicry ofan action teacher,mimicry of a procedureteacher Another robot’sprocedures, Teacherdemo. of actions andprocedures Procedures transf. atinitalization time, Activerequest by the learner tothe teacher cross-task transfer,imitationSGIM-PB parametrised actions,procedures auton. action spaceexplo., auton. proceduralspace explo., mimicry ofan action teacher,mimicry of a procedureteacher Teacher demo. of actionsand procedures Active request by thelearner to the teacher cross-task transfer,imitation

SGIM-PB starts learning from scratch. It is only provided with :• Dimensionality and boundaries of the action primitive space Π ;• Dimensionality and boundaries of each of the outcome subspaces Ω i ⊂ Ω ;• Dimensionality and boundaries of the procedural spaces ( Ω i , Ω j ) ⊂ Ω , deﬁned as allpossible pairs of two outcome subspaces.Our SGIM-PB agent is to collect data in order to learn how to reach all outcomes bygeneralising from the sampled data. This means it has to learn for all reachable outcomes,the actions or procedures to use to reach the outcome. This corresponds to learning theinverse model M . The model uses a local regression based on the k-nearest neighbours fromthe data collected by exploration. In order to do that, the learning agent is provided withdifferent exploration strategies (see Section 3.3.1), that are deﬁned as methods to generate aprocedure or action for any given outcome. The 4 types of strategies available to the robot are:two autonomous exploration strategies and two interactive strategies per task type. For theautonomous exploration strategies, we consider action space exploration and procedural spaceexploration. For the interactive strategies, we consider mimicry of an action or a procedureteacher of the task type: the former’s demonstrations are motor actions, while the latter’s areprocedures.As these strategies could be more appropriate for some tasks than others, and as theireffectiveness can depend on the maturity of the learning process, our learning agent needsto map the outcome subspaces and regions to the best suited strategies to learn them. Thus,SGIM-PB uses an Interest Mapping (see Section 3.3.2) that associates to each strategy andregion of an outcome space partition an interest measure, so as to guide the exploration byindicating which tasks are the most interesting to explore at the current learning stage, andwhich strategy is the most efﬁcient.The SGIM-PB algorithm (see Algorithm 1, Figure 3) learns by episodes, and starts eachepisode by selecting an outcome ω g ∈ Ω to focus on and an exploration strategy σ . Thestrategy and outcome region are selected based on the Interest Mapping by roulette wheelselection also called ﬁtness proportionate selection, where the interest measure serves asﬁtness (see Section 3.3.2). uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 9 of 34 Request to teacher n

Mimic

Action [9]

UpdateOutcome & Strategy Interest Mapping [15]

Outcome SpaceExplorationAction SpaceExploration = autonomous σ action spaceexplorationActionsExploration StrategiesProcedures Exploration Strategies progress = mimicry of σ actionteacher n d d π d Strategy Level

Correspondence ω g ω g π r, ω Action Optimization [4] π r, ω Select GoalOutcome & Strategy [2]

Procedure Optimization [6]

Mimic [12]

Procedure = autonomous σ proceduralspace exploration = mimicry of σ procedural teacher n Request to teacher nCorrespondence (ω di, ω dj ) Procedure execution

Procedural SpaceExploration (ω i, ω j ) ω ω [8][11] Figure 3.

SGIM-PB architecture: the arrows show the data transfer between the different blocks, thenumbers between brackets refer to speciﬁc line number in Algorithm 1.

Algorithm 1

SGIM-PB and its SGIM-TL variant

Input: the different strategies σ , ..., σ m Initialization: partition of outcome spaces R ← (cid:70) i { Ω i } Input:

CASE SGIM-TL: transfer dataset D = { ( ω i , ω j ) , ω r } Initialization:

CASE SGIM-TL: episodic memory

Memo ← { ( ω i , ω j ) , ω r } loop ω g , σ ← Select Goal Outcome and Strategy( R ) if σ = Autonomous action space exploration strategy then Memo ← Goal-Directed Action Optimization( ω g ) else if σ = Autonomous procedural space exploration strategy then Memo ← Goal-Directed Procedure Optimization( ω g ) else if σ = Mimicry of action teacher i strategy then ( π d , ω d ) ← ask and observe demonstrated action to teacher i Memo ← Mimic Action( π d ) else if σ = Mimicry of procedural teacher i strategy then (( ω di , ω dj ) , ω d ) ← ask and observe demonstrated procedure to teacher i Memo ← Mimic Procedure( ( ω di , ω dj ) ) end if Update M with collected data Memo R ← Update Outcome and Strategy Interest Mapping( R , Memo , ω g ) end loop At each episode, the robot and the environment are reset to their initial states. The learneruses the chosen strategy to build an action (based or not on a procedure), decomposed intoaction primitives which are then executed sequentially without getting back to its initialposition. Whole actions are recorded, along with their outcomes. Each step (after each actionprimitive execution) of the sequence of action primitives execution is also recorded in

Memo . uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 10 of 34 π to produce ω g by stochastically choosing between randomexploration of actions and local regression on the k-nearest action sequence neighbours. Theprobability of choosing local regression over random actions is proportional to the distance of ω g to its nearest action neighbours. This action optimization is called Goal-Directed ActionOptimization and based on the SAGG-RIAC algorithm [15].In an episode under the autonomous procedural space exploration strategy, the learnertries to optimize the procedure ( ω i , ω j ) ∈ Ω to produce ω g , by stochastically choosingbetween random exploration of the procedural space and local regression on the k-nearestprocedure neighbours. The probability of choosing local regression over random actions isproportional to the distance of ω g to its nearest procedure neighbours. This process is calledGoal-Directed Procedure Optimization.In an episode under the mimicry of an action teacher strategy, the learner requests ademonstration of an action to reach ω g from the chosen teacher. The teacher selects thedemonstration π d as the action in its demonstration repertoire reaching the closest outcomefrom ω g . It has direct access to the parameters of π d = ( π d , π d , . . . , π dl ) , and explores locallythe action parameters space (we do not consider the correspondence problem from teachers’demonstrations).In an episode under the mimicry of a procedural teacher strategy, the learner requests atask decomposition of ω g from the chosen teacher. The demonstrated procedure ( ω di , ω dj ) will deﬁne a locality in the procedure space for it to explore.When performing nearest neighbour searches during the execution of autonomous actionand procedure exploration strategies (for local optimization of procedures or actions or whenexecuting a procedure), the algorithm uses a performance metric which takes into account thecomplexity of the underlying action selected: per f ( ω g ) = d ( ω , ω g ) γ n (1)where d ( ω , ω g ) is the normalized Euclidean distance between the target and the reachedoutcomes ω g and ω , n is the size of the action chosen (the length of the sequence of primitives), γ > γ = ω g by computing the distance d ( ω r , ω g ) with the outcome ω r it actually reached (ifit has not reached any outcome in Ω i , we use a default value d thres ). Then interest measure iscomputed for the goal outcome and all outcomes reached during the episode (including theoutcomes from a different subspace than the goal outcome): interest ( ω , σ ) = p ( ω ) / K ( σ ) (2)where the progress p ( ω ) is the difference between the best competence for ω before andafter the episode, K ( σ ) is the cost associated to each strategy. K ( σ ) is a meta parameter tofavour some strategies such as autonomous ones, to push the robot to rely on itself as muchas possible instead of bothering teachers.The interest measures are then used to partition the outcome space Ω . The trajectoryof the episode is added to their partition with hindsight experience replay (both goal and uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 11 of 34 reached outcomes are taken into account), storing the values of the strategy σ , the outcomeparameter, and the interest measure. When the number of outcomes added to a region exceedsa ﬁxed limit, the region is split into two regions with a clustering boundary that separatesoutcomes with low from those with high interest. This method is explained in more detailsin [46]. The interest mapping is a tool to identify the zone of proximal development where theinterest is maximal, and to organize the learning process.This interest mapping is initialized with the partition composed of each outcome type Ω i ⊂ Ω . For the ﬁrst episode, the learning agent always starts by choosing a goal and astrategy at random.In particular, this interest mapping enables SGIM-PB to uncover the task hierarchy byassociating goal tasks and procedures. When testing a speciﬁc procedure ( ω i , ω j ) ∈ Ω thatproduces ω instead of the goal ω g under the procedural space exploration or the mimicry of aprocedure teacher strategies, SGIM-PB assesses the performance of this task decompositionand records the trajectory of the episode in the memory. This task decomposition is likelyto be reused again during local regression of k-nearest neighbours for tasks close to ω and ω g and for short sequences of primitives, i.e., if its performance per f ( ω g ) = d ( ω , ω g ) γ n ishigh. Thus the different task decompositions are compared both for their precision and cost interms of complexity. At the same time, SGIM-PB updates the interest map for that strategy.If the procedure is not relevant for the goal ω g , the procedure is ignored henceforward.On the contrary if the procedure is the right task decomposition, the interest interest ( ω , σ ) for this procedure exploration/mimicry strategy σ increases. Thus, conversely, SGIM-PBcontinues to explore using the same strategy, and tests more procedures for the same regionof outcomes. As the number of procedures explored increases and are selected by intrinsicmotivation (using the performance and interest criteria), SGIM-PB associates goal tasks to therelevant procedures, hence builds up the adequate task decomposition. These associations toprocedures constitute the task hierarchy uncovered by the robot.

4. Experiment

In this section, we present the experiment we conducted to assess the capability of ourSGIM-PB algorithm. The experimental setup features the 7 DOF right arm of an industrialYumi robot from ABB which can interact with an interactive table and its virtual objects.Figure 1 shows the robot facing a tangible RFID interactive table from [47]. The robot learnsto interact with it using its end-effector. It can learn an inﬁnite number of hierarchicallyorganized tasks regrouped in 5 types of tasks, using sequences of motor actions of unrestrictedsize. The experiments have been carried out with the physical industrial robot ABB and withsimulation software provided with the robot, Robotstudio. While the software provided bythe robot can provide static inverse kinematics, it can not provide movement trajectories.We made preliminary tests of our SGIM-PB learner on a physical simulator of the robot.We will call this the simulation setup . In the simulation setup, the end-effector is the tip of thevacuum pump below its hand. The simulation setup will be modiﬁed as a setup called left-armsetup in Section 5.5.We also implemented this setup for a physical experiment, to compare both interactivestrategies more fairly using SGIM-ACTS and SGIM-PB. For that, we modiﬁed the proceduralteacher strategy so they have the same limited repertoire from which to draw demonstrationsas action teachers. We also added an extra more complex task without demonstrations tocompare the autonomous exploration capability of both algorithms. The setup, shown onFigure 1, will be referred to as the physical setup . In the physical setup, the end-effector is thebottom part of the hand. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 12 of 34 In the following subsections, we describe in more details the physical setup and thesimulation setup and their variables, while mentioning their differences. Then we end thissection by presenting the teachers and evaluation methods.

The position of the arm’s tip on the table (see Figure 4) is noted ( x , y ) . Two virtualobjects (disks of radius R = ( x , y ) and ( x , y ) . Onlyone object can be moved at a time, otherwise the setup is blocked and the robot’s motioncancelled. If both objects have been moved, a burst sound is emitted by the interactive table,parametrised by its frequency f , its intensity level l and its rhythm b . It can be maintainedfor a duration t by touching a new position on the table afterwards. The sound parametersare computed with arbitrary rules so as to have both linear and non linear relationships, asfollow: f = ( D /4 − d min ) D (3) l = − ( ln ( r ) − ln ( r min )) / ( ln ( D ) − ln ( r min )) (4) b = ( | ϕ | / π ) ∗ + t = d / D (6)where D is the interactive table diagonal, ( r , ϕ ) are the polar coordinates of the green object inthe system centred on the blue object, r min = R , d min is the distance between the blue objectand closest table corner, and d is the distance between the end effector position on the tableand the green object . y x (f, l, b)(x ,y ) φ(x ,y ) (x ,y )r d min Ω Ω Ω Ω Ω ( x ,y ,x ,y )(f, l, b, t) Ω d Figure 4.

Representation of the interactive table: the ﬁrst object is in blue, the second one in green, theproduced burst sound and maintained sound are also represented in top left corner. The outcome spacesare also represented on this ﬁgure with their parameters in black, while the sound parameters are in red. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 13 of 34 The interactive table state is refreshed after each action primitive executed. The robotis not allowed to collide with the interactive table. In this case, the motor action is cancelledand reaches no outcomes. Before each attempt, the robot is set to its initial position and theenvironment is reset. τ ˙ v = K ( g − x ) − Dv + ( g − x ) f ( s ) (7) τ ˙ x = v (8) τ ˙ s = − α s (9)where x and v are the position and velocity of the system; s is the phase of the motion; x and g are the starting and end position of the motion; τ is a factor used to temporally scalethe system (set to ﬁx the duration of an action primitive execution); K and D are the springconstant and damping term ﬁxed for the whole experiment; α is also a constant ﬁxed for theexperiment; and f is a non-linear term used to shape the trajectory called the forcing term.This forcing term is deﬁned as: f ( s ) = ∑ i w i ψ i ( s ) s ∑ i ψ i ( s ) (10)where ψ i ( s ) = exp ( − h i ( s − c i ) ) with centers c i and widths h i ﬁxed for all primitives. Thereare 1 weights w i per DMP, which is therefore simply noted w .The weights of the forcing term and the end positions are the only parameters of theDMP used by the robot. One weight per DMP is used and each DMP controls one joint. Thestarting position of a primitive is set by either the initial position of the robot (if it is starting anew action) or the end position of the preceding primitive. Therefore an action primitive π θ isparametrized by: θ = ( a , a , a , a , a , a , a ) (11)where a i = ( w ( i ) , g ( i ) ) corresponds to DMP parameters of the joint i : the ﬁnal joint angle g ( i ) ,and the parameters w ( i ) of a basis function for the forcing term. The action primitive space isthus Π = R , and the complete action space ( R ) N .4.2.2. Outcome SpacesThe outcome spaces the robot learns are hierarchically organized:• Ω = { ( x , y ) } : positions touched on the table;• Ω = { ( x , y ) } : positions of the ﬁrst object;• Ω = { ( x , y ) } : positions of the second object;• Ω = { ( x , y , x , y ) } : positions of both objects;• Ω = { ( f , l , b ) } : burst sounds produced;• Ω = { ( f , l , b , t ) } : maintained sounds produced.The outcome space is a composite and continuous space (for the physical setup Ω = (cid:83) i = Ω i , for the simulation setup Ω = (cid:83) i = Ω i ), containing subspaces of 2 to 4 dimensions.Multiple interdependencies are present between tasks: controlling the position of either theblue object ( Ω ) or the green object ( Ω ) comes after being able to touch the table at a givenposition ( Ω ); moving both objects ( Ω ) or making a sound ( Ω ) comes after being able to uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 14 of 34 move the blue and the green object, the maintained sound ( Ω ) is the most complex task ofthe physical setup. This hierarchy is shown on Figure 2.Our intuition is that a learning agent should start by making a good progress in the easiesttask Ω , then Ω , Ω . Once it mastered those easy tasks, it can reuse that knowledge to learnto achieve the most complex tasks Ω and Ω . We will particularly focus on the learning of the Ω outcome space and the use of the procedure framework for it. Indeed in this setting, therelationship between a goal outcome in Ω and the necessary positions of both objects ( Ω , Ω )to reach that goal are not linear. So with this setting, we test if the robot can learn a non-linearmapping between a complex task and a procedural space . Finally for the physical version, wesee if the robot can reach and explore the most complex task Ω in the absence of an allocatedteacher. To help SGIM-PB in the simulation setup, procedural teachers (strategical cost K ( σ ) = K ( σ ) = Ω . Each teacher gives on the fly a procedure adapted to the learner’s request, according to itsdomain of expertise and according to a construction rule:• ProceduralTeacher1: Ω → ( Ω ) ;• ProceduralTeacher2: Ω → ( Ω ) ;• ProceduralTeacher3: Ω → ( Ω , Ω ) ;• ProceduralTeacher4: Ω → ( Ω , Ω ) .We also added different action teachers (strategical cost of K ( σ ) = Ω ): 11 demos of action primitives;• ActionTeacher1 ( Ω ): 10 demos of size 2 actions;• ActionTeacher2 ( Ω ): 8 demos of size 2 actions;• ActionTeacher34 ( Ω and Ω ): 73 demos of size 4 actions.In the physical setup, we want to delve into the differences between action and proceduralteachers. So as to put them on an equal footing, we used for all teachers demonstration datasetsof limited sizes. The demonstrations in the physical version of the action and proceduralteachers reach the same outcomes for Ω , Ω , Ω and Ω . An extra action teacher providesdemonstrations for the simplest outcome space Ω :• ActionTeacher0 ( Ω ): 9 demos of action primitives;• ActionTeacher1 ( Ω ): 7 demos of size 2 actions;• ActionTeacher2 ( Ω ): 7 demos of size 2 actions;• ActionTeacher3 ( Ω ): 32 demos of size 4 actions;• ActionTeacher4 ( Ω ): 7 demos of size 4 actions;The demonstrations for the procedural teachers correspond to the subgoals reached bythe action primitives of the action teachers. The procedural teachers have the same number ofdemonstrations as their respective action teachers.The action teachers were provided to the SGIM-ACTS learner, while the SGIM-PB algo-rithm had all procedural teachers and ActionTeacher0. No teacher was provided for the mostcomplex outcome space Ω , as to compare the autonomous exploration capability of bothlearners.In both setups, the number of demonstrations was chosen arbitrarily small, and thehigher the dimensionality of the outcome space it teaches, the more demonstrations it canoffer. The demonstrations were chosen so as to cover uniformly the reachable outcome spaces. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 15 of 34 To evaluate our algorithm, we created a testbench set of goals uniformly covering theoutcome space (the evaluation outcomes are different from the demonstration outcomes). It has29,200 goals for the real robot and 19,200 goals for simulated version. The evaluation consistsin computing mean Euclidean distance between each of the testbench goals and their nearestneighbour in the learner memory ( d thres = 5).The evaluation is repeated regularly across thelearning process.Then to assess the efﬁciency of our algorithm in the simulation setup, we are comparingthe averaged results of 10 runs of 25,000 learning iterations (each run took about a week toproceed) of the following algorithms :• RandomAction: random exploration of the action space Π N ;• IM-PB: autonomous exploration of the action Π N and procedural space Ω driven byintrinsic motivation;• SGIM-ACTS: interactive learner driven by intrinsic motivation. Choosing betweenautonomous exploration of the action space Π N and mimicry of any action teacher;• SGIM-PB: interactive learner driven by intrinsic motivation. Has autonomous explorationstrategies (of the action Π N or procedural Ω space) and mimicry ones for any proceduralteacher and ActionTeacher0;• Teachers: non-incremental learner only provided with the combined knowledge of allthe action teachers.For the physical setup, we are only comparing 1 run of SGIM-ACTS and SGIM-PB (20,000iterations or one month of trials for each), as to delve deeper into the study of the interactivestrategies and their impact.The codes used are available at https://bitbucket.org/smartan117/sgim-yumi-simu(simulated version), and at https://bitbucket.org/smartan117/sgim-yumi-real (physical one).

5. Results Ω and Ω for the simulated(left column) and for Ω , Ω and Ω for physical setup (right column). We can see that inboth setups, the most associated procedural space with each outcome space corresponds tothe designed task hierarchy, namely, Ω uses procedures mostly procedures ( Ω , Ω ) and Ω uses ( Ω , Ω ) and Ω uses ( Ω , Ω ) . It also corresponds to the hierarchy demonstrated by theprocedural teachers. It was even capable to learn the task hierarchy in the absence of providedteacher for the outcome space Ω in the physical setup. The SGIM-PB learner was ableto learn the task hierarchy of the setup using the procedure framework.

The procedurerepartition is not perfect though, as other subgoals are also used. For instance, procedures ( Ω , Ω ) were associated to the Ω goals. Although they correctly reach the goal tasks, theyintroduce longer sequences of actions and are thus suboptimal. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 16 of 34 Figure 5.

Task hierarchy discovered by the SGIM-PB learners: this represents for 3 complex outcomespace the percentage of time each procedural space would be chosen for the physical setup (left) and thesimulated setup (center). The percentage scale is represented in the colorbar (right).

SGIM-PB was able to scale the complexityof its actions : using mostly primitive and sequences of 2 primitives for Ω ; sequences of2 primitives for Ω and Ω ; of 4 primitives for Ω and Ω ; and of 5 primitives for Ω on thephysical setup. It is not perfect as SGIM-PB partially associated the Ω outcome space with2-action primitives when single action primitives were sufﬁcient. Because all tasks requirethe robot to touch the table, and thus has an outcome Ω on the environment, all the complexactions used for the other tasks could be associated to reach Ω : the redundancy of this modelmakes it harder to select the optimal action size. Appl. Sci. , , 1 26 of 28 Figure 5.

Task hierarchy discovered by the SGIM-PB learners: this represents for the outcome spaces the percentage of times eachprocedural space (ﬁrst subgoal of the procedure per column, second subgoal per row) was chosen. Are represented for the simulationsetup (left) the outcome spaces Ω and Ω and for the physical Yumi setup (right) the outcome spaces Ω , Ω and Ω . The groundtruth hierarchy is indicated by a red square. Figure 6.

Histogram of the action size of actions chosen by SGIM-PB for each outcome space. Test results on the testbench tasks on thesimulation (left) and physical setups (right)

Figure 6.

Histogram of the action size of actions chosen by SGIM-PB for each outcome space. Test resultson the testbench tasks on the simulation ( left ) and physical setups ( right ). This understanding of task complexity and task hierarchy also leads to a better perfor-mance of SGIM-PB. Figure 7 shows the global evaluation results of all tested algorithms forthe simulated version. It plots the test mean error made by each algorithm to reach the goalsof the benchmark with respect to the number of actions explored during learning. SGIM-PBhas the lowest level of error compared to the other algorithms. Thus

SGIM-PB learns withbetter precision . This is due to its transfer of knowledge from simple tasks to complex tasks,after learning task hierarchy, owing to both its use of imitation learning and the procedurerepresentation. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 17 of 34 Figure 7.

Evaluation of all algorithms throughout the learning process for the simulation setup, ﬁnalstandard deviations are given in the legend.

First, let us examine the role of imitation by contrasting the algorithms with activeimitation (SGIM-PB and SGIM-ACTS) with the autonomous learners without imitation (Ran-domAction and IM-PB). IM-PB is the variation of SGIM-PB with only autonomous learning,without imitation of actions or imitation of procedures. In Figure 7, both autonomous learnershave higher ﬁnal levels of error than the active imitation learners, which shows the advantageof using social guidance. Besides, both SGIM-ACTS and SGIM-PB have error levels droppingbelow that of the teachers, showing they learned further than the provided action demon-strations: the combination of autonomous exploration and imitation learning improves thelearner’s performance beyond the performance of the teachers.Figure 8 plots the evaluation on the simulated setup for each type of tasks. While allalgorithms have about the same performance on the simple task Ω , we notice a signiﬁcantdifference for the complex tasks in Ω , Ω , Ω or Ω between the autonomous and the activeimitation learners. The autonomous learners were not able to even reach a goal in the complexsubspaces. In particular, the difference between IM-PB and SGIM-PB means that imitation isnecessary to learn complex tasks, it is not only a speeding-up effect. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 18 of 34 Figure 8.

Evaluation of all algorithms per outcome space on the simulation setup (RandomAction andIM-PB are superposed on all except for Ω ). These results show that the imitation strategies have improved the performance fromthe beginning of the learning process, and this improvement is more visible for complextasks.

More than a speeding effect, imitation enables the robot to reach the ﬁrst goalsin the complex task subspaces, to be optimised and generalised later on by autonomousexploration, so that SGIM-PB is not bound by the limitations of the demonstration dataset . Second, let us examine the role of procedures, both for imitation and for autonomousexploration.5.3.1. Demonstrations of ProceduresTo analyse the difference between the different imitation strategies, i.e between imitationof action primitives and imitation of procedures, we can compare the algorithms SGIM-PBand SGIM-ACTS. While SGIM-PB has procedures teachers for the complex tasks and canexplore the procedure space to learn task decomposition, SGIM-ACTS has action teachers anddoes not have the procedure representation to learn task decomposition. Instead SGIM-ACTSexplores the action space by choosing a length for its action primitive sequence, then theparameters of the primitives.In Figure 7, we see that SGIM-PB is able to outperform SGIM-ACTS after only 2000 iterations,which suggests that procedural teachers can effectively replace action teachers for complex tasks.More precisely, as shown in Figure 8, for tasks Ω where SGIM-ACTS and SGIM-PB have thesame action primitive teacher ActionTeacher0, there is no difference in performance, andSGIM-PB was outperforming SGIM-ACTS on all the complex tasks, particularly Ω .To understand the learning process of SGIM-PB that leads to this difference in performance,let us look at the evolution of the choices of each outcome space (Figure 9) and strategy (Figure10). The improvement of performance of SGIM-PB compared to the other algorithms can beexplained in Figure 9 by its choice of task Ω in the curriculum for iterations above 10,000after an initial phase where it explores all outcome spaces. Figure 10, we notice that SGIM-PBchooses mainly as strategies: ProceduralTeacher3 among all imitation strategies, and Autonomousprocedures among the autonomous exploration strategies. The histogram of each task-strategycombination chosen for the whole learning process in Figure 11 confirms that ProceduralTeacher3 uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 19 of 34 was chosen the most frequently among imitation strategies specifically for tasks Ω , but SGIM-PBused most extensively Autonomous procedures. Thus SGIM-PB performance improvementis correlated with its procedure space exploration with both Autonomous procedures andprocedural imitation strategies . Figure 9.

Evolution of choices of tasks for the SGIM-PB learner during the learning process on thesimulation setup.

This comparison between SGIM-PB and SGIM-ACTS on simulation is conﬁrmed on thephysical setup. The global evaluation in Figure 12 shows a signiﬁcant gap between the twoperformances. SGIM-PB even outperforms SGIM-ACTS after only 1000 iterations, whichsuggests that procedural teachers can be more effective than action teachers for complex tasks.

Figure 10.

Evolution of choices of strategies for the SGIM-PB learner during the learning process on thesimulation setup.

The performance per type of tasks in Figure 13 shows that like in the simulation setup,there is little difference for the simple tasks Ω and Ω , and there is more difference on thecomplex tasks Ω and Ω . The more complex the task, the more SGIM-PB outperformsSGIM-ACTS. These conﬁrm that procedural teachers are better adapted to tackle the mostcomplex and hierarchical outcome spaces . uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 20 of 34 Figure 11.

Choices of strategy and goal outcome for the SGIM-PB learner during the learning process onthe simulation setup.

Figure 12.

Evaluation of all algorithms throughout the learning process for the physical setup. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 21 of 34 Figure 13.

Evaluation for each outcome space of the physical Yumi setup. Ω , which arealso the highest level of hierarchy tasks. For Ω , no action or procedural teacher was providedto SGIM-PB or SGIM-ACTS, therefore we can contrast the speciﬁc effects of autonomousexploration of procedures to autonomous exploration of the action space. Figure 13 showsthat the performance of SGIM-ACTS is constant for Ω , it is not able to reach any task in Ω even once, while SGIM-PB, owing to the capability of the procedure framework to reuse theknowledge acquired for the other tasks, is able to explore in this outcome space.To understand the reasons for this difference, let us examine the histogram of the strate-gies chosen per task in Figure 14. For Ω , SGIM-PB uses massively procedure space explorationcompared to action space exploration. Owing to autonomous procedure exploration, SGIM-PBcan thus learn complex tasks by using the decomposition of tasks into known subtasks. Figure 14.

Choices of strategy and goal outcome for the SGIM-PB learner during the learning process onthe physical setup. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 22 of 34 This highlights the essential role of the procedure representation and the procedurespace exploration by imitation but also by autonomous exploration, in order to learn high-level hierarchy tasks, which have sparse rewards . Given the properties of imitation and procedures that we outlined, is SGIM-PB capableof choosing the right strategy for each task to build a curriculum starting from simple tasksbefore complex tasks?Figure 14 shows that SGIM-PB uses more procedural exploration and imitation thanaction exploration or imitation for Ω and Ω , compared to the more simple tasks. A couplingappears between simple tasks and action space exploration on the one hand, and complextasks and procedure exploration on the other hand. Moreover, for imitation, SGIM-PB wasoverall able to correctly identify the most adapted teacher to each outcome space . Theironly suboptimal choice is to use the procedural teacher built for Ω to explore Ω . Thismistake can be explained as both outcome spaces have the same task decomposition (seeFigure 2).Likewise, for the simulation setup, in Figure 11, the histogram of explored task-strategycombinations conﬁrms that the learner uses mostly autonomous exploration strategies. Ituses mostly action exploration for the simplest outcome spaces ( Ω , Ω , Ω ), and procedureexploration for the most complex outcomes ( Ω , Ω ). This shows the complementarity ofaction and procedures for exploration in an environment with a hierarchical set of outcomespaces. We can see for each task subspace, the proportion of imitation used. While for Ω , SGIM-PB uses the strategy AutonomousActions ﬁve times more than ActionTeacher0,the proportion of imitation increases for the complex tasks. Imitation seems to be requiredmore for complex tasks. For the simplest outcome spaces ( Ω , Ω , Ω ), it uses mostly actionexploration and procedure exploration for the most complex ones ( Ω , Ω ). From Figure 11 ,we can also conﬁrm that for each task, the teacher most requested is specialised in the goaltask, the only exceptions are ProceduralTeacher3 and ProceduralTeacher4 who are specialisedin different complex tasks but that use the same subtask decomposition, thus demonstrationsof ProceduralTeacher3 has effects on Ω and vice versa. The choices shown in the histogramshow that SGIM-PB has spotted the teacher’s domain of expertise.Let us analyse the evolution of the time each outcome space (Figure 9) and strategy(Figure 10) is chosen during the learning process of SGIM-PB on the simulated setup. InFigure 9, its self-assigned curriculum starts by exploring the most simple task Ω until 1000iterations. In Figure 10, we see that this period corresponds mainly to the use of the strategyAutonomous policies, relying on itself to acquire its body schema. Then it gradually switchesto working on the most complex outcome space Ω (the highest dimension) and marginallymore on Ω while preferring autonomous procedures and marginally the teachers for Ω and Ω . In contrast, the strategy ActionTeacher0 decreases, SGIM-PB does not use action imitationany more. SGIM-PB switches from imitation of action primitives to procedures, and mostof all it turns to the strategies autonomous policies and autonomous procedures .For the physical setup, the evolution of outcome spaces (Figure 15) and strategies (Figure16) chosen are more difﬁcult to analyse. However, they show the same trend from iterations 0to 10,000: the easy outcome spaces Ω , Ω , Ω are more explored in the beginning before beingneglected after 1000 iterations to explore the most complex outcome spaces Ω , Ω . Imitationis mostly used in the beginning of the learning process, whereas later in the curriculum,autonomous exploration is preferred. The autonomous exploration of actions strategy wasalso less used. However, the curriculum after 10,000 is harder to analyse : at two instances(around 13,000 and a second time around 17,000 iterations), SGIM-PB switched to autonomousexploration of actions while exploring simpler outcome spaces Ω , Ω , Ω . This might mean uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 23 of 34 the learner needs to consolidate its basic knowledge on the basic tasks before being able tomake further progress in the complex tasks. Figure 15.

Evolution of choices of tasks for the SGIM-PB learner during the learning process on thephysical setup.

Figure 16.

Evolution of choices of strategies for the SGIM-PB learner during the learning process on thephysical setup.

In this section, we explore the possibility to transfer a set of procedures as a batch at thebeginning of the learning process, as opposed to active requests from the learner throughoutthe learning process. We consider a learner and a teacher with different embodiments workingon the same tasks. While transfer of knowledge of actions can not be reused straightforward,how can the knowledge of procedures be exploited?In our example, we consider new learners trying to explore the interactive table usingthe left arm of Yumi, while beneﬁtting from the knowledge acquired from a right arm Yumi.We call this simulation setup the left-arm setup . We extracted a dataset D composed ofthe procedures and their corresponding reached outcomes (( ω i , ω j ) , ω r )) taken from theexperience memory of a SGIM-PB learner that has trained on the right arm for 25,000 iterations(taken from the runs of SGIM-PB on the simulated setup). To analyse the beneﬁts of batchtransfer before learning, we run in the simulated setup (see Section 4.1), two variants ofSGIM-PB 10 times, with for each 10,000 iterations:• Left-Yumi-SGIM-PB: the classical SGIM-PB learner using its left arm, using from theexact same strategies as on the simulated setup, without any procedure transferred;• Left-Yumi-SGIM-TL: a modiﬁed SGIM-PB learner, beneﬁting from the strategies used onthe simulated setup, and which beneﬁts from the dataset D as a Transferred Lump atthe initialisation phase: Memo ← { (( ω i , ω j ) , ω r ) } at the beginning their learning process.No actions are transferred, and the transferred data are only used for computing localexploration of the procedural space, so they don’t impact the interest model nor the testevaluations reported in the next section. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 24 of 34 All procedural teachers propose the same procedures as in Section 4.1 for the right-handed SGIM-PB learner. The ActionTeacher0, proposing action demonstrations for reaching Ω outcomes was changed to left-handed demonstrations.5.5.1. Evaluation PerformanceFigure 17 shows the global evaluation of Left-Yumi-SGIM-PB and Left-Yumi-SGIM-TL.We can see that, even though the learning process seems quicker before 500 iterations forthe Left-Yumi-SGIM-TL learner, both learners quickly learn at the same rate as shown bytheir overlapping evaluation graphs. If we look at Figure 18 which shows this evaluationbroken down for each task, we see the same phenomenon for Ω , Ω and Ω . This seems toindicate, that active imitation using the small demonstration datasets are enough for Left-Yumi-SGIM-PB to tackle this setup, while the huge dataset of transferred procedures don’tgive Left-Yumi-SGIM-TL an edge other than a slightly better performance for the complextasks at the initial phase. Figure 17.

Evaluation of all algorithms throughout learning process for the transfer learning setup, ﬁnalstandard deviations are given in the legend. Ω as ( Ω , Ω ) , but slightly less often Ω or Ω as ( Ω , Ω ) . For Ω , Left-Yumi-SGIM-PB also usesanother procedure: ( Ω , Ω ) which is also valid. To position both objects, the robot can startby either object 1 or object 2. However, if we take into account the task hierarchy learned bythe transfer dataset which is fed to Left-Yumi-SGIM-TL before its learning process, we can seethat Left-Yumi-SGIM-TL has learned almost exactly the same task hierarchy than the transferdataset, which indicates its inﬂuence owing to the large number of procedures transferred:procedures from 25,000 iterations were transferred compared to the new procedures exploredby Left-Yumi-SGIM-TL during only 10,000 iterations. Hence Left-Yumi-SGIM-TL has the samedefects and qualities than the transfer dataset in terms of the task hierarchy discovered. Forinstance, Left-Yumi-SGIM-TL uses the inefﬁcient procedure ( Ω , Ω ) to reach Ω more oftenthan Left-Yumi-SGIM-PB. It wasn’t able to overstep the defects of the transfer dataset. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 25 of 34 Figure 18.

Task evaluation of all algorithms throughout learning process for transfer learning setup. Ω , the procedures most used during the learningphase correspond to the ground truth. Thus intrinsic motivation has oriented the explorationtowards the relevant task decompositions. Besides, for all types of task, Left-Yumi-SGIM-TLtends to explore more the procedures in ( Ω , Ω ) than Left-Yumi-SGIM-PB. This differencecan also be explained by the predominance of this procedure in the transfer dataset D . Bothof them also tried a lot of procedures from the Ω procedural space. It conﬁrms that Left-Yumi-SGIM-TL was not able to ﬁlter out the defects of its transfer dataset in terms of the taskhierarchy provided.5.5.4. Strategical ChoicesAnalysing the learning process (Figures 21–24) of both learners, we can see that theyare very similar. Both learners start by performing a lot of imitation of the available teacherscoupled with an exploration of all outcome types, until 1500 and 2000 iterations for respectivelyLeft-Yumi-SGIM-TL and Left-Yumi-SGIM-PB. This difference in timing can be caused by thetransferred dataset D , but is not very signiﬁcant. Then they focus more on the autonomousexploration of the action space to reach the Ω outcome subspace before gradually workingon more complex outcome spaces while performing more an more autonomous explorationof the procedural space. However, the Left-Yumi-SGIM-TL learner seems to abandon itsinitial imitation phase faster than Left-Yumi-SGIM-PB (about 1000 iterations faster), and alsoquickly starts working on the more complex Ω outcome space with the strategy autonomousexploration on procedures. This initial faster maturation seems perhaps too fast as Left-Yumi-SGIM-TL reverts to working on Ω with autonomous actions afterwards : we see two otherpeaks of this choice of combination at 5500 iterations and 9500 iterations. On the contrary,Left-Yumi-SGIM-PB seems to converge more steadily towards its ﬁnal learning phase. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 26 of 34 Version January 18, 2021 submitted to

Appl. Sci.

19 of 23

Figure 19.

Procedures used during theexploration phase on left-arm setup: thisshows for outcome spaces Ω , Ω and Ω thepercentage of times each procedural space waschosen during learning by Left-Yumi-SGIM-PB(1st row), and the Left-Yumi-SGIM-TL learner(3rd row), procedures discovered by the transferdataset D are also reminded (2nd row). Theground truth hierarchy is indicated by a redsquare. To analyse what task hierarchy both learners have discovered at the end of their learning process, we plotted Fig. 19. We can see that no learner is clearly better at discovering the setup task hierarchy (see Fig. Ω as ( Ω , Ω ) , but slightly less often Ω or Ω as ( Ω , Ω ) . For Ω , Left-Yumi-SGIM-PB also uses another procedure: ( Ω , Ω ) which is also valid. To position both objects, the robot can start by either object 1 or object 2. However, if we take into account the task hierarchy learned by the transfer dataset which is fed to Left-Yumi-SGIM-TL before its learning process, we can see that Left-Yumi-SGIM-TL has learned almost exactly the same task hierarchy than the transfer dataset, which indicates its inﬂuence owing to the large number of procedures transferred: procedures from 25,000 iterations were transferred compared to the new procedures explored by Left-Yumi-SGIM-TL during only 10,000 iterations. Hence Left-Yumi-SGIM-TL has the same defects and qualities than the transfer dataset in terms of the task hierarchy discovered. For instance, Left-Yumi-SGIM-TL uses the inefﬁcient procedure ( Ω , Ω ) to reach Ω more often than Left-Yumi-SGIM-PB. It wasn’t able to overstep the defects of the transfer dataset. If we look at the procedures that were actually tried out by the learner during its learning process for reaching each goal outcome (see Fig. 20), we can see ﬁrst that for all versions of the algorithm and all types of tasks except Ω , the procedures most used during the learning phase correspond to the ground truth. Thus Figure 19.

Task hierarchy discovered by the learners on left-arm setup: this represents for outcome spaces Ω , Ω and Ω the percentage of time each procedural space is chosen during test to reach goals of thetestbench by Left-Yumi-SGIM-PB ( ), the transfer dataset ( ) and the Left-Yumi-SGIM-TLlearner ( ). The ground truth hierarchy is indicated by a red square. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy. Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 27 of 34 Version January 18, 2021 submitted to

Appl. Sci.

19 of 23

Figure 19.

Evolution of choices of strategies for the Left-Yumi-SGIM-PB learner during the learningprocess on left-arm setup.

Figure 22.

Evolution of choices of tasks for the Left-Yumi-SGIM-PB learner during the learning processon left-arm setup.

Figure 23.

Evolution of choices of strategies for the Left-Yumi-SGIM-TL learner during the learningprocess on left-arm setup. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 29 of 34 Figure 24.

Evolution of choices of tasks for the Left-Yumi-SGIM-TL learner during the learning processon left-arm setup.

If we look at the choices of strategy and goal outcomes for the whole learning process(see Figures 25 and 26), we can see that this difference in the learning processes is visible inthe number of times each of the two main combinations of task and goal outcome space waschosen: Left-Yumi-SGIM-TL favors more working autonomously on procedures for exploring Ω whereas Left-Yumi-SGIM-PB worked more on autonomous exploration of actions for Ω . Figure 25.

Choices of strategy and goal outcome for the Left-Yumi-SGIM-PB learner on left-arm setup. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 30 of 34 Figure 26.

Choices of strategy and goal outcome for the Left-Yumi-SGIM-TL learner on the left-arm setup.

These results seem to indicate that if a transfer of knowledge about difﬁcult tasks takesplace before easy tasks are learned, it can disturb the learning curriculum by changing thelearner’s focus on difﬁcult tasks. The learner needs to realise that the gap in knowledge istoo high, and give up these difﬁcult tasks, to re-focus on learning the easy subtasks. Thus,demonstrations given throughout the learning process and adapted to the development ofthe learner seem more beneﬁcial than a batch of data given at a single point of time, despite alarger amount of data. These results show that procedure demonstrations can be effectivefor robots of different embodiments to learn complex tasks; and active requests of fewprocedure demonstrations are effective to learn task hierarchy . They were not signiﬁcantlyimproved by more data added at initialisation in terms of error, exploration strategy orautonomous curriculum learning.

6. Discussion

The experimental results highlight the following properties of SGIM-PB:• procedures representation and task composition become necessary to reach tasks ofhigher hierarchy. It is not a simple bootstrapping effect.• transfer of knowledge of procedures is efﬁcient for cross-learner transfer of knowledgeeven when they have different embodiments.• active imitation learning of procedures is more advantageous than imitation from adataset provided from the initialization phase.The performance of SGIM-PB stems from its tackling several aspects of transfer ofknowledge, and relies on our proposed representation of compound actions that allowshierarchical reinforcement learning.

Our representation of actions allows our learning algorithm to adapt the complexityof its actions to the complexity of the task at hand , whereas other approaches using viapoints [49], or parametrised skills [50], had to bound the complexity of their actions. Otherapproaches like [51] also use a temporally abstract representation of actions: options. Howeveroptions are often used in discrete states settings to reach speciﬁc states such as bottlenecks,and are generally learned beforehand. On the contrary, our dual representation of skills asaction primitive sequences and procedures allow an online learning of an unlimited number uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 31 of 34 of complex behaviours. Our exploitation of the learned action primitive sequence is simplisticand needs improvement though.Our work proposes a representation of the relationship between tasks and their complex-ities, by the procedural framework. Comparatively, [25] proposed for tackling a hierarchicalmulti-task setting, to learn action primitives and use planning to recursively chain skills. How-ever, that approach does not build a representation of a sequence of action primitives, andplanning grows slower as the environment is explored. A conjoint use of planning techniqueswith an unrestrained exploration of the procedural space could be an interesting prospect andextension of our work.In this article, we tackled the learning of complex control tasks using sequences of actionsof unbounded length. Our main contribution is a dual representation of compound actionsin both action and outcome spaces . We showed its impact on autonomous exploration butalso on imitation learning: SGIM-PB learns the most complex tasks by autonomous proceduralspace exploration, and can beneﬁt more from procedural demonstrations for complex tasksand from motor demonstrations for simple tasks, conﬁrming the results on a more simplesetup in [35]. Our work demonstrates the gains that can be achieved by requesting just asmall amount of demonstration data with the right type of information with respect to thecomplexity of the tasks. Our work should be improved by a better exploitation algorithm ofthe low-level control model and can be speeded up by adding planning. We have tested our algorithm in the classical setups of transfer of knowledge : cross-tasktransfer, cross-learner transfer and by imitation learning . We have shown that

SGIM-PBcan autonomously determine the main questions of Transfer Learning as theorised in [5]:• What information to transfer? For compositional tasks, a demonstration of task decom-position is more useful than a demonstration of an action, as it helps bootstrap cross-tasktransfer of knowledge. Our case study shows a clear advantage of procedure demonstra-tions and procedure exploration. On the contrary, for simple tasks, action demonstrationsand action space exploration show more advantages. Furthermore, for cross-learnertransfer, especially when the learners have different embodiments, this case study in-dicates that demonstrations of procedures are still helpful, whereas demonstrations ofactions are no longer relevant.• How to transfer? We showed that decomposition of a hierarchical task, through proce-dure exploration and imitation, is more efﬁcient than learning directly action parameters,i.e., interpolation of action parameters. This conﬁrms the results found in a more simplesetup in [35].• When to transfer? Our last setup shows that an active imitation learner asking adapteddemonstrations, as its competence increases, performs almost better than when it isgiven a signiﬁcantly larger batch of data at initialisation time. More generally for aless data-hungry transfer learning algorithm, the transferred dataset should be givento the learner in a timely manner so that the information is adapted to the current levelof the learner, i.e., its zone of proximal development [52]. This advantage has alreadybeen shown by an active and strategic learner—SGIM-IM [53], a simpler versions ofSGIM-PB—which had better performance than passive learners for multi-task learningusing action primitives.• Which source of information? Our strategical learner chooses for each task the mostefﬁcient strategy between self-exploration and imitation. Most of all, it could understandthe domain of expertise of the different teachers and choose the most appropriate expertto imitate. The results of this case study conﬁrms the results found in [35] in a simplersetup and in [38] for learning action primitives. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 32 of 34

7. Conclusions

We showed that our industrial robot could learn sequences of motor actions of unre-strained size to achieve a ﬁeld of hierarchically organized outcomes. To learn to control incontinuous high-dimensional spaces of outcomes through a continuous inﬁnite dimensionalityspace of actions, we combined: goal-oriented exploration to enable the learner to organize itslearning process in a multi-task learning setting, procedures as a task-oriented representationto build increasingly more complex sequences of actions, active imitation strategies to selectthe most appropriate information and source of information, and intrinsic motivation as aheuristic to drive the robot’s curriculum learning process. All four aspects are combined insidea curriculum learner called SGIM-PB. This algorithm showed the following characteristicsthrough this study:• Hierarchical RL: it learns online task decomposition on 4 levels of hierarchy using theprocedural framework; and it exploits the task decomposition to match the complexityof the sequences of action primitives to the task;• Curriculum learning: it autonomously switches from simple to complex tasks, and fromexploration of actions for simple tasks to exploration of procedures for the most complextasks;• Imitation learning: it empirically infers which kind of information is useful for eachkind of task and requests just a small amount of demonstrations with the right type ofinformation by choosing between procedural and action teachers;• Transfer of knowledge: it automatically decides what information, how, when to transferand which source of information for cross-task and cross-learner transfer learning.Thus, our work proposes an active imitation learning algorithm based on intrinsic mo-tivation that uses empirical measures of competence progress to choose at the same timewhat target task to focus on, which source tasks to reuse and how to transfer knowledgeabout task decomposition.

Our contributions, grounded in the ﬁeld of cognitive robotics, are: a new representation of complex actions enabling the exploitation of task decomposition andthe proposition for tutors of supplying information on the task hierarchy to learn compoundtasks.This work should be improved by a better exploitation algorithm of the low-level controlmodel and can be speeded up by adding planning methods.

Author Contributions:

Conceptualization, S.M.N.; Formal analysis, J.Z.; Funding acquisition, D.D.;Investigation, N.D.; Methodology, Jerome Kerdreux; Writing—original draft, N.D.; Writing—review &editing, S.M.N. All authors have read and agreed to the published version of the manuscript.

Funding:

Institutional Review Board Statement:

Not applicable

Informed Consent Statement:

Not applicable

Data Availability Statement:

The codes used are available at https://bitbucket.org/smartan117/sgim-yumi-simu (simulated version), and at https://bitbucket.org/smartan117/sgim-yumi-real (physi-cal one).

Acknowledgments:

This work was supported by Ministère de l’Education Nationale, de l’EnseignementSupérieur et de la Recherche, European Regional Development Fund (ERDF), Région Bretagne, ConseilGénéral du Finistère) and by Institut Mines Télécom, received in the framework of the VITAAL project. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 33 of 34 Conﬂicts of Interest:

The authors declare no conﬂict of interest. The funders had no role in the designof the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or inthe decision to publish the results.

References

1. Sutton, R.S.; Barto, A.G.

Reinforcement Learning: An introduction ; MIT Press: Cambridge, MA, USA, 1998.2. Zech, P.; Renaudo, E.; Haller, S.; Zhang, X.; Piater, J. Action representations in robotics: A taxonomy and systematic classiﬁcation.

Int.J. Robot. Res. , , 518–562, doi:10.1177/0278364919835020.3. Elman, J. Learning and development in neural networks: The importance of starting small. Cognition , , 71–99.4. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the 26th Annual International Conference onMachine Learning ; ACM: New York, NY, USA, 2009; pp. 41–48.5. Pan, S.J.; Yang, Q. A survey on transfer learning.

IEEE Trans. Knowl. Data Eng. , , 1345–1359.6. Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. , , 1633–1685.7. Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data , , 9, doi:10.1186/s40537-016-0043-6.8. Whiten, A. Primate culture and social learning. Cogn. Sci. , , 477–508.9. Call, J.; Carpenter, M. Imitation in Animals and Artifacts ; Chapter Three Sources of Information in Social Learning; MIT Press:Cambridge, MA, USA, 2002; pp. 211–228.10. Tomasello, M.; Carpenter, M. Shared intentionality.

Dev. Sci. , , 121–125.11. Piaget, J. The origins of intelligence in children (M. Cook, Trans.) ; WW Norton & Co: New York, NY, USA, 1952.12. Deci, E.; Ryan, R.M.

Intrinsic Motivation and Self-Determination in Human Behavior ; Plenum Press: New York, NY, USA, 1985.13. Oudeyer, P.Y.; Kaplan, F.; Hafner, V. Intrinsic Motivation Systems for Autonomous Mental Development.

IEEE Trans. Evol. Comput. , , 265–286.14. Schmidhuber, J. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010). IEEE Trans. Auton. Ment. Dev. , , 230–247.15. Baranes, A.; Oudeyer, P.Y. Intrinsically motivated goal exploration for active motor learning in robots: A case study. In Proceedings of the2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 1766–1773.16. Rolf, M.; Steil, J.; Gienger, M. Goal Babbling permits Direct Learning of Inverse Kinematics. IEEE Trans. Auton. Ment. Dev. , , 216–229.17. Forestier, S.; Mollard, Y.; Oudeyer, P. Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. arXiv , arXiv:1708.02190.18. Colas, C.; Fournier, P.; Chetouani, M.; Sigaud, O.; Oudeyer, P.Y. CURIOUS: Intrinsically Motivated Modular Multi-Goal ReinforcementLearning. In International Conference on Machine Learning ; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: 2019; Volume 97, pp.1331–1340.19. Giszter, S.F. Motor primitives—new data and future questions.

Curr. Opin. Neurobiol. , , 156–165.20. Arie, H.; Arakaki, T.; Sugano, S.; Tani, J. Imitating others by composition of primitive actions: A neuro-dynamic model. Robot. Auton.Syst. , , 729–741.21. Riedmiller, M.; Hafner, R.; Lampe, T.; Neunert, M.; Degrave, J.; van de Wiele, T.; Mnih, V.; Heess, N.; Springenberg, J.T. Learning byPlaying Solving Sparse Reward Tasks from Scratch. arXiv , arXiv:1802.10567.22. Barto, A.G.; Konidaris, G.; Vigorito, C. Behavioral hierarchy: exploration and representation. In Computational and Robotic Models ofthe Hierarchical Organization of Behavior ; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–46.23. Konidaris, G.; Barto, A.G. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining.

Adv. Neural Inf.Process. Syst. (NIPS) , , 1015–1023.24. Barto, A.G.; Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst. , , 41–77.25. Manoury, A.; Nguyen, S.M.; Buche, C. Hierarchical affordance discovery using intrinsic motivation. In Proceedings of the 7thInternational Conference on Human-Agent Interaction; ACM: Kyoto, Japan, 2019; pp. 196–193.26. Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstractionand intrinsic motivation. Adv. Neural Inf. Process. Syst. , , 3675–3683.27. Duminy, N.; Nguyen, S.M.; Duhaut, D. Learning a set of interrelated tasks by using sequences of motor policies for a strategicintrinsically motivated learner. In Proceedings of the 2018 Second IEEE International Conference on Robotic Computing (IRC),Laguna Hills, CA, USA, 31 January–2 February 2018.28. Schaal, S. Learning from demonstration. In Advances in Neural Information Processing Systems ; MIT Press: Cambridge, MA, USA, 1997;pp. 1040–1046.29. Billard, A.; Calinon, S.; Dillmann, R.; Schaal, S.

Handbook of Robotics ; Number 59; Chapter Robot Programming by Demonstration; MITPress: Cambridge, MA, USA, 2007. uminy, N.; Nguyen, S.M.; Zhu, J.; Duhaut, D.; Kerdreux, J. Intrinsically Motivated Open-Ended Multi-Task Learning Using Transfer Learning to DiscoverTask Hierarchy.

Appl. Sci. , , 975. https://doi.org/10.3390/app11030975 , 975 34 of 34

30. Muelling, K.; Kober, J.; Peters, J. Learning table tennis with a mixture of motor primitives. In Proceedings of the 2010 10th IEEE-RASInternational Conference on Humanoid Robots, Nashville, TN, USA, 6–8 December 2010; pp. 411–416.31. Reinhart, R.F. Autonomous exploration of motor skills by skill babbling.

Auton. Robot. , , 1521–1537.32. Taylor, M.E.; Suay, H.B.; Chernova, S. Integrating reinforcement learning with human demonstrations of varying ability. In The 10thInternational Conference on Autonomous Agents and Multiagent Systems-Volume 2 ;International Foundation for Autonomous Agents andMultiagent Systems, Richland, SC, USA, 2011; pp. 617–624.33. Thomaz, A.L.; Breazeal, C. Experiments in Socially Guided Exploration: Lessons learned in building robots that learn with andwithout human teachers.

Connect. Sci. , , 91–110.34. Grollman, D.H.; Jenkins, O.C. Incremental learning of subtasks from unsegmented demonstration. In Proceedings of the 2010IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 261–266.35. Duminy, N.; Nguyen, S.M.; Duhaut, D. Learning a Set of Interrelated Tasks by Using a Succession of Motor Policies for a SociallyGuided Intrinsically Motivated Learner. Front. Neurorobot. , , 87.36. Argall, B.D.; Browning, B.; Veloso, M. Learning robot motion control with demonstration and advice-operators. In Proceedings of the2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 399–404.37. Chernova, S.; Veloso, M. Interactive Policy Learning through Conﬁdence-Based Autonomy. J. Artif. Intell. Res. , , 1.38. Nguyen, S.M.; Oudeyer, P.Y. Active choice of teachers, learning strategies and goals for a socially guided intrinsic motivation learner. Paladyn J. Behav. Robot. , , 136–146.39. Cakmak, M.; Chao, C.; Thomaz, A.L. Designing interactions for robot active learners. IEEE Trans. Auton. Ment. Dev. , , 108–118.40. Begus, K.; Southgate, V. Active Learning from Infancy to Childhood ; Springer: Berlin/Heidelberg, Germany, 2018; Chapter CuriousLearners: How Infants’ Motivation to Learn Shapes and Is Shaped by Infants’ Interactions with the Social World; pp. 13–37.41. Poulin-Dubois, D.; Brooker, I.; Polonia, A. Infants prefer to imitate a reliable person.

Infant Behav. Dev. , , 303–309.doi:10.1016/j.infbeh.2011.01.006.42. Fournier, P.; Colas, C.; Sigaud, O.; Chetouani, M. CLIC: Curriculum Learning and Imitation for object Control in non-rewardingenvironments. IEEE Trans. Cogn. Dev. Syst. , 1, doi:10.1109/TCDS.2019.2933371.43. Duminy, N.; Nguyen, S.M.; Duhaut, D. Effects of social guidance on a robot learning sequences of policies in hierarchical learning. InProceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018.44. Asada, M.; Hosoda, K.; Kuniyoshi, Y.; Ishiguro, H.; Inui, T.; Yoshikawa, Y.; Ogino, M.; Yoshida, C. Cognitive developmental robotics:A survey.

IEEE Trans. Auton. Ment. Dev. , , 12–34.45. Cangelosi, A.; Schlesinger, M. Developmental Robotics: From Babies to Robots ; MIT Press: Cambridge, MA, USA,2015.46. Nguyen, S.M.; Oudeyer, P.Y. Socially Guided Intrinsic Motivation for Robot Learning of Motor Skills.

Auton. Robot. , , 273–294.47. Kubicki, S.; Pasco, D.; Hoareau, C.; Arnaud, I. Using a tangible interactive tabletop to learn at school: empirical studies in the wild. In Actes de la 28ième conférence francophone sur l’Interaction Homme-Machine ; ACM: New York, NY, USA, 2016; pp. 155–166.48. Pastor, P.; Hoffmann, H.; Asfour, T.; Schaal, S. Learning and generalization of motor skills by learning from demonstration. InProceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 763–768.49. Stulp, F.; Schaal, S. Hierarchical reinforcement learning with movement primitives. In Proceedings of the 2011 11th IEEE-RASInternational Conference on Humanoid Robots, Bled, Slovenia, 26–28 October 2011; pp. 231–238.50. Da Silva, B.; Konidaris, G.; Barto, A.G. Learning Parameterized Skills. arXiv , arXiv:1206.6398.51. Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.