Hierarchical Affordance Discovery using Intrinsic Motivation
AA. Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019.
Hierarchical Affordance Discovery using Intrinsic Motivation
Alexandre Manoury [email protected] AtlantiqueBrest, France
Sao Mai Nguyen [email protected] AtlantiqueBrest, France
Cédric Buche [email protected], France
ABSTRACT
To be capable of life-long learning in a real-life environment, robotshave to tackle multiple challenges. Being able to relate physicalproperties they may observe in their environment to possible inter-actions they may have is one of them. This skill, named affordancelearning, is strongly related to embodiment and is mastered througheach person’s development: each individual learns affordances dif-ferently through their own interactions with their surroundings.Current methods for affordance learning usually use either fixedactions to learn these affordances or focus on static setups involvinga robotic arm to be operated.In this article, we propose an algorithm using intrinsic motiva-tion to guide the learning of affordances for a mobile robot. Thisalgorithm is capable to autonomously discover, learn and adaptinterrelated affordances without pre-programmed actions. Oncelearned, these affordances may be used by the algorithm to plansequences of actions in order to perform tasks of various difficulties.We then present one experiment and analyse our system beforecomparing it with other approaches from reinforcement learningand affordance learning.
KEYWORDS
Intrinsic motivation, Incremental learning, Affordances
ACM Reference Format:
Alexandre Manoury, Sao Mai Nguyen, and Cédric Buche. 2019. HierarchicalAffordance Discovery using Intrinsic Motivation. In
Proceedings of the 7thInternational Conference on Human-Agent Interaction (HAI ’19), October 6–10,2019, Kyoto, Japan.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3349537.3351898
Continuous adaptation to the environment constitutes a key fea-ture of the human learning process. It enables humans to learn tointeract with newly discovered objects, either by reusing and adapt-ing previously acquired knowledge or by building new skills moreadapted to the situation at hand. This competence, named life-long
The research work presented is partially supported by the European Regional Fund(FEDER) via the VITAAL Contrat Plan Etat Region.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
HAI ’19, October 6–10, 2019, Kyoto, Japan © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6922-0/19/10...$15.00https://doi.org/10.1145/3349537.3351898 learning, is one of the central challenges for service robots to actin our every day environment, towards socially assistive roboticsand human agent interaction.To tackle this, we adopt the approach of developmental robot-ics. Indeed, studying how infants learn and adapt constitutes anessential example of life-long learning. It highlights multiple mech-anisms involved in this learning. Among them, we decide to focuson two in particular: the way infants relate what they see to howthey may interact with surrounding objects; and how they exploreand interact with their environment while building new skills.The first one, coined since 1979 by Gibson as the concept ofaffordance [23], describes the strong relationship between visualcues and possible interactions. Contrarily to classical computervision, central to the notion of affordance are the concepts of em-bodiment and of motor capabilities [10]. For instance, adults andinfants do not see the same affordances for the same objects be-cause they do not have the same body, the same way humans do notperceive the same affordances as robots. Such affordances evolveall along the life of a person, directly through its interactions withits surroundings.The latter, the capacity that infants have to autonomously ex-plore their environment, may be one of the answers of how af-fordances are learned. Indeed, infants use their curiosity to drivetheir exploration and build new skills through it [9]. This capacity,described as intrinsic motivation in psychology [14], provides apowerful mechanism to learn motor skills such as affordances.In this paper, we focus on combining those two aspects of thehuman learning process: affordances and intrinsic motivation, byproposing a robotic framework to learn affordances thanks to activelearning. We apply it to the case of a mobile robot. Moreover, as weaim at complex affordances that might require not only primitiveactions but a succession of actions to be combined, we use planningto chain actions in order to perform task of various difficulty.
In this section we review the work related to two aspects of ourapproach: affordances learning and active learning algorithms, es-pecially those using intrinsic motivation. We also present reinforce-ment learning methods we compare with later in the article.
Many approaches to affordance learning has been developed: thetraversability affordance for instance, has been studied in differentworks [4, 22]. Likewise, the grasp affordance is a recurrent topicand various approaches exist to learn it such as learning basedon visual descriptors or raw image input [11, 17]. However suchmethods are not easily generalised and tend to focus on one, or a a r X i v : . [ c s . A I] S e p . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. fixed number of specific affordances, with no mechanism adaptingit to new or more complex affordances.More general approaches, not focusing only on one affordance,have also been proposed. Such approaches build and update a listof affordances through the robot interactions with its environment.In [12], Bayesian Networks are used to learn dependencies betweenactions, effects, and visual properties; a strongly spread definition ofaffordances in robotics. Likewise, [19] also uses a Bayesian approachbut coupled with a fixed and finite pre-programmed action set tolearn affordances. In our case, we aim to continual learning ofmultiple affordances through the interaction with its environment.Thus, the robot builds itself sensory motor skills using a widevariety of actions. The robot can use actions of unbounded lengthand duration, in a continuous action space. In [6], Ugur et al. proposea developmental approach of affordance learning. The robot learnsby stages: simple affordances first and more complex later. But theylimit their approach to simple affordances with no multi-objectinteraction possible. In all of those methods above, the approachis limited to a single scene and to a single object interaction ata time, guiding the robot exploration without letting it chooseautonomously. Furthermore, the considered setups usually focuson fixed robot with an end-point manipulator, while in our case wedecide to consider a mobile robot, using its mobility to explore itssurroundings by itself and choosing which object to interact with. Central to Gibson’s theory is the notion that the motor capabilitiesof the agent dramatically influence perception. Therefore affor-dance is closely linked to motor learning. In recent years, multipleapproaches have emerged for active motor learning. For instance,several methods exist to learn forward and inverse models, mapmotor policies to sensorimotor outcomes; as formalised by [8, 24].However, as the dimensionality of the spaces considered increase,the learner faces the curse of dimensionality [2] and no compre-hensive method is possible.Developmental methods, inspired by how infants explore andlearn, have also been proposed to tackle this issue [18]. Indeed,curiosity has been identified as a key mechanism for exploration[14]. Other methods, even further inspired by human psychologyhas been developed [1], using goal-babbling mechanism to generategoals and drive their exploration.More recently, methods using intrinsic motivation to build ahierarchy of interrelated skills has been proposed. Firstly by us-ing a pre-programmed and static hierarchy [7], then by learning itthrough exploration for robotic arms [5] or mobile robots [13]. Thelatter provides an algorithm, CHIME, that uses intrinsic motivationto guide its exploration. It is capable of adaptive addition and modi-fication of a skill hierarchy based on its interactions. It may alsoplan of sequence of actions to perform complex tasks.
In other domains, such as reinforcement learning, numerous meth-ods have emerged to tackle solving daily tasks, using either classicalapproaches, such as Q-Learning or more recent neural networkbased ones: DQN[16] or Actor Critic algorithm[15]. But such meth-ods differ from our approach as they are designed to tackle specific tasks, defined by a reward function that need to be provided tothe learner. For a more general approach of reinforcement learningmethods, Universal Value Function Approximators [20] have beenproposed: the task goal is this time learned as an environmentalstate instead of being fixed. This lets the learned value functionto be more general and applicable to various goals. Closer to ourapproach, the CURIOUS algorithm has been proposed, combiningdeep reinforcement learning and intrinsic motivation to generategoals to explore [3].We decide to base our proposition on the CHIME algorithm andadapt their learning algorithm to the affordance learning problemby using sensorimotor features. Whereas in [13], CHIME couldnot generalise its skills to new objects, our algorithm is capableto generalise to new objects, as its learning is based on sensoryfeatures.
Before describing our method, we first present the experiment andformalise the learning problem.
The experimental setup used in this article is presented in Figure 1.
Figure 1: Experimental setup used: at the center is a mobilerobot, the green objects represent movable entities, at theopposite of the red ones. The room is closed.
A mobile robot, equipped with two controllable wheels is placedin a rectangular room. It is surrounded by multiple objects andpossesses a LIDAR sensor, helping it to detect obstacles. It cannavigate between them or learn to push them. The robot starts witha limited prior knowledge: • how many effectors it possesses, in this case its 2 wheels, • a method to perform actions on its effectors. In our case,the method lists indexes of the effectors to activate and theaction parameters p ∈ [− , ] representing the intensity ofthe current to apply to each wheel. • a list of the objects present in the room. We consider 9 cylin-ders in our setup, of various sizes and colors. The greenobjects are pushables whereas the red ones are not. • a method to observe the properties of each of the objects(including the robot itself). The properties used in our ex-periment are: position, shape, radius, height, color, and the . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. position relative to the robot. A pushable object requiresmore torque to be moved as its radius and height increase.With this knowledge, the robot has to autonomously explore itsenvironment and learn how to perform various tasks: moving itself,placing an object somewhere, pushing an object using another one.It also has to learn affordances corresponding to such tasks and tobe able to recognise (or estimate) objects on which an affordancemay apply or not. E.g. the pushability affordance cannot be appliedto red objects, as those are fixed.The robot explores the room in episodes of arbitrary length of100 actions. At the beginning of each episode, the position andproperties of the objects are initialised at random values and therobot always starts at the center of the room.In real life operation, the robot can extract this information andproperties from its sensors. It our case, the real life system can use anRGB-D sensor to segmentate objects and an variational autoencoderto extract properties from each of them. All the acquired data areused to fill in an environmental map, used in turn to provide datato the learning algorithm described in this article. To keep thissystem simple and avoid multiplying the source of errors, thisarticle focuses only on the learning algorithm, with direct access tohigh-level data. Let us consider a robot interacting with its non-rewarding environ-ment by performing sequences of motions of unbounded length inorder to induce changes in its surroundings.Each one of these motions is a primitive action described by aparametrised function with N parameters : a ∈ A ⊂ R N . Eachprimitive action a corresponds to a command that may be sent toone or several actuators of the robot.Our robot can perform sequences of primitive actions. Suchsequence may be of any length n ∈ N , and described by n successiveprimitive actions: a = [ a , . . . , a n ] ∈ A n . Thus the primitive actionspace exploitable by the robot is a continuous space of infinitedimensionality A N ⊂ R N .Each of the actions performed by the robot may have conse-quences on the environment, observable by the robot. We call suchconsequences observations and note them ω ∈ Ω ⊂ R M .Each subspace of Ω is related to a given property of an object o ∈ O present in the environment (e.g. the position of an object).We consider this relation known to the robot, as such knowledgeis required to build an affordance. This is a weak assumption assuch information may be extracted from visual segmentation orby exploiting data from a semantic map. In our case, such data aredirectly given by the simulator itself. To learn how to interact with its environment, the robot learnsmodels of relations between primitive actions a ∈ A and outcomes ω ∈ Ω (relative observations before and after executing an action)obtained after performing this action within a given context ˜ ω ∈ Ω (absolute state before executing the action, for more conveniencewe indicate with ˜ the context spaces to differentiate them fromoutcome spaces). For convenience, we define the controllable ensemble C = A ∪ Ω controllable , regrouping both primitive actions ∈ A and observ-ables that may be controlled ( ∈ Ω controllable ), i.e. that a modelmay be used to find one or a sequence of primitive actions to beperformed in order to induce a value for the given observable. Ω controllable is a subset of Ω and this set changes dynamically asthe robot discovers new control models.More generally, the robot may learn models between control-lables c ∈ C (not only primitive actions) and relative observationswithin a given context. Indeed, our robot may learn how to reach agoal observation value by first inducing a change in another observ-able of the environment. E.g. pushing an object can be performedafter reaching the object.To formalise affordances we use two elements. First we note anAffordance model A (C i , Ω j , ˜ Ω k ) , where C i ⊂ C is the input spaceof the affordance, Ω j ⊂ Ω is the output space and ˜ Ω k ⊂ Ω is itscontext space. And, secondly, to visually identify this affordance,we associate A with a visual predictor p A . It learns on Ω and indi-cates whether A may be applied to an object o in the scene or not,accordingly to its visual or physical properties. Moreover, to be ableto learn how to use the affordance to complete tasks, A possessesa forward model M A and an inverse model L A . Both models learnthe relationship between C i and Ω j knowing a context ˜ Ω k . Theforward model is used to predict the observable consequences ω of a controllable c i in a given context ˜ ω . Conversely, the inversemodel is used to estimate a controllable c i to be performed in agiven context ˜ ω to induce a goal observable state ω as a result of c i . These models are trained on the data acquired by the robot allalong its life and recorded in its dataset. Let us note D this dataset.Each affordance A can be seen as a basic skill, letting the robotperform a given simple task, e.g. reaching a position, placing anobject somewhere.Let us note H the ensemble of the affordances used by our robot.As our robot aimed to be adaptive, H varies along time. For our robot to learn to associate a sequence of primitive actions [ a , ..., a n ] to desired consequences on multiple objects in its en-vironment, our robot needs to learn which consequences ω canbe observed and learn the control actions to realise these conse-quences. For this learning problem, we propose an algorithm in thissection. We first introduce its global architecture before detailingits key processes: how intrinsic motivation drives the exploration,how actions are executed and finally how affordance models arebuilt and updated. Our algorithm is based on the CHIME algorithm [13]. Both are itera-tive and active learning algorithm that learn by episodes, but unlikeCHIME our algorithm is designed to consider visual propertiesduring its learning process.The global layout of the algorithm architecture is presentedin Figure 2 and the corresponding pseudo code can be seen onAlgorithm 1. At the beginning of the learning, the dataset D andthe affordance hierarchy H are both empty: the robot autonomouslycollects data and creates affordances. . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. Figure 2: Abstract layout of a learning episode, beginning ison the left on the bold node.
At each episode, the robot explores its environment by perform-ing actions, observes the context and the outcomes obtained andprocesses the acquired data. One episode is composed of multipleiterations, and at each iteration one primitive action is performed.Starting an episode, the robot decides either to explore a randomaction (l. 10), or to use goal babbling to generate a goal to attainduring the episode (l. 5). This decision is stochastic, based on aparameter σ , and it also depends on whether interesting goals maybe generated or not. E.g. at the first episode, no data has beenacquired yet and thus only a random action may be performed.When choosing a random action, the robot generates a randomcontrollable to be tested c ∈ C among all the controllable spaces(including the primitive actions). If required, this controllable isthen converted to an executable primitive action, as only primitiveactions may be performed by the robot effectors. This process isdescribed in Section 4.3.When choosing to generate a goal, an affordance A and a goal ω д are selected, based on an interest metric detailed later in section4.2. The robot next decides on which object this goal will be tested.Currently it just selects the closest object considered as valid bythe affordance visual classifier p A . The robot then uses its inversemodels and its planning system to infer a sequence of controllables c ∈ C n to be performed in order to reach ω д . Once again, thesecontrollables are broken down into executable primitive actions, ifrequired, using the same process as previously.In both cases, the robot generates a sequence of primitive actions a = [ a , . . . , a n ] ∈ A n of length n . This corresponds to a randomaction or to a sequence of actions designed to reach a generated goal ω д . These actions are then executed by the robot (l. 17): for eachsub primitive action a i , the absolute value of each observable spaceis first recorded (corresponding to the context of the subaction), a i is then performed and the difference for each observable space(compared to before the execution) is retrieved.After finishing an episode, the robot obtains a list of ( a i , ω i , . . . , ω ik , ˜ ω i , . . . , ˜ ω ik ) for each iteration. Where i correspondsto the iteration index and k to the number of subspaces of Ω . Thesedata are then stored in D (l. 25). It is also processed and used toimprove existing affordances (l. 24), decide whether creating a newaffordance is necessary or not, and update the intrinsic motivationsystem. These different processes are described in the followingsub sections. This algorithm uses intrinsic motivation to guide its exploration.It is based on the CHIME algorithm [13], itself inspired by theSAGG-RIAC algorithm [1].
Algorithm 1
Algorithm layout i = loop D episodic = ∅ if H (cid:44) ∅ and Random () ≤ σ then A = AffordanceSelection (H) ω = GoalSelection ( A ) ω д = ObjectSelection ( A , ω ) c = Plan ( ω д ) else C i = RandomControllableSpace (C) c r = RandomValue (C i ) c = [ c r ] a = TransformToPrimitive ( c ) for a k ∈ a do ω bef ore = GetObservations ( Ω ) a i = [ c k ] if c k ∈ A else TransformToPrimitive ( c k ) Execute ( a i ) ω af ter = GetObservations ( Ω ) ω i = ω af ter − ω bef ore ˜ ω i = ω bef ore D episode ← ( a i , ω i , ˜ ω i ) i += UpdateInterestMaps (D , D episodic ) UpdateAffordances (D , D episodic ) D ← D episodic
For each affordance A (C A , Ω A , ˜ Ω A ) , the system creates an inter-est map: a partition of Ω A that is constructed incrementally basedon progress measures as described in [1]. The goal of this processis to divide Ω A into regions and attribute a value of interest to eachregion. This interest corresponds to a monitoring of how muchexploring this region may improve the robot knowledge in thefuture.This measure is linked to a notion of competence . In our case, wedefine the competence of an affordance A near a goal ω ∈ Ω A as mean ( ω e − ω r ) for the k last outcomes near ω . ω e corresponds to anoutcome goal estimated by the algorithm for a given controllable c and ω r to the effective outcome reached during the exploration.The derivative of this competence is used to define a learningprogress : how much an affordance model has been improved. Andthe interest value of a region then corresponds to the mean of thelast n learning progresses in this region.More details about this process and the region splitting mecha-nism may be found in [13] and in [1]. To perform sequence of controllables c , our algorithm uses the samesystem as CHIME. For each element c i of c : • if the sub-controllable to be performed c i is a primitive action,it is directly sent to the effectors and executed without anypre-processing • in the other case, if c i is not a primitive action, it correspondsto an observable the robot wants to induce within its envi-ronment, i.e. c i (cid:60) A but c i ∈ Ω controllable . Then it cannotbe directly executed by the effectors of the robot and it needs . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. to be broken down into primitive actions beforehand. Anaffordance A (C A , Ω A , ˜ Ω A ) is then selected (with c i ∈ Ω A )and its inverse model is applied onto c i in order to obtaina lower level controllable b i ∈ C A . If c i is difficult to reachusing only one lower level controllable, a planning phase isused to build a sequence of element of C A in order to reach c i when executed. Once again for each element of this newlycreated sequence, if it is not primitive the same mechanism isapplied recursively on it until having only primitive actions.At the end of this mechanism, we obtain a list b ∈ A N composedonly of primitive actions that can be executed directly.Additional information can be found in [13]. The CHIME algorithm has been designed to autonomously learnmodel of data. We diverge from it to autonomously learn affordancesinstead. In this section we present how affordances are added to H and updated.At each episode, the robot has to decide multiple elements:whether a new affordance must be added or if existing affordancesare enough; how to train the visual classifiers of affordances and ifaffordances need to be updated.To answer those questions, the robot follows the procedure pre-sented on Algorithm 2.At the end of each episode, subspaces of Ω for which non nullrelative outcomes ω has been observed are listed. Then the robotrandomly picks a space among this list and verifies if it matchesan existing affordance. A space matches an affordance if addingthe data from this space to the training set of the affordance doesnot reduce its competence. If not matching, it tries to add contextspaces to the affordance or then tries to create a new affordance. Thepredictor p A is afterwards trained on the acquired data (positive ornegative).The predictor p A used in our system is a binary neural networkcomposed of 3 fully connected layers using as input all the proper-ties of the object o currently considered. It is trained using actionreplay on balanced data (objects on which the A is applicable andthe others). We used our experimental setup to perform three series of tests: • firstly, evaluating our system itself, which affordances arecreated, and when; • then, comparing the task performance of our system com-pared to CHIME; • finally, comparing our system to other reinforcement learn-ing approaches. To measure the performance of our (or other) algorithm at complet-ing tasks, we define an evaluation metric as follows: for each task,we pre-define a list of points (robot position or object position) tobe reached. Then, during evaluation, the system attempts to reacheach point in the simulator and 1 − the mean error at reaching thosepoints is defined as the evaluation of this task. Algorithm 2
Autonomous affordances adaptation
Input: a the actions performed during the episode, ω the observations at the beginning of each iteration of theepisode. Spaces = SelectSpaces ( Ω , ω ) repeat k times S = PickSpace ( Spaces ) for A ∈ H do matched = False if Matches ( A , a , ω S ) then matched = True Add ( a , ω S ) to the model training dataset of A TrainVisualClassifier ( A , ω S , True ) else repeat k ′ times S ′ context = PickSpace ( Ω ) NewA = Copy ( A ) ContextSpace
N ewA = ContextSpace
N ewA ∪ S ′ context if Competence ( NewA ) ≥ τ modif ication then A ← NewA matched = True break p A ← TrainVisualClassifier ( A , ω S , matched ) if matched then Add a , ω S to the model training dataset of A else NewA = Affordance ( a , S , ∅) if Competence ( NewA ) ≥ τ creation then H ←
NewA p N ewA ← TrainVisualClassifier ( NewA , ω S , True ) In our first test, we let the robot explore the environment presentedin 3.1. This environment is simulated in python using a 2D physicsengine named pymunk .We perform 10 runs, letting the robot autonomous during 4000iterations and we report the mean results.At the end of its exploration, we observe the affordances createdand their evaluation, as presented in Figure 3. The robot has suc-cessfully discovered multiple affordances, we count 12 at the endfor the majority of runs. Among them, 3 where expected: • A : moving the robot itself, • A : pushing an object by moving the robot and • A : pushing an object using another object.The other affordances discovered are unintended, but still valid:they correspond to unexpected correlations the robot has foundbetween various spaces. In our analysis we focus on the first 3affordances mentioned above.Even in this simple environment, the algorithm has managed tocreate a hierarchy of interrelated skills: A depending on A to becompleted, itself depending on A .More than just the final number of affordances, it is interestingto observe the creations, deletions and updates of affordances allalong the exploration. . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. Figure 3: Evaluation during the training for three affor-dances: moving the robot itself A , pushing an object A andpushing an object using another one A . Evaluation is doneevery 200 iterations between 200 and 2000. Thus, affordance A is created between 600 and 800. This also shows meanevaluation value when using only random actions or onlygoal babbling when possible (standard deviation for thoseis not displayed for clarity). For A , the goal babbling or ran-dom only versions does not manage to create the affordance. Concerning the affordance A , we can see in Figure 4 (top) thatthe affordance is created since iteration t=25. At the moment of thediscovery of this affordance, the model created by the robot doesnot take as input any context space. As no walls or obstacles havebeen encountered yet the robot thinks the movement of the robotonly depends on its wheels speed. At iteration t=150 the affordanceis updated and the relative position between an object and the robotis added as context space. A wrong assumption but coherent withthe data acquired so far. Then quickly, at iteration 175, this contextspace is replaced with the robot LIDAR space and kept as such untilthe end. No physical properties is used here as a context space,this is due to the fact that the robot is the only object using thisaffordance.The results of A in Figure 4 (bottom) show that this affordanceis created much later than A . This is explained by the fact that therobot has first to collide with an object to discover how to pushobjects directly with its body. The first occurrence of such collisionwas around t=500 iterations on average. Here again, the contextspace of the affordance has evolved during the exploration and hasfinally converged to the relative position between the object andthe robot. At the difference of A , 2 physical properties are hereadded as context spaces of this affordance: the radius and the heightof the object at hand. As the pushability of each object dependson these two physical properties it is normal to see them appearhere, and this confirm that our algorithm has well captured thedependency to such properties.Once A has been created, its visual classifier p A is also createdand trained to identify to which object A may be applied or not.At the end of the 4000 iterations, we use p A to check its predictionfor each object in the room including the robot itself: it is positivefor all the green objects and negative for the robot. This is expectedas the robot cannot push itself neither it can push fixed red objects. Hence, our algorithm has successfully managed to construct both amodel affordance and the corresponding visual classifier.For A , A and A , the affordances are created directly as soon ascollected data permit it. This behaviour is desired and due to a lowaffordance creation threshold τ addition . This favours exploration ofnewly discovered spaces and regions: indeed, with a low thresholdvalue, affordances are easily created and a goal may be generatedto explore them. If the exploration then points out that it is a falsepositive, that affordance is destroyed. On the contrary if the explo-ration confirms it as a valid affordance, active learning continuesto gradually collect new data to increase the robot’s competencefor this affordance. A A Figure 4: Temporal evolution of affordances A (top) and A (bottom) during the learning process. Please note that theiteration axis is not the same for A and A . Colors are notrelated to the competence graph: yellow spaces are part of A , blue and green ones of Ω : blue ones are using relativedata while green ones absolute data.* : relative position between an object and the robot** : robot absolute position*** : LIDAR data To further analyse our algorithm we decided to test two extremesituations: one with only random action exploration; and anotherone using only goal babbling whenever possible.The first case favours novelty and discovery: the rate of affor-dance addition is high, but the exploration and the mastering of thealready discovered affordance is delayed. In Figure 3 we can seethat the competence curve for A requires more time to convergethan in the previous test.At the opposite, using only goal babbling whenever available, thenumber of affordances discovered is greatly reduced, and focusedat the beginning of the exploration. In this configuration, A isdiscovered later compared to the previous configuration. . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019. We compare our approach to baselines belonging to two differentfamilies: firstly to reinforcement learning algorithms on similarsetups. Secondly, we compare it to affordance learning algorithms.But to our knowledge such methods do not focus on mobile ro-bot and are thus evaluated on experimental setups significantlydifferent from ours.
As we want to compare our algorithm to existing ones on the samesetup, we choose to use classical reinforcement learning algorithmssuch as Q-Learning, DQN (Deep Q-Network) and Actor Critic inour experimental setup. As they are not designed for multi-tasklearning and require an extrinsic reward, some setup modificationshave been made to enable these algorithms to learn in our setup: welimited the experiment to one object at a time (except for A ) andadded a reward function to provide a feedback. Unlike our method,where the exploration is self-guided, the desired behaviours or tasksto be completed with these algorithms must be explicited throughthe reward function. We test these algorithms on 3 increasinglydifficult tasks: moving the robot, pushing an object directly and thenby using another object as a tool. To match the general aspect of ouralgorithm, we use Universal Value Function Approximators [20]for these three algorithms in order to learn how to reach variousgoals. We use 2 different kinds of reward function for each setup: Version Reaching goal Pushing/goingin the right direction Else
Non-guided(sparse reward) +1000 -5Guided +1000 max +20 -5
Table 1: Reward functions used by the comparative setup
In addition to these algorithms, we also compare ours to CURI-OUS, a reinforcement learning algorithm using intrinsic motivationfor exploration. We base its reward on the non-guided version.When required, Ω has been discretised uniformly. Actions havealso been discretised into 4 when needed: forward, backward, turn-ing left, turning right. When reaching the zone or after 1000 itera-tions the episode ends, the setup is reset and the robot is randomlyplaced inside the room.We perform 10 runs over 50000 iterations for each task, rewardfunction and algorithm and report the result in Figure 5.For the first task (top), we can see that all the algorithms succeedin 10000 to 20000 iterations. With our algorithm, moving the robotis mastered as soon as 250 iterations. The difference mainly comesfrom the use of planning in our case. It lets the robot reach distantspots even with such a few exploration done.For the second task (middle), only the guided version are success-ful, requiring between 12500 and 26000 iterations to be learned. Thenon-guided versions fail because of the combinatorial explosion ofall the states involved and the difficulty to reach the final goal. Inour case this task is learned within 1000 iterations.For the most complex task (bottom), only CURIOUS and ouralgorithm manage to succeed, CURIOUS reaches a competence of0.96 using 32000 iterations. The other algorithms, even using theguided rewards, fail due to the complexity of the task at hand. Our algorithm only requires 2100 iterations to reach the final level ofcompetence of CURIOUS.For all the examples above, as we use UVFA, the Q-Learning ishighly dependent to the number of goal it has to explore (as eachgoal corresponds to a different state). Thus, this adds another prior(in addition to the reward function) that is not required with ouralgorithm. Figure 5: Evaluation of Q-Learning, DQN, Actor Crictic andCURIOUS applied to three tasks: moving the robot, pushingan object and pushing it using another object as a tool. Thestandard deviation is displayed in transparent.
As the majority of works in affordance learning uses robot armsto manipulate objects and not mobile robot, experimental setupsare difficult to compare. Thus, we decide to provide a qualitativecomparison between our approach and existing affordance learningones. We analyse the learning process reported for the subsequentaffordances in these different setups.In [12], the system extracts pre-programmed controllables fromthe considered objects, like in our algorithm, then discretises themand clusterises them. It then builds a dependency graph that encom-passes visual controllables, performed action and the action context. . Manoury, S. M. Nguyen, and C. Buche. Hierarchical Affordance Discovery using Intrinsic Motivation. In Human Agent Interaction, 2019.
In our case, the information contained in this graph are all includedin our models and visual classifiers. Thus, our system is capableto build the same affordances. Conversely, the pre-programmedactions in [12] are in our case autonomously learned by the ro-bot, requiring less prior information and adding more flexibility.In both cases, the temporal aspect of sequences of actions is notlearned, but in our algorithm, the planning layer automaticallycreates successive sequences, based on the models learned.On the contrary, in [21], the system builds a hierarchy of affor-dances like in our proposition. This time intrinsic motivation is usedto select which action to execute within a finite set of pre-definedlow level actions. Whereas in our system, the robot manages tolearn primitive actions in a continuous space, and is capable to usesequences of actions by chaining primitive actions.
For affordances learning, we have presented an algorithm combin-ing the affordances concept and intrinsic motivation exploration.It allows a robot to autonomously discover unknown affordancesand learn actions to exploit them. The learning is based on activelearning to collect data through new interactions with the environ-ment, guided by the heuristics of intrinsic motivation; Once learned,these affordance control models are used to plan complex tasks withknown or unknown objects, by using their physical properties todecide whether or not a learned affordance may be applied.Our main contribution in this article is to propose a learningalgorithm for multiple objects based on physical properties so as togeneralise to new objects. We have shown that it can discover in adevelopmental manner non-predefined affordances from the easiestto the most complex ones, and can use unbounded sequences oflearned actions to complete complex tasks. We have compared ouralgorithm to others to outline two main properties : the hierarchicaland developmental learning process, as well as the capacity to usesequences of actions to adapt to the complexity of the task at hand.This algorithm broadly relies on the concept of embodiment andis strongly inspired by human development from this point of view;for both the affordance aspect and the intrinsic motivation one.In future works we want to deepen the comparison with existingmethods by considering similar setups, and thus applying our algo-rithm onto robotic arms. Also, we aim for a more complete system,including a mechanism for visual feature extraction in order toprovide inputs for our algorithm.
REFERENCES [1] Adrien Baranes and Pierre-yves Oudeyer. 2009. R-IAC: Robust intrinsicallymotivated exploration and active learning.
IEEE Transactions on AutonomousMental Development
1, 3 (2009), 155–169.[2] Richard Bellman. 1957.
Dynamic programming. [3] Cédric Colas, Pierre Fournier, Mohamed Chetouani, Olivier Sigaud, and Pierre-Yves Oudeyer. 2019. CURIOUS: Intrinsically Motivated Modular Multi-GoalReinforcement Learning. In
Proceedings of the 36th International Conference onMachine Learning (Proceedings of Machine Learning Research) , Kamalika Chaud-huri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, Long Beach, California,USA, 1331–1340. http://proceedings.mlr.press/v97/colas19a.html[4] Dongshin Kim, Jie Sun, Sang Min Oh, J. M. Rehg, and A. F. Bobick. 2006.Traversability classification using unsupervised on-line visual learning for out-door robot navigation. In
Proceedings 2006 IEEE International Conference on Robot-ics and Automation, 2006. ICRA 2006.
Frontiers in Neurorobotics
12 (2019), 87.https://doi.org/10.3389/fnbot.2018.00087[6] Erol Sahin Emre Ugur Yukie Nagai, Erhan Oztop, Emre Ugur, Yukie Nagai, ErolSahin, and Erhan Oztop. 2015. Staged Development of Robot Skills: BehaviorFormation, Affordance Learning and Imitation with Motionese.
IEEE Transactionson Autonomous Mental Development
7, 2 (2015), 119–139. https://doi.org/10.1109/TAMD.2015.2426192[7] Sébastien Forestier and Oudeyer Pierre-Yves. 2016. Overlapping Waves in ToolUse Development: a Curiosity-Driven Computational Model.
IEEE InternationalConference Developmental Learning and Epigenetic Robotics (2016), 1859–1864.[8] B.A. Francis and W.M. Wonham. 1976. The internal model principle of controltheory.
Automatica
12, 5 (1976), 457 – 465. https://doi.org/10.1016/0005-1098(76)90006-6[9] Jacqueline Gottlieb, Pierre Yves Oudeyer, Manuel Lopes, and Adrien Baranes.2013. Information-seeking, curiosity, and attention: Computational and neuralmechanisms.
Trends in Cognitive Sciences
17, 11 (2013), 585–593. https://doi.org/10.1016/j.tics.2013.09.001 arXiv:NIHMS150003[10] Lorenzo Jamone, Emre Ugur, Angelo Cangelosi, Luciano Fadiga, AlexandreBernardino, Justus Piater, and Jose Santos-Victor. 2016. Affordances in psychology,neuroscience and robotics: a survey.
IEEE Transactions on Cognitive and Develop-mental Systems
January (2016), 1–1. https://doi.org/10.1109/TCDS.2016.2594134[11] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen.2018. Learning hand-eye coordination for robotic grasping with deep learn-ing and large-scale data collection.
The International Journal of RoboticsResearch
37, 4-5 (2018), 421–436. https://doi.org/10.1177/0278364917710318arXiv:https://doi.org/10.1177/0278364917710318[12] Alexandre Bernardino Luis Montesano Manuel Lopes, Jose Santos-Victor, L.Montesano, M. Lopes, a. Bernardino, and Jose Santos-Victor. 2008. LearningObject Affordances: From Sensory–Motor Coordination to Imitation.
NinthInternational Conference on Epigenetic Robotics: Modeling Cognitive Developmentin Robotic Systems
24, 1 (2008), 15–26. https://doi.org/10.1109/TRO.2007.914848[13] A. Manoury, S. M. Nguyen, and C. Buche. 2019. CHIME: An Adaptive HierarchicalRepresentation for Continuous Intrinsically Motivated Exploration. In . 167–170. https://doi.org/10.1109/IRC.2019.00032[14] Karen A. Miller, Edward L. Deci, and Richard M. Ryan. 1988. Intrinsic Motiva-tion and Self-Determination in Human Behavior.
Contemporary Sociology
CoRR abs/1602.01783 (2016).arXiv:1602.01783 http://arxiv.org/abs/1602.01783[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015.Human-level control through deep reinforcement learning.
Nature . 1–6. https://doi.org/10.1109/DEVLRN.2009.5175529[18] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena Hafner. 2007. Intrinsic Mo-tivation Systems for Autonomous Mental Development.
IEEE Transactions onEvolutionary Computation
11, 2 (2007), 265–286. https://doi.org/10.1109/TEVC.2006.890271[19] Pierre Luce-Vayrac R. Omar Chavez-Garcia, Raja Chatila, R. Omar Chavez-Garcia, Pierre Luce-Vayrac, and Raja Chatila. 2016. Discovering AffordancesThrough Perception and Manipulation. In , Vol. 2016-Novem. 3959–3964.https://doi.org/10.1109/IROS.2016.7759583[20] Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. 2015. Universal ValueFunction Approximators. In
Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37 (ICML’15) . JMLR.org,1312–1320. http://dl.acm.org/citation.cfm?id=3045118.3045258[21] Emre Ugur and Justus Piater. 2016. Emergent structuring of interdependent affor-dance learning tasks using intrinsic motivation and empirical feature selection.(2016), 1–13.[22] Emre Uˇgur and Erol Şahin. 2010. Traversability: A case study for learningand perceiving affordances in robots.
Adaptive Behavior
18, 3 (2010), 258–284.https://doi.org/10.1177/1059712310370625[23] Bruce A. Whitehead. 1981. James J. Gibson: The ecological approach to visualperception. Boston: Houghton Mifflin, 1979, 332 pp.
Behavioral Science
26, 3(1981), 308–309. https://doi.org/10.1002/bs.3830260313[24] D.M. Wolpert and M. Kawato. 1998. Multiple paired forward and inverse modelsfor motor control.