Learning of Behavior Trees for Autonomous Agents
LLearning of Behavior Trees for Autonomous Agents
Michele Colledanchise, Ramviyas Parasuraman, and Petter ¨Ogren
Abstract — Definition of an accurate system model for Au-tomated Planner (AP) is often impractical, especially forreal-world problems. Conversely, off-the-shelf planners fail toscale up and are domain dependent. These drawbacks areinherited from conventional transition systems such as FiniteState Machines (FSMs) that describes the action-plan executiongenerated by the AP. On the other hand, Behavior Trees(BTs) represent a valid alternative to FSMs presenting manyadvantages in terms of modularity, reactiveness, scalability anddomain-independence.In this paper, we propose a model-free AP framework usingGenetic Programming (GP) to derive an optimal BT for anautonomous agent to achieve a given goal in unknown (butfully observable) environments. We illustrate the proposedframework using experiments conducted with an open sourcebenchmark
Mario AI for automated generation of BTs that canplay the game character
Mario to complete a certain level atvarious levels of difficulty to include enemies and obstacles.
Index Terms — Behavior trees, Evolutionary learning, GeneticProgramming, Intelligent agents, Autonomous Robots
I. I
NTRODUCTION
Automated planning is a branch of Artificial Intelligence(AI) that concerns the realization of strategies or actionsequences, typically for execution by intelligent agents, au-tonomous robots and unmanned vehicles. Unlike classicalcontrol and classification problems, the solutions are complexand must be discovered and optimized in multidimensionalspace. According to [1], there are four common subjectsthat concern the use of Automated Planners (APs). First,the knowledge representation . That is the type of knowledgethat an AP will learn must be defined; Second, the extractionof experience . That is how learning examples are collected;Third, the learning algorithm . That is how to capture patternsfrom the collected experience; Finally, the exploitation ofcollected knowledge . That is how the AP benefits from thelearned knowledge.Applying AP in a real word scenario is still an openproblem [2]. In fully known environments with availablemodels, the planning can be done offline. Solutions can befound and evaluated prior to execution. Unfortunately inmost cases the environment is unknown and the strategyneeds to be revised online. Recent works are extending theapplication of APs, from toy examples to real problems suchas planning space mission [3], fire extinction, [4] underwaternavigation [5]. However, as highlighted in [6], most of theseplanners are hard to scale up and presents issues whenit comes to extend their domain. Despite these successful
The authors are with the Centre for Autonomous Systems, ComputerVision and Active Perception Lab, School of Computer Science and Com-munication, The Royal Institute of Technology - KTH, Stockholm, Sweden.e-mail: { miccol ∣ ramviyas ∣ petter } @kth.se Fig. 1. Benchmark used to validate the framework. examples, the application of APs to real worlds problemsuffers of two main problems:
Planning Task : Generally,APs require accurate description of the planning task. Thesedescriptions include the model of the action that can beperformed in the environment, the specification of the stateof the environment and the goal to achieve. Generating exactdefinition of the planning is often unfeasible for real-worldproblems;
Extensibility : Usually, a solution of an AP isa PSpace-complete problem [7], [8]. Recent works tacklethis problem through reachability analysis [9], [10], but stillsearch control knowledge is more difficult than the planningtask because it requires expertise in the task to solve as wellas in the planning algorithm [11].The task’s goal is described using a fitness function definedby the user. The derived action execution is described as acomposition of sub-planners using a tree-structured frame-work inherited from computer game industry [12], namelyBT. BTs are a recent modular alternative to ControlledHybrid Systems (CHSs) to describe reactive fault tolerantexecutions of robot tasks [13]. BTs were first introducedin artificial intelligence for computer games, to meet theirneeds of programming the behavior of in-game non playeropponents [14]. Their tree structure, which encompassesmodularity; flexibility; and ease of human understanding,have made them very popular in industry, and their graphrepresentations have created a growing amount of attentionin academia [13], [15]–[18] and robotic industry [19]. Themain advantage of BTs as compared to CHSs can be seen bythe following programming language analogy. In most CHSs,the state transitions are encoded in the states themselves, andswitching from one state to the other leaves no memory ofwhere the transition was made from. This is very generaland flexible, but actually very similar to the now obsolete a r X i v : . [ c s . R O ] A p r OTO statement , that was an important part of many earlyprogramming languages, e.g., BASIC.In BTs the equivalents of state transitions are governed byfunction calls and return values being passed up and downthe tree structure. This is also flexible, but similar to the callsof FUNCTIONS that has replaced GOTO in almost all mod-ern programming languages. Thus, BTs exhibit many of theadvantages in terms of readability, modularity and reusabilitythat was gained when going from GOTO to FUNCTIONcalls in the 1980s. Moreover in a CHSs adding a state turnsin evaluating each possible transition from/to the new stateand removing a state can require the re-evaluation of all thetransitions in the system. BTs reveal to have a natural way toconnect/disconnect new states avoiding redundant evaluationof state transitions. In a tree-structured framework as BT, therelation between nodes are defined by parent-child relations.These relations are plausible in Genetic Programming (GP)allowing entire sub-trees to cross-over and mutate throughgenerations to yield an optimized BT that generates a planleading to the desired goal.In this paper we propose a model free algorithm basedframework that generates a BT for an autonomous agentto achieve a given goal in unknown environments. Theadvantages of our approach lies on the advantages of BTsover a general CHS. Hence our approach is modular and wecan reduce the complexity dividing the goal in sub-goals.II. R
ELATED W ORK
Evolutionary algorithms has been successfully applied inevolving robot or agent’s behaviors [20]–[23]. For instance,in [22], the authors used GP methodology to result in abetter wall-follower algorithm for a mobile robot. In anotherinteresting example by [24], the authors applied
GrammaticalEvolution to generate different levels of simulation envi-ronment for a game benchmark (MarioAI). Learning theagent’s behaviors using evolutionary algorithms has shownto outperform reinforcement learning strategies at least inagents that possess ambiguity in its perception abilities [25].BTs are originally used in gaming industry where thecomputer (autonomous) player uses BTs for its decisionmaking. Recently, there has been works to improve a BTusing several learning techniques, for example, Q-learning[26] and evolutionary approaches [23], [27].In a work by Perez et. al. [23], the authors used GE toevolve BTs to create a AI controller for an autonomous agent(game character). Despite being the most relevant work, wedepart from their work by using a metaheuristic evolutionarylearning algorithms instead of grammatical evolution as theGP algorithm provides a natural way of manipulating BTsand applying genetic operators.Scheper et. al [28] applied evolutionary learning to BTsfor a real-world robotic (Micro Air Vehicle) application. Itappears as the first real-world robotic application of evolvingBTs. They used a (sub-optimal) manually crafted BT as aninitial BT in the evolutionary learning process, and conductedexperiments with a flying robot, while the BT that controlsthe robot is learning itself in every experiment. Finally, they demonstrated significant improvement in the performancesof the evolved final BT comparing to the initial user-definedBT. While we take inspirations from this work, the downsideis that this work require an initial BT for it to work, whichgoes against our model-free objective.Even though the above-mentioned works motivates ourpresent research, we intend to use a model-free framework asagainst model-based frameworks or frameworks that needsextensive prior information. Hence, we propose a frameworkthat is more robust and require no information about theenvironment but thrives on the fact that the environmentis fully-observable. Although we do not make a directcomparison of our work with other relevant works in thispaper, we envisage it in our further works.III. B
ACKGROUND : BT
AND
GPIn this section we briefly describe BTs and GP. A moredetailed description of BTs can be found in [14].
A. Behavior Tree
A Behavior Tree is a graphical modeling language and arepresentation for execution of actions based on conditionsand observations in a system. While BTs have becomepopular for modeling the Artificial Intelligence in computergames, they are similar to a combination of hierarchical finitestate machines or hierarchical task network planners.A BT is a directed rooted tree where each node is eithera control flow node or an execution node (or the root). Foreach connected nodes we define as parent the outgoing nodeand child the incoming node. The root has no parents andonly one child, the control flow nodes have one parent andone or more child, and the execution nodes have one parentand no children. Graphically, the children of control flownodes are placed below it. The children nodes are executedin the order from left to right, as shown in Fig.s 2-4.The execution of a BT begins from the root node. It sends ticks with a given frequency to its child. When a parentsends a tick to a child, the execution of this is allowed. Thechild returns to the parent a status running if its executionhas not finished yet, success if it has achieved its goal, or failure otherwise.There are four types of control flow nodes (selector, se-quence, parallel, and decorator) and two execution nodes (ac-tion and condition). Their execution is explained as follows. Selector:
The selector node ticks its children from themost left, returning success (running) as soon as it finds achild that returns success (running). It returns failure only ifall the children return failure. When a child return runningor success, the selector node does not tick the next children(if any). The selector node is graphically represented by abox with a “?”, as in Fig. 2.
Sequence:
The sequence node ticks its children fromthe most left, returning failure (running) as soon as it finds achild that returns failure (running). It returns success only ifall the children return success. When a child return running A tick is a signal that allows the execution of a child
Child 1 Child 2 · · ·
Child N
Fig. 2. Graphical representation of a fallback node with N children. Algorithm 1:
Pseudocode of a fallback node with N children for i ← to N do childStatus ← Tick( child( i ) ) if childStatus = running then return running else if childStatus = success then return success return failure or failure, the sequence node does not tick the next children(if any). The sequence node is graphically represented by abox with a “ → ”, as in Fig. 3. → Child 1 Child 2 · · ·
Child N
Fig. 3. Graphical representation of a sequence node with N children. Algorithm 2:
Pseudocode of a sequence node with N children for i ← to N do childStatus ← Tick( child( i ) ) if childStatus = running then return running else if childStatus = failure then return failure return successParallel: The parallel node ticks its children in paralleland returns success if M ≤ N children return success, itreturns failure if N − M + children return failure, and itreturns running otherwise. The parallel node is graphicallyrepresented by a box with two arrows, as in Fig. 4. Decorator:
The decorator node manipulates the returnstatus of its child according to the policy defined by the user(e.g. it inverts the success/failure status of the child). Thedecorator is graphically represented in Fig. 5(a).
Action:
The action node performs an action, returningsuccess if the action is completed and failure if the actioncannot be completed. Otherwise it returns running. Theaction node is represented in Fig. 5(b)
Condition:
The condition node check whenever a con-dition is satisfied or not, returning success or failure ac- ⇒ Child 1 Child 2 · · ·
Child N
Fig. 4. Graphical representation of a parallel node with N children. δ PolicyChild (a) Decorator node.The label describes theuser defined policy.
Action (b) Action node. Thelabel describes the ac-tion performed
Condition (c) Condition node.The label describes thecondition verifiedFig. 5. Graphical representation of a decorator, action, and condition node. cordingly. The condition node never returns running. Thecondition node is represented in Fig. 5(c) Root:
The root node generates ticks. It is graphicallyrepresented as a white box labeled with “ ∅ ” B. Genetic Programming
GP is an optimization algorithm, takes inspiration frombiological evolution techniques and is a specialization ofgenetic algorithms where each individual itself is a computerprogram (in this work, each individual is a BT). We use GP tooptimize a population of randomly-generated BT accordingto a user-defined fitness function determined by a BT’s abilityto achieve a given goal.GP has been used as a powerful tool to solve complexengineering problems through evolution strategies [29]. InGP individuals are BTs that are evolved using geneticoperations of reproduction, cross-over, and mutation. In each population , individuals are selected according to the fitnessfunction and then mated, crossing over parts of their sub-trees to form an offspring . The offspring is finally mutatedgenerating a new population. This process continues until theGP finds a BT that satisfies the goal (such as minimize thefitness function and satisfy all constraints) is reached.Often, the size of the final generated BT using GP islarge even though there might exist smaller BT with thesame fitness value and performance. This phenomenon ofgenerating a BT of larger size than necessary can be termedas bloat. We also apply boat control at the end to optimizethe size of the generated BT.The GP used with BTs allows entire sub-trees to cross-overand mutate through generations. The previous generationare called parents and produces children BTs after applyinggenetic operators. The best performing children are selectedfrom the child population to act as the parent population forthe next generation.Crossover, mutation and selection are the three majorgenetic operations that we use in our approach. The crossoverperforms exchanges of subtrees between two parent BTs.Mutation is an operation that replace a node with a randomlyselected parent BT. Selection is the process of choosing BTsfor the next population. The probability of being selected forhe next population is proportional to a fitness function thatdescribes “how close” the agent is from the goal.
1) Two-point crossover in two BTs:
The crossover isperformed by randomly swapping a sub-tree from a BT witha sub-tree of another BT at any level [30]. Fig. 6 and Fig. 7show two BTs before and after a cross over operation.
Fig. 6. BTs before the cross over of the highlighted sub-trees.Fig. 7. BTs after the cross over of the highlighted sub-trees.
Remark 1:
Note that, using BTs as the knowledge repre-sentation framework, avoids the problem of logic violationduring cross-over experienced in [31].
2) Mutation operation in BT:
We use unary mutationoperator where the mutation is carried out by replacing anode in a BT with another node of the same type (i.e.we do not replace a execution node with a control flownode or vice versa). This increases diversity, crucial in GP[30]. To improve convergence properties we use the so-called simulated annealing [32] performing the mutation on severalnodes of the first population of BT and reducing graduallythe number of mutated nodes in each new population. In thisway we start with a very high diversity to avoid possible localminima of the fitness function and we get a smaller diversityas we get close to the goal.
3) Selection mechanism:
From the mutated population,also called offspring , individuals are selected for the nextpopulation. The selection process is a random process whichselect a given T i with a probability p i . This probability isproportional to the a fitness function f i which quantitativelymeasures how many sub-goals are satisfied. There are threemost used way to compute p i given the fitness function [1]:1) Naive Method: p i = f i ∑ j f j that is the fitness dividedby the sum of all the fitness of the individuals in thepopulation (to ensure p i ∈ [ , ] )2) Rank Space Method: We P c set as the probability ofthe highest ranking individual (individual with highest f i ), then we sort the trees in the population in descend-ing order w.r.t. the fitness. (i.e. T i has higher or equalfitness of T i − ). Then then the probabilities p i are defined as follows. p k = ( − P c ) k − P c ∀ k ∈ { , , . . . , N − } (1) p N = ( − P c ) k − (2)3) Diversity Rank Method: We measure the diversity d i ofan individual T i w.r.t. the others in the population. Theprobability p i encompasses both diversity and fitness.Let ¯ d and ¯ f be the maximal value of d i and f i in thepopulation respectively, the probability p i is given by: p i = − ∣∣[ d i , p i ] − [ ¯ d, ¯ p ]∣∣∣∣[ ¯ d, ¯ p ]∣∣ (3)Individuals with the same survival probability lies onthe so-called iso-goodness curves.IV. P ROBLEM F ORMULATION
Here we formulate definitions and assumptions, then westate the main problem, and finally illustrate the approachwith an example.
Assumption 1:
The environment is unknown but fullyobservable. We consider the problem so called learningstochastic models in fully observable environment [2].
Assumption 2:
There exists a finite set of actions that,performed, lead from the initial condition to the goal.
Definition 1: S is state space of the environment. Remark 2:
We only know the initial state and the finalstate.
Definition 2: Σ is a finite set of actions. Definition 3: γ ∶ S × Σ → [ , ] is the fitness function. Ittakes value if and only if performing the finite set of actions Σ change the state of the environment from an initial stateto a final state that satisfies the goal. Problem 1:
Given a goal described by a fitness function γ and an arbitrary initial state s ∈ S derive an action sequence Σ such that γ ( s , Σ ) = .V. P ROPOSED APPROACH
In this section we describe the proposed approach. Webegin with defining which actions the agent can performand which conditions it can observe. We also define theappropriate fitness function that takes input a BT and resultsin a fitness value proportionate to how closer the BT is inachieving a given goal. An empirically determined movingtime widow τ (seconds) is used in the execution process,where the BT is executed continuously but the fitness func-tion is evaluated for the past τ seconds. The progressivechange in fitness function are assessed to determine thecourse of the learning algorithm.We follow a metaheuristic learning strategy, where we usea greedy algorithm first and when it fails, we use the GP.The GP will also be used when the greedy algorithm cannotprovide any results or when the complexity of the solutionis increased. This mixed-learning based heuristic approachwas to minimize the learning time significantly comparedto using pure GP, while still achieving an optimal BT thatsatisfies a given goal.t an initial state s we start with a BT that consistsof only one node which is an action node. To choose thataction node, we use a greedy search process where eachaction is executed until we find an action that, when executedfor the τ seconds, the value of the fitness function keepincreasing. if such an action is found, that action is added tothe BT. However if no actions are found, the GP processwill be initiated with a population of binary trees (twonodes in a BT) with random node assignments consistingof combination of condition and action nodes, and results inan initial BT that increases the fitness value the most.In the next stages, the resultant BT is executed again, andthe changes in the conditions and fitness values are moni-tored. When the fitness value starts decreasing, the recentlychanged conditions (within τ seconds) will be composed(randomly) as a subtree (as in Fig. 2 and added to the existingBT. Then, we use the greedy search algorithm as above tofind the action node for the previously added subtree byadding each possible actions to that subtree and whole BT isexecuted. Once again, when an action that increase the fitnessvalue is found, that action is added to the recent subtree ofthe BT and the whole process continues. If no such actionis found then the GP will be used to determine the subtreewhich increases the fitness value. We iterate these processesuntil the goal is achieved. Finally, we remove the possibleunnecessary nodes in the BT by applying anti-bloat controloperation.We now address the concerns raised by [1] using the APproposed using BTs. A. Knowledge representation
The knowledge is represented as a BT. A BT can be seenas a rule-based system where it describes which action toperform when some conditions are satisfied.
B. Extraction of the experience
The experiences (knowledge) are extracted in terms ofconditions observed from the environment using sensoryperceptions in the autonomous agents. Examples of con-ditions for a robot could be obstacle position, positioninformation, energy level, etc. Similarly, example conditionsfor a game character could be ”enemy in front”, ”obstacleclose-by”, ”level reached”, ”points collected”, ”number ofbullets remaining”, etc.
C. Learning algorithm
Algorithm 3 presents the pseudo-code for the learningalgorithm. The learning algorithm has steps. The first stepaims to identify which condition have to be verified in orderto perform some actions. The second step aims to learn theactions to perform.As mentioned earlier, the framework starts at the initialstate s . If the value of the fitness function does not increase,a greedy algorithm is used to try each action until it findsthe one that leads to an increase of the fitness value. If noactions are found, it start the GP to learn a BT compositionof actions as explained before. We call the learned BT T . Remark 3:
In case the framework learns a single action, T is a degenerate BT composed by a single action.Let C F ⊆ C be the set of conditions that have changed fromtrue to false and let C T ⊆ C be the set of conditions that havechanged from false to true during τ . The BT composition ofthose conditions, T cond , is depicted in Fig. 8. Fig. 8. Graphical representation of T cond . The conditions encoded in T cond make the fitness valuedecrease (i.e. when T cond return true, the fitness valuedecreases). Thus we need to learn the BT to be performedwhenever T cond returns success to enable increase in fitnessvalue. The learning procedure continues. Let T acts be thelearned BT to be performed when T cond returns success,the BT that the agent runs is now T ≜ selector ( ˜ T , T ) (4)where ˜ T ≜ selector (T cond , T acts ) . (5)The agent runs T as long as the value fitness functionincreases. When the fitness stops to increase a new BT islearned following the previous procedure T . Generally aslong as the goal is not reached, the learned BT is: T i ≜ selector ( ˜ T i , T i − ) (6)When the final BT is learned (i.e. that BT that leads theagent to the goal), we run the anti-bloat algorithm to removepossibly inefficient nodes introduced due to a large timewindow τ or due to the randomness needed for the GP. Algorithm 3:
Pseudocode of the learning algorithm γ old ← GetFitness( nil ) t ← GetFirstBT() t ← t while γ < do γ ← GetFitness( t ) if γ old ≥ γ then t cond ← GetChangedConditions() t acts ← LearnSingleAction( t ) if t acts = nil then t act ← LearnBT( t ) ˜ t ← Sequence( t cond , t acts ) t ← Selector( ˜ t , t ) γ old ← γ return t . Exploitation of the collected knowledge At each stage, the resulting BTs are executed in a simu-lated (or real) environment to evaluate against a fitness func-tion. Based on the value of the fitness function, the learningalgorithm decides the future course. The fitness function isdefined in accordance to a given goal. For instance, if thegoal is to complete a level in a game, then the fitness functionis a function of the following: the game points acquired bythe agent (game character), how far (distance) the agent hastraversed, how much time the agent has spent in the gamelevel, how many enemies are shot by the agent, etc.
E. Anti-bloat control
Once we obtained the BT that satisfies the goal, wesearch for ineffective sub-trees, i.e. those action compositionsthat are superfluous for the goal reaching. This processis called anti-bloat control in GP. Most often, the geneticoperators (such as size fair crossover and size fair mutation)or the selection mechanism in GP applies the bloat control.However, in this work, we first generate the BT using theGP without size/depth restrictions in order to achieve acomplex yet practical BT. Then we apply bloat control usinga separate breadth-first algorithm that reduces the size anddepth of the generated BT while keeping the properties ofthe BT and its performance at the same time.To identify the redundant or unnecessary sub-trees, weenumerate the sub-trees with a Breadth-first enumeration. Werun the BT without the first sub tree and we check if thefitness function has a lower value or not. In the former casethe sub-tree is kept, in the latter case the sub-tree is removedcreating a new BT without the sub-tree mentioned. Then werun the same procedure on the new BT. The procedure stopswhen there are no ineffective sub-tree found. Algorithm 4presents a preudo-code of the procedure.
Algorithm 4:
Pseudocode of a anti-bloat control forinefficient subtree(s) removal. t new ← t i ← while i ≤ GetNodesNumber( t new ) do i ← i + 1 t rem ← RemoveSubtree( t new , i ) if GetFintess( t rem ) ≥ GetFintess( t new ) then t new ← t rem i ← return t new Remark 4:
The procedure is trivial using a BT due to itstree structure.VI. P
RELIMINARY EXPERIMENTS
To experimentally verify the proposed approach, we usedthe Mario AI [33] open-source benchmark for the SuperMario Bros game developed initially by Nintendo. The gameplay in Mario AI, as in the original Nintendo’s version,consists in moving the controlled character, namely Mario,through two-dimensional levels, which are viewed sideways.Mario can walk and run to the right and left, jump, and(depending on which state he is in) shoot fireballs. Gravityacts on Mario, making it necessary to jump over cliffs toget past them. Mario can be in one of three states:
Small , Big (can kill enemies by jumping onto them), and
Fire (canshoot fireballs).The main goal of each level is to get to the end of thelevel, which means traversing it from left to right. Auxiliarygoals include collecting as many coins as possible, finishingthe level as fast as possible, and collecting the highest score,which in part depends on number of collected coins andkilled enemies.Complicating matters is the presence of cliffs and movingenemies. If Mario falls down a hole, he loses a life. If hetouches an enemy, he gets hurt; this means losing a life if heis currently in the Small state. Otherwise, his state degradesfrom Fire to Big or from Big to Small. a) Actions:
In the benchmark there are five actionavailable: Walk right, walk left, crouch, shoot, and jump. b) Conditions:
In the benchmark there is a receptivefield of observations. We chose a × grid for such receptivefield as shown in Fig. 9. For each box of the grid there are2 conditions available: if the box is occupied by an enemyand if the box is occupied by an obstacle. For a total of conditions. c) Fitness Functions: The fitness function is given by anon linear combination of the distance passed, enemy killed,number of hurts, and time left when the end of the level isreached. Fig.9 illustrates the receptive field around Mario,used our experiments.
Fig. 9. Receptive field around Mario. In this case Mario is in state Fire,hence he occupies 2 blocks.
A. Testbed 1: No enemies and no cliffs
This is a simple case. The agent has to learn how to movetowards the end of the level and how to jump obstacles.The selection method in the GP is the rank-space method.A youtube video shows the learning phase in real time(https://youtu.be/uaqHbzRbqrk). Fig. 10 illustrates a result-ing BT learned for the Testbed 1. ig. 10. BT learned for Testbed 1.
B. Testbed 2: Walking Enemies and No Cliffs
This is slightly more complex than Tesbed 1. The agenthas to learn how to move towards the end of the level,how to jump obstacles, and how to kill the enemies. Theselection method in the GP is the rank-space method.A youtube video shows the learning phase in real time(https://youtu.be/phy98jbdgQc).
Remark 5:
The youtube video does not show the initialBT learned T , which was a simple action ”Right”.Fig. 11 illustrates a resulting BT learned for the Testbed 2. Fig. 11. BT learned for Testbed 2.
C. Testbed 3: Flying Enemies and Cliffs
ONCLUSIONS
BT are used to represent the knowledge as it provides avalid alternative to conventional planners such as FSM, interms of modularity, reactiveness, scalability and domain-independence. In this paper we presented a model-free APfor an autonomous agent using metaheuristic optimizationapproach involving a combination of GP and greedy-basedalgorithms to generate an optimal BT to achieve a desiredgoal. To our best knowledge, this is the first work followinga fully model-free framework whereas other relevant workseither use model based frameworks or use apriori informationfor the behavior trees. We have detailed how we addressedthe following subjects in AP: knowledge representation;learning algorithm; extraction of experience and exploitationof the collected knowledge. Further, the proposed approach was tested in the open-source “Mario AI” benchmark to sim-ulated autonomous behavior of the game character ”Mario”in the benchmark simulator. Some samples of the resultsare illustrated in this paper. A video of an working exampleand illustration is available at https://youtu.be/phy98jbdgQc.Even though the results are encouraging and comparable tothe state-of-the-art, more vigorous analysis and validationwill be needed before extending the proposed approach toreal-world robots. VIII. F
UTURE W ORK
The first future work is to examine our approach in theMario AI benchmark with extensive experiments and com-pare our results with other state-of-the-art approaches suchas [23]. We further plan to explore dynamic environmentsand adapt our algorithm accordingly. Inspired by the workin [34], we also plan to look at the possibility of usingsupervised learning to generate an optimal BT.Regarding the supervised learning, we are developing amodel-free framework to generate BT by learning fromtraining examples. The strength of the approach lies on thepossibility of separating the tasks to learn. A youtube video(http://youtu.be/ZositEzjidE) shows a preliminary result ofthe supervised learning approach implemented in the Mar-ioAI benchmark. In this example, the agent learns separatelythe task shoot and the task jump from examples of a gameplayed by an user. R
EFERENCES[1] T. M. Mitchell,
Machine learning. WCB . McGraw-Hill Boston, MA:,1997, vol. 8.[2] S. Jim´enez, T. De La Rosa, S. Fern´andez, F. Fern´andez, and D. Bor-rajo, “A review of machine learning for automated planning,”
TheKnowledge Engineering Review , vol. 27, no. 04, pp. 433–467, 2012.[3] P. Nayak, J. Kurien, G. Dorais, W. Millar, K. Rajan, R. Kanefsky,E. Bernard, B. Gamble Jr, N. Rouquette, D. Smith et al. , “Validatingthe ds-1 remote agent experiment,” in
Artificial intelligence, roboticsand automation in space , vol. 440, 1999, p. 349.[4] J. Fdez-Olivares, L. Castillo, O. Garcıa-P´erez, and F. Palao, “Bringingusers and planning technology together. experiences in siadex,” in
ProcICAPS , 2006, pp. 11–20.[5] J. G. Bellingham and K. Rajan, “Robotics in remote and hostileenvironments,”
Science , vol. 318, no. 5853, pp. 1098–1102, 2007.[6] S. Jim´enez, T. De la Rosa, S. Fern´andez, F. Fern´andez, and D. Bor-rajo, “A review of machine learning for automated planning,”
TheKnowledge Engineering Review , vol. 27, no. 04, pp. 433–467, 2012.[7] T. Bylander, “The computational complexity of propositional stripsplanning,”
Artificial Intelligence , vol. 69, no. 1, pp. 165–204, 1994.[8] B. Tom, “Complexity results for planning.” in
IJCAI , vol. 10, 1991,pp. 274–279.[9] F. Bacchus and F. Kabanza, “Using temporal logics to express searchcontrol knowledge for planning,”
Artificial Intelligence , vol. 116, no. 1,pp. 123–191, 2000.[10] D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, andF. Yaman, “Shop2: An htn planning system,”
J. Artif. Intell. Res.(JAIR) ,vol. 20, pp. 379–404, 2003.[11] S. Minton,
Learning search control knowledge: An explanation-basedapproach . Springer, 1988, vol. 61.[12] D. Isla, “Halo 3-building a Better Battle,” in
Game DevelopersConference , 2008.[13] M. Colledanchise, A. Marzinotto, and P. ¨Ogren, “Performance Analy-sis of Stochastic Behavior Trees,” in
Robotics and Automation (ICRA),2014 IEEE International Conference on , June 2014.[14] P. ¨Ogren, “Increasing Modularity of UAV Control Systems usingComputer Game Behavior Trees,” in
AIAA Guidance, Navigation andControl Conference, Minneapolis, MN , 2012.15] M. Colledanchise and P. Ogren, “How Behavior Trees ModularizeRobustness and Safety in Hybrid Systems,” in
Intelligent Robots andSystems (IROS 2014), 2014 IEEE/RSJ International Conference on ,Sept 2014, pp. 1482–1488.[16] J. A. D. Bagnell, F. Cavalcanti, L. Cui, T. Galluzzo, M. Hebert,M. Kazemi, M. Klingensmith, J. Libby, T. Y. Liu, N. Pollard,M. Pivtoraiko, J.-S. Valois, and R. Zhu, “An Integrated Systemfor Autonomous Robotics Manipulation,” in
IEEE/RSJ InternationalConference on Intelligent Robots and Systems , October 2012, pp.2955–2962.[17] A. Kl¨okner, “Interfacing Behavior Trees with the World Using De-scription Logic,” in
AIAA conference on Guidance, Navigation andControl, Boston , 2013.[18] M. Colledanchise, A. Marzinotto, D. V. Dimarogonas, and P. ¨Ogren,“Adaptive fault tolerant execution of multi-robot missions usingbehavior trees,”
CoRR , vol. abs/1502.02960, 2015. [Online]. Available:http://arxiv.org/abs/1502.02960[19] K. R. Guerin, C. Lea, C. Paxton, and G. D. Hager, “A framework forend-user instruction of a robot assistant for manufacturing.”[20] G. B. Parker and M. H. Probst, “Using evolution strategies for thereal-time learning of controllers for autonomous agents in xpilot-ai.”in
IEEE Congress on Evolutionary Computation . IEEE, 2010, pp.1–7.[21] G. de Croon, L. O’Connor, C. Nicol, and D. Izzo, “Evolutionaryrobotics approach to odor source localization,”
Neurocomputing , vol.121, no. 0, pp. 481 – 497, 2013, advances in Artificial Neural Networksand Machine Learning Selected papers from the 2011 InternationalWork Conference on Artificial Neural Networks (IWANN 2011).[22] C. Lazarus and H. Hu, “Using genetic programming to evolve robotbehaviours,” in
Proceedings of the 3rd British Conference on Au-tonomous Mobile Robotics and Autonomous Systems. Manchester ,2001.[23] D. Perez, M. Nicolau, M. O’Neill, and A. Brabazon, “Evolvingbehaviour trees for the mario ai competition using grammaticalevolution,” in
Proceedings of the 2011 International Conferenceon Applications of Evolutionary Computation - Volume Part I , ser.EvoApplications’11. Berlin, Heidelberg: Springer-Verlag, 2011, pp.123–132.[24] N. Shaker, M. Nicolau, G. Yannakakis, J. Togelius, and M. O’Neill,“Evolving levels for super mario bros using grammatical evolution,”in , Sept 2012, pp. 304–311.[25] G. D. Croon, M. F. V. Dartel, and E. O. Postma, “Evolutionarylearning outperforms reinforcement learning on non-markovian tasks,”in
Workshop on Memory and Learning Mechanisms in AutonomousRobots, 8th European Conference on Artificial Life , 2005.[26] R. Dey and C. Child, “Ql-bt: Enhancing behaviour tree design andimplementation with q-learning,” in , Aug 2013, pp. 1–8.[27] C.-U. Lim, R. Baumgarten, and S. Colton, “Evolving behaviour treesfor the commercial game defcon,” in
Proceedings of the 2010 Inter-national Conference on Applications of Evolutionary Computation -Volume Part I , ser. EvoApplicatons’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 100–110.[28] K. Y. W. Scheper, S. Tijmons, C. C. de Visser, and G. C. H. E.de Croon, “Behaviour trees for evolutionary robotics,”
CoRR , vol.abs/1411.7267, 2014.[29] I. Rechenberg, “Evolution strategy,”
Computational Intelligence: Imi-tating Life , vol. 1, 1994.[30] J. W. Tweedale and L. C. Jain, “Innovation in modern artificialintelligence,” in
Embedded Automation in Human-Agent Environment .Springer, 2012, pp. 15–31.[31] Z. Fu, B. L. Golden, S. Lele, S. Raghavan, and E. A. Wasil, “Agenetic algorithm-based approach for building accurate decision trees,”
INFORMS Journal on Computing , vol. 15, no. 1, pp. 3–22, 2003.[32] L. Davis,
Genetic algorithms and simulated annealing . MorganKaufman Publishers, Inc.,Los Altos, CA, Jan 1987.[33] S. Karakovskiy and J. Togelius, “The mario ai benchmark andcompetitions,”
Computational Intelligence and AI in Games, IEEETransactions on , vol. 4, no. 1, pp. 55–67, 2012.[34] E. Tomai and R. Flores, “Adapting in-game agent behavior by ob-servation of players using learning behavior trees,” in