Learning STRIPS Action Models with Classical Planning
aa r X i v : . [ c s . A I] M a r Learning S
TRIPS
Action Models with Classical Planning
Diego Aineto and
Sergio Jim´enez and
Eva Onaindia
Departamento de Sistemas Inform´aticos y Computaci´onUniversitat Polit`ecnica de Val`encia.Camino de Vera s/n. 46022 Valencia, Spain { dieaigar,serjice,onaindia } @dsic.upv.es Abstract
This paper presents a novel approach for learning S
TRIPS ac-tion models from examples that compiles this inductive learn-ing task into a classical planning task. Interestingly, the com-pilation approach is flexible to different amounts of availableinput knowledge; the learning examples can range from a setof plans (with their corresponding initial and final states) tojust a pair of initial and final states (no intermediate actionor state is given). Moreover, the compilation accepts partiallyspecified action models and it can be used to validate whetherthe observation of a plan execution follows a given S
TRIPS action model, even if this model is not fully specified.
Introduction
Besides plan synthesis (Ghallab, Nau, and Traverso 2004),planning action models are also useful for plan/goal recog-nition (Ram´ırez 2012). In both planning tasks, an auto-mated planner is required to reason about action models thatcorrectly and completely capture the possible world tran-sitions (Geffner and Bonet 2013). Unfortunately, buildingplanning action models is complex, even for planning ex-perts, and this knowledge acquisition task is a bottleneck thatlimits the potential of AI planning (Kambhampati 2007).On the other hand, Machine Learning (ML) has shown tobe able to compute a wide range of different kinds of modelsfrom examples (Michalski, Carbonell, and Mitchell 2013).The application of inductive ML to learning S
TRIPS action models, the vanilla action model for plan-ning (Fikes and Nilsson 1971), is not straightforwardthough: • The input to ML algorithms (the learning/training data) isusually a finite set of vectors that represent the value ofsome fixed object features. The input for learning plan-ning action models is, however, observations of plan exe-cutions (where each plan possibly has a different length). • The output of ML algorithms is usually a scalar value(an integer, in the case of classification tasks, or a realvalue, in the case of regression tasks). When learning ac-tion models the output is, for each action, the precondi-tions, negative and positive effects that define the possiblestate transitions.
Copyright c (cid:13)
Learning S
TRIPS action models is a well-studied problem with sophisticated algorithmssuch as ARMS (Yang, Wu, and Jiang 2007),SLAF (Amir and Chang 2008) or LOCM(Cresswell, McCluskey, and West 2013), which do notrequire full knowledge of the intermediate states traversedby the example plans. Motivated by recent advances onthe synthesis of different kinds of generative models withclassical planning (Bonet, Palacios, and Geffner 2009;Segovia-Aguas, Jim´enez, and Jonsson 2016;Segovia-Aguas, Jim´enez, and Jonsson 2017), this paperintroduces an innovative planning compilation approach forlearning S
TRIPS action models. The compilation approachis appealing by itself because it opens up the door to thebootstrapping of planning action models, but also because:1. It is flexible to various amounts of input knowledge.Learning examples range from a set of plans (with theircorresponding initial and final states) to just a pair of ini-tial and final states where no intermediate state or actionis observed.2. It accepts previous knowledge about the structure of theactions in the form of partially specified action models.In the extreme, the compilation can validate whether anobserved plan execution is valid for a given S
TRIPS actionmodel, even if this model is not fully specified.The second section of the paper formalizes the classicalplanning model, its extension to conditional effects (a re-quirement of the proposed compilation) and the S
TRIPS ac-tion model (the output of the addressed learning task). Thethird section formalizes the task of learning action modelswith different amounts of available input knowledge. Thefourth and fifth sections describe our compilation approachto tackle the formalized learning tasks. Finally, the last sec-tions show the experimental evaluation, discuss the strengthsand weaknesses of the compilation approach and proposeseveral opportunities for future research.
Background
This section defines the planning model and the output ofthe learning tasks addressed in the paper. lassical planning with conditional effects
Our approach to learning S
TRIPS action models is compil-ing this learning task into a classical planning task with con-ditional effects. Conditional effects allow us to compactlydefine actions whose effects depend on the current state.Supporting conditional effects is now a requirement of theIPC (Vallati et al. 2015) and many classical planners copewith conditional effects without compiling them away.We use F to denote the set of fluents (propositional vari-ables) describing a state. A literal l is a valuation of a fluent f ∈ F ; i.e. either l = f or l = ¬ f . A set of literals L repre-sents a partial assignment of values to fluents (without lossof generality, we will assume that L does not contain con-flicting values). We use L ( F ) to denote the set of all literalsets on F ; i.e. all partial assignments of values to fluents.A state s is a full assignment of values to fluents; | s | = | F | , so the size of the state space is | F | . Explicitly includ-ing negative literals ¬ f in states simplifies subsequent def-initions but often we will abuse of notation by defining astate s only in terms of the fluents that are true in s , as it iscommon in S TRIPS planning.A classical planning frame is a tuple
Φ = h F, A i , where F is a set of fluents and A is a set of actions. An action a ∈ A is defined with preconditions , pre ( a ) ⊆ L ( F ) , positiveeffects , eff + ( a ) ⊆ L ( F ) , and negative effects eff − ( a ) ⊆L ( F ) . We say that an action a ∈ A is applicable in a state s iff pre ( a ) ⊆ s . The result of applying a in s is the successorstate denoted by θ ( s, a ) = { s \ eff − ( a )) ∪ eff + ( a ) } .An action a ∈ A with conditional effects is defined as aset of preconditions pre ( a ) and a set of conditional effects cond ( a ) . Each conditional effect C ⊲ E ∈ cond ( a ) is com-posed of two sets of literals: C ⊆ L ( F ) , the condition , and E ⊆ L ( F ) , the effect . An action a ∈ A is applicable ina state s iff pre ( a ) ⊆ s , and the triggered effects resultingfrom the action application are the effects whose conditionshold in s : triggered ( s, a ) = [ C ⊲ E ∈ cond ( a ) ,C ⊆ s E, The result of applying action a in state s is the suc-cessor state θ ( s, a ) = { s \ eff − c ( s, a )) ∪ eff + c ( s, a ) } where eff − c ( s, a ) ⊆ triggered ( s, a ) and eff + c ( s, a ) ⊆ triggered ( s, a ) are, respectively, the triggered negative and positive effects.A classical planning problem is a tuple P = h F, A, I, G i ,where I is an initial state and G ⊆ L ( F ) is a goal condition.A plan for P is an action sequence π = h a , . . . , a n i thatinduces the state trajectory h s , s , . . . , s n i such that s = I and a i ( ≤ i ≤ n ) is applicable in s i − and generates thesuccessor state s i = θ ( s i − , a i ) . The plan length is denotedwith | π | = n . A plan π solves P iff G ⊆ s n ; i.e. if thegoal condition is satisfied in the last state resulting from theapplication of the plan π in the initial state I . S TRIPS action schemas and variable name objects
Our approach is aimed at learning PDDL action schemasthat follow the S
TRIPS requirement (McDermott et al. 1998;Fox and Long 2003). Figure 1 shows the stack (:action stack:parameters (?v1 ?v2 - object):precondition (and (holding ?v1) (clear ?v2)):effect (and (not (holding ?v1))(not (clear ?v2))(handempty) (clear ?v1)(on ?v1 ?v2)))
Figure 1:
The stack operator schema of the blocksworld domainspecified in PDDL. schema of a four-operator blocksworld do-main (Slaney and Thi´ebaux 2001) encoded in PDDL.To formalize the output of the learning task, we assumethat fluents F are instantiated from a set of predicates Ψ ,as in PDDL. Each predicate p ∈ Ψ has an argument list ofarity ar ( p ) . Given a set of objects Ω , the set of fluents F isinduced by assigning objects in Ω to the arguments of thepredicates in Ψ ; i.e. F = { p ( ω ) : p ∈ Ψ , ω ∈ Ω ar ( p ) } ,where Ω k is the k -th Cartesian power of Ω .Let Ω v = { v i } max a ∈ A ar ( a ) i =1 be a new set of objects de-noted as variable names ( Ω ∩ Ω v = ∅ ). Ω v is boundto the maximum arity of an action in a given planningframe. For instance, in a three-block blocksworld , Ω = { block , block , block } while Ω v = { v , v } because theoperators with the maximum arity, stack and unstack ,have two parameters each.Let F v be a new set of fluents, F ∩ F v = ∅ , thatresults from instantiating the predicates in Ψ using ex-clusively objects of Ω v . F v defines the elements of thepreconditions and effects of an action schema. For in-stance, in the blocksworld domain, F v = { handempty,holding( v ), holding( v ), clear( v ),clear( v ), ontable( v ), ontable( v ),on( v , v ), on( v , v ), on( v , v ), on( v , v ) } .Finally, we assume that an action a ∈ A isinstantiated from a S TRIPS operator schema ξ = h head ( ξ ) , pre ( ξ ) , add ( ξ ) , del ( ξ ) i where: • head ( ξ ) = h name ( ξ ) , pars ( ξ ) i is the operator header defined by its name and the corresponding variablenames , pars ( ξ ) = { v i } ar ( ξ ) i =1 . For instance, the headersof a four-operator blocksworld domain are: pickup( v ),putdown( v ), stack( v , v ) and unstack( v , v ) . • pre ( ξ ) ⊆ F v is the set of preconditions, del ( ξ ) ⊆ F v thenegative effects and add ( ξ ) ⊆ F v the positive effects suchthat del ( ξ ) ⊆ pre ( ξ ) , del ( ξ ) ∩ add ( ξ ) = ∅ and pre ( ξ ) ∩ add ( ξ ) = ∅ . Learning S
TRIPS action models
Learning S
TRIPS action models from fully available in-put knowledge, i.e. from plans where the pre- and post-states of every action in the plans are known, is straight-forward. When intermediate states are available, operatorschemas are derived lifting the literals that change betweenthe pre and post-state of each action execution. Precondi-tions of an action are derived lifting the minimal set of lit-erals that appears in all the pre-states of the correspondingction (Jim´enez et al. 2012).This section formalizes more challenging learning tasks,where less input knowledge is available:
Learning from (initial, final) state pairs.
This learningtask amounts to observing agents acting in the world butwatching only the result of their plans execution. No inter-mediate information about the actions in the plans is given.This learning task is formalized as
Λ = h Ψ , Σ i : • Ψ is the set of predicates that define the abstract statespace of a given planning domain. • Σ = { σ , . . . , σ τ } is a set of ( initial, f inal ) state pairscalled labels . Each label σ t = ( s t , s tn ) , ≤ t ≤ τ , com-prises the final state s tn resulting from executing an un-known plan π t = h a t , . . . , a tn i in the initial state s t . Learning from labeled plans.
We augment the inputknowledge with the actions executed by the observed agentand define the learning task Λ ′ = h Ψ , Σ , Π i : • Π = { π , . . . , π τ } is a given set of example plans where π t = h a t , . . . , a tn i , ≤ t ≤ τ , is an action sequence thatinduces the corresponding state sequence h s t , s t , . . . , s tn i such that a ti , ≤ i ≤ n , is applicable in s ti − and generates s ti = θ ( s ti − , a ti ) .Figure 2 shows an example of a learning task Λ ′ of the blocksworld domain. This task has a single learning exam-ple, Π = { π } and Σ = { σ } , that corresponds to observingthe execution of an eight-action plan ( | π | = 8) for invertinga four-block tower. Learning from partially specified action models.
In casethat partially specified operator schemas are available, wecan incorporate this information within the learning task.The new leaning task is defined as Λ ′′ = h Ψ , Σ , Π , Ξ i : • Ξ is a partially specified action model in which somepreconditions and effects are known a priori.A solution to Λ is a set of operator schemas Ξ thatis compliant with the predicates in Ψ and the set of ini-tial and final states Σ . In a Λ learning scenario, a so-lution must not only determine a possible S TRIPS actionmodel but also the plans π t , ≤ t ≤ τ , that explainthe given labels Σ using the learned model. A solution to Λ ′ is a set of S TRIPS operator schemas Ξ (one schema ξ = h head ( ξ ) , pre ( ξ ) , add ( ξ ) , del ( ξ ) i for each action thathas a different name in the example plans Π ) that is com-pliant with the predicates Ψ , the example plans Π , and theircorresponding labels Σ . Finally, a solution to Λ ′′ is a set ofS TRIPS operator schemas Ξ that is also compliant with theprovided partially specified action model Ξ . Learning S
TRIPS action models with planning
In our approach, a learning task Λ , Λ ′ or Λ ′′ is solved bycompiling it into a classical planning task with conditionaleffects. The intuition behind the compilation is that a solu-tion to the resulting classical planning task is a sequence ofactions that: ;;; Predicates in Ψ (handempty) (holding ?o - object)(clear ?o - object) (ontable ?o - object)(on ?o1 - object ?o2 - object) ;;; Plan π
0: (unstack A B)1: (putdown A)2: (unstack B C)3: (stack B A)4: (unstack C D)5: (stack C B)6: (pickup D)7: (stack D C) ;;; Label σ = ( s , s n ) DCBA ABCDFigure 2:
Learning task of the blocksworld domain from a singlelabeled plan. Programs the S TRIPS action model Ξ . A solution plan hasa prefix that, for each ξ ∈ Ξ , determines the fluents from F v that belong to pre ( ξ ) , del ( ξ ) and add ( ξ ) .2. Validates the programmed S TRIPS action model Ξ in thegiven input knowledge (the labels Σ and Π , and/or Ξ if available). For every label σ t ∈ Σ , a solution planhas a postfix that produces a final state s tn using the pro-grammed action model Ξ in the corresponding initial state s t . This process is the validation of the programmed ac-tion model Ξ with the set of learning examples ≤ t ≤ τ .To formalize our compilation we first define a set of clas-sical planning instances P t = h F, ∅ , I t , G t i that belong tothe same planning frame (i.e. same fluents and actions butdifferent initial states and goals). Fluents F are built instan-tiating the predicates in Ψ with the objects of the input labels Σ . Formally, Ω = S ≤ t ≤ τ obj ( s t ) , where obj is a functionthat returns the objects that appear in a fully specified state.The set of actions, A = ∅ , is empty because the action modelis initially unknown. Finally, the initial state I t is given bythe state s t ∈ σ t , and the goals G t are defined by the state s tn ∈ σ t .We can now formalize the compilation approach. We startwith Λ as it requires the least input knowledge. Given alearning task Λ = h Ψ , Σ i , the compilation outputs a clas-sical planning task P Λ = h F Λ , A Λ , I Λ , G Λ i : • F Λ extends F with: – Fluents pre f ( ξ ) , del f ( ξ ) and add f ( ξ ) , for every f ∈ F v and ξ ∈ Ξ that represent the programmed actionmodel. If a fluent of pre f ( ξ ) /del f ( ξ ) /add f ( ξ ) holds,it means that f is a precondition/negative effect/posi-tive effect of the operator schema ξ ∈ Ξ . For instance,the preconditions of the stack schema (Figure 1) arerepresented by fluents pre holding stack v and pre clear stack v . – A fluent mode prog to indicate whether the operatorschemas are being programmed or validated (when al-ready programmed) – Fluents { test t } ≤ t ≤ τ which represent the exampleswhere the action model will be validated. I Λ contains the fluents from F that encode s (the initialstate of the first label), the fluents in every pre f ( ξ ) ∈ F Λ and the fluent mode prog set to true. Our compilation as-sumes that any operator schema is initially programmedwith every possible precondition (the most specific learn-ing hypothesis), no negative effect and no positive effect. • G Λ = S ≤ t ≤ τ { test t } indicates that the programmed ac-tion model is validated in all the learning examples. • A Λ contains actions of three kinds:1. Actions for programming an operator schema ξ ∈ Ξ : – Actions for removing a precondition f ∈ F v from ξ . pre ( programPre f ,ξ ) = {¬ del f ( ξ ) , ¬ add f ( ξ ) ,mode prog , pre f ( ξ ) } , cond ( programPre f ,ξ ) = {∅} ⊲ {¬ pre f ( ξ ) } . – Actions for adding a negative or positive effect f ∈ F v to ξ . pre ( programEff f ,ξ ) = {¬ del f ( ξ ) , ¬ add f ( ξ ) ,mode prog } , cond ( programEff f ,ξ ) = { pre f ( ξ ) } ⊲ { del f ( ξ ) } , {¬ pre f ( ξ ) } ⊲ { add f ( ξ ) } .
2. Actions for applying an already programmed operatorschema ξ ∈ Ξ bound to the objects ω ⊆ Ω ar ( ξ ) . Weassume operators headers are known so the binding of ξ is done implicitly by order of appearance of the ac-tion parameters, i.e. variables pars ( ξ ) are bound to theobjects in ω that appear in the same position. Figure 3shows the PDDL encoding of the action for applying aprogrammed operator stack . pre ( apply ξ,ω ) = { pre f ( ξ ) = ⇒ p ( ω ) } ∀ p ∈ Ψ ,f = p ( pars ( ξ )) , cond ( apply ξ,ω ) = { del f ( ξ ) } ⊲ {¬ p ( ω ) } ∀ p ∈ Ψ ,f = p ( pars ( ξ )) , { add f ( ξ ) } ⊲ { p ( ω ) } ∀ p ∈ Ψ ,f = p ( pars ( ξ )) , { mode prog } ⊲ {¬ mode prog } .
3. Actions for validating the learning example ≤ t ≤ τ . pre ( validate t ) = G t ∪ { test j } ≤ j A classical plan π that solves P Λ induces anaction model Ξ that solves the learning task Λ .Proof sketch. Once operator schemas Ξ are programmed, theycan only be applied and validated according to the mode prog flu-ent. To solve P Λ , goals { test t } , ≤ t ≤ τ can only be achieved byexecuting an applicable sequence of programmed operator schemasthat reaches the final state s tn , defined in σ t , starting from s t . If thisis achieved for all the input examples ≤ t ≤ τ , it means that theprogrammed action model Ξ is compliant with the provided inputknowledge and hence it is a solution to Λ . (:action apply_stack:parameters (?o1 - object ?o2 - object):precondition(and (or (not (pre_on_stack_v1_v1)) (on ?o1 ?o1))(or (not (pre_on_stack_v1_v2)) (on ?o1 ?o2))(or (not (pre_on_stack_v2_v1)) (on ?o2 ?o1))(or (not (pre_on_stack_v2_v2)) (on ?o2 ?o2))(or (not (pre_ontable_stack_v1)) (ontable ?o1))(or (not (pre_ontable_stack_v2)) (ontable ?o2))(or (not (pre_clear_stack_v1)) (clear ?o1))(or (not (pre_clear_stack_v2)) (clear ?o2))(or (not (pre_holding_stack_v1)) (holding ?o1))(or (not (pre_holding_stack_v2)) (holding ?o2))(or (not (pre_handempty_stack)) (handempty))):effect(and (when (del_on_stack_v1_v1) (not (on ?o1 ?o1)))(when (del_on_stack_v1_v2) (not (on ?o1 ?o2)))(when (del_on_stack_v2_v1) (not (on ?o2 ?o1)))(when (del_on_stack_v2_v2) (not (on ?o2 ?o2)))(when (del_ontable_stack_v1) (not (ontable ?o1)))(when (del_ontable_stack_v2) (not (ontable ?o2)))(when (del_clear_stack_v1) (not (clear ?o1)))(when (del_clear_stack_v2) (not (clear ?o2)))(when (del_holding_stack_v1) (not (holding ?o1)))(when (del_holding_stack_v2) (not (holding ?o2)))(when (del_handempty_stack) (not (handempty)))(when (add_on_stack_v1_v1) (on ?o1 ?o1))(when (add_on_stack_v1_v2) (on ?o1 ?o2))(when (add_on_stack_v2_v1) (on ?o2 ?o1))(when (add_on_stack_v2_v2) (on ?o2 ?o2))(when (add_ontable_stack_v1) (ontable ?o1))(when (add_ontable_stack_v2) (ontable ?o2))(when (add_clear_stack_v1) (clear ?o1))(when (add_clear_stack_v2) (clear ?o2))(when (add_holding_stack_v1) (holding ?o1))(when (add_holding_stack_v2) (holding ?o2))(when (add_handempty_stack) (handempty))(when (modeProg) (not (modeProg))))) Figure 3: PDDL action for applying an already programmedschema stack (implications coded as disjunctions). The compilation is complete in the sense that it does notdiscard any possible S TRIPS action model. The size of theclassical planning task P Λ depends on: • The arity of the actions headers in Ξ and the predicates Ψ of the learning task. The larger the arity, the larger the F v set, which in turn defines the size of the fluent sets pre f ( ξ ) /del f ( ξ ) /add f ( ξ ) and the corresponding set of programming actions. • The number of learning examples. The larger this number,the more test t fluents and validate t actions in P Λ . Constraining the learning hypothesis spacewith additional input knowledge In this section, we show that further input knowledge can beused to constrain the space of possible action models and tomake the learning task more practicable. abeled plans We extend the compilation to consider labeled plans. Givena learning task Λ ′ = h Ψ , Σ , Π i , the compilation outputs aclassical planning task P Λ ′ = h F Λ ′ , A Λ ′ , I Λ ′ , G Λ ′ i : • F Λ ′ extends F Λ with F Π = { plan ( name ( ξ ) , Ω ar ( ξ ) , j ) } ,the fluents to code the steps of the plans in Π , where F π t ⊆ F Π encodes π t ∈ Π . Fluents at j and next j,j +1 , ≤ j < n , are also added to represent the current planstep and to iterate through the steps of a plan. • I Λ ′ extends I Λ with fluents F π plus fluents at and { next j,j +1 } , ≤ j < n , to indicate the plan step wherethe action model is validated. As in the original compila-tion, G Λ ′ = G Λ = S ≤ t ≤ τ { test t } . • With respect to A Λ ′ .1. The actions for programming the preconditions/effectsof a given operator schema ξ ∈ Ξ are the same.2. The actions for applying an already programmed op-erator have an extra precondition f ∈ F Π that en-codes the current plan step, and extra conditional ef-fects { at j } ⊲ {¬ at j , at j +1 } ∀ j ∈ [1 ,n ] for advancing tothe next plan step. With this mechanism we ensure thatthese actions are applied in the same order as in the ex-ample plans.3. The actions for validating the current learning exam-ple have an extra precondition, at | π t | , to indicate thatthe current plan π t is fully executed and extra condi-tional effects to remove plan π t and initiate the nextplan π t +1 : {∅} ⊲ {¬ at | π t | , at } ∪ {¬ f } f ∈ F πt ∪ { f } f ∈ F πt +1 . Partially specified action models The known preconditions and effects of a partially specifiedaction model are encoded as fluents pre f ( ξ ) , del f ( ξ ) and add f ( ξ ) set to true in the initial state I Λ ′ . The programmingactions, programPre f ,ξ and programEff f ,ξ , become now un-necessary and they are removed from A Λ ′ , thus making theplanning task P Λ ′ be easier to solve.To illustrate this, the plan of Figure 4 is a solutionto a learning task Λ ′′ = h Ψ , Σ , Π , Ξ i for acquiringthe blocksworld action model where operator schemas for pickup , putdown and unstack are specified in Ξ . Thisplan programs and validates the operator schema stack from blocksworld using the plan π and label σ shown inFigure 2. Plan steps [0 , program the preconditions of the stack operator, steps [9 , program the operator effectsand steps [14 , validate the programmed operators follow-ing the plan π shown in Figure 2.In the extreme, when a fully specified S TRIPS actionmodel is given in Ξ , the compilation validates whether anobserved plan follows the given model. In this case, if asolution plan is found for P Λ ′ , it means that the given ac-tion model is valid for the provided examples. If P Λ ′ is un-solvable then it means that the action model is invalid be-cause it is not compliant with all the given examples. Toolsfor plan validation like VAL (Howey, Long, and Fox 2004)could also be used at this point. : (program pre clear stack v1)01 : (program pre handempty stack)02 : (program pre holding stack v2)03 : (program pre on stack v1 v1)04 : (program pre on stack v1 v2)05 : (program pre on stack v2 v1)06 : (program pre on stack v2 v2)07 : (program pre ontable stack v1)08 : (program pre ontable stack v2) : (program eff clear stack v1)10 : (program eff clear stack v2)11 : (program eff handempty stack)12 : (program eff holding stack v1)13 : (program eff on stack v1 v2) : (apply unstack a b i1 i2)15 : (apply putdown a i2 i3)16 : (apply unstack b c i3 i4)17 : (apply stack b a i4 i5)18 : (apply unstack c d i5 i6)19 : (apply stack c b i6 i7)20 : (apply pickup d i7 i8)21 : (apply stack d c i8 i9) : (validate 1) Figure 4: Plan for programming and validating the stack schemausing plan π and label σ (shown in Figure 2) as well as previouslyspecified operator schemas for pickup , putdown and unstack . Static predicates A static predicate p ∈ Ψ is a predicate that does not appearin the effects of any action (Fox and Long 1998). Therefore,one can get rid of the mechanism for programming thesepredicates in the effects of any action schema while keepingthe compilation complete. Given a static predicate p : • Fluents del f ( ξ ) and add f ( ξ ) , such that f ∈ F v is an in-stantiation of the static predicate p in the set of variablenames Ω v , can be discarded for every ξ ∈ Ξ . • Actions programEff f ,ξ (s.t. f ∈ F v is an instantiation of p in Ω v ) can also be discarded for every ξ ∈ Ξ .Static predicates can also constrain the space of possiblepreconditions by looking at the given set of labels Σ . Onecan assume that if a precondition f ∈ F v (s.t. f ∈ F v isan instantiation of a static predicate in Ω v ) is not compliantwith the labels in Σ then fluents pre f ( ξ ) and actions programPre f ,ξ can be discarded for every ξ ∈ Ξ . Forinstance, in the zenotravel domain, pre next board v v , pre next debark v v , pre next f ly v v , pre next zoom v v , pre next ref uel v v can bediscarded (and their corresponding programming actions)because a precondition (next ?v1 ?v1 - flevel) willnever hold in any state of Σ .On the other hand, fluents pre f ( ξ ) and actions programPre f ,ξ are discardable for every ξ ∈ Ξ if a precon-dition f ∈ F v (s.t. f ∈ F v is an instantiation of a staticpredicate in Ω v ) is not possible according to Π . Back tothe zenotravel domain, if an example plan π t ∈ Π con-tains the action (fly plane1 city2 city0 fl3 fl2) and the corresponding label σ t ∈ Σ contains the static literal next fl2 fl3) but does not contain (next fl2 fl2) , (next fl3 fl3) or (next fl3 fl2) , the only possi-ble precondition that would include the static predicate is pre next f ly v v . Evaluation This section evaluates the performance of our approach forlearning S TRIPS action models with different amounts ofavailable input knowledge. Setup. The domains used in the evaluation areIPC domains that satisfy the S TRIPS require-ment (Fox and Long 2003), taken from the PLAN - NING . DOMAINS repository (Muise 2016). We only used5 learning examples for each domain and we fixed theexamples for all the experiments so that we can evaluate theimpact of the input knowledge in the quality of the learnedmodels. All experiments are run on an Intel Core i5 3.10GHz x 4 with 8 GB of RAM.The classical planner we used to solve the in-stances that result from our compilations is M ADAGAS - CAR (Rintanen 2014). We used M ADAGASCAR due to itsability to deal with planning instances populated with dead-ends. In addition, SAT-based planners can apply the actionsfor programming preconditions in a single planning step (inparallel) because these actions do not interact. Actions forprogramming action effects can also be applied in a singleplanning step reducing significantly the planning horizon. Metrics. The quality of the learned models is measuredwith the precision and recall metrics. These two metrics arefrequently used in pattern recognition , information retrieval and binary classification and are more informative that sim-ply counting the number of errors in the learned model orcomputing the symmetric difference between the learned andthe reference model (Davis and Goadrich 2006).Intuitively, precision gives a notion of soundness while re-call gives a notion of the completeness of the learned models.Formally, P recision = tptp + fp , where tp is the number oftrue positives (predicates that correctly appear in the actionmodel) and f p is the number of false positives (predicatesappear in the learned action model that should not appear).Recall is formally defined as Recall = tptp + fn where f n isthe number of false negatives (predicates that should appearin the learned action model but are missing).Given the syntax-based nature of these metrics, it mayhappen that they report low scores for learned models thatare actually good but correspond to reformulations of the ac-tual model; i.e. a learned model semantically equivalent butsyntactically different to the reference model. This mainlyoccurs when the learning task is under-constrained. Learning from labeled plans We start evaluating our approach with tasks Λ ′ = h Ψ , Σ , Π i ,where labeled plans are available. We then repeat the evalua-tion but exploiting potential static predicates computed from Σ , which are the predicates that appear unaltered in the ini-tial and final states in every σ t ∈ Σ . Static predicates are used to constrain the space of possible action models as ex-plained in the previous section.Table 1 shows the obtained results. Precision ( P ) and re-call ( R ) are computed separately for the preconditions ( Pre ),positive effects ( Add ) and negative Effects ( Del ), while thelast two columns of each setting and the last row report aver-ages values. We can observe that identifying static predicatesleads to models with better precondition recall . This fact evi-dences that many of the missing preconditions correspondedto static predicates because there is no incentive to learnthem as they always hold (Gregory and Cresswell 2015).Table 2 reports the total planning time, the preprocessingtime (in seconds) invested by M ADAGASCAR to solve theplanning instances that result from our compilation as wellas the number of actions of the solution plans. All the learn-ing tasks are solved in a few seconds. Interestingly, one canidentify the domains with static predicates by just looking atthe reported plan length. In these domains some of the pre-conditions that correspond to static predicates are directlyderived from the learning examples and therefore fewer pro-gramming actions are required. When static predicates areidentified, the resulting compilation is also much more com-pact and produces smaller planning/instantiation times. Learning from partially specified action models We evaluate now the ability of our approach to support par-tially specified action models; that is, addressing learningtasks of the kind Λ ′′ = h Ψ , Σ , Π , Ξ i . In this experiment,the model of half of the actions is given in Ξ as an extrainput of the learning task.Tables 3 and 4 summarize the obtained results, which in-clude the identification of static predicates. We only reportthe precision and recall of the unknown actions since the val-ues of the metrics of the known action models is 1.0. In thisexperiment, a low value of precision or recall has a greaterimpact than in the corresponding Λ ′ tasks because the eval-uation is done only over half of the actions. This occurs,for instance, in the precondition recall of domains such as Floortile , Gripper or Satellite .Remarkably, the overall precision is now . , whichmeans that the contents of the learned models is highly re-liable. The value of recall , 0.87, is an indication that thelearned models still miss some information (preconditionsare again the component more difficult to be fully learned).Overall, the results confirm the previous trend: the more in-put knowledge of the task, the better the models and the lessplanning time. Additionally, the solution plans required forthis task are smaller because it is only necessary to programhalf of the actions (the other half are included in the inputknowledge Ξ ). Visitall and Hanoi are excluded from thisevaluation because they only contain one action schema. Learning from (initial,final) state pairs Finally, we evaluate our approach when input plans are notavailable and thereby the planner must not only compute theaction models but also the plans that satisfy the input labels.Table 5 and 6 summarize the results obtained for the task Λ = h Ψ , Σ , Ξ i using static predicates. Values for the Zeno-travel and Grid domains are not reported because M ADA - o Static StaticPre Add Del Pre Add DelP R P R P R P R P R P R P R P R Blocks 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0Driverlog 1.0 0.36 0.75 0.86 1.0 0.71 0.92 0.64 0.9 0.64 0.56 0.71 0.86 0.86 0.78 0.73Ferry 1.0 0.57 1.0 1.0 1.0 1.0 1.0 0.86 1.0 0.57 1.0 1.0 1.0 1.0 1.0 0.86Floortile 0.52 0.68 0.64 0.82 0.83 0.91 0.66 0.80 0.68 0.68 0.89 0.73 1.0 0.82 0.86 0.74Grid 0.62 0.47 0.75 0.86 0.78 1.0 0.71 0.78 0.79 0.65 1.0 0.86 0.88 1.0 0.89 0.83Gripper 1.0 0.67 1.0 1.0 1.0 1.0 1.0 0.89 1.0 0.67 1.0 1.0 1.0 1.0 1.0 0.89Hanoi 1.0 0.50 1.0 1.0 1.0 1.0 1.0 0.83 0.75 0.75 1.0 1.0 1.0 1.0 0.92 0.92Miconic 0.75 0.33 0.50 0.50 0.75 1.0 0.67 0.61 0.89 0.89 1.0 0.75 0.75 1.0 0.88 0.88Satellite 0.60 0.21 1.0 1.0 1.0 0.75 0.87 0.65 0.82 0.64 1.0 1.0 1.0 0.75 0.94 0.80Transport 1.0 0.40 1.0 1.0 1.0 0.80 1.0 0.73 1.0 0.70 0.83 1.0 1.0 0.80 0.94 0.83Visitall 1.0 0.50 1.0 1.0 1.0 1.0 1.0 0.83 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0Zenotravel 1.0 0.36 1.0 1.0 1.0 0.71 1.0 0.69 1.0 0.64 0.88 1.0 1.0 0.71 0.96 0.790.88 0.50 0.88 0.92 0.95 0.91 0.90 0.78 0.90 0.74 0.93 0.92 0.96 0.91 0.93 0.86 Table 1: Precision and recall scores for learning tasks from labeled plans without (left) and with (right) static predicates. No Static Static Total Preprocess Length Total Preprocess LengthBlocks 0.04 0.00 72 0.03 0.00 72Driverlog 0.14 0.09 83 0.06 0.03 59Ferry 0.06 0.03 55 0.06 0.03 55Floortile 2.42 1.64 168 0.67 0.57 77Grid 4.82 4.75 88 3.39 3.35 72Gripper 0.03 0.01 43 0.01 0.00 43Hanoi 0.12 0.06 48 0.09 0.06 39Miconic 0.06 0.03 57 0.04 0.00 41Satellite 0.20 0.14 67 0.18 0.12 60Transport 0.59 0.53 61 0.39 0.35 48Visitall 0.21 0.15 40 0.17 0.15 36Zenotravel 2.07 2.04 71 1.01 1.00 55 Table 2: Total planning time, preprocessing time and plan lengthfor learning tasks from labeled plans without/with static predicates. GASCAR was not able to solve the corresponding planningtasks within a 1000 sec. time bound. The values of pre-cision and recall are significantly lower than in Table 1.Given that the learning hypothesis space is now fairly under-constrained, actions can be reformulated and still be com-pliant with the inputs (e.g. the blocksworld operator stack can be learned with the preconditions and effects of the unstack operator and vice versa). We tried to minimize thiseffect with the additional input knowledge (static predicatesand partially specified action models) and yet the results arebelow the scores obtained when learning from labeled plans. Related work Action model learning has also been studied in do-mains where there is partial or missing state observ-ability. ARMS works when no partial intermediate stateis given. It defines a set of weighted constraints thatmust hold for the plans to be correct, and solves theweighted propositional satisfiability problem with a MAX-SAT solver (Yang, Wu, and Jiang 2007). In order to effi-ciently solve the large MAX-SAT representations, ARMS implements a hill-climbing method that models the actions approximately. In contrast to our model comparison vali-dation which aims at covering all the training examples,ARMS maximizes the number of covered examples from atesting set.SLAF also deals with partial observabil-ity (Amir and Chang 2008). Given a formula representingthe initial belief state, a sequence of executed actions andthe corresponding partially observed states, it builds acomplete explanation of observations by models of actionsthrough a CNF formula. The learning algorithm updates theformula of the belief state with every action and observationin the sequence and thus the final returned formula includesall consistent models. SLAF assesses the quality of thelearned models with respect to the actual generative model. LOCM only requires the example plans as inputwithout need for providing information about pred-icates or states (Cresswell, McCluskey, and West 2013;Cresswell and Gregory 2011). The lack of available infor-mation is overcome by exploiting assumptions about thekind of domain model it has to generate. Particularly, it as-sumes a domain consists of a collection of objects (sorts)whose defined set of states can be captured by a param-eterized Finite State Machine. LOP ( LOCM with Opti-mized Plans (Gregory and Cresswell 2015)) incorporatesstatic predicates and applies a post-processing step after the LOCM analysis that requires a set of optimal plans to beused in the learning phase. This is done to mitigate the lim-itation of LOCM of inducing similar models for domainswith similar structures. LOP compares the learned modelswith the corresponding reference model.Compiling an action model learning task into classicalplanning is a general and flexible approach that allows to ac-commodate various amounts and kinds of input knowledgeand opens up a path for addressing further learning and val-idation tasks. For instance, the example plans in Π can bereplaced or complemented by a set O of sequences of ob-servations (i.e., fully or partial state observations with noisyor missing fluents (Sohrabi, Riabov, and Udrea 2016)), andlearning tasks of the kind Λ = h Ψ , Σ , O , Ξ i would also beattainable. Furthermore, our approach seems extensible to re Add DelP R P R P R P R Blocks 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0Driverlog 1.0 0.71 1.0 1.0 1.0 1.0 1.0 0.90Ferry 1.0 0.67 1.0 1.0 1.0 1.0 1.0 0.89Floortile 0.75 0.60 1.0 0.80 1.0 0.80 0.92 0.73Grid 1.0 0.67 1.0 1.0 1.0 1.0 0.84 0.78Gripper 1.0 0.50 1.0 1.0 1.0 1.0 1.0 0.83Miconic 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0Satellite 1.0 0.57 1.0 1.0 1.0 1.0 1.0 0.86Transport 1.0 0.75 1.0 1.0 1.0 1.0 1.0 0.92Zenotravel 1.0 0.67 1.0 1.0 1.0 0.67 1.0 0.780.98 0.71 1.0 0.98 1.0 0.95 0.98 0.87 Table 3: Precision and recall scores for learning tasks with par-tially specified action models.Total time Preprocess Plan lengthBlocks 0.07 0.01 54Driverlog 0.03 0.01 40Ferry 0.06 0.03 45Floortile 0.43 0.42 55Grid 3.12 3.07 53Gripper 0.03 0.01 35Miconic 0.03 0.01 34Satellite 0.14 0.14 47Transport 0.23 0.21 37Zenotravel 0.90 0.89 40 Table 4: Time and plan length learning for learning tasks withpartially specified action models. learning other types of generative models (e.g. hierarchicalmodels like HTN or behaviour trees) that can be more ap-pealing than S TRIPS models since they require less searcheffort to compute a a planning solution. Conclusions We presented a novel approach for learning S TRIPS actionmodels from examples using classical planning. The ap-proach is flexible to various amounts of input knowledgeand accepts partially specified action models. We also intro-duced the precision and recall metrics, widely used in ML,for evaluating the learned action models. These two met-rics measure the soundness and completeness of the learned Pre Add DelP R P R P R P R Blocks 0.33 0.33 0.75 0.50 0.33 0.33 0.47 0.39Driverlog 1.0 0.29 0.33 0.67 1.0 0.50 0.78 0.48Ferry 1.0 0.67 0.50 1.0 1.0 1.0 0.83 0.89Floortile 0.67 0.40 0.50 0.40 1.0 0.40 0.72 0.40Grid - - - - - - - -Gripper 1.0 0.50 1.0 1.0 1.0 1.0 1.0 0.83Miconic 0.0 0.0 0.33 0.50 0.0 0.0 0.11 0.17Satellite 1.0 0.14 0.67 1.0 1.0 1.0 0.89 0.71Transport 0.0 0.0 0.25 0.5 0.0 0.0 0.08 0.17Zenotravel - - - - - - - -0.63 0.29 0.54 0.70 0.67 0.53 0.61 0.51 Table 5: Precision and recall scores for learning from (initial,final)state pairs. Total time Preprocess Plan lengthBlocks 2.14 0.00 58Driverlog 0.09 0.00 88Ferry 0.17 0.01 65Floortile 6.42 0.15 126Grid - - -Gripper 0.03 0.00 47Miconic 0.04 0.00 68Satellite 4.34 0.10 126Transport 2.57 0.21 47Zenotravel - - - Table 6: Time and plan length when learning from (initial,final)state pairs. models and facilitate the identification of model flaws.To the best of our knowledge, this is the first work onlearning action models that is exhaustively evaluated overa wide range of domains and uses exclusively an off-the-shelf classical planner. The work in (Stern and Juba 2017)proposes a planning compilation for learning action mod-els from plan traces following the finite domain represen-tation for the state variables. This is a theoretical study onthe boundaries of the learned models and no experimentalresults are reported.When example plans are available, we can compute ac-curate action models from small sets of learning examples(five examples per domain) in little computation time (lessthan a second). When action plans are not available, our ap-proach still produces action models that are compliant withthe input information. In this case, since learning is not con-strained by actions, operators can be reformulated changingtheir semantics, in which case the comparison with a refer-ence model turns out to be tricky.Generating informative examples for learning planningaction models is still an open issue. Planning actions includepreconditions that are only satisfied by specific sequencesof actions which have low probability of being chosen bychance (Fern, Yoon, and Givan 2004). The success of recentalgorithms for exploring planning tasks (Franc`es et al. 2017)motivates the development of novel techniques that enableto autonomously collect informative learning examples. Thecombination of such exploration techniques with our learn-ing approach is an appealing research direction that opensup the door to the bootstrapping of planning action models.In many applications, the actual actions executed by theobserved agent are not available but, instead, the result-ing states can be observed. We plan to extend our ap-proach for learning from state observations as it broadensthe range of application to external observers and facili-tates the representation of imperfect observability, as shownin plan recognition (Sohrabi, Riabov, and Udrea 2016), aswell as learning from unstructured data, like state images(Asai and Fukunaga 2018). Acknowledgments This work is supported by the Spanish MINECO projectTIN2017-88476-C2-1-R. Diego Aineto is partially sup-ported by the FPU16/03184 and Sergio Jim´enez by the YC15/18009 , both programs funded by the Spanish gov-ernment. References [Amir and Chang 2008] Amir, E., and Chang, A. 2008.Learning partially observable deterministic action models. Journal of Artificial Intelligence Research National Conference onArtificial Intelligence, AAAI-18 .[Bonet, Palacios, and Geffner 2009] Bonet, B.; Palacios, H.;and Geffner, H. 2009. Automatic Derivation of MemorylessPolicies and Finite-State Controllers Using Classical Plan-ners. In International Conference on Automated Planningand Scheduling, ICAPS-09 .[Cresswell and Gregory 2011] Cresswell, S., and Gregory, P.2011. Generalised Domain Model Acquisition from ActionTraces. In International Conference on Automated Planningand Scheduling, ICAPS-11 .[Cresswell, McCluskey, and West 2013] Cresswell, S. N.;McCluskey, T. L.; and West, M. M. 2013. Acquiring plan-ning domain models using LOCM. The Knowledge Engi-neering Review International Conference on Machine Learning ,233–240. ACM.[Fern, Yoon, and Givan 2004] Fern, A.; Yoon, S. W.; and Gi-van, R. 2004. Learning Domain-Specific Control Knowl-edge from Random Walks. In International Conference onAutomated Planning and Scheduling, ICAPS-04 , 191–199.[Fikes and Nilsson 1971] Fikes, R. E., and Nilsson, N. J.1971. STRIPS: A New Approach to the Application of The-orem Proving to Problem Solving. Artificial Intelligence Journal of Ar-tificial Intelligence Research Journal of Artificial Intelligence Re-search International Joint Conference on Artificial Intel-ligence, IJCAI-17 , 4294–4301. AAAI Press.[Geffner and Bonet 2013] Geffner, H., and Bonet, B. 2013. AConcise Introduction to Models and Methods for AutomatedPlanning . Morgan & Claypool Publishers.[Ghallab, Nau, and Traverso 2004] Ghallab, M.; Nau, D.;and Traverso, P. 2004. Automated Planning: Theory andPractice . Elsevier.[Gregory and Cresswell 2015] Gregory, P., and Cresswell, S.2015. Domain Model Acquisition in the Presence of Static Relations in the LOP System. In International Conferenceon Automated Planning and Scheduling, ICAPS-15 , 97–105.[Howey, Long, and Fox 2004] Howey, R.; Long, D.; andFox, M. 2004. VAL: Automatic Plan Validation, Contin-uous Effects and Mixed Initiative Planning Using PDDL. In , 294–301. IEEE.[Jim´enez et al. 2012] Jim´enez, S.; De La Rosa, T.;Fern´andez, S.; Fern´andez, F.; and Borrajo, D. 2012.A review of machine learning for automated planning. TheKnowledge Engineering Review Na-tional Conference on Artificial Intelligence, AAAI-07 , 1601–1605.[McDermott et al. 1998] McDermott, D.; Ghallab, M.;Howe, A.; Knoblock, C.; Ram, A.; Veloso, M.; Weld, D.;and Wilkins, D. 1998. PDDL – The Planning DomainDefinition Language.[Michalski, Carbonell, and Mitchell 2013] Michalski, R. S.;Carbonell, J. G.; and Mitchell, T. M. 2013. Machine learn-ing: An artificial intelligence approach . Springer Science &Business Media.[Muise 2016] Muise, C. 2016. Planning. domains. http://bibbase.org/network/publication/muise-planningdomains/ .[Ram´ırez 2012] Ram´ırez, M. 2012. Plan Recognition asPlanning . Ph.D. Dissertation, Universitat Pompeu Fabra.[Rintanen 2014] Rintanen, J. 2014. Madagascar: ScalablePlanning with SAT. Proceedings of the 8th InternationalPlanning Competition, IPC-2014 .[Segovia-Aguas, Jim´enez, and Jonsson 2016] Segovia-Aguas, J.; Jim´enez, S.; and Jonsson, A. 2016. HierarchicalFinite State Controllers for Generalized Planning. In International Joint Conference on Artificial Intelligence,IJCAI-16 , 3235–3241. AAAI Press.[Segovia-Aguas, Jim´enez, and Jonsson 2017] Segovia-Aguas, J.; Jim´enez, S.; and Jonsson, A. 2017. GeneratingContext-Free Grammars using Classical Planning. In International Joint Conference on Artificial Intelligence,ICAPS-17 , 4391–4397.[Slaney and Thi´ebaux 2001] Slaney, J., and Thi´ebaux, S.2001. Blocks World Revisited. Artificial Intelligence International Joint Conference on ArtificialIntelligence, IJCAI-16 , 3258–3264. AAAI Press.[Stern and Juba 2017] Stern, R., and Juba, B. 2017. Effi-cient, Safe, and Probably Approximately Complete Learn-ing of Action Models. In International Joint Conference onArtificial Intelligence, IJCAI-17 , 4405–4411. AAAI Press.[Vallati et al. 2015] Vallati, M.; Chrpa, L.; Grzes, M.; Mc-Cluskey, T. L.; Roberts, M.; and Sanner, S. 2015. The 2014International Planning Competition: Progress and Trends. AI Magazine