[PDF] Learning General Policies from Small Examples Without Supervision

Abstract

Generalized planning is concerned with the computation of general policies that solve multiple instances of a planning domain all at once. It has been recently shown that these policies can be computed in two steps: first, a suitable abstraction in the form of a qualitative numerical planning problem (QNP) is learned from sample plans, then the general policies are obtained from the learned QNP using a planner. In this work, we introduce an alternative approach for computing more expressive general policies which does not require sample plans or a QNP planner. The new formulation is very simple and can be cast in terms that are more standard in machine learning: a large but finite pool of features is defined from the predicates in the planning examples using a general grammar, and a small subset of features is sought for separating "good" from "bad" state transitions, and goals from non-goals. The problems of finding such a "separating surface" while labeling the transitions as "good" or "bad" are jointly addressed as a single combinatorial optimization problem expressed as a Weighted Max-SAT problem. The advantage of looking for the simplest policy in the given feature space that solves the given examples, possibly non-optimally, is that many domains have no general, compact policies that are optimal. The approach yields general policies for a number of benchmark domains.

Full PDF

aa r X i v : . [ c s . A I] F e b Learning General Policies from Small Examples Without Supervision ∗ Guillem Franc`es, Blai Bonet, Hector Geffner Universitat Pompeu Fabra, Barcelona, Spain ICREA & Universitat Pompeu Fabra, Barcelona, [email protected], [email protected], [email protected]

Abstract

Generalized planning is concerned with the computation ofgeneral policies that solve multiple instances of a planningdomain all at once. It has been recently shown that these poli-cies can be computed in two steps: ﬁrst, a suitable abstrac-tion in the form of a qualitative numerical planning problem(QNP) is learned from sample plans, then the general poli-cies are obtained from the learned QNP using a planner. Inthis work, we introduce an alternative approach for comput-ing more expressive general policies which does not requiresample plans or a QNP planner. The new formulation is verysimple and can be cast in terms that are more standard in ma-chine learning: a large but ﬁnite pool of features is deﬁnedfrom the predicates in the planning examples using a generalgrammar, and a small subset of features is sought for sep-arating “good” from “bad” state transitions, and goals fromnon-goals. The problems of ﬁnding such a “separating sur-face” while labeling the transitions as “good” or “bad” arejointly addressed as a single combinatorial optimization prob-lem expressed as a Weighted Max-SAT problem. The advan-tage of looking for the simplest policy in the given featurespace that solves the given examples, possibly non-optimally,is that many domains have no general, compact policies thatare optimal. The approach yields general policies for a num-ber of benchmark domains.

Introduction

Generalized planning is concerned with the computation ofgeneral policies or plans that solve multiple instances of agiven planning domain all at once (Srivastava, Immerman,and Zilberstein 2008; Bonet, Palacios, and Geffner 2009; Huand De Giacomo 2011; Belle and Levesque 2016; Segovia,Jim´enez, and Jonsson 2016). For example, a general plan forclearing a block x in any instance of Blocksworld involvesa loop where the topmost block above x is picked up andplaced on the table until no such block remains. A generalplan for solving any Blocksworld instance is also possible,like one where misplaced blocks and those above them aremoved to the table, and then to their targets in order. The ∗ This paper extends (Franc`es, Bonet, and Geffner 2021a) withan appendix providing proofs and further details on the methodol-ogy and the empirical results. key question in generalized planning is how to represent andcompute such general plans from the domain representation.In one of the most general formulations, general policiesare obtained from an abstract planning model expressed as aqualitative numerical planning problem or QNP (Srivastavaet al. 2011). A QNP is a standard STRIPS planning modelextended with non-negative numerical variables that can bedecreased or increased “qualitatively”; i.e., by uncertain pos-itive amounts, short of making the variables negative. Unlikestandard planning with numerical variables (Helmert 2002),QNP planning is decidable, and QNPs can be compiledin polynomial time into fully observable non-deterministic(FOND) problems (Bonet and Geffner 2020)The main advantage of the formulation of generalizedplanning based on QNPs is that it applies to standard re-lational domains where the pool of (ground) actions changefrom instance to instance. On the other hand, while the plan-ning domain is assumed to be given, the QNP abstraction isnot, and hence it has to be written by hand or learned. Thisis the approach of Bonet, Franc`es, and Geffner (2019) wheregeneralized plans are obtained by learning the QNP abstrac-tion from the domain representation and sample plans, andthen solving the abstraction with a QNP planner.In this work, we build on this thread but introduce an alter-native approach for computing general policies that is sim-pler, yet more powerful. The learning problem is cast as a self-supervised classiﬁcation problem where (1) a pool offeatures is automatically generated from a general grammarapplied to the domain predicates, and (2) a small subset offeatures is sought for separating “good” from “bad” statetransitions, and goals from non-goals. The problems of ﬁnd-ing the “separating surface” while labeling the transitions as“good” or “bad” are addressed jointly as a single combina-torial optimization task solved with a Weighted Max-SATsolver. The approach yields general policies for a number ofbenchmark domains.The paper is organized as follows. We ﬁrst review relatedwork and classical planning, and introduce a new languagefor expressing general policies motivated by the work onQNPs. We then present the learning task, the computationalapproach for solving it, and the experimental results. elated Work

The computation of general plans from domain encodingsand sample plans has been addressed in a number of works(Khardon 1999; Mart´ın and Geffner 2004; Fern, Yoon, andGivan 2006; Silver et al. 2020). Generalized planning hasalso been formulated as a problem in ﬁrst-order logic (Sri-vastava, Immerman, and Zilberstein 2011; Illanes and McIl-raith 2019), and general plans over ﬁnite horizons have beenderived using ﬁrst-order regression (Boutilier, Reiter, andPrice 2001; Wang, Joshi, and Khardon 2008; van Otterlo,M. 2012; Sanner and Boutilier 2009). More recently, gen-eral policies for planning have been learned from PDDL do-mains and sample plans using deep learning (Toyer et al.2018; Bueno et al. 2019; Garg, Bajpai, and Mausam 2020).Deep reinforcement learning methods (Mnih et al. 2015)have also been used to generate general policies from im-ages without assuming prior symbolic knowledge (Groshevet al. 2018; Chevalier-Boisvert et al. 2019), in certain casesaccounting for objects and relations through the use of suit-able architectures (Garnelo and Shanahan 2019). Our workis closest to the works of Bonet, Franc`es, and Geffner (2019)and Franc`es et al. (2019). The ﬁrst provides a model-basedapproach to generalized planning where an abstract QNPmodel is learned from the domain representation and sam-ple instances and plans, which is then solved by a QNPplanner (Bonet and Geffner 2020). The second learns a gen-eralized value function in an unsupervised manner, underthe assumption that this function is linear. Model-based ap-proaches have an advantage over inductive approaches thatlearn generalized plans; like logical approaches, they guar-antee that the resulting policies (conclusions) are correctprovided that the model (set of premises) is correct. The ap-proach developed in this work does not make use of QNPsor planners but inherits these formal properties.

Planning

A (classical) planning instance is a pair P = h D, I i where D is a ﬁrst-order planning domain and I is an instance . Thedomain D contains a set of predicate symbols and a set of ac-tion schemas with preconditions and effects given by atoms p ( x , . . . , x k ) , where p is a k -ary predicate symbol, and each x i is a variable representing one of the arguments of the ac-tion schema. The instance is a tuple I = h O, Init, Goal i ,where O is a (ﬁnite) set of object names c i , and Init and

Goal are sets of ground atoms p ( c , . . . , c k ) , where p is a k -ary predicate symbol. This is indeed the structure of plan-ning problems as expressed in PDDL (Haslum et al. 2019).The states associated with a problem P are the possiblesets of ground atoms, and the state graph G ( P ) associatedwith P has the states of P as nodes, an initial state s thatcorresponds to the set of atoms in Init , and a set of goalstates s G with all states that include the atoms in Goal . Inaddition, the graph has a directed edge ( s, s ′ ) for each statetransition that is possible in P , i.e. where there is a groundaction a whose preconditions hold in s and whose effectstransform s into s ′ . A state trajectory s , . . . , s n is possiblein P if every transition ( s i , s i +1 ) is possible in P , and itis goal-reaching if s n is a goal state. An action sequence a , . . . , a n − that gives rise to a goal-reaching trajectory,i.e., where transition ( s i , s i +1 ) is enabled by ground action a i , is called a plan or solution for P . Generalized Planning

A key question in generalized planning is how to repre-sent general plans or policies when the different instancesto be solved have different sets of objects and ground ac-tions. One solution is to work with general features (func-tions) that have well deﬁned values over any state of anypossible domain instance, and think of general policies π as mappings from feature valuations into abstract actions that denote changes in the feature values (Bonet and Geffner2018). In this work, we build on this intuition but avoid theintroduction of abstract actions (Bonet and Geffner 2021). Policy Language and Semantics

The features considered are boolean and numerical. Theﬁrst are denoted by letters like p , and their (true or false)value in a state s is denoted as p ( s ) . Numerical features n take non-negative integer values, and their value in a state isdenoted as n ( s ) . The complete set of features is denoted as Φ and a joint valuation over all the features in Φ in a state s is denoted as φ ( s ) , while an arbitrary valuation as φ . The ex-pression J φ K denotes the boolean counterpart of φ ; i.e., J φ K gives a truth value to all the atoms p ( s ) and n ( s ) = 0 forfeatures p and n in Φ , without providing the exact value ofthe numerical features n if n ( s ) = 0 . The number of possi-ble boolean feature valuations J φ K is equal to | Φ | , whichis a ﬁxed number, as the set of features Φ does not changeacross instances.The possible effects E on the features in Φ are p and ¬ p for boolean features p in E , and n ↓ and n ↑ for nu-merical features n in E . If Φ = { p, q, n, m, r } and E = { p, ¬ q, n ↑ , m ↓ } , the meaning of the effects in E is that p must become true, q must become false, n must increase itsvalue, and m must decrease it. The features in Φ that are notmentioned in E , like r , keep their values. A set of effects E can be thought of as a set of constraints on possible statetransitions: Deﬁnition 1.

Let Φ be a set of features over a domain D ,let ( s, s ′ ) be a state transition over an instance P of D , andlet E be a set of effects over the features in Φ . Then thetransition ( s, s ′ ) is compatible with or satisﬁes E when 1) if p ( ¬ p ) in E , then p ( s ′ ) = true (resp. p ( s ′ ) = f alse ), 2) if n ↓ ( n ↑ ) in E , then n ( s ) > n ( s ′ ) (resp. n ( s ) < n ( s ′ )) , and3) if p and n are not mentioned in E , then p ( s ) = p ( s ′ ) , and n ( s ) = n ( s ′ ) respectively. The form of the general policies considered in this work canthen be deﬁned as follows:

Deﬁnition 2. A general policy π Φ is given by a set of rules C E where C is a set (conjunction) of p and n literalsfor p and n in Φ , and E is an effect expression. The p and n -literals are p , ¬ p , n =0 , and ¬ ( n =0) , abbre-viated as n> . For a reachable state s , the policy π Φ is aﬁlter on the state transitions ( s, s ′ ) in P : eﬁnition 3. A general policy π Φ denotes a mapping fromstate transitions ( s, s ′ ) over instances P ∈ Q into booleanvalues. A transition ( s, s ′ ) is compatible with π Φ if for somepolicy rule C E , C is true in φ ( s ) and ( s, s ′ ) satisﬁes E i . As an illustration of these deﬁnitions, we consider a policyfor achieving the goals clear ( x ) and an empty gripper in anyBlocksworld instance with a block x . Example.

Consider the policy π Φ given by the followingtwo rules for features Φ= { H, n } , where H is true if a blockis being held, and n tracks the number of blocks above x : {¬ H, n > } 7→ { H, n ↓ } ; { H, n > } 7→ {¬ H } . (1)The ﬁrst rule says that when the gripper is empty and thereare blocks above x , then any action that decreases n andmakes H true should be selected. The second one says thatwhen the gripper is not empty and there are blocks above x ,any action that makes H false and does not affect the count n should be selected (this rules out placing the block beingheld above x , as this would increase n ).The conditions under which a general policy solves a classof problems are the following: Deﬁnition 4.

A state trajectory s , . . . , s n is compatible with policy π Φ in an instance P if s is the initial state of P and each pair ( s i , s i +1 ) is a possible state transition in P compatible with π Φ . The trajectory is maximal if s n is agoal state, there are no state transitions ( s n , s ) in P compat-ible with π Φ , or the trajectory is inﬁnite and does not includea goal state. Deﬁnition 5.

A general policy π Φ solves a class Q of in-stances over domain D if in each instance P ∈ Q , all maxi-mal state trajectories compatible with π Φ reach a goal state. The policy expressed by the rules in (1) can be shown tosolve the class Q clear of all Blocksworld instances. Non-deterministic Policy Rules

The general policies π Φ introduced above determine the ac-tions a to be taken in a state s indirectly , as the actions a thatresult in state transitions ( s, s ′ ) that are compatible with apolicy rule C E . If there is a single rule body C that istrue in s , for the transition ( s, s ′ ) to be compatible with π Φ , ( s, s ′ ) must satisfy the effect E . Yet, it is possible that thebodies C i of many rules C i E i are true in s , and then for ( s, s ′ ) to be compatible with π Φ it sufﬁces if ( s, s ′ ) satisﬁesone of the effects E i .For convenience, we abbreviate sets of rules C i E i , i = 1 , . . . , m , that have the same body C i = C , as C E | · · · | E m , and refer to the latter as a non-deterministicrule . The non-determinism is on the effects on the features:one effect E i may increment a feature n , and another effect E j may decrease it, or leave it unchanged (if n is not men-tioned in E j ). Policies π Φ where all pairs of rules C E and C ′ E ′ have bodies C and C ′ that are jointly inconsis-tent are said to be deterministic . Previous formulations thatcast general policies as mappings from feature conditionsinto abstract (QNP) actions yield policies that are determin-istic in this way (Bonet and Geffner 2018; Bonet, Franc`es, and Geffner 2019). Non-deterministic policies, however, arestrictly more expressive. Example.

Consider a domain

Delivery where a truck has topick up m packages spread on a grid, while taking them, oneby one, to a single target cell t . If we consider the collectionof instances with one package only, call them Delivery-1 , ageneral policy π Φ for them can be expressed using the setof features Φ = { n p , n t , C, D } , where n p represents thedistance from the agent to the package ( when in the samecell or when holding the package), n t represents the distancefrom the agent to the target cell, and C and D represent thatthe package is carried and delivered respectively. One maybe tempted to write the policy π Φ by means of the four de-terministic rules: r : {¬ C, n p > } 7→ { n p ↓ } ; r : {¬ C, n p =0 } 7→ { C } r : { C, n t > } 7→ { n t ↓ } ; r : { C, n t =0 } 7→ {¬ C, D } . The rules say “if away from the package, get closer”, “ifdon’t have the package but in the same cell, pick it up”, “ifcarrying the package and away from target, get closer to tar-get”, and “if carrying the package in target cell, drop thepackage”. This policy, however, does not solve

Delivery-1 .The reason is that transitions ( s, s ′ ) where the agent getscloser to the package satisfy the conditions ¬ C and n p > of rule r but may fail to satisfy its head { n p ↓ } . This is be-cause the actions that decrease the distance n p to the pack-age may affect the distance n t of the agent to the target, con-tradicting r , which says that n t does not change. To solve Delivery-1 with the same features, rule r must be changedto the non-deterministic rule: r ′ : {¬ C, n p > } 7→ { n p ↓ , n t ↓ } | { n p ↓ , n t ↑ } | { n p ↓ } , which says indeed that “when away from the package, movecloser to the package for any possible effect on the distance n t to the target, which may decrease, increase, or stay thesame.” We often abbreviate rules like r ′ as {¬ C, n p > } 7→{ n p ↓ , n t ? } , where n t ? expresses “any effect on n t .” Learning General Policies: Formulation

We turn now to the key challenge: learning the features Φ and general policies π Φ from samples P , . . . , P k of a targetclass of problems Q , given the domain D . The learning taskis formulated as follows. From the predicates used in D anda ﬁxed grammar, we generate a large pool F of boolean andnumerical features f , like in (Bonet, Franc`es, and Geffner2019), each of which is associated with a measure w ( f ) ofsyntactic complexity. We then search for the simplest set offeatures Φ ⊆ F such that a policy π Φ deﬁned on Φ solvesall sample instances P , . . . , P k . This task is formulated asa Weighted Max-SAT problem over a suitable propositionaltheory T , with score P f ∈ Φ w ( f ) to minimize.This learning scheme is unsupervised as the sample in-stances do not come with their plans. Since the sample in-stances are assumed to be sufﬁciently small (small statespaces) this is not a crucial issue, and by letting the learn-ing algorithm choose which plans to generalize, the resultingapproach becomes more ﬂexible. In particular, if we ask forthe policy π Φ to generalize given plans as in (Bonet, Franc`es,nd Geffner 2019), it may well happen that there are policiesin the feature space but none of which generalizes the plansprovided by the teacher.We next describe the propositional theory T assuming thatthe feature pool F and the feature weights w ( f ) are given,and then explain how they are generated. Our SAT formula-tion is different from (Bonet, Franc`es, and Geffner 2019) asit is aimed at capturing a more expressive class of policieswithout requiring QNP planners. Learning the General Policy as Weighted Max-SAT

The propositional theory T = T ( S , F ) that captures ourlearning task takes as inputs the pool of features F and thestate space S made up of the (reachable) states s , the possi-ble state transitions ( s, s ′ ) , and the sets of (reachable) goalstates in each of the sample problem instances P , . . . , P n .The handling of dead-end states is explained below. Statesarising from the different instances are assumed to be differ-ent even if they express the same set of ground atoms. Thepropositional variables in T are• Select ( f ) : feature f from pool F makes it into Φ ,• Good ( s, s ′ ) : transition ( s, s ′ ) is compatible with π Φ ,• V ( s, d ) : num. labels V ( s ) = d , V ∗ ( s ) ≤ d ≤ δV ∗ ( s ) .The true atoms Select ( f ) in the satisfying assignment de-ﬁne the features f ∈ Φ , while the true atoms Good ( s, s ′ ) ,along with the selected features, deﬁne the policy π Φ . Moreprecisely, there is a rule C E | · · · | E m in the policyiff for each effect E i , there is a true atom Good ( s, s i ) forwhich C = J φ ( s ) K , and E i captures the way in which theselected features change across the transition ( s, s i ) . Theformulas in the theory use numerical labels V ( s ) = d , for V ∗ ( s ) ≤ d ≤ δV ∗ ( s ) where V ∗ ( s ) is the minimum distancefrom s to a goal, and δ ≥ is a slack parameter that controlsthe degree of suboptimality that we allow. All experimentsin this paper use δ = 2 . These values are used to ensure thatthe policy determined by the Good ( s, s ′ ) atoms solves allinstances P i as well as all instances P i [ s ] that are like P i butwith s as the initial state, where s is a state reachable in P i and is not a dead-end. We call the P i [ s ] problems variants of P i . Dead-ends are states from which the goal cannot bereached, and they are labeled as such in S .The formulas are the following. States s and t , and transi-tions ( s, s ′ ) and ( t, t ′ ) range over those in S , excluding tran-sitions where the ﬁrst state of the transition is a dead-end or agoal. ∆ f ( s, s ′ ) expresses how feature f changes across tran-sition ( s, s ′ ) : for boolean features, ∆ f ( s, s ′ ) ∈ {↑ , ↓ , ⊥} ,meaning that f changes from false to true, from true to false,or stays the same. For numerical features, ∆ f ( s, s ′ ) ∈ {↑ , ↓⊥} , meaning that f can increase, decrease, or stay the same.The formulas in T = T ( S , F ) are:1. Policy: W ( s,s ′ ) Good ( s, s ′ ) , s is non-goal state,2. V .: Exactly-1 { V ( s, d ) : V ∗ ( s ) ≤ d ≤ δV ∗ ( s ) } , V : Good ( s, s ′ ) → V ( s, d ) ∧ V ( s ′ , d ′ ) , d ′ < d ,4. Goal: W f : J f ( s ) K = J f ( s ′ ) K Select ( f ) , one { s, s ′ } is goal, This implies that V ( s, iff s is a goal state.

5. Bad trans: ¬ Good ( s, s ′ ) for s solvable, and s ′ dead-end,6. D2-sep: Good ( s, s ′ ) ∧ ¬ Good ( t, t ′ ) → D s, s ′ ; t, t ′ ) ,where D s, s ′ ; t, t ′ ) is W ∆ f ( s,s ′ ) =∆ f ( t,t ′ ) Select ( f ) .The ﬁrst formula asks for a good transition from any non-goal state s . The good transitions are transitions that willbe compatible with the policy. The second and third formu-las ensure that these good transitions lead to a goal state,and furthermore, that they can capture any non-deterministicpolicy that does so. The fourth formulation is about separat-ing goal from non-goal states, and the ﬁfth is about exclud-ing transitions into dead-ends. Finally, the D2-separationformula says that if ( s, s ′ ) is a “good” transition (i.e., com-patible with the resulting policy π Φ ), then any other transi-tion ( t, t ′ ) in S where the selected features change exactlyas in ( s, s ′ ) must be “good” as well. ∆ f ( s, s ′ ) above cap-tures how feature f changes across the transition ( s, s ′ ) , andthe selected features f change in the same way in ( s, s ′ ) and ( t, t ′ ) when ∆ f ( s, s ′ ) = ∆ f ( t, t ′ ) .The propositional encoding is sound and complete in thefollowing sense: Theorem 6.

Let S be the state space associated with a set P , . . . , P k of sample instances of a class of problems Q over a domain D , and let F be a pool of features. The theory T ( S , F ) is satisﬁable iff there is a general policy π Φ overfeatures Φ ⊆ F that discriminates goals from non-goalsand solves P , . . . , P k and their variants. For the purpose of generalization outside of the sample in-stances, instead of looking for any satisfying assignment ofthe theory T ( S , F ) , we look for the satisfying assignmentsthat minimize the complexity of the resulting policy, as mea-sured by the sum of the costs w ( f ) of the clauses Select ( f ) that are true, where w ( f ) is the complexity of feature f ∈ F .We sketched above how a general policy π Φ is extractedfrom a satisfying assignment. The only thing missing is theprecise meaning of the line “ E i captures the way in whichthe selected features change in the transition from s to s i ”.For this, we look at the value of the expression ∆ f ( s, s i ) computed at preprocessing, and place f ( ¬ f ) in E i if f isboolean and ∆ f ( s, s i ) is ‘ ↑ ’ (resp. ↓ ), and place f ↑ ( f ↓ ) in E i if f is numerical and ∆ f ( s, s i ) is ‘ ↑ ’ (resp. ↓ ). Duplicateeffects E i and E j in a policy rule are merged. The resultingpolicy delivers the properties of Theorem 6: Theorem 7.

The policy π Φ and features Φ that are deter-mined by a satisfying assignment of the theory T solves thesample problems P , . . . , P k and their variants. Feature Pool

The feature pool F used in the theory T ( S , F ) is ob-tained following the method described by Bonet, Franc`es,and Geffner (2019), where the (primitive) domain predicatesare combined through a standard description logics gram-mar (Baader et al. 2003) in order to build a larger set of(unary) concepts c and (binary) roles r . Concepts represent properties that the objects of any problem instance can fulﬁllin a state, such as the property of being a package that is ina truck on its target location in a standard logistics problem.For primitive predicates p mentioned in the goal, a “goalredicate” p G is added that is evaluated not in the state butin the goal, following (Mart´ın and Geffner 2004).From these concepts and roles, we generate cardinalityfeatures | c | , which evaluate to the number of objects thatsatisfy concept c in a given state, and distance features Distance ( c , r, c ) , which evaluate to the minimum num-ber of r -steps between two objects that (respectively) sat-isfy c and c . We refer the reader to the appendix for moredetail. Both types of features are lower-bounded by andupper-bounded by the total number of objects in the prob-lem instance. Cardinality features that only take values in { , } are made into boolean features. The complexity w ( f ) of feature f is given by the size of its syntax tree. The featurepool F used in the experiments below contains all featuresup to a certain complexity bound k F . Experimental Results

We implemented the proposed approach in a C ++ /Pythonsystem called D2L and evaluated it on several problems.Source code and benchmarks are available online andarchived in Zenodo (Franc`es, Bonet, and Geffner 2021b).Our implementation uses the Open-WBO Weighted Max-SAT solver (Martins, Manquinho, and Lynce 2014). All ex-periments were run on an Intel i7-8700 [email protected] witha 16 GB memory limit.The domains include all problems with simple goalsfrom (Bonet, Franc`es, and Geffner 2019), e.g. clearing ablock or stacking two blocks in Blocksworld, plus standardPDDL domains such as Gripper, Spanner, Miconic, Visitalland Blocksworld. In all the experiments, we use δ = 2 and k F = 8 , except in Delivery, where k F = 9 is required toﬁnd a policy. We next describe two important optimizations. Exploiting indistinguishability of constraints.

A ﬁxedfeature pool F induces an equivalence relation over the setof all transitions in the training sample that puts two tran-sitions in the same equivalence class iff they cannot be dis-tinguished by F . The theory T ( S , F ) above can be simpli-ﬁed by arbitrarily choosing one transition ( s, s ′ ) for each ofthese equivalence classes, then using a single SAT variable Good ( s, s ′ ) to denote the goodness of any transition in theclass and to enforce the D2-separation clauses. Incremental constraint generation.

Since the numberof D2-separation constraints in the theory T ( S , F ) growsquadratically with the number of equivalence classes amongthe transitions, we use a constraint generation loop wherethese constraints are enforced incrementally. We start with aset τ of pairs of transitions ( s, s ′ ) and ( t, t ′ ) that contains allpairs for which s = t plus some random pairs from S . Weobtain the theory T ( S , F ) that is like T ( S , F ) but wherethe D2-separation constraints are restricted to pairs in τ .At each step, we solve T i ( S , F ) and validate the solution tocheck whether it distinguishes all good from bad transitionsin the entire sample; if it does not, the offending transitionsare added to τ i +1 ⊃ τ i , and the loop continues until the so-lution to T i ( S , F ) satisﬁes the D2-separation formulas forall pairs of transitions in S , not just those in τ i . https://github.com/rleap-project/d2l. Results

Table 1 provides an overview of the execution of

D2L overall generalized domains. The two main conclusions to bedrawn from the results are that 1) our generalized policiesare more expressive and result in policies that cannot be cap-tured in previous approaches (Bonet, Franc`es, and Geffner2019), 2) our SAT encoding is also simpler and scales upmuch better, allowing to tackle harder tasks with reasonablecomputational effort. Also, the new formulation is unsuper-vised and complete, in the sense that if there is a generalpolicy in the given feature space that solves the instances,the solver is guaranteed to ﬁnd it.In all domains, we use a modiﬁed version of the Pyper-plan planner to check empirically that the learned poli-cies are able to solve a set of test instances of signiﬁcantlylarger dimensions than the training instances. For standardPDDL domains with readily-available instances (e.g., Grip-per, Spanner, Miconic), the test set includes all instancesin the benchmark set, whereas for other domains suchas Q rew , Q deliv or Q bw , the test set contains at least randomly-generated instances.We next brieﬂy describe the policy learnt by D2L ineach domain; the appendix contains detailed descriptionsand proofs of correctness for all these policies.

Clearing a block. Q clear is a simpliﬁed Blocksworld wherethe goal is to get clear ( x ) for a distinguished block x . Weuse the standard -op encoding with stack and unstack ac-tions. Any 5-block training instance sufﬁces to compute thefollowing policy over features Φ = { c, H, n } that denote,respectively, whether x is clear, whether the gripper holds ablock, and the number of blocks above x : r : {¬ c, H, n = 0 } 7→ { c, ¬ H } ,r : {¬ c, ¬ H, n > } 7→ { c ? , H, n ↓ } ,r : {¬ c, H, n > } 7→ {¬ H } . Rule r applies only when x is held (the only case where n = 0 and ¬ c ), and puts x on the table. Rule r picks anyblock above x that can be picked, potentially making x clear,and r puts down block y = x anywhere not above x . Notethat this policy is slightly more complex than the one deﬁnedin (1) because the SAT theory enforces that goals be distin-guishable from non-goals, which in the standard encodingcannot be achieved with H and n alone. Stacking two blocks. Q on is another simpliﬁcation ofBlocksworld where the goal is on ( x, y ) for two designatedblocks x and y . One training instance with 5 blocks yieldsa policy over features Φ = { e, c ( x ) , on ( y ) , ok, c } . The ﬁrstfour are boolean and encode whether the gripper is empty, x is clear, some block is on y , and x is on y ; the last is numer-ical and encodes the number of clear objects. This versionof the problem is more general than that in (Bonet, Franc`es, https://github.com/aibasel/pyperplan. We have used the benchmark distribution in https://github.com/aibasel/downward-benchmarks. All features discussed in this section are automatically derivedwith the description-logic grammar, but we label them manuallyfor readability. P i | dim S S / ∼ d max |F| vars clauses t all t SAT c Φ | Φ | k ∗ | π Φ |Q clear ,

161 55 7 532 7 . K . K (242 . K ) 6 < Q on ,

852 329 10 1 ,

412 17 . K . K (281 . K ) 33 22 13 5 5 7 Q grip ,

140 61 12 835 6 . K . K (100 . K ) 2 < Q rew × . K . K (98 . K ) 2 < Q deliv × ,

473 5442 56 1 ,

373 753 . K . M (23 . M ) 3071 2902 30 4 14 6 Q visit × ,

396 310 8 188 13 . K . K (160 . K ) 3 < Q span ,

10) 10 ,

777 96 19 764 85 . K . M (2 . M ) 32 < Q micon ,

7) 4 ,

706 4 ,

636 14 1 ,

073 23 . K . M (2 . M ) 41 61 11 4 5 5 Q bw ,

275 4 ,

275 8 1 ,

896 22 . K . M (390 . K ) 80 40 11 3 6 1 Table 1:

Overview of results . | P i | is number of training instances, and dim is size of largest training instance along main general-ization dimension(s): number of blocks ( Q clear , Q on , Q bw ), number of balls ( Q grip ), grid size ( Q rew , Q deliv , Q visit ), numberof locations and spanners ( Q span ), number of passengers and ﬂoors ( Q micon ). We ﬁx δ = 2 and k F = 8 in all experimentsexcept Q deliv , where k F = 9 . S is number of transitions in the training set, and S / ∼ is the number of distinguishable equiva-lence classes in S . d max is the max. diameter of the training instances. |F| is size of feature pool. “Vars” and “clauses” are thenumber of variables and clauses in the (CNF form) of the theory T ( S , F ) ; the number in parenthesis is the number of clauses inthe last iteration of the constraint generation loop. t all is total CPU time, in sec., while t SAT is CPU time spent solving Max-SATproblems. c Φ is optimal cost of SAT solution, | Φ | is number of selected features, k ∗ is cost of the most complex feature in thepolicy, | π Φ | is number of rules in the resulting policy. CPU times are given for the incremental constraint generation approach.and Geffner 2019), where x and y are assumed to be initiallyin different towers. Gripper. Q grip is the standard Gripper domain where atwo-arm robot has to move n balls between two rooms A and B . Any 4-ball instance is sufﬁcient to learn a simplepolicy with features Φ = { r B , c, b } that denote whether therobot is at B , the number of balls carried by the robot, andthe number of balls not yet left in B : r : {¬ r B , c = 0 , b > } 7→ { c ↑ } ,r : { r B , c = 0 , b > } 7→ {¬ r B } ,r : { r B , c > , b > } 7→ { c ↓ , b ↓ } ,r : {¬ r B , c > , b > } 7→ { r B } . In any non-goal state, the policy is compatible with the tran-sition induced by some action; overall, it implements a loopthat moves balls from A to B, one by one. Bonet, Franc`es,and Geffner (2019) also learn an abstraction for Gripper, butneed an extra feature g that counts the number of free grip-pers in order to keep the soundness of their QNP model. Ourapproach does not need to build such a model, and the poli-cies it learns often use features of smaller complexity. Picking rewards. Q rew consists on an agent that navigatesa grid with some non-walkable cells in order to pick up scat-tered reward items. Training on a single × grid withrandomly-placed rewards and non-walkable cells results inthe same policy as reported by Bonet, Franc`es, and Geffner(2019), which moves the agent to the closest unpicked re-ward, picks it, and repeats. In contrast with that work, how-ever, our approach does not require sample plans, and itspropositional theory is one order of magnitude smaller. Delivery. Q deliv is the previously discussed Delivery prob-lem, where a truck needs to pick m packages from differentlocations in a grid and deliver them, one at a time, to a singletarget cell t . The policy learnt by D2L is a generalization to m packages of the one-package policy discussed before. Visitall. Q visit is the standard Visitall domain where anagent has to visit all the cells in a grid at least once. Trainingon a single × instance produces a single-rule policy basedon features Φ = { u, d } that represent the number of unvis-ited cells and the distance to a closest unvisited cell. Thepolicy, similar to the one for Q rew , moves the agent greed-ily to a closest unvisited until all cells have been visited. Spanner. Q span is the standard Spanner domain where anagent picks up spanners along a corridor that are used atthe end to tighten some nuts. Since spanners can be usedonly once and the corridor is one-way, the problem becomesunsolvable as soon as the agent moves forward and leavessome needed Spanner behind. We feed D2L with 3 traininginstances with different initial locations of spanners, and itcomputes a policy with features

Φ = { n, h, e } that denotethe number of nuts that still have to be tightened, the num-ber of objects not held by the agent and whether the agentlocation is empty, i.e. has no spanner or nut in it: r : { n > , h > , e } 7→ { e ? } ,r : { n > , h > , ¬ e } 7→ { h ↓ , e ? } | { n ↓ } . The policy dictates a move when the agent is in an emptylocation; else, it dictates either to pick up a spanner ortighten a nut. Importantly, it never allows the agent to leavea location with some unpicked spanner, thereby avoidingdead-ends. Note that the features and policy are ﬁt to thedomain actions. For instance, an effect { e ? } as in r couldnot appear if the domain had no-op actions, as the result-ing no-op transitions would comply with r without makingprogress to the goal. The learned policy solves the 30 in-stances of the learning track of the 2011 International Plan-ning Competition, and can actually be formally proven cor-rect over all Miconic instances. Miconic. Q micon is the domain where a single elevatormoves across different ﬂoors to pick up and deliver passen-gers to their destinations. We train on two instances with aew ﬂoors and passengers with different origins and destina-tions. The learned policy uses 4 numerical features that en-code the number of passengers onboard in the lift, the num-ber of passengers waiting to board, the number of passengerswaiting to board on the same ﬂoor where the lift is, and thenumber of passengers boarded when the lift is on their tar-get ﬂoor. The policy solves the 50 instances of the standardMiconic distribution. Blocksworld. Q bw is the classical Blocksworld where thegoal is to achieve some desired arbitrary conﬁguration ofblocks, under the assumption that each block has a goal des-tination (i.e., the goal picks a single goal state). We use astandard PDDL encoding where blocks are moved atomi-cally from one location to another (no gripper). The onlypredicates are on and clear , and the set of objects consistsof n blocks and the table, which is always clear. We use asingle training instance with blocks, where the target lo-cation of all blocks is speciﬁed. We obtain a policy over thefeatures Φ = { c, t ′ , bwp } that stand for the number of clearobjects, the number of objects that are not on their targetlocation, and the number of objects such that all objects be-low are well-placed, i.e., in their goal conﬁguration. Inter-estingly, the value of all features in non-goal states is alwayspositive ( bwp > holds trivially, as the table is always well-placed and below all blocks). The computed policy has onesingle rule with four effects: { c > , t ′ > , bwp > } 7→ { c ↑ } | { c ↑ , t ′ ? , bwp ↑ } |{ c ↑ , t ′ ↓ } | { c ↓ , t ′ ↓ } . The last effect in the rule is compatible with any moveof a block from the table into its ﬁnal position, where ev-erything below is already well-placed (this is the only moveaway from the table compatible with the policy), while theremaining effects are compatible with moving into the tablea block that is not on its ﬁnal position. The policy solves aset of 100 test instances with 10 to 30 blocks and randominitial and goal conﬁgurations, and can actually be provencorrect.

Discussion of Results.

On dead-end free domains whereall instances of the same size (same objects) have isomor-phic state spaces,

D2L is able to generate valid policies fromone single training instance. In these cases, the only choicewe have made regarding the training instance is selecting asize for the instance which is sufﬁciently large to avoid over-ﬁtting , but sufﬁciently small to allow the expansion of theentire state space. As we have seen, though, the approachis also able to handle domains with dead-ends ( Q span ) orwhere different instances with the same objects can giverise to non-isomorphic state spaces ( Q rew , Q micon ). In thesecases, the selection of training instances needs to be donemore carefully so that sufﬁciently diverse situations are ex-empliﬁed in the training set.As it can be seen in Table 1, the two optimizations dis-cussed at the beginning are key to scale up in different do-mains. Considering indistinguishable classes of transitionsinstead of individual transitions offers a dramatic reductionin the size of the theory T ( S , F ) for domains with a large number of symmetries such as Spanner, Visitall, and Grip-per. On the other hand, the incremental constraint generationloop also reduces the size of the theory up to one order ofmagnitude for domains such as Miconic and Blocksworld.Overall, the size of the propositional theory, which is themain bottleneck in (Bonet, Franc`es, and Geffner 2019), ismuch smaller. Where they report a number of clauses for Q clear , Q on , Q grip and Q rew of, respectively, 767K, 3.3M,358K and 1.2M, the number of clauses in our encoding is242.3K, 281.5K, 100.8K and 98.9K, that is up to one orderof magnitude smaller, which allows D2L to scale up to sev-eral other domains. Our approach is also more efﬁcient thanthe one in (Franc`es et al. 2019). which requires several hoursto solve a domain such as Gripper.

Conclusions

We have introduced a new method for learning features andgeneral policies from small problems without supervision.This is achieved by means of a novel formulation in whicha large but ﬁnite pool of features is deﬁned from the predi-cates in the planning examples using a general grammar, anda small subset of features is sought for separating “good”from “bad” state transitions, and goals from non-goals. Theproblems of ﬁnding such a “separating surface” while label-ing the transitions as “good” or “bad” are addressed jointlyas a Weighted Max-SAT problem. The formulation is com-plete in the sense that if there is a general policy with fea-tures in the pool that solves the training instances, the solverwill ﬁnd it, and by computing the simplest such solution,it ensures a better generalization outside of the training set.In comparison with existing approaches, the new formula-tion is conceptually simpler, more scalable (much smallerpropositional theories), and more expressive (richer class ofnon-deterministic policies, and value functions that are notnecessarily linear in the features). In the future, we want tostudy extensions for synthesizing provable correct policiesexploiting related results in QNPs.

Acknowledgements

This research is partially funded by an ERC Advanced Grant(No 885107), by grant TIN-2015-67959-P from MINECO,Spain, and by the Knut and Alice Wallenberg (KAW) Foun-dation through the WASP program. H. Geffner is also a Wal-lenberg Guest Professor at Link¨oping University, Sweden.G. Franc`es is partially supported by grant IJC2019-039276-I from MICINN, Spain.

References

Baader, F.; Calvanese, D.; McGuinness, D.; Patel-Schneider,P.; and Nardi, D. 2003.

The description logic handbook:Theory, implementation and applications . Cambridge U.P.Belle, V.; and Levesque, H. J. 2016. Foundations for Gener-alized Planning in Unbounded Stochastic Domains. In

Proc.KR , 380–389.Bonet, B.; Franc`es, G.; and Geffner, H. 2019. Learning fea-tures and abstract actions for computing generalized plans.In

Proc. AAAI , 2703–2710.onet, B.; and Geffner, H. 2018. Features, Projections, andRepresentation Change for Generalized Planning. In

Proc.IJCAI , 4667–4673.Bonet, B.; and Geffner, H. 2020. Qualitative Numeric Plan-ning: Reductions and Complexity.

JAIR .Bonet, B.; and Geffner, H. 2021. General Policies, Repre-sentations, and Planning Width. In

Proc. AAAI .Bonet, B.; Palacios, H.; and Geffner, H. 2009. AutomaticDerivation of Memoryless Policies and Finite-State Con-trollers Using Classical Planners. In

Proc. ICAPS-09 , 34–41.Boutilier, C.; Reiter, R.; and Price, B. 2001. Symbolic Dy-namic Programming for First-Order MDPs. In

Proc. IJCAI ,volume 1, 690–700.Bueno, T. P.; de Barros, L. N.; Mau´a, D. D.; and Sanner,S. 2019. Deep Reactive Policies for Planning in StochasticNonlinear Domains. In

AAAI , volume 33, 7530–7537.Chevalier-Boisvert, M.; Bahdanau, D.; Lahlou, S.; Willems,L.; Saharia, C.; Nguyen, T. H.; and Bengio, Y. 2019.BabyAI: A Platform to Study the Sample Efﬁciency ofGrounded Language Learning. In

ICLR .Fern, A.; Yoon, S.; and Givan, R. 2006. Approximate pol-icy iteration with a policy language bias: Solving relationalMarkov decision processes.

JAIR

25: 75–118.Franc`es, G.; Bonet, B.; and Geffner, H. 2021a. LearningGeneral Policies from Small Examples Without Supervi-sion. In

Proc. AAAI .Franc`es, G.; Bonet, B.; and Geffner, H. 2021b. Source codeand benchmarks for the paper “Learning General Policiesfrom Small Examples Without Supervision”. https://doi.org/10.5281/zenodo.4322798.Franc`es, G.; Corrˆea, A. B.; Geissmann, C.; and Pommeren-ing, F. 2019. Generalized Potential Heuristics for Classi-cal Planning. In

Proceedings of the 28th International JointConference on Artiﬁcial Intelligence (IJCAI) , 5554–5561.Garg, S.; Bajpai, A.; and Mausam. 2020. GeneralizedNeural Policies for Relational MDPs. arXiv preprintarXiv:2002.07375 .Garnelo, M.; and Shanahan, M. 2019. Reconciling deeplearning with symbolic artiﬁcial intelligence: representingobjects and relations.

Current Opinion in Behavioral Sci-ences

29: 17–23.Groshev, E.; Goldstein, M.; Tamar, A.; Srivastava, S.; andAbbeel, P. 2018. Learning Generalized Reactive PoliciesUsing Deep Neural Networks. In

Proc. ICAPS .Haslum, P.; Lipovetzky, N.; Magazzeni, D.; and Muise, C.2019.

An Introduction to the Planning Domain DeﬁnitionLanguage . Morgan & Claypool.Helmert, M. 2002. Decidability and Undecidability Resultsfor Planning with Numerical State Variables. In

Proc. AIPS ,44–53.Hu, Y.; and De Giacomo, G. 2011. Generalized planning:Synthesizing plans that work for multiple environments. In

Proc. IJCAI , 918–923. Illanes, L.; and McIlraith, S. A. 2019. Generalized planningvia abstraction: arbitrary numbers of objects. In

Proc. AAAI .Khardon, R. 1999. Learning action strategies for planningdomains.

Artiﬁcial Intelligence

Applied Intelligence

Proc. SAT , 438–445.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-trol through deep reinforcement learning.

Nature

Reinforcement Learning , 253–292. Springer.Sanner, S.; and Boutilier, C. 2009. Practical Solution Tech-niques for First-Order MDPs.

Artiﬁcial Intelligence

Proc. ICAPS , 285–293.Seipp, J.; Pommerening, F.; R¨oger, G.; and Helmert, M.2016. Correlation Complexity of Classical Planning Do-mains. In

Proc. IJCAI 2016 , 3242–3250.Silver, T.; Allen, K. R.; Lew, A. K.; Kaelbling, L. P.; andTenenbaum, J. 2020. Few-Shot Bayesian Imitation Learn-ing with Logical Program Policies. In

Proc. AAAI , 10251–10258.Srivastava, S.; Immerman, N.; and Zilberstein, S. 2008.Learning generalized plans using abstract counting. In

Proc.AAAI , 991–997.Srivastava, S.; Immerman, N.; and Zilberstein, S. 2011. Anew representation and associated algorithms for general-ized planning.

Artiﬁcial Intelligence

AAAI .Toyer, S.; Trevizan, F.; Thi´ebaux, S.; and Xie, L. 2018. Ac-tion schema networks: Generalised policies with deep learn-ing. In

AAAI .Wang, C.; Joshi, S.; and Khardon, R. 2008. First Order De-cision Diagrams for Relational MDPs.

Journal of ArtiﬁcialIntelligence Research

31: 431–472.

Appendix

This appendix contains (1) proofs for Theorems 6 and 7,(2) a detailed description of the feature grammar used by

D2L , (3) a full account of the generalized policies learnedby the approach, and (4) proof of their generalization to allinstances of the generalized problem. heorems 6 and 7

For a sample S made of problems P , . . . , P k , let P S be thecollection of problems consisting of P i [ s ] , for ≤ i ≤ k and non-dead-end state s in P i , where the problem P i [ s ] islike problem P i but with initial state set to s .Theorems 6 and 7 in paper are subsumed by the following: Theorem.

Let S be the state space associated with a set P , . . . , P k of sample instances of a class of problems Q over a domain D , and let F be a pool of features. The theory T ( S , F ) is satisﬁable iff there is a policy π Φ over features Φ ⊆ F that solves all the problems in P S and where thefeatures in Φ discriminate goal from non-goal states in theproblems P , . . . , P k .Proof. We need to show two implications. Let us denote thetheory T ( S , F ) as T . For the ﬁrst direction, let π be a pol-icy deﬁned over a subset Φ of features from F such that π solves any problem in P S and Φ discriminates the goals in P , . . . , P k . Let us construct an assignment σ for the vari-ables in T :– σ (cid:15) Select ( f ) iff f ∈ Φ ,– σ (cid:15) Good ( s, s ′ ) iff transition ( s, s ′ ) is compatible with π ,– σ (cid:15) V ( s, d ) iff V π ( s ) = d , where V π ( s ) is the distance of the max-length path connecting s to a goal in the sub-graph S π of S spanned by π : edge ( s, s ′ ) belongs to S π iff ( s, s ′ ) is compatible with π , s is non-goal state, and both s and s ′ are non dead-end states. Since π solves P S , thegraph S π is acyclic and V ( s, d ) is well deﬁned for non-dead-end states s .We show that σ satisﬁes the formulas that make up T :1. W ( s,s ′ ) Good ( s, s ′ ) clearly holds since if s is a non-goaland non-dead-end state in P i , the problem P i [ s ] belongsto P S and thus is solvable by π . Then, there is at leastone transition ( s, s ′ ) in S π (i.e., compatible with π ).2. Straightforward. There is δ such that V ∗ ( s ) ≤ V π ( s ) ≤ δV ∗ ( s ) .3. Since S π is acyclic, if Good ( s, s ′ ) holds, ( s, s ′ ) in S π and V π ( s ′ ) < V π ( s ) .4. Straightforward. If only one of { s, s ′ } is goal, there issome feature f ∈ Φ such that f ( s ) = f ( s ′ ) since Φ discrim-inates goals from non-goals.5. From deﬁnition, S π has no transition ( s, s ′ ) where s isnon-goal and non-dead-end and s ′ is a dead-end. Hence, σ (cid:15) ¬ Good ( s, s ′ ) .6. Let ( s, s ′ ) and ( t, t ′ ) be two transitions between non-dead-end states where s and t are both also non-goalstates. If Good ( s, s ′ ) and ¬ Good ( t, t ′ ) then both transi-tions must be separated by at least some feature f ∈ Φ since otherwise, if ( s, s ′ ) is compatible with π (deﬁnedover Φ ), then ( t, t ′ ) would be also compatible with π andthus Good ( t, t ′ ) would hold as well.Hence, σ (cid:15) T .For the converse direction, let σ be a satisfying assign-ment for T . We must construct a subset Φ of features from F that discriminates goal from non-goal states in P , . . . , P k ,and a policy π = π Φ that solves P S . Constructing Φ is easy: Φ = { f ∈ F : σ (cid:15) Select ( f ) } . For deﬁning the policy, letus introduce the following idea. For a subset Φ of featuresand a subset T of transitions in S , the policy π T is the policygiven by the rules Φ( s ) E | · · · | E m where– s is a “source” state in some transition ( s, s ′ ) in T ,– Φ( s ) is set of boolean conditions given by Φ on s ; i.e., Φ( s ) = { p : p ( s )= true } ∪ {¬ p : p ( s )= false } ∪ { n> n ( s ) > }∪{ n =0 : n ( s )=0 } where p (resp. n ) is a boolean(resp. numerical) feature in Φ ,– each E i captures the feature changes for some transition ( s ′ , s ′′ ) in T such that s (cid:15) Φ( s ) ; i.e., E i is a maximal set of feature effects that is compatible with ( s ′ , s ′′ ) .The policy π is the policy π T for T = { ( s, s ′ ) ∈ S : σ (cid:15) Good ( s, s ′ ) } . Observe that the policy π is well deﬁned sincefor two transitions ( s, s ′ ) and ( t, t ′ ) such that Φ( s ) = Φ( t ) ,the two rules associated with Φ( s ) and Φ( t ) , respectively,are identical. This follows by formula (6) in the theory. Byformula (4) in the theory, the features in Φ discriminate goalfrom non-goal states in P , . . . , P k . So, we only need toshow that π solves any problem in P S .As before, let us construct the subgraph S π of S spannedby π : the edge ( s, s ′ ) is in S π iff s is a non-goal and non-dead-end state and ( s, s ′ ) is compatible with π ; equivalently, σ (cid:15) Good ( s, s ′ ) . Since S π contains all states that are reach-able in P , . . . , P k , a necessary and sufﬁcient condition for π to solve any problem in P S is that S π is acyclic and eachnon-dead-end state in S π is connected to a goal state. Theﬁrst property is a consequence of the assignments V ( s, d ) toeach non-dead-end state s in d since by formula (3), if ( s, s ′ ) belongs to S π and σ (cid:15) V ( s, d ) , then σ (cid:15) V ( s ′ , d ′ ) for some ≤ d ′ < d . For the second property, if s is a non-goal andnon-dead-end state in S π , then by formula (2), σ (cid:15) V ( s, d ) for some d , and by formulas (1) and (3), s is connected to astate s ′ in S π such that σ (cid:15) V ( s ′ , d ′ ) for some ≤ d ′ < d .The state s ′ is not a dead end by formula (5). If s ′ is nota goal state, repeating the argument we ﬁnd that s ′ is con-nected to a non-dead-end state s ′′ such that σ (cid:15) V ( s ′′ , d ′′ ) with ≤ d ′′ < d ′ < d . This process is continued until a goalstate s ∗ connected to s is found, for which σ (cid:15) V ( s ∗ , .Therefore, π solves any problem P in P S . Feature Grammar

The set F of candidate features is generated through a stan-dard description logics grammar (Baader et al. 2003), sim-ilarly to (Bonet, Franc`es, and Geffner 2019; Franc`es et al.2019). Description logics build on the notions of concepts ,classes of objects that have some property, and roles , rela-tions between these objects. We here use the standard de-scription logic SOI as a building block for our features.We start from a set of primitive concepts and roles madeup of all unary and binary predicates that are used to de-ﬁne the PDDL model corresponding to the generalized prob-em. Following Mart´ın and Geffner (2004), we also con-sider goal versions p g of each predicate p in the PDDLmodel that is relevant for the goal. These have ﬁxed deno-tation in all states of a particular problem instance, givenby the goal formula. To illustrate, a typical Blocksworld in-stance with a goal like on ( x, y ) results in primitive concepts clear , holding , ontable , and primitive roles on and on g .In generalized domains where it makes sense to deﬁne agoal in terms of a few goal parameters (e.g., “clear block x ”), we take these into account in the feature grammarbelow. Note however that this is mostly to improve inter-pretability, and could be easily simulated without the needfor such parameters. Concept Language: Syntax and Semantics

Assume that C and C ′ stand for concepts, R and R ′ forroles, and ∆ stands for the universe of a particular probleminstance, made up by all the objects appearing on it. Theset of all concepts and roles and their denotations in a givenstate s is inductively deﬁned as follows:– Any primitive concept p is a concept with denotation p s = { a | s | = p ( a ) } , and primitive role r is a role withdenotation r s = { ( a, b ) | s | = r ( a, b ) } .– The universal concept ⊤ and the bottom concept ⊥ areconcepts with denotations ⊤ s = ∆ , ⊥ s = ∅ .– The negation ¬ C , the union C ⊔ C ′ , the intersection C ⊓ C ′ are concepts with denotations ( ¬ C ) s = ∆ \ C s , ( C ⊔ C ′ ) s = C s ∪ C ′ s , ( C ⊓ C ′ ) s = C s ∩ C ′ s .– The existential restriction ∃ R.C and the universal restric-tion ∀ R.C are concepts with denotations ( ∃ R.C ) s = { a |∃ b : ( a, b ) ∈ R s ∧ b ∈ C s } , ( ∀ R.C ) s = { a | ∀ b : ( a, b ) ∈ R s → b ∈ C s } .– The role-value map R = R ′ is a concept with denotation ( R = R ′ ) s = { a | ∀ b : ( a, b ) ∈ R s ↔ ( a, b ) ∈ R ′ s } .– If a is a constant in the domain or a goal parameter, the nominal { a } is a concept with denotation { a } s = { a } .– The inverse role R − , the composition role R ◦ R ′ and the(non-reﬂexive) transitive closure role R + are roles withdenotations ( R − ) s = { ( b, a ) | ( a, b ) ∈ R s } , ( R ◦ R ′ ) s = { ( a, c ) | ∃ b : ( a, b ) ∈ R s ∧ ( b, c ) ∈ R ′ s } , ( R + ) s = { ( a , a n ) | ∃ a , . . . , a n − ∀ i ( a i − , a i ) ∈ R s } .We place some restrictions to the above grammar in or-der to reduce the combinatorial explosion of possible con-cepts: (1) we do not generate concept unions, (2) we do notgenerate role compositions, (3) we only generate role-valuemaps R = R ′ where R ′ is the goal version of a R . (4) weonly generate the inverse and transitive closure roles r − p , r + p , ( r − p ) + , with r p being a primitive role. From concepts to features

The complexity of a concept or role is deﬁned as the sizeof its syntax tree. From the above-described inﬁnite set ofconcepts and roles, we only consider the ﬁnite subset G k of This feature generation process implicitly restricts the domainsthat

D2L can tackle to those having predicates with arity ≤ . those with complexity under a given bound k . When gener-ating G k , redundant concepts and roles are pruned. A con-cept or role is redundant when its denotation over all statesin the training set is the same as some previously generatedconcept or role. From the domain model and G k , we gener-ate the following features:• For each nullary primitive predicate p , a boolean feature b p that is true in s iff p is true in s .• For each concept C ∈ G k , we generate a boolean feature | C | , if | C s | ∈ { , } for all states s in the training set, anda numerical feature | C | otherwise. The value of booleanfeature | C | in s is true iff | C s | = 1 ; the value of numericalfeature | C | is | C s | .• Numerical features Distance ( C , R : C, C ) that representthe smallest n such that there are objects x , . . . , x n sat-isfying C s ( x ) , C s ( x n ) , and ( R : C ) s ( x i , x i +1 ) for i =1 , . . . , n . The denotation ( R : C ) s contains all pairs ( x, y ) in R s such that y ∈ C s . When no such n exists, the fea-ture evaluates to m + 1 , where m is the number of objectsin the particular problem instance.The complexity w ( f ) of a feature f is set to the com-plexity of C for features | C | , to for features b p , and tothe sum of the complexities of C , R , C , and C , for fea-tures Distance ( C , R : C, C ) . Only features with complex-ity bounded by k are generated. For efﬁciency reasons weonly generate features Distance ( C , R : C, C ) where the de-notation of concept C in all states contains one single ob-ject. All this feature generation procedure follows (Bonet,Franc`es, and Geffner 2019), except for the addition of goalpredicates to the set of primitive concepts and roles. Generalized Policies

We next describe in detail the generalized policies learnedby

D2L on the reported example domains. We also showthat they generalize over the entire domain.

Reasoning about correctness. We sketch a method toprove correctness of a policy over an entire generalized plan-ning domain Q in a domain-dependent manner. Let P be aninstance of Q . We assume that Q implicitly deﬁnes whatstates of P are valid; henceforth we are only concernedabout valid states. We say that a valid state of P is solv-able if there is a path from it to some goal in P , and is alive if it is solvable but not a goal. We denote by A ( P ) the set ofalive states of instance P . In dead-end-free domains, whereall states are solvable, any state is either alive or a goal. Deﬁnition 8 (Complete & Descending Policies) . We saythat generalized policy π Φ is complete over P if for any state s ∈ A ( P ) , π Φ is compatible with some transition ( s, s ′ ) . Wesay that π Φ is descending over P if there is some function γ that maps states of P to a totally ordered set U such that The encodings of those domains below that are standardbenchmarks from competitions and literature can be obtained athttps://github.com/aibasel/downward-benchmarks. The following discussion relates to (Seipp et al. 2016). or any alive state s ∈ A ( P ) and π Φ -compatible transition ( s, s ′ ) , we have that γ ( s ′ ) < γ ( s ) . Theorem 9.

Let π Φ be a policy that is complete and de-scending for P . Then, π Φ solves P .Proof. Because π Φ is descending, no state trajectory com-patible with it can feature the same state more than once.Since the set S ( P ) of states of P is ﬁnite, there is a ﬁnitenumber of trajectories compatible with π Φ , all of which havelength bounded by | S ( P ) | . Let τ be one maximal such tra-jectory, i.e., a trajectory τ = s , . . . , s n such that P allowsno π Φ -compatible transition ( s n , s ) . Because π Φ is com-plete, s n

6∈ A ( P ) , so it must be a goal.A way to show that π Φ is descending is by providing aﬁxed-length tuple h f , . . . , f n i of state features f i : S ( P ) N . Boolean features can have their truth values cast to (false) or 1 (true). If for every transition ( s, s ′ ) compatiblewith π Φ , h f ( s ′ ) , . . . , f n ( s ′ ) i < h f ( s ) , . . . , f n ( s ) i , where < is the lexicographic order over tuples, then π Φ is descend-ing. When this is the case, we say that π Φ descends over h f , . . . , f n i . Policy for Q clear The set of features Φ learned by D2L contains:• c ≡ | clear ⊓ { x }| : whether block x is clear.• H ≡ | holding | : whether the gripper is holding someblock.• n ≡ |∃ on + . { x }| : number of blocks above x .The learned policy π Φ is: r : {¬ c, H, n = 0 } 7→ { c, ¬ H } ,r : {¬ c, ¬ H, n > } 7→ { c ? , H, n ↓ } ,r : {¬ c, H, n > } 7→ {¬ H } . There are no dead-ends in Q clear , and c is true only ingoal states. A particularity of the 4-op encoding used hereis that when a block is being held, it is not considered clear.Hence, n = 0 does not imply the goal.Let us show that π Φ is complete over any problem in-stance P . Let s ∈ A ( P ) . If the gripper is holding someblock in s , then the transition where it puts the block onthe table is compatible with π Φ (rules r , r ), regardless ofwhether the block is x or not. If the gripper is empty, theremust be at least one block above x , otherwise s would be agoal. The transition where the gripper picks one such blockis compatible with π Φ ( r ).Now, let us show that π Φ descends over feature tuple h n, H i . Rules r and r do not affect n and make H false, soany transition ( s, s ′ ) compatible with them makes the valu-ation of h n, H i decrease. Rule r always decreases n , socompatible transitions decrease h n, H i . Since π Φ is com-plete and descending, it solves P . Policy for Q on The set of features Φ learned by D2L contains:• e ≡ | handempty | : whether the gripper is empty,• c ≡ | clear | : number of clear objects,• c ( x ) ≡ | clear ⊓ { x }| : whether block x is clear,• on ( y ) ≡ |∃ on. { y }| : whether some block is on y .• ok ≡ |{ x } ⊓ ∃ on. { y }| : whether x is on y ,The learned policy π Φ is: r : { e, c ( x ) , ¬ on ( y ) } 7→ {¬ e, ¬ c ( x ) , c ↓ } | {¬ e, ¬ c ( x ) } ,r : { e, c ( x ) , on ( y ) } 7→ {¬ e } | {¬ e, ¬ on ( y ) } | {¬ e, ¬ c ( x ) } ,r : { e, ¬ c ( x ) , ¬ on ( y ) } 7→ {¬ e } | {¬ e, c ( x ) } ,r : { e, ¬ c ( x ) , on ( y ) } 7→ {¬ e } | {¬ e, ¬ on ( y ) } | {¬ e, c ( x ) } ,r : {¬ e, c ( x ) } 7→ { e, c ↑ } ,r : {¬ e, ¬ c ( x ) , on ( y ) } 7→ { e, c ( x ) , c ↑ } | { e, c ↑ } ,r : {¬ e, ¬ c ( x ) , ¬ on ( y ) } 7→ { e, c ( x ) , ok, on ( y ) } | { e, c ↑ } . There are no dead-ends in Q on , and in all alive states, c > and ¬ ok . These two conditions have been omitted inthe body of all 7 rules above for readability.Let us show that π Φ is complete over any problem in-stance P . Note that the conditions in the rule bodies nicelypartition A ( P ) . Let s ∈ A ( P ) be an alive state. If the grip-per is holding some block in s , then putting it on the tableis always possible and is compatible with rules r – r , ex-cept when the held block is x and y has nothing above. Inthat case, putting x on y is possible and compatible with r .If the gripper is empty, consider whether x and y are clear.If both are clear, then picking up x is always possible, andis compatible with r . Otherwise, there must be at least onetower of blocks, and picking up some block from such atower is always possible, and is compatible with r – r .Now, let us show that π Φ descends over feature tuple h al, ready ′ , t ′ , e i , where al is in any alive state, and oth-erwise; ready ′ is if holding ( x ) and clear ( y ) , and 1 other-wise; t ′ is the number of blocks not on the table, and e is asdeﬁned above. Rules r – r are compatible only with tran-sitions where the gripper is initially empty; none affects al ,and all decrease e . All their effects are compatible only withpick-ups from a tower (otherwise c ↓ ), hence do not affect t ′ ,except for picking up x when y is clear, (ﬁrst effect of r ),which increases t ′ but makes ready ′ decrease. Rules r – r are compatible only with putting a held block on the table,decreasing t ′ , and do not affect al or ready ′ . A similar rea-soning applies to rule r , except when the held block is x and can be put on y , in which case al decreases.Since π Φ is complete and descending, it solves P . Policy for Q grip The set of features Φ learned by D2L contains:• r B ≡ |∃ at g . at-robby | : whether the robot is at B .• c ≡ |∃ carry. ⊤| : number of balls carried by the robot.• b ≡ |¬ ( at g = at ) | : number of balls not in room B .he learned policy π Φ is: r : {¬ r B , c = 0 , b > } 7→ { c ↑ } ,r : { r B , c = 0 , b > } 7→ {¬ r B } ,r : { r B , c > , b > } 7→ { c ↓ , b ↓ } ,r : {¬ r B , c > , b > } 7→ { r B } . There are no dead-ends in Gripper, and b > in any alivestate of an instance P . Let us show that π Φ is complete overany problem instance P . Let s be an alive state. If the robot isin room A carrying some ball, the transition where it movesto B is compatible with π Φ ( r ). If it is carrying no ball, thenthere must be some ball in A ( s is not a goal); picking it iscompatible with r . Now, if the robot is in room B carryingsome ball, the transition where it drops it is compatible with r ; if it carries no ball, the transition where it moves to room A is compatible with r .Policy π Φ descends over tuple h b A , b RA , b RB , r B i , where b A counts the number of balls in room A , b Rx the numberof balls held by the robot while in room x , and r B is asdeﬁned above. This is because rule r decreases b A ; rule r decreases r B without affecting the other features; rule r decreases b RB , and rule r increases b RB but decreases b RA .Since π Φ is complete and descending, it solves P . Policy for Q rew The set of features