[PDF] Hierarchical Width-Based Planning and Learning

Abstract

Width-based search methods have demonstrated state-of-the-art performance in a wide range of testbeds, from classical planning problems to image-based simulators such as Atari games. These methods scale independently of the size of the state-space, but exponentially in the problem width. In practice, running the algorithm with a width larger than 1 is computationally intractable, prohibiting IW from solving higher width problems. In this paper, we present a hierarchical algorithm that plans at two levels of abstraction. A high-level planner uses abstract features that are incrementally discovered from low-level pruning decisions. We illustrate this algorithm in classical planning PDDL domains as well as in pixel-based simulator domains. In classical planning, we show how IW(1) at two levels of abstraction can solve problems of width 2. For pixel-based domains, we show how in combination with a learned policy and a learned value function, the proposed hierarchical IW can outperform current flat IW-based planners in Atari games with sparse rewards.

Full PDF

HHierarchical Width-Based Planning and Learning

Miquel Junyent, Vicenc¸ G´omez, Anders Jonsson

Universitat Pompeu FabraBarcelona, Spain { miquel.junyent, vicen.gomez, anders.jonsson } @upf.edu Abstract

Width-based search methods have demonstrated state-of-the-art performance in a wide range of testbeds, from classicalplanning problems to image-based simulators such as Atarigames. These methods scale independently of the size of thestate-space, but exponentially in the problem width. In prac-tice, running the algorithm with a width larger than 1 is com-putationally intractable, prohibiting IW from solving higherwidth problems. In this paper, we present a hierarchical al-gorithm that plans at two levels of abstraction. A high-levelplanner uses abstract features that are incrementally discov-ered from low-level pruning decisions. We illustrate this algo-rithm in classical planning PDDL domains as well as in pixel-based simulator domains. In classical planning, we show howIW(1) at two levels of abstraction can solve problems ofwidth 2. For pixel-based domains, we show how in combi-nation with a learned policy and a learned value function,the proposed hierarchical IW can outperform current ﬂat IW-based planners in Atari games with sparse rewards.

Introduction

The use of hierarchies in planning has proven to be avery successful way for signiﬁcantly reducing the compu-tational cost of ﬁnding good plans. Traditional methods in-clude Hierarchical Task Networks (Currie and Tate 1991;Erol, Hendler, and Nau 1996), macro-actions (Fikes, Hart,and Nilsson 1972; Korf 1985), and state abstraction meth-ods (Sacerdoti 1974; Knoblock 1990). Hierarchical planningcan lead to exponential gains in complexity by exploiting thestructure of a problem involving a reduced subset of the statecomponents.Iterated Width (IW) (Lipovetzky and Geffner 2012) is asearch algorithm that makes use of the feature representationof the states to perform structured exploration. The originalIW algorithm consists of successive breadth ﬁrst searchesin which states are pruned if they fail to meet a noveltycriterion. In particular, IW( w ) only considers w features ata time, and prunes those states for which all combinationsof w features are made true in previously generated states.IW( w ) runs in time and space that are exponential in w , butindependent of the size of the state space.Initially proposed as a blind search method for classicalplanning, IW search has been extended in many differentways, resulting in several competitive width-based planners, including LW1 for partially observable domains (Bonet andGeffner 2014), or BFWS as an informed (best-ﬁrst) widthsearch planner (Lipovetzky and Geffner 2017).One particular advantage of width-based planners is that,unlike other classical planners, they do not need a declara-tive representation of actions, costs or goals (Franc`es et al.2017). Width-based planners are thus directly applicablein simulator environments, achieving state-of-the-art perfor-mance in the General Video Game competition (Geffnerand Geffner 2015) and the Atari suite (Lipovetzky, Ramirez,and Geffner 2015; Shleyfman, Tuisov, and Domshlak 2016;Bandres, Bonet, and Geffner 2018).The performance of IW strongly depends on how infor-mative the state features are. Using poorly informed featuresrequires a large value of w to reach a goal state, whereas us-ing highly informative features reduces the problem widthand, hence, makes it solvable using a lower value of w . Thiseffect is known, e.g., in Atari, where using informative RAMstates leads to better results than planning directly with pix-els (Bandres, Bonet, and Geffner 2018). How to discover orlearn such features to reduce the problem width is an openproblem, and several ideas have been proposed, includingthe use of conjunctive features (Franc`es et al. 2017) or deeplearning methods (Junyent, Jonsson, and G´omez 2019; Dit-tadi, Drachmann, and Bolander 2020).In practice, IW is mostly used with w = 1 with complex-ity linear in the number of features (Geffner and Geffner2015; Bandres, Bonet, and Geffner 2018; Ramirez et al.2018; Dittadi, Drachmann, and Bolander 2020). In manychallenging problems, even w = 2 with quadratic complex-ity is unfeasible (Geffner and Geffner 2015). Finding waysto run IW with a larger value of w can further extend theapplicability of this class of planners.In this work, we propose a hierarchical formulation ofwidth-based planning that takes advantage of both the struc-tured search performed by width-based algorithms as well asthe concept of hierarchy, which captures explicitly the ideaof using state abstraction to reduce effectively the width of aproblem. The framework can be combined with other formsof learning to further extend the applicability of width-basedplanners. a r X i v : . [ c s . A I] J a n ackground In this section we deﬁne Markov decision processes and theIterated Width (IW) algorithm, and introduce notation thatwill be used throughout the paper.

Markov Decision Processes

A Markov decision process (MDP) is modeled as a tuple M = (cid:104) S, A, P, r (cid:105) , where S is a ﬁnite set of states, A isa ﬁnite set of actions, P is a transition function and r is areward function. We assume that the transition function P is deterministic , i.e. P : S × A → S maps state-action pairs tonext states, while the reward function r : S × A → R mapsstate-action pairs to real-valued rewards.At each time step t , a learning agent observes state s t ∈ S ,selects an action a t ∈ A , transitions to a new state s t +1 = P ( s t , a t ) and receives reward r t = r ( s t , a t ) . The aim of thelearner is to compute a policy π : S → ∆( A ) , i.e. a map-ping from states to probability distributions over actions,that maximizes some measure of expected future reward.Here, ∆( A ) = { µ ∈ R A : (cid:80) a µ ( a ) = 1 , µ ( a ) ≥ ∀ a ) } isthe probability simplex over A .The expected future reward associated with policy π isgoverned by a value function V π , deﬁned in each state s as V π ( s ) = E π (cid:34) ∞ (cid:88) t =0 γ t r ( S t , A t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = s (cid:35) . Here, S t and A t are random variables representing the stateand action at time t , respectively, satisfying A t ∼ π ( S t ) and S t +1 = P ( S t , A t ) for each t ≥ , and γ ∈ (0 , is adiscount factor. The optimal value function V ∗ is given by V ∗ = max π V π , and the optimal policy π ∗ is the argumentachieving this maximum, i.e. π ∗ = arg max π V π .We assume that there exists a set of features F , each withﬁnite domain D , and a mapping φ : S → D | F | from states tofeature vectors. For each feature f ∈ F and state s ∈ S , let φ ( s )[ f ] ∈ D be the value that s assigns to f . It is commonto approximate the value function in state s using the featurevector φ ( s ) and a parameter vector θ , i.e. the estimation ofthe value in state s is given by ˆ V θ ( s ) = g ( φ ( s ) , θ ) for somefunction g , e.g. a neural network.We can use deterministic MDPs to model goal-directedplanning tasks. Such a planning task is also deﬁned by aset of states S , a set of actions A and a deterministic tran-sition function P . In addition, there is a set of designatedgoal states S G ⊂ S . To model the task as an MDP, we makeeach goal state s G ∈ S G absorbing by deﬁning the transi-tion function as P ( s G , a ) = s G for each action a ∈ A . Thereward function is deﬁned as r ( s, a ) = 1 if s ∈ S G and r ( s, a ) = 0 otherwise. Hence an optimal policy attempts toreach a goal state as quickly as possible and then stay there. Iterated Width

Iterated Width (IW) (Lipovetzky and Geffner 2012) is aforward search algorithm that explores the state space ofa deterministic MDP starting from a given initial state s .IW was initially developed for goal-directed planning tasks,attempting to ﬁnd a goal state among the set of explored states. However, the algorithm has later been adapted toMDPs by instead attempting to maximize expected futurereward (Lipovetzky, Ramirez, and Geffner 2015).In its basic form, IW is a blind search algorithm thatperforms breadth-ﬁrst search in the space of states, startingfrom s . However, unlike standard breadth-ﬁrst search, IWuses a novelty measure to prune states. The novelty measurecritically relies on the feature vector φ ( s ) associated witheach state s . Concretely, IW deﬁnes a width parameter w ,and remembers all visited tuples of feature values of size w .During search, a state s is considered novel it its associatedfeature vector φ ( s ) contains at least one tuple of feature val-ues of size w that has not been visited before. IW then prunesall states that are not novel.For a given width w , because of the pruning mechanism,the number of states visited by IW( w ) is exponential in w .Since the state space is usually large, IW( w ) is typicallyprovided with a search budget, and terminates as soon asthe number of visited states exceeds the budget. Without asearch budget, in most domains it is computationally infea-sible to execute IW( w ) for w > . However, many planningbenchmarks turn out to have small width, at least when con-sidering atomic goals, and in practice they can be solved byIW(1) or IW(2).Several researchers have proposed extensions of the orig-inal IW algorithm. Rollout IW (Bandres, Bonet, and Geffner2018) simulates a breadth-ﬁrst search by repeatedly generat-ing trajectories, or rollouts, from the initial state s . This isuseful in domains for which it is expensive to store states inmemory, making it impractical to perform an actual breadth-ﬁrst search. The π -IW algorithm (Junyent, Jonsson, andG´omez 2019) maintains and updates a policy π , and usesthe policy to decide in which order to expand states, ratherthan exploring blindly. Complexity of IW( w ) In this section we provide a precise upper bound on the num-ber of states visited by IW( w ). We use n = | F | to denote thenumber of features, and d = | D | to denote the size of thecommon domain. We also assume that the branching fac-tor is b , i.e. in each state s , at most b actions are applica-ble. The only existing complexity result that we know ofis that IW(1) generates at most ndb nodes, and that IW(2)generates at most ( nd ) b nodes (Lipovetzky, Ramirez, andGeffner 2015). Our aim is to provide a tighter bound.Let N ( n, d, w ) denote the maximum number of novel states visited by IW( w ) for a given pair ( n, d ) . Hence thenumber of visited states (including those pruned) is boundedby N ( n, d, w ) · b . Below we provide a recursive deﬁnition of N ( n, d, w ) . The intuition is that we can remove a feature f to decompose the problem into two smaller instances ofIW on n − features. One instance has width w − , corre-sponding to tuples that are to be combined with a value of f .Since f has d − values that have not been seen in the initialstate s , there are d − ways to combine the tuples of size w − with a value of f . The other instance has width w ,corresponding to tuples that do not involve a value of f .There are two base cases: w = 0 , in which case no stateis novel apart from s , i.e. N ( n, d,

0) = 1 , and w = n , inhich case all states are novel, i.e. N ( n, d, n ) = d n . Therecursive deﬁnition is now given by N ( n, d,

0) = 1 ,N ( n, d, n ) = d n ,N ( n, d, w ) = ( d − N ( n − , d, w −

1) + N ( n − , d, w ) . Theorem 1.

For n features with domain size d , the maxi-mum number of novel states visited by IW( w ), ≤ w < n ,is N ( n, d, w ) = w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) . The proof of Theorem 1 appears in the supplementarymaterial, which also shows that N ( n, d, w ) is indeed upperbounded by ( nd ) w , which is consistent with previous results. Hierarchical IW

In this section, we present our hierarchical approach towidth-based planning. We start by deﬁning a simple algo-rithm for hierarchical blind search. Then, we consider usingwidth-based planners at all levels of the hierarchy, and showits effect on the width compared to planning at a single level.For simplicity, and without loss of generality, we assumea two-level hierarchy: a high level ( h ) and a low level ( (cid:96) ).Each level is deﬁned by its own feature set ( F h and F (cid:96) ,with domains D h and D (cid:96) , respectively) and feature mapping( φ h : S → D | F h | h and φ (cid:96) : S → D | F (cid:96) | (cid:96) , respectively). Eachstate s maps to a high-level state s h = φ h ( s ) and a low-levelstate s (cid:96) = φ (cid:96) ( s ) . A Hierarchical Approach to Blind Search

Blind search methods require two components: a successorfunction, that given a state and an action returns a successorstate (e.g. a simulator), and a stopping condition, that willstop the search, for instance, when the goal is reached orafter a budget is exhausted. In order to have different searchlevels, we modify these two components as follows:•

High-level successor function:

Each call to this functiontriggers a low level search, that runs until a new high-levelstate is found (i.e. a state s that maps to a different φ h ( s ) ).• Low-level stopping condition:

When a different high-level state is encountered, search is stopped, returningcontrol to the high-level planner. This stopping conditionis added to the existing stopping conditions.The control goes back and forth between the high andlow-level planners. Each time that the high-level successorfunction is called, the according low-level search is resumed,generating new states until a new high-level state is found.We achieve this by storing a low-level search tree for eachhigh-level state. If the low-level search terminates withoutﬁnding a new high-level state, the high-level successor func-tion returns null , and the high-level state is marked as ex-panded . The high-level planner will only generate succes-sors from non-expanded high-level states, and can resumesearch from any state by retrieving it from memory. The proposed framework allows many levels of abstrac-tion, as well as the possibility to have different search meth-ods at each level. For instance, we could have a breadth-ﬁrst search at the high level and depth-ﬁrst search at the lowlevel, or combine different width-based search methods.

Hierarchical Width

The framework presented in the previous section partitionsthe state space into subspaces based on high-level states. Toplan over the set of subspaces, we can use any width-basedsearch method as a high-level planner. For instance, we canapply IW( ) at the high level and IW( ) at the low level. Wedenote this by HIW( , ). We next deﬁne a particular type ofhigh-level feature that we call splitting . Deﬁnition 1.

A high-level feature f ∈ F h is splitting if, foreach value v ∈ D h , the induced subspace of states { s ∈ S : φ h ( s )[ f ] = v } is a connected graph. As an illustration, consider a simple problem where anagent needs to move along a corridor, pick up a key, andgo back along the same path to open a door. We can de-scribe this problem using two features: p (the position) and k (whether or not the key is held). If k ∈ F h , then k is split-ting: when k is false, the agent can still visit all the positionsof the corridor, and likewise when k is true. Theorem 2.

If all features in F h are splitting, HIW( w h , w (cid:96) )is equivalent to a restricted version of IW( w h + w (cid:96) ) withtuples of w h features from F h and w (cid:96) features from F (cid:96) .Proof. Since each feature in F h is splitting, whenever weapply IW( w (cid:96) ) in a high-level state s h , the entire subspace ofstates induced by s h is reachable. Since the restricted ver-sion of IW( w h + w (cid:96) ) considers exactly w (cid:96) features in F (cid:96) ,it will explore the same low-level states as IW( w (cid:96) ). At thehigh-level, the restricted version of IW( w h + w (cid:96) ) considersexactly w h features in F h , so it will explore the same high-level states as IW( w h ). Since the tuples in IW( w h + w (cid:96) ) in-volve features in both F h and F (cid:96) , each state in the low-levelsearch of a new high-level state is considered novel. HenceHIW( w h , w (cid:96) ) explores the same states as the restricted ver-sion of IW( w h + w (cid:96) ). Theorem 3.

Let n h = | F h | and d h = | D h | be the numberof high-level features and domain sizes, and deﬁne ( n (cid:96) , d (cid:96) ) analogously. The maximum number of novel states expandedby HIW( w h , w (cid:96) ) is N ( n h , d h , w h ) · N ( n (cid:96) , d (cid:96) , w (cid:96) ) .Proof. At the high level, HIW( w h , w (cid:96) ) applies IW( w h ),which expands a maximum of N ( n h , d h , w h ) novel high-level states due to Theorem 1. For each novel high-levelstate, HIW( w h , w (cid:96) ) applies IW( w (cid:96) ), which expands a maxi-mum of N ( n (cid:96) , d (cid:96) , w (cid:96) ) novel low-level states.Note that the maximum number of novel states expanded bythe unrestricted version of IW( w h + w (cid:96) ) on the feature set F = F h × F (cid:96) is N ( n h + n (cid:96) , max( d h , d (cid:96) ) , w h + w (cid:96) ) , which ismuch larger than N ( n h , d h , w h ) · N ( n (cid:96) , d (cid:96) , w (cid:96) ) in general. lgorithm 1 Method for ﬁnding high-level features

Input: node nN = ∅ if IsLeaf( n ) & Depth( n ) > then P = Atoms( n ) ∩ Atoms(Parent( n )) // common atoms if | P | < | Atoms ( n ) | then // ensure different state b = Branch ( tree, n ) // get branch root → n B = Depth ( n ) − (cid:83) i =1 Atoms ( b [ i ]) // all branch atoms N = P − B // keep (branch) novel atoms return N Serialized Hierarchical IW

In classical planning, the states are deﬁned by a set of atoms,and, although one atom may be more informative than oth-ers, there is no hierarchical structure. In this section, wepresent a simple method for identifying relevant features thatmay split the state space. Then, we introduce an algorithmthat performs a sequence of hierarchical searches, using theaforementioned method to discover new high-level featurecandidates at each step. In the experiments section, we testthe algorithm in a range of classical planning domains.

Discovering High-Level Features

Consider a search tree generated by IW( ) for a problemof width 2. Is it possible to identify features that split thestate space, so that the problem can be solved by HIW(1,1)?In this section, we present a simple method for detectingcandidate abstract features from a set of features F .We consider all trajectories in the tree and hypothesizethat a feature that changes only once before a trajectory ispruned is a good candidate for a high-level feature. Consideragain the corridor example in which an agent has to use akey to open a door. IW( ) prunes any trajectory that repeatsa position p , and will not solve the problem. However, fea-ture k splits the state space into two sub-problems: reachingthe key ( k =true), and going back to the door ( k =false).We can detect high-level features using the method de-tailed in Algorithm 1. For each pruned leaf node, we retrievethe features that are shared with its parent that have not ap-peared in that branch before. The intuition is that when asplitting feature f changes value for the ﬁrst time, the next state is likely to be pruned by IW(1), since the same statehas probably been visited earlier for the previous value of f . Serialization

A simple algorithm that takes advantage of the previousmethod would be:1. Perform an IW(1) search, if the goal is found, return.2. Run the simple method on the IW(1) tree to ﬁnd high-level features.3. Run HIW(1,1).This algorithm actually ﬁnds promising candidate fea-tures for small instances. For instance, it can solve the simple

Algorithm 2

Serialized Hierarchical IW Search

Initialize: H = ∅ , P = List(), T = Tree(), solved = falsewhile not solved do pruned , solved = HIW( T ) if not solved then Append( P , pruned ) while H == ∅ doif P is empty thenreturn n = Pop( P ) // Sample pruned node H = Heuristic( n ) // Algorithm 1 h = Pop(H) // Sample candidate atom Restructure( T , h ) // Create high-level nodes corridor example. However, it fails on bigger instances, pos-sibly because a single IW(1) search may not be sufﬁcient tovisit states that contain relevant features.Inspired by the serialization of goals from Lipovetzky andGeffner (2012), we propose Serialized HIW (Algorithm 2),that runs a series of Hierarchical IW searches. It exploits onehigh-level feature candidate at each step, and discovers newrelevant features when necessary. First, we run HIW(1,1),which is equivalent to IW(1) since we start with an emptyset of high-level features. While the task is not solved, werandomly sample a pruned node and extract a set of high-level feature candidates using the method in Algorithm 1.In case that the resulting set is empty, we repeat the opera-tion until new feature candidates are found or there are nomore pruned nodes to sample from, in which case we stopthe search. Then, a feature candidate is sampled from the set,and the current search tree is restructured accordingly.Restructuring the tree mainly involves two operations: de-taching subtrees at the low level and inserting new nodes atthe high level. Although this may seem costly, both oper-ations consist of modifying the data structure, while leav-ing the data untouched. Modifying a search tree, however,implies that the associated novelty table cannot be reused.Thus, a new novelty table needs to be generated for the mod-iﬁed low-level trees, and for the high-level tree. To reducethis cost, we only generate a new novelty table, if necessary,when the according tree search is resumed. Learning with Hierarchy

In this section we show how to combine HIW with alearning-based approach that uses a policy to direct search.

Count-based Rollout IW

Bandres, Bonet, and Geffner (2018) presented Rollout IW(RIW), a width-based algorithm that performs breadth ﬁrstsearch implicitly, from independent rollout trajectories.RIW( w ) maintains the notion of width by modifying the def-inition of novelty: a state s is considered novel if any w -tupleof features of s has not appeared at a lower depth. With this,the authors achieve an algorithm that is equivalent to IW( w ),but with better anytime behavior. This novelty measure actu-ally allows for many width-based algorithms, since it untiesthe order of expanding nodes from the novelty measure. lgorithm 3 Count-based Rollout IW function L OOKAHEAD ( N , C ) while not StopCondition() and not N empty do n = Select( N , C )Rollout( n , N , C ) function S ELECT ( N , C ) c = GetCounts( N , C ) // Feature counts of nodes in N p ∝ exp (1 /τ ( c + 1)) n = Sample( N , p ) return n function R OLLOUT ( n , N , C ) while not StopCondition() do C [ n. features ]++ n = Successor( n ) if n == null or not Novel( n ) or Terminal( n ) thenreturn Prune( N , n. features ) N [ n. features ] = n In our scenario, a subset of states is encapsulated underthe same high-level state (i.e. a set of high-level features).Selecting one high-level state or another directly determineswhich low-level states are generated. In order to balance ex-ploration within high-level states, we extend RIW with a se-lection method that depends on state visitation counts.Our method, named Count-based Rollout IW, is detailedin Algorithm 3. Similar to RIW, it consists of two phases:node selection and rollout. A non-pruned node of the searchtree is selected according to a softmax probability distribu-tion inversely proportional to the visitation counts of its fea-tures. Then, a rollout is performed, generating nodes untilone that does not pass the novelty test is found.When a node n with features f passes the novelty test,there may be another node deeper in the tree with the sameset of features f that needs to be pruned. In the implementa-tion, we identify such nodes by keeping a mapping N fromfeatures to unpruned nodes. When pruning a node, we leavethe visitation count for features f , C [ f ] , untouched. Thus,the new node n will be selected according to the existing vis-itation count. This way, we ensure a balance between differ-ent high-level states. Importantly, all nodes below the prunednode are not considered anymore for selection. Therefore,pruning a node implies removing it from the mapping N to-gether with its descendants (function Prune ). Modiﬁcations to π -IW Junyent, Jonsson, and G´omez (2019) introduced Policy-Guided IW ( π -IW), an on-line replanning algorithm that al-ternates planning and learning. π -IW learns a policy π fromthe rewards observed in the IW tree, and uses π to guide fu-ture searches. However, in sparse-reward tasks, IW(1) maynot reach any reward, especially when the planning horizonis too short. Here we extend the original π -IW in two ways:adding a better tie breaking mechanism, and a value functionestimate. We call this (still ﬂat) version π -IW+. When no reward is found during planning, the target pol-icy for the learning step becomes the uniform distribution,and π -IW behaves as Rollout IW. In this case, π -IW maytake a step towards a region of the search tree with low nodecount, and presumably with less novel states, losing valuablestructure information provided by the IW search. To avoidthat, we modify the target policy of π -IW to use the nodecounts in the search tree for tie-breaking (i.e. the amountof descendants per action at the root node). The new targetpolicy takes the form π target ∝ π rewards · π counts , where theproduct is element-wise, and the counts policy is a softmaxdistribution: π counts ( a | s ) = exp (1 / ( τ c ( s, a ) + 1)) (cid:80) a (cid:48) ∈ A exp ( c ( s, a (cid:48) )) The temperature parameter for π rewards is typically close tozero to ensure a greedy target policy. Therefore, by perform-ing the product, we achieve the effect of tie-breaking, espe-cially if the temperature parameter for the counts is someorders of magnitude higher than the one for the rewards.This tie-breaking helps ﬁnding deeper rewards. How-ever, π -IW will not exploit this information in subsequentepisodes, since π target is still based on the rewards of the cur-rent planning horizon. To amend this, we learn a value func-tion, which we combine with the observed rewards to gener-ate a better estimate of π rewards . When backpropagating therewards from the leaves to the root, we take the maximumbetween the observed rewards and our value estimate.To learn a parameterized policy estimate (cid:98) π θ , we takethe same approach as that of Junyent, Jonsson, and G´omez(2019). Speciﬁcally, we represent (cid:98) π θ using a neural network,and at each time step t , we use the cross-entropy loss to up-date the parameters θ : L = − π target t ( ·| s t ) (cid:62) log (cid:98) π θ ( ·| s t ) . The difference in our work is that the target policy now usesvisitation counts for tiebreaking. We also add an (cid:96) -2 regular-ization term. To learn the value function, we take the sameapproach as in MuZero (Schrittwieser et al. 2019).

Policy-Guided Hierarchical IW

Hierarchical IW can be straightforwardly used for online re-planning. At each step, we sample an action a ∼ π target ∝ π rewards · π counts . To generate π rewards , we need to backpropa-gate the rewards through the hierarchical tree. Starting fromthe high-level leaf nodes, we ﬁrst backpropagate the rewardsof the associated low-level trees. Then, to propagate this re-turn between two high-level nodes, we feed the return to thecorresponding low-level leaf nodes of the high-level parent,and repeat until we reach the high-level root. To generate π counts , we backpropagate the counts of each high-level nodeto the root in a similar manner.After executing an action a , we cache the resulting subtreefor subsequent searches, similar to previous versions of IW.In this case, we need to take into account that some high-level states will not be reachable anymore, and we shouldthus remove them from the high-level tree.omain I IW(1) IW(2) SHIW(1,1)Solved Nodes Time Solved Nodes Time Solved Nodes Time8puzzle 32 40.6 34 0.00

475 0.04

137 0.01Barman 232

215 0.02

215 0.13

215 0.02Blocks World 302 37.4 91 0.01 79.5 1696 0.23

869 0.06Cybersecurity 86 65.1 64 0.01 65.1 64 0.22

158 0.02Depots 189 10.6 494 0.28 23.8 2393 1.58

10 0.00

11 0.00Floortile 538 96.3 515 0.04 93.5 1115 0.63

567 0.04Freecell* 68 8.8 192 0.14

763 0.16 28.5 87 0.01Miconic 2325 0.0 - - 0.0 - - - - - - - -OpenStacksIPC6 1230 5.1 176 0.20

464 0.03Parking 540

683 0.17Rovers* 488 31.6 2520 0.37 23.2 2504 1.59

370 0.29 96.6 322 0.66

370 0.28Sokoban 154 35.1 37 0.01

327 1.87

327 1.88Tpp* 118 0.0 - -

Experiments in Classical Planning

According to Theorem 2, some problems of width can besolved using HIW(1,1). However, the theorem assumes thathigh-level features split the state space. In this section, weaddress the following questions:• In practice, can HIW(1,1) solve problems of width 2?• Can the method in Algorithm 1 identify splitting features?• Is SHIW a good alternative to IW(2)?Lipovetzky and Geffner (2012) empirically showed thatmost classical planning problems with atomic goals presenta low width. In Table 1, we reproduce such results, andcompare them to our algorithm. The table consists of 36 classical planning domains from the International PlanningCompetitions, prior to 2012. For each domain, we show theamount of single goal instances (I), generated by splittingeach instance with G goal atoms into G single goal in-stances. Columns 3-5 show the amount of instances solvedby IW(1), IW(2) and SHIW(1,1), respectively, and the aver-age number of nodes per solved instance between parenthe-ses. In these experiments, SHIW(1,1) consists of two stan-dard IW(1) searches, one at each level of abstraction.In some domains, IW(1) has a greater coverage thanIW(2), e.g. in Woodworking. This is because we set a bud-get of K nodes, and IW(2) may exhaust the budget be-fore ﬁnding the goal. SHIW(1,1) outperforms IW(1) in alligure 1: Comparison between π -HIW(1,1) and π -HIW(n,1) in the small (top) and large (bottom) gridworld environments.but ﬁve domains: Barman, OpenStacks, Parking, Scanna-lyzer and Woodworking. Compared to IW(2), SHIW(1,1)covers more or the same number of instances in 24 outof 36 domains. In 13 cases the average number of nodesper solved instance is lower in SHIW(1,1), and in 18 casesSHIW solved it faster even when solving more instances. Pixel-based Testbeds

In this section, we test our approach, π -HIW, in pixel-basedgridworld environments and Atari games. We use two levelsof abstraction: the high-level planner is Count-based Roll-out IW (Algorithm 3) and the low-level planner is π -IW (i.e.Rollout IW guided by the current policy estimate). The setof abstract features φ h ( s ) consists of a discretization of theimage, similar to the one used in Go-Explore (Ecoffet et al.2019), where the image is divided into tiles and the meanpixel value of each tile is taken as the feature value. Usu-ally, this is further quantized into a smaller subset (e.g. 8pixel values). For the low-level set of features, we follow themethodology of Junyent, Jonsson, and G´omez (2019) anddeﬁne φ (cid:96) ( s ) as the boolean discreatization of z ( s ) , where z is the last layer of the neural network representing (cid:98) π θ . Gridworld Environments

We test our algorithm in two gridworld environments withsparse rewards (Figure 2). The goal of the agent (bluesquare) is to pick up the key (red) and open the door (green),avoiding walls (gray). The agent is rewarded with +1 onlywhen the door is reached while holding the key. Any otherstate has a reward of , except if the agent hits a wall, inwhich case the episode terminates with a reward of − . We Figure 2: Snapshot of the two gridworld environments. Theblue, red, green and gray squares represent the agent, thekey, the door, and the walls, respectively. The optimal policytakes 36 and 62 steps respectively.also end the episode after and steps for the smalland large environment, respectively. The observation is a × × image and possible actions are { no-op, up, down,left, right } . The setting is similar to the one of Junyent, Jon-sson, and G´omez (2019), but with larger environments andtherefore sparser rewards.We test our approach, π -HIW, with two different high-level widths: w h = 1 and w h = n = | F h | , which wecall π -HIW(1,1) and π -HIW(n,1), respectively. Even thoughIW( n ) explores the entire high-level space, there is a singlecombination of n features, which makes the novelty checkefﬁcient. In the original IW algorithm, IW( n ) is equivalent toa breadth ﬁrst search without state duplicates. Nevertheless,we use Count-Based Rollout IW, described in Algorithm 3.With this, we may achieve effective widths larger than fortuples composed of high and low-level features.We compare our hierarchical approach to two baselines: π -IW, and our modiﬁed version π -IW+ that uses a value es- i s h i ng de r b y P ong I c e ho ck e y B o x i ng Q * be r t D oub l e dun k R obo t an k E ndu r o W i z a r d o f w o r A s t e r o i d s M s . P a c - m an G r a v i t a r A ss au l t T enn i s S eaque s t P i tf a ll ! S o l a r i s R oad R unne r A li en A m i da r T u t an k ha m K r u ll H E R O N a m e t h i s ga m e K ung - f u m a s t e r S pa c e i n v ade r s P hoen i x V i deo p i nba ll B an k H e i s t B o w li ng B r ea k ou t Z a xx on S k ii ng F r o s t b i t e D e m on a tt a ck B a tt l e z one J a m e s bond K anga r oo T i m e p il o t Y a r s r e v enge P r i v a t e e y e B e r z e r k A s t e r i x M on t e z u m a r e v enge V en t u r e -4-202468101214 R e l a t i v e i m p r o v e m en t Figure 3: Comparison between π -IW and the proposed hierarchical π -HIW in the Atari benchmark. Vertical axis indicatesrelative improvement | s π -HIW − s π -IW | /s π -IW , where s π -IW and s π -HIW are the scores of the ﬂat and hierarchical versions,respectively. For both Montezuma and Venture, the relative improvement is ∞ , since π -IW has score. See text for details.timate and the number of nodes in the tree for tie-breaking.For the latter, we use a temperature of to generate π counts .In order to bound the memory used by the planner at eachstep, we set a maximum tree size of nodes. The visita-tion count temperature used by the high-level planner (Algo-rithm 3) is set to . . All other hyperparameters are keptthe same as in Junyent, Jonsson, and G´omez (2019).Figure 1 shows results for both environments. We observehow the baseline π -IW does not perform well, reporting areward close to zero in both environments. π -IW+, how-ever, takes advantage of the value function and the countpolicy and learns to solve the ﬁrst task, while achieving amean score of . for the second one in interactionswith the environment. For the hierarchical version, whichalso includes the aforementioned modiﬁcations, we reportresults of π -HIW(1,1) and π -HIW(n,1) using different num-ber of tiles in φ h ( s ) , and values per tile. We observehow, for the smaller task, 2x2 tiles is enough to get a goodperformance, similar to the baseline π -IW+, and the perfor-mance degrades when increasing the number of tiles. In thelarger task, both π -HIW(1,1) and π -HIW(n,1) outperformthe baseline, but need at least 3x3 tiles to perform well. Fi-nally, the version with width w h = n and 4x4 tiles performsthe best in the larger environment. Atari Games

We ﬁnish this section with a set of experiments using theAtari simulator. In this case, we do not optimize the hyper-parameters and use pixels values and × tiles.Figure 3 shows a comparison between π -HIW(n,1) and π -IW using the same setup as in Junyent, Jonsson, and G´omez(2019), but half the budget of simulator interactions. We ob-serve that π -HIW improves over its predecessor π -IW inmore than half of the games. In particular, π -HIW achievesa positive score in hard exploration games such as Mon-tezuma’s Revenge and Venture. This is remarkable, sinceit was not yet reported for width-based planners. Figure 4 Figure 4: Performance of π -IW, π -IW+ and π -HIW in Mon-tezuma’s Revenge. Average over 5 runs with different ran-dom seeds. Shades show the maximum and minimum value.shows the learning curve in the game of Montezuma Re-venge. The full table of results is included in the appendix.From these results, we can conﬁrm that π -HIW can ben-eﬁt from the state abstractions provided by a simple down-sample of the image. Conclusions

We have presented a novel hierarchical approach to width-based planning. Our approach uses different feature map-pings to create several levels of the planning hierarchy,which makes it possible to use different search algorithms atdifferent levels of the hierarchy. Speciﬁcally, we propose touse Iterated Width at both levels of the hierarchy, resulting inthe hierarchical search algorithm HIW( w h , w (cid:96) ). Experimentsin planning benchmarks show that HIW(1,1) is competitivewith IW(2), even though it expands fewer nodes. When com-bined with a policy learning scheme, HIW is able to achievea positive score in hard exploration Atari games. For futurework, a promising approach is to explore different combina-tions of search algorithms at different levels of the hierarchy. eferences Bandres, W.; Bonet, B.; and Geffner, H. 2018. PlanningWith Pixels in (Almost) Real Time. In

Proceedings of theThirty-Second AAAI Conference on Artiﬁcial Intelligence,New Orleans, Louisiana, USA, February 2-7, 2018 .Bonet, B.; and Geffner, H. 2014. Belief Tracking for Plan-ning with Sensing: Width, Complexity and Approximations.

J. Artif. Int. Res.

Artiﬁcial intelligence

ICAPS 2020 Workshop on Bridging the GapBetween AI Planning and Reinforcement Learning .Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; andClune, J. 2019. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 .Erol, K.; Hendler, J.; and Nau, D. S. 1996. Complexity re-sults for HTN planning.

Annals of Mathematics and Artiﬁ-cial Intelligence

Artiﬁcial intelligence

3: 251–288.Franc`es, G.; Ram´ırez, M.; Lipovetzky, N.; and Geffner, H.2017. Purely Declarative Action Descriptions are Overrated:Classical Planning with Simulators. In

Proceedings of theTwenty-Sixth International Joint Conference on Artiﬁcial In-telligence, IJCAI-17 , 4294–4301. doi:10.24963/ijcai.2017/600. URL https://doi.org/10.24963/ijcai.2017/600.Geffner, T.; and Geffner, H. 2015. Width-based planning forgeneral video-game playing. In .Junyent, M.; Jonsson, A.; and G´omez, V. 2019. DeepPolicies for Width-Based Planning in Pixel Domains. In , ICAPS’19, 646–654. AAAI Press.Knoblock, C. A. 1990. Learning Abstraction Hierarchiesfor Problem Solving. In

Proceedings of the Eighth NationalConference on Artiﬁcial Intelligence - Volume 2 , 923–928.AAAI Press.Korf, R. E. 1985. Macro-operators: A weak method forlearning.

Artiﬁcial intelligence

Proceedingsof the 20th European Conference on Artiﬁcial Intelligence ,540–545.Lipovetzky, N.; and Geffner, H. 2017. Best-First WidthSearch : Exploration and Exploitation in Classical Planning.

Proceedings of the 31th Conference on Artiﬁcial Intelligence(AAAI 2017)

Proceedings of the 24th International Confer-ence on Artiﬁcial Intelligence , IJCAI’15, 1610–1616. AAAIPress. ISBN 9781577357384. Ramirez, M.; Papasimeon, M.; Lipovetzky, N.; Benke, L.;Miller, T.; Pearce, A. R.; Scala, E.; and Zamani, M. 2018. In-tegrated Hybrid Planning and Programmed Control for RealTime UAV Maneuvering. In

Proceedings of the 17th Inter-national Conference on Autonomous Agents and MultiAgentSystems , 1318–1326.Sacerdoti, E. D. 1974. Planning in a hierarchy of abstractionspaces.

Artiﬁcial intelligence arXiv preprintarXiv:1911.08265 .Shleyfman, A.; Tuisov, A.; and Domshlak, C. 2016. BlindSearch for Atari-Like Online Planning Revisited. In

Inter-national Joint Conference on Artiﬁcial Intelligence , 3251–3257. AAAI Press.ame π -IW(1) π -IW(1)+ π -HIW(n,1)Alien 3969.78 2585.77 Amidar 950.45 374.20

Assault 1574.91 922.30

Asterix

Beam rider 3313.11 8428.96

Berzerk 1548.23 960.03

Bowling 26.28

Centipede 126488.35

Crazy climber -16.80 3.51Enduro -28.02 -53.76Freeway 0.34

Frostbite 270.00 1636.51

Gopher -2.36

James bond 007 43.20 205.91

Kangaroo 1847.46 2918.98

Krull 8343.30

Ms. Pac-man -2.46 -128.82Pong -20.42 -9.70Private eye 452.40 1766.13

Q*bert 32529.60 23337.90

Road Runner 38764.81 43813.29

Robotank -13852.04 -15417.86Solaris 3048.78 1832.93

Space invaders 2694.09 1622.49

Stargunner 1381.24 1642.82

Tennis -23.67 -8.26 -20.00Time pilot 16099.92 11126.86

Tutankham

Zaxxon π -IW(1), π -IW(1)+ and π -HIW(n,1) over 54 Atari games. Best score given in bold. roof of Theorem 1 Here we prove Theorem 1, which states that for n features with bounded domain size d , the maximum number of novel nodesexpanded by IW( w ), ≤ w < n , is given by N ( n, d, w ) = w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) . The proof is by induction on pairs of integers ( n, w ) . The base case is given by ( n, , in which case we have N ( n, d,

0) = (cid:88) k =0 (cid:20)(cid:18) n − − k − k (cid:19) d ( d − − k (cid:21) = (cid:18) n − (cid:19) d ( d − = 1 . For ( n, w ) such that < w < n − , by hypothesis of induction we assume that Theorem 1 holds for ( n − , w − and ( n − , w ) . Applying the recursive deﬁnition yields N ( n, d, w ) = ( d − N ( n − , d, w −

1) + N ( n − , d, w )= ( d − w − (cid:88) k =0 (cid:20)(cid:18) n − − kw − − k (cid:19) d k ( d − w − − k (cid:21) + w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) = w − (cid:88) k =0 (cid:20)(cid:18)(cid:18) n − − kw − − k (cid:19) + (cid:18) n − − kw − k (cid:19)(cid:19) d k ( d − w − k (cid:21) + (cid:18) n − − w (cid:19) d w ( d − = w − (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) + (cid:18) n − − w (cid:19) d w ( d − = w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) . Here, we used the identities (cid:0) n − m − (cid:1) + (cid:0) n − m (cid:1) = (cid:0) nm (cid:1) , < m < n , and (cid:0) n (cid:1) = 1 = (cid:0) n +10 (cid:1) .For ( n, w ) such that w = n − , by hypothesis of induction we assume that Theorem 1 holds for ( n − , w − . Applyingthe recursive deﬁnition yields N ( n, d, w ) = ( d − N ( n − , d, w −

1) + N ( n − , d, w )= ( d − w − (cid:88) k =0 (cid:20)(cid:18) n − − kw − − k (cid:19) d k ( d − w − − k (cid:21) + d w = w − (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) + (cid:18) n − − w (cid:19) d w ( d − = w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) . Here, we used the deﬁnition N ( n − , d, w ) = N ( w, d, w ) = d w and the identity (cid:0) nn (cid:1) = 1 = (cid:0) n +1 n +1 (cid:1) , which is applicable since ( w − − k ) = ( n − − − k ) = ( n − − k ) .To obtain a compact upper bound on N ( n, d, w ) , we can write N ( n, d, w ) = w (cid:88) k =0 (cid:20)(cid:18) n − − kw − k (cid:19) d k ( d − w − k (cid:21) ≤ d w w (cid:88) k =0 (cid:18) n − w − k (cid:19) = d w w (cid:88) k =0 (cid:18) n − k (cid:19) ≤ d w n w = ( nd ) w ,,