[PDF] Improving Human Decision-Making by Discovering Efficient Strategies for Hierarchical Planning

Abstract

To make good decisions in the real world people need efficient planning strategies because their computational resources are limited. Knowing which planning strategies would work best for people in different situations would be very useful for understanding and improving human decision-making. But our ability to compute those strategies used to be limited to very small and very simple planning tasks. To overcome this computational bottleneck, we introduce a cognitively-inspired reinforcement learning method that can overcome this limitation by exploiting the hierarchical structure of human behavior. The basic idea is to decompose sequential decision problems into two sub-problems: setting a goal and planning how to achieve it. This hierarchical decomposition enables us to discover optimal strategies for human planning in larger and more complex tasks than was previously possible. The discovered strategies outperform existing planning algorithms and achieve a super-human level of computational efficiency. We demonstrate that teaching people to use those strategies significantly improves their performance in sequential decision-making tasks that require planning up to eight steps ahead. By contrast, none of the previous approaches was able to improve human performance on these problems. These findings suggest that our cognitively-informed approach makes it possible to leverage reinforcement learning to improve human decision-making in complex sequential decision-problems. Future work can leverage our method to develop decision support systems that improve human decision making in the real world.

Full PDF

JJournal manuscript No. (will be inserted by the editor)

Improving Human Decision-Making by DiscoveringEﬃcient Strategies for Hierarchical Planning

Saksham Consul · Lovis Heindrich · Jugoslav Stojcheski · Falk Lieder (cid:0) Received: date / Accepted: date

Abstract

To make good decisions in the real world people need eﬃcient plan-ning strategies because their computational resources are limited. Knowing whichplanning strategies would work best for people in diﬀerent situations would bevery useful for understanding and improving human decision-making. But ourability to compute those strategies used to be limited to very small and verysimple planning tasks. To overcome this computational bottleneck, we introduce acognitively-inspired reinforcement learning method that can overcome this limi-tation by exploiting the hierarchical structure of human behavior. The basic ideais to decompose sequential decision problems into two sub-problems: setting agoal and planning how to achieve it. This hierarchical decomposition enables usto discover optimal strategies for human planning in larger and more complextasks than was previously possible. The discovered strategies outperform existingplanning algorithms and achieve a super-human level of computational eﬃciency.We demonstrate that teaching people to use those strategies signiﬁcantly improvestheir performance in sequential decision-making tasks that require planning upto eight steps ahead. By contrast, none of the previous approaches was able toimprove human performance on these problems. These ﬁndings suggest that ourcognitively-informed approach makes it possible to leverage reinforcement learningto improve human decision-making in complex sequential decision-problems. Futurework can leverage our method to develop decision support systems that improvehuman decision making in the real world.

Keywords : decision-making; planning; automatic strategy discovery; reinforce-ment learning; resource rationality; boosting

Acknowledgements:

This project was funded by grant number CyVy-RF-2019-02 fromthe Cyber Valley Research Fund. The authors would like to thank Yash Rah Jain, FredericBecker, Aashay Mehta, and Julian Skirzynski for helpful discussions. Max Planck Institute for Intelligent Systems, Tübingen, Germany, 72076 (cid:0)

E-mail: [email protected], ORCID: 0000-0003-2746-6110 a r X i v : . [ c s . A I] J a n S. Consul, L. Heindrich, J. Stojcheski, F. Lieder

To make good decisions people often plan many steps ahead. This requires eﬃcientplanning strategies because the number of possible action sequences grows exponen-tially with the number of steps and people’s cognitive resources are limited. Recentwork has shown that teaching people clever decision strategies is a promising wayto improve human decision-making (Hertwig and Grüne-Yanoﬀ, 2017; Hafenbrädlet al., 2016); this approach is known as boosting . One of the bottlenecks of boostingis that discovering clever decision strategies that work well in the real world is verychallenging and time consuming. For boosting to be eﬀective the taught strategieshave to be well-adapted to the decisions and environments in which people will usethem (Simon, 1956; Gigerenzer and Selten, 2002; Todd and Gigerenzer, 2012). Thecognitive modeling paradigm of resource-rational analysis (Lieder and Griﬃths,2020) can be used to mathematically deﬁne planning strategies that are optimallyadapted to the problems people have to solve and the cognitive resources people canuse to solve those problems (Callaway et al., 2018b, 2020). Knowing those strate-gies can be very useful for understanding and improving human decision-making(Callaway et al., 2018b, 2020; Lieder et al., 2019, 2020). Recent work has developedalgorithms for computing such optimal strategies from a model of the problem tobe solved, the cognitive operations people have available to solve that problem, andhow costly those operations are (Callaway et al., 2018b, 2020; Lieder et al., 2017;Griﬃths, 2020). We refer to this approach as automatic strategy discovery . Thisapproach frames planning strategies as policies for selecting planning operations.Its methods use algorithms from dynamic programming and reinforcement learning(Sutton and Barto, 2018) to compute the policy that maximizes the expected rewardof executing the resulting plan minus the cost of the computations that the policywould perform to arrive at that plan (Callaway et al., 2018a; Lieder and Griﬃths,2020; Griﬃths et al., 2019). Recent work used dynamic programming to discoveroptimal planning strategies for diﬀerent three-step planning problems, and foundthat it is possible to improve human planning on those problems by teaching peoplethe automatically discovered strategies (Lieder et al., 2019, 2020). Subsequent workapplied reinforcement learning to approximate strategies for planning up to six stepsahead in a task where each step entailed choosing between two options and therewere only two possible rewards (Callaway et al., 2018a). But none of the existingstrategy discovery methods (Callaway et al., 2018a; Kemtur et al., 2020) is scalableenough to discover good planning strategies for more complex environments. Thisis because the run time of these methods grows exponentially with the size of theplanning problem. This conﬁned automatic strategy discovery methods to verysmall and very simple planning tasks. Discovering planning strategies that achieve– let alone exceed – the computational eﬃciency of human planning is still out ofreach for virtually all practically relevant sequential decision problems.To overcome this computational bottleneck, we developed a scalable method fordiscovering planning strategies that achieve a (super-)human level of computational eﬃciency on some of the planning problems that are too large for existing strategydiscovery methods. Our approach draws inspiration from the hierarchical structureof human behavior (Botvinick, 2008; Miller et al., 1960; Carver and Scheier, 2001;Tomov et al., 2020). Research in cognitive science and neuroscience suggeststhat the brain decomposes long-term planning into goal-setting and planning atmultiple hierarchically-nested timescales (Carver and Scheier, 2001; Botvinick, iscovering Eﬃcient Strategies for Hierarchical Planning 3

Before we introduce, evaluate, and apply our new method for discovering hierarchicalplanning strategies we now brieﬂy introduce the concepts and methods that itbuilds on. We start by introducing the theoretical framework we use to deﬁne whatconstitutes a good planning strategy.2.1 Resource rationalityPrevious work has shown that people’s planning strategies are jointly shapedby the structure of the environment and the cost of planning (Callaway et al.,2018b, 2020). This idea has been formalized within the framework of resource-rational analysis (Lieder and Griﬃths, 2020). Resource-rational analysis is cognitivemodeling paradigm that derives process models of people’s cognitive strategies from the assumption that the brain makes optimal use of its ﬁnite computationalresources. These computational resources are model as a set of elementary infor-mation processing operations. Each of these operations has a cost that reﬂects howmuch computational resources it requires. Those operations are assumed to be thebuilding blocks of people’s cognitive strategies. To be resource-rational a planningstrategy has to achieve the optimal tradeoﬀ between the expected return of the

S. Consul, L. Heindrich, J. Stojcheski, F. Lieder resulting decision and the expected cost of the planning operation it will perform toreach that decision. Both depend on the structure of the environment. The degreeto which a planning strategy ( h ) is resource-rational in a given environment ( e )can be quantiﬁed by the the sum of expected rewards achieved by executing theplan it generates ( R total ) minus the expected computational cost it incurs to makethose choices, that is (1)RR ( h, e ) = E [ R total | h, e ] − λ · E [ N | h, e ] , where λ is the cost of performing one planning operation and N is the number ofplanning operations that the strategy performs. Throughout this article, we usethis measure as our primary criterion for the performance of planning algorithms,automatically discovered strategies, and people.2.2 Discovering resource-rational planning strategies by solving metalevel MDPsCallaway et al. (2018b) developed a method to automatically derive resource-rational planning strategies by modeling the optimal planning strategy as a solutionto a metalevel Markov Decision Process (metalevel MDP). In general, a metalevelMDP M = ( B , C , T, r ) is deﬁned as an undiscounted MDP where b ∈ B representsthe belief state, T ( b, c, b (cid:48) ) is the probability of transitioning from belief state b tobelief state b (cid:48) by performing computation c ∈ C , and r ( b, c ) is a reward function thatdescribes the costs and beneﬁts of computation (Hay et al., 2014). It is importantto note that the actions in a metalevel MDP are computations which are diﬀerentfrom object-level actions – the former are planning operations and the latter arephysical actions that move the agent through the environment. Previous methodsfor discovering near-optimal decision strategies (Lieder et al., 2017; Callaway et al.,2018b,a) have been developed for and evaluated in a planning task known as theMouselab-MDP paradigm (Callaway et al., 2017, 2020).2.3 The Mouselab-MDP paradigmThe Mouselab-MDP paradigm was developed to make people’s elementary planningoperations observable. This is achieved by externalizing the process of planningas information seeking. Concretely, the Mouselab-MDP paradigm illustrated inFigure 1 shows the participant a map of an environment where each locationharbors an occluded positive or negative reward. To ﬁnd out which path to takethe participant has to click on the locations they consider visiting to uncover theirrewards. Each of these clicks is recorded and interpreted as the reﬂection of oneelementary planning planning operation. The cost of planning is externalized bythe fee that people have to pay for each click. People can stop planning and startnavigating through the environment at any time. But once they have started to move through the environment they cannot resume planning. The participant hasto follow one of the paths along the arrows to one of the outermost nodes.To evaluate the resource-rational performance metric speciﬁed in Equation 1 inthe Mouselab-MDP paradigm, we measure R total by the sum of rewards along thechosen path, set λ to the cost of clicking, measure N by the number of clicks thata strategy made on a given trial. iscovering Eﬃcient Strategies for Hierarchical Planning 5 Fig. 1: Illustration of the Mouselab-MDP paradigm. Rewards are revealed byclicking with the mouse, prior to selecting a path using the keyboard. This ﬁgureshows one concrete task that can be created using this paradigm. Many other taskscan be created by varying the size and layout of the environment, the distributionsthat the rewards are drawn from, and the cost of clicking.The structure of a Mouselab-MDP environment can be modelled as a directedacyclic graph (DAG), where each node is associated with a reward that is sampledfrom a probability distribution, and each edge represents a transition from onenode to another. In this article, we refer to the agent’s initial position as the rootnode , the most distant nodes as goal nodes and all other nodes as intermediatenodes .Figure 2 shows an instance of a Mouselab-MDP environment that we useextensively in this article. There, the variance of each node’s reward distributionincreases with the node’s depth . This models that the values of distant states aremore variable than the values of proximal states. Therefore, the goal nodes have ahigher variance than the intermediate nodes.

01 18 19 362 8 134 20 26 31223 7 9 12 14 175 6 10 11 15 16 21 25 27 30 32 3523 24 28 29 33 34

Fig. 2: Mouselab-MDP environment with 2 goals. Nodes associated with each goalare denoted in green (circles) and red (triangles), respectively. The goal nodeshave darker shades of green and red (diamonds), and the root node’s color is blue (square). A node’s depth is deﬁned as the length of the longest path connecting this node to theroot node. S. Consul, L. Heindrich, J. Stojcheski, F. Lieder b ∈ B encodes probability distributions overthe rewards that the nodes might harbor. The possible computations are C = { ξ , ..., ξ M , c , , ..., c M,N , ⊥} , where c g,n reveals the reward at intermediate node n on the path to goal g , and ξ g reveals the value of the goal node g . For simplicity,we set the cost of each computation to . When the value of a node is revealed,the new belief about the value of the inspected node assigns a probability of oneto the observed value. The metalevel operation ⊥ terminates planning and triggersthe execution of the plan. The agent selects one of the paths to a goal state thathas the highest expected sum of rewards according to the current belief state.2.5 Methods for solving meta-level MDPsIn their seminal paper, Russell and Wefald (1991) introduced the theory of rationalmetareasoning. In Russell and Wefald (1992), they deﬁne the value of computation VOC( c, b ) to be the expected improvement in decision quality achieved by perform- ing computation c in belief state b and continuing optimally, minus the cost ofcomputation c . Using this formalization, the optimal planning strategy π ∗ meta is aselection of computations which maximizes the value of computation (VOC), thatis π ∗ meta = arg max c VOC( c, b ) . (2)When the VOC is non-positive for all available computations, the policy terminates( c = ⊥ ) and executes the best object-level action according to the current belief state.Hence, VOC( ⊥ , b ) = 0 . In general, the VOC is computationally intractable but itcan be approximated (Callaway et al., 2018a). Lin et al. (2015) estimated VOC bythe myopic value of computation ( VOI ), which is the expected improvement indecision quality that would be attained by terminating deliberation immediatelyafter performing the computation. Hay et al. (2014) approximated rational metar-easoning by solving multiple smaller metalevel MDPs that each deﬁne the problemof deciding between one object-level action and its best alternative. Inspired by research on how people learn how to plan Krueger et al. (2017),Callaway et al. (2018a) developed a reinforcement learning method for learningwhen to select which computation. This method uses Bayesian optimization to ﬁnd a policy that maximizes the expected return of a metalevel MDP. The policyspace is parameterized by weights that determine to which extent computationsare selected based on the myopic VOC versus less short-sighted approximations ofthe value of computation. It thereby improves upon approximating the value ofcomputation by the myopic VOC by considering the possibility that the optimalmetalevel policy might perform additional computations afterwards. Concretely, iscovering Eﬃcient Strategies for Hierarchical Planning 7

BMPS approximates the value of computation by interpolating between the myopicVOI and the value of perfect information, that is (cid:91)

VOC( c, b ; w ) = w · VOI ( c, b ) + w · VPI( b ) + w · VPI sub ( c, b ) − w · cost( c ) , (3)where VPI( b ) denotes the value of perfect information. VPI( b ) assumes that allcomputations possible at a given belief state would take place. Furthermore, VPI sub ( c, b ) measures the beneﬁt of having full information about the subset ofparameters that the computation reasons about (e.g., the values of all paths thatpass through the node evaluated by the computation), cost( c ) is the cost of thecomputation c , and w = ( w , w , w , w ) is a vector of weights. Since the VOCand VPI sub are bounded by the VOI from below and by the VPI from above, theapproximation of VOC (i.e. (cid:91) VOC ) is a convex combination of these features, andthe weights associated with these features are constrained to a probability simplexset. Finally, the weight associated with the cost function w ∈ [1 , h ] , where h is themaximum number of available computations to be performed. The value of theseweights are computed using Bayesian Optimization (Mockus, 2012). Discoveryof the optimized weights, is analogous to discovering the optimal policy in theenvironment. Alternative methods to solve metalevel MDPs include works by (Sezener andDayan, 2020) and (Svegliato and Zilberstein, 2018). Sezener and Dayan (2020)solves a multi-arm bandit problem using a Monte Carlo Tree Search based on staticand dynamic value of computations. In a bandit problem, unlike most models ofplanning, transitions depend purely on the chosen action and not on the currentstate. Svegliato and Zilberstein (2018) devised an approximate metareasoningalgorithm using temporal diﬀerence (TD) learning to decide when to terminate theplanning process.2.6 Intelligent cognitive tutorsUtilising the optimal planning strategies discovered by solving metalevel MDPs,Lieder et al. (2019, 2020) have developed intelligent tutors that teach people theoptimal planning strategies for a given environment. Most of the tutors let peoplepractice planning in the Mouselab-MDP paradigm and provide them immediatefeedback on each chosen planning operation. The feedback is given in two ways: (1)information about what the optimal planning strategy would have done; and (2) anaﬀective element given as a positive feedback (e.g., “Good job!”) or negative feedback.The negative feedback included a slightly frustrating time-out penalty during whichparticipants were forced to wait idly for a duration that was proportional to how sub-optimal their planning operation had been.Lieder et al. (2020) found that participants were able to learn to use the automat-ically discovered strategies, remember them, and use them in novel environmentswith a similar structure. These ﬁndings suggest that automatic strategy discoverycan be used to improve human decision-making if the discovered strategies arewell-adapted to the situations where people might use them. Additionally, Lieder

S. Consul, L. Heindrich, J. Stojcheski, F. Lieder et al. (2020) also found that video demonstrations of click sequences performed bythe optimal strategy is an equally eﬀective teaching method as providing immediatefeedback. Here, we build on these ﬁndings to develop cognitive tutors that teachautomatically discovered strategies by demonstrating them to people.

All previous strategy discovery methods evaluate and compare the utilities of allpossible computations in each step. As such, these algorithms have to explore theentire metalevel MDP’s state space which grows exponentially with the number ofnodes. As a consequence, these methods do not scale well to problems with largestate spaces and long planning horizons. This is especially true of the BayesianMetalevel Policy Search algorithm (BMPS; (Callaway et al., 2018a)) whose runtime is exponential in the number of nodes of the planning problem. In contrast tothe exhaustive enumeration of all possible planning operations performed by thosemethods, people would not even consider making detailed low-level motor plansfor navigating to a speciﬁc distant location (e.g., Terminal C of San Juan Airport)until they arrive at a high-level plan that leads them to or through that location(Tomov et al., 2020). Here, we build on insights about human planning to developa more scalable method for discovering eﬃcient planning strategies.3.1 Hierarchical Problem DecompositionTo eﬃciently plan over long horizons, people (Botvinick, 2008; Carver and Scheier,2001; Tomov et al., 2020) and hierarchical planning algorithms (Kaelbling andLozano-Pérez, 2010; Sacerdoti, 1974; Marthi et al., 2007; Wolfe et al., 2010) decom-pose the problem into ﬁrst setting goals and then planning how to achieve them.This two-stage process breaks large planning problems down to smaller problemsthat are easier to solve. To discover hierarchical planning strategies automatically,our proposed strategy discovery algorithm decomposes the problem of discoveringplanning strategies into the sub-problems of discovering a strategy for selecting agoal and discovering a strategy for planning the path to the chosen goal. A pictorialrepresentation is given in Figure 3.Formally, this is accomplished by decomposing the metalevel MDP deﬁning thestrategy discovery problem into two metalevel MDPs with smaller state and actionspaces. Constructing metalevel MDPs for goal-setting and path planning is easywhen there is a small set of candidate goals. Such candidate goals can often beidentiﬁed based on prior knowledge or the structure of the domain (Schapiro et al.,2013; Solway et al., 2014). A low level controller solves the goal-achievement MDPwhereas the high level controller solves the goal-setting MDP. When the controlleris in control, a computation is selected from the corresponding metalevel MDP andperformed. The meta controller looks at the expected reward of the current goal with the expected reward of the next best goal and decides when control from thehigh level controller should be switched to the low level controller. Hence, when thelow level controller discovers that the current goal is not as valuable as expected,the meta controller allows for goal switching The results presented in the paper have upto possible belief states.iscovering Eﬃcient Strategies for Hierarchical Planning 9 Fig. 3: Flow diagram of the hierarchically discovered planning strategy. The highlevel controller decides on goal computations. The low level controller decides thecomputations to be performed within a goal.The metalevel MDP model of the sub-problem of goal selection (Section 3.1.1)only includes computations for estimating the values of a small set of candidate goalstates ( V ( g ) , · · · , V ( g M ) ). This means that goals are chosen without consideringhow costly it would be to achieve them. This makes sense when all goals are knownto be achievable and the diﬀerences between the values of alternative goal states aresubstantially larger than the diﬀerences between the costs of reaching them. This isarguably true for many challenges people face in real life. For instance, when a highschool student plans one’s career, the diﬀerence between the long-term values ofstudying computer science versus becoming a janitor is likely much larger than thediﬀerence between the costs of achieving either goal. This is to be expected whenthe time it will take to achieve the goals is short relative to a person’s lifetime.The goal-achievement MDP (Section 3.1.2) only includes computations thatupdate the estimated costs of alternative paths to the chosen goal by determiningthe costs or rewards of state-action pairs r ( b, c ) that lie on those paths. Thisselection of computations within a selected goal leads to a possible issue of ignoringsome computations that can be irrelevant in the goal achievement MDP but behighly valuable when considering the complete problem. One such example is whenconsidering computations which reveal the value of nodes lying on an unavoidablepath to the selected goal. This problem gets further accentuated if such a nodehas a possibility of having a highly positive or negative reward. To rectify thisproblem, a meta controller has been introduced to facilitate goal switching. A realworld example of the necessity to switch goals after discovering an unlikely highly negative event could be, for example, to switch from investing in the stock marketto investing in real estate after discovering a likely stock market crash.Decomposing the strategy discovery problem into these two components reducesthe number of possible computations that the metareasoning method has to choosebetween from M · N to M + N , where M is the number of possible ﬁnal destinations (goals) and N is the number of steps to the chosen goal (see Appendix A.2. Perhapsthe most-promising metareasoning method for automatic strategy discovery isthe Bayesian Metalevel Policy Search algorithm (BMPS; (Callaway et al., 2018a;Kemtur et al., 2020)). To solve the two types of metalevel MDPs introduced belowmore eﬀectively, we also introduce an improvement of the BMPS algorithm inSection 3.2. The optimal strategy for setting the goal can be formalized as the solution tothe metalevel MDP M H = ( B H , C H , T H , R H ) , where the belief state b H ( g ) ∈ B H denotes the expected cumulative reward that the agent can attain starting fromthe goal state g ∈ G . The high level computations are C H = { ξ , ..., ξ M , ⊥ H } , where ξ g reveals the value V ( g ) of the goal node g . ⊥ H terminates the high-level planningleading to the agent to select the goal with the highest value according to itscurrent belief state. The reward function is R H ( b H , c H ) = − λ H for c H ∈ { ξ , ...ξ M } and R H ( b H , ⊥ H ) = max k ∈G E [ b H ( k )] . Having set a goal to pursue, the agent has to ﬁnd the optimal planning strategy to achieve the goal. This planning strategy is formalized as the solution to themetalevel MDP M L = ( B L , C L , T L , R L ) , where the belief state b ∈ B L denotes theexpected reward for each node. The agent can only perform a subset of meta-actions C g, L = { c g, , ..., c g,N , ⊥ L } , where c g,n reveals the reward at node n in the goal set h g ∈ H . A goal set h g ∈ H refers to all nodes, including the goal node, whichlie on all paths leading to goal g ∈ G . Furthermore, ⊥ L terminates planning andleads to the agent to select the path with the highest expected sum of rewardsaccording to the current belief state. The reward function is R L ( b, c g ) = − λ L for c g ∈ { c g, , ..., c g,N } and R L ( b, ⊥ L ) = max p ∈P (cid:80) n ∈ p E [ b n ] , where P is the set ofall paths, and b n is the belief of the reward for node n .3.2 Hierarchical Bayesian Metalevel Policy SearchHaving introduced the hierarchical problem decomposition, we now present howthis decomposition can be leveraged to make BMPS and other automatic strategydiscovery methods more scalable. BMPS approximates the value of computation(VOC) according to Equation 3. We propose to utilize BMPS to solve the goalselection metalevel MDP and the goal achievement metalevel MDP separately. Themetacontroller then decides when the policy discovered should run. A detailedanalysis of the computational time is presented in the Appendix A.2. High level policy search:

The VOC for the high level policy is approximated usingthree features: (1) the myopic utility for performing a goal state evaluation (

VOI H1 ),(2) the value of perfect information about all goals ( VPI H ), and (3) the cost of therespective computation ( cost H ). (cid:91) VOC H ( c H , b H ; w H ) = w H1 · VOI H1 ( c H , b H ) + w H2 · VPI H ( b H ) − w H3 · cost H ( c H ) , (4) iscovering Eﬃcient Strategies for Hierarchical Planning 11 where w H1 , w H2 are constrained to a probability simplex set, w H3 ∈ R [1 ,M ] , and M isthe number of goals. Additionally, the cost cost H ( c H ) is deﬁned as cost H ( c H ) = (cid:40) λ H , if c H ∈ { ξ , ..., ξ M } . , if c H = ⊥ H . (5) Low level policy search:

In a similar manner as for the high-level policy, the valueof computation for the low-level policy is approximated by using a mixture ofVOI features and the anticipated cost of the current computation and futurecomputations, that is: (cid:91)

VOC L ( c, b, g ; w L ) = w L1 · VOI L1 ( c, b, g ) + w L2 · VPI L ( b, g ) (6) + w L3 · VPI

Lsub ( c, b, g ) − w L4 · cost L ( c, g, w L ) where w L i ( i = 1 , , ) are constrained to a probability simplex set, w L4 ∈ R [1 , | h g | ] ,and | h g | is the number of nodes in goal set h g . The weight values for both levelsare optimised in iterations with Bayesian Optimization (Mockus, 2012) usingthe GPyOpt library (The GPyOpt authors, 2016).The cost feature of the original BMPS algorithm introduced by Callawayet al. (2018a) only considered the cost of a single computation whereas its VOIfeatures consider the beneﬁts of performing a sequence of computations. As aconsequence, policies learned with the original version of BMPS are biased towardsinspecting nodes that many paths converge on, even when the values of thosenodes are irrelevant. To rectify this problem, we redeﬁne the cost feature so that itconsiders the costs of all computations assumed by the VOI features. Concretely, tocompute the low-level policy, we deﬁne the cost feature of BMPS as the weightedaverage of the costs of generating the information assumed by the VOI features F = { VOI L1 , VPI L , VPI

Lsub } , that is cost L ( c, g, w L ) = (cid:88) f ∈F w L f · | h g | (cid:88) n I ( c, f, n ) · cost( c ) (7)where I ( c, f, n ) returns if node n is relevant when computing feature f forcomputation c and otherwise.In the remainder of this article, we refer to the resulting strategy discoveryalgorithm as hierarchical BMPS and refer to the original version of BMPS as non-hierarchical BMPS .3.3 Tree contraction method for faster BMPS feature computationTo further increase the scalability of BMPS, we make an additional improvement to how it computes the features used to approximate the value of computation(Callaway et al., 2018a). Speciﬁcally, we aim to improve the computational eﬃciencyby combining nodes in the meta MDP according to a set of predeﬁned conditions,ultimately reducing the complexity of the necessary computations. The nodecombination is performed by merging two nodes into a single new node with aprobability distribution that represents their combined reward value. The algorithm consists of three diﬀerent operations that combine node distri-butions. A list of conditions determine an operation to apply and the algorithmstops when the distributions of all nodes within the MDP are collapsed to a singleroot node. – Add : Combines the distribution of two consecutive nodes by adding theirdistributions. This operation can be applied to two consecutive nodes in thetree as long as the parent node does not have other child nodes and the childnode does not have other parents. – Maximise : Combines two parallel nodes by taking the maximum value foreach combination of values of nodes can take, combining the nodes distributionswhile taking into account that the optimal path will always lead through thehigher node of the two. This operation can be applied to two nodes that have asingle identical parent and child node. – Split : Splits a child node into two separate nodes by duplicating that node.The whole tree is then duplicated as many times as the node has possible values,ﬁxing the node’s distribution to each possibility. The duplicated trees are thenindividually reduced to single root nodes and the individual root nodes arecombined to a single tree by pairwise application of the maximise operation.This operation can be applied to nodes that have multiple parent nodes whereeach of the individual nodes after splitting is only connected to one its parentnodes.The split operation is the most computationally expensive operation and istherefore only applied when the add and maximise operations are insuﬃcient toreduce the tree to a single node. Speciﬁcally, this happens when a node that needs tobe reduced by the multiply operation has an additional parent or child node. Sincethe structure of the environment stays identical while the rewards and discoveredstates vary, we precompute the necessary operations to reduce the tree and thenapply the reduction individually for each problem instance.Our adjustment is purely algorithmic and it does not change the value ofcomputation. Therefore, it does not impair the performance of the discoveredstrategies. An additional eﬀect of the tree contraction method is that it extendsthe types of environments solvable by BMPS. Previously, BMPS was only ableto handle environments with a branching tree structure: nodes can have multiplechildren but never multiple parents. Our new formulation allows us to compute theBMPS features for tree structures in which nodes have multiple parent nodes as well.This is possible through the application of the maximise operation, which allows tocombine multiple parent nodes into a single node, making them solvable through thevalue of computation calculation. The range of solvable environments is thereforeextended from trees to directed acyclic graphs. This extension is especially relevant for environments containing goal nodes since it is often the case that multipleintermediate nodes converge to the same goal node. Two nodes are consecutive if they are in a direct parent-child relation.iscovering Eﬃcient Strategies for Hierarchical Planning 13

To evaluate our method for discovering hierarchical planning strategies, we bench-marked its performance, scalability, and robustness using the two types of en-vironments illustrated in Figure 2 and Figure 6. The ﬁrst type of environmentsconforms to the structure that motivated our hierarchical problem decomposition(i.e., the variability of rewards increases from each step to the next) and the sec-ond type does not. In the second type of environment, we introduced a high-risknode on the path to each goal (see Figure 6). This violates the assumption thatmotivated the hierarchical decomposition and makes goal switching essential forgood performance.For each environment, we apply the criterion deﬁned in Equation 1 (see Sec-tions 2.1 and 2.3) to evaluate the degree to which the resulting strategies areresource rational against the resource rationality of human planning, existing plan-ning algorithms, and the strategies discovered by state-of-the-art strategy methods.Furthermore, we show the our method is substantially more scalable than previousmethods.To be able to compare the performance of the automatically discovered planningstrategies to the performance of people, we conduct experiments on AmazonMechanical Turk (Litman et al., 2017). In these experiments, we measure humanperformance in Flight Planning tasks that are analogous to the environments weuse to evaluate our method (see Figure 7). For the ﬁrst type of environments, werecruited participants for each of the four environments (average age . years,range: 19–70 years; female). Participants were paid $ . plus a performance-dependent bonus (average bonus $ . ). The average duration of the experimentwas . min. For the second type of environments, we recruited participants(average age . years, range: 19–70 years; female). Participants were paid$ . plus a performance-dependent bonus (average bonus $ . ). The averageduration of the experiment was . min. Following instructions that informed theparticipants about the range of possible reward values, participants were given theopportunity to familiarize themselves with the task in practice trials of the FlightPlanning task. After this, participants were evaluated on test trials of the FlightPlanning task for the ﬁrst type of environments and test trials for the second typeof environments. To ensure high data quality, we applied the same pre-determinedexclusion criterion throughout all presented experiments. We excluded participantswho did not make a single click on more than half of the test trials because notclicking is highly indicative to a participant not engaging and speeding through thetask. In the ﬁrst environment type we excluded participants and in the secondenvironment type we excluded participants.4.1 Evaluation in environments that conform to the assumed structure We ﬁrst evaluate the performance and scalability of our method in environmentswhose structure conforms to the assumptions that motivated the hierarchicalproblem decomposition. To do so, we compare the performance of the discoveredstrategies against the performance of existing planning algorithms, the strategiesdiscovered by previous strategy discover methods, and human performance in four increasingly challenging environments of this type with 2-5 candidate goals. Thereward of each node is sampled from a normal distribution with mean . Thevariance of rewards available at non-goal nodes was for nodes reachable within asingle step (level 1) and doubled from each level to the next. The variance of thedistribution from which the reward associated with the goal node was sampledstarts from and increases by for every additional goal node. The environmentwas partitioned into one sub-graph per goal. Each of those sub-graphs contains 17intermediate nodes, forming 10 possible paths that reach the goal state in maximum5 steps (see Figure 2). The cost of planning is point per click ( λ = 1 ).To estimate an upper bound on the performance of existing planning algorithmson our benchmark problems, we selected Backward Search and Bidirectional Search(Russell and Norvig, 2002) because – unlike most planning algorithms – theystart by considering potential ﬁnal destinations, which is optimal for planning inour benchmark problems. These search algorithms terminate when they ﬁnd apath whose expected return exceeds a threshold, called its aspiration value. Theaspiration was selected using Bayesian Optimization (Mockus, 2012) to get the bestpossible performance from the selected planning algorithm. We also evaluated theperformance of a random-search algorithm, which chooses computations uniformlyat random from the set of metalevel operations that have not been performed yet.In addition to those planning algorithms, our baselines also include three state-of-the-art methods for automatic strategy discovery: the greedy myopic VOCstrategy discovery algorithm (Lin et al., 2015), which approximates the VOCby its myopic utility (VOI ), BMPS (Callaway et al., 2018a), and the AdaptiveMetareasoning Policy Search algorithm (AMPS) (Svegliato and Zilberstein, 2018)which uses approximate metareasoning to decide when to terminate planning. Ourimplementation of the AMPS algorithm uses a deep Q-network (Mnih et al., 2013)to estimate the diﬀerence between values to stop planning and continue planning,respectively. It learns this estimate based on the expected termination reward of thebest path. The planning operations are selected by maximizing the myopic valueof information (VOI ). When applying hierarchical BMPS to this environment, wedisabled the goal switching component of the metacontroller since the cumulativevariance of the intermediate nodes was less than the variance of the goal nodes,rendering goal-switching unnecessary. To illustrate the versatility of our hierarchicalproblem decomposition, we also applied it to the greedy myopic VOC strategydiscovery algorithm. Table 1 and Figure 5a compare the performance of the strategies discovered byhierarchical BMPS and the hierarchical greedy myopic VOC method against theperformance of the strategies discovered by the two state-of-the-art methods, twostandard planning algorithms, and human performance on the benchmark problemsdescribed above (Section 4.1). These results show that the strategies discovered by our new hierarchical strategy discovery methods outperform extant planningalgorithms and the strategies discovered by the AMPS algorithm across all of ourbenchmark problems ( p < . for all pairwise Wilcoxon rank-sum tests). Critically,imposing hierarchical constraints on the strategy search of BMPS and the greedymyopic VOC method had no negative eﬀect on the performance of the resultingstrategies ( p > . for all pairwise Wilcoxon rank-sum tests). Additionally, when iscovering Eﬃcient Strategies for Hierarchical Planning 15 human participants were tested on environments, they performed much worsethan than the performance of the strategy discovered by our hierarchical methodregardless of the number of goals ( p < . for all pairwise Wilcoxon rank-sumtests). Type Name 2 Goals 3 Goals 4 Goals 5 Goals S Hierarchical BMPS S Non-hierarchical BMPS 111.53 S Hierarchical greedy myopic VOC S Non-hierarchical greedy myopic VOC S Adaptive Metareasoning Policy Search 77.08 109.39 127.01 141.34 P Depth-ﬁrst Search 74.99 109.13 129.66 143.45 P Breadth-ﬁrst Search 87.62 112.83 127.68 137.40 P Bidirectional Search 88.59 115.07 134.08 154.24 P Backward Search 87.85 114.29 134.43 156.56 P Random Policy 52.73 80.05 89.31 101.15Human Baseline 45.42 88.06 39.32 124.89

Table 1: Net returns of various strategy discovery methods ( S ) and existing planningalgorithms ( P ). The best algorithms and the best net returns for each environmentsetting (column) are formatted in bold . The four best methods performed signiﬁ-cantly better than the other methods but the diﬀerences between them are notstatistically signiﬁcant.As illustrated in Figure 4, the planning strategy our hierarchical BMPS algo-rithm discovered for this type of environment is qualitatively diﬀerent from allexisting planning algorithms In general, the strategy is as follows: it ﬁrst eval-uates the goal nodes until it ﬁnds a goal node with a suﬃciently high reward.Then, it plans backward from the chosen goal to the current state. In evaluatingcandidate paths from the goal to the current state, it discards each path fromfurther exploration as soon as it encounters a high negative reward on that path.This phenomenon is known as pruning and has previously been observed in hu-man planning (Huys, et al., 2012). The non-hierarchical version of BMPS alsodiscovered this type of planning strategy. This suggests that goal-setting withbackward planning is the resource-rational strategy for this environment ratherthan an artifact of our hierarchical problem decomposition. Unlike this type ofplanning, most extant planning algorithms plan forward and the few planningalgorithms that plan backward (e.g., Bidirectional Search and Backward Search)do not preemptively terminate a path exploration.

Table 2 and Figure 5b compare the run times of our hierarchical strategy discoverymethods against their non-hierarchical counterparts and adaptive metareasoning While this strategy was discovered assuming that the cost of evaluating a potential goalnode is the same as the cost of evaluating an intermediate node, we found that the discoveredstrategy remained the same as we increased the cost of evaluating goal nodes to , , or .6 S. Consul, L. Heindrich, J. Stojcheski, F. Lieder Fig. 4: Sequence of nodes revealed in particular environment. The numbers abovethe nodes indicate the sequence in which the nodes were revealed. The numbers ineach revealed node indicates its reward.Fig. 5: (a) Net returns of existing planning algorithms (striped bars) versus planningstrategies discovered by various strategy discovery methods (bars without stripes).(b) Comparison of the mean time for various strategy discovery methods (inseconds).policy search. This comparison shows that imposing hierarchical structure substan- tially increased the scalability of BMPS and the greedy myopic VOC method. Theimproved run time proﬁle reﬂects a reduction in the asymptotic upper bound onthe algorithms’ run times when hierarchical structure is imposed on the strategyspace (see Appendix A.2). As shown in the last column of Table 2, this reductionin computational complexity increases the size of environments for which we candiscovery resource-rational planning strategies by a factor of 14-15, depending iscovering Eﬃcient Strategies for Hierarchical Planning 17 on the required quality of the planning strategy. This makes it possible to auto-matically discover planning strategies for sequential decision problems with up to states. Consequently, our method scales to metalevel MDPs with up to possible belief states whereas the original version of BMPS was limited to problemswith only up to possible belief states . This shows that our approach increasedthe scalability of automatic algorithm discovery by a factor of . This is asigniﬁcant step towards discovering resource-rational planning strategies for thecomplex planning problems people face in the real world. While the hierarchicalgreedy myopic VOC method is the most scalable strategy discovery method, thefastest method on our four benchmarks was our hierarchical BMPS algorithm withtree contraction. Comparing the ﬁrst two rows shows that the tree contractionmethod described in Section 3.3 signiﬁcantly contributed to this speed-up (for moredetails see Appendix A.5). Strategy Discovery Algorithm 2 Goals 3 Goals 4 Goals 5 Goals Max.

Hierarchical BMPS 0.21 0.23 0.24 0.27 with tree contraction

Hierarchical BMPS 4.18 6.45 7.45 9.30 180Non-hierarchical BMPS 16.09 44.81 117.46 203.18 36

Hierarchical greedy

Non-hierarchical greedy 10.56 29.02 121.53 101.97 180myopic VOCAdaptive Metareasoning 6.45 13.39 24.43 64.36 36Policy Search

Table 2: Average time to evaluate an environment represented in seconds. Thelast column denotes the size of the largest environment ( when the variance of goal valuesis at least as high as the variance of the path costs, but increases to as the variance of goal values drops to only one third of the variance of the path costs.To accommodate environments whose structure violates the assumption thatmore distant rewards are more variable than more proximal ones, the hierarchicalstrategies discovered by our method can alternate between goal selection and goal The continuous normal distribution is discretized to 4 bins. So including the undiscoveredstate, each node has 5 possible state conditions8 S. Consul, L. Heindrich, J. Stojcheski, F. Lieder σ Σ σ σ Hierarchical Non-hierarchical ∆ %∆

Table 3: Comparison in performance of BMPS with and without hierarchicalstructure on 2-goal environment with various variance ratios. σ Σ : cumulativestandard deviation of the longest path to a goal node; σ : standard deviation ofthe ﬁrst goal node; σ : standard deviation of the second goal node; ∆ : absolutediﬀerence between net returns; %∆ : relative diﬀerence between net returns.planning. We now demonstrate the beneﬁts of this goal switching functionality bycomparing the performance of our method with versus without goal switching. Inparticular, we demonstrate that switching goals leads to a better performance ifthe assumption of increasing variance is violated and does not harm performancewhen that assumption is met.Firstly, we compare the performance of the twoalgorithms in an environment where switching goals should lead to an improvementin performance. This environment has a total of 60 nodes split into four diﬀerentgoals, each consisting of 15 nodes in the low-level MDP. The diﬀerence to previouslyused environments is that one of the unavoidable intermediate nodes has a 10%to harbor a large loss of -1500 (see Figure 6).The cost of computation in thisenvironment is 10 points per click (i.e., λ = 10 ).The optimal strategy for thisenvironment selects a goal, checks this high-risk node on the path leading tothe selected goal and switches to a diﬀerent goal if it uncovers the large loss.We compare the performance of hierarchical BMPS with goal-switching to theperformance of hierarchical BMPS without goal-switching, the non-hierarchicalBMPS method with tree contraction , and human performance. The three strategydiscovery algorithms were all trained on the same environment following the sametraining steps. Their performance is noted in Table 4.Since humans were evaluated on only 5 randomly selected instances of thisenvironments and each environment instance contains some randomness in itsreward values, we evaluated the performance of the strategy discovery methods onthe same 5 environments. All performance scores do not follow a normal distributionas tested with a Shapiro-Wilk test ( p < . for each). The performance betweenthe individual algorithms was compared with a Wilcoxon rank-Sum rest, adjustingthe critical alpha value via Bonferroni correction. Comparing the score of goal-switching to both our method without goal-switching ( W = 14 . , p < . ) andthe original BMPS algorithm (W=11.38, p<.001), shows a signiﬁcant beneﬁt ofgoal-switching. Comparing the performance of the original BMPS method to the nogoal-switching algorithm, the original BMPS version performs signiﬁcantly better ( W = 18 . , p < . ). While the average human performance was only − . (see Table 4), our method achieved a resource-rationality score of on the sameenvironment instances. A Shapiro-Wilk detected no signiﬁcant violation of theassumption that participants’ average scores are normally distributed ( p. = 33 ). Without our tree contraction method, the original version of BMPS would not have beenscalable enough to handle this environment.iscovering Eﬃcient Strategies for Hierarchical Planning 19

We therefore compared the average human performance to our method in a one-sample t-test. We found that human participants performed signiﬁcantly worsethan the strategy discovered by our method ( t (25) = − . , p < . ). Thissuggests that the strategy discovered by our method achieved a superhuman levelof computational eﬃciency.

01 16 31 462 3 4 5 17 18 19 20 32 33 34 35 47 48 49 506 789 10 11 1213 1415 21 222324 25 26 2728 2930 36 373839 40 41 4243 4445 51 525354 55 56 5758 5960

Fig. 6: Environment that demonstrates the utility of goal switching. High risknodes (8, 23, 38 and 53) follow a categorical reward distribution of -1500 witha probability of 0.1 and 0 with a probability of 0.9. Goal nodes (15, 30, 45 and60) have a categorical equiprobable reward distribution of 0, 25, 75 or 100. Theﬁrst node in each sub-tree (1, 16, 31 and 46) as well as the root node (0) have aﬁxed reward of 0. All other intermediate nodes follow a categorical equiprobabledistribution of -10, -5, 5, 10.

Algorithm N Reward StdNo goal-switching 5000 -80.38 446.47Goal-switching 5000 51.33 32.2Non-hierarchical BMPS 5000 39.29 40.93Human baseline 26 -79.92 74.06

Table 4: Mean reward and standard deviation of executing the the hierarchicalplanning algorithm with and without goal-switching, as well as the original BMPSalgorithm, over 5000 random instances of the high-risk environment with four goalstates. A human baseline is gathered in an online experiment with a lower number of samples.To show that enabling our algorithm’s capacity for goal-switching has nonegative eﬀect on its performance even when the assumption of the hierarchicaldecomposition is met, we perform a second comparison on the two-goal environment with increasing variance as in Section 4.1.1. Since in this environment the rewardsare most variable at the goal nodes, switching goals should usually be unnecessary.Therefore, due to the environment structure, we do not expect the goal-switchingstrategy to perform better than the purely hierarchical strategy. By comparing theperformance in this environment we observe that both versions of the algorithmperform similarly well 5. A Wilcoxon rank-Sum rest ( W = 0 . , p = . ) shows nosigniﬁcant diﬀerence between the two. This demonstrates that the addition of goalswitching to the algorithm does not impair performance, even when goal switchingis not beneﬁcial. Algorithm N Reward StdNo goal-switching 5000 108.84 95.37Goal-switching 5000 108.78 95.37

Table 5: Mean reward and standard deviation of executing the meta controller andthe hierarchical planning algorithm over 5000 random instances of the increasingvariance environment with two goal states.

Having shown that our method discovers planning strategies that achieve a super-human level of performance, we now evaluate whether we can improve humandecision-making by teaching them the automatically discovered strategies. Buildingon the Mouselab-MDP paradigm introduced in Section 2.3, we investigate thisquestion in the context of the Flight Planning task illustrated in Figure 7. Par-ticipants are tasked to plan the route of an airplane across a network of airports.Each ﬂight gains a proﬁt or a loss. Participants can ﬁnd out how much proﬁt orloss an individual ﬂight would generate by clicking on its destination for a fee of$1. The participant’s goal is to maximize the sum of the ﬂights’ proﬁts minus thecost of planning. Participants can make as few or as many clicks as they like beforeselecting a route using their keyboard.To teach people the automatically discovered strategies, we developed cognitivetutors (see Section 2.6) that shows people step-by-step demonstrations of whatthe optimal strategies for diﬀerent environments would do to reach a decision (seeFigure 8). In each step the strategy selects one click based on which information hasalready been revealed. At some point the tutor stops clicking and moves the airplanedown the best route indicated by the revealed information. Moving forward we willrefer to cognitive tutors teaching the hierarchical planning strategies discoveredby hierarchical BMPS as hierarchical tutors and refer to the tutors teaching the strategies discovered by non-hierarchical BMPS as non-hierarchical tutors .To evaluate the eﬀectiveness of these demonstration-based cognitive tutors, weconduct two experiments in which participants are taught the optimal strategiesfor ﬂight planning problems equivalent to the two types of environments in whichwe evaluated our strategy discovery method in Section 4. To assess the potential iscovering Eﬃcient Strategies for Hierarchical Planning 21

Fig. 7: Screenshot of the Flight Planning task used to assess people’s planningskills in Experiments 1. Participants can drag and zoom into the environment toshow diﬀerent portions.beneﬁts of the hierarchical tutors enabled by our new scalable strategy discoverymethod, these experiments compare the performance of people who were taughtby hierarchical tutors against the performance of people who were taught by non-hierarchical tutors, the performance of people who were taught original feedback-based tutor for small environments (Lieder et al., 2020, 2019), and the performanceof people who practiced the task on their own. We developed the best version ofeach tutor possible given the limited scalability of the underlying strategy discoverymethod. The increased scalability of our new method enabled the hierarchicaltutor to demonstrate the optimal strategy for the task participants faced whereasthe other tutors could only show demonstrations on smaller versions of the task.We found that showing people a small number of demonstrations of the optimalplanning strategy signiﬁcantly improved their decision-making not only when theassumption underlying our method’s hierarchical problem decomposition is met(Experiment 1) but also when it is violated (Experiment 2).5.1 Experiment 1: Teaching people the optimal strategy for an environment withincreasing variance

In the Experiment 1, participants were taught the optimal planning strategy for anenvironment in which ﬁnal destinations can be reached through diﬀerent pathscomprising between 4 and 6 steps each (see Figure 7). The most important propertyof this environment is that the variance of available rewards doubles from each stepto the next, starting from 5 in the ﬁrst step. Therefore, in this environment, theoptimal planning strategy is to ﬁrst inspect the values of alternative goals, then Fig. 8: Screenshot of the cognitive tutor for demonstrating the non-hierarchicalstrategy evaluated in Experiment 1. The numbers beside the node indicate thesequence in which the clicks were performed.commit to the best goal one could ﬁnd, and then plan how to achieve it withoutever reconsidering other goals.

We recruited participants on Amazon Mechanical Turk (average age . years,range: – years; female) (Litman et al., 2017). Participants were paid $ . plus a performance-dependent bonus (average bonus $ . ). The average durationof the experiment was . min. Participants could earn a performance-dependentbonus of 1 cent for every 10 points they won in the test trials.All participants had to ﬁrst agree to a consent form stating they were above 18,a US citizen residing in USA and ﬂuent in English. After this, instructions aboutthe range of possible rewards ( − to ), cost of clicking ($ ), and the movementkeys were presented. Then, participants went through trials to familiarize withthe experiment. Following this, participants were either given additional practicetrials or had trials with the cognitive tutor depending on their experimentalcondition. Finally, the participant was given test trials in the ﬂight planningtask with 10 possible ﬁnal destinations illustrated in Figure 7. Participants startedwith points in the beginning of the test block. To evaluate the eﬃcacy of cognitive tutors, participants were assigned to groups. In the experimental group participants were taught by the hierarchical tutor.The ﬁrst control group was taught by the non-hierarchical tutor. The second controlgroup was taught by the feedback-based cognitive tutor (Lieder et al., 2019, 2020)illustrated in Figure 9. The third control group practiced the Flight Planning task10 times without feedback. The hierarchical tutor taught the strategy discovered by iscovering Eﬃcient Strategies for Hierarchical Planning 23 Fig. 9: Screenshot of the feedback-based tutor against which we evaluated ourscalable demonstration-based tutor. This tutor gives people feedback on the plan-ning operations they perform in a smaller version of the environment. This smallenvironment is the largest one for which optimal feedback can be computed withthe currently available methods.hierarchical BMPS algorithm discovered for the task participants had to performin the test block. It ﬁrst demonstrated trials with the goal selection strategy;it then showed three demonstrations of the goal-planning strategy; and ﬁnallypresented demonstrations of the complete strategy combining both parts. The non-hierarchical tutor showed demonstrations of the strategy that non-hierarchicalBMPS discovered for the largest version of the Flight Planning task it could handle(i.e., 2 goals instead of 10 goals). Computational bottlenecks conﬁned the feedback-based tutor to a three-step planning task with six possible ﬁnal destinations shownin Figure 9 (Lieder et al., 2020, 2019). Participants received feedback on each oftheir clicks and their decision when the stop clicking as illustrated in Figure 9.When the participant chose a sub-optimal planning operation they were showna message stating which planning operation the optimal strategy would haveperformed instead. In addition, they received a timeout penalty whose durationwas proportional to how sub-optimal their planning operation had been. Counterbalanced assignment ensured that participants were equally distributedacross four experimental conditions (i.e., participants per condition). To ensurehigh data quality, we applied a pre-determined exclusion criterion. We excluded participants who did not make a single click on more than half of the test trialsbecause not clicking is highly indicative of speeding through the experiment withoutengaging with the task. Control condition without feedback Control condition with feedback Non-hierarchical Tutor Hierarchical Tutor050100150200 P e r f o r m a n c e Fig. 10: Performance of diﬀerent strategy discovery algorithms.

Figure 10 shows the average performance of the four groups on the test trials.According to a Shapiro-Wilk test, participants’ scores on the test trials were notnormally distributed in any of the four groups (all p < . ). We therefore testedour hypothesis using non-parametric tests. To test if there are any signiﬁcantdiﬀerences between the groups in our repeated-measures design, we performed aWald test. We found that people’s performance diﬀered signiﬁcantly across the fourexperimental conditions ( F = 15 . , p = . ). Planned pair-wise Wilcoxon rank-sum tests conﬁrmed that teaching people strategies discovered by the hierarchicalmethod signiﬁcantly improved their performance (204.48 points/trial) compared tothe control condition (177.36 points/trial, p = . , d = 0 . ), the feedback-basedcognitive tutor (167.02 points/trial, p < . , d = 0 . ), and the non-hierarchicaltutor ( p = . , d = 0 . ). By contrast, neither the feedback-based cognitive tutor( p = . , d = 0 . ) nor the non-hierarchical cognitive tutor ( p = . , d = 0 . )were more eﬀective than letting people practice the task on their own. These resultsshow that the hierarchical method is able to discover and teach the discoveredstrategy in environments in which previous methods failed.5.2 Experiment 2: Teaching people the optimal strategy for a risky environmentIn Experiment 2, participants were taught the strategy our method discovered forthe 8-step decision problem illustrated in Figure 6. Critically, in this environmenteach path contains one risky node that harbors an extreme loss with a probabilityof 10%. Therefore, the optimal strategy for this environment inspects the risky nodewhile planning how to achieve the selected goal and then switches to another goalswhen it encounters a large negative reward on the path to the initially selectedgoal. To test whether our approach can also improve people’s performance in environ-ments with this more complex structure, we created two demonstration-basedcognitive tutors that teach the strategies discovered by hierarchical BMPS with iscovering Eﬃcient Strategies for Hierarchical Planning 25 goal-switching and hierarchical BMPS without goal-switching, respectively, and afeedback-based tutor that teaches the optimal strategy for a 3-step version of therisky environment.This experiment used a Flight Planning Task that is analogous to the envi-ronment described in Section 4.2 (see Figure 6). Speciﬁcally, the environmentcomprises 4 goal nodes and 60 intermediate nodes (i.e., 15 per goal). Althougheach goal can be reached through multiple paths all of those paths lead through anunavoidable node that has a 10% risk of harboring a large loss of -1500. The aim ofthis experiment is to verify that we are still able to improve human planning evenwhen environment requires a more complex strategy that occasionally switchesgoals during planning. To test this hypothesis we showed the participants demon-strations of the strategy discovered by our method in the experimental condition,and compared their performance to the performance of three control groups. Theﬁrst control group was shown demonstrations of the strategy discovered by theversion of our method without goal switching; the second control group discoveredtheir own strategy in ﬁve training trials; the third control group practiced planningon a three-step task with a feedback-based tutor (see Section 2.6) (Lieder et al.,2020, 2019). The environment used by the feedback-based tutor mimicked thehigh-risk environment. To achieve this we changed the reward distribution of theintermediate nodes so that there is a 10% chance of a negative reward of -96, a 30%chance of -4, a 30% chance of +4, and a 30% chance of +8. We then recomputedthe optimal feedback using dynamic programming (Lieder et al., 2020, 2019).We recruited 201 participants (average age . years, range:19–70 years; female) on Amazon’s Mechanical Turk Litman et al. (2017) over three consec-utive days. All but two of them completed the assignment. Applying the samepre-determined exclusion criterion as we used in Experiment 1 (i.e., excludingparticipants who do not engage with the environment in more than half of thetest trials) led to the exclusion of 30 participants (15%), leaving us with participants. Participants were paid 1.30$ and a performance dependent bonus ofup to 1$. The average bonus was 0.56$ and the average time of the experiment was16.28 minutes. Participants were randomly assigned to one of four experimentalconditions determining their training in the planning task. All groups were tested inﬁve identical test trials. The data was analysed using the robust f1.ld.f1 functionof the nparLD R package (Noguchi et al., 2012). On each trial the participant’sscore was calculated as the expected reward of the path they chose minus the thecost of the clicks they had made . The results of the experiment are summarized in Table 6. Since the Shapiro-Wilktest shows that none of the four conditions are normally distributed ( p < . forall), we again use the non-parametric Wald test to evaluate the data. The Wald testshows signiﬁcant diﬀerences between the four groups ( F = 62 . , p < . ). Pairwise robust post-hoc comparisons show that participants trained with demonstrationsof the strategy discovered by hierarchical BMPS with goal-switching signiﬁcantlyoutperform all other conditions: participants trained by purely hierarchical demon-strations without goal switching ( p < . , d = 0 . ), participants who did not As in the simulations, the cost per click was 10.6 S. Consul, L. Heindrich, J. Stojcheski, F. Lieder receive demonstrations ( p < . , d = 0 . ), and participants who had practicedplanning with optimal feedback on a smaller analogous environment ( p < . , d = 0 . ). The performance of the three control groups was statistically indistin-guishable (all p ≥ . ). Participants trained by purely hierarchical demonstrationsdid not perform signiﬁcantly better than participants that trained with optimalfeedback ( p = . , d = 0 . ) or participants that did not receive demonstrations( p = . , d = 0 . ). Additionally there was no signiﬁcant diﬀerence between theoptimal feedback condition and the no demonstration condition ( p = . , d = 0 . ).The results of this experiment show that we can signiﬁcantly improve humandecision-making by showing them demonstrations of the automatically discoveredhierarchical planning strategy with goal-switching. This is a unique advantageof our new method because none of the other approaches was able to improvepeople’s decision-making in this large and risky environment. By comparing humanperformance to the optimal performance of our algorithm in the same environ-ment (see Table 4) we can see that even though we were able to improve humanperformance, participants still did not fully understand the strategy based on thedemonstrations alone. This reveals the limitations of teaching planning strategiespurely with demonstrations, especially for more complex strategies. Improvingupon the pedagogy of our purely demonstration-based hierarchical tutor is animportant direction for future work. Algorithm N Reward StdNo demonstration 42 -103.31 173.69Goal-switching demonstration 45 -26.71 121.87Hierarchical demonstration 39 -94.41 51.75Feedback tutor 43 -108.49 179.64

Table 6: Experimental results of teaching people automatically discovered planningstrategies for the high risk environment shown in Figure 6. For each condition wereport the number of participants, the mean expected reward and the standarddeviation of the mean expected reward.

To make good decisions in complex situations people and machines have to useeﬃcient planning strategies because planning is costly. Eﬃcient planning strategiescan be discovered automatically. But computational challenges conﬁned previousprevious strategy discovery methods to tiny problems. To overcome this problem,we devised a more scalable machine learning approach to automatic strategydiscovery. To overcome this problem, we devised a more scalable machine learning approach to automatic strategy discovery. The central idea of our method is todecompose the strategy discovery problem into discovering goal-setting strategiesand discovering goal achievement strategies. In addition, we made a substantialalgorithmic improvement to the state-of-the-art method for automatic strategydiscovery (Callaway et al., 2018a) by introducing the tree-contraction method. Wefound that this hierarchical decomposition of the planning problem, together with iscovering Eﬃcient Strategies for Hierarchical Planning 27 our tree contraction method, drastically reduces the time complexity of automaticstrategy discovery without compromising on the quality of the discovered strategiesin many cases. Furthermore, by introducing the tree contraction method we haveextended the set of environment structures that automatic strategy discovery canbe applied to from trees to directed acyclic graphs. These advances signiﬁcantlyextend the range of strategy discovery problems that can be solved by makingthe algorithm faster, more scalable, and applicable to environments with morecomplex structure. This is an important step towards discovering eﬃcient planningstrategies for real-world problems.Recent ﬁndings suggests that teaching people automatically discovered eﬃcientplanning strategies is promising way to improve their decisions (Lieder et al., 2020,2019). Due to computational limitations this approach was previously conﬁned tosequential decision problems with at most three steps (Lieder et al., 2020, 2019).The strategy discovery methods developed in this article make it possible to scale upthis approach to larger and more realistic planning tasks. As a proof-of-concept, weshowed that our method makes it possible to improve people’s decisions in planningtasks with up to 7 steps and up to 10 ﬁnal goals. We evaluate the eﬀectivenessof showing people demonstrations of the strategies discovered by our method intwo separate experiments where the environments were so large that previousmethods were unable to discover planning strategies within a time budget of eighthours. Thus, the best one could to at training people with previous methods wasto construct cognitive tutors that taught people the optimal strategy for a smallerenvironment with similar strategy or having people practice without feedback.Evaluating our method against these alternative approaches we found that ourapproach was the only one that was signiﬁcantly more beneﬁcial than havingpeople practice the task on their own. To the best of our knowledge, this makes ouralgorithm the only strategy discovery method that can improve human performanceon sequential decision problems of this size. This suggests that our approach makesit possible to leverage reinforcement learning to improve human decision-makingin problems that were out of reach for previous intelligent tutors.Our method’s hierarchical decomposition of the planning problem exploits thatpeople can typically identify potential mid- or long-term goals that might be muchmore valuable than any of the rewards they could attain in the short run. Thiscorresponds to the assumption that the rewards available in more distant statesare more variable than the rewards available in more proximal states. When thisassumption is satisﬁed, our method discovers planning strategies much more rapidlythan previous methods and the discovered strategies are as good as or better thanthose discovered with the best previous methods. When this assumption is violated,the goal switching mechanism of our method can compensate for that mismatch.This allows the discovered strategies to perform almost as well as the strategiesdiscovered by BMPS. Our method relies on this mechanism more the more stronglyits assumption is violated. In doing so, it automatically trades oﬀ its computationalspeed-up against the quality of the resulting strategy. This shows that our method is robust to violations of its assumptions about the structure of the environment;it exploits simplifying structure only when it exists.Some aspects of our work share similarities with recent work on goal-conditionedplanning (Nasiriany et al., 2019; Pertsch et al., 2020), although the problem wesolved is conceptually diﬀerent. For comparison, both aforementioned methodsoptimize the route to a given ﬁnal location, whereas our method learns a strategy for solving sequential decision problems where the strategy chooses the ﬁnal stateitself. Furthermore, while Nasiriany et al. (2019) speciﬁed a ﬁxed strategy forselecting the sequence of goals, our method learns such a strategy itself. Critically,while policies learned by Nasiriany et al. (2019) select physical actions (e.g., moveleft), the metalevel policies learned by our method select planning operations (i.e.,simulate the outcome of taking action a in state s and update the plan accordingly).Finally, our method explicitly considers the cost of planning to ﬁnd algorithmsthat achieve the optimal trade-oﬀ between the cost of planning and the quality ofthe resulting decisions.Our method’s scalability has its price. Since our approach decomposes thefull sequential decision problem into two sub-problems (goal-selection and goal-planning), its accuracy can be limited by the fact that it never considers thewhole problem space at once. This is unproblematic when the environment’sstructure matches our method’s assumption that the rewards of potential goalsare more variable than more proximal rewards. But it could be problematic whenthis assumption is violated too strongly. We mitigated this potential problem byallowing the strategy discovery algorithm to switch goals. Even with this adaptation,the discovered strategy is not optimal in all cases: Since the representation of thealternative goal reward is deﬁned as its average expected reward, the algorithmwill only switch goals if the current goal’s reward is below average. However, if thecurrent goal’s expected return is above average, the discovered strategy will notexplore other goals even when that would lead to a higher reward. On balance,we think that the scalability of our method to large environments outweighs thisminor loss in performance.The advances presented in this article open up many exciting avenues for futurework. For instance, our approach could be extended to plans with potentially manylevels of hierarchically nested subgoals. Future work might also extend our methodso that any state can be selected as a goal. In its current form, our algorithmalways selects only the environment’s most distant states (leaf nodes) as candidategoals. Future versions might allow the set of candidate goals to be chosen moreﬂexibly such that some leaf nodes can be ignored and some especially importantintermediate nodes in the tree can be considered as potential sub-goals. A moreﬂexible deﬁnition and potentially a dynamic selection of goal nodes could increasethe strategy discovery algorithm’s performance, and possibly allow us to solve awider range of more complex problems. This would mitigate limitations of theincreasing variance assumption by considering all potentially valuable states as(sub)goals regardless of where they are located.The advances reported in this article have potential applications in artiﬁcialintelligence, cognitive science, and human-computer interaction. First, since thehierarchical structure exploited by our method exists in many real-world problems,it may be worthwhile to apply our approach to discovering planning algorithms forother real-world applications of artiﬁcial intelligence where information is costly.This could be a promising step toward AI systems with a (super)human level of computational eﬃciency. Second, our method also enables cognitive scientists toscale up the resource-rational analysis methodology for understanding the cognitivemechanisms of decision-making (Lieder and Griﬃths, 2020) to increasingly morenaturalistic models of the decision problems people face in real life. Third, futurework will apply the methods developed in this article to train and support peoplein making real-world decisions they frequently face. Our approach is especially iscovering Eﬃcient Strategies for Hierarchical Planning 29 relevant when acquiring information that might improve a decision is costly ortime consuming. This is the case in many real-world decisions. For instance, whena medical doctor plans how to treat a patient’s symptoms acquiring an additionalpiece of information might mean ordering an MRI scan that costs $1000. Similarly,a holiday planning app would have to be mindful of the user’s time when decidingwhich series of places and activities the user should evaluate to eﬃciently plantheir road trip or vacation. Similar tradeoﬀs exist in project planning, ﬁnancialplanning, and time management. Furthermore, our approach can also be applied tosupport the information collection process of hiring decisions, purchasing decisions,and investment decisions. Our approach could be used to train people how tomake such decisions with intelligent tutors (Lieder et al., 2020, 2019). Alternatively,the strategies could be conveyed by decision support systems that guide peoplethrough real-life decisions by asking a series of questions. In this case, each questionthe system asks would correspond to an adaptively chosen information gatheringoperation. In summary, the reinforcement learning method developed in thisarticle is an important step towards intelligent systems with a (super)human-level computational eﬃciciency, understanding how people make decisions, andleveraging artiﬁcial intelligence to improve human decision-making in the realworld. At a high level, our ﬁndings support the conclusion that incorporatingcognitively-informed hierarchical structure into reinforcement learning methodscan make them more useful for real-world applications. Declarations

Funding

This project was funded by grant number CyVy-RF-2019-02 from theCyber Valley Research Fund.

Conﬂicts of interest/Competing interests

The authors declare that the have noconﬂicts of interest or competing interests.

Availability of data and material (data transparency)

All materials of the behavioralexperiments we conducted are available at https://github.com/RationalityEnhancement/SSD_Hierarchical/master/Human-Experiments . Anonymized data from the exper-iments is available at https://github.com/RationalityEnhancement/SSD_Hierarchical/master/Human-Experiments . Code availability (software application or custom code)

The code of the machinelearning methods introduced in this article is available at https://github.com/RationalityEnhancement/SSD_Hierarchical . Ethics approval

The experiments reported in this article were approved by theIEC of the University of Tübingen under IRB protocol number 667/2018BO2(“Online-Experimente über das Erlernen von Entscheidungsstrategien”).

Consent to participate (include appropriate statements)

Informed consent wasobtained from all individual participants included in the study.

Consent for publication

Not applicable.

References

Botvinick MM (2008) Hierarchical models of behavior and prefrontal function.Trends in cognitive sciences 12(5):201–208Callaway F, Lieder F, Krueger PM, Griﬃths TL (2017) Mouselab-MDP: A newparadigm for tracing how people plan. In: The 3rd Multidisciplinary Conferenceon Reinforcement Learning and Decision Making, Ann Arbor, MI, URL https://osf.io/vmkrq/

Callaway F, Gul S, Krueger PM, Griﬃths TL, Lieder F (2018a) Learning to selectcomputations. Uncertainty in Artiﬁcial IntelligenceCallaway F, Lieder F, Das P, Gul S, Krueger P, Griﬃths T (2018b) A resource-rational analysis of human planning. In: Kalish C, Rau M, Zhu J, Rogers T (eds)CogSci 2018Callaway F, van Opheusden B, Gul S, Das P, Krueger P, Lieder F, Griﬃths T(2020) Human planning as optimal information seeking. Manuscript under reviewCarver CS, Scheier MF (2001) On the self-regulation of behavior. CambridgeUniversity PressGigerenzer G, Selten R (2002) Bounded rationality: The adaptive toolbox. MITpressGriﬃths TL (2020) Understanding human intelligence through human limitations.Trends in Cognitive SciencesGriﬃths TL, Callaway F, Chang MB, Grant E, Krueger PM, Lieder F (2019) Doingmore with less: meta-reasoning and meta-learning in humans and machines.Current Opinion in Behavioral Sciences 29:24–30Hafenbrädl S, Waeger D, Marewski JN, Gigerenzer G (2016) Applied decisionmaking with fast-and-frugal heuristics. Journal of Applied Research in Memoryand Cognition 5(2):215–231Hay N, Russell S, Tolpin D, Shimony SE (2014) Selecting computations: Theoryand applications. arXiv preprint arXiv:14082048Hertwig R, Grüne-Yanoﬀ T (2017) Nudging and boosting: Steering or empoweringgood decisions. Perspectives on Psychological Science 12(6):973–986Huys QJ, Eshel N, O’Nions E, Sheridan L, Dayan P, Roiser JP (2012) Bonsai treesin your head: how the Pavlovian system sculpts goal-directed choices by pruningdecision trees. PLoS computational biology 8(3)Kaelbling LP, Lozano-Pérez T (2010) Hierarchical planning in the now. In: Work-shops at the Twenty-Fourth AAAI Conference on Artiﬁcial IntelligenceKemtur A, Jain Y, Mehta A, Callaway F, Consul S, Stojcheski J, Lieder F (2020)Leveraging machine learning to automatically derive robust planning strategiesfrom biased models of the environment. In: CogSci 2020, CogSciKrueger PM, Lieder F, Griﬃths TL (2017) Enhancing metacognitive reinforcementlearning using reward structures and feedback. In: Proceedings of the 39th AnnualConference of the Cognitive Science Society, Cognitive Science SocietyLieder F, Griﬃths TL (2020) Resource-rational analysis: understanding humancognition as the optimal use of limited computational resources. Behavioral and

Brain Sciences 43Lieder F, Krueger PM, Griﬃths T (2017) An automatic method for discoveringrational heuristics for risky choice. In: CogSciLieder F, Callaway F, Jain Y, Krueger P, Das P, Gul S, Griﬃths T (2019) Acognitive tutor for helping people overcome present bias. In: RLDM 2019 iscovering Eﬃcient Strategies for Hierarchical Planning 31

Lieder F, Callaway F, Jain YR, Das P, Iwama G, Gul S, Krueger P, Griﬃths TL(2020) Leveraging artiﬁcial intelligence to improve people’s planning strategies.Manuscript in revisionLin CH, Kolobov A, Kamar E, Horvitz E (2015) Metareasoning for planningunder uncertainty. In: Twenty-Fourth International Joint Conference on ArtiﬁcialIntelligenceLitman L, Robinson J, Abberbock T (2017) Turkprime. com: A versatile crowd-sourcing data acquisition platform for the behavioral sciences. Behavior researchmethods 49(2):433–442Marthi B, Russell SJ, Wolfe JA (2007) Angelic semantics for high-level actions. In:Seventeenth International Conference on Automated Planning and Scheduling,pp 232–239Miller GA, Galanter E, Pribram KH (1960) Plans and the structure of behavior.Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Ried-miller M (2013) Playing atari with deep reinforcement learning. arXiv preprintarXiv:13125602Mockus J (2012) Bayesian approach to global optimization: theory and applications,vol 37. Springer Science & Business MediaNasiriany S, Pong V, Lin S, Levine S (2019) Planning with goal-conditioned policies.In: Advances in Neural Information Processing Systems, pp 14843–14854Noguchi K, Gel YR, Brunner E, Konietschke F (2012) nparLD: An R softwarepackage for the nonparametric analysis of longitudinal data in factorial experi-ments. Journal of Statistical Software 50(12):1–23, URL

Pertsch K, Rybkin O, Ebert F, Finn C, Jayaraman D, Levine S (2020) Long-horizon visual planning with goal-conditioned hierarchical predictors. arXivpreprint arXiv:200613205Russell S, Norvig P (2002) Artiﬁcial intelligence: a modern approachRussell S, Wefald E (1992) ‘principles of metareasoning. Artiﬁcial intelligence49(1-3):361–395Russell SJ, Wefald E (1991) Do the right thing: studies in limited rationality. MITpressSacerdoti ED (1974) Planning in a hierarchy of abstraction spaces. Artiﬁcialintelligence 5(2):115–135Schapiro AC, Rogers TT, Cordova NI, Turk-Browne NB, Botvinick MM (2013)Neural representations of events arise from temporal community structure. Natureneuroscience 16(4):486Sezener E, Dayan P (2020) Static and dynamic values of computation in mcts. In:Conference on Uncertainty in Artiﬁcial Intelligence, PMLR, pp 31–40Simon HA (1956) Rational choice and the structure of the environment. Psycholog-ical review 63(2):129Solway A, Diuk C, Córdova N, Yee D, Barto AG, Niv Y, Botvinick MM (2014)Optimal behavioral hierarchy. PLoS computational biology 10(8)Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT press

Svegliato J, Zilberstein S (2018) Adaptive metareasoning for bounded rationalagents. In: CAI-ECAI Workshop on Architectures and Evaluation for Generality,Autonomy and Progress in AI (AEGAP), Stockholm, SwedenThe GPyOpt authors (2016) GPyOpt: A Bayesian optimization framework inPython. http://github.com/SheffieldML/GPyOpt

Todd PM, Gigerenzer GE (2012) Ecological rationality: Intelligence in the world.Oxford University PressTomov MS, Yagati S, Kumar A, Yang W, Gershman SJ (2020) Discovery ofhierarchical representations for eﬃcient planning. PLoS computational biology16(4):e1007594Wolfe J, Marthi B, Russell S (2010) Combined task and motion planning for mobilemanipulation. In: Twentieth International Conference on Automated Planningand Scheduling iscovering Eﬃcient Strategies for Hierarchical Planning 33

Appendix

A.1 VOC features

The features used to approximate the value of information can be explained using a sim-pliﬁed Mouselab-MDP environment. The ground truth values of each node is depicted inFigure A11(a). The rewards are sampled from a Categorical equiprobable distribution of {− , − , , } , {− , − , , } and {− , − , , } for nodes of depth , and respectively.In this example, the value of information for two metalevel action are considered, marked inFigure A11(b).To compute the VOC, the possible values of a subset of nodes are considered and thecumulative reward accumulated from the maximal path is computed to ﬁnd the expected returnif the node values were known. Subtracting this from the cumulative reward from the maximalpath given the current belief state gives the value of information of knowing performing themetalevel action. The greater the diﬀerence in the two quantities imply greater information ispossibly gained for performing the computation.In case of myopic VOC computation, the subset of nodes considered is just the node whosereward would be revealed by the metalevel action. VPI considers the entire subset of nodesin the environment.

VPI sub considers all nodes lying on paths which pass through the noderevealed by the computation. For example, when considering the

VPI sub for the computationwhich reveals the node marked in blue in Figure A11(b), nodes , and would be considered.Whereas, nodes , and would be considered for the computation corresponding to revealingthe value of the green node.

02 8-448 -2424 -48

12 345 670(a) -4

12 345 670(b)

Fig. A11: (a) Simpliﬁed example environment with the rewards beneath each nodemarked in the ﬁgure. In the beginning, the rewards are hidden. Metalevel actionsreveal the corresponding reward of a node. (b) State of the environment. Twometalevel actions are being considered which reveal the corresponding nodes asdenoted by the blue (4) and green (7) color.

A.2 Time complexity upper bound analysis

In this section we analyse the computational time upper bound of the methods that we used.For simplicity, in the hierarchical case, we assume that goal switching is not possible. Thatmeans that the high level controller would run once, followed by the low level controller. Toensure readability, we explicitly deﬁne the notations used in this section and throughout thepaper: – N : number of intermediate nodes to goal – M : number of goal states – B : number of bins to discretize a continuous probability distribution – RUN : number of unrevealed nodes relevant to compute feature4 S. Consul, L. Heindrich, J. Stojcheski, F. LiederStrategy Discovery Algorithm Feature O (RUN) O Hierarchical BMPS

VOI H1 B VPI H M B M VOI L1 B VPI

Lsub

N B N VPI L N B N Hierarchical greedy myopic VOC

VOI H1 B VOI L1 B Non-hierarchical BMPS

VOI B VPI sub

N B N VPI M · N B M · N Non-hierarchical greedy myopic VOC

VOI B Table 7: Asymptotic time to compute a feature for hierarchical and non-hierarchicalstrategy discovery algorithms

The hierarchical decomposition reduced the time complexity of the myopic strategy discoveryproblem from O (( M · N ) · B ) to O (( M + N ) · B ) . For the BMPS algorithm, the hierarchicalstructure reduces the computational time upper bound from O (( M · N ) · B N ) to O ( M · B + B M + N · B N ) . The reduction in the upper bound implies that algorithms that use hierarchicalstructure are scalable to more complex environments.As discussed in Section 2.5, metalevel actions are selected based on maximising theapproximate VOC. At each step, a metalevel action that maximizes the approximate VOCis chosen. Selection of a metalevel action converts the probability distribution of the chosennode to a Dirac delta function concentrated at the revealed node value. The calculation of VOIfeatures for continuous probability distributions requires computations of multiple cumulativedistribution functions (CDF). In general, this procedure is computationally expensive and anapproximation of it inherently requires discretization. Hence, the probability density function(PDF) of a continuous probability distribution associated with a node in the environmentis discretized into B bins. As the number of bins B increases, the discrepancy between theapproximation and the true PDF/CDF decreases and it shrinks to as B → ∞ at the cost ofhigher computation cost.The number of relevant unrevealed nodes ( RUN ) is the count of unrevealed nodes relevantto compute a VOI feature. The

RUN varies for diﬀerent algorithm features and directly aﬀectsthe time required to compute their values for a given state-action pair. For calculating theapproximate VOC, the number of possible values to be compared is B RUN . Each possible valueset requires α ∈ R (cid:54) =0 time to compute the highest cumulative return from all the possibleoutcomes of the set. Hence, the time required to calculate the approximate VOC scales with α · B RUN . For myopic VOC estimation, the number of relevant unrevealed nodes is 1. ForBMPS, the number of relevant nodes for each of the VOI features F = { VOI , VPI , VPI sub } isdiﬀerent. The VOI feature requires the least time for computation since its value depends ononly one node in the environment. On the other hand, the most time-consuming calculation isfor the VPI feature since its value depends on the all nodes in the environment. The

VPI sub feature considers values of all paths that pass through a node evaluated by the metalevel action.Hence, the most time consuming calculation of

VPI sub is for metalevel actions that correspondto the goal node. Speaking about algorithmic complexity in terms of the big-O notation, it takes O ( B ) time to calculate the myopic VOC value for a given state-action pair. On the contrary, ittakes O ( B RUN ) time to calculate the VPI sub value for a given state-action pair.The maximum amount of computational time to calculate the approximate VOC directlydepends on the selection of all possible metalevel actions, for which we prove upper boundsfor both the non-hierarchical (in Section A.3) and hierarchical strategy discovery problem (inSection A.4).iscovering Eﬃcient Strategies for Hierarchical Planning 35

A.3 Time complexity of the non-hierarchical strategy discoveryproblem

In the setting of the non-hierarchical strategy discovery problem, the metalevel policy has toinitially select the best metalevel action from M · ( N + 1) + 1 possible actions. This selectionrequires M · ( N + 1) VOI feature computations. Since computation of

VPI is required onlyonce for a given state, the next computationally most-expensive feature is

VPI sub , which iscomputed for each possible action. From all possible actions, the ones that demand most of thecomputational time to compute

VPI sub are the actions that correspond to the goal nodes. Inthis case, the computation of

VPI sub takes O ( B N ) time.In general, if there are M goals in the environment and each goal consists of N + 1 nodes(i.e. N intermediate nodes + goal node), the maximum number of metalevel actions performedincluding the termination action is M · ( N + 1) + 1 .In the non-hierarchical BMPS strategy discovery problem, the computational time upperbound to perform all metalevel actions and terminate is M · ( N +1)+1 (cid:88) i =0 [ M · ( N + 1) − i ] · O ( B ( N +1) ) = M · ( N +1) (cid:88) i =0 [ M · ( N + 1) − i ] · O ( B N ) (8) = O (( M · N ) · B N ) (9)For the greedy myopic strategy discovery algorithm, the number of relevant nodes ( RUN = 1 )reduces the second term in the equation to O ( B ) . In this case, the computational time upperbound to perform all metalevel actions and terminate is M · ( N +1)+1 (cid:88) i =0 [ M · ( N + 1) + 1 − i ] · O ( B ) = M · ( N +1) (cid:88) i =0 [ M · ( N + 1) + 1 − i ] · O ( B ) (10) = ( M · ( N + 1) + 1) · ( M · ( N + 1) + 2)2 · O ( B ) (11) = O (( M · N ) · B ) (12) A.4 Time complexity of the hierarchical strategy discovery problem

In the setting of the hierarchical strategy discovery problem, the action space shrinks severelyfor each metalevel action selection. During the goal-setting phase of the procedure, the numberof high-level actions including the high-level policy termination is M + 1 . The selection of ahigh-level action, requires the computation of VPI H and at most M computations of VOI H1 .Therefore, the computational time upper bound for this phase is M (cid:88) i =0 (cid:104) ( M − i ) · O ( B ) + O ( B ( M − i ) ) (cid:105) = O ( M · B ) + M (cid:88) i =0 O ( B ( M − i ) ) = O ( M · B + B M ) (13)Similarly, during the goal-achievement phase of the algorithmic procedure, the number oflow-level metalevel actions for each goal is N + 1 . The most time consuming feature calculatedin the goal-achievement phase is VPI

Lsub . Therefore, the computational-time upper bound forthis phase per goal is N +1 (cid:88) i =0 ( N + 1 − i ) · O ( B ( N +1 − i ) ) = O ( N · B N ) ) (14)To calculate the upper bound for the hierarchical strategy discovery algorithm, the combinedcomputational time for the high-level and low-level policy is the sum of the computationaltime consumed on both levels independently. Additionally, since the value of a goal node is notrequired in the goal-achievement procedure, the computational-time upper bound of metalevelactions at the low level is bounded by the number of intermediate nodes N for each goalseparately.6 S. Consul, L. Heindrich, J. Stojcheski, F. LiederThe computational time upper bound of the hierarchical BMPS is M (cid:88) i =0 (cid:104) ( M − i ) · O ( B ) + O ( B ( M − i ) ) (cid:105) + N (cid:88) j =0 ( N − j ) · O ( B ( N − j ) ) = O ( M · B + B M + N · B N ) (15)In the case of myopic approximation, the number of relevant nodes at each level is .Therefore, the computational time upper bound to perform all metalevel actions and terminateis  M (cid:88) i =0 ( M − i ) + N (cid:88) j =0 ( N − j )  · O ( B ) = (cid:18) M · ( M + 1)2 + N · ( N + 1)2 (cid:19) · O ( B ) (16) = O (( M + N ) · B ) (17) A.5 Analysis of the speed-up achieved by the tree-contraction method