[PDF] Metareasoning for Planning Under Uncertainty

Abstract

The conventional model for online planning under uncertainty assumes that an agent can stop and plan without incurring costs for the time spent planning. However, planning time is not free in most real-world settings. For example, an autonomous drone is subject to nature's forces, like gravity, even while it thinks, and must either pay a price for counteracting these forces to stay in place, or grapple with the state change caused by acquiescing to them. Policy optimization in these settings requires metareasoning---a process that trades off the cost of planning and the potential policy improvement that can be achieved. We formalize and analyze the metareasoning problem for Markov Decision Processes (MDPs). Our work subsumes previously studied special cases of metareasoning and shows that in the general case, metareasoning is at most polynomially harder than solving MDPs with any given algorithm that disregards the cost of thinking. For reasons we discuss, optimal general metareasoning turns out to be impractical, motivating approximations. We present approximate metareasoning procedures which rely on special properties of the BRTDP planning algorithm and explore the effectiveness of our methods on a variety of problems.

Full PDF

MMetareasoning for Planning Under Uncertainty

Christopher H. Lin ∗ Andrey Kolobov Ece Kamar Eric Horvitz

University of Washington Microsoft ResearchSeattle, WA Redmond, [email protected] { akolobov,eckamar,horvitz } @microsoft.com Abstract

The conventional model for online planning underuncertainty assumes that an agent can stop and planwithout incurring costs for the time spent planning.However, planning time is not free in most real-world settings. For example, an autonomous droneis subject to nature’s forces, like gravity, even whileit thinks, and must either pay a price for counteract-ing these forces to stay in place, or grapple with thestate change caused by acquiescing to them. Pol-icy optimization in these settings requires metarea-soning —a process that trades off the cost of plan-ning and the potential policy improvement that canbe achieved. We formalize and analyze the metar-easoning problem for Markov Decision Processes(MDPs). Our work subsumes previously studiedspecial cases of metareasoning and shows that inthe general case, metareasoning is at most polyno-mially harder than solving MDPs with any givenalgorithm that disregards the cost of thinking. Forreasons we discuss, optimal general metareason-ing turns out to be impractical, motivating approx-imations. We present approximate metareasoningprocedures which rely on special properties of theBRTDP planning algorithm and explore the effec-tiveness of our methods on a variety of problems.

Ofﬂine probabilistic planning approaches, such as policy iter-ation [Howard, 1960], aim to construct a policy for every pos-sible state before acting. In contrast, online planners, such asRTDP [Barto et al. , 1995] and UCT [Kocsis and Szepesv´ari,2006], interleave planning with execution. After an agenttakes an action and moves to a new state, these planners sus-pend execution to plan for the next step. The more planningtime they have, the better their action choices. Unfortunately,planning time in online settings is usually not free.Consider an autonomous Mars rover trying to decide whatto do while a sandstorm is nearing. The size and uncertaintyof the domain precludes a-priori computation of a completepolicy, and demands the use of online planning algorithms. ∗ Research was performed while the author was an intern at Mi-crosoft Research.

Normally, the longer the rover runs its planning algorithm,the better decision it can make. However, computation costspower; moreover, if it reasons for too long without takingpreventive action, it risks being damaged by the oncomingsandstorm. Or consider a space probe on ﬁnal approach to aspeeding comet, when the probe must plan to ensure a safelanding based on new information it gets about the comet’ssurface. More deliberation time means a safer landing. Atthe same time, if the probe deliberates for too long, the cometmay zoom out of range — a similarly undesirable outcome.Scenarios like these give rise to a general metareasoningdecision problem : how should an agent trade off the costof planning and the quality of the resulting policy for thebase planning task every time it needs to make a move, soas to optimize its long-term utility? Metareasoning aboutbase-level problem solving has been explored for probabilis-tic inference and decision making [Horvitz, 1987; Horvitz etal. , 1989], theorem proving [Horvitz and Klein, 1995; Kautz et al. , 2002], handling streams of problems [Horvitz, 2001;Shahaf and Horvitz, 2009], and search [Russell and Wefald,1991; Burns et al. , 2013]. There has been little work explor-ing generalized approaches to metareasoning for planning.We explore the general metareasoning problem for Markovdecision processes (MDPs). We begin by formalizing theproblem with a general but precise deﬁnition that subsumesseveral previously considered metareasoning models. Then,we show with a rigorous theoretical analysis that optimal gen-eral metareasoning for planning under uncertainty is at mostpolynomially harder than solving the original planning prob-lem with any given MDP solver. However, this increase incomputational complexity, among other reasons we discuss,renders such optimal general metareasoning impractical. Theanalysis raises the issue of allocating time for metareason-ing itself , and leads to an inﬁnite regress of meta ∗ reasoning(metareasoning, metametareasoning, etc.) problems.We next turn to the development and testing of fast ap-proximate metareasoning algorithms. Our procedures usethe Bounded RTDP (BRTDP [McMahan et al. , 2005]) algo-rithm to tackle the base MDP problem, and leverage BRTDP-computed bounds on the quality of MDP policies to reasonabout the value of computation. In contrast to prior workon this topic, our methods do not require any training data,precomputation, or prior information about target domains.We perform a set of experiments showing the performanceof these algorithms versus baselines in several synthetic do- a r X i v : . [ c s . A I] M a y ains with different properties, and characterize their perfor-mance with a measure that we call the metareasoning gap — a measure of the potential for improvement from metarea-soning. The experiments demonstrate that the proposed tech-niques excel when the metareasoning gap is large. Metareasoning efforts to date have employed strategies thatavoid the complexity of the general metareasoning prob-lem for planning via relying on different kinds of simpli-ﬁcations and approximations. Such prior studies includemetareasoning for time-critical decisions where expectedvalue of computation is used to guide probabilistic inference[Horvitz, 1987; Horvitz et al. , 1989], and work on the guid-ing of sequences of single actions in search [Russell andWefald, 1991; Burns et al. , 2013]. Several lines of workhave leveraged ofﬂine learning [Breese and Horvitz, 1990;Horvitz et al. , 2001; Kautz et al. , 2002]. Other studies haverelied on optimizations and inferences that leverage the struc-ture of problems, such as the functional relationships betweenmetareasoning and reasoning [Horvitz and Breese, 1990;Zilberstein and Russell, 1996], the structure of the prob-lem space [Horvitz and Klein, 1995], and the structure ofutility [Horvitz, 2001]. In other work, [Hansen and Zil-berstein, 2001] proposed a non-myopic dynamic program-ming solution for single-shot problems. Finally, several plan-ners rely on a heuristic form of online metareasoning whenmaximizing policy reward under computational constraints inreal-world time with no “conversion rate” between the two[Kolobov et al. , 2012; Keller and Geißer, 2015]. In con-trast, our metareasoning model is unconstrained, with com-putational and base-MDP costs in the same “currency.”Our investigation also has connections to research on al-locating time in a system composed of multiple sensingand planning components [Zilberstein and Russell, 1996;Zilberstein and Russell, 1993], on optimizing portfolios ofplanning strategies in scheduling applications [Dean et al. ,1995], and on choosing actions to explore in Monte Carloplanning [Hay et al. , 2012]. In other related work, [Chanel et al. , 2014] consider how best to plan on one thread, while aseparate thread processes execution.

A key contribution of our work is formalizing the metarea-soning problem for planning under uncertainty. We buildon the framework of stochastic shortest path (SSP) MDPswith a known start state. This general MDP class includesﬁnite-horizon and discounted-reward MDPs as special cases[Bertsekas and Tsitsiklis, 1996], and can also be used to ap-proximate partially observable MDPs with a ﬁxed initial be-lief state. An SSP MDP M is a tuple (cid:104) S, A, T, C, s , s g (cid:105) ,where S is a ﬁnite set of states, A is a set of actions that theagent can take, T : ( S, A, S ) → [0 , is a transition func-tion, C : ( S, A ) → R is a cost function, s ∈ S is the startstate, and s g is the goal state. An SSP MDP must have acomplete proper policy, a policy that leads to the goal fromany state with probability 1, and all improper policies mustaccumulate inﬁnite cost from every state from which they failto reach the goal with a positive probability. The objective is to ﬁnd a Markovian policy π s : S → A with the minimumexpected cost of reaching the goal from the start state s —in SSP MDPs, at least one policy of this form is globally op-timal.Without loss of generality, we assume an SSP MDP tohave a specially designated NOP (“no-operation”) action.

NOP is an action the agent chooses when it wants to “idle”and “think/plan”, and its semantic meaning is problem-dependent. For example, in some MDPs, choosing

NOP means staying in the current state for one time step, while inothers it may mean allowing a tidal wave to carry the agent toanother state. Designating an action as

NOP does not changeSSP MDPs’ mathematical properties, but plays a crucial rolein our metareasoning formalization.

The online planning problem of an agent, which involveschoosing an action to execute in any given state, is repre-sented as an SSP MDP that encapsulates the dynamics of theenvironment and costs of acting and thinking. We call thisproblem the base problem . The agent starts off in this en-vironment with some default policy, which can be as simpleas random or guided by an unsophisticated heuristic. Theagent’s metareasoning problem, then, amounts to deciding,at every step during its interaction with the environment, be-tween improving its existing policy or using this policy’s rec-ommended action while paying a cost for executing either ofthese options , so as to minimize its expected cost of gettingto the goal.Besides the agent’s state in the base MDP, which we callthe world state , the agent’s metareasoning decisions are con-ditioned on the algorithm the agent uses for solving the baseproblem, i.e., intuitively, on the agent’s thinking process.To abstract away the speciﬁcs of this planning algorithmfor the purposes of metareasoning formalization, we view itas a black-box MDP solver and represent it, following theChurch-Turing thesis, with a Turing machine B that takes abase SSP MDP M as input. In our analysis, we assume thefollowing about Turing machine B ’s operation: • B is deterministic and halts on every valid base MDP M . This assumption does not affect the expressivenessof our model, since randomized Turing machines can betrivially simulated on deterministic ones, e.g., via seedenumeration (although potentially at an exponential in-crease in time complexity). At the same time, it greatlysimpliﬁes our theorems. • An agent’s thinking cycle corresponds to B executing asingle instruction. • A conﬁguration of B is a combination of B ’s tape con-tents, state register contents, head position, and next in-put symbol. It represents the state of the online plannerin solving the base problem M . We denote the set of allconﬁgurations B ever enters on a given input MDP M as X B ( M ) . We assume that B can be paused after ex-ecuting y instructions, and that its conﬁguration at thatpoint can be mapped to an action for any world state s of M using a special function f : S × X B ( M ) → A inime polynomial in M ’s ﬂat representation. The numberof instructions needed to compute f is not counted into y . That is, an agent can stop thinking at any point andobtain a policy for its current world state. • An agent is allowed to “think” (i.e., execute B ’s instruc-tions) only by choosing the NOP action. If an agent de-cides to resume thinking after pausing B and executing afew actions, B re-starts from the conﬁguration in whichit was last paused.We can now deﬁne metareasoning precisely: Deﬁnition 1.

Metareasoning Problem. Consider an MDP M = (cid:104) S, A, T, C, s , s g (cid:105) and an SSP MDP solver repre-sented by a deterministic Turing machine B . Let X B ( M ) bethe set of all conﬁgurations B enters on input M , and let T B ( M ) : X B ( M ) × X B ( M ) → { , } be the (deterministic)transition function of B on X B ( M ) . A metareasoning prob-lem for M with respect to B , denoted Meta B ( M ) is an MDP (cid:104) S m , A m , T m , C m , s m , s mg (cid:105) s.t. • S m = S × X B ( M ) • A m = A • T m (( s, χ ) , a, ( s (cid:48) , χ (cid:48) ))=  T ( s, a, s (cid:48) ) if a (cid:54) = NOP , χ = χ (cid:48) , and a = f ( s, χ ) T ( s, a, s (cid:48) ) · T B ( M ) ( χ, χ (cid:48) ) if a = NOP • C m (( s, χ ) , a, ( s (cid:48) , χ (cid:48) )) = C ( s, a, s (cid:48) ) if T ( s, a, s (cid:48) ) (cid:54) = 0 ,and 0 otherwise • s m = ( s , χ ) , where χ is the ﬁrst conﬁguration B enters on input M • s mg = ( s g , χ ) , where χ is any conﬁguration in X B ( M ) Solving the metareasoning problem means ﬁnding a policyfor

Meta B ( M ) with the lowest expected cost of reaching s mg . This deﬁnition casts a metareasoning problem for abase MDP as another MDP (a meta-MDP ). Note that in

Meta B ( M ) , an agent must choose either NOP or an actioncurrently recommended by B ( M ) ; in other cases, the tran-sition probability is 0. Thus, Meta B ( M ) ’s deﬁnition essen-tially forces an agent to switch between two “meta-actions”:thinking or acting in accordance with the current policy.Modeling an agent’s reasoning process with a Turing ma-chine allows us to see that at every time step the metarea-soning decision depends on the combination of the currentworld state and the agent’s “state of mind,” as captured by theTuring machine’s current conﬁguration. In principle, this de-cision could depend on the entire history of the two, but thefollowing theorem implies that, as for M , at least one optimalpolicy for Meta B ( M ) is always Markovian. Theorem 1.

If the base MDP M is an SSP MDP, then Meta B ( M ) is an SSP MDP as well, provided that B haltson M with a proper policy. If the base MDP M is an inﬁnite-horizon discounted-reward MDP, then so is Meta B ( M ) .If the base MDP M is a ﬁnite-horizon MDP, then so is Meta B ( M ) . Proof. Verifying the result for ﬁnite-horizon and inﬁnite-horizon discounted-reward MDPs M is trivial, since the onlyrequirement Meta B ( M ) must satisfy in these cases is to havea ﬁnite horizon or a discount factor, respectively.If M is an SSP MDP, then, per the SSP MDP deﬁnition[Bertsekas and Tsitsiklis, 1996], to ascertain the theorem’sclaim we need to verify that (1) Meta B ( M ) has at least oneproper policy and (2) every improper policy in Meta B ( M ) accumulates an inﬁnite cost from some state.To see why (1) is true, recall that Meta B ( M ) ’s state spaceis formed by all conﬁgurations Turing machine B enters on M . Consider any state ( s (cid:48) , χ (cid:48) ) of Meta B ( M ) . Since B isdeterministic, as stated in Section 3, the conﬁguration χ (cid:48) liesin the linear sequence of conﬁgurations between the “des-ignated” initial conﬁguration χ and the ﬁnal proper-policyconﬁguration that B enters according to the theorem. Thus, B can reach a proper-policy conﬁguration from χ (cid:48) . There-fore, let the agent starting in the state ( s (cid:48) , χ (cid:48) ) of Meta B ( M ) choose NOP until B halts, and then follow the proper policycorresponding to B ’s ﬁnal conﬁguration until it reaches a goalstate s g of M . This state corresponds to a goal state ( s g , χ ) of Meta B ( M ) . Since this construct works for any ( s (cid:48) , χ (cid:48) ) ,it gives a complete proper policy for Meta B ( M ) .To verify (2), consider any policy π m for Meta B ( M ) thatwith a positive probability fails to reach the goal. Any inﬁnitetrajectory of π m that fails to reach the goal can be mappedonto a trajectory in M that repeats the action choices of π m ’strajectory in M ’s state space S . Since M is an SSP MDP,this projected trajectory must accumulate an inﬁnite cost, andtherefore the original trajectory in Meta B ( M ) must do so aswell, implying the desired result.We now present two results to address the difﬁculty ofmetareasoning. Theorem 2.

For an SSP MDP M and a deterministic Turingmachine B representing a solver for M , the time complexityof Meta B ( M ) is at most polynomial in the time complexityof executing B on M .Proof. The main idea is to construct the MDP representing

Meta B ( M ) by simulating B on M . Namely, we can run B on M until it halts and record every conﬁguration B enters toobtain the set X . Given X , we can construct S m = S × X andall other components of Meta B ( M ) in time polynomial in | X | and | M | . Constructing X itself takes time proportional torunning time of B on M . Since, by Theorem 1, Meta B ( M ) is an SSP MDP and hence can be solved in time polynomialin the size of its components, e.g., by linear programming, theresult follows. Theorem 3.

Metareasoning for SSP MDPs is P -completeunder N C -reduction. (Please see the appendix for proof.)

At ﬁrst glance, the results above look encouraging. How-ever, upon closer inspection they reveal several subtletiesmaking optimal metareasoning utterly impractical. First, al-though both SSP MDPs and their metareasoning counterpartswith respect to an optimal polynomial-time solver are in P ,doing metareasoning for a given MDP M is appreciably morexpensive than solving that MDP itself . Given that the addi-tional complexity due to metareasoning cannot be ignored,the agent now faces the new challenge of allocating computa-tional time between metareasoning and planning for the baseproblem. This challenge is a meta-metareasoning problem,and ultimately causes inﬁnite regress, an unbounded nestedsequence of ever-costlier reasoning problems.Second, constructing Meta B ( M ) by running B on M ,as the proof of Theorem 2 proceeds, may entail solving M in the process of metareasoning. While the proof doesn’tshow that this is the only way of constructing Meta B ( M ) ,without making additional assumptions about B ’s operationone cannot exclude the possibility of having to run B untilconvergence and thereby completely solving M even before Meta B ( M ) is fully formulated. Such a construction woulddefeat the purpose of metareasoning.Third, the validity of Theorems 2 and 3 relies on an im-plicit crucial assumption that the transitions of solver B onthe base MDP M are known in advance. Without this knowl-edge, Meta B ( M ) turns into a reinforcement learning prob-lem [Sutton and Barto, 1998], which further increases thecomplexity of metareasoning and the need for simulating B on M . Neither of these is viable in reality.The difﬁculties with optimal metareasoning motivate thedevelopment of approximation procedures. In this regard, thepreceding analysis provides two important insights. It sug-gests that, since running B on M until halting is infeasible, itmay be worth trying to predict B ’s progress on M . Many ex-isting MDP algorithms have clear operational patterns, e.g.,evaluating policies in the decreasing order of their cost, aspolicy iteration does [Howard, 1960]. Regularities like thesecan be of value in forecasting the beneﬁt of running B on M for additional cycles of thinking. We now focus on exploringapproximation schemes that can leverage these patterns. Our approach to metareasoning is guided by value of com-putation (VOC) analysis. In contrast to previous work thatformulates

V OC for single actions or decision-making prob-lems [Horvitz, 1987; Horvitz et al. , 1989; Russell and Wefald,1991], we aim to formulate

V OC for online planning. For agiven metareasoning problem

Meta B ( M ) , V OC at any en-countered state s m = ( s, χ ) is exactly the difference betweenthe Q-value of the agent following f ( s, χ ) (the action recom-mended by the current policy of the base MDP M ) and theQ-value of the agent taking NOP and thinking:

V OC ( s m ) = Q ∗ ( s m , f ( s, χ )) − Q ∗ ( s m , NOP ) . (1) V OC captures the difference in long-term utility betweenthinking and acting as determined by these Q-values. Anagent should take the

NOP action and think when the

V OC ispositive. Our technique aims to evaluate

V OC by estimating Q ∗ ( s m , f ( s, χ )) and Q ∗ ( s m , NOP ) . However, attempting toestimate these terms in a near-optimal manner ultimately runsinto the same difﬁculties as solving Meta B ( M ) , such as sim-ulating the agent’s thinking process many steps into the fu-ture, and is likely infeasible. Therefore, fast approximationsfor the Q-values will generally have to rely on simplifyingassumptions. We rely on performing greedy metareasoning analysis as has been done in past studies of metareasoning[Horvitz et al. , 1989; Russell and Wefald, 1991]: Meta-Myopic Assumption.

In any state s m of the meta-MDP,we assume that after the current step, the agent will neveragain choose NOP , and hence will never change its policy.

This meta-myopic assumption is important in allowing usto reduce

V OC estimation to predicting the improvement inthe value of the base MDP policy following a single thinkingstep. The weakness of this assumption is that opportunitiesfor subsequent policy improvements are overlooked. In otherwords, the

V OC computation only reasons about the currentthinking opportunity. Nonetheless, in practice, we compute

V OC at every timestep, so the agent can still think later. Ourexperiments show that our algorithms perform well in spiteof their meta-myopicity.

We begin the presentation of our approximation scheme withthe selection of B , the agent’s thinking algorithm. Since ap-proximating Q ∗ ( s m , f ( s, χ )) and Q ∗ ( s m , NOP ) essentiallyamounts to assessing policy values, we would like an on-line planning algorithm that provides efﬁcient policy valueapproximations, preferably with some guarantees. Havingaccess to these policy value approximations enables us to de-sign approximate metareasoning algorithms that can evaluate V OC efﬁciently in a domain-independent fashion.One algorithm with this property is Bounded RTDP(BRTDP) [McMahan et al. , 2005]. It is an anytime plan-ning algorithm based on RTDP [Barto et al. , 1995]. LikeRTDP, BRTDP maintains a lower bound on an MDP’s op-timal value function V ∗ , which is repeatedly updated viaBellman backups as BRTDP simulates trials/rollouts to thegoal, making BRTDP’s conﬁguration-to-conﬁguration transi-tion function T B ( M ) ( χ, χ (cid:48) ) stochastic. A key difference isthat in addition to maintaining a lower bound, it also main-tains an upper bound, updated in the same conceptual wayas the lower one. If BRTDP is initialized with a mono-tone upper-bound heuristic, then the upper-bound decreasesmonotonically as BRTDP runs. The construction of domain-independent monotone bounds is beyond the scope of thispaper, but is easy for the domains we study in our experi-ments. Another key difference between BRTDP and RTDPis that if BRTDP is stopped before convergence, it returnsan action greedy with respect to the upper, not lower bound.This behavior guarantees that the expected cost of a policy re-turned at any time by a monotonically-initialized BRTDP isno worse than BRTDP’s current upper bound. Our metarea-soning algorithms utilize these properties to estimate V OC .In the rest of the discussion, we assume that BRTDP is ini-tialized with a monotone upper-bound heuristic.

We now show how BRTDP’s properties help us with estimat-ing the two terms in the deﬁnition of

V OC , Q ∗ ( s m , f ( s, χ )) and Q ∗ ( s m , NOP ) . We ﬁrst assume that one “thinking cycle”of BRTDP (i.e., executing NOP once and running BRTDP inthe meantime, resulting in a transition from BRTDP’s cur-rent conﬁguration χ to another conﬁguration χ (cid:48) ) correspondsto completing some ﬁxed number of BRTDP trials from theagent’s current world state s . stimating Q ∗ ( s m , NOP ) We ﬁrst describe how to estimate the value of taking the

NOP action (thinking). At the highest level, this estimation ﬁrstinvolves writing down an expression for Q ∗ ( s m , NOP ) , mak-ing a series of approximations for different terms, and thenmodeling the behavior of how BRTDP’s upper bounds on theQ-value function drop in order to compute the needed quan-tities.When opting to think by choosing NOP , the agent may tran-sition to a different world state while simultaneously updatingits policy for the base problem. Therefore, we can express Q ∗ ( s m , NOP ) = (cid:88) s (cid:48) T ( s, NOP , s (cid:48) ) (cid:88) χ (cid:48) T B ( M ) ( χ, χ (cid:48) ) V ∗ (( s (cid:48) , χ (cid:48) )) . (2)Because of meta-myopicity, we have V ∗ (( s (cid:48) , χ (cid:48) )) = V χ (cid:48) ( s (cid:48) ) where V χ (cid:48) is the value function of the policy correspond-ing to χ (cid:48) in the base MDP. However, this expression can-not be efﬁciently evaluated in practice, since we do not knowBRTDP’s transition distribution T B ( M ) ( χ, χ (cid:48) ) nor the statevalues V χ (cid:48) ( s (cid:48) ) , forcing us to make further approximations.To do so, we assume V χ (cid:48) and Q χ (cid:48) are random variables, andrewrite (cid:80) χ (cid:48) T B ( M ) ( χ, χ (cid:48) ) V χ (cid:48) ( s (cid:48) ) = (cid:88) a P ( A χ (cid:48) s (cid:48) = a ) E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s (cid:48) = a ] . (3)where the random variable A χ (cid:48) s (cid:48) takes value a iff f ( s (cid:48) , χ (cid:48) ) = a after one thinking cycle in state ( s, χ ) . Intuitively, P ( A χ (cid:48) s (cid:48) = a ) denotes the probability that BRTDP will recommend ac-tion a in state s (cid:48) after one thinking cycle. Now, let us denotethe Q-value upper bound corresponding to BRTDP’s currentconﬁguration χ as Q χ . This value is known . Then, let the up-per bound corresponding to BRTDP’s next conﬁguration χ (cid:48) ,be Q χ (cid:48) . Because we do not know χ (cid:48) , this value is unknown ,and is a random variable. Because BRTDP selects actionsgreedily w.r.t. the upper bound, we follow this behavior anduse the upper bound to estimate Q-value by assuming that Q χ (cid:48) = Q χ (cid:48) . Since the value of Q χ (cid:48) is unknown at the time ofthe V OC computation, P ( A χ (cid:48) s (cid:48) = a ) and E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s (cid:48) = a ] are computed by integrating over the possible values of Q χ (cid:48) . We have that E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s = a ] = (cid:90) Q χ (cid:48) ( s (cid:48) ,a ) Q χ (cid:48) ( s (cid:48) , a ) P ( A χ (cid:48) s (cid:48) = a | Q χ (cid:48) ( s (cid:48) , a )) P ( Q χ (cid:48) ( s (cid:48) , a )) P ( A χ (cid:48) s (cid:48) = a ) , and P ( A χ (cid:48) s (cid:48) = a ) = (cid:90) Q χ (cid:48) ( s (cid:48) ,a ) P ( Q χ (cid:48) ( s (cid:48) , a )) (cid:89) a i (cid:54) = a P ( Q χ (cid:48) ( s (cid:48) , a i ) > P ( Q χ (cid:48) ( s (cid:48) , a )) . Therefore, we must model the distribution that Q χ (cid:48) is drawnfrom. We do so by modeling the change ∆ Q = Q χ − Q χ (cid:48) ,due to a single BRTDP thinking cycle that corresponds to atransition from conﬁguration χ to χ (cid:48) . Since Q χ is known and ﬁxed, estimating a distribution over possible ∆ Q gives us adistribution over Q χ (cid:48) .Let ˆ∆ Q s,a be the change in Q s,a resulting from the mostrecent thinking cycle for some state s and action a . We ﬁrstassume that the change resulting from an additional cycleof thinking, ∆ Q s,a , will be no larger than the last change, ∆ Q s,a ≤ ˆ∆ Q s,a . This assumption is reasonable, becausewe can expect the change in bounds to decrease as BRTDPconverges to the true value function. Given this assumption,we must choose a distribution D over the interval [0 , ˆ∆ Q s,a ] such that for the next thinking cycle, ∆ Q s,a ∼ D . Figure 1aillustrates these modeling assumptions for two hypotheticalactions, a and a .One option is to make D uniform, so as to represent ourpoor knowledge about the next bound change. Then, com-puting P ( A χ (cid:48) s (cid:48) = a ) involves evaluating an integral of a poly-nomial of degree O ( | A | ) (the product of | A | − CDF’s and PDF), and computing E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s = a ] also entailsevaluating an integral of degree O ( | A | ) , and thus computingthese quantities for all actions in a state can be computed intime O ( | A | ) . Since the overall goal of this subsection, ap-proximating Q ∗ ( s m , NOP ) , requires computing P ( A χ (cid:48) s (cid:48) = a ) for all actions in all states where NOP may lead, assumingthere are no more than

K << | A | such states, the complexitybecomes O ( K | A | ) for each state visited by the agent on itsway to the goal.A weakness of this approach is that the changes in the up-per bounds for different actions are modeled independently.For example, if the upper bounds for two actions in a givenstate decrease by a large amount in the previous thinking step,then it is unlikely that in the next thinking step one of themwill drop dramatically while the other drops very little. Thisindependence can cause the amount of uncertainty in the up-per bound at the next thinking step to be overestimated, lead-ing to V OC being overestimated as well.Therefore, we create another version of the algorithm as-suming that the speed of decrease in Q-value upper boundsfor all actions are perfectly correlated; all ratios between fu-ture drops in the next thinking cycle are equal to the ratiosbetween the observed drops in the last thinking cycle. For-mally, for a given state s , we let ρ ∼ Uniform[0, 1]. Then, let ∆ Q s,a = ρ · ˆ∆ Q s,a for all actions a .Now, to compute P ( A χ (cid:48) s (cid:48) = a ) , for each action a , we repre-sent the range of its possible future Q-values Q χ (cid:48) s,a with a linesegment l a on the unit interval [0,1] where l a (0) = Q χs,a and l a (1) = Q χs,a − ∆ Q s,a . Then, P ( A χ (cid:48) s (cid:48) = a ) is simply the pro-portion of l a which lies below all the other lines representingall other actions. We can na¨ıvely compute these probabilitiesin time O ( | A | ) by enumerating all intersections. Similarly,computing E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s = a ] is also easy. This value isthe mean of the portion of l a that is beneath all other lines.Figure 1b illustrates these computations.Whether or not we make the assumption of action inde-pendence, we further speed up the computations by only cal-culating E [ Q χ (cid:48) ( s (cid:48) , a ) | A χ (cid:48) s = a ] and P ( A χ (cid:48) s (cid:48) = a ) for the twoigure 1: a) Hypothetical drops in upper bounds on the Q-values of two actions, a and a . We assume the next Q-value drop resulting from another cycle of thinking, ∆ Q , isdrawn from a range equal to the last drop from thinking, ˆ∆ Q b) Assuming perfect correlation in the speed of decrease inthe Q-value upper bounds, as the upper bounds of the twoactions drop from an additional cycle of thinking, initially a has a better upper bound, but eventually a overtakes a .“most promising” actions a , those with the lowest expectationof potential upper bounds. This limits the computation timeto the time required to determine these actions (linear in | A | ),and makes the time complexity of estimating Q ∗ ( s m , NOP ) for one state s be O ( K | A | ) instead of O ( K | A | ) . Estimating Q ∗ ( s m , f ( s , χ )) Now that we have described how to estimate the value of tak-ing the

NOP action, we describe how to estimate the valueof taking the currently recommended action, f ( s, χ ) . Weestimate Q ∗ ( s m , f ( s, χ )) by computing E[ Q χ (cid:48) ( s, f ( s, χ )) ],which takes constant time, keeping the overall time complex-ity linear. The reason we estimate Q ∗ ( s m , f ( s, χ )) using fu-ture Q-value upper bound estimates based on a probabilis-tic projection of χ (cid:48) , as opposed to our current Q-value up-per bounds based on the current conﬁguration χ , is to makeuse of the more informed bounds derived at the future utilityestimation. As the BRTDP algorithm is given more compu-tation time, it can more accurately estimate the upper boundof a policy. This type of approximation has been justiﬁedbefore [Russell and Wefald, 1991]. In addition, using fu-ture utility estimates in both estimating Q ∗ ( s m , f ( s, χ )) and Q ∗ ( s m , NOP ) provides a consistency guarantee: if thinkingleads to no policy change, then our method estimates V OC to be zero . The core of our algorithms involves the computations wehave described, in every state s the agent visits on theway to the goal. In the experiments, we denote UnCorrMetareasoner as the metareasoner that assumes the ac-tions are uncorrelated, and

Metareasoner as the metarea-soner that does not make this assumption. To complete thealgorithms, we ensure that they decide the agent should thinkfor another cycle if ˆ∆ Q s,a isn’t yet available for the agent’scurrent world state s (e.g., because BRTDP has never updated bounds for this state’s Q-value so far), since V OC compu-tation is not possible without prior observations on ˆ∆ Q s,a .Crucially, all our estimates make metareasoning take timeonly linear in the number of actions, O ( K | A | ) , per visitedstate. We evaluate our metareasoning algorithms in several syn-thetic domains designed to reﬂect a wide variety of factorsthat could inﬂuence the value of metareasoning. Our goal isto demonstrate the ability of our algorithms to estimate thevalue of computation and adapt to a plethora of world condi-tions. The experiments are performed on four domains, all ofwhich are built on a × grid world, where the agent canmove between cells at each time step to get to the goal locatedin the upper right corner. To initialize the lower and upperbounds of BRTDP, we use the zero heuristic and an appropri-ately scaled (multiplied by a constant) Manhattan distance tothe goal, respectively. The four world domains are as follows: • Stochastic.

This domain adds winds to the grid worldto be analogous to worlds with stochastic state transi-tions. Moving against the wind causes slower movementacross the grid, whereas moving with the wind results infaster movement. The agent’s initial state is the south-east corner and the goal is located in the northeast cor-ner. We set the parameters of the domain as follows sothat there is a policy that can get the agent to the goalwith a small number of steps (in tens instead of hun-dreds) and where the winds signiﬁcantly inﬂuence thenumber of steps needed to get to the goal: The agentcan move 11 cells at a time and the wind has a push-ing power of 10 cells. The next location of the agent isdetermined by adding the agent’s vector and the wind’svector except when the agent decides to think (executes

NOP ), in which case it stays in the same position. Thus,the winds can never push the agent in the opposite direc-tion of its intention. The prevailing wind direction overmost of the grid is northerly, except for the column ofcells containing the goal and starting position, where it issoutherly. Note that this southerly wind direction makesthe initial heuristic extremely suboptimal. To simulatestochastic state transitions, the winds have their prevail-ing direction in a given cell with 60% probability; with40% probability they have a direction orthogonal to theprevailing one (20% easterly and 20% westerly).We perform a set of experiments on this simplest do-main of the set, to observe the effect of different costsfor thinking and acting on the behaviors of algorithms.We vary the cost of thinking and acting between 1 and15. When we vary the cost of thinking, we ﬁx the costof acting at 11, and when we vary the cost of acting, weﬁx the cost of thinking at 1. • Traps.

This domain modiﬁes the

Stochastic domain toresemble the setting where costs for thinking and act-ing are not constant among states. To simplify the pa-rameter choices, we ﬁx the cost of thinking and actingo be equal, respectively, to the agent’s moving distanceand wind strength. Thus, the cost of thinking is 10 andthe cost of acting is 11. To vary the costs of thinkingand acting between states, we make thinking and act-ing at the initial state extremely expensive at a cost of100, about 10 times the cost of acting and thinking inthe other states. Thus, the agent is forced to think out-side its initial state in order to perform optimally. • DynamicNOP-1.

In the previous domains, executing a

NOP does not change the agent’s state. In this domain,thinking causes the agent to move in the direction of thewind, causing the agent to stochastically transition as aresult of thinking. In this domain, the cost of thinkingis composed of both explicit and implicit components;a static value of 1 unit and a dynamic component deter-mined by stochastic state transitions as a result of think-ing. The static value is set to 1 so that the dynamic com-ponent can dominate the decisions about thinking. Theagent starts in cell (98 , . We change the wind direc-tions so that there are easterly winds in the most southernrow and northerly winds in the most eastern row that canpush the agent very quickly to the goal. Westerly windsexist everywhere else, pushing the agent away from thegoal. We change the stochasticity of the winds so thatthe westerly winds change to northerly winds with 20%probability, and all other wind directions are no longerstochastic. We lower the amount of stochasticity to bet-ter see if our agents can reason about the implicit costsof thinking. The wind directions are arranged so thatthere is potential for the agent to improve upon its initialpolicy but thinking is risky as it can move the agent tothe left region, which is hard to recover from since allthe winds push the agent away from the goal. • DynamicNOP-2.

This domain is just like the previousdomain, but we change the direction of the winds in thenorthern-most row to be easterly. These winds also donot change directions. In this domain, as compared tothe previous one, it is less risky to take a thinking ac-tion; even when the agent is pushed to the left region ofthe board, the agent can ﬁnd strategies to get to the goalquickly by utilizing the easterly wind at the top regionof the board.

We introduce the concept of the metareasoning gap as away to quantify the potential improvement over the initialheuristic-implied policy, denoted as

Heuristic , that ispossible due to optimal metareasoning. The metareasoninggap is the ratio of the expected cost of

Heuristic for thebase MDP to the expected cost of the optimal metareasoningpolicy, computed at the initial state. Exact computation of themetareasoning gap requires evaluating the optimal metarea-soning policy and is infeasible. Instead, we compute an upperbound on the metareasoning gap by substituting the cost ofthe optimal metareasoning policy with the cost of the optimalpolicy for the base

MDP (denoted

OptimalBase ). Themetareasoning gap can be no larger than this upper bound,because metareasoning can only add cost to

OptimalBase .We quantify each domain with this upper bound (

M G UB ) in Table 1 and show that our algorithms for metareasoningprovide signiﬁcant beneﬁts when M G UB is high. We notethat none of the algorithms use the metareasoning gap in itsreasoning. Heuristic OptimalBase MG UB Stochastic (Thinking) 1089 103.9 10.5Stochastic (Acting) 767.3 68.1 11.3Traps 979 113.5 8.6DynamicNOP-1 251.4 66 3.8DynamicNOP-2 119.4 66 1.8Table 1: Upper bounds of metareasoning gaps (

M G UB ) forall test domains, deﬁned as the ratio of the expected cost ofthe initial heuristic policy ( Heuristic ) to that of an optimalone (

OptimalBase ) at the initial state.

We compare our metareasoning algorithms against a numberof baselines. The

Think ∗ Act baseline simply plans for n cycles at the initial state and then executes the resulting pol-icy, without planning again. We also consider the Prob base-line, which chooses to plan with probability p at each state,and executes its current policy with probability − p . An im-portant drawback of these baselines is that their performanceis sensitive to their parameters n and p , and the optimal pa-rameter settings vary across domains. The NoInfoThink baseline plans for another cycle if it does not have informa-tion about how the BRTDP upper bounds will change. Thisbaseline is a simpliﬁed version of our algorithms that doesnot try to estimate the

V OC .For each experimental condition, we run each metareason-ing algorithm until it reaches the goal 1000 times and averagethe results to account for stochasticity. Each BRTDP trajec-tory is 50 actions long. In Stochastic , we perform several experiments by varying thecosts of thinking (

NOP ) and acting. We observe (ﬁgures canbe found in the appendix) that when the cost of thinkingis low or when the cost of acting is high, the baselines dowell with high values of n and p , and when the costs are re-versed, smaller values do better. This trend is expected, sincelower thinking cost affords more thinking, but these baselinesdon’t allow for predicting the speciﬁc “successful” n and p values in advance. Metareasoner does not require pa-rameter tuning and beats even the best performing baselinefor all settings. Figure 2a compares the metareasoning al-gorithms against the baselines when the results are averagedover the various settings of the cost of acting, and Figure 2bshows results averaged over the various settings of the cost ofthinking.

Metareasoner does extremely well in this do-main because the metareasoning gap is large, suggesting thatmetareasoning can improve the initial policy signiﬁcantly.Importantly, we see that

Metareasoner performs betterthan

NoInfoThink , which shows the beneﬁt from reason-ing about how the bounds on Q-values will change.

UnCorrMetareasoner does not do as well as

Metareasoner ,igure 2: Comparison of

Metareasoner and

Uncorr Metareasoner with baselines on experimental domains. Someﬁgures do not include

Heuristic when it performs especially poorly for readability.probably because the assumption that actions’ Q-values areuncorrelated does not hold well.We now turn to

Traps , where thinking and acting in the ini-tial state incurs signiﬁcant cost. Figure 2c again summarizesthe results.

Think ∗ Act performs very poorly, because it islimited to thinking only at the initial state.

Metareasoner does well, because it ﬁgures out that it should not thinkin the initial state (beyond the initial thinking step), andshould instead quickly move to safer locations.

UncorrMetareasoner also closes the metareasoning gap signif-icantly, but again not as much as

Metareasoner .We now consider

DynamicNOP-1 , a domain adversarialto approximate metareasoning, because winds almost every-where push the agent away from the goal. There are onlya few locations from which winds can carry the agent tothe destination. Figure 2d shows that our algorithms do notachieve large gains. However, this result is not surprising.The best policy involves little thinking, because wheneverthe agent chooses to think, it is pushed away from the goal,and thinking for just a few consecutive time steps can takethe agent to states where reaching the goal is extremely dif-ﬁcult. Consequently,

Think ∗ Act with 1-3 thinking stepsturns out to be near-optimal, since it is pushed away from thegoal only slightly and can use a slightly improved heuristicto head back.

Metareasoner actually does well in manyindividual runs, but occasionally thinks longer due to

V OC computation stochasticity and can get stuck, yielding higheraverage policy cost. In particular, it may frequently be pushedinto a state that it has never encountered before, where it mustthink again because it does not have any history about howBRTDP’s bounds have changed in that state, and then subse- quently get pushed into an unencountered state again. In thisdomain, our approximate algorithms can diverge away froman optimal policy, which would plan very little to minimizethe risk of being pushed away from the goal.

DynamicNOP-2 provides the agent more opportunitiesto recover even if it makes a poor decision. Figure 2edemonstrates that our algorithms perform much better in

DynamicNOP-2 than in

DynamicNOP-1 . In

DynamicNOP-2 ,even if our algorithms do not discover the jetstreams that canpush it towards the goal from initial thinking, they are pro-vided more chances to recover when they get stuck. Whenthinking can move the agent on the board, having more op-portunities to recover reduces the risk associated with mak-ing suboptimal thinking decisions. Interestingly, the metar-easoning gap is decreased at the initial state by the additionof the extra jetstream. However, the metareasoning gap atmany other states in the domain is increased, showing thatthe metareasoning gap at the initial state is not the most idealway to characterize the potential for improvement via metar-easoning in all domains.

We formalize and analyze the general metareasoning problemfor MDPs, demonstrating that metareasoning is only polyno-mially harder than solving the base MDP. Given the determi-nation that optimal general metareasoning is impractical, weturn to approximate metareasoning algorithms, which esti-mate the value of computation by relying on bounds given byBRTDP. Finally, we empirically compare our metareasoningalgorithms to several baselines on problems designed to re-ﬂect challenges posed across a spectrum of worlds, and showhat the proposed algorithms are much better at closing largemetareasoning gaps.We have assumed that the agent can plan only when it takesthe

NOP action. A generalization of our work would allowvarying amounts of thinking as part of any action. Some ac-tions may consume more CPU resources than others, and ac-tions which do not consume all resources during executioncan allocate the remainder to planning. We also can relax themeta-myopic assumption, so as to consider the consequencesof thinking for more than one cycle. In many cases, assumingthat the agent will only think for one more step can lead to un-derestimation of the value of thinking, since many cycles ofthinking may be necessary to see signiﬁcant value. This abil-ity can be obtained with our current framework by projectingchanges in bounds for multiple steps. However, in experi-ments to date, we have found that pushing out the horizonof analysis was associated with large accumulations of errorsand poor performance due to approximation upon approxi-mation from predictions about multiple thinking cycles. Fi-nally, we may be able to improve our metareasoners by learn-ing about and harnessing more details about the base-levelplanner. In our

Metareasoner approximation scheme, wemake strong assumptions about how the upper bounds pro-vided by BRTDP will change, but learning distributions overthese changes may improve performance. More informedmodels may lead to accurate estimation of non-myopic valueof computation. However, learning distributions in a domain-independent manner is difﬁcult, since the planner’s behavioris heavily dependent on the domain and heuristic at hand.

References [Barto et al. , 1995] Andrew G. Barto, Steven J. Bradtke, andSatinder P. Singh. Learning to act using real-time dynamicprogramming.

Artiﬁcial Intelligence , 72:81–138, 1995.[Bertsekas and Tsitsiklis, 1996] Dimitri P. Bertsekas andJohn Tsitsiklis.

Neuro-dynamic Programming . AthenaScientiﬁc, 1996.[Breese and Horvitz, 1990] John S. Breese and Eric Horvitz.Ideal reformulation of belief networks. In

UAI , 1990.[Burns et al. , 2013] Ethan Burns, Wheeler Ruml, andMinh B. Do. Heuristic search when time matters.

Jour-nal of Artiﬁcial Intelligence Research , 47:697–740, 2013.[Chanel et al. , 2014] Caroline P. Carvalho Chanel, CharlesLesire, and Florent Teichteil-K¨onigsbuch. A robotic exe-cution framework for online probabilistic (re)planning. In

ICAPS , 2014.[Dean et al. , 1995] Thomas Dean, Leslie Pack Kaelbling,Jak Kirman, and Ann Nicholson. Planning under timeconstraints in stochastic domains.

Artiﬁcial Intelligence ,76:35–74, 1995.[Hansen and Zilberstein, 2001] Eric A Hansen and ShlomoZilberstein. Monitoring and control of anytime algorithms:A dynamic programming approach.

Artiﬁcial Intelligence ,126(1):139–157, 2001.[Hay et al. , 2012] Nick Hay, Stuart Russell, David Toplin,and Solomon Eyal Shimony. Selecting computations: The-ory and applications. In

UAI , 2012. [Horvitz and Breese, 1990] Eric J. Horvitz and John S.Breese. Ideal partition of resources for metareasoning.Technical Report KSL-90-26, Stanford University, 1990.[Horvitz and Klein, 1995] Eric Horvitz and Adrian Klein.Reasoning, metareasoning, and mathematical truth: Stud-ies of theorem proving under limited resources. In

UAI ,1995.[Horvitz et al. , 1989] Eric J. Horvitz, Gregory F. Cooper, andDavid E.Heckerman. Reﬂection and action under scarceresources: Theoretical principles and empirical study. In

IJCAI , 1989.[Horvitz et al. , 2001] Eric Horvitz, Yongshao Ruan, Carla P.Gomes, Henry Kautz, Bart Selman, and David M. Chick-ering. A bayesian approach to tackling hard computationalproblems. In

UAI , 2001.[Horvitz, 1987] Eric Horvitz. Reasoning about beliefs andactions under computational resource constraints. In

UAI ,1987.[Horvitz, 2001] Eric Horvitz. Principles and applications ofcontinual computation.

Artiﬁcial Intelligence , 126:159–196, 2001.[Howard, 1960] R.A. Howard.

Dynamic Programming andMarkov Processes . MIT Press, 1960.[Kautz et al. , 2002] Henry Kautz, Eric Horvitz, YongshaoRuan, Carla Gomes, and Bart Selman. Dynamic restartpolicies. In

AAAI , 2002.[Keller and Geißer, 2015] Thomas Keller and FlorianGeißer. Better be lucky than good: Exceeding expecta-tions in mdp evaluation. In

AAAI , 2015.[Kocsis and Szepesv´ari, 2006] Levente Kocsis and CsabaSzepesv´ari. Bandit based monte-carlo planning. In

ECML ,2006.[Kolobov et al. , 2012] Andrey Kolobov, Mausam, andDaniel S. Weld. Lrtdp vs uct for online probabilisticplanning. In

AAAI , 2012.[McMahan et al. , 2005] H. Brendan McMahan, MaximLikhachev, and Geoffrey J. Gordon. Bounded real-time dynamic programming: Rtdp with monotone upperbounds and performance guarantees. In

ICML , 2005.[Russell and Wefald, 1991] Stuart Russell and Eric Wefald.Principles of metareasoning.

Artiﬁcial intelligence ,49(1):361–395, 1991.[Shahaf and Horvitz, 2009] Dafna Shahaf and Eric Horvitz.Ivestigations of continual computation. In

IJCAI , 2009.[Sutton and Barto, 1998] Richard S. Sutton and Andrew G.Barto.

Introduction to Reinforcement Learning . MITPress, 1998.[Zilberstein and Russell, 1993] Shlomo Zilberstein and Stu-art J. Russell. Anytime sensing, planning and action: Apractical model for robot control. In

IJCAI , 1993.[Zilberstein and Russell, 1996] Shlomo Zilberstein and Stu-art Russell. Optimal composition of real-time systems.

Ar-tiﬁcial Intelligence , 82:181–213, 1996.igure 3: The construction of a base MDP M (cid:48) for a metarea-soning problem from an input SSP MDP M . A Appendix

A.1 Proof of Theorem 3

Proof.

By calling metareasoning P -complete we mean thatthere exists a Turing machine B s.t. (1) for any input SSPMDP M (cid:48) , Meta B ( M (cid:48) ) can be decided in time polynomialin | M (cid:48) | , i.e., Meta B ( M ) is in P , and (2) there is a class of P -complete problems that can be converted to Meta B ( M (cid:48) ) via an NC -reduction, i.e., by constructing M (cid:48) appropriatelyusing a polynomial number of parallel Turing machines, eachoperating in polylogarithmic time.The ﬁrst part of the above claim follows from Theorem 2:since SSP MDPs are solvable optimally by linear program-ming in polynomial time, Meta B ( M ) is in P if B encodes apolynomial solver for linear programs.For the second part, we perform an NC -reduction from theclass of SSP MDPs to the class of SSP MDPs-based metarea-soning problems with respect to a ﬁxed optimal polynomial-time solver B . Speciﬁcally, given an SSP MDP M with aninitial state, we show how to construct another SSP MDP M (cid:48) s.t., for the optimal polynomial-time solver B we describeshortly, deciding Meta B ( M (cid:48) ) is equivalent to deciding M .The intuition behind converting a given SSP MDP M into M (cid:48) , the SSP MDP that will serve as the base in our metarea-soning problem, is to augment M with new states where theagent can “think” by using a zero-cost NOP action until theagent arrives at an optimal policy for the original states of M .Afterwards, the agent can transition from any of these newlyadded “thinking states” to M ’s original start state s and ex-ecute the optimal policy from there. Unfortunately, the proofis not as straightforward as it seems, because we cannot sim-ply build M ’ by equipping M with a new start state s (cid:48) witha self-loop zero-cost NOP action — M (cid:48) with such an actionwould violate the SSP MDP deﬁnition. Below, we show howto overcome this difﬁculty. Since thinking in the newly addedstates of M (cid:48) costs nothing, the cost of an optimal policy for Meta B ( M (cid:48) ) is the same as for M , so deciding the formerproblem decides the latter.The construction of M (cid:48) from a given SSP MDP M is illus-trated in Figure 3. Consider the number of instruction-stepsit takes to solve M by linear programming. This number ispolynomial; namely, there exists a polynomial p LP ( | M | ) thatbounds M ’s solution time from above. To transform M into M (cid:48) , we add a set of p LP ( | M | ) states, s (cid:48) , s (cid:48) , . . . , s (cid:48) p LP ( | M | ) to M . These new states connect into a chain via zero-cost NOP actions: the start state s (cid:48) of M (cid:48) links to s (cid:48) , s (cid:48) links to s (cid:48) , andso on until s (cid:48) p LP ( | M | ) links to s , the start state of M . In ad-dition, for all original states of M , we create a self-loop NOP action with a positive cost. The entire transformation can beeasily implemented as an NC-reduction on p LP ( | M | ) + | S | computers, each recording the cost and transition function of NOP for a separate state. Since for each state,

NOP ’s cost andtransition functions together can be encoded by just two num-bers (

NOP transition function assigns probability 1 to a singletransition that is implicitly but unambiguously determined forevery state), each computer operates in polylogarithmic time.Moreover, initializing each of the parallel machines with theMDP state for which it is supposed to write out the transitionand cost function values is as simple as appropriately setting apointer to the input tape, and can be done in log-space. Thus,the above procedure is a valid NC-reduction. Note also that M (cid:48) is an SSP MDP: although it has zero-cost actions, they donot form loops/strongly connected components.Our motivation for constructing M (cid:48) as above was to pro-vide an agent with enough states where it can “think” to guar-antee that if the agent starts at s (cid:48) , it arrives at M ’s initial state s with a computed optimal policy from s onwards. Thiswould imply that the expected cost of an optimal policy for Meta B ( M (cid:48) ) from s would be the same as for M . How-ever, for this guarantee to hold, we need a general SSP MDPsolver B that can solve/decide M (cid:48) in time O ( poly ( | M | )) , not O ( poly ( | M (cid:48) | ) . The difference between O ( poly ( | M (cid:48) | )) and O ( poly ( | M | )) is very important, because M (cid:48) is larger than M , so the newly added chain of states may not be enough fora O ( poly ( | M (cid:48) | )) policy computation to have zero cost.To circumvent this issue, we deﬁne B that recognizes“lollypop-shaped” MDPs M (cid:48) as in Figure 3, which have anarbitrarily connected subset S c of the state space representinga sub-MDP M c preceded by a chain of NOP -connected statesof size p LP ( | M c | ) leading to M c ’s start state s , and ignoresthe linear chain part. (Note that the policy for the linear chainpart is determined uniquely and there is no need to write it outexplicitly). For that, we assume the metareasoning problem’sinput SSP MDP to be in the form of a string M c description M c description stands for the arbitrarily con-nected part of the input MDP, and chain description stands forthe description of the linear NOP -connected chain. For MDPsviolating conditions in Figure 3 (i.e., having a different con-nectivity structure or having the linear part of the wrong size),chain description must be empty, with the entire MDP de-scription placed before “ B is deﬁned to read that inputstring only up to “ M (cid:48) from M and recording M (cid:48) in the afore-mentioned way ensures that the optimal policy for the metar-easoning problem Meta B ( M (cid:48) ) chooses NOP until the agentreaches M ’s start state s , by which point the agent will havecomputed an optimal policy for M . Coupled with the factthat Meta B ( M (cid:48) ) is in P , this implies the theorem’s claim. A.2 More Figures

Figures 4 through 11 show results for the

Stochastic domainwhere we vary the cost of thinking and the cost of acting. A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 4: Comparison of algorithms in

Stochastic , with thecost of thinking = 1 and cost of acting = 11 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 5: Comparison of algorithms in

Stochastic , with thecost of thinking = 5 and cost of acting = 11 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 6: Comparison of algorithms in

Stochastic , with thecost of thinking =10 and cost of acting = 11 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 7: Comparison of algorithms in

Stochastic , with thecost of thinking = 15 and cost of acting = 11 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 8: Comparison of algorithms in

Stochastic , with thecost of acting = 1 and cost of thinking = 1 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 9: Comparison of algorithms in

Stochastic , with thecost of acting = 5 and cost of thinking = 1 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 10: Comparison of algorithms in

Stochastic , with thecost of acting = 10 and cost of thinking = 1 A v e r a g e C o s t t o t h e G o a l Think*ActMetareasonerUnCorr Metareasoner NoInfoThinkProb

Figure 11: Comparison of algorithms in