Theory and Analysis of Optimal Planning over Long and Infinite Horizons for Achieving Independent Partially-Observable Tasks that Evolve over Time
aa r X i v : . [ c s . R O ] F e b Theory and Analysis of Optimal Planning over Long and InfiniteHorizons for Achieving Independent Partially-Observable Tasks thatEvolve over Time
Anahita Mohseni-Kabir , Manuela Veloso , and Maxim Likhachev Abstract — We present the theoretical analysis and proofsof a recently developed algorithm that allows for optimalplanning over long and infinite horizons for achieving multipleindependent tasks that are partially observable and evolve overtime.
I. B
ACKGROUND
We focus on the class of problems with N independent tasksthat evolve over time, proposed in [1]. The robot shouldinterleave its actions to find an optimal sequence of actionsto attend to all the tasks. A POMDP that includes a singletask and a robot is called a client POMDP . The N clientPOMDPs are combined into one large POMDP model calledan agent POMDP . To compute an optimal solution for allthe tasks, one should solve the agent POMDP optimally. Webriefly describe the client and agent POMDPs, and how [1]referred to as multi-task planner solves the agent POMDPefficiently. A. Client POMDP
The client POMDP for task i is represented as a tuple ( S i , A i , Z i , T i , O i , R i , γ, H ) . The state space S i = SR × SC i includes the robot’s state, SR , and the other state variablesthat are specific to task i , SC i . The action space A i includesthe actions that can be applied by the robot to task i anda special no op action which bears no direct effect onthe task ( i.e ., when executed, task i follows its underlyingHidden Markov Model). Z i denotes the observation space.The robot takes an action a ∈ A i and transitions from a state s ∈ S i to s ′ ∈ S i with probability T i ( s, a, s ′ ) . The robotthen observes z ∈ Z i and receives a reward R i ( s i , a ) . Thefunction O i ( s ′ , a, z ) models the robot’s noisy observations.In POMDP planning, the robot keeps a distribution overthe states, called a belief state, and searches for a policy π : B i → A that maximizes E hP Ht =0 γ t r i,t i at each belief b ∈ B i , where r i,t is the reward gained at time t fromPOMDP i , H is the planning horizon, and γ is the discountfactor. For infinite-horizon problems with discounting, H = ∞ and ≤ γ < . For finite-horizon problems, H isfinite and γ = 1 . . Optimal value V ∗ i ( b ) can be computedby iteratively applying the Bellman equation. Similarly, thevalue of following a trajectory n , V ni ( b i ) , can be computedby iteratively applying the Bellman equation and following *This work was partially supported by Sony AI. The authors are with School of Computer Science, Carnegie MellonUniversity { anahitam, mmv, maxim } @cs.cmu.edu . the remaining trajectory [2]. In our algorithms, n refers to atrajectory consisting of no op actions so V ni ( b i ) ≤ V ∗ i ( b i ) . B. Agent POMDP
A POMDP created from N client POMDPs is called agent POMDP (or robot POMDP ). Let P = { i ∈ N : i ≤ N } . Formally, the agent POMDP is represented by ( N, S, A, Z, T, O, R, γ, H ) where S = SR × SC × . . . × SC N , A and Z = Z × . . . × Z N denote the state, actionand observation spaces respectively. T , O and R are thetransition, observation and reward functions respectively. Therobot’s action set A (Eq. 1) contains vectors of length N inwhich except one element, all other elements are no ops .The robot’s distribution over the states is b ∈ B where B = B × . . . × B N . The agent POMDP’s reward functionis additive in terms of its underlying tasks E hP Ni =1 R i i . A = ∪ i ∈ P ∪ a ∈ A i length N z }| { [ no op ... no op , a |{z} i th element , no op ... no op ] (1) Using the mathematical definition of the independent tasks,the optimal value of an agent POMDP built from a set ofindependent client POMDPs P can be iteratively computedas follows where Pr( z | b, a ) = Q k ∈ P Pr( z k | b k , a [ k ]) [1]. V ∗ t ( b ) = max a ∈ A h X i ∈ P X s ∈ S i b i ( s ) R i ( s, a [ i ])+ γ X z ∈ Z Pr( z | b, a ) V ∗ t − ( b az ) i (2) To compute the value of the current belief state of the robotfor a fixed horizon H , the planner starts with the robot’scurrent belief as the root of a tree and builds the tree ofall reachable beliefs by considering all possible actions andobservations. For each action and observation, a new beliefnode is added to the tree as a child node of its immediateprevious belief. To solve the agent POMDP, a combinedbelief tree of all tasks is built till horizon H . The value ofthe robot’s current belief is computed by propagating valueestimates up from the fringe nodes, to their ancestors, all theway to the root, according to Eq. 2. We call this approach agent POMDP with a fixed horizon or agent-POMDP-FH . C. Multi-task POMDP Planner (Multi-task-FH)
Note that the agent POMDP approach is impractical if thenumber of tasks are large. If each client POMDP has | S | states and | A | actions (excluding the no op action), and thereare N tasks, the robot should plan over an agent POMDPwith | S | N states and N × | A | + 1 actions which is infeasible.ur prior work exploits the observation that in some domainsthe number of tasks, k ∗ , that the robot can attend to within H is limited [1]. Given this observation, if the robot optimallysolves all possible sub-problems of size k ∗ with differentcombinations of tasks, it can find the optimal solution tothe agent POMDP. In [1], we prove that decomposing theagent POMDP into a series of sub-problems of size k ∗ andsolving all combinations of k ∗ out of N tasks, tpls = { tpl ∈ P ( P ) : | tpl | = k ∗ } and returning the action with the highestvalue from them is the optimal solution to the agent POMDP.Symbol P represents the power set. Note that each member tpl (tuple of size k ∗ ) of the set tpls is a sub-problem thatcan be solved by building a combined POMDP from thePOMDPs in tpl . The robot assumes that a trajectory of noop actions is being executed on the POMDPs that are not in tpl . k ∗ is provided to the algorithm, but we discuss a wayto compute it in [1].The prior work uses the solutions to the individual clientPOMDPs to compute lower and upper-bounds on the optimalvalue of the agent POMDP to prune the tpls set. Differentfrom [1] that only uses the solutions to the single tasks toprune the low-quality tasks, in this work we take a moregradual approach and monotonically improve the bounds toprune the low-quality tasks. We start with single tasks ( k =1 ), but gradually increase k and solve sub-problems of size k ( < k ∗ ) to prune the tasks. Since our algorithm graduallyimproves the bounds to prune as many tasks as possible, iteventually solves less number of sub-problems of size k ∗ compared to [1]. We first use the single tasks to prune, thenpairs, then triplets, and so on. In addition, we use a truncatedhorizon h ( h < H ) to compute the solutions to the sub-problems of size k rather than the full horizon H which isneeded to solve the sub-problems of size k ∗ . This gradualand monotonic improvement of the bounds and planninguntil a truncated horizon h enables the robot to efficientlyand optimally plan over long fixed-length horizons withoutdiscounting and infinite-length horizons with discountingrather than planning for a short fixed horizon as done in [1].II. A PPROACH
In this section, we first explain the main ideas that we use toextend the agent POMDP planner (explained in section I-B)to be applied on long-horizon problems. We call this newapproach agent POMDP with adaptive horizon or agent-POMDP-AH . We then explain how the agent-POMDP-AHis extended to include the key insights and the efficiencyof [1] (multi-task-FH explained in section I-C). We call ourapproach multi-task POMDP with adaptive horizon ( multi-task-AH ) since in addition to leveraging the multiple inde-pendent tasks structure, we adapt the horizon (specifically,iteratively increase it) to improve the solution’s quality.Similar to [1], we use an online planning framework whichinterleaves planning and execution. Its main loop is in Alg. 1.During the planning phase, the algorithm computes the bestaction to execute given the robot’s current belief (lines 3-7).In the execution phase, the robot executes the selected action(line 8), updates the belief state (line 9), and replans after each action execution. A. Agent POMDP with Adaptive Horizon
We adapt the agent-POMDP-FH approach for the class ofproblems with multiple independent tasks to enable the robotto efficiently plan for long horizons. This approach usesa similar procedure to solve the agent POMDP as agent-POMDP-FH, but modifies it with two main ideas. The keyideas are that instead of expanding the belief tree of allthe tasks for the full horizon H , the robot 1) builds thebelief tree until a truncated but gradually increasing horizon h and 2) computes the lower and upper-bounds on the valueof the fringe nodes at the truncated horizon. To computethe bounds for the fringe nodes, the robot only solves theindividual tasks for the remaining horizon H − h (or ∞ in the infinite-horizon case) and combines their solutions.It then computes the lower and upper-bounds for the non-fringe nodes by propagating the bound values up from thefringe nodes by following the Bellman equation in Eq. 2.Note that when planning with a truncated horizon h , theplanner expands the combined model of all the tasks onlytill the truncated horizon h , but the individual tasks aresolved till the full horizon H to compute the bounds. Whenthe lower and upper-bounds on the value of the robot’sbelief become equal, the optimal solution is found and thesearch is terminated. This enables the robot to terminatethe search before reaching the full planning horizon H .We call the agent POMDP solver that follows this process TruncatedAgentPOMDP . For long horizons, solving theindividual tasks (to compute the bounds) is much faster thanexpanding the belief tree of the combined model; thus, thisapproach is efficient compared to the agent-POMDP-FH.Instead of planning for a fixed horizon H , this algorithm(Alg. 1) performs planning for increasing values of horizon h until one of the following conditions are satisfied: 1) thehorizon limit H is reached, or 2) the lower-bound ¯ V on thevalue of the robot’s belief is equal to its upper-bound ¯ V (line4). The first condition assures that the algorithm is terminatedwhen it reaches the maximum horizon H and outputs thesame solution as planning for a fixed horizon H . The secondcondition enables the robot to terminate planning beforereaching the full planning horizon, thus being more efficientthan the agent POMDP approach with a fixed horizon H .Alg. 2 provides the implementation of some of the func-tions in Alg. 1 for the agent-POMDP-AH approach. The TruncatedAgentPOMDP solver builds a combined modelwith all the client POMDPs in P . It finds the bounds forthe fringe nodes using the ComputeBounds function andpropagates the bounds up to compute the bounds for the non-fringe nodes. We refer to all the POMDPs in tpl as tpl u ;for the agent POMDP, tpl u = P (all possible tasks). Theintuition behind the lower-bound computation (line ) is toonly consider the best client POMDP from tpl and perform no ops on the other POMDPs. This is similar to taking agreedy approach of always selecting the best task to attendto rather than interleaving the tasks. This is indeed a possiblesolution, hence it is the lower-bound. The intuition behindhe upper-bound computation is to assume that the robot canaddress all the client POMDPs (tasks) in tpl in parallel. Weonly have one robot so this is an upper-bound.Since the client POMDPs are solved over and over fordifferent beliefs and horizons during planning to compute thebounds, their solutions are cached and reused in the process. Algorithm 1:
Online Planner with Adaptive Horizon MultiTaskAdaptiveHorizonPlanner ( env, P, h, H ) while not AllTasksDone() do tpls ← InitializeTuples(P,h,H) while ¯ V = ¯ V or h = H do a,tpls, ¯ V , ¯ V ← SelectAction(P,h,H,tpls) h ← h+1 tpls ← RecomputeTuples(h,tpls) // this function is onlyneeded in the multi-task-AH approach observations ← Step(env, a) UpdateBeliefs(P,observations)
Algorithm 2:
Agent POMDP with Adaptive Horizon InitializeTuples ( P , h , H ) return tpls ← { ( P , ∅ ) } SelectAction ( P , h , H , tpls ) ( ¯ V P , ¯ V P ) ← TruncatedAgentPOMDP( h , H , tpls ) a best ← action with highest ¯ V P return a best , tpls , ¯ V P , ¯ V P ComputeBounds ( b , tpl ) // for the remaining horizon H − h ¯ V ← max p ∈ tpl u ( V ∗ p ( b p ) + X q ∈ tpl u \{ p } V nq ( b q )) ; ¯ V ← X p ∈ tpl u V ∗ p ( b p ) return ¯ V , ¯ V B. Multi-task POMDP with Adaptive Horizon
We exploit the two key ideas from the previous sectionand extend the multi-task-FH [1] to address long horizonplanning. The multi-task-FH is able to leverage the indepen-dent tasks structure in the problem to efficiently solve theagent POMDP, and agent-POMDP-AH speeds up planningfor long horizons by terminating the search earlier throughthe truncated horizon and bound computations. We combinethe benefits of the two approaches in the multi-task-AH.Multi-task-FH exploits the observation that within a fixedhorizon H , the robot can only consider a limited numberof tasks k ∗ . Similarly here, we also consider all possiblesubsets of size k ∗ as it is needed to ensure optimality.However, in addition to this, we leverage the observationthat within the truncated horizon h , h ≤ H , the robot canonly consider k tasks ( k ≤ k ∗ ), and it performs no ops onthe other tasks. Intuitively, we use the key idea of [1] twice,once to divide the agent POMDP of size P into smallerproblems of size k ∗ , and the second time to divide the smallerproblems of size k ∗ into sub-problems of size k , k ≤ k ∗ ,that can be solved more efficiently. Leveraging the truncatedhorizon to further limit the number of tasks that the robot canattend to enables us to significantly speed up planning. Therobot only considers combined models of size k till horizon h , rather than combined models of size k ∗ , but computesthe lower and upper-bounds on k ∗ individual tasks for theremaining horizon H − h . Note that the lower and upper-bound computations are done on the individual tasks till thefull horizon H ; so their computations should consider all k ∗ tasks to ensure similar optimality guarantees as [1] (as theagent can attend to k ∗ tasks within horizon H ). Our approachis especially powerful in the infinite-horizon problems with N tasks. In such problems the number of tasks that the robotcan attend to within H = ∞ is N , k ∗ = N ; thus, if k ≪ N ,the algorithm significantly expedites planning by solvingmultiple sub-problems of smaller sizes rather than solvingthe agent POMDP with all the N tasks. As we increase thetruncated horizon h , we might need to increase the size ofthe subsets, i.e. , increase k . We explain how we address thisimportant aspect of the problem later.Alg. 3 shows the multi-task-AH algorithm. The function InitializeTuples considers all possible subsets of P with size k ∗ (line 2) and further divides it into subsets ofsize k (line 3). Each subset of size k ∗ ( tpl ∈ tpls ) is dividedinto two sets, tpl c with size k and tpl l with size k ∗ − k . Thetruncated agent POMDP is built from the POMDPs in tpl c while executing no ops on the POMDPs in tpl l , but the boundcomputations for the fringe nodes are done on all POMDPsin tpl u = tpl c ∪ tpl l to assure valid lower and upper-boundson the value of the tuple tpl ( TruncatedAgentPOMDP function). The
SelectAction function solves a truncatedagent POMDP for each tpl (line 7) to compute its boundswhile executing no ops on other POMDPs that are not in tpl (line 8). It then updates the bounds on the value of thefull agent POMDP (line 9). The algorithm then removes thetuples for which the upper-bounds are less than the lower-bound of the agent POMDP and returns the action from the tpl with the highest upper-bound (lines 10-11).The size of the sub-problems and their bounds getsupdated as the truncated horizon h increases. Function RecomputeTuples updates the tpls set as the number oftasks that the robot can attend to within the horizon increasesfrom k to k ′ = k + 1 . For each tpl ∈ tpls , a member of tpl l is removed and added to its tpl c set. We consider removingany element from the tpl l set to generate all possible newtuples. This is to ensure that the optimality guarantees holdas we increase the truncated horizon h . Algorithm 3:
Multi-task POMDP with Adaptive Horizon InitializeTuples ( P , h , H ) k , k ∗ ← the maximum h and H ; T ← { tpl : tpl ∈ P ( P ) , | tpl | = k ∗ } tpls ′ ← { ( tpl c , tpl l ) : tpl u ∈ T, tpl c ∈ P ( tpl u ) , | tpl c | = k, tpl l = tpl u \ tpl c } return tpls ′ SelectAction ( P , h , H , tpls ) for tpl ∈ tpls do ( ¯ V tpl , ¯ V tpl ) ← TruncatedAgentPOMDP( h , H , tpl ) ( ¯ U tpl , ¯ U tpl ) ← (¯ V tpl , ¯ V tpl ) + P q ∈ P \ tpl u V nq ¯ V P = max( ¯ V P , ¯ U tpl ) ; ¯ V P = max(¯ V P , ¯ U tpl ) tpls ← { tpl : tpl ∈ tpls, ¯ U tpl ≥ ¯ V P } a best ← action from the tpl with highest ¯ U tpl return a best , tpls , ¯ V P , ¯ V P RecomputeTuples ( h , tpls ) k , k ′ ← the maximum h − and h ; tpls ′ ← tpls if k = k ′ then tpls ′ ← { ( tpl c ∪ { p } , tpl l \ { p } ) : tpl ∈ tpls, p ∈ tpl l } return tpls ′ Other improvements that can be added to Alg. 3 to furtherexpedite planning include 1) for a given tuple tpl , if ¯ V tpl =¯ V tpl , we do not need to recompute the tpl ’s bounds as its already optimal, 2) the tuples can be processed in thedecreasing order of their upper-bounds so if the updated ¯ V P is greater than the next tuple’s upper-bound, the tuple andthe remaining tuples in the list can be discarded, and 3)if desirable, a timeout condition can also be added to theconditions on line 4 of Alg. 1 to ensure online performance.III. O PTIMALITY P ROOFS
In this section, we first prove that agent-POMDP-AHcomputes an optimal solution. We then prove that multi-task-AH finds the same solution as agent-POMDP-AH. Wediscuss both the proofs and the intuition behind them. Theproofs use the independent tasks definition, as stated in [1].
Notation required for understanding the intuitionbehind the proofs (mostly borrowed from [1]): • V ∗ p,t : the optimal value of the client POMDP p at time t . • V np,t : the value of following a trajectory of no ops forthe client POMDP p at time t . • V ∗ P,t : the optimal value of the agent POMDP createdfrom the POMDPs in P at time t (Eq. 2). • V ∗ tpl,t : the optimal value of the agent POMDP createdfrom only the client POMDPs in tpl at time t . • ¯ V htpl,t ( b tpl ) , ¯ V htpl,t ( b tpl ) : the lower and upper-bound onthe value of a belief node b tpl in the belief tree of atruncated agent POMDP created only from the membersof tpl till h . The bounds on the values of the fringenodes of the truncated belief tree are computed usingEq. 3 and Eq. 10. More notation required for understanding the proofs(mostly borrowed from [1]) • B ∗ : this refers to the Bellman operator. • A tpl : only considers the actions associated with thePOMDPs in tpl and performs no op on the otherPOMDPs (same as Eq. 1, but the union is over tpl ,not P ). • Q ∗ p,t ( b, a ) : the optimal value of the client POMDP p attime t for belief b and action a . • U ∗ tpl,t : the optimal value of the agent POMDP built from P with the action set A tpl . Intuitively, U ∗ tpl,t considersboth the value of the POMDPs in tpl ( V ∗ tpl,t ) and thevalue of executing no ops on the ones that are not in tpl . A. Lower and upper-bound
We show that the bound computations are valid (Lem. and ) and monotone (Lem. and ). The monotonicityproperty assures that the lower and upper-bounds on thevalue of a belief node does not change or improves aftereach iteration of the algorithm (increase in the truncatedhorizon h ). The bounds on the value of the fringe nodes Our adaptive horizon algorithm can generate and improve a solution inan anytime fashion until the optimal solution is achieved. are computed for the remaining horizon H − h (or ∞ in theinfinite-horizon case) using Eq. 3 and Eq. 10. The bounds forthe non-fringe nodes are computed by propagating the boundcomputations of the fringe nodes up to the root belief node.We do not make any assumptions regarding the maximumpossible horizon in the bound computations, thus the lemmasalso hold for the infinite-horizon problems with discounting.We use mathematical induction to prove Lem. 1 to 4. Lemma 1
Eq. 3 provides a lower-bound on the value of atuple tpl = ( tpl c , tpl l ) where tpl u = tpl c ∪ tpl l . ¯ V tpl,t ( b tpl ) = max p ∈ tpl u h V ∗ p,t ( b p ) + X q ∈ tpl u \{ p } V nq,t ( b q ) i ≤ V ∗ tpl,t ( b tpl ) (3) Intuition behind proof:
Let us consider that only one taskfrom tpl u , p ∈ tpl u , can be executed till the full horizon( V ∗ p ), and we perform no ops on the other tasks ( P V nq ).The best task will then be selected as the lower-bound on V ∗ tpl , max p [ V ∗ p + P V nq ] . Proof:
The proof goes by mathematical induction. For h ′ =1 , if ∀ p ∈ P, V ∗ p, ( b p ) = 0 , Eq. 4 follows from Eq. 2: V ∗ tpl, ( b tpl ) = max a ∈ Atplu z }| { max p ∈ tpl u h max a ∈ A p h X i ∈ tpl u X s ∈ S i b i ( s ) R i ( s, a [ i ]) ii = max p ∈ tpl u h V ∗ p, ( b p ) + X q ∈ tpl u \{ p } V nq, ( b q ) i (4) If h ′ = t − , we assume Eq. 5 and consequently Eq. 6 andshow that they both hold for h ′ = t . V ∗ tpl,t − ( b tpl ) ≥ max p ∈ tpl u h V ∗ p,t − ( b p ) + X q ∈ tpl u \{ p } V nq,t − ( b q ) i (5) ∀ p ∈ tpl u : V ∗ tpl,t − ( b tpl ) ≥ V ∗ p,t − ( b p ) + X q ∈ tpl u \{ p } V nq,t − ( b q ) (6) We expand Eq. 2 as follows ( b tpl or b ): V ∗ tpl,t ( b ) = max a ∈ A tplu h X i ∈ tpl u X s ∈ S i b i ( s ) R i ( s, a [ i ])+ γ X z q ∈ Z q Pr( z q | b q , a q ) . . . X z r ∈ Z r Pr( z r | b r , a r ) V ∗ tpl,t − ( b az ) i (7) We substitute Eq. 6 in Eq. 7. Given the independenceassumption, for a specific Z i , we can marginalize out thesum over Z j s ( j = i ). ∀ p ∈ tpl u , we obtain: V ∗ tpl,t ( b ) ≥ max a ∈ A tpl h Q ∗ p,t − ( b p , a [ p ]) + Q noop z }| {X q ∈ tpl u \{ p } Q nq,t − ( b q , a [ q ]) i ≥ max a ∈ A p h Q ∗ p,t − ( b p , a [ p ]) + Q noop i ≥ V ∗ p,t ( b p ) + X q ∈ tpl u \{ p } V nq,t ( b q ) (8) Thus, Eq. 9 holds for every h ′ = t . ∗ tpl,t ( b ) ≥ max p ∈ tpl u h V ∗ p,t ( b p ) + X q ∈ tpl u \{ p } V nq,t ( b q ) i (9) Lemma 2
Eq. 10 provides an upper-bound on the value ofa tuple tpl = ( tpl c , tpl l ) . ¯ V tpl,t ( b tpl ) = X p ∈ tpl u V ∗ p,t ( b p ) ≥ V ∗ tpl,t ( b tpl ) (10) Intuition behind proof:
The idea behind the upper-boundcomputation is to assume that the robot can attend to all thetasks in tpl , p ∈ tpl u , in parallel ( P V ∗ p ). We only have onerobot, so this is an upper-bound on V ∗ tpl . Proof:
Similar to Lem. 1, the proof goes by mathematicalinduction. For h ′ = 1 , the following equation holds. V ∗ tpl, ( b tpl ) = max a ∈ A tplu h X i ∈ tpl u X s ∈ S i b i ( s ) R i ( s, a [ i ]) i ≤ X i ∈ tpl u max a ∈ A tplu h X s ∈ S i b i ( s ) R i ( s, a [ i ]) i = X i ∈ tpl u V ∗ i, ( b i ) (11) We assume Eq. 12 holds for h ′ = t − ( p, q, r, . . . ∈ tpl u )and show that it also holds for h ′ = t . V ∗ tpl,t − ( b ) ≤ V ∗ p,t − ( b p ) + . . . + V ∗ q,t − ( b q ) + . . . + V ∗ r,t − ( b r ) (12) Similar to Lem. 1, Eq. 12 is substituted in Eq. 7, andsimplified to obtain Eq. 13. Thus, Eq. 10 holds for every h ′ = t . V ∗ tpl,t ( b ) ≤ max a ∈ A tplu h X i ∈ tpl u Q ∗ i,t ( b i , a [ i ]) i ≤ X i ∈ tpl u max a ∈ A tplu Q ∗ i,t ( b i , a [ i ]) = X p ∈ tpl u V ∗ p,t ( b p ) (13) Lemma 3
The lower-bound computation is monotone. ¯ V htpl,t ( b tpl ) ≤ ¯ V h ′ tpl,t ( b tpl ) where h < h ′ and h, h ′ ≤ H (14) In both ¯ V htpl,t and ¯ V h ′ tpl,t ’s computations, the belief tree isbuilt till horizon h . To compute ¯ V htpl,t , the lower-bound onthe value of the fringe belief nodes at horizon h are computedusing Eq. 3 and are propagated up the belief tree. To compute ¯ V h ′ tpl,t , the algorithm expands the tree for d more steps, h ′ = h + d , and then uses Eq. 3 to compute the lower-bound for thefringe nodes at depth h + d and propagates the bounds up thebelief tree. In both cases the lower-bound on the value of thefringe nodes are computed using Eq. 3 till the full horizon H . This property guarantees that as the truncated horizonincreases, from h to h ′ ( h < h ′ ), the lower-bound on thevalue of a certain fringe node at horizon h and consequentlythe non-fringe nodes are non-decreasing. Intuition behind proof:
The main difference between ¯ V h ( b ′ ) and ¯ V h ′ ( b ′ ) for a certain belief node b ′ at depth h (or horizon h ) is that the former uses the trivial lower-boundestimate for the node, but the latter does more computationto expand the belief tree further before using a similartrivial lower-bound estimate for the nodes at depth h + d .To compute the lower-bound for a fringe node at depth h , ¯ V h ( b ′ ) , the algorithm assumes that from there on till H ,only one task can be executed and no ops are executed onthe other tasks (one possible solution). So, expanding thebelief tree (exhaustive search) for d more steps till horizon h + d to compute ¯ V h ′ ( b ′ ) will only find the same or a bettersolution than achieving a single task. I.e. , the lower-boundon b ′ is non-decreasing as we increase the horizon. Proof:
For a certain leaf node b at horizon h , we compareits lower-bound when the truncated agent POMDP is built till h against when it is built till h ′ . The proof goes by mathemat-ical induction. First, we show that ¯ V tpl,H − h ≤ B ∗ ¯ V tpl,H − h − holds for d = 1 . We proved this previously when wesubstitute Eq. 6 in Eq. 7 to get Eq. 8, thus: V ∗ tpl,H − h = B ∗ V ∗ tpl,H − h − ≥ B ∗ ¯ V tpl,H − h − ≥ max p ∈ tpl u h V ∗ p,H − h + X q ∈ tpl u \{ p } V nq,H − h i = ¯ V tpl,H − h (15) Now, we assume that for h ′ = h + d , the following holdsfor the belief node b : ¯ V tpl,H − h ≤ B ∗ d ¯ V tpl,H − h − d , and weprove that the same equation also holds if h ′ = h + d + 1 .For a certain belief b , both ¯ V tpl,H − h ≤ B ∗ ¯ V tpl,H − h − and ¯ V tpl,H − h ≤ B ∗ d ¯ V tpl,H − h − d hold, thus the following equationholds for h ′ = h + d + 1 : V ∗ tpl,H − h ≥ B ∗ d h B ∗ ¯ V tpl,H − h − d − i ≥ B ∗ d ¯ V tpl,H − h − d ≥ max p ∈ tpl u h V ∗ p,H − h + X q ∈ tpl u \{ p } V nq,H − h i = ¯ V tpl,H − h (16) Lemma 4
The upper-bound computation is monotone. ¯ V htpl,t ( b tpl ) ≥ ¯ V h ′ tpl,t ( b tpl ) where h < h ′ and h, h ′ ≤ H (17) This property guarantees that as the horizon increases, from h to h ′ , the upper-bound on the value of a certain fringe nodeand consequently the non-fringe nodes are non-increasing. Intuition behind proof:
Similar to the intuition we gave forthe lower-bound’s monotonicity, for a certain belief node b ′ atdepth h , ¯ V h ( b ′ ) estimates the upper-bound by assuming thatall the tasks can be performed in parallel. However, ¯ V h ′ ( b ′ ) expands the belief tree for d more steps before assumingthat all the tasks can be performed in parallel. Thus, giventhat ¯ V h ′ ( b ′ ) uses the Bellman equation during the d steps, ithas a better estimate of the upper-bound than the assumptionthat all the tasks can be attended to in parallel during that d steps as assumed in ¯ V h ( b ′ ) . I.e. , as the horizon increasesand more of the belief tree is expanded, the upper-bound onthe value of b ′ improves ( i.e. , is non-increasing). roof: Similar to Lem. 3’s proof, the proof goes bymathematical induction. First, we show that B ∗ ¯ V tpl,H − h − ≤ ¯ V tpl,H − h holds for d = 1 . We proved this previously whenwe substitute Eq. 12 in Eq. 7 to get Eq. 13, thus: V ∗ tpl,H − h = B ∗ V ∗ tpl,H − h − ≤ max a ∈ A tplu h X i ∈ tpl u Q ∗ i,H − h − i ≤ X i ∈ tpl u max a ∈ A tplu Q ∗ i,H − h − = X i ∈ tpl u V ∗ i,H − h = ¯ V tpl,H − h (18) We assume that for the belief node b and h ′ = h + d , B ∗ d ¯ V tpl,H − h − d ≤ ¯ V tpl,H − h holds, and we prove it also holdsif h ′ = h + d + 1 . We know both B ∗ ¯ V tpl,H − h − ≤ ¯ V tpl,H − h and B ∗ d ¯ V tpl,H − h − d ≤ ¯ V tpl,H − h hold, thus, V ∗ tpl,H − h ≤ B ∗ d B ∗ ¯ V tpl,H − h − d − ≤ B ∗ d ¯ V tpl,H − h − d ≤ ¯ V tpl,H − h (19) In summary, we proved that the bound computations are validand monotone; thus if tpl = ( P, ∅ ) , Lem. 1 to 4 prove theoptimality of agent-POMDP-AH. Given the iterative natureof the horizon, in the worst case, the agent-POMDP-AHapproach reaches the full horizon H and obtains the samesolution as the agent-POMDP-FH approach. B. Multi-task-AH
We prove Alg. 3 is optimal. We assume k ∗ and k are themaximum number of tasks that the robot can attend to within H and the truncated horizon h respectively. ˆ V P denotes thevalue of the agent POMDP under such assumptions, referredto as limited tasks assumption . Lemma 5
The lower and upper-bounds on the value of theagent POMDP created from the set P , ˆ¯ V P and ˆ¯ V P , can becomputed by Eq. 20 and Eq. 21 respectively where tpls = { tpl ∈ P ( P ) : | tpl | = k ∗ } , and the bounds are monotone.(proof of SelectAction function in Alg. 3) ˆ¯ V P,t ( b ) = max tpl ∈ tpls (¯ V tpl,t ( b tpl ) + X q ∈ P \ tpl u V nq,t ( b q )) ≤ ˆ V ∗ P,t ( b ) (20) ˆ¯ V P,t ( b ) = max tpl ∈ tpls ( ¯ V tpl,t ( b tpl ) + X q ∈ P \ tpl u V nq,t ( b q )) ≥ ˆ V ∗ P,t ( b ) (21) Intuition behind proof:
In [1], we proved that finding theoptimal values of all tpl ∈ tpls ( V ∗ tpl ) while performing noops on the other POMDPs ( P V nq ) and selecting the bestV-value, max tpl ∈ tpls ( V ∗ tpl + P V nq ) , provides the optimalsolution to the agent POMDP. We proved in Lem. 1 to 4 thatthe lower and upper-bounds on V ∗ tpl are valid and monotone.The validity and monotonicity of ˆ¯ V P and ˆ¯ V P then simplyfollow from the validity and monotonicity of ¯ V tpl and ¯ V tpl . Proof:
We show that the bounds are valid and then arguewhy they are also monotone. From [1], we know: ˆ V ∗ P,t ( b ) = max tpl ∈ tpls U ∗ tpl,t ( b ) (22) U ∗ tpl,t ( b ) = V ∗ tpl,t ( b tpl ) + X q ∈ P \ tpl V nq,t ( b q ) (23) We proved in Lem. and that ¯ V tpl,t ( b tpl ) ≤ V ∗ tpl,t ( b tpl ) and ¯ V tpl,t ( b tpl ) ≥ V ∗ tpl,t ( b tpl ) respectively. Thus, ¯ U tpl,t ( b ) and ¯ U tpl,t ( b ) computed by substituting V ∗ tpl,t ( b tpl ) by ¯ V tpl,t ( b tpl ) and ¯ V tpl,t ( b tpl ) in Eq. 23 are lower and upper-bounds on U ∗ tpl,t ( b ) . We substitute ¯ U tpl,t ( b ) and ¯ U tpl,t ( b ) in Eq. 22 toprove Eq. 20 and Eq. 21. Given that the bound computationsfor V ∗ tpl,t are monotone, X V nq,t ( b q ) does not change fora given tuple as we increases the horizon, and the max operator does not change the monotonicity of ¯ U tpl,t and ¯ U tpl,t , ˆ¯ V P,t and ˆ¯ V P,t are monotone.
Lemma 6
Alg. 3 converges to the optimal solution ofthe agent POMDP in both finite horizon problems withoutdiscounting and infinite horizon problems with discounting.
Intuition behind proof:
In Lem. 1 to 5, we proved thatdividing the agent POMDP into subtasks ( tpl ∈ tpls ) andcomputing the lower and upper-bounds for all the tuples in tpls provide valid and monotone bounds on the value of theagent POMDP. In those lemmas, we assumed that k = k ∗ , i.e. , a combined model of all the k ∗ tasks is expanded tillthe truncated horizon h even though we know that the robotcan only attend to k tasks within h . Differently, in Alg. 3, toefficiently solve each tpl for a small truncated horizon h , weonly consider subsets of size k , but compute the bounds onall the k ∗ POMDPs in tpl , so k < k ∗ . Both cases k = k ∗ and k < k ∗ use the same lower and upper-bound computationsand have the same h as their truncated horizon. However,in the former we perform the tree expansion on a combinedmodel built from all the POMDPs in the tpl set, but in thelatter we consider all combinations of the tasks with size k out of the POMDPs in tpl and perform the tree expansion onthose only. The proof uses the same idea as [1]. It uses theassumption that within a certain horizon h , only k tasks canbe attended to, so if we consider all combinations of k tasksout of the members of the tpl ( tpl u ), we will get the samesolution as the combined model of all the tasks in tpl . Giventhat the bound computations are the same in both cases, when k < k ∗ , we get the same solution as when k = k ∗ (proof forline 3 in Alg. 3 and the RecomputeTuples function), andAlg. 3 computes valid and monotone bounds on the value ofthe agent POMDP.
Proof:
In Lem. 1 to 5, we proved that dividing the agentPOMDP into subtasks ( tpl ∈ tpls ) and computing the lowerand upper-bounds for all the members of tpls provide validand monotone bounds on the value of the agent POMDP.In these lemmas, we assumed that k = k ∗ , so line 3of Alg. 3 would become tpls ′ = { ( tpl c , tpl l ) : tpl ∈ tpls, tpl c = tpl, tpl l = ∅} , and the RecomputeTuples function would not change the tpls set. However, the benefitsof our approach are manifested when the truncated horizon h is smaller than the full planning horizon H , and consequently < k ∗ . In Alg. 3, we divide each tpl into two sets, tpl c with k tasks and tpl l with k ∗ − k tasks (all possible combinationsof k tasks out of k ∗ tasks), perform the tree expansionfor the POMDPs in tpl c while executing no ops on themembers of tpl l , and compute the bounds on all members of tpl u = tpl c ∪ tpl l . When k < k ∗ , if we prove that by usingthis approach, we get the same solution as when k = k ∗ , weprove that Alg. 3 computes valid and monotone bounds onthe value of the agent POMDP.Notice that the only difference between k = k ∗ and k ICAPS , 2020.[2] A. R. Cassandra, “A survey of pomdp applications,” in