[PDF] CMAX++ : Leveraging Experience in Planning and Execution using Inaccurate Models

Abstract

Given access to accurate dynamical models, modern planning approaches are effective in computing feasible and optimal plans for repetitive robotic tasks. However, it is difficult to model the true dynamics of the real world before execution, especially for tasks requiring interactions with objects whose parameters are unknown. A recent planning approach, CMAX, tackles this problem by adapting the planner online during execution to bias the resulting plans away from inaccurately modeled regions. CMAX, while being provably guaranteed to reach the goal, requires strong assumptions on the accuracy of the model used for planning and fails to improve the quality of the solution over repetitions of the same task. In this paper we propose CMAX++, an approach that leverages real-world experience to improve the quality of resulting plans over successive repetitions of a robotic task. CMAX++ achieves this by integrating model-free learning using acquired experience with model-based planning using the potentially inaccurate model. We provide provable guarantees on the completeness and asymptotic convergence of CMAX++ to the optimal path cost as the number of repetitions increases. CMAX++ is also shown to outperform baselines in simulated robotic tasks including 3D mobile robot navigation where the track friction is incorrectly modeled, and a 7D pick-and-place task where the mass of the object is unknown leading to discrepancy between true and modeled dynamics.

Full PDF

CC MAX ++ : Leveraging Experience in Planning and Execution using InaccurateModels

Anirudh Vemula , J. Andrew Bagnell , Maxim Likhachev Robotics Institute, Carnegie Mellon University Aurora [email protected], [email protected], [email protected]

Abstract

Given access to accurate dynamical models, modern plan-ning approaches are effective in computing feasible and op-timal plans for repetitive robotic tasks. However, it is difﬁ-cult to model the true dynamics of the real world before ex-ecution, especially for tasks requiring interactions with ob-jects whose parameters are unknown. A recent planning ap-proach, C

MAX , tackles this problem by adapting the planneronline during execution to bias the resulting plans away frominaccurately modeled regions. C

MAX , while being provablyguaranteed to reach the goal, requires strong assumptions onthe accuracy of the model used for planning and fails to im-prove the quality of the solution over repetitions of the sametask. In this paper we propose C

MAX ++, an approach thatleverages real-world experience to improve the quality of re-sulting plans over successive repetitions of a robotic task.C

MAX ++ achieves this by integrating model-free learningusing acquired experience with model-based planning usingthe potentially inaccurate model. We provide provable guar-antees on the completeness and asymptotic convergence ofC

MAX ++ to the optimal path cost as the number of repeti-tions increases. C

MAX ++ is also shown to outperform base-lines in simulated robotic tasks including 3D mobile robotnavigation where the track friction is incorrectly modeled,and a 7D pick-and-place task where the mass of the object isunknown leading to discrepancy between true and modeleddynamics. We often require robots to perform tasks that are highlyrepetitive, such as picking and placing objects in assemblytasks and navigating between locations in a warehouse. Forsuch tasks, robotic planning algorithms have been highlyeffective in cases where system dynamics is easily speci-ﬁed by an efﬁcient forward model (Berenson, Abbeel, andGoldberg 2012). However, for tasks involving interactionswith objects, dynamics are very difﬁcult to model withoutcomplete knowledge of the parameters of the objects suchas mass and friction (Ji and Xiao 2001). Using inaccuratemodels for planning can result in plans that are ineffectiveand fail to complete the task (McConachie et al. 2020). Inaddition for such repetitive tasks, we expect the robot’s task A blog post summarizing this work can be found at https://vvanirudh.github.io/blog/cmaxpp/

Figure 1: (left) PR2 lifting a heavy dumbbell, that is mod-eled as light, to a goal location that is higher than the startlocation resulting in dynamics that are inaccurately mod-eled (right) Mobile robot navigating around a track withicy patches with unknown friction parameters leading to therobot skidding. In both cases, any path to the goal needs tocontain a transition (pink) whose dynamics are not modeledaccurately.performance to improve, leading to efﬁcient plans in laterrepetitions. Thus, we need a planning approach that can usepotentially inaccurate models while leveraging experiencefrom past executions to complete the task in each repetition,and improve performance across repetitions.A recent planning approach, C

MAX , introduced in (Vem-ula et al. 2020) adapts its planning strategy online to accountfor any inaccuracies in the forward model without requiringany updates to the dynamics of the model. C

MAX achievesthis online by inﬂating the cost of any transition that is foundto be incorrectly modeled and replanning, thus biasing theresulting plans away from regions where the model is inac-curate. It does so while maintaining guarantees on complet-ing the task, without any resets, in a ﬁnite number of exe-cutions. However, C

MAX requires that there always existsa path from the current state of the robot to the goal con-taining only transitions that have not yet been found to beincorrectly modeled. This is a strong assumption on the ac-curacy of the model and can often be violated, especially inthe context of repetitive tasks.For example, consider the task shown in Figure 1(left)where a robotic arm needs to repeatedly pick a heavy ob-ject, that is incorrectly modeled as light, and place it on top a r X i v : . [ c s . R O ] O c t f a taller table while avoiding an obstacle. As the objectis heavy, transitions that involve lifting the object will havediscrepancy between true and modeled dynamics. However,any path from the start pose to the goal pose requires lift-ing the object and thus, the resulting plan needs to containa transition that is incorrectly modeled. This violates theaforementioned assumption of C MAX and it ends up inﬂat-ing the cost of any transition that lifts the object, resulting inplans that avoid lifting the object in future repetitions. Thus,the quality of C

MAX solution deteriorates across repetitionsand, in some cases, it even fails to complete the task. Fig-ure 1(right) presents another example task where a mobilerobot is navigating around a track with icy patches that haveunknown friction parameters. Once the robot enters a patch,any action executed results in the robot skidding, thus vi-olating the assumption of C

MAX because any path to thegoal from current state will have inaccurately modeled tran-sitions. C

MAX ends up inﬂating the cost of all actions exe-cuted inside the icy patch, leading to the robot being unableto ﬁnd a path in future laps and failing to complete the task.Thus, in both examples, we need a planning approach thatallows solutions to contain incorrectly modeled transitionswhile ensuring that the robot reaches the goal.In this paper we present C

MAX ++, an approach for inter-leaving planning and execution that uses inaccurate modelsand leverages experience from past executions to provablycomplete the task in each repetition without any resets. Fur-thermore, it improves the quality of solution across repeti-tions. In contrast to C

MAX , C

MAX ++ requires weaker con-ditions to ensure task completeness, and is provably guaran-teed to converge to a plan with optimal cost as the numberof repetitions increases. The key idea behind C

MAX ++ isto combine the conservative behavior of C

MAX that triesto avoid incorrectly modeled regions with model-free Q-learning that tries to estimate and follow the optimal cost-to-goal value function with no regard for any discrepanciesbetween modeled and true dynamics. This enables C

MAX ++to compute plans that utilize inaccurately modeled transi-tions, unlike C

MAX . Based on this idea, we present an algo-rithm for small state spaces, where we can do exact planning,and a practical algorithm for large state spaces using func-tion approximation techniques. We also propose an adap-tive version of C

MAX ++ that intelligently switches betweenC

MAX and C

MAX ++ to combine the advantages of both ap-proaches, and exhibits goal-driven behavior in earlier repe-titions and optimality in later repetitions. The proposed al-gorithms are tested on simulated robotic tasks: 3D mobilerobot navigation where the track friction is incorrectly mod-eled (Figure 1 right) and a 7D pick-and-place task where themass of the object is unknown (Figure 1 left).

A typical approach to planning in tasks with unknown pa-rameters is to use acquired experience from executions toupdate the dynamics of the model and replan (Sutton 1991).This works well in practice for tasks where the forwardmodel is ﬂexible and can be updated efﬁciently. Howeverfor real world tasks, the models used for planning cannot beupdated efﬁciently online (Todorov, Erez, and Tassa 2012) and are often precomputed ofﬂine using expensive proce-dures (Hauser et al. 2006). Another line of works (Saverianoet al. 2017; Abbeel, Quigley, and Ng 2006) seek to learn aresidual dynamical model to account for the inaccuracies inthe initial model. However, it can take a prohibitively largenumber of executions to learn the true dynamics, especiallyin domains like deformable manipulation (Essahbi, Bouz-garrou, and Gogu 2012). This precludes these approachesfrom demonstrating a goal-driven behavior as we show inour experimental analysis.Recent works such as C

MAX (Vemula et al. 2020)and (McConachie et al. 2020) pursue an alternative approachwhich does not require updating the dynamics of the modelor learning a residual component. These approaches exhibitgoal-driven behavior by focusing on completing the task andnot on modeling the true dynamics accurately. While C

MAX achieves this by inﬂating the cost of any transition whose dy-namics are inaccurately modeled, (McConachie et al. 2020)present an approach that learns a binary classiﬁer ofﬂine thatis used online to predict whether a transition is accuratelymodeled or not. Although these methods work well in prac-tice for goal-oriented tasks, they do not leverage experienceacquired online to improve the quality of solution when usedfor repetitive tasks.Our work is closely related to approaches that integratemodel-based planning with model-free learning. (Lee et al.2020) use model-based planning in regions where the dy-namics are accurately modeled and switch to a model-free policy in regions with high uncertainty. However, theymostly focus on perception uncertainty and require a coarseestimate of the uncertain region prior to execution, which isoften not available for tasks with other modalities of uncer-tainty like unknown inertial parameters. A very recent workby (Lagrassa, Lee, and Kroemer 2020) uses a model-basedplanner until a model inaccuracy is detected and switchesto a model-free policy to complete the task. Similar to ourapproach, they deal with general modeling errors but rely onexpert demonstrations to learn the model-free policy. In con-trast, our approach does not require any expert demonstra-tions and only uses the experience acquired online to obtainmodel-free value estimates that are used within planning.Finally, our approach is also related to the ﬁeld of real-time heuristic search which tackles the problem of efﬁcientplanning in large state spaces with bounded planning time.In this work, we introduce a novel planner that is inspiredby LRTA* (Korf 1990) which limits the number of expan-sions in the search procedure and interleaves execution withplanning. Crucially, our planner also interleaves planningand execution but unlike these approaches, employs model-free value estimates obtained from past experience withinthe search.

Following the notation of (Vemula et al. 2020), we considerthe deterministic shortest path problem that can be repre-sented using the tuple M = ( S , A , G , f, c ) where S is thestate space, A is the action space, G ⊆ S is the non-emptyset of goals, f : S × A → S is a deterministic dynamicsfunction, and c : S × A → [0 , is the cost function. Notehat we assume that the costs lie between and but anybounded cost function can be scaled to satisfy this assump-tion. Crucially, our approach assumes that the action space A is discrete, and any goal state g ∈ G is a cost-free ter-mination state. The objective of the shortest path problem isto ﬁnd the least-cost path from a given start state s ∈ S toany goal state g ∈ G in M . As is typical in shortest pathproblems, we assume that there exists at least one path fromeach state s ∈ S to one of the goal states, and that the costof any transition from a non-goal state is positive (Bertsekas2005). We will use V ( s ) to denote the state value function(a running estimate of cost-to-goal from state s ,) and Q ( s, a ) to denote the state-action value function (a running estimateof the sum of transition cost and cost-to-goal from succes-sor state,) for any state s and action a . Similarly, we will usethe notation V ∗ ( s ) and Q ∗ ( s, a ) to denote the correspond-ing optimal value functions. A value estimate is called ad-missible if it underestimates the optimal value function atall states and actions, and is called consistent if it satisﬁesthe triangle inequality, i.e. V ( s ) ≤ c ( s, a ) + V ( f ( s, a )) and Q ( s, a ) ≤ c ( s, a ) + V ( f ( s, a )) for all s, a , and V ( g ) = 0 for all g ∈ G .In this work, we focus on repetitive robotic tasks wherethe true deterministic dynamics f are unknown but we haveaccess to an approximate model described using ˆ M =( S , A , G , ˆ f , c ) where ˆ f approximates the true dynamics. Ineach repetition of the task, the robot acts in the environment M to acquire experience over a single trajectory and reachthe goal, without access to any resets. This rules out anyepisodic approach. Since the true dynamics are unknownand can only be discovered through executions, we considerthe online real-time planning setting where the robot has tointerleave planning and execution. In our motivating navi-gation example (Figure 1 right,) the approximate model ˆ M represents a track with no icy patches whereas the environ-ment M contains icy patches. Thus, there is a discrepancybetween the modeled dynamics ˆ f and true dynamics f . Fol-lowing (Vemula et al. 2020), we will refer to state-actionpairs that have inaccurately modeled dynamics as “incor-rect” transitions, and use the notation X ⊆ S × A to denotethe set of discovered incorrect transitions. The objective inour work is for the robot to reach a goal in each repetition,despite using an inaccurate model for planning while im-proving performance, measured using the cost of executions,across repetitions. In this section, we will describe the proposed approachC

MAX ++. First, we will present a novel planner usedin C

MAX ++ that can exploit incorrect transitions usingtheir model-free Q -value estimates. Second, we presentC MAX ++ and its adaptive version for small state spaces, andestablish their guarantees. Finally, we describe a practicalinstantiation of C

MAX ++ for large state spaces leveragingfunction approximation techniques.

Algorithm 1

Hybrid Limited-Expansion Search procedure SEARCH ( s, ˆ M , V, Q, X , K ) Initialize g ( s ) = 0 , min-priority open list O , andclosed list C Add s to open list O with priority p ( s ) = g ( s ) + V ( s ) for i = 1 , , · · · , K do Pop s i from O if s i is a dummy state or s i ∈ G then Set s best ← s i and go to Line 22 for a ∈ A do (cid:46) Expanding state s i if ( s i , a ) ∈ X then (cid:46) Incorrect transition

Add a dummy state s (cid:48) to O with priority p ( s (cid:48) ) = g ( s i ) + Q ( s i , a ) continue Get successor s (cid:48) = ˆ f ( s i , a ) If s (cid:48) ∈ C , continue if s (cid:48) ∈ O and g ( s (cid:48) ) > g ( s i ) + c ( s i , a ) then Set g ( s (cid:48) ) = g ( s i )+ c ( s i , a ) and recompute p ( s (cid:48) ) Reorder open list O else if s (cid:48) / ∈ O then Set g ( s (cid:48) ) = g ( s i ) + c ( s i , a ) Add s (cid:48) to O with priority p ( s (cid:48) ) = g ( s (cid:48) ) + V ( s (cid:48) ) Add s i to closed list C Pop s best from open list O for s (cid:48) ∈ C do Update V ( s (cid:48) ) ← p ( s best ) − g ( s (cid:48) ) Backtrack from s best to s , and set a best as the ﬁrst ac-tion on path from s to s best in the search tree return a best During online execution, we want the robot to acquire expe-rience and leverage it to compute better plans. This requiresa hybrid planner that is able to incorporate value estimatesobtained using past experience in addition to model-basedplanning, and quickly compute the next action to execute. Toachieve this, we propose a real-time heuristic search-basedplanner that performs a bounded number of expansions andis able to utilize Q -value estimates for incorrect transitions.The planner is presented in Algorithm 1. Given the cur-rent state s , the planner constructs a lookahead search treeusing at most K state expansions. For each expanded state s i , if any outgoing transition has been ﬂagged as incorrectbased on experience, i.e. ( s i , a ) ∈ X , then the planner cre-ates a dummy state with priority computed using the model-free Q -value estimate of that transition (Line 10). Note thatwe create a dummy state because the model ˆ M does notknow the true successor of an incorrect transition. For thetransitions that are correct, we obtain successor states us-ing the approximate model ˆ M . This ensures that we rely onthe inaccurate model only for transitions that are not knownto be incorrect. At any stage, if a dummy state is expandedthen we need to terminate the search as the model ˆ M doesnot know any of its successors, in which case we set thebest state s best as the dummy state (Line 7). Otherwise, wechoose s best as the best state (lowest priority) among the lgorithm 2 C MAX ++ and A-C

MAX ++ in small statespaces

Require:

Model ˆ M , start state s , initial value estimates V , Q , number of expansions K , t ← , incorrect set X ← {} , Number of repetitions N , Sequence { α i ≥ } Ni =1 , initial penalized value estimates ˜ V = V , penal-ized model ˜ M ← ˆ M for each repetition i = 1 , · · · , N do t ← , s ← s while s t / ∈ G do Compute a t = SEARCH ( s t , ˆ M , V, Q, X , K ) Compute ˜ a t = SEARCH ( s t , ˜ M , ˜ V , Q, {} , K ) If ˜ V ( s t ) ≤ α i V ( s t ) , assign a t = ˜ a t Execute a t in environment to get s t +1 = f ( s t , a t ) if s t +1 (cid:54) = ˆ f ( s t , a t ) then Add ( s t , a t ) to the set: X ← X ∪ { ( s t , a t ) } Update: Q ( s t , a t ) = c ( s t , a t ) + V ( s t +1 ) Update penalized model ˜ M ← ˜ M X t ← t + 1 leaves of the search tree after K expansions (Line 21). Fi-nally, the best action to execute at the current state s iscomputed as the ﬁrst action along the path from s to s best in the search tree (Line 24). The planner also updates statevalue estimates V of all expanded states using the priority ofthe best state p ( s best ) to make the estimates more accurate(Lines 22 and 23) similar to RTAA* (Koenig and Likhachev2006).The ability of our planner to exploit incorrect transitionsusing their model-free Q -value estimates, obtained frompast experience, distinguishes it from real-time search-basedplanners such as LRTA* (Korf 1990) which cannot utilizemodel-free value estimates during planning. This enablesC MAX ++ to result in plans that utilize incorrect transitionsif they enable the robot to get to the goal with lower cost.

MAX ++ in Small State Spaces C MAX ++ in small state spaces is simple and easy-to-implement as it is feasible to maintain value estimates ina table for all states and actions and to explicitly maintain arunning set of incorrect transitions with fast lookup withoutresorting to function approximation techniques.The algorithm is presented in Algorithm 2 (only the textin black.) C

MAX ++ maintains a running estimate of the setof incorrect transitions X , and updates the set whenever itencounters an incorrect state-action pair during execution.Crucially, unlike C MAX , it maintains a Q -value estimate forthe incorrect transition that is used during planning in Algo-rithm 1, thereby enabling the planner to compute paths thatcontain incorrect transitions. It is also important to note that,like C MAX , C

MAX ++ never updates the dynamics of themodel. However, instead of using the penalized model forplanning as C

MAX does, C

MAX ++ uses the initial model ˆ M , and utilizes both model-based planning and model-free Q -value estimates to replan a path from the current state toa goal. The downside of C MAX ++ is that estimating Q -valuesfrom online executions can be inefﬁcient as it might takemany executions before we obtain an accurate Q -value es-timate for an incorrect transition. This has been extensivelystudied in the past and is a major disadvantage of model-freemethods (Sun et al. 2019). As a result of this inefﬁciency,C MAX ++ lacks the goal-driven behavior of C

MAX in earlyrepetitions of the task, despite achieving optimal behavior inlater repetitions. In the next section, we present an adaptiveversion of C

MAX ++ (A-C

MAX ++) that combines the goal-driven behavior of C

MAX with the optimality of C

MAX ++.

MAX ++ Background on C

MAX

Before we describe A-C

MAX ++,we will start by summarizing C

MAX . For more details, re-fer to (Vemula et al. 2020). At each time step t during ex-ecution, C MAX maintains a running estimate of the incor-rect set X , and constructs a penalized model speciﬁed bythe tuple ˜ M X = ( S , A , G , ˆ f , ˜ c X ) where the cost function ˜ c X ( s, a ) = | S | if ( s, a ) ∈ X , else ˜ c X ( s, a ) = c ( s, a ) . Inother words, the cost of any transition found to be incorrectis set high (or inﬂated) while the cost of other transitions arethe same as in ˆ M . C MAX uses the penalized model ˜ M X toplan a path from the current state s t to a goal state. Subse-quently, C MAX executes the ﬁrst action a t along the pathand observes if the true dynamics and model dynamics dif-fer on the executed action. If so, the state-action pair ( s t , a t ) is appended to the incorrect set X and the penalized model ˜ M X is updated. C MAX continues to do this at every timestepuntil the robot reaches a goal state.Observe that the inﬂation of cost for any incorrect state-action pair biases the planner to “explore” all other state-action pairs that are not yet known to be incorrect beforeit plans a path using an incorrect transition. This inducesa goal-driven behavior in the computed plan that enablesC

MAX to quickly ﬁnd an alternative path and not waste ex-ecutions learning the true dynamics

A-C

MAX ++ A-C

MAX ++ is presented in Algorithm 2(black and blue text.) A-C

MAX ++ maintains a running esti-mate of incorrect set X and constructs the penalized model ˜ M at each time step t , similar to C MAX . For any state attime step t , we ﬁrst compute the best action a t based onthe approximate model ˆ M and the model-free Q -value esti-mates (Line 4.) In addition, we also compute the best action ˜ a t using the penalized model ˜ M , similar to C MAX , that in-ﬂates the cost of any incorrect transition (Line 5.) The cru-cial step in A-C

MAX ++ is Line 6 where we compare thepenalized value ˜ V ( s t ) (obtained using penalized model ˜ M )and the non-penalized value V ( s t ) (obtained using approx-imate model ˆ M and Q -value estimates.) Given a sequence { α i ≥ } for repetitions i = 1 , · · · , N of the task, if ˜ V ( s t ) ≤ α i V ( s t ) , then we execute action ˜ a t , else we ex-ecute a t . This implies that if the cost incurred by follow-ing C MAX actions in the future is within α i times the costincurred by following C MAX ++ actions, then we prefer toexecute C

MAX .If the sequence { α i } is chosen to be non-increasing suchhat α ≥ α · · · ≥ α N ≥ , then we can observe thatA-C MAX ++ has the desired anytime-like behavior. It re-mains goal-driven in early repetitions, by choosing C

MAX actions, and converges to optimal behavior in later repe-titions, by choosing C

MAX ++ actions. Further, the execu-tions needed to obtain accurate Q -value estimates is dis-tributed across repetitions ensuring that A-C MAX ++ doesnot have poor performance in any single repetition. Thus,A-C

MAX ++ combines the advantages of both C

MAX andC

MAX ++.

We will start with formally stating the assumption neededby C

MAX to ensure completeness:

Assumption 4.1 ((Vemula et al. 2020)) . Given a penalizedmodel ˜ M X t and the current state s t at any time step t , therealways exists at least one path from s t to a goal that doesnot contain any state-action pairs ( s, a ) that are known tobe incorrect, i.e. ( s, a ) ∈ X t . Observe that the above assumption needs to be valid atevery time step t before the robot reaches a goal and thus,can be hard to satisfy. Before we state the theoretical guar-antees for C MAX ++, we need the following assumption onthe approximate model ˆ M that is used for planning: Assumption 4.2.

The optimal value function ˆ V ∗ using thedynamics of approximate model ˆ M underestimates the opti-mal value function V ∗ using the true dynamics of M at allstates, i.e. ˆ V ∗ ( s ) ≤ V ∗ ( s ) for all s ∈ S . In other words, if there exists a path from any state s toa goal state in the environment M , then there exists a pathwith the same or lower cost from s to a goal in the approxi-mate model ˆ M . In our motivating example of pick-and-place(Figure 1 left,) this assumption is satisﬁed if the object ismodeled as light in ˆ M , as the object being heavy in realitycan only increase the cost. This assumption was also consid-ered in previous works such as (Jiang 2018) and is known asthe Optimistic Model Assumption .We can now state the following guarantees:

Theorem 4.1 (Completeness) . Assume the initial value es-timates

V, Q are admissible and consistent. Then we have,1. If Assumption 4.2 holds then using either C MAX ++ or A-C

MAX ++ , the robot is guaranteed to reach a goal state inat most | S | time steps in each repetition.2. If Assumption 4.1 holds then (a) using A-C

MAX ++ witha large enough α i in any repetition i (typically true forearly repetitions,) the robot is guaranteed to reach a goalstate in at most | S | time steps, and (b) using C MAX ++ ,it is guaranteed to reach a goal state in at most | S | timesteps in each repetitionProof Sketch. The ﬁrst part of theorem follows from theanalysis of Q-learning for systems with deterministic dy-namics (Koenig and Simmons 1993). In the worst case, ifthe model is incorrect everywhere and if Assumption 4.2(or Assumption 4.1) holds then, Algorithm 2 reduces to Q-learning, and hence we can borrow its worst case bounds.

Algorithm 3 C MAX ++ in large state spaces

Require:

Model ˆ M , start state s , value function approxima-tors V θ , Q ζ , number of expansions K , t ← , Discrep-ancy threshold ξ , Radius of hypersphere δ , Set of hy-perspheres X ξ ← {} , Number of repetitions N , Batchsize B , State buffer D S , Transition buffer D SA , Learn-ing rate η , Number of updates U for each repetition i = 1 , · · · , N do t ← , s ← s while s t / ∈ G do Compute a t = SEARCH ( s t , ˆ M , V θ , Q ζ , X ξ , K ) Execute a t in environment to get s t +1 = f ( s t , a t ) if d ( s t +1 , ˆ f ( s t , a t )) > ξ then Add hypersphere: X ξ ← X ξ ∪ { sphere ( s t , a t , δ ) } Add s t to D S , and ( s t , a t , s t +1 ) to D SA for u = 1 , · · · , U do (cid:46) Approximator updates

Q UPDATE ( Q ζ , V θ , D SA ) V UPDATE ( V θ , Q ζ , D S , X ξ ) t ← t + 1 procedure Q UPDATE ( Q ζ , V θ , D SA ) Sample B transitions from D SA with replacement Construct training set X Q = { (( s i , a i ) , Q ( s i , a i )) } for each sampled transition ( s i , a i , s (cid:48) i ) and compute Q ( s i , a i ) = c ( s i , a i ) + V θ ( s (cid:48) i ) Update: ζ ← ζ − η ∇ ζ L Q ( Q ζ , X Q ) procedure V UPDATE ( V θ , Q ζ , D S , X ξ ) Sample B states from D S with replacement Call

SEARCH ( s i , ˆ M , V θ , Q ζ , X ξ , K ) for each sampled s i to get all states on closed list s (cid:48) i and their corre-sponding value updates V ( s (cid:48) i ) to construct training set X V = { ( s (cid:48) i , V ( s (cid:48) i ) } Update: θ ← θ − η ∇ θ L V ( V θ , X V ) The second part of the theorem concerning A-C

MAX ++ fol-lows from the completeness proof of C

MAX . (cid:3) Theorem 4.2 (Asymptotic Convergence) . Assume Assump-tion 4.2 holds, and that the initial value estimates

V, Q areadmissible and consistent. For sufﬁciently large number ofrepetitions N , there exists an integer j ≤ N such that therobot follows a path with the optimal cost to the goal using C MAX ++ in Algorithm 2 in repetitions i ≥ j .Proof Sketch. The guarantee follows from the asymptoticconvergence of Q-learning (Koenig and Simmons 1993). (cid:3)

It is important to note that the conditions required for The-orem 4.1 are weaker than the conditions required for com-pleteness of C

MAX . Firstly, if either Assumption 4.1 or As-sumption 4.2 holds then C

MAX ++ can be shown to be com-plete, but C

MAX is guaranteed to be complete only underAssumption 4.1. Furthermore, Assumption 4.2 only needsto hold for the approximate model ˆ M we start with, whereasAssumption 4.1 needs to be satisﬁed for every penalizedmodel ˜ M constructed at any time step t during execution. .5 Large State Spaces In this section, we present a practical instantiation ofC

MAX ++ for large state spaces where it is infeasible tomaintain tabular value estimates and the incorrect set X explicitly. Thus, we leverage function approximation tech-niques to maintain these estimates. Assume that there existsa metric d under which S is bounded. We relax the deﬁni-tion of incorrect set using this metric to deﬁne X ξ as the setof all ( s, a ) pairs such that d ( f ( s, a ) , ˆ f ( s, a )) > ξ where ξ ≥ . Typically, we chose ξ to allow for small modelingdiscrepancies that can be compensated by a low-level pathfollowing controller.C MAX ++ in large state spaces is presented in Algo-rithm 3. The algorithm closely follows C

MAX for largestate spaces presented in (Vemula et al. 2020). The in-correct set X ξ is maintained using sets of hypersphereswith each set corresponding to a discrete action. When-ever the agent executes an incorrect state-action ( s, a ) ,C MAX ++ adds a hypersphere centered at s with radius δ , as measured using metric d , to the incorrect set corre-sponding to action a . In future planning, any state-actionpair ( s (cid:48) , a (cid:48) ) is declared incorrect if s (cid:48) lies inside any ofthe hyperspheres in the incorrect set corresponding to ac-tion a (cid:48) . After each execution, C MAX ++ proceeds to up-date the value function approximators (Line 9) by sam-pling previously executed transitions and visited statesfrom buffers and performing gradient descent steps (Proce-dures 13 and 17) using mean squared loss functions given by L Q ( Q ζ , X Q ) = | X Q | (cid:80) ( s i ,a i ) ∈ X Q ( Q ( s i , a i ) − Q ζ ( s i , a i )) and L V ( V θ , X V ) = | X V | (cid:80) s i ∈ X V ( V ( s i ) − V θ ( s i )) .By using hyperspheres, C MAX ++ “covers” the set of in-correct transitions, and enables fast lookup using KD-Treesin the state space. Like Algorithm 2, we never update theapproximate model ˆ M used for planning. However, unlikeAlgorithm 2, we update the value estimates for sampled pre-vious transitions and states (Lines 14 and 18). This ensuresthat the global function approximations used to maintainvalue estimates V θ , Q ζ have good generalization beyond thecurrent state and action. Algorithm 3 can also be extended ina similar fashion as Algorithm 2 to include A-C MAX ++ bymaintaining a penalized value function approximation andupdating it using gradient descent.

We test the efﬁciency of C

MAX ++ and A-C

MAX ++ on sim-ulated robotic tasks emphasizing their performance in eachrepetition of the task, and improvement across repetitions .In each task, we start the next repetition only if the robotreached a goal in previous repetition. In this experiment, the task is for a mobile robot with Reed-Shepp dynamics (Reeds and Shepp 1990) to navigate arounda track M with icy patches (Figure 1 right.) This can be rep-resented as a planning problem in 3D discrete state space The code to reproduce our experiments can be found at https://github.com/vvanirudh/CMAXPP. A v e r a g e nu m b e r o f s t e p s t a k e n t o ﬁ n i s h l a p

10 9 6 5 4 4 4 2 2 2 210 10 10 10 10 10 10 10 10 10 1010 10 10 10 10 10 10 10 10 10 10

3D Mobile Robot Navigation Experiment

CmaxCmax ++ A-Cmax ++ Figure 2: Number of steps taken to ﬁnish a lap averagedacross 10 instances each with icy patches placed randomlyaround the track. The number above each bar reports thenumber of instances in which the robot was successful inﬁnishing the respective lap within time steps. S with any state represented using the tuple ( x, y, θ ) where ( x, y ) is the 2D position of the robot and θ describes its head-ing. The XY-space is discretized into × grid and the θ dimension is discretized into cells. We construct a lat-tice graph (Pivtoraiko, Knepper, and Kelly 2009) using motion primitives that are pre-computed ofﬂine respectingthe differential constraints on the motion of the robot. Themodel ˆ M used for planning contains the same track as M but without any icy patches, thus the robot discovers transi-tions affected by icy patches only through executions.Since the state space is small, we use Algorithm 2 forC MAX ++ and A-C

MAX ++. For A-C

MAX ++, we use a non-increasing sequence with α i = 1 + β i where β = 100 and β i is decreased by 2.5 after every repetitions (See Ap-pendix for more details on choosing the sequence.) We com-pare both algorithms with C MAX . For all the approaches,we perform K = 100 expansions. Since the motion prim-itives are computed ofﬂine using an expensive procedure,it is not feasible to update the dynamics of model ˆ M on-line and hence, we do not compare with any model learn-ing baselines. We also conducted several experiments withmodel-free Q-learning, and found that it performed poorlyrequiring a very large number of executions and ﬁnishingonly 10 laps in the best case. Hence, we do not include it inour results shown in Figure 2.C MAX performs well in the early laps computing pathswith lower costs compared to C

MAX ++. However, after afew laps the robot using C

MAX gets stuck within an icypatch and does not make any more progress. Observe thatwhen the robot is inside the icy patch, Assumption 4.1 is vi-olated and C

MAX ends up inﬂating all transitions that takethe robot out of the patch leading to the robot ﬁnishing laps in out of instances. C MAX ++, on the other hand,is suboptimal in the initial laps, but converges to paths withlower costs in later laps. More importantly, the robot usingC

MAX ++ manages to ﬁnish laps in all instances. A-C MAX ++ also successfully ﬁnishes laps in all in-stances. However, it outperforms both C MAX and C

MAX ++in all laps by intelligently switching between them achiev-ng goal-driven behavior in early laps and optimal behaviorin later laps. Thus, A-C

MAX ++ combines the advantages ofC

MAX and C

MAX ++.

The task in this experiment is to pick and place a heavy ob-ject from a shorter table, using a degree-of-freedom (DOF)robotic arm (Figure 1 left) to a goal pose on a taller table,while avoiding an obstacle. As the object is heavy, the armcannot generate the required force in certain conﬁgurationsand can only lift the object to small heights. The problemis represented as planning in D discrete statespace wherethe ﬁrst dimensions describe the DOF pose of the armend-effector, and the last dimension corresponds to the re-dundant DOF in the arm. The action space A is a discrete setof actions corresponding to moving in each dimension bya ﬁxed offset in the positive or negative direction. The model ˆ M used for planning models the object as light, and hencedoes not capture the dynamics of the arm correctly when ittries to lift the heavy object. The state space is discretizedinto cells in each dimension resulting in a total of states. Thus, we need to use Algorithm 3 for C MAX ++ andA-C

MAX ++. The goal is to pick and place the object for repetitions where at the start of each repetition the object isin the start pose and needs to reach the goal pose by the endof repetition.We compare with C MAX for large state spaces, model-free Q-learning (van Hasselt, Guez, and Silver 2016), andresidual model learning baselines (Saveriano et al. 2017).We chose two kinds of function approximators for thelearned residual dynamics: global function approximatorssuch as Neural Networks (NN) and local memory-basedfunction approximators such as K-Nearest Neighbors re-gression (KNN.) Q-learning baseline uses Q -values that arecleverly initialized using the model ˆ M making it a strongmodel-free baseline. We use the same neural network func-tion approximators for maintaining value estimates for allapproaches and perform K = 5 expansions. We chose themetric d as the manhattan metric and use ξ = 0 for this ex-periment. We use a radius of δ = 3 for the hyperspheresintroduced in the D discrete state space, and to ensure faircomparison use the same radius for KNN regression. Thesevalues are chosen to reﬂect the discrepancies observed whenthe arm tries to lift the object. All approaches use the sameinitial value estimates obtained through planning in ˆ M . A-C MAX ++ uses a non-increasing sequence α i = 1+ β i where β = 4 and β i +1 = 0 . β i .The results are presented in Table 1. Model-free Q-learning takes a large number of executions in the initial rep-etitions to estimate accurate Q -value estimates but in laterrepetitions computes paths with lower costs managing to ﬁn-ish all repetitions in out of instances. Among the residualmodel learning baselines, the KNN approximator is success-ful in all instances but takes a large number of executions tolearn the true dynamics, while the NN approximator ﬁnishesall repetitions in only instances. C MAX performs well inthe initial repetitions but quickly gets stuck due to inﬂatedcosts and manages to complete the task for repetitions in only instance. C MAX ++ is successful in ﬁnishing thetask in all instances and repetitions, while improving perfor-mance across repetitions. Finally as expected, A-C

MAX ++also ﬁnishes all repetitions, sometimes even having betterperformance than C

MAX and C

MAX ++.

A major advantage of C

MAX ++ is that, unlike previous ap-proaches that deal with inaccurate models, it can exploit in-accurately modeled transitions without wasting online exe-cutions to learn the true dynamics. It estimates the Q -valueof incorrect transitions leveraging past experience and en-ables the planner to compute solutions containing such tran-sitions. Thus, C MAX ++ is especially useful in robotic do-mains with repetitive tasks where the true dynamics are in-tractable to model, such as deformable manipulation, or varyover time due to reasons such as wear and tear. Furthermore,the optimistic model assumption is easier to satisfy, whencompared to assumptions used by previous approaches likeC

MAX , and performance of C

MAX ++ degrades gracefullywith the accuracy of the model reducing to Q-learning inthe case where the model is inaccurate everywhere. Limi-tations of C

MAX ++ and A-C

MAX ++ include hyperparame-ters such as the radius δ and the sequence { α i } , which mightneed to be tuned for the task. However, from our sensitivityexperiments (see Appendix) we observe that A-C MAX ++performance is robust to the choice of sequence { α i } aslong as it is non-increasing. Note that Assumption 4.2 canbe restrictive for tasks where designing an initial optimisticmodel requires extensive domain knowledge. However, it isinfeasible to relax this assumption further without resortingto global undirected exploration techniques (Thrun 1992),which are highly sample inefﬁcient, to ensure completeness.An interesting future direction is to interleave model iden-tiﬁcation with C MAX ++ to combine the best of approachesthat learn the true dynamics and C

MAX ++. For instance,given a set of plausible forward models we seek to quicklyidentify the best model while ensuring efﬁcient performancein each repetition.

Acknowledgements

AV would like to thank Jacky Liang, Fahad Islam, AnkitBhatia, Allie Del Giorno, Dhruv Saxena and Pragna Man-nam for their help in reviewing the draft. AV is supported bythe CMU presidential fellowship endowed by TCS. Finally,AV would like to thank Caelan Garrett for developing andmaintaining the wonderful ss-pybullet library.

References

Abbeel, P.; Quigley, M.; and Ng, A. Y. 2006. Using inaccu-rate models in reinforcement learning. In Cohen, W. W.;and Moore, A. W., eds.,

Machine Learning, Proceedingsof the Twenty-Third International Conference (ICML 2006),Pittsburgh, Pennsylvania, USA, June 25-29, 2006 , volume148 of

ACM International Conference Proceeding Series ,1–8. ACM. doi:10.1145/1143844.1143845. URL https://doi.org/10.1145/1143844.1143845. epetition → Steps Success Steps Success Steps Success Steps Success Steps Success C MAX . ± . . ± . ± ± ± C MAX ++ ± . . ± . . ± . ± . ± . A-C

MAX ++ . ± . . ± . ± . ± . . ± . Model KNN . ± . . ± . . ± . . ± . . ± . Model NN ± . . ± . . ± . ± . . ± . Q-learning . ±

75 100% 23 . ± . . ± . ± . . ± . Table 1: Number of steps taken to reach the goal in D pick-and-place experiment for instances, each with random start andobstacle locations. We report mean and standard error only among successful instances in which the robot reached the goalwithin timesteps. The success subcolumn indicates percentage of successful instances.Andrychowicz, M.; Crow, D.; Ray, A.; Schneider, J.; Fong,R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; andZaremba, W. 2017. Hindsight Experience Replay. InGuyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.;Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R.,eds., Advances in Neural Information Processing Systems30: Annual Conference on Neural Information Process-ing Systems 2017, 4-9 December 2017, Long Beach, CA,USA , 5048–5058. URL http://papers.nips.cc/paper/7090-hindsight-experience-replay.Berenson, D.; Abbeel, P.; and Goldberg, K. 2012. A robotpath planning framework that learns from experience. In

IEEE International Conference on Robotics and Automa-tion, ICRA 2012, 14-18 May, 2012, St. Paul, Minnesota,USA , 3671–3678. IEEE. doi:10.1109/ICRA.2012.6224742.URL https://doi.org/10.1109/ICRA.2012.6224742.Bertsekas, D. P. 2005.

Dynamic programming and optimalcontrol, 3rd Edition

Opensource: bulletphysics. org

Automated Construction of RoboticManipulation Programs

Mecha-nisms, Mechanical Transmissions and Robotics , volume 162of

Applied Mechanics and Materials

Algo-rithmic Foundation of Robotics VII, Selected Contributions of the Seventh International Workshop on the AlgorithmicFoundations of Robotics, WAFR 2006, July 16-18, 2006,New York, NY, USA , volume 47 of

Springer Tracts in Ad-vanced Robotics , 507–522. Springer. doi:10.1007/978-3-540-68405-3 \

32. URL https://doi.org/10.1007/978-3-540-68405-3 32.Ji, X.; and Xiao, J. 2001. Planning Motions Compliant toComplex Contact States.

IJ Robotics Res.

Proceedings of the Thirty-Second AAAI Conference onArtiﬁcial Intelligence, (AAAI-18), the 30th innovative Ap-plications of Artiﬁcial Intelligence (IAAI-18), and the 8thAAAI Symposium on Educational Advances in Artiﬁcial In-telligence (EAAI-18), New Orleans, Louisiana, USA, Febru-ary 2-7, 2018 . URL http://arxiv.org/abs/1412.6980.Koenig, S.; and Likhachev, M. 2006. Real-time adaptiveA*. In Nakashima, H.; Wellman, M. P.; Weiss, G.; andStone, P., eds., , 281–288. ACM. doi:10.1145/1160633.1160682. URL https://doi.org/10.1145/1160633.1160682.Koenig, S.; and Simmons, R. G. 1993. Complexity Anal-ysis of Real-Time Reinforcement Learning. In Fikes, R.;and Lehnert, W. G., eds.,

Proceedings of the 11th Na-tional Conference on Artiﬁcial Intelligence. Washington,DC, USA, July 11-15, 1993

Artif. In-tell. .Lee, M. A.; Florensa, C.; Tremblay, J.; Ratliff, N. D.; Garg,A.; Ramos, F.; and Fox, D. 2020. Guided Uncertainty-AwarePolicy Optimization: Combining Learning and Model-Based Strategies for Sample-Efﬁcient Policy Learning.

CoRR abs/2005.10872. URL https://arxiv.org/abs/2005.10872.McConachie, D.; Power, T.; Mitrano, P.; and Berenson, D.2020. Learning When to Trust a Dynamics Model for Plan-ning in Reduced State Spaces.

IEEE Robotics Autom. Lett.

Advances in Neural Information ProcessingSystems 32 , 8024–8035. Curran Associates, Inc. URLhttp://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.Pivtoraiko, M.; Knepper, R. A.; and Kelly, A. 2009. Differ-entially constrained mobile robot motion planning in statelattices.

J. Field Robotics

Paciﬁc J. Math. , 4709–4715. IEEE. doi:10.1109/IROS.2017.8206343. URL https://doi.org/10.1109/IROS.2017.8206343.Sun, W.; Jiang, N.; Krishnamurthy, A.; Agarwal, A.; andLangford, J. 2019. Model-based RL in Contextual Deci-sion Processes: PAC bounds and Exponential Improvementsover Model-free Approaches. In Beygelzimer, A.; and Hsu,D., eds.,

Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA , volume 99 of

Proceedingsof Machine Learning Research , 2898–2933. PMLR. URLhttp://proceedings.mlr.press/v99/sun19a.html.Sutton, R. S. 1991. Dyna, an Integrated Architecture forLearning, Planning, and Reacting.

SIGART Bull. , 5026–5033. IEEE. doi:10.1109/IROS.2012.6386109.URL https://doi.org/10.1109/IROS.2012.6386109.van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep Re-inforcement Learning with Double Q-Learning. In Schu-urmans, D.; and Wellman, M. P., eds.,

Proceedings ofthe Thirtieth AAAI Conference on Artiﬁcial Intelligence,February 12-17, 2016, Phoenix, Arizona, USA

Proceedings of Robotics: Science andSystems . Corvalis, Oregon, USA. doi:10.15607/RSS.2020.XVI.001.

A Sensitivity Experiments

In this section, we present the results of our sensitivity ex-periments examining the performance of A-C

MAX ++ withthe choice of the sequence { α i } . We compare the perfor-mance of different choices of the sequence { α i } on the Dmobile robot navigation task. For each run, we average theresults across instances with randomly placed ice patchesand present the mean and standard errors. To keep the ﬁg-ures concise, we plot the cumulative number of steps takento reach the goal from the start of the ﬁrst lap to the currentlap across all laps. In all our runs, A-C MAX ++ successfullycompletes all laps and hence, we do not report the num-ber of successful instances in our results.We choose schedules for the sequence { α i } :1. Exponential Schedule : In this schedule, we vary β i +1 = ρβ i where ρ < is a constant that is tuned and α i =1 + β i . Observe that as i → ∞ , α i → and that thesequence { α i } is a decreasing sequence.We vary both the initial β chosen and the constant ρ in our experiments. For β we choose among values [10 , , and ρ is chosen among [0 . , . , . . Theresults are shown in Figure 3.All choices have almost the same performance with β =1000 and ρ = 0 . having the best performance initiallybut has slightly worse performance in the last several laps.The choice of β = 100 and ρ = 0 . seems to be agood choice with great performance in both initial andﬁnal laps.2. Linear Schedule : In this schedule, we vary β i +1 = β i − η where α i = 1 + β i and η > is a constant that isdetermined so that β = 0 , i.e. α = 1 . Hence, wehave η = β .We vary the initial β and choose among values [10 , , . The results are shown in Figure 4.All three choices have the same performance except in thelast few laps where β = 10 degrades while the other twochoices perform well.3. Time Decay Schedule : In this schedule, we vary β i +1 = β i +1 where α i = 1 + β i . In other words, we decay β at the

25 50 75 100 125 150 175 200

Laps C u m u l a t i v e N u m b e r o f S t e p s t a k e n t o r e a c h go a l Varying β i +1 = ρβ i according to an exponential schedule and α i = 1 + β i EXP β =10.0 ρ =0.5EXP β =10.0 ρ =0.7EXP β =10.0 ρ =0.9EXP β =100.0 ρ =0.5EXP β =100.0 ρ =0.7EXP β =100.0 ρ =0.9EXP β =1000.0 ρ =0.5EXP β =1000.0 ρ =0.7EXP β =1000.0 ρ =0.9 Figure 3: Sensitivity experiments with an exponential sched-ule

Laps C u m u l a t i v e N u m b e r o f S t e p s t a k e n t o r e a c h go a l Varying β i +1 = β i − η according to a linear schedule and α i = 1 + β i LINEAR β =10.0 η =0.05LINEAR β =100.0 η =0.5LINEAR β =200.0 η =1.0 Figure 4: Sensitivity experiments with a linear schedule

Laps C u m u l a t i v e N u m b e r o f S t e p s t a k e n t o r e a c h go a l Varying β i +1 = β / ( i + 1) according to a time decay schedule and α i = 1 + β i TIME β =10.0TIME β =100.0TIME β =1000.0 Figure 5: Sensitivity experiments with a time decay schedule

Laps C u m u l a t i v e N u m b e r o f S t e p s t a k e n t o r e a c h go a l Varying β i +1 = β i − δ if i is a multiple of ξ according to a step schedule and α i = 1 + β i STEP β =10.0 ξ =5.0 δ =0.25STEP β =10.0 ξ =10.0 δ =0.5STEP β =10.0 ξ =20.0 δ =1.0STEP β =100.0 ξ =5.0 δ =2.5STEP β =100.0 ξ =10.0 δ =5.0STEP β =100.0 ξ =20.0 δ =10.0STEP β =200.0 ξ =5.0 δ =5.0STEP β =200.0 ξ =10.0 δ =10.0STEP β =200.0 ξ =20.0 δ =20.0 Figure 6: Sensitivity experiments with a step schedulerate of i where i is the lap number. Again, observe that as i → ∞ , we have α i → .We vary the initial β and choose among values [10 , , . The results are shown in Figure 5.The choices of β = 100 and β = 1000 have the best(and similar) performance while β = 10 has a poor per-formance as it quickly switches to C MAX ++ in the earlylaps and wastes executions learning accurate Q -values.4. Step Schedule : In this schedule, we vary β as a step func-tion with β i +1 = β i − δ if i is a multiple of ξ where ξ isthe step frequency, α i = 1 + β i and δ is a constant thatis determined so that β = 0 , i.e. α = 1 . Hence, wehave δ = β ξ .We vary both the initial β and the step frequency ξ . For β we choose among values [10 , , and for ξ wechoose among [5 , , . The results are shown in Fig-ure 6.All choices have the same performance and A-C MAX ++seems to be robust to the choice of step size frequency.For our ﬁnal comparison, we will pick the best performingchoice among all the schedules and compare performanceamong these selected choices. The results are shown in Fig-ure 7.

25 50 75 100 125 150 175 200

Laps C u m u l a t i v e N u m b e r o f S t e p s t a k e n t o r e a c h go a l Comparing the best choices among all schedules

EXP β =100.0 ρ =0.9LINEAR β =100.0 η =0.5TIME β =100.0STEP β =100.0 ξ =5.0 δ =2.5 Figure 7: Sensitivity experiments with best choices amongall schedulesWe can observe that all schedules have the same perfor-mance except the exponential schedule which has worse per-formance. This can be attributed to the rapid decrease in thevalue of β compared to other schedules and thus, around lap A-C

MAX ++ switches to C

MAX ++ resulting in a largenumber of executions wasted to learn accurate Q -value es-timates. This does not happen for other schedules as theydecrease β gradually and thus, spreading out the executionsused to learn accurate Q -value estimates across several lapsand not performing poorly in any single lap. B Proofs

In this section, we present the assumptions and proofs thatresult in the theoretical guarantees of C

MAX ++ and A-C

MAX ++.

B.1 Signiﬁcance of Optimistic Model Assumption

To understand the signiﬁcance of the optimistic model as-sumption, it is important to note that completeness guar-antees usually require the use of admissible and consistentvalue estimates, i.e. estimates that always underestimate thetrue cost-to-goal values. This requirement needs to hold ev-ery time we plan (or replan) to ensure that we never discarda path as being expensive in terms of cost, when it is cheapin reality.All our guarantees assume that the initial value estimatesare consistent and admissible, but to ensure that they alwaysremain consistent and admissible throughout execution, weneed the optimistic model assumption. This assumption en-sures that updating value estimates by planning in the model ˆ M always results in estimates that are admissible and con-sistent. In other words, the optimal value function (which weobtain by doing full state space planning in ˆ M ) of the model ˆ M always underestimates the optimal value function of theenvironment M at all states s ∈ S .A very intuitive way to understand the assumption is toimagine a navigation task where the robot is navigating froma start to goal in the presence of obstacles. In this exam-ple, the optimistic model assumption requires that the modelshould never place an obstacle in a location when there isn’t an obstacle in the environment at the same location. How-ever, if there truly is an obstacle at some location, then themodel can either have an obstacle or not have one at thesame location. Put simply, an agent that is planning usingthe model should never be “pleasantly surprised” by what itsees in the environment. Several other intuitive examples arepresented in (Jiang 2018) and we recommend the reader tolook at them for more intuition. B.2 Completeness Proof

To prove completeness, ﬁrst we need to note that the Q -update in C MAX ++ always ensures that the Q -value es-timates remain consistent and admissible as long as thestate value estimates remain consistent and admissible. Wehave already seen why the optimistic model assumption en-sures that the state value estimates always remain consis-tent and admissible. Thus, we can use Theorem 3 fromRTAA* (Koenig and Likhachev 2006) in conjunction withthe optimistic model assumption to ensure completeness.Note that if the model is inaccurate everywhere, then ourplanner reduces to doing K = 1 expansions at every timestep and acts similar to Q -learning, which is also guaran-teed to be complete with admissible and consistent esti-mates (Koenig and Simmons 1993). The worst case boundof | S | steps is taken directly from the upper bound on Q -learning from (Koenig and Simmons 1993). The above ar-guments are true for both C MAX ++ and A-C

MAX ++. Notethat for A-C

MAX ++ if all paths to the goal contains an in-correct transition then the penalized value estimate ˜ V ( s ) >αV ( s ) for any ﬁnite α and thus, will fall back on C MAX ++.For the second part of the theorem, the assumption ofC

MAX (we will refer this as optimistic penalized model as-sumption ) in conjunction with RTAA* guarantee again en-sures completeness for C

MAX ++ and A-C

MAX ++. To seethis, observe that the optimistic penalized model assumptionensures that the value estimates are always admissible andconsistent w.r.t the true penalized model ( ˜ M X where X con-tains all the incorrect transitions) and from the assumption,we know that there exists a path to the goal in the true penal-ized model. Hence, C MAX ++ and A-C

MAX ++ are bound toﬁnd this path.C

MAX ++ again utilizes the worst case bounds of Q -learning under the optimistic penalized assumption as welland attains an upper bound of | S | steps. However, A-C MAX ++ with a sufﬁciently large α i for any repetition i acts similar to C MAX , and thereby can utilize the worst casebounds of LRTA* (which is simply RTAA* with K = 1 ex-pansions) from (Koenig and Simmons 1993) giving an up-per bound of | S | time steps. This shows the advantage ofA-C MAX ++ over C

MAX ++, especially in earlier repetitionswhen the incorrect set X is small (thus, making the opti-mistic penalized model assumption hold,) and α i is large. B.3 Asymptotic Convergence Proof

The asymptotic convergence proof completely relies on theasymptotic convergence of Q -learning (Koenig and Sim-mons 1993) and asymptotic convergence of LRTA* (Korf1990) to optimal value estimates. The proof again cruciallyigure 8: D Mobile Robot experiment example trackrelies on the fact that the value estimates always remain ad-missible and consistent, which is ensured by the optimisticmodel assumption. Note that the optimistic penalized modelassumption is not enough to guarantee asymptotic conver-gence to the optimal cost in M as we penalize incorrecttransitions. However, it is possible to show that under theoptimistic penalized model assumption both C MAX ++ andA-C

MAX ++ converge to the optimal cost in the true penal-ized model ˜ M X where X contains all incorrect transitions. C Experiment Details

All experiments were implemented using Python 3.6 and runon a . GHz Intel Core i machine. We use PyTorch (Paszkeet al. 2019) to train neural network function approxima-tors in our D experiments, and use Box2D (Catto 2007)for our 3D mobile robot simulation (similar to OpenAIGym (Brockman et al. 2016) car racing environment)and use PyBullet (Coumans et al. 2013) for our D PR2 ex-periments.

C.1 3D Mobile Robot Navigation with Icy Patches

An example track used in the D experiment is shown in Fig-ure 8. We generate motion primitives ofﬂine using the fol-lowing procedure: (a) We ﬁrst deﬁne the primitive action setfor the robot by discretizing the steering angle into 3 cells,one corresponding to zero and the other two correspondingto +0 . and − . radians. We also discretize the speed ofthe robot to 2 cells corresponding to +2 m/s and − m/s, (b)We then discretize the state space into a × grid in XY space and cells in θ dimension. Thus, we have a × × grid in XY θ space., (c) We then initializethe robot at (0 , xy location with different headings cho-sen among [0 , · · · , and roll out all possible sequencesof primitive actions for all possible motion primitive lengthsfrom to time steps, (d) We ﬁlter out all motion prim-itives whose end point is very close to a cell center in the XY θ grid. During execution, we use a pure pursuit con-troller to track the motion primitive so that the robot alwaysstarts and ends on a cell center. During planning, we sim-ply use the discrete offsets stored in the motion primitive tocompute the next state (and thus, the model dynamics arepre-computed ofﬂine during motion primitive generation.)The cost function used is as follows: for any motion prim-itive a and state s , the cost of executing a from s is given by c ( s, a ) = (cid:80) s (cid:48) c (cid:48) ( s (cid:48) ) where c (cid:48) is a pre-deﬁned cost map overthe × × grid and s (cid:48) is all the intermediate states(including the ﬁnal state) that the robot goes through whileexecuting the motion primitive a from s . The pre-deﬁnedcost map is deﬁned as follows: c (cid:48) ( s ) = 1 if state s lies on thetrack (i.e. xy location corresponding to s lies on the track)and c (cid:48) ( s ) = 100 otherwise (i.e. all xy locations correspond-ing to grass or wall has a cost of ). This encourages theplanner to come up with a path that lies completely on thetrack.We deﬁne two checkpoints on the opposite ends of thetrack (shown as blue squares in Figure 8.) The goal of therobot is to reach the next checkpoint incurring least costwhile staying on the track. Note that this requires the robot tocomplete laps around the track as quickly as possible. Sincethe state space is small, we maintain value estimates V, Q, ˜ V using tables and update the appropriate table entry for eachvalue update. The tables are initialized with value estimatesobtained by planning in the model ˆ M using a planner with K = 100 expansions until the robot can efﬁciently com-plete the laps using the optimal paths. However, this doesnot mean that the initial value estimates are the optimal val-ues for ˆ M dynamics since the planner looks ahead and canachieve optimal paths with underestimated value functions.Nevertheless, these estimates are highly informative. C.2 7D Pick-and-Place with a Heavy Object

For our D experiments, we make use of Bullet PhysicsEngine through the pyBullet interface. For motion plan-ning and other simulation capabilities we make use of ss-pybullet library (Garrett 2018). The task is shownin Figure 9. The goal is for the robot to pick the heavy ob-ject from its start pose and place it at its goal pose whileavoiding the obstacle, without any resets. Since the objectis heavy, the robot fails to lift the object in certain conﬁg-urations where it cannot generate the required torque to liftthe object. Thus, the robot while lifting the object might failto reach the goal waypoint and onky reach an intermediatewaypoint resulting in discrepancies between modeled andtrue dynamics.This is represented as a planning problem in D states-pace. The ﬁrst dimensions correspond to the DOF poseof the object (or gripper,) and the last dimension correspondsto the redundant DOF in the arm (in our case, it is the upperarm roll joint.) Given a D conﬁguration, we use IKFast li-brary (Diankov 2010) to compute the corresponding D jointangle conﬁguration. The action space consists of motionprimitives that move the arm by a ﬁxed offset in each of the dimensions in positive and negative directions. The dis-crepancies in this experiment are only in the Z dimensionigure 9: D Pick-and-Place Experimentcorresponding to lifting the object. For planning, we sim-ply use a kinematic model of the arm and assume that theobject being lifted is extremely light. Thus, we do not needto explicitly consider dynamics during planning. However,during execution we take the dynamics into account by ex-ecuting the motion primitives in the simulator. The cost ofany transition is if the object is not at goal pose, if theobject is at goal pose. We start the next repetition only if therobot reached the goal pose in the previous repetition.The D state space is discretized into cells in each di-mension resulting in states. Since the state space is largewe use neural network function approximators to maintainthe value functions V, Q, ˜ V . For the state value functions V, ˜ V we use the following neural network approximator: afeedforward network with hidden layers consisting of units each, we use ReLU activations after each layer exceptthe last layer, the network takes as input a D feature rep-resentation of the D state computed as follows:• For any discrete state s , we compute a continuous Drepresentation r ( s ) that is used to construct the features – The discrete state is represented as ( xd, yd, zd, rd, pd, yd, rjointd ) where ( xd, yd, zd ) represents the D discrete location of the object (orgripper,) ( rd, pd, yd ) represents the discrete roll, pitch,yaw of the object (or gripper,) and rjointd representsthe discrete redundant joint angle – We convert ( xd, yd, zd ) to a continuous representationby simply dividing by the grid size in those dimensions,i.e. ( xc, yc, zc ) = ( xd/ , yd/ , zd/ – We do a similar construction for rjointc , i.e. rjointc = rjointd/ – However, note that rd, pd, yd are angular dimensionsand simply dividing by grid size would not encodethe wrap around nature that is inherent in angular dimensions (we did not have this problem for rjointd as the redundant joint angle has lower and upperlimits, and is always recorded as a value between thoselimits.) To account for this, we use a sine-cosine rep-resentation deﬁned as ( rc , rc , pc , pc .yc , yc

2) =( sin ( rc ) , cos ( rc ) , sin ( pc ) , cos ( pc ) , sin ( yc ) , cos ( yc )) where rc, pc, yc are the roll, pitch, yaw anglescorresponding to the cell centers of the grid cells rd, pd, yd . – Thus, the ﬁnal D representa-tion of state s is given by r ( s ) =( xc, yc, zc, rc , rc , pc , pc , yc , yc , rjointc ) – We also deﬁne a truncated D representation r (cid:48) ( s ) =( xc, yc, zc, rc , rc , pc , pc , yc , yc and a D rep-resentation r (cid:48)(cid:48) ( s ) = ( xc, yc, zc ) • The ﬁrst feature is the D relative position of the D goalpose w.r.t the object f r (cid:48) ( g ) − r (cid:48) ( s ) • The second feature is the D relative position of the ob-ject w.r.t the gripper home state h , f r ( s ) − r ( h ) • The third feature is the D relative position of the goalw.r.t the gripper home state h , f r (cid:48) ( g ) − r (cid:48) ( h ) • The fourth feature is the D relative position of the obsta-cle left top corner o w.r.t the object, f r (cid:48)(cid:48) ( o − r (cid:48)(cid:48) ( s ) • The ﬁfth and ﬁnal feature is the D relative position ofthe object right bottom corner o w.r.t. the object, f r (cid:48)(cid:48) ( o − r (cid:48)(cid:48) ( s ) • Thus, the ﬁnal D feature representation is given by f ( s ) = ( f , f , f , f , f .The output of the network is a single scalar value repre-senting the cost-to-goal of the input state. Instead of learningthe cost-to-goal/value from scratch, we start with an initialvalue estimate that is hardcoded (manhattan distance to goalin the D discrete grid) and the neural network approxima-tor is used to learn a residual on top of it. A similar trickwas used in C

MAX (Vemula et al. 2020). The residual statevalue function approximator was initialized to output forall s ∈ S . We use a similar architecture for the residual Q -value function approximator but it takes as input the Dstate feature representation and outputs a vector in R | A | (inour case, R ) to represent the cost-to-goal estimate for eachaction a ∈ A . We also use the same hardcoded value es-timates as before in addition to the residual approximatorto construct the Q -values. All baselines and proposed ap-proaches use the same function approximator and same ini-tial hardcoded value estimates to ensure fair comparison.The value function approximators are trained using meansquared loss.The residual model learning baseline with neural network(NN) function approximator uses the following architecture: hidden layers each with units and all layers are followedby ReLU activations except the last layer. The input of thenetwork is the D feature representation of the state anda one-hot encoding of the action in R . The output of thenetwork is the D continuous state which is added to thestate predicted by the model ˆ M . The loss function used totrain the network is a simple mean squared loss. The residualodel learning baseline with K-Nearest Neighbor regres-sion approximator (KNN) uses a manhattan radius of inthe discrete D state space. We compute the prediction byaveraging the next state residual vector observed in the pastfor any state that lies within the radius of the current state.The averaged residual is added to the next state predicted bymodel ˆ M to obtain the learned next state.We use Adam optimizer (Kingma and Ba 2015) with alearning rate of . and a weight decay (L regularizationcoefﬁcient) of . to train all the neural network functionapproximators in all approaches. We use a batch size of for the state value function approximators and a batch sizeof for the Q -value function approximators. We perform U = 3 updates for state value function and U = 5 updatesfor state-action value function for each time step. We updatethe parameters of all neural network approximators using apolyak averaging coefﬁcient of . .Finally, we use hindsight experience replaytrick (Andrychowicz et al. 2017) in training all thevalue function approximators with the probability of sam-pling any future state in past trajectories as the goal set to .7