C-Learning: Horizon-Aware Cumulative Accessibility Estimation
Panteha Naderian, Gabriel Loaiza-Ganem, Harry J. Braviner, Anthony L. Caterini, Jesse C. Cresswell, Tong Li, Animesh Garg
PPublished as a conference paper at ICLR 2021
C-L
EARNING : H
ORIZON -A WARE C UMULATIVE A CCESSIBILITY E STIMATION
Panteha Naderian, Gabriel Loaiza-Ganem, Harry J. Braviner, Anthony L. Caterini,Jesse C. Cresswell & Tong Li
Layer 6 AI { panteha, gabriel, harry, anthony, jesse, tong } @layer6.ai Animesh Garg
University of Toronto, Vector Institute, Nvidia [email protected] A BSTRACT
Multi-goal reaching is an important problem in reinforcement learning needed toachieve algorithmic generalization. Despite recent advances in this field, currentalgorithms suffer from three major challenges: high sample complexity, learningonly a single way of reaching the goals, and difficulties in solving complex motionplanning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions , which measure the reachability of a goal froma given state within a specified horizon. We show that these functions obey a re-currence relation, which enables learning from offline interactions. We also provethat optimal cumulative accessibility functions are monotonic in the planning hori-zon. Additionally, our method can trade off speed and reliability in goal-reachingby suggesting multiple paths to a single goal depending on the provided horizon.We evaluate our approach on a set of multi-goal discrete and continuous con-trol tasks. We show that our method outperforms state-of-the-art goal-reachingalgorithms in success rate, sample complexity, and path optimality. Our codeis available at https://github.com/layer6ai-labs/CAE , and addi-tional visualizations can be found at https://sites.google.com/view/learning-cae/ . NTRODUCTION
Multi-goal reinforcement learning tackles the challenging problem of reaching multiple goals, andas a result, is an ideal framework for real-world agents that solve a diverse set of tasks. Despiteprogress in this field (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017; Ghosh et al.,2019), current algorithms suffer from a set of limitations: an inability to find multiple paths to agoal, high sample complexity, and poor results in complex motion planning tasks. In this paper wepropose C -learning, a method which addresses all of these shortcomings.Many multi-goal reinforcement learning algorithms are limited by learning only a single policy π ( a | s, g ) over actions a to reach goal g from state s . There is an unexplored trade-off betweenreaching the goal reliably and reaching it quickly. We illustrate this shortcoming in Figure 1a, whichrepresents an environment where an agent must reach a goal on the opposite side of some predator.Shorter paths can reach the goal faster at the cost of a higher probability of being eaten. Existingalgorithms do not allow a dynamic choice of whether to act safely or quickly at test time.The second limitation is sample complexity. Despite significant improvements (Andrychowicz et al.,2017; Ghosh et al., 2019), multi-goal reaching still requires a very large amount of environmentinteractions for effective learning. We argue that the optimal Q -function must be learned to highaccuracy for the agent to achieve reasonable performance, and this leads to sample inefficiency. Thesame drawback of optimal Q -functions often causes agents to learn sub-optimal ways of reachingthe intended goal. This issue is particularly true for motion planning tasks (Qureshi et al., 2020),where current algorithms struggle. 1 a r X i v : . [ c s . L G ] J a n ublished as a conference paper at ICLR 2021 (a) (b)Figure 1: ( a ) A continuous spectrum of paths allow the mouse to reach its goal faster, at an increased riskof disturbing the cat and being eaten. ( b ) Q ∗ (with γ = 0 . ) needs to be learned more accurately than C ∗ to act optimally. The goal g can be reached in h ∗ = 5 steps from s , so that Q ∗ ( s, g, a ∗ ) = 0 . and Q ∗ ( s, g, a − ) = 0 . ; while C ∗ ( s, a ∗ , g, h ∗ ) = 1 and C ∗ ( s, a − , g, h ∗ ) = 0 . We propose to address these limitations by learning horizon-aware policies π ( a | s, g, h ) , whichshould be followed to reach goal g from state s in at most h steps. The introduction of a timehorizon h naturally allows us to tune the speed/reliability trade-off, as an agent wishing to reachthe goal faster should select a policy with a suitably small h value. To learn these policies, weintroduce the optimal cumulative accessibility function C ∗ ( s, a, g, h ) . This is a generalization ofthe state-action value function and corresponds to the probability of reaching goal g from state s after at most h steps if action a is taken, and the agent acts optimally thereafter. Intuitively it issimilar to the optimal Q -function, but Q -functions rarely correspond to probabilities, whereas the C ∗ -function does so by construction. We derive Bellman backup update rules for C ∗ , which allowit to be learned via minimization of unbiased estimates of the cross-entropy loss – this is in contrastto Q -learning, which optimizes biased estimates of the squared error. Policies π ( a | s, g, h ) can thenbe recovered from the C ∗ function. We call our method cumulative accessibility estimation, or C-learning . Pong et al. (2018) proposed TDMs, a method involving horizon-aware policies. We pointout that their method is roughly related to a non-cumulative version of ours with a different lossthat does not enable the speed/reliability trade-off and is ill-suited for sparse rewards. We include adetailed discussion of TDMs in section 4.One might expect that adding an extra dimension to the learning task, namely h , would increase thedifficulty - as C ∗ effectively contains the information of several optimal Q -functions for differentdiscount factors. However, we argue that C ∗ does not need to be learned to the same degree ofaccuracy as the optimal Q -function for the agent to solve the task. As a result, learning C ∗ ismore efficient, and converges in fewer environmental interactions. This property, combined withour proposed goal sampling technique and replay buffer used during training, provides empiricalimprovements over Q -function based methods.In addition to these advantages, learning C ∗ is itself useful, containing information that the horizon-aware policies do not. It estimates whether a goal g is reachable from the current state s within h steps. In contrast, π ( a | s, g, h ) simply returns some action, even for unreachable goals. We showthat C ∗ can be used to determine reachability with examples in a nonholonomic environment. Summary of contributions : ( i ) introducing C -functions and cumulative accessibility estimationfor both discrete and continuous action spaces; ( ii ) highlighting the importance of the speed vs re-liability trade-off in finite horizon reinforcement learning; ( iii ) introducing a novel replay bufferspecially tailored for learning C ∗ which builds on HER (Andrychowicz et al., 2017); and ( iv ) em-pirically showing the effectiveness of our method for goal-reaching as compared to existing alterna-tives, particularly in the context of complex motion planning tasks. ACKGROUND AND RELATED WORK
Let us extend the Markov Decision Process (MDP) formalism (Sutton et al., 1998) for goal-reaching.We consider a set of actions A , a state space S , and a goal set G . We assume access to a goal checkingfunction G : S × G → { , } such that G ( s, g ) = 1 if and only if state s achieves goal g . Forexample, achieving the goal could mean exactly reaching a certain state, in which case G = S and2ublished as a conference paper at ICLR 2021 a a a a sh s h − s h − s h − g . . .. . . Figure 2: Graphical model depicting trajectories from P π ( ·|· ,g,h ) ( ·| s = s, a = a ) . Gray nodes denote fixed val-ues, and white nodes stochastic ones. Nodes a , g and s arenon-stochastic simply because they are conditioned on, notbecause they are always fixed within the environment. Notethat the values of h decrease deterministically. Nodes corre-sponding to horizons could be separated from states, but arenot for a more concise graph. G ( s, g ) = ( s = g ) . For many continuous state-spaces, hitting a state exactly has zero probability.Here we can still take G = S , but let G ( s, g ) = ( d ( s, g ) ≤ (cid:15) ) for some radius (cid:15) and metric d . Moregeneral choices are possible. For example, in the Dubins’ Car environment which we describe inmore detail later, the state consists of both the location and orientation of the car: S = R × S .We take G = R , and G ( s, g ) checks that the location of the car is within some small radius of g ,ignoring the direction entirely. For a fixed g , G ( s, g ) can be thought of as a sparse reward function.In the goal-reaching setting, a policy π : S ×G → P ( A ) , where P ( A ) denotes the set of distributionsover A , maps state-goal pairs to an action distribution. The environment dynamics are given by astarting distribution p ( s , g ) , usually taken as p ( s ) p ( g ) , and transition probabilities p ( s t +1 | s t , a t ) .States for which G ( s, g ) = 1 are considered terminal. Q-Learning: A Q -function (Watkins & Dayan, 1992) for multi-goal reaching, Q π : S ×G×A → R ,is defined by Q π ( s t , g, a t ) = E π [ (cid:80) ∞ i = t γ i − t G ( s t , g ) | s t , a t ] , where γ ∈ [0 , is a discount factorand the expectation is with respect to state-action trajectories obtained by using π ( a | s i , g ) . If π ∗ isan optimal policy in the sense that Q π ∗ ( s, g, a ) ≥ Q π ( s, g, a ) for every π and ( s, g, a ) ∈ S × G × A ,then Q π ∗ matches the optimal Q -function, Q ∗ , which obeys the Bellman equation: Q ∗ ( s, g, a ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) G ( s, g ) + γ max a (cid:48) ∈A Q ∗ ( s (cid:48) , g, a (cid:48) ) (cid:21) . (1)In deep Q -learning (Mnih et al., 2015), Q ∗ is parameterized with a neural network and learn-ing is achieved by enforcing the relationship from equation 1. This is done by minimizing (cid:80) i L ( Q ∗ ( s i , g i , a i ) , y i ) , where y i corresponds to the expectation in equation 1 and is estimatedusing a replay buffer of stored tuples ( s i , a i , g i , s (cid:48) i ) . Note that s (cid:48) i is the state the environmenttransitioned to after taking action a i from state s i , and determines the value of y i . Typically L is chosen as a squared error loss, and the dependency of y i on Q ∗ is ignored for backprop-agation in order to stabilize training. Once Q ∗ is learned, the optimal policy is recovered by π ∗ ( a | s, g ) = ( a = arg max a (cid:48) Q ∗ ( s, g, a (cid:48) )) .There is ample work extending and improving upon deep Q -learning (Haarnoja et al., 2018). Forexample, Lillicrap et al. (2015) extend it to the continuous action space setting, and Fujimoto et al.(2018) further stabilize training. These improvements are fully compatible with goal-reaching (Ponget al., 2019; Bharadhwaj et al., 2020a; Ghosh et al., 2019). Andrychowicz et al. (2017) proposedHindsight Experience Replay (HER), which relabels past experience as achieved goals, and allowssample efficient learning from sparse rewards (Nachum et al., 2018). UMULATIVE ACCESSIBILITY FUNCTIONS
We now consider horizon-aware policies π : S × G × N → P ( A ) , and define the cumulativeaccessibility function C π ( s, a, g, h ) , or C -function, as the probability of reaching goal g from state s in at most h steps by taking action a and following the policy π thereafter. By “following thepolicy π thereafter” we mean that after a , the next action a is sampled from π ( ·| s , g, h − , a issampled from π ( ·| s , g, h − and so on. See Figure 2 for a graphical model depiction of how thesetrajectories are obtained. Importantly, an agent need not always act the same way at a particular statein order to reach a particular goal, thanks to horizon-awareness. We use P π ( ·|· ,g,h ) ( ·| s = s, a = a ) to denote probabilities in which actions are drawn in this manner and transitions are drawn accordingto the environment p ( s t +1 | s t , a ) . More formally, C π is given by: C π ( s, a, g, h ) = P π ( ·|· ,g,h ) (cid:18) max t =0 ,...,h G ( s t , g ) = 1 (cid:12)(cid:12)(cid:12)(cid:12) s = s, a = a (cid:19) . (2)3ublished as a conference paper at ICLR 2021 Proposition 1 : C π can be framed as a Q -function within the MDP formalism, and if π ∗ is optimalin the sense that C π ∗ ( s, a, g, h ) ≥ C π ( s, a, g, h ) for every π and ( s, a, g, h ) ∈ S × A × G × N , then C π ∗ matches the optimal C -function, C ∗ , which obeys the following equation: C ∗ ( s, a, g, h ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) max a (cid:48) ∈A C ∗ ( s (cid:48) , a (cid:48) , g, h − (cid:21) if G ( s, g ) = 0 and h ≥ ,G ( s, g ) otherwise . (3)See appendix A for a detailed mathematical proof of this proposition. The proof proceeds by firstderiving a recurrence relationship that holds for any C π . In an analogous manner to the Bellmanequation in Q -learning, this recurrence involves an expectation over π ( ·| s (cid:48) , g, h − , which, whenreplaced by a max returns the recursion for C ∗ .Proposition 1 is relevant as it allows us to learn C ∗ , enabling goal-reaching policies to be recovered: π ∗ ( a | s, g, h ) = (cid:18) a = arg max a (cid:48) C ∗ ( s, a (cid:48) , g, h ) (cid:19) . (4) C ∗ itself is useful for determining reachability. After maximizing over actions, it estimates whether agiven goal is reachable from a state within some horizon. Comparing these probabilities for differenthorizons allows us to make a speed / reliability trade-off for reaching goals.We observe that an optimal C ∗ -function is non-decreasing in h , but this does not necessarily holdfor non-optimal C -functions. For example, a horizon-aware policy could actively try to avoid thegoal for high values of h , and the C π -function constructed from it would show lower probabilitiesof success for larger h . See appendix A for a concrete example of this counter-intuitive behavior. Proposition 2 : C ∗ is non-decreasing in h .See appendix A for a detailed mathematical proof. Intuitively, the proof consists of showing that anoptimal policy can not exhibit the pathology mentioned above. Given an optimal policy π ∗ ( a | s, g, h ) for a fixed horizon h we construct a policy ˜ π for h + 1 which always performs better, and lowerbounds the performance of π ∗ ( a | s, g, h + 1) .In addition to being an elegant theoretical property, proposition 2 suggests that there is additionalstructure in a C ∗ function which mitigates the added complexity from using horizon-aware poli-cies. Indeed, in our preliminary experiments we used a non-cumulative version of C -functions (seesection 3.3) and obtained significantly improved performance upon changing to C -functions. More-over, monotonicity in h could be encoded in the architecture of C ∗ (Sill, 1998; Wehenkel & Louppe,2019). However, we found that actively doing so hurt empirical performance (appendix F).3.1 S HORTCOMINGS OF Q - LEARNING
Before describing our method for learning C ∗ , we highlight a shortcoming of Q -learning. Consider a2D navigation environment where an agent can move deterministically in the cardinal directions, andfix s and g . For an optimal action a ∗ , the optimal Q function will achieve some value Q ∗ ( s, g, a ∗ ) ∈ [0 , in the sparse reward setting. Taking a sub-optimal action a − initially results in the agent takingtwo extra steps to reach the intended goal, given that the agent acts optimally after the first action, sothat Q ∗ ( s, g, a − ) = γ Q ∗ ( s, g, a ∗ ) . The value of γ is typically chosen close to , for example . ,to ensure that future rewards are not too heavily discounted. As a consequence γ ≈ and thus thevalue of Q ∗ at the optimal action is very close to its value at a sub-optimal action. We illustrate thisissue in Figure 1b. In this scenario, recovering an optimal policy requires that the error between thelearned Q -function and Q ∗ should be at most (1 − γ ) / ; this is reflected empirically by Q -learninghaving high sample complexity and learning sub-optimal paths. This shortcoming surfaces in anyenvironment where taking a sub-optimal action results in a slightly longer path than an optimal one,as in e.g. motion planning tasks.The C ∗ function does not have this shortcoming. Consider the same 2D navigation example, and let h ∗ be the smallest horizon for which g can be reached from s . h ∗ can be easily obtained from C ∗ as the smallest h such that max a C ∗ ( s, a, g, h ) = 1 . Again, denoting a ∗ as an optimal action and a − as a sub-optimal one, we have that C ∗ ( s, a ∗ , g, h ∗ ) = 1 whereas C ∗ ( s, a − , g, h ∗ ) = 0 , whichis illustrated in Figure 1b. Therefore, the threshold for error is much higher when learning the C ∗ function. This property results in fewer interactions with the environment needed to learn C ∗ andmore efficient solutions. 4ublished as a conference paper at ICLR 20213.2 H ORIZON - INDEPENDENT POLICIES
Given a C ∗ -function, equation 4 lets us recover a horizon-aware policy. At test time, using small val-ues of h can achieve goals faster, while large values of h will result in safe policies. However, a nat-ural question arises: how small is small and how large is large? In this section we develop a methodto quantitatively recover reasonable values of h which adequately balance the speed/reliability trade-off. First, a safety threshold α ∈ (0 , is chosen, which indicates the percentage of the maximumvalue of C ∗ we are willing to accept as safe enough. Smaller values of α will thus result in quickerpolicies, while larger values will result in safer ones. Then we consider a range of viable horizons, H , and find the maximal C ∗ value, M ( s, g ) = max h ∈H ,a ∈A C ∗ ( s, a, g, h ) . We then compute: h α ( s, g ) = arg min h ∈H { max a ∈A C ∗ ( s, a, g, h ) : max a ∈A C ∗ ( s, a, g, h ) ≥ αM ( s, g ) } , (5)and take π ∗ α ( a | s, g ) = ( a = arg max a (cid:48) C ∗ ( s, a (cid:48) , g, h α ( s, g ))) . This procedure also allows us torecover horizon-independent policies from C ∗ by using a fixed value of α , which makes comparingagainst horizon-unaware methods straightforward. We used horizon-unaware policies with addedrandomness as the behaviour policy when interacting with the environment for exploration.3.3 A LTERNATIVE RECURSIONS
One could consider defining a non-cumulative version of the C -function, which we call an A -function for “accessibility” ( not for “advantage”), yielding the probability of reaching a goal inexactly h steps. However, this version is more susceptible to pathological behaviors that hinderlearning for certain environments. For illustration, consider an agent that must move one step at atime in the cardinal directions on a checkerboard. Starting on a dark square, the probability of reach-ing a light square in an even number of steps is always zero, but may be non-zero for odd numbers.An optimal A -function would fluctuate wildly as the step horizon h is increased, resulting in a hardertarget to learn. Nonetheless, A -functions admit a similar recursion to equation 18, which we includein appendix C for completeness. In any case, the C -function provides a notion of reachability whichis more closely tied to related work (Savinov et al., 2018; Venkattaramanujam et al., 2019; Ghoshet al., 2018; Bharadhwaj et al., 2020a).In Q -learning, discount factors close to encourage safe policies, while discount factors close to encourage fast policies. One could then also consider discount-conditioned policies π ( a | s, g, γ ) as a way to achieve horizon-awareness. In appendix D we introduce D -functions (for “discount”), D ( s, a, g, γ ) , which allow recovery of discount-conditioned policies. However D -functions sufferfrom the same limitation as Q -functions in that they need to be learned to a high degree of accuracy. UMULATIVE ACCESSIBILITY ESTIMATION
Our training algorithm, which we call cumulative accessibility estimation (CAE), or C -learning,is detailed in algorithm 1. Similarly to Q -learning, the C ∗ function can be modelled as a neuralnetwork with parameters θ , denoted C ∗ θ , which can be learned by minimizing: − (cid:88) i [ y i log C ∗ θ ( s i , a i , g i , h i ) + (1 − y i ) log (1 − C ∗ θ ( s i , a i , g i , h i ))] , (6)where y i corresponds to a stochastic estimate of the right hand side of equation 18. The sum is overtuples ( s i , a i , s (cid:48) i , g i , h i ) drawn from a replay buffer which we specially tailor to successfully learn C ∗ . Since C ∗ corresponds to a probability, we change the usual squared loss to be the binary crossentropy loss. Note that even if the targets y i are not necessarily binary, the use of binary cross entropyis still justified as it is equivalent to minimizing the KL divergence between Bernoulli distributionswith parameters y i and C ∗ θ ( s i , a i , g i , h i ) (Bellemare et al., 2017). Since the targets used do notcorrespond to the right-hand side of equation 18 but to a stochastic estimate (through s (cid:48) i ) of it, usingthis loss instead of the square loss results in an unbiased estimate of (cid:80) i L ( C ∗ θ ( s i , a i , g i , h i ) , y true i ) ,where y true i corresponds to the right-hand side of equation 18 at ( s i , a i , g i , h i ) (see appendix B fora detailed explanation). This confers another benefit of C -learning over Q -learning, as passing astochastic estimate of equation 1 through the typical squared loss results in biased estimators. Asin Double Q -learning (Van Hasselt et al., 2015), we use a second network C ∗ θ (cid:48) to compute the y i targets, and do not minimize equation 6 with respect to θ (cid:48) . We periodically update θ (cid:48) to match θ .We now explain our algorithmic details. 5ublished as a conference paper at ICLR 2021 Algorithm 1:
Training C-learning Network
Parameter: N explore : Number of random exploration episodes Parameter: N GD : Number of goal-directed episodes Parameter: N train : Number of batches to train on per goal-directed episode Parameter: N copy : Number of batches between target network updates Parameter: α : Learning rate θ ← Initial weights for C ∗ θ θ (cid:48) ← θ // Copy weights to target network D ← [] // Initialize experience replay buffer n b ← // Counter for training batches repeat N explore times E ← get rollout ( π random ) // Do a random rollout D . append ( E ) // Save the rollout to the buffer repeat N GD times g ← goal sample ( n b ) // Sample a goal E ← get rollout ( π behavior ) // Try to reach the goal D . append ( E ) // Save the rollout repeat N train times if n b mod N copy = 0 then θ (cid:48) ← θ // Copy weights periodically B := { s i , a i , s (cid:48) i , g i , h i } |B| i =1 ← sample batch ( D ) // Sample a batch { y i } |B| i =1 ← get targets ( B , θ (cid:48) ) // Estimate RHS of equation 18 ˆ L ← − |B| (cid:80) |B| i =1 y i log C ∗ θ ( s i , a i , g i , h i ) + (1 − y i ) log (1 − C ∗ θ ( s i , a i , g i , h i )) θ ← θ − α ∇ θ ˆ L // Update weights n b ← n b + 1 // Trained one batch Reachability-guided sampling : When sampling a batch (line 15 of algorithm 1), for each i wefirst sample an episode from D , and then a transition ( s i , a i , s (cid:48) i ) from that episode. We sample h i favoring smaller h ’s at the beginning of training. We achieve this by selecting h i = h withprobability proportional to h − κn GD /N GD , where n GD is the current episode number, N GD the totalnumber of episodes, and κ is a hyperparameter. For deterministic environments, we follow HER(Andrychowicz et al., 2017) and select g i uniformly at random from the states observed in theepisode after s i . For stochastic environments we use a slightly modified version, which addressesthe bias incurred by HER (Matthias et al., 2018) in the presence of stochasticity. All details areincluded in appendix G. Extension to continuous action spaces : Note that constructing the targets y i requires taking amaximum over the action space. While this is straightforward in discrete action spaces, it is not sofor continuous actions. Lillicrap et al. (2015) proposed DDPG, a method enabling deep Q -learningin continuous action spaces. Similarly, Fujimoto et al. (2018) proposed TD3, a method for furtherstabilizing Q -learning. We note that C -learning is compatible with the ideas underpinning both ofthese methods. We present a TD3 version of C -learning in appendix E.We finish this section by highlighting differences between C -learning and related work. Ghoshet al. (2019) proposed GCSL, a method for goal-reaching inspired by supervised learning. In theirderivations, they also include a horizon h which their policies can depend on, but they drop thisdependence in their experiments as they did not see a practical benefit by including h . We findthe opposite for C -learning. TDMs (Pong et al., 2018) use horizon-aware policies and a similarrecursion to ours. In practice they use a negative L distance reward, which significantly differs fromour goal-reaching indicators. This is an important difference as TDMs operate under dense rewards,while we use sparse rewards, making the problem significantly more challenging. Additionally,distance between states in nonholonomic environments is very poorly described by L distance,resulting in TDMs being ill-suited for motion planning tasks. We also highlight that even if thereward in TDMs was swapped with our goal checking function G , the resulting objective would bemuch closer to the non-cumulative version of C -learning presented in appendix C. TDMs recoverpolicies from their horizon-aware Q -function in a different manner to ours, not allowing a speedvs reliability trade-off. The recursion presented for TDMs is used by definition, whereas we have6ublished as a conference paper at ICLR 2021 Figure 3: Experimental environments, from left to right: frozen lake, Dubins’ car, FetchPickAndPlace-v1and HandManipulatePenFull-v0. Arrows represent actions (only their direction should be considered, theirmagnitude is not representative of the distance travelled by an agent taking an action). See text for description. provided a mathematical derivation of the Bellman equation for C -functions along with proofs ofadditional structure. As a result of the C -function being a probability we use a different loss function(binary cross-entropy) which results in unbiased estimates of the objective’s gradients. Finally, wepoint out that TDMs sample horizons uniformly at random, which differs from our specificallytailored replay buffer. XPERIMENTS
ETUP
Our experiments are designed to show that ( i ) C -learning enables the speed vs reliability trade-off, ( ii ) the C ∗ function recovered through C -learning meaningfully matches reachability in nonholo-nomic environments, and ( iii ) C -learning scales to complex and high-dimensional motion planningtasks, resulting in improved sample complexity and goal-reaching. We use success rate (percentageof trajectories which reach the goal) and path length (average, over trajectories and goals, numberof steps needed to achieve the goal among successful trajectories) as metrics.We compare C -learning against GCSL (Ghosh et al., 2019) (for discrete action states only) and deep Q -learning with HER (Andrychowicz et al., 2017) across several environments, which are depictedin Figure 3. All experimental details and hyperparameters are given in appendix G. We also provideablations, and comparisons against TDMs and the alternative recursions from section 3.3 in appendixF. We evaluate C -learning on the following domains:1. Frozen lake is a 2D navigation environment where the agent must navigate without falling inthe holes (dark blue). Both the state space ( × grid, with two × holes) and the action stateare discrete. Falling in a hole terminates an episode. The agent’s actions correspond to intendeddirections. The agent moves in the intended direction with probability . , and in perpendiculardirections with probabilities . each. Moving against the boundaries of the environment has noeffect. We take G = S and G ( s, g ) = ( g = s ) .2. Dubins’ car is a more challenging deterministic 2D navigation environment, where the agentdrives a car which cannot turn by more than ◦ . The states, with spatial coordinates in [0 , ,are continuous and include the direction of the car. There are 7 actions: the combinations of { left 10 ◦ , straight , right 10 ◦ } × { forward 1 , reverse 1 } , and the option to not move. As such, theset of reachable states is not simply a ball around the current state. The environment also has wallsthrough which the car cannot drive. We take G = [0 , and the goal is considered to be reachedwhen the car is within an L ∞ distance of . from the goal, regardless of its orientation.3. FetchPickAndPlace-v1 (Brockman et al., 2016) is a complex, higher-dimensional environmentin which a robotic arm needs to pick up a block and move it to the goal location. Goals are definedby their 3-dimensional coordinates. The state space is -dimensional, and the action space iscontinuous and -dimensional.4. HandManipulatePenFull-v0 (Brockman et al., 2016) is a realistic environment known the be adifficult goal-reaching problem, where deep Q-learning with HER shows very limited success (Plap-pert et al., 2018). The environment has a continuous action space of dimension , a -dimensionalstate space, and -dimensional goals. Note that we are considering the more challenging version ofthe environment where both target location and orientation are chosen randomly.7ublished as a conference paper at ICLR 2021 (a) (b) (c) (d)Figure 4: ( a ) Multimodal policies recovered by C -learning in frozen lake for different values of h for reaching ( G ) from ( S ) (top); unimodal policies recovered by GCSL and HER (bottom). ( b ) Heatmaps over the goalspace of max a C ∗ ( s, a, g, h ) with a fixed s and h for Dubins’ car, with h = 7 (top), h = 15 (middle)and h = 25 (bottom). ( c ) Trajectories learned by C -learning (top), GCSL (middle) and HER (bottom) forDubins’ car. ( d ) Success rate throughout the training for Dubins’ car (top), FetchPickAndPlace-v1 (middle)and HandManipulatePenFull-v0 (bottom) for C -learning (blue), HER (red) and GCSL (green). ESULTS
Speed vs reliability trade-off : We use frozen lake to illustrate this trade-off. At test time, we setthe starting state and goal as shown in Figure 4a. Notice that, given enough time, an agent can reachthe goal with near certainty by going around the holes on the right side of the lake. However, ifthe agent is time-constrained, the optimal policy must accept the risk of falling in a hole. We seethat C -learning does indeed learn both the risky and safe policies. Other methods, as previouslyexplained, can only learn one. To avoid re-plotting the environment for every horizon, we haveplotted arrows in Figure 4a corresponding to arg max a π ∗ ( a | s , g, h ) , arg max a π ∗ ( a | s , g, h − , arg max a π ∗ ( a | s , g, h − and so on, where s t +1 is obtained from the previous state and actionwhile ignoring the randomness in the environment (i.e. s t +1 = arg max s p ( s | s t , a t ) ). In otherwords, we are plotting the most likely trajectory of the agent. When given the minimal amount oftime to reach the goal ( h = 6 ), the CAE agent learns correctly to accept the risk of falling in a holeby taking the direct path. When given four times as long ( h = 24 ), the agent takes a safer pathby going around the hole. Notice that the agent makes a single mistake at the upper right corner,which is quickly corrected when the value of h is decreased. On the other hand, we can see onthe bottom panel that both GCSL and HER recover a single policy, thus not enabling the speedvs reliability trade-off. Surprisingly, GCSL learns to take the high-risk path, despite Ghosh et al.(2019)’s intention to incentivize paths that are guaranteed to reach the goal. Reachability learning : To demonstrate that we are adequately learning reachability, we removedthe walls from Dubins’ car and further restricted the turning angles to ◦ , thus making the dumbbellshape of the true reachable region more extreme. In Figure 4b, we show that our learned C ∗ functioncorrectly learns which goals are reachable from which states for different time horizons in Dubins’car: not only is the learned C ∗ function increasing in h , but the shapes are as expected. None of thecompeting alternatives recover this information in any way, and thus comparisons are not available.As previously mentioned, the optimal C ∗ -function in this task is not trivial. Reachability is definedby the “geodesics” in S = [0 , × S that are constrained by the finite turning radius, and thusfollow a more intricate structure than a ball in R .8ublished as a conference paper at ICLR 2021 Table 1: Comparison of C -Learning against relevant benchmarks in three environments, averaged across fiverandom seeds. Runs in bold are either the best on the given metric in that environment, or have a mean scorewithin the error bars of the best. ENVIRONMENT METHOD SUCCESS RATE PATH LENGTH
Dubins’ Car CAE . % ± . % . ± . GCSL . ± .
35% 32 . ± . HER . ± .
48% 20 . ± . FetchPickAndPlace-v1 CAE (TD3) . % ± . % . ± . HER (TD3) . ± .
51% 8 . ± . HandManipulatePenFull-v0 CAE (TD3) . % ± . % . ± . HER (TD3) . ± .
46% 15 . ± . Motion planning and goal-reaching : For a qualitative comparison of performance for goal-reaching, Figure 4c shows trajectories for C -learning and competing alternatives for Dubins’ car.We can see that C -learning finds the optimal route, which is a combination of forward and backwardmovement, while other methods find inefficient paths. We also evaluated our method on challenginggoal reaching environments, and observed that C -learning achieves state-of-the-art results both insample complexity and success rate. Figure 4d shows that C -learning is able to learn considerablyfaster in the FetchPickAndPlace-v1 environment. More importantly, C -learning achieves an abso-lute improvement in its success rate over the current state-of-the-art algorithm HER (TD3) inthe HandManipulatePenFull-v0 environment. Quantitative results are shown in Table 1. On Dubins’car, we also note that C -learning ends up with a smaller final L ∞ distance to the goal: . ± . vs. . ± . for GCSL. ISCUSSION
We have shown that C -learning enables simultaneous learning of how to reach goals quickly andhow to reach them safely, which current methods cannot do. We point out that reaching the goalsafely in our setting means doing so at test time, and is different to what is usually consideredin the safety literature where safe exploration is desired (Chow et al., 2018; Bharadhwaj et al.,2020b). Additionally, learning C ∗ effectively learns reachability within an environment, and couldthus naturally lends itself to incorporation into other frameworks, for example, in goal setting forhierarchical RL tasks (Nachum et al., 2018) where intermediate, reachable goals need to be selectedsequentially. We believe further investigations on using C -functions on safety-critical environmentsrequiring adaptation (Peng et al., 2018; Zhang et al., 2020), or for hierarchical RL, are promisingdirections for further research.We have also argued that C -functions are more tolerant of errors during learning than Q -functionswhich increases sample efficiency. This is verified empirically in that C -learning is able to solvegoal-reaching tasks earlier on in training than Q -learning.We finish by noticing a parallel between C -learning and the options framework (Sutton et al., 1999;Bacon et al., 2017), which introduces temporal abstraction and allows agents to not always followthe same policy when at a given state s . However, our work does not fit within this framework,as options evolve stochastically and new options are selected according only to the current state,while horizons evolve deterministically and depend on the previous horizon only, not the state.Additionally, and unlike C -learning, nothing encourages different options to learn safe or quickpolicies, and there is no reachability information contained in options. ONCLUSIONS
In this paper we introduced C -learning, a Q -learning-inspired method for goal-reaching. Unlikeprevious approaches, we propose the use of horizon-aware policies, and show that not only canthese policies be tuned to reach the goal faster or more reliably, but they also outperform horizon-unaware approaches for goal-reaching on complex motion planning tasks. We hope our method willinspire further research into horizon-aware policies and their benefits.9ublished as a conference paper at ICLR 2021 R EFERENCES
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, BobMcGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In
Advances in neural information processing systems , pp. 5048–5058, 2017.Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In
Thirty-First AAAIConference on Artificial Intelligence , 2017.Marc G Bellemare, Will Dabney, and R´emi Munos. A distributional perspective on reinforcementlearning. arXiv preprint arXiv:1707.06887 , 2017.Homanga Bharadhwaj, Animesh Garg, and Florian Shkurti. Leaf: Latent exploration along thefrontier. arXiv preprint arXiv:2005.10934 , 2020a.Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, and Ani-mesh Garg. Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497 , 2020b.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym, 2016.Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In
Advances in neural information processingsystems , pp. 8092–8101, 2018.Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error inactor-critic methods. arXiv preprint arXiv:1802.09477 , 2018.Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819 , 2018.Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach, andSergey Levine. Learning to Reach Goals via Iterated Supervised Learning. arXiv e-prints , art.arXiv:1912.06088, December 2019.Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprintarXiv:1801.01290 , 2018.Leslie Pack Kaelbling. Learning to achieve goals. In
IJCAI , pp. 1094–1099. Citeseer, 1993.Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXivpreprint arXiv:1509.02971 , 2015.Plappert Matthias, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell,Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and WojciechZaremba. Multi-goal reinforcement learning: Challenging robotics environments and requestfor research. arXiv preprint arXiv:1802.09464 , 2018.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning. nature , 518(7540):529–533, 2015.Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchicalreinforcement learning. In
Advances in Neural Information Processing Systems , pp. 3303–3313,2018.Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer ofrobotic control with dynamics randomization. In , pp. 1–8. IEEE, 2018.10ublished as a conference paper at ICLR 2021Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Pow-ell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforce-ment learning: Challenging robotics environments and request for research. arXiv preprintarXiv:1802.09464 , 2018.Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081 , 2018.Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698 ,2019.Ahmed Qureshi, Yinglong Miao, Anthony Simeonov, and Michael Yip. Motion planning net-works: Bridging the gap between learning-based and classical motion planners. arXiv preprintarXiv:1907.06013 , 2020.Nikolay Savinov, Anton Raichuk, Rapha¨el Marinier, Damien Vincent, Marc Pollefeys, Timo-thy Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprintarXiv:1810.02274 , 2018.Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxima-tors. In
International conference on machine learning , pp. 1312–1320, 2015.Joseph Sill. Monotonic networks. In
Advances in neural information processing systems , pp. 661–667, 1998.Richard S Sutton, Andrew G Barto, et al.
Introduction to reinforcement learning , volume 135. MITpress Cambridge, 1998.Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A frame-work for temporal abstraction in reinforcement learning.
Artificial intelligence , 112(1-2):181–211, 1999.Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461 , 2015.Srinivas Venkattaramanujam, Eric Crawford, Thang Doan, and Doina Precup. Self-supervisedlearning of distance functions for goal-conditioned reinforcement learning. arXiv preprintarXiv:1907.02998 , 2019.Christopher JCH Watkins and Peter Dayan. Q-learning.
Machine learning , 8(3-4):279–292, 1992.Antoine Wehenkel and Gilles Louppe. Unconstrained monotonic neural networks. In
Advances inNeural Information Processing Systems , pp. 1545–1555, 2019.Jesse Zhang, Brian Cheung, Chelsea Finn, Sergey Levine, and Dinesh Jayaraman. Cautious adapta-tion for reinforcement learning in safety-critical settings. arXiv preprint arXiv:2008.06622 , 2020.11ublished as a conference paper at ICLR 2021
A P
ROOFS OF PROPOSITIONS
Proposition 1 : C π can be framed as a Q -function within the MDP formalism, and if π ∗ is optimalin the sense that C π ∗ ( s, a, g, h ) ≥ C π ( s, a, g, h ) for every π and ( s, a, g, h ) ∈ S × A × G × N , then C π ∗ matches the optimal C -function, C ∗ , which obeys the following equation: C ∗ ( s, a, g, h ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) max a (cid:48) ∈A C ∗ ( s (cid:48) , a (cid:48) , g, h − (cid:21) if G ( s, g ) = 0 and h ≥ ,G ( s, g ) otherwise . Proof : First, we frame C -functions within the MDP formalism. Consider a state space given by S (cid:48) = S × G × N , where states corresponding to reached goals (i.e. G ( s, g ) = 1 ) or with lastcoordinate equal to (i.e. h = 0 ) are considered terminal. The dynamics in this environment aregiven by: p ( s t +1 , g t +1 , h t +1 | s t , g t , h t , a t ) = p ( s t +1 | s t , a t ) ( g t +1 = g t ) ( h t +1 = h t − . (7)That is, states evolve according to the original dynamics, while goals remain unchanged and horizonsdecrease by one at every step. The initial distribution is given by: p ( s , g , h ) = p ( s ) p ( g , h | s ) , (8)where p ( s ) corresponds to the original starting distribution and p ( g , h | s ) needs to be specifiedbeforehand. Together, the initial distribution and the dynamics, along with a policy π ( a | s, g, h ) properly define a distribution over trajectories, and again, we use P π ( ·|· ,g,h ) to denote probabilitieswith respect to this distribution. The reward function r : S (cid:48) × A → R is given by: r ( s, a, g, h ) = G ( s, g ) . (9)By taking the discount factor to be , and since we take states for which G ( s, g ) = 1 to be terminal,the return (sum of rewards) over a trajectory ( s , s , . . . ) with h = h is given by: max t =0 ,...,h G ( s t , g ) . (10)For notational simplicity we take G ( s t , g ) = 0 whenever t is greater than the length of the trajec-tory to properly account for terminal states. Since the return is binary, its expectation matches itsprobability of being equal to , so that indeed, the Q -functions in this MDP correspond to C π : C π ( s, a, g, h ) = P π ( ·|· ,g,h ) (cid:18) max t =0 ,...,h G ( S t , g ) = 1 (cid:12)(cid:12)(cid:12)(cid:12) s = s, a = a (cid:19) . Now, we derive the Bellman equation for our C -functions. Trivially: C π ( s, a, g, h ) = G ( s, g ) , (11)whenever G ( s, g ) = 1 or h = 0 . For the rest of the derivation, we assume that G ( s, g ) = 0 and h ≥ . We note that the probability of reaching g from s in at most h steps is given by the probabilityof reaching it in exactly one step, plus the probability of not reaching it in the first step and reachingit in at most h − steps thereafter. Formally: C π ( s, a, g, h ) (12) = C π ( s, a, g,
1) + E s (cid:48) ∼ p ( ·| s,a ) (cid:2) (1 − G ( s (cid:48) , g )) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − [ C π ( s (cid:48) , a (cid:48) , g, h − (cid:3) = E s (cid:48) ∼ p ( ·| s,a ) [ G ( s (cid:48) , g )] + E s (cid:48) ∼ p ( ·| s,a ) (cid:2) (1 − G ( s (cid:48) , g )) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − [ C π ( s (cid:48) , a (cid:48) , g, h − (cid:3) = E s (cid:48) ∼ p ( ·| s,a ) (cid:2) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − [ C π ( s (cid:48) , a (cid:48) , g, h − (cid:3) , where the last equality follows from the fact that C π ( s (cid:48) , a (cid:48) , g, h −
1) = 1 whenever G ( s (cid:48) , g ) = 1 .Putting everything together, we obtain the Bellman equation for C π : C π ( s, a, g, h ) = (cid:26) E s (cid:48) ∼ p ( ·| s,a ) (cid:2) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − [ C π ( s (cid:48) , a (cid:48) , g, h − (cid:3) if G ( s, g ) = 0 and h ≥ ,G ( s, g ) otherwise . (13)Recall that the optimal policy is defined by the fact that C ∗ ≥ C π for any arguments and for anypolicy π . We can by maximizing equation 13 that, given the C ∗ ( · , · , · , h − values, the optimalpolicy values at horizon h − must be π ∗ ( a | s, g, h −
1) = ( a = arg max a (cid:48) C ∗ ( s, a, g, h − .12ublished as a conference paper at ICLR 2021We thus we obtain: C ∗ ( s, a, g, h ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) max a (cid:48) ∈A C ∗ ( s (cid:48) , a (cid:48) , g, h − (cid:21) if G ( s, g ) = 0 and h ≥ ,G ( s, g ) otherwise . Note that, as in Q -learning, the Bellman equation for the optimal policy has been obtained by re-placing the expectation with respect to a (cid:48) with a max . Proposition 2 : C ∗ is non-decreasing in h .As mentioned in the main manuscript, one might naively think that this monotonicity propertyshould hold for C π for any π . However this is not quite the case, as π depends on h and a per-verse policy may actively avoid the goal for large values of h . Restricting to optimal C ∗ -functions,such pathologies are avoided. As an example, consider an environment with three states, { , , } ,and two actions {− , +1 } . The transition rule is s t +1 = max(0 , min(2 , s t + a t )) , that is, the agentmoves deterministically in the direction of the action unless doing so would move it out of the do-main. Goals are defined as states. Let π be such that π ( a | s = 1 , g = 2 , h = 1) = ( a = 1) and π ( a | s = 1 , g = 2 , h = 2) = ( a = − . While clearly a terrible policy, π is such that C π ( s = 0 , a = 1 , g = 2 , h = 2) = 1 and C π ( s = 0 , a = 1 , g = 2 , h = 3) = 0 , so that C π candecrease with h . Proof : Fix a distribution p on the action space. For any policy π ( a | s, g, h ) , we define a new policy ˜ π as: ˜ π ( a | s, g, h + 1) = (cid:26) π ( a | s, g, h ) , if h > p ( a ) , otherwise. (14)The new policy ˜ π acts the same as π for the first h steps and on the last step it samples an actionfrom the fixed distribution p . The final step can only increase the cumulative probability of reachingthe goal, therefore: C ˜ π ( s, a, g, h + 1) ≥ C π ( s, a, g, h ) . (15)Since equation 15 holds for all policies π , taking the maximum over policies gives: C ∗ ( s, a, g, h + 1) ≥ max π C ˜ π ( s, a, g, h + 1) ≥ max π C π ( s, a, g, h )= C ∗ ( s, a, g, h ) . (16) B U
NBIASEDNESS OF THE CROSS ENTROPY LOSS
For given s i , a i , g i and h i from the replay buffer, we denote the right-hand side of equation 2 as y truei : y truei = E s (cid:48) ∼ p ( ·| s i ,a i ) (cid:20) max a (cid:48) ∈A C ∗ θ ( s (cid:48) , a (cid:48) , g i , h i − (cid:21) if G ( s i , g i ) = 0 and h i ≥ ,G ( s i , g i ) otherwise . (17)Note that y truei cannot be evaluated exactly, as the expectation over s (cid:48) would require knowledge ofthe environment dynamics to compute in closed form. However, using s (cid:48) i , the next state after s i inthe replay buffer, we can obtain a single-sample Monte Carlo estimate of y truei , y i as: y i = (cid:40) max a (cid:48) ∈A C ∗ θ ( s (cid:48) i , a (cid:48) , g i , h i − if G ( s i , g i ) = 0 and h i ≥ ,G ( s i , g i ) otherwise . (18)Clearly y i is an unbiased estimate of y truei . However, the optimization objective is given by: (cid:88) i L ( C ∗ θ ( s i , a i , g i , h i ) , y truei ) (19)13ublished as a conference paper at ICLR 2021where the sum is over tuples in the replay buffer and L is the loss being used. Simply replacing y truei with its estimate y i , while commonly done in Q -learning, will in general result in a biasedestimate of the loss: (cid:88) i L ( C ∗ θ ( s i , a i , g i , h i ) , y truei ) = (cid:88) i L ( C ∗ θ ( s i , a i , g i , h i ) , E s (cid:48) i | s i ,a i [ y i ]) (cid:54) = (cid:88) i E s (cid:48) i | s i ,a i [ L ( C ∗ θ ( s i , a i , g i , h i ) , y i )] (20)since in general the expectation of a function need not match the function of the expectation. Inother words, pulling the expectation with respect to s (cid:48) i outside of the loss will in general incur inbias. However, if L is linear in its second argument, as is the case with binary cross entropy but notwith the squared loss, then one indeed recovers: (cid:88) i L ( C ∗ θ ( s i , a i , g i , h i ) , y truei ) = (cid:88) i E s (cid:48) i | s i ,a i [ L ( C ∗ θ ( s i , a i , g i , h i ) , y i )] (21)so that replacing y truei with y i does indeed recover an unbiased estimate of the loss. C N ON - CUMULATIVE CASE
We originally considered a non-cumulative version of the C -function, giving the probability ofreaching the goal in exactly h steps. We call this function the accessibility function, A π (despiteour notation, this is unrelated to the commonly used advantage function), defined by: A π ( s, a, g, h ) = P π ( ·|· ,g,h ) ( G ( s h , g ) = 1 | s = s, a = a ) . (22)Here the trivial case is only when h = 0 , where we have: A π ( s, a, g,
0) = G ( s, g ) (23)If h ≥ , we can obtain a similar recursion to that of the C -functions. Here we no longer assumethat states which reach the goal are terminal. The probability of reaching the goal in exactly h stepsis equal to the probability of reaching it in h − steps after taking the first step. After the first action a , subsequent actions are sampled from the policy π : A π ( s, a, g, h ) = P π ( ·|· ,g,h ) ( G ( s h , g ) = 1 | s = s, a = a )= E s (cid:48) ∼ p ( ·| s,a ) (cid:2) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − (cid:2) P π ( ·|· ,g,h ) ( G ( s h , g ) = 1 | s = s (cid:48) , a = a (cid:48) ) (cid:3)(cid:3) = E s (cid:48) ∼ p ( ·| s,a ) (cid:2) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − (cid:2) P π ( ·|· ,g,h − ( G ( s h − , g ) = 1 | s = s (cid:48) , a = a (cid:48) ) (cid:3)(cid:3) = E s (cid:48) ∼ p ( ·| s,a ) (cid:2) E a (cid:48) ∼ π ( ·| s (cid:48) ,g,h − [ A π ( s (cid:48) , a (cid:48) , g, h − (cid:3) . (24)By the same argument we employed in proposition 1, the recursion for the optimal A -function, A ∗ ,is obtained by replacing the expectation with respect to a (cid:48) with max . Putting this together with thebase case, we have: A ∗ ( s, a, g, h ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) max a (cid:48) ∈A A ∗ ( s (cid:48) , a, g, h − (cid:21) if h ≥ ,G ( s, g ) otherwise . (25)Note that this recursion is extremely similar to that of C ∗ , and differs only in the base cases for therecursion. The difference is empirically relevant however for the two reasons that make C ∗ easier tolearn: C ∗ is monotonic in h , and the cumulative probability is more well-behaved generally. D D
ISCOUNTED CASE
Another approach which would allow a test-time trade-off between speed and reliability is to learnthe following D -function ( D standing for “discounted accessibility”). D π ( s, a, g, γ ) = E π (cid:2) γ T − | s = s, a = a (cid:3) , (26)where the random variable T is the smallest positive number such that G ( s T , g ) = 1 . If no suchstate occurs during an episode then γ T − is interpreted as zero. This is the discounted future returnof an environment in which satisfying the goal returns a reward of 1 and terminates the episode.14ublished as a conference paper at ICLR 2021We may derive a recursion relation for this formalism too D π ( s, a, g, γ ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) G ( s (cid:48) , g ) + γ (1 − G ( s (cid:48) , g )) E a (cid:48) ∼ π ( ·| s (cid:48) ) (cid:104) E π (cid:2) γ T − | s (cid:48) , a (cid:48) (cid:3) (cid:105)(cid:21) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) G ( s (cid:48) , g ) + γ (1 − G ( s (cid:48) , g )) E a (cid:48) ∼ π ( ·| s (cid:48) ) (cid:104) D π ( s (cid:48) , a (cid:48) , g, γ ) (cid:105)(cid:21) . (27)By the same argument as employed for the C and A functions, the D -function of the optimal policyis obtained by replacing the expectation over actions with a max , giving D ∗ ( s, a, g, γ ) = E s (cid:48) ∼ p ( ·| s,a ) (cid:20) G ( s (cid:48) , g ) + γ (1 − G ( s (cid:48) , g )) max a (cid:48) D ∗ ( s (cid:48) , a (cid:48) , g, γ ) (cid:21) . (28)Learning such a D -function would allow a test time trade-off between speed and reliability, andmight well be more efficient than training independent models for different values of γ . We didnot pursue this experimentally for two reasons. Firstly, our initial motivation was to allow somehigher-level operator (either human or a controlling program) to trade off speed for reliability, anda hard horizon is usually more interpretable than a discounting factor. Secondly, we noticed that C -learning performs strongly at goal-reaching in deterministic environments, and we attribute thisto the hard horizon allowing the optimal policy to be executed even with significant errors in the C -value. Conversely, Q -learning can typically only tolerate small errors before the actions selectedbecome sub-optimal. D -learning would suffer from the same issue. E C
ONTINUOUS ACTION SPACE CASE
As mentioned in the main manuscript, C -learning is compatible with the ideas that underpin DDPGand TD3 (Lillicrap et al., 2015; Fujimoto et al., 2018), allowing it to be used in continuous actionspaces. This requires the introduction of another neural network approximating a deterministicpolicy, µ φ : S × G × N → A , which is intended to return µ ( s, g, h ) = arg max a C ∗ ( s, a, g, h ) . Thisnetwork is trained alongside C ∗ θ as detailed in Algorithm 2.We do point out that the procedure to obtain a horizon-agnostic policy, and thus π behavior , is alsomodified from the discrete version. Here, M ( s, g ) = max h ∈H C ∗ ( s, µ ( s, g, h ) , g, h ) , and: h γ ( s, g ) = arg min h ∈H { C ∗ ( s, µ ( s, g, h ) , g, h ) : C ∗ ( s, µ ( s, g, h ) , g, h ) ≥ γM ( s, g ) } , (29)where we now take π ∗ γ ( a | s, g ) as a point mass at µ ( s, g, h γ ( s, g )) . F A
DDITIONAL EXPERIMENTS
In this section we study the performance of C -learning across different types of goals. We firstevaluate C -learning for the Mini maze and Dubins’ car environments, and partition their goal spaceinto easy, medium and hard as shown in Figure 5. We run experiments for (Mini maze) or (Dubins’ car) choices of random seed and take the average for each metric and goal stratification,plus/minus the standard deviation across runs.The Mini maze results are shown in Table 2. We see that C -learning beats GCSL on success rate,although only marginally, with the bulk of the improvement observed on the reaching of hard goals.Both methods handily beat HER on success rate, and note that the path length results in the HERsection are unreliable because very few runs reach the goal.The Dubins’ car results are shown in Table 3. We see that C -learning is the clear winner here acrossthe board, with GCSL achieving a somewhat close success rate, and HER ending up with pathswhich are almost as efficient. Note that this result is the quantitative counterpart to the qualitativetrajectory visualizations from Figure 4c. As mentioned in the main manuscript, we also tried ex-plicitly enforcing monotonicity of our C -functions using the method of Sill (1998), but we obtainedsuccess rates below in Dubins’ car when doing so.We also compare C -learning to naive horizon-aware Q -learning, where instead of sampling h asin C -learning, where consider it as part of the state space and thus sample it along the state in thereplay buffer. Results are shown in the left panel of Figure 6. We can see that our sampling of h achieves the same performance in roughly half the time. Additionally, we compare against sampling h uniformly in the right panel of Figure 6, and observe similar results.15ublished as a conference paper at ICLR 2021 Algorithm 2:
Training C-learning (TD3) Version
Parameter: N explore : Number of random exploration episodes Parameter: N GD : Number of goal-directed episodes Parameter: N train : Number of batches to train on per goal-directed episode Parameter: N copy : Number of batches between target network updates Parameter: α : Learning rate θ , θ ← Initial weights for C ∗ θ , C ∗ θ θ (cid:48) , θ (cid:48) ← θ , θ // Copy weights to target network φ ← Initial weights for µ φ D ← [] // Initialize experience replay buffer n b ← // Counter for training batches repeat N explore times E ← get rollout ( π random ) // Do a random rollout D . append ( E ) // Save the rollout to the buffer repeat N GD times g ← goal sample ( n b ) // Sample a goal E ← get rollout ( π behavior ) // Try to reach the goal D . append ( E ) // Save the rollout repeat N train times B := { s i , a i , s (cid:48) i , g i , h i } |B| i =1 ← sample batch ( D ) // Sample a batch { y i } |B| i =1 ← get targets ( B , θ (cid:48) ) // Estimate RHS of equation 18 ˆ L ← − |B| (cid:80) |B| i =1 y i log C ∗ θ ( s i , a i , g i , h i ) + (1 − y i ) log (cid:0) − C ∗ θ ( s i , a i , g i , h i ) (cid:1) ˆ L ← − |B| (cid:80) |B| i =1 y i log C ∗ θ ( s i , a i , g i , h i ) + (1 − y i ) log (cid:0) − C ∗ θ ( s i , a i , g i , h i ) (cid:1) θ ← θ − α ∇ θ ˆ L // Update weights θ ← θ − α ∇ θ ˆ L // Update weights if n b mod policy delay = 0 then ˆ L actor ← − |B| (cid:80) |B| i =1 C ∗ θ ( s i , µ φ ( s i , g i , h i ) , g i , h i ) φ ← φ − α ∇ φ ˆ L actor // Update actor weight θ (cid:48) ← θ ∗ (1 − τ ) + θ (cid:48) ∗ τ // Update target networks θ (cid:48) ← θ ∗ (1 − τ ) + θ (cid:48) ∗ τ φ (cid:48) ← φ ∗ (1 − τ ) + φ (cid:48) ∗ τ n b ← n b + 1 // Trained one batch Figure 5: Partitioning of environments into easy (green), medium (yellow) and hard (red) goals to reach from agiven starting state (blue). The left panel shows mini maze, and the right shows Dubins’ car.
Table 2: Relevant metrics for the
Mini maze environment, stratified by difficulty of goal (see Figure 5 (left)from appendix F). Runs in bold are either the best on the given metric in that environment, or have a meanscore within the error bars of the best.
METHOD SUCCESS RATE PATH LENGTH
EASY MED HARD
ALL
EASY MED HARD
ALL
CAE . % ± . % 97 . ± . . % ± . % . % ± . % 5 . ± . . ± . . ± .
61 13 . ± . GCSL . % ± . % . % ± . % . % ± . % . % ± . % 5 . ± . . ± .
02 23 . ± .
22 13 . ± . HER . ± .
73% 21 . ± .
68% 5 . ± .
79% 17 . ± . . ± .
63 13 . ± .
40 23 . ± .
00 11 . ± . Table 3: Relevant metrics for the
Dubins’ car environment, stratified by difficulty of goal (see Figure 5 (right)from appendix F). Runs in bold are either the best on the given metric in that environment, or have a meanscore within the error bars of the best.
METHOD SUCCESS RATE PATH LENGTH
EASY MED HARD
ALL
EASY MED HARD
ALL
CAE . % ± . % . % ± . % . % ± . % . % ± . % . ± .
91 16 . ± .
47 21 . ± .
50 16 . ± . GCSL . ± .
76% 76 . ± .
11% 75 . ± .
05% 79 . ± .
35% 17 . ± .
81 31 . ± .
99 41 . ± .
50 32 . ± . HER . ± .
60% 55 . ± .
77% 47 . ± .
20% 51 . ± .
48% 9 . ± .
37 18 . ± .
90 26 . ± .
84 20 . ± . We also compare C -learning against TDMs Pong et al. (2018) for motion planning tasks. Results arein Figure 7. As mentioned in the main manuscript, TDMs assume a dense reward function, which C -learning does not have access to. In spite of this, C -learning significantly outperforms TDMs.Finally, we also compare against the non-cumulative version of C -learning ( A -learning), and thediscount-based recursion ( D -learning) in Dubins’ car. For D -learning, we selected γ uniformly atrandom in [0 , . Results are shown in table 4. We can see that indeed, C -learning outperforms bothalternatives. Curiously, while underperformant, A -learning and D -learning seem to obtain shorterpaths among successful trials. G E
XPERIMENTAL DETAILS
C-learning for stochastic environments : As mentioned in the main manuscript, we modify thereplay buffer in order to avoid the bias incurred by HER sampling (Matthias et al., 2018) in non-deterministic environments. We sample goals independently from the chosen episode. We sample g i to be a potentially reachable goal from s i in h i steps. For example, if given access to a distance d between states and goals such that the distance can be decreased at most by unit after a singlestep, we sample g i from the set of g ’s such that d ( s i , g ) ≤ h i . Moreover, when constructing y i (line16 of algorithm 1), if we know for a fact that g i cannot be reached from s (cid:48) i in h i − steps, we output instead of max a (cid:48) C ∗ θ (cid:48) ( s (cid:48) i , a (cid:48) , g i , h i − . For example, d ( s (cid:48) i , g i ) > h i − allows us to set y i = 0 .We found this practice, combined with the sampling of g i described above, to significantly improveperformance. While this requires some knowledge of the environment, for example a metric overstates, this knowledge is often available in many environments. For frozen lake, we use the L metricto determine if a goal is not reachable from a state, while ignoring the holes in the environment soas to not use too much information about the environment. Figure 6: Success rate throughout training of C -learning (blue) and naive horizon-aware Q -learning (red)on FetchPickAndPlace-v1 (left); and C -learning (blue) and C -learning with uniform h sampling (red) onFetchPickAndPlace-v1 (right). Figure 7: Success rate throughout training of C -learning (blue) and TDMs (red) in FetchPickAndPlace-v1 (left)and HandManipulatePenFull-v0 (right).Table 4: Relevant metrics for the Dubins’ car environment, stratified by difficulty of goal (see Figure 5 (right)from appendix F). Runs in bold are either the best on the given metric in that environment, or have a meanscore within the error bars of the best.
METHOD SUCCESS RATE PATH LENGTH
EASY MED HARD
ALL
EASY MED HARD
ALL C -learning . % ± . % . % ± . % . % ± . % . % ± . % 6 . ± .
91 16 . ± .
47 21 . ± .
50 16 . ± . A -learning . ± . . ± . % 62 . ± .
50% 78 . ± . . ± . . ± .
28 16 . ± .
18 12 . ± . D -learning . ± .
27% 34 . ± .
71% 2 . ± .
21% 30 . ± . . ± .
25 11 . ± .
26 13 . ± .
00 7 . ± . G.1 F
ROZEN LAKE
For all methods, we train for episodes, each one of maximal length steps, we use a learningrate − , a batch size of size , and train for gradient steps per episode. We use a . -greedyfor the behavior policy. We use a neural network with two hidden layers of respective sizes and with ReLU activations. We use fully random exploration episodes before we start training. Wetake p ( s ) as uniform among non-hole states during training, and set it as a point mass at (1 , fortesting. We set p ( g ) as uniform among states during training, and we evaluate at every goal duringtesting. For C -learning, we use κ = 3 , and copy the target network every steps. We take themetric d to be the L norm, completely ignoring the holes so as to not use too much environmentinformation. For the horizon-independent policy, we used H = { , , . . . , } and α = 0 . . We dopoint out that, while C -learning did manage to recover the safe path for large h , it did not do alwaysdo so: we suspect that the policy of going directly up is more likely to be explored. However, wenever observed GCSL taking the safe path.G.2 M INI MAZE
We again train for episodes, now of maximal length . We train gradient steps perepisode, and additionally decay the exploration noise of the C -learning behavior policy throughouttraining according to (cid:15) = 0 . / (1 + n GD / . The network we used has two hidden layers of size and with ReLU activations, respectively. While the environment is deterministic, we use thesame replay buffer as for frozen lake, and take the metric d to be the L norm, completely ignoringwalls. We found this helped improve performance and allowed C -learning to still learn to reach faraway goals. We also define H ( s, g ) := {(cid:107) s − g (cid:107) , (cid:107) s − g (cid:107) + 1 , . . . , h max } , with h max := 50 , asthe set of horizons over which we will check when rolling out the policies. We take p ( s ) as a pointmass both for training and testing. We use p ( g ) during training as specified in Algorithm 1, andevaluate all the methods on a fixed set of goals. Additionally, we lower the learning rate by a factorof after episodes. All the other hyperparameters are as in frozen lake. Figure 5 shows thesplit between easy, medium, and hard goals.G.3 D UBINS ’ CAR
We train for episodes, each one with a maximal length of steps and using gradient stepsper episode. We use a neural network with two hidden layers of respective sizes and withReLU activations. We take the metric d to be the L ∞ norm, completely ignoring walls, which weonly use to decide whether or not we have reached the goal. We take p ( s ) as a point mass both for18ublished as a conference paper at ICLR 2021training and testing. We use p ( g ) during training as specified in Algorithm 1, and evaluate all themethods on a fixed set of goals. All other hyperparameters are as in frozen lake. The partition ofgoals into easy, medium and hard is specified in Figure 5, where the agent always starts at the upperleft corner.G.4 F ETCH P ICK A ND P LACE - V gradient steps per episode,and only update φ with half the frequency of θ . We use episodes for FetchPickAndPlace-v1.All the other hyperparameters are as in Dubins’ car.G.5 H AND M ANIPULATE P EN F ULL - V gradient steps per episode,and only update φ with half the frequency of θ . We use60000