[PDF] Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Abstract

We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards.

Full PDF

HHindsight Expectation Maximization forGoal-conditioned Reinforcement Learning

Yunhao Tang

Columbia University [email protected]

Alp Kucukelbir

Columbia University [email protected]

Abstract

We propose a graphical model framework for goal-conditioned reinforcementlearning ( RL ), with an expectation maximization ( EM ) algorithm that operates onthe lower bound of the RL objective. The E-step provides a natural interpretation ofhow ‘learning in hindsight’ techniques, such as hindsight experience replay ( HER ),to handle extremely sparse goal-conditioned rewards. The M-step reduces policyoptimization to supervised learning updates, which greatly stabilizes end-to-endtraining on high-dimensional inputs such as images. We show that the combinedalgorithm, hindsight expectation maximization ( h EM ) signiﬁcantly outperformsmodel-free baselines on a wide range of goal-conditioned benchmarks with sparserewards. In goal-conditioned reinforcement learning ( RL ), an agent seeks to achieve a goal through interactionswith the environment. At each step, the agent receives a reward, which ideally reﬂects how well it isachieving its goal. Traditional RL methods leverage these rewards to learn good policies. As such,the effectiveness of these methods rely on how informative the rewards are.This sensitivity of traditional RL algorithms has led to a ﬂurry of activity around reward shaping [1].This limits the applicability of RL , as reward shaping is often speciﬁc to an environment and task — apractical obstacle to wider applicability. Binary rewards, however, are trivial to specify. The agentreceives a strict indicator of success when it has achieved its goal; until then, it receives preciselyzero reward. Such a sparsity of reward signals renders goal-conditioned RL extremely challenging fortraditional methods [2].How can we navigate such binary reward settings? Consider an agent that explores its environmentbut fails to achieve its goal. One idea is to treat, in hindsight , its exploration as having achieved someother goal. By relabeling a ‘failure’ relative to an original goal as a ‘success’ with respect to someother goal, we can imagine an agent succeeding frequently at many goals, in spite of failing at theoriginal goals. This insight motivates hindsight experience replay ( HER ) [2], an intuitive strategy thatenables off-policy RL algorithms, such as [3, 4], to function in sparse binary reward settings.The statistical simulation of rare events occupies a similar setting. Consider estimating an expectationof low-probability events using Monte Carlo sampling. The variance of this estimator relative to itsexpectation is too high to be practical [5]. A powerful approach to reduce variance is importancesampling ( IS ) [6]. The idea is to adapt the sampling procedure such that these rare events occurfrequently, and then to adjust the ﬁnal computation. Could IS help in binary reward RL settings too? Main idea.

We propose a probabilistic framework for goal-conditioned RL that formalizes theintuition of hindsight using ideas from statistical simulation. We equate the traditional RL objectiveto maximizing the evidence of our probabilistic model. This leads to a new algorithm, hindsight Preprint. Work in progress. a r X i v : . [ c s . L G ] J un a) Point mass (b) Reacher goal (c) Fetch robot (d) Sawyer robot Figure 1:

Training curves of hindsight expectation maximization ( h EM ) and HER on four goal-conditioned RL benchmark tasks. Inputs are either state-based (solid lines) or image-based (dashed lines). The y-axis showsthe success rates and the x-axis shows the training time steps. All curves are calculated based on averages over − random seeds. h EM consistently outperforms HER across all baselines. expectation maximization ( h EM ), which maximizes a tractable lower bound of the original objective[7]. A central insight is that the E-step naturally interprets hindsight replay as a special case of IS .Figure 1 compares h EM to HER [2] on four goal-conditioned RL tasks with low-dimensional stateand high-dimensional images as inputs. While h EM performs consistently well on both inputs, HER struggles with image-based inputs. This is due to how

HER leverages hindsight replay within atemporal difference ( TD )-learning procedure; performance degrades sharply with the dimensionalityof the inputs (as observed previously in [8, 9]; also see Section 4). In contrast, h EM leverageshindsight experiences through the lens of IS , thus enabling better performance in high dimensions.The rest of this section presents a quick background on goal-conditioned RL and probabilisticinference. Expert readers may jump ahead to Section 2, which presents our graphical model. Goal-conditioned RL background. A Markov decision process (

MDP ) can be simply extendedto incorporate multiple goals. Consider an agent that interacts with an environment in episodes.At the beginning of each episode, a goal g ∈ G is ﬁxed. At a discrete time t ≥ , an agent instate s t ∈ S takes action a t ∈ A receives a reward r ( s t , a t , g ) ∈ R and transitions to its next state s t +1 ∼ p ( · | s t , a t ) ∈ S . This process is independent of goals. A policy π ( a | s, g ) : S × G (cid:55)→ P ( A ) deﬁnes a map from state and goal to distributions over actions. Given a distribution over goals g ∼ p ( · ) , we consider the undiscounted episodic return J ( π ) := E g ∼ p ( · ) (cid:2) E π [ (cid:80) T − t =0 r ( s t , a t , g )] (cid:3) .When rewards are independent from goals r ( s t , a t , g ) ≡ r ( s t , a t ) and we recover classical RL [10]. Probabilistic inference background.

Consider data as observed random variables x ∈ X . Eachmeasurement x is a discrete or continuous random variable. A likelihood p θ ( x | z ) relates eachmeasurement to latent variables z ∈ Z and unknown, but ﬁxed, parameters θ . The full probabilisticgenerative model speciﬁes a prior over the latent variable p ( z ) . Bayesian inference requires computingthe posterior p ( z | x ) — an intractable task for all but a small class of simple models.Variational inference approximates the posterior by matching a tractable density q φ ( z | x ) to theposterior. The following calculation speciﬁes this procedure: log p ( x ) = log E z ∼ p ( · ) [ p θ ( x | z )] (1) = log E z ∼ q φ ( ·| x ) (cid:20) p θ ( x | z ) p ( z ) q φ ( z | x ) (cid:21) ≥ E z ∼ q φ ( ·| x ) (cid:20) log p θ ( x | z ) p ( z ) q φ ( z | x ) (cid:21) (2) = E z ∼ q φ ( ·| x ) [log p θ ( x | z )] − KL [ q φ ( · | z ) (cid:107) p ( z )] =: L ( p θ , q ) . (3)(For a detailed derivation, please see [7].) Equation (3) deﬁnes the evidence lower bound ( ELBO ) L ( p θ , q ) . Matching the tractable density q φ ( z | x ) to the posterior thus turns into maximizing the ELBO via expectation maximization ( EM ) [11] or stochastic gradient ascent [12, 13]. Figure 2(a)presents a graphical model of the above. For a ﬁxed set of θ parameters, the optimal variationaldistribution q is the true posterior arg max q L ( p θ , q ) ≡ p ( z | x ) := p ( z ) p θ ( x ) /p ( x ) . From an IS perspective, note how the variational distribution q φ ( z | x ) serves as a proposal distribution in placeof p ( z ) . 2 zθ φ (a) Probabilistic inference

Oτθ q (b)

Variational RL Oτgθ (c)

Goal-conditioned RL (Generative model) τg q (d) Goal-conditioned RL (Inference model) Figure 2:

Graphical models. Solid lines represent generative models and dashed lines represent inferencemodels. Circles represent random variables and squares represent parameters. Shading indicates that the randomvariable is observed.

Probabilistic modeling and control enjoy strong connections, especially in linear systems [14, 15].Two recent frameworks connect probabilistic inference to general RL : Variational RL [16–19] and RL as inference [20–22]. We situate our probabilistic model by ﬁrst presenting Variational RL below.Appendix A presents a detailed comparison to RL as inference.) Variational RL . Begin by deﬁning a trajectory random variable τ ≡ ( s t , a t ) T − t =1 to encapsulatea sequence of state and action pairs. The random variable is generated by a factorized distribution a t ∼ π θ ( · | s t ) , s t +1 ∼ p ( · | s t , a t ) , which deﬁnes the joint distribution p θ ( τ ) := Π T − t =0 π θ ( a t | s t ) p ( s t +1 | s t , a t ) . Conditional on τ , deﬁne the distribution of a binary optimality variable as p ( O = 1 | τ ) ∝ exp( (cid:80) T − t =0 r ( s t , a t ) /α ) for some α > , where we assume r ( s t , a t ) ≥ withoutloss of generality. Optimizing the standard RL objective corresponds to maximizing the evidenceof log p ( O = 1) , where all binary variables are treated as observed and equal to one. Positing avariational approximation to the posterior over trajectories gives the following lower bound, log p ( O = 1) ≥ E q ( τ ) [log p ( O = 1 | τ )] − KL [ q ( τ ) (cid:107) p θ ( τ )] =: L ( π θ , q ) . (4)Figure 2(b) shows a combined graphical model for both the generative and inference models ofVariational RL . Equation (4) is typically maximized using EM (e.g., [23, 18, 19]), by alternatingupdates between θ and q ( τ ) . Note that Variational RL does not model goals. To extend the Variational RL framework to incorporate goals, introduce a goal variable g and aprior distribution g ∼ p ( · ) . Conditional on a goal g , the trajectory variable τ ≡ ( s t , a t ) T − t =0 ∼ p ( · | θ, g ) is sampled by executing the policy π θ ( a | s, g ) in the MDP . Similar to Variational RL , the joint distribution factorizes as p ( τ | θ, g ) := Π T − t =0 π θ ( a t | s t , g ) p ( s t +1 | s t , a t ) . Now,deﬁne a goal-conditioned binary optimality variable O , such that p ( O = 1 | τ, g ) := R ( τ, g ) := (cid:80) T − t =0 r ( s t , a t , g ) /α where α > normalizes this density. Figure 2(c) shows a graphical model ofjust this generative model. Treat the optimality variables as the observations and assume O ≡ .The following proposition shows the equivalence between inference in this model and traditionalgoal-conditioned RL . Proposition 1. (Proof in Appendix B.) Maximizing the evidence of the probabilistic model is equiva-lent to maximizing returns in the goal-conditioned RL problem, i.e., arg max θ log p ( O = 1) = arg max θ J ( π θ ) . (5) Equation (5) implies that algorithms to maximize the evidence of such probabilistic models could bereadily applied to goal-conditioned RL . Unlike typical probabilistic inference settings, the evidencehere could technically be directly optimized. Indeed, p ( O = 1) ≡ J ( π θ ) could be maximized viatraditional RL approaches e.g., policy gradients [24]. In particular, the REINFORCE gradient estimator[25] of Equation (5) is given by as η θ = (cid:80) t ≥ (cid:80) r (cid:48) ≥ t r ( s t (cid:48) , a t (cid:48) , g ) ∇ θ log π θ ( a t | s t , g ) ≈ ∇ θ J ( π θ ) ,3here g ∼ p ( · ) and ( s t , a t ) T − t =0 are sampled on-policy. The direct optimization of log p ( O = 1) consists in gradient ascents θ ← θ + η θ . This poses a practical challenge in goal-conditioned RL . Tosee why, consider the following example. Illustrative example.

Consider a one-step

MDP with T = 1 where S = { s } , A = G , r ( s, a, g ) = I [ a = g ] . Assume that there are a ﬁnite number of actions and goals |A| = |G| = k .The following theorem shows the difﬁculty in building a practical estimator for ∇ θ J ( π θ ) . Theorem 1. (Proof in Appendix B.2.) Consider the example above. Let the policy π θ ( a | s, g ) = softmax ( L a,g ) be parameterized by logits L a,g and let η a,g be the one-sample REINFORCE gradientestimator of L a,g . Assume a uniform distribution over goals p ( g ) = 1 /k for all g ∈ G . Assumethat the policy is randomly initialized (e.g. L a,g ≡ L, ∀ a, g for some L ). Let MSE [ x ] be the meansquared error MSE [ x ] := E [( x − E [ η a,g ]) ] . It can be shown that the relative error of the estimate (cid:112) MSE [ η a,g ] / E [ η a,g ] = k (1 + o (1)) grows approximately linearly with k for all ∀ a ∈ A , g ∈ G . The above theorem shows that in the simple setup above, the relative error of the

REINFORCE gradientestimator grows linearly in k . This implies that to reduce the error with traditional Monte Carlosampling would require m ≈ k samples, which quickly becomes intractable as k increases. Thoughvariance reduction methods such as control variates [26] could be of help, it does not change thesup-linear growth rate of samples (see comments in Appendix B.2). The fundamental bottleneck isthat dense gradients r ( s, a, g ) ∇ θ log π θ ( a | s, g ) (cid:54) = 0 are rare event with a probability of / k , whichmakes it difﬁcult to accurately estimate with on-policy measures [5]. This example hints at similarissues with more realistic cases and motivates an IS approach to address the problem. Consider a variational inequality similar to Equation (3) with a variational distribution q ( τ, g )log p ( O = 1) = log E q ( τ,g ) (cid:20) p ( O = 1 | τ, g ) p ( g ) p ( τ | θ, g ) q ( τ, g ) (cid:21) (6) ≥ E q ( τ,g ) (cid:20) log p ( O = 1 | τ, g ) p ( g ) p ( τ | θ, g ) q ( τ, g ) (cid:21) (7) = E q ( τ,g ) [log p ( O = 1 | τ, g )] − KL [ q ( τ, g ) (cid:107) p ( g ) p ( τ | θ, g )] =: L ( π θ , q ) . (8)This variational distribution corresponds to the inference model in Figure 2(d). As with typicalgraphical models, instead of maximizing log p ( O = 1) , consider maximizing its ELBO L ( π θ , q ) withrespect to both θ and variational distribution q ( τ, g ) . Our key insight lies in the following observation:the bottleneck of the direct optimization of log p ( O = 1) lies in the sparsity of p ( O = 1 | τ, g ) = (cid:80) T − t =0 r ( s t , a t , g ) , where ( τ, g ) are sampled with the on-policy measure g ∼ p ( · ) , τ ∼ p ( · | θ, g ) .The variational distribution q ( τ, g ) serves as a IS proposal in place of p ( τ | θ, g ) p ( g ) . If q ( τ, g ) putsmore probability mass on ( τ, g ) pairs with high returns (high p ( O = 1 | τ, g ) ), the rewards becomedense and learning becomes feasible. In the next section, we show how hindsight replay [2] providesan intuitive and effective way to select such a q ( τ, g ) . The EM -algorithm [11] for Equation (8) alternates between an E- and M-step: at iteration t , denotethe policy parameter to be θ t and the variational distribution to be q t .E-step: q t +1 = arg max q L ( π θ t , q ) , M-step: θ t +1 = arg max θ L ( π θ , q t +1 ) . (9)This ensures a monotonic improvement in the ELBO L ( π θ t +1 , q t +1 ) ≥ L ( π θ t , q t ) . We discuss thesetwo alternating steps in details below, starting with the M-step. M-step: Optimization for π θ . Fixing the variational distribution q ( τ, g ) , to optimize L ( π θ , q ) withrespect to θ is equivalent to max θ E q ( τ,g ) [log p ( τ | θ, g )] ≡ max θ E q ( τ,g ) (cid:34) T − (cid:88) t =0 log π θ ( a t | s t , g ) (cid:35) . (10)4he right hand side of Equation (10) corresponds to a supervised learning problem where learningsamples come from q ( τ, g ) . Prior studies have adopted this idea and developed policy optimizationalgorithms in this direction [18, 19, 27–29]. In practice, the M-step is carried out partially where θ isupdated with gradient steps instead of optimizing Equation (10) fully. E-step: Optimization for q ( τ, g ) . The choice of q ( τ, g ) should satisfy two desirable properties: (P.1) it leads to monotonic improvements in log p ( O = 1) ≡ J ( π θ ) or a lower bound thereof; (P.2) itprovide dense learning signals for the M-step. The posterior distribution p ( τ, g | O = 1) achieves (P.1) and (P.2) in a near-optimal way, in that it is the maximizer of the E-step in Equation (9), whichmonotonically improves the ELBO . The posterior also provides dense reward signals to the M-stepbecause p ( τ, g | O = 1) ∝ p ( O = 1 | τ, g ) . In practice, one chooses a variational distribution q ( τ, g ) as an alternative to the intractable posterior by maximizing Equation (8). Below, we showit is possible to achieve ( P.1 )( P.2 ) even though the E-step is not carried out fully. By plugging in p ( O = 1 | τ, g ) = (cid:80) T − t =0 r ( s t , a t , g ) /α , we write the ELBO as L ( π θ , q ) = E q ( τ,g ) (cid:34) (cid:80) T − t =0 r ( s t , a t , g ) α (cid:35)(cid:124) (cid:123)(cid:122) (cid:125) ﬁrst term + − KL [ q ( τ, g ) (cid:107) p ( g ) p ( τ | θ, g )] (cid:124) (cid:123)(cid:122) (cid:125) second term . (11)We now examine alternative ways to select the variational distribution q ( τ, g ) . Prior work.

State-of-the-art model-free algorithms such as

MPO [18, 19] applies a factorizedvariational distribution q ent ( τ, g ) = p ( g )Π T − t =0 q ent ( a t | s t , g ) . The variational distribution is deﬁnedby local distributions q ent ( a | s, g ) := π θ ( a | s, g ) exp( ˆ Q π θ ( s, a, g ) /η ) for some temperature η > and estimates of Q-functions ˆ Q π θ ( s, a, g ) . The design of q ent ( τ, g ) could be interpreted as initializing q ent ( a | s, g ) with π θ ( a | s, g ) which effectively maximizes the second term in Equation (11), thentaking one improvement step of the ﬁrst term [18]. This distribution satisﬁes (P.1) because thecombined EM -algorithm corresponds to entropy-regularized policy iteration, and retains monotonicimprovements in J ( π θ ) . However, it does not satisfy ( P.2 ): when rewards are sparse r ( s, a, g ) ≈ ,estimates of Q-functions are sparse Q π θ ( s, a, g ) ≈ and leads to uninformed variational distributions q ent ( a | s ) ∝ π θ ( a | s ) exp( Q π θ ( s, a, g ) /η ) ≈ π θ ( a | s, g ) for the M-step. In fact, when η is largeand the update to q ( a | s ) from π θ ( a | s, g ) becomes inﬁnitesimal, the E-step is equivalent to policygradients [26, 24], which suffers from the sparsity of rewards as discussed in Section 2. Hindsight variational distribution.

Maximizing the ﬁrst term of the

ELBO is challenging whenrewards are sparse. This motivates choosing a q ( τ, g ) which puts more weights on maximizingthe ﬁrst term . Now, we formally introduce hindsight variational distribution q h ( τ, g ) , the samplingdistribution employed equivalently in HER [2]. Sampling from this distribution is implicitly deﬁnedby an algorithmic procedure:

Step 1.

Collect an on-policy trajectory or sample a trajectory from a replay buffer τ ∼ D . Step 2.

Find the g such that the trajectory is rewarding, in that R ( τ, g ) is high or the trial is successful.Return the pair ( τ, g ) .Note that Step 2 can be conveniently carried out with access to the reward function r ( s, a, g ) as in [2].Contrary to q ent ( τ, g ) , this hindsight variational distribution maximizes the ﬁrst term in Equation (11)by construction. This naturally satisﬁes ( P.2 ) as q h ( τ, g ) provides highly rewarding samples ( τ, g ) and hence dense signals to the M-step. The following theorem shows how q h ( τ, g ) improves thesampling performance of our gradient estimates Theorem 2. (Proof in Appendix B.3.) Consider the illustrative example in Theorem 1. Let η h ( a, g ) = r ( s, b, g (cid:48) ) ∇ L a,g log π ( b | s, g (cid:48) ) /k be the normalized one-sample REINFORCE gradient estimatorwhere ( b, g (cid:48) ) are sampled from the hindsight variational distribution with an on-policy buffer. Thenthe relative error (cid:113) MSE [ η ha,g ] / E [ η a,g ] = √ k (1 + o (1)) grows sub-linearly for all ∀ a ∈ A , g ∈ G . Theorem 2 implies that to further reduce the relative error of the hindsight estimator η h ( a, g ) withtraditional Monte Carlo sampling would require m ≈ ( √ k ) = k samples, which scales linearly withthe problem size k . This is a sharp contrast to m ≈ k from using the on-policy REINFORCE gradient5stimator. The above result shows the beneﬁts of IS , where under q h ( τ, g ) rewarding trajectory-goalpairs are given high probabilities and this naturally alleviates the issue with sparse rewards. Thefollowing result shows that q h ( τ, g ) also satisﬁes ( P.1 ) under mild conditions.

Theorem 3. (Proof in Appendix B.4.) Assume p ( g ) to be uniform without loss of generality and atabular representation of policy π θ . At iteration t , assume that the partial E-step returns q t ( τ, g ) andthe M-step objective in Equation (10) is optimized fully. Also assume the variational distribution to bethe hindsight variational distribution q t ( τ, g ) := q h ( τ, g ) . Let ˜ p t ( g ) := (cid:82) τ q t ( τ, g ) dτ be the marginaldistribution of goals. The performance is lower bounded as J ( π θ t +1 ) ≥ | supp (˜ p t ( g )) | / |G| =: ˜ L t .When the replay buffer size D increases over iterations, the lower bound improves ˜ L t +1 ≥ ˜ L t . We now present h EM , a combination of the above E- and M-steps. The algorithm maintains a policy π θ ( a | s, g ) . At each iteration, h EM collects N trajectory-goal pairs by ﬁrst sampling a goal g ∼ p ( · ) and then rolling out a trajectory τ . All trajectories are stored into a replay buffer D [3]. At trainingtime, h EM carries out a partial E-step by sampling ( τ, g ) pairs from q h ( τ, g ) . For the partial M-step,the policy is updated through several gradient ascents on Equation (10) with the Adam optimizer [30].Importantly, h EM is an off-policy RL algorithm without value functions, which also makes it agnosticto reward functions. The pseudocode is summarized in Algorithm 1. Please refer to Appendix C forfull descriptions of the algorithm. Algorithm 1

Hindsight Expectation Maximization ( h EM ) INPUT policy π θ ( a | s, g ) . while t = 0 , , ... do Sample goal g ∼ p ( · ) and trajectory τ ∼ p ( · | θ, g ) by executing π θ in the MDP . Save data ( τ, g ) to a replay buffer D . E-step.

Sample from q h ( τ, g ) : sample τ ≡ ( s t , a t ) T − t =0 ∼ D and ﬁnd rewarding goals g . M-step.

Update the policy by a few gradient ascents θ ← θ + ∇ θ log (cid:80) T − t =0 log π θ ( a t | s t , g ) . end while3.2 Connections to prior workHindsight experience replay. The core of

HER lies in the hindsight goal replay [2]. Similar to h EM , HER samples trajectory-goal pairs from the hindsight variational distribution q h ( τ, g ) and minimizethe Q-learning loss E ( τ,g ) ∼ q h ( · ) [ (cid:80) T − t =0 ( Q θ ( s t , a t , g ) − r ( s t , a t , g ) − γ max a (cid:48) Q θ ( s i , a (cid:48) , g )) ] . Thedevelopment of h EM in Section 3 formalizes this choice of the sampling distribution q ( τ, g ) := q h ( τ, g ) as partially maximizing the ELBO during an E-step. Compared to h EM , HER learns a critic Q θ ( s, a, g ) . We will see in the experiments that such critic learning tends to be much more unstablewhen rewards are sparse and inputs are high-dimensional, as was also observed in [8, 9]. Hindsight policy gradient.

In its vanilla form, the hindsight policy gradient (

HPG ) considers on-policy stochastic gradient estimators of the RL objective [24] as E p ( g ) p ( τ | θ,g ) [ R ( τ, g ) ∇ θ log p ( τ | θ, g )] . Despite variance reduction methods such as control variates [26, 24], the unbiased estimators of HPG do not address the rare event issue central to sparse rewards

MDP , where R ( τ, g ) ∇ θ log p ( τ | θ, g ) taking non-zero values is a rare event under the on-policy measure p ( g ) p ( τ | θ, g ) . Contrast HPG tothe unbiased IS objective in Equation (8): E q ( τ,g ) [ R ( τ, g ) ∇ θ log p ( τ | θ, g ) · p ( g ) p ( τ | θ,g ) q ( τ,g ) ] , where theproposal q ( τ, g ) ideally prioritizes the rare events [5] to generate rich learning signals. h EM furtheravoids the explicit IS ratios with the variational approach that leads to an ELBO [7].

We evaluate the empirical performance of h EM on a wide range of goal-conditioned RL benchmarktasks. These tasks all have extremely sparse binary rewards which indicate success of the trial. Theevaluation criterion is the success rate at test time. Since h EM builds on pure model-free concepts,we focus on the model-free state-of-the-art algorithm HER [2] as a comparison. In some cases we6lso compare with closely related

HPG [24]: however, we ﬁnd that even as

HPG adopts more denserewards, its performance evaluated as the success rate is much more inferior than

HER and h EM . Forhyper-parameter details and additional results, please see Appendix C. Taken from [2], the

MDP is parameterized by the number of bits K . The state space andgoal space S = G = { , } K and the action space A = { , , . . . , K } . Given s t , the action ﬂips thebit at location a t . The reward function is r ( s t , a t ) = I [ s t +1 = g ] , the state is ﬂipped to match thetarget bit string. The environment is difﬁcult for traditional RL methods as the search space is ofexponential size |S| = 2 K . In Figure 3(a), we present results for HER (taken from Figure 1 of [2]), h EM and HPG . Observe that h EM and HER consistently perform well even when K = 50 while theperformance of HPG drops drastically as the underlying spaces become enormous. See Figure 3(b)for the training curves of h EM and HPG for K = 50 ; note that HPG does not make any progress.

Continuous navigation.

As a continuous analogue of the Flip bit

MDP , consider a K -dimensionalnavigation task with a point mass. The state space and goal space coincide with X = G = [ − , K while the actions A = [ − . , . K specify changes in states. The reward function is r ( s t , a t ) = I [ (cid:107) s t +1 − g (cid:107) < . which indicates success when reaching the goal location. Results are shown inFigure 3(c) where we see that as K increases, the search space quickly explodes and the performanceof HER degrades drastically. The performance of h EM is not greatly inﬂuenced by increases in K .See Figure 3(d) for the comparison of training curves between h EM and HER for K = 40 . HER already learns much more slowly compared to h EM and degrades further when K = 80 . (a) Flip bit results (b) Flip bit curves (c) Navigation results (d) Navigation curves Figure 3:

Summary of results for Flip bit and continuous navigation

MDP . Plots (a) and (c) show the ﬁnalperformance after training is completed. Plots (b) and (d) show the training curves for Flip bit K = 50 andnavigation K = 40 respectively. h EM consistently outperforms HER and

HPG across these tasks.

To assess the performance of h EM in contexts with richer system dynamics, we consider a wide rangeof goal-conditioned reaching tasks. We present details of their state space X , goal space G and actionspace A in Appendix C. These include Point mass , Reacher goal , Fetch robot and

Sawyer robot ,as illustrated in Figure 7 in Appendix C.Across all tasks, the reward takes the sparse form r ( s, a, g ) = I [ success ] . As a comparison, wealso include a HER baseline where the rewards take the form ˜ r ( s, a, g ) = − I [ failure ] . Such rewardshaping does not change the optimality of policies as ˜ r = r − but makes the reward more ‘dense’and is more suitable for learning by neural network based Q-functions. We observe that this makes asigniﬁcant difference in the performance of HER . We dnote this

HER baseline as ‘

HER -dense’.From the result in Figure 4, we see that h EM performs signiﬁcantly better than HER with binaryrewards (

HER -sparse). The performance of h EM quickly converges to optimality while HER strugglesat learning good Q-functions. However, when compared with

HER -dense, h EM does not achievenoticeable gains. Such an observation conﬁrms the importance of reward shaping to HER . We further assess the performance of h EM when policy inputs are high-dimensional images (seeFigure 8 for illustrations). Across all tasks, the state inputs are by default images s ∈ R w × w × where w ∈ { , } while the goal is still low-dimensional. See Appendix C for the network architectures.7 a) Point mass (b) Reacher goal (c) Fetch robot (d) Sawyer robot Figure 4:

Training curves of h EM and HER on four goal-conditioned RL benchmark tasks with state-basedinputs and sparse binary rewards. The y-axis shows the success rates and the x-axis shows the training time steps.All curves are calculated based on averages over − random seeds. Standard deviations are small across seeds. We focus on the comparison between h EM and HER -dense in Figure 5, as the performance of

HER with binary rewards is inferior as seen Section 4.2. We see that for image-based tasks,

HER -densesigniﬁcantly underperforms h EM . While HER -dense makes slow progress for most cases, h EM achieves stable learning across all tasks. We speculate this is partly due to the common observations[8, 9] that TD -learning directly from high-dimensional image inputs is challenging. For example,prior work [31] has applied a variational autoencoder approach [12] to reduce the dimension of theimage inputs for downstream TD -learning. On the contrary, h EM only requires optimization in asupervised learning style, which is much more stable with end-to-end training on image inputs.Further, image-based goals are much easier to specify in certain contexts [31]. We evaluate h EM onimage-based goals for the Sawyer robot and achieve similar performance as the state-based goals.See results in Figure 9 in Appendix C. (a) Point mass (I) (b) Reacher goal (I) (c) Fetch robot (I) (d) Sawyer robot (I) Figure 5:

Training curves of h EM and HER on four goal-conditioned RL tasks with image-based inputs.Standard deviations are small across seeds. All curves are calculated based on averages over − random seeds.‘ h EM -48’ refers to image inputs with w = 48 . h EM achieves stable learning regardless of the input sizes, thoughlarger sizes do slow the learning rate. h EM collects N trajectories at each iteration, which is set to N = 20 in previous experiments (exceptfor Fetch robot (I) and Reacher (I) where N = 80 ). In certain cases, we ﬁnd that the performanceof h EM critically depends on the size of N . In Figure 9 in Appendix C we provide ablation resultson the Flip bit MDP with K = 50 and image-based Fetch robot task. Both tasks are challenging,both due to an enormous state space or the high dimensionality of the images. In general, we ﬁndthat larger N leads to better performance. Similar observations have been made for HER [2], whereincreasing the number of parallel workers generally improves training performance.

We present a probabilistic framework for goal-conditioned RL . This framework motivates thedevelopment of h EM , a simple and effective off-policy RL algorithm. Our formulation draws formalconnections between hindsight goal replay [2] and IS for rare event simulation. h EM combinesthe stability of supervised learning updates via the M-step and the hindsight replay technique viathe E-step. We show improved performance over a variety of benchmark RL tasks, especially inhigh-dimensional input settings with sparse binary rewards.8 eferences [1] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-mations: Theory and application to reward shaping. In ICML , volume 99, pages 278–287,1999.[2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experiencereplay. In

Advances in neural information processing systems , pages 5048–5058, 2017.[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.[4] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXivpreprint arXiv:1509.02971 , 2015.[5] Gerardo Rubino and Bruno Tufﬁn.

Rare event simulation using Monte Carlo methods . JohnWiley & Sons, 2009.[6] George Casella and Roger L Berger.

Statistical inference , volume 2. Duxbury Paciﬁc Grove,CA, 2002.[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review forstatisticians.

Journal of the American Statistical Association , 112(518):859–877, 2017.[8] Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic:Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953 ,2019.[9] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizingdeep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649 , 2020.[10] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction , volume 1.MIT press Cambridge, 1998.[11] Todd K Moon. The expectation-maximization algorithm.

IEEE Signal processing magazine ,13(6):47–60, 1996.[12] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.[13] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In

Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics ,2014.[14] Rudolf Kalman. On the general theory of control systems.

IRE Transactions on AutomaticControl , 4(3):110–110, 1959.[15] Emanuel Todorov. General duality between optimal control and estimation. In

Decision andControl, 2008. CDC 2008. 47th IEEE Conference on , pages 4286–4292. IEEE, 2008.[16] Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In

Advances inneural information processing systems , pages 849–856, 2009.[17] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In

Advances in Neural Information Processing Systems , pages 207–215, 2013.[18] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, andMartin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920 ,2018. 919] H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer,Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policymaximum a posteriori policy optimization for discrete and continuous control. arXiv preprintarXiv:1909.12238 , 2019.[20] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causalentropy. 2010.[21] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policiesfor hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808 , 2018.[22] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial andreview. arXiv preprint arXiv:1805.00909 , 2018.[23] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In

Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence , 2010.[24] Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Juergen Schmidhuber. Hindsight policygradients. arXiv preprint arXiv:1711.06006 , 2017.[25] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. In

Reinforcement Learning , pages 5–32. Springer, 1992.[26] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradientmethods for reinforcement learning with function approximation. In

Advances in neuralinformation processing systems , pages 1057–1063, 2000.[27] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game ofgo without human knowledge.

Nature , 550(7676):354–359, 2017.[28] Quan Vuong, Yiming Zhang, and Keith W Ross. Supervised policy update for deep reinforce-ment learning. arXiv preprint arXiv:1805.11706 , 2018.[29] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si-mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Masteringatari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265 ,2019.[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[31] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Vi-sual reinforcement learning with imagined goals. In

Advances in Neural Information ProcessingSystems , pages 9191–9200, 2018.[32] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprintarXiv:1801.01290 , 2018.[33] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In

Aaai , volume 8, pages 1433–1438. Chicago, IL, USA, 2008.[34] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via softupdates. arXiv preprint arXiv:1512.08562 , 2015.[35] Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcementlearning. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 243–252. JMLR. org, 2017.[36] Michael C Fu. Stochastic gradient estimation. In

Handbook of simulation optimization , pages105–147. Springer, 2015. 1037] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-basedcontrol. In

Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on ,pages 5026–5033. IEEE, 2012.[38] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.[39] Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitationlearning. In

Advances in Neural Information Processing Systems , pages 15298–15309, 2019.[40] Soroush Nasiriany, Vitchyr Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In

Advances in Neural Information Processing Systems , pages 14814–14825, 2019.[41] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952 , 2015.[42] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, AlecRadford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines , 2017. 11

Details on Graphical Models for Reinforcement Learning

In this section, we review details of the RL as inference framework [32, 22] and highlight its criticaldifferences from Variational RL .The graphical model for RL as inference is shown in Figure 6(c). The framework also assumes atrajectory variable τ ≡ ( s t , a t ) T − t =0 which encompasses the state and action sequences. Conditional onthe trajectory variable τ , the optimality variable is deﬁned as p ( O = 1 | τ ) ∝ exp( (cid:80) T − t =0 r ( s t , a t ) /α ) for α > . Under this framework, the trajectory variable has a prior a t ∼ p ( · ) where p ( · ) is usuallyset to be a uniform distribution over the action space A .The policy parameter θ comes into play with the inference model. The framework asks the question:what is the posterior distribution p ( τ | O = 1) . To approximate this intractable posterior distribution,consider the variational distribution q ( τ ) := Π T − t =0 π θ ( a t | s t ) p ( s t +1 | s t , a t ) . Searching for the bestapproximate posterior by minimizing the KL-divergence KL [ q ( τ ) (cid:107) p ( τ | O = 1)] , it can be shownthat this is equivalent to maximum-entropy RL [33–35]. It is important to note that RL as inference does not contain trainable parameters for the generative model.Contrasting this to Variational RL and the graphical model for goal-conditioned RL in Figure 2: thepolicy dependent parameter θ is part of a generative model. The variational distribution q ( τ, g ) ,deﬁned separately from θ , is the inference model. In such cases, the variational distribution q ( τ, g ) isan auxiliary distribution which aids in the optimization of θ by performing partial E-steps. xzθ φ (a) Probabilistic inference

Oτθ q (b)

Variational RL Oτ θ (c) RL as inference Figure 6:

Plot (c) shows the graphical model for RL as inference [32, 22]. Solid lines represent generativemodels and dashed lines represent inference models. Circles represent random variables and squares representparameters. Filled circles represent observed random variables. This graphical model does not have trainableparameters for the generative model. The policy dependent parameter θ is in the inference model. B Details on proof

B.1 Proof of Proposition Proposition 1

The proof follows from the observation that p ( O = 1) = E g ∼ p ( · ) ,π [ p ( O = 1 | τ, g )] = J ( π θ ) , andtaking the log does not change the optimal solution. B.2 Proof of Theorem 1

Recall that we have a one-step

MDP setup where A = G and |A| = k . The policy π ( a | s, g ) = softmax ( L a,g ) is parameterized by logits L a,g . When the policy is initialized randomly, we have L a,g ≡ L for some L and π ( a | s, g ) = 1 /k for all a, g . Assume also p ( g ) = 1 /k, ∀ g .The one-sample REINFORCE gradient estimator for the component L a,g is η a,g = r ( s, b, g (cid:48) ) log L a,g π ( b | s, g (cid:48) ) with g (cid:48) ∼ p ( · ) and b ∼ π ( · | s, g (cid:48) ) . Further, we can show E [ η a,g ] = 1 k δ a,g − k , V [ η a,g ] = ( 1 k + 2 k − k − k ) δ a,g + 1 k − k , where δ a,b are dirac-delta functions, which mean δ a,b = 1 if a = b and δ a,b = 0 otherwise. Takingthe ratio, we have the squared relative error (note that the estimator is unbiased and MSE consistspurely of the variance) MSE [ η a,g ] E [ η a,g ] = ( k + o ( k )) δ a,g + ( k + o ( k ))( k + o ( k )) δ a,g + 1) δ a,g . However, in either case (either δ a,g = 1 or δ a,g = 0 ), it is clear that MSE [ η a,g ] E [ η a,g ] = k (1 + o (1)) , which directly reduces to the resultof the theorem. Comment on the control variates.

We also brieﬂy study the effect of control variates. Let

X, Y be two random variables and assume E [ Y ] = 0 . Then compare the variance of V [ X ] and V [ X + αY ] where α is chosen optimally to minimize the variance of the second estimator. It can beshown that with the best α ∗ , the ratio of variance reduction is ( V [ X ] − V [ X + α ∗ Y ]) / V [ X ] = ρ := Cov [ X, Y ] / V [ X ] V [ Y ] . Consider the state-based control variate for the REINFORCE gradientestimator, in this case − α · ∇ L a,g π ( b | s, g (cid:48) ) where α is chosen to minimize the variance of thefollowing aggregate estimator η a,g ( α ) = r ( s, b, g (cid:48) ) log L a,g π ( b | s, g (cid:48) ) − α log L a,g π ( b | s, g (cid:48) ) . Note that in practice, α is chosen to be state-dependent for REINFORCE gradient estimator of generalMDPs and is set to be the value function α := V π ( s ) . Such a choice is not optimal [36] but isconveniently adopted in practice. Here, we consider an optimal α ∗ for the one-step MDP . The centralquantity is the squared correlation ρ between r ( s, b, g (cid:48) ) log L a,g π ( b | s, g (cid:48) ) and log L a,g π ( b | s, g (cid:48) ) .With similar computations as above, it can be shown that ρ ≈ for b (cid:54) = g (cid:48) and ρ ≈ k otherwise.This implies that for k out of k logits parameters, the variance reduction is signiﬁcant; yet for therest of the k − k parameters, the variance reduction is negligible. Overall, the analysis reﬂects thatconventional control variantes do not address the issue of sup-linear growth of relative errors as aresult of sparse gradients . B.3 Proof of Theorem 2

Similar to the proof of Theorem 1, we can show that the normalized one-step

REINFORCE gradientestimator η ha,g = r ( s, b, g (cid:48) ) ∇ L a,g log π ( b | s, g (cid:48) ) /k with ( b, g (cid:48) ) ∼ q h ( τ, g ) has the following propertyMSE [ η a,g ] E [ η a,g ] = ( k + o ( k )) δ a,g + ( k + o ( k ))( k + o ( k )) δ a,g + 1) . This implies the result of the theorem.

B.4 Proof of Theorem 3

Without loss of generality we assume p ( g ) is a uniform measure, i.e. p ( g ) = 1 / |G| . If not, we couldalways ﬁnd a transformation g = f ( g (cid:48) ) such that g (cid:48) takes a uniform measure [6] and treat g (cid:48) as thegoal to condition on.Let |G| < ∞ and recall supp (˜ p ( g )) to be the support of ˜ p ( g ) . The uniform distribution assumptiondeduces that | supp (˜ p ( g )) | = (cid:82) g ∈ supp (˜ p ( g )) dg . At iteration t , under tabular representation, the M-stepupdate implies that π θ learns the optimal policy for all g that could be sampled from q ( τ, g ) , whhicheffectively corresponds to the support of ˜ p ( g ) . Formally, this implies E p ( τ | θ t +1 ,g ) [ R ( τ, g )] = 1 for ∀ g ∈ supp (˜ p t ( g )) . This further implies J ( π θ t +1 ) := (cid:90) E p ( τ | θ t +1 ,g ) [ R ( τ, g )] p ( g ) dg ≥ (cid:90) g ∈ supp (˜ p t ( g )) · p ( g ) dg = | supp (˜ p t ( g )) | / |G| . C Additional Experiment Results

C.1 Details on Benchmark tasks

All reaching tasks are built with physics simulation engine MuJoCo [37]. We build customizedpoint mass environment; the Reacher and Fetch environment is partly based on OpenAI gymenvironment [38]; the Sawyer environment is based on the multiworld open source code https://github.com/vitchyr/multiworld .All simulation tasks below have a maximum episode length of T = 50 . The episode terminates earlyif the goal is achieved at a certain step. The sparse binary reward function is r ( s, a, g ) = I [ success ] ,13 igure 7: Illustration of tasks. From left to right: Point mass, Reacher, Fetch robot and Sawyer Robot. On theright is the image-based input for Fetch robot. For additional information on the tasks, see Appendix C.

Figure 8:

Illustration of image-based inputs for different reaching tasks in the main paper. Images aredown-sampled to be of size w × w × as inputs, where w ∈ { , } . which indicates the success of the transitioned state s (cid:48) = f ( s, a ) . Below we describe in details thesetup of each task, in particular the success criterion. • Point mass [39].

The objective is to navigate a point mass through a 2-D room with obstaclesto the target location. |S| = 4 , |G| = 2 and |A| = 2 . The goals g ∈ R are speciﬁed as a 2-Dpoint on the plane. Included in the state s are the 2-D coordinates of the point mass, denotedas s xy ∈ R . The success is deﬁned as d ( z ( s xy ) , z ( g )) ≤ d where d ( · , · ) is the Euclideandistance, z ( g ) is a element-wise normalization function z ( x ) := ( x − x min ) / ( x max − x min ) where x max , x min are the boundaries of the wall. The normalized threshold is d = 0 . · √ . • Reacher [38].

The objective is to move via joint motors the ﬁnger tip of a 2-D Reacherrobot to reach a target goal location. |S| = 11 , |G| = 2 and |A| = 2 . As with the abovepoint mass environment, the goals g ∈ R are locations of a point at the 2-D plane. Includedin the state s are 2-D coordinates of the ﬁnger tip location of the Reacher robot s xy . Thesuccess criterion is deﬁned identically as the point mass environment. • Fetch robot [38, 2].

The objective is to move via position controls the end effector of afetch robot, to reach a target location in the 3-D space. |S| = 10 , |G| = 3 and |A| = 3 . Thistask belongs to the standard environment in OpenAI gym [38] and we leave the details tothe code base and [2]. • Sawyer robot [31, 40].

The objective is to move via motor controls of the end effector of asawyer robot, to reach a target location in the 3-D space. |S| = |G| = |A| = 3 . This taskbelongs to the multiworld code base. Details on image inputs.

For the customized point mass and Reacher environments, the imageinputs are taken by cameras which look vertically down at the systems For the Fetch robot and Sawyerrobot environment, the images are taken by cameras mounted to the robotic systems. See Figure 8 foran illustration of the image inputs. For such simulation environments, the transition s (cid:48) ∼ p ( · | s, a ) is deterministic so we equivalently write s (cid:48) = f ( s, a ) for some deterministic function f . .2 Details on Algorithms and Hyper-parametersHindsight Expectation Maximization. In our implementation, we take the policy network π θ ( a | s, g ) to be a state-goal conditional Gaussian distribution π θ ( a | s, g ) = N ( µ θ ( s, g ) , σ ) with aparameterized mean µ θ ( s, g ) and a global standard deviation σ . The mean is takes the concatenatedvector [ x, g ] as inputs, has − hidden layers each with − hidden units interleaved with relu ( x ) non-linear activation functions, and outputs a vector µ θ ( s, g ) ∈ R |A| . h EM alternates between data collection using the policy and policy optimization with the EM -algorithms. During data collection, the output action is perturbed by a Gaussian noise a (cid:48) = N (0 , σ a )+ a, a ∼ π θ ( · | s, g ) where the scale is σ a = 0 . . Note that injecting noise to actions is a commonpractice in off-policy RL algorithms to ensure sufﬁcient exploration [3, 4]. The baseline h EM collectsdata with N = 30 parallel MPI actors, each with k = 20 trajectories. When sampling the hindsightgoal given trajectories, we adopt the future strategy speciﬁed in HER [2]: in particular, at state s ,future achieved goals are uniformly sampled at trainig time as q h ( τ, g ) . All parameters are optimizedwith Adam optimizer [30] with learning rate α = 10 − . By default, we run M = 30 parallel MPIworkers for data collection and training, at each iteration h EM collects N = 20 trajectories from theenvironment. For image-based reacher and Fetch robot, h EM collects N = 80 trajectories. Hindsight Experience Replay.

By design in [2],

HER is combined with off-policy learning algo-rithms such as DQN or DDPG [3, 4]. We describe the details of DDPG. The algorithm maintains aQ-function Q θ ( s, a, g ) parameterized similarly as a universal value function [41]: the network takesas inputs the concatenated vector [ x, a, g ] , has − hidden layers with h = 256 hidden units perlayer interleaved with relu ( x ) non-linear activation functions, and outputs a single scalar. The policynetwork π θ ( s, g ) takes the concatenated vector [ x, g ] as inputs, has the same intermediate architectureas the Q-function network and outputs the action vector π θ ( s, g ) ∈ R |A| . We take the implementationfrom OpenAI baseline [42], all missing hyper-parameters are the default hyper-parameters in thecode base. Across all tasks, HER is run with M = 20 parallel MPI workers as speciﬁed in [42]. Image-based architecture.

When state or goal are image-based, the Q-function network/policynetwork applies a convolutional network to extract features. For example, let s, g ∈ R w × w × where w ∈ { , } be raw images, and let f θ ( s ) , f θ ( g ) be the features output by the convolutional network.These features are concatenated before passing through the fully-connected networks described above.The convolutional network has the following architecture: [32 , , → relu → [64 , , → relu → [64 , , → relu, where [ n f , r f , s f ] refers to: n f number of feature maps, r f feature patch dimensionand s f the stride. C.3 Ablation studyAblation study on the effect of N . h EM collects N trajectories at each training iteration. We vary N ∈ { , , , , } on two challenging domains: Flip bit K = 50 and Fetch robot (image-based)and evaluate the corresponding performance. See Figure 9. We see that in general, large N tends tolead to better performance. For example, when N = 5 , h EM learns slowly on Flip bit; when N = 80 , h EM generally achieves faster convergence and better asymptotic performance across both tasks. Wespeculate that this is partly because with large N the algorithm can have a larger coverage over goals(larger support over goals in the language of Theorem 3). With small N , the policy might convergeprematurely and hence learn slowly. Similar observations have been made for HER , where they ﬁndthat the algorithm performs better with a large number of MPI workers (effectively large N ). Ablation on image-based goals.

To further assess the robustness of h EM against image-basedinputs, we consider Sawyer robot where both states and goals are image-based. This differs fromexperiments shown in Figure 5 where only states are image-based. In Figure 9(c), we see that theperformance of h EM does not degrade even when goals are image-based and is roughly agnostic tothe size of the image. Contrast this with HER , which does not make signiﬁcant progress even whenonly states are image-based. 15 a) Flip bit (b) Fetch robot (c) Sawyer robot

Figure 9:

Ablation study. Plot (a) and (b): The effect of the data collection size N . Plot (c): The effect ofimage-based inputs for both states and goals. ‘ h EM -48’ refers to image-based inputs with size × × . C.4 Comparison between h EM and HPG

We do not list

HPG as a major baseline for comparison in the main paper, primarily due to afew reasons: by design, the

HPG agent tackles discrete action space (see the author code base https://github.com/paulorauber/hpg ), while many goal-conditioned baselines of interest[2, 31, 40] are continuous action space. Also, in [24] the author did not report comparison totraditional baselines such as

HER and only report cumulative rewards instead of success rate asevaluation criterion. Here, we compare h EM with HPG on a few discrete benchmarks provided in [24]to assess their performance.

Details on

HPG . The

HPG is based on the author code base. [24] proposes several

HPG variantswith different policy gradient variance reduction techniques [26] and we take the

HPG variant withthe highest performance as reported in the paper. Throughout the experiment we set the learning rateto be − and other hyper-parameters take default values. Benchmarks.

We compare h EM and HPG on Flip bit K = 25 , and the four room environment.The details of the Flip bit environment could be found in the main paper. The four room environmentis used as a benchmark task in [24], where the agent navigates a grid world with four rooms to reacha target location within episodic time limit. The agent has access to four actions, which moves theagent in four directions. The trial is successful only if the agent reaches the goal in time. Results.

We show results in Figure 10. For the Flip bit K = 25 , HPG and h EM behave similarly:both algorithms reach the near-optimal performance quickly and has similar convergence speed;when the state space increases to K = 50 , HPG does not make any progress while the performance of h EM steadily improves. Finally, for the four room environment, we see that though the performanceof HPG initially increases quickly as h EM , its success rate quickly saturates to a level signiﬁcantlybelow the asymtotpic performance of h EM . These observations show that h EM performs much morerobustly and signiﬁcantly better than HPG , especially in challenging environments. (a) Flip bit K = 25 (b) Flip bit K = 50 (c) Four room Figure 10:

Comparison between h EM and HPG . HPG performs well on Flip bit

MDP with K = 25 , but when K = 50 its performance drops drastically. HPG also underperforms h EM on the four room environment where itmakes fast progress initially but saturates to a low sub-optimal level.on the four room environment where itmakes fast progress initially but saturates to a low sub-optimal level.