Meta-Model-Based Meta-Policy Optimization
Takuya Hiraoka, Takahisa Imagawa, Voot Tangkaratt, Takayuki Osa, Takashi Onishi, Yoshimasa Tsuruoka
aa r X i v : . [ c s . L G ] J un Meta-Model-Based Meta-Policy Optimization
Takuya Hiraoka , Takahisa Imagawa , Voot Tangkaratt , Takayuki Osa , Takashi Onishi , and
Yoshimasa Tsuruoka
NEC Corporation National Institute of Advanced Industrial Science and Technology RIKEN Center for Advanced Intelligence Project Kyushu Institute of Technology The University of Tokyo {takuya-h1, takashi.onishi}@[email protected], [email protected]@aist.go.jp, [email protected]
Abstract
Model-based reinforcement learning (MBRL) has been applied to meta-learningsettings and demonstrated its high sample efficiency. However, in previous MBRLfor meta-learning settings, policies are optimized via rollouts that fully rely on apredictive model for an environment, and thus its performance in a real environ-ment tends to degrade when the predictive model is inaccurate. In this paper,we prove that the performance degradation can be suppressed by using branchedmeta-rollouts. Based on this theoretical analysis, we propose meta-model-basedmeta-policy optimization (M3PO), in which the branched meta-rollouts are usedfor policy optimization. We demonstrate that M3PO outperforms existing metareinforcement learning methods in continuous-control benchmarks.
Reinforcement learning (RL) methods have achieved remarkable successes in many decision-making tasks, such as playing video games or controlling robots (e.g., [9, 20]). In conventionalRL methods, when multiple tasks are to be solved, a policy of each task is independently learnedwith millions of training samples from the environment for individual tasks. This independent learn-ing with a large number of samples prevents the conventional RL methods from being applied tothe practical problems that require a variety of policies to solve multiple tasks (e.g., robotic ma-nipulation problems involving grasping or moving different types of objects [39]). Meta-learningmethods [29, 35] have recently gained much attention as a solution to this problem; they learn astructure shared in the tasks by using a large number of samples collected across the parts of thetasks. Once learned, these methods can adapt quickly to new (the rest of) tasks with a small numberof samples given in them.In previous work, meta RL methods have been introduced into both model-free and model-basedsettings. For model-free settings, there are two main types of approaches proposed so far: recurrent-based policy adaptation [4, 19, 26, 37] and gradient-based policy adaptation [1, 6, 7, 10, 27, 31]. Inthese approaches, policies adapt to a new task by leveraging the history of past trajectories; followingprevious work [3], we refer to these adaptive policies as meta-policies in our paper. These model-free RL methods with meta-learning require more training samples than conventional RL because,in addition to learning control policies, learning of policy adaptation is also required [18].For model-based settings, Sæmundsson et al. [28] use a predictive model (also known as a transitionmodel) based on a latent variable Gaussian process for model predictive control. Nagabandi et
Preprint. Under review. l. [22, 23] introduced both recurrent-based and gradient-based meta-learning methods into model-based RL. In these works, the predictive models adapt to a new task by leveraging the history ofpast trajectories; in analogy to the meta-policy, we refer to these adaptive predictive models as meta-models in our paper. In general, these meta-model-based RL approaches are more sample efficientthan the model-free approaches because they leverage more supervision. However, in previous meta-model-based RL, the meta-policy (or the course of future actions) is optimized via rollouts relyingfully on the meta-model, and thus its performance in a real environment tends to degrade when themeta-model is inaccurate.In this paper, we address this problem in meta-model-based RL. At first, as preliminaries, we re-view the related work (Section 2) and partially observable Markov decision processes (POMDPs)(Section 3). Next, we formulate meta-model-based RL as solving a POMDP (Section 4) and thenconduct theoretical analysis on its performance guarantee (Section 5). Our analysis results show that,if the meta-model is used in the branched rollout manner [15], the performance degradation due tothe meta-model inaccuracy can be significantly suppressed compared to the full meta-model-basedrollouts case. Based on this analysis result, we propose a practical meta-model-based RL methodcalled Meta-Model-based Meta-Policy Optimization (M3PO), in which the meta-model is used inthe branced rollout manner (Section 6). Finally, we experimentally demonstrate that M3PO outper-forms existing methods in continuous-control benchmarks (Section 7). Our main contribution areas follows: (1) providing performance guarantees of meta-model-based RL (Theorem 1, Theorem 2,and Theorem 3), and (2) proposing and showing the effectiveness of M3PO.
In this section, we review related work on POMDPs and theoretical analysis in model-based RL.
Partially observable Markov decision processes : In our paper, we formulate model-based RLin a meta-learning setting as solving a POMDP, and provide its performance guarantee under thebranched rollout setting. POMDPs are a long-studied problem (e.g., [8, 32, 33]) and there are manyworks that discuss a performance guarantee of RL methods for solving POMDPs. However, to thebest of our knowledge, the performance guarantee of the RL methods based on branched rollouts hasnot been discussed in the literature. On the other hand, some researchers [14, 16, 40] propose model-free RL methods for solving a POMDP without prior knowledge of the accurate model. However,they do not provide theoretical analyses of performance. In this work, by contrast, we propose amodel-based RL method, and provide theoretical analyses on its performance guarantee.
Theoretical analysis on the performance of model-based RL:
There are previous works in whichtheoretical analysis on the performance of model-based RL is provided [5, 12, 15, 17, 25]. Inthese theoretical analyses, standard Markov decision processes (MDPs) are assumed, and the meta-learning (or POMDP) setting is not discussed. In contrast, our work provides a theoretical analysison the meta-learning (and POMDP) setting, substantially extending Janner et al’s work [15]. Theyanalyze the performance guarantee of branched rollouts with a predictive model on MDPs, and in-troduce branched rollouts into a model-based RL algorithm. We extend their analysis and algorithmto a meta learning (POMDP) case. In addition, as a part of our analysis, we make a modification totheir theorems of the branched rollouts so that some important premises (e.g., the effect of multiplemodel-based rollout factors) are more properly reflected into the theorems. See A.1 in the appendixfor more detailed discussion about our contribution.
We formalize our problem with a POMDP, which is defined as the tuple hO , S , A , p ob , r, γ, p st i .Here, O is the set of observations, S is the set of hidden states, A is the set of actions, p ob := O × S × A → [0 , is the observation probability, p st := S × S × A → [0 , is the state transitionprobability, r : S × A → R is a reward function, and γ ∈ [0 , is a discount factor. At timestep t , these functions are used as p st ( s t | s t − , a t − ) , p ob ( o t | s t , a t − ) , and r t = r ( s t , a t ) . Theagent cannot directly observe the hidden state, but receives the observation instead. The agent We include works on Bayes-adaptive MDPs [8, 40] as works on POMDPs because they are a special caseof POMDPs. For simplicity, we use these probabilities by abbreviating the subscripts “st” and “ob.” lgorithm 1 Abstract Meta Model-based Meta-Policy Optimization (Abstract M3PO) Initialize meta-policy π φ , meta-model p θ , and dataset D . for N epochs do Collect trajectories from environment according to π φ : D = D ∪ { ( h t , o t +1 , r t ) } . Optimize π φ and p θ : ( φ, θ ) ← arg max ( φ,θ ) E a ∼ π φ ,h ∼ p θ [ R ] − C ( ǫ m ( θ ) , ǫ π ( φ )) . end for selects an action based on a policy π := p ( a t +1 | h t ) , where h t is a history (the past trajectories),i.e., h t := { a , o , ..., a t , o t } . We denote the set of the histories by H . Given the definition of thehistory, the history transition probability can be defined as p ( h t +1 | a t +1 , h t ) := p ( o t +1 | h t ) . Here, p ( o t +1 | h t ) := P s t +1 P s t p ( s t | h t ) p ( s t +1 | s t , a t ) p ( o t +1 | s t +1 , a t ) where p ( s t | h t ) is the belief aboutthe hidden state. The goal of reinforcement learning in the POMDP is to find the optimal policy π ∗ that maximizes the expected return R := P ∞ t =0 γ t r t (i.e., π ∗ = arg max π E a ∼ π,h ∼ p [ R ] ). In this section, we formulate model-based RL in the meta-learning setting as solving a POMDP byusing a parameterized meta-policy and meta-model. As with [22, 23], we assume online adaptationsituations where the agent can leverage a small number of samples to adapt quickly to a new task.Here, a task specifies the transition probability and the reward function. The task information cannotbe observed by the agent, and may change at each step in an episode (i.e., the task information isincluded in the hidden state s ). In the previous meta-learning literature (e.g., [7, 22, 23, 26]), thepolicy and the predictive model adapt to a new task by using past trajectories in an episode. As wenoted earlier, we call this sort of adaptive policy and predictive model a meta-policy and a meta-model, respectively.We define a meta-policy and a meta-model as π φ ( a t +1 | h t ) and p θ ( r t , o t +1 | h t ) , respectively. Here, φ and θ are learnable parameters for the meta-policy and the meta-models. r t +1 and s t +1 are assumedto be conditionally independent given h t , i.e., p θ ( r t , o t +1 | h t ) = p θ ( r t | h t ) · p θ ( o t +1 | h t ) . As with p ( h t +1 | a t +1 , h t ) , we define the meta-model for the history as p θ ( h t +1 | a t +1 , h t ) := p θ ( o t +1 | h t ) .We use the parameterized meta-model and the meta-policy as in Algorithm 1. In this algorithm, thefollowing procedures are repeated: 1) sample collection from the real environment with the meta-policy and store them into a dataset D , 2) optimization of the meta-policy and the meta-model tomaximize E a ∼ π φ ,h ∼ p θ [ R ] − C ( ǫ m ( θ ) , ǫ π ( φ )) . Here, E a ∼ π φ ,h ∼ p θ [ R ] is a meta-model return, i.e., thereturn of the meta-policy on the meta-model (for simplicity, we use the abbreviated style E π φ ,p θ [ R ] ),and C ( ǫ m ( θ ) , ǫ π ( φ )) is a discrepancy depending on the two error quantities ǫ m and ǫ π (their detaileddefinitions are introduced in the next section). In the next section, we conduct theoretical analysesfor the performance guarantee of Algorithm 1. Supplement:
Some readers may be concerned that “in a typical meta-RL setting, the objective tobe optimized is usually defined as the expected return with respect to the task distribution, but suchan objective is not included in the formulation in Section 4.” In A.4 in the appendix, we showthat such an objective can actually be derived from E a ∼ π φ ,h ∼ p θ [ R ] by specializing our formulation(i.e., our formulation covers the typical meta-RL setting). Nevertheless, in this paper, we avoidsuch specialization in order to provide analysis and algorithms covering a wider range of meta-RLsettings. In this section, we analyze the performance guarantee of meta-model-based RL with an inaccuratemeta-model. In Section 5.1, we provide the performance guarantee in a full meta-model-basedrollout case (Theorem 1) as a baseline to be improved. In Section 5.2, we introduce the notionof branched meta-rollout and show that, under the one-step branched meta-rollout, performancedegradation due to meta-model error can be suppressed compared to the full meta-model-rollout case(Theorem 2). In Section 5.3, we show that, by considering 1) meta-model error under the current3eta-policy and 2) appropriately adjusted length for the branched meta-rollout, the performancedegradation can be further suppressed (Theorem 3). These analysis results indicate that the branchedmeta-rollout with appropriate rollout length is better way of using the model for suppressing theperformance degradation due to the meta-model error than the full meta-model-based rollout.
Our goal is to outline a theoretical framework in which we can provide performance guaranteesfor Algorithm 1. To show the guarantees, we construct a lower bound taking the following form: E π φ ,p [ R ] ≥ E π φ ,p θ [ R ] − C ( ǫ m ( θ ) , ǫ π ( φ )) , where E π φ ,p [ R ] denotes a true return, i.e., the returnof the meta-policy in the true environment. The discrepancy between these returns, C , which de-termines the amount of performance degradation, can be expressed as the function of two errorquantities of the meta-model: generalization error due to sampling, and distribution shift due to theupdated meta-policy. For our analysis, we define the bounds of the generalization error ǫ m and thedistribution shift ǫ π as follows: Definition 1. ǫ m ( θ ) := max t E a ∼ π D ,h ∼ p [ D T V ( p ( h t +1 | a t +1 , h t ) || p θ ( h t +1 | a t +1 , h t ))] , where D T V is a total variation distance and π D is the data collection policy that actions contained in D follow . Definition 2. ǫ π ( φ ) := max h t D T V ( π D ( a t +1 | h t ) || π φ ( a t +1 | h t )) . In addition to these error bounds, we assume that we can know the bound of the reward expected onthe basis of a belief; r max > (cid:12)(cid:12)P s t p ( s t | h t ) r ( s t , a t ) (cid:12)(cid:12) .Now we present our bound, which is an extension of the theorem proposed in [15] to our setting. Theorem 1 (The POMDP extension of Theorem 4.1. in [15]) . Let the expected TV-distance betweentwo history transition distributions be bounded at each timestep by ǫ m and the meta-policy diver-gence be bounded by ǫ π . Then the true returns is bounded from below by meta-model returns of themeta-policy and discrepancy: E π φ ,p [ R ] ≥ E π φ ,p θ [ R ] − r max (cid:20) γ ( ǫ m + 2 ǫ π )(1 − γ ) + 4 ǫ π (1 − γ ) (cid:21)| {z } C ( ǫ m ( θ ) ,ǫ π ( φ )) (1)This theorem implies that the discrepancy of the returns under full meta-model rollout scales linearlywith both ǫ m and ǫ π . If we can reduce the discrepancy, the two returns are closer to each other andthus the performance degradation is more significantly suppressed. In the next section, we discuss anew meta-model usage to reduce the discrepancy induced by the meta-model error (i.e., ǫ m ). The analysis of Theorem 1 relies on running full rollouts through the meta-model, allowing meta-model errors to compound. This is reflected in the bound by a factor scaling quadratically withthe effective horizon, / (1 − γ ) . In such cases, we can improve the algorithm by choosing to relyless on the meta-model and instead more on real data collected from the true environment when themeta-model is inaccurate.In order to allow for dynamic adjustment between meta-model-based and model-free rollouts, weintroduce the notion of a branched meta-rollout , in which we begin a rollout from a history underthe previous meta-policy’s history distribution p π D ( h t ) and run k steps according to π φ under thelearned meta-model p θ . Under such a scheme, the true return can be bounded from below as follows: Theorem 2.
Under the k -branched meta-rollout, using the bound of a meta-model error under π D , ǫ m = max t E a ∼ π D ,h ∼ p,t [ D TV ( p ( h ′ | h, a ) || p θ ( h ′ | h, a ))] , the bound of the meta-policy shift ǫ π =max h t D T V ( π D || π φ ) , and the return on the meta-model E ( a,h ) ∼D model [ R ] where D model is the set As with [15], to simplify our analysis, we assume that the meta-model can accurately estimate reward. Wediscuss the case in which the reward prediction of a meta-model is inaccurate in A.6 in the appendix. This can be seen as the extension of a “branched rollout” proposed in [15, 34] to meta-learning settings. a) Discrepancy in Theorem 2 (b)
Discrepancy in Theorem 3
Figure 1:
Discrepancy between E π φ ,p [ R ] and E ( a,h ) ∼D model [ R ] in Theorem 2 and that in Theorem 3. Eachvertical axis represents k ∈ [1 , and each horizontal axis represents ǫ m , ǫ m ′ ∈ [0 , . In all figures,for evaluating the discrepancy values, we set the other variables as r max = 1 , ǫ π = 1 − ǫ m for (a),and ǫ π = 1 − ǫ m ′ for (b). of samples collected through branched rollouts, the following inequation holds, (2) E π φ ,p [ R ] ≥ E ( a,h ) ∼D model [ R ] − r max (cid:26) γ (1 − γ ) ǫ π + γ − kγ k + ( k − γ k +1 (1 − γ ) ( ǫ π + ǫ m )+ γ k − γγ − ǫ π + ǫ m ) + γ k − γ ( k + 1)( ǫ π + ǫ m ) (cid:27) . The discrepancy factors relying on ǫ m at k = 1 in Theorem 2 are smaller than the discrepancy factorsrelying on ǫ m in Theorem 1 (see Corollary 1 in the appendix). This indicates that the performancedegradation due to the meta-model error can be more suppressed than that under the full meta-based-model rollout. Figure 1a shows that the discrepancy value in Theorem 2 tends to monotonically increase as thevalue of k increases, regardless of the values of γ and ǫ m ; this means that the optimal value of k is always 1. However, intuitively, we may expect that there is an optimal value for k that is higherthan 1 when the meta-model error is small; as will be shown in this section, this intuition is correct.One of the main reasons for the mismatch of the intuition and the tendency of the discrepancy inTheorem 2 is that, for its analysis, the meta-model error on the data collection meta-policy π D (i.e., ǫ m ) is used instead of that on the current meta-policy π φ . This ignorance of the meta-model erroron the current meta-policy induces the pessimistic estimation of the discrepancy in the analysis (seethe evaluation of “term B” and “term C” in the proof of Theorem 4 via the proof of Theorem 2 inthe appendix for more details).To estimate the discrepancy more tightly, as in [15], we introduce the assumption that the meta-model error on the current meta-policy can be approximated by ǫ π and ǫ m : Assumption 1.
An approximated meta-model error on a current policy ǫ m ′ : ǫ m ′ ( ǫ π ) ≈ ǫ m + ǫ π dǫ m ′ dǫ π ,where dǫ m ′ dǫ π is the local change of ǫ m ′ with respect to ǫ π To see the tendency of this approximated meta-model error, we plot the empirical value of dǫ m ′ /dǫ π ,varying the size of training samples for the meta-model in Figure 7 in A.10. The figure shows that,as the training sample size increases, the value tends to gradually approach zero. This means thattraining meta-models with more samples provide better generalization on the nearby distributions.Equipped with the approximated meta-model’s error on the distribution of the current meta-policy π φ , we arrive at the following bound: 5 lgorithm 2 Meta-Model-based Meta-Policy Optimization with Deep RL (M3PO) Initialize meta-policy π φ , meta-model p θ , environment dataset D env , meta-model dataset D model . for N epochs do Train meta-model p θ with D env : θ ← arg max θ E D env [ p θ ( r t , o t +1 | h t )] for E steps do Take actions according to π φ ; add the trajectory to D env for M model rollouts do Sample h t uniformly from D env Perform k -step meta-model rollout starting from h t using meta-policy π φ ; add fictitioustrajectories to D model end for for G gradient updates do Update policy parameters with D model : φ ← φ − ∇ φ J D model ( φ ) end for end for end forTheorem 3. Let ǫ m ′ ≥ max t E a ∼ π φ ,h ∼ p [ D TV ( p ( h ′ | h, a ) || p θ ( h ′ | h, a ))] , (3) E π φ ,p [ R ] ≥ E ( a,h ) ∼D model [ R ] − r max (cid:26) γ (1 − γ ) ǫ π + γ − kγ k + ( k − γ k +1 (1 − γ ) ( ǫ m ′ − ǫ π )+ γ k − γγ − ǫ m ′ − ǫ π ) + γ k − γ ( k + 1)( ǫ m ′ − ǫ π ) (cid:27) . Given that ǫ m = ǫ m ′ , it is obvious that the discrepancy in Theorem 3 is equal to or smaller than thatin Theorem 2. In the discrepancy in Theorem 3, all terms except the first term become negative when ǫ m ′ < ǫ π . This implies that the optimal k that minimizes the discrepancy can take on the valuehigher than 1 when the meta-model error is relatively small. The empirical trend of the discrepancyvalue (Figure 1b) supports it; when ǫ m ′ is lower than 0.5 (i.e., ǫ m ′ < ǫ π ), the discrepancy valuesdecrease as the value of k grows regardless of the value of γ . This result motivates us to set k to thevalue higher than 1 in accordance with the meta-model error for reducing the discrepancy.To recap our analysis, the use of branched meta-rollouts with appropriate rollout lengths (i.e., thevalue of k ) contribute to reducing the returns’ discrepancy relying on the meta-model error and thuslead to suppressing the performance degradation due to the meta-model inaccuracy. In the previous section, we show that the use of branched meta-rollouts with appropriate rolloutlengths can suppress the performance degradation induced by the meta-model inaccuracy. In thissection, based on this result, we propose to modify Algorithm 1 so that the meta-policy and the meta-model are optimized on the basis of the lower bound of true return with the branched meta-rollouts, E ( a,h ) ∼D model [ R ] − C ( ǫ m ( θ ) , ǫ π ( φ )) , instead of E π φ ,p θ [ R ] − C ( ǫ m ( θ ) , ǫ π ( φ )) .More specifically, we propose the following modifications to Algortihm 1: Meta-policy optimization:
The meta-policy is optimized with branched meta-rollouts E ( a,h ) ∼D model [ R ] . For the optimization, we adapt PEARL [26] because it achieved good learn-ing performance in meta-learning settings [26]. To adapt PEARL, we use the imagination tra-jectories generated from the branched meta-rollouts for optimizing the meta-policy (and value Such situations may arise when the meta-model is learnt with a large amount of data and dǫ m ′ dǫ π is nearlyzero (recall the result in Figure 7), or environment dynamics over tasks are simple and easily approximated bythe meta-model. To stabilize the learning, we omit C ( ǫ m ( θ ) , ǫ π ( φ )) from the optimization objective for the meta-policy. π φ is optimized by using the gradient of opti-mization objective J D model ( φ ) := E ( a,h ) ∼D model (cid:2) D KL (cid:0) π φ || exp (cid:0) Q π φ − V π φ (cid:1)(cid:1)(cid:3) , where D KL isthe Kullback-Leibler divergence, Q π φ ( a t +1 , h t ) := E ( r,h ) ∼D model [ R | a t +1 , h t ] , and V π φ ( h t ) := P a t +1 Q π φ ( a t +1 , h t ) π φ ( a t +1 | h t ) . Meta-model optimization:
The meta-model is optimized to minimize the discrepancy (i.e., ǫ m ). Inpractice, to simplify and stabilize learning, we learn the meta-model via maximum likelihood esti-mation . For the meta-model, to reduce the model bias, we use a bootstrap ensemble of B dynamicsmodels { p θ , ..., p Bθ } . Here p iθ ( r t , o t +1 | h t ) is the i -th conditional Gaussian distribution with diagonalcovariance: p iθ ( r t , o t +1 | h t ) = N (cid:0) r t +1 , o t +1 | µ iθ ( h t ) , σ iθ ( h t ) (cid:1) , where µ iθ and σ iθ are the mean andthe standard deviation, respectively. In our implementation, we use the recurrent-based architectureinspired by [23, 26]; at each evaluation of the model, { a , o , ..., a t − , o t − } in h t is fed to a recur-rent neural network, and then its hidden unit output and { a t , o t } in h t are fed to the feed-forwardneural network that outputs the mean and standard deviation of the Gaussian distribution. We usethe gated recurrent unit [2] for the recurrent neural network .The resulting algorithm is shown in Algorithm 2. The modifications “Meta-model optimization”and “Meta-policy optimization” in the last paragraph are reflected in line 3 and lines 4–13, respec-tively. In addition, as indicated by Theorem 2 and Theorem 3, appropriately setting k contributesto decreasing the discrepancy, and leads to the suppression of the performance degradation in thereal environment. Thus, we treat k as a hyperparameter, and set it to different values for differentenvironments so that the discrepancy decreases. For the experiments described in the next section,we set hyperparameters for this algorithm by grid search. A search result for the hyperparametervalues is described in Table 1 in the appendix). In this section, we report our experiments aiming to answer the following questions: Q.1:
Can ourmethod (M3PO) outperforms existing meta-RL methods?
Q.2:
How do meta-model-rollout lengthsaffect the actual performance?In our experiments, we compare our method (M3PO) with baseline methods;
PEARL [26] and
Learning to adapt (L2A) [22]. More detailed information is described in A.8 in the appendix. Weconduct a comparative evaluation of the methods on a variety of simulated robot environments usingthe MuJoCo physics engine [36]. We prepare the environments proposed in the meta-RL [6, 22, 26,27] and robust-RL [13, 24] literature:
Halfcheetah-fwd-bwd , Halfcheetah-pier , Ant-fwd-bwd , Ant-crippled-leg , Walker2D-randomparams , and
Humanoid-direc . In all the environments, theagent is required to adapt to a fluidly changing task that the agent cannot directly observe. Detailedinformation about each environment is described in A.9 in the appendix.Regarding Q1, our experimental results indicate that M3PO outperforms existing meta RL methods.In Figure 2, the learning curves of M3PO and existing meta-RL methods (L2A and PEARL) onmeta-training phases are shown, and they indicate that the sample efficiency of M3PO is better thanL2A and PEARL . L2A is trapped in poor performance (return) and it is not improved even whenthe training sample size increases, whereas M3PO avoids being trapped in such a poor performance.PEARL can improve meta-policy performance via training in all environments, but the degree ofimprovement of PEARL is smaller than that of M3PO. On the other hand, in some of the environ-ments (e.g., Halfcheetah-pier), as the training sample size increases, the relative performance ofM3PO against PEARL becomes worse. This indicates that, as with [21], dynamic switching fromM3PO to PEARL or other model-free approaches needs to be considered to further improve overallperformance. The transition of ǫ m over training epochs is shown in Figure 8 in the appendix, and it indicates that themodel error tends to decrease as epochs elapse. We also tried to use the gradient-based MAML architecture [7] for our meta-models as with [23], but wereunable to obtain reasonable results with it. The source code to replicate the experiments will be open to the public after the acceptance of this paper. Note that, in an early stage of the training phase, there are many test episodes in which unseen tasks appear.Therefore, the improvement of M3PO over L2A and PEARL at the early stage of learning indicates its highadaptation capability for unseen situations. igure 2: Learning curves. In each figure, the vertical axis represents returns, and the horizontal axis representsnumbers of training samples (x1000). The meta-policy and meta-model are fixed and evaluated interms of their expected returns on 50 episodes at every 5000 training samples for L2A and 1000training samples for the other methods. In each episode, the task is initialized and changed randomly.Each method is evaluated at least in five trials, and the expected return on the 50 episodes is furtheraveraged over the trials. The averaged expected returns and their standard deviations are plotted inthe figures. We assigned an almost equal time budget for each method, and trials completed in thetime frame are used for calculating performance. Therefore the number of trials of each method isnot necessarily equal to others.
Regarding Q2, we conducted an evaluation of M3PO by varying its model-rollout length k . Theevaluation results (Figure 3) indicate that the performance is degraded when the model-rollout lengthis long (i.e., when the performance is significantly affected by the meta-model error). We can seesignificant performance degradation especially in Ant-fwd-bwd and Humanoid-direc. In Ant-fwd-bwd, the performance at k = 100 is significantly worse than that at k = 10 . In Humanoid-dire, theperformance at k = 5 is significantly worse than that at k = 1 (i.e., the one described in Figure 2).As we have seen, the performance degradation in Humanoid-direc is more sensitive to the model-rollout length than that in Ant-fwd-bwd. One of the reasons for this is that the meta-model error inHumanoid-direc is larger than that in Ant-fwd-bwd (Figure 8 in the appendix). Figure 3:
Learning curves of M3PO. In each figure, the vertical axis represents returns, and the horizontalaxis represents numbers of training samples (x1000). The meta-policy and meta-model are fixed andevaluated their expected returns on 50 episodes per 1000 training samples. We run experiments byvarying the rollout length k of M3PO. In these experiments, the values of hyperparamters except k are the same as those described in Table 1. Each case is evaluated at least in five trials, and theexpected return on the 50 episodes is further averaged over the trials. The averaged expected returnsand their standard deviations are plotted in the figures. Conclusion
In this paper, we analyzed the performance guarantee (and performance degradation) of model-basedRL in a meta-learning setting. We first formulated model-based reinforcement learning in a meta-learning setting as solving a POMDP. We then conducted theoretical analysis on the performanceguarantee under both the full model-based rollout case and the branched meta-rollout case. Basedon the theoretical result, we introduced branched meta-rollouts to policy optimization and proposedM3PO. Our experimental results show that it achieves better sample efficiency than a variant ofPEARL and L2A. We also discussed important future work for improving M3PO (e.g., study ondynamic switching from M3PO to model-free approaches).9 roader impact
Our work contributes to reducing various sorts of costs for training adaptive policies. The reductionof the costs enables us to apply RL to many practical problems where the costs had been a bottleneckon the application of RL. Such practical problems should contain ones where people are included asa part of environments and agents has a chance to harm them.We encourage researchers and engineers to pay more attention to the safety aspect when they useRL methods for solving such problems.
References [1] Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuousadaptation via meta-learning in nonstationary and competitive environments. In Proc. ICLR,2018.[2] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., andBengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machinetranslation. In Proc. EMNLP, 2014.[3] Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-basedreinforcement learning via meta-policy optimization. In Proc. CoRL, pp. 617–629, 2018.[4] Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl : Fastreinforcement learning via slow reinforcement learning. In Proc. ICLR, 2017.[5] Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-basedvalue expansion for efficient model-free reinforcement learning. In Proc. ICML, 2018.[6] Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradientdescent can approximate any learning algorithm. In Proc. ICLR, 2018.[7] Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deepnetworks. In Proc. ICML, pp. 1126–1135, 2017.[8] Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. Bayesian reinforcement learning:A survey. Foundations and Trends R (cid:13) in Machine Learning, 8(5-6):359–483, 2015.[9] Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manip-ulation with asynchronous off-policy updates. In Proc. ICRA, pp. 3389–3396. IEEE, 2017.[10] Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning ofstructured exploration strategies. In Proc. NeurIPS, pp. 5302–5311, 2018.[11] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximumentropy deep reinforcement learning with a stochastic actor. In Proc. ICML, pp. 1856–1865,2018.[12] Henaff, M. Explicit explore-exploit algorithms in continuous state spaces. In Proc. NeurIPS,2019.[13] Hiraoka, T., Imagawa, T., Mori, T., Onishi, T., and Tsuruoka, Y. Learning robust options byconditional value at risk optimization. In Proc. NeurIPS, 2019.[14] Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. Deep variational reinforcementlearning for POMDPs. In Proc. ICML, pp. 2117–2126, 2018.[15] Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policyoptimization. In Proc. NeurIPS, 2019.[16] Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. Stochastic latent actor-critic: Deepreinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953, 2019.1017] Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In Proc. ICLR, 2018.[18] Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., and Finn, C. Guided meta-policysearch. In Proc. NeurIPS, pp. 9653–9664, 2019.[19] Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner.In Proc. ICLR, 2018.[20] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deepreinforcement learning. Nature, 518(7540):529–533, 2015.[21] Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In Proc. ICRA, pp. 7559–7566,2018.[22] Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learningto adapt in dynamic, real-world environments via meta-reinforcement learning. In Proc. ICLR,2019.[23] Nagabandi, A., Finn, C., and Levine, S. Deep online learning via meta-learning: Continualadaptation for model-based RL. In Proc. ICLR, 2019.[24] Rajeswaran, A., Ghotra, S., Levine, S., and Ravindran, B. EPOpt: Learning Robust NeuralNetwork Policies Using Model Ensembles. In Proc. ICLR, 2017.[25] Rajeswaran, A., Mordatch, I., and Kumar, V. A game theoretic framework for model basedreinforcement learning, 2020.[26] Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In Proc. ICML, pp. 5331–5340,2019.[27] Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel, P. Promp: Proximal meta-policysearch. In Proc. ICLR, 2019.[28] Sæmundsson, S., Hofmann, K., and Deisenroth, M. P. Meta reinforcement learning with latentvariable Gaussian processes. arXiv preprint arXiv:1803.07551, 2018.[29] Schmidhuber, J., Zhao, J., and Wiering, M. Simple principles of metalearning. Technicalreport IDSIA, 69:1–23, 1996.[30] Silver, D. and Veness, J. Monte-Carlo planning in large POMDPs. In Proc. NIPS, pp. 2164–2172, 2010.[31] Stadie, B. C., Yang, G., Houthooft, R., Chen, X., Duan, Y., Wu, Y., Abbeel, P., and Sutskever,I. Some considerations on learning to explore via meta-reinforcement learning. arXiv preprintarXiv:1803.01118, 2018.[32] Sun, W. Towards Generalization and Efficiency in Reinforcement Learning. PhD thesis,Carnegie Mellon University, 2019.[33] Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based RL incontextual decision processes: PAC bounds and exponential improvements over model-freeapproaches. In Proc. COLT, pp. 2898–2933, 2019.[34] Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximat-ing dynamic programming. In Proc. ML, pp. 216–224. Elsevier, 1990.[35] Thrun, S. and Pratt, L. Learning to learn: Introduction and overview. In science & businessmedia, pp. 3–17. Springer, 1998.[36] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. InProc. IROS, pp. 5026–5033. IEEE, 2012. 1137] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blun-dell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprintarXiv:1611.05763, 2016.[38] Williams, G., Aldrich, A., and Theodorou, E. Model predictive path integral control usingcovariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.[39] Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning. In Proc. CoRL,2019.[40] Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S.VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning. In Proc. ICLR,2020. 12 Appendices
A.1 How does our work differ from Janner et al.’s work [15]?
Although our work is grounded primarily on the basis of Janner et al.’s work [15], we provide non-trivial contributions in both theoretical and practical frontiers: (1)
We provide theorems about therelation between true returns and returns on inaccurate predictive models (model returns) on a “meta-learning (POMDPs)” setting (Section 5). In their work, they provide theorems about the relationbetween the true returns and the model returns in the branched rollout in MDPs. In contrast, weprovide theorems about the relation between the true returns and of the model returns in the branchedrollout in POMDPs.
In addition, in the derivation of theorems (Theorems 4.2 and 4.3) in their work,a number of important premises are not properly taken into consideration (the detailed discussionon it is described in the second paragraph in A.5). We provide new theorems, in which the premisesare more properly reflected, for both MDPs and POMDPs (A.2 and A.5). (2)
We extend the model-based policy optimization (MBPO) proposed by Janner et al. into the meta-learning (POMDPs)setting (Section 6). MBPO is for the MDP settings and does not support POMDP settings, while ourmethod (M3PO) supports POMDP settings. Furthermore, we empirically demonstrate the usefulnessof the meta-model usage in the branched rollout manner in the POMDP settings (Section 7).
A.2 Proofs of theorems in the main content
Before starting the derivation of the main theorems, we introduce a lemma useful for bridgingPOMDPs and MDPs.
Lemma 1 ([30]) . Given a POMDP hO , S , A , p ob , r, γ, p st i , consider the derived MDPwith histories as states, hH , A , γ, ¯ r, p hi i , where ∀ t. p hi := p ( h t +1 | a t +1 , h t ) = P s t ∈S P s t +1 ∈S p ( s t | h t ) p ( s t +1 | s t , a t ) p ( o t +1 | s t +1 , a t ) and ¯ r ( h t , a t ) := P s t ∈ S p ( s t | h t ) r ( s t , a t ) .Then, the value function ¯ V π ( h t ) of the derived MDP is equal to the value function V π ( h t ) of thePOMDP.Proof. The statement can be derived by backward induction on the value functions. See the proofof Lemma 1 in [30] for details.
Proof of Theorem 1:
Proof.
By Lemma 1, our problem in POMDPs can be mapped into the problem in MDPs, and thenTheorem 4.1 in [15] can be applied to the problem.Similarly, the proofs of Theorems 2 and 3 are derived by mapping our problem into that in MDPsby Lemma 1 and leveraging theoretical results on MDPs.
Proof of Theorem 2:
Proof.
By Lemma 1, our problem in POMDPs can be mapped into the problem in MDPs, and thenTheorem 4 in A.5 can be applied to the problem.
Proof of Theorem 3:
Proof.
By Lemma 1, our problem in POMDPs can be mapped into the problem in MDPs, and thenTheorem 5 in A.5 can be applied to the problem.
A.3 The discrepancy relying on the meta-model error in Theorem 1 and that in Theorem 2Corollary 1.
The discrepancy factors relying on ǫ m in Theorem 1, C Th1 ,m , are equal to or largerthan those relying on ǫ m at k = 1 in Theorem 2, C Th2 ,m .Proof. By Theorem 1 and 2, (4) C Th1 ,m = r max γǫ m (1 − γ ) . C Th2 ,m = r max γ − γ ǫ m . Given that γ ∈ [0 , , r max > and ǫ m ≥ , (6) C Th1 ,m − C Th2 ,m = r max γǫ m − γ ǫ m (1 − γ ) ≥ . A.4 Connection to a typical meta-RL setting
In Section 4, we formulate meta-model-based RL to solve a POMDP. However, this formulationmay make it difficult for certain readers to comprehend the connection to a typical meta-RL setting.Although in the normative meta-RL setting (e.g., [7]), the objective to be optimized is given as thereturn expected with respect to a task distribution, such an objective does not appear in the formula-tion in Section 4. In this section, we show that such an objective can be derived by specializing theformulation in Section 4 under a number of assumptions (Corollary 2). Then, we explain why wedid not adopt such a specialization and maintained a more abstract formulation in Section 4.First, letting a task set and a task distribution be denoted by T and p ( τ ) where τ ∈ T , respectively,we introduce the following assumptions: Assumption 2. S := O × T . Assumption 3. p ( s t +1 | s t , a t ) := p ( o t +1 | o t , τ t , a t ) · ( τ t +1 = τ t ) . Assumption 4.
For t > , p ( s t | h t ) := p ( τ t | h t ) · ( τ t = τ ) . Assumption 5. p ( τ | h ) := p ( τ ) . Here, ( · ) is the indicator function that returns one if the argument is true, and zero otherwise.With these assumptions, the following corollary holds: Corollary 2.
Given a POMDP hO , S , A , p ob , r, γ, p st i and a task set T , consider the parame-terized MDP with histories as states, hH , A , γ, ¯¯ r, ¯ p ob i , where ∀ t. ¯ p ob := p ( o t +1 | o t , τ , a t ) and ¯¯ r := r ( o t , τ , a t ) . Under Assumptions 2 ∼
5, the expected return on the parameterized MDP E a ∼ π,h ∼ p,τ ∼ p ( τ ) [ P ∞ t γ t ¯¯ r t ] := P τ ∈T p ( τ ) E a ∼ π,h ∼ p [ P ∞ t γ t ¯¯ r t | τ ] is equal to the expected returnon the POMDP E a ∼ π,h ∼ p [ R ] .Proof. By applying Lemma 1, the value function in a POMDP hO , S , A , p ob , r, γ, p st i can bemapped to the value function ¯ V π ( h t ) in the derived MDP, which is hH , A , γ, ¯ r, p hi i , where ∀ t. p hi := p ( h t +1 | a t +1 , h t ) = P s t ∈S P s t +1 ∈S p ( s t | h t ) p ( s t +1 | s t , a t ) p ( o t +1 | s t +1 , a t ) and ¯ r ( h t , a t ) := P s t ∈ S p ( s t | h t ) r ( s t , a t ) .By applying the assumptions, this value function can be transformed to a different representationthat explicitly contains τ and its distribution:For t > , (7) ¯ V π ( h t ) = X a t +1 π ( a t +1 | h t ) X s t ∈ S p ( s t | h t ) r ( s t , a t )+ γ X o t +1 ∈O X s t ∈S X s t +1 ∈S p ( s t | h t ) p ( s t +1 | s t , a t ) p ( o t +1 | s t +1 , a t ) ¯ V π ( h t +1 ) = X a t +1 π ( a t +1 | h t ) r ( o t , τ , a t ) | {z } ¯¯ r t + γ X o t +1 ∈O p ( o t +1 | o t , τ , a t ) | {z } ¯ p ob ¯ V π ( h t +1 ) . t = 0 , (8) ¯ V π ( h ) = X a π ( a | h ) (X τ p ( τ ) r ( o , τ , a ) + γ X o ∈O X τ p ( τ ) p ( o | o , τ , a ) ¯ V π ( h ) ) = X τ p ( τ ) X a π ( a | h ) ( r ( o , τ , a ) + γ X o ∈O p ( o | o , τ , a ) ¯ V π ( h ) )| {z } ¯¯ V π ( h ) . Therefore, (9) E a ∼ π,h ∼ p [ R ] = X h p ( h ) ¯ V π ( h )= X τ p ( τ ) X h p ( h ) ¯¯ V π ( h )= E a ∼ π,h ∼ p,τ ∼ p ( τ ) " ∞ X t γ t ¯¯ r t By Corollary 2, our formulation in Section 4 can be specialized into a problem where theobjective to be optimized is given as the return expected with respect to a task distribution.We can derive the meta-model returns with discrepancies for bounding the true return (i.e., E a ∼ π φ ,h ∼ p θ ,τ ∼ p ( τ ) [ P ∞ t γ t ¯¯ r t ] − C ( ǫ m , ǫ π ) ) by using Corollary 2 instead of Lemma 1 in the proofsof Theorem 1 and replacing p θ ( o t | o , τ , a t ) with p θ ( o t | h t − ) .The main reason that we do not adopt such a specialization in Section 4 is to avoid restrictionsinduced by the assumptions (Assumption 2 ∼ A.5 The derivation of the relation of the returns in k -step branched rollouts ( k ≥ ) inMarkov decision processes In this section, we discuss the relation of the true returns and the model returns under the branchedrollout in an MDP, which is defined by a tuple hS , A , r, γ, p st i . Here, S is the set of states, A is theset of actions, p st := p ( s ′ | s, a ) is the state transition probability for any s ′ , s ∈ S and a ∈ A , r isthe reward function and γ ∈ [0 , is the discount factor. At time step t , the former two functionsare used as p ( s t | s t − , a t ) , r t = r ( s t , a t ) . The agent selects an action on the basis of a policy π := p ( a t +1 | s t ) . We denote the data collection policy by π D and the state visitation probabilityunder π D and p ( s ′ | s, a ) by p π D ( s t ) . We also denote the predictive model for the next state by p θ ( s ′ | s, a ) . In addition, we define the upper bounds of the reward scale as r max > max s,a | r ( s, a ) | .Note that, in this section, to discuss the MDP case, we are overriding the definition of the variablesand functions that were defined for the POMDP case in the main body. In addition, for simplicity,we use the abbreviated style E π,p [ R ] for the true return E a ∼ π,s ∼ p [ R := P ∞ t =0 γ t r t ] .Although the theoretical analysis on the relation of the returns in the MDP case is provided by Janneret al. [15], in their analysis, a number of important premises are not properly taken into consideration.First, although they use the replay buffers for the branched rollout (i.e. datasets D env and D model inAlgorithm 2 in [15]), they do not take the use of the replay buffers into account in the their theoreticalanalysis. Furthermore, they calculate state-action visitation probabilities based solely on a single15 igure 4: An example of branched rollouts with k = 3 . Here, the blue dots represent trajectories contained inan environment dataset D env and the yellow dots represent fictitious trajectories generated in accor-dance with a current policy π under a predictive model p θ . model-based rollout factor. In the branched rollout, state-action visitation probabilities (except forthe one at t = 0 ) should be affected by multiple past model-based rollouts. For example, a state-action visitation probability at t (s.t. t > k ) is affected by the model-based rollout branched fromreal trajectories at t − k and ones from t − k + 1 to t − (in total, k model-based rollouts). However,in their analysis (the proof of Lemma B.4 in [15]), they calculate state-action visitation probabilitiesbased solely on the model-based rollout. For example, in their analysis, it is assumed that a state-action visitation probability at t (s.t. t > k ) is affected only by the model-based rollout branchedfrom real trajectories at t − k . These oversights of important premises in their analysis induce alarge mismatch between those for their theorems and those made for the actual implementation ofthe branched rollout (i.e., Algorithm 2 in [15]). Therefore, we decided to newly derive the theoremson the branched rollout, reflecting these premises more appropriately.The outline of our branched rollout is shown in Figure 4. Here, we assume that the trajectoriescollected from the real environment are stored in a dataset D env . The trajectories stored in D env can be seen as trajectories following the true dynamics p ( s ′ | s, a ) and data collection policy (i.e., amixture of previous policies used for data collection) π D . At each branched rollout, the trajectoriesin D env are uniformly sampled , and then starting from the sampled trajectories, k -step model-based rollouts in accordance with π under p θ is run. The fictitious trajectories generated by thebranched rollout is stored in a model dataset D model 12 . This process more appropriately reflectsthe actual implementation of the branched rollout (i.e., lines 5–8 in Algorithm 2) in [15] . Theperformance of π is evaluated as the expected return under the state-action visitation probability in D model . Thus, the initial state probability for the rollout starting from the trajectories follows p π D ( s ) Here, when the trajectories are stored in D model , the states in the trajectories are augmented with time stepinformation to deal with the state transition depending on the time step. Note that the extension of this process to the POMDP case is compatible with the implementation of thebranched meta-rollout in our algorithm (lines 4–13 in Algorithm 2). E ( a,s ) ∼D model [ R ] as: E ( a,s ) ∼D model [ R ] := X s ,a p π D ( s , a ) r ( s , a ) + k − X t =1 X s t ,a t γ t p br t Assume that the rollout process in which the policy and dynamics canbe switched to other ones at time step t sw . Letting two probabilities be p and p , for ≤ t ′ ≤ t sw , we assume that the dynamics distributions are boundedas ǫ m, pre ≥ max t ′ E s ∼ p [ D TV ( p ( s t ′ | s t ′ − , a t ′ ) || p ( s t ′ | s t ′ − , a t ′ ))] . In addition, for t sw < t ′ ≤ t , we assume that the dynamics distributions are bounded as ǫ m, post ≥ max t ′ E s ∼ p [ D TV ( p ( s t ′ | s t ′ − , a t ′ ) || p ( s t ′ | s t ′ − , a t ′ ))] . Likewise, the policy divergence isbounded by ǫ π, pre and ǫ π, post . Then, the following inequation holds (15) X s t ,a t | p ( s t , a t ) − p ( s t , a t ) | ≤ t − t sw )( ǫ m, post + ǫ π, post ) + 2 t sw ( ǫ m, pre + ǫ π, pre ) roof. The proof is done in a similar manner to those of Lemma B.1 and B.2 in [15]. X s t ,a t | p ( s t , a t ) − p ( s t , a t ) | = X s t ,a t | p ( a t ) p ( s t | a t ) − p ( a t ) p ( s t | a t ) | = X s t ,a t | p ( a t ) p ( s t | a t ) − p ( a t ) p ( s t | a t ) + ( p ( a t ) − p ( a t )) p ( s t | a t ) |≤ X s t ,a t p ( a t ) | p ( s t | a t ) − p ( s t | a t ) | + X a t | p ( a t ) − p ( a t ) |≤ X s t ,a t p ( a t ) | p ( s t | a t ) − p ( s t | a t ) | + X a t ,s t − | p ( a t , s t − ) − p ( a t , s t − ) | = X s t ,a t p ( a t ) | p ( s t | a t ) − p ( s t | a t ) | + X a t ,s t − | p ( s t − ) p ( a t | s t − ) − p ( s t − ) p ( a t | s t − ) + ( p ( s t − ) − p ( s t − )) p ( a t | s t − ) |≤ X s t ,a t p ( a t ) | p ( s t | a t ) − p ( s t | a t ) | + X a t ,s t − p ( s t − ) | p ( a t | s t − ) − p ( a t | s t − ) | + X s t − | p ( s t − ) − p ( s t − ) |≤ X s t ,a t p ( a t ) | p ( s t | a t ) − p ( s t | a t ) | + X a t ,s t − p ( s t − ) | p ( a t | s t − ) − p ( a t | s t − ) | + X s t − ,a t − | p ( s t − , a t − ) − p ( s t − , s t − ) |≤ ǫ m, post + 2 ǫ π, post + X s t − ,a t − | p ( s t − , a t − ) − p ( s t − , s t − ) |≤ t − t sw )( ǫ m, post + ǫ π, post ) + X s t sw ,a t sw | p ( s t sw , a t sw ) − p ( s t sw , s t sw ) |≤ t − t sw )( ǫ m, post + ǫ π, post ) + 2 t sw ( ǫ m, pre + ǫ π, pre ) (16)Now, we start the derivation of our bounds. Theorem 4. Under the k -step branched rollouts, using the bound of a model error under π D , ǫ m = max t E a ∼ π D ,s ∼ p,t [ D TV ( p ( s ′ | s, a ) || p θ ( s ′ | s, a ))] and the bound of the policy shift ǫ π =max s D TV ( π || π D ) , the following inequation holds, (17) E π,p [ R ] ≥ E D model [ R ] − r max (cid:26) γ (1 − γ ) ǫ π + γ − kγ k + ( k − γ k +1 (1 − γ ) ( ǫ π + ǫ m )+ γ k − γγ − ǫ π + ǫ m ) + γ k − γ ( k + 1)( ǫ π + ǫ m ) (cid:27) . roof. (cid:12)(cid:12) E π,p [ R ] − E D model [ R ] (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P s ,a { p π ( s , a ) − p π D ( s , a ) } r ( s , a )+ P k − t =1 P s t ,a t γ t (cid:8) p π ( s t , a t ) − p br t Let ǫ m ′ ≥ max t E a ∼ π,s ∼ p [ D TV ( p ( s ′ | s, a ) || p θ ( s ′ | s, a ))] , (27) E π,p [ R ] ≥ E D model [ R ] − r max (cid:26) γ (1 − γ ) ǫ π + 1 − kγ ( k − + ( k − γ k (1 − γ ) γ ( ǫ m ′ − ǫ π )+ γ k − γγ − ǫ m ′ − ǫ π ) + γ k − γ ( k + 1)( ǫ m ′ − ǫ π ) (cid:27) . Proof. The derivation of this theorem is basically the same as that in the Theorem 4 case except forthe way of evaluation of the bound of terms B and C.For term B , we can apply Lemma 2 to bound the value: X s t ,a t (cid:12)(cid:12) p π ( s t , a t ) − p br t In Section 5, we discuss the relation between the true returns and the returns estimated on the meta-model under the assumption that the reward prediction error is zero. The theoretical result underthis assumption is still useful because there are many cases where the true reward function is givenand the reward prediction is not required. However, a number of readers still may want to knowwhat the relation of the returns is under the assumption that the reward prediction is inaccurate. Inthis section, we provide the relation of the returns under inaccurate reward prediction in the MDPcase .We start our discussion by defining the bound of the reward prediction error ǫ r : Here, we do not discuss the theorems in the POMDP case because those in the MDP case can be easilyextended into the POMDP case by utilizing Lemma 1. efinition 3. ǫ r := max t E ( a t ,s t ) ∼D model [ | r ( s t , a t ) − r θ ( s t , a t ) | ] , where r θ ( s t , a t ) := E r t ∼ p θ [ r t | a t , s t ] . We also define the return on the branched rollout with inaccurate reward prediction. E D model [ ˆ R ] := X s ,a p π D ( s , a ) r θ ( s , a ) + k − X t =1 X s t ,a t γ t p br t Similar to the derivation of Theorem 6, we obtain the result by substituting Eq. 30 and 34into Eq. 33. A.7 PEARL in Sections 6 and 7 The PEARL algorithm used in Sections 6 and 7 refers to “PEARL with RNN-traj” in [26]. AlthoughPEARL with RNN-traj performed worse than vanilla PEARL and its variant (PEARL with RNN-tran) in the original paper [26], we found that PEARL RNN-traj works best in our setup, and thusdecided to use it for our implementation of M3PO and experiments. A.8 Baseline methods for our experimentPEARL: The model-free meta-RL method proposed in [26]. This is an off-policy method and im-plemented by extending Soft Actor-Critic [11]. By leveraging experience replay, this method showshigh sample efficiency. We reimplemented the PEARL on TensorFlow, referring to the original im-plementation on PyTorch ( https://github.com/katerakelly/oyster ). Learning to adapt (L2A): The meta-model-based RL proposed in [22]. In this method, the meta-model is implemented with MAML [7] and the optimal action is found by the model predictive pathintegral control [38] on the full meta–model based rollouts. We adapt the following implementationof L2A to our experiment: https://github.com/iclavera/learning_to_adapt A.9 Environments for our experiment For our experiment in Section 7, we prepare simulated robot environments using the MuJoCophysics engine [36] (Figure 5): Halfcheetah-fwd-bwd: In this environment, meta-policies are used to control the half-cheetah,which is a planar biped robot with eight rigid links, including two legs and a torso, along withsix actuated joints. Here, the half-cheetah’s moving direction is randomly selected from “forward”and “backward” around every 15 seconds (in simulation time). If the half-cheetah moves in thecorrect direction, a positive reward is fed to the half-cheetah in accordance with the magnitude ofmovement, otherwise, a negative reward is fed. Halfcheetah-pier: In this environment, the half-cheetah runs over a series of blocks that are floatingon water. Each block moves up and down when stepped on, and the changes in the dynamics arerapidly changing due to each block having different damping and friction properties. These proper-ties are randomly determined at the beginning of each episode. Ant-fwd-bwd: Same as Halfcheetah-fwd-bwd except that the meta-policies are used for controllingthe ant, which is a quadruped robot with nine rigid links, including four legs and a torso, along witheight actuated joints. Ant-crippled-leg: In this environment, we randomly sample a leg on the ant to cripple. The crip-pling of the leg causes unexpected and drastic changes to the underlying dynamics. One of the fourlegs is randomly crippled every 15 seconds. Walker2D-randomparams: In this environment, the meta-policies are used to control the walker,which is a planar biped robot consisting of seven links, including two legs and a torso, along withsix actuated joints. The walker’s torso mass and ground friction is randomly determined every 15seconds. Humanoid-direc: In this environment, the meta-policies are used to control the humanoid, whichis a biped robot with 13 rigid links, including two legs, two arms and a torso, along with 17 actuatedjoints. In this task, the humanoid moving direction is randomly selected from two different direc-tions around every 15 seconds. If the humanoid moves in the correct direction, a positive reward24s fed to the humanoid in accordance with the magnitude of its movement, otherwise, a negativereward is fed.Left: Halfcheetah-fwd-bwd, Center: Halfcheetah-pier, Right: Walker2D-randomparamsLeft: Ant-fwd-bwd, Center: Ant-crippled-leg, Right: Humanoid-direc Figure 5: Environments for our experiment .10 Complementary experimental results Figure 6: Discrepancy between E π φ ,p [ R ] and E π φ ,p θ [ R ] in Theorem 1. The vertical and horizontal axesrepresent the discrepancy value and ǫ m ∈ [0 , , respectively. We set the other variables as ǫ π =1 − ǫ m and r max = 1 . Figure 7: The local change in ǫ m ′ with respect to ǫ π versus training sample size. In each figure, the verticalaxis represents the local change of the meta-model error ( dǫ m ′ dǫ π ) and the horizontal axis represents thetraining sample size (x1000). The red-dotted line is the linear interpolation of the blue dots, whichshows the trend of the local change decreasing as the training sample size grows. igure 8: Transition of model errors on training. In each figure, the vertical axis represents empirical valuesof ǫ m and the horizontal axis represents the number of training samples (x1000). We ran five trialswith different random seeds. The result of the x -th trial is denoted by Trial- x . We used the negativeof log-likelihood of the meta-model on validation samples as the approximation of ǫ m . Figure 9: Learning curve of PEARL in a long-term training. In each figure, the vertical axis represents ex-pected returns and the horizontal axis represents the number of training samples ( x50000 ). Themeta-policy and meta-model were fixed and their expected returns were evaluated on 50 episodesat every 50,000 training samples. Each method was evaluated in three trials, and the result of the x -th trial is denoted by Trial- x . Note that the scale of the horizontal axis is larger than that inFigure 2 by 50 times (i.e., 4 in this figure is equal to 200 in Figure 2). igure 10: Learning curve of M3PO and M2PO. In each figure, the vertical axis represents expected returns andthe horizontal axis represents the number of training samples (x1000). The meta-policy and meta-model (and a predictive model) were fixed and their expected returns were evaluated on 50 episodesat every 1000 training samples for the other methods. In each episode, the task was initialized andchanged randomly. Each method was evaluated in at least five trials, and the expected return on the50 episodes was further averaged over the trials. The averaged expected returns and their standarddeviations are plotted in the figures. Figure 11: Transition of TD-errors (Q-function error) on training. In each figure, the vertical axis representsempirical values of ǫ m and the horizontal axis represents the number of training samples (x1000).We ran ten trials with different random seeds and plotted the average of their results. The error barmeans one standard deviation. .11 Complementary analysis In addition to Q1 and Q2 in the main content, we also conducted a complementary analysis to answerthe following question. Q.3: Does the use of a meta-model in M3PO contribute to the improvementof the meta-policy?In an analysis in this section, we compare M3PO with the following method. Model-based Meta-Policy Optimization (M2PO): This method is a variant of M3PO, in which a non-adaptive pre-dictive model is used instead of the meta-model. The predictive model architecture is the same asthat in the MBPO algorithm [15] (i.e., the ensemble of Gaussian distributions based on four-layerfeed-forward neural networks).Regarding Q3, our experimental result indicates that the use of a meta-model contributed to theperformance improvement in a number of the environments. In Figure 10 in the appendix, wecan clearly see the improvement of M3PO against M2PO in Halfcheetah-fwd-bwd. In addition, inthe Ant environments, although the M3PO’s performance is seemingly the same as that of M2PO,the qualitative performance is quite different; the M3PO can produce the meta-policy for walkingin the correct direction, while M2PO failed to do so (M2PO produces the meta-policy “alwaysstanding” with a very small amount of control signal). For Humanoid-direc, in contrast, M2PO tendsto achieve a better sample efficiency than M3PO. We hypothesize that the primary reason for this isthat during the plateau at the early stage of training in Humanoid-direc, the predictive model usedin M2PO generates fictitious trajectories that make meta-policy optimization more stable. To verifythis hypothesis, we compare TD-errors (Q-function errors) during training, which is an indicator ofthe stability of meta-policy optimization, in M3PO and M2PO. The evaluation result (Figure 11 inthe appendix) shows that during the performance plateau (10–60 epoch), the TD-error in M2PO wasactually lower than that in M3PO; this result supports our hypothesis. In this paper, we did not focuson the study of meta-model usage to generate the trajectories that make meta-policy optimizationstable, but this experimental result indicates that such a study is important for further improvingM3PO. A.12 Hyperparameter setting Table 1: Hyperparameter settings for M3PO results shown in Figure 2. x → y over epohs a → b denotes athresholded linear function, i.e., at epoch e , f ( e ) = min(max( x + e − ab − a · ( y − x ) , x ) , y ) H a l f c h ee t a h -f w d - b w d H a l f c h ee t a h - p i e r A n t -f w d - b w d A n t - c r i pp l e d - l e g W a l k e r D -r a ndo m p a r a m s H u m a no i d - d i r ec N epoch 200 E environment step per epoch 1000 M meta-model rollouts per epoch 1e6 5e5 1e6 B ensemble size 3 G meta-policy update per environment step 40 20 k meta-model horizon 1 1 → →100