[PDF] On the Generalization Gap in Reparameterizable Reinforcement Learning

Abstract

Understanding generalization in reinforcement learning (RL) is a significant challenge, as many common assumptions of traditional supervised learning theory do not apply. We focus on the special class of reparameterizable RL problems, where the trajectory distribution can be decomposed using the reparametrization trick. For this problem class, estimating the expected return is efficient and the trajectory can be computed deterministically given peripheral random variables, which enables us to study reparametrizable RL using supervised learning and transfer learning theory. Through these relationships, we derive guarantees on the gap between the expected and empirical return for both intrinsic and external errors, based on Rademacher complexity as well as the PAC-Bayes bound. Our bound suggests the generalization capability of reparameterizable RL is related to multiple factors including "smoothness" of the environment transition, reward and agent policy function class. We also empirically verify the relationship between the generalization gap and these factors through simulations.

Full PDF

aa r X i v : . [ c s . L G ] M a y On the Generalization Gap in Reparameterizable Reinforcement Learning

Huan Wang Stephan Zheng Caiming Xiong Richard Socher Abstract

Understanding generalization in reinforcementlearning (RL) is a signiﬁcant challenge, as manycommon assumptions of traditional supervisedlearning theory do not apply. We focus on thespecial class of reparameterizable RL prob-lems, where the trajectory distribution can be de-composed using the reparametrization trick. Forthis problem class, estimating the expected returnis efﬁcient and the trajectory can be computeddeterministically given peripheral random vari-ables, which enables us to study reparametrizableRL using supervised learning and transfer learn-ing theory. Through these relationships, we de-rive guarantees on the gap between the expectedand empirical return for both intrinsic and exter-nal errors, based on Rademacher complexity aswell as the PAC-Bayes bound. Our bound sug-gests the generalization capability of reparame-terizable RL is related to multiple factors includ-ing “smoothness” of the environment transition,reward and agent policy function class. We alsoempirically verify the relationship between thegeneralization gap and these factors through sim-ulations.

1. Introduction

Reinforcement learning (RL) has proven successful in aseries of applications such as games (Silver et al., 2016;2017; Mnih et al., 2015; Vinyals et al., 2017; OpenAI,2018), robotics (Kober et al., 2013), recommendation sys-tems (Li et al., 2010; Shani et al., 2005), resource manage-ment (Mao et al., 2016; Mirhoseini et al., 2018), neural ar-chitecture design (Baker et al., 2017), and more. Howeversome key questions in reinforcement learning remain un-solved. One that draws more and more attention is the is-sue of overﬁtting in reinforcement learning (Sutton, 1995; Salesforce Research, Palo Alto CA, USA. Correspondence to:Huan Wang < [email protected] > . Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

Cobbe et al., 2018; Zhang et al., 2018b; Packer et al., 2018;Zhang et al., 2018a). A model that performs well in thetraining environment, may or may not perform well whenused in the testing environment. There is also a growinginterest in understanding the conditions for model general-ization and developing algorithms that improve generaliza-tion.In general we would like to measure how accurate an al-gorithm is able to predict on previously unseen data. Onemetric of interest is the gap between the training and test-ing loss or reward. It has been observed that such gaps arerelated to multiple factors: initial state distribution, envi-ronment transition, the level of “difﬁculty” in the environ-ment, model architectures, and optimization. Zhang et al.(2018b) split randomly sampled initial states into trainingand testing and evaluated the performance gap in deep re-inforcement learning. They empirically observed overﬁt-ting caused by the randomness of the environment, evenif the initial distribution and the transition in the test-ing environment are kept the same as training. On theother hand, Farebrother et al. (2018); Justesen et al. (2018);Cobbe et al. (2018) allowed the test environment to varyfrom training, and observed huge differences in testing per-formance. Packer et al. (2018) also reported very differenttesting behaviors across models and algorithms, even forthe same RL problem.Although overﬁtting has been empirically observed in RLfrom time to time, theoretical guarantees on generalization,especially ﬁnite-sample guarantees, are still missing. Inthis work, we focus on on-policy RL , where agent poli-cies are trained based on episodes of experience that aresampled “on-the-ﬂy” using the current policy in training.We identify two major obstacles in the analysis of on-policy RL. First, the episode distribution keeps changingas the policy gets updated during optimization. Therefore,episodes have to be continuously redrawn from the new dis-tribution induced by the updated policy during optimiza-tion. For ﬁnite-sample analysis, this leads to a process withcomplex dependencies. Second, state-of-the-art researchon RL tends to mix the errors caused by randomness in theenvironment and shifts in the environment distribution. Weargue that actually these two types of errors are very dif-ferent. One, which we call intrinsic error, is analogous tooverﬁtting in supervised learning, and the other, called ex- n the Generalization Gap in Reparameterizable RL ternal error, looks more like the errors in transfer learning.Our key observation is there exists a special class of RL,called reparameterizable RL , where randomness in theenvironment can be decoupled from the transition andinitialization procedures via the reparameterization trick(Kingma & Welling, 2014). Through reparameterization,an episode’s dependency on the policy is “lifted” to thestates. Hence, as the policy gets updated, episodes are de-terministic given peripheral random variables. As a con-sequence, the expected reward in reparameterizable RLis connected to the Rademacher complexity as well asthe PAC-Bayes bound. The reparameterization trick alsomakes the analysis for the second type of errors, i.e., whenthe environment distribution is shifted, much easier sincethe environment parameters are also “lifted” to the repre-sentation of states.

Related Work

Generalization in reinforcement learn-ing has been investigated a lot both theoretically andempirically. Theoretical work includes bandit anal-ysis (Agarwal et al., 2014; Auer et al., 2002; 2009;Beygelzimer et al., 2011), Probably ApproximatelyCorrect (PAC) analysis (Jiang et al., 2017; Dann et al.,2017; Strehl et al., 2009; Lattimore & Hutter, 2014)as well as minimax analysis (Azar et al., 2017;Chakravorty & Hyland, 2003). Most works focus onthe analysis of regret and consider the gap between theexpected value and optimal return. On the empirical side,besides the previously mentioned work, Whiteson et al.(2011) proposes generalized methodologies that arebased on multiple environments sampled from a distri-bution. Nair et al. (2015) also use random starts to testgeneralization.Other research has also examined generalizationfrom a transfer learning perspective. Lazaric (2012);Taylor & Stone (2009); Zhan & Taylor (2015); Laroche(2017) examine model generalization across differentlearning tasks, and provide guarantees on asymptoticperformance.There are also works in robotics for transferring policyfrom simulator to real world and optimizing an internalmodel from data (Kearns & Singh, 2002), or works tryingto solve abstracted or compressed MDPs (Majeed & Hutter,2018).

Our Contributions : • A connection between (on-policy) reinforcementlearning and supervised learning through the reparam-eterization trick. It simpliﬁes the ﬁnite-sample anal-ysis for RL, and yields Rademacher and PAC-Bayesbounds on Markov Decision Processes (MDP). • Identifying a class of reparameterizable RL and pro- viding a simple bound for “smooth” environments andmodels with a limited number of parameters. • A guarantee for reparameterized RL when the envi-ronment is changed during testing. In particular wediscuss two cases in environment shift: change in theinitial distribution for the states, or the transition func-tion.

2. Notation and Formulation

We denote a Markov Decision Process (MDP) as a -tuple ( S , A , P , r, P ) . Here S is the state space, A is the action-space, P ( s, a, s ′ ) : S × A × S → [0 , is the transitionprobability from state s to s ′ when taking action a , r ( s ) : S → R represents the reward function, and P ( s ) : S → [0 , is the initial state distribution. Let π ( s ) ∈ Π :

S → A be the policy map that returns the action a at state s .We consider episodic MDPs with a ﬁnite horizon. Giventhe policy map π and the transition probability P , the state-to-state transition probability is T π ( s, s ′ ) = P ( s, π ( s ) , s ′ ) .Without loss of generality, the length of the episode is T +1 .We denote a sequence of states [ s , s , . . . , s T ] as s . Thetotal reward in an episode is R ( s ) = P Tt =0 γ t r t , where γ ∈ (0 , is a discount factor and r t = r ( s t ) .Denote the joint distribution of the sequence of states in anepisode s = [ s , s , . . . , s T ] as D π . Note D π is also relatedto P and P . In this work we assume P and P are ﬁxed,so D π is a function of π . Our goal is to ﬁnd a policy thatmaximizes the expected total discounted reward (return): π ∗ = arg max π ∈ Π E s ∼D π R ( s ) = arg max π ∈ Π E s ∼D π T X t =0 γ t r t . (1)Suppose during training we have a budget of n episodes,then the empirical return is ˆ π = arg max π ∈ Π , s i ∼D π n n X i =1 R ( s i ) , (2)where s i = [ s i , s i , . . . , s iT ] is the i th episode of length T + 1 . We are interested in the generalization gap Φ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R ( s i ) − E s ∼D ′ ˆ π R ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (3)Note that in (3) the distribution D ′ ˆ π may be different from D ˆ π since in the testing environment P ′ as well as P ′ maybe shifted compared to the training environment. n the Generalization Gap in Reparameterizable RL

3. Generalization in Reinforcement Learningv.s. Supervised Learning

Generalization has been well studied in the supervisedlearning scenario. A popular assumption is that samples areindependent and identically distributed ( x i , y i ) ∼ D , ∀ i ∈{ , , . . . , n } . Similar to empirical return maximization dis-cussed in Section 2, in supervised learning a popular algo-rithm is empirical risk minimization: ˆ f = arg min f ∈F n n X i =1 ℓ ( f, x i , y i ) , (4)where f ∈ F : X → Y is the prediction function to belearned and ℓ : F ×X ×Y → R + is the loss function. Simi-larly generalization in supervised learning concerns the gapbetween the expected loss E [ ℓ ( f, x, y )] and the empiricalloss n P ni =1 ℓ ( f, x i , y i ) .It is easy to ﬁnd the correspondence between the episodesdeﬁned in Section 2 and the samples ( x i , y i ) in supervisedlearning. Just like supervised learning where ( x, y ) ∼ D ,in (episodic) reinforcement learning s i ∼ D π . Also the re-ward function R in reinforcement learning is similar to theloss function ℓ in supervised learning. However, reinforce-ment learning is different because • In supervised learning, the sample distribution D iskept ﬁxed, and the loss function ℓ ◦ f changes as wechoose different predictors f . • In reinforcement learning, the reward function R iskept ﬁxed, but the sample distribution D π changes aswe choose different policy maps π .As a consequence, the training procedure in reinforcementlearning is also different. Popular methods such as RE-INFORCE (Williams, 1992), Q-learning (Sutton & Barto.,1998), and actor-critic methods (Mnih et al., 2016) drawnew states and episodes on the ﬂy as the policy π is beingupdated. That is, the distribution D π from which episodesare drawn always changes during optimization. In contrast,in supervised learning we only update the predictor f with-out affecting the underlying sample distribution D .

4. Intrinsic vs External Generalization Errors

The generalization gap (3) can be bounded Φ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R ( s i ) − E s ∼D ˆ π R ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } intrinsic + (cid:12)(cid:12)(cid:12) E s ∼D ˆ π R ( s ) − E s ∼D ′ ˆ π R ( s ) (cid:12)(cid:12)(cid:12)| {z } external (5) using the triangle inequality. The ﬁrst term in (5) is theconcentration error between the empirical reward and its ex-pectation. Since it is caused by intrinsic randomness of theenvironment, we call it the intrinsic error . Even if the testenvironment shares the same distribution with training, inthe ﬁnite-sample scenario there is still a gap between train-ing and testing. This is analogous to the overﬁtting problemstudied in supervised learning. Zhang et al. (2018b) mainlyfocuses on this aspect of generalization. In particular, theirrandomness is carefully controlled in experiments to onlycome from the initial states s ∼ P .We call the second term in (5) external error , as it is causedby shifts of the distribution in the environment. For exam-ple, the transition distribution P or the initialization distri-bution P may get changed during testing, which leads to adifferent underlying episode distribution D ′ π . This is analo-gous to the transfer learning problem. For instance, gener-alization as in Cobbe et al. (2018) is mostly external errorsince the number of levels used for training and testing aredifferent even though the difﬁcult level parameters are sam-pled from the same distribution. The setting in Packer et al.(2018) covers both intrinsic and external errors.

5. Why Intrinsic Generalization Error? If π is ﬁxed, by concentration of measures, as the number ofepisodes n increases, the intrinsic error decreases roughlywith √ n . For example, if the reward is bounded | R ( s i ) | ≤ c/ , by McDiarmid’s bound, with probability at least − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R ( s i ) − E s ∼D [ R ( s )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c s log δ n , (6)where c > . Note the bound above also holds for the testsamples if the distribution D is ﬁxed and s test ∼ D .For the population argument (1), π ∗ is deﬁned determinis-tically since the value E s ∼D π R ( s ) is a deterministic func-tion of π . However, in the ﬁnite-sample case (2), the policymap ˆ π is stochastic: it depends on the samples s i . As aconsequence, the underlying distribution D ˆ π is not ﬁxed.In that case, the expectation E s ∼D ˆ π [ R ( s )] in (6) becomesa random variable so (6) does not hold any more.One way of ﬁxing the issue caused by random D ˆ π is toprove a bound that holds uniformly for all policies π ∈ Π .If π is ﬁnite, by applying a union bound, it follows that: Lemma 1. If Π is ﬁnite, and | R ( s ) | ≤ c/ , then with prob-ability at least − δ , for all π ∈ Π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R ( s i ) − E s ∼D π [ R ( s )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c s log | Π | δ n , (7) where | Π | is the cardinality of Π . n the Generalization Gap in Reparameterizable RL Unfortunately in most of the applications, Π is not ﬁnite.One difﬁculty in analyzing the intrinsic generalization er-ror is that the policy changes during the optimization proce-dure. This leads to a change in the episode distribution D π .Usually π is updated using episodes generated from some“previous” distributions, which are then used to generatenew episodes. In this case it is not easy to split episodesinto a training and testing set, since during optimizationsamples always come from the updated policy distribution.

6. Reparameterization Trick

The reparameterization trick has been popular in the op-timization of deep networks (Kingma & Welling, 2014;Maddison et al., 2017; Jang et al., 2017; Tokui & Sato,2016) and used, e.g., for the purpose of optimization efﬁ-ciency. In RL, suppose the objective (1) is reparameteriz-able: E s ∼D π R ( s ) = E ξ ∼ p ( ξ ) R ( s ( f ( ξ, π ))) . Then under some weak assumptions ∇ θ E s ∼D πθ R ( s ) = ∇ θ (cid:2) E ξ ∼ p ( ξ ) R ( s ( f ( ξ, π θ ))) (cid:3) = E ξ ∼ p ( ξ ) [ ∇ θ R ( s ( f ( ξ, π θ )))] (8)The reparameterization trick has already been used: for ex-ample, PGPE (R¨uckstieß et al., 2010) uses policy reparam-eterization, and SVG (Heess et al., 2015) uses policy andenvironment dynamics reparameterization. In this work,we will show the reparameterization trick can help to an-alyze the generalization gap. More precisely, we will showthat since both P and P are ﬁxed, even if they are un-known, as long as they satisfy some “smoothness” assump-tions, we can provide theoretical guarantees on the test per-formance.

7. Reparameterized MDP

We start our analysis with reparameterizing a Markov Deci-sion Process with discrete states. We will give a general ar-gument on reparameterizable RL in the next section. In thissection we slightly abuse notation by letting P and P ( s, a ) denote | S | -dimensional probability vectors for multinomialdistributions for initialization and transition respectively.One difﬁculty in the analysis of the generalization in rein-forcement learning rises from the sampling steps in MDPwhere states are drawn from multinomial distributions spec-iﬁed by either P or P ( s t , a t ) , because the sampling proce-dure does not explicitly connect the states and the distri-bution parameters. We can use standard Gumbel randomvariables g ∼ exp( − g + exp( − g )) to reparameterize sam-pling and get a procedure equivalent to classical MDPs butwith slightly different expressions, as shown in Algorithm1. Algorithm 1

Reparameterized MDPInitialization: Sample g init , g , g , . . . , g T ∼ G | S | . s =arg max ( g init + log P ) , R = 0 . for t in , . . . , T do R = R + γ t r ( s t ) s t +1 = arg max ( g t + log P ( s t , π ( s t ))) end for return R .In the reparameterized MDP procedure, G | S | is an | S | -dimensional Gumbel distribution. g , . . . , g T are | S | -dimensional vectors with each entry being a Gumbel ran-dom variable. Also g + log P and g t + log P ( s t , a t ) areentry-wise vector sums, so they are both | S | -dimensionalvectors. arg max( v ) returns the index of the maximum en-try in the | S | -dimensional vector v . In the reparameterizedMDP procedure shown above, the states s t are representedas an index in { , , . . . , | S |} . After reparameterization, wemay rewrite the RL objective (2) as: ˆ π = arg max π ∈ Π ,g i ∼G | S | T n n X i =1 R ( s i ( g i ; π )) , (9)where g i = [ g i , g i , . . . , g iT ] , g it is an | S | -dimensional Gum-bel random variable, and R ( s i ( g i ; π )) = T X t =0 γ t r ( s it ( g i , g i , . . . , g it ; π )) (10)is the discounted return for one episode of length T + 1 .The reparameterized objective (9) maximizes the empiri-cal reward by varying the policy π . The distribution fromwhich the random variables g i are drawn does not dependon the policy π anymore, and the policy π only affects thereward R ( s i ( g i ; π )) through the states s i .The objective (9) is a discrete function due to the arg max operator. One way to circumvent this is to use Gumbel soft-max to approximate the arg max operator (Maddison et al.,2017; Jang et al., 2017). If we denote s as a one-hot vec-tor in R | S | , and further relax the entries in s to take pos-itive values that sum up to one, we may use the softmaxto approximate the arg max operator. For instance, thereparametrized initial-state distribution becomes: s = exp { ( g + log P ) /τ }k exp { ( g + log P ) /τ }k , (11)where g is an | S | -dimensional Gumbel random variable, P is an | S | -dimensional probability vector in multinomial dis-tribution, and τ is a positive scalar. As the temperature τ → , the softmax approaches s = arg max ( g + log P ) ∼ P in terms of the one-hot vector representation. Again we abuse the notation by denoting s i ( f ( g i ; π )) as s i ( g i ; π ) . n the Generalization Gap in Reparameterizable RL

8. Reparameterizable RL

In general, as long as the transition and initialization pro-cess can be reparameterized so that the environment param-eters are separated from the random variables, the objectivecan always be reformulated so that the policy only affectsthe reward instead of the underlying distribution. The repa-rameterizable RL procedure is shown in Algorithm 2.

Algorithm 2

Reparameterizzble RLInitialization: Sample ξ , ξ , . . . , ξ T . s = I ( ξ ) , R =0 . for t in , . . . , T do R = R + γ t r ( s t ) s t +1 = T ( s t , π ( s t ) , ξ t ) end for return R .In this procedure, ξ s are d -dimensional random variablesbut they are not necessarily sampled from the same dis-tribution . In many scenarios they are treated as randomnoise. I : R d → R | S | is the initialization function . Dur-ing initialization, the random variable ξ is taken as inputand the output is an initial state s . The transition function T : R | S | × R | A | × R d → R | S | , takes the current state s t , theaction produced by the policy π ( s t ) , and a random variable ξ t to produce the next state s t +1 .In reparameterizable RL, the peripheral random variables ξ , ξ , . . . , ξ T can be sampled before the episode is gener-ated. In this way, the randomness is decoupled from the pol-icy function, and as the policy π gets updated, the episodescan be computed deterministically.The class of reparamterizable RL problems includes thosewhose initial state, transition, reward and optimal policydistribution can be reparameterized. Generally, a distribu-tion can be reparameterized, e.g., if it has a tractable in-verse CDF, is a composition of reparameterizable distribu-tions (Kingma & Welling, 2014), or is a limit of smoothapproximators (Maddison et al., 2017; Jang et al., 2017).Reparametrizable RL settings include LQR (Lewis et al.,1995) and physical systems (e.g., robotics) where the dy-namics are given by stochastic partial differential equations(PDE) with reparameterizable components over continuousstate-action spaces.

9. Main Result

For reparameterizable RL , if the environments and thepolicy are “smooth”, we can control the error between the They may also have different dimensions. In this work, with-out loss of generality, we assume the random variables have thesame dimension d . expected and the empirical reward. In particular, the as-sumptions we make are Assumption 1. T ( s, a ) : R | S | × R | A | → R | S | is L t -Lipschitz in terms of the ﬁrst variable s , and L t -Lipschitzin terms of the second variable a . That is, ∀ x, x ′ , y, y ′ , z , kT ( x, y, z ) − T ( x ′ , y, z ) k ≤ L t k x − x ′ k , kT ( x, y, z ) − T ( x, y ′ , z ) k ≤ L t k y − y ′ k . Assumption 2.

The policy is parameterized as π ( s ; θ ) : R | S | × R m → R | A | , and π ( s ; θ ) is L π -Lipschitz in termsof the states, and L π -Lipschitz in terms of the parameter θ ∈ R m , that is, ∀ s, s ′ , θ, θ ′ k π ( s ; θ ) − π ( s ′ ; θ ) k ≤ L π k s − s ′ k , k π ( s ; θ ) − π ( s ; θ ′ ) k ≤ L π k θ − θ ′ k . Assumption 3.

The reward r ( s ) : R | S | → R is L r -Lipschitz: | r ( s ′ ) − r ( s ) | ≤ L r k s ′ − s k . If assumptions (1) (2) and (3) hold, we have the following:

Theorem 1. In reparameterizable RL , suppose the tran-sition T ′ in the test environment satisﬁes ∀ x, y, z, k ( T ′ −T )( x, y, z ) k ≤ ζ , and suppose the initialization function I ′ in the test environment satisﬁes ∀ ξ, k ( I ′ − I )( ξ ) k ≤ ǫ .If assumptions (1), (2) and (3) hold, the peripheral randomvariables ξ i for each episode are i.i.d., and the reward isbounded | R ( s ) | ≤ c/ , then with probability at least − δ ,for all policies π ∈ Π : | E ξ [ R ( s ( ξ ; π, T ′ , I ′ ))] − n X i R ( s ( ξ i ; π, T , I )) |≤ Rad ( R π, T , I ) + L r ζ T X t =0 γ t ν t − ν − L r ǫ T X t =0 γ t ν t + O c r log(1 /δ ) n ! , where ν = L t + L t L π , and Rad ( R π, T , I ) = E ξ E σ (cid:2) sup π n P ni =1 σ i R ( s i ( ξ i ; π, T , I )) (cid:3) is theRademacher complexity of R ( s ( ξ ; π, T , I )) under thetraining transition T , the training initialization I , and n isthe number if training episodes. Note the i.i.d. assumption on the peripheral variables ξ i isacross episodes. Within the same episode, there could becorrelations among the ξ it s at different time steps.Similar arguments can also be made when the transition T ′ in the test environment stays the same as T , but the initial-ization I ′ is different from I . In the following sections wewill bound the intrinsic and external errors respectively. k · k is the L norm, and θ ∈ R m . n the Generalization Gap in Reparameterizable RL

10. Bounding Intrinsic Generalization Error

After reparameterization, the objective (9) is essentiallythe same as an empirical risk minimization problem inthe supervised learning scenario. According to classicallearning theory, the following lemma is straight-forward(Shalev-Shwartz & Ben-David, 2014):

Lemma 2.

If the reward is bounded, | R ( s ) | ≤ c/ , c > ,and g i ∼ G | S |× T are i.i.d. for each episode, with probabil-ity at least − δ , for ∀ π ∈ Π : | E g ∼G | S |× T [ R ( s ( g ; π ))] − n X i R ( s i ( g i ; π )) |≤ Rad ( R π ) + O c r log(1 /δ ) n ! , (12) where Rad ( R π ) = E g E σ (cid:2) sup π n P ni =1 σ i R ( s i ( g i ; π )) (cid:3) is the Rademacher complexity of R ( s ( g ; π )) . The bound (12) holds uniformly for all π ∈ Π ,so it also holds for ˆ π . Unfortunately, in MDPs Rad ( R π ) is hard to control, mainly due to the recur-sive arg max in the representation of the states, s t +1 =arg max ( g t + log P ( s t , π ( s t ))) .On the other hand, for general reparameterizable RL wemay control the intrinsic generalization gap by assumingsome “smoothness” conditions on the transitions T , as wellas the policy π . In particular, it is straight-forward to provethat the empirical return R is “smooth” if the transitionsand policies are all Lipschitz. Lemma 3.

For reparameterizable RL , given assumptions1, 2, and 3, the empirical return R deﬁned in (10), as afunction of the parameter θ , has a Lipschitz constant of β = L r L t L π T X t =0 γ t ν t − ν − , (13) where ν = L t + L t L π . Also, if the number of parameters m in π ( θ ) is bounded,then the Rademacher complexity Rad ( R π ) in Lemma 2can be controlled (van der Vaart., 1998; Bartlett, 2013). Lemma 4.

For reparameterizable RL , given assumptions1, 2, and 3, if the parameters θ ∈ R m is bounded suchthat k θ k ≤ , and the function class of the reparameterizedreward R is closed under negations, then the Rademachercomplexity Rad ( R π ) is bounded by Rad ( R π ) = O (cid:18) β r mn (cid:19) (14) where β is the Lipschitz constant deﬁned in (13), and n isthe number of episodes. In the context of deep learning, deep neural networks areover-parameterized models that have proven to work wellin many applications. However, the bound above doesnot explain why over-parameterized models also general-ize well since the Rademacher complexity bound (14) canbe extremely large as m grows. To ameliorate this is-sue, recently Arora et al. (2018) proposed a compressionapproach that compresses a neural network to a smallerone with fewer parameters but has roughly the same train-ing errors. Whether this also applies to reparameterizableRL is yet to be proven. There are also trajectory-basedtechniques proposed to sharpen the generalization bound(Li et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019;Cao & Gu, 2019). We can also analyze the Rademacher complexity of the em-pirical return by making a slightly different assumption onthe policy. Suppose π is parameterized as π ( θ ) , and θ issampled from some posterior distribution θ ∼ Q . Accord-ing to the PAC-Bayes theorem (McAllester, 1998; 2003;Neyshabur et al., 2018; Langford & Shawe-Taylor, 2002): Lemma 5.

Given a “prior” distribution D , with probabil-ity at least − δ over the draw of n episodes, ∀Q : E g [R θ ∼Q ( g )] ≥ n X i R θ ∼Q ( g i ) − s KL ( Q||D ) + log nδ ) n − , (15) R θ ∼Q ( g i ) = E θ ∼Q (cid:2) R ( s i ( g i ; π ( θ ))) (cid:3) = E θ ∼Q " T X t =0 γ t r ( s it ( g i ; π ( θ ))) , (16) where R θ ∼Q ( g ) is the expected “Bayesian” reward. The bound (15) holds for all posterior Q . In particular itholds if Q is θ + u where θ could be any solution pro-vided by empirical return maximization, and u is a pertur-bation, e.g., zero-centered uniform or Gaussian distribution.This suggests maximizing a perturbed objective insteadmay lead to better generalization performance, which hasalready been observed empirically (Wang et al., 2018b).The tricky part about perturbing the policy is choosing thelevel of noise. Suppose there is an empirical reward opti-mizer π (ˆ θ ) . When the noise level is small, the ﬁrst term in(15) is large, but the second term may also be large sincethe posterior Q is too focused on ˆ θ but the “prior” D can-not depend on ˆ θ , and vice versa. On the other hand, if thereward function is “nice”, e.g., if some “smoothness” as-sumption holds in a local neighborhood of ˆ θ , then one canprove the optimal noise level roughly scales inversely as n the Generalization Gap in Reparameterizable RL the square root of the local Hessian diagonals (Wang et al.,2018a).

11. Bounding External Generalization Error

Another source of generalization error in RL comes fromthe change of environment. For example, in an MDP ( S , A , P , r, P ) , the transition probability P or the initial-ization distribution P is different in the test environment.Cobbe et al. (2018) and Packer et al. (2018) show that asthe distribution of the environment varies the gap betweenthe training and testing could be huge.Indeed if the test distribution is drastically different fromthe training environment, there is no guarantee the perfor-mance of the same model could possibly work for testing.On the other hand, if the test distribution D ′ is not too faraway from the training distribution D then the test error canstill be controlled. For example, for supervised learning,Mohri & Medina (2012) prove the expected loss of a drift-ing distribution is also bounded. In addition to Rademachercomplexity and a concentration tail, there is one more termin the gap that measures the discrepancy between the train-ing and testing distribution.For reparameterizable RL, since the environment parame-ters are lifted into the reward function in the reformulatedobjective (9), the analysis becomes easier. For MDPs, asmall change in environment could cause large differencein the reward since arg max is not continuous. Howeverif the transition function is “smooth”, the expected rewardin the new environment can also be controlled. e.g., if weassume the transition function T , the reward function r , aswell as the policy function π are all Lipschitz, as in section10.If the transition function T is the same in the test environ-ment and the only difference is the initialization, we canprove the following lemma: Lemma 6. In reparameterizable RL , suppose the ini-tialization function I ′ in the test environment satisﬁes ∀ ξ, k ( I ′ − I )( ξ ) k ≤ ζ for ζ > , and the transition func-tion T in the test environment is the same as training. Ifassumptions (1), (2), and (3) hold, then: | E ξ [ R ( s ( ξ ; I ′ ))] − E ξ [ R ( s ( ξ ; I ))] |≤ L r ζ T X t =0 γ t ( L t + L t L π ) t (17)Lemma 6 means that if the initialization in the test environ-ment is not too different from the training one, and if thetransition, policy and reward functions are smooth, then theexpected reward in the test environment won’t deviate fromthat of training too much. Table1. Intrinsic Gap versus Smoothness

Temperature Policy State Action τ Gap τ Π l k ˆ θ l k F Gap Gap .

001 0 .

554 2 . · .

632 0 . .

01 0 .

494 4 . · .

632 0 . . .

482 1 . · .

633 0 . .

478 8 . · .

598 0 . .

479 5 . · .

588 0 . .

468 4 . · .

581 0 . .

471 3 . · .

590 0 . The other possible environment change is that the test ini-tialization I stays the same but the transition changes fromthe training transition T to T ′ . Similar to before, we have: Lemma 7. In reparameterizable RL , suppose the transi-tion T ′ in the test environment satisﬁes ∀ x, y, z, k ( T ′ −T )( x, y, z ) k ≤ ζ , and the initialization I in the test en-vironment is the same as training. If assumptions (1), (2)and (3) hold then | E ξ [ R ( s ( ξ ; T ′ ))] − E ξ [ R ( s ( ξ ; T ))] |≤ L r ζ T X t =0 γ t ν t − ν − (18) where ν = L t + L t L π . The difference between (18) and (17) is that the change ζ intransition T is further enlarged during an episode: as longas ν > , the gap in (18) is larger and can become huge asthe length T of the episode increases.

12. Simulation

We now present empirical measurements in simulations toverify some claims made in section 10 and 11. The bound(14) suggests the gap between the expected reward and theempirical reward is related to the Lipschitz constant β of R ,which according to equation (13) is related to the Lipschitzconstant of a series of functions including π , T , and r . In (13), as the length of the episode T increases, the dom-inating factors in β are L t , L t and L π . Our ﬁrst sim-ulation ﬁxes the environment and veriﬁes L π . In the sim-ulation, we assume the initialization I and the transition T are all known and ﬁxed. I is an identity function, and ξ ∈ R | S | is a vector of i.i.d. uniformly distributed ran-dom variables: ξ [ k ] ∼ U [0 , , ∀ k ∈ , . . . | S | . The transitfunction is T ( s, a, ξ ) = sT + aT + ξT , where s ∈ R | S | , a ∈ R | A | , ξ ∈ R are row vectors, and T ∈ R | S |×| S | , T ∈ R | A |×| S | , and T ∈ R ×| S | are matrices used toproject the states, actions, and noise respectively. T , T , n the Generalization Gap in Reparameterizable RL and T are randomly generated and then kept ﬁxed duringthe experiment. We use γ = 1 as the discounting constantthroughout.The policy π ( s, θ ) is modeled using a multiple layer per-ceptron (MLP) with rectiﬁed linear as the activation. Thelast layer of MLP is a linear layer followed by a softmaxfunction with temperature: q ( x [ k ]; τ ) = exp x [ k ] τ P k exp x [ k ] τ .By varying the temperature τ we are able to control theLipschitz constant of the policy class L π and L π if we as-sume the bound on the parameters k θ k ≤ B is unchanged.We set the length of the episode T = 128 , and randomlysample ξ , ξ , . . . , ξ T for n = 128 training and testingepisodes. Then we use the same random noise to evalu-ate a series of policy classes with different temperatures τ ∈ { . , . , . , , , , } .Since we assume I and T are known, during training thecomputation graph is complete. Hence we can directly op-timize the coefﬁcients θ in π ( s ; θ ) just as in supervisedlearning. We use Adam (Kingma & Ba, 2015) to optimizewith initial learning rates − and − . When the rewardstops increasing we halved the learning rate. and analyzethe gap between the average training and testing reward.First, we observe the gap is affected by the optimizationprocedure. For example, different learning rates can leadto different local optima, even if we decrease the learningrate by half when the reward does not increase. Second,even if we know the environment I and T , so that we canoptimize the policy π ( s ; θ ) directly, we still experience un-stable learning just like other RL algorithms. This suggeststhat the unstableness of the RL algorithms may not risefrom the estimation of the environment for the model basedalgorithms such as A2C and A3C (Mnih et al., 2016), sinceeven if we know the environment the learning is still unsta-ble.Given the unstable training procedure, for each trial we ranthe training for epochs with learning rate of 1e-2 and1e-3, and the one with higher training reward at the lastepoch is used for reporting. Ideally as we vary τ , the Lip-schitz constant for the function class π ∈ Π is changedaccordingly given the assumption k θ k ≤ B . However, it isunclear if B is changed or not for different conﬁgurations.After all, the assumption that the parameters are boundedis artiﬁcial. To ameliorate this defect we also check themetric τ Π l k θ l k F , where θ l is the weight matrix of the l thlayer of MLP. In our experiment there is no bias term in thelinear layers in MLP, so τ Π l k ˆ θ l k F can be used as a metricon the Lipschitz constant L π at the solution point ˆ θ . We In real applications this is not doable since T and I are un-known. Here we assume they are known just to investigate thegeneralization gap. Params

Gap

Table2. Empirical gap vs ζ in I Gap

Table3. Empirical generalization gap vs shift in initialization. ζ in T Gap

11 451 8,260 73,300

Table4. Empirical generalization gap vs shift in transition. also vary the smoothness in the transition function a a func-tion of states ( T ), and actions ( T ), by applying softmaxwith different temperatures τ to the singular values of therandomly generated matrix.Table 1 shows the average generalization gap roughly de-creases as τ decreases. The metric τ Π l k ˆ θ l k F also de-creases similarly as the average gap. In particular, the 2ndand 3rd column shows the average gap as the policy be-comes “smoother”. The 4th column shows, if we ﬁx thepolicy- τ as well as setting T = 1 , the generalization gapdecreases as we increase the transition- τ for T (states).Similarly the last column is the gap as the transition- τ foractions ( T ) varies. In Table 2 the environment is ﬁxed andfor each parameter conﬁguration the gap is averaged from trials with randomly initialized and then optimized poli-cies. To measure the external generalization gap, we vary thetransition T as well as the initialization I in the test envi-ronment. For that, we add a vector of Rademacher randomvariables ∆ to I or T , with k ∆ k = ζ . We adjust the levelof noise δ in the simulation and report the change of theaverage gap in Table 3 and Table 4. It is not surprising thatthe change ∆ T in transition T leads to a higher generaliza-tion gap since the impact from ∆ T is accumulated acrosstime steps. Indeed if we compare the bound (18) and (17),when γ = 1 as long as ν > , the gap in (18) is larger.

13. Discussion and Future Work

Even though a variety of distributions, discrete or continu-ous, can be reparameterized, and we have shown that theclassical MDP with discrete states is reparameterizable, itis not clear in general under which conditions reinforce-ment learning problems are reparameterizable. Classifyingparticular cases where RL is not reparameterizable is an in-teresting direction for future work. Second, the transitionsof discrete MDPs are inherently non-smooth, so Theorem1 does not apply. In this case, the PAC-Bayes bound can be n the Generalization Gap in Reparameterizable RL applied, but this requires a totally different framework. Itwill be interesting to see if there is a “Bayesian” version ofTheorem 1. Finally, our analysis only covers “on-policy”RL. Studying generalization for “off-policy” RL remainsan interesting future topic.

References

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., andSchapire, R. Taming the monster: A fast and simple al-gorithm for contextual bandits.

International Conferenceon Machine Learning , 2014.Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gen-eralization in overparameterized neural networks, goingbeyond two layers.

CoRR , abs/1811.04918, 2018.Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Strongergeneralization bounds for deep nets via a compressionapproach.

International Conference on Machine Learn-ing , 2018.Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization foroverparameterized two-layer neural networks.

Interna-tional Conference on Machine Learning , 2019.Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time anal-ysis of the multiarmed bandit problem.

Maching Learn-ing , 2002.Auer, P., Jaksch, T., and Ortner, R. Near-optimal regretbounds for reinforcement learning.

Advances in NeuralInformation Processing Systems 21 , 2009.Azar, M. G., Osband, I., and Munos, R. Minimax regretbounds for reinforcement learning.

International Con-ference on Machine Learning , 2017.Baker, B., Gupta, O., Naik, N., and Raskar, R. Designingneural network architectures using reinforcement learn-ing. 2017.Bartlett, P. Lecture notes on theoretical statistics. 2013.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., andSchapire, R. Contextual bandit algorithms with super-vised learning guarantees.

Proceedings of the FourteenthInternational Conference on Artiﬁcial Intelligence andStatistics , 2011.Cao, Y. and Gu, Q. A generalization theory of gradientdescent for learning over-parameterized deep relu net-works.

CoRR , abs/1902.01384, 2019.Chakravorty, S. and Hyland, D. C. Minimax reinforcementlearning.

American Institute of Aeronautics and Astro-nautic , 2003. Cobbe, K., Klimov, O., Hesse, C., Kim, T., andSchulman, J. Quantifying generalization inreinforcement learning.

CoRR , 2018. URL http://arxiv.org/abs/1812.02341 .Dann, C., Lattimore, T., and Brunskill, E. Unifying pac andregret: Uniform pac bounds for episodic reinforcementlearning.

International Conference on Neural Informa-tion Processing Systems (NIPS) , 2017.Farebrother, J., Machado, M. C., and Bowling, M. Gener-alization and regularization in dqn.

CoRR , 2018. URL https://arxiv.org/abs/1810.00123 .Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T.,and Tassa, Y. Learning continuous control policies bystochastic value gradients.

Advances in Neural Informa-tion Processing Systems , 2015.Jang, E., Gu, S., and Poole, B. Categorical reparameteri-zation by gumbel-softmax.

International Conference onLearning Representations , 2017.Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.,and Schapire, R. E. Contextual decision processes withlow Bellman rank are PAC-learnable.

International Con-ference on Machine Learning , 2017.Justesen, N., Torrado, R. R., Bontrager, P., Khalifa, A.,Togelius, J., and Risi, S. Illuminating generalizationin deep reinforcement learning through procedural levelgeneration.

NeurIPS Deep RL Workshop , 2018.Kearns, M. and Singh, S. Near-optimal reinforcementlearning in polynomial time.

Mache Learning , 2002.Kingma, D. P. and Ba, J. Adam: A method for stochas-tic optimization.

International Conference on LearningRepresentations , 2015.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes.

International Conference on Learning Represen-tations , 2014.Kober, J., Bagnell, J. A., and Peters, J. Reinforcementlearning in robotics: A survey.

International Journal ofRobotic Research , 2013.Langford, J. and Shawe-Taylor, J. Pac-bayes & margins.

In-ternational Conference on Neural Information Process-ing Systems (NIPS) , 2002.Laroche, R. Transfer reinforcement learning with shareddynamics. 2017.Lattimore, T. and Hutter, M. Near-optimal PAC bounds fordiscounted mdps.

Theoretical Computer Science , 2014. n the Generalization Gap in Reparameterizable RL

Lazaric, A. Transfer in reinforcement learning: a frame-work and a survey.

Reinforcement Learning - State ofthe Art, Springer , 2012.Lewis, F., Syrmos, V., and Syrmos, V.

Opti-mal Control . A Wiley-interscience publication.Wiley, 1995. ISBN 9780471033783. URL https://books.google.com/books?id=jkD37elP6NIC .Li, L., Chu, W., Langford, J., and Schapire, R. E. Acontextual-bandit approach to personalized news articlerecommendation.

Proceedings of the 19th InternationalConference on World Wide Web , 2010.Li, Y., Ma, T., and Zhang, H. Algorithmic regularization inover-parameterized matrix recovery, 2018.Maddison, C. J., Mnih, A., and Teh, Y. W. The concretedistribution: a continuous relaxation of discrete randomvariables.

International Conference on Learning Repre-sentations , 2017.Majeed, S. J. and Hutter, M. Performance guaranteesfor homomorphisms beyond markov decision processes.

CoRR , abs/1811.03895, 2018.Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Re-source management with deep reinforcement learning.2016.McAllester, D. A. Some pac-bayesian theorems.

Confer-ence on Learning Theory (COLT) , 1998.McAllester, D. A. Simpliﬁed pac-bayesian margin bounds.

Conference on Learning Theory (COLT) , 2003.Mirhoseini, A., Goldie, A., Pham, H., Steiner,B., Le, Q. V., and Dean, J. Hierarchicalplanning for device placement. 2018. URL https://openreview.net/pdf?id=Hkc-TeZ0W .Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-stra, D., Legg, S., and Hassabis, D. Human-level controlthrough deep reinforcement learning.

Nature , 2015.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning.

In-ternational Conference on Machine Learning , 2016.Mohri, M. and Medina, A. M. New analysis and algo-rithm for learning with drifting distributions.

Algorith-mic Learning Theory , 2012. Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C.,Fearon, R., Maria, A. D., Panneershelvam, V.,Suleyman, M., Beattie, C., Petersen, S., Legg,S., Mnih, V., Kavukcuoglu, K., and Silver, D.Massively parallel methods for deep reinforcementlearning.

CoRR , abs/1507.04296, 2015. URL http://arxiv.org/abs/1507.04296 .Neyshabur, B., Bhojanapalli, S., and Srebro, N. Apac-bayesian approach to spectrally-normalized marginbounds for neural networks.

International Conferenceon Learning Representations (ICLR) , 2018.OpenAI. Openai ﬁve. https://blog.openai.com/openai-five/ ,2018.Packer, C., Gao, K., Kos, J., Kr¨ahenb¨uhl, P., Koltun,V., and Song, D. Assessing generalization indeep reinforcement learning.

CoRR , 2018. URL https://arxiv.org/abs/1810.12282 .R¨uckstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun,Y., and Schmidhuber, J. Exploring parameter space inreinforcement learning.

Paladyn , 2010.Shalev-Shwartz, S. and Ben-David, S.

Understanding Ma-chine Learning: From Theory to Algorithms . CambridgeUniversity Press, New York, NY, USA, 2014. ISBN1107057132, 9781107057135.Shani, G., Brafman, R. I., and Heckerman, D. An mdp-based recommender system.

The Journal of MachineLearning Research , 2005.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,van den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,T., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-sabis, D. Mastering the game of go with deep neuralnetworks and tree search.

Nature , 2016.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-pel, T., Lillicrap, T. P., Simonyan, K., and Hassabis, D.Mastering chess and shogi by self-play with a generalreinforcement learning algorithm.

CoRR , 2017. URL http://arxiv.org/abs/1712.01815 .Strehl, A. L., Li, L., and Littman, M. L. Reinforcementlearning in ﬁnite mdps: Pac analysis.

Journal of MachineLearning Research , 2009.Sutton, R. and Barto., A.

Reinforcement Learning: An In-troduction.

MIT Press, 1998. n the Generalization Gap in Reparameterizable RL

Sutton, R. S. Generalization in reinforcement learning:Successful examples using sparse coarse coding. 1995.Taylor, M. E. and Stone, P. Transfer learning for reinforce-ment learning domains: A survey.

J. Mach. Learn. Res. ,2009.Tokui, S. and Sato, I. Reparameterization trickfor discrete variables.

CoRR , 2016. URL https://arxiv.org/abs/1611.01239 .van der Vaart., A.

Asymptotic Statistics..

Cambridge, 1998.Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P.,Vezhnevets, A. S., Yeo, M., Makhzani, A., K¨uttler,H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney,S., Petersen, S., Simonyan, K., Schaul, T., van Has-selt, H., Silver, D., Lillicrap, T. P., Calderone, K.,Keet, P., Brunasso, A., Lawrence, D., Ekermo, A.,Repp, J., and Tsing, R. Starcraft II: A new chal-lenge for reinforcement learning.

CoRR , 2017. URL http://arxiv.org/abs/1708.04782 .Wang, H., Keskar, N. S., Xiong, C., andSocher, R. Identifying generalization prop-erties in neural networks. 2018a. URL https://openreview.net/forum?id=BJxOHs0cKm .Wang, J., Liu, Y., and Li, B. Reinforcement learning withperturbed rewards.

CoRR , abs/1810.01032, 2018b. URL http://arxiv.org/abs/1810.01032 .Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Pro-tecting against evaluation overﬁtting in empirical rein-forcement learning. , 2011.Williams, R. J. Simple statistical gradient-following al-gorithms for connectionist reinforcement learning.

Ma-chine Learning , 1992.Zhan, Y. and Taylor, M. E. Online transfer learning in re-inforcement learning domains.

CoRR , abs/1507.00436,2015.Zhang, A., Ballas, N., and Pineau, J. A dis-section of overﬁtting and generalization in continu-ous reinforcement learning.

CoRR , 2018a. URL https://arxiv.org/abs/1806.07937 .Zhang, C., Vinyals, O., Munos, R., and Ben-gio, S. A study on overﬁtting in deep re-inforcement learning.

CoRR , 2018b. URL http://arxiv.org/abs/1804.06893 . n the Generalization Gap in Reparameterizable RL A. Proof of Lemma 3

Lemma.

For Reparameterizable RL, given assumptions 1,2, and 3, the empirical reward R deﬁned in (10), as a func-tion of the parameter θ , has a Lipschitz constant of β = T X t =0 γ t L r L t L π ν t − ν − where ν = L t + L t L π .Proof. Let’s denote s ′ t = s t ( θ ′ ) , and s t = s t ( θ ) . We startby investigating the policy function across different timesteps: k π ( s ′ t ; θ ′ ) − π ( s t ; θ ) k = k π ( s ′ t ; θ ′ ) − π ( s t ; θ ′ ) + π ( s t ; θ ′ ) − π ( s t ; θ ) k≤ k π ( s ′ t ; θ ′ ) − π ( s t ; θ ′ ) k + k π ( s t ; θ ′ ) − π ( s t ; θ ) k≤ L π k s ′ t − s t k + L π k θ ′ − θ k (19)The ﬁrst inequality is the triangle inequality, and the secondis from our Lipschitz assumption 2.If we look at the change of states as the episode proceeds: k s ′ t − s t k = kT ( s ′ t − , π ( s ′ t − ; θ ′ ) , ξ t − ) − T ( s t − , π ( s t − ; θ ) , ξ t − ) k≤ kT ( s ′ t − , π ( s ′ t − ; θ ′ ) , ξ t − ) − T ( s t − , π ( s ′ t − ; θ ′ ) , ξ t − ) k + kT ( s t − , π ( s ′ t − ; θ ′ ) , ξ t − ) − T ( s t − , π ( s t − ; θ ) , ξ t − ) k≤ L t k s ′ t − − s t − k + L t k π ( s ′ t − ; θ ′ ) − π ( s t − ; θ ) k (20)Now combine both (19) and (20), k s ′ t − s t k≤ L t k s ′ t − − s t − k + L t ( L π k s ′ t − − s t − k + L π k θ ′ − θ k ) ≤ ( L t + L t L π ) k s ′ t − − s t − k + L t L π k θ ′ − θ k In the initialization, we know s ′ = s since the initializa-tion process does not involve any computation using theparameter θ in the policy π .By recursion, we get k s ′ t − s t k ≤ L t L π k θ ′ − θ k t − X t =0 ( L t + L t L π ) t = L t L π ν t − ν − k θ ′ − θ k where ν = L t + L t L π . By assumption 3, r ( s ) is L r -Lipschitz, so k r ( s ′ t ) − r ( s t ) k ≤ L r k s ′ t − s t k≤ L r L t L π ν t − ν − k θ ′ − θ k So the reward | R ( s ′ ) − R ( s ) | = | T X t =0 γ t r ( s ′ t ) − T X t =0 γ t r ( s t ) |≤ | T X t =0 γ t ( r ( s ′ t ) − r ( s t )) | ≤ T X t =0 γ t | r ( s ′ t ) − r ( s t )) |≤ T X t =0 γ t L r L t L π ν t − ν − k θ ′ − θ k = β k θ ′ − θ k B. Proof of Lemma 6

Lemma.

In reparameterizable RL, suppose the initializa-tion function I ′ in the test environment satisﬁes k ( I ′ −I )( ξ ) k ≤ δ , and the transition function is the same forboth training and testing environment. If assumptions (1),(2), and (3) hold then | E ξ [ R ( s ( ξ ; I ′ ))] − E ξ [ R ( s ( ξ ; I ))] | ≤ T X t =0 γ t L r ( L t + L t L π ) t δ Proof.

Denote the states at time t with I ′ as the initializa-tion function as s ′ t . Again we look at the difference between s ′ t and s t . By triangle inequality and assumptions 1 and 2, k s ′ t − s t k = kT ( s ′ t − , π ( s ′ t − ) , ξ t − ) − T ( s t − , π ( s t − ) , ξ t − ) k≤ kT ( s ′ t − , π ( s ′ t − ) , ξ t − ) − T ( s t − , π ( s ′ t − ) , ξ t − ) k + kT ( s t − , π ( s ′ t − ) , ξ t − ) − T ( s t − , π ( s t − ) , ξ t − ) k≤ L t k s ′ t − − s t − k + L t k π ( s ′ t − ) − π ( s t − ) k≤ L t k s ′ t − − s t − k + L t L π k s ′ t − − s t − k = ( L t + L t L π ) k s ′ t − − s t − k≤ ( L t + L t L π ) t k s ′ − s k≤ ( L t + L t L π ) t δ where the last inequality is due to the assumption that k s ′ − s k = kI ′ ( ξ ) − I ( ξ ) k ≤ δ n the Generalization Gap in Reparameterizable RL Also since r ( s ) is also Lipschitz, | R ( s ′ ) − R ( s ) | = | T X t =0 γ t r ( s ′ t ) − T X t =0 γ t r ( s t ) |≤ T X t =0 γ t | r ( s ′ t ) − r ( s t ) | ≤ T X t =0 γ t L r k s ′ t − s t k≤ L r δ T X t =0 γ t ( L t + L t L π ) t The argument above holds for any given random input ξ , so | E ξ [ R ( s ′ ( ξ )] − E ξ [ R ( s ( ξ )] |≤ (cid:12)(cid:12)(cid:12)(cid:12)Z ξ ( R ( s ′ ( ξ )) − R ( s ( ξ ))) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ξ | R ( s ′ ( ξ )) − R ( s ( ξ )) |≤ L r δ T X t =0 γ t ( L t + L t L π ) t C. Proof of Lemma 7

Lemma.

In reparameterizable RL, suppose the transi-tion T ′ in the test environment satisﬁes ∀ x, y, z, k ( T ′ −T )( x, y, z ) k ≤ δ , and the initialization is the same for boththe training and testing environment. If assumptions (1), (2)and (3) hold then | E ξ [ R ( s ( ξ ; T ′ ))] − E ξ [ R ( s ( ξ ; T ))] | ≤ T X t =0 γ t L r − ν t − ν δ (21) where ν = L t + L t L π Proof.

Again let’s denote the state at time t with the newtransition function T ′ as s ′ t , and the state at time t with theoriginal transition function T as s t , then k s ′ t − s t k = kT ′ ( s ′ t − , π ( s ′ t − ) , ξ t − ) − T ( s t − , π ( s t − ) , ξ t − ) k≤ kT ′ ( s ′ t − , π ( s ′ t − ) , ξ t − ) − T ′ ( s t − , π ( s t − ) , ξ t − ) k + kT ′ ( s t − , π ( s t − ) , ξ t − ) − T ( s t − , π ( s t − ) , ξ t − ) k≤ kT ′ ( s ′ t − , π ( s ′ t − ) , ξ t − ) − T ′ ( s t − , π ( s ′ t − ) , ξ t − ) k + kT ′ ( s t − , π ( s ′ t − ) , ξ t − ) − T ′ ( s t − , π ( s t − ) , ξ t − ) k + δ ≤ L t k s ′ t − − s t − k + L t k π ( s ′ t − ) − π ( s t − ) k + δ ≤ L t k s ′ t − − s t − k + L t L π k s ′ t − − s t − k + δ = ( L t + L t L π ) k s ′ t − − s t − k + δ Again we have the initialization condition s ′ = s since the initialization procedure I stays the same. By re-cursion we have k s ′ t − s t k ≤ δ t − X t =0 ( L t + L t L π ) t (22)By assumption 3, | R ( s ′ ) − R ( s ) | = | T X t =0 γ t r ( s ′ t ) − T X t =0 γ t r ( s t ) |≤ T X t =0 γ t | r ( s ′ t ) − r ( s t ) | ≤ T X t =0 γ t L r k s ′ t − s t k≤ L r δ T X t =0 γ t t − X k =0 ( L t + L t L π ) k ! ≤ L r δ T X t =0 γ t ν t − ν − where ν = L t + L t L π . Again the argument holds forany given random input ξ , so | E ξ [ R ( s ′ ( ξ )] − E ξ [ R ( s ( ξ )] |≤ (cid:12)(cid:12)(cid:12)(cid:12)Z ξ ( R ( s ′ ( ξ )) − R ( s ( ξ ))) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ξ | R ( s ′ ( ξ )) − R ( s ( ξ )) |≤ L r δ T X t =0 γ t ν t − ν − D. Proof of Theorem 1

Theorem. In reparameterizable RL , suppose the transi-tion T ′ in the test environment satisﬁes ∀ x, y, z, k ( T ′ −T )( x, y, z ) k ≤ ζ , and suppose the initialization function I ′ in the test environment satisﬁes ∀ ξ, k ( I ′ − I )( ξ ) k ≤ ǫ .If assumptions (1), (2) and (3) hold, the peripheral randomvariables ξ i for each episode are i.i.d., and the reward isbounded | R ( s ) | ≤ c/ , then with probability at least − δ ,for all policy π ∈ Π , | E ξ [ R ( s ( ξ ; π, T ′ , I ′ ))] − n X i R ( s ( ξ i ; π, T , I )) |≤ Rad ( R π, T , I ) + L r ζ T X t =0 γ t ν t − ν − L r ǫ T X t =0 γ t ν t + O c r log(1 /δ ) n ! n the Generalization Gap in Reparameterizable RL where ν = L t + L t L π , and Rad ( R π, T , I ) = E ξ E σ " sup π n n X i =1 σ i R ( s i ( ξ i ; π, T , I )) is the Rademacher complexity of R ( s ( ξ ; π, T , I )) underthe training transition T , the training initialization I , and n is the number if training episodes.Proof. Note | n X i R ( s ( ξ i ; π, T , I )) − E ξ [ R ( s ( ξ ; π, T ′ , I ′ ))] |≤ | n X i R ( s ( ξ i ; π, T , I )) − E ξ [ R ( s ( ξ ; π, T , I ))] | + | E ξ [ R ( s ( ξ ; π, T , I ))] − E ξ [ R ( s ( ξ ; π, T ′ , I ))] | + | E ξ [ R ( s ( ξ ; π, T ′ , I ))] − E ξ [ R ( s ( ξ ; π, T ′ , I ′ ))] ||