[PDF] Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Abstract

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order O(1/t) by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

Full PDF

((1) 2020 Submitted pp; Published mm/dd

Variational Policy Gradient Method forReinforcement Learning with General Utilities

Junyu Zhang [email protected]

Department of Industrial and Systems EngineeringUniversity of MinnesotaMinneapolis, Minnesota, 55455

Alec Koppel [email protected]

Computational and Information Sciences DirectorateUS Army Research LaboratoryAdelphi, MD 20783

Amrit Singh Bedi [email protected]

Computational and Information Sciences DirectorateUS Army Research LaboratoryAdelphi, MD, USA 20783

Csaba Szepesvari [email protected]

Department of Computer ScienceDeepMind/University of AlbertaPrinceton, NJ 08544

Mengdi Wang [email protected]

Department of Electrical EngineeringCenter for Statistics and Machine LearningPrinceton University/DeepmindPrinceton, NJ 08544

Abstract

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sumof rewards have gained traction, such as in constrained problems, exploration, and acting uponprior experiences. In this paper, we consider policy optimization in Markov Decision Problems,where the objective is a general concave utility function of the state-action occupancy measure,which subsumes several of the aforementioned examples as special cases. Such generality invalidatesthe Bellman equation. As this means that dynamic programming no longer works, we focus ondirect policy search. Analogously to the Policy Gradient Theorem Sutton et al. (2000) available forRL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL withgeneral utilities, which establishes that the parametrized policy gradient may be obtained as thesolution of a stochastic saddle point problem involving the Fenchel dual of the utility function. Wedevelop a variational Monte Carlo gradient estimation algorithm to compute the policy gradientbased on sample paths. We prove that the variational policy gradient scheme converges globally tothe optimal policy for the general objective, though the optimization problem is nonconvex. Wealso establish its rate of convergence of the order O (1 /t ) by exploiting the hidden convexity ofthe problem, and proves that it converges exponentially when the problem admits hidden strongconvexity. Our analysis applies to the standard RL problem with cumulative rewards as a specialcase, in which case our result improves the available convergence rate. c (cid:13) a r X i v : . [ c s . L G ] J u l hang, Koppel, Bedi, Szepesvari, and Wang

1. Introduction

The standard formulation of reinforcement learning (RL) is concerned with ﬁnding a policy thatmaximizes the expected sum of rewards along the sample paths generated by the policy. The additivenature of the objective function creates an attractive algebraic structure which most eﬃcient RLalgorithms exploit. However, the cumulative reward objective is not the only one that has attractedattention. In fact, many alternative objectives made appearances already in the early literature onstochastic optimal control and operations research. Examples include various kinds of risk-sensitiveobjectives Kallenberg (1994); Borkar and Meyn (2002); Yu et al. (2009); Mannor and Tsitsiklis(2011), objectives to maximize the entropy of the state visitation distribution Hazan et al. (2018),the incorporation of constraints Derman and Klein (1965); Altman (1999); Achiam et al. (2017), andlearning to “mimic” a demonstration Schaal (1997); Argall et al. (2009).In this paper, we consider RL with general utility functions, and we aim to develop a principledmethodology and theory for policy optimization in such problems. We focus on utility functions thatare concave functionals of the state-action occupancy measure, which contains many, although notall, of the aforementioned examples as special cases.The general (or non-standard Kallenberg (1994))utility is a strict generalization of cumulative reward, which itself can be viewed as a linear functionalof the state-action occupancy measure, and as such, is a concave function of the occupancy measures.When moving beyond cumulative rewards, we quickly run into technical challenges because ofthe lack of additive structure. Without additivity of rewards, the problem becomes non-Markovianin the cost-to-go Tak´acs (1966); Whitehead and Lin (1995). Consequently, the Bellman equationfails to hold and dynamic programming (DP) breaks down. Therefore, stochastic methods basedupon DP such as temporal diﬀerence Sutton (1988) and Q-learning Watkins and Dayan (1992); Ross(2014) are inapplicable. The value function, the core quantity for RL, is not even well deﬁned forgeneral utilities, thus invalidating the foundation of value-function based approach to RL.Due to these challenges, we consider direct policy search methods for the solution of RL problemsdeﬁned by general utility functions. We consider the most elementary policy-based method, namelythe Policy Gradient (PG) method Williams (1992). The idea of policy gradient methods is that torepresent policies through some policy parameterization and then move the parameters of a policy inthe direction of the gradient of the objective function. When (as typical) only a noisy estimate of thegradient is available, we arrive at a stochastic approximation method Robbins and Monro (1951);Kiefer et al. (1952). In the classical cumulative reward objectives, the gradient can be written asthe product of the action-value function and the gradient of the logarithm of the policy, or policyscore function Sutton et al. (2000). State-of-the-art RL algorithms for the cumulative reward settingcombine this result with other ideas, such as limiting the changes to the policies S Kakade (2002);Schulman et al. (2015, 2017), variance reduction Kakade (2002); Papini et al. (2018); Xu et al. (2019),or exploiting structural aspects of the policy parameterization Wang et al. (2019); Agarwal et al.(2019); Mei et al. (2020).As mentioned, these approaches crucially rely on the standard PG Theorem Sutton et al. (2000),which is not available for general utilities. Compounding this challenge is the fact that the action-valuefunction is not well-deﬁned in this instance, either. Thus, how and whether the policy gradient canbe eﬀectively computed becomes a question. Further, due to the problem’s nonconvexity, it is anopen question whether an iterative policy improvement scheme converges to anything meaningful: Inparticular, while standard results for stochastic approximation would give convergence to stationarypoints Borkar (2009), it is unclear whether the stationary points give reasonable policies. Therefore,we ask the question:

Is policy search viable for general utilities,when Bellman’s equation, the value function, and dynamic programming all fail?

We will answer the question positively in this paper. Our contributions are three-folded: ariational Policy Gradient • We derive a Variational Policy Gradient Theorem for RL with general utilities which establishesthat the parametrized policy gradient is the solution to a stochastic saddle point problem. • We show that the Variational Policy Gradient can be estimated by a primal-dual stochasticapproximation method based on sample paths generated by following the current policy Arrowet al. (1958). We prove that the random error of the estimate decays at order O (1 / √ n ) thatalso depends on properties of the utility, where n is the number of episodes . • We consider the non-parameterized policy optimization problem which is nonconvex in thepolicy space. Despite the lack of convexity, we identify the problem’s hidden convexity, whichallows us to show that a variational policy gradient ascent scheme converges to the globaloptimal policy for general utilities, at a rate of O (1 /t ), where t is the iteration index. In thespecial case of cumulative rewards, our result improves upon the best known convergence rate O (1 / √ t ) for tabular policy gradient Agarwal et al. (2019), and matches the convergence rate ofvariants of the algorithm such as softmax policy gradient Mei et al. (2020) and natural policygradient Agarwal et al. (2019). In the case where the utility is strongly concave in occupancymeasures (e.g., utilities involving Kullback-Leiber divergence), we established the exponentialconvergence rate of the variational gradient scheme. Related Work.

Policy gradient methods have been extensive studied for RL with cumulativereturns. There is a large body of work on variants of policy-based methods as well as theoreticalconvergence analysis for these methods. Due to space constraints, we defer a thorough review toSupplement A.

Notation.

We let R denote the set of reals. We also let (cid:107) · (cid:107) denote the 2-norm, while for matriceswe let it denote the spectral norm. For the p -norms (1 ≤ p ≤ ∞ ), we use (cid:107) · (cid:107) p . For any matrix B , (cid:107) B (cid:107) ∞ , := max (cid:107) u (cid:107) ∞ ≤ (cid:107) Bu (cid:107) . For a diﬀerentiable function f , we denote by ∇ f its gradient. If f is nondiﬀerentiable, we denote by ˆ ∂f the Fr´echet superdiﬀerential of f ; see e.g. Drusvyatskiy andPaquette (2019).

2. Problem Formulation

Consider a Markov decision process (MDP) over the ﬁnite state space S and a ﬁnite action space A .For each state i ∈ S , a transition to state j ∈ S occurs when selecting action a ∈ A according to aconditional probability distribution j ∼ P ( ·| a, i ), for which we deﬁne the short-hand notation P a ( i, j ).Let ξ be the initial state distribution of the MDP. We let S denote the number of states and A thenumber of actions. The goal is to prescribe actions based on previous states in order to maximizesome long term objective. We call π : S → P ( A ) a policy that maps states to distributions overactions, which we subsequently stipulate is stationary. In the standard (cumulative return) MDP,the objective is to maximize the expected cumulative sum of future rewards Puterman (2014), i.e.,max π V π ( s ) := E (cid:34) ∞ (cid:88) t =0 γ t r s t a t (cid:12)(cid:12)(cid:12)(cid:12) i = s, a t ∼ π ( ·| s t ) , t = 0 , , . . . (cid:35) , ∀ s ∈ S . (2.1)with reward r s t a t ∈ R revealed by the environment when action a t is chosen at state s t .In this paper we consider policy optimization for maximizing general objective functions that arenot limited to cumulative rewards. In particular, we consider the problemmax π R ( π ) := F ( λ π ) (2.2)where λ π is known as the cumulative discounted state-action occupancy measure , or ﬂux under policy π , and F is a general concave functional. Denote ∆ SA and L as the set of policy and ﬂux respectively, hang, Koppel, Bedi, Szepesvari, and Wang then λ π is given by the mapping Λ : ∆ SA (cid:55)→ L as λ πsa = Λ sa ( π ) := ∞ (cid:88) t =0 γ t · P (cid:16) s t = s, a t = a (cid:12)(cid:12)(cid:12) π, s ∼ ξ (cid:17) for ∀ a ∈ A , ∀ s ∈ S . (2.3)Similar to the LP formulation of a standard MDP, we can write (2.2) equivalently as an optimiza-tion problem in λ (see Zhang et al. (2020a)), giving rise tomax λ F ( λ ) s.t. (cid:88) a ∈A ( I − γP (cid:62) a ) λ a = ξ, λ ≥ , (2.4)where λ a = [ λ a , · · · , λ Sa ] (cid:62) ∈ R A is the a -th column of λ and ξ is the initial distribution over thestate space S . The constraints require that λ be the unnormalized state-action occupancy measurecorresponding to some policy. In fact, it is well known that a policy π inducing λ can be extractedfrom λ using the mapping Π : L (cid:55)→ ∆ SA as π ( a | s ) = Π sa ( λ ) := λ sa (cid:80) a (cid:48)∈A λ sa (cid:48) for all a, s .Problem (2.2) contains the original MDP problem as a special case. To be speciﬁc, when F ( λ ) = (cid:104) r, λ (cid:105) with r ∈ R SA as the reward function, then F ( λ ) = (cid:104) λ, r (cid:105) = E (cid:2) (cid:80) ∞ t =0 γ t r s t a t (cid:12)(cid:12) π, s ∼ ξ (cid:3) .This means that (2.4) is a generalization of (2.1), and reduces to the dual LP formulation of standardMDP for this (linear) choice of F ( · ) Kallenberg (1983). We focus on the case where F is concave,which makes (2.4) a concave (hence, convenient) maximization problem. Next we introduce a fewexamples that arise in practice for incentivizing safety, exploration, and imitation, respectively. Example 2.1 ( MDP with Constraints or Barriers).

In discounted constrained MDPs the goalis to maximize the total expected discounted reward under a constraint where for some cost function c : S × A → R , the total expected discounted cost incurred by the chosen policy is constrainedfrom above. Letting r denote the reward function over S × A , the underlying optimization problembecomes max π v πr := E π (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) s.t. v πc := E π (cid:34) ∞ (cid:88) t =0 γ t c ( s t , a t ) (cid:35) ≤ C. (2.5)As is well known, a relaxed formulation ismax λ F ( λ ) := (cid:104) λ, r (cid:105) − β · p ( (cid:104) λ, c (cid:105) − C ) s.t. (cid:88) a ∈A ( I − γP (cid:62) a ) λ a = ξ, λ ≥ . (2.6)where p is a penalty function (e.g., the log barrier function). Example 2.2 ( Pure Exploration).

In the absence of a reward function, an agent may considerthe problem of ﬁnding a policy whose stationary distribution has the largest “entropy”, as this shouldfacilitate maximizing the speed at which the agent explores its environment Hazan et al. (2018):max π R ( π ) := Entropy(¯ λ π ) , (2.7)where ¯ λ π is the normalized state visitation measure given by ¯ λ πs = (1 − γ ) (cid:80) a λ πsa for all s . Variousentropic measures are possible, but the simplest is the negative log-likelihood: Entropy(¯ λ π ) = − (cid:80) s ¯ λ πs log[¯ λ πs ]. As is well known, this entropy is (strongly) concave.Another example, when d state-action features φ ( s, a ) ∈ R d are available, is to cover the entirefeature space by maximizing the smallest eigenvalue of the covariance matrix:max π R ( π ) := σ min (cid:32) E π (cid:34) ∞ (cid:88) t =1 γ t φ ( s t , a t ) φ ( s t , a t ) (cid:62) (cid:35)(cid:33) . (2.8) ariational Policy Gradient In (2.8), observe that E π [ (cid:80) ∞ t =1 γ t φ ( s t , a t ) φ ( s t , a t ) (cid:62) ] = (cid:80) sa λ πsa · φ ( s, a ) φ ( s, a ) (cid:62) . By Rayleighprinciple, it is again a concave function of λ . Example 2.3 ( Learning to mimic a demonstration).

When demonstrations are available, theymay be employed to obtain information about a prior policy in the form of a state visitationdistribution ¯ µ . Remaining close to this prior can be achieved by minimizing the Kullback-Liebler(KL) divergence between the state marginal distribution of λ and the prior ¯ µ stated as F ( λ ) = KL (cid:16) (1 − γ ) (cid:88) a λ a || ¯ µ (cid:17) (2.9)which, when substituted into (2.4), yields a method for ensuring some baseline performance. Wefurther note that in place of KL divergence, one can also use other convex distances such asWasserstein, total variation, or Hellinger distances.Additional instances may be found in Zhang et al. (2020a). With the setting clariﬁed, we shiftfocus to developing an algorithmic solution to (2.4), that is, to solve for policy π .

3. Variational Policy Gradient Theorem

To handle the curse of dimensionality, we allow parametrization of the policy by π = π θ , where θ ∈ Θ ⊂ R d is the parameter vector. In this way, we can narrow down the policy search problem towithin a d -dimensional parameter space rather than the high-dimensional space of tabular policies.The policy optimization problem then becomesmax θ ∈ Θ R ( π θ ) := F ( λ π θ ) (3.1)where F is the concave utility of the state-action occupancy measure λ ( θ ) := λ π θ , Θ ⊂ R d is aconvex set. We seek to solve for the policy maximizing the utility as in (3.1) using gradient ascentover the parameter space Θ. Note that (3.1) is simply (2.2) with parameterization θ of policy π substituted. We denote by ∇ θ R ( π θ ) the parameterized policy gradient of general utility.First, recall the policy gradient theorem for RL with cumulative rewards Sutton et al. (2000).Let the reward function be r . Deﬁne V ( θ ; r ) := (cid:104) λ ( θ ) , r (cid:105) , i.e., the total expected discounted rewardunder the reward function r and the policy π θ . The Policy Gradient Theorem states that ∇ θ V ( θ ; r ) = E π θ (cid:34) ∞ (cid:88) t =0 γ t Q π θ ( s t , a t ; r ) · ∇ θ log π θ ( a t | s t ) (cid:35) , (3.2)where Q π ( s, a ; r ) := E π (cid:2) (cid:80) t γ t r ( s t , a t ) | s = s, a = a, a t ∼ π ( · | s t ) (cid:3) . Unfortunately, this elegantresult no longer holds when we consider a general function instead of cumulative rewards: The policygradient theorem relies on the additivity of rewards, which is lost in our problem. For future reference,we denote Q π ( s, a ; z ) := E π (cid:2) (cid:80) t γ t z s t a t | s = s, a = a, a t ∼ π ( · | s t ) (cid:3) where z is any “function”of the state-action pairs ( z ∈ R SA ). Moreover, V ( θ ; z ) is deﬁned similarly. These deﬁnitions aremotivated by subsequent eﬀorts to derive an expression for the gradient of (3.1). R ( π θ )Now we derive the policy gradient of R ( π θ ) with respect to θ . By the chain rule, the gradient of F ( λ ( θ )) := F ( λ π θ ), using the deﬁnition of R ( π θ ), yields (assuming diﬀerentiability of F, λ ): ∇ θ R ( π θ ) = (cid:88) s ∈S (cid:88) a ∈A ∂F ( λ ( θ )) ∂λ sa · ∇ θ λ sa ( θ ) . (3.3) hang, Koppel, Bedi, Szepesvari, and Wang To directly use the chain rule, one needs the partial derivatives ∂F ( λ ( θ )) ∂λ sa and ∇ θ λ sa ( θ ). Unfortunately,neither of them is easy to estimate. The partial gradient ∂F ( λ ( θ )) ∂λ sa is a function of the currentstate-action occupancy measure λ π θ . One might attempt to estimate the measure λ π θ and thenevaluate the gradient map ∂F ( λ ( θ )) ∂λ sa . However, estimates of distributions over large spaces tend toconverge very slowly Tsybakov (2008).As it turns out, a viable alternate route is to consider the Fenchel dual F ∗ of F . Recall that F ∗ ( z ) = inf λ (cid:104) λ, z (cid:105) − F ( λ ), where we use (cid:104) x, y (cid:105) := x (cid:62) y (since F is concave, the dual is deﬁned usinginf, instead of sup). As is well known, for F concave, under mild regularity conditions, the bidual(dual of the dual) of F is equal to F . This forms the basis of our ﬁrst result, which states that thesteepest policy ascent direction of (3.1) is the solution to a stochastic saddle point problem. Theproofs of this and subsequent results are given in the supplementary material. Theorem 3.1 ( Variational Policy Gradient Theorem ) . Suppose F is concave and continuouslydiﬀerentiable in an open neighborhood of λ π θ . Denote V ( θ ; z ) to be the cumulative value of policy π θ when the reward function is z , and assume ∇ θ V ( θ ; z ) always exists. Then we have ∇ θ R ( π θ ) = lim δ → + argmax x inf z (cid:26) V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) (cid:27) . (3.4)Therefore, to estimate ∇ θ R ( π θ ) we require the cumulative return V ( θ ; z ) of the function z , itsassociated “vanilla” policy gradient (3.2), and the gradient of the Fenchel dual of F at z . Theseingredients are combined via (3.4) to obtain a valid policy gradient for general objectives. Next, wediscuss how to estimate the gradient using sampled trajectories. Theorem 3.1 implies that one can estimate ∇ θ R ( π θ ) by solving a stochastic saddle point problem.Suppose we generate n i.i.d. episodes of length K following π θ , denoted as ζ i = { s ( i ) k , a ( i ) k } Kk =1 . Thenwe can estimate V ( θ ; z ) and ∇ V ( θ ; z ) for any function z by˜ V ( θ ; z ) := 1 n n (cid:88) i =1 V ( θ ; z ; ζ i ) := 1 n n (cid:88) i =1 K (cid:88) k =1 γ k · z ( s ( i ) k , a ( i ) k ) , (3.5) ∇ ˜ V ( θ ; z ) := 1 n (cid:88) i =1 ∇ θ V ( θ ; z ; ζ i ) := 1 n n (cid:88) i =1 K (cid:88) k =1 (cid:88) a ∈A γ k · Q ( s ( i ) k , a ; z ) ∇ θ π θ ( a | s ( i ) k ) . For a given value of K , the error introduced by “truncating” trajectories at length K is of order γ K / (1 − γ ), which quickly decays to zero for γ <

1. Plugging in the obtained estimates into (3.4)gives rise to the sample-average approximation to the policy gradient:ˆ ∇ θ R ( π θ ; δ ) := argmax x inf (cid:107) z (cid:107) ∞ ≤ (cid:96) F (cid:26) − F ∗ ( z ) + ˜ V ( θ ; z ) + δ ∇ θ ˜ V ( θ ; z ) (cid:62) x − δ (cid:107) x (cid:107) (cid:27) , (3.6)where (cid:96) F is deﬁned in the next theorem. Therefore, any algorithm that solves problem (3.6) will serveour purpose. A MC stochastic approximation scheme, i.e., Algorithm 1, is provided in Appendix B.1. Theorem 3.2 ( Error bound of policy gradient estimates ) . Suppose the following holds: (i) dom F = R SA , there exists (cid:96) F such that max {(cid:107)∇ F ( λ ) (cid:107) ∞ : (cid:107) λ (cid:107) ≤ − γ } ≤ (cid:96) F . (ii) F is L F -smooth under L norm, i.e., (cid:107)∇ F ( λ ) − ∇ F ( λ (cid:48) ) (cid:107) ∞ ≤ L F (cid:107) λ − λ (cid:48) (cid:107) . (iii) F ∗ is ( (cid:96) F ∗ ) -Lipschitz with respect to the L ∞ norm in the set { z : (cid:107) z (cid:107) ∞ ≤ (cid:96) F , F ∗ ( z ) > −∞} . (iv) There exists C with (cid:107)∇ θ π ( ·| s ) (cid:107) ∞ , ≤ C , where ∇ θ π ( ·| s ) = [ ∇ θ π (1 | s ) , · · · , ∇ θ π ( A | s )] .Let ˆ ∇ θ R ( π θ ) := lim δ → + ˆ ∇ θ R ( π θ ; δ ) . Then E [ (cid:107) ˆ ∇ θ R ( π θ ) − ∇ θ R ( π θ ) (cid:107) ] ≤ O (cid:18) C ( (cid:96) F + L F (cid:96) F ∗ ) n (1 − γ ) + C L F n (1 − γ ) (cid:19) + O ( γ K ) . ariational Policy Gradient Remarks.(1)

Theorem 3.2 suggests an O (1 / √ n ) error rate, proving that the variational policy gradient -though more complicated than the typical policy gradient that takes the form of a mean - can beeﬃciently estimated from ﬁnite data. (2) Although the variable z is high dimensional, our error bound depends only on the properties of F . (3) We assumed for simplicity that Q values are known. In practice, they can estimated by, e.g., anadditional Monte Carlo rollout on the same sample path or temporal diﬀerence learning. As long asthe estimator for Q ( s, a ; z ) is unbiased and upper bouded by O ( (cid:107) z (cid:107) ∞ − γ ), the result will not change. (4) For the case of cumulative rewards, we have F ( λ ) = (cid:104) r, λ (cid:105) , so that (cid:96) F = (cid:107) r (cid:107) ∞ , (cid:96) F ∗ =0 , L F =0.Therefore E [ (cid:107) ˆ ∇ θ R ( π θ ) − ∇ θ R ( π θ ) (cid:107) ] ≤ O (cid:18) C (cid:107) r (cid:107) ∞ n (1 − γ ) (cid:19) . Special cases of ∇ θ R ( π θ ) . We further explain how to obtain the variational policy gradientfor several special cases of R , including constrained MDP, maximal exploration, and learning fromdemonstrations. See Appendix B.2 for more details.

4. Global Convergence of Policy Gradient Ascent

In this section, we analyze policy search for the problem (3.1), i.e., max θ ∈ Θ R ( π θ ) via gradient ascent: θ k +1 = argmax θ ∈ Θ R ( π θ k )+ (cid:10) ∇ θ R ( π θ k ) , θ − θ k (cid:11) − η (cid:107) θ − θ k (cid:107) = Proj Θ (cid:8) θ k + η ∇ θ R ( π θ k ) (cid:9) (4.1)where Proj Θ {·} denotes Euclidean projection onto Θ, and equivalence holds by the convexity of Θ. We study the geometry of the (possibly) nonconvex optimization problem (3.1). When F is a linearfunction of λ , and the parameterization is tabular or softmax, existing theory of cumulative-returnRL problems have shown that every ﬁrst-order stationary point of (3.1) is globally optimal – seeAgarwal et al. (2019); Mei et al. (2020).In what follows, we show that the problem (3.1) has no spurious extrema despite of its nonconvexity,for general utility functions and policy parametrization. Speciﬁcally, to generalize global optimalityattributes of stationary points of (3.1) from (2.1), we exploit structural aspects of the relationshipbetween occupancy measures and parameterized families of policies, namely, that these entities arerelated through a bijection. This bijection, when combined with the fact that (3.1) is concave in λ ,and suitably restricting the parameterized family of policies, is what we subsequently describe as“hidden convexity.” For these results to be valid, we require the following regularity conditions. Assumption 4.1.

Suppose the following holds true: (i). λ ( · ) forms a bijection between Θ and λ (Θ) , where Θ and λ (Θ) are closed and convex. (ii). The Jacobian matrix ∇ θ λ ( θ ) is Lipschitz continuous in Θ . (iii). Denote g ( · ) := λ − ( · ) as the inverse mapping of λ ( · ) . Then there exists (cid:96) θ > s.t. (cid:107) g ( λ ) − g ( λ (cid:48) ) | ≤ (cid:96) θ ||| λ − λ (cid:48) ||| for some norm ||| · ||| and for all λ, λ (cid:48) ∈ λ (Θ) . In particular, for the direct policy parametrization, also known as the “tabular” policy case, wehave λ ( θ ) := Λ( π ) where Λ is deﬁned in (2.3). When ξ is positive-valued, Assumption 4.1 is true forthe tabular policy case (as established in Appendix H). Theorem 4.2 ( Global optimality of stationary policies).

Suppose Assumption 4.1 holds, and F is a concave, and continuous function deﬁned in an open neighbourhood containing λ (Θ) . Let θ ∗ be a ﬁrst-order stationary point of problem (3.1) , i.e., ∃ u ∗ ∈ ˆ ∂ ( F ◦ λ )( θ ∗ ) , s.t. (cid:104) u ∗ , θ − θ ∗ (cid:105) ≤ for ∀ θ ∈ Θ . (4.2) hang, Koppel, Bedi, Szepesvari, and Wang Then θ ∗ is a globally optimal solution of problem (3.1) . Theorem 4.2 provides conditions such that, despite of nonconvexity, local search methods canﬁnd the global optimal policies. Since we aim at general utilities, we naturally separated out theconvex and non-convex maps in the composite objective and our conditions for optimality rely on theproperties of these. In a recent paper, Bhandari and Russo (2020) proposed some suﬃcient conditionsunder which a result similar to Theorem 4.2 holds in the setting of the standard, cumulative totalreward criterion. Their conditions are (i) the policy class is closed under (one-step, weighted) policyimprovement and that (ii) all stationary points of the one-step policy improvement map are globaloptima of this map. It remains for future work to see the relationship between our conditions andthese conditions: They appear to have rather diﬀerent natures.

Now we analyze the convergence rate of the policy gradient scheme (4.1) for general utilities.

Assumption 4.3.

There exists

L > such that the policy gradient ∇ θ R ( π θ ) is L -Lipschitz. The objective R ( π θ ) is nonconvex in θ , so one might expect that gradient schemes converge tostationary solutions at a standard O (1 / √ t ) convergence rate Shapiro et al. (2014). Remarkably, thepolicy optimization problem admits a convex nature if we view it in the space of λ , as long as F isconcave. By exploiting this hidden convexity, we establish an O (1 /t ) convergence rate for solving RLwith general utilities. Further, we show that, when the utility F is strongly concave, the gradientascent scheme converges to the globally optimal policy exponentially fast. Theorem 4.4 ( Convergence rate of parameterized policy gradient iteration).

Let Assump-tions 4.1 and 4.3 hold. Denote D λ := max λ,λ (cid:48) ∈ λ (Θ) ||| λ − λ (cid:48) ||| as deﬁned in Assumption 4.1(iii). Thenthe policy gradient update (4.1) with η = 1 /L satisﬁes for all kR ( π θ ∗ ) − R ( π θ k ) ≤ L(cid:96) θ D λ k + 1 . Additionally, if F ( · ) is µ -strongly concave with respect to the ||| · ||| norm, we have R ( π θ ∗ ) − R ( π θ k ) ≤ (cid:16) − L(cid:96) θ /µ (cid:17) k ( R ( π θ ∗ ) − R ( π θ )) . The exponential convergence result of Theorem 4.4 implies that, when a regularizer like Kullback-Leiber divergence is used, policy gradient method converges much faster. In other words, policysearch with general utilities can actually be easier than the typical, cumulative-return problem.Finally, we study the case where policies are not parameterized, i.e., θ = π . The next theoremestablishes a tighter convergence rate than what Theorem 4.4 already implies. Theorem 4.5 ( Convergence rate of tabular policy gradient iteration).

Let θ = π and λ ( θ ) = Λ( π ) . Let Assumption 4.3 hold and assume that ξ is positive-valued. Then the iteratesgenerated by (4.1) with η = 1 /L satisfy for all k ≥ that R ( π ∗ ) − R ( π k ) ≤ L |S| (1 − γ ) ( k + 1) · (cid:13)(cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13)(cid:13) ∞ . The case of cumulative rewards.

Let us consider the well-studied special case where F is a linearfunctional, i.e., R ( π ) = V π [cf. (2.1)] is the typical cumulative return. In this case, we have L = γA (1 − γ ) (Agarwal et al. (2019)). Now in order to obtain an (cid:15) -optimal policy ¯ π such that V π ∗ − V ¯ π ≤ (cid:15) , thegradient ascent update requires O (cid:16) SA (1 − γ ) (cid:15) · (cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13) ∞ (cid:17) iterations according to Theorem 4.5. Thisbound is strictly smaller than the O (cid:16) SA (1 − γ ) (cid:15) (cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13) ∞ (cid:17) iteration complexity proved by Agarwal et al.(2019) for tabular policy gradient. The improvement from O (1 /(cid:15) ) to (1 /(cid:15) ) comes from the fact that,although the policy optimization problem is nonconvex, our analysis exploits its hidden convexity inthe space of λ . ariational Policy Gradient

5. Experiments

Now we shift to numerically validating our methods and theory on OpenAI Frozen Lake Brockmanet al. (2016). Throughout, additional details may be found in Appendix C.Figure 1:

PG estimation via Alg.1

Cosine similarity betweenPG estimates ˆ x t generatedby Algorithm 1 after t sam-ples and the ground truth x (cid:63) , which consistently con-verges to near 1 across dif-ferent instances (E.g. (2.1)- (2.3)) when t becomeslarge. For comparison, wealso include the convergenceof PG estimates from RE-INFORCE for cumulative re-turns. (a) Entropy vs. World & occupancy dist. (En-tropy)

Figure 2:

Results for maximum entropy exploration : In Fig. 2(a), toquantify exploration, we present the entropy of ﬂux λ over train-ing index n for our approach, as compared with the entropy of auniform random policy. Fig. 2(b)(bottom) visualizes the worldmodel (holes in the lake have null entropy, as they terminate theepisode), the lower middle layer displays the occupancy measureassociated with a uniformly random policy, the upper-middle vi-sualizes the pseudo-reward z ∗ deﬁned by the Fenchel dual of theentropy (2.7) – see Appendix B.2. Lastly, on top we visualize theoccupancy measure associated with the max entropy policy, whichbetter covers the space than a uniformly random policy. Policy Gradient (PG) Estimation.

First we investigate the use of Theorem 3.1 and Algorithm1 (Appendix B.1) for PG estimation, for several instances of the general utility. We also compareit with the gradient estimates computed by REINFORCE for cumulative returns. Speciﬁcally, inFigure 1 we illustrate the convergence of gradient estimates, measured using the cosine similaritybetween x n (running estimate based on n episodes) and the true gradient x ∗ (which is evaluatedusing brute force Monte Carlo rollouts – see Appendix C.2). The cosine similarity converges to 1across diﬀerent instances, providing evidence that Algorithm 1 yields consistent gradient estimatesfor general utilities. PG Ascent for Maximal Entropy Exploration.

Next, we consider maximum entropy exploration(2.7) using algorithm (4.1), with softmax parametrization. First, we display the evolution of theentropy of the normalized occupancy measure over the number of episodes in Fig. 2(a). Then, wevisualize the world model in Fig. 2(b)(bottom). Moreover, the lower middle is the occupancy measureassociated with a uniformly random policy, the upper-middle layer visualizes the ”pseudo-reward” z ∗ computed as the Fenchel dual of the entropy (2.7) – see Appendix B.2, which is null at the holesand positive otherwise. We use a diﬀerent color to denote that its values are not likelihoods. Theoccupancy measure obtained by policy gradient ascent with gradient estimated by Algorithm 1 atthe end of training is in Figure 2(b)(top) – observe the maximal entropy policy achieves signiﬁcantlybetter coverage of the state space than the uniformly random policy. PG Ascent for Avoiding Obstacles.

Suppose our goal is to navigate the Frozen Lake and avoidobstacles. We consider imposing penalties to avoid costly states [cf. (2.6)] via a logarithmic barrier(B.3), and by applying variational PG ascent, we obtain an optimal policy whose resulting occupancymeasure is depicted in Fig. 3(a)(top). For comparison, we consider optimizing the standard expectedcumulative return (2.1), whose state occupancy measure is given in Fig. 3(a)(middle). Observe thatimposing log penalties yields policies whose probability mass is concentrated away from obstacles(dark green). Further, we display in Fig. 3 the reward 3(b) and cost 3(c) accumulation during test hang, Koppel, Bedi, Szepesvari, and Wang (a) World & occupancy dist.(CMDP) (b) Reward vs.

Figure 3:

Results for avoiding obstacles.

Fig. 3(a)(bottom) depicts the world model of OpenAI Frozen Lake withaugmentation to include costly states, e.g., obstacles: C represents costly states, F is the frozen lake, H is thehole, and G is the goal. We consider softmax policy parameterization, and visualize the occupancy measureassociated with REINFORCE for the cumulative return (2.1) in the middle layer, and the relaxed

CMDP (2.6)via a logarithmic barrier (B.3) at the top.The policy obtained via barriers avoids visiting costly states, incontrast to the middle. Fig. 3(b) and Fig. 3(c) show the reward/cost accumulated during test trajectories overtraining index for Algorithm 1. Observe that the reward/cost curves behave diﬀerently as the penalty parameter β varies: observe that without any constraint imposition (which implies β = 0 in red), one achieves the highestreward, but incurs the most costs, i.e., hits obstacles most often. Larger β imposes more penalty, and hence β = 4 incurs lowest cost and lowest reward. Other instances are also shown for β = 1 and β = 2. trajectories as a function of the iteration index for the PG ascent (4.1) for the cumulative return (2.1)as compared with a logarithmic barrier imposed to solve (2.6) for diﬀerent penalty parameters β .

6. Broader Impact

While RL has a great number of potential applications, our work is of foundational nature and assuch, the application of the ideas in this paper can have both broad positive and negative impacts.However, this paper is purely theoretical, as we do not aim at any speciﬁc application, there is nothingwe can say about the most likely broader impact of this work that would go beyond speculation.

References

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 22–31.JMLR. org, 2017.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximationwith policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261 , 2019.Eitan Altman.

Constrained Markov decision processes , volume 7. CRC Press, 1999.Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learningfrom demonstration.

Robotics and autonomous systems , 57(5):469–483, 2009.K.J. Arrow, L. Hurwicz, and H. Uzawa.

Studies in Linear and Non-Linear Programming , volume IIof

Stanford Mathematical Studies in the Social Sciences . Stanford University Press, Stanford,December 1958.Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXivarXiv:1906.01786 , 2019.Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. workingpaper, 2020. URL https://djrusso.github.io/docs/policy_grad_optimality.pdf . ariational Policy Gradient Shalabh Bhatnagar, Richard Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-criticalgorithms.

Automatica , 45(11):2471–2482, 2009.Vivek S Borkar.

Stochastic approximation: A dynamical systems viewpoint . Cambridge UniversityPress, 2008.Vivek S. Borkar.

Stochastic Approximation: A Dynamical Systems Viewpoint , volume 77. Wiley,2009.Vivek S Borkar and Sean P Meyn. Risk-sensitive optimal control for Markov decision processes withmonotone cost.

Mathematics of Operations Research , 27(1):192–209, 2002.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.Jingjing Bu, Afshin Mesbahi, Maryam Fazel, and Mehran Mesbahi. LQR through the lens of ﬁrstorder methods: Discrete-time case. arXiv preprint arXiv:1907.08921 , 2019.Cyrus Derman and Morton Klein. Some remarks on ﬁnite horizon Markovian decision models.

Operations Research , 13(2):272–278, April 1965. URL https://doi.org/10.1287/opre.13.2.272 .Dmitriy Drusvyatskiy and Courtney Paquette. Eﬃciency of minimizing compositions of convexfunctions and smooth maps.

Mathematical Programming , 178(1-2):503–558, 2019.Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policygradient methods for the linear quadratic regulator. In

International Conference on MachineLearning , pages 1467–1476, 2018.Jerzy A. Filar, L. C. M. Kallenberg, and Huey-Miin Lee. Variance-penalized Markov decisionprocesses.

Mathematics of Operations Research , 14(1):147–161, 1989. URL http://pubsonline.informs.org/doi/abs/10.1287/moor.14.1.147 .Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably eﬃcient maximumentropy exploration. arXiv preprint arXiv:1812.02690 , 2018.Ying Huang and L. C. M. Kallenberg. On ﬁnding optimal policies for Markov decision chains:A unifying framework for mean-variance-tradeoﬀs.

Mathematics of Operations Research , 19(2):434–448, 1994. URL http://pubsonline.informs.org/doi/abs/10.1287/moor.19.2.434 .Sham M Kakade. A natural policy gradient. In

Advances in neural information processing systems ,pages 1531–1538, 2002.Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learningwith matrices.

Journal of Machine Learning Research , 13(Jun):1865–1890, 2012.L C M Kallenberg.

Linear Programming and Finite Markovian Control Problems . CWI MathematischCentrum, 1983.L. C. M. Kallenberg. Survey of linear programming for standard and nonstandard Markovian controlproblems. Part I: Theory.

Zeitschrift f¨ur Operations Research , 40(1):1–42, 1994.Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function.

The Annals of Mathematical Statistics , 23(3):462–466, 1952.Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In

Advances in Neural InformationProcessing Systems , pages 1008–1014, 2000. hang, Koppel, Bedi, Szepesvari, and Wang Vijaymohan R Konda and Vivek S Borkar. Actor-critic–type learning algorithms for Markov DecisionProcesses.

SIAM Journal on Control and Optimization , 38(1):94–123, 1999.Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimizationattains globally optimal policy. arXiv preprint arXiv:1906.10306 , 2019.Shie Mannor and John N Tsitsiklis. Mean-variance optimization in Markov decision processes. In

Proceedings of the 28th International Conference on International Conference on Machine Learning ,pages 177–184, 2011.Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergencerates of softmax policy gradient methods. arXiv preprint arXiv:2005.06392 , 2020.Jorge Nocedal and Stephen Wright.

Numerical optimization . Springer Science & Business Media,2006.Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli.Stochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618 , 2018.Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive step-size for policy gradient methods.In

Advances in Neural Information Processing Systems , pages 1394–1402, 2013.Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in Lipschitz Markov DecisionProcesses.

Machine Learning , 100(2-3):255–283, 2015.Martin L Puterman.

Markov decision processes: discrete stochastic dynamic programming . JohnWiley & Sons, 2014.Herbert Robbins and Sutton Monro. A stochastic approximation method.

The annals of mathematicalstatistics , pages 400–407, 1951.Sheldon M Ross.

Introduction to stochastic dynamic programming . Academic press, 2014.J Langford S Kakade. Approximately optimal approximate reinforcement learning. In

ICML , pages267–274, 2002.Stefan Schaal. Learning from demonstration. In

Advances in neural information processing systems ,pages 1040–1046, 1997.John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In

International conference on machine learning , pages 1889–1897, 2015.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy´nski.

Lectures on stochastic programming:modeling and theory . SIAM, 2014.Richard S Sutton. Learning to predict by the methods of temporal diﬀerences.

Machine learning , 3(1):9–44, 1988.Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradientmethods for reinforcement learning with function approximation. In

Advances in neural informationprocessing systems , pages 1057–1063, 2000.L Tak´acs. Non-Markovian processes. In

Stochastic Process: Problems and Solutions , pages 46–62.Springer, 1966. ariational Policy Gradient Alexandre B Tsybakov.

Introduction to nonparametric estimation . Springer Science & BusinessMedia, 2008.Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Globaloptimality and rates of convergence. arXiv preprint arXiv:1909.01150 , 2019.Christopher JCH Watkins and Peter Dayan. Q-learning.

Machine learning , 8(3-4):279–292, 1992.Steven D Whitehead and Long-Ji Lin. Reinforcement learning of non-Markov decision processes.

Artiﬁcial intelligence , 73(1-2):271–306, 1995.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.

Machine learning , 8(3-4):229–256, 1992.Pan Xu, Felicia Gao, and Quanquan Gu. Sample eﬃcient policy gradient methods with recursivevariance reduction. arXiv preprint arXiv:1909.08610 , 2019.Y.-L. Yu, Y. Li, D. Schuurmans, and Cs. Szepesv´ari. A general projection property for distributionfamilies. In

Advances in Neural Information Processing Systems , 2009.Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel. Cautious reinforcement learningvia distributional risk in the dual domain. arXiv preprint arXiv:2002.12475 , 2020a.Junyu Zhang, Mingyi Hong, Mengdi Wang, and shuzhong Zhang. Generalization bounds for stochasticsaddle point problems. arXiv preprint arXiv:2006.02067 , 2020b.Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Ba¸sar. Global convergence of policy gradientmethods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383 , 2019. hang, Koppel, Bedi, Szepesvari, and Wang Supplementary Material for“Variational Policy Gradient Methodfor Reinforcement Learning with General Utilities”Appendix A. Related Work

We provide a more extension discussion for the context of this work. Firstly, when closed-formexpressions for the optimizer of a function are unavailable, solving optimization problems requiresiterative schemes such as gradient ascent Nocedal and Wright (2006). Their convergence to globalextrema is predicated on concavity and the tractability of computing ascent directions. When theobjective takes the form of an expected value of a function parameterized by a random variable,stochastic approximations are required Robbins and Monro (1951); Kiefer et al. (1952). The PGTheorem mentioned above gives a speciﬁc form for obtaining ascent directions with respect to aparameterized family of stationary policies via trajectories in a Markov decision process, when theobjective is the expected cumulative return Sutton et al. (2000), which gives rise to the REINFORCEalgorithm.The convergence of policy search for the expected cumulative return has been studied extensivelyin recent years. Under general parameterizations the problem becomes nonconvex. Hence, earlywork focused on asymptotic convergence to stationarity Pirotta et al. (2015) by invoking dynamicalsystems Borkar (2008). In actor-critic Konda and Borkar (1999); Konda and Tsitsiklis (2000), onereplaces the Monte Carlo rollout of the Q function with a temporal diﬀerence estimator Sutton(1988), and its asymptotic stability follows similar logic Bhatnagar et al. (2009). Another line ofwork focused on only on per-step value increase, i.e., policy improvement bounds Pirotta et al. (2013,2015). Recent interest has been on structural results that yield convergence to global optimality:when state transitions are linear Fazel et al. (2018); Bu et al. (2019)), the policy parameterization isdirect (tabular) Bhandari and Russo (2019); Agarwal et al. (2019), function approximation error canbe quantiﬁed S Kakade (2002); Liu et al. (2019). Clever step-size rules have also been designed toensure convergence to second-order stationary points under general settings Zhang et al. (2019).These results, however, are restricted to the expected cumulative return, a linear functional of thestate-action occupancy measure, and hence do not apply to general concave functionals of the formconsidered in this work. Early works in operations research consider nonstandard utilities Huang andKallenberg (1994), motivated by certain variance-penalizations which may also be written as concavefunctionals of occupancy measures Filar et al. (1989). Similar in spirit to this work is Kallenberg(1994), as it also puts occupancy measures at the center of its conceptual development. Theseworks develop dynamic programming approaches for tabular settings, and hence are not scalable toproblems with large spaces. More recently, maximizing the entropy of the state visitation distributionhas been considered Hazan et al. (2018), a special case of the concave utilities we study. Moreover,the authors develop a model-based iteratively policy update, which requires explicit knowledge ofthe transition probability matrix. By contrast, in this work we prioritize model-free approaches forpossibly large spaces via the fusion of direct policy search and parameterization over a family ofpolicies.

Appendix B. Supplementary materials of Section 3

B.1 A Monte Carlo Algorithm for solving (3.6)Note that any algorithm that solves problem (3.6) will serve our purpose. Therefore, we provide aMonte Carlo method that alternates between stochastic primal and dual updates as an example,stated in Algorithm 1, in which the projection operator onto the set { z : (cid:107) z (cid:107) ∞ ≤ (cid:96) F } is denoted as ariational Policy Gradient Proj (cid:96) F { z } . For any z , z (cid:48) = Proj (cid:96) F { z } is deﬁned as z (cid:48) i =  − (cid:96) F , if z i ∈ ( −∞ , − (cid:96) F ) ,z i , if z i ∈ [ − (cid:96) F , (cid:96) F ] ,(cid:96) F , if z i ∈ ( (cid:96) F , + ∞ ) . It is worth noting that we have we omit the term δ ∇ ˜ V ( θ ; z ) (cid:62) x when computing the gradient w.r.t. Algorithm 1

Monte Carlo Variational Policy Gradient Estimation

Require: a diﬀerentiable policy parametrization π θ , stepsizes α t , β t >

0, initial points x = 0, z = 0.A constant (cid:96) F . policy parameter θ ∈ R d Generate episodes ζ i = { ( s k , a k ) } from i = 1 , · · · , n following π θ ( a | s )For t = 0 , , , ... until some stopping criterion is met: Sample ( s k , a k ) from the data set Update z t +1 ← Proj (cid:96) F (cid:26) z t − α t − γ s k ,a k + α t ∇ F ∗ ( z t ) (cid:27) (B.1) x t +1 ← x t + β t (cid:34)(cid:88) a ∈A Q π θ ( s k , a ; z t ) · ∇ θ π θ ( a | s k ) − x t (cid:35) (B.2) Output: the last iterate x z in (B.1). Note that for the iterates x t are all well bounded, then δ ∇ ˜ V ( θ ; z t ) (cid:62) x t = O ( δ ), which isnegligible when δ → B.2 Special cases of policy gradient computation

We give several examples of the policy gradient for special cases of the general utility in (3.1).

Linear utility

The simplest, where F ( λ ) = (cid:104) λ, r (cid:105) [cf. (2.1)], we have F ∗ ( z ) = 0 if z = c · r forsome scalar c and F ∗ ( z ) = ∞ otherwise. In this case z ∗ = r and Theorem 3.1 recovers the knownpolicy gradient theorem for the risk-neutral MDP (2.1), that is ∇ θ R ( π θ ) = ∇ θ V ( θ ; r ). Constrained MDPs

By contrast, in Example 2.1, i.e., when a constraint E π [ (cid:80) ∞ t =0 γ t c ( s t , a t )] ≤ C on the accumulation of costs c ( s t , a t ) is present, and we may enforce it approximately with a logbarrier by deﬁning R ( π θ ) = (cid:104) r, λ ( θ ) (cid:105) + β log ( C − (cid:104) c, λ ( θ ) (cid:105) ) = V ( θ ; r ) + β log ( C − V ( θ ; c )) , (B.3)where β is a regularization parameter, in which case the policy gradient takes the form ∇ R ( π θ ) = ∇ θ V ( θ ; r ) − β ∇ θ V ( θ ; c ) C − V ( θ ; c ) . Estimating the policy gradient R of constrained MDP consists of estimating two policy gradients ∇ θ V ( θ ; c ) and ∇ θ V ( θ ; r ) and accumulated reward V ( θ ; c ). Minimum eigenvalue

For case (2.8), deﬁne Φ( λ π θ ) = (cid:80) s,a λ π θ sa · φ ( s, a ) φ ( s, a ) (cid:62) . Then Φ( λ π θ ) issymmetric and positive semideﬁnite, since λ π θ ≥

0. By using Rayleigh principle, we have R ( π θ ) = σ min (Φ( λ π θ )) = min (cid:107) u (cid:107) =1 u (cid:62) Φ( λ π θ ) u = min (cid:107) u (cid:107) =1 (cid:88) s,a λ π θ sa | φ ( s, a ) (cid:62) u | . (B.4) hang, Koppel, Bedi, Szepesvari, and Wang which is the minimum of a family of linear function in λ . Let v (1) , ..., v ( k ) be a group of orthonormalbases of the eigenspace of Φ( λ π θ ) corresponding to the minimum eigenvalue. Then deﬁne k vectorsas r ( i ) ( s, a ) = | φ ( s, a ) (cid:62) v ( i ) | , ∀ s, a , i = 1 , ..., k . Then the Fr´echet superdiﬀerential of R at θ isˆ ∂ θ R ( π θ ) = (cid:110) ∇ θ V ( θ ; r ) : r ∈ conv( r (1) , ..., r ( k ) ) (cid:111) , where conv( · ) denotes the convex hull of a group of vectors. When the multiplicity of the minimumeigenvalue is 1, then R ( · ) is diﬀerentiable at this point and ˆ ∂ θ R ( · ) = {∇ θ R ( · ) } . Entropy maximization

For the entropy (2.7), its Fenchel dual takes the form F ∗ ( z ) = − (cid:88) sa exp (cid:8) − z sa − γ − (cid:9) . Learning to mimic a distribution

For the KL divergence to a prior µ in (2.9), we have F ∗ ( z ) = (cid:40) − (cid:80) s µ s exp (cid:8) − z s − γ − (cid:9) if z sa = z sa ∀ s ∈ S , a , a ∈ A , −∞ otherwise. Appendix C. Additional Details of Experiments

C.1 Details of Environment

OpenAI Frozen Lake is a ﬁnite-state action problem. The standard state consists of { S, F, H, G } ,to which we add an additional state C which is visualized in Fig. 3(a). At each step, an agentselects an action a ∈ A , which consists of one of four directions (up, down, left, right), which maybe enumerated as { , . . . , } . The reward is null at all Frozen F spaces, the start S location, andthe Holes H in the lake. If the agent enters a hole, the episode terminates, and hence null reward isaccumulated for this trajectory. The only positive reward is 1 and may be obtained when reachingthe goal state G . Our augmentation is that costly states C have been added, which incur reward − . π θ ( s | a ) = e θ sa / ( (cid:80) a (cid:48) e θ sa (cid:48) ) for θ ∈ R |S|×|A| . Forthe Frozen lake environment in this paper, we have |S| = 16 and |A| = 4. C.2 Computing the True Policy Gradient

For comparison, we compute the true policy gradient by using a baseline approach based on the chainrule and a variant of REINFORCE Sutton et al. (2000): the second factor on the right-hand sideof (3.3) is exactly computed using REINFORCE ∇ θ λ sa ( θ ), whereas the ﬁrst, ∂F ( λ ( θ )) ∂λ sa , is computedusing an additional Monte Carlo rollout. We denote as x ∗ the result of this procedure and use itas ground truth. In Figure 4(a) we display the evolution of its norm diﬀerence (cid:107) ˆ x (cid:63)n − ˆ x (cid:63)n − (cid:107) as thesample size n increases. That it approaches null with the sample size implies that this brute forceMonte Carlo variant of REINFORCE is convergent, and hence is a reasonable benchmark comparator. C.3 Details about Maximum Entropy Exploration

For this problem instance, i.e., (2.7) from Example 2.2, we also consider the state space deﬁned byFrozen Lake, but note that the reward as deﬁned by the environment is now a moot point. This is ariational Policy Gradient (a) Convergence of x (cid:63) Figure 4: Fig. 4(a) displays the convergence of a generalization of REINFORCE-based gradientestimator for (3.3) in terms of its diﬀerence (cid:107) ˆ x (cid:63)n − ˆ x (cid:63)n − (cid:107) as the number of processedtrajectories n increases, which converges to null, certifying ˆ x (cid:63)n as a baseline.because each state contributes positive entropy, with the exception of the holes in the lake, whichterminate the episode. We visualize this setup at the bottom layer of Fig. 2(b). The lower middlelayer visualizes the occupancy measure associated with a uniform policy. Moreover, the upper middlelayer visualizes the “pseudo-reward” z for each point in the state space. This quantity is computed interms of the Fenchel dual of the entropy – see Appendix B.2, and the occupancy measure associatedwith the output of Algorithm 1 at the end of training is visualized at the top layer. To obtain thisresult, we run it for 10 total episodes, and for each episode we evaluate the entropy using (2.7). Weconsider a constant step-size α = 0 . , β = 0 .

1, and η = 0 .

001 throughout this experiment.

C.4 Details about the Constrained Markov Decision Process

In this subsection, we elaborate upon the implementation of Example 2.1, speciﬁcally, (2.6) and itsapproximation using a logarithmic barrier as detailed in (B.3). We consider the problem of navigatingthrough the FrozenLake environment as shown in Fig. 3(a)(bottom): we seek to reach the goal state G (reward = 1) from the starting location S (reward = 0), navigating along F frozen spaces (reward= 0), while avoiding locations marked C (reward = − .

2) that denote costly states (obstacles) and H holes.We consider two approaches to the problem: ﬁrst, we focus on optimizing the standard expectedcumulative return (2.1), whose associated state occupancy measure is given in Fig. 3(a)(middle);second, we consider imposing constraints to avoid costly states [cf. (2.6)] via a logarithmic barrier(B.3), whose resulting occupancy measure is depicted in Fig. 3(a)(top). Bluer/yellower colors denotehigher/lower likelihoods, respectively. We observe that imposing constraints yields policies whoseprobability mass is concentrated away from constraints and instead along paths from the start to thegoal. Thus, Algorithm 1 combined with a policy search scheme (4.1) may be used to solve CMDPs.This trend is corroborated in Fig. 3, which depicts the reward 3(b) and cost 3(c) accumulationduring test trajectories as a function of training index for Algorithm 1 for the cumulative return(2.1) as compared with a logarithmic barrier imposed to solve CMDP (2.6) for diﬀerent penaltyparameters β . We may observe that without imposing any constraint ( β = 0 in red), one achievesthe highest reward, but incurs the most costs, i.e., hits obstacles most often, a form of “recklessboldness.” Larger β means higher penalty for the constraints, and hence β = 4 incurs lower cost andlower reward. We further added the curves for β = 1 and β = 2 for comparison.For all results reported in Fig. 3, we run the algorithm for 10 K total training steps in the formof episodes. For each episode, we run a number of evaluation (test) trajectories in order to determine hang, Koppel, Bedi, Szepesvari, and Wang their merit, both in terms of reward and cost accumulation. Put more simply, we evaluate theperformance averaged over a few test trajectories as a function of episode number and report itsaverage over last 20 episodes to show the trend. This is to illuminate policy improvement in itsvarious forms (reward/cost accumulation) during training. Moreover, the algorithm is run withconstant step-size η = 0 . Appendix D. Proof of Theorem 3.1

Proof.

First note that for any z ∈ R SA , x ∈ R d , we have V ( θ ; z ) = (cid:104) z, λ ( θ ) (cid:105) , ∇ θ V ( θ ; z ) (cid:62) x = (cid:104) z, ∇ θ λ ( θ ) x (cid:105) , (D.1)where ∇ θ λ ( θ ) is the SA × d Jacobian matrix, the ﬁrst identity holds by deﬁnition, and the secondholds by directly diﬀeretiating the ﬁrst indentity and product it with x .Consider the saddle point problem in (3.4) for ﬁxed 0 < δ <

1. Let G be any constant such that (cid:107)∇ F ( λ ( θ )) (cid:107) ∞ < G . Deﬁne( x ∗ ( δ ) , z ∗ ( δ )) := argmax x argmin (cid:107) z (cid:107) ∞ ≤ G (cid:8) V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) (cid:9) . (D.2)Note in (D.2) we added the auxiliary constraint set { z : (cid:107) z (cid:107) ∞ ≤ G } , and later we will showthat this constraint is inactive for all δ suﬃciently small. We will also show that ( x ∗ ( δ ) , z ∗ ( δ )) arebounded for all δ suﬃciently small.By the ﬁrst-order stationarity condition, we have x ∗ ( δ ) = ∇ θ V ( θ ; z ∗ ( δ )) . Note that ∇ θ V ( θ ; · ) is a linear function of z , thus there exists B > (cid:107)∇ θ V ( θ ; z ) (cid:107) ≤ B forall z ∈ {(cid:107) z (cid:107) ∞ ≤ G } . And consequently (cid:107) x ∗ ( δ ) (cid:107) ≤ B for all δ > x ∈ {(cid:107) x (cid:107) ≤ B } , we havelim δ → + λ ( θ ) + δ ∇ θ λ ( θ ) x = λ ( θ ) . Therefore, there exists some small δ >

0, such that for all δ < δ , the vector λ ( θ ) + δ ∇ θ λ ( θ ) x belongsto the neighborhood on which F is diﬀerentiable and (cid:107)∇ F ( λ ( θ ) + δ ∇ θ λ ( θ ) x ) (cid:107) ∞ < G, ∀ x ∈ { x : (cid:107) x (cid:107) ≤ B } . In this case, we consider the unconstrained solution, for (cid:107) x (cid:107) ≤ B , deﬁned by z ∗ ( x ; δ ) := argmin z V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) = ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x (cid:1) , and observe that the unconstrained solution satisﬁes (cid:107) z ∗ ( x ; δ ) (cid:107) ∞ < G , and consequently the constraint (cid:107) z (cid:107) ∞ ≤ G is not active. Therefore, for δ < δ , we can equivalently rewrite (D.2) as x ∗ ( δ ) := argmax (cid:107) x (cid:107)≤ B min z (cid:26) V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) (cid:27) (D.3)= argmax (cid:107) x (cid:107)≤ B F ( λ ( θ ) + δ ∇ θ λ ( θ ) x ) − δ (cid:107) x (cid:107) , Recall that we showed (cid:107) x ∗ ( δ ) (cid:107) ≤ B , therefore the constraint (cid:107) x (cid:107) ≤ B is also inactive and removable.Therefore x ∗ ( δ ) is equivalent to the unconstrained min-max solution, for all δ suﬃciently small, and ariational Policy Gradient Fenchel duality together with the ﬁrst-order stationarity condition implies x ∗ ( δ ) = argmax x inf z (cid:26) V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) (cid:27) = ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:1) . By using the fact that ∇ F is continuous at λ ( θ ) and x ∗ ( δ ) is bounded, by letting δ → δ → + x ∗ ( δ ) = lim δ → + ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:1) = ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) (cid:1) = ∇ R ( θ ) , where the last equality uses the chain rule. Appendix E. Proof of Theorem 3.2

Proof.

First, let us denote the expression in (3.4) for ﬁxed 0 < δ < x ∗ ( δ ) , z ∗ ( δ )) = argmax x argmin (cid:107) z (cid:107) ∞ ≤ (cid:96) F V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) , (E.1)and its approximation with empirically estimated value functions and their gradients in (3.5) as(ˆ x ( δ ) , ˆ z ( δ )) = argmax x argmin (cid:107) z (cid:107) ∞ ≤ (cid:96) F ˜ V ( θ ; z ) + δ ∇ θ ˜ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) . (E.2)Then we decompose the entity E (cid:20)(cid:13)(cid:13)(cid:13) ˆ ∇ θ R ( π θ ) − ∇ θ R ( π θ ) (cid:13)(cid:13)(cid:13) (cid:21) into three terms by adding and subtract-ing (i) x ∗ ( δ ) and (ii) ˆ x ( δ ), which we then establish depends on the diﬀerence between (iii) ˆ z ( δ ) and z ∗ ( δ ). Taken together with computing the limit of the right-hand side as δ → Lemma E.1.

Consider ( x ∗ ( δ ) , z ∗ ( δ )) and (ˆ x ( δ ) , ˆ z ( δ )) as deﬁned in (E.1) - (E.2) , respectively. Underthe technical conditions stated in Theorem 3.2, their respective estimation errors satisfy:(i) (cid:13)(cid:13)(cid:13) x ∗ ( δ ) − ∇ θ R ( π θ ) (cid:13)(cid:13)(cid:13) = O ( δ ) . (ii) E (cid:20)(cid:13)(cid:13)(cid:13) x ∗ ( δ ) − ˆ x ( δ ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ C (cid:107) z ∗ ( δ ) (cid:107) ∞ (1 − γ ) · (cid:16) γ K (1 − γ ) + n (cid:17) + C (1 − γ ) · E (cid:104) (cid:107) z ∗ ( δ ) − ˆ z ( δ ) (cid:107) ∞ (cid:105) . (iii) E (cid:2) (cid:107) ˆ z ( δ ) − z ∗ ( δ ) (cid:107) ∞ (cid:3) ≤ O (cid:16) L F n (1 − γ ) + L F (cid:96) F ∗ n + L F δ + L F δn (cid:17) . Combining the three steps and the fact that (cid:107) z ∗ ( δ ) (cid:107) ∞ ≤ (cid:96) F yields E (cid:104) (cid:107) ˆ x ( δ ) − ∇ θ R ( θ ) (cid:107) (cid:105) ≤ O (cid:18) C ( (cid:96) F + L F (cid:96) F ∗ ) n (1 − γ ) + C L F n (1 − γ ) (cid:19) + O ( δ + δ/n + γ K ) . Let δ →

0, we get E (cid:20)(cid:13)(cid:13)(cid:13) ˆ ∇ θ R ( π θ ) − ∇ θ R ( π θ ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ O (cid:18) C ( (cid:96) F + L F (cid:96) F ∗ ) n (1 − γ ) + C L F n (1 − γ ) (cid:19) + O ( γ K ) . hang, Koppel, Bedi, Szepesvari, and Wang Lemma E.1(i) - (iii) is proved in the next subsection. For the ease of notation, we will simplydenote x ∗ and ˆ x instead of x ∗ ( δ ) and ˆ x ( δ ). Similarly, we denote z ∗ and ˆ z instead of z ∗ ( δ ) and ˆ z ( δ ). E.1 Preliminary TechnicalitiesLinearity property . The functions Q , V and ∇ θ V are linear in the reward function. Namely, forany α, α (cid:48) ∈ R and r, r (cid:48) ∈ R |S||A| , α ∇ θ V ( θ ; r ) + α (cid:48) ∇ θ V ( θ ; r (cid:48) ) = ∇ θ V ( θ ; αr + α (cid:48) r (cid:48) ) . Similar identities holds for Q π θ ( s, a ; · ) and V ( θ ; · ). For the stochastic estimators ∇ θ ˜ V ( θ ; r ; ζ ), it isstraightforward to check that the linearity property is still true. Upperbounding Q and V . Given an arbitrary reward function r , the upper bounds of Q and V functions are | Q π θ ( s, a ; r ) | ≤ (cid:107) r (cid:107) ∞ − γ and | V ( θ ; r ) | ≤ (cid:107) r (cid:107) ∞ − γ . Uniform upperbounds for estimators.

Given any sample path ζ = { ( s k , a k ) } Kk =0 , the estimators˜ V ( θ ; z ; ζ ) and ∇ θ ˜ V ( θ ; z ; ζ ) are upper bounded by˜ V ( θ ; z ; ζ ) | ≤ (cid:107) z (cid:107) ∞ − γ and (cid:107)∇ θ ˜ V ( θ ; z ; ζ ) (cid:107) ≤ C (cid:107) z (cid:107) ∞ (1 − γ ) . (E.3)Consequently, as the sample averages of ˜ V ( θ ; z ; ζ i ) and ∇ θ ˜ V ( θ ; z ; ζ i ), we also have | ˜ V ( θ ; z ) | ≤ (cid:107) z (cid:107) ∞ − γ and (cid:107)∇ θ ˜ V ( θ ; z ) (cid:107) ≤ C (cid:107) z (cid:107) ∞ (1 − γ ) (E.4)for any set of sample paths { ζ i } ni =1 . Proof.

For ˜ V ( θ ; z ; ζ ), for any z , | ˜ V ( θ ; z ; ζ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =0 γ k · z ( s k , a k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =0 γ k (cid:107) z (cid:107) ∞ ≤ (cid:107) z (cid:107) ∞ − γ For ∇ θ ˜ V ( θ ; z ; ζ ), for any z , (cid:107)∇ θ ˜ V ( θ ; z ; ζ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 (cid:88) a ∈A γ k · Q ( s k , a ; z ) ∇ θ π θ ( a | s k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 γ k · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) a ∈A Q ( s k , a ; z ) ∇ θ π θ ( a | s k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K (cid:88) k =1 γ k · max (cid:107) u (cid:107) ∞ ≤ (cid:107) z (cid:107)∞ − γ (cid:107) π θ ( ·| s k ) u (cid:107)≤ C (cid:107) z (cid:107) ∞ (1 − γ ) . ariational Policy Gradient E.2 Proof of Lemma E.1(i).

Consider the problem (E.1). First let us ignore the requirement that (cid:107) z (cid:107) ∞ ≤ (cid:96) F . For this series ofunconstrained problem, Theorem 3.1 suggests thatlim δ → + x ∗ ( δ ) = ∇ θ R ( π θ ) . Consequently, lim δ → + λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) = λ ( θ ). Because (cid:107) λ ( θ ) (cid:107) = (1 − γ ) − , ∃ δ > δ < δ we have (cid:107) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:107) ≤ − γ . According to condition (i) of this theorem, we have (cid:107)∇ F ( λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ )) (cid:107) ∞ ≤ (cid:96) F . It is worth noting that z ∗ ( δ ) = ∇ F ( λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ )) is also the solution to the unconstrainedversion of (E.1). Therefore we have (cid:107) z (cid:107) ∞ ≤ (cid:96) F , so that we can add this to the constraint withoutchanging the optimal solutions. By the intermediate result in the proof of Theorem 3.1, we have x ∗ ( δ ) = ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:1) . Consequently, by the Lipschitz continuity of ∇ F , we have (cid:13)(cid:13)(cid:13) x ∗ ( δ ) − ∇ θ R ( θ ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:1) − ∇ θ λ ( θ ) (cid:62) ∇ F (cid:0) λ ( θ ) (cid:1)(cid:13)(cid:13)(cid:13) ≤ (cid:107)∇ θ λ ( θ ) (cid:62) (cid:107) ∞ , · (cid:13)(cid:13)(cid:13) ∇ F (cid:0) λ ( θ ) + δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:1) − ∇ F (cid:0) λ ( θ ) (cid:1)(cid:13)(cid:13)(cid:13) ∞ ≤ L F (cid:107)∇ θ λ ( θ ) (cid:62) (cid:107) ∞ , · (cid:13)(cid:13)(cid:13) δ ∇ θ λ ( θ ) x ∗ ( δ ) (cid:13)(cid:13)(cid:13) = O ( δ ) . as stated in Lemma E.1(i). In the last step, we used the fact that x ∗ ( δ ) is bounded because x ∗ ( δ ) → ∇ θ R ( π θ ). E.3 Proof of Lemma E.1(ii).

By the ﬁrst order stationarity condition of the problems (E.1)-(E.2), we know x ∗ = ∇ θ V ( θ ; z ∗ ) and ˆ x = ∇ θ ˜ V ( θ ; ˆ z ) . Consider the norm-diﬀerence between the preceding quantities: E (cid:104)(cid:13)(cid:13) x ∗ − ˆ x (cid:13)(cid:13) (cid:105) ≤ E (cid:104) (cid:107)∇ θ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; z ∗ ) (cid:107) (cid:105) + 2 E (cid:104) (cid:107)∇ θ ˜ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; ˆ z ) (cid:107) (cid:105) . (E.5)To bound the term E (cid:104) (cid:107)∇ θ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; z ∗ ) (cid:107) (cid:105) , recall the deﬁnition (3.5): ∇ ˜ V ( θ ; z ) := 1 n (cid:88) i =1 ∇ θ V ( θ ; z ; ζ i ) = 1 n n (cid:88) i =1 K (cid:88) k =1 (cid:88) a ∈A γ k Q ( s ( i ) k , a ; z ) ∇ θ π θ ( a | s ( i ) k ) . Consider the ﬁrst term on the right-hand side of (E.5). Add and subtract E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105) and usethe fact that E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105) = ∇ θ V ( θ ; z ∗ ), i.e., the bias-variance decomposition identity, to write E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; z ∗ ) (cid:13)(cid:13)(cid:13) (cid:21) (E.6)= (cid:13)(cid:13)(cid:13) ∇ θ V ( θ ; z ∗ ) − E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105)(cid:13)(cid:13)(cid:13) + E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ) − E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105)(cid:13)(cid:13)(cid:13) (cid:21) . hang, Koppel, Bedi, Szepesvari, and Wang For the ﬁrst (squared bias) term on the right-hand side of (E.6), denote d πξ,K ( s ) = (1 − γ ) (cid:80) Kt =0 γ t Prob ( s t = s | π, s ∼ ξ ). Then it is straightforward that (cid:80) s | d πξ,K ( s ) − d πξ ( s ) | ≤ γ K − γ . As a result, we know (cid:13)(cid:13)(cid:13) ∇ θ V ( θ ; z ∗ ) − E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105)(cid:13)(cid:13)(cid:13) (E.7)= 1(1 − γ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) s (cid:0) d πξ ( s ) − d πξ,K ( s ) (cid:1) (cid:88) a Q π θ ( s, a ; z ∗ ) ∇ θ π θ ( a | s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1(1 − γ ) (cid:32)(cid:88) s | d πξ ( s ) − d πξ,K ( s ) | · (cid:13)(cid:13)(cid:13) (cid:88) a Q π θ ( s, a ; z ∗ ) ∇ θ π θ ( a | s ) (cid:13)(cid:13)(cid:13)(cid:33) = 1(1 − γ ) (cid:32)(cid:88) s | d πξ ( s ) − d πξ,K ( s ) | · (cid:13)(cid:13)(cid:13) (cid:88) a Q π θ ( s, a ; z ∗ ) ∇ θ π θ ( a | s ) (cid:13)(cid:13)(cid:13)(cid:33) ≤ − γ ) (cid:32)(cid:88) s | d πξ ( s ) − d πξ,K ( s ) | · max (cid:107) u (cid:107) ∞ ≤ (cid:107) z ∗(cid:107)∞ − γ (cid:107)∇ θ π ( ·| s ) u (cid:107) (cid:33) ≤ (cid:107)∇ θ π ( ·| s ) (cid:107) ∞ , · (cid:107) z ∗ (cid:107) ∞ (1 − γ ) (cid:32)(cid:88) s | d πξ ( s ) − d πξ,K ( s ) | (cid:33) ≤ C (cid:107) z ∗ (cid:107) ∞ (1 − γ ) γ K . Next, we consider the second (variance) term on the right-hand side of (E.6). By substituting (3.5)in for ∇ θ ˜ V ( θ ; z ∗ ) to rewrite it in terms of trajectories ζ i , we have E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ) − E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ) (cid:105)(cid:13)(cid:13)(cid:13) (cid:21) = 1 n E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ; ζ i ) − E (cid:104) ∇ θ ˜ V ( θ ; z ∗ ; ζ i ) (cid:105)(cid:13)(cid:13)(cid:13) (cid:21) ≤ n E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ; ζ i ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ C (cid:107) z ∗ (cid:107) ∞ n (1 − γ ) . The ﬁrst inequality comes from crudely upper-bounding the bias by the estimator itself. The lastequality uses (E.3).Now, returning focus to the second term in the bound (E.5), by the linearity of the stochasticestimators with respect to the diﬀerential and (E.4), we have (cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; ˆ z ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ − ˆ z ) (cid:13)(cid:13)(cid:13) ≤ C (cid:107) z ∗ − ˆ z (cid:107) ∞ (1 − γ ) . Taking the expectation after squaring both sides yields E (cid:20)(cid:13)(cid:13)(cid:13) ∇ θ ˜ V ( θ ; z ∗ ) − ∇ θ ˜ V ( θ ; ˆ z ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ C (1 − γ ) E (cid:2) (cid:107) z ∗ − ˆ z (cid:107) ∞ (cid:3) . (E.8)Combining inequalities (E.5), (E.6), (E.7), (E.8), (E.8) yields E (cid:104)(cid:13)(cid:13) x ∗ − ˆ x (cid:13)(cid:13) (cid:105) ≤ C (cid:107) z ∗ (cid:107) ∞ (1 − γ ) · (cid:18) γ K (1 − γ ) + 1 n (cid:19) + 2 C (1 − γ ) · E (cid:104) (cid:107) z ∗ − ˆ z (cid:107) ∞ (cid:105) . which is as stated in Lemma E.1(ii). ariational Policy Gradient E.4 Proof of Lemma E.1(iii).

In this section we will apply the generalization bound for stochastic saddle points from Zhang et al.(2020b) to bound the term E [ (cid:107) ˆ z − z ∗ (cid:107) ∞ ]. To achieve this, we need a compact feasible region for x .Note that for problems (E.1) and (E.2), the solutions x ∗ and ˆ x has the form x ∗ = ∇ θ V ( θ ; z ∗ ) and ˆ x = ∇ θ ˜ V ( θ ; ˆ z ) . Due to (E.4) and the constraint that (cid:107) z (cid:107) ∞ ≤ (cid:96) F , we have (cid:107) x ∗ (cid:107) ≤ C (cid:107) z ∗ (cid:107) ∞ (1 − γ ) ≤ C(cid:96) F (1 − γ ) and thus (cid:107) ˆ x (cid:107) ≤ C(cid:96) F (1 − γ ) with probability 1. Therefore, adding a constraint that (cid:107) x (cid:107) ≤ C(cid:96) F (1 − γ ) will not changethe solutions of problems (E.1) and (E.2). Formally speaking, we will then apply the theory of Zhanget al. (2020b) to the following pair of constrained problems:( x ∗ , z ∗ ) = argmax x ∈X argmin z ∈Z V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) , (E.9)and (ˆ x, ˆ z ) = argmax x ∈X argmin z ∈Z ˜ V ( θ ; z ) + δ ∇ θ ˜ V ( θ ; z ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) . (E.10)with X = { x : (cid:107) x (cid:107) ≤ C(cid:96) F (1 − γ ) } and Z = { z : (cid:107) z (cid:107) ∞ ≤ (cid:96) F } . The problems (E.1) and (E.9) share thesame solution, and problems (E.2) and (E.10) share the same solution.Finally, similar to the proof of (E.7), for any x ∈ X and z ∈ Z V ( θ ; z ) + δ ∇ θ V ( θ ; z ) (cid:62) x − E (cid:104) ˜ V ( θ ; z ; ζ i ) + δ ∇ θ ˜ V ( θ ; z ; ζ i ) (cid:62) x (cid:105) = O (cid:18) γ K − γ (cid:19) . For the simplicity of discussion, let us assume that K is large enough so that we can ignore the O (cid:16) γ K − γ (cid:17) bias. Therefore problem (E.10) can be viewed as an empirical version of the problem (E.9)with negligible bias. To apply the theory of Zhang et al. (2020b), deﬁneΨ ζ ( x, z ) := ˜ V ( θ ; z ; ζ ) + δ ∇ θ ˜ V ( θ ; z ; ζ ) (cid:62) x − F ∗ ( z ) − δ (cid:107) x (cid:107) . Then for any sample path ζ , Ψ ζ satisﬁes the following set of properties: • Ψ ζ ( · , z ) is µ x -strongly concave under L norm. And Ψ ζ ( x, · ) is µ z -strongly convex under the L ∞ norm. In other words, for ∀ x, x (cid:48) ∈ X and z, z (cid:48) ∈ Z , (cid:40) Ψ ζ ( x (cid:48) , z ) ≥ Ψ ζ ( x, z ) + (cid:104) u, x (cid:48) − x (cid:105) + µ x (cid:107) x (cid:48) − x (cid:107) , u ∈ ∂ x Ψ ζ ( x, z ) , Ψ ζ ( x, z (cid:48) ) ≤ Ψ ζ ( x, z ) + (cid:104) v, z (cid:48) − z (cid:105) − µ z (cid:107) z (cid:48) − z (cid:107) ∞ , v ∈ ∂ z Ψ ζ ( x, z ) . In our case, it is clear that µ x = δ . Due to Theorem 3 of Kakade et al. (2012), µ z = L − F . • The feasible regions X and Z are compact convex sets. For every ζ , there exist constants (cid:96) x ( ξ, z ) and (cid:96) z ( ξ, x ) s.t. (cid:40) | Ψ ζ ( x (cid:48) , z ) − Ψ ζ ( x, z ) | ≤ (cid:96) x ( ζ, z ) (cid:107) x (cid:48) − x (cid:107) , ∀ x, x (cid:48) ∈ X and y ∈ Y , | Ψ ζ ( x, z (cid:48) ) − Ψ ζ ( x, z ) | ≤ (cid:96) z ( ζ, x ) (cid:107) z (cid:48) − z (cid:107) ∞ , ∀ z, z (cid:48) ∈ Z and x ∈ X . In our case, we gave (cid:96) z ( ζ, x ) = sup {(cid:107) u (cid:107) : z ∈ Z , u ∈ ∂ z Ψ ζ ( x, z ) } = − γ + (cid:96) F ∗ + O ( δ ) and (cid:96) x ( ζ, z ) = sup x ∈X (cid:107)∇ x Ψ ζ ( x, z ) (cid:107) = O ( δ ). Consequently, (cid:40) ( (cid:96) wx ) := sup z ∈Z E (cid:2) (cid:96) x ( ζ, z ) (cid:3) = O ( δ ) , ( (cid:96) wz ) := sup x ∈X E (cid:2) (cid:96) z ( ζ, x ) (cid:3) = O ( (cid:96) F ∗ + − γ ) + δ ) . hang, Koppel, Bedi, Szepesvari, and Wang With the above two properties, Theorem 1 of Zhang et al. (2020b) indicates that µ z E (cid:2) (cid:107) ˆ z − z ∗ (cid:107) ∞ (cid:3) ≤ √ n · (cid:18) ( (cid:96) wx ) µ x + ( (cid:96) wz ) µ z (cid:19) . With the detailed parameters substituted in the above inequality, we have E (cid:2) (cid:107) ˆ z − z ∗ (cid:107) ∞ (cid:3) ≤ O (cid:18) L F n (1 − γ ) + L F (cid:96) F ∗ n + L F δ + L F δn (cid:19) as stated in Lemma E.1(iii). Appendix F. Proof of Theorem 4.2

Proof.

Let θ ∗ be a ﬁrst-order stationary solution of (3.1). When F is concave and locally Lipschitzcontinuous in a neighbourhood containing λ (Θ), we can compute the Fr´echet superdiﬀerential of F ◦ λ at θ ∗ by the chain rule, see Drusvyatskiy and Paquette (2019). That isˆ ∂ ( F ◦ λ )( θ ∗ ) = [ ∇ θ λ ( θ ∗ )] (cid:62) ∂F ( λ ∗ )where ∂F ( λ ∗ ) denotes the set of supergradients of the concave function F at λ ∗ . Then there exists w ∗ ∈ ∂F ( λ ∗ ) ∈ R SA such that u ∗ := [ ∇ θ λ ( θ ∗ )] (cid:62) w ∗ ∈ ˆ ∂ ( F ◦ λ )( θ ∗ ) as in (4.2). It follows from (4.2)that (cid:104) w ∗ , ∇ θ λ ( θ ∗ )( θ − θ ∗ ) (cid:105) ≤ , for ∀ θ ∈ Θ . (F.1)For any λ ∈ λ (Θ), we let θ := g ( λ ) such that λ = λ ( θ ). Therefore, by adding and subtracting ∇ θ λ ( θ ∗ ) θ inside the inner product we have (cid:104) w ∗ , λ − λ ∗ (cid:105) = (cid:104) w ∗ , λ ( θ ) − λ ( θ ∗ ) (cid:105) (F.2)= (cid:104) w ∗ , ∇ θ λ ( θ ∗ )( θ − θ ∗ ) (cid:105) + (cid:104) w ∗ , λ ( θ ) − λ ( θ ∗ ) − ∇ θ λ ( θ ∗ )( θ − θ ∗ ) (cid:105)≤ (cid:107) w ∗ (cid:107)(cid:107) λ ( θ ) − λ ( θ ∗ ) − ∇ θ λ ( θ ∗ )( θ − θ ∗ ) (cid:107) . where in the last inequality we group terms and apply Cauchy-Schwartz. Note that the Jacobian matrix ∇ θ λ ( θ ) is Lipschitz continuous. Denote the Lipschitz constant by L λ , i.e., (cid:107)∇ θ λ ( θ ) − ∇ θ λ ( θ (cid:48) ) (cid:107) ≤ L λ (cid:107) θ − θ (cid:48) (cid:107) for all θ, θ (cid:48) ∈ Θ. Then, (cid:107) λ ( θ ) − λ ( θ ∗ ) − ∇ θ λ ( θ ∗ )( θ − θ ∗ ) (cid:107) ≤ L λ (cid:107) θ − θ ∗ (cid:107) . By Assumption 4.1, we know (cid:107) θ − θ ∗ (cid:107) = (cid:107) g ( λ ) − g ( λ ∗ ) (cid:107) ≤ (cid:96) θ ||| λ − λ ∗ ||| . Substituting the above inequalities into (F.2) yields (cid:104) w ∗ , λ − λ ∗ (cid:105) ≤ L λ (cid:96) θ (cid:107) w ∗ (cid:107)||| λ − λ ∗ ||| ∀ λ ∈ λ (Θ) . (F.3)Note that (F.3) holds for arbitrary λ ∈ λ (Θ). Therefore, since λ (Θ) is assumed to be convex(Assumption 4.1(i)), we can also substitute λ with (1 − α ) λ ∗ + αλ, α ∈ [0 ,

1] into the above equation,which yields α (cid:104) w ∗ , λ − λ ∗ (cid:105) ≤ L λ (cid:96) θ α (cid:107) w ∗ (cid:107)||| λ − λ ∗ ||| ∀ λ ∈ L , ∀ α ∈ [0 , . ariational Policy Gradient Divide both sides of the preceding expression by α and take α →

0+ gives (cid:104) w ∗ , λ − λ ∗ (cid:105) ≤ lim α → L λ (cid:96) θ α (cid:107) w ∗ (cid:107)||| λ − λ ∗ ||| = 0 ∀ λ ∈ λ (Θ) . Recall that the following problem is concave in λ :max λ F ( λ ) s.t. λ ∈ λ (Θ) , therefore we conclude that λ ∗ is the global optimal solution. Then θ ∗ = g ( λ ∗ ) is the globally optimalsolution of the nonconvex optimization problem (3.1). Appendix G. Proof of Theorem 4.4

G.1 Proof of sublinear convergence

Proof.

First, the Lipschitz continuity in Assumption 4.3 indicates that (cid:12)(cid:12) F ( λ ( θ )) − F ( λ ( θ k )) − (cid:104)∇ θ F ( λ ( θ k )) , θ − θ k (cid:105) (cid:12)(cid:12) ≤ L (cid:107) θ − θ k (cid:107) . Consequently, for any θ ∈ Θ we have the ascent property: F ( λ ( θ )) ≥ F ( λ ( θ k )) + (cid:104)∇ θ F ( λ ( θ k )) , θ − θ k (cid:105) − L (cid:107) θ − θ k (cid:107) ≥ F ( λ ( θ )) − L (cid:107) θ − θ k (cid:107) . (G.1)The optimality condition in the policy update rule (4.1) then yields F ( λ ( θ k +1 )) ≥ F ( λ ( θ k )) + (cid:104)∇ θ F ( λ ( θ k )) , θ k +1 − θ k (cid:105) − L (cid:107) θ k +1 − θ k (cid:107) = max θ ∈ Θ F ( λ ( θ k )) + (cid:104)∇ θ F ( λ ( θ k )) , θ − θ k (cid:105) − L (cid:107) θ − θ k (cid:107) ≥ max θ ∈ Θ F ( λ ( θ )) − L (cid:107) θ − θ k (cid:107) ≥ max α ∈ [0 , (cid:8) F ( λ ( θ α )) − L (cid:107) θ α − θ k (cid:107) : θ α = g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) (cid:9) . (G.2)Here, step (a) is due to (G.1) and step (b) uses the convexity of λ (Θ). Now, we proceed to analyzethe right-hand side of (G.2). First, by the concavity of F and the fact that λ ◦ g = id , we have F ( λ ( θ α )) = F ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) ≥ αF ( λ ( θ ∗ )) + (1 − α ) F ( λ ( θ k )) . Moreover, by the Lipschitz continuity assumption of g , we have (cid:107) θ α − θ k (cid:107) = (cid:107) g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) − g ( λ ( θ k )) (cid:107) (G.3) ≤ α (cid:96) θ ||| λ ( θ ∗ ) − λ ( θ k ) ||| ≤ α (cid:96) θ D λ . Substituting the above two inequalities into the right-hand side of (G.2), we get F ( λ ( θ ∗ )) − F ( λ ( θ k +1 )) ≤ min α ∈ [0 , (cid:8) F ( λ ( θ ∗ )) − F ( λ ( θ α )) + L (cid:107) θ α − θ k (cid:107) : θ α = g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) (cid:9) ≤ min α ∈ [0 , (1 − α ) (cid:0) F ( λ ( θ ∗ )) − F ( λ ( θ k )) (cid:1) + α L(cid:96) θ D λ . (G.4) hang, Koppel, Bedi, Szepesvari, and Wang Let α k = F (Λ( π ∗ )) − F (Λ( π k ))2 L(cid:96) θ D λ ≥

0, which is the minimizer of the RHS of (G.4) as long as it satisﬁes α k ≤ α k ≥ α k +1 <

1. Further, if α k < α k +1 ≤ α k . Thetwo claims together mean that ( α k ) k is decreasing and all α k are in [0 ,

1) except perhaps α .To prove the ﬁrst of the two claims, assume α k ≥

1. This implies that F (Λ( π ∗ )) − F (Λ( π k )) ≥ L(cid:96) θ D λ . Hence, choosing α = 1 in (G.4), we get F ( λ ( θ ∗ )) − F ( λ ( θ k )) ≤ L(cid:96) θ D λ which implies that α k +1 ≤ / < α k into (G.4) to get F ( λ ( θ ∗ )) − F ( λ ( θ k +1 )) ≤ (cid:18) − F ( λ ( θ ∗ )) − F ( λ ( θ k ))4 L(cid:96) θ D λ (cid:19) ( F ( λ ( θ ∗ )) − F ( λ ( θ k ))) , which shows that α k +1 ≤ α k as required.Now, by our preceding discussion, for k = 1 , , . . . the previous recursion holds. Using thedeﬁnition of α k , we rewrite this in the equivalent form α k +1 ≤ (cid:16) − α k (cid:17) · α k . By rearranging the preceding expressions and algebraic manipulations, we obtain2 α k +1 ≥ (cid:0) − α k (cid:1) · α k = 2 α k + 11 − α k ≥ α k + 1 . For simplicity assume that α < α k ≥ α + k , and consequenlty F ( λ ( θ ∗ )) − F ( λ ( θ k )) ≤ F ( λ ( θ ∗ )) − F ( λ ( θ ))1 + F ( λ ( θ ∗ )) − F ( λ ( θ ))4 L(cid:96) θ D λ · k ≤ L(cid:96) θ D λ k . A similar analysis holds when α >

1. Combining these two gives that F ( λ ( π ∗ )) − F ( λ ( π k )) ≤ L(cid:96) θ D λ k +1 no matter the value of α , which proves the result. G.2 Proof of exponential convergence

When the strong concavity of F is available, we further provide the exponential convergence result. Proof.

We start from (G.2) whose proof requires no assumption on strong concavity of F , which is F ( λ ( θ k +1 )) ≥ max α ∈ [0 , (cid:8) F ( λ ( θ α )) − L (cid:107) θ α − θ k (cid:107) : θ α = g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) (cid:9) . (G.5)By the µ -strong concavity of F , we have F ( λ ( θ α )) = F ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) ≥ αF ( λ ( θ ∗ )) + (1 − α ) F ( λ ( θ k )) + µ α (1 − α ) ||| λ ( θ ∗ ) − λ ( θ k ) ||| . By the Lipschitz continuity of g , we know that (cid:107) θ α − θ k (cid:107) = (cid:107) g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) − g ( λ ( θ k )) (cid:107) ≤ α(cid:96) θ ||| λ ( θ ∗ ) − λ ( θ k ) ||| Substituting the above two inequalities into the right-hand side of (G.5), we get F ( λ ( θ ∗ )) − F ( λ ( θ k +1 )) (G.6) ≤ min α ∈ [0 , (cid:8) F ( λ ( θ ∗ )) − F ( λ ( θ α )) + L (cid:107) θ α − θ k (cid:107) : θ α = g ( αλ ( θ ∗ ) + (1 − α ) λ ( θ k )) (cid:9) ≤ min α ∈ [0 , (1 − α ) (cid:0) F ( λ ( θ ∗ )) − F ( λ ( θ k )) (cid:1) − α (cid:18) − α µ − L(cid:96) θ α (cid:19) ||| λ ( θ ∗ ) − λ ( θ k ) ||| ariational Policy Gradient Suppose we choose ¯ α = L(cid:96) θ /µ < (cid:0) − ¯ α µ − L(cid:96) θ ¯ α (cid:1) = 0. Then we have a contractionwith modulus 1 − ¯ α as F ( λ ( θ ∗ )) − F ( λ ( θ k +1 )) ≤ (1 − ¯ α ) F ( λ ( θ ∗ )) − F ( λ ( θ k )) . Consequently, for any k ≥

1, we have F ( λ ( θ ∗ )) − F ( λ ( θ k )) ≤ (1 − ¯ α ) k (cid:0) F ( λ ( θ ∗ )) − F ( λ ( θ )) (cid:1) . which can be translated into iteration complexity by ﬁxing (cid:15) and initialization θ , and solving for theminimal k such that F ( λ ( θ ∗ )) − F ( λ ( θ k )) ≤ (cid:15) . Doing so is an algebraic exercise which results in O (cid:18) α log (cid:18) F ( λ ( θ ∗ )) − F ( λ ( θ )) (cid:15) (cid:19)(cid:19) = O (cid:18) L(cid:96) θ µ log (cid:18) (cid:15) (cid:19)(cid:19) Appendix H. Validating Assumption 4.1 for tabular policy case

For the tabular policy case, the following Proposition holds true and hence the Assumption 4.1 issatisﬁed in this case.

Proposition H.1.

Suppose ξ s > for ∀ s ∈ S . Then the following hold:(i). The mappings Π and Λ form a pair of bijections between the convex sets ∆ SA and L ;(ii). ∃ L λ > s.t. (cid:107)∇ Λ( π ) − ∇ Λ( π (cid:48) ) (cid:107) ≤ L λ (cid:107) π − π (cid:48) (cid:107) , ∀ π, π (cid:48) ∈ ∆ SA ;(iii). For all λ, λ (cid:48) ∈ L , we have (cid:107) Π( λ ) − Π( λ (cid:48) ) (cid:107) ≤ (cid:88) s (cid:16) (cid:88) a ( λ (cid:48) sa − λ sa ) + ( (cid:88) a λ (cid:48) sa − λ sa ) (cid:17) / (cid:0) (cid:88) a λ sa (cid:1) . Consequently, (cid:107) Π( λ ) − Π( λ (cid:48) ) (cid:107) ≤ s ξ s (cid:107) λ − λ (cid:48) (cid:107) Proof.

Proof of (i) : The equations Π ◦ Λ = id L and Λ ◦ Π = id ∆ SA are standard. See, e.g., Altman (1999)or Appendix A of Zhang et al. (2020a). Proof of (ii) : For the existence of the L λ -Lipschitz constant of the gradient ∇ Λ, note that the t -thterm of the inﬁnite sum Λ sa ( π ) = ∞ (cid:88) t =0 γ t · P (cid:18) s t = s, a t = a (cid:12)(cid:12)(cid:12)(cid:12) π, s ∼ ξ (cid:19) is a ( t + 1)-th order polynomial. Therefore, Λ sa ( π ) can actually be deﬁned for any π even if π / ∈ ∆ SA ,as long as this inﬁnite series of polynomial of π converges absolutely. Note that for ∀ π ∈ ∆ SA , since0 ≤ P (cid:0) s t = s, a t = a (cid:12)(cid:12) π, s ∼ ξ (cid:1) ≤ < γ <

1, even if we slightly purterb the π within a neighbourhood of it (not necessarily in ∆ SA afterpurterbation), the inﬁnite series is still absolutely convergent. This indicates that Λ sa is inﬁnitelycontinuously diﬀerentiable in an open neighbourhood containing ∆ SA , then due to the compactness of∆ SA , we are able to argue that there exists a L λ s.t. ∇ Λ is L λ -Lipschitz continuous within ∆ SA . hang, Koppel, Bedi, Szepesvari, and Wang Proof of (iii) : Now, we provide the calculation of the Lipschitz constant of Π. For the ease ofnotation, let us deﬁne µ s = (cid:80) a ∈A λ sa and µ (cid:48) s = (cid:80) a ∈A λ (cid:48) sa . Then for ∀ λ, λ (cid:48) ∈ L and ∀ ( s, a ) ∈ S × A ,it holds that Π sa ( λ ) − Π sa ( λ (cid:48) ) = λ sa µ s − λ (cid:48) sa µ (cid:48) s = (cid:18) λ sa µ s − λ (cid:48) sa µ s (cid:19) + (cid:18) λ (cid:48) sa µ s − λ (cid:48) sa µ (cid:48) s (cid:19) = 1 µ s ( λ sa − λ (cid:48) sa ) + µ (cid:48) s − µ s µ s µ (cid:48) s λ (cid:48) sa . Consequently, we can compute the norm diﬀerence of the preceding expression and apply the triangleinequality: (cid:107) Π( λ ) − Π( λ (cid:48) ) (cid:107) = (cid:88) s ∈S (cid:88) a ∈A (Π sa ( λ ) − Π sa ( λ (cid:48) )) (H.1) ≤ (cid:88) s ∈S (cid:88) a ∈A µ s ( λ sa − λ (cid:48) sa ) + 2 (cid:88) s ∈S (cid:88) a ∈A ( µ (cid:48) s − µ s ) µ s ( µ (cid:48) s ) ( λ (cid:48) sa ) ≤ (cid:88) s ∈S µ s (cid:32)(cid:88) a ∈A ( λ sa − λ (cid:48) sa ) + ( µ (cid:48) s − µ s ) (cid:33) , where the last inequality follows because (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) holds for any vector x (here, (cid:107) · (cid:107) p denotes the p -norm). Finally, note that µ s ≥ ξ s >

0, we have (cid:107) Π( λ ) − Π( λ (cid:48) ) (cid:107) ≤ (cid:88) s ∈S µ s (cid:32)(cid:88) a ∈A ( λ sa − λ (cid:48) sa ) + ( µ (cid:48) s − µ s ) (cid:33) ≤ s ξ s (cid:88) s ∈S (cid:32)(cid:88) a ∈A ( λ sa − λ (cid:48) sa ) + (cid:0) (cid:88) a ∈A | λ sa − λ sa (cid:48) | (cid:1) (cid:33) ≤ s ξ s (cid:107) λ − λ (cid:48) (cid:107) Take the square root of both sides completes the proof.

Appendix I. Proof of Theorem 4.5

Proof.

To prove this theorem, it suﬃces to observe that (G.2) is still true with θ = π , λ ( θ ) = Λ( π )and g ( λ ) = Π( λ ). Therefore, (G.2) can be translated as F (Λ( π k +1 )) ≥ max α ∈ [0 , (cid:8) F (Λ( π α )) − L (cid:107) π α − π k (cid:107) : π α = Π( α Λ( π ∗ ) + (1 − α )Λ( π k )) (cid:9) . (I.1)By the concavity of F and the fact that Λ ◦ Π = id , we have F (Λ( π α )) = F ( α Λ( π ∗ ) + (1 − α )Λ( π k )) ≥ αF (Λ( π ∗ )) + (1 − α ) F (Λ( π k )) . (I.2) ariational Policy Gradient For the inequality (G.3), we can derive a tighter bound by the following argument: (cid:107) π α − π k (cid:107) = (cid:107) Π( α Λ( π ∗ ) + (1 − α )Λ( π k )) − Π(Λ( π k )) (cid:107) (I.3) ≤ α (cid:88) s (cid:0) (cid:80) a λ sa (cid:1) (cid:32)(cid:88) a ( λ ∗ sa − λ sa ) + (cid:0) (cid:88) a λ ∗ sa − (cid:88) a λ sa (cid:1) (cid:33) ≤ α (cid:88) s (cid:0) (cid:80) a λ sa (cid:1) (cid:32)(cid:0) (cid:88) a λ ∗ sa (cid:1) + (cid:0) (cid:88) a λ sa (cid:1) (cid:33) = 4 α (cid:88) s (cid:0) d π ∗ ξ ( s ) (cid:1) + (cid:0) d π k ξ ( s ) (cid:1) (cid:0) d π k ξ ( s ) (cid:1) = 4 α |S| + 4 α (cid:88) s (cid:32) d π ∗ ξ ( s ) d π k ξ ( s ) (cid:33) ≤ α |S| + 4 α |S| (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ξ d π k ξ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ α |S| · (cid:18) − γ ) − (cid:13)(cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13)(cid:13) ∞ (cid:19) ≤ α |S| (1 − γ ) (cid:13)(cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13)(cid:13) ∞ Denote D := |S| (1 − γ ) (cid:13)(cid:13)(cid:13) d π ∗ ξ /ξ (cid:13)(cid:13)(cid:13) ∞ . Substituting the above two inequalities into the right-hand side of(I.1), we get F (Λ( π ∗ )) − F (Λ( π k +1 )) ≤ min α ∈ [0 , (cid:8) F (Λ( π ∗ )) − F (Λ( π α )) + L (cid:107) π α − π k (cid:107) : π α = Π( α Λ( π ∗ ) + (1 − α )Λ( π k )) (cid:9) ≤ min α ∈ [0 , (1 − α ) (cid:0) F (Λ( π ∗ )) − F (Λ( π k )) (cid:1) + LDα . (I.4)Note that (I.4) diﬀers from (G.4) by replacing (cid:96) θ D λ with D . The latter proof of Theorem 4.5 isalmost identical to that of Theorem 4.4 and hence we omit the proof.. The latter proof of Theorem 4.5 isalmost identical to that of Theorem 4.4 and hence we omit the proof.