[PDF] Approximation Benefits of Policy Gradient Methods with Aggregated States

Abstract

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by ϵ , the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as ϵ/(1−γ) , where γ is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming.

Full PDF

AApproximation Beneﬁts of Policy Gradient Methodswith Aggregated States

Daniel J. RussoDivision of Decision Risk and OperationsColumbia University [email protected]

July 24, 2020

Abstract

Folklore suggests that policy gradient can be more robust to misspeciﬁcation than its relative,approximate policy iteration. This paper studies the case of state-aggregation, where the statespace is partitioned and either the policy or value function approximation is held constantover partitions. This paper shows a policy gradient method converges to a policy whose regretper-period is bounded by (cid:101) , the largest difference between two elements of the state-action valuefunction belonging to a common partition. With the same representation, both approximatepolicy iteration and approximate value iteration can produce policies whose per-period regretscales as (cid:101) / ( − γ ) , where γ is a discount factor. Theoretical results synthesize recent analysis ofpolicy gradient methods with insights of Van Roy [2006] into the critical role of state-relevanceweights in approximate dynamic programming. As motivation, consider a sequence of works that applied approximate dynamic programmingtechniques to Tetris . Bertsekas and Ioffe [1996] applied approximate policy iteration (API), whichmodiﬁes the classic policy iteration algorithm by approximating the value function as a linearcombination of hand crafted features. Subsequently, Kakade [2002], Szita and Lörincz [2006], andFurmston and Barber [2012] attained much higher scores using methods that directly search over aclass of policies. This is not unique to tetris. A similar phenomenon was observed in an ambulanceredeployment problem by Maxwell et al. [2013] and a Battery storage problem by Scott et al. [2014].Experiments with deep reinforcement learning tend to be less transparent, but policy gradientmethods are extremely popular [see e.g. Schulman et al., 2015, 2017].Kakade [2002], Szita and Lörincz [2006], Furmston and Barber [2012], Maxwell et al. [2013] andScott et al. [2014] all search over the class of policies that are induced by (soft) maximization withrespect to some parameterized value function. In a sense, these methods tune the parameters ofthe value function approximation, but do so aiming to directly improve the total expected rewardearned rather than to minimize a measure of prediction error. As a result, any gap in performancecannot be due to the approximation architecture and instead is caused by the procedure that setsthe parameters. See Gabillon et al. [2013] for a full account of the history. a r X i v : . [ c s . L G ] J u l here is very limited theory formalizing this phenomenon. Several works provide broadperformance guarantees for each type of algorithm. In the case of API, Munos [2003], Antos et al.[2008], Lazaric et al. [2012] build on the original analysis of Bertsekas and Tsitsiklis [1996]. Anintellectual foundation for studying policy gradient methods was laid by Kakade and Langford[2002], who analyze a conservative policy iteration algorithm (CPI). Scherrer and Geist [2014]observed that guarantees similar to those for CPI could be provided for some idealized policygradient methods and recently Agarwal et al. [2019b] developed approximation guarantees andconvergence rates for a much broader class of policy gradient algorithms. Comparing these bounds,one can ﬁnd the results for incremental algorithms like CPI depend on a certain distributionshift term that is typically smaller than the so-called concentrability coefﬁcients in Munos [2003],Antos et al. [2008], Lazaric et al. [2012]. See Scherrer [2014]. But few conclusions can be drawn bycomparing upper bounds alone and the distribution shift terms are highly abstract.This paper provides a specialized study of algorithms that use state-aggregated representations,under which the state space is partitioned and either the policy or value function approximationdoes not distinguish between states in a common partition. State aggregation is a very old ideain approximate dynamic programming and reinforcement learning [Whitt, 1978, Bean et al., 1987,Gordon, 1995, Tsitsiklis and Van Roy, 1996, Rust, 1997, Li et al., 2006, Jiang et al., 2015, Abelet al., 2016], leading to tractable algorithms for problems with low dimension continuous actionspaces where it is believed that nearby states are similar. We measure the inherent error of asate aggregation procedure by the largest difference between two elements of the state-actionvalue function belonging to a common partition, denoted (cid:101) φ (here φ denotes a particular state-aggregation).We show that any policy that is a stationary point of the policy gradient objective function hasper-period regret less that (cid:101) φ . Many variants policy gradient algorithms, being ﬁrst-order methods,are ensured to converge (often efﬁciently) to stationary points, so this provides a guarantee onthe quality of an ultimate policy produced with this approximation architecture. This guaranteeis a substantial improvement over past work. The recent results of Bhandari and Russo [2019]translate into per-period regret of κ ρ (cid:101) φ / ( − γ ) , where γ is a discount factor and κ ρ is a complexterm that captures distribution shift. Critically, here even per-period regret scales with the effectivehorizon. Other available bounds [Kakade and Langford, 2002, Scherrer and Geist, 2014, Agarwalet al., 2019b] are at least as bad. Building on an example of Bertsekas and Tsitsiklis [1996], we givean example in which API produces policies whose per-period regret scales as (cid:101) φ / ( − γ ) , henceestablishing formally that policy gradient methods converge to a drastically better policy with thesame approximation architecture.The poor performance of API in that example appears to be linked to the use of a non-adaptivestate weighting in the weighted least squares problems deﬁning value function estimates. This wasthe critical insight of Van Roy [2006], who showed that dramatic performance gains are possible if,in the projected Bellman equation deﬁning an approximate value function, the Euclidean projectionis weighted by the invariant distribution under a greedy policy. A recent preprint by Dong et al.[2019] shows an optimistic variant of Q-learning efﬁciently approaches such a ﬁxed point, ensuringlimiting per-period regret smaller than (cid:101) φ . The current paper was directly inspired by that workand an intuition that policy gradient methods should have the same approximation guarantee.Although the proofs bear little resemblance to Van Roy [2006] and Dong et al. [2019], a similarintuition applies. To draw a tighter connection, Section 5 considers the use of actor-critic methods,which use estimated value functions to evaluate policy gradients. The theory of compatible function2pproximation due to Konda and Tsitsiklis [2000] and Sutton et al. [2000] implies that unbiasedgradient evaluation requires state-action value functions are estimated in a norm deﬁned by thestate occupancy measure of the current policy. This offers one alternative perspective on the resultsof Van Roy [2006]. We consider a Markov decision process M = ( S , A , r , P , γ , ρ ) , which consists of a ﬁnite state space S = { · · · , n } , ﬁnite action space A = { · · · , k } , reward function r , transition kernel P , discountfactor γ ∈ (

0, 1 ) and initial distribution ρ . For any ﬁnite set X , we let ∆ ( X ) = { d ∈ R |X | + : ∑ x ∈X d x = } denote the set of probability distributions over X . A stationary randomized policy isa mapping π : S → ∆ ( A ) . We use π ( s , a ) to denote the a th component of π ( s ) . Let Π denote theset of all stationary randomized policies. Conditioned on the history up to that point, an agentwho selects action a in state s ∈ S earns mean reward in that period of r ( s , a ) and transitionsrandomly to a next state, where P ( s (cid:48) | s , a ) denotes the probability of transitioning to state s (cid:48) ∈ S .To treat randomized policies, we overload notation, deﬁning for d ∈ ∆ ( A ) , r ( s , d ) = ∑ ka = r ( s , a ) d a and P ( s (cid:48) | s , d ) = ∑ ka = P ( s (cid:48) | s , a ) d a . Notice that if e a ∈ ∆ ( A ) is the standard basis vector, then r ( s , e a ) = r ( s , a ) . Value functions and Bellman operators.

We deﬁne, respectively, the value function associatedwith a policy π and the optimal value function by V π ( s ) = E π s (cid:34) ∞ ∑ t = γ t r ( s t , a t ) (cid:35) V ∗ ( s ) = sup π ∈ Π V π ( s ) .The notation E π s [ · ] denotes expectations taken over the sequence of state when s = s and policy π is applied. A policy π ∗ is said to be optimal if V π ∗ ( s ) = V ∗ ( s ) for every s ∈ S . It is knownthat an optimal deterministic policy exists. Throughout this paper, I will use π ∗ to denote someoptimal policy. There could be multiple, but this does not change the results. The Bellman operator T π : R n → R n associated with a policy π ∈ Π maps a value function V ∈ R n to a new valuefunction T π V ∈ R n deﬁned by ( T π V ) ( s ) = r ( s , π ( s )) + γ ∑ s (cid:48) ∈S P ( s (cid:48) | s , π ( s )) V ( s (cid:48) ) . The Bellmanoptimality operator T : R n → R n is deﬁned by TV ( s ) = max π ∈ Π ( T π V ) ( s ) = max d ∈ ∆ ( A ) r ( s , d ) + γ ∑ s (cid:48) ∈S P ( s (cid:48) | s , d ) V ( s (cid:48) ) .It is well known that T and T π are contraction mappings with respect the maximum norm. Theirunique ﬁxed points are J π and J ∗ , respectively. For a state s ∈ S , policy π ∈ Π and action distribu-tion d ∈ ∆ ( A ) , deﬁne the state-action value function Q π ( s , d ) = r ( S , d ) + γ ∑ s (cid:48) ∈S P ( s (cid:48) | s , d ) V π ( s ) ,which measures the expected total discounted reward of sampling an action from d in state s and applying π thereafter. When d is deterministic, meaning d a = a ∈ A , we denotethis simply by Q π ( s , a ) . Deﬁne Q ∗ ( s , d ) = Q π ∗ ( s , d ) for some optimal policy π ∗ . These obey therelations, Q π ( s , π (cid:48) ( s )) = ( T π (cid:48) V π ) ( s ) min d ∈ ∆ ( A ) Q π ( s , d ) = ( TV π ) ( s ) . (1)3 eometric average rewards and occupancies. Typically in dynamic programming, we seek poli-cies that are optimal simultaneously from every initial state. Policy gradient methods are insteadderived with respect to a weaker scalar objective that measures expected discounted reward from arandom initial state, J ( π ) = ( − γ ) ∑ s ∈S ρ ( s ) V π ( s ) .Another critical object is the discounted discounted state occupancy measure η π = ( − γ ) ∞ ∑ t = ργ t P t π ∈ ∆ ( S ) ,where P π ∈ R n × n is the Markov transition matrix under policy π . Here η π ( s ) gives the geometricaverage time spent in state s when the initial state is drawn from ρ . These two are related, as J ( π ) = ∑ s ∈S η π ( s ) r ( s , π ( s )) .The factor of ( − γ ) in the η π and J ( π ) serves to normalize these quantities and gives thema natural interpretation in terms of average reward problems. In particular, consider, just for themoment, a problem with modiﬁed transition probabilities ˜ P ( s (cid:48) | s , a ) = ( − γ ) ρ ( s ) + γ P ( s (cid:48) | s , a ) .That is, in each period there is a 1 − γ chance that the system resets in a random state drawn from ρ .Otherwise, the problem continues with next state drawn according to P . In this episodic problem, J ( π ) denotes the average reward earned by π and η π ( s ) is average fraction of time spent in state s under policy π . Undiscounted average reward problems are often constructed by studying J ( π ) asthe discount factor approaches one [Bertsekas, 1995, Puterman, 2014]. A state aggregation is deﬁned by a function φ : S → { · · · , m } that partitions the state space into m segments. We call φ − ( j ) = { s ∈ S : φ ( s ) = j } the j –th segment of the partition. Typically wehave in mind problems where the state space is enormous (effectively inﬁnite) but it is tractableto store and loop over vectors of length m . Tractable algorithms can then be derived by searchingover approximate transition kernels, value functions, or policies, that don’t distinguish betweendistinct states belonging to a common segment. Our hope is that states in a common segment aresufﬁciently similar, for example due to smoothness in the transition dynamics and rewards, so thatthese approximations still allow for effective decision-making.To make this idea formal, let us deﬁne the set of approximate value functions and policiesinduced by a state aggregation φ , Q φ = { Q ∈ R |S|×|A| : Q ( s , a ) = Q ( s (cid:48) , a ) for all a ∈ A , s , s (cid:48) ∈ S such that φ ( s ) = φ ( s (cid:48) ) } Π φ = { π ∈ Π : π ( s ) = π ( s (cid:48) ) for all s , s (cid:48) ∈ S such that φ ( s ) = φ ( s (cid:48) ) } .It should be emphasized that practical algorithms do not require, for example, actually storing n · k numbers in order to represent an element Q ∈ Q φ ⊂ R n × k . Instead, one stores just m · k numbers,one per segment.Should we approximate the value function or the policy? In this setting, there is a broadequivalence. 4 emark 1 (Equivalence of aggregated-state approximations) . The set of randomized state-aggregatedpolicies Π φ is induced by softmax optimization with respect to state-aggregated value functions: Π φ = closure { π ∈ Π : Q ∈ Q φ , π ( s ) a = e Q ( s , a ) / ∑ a (cid:48) ∈A e Q ( s , a (cid:48) ) ∀ s ∈ S , a ∈ A} . Moreover, the set of deterministic policies contained in Π φ is precisely equal to { f ∈ A |S| : Q ∈ Q φ , f ( s ) = min { argmax a ∈A Q ( s , a ) } } , the set of greedy policies with respect to some state-aggregated value function. Convergence to stationary points.

Policy gradient methods are ﬁrst-order optimization methodsapplied to maximize J ( π ) over the constrained policy class Π φ . Of course, just as there is an evergrowing list of ﬁrst-order optimization procedures, there are many policy gradient variants. Howdo we provide insights relevant to this whole family of algorithms? Were J ( π ) convex, we wouldexpect that sensible optimization method converge to the solution of max π ∈ Π φ J ( π ) , allowing us toabstract away the details of the optimization procedure and study instead the quality of decisionsthat can be made using a certain constrained policy class. Unfortunately, J ( π ) is non-convex. It is,however, smooth (see Lemma 1). In smooth non-convex optimization, we expect sensible ﬁrst-ordermethods to converge to a ﬁrst-order stationary point. Studying the quality of such policies thengives broad insight into how the use of restricted policy classes affects the limiting performancereached by policy gradient methods.As deﬁned below, a policy is a ﬁrst-order stationary point if, based on a ﬁrst-order approxima-tion to J ( · ) , there is no feasible direction that improves the objective value. Local search algorithmsgenerally continue to increase the objective value until reaching a stationary point. Throughoutthis section, we view each π ∈ Π as a stacked vector π = ( π ( s , a ) : s ∈ S , a ∈ A ) ∈ R |S|×|A| .It may also be natural to view π as an |S | × |A| dimensional matrix whose rows are probabilityvectors. In that case, all results are equivalent if one views inner products (cid:104) A , B (cid:105) = Trace ( A (cid:62) B ) asthe standard inner product on square matrices and all norms as the Frobenius norm. Deﬁnition 1.

A policy π ∈ Π is a ﬁrst order stationary point of J : Π → R on the subset Π φ ⊂ Π if (cid:104)∇ J ( π ) , π (cid:48) − π (cid:105) ≤ ∀ π (cid:48) ∈ Π φ .The following smoothness result is shown by a short calculation in Agarwal et al. [2019b]. Lemma 1.

For every π , π (cid:48) ∈ Π , (cid:107)∇ J ( π ) − ∇ J ( π (cid:48) ) (cid:107) ≤ (cid:107) π − π (cid:48) (cid:107) where L = γ |A|(cid:107) r (cid:107) ∞ ( − γ ) . Here we have broken ties deterministically in favor of the actions with a smaller index. If there are multiple optimalactions and ties are broken differently at states sharing common segment, the induced policy would not be constantacross segments. In Arxiv version 2, this is Lemma E.3. To translate their result to our formulation, one must multiply the statementin Lemma E.3 by ( − γ ) , as in the deﬁnition J ( π ) = ( − γ ) ρ J π . They also have normalized so that | r ( s , a ) | ≤

1. That isthe reason (cid:107) r (cid:107) ∞ does not appear in their expression. idealized policy gradient method, with exact gradient evaluations and a direct parameterization.This allows for a simple, clear, study of the quality of approximation possible with this restrictedpolicy class.Recall that a point π ∞ is a limit point of a sequence if some subsequence converges to π ∞ .Bounded sequences have convergent subsequences, so limit points exist for the sequence { π t } inLemma 2. The operator Proj Π φ ( π ) = argmin π (cid:48) ∈ Π φ (cid:107) π (cid:48) − π (cid:107) denotes orthogonal projection ontothe convex set Π φ . For the interested reader, the appendix provides many extra details related tothis algorithm. It explains that this projection can be computed using simple soft-thresholdingoperations and that the whole algorithm can be implemented efﬁciently while storing only m · k values. This uses one value per state segment and action, rather than one per state. The appendixalso shows how to generate unbiased stochastic gradients of J ( · ) . The body of this paper willinstead focus on the quality of these stationary points. This lemma and its proof can be found inBhandari and Russo [2020]. Lemma 2 (Convergence to stationary points) . For any π ∈ Π and α ∈ (cid:0) L (cid:3) , let π t + = Proj Π φ ( π t + α ∇ J ( π t )) t =

1, 2, 3 · · · If π ∞ is a limit point of { π t : t ∈ N } , then π ∞ is a stationary point of J ( · ) on Π φ and lim t → ∞ J ( π t ) = J ( π ∞ ) . Quality of stationary points.

We will measure the accuracy of a state-aggregation φ ( · ) throughthe maximal difference between state-action values with states belonging to the same segment of thestate space. This notion is weaker than alternatives that explicitly assume transition probabilitiesand rewards are uniformly close within segments — usually by imposing a Lipschitz condition.But it is a stronger requirement than a the recent one in Dong et al. [2019], which only looks at thegap between state action values under the optimal value function. It does not seem possible togive meaningful guarantees for policy gradient methods if we replace Q π with Q ∗ in the deﬁnitionbelow, but we leave this question for future work. Deﬁnition 2 (Inherent state aggregation error) . Let (cid:101) φ ∈ R be the smallest scalar satisfying | Q π ( s , a ) − Q π ( s (cid:48) , a ) | ≤ (cid:101) φ for every π ∈ Π φ and all s , s (cid:48) ∈ S such that φ ( s ) = φ ( s (cid:48) ) . Despite the non-convexity of J ( · ) , one can give guarantees on the quality of its stationary points.Recall that π ∗ ∈ Π denotes some optimal policy, which by deﬁnition satisﬁes J π ∗ ( s ) = J ∗ ( s ) ∀ s ∈ S .Such a π ∗ is also an unconstrained minimizer of the policy gradient objective J ( · ) . Let us emphasizethat this result requires each state-space segment has positive probability under the initial weighting6 . Similar measures appear in Bhandari and Russo [2020] and Agarwal et al. [2019b] and they eachdiscuss its necessity at some length. Let us also emphasize that the convergence rates of manypolicy gradient methods depend inversely on min i ρ (cid:0) φ − ( i ) (cid:1) . An exploratory initial distribution iscritical to these algorithm’s practical success. Theorem 3 (Quality of stationary points) . Suppose ρ (cid:0) φ − ( i ) (cid:1) > for each i ∈ { · · · , m } . If π ∞ is astationary point of J ( · ) on Π φ , thenJ ( π ∗ ) − J ( π ∞ ) ≤ ( − γ ) (cid:107) V π ∞ − V ∗ (cid:107) ∞ ≤ (cid:101) φ .For purposes of comparison, let us provide the best known alternative result, which can bederived by specializing a result in Bhandari and Russo [2020]. A similar result was given in Agarwalet al. [2019b], but that bound is even worse, having an extra factor of ( ( − γ ) on the right handside. Theorem 4 (Earlier result by Bhandari and Russo [2020]) . If π ∞ is a stationary point of J ( · ) on Π φ ,then J ( π ∗ ) − J ( π ∞ ) ≤ ( − γ ) (cid:107) V π ∞ − V ∗ (cid:107) ∞ ≤ κ ρ (cid:101) φ ( − γ ) where κ ρ ≤ max i ∈{ ··· , m } η π ∗ (cid:0) φ − ( i ) (cid:1) ρ ( φ − ( i )) .Here, κ ρ captures whether the initial distribution ρ places weight on each segment of thestate partition that is aligned with the occupancy measure of an optimal policy. The form here issomewhat stronger than the simple one in Kakade and Langford [2002], which does not aggregateacross segments, but it is still problematic. Without special knowledge about the optimal policy,it is impossible to guarantee κ ρ is smaller than the number of segments m . Worse perhaps is thedependence on the effective horizon 1/ ( − γ ) . Recall from Section 2 that J ( π ) ∈ [

0, 1 ] has theinterpretation of a geometric average reward per decision. The optimality gap J ( π ∗ ) − J ( π ∞ ) thenrepresents a kind of average per-decision regret produced by a limiting policy. The dependenceon 1/ ( − γ ) on the right hand side is then highly problematic, suggesting performance degradesentirely in a long horizon regime. While undesirable, this horizon dependence is unavoidableunder some classic approximate dynamic programming procedures. Below, we will show this istrue for API with a uniform state weighting. Proof of Theorem 3.

The next lemma is a specialization of the policy gradient theorem [Sutton andBarto, 2018]. For completeness, details are given in the appendix. The inner product interpretationis inspired by Konda and Tsitsiklis [2000].

Lemma 5 (Policy gradient theorem for direction derivatives) . For each π , π (cid:48) ∈ Π , (cid:104)∇ J ( π ) π (cid:48) − π (cid:105) = ∑ s ∈S ∑ a ∈A η π ( s ) Q π ( s , a ) (cid:0) π (cid:48) ( s , a ) − π ( s , a ) (cid:1) : = (cid:104) Q π , π (cid:48) − π (cid:105) η π × (2)The next lemma is a special case of one in Bhandari and Russo [2020]. This simplﬁed settingallows for an extremely simple proof, so we include it for completeness.7 emma 6 (An approximate Bellman equation for stationary points) . If π ∞ is a stationary point of J ( · ) on Π φ , then E [ J π ∞ ( S )] = max π ∈ Π φ E [ T π V π ∞ ( S )] where S ∼ η π ∞ . Proof.

Continue to let S denote a random draw from η π ∞ . For every π ∈ Π φ we have,0 ≥ (cid:104)∇ J ( π ∞ ) , π − π ∞ (cid:105) = (cid:104) Q π ∞ , π − π ∞ (cid:105) η π ∞ × ( a ) = E [ Q π ∞ ( S , π ( S ))) − Q π ∞ ( S , π ∞ ( S ))]= E [( T π V π ∞ ) ( S ) − V π ∞ ( S )] .Equality (a) recalls that Q ( s , d ) = ∑ a Q ( s , a ) d a and equality (b) uses (1). The other direction usesthat π ∞ ∈ Π φ along with the Bellman equation J π ∞ = T π ∞ J π ∞ .We are now ready ot prove Theorem 3. Proof of Theorem 3.

We apply Lemma 6 and several times use the connection between Q functionsand Bellman operators in (1). Let S denote a random draw from η π ∞ . Since E [ T π ∞ J π ∞ ( S )] = max π ∈ Π φ E [ T π V π ∞ ( S )] , we have π ∞ ∈ argmax π ∈ Π φ E [ T π J π ∞ ( S )] = argmax π ∈ Π φ E [ Q π ∞ ( S , π ( S ))]= argmax π ∈ Π φ m ∑ i = E [ Q π ∞ ( S , π ( S )) | φ ( S ) = i ] P ( φ ( S ) = i ) .Let a ∞ i denote the action selected by policy π ∞ at any state s ∈ φ − ( i ) in segment i . The vec-tor ( a ∞ , · · · , a ∞ m ) provides a full description of the policy π ∞ . The optimization problem abovedecomposes across sates, showing a ∞ i ∈ argmax a ∈A E [ Q π ∞ ( S , a ) | φ ( S ) = i ] i = · · · , m .Now, we use the deﬁnition of (cid:101) φ to show a ∞ i must be near optimal at every state in partition i . Pick ( s ∗ i , a ∗ i ) ∈ argmax s ∈S i , a ∈A Q π ∞ ( s , a ) .By the optimality of a ∞ i there must exist some ˜ s ∈ S i such that Q π ∞ ( ˜ s , a ∞ i ) ≥ Q π ∞ ( ˜ s , a ∗ i ) . For anyother s ∈ S i we have Q π ∞ ( s , a ∞ i ) ≥ Q π ∞ ( ˜ s , a ∞ i ) − (cid:101) φ ≥ Q π ∞ ( ˜ s , a ∗ i ) − (cid:101) φ ≥ Q π ( s ∗ i , a ∗ i ) − (cid:101) φ ≥ max a ∈A Q π ∞ ( s , a ) − (cid:101) φ .Observe that Q π ∞ ( s , a ∞ i ) = Q π ∞ ( s , π ∞ ( s )) = J π ∞ ( s ) and max a ∈A Q π ∞ ( s , a ) = T J π ∞ ( s ) . Since s isarbitrary, this gives element-wise inequality V π ∞ (cid:23) TV π ∞ − (cid:101) φ e where e denotes a vector of ones.Using the monotonicity of Bellman operators and the fact that T ( V + ce ) = TV + γ ce [Bertsekas,1995], we have V π ∞ (cid:23) TV π ∞ − (cid:101) φ e (cid:23) T V π ∞ + γ(cid:101) φ e − (cid:101) φ e (cid:23) · · · (cid:23) V ∗ − (cid:101) φ − γ e .8 Horizon dependent per-period regret under approximate policy iter-ation

Approximate policy iteration is one of the classic approximate dynamic programming algorithms.It has deep connections to popular methods today, like Q -learning with target networks that areinfrequently updated [Mnih et al., 2015]. Approximate policy iteration is presented in Algorithm1. The norm there is deﬁned by (cid:107) Q (cid:107) w × = (cid:112) ∑ s ∑ a w ( s ) Q ( s , a ) . The procedure mimics theclassic policy iteration algorithm [Puterman, 2014] except it uses a regression based approximationin the policy evaluation step. It is worth noting that this is a somewhat idealized form of thealgorithm. Practical algorithms use efﬁcient sample based approximations to the least-squaresproblem deﬁning ˆ Q . See Bertsekas and Tsitsiklis [1996] or Bertsekas [2011].How does this algorithm perform? Our main result in this section is captured by the followingproposition, giving a lower bound on performance which is worse than the result in Theorem 3 bya factor of the effective horizon 1/ ( − γ ) . Recall from Section 3 that there is a broad equivalencebetween searching over the restricted class of value functions in Q φ and searching over the restrictedclass of policies Π φ . Any advantage in the limiting performance of policy gradient methods is dueto the way in which it searches over policies and not an advantage in representational power. Proposition 7.

There exists an MDP and a state aggregation rule φ , where if { π t } t ∈ N is generated byAlgorithm 1 with uniform weighting w ( s ) = |S | ∀ s, then lim sup t → ∞ ( − γ ) (cid:107) V π t − V ∗ (cid:107) ∞ ≥ (cid:101) φ /4 ( − γ ) . (3)This result is established through an example which synthesizes an example from Bertsekasand Tsitsiklis [1996] with an example of Van Roy [2006]. The latter work studies approximate valueiteration in sate-aggregated problems. Let us emphasize two features of this result. First, it does notshow that every policy produced by API is poor. Instead, in our example API will cycle endlesslythrough all policies, some of which are disastrous. Second, as elucidated by Van Roy [2006], acritical reason for the poor performance is the choice non-adaptive choice of state relevance weights w ( · ) . Wewill return to this insight in the next section.Note that a classic result of [Bertsekas and Tsitsiklis, 1996, Prop. 6.2], specialized to this setting,can be used to show an upper bound that matches (3) up to a numerical constant. Technically,the analysis of Bertsekas and Tsitsiklis [1996] applies to value functions and not state-action valuefunctions. The reader can ﬁnd the same proof written in terms of state-action value functions inAgarwal et al. [2019a]. 9igure 1: A bad example for API. Here de-picted for n =

10 and m = Algorithm 1:

Approx. Policy Iteration input : w ∈ ∆ ( S ) , π ∈ Π (1) for t =

1, 2, · · · , do /* Approximate policyevaluation step */ (2) ˆ Q t ∈ argmin ˆ Q ∈Q φ (cid:107) ˆ Q − Q π t (cid:107) w × ; /* Policy iteration step*/ (3) π t ( s ) ∈ argmax a ∈A ˆ Q t ( s , a ) ∀ s ; (4) endExample 1. Consider an MDP with n = m states, depicted in Figure 1 for n = and m = . Fors ∈ { · · · , m } , we have φ ( s ) = φ ( s + m ) = s. This means that the algorithms don’t distinguish betweens and s + m. In state s ∈ { · · · , m } there are two possible actions, Move (M) which moves the agent tostate s − and generates a reward r ( s , M ) = , and Stay (S) which keeps the agent in the same state withreward r ( s , Stay ) = − c ∑ si = γ i − . The negative reward for the action Stay can be thought of as a cost.State has only the action Stay which incurs cost . (Or one can think of Move as being identical to

Stay in state 1). Rewards obey the recursionr ( Stay ) = − c r ( s , Stay ) = γ r ( s − M ) − c for s ∈ { · · · , n } . For s ∈ { m + · · · , 2 m } , the transitions are identical to that of state s − m, but r ( s , Stay ) = r ( s − m , Stay ) + (cid:101) φ . Pick c = (cid:101) φ /2 . The optimal policy is to move left from every state. Consider API applied with w ( s ) = |S | . The weighted least-squares problem has a particularlysimple form in this case. It is straightforward to show that the problem decomposes across segmentsof the state space and, as the conditional mean minimizes squared loss, has the formˆ Q t ( s , a ) = E S ∼ w (cid:104) Q π t ( S , a ) | S ∈ φ − ( s ) (cid:105) = Q π t ( s , a ) + Q π t ( s + m , a ) ∀ s ≤ m , a ∈ A = Q π t ( s , a ) + ( (cid:101) φ / ) ( a = Stay ) ∀ s ≤ m , a ∈ A .That is, in each segment the value function of the current policy is overestimated by (cid:101) φ / at state s and under-estimated by (cid:101) φ / in state s + m . In this way, we have constructed a problem with state-aggregation that effectively reproduces the behavior of approximate policy iteration in Example 6.4of Bertsekas and Tsitsiklis [1996]. We will only sketch the analysis of the example. Assume thatties in the execution of API are broken in favor of the action Stay . (Otherwise, we can shift c byinﬁnitesimal constant.) One can show that, beginning with an initial policy π that moves left fromevery state, API produces a policy π that moves left at each s ∈ { · · · , m } but plays Stay at state2. The reason is thatˆ Q π ( Stay ) = Q π ( Stay ) + (cid:101) φ / = r ( Stay ) + γ J π ( ) + (cid:101) φ / = − c + (cid:101) φ / = = Q π ( s , Move ) ,but ˆ Q π ( s , Stay ) < ˆ Q π ( s , Move ) for s >

2, since r ( Stay ) strictly decreases with s . Using thatthe policy π will move left until reaching state 2, at which point it stays and receives r ( s , Stay ) Q π ( Move ) = + γ J π ( ) = γ r ( Stay ) − γ = − γ c − γ Q π ( Stay ) = r ( Stay ) + γ ( Q π ( Move )) = ( − γ c − c ) + − γ c − γ = − c + − γ c − γ .Then, ˆ Q π ( Stay ) − ˆ Q π ( Move ) = (cid:101) φ / − c ≥ π produced by API willplay Stay in state 3. Since r ( s , Stay ) < r ( Stay ) for s >

3, the policy π still moves left in statesto the right of the third state.The iterations continue this way until eventually, we ﬁnd a policy π m that plays Stay fromevery state. Next π m + = π and the process repeats endlessly. Suppose m is large. When π t plays Stay at every state, we ﬁnd V π t ( m ) = r ( m , Stay ) / ( − γ ) = − c ∑ mi = γ i − − γ ≈ − c / ( − γ ) . Onthe other hand, the optimal policy moves left from every state, earning zero reward ( V ∗ ( s ) = ( − γ ) (cid:107) J π t − J ∗ (cid:107) ∞ ≈ c / ( − γ ) = (cid:101) φ / ( − γ ) . (The claim inProposition 7 uses a looser constant of 1/4 to allow for small errors due to ﬁnite m .). Notice that (cid:96) ( π t ) ≈ − c / ( − γ ) as well if ρ ( m ) is sufﬁciently close to one. The apparent gap in performance between policy gradient methods and approximate policy itera-tion may be surprising, given their close connections. Some insight can be gained by consideringactor critic methods, which use estimated value functions in evaluating gradients of J ( · ) . To makethis precise, recall the policy gradient expression in Lemma 5 expresses directional derivatives as acertain weighted inner product, (cid:104)∇ J ( π ) π (cid:48) − π (cid:105) = (cid:104) Q π , π (cid:48) − π (cid:105) η π × . Actor critic methods replace Q π with some parametrized approximations, producing an approximate gradient.An elegant insight of Konda and Tsitsiklis [2000] and Sutton et al. [2000] shows that compatiblevalue function approximation produces no error in gradient evaluation. Below we identify the formof compatible function approximation in our setting. Let (cid:107) ˆ Q − Q π (cid:107) η π × denote the norm inducedby the inner product (cid:104)· , ·(cid:105) η π × . Lemma 8 (Compatible function approximation) . If ˆ Q π = argmin ˆ Q ∈Q φ (cid:107) ˆ Q − Q π (cid:107) η π × , then, (cid:104)∇ J ( π ) , π (cid:48) − π (cid:105) = (cid:104) ˆ Q π , π (cid:48) − π (cid:105) η π × ∀ π (cid:48) ∈ Π φ . Proof.

Observe that Q φ = Span (cid:0) Π φ (cid:1) , where Span (cid:0) Π φ (cid:1) consists of all vectors of the form ∑ Ii = c i π ( i ) where each c i ∈ R is a scalar and π ( i ) ∈ Π φ . Then, ˆ Q π is the orthogonal projection of Q π ontoSpan (cid:0) Π φ (cid:1) with respect to the norm induced by the inner product (cid:104)· , ·(cid:105) η π × . This implies, (cid:104) Q π , ˜ π (cid:105) η π × = (cid:104) ˆ Q π , ˜ π (cid:105) η π × ∀ ˜ π ∈ Span (cid:0) Π φ (cid:1) .Combined with Lemma 5, this yields the result.This result suggests that a critical ﬂaw in the API method treated earlier in this section was itsuse of a ﬁxed state weighting in estimating the value function. Lemma 8 shows that if the valuefunctions are projections in a norm weighted by the current state occupancy measure, these valuefunctions can be used to evaluate policy gradients while attaining the guarantee in Theorem 3.11reviously, Van Roy [2006] identiﬁed the critical role of state-relevance weightings in approximatevalue iteration methods with sate-aggregation. This section emphasizes the connection of hisinsight to the policy gradient theorem. An open question is whether better guarantees are possiblefor API with an adaptive state weighting. I conjecture this is not true, since API can make largechanges to the policy and our accuracy guarantee for compatible function approximators applyonly locally. The incremental nature of policy gradient methods also appears to be critical to theresults in this paper. Acknowledgments.

I would like to thank Jalaj Bhandari for many helpful discussions on policygradient methods.

References

David Abel, D Ellis Hershkowitz, and Michael L Littman. Near optimal behavior via approximate stateabstraction. In

Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48 , pages 2915–2923, 2016.Alekh Agarwal, Nan Jiang, and Sham M Kakade. Reinforcement learning: Theory and algorithms. Technicalreport, 2019a.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation withpolicy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261 , 2019b.András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residualminimization based ﬁtted policy iteration and a single sample path.

Machine Learning , 71(1):89–129, 2008.James C Bean, John R Birge, and Robert L Smith. Aggregation in dynamic programming.

Operations Research ,35(2):215–220, 1987.Amir Beck.

First-order methods in optimization . SIAM, 2017.Dimitri P Bertsekas.

Dynamic programming and optimal control , volume 2. Athena scientiﬁc Belmont, MA,1995.Dimitri P Bertsekas. Nonlinear programming.

Journal of the Operational Research Society , 48(3):334–334, 1997.Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods.

Journal of ControlTheory and Applications , 9(3):310–335, 2011.Dimitri P Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming.

Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA , 14,1996.Dimitri P Bertsekas and John N Tsitsiklis.

Neuro-dynamic programming . Athena Scientiﬁc, 1996.Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv preprintarXiv:1906.01786 , 2019.Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. 2020. URL http://djrusso.github.io/docs/policy_grad_optimality.pdf .Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth,nonconvex problems.

SIAM Journal on Optimization , 29(3):1908–1930, 2019. aron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method withsupport for non-strongly convex composite objectives. In Advances in neural information processing systems ,pages 1646–1654, 2014.Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Provably efﬁcient reinforcement learning with aggre-gated states. arXiv preprint arXiv:1912.06366 , 2019.John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efﬁcient projections onto the l 1-ballfor learning in high dimensions. In

Proceedings of the 25th international conference on Machine learning , pages272–279, 2008.Thomas Furmston and David Barber. A unifying perspective of parametric policy search methods formarkov decision processes. In

Advances in neural information processing systems , pages 2717–2725, 2012.Victor Gabillon, Mohammad Ghavamzadeh, and Bruno Scherrer. Approximate dynamic programmingﬁnally performs well in the game of tetris. In

Advances in neural information processing systems , pages1754–1762, 2013.Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochasticprogramming.

SIAM Journal on Optimization , 23(4):2341–2368, 2013.Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.

Mathematical Programming , 156(1-2):59–99, 2016.Geoffrey J Gordon. Stable function approximation in dynamic programming. In

Machine Learning Proceedings1995 , pages 261–268. Elsevier, 1995.Nan Jiang, Alex Kulesza, and Satinder Singh. Abstraction selection in model-based reinforcement learning.In

International Conference on Machine Learning , pages 179–188, 2015.Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In

ICML ,volume 2, pages 267–274, 2002.Sham M Kakade. A natural policy gradient. In

Advances in neural information processing systems , pages1531–1538, 2002.Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In

Advances in neural information processingsystems , pages 1008–1014, 2000.Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of least-squarespolicy iteration.

Journal of Machine Learning Research , 13(Oct):3041–3074, 2012.Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a uniﬁed theory of state abstraction for mdps.In

ISAIM , 2006.Matthew S Maxwell, Shane G Henderson, and Huseyin Topaloglu. Tuning approximate dynamic program-ming policies for ambulance redeployment via direct search.

Stochastic Systems , 3(2):322–361, 2013.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning.

Nature , 518(7540):529–533, 2015.Rémi Munos. Error bounds for approximate policy iteration. In

ICML , volume 3, pages 560–567, 2003.Martin L Puterman.

Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons,2014. ashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic variance reductionfor nonconvex optimization. In International conference on machine learning , pages 314–323, 2016a.Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvexoptimization. In ,pages 1244–1251. IEEE, 2016b.Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods fornonsmooth nonconvex ﬁnite-sum optimization. In

Advances in Neural Information Processing Systems , pages1145–1153, 2016c.John Rust. Using randomization to break the curse of dimensionality.

Econometrica: Journal of the EconometricSociety , pages 487–516, 1997.Bruno Scherrer. Approximate policy iteration schemes: a comparison. In

International Conference on MachineLearning , pages 1314–1322, 2014.Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iterationas boosted policy search. In

Joint European Conference on Machine Learning and Knowledge Discovery inDatabases , pages 35–50. Springer, 2014.John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policyoptimization. In

International conference on machine learning , pages 1889–1897, 2015.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza-tion algorithms. arXiv preprint arXiv:1707.06347 , 2017.Warren R Scott, Warren B Powell, and Somayeh Moazehi. Least squares policy iteration with instrumentalvariables vs. direct policy search: Comparison against optimal benchmarks using energy storage. arXivpreprint arXiv:1401.0843 , 2014.Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press, 2018.Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods forreinforcement learning with function approximation. In

Advances in neural information processing systems ,pages 1057–1063, 2000.István Szita and András Lörincz. Learning tetris using the noisy cross-entropy method.

Neural computation ,18(12):2936–2941, 2006.John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming.

Machine Learning , 22(1-3):59–94, 1996.Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation.

Mathematics of Operations Research , 31(2):234–244, 2006.Ward Whitt. Approximations of dynamic programs, i.

Mathematics of Operations Research , 3(3):231–243, 1978.Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.

SIAM Journal on Optimization , 24(4):2057–2075, 2014. Implementing projected policy gradient with aggregated state approx-imations

Conceptually, the simplest policy gradient method is the projected gradient ascent iteration π t + = Proj Π φ (cid:0) π t + α ∇ J ( π t ) (cid:1) = argmin π ∈ Π φ (cid:18) π t + (cid:104)∇ J ( π t ) , π − π t (cid:105) + α (cid:13)(cid:13) π − π t (cid:13)(cid:13) (cid:19) t ∈ N ,where any policy π ∈ Π is viewed as a stacked |S | · |A| dimensional vector satisfying ∑ a ∈A π ( s , a ) = π ≥

0. The operator Proj Π φ ( π ) = argmin π (cid:48) ∈ Π φ (cid:107) π (cid:48) − π (cid:107) denoted orthogonal projectiononto the convex set Π φ with respect to the Euclidean norm. The second equality is a well known“proximal” interpretation of the projected update [Beck, 2017]. Although the optimization problemargmin π ∈ Π φ (cid:18) π t + (cid:104)∇ J ( π t ) , π − π t (cid:105) + α (cid:13)(cid:13) π − π t (cid:13)(cid:13) (cid:19) appears to involve |S | · |A| decision variables, it is equivalent to one involving m · |A| decisionvariables. Algorithm 2 below uses θ ∈ R m × k to denote a the parameter of a state-aggregated policy,where π ∞ ( s , a ) = θ i , a is the probability of selecting action a for a state s ∈ φ − ( i ) in segment i . Theprojection has a simple solution, involving projecting the vector ˜ θ s ,: corresponding to partition i onto the space of action distributions ∆ ( A ) . Projection onto the simplex can be executed with asimple soft thresholding procedure [Duchi et al., 2008]. Algorithm 2:

Projected Policy Gradient input : θ ∈ R m × k , initial stepsize α (1) for t =

1, 2, · · · , do (2) Get gradient g = ∇ θ J ( π θ ) ; (3) Form target ˜ θ = θ + αθ ; /* Project onto simplex */ (4) for i = · · · , m do (5) θ i ,: ← min d ∈ ∆ ( A ) (cid:107) d − ˜ θ i ,: (cid:107) (6) end (7) α ← get . updated . stepsize () (8) end Algorithm 3 provides an unbiased monte-carlo policy gradient estimator. It is based on theformula ∂∂π ( s , a ) J ( π ) , = Q π ( s , a ) η π ( s ) .Rewriting this, if ˜ s ∼ η π and ˜ a | ˜ s ∼ Uniform ( · · · , k ) , then ∂∂π ( s , a ) J ( π ) = Q π ( s , a ) P ( ˜ s = s ) = kQ π ( s , a ) P ( ˜ s = s , ˜ a = a ) .15sing the chain rule, we have ∂∂θ i , a J ( π θ ) = ∑ s ∈ φ − ( i ) ∂∂π ( s , a ) J ( π ) = k E (cid:104) Q π ( ˜ s , a ) ( ˜ s ∈ φ − ( i ) , ˜ a = a ) (cid:105) .Using this, Algorithm 3 ﬁrst draws a state ˜ s from η π ( · ) and then an action ˜ a uniformly atrandom. Then it uses a Monte Carlo rollout to estimate Q π ( ˜ s , ˜ a ) . To give unbiased estimatesof inﬁnite horizon discounted sums underlying η π and Q π , it leverages a well know equivalencebetween geometric discounting and the use a random geometric horizon. For any scalar randomvariables { X t } t = ··· , one has E (cid:34) ∞ ∑ t = γ t X t (cid:35) = E (cid:34) τ ∑ t = X t (cid:35) where τ ∼ Geometric ( − γ ) has distribution is independent of { X t } . The equivalence is due tothe fact that P ( τ ≥ t ) = γ t . It is easy to modify this method to provide an estimate of the gradientwith respect to the policy parameter θ in Algorithm 2. Algorithm 3:

Simple Unbiased Gradient input : H , S , A , tuning parameters { β k } k ∈ N /* Sample ˜ s ∼ η π */ (1) Sample τ ∼ Geometric ( − γ ) ; (2) Sample initial s ∼ ρ . ; (3) Apply policy π for τ timesteps; (4) Observe ( s , a , r , · · · , a τ − , r τ − , s τ ) ; (5) Set ˜ s = s τ ; /* Draw uniform random action */ (6) Sample ˜ a ∼ Uniform { · · · , k } ; /* Unbiased estimate of Q π ( ˜ a , ˜ s ) */ (7) Sample ˜ τ ∼ Geometric ( − γ ) ; (8) Apply action ˜ a and observe ( ˜ r , ˜ s ) ; (9) if ˜ τ > then (10) Apply policy π for ˜ τ − s ; (11) Observe: ( ˜ s , ˜ a , ˜ s , · · · , ˜ a τ − , ˜ r τ − , ˜ s τ ) ; (12) end (13) Set ˆ Q = ˜ r + · · · + ˜ r τ ; (14) Find state segment I = φ − ( ˜ s ) ; (15) Set ˆ g ( i , a ) = (cid:40) k · ˆ Q if i = I , a = ˜ a (16) Return ˆ g ∈ R m × kk