Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator
GGlobal Convergence of Policy GradientMethods for the Linear Quadratic Regulator
Maryam Fazel , Rong Ge , Sham M. Kakade , and Mehran Mesbahi University of Washington, Seattle, WA, USA, [email protected] , [email protected] , [email protected] . Duke University, Durham, NC, USA, [email protected] .March 26, 2019
Abstract
Direct policy gradient methods for reinforcement learning and continuous control problems are apopular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge ofthe underlying model, 2) they are an “end-to-end” approach, directly optimizing the performance metricof interest, 3) they inherently allow for richly parameterized policies. A notable drawback is that evenin the most basic continuous control problem (that of linear quadratic regulators), these methods mustsolve a non-convex optimization problem, where little is understood about their efficiency from bothcomputational and statistical perspectives. In contrast, system identification and model based planningin optimal control theory have a much more solid theoretical footing, where much is known with regardsto their computational and statistical properties. This work bridges this gap showing that (model free)policy gradient methods globally converge to the optimal solution and are efficient (polynomially so inrelevant problem dependent quantities) with regards to their sample and computational complexities.
Recent years have seen major advances in the control of uncertain dynamical systems using reinforcementlearning and data-driven approaches; examples range from allowing robots to perform more sophisticatedcontrols tasks such as robotic hand manipulation [Tassa et al., 2012, Al Borno et al., 2013, Kumar et al.,2016, Levine et al., 2016, Tobin et al., 2017, Rajeswaran et al., 2017a], to sequential decision makingin game domains, e.g., AlphaGo [Silver et al., 2016] and Atari game playing [Mnih et al., 2015]. Deepreinforcement learning (DeepRL) is becoming increasingly popular for tackling such challenging sequentialdecision making problems.Many of these successes have relied on sampling based reinforcement learning algorithms such as pol-icy gradient methods, including the DeepRL approaches. For these approaches, there is little theoreticalunderstanding of their efficiency, either from a statistical or a computational perspective. In contrast, controltheory (optimal and adaptive control) has a rich body of tools, with provable guarantees, for related sequen-tial decision making problems, particularly those that involve continuous control. These latter techniquesare often model-based—they estimate an explicit dynamical model first (via system identification) and thendesign optimal controllers.This work builds bridges between these two lines of work, namely, between optimal control theory andsample based reinforcement learning methods, using ideas from mathematical optimization.1 a r X i v : . [ c s . L G ] M a r .1 The optimal control problem In the standard optimal control problem, a dynamical system is described as x t +1 = f t ( x t , u t , w t ) , where f t maps a state x t ∈ R d , a control (the action) u t ∈ R k , and a disturbance w t , to the next state x t +1 ∈ R d , starting from an initial state x . The objective is to find the control input u t which minimizesthe long term cost, minimize T (cid:88) t =0 c t ( x t , u t ) such that x t +1 = f t ( x t , u t , w t ) t = 0 , . . . , T. Here the u t are allowed to depend on the history of observed states, and T is the time horizon (which can befinite or infinite). In practice, this is often solved by considering the linearized control (sub-)problem wherethe dynamics are approximated by x t +1 = A t x t + B t u t + w t , and the costs are approximated by a quadratic function in x t and u t , e.g. [Todorov and Li, 2004]. The presentpaper considers an important special case: the time homogenous, infinite horizon problem referred to as thelinear quadratic regulator (LQR) problem. The results herein can also be extended to the finite horizon, timeinhomogenous setting, discussed in Section 5.We consider the following infinite horizon LQR problem,minimize E (cid:34) ∞ (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) (cid:35) such that x t +1 = Ax t + Bu t , x ∼ D , where initial state x ∼ D is assumed to be randomly distributed according to distribution D ; the matrices A ∈ R d × d and B ∈ R d × k are referred to as system (or transition) matrices; Q ∈ R d × d and R ∈ R k × k areboth positive definite matrices that parameterize the quadratic costs. For clarity, this work does not considera noise disturbance but only a random initial state. The importance of (some) randomization for analyzingdirect methods is discussed in Section 3.Throughout, assume that A and B are such that the optimal cost is finite (for example, the controllabilityof the pair ( A, B ) would ensure this). Optimal control theory [Anderson and Moore, 1990, Evans, 2005,Bertsekas, 2011, 2017] shows that the optimal control input can be written as a linear function in the state, u t = − K ∗ x t where K ∗ ∈ R k × d . Planning with a known model.
For the infinite horizon LQR problem, planning can be achieved bysolving the Algebraic Riccati Equation (ARE), P = A T P A + Q − A T P B ( B T P B + R ) − B T P A , (1)for a positive definite matrix P which parameterizes the “cost-to-go” (the optimal cost from a state goingforward). The optimal control gain is then given as: K ∗ = − ( B T P B + R ) − B T P A. (2)2o find P , there are iterative methods, algebraic solution methods, and (convex) SDP formulations. Solv-ing the ARE is extensively studied; one approach due to [Kleinman, 1968] (for continuous time) and[Hewer, 1971] (for discrete time) is to simply run the recursion P k +1 = Q + A T P k A − A T P k B ( R + B T P k B ) − B T P k A where P = Q , which converges to the unique positive semidefinite solution of theARE (since the fixed-point iteration is contractive). Other approaches are direct and are based on linearalgebra, which carry out an eigenvalue decomposition on a certain block matrix (called the Hamiltonianmatrix) followed by a matrix inversion [Lancaster and Rodman, 1995]. The LQR problem can also beexpressed as a semidefinite program (SDP) with variable P as given in [Balakrishnan and Vandenberghe,2003] (see Section A in the supplement).However, these formulations: 1) do not directly parameterize the policy, 2) are not “end-to-end” ap-proaches, in that they are not directly optimizing the cost function of interest, and 3) it is not immediatelyclear how to utilize these approaches in the model-free setting, where the agent only has simulation access.These issues are outlined in Section A of the supplement. Even in the most basic case of the standard linear quadratic regulator model, little is understood as tohow direct (model-free) policy gradient methods fare. This work provides rigorous guarantees, showingthat, while in fact the approach deals with a non-convex problem, directly using (model free) local searchmethods leads to finding the globally optimal policy (i.e., a policy whose objective value is (cid:15) -close to theoptimal). The main contributions are as follows: • (Exact case) Even with access to exact gradient evaluation, little is understood about whether or notconvergence to the optimal policy occurs, even in the limit, due to the non-convexity of the problem.This work shows that global convergence does indeed occur (and does so efficiently) for gradientdescent methods. • (Model free case) Without a model, this work shows how one can use simulated trajectories (as op-posed to having knowledge of the model) in a stochastic policy gradient method, where provableconvergence to a globally optimal policy is guaranteed, with (polynomially) efficient computationaland sample complexities. • (The natural policy gradient) Natural policy gradient methods [Kakade, 2001] — and related al-gorithms such as Trust Region Policy Optimization [Schulman et al., 2015] and the natural actorcritic [Peters and Schaal, 2007] — are some of the most widely used and effective policy gradientmethods (see Duan et al. [2016]). While many results argue in favor of this method based on ei-ther information geometry [Kakade, 2001, Bagnell and Schneider, 2003] or based on connections toactor-critic methods [Deisenroth et al., 2013], these results do not provably show an improved con-vergence rate. This work is the first to provide a guarantee that the natural gradient method enjoys aconsiderably improved convergence rate over its naive gradient counterpart.More broadly, the techniques in this work merge ideas from optimal control theory, mathematical opti-mization (first order and zeroth order), and sample based reinforcement learning methods. These techniquesmay ultimately help in improving upon the existing set of algorithms, addressing issues such as variancereduction or improving upon the natural policy gradient method (with, say, a Gauss-Newton method as inTheorem 7). The Discussion section touches upon some of these issues.3 .3 Related work In the reinforcement learning setting, the model is unknown, and the agent must learn to act through its inter-actions with the environment. Here, solution concepts are typically divided into: model-based approaches,where the agent attempts to learn a model of the world, and model-free approaches, where the agent directlylearns to act and does not explicitly learn a model of the world. The related work on provably learning LQRsis reviewed from this perspective.
Model-based learning approaches.
In the context of LQRs, the agent can attempt to learn the dy-namics of “the plant” (i.e., the model) and then plan, using this model, for control synthesis. Here, theclassical approach is to learn the model with subspace-based system identification [Ljung, 1999]. Fiechter[1994] provides a provable learning (and non-asymptotic) result, where the quality of the policy obtainedis shown to be near optimal (efficiency is in terms of the persistence of the training data and the controlla-bility Gramian). Abbasi-Yadkori and Szepesv´ari [2011] also provides provable, non-asymptotic learningresults in a regret context, using a bandit algorithm that achieves lower sample complexity (by balancingexploration-exploitation more effectively); the computational efficiency of this approach is less clear.More recently, Dean et al. [2017] expands on an explicit system identification process, where a ro-bust control synthesis procedure is adopted that relies on a coarse model of the plant matrices ( A and B are estimated up to some accuracy level, naturally leading to a “robust control” setup to then design thecontroller based in the coarse model). Tighter analysis for sample complexity was given in Tu and Recht[2018], Simchowitz et al. [2018]. Arguably, this is the most general (and non-asymptotic) result that is effi-cient from a statistical perspective. Computationally, the method works with a finite horizon to approximatethe infinite horizon. This result only needs the plant to be controllable; the work herein needs the strongerassumption that the initial policy in the local search procedure is a stable controller (an assumption whichmay be inherent to local search procedures, discussed in Section 5). Another recent line of work [Hazanet al., 2017, 2018, Arora et al., 2018] treat the problem of learning a linear dynamical system as an onlinelearning problem. [Hazan et al., 2017, Arora et al., 2018] are restricted to systems with symmetric dynamics(symmetric A matrix), while [Hazan et al., 2018] handles a more general setting. This line of work canhandle the case when there are latent states (i.e., when the observed output is a linear function of the state,and the state is not observed directly) and does not need to do system identification first. On the other hand,they don’t output a succinct linear policy as Dean et al. [2017] or this paper. Model-free learning approaches.
Model-free approaches that do not rely on an explicit system iden-tification step typically either: 1) estimate value functions (or state-action values) through Monte Carlosimulation which are then used in some approximate dynamic programming variant [Bertsekas, 2011], or2) directly optimize a (parameterized) policy, also through Monte Carlo simulation. Model-free approachesfor learning optimal controllers are not well understood from a theoretical perspective. Here, Bradtke et al.[1994] provides an asymptotic learnability result using a value function approach, namely Q -learning. This work seeks to characterize the behavior of (direct) policy gradient methods, where the policy is linearlyparameterized, as specified by a matrix K ∈ R k × d which generates the controls: u t = − Kx t for t ≥ . The cost of this K is denoted as: C ( K ) := E x ∼D (cid:34) ∞ (cid:88) t =0 ( x (cid:62) t Qx t + u (cid:62) t Ru t ) (cid:35) { x t , u t } is the trajectory induced by following K , starting with x ∼ D . The importance of (some)randomization, either in x or noise through having a disturbance, for analyzing gradient methods is dis-cussed in Section 3. Here, K ∗ is a minimizer of C ( · ) .Gradient descent on C ( K ) , with a fixed stepsize η , follows the update rule: K ← K − η ∇ C ( K ) . It is helpful to explicitly write out the functional form of the gradient. Define P K as the solution to: P K = Q + K (cid:62) RK + ( A − BK ) (cid:62) P K ( A − BK ) . and, under this definition, it follows that C ( K ) can be written as: C ( K ) = E x ∼D x (cid:62) P K x . Also, define Σ K as the (un-normalized) state correlation matrix, i.e. Σ K = E x ∼D ∞ (cid:88) t =0 x t x (cid:62) t . Lemma 1. (Policy Gradient Expression) The policy gradient is: ∇ C ( K ) = 2 (cid:16) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:17) Σ K Later for simplicity, define E K to be E K = (cid:16) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:17) , as a result the gradient can be written as ∇ C ( K ) = 2 E K Σ K .Proof. Observe: C K ( x ) = x (cid:62) P K x = x (cid:62) (cid:16) Q + K (cid:62) RK (cid:17) x + x (cid:62) ( A − BK ) (cid:62) P K ( A − BK ) x = x (cid:62) (cid:16) Q + K (cid:62) RK (cid:17) x + C K (( A − BK ) x ) . Let ∇ denote the gradient with respect to K , note that ∇ C K (( A − BK ) x ) has two terms (one with respectto K in the subscript and one with respect to the input ( A − BK ) x ), this implies ∇ C K ( x ) = 2 RKx x (cid:62) − B (cid:62) P K ( A − BK ) x x (cid:62) + ∇ C K ( x ) | x =( A − BK ) x = 2 (cid:16) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:17) ∞ (cid:88) t =0 x t x (cid:62) t using recursion and that x = ( A − BK ) x . Taking expectations completes the proof.5 .2 Review: (Model free) sample based policy gradient methods Sample based policy gradient methods introduce some randomization for estimating the gradient.
REINFORCE. [Williams, 1992, Sutton et al., 2000] Let π θ ( u | x ) be a parametric stochastic policy, where u ∼ π θ ( ·| x ) . The policy gradient of the cost, C ( θ ) , is: ∇ C ( θ ) = E (cid:34) ∞ (cid:88) t =0 Q π θt ( x t , u t ) ∇ log π θ ( u t | x t ) (cid:35) , where Q π θ ( x, u ) = E (cid:34) ∞ (cid:88) t =0 c t | x = x, u = u (cid:35) , where the expectation is with respect to the trajectory { x t , u t } induced under the policy π θ and where Q π θ ( x, u ) is referred to as the state-action value. The REINFORCE algorithm uses Monte Carlo estimatesof the gradient obtained by simulating π θ . The natural policy gradient.
The natural policy gradient [Kakade, 2001] follows the update: θ ← θ − η G − θ ∇ C ( θ ) , where: G θ = E (cid:34) ∞ (cid:88) t =0 ∇ log π θ ( u t | x t ) ∇ log π θ ( u t | x t ) (cid:62) (cid:35) , where G θ is the Fisher information matrix. There are numerous succesful related approaches [Peters andSchaal, 2007, Schulman et al., 2015, Duan et al., 2016]. An important special case is using a linear policywith additive Gaussian noise [Rajeswaran et al., 2017b], i.e. π K ( x, u ) = N ( Kx, σ I) (3)where K ∈ R k × d and σ is the noise variance. Here, the natural policy gradient of K (when σ is consideredfixed) takes the form: K ← K − η ∇ C ( K )Σ − K (4)To see this, one can verify that the Fisher matrix of size kd × kd , which is indexed as [ G K ] ( i,j ) , ( i (cid:48) ,j (cid:48) ) where i, i (cid:48) ∈ { , . . . k } and j, j (cid:48) ∈ { , . . . d } , has a block diagonal form where the only non-zeros blocks are [ G K ] ( i, · ) , ( i, · ) = Σ K (this is the block corresponding to the i -th coordinate of the action, as i ranges from to k ). This form holds more generally, for any diagonal noise. Zeroth order optimization.
Zeroth order optimization is a generic procedure [Conn et al., 2009, Nes-terov and Spokoiny, 2015] for optimizing a function f ( x ) , using only query access to the function valuesof f ( · ) at input points x (and without explicit query access to the gradients of f ). This is also the approachin using “evolutionary strategies” for reinforcement learning [Salimans et al., 2017]. The generic approachcan be described as follows: define the perturbed function as f σ ( x ) = E ε ∼N (0 ,σ I) [ f ( x + ε )] For small σ , the smooth function is a good approximation to the original function. Due to the Gaussiansmoothing, the gradient has the particularly simple functional form (see Conn et al. [2009], Nesterov andSpokoiny [2015]): ∇ f σ ( x ) = 1 σ E ε ∼N (0 ,σ I) [ f ( x + ε ) ε ] . This expression implies a straightforward method to obtain an unbiased estimate of the ∇ f σ ( x ) , throughobtaining only the function values f ( x + ε ) for random ε .6 The (non-convex) Optimization Landscape
This section provides a brief characterization of the optimization landscape, in order to help provide intuitionas to why global convergence is possible and as to where the analysis difficulties lie.
Lemma 2. (Non-convexity) If d ≥ , there exists an LQR optimization problem, min K C ( K ) , which is notconvex, quasi-convex, and star-convex. The specific example is given in supplementary material (Section B). In particular, there can be twomatrices K and K (cid:48) where both C ( K ) and C ( K (cid:48) ) are finite, but C (( K + K (cid:48) ) / is infinite.For a general non-convex optimization problem, gradient descent may not even converge to the globaloptima in the limit. The optimization problem of LQR satisfies a special gradient domination condition,which makes it much easier to optimize: Lemma 3. (Gradient domination) Let K ∗ be an optimal policy. Suppose K has finite cost and σ min (Σ K ) > . It holds that C ( K ) − C ( K ∗ ) ≤ (cid:107) Σ K ∗ (cid:107) σ min (Σ K ) σ min ( R ) (cid:107)∇ C ( K ) (cid:107) F . This lemma can be proved by analyzing the “advantage” of the optimal policy Σ ∗ to Σ in every step.The detailed lemma and the full proof is deferred to supplementary material.As a corollary, this lemma provides a characterization of the stationary points. Corollary 4. (Stationary point characterization) If ∇ C ( K ) = 0 , then either K is an optimal policy or Σ K is rank deficient. Note that the covariance Σ K (cid:23) Σ := E x ∼D x x (cid:62) . Therefore, this lemma is the motivation for usinga distribution over x (as opposed to a deterministic starting point): E x ∼D x x (cid:62) being full rank guaranteesthat Σ K is full rank, which implies all stationary points are a global optima. An additive disturbance in thedynamics model also suffices.The concept of gradient domination is important in the non-convex optimization literature [Polyak, 1963,Nesterov and Polyak, 2006, Karimi et al., 2016]. A function f : R d → R is said to be gradient dominated ifthere exists some constant λ , such that for all x , f ( x ) − min x (cid:48) f ( x (cid:48) ) ≤ λ (cid:107)∇ f ( x ) (cid:107) . If a function is gradient dominated, this implies that if the magnitude of the gradient is small at some x , thenthe function value at x will be close to that of the optimal function value.Using the fact that Σ K (cid:23) Σ , the following corollary of Lemma 3 shows that C ( K ) is gradient domi-nated. Corollary 5. (Gradient Domination) Suppose E x ∼D x x (cid:62) is full rank. Then C ( K ) is gradient dominated,i.e. C ( K ) − C ( K ∗ ) ≤ λ (cid:104)∇ C ( K ) , ∇ C ( K ) (cid:105) where λ = (cid:107) Σ K ∗ (cid:107) σ min (Σ ) σ min ( R ) is a problem dependent constant (and (cid:104)· , ·(cid:105) denotes the trace inner product). Naively, one may hope that gradient domination immediately implies that gradient descent convergesquickly to the global optima. This would indeed be the case if the C ( K ) were a smooth function : if it A differentiable function f ( x ) is said to be smooth if the gradients of f are continuous. Equivalently, see the definition inEquation 13. C ( K ) is both gradient dominated and smooth, then classical mathematical optimizationresults [Polyak, 1963] would not only immediately imply global convergence, these results would also implyconvergence at a linear rate. These results are not immediately applicable due to it is not straightforwardto characterize the (local) smoothness properties of C ( K ) ; this is a difficulty well studied in the optimalcontrol theory literature, related to robustness and stability.Similarly, one may hope that recent results on escaping saddle points [Nesterov and Polyak, 2006, Geet al., 2015, Jin et al., 2017] immediately imply that gradient descent converges quickly to the global optima,due to that there are no (spurious) local optima. Again, for reasons related to smoothness this is not the case.The main reason that the LQR objective cannot satisfy the smoothness condition globally is that theobjective becomes infinity when the matrix A − BK becomes unstable (i.e. has an eigenvalue that is outsideof the unit circle in the complex plane). At the boundary between stable and unstable policies, the objectivefunction quickly becomes infinity, which violates the traditional smoothness conditions because smoothnessconditions would imply quadratic upper-bounds for the objective function.To solve this problem, it is observed that when the policy K is not too close to the boundary, the objectivesatisfies an almost-smoothness condition: Lemma 6. (“Almost” smoothness) C ( K ) satisfies: C ( K (cid:48) ) − C ( K ) = − K (cid:48) ( K − K (cid:48) ) (cid:62) E K )+ Tr(Σ K (cid:48) ( K − K (cid:48) ) (cid:62) ( R + B (cid:62) P K B )( K − K (cid:48) )) To see why this is related to smoothness (e.g. compare to Equation 13), suppose K (cid:48) is sufficiently closeto K so that: Σ K (cid:48) ≈ Σ K + O ( (cid:107) K − K (cid:48) (cid:107) ) and the leading order term K (cid:48) ( K (cid:48) − K ) (cid:62) E K ) would then behave as Tr(( K (cid:48) − K ) (cid:62) ∇ C ( K )) , and theremaining terms will be second order in K − K (cid:48) .Quantify the Taylor approximation Σ K (cid:48) ≈ Σ K + O ( (cid:107) K − K (cid:48) (cid:107) ) is one of the key steps in proving theconvergence of policy gradient. First, results on exact gradient methods are provided. From an analysis perspective, this is the naturalstarting point; once global convergence is established for exact methods, the question of using simulation-based, model-free methods can be approached with zeroth-order optimization methods (where gradients arenot available, and can only be approximated using samples of the function value).
Notation. (cid:107) Z (cid:107) denotes the spectral norm of a matrix Z ; Tr( Z ) denotes the trace of a square matrix; σ min ( Z ) denotes the minimal singular value of a square matrix Z . Also, it is helpful to define µ := σ min ( E x ∼D x x (cid:62) ) We consider three exact update rules. For gradient descent, the update is K n +1 = K n − η ∇ C ( K n ) . (5)For natural policy gradient descent, the direction is defined so that it is consistent with the stochastic case,as per Equation 4, in the exact case the update is: K n +1 = K n − η ∇ C ( K n )Σ − K n (6)8or Gauss-Newton method, the update is: K n +1 = K n − η ( R + B (cid:62) P K n B ) − ∇ C ( K n )Σ − K n . (7)The standard policy iteration algorithm[Howard, 1964] that tries to optimize a one-step deviation from thecurrent policy is equivalent to a special case of the Gauss-Newton method when η = 1 (for the case ofpolicy iteration, convergence in the limit is provided in Todorov and Li [2004], Ng et al. [2002], Liao andShoemaker [1991], along with local convergence rates.)The Gauss-Newton method requires the most complex oracle to implement: it requires access to ∇ C ( K ) , Σ K , and R + B (cid:62) P K B ; it also enjoys the strongest convergence rate guarantee. At the other extreme, gradi-ent descent requires oracle access to only ∇ C ( K ) and has the slowest convergence rate. The natural policygradient sits in between, requiring oracle access to ∇ C ( K ) and Σ K , and having a convergence rate betweenthe other two methods. Theorem 7. (Global Convergence of Gradient Methods) Suppose C ( K ) is finite and µ > . • Gauss-Newton case: For a stepsize η = 1 and for N ≥ (cid:107) Σ K ∗ (cid:107) µ log C ( K ) − C ( K ∗ ) ε , the Gauss-Newton algorithm (Equation 7) enjoys the following performance bound: C ( K N ) − C ( K ∗ ) ≤ ε • Natural policy gradient case: For a stepsize η = 1 (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ and for N ≥ (cid:107) Σ K ∗ (cid:107) µ (cid:18) (cid:107) R (cid:107) σ min ( R ) + (cid:107) B (cid:107) C ( K ) µσ min ( R ) (cid:19) log C ( K ) − C ( K ∗ ) ε , natural policy gradient descent (Equation 6) enjoys the following performance bound: C ( K N ) − C ( K ∗ ) ≤ ε . • Gradient descent case: For an appropriate (constant) setting of the stepsize η , η = poly (cid:18) µσ min ( Q ) C ( K ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) (cid:19) and for N ≥ (cid:107) Σ K ∗ (cid:107) µ log C ( K ) − C ( K ∗ ) ε poly (cid:18) C ( K ) µσ min ( Q ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) (cid:19) , gradient descent (Equation 5) enjoys the following performance bound: C ( K N ) − C ( K ∗ ) ≤ ε .
9n comparison to model-based approaches, these results require the (possibly) stronger assumption thatthe initial policy is a stable controller, i.e. C ( K ) is finite (an assumption which may be inherent to localsearch procedures). The Discussion mentions this as direction of future work.The proof for Gauss-Newton algorithm is simple based on the characterizations in Lemma 3 and Lemma 6,and is given below. The proof for natural policy gradient and gradient descent are more involved, and aredeferred to supplementary material. Lemma 8.
Suppose that: K (cid:48) = K − η ( R + B (cid:62) P K B ) − ∇ C ( K )Σ − K , . If η ≤ , then C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ηµ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) Proof.
Observe K (cid:48) = K − η ( R + B (cid:62) P K B ) − E K . Using Lemma 6 and the condition on η , C ( K (cid:48) ) − C ( K )= − η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) + η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − ησ min (Σ K (cid:48) )Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − ηµ Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − η µ (cid:107) Σ K ∗ (cid:107) ( C ( K ) − C ( K ∗ )) , where the last step uses Lemma 3.With this lemma, the proof of the convergence rate of the Gauss Newton algorithm is immediate. Proof. (of Theorem 7, Gauss-Newton case) The theorem is due to that η = 1 leads to a contraction of − ηµ (cid:107) Σ K ∗ (cid:107) at every step. In the model free setting, the controller has only simulation access to the model; the model parameters, A , B , Q and R , are unknown. The standard optimal control theory approach is to use system identificationto learn the model, and then plan with this learned model This section proves that model-free, policy gradientmethods also lead to globally optimal policies, with both polynomial computational and sample complexities(in the relevant quantities).Using a zeroth-order optimization approach (see Section 2.2), Algorithm 1 provides a procedure to find(bounded bias) estimates, (cid:92) ∇ C ( K ) and (cid:98) Σ K , of both ∇ C ( K ) and Σ K . These can then be used in the policygradient and natural policy gradient updates. For policy gradient we have K n +1 = K n − η (cid:92) ∇ C ( K n ) . (8)For natural policy gradient we have: K n +1 = K n − η (cid:92) ∇ C ( K n ) (cid:98) Σ − K n . (9)10 lgorithm 1 Model-Free Policy Gradient (and Natural Policy Gradient) Estimation Input: K , number of trajectories m , roll out length (cid:96) , smoothing parameter r , dimension d for i = 1 , · · · m do Sample a policy (cid:98) K i = K + U i , where U i is drawn uniformly at random over matrices whose (Frobe-nius) norm is r . Simulate (cid:98) K i for (cid:96) steps starting from x ∼ D . Let (cid:98) C i and (cid:98) Σ i be the empirical estimates: (cid:98) C i = (cid:96) (cid:88) t =1 c t , (cid:98) Σ i = (cid:96) (cid:88) t =1 x t x (cid:62) t where c t and x t are the costs and states on this trajectory. end for Return the (biased) estimates: (cid:92) ∇ C ( K ) = 1 m m (cid:88) i =1 dr (cid:98) C i U i , (cid:100) Σ K = 1 m m (cid:88) i =1 (cid:98) Σ i In both Equations (8) and (9), Algorithm 1 is called at every iteration to provide the estimates of ∇ C ( K n ) and Σ K n .The choice of using zeroth order optimization vs using REINFORCE (with Gaussian additive noise, asin Equation 3) is primarily for technical reasons . It is plausible that the REINFORCE estimation procedurehas lower variance. One additional minor difference, again for technical reasons, is that Algorithm 1 uses aperturbation from the surface of a sphere (as opposed to a Gaussian perturbation). Theorem 9. (Global Convergence in the Model Free Setting) Suppose C ( K ) is finite, µ > , and that x ∼ D has norm bounded by L almost surely. Also, for both the policy gradient method and the naturalpolicy gradient method, suppose Algorithm 1 is called with parameters: m, (cid:96), /r = poly (cid:18) C ( K ) , µ , σ min ( Q ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) , d, /(cid:15), L /µ (cid:19) . • Natural policy gradient case: For a stepsize η = 1 (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ and for N ≥ (cid:107) Σ K ∗ (cid:107) µ (cid:18) (cid:107) R (cid:107) σ min ( R ) + (cid:107) B (cid:107) C ( K ) µσ min ( R ) (cid:19) log 2( C ( K ) − C ( K ∗ )) ε , then, with high probability, i.e. with probability greater than − exp( − d ) , the natural policy gradientdescent update (Equation 9) enjoys the following performance bound: C ( K N ) − C ( K ∗ ) ≤ ε . The correlations in the state-action value estimates in REINFORCE are more challenging to analyze. Gradient descent case: For an appropriate (constant) setting of the stepsize η , η = poly (cid:18) µσ min ( Q ) C ( K ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) (cid:19) and for N ≥ (cid:107) Σ K ∗ (cid:107) µ log C ( K ) − C ( K ∗ ) ε × poly (cid:18) C ( K ) µσ min ( Q ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) (cid:19) , then, with high probability, gradient descent (Equation 8) enjoys the following performance bound: C ( K N ) − C ( K ∗ ) ≤ ε . This theorem gives the first polynomial time guarantee for policy gradient and natural policy gradientalgorithms in the LQR problem.
Proof Sketch
The model free results (Theorem 9) are proved in the following three steps:1. Prove that when the roll out length (cid:96) is large enough, the cost function C and the covariance Σ areapproximately equal to the corresponding quantities at infinite steps.2. Show that with enough samples, Algorithm 1 can estimate both the gradient and covariance matrixwithin the desired accuracy.3. Prove that both gradient descent and natural gradient descent can converge with a similar rate, even ifthe gradient/natural gradient estimates have some bounded perturbations.The proofs are technical and are deferred to supplementary material. We have focused on provingpolynomial relationships in our complexity bounds, and did not optimize for the best dependence on therelevant parameters. This work has provided provable guarantees that model-based gradient methods and model-free (samplebased) policy gradient methods convergence to the globally optimal solution, with finite polynomial compu-tational and sample complexities. Taken together, the results herein place these popular and practical policygradient approaches on a firm theoretical footing, making them comparable to other principled approaches(e.g., subspace system identification methods and algebraic iterative approaches).
Finite C ( K ) assumption, noisy case, and finite horizon case. These methods allow for extensions tothe noisy case and the finite horizon case. This work also made the assumption that C ( K ) is finite, whichmay not be easy to achieve in some infinite horizon problems. The simplest way to address this is to modelthe infinite horizon problem with a finite horizon one; the techniques developed in Section D.1 shows this ispossible. This is an important direction for future work. Open Problems. • Variance reduction: This work only proved efficiency from a polynomial sample size perspective. Aninteresting future direction would be in how to rigorously combine variance reduction methods andmodel-based methods to further decrease the sample size.12
A sample based Gauss-Newton approach: This work showed how the Gauss-Newton algorithm im-proves over even the natural policy gradient method, in the exact case. A practically relevant questionfor the Gauss-Newton method would be how to both: a) construct a sample based estimator b) extendthis scheme to deal with (non-linear) parametric policies. • Robust control: In model based approaches, optimal control theory provides efficient procedures todeal with (bounded) model mis-specification. An important question is how to provably understandrobustness in a model free setting.
Acknowledgments
Support from DARPA Lagrange Grant FA8650-18-2-7836 (to M. F., M. M., and S. K.) and from ONR awardN00014-12-1-1002 (to M. F. and M. M.) is gratefully acknowledged. S. K. gratefully acknowledges fundingfrom the Washington Research Foundation for Innovation in Data-intensive Discover and the ONR awardN00014-18-1-2247. S. K. thanks Emo Todorov, Aravind Rajeswaran, Kendall Lowrey, Sanjeev Arora, andElad Hazan for helpful discussions. S. K. and M. F. also thank Ben Recht for helpful discussions. R. G.acknowledges funding from NSF CCF-1704656. We thank Jingjing Bu from University of Washington forrunning the numerical simulations in Section E in supplementary material. We also thank Bin Hu for readingthe paper carefully and pointing out a missing step in the proof, which is now addressed in Section C.
References
Yasin Abbasi-Yadkori and Csaba Szepesv´ari. Regret bounds for the adaptive control of linear quadraticsystems.
Conference on Learning Theory , 2011. ISSN 15337928.M. Al Borno, M. de Lasa, and A. Hertzmann. Trajectory Optimization for Full-Body Movements withComplex Contacts.
IEEE Transactions on Visualization and Computer Graphics , 2013.Brian D. O. Anderson and John B. Moore.
Optimal Control: Linear Quadratic Methods . Prentice-Hall,Inc., Upper Saddle River, NJ, USA, 1990. ISBN 0-13-638560-5.Sanjeev Arora, Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Towards provable controlfor unknown linear dynamical systems. 2018.J. Andrew Bagnell and Jeff Schneider. Covariant policy search. In
Proceedings of the 18th InternationalJoint Conference on Artificial Intelligence , IJCAI’03, pages 1019–1024, San Francisco, CA, USA, 2003.Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=1630659.1630805 .V. Balakrishnan and L. Vandenberghe. Semidefinite programming duality and linear time-invariant systems.
IEEE Transactions on Automatic Control , 48(1):30–41, 2003.Dimitri P. Bertsekas. Approximate policy iteration: A survey and some new methods.
Journal of ControlTheory and Applications , 9(3):310–335, 2011. ISSN 16726340. doi: 10.1007/s11768-011-1005-3.Dimitri P. Bertsekas.
Dynamic Programming and Optimal Control . Athena Scientific, 2017.S.J. Bradtke, B.E. Ydstie, and a.G. Barto. Adaptive linear quadratic control using policy iteration.
Proceed-ings of American Control Conference , 3(2):3475–3479, 1994. doi: 10.1109/ACC.1994.735224.E.F. Camacho and C. Bordons.
Model Predictive Control . Advanced Textbooks in Control and SignalProcessing. Springer London, 2004. ISBN 9781852336943.13.R. Conn, K. Scheinberg, and L.N. Vicente.
Introduction to derivative-free optimization , volume 8 of
MPS/SIAM Series on Optimization . Society for Industrial and Applied Mathematics (SIAM), Philadel-phia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2009.S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadraticregulator.
ArXiv e-prints , 2017.Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics.
Found.Trends Robot , 2(1& http://dx.doi.org/10.1561/2300000021 .Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learningfor continuous control. In
ICML , 2016.Lawrence C. Evans. An introduction to mathematical optimal control theory.
University of California,Department of Mathematics , page 126, 2005. ISSN 14712334. doi: 10.1186/1471-2334-10-32.Claude-Nicolas Fiechter. PAC adaptive control of liner systems. In
Proceeding COLT ’94 Proceedings ofthe seventh annual conference on Computational learning theory , pages 88–97, 1994.Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in thebandit setting: gradient descent without a gradient. In
Proceedings of the sixteenth annual ACM-SIAMsymposium on Discrete algorithms , pages 385–394. Society for Industrial and Applied Mathematics,2005.Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradientfor tensor decomposition.
Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris,France, July 3-6, 2015 , 2015.Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In
Advances in Neural Information Processing Systems , pages 6705–6715, 2017.Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general lineardynamical systems. arXiv preprint arXiv:1802.03981 , 2018.G. A. Hewer. An iterative technique for the computation of steady state gains for the discrete optimalregulator.
IEEE Trans. Automat. Contr. , pages 382–384, 1971.Ronald A Howard.
Dynamic programming and Markov processes . Wiley for The Massachusetts Instituteof Technology, 1964.Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddlepoints efficiently. In
Proceedings of the 34th International Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017 , pages 1724–1732, 2017.S. Kakade. A natural policy gradient. In
NIPS , 2001.S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In
ICML , 2002.S. M. Kakade.
On the sample complexity of reinforcement learning . PhD thesis, Gatsby ComputationalNeuroscience Unit, University College, London, 2003.14amed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradientmethods under the polyak-łojasiewicz condition.
Machine Learning and Knowledge Discovery inDatabases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016,Proceedings, Part I , pages 795–811, 2016.D. L. Kleinman. On an iterative technique for Riccati equation computations.
IEEE Transactions on Auto-matic Control , 13(1):114–115, 1968. ISSN 0018-9286. doi: 10.1109/TAC.1968.1098829.V. Kumar, E. Todorov, and S. Levine. Optimal control with learned local models: Application to dexterousmanipulation. In
ICRA , 2016.P. Lancaster and L. Rodman.
Algebraic Riccati Equations . Oxford science publications. Clarendon Press,1995. ISBN 9780191591259.Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotorpolicies.
JMLR , 17(39):1–40, 2016.L. Z. Liao and C. A. Shoemaker. Convergence in unconstrained discrete-time differential dynamic program-ming.
IEEE Transactions on Automatic Control , 36, 1991.Lennart Ljung, editor.
System Identification (2Nd Ed.): Theory for the User . Prentice Hall PTR, UpperSaddle River, NJ, USA, 1999. ISBN 0-13-656695-2.Karl M˚artensson. Gradient methods for large-scale and distributed linear quadratic control.
Ph.D. Theses ,2012.Karl M˚artensson and Anders Rantzer. Gradient methods for iterative distributed control synthesis.
Confer-ence on Decision and Control , pages 1–6, 2009.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, AlexGraves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, AmirSadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and DemisHassabis. Human-level control through deep reinforcement learning.
Nature , 518, 2015.Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance.
Math. Program. , pages 177–205, 2006.Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.
Founda-tions of Computational Mathematics , pages 1–40, 2015. ISSN 1615-3383.Chi-Kong Ng, Li-Zhi Liao, and Duan Li. A globally convergent and efficient method for unconstraineddiscrete-time optimal control.
J. Global Optimization , 23:401–421, 2002.J. Peters and S. Schaal. Natural actor-critic.
Neurocomputing , 71:1180–1190, 2007.E. Polak. An Historical Survey of Computational Methods in Optimal Control.
SIAM Review , 15(2):pp.553–584, 1973. ISSN 00361445. doi: 10.1137/1015071.B. T. Polyak. Gradient methods for minimizing functionals.
USSR Computational Mathematics and Math-ematical Physics , 3(4):864878, 1963.Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and SergeyLevine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.
CoRR , abs/1709.10087, 2017a. URL http://arxiv.org/abs/1709.10087 .15ravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards generalization andsimplicity in continuous control.
CoRR , abs/1703.02660, 2017b. URL http://arxiv.org/abs/1703.02660 .J.B. Rawlings and D.Q. Mayne.
Model Predictive Control: Theory and Design . Nob Hill Pub., 2009. ISBN9780975937709.Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative toreinforcement learning.
ArXiv e-prints , 2017.J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. In
ICML ,2015.David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do-minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, KorayKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networksand tree search.
Nature , 529, 2016.Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning withoutmixing: Towards a sharp analysis of linear system identification. In
COLT , 2018.Gilbert W Stewart and Ji-Guang Sun. Matrix perturbation theory (computer science and scientific comput-ing), 1990.Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methodsfor reinforcement learning with function approximation. In
Advances in neural information processingsystems , pages 1057–1063, 2000.Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectoryoptimization.
International Conference on Intelligent Robots and Systems , 2012.Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domainrandomization for transferring deep neural networks from simulation to the real world.
ArXiv e-prints ,2017.Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback controlof constrained nonlinear stochastic systems. In
American Control Conference , 2004.Joel A Tropp. User-friendly tail bounds for sums of random matrices.
Foundations of computational math-ematics , 12(4):389–434, 2012.Stephen Tu and Benjamin Recht. Least-squares temporal difference learning for the linear quadratic regu-lator. In
ICML , 2018.Eugene E Tyrtyshnikov.
A brief introduction to numerical analysis . Springer Science & Business Media,2012.Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn-ing.
Machine Learning , 8(3):229–256, 1992. 16
Planning with a model
This section briefly reviews some parameterizations and solution methods for the classic LQR and relatedproblems from control theory.
Finite horizon LQR.
First, consider the finite horizon case. The basic approach is to view it as adynamic program with the value function x Tt P t x t , where P t − = Q + A T P t A − A T P t B ( R + B T P t B ) − B T P t A, which in turn gives the optimal control u t = − K t x t = − ( R + B T P t +1 B ) − B T P t +1 Ax t , (recursions run backward in time).Another approach is to view the LQR problem as a linearly-constrained Quadratic Program in all x t and u t (where the constraints are given by the dynamics, and the problem size equals the horizon). TheQP is clearly a convex problem, but this observation is not useful by itself as the problem size grows withthe horizon, and naive use of quadratic programming scales badly. However, the special structure due tothe linearity of the dynamics allows for simplifications and a control-theoretic interpretation as follows: theLagrange multipliers in the QP can be interpreted as “co-state” variables, and they follow a recursion thatruns backwards in time known as the “adjoint system” dynamics. Using Lagrange duality, one can showthat this approach is equivalent to solving the Riccati recursion mentioned above.Popular use of the LQR in control practice is often in the receding horizon LQR, Camacho and Bordons[2004], Rawlings and Mayne [2009]: at time t , an input sequence is found that minimizes the T -step aheadLQR cost starting at the current time, then only the first input in the sequence is used. The resulting staticfeedback gain converges to the infinite horizon optimal solution as horizon T becomes longer. Infinite horizon LQR.
Here, the constrained optimization view (QP) is not informative as the problemis infinite dimensional; however the dynamic programming viewpoint readily extends. Suppose the system
A, B is controllable (which guarantees the optimal cost is finite). It turns out that the value function andthe optimal controller are static (i.e., do not depend on t ) and can be found by solving the Algebraic RiccatiEquation (ARE) given in (1). The optimal K can then be found from equation (2).The main computational step is solving the ARE, which is extensively studied (e.g. [Lancaster andRodman, 1995]). One approach due to [Kleinman, 1968] (for continuous time) and [Hewer, 1971] (fordiscrete time) is to simply run the recursion P k +1 = Q + A T P k A − A T P k B ( R + B T P k B ) − B T P k A where P = Q , which converges to the unique positive semidefinite solution of the ARE (since the fixed-point iteration is contractive). Other approaches are direct and based on linear algebra, which carry out aneigenvalue decomposition on a certain block matrix (called the Hamiltonian matrix) followed by a matrixinversion [Lancaster and Rodman, 1995].Direct computation of the control input has also been considered in the optimal control literature, e.g.,gradient updates in function spaces [Polak, 1973]. For the linear quadratic setup, direct iterative compu-tation of the feedback gain has been examined in [M˚artensson and Rantzer, 2009], and explored furtherin [M˚artensson, 2012] with a view towards distributed implementations. There methods are presented aslocal search heuristics without provable guarantees of reaching the optimal policy. SDP formulation.
The LQR problem can also be expressed as a semidefinite program (SDP) withvariable P , as given in [Balakrishnan and Vandenberghe, 2003] (section 5, equation (34), this is for acontinuous-time system but there are similar discrete-time versions). This SDP can be derived by relax-ing the equality in the Riccati equation to an inequality, then using the Schur complement lemma to rewritethe resulting Riccati inequality as linear matrix inequality. The objective in the case of LQR is the trace of17he positive definite matrix variable, and the optimization problem (for the continuous time system) is givenas maximize x T P x subject to (cid:20) A T P + P A + Q P BB T P I (cid:21) ≥ , P ≥ , (10)where the optimization variable is P This SDP and its dual, and system-theoretic interpretations of itsoptimality conditions, have been explored in [Balakrishnan and Vandenberghe, 2003]. Note that while theoptimal solution P ∗ of this SDP is the unique positive semidefinite solution to the Riccati equation, whichin turn gives the optimal policy K ∗ , other feasible P (not equal to P ∗ ) do not necessarily correspond to afeasible, stabilizing policy K . This means that the feasible set of this SDP is not a convex characterization ofall P that correspond to stabilizing K . Thus it also implies that if one uses any optimization algorithm thatmaintains iterates in the feasible set (e.g., interior point methods), no useful policy can be extracted from theiterates before convergence to P ∗ . For this reason, this convex formulation is not helpful for parametrizingthe space of policies K in a manner that supports the use of local search methods (those that directly lowerthe cost function of interest as a function of policy K ), which is the focus of this work. B Non-convexity of the set of stabilizing State Feedback Gains
In this section we prove Lemma 2. Let K ( A, B ) denote the set of state feedback gains K such that A − BK is stable, i.e., its eigenvalues are inside the unit circle in the complex plane. This set is generally nonconvex.A concise counterexample to convexity is provided here. Let A and B be × identity matrices and K = − − and K = −
10 00 1 0 − . Then the spectra of A − BK and A − BK are both concentrated at the origin, yet two of the eigenvaluesof A − B (cid:98) K with (cid:98) K = ( K + K ) / , are outside of the unit circle in the complex plane. C Analysis: the exact case
This section provides the analysis of the convergence rates of the (exact) gradient based methods. First,some helpful lemmas for the analysis are provided.Throughout, it is convenient to use the following definition: E K := ( R + B (cid:62) P K B ) K − B (cid:62) P K A .
The policy gradient can then be written as: ∇ C ( K ) = 2 (cid:16) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:17) Σ K = 2 E K Σ K C.1 Helper lemmas
Define the value V K ( x ) , the state-action value Q K ( x, u ) , and the advantage A K ( x, u ) . V K ( x, t ) is the costof the policy starting with x = x and proceeding with K onwards: V K ( x ) := ∞ (cid:88) t =0 (cid:16) x (cid:62) t Qx t + u (cid:62) t Ru t (cid:17) = x (cid:62) P K x . K ( x, u ) is the cost of the policy starting with x = x , taking action u = u and then proceeding with K onwards: Q K ( x, u ) := x (cid:62) Qx + u (cid:62) Ru + V K ( Ax + Bu ) The advantage A K ( x, u ) is: A K ( x, u ) = Q K ( x, u ) − V K ( x ) . The advantage can be viewed as the change in cost starting at state x and taking a one step deviation fromthe policy K .The next lemma is identical to that in [Kakade and Langford, 2002, Kakade, 2003] for Markov decisionprocesses. Lemma 10. (Cost difference lemma) Suppose K and K (cid:48) have finite costs. Let { x (cid:48) t } and { u (cid:48) t } be state andaction sequences generated by K (cid:48) , i.e. starting with x (cid:48) = x and using u (cid:48) t = − K (cid:48) x (cid:48) t . It holds that: V K (cid:48) ( x ) − V K ( x ) = (cid:88) t A K ( x (cid:48) t , u (cid:48) t ) . Also, for any x , the advantage is: A K ( x, K (cid:48) x ) = 2 x (cid:62) ( K (cid:48) − K ) (cid:62) E K x + x (cid:62) ( K (cid:48) − K ) (cid:62) ( R + B (cid:62) P K B )( K (cid:48) − K ) x . (11) Proof.
Let c (cid:48) t be the cost sequence generated by K (cid:48) . Telescoping the sum appropriately: V K (cid:48) ( x ) − V K ( x ) = (cid:88) t =0 c (cid:48) t − V K ( x )= (cid:88) t =0 ( c (cid:48) t + V K ( x (cid:48) t ) − V K ( x (cid:48) t )) − V K ( x )= (cid:88) t =0 ( c (cid:48) t + V K ( x (cid:48) t +1 ) − V K ( x (cid:48) t ))= (cid:88) t =0 A K ( x (cid:48) t , u (cid:48) t ) which completes the first claim (the third equality uses the fact that x = x = x (cid:48) ).For the second claim, observe that: V K ( x ) = x (cid:62) (cid:16) Q + K (cid:62) RK (cid:17) x + x (cid:62) ( A − BK ) (cid:62) P K ( A − BK ) x And, for u = K (cid:48) x , A K ( x, u ) = Q K ( x, u ) − V K ( x )= x (cid:62) (cid:16) Q + ( K (cid:48) ) (cid:62) RK (cid:48) (cid:17) x + x (cid:62) ( A − BK (cid:48) ) (cid:62) P K ( A − BK (cid:48) ) x − V K ( x )= x (cid:62) (cid:16) Q + ( K (cid:48) − K + K ) (cid:62) R ( K (cid:48) − K + K ) (cid:17) x + x (cid:62) ( A − BK − B ( K (cid:48) − K )) (cid:62) P K ( A − BK − B ( K (cid:48) − K )) x − V K ( x )= 2 x (cid:62) ( K (cid:48) − K ) (cid:62) (cid:16) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:17) x + x (cid:62) ( K (cid:48) − K ) (cid:62) ( R + B (cid:62) P K B )( K (cid:48) − K )) x , which completes the proof. 19his lemma is helpful in proving that C ( K ) is gradient dominated. Lemma 11. (Gradient domination, Lemma 3 and Corollary 5 restated) Let K ∗ be an optimal policy. Sup-pose K has finite cost and µ > . It holds that: C ( K ) − C ( K ∗ ) ≤ (cid:107) Σ K ∗ (cid:107) Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ (cid:107) Σ K ∗ (cid:107) σ min ( R ) Tr( E (cid:62) K E K ) ≤ (cid:107) Σ K ∗ (cid:107) σ min (Σ K ) σ min ( R ) Tr( ∇ C ( K ) (cid:62) ∇ C ( K )) ≤ (cid:107) Σ K ∗ (cid:107) µ σ min ( R ) Tr( ∇ C ( K ) (cid:62) ∇ C ( K )) For a lower bound, it holds that: C ( K ) − C ( K ∗ ) ≥ µ (cid:107) R + B (cid:62) P K B (cid:107) Tr( E (cid:62) K E K ) Proof.
From Equation 11 and by completing the square, Q K ( x, K (cid:48) x ) − V K ( x )= 2Tr( xx (cid:62) ( K (cid:48) − K ) (cid:62) E K ) + Tr( xx (cid:62) ( K (cid:48) − K ) (cid:62) ( R + B (cid:62) P K B )( K (cid:48) − K ))= Tr( xx (cid:62) (cid:16) K (cid:48) − K + ( R + B (cid:62) P K B ) − E K (cid:17) (cid:62) ( R + B (cid:62) P K B ) (cid:16) K (cid:48) − K + ( R + B (cid:62) P K B ) − E K (cid:17) ) − Tr( xx (cid:62) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≥ − Tr( xx (cid:62) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) (12)with equality when K (cid:48) = K − ( R + B (cid:62) P K B ) − E K .Let x ∗ t and u ∗ t be the sequence generated under K ∗ . Using this and Lemma 10, C ( K ) − C ( K ∗ ) = − E (cid:88) t A K ( x ∗ t , u ∗ t ) ≤ E (cid:88) t Tr( x ∗ t ( x ∗ t ) (cid:62) E (cid:62) K ( R + B (cid:62) P K B ) − E K )= Tr(Σ K ∗ E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ (cid:107) Σ K ∗ (cid:107) Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ (cid:107) Σ K ∗ (cid:107)(cid:107) ( R + B (cid:62) P K B ) − (cid:107) Tr( E (cid:62) K E K ) ≤ (cid:107) Σ K ∗ (cid:107) σ min ( R ) Tr( E (cid:62) K E K )= (cid:107) Σ K ∗ (cid:107) σ min ( R ) Tr(Σ − K ∇ C ( K ) (cid:62) ∇ C ( K )Σ − K ) ≤ (cid:107) Σ K ∗ (cid:107) σ min (Σ K ) σ min ( R ) Tr( ∇ C ( K ) (cid:62) ∇ C ( K )) ≤ (cid:107) Σ K ∗ (cid:107) µ σ min ( R ) Tr( ∇ C ( K ) (cid:62) ∇ C ( K )) which completes the proof of the upper bound. Here the last step is because Σ K (cid:23) E [ x x (cid:62) ] .20or the lower bound, consider K (cid:48) = K − ( R + B (cid:62) P K B ) − E K where equality holds in Equation 12.Let x (cid:48) t and u (cid:48) t be the sequence generated under K (cid:48) . Using that C ( K ∗ ) ≤ C ( K (cid:48) ) , C ( K ) − C ( K ∗ ) ≥ C ( K ) − C ( K (cid:48) )= − E (cid:88) t A K ( x (cid:48) t , u (cid:48) t )= E (cid:88) t Tr( x (cid:48) t ( x (cid:48) t ) (cid:62) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≥ Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≥ µ (cid:107) R + B (cid:62) P K B (cid:107) Tr( E (cid:62) K E K ) which completes the proof.Recall that a function f is said to be smooth (or C -smooth) if for some finite β , it satisfies: | f ( x ) − f ( y ) − ∇ f ( y ) (cid:62) ( x − y ) | ≤ β (cid:107) x − y (cid:107) . (13)for all x, y (equivalently, it is smooth if the gradients of f are continuous). Lemma 12. (“Almost” smoothness, Lemma 6 restated) C ( K ) satisfies: C ( K (cid:48) ) − C ( K ) = − K (cid:48) ( K − K (cid:48) ) (cid:62) E K ) + Tr(Σ K (cid:48) ( K − K (cid:48) ) (cid:62) ( R + B (cid:62) P K B )( K − K (cid:48) )) To see why this is related to smoothness (e.g. compare to Equation 13), suppose K (cid:48) is sufficiently closeto K so that: Σ K (cid:48) ≈ Σ K + O ( (cid:107) K − K (cid:48) (cid:107) ) (14)and the leading order term K (cid:48) ( K (cid:48) − K ) (cid:62) E K ) would then behave as Tr(( K (cid:48) − K ) (cid:62) ∇ C ( K )) . Thechallenge in the proof (for gradient descent) is quantifying the lower order terms in this argument. Proof.
The claim immediately results from Lemma 10, by using Equation 11 and taking an expectation.The next lemma spectral norm bounds on P K and Σ K are helpful: Lemma 13.
It holds that: (cid:107) P K (cid:107) ≤ C ( K ) µ , (cid:107) Σ K (cid:107) ≤ C ( K ) σ min ( Q ) Proof.
For the first claim, C ( K ) is lower bounded as: C ( K ) = E x ∼D x (cid:62) P K x ≥ (cid:107) P K (cid:107) σ min ( E x x (cid:62) ) Alternatively, C ( K ) can be lower bounded as: C ( K ) = Tr(Σ K ( Q + K (cid:62) RK )) ≥ Tr(Σ K ) σ min ( Q ) ≥ (cid:107) Σ K (cid:107) σ min ( Q ) , which proves the second claim. 21 .2 Gauss-Newton Analysis The next lemma bounds the one step progress of Gauss-Newton.
Lemma 14. (Lemma 8 restated) Suppose that: K (cid:48) = K − η ( R + B (cid:62) P K B ) − ∇ C ( K )Σ − K , . If η ≤ , then C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ηµ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) Proof.
First we prove this assuming K (cid:48) is a stabilizing policy (that is, ρ ( A − BK (cid:48) ) < ). In this case wecan apply Lemma 12.Observe K (cid:48) = K − η ( R + B (cid:62) P K B ) − E K . Using Lemma 12 and the condition on η , C ( K (cid:48) ) − C ( K ) = − η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) + η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − ησ min (Σ K (cid:48) )Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − ηµ Tr( E (cid:62) K ( R + B (cid:62) P K B ) − E K ) ≤ − η µ (cid:107) Σ K ∗ (cid:107) ( C ( K ) − C ( K ∗ )) , where the last step uses Lemma 11.Now, we will prove that K (cid:48) is always stabilizing for our choice of η . We will use K (cid:48) ( η ) to denote thepolicy K (cid:48) when we choose step size η . Assume towards contradiction that for some η ≤ K (cid:48) is notstabilizing. Let η = inf η ρ ( A − BK (cid:48) ( η )) ≥ and η = η − (cid:15) for small enough (cid:15) . By definition η is still stabilizing so we know C ( K (cid:48) ( η )) ≤ C ( K ) , and also (cid:107) A − BK (cid:48) ( η ) (cid:107) ≤ (cid:107) A − BK (cid:107) + (cid:107) ( R + B (cid:62) P K B ) − ∇ C ( K )Σ − K (cid:107) is uniformly bounded for every K (cid:48) ( η ) . By Lemma 16 we know there exists aneighborhood of K (cid:48) ( η ) such that every policy in this neighborhood is stabilizing. However, this contradictswith the assumption that K (cid:48) ( η ) is not stabilizing when (cid:15) is chosen to be small enough.With this lemma, the proof of the convergence rate of the Gauss Newton algorithm is immediate. Proof. (of Theorem 7, Gauss-Newton case) The theorem is due to that η = 1 leads to a contraction of − ηµ (cid:107) Σ K ∗ (cid:107) at every step. C.3 Natural Policy Gradient Descent Analysis
The next lemma bounds the one step progress of the natural policy gradient.
Lemma 15.
Suppose: K (cid:48) = K − η ∇ C ( K )Σ − K and that η ≤ (cid:107) R + B (cid:62) P K B (cid:107) . It holds that: C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) roof. We will again first prove the lemma when K (cid:48) is a stabilizing policy ( ρ ( A − BK (cid:48) ) < ). Using thesame idea as in proof of Lemma 8 we can prove that K (cid:48) must be stabilizing for all the step sizes we choose.Since K (cid:48) = K − ηE K , Lemma 12 implies: C ( K (cid:48) ) − C ( K ) = − η Tr(Σ K (cid:48) E (cid:62) K E K ) + η Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) E K ) The last term can be bounded as:
Tr(Σ K (cid:48) E (cid:62) K ( R + B (cid:62) P K B ) E K ) = Tr(( R + B (cid:62) P K B ) E K Σ K (cid:48) E (cid:62) K ) ≤ (cid:107) R + B (cid:62) P K B (cid:107) Tr( E K Σ K (cid:48) E (cid:62) K )= (cid:107) R + B (cid:62) P K B (cid:107) Tr(Σ K (cid:48) E (cid:62) K E K ) . Continuing and using the condition on η , C ( K (cid:48) ) − C ( K ) ≤ − η Tr(Σ K (cid:48) E (cid:62) K E K ) + η (cid:107) R + B (cid:62) P K B (cid:107) Tr(Σ K (cid:48) E (cid:62) K E K ) ≤ − η Tr(Σ K (cid:48) E (cid:62) K E K ) ≤ − ησ min (Σ K (cid:48) )Tr( E (cid:62) K E K ) ≤ − ηµ Tr( E (cid:62) K E K ) ≤ − η µσ min ( R ) (cid:107) Σ K ∗ (cid:107) ( C ( K ) − C ( K ∗ )) using Lemma 11.With this lemma, the proof of the natural policy gradient convergence rate can be completed. Proof. (of Theorem 7, natural policy gradient case) Using Lemma 13, (cid:107) R + B (cid:62) P K B (cid:107) ≥ (cid:107) R (cid:107) + (cid:107) B (cid:107) (cid:107) P K (cid:107) ≥ (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ The proof is completed by induction: C ( K ) ≤ C ( K ) , since Lemma 15 can be applied. The proof proceedsby arguing that Lemma 15 can be applied at every step. If it were the case that C ( K t ) ≤ C ( K ) , then η ≤ (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ ≤ (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K t ) µ ≤ (cid:107) R + B (cid:62) P K t B (cid:107) and by Lemma 15: C ( K t +1 ) − C ( K ∗ ) ≤ − µ (cid:107) Σ K ∗ (cid:107) σ min ( R ) (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ ( C ( K t ) − C ( K ∗ )) which completes the proof. C.4 Gradient Descent Analysis
As informally argued by Equation 14, the proof seeks to quantify how Σ K (cid:48) changes with η . Then the proofbounds the one step progress of gradient descent. 23 K perturbation analysis This subsections aims to prove the following:
Lemma 16. ( Σ K perturbation) Suppose K (cid:48) is such that: (cid:107) K (cid:48) − K (cid:107) ≤ σ min ( Q ) µ C ( K ) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) It holds that: (cid:107) Σ K (cid:48) − Σ K (cid:107) ≤ (cid:18) C ( K ) σ min ( Q ) (cid:19) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) µ (cid:107) K − K (cid:48) (cid:107) The proof proceeds by starting with a few technical lemmas. First, define a linear operator on symmetricmatrices, T K ( · ) , which can be viewed as a matrix on (cid:0) d +12 (cid:1) dimensions. Define this operator on a symmetricmatrix X as follows: T K ( X ) := ∞ (cid:88) t =0 ( A − BK ) t X [( A − BK ) (cid:62) ] t Also define the induced norm of T as follows: (cid:107)T K (cid:107) = sup X (cid:107)T K ( X ) (cid:107)(cid:107) X (cid:107) (15)where the supremum is over all symmetric matrices X (whose spectral norm is non-zero).Also, define Σ = E x x (cid:62) . Lemma 17. ( T K norm bound) It holds that (cid:107)T K (cid:107) ≤ C ( K ) µ σ min ( Q ) Proof.
For a unit norm vector v ∈ R d and unit spectral norm matrix X , v (cid:62) ( T K ( X )) v = ∞ (cid:88) t =0 v (cid:62) ( A − BK ) t X [( A − BK ) (cid:62) ] t v = ∞ (cid:88) t =0 Tr([( A − BK ) (cid:62) ] t vv (cid:62) ( A − BK ) t X )= ∞ (cid:88) t =0 Tr([Σ / ( A − BK ) (cid:62) ] t vv (cid:62) ( A − BK ) t Σ / Σ − / X Σ − / ) ≤ ∞ (cid:88) t =0 Tr([Σ / ( A − BK ) (cid:62) ] t vv (cid:62) ( A − BK ) t Σ / ) (cid:107) Σ − / X Σ − / (cid:107) = (cid:107) Σ − / X Σ − / (cid:107) (cid:16) v (cid:62) T K (Σ ) v (cid:17) ≤ σ min ( E x x (cid:62) ) (cid:107)T K (Σ ) (cid:107) = 1 µ (cid:107) Σ K (cid:107) using that T K (Σ ) = Σ K . The proof is completed using the upper bound on (cid:107) Σ K (cid:107) in Lemma 13.24lso, with respect to K , define another linear operator on symmetric matrices: F K ( X ) = ( A − BK ) X ( A − BK ) (cid:62) . Let I to denote the identity operator on the same space. Define the induced norm (cid:107) · (cid:107) of these operators asin Equation 15. Note these operators are related to the operator T K as follows: Lemma 18.
When ( A − BK ) has spectral radius smaller than 1, T K = (I − F K ) − . Proof.
When ( A − BK ) has spectral radius smaller than 1, T K is well defined and is the solution of T K =I + T K ◦ F K . Therefore T K ◦ (I − F K ) = I and T K = (I − F K ) − .Since, Σ K = T K (Σ ) = (I − F K ) − (Σ ) . The proof of Lemma 16 seeks to bound: (cid:107) Σ K − Σ K (cid:48) (cid:107) = (cid:107) ( T K − T K (cid:48) )(Σ ) (cid:107) = (cid:107) ((I − F K ) − − (I − F K (cid:48) ) − )(Σ ) (cid:107) . The following two perturbation bounds are helpful in this.
Lemma 19.
It holds that: (cid:107)F K − F K (cid:48) (cid:107) ≤ (cid:107) A − BK (cid:107)(cid:107) B (cid:107)(cid:107) K − K (cid:48) (cid:107) + (cid:107) B (cid:107) (cid:107) K − K (cid:48) (cid:107) . Proof.
Let ∆ = K − K (cid:48) . For every matrix X , ( F K − F K (cid:48) )( X ) = ( A − BK ) X ( B ∆) (cid:62) + ( B ∆) X ( A − BK ) (cid:62) − ( B ∆) X ( B ∆) (cid:62) . The operator norm of F K − F K (cid:48) is the maximum possible ratio in spectral norm of ( F K − F K (cid:48) )( X ) and X . Then the claim follows because (cid:107) AX (cid:107) ≤ (cid:107) A (cid:107)(cid:107) X (cid:107) . Lemma 20. If (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) ≤ / , and both F K and F K (cid:48) satisfy ρ ( F K ) < and ρ ( F K (cid:48) ) < then (cid:107) ( T K − T K (cid:48) ) (Σ) (cid:107) ≤ (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107)(cid:107)T K (Σ) (cid:107) . ≤ (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107)(cid:107) Σ (cid:107) . Proof.
Define A = I − F K , and B = F K (cid:48) − F K . In this case A − = T K and ( A − B ) − = T K (cid:48) . Hence,the condition (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) ≤ / translates to the condition (cid:107)A − (cid:107)(cid:107)B(cid:107) ≤ / .Observe: ( A − − ( A − B ) − )(Σ) = (I − (I − A − ◦ B ) − )( A − (Σ)) = (I − (I − A − ◦ B ) − )( T K (Σ)) . Since (I − A − ◦ B ) − = I + A − ◦ B ◦ (I − A − ◦ B ) − , (cid:107) (I − A − ◦ B ) − (cid:107) ≤ (cid:107)A − ◦ B(cid:107)(cid:107) (I − A − ◦ B ) − (cid:107) ≤ / (cid:107) (I − A − ◦ B ) − (cid:107) (cid:107) (I − A − ◦ B ) − (cid:107) ≤ . Hence, (cid:107) I − (I − A − ◦ B ) − (cid:107) = (cid:107)A − ◦ B ◦ (I − A − ◦ B ) − (cid:107) ≤ (cid:107)A − (cid:107)(cid:107)B(cid:107)(cid:107) (I − A − ◦ B ) − (cid:107) = 2 (cid:107)A − (cid:107)(cid:107)B(cid:107) . and so (cid:107) I − (I − A − ◦ B ) − (cid:107) ≤ (cid:107)A − (cid:107)(cid:107)B(cid:107) = 2 (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) . Combining these two, (cid:107) ( T K − T K (cid:48) ) (Σ) (cid:107) ≤ (cid:107) (I − (I − A − ◦ B ) − ) (cid:107)(cid:107)T K (Σ) (cid:107) ≤ (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107)(cid:107)T K (Σ) (cid:107) . This proves the main inequality. The last step of the inequality is just applying definition of the norm of T K : (cid:107)T K (Σ) (cid:107) ≤ (cid:107)T K (cid:107)(cid:107) Σ (cid:107) .With these Lemmas, we can first prove a weaker version of Lemma 16 which assumes F K (cid:48) has spectralradius at most 1, Lemma 21.
Lemma 16 holds with the additional assumption that ρ ( F K (cid:48) ) < (where F K (cid:48) is defined as F K (cid:48) ( X ) = ( A − BK (cid:48) ) X ( A − BK (cid:48) ) (cid:62) ).Proof. First, the proof shows (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) ≤ / , which is the desired condition in Lemma 20. First,observe that under the assumed condition on (cid:107) K − K (cid:48) (cid:107) , implies that (cid:107) B (cid:107)(cid:107) K (cid:48) − K (cid:107) ≤ σ min ( Q ) µ C ( K ) ( (cid:107) A − BK (cid:107) + 1) ≤ σ min ( Q ) µC ( K ) ≤ using that σ min ( Q ) µC ( K ) ≤ due to Lemma 13. Using Lemma 19, (cid:107)F K − F K (cid:48) (cid:107) ≤ (cid:0) (cid:107) A − BK (cid:107)(cid:107) B (cid:107)(cid:107) K − K (cid:48) (cid:107) + (cid:107) B (cid:107) (cid:107) K − K (cid:48) (cid:107) (cid:1) ≤ (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) (16)Using this and Lemma 17, (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) ≤ C ( K ) σ min ( Q ) µ (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) ≤ where the last step uses the condition on (cid:107) K − K (cid:48) (cid:107) .Thus, (cid:107) Σ K (cid:48) − Σ K (cid:107) ≤ (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107)(cid:107)T K (Σ ) (cid:107)≤ C ( K ) σ min ( Q ) µ (cid:0) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) (cid:1) C ( K ) σ min ( Q ) using Lemmas 13 and 19.Now the only remaining step to prove Lemma 16 is to show that within the ball assumed in Lemma 16,the policy is always stabilizing (that is, ρ ( F K (cid:48) ) < ). Lemma 22.
Suppose K (cid:48) is such that: (cid:107) K (cid:48) − K (cid:107) ≤ σ min ( Q ) µ C ( K ) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) It holds that ρ ( F K (cid:48) ) < (where F K (cid:48) is defined as F K (cid:48) ( X ) = ( A − BK (cid:48) ) X ( A − BK (cid:48) ) (cid:62) ). ρ ( F K (cid:48) ) is very close to , the final covariancematrix Σ K (cid:48) must be large. Note that ρ ( F K (cid:48) ) = ρ ( A − BK (cid:48) ) so we only need to prove this for A − BK (cid:48) . Lemma 23.
For any K (cid:48) with ρ ( A − BK (cid:48) ) < , we have tr(Σ K (cid:48) ) ≥ µ − ρ ( A − BK (cid:48) )) . Proof.
We know Σ K (cid:48) = ∞ (cid:88) i =0 F iK (cid:48) (Σ ) . Since Σ (cid:22) µI , we know the i -th term F iK (cid:48) (Σ ) (cid:23) µ ( A − BK (cid:48) ) i [( A − BK (cid:48) ) (cid:62) )] i . The trace of this term isat least tr( F iK (cid:48) (Σ )) ≥ µ tr(( A − BK (cid:48) ) i [( A − BK (cid:48) ) (cid:62) )] i ) ≥ µ (cid:107) ( A − BK (cid:48) ) i (cid:107) F ≥ µρ (( A − BK (cid:48) ) i ) = µρ ( A − BK (cid:48) ) i . Now summing over the trace of all the terms gives us the result.Now we are ready to prove Lemma 22.
Proof.
Let
Γ = tr (Σ K ) + d (cid:16) C ( K ) σ min ( Q ) (cid:17) , this is the maximum possible value for tr(Σ K (cid:48) ) according toLemma 21 when K (cid:48) is close to K as in Lemma 16 and ρ ( F K (cid:48) ) < . Let (cid:15) = 3 /µ Γ . We know that ρ (Σ K ) < − (cid:15) because otherwise it contradicts with Lemma 23.Assume towards contradiction that there exists a K (cid:48) within the ball (cid:107) K (cid:48) − K (cid:107) ≤ σ min ( Q ) µ C ( K ) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) +1) such that ρ ( A − BK (cid:48) ) ≥ , since spectral radius is a continuous function [Tyrtyshnikov, 2012], we knowthere must be a point K (cid:48)(cid:48) on the path between K and K (cid:48) such that ρ ( K (cid:48)(cid:48) ) = 1 − (cid:15) . Now for K (cid:48)(cid:48) , we can applyLemma 21, and conclude that (cid:107) Σ K (cid:48)(cid:48) − Σ K (cid:107) ≤ (cid:16) C ( K ) σ min ( Q ) (cid:17) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) +1) µ (cid:107) K − K (cid:48)(cid:48) (cid:107) ≤ (cid:16) C ( K ) σ min ( Q ) (cid:17) . As aresult, tr( K (cid:48)(cid:48) ) ≤ tr( K ) + d (cid:107) Σ K (cid:48)(cid:48) − Σ K (cid:107) ≤ Γ . On the other hand, by Lemma 23 we know tr( K (cid:48)(cid:48) ) > . .This is a contradiction. Therefore for any point K (cid:48) within the ball we have σ ( A − BK (cid:48) ) < .Lemma 16 now follows immediately from Lemma 21 and Lemma 22. Gradient Descent Progress
Equipped with these lemmas, the one step progress of gradient descent can be bounded.
Lemma 24.
Suppose that K (cid:48) = K − η ∇ C ( K ) , where η ≤
116 min (cid:40)(cid:18) σ min ( Q ) µC ( K ) (cid:19) (cid:107) B (cid:107)(cid:107)∇ C ( K ) (cid:107) (1 + (cid:107) A − BK (cid:107) ) , σ min ( Q )2 C ( K ) (cid:107) R + B (cid:62) P K B (cid:107) (cid:41) . (17) It holds that: C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) roof. By Lemma 12, C ( K (cid:48) ) − C ( K )= − η Tr(Σ K (cid:48) Σ K E (cid:62) K E K ) + η Tr(Σ K Σ K (cid:48) Σ K E (cid:62) K ( R + B (cid:62) P K B ) E K ) ≤ − η Tr(Σ K E (cid:62) K E K Σ K ) + 2 η (cid:107) Σ K (cid:48) − Σ K (cid:107) Tr(Σ K E (cid:62) K E K )+ η (cid:107) Σ K (cid:48) (cid:107)(cid:107) R + B (cid:62) P K B (cid:107) Tr(Σ K Σ K E (cid:62) K E K ) ≤ − η Tr(Σ K E (cid:62) K E K Σ K ) + 2 η (cid:107) Σ K (cid:48) − Σ K (cid:107) σ min (Σ K ) Tr(Σ K E (cid:62) K E K Σ K )+ η (cid:107) Σ K (cid:48) (cid:107)(cid:107) R + B (cid:62) P K B (cid:107) Tr(Σ K E (cid:62) K E K Σ K )= − η (cid:18) − (cid:107) Σ K (cid:48) − Σ K (cid:107) σ min (Σ K ) − η (cid:107) Σ K (cid:48) (cid:107)(cid:107) R + B (cid:62) P K B (cid:107) (cid:19) Tr( ∇ C ( K ) (cid:62) ∇ C ( K )) ≤ − η µ σ min ( R ) (cid:107) Σ K ∗ (cid:107) (cid:18) − (cid:107) Σ K (cid:48) − Σ K (cid:107) µ − η (cid:107) Σ K (cid:48) (cid:107)(cid:107) R + B (cid:62) P K B (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) where the last step uses Lemma 11.By Lemma 16, (cid:107) Σ K (cid:48) − Σ K (cid:107) µ ≤ η (cid:18) C ( K ) σ min ( Q ) µ (cid:19) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1)) (cid:107)∇ C ( K ) (cid:107) ≤ / using the assumed condition on η .Using this last claim and Lemma 13, (cid:107) Σ K (cid:48) (cid:107) ≤ (cid:107) Σ K (cid:48) − Σ K (cid:107) + (cid:107) Σ K (cid:107) ≤ µ C ( K ) σ min ( Q ) ≤ (cid:107) Σ K (cid:48) (cid:107) C ( K ) σ min ( Q ) and so (cid:107) Σ K (cid:48) (cid:107) ≤ C ( K )3 σ min ( Q ) . Hence, − (cid:107) Σ K (cid:48) − Σ K (cid:107) µ − η (cid:107) Σ K (cid:48) (cid:107)(cid:107) R + B (cid:62) P K B (cid:107) ≥ − / − η C ( K )3 σ min ( Q ) (cid:107) R + B (cid:62) P K B (cid:107) ≥ / using the condition on η .In order to prove a gradient descent convergence rate, the following bounds are helpful: Lemma 25.
It holds that (cid:107)∇ C ( K ) (cid:107) ≤ C ( K ) σ min ( Q ) (cid:115) (cid:107) R + B (cid:62) P K B (cid:107) ( C ( K ) − C ( K ∗ )) µ and that: (cid:107) K (cid:107) ≤ σ min ( R ) (cid:115) (cid:107) R + B (cid:62) P K B (cid:107) ( C ( K ) − C ( K ∗ )) µ + (cid:107) B (cid:62) P K A (cid:107) Proof.
Using Lemma 13, (cid:107)∇ C ( K ) (cid:107) ≤ Tr(Σ K E (cid:62) K E K Σ K ) ≤ (cid:107) Σ K (cid:107) Tr( E (cid:62) K E K ) ≤ (cid:18) C ( K ) σ min ( Q ) (cid:19) Tr( E (cid:62) K E K )
28y Lemma 11,
Tr( E (cid:62) K E K ) ≤ (cid:107) R + B (cid:62) P K B (cid:107) ( C ( K ) − C ( K ∗ )) µ which proves the first claim.Again using Lemma 11, (cid:107) K (cid:107) ≤ (cid:107) ( R + B (cid:62) P K B ) − (cid:107)(cid:107) ( R + B (cid:62) P K B ) K (cid:107)≤ σ min ( R ) (cid:107) ( R + B (cid:62) P K B ) K (cid:107)≤ σ min ( R ) (cid:16) (cid:107) ( R + B (cid:62) P K B ) K − B (cid:62) P K A (cid:107) + (cid:107) B (cid:62) P K A (cid:107) (cid:17) = (cid:107) E K (cid:107) σ min ( R ) + (cid:107) B (cid:62) P K A (cid:107) σ min ( R ) ≤ (cid:113) Tr( E (cid:62) K E K ) σ min ( R ) + (cid:107) B (cid:62) P K A (cid:107) σ min ( R )= (cid:112) ( C ( K ) − C ( K ∗ )) (cid:107) R + B (cid:62) P K B (cid:107)√ µσ min ( R ) + (cid:107) B (cid:62) P K A (cid:107) σ min ( R ) which proves the second claim.With these lemmas, the proof of the gradient descent convergence rate follows: Proof. (of Theorem 7, gradient descent case) First, the following argues that progress is made at t = 1 .Based on Lemma 13 and Lemma 25, by choosing η to be an appropriate polynomial in C ( K ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) ,σ min ( R ) , σ min ( Q ) and µ , the stepsize condition in Equation 17 is satisfied. Hence, by Lemma 24, C ( K ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) which implies that the cost decreases at t = 1 . Proceeding inductively, now suppose that C ( K t ) ≤ C ( K ) ,then the stepsize condition in Equation 17 is still satisfied (due to the use of C ( K ) in bounding the quantitiesin Lemma 25). Thus, Lemma 24 can again be applied for the update at time t + 1 to obtain: C ( K t +1 ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K t ) − C ( K ∗ )) . Provided T ≥ (cid:107) Σ K ∗ (cid:107) ηµ σ min ( R ) log C ( K ) − C ( K ∗ ) ε , then C ( K T ) − C ( K ∗ ) ≤ ε , and the result follows. D Analysis: the Model-free case
This section shows how techniques from zeroth order optimization allow the algorithm to run in the model-free setting with only black-box access to a simulator. The dependencies on various parameters are notoptimized, and the notation h is used to represent different polynomial factors in the relevant factors( C ( K ) µσ min ( Q ) , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , /σ min ( R ) ). When the polynomial also depend on dimension d or accuracy /(cid:15) , this is specified as parameters ( h ( d, /(cid:15) ) ).The section starts by showing how the infinite horizon can be approximated with a finite horizon.29 .1 Approximating C ( K ) and Σ K with finite horizon This section shows that as long as there is an upper bound on C ( K ) , it is possible to approximate both C ( K ) and Σ( K ) with any desired accuracy. Lemma 26.
For any K with finite C ( K ) , let Σ ( (cid:96) ) K = E [ (cid:80) (cid:96) − i =0 x i x (cid:62) i ] and C ( (cid:96) ) ( K ) = E [ (cid:80) (cid:96) − i =0 x (cid:62) i Qx i + u (cid:62) i Ru i ] = (cid:104) Σ ( t ) K , Q + K (cid:62) RK (cid:105) . If (cid:96) ≥ d · C ( K ) (cid:15)µσ min ( Q ) , then (cid:107) Σ ( (cid:96) ) K − Σ K (cid:107) ≤ (cid:15) . Also, if (cid:96) ≥ d · C ( K )( (cid:107) Q (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:107) ) (cid:15)µσ min ( Q ) then C ( K ) ≥ C ( (cid:96) ) ( K ) ≥ C ( K ) − (cid:15) .Proof. First, the bound on Σ K is proved. Define the operators T K and F K as in Section C.4, observe Σ K = T K (Σ ) and Σ ( (cid:96) ) K = Σ K − ( F K ) (cid:96) (Σ K ) .If X (cid:23) Y , then F K ( X ) (cid:23) F K ( Y ) , this follows immediately from the form of F K ( X ) = ( A + BK ) X ( A + BK ) (cid:62) . If X is PSD then W XW (cid:62) is also PSD for any W .Now, since (cid:96) − (cid:88) i =0 tr( F (cid:96) (Σ )) = tr( (cid:96) − (cid:88) i =0 F (cid:96) (Σ )) ≤ tr( ∞ (cid:88) i =0 F (cid:96) (Σ )) = tr(Σ K ) ≤ d · C ( K ) σ min ( Q ) . (Here the last step is by Lemma 13), and all traces are nonnegative, then there must exists j < (cid:96) such that tr( F jK (Σ )) ≤ d · C ( K ) (cid:96)σ min ( Q ) . Also, since Σ K (cid:22) C ( K ) µσ min ( Q ) Σ , tr( F jK (Σ K )) ≤ C ( K ) µσ min ( Q ) tr( F jK (Σ )) ≤ d · C ( K ) (cid:96)µσ min ( Q ) . Therefore as long as (cid:96) ≥ dC ( K ) (cid:15)µσ min ( Q ) , it follows that: (cid:107) Σ K − Σ ( (cid:96) ) K (cid:107) ≤ (cid:107) Σ K − Σ ( j ) K (cid:107) = (cid:107)F jK (Σ K ) (cid:107) ≤ (cid:15). Here the first step is again because of all the terms are PSD, so using more terms is always better. The laststep follows because F jK (Σ K ) is also a PSD matrix so the spectral norm is bounded by trace. In fact, itholds that tr(Σ K − Σ ( (cid:96) ) K ) is smaller than (cid:15) .Next, observe C ( (cid:96) ) ( K ) = (cid:104) Σ ( (cid:96) ) K , Q + K (cid:62) RK (cid:105) and C ( K ) = (cid:104) Σ K , Q + K (cid:62) RK (cid:105) , therefore C ( K ) − C ( (cid:96) ) ( K ) ≤ tr(Σ K − Σ ( (cid:96) ) K )( (cid:107) Q (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:107) ) . Therefore if (cid:96) ≥ d · C ( K )( (cid:107) Q (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:107) ) (cid:15)µσ min ( Q ) , then tr(Σ K − Σ ( (cid:96) ) K ) ≤ (cid:15)/ ( (cid:107) Q (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:107) ) and hence C ( K ) − C ( (cid:96) ) ( K ) ≤ (cid:15) .30 .2 Perturbation of C ( K ) and ∇ C ( K ) The next lemma show that the function value and its gradient are approximate preserved if a small perturba-tion to the policy K is applied. Lemma 27. ( C K perturbation) Suppose K (cid:48) is such that: (cid:107) K (cid:48) − K (cid:107) ≤ min (cid:18) σ min ( Q ) µ C ( K ) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) , (cid:107) K (cid:107) (cid:19) then: | C ( K (cid:48) ) − C ( K ) |≤ (cid:107) K (cid:107) (cid:107) R (cid:107) E (cid:107) x (cid:107) (cid:18) C ( K ) µ σ min ( Q ) (cid:19) ( (cid:107) K (cid:107)(cid:107) B (cid:107)(cid:107) A − BK (cid:107) + (cid:107) K (cid:107)(cid:107) B (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) Proof.
As in the proof of Lemma 19, the assumption implies that (cid:107)T K (cid:107)(cid:107)F K − F K (cid:48) (cid:107) ≤ / , and, fromEquation 16, that (cid:107)F K − F K (cid:48) (cid:107) ≤ (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) First, observe: C ( K (cid:48) ) − C ( K ) ≤ Tr( E x x (cid:62) ) (cid:107)T K (cid:48) ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ( Q + K (cid:62) RK ) (cid:107) = E (cid:107) x (cid:107) (cid:107)T K (cid:48) ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ( Q + K (cid:62) RK ) (cid:107) = E (cid:107) x (cid:107) (cid:107) P K (cid:48) − P k (cid:107) . To bound the difference we just need to bound (cid:107) P K (cid:48) − P k (cid:107) . For that we have P K (cid:48) − P K = (cid:107)T K (cid:48) ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ( Q + K (cid:62) RK ) (cid:107)≤ (cid:107)T K (cid:48) ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − (cid:16) T K ( Q + K (cid:62) RK ) − T K ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:17) (cid:107) = (cid:107)T K (cid:48) ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ( Q + ( K (cid:48) ) (cid:62) RK (cid:48) ) − T K ◦ ( K (cid:62) RK − ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:107)≤ (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107)(cid:107) ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:107) + (cid:107)T K (cid:107)(cid:107) K (cid:62) RK − ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:107)≤ (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107) (cid:16) (cid:107) ( K (cid:48) ) (cid:62) RK (cid:48) ) − K (cid:62) RK (cid:107) + (cid:107) K (cid:62) RK ) (cid:107) (cid:17) + (cid:107)T K (cid:107)(cid:107) K (cid:62) RK − ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:107)≤ (cid:107)T K (cid:107)(cid:107) ( K (cid:48) ) (cid:62) RK (cid:48) ) − K (cid:62) RK (cid:107) + 2 (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107)(cid:107) K (cid:62) RK (cid:107) + (cid:107)T K (cid:107)(cid:107) K (cid:62) RK − ( K (cid:48) ) (cid:62) RK (cid:48) ) (cid:107) = 2 (cid:107)T K (cid:107)(cid:107) ( K (cid:48) ) (cid:62) RK (cid:48) ) − K (cid:62) RK (cid:107) + 2 (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107)(cid:107) K (cid:62) RK (cid:107) For the first term, (cid:107)T K (cid:107)(cid:107) ( K (cid:48) ) (cid:62) RK (cid:48) ) − K (cid:62) RK (cid:107) ≤ (cid:107)T K (cid:107) (cid:0) (cid:107) K (cid:107)(cid:107) R (cid:107)(cid:107) K (cid:48) − K (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:48) − K (cid:107) (cid:1) ≤ (cid:107)T K (cid:107) (cid:0) (cid:107) K (cid:107)(cid:107) R (cid:107)(cid:107) K (cid:48) − K (cid:107) (cid:1) using the assumption that (cid:107) K (cid:48) − K (cid:107) ≤ (cid:107) K (cid:107) . For the second term, (cid:107)T K (cid:107) (cid:107)F K − F K (cid:48) (cid:107)(cid:107) K (cid:62) RK (cid:107) ≤ (cid:107)T K (cid:107) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) (cid:107) K − K (cid:48) (cid:107) (cid:107) K (cid:107) (cid:107) R (cid:107) . Combining the two terms completes the proof. 31he next lemma shows the gradient is also stable after perturbation.
Lemma 28. ( ∇ C K perturbation) Suppose K (cid:48) is such that: (cid:107) K (cid:48) − K (cid:107) ≤ min (cid:18) σ min ( Q ) µ C ( K ) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) , (cid:107) K (cid:107) (cid:19) then there is a polynomial h grad in C ( K ) µσ min ( Q ) , E [ (cid:107) x (cid:107) ] , (cid:107) A (cid:107) , (cid:107) B (cid:107) , (cid:107) R (cid:107) , σ min ( R ) such that (cid:107)∇ C ( K (cid:48) ) − ∇ C ( K ) (cid:107) ≤ h grad (cid:107) K (cid:48) − K (cid:107) . Also, (cid:107)∇ C ( K (cid:48) ) − ∇ C ( K ) (cid:107) F ≤ h grad (cid:107) K (cid:48) − K (cid:107) F . Proof.
Recall ∇ C ( K ) = 2 E K Σ K where E K = ( R + B (cid:62) P K B ) K − B (cid:62) P K A . Therefore ∇ C ( K (cid:48) ) − ∇ C ( K ) = 2 E K (cid:48) Σ K (cid:48) − E K Σ K = 2( E K (cid:48) − E K )Σ K (cid:48) + 2 E K (Σ K (cid:48) − Σ K ) . Let’s first look at the second term. By Lemma 11,
Tr( E (cid:62) K E K ) ≤ (cid:107) R + B (cid:62) P K B (cid:107) ( C ( K ) − C ( K ∗ )) µ , then by Lemma 16 (cid:107) Σ K (cid:48) − Σ K (cid:107) ≤ (cid:18) C ( K ) σ min ( Q ) (cid:19) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) µ (cid:107) K − K (cid:48) (cid:107) Therefore the second term is bounded by (cid:18) C ( K ) σ min ( Q ) (cid:19) ( (cid:107) R + B (cid:62) P K B (cid:107) ( C ( K ) − C ( K ∗ ))) (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) µ (cid:107) K − K (cid:48) (cid:107) . Next we bound the first term. Since K (cid:48) − K is small enough, (cid:107) Σ K (cid:48) (cid:107) ≤ (cid:107) Σ K (cid:107) + C ( K ) σ min ( Q ) .For E K (cid:48) − E K , we first need a bound on P K (cid:48) − P K . By the previous lemma, (cid:107) P (cid:48) K − P K (cid:107) = 6 (cid:32)(cid:18) C ( K ) µ σ min ( Q ) (cid:19) (cid:107) K (cid:107) (cid:107) R (cid:107)(cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) + 1) + (cid:18) C ( K ) µ σ min ( Q ) (cid:19) (cid:107) K (cid:107)(cid:107) R (cid:107) (cid:33) (cid:107) K − K (cid:48) (cid:107) . Therefore E (cid:48) K − E K = R ( K (cid:48) − K ) + B (cid:62) ( P K (cid:48) − P K ) A + B (cid:62) ( P K (cid:48) − P K ) BK (cid:48) + B (cid:62) P K B ( K (cid:48) − K ) . Since (cid:107) K (cid:48) (cid:107) ≤ (cid:107) K (cid:107) , and (cid:107) K (cid:107) can be bounded by C ( K ) (Lemma 25), all the terms can be bounded bypolynomials of related parameters multiplied by (cid:107) K − K (cid:48) (cid:107) .32 .3 Smoothing and the gradient descent analysis This section analyzes the smoothing procedure and completes the proof of gradient descent. AlthoughGaussian smoothing is more standard, the objective C ( K ) is not finite for every K , therefore technically E u ∼N (0 ,σ I ) [ C ( K + u )] is not well defined. This is avoidable by smoothing in a ball.Let S r represent the uniform distribution over the points with norm r (boundary of a sphere), and B r represent the uniform distribution over all points with norm at most r (the entire sphere). When applyingthese sets to matrix a U , the Frobenius norm ball is used. The algorithm performs gradient descent on thefollowing function C r ( K ) = E U ∼ B r [ C ( K + U )] . The next lemma uses the standard technique (e.g. in [Flaxman et al., 2005]) to show that the gradient of C r ( K ) can be estimated just with an oracle for function value. Lemma 29. ∇ C r ( K ) = dr E U ∼ S r [ C ( K + U ) U ] . This is the same as Lemma 2.1 in Flaxman et al. [2005], for completeness the proof is provided below.
Proof.
By Stokes formula, ∇ (cid:90) δ B r C ( K + U ) dx = (cid:90) δ S r C ( K + U ) U (cid:107) U (cid:107) F dx. By definition, C r ( K ) = (cid:82) δ B r C ( K + U ) dx vol d ( δ B r ) , Also, E U ∼ S r [ C ( K + U ) U ] = r E U ∼ S r [ C ( K + U ) Ur ] = r · (cid:82) δ S r C ( K + U ) U (cid:107) U (cid:107) F dx vol d − ( δ S r ) . The Lemma follows from combining these equations, and use the fact thatvol d ( δ B r ) = vol d − ( δ S r ) · rd . From the lemma above and standard concentration inequalities, it is immediate that it suffices to use apolynomial number of samples to approximate the gradient.
Lemma 30.
Given an (cid:15) , there are fixed polynomials h r (1 /(cid:15) ) , h sample ( d, /(cid:15) ) such that when r ≤ /h r (1 /(cid:15) ) ,with m ≥ h sample ( d, /(cid:15) ) samples of U , ..., U n ∼ S r , with high probability (at least − ( d/(cid:15) ) − d ) theaverage ˆ ∇ = 1 m m (cid:88) i =1 dr C ( K + U i ) U i is (cid:15) close to ∇ C ( K ) in Frobenius norm.Further, if for x ∼ D , (cid:107) x (cid:107) ≤ L almost surely, there are polynomials h (cid:96),grad ( d, /(cid:15) ) , h r,trunc (1 /(cid:15) ) , h sample,trunc ( d, /(cid:15), σ, L /µ ) such that when m ≥ h sample,trunc ( d, /(cid:15), L /µ ) , (cid:96) ≥ h (cid:96),grad ( d, /(cid:15) ) , let x ij , u ij (0 ≤ j ≤ (cid:96) ) be a single path sampled using K + U i , then the average ˜ ∇ = 1 m m (cid:88) i =1 dr [ (cid:96) − (cid:88) j =0 ( x ij ) (cid:62) Qx ij + ( u ij ) (cid:62) Ru ij ] U i is also (cid:15) close to ∇ C ( K ) in Frobenius norm with high probability. roof. For the first part, the difference is broken into two terms: ˆ ∇ − ∇ C ( K ) = ( ∇ C r ( K ) − ∇ C ( K )) + ( ˆ ∇ − ∇ C r ( K )) . For the first term, choose h r (1 /(cid:15) ) = min { /r , h grad /(cid:15) } ( r is chosen later). By Lemma 28 when r issmaller than /h r (1 /(cid:15) ) = (cid:15)/ h grad , every point u on the sphere have (cid:107)∇ C ( K + U ) − ∇ C ( K ) (cid:107) F ≤ (cid:15)/ .Since ∇ C r ( K ) is the expectation of ∇ C ( K + U ) , by triangle inequality (cid:107)∇ C r ( K ) − ∇ C ( K ) (cid:107) F ≤ (cid:15)/ .The proof also makes sure that r ≤ r such that for any U ∼ S r , it holds that C ( K + U ) ≤ C ( K ) . ByLemma 27, /r is a polynomial in the relevant factors.For the second term, by Lemma 29, E [ ˆ ∇ ] = ∇ C r ( K ) , and each individual sample has norm bounded by dC ( K ) /r , so by Vector Bernstein’s Inequality, know with m ≥ h sample ( d, /(cid:15) ) = Θ (cid:16) d (cid:16) dC ( K ) (cid:15)r (cid:17) log d/(cid:15) (cid:17) samples, with high probability (at least − ( d/(cid:15) ) − d ) (cid:107) ˆ ∇ − E [ ˆ ∇ ] (cid:107) F ≤ (cid:15)/ .Adding these two terms and apply triangle inequality gives the result.For the second part, the proof breaks it into more terms. Let ∇ (cid:48) be equal to m (cid:80) mi =1 dr C ( (cid:96) ) ( K + U i ) U i (where C ( (cid:96) ) is defined as in Lemma 26), then ˜ ∇ − ∇ C ( K ) = ( ˜ ∇ − ∇ (cid:48) ) + ( ∇ (cid:48) − ˆ ∇ ) + ( ˆ ∇ − ∇ C ( K )) . The third term is just what was bounded earlier, by choosing h r,trunc (1 /(cid:15) ) = h r (2 /(cid:15) ) and making sure h sample,trunc ( d, /(cid:15) ) ≥ h sample ( d, /(cid:15) ) , we guarantees that it is smaller than (cid:15)/ .For the second term, choose (cid:96) ≥ d · C ( K )( (cid:107) Q (cid:107) + (cid:107) R (cid:107)(cid:107) K (cid:107) ) (cid:15)rµσ min ( Q ) =: h (cid:96),grad ( d, /(cid:15) ) . By Lemma 26, for any K (cid:48) with C ( K (cid:48) ) ≤ C ( K ) , it holds that (cid:107) C ( (cid:96) ) ( K (cid:48) ) − C ( K (cid:48) ) (cid:107) ≤ r(cid:15) d . Therefore by triangle inequality (cid:107) m m (cid:88) i =1 dr C ( (cid:96) ) ( K + U i ) U i − m m (cid:88) i =1 dr C ( K + U i ) U i (cid:107) ≤ (cid:15)/ . Finally for the first term it is easy to see that E [ ˜ ∇ ] = ∇ (cid:48) where the expectation is taken over therandomness of the initial states x i . Since (cid:107) x i (cid:107) ≤ L , ( x i )( x i ) (cid:62) (cid:22) L µ E [ x x (cid:62) ] , as a result the sum [ (cid:96) − (cid:88) j =0 ( x ij ) (cid:62) Qx ij + ( u ij ) (cid:62) Ru ij ] ≤ L µ C ( K + U i ) . Therefore, ˜ ∇ − ∇ (cid:48) is again a sum of independent vectors with bounded norm, so by Vector Bernstein’sinequality, when h sample,trunc ( d, /(cid:15), L /µ ) is a large enough polynomial, (cid:107) ˜ ∇ − ∇ (cid:48) (cid:107) ≤ (cid:15)/ with highprobability. Adding all the terms finishes the proof. Theorem 31.
There are fixed polynomials h GD,r (1 /(cid:15) ) , h GD,sample ( d, /(cid:15), L /µ ) , h GD,(cid:96) ( d, /(cid:15) ) such thatif every step the gradient is computed as Lemma 30 (truncated at step (cid:96) ), pick step size η and T the same asthe gradient descent case of Theorem 7, it holds that C ( K T ) − C ( K (cid:63) ) ≤ (cid:15) with high probability (at least − exp( − d ) ).Proof. By Lemma 24, when η ≤ /h GD,η for some fixed polynomial h GD,η (given in Lemma 24), then C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) Let ˜ ∇ be the approximate gradient computed, and let K (cid:48)(cid:48) = K − η ˜ ∇ be the iterate that uses theapproximate gradient. The proof shows given enough samples, the gradient can be estimated with enoughaccuracy that makes sure | C ( K (cid:48)(cid:48) ) − C ( K (cid:48) ) | ≤ ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) · (cid:15). C ( K ) − C ( K ∗ ) ≥ (cid:15) , it holds that C ( K (cid:48)(cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) . Then the same proof of Theorem 7 gives the convergence guarantee.Now C ( K (cid:48)(cid:48) ) − C ( K (cid:48) ) is bounded. By Lemma 27, if (cid:107) K (cid:48)(cid:48) − K (cid:48) (cid:107) ≤ ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) · (cid:15) · /h func ( h func is the polynomial in Lemma 27), then C ( K (cid:48)(cid:48) ) − C ( K (cid:48) ) is small enough. To get that, observe K (cid:48)(cid:48) − K (cid:48) = η ( ∇ − ˜ ∇ ) , therefore it suffices to make sure (cid:107)∇ − ˜ ∇(cid:107) ≤ σ min ( R ) µ (cid:107) Σ K ∗ (cid:107) · (cid:15) · /h func By Lemma 28, it suffices to pick h GD,r (1 /(cid:15) ) = h r,trunc (2 h func (cid:107) Σ K ∗ (cid:107) / ( µ σ min ( R ) (cid:15) )) , h GD,sample ( d, /(cid:15), L /µ ) = h sample,trunc ( d, h func (cid:107) Σ K ∗ (cid:107) / ( µ σ min ( R ) (cid:15) ) , L /µ ) , and h GD,(cid:96) ( d, /(cid:15) ) = h (cid:96),grad ( d, h func (cid:107) Σ K ∗ (cid:107) / ( µ σ min ( R ) (cid:15) )) .This gives the desired upper-bound on (cid:107)∇ − ˜ ∇(cid:107) with high probability (at least − ( (cid:15)/d ) − d ).Since the number of steps is a polynomial, by union bound with high probability (at least − T ( (cid:15)/d ) − d ≥ − exp( − d ) ) the gradient is accurate enough for all the steps, so C ( K (cid:48)(cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) . The rest of the proof is the same as Theorem 7. Note that in the smoothing, because the function valueis monotonically decreasing and the choice of radius, all the function value encountered is bounded by C ( K ) , so the polynomials are indeed bounded throughout the algorithm. D.4 The natural gradient analysis
Before the Theorem for natural gradient is proven, the following lemma shows the variance can be estimatedaccurately.
Lemma 32.
If for x ∼ D , (cid:107) x (cid:107) ≤ L almost surely, there exists polynomials h r,var (1 /(cid:15) ) , h varsample,trunc ( d, /(cid:15), L /µ ) and h (cid:96),var ( d, /(cid:15) ) such that if ˆΣ K is estimated using at least m ≥ h varsample,trunc ( d, /(cid:15), L /µ ) initialpoints x , ..., x m , m random perturbations U i ∼ S r where r ≤ /h r,var (1 /(cid:15) ) , all of these initial pointsare simulated using ˆ K i = K + U i to (cid:96) ≥ h (cid:96),var ( d, /(cid:15) ) iterations, then with high probability (at least − ( d/(cid:15) ) − d ) the following estimate ˜Σ = 1 m m (cid:88) i =1 (cid:96) − (cid:88) j =0 x ij ( x ij ) (cid:62) . satisfies (cid:107) ˜Σ − Σ K (cid:107) ≤ (cid:15) . Further, when (cid:15) ≤ µ/ , it holds that σ min ( ˆΣ K ) ≥ µ/ .Proof. This is broken into three terms: let Σ ( (cid:96) ) K be defined as in Lemma 26, let ˆΣ = m (cid:80) mi =1 Σ K + U i and ˆΣ ( (cid:96) ) = m (cid:80) mi =1 Σ ( (cid:96) ) K + U i , then it holds that ˜Σ − Σ K = ( ˜Σ − ˆΣ ( (cid:96) ) ) + ( ˆΣ ( (cid:96) ) − ˆΣ) + ( ˆΣ − Σ K ) . First, r is chosen small enough so that C ( K + U i ) ≤ C ( K ) . This only requires an inverse polynomial r by Lemma 27. 35or the first term, note that E [ ˜Σ] = ˆΣ ( (cid:96) ) where the expectation is taken over the initial points x i . Since (cid:107) x i (cid:107) ≤ L , ( x i )( x i ) (cid:62) (cid:22) L µ E [ x x (cid:62) ] , and as a result the sum (cid:96) − (cid:88) j =0 x ij ( x ij ) (cid:62) Q (cid:22) L µ Σ K + U i . Therefore, standard concentration bounds show that when h varsample,trunc is a large enough polynomial, (cid:107) ˜Σ − ˆΣ ( (cid:96) ) (cid:107) ≤ (cid:15)/ holds with high probability.For the second term, Lemma 26 is applied. Because C ( K + U i ) ≤ C ( K ) , choosing (cid:96) ≥ h (cid:96),var ( d, /(cid:15) ) = d · C ( K ) (cid:15)µσ min ( Q ) , the error introduced by truncation (cid:107) ˆΣ ( (cid:96) ) − ˆΣ (cid:107) is then bounded by (cid:15)/ .For the third term, Lemma 16 is applied. When r ≤ (cid:15) · (cid:16) σ min ( Q ) C ( K ) (cid:17) µ (cid:107) B (cid:107) ( (cid:107) A − BK (cid:107) +1) , (cid:107) Σ K + U i − Σ K (cid:107) ≤ (cid:15)/ . Since ˆΣ is the average of Σ K + U i , by the triangle inequality, (cid:107) ˆΣ − Σ K (cid:107) ≤ (cid:15)/ .Adding these three terms gives the result.Finally, the bound on σ min ( ˜Σ K ) follows simply from Weyl’s Theorem. Theorem 33.
Suppose C ( K ) is finite and and µ > . The natural gradient follows the update rule: K t +1 = K t − η ∇ C ( K t )Σ − K t Suppose the stepsize is set to be: η = 1 (cid:107) R (cid:107) + (cid:107) B (cid:107) C ( K ) µ If the gradient and variance are estimated as in Lemma 30, Lemma 32 with r = 1 /h NGD,r (1 /(cid:15) ) , with m ≥ h NGD,sample ( d, /(cid:15), L /µ ) samples, both are truncated to h NGD,(cid:96) ( d, /(cid:15) ) iterations, then with highprobability (at least − exp( − d ) ) in T iterations where T > (cid:107) Σ K ∗ (cid:107) µ (cid:18) (cid:107) R (cid:107) σ min ( R ) + (cid:107) B (cid:107) C ( K ) µσ min ( R ) (cid:19) log 2( C ( K ) − C ( K ∗ )) ε then the natural gradient satisfies the following performance bound: C ( K T ) − C ( K ∗ ) ≤ ε Proof.
By Lemma 15, C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) Let ˜ ∇ be the estimated gradient, ˜Σ K be the estimated Σ K , and let K (cid:48)(cid:48) = K − η ˜ ∇ ˜Σ K − . The proofshows that when both the gradient and the covariance matrix are estimated accurately enough, then | C ( K (cid:48) ) − C ( K (cid:48)(cid:48) ) | ≤ (cid:15) ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) . This implies when C ( K ) − C ( K (cid:63) ) ≥ (cid:15) , C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) (cid:107) K (cid:48)(cid:48) − K (cid:48) (cid:107) ≤ (cid:15) h func ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) the desired bound on | C ( K (cid:48) ) − C ( K (cid:48)(cid:48) ) | holds.To achieve this, it suffices to have (cid:107) ˜ ∇ ˜Σ − K − ∇ C ( K )Σ − K (cid:107) ≤ (cid:15) h func σ min ( R ) µ (cid:107) Σ K ∗ (cid:107) . This is broken into two terms (cid:107) ˜ ∇ ˜Σ − K − ∇ C ( K )Σ − K (cid:107) ≤ (cid:107) ˜ ∇ − ∇(cid:107)(cid:107) ˜Σ − K (cid:107) + (cid:107)∇ C ( K ) (cid:107)(cid:107) ˜Σ − K − Σ − K (cid:107) . For the first term, by Lemma 32 we know when the number of samples is large enough (cid:107) ˜Σ − K (cid:107) ≤ /µ .Therefore it suffices to make sure (cid:107) ˜ ∇ − ∇(cid:107) ≤ (cid:15) h func σ min ( R ) µ (cid:107) Σ K ∗ (cid:107) , this can be done by Lemma 30 bysetting h NGD,grad,r (1 /(cid:15) ) = h r,trunc ( h func (cid:107) Σ ∗ K (cid:107) µ σ min ( R ) (cid:15) ) , h NGD,gradsample ( d, /(cid:15), L/µ ) = h sample,trunc ( d, h func (cid:107) Σ ∗ K (cid:107) µ σ min ( R ) (cid:15) , L/µ ) and h NGD,(cid:96),grad ( d, /(cid:15) ) = h (cid:96),grad ( d, h func (cid:107) Σ ∗ K (cid:107) µ σ min ( R ) (cid:15) ) .For the second term, it suffices to make sure (cid:107) ˜Σ − K − Σ − K (cid:107) ≤ (cid:15) h func σ min ( R ) µ (cid:107) Σ K ∗ (cid:107)(cid:107)∇ C ( K ) (cid:107) . . Bystandard matrix perturbation, if σ min (Σ K ) ≥ µ and (cid:107) ˜Σ K − Σ K (cid:107) ≤ µ/ , (cid:107) ˜Σ − K − Σ − K (cid:107) ≤ (cid:107) ˜Σ K − Σ K (cid:107) /µ . Therefore by Lemma 32 it suffices to choose h NGD,var,r (1 /(cid:15) ) = h var,r ( h func (cid:107) Σ K ∗ (cid:107)(cid:107)∇ C ( K ) (cid:107) µ σ min ( R ) (cid:15) ) , h NGD,varsample ( d, /(cid:15), L/µ ) = h varsample,trunc ( d, h func (cid:107) Σ K ∗ (cid:107)(cid:107)∇ C ( K ) (cid:107) µ σ min ( R ) (cid:15) , L/µ ) and h NGD,(cid:96),var ( d, /(cid:15) ) = h (cid:96),var ( d, h func (cid:107) Σ K ∗ (cid:107)(cid:107)∇ C ( K ) (cid:107) µ σ min ( R ) (cid:15) ) . This is indeed a polynomial because (cid:107)∇ C ( K ) (cid:107) is bounded by Lemma 25.Finally, choose h NGD,r = max { h NGD,grad,r , h
NGD,var,r } , h NGD,sample = max { h NGD,gradsample , h
NGD,varsample } , and h NGD,(cid:96) = max { h NGD,(cid:96),grad , h
NGD,(cid:96),var } .This ensures all the bounds mentioned above hold and that C ( K (cid:48) ) − C ( K ∗ ) ≤ (cid:18) − ησ min ( R ) µ (cid:107) Σ K ∗ (cid:107) (cid:19) ( C ( K ) − C ( K ∗ )) The rest of the proof is the same as Theorem 7. Note again that in the smoothing, because the function valueis monotonically decreasing and the choice of radius, all the function values encountered are bounded by C ( K ) , so the polynomials are indeed bounded throughout the algorithm. D.5 Standard Matrix Perturbation and Concentrations
In the previous sections, we used several standard tools in matrix perturbation and concentration, whichwe summarize here. The matrix perturbation theorems can be found in Stewart and Sun [1990]. Matrixconcentration bounds can be found in Tropp [2012]
Theorem 34 (Weyl’s Theorem) . Suppose B = A + E , then the singular values of B are within (cid:107) E (cid:107) to thecorresponding singular values of A . In particular (cid:107) B (cid:107) ≤ (cid:107) A (cid:107) + (cid:107) E | and σ min ( B ) ≥ σ min ( A ) − (cid:107) E (cid:107) . Theorem 35 (Perturbation of Inverse) . Let B = A + E , suppose (cid:107) E (cid:107) ≤ σ min ( A ) / then (cid:107) B − − A − (cid:107) ≤ (cid:107) A − B (cid:107) /σ min ( A ) . Theorem 36 (Matrix Bernstein) . Suppose ˆ A = (cid:80) i ˆ A i , where ˆ A i are independent random matrices ofdimension d × d (let d = d + d ). Let E [ ˆ A ] = A , the variance M = E [ (cid:80) i ˆ A i ˆ A (cid:62) i ] , M = E [ (cid:80) i ˆ A (cid:62) i ˆ A i ] .If σ = max {(cid:107) M (cid:107) , (cid:107) M (cid:107)} , and every ˆ A i has spectral norm (cid:107) ˆ A i (cid:107) ≤ R with probability 1, then with highprobability (cid:107) ˆ A − A (cid:107) ≤ O ( R log d + (cid:112) σ log d ) .
20 40 60 80 100
Iterations − C ( K ) − C ( K ∗ ) C ( K ∗ ) Convergence of the Gradient Descent for LQR with Gradient Oracle
Figure 1: Simulation results with Gradient Descent * * The simulation was done by Jingjing Bu.
In our proof we often treat a matrix as a vector and look at its Frobenius norm, in these cases we use thefollowing corollary:
Theorem 37 (Vector Bernstein) . Suppose ˆ a = (cid:80) i ˆ a i , where ˆ a i are independent random vector of dimension d . Let E [ˆ a ] = a , the variance σ = E [ (cid:80) i (cid:107) ˆ a i (cid:107) ] . If every ˆ a i has norm (cid:107) ˆ a i (cid:107) ≤ R with probability 1, thenwith high probability (cid:107) ˆ a − a (cid:107) ≤ O ( R log d + (cid:112) σ log d ) . E Simulation Results
Here we give simulations for the gradient descent algorithm (with backtracking step size) to show thatthe algorithm indeed converges within reasonable time in practice. In this experiment, x ∈ R and u ∈ R . We use random matrices A, B . The scaling of A is chosen so that A is stabilizing with highprobability ( λ max ( A ) ≤ ). We initialize the solution at K = 0 , which ensures C ( K ) is finite because A is stabilizing. The distribution of the initial point x0