CoNES: Convex Natural Evolutionary Strategies
CCoNES: Convex Natural Evolutionary Strategies
Sushant Veer
Mechanical and Aerospace EngineeringPrinceton University [email protected]
Anirudha Majumdar
Mechanical and Aerospace EngineeringPrinceton University [email protected]
Abstract
We present a novel algorithm – convex natural evolutionary strategies (CoNES) – for opti-mizing high-dimensional blackbox functions by leveraging tools from convex optimization andinformation geometry. CoNES is formulated as an efficiently-solvable convex program that adaptsthe evolutionary strategies (ES) gradient estimate to promote rapid convergence. The resultingalgorithm is invariant to the parameterization of the belief distribution. Our numerical resultsdemonstrate that CoNES vastly outperforms conventional blackbox optimization methods on asuite of functions used for benchmarking blackbox optimizers. Furthermore, CoNES demonstratesthe ability to converge faster than conventional blackbox methods on a selection of OpenAI’sMuJoCo reinforcement learning tasks for locomotion.
Policy optimization in reinforcement learning (RL) can be posed as a blackbox optimization problem:given access to a “blackbox” in the form of a simulator or robot hardware, find a setting of policyparameters that maximizes rewards. This perspective has led to significant recent interest from theRL community towards scaling blackbox optimization methods and has catapulted the use of blackboxoptimizers from low-dimensional hyperparameter tuning [18, 25] to training deep neural networks(DNNs) with thousands of parameters [12, 13, 14, 31, 33, 42]. Despite these promising advances, thesample complexity of blackbox methods remains high and is the subject of ongoing research.In this paper we study a class of blackbox optimization methods called evolutionary strategies (ES)[41, 42]. ES methods maintain a belief distribution on the domain of candidates. At each iteration,a batch of candidates is sampled from this distribution and their fitness is evaluated. These fitnessscores are used to obtain a Monte-Carlo (MC) estimate of the loss function’s gradient with respectto the parameters of the belief distribution. In the domain of ES for RL, approaches that adaptthe sampling rate from the belief distribution and reuse samples from previous iterations have beenproposed to improve the sample complexity [12, 13]. However, standard ES methods are not invariantto re-parameterizations of the belief distribution. Hence, the choice of belief parameterization (e.g.,encoding the covariance as a symmetric positive definite matrix vs. a Cholesky decomposition) canaffect the rate of convergence and cause undesirable behavior (e.g., oscillations) [48]. In contrast, EStechniques based on the natural gradient [5, 44, 48] are parameterization invariant and can demonstrateimproved sample efficiency. However, these methods have not been thoroughly exploited in RL dueto the difficulties in computing the natural gradient for high-dimensional problems; in particular,the challenging estimation of the Fisher information matrix is necessary for computing the naturalgradient.In this paper, we present a novel algorithm – convex natural evolutionary strategies (CoNES) –that leverages results on the natural gradient [5, 44, 48] from information geometry [6] and couplesthem with powerful tools from convex optimization (e.g., second-order cone programming [9] andgeometric programming [8]) to promote rapid convergence. In particular, CoNES refines a crude1 a r X i v : . [ c s . L G ] J u l radient estimate by transforming it through a convex program that searches for the direction ofsteepest ascent in a KL-divergence ball around the current belief distribution. The relationship tonatural evolutionary strategies (NES) [48] comes from the fact that the limiting solution of the KL-constrained optimization problem (as the “radius” of the KL-divergence ball shrinks to zero) correspondsto the natural gradient. However, in contrast to NES [48], CoNES circumvents the estimation ofthe Fisher information matrix by directly solving the convex KL-constrained optimization problem. θ ∆ θ − ∆ θ l ( θ − ∆ θ ) l ( θ + ∆ θ ) l ( θ ) Figure 1:
Illustration demonstrating the impor-tance of accounting for the step length for choosingthe update direction. At the belief distributionexpressed in the coordinates θ , if we follow thenegative of the gradient direction (right), then,with the step size ∆ θ , the loss increases. How-ever, accounting for the step size while choosingthe direction, we would go left and the loss woulddecrease. Furthermore, tuning the radius of the KL-divergenceball facilitates better alignment of the update directionwith the update step size, yielding faster convergencethan NES (which provides the steepest ascent directionfor infinitesimal steps lengths); see Fig. 1 for an illustra-tion that demonstrates the importance of accountingthe step length for choosing the update direction.Our theoretical results establish that CoNES is invariant to the parameterization of the belief distri-bution (e.g., encoding the covariance as a symmetricpositive definite matrix or a Cholesky decompositiondoes not affect the solution of the CoNES optimizationproblem). Parameterization invariance ensures thatwe are working with the intrinsic mathematical object(i.e., probability distribution) and the specific encodingof these objects do not affect the outcome. Moreover,CoNES is agnostic to the method that generates thecrude gradient estimate and can thus be potentiallycombined with various existing ES methods, such as [12, 13, 42]. Through our numerical results wedemonstrate that CoNES vastly outperforms various conventional blackbox optimizers on a suite of 5000-dimensional benchmark functions for blackbox optimizers:
Sphere, Rosenbrock, Rastrigin , and
Lunacek . We also demonstrate the improved sample complexity achieved by CoNES on the followingOpenAI MuJoCo RL tasks:
HalfCheetah-v2, Walker2D-v2, Hopper-v2 , and
Swimmer-v2 . Blackbox optimization.
Various engineering problems require optimizing systems for which thegoverning mechanisms are not explicitly known; e.g., system identification of complex physical systems[4] and mechanism design [7]. Blackbox optimization techniques such as Nelder-Mead [36], evolutionarystrategies (ES) [41], simulated annealing [28], genetic algorithms [24], the cross-entropy method [16],and covariance matrix adaptation (CMA) [20] were developed to address such problems. Recently,the growing potential of these methods for training control policies with reinforcement learning[11, 12, 13, 14, 19, 31, 33, 42] has reignited interest in blackbox optimizers. In this paper, we willprimarily consider the class of blackbox optimizers that fall under the purview of ES.
Evolutionary strategies for reinforcement learning.
In RL tasks, the advantages of ES – highparallelizability, better robustness, and richer exploration – were first demonstrated in [42]. Spurred bythese findings, a plethora of recent developments aimed at improving ES for RL have emerged, someof which include: explicit novelty search regularization to avoid local minima [14], robustification ofES and efficient re-use of prior rollouts [12], and adaptive sampling for the ES gradient estimate [13].We remark that all the above papers focus on improving the ES MC gradient estimator. In contrast,this paper presents a method that refines the ES gradient estimate – regardless of where that estimatecomes from – by solving a convex program.
Natural gradient.
Our method is directly motivated by the concept of the natural gradient [6]. Theapplication of natural gradient in learning was initially pioneered in [5] and was later demonstrated tobe effective for RL [26], deep learning with backpropagation [39], and blackbox optimization with ES244, 48]. However, the latent potential of the natural gradient has not been completely realized dueto the difficulty in estimation of the Fisher information matrix. Much of the prior work employingnatural gradient has focused on efficient estimation or computation of the Fisher information matrix[39, 44, 49]. In contrast, CoNES does not work directly with the Fisher information matrix. Instead,we approximate the update direction by solving a convex program that maximizes the loss whilebeing constrained to a KL-divergence ball around the current belief distribution; as the radius ofthe KL-divergence ball goes to zero, the limiting solution of this convex program corresponds to thenatural gradient (see Proposition 1).
Trust-regions for blackbox optimization.
Recent work on trust region methods for blackboxoptimizers [2, 31, 34] performs updates on the belief distribution by optimizing the loss on a KL-divergence ball. However, [2, 34] perform the constrained optimization on a discretization of the beliefdistribution. The approach in [31] computes the KL-divergence for each dimension individually andbounds their maximum; the resulting optimization problem is approximated via a clipped surrogateobjective similar to proximal policy optimization (PPO) [43]. In contrast, we exactly solve a KL-constrained problem whose solution approximates the natural gradient (as outlined above and formallydiscussed in Section 4.1) using powerful tools from convex optimization (e.g., second-order coneprogramming and geometric programming).
We denote a blackbox loss function by ˆ l : X → R with X ⊆ R m as its domain. Let P be a distributionon the domain X that signifies our belief of where the optimal candidate for ˆ l resides. We assumethat P belongs to the statistical manifold P [45] which is a Riemannian manifold [40] of probabilitydistributions. Any point P ∈ P is expressed in the coordinates θ ∈ R n . Rather than optimizing ˆ l directly, we will work with the loss function l : P → R which provides the expected loss P E x ∼ P [ˆ l ( x )]under the belief distribution P . When referring to the manifold in a coordinate-free setting, we expressthe loss as l : P → R , whereas, when we work with a particular coordinate system on P , we express theloss as l : R n → R ; the abuse of notation creates no confusion as it will always be clear from context.The (Euclidean) gradient operator is denoted by ∇ ; the natural gradient operator is denoted by ˜ ∇ ;and the solution of CoNES is denoted by ˆ ∇ . The KL-divergence between two distributions is denotedby D ( ·||· ) and the Euclidean inner product between two vectors is denoted by h· , ·i . It is a commonly-held belief that the steepest ascent direction for a loss function l : P → R is given byits gradient ∇ l . However, this is only true if the domain P is expressed in an orthonormal coordinatesystem in a Euclidean space. If the space P admits a Reimannian manifold [40] structure, the steepestascent direction is then given by the natural gradient ˜ ∇ l instead [6, Section 12.1.2]. Besides providingthe steepest ascent direction on P , the natural gradient possesses various attractive properties: (a)natural gradient is independent of the choice of coordinates θ on the statistical manifold P ; (b)natural gradient avoids saturation due to sigmoidal activation functions [6, Theorem 12.2]; (c) onlinenatural-gradient learning is asymptotically Fisher efficient, i.e., it asymptotically approaches equalityof the Cramér-Rao bound [5]. These qualities lay the foundation of our interest in leveraging thenatural gradient in learning applications. In the rest of this section we will present two explicitcharacterizations of the natural gradient relevant to this paper.Let F ( θ ) be the Fisher information matrix for the Reimannian manifold of distributions P describedin the coordinates θ ; e.g., Gaussian distributions can be expressed in the coordinates θ = ( µ, vec ◦ upper-triangle (Σ)) where µ , Σ denote the mean and the covariance, respectively. The natural3radient then satisfies the following relation with the Euclidean gradient:˜ ∇ l ( θ ) = F ( θ ) − ∇ l ( θ ) . (1)For the second characterization of the natural gradient we will need the Fisher-Rao norm k · k F : P → [0 , ∞ ) defined as k θ k F := p h θ, F ( θ ) θ i [30, Definition 2]. Using this norm we can express thenatural gradient as follows: Proposition 1. [Adapted from [37, Proposition 1]]
Let P be a statistical manifold, each pointof which is a probability distribution P θ parameterized by θ . Let l : P → R be a loss function whichmaps a probability distribution P θ to a scalar. Then, the natural gradient ˜ ∇ l ( θ ) of the loss functioncomputed at any θ satisfies: ˜ ∇ l ( θ ) k ˜ ∇ l ( θ ) k F = lim (cid:15) → arg max v ∈ R n l ( θ + (cid:15)v ) (2) s.t. D ( P θ + (cid:15)v || P θ ) ≤ (cid:15) / . Proposition 1 states that the natural gradient is aligned with the direction v which maximizesthe loss function in an infinitesimal KL-divergence ball around the current distribution P θ . To avoidconfusion, it is worth clarfiying that the maximization in Proposition 1 computes the natural gradientwhich can then be passed to a gradient-based optimizer to minimze the loss . Remark 1.
Proposition 1 also holds true for the linear approximation of the loss function l ( θ + (cid:15)v ) at θ . Intuitively, the reason for this is that the linear approximation locally converges to the loss functionfor arbitrarily small (cid:15) > . The evolutionary strategies (ES) framework performs a Monte-Carlo estimate of the gradient of theloss with respect to the belief distribution [48, Section 2]: ∇ l ( θ ) = ∇ E x ∼ P θ [ˆ l ( x )] = E x ∼ P θ [ˆ l ( x ) ∇ ln P θ ( x )] . (3)This gradient estimate is then supplied to a gradient-based optimizer to update the belief distribution.Note that (3) provides an estimate of the Euclidean gradient. Instead of using the Euclidean gradient(3), Natural Evolutionary Strategies (NES) [44, 48] estimates the natural gradient by transforming theEuclidean gradient estimate (3) through (1). Despite the various advantages offered by the natural gradient, the computationally expensive estimationof the Fisher information matrix F ( θ ) and its inverse makes it difficult to scale to very high-dimensionalproblems. Proposition 1 offers an alternative to compute the natural gradient while obviating the needto estimate F ( θ ); however, (2) is a challenging non-convex optimization problem. To develop CoNESwe “massage” (2) into an efficiently-solvable convex program.We begin by relaxing relaxing the requirement lim (cid:15) → (cid:15) > v ∗ ( θ ) ∈ arg max v { l ( θ + (cid:15)v ) | D ( P θ + (cid:15)v || P θ ) ≤ (cid:15) , v ∈ R n } , (4)where (cid:15) is now a hyperparameter which can be as large as necessary. Using v ∗ ( θ ) as the updatedirection could yield faster convergence than ˜ ∇ l ( θ ). This may seem counter-intuitive because the Without loss of generality, we are replacing (cid:15) / (cid:15) . (cid:15) permits us toalign the search for the steepest ascent direction with the desired step-length of the update, yieldingrapid convergence; see Fig. 1 for an illustration.We are interested in settings where the landscape of the loss function l is unknown and queryingloss values of individual candidates is expensive. Even if the analytical form of l was available to us, (4)may be a non-convex problem and hence challenging to solve. To make this problem more tractable,we perform a Taylor expansion of the loss function l ( θ + (cid:15)v ) ≈ l ( θ ) + h∇ l ( θ ) , (cid:15)v i and work with thefollowing optimization problem: v ∗ ( θ ) ∈ arg max v { l ( θ ) + h∇ l ( θ ) , (cid:15)v i | D ( P θ + (cid:15)v || P θ ) ≤ (cid:15) , v ∈ R n } . (5)In (5), l ( θ ) is a constant offset which does not affect the choice of v and can hence be ignored. Further,we denote δθ := (cid:15)v and restate (5) as:ˆ ∇ l ( θ ; (cid:15) ) ∈ arg max δθ {h∇ l ( θ ) , δθ i | D ( P θ + δθ || P θ ) ≤ (cid:15) , δθ ∈ R n } . (6)Despite these relaxations, the optimization problem (6) may still be intractable due to the lack ofconvexity of the feasible set. However, in the following theorem we establish for the Gaussian family ofprobability distributions that (6) is convex and can be solved in polynomial time . Theorem 1.
The optimization (6) is: • a semidefinite program (SDP) with an additional exponential cone constraint if P is the space ofGaussian distributions; • a second-order cone program (SOCP) with an additional exponential cone constraint if P is thespace of Gaussian distributions with diagonal covariance.Proof. As the objective function of (6) is linear, we only need to verify the convexity of the feasible set.We will first consider the case when P is the space of Gaussian distributions. Let P θ + δθ = N ( µ, Σ)and P θ = N ( µ , Σ ). Then: D ( P θ + δθ || P θ ) = 12 (cid:18) Tr(Σ − Σ) + ( µ − µ ) T Σ − ( µ − µ ) − log det(Σ) + log det(Σ ) − n (cid:19) (7)which is convex because Tr(Σ − Σ) is linear, ( µ − µ ) T Σ − ( µ − µ ) is positive-definite quadratic, and − log det(Σ) is convex. Finally, noting that log det constraints can be formulated as an SDP with anadditional exponential cone constraint [1] completes the proof of this part.Now we consider the family of Gaussian distributions P θ + δθ = N ( µ, Σ) and P θ = N ( µ , Σ ) withdiagonal covariance. We denote the mean as µ = ( µ , · · · , µ n ) and µ = ( µ , , · · · , µ ,n ). The diagonalelements of the covariance Σ and Σ are expressed as ( σ , · · · , σ n ) and ( σ , , · · · , σ ,n ), respectively.Then, the KL-divergence between two distributions in this family is: D ( P θ + δθ || P θ ) = − n X i =1 (cid:18) σ i − log σ ,i − ( µ i − µ ,i ) σ ,i − σ i σ ,i (cid:19) . (8)From (8), it follows that the problem (6) for this family of distributions is an SOCP with an additionalexponential cone constraint (that arises from the log terms), completing the proof.5 ˆ ∇ l ( θ )˜ ∇ l ( θ ) (cid:2)∇ θ l ( θ ) , δθ (cid:3) l ( θ ) Figure 2:
Geometric illustration of CoNES.
Restricting the class of belief distributions to those inTheorem 1 gives rise to CoNES: a family of convex programsthat draws motivation from the concept of the natural gra-dient to transform the Euclidean gradient. To geometricallyvisualize CoNES, consider the illustration in Fig. 2. Theorange surface is the loss landscape and the gray surfaceis the linearization of the loss at the point denoted by θ ;in differential geometric terms, the orange surface is moreaccurately characterized as the manifold given by the graphof the loss l ( θ ) while the gray surface is the manifold’stangent space at ( θ, l ( θ )). The green arrow represents thesolution of CoNES for a KL-divergence ball (light greenregion) with a very small (cid:15) which can also be regarded as the natural gradient (modulo the norm) at θ by Remark 1. The red arrow is the solution of CoNES for a KL-divergence ball (light red region) witha larger (cid:15) . Note that this figure is an illustration; the KL-divergence balls may not necessarily manifestin the depicted shapes. The NES gradient is the sharpest ascent direction for an infinitesimal step size,but, it may not be ideal for a larger step size. With CoNES, we can tune the scalar parameter (cid:15) tobetter align the update direction with the gradient-based optimizer’s step size (learning rate), yieldingfaster updates. Indeed, the choice of (cid:15) is important to the performance of CoNES as demonstrated inour numerical results in Section 7.2. The mechanism for selecting (or adapting) the hyperparameter (cid:15) is beyond the scope of this paper and will be explored in our future work.The psuedo-code for our implementation of CoNES as a blackbox optimizer is detailed in Algorithm 1.We use the ES gradient estimate (presented in Section 4.2) as the Gradient-Estimator in Line 5of Algorithm 1; any estimator of the Euclidean gradient, such as [12, 13], can be used here. We useAdam [27] as our gradient-based optimizer in Line 7; any gradient-based optimizer can be used.
Algorithm 1
CoNES Hyperparameters: radius (cid:15) of KL-divergence ball, number of candidates N drawn at each iteration Initialize: θ ← θ , Optimizer repeat { ˆ x i } Ni =1 ← Draw N samples from the belief distribution P θ ∇ θ l ( θ ) ← Gradient-Estimator ( { x i } Ni =1 , { ˆ l ( x i ) } Ni =1 ) ˆ ∇ θ l ( θ ) ← CoNES ( ∇ θ l ( θ ) , (cid:15) ) . solve (6) θ ← Optimizer ( θ , ˆ ∇ θ l ( θ )) until Termination conditions satisfied return θ An important property of the natural gradient is its independence to the parameterization of thebelief distribution; e.g., for Gaussian distributions it does not matter whether we use the covariancematrix or its Cholesky decomposition. The natural gradient inherits this property by constructionas the covariant gradient on the statistical manifold [6]. Parameterization invariance ensures that weare working with the intrinsic mathematical objects (probability distributions here) and the specificencoding of these objects will not affect the outcome. From a practical perspective, we derive thebenefit of fewer properties to “engineer”.A natural question to ask is whether CoNES (Problem (6)) exhibits the same property. Proposition 1ensures that the CoNES optimization exhibits this property in the limit of (cid:15) tending to zero, as theupdate direction then coincides with the natural gradient. However, establishing this propertyfor arbitrary (cid:15) > l rather than its linearization with the understanding that ifthe parameterization invariance holds for an arbitrary function l , it will automatically hold for thelinear function in (6). With a slight abuse of notation, we will express the loss function l : R n → R inthe coordinates θ on the statistical manifold instead of the coordinate-free notation of l : P → R . Nowwe are ready to present the main result of this section: Theorem 2.
Consider the optimization problem:
OP T θ : l ∗ θ = max { l ( θ + (cid:15)v θ ) | D ( P θ + (cid:15)v θ || P θ ) ≤ (cid:15) , v θ ∈ R n } . (9) Let
Φ : R n → R n be a smooth invertible mapping which performs a coordinate change from θ φ :=Φ( θ ) . Consider the following optimization problem OP T φ in the new coordinates: OP T φ : l ∗ φ = max { l ◦ Φ − ( φ + (cid:15)v φ ) | D ( P φ + (cid:15)v φ || P φ ) ≤ (cid:15) , v φ ∈ R n } . (10) Then, there exists an invertible mapping Φ v : R n → R n such that v ∗ θ ∈ arg max v OP T θ ⇐⇒ Φ v ( v ∗ θ ) ∈ arg max v OP T φ , ensuring that l ∗ θ = l ∗ φ . Theorem 2 shows that expressing the belief distribution P ∈ P in different coordinates θ or φ provides the same optimal loss and the same set of possible outcomes (upto a bijective mapping). Ofcourse, we cannot ensure that the outcome, i.e., the arg max of the CoNES optimization is the samedue to the potential lack of uniqueness of the optima; e.g., consider the maximization of x + x in x + x ≤ x , x ) = (0 ,
0) – all directions v from the initial point are equally good.Intuitively, Theorem 2 holds because the KL-divergence is independent of the parameterization ofthe distribution [29, Corollary 4.1], i.e., for θ , φ , and Φ as defined in Theorem 2, we have: D ( P θ + (cid:15)v θ || P θ ) = D ( P Φ( θ + (cid:15)v θ ) || P Φ( θ ) ) . (11)To formally prove Theorem 2, we will first establish two lemmas. The first lemma shows the existenceof a bijective mapping between v θ and v φ . Lemma 1.
Let θ , φ , and Φ be as defined in Theorem 2. Then, there exists a bijective mapping Φ v : R n → R n , defined as v θ Φ( θ + (cid:15)v θ ) − Φ( θ ) (cid:15) . (12) Proof.
First we will check the injectivity of Φ v :Φ v ( v θ ) = Φ v ( v θ ) ⇐⇒ Φ( θ + (cid:15)v θ ) = Φ( θ + (cid:15)v θ ) ⇐⇒ v θ = v θ (since Φ is injective) . (13)Next, to check the surjectivity of Φ v , let v φ ∈ R n be arbitrary. Then there exists v θ := (Φ − (Φ( θ ) + (cid:15)v φ ) − θ ) /(cid:15) which satisfies Φ v ( v θ ) = v φ .In the following remark, we express the result of Lemma 1 in a form that is more conducive to ourforthcoming proof. Remark 2.
Lemma 1 ensures that the following relation holds for any v θ ∈ R n : v φ = Φ v ( v θ ) ⇐⇒ φ + (cid:15)v φ = Φ( θ + (cid:15)v θ ) ⇐⇒ θ + (cid:15)v θ = Φ − ( φ + (cid:15)v φ ) where the first equivalence relation holds by using the expression of Φ v (12) and the second equivalencerelations hold from the bijectivity of Φ . From a geometric perspective, θ and φ are coordinates on the statistical manifold P , either of which can be used toexpress a distribution P ∈ P . The directions v θ and v φ lie in the tangent space T P P of P at P . emma 2. Let B θ := { v ∈ R n | D ( P θ + (cid:15)v || P θ ) ≤ (cid:15) } and B φ := { v ∈ R n | D ( P φ + (cid:15)v || P φ ) ≤ (cid:15) } be the feasible sets of OP T θ and OP T φ , respectively. Let Φ v be defined as in Lemma 1. Then, B φ = { Φ v ( v ) | v ∈ B θ } .Proof. Let v φ ∈ { Φ v ( v ) | v ∈ B θ } , then there exists a v θ ∈ B θ such that v φ = Φ v ( v θ ). Therefore,Remark 2 ensures that φ + (cid:15)v φ = Φ( θ + (cid:15)v θ ), which further gives us: D ( P φ + (cid:15)v φ || P φ ) = D ( P Φ( θ + (cid:15)v θ ) || P Φ( θ ) ) = D ( P θ + (cid:15)v θ || P θ ) ≤ (cid:15) , (14)where the last equality follows from (11) and the inequality follows from the fact that v θ ∈ B θ . From(14) we have that v φ ∈ B φ implying that { Φ v ( v ) | v ∈ B θ } ⊆ B φ .Now, let v φ ∈ B φ . By the surjectivity of Φ v from Lemma 1, there exists a v θ ∈ R n such that v φ = Φ v ( v θ ). With this, Remark 2 ensures that φ + (cid:15)v φ = Φ( θ + (cid:15)v θ ). Hence, using (11), followed by φ + (cid:15)v φ = Φ( θ + (cid:15)v θ ) gives: D ( P θ + (cid:15)v θ || P θ ) = D ( P Φ( θ + (cid:15)v θ ) || P Φ( θ ) ) = D ( P φ + (cid:15)v φ || P φ ) ≤ (cid:15) (15)where the last inequality follows from the fact that v φ ∈ B φ . Therefore, by (15), we have that v θ ∈ B θ ,which, on combining with the earlier assertion that v φ = Φ v ( v θ ) implies that v φ ∈ { Φ v ( v ) | v ∈ B θ } .Thereby, ensuring that B φ ⊆ { Φ v ( v ) | v ∈ B θ } and completing the proof. Proof of Theorem 2.
The proof follows from the following chain of arguments: v ∗ θ ∈ arg max v OP T θ ⇐⇒ l ( θ + (cid:15)v ∗ θ ) ≥ l ( θ + (cid:15)v θ ) , ∀ v θ ∈ B θ (16) ⇐⇒ l ◦ Φ − ( φ + (cid:15) Φ v ( v ∗ θ )) ≥ l ◦ Φ − ( φ + (cid:15) Φ v ( v θ )) , ∀ v θ ∈ B θ (17) ⇐⇒ l ◦ Φ − ( φ + (cid:15) Φ v ( v ∗ θ )) ≥ l ◦ Φ − ( φ + (cid:15)v φ ) , ∀ v φ ∈ B φ (18) ⇐⇒ Φ v ( v ∗ θ ) ∈ arg max v OP T φ , (19)where (17) follows from Remark 2 (Lemma 1) and (18) follows from Lemma 2. Further, because l ( θ + (cid:15)v ∗ θ ) = l ◦ Φ − ( φ + (cid:15) Φ v ( v ∗ θ )) from Remark 2, we get l ∗ θ = l ∗ φ . In this section, we use CoNES on two classes of problems: (a) a standard suite of high-dimensionalloss functions used to benchmark blackbox optimizers, and (b) a selection of OpenAI Gym’s [10]MuJoCo [47] suite of RL tasks. We compare CoNES against existing methods including ES, naturalevolutionary strategies (NES), and covariance matrix adaptation (CMA). We custom implemented ES,NES, and CoNES, while CMA is adapted directly from the open-source PyCMA package [21]; ourcode is accessible at: https://github.com/irom-lab/CoNES.The family of Gaussian belief distributions with diagonal covariance is used for ES, NES, andCoNES. This family of belief distributions permits the implementation of NES exactly (i.e., withouthaving to numerically estimate the Fisher information matrix [44]) for high-dimensional problems,serving as a strong baseline to compare CoNES against. For CMA, PyCMA’s default family of beliefdistributions – Gaussian distributions with non-diagonal covariance – is used. For ES, NES, and CoNESwe compute an estimate of the gradient direction and pass it to the Adam optimizer [27] to updatethe belief distribution. For each of these methods we perform antithetic sampling and rank-basedfitness transformation [42]. Unlike [42], we also update the variance of the belief distribution; wecircumvent the non-negativeness constraint of the variance by updating the log of variance with theAdam optimizer instead. The resulting convex optimization problems for CoNES are solved using theCVXPY package [17] and the MOSEK solver [35].8
Function Evaluations L o ss Sphere
Function Evaluations L o ss Rosenbrock
Function Evaluations L o ss Rastrigin
Function Evaluations L o ss Lunacek
ES NES CMA CoNES
Figure 3:
Average loss (solid curve) with standard deviation (shaded region) across 10 seeds for ES, NES,CMA, and CoNES on
Sphere, Rosenbrock, Rastrigin, and
Lunacek . Iteration S t e p S i z e Sphere
Iteration S t e p S i z e Rosenbrock
Iteration S t e p S i z e Rastrigin
Iteration S t e p S i z e Lunacek
ES NES CMA CoNES
Figure 4:
Average step size (solid) with standard deviation (shaded region) of the belief distribution’s meanacross 10 seeds for ES, NES, CMA, and CoNES on
Sphere, Rosenbrock, Rastrigin, and
Lunacek . We first test our approach on four 5000-dimensional functions:
Sphere, Rosenbrock, Rastrigin, and
Lunacek [23] which are provided in Appendix B. These functions are commonly-used benchmarksfor blackbox optimization methods [22, 46]. Hyperparameters for ES, NES, and CoNES are sharedacross all problems (see Appendix A) while the hyperparameters of CMA are the default values chosenby PyCMA. Training for these benchmark functions was performed on a desktop with a 3.30 GHzIntel i9-7900X CPU with 10 cores and 32 GB RAM. Fig. 3 plots the average and standard deviation(shaded region) of the loss curves across 10 seeds. The rapid drop of the loss for CoNES demonstratessignificant benefits in terms of the sample complexity over other methods. Fig. 4 shows that the stepsize for CoNES is smaller than ES and NES, which coupled with its lower loss implies that the updatedirection for CoNES is more accurate than ES and NES. The run-time for a single seed is ∼ ∼ ∼
35 minutes for CMA.
Next, we benchmark our approach on the following environments from the OpenAI Gym suite ofRL problems:
HalfCheetah-v2, Walker2D-v2, Hopper-v2 , and
Swimmer-v2 . We employ a fully-connected neural network policy with tanh activations possessing one hidden layer with 16 neuronsfor
Swimmer-v2 and 50 neurons for all other environments. The input to the policies are the agent’sstate – which are normalized using a method similar to the one adopted by [33] – and the output isa vector in the agent’s action space. The training for these tasks was performed on a c5.24xlarge instance on Amazon Web Services (AWS). Fig. 5 presents the average and standard deviation ofthe rewards for each RL task across 10 seeds against the number of time-steps interacted with theenvironment. Fig. 5 as well as Table 1 illustrate that CoNES performs well on all these tasks. For eachenvironment we share the same hyperparameters (excluding (cid:15) ) between ES, NES, and CoNES; for CMAwe use the default hyperparameters as chosen by PyCMA. It is worth pointing out that for RL tasks,CoNES demonstrates high sensitivity to the choice of (cid:15) . The results for CoNES reported in Fig. 5 andTable 1 are for the best choice of (cid:15) from [ √ . , √ , √ , √ , √ HalfCheetah-v2, Walker2D-v2 and
Hopper-v2 ,takes ∼ ∼
10 hours with CMA. Each seed of
Swimmer-v2 takes ∼ ∼ Timesteps R e w a r d HalfCheetah-v2
Timesteps R e w a r d Walker2D-v2
Timesteps R e w a r d Hopper-v2
Timesteps R e w a r d Swimmer-v2
ES NES CMA CoNES
Figure 5:
Average reward (solid curve) with standard deviation (shaded region) across 10 seeds for ES, NES,CMA, and CoNES on
HalfCheetah-v2, Walker2D-v2, Hopper-v2 , and
Swimmer-v2 . HalfCheetah-v2 . × . × – . × Walker2D-v2 . × – . × Hopper-v2 . × Swimmer-v2
340 1 . × . × . × . × Table 1: Timesteps to attain a target average reward (over 10 seeds) for RL tasks. For each environmentthe timestep for the best performing blackbox method is displayed in bold. Hyphen ( – ) is used forthe method that failed to achieve the target average reward in 2 × timesteps for Swimmer-v2 and5 × timesteps for all other environments. We presented convex natural evolutionary strategies (CoNES) for optimizing high-dimensional blackboxfunctions. CoNES combines the notion of the natural gradient from information geometry with powerfultechniques from convex optimization (e.g., second-order cone programming and geometric programming).In particular, CoNES refines a gradient estimate by solving a convex program that searches for thedirection of steepest ascent in a KL-divergence ball around the current belief distribution. Weformally established that CoNES is invariant under transformations of the belief parameterization.Our numerical results on benchmark functions and RL examples demonstrate the ability of CoNES toconverge faster than conventional blackbox methods such as ES, NES, and CMA.
Future Work.
This paper raises numerous exciting future directions to explore. The performanceof CoNES is dependent on the choice of the radius (cid:15) of the KL-divergence ball. Furthermore, asuitable choice of (cid:15) in one region of the loss landscape may not be suitable for another. Hence, anadaptive scheme for choosing the radius of the KL-divergence ball could substantially enhance theperformance of CoNES. Another potentially fruitful future direction arises from the observation thatProposition 1 — which serves as the cornerstone of CoNES — holds for any f -divergence [15]. Hence,we can generalize CoNES to arbitrary f -divergences; this may afford greater flexibility in tuning it forthe specific loss landscape and further improving performance. We can increase the flexibility affordedby CoNES even more by expanding beyond the family of Gaussian belief distributions. Finally, we arealso exploring the empirical benefits of adaptively restricting the covariance matrix model [3, 13] inorder to further enhance sample complexity. Acknowledgements
The authors were supported by the Office of Naval Research [Award Number: N00014-18-1-2873], theGoogle Faculty Research Award, and the Amazon Research Award. This an outcome of the fact that the Hessian of all f -divergences is the Fisher information [32]. ppendixA Hyperparameters The parameters for the Adam optimizer were chosen according to [27, Algorithm 1] for all results inSection 7.
Benchmark Functions.
For all the results in Section 7.1 the initial belief distribution is chosento be the normal distribution N (0 , I ). The hyperparameters for ES, NES and CoNES were chosen asfollows: the number of function evaluations performed per iteration is 100 and the learning rate forthe mean and log of the variance is 0.1. Additionally, (cid:15) is set to 100 for CoNES. RL Tasks.
The hyperparameters for ES, NES, and CoNES for the results in Section 7.2 aredetailed in Table 2 below; some of these hyperparameters were borrowed from [38].
Initial Distribution Learning Rate µ ) std ( σ ) µ log( σ ) per itr (N) per policy (m) (cid:15) HalfCheetah-v2 √ Walker2D-v2 √ Hopper-v2
Swimmer-v2
Table 2: Hyperparameters for RL tasks.
B Benchmark functions
Let x ∈ R n be expressed in its coordinates as x = ( x , · · · , x n ). • Sphere : x x T x • Rosenbrock : x P n − i =1 (100( x i − x i +1 ) + (1 − x i ) ) • Rastrigin : x n + P ni =1 ( x i −
10 cos(2 πx i )) • Lunacek : First define the constants µ = 2 . , s = 1 − √ n + 20 − . , d = 1 , µ = − r µ − ds . Using these constants the function can be expressed as x min { P ni =1 ( x i − µ ) , dn + s P ni =1 ( x i − µ ) } + 10 P ni =1 (1 − cos(2 π ( x i − µ ))). References [1] Mosek modeling cook-book: Log-determinant. https://docs.mosek.com/modeling-cookbook/sdo.html
Proceedings of the Genetic and EvolutionaryComputation Conference , pages 657–664, 2017.[3] Y. Akimoto and N. Hansen. Projection-based restricted covariance matrix adaptation for highdimension. In
Proceedings of the Genetic and Evolutionary Computation Conference 2016 , pages197–204, 2016. 114] S. Amaran, N. V. Sahinidis, B. Sharda, and S. J. Bury. Simulation optimization: a review ofalgorithms and applications.
Annals of Operations Research , 240(1):351–380, 2016.[5] S.-I. Amari. Natural gradient works efficiently in learning.
Neural computation , 10(2):251–276,1998.[6] S.-i. Amari.
Information Geometry And Its Applications , volume 194. Springer, 2016.[7] C. Audet and M. Kokkolaras. Blackbox and derivative-free optimization: theory, algorithms andapplications, 2016.[8] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming.
Optimization and Engineering , 8(1):67, 2007.[9] S. Boyd and L. Vandenberghe.
Convex Optimization . Cambridge University Press, 2004.[10] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.OpenAI Gym. arXiv preprint arXiv:1606.01540 , 2016.[11] K. Chatzilygeroudis, R. Rama, R. Kaushik, D. Goepp, V. Vassiliades, and J.-B. Mouret. Black-boxdata-efficient policy search for robotics. In
Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS) , pages 51–58. IEEE, 2017.[12] K. Choromanski, A. Pacchiano, J. Parker-Holder, Y. Tang, D. Jain, Y. Yang, A. Iscen, J. Hsu, andV. Sindhwani. Provably robust blackbox optimization for reinforcement learning. arXiv:1903.02993 ,2019.[13] K. M. Choromanski, A. Pacchiano, J. Parker-Holder, Y. Tang, and V. Sindhwani. From complexityto simplicity: Adaptive ES-active subspaces for blackbox optimization. In
Advances in NeuralInformation Processing Systems , pages 10299–10309, 2019.[14] E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. Stanley, and J. Clune. Improving explorationin evolution strategies for deep reinforcement learning via a population of novelty-seeking agents.In
Advances in Neural Information Processing Systems , pages 5027–5038, 2018.[15] I. Csiszár and P. C. Shields.
Information Theory and Statistics: A Tutorial . Now Publishers Inc,2004.[16] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on the cross-entropymethod.
Annals of Operations Research , 134(1):19–67, 2005.[17] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization.
Journal of Machine Learning Research , 17(83):1–5, 2016.[18] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A servicefor black-box optimization. In
Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , pages 1487–1495, 2017.[19] D. Ha. Reinforcement learning for improving agent design.
Artificial Life , 25(4):352–365, 2019.[20] N. Hansen. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772 , 2016.[21] N. Hansen, Y. Akimoto, and P. Baudis. CMA-ES/pycma on Github. Zenodo,DOI:10.5281/zenodo.2559634, Feb. 2019.[22] N. Hansen, A. Auger, O. Mersmann, T. Tusar, and D. Brockhoff. COCO: A platform for comparingcontinuous optimizers in a black-box setting. arXiv preprint arXiv:1603.08785 , 2016.1223] N. Hansen, S. Finck, R. Ros, and A. Auger. Real-parameter black-box optimization benchmarking2009: Noiseless functions definitions. [Research Report] RR-6829, INRIA , 2009. inria00362633v2.[24] J. H. Holland.
Adaptation in natural and artificial systems: an introductory analysis withapplications to biology, control, and artificial intelligence . MIT Press, 1992.[25] F. Hutter, L. Kotthoff, and J. Vanschoren.
Automated Machine Learning . Springer, 2019.[26] S. M. Kakade. A natural policy gradient. In
Advances in Neural Information Processing Systems ,pages 1531–1538, 2002.[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[28] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.
Science ,220(4598):671–680, 1983.[29] S. Kullback and R. A. Leibler. On information and sufficiency.
The Annals of MathematicalStatistics , 22(1):79–86, 1951.[30] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes. Fisher-Rao metric, geometry, and complexity ofneural networks. arXiv preprint arXiv:1711.01530 , 2017.[31] G. Liu, L. Zhao, F. Yang, J. Bian, T. Qin, N. Yu, and T.-Y. Liu. Trust region evolution strategies.In
Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 4352–4359,2019.[32] A. Makur.
A study of local approximations in information theory . PhD thesis, MassachusettsInstitute of Technology, 2015.[33] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach toreinforcement learning. arXiv preprint arXiv:1803.07055 , 2018.[34] M. Miyashita, S. Yano, and T. Kondo. Mirror descent search and its acceleration.
Robotics andAutonomous Systems , 106:107–116, 2018.[35] MOSEK ApS. Mosek fusion api for python 9.0.84(beta), 2019.[36] J. A. Nelder and R. Mead. A simplex method for function minimization.
The Computer Journal ,7(4):308–313, 1965.[37] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Information-geometric optimization algorithms:A unifying picture via invariance principles.
The Journal of Machine Learning Research , 18(1):564–628, 2017.[38] P. Pagliuca, N. Milano, and S. Nolfi. Efficacy of modern neuro-evolutionary strategies forcontinuous control optimization. arXiv preprint arXiv:1912.05239 , 2019.[39] R. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. arXiv preprintarXiv:1301.3584 , 2013.[40] P. Petersen, S. Axler, and K. Ribet.
Riemannian Geometry , volume 171. Springer, 2006.[41] I. Rechenberg and M. Eigen. Evolutionsstrategie: Optimierung technischer systeme nach prinzipiender biologischen evolution.
Frommann-Holzboog Stuttgart , 1973.[42] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalablealternative to reinforcement learning. arXiv preprint arXiv:1703.03864 , 2017.1343] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.[44] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient natural evolution strategies. In
Proceedings of the Annual Conference on Genetic and evolutionary computation , pages 539–546,2009.[45] M. Suzuki. Information geometry and statistical manifold. arXiv preprint arXiv:1410.3369 , 2014.[46] O. Teytaud and J. Rapin. Nevergrad: An open source tool for derivative-free optimization, 2018.[47] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) ,pages 5026–5033. IEEE, 2012.[48] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. Natural evolutionstrategies.
The Journal of Machine Learning Research , 15(1):949–980, 2014.[49] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region method for deepreinforcement learning using kronecker-factored approximation. In