[PDF] A Linear Time Natural Evolution Strategy for Non-Separable Functions

Abstract

We present a novel Natural Evolution Strategy (NES) variant, the Rank-One NES (R1-NES), which uses a low rank approximation of the search distribution covariance matrix. The algorithm allows computation of the natural gradient with cost linear in the dimensionality of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenbrock function (512 dimensions).

Full PDF

AA Linear Time Natural Evolution Strategy forNon-Separable Functions

Yi Sun, Faustino Gomez, Tom Schaul, and J¨urgen Schmidhuber

IDSIA, University of Lugano & SUPSI, Galleria 2, Manno, CH-6928, Switzerland { yi,tino,tom,juergen } @idsia.ch Abstract.

We present a novel Natural Evolution Strategy (NES) vari-ant, the Rank-One NES (R1-NES), which uses a low rank approxima-tion of the search distribution covariance matrix. The algorithm allowscomputation of the natural gradient with cost linear in the dimensional-ity of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenbrockfunction (512 dimensions).

Black-box optimization (also called zero-order optimization) methods have re-ceived a great deal of attention in recent years due to their broad applicability toreal world problems [7,9–11,15,18]. When the structure of the objective functionis unknown or too complex to model directly, or when gradient information isunavailable or unreliable, such methods are often seen as the last resort becauseall they require is that the objective function can be evaluated at speciﬁc points.In continuous black-box optimization problems, the state-of-the-art algo-rithms, such as xNES [4] and CMA-ES [8], are all based the same principle [1,4]:a Gaussian search distribution is repeatedly updated based on the objectivefunction values of sampled points. Usually the full covariance matrix of the dis-tribution is updated, allowing the algorithm to adapt the size and shape of theGaussian to the local characteristics of the objective function. Full parameteriza-tion also provides invariance under aﬃne transformations of the coordinate sys-tem, so that ill-shaped, highly non-separable problems can be tackled. However,this generality comes with a price. The number of parameters scales quadrati-cally in the number of dimensions, and the computational cost per sample is atleast quadratic in the number of dimensions [13], and sometimes cubic [4, 16].This cost is often justiﬁed since evaluating the objective function can domi-nate the computation, and thus the main focus is on the improvement of sam-pling eﬃciency. However, there are many problems, such as optimizing weightsin neural networks where the dimensionality of the parameter space can be verylarge (e.g. many thousands of weights), the quadratic cost of updating the searchdistribution can become the computational bottleneck. One possible remedy isto restrict the covariance matrix to be diagonal [14], which reduces the compu-tation per function evaluation to O ( d ), linear in the number d of dimensions. a r X i v : . [ c s . A I] J un nfortunately, this “diagonal” approach performs poorly when the problem isnon-separable because the search distribution cannot follow directions that arenot parallel to the current coordinate axes.In this paper, we propose a new variant of the natural evolution strategyfamily [17], termed Rank One NES (R1-NES). This algorithm stays within thegeneral NES framework in that the search distribution is adjusted according tothe natural gradient [2], but it uses a novel parameterization of the covariancematrix, C = σ (cid:0) I + uu (cid:62) (cid:1) ,where u and σ are the parameters to be adjusted. This parameterization allowsthe predominant eigen direction, u , of C to be aligned in any direction, enablingthe algorithm to tackle highly non-separable problems while maintaining only O ( d ) parameters. We show through rigorous derivation that the natural gradientcan also be eﬀectively computed in O ( d ) per sample. R1-NES scales well to highdimensions, and dramatically outperforms diagonal covariance matrix algorithmson non-separable objective functions. As an example, R1-NES reliably solves thethe non-convex Rosenbrock function up to 512 dimensions.The rest of the paper is organized as follows. Sectio 2, brieﬂy reviews the NESframework. The derivation of R1-NES is presented in Section 3. Section 4 empir-ically evaluates the algorithm on standard benchmark functions, and Section 5concludes the paper. Natural evolution strategies (NES) are a class of evolutionary algorithms forreal-valued optimization that maintain a search distribution, and adapt the dis-tribution parameters by following the natural gradient of the expected functionvalue. The success of such algorithms is largely attributed to the use of naturalgradient, which has the advantage of always pointing in the direction of the steep-est ascent, even if the parameter space is not Euclidean. Moreover, comparedto regular gradient, natural gradient reduces the weights of gradient compo-nents with higher uncertainty, therefore making the update more reliable. As aconsequence, NES algorithms can eﬀectively cope with objective functions withill-shaped landscapes, especially preventing premature convergence on plateausand avoiding overaggressive steps on ridges [16].The general framework of NES is given as follows: At each time step, thealgorithm samples n ∈ N new samples x , . . . , x n ∼ π ( ·| θ ), with π ( ·| θ ) being thesearch distribution parameterized by θ . Let f : R d (cid:55)→ R be the objective functionto maximize. The expected function value under the search distribution is J ( θ ) = E θ [ f ( x )] = (cid:90) f ( x ) π ( x | θ ) dx .sing the log-likelihood trick, the gradient w.r.t. the parameters can be writtenas (cid:79) θ J = (cid:79) θ (cid:90) f ( x ) π ( x | θ ) dx = (cid:90) f ( x ) π ( x | θ ) (cid:79) θ π ( x | θ ) π ( x | θ ) dx = E [ f ( x ) (cid:79) θ log π ( x | θ )] ,from which we obtain the Monte-Carlo estimate (cid:79) θ J (cid:39) n n (cid:88) i =1 f ( x i ) (cid:79) θ log π ( x i | θ ) .of the search gradient. The key step of NES then consists of replacing this gra-dient by the natural gradient ˜ (cid:79) θ J = F − (cid:79) θ J ,where F = E (cid:104) (cid:79) θ log π ( x i | θ ) (cid:79) θ log π ( x i | θ ) (cid:62) (cid:105) is the Fisher information matrix (See Fig.1 for an illustration). This leads to astraightforward scheme of natural gradient ascent for iteratively updating theparameters θ ← θ + η ˜ (cid:79) θ J = θ + ηn n (cid:88) i =1 f ( x i ) F − (cid:79) θ log π ( x i | θ )= θ + ηn n (cid:88) i =1 f ( x i ) ˜ (cid:79) θ log π ( x i | θ ) . (1)The sequence of 1) sampling an oﬀspring population, 2) computing the corre-sponding Monte Carlo estimate of the gradient, 3) transforming it into the nat-ural gradient, and 4) updating the search distribution, constitutes one iterationof NES.The most diﬃcult step in NES is the computation of the Fisher informationmatrix with respect to the parameterization. For full Gaussian distribution, theFisher can be derived analytically [4,16]. However, for arbitrary parameterizationof C , the Fisher matrix can be highly non-trivial. In this paper, we consider a special formulation of the covariance matrix C = σ (cid:0) I + uu (cid:62) (cid:1) ,ith parameter set θ = (cid:104) σ, u (cid:105) . The special part of the parameterization is thevector u ∈ R d , which corresponds to the predominant direction of C . This allowsthe search distribution to be aligned in any direction by adjusting u , enablingthe algorithm to follow valleys not aligned with the current coordinate axes,which is essential for solving non-separable problems.Since σ should always be positive, following the same procedure in [4], weparameterize σ = e λ , so that λ ∈ R can be adjusted freely using gradient descentwithout worrying about σ becoming negative. The parameter set is adjusted to θ = (cid:104) λ, u (cid:105) accordingly.From the derivation of [16], the natural gradient on the sample mean is givenby ˜ (cid:79) µ log p ( x | θ ) = x − µ . (2)In the subsequent discussion we always assume µ = 0 for simplicity. It is straight-forward to sample from N (0 , C ) by lettting y ∼ N (0 , I ), z ∼ N (0 , x = σ ( y + zu ) ∼ N (0 , C ) .The inverse of C can also be computed easily as C − = σ − (cid:18) I −

11 + r uu (cid:62) (cid:19) ,where r = u (cid:62) u . Using the relation det (cid:0) I + uu (cid:62) (cid:1) = 1 + u (cid:62) u , the determinantof C is | C | = σ d (cid:0) r (cid:1) .Knowing C − and | C | allows the log-likelihood to be written explicitly aslog p ( x | θ ) = const −

12 log | C | − x (cid:62) C − x = const − λd −

12 log (cid:0) r (cid:1) − e − λ x (cid:62) x + 12 e − λ r (cid:0) x (cid:62) u (cid:1) .The regular gradient with respect to λ and u can then be computed as: (cid:79) λ log p ( x | θ ) = − d + e − λ (cid:32) x (cid:62) x − (cid:0) x (cid:62) u (cid:1) r (cid:33) , (3) (cid:79) u log p ( x | θ ) = − u r + e − λ (cid:34) − (cid:0) x (cid:62) u (cid:1) u (1 + r ) + (cid:0) x (cid:62) u (cid:1) x r (cid:35) . (4)Replacing x with e λ ( y + zu ), then the Fisher can be computed by marginal-izing out i.i.d. standard Gaussian variables y and z , namely, F = E x (cid:104) (cid:79) θ log p ( x | θ ) (cid:79) θ log p ( x | θ ) (cid:62) (cid:105) = E y,z (cid:104) (cid:79) θ log p ( y + zu | θ ) (cid:79) θ log p ( y + zu | θ ) (cid:62) (cid:105) . For succinctness, we always assume the mean of the search distribution is 0. Thiscan be achieved easily by shifting the coordinates. ince elements in (cid:79) θ log p ( x | θ ) (cid:79) θ log p ( x | θ ) (cid:62) are essentially polynomials of y and z , their expectations can be computed analytically , which gives the exactFisher information matrix F = (cid:34) d u (cid:62) r u r B (cid:35) ,with B = 11 + r (cid:20) r I + 1 − r r uu (cid:62) (cid:21) .Let v = u/r , then F = r r (cid:34) d r r v (cid:62) r vr I + − r r vv (cid:62) (cid:35) .The inverse of F is thus given by F − = 1 + r r (cid:34) d r r v (cid:62) r vr I + − r r vv (cid:62) (cid:35) − .We apply the formula for block matrix inverse in [12] (cid:20) A A A A (cid:21) − = (cid:20) C − − A − A C − − C − A A − C − (cid:21) ,where C = A − A A − A , and C = A − A A − A are the Schur com-plements. Let F be partitioned as above, then B − = I − − r vv (cid:62) ,and the Shur complements are C = 2 d r r − v (cid:62) r (cid:18) I − − r vv (cid:62) (cid:19) vr = 2 d (cid:18) r r (cid:19) − r r = 2 ( d − (cid:18) r r (cid:19) ,and C = I + 1 − r r vv (cid:62) − vv (cid:62) d (1 + r )= I + 11 + r (cid:20) − r − d (cid:21) vv (cid:62) , The derivation is tedious, thus omitted here. All derivations are numerically veriﬁedusing Monte-Carlo simulation. hose inverse is given by C − = I + 2 + d (cid:0) r − (cid:1) d − vv (cid:62) .Combining the results gives the analytical solution of the inverse Fisher: F − = 1 + r r ( d − (cid:20) r r − rv (cid:62) − rv d − I + (cid:2) d (cid:0) r − (cid:1)(cid:3) vv (cid:62) (cid:21) .Multiplying F − with the regular gradient in Eq.3 and Eq.4 gives the naturalgradient for λ and u :˜ (cid:79) λ log p ( x | θ ) = 12 ( d − (cid:104)(cid:0) e − λ x (cid:62) x − d (cid:1) − (cid:16) e − λ (cid:0) x (cid:62) v (cid:1) − (cid:17)(cid:105) . (5)and˜ (cid:79) u log p ( x | θ ) = e − λ d − r (cid:104) (1 − d ) (cid:0) x (cid:62) v (cid:1) + (cid:0) r + 1 (cid:1) (cid:16)(cid:0) x (cid:62) v (cid:1) − x (cid:62) x (cid:17)(cid:105) . (6)Note that computing both ˜ (cid:79) λ log p ( x | θ ) and ˜ (cid:79) u log p ( x | θ ) requires only theinner products x (cid:62) x and x (cid:62) v , therefore can be done O ( d ) storage and time. The natural gradient above is obtained with respect to u . However, direct gradi-ent update on u has an unpleasant property when ˜ (cid:79) u log p ( x | θ ) is in the oppositedirection of u , which is illustrated in Fig. 2(a). In this case, the gradient tendsto shrink u . However, if ˜ (cid:79) u log p ( x | θ ) is large, adding the gradient will ﬂip thedirection of u , and the length of u might even grow. This causes numerical prob-lems, especially when r is small. A remedy is to separate the length and directionof u , namely, reparameterize u = e c v , where (cid:107) v (cid:107) = 1 and e c is the length of u .Then the gradient update on c will never ﬂip u , and thus avoid the problem.Note that for small change δu , the update on c and v can be obtained from δc = 12 log ( u + δu ) (cid:62) ( u + δu ) − c (cid:39)

12 log (cid:0) u (cid:62) u + 2 δu (cid:62) u (cid:1) − c = 12 log u (cid:62) u + 12 log (cid:18) δu (cid:62) uu (cid:62) u (cid:19) − c (cid:39) δu (cid:62) uu (cid:62) u lgorithm 1: R1-NES( n ) while not terminate do for i = 1 to n do y i ←N (0 , I ) z i ←N (0 , x i ← e λ ( y i + z i u ) //generate sample ﬁtness[i] ← f ( µ + x i ) end Compute the natural gradient for µ , λ , u , c , and v according to Eq.2, 5, 6,7, and 8, and combine them using Eq.1 µ ← µ + η ˜ (cid:79) µ J λ ← λ + η ˜ (cid:79) λ J if ˜ (cid:79) c log p ( x | θ ) < then c ← c + η ˜ (cid:79) c J v ← v + η ˜ (cid:79) v J (cid:107) v + η ˜ (cid:79) v J (cid:107) u ← e c v else u ← u + η ˜ (cid:79) u J //additive update c ← log (cid:107) u (cid:107) v ← u (cid:107) u (cid:107) end end and δv = u + δu (cid:113) ( u + δu ) (cid:62) ( u + δu ) − v (cid:39) u + δu ( u (cid:62) u + 2 δu (cid:62) u ) − u ( u (cid:62) u ) = u + δu ( u (cid:62) u ) (cid:16) δu (cid:62) uu (cid:62) u (cid:17) − u ( u (cid:62) u ) (cid:39) ( u + δu ) (cid:16) − δu (cid:62) uu (cid:62) u (cid:17) ( u (cid:62) u ) − u ( u (cid:62) u ) = 1( u (cid:62) u ) (cid:20) δu − δu (cid:62) uu (cid:62) u u (cid:21) .The natural gradient on c and v is given by letting δu ∝ ˜ (cid:79) u log p ( x | θ ), thanksto the invariance property:˜ (cid:79) c log p ( x | θ ) = r − ˜ (cid:79) u log p ( x | θ ) (cid:62) v (7)˜ (cid:79) v log p ( x | θ ) = r − (cid:104) ˜ (cid:79) u log p ( x | θ ) − (cid:16) ˜ (cid:79) u log p ( x | θ ) (cid:62) v (cid:17) v (cid:105) , (8)ote that computing ˜ (cid:79) c log p ( x | θ ) and ˜ (cid:79) v log p ( x | θ ) involves only inner productsbetween vectors, which can also be done linearly in the number of dimensions.Using the parameterization (cid:104) c, v (cid:105) introduces another problem. When r issmall, ˜ (cid:79) c log p ( x | θ ) tends to be large, and thus directly updating c causes r togrow exponentially, resulting in numerical instability, as shown in Fig. 2(b). Inthis case, the additive update on u , rather than the update on (cid:104) c, v (cid:105) is more sta-ble. In our implementation, the additive update on u is used if ˜ (cid:79) c log p ( x | θ ) > (cid:104) c, v (cid:105) . This solution proved to be numerially stable inall our tests. Algorithm 1 shows the complete R1-NES algorithm in pseudocode. The R1-NES algorithm was evaluated on the twelve noise-free unimodal func-tions [6] in the ‘Black-Box Optimization Benchmarking’ collection (BBOB) fromthe 2010 GECCO Workshop for Real-Parameter Optimization. In order to makethe results comparable those of other methods, the setup in [5] was used, whichtransforms the pure benchmark functions to make the parameters non-separable(for some) and avoid trivial optima at the origin.R1-NES was compared to xNES [3], SNES [14] on each benchmark withproblem dimensions d = 2 k , k = { .. } (20 runs for each setup), except forxNES, which was only run up k = 6 , d = 64. Note that xNES serves as aproper baseline since it is state-of-the-art, achieving performance on par withthe popular CMA-ES. The reference machine is an Intel Core i7 processor with1.6GHz and 4GB of RAM.Fig. 3 shows the results for the eight benchmarks on which R1-NES performsat least as well as the other methods, and often much better. For dimensional-ity under 64, R1-NES is comparable to xNES in terms of the number of ﬁtnessevaluations, indicating that the rank-one parameterization of the search distri-bution eﬀectively captures the local curvature of the ﬁtness function (see Fig.5for example). However, the time required to compute the update for the twoalgorithms diﬀers drastically, as depicted in Fig. 4. For example, a typical runof xNES in 64 dimensions takes hours (hence the truncated xNES curves in allgraphs), compared to minutes for R1-NES. As a result, R1-NES can solve theseproblems up to 512 dimensions in acceptable time. In particular, the result onthe 512-dimensional Rosenbrock function is, to our knowledge, the best to date.We estimate that optimizing the 512-dimensional sphere function with xNES (orany other full parameterization method, e.g. CMA-ES) would take over a year incomputation time on the same reference hardware. It is also worth pointing outthat sNES, though sharing similar, low complexity per function evaluation, canonly solve separable problems (Sphere, Linear, AttractiveSector, and Ellipsoid).Fig. 6 shows four cases (Ellipsoid, StepEllipsoid, RotatedEllipsoid, and Tablet)for which R1-NES is not suited, highlighting a limitation of the algorithm. Threeof the four functions are from the Ellipsoid family, where the ﬁtness functionsre variants of the type f ( x , . . . , x d ) = d (cid:88) i =1 x · i − d i .The eigenvalues of the Hessian span several orders of magnitude, and the param-eterization with a single predominant direction is not enough to approximate theHessian, resulting in poor performance. The other function where R1-NES failsis the Tablet function where all but a one eigendirection has a large eigenvalue.Since the parameterization of R1-NES only allows a single direction to have alarge eigenvalue, the shape of the Hessian cannot be eﬀectively approximated. We presented a new black-box optimization algorithm R1-NES that employs anovel parameterization of the search distribution covariance matrix which allowsa predominant search direction to be adjusted using the natural gradient withcomplexity linear in the dimensionality. The algorithm shows excellent perfor-mance in a number of high-dimensional non-separable problems that, to date,have not been solved with other parameterizations of similar complexity.Future work will concentrate on overcoming the limitations of the algorithm(shown in Fig 6). In particular, we intend to extend the algorithm to a) incor-porate multiple search directions, and b) enable each search direction to shrink as well as grow.

Acknowledgement

This research was funded in part by Swiss National Science Foundation grants200020-122124, 200020-125038, and EU IM-CLeVeR project (

References

1. Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Bidirectional Relation betweenCMA Evolution Strategies and Natural Evolution Strategies. In

Parallel ProblemSolving from Nature (PPSN) , 2010.2. S. Amari. Natural gradient works eﬃciently in learning.

Neural Computation ,10(2):251–276, 1998.3. T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber. ExponentialNatural Evolution Strategies. In

Genetic and Evolutionary Computation Confer-ence (GECCO) , Portland, OR, 2010.4. T. Glasmaches, T. Schaul, Y. Sun, and J. Schmidhuber. Exponential naturalevolution strategies. In

GECCO’10 , 2010.5. N. Hansen and A. Auger. Real-parameter black-box optimization benchmarking2010: Experimental setup, 2010.. N. Hansen and S. Finck. Real-parameter black-box optimization benchmarking2010: Noiseless functions deﬁnitions, 2010.7. N. Hansen, A. S. P. Niederberger, L. Guzzella, and P. Koumoutsakos. A method forhandling uncertainty in evolutionary optimization with an application to feedbackcontrol of combustion.

Trans. Evol. Comp , 13:180–197, 2009.8. N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolu-tion strategies.

Evolutionary Computation , 9(2):159–195, 2001.9. M. Hasenj¨ager, B. Sendhoﬀ, T. Sonoda, and T. Arima. Three dimensional evo-lutionary aerodynamic design optimization with CMA-ES. In

Proceedings of the2005 conference on Genetic and evolutionary computation , GECCO ’05, pages2173–2180, New York, NY, USA, 2005. ACM.10. M. Jebalia, A. Auger, M. Schoenauer, F. James, and M. Postel. Identiﬁcationof the isotherm function in chromatography using cma-es. In

IEEE Congress onEvolutionary Computation , pages 4289–4296, 2007.11. J. Klockgether and H. P. Schwefel. Two-phase nozzle and hollow core jet experi-ments. In

Proc. 11th Symp. Engineering Aspects of Magnetohydrodynamics , pages141–148, 1970.12. K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2008.13. R. Ros and N. Hansen. A Simple Modiﬁcation in CMA-ES Achieving Linear Timeand Space Complexity. In R. et al., editor,

Parallel Problem Solving from Nature,PPSN X , pages 296–305. Springer, 2008.14. T. Schaul, T. Glasmachers, and J. Schmidhuber. High Dimensions and HeavyTails for Natural Evolution Strategies. In

To appear in: Genetic and EvolutionaryComputation Conference (GECCO) , 2011.15. O. M. Shir and T. B¨ack. The second harmonic generation case-study as a gatewayfor es to quantum control problems. In

Proceedings of the 9th annual conferenceon Genetic and evolutionary computation , GECCO ’07, pages 713–721, New York,NY, USA, 2007. ACM.16. Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Stochastic search using thenatural gradient. In

ICML’09 , 2009.17. D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural Evolution Strate-gies. In

Proceedings of the Congress on Evolutionary Computation (CEC08),Hongkong . IEEE Press, 2008.18. S. Winter, B. Brendel, and C. Igel. Registration of bone structures in 3d ultra-sound and ct data: Comparison of diﬀerent optimization strategies.

InternationalCongress Series , 1281:242 – 247, 2005. µ σσ

Fig. 1. Plain versus natural gradient in parameter space.

Consider two param-eters, e.g. θ = ( µ, σ ), of the search distribution. On the left, the solid (black) arrowsindicate the gradient samples ∇ θ log π ( x | θ ), the dotted (blue) arrows correspond to f ( x ) · ∇ θ log π ( x | θ ), that is, the same gradient estimates, but scaled with ﬁtness. Com-bining these, the bold (green) arrow indicates the (sampled) ﬁtness gradient ∇ θ J , whilethe bold dashed (red) arrow indicates the corresponding natural gradient ˜ ∇ θ J . Beingrandom variables with expectation zero, the distribution of the black arrows is gov-erned by their covariance (the gray ellipse). Note that this covariance is a quantity in parameter space (where the θ reside); not to be confused with that of the search space (where the samples x reside). In contrast, on the right, the solid (black) arrows rep-resent ˜ ∇ θ log π ( x | θ ), and dotted (blue) arrows indicate the natural gradient samples f ( x ) · ˜ ∇ θ log π ( x | θ ), resulting in the natural gradient (dashed red). The covariance ofthe solid arrows on the right hand side turns out to be the inverse of the covarianceof the solid arrows on the left. This has the eﬀect that when computing the naturalgradient, directions with high variance (uncertainty) are penalized and thus shrunken,while components with low variance (high certainty) are boosted, since these compo-nents of the gradient samples deserve more trust. This makes the (dashed red) naturalgradient a much more trustworthy update direction than the (green) plain gradient. Using the log-likelihood trick, the gradient w.r.t. the parameters can be writtenas ▽ θ J = ▽ θ Z f ( x ) π ( x | θ ) dx = Z f ( x ) π ( x | θ ) ▽ θ π ( x | θ ) π ( x | θ ) dx = E [ f ( x ) ▽ θ log π ( x | θ )] ,from which we obtain the Monte-Carlo estimate ▽ θ J ≃ n n X i =1 f ( x i ) ▽ θ log π ( x i | θ ) . Fig. 1. Plain versus natural gradient in parameter space.

Consider two pa-rameters, e.g. θ = ( µ, σ ), of the search distribution. On the left, the solid (black)arrows indicate the gradient samples ∇ θ log π ( x | θ ), the dotted (blue) arrows corre-spond to f ( x ) · ∇ θ log π ( x | θ ), that is, the same gradient estimates, but scaled with ﬁt-ness. Combining these, the bold (green) arrow indicates the (sampled) ﬁtness gradient ∇ θ J , while the bold dashed (red) arrow indicates the corresponding natural gradient˜ ∇ θ J = F − (cid:79) θ J . Being random variables with expectation zero, the distribution of theblack arrows is governed by their covariance (the gray ellipse). Note that this covari-ance is a quantity in parameter space (where the θ reside); not to be confused with thatof the search space (where the samples x reside). In contrast, on the right, the solid(black) arrows represent ˜ ∇ θ log π ( x | θ ), and dotted (blue) arrows indicate the natural gradient samples f ( x ) · ˜ ∇ θ log π ( x | θ ), resulting in the natural gradient (dashed red).The covariance of the solid arrows on the right hand side turns out to be the inverseof the covariance of the solid arrows on the left. This has the eﬀect that when comput-ing the natural gradient, directions with high variance (uncertainty) are penalized andthus shrunken, while components with low variance (high certainty) are boosted, sincethese components of the gradient samples deserve more trust. This makes the (dashedred) natural gradient a much more trustworthy update direction than the (green) plaingradient.a) (b) Fig. 2. Illustration of the change in parameterization.

In both panels, the blacklines and ellipses refer to the current predominant direction u , and the correspondingsearch distribution from which samples are drawn. The black cross denotes one suchsample that is being used to update the distribution. In the left panel, the direction ofthe selected point is almost perpendicular to u , resulting in a large gradient, reducing u (the dotted blue line). However, direct gradient update on u will ﬂip the directionof u . As a result, u stays in the same undesired direction, but with increased length.In contrast, performing update on c and v gives the predominant search directiondepicted in the red, with u shrunk properly. The right panel shows another case wherethe selected point aligns with the search direction, and performing the exponentialupdate on c and v causes u to increase dramatically (green line & ellipsoid). This eﬀectis prevented by performing the additive update (Eq. 6) on u (red line & ellipsoid). ig. 3. Performance comparison on BBOB unimodal benchmarks. Log-logplot of the median number of ﬁtness evaluations (over 20 trials) required to reach thetarget ﬁtness value of − − for unimodal benchmark functions for which R1-NES iswell suited, on dimensions 2 to 512 (cases for which 90% or more of the runs convergedprematurely are not shown). Note that xNES consistently solves all benchmarks onsmall dimensions ( ≤ ig. 4. Computation time per function evaluation , for the three algorithms, onproblem dimensions ranging from 2 to 512. Both SNES and R1-NES scale linearly,whereas the cost grows cubically for xNES.

10 15 20−8−6−4−202468 l og f i t ne ss evaluations × pa r a m e t e r v a l ue evaluations × λ c Fig. 5. Behavior of R1-NES on the 32-dimensional cigar function : f ( x , . . . , x d ) = 10 x + x + · · · + x d . The left panel shows the best ﬁtness foundso far, and the min and max ﬁtness in the current population. The right panel showshow λ and c evolve over time. Note that the λ decreases almost linearly, indicating thatall the other directions except the predominant one shrink exponentially. In contrast, c ﬁrst increases, and then stabilizes around log 1000 (the black line). As a result, I + uu (cid:62) corresponds to the Hessian of the cigar function (cid:2) , , . . . , (cid:3) . ig. 6. Performance comparison on BBOB unimodal benchmarks for whichR1-NES is not well suited.ig. 6. Performance comparison on BBOB unimodal benchmarks for whichR1-NES is not well suited.