[PDF] Challenges in High-dimensional Reinforcement Learning with Evolution Strategies

Abstract

Evolution Strategies (ESs) have recently become popular for training deep neural networks, in particular on reinforcement learning tasks, a special form of controller design. Compared to classic problems in continuous direct search, deep networks pose extremely high-dimensional optimization problems, with many thousands or even millions of variables. In addition, many control problems give rise to a stochastic fitness function. Considering the relevance of the application, we study the suitability of evolution strategies for high-dimensional, stochastic problems. Our results give insights into which algorithmic mechanisms of modern ES are of value for the class of problems at hand, and they reveal principled limitations of the approach. They are in line with our theoretical understanding of ESs. We show that combining ESs that offer reduced internal algorithm cost with uncertainty handling techniques yields promising methods for this class of problems.

Full PDF

CChallenges in High-dimensional Reinforcement Learningwith Evolution Strategies

Nils M¨uller and Tobias GlasmachersInstitut f¨ur Neuroinformatik, Ruhr-Universit¨at Bochum, Germany { nils.mueller, tobias.glasmachers } @ini.rub.de Abstract

Evolution Strategies (ESs) have recently become popular for training deep neural net-works, in particular on reinforcement learning tasks, a special form of controller design.Compared to classic problems in continuous direct search, deep networks pose extremelyhigh-dimensional optimization problems, with many thousands or even millions of variables.In addition, many control problems give rise to a stochastic ﬁtness function. Consideringthe relevance of the application, we study the suitability of evolution strategies for high-dimensional, stochastic problems. Our results give insights into which algorithmic mecha-nisms of modern ES are of value for the class of problems at hand, and they reveal principledlimitations of the approach. They are in line with our theoretical understanding of ESs.We show that combining ESs that oﬀer reduced internal algorithm cost with uncertaintyhandling techniques yields promising methods for this class of problems.

Since the publication of DeepMind’s Deep-Q-Learning system [18] in 2015, the ﬁeld of (deep)reinforcement learning (RL) [34] is developing at a rapid pace. In [18] neural networks learn toplay classic Atari 2600 games solely from interaction, based on raw (unprocessed) visual input.The approach had a considerable impact because it demonstrated the great potential of deepreinforcement learning. Only one year later AlphaGo [8] demystiﬁed the ancient game of Go bybeating multiple human world experts. In this rapidly moving ﬁeld, Evolution Strategies (ESs)[4, 22, 30] have gained considerable attention by the machine learning community when OpenAIpromoted them as a “scalable alternative to reinforcement learning” [17], which spawned severalfollow-up works [6, 9].Already long before deep RL, controller design with ESs was studied for many years withinthe domain of neuroevolution [16, 23, 24, 29, 32, 33]. The optimization of neural networkcontrollers is frequently cast as an episodic RL problem, which can be solved with direct policysearch, for example with an ES. This amounts to parameterizing a class of controllers, whichare optimized to maximize reward or to minimize cost, determined by running the controlleron the task at hand, often in a simulator. The value of the state-of-the-art covariance matrixadaptation evolution strategy (CMA-ES) algorithm [22] for this problem class was emphasized byseveral authors [16, 23]. CMA-ES was even augmented with an uncertainty handling mechanism,speciﬁcally for controller design [14].The controller design problems considered in the above-discussed papers are rather low-dimensional, at least compared to deep learning models with up to millions of weights. CMA-1 a r X i v : . [ c s . N E ] J u l hallenges in High-dimensional Reinforcement Learning with Evolution Strategies ES is rarely applied to problems with more than 100 variables. This is because learning a fullcovariance matrix introduces non-trivial algorithm internal cost and hence prevents the directapplication of CMA-ES to high-dimensional problems. In recent years it turned out that evencovariance matrix adaptation can be scaled up to very large dimensions, as proven by a seriesof algorithms [1, 11, 28, 31], either by restricting the covariance matrix to the diagonal, to alow-rank model, or to a combination of both. Although apparently promising, none of theseapproaches was to date applied to the problem of deep reinforcement learning.Against this background, we investigate the suitability of evolution strategies in general andof modern scalable CMA-ES variants in particular for the design of large-scale neural networkcontrollers. In contrast to most existing studies in this domain, we approach the problem from anoptimization perspective, not from a (machine) learning perspective. We are primarily interestedin how diﬀerent algorithmic components aﬀect optimization performance in high-dimensional,noisy optimization problems. Our results provide a deeper understanding of relevant aspects ofalgorithm design for deep neuroevolution.The rest of the paper is organized as follows. After a brief introduction to controller designwe discuss mechanisms of evolution strategies in terms of convergence properties. We carryout experiments on RL problems as well as on optimization benchmarks, and close with ourconclusions.

General setting.

In this paper, we investigate the utility of evolution strategies for optimiza-tion problems that pose several diﬃculties at the same time: • a large number d of variables (high dimension of the search space R d ), • ﬁtness noise, i.e., the variability of ﬁtness values f ( x ) when evaluating the non-deterministicﬁtness function multiple times in the same point x , and • multi-modality, i.e., the presence of a large number of local optima.Additionally, a fundamental requirement of relatively quick problem evaluation time (typicallyrequiring simulation of real world phenomena) is appropriate.State-of-the-art algorithms like CMA-ES can handle dimensions of up to d ≤

100 with ease.They become painfully slow for d ≥ d ≥ ils M¨uller and Tobias Glasmachers Despite the greater generality of the described problem setting, a central motivation forstudying the above problem class is controller design. In evolutionary controller design, anindividual (a candidate controller) is evaluated in a Monte Carlo manner, by sampling its per-formance on a (simulated) task, or a set of tasks and conditions. Stochasticity caused by randomstate transitions and randomized controllers is a common issue. Due to complex and stochas-tic controller-environment interactions, controller design is considered a diﬃcult problem, andblack-box approaches like ESs are well suited for the task, in particular, if gradients are notavailable.

Reinforcement learning.

In reinforcement learning, control problems are typically framedas stochastic, time-discrete, Markov decision processes (

S, A, P · , · ( · ) , R · ( · , · ) , γ ) with the notionof a (software) agent embedded in an environment. The agent ought to take an action a ∈ A when presented with a state s ∈ S of the environment in order to receive a reward ( s, s (cid:48) , a ) (cid:55)→ R s ( s (cid:48) , a ) ∈ R for a resulting state transition to new state s (cid:48) ∈ S in the next time step. Anindividual encodes a (possibly randomized) controller or policy π θ : S → A with parameters θ ∈ Θ, which is followed by the agent. It is assumed that each policy yields a constant expectedcumulative reward over a ﬁxed number of actions τ taken when acting according to it, as thestate transition probability ( s, s (cid:48) , a ) (cid:55)→ P s,a ( s (cid:48) ) = P r ( s (cid:48) = s (cid:48) | s = s, a = a ), to a successor state s (cid:48) is stationary (time-independent) and depends only on the current state and action (Markovproperty), for all s, s (cid:48) ∈ S, a ∈ A . This cumulative reward acts as a ﬁtness measure F π : Θ → R ,while the policy parameters θ (e.g., weights of neural networks π θ ) are the variables of theproblem. Thus, we consider the (reinforcement learning) optimization problemmin θ ∈ Θ F π ( θ ) = − (cid:88) s ,...,s τ ∈ S (cid:32) τ − (cid:88) k =0 γ k R s k ( s k +1 , π θ ( s k )) (cid:33) ·  τ − (cid:89) j =0 P s j ,π θ ( s j ) ( s j +1 )  , where γ ∈ (0 ,

1] is a discount factor.Developments in RL demonstrated the merit in utilizing “model-free” approaches to thedesign of high-dimensional controllers such as neural networks for solving a variety of taskspreviously inaccessible [8, 18], as well as novel frameworks for scaling evolution strategies toCPU clusters [17].ESs have advantages and disadvantages compared to alternative approaches like policy gra-dient methods. Several mechanisms of ESs add robustness to the search. Modeling distributionsover policy parameters as done explicitly in natural evolution strategies (NES) [7] and also inCMA-ES serves this purpose [12], and so do problem-agnostic algorithm design and strong in-variance properties. Direct policy search does not suﬀer from the temporal credit assignmentproblem or from sparse rewards [17]. ESs have demonstrated superior exploration behavior,which is important to avoid a high bias when sampling the environment [13]. On the con-trary, ESs ignore the information contained in individual state transitions and rewards. Thisineﬃciency can (partly) be compensated by better parallelism in ESs [17].

In this section, we discuss Evolution Strategies (ESs) from a bird’s eye perspective, in terms oftheir central algorithmic components, and without resorting to the details of their implementa-tion. For actual exemplary algorithm instances with the properties under consideration, we refer3 hallenges in High-dimensional Reinforcement Learning with Evolution Strategies

Algorithm 1

Generic Evolution Strategy Template initialize λ , m ∈ R d , σ > C = I repeat repeat for i ← , . . . , λ do sample oﬀspring x i ∼ N ( m, σ C ) evaluate ﬁtness f ( x i ) by testing the controller encoded by x i on the task actual optimization: update m step size control: update σ covariance matrix adaptation: update C uncertainty handling, i.e., adapt λ or the number of tests per ﬁtness evaluation until stopping criterion is met prepare restart, i.e., set new initial m , σ , and λ , and reset C ← I until budget is used up return m to the literature. Algorithm 1 summarizes commonly found mechanisms without going into anydetails.ESs enjoy many invariance properties. This is generally considered a sign of good algorithmdesign: due to their rank-based processing of ﬁtness values, they are invariant to strictly mono-tonically increasing transformations of ﬁtness; furthermore, they are invariant to translation,rotation, and scaling provided that the initial distribution is transformed accordingly, and withCMA (see below) they are even asymptotically invariant under arbitrary aﬃne transformations. Step Size Control.

The algorithms applied to RL problems in [6, 9, 17] are designed in thestyle of non-adaptive algorithms, i.e., applying a mutation distribution with ﬁxed parameters σ and C , adapting only the center m . This method is known to converge as slowly as pure randomsearch [21]. Therefore it is in general imperative to add step size adaptation, which has alwaysbeen an integral mechanism since the inception of the method [4, 30]. Cumulative step sizeadaptation (CSA) is a state-of-the-art method [22]. Step size control enables linear convergenceon scale invariant (e.g., convex quadratic) functions, and hence locally linear convergence intotwice continuously diﬀerentiable local optima [21], which puts ESs into the same class as manygradient-based methods. It was shown in [35] that convergence of rank-based algorithms cannotbe faster than linear. However, the convergence rate of a step size adaptive ESs is of the form O (1 / ( kd )), where d is the dimensionality of the search space and k is the condition number ofthe Hessian in the optimum. In contrast, the convergence rate of gradient descent suﬀers fromlarge k , but is independent of the dimension d . Metric Learning.

Metric adaptation methods like CMA-ES [5, 7, 22] improve the convergencerate to O (1 /d ) by adapting not only the global step size σ but also the full covariance matrix C of the mutation distribution. However, estimating a suitable covariance matrix requires a largenumber of samples, so that fast progress is made only after O ( d ) ﬁtness evaluations, which isin itself prohibitive for large d . Also, the algorithm internal cost for storing and updating afull covariance matrix and even for drawing a sample from the distribution is at least of order O ( d ), which means that modeling the covariance matrix quickly becomes the computational4 ils M¨uller and Tobias Glasmachers bottleneck, in particular if the ﬁtness function scales linear with d , as it is the case for neuralnetworks.Several ESs for large-scale optimization have been proposed as a remedy [1, 11, 20, 28,31]. They model only the diagonal of the covariance matrix and/or interactions in an O (1) to O (log( d )) dimensional subspace, achieving a time and memory complexity of O ( d ) to O ( d log( d ))per sample. The aim is to oﬀer a reasonable compromise between ES-internal and external(ﬁtness) complexity while retaining most of the beneﬁts of full covariance matrix adaptation.The LM-MA-ES algorithm [11] oﬀers the special beneﬁt of adapting the fastest evolving subspaceof the covariance matrix with only O ( d ) samples, which is a signiﬁcant speed-up over the O ( d )sample cost of full covariance matrix learning. Noise Handling.

Evolution strategies can be severely impaired by noise, in particular when itinterferes with step size adaptation. Being randomized algorithms, ESs are capable of toleratingsome level of noise with ease. In the easy-to-analyze multiplicative noise model [26], the noiselevel decays as we approach the optimum and hence, on the sphere function f ( x ) = (cid:107) x (cid:107) ,the signal-to-noise ratio (deﬁned as the systematic variance of f due to sampling divided bythe variance of random noise) oscillates around a positive constant (provided that step sizeadaptation works as desired [25], keeping σ roughly proportional to (cid:107) m (cid:107) /d ). For strong noise,this ratio is small. Then the ES essentially performs a random walk, and a non-elitist algorithmmay even diverge. Then CSA is endangered to converge prematurely [2]. For more realisticadditive noise, the noise variance is (lower bounded by) a positive constant. When convergingto the optimum, σ and hence the signal-no-noise-ratio decays to zero. Therefore progress stalls atsome distance to the optimum. Thus there exists a principled limitation on the precision to whichan optimum can be located. Explicit noise handling mechanisms like [3, 14] can be employed toincrease the precision, and even enable convergence, e.g., by adaptively increasing the populationsize or the number of independent controller test runs per ﬁtness evaluation. They adaptivelyincrease the population size or the number of simulation runs per ﬁtness evaluation, eﬀectivelyimproving the signal-to-noise ratio. The algorithm parameters can be tuned to avoid prematureconvergence of CSA. However, the convergence speed is so slow that in practice additive noiseimposes a limit on the attainable solution precision, even if the optimal convergence rate isattained [3]. Noise in High Dimensions.

There are multiple ways in which optimization with noise andin high dimensions interact. In the best case, adaptation slows down due to reduced informationcontent per sample, which is the case for metric learning. The situation is even worse for stepsize adaptation: for the noise-free sphere problem, the optimal step size σ is known to beproportional to (cid:107) m (cid:107) /d . Therefore, in the same distance to the optimum and for the same noisestrength, noise handling becomes harder in high dimensions. Then the step size can become toosmall, and CSA can converge prematurely [2]. Most of the theoretical arguments brought forward in the previous section are of asymptoticnature, while sometimes practice is dominated by constant factors and transient eﬀects. Also,it remains unclear which of the diﬀerent eﬀects like slow convergence, slow adaptation, and the5 hallenges in High-dimensional Reinforcement Learning with Evolution Strategies diﬃculty of handling noise is a critical factor. In this section, we provide empirical answers tothese questions.Well-established benchmark collections exist in the evolutionary computation domain, inparticular for continuous search spaces [15, 19]. Typical questions are whether an algorithmcan handle non-separable, ill-conditioned, multi-modal, or noisy test functions. However, it isnot a priori clear which of these properties are found in typical controller design problems. Forexample, the optimization landscapes of neural networks are not yet well understood. Closingthis gap is far beyond the scope of this paper. Here we pursue a simpler goal, namely toidentify the most relevant factors. More concretely, we aim to understand in which situation(dimensionality and noise strength) which algorithm component (as discussed in the previoussection) has a signiﬁcant impact on optimization performance, and which mechanisms fail topay oﬀ.To this end, we run diﬀerent series of experiments on established benchmarks from theoptimization literature and from the RL domain. We have published code for reproducing allexperiments online. For ease of comparison, we use the recently proposed MA-ES algorithm[5] adapting the full covariance matrix, which was shown empirically to perform very similar toCMA-ES. This choice is motivated by its closeness to the LM-MA-ES method [11], which learnsa low-rank approximation of the covariance matrix. When disabling metric learning entirely inthese methods, we obtain a rather simple ES with CSA, which we include in the comparison.Figure 1 shows the time evolution of the ﬁtness F π ( θ ) (eq. (2)) on three prototypical bench-mark problems from the OpenAI Gym environment [10], a collection of RL benchmarks: acrobot,bipedal walker, and robopong. All three controllers π θ (eq. (2)) are fully connected deep net-works with hidden layer sizes 30-30-10 (acrobot) and 30-30-15-10 (bipedal walker and robopong),giving rise to moderate numbers of weights around 2,000, depending on the task-speciﬁc numbersof inputs and controls. It is apparent that in all three cases ﬁtness noise plays a key role.Figure 2 investigates the scaling of LM-MA-ES and MA-ES with problem dimension on thebipedal walker task. For the small network considered above, MA-ES performs considerablyworse than LM-MA-ES, not only in wall clock time (not shown) but also in terms of samplecomplexity. A similar eﬀect was observed in [11] for the Rosenbrock problem. This indicates thatLM-MA-ES can proﬁt from its fast adaptation to the most prominent subspace. However, thiseﬀect does not necessarily generalize to other tasks. More importantly, we see (unsurprisingly)that the performance of both algorithms is severely aﬀected as d grows.In order to gain a better understanding of the eﬀect of ﬁtness noise on high-dimensionalcontroller design, we consider optimization benchmarks. These problems have the advantagethat the optimum is known and that the noise strength is under our control. Since we areparticularly interested in scalable metric learning, we employ the noisy ellipsoid problem f ( x ) =¯ f ( x ) + N ( x ), ¯ f ( x ) = √ x T Hx , with eigenvalues λ i = k i − d − of H , and N ( x ) is the noise. For themultiplicative case, the range of N ( x ) is proportional to ¯ f ( x ), while for the additive case it isnot.Among the problem parameters we vary • problem dimension d ∈ { , , , } , • problem conditioning ( k ∈ { , , } (sphere, benign ellipsoid, standard ellipsoid),and • noise strength (none, multiplicative with various constants of proportionality, additive). https://github.com/NiMlr/High-Dim-ES-RL ils M¨uller and Tobias Glasmachers Figure 1: Evolution of population average ﬁtness for three reinforcement learning tasks withLM-MA-ES, averaged over ﬁve runs. f i t ne ss acrobot, 1483 weights f i t ne ss biped, 2349 weights f i t ne ss robopong, 1352 weights Figure 2: Evolution of ﬁtness or neural networks with diﬀerent numbers of weights (diﬀerenthidden layer sizes), for LM-MA-ES (left) and MA-ES (right) on the bipedal walker task. f i t ne ss LMMAES numberofweights23 9769 2738 10276 0 100000 200000 300000evals−200−150−100−50050 f i t ne ss MAES numberofweights23 9769

Figure 3 shows the time evolution of ﬁtness and step size of the diﬀerent algorithms in theseconditions.The experiments on the noise-free sphere problem show that the speed of optimization decayswith increasing dimension, as predicted by theory [25]: halving the distance to the optimumrequires Θ( d ) samples. For this reason, within the ﬁxed budget of 10 function evaluations,there is less progress in higher dimensions. For d = 20 , , which requires the step size to change by the same amount. However,extrapolating our results we see that in extremely high dimensions the algorithm is simply notprovided enough generations to make suﬃcient progress in order to justify step size adaptation.This is in accordance with [6]. A similar eﬀect is observed for metric learning, which takes Θ( d )samples for the full covariance matrix. Even for the still moderate dimension of d = 2 , hallenges in High-dimensional Reinforcement Learning with Evolution Strategies Figure 3: Evolution of ﬁtness and step size over function evaluations, averaged over ﬁve inde-pendent runs, for three diﬀerent algorithms and problems (see the legend for details). Note thelogarithmic scale of both axes. − − l o g ( f v a l ) − . − . − . − . . l o g ( σ ) − − − . − . − . − . . − − − . − . − . − . . − − − . − . − . − . . − − − . − . − . − . . − − l o g ( f v a l ) − . − . − . − . . l o g ( σ ) − − − . − . − . − . . − − − . − . − . − . . − − − . − . − . − . . − − − . − . − . − . . − l o g ( f v a l ) log10(eval) − . − . − . − . . l o g ( σ ) − log10(eval) − . − . − . − . . − log10(eval) − . − . − . − . . − log10(eval) − . − . − . − . . − log10(eval) − . − . − . − . . s phe r eben i gn e lli p s ee lli p s e no noise multiplicative noise 0.005 multiplicative noise 0.025 multiplicative noise 0.125 additive noise 10 -6 d=20 d=200 d=2000 d=20000 simple ES LM-MA-ES MA-ES neural networks with millions of weights, unless the budget grows at least linear with the num-ber of weights. This in turn requires extremely fast simulations as well as a large amount ofcomputational hardware resources.Noise has a signiﬁcant impact on the optimization behavior and on the solution quality.Additive noise implies extremely slow convergence, and indeed we ﬁnd that all methods stall inthis case. Too strong multiplicative noise even results in divergence. A particularly adversarialeﬀect is that the noise strength that can be tolerated is at best inversely proportional to thedimension. This eﬀect nicely shows up in the noisy sphere results. Here, uncertainty handling canhelp in principle, since it improves the signal-to-noise adaptively to the needs of the algorithm,but at the cost of more function evaluations per generation, which ampliﬁes the eﬀects discussedabove.In the presence of noise, CSA does not seem to work well in low dimensions. In case ofhigh noise, log( σ ) performs a random walk. However, this walk is subject to a selection biasaway from high values, since they improve the signal-to-noise ratio. Therefore we ﬁnd extendedperiods of stalled progress, in particular for d = 20, accompanied by a random walk of the (fartoo small) step size. The eﬀect is unseen in higher dimensions, probably due to the smallerupdate rate.We are particularly interested in the interplay between metric adaptation and noise. Itturns out that in all cases where CMA helps (non-spherical problems of moderate dimension),i.e., where LM-MA-ES and MA-ES outperform the simple ES, the same holds true for thecorresponding noisy problems. We conclude that metric learning still works well, even when8 ils M¨uller and Tobias Glasmachers Figure 5: Fitness and number of re-evaluations (left) step size and standard deviation of ﬁtness(right), averaged over six runs of LM-MA-ES with and without uncertainty handling on thebipedal walker task. f i t ne ss LMMAESUHLMMAES 0 100000 200000 300000 400000evals406080100120140 P opu l a t i on s t anda r d de v i a t i on o f f i t ne ss UHLMMAES Pop.Std.LMMAES Pop.Std. 0.050.100.150.200.25UHLMMAES LMMAES 2468101214 c and i da t e e v a l ua t i on s i n UH L M M A ES Repetitive evaluations in UH faced with noise in high dimensions.The inﬂuence of noise can be controlled Figure 4: (UH-)LM-MA-ES on the benign el-lipse in d = 100 ,

000 with additive noise re-stricted to ¯ f ( x ) > .

5. LM-MA-ES with-out uncertainty handling (blue curve) divergeswhile LM-MA-ES with uncertainty handling ap-proaches the optimum (red curve). log eval ) l o g ( f v a l ) and mitigated with uncertainty handling tech-niques [3, 14]. This essentially results in curvessimilar to the leftmost column of ﬁgure 3,but with slower convergence, depending onthe noise strength. In controller design, noisehandling can be key to success, in particularif the optimal controller is nearly determin-istic, while strong noise is encountered dur-ing learning. This is a plausible assumptionfor the bipedal walker task: at an intermedi-ate stage, the walker falls over randomly de-pending on minor details of the environment,resulting in high noise variance, while a con-troller that has learned a stable and robustwalking pattern achieves good performancewith low variance. Then it is key to handlethe early phase by means of uncertainty han-dling, which enables the ES to enter the lateconvergence phase eventually. Figure 4 displays such a situation for the benign ellipse with d = 100 ,

000 with additive noise applied only for function values above a threshold. LM-MA-ESwithout uncertainty handling fails, but with uncertainty handling the algorithm ﬁnally reachesthe noise-free region and then converges quickly.Figure 5 shows the eﬀect of uncertainty handling. It yields signiﬁcantly more stable opti-mization behavior in two ways: 1. it keeps the step size high, avoiding an undesirable decayand hence the danger of premature convergence or of a less-robust population, and 2. it keepsthe ﬁtness variance small, which allows the algorithm to reach better ﬁtness in the late ﬁnetuning phase. Interestingly, the ES without uncertainty handling is initially faster. This can9 hallenges in High-dimensional Reinforcement Learning with Evolution Strategies be mitigated by tuning the initial step size, which anyway becomes an increasingly importanttask in high dimensions, for two reasons: adaptation takes long in high dimensions, and evenworse, a too small initial step size makes uncertainty handling kick in without need, so that theadaptation takes even longer. The latter might especially be called for on expensive problemscommonly found in RL.

We have investigated the utility of diﬀerent algorithmic mechanisms of evolution strategiesfor problems with a speciﬁc combination of challenges, namely high-dimensional search spacesand ﬁtness noise. The study is motivated by a broad class of problems, namely the design ofﬂexible controllers. Reinforcement learning with neural networks yields some extremely high-dimensional problem instances of this type.We have argued theoretically and also found empirically that many of the well-establishedcomponents of state-of-the-art methods like CMA-ES and scalable variants thereof graduallylose their value in high dimensions, unless the number of function evaluations can be scaledup accordingly. This aﬀects the adaptation of the covariance matrix, and in extremely high-dimensional cases also the step size. This somewhat justiﬁes the application of very simplealgorithms for training neural networks with millions of weights, see [6].Additive noise imposes a principled limitation on the solution quality. However, it turns outthat adaptation of the search distribution still helps, because it allows for a larger step size andhence a better signal-to-noise ratio. Unsurprisingly, uncertainty handling can be a key techniquefor robust convergence.Overall, we ﬁnd that adaptation of the mutation distribution becomes less valuable in highdimensions because it kicks in only rather late. However, it never harms, and it can helpeven when dealing with noise in high dimensions. Our results indicate that a scalable modernevolution strategy with step size and eﬃcient metric learning equipped with uncertainty handlingis the most promising general-purpose technique for high-dimensional controller design.

References [1] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. Comparison-based natural gradient optimiza-tion in high dimension. In

Proceedings of the 2014 Annual Conference on Genetic and EvolutionaryComputation , pages 373–380. ACM, 2014.[2] Hans-Georg Beyer and Dirk V Arnold. Qualms regarding the optimality of cumulative path lengthcontrol in CSA/CMA-evolution strategies.

Evolutionary Computation , 11(1):19–28, 2003.[3] Hans-Georg Beyer and Michael Hellwig. Analysis of the pcCMSA-ES on the noisy ellipsoid model. In

Proceedings of the Genetic and Evolutionary Computation Conference , pages 689–696. ACM, 2017.[4] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies–a comprehensive introduction.

Natural computing , 1(1):3–52, 2002.[5] Hans-Georg Beyer and Bernhard Sendhoﬀ. Simplify your covariance matrix adaptation evolutionstrategy.

IEEE Transactions on Evolutionary Computation , 2017.[6] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. Back to basics: Benchmarking canonicalevolution strategies for playing atari. Technical Report 1802.08842, arXiv.org, 2018. ils M¨uller and Tobias Glasmachers [7] Daan Wierstra et al. Natural Evolution Strategies. Journal of Machine Learning Research , 15(1):949–980, 2014.[8] David Silver et al. Mastering the game of Go with deep neural networks and tree search.

Nature ,529(7587):484–489, 2016.[9] Felipe Such et al. Deep neuroevolution: Genetic algorithms are a competitive alternative for trainingdeep neural networks for reinforcement learning. Technical Report 1712.06567, arXiv.org, 2017.[10] Greg Brockman et al. OpenAI Gym. Technical Report 1606.01540, arxiv.org, 2016.[11] Ilya Loshchilov et al. Limited-memory matrix adaptation for large scale black-box optimization.Technical Report 1705.06693, arXiv.org, 2017.[12] Joel Lehman et al. ES is more than just a traditional ﬁnite-diﬀerence approximator. TechnicalReport 1712.06568v2, arXiv.org, 2017.[13] Matthias Plappert et al. Parameter space noise for exploration. Technical Report 1706.01905v2,arXiv.org, 2017.[14] Nikolaus Hansen et al. A method for handling uncertainty in evolutionary optimization with anapplication to feedback control of combustion.

IEEE Transactions on Evolutionary Computation ,13(1):180–197, 2009.[15] Nikolaus Hansen et al. COCO: A platform for comparing continuous optimizers in a black-boxsetting. Technical Report 1603.08785, arXiv.org, 2016.[16] Thomas Geijtenbeek et al. Flexible muscle-based locomotion for bipedal creatures.

ACM Transac-tions on Graphics (TOG) , 32(6):206, 2013.[17] Tim Salimans et al. Evolution strategies as a scalable alternative to reinforcement learning. TechnicalReport 1703.03864, arXiv.org, 2017.[18] Volodymyr Mnih et al. Human-level control through deep reinforcement learning.

Nature ,518(7540):529, 2015.[19] Xiaodong Li et al. Benchmark functions for the CEC 2013 special session and competition onlarge-scale global optimization. gene , 7(33):8, 2013.[20] Yi Sun et al. A linear time natural evolution strategy for non-separable functions. In

Conferencecompanion on genetic and evolutionary computation . ACM, 2013.[21] Nikolaus Hansen, Dirk V Arnold, and Anne Auger. Evolution strategies. In

Springer handbook ofcomputational intelligence , pages 871–898. Springer, 2015.[22] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolutionatrategies.

Evolutionary Computation , 9(2):159–195, 2001.[23] Verena Heidrich-Meisner and Christian Igel. Neuroevolution strategies for episodic reinforcementlearning.

Journal of Algorithms , 64(4):152–168, 2009.[24] Christian Igel. Neuroevolution for reinforcement learning using evolution strategies. In

Congress onEvolutionary Computation , volume 4, pages 2588–2595, 2003.[25] Jens J¨agersk¨upper. How the (1+1)-ES using isotropic mutations minimizes positive deﬁnite quadraticforms.

Theoretical Computer Science , 361(1):38–56, 2006.[26] Mohamed Jebalia and Anne Auger. On multiplicative noise models for stochastic search. In

ParallelProblem Solving from Nature , pages 52–61. Springer, 2008.[27] Kenji Kawaguchi. Deep learning without poor local minima. In

Advances in Neural InformationProcessing Systems , pages 586–594, 2016. hallenges in High-dimensional Reinforcement Learning with Evolution Strategies [28] Ilya Loshchilov. A computationally eﬃcient limited memory CMA-ES for large scale optimization.In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation , pages397–404. ACM, 2014.[29] David E Moriarty, Alan C Schultz, and John J Grefenstette. Evolutionary algorithms for reinforce-ment learning.

J. Artif. Intell. Res.(JAIR) , 11:241–276, 1999.[30] Ingo Rechenberg. Evolutionsstrategie–Optimierung technischer Systeme nach Prinzipien der biolo-gischen Evolution. 1973.[31] Raymond Ros and Nikolaus Hansen. A simple modiﬁcation in CMA-ES achieving linear time andspace complexity. In

International Conference on Parallel Problem Solving from Nature , pages 296–305. Springer, 2008.[32] Kenneth Stanley, David D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolvinglarge-scale neural networks.

Artiﬁcial life , 15(2):185–212, 2009.[33] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topolo-gies.

Evolutionary computation , 10(2):99–127, 2002.[34] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction , volume 1. MITpress Cambridge, 1998.[35] Olivier Teytaud and Sylvain Gelly. General lower bounds for evolutionary algorithms. In

ParallelProblem Solving from Nature–PPSN IX , pages 21–31. 2006., pages 21–31. 2006.