[PDF] Learning adaptive differential evolution algorithm from optimization experiences by policy gradient

Abstract

Differential evolution is one of the most prestigious population-based stochastic optimization algorithm for black-box problems. The performance of a differential evolution algorithm depends highly on its mutation and crossover strategy and associated control parameters. However, the determination process for the most suitable parameter setting is troublesome and time-consuming. Adaptive control parameter methods that can adapt to problem landscape and optimization environment are more preferable than fixed parameter settings. This paper proposes a novel adaptive parameter control approach based on learning from the optimization experiences over a set of problems. In the approach, the parameter control is modeled as a finite-horizon Markov decision process. A reinforcement learning algorithm, named policy gradient, is applied to learn an agent (i.e. parameter controller) that can provide the control parameters of a proposed differential evolution adaptively during the search procedure. The differential evolution algorithm based on the learned agent is compared against nine well-known evolutionary algorithms on the CEC'13 and CEC'17 test suites. Experimental results show that the proposed algorithm performs competitively against these compared algorithms on the test suites.

Full PDF

11 Learning Adaptive Differential Evolution Algorithmfrom Optimization Experiences by Policy Gradient

Jianyong Sun,

Senior Member, IEEE , Xin Liu, Thomas B¨ack,

Senior Member, IEEE , and Zongben Xu

Abstract —Differential evolution is one of the most prestigiouspopulation-based stochastic optimization algorithm for black-boxproblems. The performance of a differential evolution algorithmdepends highly on its mutation and crossover strategy and as-sociated control parameters. However, the determination processfor the most suitable parameter setting is troublesome and time-consuming. Adaptive control parameter methods that can adaptto problem landscape and optimization environment are morepreferable than ﬁxed parameter settings. This paper proposesa novel adaptive parameter control approach based on learningfrom the optimization experiences over a set of problems. In theapproach, the parameter control is modeled as a ﬁnite-horizonMarkov decision process. A reinforcement learning algorithm,named policy gradient, is applied to learn an agent (i.e. parametercontroller) that can provide the control parameters of a proposeddifferential evolution adaptively during the search procedure. Thedifferential evolution algorithm based on the learned agent iscompared against nine well-known evolutionary algorithms onthe CEC’13 and CEC’17 test suites. Experimental results showthat the proposed algorithm performs competitively against thesecompared algorithms on the test suites.

Index Terms —adaptive differential evolution, reinforcementlearning, deep learning, policy gradient, global optimization

I. I

NTRODUCTION

Among many evolutionary algorithm (EA) variants, differ-ential evolution (DE) [1], [2] is one of the most prestigiousdue to its exclusive advantages such as automatic adaptation,easy implementation, and very few control parameters [3],[4]. The DE variants have been successfully applied to avariety of real-world optimization problems [3], [5], andhave been considered very competitive in the evolutionarycomputation community according to their performances onvarious competitions [6]–[9].However, DE has also some drawbacks such as stagnation,premature convergence, sensitivity/insensitivity to control pa-rameters and others [5]. Although various factors, such asdimensionality of the decision space and the characteristicsof the optimization problems, can result in these drawbacks,the bad choice of control parameters (namely, the scale factor F , the crossover rate CR and the population size N ) is oneof the key problems [10], [11].It is well acknowledged that control parameters can sig-niﬁcantly inﬂuence the performance of an evolutionary algo-rithm [10], and this also holds for the control parameters ofDE. In the early days of research in DE, the control parameters J. Sun, X. Liu and Z. Xu are with the School of Mathematics and Statistics,Xi’an Jiaotong University, Xi’an, China, 710049. email: [email protected],[email protected], [email protected]. B¨ack is with the Leiden Institute of Advanced Computer Science, LeidenUniversity, Leiden, The Netherlands. email: [email protected] are usually set by trial-and-error [12]–[14] according to the optimization experiences gained from applying them to a setof test problems. Once the control parameters are set, they areﬁxed along the search procedure (this parameter determinationapproach is usually referred to as “parameter tuning”). Forexamples, in [15], F and CR are suggested to be in the rangeof [0 . , and [0 . , . , respectively, while N is suggested tobe [2 − n where n is the problem dimension [13]. The trial-and-error approach is usually time-consuming, not reliable andinefﬁcient [16].Along with the study of the exploration-exploitation rela-tion in EA, it is found that the optimal control parametersetting for a DE algorithm is problem-speciﬁc, dependenton the state of the evolutionary search procedure, and onthe different requirements of problems [17]. Further, differentcontrol parameters impose different inﬂuences on the algo-rithmic performance in terms of effectiveness, efﬁciency androbustness [18]. Therefore, it is generally a very difﬁcult taskto properly determine the optimal control parameters for abalanced algorithmic performance due to various factors, suchas problem characteristics and correlations among them.Some researchers claim that F and CR both affect theconvergence and robustness, but F is more related to con-vergence [19], CR is more sensitive to the problem character-istics [20], and their optimal values correlate with N [21]. N could cause stagnation if it is too small, and slow convergenceif it is too big [17]. Its setting strongly depends on problemcharacteristics (e.g. separability, multi-modality, and others).It is also observed that the control parameters should bedifferently set at different generations simply due to changingrequirements (i.e., exploration vs. exploitation) in differentphases of the optimization run. For example, researchersgenerally believe that F should be set larger at the beginningof the search to encourage exploration and smaller in theend to ensure convergence [22]. In light of this observation,parameter control methods, i.e., methods for changing thecontrol parameter values dynamically during the run, havebecome one of the main streams in DE studies [23]. Thesemethods are referred to as “parameter control”.In [24], the parameter control methods are classiﬁed as de-terministic, adaptive, and self-adaptive, while a hybrid controlis included in [23]. In this paper, we propose to classify the pa-rameter control methods based on how they are set accordingto online information collected during the evolutionary search.Our classiﬁcation differentiates whether parameters are learnedduring search or not, and is provided next.

1) Parameter Control Without Learning:

In this category,online information of any form is not reﬂected in the control a r X i v : . [ c s . N E ] F e b of the parameters. Rather, a simple rule is applied determin-istically or probabilistically along the search process.The simple rule is constructed usually in three ways. First,no information is used at all. For example, in [25], [26], F issampled uniformly at random from a preﬁxed range at eachgeneration for each individual. In [27], a combination of F and CR is randomly picked from three pre-deﬁned pairs of F and CR at each generation for each individual.Second, the simple rule is time-varying depending on thegeneration. For example, in [25], F decreases linearly, whilein [28], F and CR are determined based on a sinusoidalfunction of the generation index. In [6], the population size N is linearly decreased with respect to the number of ﬁtnessevaluation used.Third, information collected in the current generation , suchas the range of the ﬁtness values, the diversity of the popula-tion, the distribution of individuals and the rank of individuals,is used to specify the simple rule. For example, in [29], theminimum and maximum ﬁtness values at current populationare used to determine the value of F at each generation.In [30], F and CR are sampled based on the diversity of theobjective values of the current population, while the averageof the current population in the objective space is used in [31].The individual’s rank is used to determine F and CR for eachindividual in [32]. It is used to compute the mean values ofthe normal distributions associated with F and CR in [33]from which F and CR are sampled for each individual.

2) Parameter Control With Learning from the Search Pro-cess:

In this category, collectable information during thesearch process is processed for updating the control param-eters. The information used in these methods is mainly thesuccessful trials obtained by using previous F and CR values.There are mainly three ways. First, a trial F and CR isdecided by an (cid:15) -greedy strategy, as initially developed in [34].That is, if a uniformly sampled value is less than a hyper-parameter (cid:15) , a random number in [0 , is uniformly sampledas the trial F (resp. CR ); otherwise previous F (resp. CR )is used. If the trial by using the F and CR is successful,the sampled F and CR will be passed to the next generation;otherwise the previous F and CR will be kept. The sampled F value is taken as a perturbation in [35]. The hyper-parameter (cid:15) is adaptively determined either by the current populationdiversity [30], or by ﬁtness values [31].Second, successful control parameters are stored in memory(or pool) during the optimization process which are then usedto create the next parameters [36], [37]. For example, the me-dian (or mean) value of the memory values is used as the meanof the control parameter distribution for sampling new controlparameters [36]. The distribution is assumed to be normal [36]or Cauchy [38]. To make the sampling adapt to the searchstate, hyper-parameters are proposed in the distribution [14],[38] while the hyper-parameters are updated at each generationaccording to the success of previous distributions.Third, some authors proposed to update the mean of thecontrol parameter’s distribution through a convex linear com-bination between the arithmetic [37] or Lehmer [6], [39]mean of the stored pool of successful and the current controlparameters. In [40], an ensemble of two sinusoidal waves is added to adapt F based on successful performance ofprevious generations. In [41], a semi-parameter adaptationapproach based on randomization and adaptation is developedto effectively adapt F values. In [42], the control parametersare also modiﬁed based on the ratio of the current numberof function evaluations to the maximum allowed number offunction evaluations. A grouping strategy with an adaptationscheme is proposed in [43], [44] to tackle the improperupdating method of CR . In [45], [46], a memory updatemechanism is further developed, while new control parametersare assumed to follow normal and Cauchy distribution, respec-tively. In [47], the control parameters are updated accordingto the formulae of particle swarm optimization, in which thecontrol parameters are considered as particles and evolvedalong the evolution process.In this paper, we propose a novel approach to adaptivelycontrol F and CR . In our approach, the control parametersat each generation are the output of a non-linear functionwhich is modeled by a deep neural network (DNN). TheDNN works as the parameter controller. The parameters ofthe network are learned from the experiences of optimizinga set of training functions by a proposed DE. The learningis based on the formalization of the evolutionary procedureas a ﬁnite-horizon Markov decision process (MDP). One ofthe reinforcement learning algorithms, policy gradient [48], isapplied to optimize for the optimal parameters of the DNN.This method can be considered as an automatic alternative tothe trial-and-error approach in the early days of DE study. Notethat the trial-and-error approach can only provide possiblevalues or ranges of the control parameters. For a new testproblem, these values need to be further adjusted which couldresult in spending a large amount of computational resources.Our approach does not need such extra adjustment for a newtest problem. It can provide control parameter settings not onlyadaptively to the search procedure, but also to test problems.In the remainder of the paper, we ﬁrst introduce deeplearning and reinforcement learning which are the prelimi-naries for our approach in Section II. Section III presents theproposed learning method. Experimental results are presentedin Section IV and V including the comparison with severalwell-known DEs and a state-of-the-art EA. The related workis presented in Section VI. Section VII concludes the paper.II. P ARAMETERIZED K NOWLEDGE R EPRESENTATION BY D EEP AND R EINFORCEMENT L EARNING

A. Deep Learning

Deep learning is a class of machine learning algorithms forlearning data representation [49]. It consists of multiple layersof nonlinear processing units for extraction of meaningfulfeatures. The deep learning architecture can be seen as aparameterized model for knowledge representation and a toolfor knowledge extraction. It can also be seen as an efﬁcientexpert system. That is, given the current state of a system,through the deep neural network, a proper decision can bemade if the deep network is optimally trained. The deepneural network has a high order of modeling freedom dueto its large number of parameters, which make it able to

TABLE I: Notations. N the population size n dimension of the decision variable f the objective (ﬁtness) function x ti ∈ R n the i -th individual at the t -th generation P t ∈ R N × n the t -th population F t ∈ R N the ﬁtness values of the t -th population V t ∈ R N × n the mutated population at the t -th generation (cid:101) P t ∈ R N × n the trial solutions obtained at the t -th generation (cid:101) F t ∈ R N the ﬁtness values of the trial solutions U t the statistics at the t -th generation H t the memory at the t -th generation M the mutation operator CR the crossover operator S the selection operator make accurate predictions. Deep learning has had remarkablesuccess in recent years in image processing, natural languageprocessing and other application domains [49]. B. Reinforcement Learning

Reinforcement learning (RL) deals with the situation thatan agent interacts with its surrounding environment. The aimof learning is to ﬁnd an optimal policy for the agent thatcan respond to the environment for a maximized cumulativereward. RL can be modeled as a 4-tuple ( s, a, r, p a ) Markovdecision process (MDP), where s (resp. a , r and p a ) representsstate (resp. action, reward, and transition probability).Formally, at each time step t , a state s t and reward r t are associated to it. A policy π is a conditional probabilitydistribution, i.e. π = p ( A t | S t ; θ ) with parameter θ where A t (resp. S t ) represents the random variable for action (resp.state). Given the current state s t , the agent takes an action a t by sampling from π . Given this action, the environmentresponds with a new state s t +1 and a reward r t +1 . The expec-tation of total rewards U ( θ ) = E τ ∼ q ( τ ) (cid:16)(cid:80) Tt =0 γ t r t (cid:17) , where γ t denotes the time-step dependent weighting factor, is to bemaximized for an optimal policy π ∗ where the expectation istaken over trajectory τ = { s , a , s , a , · · · , a T − , s T , · · · } with the joint probability distribution q ( τ ) .To train a RL agent for the optimal parameter, various RLalgorithms, such as temporal difference, Q-learning, SARSA,policy gradient and others have been widely used for dif-ferent scenarios (e.g. discrete or continuous action and statespace) [48]. If the policy is represented by a deep neuralnetwork, it leads to the so-called deep RL, which has gainedincredible success on playing games, such as AlphaGo [50].III. A DAPTIVE P ARAMETER C ONTROL VIA P OLICY G RADIENT

In this section, we show how to learn to control the adaptivesettings of a typical DE. Before presenting the algorithm, thenotations used in the paper are listed in Table I.

A. The Typical DE

In the proposed DE, the current-to- p best/1 mutation operatorand the binomial crossover are employed. In the current-to- p best/1 mutation operator [37], at the t -th generation, for each individual x ti , a mutated individual v ti is generated in thefollowing manner: v ti = x ti + F ti · ( x t pbest − x ti ) + F ti · ( x tr − x tr ) , (1)where x t pbest is an individual randomly selected from the best N × p ( p ∈ (0 , individuals at generation t . The indices r , r are randomly selected from [1 , N ] such that they differfrom each other and i .The binomial crossover operator works on the target in-dividual x ti = ( x ti, , ..., x ti,n ) and the corresponding mutatedindividual v ti = ( v ti, , ..., v ti,n ) to obtain a trial individual (cid:101) x ti = ( (cid:101) x ti, , ..., (cid:101) x ti,n ) element by element as follows: (cid:101) x ti,j = (cid:40) v ti,j , if rand [0 , ≤ CR ti or j = j rand x ti,j , otherwise (2)where rand [0 , denotes a uniformly sampled number from [0 , and j rand is an integer randomly chosen from [1 , n ] .Given the trial individuals, the next generation is selected in-dividual by individual. For each individual, if f ( (cid:101) x ti ) ≤ f ( x ti ) ,then x t +1 i = (cid:101) x ti ; otherwise x t +1 i = x ti .It is clear that the control parameters of the proposed DEinclude N and { F ti , CR ti , ≤ i ≤ N } at each generation. Inthe following, we will show how { F ti , CR ti , ≤ i ≤ N } canbe learned from optimization experiences. B. Embed Recurrent Neural Network within the Typical DE

The evolution procedure of the proposed DE can be for-malized as follows. At generation t , a mutation population V t = { v t , · · · , v tN } is ﬁrst generated by applying the mutationoperator (denoted as M ) on the current population P t ; atrial population (cid:101) P t = { (cid:101) x t , · · · , (cid:101) x tN } is further obtained byapplying the binomial crossover (denoted as CR ) operator.The new population is then formed by applying the selection(denoted as S ) operation. In the sequel, we denote Θ t F = { F ti , ≤ i ≤ N } and Θ t CR = { CR ti , ≤ i ≤ N } .Formally, the evolution procedure can be written as follows: V t = M ( P t ; Θ t F ); (cid:101) P t = CR ( V t , P t ; Θ t CR ); (cid:101) F t = f ( (cid:101) P t ); P t +1 , F t +1 = S ( F t , (cid:101) F t , P t , (cid:101) P t ) . (3)Due to the stochastic nature of the mutation and crossoveroperators, the evolution procedure can be considered as astochastic time series. The creation of solutions at generation t depends on information collected from previous generationsfrom to t − . Fig. 1 shows the ﬂowchart of the procedureat the t -th generation.As discussed in the introduction, recent studies focused oncontrolling F and CR by learning from online information.Various kinds of information are derived and used to updatethe control parameters for the current population. Generallyspeaking, the updated control parameters can be consideredas output of a non-linear function with the collected informa-tion as input. Since an artiﬁcial neural network (ANN) is auniversal function approximator [51], this motivates us to takean ANN to approximate the non-linear function. Θ !" 𝑃 " 𝑉 " Θ M CR 𝑃$ " S 𝑓 𝑃 "%& 𝐹 " 𝐹$ " Fig. 1: The ﬂowchart of a typical DE at the t -th generation. Inthe ﬁgure, the mutation operator (resp. crossover and selectionoperator) is denoted as M (resp. CR and S ). Θ !" 𝑉 " Θ M CR 𝑃$ " S 𝑓 𝑃 "%& 𝐹 " N 𝑃 " 𝑈 " Fig. 2: The proposed DE with embedded neural network(denoted as N ) at the t -th generation.The neural network can be embedded in the evolutionprocedure as a parameter controller. Fig. 2 shows the ﬂowchartof the proposed DE with embedded neural network at the t -th generation. As illustrated in the ﬁgure, the neural network ( N in the ﬁgure ) outputs Θ t F and Θ t CR . Θ t F is used as theinput to the mutation operator, while Θ t CR is the input to thecrossover operator. The mutated population (i.e. V t ) and trialpopulation (i.e. (cid:101) P t ) are then generated, respectively. V t and P t are the input to CR ; (cid:101) P t and P t are the input to S forselecting the new population P t +1 .Existing research only utilizes the information collectedfrom the current and/or previous generations for the parametercontrol. However, all the information until the current gener-ation should have certain inﬂuences, although with differentimportances. The closer to the current generation, the moreinﬂuential.To accommodate the time-dependence feature, we take theneural network to be a long short-term memory (LSTM) [52].LSTM is a kind of recurrent neural network. As its name sug-gests, LSTM is capable of capturing long-term dependenciesamong input signals. There are a variety of LSTM variants.Its simplest form is formulated as follows: f t = σ ( W f · [ h t − , x t ] + b f ) ; i t = σ ( W i · [ h t − , x t ] + b i ) ;˜ C t = tanh ( W c · [ h t − , x t ] + b c ) ; C t = f t ⊗ C t − + i t ⊗ ˜ C t ; o t = σ ( W o · [ h t − , x t ] + b o ) ; h t = o t ⊗ tanh ( C t ) , where x t is the input at step t , [ h t − , x t ] means the catenationof h t − and x t , ⊗ means Hadamard product, σ and tanh arethe sigmoid activation function and tanh function, respectively: σ ( z ) = 11 + e − z ; tanh ( z ) = e z − e − z e z + e − z . The parameters of the LSTM include W f , W i , W c and W o which are matrices and b f , b i , b c and b o which are biases.Fig. 3 shows the ﬂowchart of the LSTM.Fig. 3: The ﬂowchart of the LSTM.Omitting the intermediate variables, the LSTM can beformally written as: C t , h t = LSTM ( x t , h t − , C t − ; W ) , (4)where W = [ W f , W i , W c , W o , b f , b i , b c , b o ] denotes its pa-rameter.In our context, we consider the input to the LSTM as thecatenation of F t and U t which is some statistics derived from F t , and denote A t = [ F t , U t ] . In addition, we use H t and C t to represent the short and long term memory. Formally, theparameter controller can be written as follows: C t , H t = LSTM ( A t , H t − , C t − ; W L );Θ t F = FullConnect ( H t ; W F , b F );Θ t CR = FullConnect ( H t ; W C , b C ) , (5)where W L is the parameter of LSTM, FullConnect ( · ; W , b ) represents a fully-connected neural network with weight ma-trix W and bias b . Here Θ t F = σ ( H t (cid:62) W F + b F );Θ t CR = σ ( H t (cid:62) W C + b C ) . (6)In the sequel, we denote Θ t = [Θ t F , Θ t CR ] and use thefollowing concise formula to represent Eq. 5: Θ t , C t , H t = LSTM ( A t , H t − , C t − ; W ) , (7)where W = [ W L , W F , b F , W C , b C ] . C. Model the Evolution Search Procedure as an MDP

To learn the parameters of the LSTM, i.e. the agent or thecontroller, embedded in the DE, we ﬁrst model the evolutionprocedure of the proposed DE as an MDP with the followingdeﬁnitions of environment, state, action, policy, reward, andtransition.

1) Environment:

For parameter control, an optimal con-troller is expected to be learned from optimization experiencesobtained when optimizing a set of optimization problems.Therefore, the environment consists of a set of optimizationproblems (called training functions). They are used to evaluatethe performance of the controller when learning. Note thatthese training functions should have some common character-istics for which can be learned for a good parameter controller. Here, [ F t , U t ] also means to catenate the two vectors F t and U t intoone single vector.

2) State ( S t ): We take the ﬁtness F t , the statistics U t andthe memories H t as the state S t .Particularly, U t includes the histogram of the normalized F t (denoted as h t ) and the moving average of the histogramvectors over the past g generations (denoted as ¯ h t ). Formally, ¯ f i = f i − min {F t } max {F t } − min {F t } ; h t = histogram ( { ¯ f i } , b );¯ h t = 1 g t − (cid:88) i = t − g h i , (8)where b is the number of bins. The lower (resp. upper) rangeof the bins is deﬁned as min {F t } (resp. max {F t } ). That is,to derive U t , the ﬁtness values of the current population areﬁrst normalized. Its histogram is then computed and taken asthe input to LSTM. This is to represent the information of thecurrent population. Further, the statistics represented by ¯ h t iscomputed as the information from previous search history.It should be noted that the statistics U t is computed at eachgeneration w.r.t. the current population, not to each individual.

3) Action ( A t ) and Policy ( π ): In the MDP, given state S t ,the agent can choose (sample) an action from policy π deﬁnedas a probability distribution p ( A t | S t ; θ ) where θ represents theparameters of the policy. Here we deﬁne A t as the controlparameters, i.e. A t = { F ti , CR ti , ≤ i ≤ N } ∈ R N .Since the control parameters take continuous values, weassume the policy is normal. That is, π ( A t | S t ) = N ( A t | LSTM ( S t ) , σ )= 1(2 πσ ) N exp (cid:26) − σ ( A t − LSTM ( S t ; W )) (cid:27) . (9)It is seen that the policy is uniquely determined by the LSTMparameter W .

4) Reward ( r t +1 ): The environment responds with a reward r t +1 after the action. In our case, the reward r t +1 is deﬁnedas the relative improvement of the best ﬁtness r t +1 = max {F t +1 } − max {F t } max {F t } (10)where max {F t } denotes the best ﬁtness obtained at genera-tion t . That is, after determining the control parameters, themutation, crossover and selection operations are performedto obtain the next generation. The relative improvement isconsidered as the outcome of the application of the sampledcontrol parameters.A higher reward (improvement) indicates that the deter-mined control parameters have a more positive impact on thesearch for global optimum.

5) Transition:

The transition is also a probability p ( S t +1 | A t = a t , S t = s t ) . In our case, the probabilitydistribution is not available. It will be seen in the followingthat the transition distribution does not affect the learning. A histogram is constructed by dividing the entire range of values into aseries of intervals (i.e. bins), and count how many values fall into each bin.

D. Learn the Control Parameter by Policy Gradient

A variant of the RL algorithm, policy gradient (PG), is ableto deal with the scenario when the transition probability isnot known. PG works by updating the policy parameters viastochastic gradient ascent on the expectation of the reward θ t +1 = θ t + α t ∇ θ U ( θ t ) , (11)where α t denotes the learning rate and ∇ θ U ( θ ) is the gradientof the cumulative reward U ( θ ) .An evolutionary search procedure with a ﬁnite number ofgenerations (denoted as T ) can be considered as a ﬁnite-horizon MDP. For such an MDP, given previously deﬁned stateand action, a trajectory τ is { S , A , r , . . . , S T − , A T − , r T } .The joint probability of the trajectory can be written as q ( τ ; θ ) = p ( S ) T − (cid:89) t =0 π ( A t | S t ; θ ) p ( S t +1 | A t , S t ) . (12)Further, ∇ θ U ( θ ) can be derived as follows: ∇ θ U ( θ ) = (cid:88) τ r ( τ ) ∇ θ q ( τ ; θ ) = (cid:88) τ r ( τ ) ∇ θ q ( τ ; θ ) q ( τ ; θ ) q ( τ ; θ )= (cid:88) τ r ( τ ) q ( τ ; θ ) ∇ θ [log q ( τ ; θ )]= (cid:88) τ r ( τ ) q ( τ ; θ ) (cid:34) T − (cid:88) t =0 ∇ θ log π ( A t | S t ; θ ) (cid:35) (13)where r ( τ ) is the cumulative reward of the trajectory τ .The expectation of Eq. 13 can be calculated by sampling L trajectories τ , · · · , τ L , ∇ U θ ( θ ) ≈ L L (cid:88) i =1 r ( τ i ) T − (cid:88) t =0 ∇ θ ln π ( A ( i ) t = a ( i ) t | S ( i ) t = s ( i ) t ; θ ) (14)where a ( i ) t (resp. s ( i ) t ) denotes action (resp. state) value at time t in the i -th trajectory.A detailed description on how to update W is given inAlg. 1. In the algorithm, to obtain the optimal W , theoptimization experience of a set of M training functions isused. For each function, ﬁrst a set of L trajectories is sampledby applying the proposed DE (line 8-19) for T generations.The control parameters of the proposed DE are obtained byforward computation of the LSTM given the present W ateach generation. With the sampled trajectories, the reward ateach generation for each optimization function is computed,and used for updating W (line 22).Note that in the learning, a set of optimization functionsare used. For each optimization function, the proposed DEis applied for T generations to sample the trajectories. Eachtrajectory can be considered as an optimization experiencefor a particular training function. For each function, thereare L trajectories sampled. M functions can provide L × M optimization experiences. Learning from these experiencescould thus be able to lead to a good parameter controller. Algorithm 1

Learning to control the parameters of the DE

Require: the LSTM parameter W , the number of epochs Q , the number of training functions M , the population size N , thenumber of trajectories L , the trajectory length T and the learning rate α Initialize W uniformly at random; for epoch = 1 → Q do Initialize P = [ x , · · · , x N ] uniformly at random; for k = 1 → M do Set P k = P ; Evaluate F k = { f k ( x i ) , ≤ i ≤ N } ; (cid:46) trajectory sampling; for l = 1 → L do Set t ← , H k = and C k = ; repeat Compute U tk = [ h t , ¯ h t ] by Eq. 8; Set A tk = [ U tk , F tk ] ; Apply LSTM: Θ tk , C t +1 k , H t +1 k ← LSTM ( A tk , H tk , C tk ; W ); Create the trial population: (cid:101) P tk = [ (cid:101) x t , · · · , (cid:101) x tN ] ← CR ◦ M ( P tk ; Θ tk ); Evaluate the trial population: (cid:101) F tk ← { f ( (cid:101) x ti ) , ≤ i ≤ N } ; Form the new population: P t +1 k , F t +1 k ← S ( F tk , (cid:101) F tk , P tk , (cid:101) P tk ) ; Calculate r t +1 k using Eq.(10); and set t ← t + 1 ; until t ≥ T end for end for (cid:46) policy parameter updating; Update W using Eq.(14) and Eq.(11); end for return W ; E. Embed the Learned Controller within the DE

After training, it is assumed that we have secured therequired knowledge for generating control parameters throughthe learning from optimization experiences. Given a newtest problem, the proposed DE with the learned parametercontroller can be applied directly. The detailed algorithm,named as the learned DE (dubbed as LDE), is summarizedin Alg. 2.In Alg. 2, the evolution procedure is the same as a typicalDE except that the control parameters are the samples ofthe output of the controller (line 8-10). One of the inputs ofthe LSTM, the statistics U t is computed at each generation(line 6), and the hidden information H t and C t are initialized(line 3) and maintained during the evolution.Note that in the LDE, the controller contains knowledgelearned from optimization experiences which can be consid-ered as extraneous/ofﬂine information, while the use of U t , H t and C t represent the information learned during the searchprocedure which is intraneous/online information. The timecomplexity for one generation of LDE is O ( H + N · H + N · n ) ,where H denotes the number of neurons used in the hiddenlayer. IV. E XPERIMENTAL S TUDY

In this section, we ﬁrst present the implementation details ofboth Alg. 1 and Alg. 2. The training details and the comparisonresults against some known DEs and a state-of-the-art EA arepresented afterwards.

Algorithm 2

The Learned DE (LDE)

Require: the trained agent with parameter W Initialize population P uniformly at random; Evaluate F = f ( P ) ; Set g ← , H g = 0 and C g = 0 ; while the termination criteria have not been met do Compute U g = [ h g , ¯ h g ] by Eq. 8; Set A g = [ U g , F g ] ; g ← g + 1 ; Θ g − , C g , H g ← LSTM ( A g − , H g − , C g − ; W ) ; (cid:101) Θ g − ∼ N (Θ g − , σ ) ; P g , F g ← DE ( P g − , F g − ; (cid:101) Θ g − ) ; end while return x ∗ = arg max f ( P g ) Some or all functions in CEC’13 [53] are used as thetraining functions. In the comparison study, functions that havenot been used for training from CEC’13 or CEC’17 [54] are tested. The CEC’13 test suite consists of ﬁve unimodalfunctions f − f , 15 basic multimodal functions f − f and eight composition functions f − f . The CEC’17 testsuite includes two unimodal functions F and F , seven simplemultimodal functions F − F , ten hybrid functions F − F , https://github.com/P-N-Suganthan/CEC2017-BoundContrained and ten more complex composition functions F − F .When training, the following settings were used, includingthe number of epochs Q = 150 , the population size N = 50 when n = 10 and N = 100 when n = 30 for the mutationstrategy, the number of bins b = 5 , the number of previousgenerations g = 5 , the trajectory length T = 50 , the numberof trajectories L = 20 , and the learning rate α = 0 . .For the experimental comparison, the same criteria as ex-plained in [53] and [54] are used. Each algorithm is executed51 runs for each function. The algorithm terminates if the max-imum number of objective function evaluations (MAXNFE)exceeds n × or the difference between the function valuesof the found best solution and the optimal solution (also calledthe function error value) is smaller than − .The compared algorithms include the following: DE [12]: the original DE algorithm with DE/rand/1/binmutation and binomial crossover. JADE [37]: the classical adaptive DE method in which theDE/current-to- p best/1 mutation strategy was ﬁrstly proposed.JADE has two versions. One is with an external archive. Theexternal archive is to aid the generation of offspring whenmutation. The other is without the archive. In JADE, each F (resp. CR ) is generated for each individual by sampling from aCauchy (resp. normal) distribution and the location parameterof the distribution is updated by the Lehmer (resp. arithmetic)mean of the successful F ’s (resp. CR ’s). jSO [55] : which ranked second and the best DE-based al-gorithm in the CEC’17 competition. As an elaborate variationof JADE, two scale factors are associated with the mutationstrategy. They are different from each other and limited withinvarious bounds along the evolution. CoBiDE [56] : in which the F (resp. CR ) values are gen-erated from a bimodal distribution consisting of two Cauchydistributions. Trial vectors are formed in the Eigen coordinatesystem built by the eigenvectors of the covariance matrix ofthe top individuals. cDE [14] : in which both F and CR are selected from a pre-deﬁned pool. The selection probability is proportional to thecorresponding number of the successful individuals obtainedfrom previous generations. CoBiDE-PCM [24] and cDE-PCM [24] : that were pro-posed in [24] for studying the effect of parameter controlmanagement. These two algorithms are largely consistent withCoBiDE and cDE, but are equipped with different mutationand crossover operators. They are ranked ﬁrst or second onthe BBOB benchmark in [24]. HSES [57] : the winner of the bound constraint com-petition of CEC’18 . HSES is a three-phase algorithm. Amodiﬁed univariate sampling is employed in the ﬁrst stagefor good initial points, while CMA-ES [58] is used in the Code is available at https://github.com/P-N-Suganthan/CEC2017 Code is available at https://sites.google.com/view/pcmde/ ∼ tvrdik/wp-content/uploads/men05 CD.pdf Code is available at https://sites.google.com/view/pcmde/ Code is available at https://sites.google.com/view/pcmde/ Code is available at https://github.com/P-N-Suganthan/CEC2018 The test functions in CEC’18 are the same as those in CEC’17 exceptthe function F in CEC’17 is removed. second stage, followed by another univariate sampling, forlocal reﬁnement.The parameters and hyper-parameters of these methods arekept the same as the settings in the original references in ourexperiments. Table II shows the detailed variation operatorsused and the parameter (hyper-parameter) settings for thecompared algorithms.It should be noted that JADE, cDE and CoBiDE are nottested on the CEC’13 or CEC’17 test suites in the originalreferences, but on 20 basic functions, six basic functions andCEC’05 [59], respectively. The parameters of these algorithmsare tuned manually by grid search based on the mean ﬁtnessvalues found over a number of independent runs for eachfunction. It is expected that the parameter tuning proceduresof these algorithms are time-consuming and computationallyintensive. However, we should be frank that there is a possi-bility that these algorithms’ performances could be improvedif their parameters are tuned on the CEC’13 or CEC’17 testsuites. A. The Comparison Results on the CEC’13 Test Suite

In this experiment, we use the ﬁrst 20 functions f − f in CEC’13 as the training functions. The remaining eightfunctions f − f are used for comparison. As in [54], thefunction error value is used as the metric, and recorded foreach run. The mean and standard deviation of the error valuesobtained for each function over 51 runs are used for compari-son. In the following, the experimental results are summarizedin tables, in which the means, standard deviations and theWilcoxon rank-sum hypothesis test results are included. Thebest (minimum) mean values are typeset in bold.The Wilcoxon rank-sum hypothesis test is performed to testthe signiﬁcant differences between LDE and the comparedalgorithms. The test results are shown by using symbols + , − ,and ≈ in the tables. The symbol + (resp. − , and ≈ ) indicatesthat LDE performs signiﬁcantly worse than (resp. better thanand similar to) the compared algorithms at a signiﬁcance levelof 0.05. The results are summarized in the “WR” column ofthe tables.Tables IV and V summarize the experimental results whenterminating at MAXNFE, while Table III shows the resultswhen the algorithms terminate at the maximum number ofgenerations T = 50 which is the number of generations usedfor training the agent.From Table III, we can see that LDE does not performsatisfactorily. It is worse than the compared algorithms whenthe algorithm terminates at generation T = 50 . However, itperforms better as the process of evolution continues up toMAXNFE. Note that in the training, the optimization proce-dure does not terminate at MAXNFE. The poor performanceof LDE implies that the agent trained in T = 50 generations isnot good enough. However, the experimental results show thatthe knowledge learned in T = 50 generations can be beneﬁcialfor the evolutionary search in further generations.Tables IV and V show that when MAXNFE has beenreached, LDE exhibits superior overall performance as com-pared with DE/rand/1/bin and JADE with archive, and shows TABLE II: Detailed reproduction operators used and parameter and hyper-parameter settings of the compared algorithms.

Alg. Operations Control Hyper-Mutation Crossover Parameters parametersDE DE/rand/1 bin N = 5 n, F = 0 . , CR = 0 . NAJADE DE/current-to-pbest/1 bin N = 30 , when n = 10 , N archive = N, µ F = 0 . µ CR = 0 . , c = 0 . , p = 0 . jSO DE/current-to-pbest-w/1 bin N init = 25 log ( n ) √ n, N min = 4 N archive = 20 , M F = 0 . , M CR = 0 . p max = 0 . , p min = 0 . , H = 5 CoBiDE DE/rand/1 bin, covariance N = 60 pb = 0 . , ps = 0 . matrix learningCoBiDE-PCM DE/best/1 bin N = 5 × n NAcDE DE/rand/1, DE/best/2 bin N = max (20 , n ) , F = { . , . , } n = 2 , δ = 1 / CR = { , . , } cDE-PCM DE/rand/1 ( n = 10 ) bin N = 5 n, F = { . , . , } n = 2 , δ = 1 / DE/best/1 ( n = 30 ) bin CR = { , . , } HSES modiﬁed univariate sampling, CMA-ES M = 200 N = 100 , N = 160 , cc = 0 . I = 20 , λ = 3 ln n + 80 TABLE III: Means and standard deviations of function error values for comparison of LDE on the CEC’13 benchmark suitefor n = 10 at generation T = 50 . LDE DE JADE w/o archive JADE with archive jSOMean Std. Dev. Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR f + + + + f − + + ≈ f + ≈ ≈ + f + + + + f + + + + f + + + + f + + + + f + + + ++ / ≈ / − TABLE IV: Means and standard deviations of function error values for comparison of LDE on the CEC’13 benchmark suitefor n = 10 when MAXNFE has been reached. LDE DE JADE w/o archive JADE with archive jSOMean Std. Dev. Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR f − − − − f − − ≈ + f − + + + f − ≈ ≈ − f + ≈ − + f − ≈ ≈ + f + + − + f − ≈ ≈ − + / ≈ / − TABLE V: Means and standard deviations of function error values for comparison of LDE on the CEC’13 benchmark suitefor n = 30 when MAXNFE has been reached. LDE DE JADE w/o archive JADE with archive jSOMean Std. Dev. Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR f + + + ≈ f − + + + f − + + + f + ≈ ≈ + f ≈ − − + f − − − + f − − − + f − ≈ ≈ ≈ + / ≈ / − similar performance as compared with JADE without archive.However, LDE is outperformed by jSO in ﬁve and six out ofeight 10-D and 30-D test functions, respectively.Speciﬁcally, on the eight 10-D complex compositionfunctions, LDE performs better than DE on six functions f , f , f , f , f , f , and surpasses JADE with archiveon three functions f , f , f and that without archive ontwo functions f , f . However LDE performs statisticallybetter than jSO on only three functions f , f , f out of theeight test functions. For 30-D test problems, LDE yields betterperformance than DE on ﬁve functions f , f , f , f , f .LDE shows the same advantage over JADE with and withoutarchive on f , f , f on the CEC’13 benchmarks with 30-D. Again, LDE performs worse than jSO on six test functions f − f with 30-D.To see the overall performances, we rank the comparedalgorithms by using the average performance score (APS) [60].The APS is deﬁned based on the error values obtained bythe compared algorithms for the test functions. Suppose thereare m algorithms A , · · · , A m to compare on a set of M functions. For each i, j ∈ [1 , m ] , if A j performs better than A i on the k -th function F k , k ∈ [1 , M ] with statistical signiﬁcance(i.e. p < . ), then set δ ij = 1 otherwise δ ij = 0 . Theperformance score of A i on F k is computed as follows: P k ( A i ) = (cid:88) j ∈ [1 ,n ] \{ i } δ ij (15)The AP value of A i is the average of the performance scorevalues of A i over the test functions. A smaller APS valueindicates a better performance. Table VI summarizes the ranksTABLE VI: The average ranks of the compared algorithmsaccording to their APS values on the last eight functions inthe CEC’13 test suite. Alg. jSO LDE JADE JADE DEw/o archive with archive n = 10 n = 30 of the compared algorithms in terms of their APS values. It canbe seen that jSO is superior to LDE. LDE ranks the second,which is better than the other algorithms. This shows that theproposed method is quite promising.Fig. 4 shows the evolution of the control parameters duringthe optimization obtained by the learned controller and jSOwhen optimizing f . In the ﬁgure, we show the mean valuesof F and CR at each generation by clustering F t intothree groups. The upper plot shows the mean F values,while the lower plot shows the mean CR values associatedwith the individuals in the groups. From the upper plot ofFig. 4(a) for the learned controller, it is seen that high-qualityindividuals generally have a smaller F value than the low-quality individuals, while the middle-quality individuals havehigher CR values. From Fig. 4(b) for jSO, it is seen that alongevolution, the F and CR values become scattered.The better performance of LDE when running more gen-erations indicates that the learned controller is promising foradaptive parameter control. (a) LDE(b) jSO Fig. 4: The evolution of F and CR along optimization for f obtained by the learned controller. The upper (resp. lower) plotshows the F (resp. CR ) values. The population’s ﬁtness isgrouped into three clusters at each generation. The associated F and CR values are averaged. B. The Comparison Results on the CEC’17 Test Suite

In this section, all 28 test problems in the CEC’13 test suiteare used as the training functions. LDE is then compared withthe other algorithms on the 29 functions of the CEC’17 testsuite. Table VII and VIII summarize the means and standarddeviations of the function error values obtained by all thecompared methods over 51 times on the CEC’17 test suitefor 10-D and 30-D, respectively.For 10-D test functions, LDE exhibits superiority overHSES and most of the conventional and classical DE-basedalgorithms, except for CoBiDE and cDE-PCM. It performssimilarly to jSO.Particularly, LDE performs better than the classical DE andJADE with archive on 16 functions, CoBiDE-PCM on 24functions, cDE on 16 functions, and HSES on 14 functions.LDE performs worse than CoBiDE on 12 functions, jSO on 7functions and cDE-PCM on 6 functions. LDE performs similarto CoBiDE and jSO on 16 and 15 functions, respectively.For 30D test problems, it is seen that jSO and HSES performbetter than LDE on most of the test functions. However, LDEperforms better than the rest of the algorithms in general.Particularly, LDE performs better than classical DE and theother adaptive DEs on more functions than that it performsworse than these algorithms. The performance of LDE issimilar to CoBiDE in the sense that the numbers of functions TABLE VII: Means and standard deviations of the error values obtained by LDE and the compared algorithms on the CEC’17benchmark suite for n = 10 when MAXNFE has been reached or function error is less than − . LDE DE JADE w/o archive JADE with archive jSOMean Std. Dev. Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR F ≈ ≈ ≈ ≈ F ≈ ≈ ≈ ≈ F − ≈ ≈ ≈ F − ≈ ≈ + F ≈ ≈ ≈ ≈ F − + + + F − ≈ ≈ + F ≈ ≈ ≈ ≈ F − + + + F − − − ≈ F ≈ − − ≈ F − − − ≈ F ≈ − − ≈ F ≈ − − − F − − − − F ≈ − − − F ≈ − − − F + ≈ ≈ ≈ F − ≈ − − F − − − ≈ F − − − ≈ F + ≈ ≈ + F − ≈ − − F − − − ≈ F ≈ ≈ ≈ ≈ F − − ≈ + F ≈ − − ≈ F − − − − F − − − ++ / ≈ / − F ≈ ≈ ≈ ≈ ≈ F ≈ ≈ ≈ ≈ ≈ F ≈ ≈ ≈ ≈ ≈ F + − ≈ + + F ≈ − ≈ ≈ ≈ F + − ≈ ≈ + F + − − + + F ≈ ≈ ≈ ≈ ≈ F + − ≈ ≈ + F ≈ − − ≈ ≈ F + − − ≈ ≈ F + − − ≈ − F + − − + − F + − − ≈ − F ≈ − ≈ ≈ − F − − − + − F + − − ≈ − F ≈ − ≈ − − F ≈ − ≈ ≈ − F ≈ − − ≈ − F ≈ − − ≈ ≈ F ≈ − − ≈ + F ≈ − − ≈ ≈ F ≈ − − + − F ≈ ≈ ≈ ≈ ≈ F + − ≈ + − F ≈ − − ≈ − F + − − ≈ − F + − − − − TABLE VIII: Means and standard deviations of the error values obtained by LDE and the compared algorithms on the CEC’17benchmark suite for n = 30 when MAXNFE has been reached or function error is less than − . LDE DE JADE w/o archive JADE with archive jSOMean Std. Dev. Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR Mean Std. Dev. WR F − ≈ ≈ ≈ F − − − + F − + + ≈ F − + + + F − + + + F − + + + F − + + + F ≈ ≈ ≈ ≈ F − ≈ ≈ + F − − − + F − − ≈ + F − − − + F − − − ≈ F − − − + F − ≈ ≈ + F − − − ≈ F − − − + F − − − + F − − − ≈ F − + + + F − ≈ ≈ ≈ F − + + + F − + + + F − − − − F − + + + F + − − + F − − ≈ + F − − − ≈ F − − − ++ / ≈ / − F ≈ ≈ ≈ ≈ ≈ F + + + + + F ≈ + ≈ + + F ≈ − − − + F + − + ≈ + F − − − − + F ≈ − − − + F ≈ − − − ≈ F ≈ − − − + F − − − − − F ≈ − − − + F ≈ − − − ≈ F + − − − + F ≈ − − − + F ≈ − − ≈ + F ≈ − − − − F + − − − + F + − − − + F + − − − − F − − − − + F ≈ − ≈ ≈ ≈ F − − − − + F ≈ − − − + F − − − − ≈ F ≈ − − − + F ≈ − ≈ − − F − ≈ − ≈ − F + − − − ≈ F − − − − − TABLE IX: Average ranking of the compared algorithms according to their APS values on the CEC’17 benchmark functions

Alg. jSO CoBiDE LDE HSES JADE w/o archive JADE with archive cDE-PCM cDE DE CoBiDE-PCMRank n = 10 n = 30 that LDE outperforms CoBiDE and CoBiDE outperforms LDEare the same.The ranking result of all algorithms is shown in Table IX.It is seen that LDE ranks the third on the CEC’17 benchmarksuite. Note that ﬁrst the agent is learned from CEC’13. Itsperformance on CEC’17 implies that indeed some usefulknowledge which is helpful for parameter control is effectivelylearned. Second, once the controller has been learned, it isapplied to solve new test functions without requiring anytuning of the algorithmic parameters. This can greatly reducepossibly large amount of computational efforts.V. S ENSITIVITY A NALYSIS

One of the main parameters that greatly inﬂuence theperformance of LDE is the number of neurons (i.e. the sizesof H t and C t ) used in the hidden layers. A higher number canincrease the representation ability of the LSTM but may causeover-ﬁtting. Here we investigate the effect of different neuronsizes to the performance of LDE on the last eight functions f − f of CEC’13 for 10-D and 30-D.Six agents with different number of neurons are learned on10-D functions. A set of neuron sizes, from 500 to 3000 withan interval of 500, is studied when the population size is ﬁxedas 50. The obtained results are summarized in Table X.From Table X, we see that 1) the performance of LDEdiffers w.r.t. the size of neurons; 2) for different functions,the best result is obtained by taking different neuron size; and3) a higher number of neurons does not always mean betterperformance.Generally speaking, the population size ought to be in-creased for problems with larger dimensions. To see the effectof population size, we carry out experiments to learn ﬁvecontrollers that are with different population and neuron sizeson the same CEC’13 training functions (i.e. f − f with 30-D). The performance of the learned controller is again testedon f − f with 30-D.Table XI lists the comparison results of the ﬁve designedcontrollers. From the table, it is observed that the neuron sizetakes the same effects as those in 10-D case. Further, it can beseen that the population and neuron size together have a verycomplex effect on the performance of the learned controller.VI. R ELATED W ORK

In this paper, RL is used as the main technique to learnon how to adaptively control the algorithmic parameters fromoptimization experiences. To the best of our knowledge, thereis no related work on controlling the DE parameters by learn-ing from optimization experiences. However, we found someworks on controlling the parameters of genetic algorithms (GAs). These works relate to our approach but with signiﬁcantdifferences.In [61], four control parameters (including crossover rate,mutation rate, tournament proportion and population size) of aGA are dynamically regulated with the help of the reinforce-ment learning. The learning algorithm is a mix of Q-learningand SARSA which involves maintaining a discrete table ofstate-action pairs. In [61], information along the GA’s searchprocedure is extracted as the state. Two RL algorithms switchin a pre-deﬁned frequency to ﬁnd a new action (i.e. controlparameter value). The work shows that the RL-enhanced GAoutperforms a steady-state GA in terms of ﬁtness and successrate.In [62], the Q-learning algorithm is applied to choose asuitable reproduction operator which can generate a promisingindividual in a short time. The authors propose a new rewardfunction incorporating GA’s multi-point search feature andthe time complexity of recombination operators. Further, theaction-value function is updated after generating all individ-uals. Similarly, in [63], the Q-learning is also used to adap-tively select reproduction operators. But the chosen operatoris applied to the whole population. This method is shownempirically that it tends to avoid obstructive operators and thussolve the problems more efﬁciently than random selection.In [64], a universal controller by using RL is found to beable to adapt to any existing EA and to adjust any parameterinvolved. In their method, a set of observables (consideredas states) is fed to a binary decision tree consisting of onlyone root node for representing a universal state. SARSA [48]is carried out to update the state-action value. It is shownthat the RL-enhanced controller exhibits superiority over twobenchmark controllers on most common complex problems.Here we would like to point out the signiﬁcant differencesbetween the proposed approach and the aforementioned RLbased approaches. First, the RL methods, such as Q-learningor SARSA used in existing approaches are developed for MDPwith the discrete state and action. Second, existing parametercontrollers are not learnt from optimization experiences, butare updated based on the online information obtained fromthe search procedure during a single run for a single testproblem. The main idea behind the existing study is thesame as in the DE parameter control methods reviewed in theintroduction. They all try to use information obtained online toupdate the control parameters. Third, different from RL whichaims to learn an agent with a converged optimal policy, thepolicy derived from the state-action pairs in existing studyis not necessarily convergent or even stable. However, theproposed approach in this paper can learn from the extraneousinformation for a stable policy.The only work that applies an idea similar to our approachis the DE-DDQN [65], in which a set of mutation operators TABLE X: The results of last eight problems of CEC’13 with different cell size for n = 10 at termination cell size 500 1000 1500 2000 2500 3000Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. f f f f f f f f TABLE XI: The results of last 8 problems of CEC’13 with different cell size for n = 30 at termination N

100 100 100 150 200cell size 2000 3000 3500 3000 1500Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. f f f f f f f f is adaptively selected based on the learning from optimizationexperiences over a set of training functions. In DE-DDQN,double deep Q learning is applied for the selection. Variousfeatures are deﬁned as states and taken as input to the deepneural network at each generation.VII. CONCLUSION

This paper proposed a new adaptive parameter controllerby learning from the optimization experiences of a set oftraining functions. The adaptive parameter control problemwas modeled as an MDP. A recurrent neural network, calledLSTM, was employed as the parameter controller. The re-inforcement learning algorithm, policy gradient, was used tolearn the parameters of the LSTM. The learned controller wasembedded within a DE for new test problem optimization. Inthe experiments, functions in the CEC’13 test suite were usedin training. After training, the trained agent was studied onthe CEC’13 and CEC’17 test suites in comparison with somewell-known DE algorithms and a state-of-the-art evolutionaryalgorithm. The experimental results showed that the learnedDE was very competitive to the compared algorithms whichindicated the effectiveness of the proposed controller.From our experimental study, we ﬁnd that training the pa-rameter controller for 30D problems is rather difﬁcult in termsof the computational resources. Particularly, the CPU/GPUtime used in the training process is considerable. Further, asthe number of dimension increases, it is expected that therewill be an increasing need for training time and powerful com-puting devices. It is also hard to choose the training functionsto make the training stable. Moreover, there has no theoreticalfoundation or practical principles on deciding the cell sizein the employed neural network. Another disadvantage of thelearned algorithm is that its time complexity is greater thanthe compared algorithms. Note that the training and test functions share similarfeatures since they are all constructed by using the same basicfunctions. As a result, its performance over unrelated functionsis not predictable, may be limited on totally different set offunctions, such as real-world problems. A possible way toimprove the applicability of LDE maybe is to use new learningtechniques or incorporate existing DE techniques in the LDE.In the future, we plan to improve the performance of theLDE in a number of ways, such as using different statistics U t , adopting different neural networks, considering differ-ent output of the neural network, and others. Further, weintend to apply the LDE on some real-world optimizationand engineering problems. We also intend to study on theuse of reinforcement learning for adaptive mutation/crossoverstrategy, on the learning for hyper-parameters of state-of-the-art evolutionary algorithms, and on the learning for meta-heuristics for combinatorial optimization problems.R EFERENCES[1] R. Storn and K. Price, “Differential evolution – A simple and efﬁcientadaptive scheme for global optimization over continuous spaces,” Inter-national Science Computer Institute, Berkley, 1995.[2] R. Storn and K. Price, “Differential evolution – A simple and efﬁcientheuristic for global optimization over continuous spaces,”

J. Glob.Optim. , vol. 11, no. 4, pp. 341–359, 1997.[3] S. Das, S. Mullick, and P. Suganthan, “Recent advances in differentialevolution – An updated survey,”

Swarm Evol. Comput. , vol. 27, pp.1–30, 2016.[4] F. Neri and V. Tirronen, “Recent advances in differential evolution – Asurvey and experimental analysis,”

Artif. Intell. Rev. , vol. 33, no. 1-2,pp. 61–106, 2010.[5] S. Das and P. Suganthan, “Differential evolution: A survey of the state-of-the-art,”

IEEE Trans. Evol. Comput. , vol. 15, no. 1, pp. 4–31, Feb.2011.[6] R. Tanabe and A. S. Fukunaga, “Improving the search performance ofSHADE using linear population size reduction,” in

Proc. IEEE Congr.Evol. Comput. (CEC’14) , Beijing, China, 2014, pp. 1658–1665. [7] J. Brest, M. S. Mauˇcec, and B. Boˇskovi´c, “The 100-digit challenge:Algorithm jDE100,” in Proc. IEEE Congr. Evol. Comput. (CEC’19) ,Wellington, New Zealand, 2019, pp. 19–26.[8] U. ˇSkvorc, T. Eftimov, and P. Koroˇsec, “CEC real-parameter optimiza-tion competitions: Progress from 2013 to 2018,” in

Proc. IEEE Congr.Evol. Comput. (CEC’19)

IEEE Trans. Evol. Comput. , vol. 3, no. 2, pp.124–141, Jul. 1999.[11] G. Karafotias, M. Hoogendoorn, and A. E. Eiben, “Parameter controlin evolutionary algorithms: Trends and challenges,”

IEEE Trans. Evol.Comput. , vol. 19, no. 2, pp. 167–187, Apr. 2015.[12] R. G¨amperle, S. D. M¨uller, and P. Koumoutsakos, “A parameter study fordifferential evolution,”

Advances in Intelligent Systems, Fuzzy Systems,Evolutionary Computation , vol. 10, no. 10, pp. 293–298, 2002.[13] J. R¨onkk¨onen, S. Kukkonen, and K. V. Price, “Real-parameter optimiza-tion with differential evolution,” in

Proc. IEEE Congr. Evol. Comput.(CEC’05) , vol. 1, Edinburgh, Scotland, United kingdom, 2005, pp. 506–513.[14] J. Tvrd´ık, “Competitive differential evolution,” in

MENDEL , Brno,Czech Republic, 2006, pp. 7–12.[15] P. Kaelo and M. Ali, “Differential evolution algorithms using hybridmutation,”

Comput. Optim. Appl. , vol. 37, no. 2, pp. 231–246, 2007.[16] J. Tvrd´ık, “Adaptation in differential evolution: A numerical compari-son,”

Appl. Soft Comput. , vol. 9, no. 3, pp. 1149–1155, 2009.[17] V. Feoktistov,

Differential evolution: In search of solutions . Berlin,Heidelberg: Springer, 2006.[18] J. Brest,

Constrained Real-Parameter Optimization with (cid:15) -Self-AdaptiveDifferential Evolution . Berlin, Heidelberg: Springer, 2009, pp. 73–93.[19] K. Price,

Eliminating drift bias from the differential evolution algorithm .Berlin, Heidelberg: Springer, 2008, pp. 33–88.[20] A. K. Qin and P. N. Suganthan, “Self-adaptive differential evolutionalgorithm for numerical optimization,” in

Proc. IEEE Congr. Evol.Comput. (CEC’05) , vol. 2, Edinburgh, Scotland, United kingdom, 2005,pp. 1785–1791.[21] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Differential evolutiontraining algorithm for feed-forward neural networks,”

Neural Process.Lett. , vol. 17, no. 1, pp. 93–105, 2003.[22] G. Li and M. Liu, “The summary of differential evolution algorithmand its improvements,” in

Proc. Int. Conf. Adv. Comput. Theory Eng.(ICACTE’10) , vol. 3, Chengdu, China, 2010, pp. V3153–V3156.[23] E.-N. Dragoi and V. Daﬁnescu, “Parameter control and hybridizationtechniques in differential evolution: A survey,”

Artif. Intell. Rev. , vol. 45,no. 4, pp. 447–470, 2016.[24] R. Tanabe and A. Fukunaga, “Reviewing and benchmarking parametercontrol methods in differential evolution,”

IEEE T. Cybern. , vol. 50,no. 3, pp. 1170–1184, Mar. 2020.[25] S. Das, A. Konar, and U. K. Chakraborty, “Two improved differentialevolution schemes for faster global search,” in

Proc. Annu. Genet. Evol.Comput. Conf. (GECCO’05) , Washington DC, USA, 2005, pp. 991–998.[26] D. Zou, J. Wu, L. Gao, and S. Li, “A modiﬁed differential evolutionalgorithm for unconstrained optimization problems,”

Neurocomputing ,vol. 120, pp. 469–481, 2013.[27] Y. Wang, Z. Cai, and Q. Zhang, “Differential evolution with compositetrial vector generation strategies and control parameters,”

IEEE Trans.Evol. Comput. , vol. 15, no. 1, pp. 55–66, Feb. 2011.[28] A. Draa, S. Bouzoubia, and I. Boukhalfa, “A sinusoidal differentialevolution algorithm for numerical optimisation,”

Appl. Soft Comput. ,vol. 27, pp. 99–126, 2015.[29] M. M. Ali and A. A. T¨orn, “Population set-based global optimizationalgorithms: Some modiﬁcations and numerical studies,”

Comput. Oper.Res. , vol. 31, no. 10, pp. 1703–1725, 2004.[30] V. Tirronen and F. Neri,

Differential Evolution with Fitness DiversitySelf-adaptation . Berlin, Heidelberg: Springer, 2009, pp. 199–234.[31] L. Jia and C. Zhang, “An improved self-adaptive control parameterof differential evolution for global optimization,”

Int. J. Digit. ContentTechnol. Appl. , vol. 6, no. 8, pp. 343–350, 2012.[32] T. Takahama and S. Sakai, “Efﬁcient constrained optimization by the (cid:15) constrained rank-based differential evolution,” in

Proc. IEEE Congr.Evol. Comput. (CEC’12) , Brisbane, QLD, Australia, 2012, pp. 1–8. [33] L. Tang, Y. Dong, and J. Liu, “Differential evolution with an individual-dependent mechanism,”

IEEE Trans. Evol. Comput. , vol. 19, no. 4, pp.560–574, Aug. 2015.[34] J. Brest, S. Greiner, B. Boˇskovi´c, M. Mernik, and V. ˇZumer, “Self-adapting control parameters in differential evolution: A comparativestudy on numerical benchmark problems,”

IEEE Trans. Evol. Comput. ,vol. 10, no. 6, pp. 646–657, Dec. 2006.[35] F. Lezama, J. Soares, R. Faia, and Z. Vale, “Hybrid-adaptive differentialevolution with decay function (HyDE-DF) applied to the 100-digitchallenge competition on single objective numerical optimization,” in

Proc. Genet. Evolut. Comput. Conf. Companion (GECCO’19) , Prague,Czech republic, 2019, pp. 7–8.[36] A. Qin, V. Huang, and P.N.Suganthan, “Differential evolution algorithmwith strategy adaptation for global numerical optimization,”

IEEE Trans.Evol. Comput. , vol. 13, no. 2, pp. 398–417, Apr. 2009.[37] J. Zhang and A. C. Sanderson, “JADE: Adaptive differential evolutionwith optional external archive,”

IEEE Trans. Evol. Comput. , vol. 13,no. 5, pp. 945–958, Oct. 2009.[38] Zhenyu Yang, Ke Tang, and Xin Yao, “Self-adaptive differential evo-lution with neighborhood search,” in

Proc. IEEE Congr. Evol. Comput.(CEC’08) , Hong Kong, China, Jun. 2008, pp. 1110–1116.[39] R. Tanabe and A. Fukunaga, “Success-history based parameter adap-tation for differential evolution,” in

Proc. IEEE Congr. Evol. Comput.(CEC’13) , Cancun, Mexico, 2013, pp. 71–78.[40] N. H. Awad, M. Z. Ali, and P. N. Suganthan, “Ensemble sinusoidaldifferential covariance matrix adaptation with Euclidean neighborhoodfor solving CEC2017 benchmark problems,” in

Proc. IEEE Congr. Evol.Comput. (CEC’17) , Donostia-San Sebastian, Spain, 2017, pp. 372–379.[41] A. W. Mohamed, A. A. Hadi, A. M. Fattouh, and K. M. Jambi,“LSHADE with semi-parameter adaptation hybrid with CMA-ES forsolving CEC2017 benchmark problems,” in

Proc. IEEE Congr. Evol.Comput. (CEC’17) , Donostia-San Sebastian, Spain, 2017, pp. 145–152.[42] A. Zamuda, “Function evaluations upto 1e+12 and large populationsizes assessed in distance-based success history differential evolution for100-digit challenge and numerical optimization scenarios (DISHchain1e+12): A competition entry for “100-digit challenge, and four othernumerical optimization competitions” at the genetic and evolutionarycomputation conference (GECCO) 2019,” in

Proc. Genet. Evolut. Com-put. Conf. Companion (GECCO’19) , Prague, Czech republic, 2019, pp.11–12.[43] Z. Meng, J.-S. Pan, and K.-K. Tseng, “PaDE: An enhanced differentialevolution algorithm with novel control parameter adaptation schemesfor numerical optimization,”

Knowledge-Based Syst. , vol. 168, pp. 80 –99, 2019.[44] Z. Meng, J.-S. Pan, and L. Kong, “Parameters with adaptive learningmechanism (PALM) for the enhancement of differential evolution,”

Knowledge-Based Syst. , vol. 141, pp. 92–112, 2018.[45] J. Brest, M. S. Mauˇcec, and B. Boˇskovi´c, “iL-SHADE: Improved L-SHADE algorithm for single objective real-parameter optimization,” in

Proc. IEEE Congr. Evol. Comput. (CEC’16) , Vancouver, BC, Canada,2016, pp. 1188–1195.[46] Z. Zhao, J. Yang, Z. Hu, and H. Che, “A differential evolution algorithmwith self-adaptive strategy and control parameters based on symmetricLatin hypercube design for unconstrained optimization problems,”

Eur.J. Oper. Res. , vol. 250, no. 1, pp. 30–45, 2016.[47] Z.-H. Zhan and J. Zhang, “Self-adaptive differential evolution basedon PSO learning strategy,” in

Proc. Annu. Genet. Evol. Comput. Conf.(GECCO’10) , Portland, OR, United states, 2010, pp. 39–46.[48] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .Cambridge, MA, USA: MIT press, 2018.[49] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning

Nature , vol. 529, pp. 484–489, 2016.[51] G. Cybenko, “Approximations by superpositions of sigmoidal functions,”

Math. Control Signal Syst. , vol. 2, no. 4, pp. 303–314, 1989.[52] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

NeuralComput. , vol. 9, no. 8, pp. 1735–1780, 1997.[53] J. Liang, B. Qu, P. Suganthan, and A. G. Hern´andez-D´ıaz, “Problemdeﬁnitions and evaluation criteria for the CEC2013 special sessionon real-parameter optimization,” Computational Intelligence Laboratory,Zhengzhou University, Zhengzhou, China and Nanyang TechnologicalUniversity, Singapore, Tech. Rep., 2013. [54] G. Wu, R. Mallipeddi, and P. Suganthan, “Problem deﬁnitions andevaluation criteria for the CEC2017 competition and special sessionon constrained single objective real-parameter optimization,” NationalUniversity of Defense Technology, Changsha, Hunan, PR China andKyungpook National University, Daegu, South Korea and NanyangTechnological University, Singapore, Tech. Rep., 2016.[55] J. Brest, M. S. Mauˇcec, and B. Boˇskovi´c, “Single objective real-parameter optimization: Algorithm jSO,” in Proc. IEEE Congr. Evol.Comput. (CEC’17) , Donostia-San Sebastian, Spain, 2017, pp. 1311–1318.[56] Y. Wang, H.-X. Li, T. Huang, and L. Li, “Differential evolution based oncovariance matrix learning and bimodal distribution parameter setting,”

Appl. Soft Comput. , vol. 18, pp. 232–247, 2014.[57] G. Zhang and Y. Shi, “Hybrid sampling evolution strategy for solvingsingle objective bound constrained problems,” in

Proc. IEEE Congr.Evol. Comput. (CEC’18) , Rio de Janeiro, Brazil, 2018, pp. 1–7.[58] N. Hansen, S. D. M¨uller, and P. Koumoutsakos, “Reducing the timecomplexity of the derandomized evolution strategy with covariancematrix adaptation (CMA-ES),”

Evol. Comput. , vol. 11, no. 1, pp. 1–18, 2003.[59] P. Suganthan, N. Hansen, J. Liang, K. Deb, Y.-p. Chen, A. Auger, andS. Tiwari, “Problem deﬁnitions and evaluation criteria for the CEC2005special session on real-parameter optimization,”

KanGAL report , vol.2005005, 2005.[60] J. Bader and E. Zitzler, “HypE: An algorithm for fast hypervolume-based many-objective optimization,”

Evol. Comput. , vol. 19, no. 1, pp.45–76, 2011.[61] A. E. Eiben, M. Horvath, W. Kowalczyk, and M. C. Schut, “Reinforce-ment learning for online control of evolutionary algorithms,” in

Proc.International Workshop on Engineering Self-Organising Applications(ESOA’06) . Hakodate, Japan: Springer, 2006, pp. 151–160.[62] Y. Sakurai, K. Takada, T. Kawabe, and S. Tsuruta, “A method to controlparameters of evolutionary algorithms by using reinforcement learning,”in

Proc. Int. Conf. Signal Image Technol. Internet Based Syst. (SITIS’10) ,Kuala Lumpur, Malaysia, 2010, pp. 74–79.[63] A. Buzdalova, V. Kononov, and M. Buzdalov, “Selecting evolutionaryoperators using reinforcement learning: Initial explorations,” in

Proc.Companion Publ. Genet. Evol. Comput. Conf. (GECCO Comp’14) ,Vancouver, BC, Canada, 2014, pp. 1033–1036.[64] G. Karafotias, A. E. Eiben, and M. Hoogendoorn, “Generic parametercontrol with reinforcement learning,” in

Proc. Genet. Evol. Comput.Conf. (GECCO’14) , Vancouver, BC, Canada, 2014, pp. 1319–1326.[65] M. Sharma, A. Komninos, M. L´opez-Ib´a˜nez, and D. Kazakov, “Deepreinforcement learning based parameter control in differential evolution,”in