[PDF] Differential Evolution with Reversible Linear Transformations

Abstract

Differential evolution (DE) is a well-known type of evolutionary algorithms (EA). Similarly to other EA variants it can suffer from small populations and loose diversity too quickly. This paper presents a new approach to mitigate this issue: We propose to generate new candidate solutions by utilizing reversible linear transformation applied to a triplet of solutions from the population. In other words, the population is enlarged by using newly generated individuals without evaluating their fitness. We assess our methods on three problems: (i) benchmark function optimization, (ii) discovering parameter values of the gene repressilator system, (iii) learning neural networks. The empirical results indicate that the proposed approach outperforms vanilla DE and a version of DE with applying differential mutation three times on all testbeds.

Full PDF

DDifferential Evolution with Reversible Linear Transformations

Jakub M. Tomczak

Vrije Universiteit AmsterdamAmsterdam, the [email protected]

Ewelina Węglarz-Tomczak

University of AmsterdamAmsterdam, the [email protected]

Agoston E. Eiben

Vrije Universiteit AmsterdamAmsterdam, the [email protected]

ABSTRACT

Differential evolution (DE) is a well-known type of evolutionary al-gorithms (EA). Similarly to other EA variants it can suffer from smallpopulations and loose diversity too quickly. This paper presents anew approach to mitigate this issue: We propose to generate newcandidate solutions by utilizing reversible linear transformationapplied to a triplet of solutions from the population. In other words,the population is enlarged by using newly generated individualswithout evaluating their fitness. We assess our methods on threeproblems: (i) benchmark function optimization, (ii) discoveringparameter values of the gene repressilator system, (iii) learningneural networks. The empirical results indicate that the proposedapproach outperforms vanilla DE and a version of DE with applyingdifferential mutation three times on all testbeds.

CCS CONCEPTS • Theory of computation → Bio-inspired optimization ; KEYWORDS

Black-box optimization, reversible computation, population-basedalgorithms

ACM Reference Format:

Jakub M. Tomczak, Ewelina Węglarz-Tomczak, and Agoston E. Eiben. 2020.Differential Evolution with Reversible Linear Transformations. In

Proceed-ings of Preprint.

ACM, New York, NY, USA, 9 pages. https://doi.org/

Optimization is about finding a solution that minimizes (or max-imizes) an objective function for given constraints, i.e. , possiblevalues that solutions can take. A subset of optimization problemswith so called black-box objective functions constitute black-box optimization. In general, a black-box is any process that when pro-vided an input, returns an output, but its analytical description isunavailable or it is non-differentiable [2]. Examples of black-boxfunctions (and/or constraints) are computer programs [6], physicaland biochemical processes [32, 33], or evolutionary robotics [7].There exists a vast of derivative-free optimization (DFO) meth-ods, ranging from classical algorithms like iterative local search ordirect search [2] to modern approaches like Bayesian optimization[26] and evolutionary algorithms (EA) [3, 9]. Differential evolution

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

Preprint, prelimiary work © prelimiary work Copyright held by the owner/author(s). (DE) [22, 28] is one of the most successful and popular population-based DFO algorithms that utilizes evolutionary operators (muta-tion and crossover) and a selection mechanism to generate a newset of candidate solutions. DE is a metaheuristic with no conver-gence guarantees, however, it possesses multiple interesting the-oretical properties [20]. Since the original publication of DE [28],the method was extended in many ways by, e.g. , using adaptivelocal search [19], modifying the differential mutation operator [35]or optimizing parameters of DE [27]. DE has been also applied tomany real-life problem, such as, digital filter design [14], parameterestimation in ODEs [34], discovering predictive genes in microarraydata [30], or robot navigation [18].In this paper, we follow this line of research and present an ex-tension of DE. One of potential issues with DE is, similarly to otherEA variants, that it can suffer from small populations and loosediversity too quickly. Therefore, a potential solution is adjustingor modifying the population size [8]. Here, we propose to enlargethe population on-the-fly by generating new candidate solutionsusing reversible linear transformation applied to a triplet of solu-tions from the population. As a result, we take a population of N individuals and generate 3 N new candidates assuming that we canafford running extra evaluations. This procedure allows to enhanceDE and explore/exploit the search space better. We evaluate ourapproach on three testbeds. First, we present results on benchmarkfunction optimization (Griewank, Rastrigin, Schwefel and Salomonfunctions). Second, we apply the proposed methods to discoveringparameter values of the gene repressilator system. Lastly, we utilizethe new DE schema in learning neural networks on image data. Inall experiments, we show that enlarging the population size indeedallow to faster convergence (in terms of the number of fitness eval-uations) and the reversible linear transformations provide efficientand effective alternative to the vanilla differential mutation.The contribution of the paper is threefold. First, we propose toenhance DE by applying reversible linear transformations withtwo different linear operators. Second, we analyze the operatorsby inspecting their eigenvalues. Third, we show empirically onproblems with the number of variables ranging from 4 to 4120 thatthe proposed DE with reversible linear transformations significantlyoutperforms the DE and its extension with three perturbations. We consider an optimization problem of a function f : X → R ,where X ⊆ R D is the search space. In this paper we focus on theminimization problem, namely: x ∗ = arg min x ∈ X f ( x ) . (1)Further, we assume that the analytical form of the function f isunknown or cannot be used to calculate derivatives, however, we a r X i v : . [ c s . N E ] F e b reprint, prelimiary work Tomczak et al. can query it through a simulation or experimental measurements.Problems of this sort are known as black-box optimization problems [2, 13]. Additionally, we consider a bounded search space, i.e. , weinclude inequality constraints for all dimensions in the followingform: l d ≤ x d ≤ u d , where l d , u d ∈ R and l d < u d , for d = , , . . . , D . One of the most widely-used methods for black-box optimizationproblems is differential evolution (DE) [28] that requires a populationof candidate solutions, X = { x , . . . , x N } , to iteratively generatenew query points. A new candidate is generated by randomly pick-ing a triple from the population, ( x i , x j , x k ) ∈ X , and then x i isperturbed by adding a scaled difference between x j and x k , that is: y = x i + F ( x j − x k ) , (2)where F ∈ R + is the scaling factor. This operation could be seen asan adaptive mutation operator that is widely known as differentialmutation [22].Further, the authors of [28] proposed to sample a binary mask m ∈ { , } D according to the Bernoulli distribution with probability p = P ( m d = ) shared across all D dimensions, and calculate thefinal candidate according to the following formula: v = m ⊙ y + ( − m ) ⊙ x i , (3)where ⊙ denotes the element-wise multiplication. In the evolu-tionary computation literature this operation is known as uniformcrossover operator [3, 9]. In this paper, we fix p = . i.e. , we combinethe old population with the new one and select N candidates withhighest fitness values ( i.e. , the deterministic ( µ + λ ) selection).This variant of DE is referred to as “DE/ rand / / bin ”, where rand stands for randomly selecting a base vector, is for adding a singleperturbation (a vector difference) and bin denotes the uniformcrossover. Sometimes it is called classic DE [22]. Generating new candidates in DE requires sampling a triplet ofsolutions and, basing on these points, one solution is perturbedusing the other two solutions. This approach possesses multipleadvantages, naming only a few:(i) it is non-parametric, i.e. , contrary to evolutionary strategies[4], no assumption on the underlying distribution of thepopulation is made;(ii) it has been shown to be effective in many benchmark opti-mization problems and real-life applications [22].However, the number of possible perturbations is finite and reliesentirely on the population size. Therefore, a small population sizecould produce insufficient variability of new candidate solutions.To counteract this issue, we propose the following solutions: (1) In order to increase variability, we can perturb candidatesmultiple times by running the differential mutation morethan once ( e.g. , three times).(2) In fact, we can use the selected triple of points and use itthree times to generate new points. In other words, we noticethat there is no need to sample three different triplets.(3) We propose to modify the selected triplet by using generatednew solutions on-the-fly . This approach allows to enlargethe population size.In the following subsections, we outline the three approaches. Fur-ther, we notice that the second and the third method could berepresented as linear transformations. As such, we could analyzethem algebraically.

In the first approach we generate a larger new population by per-turbing the point x i using multiple candidate solutions, namely, x j , x k , x l , x m , x n , x q ∈ X . Then, we can produce 3 N new candidatesolutions instead of N as follows: y = x i + F ( x j − x k ) (4) y = x i + F ( x l − x m ) (5) y = x i + F ( x n − x q ) . (6)This approach requires sampling more pairs and evaluating morepoints, however, it allows to better explore the search space. Werefer to this approach as Differential Evolution ×

3, or DEx3 for short.

We first notice that in the DEx3 approach we sample three pairs ofpoints to calculate perturbations. Since we pick them at random,we propose to sample three candidates x i , x j , x k ∈ X and calculateperturbations by changing their positions only, that is: y = x i + F ( x j − x k ) y = x j + F ( x k − x i ) (7) y = x k + F ( x i − x j ) . In other words, we perturb each point by using the remaining two.Interestingly, we notice that Eq. 7 corresponds to applying a lineartransformation to these three points. For this purpose, we rewrite(7) using matrix notation by introducing matrices Y = [ y , y , y ] ⊤ and X = [ x i , x j , x k ] ⊤ that yields: Y = MX , (8)where: M =  F − F − F FF − F  . (9)The matrix M can be further decomposed as follows: M =  (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) I +  F − F − F FF − F (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) A , (10)where I denotes the identity matrix, and A is the antisymmetricmatrix. ifferential Evolution with Reversible Linear Transformations Preprint, prelimiary work Figure 1: Real part of eigenvalues, ℜ( λ ) , and sbsolute value of eigenvalues, | λ | , for: (a) M in ADE, and (b) R in RevDE. Comparing Eq. 7 to DE × Antisymmetric DifferentialEvolution (ADE), because the linear transformation consists of theidentity matrix and the antisymmetric matrix parameterized withthe scaling factor F . The linear transformation presented in Eq. 7 allows to utilize thetriplet ( x i , x j , x k ) to generate three new points, however, it could bestill seen as applying DE three times, but in a specific manner ( i.e. , bydefining the linear operator M ). A natural question arises whethera different transformation could be proposed that allows better exploitation and/or exploration of the search space. The mutationoperator in DE perturbs candidates using other individuals in thepopulation. As a result, having too small population could limitexploration of the search space. In order to overcome this issue,we propose to modify ADE by using newly generated candidates on-the-fly , that is: y = x i + F ( x j − x k ) y = x j + F ( x k − y ) (11) y = x k + F ( y − y ) . Using new candidates y and y allows to calculate perturbationsusing points outside the population. This approach does not followa typical construction of an EA where only evaluated candidatesare mutated. Further, similarly to ADE, we can express (11) as alinear transformation Y = RX with the following linear operator: R =  F − F − F − F F + F F + F − F + F + F − F − F  . (12)In order to obtain the matrix R , we need to plug y to the secondand third equation in (11), and then y to the last equation in (11).We refer to this version of DE as Reversible Differential Evolution (RevDE), because the linear transformation is reversible (see nextsubsection).

An interesting property of the matrices M and R in ADE and RevDE, respectively, is that they are nonsigularmatrices (see the Appendix for the proofs). Since they are non-singular, they are also invertible, and, thus, ADE and RevDE usereversible linear transformations.The reversibility is an important property for formulating MarkovChain Monte Carlo methods (MCMC) [1]. Therefore, we could takeadvantage of the proposed reversible linear transformations andextend the existing work on utilizing DE for sampling methods[31]. However, this is beyond the scope of this paper and we leaveinvestigating it in the future work. We can obtain an insight intoa linear operator by analysing its eigenvalues that tell us how amatrix transforms an object [29]. Therefore, they play a crucial rolein analyzing properties of linear operators, e.g. , in control theoryreal parts of eigenvalues are used to determine stability of lineardynamical systems (if real part of all eigenvalues are lower than0, then the system is stable; otherwise it is unstable [5]). Further,the absolute value of an eigenvalue λ i determines the influenceof the corresponding eigenvector [29]. If the absolute value of theeigenvector is lower than 1, then the eigenvector is a decayingmode. Similarly, if | λ i | >

1, then the eigenvector is a dominantmode. In the case of | λ i | = M and R . We notice the following facts: • For ADE, all real parts of eigenvalues are above or equal 1,and all absolute values of eigenvalues are equal 1. As a result,the method will never lead to a decaying mode, and as suchit will encourage exploration of the search space. • For RevADE, the situation is different, namely, for F < . F > .

75 one eigenvalue has a real part equal1 and the other two eigenvalues have real parts lower than0. However, in all cases, all absolute values of eigenvaluesare larger than 0. In other words, RevDE for some valuesof F possesses steady states, but for F > F could result in generating candidate solutions that are dominatedby a direction indicated by one of two eigenvectors. Consequently, This fact follows from the non-singularity of the matrix R , i.e. , a matrix is non-singulariff all its eigenvalues are non-zero. reprint, prelimiary work Tomczak et al. this could lead to “jumping” in the search space. Since ADE isclosely related to DE ×

3, this result sheds an additional light on thebehavior of DE, and seems to confirm that taking F larger than 0 . F below0 .

75 is preferable, because then the linear operator will not lead todominating modes. As a result, a better exploitation/exploration ofthe search space could be achieved.

In order to verify our approach empirically, we compare the threeproposed methods and the standard DE on three testbeds:(1)

Benchmark functions : selected benchmark function for opti-mization.(2)

Gene Repressilator System : discovering parameter values of asystem of ordinary differential equations for given observa-tions.(3)

Neural Networks Learning : learning a neural network withone hidden layer on image dataset.In all experiments, we used the uniform crossover with p = . F was selected from the follow-ing range of values: { . , . , . , . , . , . , . , . } .The population size was set to 500 across all experiments. We pickthe base vector randomly .The code of the methods and all experiments is available underthe following link: https://github.com/jmtomczak/reversible-de. We evaluate the proposed methods on the opti-mization task of four benchmark functions: • Griewank function [11]: f ( x ) = + D (cid:213) d = (cid:113) x d / − D (cid:214) d = cos (cid:16) x d √ d (cid:17) (13)with the box constraints: ∀ d ∈{ , ,..., D } x d ∈ [− , ] ; • Rastrigin function [23]: f ( x ) = D + D (cid:213) d = (cid:16) x d −

10 cos (cid:0) πx d (cid:1)(cid:17) (14)with the box constraints: ∀ d ∈{ , ,..., D } x d ∈ [− , ] ; • Salomon function [24]: f ( x ) = − cos (cid:16) π (cid:118)(cid:117)(cid:116) D (cid:213) d = x d (cid:17) + . (cid:118)(cid:117)(cid:116) D (cid:213) d = x d (15)with the box constraints: ∀ d ∈{ , ,..., D } x d ∈ [− , ] ; • Schwefel function [25]: f ( x ) = . D − D (cid:213) d = (cid:16) x d sin (cid:0)(cid:112) | x d | (cid:1)(cid:17) (16)with the box constraints: ∀ d ∈{ , ,..., D } x d ∈ [ , ] .We test the methods on these function with different dimensions,namely, D ∈ { , , } . We run DEx3, ADE and RevDE for 150generations. Since DE evaluates three times less candidate solutions,we run it for 450 generations to match the number of evaluations. However, we want to highlight that DE is more informed than othermethods due to the propagation of new solutions in consecutiveiterations ( i.e. , applying the selection mechanism 3 times more). Allexperiments are repeated 10. The results of the best solutionfound until given evaluation for this experiment are presentedin Figure 2. First, we notice that ADE and DEx3 similarly to theDE. However, in 3 out of 12 cases ( i.e. , Griewank with D = D = D = i.e. , three) triplets, and utilizing a single triplet to generate newcandidates is sufficient.In all test cases, RevDE achieved the best results in terms ofboth final objective value and convergence speed. This result is re-markable, because new candidate solutions are generated on-the-fly and are used to generate to new points. Moreover, for D =

30 and D = i.e. , the higher-dimensional cases, RevDE outperformedother methods significantly. These results are especially promisingfor real-life applications like parameter values discovery of mecha-nistic models and computer programs [6] or learning controllers in(evolutionary) robotics [15, 16].In the Rastrigin function with D =

30 and D =

100 there is apeculiar behavior of RevDE where around the evaluation number80000 and 50000, respectively, there is a large improvement in termsof the objective value. We hypothesize that the optimizer “jumpsout” from a local minimum due to large eigenvalue as discussed inSection 3.4.

The gene repressilator system proposed in [10] isa popular model for gene regulatory systems and consists of threegenes connected in a feedback loop, where each gene transcribes therepressor protein for the next gene in the loop. The model consistsof six ordinary differential equations that describe dependenciesamong mRNA ( m , m , m ) and corresponding proteins ( p , p , p ) ,and four parameters x = [ α , n , β , α ] ⊤ , which are as follows:d m d t = − m + α + p n + α (17)d p d t = − β ( p − m ) (18)d m d t = − m + α + p n + α (19)d p d t = − β ( p − m ) (20)d m d t = − m + α + p n + α (21)d p d t = − β ( p − m ) . (22) ifferential Evolution with Reversible Linear Transformations Preprint, prelimiary work Figure 2: The results of the best solution found until given evaluation on four benchmark black-box optimization testbeds (a)Griewank function, (b) Rastrigin function, (c) Salomon function, (d) Schwefel function, and three cases: (left column) D case,(middle column) D case, (right column)

D case. The solid red lines correspond to DE, the solid yellow lines are for DEx3,the dotted-dashed green lines depict ADE, and the dotted blue lines represent RevDE. In all test cases an average and onestandard deviations over runs are presented. reprint, prelimiary work Tomczak et al. Figure 3: The discovered parameter values in the repressilator model by: ( left column ) DEx3, ( middle column ) ADE, ( rightcolumn ) RevDE. Colors of dots represent the generation number: blue is the 1st generation, orange is the 4th generation,green is the 8th generation, red is the 20th generation. The real parameter values: ( α , n , β , α ) = ( , , , ) . ifferential Evolution with Reversible Linear Transformations Preprint, prelimiary work Further, we assume that only mRNA measurement are measured,while proteins are considered as missing data. The goal of thisexperiment is to discover the parameters’ values for a given obser-vation of mRNA. We transform this problem into the minimizationof the following objective: f ( x ) = N N (cid:213) n = (cid:118)(cid:117)(cid:116) (cid:213) i = (cid:0) m i , n − m i , n ( x ) (cid:1) , (23)where m i , n ( x ) is given by numerically integrating the system ofdifferential equations in (17–22) using a solver, e.g. , a Runge-Kuttamethod . Notice that the objective function is black-box due to thenon-differentiable simulator.We follow the settings outlined in [33]. The real parameters’values are assumed to be x ∗ = [ , , , ] ⊤ and we generate realvalues of m i by first solving the equations (17–22) with x ∗ and giveninitial conditions ( m , p , m , p , m , p ) = ( , , , , , ) , and thenadding Gaussian noise with the mean equal 0 and the standarddeviation equal 5.We run all methods for 20 generations. All experiments wererepeated 10 times. For analyzing final solutions, we look into theconvergence of a population from a single run. o b j e c t i v e DEx3ADERevDE

Figure 4: The results of the best solution found until givenfitness evaluation on the repressilator model. The averageand the standard deviation over runs are reported. We present results of the best so-lution found until given objective evaluation in Figure 4. Further,we depict converging process of the population in Figure 3. Wepresent only a single run out of ten, however, the behavior is almostindistinguishable between runs.All methods achieve almost identical objective values (see Figure4). RevDE seems to converge slightly faster (on average) than othermethods, but the difference is not significant.We can obtain more insight into the performance of the methodsby analysing behavior of the population over generations (see Fig-ure 3). First we notice that all methods converge to a single point within 20 generations. Second, it seems that ADE and DEx3 behavealmost indistinguishably. Comparing their scatterplots it is almostimpossible to spot a difference. However, RevDE converges faster,because already in the 4th generation the solutions are less scat-tered than in the case of DEx3 and ADE. In other words, comparinghow points are distributed in the 4th and the 8th generation, it isapparent that the variance for the populations found by RevDE issmaller than for ADE ans DEx3.Eventually, we would like to comment on discovering the pa-rameter values. In the case of ADE, DEx3 and RevDE all candidatesconverge to the same solution that are roughly around the point x = [ , , , ] ⊤ . By comparing the real values and the discov-ered ones we see that the only mismatch is for α . However, thediscrepancy (1000 vs. 1380) possibly follows from the fact that theobserved data is noisy, because the population in the 4th epochcovered values around 1000 well. Similarly, in [33] the ApproximateBayesian Computation with the Sequential Monte Carlo method(ABC-SMC) also obtained values between around 800 and 1300(see Figure 4(c) in [33]). We conclude that RevDE seems to be avery promising alternative to Monte Carlo techniques for findingparameters in simulator-based inference problems [6]. In the last experiment we aim at evaluating ourapproach on a high-dimensional optimization problem. For thispurpose, we train a neural network with a single fully-connectedhidden layer on the image dataset of ten handwritten digits (MNIST[17]). We resize original images from 28px × × i.e. , X = R ). We use the ReLU non-linearactivation function for hidden units and the softmax activationfunction for outputs. The objective function is the classificationerror: f ( x ) = − N N (cid:213) n = I [ y n = y n ( x )] , (24)where N denotes the number of images, I [·] is an indicator function, y n is the true label of the n th image, and y n ( x ) is the label for the n timage predicted by a neural network with weights x . The predictionof the neural network is a class label with highest value given bythe softmax output.The original dataset consists of 60000 pairs of images and labelsfor training, and 10000 pairs of images and labels for training. Inour experiments, we use only 2000 training points, but all 10000testing points. All models are trained for 500 epochs (generations)and the experiments are repeated 3 times. For testing, we takea candidate solution from the final population with the lowesttraining classification error.The objective function in Eq. 24 is non-differentiable, and, thus,could be treated as a black-box objective. However, we want tohighlight that this experiment does not aim at proposing DE as analternative training procedure to a gradient-based methods, becausethe log-likelihood function is a good proxy to the objective in (24).In fact, is has been shown in multiple papers that the (stochastic)gradient descent optimizer is extremely effective in learning neuralnetworks and DE is not competitive with it at all [12]. We rather use reprint, prelimiary work Tomczak et al. the neural network learning problem as an interesting showcase ofa high-dimensional optimization problem. Table 1: Test results on MNIST. The average with the stan-dard errors over runs are reported.Method Classification error DE × ± ± ± In Figure 5 we present learningcurves for neural networks trained with different methods. Ad-ditionally, in Table 1 we gather test classification errors.First of all, we notice that the training is not fully convergedand possibly better results could have been achieved. Neverthe-less, our goal is to present performance of our methods on a high-dimensional problem rather than reaching state-of-the art scores.That being said, we first observe that the proposed extensions ofDE shared similar learning curves. ADE performed the best duringtraining, and RevDE converged to a better point than DEx3 in thevery end. However, the final test performance was better for ADEand RevDE than DEx3. This result could be possibly explained bythe fact that DEx3 is more stochastic than the other two methodsthat could be harmful in the highly dimensional problem.A close inspection of the results in Table 1 suggests that ADEand RevDE perform on par, and they seem to be better than DEx3.This result is potentially interesting because the negative messagedelivered in [12] that DE is definitely worse than the gradient-based learning method is not necessarily true and more research inthis direction is required. Especially in the context of adding non-differentiable components (regularizers) to the learning objective. o b j e c t i v e DEx3ADERevDE

Figure 5: Training curves on MNIST. The average and thestandard deviation over runs are reported. In this paper, we note that insufficient variability of the populationcould cause DE to loose diversity too quickly. In order to counteractthis issue, we propose three extensions of DE: (i) DE with multiplesamples of candidates for calculating perturbations, (ii) DE withthe reversible linear transformation using a sum of the identitymatrix and an anti-symmetric matrix, (iii) DE with the reversiblelinear transformation utilizing newly generated yet not evaluatedcandidates.We provide a theoretical analysis of the proposed linear operatorsby proving their reversibility, and inspecting their eigenvalues. Fur-ther, we show empirically on three testbeds (benchmark functionoptimization, discovering parameter values of the gene repressilatorsystems, and learning neural networks) that producing new candi-dates on-the-fly allows to obtain better results in fewer number ofevaluations compared to DE.We believe that this work opens new possible research directions: • Representing the differential mutation as a linear transfor-mation allows to look into other forms of linear operators. • The linear operators defined in this paper are parameter-ized with a single parameter. A natural extension would beconsidering different parameterization. • Here, we present an analysis based on eigenvalues. However,we can consider the reversible transformation as a dynamicalsystem ( e.g. , an extension of the analysis outlined in [20]). • We can take advantage of the reversibility of the proposedlinear transformations. For instance, reversibility is an im-portant property of transition operators in MCMC methods[1]. A modification of the vanilla DE for formulating a properMCMC method was already presented in [31] and an inter-esting direction would be to extend this work using DE withthe reversible linear transformations.

APPENDIXNon-singularity of M and R Proposition 5.1.

The matrix M defined in Eq. 9 is non-singular. Proof. Since the matrix M is a small 3-on-3 matrix, we cancalculate its determinant analytically that gives:det ( M ) = + F . (25)For any value of F we have det ( M ) (cid:44)

0, therefore, the matrix M isnon-singular □ Proposition 5.2.

The matrix R defined in Eq. 12 is non-singular. Proof. The matrix R is a small 3-on-3 matrix, thus, we cancalculate its determinant analytically, that gives:det ( R ) = . (26)Since the determinant is always 1, then R is non-singular. □ ACKNOWLEDGMENTS

EW-T is financed by a grant within Mobilnosc Plus V from the PolishMinistry of Science and Higher Education (Grant No. 1639/MOB/V/2017/0). ifferential Evolution with Reversible Linear Transformations Preprint, prelimiary work

REFERENCES [1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I. Jordan.2003. An introduction to MCMC for machine learning.

Machine Learning

50, 1-2(2003), 5–43.[2] Charles Audet and Warren Hare. 2017.

Derivative-free and Blackbox Optimization .Springer.[3] Thomas Bäck. 1996.

Evolutionary Algorithms in Theory and Practice: EvolutionStrategies, Evolutionary Programming, Genetic Algorithms . Oxford universitypress.[4] Thomas Bäck, Christophe Foussette, and Peter Krause. 2013.

ContemporaryEvolution Strategies . Springer.[5] Zdzisław Bubnicki. 2005.

Modern Control Theory . Springer.[6] Kyle Cranmer, Johann Brehmer, and Gilles Louppe. 2019. The frontier ofsimulation-based inference. arXiv preprint arXiv:1911.01429 (2019).[7] Stephane Doncieux, Nicolas Bredeche, Jean-Baptiste Mouret, and Agoston E.Eiben. 2015. Evolutionary robotics: what, why, and where to.

Frontiers in Roboticsand AI

International Conference on ParallelProblem Solving from Nature . Springer, 41–50.[9] Agoston E. Eiben and James E. Smith. 2015.

Introduction to Evolutionary Comput-ing . Vol. 53. Springer.[10] Michael B Elowitz and Stanislas Leibler. 2000. A synthetic oscillatory network oftranscriptional regulators.

Nature

Journalof optimization theory and applications

34, 1 (1981), 11–39.[12] Jarmo Ilonen, Joni-Kristian Kamarainen, and Jouni Lampinen. 2003. Differentialevolution training algorithm for feed-forward neural networks.

Neural ProcessingLetters

17, 1 (2003), 93–105.[13] Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient globaloptimization of expensive black-box functions.

Journal of Global optimization

EURASIP Journal on Advances in Signal Processing arXiv preprint arXiv:2001.07804 (2020).[16] Gongjin Lan, Milan Jelisavcic, Diederik M. Roijers, Evert Haasdijk, and Agoston E.Eiben. 2018. Directed locomotion for modular robots with evolvable morpholo-gies. In

International Conference on Parallel Problem Solving from Nature . Springer,476–487.[17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.

Proc. IEEE

86, 11 (1998), 2278–2324.[18] Erasmo Gabriel Martinez-Soltero and Jesus Hernandez-Barragan. 2018. Robotnavigation based on differential evolution.

IFAC-PapersOnLine

51, 13 (2018),350–354.[19] Nasimul Noman and Hitoshi Iba. 2008. Accelerating differential evolution usingan adaptive local search.

IEEE Transactions on evolutionary Computation

12, 1 (2008), 107–125.[20] Karol R Opara and Jarosław Arabas. 2019. Differential Evolution: A survey oftheoretical analyses.

Swarm and evolutionary computation

44 (2019), 546–558.[21] Magnus Erik Hvass Pedersen. 2010.

Good parameters for differential evolution .Technical Report HL1002. Hvass Laboratories.[22] Kenneth Price, Rainer M Storn, and Jouni A. Lampinen. 2006.

Differential Evolu-tion: A Practical Approach to Global Optimization . Springer Science & BusinessMedia.[23] Leonard Andreevič Rastrigin. 1974. Systems of extremal control.

Nauka (1974).[24] Ralf Salomon. 1996. Re-evaluating genetic algorithm performance under coordi-nate rotation of benchmark functions. A survey of some theoretical and practicalaspects of genetic algorithms.

BioSystems

39, 3 (1996), 263–278.[25] Hans-Paul Schwefel. 1981.

Numerical optimization of computer models . JohnWiley & Sons, Inc.[26] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Fre-itas. 2015. Taking the human out of the loop: A review of Bayesian optimization.

Proc. IEEE

Proceedings of the Genetic and Evolutionary Computation Conference .709–717.[28] Rainer Storn and Kenneth Price. 1997. Differential evolution–a simple andefficient heuristic for global optimization over continuous spaces.

Journal ofGlobal Optimization

11, 4 (1997), 341–359.[29] Gilbert Strang. 2016.

Introduction to Linear Algebra . Wellesley-Cambridge PressWellesley, MA.[30] Dimitris K. Tasoulis, Vassilis P. Plagianakos, and Michael N. Vrahatis. 2006.Differential evolution algorithms for finding predictive gene subsets in microarraydata. In

IFIP International Conference on Artificial Intelligence Applications andInnovations . Springer, 484–491.[31] Cajo J.F. Ter Braak. 2006. A Markov Chain Monte Carlo version of the geneticalgorithm Differential Evolution: easy Bayesian computing for real parameterspaces.

Statistics and Computing

16, 3 (2006), 239–249.[32] Jakub M. Tomczak and Ewelina Węglarz-Tomczak. 2019. Estimating kineticconstants in the Michaelis–Menten model from one enzymatic assay using Ap-proximate Bayesian Computation.

FEBS letters

Journal of the Royal SocietyInterface

6, 31 (2009), 187–202.[34] Feng-Sheng Wang, Tzu-Liang Su, and Horng-Jhy Jang. 2001. Hybrid DifferentialEvolution for Problems of Kinetic Parameter Estimation and Dynamic Optimiza-tion of an Ethanol Fermentation Process.

Industrial & Engineering ChemistryResearch

40, 13 (2001), 2876–2885.[35] Jingqiao Zhang and Arthur C. Sanderson. 2009. JADE: Adaptive DifferentialEvolution with Optional External Archive.