[PDF] Performance Analysis and Improvement of Parallel Differential Evolution

Abstract

Differential evolution (DE) is an effective global evolutionary optimization algorithm using to solve global optimization problems mainly in a continuous domain. In this field, researchers pay more attention to improving the capability of DE to find better global solutions, however, the computational performance of DE is also a very interesting aspect especially when the problem scale is quite large. Firstly, this paper analyzes the design of parallel computation of DE which can easily be executed in Math Kernel Library (MKL) and Compute Unified Device Architecture (CUDA). Then the essence of the exponential crossover operator is described and we point out that it cannot be used for better parallel computation. Later, we propose a new exponential crossover operator (NEC) that can be executed parallelly with MKL/CUDA. Next, the extended experiments show that the new crossover operator can speed up DE greatly. In the end, we test the new parallel DE structure, illustrating that the former is much faster.

Full PDF

11 Performance Analysis and Improvement of ParallelDifferential Evolution

Zibin PanThe Chinese University of Hongkong, [email protected]

Abstract —Differential evolution (DE) is an effective globalevolutionary optimization algorithm using to solve global opti-mization problems mainly in a continuous domain. In this ﬁeld,researchers pay more attention to improving the capability ofDE to ﬁnd better global solutions, however, the computationalperformance of DE is also a very interesting aspect especiallywhen the problem scale is quite large. Firstly, this paper analyzesthe design of parallel computation of DE which can easily beexecuted in Math Kernel Library (MKL) and Compute UniﬁedDevice Architecture (CUDA). Then the essence of the exponentialcrossover operator is described and we point out that it cannotbe used for better parallel computation. Later, we propose anew exponential crossover operator (NEC) that can be executedparallelly with MKL/CUDA. Next, the extended experimentsshow that the new crossover operator can speed up DE greatly.In the end, we test the new parallel DE structure, illustratingthat the former is much faster.

Index Terms —Differential evolution, Matrix calculation, Highperformance

I. I

NTRODUCTION D IFFERENTIAL evolution (DE) is a simple and usefulevolutionary algorithm which is similar to the GeneticAlgorithm (GA) and able to address optimization problemsmore efﬁciently and precisely [1], [2]. Since DE was proposedby Price and Storn in 1995 with several papers [3], [4], [5], ithas attracted a lot of people to use and do further researchesbecause of its simple realization, wide application, and goodoptimization results.Most researchers pay more attention to reforming DE inorder to make it solve difﬁcult problems better. But as for somelarge-scale optimization problems, although the evaluation ofindividuals is the bottleneck of the computational performancemost of the time, we still found that if we use CPU/GPUparallel technologies in DE, we can get a dramatic increase inthe performance.In DE, each chromosome is an array containing a list of realnumbers. This simple and useful structure makes it possibleto calculate parallelly by matrix calculation techniques withthe help of Math Kernel Library(MKL) or Compute UniﬁedDevice Architecture (CUDA). The former one performs wellin matrix calculation with CPU while the other one can speedup the code with Graphics Processing Unit (GPU).Until now there are many achievements in this ﬁeld, forexample, DE with GPGPU (General Purpose GPU), is aparallel version of DE executing in GPU, introduced by deVeronese and Krohling [6], Zhu [7], and Zhu and Li [8]. Corteset.al [9] analyzed the inﬂuence of parameters on speedup and the quality of solutions of DE on a GPGPU, showing that DEon a GPGPU is not only running faster but also has a goodoptimization result. Recently, Meselhi et. al. designed a fastdifferential evolution for big optimization [10], which showsa big increase in the computational performance.However, these parallel calculation achievements can onlyperform well in ”DE/*/*/bin”. They don’t do well in someother versions of DE such as ”DE/*/*/L”, which uses theexponential crossover (exp) instead of the uniform (binomial)crossover in recombination. The work of Tanabe [11] showedthat the exp crossover can perform better in some optimizationproblems where the adjacent variables have some dependen-cies. Hence, it is still useful in DE. However, in this context, itwill show that the exp crossover cannot be calculated parallellyenough as the uniform (binomial) crossover so that when usingparallel calculation the former one runs much slower than thelatter.The rest of the paper is organized as follows. Section 2shows several parallel-computational algorithms in a differentpart of DE. In section 3, it will uncover the essence ofexponential crossover and changed it with the help of theprobability theory so that it can be faster as well as be ableto speed up on MKL/GPU. In section 4, we will compare theperformances of DE, DE on MKL (MKL-DE), and DE withCUDA (CUDA-DE).II. P

ARALLEL D IFFERENTIAL A LGORITHM

The Differential evolutionary algorithm processes the pop-ulation with N individuals. Each individual owns a chro-mosome which is an n -dimension vector x i,g made of realnumbers. i is the index number of the individual in thepopulation, g indicates the generation to which a vectorbelongs. x g is a matrix that stores all the x i,g in the population.After initialization, base vector selection, differential mutation,recombination, and selection, a new generation is created andafter several iterations, the solutions of the population can getbetter and better. The pseudocode of DE is shown in Algorithm1. In Algorithm 1, the loop between line 5 and line 14 can becalculated parallelly. Suppose we denote a parallel calculationtask as a ”job”, meaning that the device we used can do allthe jobs parallelly. Hence, each iteration of the loop betweenline 5 and line 16 can be allocated to a job. In addition, inline 7, line 8, and line 12, we need to loop all the elementsof the chromosome vector, since each loop is independent. a r X i v : . [ c s . N E ] J a n Algorithm 1

Differential Evolution (DE). Set parameters C r , F and N p and iteration counter g ← Initialize population P x,g = ( x i,g ) , i = 0 , , · · · , N − f it ← evaluate( x g ) while Stop Criteria is False do for i = 0 to pop size-1 do Select a index number: r v i,g ← differential mutation( F, x i,g , r ) u i,g ← recombination( Cr, v i,g , x i,g ) f it (cid:48) ← evaluate( u i,g ) if f it (cid:48) < f it i then x i,g +1 ← u i,g f it i ← f it (cid:48) else x i,g +1 ← x i,g However, it is not good enough to set sub-jobs in a parallelcalculation job: As we can’t control the running time of line10 (evaluation), it’s not good to allocate each loop from line 5to line 16 to a single job for a thread to do by including a time-consuming evaluation task in it. It’s necessary to change thealgorithm ﬂow design of DE to a new one where the procedureof line 10 is separated so that other procedures of it can becalculated parallelly on MKL/CUDA. The pseudocode of theparallel DE framework can be seen in Algorithm 2

Algorithm 2

Parallel Differential Evolution framework. Set parameters C r , F and N p and iteration counter g ← Initialize population chromosome matrix x g f it ← evaluate( x g ) while Stop Criteria is False do shufﬂe the order of the individuals r ← base index selection( N p ) mask ← crossover( Cr, N p , D ) u g ← copy( x g ) u g [ mask ] ← differential mutation( F, x g , mask, r ) f it (cid:48) ← evaluate( u g ) idx ← where( f it (cid:48) < f it ) x g +1 ← copy( x g ) x g +1 [ idx, :] ← u g [ idx, :] f it [ idx ] ← f it (cid:48) [ idx ] In Algorithm 2, r in line 6 is a vector that stores theindices of individuals chosen by the selection method. Thebase vectors in DE are equal to x g [ r , :] . r can be createdparallelly. In line 7, mask is a 0-1 matrix storing the messagewhether which element of the u g matrix needs to be changedits value to the mutation result or not(1 means yes while 0means no). It’s easy to see that the order between the crossoverand the mutation is exchanged, which can decrease the runningtime but get the same result as the traditional DE. ”DE/*/*/bin”can be designed according to Algorithm 2. This frameworkcan be used in many different versions of DE. For example,as to the ”DE/*/*/bin”, we can use this framework parallelly.Except line 10 (evaluation), all other codes from line 5 to line14 can be executed parallelly by matrix calculation with the help of MKL/CUDA, which means that in many underlyingoperating procedures, for example, how many sub-threads areused in the parallel calculation, is setting to a suitable stateby MKL/CUDA automatically. There are many kinds of baseindex selection in DE, some can be seen in Algorithm 3, 4,and 5 The basic parallel mutation of ”DE/*/1/bin” can be seenin Algorithm 6, and the parallel uniform (binomial) crossovercan be seen in Algorithm 7. Algorithm 3

Random Base Index Selection.

Require: N p . Ensure:

An array like r . r ← N p · ﬂoor(rand(0,1,size= N p )) Algorithm 4

Random Offset Base Index Selection.

Require: N p . Ensure:

An array like r . idx ← [0 , , · , N p − r g ← ceil( ( N p − · rand(0,1)) r ← ( idx + r g ) mod N p Algorithm 5

Permutation Base Index Selection.

Require: N p . Ensure:

An array like r . idx ← [0 , , · , N p − r ← shufﬂe( idx )The three selections above can all be easily calculated onMKL/CUDA. Algorithm 6

Basic Parallel Differential Mutation.

Require:

F, x g , mask, r . Ensure:

An array like results . N p , Lind ← the size of x g r ← r + ( N p − · ceil(rand(0,1,size= N p )) mod N p r ← r + ( N p − · ceil(rand(0,1,size= N p )) mod N p results ← ( x g [ r , :])[ mask ] + F · (( x g [ r , :])[ mask ] − ( x g [ r , :])[ mask ]) The Altorithm 6 implements the differential mutation in the”DE/*/1/bin”, the formula is as follows: u i ← x r + F · ( x r − x r ) (1)In traditional DE, the crossover is executed after mutation.After the crossover is done, quite many elements of v i,g whichcannot be given to u i,g are wasted. So the mask helps to speedup DE.The Algorithm 7 implements the uniform (binomial)crossover, the formula is as follows: u i,j,g ← (cid:26) v i,j,g , if rand (0 , < Cr or j = j rand x i,j,g , otherwise (2) j is the index number which values in , , · · · , D − . the mask is a 0-1 matrix that marks whether the elements of u g matrix which are need to be crossed with v g matrix or not. Algorithm 7

Parallel Uniform (Binomial) Crossover(MKL/CUDA-BIN).

Require: Cr ∈ [0 , , N p , D . Ensure: mask . mask ← an N p × D matrix in which all elements are 0 j rand ← randint( D , size=( N p )) idx ← [0 , , · · · , N p − mask [ idx, j rand ] ← R ← rand(0,1,size=( N p , D )) mask [ R < Cr ] ← The algorithms above can easily be done in MKL/CUDA,because most procedures of them are mainly matrix cal-culation. In section 4 we will show the acceleration underMKL/CUDA.III. N EW E XPONENTIAL C ROSSOVER

A. The Essence of the Exponential Crossover

The exponential crossover in Differential Evolution is sim-ilar to 1 or 2 point crossover in the Genetic Algorithm (GA)[3], which is widely used in ”DE/*/*/L”. Its implementationis shown in Algorithm 8. In each generation g , a mutationvector v i,g created from the parent chromosome vector x i,g by applying some mutation strategies. After that, the mutationvector v i,g is crossed with x i,g in order to create trial vector u i,g by applying the exponential crossover. Cr ∈ [0 , is thecrossover rate, j rand is a decision variable index randomlyselected from [0, D − ]. Algorithm 8

Exponential Crossover.

Require: x g , v g , Cr ∈ [0 , Ensure: u g N p , D ← the shape of x g for i ← [0 , , · · · , N p − do u i,g ← x i,g j is randomly selected from [0, D -1], L = 0 do u i,j,g ← v i,j,g j ← ( j + 1) mod D L ← L + 1 while rand[0,1)¡ Cr and L < D

However, the exponential crossover is seldom used in par-allel DE practices. We can easily ﬁnd the reason. It’s clearthat if we allocate each execution of the exponential crossoverof each individual as a parallel job, that each job needs toloop the chromosome vector to check whether u i,g is neededto be crossed. This procedure cannot run parallelly. As tothe uniform (binomial) crossover, the process of checkingwhether bit of chromosome is needed to be crossed or notis independent. Thus, the latter is more suitable for parallelcalculation. Figure 1 shows the runtime comparison betweenparallel exponential crossover (p-exp) and parallel uniform(binomial) crossover (p-bin).The exponential crossover has a congenital advantage thatwhen Cr is relatively small and the dimension of u i,g is large, t i m e / s Cr=0.2 p-expp-bin 0 25 50 75 100Dimension Number0.050.100.15 t i m e / s Cr=0.4 p-expp-bin0 25 50 75 100Dimension Number0.050.10 t i m e / s Cr=0.7 p-expp-bin 0 25 50 75 100Dimension Number0.10.2 t i m e / s Cr=0.9 p-expp-bin

Fig. 1: Running Time Comparison between p-exp and p-binin N p = 1000 it doesn’t need to traverse all bit of u i,g to cross, because theloop will be probability stopped on the half-way. So we cansee in Figure 1 that when Cr = 0 . and Cr = 0 . , as thedimension of the chromosome vector is getting larger, the p-exp has a higher performance than p-bin. However, when Cr is getting larger, p-exp is slower than p-bin. That is becauseeach parallel job of p-exp has more work to do: it needs totraverse the bits of u i,g to check whether it is needed to cross.In some related work, the L of each v i,g can be calculatedbefore the crossover, so that remain procedures of the expcrossover can be calculated parallelly at a higher speed. Thatcan be seen in [12]. However, when calculating the L , theystill use a loop same to the traditional exp crossover, so thatthese improvements are still unsuitable for parallel calculation.If we reconsider the exponential crossover, we can see thatthe L , which is equal to the length of crossed bits in u i,g , issubject to geometric distribution, not the exponential distribu-tion. But as the geometric distribution has some relationshipwith the exponential distribution, it’s unnecessary to changethe name of the exponential crossover. Instead, we can usethe probability theory to redesign the exponential crossover inorder to make it faster.According to Algorithm 8, we can get L ∼ G (1 − Cr ) Where G denotes the geometric distribution, meaning that P { L = n } = Cr n − (1 − Cr ) , n ≥ (3)What we use here is the property that G ( n ) = P { L ≤ n } = 1 − (cid:88) i>n (1 − Cr ) Cr i − = 1 − Cr n (4)Now we need to generate a random number that is subjectto the geometric distribution. Denote X as an exponentiallydistributed random variable with parameter λ , so we have L = (cid:100) X (cid:101) , where (cid:100) (cid:101) is the ceil (or smallest integer) function. L is a geometrically distributed random variable with parameter p = 1 − e − λ . p is the parameter of the geometric distribution.In the exponential crossover, p is equal to − Cr . Thus, we obtain λ = − ln Cr . If we denote F X ( x ) as the distributionfunction of the exponential distribution, we can get F X ( x ) = 1 − Cr x (5)According to the inversion method [13] and equation 5, wecan get U = F X ( x ) (6)where U is a random number in (0,1). Is clear that U =1 − U is also a uniform distribution random number in (0,1).Thus, we can get a random sampling real number x which isexponentially distributed. x = ln ( U )ln ( Cr ) (7)So, the L in the exponential crossover can be sampledrandomly by the following equation instead of traversingelements of the chromosome vector. L = (cid:24) ln U ln Cr (cid:25) (8) B. New Exponential Crossover (NEC)

In the new exponential crossover, since Cr ∈ [0 , , it isnecessary to duel with the three special cases: Cr = 0 , Cr =1 , and L cannot be bigger than the dimension number D ofthe u i,g . Thus, the equation to generate L is as follows. L =  , Cr = 0min (cid:0)(cid:6) ln U ln Cr (cid:7) , D (cid:1) , Cr ∈ (0 , D, Cr = 1 (9)After generating L , we can know which bits of u i,g areneeded to be crossed directly. So we can redesign the expo-nential crossover. Algorithm 9 shows the new design of theexponential crossover. Algorithm 9

New Exponential Crossover (NEC).

Require: x g , v g , Cr ∈ [0 , Ensure: u g N p , D ← the shape of x g for i ← [0 , , · · · , N p − do u i,g ← x i,g j is randomly selected from [0, D -1], l = 0 calculate L by equation 9 do u i,j,g ← v i,j,g j ← ( j + 1) mod D l ← l + 1 while l < L Figure 2 shows the frequency of L in the new exponentialcrossover (NEC) and the traditional exponential crossover(exp) in 1000000 times experiments with D = 10 . We can seethat we can get similar results as the traditional exponentialcrossover by using NEC.Figure 3 and 4 show the higher speed of NEC in 100experiments with different Cr . F r e q u e n c y Cr=0.1 p-necexp 0 5 10L0200000400000600000 F r e q u e n c y Cr=0.3 p-necexp0 5 10L0100000200000300000 F r e q u e n c y Cr=0.7 p-necexp 0 5 10L0200000400000 F r e q u e n c y Cr=0.9 p-necexp

Fig. 2: The frequencies of L between nec and exp t i m e / s Cr=0.2 necexp 0 250 500 750 1000Dimension Number0.00.20.40.6 t i m e / s Cr=0.4 necexp0 250 500 750 1000Dimension Number0.00.20.40.6 t i m e / s Cr=0.7 necexp 0 250 500 750 1000Dimension Number0.000.250.500.75 t i m e / s Cr=0.9 necexp

Fig. 3: Running Time Comparison between nec and exp in N p = 1000 Comparing the traditional exponential crossover and NEC,the latter calculates the total crossed length L at ﬁrst so thatwhen looping u i,g , it only needs to compare the crossed length l with L , which makes it much faster than the traditionalexponential crossover.Moreover, the procedure of generating all L of all indi-viduals in DE can be also executed parallelly, so that we canmake full use of MKL/CUDA to speed up the procedure of thenew exponential crossover. Firstly, the column vector Ls , thatstores all the L of the u i,g , i = 0 , , , · · · , N p are sampledparallelly by applying equation 9. Then, with the help of the Ls , we can create a matrix mask parallelly which denoteswhether every bit of all the u i,g of all chromosome vectorsneed to be crossed or not. Finally, we can do the cross of everybit of the chromosomes parallelly on MKL/CUDA accordingto the matrix mask . It is worth noting that the new exponentialcrossover on MKL/CUDA (MKL/CUDA-NEC) only returns asignal matrix mask , which is similar to Algorithm 7, that isbecause in parallel differential evolution on MKL/CUDA (seeAlgorithm 2), the mutation is executed after the crossover.With the help of the mask , the mutation only needs to mutatethe elements where the element of the mark is equal to 1.Algorithm 10 is the pseudocode of the parallel new exponentialcrossover (p-nec).Comparing MKL/CUDA-nec to p-exp we mentioned above,we can see that all the procedures of the former one are t i m e / s Cr=0.2 necexp 0 500 1000 1500 2000Population Size0.00.20.4 t i m e / s Cr=0.4 necexp0 500 1000 1500 2000Population Size0.00.20.4 t i m e / s Cr=0.7 necexp 0 500 1000 1500 2000Population Size0.00.20.4 t i m e / s Cr=0.9 necexp

Fig. 4: Running Time Comparison between nec and exp in D = 100 Algorithm 10

Parallel Version of New Exponential Crossover(MKL/CUDA-NEC).

Require: Cr ∈ [0 , , N p , D . Ensure: mask . j rand ← randint( D , size=( N p )) create a column vector Ls by applying equation 9 paral-lelly Seq ← repmat( [0 , , · · · , D − , size=( N p ,1)) j ← repmat( j rand ,size=(1, D )) j (cid:48) ← repmat( Ls ,size=(1, D )) mask ← ( j < = Seq ) XOR ( j (cid:48) < Seq ) the matrix calculation, which can be easily run thoroughlyparallelly with the help of MKL/CUDA. Thus, it runs muchfaster than p-exp. Figure 5, Figure 6, and Figure 7 will showthe runtime comparison among CUDA-nec (using CUDA tocalculate NEC), MKL-nec (using MKL to calculate NEC), andp-exp when the population size N p is 100, 1000 and 5000in 100 repeat experiments. The D (dimension of u i,g ) is setfrom 2 to 5000. We can see from these pictures that when N p is small, MKL-NEC is faster than the others, but when N p is large and the dimension number of the chromosome isgetting larger and larger, CUDA-NEC can execute faster. Thatis because in big optimization, the runtime of data’s transportbetween the memory device and the GPU is no longer thebottleneck, thus, we get a higher speedup by using GPU tocalculate.Table I shows more experiment results. Each experimentare done in Cr =0.2, 0.4, 0.6, 0.8 and 1.0 for 100 times thenrecords the total runtime. We can see that when the dimensionof the decision variable ( D ) and the population size ( N p ) isrelatively small, the new exponential crossover on MKL has ahigher performance. But when D and N p are relatively large,it’s better to use CUDA-NEC as well as CUDA-bin. Moreover,no matter MKL is used or not, when D and N p are gettinglarger, the uniform (binomial) crossover costs more time thanthe new exponential crossover. That is because, in the uniform(binomial) crossover, we need to create an N p × D matrixin which elements are random number, while as to the newexponential crossover, with the help of equation 9, we onlyneed to create a random column vector which length is N p . But t i m e / s Cr=0.2 cuda-necmkl-necp-exp 0 2000 4000Dimension Number0.00.20.4 t i m e / s Cr=0.4 cuda-necmkl-necp-exp0 2000 4000Dimension Number0.00.20.4 t i m e / s Cr=0.7 cuda-necmkl-necp-exp 0 2000 4000Dimension Number0.00.20.4 t i m e / s Cr=0.9 cuda-necmkl-necp-exp

Fig. 5: Running Time Comparison among CUDA-nec, MKL-nec, and p-exp when N p = 100 t i m e / s Cr=0.2 cuda-necmkl-necp-exp 0 2000 4000Dimension Number024 t i m e / s Cr=0.4 cuda-necmkl-necp-exp0 2000 4000Dimension Number024 t i m e / s Cr=0.7 cuda-necmkl-necp-exp 0 2000 4000Dimension Number024 t i m e / s Cr=0.9 cuda-necmkl-necp-exp

Fig. 6: Running Time Comparison among CUDA-nec, MKL-nec, and p-exp when N p = 1000 if we use CUDA to speed up by GPU when the computationscale is large, the runtime gap between CUDA-nec and CUDA-bin is not so signiﬁcant.TABLE I: Running times’ comparison of several DECrossovers D Np exp nec bin MKL-nec CUDA-nec MKL-bin CUDA-bin10 100 0.05 0.03 0.06 0.02 1.76 0.02 6.9910 1000 0.44 0.12 0.51 0.05 1.18 0.07 0.6410 10000 4.34 1.17 4.99 0.31 1.34 0.53 0.94100 100 0.11 0.08 0.23 0.03 1.16 0.05 0.66100 1000 1.09 0.61 2.18 0.12 1.23 0.32 0.64100 10000 10.86 6.15 22.57 0.98 1.11 4.18 1.361000 100 0.75 0.57 1.9 0.13 0.66 0.32 0.561000 1000 7.53 5.44 19.16 0.73 1.07 3.98 1.001000 10000 74.77 54.52 192.46 9.16 4.31 41.76 4.5610000 100 7.1 5.46 18.67 0.99 1.48 4.1 0.9710000 1000 70.82 53.58 187.6 7.97 4.43 41.49 3.9110000 10000 710.4 538.05 1868.47 77.82 33.26 421.17 39.14

IV. P

ERFORMANCE E VALUATION OF DE BASED ON

MKL/CUDAOur DE programs are implemented in Python. The MKLis implemented by Numpy-MKL and the GPU environmentis set up in Cupy [14] with CUDA 10.0. In order to over-come the performance deﬁciency of Python, the key codesare implemented in Cython, which is the C-Extensions for t i m e / s Cr=0.2 cuda-necmkl-necp-exp 0 2000 4000Dimension Number01020 t i m e / s Cr=0.4 cuda-necmkl-necp-exp0 2000 4000Dimension Number01020 t i m e / s Cr=0.7 cuda-necmkl-necp-exp 0 2000 4000Dimension Number01020 t i m e / s Cr=0.9 cuda-necmkl-necp-exp

Fig. 7: Running Time Comparison among CUDA-nec, MKL-nec, and p-exp when N p = 5000 Python. There are mainly three different versions of DE in thisexperiment: no-parallel DE, DE on MKL (MKL-DE), and DEin CUDA (CUDA-DE). In addition, as to the DE, we arrangetwo sub-versions: ”DE/rand/1/bin” and ”DE/rand/1/L”. It isworth to note that the crossover of ”DE/rand/1/L” is theparallel version of the new exponential crossover mentioned inAlgorithm 10. All of them are running on an Intel ® i5-9600K,16GB memory, and a common NVIDIA GeForce TM GTX 760GPU.In this experiment, we focus our attention on three well-known benchmark functions: Ackley, Griewank and Rosen-brock [15]: • Ackley function: multimodal, separable, f ( x ) = − e − . (cid:115) D D (cid:80) j =1 x j − e D D (cid:80) j =1 cos 2 πx j + 20 + ex j ∈ [ − , , x ∗ = (0 , , · · · , , f ( x ∗ ) = 0 . • Griewank function: multimodal, nonseparable, f ( x ) = D (cid:88) j =1 x j − D (cid:89) j =1 cos (cid:18) x j √ j (cid:19) + 1 x j ∈ [ − , , x ∗ = (0 , , · · · , , f ( x ∗ ) = 0 . • Rosenbrock function: multimodal, separable, f ( x ) = D − (cid:88) j =1 (cid:104) (cid:0) x j − x j +1 (cid:1) + (1 − x j ) (cid:105) x j ∈ [ − . , . , x ∗ = (1 , , · · · , , f ( x ∗ ) = 0 . The parameters setting of DEs in the experiment can be seenin Table II, where the F and Cr are referred to K. Price and R.Storn’s work [15]. The dimension D of these three benchmarkfunctions and the population size N p are set to different valuesin order to compare the calculational performance in differentcomputational scales. The results can be seen in the tablesrange from Table III to Table V.TABLE II: Parameters Setting Function D Np F Cr MAX GENAckley 1000 500 0.5 0.2 30000Griewank 500 300 0.25 0.45 20000Rosenbrock 50 100 0.65 0.95 10000

TABLE III: Comparison of variant DEs in 1000-D Ackleywith 500 individuals and 30000 generations

Algorithms best value best gen time/s speedupDE/rand/1/L 2.32E-05 30000 2742.93( ± ±

52) 1.18CUDA-DE/rand/1/L 2.62E-05 29991 323.97( ±

5) 8.47rand/1/bin 2.15E-04 29998 3323.53( ±

93) -MKL-DE/rand/1/bin 2.84E-04 29951 2784.57( ±

98) 1.19CUDA-DE/rand/1/bin 2.28E-04 29983 321.45( ±

11) 10.34

TABLE IV: Comparison of variant DEs in 500-D Griewankwith 300 individuals and 20000 generations

Algorithms best value best gen time/s speedupDE/rand/1/L 8.88E-16 18366 563.84( ±

84) -MKL-DE/rand/1/L 3.71E-16 19855 475.81( ±

45) 1.19CUDA-DE/rand/1/L 1.73E-16 19936 149.03( ±

5) 3.78rand/1/bin 2.22E-16 8125 732.84( ± ±

95) 1.40CUDA-DE/rand/1/bin 2.22E-16 6041 143.95( ±

8) 5.09

From the results, we can see that the DE (see Algorithm2) on MKL/CUDA is much faster than traditional DE (seeAlgorithm 1). And between the DE on MKL (MKL-DE) andthe DE in CUDA (CUDA-DE), when the dimension of thevariable is relatively small, for example, D = 50 , MKL-DE isthe fastest. While D is relatively larger, for example, D = 500 or D = 1000 , and the population size N p is large, CUDA-DE is fastest. That is because Algorithm 2 is mainly matrixcalculations, which can be executed very fast on MKL/CUDA.And when the matrix scale is quite large, the CUDA can dothe calculation much faster by using GPU to accelerate.V. C ONCLUSIONS AND F UTURE W ORK

This paper ﬁrstly presents an available parallel differen-tial evolution framework which is suitable to be built onMKL/CUDA. The procedure of this parallel DE is mainlymatrix calculation so that with the help of the MKL/CUDA,the DE program can be run in a high speed even parallellyif possible. Moreover, it doesn’t need to worry about theparameters of the kernel processes of the parallel calculationsuch as the number of CPU cores, GPU cores, threads, andso forth, these parameters are set automatically and suitablyin MKL/CUDA.Later, we analyzed the disadvantage of the exponentialcrossover of DE, which is inefﬁcient and isn’t suitable forMKL/CUDA to calculate totally parallelly. Hence, a newexponential crossover is presented in Algorithm 9, which canrun faster. Moreover, a parallel version of the new exponentialcrossover is proposed in Algorithm 10, which is all the matrixcalculation so that it can be executed in a high speed with thehelp of MKL/CUDA. From the experimental results presentedin Section 3, we can see that the computational performanceof the parallel version of the new exponential crossovercatches up with the parallel binomial (uniform) crossover onMKL/CUDA.The experimental results presented in Section 4 showsthe higher performance of the parallel DE on MKL/CUDA.Especially when the dimension of the decision variable of theoptimization problem is quite large, the DE on CUDA has ahigher acceleration ratio.

TABLE V: Comparison of variant DEs in 50-D Rosenbrockwith 100 individuals and 10000 generations

Algorithms bestValue best gen time/s speedupDE/rand/1/L 2.04E-07 9991 15.43( ±

2) -MKL-DE/rand/1/L 4.28E-07 9989 9.96( ±

1) 1.55CUDA-DE/rand/1/L 1.16E-07 9996 58.92( ±

3) 0.26rand/1/bin 3.71E-10 10000 19.13( ±

2) -MKL-DE/rand/1/bin 2.17E-10 9998 10.01( ±

1) 1.91CUDA-DE/rand/1/bin 9.93E-10 10000 56.72( ±

4) 0.34

Since differential evolution is a stochastic optimizationevolution algorithm that is inherently parallel and it is sim-ilar to many other evolutionary algorithms, the whole areaof evolutionary computation may also get beneﬁt from theMKL/CUDA. In the future, we will make more effort to speedup more evolutionary algorithms.R

EFERENCES[1] K. R. Opara and J. Arabas, “Differential Evolution: A surveyof theoretical analyses,”

Swarm and Evolutionary Computation ,vol. 44, no. June 2017, pp. 546–558, 2019. [Online]. Available:https://doi.org/10.1016/j.swevo.2018.06.010[2] N. Pham, A. Malinowski, and T. Bartczak, “Comparative study of deriva-tive free optimization algorithms,”

IEEE Transactions on IndustrialInformatics , vol. 7, no. 4, pp. 592–600, 2011.[3] R. Storn and K. Price, “Differential Evolution - A simple and efﬁcientadaptive scheme for global optimization over continuous spaces,”

International Computer Science Institute , pp. 1–12, 1995. [Online].Available: ftp://ftp.icsi.berkeley.edu/pub/techreports/1995/tr-95-012.pdf[4] K. V. Price and O. Circle, “Differential Evolution : A Fast and SimpleNumerical Optimizer,”

Biennial Conference of the North AmericanFuzzy Information Processing Society - NAFIPS , pp. 524–527, 1996.[Online]. Available: https://doi.org/10.1109/naﬁps.1996.534790 [5] R. Storn, “On the usage of differential evolution for function optimiza-tion,”

Biennial Conference of the North American Fuzzy InformationProcessing Society - NAFIPS , pp. 519–523, 1996.[6] L. de P. Veronese and R. A. Krohling, “Differential evolution algorithmon the gpu with c-cuda,”

IEEE Congress on Evolutionary Computation ,pp. 1–7, 2010.[7] W. Zhu, “Massively parallel differential evolution - patternsearch optimization with graphics hardware acceleration : aninvestigation on bound constrained optimization,”

Journal ofGlobal Optimization , pp. 417–437, 2011. [Online]. Available:https://doi.org/10.1007/s10898-010-9590-0[8] ——, “GPU-Accelerated Differential Evolutionary Markov ChainMonte Carlo Method for Multi-Objective Optimization over ContinuousSpace,”

Proceeding of the 2nd Workshop on Bio-Inspired Algorithmsfor Distributed Systems , pp. 1–8, 2010. [Online]. Available: https://doi.org/10.1145/1809018.1809021[9] O. A. C. Cortes, S. Pais, and N. M. Filipo, “Differential Evolutionon a GPGPU : The Inﬂuence of Parameters on Speedup and theQuality of Solutions,”

International Parallel and Distributed ProcessingSymposium Workshops , pp. 299–306, 2015.[10] M. A. Meselhi, S. M. Elsayed, D. L. Essam, and R. A. Sarker, “Fastdifferential evolution for big optimization,”

International Conference onSoftware, Knowledge Information, Industrial Management and Applica-tions, SKIMA , vol. 2017-Decem, no. February 2019, 2018.[11] R. Tanabe and A. Fukunaga, “Reevaluating exponential crossover indifferential evolution,”

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes inBioinformatics) , vol. 8672, pp. 201–210, 2014.[12] S. Z. Zhao and P. N. Suganthan, “Empirical investigations intothe exponential crossover of differential evolutions,”

Swarm andEvolutionary Computation , vol. 9, pp. 27–36, 2013. [Online].Available: http://dx.doi.org/10.1016/j.swevo.2012.09.004[13] L. Devroye,

Complexity questions in non-uniform random variate gen-eration , 2010.[14] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis, “CuPy: ANumPy-Compatible Library for NVIDIA GPU Calculations,”

Workshopon ML Systems at NIPS 2017 , no. Nips, pp. 1–7, 2017. [Online].Available: http://learningsys.org/nips17/assets/papers/paper { }