Significance Relations for the Benchmarking of Meta-Heuristic Algorithms
SSignificance Relations for the Benchmarking of Meta-Heuristic Algorithms
Mario K¨oppen and Kei Ohnishi
Computer Science and Engineering DepartmentKyushu Institute of Technology680-4 Kawazu, Iizuka, 820-8502 Fukuoka, JapanEmail: mkoeppen,[email protected]
Abstract —The experimental analysis of meta-heuristic algo-rithm performance is usually based on comparing average per-formance metric values over a set of algorithm instances. Whenalgorithms getting tight in performance gains, the additionalconsideration of significance of a metric improvement comesinto play. However, from this moment the comparison changesfrom an absolute to a relative mode. Here the implicationsof this paradigm shift are investigated. Significance relationsare formally established. Based on this, a trade-off betweenincreasing cycle-freeness of the relation and small maximumsets can be identified, allowing for the selection of a propersignificance level and resulting ranking of a set of algorithms.The procedure is exemplified on the CEC’05 benchmark ofreal parameter single objective optimization problems. Thesignificance relation here is based on awarding ranking pointsfor relative performance gains, similar to the Borda countvoting method or the Wilcoxon signed rank test. In theparticular CEC’05 case, five ranks for algorithm performancecan be clearly identified.
Keywords -benchmarking, meta-heuristic algorithms, Bordacount, relational optimization
I. I
NTRODUCTION
Recently there is growing interest in the experimentalanalysis of algorithm performance. The establishment ofcomputational paradigms like soft computing and compu-tational intelligence has lead to a rapidly increasing numberof new algorithm proposals, esp. based on computationalmodels of evolution, genetics, or swarms intelligence, butalso modifications of hitherto uniform algorithms to the levelof the appearance of new algorithms, or combination, fusionand hybridization of existing algorithms into new ones. Thecommon aspects of these algorithms — often called meta-heuristic algorithms — is that their design is essentiallyproblem-independent, that their processing usually includesrandom factors, and that there are no guarantees or knownbounds for their performance while they posses parametersinfluencing the likelihood of good results by adjustable effortlike population size or number of generations. Thus, thealgorithm performance can differ from problem to problem,and on algorithm instance to algorithm instance, allowing fora performance competition between all those algorithms.However, the experimental evaluation of algorithm per-formance faces a number of problems. Just to name theessential ones: (1) the problem of specifying a subset of test functions that are challenging enough to generate a spec-trum of performance values, while avoiding any “needle-in-a-haystack” pure-chance search that would not provideany meaningful insights into strength and weaknesses ofthe studied algorithm; (2) fairness of comparison, usuallyunderstood as measuring performance under “equal effort”conditions like same number of test function evaluations(but commonly not memory usage or CPU time); (3) themeans of quantifying the experimental results into perfor-mance measures and the way of comparing them (wherethe No-Free-Lunch theorems [1] state the fundamentalnon-existence of a distinguishing measure for all possiblefunctions but also [2] stating the non-existence of relatedbenchmarks); (4) the question about favorable parametersettings and the related identity of an algorithm, which oftenallows for a number of structural modifications to be appliedwhile the algorithm is still considered the same, and whichof these design choices to tolerate while still maintaining afair comparison.A number of approaches tried to accommodate these prob-lems and proposed a set of benchmark functions and relatedbounds on effort-related algorithm parameters. Among them,the series of benchmark suites presented at the annual IEEEConference on Evolutionary Computation (CEC) gained alot of attention. From 2005, where a general benchmarkon evolutionary real parameter single objective optimizationwas presented as an open contests (as well as in 2010and 2013), a number of benchmarks on various specificaspects followed: for example on evolutionary constrainedreal parameter single objective optimization problems in2006, on large-scale single objective global optimizationwith bound constraints in 2008 and 2010, on niching in2010. In between, the CEC’05 benchmark has become ade facto standard for the evaluation of new algorithms.Subsequent publications like [3] showed that despite thewell-thought and modern design of these benchmarks, theproblem of a proper evaluation of the results remained anissue. This is also related to the growing acceptance ofstatistical methods in the evaluation of experimental algo-rithm performance, assuming that each algorithm unveils astatistical distribution of its performance values. The generalconsideration here is about significance of a performanceimprovement. While it has become common to repeat a r X i v : . [ c s . D S ] N ov xperiments 10 or 30 times and consider the numericalaverage of closest approaches to known extrema of a specificbenchmark function as a suitable quantity, it became alsoclear that an algorithm, showing here performance p + (cid:15) isnot automatically better than an algorithm with performance p — it has to be significantly better as well. The meaning ofsignificance then is usually related to a statistical confidencevalue that the average performances stemmed from differentdistributions (more precisely: to reject the H hypothesisthat for two algorithms both performance outcomes followthe same distribution). However, the notion of significancealso introduces another relevant aspect in the question forthe “best algorithm” that has not been solicited so far: theloss of an absolute comparison.To illustrate the meaning of this, compare the situationwith typical ways of performing sport contests, where wecan easily identify two main lines. For some sports, weare considering absolute means of success. In 100-meterdash, the performance is just the time needed to pass the100 meter distance. This is an absolute value. The recordperformance can be saved and is available at any later time todecide on the setting of a new record performance. Currently,it is 9.58s, which is a result from 2009. In comparison,team sports like soccer, baseball, tennis or handball do nothave such an absolute value of performance and is thereforeperformed in tournaments. It means from one match there isno quantity derived that allows to judge the performance ofthe next match (we are not considering global achievementsthat go into the “hall of fame” here). The performanceevaluation is relative. This has a number of implications: forexample, we can tolerate for intransitivity in a tournament,which would be nonsense in a race — when team A winsagainst team B and team B afterwards against team C there is no reason to assume that in a later match team A is guaranteed to win against team C . However, there arevarious attempts to assign ratings to teams that allow for theapproximate computation (or prediction) of ranks and matchoutcomes, as for example Massey’s approach of assigningratings such that their difference is as close as possible toknown match outcomes [4], or Keener’s approach of ratinginspired by the famous page rank procedure as it is used byGoogle(TM) search engine [5].Algorithm performance shares aspects with both kindsof sport contest evaluations: an algorithm does not needthe parallel processing of another algorithm to assess itsperformance for some benchmark function (otherwise thestudy of algorithm performance would enter the realms ofgame theory). But as soon as we introduce the aspect ofsignificance, there is no absolute single value measure avail-able anymore and the justification of algorithm betternessbecomes a pairwise exercise, thus a relative one.Here we want to study a way to account for theseaspects based on a purely (set) relational framework. Wecontribute a formal definition of a performance comparison with significance and identify a subset of relations where anoptimal choice of the significance level is feasible, based onthe relation between minimal number of cycles appearing inthe comparison and the smallest number of best algorithms(section II). In section III we apply this framework to someresults presented for the CEC’05 benchmark in order todemonstrate the design of feasible relations for comparison,followed by a discussion in section IV.II. M AXIMALITY OF SIGNIFICANCE RELATIONS
Usually in optimization we study a real-valued function y = f ( x ) where the quest is for a value x that maximizes (incase f is seen as a quality or fitness function) or minimizes(in case f is seen as a coast function) the function value y .In a number of circumstances, especially related to modernapplications of computer science, this approach (then calledglobal optimization) appears to be not fully adequate to re-flect aspects of optimality like fairness, resource limitations,user preferences or simply multiple conflicting objectives.Thus, relational optimization attempts to generalize thisconcept of optimality by studying maximality of relations.A (set-theoretic) binary relation between elements of adomain A is a subset of A × A , i.e. a number of ordered pairsof elements of A that are considered to be in that relationwith each other. This is also the way to interpret aboveoptimality task: if x and x are from the domain of f andwe have f ( x ) > f ( x ) and we are looking for maximality,then the relation R opt is just the real-valued larger relationbetween function values, and the pair ( x , x ) ∈ R opt .The essential point, following Suzumura’s theory of socialchoice [6] is that a concept of maximality can be consideredfor any relation, no matter what its domain and what thespecific way of specifying related pairs. To do this, we firstconsider the asymmetric part P ( R ) of a relation: it is the partof the relation where the order in the pair matters, i.e. for arelation R the pair ( x, y ) ∈ R but not ( y, x ) . The elementsof R that are not in P ( R ) then establish the symmetricpart I ( R ) where both ( x, y ) and ( y, x ) belong to R . Thus,we have R = P ( R ) ∪ I ( R ) and P ( R ) ∩ I ( R ) = ∅ and aconvenient test for each pair in R whether it belongs to theasymmetric or the symmetric part.Now, for the asymmetric part we consider all elements ofthe domain that never appear in the second position: this isthe maximum set of R. If we read ( x, y ) ∈ R as “ x is betterthan/dominates/is preferred to y” then for any element ofthe domain that never appears in the role of y here it meansthat there is no better, or preferred element, or that it isnon-dominated.We see that this definition only uses the set-theoreticspecification of a relation and nothing else, and includesthe above example of function optimality as a special case.It has to be distinguished from greatestness of a domainelement. For a greatest element x from the best set ofa relation, for each y ∈ A (including x itself) we have able IT HE AVERAGE PERFORMANCES FOR
RUNS ON THE TEST PROBLEMS OF THE
CEC’05
BENCHMARK IN DIMENSION FOR TESTALGORITHMS .T HE VALUES HERE ARE THE SAME AS THE VALUES USED IN [3],
WITH SOME ADJUSTMENT OF NUMERICAL SCALE FOR BETTERREADABILITY .BLX-GL50 BLX-MA CoEVO DE DMS-L-PSO EDA G-CMA-ES K-PCX L-CMA-ES L-SaDE SPC-PNXf1 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09f2 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09 1.00E-09f3 5.71E+02 4.77E+04 1.00E-09 1.94E-06 1.00E-09 2.12E+01 1.00E-09 4.15E-01 1.00E-09 1.67E-02 1.08E+05f4 1.00E-09 2.00E-08 1.00E-09 1.00E-09 1.89E-03 1.00E-09 1.00E-09 7.94E-07 1.76E+06 1.42E-05 1.00E-09f5 1.00E-09 2.12E-02 2.133 1.00E-09 1.14E-06 1.00E-09 1.00E-09 4.85E+01 1.00E-09 0.012 1.00E-09f6 1.00E-09 1.49 1.25E+01 1.59E-01 6.89E-08 4.18E-02 1.00E-09 4.78E-01 1.00E-09 1.20E-08 1.89E+01f7 1.17E-02 1.97E-01 3.71E-02 1.46E-01 4.52E-02 4.21E-01 1.00E-09 2.31E-01 1.00E-09 0.02 8.26E-02f8 20.35 20.19 20.27 20.4 20 20.34 20 20 20 20 20.99f9 1.154 0.4379 19.19 0.955 1.00E-09 5.418 0.239 0.119 44.9 1.00E-09 4.02f10 4.975 5.643 26.77 12.5 3.622 5.289 7.96E-02 0.239 40.8 4.969 7.304f11 2.334 4.557 9.029 0.847 4.623 3.944 0.934 6.65 3.65 4.891 1.91f12 406.9 74.3 604.6 31.7 2.4001 442.3 29.3 149 209 4.50E-07 259.5f13 0.7498 0.7736 1.137 0.977 0.3689 1.841 0.696 0.653 0.494 0.22 0.8379f14 2.172 2.03 3.706 3.45 2.36 2.63 3.01 2.35 4.01 2.915 3.046f15 400 269.6 293.8 259 4.854 365 228 510 211 32 253.8f16 93.49 101.6 177.2 113 94.76 143.9 91.3 95.9 105 101.2 109.6f17 109 127 211.8 115 110.1 156.8 123 97.3 549 114.1 119f18 420 803.3 901.4 400 760.7 483.2 332 752 497 719.4 439.6f19 449 762.8 844.5 420 714.3 564.4 326 751 516 704.9 380f20 446 800 862.9 460 822 651.9 300 813 442 713 440f21 689.3 721.8 634.9 492 536 484 500 1050 404 464 680.1f22 758.6 670.9 778.9 718 692.4 770.9 729 659 740 734.9 749.3f23 638.9 926.7 834.6 572 730.3 640.5 559 1060 791 664.1 575.9f24 200 224 313.8 200 224 200 200 406 865 200 200f25 403.6 395.7 257.3 923 365.7 373 374 406 442 375.9 406 ( x, y ) ∈ R that is, using above readings, x is betterthan any other element, preferred to any other element, ordominating any element of A . There are some relationsbetween the maximum set and the best set (for examplethat the best set is a subset of the maximum set). In case ofglobal optimality both basically coincide, but in the generalcase not. However, the specific way of defining verifiablerelations in optimization problems gives preference for theconcept of maximality - for the price that there is usuallymore than one maximal element, compared to usually emptybest sets.Now we consider maximum sets of relations within thescope of experimental algorithm analysis, taking significanceinto account. Definition 1. A significance relation is a family R σ ofrelations parameterized by the (real-valued non-negative) significance level σ such that: For σ > σ , R σ ⊆ R σ and For σ → ∞ , R σ → ∅ The relation R is called the base relation . These two requirements reflect the common ideas behindsignificance. Take as an example the comparison of two algorithms by their average performances on a number ofbenchmark functions: in order to consider algorithm A reallybetter than algorithm B we set a threshold s such thatalgorithm A is only considered to be better than B whenhaving a larger average than for B by margin s . If weincrease that s the number of cases where A is better than B will decrease — this is the first requirement. If s becomeslarger, at some point, no algorithm will be better than anyother by such a large margin, and the relation becomes theempty set. Last but not least, in case of s = 0 the relationbecomes the standard real-valued larger-relation that can beseen as the base relation for comparison.Now that there is some evolution of relations over the spanof confidence values, the question is what happens to theirmaximum sets when σ is increasing. It needs two comments:(1) the definition of maximum sets applies to any rela-tion. However, maximum sets can be empty. A sufficientcondition for the existence of non-empty maximum setsfor finite and non-empty domains is cycle-freeness of therelation. Cycle-freeness means that there is no sequence ofone or more elements x i ( i = 1 , . . . , k ) where ( x , x ) ∈ P ( R ) , ( x , x ) ∈ P ( R ) , . . . , ( x k − , x k ) ∈ P ( R ) and also ( x k , x ) ∈ R . Note that the naming “cycle” refers to thelternative representation of a relation as a directed graph,where cycles are equivalent to fully connected components.The definition of P ( R ) clarifies the issue for k = 1 and k = 2 , but in other cases it can become a complicated taskto decide whether there are cycles or not. But even if thereare cycles, it does not automatically imply empty maximumsets in specific cases. However, the empty set, or emptyrelation does not contain cycles for sure, and therefore bya continuity argument, for any significance relation there issome σ such that for this value and all larger the relationwill be cycle-free and the existence of maximum elementsis guaranteed.(2) On the other hand, for the empty relation each elementof the domain is maximal. Under the additional constraintthat R σ for each σ is an asymmetric relation, then the sizeof maximum sets increases monotonically with increasing σ . Taking both arguments together, we can identify a signif-icance level where the relation becomes cycle-free (or onewhere the maximum set is non-empty) while the size of themaximum set is at the lowest non-negative level. From thiswe can identify the “best algorithms” by that maximum set.At the end, we have to consider how to design an appro-priate relation. This will be demonstrated in the next section,where we are going to apply the theoretical foundation givenin this section to the CEC’05 benchmark.III. A PPLICATION TO
CEC’05
BENCHMARK
The CEC’05 benchmark is composed of 25 functions thatfall into three categories: functions 1 to 5 are unimodalfunctions, functions 6 to 14 multi-modal functions and theremaining functions 15 to 25 are so-called hybrid composi-tion functions. For details see the corresponding technicalreport [7]. For the contest, 11 algorithms were selected.Achieved performance values for dimension 10 versions ofall 25 problems can be seen in Table 1 (these are the samevalues that were studied in [3]). A quick glance on this tablealready reveals a number of issues that make sure that theevaluation of the outcome of the experiment is not straight-forward. At first, the performances differ largely in order ofmagnitude. Values × − represent cases were the problemwas “fully solved” i.e. the algorithm were stopped at thispoint. Such values are predominant in the first category ofunimodal functions. For functions of the second category wefind varying performance, while the third category providesmost challenging functions, and algorithms seems to yieldcomparable bad performance values.If considering a base function for the specification of asignificance relation, this makes clear that the average ofthese values per algorithm is not suitable due to the differentscale of the performance values for the particular functions,as well as any reference to Euclidian distances betweenpoints with performance values as components. In fact, whatwe need is • Horizontal scale-freeness: each problem function pro-vides its own scale for the typical range of performancevalues. It can be achieved by considering the relativegains of performance instead of taking reference toabsolute values. • Vertical scale-freeness: the evaluation also needs to takethe differing performance scales of different problemsinto account. This can be achieved by awarding rankpoints, as it is done in the Borda count method forvoting (see [8] for a gentle introduction into thistopic), or the Wilcoxon signed rank test for statisticalsignificance.Based on these arguments, the proposed significancerelations for given significance level σ applied to two vectors x and y both of dimension n and with positive componentsis as follows:1) Compute the ranking vector r where r i =max[ x i /y i , y i /x i ] ( i = 1 , . . . , n ) as well as the signa-ture vector s where s i is +1 if x i > y i , -1 if x i < y i and 0 if x i = y i .2) Sort the components of r i in non-decreasing order toyield the vector ˜ r ( i ) .3) The award points vector a is the vector (1 , , . . . , n ) permuted in the same way as the permutation leadingfrom r to ˜ r .4) Compute the scalar product D = a · ˜ s (where ˜ s is thesorted version of s ). In case of ties, collect the corr.award points and share equally between x and y .5) If D ≥ σ then x is in significance relation to y ,otherwise not.In summary, the ratios between corr. components of x and y (the larger one divided by the smaller one) are sortedby size, and signed award points are given two x or y , 1for the smallest ratio, 2 for the second smallest etc. If x had the larger component, the sign is +1, -1 otherwise. Forexample, consider the vectors x = (2 , , and y = (4 , , and significance σ = 2 . Then r = (4 / , / , /
3) = (2 , , which is sorted as ˜ r = (2 , , and the corr. award vector is s = (1 , , ( (2 , , is also possible but would not affectthe result). The sign vector registers which vector had thelarger component: s = ( − , , , ˜ s = ( − , , . Since thetwo first components of ˜ r are equal, the total award points are shared equally as / . and x receives . (for its second component) and 3 (for the third component)while y receives . for the second component. The total is D = 1 . − . . Since 3 is larger than the confidence, x is in relation to y (or: x is better than y with significanceat least 2).We apply this relation to the CEC’05 benchmark functionresults for dimension 10 on the 11 test algorithms. It meansthat we select various increasing values of σ until the relationbecomes cycle-free, and then look for the maximal (or: non-dominated) elements. Figure 1 shows the result for σ = 10 . Figure 1. Graph of the significance relation for CEC’05 benchmark functions and test algorithms and significance level 10.
Figure 2. Graph of the significance relation for CEC’05 benchmark functions and test algorithms and significance level 60.
It is shown as a directed graph where a directed edge (or arc)refers to the relation between the corr. algorithm to exist (thealgorithms are given as numbers in the order of the headerrow of Table 1). A number of things can be seen: • Algorithm 7 is a clear winner, as it is in significancerelation to all other algorithms. • Algorithm 3 is a clear looser, since it is dominated byall other algorithms. • There is one cycle in the graph, linking algorithms 1,10, 5, 4 and back to 1. • Algorithms 2, 6, 8, 9 and 11 appear to be on lower rankthan this cycle, but of higher rank than algorithm 3.If we increase σ at around 15 the relation becomes cycle-free and above four ranks consolidate (not shown here forspace reasons). For σ = 60 (Fig. 2) we can distinguish fiveranks (7 as top performer as before, next rank 5, 1 and 10,next rank 4, 9, 11, then 2, 6 and 8, and last as well as least3). Note that algorithm 4 was part of the cycle before, now itis ranked below all other members of the cycle. For σ = 100 it will shrink to 4 ranks and at some point all but algorithms and 3 will be on the same second rank.IV. D ISCUSSION
The analysis of the CEC’05 example shows that theprocedure is feasible and allows for a ranking of a number ofalgorithms. Here some related thoughts and considerations:(1) The procedure possibly assigns same rank to severalalgorithms. In the worst case, all algorithms or its largerpart can be on the same rank - that is when for thesmallest significance level σ where the relation becomescycle-free suddenly many elements are not dominated byany other element. While this is possible in theory, in thiscase the design of the benchmark might be questioned sinceit also indicates a lack of distinctiveness from the selectedbenchmark functions (for example, when all algorithms cansolve all problems). As the result for the CEC’05 benchmarkshows, there can be even a best algorithm.(2) The computational effort of the evaluation is rather small- as long as the number of algorithms is bounded. It needspairwise comparison and a linear comparison procedure.However, even polynomial complexity can become hardwhen the number of algorithms is of order 1 Million orso, since then there would be comparisons, an effortfar beyond the available computational power these days.In this case, meta-heuristic algorithms can be designed toapproximate maximum sets of general relations [9].(3) The selection of σ depends on the specific choice ofalgorithms. It means if the set of algorithm changes, theselection procedure for σ can also give a different value.So questions like this may come up: algorithm 7 was thewinner of the CEC’05 benchmark. What to do to makean even better algorithm? The answer depends much onthe benchmark functions themselves, but a direct answer is:either to be better than algorithm 7 for all functions, or beingbetter for functions where algorithm 7 (compared e.g. to nextrank algorithms) is weak.V. S UMMARY
A method for the evaluation of benchmark results forfunction optimization has been presented. It is based on thegeneral concept of significance relations: two algorithms arein such a relation if one performs better than the other bya gain that is numerically represented as significance level.When the significance level is increased there must be alevel where the relation becomes cycle-free and provides afull ranking of all algorithms (thus also providing a partialorder of the algorithms). The procedure distinguishes fromthe classical “compare the average performances” in thesense that it accounts for significance while switching to arelational mode (instead of absolute mode) of comparison.The feasibility of the approach has been shown by using testresults for the CEC’05 benchmark. R
EFERENCES [1] D. H. Wolpert and W. G. Macready, “No free lunch theoremsfor optimization,”
Evolutionary Computation, IEEE Transac-tions on , vol. 1, no. 1, pp. 67–82, 1997.[2] M. K¨oppen, “No-free-lunch theorems and the diversity ofalgorithms,” in
Evolutionary Computation, 2004. CEC2004.Congress on , vol. 1. IEEE, 2004, pp. 235–241.[3] S. Garc´ıa, D. Molina, M. Lozano, and F. Herrera, “A study onthe use of non-parametric tests for analyzing the evolutionaryalgorithms’ behaviour: a case study on the CEC’2005 specialsession on real parameter optimization,”
Journal of Heuristics ,vol. 15, no. 6, pp. 617–644, 2009.[4] K. Massey,
Statistical models applied to the rating of sportsteams (Bachelor Thesis) . Bluefield College, 1997.[5] J. P. Keener, “The Perron-Frobenius theorem and the rankingof football teams,”
SIAM review , vol. 35, no. 1, pp. 80–93,1993.[6] K. Suzumura,
Rational Choice, Collective Decisions, andSocial Welfare . Cambridge University Press, 2009.[7] P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y. Chen,A. Auger, and S. Tiwari, “Problem definitions and evaluationcriteria for the CEC 2005 special session on real-parameteroptimization,”
KanGAL Report , vol. 2005005, 2005.[8] C. B¨orgers,
Mathematics of Social Choice: Voting, Compensa-tion, and Division . SIAM-Society for Industrial and AppliedMathematics, 2009.[9] M. K¨oppen, K. Yoshida, and K. Ohnishi, “Meta-heuristicoptimization reloaded,” in