[PDF] Reproducibility in Evolutionary Computation

Abstract

Experimental studies are prevalent in Evolutionary Computation (EC), and concerns about the reproducibility and replicability of such studies have increased in recent times, reflecting similar concerns in other scientific fields. In this article, we discuss, within the context of EC, the different types of reproducibility and suggest a classification that refines the badge system of the Association of Computing Machinery (ACM) adopted by ACM Transactions on Evolutionary Learning and Optimization (this https URL). We identify cultural and technical obstacles to reproducibility in the EC field. Finally, we provide guidelines and suggest tools that may help to overcome some of these reproducibility obstacles.

Full PDF

aa r X i v : . [ c s . A I] F e b Reproducibility in Evolutionary Computation

MANUEL LÓPEZ-IBÁÑEZ,

University of Málaga, Spain

JUERGEN BRANKE,

University of Warwick, UK

LUÍS PAQUETE,

University of Coimbra, CISUC, Department of Informatics Engineering, Portugal

Experimental studies are prevalent in Evolutionary Computation (EC), and concerns about the reproducibility and replicability of suchstudies have increased in recent times, reﬂecting similar concerns in other scientiﬁc ﬁelds. In this article, we suggest a classiﬁcationof diﬀerent types of reproducibility that reﬁnes the badge system of the Association of Computing Machinery (ACM) adopted byTELO. We discuss, within the context of EC, the diﬀerent types of reproducibility as well as the concepts of artifact and measurement ,which are crucial for claiming reproducibility. We identify cultural and technical obstacles to reproducibility in the EC ﬁeld. Finally,we provide guidelines and suggest tools that may help to overcome some of these reproducibility obstacles.Additional Key Words and Phrases: Evolutionary Computation, Reproducibility, Empirical study, Benchmarking

As in many other ﬁelds of Computer Science, most of the published research in Evolutionary Computation (EC) relieson experiments to justify their conclusions. The ability of reaching similar conclusions by repeating an experimentperformed by other researchers is the only way a research community can reach a consensus on a given hypothesis.From an engineering perspective, the assumption that experimental ﬁndings hold under similar conditions is essentialfor making sound decisions and predicting their outcomes when tackling a real-world problem.The “reproducibility crisis” refers to the realisation that many experimental ﬁndings described in peer-reviewedscientiﬁc publications cannot be reproduced, either because they lack enough details to repeat the experiment or be-cause repeating the experiment leads to diﬀerent conclusions. Despite its strong mathematical basis, Computer Science(CS) also shows signs of suﬀering such a crisis [Cockburn et al. 2020; Fonseca Cacho and Taghva 2020; Gundersen et al.2018]. EC is by no means an exception. In fact, as we will discuss later, particular challenges of reproducibility in ECarise from the stochastic nature of the algorithms.Interest in reproducibility has only become explicit in EC relatively recently. The goal of this paper is to discussreproducibility in the context of EC (and heuristic optimisation in general). We review the abundant research on repro-ducibility from other ﬁelds and adapt it, when pertinent, to the EC context. In Section 2, we explain what reproducibilitymeans in the context of EC and argue that reproducibility is as relevant in EC as in any other sub-ﬁeld of CS, bothfrom a scientiﬁc and an engineering perspective. In Section 3, we discuss, in the context of EC, two key concepts thatarise when discussing reproducibility, the notions of artifact and measurement . We also review the terminology adop-ted by ACM and others to formally distinguish between diﬀerent levels of reproducibility, and propose a reﬁnementthat classiﬁes reproducibility studies in EC according to the factors that are varied in the study with respect to theoriginal work. We discuss in Section 4 some of the cultural and technical obstacles to ensuring reproducibility in EC.In Section 5, we suggest guidelines and tools that may help overcome some of those obstacles. Finally, we concludein Section 6 with an overall discussion of the state of reproducibility in EC and point out future directions to furtherunderstand and improve reproducibility.

Authors’ addresses: Manuel López-Ibáñez, [email protected], University of Málaga, Bulevar Louis Pasteur, 35, Málaga, Spain, 29071; JuergenBranke, [email protected], University of Warwick, Gibbet Hill Road, Coventry, UK, CV4 7AL; Luís Paquete, [email protected], University ofCoimbra, CISUC, Department of Informatics Engineering, Polo II, Pinhal de Marrocos, Coimbra, Portugal, 3020-290. Manuel López-Ibáñez, Juergen Branke, and Luís Paquete

Evolutionary Computation (EC), in much the same way as Computer Science [Wegner 1976], can be seen as a three-fold discipline: it is mathematical since it is concerned with the formal properties of abstract structures; it is also scientiﬁc since it is concerned with the empirical study of a particular class of phenomena; and it is engineering sinceit is concerned with the eﬀective design of tools that have social and commercial impact in the real world. Despitemajor advances in the Theory of EC, the widespread belief is that the dynamics of practical EC algorithms applied tonon-trivial problems are too complex to be analysed using only mathematical arguments. Thus, in the following, wediscuss EC from the scientiﬁc and engineering perspectives, whose body of knowledge is mostly based on empiricalﬁndings.In empirical sciences, the body of knowledge is built by following the principles of the Scientiﬁc Method:(1) Observe a phenomenon, e.g., EAX crossover appears to have the capacity for local optimisation in the travellingsalesman problem (TSP) [Nagata and Kobayashi 1997].(2) Construct a hypothesis, e.g., an evolutionary algorithm (EA) produces better solutions for the TSP when usingEAX than when using the other known crossovers for the TSP.(3) Conduct an experiment, e.g., measuring the performance of an EA using EAX and alternative crossovers on anumber of TSP instances.(4) Analyse and draw a conclusion on whether the experiment supports the hypothesis and, hence, it is provisionallyaccepted, or not, hence, it is falsiﬁed .A cornerstone of the Scientiﬁc Method is the notion of falsiﬁability , i.e., a scientiﬁc hypothesis must be testableempirically and, possibly, falsiﬁable. For example, the statement “

There are problems for which Evolutionary Algorithmsare the best optimisation methods possible ” is not falsiﬁable (by evidence), not only because of the vagueness of termssuch as “best” and “Evolutionary Algorithm”, but also because we may never know all possible optimisation methodsnor all possible problems. However, the statement “ in crossover operators for the traveling salesman problem, the trade-oﬀ between the ratio of edges inherited by oﬀsprings from parents and the variety of oﬀsprings is important for generatinglarge number of improved oﬀsprings ” [Nagata and Kobayashi 1999] is falsiﬁable by evidence.On the other hand, research in EC very often takes an engineering perspective, e.g. when comparing diﬀerentEC algorithms to solve a particular problem. As in other engineering disciplines, a researcher has to go through thefollowing steps:(1) Specify requirements, e.g., an EA that outperforms the LKH algorithm in ﬁnding solutions less than fromthe optimum on instances of the TSP with up to 200,000 cities [Nagata and Kobayashi 2013].(2) Design a solution, e.g, a particular type of EA that uses a novel crossover.(3) Conduct an experiment, which will often involve implementing a reasonably eﬃcient prototype, careful para-meter tuning and benchmarking the prototype against a competitor.(4) Analyse and draw a conclusion on whether the benchmarking results provide evidence that the solution meetsthe requirements.There are clear parallels between the scientiﬁc and engineering perspectives. Moreover, engineering requirementscan also be recast as scientiﬁc hypotheses. A major diﬀerence between the scientiﬁc and engineering perspectives iseproducibility in Evolutionary Computation 3that the latter is mostly concerned with demonstrating performance diﬀerences between realistic algorithmic imple-mentations on practical problems under the requirements speciﬁcation, whereas the former is concerned with con-ﬁrming hypotheses on abstract models of the real-world that may lead to general principles.From both scientiﬁc and engineering perspectives, experiments that are reproducible and falsiﬁable by others are aprerequisite to reach a consensus in the research community and building a body of knowledge about working prin-ciples of EC. Such “laws of qualitative structure” [Newell and Simon 1976] are qualitative hypotheses that are acceptedby the research community until suﬃcient empirical evidence arises to falsify them, e.g., the generally accepted hypo-thesis that the search space of the travelling salesman problem (TSP) has a “big valley” structure [Boese et al. 1994]. Scientiﬁc progress is a collaborative eﬀort. Most new research results build on previous research results. The ﬁrst stepto improving an algorithm is to reproduce the previous results. In this sense, reproducibility facilitates (or even is apre-requisite of) scientiﬁc progress. If reproducing previous results is easy because the research has been publishedin a reproducible format, it saves researchers a lot of time, allowing the community to quickly absorb new results,speeding up scientiﬁc progress as well as the transfer of new ideas into practical applications.

In a provocative paper entitled "Why most published research ﬁndings are false", Ioannidis [2005] suggests that muchof published research results cannot be trusted. A recent survey in Nature [Baker 2016] revealed that more than 70% ofresearchers have previously failed in an attempt to reproduce another researcher’s experiments, and over 50% have evenfailed to reproduce one of their own previous results. This is generally not due to researchers deliberately falsifyingtheir results, but more often a result of the publication culture, researcher ignorance, or conﬁrmation bias.A well-documented bias against publishing negative results [Fanelli 2012] together with a culture that rewardsscientists chieﬂy on quantity of publications incentivises careless research [Grimes et al. 2018].Researchers often lack suﬃcient expertise in statistics and unknowingly use improper statistical tests, insuﬃcientsample sizes [Campelo and Wanner 2020], or manipulations of the experimental conditions that alter (unintentionallyor deliberately) the statistical signiﬁcance of results, e.g., p-hacking [Cockburn et al. 2020; Simmons et al. 2011] andhypothesising after results are known (HARKing) [Kerr 1998].In particular since research in EC is often framed as competitive testing [Hooker 1996], there is also a bias in theeﬀort spent by researchers in verifying their experimental setup and code. If a researcher hypothesises that a newmutation operator should work well on a particular problem, and experiments show poor performance, they are likelyto carefully check their code and experimental setup to make sure that this unexpectedly poor performance is notdue to an error. On the other hand, if results are very positive, they are less likely to suspect a problem and thusspend much less time verifying their code and experimental setup. As a result, errors that lead to poor performanceare usually corrected, whereas errors that lead to apparent but false good performance are often not detected and arepublished. A similar bias, but in the opposite direction, may be true for competing algorithms. There, an error thatleads to poor performance of an existing algorithm relative to the author’s newly proposed algorithm risks not beingdetected because it supports the author’s presumption that their new algorithm is better. Even if the code is availablebut the bug only shows up on new problem instances or experimental conditions, there is little incentive to investigatethe reasons behind the poor performance of a benchmark algorithm. Brockhoﬀ [2015] reports the illustrative case of abug in the implementation of an algorithm aﬀecting published results and how the bug has propagated to many other Manuel López-Ibáñez, Juergen Branke, and Luís Paquetesoftware packages due to the lack of independent implementations, potentially aﬀecting the results of hundreds ofpublished papers.However, even though there is evidence that a signiﬁcant proportion of published research results are wrong, andmany researchers probably have experienced challenges in reproducing published results [Sörensen et al. 2017], thenumber of published corrections is negligible. A search on Scopus reveals that out of 2484 papers published in thejournals IEEE Transactions on Evolutionary Computation , Evolutionary Computation , and

Swarm Intelligence and Evol-utionary Computation , only 8 were Errata. As a consequence, a lot of eﬀort is potentially wasted by many researchgroups who independently attempt to reproduce results and fail, before the rumor somehow spreads and people ac-cept that certain results are not reproducible. A proper research culture where reproducibility is regularly attemptedand also negative results are published could signiﬁcantly speed up scientiﬁc progress.

Informally, the terms reproducibility and replicability are often used to describe various concepts related to being ableto conﬁrm or falsify a hypothesis by repeating an experiment. More formally, those terms often denote various degreesof reproducibility and, unfortunately, not always consistently, since diﬀerent communities use diﬀerent terminologies.For a historical perspective on terminology see Plesser [2018]. In our paper, we use the term “reproducibility” whenaddressing the general topic. When discussing speciﬁc degrees of reproducibility, we mostly follow the terminologyused by ACM [ACM 2020] with a further reﬁnement presented later in this Section. The ACM terminology relies onthe concepts of “artifact” and “measurement”:

Artifact. “A digital object that was either created by the authors to be used as part of the study or generated bythe experiment itself” [ACM 2020]. Examples of artifacts in the context of EC would be complete implementationsof algorithms, either in source code or executable form; data or code required to fully specify problem instances orbenchmark functions, e.g., ﬁles containing distance matrices for the traveling salesman problem, a software library ofcontinuous benchmark functions or a simulation software needed for evaluating solutions; raw data measured duringthe experiment and used for validating the hypothesis, e.g., measurements of solution quality, counts of objectivefunction evaluations, iterations, steps, or computation times; and any scripts required to process the raw measurementsand calculate the statistics or visualisations that justify the conclusions of the experiment.

Measurement.

The term "measurement" is used in analogy to physical experiments. For computer science, a meas-urement is the data that results from an experiment. For EC, it is common to report summary statistics such as meansand standard errors of results or runtimes. As discussed by McGeoch [2012], the measurements taken should be appro-priate to the level of abstraction being studied. For example, computational eﬀort may be measured as cycles, seconds,function evaluations or iteration counters.Based on the above concepts of artifact and measurement, the ACM deﬁnes the following terms [ACM 2020]:

Repeatability. “The measurement can be obtained with stated precision by the same team using the same measure-ment procedure, the same measuring system, under the same operating conditions, in the same location on multipletrials.” (Same team, same experimental setup) . Searched November 2020 On August 24th 2020, ACM swapped the deﬁnitions of “Reproducibility” and “Replicability” to match the terminology proposed byClaerbout and Karrenbach [1992]. eproducibility in Evolutionary Computation 5

Reproducibility. “The measurement can be obtained with stated precision by a diﬀerent team using the same meas-urement procedure, the same measuring system, under the same operating conditions, in the same or a diﬀerent loca-tion on multiple trials.” (Diﬀerent team, same experimental setup) . Replicability. “The measurement can be obtained with stated precision by a diﬀerent team, a diﬀerent measuringsystem, in a diﬀerent location on multiple trials.” (Diﬀerent team, diﬀerent experimental setup) .In the context of EC, repeatability means that the authors of a publication can reliably perform multiple times theirown experiments and get the same result up to their own stated precision. Reproducibility means that independent re-searchers can reliably perform multiple times the experiments described by the publication using the artifacts providedby the original authors and the same computational environment or a similar one, and get the same result up to thestated precision. Finally, replicability means that independent researchers can reliably perform multiple times the ex-periments using independently developed artifacts on a diﬀerent computational environment and get the same resultup to the stated precision.According to the ACM classiﬁcation, the main distinction between reproducibility and replicability is that, in case ofreproducibility, the original artifacts are re-used, while for replicability another group has to independently generatethe necessary artifacts. However, we believe that there are more dimensions to reproducibility, especially in evolution-ary computation, where algorithms are randomised, parameterised, and results based on benchmark problems. Whatshould be kept ﬁxed and what should change to assess either reproducibility or replicability?Following classical statistical terminology [Chiarandini and Goegebeur 2010], we make a distinction between twotypes of experimental factors: random eﬀect factors and ﬁxed eﬀect factors . A random factor has many possible valuesand the experimental conclusion of a paper applies to a certain range or distribution, but the experiment only evaluatesa random sample of values. A ﬁxed factor may also have many possible values, but the experiment only evaluatesspeciﬁc values chosen by the experimenter and the claim in the paper is only supported for those speciﬁc values. Atypical random factor in EC is the random seed of a stochastic algorithm, even though most computer experimentsare not truly random. A typical ﬁxed factor would be an algorithmic parameter. Whether a factor is treated as randomor as ﬁxed is typically decided by the experimenter, depending on the claim that the experiment will aim to support.In some cases, a factor must be ﬁxed because there is no known unbiased way to sample its values. This is oftenthe case for benchmark problem instances—e.g., it is not clear how to sample from the space of all “interesting” real-valued functions—or only few real-world instances are available for a particular application. If they are selected bythe experimenter, then they are treated as a ﬁxed factor and the experiment only directly supports claims regardingthose speciﬁc instances, although the author may hypothesise about a wider applicability. If the problem instances arerandomly generated or selected from a larger class of instances, then they are treated as a random factor, and the papercan make statistical inferences about the larger class.We suggest to consider the following three dimensions of reproducibility:(1)

Artifacts:

Re-use of the original artifacts should allow to repeat the exact same experiments as described in theoriginal publication. However, it bears the risk of also repeating the exact same mistakes in case the originalcode or data contained errors. Having the artifacts re-created by another group reduces the risk of errors beingrepeated, and also conﬁrms that all information required to re-create the artifacts is contained in the manuscript.In some cases, the entire computational environment, and even the hardware, used in the original experiment Manuel López-Ibáñez, Juergen Branke, and Luís Paquetemay be provided as artifacts in the form of virtual machines, “containers” or access to cloud platforms (seeSection 5.1).(2)

Random factors:

In the presence of random factors in an experiment, repeating exactly the same computationwould require using exactly the same values of the random factors, e.g., same random seeds. However, onewould expect that the claims of the paper hold after resampling the values of the random factors. Of course,such claims would need to be expressed in statistical terms to determine whether the results are equivalent.(3)

Fixed factors:

Unless somehow randomised, ﬁxed factors in EC typically include test problems, parameter set-tings, computational budget, etc. Strictly speaking, the hypothesis supported by the experiment will only applyto the speciﬁc values tested. Changing these values (or converting them to a random factor) will test whetherthe claims of the paper generalise also to other values and would go beyond just replication of the experimentsin the paper. In some cases, the experiment speciﬁes a reproducible procedure to randomise or unambiguouslydetermine the values of a factor, for example, for deciding parameter values. In those cases, the procedure itselfbecomes the experimental factor, either random or ﬁxed.The typical combinations of these dimensions in a reproducibility study, together with a suggested label, are sum-marised in Table 1. In a repeatability study, every dimension is exactly as in the original experiment. This could beuseful to assess that the original results are indeed obtainable, but may be only feasible for the original authors orrequire access to the original computational environment. A reproducibility study (in the narrow sense) would varythe stochastic aspects of the experiment, i.e., the random factors, but re-use as much as possible the original artifacts,possibly including the computational environment if provided as an artifact, and values of ﬁxed factors. At this level,we cannot expect to obtain exactly the same results as the original experiment. What is being evaluated is the statisticalrobustness of the conclusions reached. At a third level we ﬁnd replicability studies, where the goal is to reach the sameconclusion as the original experiment but with independently developed artifacts. Such a study would evaluate howmuch the conclusions depend on the particular artifacts and/or computational environment. As in the previous level,random factors must be varied to properly evaluate the statistical robustness of the claims, thus there is no point inre-using the original random seeds at this level. A further level concerns the generalisability of claims of the paper toother values of the ﬁxed factors. Generalisability goes beyond the claims supported by the experiment. For example,although the conclusions of the paper may be true for the problem instances (or instance generator) evaluated, theywould be more interesting if they extend to other problem instances. The sensitivity of the algorithm’s performanceto particular parameter settings would also be an example of generalisability. In generalisability studies that use inde-pendently developed artifacts, it is a good idea to conduct ﬁrst a replicability study so that, if conclusions are diﬀerentfrom the ones in the original experiment, this discrepancy can be properly attributed to the changes in ﬁxed factors orin artifacts.Studies that fall between levels are possible. For example, a study that re-uses some of the original artifacts, suchas the implementation of the algorithms, while evaluating the results in a new computational environment, would fallcloser to reproducibility than replicability. Similarly, if claims of the paper rely on speciﬁc aspects (implementationlanguage, hardware and third-party software capabilities, etc), these become ﬁxed factors rather than artifacts and,thus, varying them would be closer to a generalisability study than a replicability one.Ideally, all published experiments should be replicable, and some argue that a pure repeatability study does notgenerate additional evidence for a paper’s claims and therefore may not be worthwhile [Drummond 2009]. However,evaluating the repeatability and reproducibility of an experimental study is typically less demanding and may be takeneproducibility in Evolutionary Computation 7

Table 1. Proposed classification of reproducibility studies.

Label Artifacts Randomfactors Fixedfactors Comment

Repeatability Original Original Original Exactly repeat the original experiment, generating preciselythe same results.Reproducibility Original New Original Test whether the original results were dependent on speciﬁcvalues of random factors and, hence, only a statisticalanomaly.Replicability New New Original Test whether it is possible to independently reach the sameconclusion without relying on original artifacts.Generalisability Originalor New New New Test whether the conclusion extends beyond theexperimental setup of the original paper. When new artifactsare used, generalisability should come after a replicabilitystudy.as a precondition before attempting replication. In any case, we consider it important that reproducibility studies arespeciﬁc about what level of reproducibility is attempted. We also suggest that if an attempt to reproduce results failsto generalise to other values of ﬁxed factors (e.g., test problems or parameter settings), it should be attempted with thesame test problems and parameter settings, and if this fails too, it should be attempted with the original artifacts and,if possible, same random seeds. This would allow tracing back the cause of a reproducibility problem. Hence, whilerepeatability and reproducibility are not the end goal, they are still important to learn from and facilitate replicabilityand generalisability studies.For completeness, we would like to mention two more terminologies that classify levels of reproducibility. First, theTuring Way project [The Turing Way Community et al. 2019] funded by The Alan Turing Institute in the UK distin-guishes reproducibility (same analysis performed on the same dataset consistently produces the same answer), rep-licability (same analysis performed on diﬀerent datasets produces qualitatively similar answers), robustness (diﬀerentanalysis applied to the same dataset produces a qualitatively similar answer to the same question) and generalisability(the combination of replicability and robustness). Second, Stodden [2014] makes a distinction between empirical repro-ducibility , which is the concept of reproducibility that arises in natural sciences, i.e., being able to repeat an experimentfollowing the details published and obtain a similar conclusion by using diﬀerent artifacts, since artifacts cannot becopied in the natural sciences; computational reproducibility , which relates to the availability of the code, data andall details of the implementation and experimental setup that allow obtaining the published results; and, lastly, stat-istical reproducibility , which is concerned with validating the results of repeated experiments by means of statisticalassessment.Nevertheless, the above deﬁnitions do not fully specify what details deﬁne the experimental setup (or operatingconditions). Completely replicating the exact conditions of the original experiment may be impossible even by theoriginal authors, e.g., the original hardware may not be available anymore, the load of the computational system mayhave inﬂuenced the measurements, the precise version of some software libraries may be unknown, some sources ofrandomness may not be repeatable, etc. In that case, the experimental setup may refer only to the details that the Manuel López-Ibáñez, Juergen Branke, and Luís Paqueteoriginal authors consider relevant for their experiment. Alternatively, one may give up on repeatability and reprodu-cibility as long as replicability is achievable, which does not mean that the latter is easier to achieve than the former.Indeed, replicability requires high-level descriptions of artifacts with enough detail to enable their independent devel-opment and a careful choice of measurements, their stated precision and conﬁdence levels that allow other researchersto unequivocally conclude whether a replication attempt falsiﬁes the original experiment.

Despite the obvious beneﬁts or reproducibility studies, very few such studies are published in EC. In this section, wetry to explain the low number of reproducibility studies by discussing cultural and technical obstacles.

A key reason for the low number of reproducibility studies is simply that the current “publish or perish” culture doesnot encourage it. Neither has the author of a scientiﬁc paper enough incentives to facilitate reproducibility studies, norhave other scientists enough incentives to conduct reproducibility studies.

Disincentives to provide artifacts.

Reproducibility studies would be greatly facilitated if authors would make theirartifacts available and accessible for others to be easily re-used, which means that authors would have to learn aboutand apply standard principles of software engineering such as proper documentation, modularity, version control,testing and maintenance. With reputation and career prospects closely linked to the number of publications, thisadditional eﬀort is not obviously beneﬁcial to the individual, and thus researchers rather invest their time in publishingmore papers than in making artifacts available and accessible. Besides the additional eﬀort required, the publicationof artifacts increases the chances of error detection, and thus may increase chances of the paper being rejected (ifthe artifacts are checked before publication), having to publish errata, or even to retract a paper. In principle, thelack of intrinsic incentives could be counter-balanced by top journals requiring the publication of artifacts prior topublication of the paper. This would not only lead to a larger number of artifacts being available, but also raise thequality of the artifacts provided, as reviewers would be less inclined to review and accept poor quality artifacts. There issome evidence that journal policies are frequently ignored and marginally eﬀective if they merely “encourage” authorsto make their artifacts available or only “require” them post-publication [Stodden et al. 2018]. Nevertheless, we are notaware of any journal or conference in the EC ﬁeld that requires artifact publication. In any case, code review places asigniﬁcant additional workload onto reviewers and a peer review system that is already stretched to the limit, and thetime constraints for conference publications would not allow for additional code review. It can therefore be expectedthat only few top publications would be able to provide such a service. As a result, very few published papers, even inmajor journals, provide a complete set of artifacts. Diﬃculty to publish a reproducibility study.

Conducting a reproducibility study is also not incentivised, as publishingthe results may be challenging [Sörensen et al. 2017]. If the experiments conﬁrm the results from the original paper, theknowledge gained may be considered marginal. On the other hand, if the experiments fail to validate previous work,the results of the original publication stand against the results of the reproducibility study, and the question ariseswhether there is a problem with the original paper, a problem with the reproducibility study, or whether the diﬀerenceis simply due to statistical uncertainty. It is diﬃcult to convince reviewers that the new study is more reliable than the Although a recent study found a small positive correlation between linking to artifacts in a paper and its scientiﬁc impact in terms of citations[Heumüller et al. 2020]. eproducibility in Evolutionary Computation 9old one. An independent third party, or a collaboration between the authors of the original paper and the team thattried to reproduce the results, may be required to explain the observed diﬀerence. Hence, rather than spending theeﬀort on reproducibility studies that are diﬃcult to publish, scientists are incentivised to develop new algorithms andpublish new results.

Insuﬃcient description.

The reluctance of authors to publish and properly document their artifacts further com-pounds the disincentive for reproducibility studies. Without artifacts, direct reproducibility is impossible. But even ifthe artifacts are available when the paper is published, they may not match the required standards for reproducibility:the steps to reproduce the results are not fully documented, the artifacts require precise versions of additional soft-ware not provided nor documented, the download link to the artifacts has become unavailable since the paper waspublished, etc. It also happens that the artifacts provided do not match the description in the paper, i.e., either thealgorithm described in the paper is not the one actually used in the experiments or the results shown in the papercorrespond to a diﬀerent version of the artifacts provided.Often, the description given in a paper is unintentionally ambiguous or insuﬃcient to re-implement the precisealgorithm. Indeed, the page limit imposed by some journals often necessitates omitting some details. This stresses theimportance of making the original source code available. Even “obsolete” code, which can no longer be run becausethe compiler or hardware needed are no longer available, can help to resolve ambiguities and ﬁll in details missing inthe paper.

Mistakes perceived negatively.

Even though everyone occasionally makes mistakes and discussing them openlywould be beneﬁcial to the community, errors are culturally disdained. The author of a study may feel uneasy hav-ing to admit a mistake in a published paper, and the scientist who conducted a reproducibility study may feel uneasyabout challenging the authors of the original study. As a result, even if someone has attempted to reproduce a scientiﬁcstudy, the results are rarely published.

Intellectual property.

Concerns about licensing, privacy and commercially sensitive information may be legitimateobstacles for making artifacts (source code or data) publicly available [Fonseca Cacho and Taghva 2020]. Although itmay be tempting to make artifacts available only to reviewers under some type of nondisclosure agreement [Heroux2015; Stodden et al. 2016], such an arrangement does not actually improve reproducibility.Similarly, publishing artifacts in executable form instead of source code does not increase reproducibility. One mightthink that being able to reproduce the results by having the algorithm in executable form is better than not being ableto reproduce the results. However, such argument misunderstands the ultimate purpose of reproducibility, which isto be able to understand in detail how the published results were produced and whether they match the descriptionin the paper. Therefore, even obsolete source code, in the sense discussed above, is better than “working” black-boxobject code.

Computational environment.

Although lab conditions in computer science are very controlled, the experimental ar-tifact may depend on details of the computational environment such as the compiler, the hardware, or speciﬁc libraries.Any diﬀerence in those may signiﬁcantly distort runtime results, random number generation or ﬂoating-point calcu-lations, or even make it impossible to run the code at all. Often such dependencies are not properly described in the0 Manuel López-Ibáñez, Juergen Branke, and Luís Paquetepaper, but even if they are, recreating a speciﬁc computational environment for a reproducibility study may be nosmall feat, and becomes more challenging the older the artifact is [Perkel 2020].

Computational resources.

A more challenging obstacle arises when the time or computational resources requiredto reproduce an experiment are prohibitively large. It is not unusual nowadays that research teams have access tocomputation clusters capable of performing several years of CPU-time in a few weeks. Reviewers may not have accessto such resources, neither the time or the budget required to reproduce all experiments.

Veriﬁcation of artifacts.

Although we believe that a cursory peer-review of artifacts before publication would havea positive impact on reproducibility in EC, in an ideal world one would like to ensure the correctness of the artifacts.However the eﬀort to do so manually is tremendous, and can only be reduced somewhat by implementation-agnosticvalidation testsuites and detailed source documentation. This is also one of the reasons why replicability studies usingindependent implementations solely based on the description of the algorithm in the paper are very valuable, as it isunlikely that diﬀerent teams would make exactly the same implementation errors.

A particular challenge in empirical EC research is that experiments are necessarily limited to speciﬁc problem instancesand parameter settings. Nevertheless, the insights are usually expected to generalise to other settings. For example, if apaper ﬁnds one TSP crossover operator superior over another on some TSP problem instances and for a certain numberof function evaluations, the expectation is (and the claim in the paper usually implies) that similar results also holdfor other TSP problem instances and a larger or smaller computational budget. Studies on generalisability as deﬁnedin Table 1 are thus very important to understand how robust and generalisable the paper’s conclusions are. Further-more, in EC, parameter settings can have a huge impact on performance. Smit and Eiben [2010] have demonstratedthat automatic optimisation of the parameters can substantially improve even the performance of the algorithm win-ning the CEC 2005 competition. They speculate in their conclusion that diﬀerent algorithms might beneﬁt diﬀerentlyfrom tuning, and that tuning all algorithms may change the ranking observed in the competition. This is exactly whatMelis et al. [2017] have investigated for natural language processing. They re-evaluated several popular architecturesand regularisation methods by automatically tuning their parameters and arrive at the conclusion that standard archi-tectures, when properly tuned, outperform more recent models. To make things worse, one can argue that changing theproblem instance class or the runtime available necessitates a change in the algorithm’s parameter settings. Someonetesting the generalisability of a paper’s conclusion to a diﬀerent class of problem instances thus faces the additionalchallenge of choosing appropriate parameter settings. So unless parameters have been set in a systematic way, thisraises the question whether the observation that one algorithm is better than another is really due to the algorithmicdiﬀerences, or just a consequence of insuﬃcient or inappropriate tuning.

In this section, we discuss a few general guidelines and present pointers to the literature that aim at improving,assessing and encouraging reproducibility of research published in the EC ﬁeld. Some of these guidelines are in-spired by the ACM guidelines for artifact review badging [ACM 2020], the guidelines for AI research endorsed byeproducibility in Evolutionary Computation 11AAAI [Gundersen et al. 2018], the Replicated Computational Results Initiative of ACM Transactions on MathematicalSoftware [Heroux 2015] and other sources [Stodden et al. 2016].

Publish permanently accessible, complete and useful artifacts.

When sharing artifacts, the rule-of-thumb heuristicshould be that a person, who only has access to the published paper and the artifacts provided, should be able toreproduce the results shown in the paper without having to contact the original authors. This implies that the sharedartifacts should not change after publication, because the changes may prevent reproducing the paper as published.Hence, a development repository, e.g. in GitHub, is not a valid repository for artifacts unless the precise versions usedin a paper are clearly tagged. Preferably, artifact repositories will have a digital object identiﬁer (DOI), such as thosegenerated by Zenodo (e.g., doi: 10.5281/zenodo.3749288). If revisions to the published artifacts are necessary, it shouldbe easy to identify each previous version. The repository should provide a plan for long-term, ideally permanent,accessibility. Authors’ personal webpages or development repositories do not typically satisfy this requirement. ACMuses the badge artifact available for papers that match the requirements above [ACM 2020].Artifacts should contain, at a minimum, all the source code and the input data required to reproduce the resultsreported in the paper, together with clear metadata and suﬃcient documentation on how to reproduce the results.We suggest, however, to provide a detailed step-by-step documentation, ﬂexible reproduction scripts and, as much aspossible, raw intermediate (generated) data that allow reviewers and other researchers to selectively repeat parts of theexperiment. Such extensive artifacts enable a better evaluation and comprehension, hopefully simplifying reproductioneﬀorts and avoiding mistakes. In summary, with regards to source code, we suggest splitting the code into: • Pre-processing code , e.g. code that generates instance data and scripts that set up the experimental conditions. • Algorithm code , the implementation of the algorithm(s) to be tested. • Analysis code , scripts that post-process the data produced by the algorithm and perform statistical analysis. • Presentation code , e.g. scripts that generate tables and ﬁgures reported in the article.As for the generated data provided, although a paper may report only summary statistics, the artifacts should ideallycontain the raw data generated, thus not only enabling the reproduction of the analysis, but also further analysis byothers. Furthermore, in an optimisation context, we argue that the raw data should contain not only objective functionvalues but also the actual solutions, thus making it possible to verify and compare results. Even for simple problemssuch as the TSP, the correct computation of the objective function may depend on technical details, e.g., preprocessingof distance data. Subtle implementation errors cannot be detected unless the actual solutions are available. Veriﬁcationmay be facilitated by publicly available solution checkers or the authors themselves may provide such a checker asan additional artifact that other researchers may use to verify their obtained solutions. Solution checkers should beas simple as possible so that the implementation can be trusted. One further step would be to automatically run thesolution checker during post-processing. This proposal will bring us closer to “certifying artifacts” [McConnell et al.2011].Finally, we argue that artifacts should be made available as source code and open-data formats under conditionsno more restrictive than those required to read the paper itself. License information should be included with the Latest version can be found at https://folk.idi.ntnu.no/odderik/reproducibility_guidelines.pdf (Last accessed, version 1.3, June 25, 2020) E.g., http://comopt.iﬁ.uni-heidelberg.de/software/TSPLIB95/tsp95.pdf

Facilitate access to computational resources.

Current practice for journals checking artifacts is that it is up to re-viewers to get access to the required resources and bear the cost for reproducing experiments. If special hardwareis required, e.g, graphical processing units (GPUs), authors could consider providing to reviewers access to the re-quired resources for the purpose of reproducibility checks. Although this case may seem similar to the availabilityof sensitive artifacts discussed above, where we argued against making sensitive artifacts only available to reviewers,there is a fundamental diﬀerence: Artifacts that are only disclosed to reviewers will never become available to otherresearchers, which hinders reproducibility, whereas specialist hardware such as GPUs is publicly available for pur-chase by interested researchers but reviewers should not bear the cost. A similar distinction may be made betweenundisclosed data, which is not suitable for reproducibility, and data that is simply too large to host or copy for reviewpurposes [Fonseca Cacho and Taghva 2020]. Journals might consider making resources available to their reviewersand bear some of the cost. For really expensive resources, however, the only realistic solution might be that researchcouncils speciﬁcally fund reproducibility studies (see also Section 5.3).

Report detailed experimental conditions.

Any details required to reproduce the experiment but not included as partof the artifacts should be thoroughly reported in the documentation included with the artifacts. These details includethe precise versions of any additional software, packages, libraries, simulators, compilers, interpreters, and operatingsystems (possibly including installation steps unless trivial); as well as the relevant details of the hardware platform. Forexample, experiments requiring signiﬁcant amounts of memory should report the memory available, whereas resultssensitive to small changes in computation time should report full CPU details including cache sizes.Literate programming, dynamic documentation and reproducible notebooks integrate code, documentation andanalysis, which makes it much easier to understand and interact with code, and reproduce or observe results by auto-matically re-creating analysis, tables and ﬁgures.Nowadays, several technical solutions exist that can ensure the portability of programs to diﬀerent software environ-ments such as virtual machines, containers, and platforms, e.g., Open Science Foundation , Code Ocean and Docker .A container includes everything that is needed to run an application, such as code, system tools, system libraries andsettings, independent of the underlying operating system. ACM TELO explicitly supports the use of Code Ocean andcan integrate it directly into its Digital Library platform.In the case of algorithms, all (hyper-)parameters should be clearly described in the documentation, including theirdomain. For the purposes of generalisability, the process used for setting parameter values should be reproducibleas well, ideally by means of design of experiments [Montgomery 2012; Paquete et al. 2007] or automatic algorithm E.g., Rmarkdown, Jupyter notebooks, Knitr https://osf.io/ https://codeocean.com/ eproducibility in Evolutionary Computation 13conﬁguration tools [Birattari 2009], such as SMAC [Hutter et al. 2011] or irace [López-Ibáñez et al. 2016], with a clearexplanation of the values explored.If results depend on random number generators, one should document or provide as artifacts the precise randomseeds that produce the results reported, for the purpose of allowing the exact repetition of an experiment. This recom-mendation also applies to randomly generated data and problem instances, although in this case, one may also providethe data generated for completeness. Measure and report with reproducibility in mind.

There are two main concerns that should guide which measure-ments are performed and how they are reported: (1) the level of algorithmic abstraction being considered in the exper-iment and (2) what measurements can actually be reproduced given the artifacts provided.McGeoch [2012] provides detailed guidelines for measuring and reporting solution quality and computational eﬀortat various abstraction levels that are directly applicable to EC. At the highest level, we ﬁnd algorithm paradigms suchas metaheuristics, which are generic algorithmic templates that can be applied to diﬀerent problem domains; whereasat the lowest level we ﬁnd executable implementations of speciﬁc algorithms running on a particular machine. It makessense to consider machine-independent measures to compare algorithm paradigms as well as very precise machinecounts to compare diﬀerent implementations, but not the other way around.When measuring and reporting computation time, authors should report not only hardware conﬁguration, but alsoinclude benchmarking codes and their running times among the artifacts provided, e.g., a publicly available determ-inistic algorithm for the particular problem domain, run on a few small standard benchmark problem instances. Suchinformation may be used to normalise machine speeds [Johnson 2002] by comparing the speed of these standardbenchmarks on diﬀerent computers, and scaling speeds accordingly.With respect to ensuring reproducibility in the narrow sense, random experimental factors may lead to diﬀerencesin results reported no matter how detailed the artifacts provided are. Therefore, results should not be reported witha conﬁdence or precision larger than what can actually be reproduced, since it provides a false certainty about thevalues reported. Nevertheless, the more details are included in the artifacts (e.g., random seeds, precise versions ofrequired software or even fully-ﬂedged software containers and virtual machines), the less random variation we needto account for in a reproducibility study.

Report statistical inference to make your claims more robust.

Due to the stochastic nature of EC algorithms, it is expec-ted that authors report not only means and variances, but also conﬁdence intervals, 𝑝 -values and/or size eﬀects estim-ates. The usage of conﬁdence intervals, rather than 𝑝 -values, is usually recommended in recent literature [Cumming2012]. The former gives useful information about uncertainty and it is easier to interpret. Moreover, a 𝑝 -value can al-ways be derived from a conﬁdence interval, but not the other way around. Eﬀect size, such as Cohen’s 𝑑 and Person’s 𝑟 estimate the eﬀect of a treatment, such as the eﬀect of a new operator on the overall performance of an algorithm.Conﬁdence intervals for the eﬀect size estimates are also available. For an appropriate treatment of inferential pro-cedures in the context of computer science experiments, we refer to Lilja [2000], Cohen [1995], Bartz-Beielstein et al.[2010] and McGeoch [2012]. Be precise about the claims made.

Most empirical results in EC are obtained with speciﬁc algorithmic parametersettings, on a small set of problem instances and under speciﬁc experimental conditions (e.g., number of function eval-uations allowed) and random seeds. However, it is usually expected that the conclusions generalise beyond the preciseexperiment reported by the paper. Certainly, in most cases, a conclusion that depends on the speciﬁc random seeds4 Manuel López-Ibáñez, Juergen Branke, and Luís Paquetewould not have much value, even if it is fully repeatable. On the other hand, the conclusions in many papers are muchbroader, e.g., crossover operator A is better than crossover operator B for continuous optimisation. If a subsequentstudy ﬁnds that crossover operator B is better than crossover operator A on continuous problems diﬀerent than theones used in the original paper then, strictly speaking, the original claim is falsiﬁed, even though the experiment maystill be replicable and we can only say that the results do not generalise (Table 1). Authors should therefore be as pre-cise as possible about the claims they make, such as the experimental conditions and problem classes for which theybelieve their conclusions to hold. The experimental design should reﬂect as well the scope of the claims, e.g., by usinga problem instance generator whenever possible to clearly deﬁne the relevant class of problems rather than testing ona few arbitrarily selected problem instances. In absence of (or in addition to) such generator, deﬁning and measuringproblem features would characterise the scope of the claims [Muñoz and Smith-Miles 2020] and provide evidence thatthe conclusions hold within this scope. Another well-known issue is that specialising algorithm designs and parametersettings to particular problem instances, i.e., overﬁtting , typically comes at the cost of worsening performance in un-seen instances, even of the same problem [Birattari 2009]. Thus, several journals [Dorigo 2016; Journal of Heuristics2015] have adopted policies that require a clear separation between the problem instances used for algorithmic devel-opment and parameter tuning, and problem instances used for hypothesis testing and benchmarking. Such separationprovides evidence that the claims of the paper apply to a broader scope than the particular instances evaluated. Finally,a sensitivity analysis of parameter settings and experimental conditions would also provide evidence that the mainconclusions hold when those conditions vary.

Procedures to assess reproducibility should be tied to author’s claims. Unfortunately, there is no standard to evaluatereproducibility of computational experiments. This is particularly diﬃcult in EC, since one has to deal not only withdiﬀerences on hardware and/or software for reproducing experiments, but also with algorithm stochasticity. Therefore,we advocate some caution before concluding, in a clear-cut manner, that a work is not reproducible if results do notmatch exactly. Instead, we suggest to investigate further reasons for the work not being reproducible, for instance,identify possible hardware or software diﬀerences, such as compiler ﬂags, cache level sizes, software libraries, or evensample size.Even in repeatability studies (see Table 1), we may not always expect an implementation to obtain exactly thesame result under the same random seed if run twice, for instance, due to small ﬂuctuations on the running time thatdeﬁnes the termination condition. Inferential procedures could be used to assert whether the diﬀerences between theoriginal runs and the replicated runs are due to a random or a systematic eﬀect. In this case, a matched-pair inferentialprocedure would be appropriate in order to take into account the natural pairing between the original and the replicaterun using the same random seed.A typical scenario in EC is to compare the performance of several algorithms on a set of benchmark instances.Asserting an author’s claim that Algorithm A is signiﬁcantly better than Algorithm B with respect to solution qualitycan be performed by testing whether the reproduced results show a signiﬁcant eﬀect in the same direction, given thesame signiﬁcance level as speciﬁed in the original publication. Alternatively, the opposite direction of the claim couldbe tested, which, if signiﬁcant, would allow to infer that the work is not reproducible.In the above scenario, it is also possible to test if the eﬀect size is signiﬁcantly diﬀerent, even if the direction isthe same. The authors in Open Science Collaboration [2015] suggest to test whether the original eﬀect size is withinthe 95% conﬁdence interval of the reproduced eﬀect size estimate. However, some concerns have been raised abouteproducibility in Evolutionary Computation 15this procedure, as the average probability of the ﬁrst 95% conﬁdence interval including the next reproduced meanis only approximately 83% [Cumming 2012]. Reporting conﬁdence intervals of the diﬀerence between original andreproduced eﬀect sizes is usually recommended. If 0 is included in this conﬁdence interval, it suggests that the workis reproducible.We note that the usual assumptions of parametric inferential procedures may be hard to be met for assessing ECalgorithms and non-parametric alternatives may be better suited. However, conducting non-parametric inference pro-cedures based on computationally intensive methods, such as bootstrapping and randomisation tests, for assessingreproducibility may require access to all data collected by the original study in addition to the aggregated statisticalmeasures usually reported.

Meta-analysis is an interesting complementary analysis extensively used in other ﬁelds to aggregate results fromdiﬀerent studies to derive general conclusions [Borenstein et al. 2009]. In the context of reproducibility, meta-analysiswould allow to understand how much the eﬀect size varies in the original and the reproduced studies by combiningthe results from both. This new estimate takes usually the form of a weighted average of individual estimates, wherethe weights are inversely proportional to the sampled variance, and from which inferential procedures for testingheterogeneity are constructed. We refer to Ehm [2016] for the application of inferential methods in meta-analysis forreproducibility.Most research in EC is trying to derive general insights such as “algorithm A is better than algorithm B for thisclass of problems” from limited experiments on a set of problem instances. Reproducibility studies should thus notonly focus on exactly reproducing results, but also expand experimentation to assess generalisability, by changing thevalue of ﬁxed factors such as parameter settings and problem instances. Only such experiments will conﬁrm, over time,that the conclusions are not only valid for the ﬁxed values examined in the original paper, but have a broader validity.

Ideally, rigorous journals should adopt the Transparency and Openness Promotion (TOP) guidelines, which requirereproducibility checks and even independent replication before publication [Nosek et al. 2015; Stodden et al. 2016].Some journals, such as

Mathematical Programming Computation already require that source code is provided toreviewers and some authors (and we) believe this requirement should become the norm [Sörensen et al. 2017]. Anintermediate, less onerous step is to award recognition to papers that achieve certain levels of reproducibility. ACMbadges [ACM 2020] already provide a way to recognise diﬀerent degrees of reproducibility that journals could adopt.ACM TELO follows the ACM guidelines for reproducibility [ACM 2020]. When submitting the manuscript, the au-thor can apply for an ACM reproducibility badge. Once the paper passes the ﬁrst stage of review and is accepted orreturned to the author for revision, the artifacts are reviewed by a member of the journal’s reproducibility board, whocan recommend that a badge be awarded, or request further revisions of the artifact before a badge can be awarded.Three badges can be requested: Artifacts Available , Artifacts Evaluated and

Results Validated . The badges are independ-ent, that is, any combination of badges can be requested. The badge

Artifacts Available is provided if the artifact ispublicly available in a permanent repository. The badge

Artifacts Evaluated is concerned with ensuring that the ar-tifact fulﬁlls the requirements to be reproduced by others and it has two levels, functional and reusable , the latterrequiring that the artifact can be re-used and re-purposed. The badge

Results Reproduced corresponds to the notionof reproducibility presented in Section 3, that is, the experimental results are validated using the artifact provided by Results Replicated , which requiresthat the results are replicated without using the author’s artifacts. However, this is not oﬀered by ACM TELO at themoment, as the eﬀort to replicate the code has been deemed excessive.Another aspect that may be of interest to the EC community and that may help to prevent publication bias, isto allow authors to pre-register [Nosek et al. 2018] their scientiﬁc studies and hypotheses with a journal, or a publiclyavailable website, before conducting the experiments. Pre-registration would allow reviewers to verify whether the ini-tial authors’ plan and the published results match or not. A more ambitious goal would be to allow the pre-registrationdocument to be peer reviewed in order to identify issues with the experimental setup and its suitability for validatingthe authors’ hypotheses before the experiments are conducted. A certain publication guarantee could be provided de-pending of the reviewers’ conﬁdence. This is particularly relevant for experimental studies that take very long time orrequire huge amounts of computational resources.Funding agencies may also encourage reproducibility in various ways, as suggested by Stodden et al. [2016]. Inparticular, funding agencies may require that the resulting research is reproducible according to speciﬁc and veriﬁ-able criteria, in a similar manner that some funding agencies already require and provide funding for open-accesspublications. Also, funding agencies could encourage reproducibility studies by funding such studies and support re-producibility eﬀorts by funding research that analyses or alleviates reproducibility obstacles. We want to highlight theincongruity of funding non-reproducible research with public money.

Reproducibility is a cornerstone of science. Without reproducibility, scientiﬁc progress is impossible. Yet, many sci-entiﬁc works are not reproducible, partly due to wrong incentives for academics, partly due to insuﬃcient awareness.EC is particularly vulnerable because of its mostly experimental approach and the stochastic nature of its algorithms.In this paper, we have discussed reproducibility in the context of EC, and proposed a new classiﬁcation of reproducibil-ity studies, distinguishing four diﬀerent types, namely,

Repeatability , Reproducibility , Replicability and

Generalisability ,with diﬀerent purposes and study designs. We have then analysed the reasons for the reproducibility crisis and iden-tiﬁed various cultural and technical obstacles.Despite these obstacles, there are positive developments that point to a shift of culture. First, concern is growingin the EC community about questionable benchmarking [Bartz-Beielstein et al. 2020], insuﬃcient statistical assess-ment [Buzdalov 2019; García et al. 2009; Shilane et al. 2008], unfair parameter tuning [Birattari 2009], and, more re-cently, reproducibility and replicability issues. Several journals have adopted explicit policies that encourage reproducibility—albeit do not require it—and improve replicability. Some ACM journals, with TELO being a prominent example, haveestablished reproducibility boards that award badges recognising the eﬀort in making research reproducible. Finally,due to this shift in culture, solutions to technical obstacles are becoming more widely available and adopted, thuslowering the eﬀort to improving the reproducibility of EC research.We suggest that reproducibility (in the narrow sense) is a short-term goal that ideally should be checked during thereview process. In EC, in particular, there are no actual technical obstacles to make code and data available, thus mak-ing result reproducible should be the norm. Platforms such as CodeOcean and OSF exist that provide nearly identicalexperimental setup to ensure that published results may be reproduced by reviewers and other researchers. Neverthe-less, once this validation step is done, we believe that the preservation of code and data is more useful in the long-termeproducibility in Evolutionary Computation 17than the long-term availability of a reproducible experimental environment, given the rapid obsolesce of softwareand hardware. Even if the original study becomes non-reproducible due to the obsolescence of its original artifacts,studying their code and data could help future replication and generalisation eﬀorts.The next step should be empirical and statistical replicability, and published research should enable it. In otherwords, published research should contain the information required to independently replicate the experiment withoutusing the original artifacts, and reach the same conclusion given the statistical conﬁdence provided by the originalexperiment. This information would include all relevant details about the algorithm, problem, measurements and ex-perimental environment at the right level of abstraction. It would also include all statistical details that would allowthe authors of a replication study to assess whether their new results, which are expected to be numerically diﬀerentfrom the original ones due to varying the random factors of the experiments, reject or not the original hypothesis.The ﬁnal step that will actually push the boundary is to examine the generalisability of the claims made in scientiﬁcpapers by testing whether the main conclusions still hold in somewhat diﬀerent experimental setups and for diﬀerentproblem classes.To overcome the reproducibility crisis we need a culture shift towards reproducibility in EC, with reproducibilityplaying a bigger role in education, funding decisions, recruitment and reputation. While this requires some extra eﬀortespecially early on, the reward will be faster scientiﬁc progress, less frustration trying to build on other’s work, and ahigher reputation for the ﬁeld as a whole. The journey has already begun.

ACKNOWLEDGMENTS

We would like to thank Carola Doerr (Sorbonne University, France) and Mike Preuss (Leiden University) for pointingout guidelines for reproducibility in other ﬁelds. M. López-Ibáñez is a “

Beatriz Galindo ” Senior Distinguished Researcher(BEAGAL 18/00053) funded by the Spanish Ministry of Science and Innovation (MICINN). This work was partiallyfunded by national funds through the FCT - Foundation for Science and Technology, I.P. within the scope of theproject CISUC – UID/CEC/00326/2020.

REFERENCES

Nature

533 (2016), 452–454.T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss (Eds.). 2010.

Experimental Methods for the Analysis of Optimization Algorithms . Springer,Berlin, Germany.T. Bartz-Beielstein, C. Doerr, D. van den Berg, J. Bossek, S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Kerschke, W. La Cava, M. López-Ibáñez, K. M.Malan, J. H. Moore, B. Naujoks, P. Orzechowski, V. Volz, M. Wagner, and T. Weise. 2020. Benchmarking in Optimization: Best Practice and OpenIssues.

Arxiv preprint arXiv:2007.03488 [cs.NE] (2020). https://arxiv.org/abs/2007.03488M. Birattari. 2009.

Tuning Metaheuristics: A Machine Learning Perspective . Studies in Computational Intelligence, Vol. 197. Springer, Berlin, Heidelberg.https://doi.org/10.1007/978-3-642-00483-4K. D. Boese, A. B. Kahng, and S. Muddu. 1994. A New Adaptive Multi-Start Technique for Combinatorial Global Optimization.

Operations ResearchLetters

16, 2 (1994), 101–113.M. Borenstein, L. V. Hedges, J. P. T. Higgins, and H. R. Rothstein. 2009.

Introduction to Meta-Analysis . Wiley.D. Brockhoﬀ. 2015. A Bug in the Multiobjective Optimizer IBEA: Salutary Lessons for Code Release and a Performance Re-Assessment. In

EvolutionaryMulti-criterion Optimization, EMO 2015 Part I , A. Gaspar-Cunha, C. H. Antunes, and C. A. Coello Coello (Eds.). Lecture Notes in Computer Science,Vol. 9018. Springer, Heidelberg, Germany, 187–201. https://doi.org/10.1007/978-3-319-15934-8_13M. Buzdalov. 2019. Towards better estimation of statistical signiﬁcance when comparing evolutionary algorithms. In

GECCO’19 Companion , M. López-Ibáñez, A. Auger, and T. Stützle (Eds.). ACM Press, New York, NY, 1782–1788. https://doi.org/10.1145/3319619.3326899F. Campelo and E. F. Wanner. 2020. Sample size calculations for the experimental comparison of multiple algorithms on multiple problem instances.

Journal of Heuristics (2020). https://doi.org/10.1007/s10732-020-09454-w

M. Chiarandini and Y. Goegebeur. 2010. Mixed Models for the Analysis of Optimization Algorithms. See [Bartz-Beielstein et al. 2010], 225–264.https://doi.org/10.1007/978-3-642-02538-9J. Claerbout and M. Karrenbach. 1992. Electronic documents give reproducible research a new meaning. In

SEG Technical Program Expanded Abstracts1992 . Society of Exploration Geophysicists, 601–604. https://doi.org/10.1190/1.1822162A. Cockburn, P. Dragicevic, L. Besançon, and C. Gutwin. 2020. Threats of a Replication Crisis in Empirical Computer Science.

Commun. ACM

63, 8 (July2020), 70–79. https://doi.org/10.1145/3360311P. R. Cohen. 1995.

Empirical Methods for Artiﬁcial Intelligence . MIT Press, Cambridge, MA.J. Cumming. 2012.

Understanding the New Statistics – Eﬀect Sizes, Conﬁdence Intervals, and Meta-analysis . Taylor & Francis.M. Dorigo. 2016. Swarm intelligence: A few things you need to know if you want to publish in this journal.

Swarm Intelligence (Nov. 2016).https://static.springer.com/sgw/documents/1593723/application/pdf/Additional_submission_instructions.pdfC. Drummond. 2009. Replicability is not Reproducibility: Nor is it Good Science. In

Proceedings of the Evaluation Methods for Machine Learning Workshopat the 26th ICML

Reproducibility – Principles, problems, practices and prospects , H. Atmanspacherand S. Maasen (Eds.). Wiley, 141–168.D. Fanelli. 2012. Negative results are disappearing from most disciplines and countries.

Scientometrics

90, 3 (2012), 891–904.https://doi.org/10.1007/s11192-011-0494-7J. R. Fonseca Cacho and K. Taghva. 2020. The State of Reproducible Research in Computer Science. In , S. Latiﬁ (Ed.). Springer International Publishing, 519–524. https://doi.org/10.1007/978-3-030-43020-7_68S. García, D. Molina, M. Lozano, and F. Herrera. 2009. A study on the use of non-parametric tests for analyzing the evolutionary algorithms’behaviour: a case study on the CEC’2005 Special Session on Real Parameter Optimization.

Journal of Heuristics

15, 617 (2009), 617–644.https://doi.org/10.1007/s10732-008-9080-4D. R. Grimes, C. T. Bauch, and J. P. A. Ioannidis. 2018. Modelling science trustworthiness under publish or perish pressure.

Royal Society Open Science

AI Magazine

39, 3 (Sept. 2018), 56–68. https://doi.org/10.1609/aimag.v39i3.2816M. A. Heroux. 2015. Editorial: ACM TOMS Replicated Computational Results Initiative.

ACM Trans. Math. Software

41, 3 (June 2015), 1–5.https://doi.org/10.1145/2743015R. Heumüller, S. Nielebock, J. Krüger, and F. Ortmeier. 2020. Publish or perish, but do not forget your software artifacts.

Empirical Software Engineering

25, 6 (2020), 4585–4616. https://doi.org/10.1007/s10664-020-09851-6J. N. Hooker. 1996. Testing Heuristics: We Have It All Wrong.

Journal of Heuristics

1, 1 (1996), 33–42. https://doi.org/10.1007/BF02430364F. Hutter, H. H. Hoos, and K. Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Conﬁguration. In

Learning and IntelligentOptimization, 5th International Conference, LION 5 , C. A. Coello Coello (Ed.). Lecture Notes in Computer Science, Vol. 6683. Springer, Heidelberg,Germany, 507–523.J. P. A. Ioannidis. 2005. Why Most Published Research Findings Are False.

PLoS Medicine

2, 8 (2005), e124. https://doi.org/10.1371/journal.pmed.0020124D. S. Johnson. 2002. A Theoretician’s Guide to the Experimental Analysis of Algorithms. In

Data Structures, Near Neighbor Searches, and Methodology:Fifth and Sixth DIMACS Implementation Challenges

Personality and Social Psychology Review

2, 3 (Aug. 1998), 196–217.https://doi.org/10.1207/s15327957pspr0203_4D. J. Lilja. 2000.

Measuring Computer Performance: A Practitioner’s Guide . Cambridge University Press. https://doi.org/10.1017/CBO9780511612398M. López-Ibáñez, J. Dubois-Lacoste, L. Pérez Cáceres, T. Stützle, and M. Birattari. 2016. The irace package: Iterated Racing for Automatic AlgorithmConﬁguration.

Operations Research Perspectives

Computer Science Review

5, 2 (2011), 119–161.https://doi.org/10.1016/j.cosrev.2010.09.009C. C. McGeoch. 2012.

A Guide to Experimental Algorithmics . Cambridge University Press.G. Melis, C. Dyer, and P. Blunsom. 2017. On the State of the Art of Evaluation in Neural Language Models.

Arxiv preprint arXiv:1807.02811 (2017).http://arxiv.org/abs/1707.05589D. C. Montgomery. 2012.

Design and Analysis of Experiments (8th ed.). John Wiley & Sons, New York, NY.M. A. Muñoz and K. Smith-Miles. 2020. Generating New Space-Filling Test Instances for Continuous Black-Box Optimization.

Evolutionary Computation

28, 3 (Sept. 2020), 379–404. https://doi.org/10.1162/evco_a_00262Y. Nagata and S. Kobayashi. 1997. Edge Assembly Crossover: A High-power Genetic Algorithm for the Traveling Salesman Problem. In

ICGA , T. Bäck(Ed.). Morgan Kaufmann Publishers, San Francisco, CA, 450–457.Y. Nagata and S. Kobayashi. 1999. An analysis of edge assembly crossover for the traveling salesman problem. In

IEEE SMC’99 Conference Pro-ceedings, 1999 IEEE International Conference on Systems, Man, and Cybernetics , K. Ito, F. Harashima, and K. Tanie (Eds.). IEEE Press, 628–633. eproducibility in Evolutionary Computation 19 https://doi.org/10.1109/icsmc.1999.823285Y. Nagata and S. Kobayashi. 2013. A Powerful Genetic Algorithm Using Edge Assembly Crossover for the Traveling Salesman Problem.

INFORMS Journalon Computing

25, 2 (2013), 346–363. https://doi.org/10.1287/ijoc.1120.0506A. Newell and H. A. Simon. 1976. Computer Science as Empirical Inquiry: Symbols and Search.

Commun. ACM

19, 3 (March 1976), 113–126.https://doi.org/10.1145/360018.360022B. A. Nosek, G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen, M. Contestabile, A.Dafoe, E. Eich, J. Freese, R. Glennerster, D. Goroﬀ, D. P. Green, B. Hesse, M. Humphreys, J. Ishiyama, D. Karlan, A. Kraut, A. Lupia, P. Mabry, T.Madon, N. Malhotra, E. Mayo-Wilson, M. McNutt, E. Miguel, E. L. Paluck, U. Simonsohn, C. Soderberg, B. A. Spellman, J. Turitto, G. VandenBos,S. Vazire, E. J. Wagenmakers, R. Wilson, and T. Yarkoni. 2015. Promoting an open research culture.

Science

Proceedings of the National Academy of Sciences

Science

Metaheuristics: Progress in Complex Systems Optimization , K. F. Doerner, M. Gendreau, P. Greistorfer, W. J. Gutjahr, R. F. Hartl, and M. Reimann(Eds.). Operations Research / Computer Science Interfaces, Vol. 39. Springer, New York, NY, 325–344. https://doi.org/10.1007/978-0-387-71921-4_17J. M. Perkel. 2020. Challenge to scientists: does your ten-year-old code still run?

Nature

584 (2020), 556–658. https://doi.org/10.1038/d41586-020-02462-7H. E. Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology.

Frontiers in Neuroinformatics

11 (Jan. 2018).https://doi.org/10.3389/fninf.2017.00076D. Shilane, J. Martikainen, S. Dudoit, and S. J. Ovaska. 2008. A general framework for statistical performance comparison of evolutionary computationalgorithms.

Information Sciences

Psychological Science (2011). https://ssrn.com/abstract=1850704S. K. Smit and A. E. Eiben. 2010. Beating the ’world champion’ evolutionary algorithm via REVAC tuning. In

Proceedings of the 2010 Congress onEvolutionary Computation (CEC 2010) , H. Ishibuchi et al. (Eds.). IEEE Press, Piscataway, NJ, 1–8. https://doi.org/10.1109/CEC.2010.5586026K. Sörensen, F. Arnold, and D. Palhazi Cuervo. 2017. A critical analysis of the “improved Clarke and Wright savings algorithm”.

International Transactionsin Operational Research

26, 1 (2017), 54–63. https://doi.org/10.1111/itor.12443V. Stodden. 2014. What scientiﬁc idea is ready for retirement? Reproducibility.

Edge

Science

Proceedings of the NationalAcademy of Sciences

TheTuring Way: A Handbook for Reproducible Data Science . Zenodo. https://doi.org/10.5281/zenodo.3233986P. Wegner. 1976. Research paradigms in computer science. In