How to Evaluate Solutions in Pareto-based Search-Based Software Engineering? A Critical Review and Methodological Guidance
MMANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 1
How to Evaluate Solutions in Pareto-basedSearch-Based Software Engineering? A CriticalReview and Methodological Guidance
Miqing Li, Tao Chen, Xin Yao,
Fellow, IEEE
Abstract —With modern requirements, there is an increasing tendency of considering multiple objectives/criteria simultaneously inmany Software Engineering (SE) scenarios. Such a multi-objective optimization scenario comes with an important issue — how toevaluate the outcome of optimization algorithms, which typically is a set of incomparable solutions (i.e., being Pareto nondominated toeach other). This issue can be challenging for the SE community, particularly for practitioners of Search-Based SE (SBSE). On onehand, multi-objective optimization could still be relatively new to SE/SBSE researchers, who may not be able to identify the rightevaluation methods for their problems. On the other hand, simply following the evaluation methods for general multi-objectiveoptimization problems may not be appropriate for specific SBSE problems, especially when the problem nature or decision maker’spreferences are explicitly/implicitly known. This has been well echoed in the literature by various inappropriate/inadequate selectionand inaccurate/misleading use of evaluation methods. In this paper, we first carry out a systematic and critical review of qualityevaluation for multi-objective optimization in SBSE. We survey 717 papers published between 2009 and 2019 from 36 venues in sevenrepositories, and select 95 prominent studies, through which we identify five important but overlooked issues in the area. We thenconduct an in-depth analysis of quality evaluation indicators/methods and general situations in SBSE, which, together with the identifiedissues, enables us to codify a methodological guidance for selecting and using evaluation methods in different SBSE scenarios.
Index Terms —Search-based software engineering, multi-objective optimization, Pareto optimization, quality evaluation, qualityindicators, preferences. (cid:70)
NTRODUCTION I N software engineering (SE), it is not uncommon toface a scenario where multiple objectives/criteria need tobe considered simultaneously [20], [50]. In such scenarios,there is usually no single optimal solution but rather a setof Pareto optimal solutions (termed a Pareto front in theobjective space), i.e., solutions that cannot be improved onone objective without degrading on some other objective. Totackle these multi-objective SE problems, different problem-solving ideas have been brought up. One of them is togenerate a set of solutions to approximate the Pareto front.This, in contrast with the idea of aggregating objectives(by weighting) into a single-objective problem, providesdifferent trade-offs between the objectives, from which thedecision maker (DM) can choose their favorite solution.In such Pareto-based optimization, a fundamental issueis to evaluate the quality of solution sets (populations)obtained by computational search methods (e.g., greedysearch, heuristics, and evolutionary algorithms) in order toknow how well the methods perform. Since the obtained • Miqing Li and Tao Chen contributed equally to this research. (Correspond-ing authors: Tao Chen and Xin Yao) • Miqing Li is with the School of Computer Science, University of Birming-ham, UK, B15 2TT. (email: [email protected] ) • Tao Chen is with the Department of Computer Science, LoughboroughUniversity, UK, LE11 3TU. (email: [email protected]) • Xin Yao is with the Shenzhen Key Laboratory of Computational Intel-ligence (SKyLoCI), Department of Computer Science and Engineering,Southern University of Science and Technology, Shenzhen, P. R. China,and CERCIA, School of Computer Science, University of Birmingham,UK, B15 2TT. (email: [email protected]) solution sets are typically not comparable to each other withrespect to Pareto dominance , how to evaluate/comparethem is non-trivial. A straightforward way is to plot thesolution sets (by scatter plot) for an intuitive evaluation/-comparison. Yet this only works well for the bi-objectivecase, and when the number of objectives reaches four, it isimpossible to show the solution sets by scatter plot. Moreimportantly, visual comparison cannot provide a quantita-tive comparative result between the solution sets.Another way to evaluate the solution sets is to reporttheir descriptive statistical results, such as the best, mean,and median values on each objective from each solutionset. This has been profoundly used in Search-Based SE(SBSE) [6], [9], [18], [22], [37], [130], [134]. However, some ofthese statistic indexes may easily give misleading evaluationresults. That is, a solution set which is evaluated better thanits competitor could be never preferred by the DM underany circumstance. This will be explained in detail in the textlater (Section 5.1.2).Generic quality indicators, which is arguably the moststraightforward evaluation method that maps a solution setto a real number that indicates one or several aspects ofthe set’s quality, have emerged in the fields of evolution-ary computation and operational research [10], [62], [116],[154]. Today analyzing and designing quality indicators has
1. A solution set A is said to (Pareto) dominate a solution set B if forany solution in B there exists at least one solution in A dominating it,where the dominance relation between two solutions can be seen as anatural “better” relation of the objective vectors, i.e., better or equal onall the objectives, and better at least on one objective [154]. a r X i v : . [ c s . S E ] N ov ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 2 become an important research topic. There are hundreds ofthem in literature [75], with some measuring closeness ofthe solution set to the Pareto front, some gauging diver-sity of the solution set, some considering a comprehensiveevaluation of the solution set, etc. The SBSE communitybenefits from this prosperity. A common practice in SBSEis to use some well-established quality indicators, such ashypervolume ( HV ) [153] and inverted generational distance( IGD ) [26], to evaluate the obtained solution sets. However,some indicators may not be appropriate when it comesto practical SE optimization scenarios. For example, sincethe Pareto front of a practical SBSE problem is typicallyunavailable, indicators that require a reference set that wellrepresents the problem’s Pareto front may not be wellsuited [75], such as
IGD .More importantly, specific SBSE problems usually havetheir own nature and requirements. Simply following in-dicators that were designed for general Pareto-based op-timization may fail to reflect these requirements. Take thesoftware product line configuration problem as an example.In this problem, the objective of a product’s correctnessis always prioritized above other objectives (e.g., cost andrichness of features). Equally rating these objectives byusing generic indicators like HV (which in fact has beencommonly practiced in the literature [64], [93], [118], [120])may return the DM meaningless solutions, i.e., invalid prod-ucts with good performance on the other objectives. Thissituation also applies to the test case generation problem,where the DM may first favor the full code coverage andthen others (e.g., low cost).Moreover, some SBSE problems may associate with theDM’s explicit/implicit assumptions or preferences betweenthe objectives. It is expected for researchers to select indi-cators bearing these assumptions/preferences in mind. Forinstance, in many SE scenarios, the DM may prefer well-balanced trade-off solutions (i.e., knee points on the Paretofront) between conflicting objectives. An example is thatwhen optimizing the conflicting non-functional quality ofa software system (e.g., latency and energy consumption),knee points are typically the most preferred solutions, asin such case, it is often too difficult, if not impossible, toexplicitly quantify the relative importance between objec-tives. Under this circumstance, quality indicators that treatall points on the Pareto front equally (such as IGD ) may notbe able to reflect this preference, despite the fact that theyhave been frequently used in such scenarios [37], [89].Finally, the study of quality indicator selection itself inmulti-objective optimization is in fact a non-trivial task.Each indicator has its own specific quality implication, andthe variety of indicators in literature can easily overwhelmthe researchers and practitioners in the field. On the onehand, an accurate categorization of quality indicators is ofhigh importance. Failing to do so can easily result in amisleading understanding of search algorithms’ behavior,see [74]. On the other hand, even under the same category,different indicators are of distinct quality implications, e.g.,
IGD prefers uniformly distributed solutions and HV is infavor of knee solutions. A careful selection needs to be madeto ensure the considered quality indicators to be in line withthe DM’s preferences. In addition, many quality indicatorsinvolve critical parameters (e.g., the reference point in the HV indicator). It remains unclear how to properly set theseparameters under different circumstances, particularly inthe presence of the DM’s preferences.Given the above, this paper aims to systematically sur-vey and justify some of the overwhelming issues whenevaluating solution sets in SBSE, and more importantly,to provide a systematic and methodological guidance ofselecting/using evaluation methods and quality indicatorsin various Pareto-based SBSE scenarios. Such a guidanceis of high practicality to the SE community, as researchfrom the well-established community of multi-objective op-timization may still be relatively new to SE researchers andpractitioners. This is, to the best of our knowledge, the firstwork of its kind to specifically target the quality evaluationof solution sets in SBSE based on a theoretically justifiablemethodology.It is worth mentioning that recently there are someattempts from the perspective of empirical studies to pro-vide guidelines for quality indicator selection [4], [138].Wang et al. [138] proposed a practical guide for SBSEresearchers based on the observations from experimentalresults in three SBSE real-world problems. Ali et al. [4]significantly extended that work and provided a set ofguidelines based on the observations from experimentalresults in nine SBSE problems from industrial, real-worldand open-source projects. However, observations drawnfrom an empirical investigation on specific SBSE scenariosmay not be generalizable. Indeed, different DMs may preferdifferent trade-offs between objectives, even for the sameoptimization problem, as nondominated solutions are inessence incomparable. Observations obtained on one (orsome) scenario(s) is therefore difficult to be transferred intoother scenarios. As a result, a general and theoreticallysound guidance based upon the DM’s preferences is needfulsince the fundamental goal of multi-objective optimizationis to supply the DM a set of solutions which are the mostconsistent with their preferences.For the rest of the paper, we start by providing somebackground knowledge of multi-objective optimization andquality evaluation (Section 2). Then, we conduct a system-atic survey of the SE problems that involve Pareto-basedsearch (hence termed Pareto-based SBSE problems) acrossall phases in the classic Software Development Life Cycle(SDLC) [112], along with their problem nature, the DM’spreferences, the quality indicators and evaluation meth-ods used (Sections 3 and 4). The survey has covered 717searched papers published between 2009 and 2019, on 36venues from seven repositories, leading to 95 prominentprimary studies in the SBSE community. This is followed bya critical review on the evaluation method selection and usein those primary studies, based on which we identify fiveimportant issues that have been significantly overlooked(Section 5). Then, we carry out an in-depth analysis offrequently-used quality indicators in the area (Section 6), inorder to make it clear which indicators fit in which situation.Next, to mitigate the identified issues in the future work ofSBSE, we provide a methodological guidance and procedureof selecting, adjusting, and using evaluation methods invarious SBSE scenarios (Section 7). The last three sections aredevoted to threats to validity, related work, and conclusion,respectively. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 3
RELIMINARIES ON M ULTI -O BJECTIVE O PTI - MIZATION
Multi-objective optimization is an optimization scenariothat considers multiple objectives/criteria simultaneously.Apparently, when comparing solutions in multi-objectiveoptimization, we need to consider all the objectives ofa given optimization problem. There are two commonlyused terms to define the relations between solutions, Paretodominance and weak Pareto dominance.Without loss of generality, let us consider a minimizationscenario. For two solutions a , b ∈ Z ( Z ⊂ R m , where m denotes the number of objectives), solution a is saidto weakly dominate b (denoted as a (cid:22) b ) if a i ≤ b i for ≤ i ≤ m . If there exists at least one objective j on which a j < b j , we say that a dominates b (denoted as a ≺ b ). Asolution a ∈ Z is called Pareto optimal if there is no b ∈ Z that dominates a . The set of all Pareto optimal solutions ofa multi-objective optimization problem is called its Paretofront .The above relations between solutions can immediatelybe extended to between sets. Let A and B be two solutionsets. Relation 1. [Dominance between two sets [154]] We say that A dominates B (denoted as A ≺ B ) if for every solution b ∈ B there exists at least one solution a ∈ A that dominates b . Relation 2. [Weak Dominance between two sets [154]] We saythat A weakly dominates B (denoted as A (cid:22) B ) if for everysolution b ∈ B there exists at least one solution a ∈ A thatweakly dominates b . We can see that the weak dominance relation between twosets does not rule out their equality, while the dominance relation does but it also rules out the case that there existsame solutions with respect to the two sets. Thus, we mayneed another relation to define that A is generally better than B . Relation 3. [Better relation between two sets [154]] We say that A is better than B (denoted as A (cid:67) B ) if for every solution b ∈ B there exists at least one solution a ∈ A that weaklydominates b , but there exists at least one solution in A that isnot weakly dominated by any solution in B . The better relation (cid:67) reflects the most general assump-tion of the DM’s preferences to compare solution sets.However, the better relation may leave many solution setsincomparable since it is very likely that there exist somesolutions being nondominated with each other in the sets.As typically the size of the Pareto front of a multi-objectiveoptimization problem can be prohibitively large or eveninfinite, a solution set that can well represent the Paretofront is preferred, especially when the DM’s preferencesare unavailable. This leads to four quality aspects that weoften care about [75] — Convergence, how close the solutionset is to the Pareto front; Spread, how large the regionthat the set covers is; Uniformity, how evenly the solutions
2. For simplicity, we refer to an objective vector as a solution and theoutcome of a multi-objective optimizer as a solution set . are distributed in the set; Cardinality, how many (unique)nondominated solutions are in the set. Over the last threedecades, numerous quality evaluation methods have beendeveloped for these four aspects. Among them, qualityindicators are the most popular ones [75]. They typicallymap a solution set to a real number that indicates oneor more of the four quality aspects, defining a total orderamong solution sets for comparison. EVIEW M ETHODOLOGY
Despite that we have randomly witnessed several inap-propriate evaluations of solution sets through our ownexperiences working in SBSE, the key challenge in this workremains to systematically understand what the trends of issueson the way to evaluate solution sets in the SBSE community are, ifany, so that a clarified guidance can be drawn thereafter. Tothis end, we at first conduct a systematic literature reviewcovering the studies published between 2009 and 2019. Areason that we consider this period is that one of the mostwell-known SBSE surveys (Harman et al. [50]) has coveredthe SBSE work from 1976 to 2008, and we try to cover theperiod that has not been reviewed by that work. In addition,since 2010 or so, there is a rapidly increasing interest inusing Pareto-based optimization techniques to deal withSBSE. Given these two reasons, we have chosen 2009 as thestarting year of our review. Having said that, we do not aimto provide a complete review on all parts of the SBSE work,but specifically on the aspects related to the major trends ofevaluating solution sets.Our review methodology follows the best practice of asystematic literature review for software engineering [60],consisting of search protocol, inclusion/exclusion criteria,formal data collection process, and pragmatic classification.Specifically, the review aims to answer three research ques-tions (RQs): • RQ1:
What evaluation methods have been used toevaluate solution sets in SBSE? (
What ) • RQ2:
What are the reasons and practice of using thegeneric quality indicators? (
Why and
How ) • RQ3:
In what domain and context the evaluationmethods have been used? (
Where ) As shown in Figure 1, our literature review protocol obtainsinputs from various sources via automatic search, whichwe will elaborate into details in Section 3.2. This givesus 3,156 studies including duplication. We then removedany duplicated studies by automatically matching theirtitles , leading to 717 searched studies . Next, we filteredthe searched studies by reading through their titles andabstracts using two simple filtering criteria: • The paper is not relevant to SBSE. • The paper does not investigate or compare multi-objective search/optimization.A study was ruled out if it meets any of the above twofiltering criteria. The aim of filtering is to reduce the found
3. Patents, citation entries, inaccessible papers, and any other non-English documents were also eliminated.
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 4
FilteringForward snowballing
More papers?
Applying inclusion criteria Cleaning di ff erent studies that report on the same workResultsFiltering criteria Iteration 1Iteration 2Iteration 3Inclusion and exclusion criteria Applyingexclusion criteria Initially 298 candidate studiesNo, 319 candidate studies
Removing duplication
Yes 717 searched studies167 candidate studies 3,156 studies
Data Collection
Full-text reviewTitle and abstract review 101 candidate studies 95 primary studies2,39050181163 2577 108
Cleaning criteria
Fig. 1: Systematic literature review protocol.studies to a much smaller and more concise set, namelythe candidate studies . As can be seen, the process resultedin 298 candidate studies prior to manual search. Startingfrom the 298 studies, we adopted an iterative forwardsnowballing as suggested by Felizardo et al. [33], where thenewly included studies (after filtering) were placed into thenext snowballing round. The reason why we did not dobackward snowballing is because we have set strict timescale on the studies searched within the last decade, andthus backward snowballing would too easily violate such arequirement of timeliness. To avoid a complicated process,we relied on Google Scholar as the single source for forwardsnowballing, as it returns most of the searched results asshown in Figure 1 and has been followed by softwareengineering surveys [42]. This snowballing stopped whenno new studies can be found and it eventually led to 319candidate studies, upon which the procedure for the full-text review begins.At the next stage, we reviewed all the 319 studies andtemporarily included studies using the inclusion criteriafrom Section 3.3, which resulted in 167 candidate studies.We then applied the exclusion criteria (see Section 3.3) toextract the temporarily included studies, leading to 101candidate studies. By using the cleaning criteria specifiedin Section 3.3, a further cleaning process was conductedto prune different studies that essentially report on thesame work, e.g., conference extended journal papers. Allthe processes finally produced 95 primary studies for dataanalysis and collection.On these 95 primary studies, we conducted systematicand pragmatic data collection via three iterations, whose details are given in Sections 3.4 and 3.5. The summarizedresults were reported thereafter.
From 12th to 19th Aug 2019, we conducted an automaticsearch over a wide range of scientific literature sources, in-cluding ACM Library, IEEE Xplore, ScienceDirect, Springer-Link, Google Scholar, DBLP and the SBSE repository main-tained by the CREST research group at UCL .We used a search string that aims to cover a variety ofproblem nature and application domains with respect tomulti-objective optimization. Synonyms and keywords wereproperly linked via logical operators (AND, OR) to build thesearch term. The final search string is shown as below: (“multi objective” OR “multi criteria” OR “Paretobased” OR “non dominated” OR “Pareto front”)AND “search based software engineering” ANDoptimization We conducted a full-text search on ACM Library, IEEEXplore, ScienceDirect, SpringerLink, and Google Scholar,but rely on searching the title only for DBLP and UCL’sSBSE repository, due to their restricted feature. Since DBLP’ssearch feature cannot handle the whole search string,we paired each term in the first bracket with “ searchbased software engineering ” to run the search in-dependently and collect all results returned. We omit-ted “ optimization ” as it rarely appears together with“ search based software engineering ” in the title,and having it along would produce many irrelevant results.Due to the similar reason, for the UCL’s SBSE repository,we searched each term from the first bracket independently,as it is known that all the studies in this source are SBSErelated.On all the sources except DBLP and UCL’s SBSE repos-itory, we tried two versions of the search string: onewith a hyphen between the commonly used terms (e.g.,“ multi(-)objective ”) and another without. The re-turned results with the highest number of items were used .In particular, when searching on UCL’s SBSE repository, theresults of these two versions were combined, for example,“ multi objective ” and “ multi-objective ” wouldlead to different results. We recorded all the results returnedunder semantically equivalent terms. For all the candidate studies identified, we first extract theprimary studies by using the inclusion criteria as below;studies meeting all of the criteria were temporarily chosenas the primary studies:1) The study primarily focuses on (or has a section thatdiscusses) a Pareto-based multi-objective solution tothe SBSE problem. This means we do not considerpapers that utilize the multi-objective treatment that
4. http://crestweb.cs.ucl.ac.uk/resources/sbse repository5. This is because their results are not mutually exclusive, e.g., onGoogle Scholar, “ multi objective ” would also return all the studiesthat contain “ multi-objective ” but not the other way around.
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 5 relies on objective aggregation (e.g., weighted sum),unless they have explicitly compared the solutionagainst a Pareto-based multi-objective solution. Thisis reasonable as if a clear aggregation of objectivescan be defined, then there would be almost noneed to select quality indicators but rely on the saidaggregation to obtain a utility value for comparison.2) The study explicitly or implicitly discusses (or atleast makes assumptions about) the DM’s prefer-ences/contextual information between the objec-tives for the SBSE problem in hand. By implicitdiscussion, we mean that the study does not clearlystate the assumptions, but such assumptions canbe easily interpreted from the paper. For example,in the software product line configuration problem,some studies do not explicitly study the assump-tions, but the number of valid products (one objec-tive to be optimized) is solely used as an indicatorto compare the peer approaches, which gives a clearindication that it is more important than the otherobjectives. Note that this also includes the assump-tion of no preferences/contextual information.3) The SBSE problem in hand can be framed into atleast one phase of the classic SDLC [112].4) The study uses at least one search algorithm to solvethe problem.5) The study includes quantitative experimental re-sults with clear instructions on how the results wereobtained.6) The study uses at least one method to evaluate theexperimental results.Subsequently, studies meeting any of the exclusion cri-teria below are filtered out from the temporary primarystudies:1) The study neither explicitly nor implicitly mentionsSBSE, where the computational search is the key.2) The study is not “highly visible” or widely followed.We used the citation information from GoogleScholar as a single metric to (partially) assess theimpact of a study . In particular, we follow a prag-matic strategy that: a study has 5 citations per yearfrom its year of publication is counted in, e.g., a 2010study would expect to have at least 45 citations .The only exception is for the work published in theyear of writing this article (i.e., 2019), we considerthose that were published for shorter than 6 monthsand have been cited by more than once, togetherwith the pre-press ones that have not yet been givenan issue number regardless of their citation counts.The reasons behind this setting are three-folds:(a) We do not attempt to provide a comprehen-sive survey on the whole SBSE field, but ratherto gather the major trends on how solution setsare evaluated, which can at least provide somesources for detailed analysis and discussion. There-fore, some metrics are required to ensure a trade-
6. Admittedly, there is no metric that is able to well quantify the im-pact of a paper. Nevertheless, the citation count can indicate somethingabout a paper, e.g., its popularity.7. All the citations were counted by 23rd Nov 2019. off between the trend coverage and a reasonablyrequired effort for detailed data collections. This issimilar to a sampling of the literature with the aim togather the “representative” samples. This approachwas adopted by many works, such as [40] wherethey used the citation count from Google Scholar asa threshold to select studies for review, as we did inthis work.(b) It is not uncommon to see that software engi-neering surveys are conducted using some metricsto measure the “impact” of a work. For example,some restrict their work only at what the authorsbelieve to be premium venues [42], others use athreshold on the impact factors of the publishedjournals, e.g., Cai and Card [12] use . and Zouet al. [155] use . . In our case, it may not be a bestpractice to apply a metric at the venue level as theSBSE work is often multi-disciplinary (as we willshow in Table 2) — it is difficult to quantify the“impact” across communities. We, therefore, havetaken a measurement at the paper level based onthe citation counts from Google Scholar, which hasbeen used as the sole metric to differentiate betweenthe studies in some prior work [27], [40], [42].(c) Indeed, there is no rule to set the citationthreshold. The settings in this work were taken fromthe (rounded) average figure within the populationof the candidate studies. These may seem very highat the first glance probably due to two reasons:(i) by publication date, we meant the official datethat the work appears on the publisher’s webpage(for journal work, this means it has been given anofficial issue number). Yet, it is not uncommon thatmany studies are made citable as pre-prints beforethe actual publication, e.g., ICSE often has around6 months gap between notification and official pub-lication, and there is an even larger gap for somejournals. This has helped to accumulate citations.(ii) Google Scholar counts the citations made byany publicly available documents and self-citation,which can still be part of the impact but impliestheir citation count may be higher than those purelymade by peer-reviewed publications. Nevertheless,this could indeed pose a threat of construct validity,which we will discuss in Section 8.3) The study is a short paper, i.e., shorter than 8 pages(double column) or 15 pages (single column).4) The study is a review, survey, or tutorial.5) The study is published in a non-peer-reviewed pub-lic venue, e.g., arXiv.Finally, if multiple studies of the same research work arefound, we applied the following cleaning criteria to deter-mine if they should all be considered. The same procedure isapplied if the same authors have published different studiesfor the same SBSE approach, and thereby only significantcontributions are analyzed for the review. • All studies are considered if they report on the sameSBSE problem but different solutions. • All studies are considered if they report on the sameSBSE problem and solutions but have different as-
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 6
TABLE 1: Data collection items.
ID Item Questions I Author(s) N/A I Year N/A I Title N/A I Venue (journal or conference) N/A I Citation count N/A I Indicator and method RQ1 I Stated quality aspects RQ2 I Reference point/front RQ2 I I SBSE problem RQ3 I DM’s preferences and contextual information RQ3 sumptions about the DM’s preference, nature of theproblem, or new findings of the problem. • When the above two points do not hold, only thelatest version or the extended journal version isconsidered.
The items to be collected when reviewing the details ofthe primary studies have been shown in Table 1. We nowdescribe their design rationales and the procedure to extractand classify the data from each item.The data for I to I is merely used as the meta-information of the primary studies. I , which answers RQ1 ,is the key item of our review. The evaluation method(s)used can be easily identified in a study, most commonlyat the
Experiment section. In general, apart from identifyingthe evaluation methods used in each study, we also seek toclassify them into the following four categories: • Generic Quality Indicator:
This refers to indicators thatare designed to evaluate the quality of solution setsfor generic multi-objective optimization problems(e.g., HV , IGD and
Spread ), as documented by Liand Yao [75]. Formally, a quality indicator is a metricthat maps a set of solutions (i.e., solution vectors) toa real number that indicates one or several aspects ofthe solution set quality [74], [75], e.g., to indicate howclose the set is to the Pareto front and how evenlysolutions are distributed in the set. • Solution Set Plotting (
SSP ): This is a straightforwardway to evaluate solution sets — visualizing the re-sults by plotting them. • Descriptive Objective Evaluation (
DOE ): This resorts tothe direct statistical results of objective values, e.g.,the best/mean/median of the solution set on eachobjective. • Problem Specific Indicator (
P SI ): This refers to indi-cators that are not used for generic multi-objectiveoptimization problems, but specifically designed fora given SBSE problem. I is heavily relevant to I , but requiring more detailedinspection to the studies. By this means, we aim to collectinformation about when a generic quality indicator is used,what quality aspect the study seeks to evaluate by it (for RQ2 ), which is the key reason of why such an indicator ischosen. I is classified based on the four quality aspects of a solution set as concluded by Li and Yao [75], i.e.,Convergence, Spread, Uniformity, and Cardinality. For eachstudy, we first looked for the section where the genericquality indicators are explained, if no information found,we then searched for every place where the generic qualityindicators are mentioned. We classify each indicator intothe quality aspects based on whether their keywords havebeen clearly mentioned, otherwise, the indicator is markedas Unknown under I of a study.For I , we wish to understand how the generic qualityindicators are used, as some of them requiring a referencepoint (e.g., HV ) or a reference Pareto front (e.g., GD and IGD ) in order to be used correctly (for
RQ2 ). This is againfollowing a similar procedure to that of I ; when no suchinformation can be found for an indicator that requires areference, we marked Unknown under the indicator for I ofthe study.To answer RQ3 , I is rather straightforward, and under-standing it can help us to know whether some evaluationmethods are used appropriately, as some of them have limi-tations in terms of the number of objectives to be optimized. I is also relatively easy to identify, most commonly fromthe Introduction section and we classify the Pareto-basedSBSE problems into the SDLC phases in a classic waterfallmodel according to [112]. Note that we choose this model byno mean to rely on its usefulness, but only because it is oneof the oldest models which consists of very generic phasesthat allow us to showcase the categories of SBSE problems.Finally, to complete
RQ3 , I is crucial as it enables us toassess whether the evaluation methods are used correctly,given the DM’s preferences over the objectives and/or thecontextual information, which is one of the core initiativesof this paper. To classify the preferences and contextualinformation, we followed a pragmatic classification coding: • Contextual information:
Every problem has its ownnature and characteristics; there is no exception forPareto-based SBSE problems. In general, such na-ture and characteristics of the problem form thecontextual information, which is precise, clear, andexplicitly stated as a fact in a study. For example,in the software product line configuration problem,many studies state that there is no doubt that thecorrectness objective has higher priority than anyothers, as an invalid product has no value in practice. • DM’s preferences:
The DMs often have preferencesover certain objectives or are able to provideinformation about their relative importance andexpectation. This may be for example, “ objectiveA is preferred as long as objective Bhas at least reached b ”; or well-balancedsolutions (a.k.a. knee solutions) are preferred. Whenthe DM’s preferences are aligned with contextualinformation, they are indeed similar. However, thekey difference is that the contextual informationis clear, and it is a hard requirement that is wellacknowledged on the given Pareto-based SBSEproblem regardless of whether the DM explicitlystates it or not. In contrast, the DM’s preferencesare often vague and imprecise, and it cannotbe generalized to all the scenarios of the given
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 7
Pareto-based SBSE problem. • Not specified:
When neither of the above categoriescan be applied, the preference and contextual infor-mation is marked as
Not specified .Extracting the data for I focuses on understanding ex-actly what DM’s preferences and contextual information areassumed in each study. This was achieved by inspecting thesections relevant to Problem Statement and
Approach Design .If no information can be found, we then looked for insightsfrom the
Experiment section. For example, when a singleobjective, which belongs to part of the search, is explicitlydiscussed and used to compare the peer approaches in theevaluation, it often reflects the assumptions of contextualinformation and/or DM’s preferences in the study.
For each primary study identified, the data items fromTable 1 were collected and classified based on the codingfrom Section 3.4. The first two authors of this paper re-viewed the primary studies independently. The data andclassification extracted by one author were checked by theother. Disagreements and discrepancies were resolved bydiscussing between the two authors or by consulting anadditional researcher.Following the strategy recommended in a recent sur-vey [155], we adopted three iterations for the data collectionprocess, which are detailed as below:
Iteration 1:
This iteration aims to conduct an initial datacollection to summarize the data and perform preliminaryclassification. During the process, a notable difficulty be-tween the authors was that the evaluations using descriptivestatistics and problem-specific indicators are hard to bedistinguished. This is due to the fact that most of them arenot clearly stated in the studies and there is a wide varietyof problem-specific indicators across all Pareto-based SBSEproblems (we found 34 of them in our review). Therefore,any study, which the authors suspected that these two typesof methods might have been used but could not be certain,was placed into a bin for further investigation in the nextiteration. There were 26 studies in the bin when this iterationfinished.
Iteration 2:
In this iteration, the two authors checked thedata and classification from each other to ensure consistency.A study was discussed during the process if either authorhas any concern about the data extracted. Any unresolvedstudies from the bin were also checked by the other authoragain. Particularly, for each study in the bin , a commonagreement on the descriptive statistics and problem-specificindicators used was reached via either discussion betweenthe authors or counseling external researchers. Further read-ing to understand the nature of an evaluation method (forproblem-specific indicators) was conducted when necessary.Apart from this, other major discussions raised were con-cerned with certain generic quality indicators, due to tworeasons: (i) some studies have indeed used generic qualityindicators, but the actual name of the indicator is miss-ing, although some detailed calculation has been provided,e.g., [48], [145]; (ii) some other studies have used completelydifferent names to refer to the same quality indicator (oreven invented their own name), e.g. [69], [147]. These cases TABLE 2: The reviewed studies counts and venues.
Journal S ea r c h C a nd i d a t e P r i m a r y ACM Transactions on Software Engineering andMethodology 15 10 8Elsevier Information and Software Technology 65 33 6Elsevier Applied Soft Computing 17 4 0Springer Automated Software Engineering 16 6 2IEEE Transactions on Software Engineering 79 41 9Springer Empirical Software Engineering 37 17 3Elsevier Future Generation Computing Systems 3 3 1Springer Soft Computing 11 4 0IEEE Transactions on Evolutionary Computation 13 3 2IEEE Transactions on Services Computing 6 3 3Elsevier Journal of Systems and Software 70 35 8Elsevier Information Sciences 18 7 3Springer Requirements Engineering 5 3 1Springer Software Quality Journal 12 7 2Wiley Software Testing, Verification and Reliability 4 2 1Wiley Software: Practice and Experience 8 4 1Springer Software and Systems Modeling 2 1 1IEEE Transactions on Systems, Man, and Cybernetics 2 2 1
Conference, Symposium and Congress
IEEE/ACM Conference on Software Engineering 30 12 8Springer Symposium on Search Based Software En-gineering 110 33 4IEEE Congress on Evolutionary Computation 21 8 0IEEE/ACM Conference on Automated Software En-gineering 21 8 5ACM Conference and Symposium on the Founda-tions of Software Engineering 13 3 0ACM Genetic and Evolutionary Computation Con-ference 57 33 10IEEE Conference on Software Testing, Verificationand Validation 20 11 3ACM Symposium on Software Testing and Analysis 8 5 3IEEE/ACM Conference on Empirical Software Engi-neering and Measurements 1 0 0ACM Systems and Software Product Line Confer-ence 11 3 1IEEE Conference on Web Services 2 2 1IEEE Conference on Software Maintenance 15 1 1IEEE Conference on Software Maintenance andReengineering 4 3 1IEEE Workshop on Combining Modelling andSearch-Based Software Engineering 5 3 1IEEE Conference on Requirements Engineering 7 3 1ACM Conference on Performance Engineering 2 2 2IEEE/ACM Conference on Program Comprehension 3 2 1IEEE Conference on Software Architecture 4 2 1
Total
717 319 95 require both the authors to thoroughly inspect the detailedcalculation of those indicators before reaching an agreement.Overall, a total of 60 studies were discussed in this iteration.
Iteration 3:
The process of the final iteration is similarto that of
Iteration 1 , but its goal is to eliminate any typo,missing labels, and errors. The extracted data for 11 studies,which contain errors, were corrected during the process.
ESULTS OF THE R EVIEW
A breakdown of the studies identified with respect to thevenues where they were published has been shown inTable 2. As can be seen, the studies come from a wide
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 8
TABLE 3: Acronyms of the evaluation methods.
Acronym Full Name Acronym Full Name Acronym Full Name AS [63] Attainment Surface IGD + [56] Inverted Generational Distance + CI [87] Contribution Indicator C ( CS ) C Metric GD [133] Generational Distance Spread ( ∆ ) [29] Spread SP [121] Spacing NF S
Nondominated Front Size (cid:15) [154] (cid:15) -Indicator
IGD [26] Inverted Generational Distance HV [153] Hypervolume ED Euclidean Distance ER Error Rate
SSP
Solution Set Plotting
DOE
Descriptive Objective Evaluation SS P H V P S I D O E I G D G D Sp r e a d (cid:15) N F S C I C S A S S
P E D E R I G D + p r i m a r y s t ud i e s Solution Set PlottingGeneric Quality IndicatorProblem Specific IndicatorDescriptive Objective Evaluation
Fig. 2: Usage of evaluation methods in primary studies.range of conferences and journals, which are all respectful .It is worth noting that the results do not only includestudies published in software engineering venues, but alsothose published in service, system and cloud engineeringconferences/journals as well as those in the computationalintelligence venues, as long as they are related to problemsin the software engineering domain and comply with theinclusion/exclusion criteria.Next, we report on the results collected from our sys-tematic literature review, which would further motivate theremaining of our work. The usage of evaluation methods has been presented inFigure 2 along with the details for every single primarystudy presented in Table A1 (at appendix). As can be seen,a total of 13 generic quality indicators have been used inthe primary studies (i.e., HV , IGD , GD , Spread , (cid:15) , N F S , CI , CS , AS , SP , ED , ER and IGD + ). Explanations of allthe acronyms can be found in Table 3. In particular, HV , IGD and GD are the top three most widely used genericquality indicators across almost all the SBSE problems, duepresumably to their popularity as well as “inertia” (i.e.,researchers tend to use indicators which were used beforeeven though they are not the best fit) [75].There are also 45 primary studies using P SI ; for exam-ple, MoJoFM [65] is a commonly used symmetric indicatorfor the software modularization problem, which aims tocompare two resulted partitions of classes (i.e., two solu-tions), thus inapplicable to other optimization scenarios.
8. The raw data and minutes recorded during the discussion in thedata collection process are publicly available at: https://github.com/taochen/sbse-qi.
TABLE 4: Descriptive Objective Evaluation (DOE) methodsused.
DOE
Used in
Mean Fitness Value (
MF V ) [76] [137]Analytic Hierarchy Process ( AHP ) [125] [126]Mean, best, worst, median and/or statis-tical result of each objective for solutionsin the population (cid:63) [88] [9] [137] [65][6] [21] [134] [37] [58][30] [82] [108] [142][69] [114] [129] [128][118] [34] [105] [7][107] [76], [123] [1],[135], [141]Best of one objective over the populationwhile another is below certain thresholds (cid:63) [66] (cid:63) The mean of all repeated runs are reported. Note that a studycould involve more than one
DOE form.
Note that since the
P SI is highly domain-dependent andthey are not explicitly designed for evaluating solution sets,we do not specify the usage details for every single one ofthem. We do however present which particular
P SI is usedunder which context, as shown in Table 8 which we willelaborate in Section 4.3.3. In summary, we found a total of34 different
P SI over all the Pareto-based SBSE problems.Apart from the quality indicators,
SSP and
DOE havealso been overwhelmingly used by 50 and 29 primarystudies respectively to evaluate solution sets. For
SSP ,we found only two sub-types: the Parallel Coordinate plotwhich shows the solutions’ objective values upon n parallellines, where n is the number of objectives; and the ScatterPlot that plots the solutions in the objective space. DOE involves more diverse forms, as shown in Table 4, including27 cases to compare the mean, best, worst, median, orstatistical results of each objective in the evaluation alongwith the remaining five cases that use other three forms.
As shown in Table 5, for those primary studies that madeuse of generic quality indicators, most commonly there is aclear statement about what quality aspect(s) they selected anindicator for, as appeared in 73 cases. This is also the reasonand evidence that these studies used to justify their choices.However, there is still a considerable amount of cases (59)that were marked as
Unknown , i.e., no clear and explicitrationale of the choice has been provided. For the referencefront used for GD and IGD , whilst most of the cases thebest Pareto front found by all algorithms (i.e., nondominatedsolutions of the set consisting of the solutions produced byall algorithms) is used, many still do not explicitly declaresuch information. On the reference point used by HV , wesee a diverse way of obtaining such a point, including using ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 9
TABLE 5: Summary of how generic quality indicators areused to evaluate solution sets in the primary studies.
Indicator Stated QualityAspects to Measure ReferencePoint or Front HV Unknown (25), Q ∪ Q ∪ Q (15), Q ∪ Q (3), Q (3), Q ∪ Q (2) Unknown (23),Worst values (10),Nadir point (9),Boundary (6) IGD Q ∪ Q ∪ Q (9), Q (4),Unknown (2), Q ∪ Q (1) Best Paretofront found (14),Unknown (2) IGD + Q ∪ Q ∪ Q (1) Unknown (1) GD Unknown (10), Q (6) Best Paretofront found (10),Unknown (6) Spread Q ∪ Q (9), Q (5) N/A (cid:15) -Indicator Q (5), Unknown (2), Q ∪ Q ∪ Q (1) N/A NF S
Unknown (4), Q ∪ Q (3) N/A CI Unknown (5), Q (1) N/A CS Unknown (4), Q (1) N/A AS Unknown (4) N/A
SP Q (2), Q ∪ Q (1) N/A ED Unknown (3) N/A
ER Q (1) Unknown (1) Q =Convergence; Q =Spread; Q =Uniformity; Q =Cardinality.The number of primary studies is shown within the brackets. the worst objective value of all the solutions found, theboundary of the optimization problem in SBSE, and thenadir point from the Pareto front of all the solutions found. Table 6 shows the number of objectives considered by theevaluation methods. On the generic quality indicators, wesee that
IGD has been used for the highest number, i.e., 15objectives, and all of them (except SP and ER ) have everbeen used on the bi-objective cases, which is the minimumnumber required to build a Pareto front. Whilst most ofthe generic quality indicators have been used under the bi-and tri-objectives cases, a considerable amount of them havebeen used on the objective number over three.As for P SI , DOE , and
SSP , we can observe that theyare used on a relatively wider range of objective numbers ascompared with most of the generic quality indicators.
From Table 7, it is clear that our systematic review hasrevealed 21 distinct Pareto-based SBSE problems, which arespread across all the six common phases in the SDLC . No-tably, we can see that certain problems have attracted moreattention than the others, as evidenced by the much highernumber of primary studies included, such as the softwareproduct line and the white/black-box test case generationproblems. Among others, the software testing phase, as wellas the deployment and maintenance phase contain muchmore diverse problems than the other SDLC phases. This isprobably because the nature of those problems, which areusually in later phases of the SDLC, fits the requirements ofsearch-based optimization well.
9. Note that [16] studies three different problems.
TABLE 6: Summary of the
SP I , SSP and
DOE are used.
Method HV IGD
IGD + GD Spread (cid:15) -Indicator 2 (1), 3 (4), 5 (4)
NF S CI CS AS SP ED ER SSP
DOE
P SI >
100 (2)The bracket shows the number of problems under which the a pairof objective number and evaluation method is considered. Note thata study may consider problems with different
Indeed, some of the Pareto-based SBSE problems can ar-guably fit into more than one phase of the SDLC; but in thiswork, we classify those problems according to which phasescan be better matched with the detailed formalization of theproblem and the hypotheses that the authors made. Further,certain problems in the deployment and maintenance phaseare not classic software engineering problems (e.g., resourcemanagement and service composition); however, they haverecently attracted more and more attention from softwareengineering researchers and have been increasingly con-sidered as important issues in the software engineeringdomain [49].
In Table 8, we summarize the assumptions on the DM’spreferences and contextual information about the objectivesfor each Pareto-based SBSE problem reviewed, and theevaluation methods used to compare different solution setsunder these contexts. As we can see, there are 17 cases,covering 25 primary studies, have made assumptions onDM’s preferences. The contextual information, in contrast,has been used in five cases over 20 primary studies. Theothers do not clearly state the assumptions in this regardand hence noted as
Not specified . One of the most notableobservations is that a particular SBSE problem may havemultiple, distinct assumptions about the DM’s preferencesand contextual information. In fact, most of the problemshave more than one assumption on the preferences/con-textual information, and particularly, the project schedulingproblem and software configuration/adaptation probleminvolve up to four different types of assumptions. Thisreflects the fact that many problems are complex and theactual preferences can be situation-dependent. Yet, it doesbring the requirements that all those situations need to becatered for. In contrast, problems such as effort estimationand requirement assignment have assumed only one typeof preferences/contextual information, which implies a rel-
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 10
TABLE 7: Pareto-based SBSE problems in different SDLC phases.
SDLC Phase SBSE Problem Description Primary Studies
Planning Effort Estimation Optimize, e.g., accuracy and confidence interval, by changing the num-ber of measured samples. [88] [115]Project Scheduling Optimize, e.g., duration and cost, by assigning employee into the tasksof a software project. [126] [125] [114][44] [35] [24] [16]RequirementAnalysis RequirementAssignment Optimize, e.g., completeness and familiarity, by assigning requirementsto different stakeholders’ for their reviews. [76]Next Release Problem Optimize, e.g., robustness and cost, by selecting stakeholders’ require-ments in the next release of software. [36] [31] [48] [69][147] [148] [16] [149]Design Software Modeling andArchitecting Optimize, e.g., cohesion and coupling, by modeling the object-orientedconcept of the software and its architecture using standard notations. [109] [9] [129][128] [51] [11] [59]Software Product Line Optimize, e.g., correctness and richness, by finding the concrete prod-ucts from the feature model. [118] [119] [120][54] [52] [93] [144][137] [16] [45] [143][66] [113] [77] [53]Implementation Library Recommendation Optimize, e.g., library linked-usage and semantic similarity, by priori-tizing the libraries that meet the required functionality to be used in thecodebase. [99]Program Improvement Optimize, e.g., execution time and number of instructions, by producingsemantically preserved software code. [141]Software Modulariza-tion Optimize, e.g., modularization quality, cohesion and coupling, by plac-ing different classes of code into different clusters. [65] [6] [108] [37][7]Testing Code Smell Detection Optimize, e.g., coverage of bad examples and detection of good exam-ples, by identifying the code and modules that could potentially causeissues. [80]Defect Prediction Optimize, e.g., effectiveness and cost, by adjusting the source codecomponents to be predicted by the model. [15] [14] [23] [92]Test Case Prioritization Optimize, e.g., coverage of the code and cost of test, by ordering the testcases to be tested. [5] [103] [83] [123][32] [106] [127]White Box Test CaseGeneration Optimize, e.g., coverage of the code and cost of test, by identifying thetest cases, inputs and test suits based on internal information of thesoftware. [2] [145] [150] [82][58] [34] [100] [57][102]Black Box Test CaseGeneration Optimize, e.g., length of the inputs, distance to the ideal inputs, andcost of test, by identifying the test cases, inputs and test suits withoutinternal information about the software [124] [107] [3]Deployment andMaintenance Resource Management Optimize, e.g., response time and cost, by changing the supportedsoftware and hardware resources such as in the Cloud environment. [39] [68] [17]Software Configurationand Adaptation Optimize, e.g., response time and energy consumption, by changingsoftware specific configurations, structure and connectors at design timeor runtime. [105] [43] [84] [67][1] [19] [13]Program Manipulation Optimize, e.g., response time and memory consumption, by changingthe parametrized variables within the program code. [142]Service Composition Optimize, e.g., latency and cost, by mapping the concrete services intoabstract services within a workflow. [135] [134] [131][21]Log Template Identifi-cation Optimize, e.g., the frequency and specificity of the log message matchedto a log template. [86]Workflow Scheduling Optimize, e.g., makespan and energy consumption, by assigning activi-ties into a given application workflow. [30]Software Refactoring Optimize, e.g., number of defects found and semantics, by changing thedesign model or the program code. [94] [91] [95] [96][97] [98] [81] [90][89] atively more straightforward selection and use in the qualityevaluation process on the solution sets. SSUES ON Q UALITY E VALUATION IN P ARETO - BASED
SBSE
Based on our systematic literature review, this section pro-vides a systematic analysis of five issues of quality eval-uation, classified into two categories, from state-of-the-artPareto-based SBSE work.
As shown in Figure 2, Tables 8 and A1 (at appendix), thereexist many SBSE studies, particularly in early days, thatrelied on plotting the solution set returned (
SSP ) and/or on reporting some
DOE results to reflect the quality of solutionsets. Despite these two methods being simple to apply, theymay easily lead to inaccurate evaluations and conclusions.
ISSUE I:
Inadequacy of Solution Set Plotting (
SSP ) A straightforward way to evaluate/compare the qualityof solution sets returned by search algorithms is to plotsolution sets and judge intuitively how good they are.Such visual comparison is among the most frequently usedmethods in SBSE, but it may not be very practical in manycases.First, it cannot scale up well — when the number ofobjectives is larger than three, the direct observation ofsolution sets (by scatter plot) is unavailable. Second, thevisual comparison fails to quantify the difference betweensolution sets. Finally, when an algorithm involves stochastic
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 11
TABLE 8: Assumptions of DM’s preferences and contextual information in the Pareto-based SBSE problems and thecorresponding evaluation methods used.
SBSE Problem Assumptions Evaluation Methods
Effort Estimation Not specified [88] [115]
DOE , SSP , CI , GD , HV Project Scheduling Not specified [16] [126]
SSP , DOE , Spread , GD , IGD , NF S , HV (P) Prefer solutions that favor certain objec-tives using Analytic Hierarchy Process [125]
DOE , CS , SSP , GD , SP , Spread , HV (P) Prefer knee solutions [114] [44] [35]
SSP , CI , GD , DOE , HV (P) Prefer widely distributed solutions [24] AS , HV Requirement Assignment Not specified [76]
DOE , SSP , HV Next Release Problem Not specified [36] [31] [48] [69] [147] [16] [149]
SSP , DOE , AS , NF S , GD , CI , Spread , HV , % of includedrequirements (cid:63) (P) Prefer extreme solutions [148]
SSP
Software Modeling andArchitecting Not specified [109] [9] [129]
DOE , SSP , SP , HV , % of within-range solutions (cid:63) , % ofequivalent solutions (cid:63) (P) Prefer knee solutions [59] average correction (cid:63) , manual correction (cid:63) , recall (cid:63) , precision (cid:63) (P)
Prefer solutions that favor certain objec-tives as ranked by users [128]
DOE (P)
Prefer solutions that meet preferences in,e.g., the requirement documentations or thegoal model [51] [11]
SSP
Software Product Line Not specified [143] HV , IGD + , SSP (C)
Prefer solutions that favor correctnessobjective over the others [118] [119] [120] [54][52] [93] [144] [53] [16] [45] [66] [113] [77] CS , Spread , NF S , SSP , (cid:15) -indicator, IGD , HV , DOE , num-ber of required evaluations to find a valid solution (cid:63) , fullcoverage ratio (cid:63) , % of valid solutions (cid:63) (P)
Prefer balanced solutions [137]
DOE , SSP
Library Recommendation Not specified [99] GD , Spread , HV , SSP , accuracy (cid:63) , precision (cid:63) , recall (cid:63)
Program Improvement (C)
Prefer the program validity [141]
DOE , SSP
Software Modularization (C)
Prefer solutions that favor modular-ization quality objective over the oth-ers [65] [6] [108]
DOE , SSP , MoJoFM (cid:63) (P)
Prefer knee solutions [37] [7]
DOE , IGD , HV , precision (cid:63) , recall (cid:63) , manual precision (cid:63) , dif-ficulty to perform task by human (cid:63) , possibility of manually fixthe bug in solution by human (cid:63) , possibility of manually adaptthe solution by human (cid:63) Code Smell Detection Not specified [80]
IGD , HV , precision (cid:63) , recall (cid:63) Defect Prediction Not specified [15] [14] [23] [92]
SSP , HV , ACC (cid:63) , P (cid:63)opt , precision (cid:63) , recall (cid:63) , AUC (cid:63) , cost ofcode inspection (cid:63) Test Case Prioritization Not specified [5] [103] [83] [123] [32] [106] [127]
SSP , DOE , CS , ED , Spread , GD , (cid:15) -indicator, IGD , HV , NF S , AP F D (cid:63) , % of detected faults (cid:63)
White Box Test CaseGeneration Not specified [2] [145] [82] [34] [102] [100]
DOE , SSP , CI , GD , NF S , HV , AS , total coverage (cid:63) (P) Prefer solutions that favor certain ob-jectives as ranked by users, i.e., referencepoint [57] [58] ED , SSP , R - HV , Average number of solutions in the regionof interest (cid:63) (C) Prefer solutions that favor coverage ob-jective over the others [150]
SSP , HV Black Box Test CaseGeneration Not specified [107] [3]
DOE , HV , GD , Spread (C)
Prefer solutions that favor coverage ob-jective over the others [124]
SSP , p-measure (cid:63)
Resource Management Not specified [39]
IGD , HV (P) Prefer knee solutions [68] [17]
SSP , CS , GD , elasticity (cid:63) Software Configurationand Adaptation Not specified [105] [43]
DOE , GD , (cid:15) -indicator, IGD , HV , SSP (P)
Prefer solutions that meet preferencesform the natural descriptions from the stake-holders [84] [67] [1]
DOE , SSP , expected value of total perfect information (cid:63) (P)
Prefer knee solutions [19]
SSP , ED , HV , % of valid solutions (cid:63) (P) Prefer robust solutions around a givenregion [13]
SSP , modified (cid:15) -indicator and
IGD according to problemnatureProgram Manipulation Not specified [142]
DOE , AS , CI , HV , SSP
Service Composition Not specified [135] [21]
DOE , HV (cid:15) -indicator (P)
Prefer extreme solutions [134] [131]
DOE , SSP , IGD , HV , coefficient of variation of objectivevalues (cid:63) Log Template Identification (P)
Prefer knee solutions [86]
SSP , precision (cid:63) , recall (cid:63) , f-measure (cid:63)
Workflow Scheduling Not specified [30]
DOE , SSP , HV Software Refactoring Not specified [94] [91] [95] [96] [97][98] [81] [90]
SSP , IGD , precision (cid:63) , recall (cid:63) , defect correction ratio (cid:63) , reusedrefactoring (cid:63) , usefulness by human (cid:63) , % of fixed code smells (cid:63) ,code change score (cid:63) , manual precision (cid:63) , quality gains (cid:63) ,medium value of refactoring (cid:63) (P)
Prefer knee solutions [89]
SSP , IGD , precision (cid:63) , recall (cid:63) , manual precision (cid:63) , qual-ity gain (cid:63) , defect correction ratio (cid:63) , number of suggestedrefactoring (cid:63) , usefulness by human (cid:63)
All problem specific indicators are listed in full and marked as (cid:63) . (P) denotes DM’s preferences; (C) denotes contextual information. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 12 elements, different runs usually result in different solutionsets. So, it may not be easy to decide which run should beconsidered. Printing the solution sets obtained in all the runscan easily clutter the picture. As such, plotting solution setsdoes not suffice to the quality evaluation in Pareto-basedSBSE, despite the fact that it has been used solely to comparesolution sets in a considerable amount of the primary stud-ies, e.g., [11], [44], [51], [68], [84], [148], as shown in TableA1 (at appendix). Nevertheless, it is worth mentioning that
SSP is useful as an extra evaluation method in addition toquality indicators, particularly in bi- and tri-objective cases.This will be discussed in the guidance section (Section 7)later on.
ISSUE II:
Inappropriate Use of Descriptive ObjectiveEvaluation (
DOE ) Many Pareto-based SBSE studies evaluate solution sets by
DOE — statistical objective values in the obtained solutionset(s). For example, as it can be seen in Table 4, the meanobjective value was considered in [6], [7], [9], [102], [105],[107], [114], [118], [128], [129]; the median value in [6], [37],[58]; the best value in [9], [21], [30], [65], [82], [88], [100],[108], [134], [135], [141], [142]; the worst value in [9], [134];the statistical significance of the differences between distinctsolution sets’ objective values in [1], [76], [123], [137]. Such
DOE measures need to be used in line with the DM’spreference. For example, comparing the best value of eachobjective can well evaluate solution sets if the DM prefersthe extreme points (solutions), but may not be well-suitedwhen balanced points are wanted, which, unfortunately,was practiced in some studies such as [137] shown inTable 8. Worse still, many
DOE measures may give a mis-leading evaluation, including those comparing the mean,median, and worst values of each objective and comparingstatistically significant differences on each objective. That isto say, by a
DOE measure a solution set is evaluated betterthan another set, but in fact, the latter is always preferredby the DM under any circumstances. Figure 3 gives suchan example (minimization) with respect to calculating themean of each objective. As shown, the mean of the solutionset A on either objective f or f is , larger than that of thesolution set B ( ), thus A being regarded as inferior to B .Yet, A will always be favored by the DM since there is onesolution in A better than any solution in B .On the other hand, recalled from Table 4, some workin the primary studies considers selecting one particularsolution (by using a decision-making method) from thewhole solution set produced by the Pareto-based search forcomparison. For example, the studies in [76], [137] consid-ered Mean Fitness Value ( M F V ) and the studies in [125][126] considered Analytic Hierarchy Process (AHP) [41].However, one question is that if we know clear weightingbetween objectives of the DM (thus being able to take onlyone solution from the whole solution set into account),why not directly integrate this information into the problemmodel, thus converting a multi-objective problem into aneasier single-objective problem in the first place.
As disclosed in Tables 6, 8 and A1 (at appendix), it hasbeen commonly seen in Pareto-based SBSE studies that f f AB Fig. 3: An example that comparing the mean on each ob-jective fails to reflect the quality of solution sets. In thisminimization problem, solution set A dominates solutionset B (i.e., any solution in B is dominated by at least onesolution in A ), thus always being favored by the DM. Yet,the mean of A on either objective f or f is , larger thanthat of B ( ); thus A is regarded as inferior to B .select or use quality indicators that cannot accurately re-flect the quality of solution sets. This is virtually becausethe SBSE researchers/practitioners may not be very clearabout indicators’ behavior, role, and characteristics. Thisleads them either to fail to select appropriate indicators toevaluate the generic quality of solution sets, or to fail toalign the considered indicators with the DM’s preferencesor the problem’s contextual information. ISSUE III:
Confusion of the Quality Aspects Coveredby Generic Quality Indicators
As mentioned, the generic quality of a solution set in Pareto-based optimization can be interpreted as how well it repre-sents the Pareto front. It can be broken down into four as-pects: convergence, spread, uniformity, and cardinality [75].It is expected that when the DM’s preferences are unknown a priori , an indicator (or a combination of indicators) cancover all the four quality aspects since a solution set withthese qualities can well represent the Pareto front and havea great probability of being preferred by the DM.Unfortunately, as shown in Tables 6 and A1 (at ap-pendix), in SBSE many studies only consider part of thesequality aspects. For example, the studies in [48], [69] usedthe convergence indicator GD [133] as the sole indicator tocompare the solution sets. The study in [145] consideredboth P F S [52] and CI which however are merely forconvergence and cardinality. In addition, some indicatorswere used to evaluate certain quality aspect(s) of solutionsets which, unfortunately, were not designed for, as shownin [74]. For example, the indicator C [153], designed for con-vergence evaluation, was considered for evaluating spreadin [138]. SP [121], which can only reflect the uniformity ofsolution sets, was used to evaluate the diversity (i.e., bothspread and uniformity) in [109]. P F S , which counts non-dominated solutions in the set, was placed into the categoryof diversity indicators in [52], [138], and (cid:15) -indicator, whichis able to reflect all the quality aspects of solution sets, wasplaced into the category of convergence indicators in [138].As can be seen from Tables 5 and 6, some indicatorswere used incorrectly, For example, the indicator
Spread (i.e., ∆ in [29]) as well as its variants (e.g., GS [151]), whichis only effective in the bi-objective case, was frequently ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 13 C o d e C o v e r a g e ( % ) AB Cost (second) Fig. 4: An example where lack of considering con-textual information may give unwanted evaluation re-sults [74] . Considering two solutions sets ( A and B )for optimizing the code coverage and the cost of test-ing time on the software test case generation problem,where A = { (200 , . , (350 , . , (400 , . , (450 , . } and B = { (0 , , (100 , . , (200 , . , (350 , . , (500 , . } . B isevaluated better than A on eight frequently-used indicators: GD ( B ) = 0 . < GD ( A ) = 0 . , ED ( B ) = 0 .
Spread requirethe Pareto front consisting of uniformly-distributed points,while HV and (cid:15) -indicator do not [75]. Therefore, IGD and
Spread may not be very suitable in SBSE, despite the factthey were frequently used, e.g., in [31], [35], [52], [114], [125],[126], [147].
ISSUE IV:
Oblivion of Context Information
As shown in Table 8, in Pareto-based SBSE, many studiescompare solution sets without bearing in mind the contex-tual information with respect to the considered optimiza-tion problem. They typically adopt commonly-used qualityindicators to directly evaluate the set of all the solutions ob-tained, although some of these solutions may never or rarely be of interest to the DM. Figure 4 shows such an example,under a scenario of optimizing the code coverage and thecost of testing time on the software test case generationproblem, borrowed from [74]. As can be seen, the set B isevaluated better than the set A by all eight commonly usedquality indicators ( GD [133], ED [25], (cid:15) -indicator [154], GS [151], P F S [52],
IGD [26], HV [153] and C [153]) inSBSE [138]. However, depending on the context (as shownin Table 8), the DM might first favor the full code coverageand then possible low cost [150]. This will lead to set A to beof more interest, as it has the solution ( , . ) that achievesfull coverage and lower cost than the one in B ( , . ).Similar observations have been seen in the optimalproduct selection in software product line [16], [45], [52],[66], [77], [113], [119], [120], [132] where the correctnessof configurations is regarded as one objective and equallyrated as other objectives (e.g., richness of features and cost).This may lead to an invalid product to be evaluated betterthan a valid product if the former performs better in otherobjectives, which is apparently of no value to the DM. Inaddition, in many SBSE problems, cost could be an objectiveto minimize, but solutions with zero cost are trivial, e.g.,the solution with zero cost and zero coverage in Figure 4.However, these solutions may largely affect the evaluationresults. Therefore, it is necessary to remove solutions thatwould never be interested by the DM before the evaluation,which, unfortunately, has been rarely practiced in Pareto-based SBSE. ISSUE V:
Noncompliance of the DM’s Preferences
Although every quality indicator is designed to reflectcertain quality aspect(s) of solution sets (i.e., convergence,spread, uniformity, cardinality, or their combination), theydo have their own implicit preferences. For example, theindicators HV and IGD , both designed to cover all of thefour quality aspects, have rather distinct preferences. HV prefers knee points of a solution set, while IGD is in favorof a set of uniformly distributed solutions. Therefore, itis important to select indicators whose preferences are inline with the DM’s. Neglecting this can lead to misleadingevaluation results. Figure 5 gives such an example — whenpreferring knee points, considering the indicator
IGD couldreturn a misleading result. That is, the set having kneepoints is evaluated worse than that having no knee point.Similar observations also apply to the indicators GD and CI , as shown in the same figure.Unfortunately, as what has been revealed in Table 8,such misuse of indicators is not uncommon in the SBSEcommunity. For example, preferring knee points yet using IGD in [37], [89]; preferring knee points yet using GD and CI in [35], [114]; and preferring extreme solutions yetusing HV and IGD in [131], [134]. HV can be somehowin favor of extreme solutions if the reference point is set faraway from the considered set, but IGD certainly does notprefer extreme solutions. Therefore, it is of high importanceto understand the behavior, role, and characteristics of theconsidered indicators, which may not be very clear to thecommunity. In the next section, we will detail widely usedquality evaluation methods in the area (as well as other use-ful indicators) and explain the scope of their applicability.
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 14 f f AB Fig. 5: An example that preferring knee points while us-ing the indicator
IGD (as well as GD and CI ) can leadto misleading results. Consider two solution sets ( A and B ) for a bi-objective minimization scenario, where A = { (2 , , (9 , } are the two knee points of the Pareto front,and B = { (1 , , (7 , , (12 , . } are three well-distributednon-knee points on the Pareto front. Apparently, if the DMprefers knee points then solutions in A will certainly beselected. Yet, A is evaluated worse than (or as equal as) B by IGD , GD and CI : IGD ( A ) = 2 . > IGD ( B ) = 1 . , GD ( A ) = GD ( B ) = 0 , and CI ( A ) = 0 . < CI ( B ) = 0 . .In contrast, the indicator HV can reflect this preference, A being evaluated better than B : HV ( A ) = 71 . > HV ( B ) =45 . (the reference point is (13 , ). EVISITING Q UALITY E VALUATION FOR P ARETO - BASED O PTIMIZATION
In Pareto-based optimization, the general goal for the al-gorithm designer is to supply the DM a set of solutionsfrom which they can select their preferred one. In general,the actual preferences can be either articulated by the DMor derived from the contextual information of the prob-lem, which may differ depending on the situation. Havingsaid that, the Pareto dominance relation is apparently theforemost criterion in any case, provided that the conceptof optimum is solely based on the direct comparison ofsolutions’ objective values (other than on other criteria, e.g.,robustness and implementability with respect to decisionvariables). That is to say, the DM would never prefer asolution to the one that dominates it.As discussed in Section 2, the better relation ( (cid:67) ) rep-resents the most general and weakest form of superioritybetween two sets. That is, for two solution sets A and B , A (cid:67) B indicates that A is at least as good as B , while B is not as good as A . It meets any preference potentiallyarticulated by the DM. If A (cid:67) B , then it is always safefor the DM only to consider solutions in A . Apparently,it is desirable that a quality evaluation method is able tocapture this relation; that is to say, for any two solutionsets A and B , if A (cid:67) B , then A is evaluated better than B . Unfortunately, there are very few quality evaluationmethods holding this property. HV is one of them [152].There is a weaker property called being Pareto compliant [62],[154], which is more commonly used in the literature. Thatis, a quality evaluation method is said to be
Pareto compliant if and only if “at least as good” in terms of the dominancerelation implies “at least as good” in terms of the evaluationvalues (i.e., ∀ A , B : A (cid:22) B ⇒ I ( A ) ≤ I ( B ) , where I is the evaluation method, assuming that the smaller thebetter). Despite the relaxation, many quality indicators are not Pareto compliant, including widely used ones, such as GD , IGD , Spread , GS , and SP . Pareto compliant indicatorsare mainly those falling into the category of evaluating con-vergence of solution sets (e.g., C and CI ) and the category ofevaluating comprehensive quality of solution sets (e.g., HV , (cid:15) -indicator, IP F [10], R [46], and P CI [72]).
DCI [70] isthe only known diversity indicator compliant with Paretodominance when comparing two sets. In addition, somenon-compliant indicators can become Pareto compliant aftersome modifications. For example, GD and IGD can betransformed into two Pareto compliant indicators (called GD + and IGD + ) if considering “superiority” distance in-stead of Euclidean distance between points [56]. Overall, it ishighly recommended to consider (at least) Pareto compliantquality indicators to evaluate solution sets; otherwise, it mayviolate the basic assumption of the DM’s preferences. Thatis, recommending the DM a solution set B over A , whereeach solution in B is inferior to or can be replaced by (inthe case of equality) some solution in A . This is what the DOE evaluation method that compares the mean on eachobjective does in the example of Figure 3.Now, one may ask why not directly use the better relationto evaluate solution sets. The reason is that the better relationmay leave many solution sets incomparable since in mostcases there exist some solutions from different sets beingnondominated to each other. Therefore, we need stronger as-sumptions about the DM’s preferences, which are reflectedby quality evaluation methods. However, stronger assump-tions (than the better relation) cannot guarantee that thefavored set (under the assumptions) is certainly preferred bythe DM, as in different situations the DM indeed may preferdifferent trade-offs between objectives. Consequently, it isvital to ensure the considered evaluation methods in linewith the DM’s explicit or implicit preferences.Back to the example in Figure 4 where optimizing theobjectives code coverage and cost of testing time, essentiallythese two solution sets are not comparable with respectto the better relation despite the fact that most solutionsin A are dominated by some solution in B . As stated,the DM may be more interested in full code coverage andthen possible lower cost, thus preferring A to B . However,the considered eight indicators fail to capture this informa-tion and give opposite results. This clearly indicates theimportance of understanding quality evaluation methods(including what kind of assumptions they imply). Next, wewill review several quality evaluation methods which arecommonly used in the SBSE community (as we have shownin Table 2) and at the same time are very representative toreflect certain aspect(s) of solution set quality. DOE ) As stated before, the
DOE methods evaluate a solution set(or several sets obtained by a search algorithm in multipleruns) by directly reporting statistical results of objective val-ues of its solutions, such as the mean, median, best, worst,and statistical significance (in comparison with other sets).Unfortunately, such methods are rarely being Pareto compli-ant and unlikely to be associated with the DM’s preferences.However, an exception is the method that considering thebest value of some objective(s) in a solution set, since it
ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 15 is Pareto compliant and able to directly reflect the DM’spreferences in the case that they prefer extreme solutions.Overall, the
DOE methods are not recommended, unlessthe DM explicitly expresses their preferences in line withthem. CI ) The CI indicator [87], which was designed to compare theconvergence of two solution sets, has been frequently usedin SBSE, e.g., in [35], [114], [115], [142], [145], [149]. CI calculates the ratio of the solutions of a set that are notdominated by any solution in the other set. Formally, giventwo sets A and B , CI ( A , B ) = | A ∩ B | / | A ≺ B | + | A (cid:14) ∩ (cid:7) B || A ∩ B | + | A ≺ B | + | A (cid:14) ∩ (cid:7) B | + | B ≺ A | + | B (cid:14) ∩ (cid:7) A | (1) where A ≺ B stands for the set of solutions in A thatdominate some solution of B (i.e., A ≺ B = { a ∈ A |∃ b ∈ B : a ≺ b } ), and A (cid:14) ∩ (cid:7) B stands for the set of solutionsin A that do not weakly dominate any solution in B and also are not dominated by any solution in B (i.e., A (cid:14) ∩ (cid:7) B = { a ∈ A | (cid:64) b ∈ B : a (cid:22) b ∨ b ≺ a } ).The CI value is in the range of [0 , . A higher valueis preferable. It is apparent that CI ( A , B ) + CI ( B , A ) =1 . A clear strength of the indicator CI is that it holds the better relation , i.e., if A (cid:67) B then CI ( A , B ) > CI ( B , A ) .Moreover, if A ≺ B , then CI ( B , A ) = 0 . In addition, apartfrom comparing the convergence of solution sets, CI canreflect their cardinality to some extent. A set having a largernumber of solutions is likely to be favored by the indicator.A clear weakness of CI is that it relies completely onthe dominance relation between solutions, thus providinglittle information about to what extent one set outperformsanother. Moreover, they may leave many solution sets in-comparable if all solutions from the sets are nondominatedto each other. This may happen frequently in many-objectiveoptimization, where more objectives are considered.There is another well-known dominance-based qualityindicator (called C or CS ) [153], used in e.g. [5], [17], [125],[146]. It measures the proportion of solutions in a set thatis weakly dominated by some solution in the other set; inother words, the percentage of a set that is covered by itsopponent. The details of the indicator C can be found in [75]. C tends to be more popular in the multi-objective optimiza-tion community, despite sharing the above strengths andweaknesses with CI . Finally, it is worth mentioning thatdespite only partially reflecting the convergence of solutionsets, such dominance-based indicators are useful since mostproblems in SBSE are combinatorial ones, where the sizeof the Pareto front may be relatively small and it is likelyto have comparable solutions (i.e., dominated/duplicatesolutions) from different sets [75]. GD ) As one of the most widely used convergence indicatorsin SBSE (used in e.g., [2], [5], [17], [35], [99], [105], [114],[115], [125], [147]), GD [133] is to measure how close the
10. Note that it is not unusual that binary indicators (i.e., thosedirectly comparing two sets) holds the better relation [75]. obtained solution set is from the Pareto front. Since thePareto front is usually unknown a priori , a reference set, R ,which consists of nondominated solutions of the collectionof solutions obtained by all search algorithms considered,is typically used to represent the Pareto front in practice.Formally, given a solution set A = { a , a , ..., a n } , GD isdefined as GD ( A ) = 1 n (cid:32) n (cid:88) i =1 (min r ∈ R d ( a i , r )) p (cid:33) /p (2)where d ( a i , r ) means the Euclidean distance between a i and r , and p is a parameter determining what kind of mean ofthe distances is used, e.g., the quadratic mean and arithmeticmean.The GD value is to be minimized and the ideal value iszero, which indicates that the set is precisely on the Paretofront. In the original version, the parameter p was set to 2.Unfortunately, this would make the evaluation value rathersensitive to outliers and also affected by the size of thesolution set (when N → ∞ , GD → even if the set isfar away from the Pareto front [122]). Setting p = 1 has nowbeen commonly accepted.Compared to those dominance-based convergence in-dicators (e.g., CI and C ), GD is more accurate in termsof measuring the closeness of solution sets to the Paretofront due to it considering the distance between points.However, a clear weakness of GD is not being Paretocompliant [61], [154]. This is very undesirable since GD , asa convergence indicator, fails to provide reliable evaluationresults with respect to the weakest assumption of the DM’spreferences. A simple example was given in [154]: considertwo solution sets A = { (2 , } and B = { (3 , } on a bi-objective minimization scenario, where the reference set is R = { (1 , , (0 , } . Clearly, A dominates B , but GD re-turns an opposite result: GD ( A ) = √ > GD ( B ) = √ .Recently, a modified GD was proposed to overcome this is-sue, called GD + [56], where the Euclidean distance between a i and r in Equation (2) is modified by only considering theobjectives where r is superior to a i . Specifically, d + ( a i , r ) = ( m (cid:88) j =1 (max { a ij − r j , } ) ) / (3)where m denotes the number of objectives, and a ij de-notes the value of solution a i on the j th objective. Thismodification makes the indicator compliant with Paretodominance. Going back to the above example, now we havethe evaluation results of A better than B ( GD + ( A ) = 2 Spread (aka ∆ ) [29] and its variants [125],[151] have been commonly adopted to evaluate the diversity(i.e., spread and uniformity) of solution sets in the field, e.g.,in [31], [52], [99], [118], [120], [125], [126], [147]. Specifically,the indicator ∆ of a solution set A (assuming the set ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 16 only consisting of nondominated solutions) in a bi-objectivescenario is defined as follows. ∆( A ) = d upper + d bottom + (cid:80) n − i =1 | d i − d | d upper + d bottom + ( n − d (4)where n denotes the size of A , d i ( i = 1 , , ...n − ) isthe Euclidean distance between consecutive solutions in the A , and d is the average of all the distances d i . d upper and d bottom are the Euclidean distance between the two extremesolutions of A and the two extreme points of the Paretofront, respectively.A small ∆ value is preferred, which indicates a gooddistribution of the set in terms of both spread and uni-formity. When ∆ = 0 means that solutions in the set areequidistantly spaced and their boundaries reach the Paretofront extremes.A major weakness of ∆ (including its variants) is thatit only works reliably on bi-objective problems as wherenondominated solutions are located consecutively on eitherobjective. With more objectives, the neighbor of a solutionon one objective may be far away on another objective [71].This issue applies to any distance-based diversity indica-tor [75]. For problems with more than two objectives, regiondivision-based diversity indicators are more accurate [75].They typically divide the space into many equal-sized cellsand then consider cells instead of solutions (e.g., countingthe number of these cells). This is based on the fact thata set of more diversified solutions usually populate morecells. However, such indicators may suffer from the curseof dimension as they typically need to record informationof every cell. In this regard, the diversity indicator DCI [70] may be a pragmatic option since its calculation onlyinvolves non-empty cells, thus independent of the numberof cells (linearly increasing computational cost in objectivedimensionality).In addition, another indicator Spacing (aka SP ) [121]has also been used to evaluate the diversity of solution setsin e.g., [109], [125]. However, this indicator can only reflectthe uniformity (not spread) of solution sets [75]. N F S ) Used in e.g. [52], [103], [145], the N F S (also called ParetoFront Size, P F S ) is to simply count how many nondomi-nated solutions are in the obtained solution set. However,this indicator may not be very practical as in many casesall solutions in the obtained set are nondominated to eachother, particularly in many-objective optimization. In addi-tion, as by definition duplicate solutions are nondominatedto each other, a set full of duplicate solutions would beevaluated well by N F S if there is no other solution in theset dominating them.As such, a measure that only considers unique nondom-inated solutions which are not dominated by any other setseems more reasonable. Specifically, one can consider theratio of the number of such solutions in each set to the sizeof the reference set (which consists of unique nondominatedsolutions of the collections of solutions obtained by thealgorithms). In other words, we quantify the contribution ofeach set to the combined nondominated front of all the sets.Formally, let A unf be the unique nondominated front of a given solution set A (i.e., A unf ⊆ A ∧ A unf (cid:22) A ∧ ∀ a i ∈ A unf , (cid:64) a j ∈ A unf , j (cid:54) = i : a j (cid:22) a i ). Then, the indicator,denoted as Unique Nondominated Front Ratio ( U N F R ), isdefined as U N F R ( A ) = | a ∈ A unf | (cid:64) r ∈ R unf : r ≺ a || R unf | (5)where R unf denotes the reference set which consists ofthe unique nondominated solutions of the collections of allsolutions produced.The U N F R value is in the range of [0 , . A high valueis preferred. Being zero means that for any solution in A there always exists some solution better in the other sets.Being one means that for any solution in the other sets therealways exists some solution in A better than (or at leastequal to) it (i.e., the reference set is precisely comprised bysolutions of A ). In addition, U N F R is Pareto compliant. IGD ) IGD [26] is a well-known indicator in the field (e.g., in [5],[13], [32], [37], [39], [43], [52], [80], [89], [91], [126], [131]). Asthe name suggests, IGD , an inversion of GD , is to measurehow close the Pareto front is to the obtained solution set.Formally, given a solution set A and a reference set R , IGD is calculated as IGD ( A ) = 1 | R | (cid:88) r ∈ R min a ∈ A d ( r , a ) (6)where d ( r , a ) is the Euclidean distance between r and a . Alow IGD value is preferable. IGD is capable of reflecting the quality of a solutionset in terms of all the four aspects: convergence, spread,uniformity, and cardinality. However, a major weakness of IGD is that the evaluation results heavily depend on thebehavior of its reference set. A reference set of denselyand uniformly distributed solutions along the Pareto frontis required; otherwise, it could easily return misleadingresults [75]. This is particularly problematic in SBSE sincethe reference set is created normally from the collection of allthe obtained solutions; its distribution cannot be controlled.Consider an example in Figure 6(a), where comparingtwo solution sets A and B . The reference set is comprisedof all the nondominated solutions, i.e., the three solutionsof A and the two boundary solutions of B . As can beseen, B performs significantly worse than A in terms ofconvergence, with its solutions being either dominated bysome solution in A or slightly better on one objective butmuch worse on the other objective; thus B unlikely to bepreferred by the DM. However, IGD gives an oppositeevaluation: IGD ( A ) ≈ . > IGD ( B ) ≈ . .In addition, the way of how the reference set is createdmakes IGD prefer a specific distribution pattern consistentwith the majority of the considered solution sets [75]. Inother words, if a solution set is distributed very differentlyfrom others, then the set is likely to assign a poor IGD value whatever its actual distribution is. Figure 6(b) is suchan example. When comparing A with B (the reference setcomprised of these two sets), we will have A evaluatedbetter than B ( IGD ( A ) ≈ . < IGD ( B ) ≈ . ). Butif adding another set C which has the similar distribution ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 17 f f AB f f A BC (a) (b) Fig. 6: Two examples that the collection of solution sets as the reference set may lead to misleading evaluations for IGD .(a) For two bi-objective sets A = { (1 , , (2 , , (3 , } and B = { (0 . , , (3 , , (10 , . } , A should be highly likelyto be preferred to B as solutions of B are either dominated by some solution in A or slightly better on one objective butsignificantly worse on the other objective. but IGD gives opposite results: IGD ( A ) ≈ . > IGD ( B ) ≈ . . (b) Forthree bi-objective sets A = { (0 , , (5 , , (0 , } , B = { (2 . , . , (7 . , . } and C = { (2 , , (7 , } , in general A maybe likely to be preferred by the DM than B and C as it provides better spread and cardinality, but IGD gives oppositeresults: IGD ( A ) ≈ . > IGD ( B ) ≈ . > IGD ( C ) ≈ . .pattern to B into the evaluation, and now the reference setis comprised of the three sets, we will have A worse than B ( IGD ( A ) ≈ . > IGD ( B ) ≈ . ). A potential wayto deal with this issue is to cluster crowded solutions in thereference set first and then to consider these well-distributedclusters instead of arbitrarily-distributed points, as did inthe indicator P CI [72]. Yet, this could induce another issue— how to properly cluster the solutions in the reference setsubject to potentially highly irregular solution distribution. HV ) Like IGD , HV [153] evaluates the quality of a solution setin terms of all the four aspects. Due to its desirable practicalusability and theoretical properties, HV is arguably themost commonly used indicator in SBSE, e.g., used in [24],[30], [34], [37], [54], [93], [99], [103], [105], [125], [126], [131],[139], [144]. For a solution set, its HV value is the volumeof the union of the hypercubes determined by each of itssolutions and a reference point. It can be formulated as HV ( A ) = λ ( (cid:91) a ∈ A { x | a ≺ x ≺ r } ) (7)where r denotes the reference point and λ denotes theLebesgue measure. A high HV value is preferred.A limitation of the HV indicator is its exponentiallyincreasing computational time with respect to the numberof objectives. Many efforts have been made to reduce itsrunning time, theoretically and practically (see [75] for asummary), which makes the indicator workable on a solu-tion set with more than 10 objectives (under a reasonable setsize).As stated previously, HV is in favor of knee points of asolution set, thus a good choice when the DM prefers kneepoints of the problem’s Pareto front. In addition, the settingsof the reference point can affect its evaluation results. Con-sider the two solution sets A and B in Figure 7, where A consists of two boundary solutions, and B consists of fouruniformly distributed inner solutions. When the referencepoint is set to (6 , (Figure 7(a)), A is evaluated worse than B . When the reference point is set to (11 , (Figure 7(b)), A f f AB Referent point AB Referent point f f AB Referent point AB Referent point (a) (b) Fig. 7: An example that distinct reference points lead tothat HV prefers different solution sets, where the set A consists of two boundary solutions ( A = { (0 , , (5 , } ),and the set B consists of four uniformly distributed innersolutions ( B = { (1 , , (2 , , (3 , , (4 , } ). The grey area ⊂ HV ( A ) but (cid:42) HV ( B ) and the hatched area ⊂ HV ( B ) but (cid:42) HV ( A ) . In (a) where the reference point is (6 , , A is evaluated worse than B : HV ( A ) = 11 < HV ( B ) = 19 .In (b) where the reference point is (11 , , A is evaluatedbetter than B : HV ( A ) = 96 > HV ( B ) = 94 .is evaluated better than B . Fortunately, we can make gooduse of such a behavior of HV to enable the indicator toreflect the DM’s preferences. If the DM prefers the extremepoints, then a reference point can be set to be fairly distantfrom the solution sets’ boundaries, e.g., doubling the Paretofront’s range, namely, r i = nadir i + l i where nadir i isthe nadir point of the Pareto front (or the reference set,i.e., the combined nondominated front) on its i th objective,and l i is the range of the Pareto front (or the referenceset) on the i th objective. If there is no clear preferencefrom the DM, unfortunately, no consensus regarding howto set the reference point has been reached in the multi-objective optimization field. A common practice is to set it1.1 times of the range of the combined nondominated front(i.e., r i = nadir i + l i / ). Some recent studies [55] suggestedto set it as r i = nadir i + l i /h , where h is an integer subjectto C h + m − m − ≤ n < C h + mm − ( m and n being the number ofobjectives and the size of the considered set, respectively). ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 18 In any case, the reference point setting is non-trivial — anappropriate setting needs to consider not only the numberof objectives and the size of the solution set, but also theactual dimensionality of the set, its shape, etc. (cid:15) -indicator (cid:15) -indicator is another well-established comprehensive indi-cator frequently appearing in SBSE, e.g., [13], [32], [43], [52],[135]. It measures the maximum difference between twosolution sets and can be defined as (cid:15) ( A , B ) = max b ∈ B min a ∈ A max i ∈{ ...m } a i − b i (8)where a i denotes the objective of a for the i th objective and m is the number of objectives. A low value is preferred. (cid:15) ( A , B ) ≤ implies that A weakly dominates B . Whenreplacing B with a reference set that represents the Paretofront, the (cid:15) -indicator becomes a unary indicator, measuringthe gap of the considered set to the Pareto front. (cid:15) -indicator is Pareto compliant and user friendly(parameter-free and quadratic computational effort). Yet,the calculation of (cid:15) -indicator only involves one particularobjective of one particular solution in either set (where themaximum difference is), rendering its evaluation omittingthe difference on other objectives and other solutions. Thismay lead to different solution sets having same/similarevaluation results, as reported in [78]. In addition, in somestudies [111], (cid:15) -indicator has been empirically found tobehave very similarly as HV in ranking solution sets. Table 9 summarizes the above 12 indicators on severalaspects, namely, (i) what kind of quality aspect(s) they areable to reflect, (ii) if they are Pareto compliant, (iii) whatwe need to take care of when using them, and (iv) whatsituation they are suitable for. The following guidelines canbe derived from the table. • If the DM wants to know the convergence quality of asolution set to the Pareto front, GD + (instead of GD )could be an ideal choice — it is Pareto compliant andthe reference set required can be set as the combinednondominated front of all the considered sets, notnecessarily a set of uniformly-distributed points. Ifthe DM wants to know the relative quality betweentwo solution sets in terms of the Pareto dominancerelation, CI (or C ) could be a choice. • If the DM wants to know the diversity quality (bothspread and uniformity) of a solution set, for bi-objective cases, ∆ is a good choice; for problemswith more objectives, DCI can be used. SP can onlyreflect the uniformity of a solution set which maynot be very useful — uniformly distributed solutionsconcentrating in a tiny area typically not in the DM’sfavor. • The indicator U N F R should replace N F S to mea-sure the cardinality of solution sets. • Regarding comprehensive evaluation indicators, HV can generally be the first choice, especially whenthe DM prefers knee points. In addition, if the DMprefers extreme solutions, the reference point needs to be set fairly distant from the solution sets’ bound-aries. (cid:15) -indicator is user-friendly, but is less sensitiveto solution sets’ quality difference than HV since itsvalue only lies upon one particular solution on oneparticular objective. IGD may not be very practicalas it requires a Pareto front representation consistingof densely and uniformly distributed points. ETHODOLOGICAL G UIDANCE TO Q UALITY E VALUATION IN P ARETO - BASED SBSE In this section, we provide guidance on how to select anduse quality evaluation methods in Pareto-based SBSE. Asdiscussed previously, selecting and using quality evaluationmethods needs to be aligned with the DM’s preferences .A solution set being evaluated better means nothing but itssolutions having a bigger chance to be picked out by theDM. However, to different problems or even to the sameproblem but under different circumstances, the articulationof the DM’s preferences may differ. In some cases, the DMis confident to articulate their preferences (or can be easilyderived from contextual information); e.g., they see oneobjective more important than others. In some cases, theDM may experience difficulty in precisely articulating theirpreferences; e.g., they are only able to provide some vaguepreference information such as a fuzzy region around onepoint. In some other cases, the DM’s preferences may notbe available at all; e.g., when the DM wants to see what thewhole Pareto front looks like before articulating their prefer-ences. Therefore, quality evaluation needs to be conductedin accordance with different cases. Next, we consider fourgeneral cases of quality evaluation with respect to the DM’spreferences. The case of the DM’s preferences being clear can oftenfall into two categories over Pareto-based SBSE problems.The first is when relative importance/weighting among theobjectives considered can be explicitly expressed and quan-tified, e.g., in [125]. It is worth noting that the weightingbetween objectives may not need to be fixed a priori . Forexample, in the case of interactive Pareto-based SBSE forsoftware modeling and architecting problems [128], the DMis asked to explicitly rank the relative importance of theobjectives as the search proceeds. Under this circumstance,the sum of the weighted objectives can be used to find thefittest solution from a solution set, and then determine thequality of the set.The other category concerns when the DM prefers someobjective to some other (i.e., a clear priority can be assumed,which is a unique situation that often implies some clearcontextual information of a hard requirement under theSBSE problem), or when the DM is only interested in so-lutions which is up to scratch on some objective (whichcould be seen as a constraint). This happens frequently inthe software product line configuration problem [52], [54], 11. For convenience, from this point forwards we use DM’s pref-erence information as a general term, which refers to not only thepreferences that the DM articulates but also the requirements derivedfrom problem nature and contextual information. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 19 TABLE 9: A summary of representative quality indicators used in SBSE, their usage note/caveats and applicable conditions. Indicator Convergence Spread Uniformity Cardinality Pareto Compliant Usage Note/Caveats Applicable Conditions CI − − + (i) not able to distinguish between sets iftheir solutions are nondominated to eachother, which may happen frequently inmany-objective optimization;(ii) binary indicator which evaluates rel-ative quality of two sets and cannot beconverted into a unary indicator. (i) when the DM wants to know the rela-tive quality difference (in terms of domi-nance relation) between two sets, and (ii)when the Pareto front size is relativelysmall, e.g., on some low-dimensionalcombinatorial problems. C ( CS ) − − + (i) not able to distinguish between sets iftheir solutions are nondominated to eachother, which may happen frequently inmany-objective optimization;(ii) binary indicator which evaluates rel-ative quality of two sets and cannot beconverted into a unary indicator;(iii) removing duplicate solutions beforethe calculation. (i) when the DM wants to know the rela-tive quality difference (in terms of domi-nance relation) between two sets, and (ii)when the Pareto front size is relativelysmall, e.g., on some low-dimensionalcombinatorial problems. GD + (i) additional problem knowledge: a ref-erence set that represents the Paretofront (not necessarily a set of uniformly-distributed points);(ii) each objective needs normalization;(iii) may give misleading results due tonot holding the Pareto compliance prop-erty. (i) when the DM wants to know howclose the obtained sets from the Paretofront, (ii) when the compared sets arenondominated to each other (i.e., no bet-ter relation between the sets), and (iii)when the Pareto front range can be es-timated properly (e.g., no DRS points inthe reference set [75]). GD + + + (i) additional problem knowledge: a ref-erence set that represents the Paretofront (not necessarily a set of uniformly-distributed points);(ii) each objective needs normalization. (i) when the DM wants to know howclose the obtained sets from the Paretofront. Spread + + (i) additional problem knowledge: ex-treme points of the Pareto front;(ii) each objective needs normalization;(iii) reliable only on bi-objective prob-lems. (i) when the DM wants to know the di-versity (including both spread and uni-formity) of the obtained sets on bi-objective problems, and (ii) when thecompared sets are nondominated to eachother. DCI + − − − (i) additional problem knowledge:proper setting of the grid division;(ii) M -nary indicator that evaluates rel-ative quality of M sets, but can be con-verted into a unary one by comparing theobtained set with the Pareto front. (i) when the DM wants to know the di-versity of the obtained sets. SP + (i) each objective needs normalization;(ii) cannot reflect the spread of solutionsets. (i) when the DM wants to know theuniformity of the obtained sets, and (ii)when the compared sets are nondomi-nated to each other. NF S + (i) not able to compare sets as it onlycounts the number of nondominated so-lutions in a set. (i) not reliable when the DM wants tocompare sets. UNF R + + None (i) when the DM wants to comparethe cardinality of sets, particularly howmuch they contribute the combined non-dominated front. IGD + + − − (i) additional problem knowledge: a ref-erence set that well represents the Paretofront (i.e., densely and uniformly dis-tributed points);(ii) each objective needs normalization;(iii) may give misleading results due tothe lack of Pareto compliance property. (i) when the DM wants to know howwell the obtained sets can represent thePareto front, (ii) when the compared setsare nondominated to each other, and (iii)when there is a Pareto front with denselyand uniformly distributed points. HV + + − + + (i) additional problem knowledge: a ref-erence point that worse than the nadirpoint of the Pareto front;(ii) exponentially increasing computa-tional cost in objective dimensionality.(iii) the DM can specify the referencepoint according to their preference to ex-treme solutions or to inner ones. (i) when the DM wants to know com-prehensive quality of the obtained sets,especially suitable if the DM prefers kneepoints of the problem, and (ii) when theobjective dimensionality is not very high. (cid:15) -indicator + + − − + (i) each objective needs normalization;(ii) binary indicator, but can be convertedinto a unary indicator by comparing theobtained set with the Pareto front;(iii) differently-performed sets may havethe same/similar evaluation results. (i) when the DM wants to know max-imum difference between two solutionsets (or the obtained solution set from thePareto front).“ + ” generally means that the indicator can well reflect the specified quality (or meet the specified property). “ − ” for convergence means that the indicator canreflect the convergence of a set to some extent; e.g., indicators only considering the dominance relation as convergence measure. “ − ” for spread means that theindicator can only reflect the extensity of a set. “ − ” for uniformity means that the indicator can reflect the uniformity of a set to some extent; i.e., a disturbance toan equally-spaced set may not certainly lead to a worse evaluation result. “ − ” for cardinality means that adding a nondominated solution into a set is not surelybut likely to lead to a better evaluation result and also it never leads to a worse evaluation result. “ − ” for Pareto compliance means that the indicator holds theproperty subject to certain conditions. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 20 [93], [118], [119], [120], [144], where the correctness of theproducts (i.e., the feature model’s dependency compliance)is always of higher priority than other objectives such as therichness and the cost of the model — only the solutions(products) that achieve full dependency compliance areof interest. This is obvious, as a violation of dependencyimplies faulty and incorrect configuration, thus valuelessin practice. A similar situation applies to the test casegeneration problem [53], [103], [104], [124], [150] where theDM is typically interested in test suites with full coverage.In addition, the DM may only be interested in solutions thatreach a certain level on some objective. For example, in soft-ware deployment and maintenance, it is not uncommon tohave a statement like “ The software service shallbe available for at least 95% of the time ”. Insuch a case, it is rather clear that any value of availabilityless than 95% is unacceptable, while anything beyond 95%can be considered.An appropriate way to perform evaluation under theabove circumstance is to transfer the DM’s preferences intothe solution set to be evaluated. This can be done by firstremoving solutions that are irrelevant from the set. Afterthat, the set of the remaining solutions is evaluated, subjectto two situations: if the remaining solutions are of the samevalue on the objective(s) where the DM articulates theirpreferences, then the quality evaluation is performed onlyon the other objectives; otherwise, the evaluation is doneon all the objectives. The former has been commonly seenwhen the DM is only interested in solutions which achievethe best of the objective, such as the solutions with fullcoverage for the test case generation problem, whereas thelatter often applies when the DM is interested in a particularthreshold of solution quality on the objective, such as inthe software deployment and maintenance case mentionedabove, only the solutions with availability values not lessthan 95% would be evaluated. It is not uncommon that there exist important, yet imprecisepreferences in the SDLC. In general, they are mainlyderived from the non-functional requirements recorded indocumentations, notes, and specifications, which are oftenvague in nature, as in [51] [11] [84] [67] [1]. For example,in the software configuration and adaptation problem,some statements may be rather ambiguous like “ thefirst objective should be reasonable andthe others are as good as possible ”. In such asituation, one may not be able to integrate the preferencesinto the quality evaluation since it is not possible toquantify qualitative descriptions like “reasonable”. As such,a safe choice is to treat them as a general multi-objectiveoptimization case (i.e., without specific preferences).In other situations, the SBSE researchers/practitionersmay give some preference information around somevalues/thresholds on one (or several) objective. This, incontrast to the case of the DM’s preferences being clear,allows some tolerances on the specified value/threshold.For example, the software may have a requirementstating that “ the cost shall be low while theproduct shall support ideally up to 3000 simultaneous users ”. This typically happens for SMEwhere the budget of a software project (e.g., money forbuying required Cloud resources or the consumption ofdata centers) is low, and thus it is more realistic to set athreshold point such that certain level of performance (e.g., simultaneous users) would be sufficient (anythingbeyond is deemed as equivalent). However, while therequirement gives a clear cap of the best performanceexpected, it does not constrain on the worst case, implyingthat it allows tolerances when the users goal cannot bemet.Despite not impossible, it can be a challenging task tofind a quality indicator that is able to reflect such preferenceinformation. First, the quality indicator should be capableof accommodating such preference information in the sensethat the evaluation results can embody it. Second, the intro-duction of the preferences should neither compromise thegeneral quality aspect that the indicator reflects, nor vio-late properties that the indicator complies with (e.g., beingPareto compliant). In this regard, the indicator HV [153]could be a good choice since it (i) can relatively easilyintegrate the DM’s preference information [136], [152] and(ii) can still be Pareto compliant after a careful introductionof the DM’s preference information [152].To integrate preferences into HV , one approach, calledthe weighted HV presented in [152], is to interpret the HV value as the volume of the objective space enclosedby the attainment function [28] and the axes. Here, theattainment function is to give for each vector in the ob-jective space the probability that it is weakly dominatedby the outcome of the solution set. Then, to give differentweights to different regions by a weight distribution func-tion, the weighted HV is calculated as the integral overthe product of the weight distribution function and theattainment function [152]. This essentially transforms thepreference information into a weight distribution functionto unequalize the HV contribution from different regions.However, it is not trivial to construct a weight distributionfunction that is able to reflect preferences expressed by theSBSE researchers/practitioners. Even for the situation thatthe preference information is clear (e.g., clear weightingbetween objectives), the preferences cannot be used as theweight distribution function because of the interaction of theweight distribution function and the attainment function inthe calculation.Another (perhaps more pragmatic) approach is to di-rectly transform the original solutions into new solutionswhich accommodate the preference information, and thenapply HV (or other quality indicators) to the new solutions,provided that such a transformation is in line with theselected indicator. For instance, consider the above examplethat the cost shall be low while the product shall ideallysupport up to users. Let us say that there are twosolutions a = (1500 , and b = (2000 , obtainedfor this problem. Solution a has a lower cost while solu-tion b can support more users. However, according to thepreference information, for the number of users anythingbeyond can be deemed as equivalent. As such, the usernumber of the solution b can be transformed to , thatis, now b = (2000 , , worse than (i.e., dominated by) a . The HV indicator can capture such dominance relation ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 21 f f Cost (×1000$) U s e r ( × ) a a a f f Cost (×1000$) U s e r ( × ) b b b f f Cost (×1000$) U s e r ( × ) a' a' a' f f Cost (×1000$) U s e r ( × ) b' b' (a) HV ( A ) = 4 , , (b) HV ( B ) = 4 , , (c) HV ( A (cid:48) ) = 4 , , (d) HV ( B (cid:48) ) = 3 , , Fig. 8: HV comparison of with/without integrating the DM’s preferences “ the cost shall be low while theproduct should be able to support at least 1500 simultaneous users and ideally reach 3000users ” into solutions on the basis of transformation function Equation (9). Without considering the preferences ((a) and(b)), the solution set B is evaluated by HV better than A as it can provide more diverse solutions. Yet, when consideringthe preferences, the solution b in B will be no interest to the DM (thus discarded) as its supported user number is lessthan , and the number of users that b supports can be regarded down to as is the best expected value.After this transformation ((c) and (d)), the set B (cid:48) is evaluated significantly worse than A (cid:48) .information — a dominated solution is always evaluatedworse by HV than one dominating it. Next, we look at acase study based on this example to see how such transfor-mation affects the evaluation results.Consider a situation of designing a product with therequirements that “ the cost shall be low whilethe product should be able to support atleast 1500 simultaneous users and ideallyreach 3000 users ”. As can be seen, the first objectivecost is a normal one (i.e., the lower the better), while forthe second objective the number of simultaneous users,there are two types of preferences: clear one and vague one.The statement “ support at least 1500 users ” is aclear one, which means the product is useless if it cannotsupport 1500 users. The statement “ ideally reach3000 users ” is a vague one, which implies that despitethe threshold, it is acceptable to support less users andit will have the same level of satisfaction even if moreusers are supportable. As such, we can have the followingtransformation function. a (cid:48) = , a > a , ≤ a ≤ to disgard a , a < (9)where a (cid:48) i denotes the transformed value of solution a on the i th objective.Now let us assume two solution sets A = { a , a , a } , B = { b , b , b } obtained by two search al-gorithms, where a = (750 , , a = (1000 , , a =(1500 , , b = (500 , , b = (1250 , , b =(2000 , , shown in Figure 8(a) and (b). We want toevaluate and compare them under the circumstances with-/without the preference information given above to see howtransferring preferences into solutions affects the evaluationresults. As seen in Figure 8(a) and (b), without consideringthe preferences A is evaluated worse than B ( HV ( A ) =4 , , < B = 4 , , . This makes sense as thesolutions of B spread more widely than those of A . Yet,when considering the preferences of the DM (transferred byEquation (9)), while the set A stays unchanged ( A (cid:48) = A ), the solution b will be discarded and the solution b will be-come (2000 , . As a result, A (cid:48) is evaluated significantlybetter than B (cid:48) ( HV ( A (cid:48) ) = 4 , , > B (cid:48) = 3 , , ),as shown in Figure 8(c) and (d). This shows that the inte-gration of the DM’s preferences can completely change theevaluation results between solution sets. Sometimes, the DM may be more interested in specificpart/solutions of the Pareto front than others. Knee pointsare certainly among such solutions, preferred in many situa-tions, e.g., in [114] [44] [35] [137] [37] [68] [17] [19] [89]. Kneepoints are points on the Pareto front where a small improve-ment on one objective would lead to a large deterioration onat least one other objective. They represent “good” trade-offs between conflicting objectives, thus naturally more ofinterest to the DM. For example, on the cloud autoscalingproblem [17], where different cloud tenants (users) mayintroduce conflicting objectives due to the interference andshared infrastructure. From the perspective of the cloudvendor, ensuring fairness among tenants of the same classis often the top priority and thus the knee solutions aremore of interest. As we explained previously, HV is a goodchoice in such a situation, alongside other indicators likethe (cid:15) -indicator [154], IGD + [56] and P CI [72], whereasunfortunately IGD [26] is not one of them, despite beingwidely used, e.g., in [37], [89].Another relatively common situation is that the DMmay be more interested in the extreme solutions (e.g.,in [148] [134] [131]), namely, solutions achieving the beston one objective or another. For example, for the servicecomposition problem [134], one may prefer the extremesolutions around the edges, e.g., those with low latencybut high cost, or vice versa. For this situation, HV canalso be a viable solution. As shown previously (Section 5.7),setting the reference point fairly distant from the combinednondominated solution set gives the extreme solutions big-ger weighting on the evaluation results. Besides, one maydirectly compare solution sets through their best value on ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 22 the corresponding objective(s). Such a DOE measure, incontrast to HV which provides comprehensive evaluationresults, returns the objective values which are straightfor-ward for the DM to understand. As can be seen in Table 8, the majority of studies inPareto-based SBSE effectively do not involve any prefer-ence. For this situation, a solution set that well representsthe whole Pareto front is preferred. As aforementioned,the “representation” can be broken down to the qualityaspects convergence, diversity (i.e., spread and uniformity),and cardinality. Naturally, it is expected to consider qualityindicators which (together) are able to cover all of them.In general, there are two ways to implement that inpractice. One is to consider several indicators, each respon-sible for one specific aspect. For example, GD + [56] is fora solution set’s convergence, ∆ [29] for diversity (underthe bi-objective circumstance), and U N F R for cardinality.The other one is to consider a comprehensive indicatorto evaluate all the aspects. Such indicators include HV , IGD , and (cid:15) -indicator. Today, there is a tendency to usecomprehensive indicators. Numerous recent studies used HV and IGD . However, as explained previously, IGD maynot be an ideal indicator in Pareto-based SBSE as a Paretofront representation with densely and uniformly distributedpoints is usually unavailable in practice.In addition, when using comprehensive indicators wesuggest to consider multiple differently-behaving indicatorsif applicable. Each indicator has its own (explicitly or im-plicitly) preferences. A solution set evaluated better on anindicator is often evaluated better as well on another similarindicator; this means nothing but the set favored under thistype of preference. When a solution set is evaluated betteron all of the considered indicators whose preferences arequite different, then that set certainly has a higher chanceto be chosen by the DM. Unfortunately, many comprehen-sive indicators behave similarly as HV [78], [111], namely,preferring knee points of the Pareto front rather than a setof uniformly-distributed solutions on the Pareto front, suchas R [46] and (cid:15) -indicator (except IGD which, however,is not applicable typically). Therefore, as a supplement to HV , considering a quality indicator that can well evaluatethe diversity of a solution set sounds reasonable. In thisregard, the indicator ∆ ( Spread ) [29] may be chosen for thebi-objective case and DCI [70] for the more-objective case. The above are four general cases of solution sets’ qualityevaluation on the basis of the DM’s preferences. On topof those, there exist some quality indicators for specificSBSE scenarios (see Table 8), which we call problem-specificindicators ( P SI ). For example, in the library recommenda-tion [99], top-k accuracy, precision, and recall on historydatasets are commonly used P SI for evaluating recom-mendation systems. For another example, in the softwaremodularization problem the P SI MoJoFM [65], derivedfrom the MoJo distance, compares a produced solution toa given “golden rule” solution, which naturally represents the DM’s preference over the objectives such as cohesionand coupling [140]. Another example can be seen in thetest case generation problem where some works use multi-objectivization to improve the code coverage [100], [101],[102]. They reformulate the coverage criterion as a many-objective optimization problem, where the objectives to beoptimized are different coverage targets (e.g., branches in[100]), but only the total coverage of all test cases in theproduced test suite is of interest. In this case, such a totalcoverage criterion can be regarded as a problem-specificindicator, as it does reflect the quality of a solution set inthis specific problem but not directly involved in the search-based optimization. Overall, such P SI indicators not onlyrepresent more “accessible” quality evaluation of solutionsets (i.e., how they perform under the practical problembackground), but also usually imply some preferences fromthe DM. Therefore, it is highly recommended to includethem in the evaluation if existing.Nevertheless, it is worth noting that P SI indicators usu-ally need to work together with generic quality indicators(e.g., those in Table 9) to provide reliable evaluations sincethey may be irrelevant to Pareto-based optimization (e.g.,only focusing on particular objectives in evaluation). Forexample, the study [106] mainly relies on APFD, the averagepercentage of fault detected, to evaluate the solution set ofprioritized test cases. Indeed, APFD is a frequently used P SI in the test case prioritization, but it can only reflect therate of fault detected, not the reliance of test cases, both ofwhich are the objectives to be optimized for the problem.In addition, plotting representative solution sets ( SSP )is also desirable as an auxiliary evaluation, as it empowersthe SBSE researchers/practitioners to get a sense of whatthe solution set looks like. This is very helpful not only forsolution set comparison, but also for the DM to understandthe problem and then perhaps to refine their preferencesfurther.To use SSP , for an algorithm involving stochastic ele-ments we suggest plotting the solution set in a particular runwhich corresponds to the evaluation result (obtained by acomprehensive quality indicator, e.g., HV ) that is the closestto the median value in all the runs. Alternatively, for opti-mization problems with two and three objectives, medianattainment surfaces [38], [63], [79] can be used to visualizethe performance of the algorithm with respect to all the runs(which have already been adopted in the literature [24], [34],[142], [148]). For problems with more objectives, the parallelcoordinates plot (instead of the scatter plot) is a helpful tool,which can reflect the convergence and diversity of a solutionset to some extent [73]. It has started to be used recently, e.g.,in [53], [89], [91], [143], [144]. Based on the above, we now are in a good position toprovide a general procedure of how to evaluate solutionsets in Pareto-based SBSE in Figure 9. At first, we suggestconducting some screening ( P1 in the figure) to filter outtrivial solutions in the considered solution sets according tothe nature of the optimization problem in SBSE. The trivialsolutions can be seen as those which are straightforward toobtain and would never be of interest to the DM, but may ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 23 Start (P1) Filtering (D8) Any preferences that can be transferred into solutions? (D1) Any preferences available? (D2) Which quality aspect? Comprehensive (D3) Large scale (e.g., Using HV Using ε -indicator Yes No Using GD+ Using CI or C Using DCI Using Using UNFR (D6) Any clear preferences? (D7) Any vague preferences? (D10) Any preferences that can be transferred into an indicator? (P2) Transferring clear preferences into solutions (D12) Which part? (P3) Transferring vague preferences into solutions (P4) Transferring vague preferences the indicator (D11) Prefering specific part of the Pareto front? Using HV Using HV and DOE (D9) Any vague preferences left? Knee Boundary (D5) All concerned aspects covered? (D4) Bi-objective case? End No NoNoConvergence Dominance Diversity CardinalityNo YesYesNo Yes NoYes Yes Yes Yes NoNo Yes Primary decision Secondary decision Process Recommended evaluation method(s) Since every quality indicator has its own implications, the guidance is merely a general instruction of selecting quality evaluation methods. In addition, the concrete usage of the quality indicators in each specific case should refer to Table 5. If the DM prefers to know the uniformity of the solution set only (rather than diversity), SP will be fitter than DCI and . For comprehensive quality evaluation, we recommend to use multiple (di ff erently-behaving) indicators if applicable, e.g., firstly use HV then other comprehensive indicators (see [64] for a survey of various comprehensive indicators). N.B.: In this case, the reference point for HV should be far away from the nadir point, see Section 5.7 for details. Specifically, the best value of each objective from the population. (D13) Any problem-specific indicators? (D14) Using scatter plot or median attainment surfaces Using parallel coordinates plot Using problem-specific indicators Yes Yes YesNo No Using the preference-embedded indicator No [75] into Fig. 9: General procedure of quality evaluation in Pareto-based SBSE.affect the evaluation result. e.g., the solution with zero costand zero coverage in the example of Figure 4.After the filtering, it comes to evaluating solution setsaccording to the DM’s preference information ( D1 ). If thereis no preference information available at all (e.g., the effortestimation and test case prioritization problem), we sug-gest to consider quality indicators that together are ableto accurately reflect all the quality aspects ( D2 – D5 ). Forexample, one can consider separately evaluating distinctquality aspects of solution sets, e.g., GD + for convergence(together with CI if willing to know the dominance relationbetween sets), DCI for diversity (when involving ≥ ob-jectives), and U N F R for cardinality. Alternatively, one canconsider evaluating the comprehensive quality of solutionsets, e.g., HV in most cases; or even some mix (e.g., HV plus U N F R ) if it requires understanding on some specificquality aspects on top of solution sets’ general quality. Notethat the cardinality of solution sets tends to have moreweight in the SBSE area than some other areas such asevolutionary computation, since many SBSE problems arecombinatorial ones, where their Pareto front size may berelatively small and it is likely to have comparable solutions(e.g., dominated/duplicate solutions) from different sets. Inany case, whatever indicators considered, the way of usingthem needs to comply with their usage note and caveats (seeTable 9).If there is preference information available, we recom-mend first to see whether (part of) it belongs to clearpreferences ( D6 ); if so, such as the software product line con-figuration problem, one can transfer the clear preferencesinto the solutions ( P2 ), as what we discussed in Section 7.1. Itis necessary to note that sometimes after the transfer there is only one objective left to be considered (e.g., in the exampleof Figure 4). In this case, the best value on that remainingobjective represents the quality of the solution set.After considering clear preferences, one needs to seewhether there exist some vague preferences ( D7 ). if the an-swer is yes, e.g., the software configuration and adaptationproblem, then we transfer those that are transferable ( D8,P3, D9 ), e.g., the example in Figure 8 and resolution fromSection 7.2. After that, we recommend to check whether therest preferences can be transferred into an indicator ( D10 );if so, one can then transfer them ( P4 ), e.g., transferringcertain preferences into a weight distribution function in theweighted HV [152], and use that indicator to evaluate thesolution sets as shown in Section 7.2.When the preference information cannot be accommo-dated into an indicator, the next step is to check if theDM prefers a specific part of the Pareto front ( D11 ), suchas the software modularization and service compositionproblem. As we discussed in Section 7.3, if one prefersknee points on the Pareto front (i.e., well-balanced solutionsbetween conflicting objectives), then HV is a good option. Ifone prefers boundary solutions, then HV with an unusualconfiguration of its reference point can be used, alongsidewith reporting the best value of relevant objective(s) in thepopulation.Note that the DM may present several types of pref-erence information. For example, one may specify a clearthreshold on one objective and at the same time be in-terested in knee points on the Pareto front — a typicalcase when non-functional requirements of the software areinvolved. Another example has been seen in the situa-tion of Figure 8, where the DM’s preferences contain both ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 24 clear and vague information. In addition, it is necessaryto mention that there do exist some situations where theDM’s preferences cannot be quantified/transferred prop-erly, e.g., the DM may state like “ the cost should bereasonable ”. In such a situation, we suggest proceedingto D2 — the general multi-objective optimization case (with-out specific preferences).After going through all possible cases of the DM’s pref-erences, it comes to check the last two quality evaluationmethods, problem-specific evaluation, i.e., P SI ( D13 ), andsolution set plotting, i.e., SSP ( D14 ). These are two veryhelpful methods in reflecting the solution set’s quality thatmay have not been captured by generic quality indicators. HREATS TO V ALIDITY Threats to construct validity can be raised by the researchmethodology, which may not serve the purpose of survey-ing the evaluation methods for Pareto-based optimizationin existing SBSE studies. We have mitigated such threatsby following the systematic review protocol proposed byKitchenham et al. [60], which is a widely recognized searchmethodology for conducting a survey in the SE research.Another threat is related to the citation count used in theexclusion criteria. Indeed, it is difficult to set a thresholdfor such, as the citation count itself cannot well reflect theimpact of work. Since there is no metric that suffices todo so, in this work we used the citation count and set athreshold by averaging the candidate studies. It is howeverworth noting that we do not seek to provide a compre-hensive review over the entire SBSE field, but to capturemajor trends on the evaluation of solution sets, which canat least provide some sources for analyzing and buildingthe methodological guidance. Therefore it is necessary toreach a trade-off between the trend coverage and the effortsrequired for detailed data collections of the studies.Threats to internal validity may be introduced by havinginappropriate classification and interpretation of the SBSEpapers, their implied preferences, and used quality indica-tors/evaluation methods. We have limited this by conduct-ing three iterations of study reviews by the first two authors.Error checks and investigations were also conducted tocorrect any issues found during the search procedure. Thekey issues identified have also been resolved among the firsttwo authors or by counseling external researchers.Threats to external validity may restrict the generaliz-ability of the proposed guidance and considered cases. Wehave mitigated such by conducting the survey more widelyand deeply: it covers 717 searched papers published be-tween 2009 and 2019, on 36 venues from seven repositories;while at the same time, extracting 95 prominent primarystudies following the exclusion and inclusion procedures.This has included 21 most noticeable SBSE problems thatspread across the whole SDLC. The extracted assumptionsof the DM’s preferences, together with rigorous analysesof the 12 representative quality indicators (i.e., either usedwidely in SBSE or proposed herein for a more accurateevaluation), have provided rich sources for us to establish ageneral methodological guidance for the community.Finally, although our guidance has been designed in away that it aims to cover a wide range of SBSE problems, it is always possible that there are situations which we haveunfortunately missed; for example, the behavior of solutionsin the decision space, e.g., their diversity and robustness. Asdifferent settings of parameters (i.e., decision variables) maylead to similar/same solutions’ quality (e.g., multiple pointsin the decision map to a single point in the objective space),the DM of course likes those which are easier to implement.Therefore, a set of diverse solutions in the decision spaceare preferred, providing more options for the DM. Anotheraspect that the DM may consider is robustness [13], which isrelated to how fast the quality of solutions degrades whenvarying their parameters (decision variables). This issue isparticularly important in an uncertain environment wherethe solution may not be able to be deployed accuratelyand/or the objective functions estimated may be of a marginof error. Therefore, robust solutions are preferred to sensitiveones even if their quality is slightly lower in some circum-stances. Overall, in those cases, an evaluation of solutionsets’ quality both in the decision space and in the objectivespace is needed. ELATED W ORK Various surveys on SBSE (e.g., [47], [50], [85]) reveal in-tense interests in developing computational search methodsfor complex optimization problems in SE. Some of themfocus on or are relevant to Pareto-based multi-objectiveoptimization. For example, Sayyad et al. [117] performeda brief literature review of SBSE studies that used Pareto-based evolutionary algorithms for multi-objective optimiza-tion problems; Boussa¨ıd et al. [8] conducted a comprehen-sive survey on search-based model-driven engineering andclassified relevant search algorithms into single- and multi-objective ones; Ram´ırez et al. [110] reviewed SBSE studies ona subarea of multi-objective optimization, many-objectiveoptimization, where the number of objectives is larger than3. In general, these papers concentrate on the developmentof search algorithms for Pareto-based multi-objective opti-mization problems; very few touch on the quality evaluationof the results obtained by search algorithms until recently.Wang et al. [138] proposed a practical guide for SBSE re-searchers/practitioners to select quality indicators in Pareto-based optimization, on the basis of the results of experi-mental studies evaluating eight quality indicators in threeindustrial and real-world problems. They firstly classifiedthese indicators into four categories, convergence, diversity,combination, and coverage, and then they, based on em-pirical observations, have drawn several conclusions aboutthe indicator selection. For example, they have concludedthat it does matter which indicator to select in the diversitycategory, but it does not matter which indicator to selectwithin the same convergence or combination category.Very recently, Ali et al. [4] substantially extended Wanget al.’s work [138] and provided a set of guidelines drawnfrom an extensive empirical evaluation in nine SBSE prob-lems from industrial, real-world and open-source projects.From these experiments, they produced 22 observationsbased on statistical comparisons between six multi-objectiveevolutionary algorithms. They have claimed that the differ-ences in SBSE problems have high effect on the consistency ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 25 of quality indicators’ evaluation results, whereas the effectof search algorithms is low. A noticeable difference from[138] is that the guidance provided did not build on aclassification of indicators.Li et al. [74] conducted a critical review of Wang etal.’s work [138]. They argued that some conclusions (e.g.,it matters which indicator to select in the category diversity)are actually caused by the inaccurate classification of theconsidered indicators. More importantly, they argued thateven if an accurate classification is made, one still cannotdraw any conclusions like it does not matter which indicatorto select, whether in the same category or across differentcategories.Indeed, as can be seen in Section 6, each quality indicatorhas its own distinct quality implications. A solution setbeing evaluated better by an indicator does not mean thatit generally has higher quality, but rather that it is preferredunder the assumption that the indicator accurately reflectsthe DM’s preferences. However, different DMs may preferdifferent trade-off solutions between objectives, even forthe same problem. For example, for the project schedul-ing problem, in some scenarios, the DM may prefer kneesolutions [114] [44] [35], in some other scenarios, the DMmay prefer widely distributed solutions [24], in some otherscenarios, the DM may prefer specific solutions relying onthe Analytic Hierarchy Process [125]. Consequently, obser-vations on quality indicators drawn from an empirical in-vestigation on specific SBSE scenarios may not be well gen-eralized. This suggests a need of a general, methodologicalguidance on how to select and use indicators in SBSE. Sucha guidance is not based upon empirical studies on specificproblems but upon the fundamental goal of multi-objectiveoptimization — supplying the DM a set of solutions whichare the most consistent with their preferences.It is worth mentioning that a recent survey paper [75] onquality evaluation in multi-objective optimization appeared,albeit not specific for SBSE. It systematically reviewed 100quality indicators, analyzed correlations between represen-tative indicators, discussed several important issues in de-signing indicators, and suggested a few future researchdirections. One key purpose of that work is about indicatordesign and development, i.e., to inform the researchers andpractitioners on what aspects to bear in mind when de-signing new indicators. In contrast, our work here is aboutindicators (and other evaluation methods) selection and use,i.e., to guide the researchers and practitioners on how toselect/use existing indicators, or even specialize them, forevaluating solution sets in SBSE. Another major differencebetween the two works is that the work [75] considered thegeneral situation that the DM’s preferences are not available,whereas our work considers the situations based preciselyupon various DM’s preferences.As such, these two works complement well with eachother. If the SBSE researchers/practitioners want to un-derstand correlations between different quality indicators,to know some important issues (e.g., scaling, normaliza-tion and effect of dominated/duplicate solutions) whenperforming quality evaluation for a practical problem, oreven by themselves to design/develop new indicators, theycan refer to the general survey paper in [75]. In contrast,if the SBSE researchers/practitioners want to select and use existing indicators in various optimization scenarios inSBSE, or to adapt existing indicators to fit explicit/implicitpreferences from the DM in the given SBSE problem, thiswork can be well served.Overall, in comparison with the existing works, thiswork presents several additional contributions. First, weconduct a systematic literature review on quality evalu-ation for Pareto-based optimization in SBSE. Second, we,from that review, present a variety of inappropriate/inade-quate selection and inaccurate/misleading use of evaluationmethods, and identify five important but overlooked issues.Third, from the perspective of the goal of multi-objectiveoptimization, we discuss the reasons that quality indicatorsare needed, carry out an in-depth analysis of frequently-used quality indicators in the area, and explain the scopeof their applicability. Finally, we provide a methodologicalguidance and procedure of selecting and using evaluationmethods in various SBSE scenarios. 10 C ONCLUSIONS The nature of considering multiple (conflicting) objectives inmany SBSE problems leads to a link between SE and multi-objective optimization. However, compared to the flourishof the use/design of multi-objective optimizers in SBSE, theevaluation of the optimizers’ outcome remains relatively“casual”. Existing SBSE researches often work by analogy,namely, following popular (or previously used) quality eval-uation methods without considering whether they are trulysuitable for their specific situation. In this paper, we havecarried out a systematic and critical review of the qualityevaluation in Pareto-based SBSE, covering 95 prominentstudies published between 2009 and 2019 from 36 venuesin seven repositories. We have found that in many studiesthe selection/use of evaluation methods is not appropriateand can even be misleading, based on which we summarizefive critical issues, namely: • Inadequacy of Solution Set Plotting. • Inappropriate use of Descriptive Objective Evalua-tion. • Confusion of the quality aspects covered by genericquality indicators. • Oblivion of context information. • Noncompliance of the DM’s preferences.Through revisiting the pros and cons of widely usedquality indicators in SBSE, we have provided a methodolog-ical guidance and procedure of selecting, adjusting and us-ing quality evaluation methods on the basis of the followingavailability/types of the DM’s preferences: • There are clear preferences between the objectives. • There are vague/rough preferences between the ob-jectives. • There are preferences on some specific parts of thePareto front. • There are no preferences available.We hope that our guidance would help to mitigate theevaluation issues in future SBSE work, and more impor-tantly, would enable the quality evaluation of solution setseasier, clearer and more accurate for SBSE researchers andpractitioners. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 26 TABLE A1: What and how the evaluation methods are used in each primary study. Study Indicator/Method Stated Quality Aspects to Measure [88] SSP, DOE N/A 3 N/A N/A[115] HV, CI, GD Unknown 2-5 Nadir point Best Pareto front found[126] DOE, HV, IGD, SP, Spread, SSP HV= Q ∪ Q , IGD= Q ∪ Q ∪ Q ,SP= Q , Spread= Q ∪ Q Q , GS= Q ∪ Q ,NFS= Q ∪ Q , HV= Q ∪ Q ∪ Q Q , CI= Q , GD= Q HV=Unknown, Spread= Q ∪ Q Q ∪ Q ∪ Q Spread= Q , NFS=Unknown 2 Worst values N/A[48] GD, SSP Unknown 2 N/A Unknown[69] GD, SSP, DOE Unknown 3 N/A Unknown[147] SSP, GD, AS, Spread, NFS GD= Q , AS=UnknownSpread= Q ∪ Q , NFS=Unknown 2 N/A Unknown[76] HV, DOE, SSP HV= Q ∪ Q Q , SP= Q ∪ Q + , SSP HV= Q ∪ Q ∪ Q IGD + = Q ∪ Q ∪ Q Q ∪ Q ∪ Q Spread= Q ∪ Q , GD= Q Q ∪ Q , IGD= Q ∪ Q (cid:15) -indicator, HV, IGD, SSP (cid:15) -indicator= Q HV= Q ∪ Q ∪ Q ,IGD= Q ∪ Q ∪ Q (cid:15) -indicator, DOE Unknown 3 N/A Unknown[21] HV, DOE HV= Q ∪ Q ∪ Q Q ∪ Q ∪ Q 15 N/A Best Pareto front found[95] PSI, SSP N/A 2 N/A N/A[96] PSI, SSP N/A 2 N/A N/A[97] PSI N/A 3 N/A N/A[98] PSI, SSP N/A 4 N/A N/A[81] PSI N/A 3 N/A N/A[90] PSI N/A 3 N/A N/A[80] PSI, HV, IGD IGD= Q , HV=Unknown 2 Unknown Best Pareto front found[15] PSI, SSP N/A 2 N/A N/A[23] PSI, HV HV= Q ∪ Q ∪ Q Q ∪ Q ∪ Q > 100 N/A N/A[82] DOE N/A 3 N/A N/A[34] DOE, HV, AS Unknown 2 Worst values N/A[145] SSP, NFS, CI Unknown 2-3 N/A N/A[107] HV,DOE HV= Q ∪ Q ∪ Q Q Q ∪ Q ∪ Q ,IGD= Q ∪ Q ∪ Q Q (cid:15) -indicator, IGD, HV (cid:15) -indicator= Q ,IGD= Q , HV= Q ∪ Q ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 27 TABLE A1: What and how the evaluation methods are used in each primary study (continue). Study Indicator/Method Stated Quality Aspects to Measure [125] DOE, SSP, HV, GD, SP, Spread, CS HV= Q ∪ Q , GD= Q ,SP= Q , Spread= Q ∪ Q CS=Unknown 4 Worst values Best Pareto front found[128] DOE N/A 4 N/A N/A[144] SSP, PSI, HV, IGD HV= Q ∪ Q ∪ Q ,IGD= Q ∪ Q ∪ Q Q ∪ Q ∪ Q (cid:15) -indicator, IGD, Spread, HV (cid:15) -indicator= Q , IGD= Q Spread= Q ∪ Q , HV= Q ∪ Q Q Q Q ∪ Q ∪ Q (cid:15) -indicator, NFS, Spread HV= Q , IGD= Q , (cid:15) -indicator= Q ,NFS= Q ∪ Q , Spread= Q ∪ Q (cid:15) -indicator, NFS, Spread HV= Q , IGD= Q , (cid:15) -indicator= Q ,NFS= Q ∪ Q , Spread= Q ∪ Q (cid:15) -indicator, ER, Spread HV= Q ∪ Q ∪ Q ,IGD= Q ∪ Q ∪ Q , (cid:15) -indicator= Q ∪ Q ∪ Q ,ER= Q , Spread= Q ∪ Q Q ∪ Q ∪ Q > 100 N/A N/A[58] DOE, SSP N/A 7 N/A N/A[124] SSP, PSI N/A 2 N/A N/A[24] AS, HV Unknown 3 Unknown N/A[114] DOE, CI, HV, GD Unknown 3 Unknown Best Pareto front found[44] SSP N/A 3 N/A N/A[35] CI, HV, GD Unknown 3 Unknown Best Pareto front found[148] SSP N/A 2 N/A N/A[59] PSI N/A 3 N/A N/A[137] DOE, SSP N/A 5 N/A N/A[7] DOE, PSI N/A 9 N/A N/A[37] DOE, HV, IGD HV= Q ∪ Q ∪ Q ,IGD= Q ∪ Q ∪ Q (cid:15) -indicator, IGD Unknown 2-3 N/A Best Pareto front found[134] HV, DOE, PSI Unknown 3 Unknown N/A[131] HV, IGD, SSP Unknown 2 Worst values Best Pareto front found[86] SSP, PSI N/A 2 N/A N/A[89] SSP, IGD, PSI IGD= Q ∪ Q ∪ Q Q =Convergence; Q =Spread; Q =Uniformity; Q =Cardinality. A PPENDIX Table A1 specifies the evaluation methods and how they areused for all the 95 studies analyzed in this work. A CKNOWLEDGMENT This work was supported by the Guangdong ProvincialKey Laboratory (Grant No. 2020B121201001), the Programfor Guangdong Introducing Innovative and EnterpreneurialTeams (Grant No. 2017ZT07X386), Shenzhen Science andTechnology Program (Grant No. KQTD2016112514355531),and the Program for University Key Laboratory of Guang-dong Province (Grant No. 2017KSYS008). We are grateful tothe editor and anonymous reviewers for their constructivecomments on the early version of this paper. R EFERENCES [1] H. Abdeen, H. Sahraoui, and C. Debreceni, “Multi-objective op-timization in rule-based design space exploration,” in IEEE/ACMInternational Conference on Automated Software Engineering , 2014,pp. 289–300.[2] R. B. Abdessalem, S. Nejati, L. C. Briand, and T. Stifter, “Testingadvanced driver assistance systems using multi-objective searchand neural networks,” in IEEE/ACM International Conference onAutomated Software Engineering , 2016, pp. 63–74.[3] ——, “Testing vision-based control systems using learnable evo-lutionary algorithms,” in . IEEE, 2018, pp. 1016–1026.[4] S. Ali, P. Arcaini, D. Pradhan, S. A. Safdar, and T. Yue, “Qual-ity indicators in search-based software engineering: An empir-ical evaluation,” ACM Transactions on Software Engineering andMethodology (TOSEM) , vol. 29, no. 2, pp. 1–29, 2020. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 28 [5] W. K. G. Assuncao, T. E. Colanzi, S. R. Vergilio, and A. Pozo, “Amulti-objective optimization approach for the integration and testorder problem,” Information Sciences , vol. 267, no. 2, pp. 119–139,2014.[6] G. Bavota, F. Carnevale, A. D. Lucia, M. D. Penta, and R. Oliveto,“Putting the developer in-the-loop: An interactive ga for softwarere-modularization,” in International Symposium on Search BasedSoftware Engineering , 2012, pp. 75–89.[7] S. Boukharata, A. Ouni, M. Kessentini, S. Bouktif, and H. Wang,“Improving web service interfaces modularity using multi-objective optimization,” Automated Software Engineering , vol. 26,no. 2, pp. 275–312, 2019.[8] I. Boussa¨ıd, P. Siarry, and M. Ahmed-Nacer, “A survey on search-based model-driven engineering,” Automated Software Engineer-ing , vol. 24, no. 2, pp. 233–294, 2017.[9] M. Bowman, L. C. Briand, and Y. Labiche, “Solving the classresponsibility assignment problem in object-oriented analysiswith multi-objective genetic algorithms,” IEEE Transactions onSoftware Engineering , vol. 36, no. 6, pp. 817–837, 2010.[10] B. Bozkurt, J. W. Fowler, E. S. Gel, B. Kim, M. K¨oksalan, andJ. Wallenius, “Quantitative comparison of approximate solu-tion sets for multicriteria optimization problems with weightedTchebycheff preference function,” Operations Research , vol. 58,no. 3, pp. 650–659, 2010.[11] S. A. Busari and E. Letier, “RADAR: A lightweight tool forrequirements and architecture decision analysis,” in Proceedingsof the 39th International Conference on Software Engineering . IEEEPress, 2017, pp. 552–562.[12] K.-Y. Cai and D. Card, “An analysis of research topics in softwareengineering–2006,” Journal of Systems and Software , vol. 81, no. 6,pp. 1051–1058, 2008.[13] R. Calinescu, M. Ceska, S. Gerasimou, M. Kwiatkowska, andN. Paoletti, “Designing robust software systems through para-metric markov chain synthesis,” in IEEE International Conferenceon Software Architecture , 2017.[14] G. Canfora, A. D. Lucia, M. D. Penta, R. Oliveto, A. Panichella,and S. Panichella, “Multi-objective cross-project defect predic-tion,” in IEEE Sixth International Conference on Software Testing,Verification and Validation , 2013, pp. 252–261.[15] ——, “Defect prediction as a multiobjective optimization prob-lem,” Software Testing Verification and Reliability , vol. 25, no. 4, pp.426–459, 2015.[16] J. Chen, V. Nair, R. Krishna, and T. Menzies, “‘Sampling’ as abaseline optimizer for search-based software engineering,” IEEETransactions on Software Engineering , vol. 45, no. 6, pp. 597–614,June 2019.[17] T. Chen and R. Bahsoon, “Self-adaptive trade-off decision mak-ing for autoscaling cloud-based services,” IEEE Transactions onServices Computing , vol. 10, no. 4, pp. 618–632, 2017.[18] T. Chen, R. Bahsoon, S. Wang, and X. Yao, “To adapt or notto adapt?: Technical debt and learning driven self-adaptationfor managing runtime performance,” in Proceedings of the 2018ACM/SPEC International Conference on Performance Engineering,ICPE 2018, Berlin, Germany, April 09-13, 2018 , 2018, pp. 48–55.[19] T. Chen, K. Li, R. Bahsoon, and X. Yao, “FEMOSAA: Featureguided and knee driven multi-objective optimization for self-adaptive software,” ACM Transactions on Software Engineering andMethodology , vol. 27, no. 2, 2018.[20] T. Chen, M. Li, K. Li, and K. Deb, “Search-based software engi-neering for self-adaptive systems: One survey, five disappoint-ments and six opportunities,” CoRR , vol. abs/2001.08236, 2020.[21] T. Chen, M. Li, and X. Yao, “On the effects of seeding strategies:a case for search-based multi-objective service composition,” in Proceedings of the Genetic and Evolutionary Computation Conference .ACM, 2018, pp. 1419–1426.[22] ——, “Standing on the shoulders of giants: Seeding search-basedmulti-objective optimization with prior knowledge for softwareservice composition,” Inf. Softw. Technol. , vol. 114, pp. 155–175,2019. [Online]. Available: https://doi.org/10.1016/j.infsof.2019.05.013[23] X. Chen, Y. Zhao, Q. Wang, and Z. Yuan, “Multi: Multi-objectiveeffort-aware just-in-time software defect prediction,” Informationand Software Technology , vol. 93, pp. 1–13, 2018.[24] F. Chicano, F. Luna, A. J. Nebro, and E. Alba, “Using multi-objective metaheuristics to solve the software project schedulingproblem,” in Conference on Genetic and Evolutionary Computation ,2011, pp. 1915–1922. [25] J. L. Cochrane and M. Zeleny, Multiple Criteria Decision Making .University of South Carolina Press, 1973.[26] C. A. C. Coello and M. R. Sierra, “A study of the parallelizationof a coevolutionary multi-objective evolutionary algorithm,” in Proceedings of the Mexican International Conference on ArtificialIntelligence (MICAI) , 2004, pp. 688–697.[27] T. E. Colanzi, W. K. G. Assunc¸ ˜ao, P. R. Farah, S. R. Vergilio, andG. Guizzo, “A review of ten years of the symposium on search-based software engineering,” in Search-Based Software Engineering ,S. Nejati and G. Gay, Eds., 2019.[28] V. G. Da Fonseca, C. M. Fonseca, and A. O. Hall, “Inferentialperformance assessment of stochastic optimisers and the attain-ment function,” in International Conference on Evolutionary Multi-Criterion Optimization . Springer, 2001, pp. 213–225.[29] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and eli-tist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactionson Evolutionary Computation , vol. 6, no. 2, pp. 182–197, 2002.[30] J. J. Durillo, V. Nae, and R. Prodan, “Multi-objective energy-efficient workflow scheduling using list-based heuristics,” FutureGeneration Computer Systems , vol. 36, no. 3, pp. 221–236, 2014.[31] J. J. Durillo, Y. Y. Zhang, E. Alba, and A. J. Nebro, “A study of themulti-objective next release problem,” in International Symposiumon Search Based Software Engineering , 2009, pp. 29–60.[32] M. G. Epitropakis, S. Yoo, M. Harman, and E. K. Burke, “Empir-ical evaluation of Pareto efficient multi-objective regression testcase prioritisation,” in International Symposium on Software Testingand Analysis , 2015, pp. 234–245.[33] K. R. Felizardo, E. Mendes, M. Kalinowski, ´E. F. de Souza,and N. L. Vijaykumar, “Using forward snowballing to updatesystematic reviews in software engineering,” in Proceedings ofthe 10th ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement, ESEM 2016, Ciudad Real, Spain,September 8-9, 2016 . ACM, 2016, pp. 53:1–53:6.[34] J. Ferrer, F. Chicano, and E. Alba, “Evolutionary algorithmsfor the multi-objective test data generation problem,” SoftwarePractice and Experience , vol. 42, no. 11, pp. 1331–1362, 2012.[35] F. Ferrucci, M. Harman, J. Ren, and F. Sarro, “Not going totake this anymore: multi-objective overtime planning for soft-ware engineering projects,” in International Conference on SoftwareEngineering , 2013, pp. 462–471.[36] A. Finkelstein, M. Harman, S. A. Mansouri, J. Ren, and Y. Zhang,“A search based approach to fairness analysis in requirementassignments to aid negotiation, mediation and decision making,” Requirements Engineering , vol. 14, no. 4, pp. 231–245, 2009.[37] M. Fleck, J. Troya, M. Kessentini, M. Wimmer, and B. Alkhazi,“Model transformation modularization as a many-objective op-timization problem,” IEEE Transactions on Software Engineering ,vol. 43, no. 11, pp. 1009–1032, 2017.[38] C. M. Fonseca and P. J. Fleming, “On the performance assess-ment and comparison of stochastic multiobjective optimizers,”in International Conference on Parallel Problem Solving from Nature .Springer, 1996, pp. 584–593.[39] S. Frey, F. Fittkau, and W. Hasselbring, “Search-based geneticoptimization for deployment and reconfiguration of softwarein the cloud,” in International Conference on Software Engineering ,2013, pp. 512–521.[40] W. Fu, T. Menzies, and X. Shen, “Tuning for software analytics: Isit really necessary?” Inf. Softw. Technol. , vol. 76, pp. 135–146, 2016.[41] J. F ¨ul¨op, “Introduction to decision making methods,” in BDEI-3workshop, Washington . Citeseer, 2005, pp. 1–15.[42] M. Galster, D. Weyns, D. Tofan, B. Michalik, and P. Avgeriou,“Variability in software systems - A systematic literature review,” IEEE Trans. Software Eng. , vol. 40, no. 3, pp. 282–306, 2014.[43] S. Gerasimou, G. Tamburrelli, and R. Calinescu, “Search-basedsynthesis of probabilistic models for quality-of-service softwareengineering (t),” in IEEE/ACM International Conference on Auto-mated Software Engineering , 2016, pp. 319–330.[44] S. Gueorguiev, M. Harman, and G. Antoniol, “Software projectplanning for robustness and completion time in the presenceof uncertainty using multi-objective search based software en-gineering,” in Conference on Genetic and Evolutionary Computation ,2009, pp. 1673–1680.[45] J. Guo, J. H. Liang, K. Shi, D. Yang, J. Zhang, K. Czarnecki,V. Ganesh, and H. Yu, “Smtibea: a hybrid multi-objective op-timization algorithm for configuring large constrained softwareproduct lines,” Software & Systems Modeling , vol. 18, no. 2, pp.1447–1466, 2019. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 29 [46] M. P. Hansen and A. Jaszkiewicz, “Evaluating the quality ofapproximations to the nondominated set,” Institute of Mathe-matical Modeling, Technical University of Denmark, IMM-REP-1998-7, 1998.[47] M. Harman, Y. Jia, J. Krinke, W. B. Langdon, J. Petke, andY. Zhang, “Search based software engineering for software prod-uct line engineering: a survey and directions for future work,”in Proceedings of the 18th International Software Product Line Confer-ence , 2014, pp. 5–18.[48] M. Harman, J. Krinke, J. Ren, and S. Yoo, “Search based datasensitivity analysis applied to requirement engineering,” in Con-ference on Genetic and Evolutionary Computation , 2009, pp. 1681–1688.[49] M. Harman, K. Lakhotia, J. Singer, D. R. White, and S. Yoo,“Cloud engineering is search based software engineering too,” Journal of Systems and Software , vol. 86, no. 9, pp. 2225–2241, 2013.[50] M. Harman, S. A. Mansouri, and Y. Zhang, “Search-based soft-ware engineering: Trends, techniques and applications,” ACMComputing Surveys (CSUR) , vol. 45, no. 1, p. 11, 2012.[51] W. Heaven and E. Letier, “Simulating and optimising design de-cisions in quantitative goal models,” in Requirements EngineeringConference , 2011, pp. 79–88.[52] C. Henard, M. Papadakis, M. Harman, and Y. L. Traon, “Combin-ing multi-objective search and constraint solving for configuringlarge software product lines,” in IEEE/ACM IEEE InternationalConference on Software Engineering , 2015, pp. 517–528.[53] R. M. Hierons, M. Li, X. Liu, J. A. Parejo, S. Segura, and X. Yao,“Many-objective test suite generation for software product lines,” ACM Transactions on Software Engineering and Methodology , vol. 29,no. 1, 2020.[54] R. M. Hierons, M. Li, X. Liu, S. Segura, and W. Zheng, “SIP:Optimal product selection from feature models using many-objective evolutionary optimization,” ACM Transactions on Soft-ware Engineering and Methodology , vol. 25, no. 2, pp. 1–39, 2016.[55] H. Ishibuchi, R. Imada, Y. Setoguchi, and Y. Nojima, “How tospecify a reference point in hypervolume calculation for fair per-formance comparison,” Evolutionary Computation , vol. 26, no. 3,pp. 411–440, 2018.[56] H. Ishibuchi, H. Masuda, Y. Tanigaki, and Y. Nojima, “Modifieddistance calculation in generational distance and inverted gener-ational distance,” in Proceedings of the International Conference onEvolutionary Multi-Criterion Optimization (EMO) , 2015, pp. 110–125.[57] H. L. Jakubovski Filho, T. N. Ferreira, and S. R. Vergilio, “Prefer-ence based multi-objective algorithms applied to the variabilitytesting of software product lines,” Journal of Systems and Software ,vol. 151, pp. 194–209, 2019.[58] S. Kalboussi, S. Bechikh, M. Kessentini, and L. B. Said,“Preference-based many-objective evolutionary testing generatesharder test cases for autonomous agents,” in International Sympo-sium on Search Based Software Engineering , 2013, pp. 245–250.[59] W. Kessentini, H. Sahraoui, and M. Wimmer, “Automated meta-model/model co-evolution: A search-based approach,” Informa-tion and Software Technology , vol. 106, pp. 49–67, 2019.[60] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey,and S. Linkman, “Systematic literature reviews in software engi-neering - a systematic literature review,” Information and SoftwareTechnology , vol. 51, no. 1, pp. 7–15, 2009.[61] J. D. Knowles and D. W. Corne, “On metrics for comparingnondominated sets,” in Proceeding of the Congress EvolutionaryComputation (CEC) , vol. 1, 2002, pp. 711–716.[62] J. D. Knowles, L. Thiele, and E. Zitzler, “A tutorial on theperformance assessment of stochastic multiobjective optimizers,”Computer Engineering and Networks Laboratory (TIK), ETHZurich, Switzerland, Tech. Rep. No. 214, 2006.[63] J. D. Knowles, “A summary-attainment-surface plotting methodfor visualizing the performance of stochastic multiobjective op-timizers,” in . IEEE, 2005, pp. 552–557.[64] S. Kumar, R. Bahsoon, T. Chen, K. Li, and R. Buyya, “Multi-tenantcloud service composition using evolutionary optimization,” in , 2018, pp. 972–979.[65] A. C. Kumari and K. Srinivas, “Hyper-heuristic approach formulti-objective software module clustering,” Journal of Systemsand Software , vol. 117, pp. 384–401, 2016. [66] N.-Z. Lee, P. Arcaini, S. Ali, and F. Ishikawa, “Stability analysisfor safety of automotive multi-product lines: A search-based ap-proach,” in Proceedings of the Genetic and Evolutionary ComputationConference . ACM, 2019, pp. 1241–1249.[67] E. Letier, D. Stefan, and E. T. Barr, “Uncertainty, risk, and in-formation value in software requirements and architecture,” in International Conference on Software Engineering , 2014, pp. 883–894.[68] H. Li, G. Casale, and T. Ellahi, “SLA-driven planning and opti-mization of enterprise applications,” in Joint Wosp/sipew Interna-tional Conference on Performance Engineering , 2010, pp. 117–128.[69] L. Li, M. Harman, E. Letier, and Y. Zhang, “Robust next releaseproblem:handling uncertainty during optimization,” in Confer-ence on Genetic and Evolutionary Computation , 2014, pp. 1247–1254.[70] M. Li, S. Yang, and X. Liu, “Diversity comparison of Pareto frontapproximations in many-objective optimization,” IEEE Transac-tions on Cybernetics , vol. 44, no. 12, pp. 2568–2584, 2014.[71] ——, “Shift-based density estimation for Pareto-based algorithmsin many-objective optimization,” IEEE Transactions on Evolution-ary Computation , vol. 18, no. 3, pp. 348–365, 2014.[72] ——, “A performance comparison indicator for Pareto frontapproximations in many-objective optimization,” in Proceedingsof the Genetic and Evolutionary Computation Conference (GECCO) ,2015, pp. 703–710.[73] M. Li, L. Zhen, and X. Yao, “How to read many-objective solu-tion sets in parallel coordinates,” IEEE Computational IntelligenceMagazine , vol. 12, no. 4, pp. 88–97, 2017.[74] M. Li, T. Chen, and X. Yao, “A critical review of ”A practicalguide to select quality indicators for assessing Pareto-basedsearch algorithms in search-based software engineering”: Essayon quality indicator selection for SBSE,” in , 2018, pp. 17–20.[75] M. Li and X. Yao, “Quality evaluation of solution sets in multiob-jective optimisation: A survey,” ACM Computing Surveys , vol. 52,no. 2, 2019.[76] Y. Li, T. Yue, S. Ali, and L. Zhang, “Zen-ReqOptimizer: a search-based approach for requirements assignment optimization,” Em-pirical Software Engineering , vol. 22, no. 1, pp. 175–234, 2017.[77] X. Lian, L. Zhang, J. Jiang, and W. Goss, “An approach foroptimized feature selection in large-scale software product lines,” Journal of Systems and Software , vol. 137, pp. 636–651, 2018.[78] A. Liefooghe and B. Derbel, “A correlation analysis of set qualityindicator values in multiobjective optimization,” in Proceedingsof the Genetic and Evolutionary Computation Conference (GECCO) .ACM, 2016, pp. 581–588.[79] M. L´opez-Ib´anez, L. Paquete, and T. St ¨utzle, “Exploratory anal-ysis of stochastic local search algorithms in biobjective opti-mization,” in Experimental methods for the analysis of optimizationalgorithms . Springer, 2010, pp. 209–222.[80] U. Mansoor, M. Kessentini, B. R. Maxim, and K. Deb, “Multi-objective code-smells detection using good and bad design ex-amples,” Software Quality Journal , vol. 25, no. 2, pp. 1–24, 2016.[81] U. Mansoor, M. Kessentini, M. Wimmer, and K. Deb, “Multi-viewrefactoring of class and activity diagrams using a multi-objectiveevolutionary algorithm,” Software Quality Journal , vol. 25, no. 2,pp. 1–29, 2015.[82] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective auto-mated testing for android applications,” in Proceedings of the 25thInternational Symposium on Software Testing and Analysis , 2016, pp.94–105.[83] T. Mariani, G. Guizzo, R. S. Vergilio, and T. R. A. Pozo, “Gram-matical evolution for the multi-objective integration and testorder problem,” in Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO) , 2016, pp. 1069–1076.[84] A. Martens, H. Koziolek, S. Becker, and R. Reussner, “Automat-ically improve software architecture models for performance,reliability, and cost using evolutionary algorithms,” in JointWOSP/SIPEW International Conference on Performance Engineering ,2010, pp. 105–116.[85] P. McMinn, “Search-based software testing: Past, present andfuture,” in . IEEE, 2011, pp. 153–163.[86] S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sas-nauskas, “A search-based approach for accurate identificationof log message formats,” in Proceedings of the 26th Conference onProgram Comprehension . ACM, 2018, pp. 167–177. ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 30 [87] H. Meunier, E. G. Talbi, and P. Reininger, “A multiobjectivegenetic algorithm for radio network optimization,” in Proceedingsof the 2000 Congress on Evolutionary Computation , vol. 1, 2000, pp.317–324.[88] L. L. Minku and X. Yao, “Software effort estimation as a multiob-jective learning problem,” ACM Transactions on Software Engineer-ing and Methodology , vol. 22, no. 4, pp. 402–418, 2013.[89] M. W. Mkaouer, M. Kessentini, S. Bechikh, M. . Cinn´eide, andK. Deb, “On the use of many quality attributes for softwarerefactoring: a many-objective search-based software engineeringapproach,” Empirical Software Engineering , vol. 21, no. 6, pp. 2503–2545, 2016.[90] M. W. Mkaouer, M. Kessentini, S. Bechikh, and K. Deb, “Recom-mendation system for software refactoring using innovizationand interactive dynamic optimization,” in ACM/IEEE Interna-tional Conference on Automated Software Engineering , 2014, pp. 331–336.[91] W. Mkaouer, M. Kessentini, A. Shaout, P. Koligheu, S. Bechikh,K. Deb, and A. Ouni, “Many-objective software remodularizationusing NSGA-III,” ACM Trans. Softw. Eng. Methodol. , vol. 24, no. 3,pp. 17:1–17:45, 2015.[92] C. Ni, X. Chen, F. Wu, Y. Shen, and Q. Gu, “An empirical study onpareto based multi-objective feature selection for software defectprediction,” Journal of Systems and Software , vol. 152, pp. 215–238,2019.[93] R. Olaechea, D. Rayside, J. Guo, and K. Czarnecki, “Compari-son of exact and approximate multi-objective optimization forsoftware product lines,” in International Software Product LineConference , 2014, pp. 92–101.[94] A. Ouni, M. Kessentini, and H. Sahraoui, “Search-based refac-toring using recorded code changes,” in European Conference onSoftware Maintenance and Reengineering , 2013, pp. 221–230.[95] A. Ouni, M. Kessentini, H. Sahraoui, and M. Boukadoum, “Main-tainability defects detection and correction: a multi-objectiveapproach,” Automated Software Engineering , vol. 20, no. 1, pp. 47–79, 2013.[96] A. Ouni, M. Kessentini, H. Sahraoui, and M. S. Hamdi, “Search-based refactoring: Towards semantics preservation,” in IEEEInternational Conference on Software Maintenance , 2012, pp. 347–356.[97] ——, “The use of development history in software refactoringusing a multi-objective evolutionary algorithm,” in Genetic andEvolutionay Computation Conference , 2013, pp. 1461–1468.[98] A. Ouni, M. Kessentini, H. Sahraoui, K. Inoue, and K. Deb,“Multi-criteria code refactoring using search-based software en-gineering: An industrial case study,” ACM Transactions on Soft-ware Engineering and Methodology , vol. 25, no. 3, p. 23, 2016.[99] A. Ouni, R. G. Kula, M. Kessentini, T. Ishio, D. M. German, andK. Inoue, “Search-based software library recommendation usingmulti-objective optimization,” Information and Software Technol-ogy , vol. 83, pp. 55–75, 2017.[100] A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulatingbranch coverage as a many-objective optimization problem,” in IEEE International Conference on Software Testing, Verification andValidation , 2015, pp. 1–10.[101] ——, “Automated test case generation as a many-objective op-timisation problem with dynamic selection of the targets,” IEEETransactions on Software Engineering , vol. 44, no. 2, pp. 122–158,2018.[102] ——, “Incremental control dependency frontier exploration formany-criteria test case generation,” in International Symposium onSearch Based Software Engineering . Springer, 2018, pp. 309–324.[103] A. Panichella, R. Oliveto, M. D. Penta, and A. D. Lucia, “Im-proving multi-objective test case selection by injecting diversityin genetic algorithms,” IEEE Transactions on Software Engineering ,vol. 41, no. 4, pp. 358–383, 2015.[104] J. A. Parejo, A. B. S´anchez, S. Segura, A. Ruiz-Cort´es, R. E. Lopez-Herrejon, and A. Egyed, “Multi-objective test case prioritizationin highly configurable systems: A case study,” Journal of Systemsand Software , vol. 122, pp. 287–310, 2016.[105] G. G. Pascual, R. E. Lopez-Herrejon, L. Fuentes, and A. Egyed,“Applying multiobjective evolutionary algorithms to dynamicsoftware product lines for reconfiguring mobile applications,” Journal of Systems and Software , vol. 103, no. C, pp. 392–411, 2015.[106] D. Pradhan, S. Wang, S. Ali, T. Yue, and M. Liaaen, “REMAP:using rule mining and multi-objective search for dynamic test case prioritization,” in IEEE International Conference on SoftwareTesting, Verification and Validation (ICST) . IEEE, 2018, pp. 46–57.[107] D. Pradhan, S. Wang, T. Yue, S. Ali, and M. Liaaen, “Search-based test case implantation for testing untested configurations,” Information and Software Technology , vol. 111, pp. 22–36, 2019.[108] K. Praditwong, M. Harman, and X. Yao, “Software module clus-tering as a multi-objective search problem,” Software EngineeringIEEE Transactions on , vol. 37, no. 2, pp. 264–282, 2011.[109] A. Ram´ırez, J. R. Romero, and S. Ventura, “A comparative studyof many-objective evolutionary algorithms for the discovery ofsoftware architectures,” Empirical Software Engineering , vol. 21,no. 6, pp. 2546–2600, 2016.[110] A. Ramirez, J. R. Romero, and S. Ventura, “A survey of many-objective optimisation in search-based software engineering,” Journal of Systems and Software , vol. 149, pp. 382–395, 2019.[111] M. Ravber, M. Mernik, and M. ˇCrepinˇsek, “The impact of qualityindicators on the rating of multi-objective evolutionary algo-rithms,” Applied Soft Computing , vol. 55, pp. 265–275, 2017.[112] N. B. Ruparelia, “Software development lifecycle models,” ACMSIGSOFT Software Engineering Notes , vol. 35, no. 3, pp. 8–13, 2010.[113] T. Saber, D. Brevet, G. Botterweck, and A. Ventresque, “Is seedinga good strategy in multi-objective feature selection when featuremodels evolve?” Information and Software Technology , vol. 95, pp.266–280, 2018.[114] F. Sarro, F. Ferrucci, M. Harman, A. Manna, and J. Ren, “Adaptivemulti-objective evolutionary algorithms for overtime planningin software projects,” IEEE Transactions on Software Engineering ,vol. 43, no. 10, pp. 898–917, 2017.[115] F. Sarro, A. Petrozziello, and M. Harman, “Multi-objective soft-ware effort estimation,” vol. 21, no. 2, pp. 619–630, 2016.[116] S. Sayın, “Measuring the quality of discrete representations ofefficient sets in multiple objective mathematical programming,” Mathematical Programming , vol. 87, no. 3, pp. 543–560, 2000.[117] A. S. Sayyad and H. Ammar, “Pareto-optimal search-basedsoftware engineering (posbse): A literature survey,” in The 2ndInternational Workshop on Realizing Artificial Intelligence Synergiesin Software Engineering (RAISE) . IEEE, 2013, pp. 21–27.[118] A. S. Sayyad, J. Ingram, T. Menzies, and H. Ammar, “Optimumfeature selection in software product lines: Let your model andvalues guide your search,” in International Workshop on CombiningModelling and Search-Based Software Engineering , 2013, pp. 22–27.[119] ——, “Scalable product line configuration: A straw to break thecamel’s back,” in IEEE/ACM International Conference on AutomatedSoftware Engineering , 2013, pp. 465–474.[120] A. S. Sayyad, T. Menzies, and H. Ammar, “On the value of userpreferences in search-based software engineering: A case studyin software product lines,” in International Conference on SoftwareEngineering , 2013, pp. 492–501.[121] J. R. Schott, “Fault tolerant design using single and multicriteriagenetic algorithm optimization,” Master’s thesis, Department ofAeronautics and Astronautics, Massachusetts Institute of Tech-nology, 1995.[122] O. Schutze, X. Esquivel, A. Lara, and C. C. A. Coello, “Usingthe averaged hausdorff distance as a performance measure inevolutionary multiobjective optimization,” IEEE Transactions onEvolutionary Computation , vol. 16, no. 4, pp. 504–522, 2012.[123] S. Segura, R. E. Lopez-Herrejon, and A. Egyed, “Multi-objectivetest case prioritization in highly configurable systems,” Journal ofSystems and Software , vol. 122, no. C, pp. 287–310, 2016.[124] A. Shahbazi and J. Miller, “Black-box string test case generationthrough a multi-objective optimization,” IEEE Transactions onSoftware Engineering , vol. 42, no. 4, pp. 361–378, 2016.[125] X. Shen, L. Minku, R. Bahsoon, and X. Yao, “Dynamic softwareproject scheduling through a proactive-rescheduling method,” IEEE Transactions on Software Engineering , vol. 42, no. 7, pp. 658–686, 2016.[126] X. Shen, L. Minku, N. Marturi, Y. Guo, and Y. Han, “A q-learning-based memetic algorithm for multi-objective dynamic softwareproject scheduling,” Information Sciences , vol. 428, pp. 1–29, 2018.[127] S. Y. Shin, S. Nejati, M. Sabetzadeh, L. C. Briand, and F. Zimmer,“Test case prioritization for acceptance testing of cyber physicalsystems: a multi-objective search-based approach,” in Proceedingsof the 27th ACM SIGSOFT International Symposium on SoftwareTesting and Analysis . ACM, 2018, pp. 49–60.[128] C. L. Simons and I. C. Parmee, “Elegant object-oriented softwaredesign via interactive, evolutionary computation,” IEEE Transac- ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 31 tions on Systems Man and Cybernetics Part C , vol. 42, no. 6, pp.1797–1805, 2012.[129] C. L. Simons, I. C. Parmee, and R. Gwynllyw, “Interactive,evolutionary search in upstream object-oriented class design,” IEEE Transactions on Software Engineering , vol. 36, no. 6, pp. 798–816, 2010.[130] D. Sobhy, L. L. Minku, R. Bahsoon, T. Chen, and R. Kazman,“Run-time evaluation of architectures: A case study ofdiversification in iot,” Journal of Systems and Software , vol.159, 2020. [Online]. Available: https://doi.org/10.1016/j.jss.2019.110428[131] B. Tan, H. Ma, Y. Mei, and M. Zhang, “Evolutionary multi-objective optimization for web service location allocation prob-lem,” IEEE Transactions on Services Computing , 2018.[132] T. H. Tan, Y. Xue, M. Chen, J. Sun, Y. Liu, and J. S. Dong, “Op-timizing selection of competing features via feedback-directedevolutionary algorithms,” in Proceedings of the 2015 InternationalSymposium on Software Testing and Analysis . ACM, 2015, pp. 246–256.[133] D. A. Van Veldhuizen and G. B. Lamont, “Evolutionary compu-tation and convergence to a Pareto front,” in Late Breaking Papersat the Genetic Programming Conference , 1998, pp. 221–228.[134] H. Wada, J. Suzuki, Y. Yamano, and K. Oba, “E3: A multiobjec-tive optimization framework for sla-aware service composition,” IEEE Transactions on Services Computing , vol. 5, no. 3, pp. 358–372,2012.[135] F. Wagner, A. Klein, B. Klopper, F. Ishikawa, and S. Honiden,“Multi-objective service composition with time- and input-dependent qos,” in IEEE International Conference on Web Services ,2012, pp. 234–241.[136] T. Wagner and H. Trautmann, “Integration of preferences inhypervolume-based multiobjective evolutionary algorithms bymeans of desirability functions,” IEEE Transactions on EvolutionaryComputation , vol. 14, no. 5, pp. 688–701, 2010.[137] S. Wang, S. Ali, and A. Gotlieb, “Cost-effective test suite min-imization in product lines using search techniques,” Journal ofSystems and Software , vol. 103, no. C, pp. 370–391, 2015.[138] S. Wang, S. Ali, T. Yue, Y. Li, and M. Liaaen, “A practical guide toselect quality indicators for assessing pareto-based search algo-rithms in search-based software engineering,” in . IEEE,2016, pp. 631–642.[139] Z. Wang, K. Tang, and X. Yao, “Multi-objective approaches tooptimal testing resource allocation in modular software systems,” IEEE Transactions on Reliability , vol. 59, no. 3, pp. 563–575, 2010.[140] Z. Wen and V. Tzerpos, “An effectiveness measure for softwareclustering algorithms,” in Proceedings of the 12th IEEE InternationalWorkshop on Program Comprehension . IEEE, 2004, pp. 194–203.[141] D. R. White, A. Arcuri, and J. A. Clark, “Evolutionary improve-ment of programs,” IEEE Transactions on Evolutionary Computa-tion , vol. 15, no. 4, pp. 515–538, 2011.[142] F. Wu, W. Weimer, M. Harman, Y. Jia, and J. Krinke, “Deepparameter optimisation,” in Conference on Genetic and EvolutionaryComputation , 2015, pp. 1375–1382.[143] Y. Xiang, X. Yang, Y. Zhou, and H. Huang, “Enhancingdecomposition-based algorithms by estimation of distribution forconstrained optimal software product selection,” IEEE Transac-tions on Evolutionary Computation , 2019.[144] Y. Xiang, Y. Zhou, Z. Zheng, and M. Li, “Configuring softwareproduct lines by combining many-objective optimization and satsolvers,” ACM Transactions on Software Engineering and Methodol-ogy , vol. 26, no. 4, 2018.[145] S. Yoo and M. Harman, “Using hybrid algorithm for Pareto ef-ficient multi-objective test suite minimisation,” Journal of Systemsand Software , vol. 83, no. 4, pp. 689–701, 2010.[146] G. Zhang, Z. Su, M. Li, F. Yue, J. Jiang, and X. Yao, “Constrainthandling in nsga-ii for solving optimal testing resource allocationproblems,” IEEE Transactions on Reliability , vol. 66, no. 4, pp. 1193–1212, 2017.[147] Y. Zhang, M. Harman, and S. L. Lim, “Empirical evaluation ofsearch based requirements interaction management,” Informationand Software Technology , vol. 55, no. 1, pp. 126–152, 2013.[148] Y. Zhang, M. Harman, and S. A. Mansouri, “The multi-objectivenext release problem,” in Proceedings of the Genetic and Evolution-ary Computation Conference , 2007, pp. 1129–1137.[149] Y. Zhang, M. Harman, G. Ochoa, G. Ruhe, and S. Brinkkemper,“An empirical study of meta-and hyper-heuristic search for multi-objective release planning,” ACM Transactions on SoftwareEngineering and Methodology , vol. 27, no. 1, p. 3, 2018.[150] W. Zheng, R. M. Hierons, M. Li, X. H. Liu, and V. Vinciotti,“Multi-objective optimisation for regression testing,” InformationSciences , vol. 334, pp. 1–16, 2016.[151] A. Zhou, Y. Jin, Q. Zhang, B. Sendhoff, and E. Tsang, “Combiningmodel-based and genetics-based offspring generation for multi-objective optimization using a convergence criterion,” in IEEECongress on Evolutionary Computation , 2006, pp. 892–899.[152] E. Zitzler, D. Brockhoff, and L. Thiele, “The hypervolume indi-cator revisited: On the design of pareto-compliant indicators viaweighted integration,” in International Conference on EvolutionaryMulti-Criterion Optimization . Springer, 2007, pp. 862–876.[153] E. Zitzler and L. Thiele, “Multiobjective optimization using evo-lutionary algorithms - a comparative case study,” in Proceedings ofthe International Conference on Parallel Problem Solving from Nature(PPSN) , 1998, pp. 292–301.[154] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G.Da Fonseca, “Performance assessment of multiobjective optimiz-ers: An analysis and review,” IEEE Transactions on EvolutionaryComputation , vol. 7, no. 2, pp. 117–132, 2003.[155] W. Zou, D. Lo, Z. Chen, X. Xia, Y. Feng, and B. Xu, “How practi-tioners perceive automated bug report management techniques,” IEEE Transactions on Software Engineering , pp. 1–1, 2018. Miqing Li is currently a Lecturer at School ofComputer Science at the University of Birm-ingham. His research is principally on multi-objective optimisation, where he focuses ondeveloping population-based randomised al-gorithms (mainly evolutionary algorithms) forboth general challenging problems (e.g. many-objective optimisation, constrained optimisation,robust optimisation, expensive optimisation) andspecific challenging problems (e.g. those in soft-ware engineering, system engineering, productdisassembly, post-disaster response, neural architecture search, rein-forcement learning for games). Dr Li has published over 60 researchpapers in scientific journals and international conferences. Some of hispapers, since published, have been amongst the most cited papers incorresponding journals such as IEEE T RANSACTIONS ON E VOLUTION - ARY C OMPUTATION , A RTIFICIAL I NTELLIGENCE , ACM T RANSACTIONSON S OFTWARE E NGINEERING AND M ETHODOLOGY , IEEE T RANSAC - TIONS ON P ARALLEL AND D ISTRIBUTION S YSTEMS , and ACM C OMPUT - ING S URVEYS . His work has received the Best Student Paper Awardor Best Paper Award nomination in EC mainstream conferences, CEC,GECCO, and SEAL. Dr Li is the founding chair of the IEEE CIS’ TaskForce on Many-Objective Optimisation. Tao Chen (M’15) received his Ph.D. from theSchool of Computer Science, University of Birm-ingham, United Kingdom, in 2016. He is cur-rently a Lecturer (assistant professor) in Com-puter Science at the Department of ComputerScience, Loughborough University, United King-dom. He has broad research interests on soft-ware engineering, including but not limited toperformance engineering, self-adaptive softwaresystems, search-based software engineering,data-driven software engineering and computa-tional intelligence. As the lead author, his work has been publishedin internationally renowned journals, such as IEEE T RANSACTIONSON S OFTWARE E NGINEERING , ACM T RANSACTIONS ON S OFTWARE E NGINEERING AND M ETHODOLOGY , IEEE T RANSACTIONS ON S ER - VICES C OMPUTING , and P ROCEEDINGS OF THE IEEE; and top-tierconferences, e.g., ICSE, ASE, and GECCO. Among other roles, Dr.Chen regularly serves as a PC member for various conferences in hisfields and is an associate editor for the S ERVICES T RANSACTIONS ON I NTERNET OF T HINGS . ANUSCRIPT ACCEPTED BY IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2020 32 Xin Yao (F’03) received the BSc degree fromthe University of Science and Technology ofChina (USTC), Hefei, China, in 1982, the MScdegree from the North China Institute of Com-puting Technologies, Beijing, China, in 1985,and the PhD degree from USTC in 1990. He isa Chair Professor of Computer Science at theSouthern University of Science and Technology(SUSTech), Shenzhen, China, and a part-timeProfessor of Computer Science at the Univer-sity of Birmingham, UK. His current researchinterests include evolutionary computation, machine learning, and theirreal world applications, especially to software engineering. He startedhis work on search-based software engineering (SBSE) more than adecade ago, including “Coevolving Programs and Unit Tests from TheirSpecification” at ASE’07 and “Software Module Clustering as a Multi-Objective Search Problem” in March 2011’s IEEE T RANSACTIONS ON S OFTWARE E NGINEERING . His latest work on SBSE includes ”SoftwareEffort Interval Prediction via Bayesian Inference and Synthetic BootstrapResampling” in January 2019’s ACM T RANSACTIONS ON S OFTWARE E NGINEERING AND M ETHODOLOGY and “Synergizing Domain ExpertiseWith Self-Awareness in Software Systems: A Patternized ArchitectureGuideline” in July 2020’s P ROCEEDINGS OF THE IEEE. He was a recip-ient of the Royal Society Wolfson Research Merit Award in 2012, theIEEE Computational Intelligence Society (CIS) Evolutionary Computa-tion Pioneer Award in 2013 and the IEEE Frank Rosenblatt Award in2020. His work won the 2001 IEEE Donald G. Fink Prize Paper Award,the 2010, 2016, and 2017 IEEE T RANSACTIONS ON E VOLUTIONARY C OMPUTATION Outstanding Paper Awards, the 2011 IEEE T RANSAC - TIONS ON N EURAL N ETWORKS Outstanding Paper Award, and manyother best paper awards at conferences. He was the President of IEEECIS from 2014 to 2015 and the Editor-in-Chief of IEEE T