[PDF] An Extensive Experimental Evaluation of Automated Machine Learning Methods for Recommending Classification Algorithms (Extended Version)

Abstract

This paper presents an experimental comparison among four Automated Machine Learning (AutoML) methods for recommending the best classification algorithm for a given input dataset. Three of these methods are based on Evolutionary Algorithms (EAs), and the other is Auto-WEKA, a well-known AutoML method based on the Combined Algorithm Selection and Hyper-parameter optimisation (CASH) approach. The EA-based methods build classification algorithms from a single machine learning paradigm: either decision-tree induction, rule induction, or Bayesian network classification. Auto-WEKA combines algorithm selection and hyper-parameter optimisation to recommend classification algorithms from multiple paradigms. We performed controlled experiments where these four AutoML methods were given the same runtime limit for different values of this limit. In general, the difference in predictive accuracy of the three best AutoML methods was not statistically significant. However, the EA evolving decision-tree induction algorithms has the advantage of producing algorithms that generate interpretable classification models and that are more scalable to large datasets, by comparison with many algorithms from other learning paradigms that can be recommended by Auto-WEKA. We also observed that Auto-WEKA has shown meta-overfitting, a form of overfitting at the meta-learning level, rather than at the base-learning level.

Full PDF

AAn Extensive Experimental Evaluation of AutomatedMachine Learning Methods for RecommendingClassiﬁcation Algorithms (Extended Version)

Márcio P. Basgalupp Rodrigo C. Barros Alex G. C. de SáGisele Pappa Rafael G. Mantovani André C. P. L. F. de CarvalhoAlex A. FreitasReceived: date / Accepted: date

Abstract

This paper presents an experimental comparison among four Automated Ma-chine Learning (AutoML) methods for recommending the best classiﬁcation algo-rithm for a given input dataset. Three of these methods are based on EvolutionaryAlgorithms (EAs), and the other is Auto-WEKA, a well-known AutoML methodbased on the Combined Algorithm Selection and Hyper-parameter optimisation(CASH) approach. The EA-based methods build classiﬁcation algorithms from asingle machine learning paradigm: either decision-tree induction, rule induction,or Bayesian network classiﬁcation. Auto-WEKA combines algorithm selection andhyper-parameter optimisation to recommend classiﬁcation algorithms from multi-ple paradigms. We performed controlled experiments where these four AutoMLmethods were given the same runtime limit for different values of this limit. Ingeneral, the difference in predictive accuracy of the three best AutoML methodswas not statistically signiﬁcant. However, the EA evolving decision-tree inductionalgorithms has the advantage of producing algorithms that generate interpretableclassiﬁcation models and that are more scalable to large datasets, by comparisonwith many algorithms from other learning paradigms that can be recommendedby Auto-WEKA. We also observed that Auto-WEKA has shown meta-overﬁtting, aform of overﬁtting at the meta-learning level, rather than at the base-learning level.

Classiﬁcation is one of the main machine learning tasks and, hence, there is a large va-riety of classiﬁcation algorithms available [Witten et al., 2016, Zaki and Meira Jr, 2020].However, in most real-world applications, the choice of classiﬁcation algorithm for anew dataset or application domain is still mainly an ad-hoc decision.In this context, the use of meta-learning for algorithm recommendation is a veryimportant research area with seminal work dating back more than years, which in-cludes the StatLog [Michie et al., 1994] and METAL [MET, 2002] projects. Meta-learning1 a r X i v : . [ c s . L G ] S e p an be deﬁned as learning how to learn, which involves learning, from previous ex-perience, what is the best machine learning algorithm (and its best hyper-parametersetting) for a given dataset [Brazdil et al., 2008, Vanschoren, 2018]. Meta-learning sys-tems for algorithm recommendation can be divided into two broad groups, namely: (a)systems that perform algorithm selection based on meta-features [Brazdil et al., 2008],which is the most investigated type; and (b) systems that search for the best possibleclassiﬁcation algorithm in a given algorithm space [Thornton et al., 2013].Meta-feature-based meta-learning for algorithm selection and recommendation con-sists of two basic steps [Brazdil et al., 2008]. First, the creation of a meta-training setwhere each meta-instance represents a dataset, meta-features represent dataset prop-erties, and each meta-class represents a (base level) learning algorithm. Second, theinduction of a meta-classiﬁcation model by a (meta) classiﬁcation algorithm over themeta-training set, thus allowing the recommendation of algorithm(s) for a novel dataset(not included in the meta-training set). A key issue is the design of a good set of meta-features, with enough predictive power to support an accurate recommendation of thebest learning algorithm. Extensive research in this topic has produced a large varietyof meta-features [Brazdil et al., 2008, Ho and Basu, 2002, Ho et al., 2006], but the issueof ﬁnding a set of meta-features with very good predictive power is still an open anddifﬁcult problem.A limitation of meta-feature-based meta-learning research is that usually a smallnumber of candidate classiﬁcation algorithms are considered as meta-classes. This isbecause in general, the larger the number of candidate classiﬁcation algorithms used asmeta-classes, the more difﬁcult it would be for the meta-classiﬁcation algorithm to accu-rately predict all meta-classes. In addition, it is difﬁcult to produce large meta-datasetsfor meta-learning, since in order to compute the meta-class of each meta-instance weneed to run all candidate classiﬁcation algorithms on all datasets (one for each meta-instance).These difﬁculties have motivated research on the second type of meta-learning foralgorithm recommendation, meta-learning systems using search or optimisation meth-ods to indicate the best classiﬁcation algorithm for a given target dataset, in a given al-gorithm space [Pappa and Freitas, 2009, Leite et al., 2012, Thornton et al., 2013, Pappa et al., 2014,Kotthoff et al., 2017, Barros et al., 2015, van Rijn et al., 2015]. This work focuses mainlyon this type of meta-learning systems, which is a type of Automated Machine Learning(AutoML) [Hutter et al., 2019], since such systems effectively automate the process ofselecting the best algorithm and its hyper-parameters for the input dataset.This AutoML approach bypasses the need for designing meta-features and it can, inprinciple, consider a substantially larger number of candidate classiﬁcation algorithmsand hyper-parameters than meta-feature-based meta-learning systems. Note that al-though this approach does not explicitly use a learning algorithm at the meta-level,some methods following this AutoML approach (like some methods evaluated in thiswork) perform a form of meta-learning because the search is performed in the spaceof candidate learning algorithms and is guided by an evaluation function based onthe accuracy of learning algorithms at the base level. Therefore, the search method at2he meta-level is implicitly learning from the results of base-level learning algorithms.Note, however, that this kind of meta-learning of course does not occur in the case ofsimple and popular methods for algorithm selection and parameter conﬁguration, likerandom search and grid search, which do not perform any learning by themselves.In this context, the main contribution of this paper is to present an extensive em-pirical comparison of the predictive performance of four sophisticated AutoML meth-ods for the recommendation of classiﬁcation algorithms. One of these methods, Auto-WEKA [Thornton et al., 2013, Kotthoff et al., 2017], performs algorithm selection andhyper-parameter conﬁguration by considering all candidate classiﬁcation algorithmsavailable in the well-known WEKA data mining tool, which includes algorithms basedon several different types of knowledge (or model) representations – e.g., decision trees,if-then classiﬁcation rules, Bayesian network classiﬁers, neural networks, support vec-tor machines, etc. The other three methods are based on evolutionary algorithms (EAs).Unlike Auto-WEKA, each of the three EAs focuses on a search space containing classiﬁ-cation algorithms based on a single type of knowledge representation. More precisely,the EAs evolve rule induction algorithms [Pappa and Freitas, 2009], decision-tree in-duction algorithms [Barros et al., 2015], and Bayesian network classiﬁcation algorithms[de Sá and Pappa, 2014]. Hence, the EAs produce a narrower diversity of classiﬁca-tion algorithms in terms of knowledge representation. However, within its specializedknowledge representation, an EA can have more ﬂexibility (or autonomy) to constructnew classiﬁcation algorithms, rather than just optimising the conﬁguration of hyper-parameters for an existing classiﬁcation algorithm, as discussed later.There are also other recently proposed EAs for related AutoML tasks. In particular,the EAs proposed in [de Sá et al., 2017, Kˇren et al., 2017, Olson et al., 2016a] try to op-timize an entire machine learning pipeline for a given dataset, including the choice ofdata preprocessing methods (like feature scaling operators and feature selection meth-ods) and classiﬁcation algorithm. By contrast, we focus on using EAs that recommendonly classiﬁcation algorithms. In addition, in [Nyathi and Pillay, 2017] an EA is pro-posed to automatically evolve another type of EA (genetic programming) for classiﬁ-cation. By contrast, the EAs used here automatically evolve more conventional (non-evolutionary) types of classiﬁcation algorithms, as mentioned earlier.Controlled experiemnts were performed, where the four previous AutoML methods(the three EAs and Auto-WEKA) had the same runtime limit for different values of thislimit. In general, the difference in predictive accuracy of the three best AutoML meth-ods was not statistically signiﬁcant, but Auto-WEKA showed meta-overﬁtting, a formof overﬁtting at the meta-learning level, due to evaluating many different (base-level)classiﬁcation algorithms during its search for the best algorithm. This is in contrast tothe standard overﬁtting at the base level, due to the evaluating many different modelsbuilt by the same classiﬁcation algorithm. In addition, the EA evolving decision-treeinduction algorithms have the advantage of producing algorithms that generate inter-pretable classiﬁcation models and that are more scalable to large datasets, by compar-ison with many algorithms from other learning paradigms that can be recommendedby Auto-WEKA. Furthermore, an analysis of the different types of classiﬁcation algo-3ithms recommended by Auto-WEKA shows that overall decision-tree and ensemblealgorithms were the most frequently recommended types of algorithms, whilst ruleinduction algorithms were the least recommended type.The remainder of this paper is organised as follows. Section 2 reviews the back-ground on AutoML methods for classiﬁcation-algorithm recommendation, focusing onthe four previously mentioned AutoML methods. Section 3 describes the methodologyadopted in this study for executing the experimental analyses, whose extensive resultsare presented in Section 4. Finally, the main conclusions and future work suggestionsare presented in Section 5. This section reviews the main concepts underlying several AutoML methods for au-tomatic recommendation of the best classiﬁcation algorithm for a given input dataset.It mainly covers the four AutoML methods evaluated in this work, Auto-WEKA andthree EAs, as mentioned earlier. Its last subsection brieﬂy reviews related work on otherevolutionary AutoML methods.

Initial work on meta-learning focused on selecting the best classiﬁcation algorithm(s)for a given dataset, explicitly or implicitly assuming a default conﬁguration (hyper-parameter settings) for the candidate algorithms. However, given that the success of aclassiﬁcation algorithm strongly depends on its hyper-parameter settings, more recentwork has focused on the so called Combined Algorithm Selection and Hyper-parameter(CASH) optimisation problem [Thornton et al., 2013]. In this section, we review the Au-toML methods evaluated in this work that address the CASH problem by considering,as candidate algorithms to be recommended, classiﬁcation algorithms from multipleknowledge (model) representations, like decision trees, IF-THEN classiﬁcation rules,probabilistic graphical models, neural networks, ensembles, etc.In this context, an advanced and well-known system designed for the CASH prob-lem is Auto-WEKA [Thornton et al., 2013, Kotthoff et al., 2017], whose search-space in-cludes all classiﬁcation algorithms available in

Weka [Hall et al., 2009] with their corre-sponding candidate hyper-parameter settings.In order to search the space of candidate algorithms and their hyper-parameter set-tings, Auto-WEKA uses a stochastic search method, named Sequential Model-BasedOptimisation (SMBO), and a loss function to measure classiﬁcation error. The goal isto ﬁnd the classiﬁcation algorithm and its corresponding hyper-parameter settings thatminimise the value of the loss function for the target dataset. SMBO essentially worksas follows. First, the CASH problem is formulated as a hierarchical hyper-parametersearch-space where there is a new root-level hyper-parameter that selects between al-gorithms. Hence, a candidate solution is an algorithm selected at the root level and its4yper-parameters selected at lower levels. As shown in Algorithm 1, SMBO initiallybuilds a model ( M L , line 1) representing the dependency of the loss function on thecandidate hyper-parameter settings. Next, it iteratively uses the model to generate apromising candidate hyper-parameter setting ( λ , line 3), evaluates the setting (lines 4-5), and updates the model according to the evaluation (line 6). SMBO is ﬂexible enoughto be able to be used with different algorithms for building the dependency model, withrandom forests being used in [Thornton et al., 2013, Kotthoff et al., 2017]. Algorithm 1

Pseudo-code of SMBO. Adapted from [Thornton et al., 2013]. Initialise model M L ; H = ∅ while time budget has not been exceeded do λ = candidate conﬁguration from M L compute c = L ( A λ , D ( i ) train , D ( i ) valid ) H = H ∪ { ( λ, c ) } Update M L given H end while return λ from H with minimal c The approach used by Auto-WEKA was also extended to produce another sys-tem for solving the CASH problem, namely Auto-sklearn [Feurer et al., 2015b], whichuses the scikit-learn machine learning library [Pedregosa et al., 2011] rather than

Weka .Auto-sklearn extends Auto-WEKA’s approach in two ways. First, it uses an ensembleof the classiﬁcation models generated by the SMBO search method, instead of just onemodel like in Auto-WEKA. Second, it uses meta-features-based meta-learning to ﬁndgood classiﬁcation algorithm conﬁgurations (see [Feurer et al., 2015b, Feurer et al., 2015a]for details of these two extensions). In addition, meta-features-based meta-learning hasbeen recently used to initialise the SMBO’s search for the optimal solution to the CASHproblem [Feurer et al., 2015c]. It should be noted that the aforementioned systems,although very advanced, are limited to ﬁnd a combination of algorithm and hyper-parameter settings among existing combinations in the base machine learning toolkitbeing used (

Weka or scikit-learn). They do not have enough autonomy for construct-ing a new classiﬁcation algorithm, which can be done in some cases by the EA-basedmeta-learning methods discussed in the next section.

Each of the Evolutionary Algorithm-based (EA-based) AutoML methods evaluatedin this work explores a search space with classiﬁcation algorithms from a differentknowledge (model) representation, namely: rule induction [Pappa and Freitas, 2009],decision-tree induction [Barros et al., 2012a], or Bayesian network classiﬁers [de Sá and Pappa, 2014].EAs are search methods based on the natural selection principle [Eiben and Smith, 2015].They have been extensively used for evolving classiﬁcation models in machine learning[Freitas, 2008, Barros et al., 2012a]. In this work, however, the EAs evolve full classiﬁ-cation algorithms rather than classiﬁcation models. In EA terminology, the EAs usedin this work are hyper-heuristic search methods, which perform a search in the spaceof candidate classiﬁcation algorithms [Pappa et al., 2014]; whilst EAs that perform a5earch in the space of classiﬁcation models are conventional meta-heuristic search meth-ods.The three EAs receive as input a high-level pseudo-code with the main algorithmiccomponents to be used to create classiﬁcation algorithms from a target algorithm type.For instance, if the target is rule induction algorithms, the components include a rulesearch method, a rule evaluation criterion, etc. Each component can be instantiated indifferent ways, e.g., conﬁdence or information gain can be used to instantiate the ruleevaluation component. Given an input dataset, an EA searches for the best combinationof algorithmic components based on an evaluation function (called ﬁtness function inEAs). Thus, the EA’s output is a classiﬁcation algorithm of the target type.Note that the EAs can sometimes generate a new classiﬁcation algorithm whichworks in a way different from all current (manually-designed) classiﬁcation algorithms.This is because the EAs can combine the prespeciﬁed algorithmic components in novelways, not explored by human algorithm designers yet.As an example of algorithm construction, let us consider the EA for evolving decision-tree algorithms. That EA’s algorithmic components include, among other types of com-ponents, 15 different split criteria and 5 tree-pruning methods. A manually-designeddecision-tree algorithm like J48 (WEKA’s version of C4.5) or CART offers just a sub-set of these split criteria and pruning methods. Hence, when Auto-WEKA conﬁg-ures a decision-tree algorithm, it ﬁrst chooses exactly which algorithm will be con-ﬁgured, say J48 or CART, and then it considers only the split criteria and tree prun-ing methods/hyper-parameters available in WEKA for the chosen algorithm. It can-not combine, e.g., the information gain ratio used by J48 with the cost-complexitypruning used by CART. By contrast, the EA can construct a new decision-tree induc-tion algorithm with any combination of split criteria and tree pruning method/hyper-parameters (as well as any combination of other speciﬁc components), regardless ofwhether or not the chosen combination of components occurs in a current manually-designed decision-tree algorithm.Algorithm 2 shows the high-level pseudo-code of the three EAs for recommendingclassiﬁcation algorithms used in this work. First, they generate a population of candi-date solutions (classiﬁcation algorithms), or individuals, based on the target pseudo-code and sets of components given as input. For a ﬁxed number of iterations (genera-tions) g , the classiﬁcation algorithms represented by the individuals in the initial popu-lation P are built and run on the input dataset. The input dataset is divided into meta-training, meta-validation, and meta-test sets. In order to measure the ﬁtness (quality)of an individual, its corresponding classiﬁcation algorithm is executed over the meta-training set to build a classiﬁcation model. Afterwards, a given predictive performancemeasure is used to evaluate the model performance on the meta-validation set, and thismeasure is used as the ﬁtness of the individual.To avoid overﬁtting, at each s generations, the examples belonging to the meta-training and meta-validation sets are resampled, and the best individual found in thatsample is saved in BestSet . During the EA run, individuals at different generationsmay be evaluated with different data. Based on the individuals’ ﬁtness values, the6 lgorithm 2

Pseudo-code of evolutionary algorithms for generating classiﬁcation algo-rithms.

BuildTailoredAlgorithm(datasets, generalPseudocode, components, g , s )P = CreatePopulation(generalPseudocode, components)count = 0BestSet = ∅ while count < g dofor all indiv in P do BuildAlgorithm(indiv)RunAlgorithm(indiv,dataset) end for

TournamentSelection(P)Crossover(P)Mutation(P)count = count + 1 if count mod s then BestSet = best in PResample dataset end ifend whilereturn best in BestSet according to a predictive performance measure best candidate classiﬁcation algorithms are selected to undergo EA operations suchas crossover and mutation, according to user-deﬁned probabilities. At the end of anEA run, the best algorithm output by the EA is chosen as follows. Considering theindividuals saved in

BestSet , a new cross-validation procedure is performed on thetraining set. All individuals are then executed using the same cross-validation folds,and the best classiﬁcation algorithm is output. That algorithm is ﬁnally evaluated onthe meta-test set, which was not seen during the EA run, to compute the ﬁnal measureof predictive accuracy for the evolved classiﬁcation algorithm.All three EAs discussed in this paper follow Algorithm 2, but they vary on howthey represent individuals, the types of components used to build classiﬁcation algo-rithms (depending on the type of target classiﬁcation algorithm), and the performancemeasure used to select the best individuals. All algorithms require user-deﬁned hyper-parameters which include, besides the number of iterations (generations), the numberof individuals, the rates of crossover and mutation (operators used to produce newindividuals from existing ones), the rate of elitism (i.e. the percentage of individualsfrom the current generation that are passed unaltered to the next generation), and thenumber of individuals selected to undergo tournament selection.

The ﬁrst EA proposed for generating a full classiﬁcation algorithm customised to agiven input dataset evolves rule induction algorithms (which output IF-THEN classiﬁ-cation rules), using a Grammar-based Genetic Programming (GGP) algorithm [Pappa and Freitas, 2009],named GGP-RI (GGP for Rule Induction). GGPs differ from standard EAs as they re-7eive as input a grammar, and all candidate solutions generated must obey the gram-mar production rules.The grammar has production rules specifying how the following components of in-duction algorithms can be instantiated and combined together into valid algorithms:the decision to generate an unordered rule set or an ordered rule list, different meth-ods to initialize, search, evaluate and prune rules, as well as different loop structuresand conditional statements to control the iterative processes of constructing a rule andadding/removing rules to/from a set/list. Each individual is represented by a tree gen-erated by applying the production rules. Each tree is mapped to a rule induction algo-rithm. The GGP grammar has 26 non-terminals and 83 production rules, and, varyingthe order in which the production rules are applied, the GGP’s search-space has over 2billion different rule induction algorithms. GGP’s ﬁtness function is the F-Measure (theharmonic mean of precision and recall) of a candidate rule induction algorithm in themeta-validation set (as explained earlier).

A hyper-heuristic EA that generates decision-tree induction algorithms, called HEAD-DT (Hyper-heuristic Evolutionary Algorithm for Automatically Designing Decision-Tree algorithms), is described in [Barros et al., 2013, Barros et al., 2014]. Unlike GGP,HEAD-DT is based on a genetic algorithm with linear encoding. An individual (candi-date decision-tree induction algorithm) consists of a set of many options to instantiatethe following components of decision-tree induction algorithms: the data split proce-dure used at each node of the tree (i.e., whether performing a binary or multi-way splitand which feature evaluation function should be used), the tree expansion stopping cri-teria, approaches to cope with missing values (in both the training and testing phases),and the tree pruning procedure. For each algorithmic component, an individual spec-iﬁes both categorical options (e.g., the choice of feature evaluation function, out of 16predeﬁned functions) and the numerical value of hyper-parameters associated withthe chosen options (e.g., a hyper-parameter that controls the degree of pruning for agiven pruning method). HEAD-DT’s ﬁtness function is the F-Measure of a candidatedecision-tree induction algorithm in the meta-validation set, and its search space con-tains 21,319,200 different decision-tree algorithms. It was applied with success in dif-ferent application domains, such as gene expression classiﬁcation [Barros et al., 2014]and rational drug design [Barros et al., 2012b].

The EA for generating Bayesian Network Classiﬁcation (BNC) algorithms is namedHHEA-BNC (Hyper-Heuristic Evolutionary Algorithm for creating a BNC algorithm)8de Sá and Pappa, 2014, de Sá and Pappa, 2013]. BNC algorithms usually have two phases[Cheng and Greiner, 1999, Daly et al., 2011]: (i) network-structure learning; and (ii) pa-rameter learning. In the ﬁrst phase, the algorithm learns which nodes (features) in thenetwork should be connected to each other. The parameter learning phase, in turn,learns the Conditional Probability Tables (CPTs) for each node of the network (the BNCmodel). However, learning the parameters of a BNC model is a relatively straightfor-ward procedure when the network structure has been determined. For this reason,HHEA-BNC focuses on the structure learning phase. HHEA-BNC encodes candidateBNC algorithms using a dynamic array-like representation, where each position in thearray represents a different algorithm component to be instantiated. In order to selectand instantiate the components of the BNC algorithm, HHEA-BNC uses a top-downapproach, where the ﬁrst instantiated component of the BNC algorithm being createdis the search method, with a choice among different methods. The search methoddeﬁnes the type of algorithm being generated (naïve Bayes, score-based, constraint-based or hybrid) and, consequently, the type of BNC model being created (i.e. tree,graph, or no edges between features, in the case of naïve Bayes). Based on this ﬁrstchoice, different BNC algorithms can be generated, including components like scoringmetrics, statistical independence tests, maximal number of parents per node, etc. Thesmallest individual has three components, while the largest has . The search-space ofHHEA-BNC has 60,510,000 different candidate BNC algorithms. HHEA-BNC’s ﬁtnessfunction is the F-measure of a candidate BNC algorithm in the meta-validation set. We also have identiﬁed three evolutionary AutoML methods that try to optimize the en-tire classiﬁcation pipeline: (i) Tree-based Pipeline Optimization Tool (TPOT) [Olson et al., 2016a,Olson et al., 2016b]; (ii) Genetic Programming for Machine Learning (GP-ML) [Kˇren et al., 2017];and (iii) REsilient ClassifIcation Pipeline Evolution (RECIPE) [de Sá et al., 2017]. Apipeline is deﬁned as a machine learning workﬂow that solves the classiﬁcation task.To solve this type of task, a pipeline may contain data preprocessing methods (e.g.,feature normalization or feature selection), must have a classiﬁcation algorithm (e.g.,naïve Bayes or a support vector machine) and may have a post-processing approach(e.g., voting or stacking). Therefore, these methods take into account various aspects ofmachine learning instead of focusing only on the classiﬁcation algorithm. This meansthat these methods could select and conﬁgure a range of different classiﬁcation-relatedmethods during the evolutionary search, as they are not centered on just one type ofclassiﬁcation algorithm. This basic principle is also followed by Auto-WEKA and Auto-sklearn, which are well-known non-EA-based AutoML methods. The aforementionedEA-based AutoML methods are discussed in somewhat more detail next.TPOT is a genetic programming-based method that searches for the most suitableclassiﬁcation pipeline to the input dataset. It encompasses (part of) the available meth-ods in the scikit-learn library in its search space, and allows different ways of com-bining the data preprocessing methods (in sequence or in parallel) and the classiﬁca-9ion algorithms (supporting ensemble approaches or not). Although TPOT has beendesigned for general classiﬁcation, it alternatively has a speciﬁc version for bioinfor-matics studies, named TPOT-MDR [Sohn et al., 2017]. TPOT-MDR includes two newdata preprocessing operators that are used in genetic analyses of human diseases: theMultifactor Dimensionality Reduction (MDR) and the Expert Knowledge Filter (EKF).Besides, both versions perform multi-objective search using Pareto selection (based onthe well-known NSGA-II algorithm) [Deb et al., 2002] with two objectives: maximizingthe predictive accuracy measure of the pipeline and minimizing the pipeline’s overallcomplexity (which is represented by the number of pipeline operators).The main issue when using TPOT is that it can generate classiﬁcation pipelines thatare invalid or arbitrary during its evolutionary process, i.e., pipelines that do not solvethe classiﬁcation task itself. This happens because TPOT does not impose any con-straints when combining the ML components to create the pipelines. For instance,TPOT can create a pipeline without a classiﬁcation algorithm [Olson et al., 2016a]. This,of course, makes the evolutionary process to waste resources as various individualswould not solve the classiﬁcation task. This can be considered a signiﬁcant drawbackof TPOT in the context of the classiﬁcation task.GP-ML overcomes this limitation by using a strongly typed genetic programming(STGP) method. A STGP method restricts the scikit-learn pipelines in such a way thatmakes them valid from the machine learning point of view. In addition, GP-ML appliesan asynchronous evolutionary algorithm [Scott and De Jong, 2016] instead of a gener-ational one. [Scott and De Jong, 2016] observed that asynchronous evolution is biasedtowards the evaluation of faster pipelines in some parts of the search space. However,[Kˇren et al., 2017] consider this bias an advantage to the AutoML task, because a fasterpipeline is usually preferable to a slower one, when both present similar predictiveaccuracy values.RECIPE follows the same basic principle of GP-ML, i.e., it only allows the gener-ation of valid pipelines during the evolutionary process. In order to implement thisprinciple, RECIPE deﬁnes a grammar which encompasses the classiﬁcation knowledgein scikit-learn. Therefore, RECIPE makes use of a grammar-based genetic program-ming (GGP) [Mckay et al., 2010] to perform the search for the most suitable classiﬁca-tion pipeline. The grammar prevents the generation of invalid/arbitrary pipelines, andcould also speed up the search.

The experiments are divided into two parts. The ﬁrst part compares the results ob-tained by the EAs with the results obtained by Auto-WEKA [Thornton et al., 2013],whose search space includes all 33 classiﬁcation algorithms available in WEKA. Theseexperiments used 20 datasets.The second part of the experiments compares one of the EAs (HEAD-DT, the EAevolving decision-tree algorithms) against Auto-WEKA, on an extended set of 40 datasets.The main reason for using a smaller number of datasets in the ﬁrst type of experiment10as the very long computation time associated with comparing four methods. HEAD-DT was chosen because, among the two most successful EAs overall (HEAD-DT andHHEA-BNC, as discussed later), HEAD-DT has the advantage of producing decisiontree algorithms which are more scalable to larger datasets than the Bayesian networkclassiﬁcation algorithms produced by HHEA-BNC. The datasets used in both types ofexperiments are described next.

The ﬁrst part of the experiments focus on challenging datasets, characterised in gen-eral (with one exception) by a small number of instances and a large number of at-tributes. Table 1 summarises their main characteristics, including number of instances,number of numerical and nominal attributes, percentage of missing values, class bal-ance ratio (class bal.) and number of classes. Class bal. is the ratio of the minor-ity class frequency over the majority class frequency – values closer to 0 (1) indicatedatasets with more (less) class distribution imbalance. The ﬁrst 12 datasets in this ta-ble are bioinformatics datasets, whilst the last 8 ones are text mining datasets. Theﬁrst six datasets involve data from the biology of ageing. Datasets CE-T3, SC-T3, DM-T3, and MM-T3 are described in [Wan et al., 2015]; whilst datasets DNA-T3 and DNA-T11 are described in [Freitas et al., 2011]. Dataset PS-T3 involves post-synaptic pro-teins [Pappa et al., 2005]. The 5 microarray datasets are publicly-available microarraygene expression datasets, described in [de Souto et al., 2008]. Finally, the 8 text miningdatasets were obtained from OpenML [Vanschoren et al., 2014].Table 1: Summary of the 20 datasets used in both the ﬁrst and the second sets of exper-iments. Type Dataset

Dataset

The -fold cross-validation technique (10-cv) [Witten et al., 2016] was used in the ex-periments. Since Auto-WEKA and the Evolutionary Algorithms (EAs) are non-deterministic,their results are an average over executions, generating, for each method, 1000 algo-rithms. All results presented in Section 4 refer to the predictive accuracy of the recom-mended algorithms in the test sets.Two predictive accuracy measures are used. First, the Geometric Mean (GMean) ofsensitivity ( Sens ) and speciﬁcity (

Spec ) [Japkowicz and Shah, 2011], deﬁned as

GMean = √ Sens × Spec . Sens is the proportion of positive instances that were correctly pre-dicted as positive.

Spec is the proportion of negative instances that were correctly pre-dicted as negative. These measures were calculated considering each class in turn as thepositive class, and then computing the weighted average of these measures, by weigh-ing the classes according to their relative frequency. The GMean measure was also usedto evaluate some datasets in [Wan et al., 2015]. The second predictive accuracy measureused is the simple classiﬁcation accuracy measure used by Auto-WEKA to choose thebest algorithm for each dataset.Statistical signiﬁcance analysis was applied to the experimental results. In the ﬁrstset of experiments (comparing four methods), we have adopted Demšar’s [Demšar, 2006]recommendation to use the Friedman test with the adjusted statistic F F [Iman and Davenport, 1980]12o compare multiple algorithms over multiple datasets, followed by the Nemenyi post-hoc test for pairwise comparisons. In the ﬁnal experiment comparing only two methodswe have used the Wilcoxon test [Wilcoxon et al., 1970]. The main advantage of all thesestatistical tests is that they are non-parametric, so that they do not make the assump-tion that the data follows the normal distribution (nor assume any other probabilitydistribution, for that matter). All statistical tests were used with the conventional sig-niﬁcance level of 0.05. In order to perform a fair comparison, all EAs were conﬁgured with the same hyper-parameters values, listed in Table 3.Table 3: Parameter values for the evolutionary algorithms.

Parameter Description ValueNumber of individuals 100Number of generations before changing the validation set 5Tournament selection size 2Elitism rate 5%Crossover rate 95%Mutation rate 5%

Table 4 shows the hyper-parameter settings for Auto-WEKA based on the optionsprovided by its Experiment Builder [Thornton et al., 2013]. Note that the 10-cv men-tioned in Table 4 is another cross-validation procedure used by Auto-WEKA, but thistime over the training set (generated by the outermost 10-cv) to evaluate its candidatesolutions regarding their predictive accuracy.Table 4: Hyper-parameter values for all versions of Auto-WEKA.

Parameter Description Value(s)Instance generator 10-fold cross-validation, seed = 1,..,5Evaluation measure error rate (classiﬁcation)Optimisation method SMAC, with executable =smac-v2.06.01-development-619/smacInitial Incumbent = RandomExecution Mode = SMACInitialN = 1memLimit 15 GBtimeLimit from 1,000s to 10,000s

None of the 4 meta-learning methods had their hyper-parameter values optimisedto individual datasets. A more robust hyper-parameter optimisation procedure wouldbe too time-consuming, given the very large number of experiments carried out in thiswork. 13 .4 Computational Environment and Runtime Limits

The experiments were executed in a Dual Intel 2.10GHz Xeon E5-2683 v4 Hexadeca-Core with 128GB RAM. In order to perform controlled experiments comparing differ-ent meta-learning methods with the same computational budget, recall that two typesof experiments are performed, as reported in Section 4. The ﬁrst type of experimentcompares the results obtained by the three EAs (each evolving classiﬁcation algorithmsbased on a single type of knowledge representation) with the results obtained by Auto-WEKA, which can recommend classiﬁcation algorithms based on multiple knowledgerepresentations. The second type of experiments compares the best EA (HEAD-DT,evolving decision-tree algorithms) against Auto-WEKA in an extended set of datasets.In both types of experiments, to have a fair comparison among all meta-learningmethods, each of them is allocated the same runtime limit. Experiments were per-formed with ten increasing values of the runtime limit for each meta-learning method,namely 1,000s (seconds), 2,000s, ..., up to 10,000s. These runtime limits refer to the timetaken by a single run of each method on each dataset, on a single cross-validation fold.Due to space restrictions, the next section will report only the results for the smallestand the largest runtime limits, i.e., 1,000s and 10,000s. The results for the other runtimelimits can be seen in [Basgalupp et al., 2018].In addition to the parameters that are common to all three EAs, which were set asdescribed in Table 3, there is a parameter that is used by GGP-RI and HHEA-BNC, butnot by HEAD-DT. This parameter is a timeout to evaluate each individual (candidatealgorithm) of the EA. For GGP-RI, the value of this parameter starts with 10s (seconds)when the runtime limit for the entire run of GGP-RI is 1,000s. Then the individual eval-uation timeout increases by 10s for each increase of 1,000s in GGP-RI’s runtime, up to100s, when the GGP-RI’s runtime limit is 10,000s. For HHEA-BNC, the value of thisparameter starts with 50s (seconds) when the runtime limit for the entire run of HHEA-BNC is 1,000s. Then the individual evaluation timeout increases by 50s for each increaseof 1,000s in HHEA-BNC’s runtime, up to 500s, when the HHEA-BNC’s runtime limitis 10,000s. HEAD-DT does not need this parameter because the decision tree inductionalgorithms produced by this EA are relatively fast. The values of this parameter forHHEA-BNC are larger than the values for GGP-RI because the Bayesian network clas-siﬁcation algorithms generated by the former tend to be considerably slower than therule induction algorithms generated by the latter EA.

This section presents the results of the following two types of experiments:1. Experiments comparing four AutoML methods: the three EAs (HEAD-DT, GGP-RI, HHEA-BNC) and Auto-WEKA.2. Experiments comparing one of the EAs (HEAD-DT, evolving decision tree algo-rithms) with Auto-WEKA, on an extended set of datasets.14s mentioned earlier, due to the very large number of experiments, the ﬁrst typeof experiments use the 20 datasets shown in Table 1; whilst the second type of experi-ments uses an extended set of 40 datasets (the 20 datasets in Table 1 plus the 20 datasetsin Table 2). We report results for the values of accuracy and Gmean (the geometricmean of sensitivity and speciﬁcity) for each dataset; and the average values of accuracyand GMean, as well as the average rank of each method based on these measures, overthe corresponding datasets. The lower the rank, the better the method. A method thatoutperforms every other method in every dataset has an average rank of 1.0 (ﬁrst posi-tion). The complete tables with per-dataset results can be found in the SupplementaryResults ﬁle. Recall that, although we performed experiments with the runtime limit formeta-learning methods varying from 1,000 to 10,000 seconds, in increments of 1,000s,in general only the results for 1,000s and 10,000s are reported in this section, due tospace restrictions. The results for the 10 different runtime limits can be found in theSupplementary Results ﬁle.

This section compares four types of AutoML methods, the three EAs and Auto-WEKA,in controlled experiments where all the four methods use the same runtime limit, asmentioned earlier.Tables 5 and 6 show the GMean results for each method, for the runtime limits of1,000s and 10,000s, respectively. Recall that these runtime limits refer to a single run ofa meta-learning method, for each fold of the cross-validation procedure. The last rowof these tables show the average rank based on GMean over all 20 datasets. Tables 7and 8 show the accuracy results for each method, for the runtime limits of 1,000s and10,000s, respectively.In Table 5, with GMean results for the smallest runtime limit of 1,000s, the bestaverage ranks were jointly obtained by three methods, HEAD-DT, HHEA-BNC andAuto-WEKA; whilst HEAD-DT obtained a slightly better average GMean value. InTable 6, with results for the longest runtime limit of 10,000s, Auto-WEKA obtained aslightly better result (regarding both the average rank and the average GMean value)than HEAD-DT and HHEA-BNC. In both tables, GGP-RI was clearly the worst per-forming method. This result seem partly due to the fact that GGP-RI had poor resultsin many datasets with a large number of numerical attributes. Comparing the aver-age GMean values of each method across both tables, one can observe that the threeEAs have only slightly improved their GMean values from 1,000s to 10,000s – an im-provement of just 0.001 for HEAD-DT and 0.003 for the other two EAs. By contrast,Auto-WEKA obtained a somewhat greater GMean improvement of 0.008, when theruntime limit increased from 1,000s to 10,000s.Hence, Auto-WEKA has beneﬁted from the increase in runtime limit more than theEAs. This seems due to the fact that Auto-WEKA is searching in a much more di-verse space of classiﬁcation algorithms, in terms of knowledge representations. Recallthat each EA’s search space includes algorithms from a single knowledge representa-15able 5: GMean results for the four AutoML methods (time limit: 1,000s).

Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.564 0.576 0.501

DM 0.559 chowdary-2006 0.956 0.966 0.830 nutt-2003-v2 0.790 0.746 0.631 singh-2002 0.772 0.771 0.613 west-2001 dbworld-bodies-stemmed 0.815 0.770 0.652 oh0.wc 0.895 re1.wc

Table 6: GMean results for the four AutoML methods (time limit: 10,000s).

Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.581 0.578 0.502

DM 0.517 chowdary-2006 0.956 0.958 0.833 nutt-2003-v2 0.790 0.809 0.611 singh-2002 0.772 0.777 0.638 west-2001 dbworld-bodies-stemmed 0.815 0.805 0.649 oh0.wc 0.893 re1.wc

Average Rank 2.050 2.100 3.850

Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.613 0.615 0.478

DM 0.637

DNA3 0.846 0.841 0.760

DNA11 chowdary-2006 0.959 0.971 0.832 nutt-2003-v2 0.760 0.730 0.537 singh-2002 0.772 0.771 0.539 west-2001 dbworld-bodies-stemmed 0.806 0.783 0.610 oh0.wc 0.825 re1.wc

Average Rank 2.150 2.025 4.000

Figure 1 shows the critical diagrams comparing the four AutoML methods in termsof their average rank based on both GMean (in the top two diagrams) and accuracy(in the bottom two diagrams). For both measures, and for both the runtime limits of1,000s and 10,000s, we can see that there is no statistically-signiﬁcant difference among17able 8: Accuracy results for the four AutoML methods (time limit: 10,000s).

Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.623 0.614 0.482

DM 0.604

DNA3 0.847 0.838 0.758

DNA11 chowdary-2006 0.959 0.965 0.837 nutt-2003-v2 0.760 0.790 0.517 singh-2002 0.772 0.777 0.574 west-2001 dbworld-bodies-stemmed 0.806 0.814 0.605 oh0.wc 0.824 re1.wc

Average Rank 2.150 2.050 3.950 all methods, with the exception of GGP-RI, which is signiﬁcantly outperformed by theother three methods.

CD1 2 3 4

HHEA-BNCAuto-WEKA GGP-RIHEAD-DT (a) GMean: 1,000s

CD1 2 3 4

Auto-WEKAHEAD-DT GGP-RIHHEA-BNC (b) GMean: 10,000s

CD1 2 3 4

Auto-WEKAHHEA-BNC GGP-RIHEAD-DT (c) Accuracy: 1,000s

CD1 2 3 4

Auto-WEKAHHEA-BNC GGP-RIHEAD-DT (d) Accuracy: 10,000s

Figure 1: Critical diagrams showing average GMean/Accuracy ranks and Nemenyi’scritical difference (CD) for the four AutoML methods.As mentioned earlier, the analysis of the results so far focused only on the runtimelimits of 1,000s and 10,000s due to space restrictions, but we performed experimentswith 10 different limits (from 1,000s up to 10,000s). Figure 2(a) shows the evolutionof the GMean average ranks for the four meta-learning methods across the 10 runtimelimits. This ﬁgure shows that HHEA-BNC tends to achieve overall the best (lowest)average rank until the runtime limit of 7,000s, whilst for longer runtime limits Auto-WEKA and HEAD-DT tend to share the best rank, with Auto-WEKA slightly better atthe last runtime limit.Figure 2(b) shows the same evolution, but this time regarding average accuracy18anks. In this case, Auto-WEKA remains the best method across all runtime limits, andfor nearly all runtime limits, the second place is obtained by HHEA-BNC. Note thatGGP-RI remained consistently the worst method across all 10 runtime limits, for bothGMean and accuracy results. (a)

GMean (b)

Accuracy

Figure 2: Evolution of average ranks for all AutoML methods across the 10 runtimelimits.Figures 3(a) and 3(b) show the broad types of algorithms recommended by Auto-WEKA per dataset, for the runtime limits of 1,000s and 10,000s, respectively. SinceAuto-WEKA considers a large number of algorithms, instead of referring to speciﬁcalgorithms, the graphs show the frequency of recommendations for ﬁve broad types ofalgorithms, namely: the three types of algorithms that are considered by the three EAs(decision trees, if-then classiﬁcation rules, and Bayesian network classiﬁers), ensemblemethods and all the others. Note that the variability of the selected types of algorithmsis high, highlighting the difﬁculty of selecting the best algorithm for each dataset.For the runtime limit of 1,000s (Figure 3(a)), ensembles had the highest prevalenceacross the datasets; they were selected by Auto-WEKA in 33.9% of the cases, closely fol-lowed by decision-tree algorithms, selected in 31.4% of the cases. For the runtime limitof 10,000s (Figure 3(b)), these two types of classiﬁcation algorithms swapped places inthe ranking by prevalence, i.e., decision-tree algorithms were selected by Auto-WEKAin 34.7% of the cases, whilst ensembles were selected in 27.3% of the cases. Bayesianclassiﬁcation algorithms also did relatively well, partly because they had a high preva-lence among the text mining datasets. For both runtime limits, Bayesian classiﬁcationalgorithms were the third most selected type of classiﬁcation algorithm: they were se-lected in 16.7% of the cases in Figure 3(a) and in 24.2% of the cases in Figure 3(b). Forboth runtime limits, rule induction algorithms had small frequencies of selection, only7.9% in Figure 3(a) and 6.7% in Figure 3(b). This is consistent with the fact that, out ofthe 3 EAs for AutoML evaluated in this work, GGP-RI (which evolved rule inductionalgorithms) obtained clearly the worst result.19 (a) (b)

Figure 3: Number of times each type of classiﬁcation algorithm is selected by Auto-WEKA. 20 .2 More extensive experiments comparing HEAD-DT and Auto-WEKA

In this section we compare HEAD-DT and Auto-WEKA in an extended set of 40 datasets.This includes the 20 datasets used in the previous section plus 20 other datasets, as dis-cussed in Section 3.1. As mentioned earlier, the motivation for using this larger set ofdatasets only to compare the two methods in this section, rather than to compare moremethods in the previous section, is the much larger amount of time associated with theexperiments using all the 40 datasets. This section uses the same experimental method-ology used in the previous section, using 10-fold cross-validation and comparing thetwo methods with the same runtime limit, varying this limit from 1,000s to 10,000s, inincrements of 1,000s. Again, due to space restrictions, we report results only for thesmallest and longest runtime limits, namely 1,000s and 10,000s; but the results for the10 different runtime limits can be found in the Supplementary Results ﬁle.Table 9 and Table 10 show the accuracy and GMean values, respectively, obtained byHEAD-DT and Auto-WEKA with the runtime limits of 1,000s and 10,000s. In terms ofaccuracy, Auto-WEKA has somewhat outperformed HEAD-DT overall, whilst the op-posite was observed for the GMean measure. This result is consistent with the fact thatAuto-WEKA’s search tries to optimize the accuracy measure (unlike HEAD-DT), as dis-cussed earlier. However, the result of a Wilcoxon signiﬁcance test, at the conventionalsigniﬁcance level of 0.05, indicates that there is no statistically signiﬁcant difference ofpredictive performance between HEAD-DT and Auto-WEKA (for both accuracy andGMean measures), for each of the 10 runtime limits.Figure 4(a) shows the evolution of the average GMean values (across all datasets)for Auto-WEKA and HEAD-DT across the 10 runtime limits. This ﬁgure shows thatHEAD-DT obtains a better (higher) GMean value for all runtime limits. Figure 4(b)shows the same type of evolution for the accuracy measure. In this case, HEAD-DT ob-tains the best average accuracy for the smallest runtime limit, but Auto-WEKA obtainshigher accuracy for all other runtime limits. It should be noted, however, that in bothgraphs the differences of predictive performance between HEAD-DT and Auto-WEKAare small, less than 1% in general, across the different runtime limits. (a)

GMean (b)

Accuracy

Figure 4: Evolution of average predictive values for HEAD-DT and Auto-WEKA acrossthe 10 runtime limits. 21able 9: Accuracy results for HEAD-DT and Auto-WEKA (time limits: 1,000s and10,000s).

DM 0.637

MM 0.720

SC 0.826

DNA3 0.846

DNA11 chowdary-2006 0.959 nutt-2003-v2 0.760 singh-2002 0.772 west-2001 dbworld-bodies-stemmed 0.806 oh0.wc re0.wc 0.755 re1.wc convex mnist 0.886 mnistrotationbackimagenew semeion 0.763 shuttle winequalitywhite 0.622 yeast 0.584 sick pc4 0.889 magicTelescope DM MM DNA11 chowdary-2006 0.956 nutt-2003-v2 0.790 singh-2002 0.772 west-2001 dbworld-bodies-stemmed 0.815 oh0.wc re0.wc 0.831 re1.wc convex mnist 0.935 mnistrotationbackimagenew semeion 0.862 shuttle winequalitywhite 0.712 yeast 0.708 sick Conclusions

AutoML is currently a very popular issue, having attracted a great deal of attention,with the proposal of new tools, mainly based on optimization [Hutter et al., 2019, Elsken et al., 2019,Mohr et al., 2018, das Dôres et al., 2018, Li et al., 2018, Larcher and Barbosa, 2019, Koch et al., 2018,Jin et al., 2019, Fusi et al., 2018]. Based on the relevance of AutoML, this work has eval-uated four methods for recommending a classiﬁcation algorithm for a target dataset:three Evolutionary Algorithms (EAs) and Auto-WEKA [Thornton et al., 2013], in twosets of experiments. In the ﬁrst set of experiments, we have compared the four AutoMLmethods with the same runtime limit on 20 datasets. Auto-WEKA can recommendclassiﬁcation algorithms of various types (paradigms), whilst each of the three EAsis restricted to recommend a different type of classiﬁcation algorithm: decision tree,rule induction or Bayesian network classiﬁcation algorithms, in the case of HEAD-DT,GGP-RI and HHEA-BNC, respectively. In these experiments, there was no statisticallysigniﬁcant difference of predictive accuracy between the three best methods, namelytwo EAs (HEAD-DT and HHEA-BNC) and Auto-WEKA. However, these three meth-ods obtained signiﬁcantly better predictive accuracy than the other EA (GGP-RI). Theseresults were broadly consistent across the 10 different runtime limits used in the exper-iments. In the second set of experiments, where a larger set of 40 datasets was used tocompare the predictive accuracy of HEAD-DT and Auto-WEKA only, again there wasno statistically signiﬁcant difference between the predictive performance of these twomethods.However, the focus of HEAD-DT on only on decision-tree algorithms has two ad-vantages from the perspective of other algorithm-evaluation criteria. First, in applica-tions where it is important that the classiﬁcation model be interpreted by users (e.g.in medical applications), decision-tree algorithms have the advantage of generatinginterpretable classiﬁcation models. By contrast, since Auto-WEKA can select any algo-rithm out of many types of classiﬁcation algorithm, it can recommend classiﬁcation al-gorithms producing black-box (non-interpretable) models. Indeed, in our experiments,Auto-WEKA often recommended ensembles, which are not easily interpretable. Sec-ond, decision-tree algorithms also have the advantage of being in general more scalableto large datasets than several other types of classiﬁcation algorithms in Auto-WEKA’ssearch space, like neural networks, support vector machines and some ensemble meth-ods.Overall, when the runtime limit is increased from 1,000s to 10,000s, Auto-WEKAbeneﬁts more from the extra search time than HEAD-DT. This seems due to the factthat Auto-WEKA has to explore a much more diverse space of classiﬁcation algorithms,so it probably requires more time to ﬁnd the best type of classiﬁcation algorithm to berecommended for a given input dataset.In addition, we observed that Auto-WEKA exhibited meta-overﬁtting, where theGMean values on the training set were substantially lower than the GMean values onthe test set, for the best algorithm found by Auto-WEKA. As noted earlier, this meta-overﬁtting is a form of overﬁtting at the meta-learning level, due to evaluating many25ifferent (base-level) classiﬁcation algorithms during Auto-WEKA’s search for the bestalgorithm. This is in contrast to the standard overﬁtting at the base level, due to evalu-ating many different models built by the same classiﬁcation algorithm.

It would be interesting to enhance the search process of the EAs by ﬁrst performing aglobal search to optimise the candidate algorithms’ (procedural) components, followedby a second (global or local) search to optimise the continuous parameters of the best al-gorithm generated by the ﬁrst search. Another future research direction is to extend theEAs to produce an ensemble of evolved classiﬁcation algorithms in a post-processingphase, after the EAs have completed their search.Besides, since Auto-WEKA showed a clear sign of meta-overﬁtting, another re-search direction consists of developing new meta-overﬁtting-avoidance methods thatcould potentially improve the predictive performance of Auto-WEKA. Finally, it wouldbe interesting to compare the three EAs and Auto-WEKA to other AutoML methods,such as Auto-sklearn and those described in Section 2.2.4. This would give us a moredetailed assessment about which AutoML method recommends the best classiﬁcationalgorithm, taking into account different datasets.

A Supplementary Material for: An Extensive Experimental Eval-uation of Automated Machine Learning Methods for Recom-mending Classiﬁcation Algorithms data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.499 0.585 0.576 0.496 0.541DM 0.348 0.545 0.423 0.449 0.315MM 0.312 0.496 0.403 0.484 0.292SC 0.280 0.542 0.362 0.412 0.297DNA3 0.562 0.720 0.604 0.482 0.562DNA11 0.260 0.478 0.420 0.491 0.288PS 0.700 0.801 0.794 0.913 0.823chen-2002 0.857 0.798 0.944 0.826 0.756chowdary-2006 0.965 0.951 0.969 0.918 0.930nutt-2003-v2 0.933 0.933 0.850 0.733 0.600singh-2002 0.746 0.781 0.901 0.754 0.696west-2001 0.862 0.815 0.878 0.578 0.782dbworld-bodies 0.770 0.743 0.819 0.686 0.758dbworld-bodies-stemmed 0.900 0.792 0.758 0.596 0.811oh0.wc 0.857 0.972 0.979 0.925 0.962oh5.wc 0.857 0.976 0.985 0.921 0.975oh10.wc 0.863 0.961 0.967 0.921 0.962oh15.wc 0.851 0.966 0.973 0.921 0.952re0.wc 0.736 0.915 0.940 0.881 0.909re1.wc 0.829 0.970 0.979 0.941 0.970Average 0.699 0.787 0.776 0.716 0.709

Table A2: Results of predictive Sensitivity values of the baseline methods for buildingdecision tree models. data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.623 0.648 0.659 0.567 0.642DM 0.614 0.664 0.623 0.555 0.623MM 0.631 0.661 0.640 0.631 0.708SC 0.827 0.822 0.835 0.774 0.807DNA3 0.863 0.841 0.848 0.733 0.863DNA11 0.720 0.777 0.740 0.706 0.726PS 0.981 0.985 0.985 0.985 0.983chen-2002 0.821 0.810 0.949 0.827 0.787chowdary-2006 0.971 0.951 0.972 0.923 0.944nutt-2003-v2 0.867 0.867 0.850 0.717 0.450singh-2002 0.744 0.783 0.903 0.756 0.694west-2001 0.855 0.835 0.855 0.555 0.735dbworld-bodies 0.738 0.740 0.814 0.698 0.733dbworld-bodies-stemmed 0.867 0.783 0.767 0.612 0.798oh0.wc 0.351 0.820 0.874 0.463 0.779oh5.wc 0.261 0.807 0.883 0.356 0.805oh10.wc 0.262 0.727 0.754 0.368 0.742oh15.wc 0.272 0.736 0.810 0.387 0.653re0.wc 0.537 0.745 0.832 0.559 0.739re1.wc 0.396 0.798 0.859 0.472 0.786Average 0.660 0.790 0.823 0.632 0.750 data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.557 0.616 0.616 0.530 0.589DM 0.460 0.600 0.511 0.496 0.442MM 0.435 0.569 0.499 0.537 0.451SC 0.464 0.663 0.535 0.556 0.466DNA3 0.694 0.773 0.707 0.584 0.694DNA11 0.426 0.601 0.553 0.586 0.450PS 0.828 0.888 0.884 0.948 0.899chen-2002 0.838 0.804 0.947 0.826 0.771chowdary-2006 0.968 0.951 0.970 0.920 0.937nutt-2003-v2 0.898 0.898 0.846 0.721 0.507singh-2002 0.745 0.782 0.902 0.755 0.695west-2001 0.858 0.823 0.866 0.563 0.756dbworld-bodies 0.753 0.741 0.816 0.691 0.745dbworld-bodies-stemmed 0.883 0.787 0.762 0.603 0.804oh0.wc 0.548 0.892 0.925 0.654 0.865oh5.wc 0.473 0.887 0.933 0.568 0.886oh10.wc 0.475 0.835 0.854 0.581 0.844oh15.wc 0.480 0.842 0.888 0.595 0.788re0.wc 0.629 0.826 0.884 0.702 0.820re1.wc 0.573 0.880 0.917 0.666 0.873Average 0.649 0.783 0.791 0.654 0.714

Table A4: Results of predictive Speciﬁcity values of HEAD-DT for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.693 0.664 0.656 0.691 0.676DM 0.808 0.758 0.758 0.758 0.825MM 0.733 0.739 0.739 0.739 0.675SC 0.767 0.724 0.701 0.701 0.701DNA3 0.846 0.846 0.846 0.846 0.846DNA11 0.854 0.833 0.833 0.833 0.833PS 0.978 0.989 0.989 0.975 0.978chen-2002 0.964 0.937 0.937 0.937 0.937chowdary-2006 0.965 0.965 0.965 0.965 0.965nutt-2003-v2 0.983 0.983 0.983 0.983 0.983singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.973 0.973 0.973 0.973 0.973dbworld-bodies 0.844 0.844 0.844 0.844 0.844dbworld-bodies-stemmed 0.846 0.846 0.846 0.846 0.846oh0.wc 0.984 0.984 0.981 0.980 0.980oh5.wc 0.991 0.991 0.990 0.991 0.990oh10.wc 0.975 0.975 0.975 0.975 0.979oh15.wc 0.977 0.978 0.979 0.980 0.980re0.wc 0.961 0.952 0.953 0.954 0.955re1.wc 0.986 0.985 0.985 0.983 0.983Average 0.899 0.891 0.889 0.890 0.890 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.676 0.676 0.682 0.708 0.708DM 0.825 0.796 0.796 0.796 0.796MM 0.675 0.675 0.675 0.675 0.675SC 0.701 0.701 0.722 0.722 0.722DNA3 0.846 0.846 0.846 0.846 0.846DNA11 0.833 0.807 0.807 0.807 0.807PS 0.978 0.978 0.978 0.978 0.978chen-2002 0.937 0.937 0.937 0.937 0.937chowdary-2006 0.965 0.965 0.965 0.965 0.965nutt-2003-v2 0.983 0.983 0.983 0.983 0.983singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.973 0.973 0.973 0.973 0.973dbworld-bodies 0.844 0.844 0.844 0.844 0.844dbworld-bodies-stemmed 0.846 0.846 0.846 0.846 0.846oh0.wc 0.981 0.982 0.982 0.982 0.982oh5.wc 0.990 0.990 0.990 0.990 0.990oh10.wc 0.979 0.979 0.980 0.980 0.980oh15.wc 0.982 0.982 0.982 0.982 0.982re0.wc 0.954 0.957 0.957 0.957 0.957re1.wc 0.984 0.984 0.985 0.985 0.985Average 0.890 0.888 0.889 0.890 0.890 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.759 0.749 0.736 0.759 0.751DM 0.867 0.842 0.842 0.842 0.875MM 0.867 0.878 0.878 0.878 0.875SC 0.943 0.927 0.915 0.915 0.915DNA3 0.935 0.935 0.935 0.935 0.935DNA11 0.933 0.919 0.919 0.919 0.919PS 0.998 0.999 0.999 0.998 0.998chen-2002 0.966 0.950 0.950 0.950 0.950chowdary-2006 0.971 0.971 0.971 0.971 0.971nutt-2003-v2 0.967 0.967 0.967 0.967 0.967singh-2002 0.851 0.851 0.851 0.851 0.851west-2001 0.960 0.960 0.960 0.960 0.960dbworld-bodies 0.831 0.831 0.831 0.831 0.831dbworld-bodies-stemmed 0.829 0.829 0.829 0.829 0.829oh0.wc 0.896 0.901 0.887 0.884 0.884oh5.wc 0.930 0.925 0.922 0.922 0.917oh10.wc 0.834 0.839 0.840 0.837 0.855oh15.wc 0.845 0.851 0.855 0.858 0.859re0.wc 0.888 0.870 0.874 0.878 0.878re1.wc 0.893 0.893 0.887 0.874 0.879Average 0.898 0.894 0.892 0.893 0.895

Table A7: Results of predictive Sensitivity values of HEAD-DT for timeouts from 6,000sto 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.751 0.751 0.764 0.779 0.779DM 0.875 0.867 0.867 0.867 0.867MM 0.875 0.875 0.875 0.875 0.875SC 0.915 0.911 0.919 0.919 0.915DNA3 0.935 0.935 0.935 0.935 0.935DNA11 0.919 0.911 0.911 0.911 0.911PS 0.998 0.998 0.998 0.998 0.998chen-2002 0.950 0.950 0.950 0.950 0.950chowdary-2006 0.971 0.971 0.971 0.971 0.971nutt-2003-v2 0.967 0.967 0.967 0.967 0.967singh-2002 0.851 0.851 0.851 0.851 0.851west-2001 0.960 0.960 0.960 0.960 0.960dbworld-bodies 0.831 0.831 0.831 0.831 0.831dbworld-bodies-stemmed 0.829 0.829 0.829 0.829 0.829oh0.wc 0.886 0.892 0.892 0.892 0.892oh5.wc 0.912 0.912 0.912 0.912 0.916oh10.wc 0.855 0.855 0.858 0.858 0.858oh15.wc 0.865 0.865 0.874 0.872 0.872re0.wc 0.875 0.880 0.880 0.880 0.880re1.wc 0.883 0.891 0.893 0.893 0.893Average 0.895 0.895 0.897 0.897 0.897 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.725 0.705 0.694 0.724 0.712DM 0.834 0.794 0.794 0.794 0.846MM 0.789 0.797 0.797 0.797 0.756SC 0.846 0.806 0.788 0.788 0.788DNA3 0.887 0.887 0.887 0.887 0.887DNA11 0.890 0.873 0.873 0.873 0.873PS 0.988 0.994 0.994 0.986 0.988chen-2002 0.965 0.943 0.943 0.943 0.943chowdary-2006 0.968 0.968 0.968 0.968 0.968nutt-2003-v2 0.975 0.975 0.975 0.975 0.975singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.967 0.967 0.967 0.967 0.967dbworld-bodies 0.837 0.837 0.837 0.837 0.837dbworld-bodies-stemmed 0.837 0.837 0.837 0.837 0.837oh0.wc 0.939 0.941 0.933 0.931 0.931oh5.wc 0.960 0.957 0.955 0.955 0.953oh10.wc 0.901 0.904 0.904 0.902 0.915oh15.wc 0.909 0.912 0.915 0.916 0.917re0.wc 0.924 0.910 0.913 0.915 0.916re1.wc 0.938 0.938 0.934 0.927 0.929Average 0.896 0.890 0.888 0.889 0.889

Table A9: Results of predictive GMean values of HEAD-DT for timeouts from 6,000s to10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.712 0.712 0.721 0.742 0.742DM 0.846 0.827 0.827 0.827 0.827MM 0.756 0.756 0.756 0.756 0.756SC 0.788 0.786 0.803 0.803 0.801DNA3 0.887 0.887 0.887 0.887 0.887DNA11 0.873 0.855 0.855 0.855 0.855PS 0.988 0.988 0.988 0.988 0.988chen-2002 0.943 0.943 0.943 0.943 0.943chowdary-2006 0.968 0.968 0.968 0.968 0.968nutt-2003-v2 0.975 0.975 0.975 0.975 0.975singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.967 0.967 0.967 0.967 0.967dbworld-bodies 0.837 0.837 0.837 0.837 0.837dbworld-bodies-stemmed 0.837 0.837 0.837 0.837 0.837oh0.wc 0.932 0.936 0.936 0.936 0.936oh5.wc 0.950 0.950 0.950 0.950 0.952oh10.wc 0.915 0.915 0.917 0.917 0.917oh15.wc 0.921 0.921 0.926 0.925 0.925re0.wc 0.913 0.917 0.917 0.917 0.917re1.wc 0.932 0.936 0.938 0.938 0.938Average 0.890 0.888 0.890 0.891 0.891 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.539 0.547 0.549 0.552 0.552DM 0.378 0.363 0.361 0.366 0.388MM 0.399 0.418 0.417 0.418 0.418SC 0.254 0.253 0.253 0.259 0.254DNA3 0.554 0.593 0.578 0.589 0.612DNA11 0.403 0.376 0.401 0.377 0.362PS 0.699 0.702 0.711 0.721 0.722chen-2002 0.924 0.926 0.914 0.917 0.918chowdary-2006 0.993 0.994 0.994 0.994 0.993nutt-2003-v2 0.879 0.854 0.854 0.854 0.854singh-2002 0.892 0.886 0.886 0.881 0.875west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.738 0.743 0.734 0.738 0.763dbworld-bodies-stemmed 0.863 0.843 0.793 0.769 0.771oh0.wc 0.965 0.965 0.966 0.966 0.966oh5.wc 0.981 0.981 0.980 0.980 0.980oh10.wc 0.959 0.959 0.959 0.958 0.959oh15.wc 0.967 0.967 0.968 0.968 0.969re0.wc 0.921 0.919 0.920 0.920 0.920re1.wc 0.968 0.968 0.969 0.969 0.969Average 0.759 0.759 0.756 0.756 0.758

Table A11: Results of predictive Speciﬁcity values of Auto-WEKA-Trees for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.553 0.558 0.553 0.555 0.558DM 0.387 0.389 0.388 0.375 0.387MM 0.418 0.414 0.413 0.416 0.402SC 0.259 0.254 0.290 0.258 0.269DNA3 0.600 0.610 0.598 0.608 0.601DNA11 0.388 0.379 0.378 0.422 0.387PS 0.723 0.744 0.736 0.748 0.747chen-2002 0.920 0.923 0.915 0.914 0.912chowdary-2006 0.993 0.993 0.993 0.993 0.988nutt-2003-v2 0.858 0.858 0.850 0.854 0.875singh-2002 0.872 0.875 0.872 0.872 0.877west-2001 0.914 0.908 0.908 0.908 0.908dbworld-bodies 0.737 0.739 0.739 0.739 0.739dbworld-bodies-stemmed 0.769 0.769 0.772 0.772 0.772oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.981 0.980 0.980 0.980 0.980oh10.wc 0.958 0.959 0.958 0.959 0.958oh15.wc 0.969 0.969 0.969 0.969 0.969re0.wc 0.920 0.919 0.917 0.917 0.917re1.wc 0.969 0.969 0.969 0.969 0.969Average 0.758 0.759 0.758 0.760 0.759 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.635 0.643 0.644 0.645 0.645DM 0.634 0.628 0.630 0.628 0.641MM 0.683 0.692 0.686 0.692 0.692SC 0.820 0.817 0.817 0.818 0.820DNA3 0.833 0.847 0.831 0.834 0.843DNA11 0.704 0.715 0.720 0.706 0.708PS 0.976 0.976 0.976 0.977 0.977chen-2002 0.931 0.934 0.926 0.927 0.927chowdary-2006 0.991 0.993 0.993 0.993 0.991nutt-2003-v2 0.821 0.796 0.796 0.796 0.796singh-2002 0.893 0.889 0.889 0.884 0.877west-2001 0.899 0.899 0.899 0.899 0.899dbworld-bodies 0.748 0.747 0.739 0.745 0.770dbworld-bodies-stemmed 0.848 0.830 0.786 0.770 0.770oh0.wc 0.802 0.804 0.808 0.810 0.809oh5.wc 0.807 0.807 0.805 0.807 0.807oh10.wc 0.725 0.721 0.726 0.721 0.724oh15.wc 0.775 0.778 0.783 0.783 0.788re0.wc 0.745 0.768 0.770 0.769 0.768re1.wc 0.766 0.766 0.767 0.767 0.768Average 0.802 0.803 0.800 0.799 0.801

Table A13: Results of predictive Sensitivity values of Auto-WEKA-Trees for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.645 0.648 0.645 0.646 0.648DM 0.645 0.642 0.640 0.641 0.660MM 0.692 0.692 0.689 0.686 0.683SC 0.818 0.820 0.825 0.817 0.818DNA3 0.842 0.843 0.843 0.838 0.834DNA11 0.712 0.711 0.717 0.721 0.715PS 0.977 0.978 0.978 0.979 0.978chen-2002 0.930 0.931 0.926 0.926 0.926chowdary-2006 0.991 0.991 0.991 0.991 0.983nutt-2003-v2 0.804 0.804 0.788 0.796 0.813singh-2002 0.874 0.877 0.874 0.874 0.879west-2001 0.899 0.893 0.893 0.893 0.893dbworld-bodies 0.754 0.755 0.755 0.755 0.755dbworld-bodies-stemmed 0.766 0.766 0.770 0.770 0.770oh0.wc 0.810 0.810 0.810 0.805 0.806oh5.wc 0.810 0.808 0.808 0.808 0.810oh10.wc 0.721 0.723 0.727 0.728 0.722oh15.wc 0.790 0.791 0.792 0.791 0.792re0.wc 0.769 0.768 0.768 0.767 0.770re1.wc 0.768 0.768 0.768 0.767 0.769Average 0.801 0.801 0.800 0.800 0.801 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.539 0.547 0.549 0.552 0.552DM 0.378 0.363 0.361 0.366 0.388MM 0.399 0.418 0.417 0.418 0.418SC 0.254 0.253 0.253 0.259 0.254DNA3 0.554 0.593 0.578 0.589 0.612DNA11 0.403 0.376 0.401 0.377 0.362PS 0.699 0.702 0.711 0.721 0.722chen-2002 0.924 0.926 0.914 0.917 0.918chowdary-2006 0.993 0.994 0.994 0.994 0.993nutt-2003-v2 0.879 0.854 0.854 0.854 0.854singh-2002 0.892 0.886 0.886 0.881 0.875west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.738 0.743 0.734 0.738 0.763dbworld-bodies-stemmed 0.863 0.843 0.793 0.769 0.771oh0.wc 0.965 0.965 0.966 0.966 0.966oh5.wc 0.981 0.981 0.980 0.980 0.980oh10.wc 0.959 0.959 0.959 0.958 0.959oh15.wc 0.967 0.967 0.968 0.968 0.969re0.wc 0.921 0.919 0.920 0.920 0.920re1.wc 0.968 0.968 0.969 0.969 0.969Average 0.759 0.759 0.756 0.756 0.758

Table A15: Results of predictive GMean values of Auto-WEKA-Trees for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.553 0.558 0.553 0.555 0.558DM 0.387 0.389 0.388 0.375 0.387MM 0.418 0.414 0.413 0.416 0.402SC 0.259 0.254 0.290 0.258 0.269DNA3 0.600 0.610 0.598 0.608 0.601DNA11 0.388 0.379 0.378 0.422 0.387PS 0.723 0.744 0.736 0.748 0.747chen-2002 0.920 0.923 0.915 0.914 0.912chowdary-2006 0.993 0.993 0.993 0.993 0.988nutt-2003-v2 0.858 0.858 0.850 0.854 0.875singh-2002 0.872 0.875 0.872 0.872 0.877west-2001 0.914 0.908 0.908 0.908 0.908dbworld-bodies 0.737 0.739 0.739 0.739 0.739dbworld-bodies-stemmed 0.769 0.769 0.772 0.772 0.772oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.981 0.980 0.980 0.980 0.980oh10.wc 0.958 0.959 0.958 0.959 0.958oh15.wc 0.969 0.969 0.969 0.969 0.969re0.wc 0.920 0.919 0.917 0.917 0.917re1.wc 0.969 0.969 0.969 0.969 0.969Average 0.758 0.759 0.758 0.760 0.759 data set Decision Table JRip OneR PART ZeroRCE 0.537 0.565 0.571 0.550 0.398DM 0.416 0.456 0.382 0.511 0.327MM 0.480 0.430 0.494 0.534 0.292SC 0.345 0.378 0.221 0.458 0.161DNA3 0.506 0.606 0.513 0.641 0.237DNA11 0.284 0.373 0.370 0.565 0.244PS 0.761 0.852 0.700 0.884 0.060chen-2002 0.854 0.790 0.778 0.858 0.419chowdary-2006 0.935 0.954 0.965 0.936 0.404nutt-2003-v2 0.767 0.833 0.900 0.933 0.633singh-2002 0.766 0.819 0.725 0.724 0.491west-2001 0.848 0.935 0.918 0.815 0.490dbworld-bodies 0.786 0.743 0.865 0.743 0.455dbworld-bodies-stemmed 0.679 0.787 0.737 0.792 0.455oh0.wc 0.960 0.962 0.857 0.972 0.807oh5.wc 0.976 0.965 0.861 0.978 0.838oh10.wc 0.949 0.957 0.863 0.961 0.848oh15.wc 0.950 0.960 0.866 0.964 0.828re0.wc 0.852 0.923 0.712 0.922 0.596re1.wc 0.953 0.972 0.829 0.972 0.776Average 0.730 0.763 0.706 0.786 0.488

Table A17: Results of predictive Sensitivity values of the baseline methods for inducingrule-based models. data set Decision Table JRip OneR PART ZeroRCE 0.651 0.642 0.657 0.592 0.602DM 0.647 0.673 0.655 0.639 0.673MM 0.708 0.675 0.775 0.661 0.708SC 0.851 0.811 0.835 0.806 0.839DNA3 0.813 0.849 0.835 0.784 0.763DNA11 0.732 0.718 0.732 0.755 0.756PS 0.984 0.986 0.981 0.988 0.940chen-2002 0.871 0.793 0.765 0.860 0.581chowdary-2006 0.951 0.962 0.971 0.933 0.596nutt-2003-v2 0.633 0.767 0.800 0.867 0.367singh-2002 0.764 0.815 0.725 0.723 0.509west-2001 0.835 0.915 0.915 0.835 0.510dbworld-bodies 0.798 0.740 0.860 0.740 0.545dbworld-bodies-stemmed 0.688 0.771 0.721 0.783 0.545oh0.wc 0.773 0.799 0.351 0.815 0.193oh5.wc 0.788 0.771 0.260 0.810 0.162oh10.wc 0.687 0.724 0.266 0.712 0.152oh15.wc 0.700 0.762 0.315 0.748 0.172re0.wc 0.668 0.781 0.543 0.725 0.404re1.wc 0.767 0.801 0.396 0.797 0.224Average 0.765 0.788 0.668 0.779 0.512 data set Decision Table JRip OneR PART ZeroRCE 0.591 0.602 0.612 0.571 0.489DM 0.514 0.551 0.497 0.567 0.469MM 0.567 0.531 0.610 0.589 0.451SC 0.520 0.537 0.422 0.602 0.368DNA3 0.629 0.715 0.653 0.707 0.424DNA11 0.452 0.504 0.513 0.647 0.428PS 0.865 0.916 0.828 0.935 0.238chen-2002 0.862 0.791 0.771 0.859 0.493chowdary-2006 0.943 0.958 0.968 0.934 0.490nutt-2003-v2 0.692 0.794 0.845 0.898 0.477singh-2002 0.765 0.817 0.725 0.723 0.500west-2001 0.841 0.925 0.916 0.823 0.491dbworld-bodies 0.791 0.741 0.862 0.741 0.495dbworld-bodies-stemmed 0.683 0.779 0.729 0.787 0.495oh0.wc 0.861 0.876 0.548 0.890 0.395oh5.wc 0.877 0.862 0.473 0.890 0.369oh10.wc 0.807 0.832 0.479 0.827 0.359oh15.wc 0.815 0.855 0.522 0.849 0.377re0.wc 0.754 0.849 0.621 0.817 0.491re1.wc 0.855 0.882 0.573 0.880 0.417average 0.734 0.766 0.658 0.777 0.436

Table A19: Results of predictive Speciﬁcity values of GGP-RI for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.413 0.415 0.415 0.415 0.415DM 0.405 0.407 0.423 0.431 0.431MM 0.399 0.411 0.423 0.423 0.426SC 0.194 0.190 0.190 0.190 0.190DNA3 0.429 0.428 0.427 0.427 0.427DNA11 0.352 0.343 0.353 0.353 0.353PS 0.280 0.297 0.284 0.268 0.281chen-2002 0.632 0.628 0.617 0.624 0.624chowdary-2006 0.809 0.814 0.814 0.811 0.808nutt-2003-v2 0.670 0.670 0.660 0.660 0.660singh-2002 0.618 0.619 0.621 0.626 0.634west-2001 0.649 0.635 0.638 0.634 0.645dbworld-bodies 0.555 0.555 0.555 0.555 0.551dbworld-bodies-stemmed 0.640 0.640 0.644 0.644 0.644oh0.wc 0.809 0.809 0.809 0.809 0.809oh5.wc 0.846 0.845 0.845 0.845 0.845oh10.wc 0.843 0.847 0.847 0.847 0.847oh15.wc 0.833 0.833 0.833 0.833 0.833re0.wc 0.614 0.614 0.614 0.614 0.614re1.wc 0.786 0.786 0.786 0.787 0.787Average 0.589 0.589 0.590 0.590 0.591 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.415 0.415 0.415 0.416 0.415DM 0.431 0.430 0.430 0.429 0.429MM 0.426 0.426 0.426 0.434 0.436SC 0.187 0.187 0.187 0.187 0.190DNA3 0.427 0.427 0.432 0.432 0.432DNA11 0.353 0.353 0.353 0.363 0.363PS 0.281 0.281 0.281 0.281 0.281chen-2002 0.626 0.629 0.630 0.636 0.633chowdary-2006 0.809 0.813 0.813 0.813 0.813nutt-2003-v2 0.647 0.647 0.643 0.650 0.650singh-2002 0.642 0.642 0.642 0.642 0.642west-2001 0.649 0.649 0.649 0.649 0.647dbworld-bodies 0.551 0.551 0.551 0.551 0.558dbworld-bodies-stemmed 0.644 0.637 0.637 0.637 0.637oh0.wc 0.809 0.809 0.809 0.809 0.809oh5.wc 0.845 0.845 0.845 0.845 0.845oh10.wc 0.848 0.848 0.848 0.848 0.848oh15.wc 0.833 0.833 0.833 0.833 0.833re0.wc 0.614 0.613 0.613 0.614 0.614re1.wc 0.787 0.787 0.787 0.787 0.787Average 0.591 0.591 0.591 0.593 0.593

Table A21: Results of predictive Sensitivity values of GGP-RI for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.607 0.608 0.608 0.609 0.608DM 0.684 0.684 0.693 0.698 0.698MM 0.717 0.722 0.726 0.726 0.726SC 0.820 0.818 0.818 0.821 0.821DNA3 0.813 0.812 0.807 0.807 0.807DNA11 0.739 0.738 0.741 0.741 0.741PS 0.952 0.953 0.952 0.951 0.952chen-2002 0.686 0.684 0.676 0.683 0.683chowdary-2006 0.854 0.858 0.858 0.855 0.854nutt-2003-v2 0.620 0.620 0.600 0.600 0.600singh-2002 0.609 0.613 0.615 0.619 0.626west-2001 0.594 0.585 0.589 0.593 0.602dbworld-bodies 0.615 0.615 0.615 0.615 0.612dbworld-bodies-stemmed 0.669 0.669 0.671 0.671 0.671oh0.wc 0.196 0.196 0.196 0.196 0.196oh5.wc 0.156 0.158 0.158 0.158 0.158oh10.wc 0.162 0.160 0.160 0.160 0.160oh15.wc 0.175 0.175 0.175 0.175 0.174re0.wc 0.399 0.399 0.399 0.399 0.399re1.wc 0.217 0.218 0.218 0.218 0.217Average 0.564 0.564 0.564 0.565 0.565 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.608 0.608 0.608 0.608 0.608DM 0.698 0.702 0.702 0.700 0.700MM 0.726 0.726 0.726 0.726 0.724SC 0.823 0.823 0.823 0.823 0.821DNA3 0.807 0.807 0.809 0.809 0.809DNA11 0.741 0.741 0.741 0.741 0.741PS 0.952 0.952 0.952 0.952 0.952chen-2002 0.684 0.685 0.686 0.692 0.689chowdary-2006 0.855 0.857 0.857 0.857 0.857nutt-2003-v2 0.593 0.593 0.587 0.600 0.600singh-2002 0.634 0.634 0.634 0.634 0.634west-2001 0.614 0.614 0.614 0.614 0.610dbworld-bodies 0.612 0.612 0.612 0.612 0.619dbworld-bodies-stemmed 0.671 0.665 0.665 0.665 0.665oh0.wc 0.196 0.196 0.196 0.196 0.196oh5.wc 0.158 0.158 0.158 0.158 0.158oh10.wc 0.162 0.162 0.162 0.162 0.162oh15.wc 0.175 0.175 0.175 0.175 0.175re0.wc 0.399 0.398 0.398 0.399 0.399re1.wc 0.217 0.217 0.217 0.217 0.217Average 0.566 0.566 0.566 0.567 0.567

Table A23: Results of predictive Gmean values of GGP-RI for timeouts from 1,000s to5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.501 0.502 0.502 0.503 0.503DM 0.523 0.525 0.538 0.545 0.545MM 0.524 0.532 0.542 0.542 0.544SC 0.392 0.388 0.388 0.389 0.389DNA3 0.582 0.581 0.578 0.578 0.578DNA11 0.498 0.491 0.499 0.499 0.499PS 0.445 0.463 0.451 0.436 0.447chen-2002 0.658 0.654 0.645 0.652 0.652chowdary-2006 0.83 0.834 0.834 0.832 0.829nutt-2003-v2 0.631 0.631 0.616 0.616 0.616singh-2002 0.613 0.616 0.618 0.623 0.63west-2001 0.617 0.605 0.609 0.609 0.619dbworld-bodies 0.582 0.582 0.582 0.582 0.578dbworld-bodies-stemmed 0.652 0.652 0.655 0.655 0.655oh0.wc 0.398 0.398 0.398 0.398 0.398oh5.wc 0.361 0.364 0.364 0.364 0.364oh10.wc 0.37 0.367 0.367 0.367 0.367oh15.wc 0.382 0.381 0.381 0.381 0.381re0.wc 0.489 0.489 0.489 0.489 0.489re1.wc 0.407 0.408 0.408 0.408 0.407Average 0.523 0.523 0.523 0.523 0.525 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.503 0.503 0.503 0.503 0.502DM 0.545 0.546 0.546 0.544 0.544MM 0.544 0.544 0.544 0.549 0.55SC 0.386 0.386 0.386 0.386 0.389DNA3 0.578 0.578 0.583 0.583 0.583DNA11 0.499 0.499 0.499 0.506 0.506PS 0.448 0.448 0.448 0.448 0.448chen-2002 0.653 0.655 0.657 0.663 0.659chowdary-2006 0.831 0.833 0.833 0.833 0.833nutt-2003-v2 0.606 0.606 0.6 0.611 0.611singh-2002 0.638 0.638 0.638 0.638 0.638west-2001 0.628 0.628 0.628 0.628 0.624dbworld-bodies 0.578 0.578 0.578 0.578 0.585dbworld-bodies-stemmed 0.655 0.649 0.649 0.649 0.649oh0.wc 0.398 0.398 0.398 0.398 0.398oh5.wc 0.364 0.364 0.364 0.364 0.364oh10.wc 0.369 0.369 0.369 0.369 0.369oh15.wc 0.381 0.381 0.381 0.381 0.381re0.wc 0.489 0.489 0.489 0.489 0.489re1.wc 0.407 0.407 0.407 0.407 0.407Average 0.525 0.525 0.525 0.526 0.526

Table A25: Results of predictive Speciﬁcity values of Auto-WEKA-Rules for timeoutsfrom 1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.532 0.529 0.531 0.532 0.532DM 0.360 0.361 0.368 0.358 0.362MM 0.319 0.317 0.316 0.316 0.316SC 0.223 0.233 0.233 0.228 0.233DNA3 0.532 0.555 0.562 0.588 0.571DNA11 0.382 0.384 0.400 0.435 0.399PS 0.676 0.697 0.697 0.709 0.706chen-2002 0.850 0.867 0.850 0.853 0.850chowdary-2006 0.938 0.930 0.928 0.928 0.928nutt-2003-v2 0.804 0.808 0.792 0.792 0.792singh-2002 0.791 0.792 0.798 0.791 0.808west-2001 0.927 0.927 0.927 0.927 0.927dbworld-bodies 0.783 0.763 0.777 0.793 0.783dbworld-bodies-stemmed 0.728 0.759 0.771 0.764 0.781oh0.wc 0.949 0.952 0.955 0.956 0.956oh5.wc 0.967 0.971 0.973 0.975 0.975oh10.wc 0.952 0.953 0.952 0.952 0.951oh15.wc 0.955 0.957 0.957 0.957 0.958re0.wc 0.851 0.911 0.906 0.906 0.907re1.wc 0.912 0.945 0.951 0.950 0.950Average 0.722 0.731 0.732 0.735 0.734 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.532 0.537 0.534 0.534 0.535DM 0.365 0.379 0.371 0.376 0.363MM 0.316 0.316 0.316 0.316 0.316SC 0.238 0.233 0.228 0.253 0.212DNA3 0.571 0.554 0.564 0.568 0.549DNA11 0.439 0.452 0.427 0.421 0.448PS 0.712 0.712 0.720 0.730 0.729chen-2002 0.866 0.859 0.853 0.864 0.857chowdary-2006 0.928 0.928 0.928 0.928 0.928nutt-2003-v2 0.808 0.792 0.792 0.804 0.808singh-2002 0.798 0.805 0.808 0.796 0.800west-2001 0.919 0.919 0.919 0.919 0.919dbworld-bodies 0.776 0.782 0.780 0.782 0.782dbworld-bodies-stemmed 0.797 0.770 0.774 0.788 0.794oh0.wc 0.957 0.956 0.955 0.956 0.956oh5.wc 0.977 0.975 0.976 0.977 0.977oh10.wc 0.952 0.953 0.953 0.953 0.955oh15.wc 0.957 0.957 0.957 0.957 0.957re0.wc 0.909 0.908 0.908 0.906 0.908re1.wc 0.950 0.951 0.950 0.951 0.951Average 0.738 0.737 0.736 0.739 0.737

Table A27: Results of predictive Sensitivity values of Auto-WEKA-Rules for timeoutsfrom 1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.628 0.626 0.629 0.629 0.631DM 0.649 0.651 0.653 0.651 0.653MM 0.694 0.692 0.689 0.689 0.689SC 0.820 0.818 0.816 0.815 0.816DNA3 0.835 0.840 0.840 0.844 0.840DNA11 0.719 0.730 0.725 0.741 0.734PS 0.977 0.977 0.977 0.977 0.978chen-2002 0.871 0.889 0.875 0.875 0.871chowdary-2006 0.945 0.941 0.938 0.938 0.938nutt-2003-v2 0.783 0.792 0.783 0.783 0.783singh-2002 0.793 0.793 0.798 0.791 0.808west-2001 0.915 0.915 0.915 0.915 0.915dbworld-bodies 0.792 0.772 0.783 0.796 0.788dbworld-bodies-stemmed 0.729 0.758 0.758 0.759 0.767oh0.wc 0.750 0.762 0.768 0.772 0.771oh5.wc 0.751 0.771 0.774 0.780 0.781oh10.wc 0.702 0.704 0.705 0.702 0.701oh15.wc 0.751 0.753 0.754 0.755 0.756re0.wc 0.671 0.744 0.740 0.740 0.740re1.wc 0.610 0.720 0.755 0.754 0.754Average 0.769 0.782 0.784 0.785 0.786 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.630 0.634 0.633 0.633 0.633DM 0.660 0.662 0.657 0.649 0.649MM 0.689 0.689 0.689 0.689 0.689SC 0.819 0.815 0.815 0.815 0.815DNA3 0.840 0.835 0.831 0.835 0.831DNA11 0.751 0.747 0.736 0.746 0.744PS 0.978 0.978 0.978 0.979 0.979chen-2002 0.883 0.879 0.872 0.881 0.878chowdary-2006 0.938 0.938 0.938 0.938 0.938nutt-2003-v2 0.792 0.783 0.783 0.783 0.792singh-2002 0.798 0.805 0.808 0.795 0.801west-2001 0.910 0.910 0.910 0.910 0.910dbworld-bodies 0.784 0.789 0.789 0.789 0.789dbworld-bodies-stemmed 0.784 0.759 0.763 0.775 0.783oh0.wc 0.774 0.772 0.771 0.775 0.773oh5.wc 0.786 0.780 0.782 0.786 0.786oh10.wc 0.704 0.706 0.707 0.709 0.712oh15.wc 0.757 0.757 0.755 0.756 0.757re0.wc 0.744 0.743 0.745 0.740 0.743re1.wc 0.754 0.755 0.754 0.755 0.755Average 0.789 0.787 0.786 0.787 0.788

Table A29: Results of predictive GMean values of Auto-WEKA-Rules for timeouts from1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.578 0.575 0.578 0.578 0.579DM 0.480 0.482 0.487 0.480 0.483MM 0.465 0.463 0.462 0.462 0.462SC 0.419 0.427 0.426 0.421 0.426DNA3 0.658 0.674 0.678 0.695 0.683DNA11 0.517 0.521 0.528 0.558 0.533PS 0.793 0.805 0.805 0.811 0.810chen-2002 0.860 0.878 0.862 0.864 0.860chowdary-2006 0.941 0.935 0.933 0.933 0.933nutt-2003-v2 0.789 0.796 0.783 0.783 0.783singh-2002 0.792 0.793 0.798 0.791 0.808west-2001 0.920 0.920 0.920 0.920 0.920dbworld-bodies 0.787 0.767 0.780 0.795 0.785dbworld-bodies-stemmed 0.728 0.758 0.764 0.761 0.774oh0.wc 0.843 0.852 0.856 0.858 0.858oh5.wc 0.852 0.865 0.867 0.872 0.872oh10.wc 0.817 0.819 0.819 0.817 0.817oh15.wc 0.847 0.849 0.849 0.850 0.850re0.wc 0.756 0.823 0.819 0.819 0.819re1.wc 0.742 0.823 0.847 0.846 0.846Average 0.729 0.741 0.743 0.746 0.745 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.579 0.584 0.581 0.582 0.582DM 0.487 0.497 0.490 0.489 0.482MM 0.462 0.462 0.462 0.462 0.462SC 0.432 0.426 0.421 0.443 0.406DNA3 0.683 0.670 0.675 0.680 0.668DNA11 0.567 0.572 0.550 0.551 0.567PS 0.813 0.813 0.817 0.823 0.823chen-2002 0.874 0.869 0.862 0.872 0.867chowdary-2006 0.933 0.933 0.933 0.933 0.933nutt-2003-v2 0.796 0.783 0.783 0.789 0.796singh-2002 0.798 0.805 0.808 0.795 0.800west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.780 0.785 0.784 0.785 0.785dbworld-bodies-stemmed 0.790 0.764 0.768 0.781 0.788oh0.wc 0.860 0.859 0.858 0.861 0.859oh5.wc 0.876 0.872 0.874 0.876 0.876oh10.wc 0.819 0.820 0.821 0.822 0.824oh15.wc 0.851 0.851 0.850 0.850 0.851re0.wc 0.822 0.822 0.822 0.819 0.821re1.wc 0.846 0.847 0.846 0.847 0.847Average 0.749 0.747 0.746 0.749 0.748

Table A31: Results of predictive Speciﬁcity values of the baseline methods for buildingBayesian network classiﬁers. data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.550 0.554 0.556DM 0.545 0.566 0.620MM 0.498 0.507 0.574SC 0.525 0.468 0.609DNA3 0.736 0.716 0.742DNA11 0.672 0.672 0.509PS 0.800 0.779 0.772chen-2002 0.884 0.924 0.864chowdary-2006 0.954 0.980 0.980nutt-2003-v2 0.800 0.750 0.367singh-2002 0.833 0.785 0.725west-2001 0.892 0.875 0.835dbworld-bodies 0.811 0.700 0.905dbworld-bodies-stemmed 0.733 0.733 0.871oh0.wc 0.987 0.977 0.988oh5.wc 0.983 0.973 0.984oh10.wc 0.976 0.964 0.974oh15.wc 0.977 0.965 0.978re0.wc 0.930 0.938 0.949re1.wc 0.976 0.974 0.975Average 0.803 0.790 0.789 data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.586 0.598 0.584DM 0.647 0.664 0.647MM 0.650 0.683 0.707SC 0.738 0.754 0.750DNA3 0.805 0.820 0.829DNA11 0.734 0.734 0.689PS 0.974 0.981 0.979chen-2002 0.905 0.933 0.899chowdary-2006 0.962 0.981 0.981nutt-2003-v2 0.700 0.700 0.333singh-2002 0.834 0.785 0.725west-2001 0.875 0.875 0.815dbworld-bodies 0.814 0.750 0.895dbworld-bodies-stemmed 0.767 0.767 0.862oh0.wc 0.899 0.796 0.895oh5.wc 0.855 0.787 0.869oh10.wc 0.804 0.721 0.801oh15.wc 0.829 0.725 0.834re0.wc 0.759 0.566 0.798re1.wc 0.814 0.664 0.837Average 0.798 0.764 0.786

Table A33: Results of predictive GMean values of the baseline methods for buildingBayesian network classiﬁers. data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.567 0.576 0.569DM 0.589 0.608 0.632MM 0.560 0.578 0.627SC 0.616 0.587 0.672DNA3 0.766 0.762 0.778DNA11 0.699 0.699 0.581PS 0.883 0.874 0.869chen-2002 0.894 0.928 0.881chowdary-2006 0.958 0.980 0.980nutt-2003-v2 0.745 0.716 0.340singh-2002 0.833 0.785 0.725west-2001 0.883 0.874 0.824dbworld-bodies 0.812 0.722 0.900dbworld-bodies-stemmed 0.749 0.749 0.866oh0.wc 0.942 0.882 0.940oh5.wc 0.917 0.875 0.925oh10.wc 0.886 0.833 0.883oh15.wc 0.900 0.836 0.903re0.wc 0.840 0.728 0.870re1.wc 0.891 0.804 0.903Average 0.797 0.770 0.784 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.540 0.541 0.539 0.533 0.534DM 0.508 0.553 0.571 0.552 0.544MM 0.562 0.560 0.572 0.565 0.549SC 0.332 0.328 0.343 0.348 0.356DNA3 0.670 0.672 0.674 0.673 0.666DNA11 0.423 0.389 0.384 0.380 0.369PS 0.724 0.737 0.751 0.745 0.745chen-2002 0.841 0.870 0.865 0.867 0.864chowdary-2006 0.962 0.962 0.959 0.962 0.959nutt-2003-v2 0.780 0.820 0.810 0.813 0.840singh-2002 0.771 0.778 0.780 0.793 0.791west-2001 0.885 0.884 0.878 0.878 0.894dbworld-bodies 0.742 0.742 0.759 0.780 0.775dbworld-bodies-stemmed 0.758 0.797 0.809 0.778 0.812oh0.wc 0.986 0.986 0.986 0.986 0.986oh5.wc 0.982 0.979 0.972 0.975 0.975oh10.wc 0.972 0.974 0.971 0.972 0.969oh15.wc 0.980 0.977 0.977 0.974 0.980re0.wc 0.931 0.931 0.930 0.930 0.931re1.wc 0.938 0.968 0.970 0.971 0.971Average 0.764 0.772 0.775 0.774 0.775

Table A35: Results of predictive Speciﬁcity values of HHEA-BNC for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.541 0.538 0.539 0.544 0.544DM 0.555 0.550 0.579 0.567 0.563MM 0.525 0.520 0.512 0.505 0.510SC 0.360 0.343 0.364 0.381 0.365DNA3 0.670 0.667 0.650 0.650 0.649DNA11 0.357 0.361 0.354 0.357 0.356PS 0.747 0.739 0.740 0.748 0.729chen-2002 0.863 0.865 0.862 0.862 0.858chowdary-2006 0.962 0.965 0.965 0.962 0.951nutt-2003-v2 0.827 0.843 0.847 0.837 0.840singh-2002 0.789 0.785 0.783 0.786 0.776west-2001 0.884 0.886 0.882 0.882 0.877dbworld-bodies 0.753 0.771 0.758 0.787 0.776dbworld-bodies-stemmed 0.803 0.812 0.791 0.801 0.798oh0.wc 0.986 0.983 0.979 0.979 0.979oh5.wc 0.977 0.972 0.972 0.972 0.978oh10.wc 0.966 0.966 0.963 0.966 0.966oh15.wc 0.980 0.980 0.980 0.980 0.977re0.wc 0.931 0.931 0.918 0.917 0.917re1.wc 0.971 0.971 0.972 0.972 0.972Average 0.772 0.772 0.770 0.773 0.769 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.615 0.614 0.608 0.607 0.609DM 0.708 0.717 0.724 0.708 0.701MM 0.748 0.746 0.748 0.744 0.744SC 0.802 0.802 0.795 0.801 0.801DNA3 0.841 0.837 0.832 0.829 0.837DNA11 0.743 0.741 0.742 0.742 0.739PS 0.978 0.979 0.980 0.978 0.979chen-2002 0.867 0.880 0.875 0.878 0.872chowdary-2006 0.971 0.971 0.969 0.971 0.969nutt-2003-v2 0.730 0.780 0.770 0.767 0.790singh-2002 0.771 0.777 0.779 0.793 0.791west-2001 0.888 0.883 0.879 0.879 0.896dbworld-bodies 0.764 0.760 0.776 0.800 0.790dbworld-bodies-stemmed 0.783 0.811 0.823 0.787 0.823oh0.wc 0.896 0.896 0.895 0.896 0.894oh5.wc 0.848 0.835 0.796 0.815 0.814oh10.wc 0.790 0.793 0.779 0.779 0.767oh15.wc 0.844 0.830 0.832 0.820 0.845re0.wc 0.760 0.760 0.760 0.760 0.760re1.wc 0.742 0.789 0.796 0.798 0.798Average 0.805 0.810 0.808 0.808 0.811

Table A37: Results of predictive Sensitivity values of HHEA-BNC for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.615 0.608 0.608 0.612 0.614DM 0.710 0.712 0.722 0.726 0.713MM 0.735 0.730 0.726 0.728 0.730SC 0.801 0.798 0.800 0.804 0.806DNA3 0.839 0.837 0.835 0.834 0.838DNA11 0.735 0.739 0.730 0.726 0.730PS 0.979 0.979 0.979 0.979 0.977chen-2002 0.872 0.873 0.872 0.871 0.868chowdary-2006 0.971 0.973 0.973 0.971 0.965nutt-2003-v2 0.783 0.787 0.793 0.783 0.790singh-2002 0.789 0.785 0.783 0.787 0.777west-2001 0.886 0.891 0.888 0.888 0.883dbworld-bodies 0.769 0.786 0.773 0.801 0.792dbworld-bodies-stemmed 0.820 0.831 0.801 0.818 0.814oh0.wc 0.894 0.882 0.867 0.868 0.868oh5.wc 0.827 0.802 0.800 0.804 0.829oh10.wc 0.755 0.754 0.739 0.755 0.754oh15.wc 0.844 0.845 0.845 0.845 0.832re0.wc 0.760 0.760 0.747 0.746 0.746re1.wc 0.799 0.800 0.802 0.801 0.802Average 0.809 0.809 0.804 0.807 0.806 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.576 0.576 0.572 0.569 0.570DM 0.596 0.626 0.640 0.622 0.613MM 0.637 0.634 0.642 0.635 0.626SC 0.497 0.496 0.508 0.514 0.518DNA3 0.741 0.741 0.741 0.739 0.739DNA11 0.544 0.521 0.516 0.514 0.507PS 0.827 0.835 0.847 0.837 0.838chen-2002 0.852 0.874 0.870 0.872 0.867chowdary-2006 0.966 0.966 0.964 0.966 0.964nutt-2003-v2 0.746 0.793 0.784 0.783 0.809singh-2002 0.771 0.777 0.779 0.793 0.791west-2001 0.886 0.883 0.878 0.878 0.895dbworld-bodies 0.753 0.750 0.767 0.789 0.782dbworld-bodies-stemmed 0.770 0.804 0.816 0.782 0.817oh0.wc 0.940 0.940 0.939 0.940 0.939oh5.wc 0.913 0.902 0.871 0.885 0.884oh10.wc 0.876 0.878 0.868 0.868 0.858oh15.wc 0.909 0.898 0.899 0.889 0.910re0.wc 0.841 0.841 0.841 0.841 0.841re1.wc 0.832 0.874 0.878 0.880 0.880Average 0.774 0.781 0.781 0.780 0.782

Table A39: Results of predictive GMean values of HHEA-BNC for timeouts from 6,000sto 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.577 0.572 0.572 0.577 0.578DM 0.624 0.623 0.643 0.638 0.629MM 0.608 0.604 0.596 0.594 0.598SC 0.522 0.509 0.525 0.539 0.528DNA3 0.744 0.739 0.729 0.727 0.730DNA11 0.498 0.503 0.495 0.496 0.497PS 0.842 0.836 0.838 0.845 0.824chen-2002 0.867 0.869 0.867 0.866 0.862chowdary-2006 0.966 0.969 0.969 0.966 0.958nutt-2003-v2 0.799 0.808 0.814 0.804 0.809singh-2002 0.789 0.785 0.783 0.787 0.777west-2001 0.885 0.888 0.884 0.884 0.879dbworld-bodies 0.760 0.778 0.765 0.794 0.784dbworld-bodies-stemmed 0.811 0.821 0.795 0.809 0.805oh0.wc 0.939 0.929 0.918 0.918 0.918oh5.wc 0.895 0.874 0.873 0.876 0.896oh10.wc 0.848 0.848 0.836 0.848 0.847oh15.wc 0.910 0.910 0.910 0.910 0.900re0.wc 0.841 0.841 0.828 0.827 0.827re1.wc 0.881 0.881 0.882 0.882 0.883Average 0.780 0.779 0.776 0.779 0.777 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.566 0.566 0.564 0.567 0.567DM 0.546 0.552 0.554 0.550 0.567MM 0.490 0.493 0.493 0.493 0.492SC 0.396 0.401 0.407 0.417 0.417DNA3 0.570 0.552 0.570 0.565 0.560DNA11 0.462 0.452 0.443 0.439 0.446PS 0.688 0.685 0.676 0.681 0.682chen-2002 0.891 0.885 0.891 0.890 0.886chowdary-2006 0.973 0.973 0.973 0.973 0.973nutt-2003-v2 0.850 0.846 0.842 0.838 0.838singh-2002 0.872 0.874 0.874 0.872 0.869west-2001 0.908 0.898 0.908 0.898 0.912dbworld-bodies 0.885 0.885 0.863 0.861 0.858dbworld-bodies-stemmed 0.937 0.915 0.918 0.939 0.939oh0.wc 0.966 0.967 0.966 0.966 0.965oh5.wc 0.964 0.964 0.964 0.964 0.967oh10.wc 0.960 0.960 0.960 0.960 0.961oh15.wc 0.964 0.964 0.964 0.964 0.964re0.wc 0.941 0.940 0.941 0.941 0.938re1.wc 0.961 0.962 0.961 0.961 0.961Average 0.789 0.787 0.787 0.787 0.788

Table A41: Results of predictive Speciﬁcity values of Auto-WEKA-Bayes for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.567 0.572 0.570 0.567 0.571DM 0.555 0.557 0.561 0.559 0.560MM 0.488 0.492 0.498 0.494 0.494SC 0.406 0.401 0.423 0.387 0.407DNA3 0.566 0.567 0.576 0.571 0.560DNA11 0.462 0.460 0.456 0.466 0.462PS 0.688 0.686 0.694 0.695 0.693chen-2002 0.888 0.885 0.887 0.884 0.886chowdary-2006 0.973 0.973 0.973 0.973 0.973nutt-2003-v2 0.842 0.842 0.842 0.842 0.842singh-2002 0.872 0.864 0.862 0.867 0.859west-2001 0.908 0.912 0.912 0.912 0.908dbworld-bodies 0.871 0.868 0.857 0.857 0.844dbworld-bodies-stemmed 0.932 0.928 0.944 0.941 0.932oh0.wc 0.966 0.966 0.966 0.965 0.965oh5.wc 0.967 0.966 0.966 0.966 0.966oh10.wc 0.960 0.960 0.960 0.960 0.960oh15.wc 0.965 0.965 0.965 0.965 0.964re0.wc 0.941 0.941 0.941 0.941 0.941re1.wc 0.962 0.961 0.962 0.962 0.961Average 0.789 0.788 0.791 0.789 0.788 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.642 0.644 0.644 0.646 0.646DM 0.693 0.699 0.697 0.697 0.702MM 0.722 0.728 0.728 0.728 0.728SC 0.825 0.827 0.829 0.831 0.833DNA3 0.846 0.837 0.844 0.841 0.839DNA11 0.711 0.711 0.700 0.713 0.717PS 0.979 0.979 0.978 0.978 0.978chen-2002 0.900 0.895 0.900 0.899 0.897chowdary-2006 0.983 0.983 0.983 0.983 0.983nutt-2003-v2 0.800 0.792 0.783 0.775 0.775singh-2002 0.872 0.874 0.874 0.871 0.869west-2001 0.900 0.890 0.900 0.890 0.905dbworld-bodies 0.874 0.874 0.854 0.854 0.850dbworld-bodies-stemmed 0.938 0.918 0.918 0.938 0.938oh0.wc 0.803 0.810 0.806 0.804 0.799oh5.wc 0.766 0.766 0.766 0.766 0.773oh10.wc 0.723 0.722 0.721 0.722 0.721oh15.wc 0.770 0.771 0.771 0.771 0.771re0.wc 0.755 0.755 0.755 0.755 0.751re1.wc 0.779 0.779 0.779 0.779 0.779Average 0.814 0.813 0.811 0.812 0.813

Table A43: Results of predictive Sensitivity values of Auto-WEKA-Bayes for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.646 0.648 0.646 0.644 0.646DM 0.695 0.693 0.695 0.691 0.693MM 0.728 0.725 0.731 0.731 0.731SC 0.828 0.827 0.833 0.830 0.832DNA3 0.842 0.846 0.839 0.846 0.842DNA11 0.726 0.717 0.722 0.726 0.719PS 0.978 0.979 0.979 0.979 0.979chen-2002 0.899 0.897 0.899 0.896 0.899chowdary-2006 0.983 0.983 0.983 0.983 0.983nutt-2003-v2 0.783 0.783 0.783 0.783 0.783singh-2002 0.871 0.864 0.862 0.867 0.859west-2001 0.900 0.905 0.905 0.905 0.900dbworld-bodies 0.863 0.859 0.851 0.847 0.835dbworld-bodies-stemmed 0.931 0.926 0.941 0.938 0.930oh0.wc 0.804 0.804 0.804 0.803 0.801oh5.wc 0.774 0.771 0.773 0.770 0.773oh10.wc 0.724 0.723 0.723 0.725 0.723oh15.wc 0.773 0.774 0.771 0.772 0.771re0.wc 0.754 0.754 0.754 0.754 0.755re1.wc 0.776 0.776 0.778 0.775 0.777Average 0.814 0.813 0.814 0.813 0.812 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.603 0.604 0.602 0.605 0.605DM 0.612 0.618 0.619 0.616 0.628MM 0.586 0.590 0.590 0.590 0.589SC 0.555 0.560 0.566 0.572 0.571DNA3 0.684 0.670 0.683 0.679 0.675DNA11 0.568 0.562 0.551 0.555 0.560PS 0.800 0.798 0.793 0.796 0.797chen-2002 0.895 0.890 0.896 0.894 0.892chowdary-2006 0.978 0.978 0.978 0.978 0.978nutt-2003-v2 0.818 0.812 0.806 0.799 0.799singh-2002 0.872 0.874 0.874 0.872 0.869west-2001 0.904 0.893 0.904 0.893 0.908dbworld-bodies 0.879 0.879 0.858 0.857 0.854dbworld-bodies-stemmed 0.937 0.917 0.918 0.938 0.938oh0.wc 0.881 0.885 0.882 0.881 0.878oh5.wc 0.859 0.859 0.859 0.859 0.864oh10.wc 0.833 0.833 0.832 0.833 0.832oh15.wc 0.862 0.862 0.862 0.862 0.862re0.wc 0.843 0.842 0.843 0.843 0.840re1.wc 0.865 0.865 0.865 0.865 0.865Average 0.792 0.790 0.789 0.789 0.790

Table A45: Results of predictive GMean values of Auto-WEKA-Bayes for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.605 0.609 0.607 0.604 0.607DM 0.618 0.619 0.622 0.619 0.621MM 0.587 0.588 0.593 0.591 0.591SC 0.562 0.559 0.575 0.552 0.565DNA3 0.680 0.682 0.686 0.685 0.677DNA11 0.576 0.570 0.570 0.578 0.571PS 0.800 0.799 0.803 0.804 0.803chen-2002 0.893 0.891 0.893 0.890 0.892chowdary-2006 0.978 0.978 0.978 0.978 0.978nutt-2003-v2 0.806 0.806 0.806 0.806 0.806singh-2002 0.872 0.864 0.862 0.867 0.859west-2001 0.904 0.908 0.908 0.908 0.904dbworld-bodies 0.866 0.863 0.854 0.852 0.839dbworld-bodies-stemmed 0.931 0.927 0.943 0.940 0.931oh0.wc 0.881 0.881 0.881 0.880 0.879oh5.wc 0.865 0.863 0.864 0.862 0.864oh10.wc 0.833 0.833 0.833 0.834 0.833oh15.wc 0.863 0.864 0.862 0.863 0.862re0.wc 0.842 0.842 0.842 0.842 0.843re1.wc 0.864 0.864 0.865 0.863 0.864Average 0.791 0.790 0.792 0.791 0.789 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.564 0.561 0.557 0.566 0.558DM 0.460 0.469 0.474 0.437 0.437MM 0.471 0.448 0.457 0.462 0.465SC 0.290 0.280 0.299 0.279 0.281DNA3 0.584 0.586 0.564 0.571 0.577DNA11 0.380 0.396 0.376 0.401 0.418PS 0.737 0.740 0.739 0.747 0.750chen-2002 0.918 0.923 0.927 0.917 0.928chowdary-2006 0.982 0.982 0.986 0.981 0.984nutt-2003-v2 0.896 0.917 0.892 0.900 0.925singh-2002 0.865 0.876 0.871 0.878 0.881west-2001 0.891 0.882 0.882 0.882 0.885dbworld-bodies 0.759 0.776 0.778 0.802 0.821dbworld-bodies-stemmed 0.820 0.873 0.890 0.882 0.875oh0.wc 0.959 0.967 0.968 0.967 0.966oh5.wc 0.978 0.975 0.976 0.975 0.975oh10.wc 0.958 0.958 0.958 0.959 0.960oh15.wc 0.967 0.968 0.968 0.967 0.968re0.wc 0.921 0.916 0.917 0.913 0.916re1.wc 0.959 0.961 0.963 0.964 0.963Average 0.768 0.773 0.772 0.773 0.777

Table A47: Results of predictive Speciﬁcity values of Auto-WEKA-ALL for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.556 0.551 0.558 0.560 0.564DM 0.455 0.466 0.464 0.449 0.439MM 0.453 0.463 0.450 0.470 0.467SC 0.273 0.269 0.284 0.276 0.280DNA3 0.555 0.586 0.551 0.584 0.601DNA11 0.392 0.397 0.401 0.391 0.388PS 0.747 0.741 0.744 0.739 0.741chen-2002 0.925 0.919 0.927 0.929 0.924chowdary-2006 0.986 0.986 0.985 0.981 0.986nutt-2003-v2 0.921 0.892 0.904 0.917 0.908singh-2002 0.897 0.887 0.889 0.883 0.877west-2001 0.871 0.885 0.891 0.874 0.891dbworld-bodies 0.818 0.810 0.810 0.810 0.814dbworld-bodies-stemmed 0.880 0.878 0.885 0.877 0.881oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.975 0.976 0.977 0.976 0.976oh10.wc 0.960 0.960 0.959 0.959 0.960oh15.wc 0.967 0.967 0.966 0.966 0.966re0.wc 0.914 0.915 0.915 0.913 0.912re1.wc 0.963 0.964 0.963 0.964 0.963Average 0.774 0.774 0.775 0.774 0.775 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.651 0.648 0.644 0.651 0.645DM 0.660 0.669 0.664 0.658 0.658MM 0.720 0.706 0.706 0.709 0.709SC 0.828 0.828 0.821 0.822 0.834DNA3 0.853 0.838 0.837 0.844 0.840DNA11 0.712 0.711 0.714 0.715 0.727PS 0.975 0.975 0.976 0.977 0.977chen-2002 0.927 0.930 0.935 0.931 0.937chowdary-2006 0.988 0.988 0.991 0.986 0.991nutt-2003-v2 0.842 0.858 0.858 0.850 0.875singh-2002 0.866 0.877 0.872 0.879 0.881west-2001 0.871 0.860 0.860 0.860 0.865dbworld-bodies 0.760 0.783 0.787 0.796 0.811dbworld-bodies-stemmed 0.820 0.882 0.885 0.878 0.871oh0.wc 0.775 0.813 0.812 0.808 0.805oh5.wc 0.791 0.795 0.795 0.794 0.793oh10.wc 0.721 0.719 0.722 0.726 0.724oh15.wc 0.779 0.781 0.784 0.781 0.783re0.wc 0.783 0.778 0.779 0.777 0.779re1.wc 0.751 0.759 0.760 0.764 0.761Average 0.804 0.810 0.810 0.810 0.813

Table A49: Results of predictive Sensitivity values of Auto-WEKA-ALL for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.642 0.640 0.643 0.644 0.650DM 0.667 0.666 0.670 0.667 0.656MM 0.709 0.692 0.717 0.723 0.703SC 0.818 0.820 0.818 0.832 0.825DNA3 0.844 0.848 0.835 0.849 0.853DNA11 0.723 0.721 0.717 0.725 0.701PS 0.976 0.975 0.976 0.976 0.975chen-2002 0.931 0.927 0.933 0.934 0.930chowdary-2006 0.991 0.991 0.988 0.986 0.991nutt-2003-v2 0.867 0.858 0.858 0.858 0.867singh-2002 0.898 0.889 0.889 0.884 0.879west-2001 0.850 0.865 0.871 0.855 0.871dbworld-bodies 0.811 0.802 0.802 0.802 0.807dbworld-bodies-stemmed 0.874 0.874 0.882 0.875 0.879oh0.wc 0.803 0.805 0.804 0.804 0.803oh5.wc 0.792 0.793 0.793 0.793 0.796oh10.wc 0.723 0.728 0.725 0.725 0.729oh15.wc 0.778 0.778 0.778 0.777 0.779re0.wc 0.775 0.779 0.778 0.778 0.778re1.wc 0.767 0.768 0.767 0.767 0.767Average 0.812 0.811 0.812 0.813 0.812 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.606 0.603 0.599 0.607 0.600DM 0.545 0.555 0.556 0.531 0.530MM 0.573 0.553 0.559 0.563 0.564SC 0.473 0.466 0.478 0.463 0.469DNA3 0.694 0.689 0.677 0.685 0.685DNA11 0.515 0.524 0.511 0.529 0.544PS 0.831 0.833 0.832 0.837 0.839chen-2002 0.923 0.926 0.931 0.924 0.932chowdary-2006 0.985 0.985 0.988 0.983 0.987nutt-2003-v2 0.865 0.884 0.871 0.871 0.897singh-2002 0.866 0.876 0.871 0.878 0.881west-2001 0.881 0.870 0.870 0.870 0.875dbworld-bodies 0.759 0.779 0.782 0.799 0.815dbworld-bodies-stemmed 0.820 0.876 0.887 0.880 0.873oh0.wc 0.860 0.887 0.886 0.884 0.882oh5.wc 0.879 0.880 0.880 0.880 0.879oh10.wc 0.831 0.829 0.831 0.834 0.833oh15.wc 0.868 0.869 0.871 0.869 0.870re0.wc 0.849 0.844 0.845 0.842 0.844re1.wc 0.848 0.854 0.855 0.858 0.856Average 0.773 0.779 0.779 0.779 0.783

Table A51: Results of predictive GMean values of Auto-WEKA-ALL for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.597 0.594 0.599 0.600 0.605DM 0.546 0.552 0.554 0.543 0.532MM 0.556 0.556 0.557 0.571 0.563SC 0.456 0.452 0.468 0.465 0.463DNA3 0.673 0.695 0.668 0.694 0.707DNA11 0.525 0.528 0.529 0.524 0.516PS 0.836 0.835 0.835 0.836 0.837chen-2002 0.928 0.923 0.930 0.931 0.927chowdary-2006 0.988 0.988 0.986 0.983 0.988nutt-2003-v2 0.891 0.871 0.878 0.884 0.884singh-2002 0.898 0.888 0.889 0.883 0.878west-2001 0.860 0.875 0.881 0.864 0.881dbworld-bodies 0.814 0.806 0.806 0.806 0.810dbworld-bodies-stemmed 0.877 0.876 0.883 0.876 0.880oh0.wc 0.881 0.882 0.881 0.881 0.881oh5.wc 0.878 0.880 0.880 0.879 0.881oh10.wc 0.833 0.836 0.834 0.834 0.836oh15.wc 0.867 0.867 0.867 0.866 0.867re0.wc 0.842 0.844 0.844 0.843 0.842re1.wc 0.860 0.860 0.859 0.860 0.860Average 0.780 0.780 0.781 0.781 0.782

Speciﬁcity Sensitivity Gmean1k 2k-10k 1k 2k-10k 1k 2k-10kCE 0.574 0.574 0.644 0.644 0.608 0.608DM 0.550 0.550 0.675 0.675 0.607 0.607MM 0.472 0.472 0.674 0.674 0.557 0.557SC 0.351 0.351 0.778 0.778 0.507 0.507DNA3 0.718 0.718 0.879 0.879 0.790 0.790DNA11 0.448 0.448 0.749 0.749 0.571 0.571PS 0.747 0.747 0.980 0.980 0.833 0.833chen-2002 0.844 0.844 0.859 0.859 0.851 0.851chowdary-2006 0.937 0.937 0.932 0.932 0.934 0.934nutt-2003-v2 0.933 0.933 0.867 0.867 0.898 0.898singh-2002 0.757 0.757 0.763 0.763 0.760 0.760west-2001 0.905 0.905 0.895 0.895 0.900 0.900dbworld-bodies 0.670 0.670 0.688 0.688 0.678 0.678dbworld-bodies-stemmed 0.865 0.865 0.843 0.843 0.854 0.854oh0.wc 0.972 0.972 0.818 0.818 0.891 0.891oh5.wc 0.977 0.977 0.809 0.809 0.889 0.889oh10.wc 0.969 0.959 0.771 0.732 0.865 0.838oh15.wc 0.968 0.965 0.767 0.755 0.861 0.853re0.wc 0.919 0.919 0.755 0.755 0.833 0.833re1.wc 0.969 0.969 0.791 0.789 0.875 0.874Average 0.777 0.777 0.797 0.794 0.778 0.776

Speciﬁcity Sensitivity Gmean1k 2k 3k-6k 7k-10k 1k 2k 3k-6k 7k-10k 1k 2k 3k-6k 7k-10kCE 0.561 0.574 0.574 0.574 0.629 0.644 0.644 0.644 0.608 0.608 0.608 0.608DM 0.550 0.550 0.550 0.550 0.675 0.675 0.675 0.675 0.607 0.607 0.607 0.607MM 0.374 0.472 0.472 0.472 0.640 0.674 0.674 0.674 0.557 0.557 0.557 0.557SC 0.351 0.351 0.351 0.351 0.778 0.778 0.778 0.778 0.507 0.507 0.507 0.507DNA3 0.718 0.718 0.718 0.718 0.879 0.879 0.879 0.879 0.790 0.790 0.790 0.790DNA11 0.448 0.448 0.448 0.448 0.749 0.749 0.749 0.749 0.571 0.571 0.571 0.571PS 0.747 0.747 0.747 0.747 0.980 0.980 0.980 0.980 0.833 0.833 0.833 0.833chen-2002 0.855 0.844 0.844 0.844 0.876 0.859 0.859 0.859 0.851 0.851 0.851 0.851chowdary-2006 0.937 0.937 0.937 0.937 0.932 0.932 0.932 0.932 0.934 0.934 0.934 0.934nutt-2003-v2 0.950 0.933 0.933 0.933 0.900 0.867 0.867 0.867 0.898 0.898 0.898 0.898singh-2002 0.757 0.757 0.757 0.757 0.763 0.763 0.763 0.763 0.760 0.760 0.760 0.760west-2001 0.905 0.905 0.905 0.905 0.895 0.895 0.895 0.895 0.900 0.900 0.900 0.900dbworld-bodies 0.712 0.670 0.670 0.670 0.721 0.688 0.688 0.688 0.678 0.678 0.678 0.678dbworld-bodies-stemmed 0.819 0.865 0.865 0.865 0.814 0.843 0.843 0.843 0.854 0.854 0.854 0.854oh0.wc 0.972 0.972 0.972 0.972 0.818 0.818 0.818 0.818 0.891 0.891 0.891 0.891oh5.wc 0.977 0.977 0.977 0.977 0.809 0.809 0.809 0.809 0.889 0.889 0.889 0.889oh10.wc 0.960 0.959 0.959 0.959 0.736 0.732 0.732 0.732 0.838 0.838 0.838 0.838oh15.wc 0.968 0.968 0.965 0.965 0.767 0.767 0.755 0.755 0.861 0.861 0.853 0.853re0.wc 0.920 0.920 0.920 0.919 0.749 0.749 0.749 0.755 0.830 0.830 0.830 0.833re1.wc 0.970 0.969 0.969 0.969 0.801 0.789 0.789 0.789 0.881 0.874 0.874 0.874Average 0.772 0.777 0.777 0.777 0.796 0.794 0.794 0.794 0.777 0.777 0.776 0.776 eferences [MET, 2002] (2002). METAL: meta-learning assistant for providing user support in machinelearning and data mining.[Barros et al., 2012a] Barros, R. C., Basgalupp, M. P., de Carvalho, A. C. P. L. F., and Freitas, A. A.(2012a). A Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactionson Systems, Man, and Cybernetics, Part C: Applications and Reviews , 42(3):291–312.[Barros et al., 2013] Barros, R. C., Basgalupp, M. P., de Carvalho, A. C. P. L. F., and Freitas,A. A. (2013). Automatic design of decision-tree algorithms with evolutionary algorithms.

Evolutionary Computation , 21(4):659–684.[Barros et al., 2014] Barros, R. C., Basgalupp, M. P., Freitas, A. A., and de Carvalho, A. C. P.L. F. (2014). Evolutionary design of decision-tree algorithms tailored to microarray geneexpression data sets.

IEEE Transactions on Evolutionary Computation , 18(6):873–892.[Barros et al., 2015] Barros, R. C., de Carvalho, A. C., and Freitas, A. A. (2015).

Automatic Designof Decision-Tree Induction Algorithms . Number 978-3-319-14231-9 in SpringerBriefs in Com-puter Science. Springer.[Barros et al., 2012b] Barros, R. C., Winck, A. T., Machado, K. S., Basgalupp, M. P., de Carvalho,A. C. P. L. F., Ruiz, D. D., and de Souza, O. N. (2012b). Automatic design of decision-treeinduction algorithms tailored to ﬂexible-receptor docking data.

BMC Bioinformatics , 13:310.[Basgalupp et al., 2018] Basgalupp, M., Barros, R., Sá, A. G., Pappa, G. L., Mantovani, R.,de Carvalho, A., and Freitas, A. (2018). Supplementary material for: An experimental evalua-tion of meta-learning methods for recommending classiﬁcation algorithms. In to be submittedto arXiv .[Brazdil et al., 2008] Brazdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. (2008).

Metalearn-ing: Applications to Data Mining . Springer, 1 edition.[Cheng and Greiner, 1999] Cheng, J. and Greiner, R. (1999). Comparing bayesian network clas-siﬁers. In

Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence , pages101–108. Morgan Kaufmann.[Daly et al., 2011] Daly, R., Shen, Q., and Aitken, S. (2011). Learning Bayesian networks: Ap-proaches and issues.

The Knowledge Engineering Review , 26(2):99–157.[das Dôres et al., 2018] das Dôres, S. C. N., Soares, C., and Ruiz, D. (2018). Bandit-based au-tomated machine learning. In

Proceedings of the Brazilian Conference on Intelligent Systems ,BRACIS’18, pages 121–126, New York, NY, USA. IEEE.[de Sá and Pappa, 2013] de Sá, A. G. C. and Pappa, G. L. (2013). Towards a method for au-tomatically evolving Bayesian network classiﬁers. In

Proceedings of the Annual ConferenceCompanion on Genetic and Evolutionary Computation , pages 1505–1512. ACM.[de Sá and Pappa, 2014] de Sá, A. G. C. and Pappa, G. L. (2014). A hyper-heuristic evolution-ary algorithm for learning Bayesian network classiﬁers. In

Proceedings of the Ibero-AmericanConference on Artiﬁcial Intelligence , pages 430–442. Springer.[de Sá et al., 2017] de Sá, A. G. C., Pinto, W. J. G. S., Oliveira, L. O. V. B., and Pappa, G. L. (2017).RECIPE: A grammar-based framework for automatically evolving classiﬁcation pipelines.In

Proceedings of the European Conference on Genetic Programming (EuroGP) , pages 246–261.Springer International Publishing. de Souto et al., 2008] de Souto, M., Costa, I., de Araujo, D., Ludermir, T., and Schliep, A. (2008).Clustering cancer gene expression data: a comparative study. BMC Bioinformatics , 9(1):497.[Deb et al., 2002] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitistmultiobjective genetic algorithm: NSGA-II.

IEEE Transactions on Evolutionary Computation ,6(2):182–197.[Demšar, 2006] Demšar, J. (2006). Statistical comparisons of classiﬁers over multiple data sets.

J. Mach. Learn. Res. , 7:1–30.[Eiben and Smith, 2015] Eiben, A. E. and Smith, J. (2015). From evolutionary computation tothe evolution of things.

Nature , 521(7553):476–482.[Elsken et al., 2019] Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural architecture search:A survey.

Journal of Machine Learning Research , 20(55):1–21.[Feurer et al., 2015a] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., andHutter, F. (2015a). Efﬁcient and robust automated machine learning. In

Advances in NeuralInformation Processing Systems 28 , pages 2944–2952. Curran Associates, Inc.[Feurer et al., 2015b] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., andHutter, F. (2015b). Methods for improving bayesian optimization for automl. In

ICML 2015AutoML Workshop .[Feurer et al., 2015c] Feurer, M., Springenberg, J. T., and Hutter, F. (2015c). Initializing bayesianhyperparameter optimization via meta-learning. In

Proceedings of the Twenty-Ninth AAAIConference on Artiﬁcial Intelligence , pages 1128–1135.[Freitas, 2008] Freitas, A. A. (2008).

Soft Computing for Knowledge Discovery and Data Mining ,chapter A Review of evolutionary Algorithms for Data Mining, pages 79–111. Springer US.[Freitas et al., 2011] Freitas, A. A., Vasieva, O., and Magalhães, J. P. d. (2011). A data miningapproach for classifying dna repair genes into ageing-related or non-ageing-related.

BMCGenomics , 12(1):1–11.[Fusi et al., 2018] Fusi, N., Sheth, R., and Elibol, H. M. (2018). Probabilistic matrix factorizationfor automated machine learning. In

Proceedings of the International Conference on Neural Infor-mation Processing Systems , NIPS’18, pages 3348–3357, Red Hook, NY, USA. Curran AssociatesInc.[Hall et al., 2009] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten,I. H. (2009). The weka data mining software: An update.

SIGKDD Explor. Newsl. , 11(1):10–18.[Ho and Basu, 2002] Ho, T. and Basu, M. (2002). Complexity measures of supervised classiﬁca-tion problems.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(3):289–300.[Ho et al., 2006] Ho, T., Basu, M., and Law, M. (2006). Measures of geometrical complexity inclassiﬁcation problems. In

Data Complexity in Pattern Recognition . Springer London.[Hutter et al., 2019] Hutter, F., Kotthoff, L., and Vanschoren, J., editors (2019).

Automated Ma-chine Learning: Methods, Systems, Challenges . Springer, New York, NY, USA. Available athttp://automl.org/book.[Iman and Davenport, 1980] Iman, R. and Davenport, J. (1980). Approximations of the criticalregion of the friedman statistic.

Communications in Statistics , pages 571–595.[Japkowicz and Shah, 2011] Japkowicz, N. and Shah, M. (2011).

Evaluating Learning Algorithms:A Classiﬁcation Perspective . Cambridge University Press, New York, NY, USA. Jin et al., 2019] Jin, H., Song, Q., and Hu, X. (2019). Auto-Keras: An efﬁcient neural architec-ture search system. In

Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , KDD’19, pages 1946–1956, New York, NY, USA. ACM.[Koch et al., 2018] Koch, P., Golovidov, O., Gardner, S., Wujek, B., Grifﬁn, J., and Xu, Y. (2018).Autotune: A derivative-free optimization framework for hyperparameter tuning. In

Pro-ceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD’18, pages 443–452, New York, NY, USA. ACM.[Kotthoff et al., 2017] Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown,K. (2017). Auto-weka 2.0: Automatic model selection and hyperparameter optimization inweka.

Journal of Machine Learning Research , 18(25):1–5.[Kˇren et al., 2017] Kˇren, T., Pilát, M., and Neruda, R. (2017). Automatic creation of machinelearning workﬂows with strongly typed genetic programming.

International Journal on Arti-ﬁcial Intelligence Tools , 26(05):1760020.[Larcher and Barbosa, 2019] Larcher, C. H. N. and Barbosa, H. J. C. (2019). Auto-cve: A coevo-lutionary approach to evolve ensembles in automated machine learning. In

Proceedings of theGenetic and Evolutionary Computation Conference , GECCO’19, pages 392–400, New York, NY,USA. ACM.[Leite et al., 2012] Leite, R., Brazdil, P., and Vanschoren, J. (2012).

Selecting Classiﬁcation Algo-rithms with Active Testing , pages 117–131. Springer, Berlin.[Li et al., 2018] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018).Hyperband: A novel bandit-based approach to hyperparameter optimization.

Journal of Ma-chine Learning Research , 18(185):1–52.[Mckay et al., 2010] Mckay, R., Hoai, N., Whigham, P., Shan, Y., and O Neill, M. (2010).Grammar-based Genetic Programming: a survey.

Genetic Programming and Evolvable Ma-chines , 11(3):365–396.[Michie et al., 1994] Michie, D., Spiegelhalter, D. J., Taylor, C. C., and Campbell, J., editors(1994).

Machine Learning, Neural and Statistical Classiﬁcation . Ellis Horwood, Upper SaddleRiver, NJ, USA.[Mohr et al., 2018] Mohr, F., Wever, M., and Hüllermeier, E. (2018). ML-Plan: Automated ma-chine learning via hierarchical planning.

Machine Learning , 107:1495–1515.[Nyathi and Pillay, 2017] Nyathi, T. and Pillay, N. (2017). Automated design of genetic pro-gramming classiﬁcation algorithms using a genetic algorithm. In

EvoApplications (2) , volume10200 of

Lecture Notes in Computer Science , pages 224–239.[Olson et al., 2016a] Olson, R., Urbanowicz, R., Andrews, P., Lavender, N., Kidd, L., and Moore,J. H. (2016a). Automating biomedical data science through tree-based pipeline optimization.In

Proceedings of the European Conference on the Applications of Evolutionary Computation , pages123–137.[Olson et al., 2016b] Olson, R. S., Bartley, N., Urbanowicz, R. J., and Moore, J. H. (2016b). Eval-uation of a tree-based pipeline optimization tool for automating data science. In

Proceedingsof the Genetic and Evolutionary Computation Conference (GECCO) , pages 485–492. ACM.[Pappa et al., 2005] Pappa, G. L., Baines, A. J., and Freitas, A. A. (2005). Predicting post-synapticactivity in proteins with data mining.

Bioinformatics , 21(2):19–25. Pappa and Freitas, 2009] Pappa, G. L. and Freitas, A. (2009).

Automating the Design of DataMining Algorithms: An Evolutionary Computation Approach . Springer, 1st edition.[Pappa et al., 2014] Pappa, G. L., Ochoa, G., Hyde, M. R., Freitas, A. A., Woodward, J., andSwan, J. (2014). Contrasting meta-learning and hyper-heuristic research: The role of evolu-tionary algorithms.

Genetic Programming and Evolvable Machines , 15(1):3–35.[Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). SciKit-Learn: Machinelearning in Python.

Journal of Machine Learning Research , 12:2825–2830.[Scott and De Jong, 2016] Scott, E. O. and De Jong, K. A. (2016). Evaluation-time bias in quasi-generational and steady-state asynchronous evolutionary algorithms. In

Proceedings of theGenetic and Evolutionary Computation Conference (GECCO) , pages 845–852. ACM.[Sohn et al., 2017] Sohn, A., Olson, R. S., and Moore, J. H. (2017). Toward the automated anal-ysis of complex diseases in genome-wide association studies using genetic programming. In

Proceedings of the Genetic and Evolutionary Computation Conference (GECCO) , pages 489–496.ACM.[Thornton et al., 2013] Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2013).Auto-weka: Combined selection and hyperparameter optimization of classiﬁcation algo-rithms. In

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’13, pages 847–855. ACM.[van Rijn et al., 2015] van Rijn, J. N., Abdulrahman, S. M., Brazdil, P., and Vanschoren, J. (2015).Fast algorithm selection using learning curves. In

Advances in Intelligent Data Analysis XIV -14th International Symposium, IDA 2015, Saint Etienne, France, October 22-24 , pages 298–309.[Vanschoren, 2018] Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint,arXiv:1810.03548 .[Vanschoren et al., 2014] Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). Openml:Networked science in machine learning.

SIGKDD Explor. Newsl. , 15(2):49–60.[Wan et al., 2015] Wan, C., Freitas, A., and de Magalhaes, J. (2015). Predicting the pro-longevityor anti-longevity effect of model organism genes with new hierarchical feature selectionmethods.

Computational Biology and Bioinformatics, IEEE/ACM Transactions on , 12(2):262–275.[Wilcoxon et al., 1970] Wilcoxon, F., Katti, S. K., and Wilcox, R. A. (1970). Critical values andprobability levels for the Wilcoxon rank sum test and the wilcoxon signed rank test.

Selectedtables in mathematical statistics , 1:171–259.[Witten et al., 2016] Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016).

Data Mining: Prac-tical Machine Learning Tools and Techniques . Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 4th edition.[Zaki and Meira Jr, 2020] Zaki, M. J. and Meira Jr, W. (2020).

Data Mining and Analysis: Funda-mental Concepts and Algorithms . Cambridge University Press, Cambridge, UK, 2nd edition.. Cambridge University Press, Cambridge, UK, 2nd edition.