An Extensive Experimental Evaluation of Automated Machine Learning Methods for Recommending Classification Algorithms (Extended Version)
Márcio P. Basgalupp, Rodrigo C. Barros, Alex G. C. de Sá, Gisele L. Pappa, Rafael G. Mantovani, André C. P. L. F. de Carvalho, Alex A. Freitas
AAn Extensive Experimental Evaluation of AutomatedMachine Learning Methods for RecommendingClassification Algorithms (Extended Version)
Márcio P. Basgalupp Rodrigo C. Barros Alex G. C. de SáGisele Pappa Rafael G. Mantovani André C. P. L. F. de CarvalhoAlex A. FreitasReceived: date / Accepted: date
Abstract
This paper presents an experimental comparison among four Automated Ma-chine Learning (AutoML) methods for recommending the best classification algo-rithm for a given input dataset. Three of these methods are based on EvolutionaryAlgorithms (EAs), and the other is Auto-WEKA, a well-known AutoML methodbased on the Combined Algorithm Selection and Hyper-parameter optimisation(CASH) approach. The EA-based methods build classification algorithms from asingle machine learning paradigm: either decision-tree induction, rule induction,or Bayesian network classification. Auto-WEKA combines algorithm selection andhyper-parameter optimisation to recommend classification algorithms from multi-ple paradigms. We performed controlled experiments where these four AutoMLmethods were given the same runtime limit for different values of this limit. Ingeneral, the difference in predictive accuracy of the three best AutoML methodswas not statistically significant. However, the EA evolving decision-tree inductionalgorithms has the advantage of producing algorithms that generate interpretableclassification models and that are more scalable to large datasets, by comparisonwith many algorithms from other learning paradigms that can be recommendedby Auto-WEKA. We also observed that Auto-WEKA has shown meta-overfitting, aform of overfitting at the meta-learning level, rather than at the base-learning level.
Classification is one of the main machine learning tasks and, hence, there is a large va-riety of classification algorithms available [Witten et al., 2016, Zaki and Meira Jr, 2020].However, in most real-world applications, the choice of classification algorithm for anew dataset or application domain is still mainly an ad-hoc decision.In this context, the use of meta-learning for algorithm recommendation is a veryimportant research area with seminal work dating back more than years, which in-cludes the StatLog [Michie et al., 1994] and METAL [MET, 2002] projects. Meta-learning1 a r X i v : . [ c s . L G ] S e p an be defined as learning how to learn, which involves learning, from previous ex-perience, what is the best machine learning algorithm (and its best hyper-parametersetting) for a given dataset [Brazdil et al., 2008, Vanschoren, 2018]. Meta-learning sys-tems for algorithm recommendation can be divided into two broad groups, namely: (a)systems that perform algorithm selection based on meta-features [Brazdil et al., 2008],which is the most investigated type; and (b) systems that search for the best possibleclassification algorithm in a given algorithm space [Thornton et al., 2013].Meta-feature-based meta-learning for algorithm selection and recommendation con-sists of two basic steps [Brazdil et al., 2008]. First, the creation of a meta-training setwhere each meta-instance represents a dataset, meta-features represent dataset prop-erties, and each meta-class represents a (base level) learning algorithm. Second, theinduction of a meta-classification model by a (meta) classification algorithm over themeta-training set, thus allowing the recommendation of algorithm(s) for a novel dataset(not included in the meta-training set). A key issue is the design of a good set of meta-features, with enough predictive power to support an accurate recommendation of thebest learning algorithm. Extensive research in this topic has produced a large varietyof meta-features [Brazdil et al., 2008, Ho and Basu, 2002, Ho et al., 2006], but the issueof finding a set of meta-features with very good predictive power is still an open anddifficult problem.A limitation of meta-feature-based meta-learning research is that usually a smallnumber of candidate classification algorithms are considered as meta-classes. This isbecause in general, the larger the number of candidate classification algorithms used asmeta-classes, the more difficult it would be for the meta-classification algorithm to accu-rately predict all meta-classes. In addition, it is difficult to produce large meta-datasetsfor meta-learning, since in order to compute the meta-class of each meta-instance weneed to run all candidate classification algorithms on all datasets (one for each meta-instance).These difficulties have motivated research on the second type of meta-learning foralgorithm recommendation, meta-learning systems using search or optimisation meth-ods to indicate the best classification algorithm for a given target dataset, in a given al-gorithm space [Pappa and Freitas, 2009, Leite et al., 2012, Thornton et al., 2013, Pappa et al., 2014,Kotthoff et al., 2017, Barros et al., 2015, van Rijn et al., 2015]. This work focuses mainlyon this type of meta-learning systems, which is a type of Automated Machine Learning(AutoML) [Hutter et al., 2019], since such systems effectively automate the process ofselecting the best algorithm and its hyper-parameters for the input dataset.This AutoML approach bypasses the need for designing meta-features and it can, inprinciple, consider a substantially larger number of candidate classification algorithmsand hyper-parameters than meta-feature-based meta-learning systems. Note that al-though this approach does not explicitly use a learning algorithm at the meta-level,some methods following this AutoML approach (like some methods evaluated in thiswork) perform a form of meta-learning because the search is performed in the spaceof candidate learning algorithms and is guided by an evaluation function based onthe accuracy of learning algorithms at the base level. Therefore, the search method at2he meta-level is implicitly learning from the results of base-level learning algorithms.Note, however, that this kind of meta-learning of course does not occur in the case ofsimple and popular methods for algorithm selection and parameter configuration, likerandom search and grid search, which do not perform any learning by themselves.In this context, the main contribution of this paper is to present an extensive em-pirical comparison of the predictive performance of four sophisticated AutoML meth-ods for the recommendation of classification algorithms. One of these methods, Auto-WEKA [Thornton et al., 2013, Kotthoff et al., 2017], performs algorithm selection andhyper-parameter configuration by considering all candidate classification algorithmsavailable in the well-known WEKA data mining tool, which includes algorithms basedon several different types of knowledge (or model) representations – e.g., decision trees,if-then classification rules, Bayesian network classifiers, neural networks, support vec-tor machines, etc. The other three methods are based on evolutionary algorithms (EAs).Unlike Auto-WEKA, each of the three EAs focuses on a search space containing classifi-cation algorithms based on a single type of knowledge representation. More precisely,the EAs evolve rule induction algorithms [Pappa and Freitas, 2009], decision-tree in-duction algorithms [Barros et al., 2015], and Bayesian network classification algorithms[de Sá and Pappa, 2014]. Hence, the EAs produce a narrower diversity of classifica-tion algorithms in terms of knowledge representation. However, within its specializedknowledge representation, an EA can have more flexibility (or autonomy) to constructnew classification algorithms, rather than just optimising the configuration of hyper-parameters for an existing classification algorithm, as discussed later.There are also other recently proposed EAs for related AutoML tasks. In particular,the EAs proposed in [de Sá et al., 2017, Kˇren et al., 2017, Olson et al., 2016a] try to op-timize an entire machine learning pipeline for a given dataset, including the choice ofdata preprocessing methods (like feature scaling operators and feature selection meth-ods) and classification algorithm. By contrast, we focus on using EAs that recommendonly classification algorithms. In addition, in [Nyathi and Pillay, 2017] an EA is pro-posed to automatically evolve another type of EA (genetic programming) for classifi-cation. By contrast, the EAs used here automatically evolve more conventional (non-evolutionary) types of classification algorithms, as mentioned earlier.Controlled experiemnts were performed, where the four previous AutoML methods(the three EAs and Auto-WEKA) had the same runtime limit for different values of thislimit. In general, the difference in predictive accuracy of the three best AutoML meth-ods was not statistically significant, but Auto-WEKA showed meta-overfitting, a formof overfitting at the meta-learning level, due to evaluating many different (base-level)classification algorithms during its search for the best algorithm. This is in contrast tothe standard overfitting at the base level, due to the evaluating many different modelsbuilt by the same classification algorithm. In addition, the EA evolving decision-treeinduction algorithms have the advantage of producing algorithms that generate inter-pretable classification models and that are more scalable to large datasets, by compar-ison with many algorithms from other learning paradigms that can be recommendedby Auto-WEKA. Furthermore, an analysis of the different types of classification algo-3ithms recommended by Auto-WEKA shows that overall decision-tree and ensemblealgorithms were the most frequently recommended types of algorithms, whilst ruleinduction algorithms were the least recommended type.The remainder of this paper is organised as follows. Section 2 reviews the back-ground on AutoML methods for classification-algorithm recommendation, focusing onthe four previously mentioned AutoML methods. Section 3 describes the methodologyadopted in this study for executing the experimental analyses, whose extensive resultsare presented in Section 4. Finally, the main conclusions and future work suggestionsare presented in Section 5. This section reviews the main concepts underlying several AutoML methods for au-tomatic recommendation of the best classification algorithm for a given input dataset.It mainly covers the four AutoML methods evaluated in this work, Auto-WEKA andthree EAs, as mentioned earlier. Its last subsection briefly reviews related work on otherevolutionary AutoML methods.
Initial work on meta-learning focused on selecting the best classification algorithm(s)for a given dataset, explicitly or implicitly assuming a default configuration (hyper-parameter settings) for the candidate algorithms. However, given that the success of aclassification algorithm strongly depends on its hyper-parameter settings, more recentwork has focused on the so called Combined Algorithm Selection and Hyper-parameter(CASH) optimisation problem [Thornton et al., 2013]. In this section, we review the Au-toML methods evaluated in this work that address the CASH problem by considering,as candidate algorithms to be recommended, classification algorithms from multipleknowledge (model) representations, like decision trees, IF-THEN classification rules,probabilistic graphical models, neural networks, ensembles, etc.In this context, an advanced and well-known system designed for the CASH prob-lem is Auto-WEKA [Thornton et al., 2013, Kotthoff et al., 2017], whose search-space in-cludes all classification algorithms available in
Weka [Hall et al., 2009] with their corre-sponding candidate hyper-parameter settings.In order to search the space of candidate algorithms and their hyper-parameter set-tings, Auto-WEKA uses a stochastic search method, named Sequential Model-BasedOptimisation (SMBO), and a loss function to measure classification error. The goal isto find the classification algorithm and its corresponding hyper-parameter settings thatminimise the value of the loss function for the target dataset. SMBO essentially worksas follows. First, the CASH problem is formulated as a hierarchical hyper-parametersearch-space where there is a new root-level hyper-parameter that selects between al-gorithms. Hence, a candidate solution is an algorithm selected at the root level and its4yper-parameters selected at lower levels. As shown in Algorithm 1, SMBO initiallybuilds a model ( M L , line 1) representing the dependency of the loss function on thecandidate hyper-parameter settings. Next, it iteratively uses the model to generate apromising candidate hyper-parameter setting ( λ , line 3), evaluates the setting (lines 4-5), and updates the model according to the evaluation (line 6). SMBO is flexible enoughto be able to be used with different algorithms for building the dependency model, withrandom forests being used in [Thornton et al., 2013, Kotthoff et al., 2017]. Algorithm 1
Pseudo-code of SMBO. Adapted from [Thornton et al., 2013]. Initialise model M L ; H = ∅ while time budget has not been exceeded do λ = candidate configuration from M L compute c = L ( A λ , D ( i ) train , D ( i ) valid ) H = H ∪ { ( λ, c ) } Update M L given H end while return λ from H with minimal c The approach used by Auto-WEKA was also extended to produce another sys-tem for solving the CASH problem, namely Auto-sklearn [Feurer et al., 2015b], whichuses the scikit-learn machine learning library [Pedregosa et al., 2011] rather than
Weka .Auto-sklearn extends Auto-WEKA’s approach in two ways. First, it uses an ensembleof the classification models generated by the SMBO search method, instead of just onemodel like in Auto-WEKA. Second, it uses meta-features-based meta-learning to findgood classification algorithm configurations (see [Feurer et al., 2015b, Feurer et al., 2015a]for details of these two extensions). In addition, meta-features-based meta-learning hasbeen recently used to initialise the SMBO’s search for the optimal solution to the CASHproblem [Feurer et al., 2015c]. It should be noted that the aforementioned systems,although very advanced, are limited to find a combination of algorithm and hyper-parameter settings among existing combinations in the base machine learning toolkitbeing used (
Weka or scikit-learn). They do not have enough autonomy for construct-ing a new classification algorithm, which can be done in some cases by the EA-basedmeta-learning methods discussed in the next section.
Each of the Evolutionary Algorithm-based (EA-based) AutoML methods evaluatedin this work explores a search space with classification algorithms from a differentknowledge (model) representation, namely: rule induction [Pappa and Freitas, 2009],decision-tree induction [Barros et al., 2012a], or Bayesian network classifiers [de Sá and Pappa, 2014].EAs are search methods based on the natural selection principle [Eiben and Smith, 2015].They have been extensively used for evolving classification models in machine learning[Freitas, 2008, Barros et al., 2012a]. In this work, however, the EAs evolve full classifi-cation algorithms rather than classification models. In EA terminology, the EAs usedin this work are hyper-heuristic search methods, which perform a search in the spaceof candidate classification algorithms [Pappa et al., 2014]; whilst EAs that perform a5earch in the space of classification models are conventional meta-heuristic search meth-ods.The three EAs receive as input a high-level pseudo-code with the main algorithmiccomponents to be used to create classification algorithms from a target algorithm type.For instance, if the target is rule induction algorithms, the components include a rulesearch method, a rule evaluation criterion, etc. Each component can be instantiated indifferent ways, e.g., confidence or information gain can be used to instantiate the ruleevaluation component. Given an input dataset, an EA searches for the best combinationof algorithmic components based on an evaluation function (called fitness function inEAs). Thus, the EA’s output is a classification algorithm of the target type.Note that the EAs can sometimes generate a new classification algorithm whichworks in a way different from all current (manually-designed) classification algorithms.This is because the EAs can combine the prespecified algorithmic components in novelways, not explored by human algorithm designers yet.As an example of algorithm construction, let us consider the EA for evolving decision-tree algorithms. That EA’s algorithmic components include, among other types of com-ponents, 15 different split criteria and 5 tree-pruning methods. A manually-designeddecision-tree algorithm like J48 (WEKA’s version of C4.5) or CART offers just a sub-set of these split criteria and pruning methods. Hence, when Auto-WEKA config-ures a decision-tree algorithm, it first chooses exactly which algorithm will be con-figured, say J48 or CART, and then it considers only the split criteria and tree prun-ing methods/hyper-parameters available in WEKA for the chosen algorithm. It can-not combine, e.g., the information gain ratio used by J48 with the cost-complexitypruning used by CART. By contrast, the EA can construct a new decision-tree induc-tion algorithm with any combination of split criteria and tree pruning method/hyper-parameters (as well as any combination of other specific components), regardless ofwhether or not the chosen combination of components occurs in a current manually-designed decision-tree algorithm.Algorithm 2 shows the high-level pseudo-code of the three EAs for recommendingclassification algorithms used in this work. First, they generate a population of candi-date solutions (classification algorithms), or individuals, based on the target pseudo-code and sets of components given as input. For a fixed number of iterations (genera-tions) g , the classification algorithms represented by the individuals in the initial popu-lation P are built and run on the input dataset. The input dataset is divided into meta-training, meta-validation, and meta-test sets. In order to measure the fitness (quality)of an individual, its corresponding classification algorithm is executed over the meta-training set to build a classification model. Afterwards, a given predictive performancemeasure is used to evaluate the model performance on the meta-validation set, and thismeasure is used as the fitness of the individual.To avoid overfitting, at each s generations, the examples belonging to the meta-training and meta-validation sets are resampled, and the best individual found in thatsample is saved in BestSet . During the EA run, individuals at different generationsmay be evaluated with different data. Based on the individuals’ fitness values, the6 lgorithm 2
Pseudo-code of evolutionary algorithms for generating classification algo-rithms.
BuildTailoredAlgorithm(datasets, generalPseudocode, components, g , s )P = CreatePopulation(generalPseudocode, components)count = 0BestSet = ∅ while count < g dofor all indiv in P do BuildAlgorithm(indiv)RunAlgorithm(indiv,dataset) end for
TournamentSelection(P)Crossover(P)Mutation(P)count = count + 1 if count mod s then BestSet = best in PResample dataset end ifend whilereturn best in BestSet according to a predictive performance measure best candidate classification algorithms are selected to undergo EA operations suchas crossover and mutation, according to user-defined probabilities. At the end of anEA run, the best algorithm output by the EA is chosen as follows. Considering theindividuals saved in
BestSet , a new cross-validation procedure is performed on thetraining set. All individuals are then executed using the same cross-validation folds,and the best classification algorithm is output. That algorithm is finally evaluated onthe meta-test set, which was not seen during the EA run, to compute the final measureof predictive accuracy for the evolved classification algorithm.All three EAs discussed in this paper follow Algorithm 2, but they vary on howthey represent individuals, the types of components used to build classification algo-rithms (depending on the type of target classification algorithm), and the performancemeasure used to select the best individuals. All algorithms require user-defined hyper-parameters which include, besides the number of iterations (generations), the numberof individuals, the rates of crossover and mutation (operators used to produce newindividuals from existing ones), the rate of elitism (i.e. the percentage of individualsfrom the current generation that are passed unaltered to the next generation), and thenumber of individuals selected to undergo tournament selection.
The first EA proposed for generating a full classification algorithm customised to agiven input dataset evolves rule induction algorithms (which output IF-THEN classifi-cation rules), using a Grammar-based Genetic Programming (GGP) algorithm [Pappa and Freitas, 2009],named GGP-RI (GGP for Rule Induction). GGPs differ from standard EAs as they re-7eive as input a grammar, and all candidate solutions generated must obey the gram-mar production rules.The grammar has production rules specifying how the following components of in-duction algorithms can be instantiated and combined together into valid algorithms:the decision to generate an unordered rule set or an ordered rule list, different meth-ods to initialize, search, evaluate and prune rules, as well as different loop structuresand conditional statements to control the iterative processes of constructing a rule andadding/removing rules to/from a set/list. Each individual is represented by a tree gen-erated by applying the production rules. Each tree is mapped to a rule induction algo-rithm. The GGP grammar has 26 non-terminals and 83 production rules, and, varyingthe order in which the production rules are applied, the GGP’s search-space has over 2billion different rule induction algorithms. GGP’s fitness function is the F-Measure (theharmonic mean of precision and recall) of a candidate rule induction algorithm in themeta-validation set (as explained earlier).
A hyper-heuristic EA that generates decision-tree induction algorithms, called HEAD-DT (Hyper-heuristic Evolutionary Algorithm for Automatically Designing Decision-Tree algorithms), is described in [Barros et al., 2013, Barros et al., 2014]. Unlike GGP,HEAD-DT is based on a genetic algorithm with linear encoding. An individual (candi-date decision-tree induction algorithm) consists of a set of many options to instantiatethe following components of decision-tree induction algorithms: the data split proce-dure used at each node of the tree (i.e., whether performing a binary or multi-way splitand which feature evaluation function should be used), the tree expansion stopping cri-teria, approaches to cope with missing values (in both the training and testing phases),and the tree pruning procedure. For each algorithmic component, an individual spec-ifies both categorical options (e.g., the choice of feature evaluation function, out of 16predefined functions) and the numerical value of hyper-parameters associated withthe chosen options (e.g., a hyper-parameter that controls the degree of pruning for agiven pruning method). HEAD-DT’s fitness function is the F-Measure of a candidatedecision-tree induction algorithm in the meta-validation set, and its search space con-tains 21,319,200 different decision-tree algorithms. It was applied with success in dif-ferent application domains, such as gene expression classification [Barros et al., 2014]and rational drug design [Barros et al., 2012b].
The EA for generating Bayesian Network Classification (BNC) algorithms is namedHHEA-BNC (Hyper-Heuristic Evolutionary Algorithm for creating a BNC algorithm)8de Sá and Pappa, 2014, de Sá and Pappa, 2013]. BNC algorithms usually have two phases[Cheng and Greiner, 1999, Daly et al., 2011]: (i) network-structure learning; and (ii) pa-rameter learning. In the first phase, the algorithm learns which nodes (features) in thenetwork should be connected to each other. The parameter learning phase, in turn,learns the Conditional Probability Tables (CPTs) for each node of the network (the BNCmodel). However, learning the parameters of a BNC model is a relatively straightfor-ward procedure when the network structure has been determined. For this reason,HHEA-BNC focuses on the structure learning phase. HHEA-BNC encodes candidateBNC algorithms using a dynamic array-like representation, where each position in thearray represents a different algorithm component to be instantiated. In order to selectand instantiate the components of the BNC algorithm, HHEA-BNC uses a top-downapproach, where the first instantiated component of the BNC algorithm being createdis the search method, with a choice among different methods. The search methoddefines the type of algorithm being generated (naïve Bayes, score-based, constraint-based or hybrid) and, consequently, the type of BNC model being created (i.e. tree,graph, or no edges between features, in the case of naïve Bayes). Based on this firstchoice, different BNC algorithms can be generated, including components like scoringmetrics, statistical independence tests, maximal number of parents per node, etc. Thesmallest individual has three components, while the largest has . The search-space ofHHEA-BNC has 60,510,000 different candidate BNC algorithms. HHEA-BNC’s fitnessfunction is the F-measure of a candidate BNC algorithm in the meta-validation set. We also have identified three evolutionary AutoML methods that try to optimize the en-tire classification pipeline: (i) Tree-based Pipeline Optimization Tool (TPOT) [Olson et al., 2016a,Olson et al., 2016b]; (ii) Genetic Programming for Machine Learning (GP-ML) [Kˇren et al., 2017];and (iii) REsilient ClassifIcation Pipeline Evolution (RECIPE) [de Sá et al., 2017]. Apipeline is defined as a machine learning workflow that solves the classification task.To solve this type of task, a pipeline may contain data preprocessing methods (e.g.,feature normalization or feature selection), must have a classification algorithm (e.g.,naïve Bayes or a support vector machine) and may have a post-processing approach(e.g., voting or stacking). Therefore, these methods take into account various aspects ofmachine learning instead of focusing only on the classification algorithm. This meansthat these methods could select and configure a range of different classification-relatedmethods during the evolutionary search, as they are not centered on just one type ofclassification algorithm. This basic principle is also followed by Auto-WEKA and Auto-sklearn, which are well-known non-EA-based AutoML methods. The aforementionedEA-based AutoML methods are discussed in somewhat more detail next.TPOT is a genetic programming-based method that searches for the most suitableclassification pipeline to the input dataset. It encompasses (part of) the available meth-ods in the scikit-learn library in its search space, and allows different ways of com-bining the data preprocessing methods (in sequence or in parallel) and the classifica-9ion algorithms (supporting ensemble approaches or not). Although TPOT has beendesigned for general classification, it alternatively has a specific version for bioinfor-matics studies, named TPOT-MDR [Sohn et al., 2017]. TPOT-MDR includes two newdata preprocessing operators that are used in genetic analyses of human diseases: theMultifactor Dimensionality Reduction (MDR) and the Expert Knowledge Filter (EKF).Besides, both versions perform multi-objective search using Pareto selection (based onthe well-known NSGA-II algorithm) [Deb et al., 2002] with two objectives: maximizingthe predictive accuracy measure of the pipeline and minimizing the pipeline’s overallcomplexity (which is represented by the number of pipeline operators).The main issue when using TPOT is that it can generate classification pipelines thatare invalid or arbitrary during its evolutionary process, i.e., pipelines that do not solvethe classification task itself. This happens because TPOT does not impose any con-straints when combining the ML components to create the pipelines. For instance,TPOT can create a pipeline without a classification algorithm [Olson et al., 2016a]. This,of course, makes the evolutionary process to waste resources as various individualswould not solve the classification task. This can be considered a significant drawbackof TPOT in the context of the classification task.GP-ML overcomes this limitation by using a strongly typed genetic programming(STGP) method. A STGP method restricts the scikit-learn pipelines in such a way thatmakes them valid from the machine learning point of view. In addition, GP-ML appliesan asynchronous evolutionary algorithm [Scott and De Jong, 2016] instead of a gener-ational one. [Scott and De Jong, 2016] observed that asynchronous evolution is biasedtowards the evaluation of faster pipelines in some parts of the search space. However,[Kˇren et al., 2017] consider this bias an advantage to the AutoML task, because a fasterpipeline is usually preferable to a slower one, when both present similar predictiveaccuracy values.RECIPE follows the same basic principle of GP-ML, i.e., it only allows the gener-ation of valid pipelines during the evolutionary process. In order to implement thisprinciple, RECIPE defines a grammar which encompasses the classification knowledgein scikit-learn. Therefore, RECIPE makes use of a grammar-based genetic program-ming (GGP) [Mckay et al., 2010] to perform the search for the most suitable classifica-tion pipeline. The grammar prevents the generation of invalid/arbitrary pipelines, andcould also speed up the search.
The experiments are divided into two parts. The first part compares the results ob-tained by the EAs with the results obtained by Auto-WEKA [Thornton et al., 2013],whose search space includes all 33 classification algorithms available in WEKA. Theseexperiments used 20 datasets.The second part of the experiments compares one of the EAs (HEAD-DT, the EAevolving decision-tree algorithms) against Auto-WEKA, on an extended set of 40 datasets.The main reason for using a smaller number of datasets in the first type of experiment10as the very long computation time associated with comparing four methods. HEAD-DT was chosen because, among the two most successful EAs overall (HEAD-DT andHHEA-BNC, as discussed later), HEAD-DT has the advantage of producing decisiontree algorithms which are more scalable to larger datasets than the Bayesian networkclassification algorithms produced by HHEA-BNC. The datasets used in both types ofexperiments are described next.
The first part of the experiments focus on challenging datasets, characterised in gen-eral (with one exception) by a small number of instances and a large number of at-tributes. Table 1 summarises their main characteristics, including number of instances,number of numerical and nominal attributes, percentage of missing values, class bal-ance ratio (class bal.) and number of classes. Class bal. is the ratio of the minor-ity class frequency over the majority class frequency – values closer to 0 (1) indicatedatasets with more (less) class distribution imbalance. The first 12 datasets in this ta-ble are bioinformatics datasets, whilst the last 8 ones are text mining datasets. Thefirst six datasets involve data from the biology of ageing. Datasets CE-T3, SC-T3, DM-T3, and MM-T3 are described in [Wan et al., 2015]; whilst datasets DNA-T3 and DNA-T11 are described in [Freitas et al., 2011]. Dataset PS-T3 involves post-synaptic pro-teins [Pappa et al., 2005]. The 5 microarray datasets are publicly-available microarraygene expression datasets, described in [de Souto et al., 2008]. Finally, the 8 text miningdatasets were obtained from OpenML [Vanschoren et al., 2014].Table 1: Summary of the 20 datasets used in both the first and the second sets of exper-iments. Type Dataset
Dataset
The -fold cross-validation technique (10-cv) [Witten et al., 2016] was used in the ex-periments. Since Auto-WEKA and the Evolutionary Algorithms (EAs) are non-deterministic,their results are an average over executions, generating, for each method, 1000 algo-rithms. All results presented in Section 4 refer to the predictive accuracy of the recom-mended algorithms in the test sets.Two predictive accuracy measures are used. First, the Geometric Mean (GMean) ofsensitivity ( Sens ) and specificity (
Spec ) [Japkowicz and Shah, 2011], defined as
GMean = √ Sens × Spec . Sens is the proportion of positive instances that were correctly pre-dicted as positive.
Spec is the proportion of negative instances that were correctly pre-dicted as negative. These measures were calculated considering each class in turn as thepositive class, and then computing the weighted average of these measures, by weigh-ing the classes according to their relative frequency. The GMean measure was also usedto evaluate some datasets in [Wan et al., 2015]. The second predictive accuracy measureused is the simple classification accuracy measure used by Auto-WEKA to choose thebest algorithm for each dataset.Statistical significance analysis was applied to the experimental results. In the firstset of experiments (comparing four methods), we have adopted Demšar’s [Demšar, 2006]recommendation to use the Friedman test with the adjusted statistic F F [Iman and Davenport, 1980]12o compare multiple algorithms over multiple datasets, followed by the Nemenyi post-hoc test for pairwise comparisons. In the final experiment comparing only two methodswe have used the Wilcoxon test [Wilcoxon et al., 1970]. The main advantage of all thesestatistical tests is that they are non-parametric, so that they do not make the assump-tion that the data follows the normal distribution (nor assume any other probabilitydistribution, for that matter). All statistical tests were used with the conventional sig-nificance level of 0.05. In order to perform a fair comparison, all EAs were configured with the same hyper-parameters values, listed in Table 3.Table 3: Parameter values for the evolutionary algorithms.
Parameter Description ValueNumber of individuals 100Number of generations before changing the validation set 5Tournament selection size 2Elitism rate 5%Crossover rate 95%Mutation rate 5%
Table 4 shows the hyper-parameter settings for Auto-WEKA based on the optionsprovided by its Experiment Builder [Thornton et al., 2013]. Note that the 10-cv men-tioned in Table 4 is another cross-validation procedure used by Auto-WEKA, but thistime over the training set (generated by the outermost 10-cv) to evaluate its candidatesolutions regarding their predictive accuracy.Table 4: Hyper-parameter values for all versions of Auto-WEKA.
Parameter Description Value(s)Instance generator 10-fold cross-validation, seed = 1,..,5Evaluation measure error rate (classification)Optimisation method SMAC, with executable =smac-v2.06.01-development-619/smacInitial Incumbent = RandomExecution Mode = SMACInitialN = 1memLimit 15 GBtimeLimit from 1,000s to 10,000s
None of the 4 meta-learning methods had their hyper-parameter values optimisedto individual datasets. A more robust hyper-parameter optimisation procedure wouldbe too time-consuming, given the very large number of experiments carried out in thiswork. 13 .4 Computational Environment and Runtime Limits
The experiments were executed in a Dual Intel 2.10GHz Xeon E5-2683 v4 Hexadeca-Core with 128GB RAM. In order to perform controlled experiments comparing differ-ent meta-learning methods with the same computational budget, recall that two typesof experiments are performed, as reported in Section 4. The first type of experimentcompares the results obtained by the three EAs (each evolving classification algorithmsbased on a single type of knowledge representation) with the results obtained by Auto-WEKA, which can recommend classification algorithms based on multiple knowledgerepresentations. The second type of experiments compares the best EA (HEAD-DT,evolving decision-tree algorithms) against Auto-WEKA in an extended set of datasets.In both types of experiments, to have a fair comparison among all meta-learningmethods, each of them is allocated the same runtime limit. Experiments were per-formed with ten increasing values of the runtime limit for each meta-learning method,namely 1,000s (seconds), 2,000s, ..., up to 10,000s. These runtime limits refer to the timetaken by a single run of each method on each dataset, on a single cross-validation fold.Due to space restrictions, the next section will report only the results for the smallestand the largest runtime limits, i.e., 1,000s and 10,000s. The results for the other runtimelimits can be seen in [Basgalupp et al., 2018].In addition to the parameters that are common to all three EAs, which were set asdescribed in Table 3, there is a parameter that is used by GGP-RI and HHEA-BNC, butnot by HEAD-DT. This parameter is a timeout to evaluate each individual (candidatealgorithm) of the EA. For GGP-RI, the value of this parameter starts with 10s (seconds)when the runtime limit for the entire run of GGP-RI is 1,000s. Then the individual eval-uation timeout increases by 10s for each increase of 1,000s in GGP-RI’s runtime, up to100s, when the GGP-RI’s runtime limit is 10,000s. For HHEA-BNC, the value of thisparameter starts with 50s (seconds) when the runtime limit for the entire run of HHEA-BNC is 1,000s. Then the individual evaluation timeout increases by 50s for each increaseof 1,000s in HHEA-BNC’s runtime, up to 500s, when the HHEA-BNC’s runtime limitis 10,000s. HEAD-DT does not need this parameter because the decision tree inductionalgorithms produced by this EA are relatively fast. The values of this parameter forHHEA-BNC are larger than the values for GGP-RI because the Bayesian network clas-sification algorithms generated by the former tend to be considerably slower than therule induction algorithms generated by the latter EA.
This section presents the results of the following two types of experiments:1. Experiments comparing four AutoML methods: the three EAs (HEAD-DT, GGP-RI, HHEA-BNC) and Auto-WEKA.2. Experiments comparing one of the EAs (HEAD-DT, evolving decision tree algo-rithms) with Auto-WEKA, on an extended set of datasets.14s mentioned earlier, due to the very large number of experiments, the first typeof experiments use the 20 datasets shown in Table 1; whilst the second type of experi-ments uses an extended set of 40 datasets (the 20 datasets in Table 1 plus the 20 datasetsin Table 2). We report results for the values of accuracy and Gmean (the geometricmean of sensitivity and specificity) for each dataset; and the average values of accuracyand GMean, as well as the average rank of each method based on these measures, overthe corresponding datasets. The lower the rank, the better the method. A method thatoutperforms every other method in every dataset has an average rank of 1.0 (first posi-tion). The complete tables with per-dataset results can be found in the SupplementaryResults file. Recall that, although we performed experiments with the runtime limit formeta-learning methods varying from 1,000 to 10,000 seconds, in increments of 1,000s,in general only the results for 1,000s and 10,000s are reported in this section, due tospace restrictions. The results for the 10 different runtime limits can be found in theSupplementary Results file.
This section compares four types of AutoML methods, the three EAs and Auto-WEKA,in controlled experiments where all the four methods use the same runtime limit, asmentioned earlier.Tables 5 and 6 show the GMean results for each method, for the runtime limits of1,000s and 10,000s, respectively. Recall that these runtime limits refer to a single run ofa meta-learning method, for each fold of the cross-validation procedure. The last rowof these tables show the average rank based on GMean over all 20 datasets. Tables 7and 8 show the accuracy results for each method, for the runtime limits of 1,000s and10,000s, respectively.In Table 5, with GMean results for the smallest runtime limit of 1,000s, the bestaverage ranks were jointly obtained by three methods, HEAD-DT, HHEA-BNC andAuto-WEKA; whilst HEAD-DT obtained a slightly better average GMean value. InTable 6, with results for the longest runtime limit of 10,000s, Auto-WEKA obtained aslightly better result (regarding both the average rank and the average GMean value)than HEAD-DT and HHEA-BNC. In both tables, GGP-RI was clearly the worst per-forming method. This result seem partly due to the fact that GGP-RI had poor resultsin many datasets with a large number of numerical attributes. Comparing the aver-age GMean values of each method across both tables, one can observe that the threeEAs have only slightly improved their GMean values from 1,000s to 10,000s – an im-provement of just 0.001 for HEAD-DT and 0.003 for the other two EAs. By contrast,Auto-WEKA obtained a somewhat greater GMean improvement of 0.008, when theruntime limit increased from 1,000s to 10,000s.Hence, Auto-WEKA has benefited from the increase in runtime limit more than theEAs. This seems due to the fact that Auto-WEKA is searching in a much more di-verse space of classification algorithms, in terms of knowledge representations. Recallthat each EA’s search space includes algorithms from a single knowledge representa-15able 5: GMean results for the four AutoML methods (time limit: 1,000s).
Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.564 0.576 0.501
DM 0.559 chowdary-2006 0.956 0.966 0.830 nutt-2003-v2 0.790 0.746 0.631 singh-2002 0.772 0.771 0.613 west-2001 dbworld-bodies-stemmed 0.815 0.770 0.652 oh0.wc 0.895 re1.wc
Table 6: GMean results for the four AutoML methods (time limit: 10,000s).
Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.581 0.578 0.502
DM 0.517 chowdary-2006 0.956 0.958 0.833 nutt-2003-v2 0.790 0.809 0.611 singh-2002 0.772 0.777 0.638 west-2001 dbworld-bodies-stemmed 0.815 0.805 0.649 oh0.wc 0.893 re1.wc
Average Rank 2.050 2.100 3.850
Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.613 0.615 0.478
DM 0.637
DNA3 0.846 0.841 0.760
DNA11 chowdary-2006 0.959 0.971 0.832 nutt-2003-v2 0.760 0.730 0.537 singh-2002 0.772 0.771 0.539 west-2001 dbworld-bodies-stemmed 0.806 0.783 0.610 oh0.wc 0.825 re1.wc
Average Rank 2.150 2.025 4.000
Figure 1 shows the critical diagrams comparing the four AutoML methods in termsof their average rank based on both GMean (in the top two diagrams) and accuracy(in the bottom two diagrams). For both measures, and for both the runtime limits of1,000s and 10,000s, we can see that there is no statistically-significant difference among17able 8: Accuracy results for the four AutoML methods (time limit: 10,000s).
Dataset HEAD-DT HHEA-BNC GGP-RI Auto-WEKACE 0.623 0.614 0.482
DM 0.604
DNA3 0.847 0.838 0.758
DNA11 chowdary-2006 0.959 0.965 0.837 nutt-2003-v2 0.760 0.790 0.517 singh-2002 0.772 0.777 0.574 west-2001 dbworld-bodies-stemmed 0.806 0.814 0.605 oh0.wc 0.824 re1.wc
Average Rank 2.150 2.050 3.950 all methods, with the exception of GGP-RI, which is significantly outperformed by theother three methods.
CD1 2 3 4
HHEA-BNCAuto-WEKA GGP-RIHEAD-DT (a) GMean: 1,000s
CD1 2 3 4
Auto-WEKAHEAD-DT GGP-RIHHEA-BNC (b) GMean: 10,000s
CD1 2 3 4
Auto-WEKAHHEA-BNC GGP-RIHEAD-DT (c) Accuracy: 1,000s
CD1 2 3 4
Auto-WEKAHHEA-BNC GGP-RIHEAD-DT (d) Accuracy: 10,000s
Figure 1: Critical diagrams showing average GMean/Accuracy ranks and Nemenyi’scritical difference (CD) for the four AutoML methods.As mentioned earlier, the analysis of the results so far focused only on the runtimelimits of 1,000s and 10,000s due to space restrictions, but we performed experimentswith 10 different limits (from 1,000s up to 10,000s). Figure 2(a) shows the evolutionof the GMean average ranks for the four meta-learning methods across the 10 runtimelimits. This figure shows that HHEA-BNC tends to achieve overall the best (lowest)average rank until the runtime limit of 7,000s, whilst for longer runtime limits Auto-WEKA and HEAD-DT tend to share the best rank, with Auto-WEKA slightly better atthe last runtime limit.Figure 2(b) shows the same evolution, but this time regarding average accuracy18anks. In this case, Auto-WEKA remains the best method across all runtime limits, andfor nearly all runtime limits, the second place is obtained by HHEA-BNC. Note thatGGP-RI remained consistently the worst method across all 10 runtime limits, for bothGMean and accuracy results. (a)
GMean (b)
Accuracy
Figure 2: Evolution of average ranks for all AutoML methods across the 10 runtimelimits.Figures 3(a) and 3(b) show the broad types of algorithms recommended by Auto-WEKA per dataset, for the runtime limits of 1,000s and 10,000s, respectively. SinceAuto-WEKA considers a large number of algorithms, instead of referring to specificalgorithms, the graphs show the frequency of recommendations for five broad types ofalgorithms, namely: the three types of algorithms that are considered by the three EAs(decision trees, if-then classification rules, and Bayesian network classifiers), ensemblemethods and all the others. Note that the variability of the selected types of algorithmsis high, highlighting the difficulty of selecting the best algorithm for each dataset.For the runtime limit of 1,000s (Figure 3(a)), ensembles had the highest prevalenceacross the datasets; they were selected by Auto-WEKA in 33.9% of the cases, closely fol-lowed by decision-tree algorithms, selected in 31.4% of the cases. For the runtime limitof 10,000s (Figure 3(b)), these two types of classification algorithms swapped places inthe ranking by prevalence, i.e., decision-tree algorithms were selected by Auto-WEKAin 34.7% of the cases, whilst ensembles were selected in 27.3% of the cases. Bayesianclassification algorithms also did relatively well, partly because they had a high preva-lence among the text mining datasets. For both runtime limits, Bayesian classificationalgorithms were the third most selected type of classification algorithm: they were se-lected in 16.7% of the cases in Figure 3(a) and in 24.2% of the cases in Figure 3(b). Forboth runtime limits, rule induction algorithms had small frequencies of selection, only7.9% in Figure 3(a) and 6.7% in Figure 3(b). This is consistent with the fact that, out ofthe 3 EAs for AutoML evaluated in this work, GGP-RI (which evolved rule inductionalgorithms) obtained clearly the worst result.19 (a) (b)
Figure 3: Number of times each type of classification algorithm is selected by Auto-WEKA. 20 .2 More extensive experiments comparing HEAD-DT and Auto-WEKA
In this section we compare HEAD-DT and Auto-WEKA in an extended set of 40 datasets.This includes the 20 datasets used in the previous section plus 20 other datasets, as dis-cussed in Section 3.1. As mentioned earlier, the motivation for using this larger set ofdatasets only to compare the two methods in this section, rather than to compare moremethods in the previous section, is the much larger amount of time associated with theexperiments using all the 40 datasets. This section uses the same experimental method-ology used in the previous section, using 10-fold cross-validation and comparing thetwo methods with the same runtime limit, varying this limit from 1,000s to 10,000s, inincrements of 1,000s. Again, due to space restrictions, we report results only for thesmallest and longest runtime limits, namely 1,000s and 10,000s; but the results for the10 different runtime limits can be found in the Supplementary Results file.Table 9 and Table 10 show the accuracy and GMean values, respectively, obtained byHEAD-DT and Auto-WEKA with the runtime limits of 1,000s and 10,000s. In terms ofaccuracy, Auto-WEKA has somewhat outperformed HEAD-DT overall, whilst the op-posite was observed for the GMean measure. This result is consistent with the fact thatAuto-WEKA’s search tries to optimize the accuracy measure (unlike HEAD-DT), as dis-cussed earlier. However, the result of a Wilcoxon significance test, at the conventionalsignificance level of 0.05, indicates that there is no statistically significant difference ofpredictive performance between HEAD-DT and Auto-WEKA (for both accuracy andGMean measures), for each of the 10 runtime limits.Figure 4(a) shows the evolution of the average GMean values (across all datasets)for Auto-WEKA and HEAD-DT across the 10 runtime limits. This figure shows thatHEAD-DT obtains a better (higher) GMean value for all runtime limits. Figure 4(b)shows the same type of evolution for the accuracy measure. In this case, HEAD-DT ob-tains the best average accuracy for the smallest runtime limit, but Auto-WEKA obtainshigher accuracy for all other runtime limits. It should be noted, however, that in bothgraphs the differences of predictive performance between HEAD-DT and Auto-WEKAare small, less than 1% in general, across the different runtime limits. (a)
GMean (b)
Accuracy
Figure 4: Evolution of average predictive values for HEAD-DT and Auto-WEKA acrossthe 10 runtime limits. 21able 9: Accuracy results for HEAD-DT and Auto-WEKA (time limits: 1,000s and10,000s).
DM 0.637
MM 0.720
SC 0.826
DNA3 0.846
DNA11 chowdary-2006 0.959 nutt-2003-v2 0.760 singh-2002 0.772 west-2001 dbworld-bodies-stemmed 0.806 oh0.wc re0.wc 0.755 re1.wc convex mnist 0.886 mnistrotationbackimagenew semeion 0.763 shuttle winequalitywhite 0.622 yeast 0.584 sick pc4 0.889 magicTelescope DM MM DNA11 chowdary-2006 0.956 nutt-2003-v2 0.790 singh-2002 0.772 west-2001 dbworld-bodies-stemmed 0.815 oh0.wc re0.wc 0.831 re1.wc convex mnist 0.935 mnistrotationbackimagenew semeion 0.862 shuttle winequalitywhite 0.712 yeast 0.708 sick Conclusions
AutoML is currently a very popular issue, having attracted a great deal of attention,with the proposal of new tools, mainly based on optimization [Hutter et al., 2019, Elsken et al., 2019,Mohr et al., 2018, das Dôres et al., 2018, Li et al., 2018, Larcher and Barbosa, 2019, Koch et al., 2018,Jin et al., 2019, Fusi et al., 2018]. Based on the relevance of AutoML, this work has eval-uated four methods for recommending a classification algorithm for a target dataset:three Evolutionary Algorithms (EAs) and Auto-WEKA [Thornton et al., 2013], in twosets of experiments. In the first set of experiments, we have compared the four AutoMLmethods with the same runtime limit on 20 datasets. Auto-WEKA can recommendclassification algorithms of various types (paradigms), whilst each of the three EAsis restricted to recommend a different type of classification algorithm: decision tree,rule induction or Bayesian network classification algorithms, in the case of HEAD-DT,GGP-RI and HHEA-BNC, respectively. In these experiments, there was no statisticallysignificant difference of predictive accuracy between the three best methods, namelytwo EAs (HEAD-DT and HHEA-BNC) and Auto-WEKA. However, these three meth-ods obtained significantly better predictive accuracy than the other EA (GGP-RI). Theseresults were broadly consistent across the 10 different runtime limits used in the exper-iments. In the second set of experiments, where a larger set of 40 datasets was used tocompare the predictive accuracy of HEAD-DT and Auto-WEKA only, again there wasno statistically significant difference between the predictive performance of these twomethods.However, the focus of HEAD-DT on only on decision-tree algorithms has two ad-vantages from the perspective of other algorithm-evaluation criteria. First, in applica-tions where it is important that the classification model be interpreted by users (e.g.in medical applications), decision-tree algorithms have the advantage of generatinginterpretable classification models. By contrast, since Auto-WEKA can select any algo-rithm out of many types of classification algorithm, it can recommend classification al-gorithms producing black-box (non-interpretable) models. Indeed, in our experiments,Auto-WEKA often recommended ensembles, which are not easily interpretable. Sec-ond, decision-tree algorithms also have the advantage of being in general more scalableto large datasets than several other types of classification algorithms in Auto-WEKA’ssearch space, like neural networks, support vector machines and some ensemble meth-ods.Overall, when the runtime limit is increased from 1,000s to 10,000s, Auto-WEKAbenefits more from the extra search time than HEAD-DT. This seems due to the factthat Auto-WEKA has to explore a much more diverse space of classification algorithms,so it probably requires more time to find the best type of classification algorithm to berecommended for a given input dataset.In addition, we observed that Auto-WEKA exhibited meta-overfitting, where theGMean values on the training set were substantially lower than the GMean values onthe test set, for the best algorithm found by Auto-WEKA. As noted earlier, this meta-overfitting is a form of overfitting at the meta-learning level, due to evaluating many25ifferent (base-level) classification algorithms during Auto-WEKA’s search for the bestalgorithm. This is in contrast to the standard overfitting at the base level, due to evalu-ating many different models built by the same classification algorithm.
It would be interesting to enhance the search process of the EAs by first performing aglobal search to optimise the candidate algorithms’ (procedural) components, followedby a second (global or local) search to optimise the continuous parameters of the best al-gorithm generated by the first search. Another future research direction is to extend theEAs to produce an ensemble of evolved classification algorithms in a post-processingphase, after the EAs have completed their search.Besides, since Auto-WEKA showed a clear sign of meta-overfitting, another re-search direction consists of developing new meta-overfitting-avoidance methods thatcould potentially improve the predictive performance of Auto-WEKA. Finally, it wouldbe interesting to compare the three EAs and Auto-WEKA to other AutoML methods,such as Auto-sklearn and those described in Section 2.2.4. This would give us a moredetailed assessment about which AutoML method recommends the best classificationalgorithm, taking into account different datasets.
A Supplementary Material for: An Extensive Experimental Eval-uation of Automated Machine Learning Methods for Recom-mending Classification Algorithms data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.499 0.585 0.576 0.496 0.541DM 0.348 0.545 0.423 0.449 0.315MM 0.312 0.496 0.403 0.484 0.292SC 0.280 0.542 0.362 0.412 0.297DNA3 0.562 0.720 0.604 0.482 0.562DNA11 0.260 0.478 0.420 0.491 0.288PS 0.700 0.801 0.794 0.913 0.823chen-2002 0.857 0.798 0.944 0.826 0.756chowdary-2006 0.965 0.951 0.969 0.918 0.930nutt-2003-v2 0.933 0.933 0.850 0.733 0.600singh-2002 0.746 0.781 0.901 0.754 0.696west-2001 0.862 0.815 0.878 0.578 0.782dbworld-bodies 0.770 0.743 0.819 0.686 0.758dbworld-bodies-stemmed 0.900 0.792 0.758 0.596 0.811oh0.wc 0.857 0.972 0.979 0.925 0.962oh5.wc 0.857 0.976 0.985 0.921 0.975oh10.wc 0.863 0.961 0.967 0.921 0.962oh15.wc 0.851 0.966 0.973 0.921 0.952re0.wc 0.736 0.915 0.940 0.881 0.909re1.wc 0.829 0.970 0.979 0.941 0.970Average 0.699 0.787 0.776 0.716 0.709
Table A2: Results of predictive Sensitivity values of the baseline methods for buildingdecision tree models. data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.623 0.648 0.659 0.567 0.642DM 0.614 0.664 0.623 0.555 0.623MM 0.631 0.661 0.640 0.631 0.708SC 0.827 0.822 0.835 0.774 0.807DNA3 0.863 0.841 0.848 0.733 0.863DNA11 0.720 0.777 0.740 0.706 0.726PS 0.981 0.985 0.985 0.985 0.983chen-2002 0.821 0.810 0.949 0.827 0.787chowdary-2006 0.971 0.951 0.972 0.923 0.944nutt-2003-v2 0.867 0.867 0.850 0.717 0.450singh-2002 0.744 0.783 0.903 0.756 0.694west-2001 0.855 0.835 0.855 0.555 0.735dbworld-bodies 0.738 0.740 0.814 0.698 0.733dbworld-bodies-stemmed 0.867 0.783 0.767 0.612 0.798oh0.wc 0.351 0.820 0.874 0.463 0.779oh5.wc 0.261 0.807 0.883 0.356 0.805oh10.wc 0.262 0.727 0.754 0.368 0.742oh15.wc 0.272 0.736 0.810 0.387 0.653re0.wc 0.537 0.745 0.832 0.559 0.739re1.wc 0.396 0.798 0.859 0.472 0.786Average 0.660 0.790 0.823 0.632 0.750 data set Decision Stump J48 LMT RandomTrees REPTreeCE 0.557 0.616 0.616 0.530 0.589DM 0.460 0.600 0.511 0.496 0.442MM 0.435 0.569 0.499 0.537 0.451SC 0.464 0.663 0.535 0.556 0.466DNA3 0.694 0.773 0.707 0.584 0.694DNA11 0.426 0.601 0.553 0.586 0.450PS 0.828 0.888 0.884 0.948 0.899chen-2002 0.838 0.804 0.947 0.826 0.771chowdary-2006 0.968 0.951 0.970 0.920 0.937nutt-2003-v2 0.898 0.898 0.846 0.721 0.507singh-2002 0.745 0.782 0.902 0.755 0.695west-2001 0.858 0.823 0.866 0.563 0.756dbworld-bodies 0.753 0.741 0.816 0.691 0.745dbworld-bodies-stemmed 0.883 0.787 0.762 0.603 0.804oh0.wc 0.548 0.892 0.925 0.654 0.865oh5.wc 0.473 0.887 0.933 0.568 0.886oh10.wc 0.475 0.835 0.854 0.581 0.844oh15.wc 0.480 0.842 0.888 0.595 0.788re0.wc 0.629 0.826 0.884 0.702 0.820re1.wc 0.573 0.880 0.917 0.666 0.873Average 0.649 0.783 0.791 0.654 0.714
Table A4: Results of predictive Specificity values of HEAD-DT for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.693 0.664 0.656 0.691 0.676DM 0.808 0.758 0.758 0.758 0.825MM 0.733 0.739 0.739 0.739 0.675SC 0.767 0.724 0.701 0.701 0.701DNA3 0.846 0.846 0.846 0.846 0.846DNA11 0.854 0.833 0.833 0.833 0.833PS 0.978 0.989 0.989 0.975 0.978chen-2002 0.964 0.937 0.937 0.937 0.937chowdary-2006 0.965 0.965 0.965 0.965 0.965nutt-2003-v2 0.983 0.983 0.983 0.983 0.983singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.973 0.973 0.973 0.973 0.973dbworld-bodies 0.844 0.844 0.844 0.844 0.844dbworld-bodies-stemmed 0.846 0.846 0.846 0.846 0.846oh0.wc 0.984 0.984 0.981 0.980 0.980oh5.wc 0.991 0.991 0.990 0.991 0.990oh10.wc 0.975 0.975 0.975 0.975 0.979oh15.wc 0.977 0.978 0.979 0.980 0.980re0.wc 0.961 0.952 0.953 0.954 0.955re1.wc 0.986 0.985 0.985 0.983 0.983Average 0.899 0.891 0.889 0.890 0.890 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.676 0.676 0.682 0.708 0.708DM 0.825 0.796 0.796 0.796 0.796MM 0.675 0.675 0.675 0.675 0.675SC 0.701 0.701 0.722 0.722 0.722DNA3 0.846 0.846 0.846 0.846 0.846DNA11 0.833 0.807 0.807 0.807 0.807PS 0.978 0.978 0.978 0.978 0.978chen-2002 0.937 0.937 0.937 0.937 0.937chowdary-2006 0.965 0.965 0.965 0.965 0.965nutt-2003-v2 0.983 0.983 0.983 0.983 0.983singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.973 0.973 0.973 0.973 0.973dbworld-bodies 0.844 0.844 0.844 0.844 0.844dbworld-bodies-stemmed 0.846 0.846 0.846 0.846 0.846oh0.wc 0.981 0.982 0.982 0.982 0.982oh5.wc 0.990 0.990 0.990 0.990 0.990oh10.wc 0.979 0.979 0.980 0.980 0.980oh15.wc 0.982 0.982 0.982 0.982 0.982re0.wc 0.954 0.957 0.957 0.957 0.957re1.wc 0.984 0.984 0.985 0.985 0.985Average 0.890 0.888 0.889 0.890 0.890 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.759 0.749 0.736 0.759 0.751DM 0.867 0.842 0.842 0.842 0.875MM 0.867 0.878 0.878 0.878 0.875SC 0.943 0.927 0.915 0.915 0.915DNA3 0.935 0.935 0.935 0.935 0.935DNA11 0.933 0.919 0.919 0.919 0.919PS 0.998 0.999 0.999 0.998 0.998chen-2002 0.966 0.950 0.950 0.950 0.950chowdary-2006 0.971 0.971 0.971 0.971 0.971nutt-2003-v2 0.967 0.967 0.967 0.967 0.967singh-2002 0.851 0.851 0.851 0.851 0.851west-2001 0.960 0.960 0.960 0.960 0.960dbworld-bodies 0.831 0.831 0.831 0.831 0.831dbworld-bodies-stemmed 0.829 0.829 0.829 0.829 0.829oh0.wc 0.896 0.901 0.887 0.884 0.884oh5.wc 0.930 0.925 0.922 0.922 0.917oh10.wc 0.834 0.839 0.840 0.837 0.855oh15.wc 0.845 0.851 0.855 0.858 0.859re0.wc 0.888 0.870 0.874 0.878 0.878re1.wc 0.893 0.893 0.887 0.874 0.879Average 0.898 0.894 0.892 0.893 0.895
Table A7: Results of predictive Sensitivity values of HEAD-DT for timeouts from 6,000sto 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.751 0.751 0.764 0.779 0.779DM 0.875 0.867 0.867 0.867 0.867MM 0.875 0.875 0.875 0.875 0.875SC 0.915 0.911 0.919 0.919 0.915DNA3 0.935 0.935 0.935 0.935 0.935DNA11 0.919 0.911 0.911 0.911 0.911PS 0.998 0.998 0.998 0.998 0.998chen-2002 0.950 0.950 0.950 0.950 0.950chowdary-2006 0.971 0.971 0.971 0.971 0.971nutt-2003-v2 0.967 0.967 0.967 0.967 0.967singh-2002 0.851 0.851 0.851 0.851 0.851west-2001 0.960 0.960 0.960 0.960 0.960dbworld-bodies 0.831 0.831 0.831 0.831 0.831dbworld-bodies-stemmed 0.829 0.829 0.829 0.829 0.829oh0.wc 0.886 0.892 0.892 0.892 0.892oh5.wc 0.912 0.912 0.912 0.912 0.916oh10.wc 0.855 0.855 0.858 0.858 0.858oh15.wc 0.865 0.865 0.874 0.872 0.872re0.wc 0.875 0.880 0.880 0.880 0.880re1.wc 0.883 0.891 0.893 0.893 0.893Average 0.895 0.895 0.897 0.897 0.897 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.725 0.705 0.694 0.724 0.712DM 0.834 0.794 0.794 0.794 0.846MM 0.789 0.797 0.797 0.797 0.756SC 0.846 0.806 0.788 0.788 0.788DNA3 0.887 0.887 0.887 0.887 0.887DNA11 0.890 0.873 0.873 0.873 0.873PS 0.988 0.994 0.994 0.986 0.988chen-2002 0.965 0.943 0.943 0.943 0.943chowdary-2006 0.968 0.968 0.968 0.968 0.968nutt-2003-v2 0.975 0.975 0.975 0.975 0.975singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.967 0.967 0.967 0.967 0.967dbworld-bodies 0.837 0.837 0.837 0.837 0.837dbworld-bodies-stemmed 0.837 0.837 0.837 0.837 0.837oh0.wc 0.939 0.941 0.933 0.931 0.931oh5.wc 0.960 0.957 0.955 0.955 0.953oh10.wc 0.901 0.904 0.904 0.902 0.915oh15.wc 0.909 0.912 0.915 0.916 0.917re0.wc 0.924 0.910 0.913 0.915 0.916re1.wc 0.938 0.938 0.934 0.927 0.929Average 0.896 0.890 0.888 0.889 0.889
Table A9: Results of predictive GMean values of HEAD-DT for timeouts from 6,000s to10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.712 0.712 0.721 0.742 0.742DM 0.846 0.827 0.827 0.827 0.827MM 0.756 0.756 0.756 0.756 0.756SC 0.788 0.786 0.803 0.803 0.801DNA3 0.887 0.887 0.887 0.887 0.887DNA11 0.873 0.855 0.855 0.855 0.855PS 0.988 0.988 0.988 0.988 0.988chen-2002 0.943 0.943 0.943 0.943 0.943chowdary-2006 0.968 0.968 0.968 0.968 0.968nutt-2003-v2 0.975 0.975 0.975 0.975 0.975singh-2002 0.852 0.852 0.852 0.852 0.852west-2001 0.967 0.967 0.967 0.967 0.967dbworld-bodies 0.837 0.837 0.837 0.837 0.837dbworld-bodies-stemmed 0.837 0.837 0.837 0.837 0.837oh0.wc 0.932 0.936 0.936 0.936 0.936oh5.wc 0.950 0.950 0.950 0.950 0.952oh10.wc 0.915 0.915 0.917 0.917 0.917oh15.wc 0.921 0.921 0.926 0.925 0.925re0.wc 0.913 0.917 0.917 0.917 0.917re1.wc 0.932 0.936 0.938 0.938 0.938Average 0.890 0.888 0.890 0.891 0.891 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.539 0.547 0.549 0.552 0.552DM 0.378 0.363 0.361 0.366 0.388MM 0.399 0.418 0.417 0.418 0.418SC 0.254 0.253 0.253 0.259 0.254DNA3 0.554 0.593 0.578 0.589 0.612DNA11 0.403 0.376 0.401 0.377 0.362PS 0.699 0.702 0.711 0.721 0.722chen-2002 0.924 0.926 0.914 0.917 0.918chowdary-2006 0.993 0.994 0.994 0.994 0.993nutt-2003-v2 0.879 0.854 0.854 0.854 0.854singh-2002 0.892 0.886 0.886 0.881 0.875west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.738 0.743 0.734 0.738 0.763dbworld-bodies-stemmed 0.863 0.843 0.793 0.769 0.771oh0.wc 0.965 0.965 0.966 0.966 0.966oh5.wc 0.981 0.981 0.980 0.980 0.980oh10.wc 0.959 0.959 0.959 0.958 0.959oh15.wc 0.967 0.967 0.968 0.968 0.969re0.wc 0.921 0.919 0.920 0.920 0.920re1.wc 0.968 0.968 0.969 0.969 0.969Average 0.759 0.759 0.756 0.756 0.758
Table A11: Results of predictive Specificity values of Auto-WEKA-Trees for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.553 0.558 0.553 0.555 0.558DM 0.387 0.389 0.388 0.375 0.387MM 0.418 0.414 0.413 0.416 0.402SC 0.259 0.254 0.290 0.258 0.269DNA3 0.600 0.610 0.598 0.608 0.601DNA11 0.388 0.379 0.378 0.422 0.387PS 0.723 0.744 0.736 0.748 0.747chen-2002 0.920 0.923 0.915 0.914 0.912chowdary-2006 0.993 0.993 0.993 0.993 0.988nutt-2003-v2 0.858 0.858 0.850 0.854 0.875singh-2002 0.872 0.875 0.872 0.872 0.877west-2001 0.914 0.908 0.908 0.908 0.908dbworld-bodies 0.737 0.739 0.739 0.739 0.739dbworld-bodies-stemmed 0.769 0.769 0.772 0.772 0.772oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.981 0.980 0.980 0.980 0.980oh10.wc 0.958 0.959 0.958 0.959 0.958oh15.wc 0.969 0.969 0.969 0.969 0.969re0.wc 0.920 0.919 0.917 0.917 0.917re1.wc 0.969 0.969 0.969 0.969 0.969Average 0.758 0.759 0.758 0.760 0.759 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.635 0.643 0.644 0.645 0.645DM 0.634 0.628 0.630 0.628 0.641MM 0.683 0.692 0.686 0.692 0.692SC 0.820 0.817 0.817 0.818 0.820DNA3 0.833 0.847 0.831 0.834 0.843DNA11 0.704 0.715 0.720 0.706 0.708PS 0.976 0.976 0.976 0.977 0.977chen-2002 0.931 0.934 0.926 0.927 0.927chowdary-2006 0.991 0.993 0.993 0.993 0.991nutt-2003-v2 0.821 0.796 0.796 0.796 0.796singh-2002 0.893 0.889 0.889 0.884 0.877west-2001 0.899 0.899 0.899 0.899 0.899dbworld-bodies 0.748 0.747 0.739 0.745 0.770dbworld-bodies-stemmed 0.848 0.830 0.786 0.770 0.770oh0.wc 0.802 0.804 0.808 0.810 0.809oh5.wc 0.807 0.807 0.805 0.807 0.807oh10.wc 0.725 0.721 0.726 0.721 0.724oh15.wc 0.775 0.778 0.783 0.783 0.788re0.wc 0.745 0.768 0.770 0.769 0.768re1.wc 0.766 0.766 0.767 0.767 0.768Average 0.802 0.803 0.800 0.799 0.801
Table A13: Results of predictive Sensitivity values of Auto-WEKA-Trees for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.645 0.648 0.645 0.646 0.648DM 0.645 0.642 0.640 0.641 0.660MM 0.692 0.692 0.689 0.686 0.683SC 0.818 0.820 0.825 0.817 0.818DNA3 0.842 0.843 0.843 0.838 0.834DNA11 0.712 0.711 0.717 0.721 0.715PS 0.977 0.978 0.978 0.979 0.978chen-2002 0.930 0.931 0.926 0.926 0.926chowdary-2006 0.991 0.991 0.991 0.991 0.983nutt-2003-v2 0.804 0.804 0.788 0.796 0.813singh-2002 0.874 0.877 0.874 0.874 0.879west-2001 0.899 0.893 0.893 0.893 0.893dbworld-bodies 0.754 0.755 0.755 0.755 0.755dbworld-bodies-stemmed 0.766 0.766 0.770 0.770 0.770oh0.wc 0.810 0.810 0.810 0.805 0.806oh5.wc 0.810 0.808 0.808 0.808 0.810oh10.wc 0.721 0.723 0.727 0.728 0.722oh15.wc 0.790 0.791 0.792 0.791 0.792re0.wc 0.769 0.768 0.768 0.767 0.770re1.wc 0.768 0.768 0.768 0.767 0.769Average 0.801 0.801 0.800 0.800 0.801 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.539 0.547 0.549 0.552 0.552DM 0.378 0.363 0.361 0.366 0.388MM 0.399 0.418 0.417 0.418 0.418SC 0.254 0.253 0.253 0.259 0.254DNA3 0.554 0.593 0.578 0.589 0.612DNA11 0.403 0.376 0.401 0.377 0.362PS 0.699 0.702 0.711 0.721 0.722chen-2002 0.924 0.926 0.914 0.917 0.918chowdary-2006 0.993 0.994 0.994 0.994 0.993nutt-2003-v2 0.879 0.854 0.854 0.854 0.854singh-2002 0.892 0.886 0.886 0.881 0.875west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.738 0.743 0.734 0.738 0.763dbworld-bodies-stemmed 0.863 0.843 0.793 0.769 0.771oh0.wc 0.965 0.965 0.966 0.966 0.966oh5.wc 0.981 0.981 0.980 0.980 0.980oh10.wc 0.959 0.959 0.959 0.958 0.959oh15.wc 0.967 0.967 0.968 0.968 0.969re0.wc 0.921 0.919 0.920 0.920 0.920re1.wc 0.968 0.968 0.969 0.969 0.969Average 0.759 0.759 0.756 0.756 0.758
Table A15: Results of predictive GMean values of Auto-WEKA-Trees for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.553 0.558 0.553 0.555 0.558DM 0.387 0.389 0.388 0.375 0.387MM 0.418 0.414 0.413 0.416 0.402SC 0.259 0.254 0.290 0.258 0.269DNA3 0.600 0.610 0.598 0.608 0.601DNA11 0.388 0.379 0.378 0.422 0.387PS 0.723 0.744 0.736 0.748 0.747chen-2002 0.920 0.923 0.915 0.914 0.912chowdary-2006 0.993 0.993 0.993 0.993 0.988nutt-2003-v2 0.858 0.858 0.850 0.854 0.875singh-2002 0.872 0.875 0.872 0.872 0.877west-2001 0.914 0.908 0.908 0.908 0.908dbworld-bodies 0.737 0.739 0.739 0.739 0.739dbworld-bodies-stemmed 0.769 0.769 0.772 0.772 0.772oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.981 0.980 0.980 0.980 0.980oh10.wc 0.958 0.959 0.958 0.959 0.958oh15.wc 0.969 0.969 0.969 0.969 0.969re0.wc 0.920 0.919 0.917 0.917 0.917re1.wc 0.969 0.969 0.969 0.969 0.969Average 0.758 0.759 0.758 0.760 0.759 data set Decision Table JRip OneR PART ZeroRCE 0.537 0.565 0.571 0.550 0.398DM 0.416 0.456 0.382 0.511 0.327MM 0.480 0.430 0.494 0.534 0.292SC 0.345 0.378 0.221 0.458 0.161DNA3 0.506 0.606 0.513 0.641 0.237DNA11 0.284 0.373 0.370 0.565 0.244PS 0.761 0.852 0.700 0.884 0.060chen-2002 0.854 0.790 0.778 0.858 0.419chowdary-2006 0.935 0.954 0.965 0.936 0.404nutt-2003-v2 0.767 0.833 0.900 0.933 0.633singh-2002 0.766 0.819 0.725 0.724 0.491west-2001 0.848 0.935 0.918 0.815 0.490dbworld-bodies 0.786 0.743 0.865 0.743 0.455dbworld-bodies-stemmed 0.679 0.787 0.737 0.792 0.455oh0.wc 0.960 0.962 0.857 0.972 0.807oh5.wc 0.976 0.965 0.861 0.978 0.838oh10.wc 0.949 0.957 0.863 0.961 0.848oh15.wc 0.950 0.960 0.866 0.964 0.828re0.wc 0.852 0.923 0.712 0.922 0.596re1.wc 0.953 0.972 0.829 0.972 0.776Average 0.730 0.763 0.706 0.786 0.488
Table A17: Results of predictive Sensitivity values of the baseline methods for inducingrule-based models. data set Decision Table JRip OneR PART ZeroRCE 0.651 0.642 0.657 0.592 0.602DM 0.647 0.673 0.655 0.639 0.673MM 0.708 0.675 0.775 0.661 0.708SC 0.851 0.811 0.835 0.806 0.839DNA3 0.813 0.849 0.835 0.784 0.763DNA11 0.732 0.718 0.732 0.755 0.756PS 0.984 0.986 0.981 0.988 0.940chen-2002 0.871 0.793 0.765 0.860 0.581chowdary-2006 0.951 0.962 0.971 0.933 0.596nutt-2003-v2 0.633 0.767 0.800 0.867 0.367singh-2002 0.764 0.815 0.725 0.723 0.509west-2001 0.835 0.915 0.915 0.835 0.510dbworld-bodies 0.798 0.740 0.860 0.740 0.545dbworld-bodies-stemmed 0.688 0.771 0.721 0.783 0.545oh0.wc 0.773 0.799 0.351 0.815 0.193oh5.wc 0.788 0.771 0.260 0.810 0.162oh10.wc 0.687 0.724 0.266 0.712 0.152oh15.wc 0.700 0.762 0.315 0.748 0.172re0.wc 0.668 0.781 0.543 0.725 0.404re1.wc 0.767 0.801 0.396 0.797 0.224Average 0.765 0.788 0.668 0.779 0.512 data set Decision Table JRip OneR PART ZeroRCE 0.591 0.602 0.612 0.571 0.489DM 0.514 0.551 0.497 0.567 0.469MM 0.567 0.531 0.610 0.589 0.451SC 0.520 0.537 0.422 0.602 0.368DNA3 0.629 0.715 0.653 0.707 0.424DNA11 0.452 0.504 0.513 0.647 0.428PS 0.865 0.916 0.828 0.935 0.238chen-2002 0.862 0.791 0.771 0.859 0.493chowdary-2006 0.943 0.958 0.968 0.934 0.490nutt-2003-v2 0.692 0.794 0.845 0.898 0.477singh-2002 0.765 0.817 0.725 0.723 0.500west-2001 0.841 0.925 0.916 0.823 0.491dbworld-bodies 0.791 0.741 0.862 0.741 0.495dbworld-bodies-stemmed 0.683 0.779 0.729 0.787 0.495oh0.wc 0.861 0.876 0.548 0.890 0.395oh5.wc 0.877 0.862 0.473 0.890 0.369oh10.wc 0.807 0.832 0.479 0.827 0.359oh15.wc 0.815 0.855 0.522 0.849 0.377re0.wc 0.754 0.849 0.621 0.817 0.491re1.wc 0.855 0.882 0.573 0.880 0.417average 0.734 0.766 0.658 0.777 0.436
Table A19: Results of predictive Specificity values of GGP-RI for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.413 0.415 0.415 0.415 0.415DM 0.405 0.407 0.423 0.431 0.431MM 0.399 0.411 0.423 0.423 0.426SC 0.194 0.190 0.190 0.190 0.190DNA3 0.429 0.428 0.427 0.427 0.427DNA11 0.352 0.343 0.353 0.353 0.353PS 0.280 0.297 0.284 0.268 0.281chen-2002 0.632 0.628 0.617 0.624 0.624chowdary-2006 0.809 0.814 0.814 0.811 0.808nutt-2003-v2 0.670 0.670 0.660 0.660 0.660singh-2002 0.618 0.619 0.621 0.626 0.634west-2001 0.649 0.635 0.638 0.634 0.645dbworld-bodies 0.555 0.555 0.555 0.555 0.551dbworld-bodies-stemmed 0.640 0.640 0.644 0.644 0.644oh0.wc 0.809 0.809 0.809 0.809 0.809oh5.wc 0.846 0.845 0.845 0.845 0.845oh10.wc 0.843 0.847 0.847 0.847 0.847oh15.wc 0.833 0.833 0.833 0.833 0.833re0.wc 0.614 0.614 0.614 0.614 0.614re1.wc 0.786 0.786 0.786 0.787 0.787Average 0.589 0.589 0.590 0.590 0.591 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.415 0.415 0.415 0.416 0.415DM 0.431 0.430 0.430 0.429 0.429MM 0.426 0.426 0.426 0.434 0.436SC 0.187 0.187 0.187 0.187 0.190DNA3 0.427 0.427 0.432 0.432 0.432DNA11 0.353 0.353 0.353 0.363 0.363PS 0.281 0.281 0.281 0.281 0.281chen-2002 0.626 0.629 0.630 0.636 0.633chowdary-2006 0.809 0.813 0.813 0.813 0.813nutt-2003-v2 0.647 0.647 0.643 0.650 0.650singh-2002 0.642 0.642 0.642 0.642 0.642west-2001 0.649 0.649 0.649 0.649 0.647dbworld-bodies 0.551 0.551 0.551 0.551 0.558dbworld-bodies-stemmed 0.644 0.637 0.637 0.637 0.637oh0.wc 0.809 0.809 0.809 0.809 0.809oh5.wc 0.845 0.845 0.845 0.845 0.845oh10.wc 0.848 0.848 0.848 0.848 0.848oh15.wc 0.833 0.833 0.833 0.833 0.833re0.wc 0.614 0.613 0.613 0.614 0.614re1.wc 0.787 0.787 0.787 0.787 0.787Average 0.591 0.591 0.591 0.593 0.593
Table A21: Results of predictive Sensitivity values of GGP-RI for timeouts from 1,000sto 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.607 0.608 0.608 0.609 0.608DM 0.684 0.684 0.693 0.698 0.698MM 0.717 0.722 0.726 0.726 0.726SC 0.820 0.818 0.818 0.821 0.821DNA3 0.813 0.812 0.807 0.807 0.807DNA11 0.739 0.738 0.741 0.741 0.741PS 0.952 0.953 0.952 0.951 0.952chen-2002 0.686 0.684 0.676 0.683 0.683chowdary-2006 0.854 0.858 0.858 0.855 0.854nutt-2003-v2 0.620 0.620 0.600 0.600 0.600singh-2002 0.609 0.613 0.615 0.619 0.626west-2001 0.594 0.585 0.589 0.593 0.602dbworld-bodies 0.615 0.615 0.615 0.615 0.612dbworld-bodies-stemmed 0.669 0.669 0.671 0.671 0.671oh0.wc 0.196 0.196 0.196 0.196 0.196oh5.wc 0.156 0.158 0.158 0.158 0.158oh10.wc 0.162 0.160 0.160 0.160 0.160oh15.wc 0.175 0.175 0.175 0.175 0.174re0.wc 0.399 0.399 0.399 0.399 0.399re1.wc 0.217 0.218 0.218 0.218 0.217Average 0.564 0.564 0.564 0.565 0.565 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.608 0.608 0.608 0.608 0.608DM 0.698 0.702 0.702 0.700 0.700MM 0.726 0.726 0.726 0.726 0.724SC 0.823 0.823 0.823 0.823 0.821DNA3 0.807 0.807 0.809 0.809 0.809DNA11 0.741 0.741 0.741 0.741 0.741PS 0.952 0.952 0.952 0.952 0.952chen-2002 0.684 0.685 0.686 0.692 0.689chowdary-2006 0.855 0.857 0.857 0.857 0.857nutt-2003-v2 0.593 0.593 0.587 0.600 0.600singh-2002 0.634 0.634 0.634 0.634 0.634west-2001 0.614 0.614 0.614 0.614 0.610dbworld-bodies 0.612 0.612 0.612 0.612 0.619dbworld-bodies-stemmed 0.671 0.665 0.665 0.665 0.665oh0.wc 0.196 0.196 0.196 0.196 0.196oh5.wc 0.158 0.158 0.158 0.158 0.158oh10.wc 0.162 0.162 0.162 0.162 0.162oh15.wc 0.175 0.175 0.175 0.175 0.175re0.wc 0.399 0.398 0.398 0.399 0.399re1.wc 0.217 0.217 0.217 0.217 0.217Average 0.566 0.566 0.566 0.567 0.567
Table A23: Results of predictive Gmean values of GGP-RI for timeouts from 1,000s to5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.501 0.502 0.502 0.503 0.503DM 0.523 0.525 0.538 0.545 0.545MM 0.524 0.532 0.542 0.542 0.544SC 0.392 0.388 0.388 0.389 0.389DNA3 0.582 0.581 0.578 0.578 0.578DNA11 0.498 0.491 0.499 0.499 0.499PS 0.445 0.463 0.451 0.436 0.447chen-2002 0.658 0.654 0.645 0.652 0.652chowdary-2006 0.83 0.834 0.834 0.832 0.829nutt-2003-v2 0.631 0.631 0.616 0.616 0.616singh-2002 0.613 0.616 0.618 0.623 0.63west-2001 0.617 0.605 0.609 0.609 0.619dbworld-bodies 0.582 0.582 0.582 0.582 0.578dbworld-bodies-stemmed 0.652 0.652 0.655 0.655 0.655oh0.wc 0.398 0.398 0.398 0.398 0.398oh5.wc 0.361 0.364 0.364 0.364 0.364oh10.wc 0.37 0.367 0.367 0.367 0.367oh15.wc 0.382 0.381 0.381 0.381 0.381re0.wc 0.489 0.489 0.489 0.489 0.489re1.wc 0.407 0.408 0.408 0.408 0.407Average 0.523 0.523 0.523 0.523 0.525 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.503 0.503 0.503 0.503 0.502DM 0.545 0.546 0.546 0.544 0.544MM 0.544 0.544 0.544 0.549 0.55SC 0.386 0.386 0.386 0.386 0.389DNA3 0.578 0.578 0.583 0.583 0.583DNA11 0.499 0.499 0.499 0.506 0.506PS 0.448 0.448 0.448 0.448 0.448chen-2002 0.653 0.655 0.657 0.663 0.659chowdary-2006 0.831 0.833 0.833 0.833 0.833nutt-2003-v2 0.606 0.606 0.6 0.611 0.611singh-2002 0.638 0.638 0.638 0.638 0.638west-2001 0.628 0.628 0.628 0.628 0.624dbworld-bodies 0.578 0.578 0.578 0.578 0.585dbworld-bodies-stemmed 0.655 0.649 0.649 0.649 0.649oh0.wc 0.398 0.398 0.398 0.398 0.398oh5.wc 0.364 0.364 0.364 0.364 0.364oh10.wc 0.369 0.369 0.369 0.369 0.369oh15.wc 0.381 0.381 0.381 0.381 0.381re0.wc 0.489 0.489 0.489 0.489 0.489re1.wc 0.407 0.407 0.407 0.407 0.407Average 0.525 0.525 0.525 0.526 0.526
Table A25: Results of predictive Specificity values of Auto-WEKA-Rules for timeoutsfrom 1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.532 0.529 0.531 0.532 0.532DM 0.360 0.361 0.368 0.358 0.362MM 0.319 0.317 0.316 0.316 0.316SC 0.223 0.233 0.233 0.228 0.233DNA3 0.532 0.555 0.562 0.588 0.571DNA11 0.382 0.384 0.400 0.435 0.399PS 0.676 0.697 0.697 0.709 0.706chen-2002 0.850 0.867 0.850 0.853 0.850chowdary-2006 0.938 0.930 0.928 0.928 0.928nutt-2003-v2 0.804 0.808 0.792 0.792 0.792singh-2002 0.791 0.792 0.798 0.791 0.808west-2001 0.927 0.927 0.927 0.927 0.927dbworld-bodies 0.783 0.763 0.777 0.793 0.783dbworld-bodies-stemmed 0.728 0.759 0.771 0.764 0.781oh0.wc 0.949 0.952 0.955 0.956 0.956oh5.wc 0.967 0.971 0.973 0.975 0.975oh10.wc 0.952 0.953 0.952 0.952 0.951oh15.wc 0.955 0.957 0.957 0.957 0.958re0.wc 0.851 0.911 0.906 0.906 0.907re1.wc 0.912 0.945 0.951 0.950 0.950Average 0.722 0.731 0.732 0.735 0.734 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.532 0.537 0.534 0.534 0.535DM 0.365 0.379 0.371 0.376 0.363MM 0.316 0.316 0.316 0.316 0.316SC 0.238 0.233 0.228 0.253 0.212DNA3 0.571 0.554 0.564 0.568 0.549DNA11 0.439 0.452 0.427 0.421 0.448PS 0.712 0.712 0.720 0.730 0.729chen-2002 0.866 0.859 0.853 0.864 0.857chowdary-2006 0.928 0.928 0.928 0.928 0.928nutt-2003-v2 0.808 0.792 0.792 0.804 0.808singh-2002 0.798 0.805 0.808 0.796 0.800west-2001 0.919 0.919 0.919 0.919 0.919dbworld-bodies 0.776 0.782 0.780 0.782 0.782dbworld-bodies-stemmed 0.797 0.770 0.774 0.788 0.794oh0.wc 0.957 0.956 0.955 0.956 0.956oh5.wc 0.977 0.975 0.976 0.977 0.977oh10.wc 0.952 0.953 0.953 0.953 0.955oh15.wc 0.957 0.957 0.957 0.957 0.957re0.wc 0.909 0.908 0.908 0.906 0.908re1.wc 0.950 0.951 0.950 0.951 0.951Average 0.738 0.737 0.736 0.739 0.737
Table A27: Results of predictive Sensitivity values of Auto-WEKA-Rules for timeoutsfrom 1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.628 0.626 0.629 0.629 0.631DM 0.649 0.651 0.653 0.651 0.653MM 0.694 0.692 0.689 0.689 0.689SC 0.820 0.818 0.816 0.815 0.816DNA3 0.835 0.840 0.840 0.844 0.840DNA11 0.719 0.730 0.725 0.741 0.734PS 0.977 0.977 0.977 0.977 0.978chen-2002 0.871 0.889 0.875 0.875 0.871chowdary-2006 0.945 0.941 0.938 0.938 0.938nutt-2003-v2 0.783 0.792 0.783 0.783 0.783singh-2002 0.793 0.793 0.798 0.791 0.808west-2001 0.915 0.915 0.915 0.915 0.915dbworld-bodies 0.792 0.772 0.783 0.796 0.788dbworld-bodies-stemmed 0.729 0.758 0.758 0.759 0.767oh0.wc 0.750 0.762 0.768 0.772 0.771oh5.wc 0.751 0.771 0.774 0.780 0.781oh10.wc 0.702 0.704 0.705 0.702 0.701oh15.wc 0.751 0.753 0.754 0.755 0.756re0.wc 0.671 0.744 0.740 0.740 0.740re1.wc 0.610 0.720 0.755 0.754 0.754Average 0.769 0.782 0.784 0.785 0.786 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.630 0.634 0.633 0.633 0.633DM 0.660 0.662 0.657 0.649 0.649MM 0.689 0.689 0.689 0.689 0.689SC 0.819 0.815 0.815 0.815 0.815DNA3 0.840 0.835 0.831 0.835 0.831DNA11 0.751 0.747 0.736 0.746 0.744PS 0.978 0.978 0.978 0.979 0.979chen-2002 0.883 0.879 0.872 0.881 0.878chowdary-2006 0.938 0.938 0.938 0.938 0.938nutt-2003-v2 0.792 0.783 0.783 0.783 0.792singh-2002 0.798 0.805 0.808 0.795 0.801west-2001 0.910 0.910 0.910 0.910 0.910dbworld-bodies 0.784 0.789 0.789 0.789 0.789dbworld-bodies-stemmed 0.784 0.759 0.763 0.775 0.783oh0.wc 0.774 0.772 0.771 0.775 0.773oh5.wc 0.786 0.780 0.782 0.786 0.786oh10.wc 0.704 0.706 0.707 0.709 0.712oh15.wc 0.757 0.757 0.755 0.756 0.757re0.wc 0.744 0.743 0.745 0.740 0.743re1.wc 0.754 0.755 0.754 0.755 0.755Average 0.789 0.787 0.786 0.787 0.788
Table A29: Results of predictive GMean values of Auto-WEKA-Rules for timeouts from1,000s to 5,000s. data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.578 0.575 0.578 0.578 0.579DM 0.480 0.482 0.487 0.480 0.483MM 0.465 0.463 0.462 0.462 0.462SC 0.419 0.427 0.426 0.421 0.426DNA3 0.658 0.674 0.678 0.695 0.683DNA11 0.517 0.521 0.528 0.558 0.533PS 0.793 0.805 0.805 0.811 0.810chen-2002 0.860 0.878 0.862 0.864 0.860chowdary-2006 0.941 0.935 0.933 0.933 0.933nutt-2003-v2 0.789 0.796 0.783 0.783 0.783singh-2002 0.792 0.793 0.798 0.791 0.808west-2001 0.920 0.920 0.920 0.920 0.920dbworld-bodies 0.787 0.767 0.780 0.795 0.785dbworld-bodies-stemmed 0.728 0.758 0.764 0.761 0.774oh0.wc 0.843 0.852 0.856 0.858 0.858oh5.wc 0.852 0.865 0.867 0.872 0.872oh10.wc 0.817 0.819 0.819 0.817 0.817oh15.wc 0.847 0.849 0.849 0.850 0.850re0.wc 0.756 0.823 0.819 0.819 0.819re1.wc 0.742 0.823 0.847 0.846 0.846Average 0.729 0.741 0.743 0.746 0.745 data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.579 0.584 0.581 0.582 0.582DM 0.487 0.497 0.490 0.489 0.482MM 0.462 0.462 0.462 0.462 0.462SC 0.432 0.426 0.421 0.443 0.406DNA3 0.683 0.670 0.675 0.680 0.668DNA11 0.567 0.572 0.550 0.551 0.567PS 0.813 0.813 0.817 0.823 0.823chen-2002 0.874 0.869 0.862 0.872 0.867chowdary-2006 0.933 0.933 0.933 0.933 0.933nutt-2003-v2 0.796 0.783 0.783 0.789 0.796singh-2002 0.798 0.805 0.808 0.795 0.800west-2001 0.914 0.914 0.914 0.914 0.914dbworld-bodies 0.780 0.785 0.784 0.785 0.785dbworld-bodies-stemmed 0.790 0.764 0.768 0.781 0.788oh0.wc 0.860 0.859 0.858 0.861 0.859oh5.wc 0.876 0.872 0.874 0.876 0.876oh10.wc 0.819 0.820 0.821 0.822 0.824oh15.wc 0.851 0.851 0.850 0.850 0.851re0.wc 0.822 0.822 0.822 0.819 0.821re1.wc 0.846 0.847 0.846 0.847 0.847Average 0.749 0.747 0.746 0.749 0.748
Table A31: Results of predictive Specificity values of the baseline methods for buildingBayesian network classifiers. data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.550 0.554 0.556DM 0.545 0.566 0.620MM 0.498 0.507 0.574SC 0.525 0.468 0.609DNA3 0.736 0.716 0.742DNA11 0.672 0.672 0.509PS 0.800 0.779 0.772chen-2002 0.884 0.924 0.864chowdary-2006 0.954 0.980 0.980nutt-2003-v2 0.800 0.750 0.367singh-2002 0.833 0.785 0.725west-2001 0.892 0.875 0.835dbworld-bodies 0.811 0.700 0.905dbworld-bodies-stemmed 0.733 0.733 0.871oh0.wc 0.987 0.977 0.988oh5.wc 0.983 0.973 0.984oh10.wc 0.976 0.964 0.974oh15.wc 0.977 0.965 0.978re0.wc 0.930 0.938 0.949re1.wc 0.976 0.974 0.975Average 0.803 0.790 0.789 data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.586 0.598 0.584DM 0.647 0.664 0.647MM 0.650 0.683 0.707SC 0.738 0.754 0.750DNA3 0.805 0.820 0.829DNA11 0.734 0.734 0.689PS 0.974 0.981 0.979chen-2002 0.905 0.933 0.899chowdary-2006 0.962 0.981 0.981nutt-2003-v2 0.700 0.700 0.333singh-2002 0.834 0.785 0.725west-2001 0.875 0.875 0.815dbworld-bodies 0.814 0.750 0.895dbworld-bodies-stemmed 0.767 0.767 0.862oh0.wc 0.899 0.796 0.895oh5.wc 0.855 0.787 0.869oh10.wc 0.804 0.721 0.801oh15.wc 0.829 0.725 0.834re0.wc 0.759 0.566 0.798re1.wc 0.814 0.664 0.837Average 0.798 0.764 0.786
Table A33: Results of predictive GMean values of the baseline methods for buildingBayesian network classifiers. data set BayesNet NaiveBayes NaiveBayesMultinomialCE 0.567 0.576 0.569DM 0.589 0.608 0.632MM 0.560 0.578 0.627SC 0.616 0.587 0.672DNA3 0.766 0.762 0.778DNA11 0.699 0.699 0.581PS 0.883 0.874 0.869chen-2002 0.894 0.928 0.881chowdary-2006 0.958 0.980 0.980nutt-2003-v2 0.745 0.716 0.340singh-2002 0.833 0.785 0.725west-2001 0.883 0.874 0.824dbworld-bodies 0.812 0.722 0.900dbworld-bodies-stemmed 0.749 0.749 0.866oh0.wc 0.942 0.882 0.940oh5.wc 0.917 0.875 0.925oh10.wc 0.886 0.833 0.883oh15.wc 0.900 0.836 0.903re0.wc 0.840 0.728 0.870re1.wc 0.891 0.804 0.903Average 0.797 0.770 0.784 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.540 0.541 0.539 0.533 0.534DM 0.508 0.553 0.571 0.552 0.544MM 0.562 0.560 0.572 0.565 0.549SC 0.332 0.328 0.343 0.348 0.356DNA3 0.670 0.672 0.674 0.673 0.666DNA11 0.423 0.389 0.384 0.380 0.369PS 0.724 0.737 0.751 0.745 0.745chen-2002 0.841 0.870 0.865 0.867 0.864chowdary-2006 0.962 0.962 0.959 0.962 0.959nutt-2003-v2 0.780 0.820 0.810 0.813 0.840singh-2002 0.771 0.778 0.780 0.793 0.791west-2001 0.885 0.884 0.878 0.878 0.894dbworld-bodies 0.742 0.742 0.759 0.780 0.775dbworld-bodies-stemmed 0.758 0.797 0.809 0.778 0.812oh0.wc 0.986 0.986 0.986 0.986 0.986oh5.wc 0.982 0.979 0.972 0.975 0.975oh10.wc 0.972 0.974 0.971 0.972 0.969oh15.wc 0.980 0.977 0.977 0.974 0.980re0.wc 0.931 0.931 0.930 0.930 0.931re1.wc 0.938 0.968 0.970 0.971 0.971Average 0.764 0.772 0.775 0.774 0.775
Table A35: Results of predictive Specificity values of HHEA-BNC for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.541 0.538 0.539 0.544 0.544DM 0.555 0.550 0.579 0.567 0.563MM 0.525 0.520 0.512 0.505 0.510SC 0.360 0.343 0.364 0.381 0.365DNA3 0.670 0.667 0.650 0.650 0.649DNA11 0.357 0.361 0.354 0.357 0.356PS 0.747 0.739 0.740 0.748 0.729chen-2002 0.863 0.865 0.862 0.862 0.858chowdary-2006 0.962 0.965 0.965 0.962 0.951nutt-2003-v2 0.827 0.843 0.847 0.837 0.840singh-2002 0.789 0.785 0.783 0.786 0.776west-2001 0.884 0.886 0.882 0.882 0.877dbworld-bodies 0.753 0.771 0.758 0.787 0.776dbworld-bodies-stemmed 0.803 0.812 0.791 0.801 0.798oh0.wc 0.986 0.983 0.979 0.979 0.979oh5.wc 0.977 0.972 0.972 0.972 0.978oh10.wc 0.966 0.966 0.963 0.966 0.966oh15.wc 0.980 0.980 0.980 0.980 0.977re0.wc 0.931 0.931 0.918 0.917 0.917re1.wc 0.971 0.971 0.972 0.972 0.972Average 0.772 0.772 0.770 0.773 0.769 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.615 0.614 0.608 0.607 0.609DM 0.708 0.717 0.724 0.708 0.701MM 0.748 0.746 0.748 0.744 0.744SC 0.802 0.802 0.795 0.801 0.801DNA3 0.841 0.837 0.832 0.829 0.837DNA11 0.743 0.741 0.742 0.742 0.739PS 0.978 0.979 0.980 0.978 0.979chen-2002 0.867 0.880 0.875 0.878 0.872chowdary-2006 0.971 0.971 0.969 0.971 0.969nutt-2003-v2 0.730 0.780 0.770 0.767 0.790singh-2002 0.771 0.777 0.779 0.793 0.791west-2001 0.888 0.883 0.879 0.879 0.896dbworld-bodies 0.764 0.760 0.776 0.800 0.790dbworld-bodies-stemmed 0.783 0.811 0.823 0.787 0.823oh0.wc 0.896 0.896 0.895 0.896 0.894oh5.wc 0.848 0.835 0.796 0.815 0.814oh10.wc 0.790 0.793 0.779 0.779 0.767oh15.wc 0.844 0.830 0.832 0.820 0.845re0.wc 0.760 0.760 0.760 0.760 0.760re1.wc 0.742 0.789 0.796 0.798 0.798Average 0.805 0.810 0.808 0.808 0.811
Table A37: Results of predictive Sensitivity values of HHEA-BNC for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.615 0.608 0.608 0.612 0.614DM 0.710 0.712 0.722 0.726 0.713MM 0.735 0.730 0.726 0.728 0.730SC 0.801 0.798 0.800 0.804 0.806DNA3 0.839 0.837 0.835 0.834 0.838DNA11 0.735 0.739 0.730 0.726 0.730PS 0.979 0.979 0.979 0.979 0.977chen-2002 0.872 0.873 0.872 0.871 0.868chowdary-2006 0.971 0.973 0.973 0.971 0.965nutt-2003-v2 0.783 0.787 0.793 0.783 0.790singh-2002 0.789 0.785 0.783 0.787 0.777west-2001 0.886 0.891 0.888 0.888 0.883dbworld-bodies 0.769 0.786 0.773 0.801 0.792dbworld-bodies-stemmed 0.820 0.831 0.801 0.818 0.814oh0.wc 0.894 0.882 0.867 0.868 0.868oh5.wc 0.827 0.802 0.800 0.804 0.829oh10.wc 0.755 0.754 0.739 0.755 0.754oh15.wc 0.844 0.845 0.845 0.845 0.832re0.wc 0.760 0.760 0.747 0.746 0.746re1.wc 0.799 0.800 0.802 0.801 0.802Average 0.809 0.809 0.804 0.807 0.806 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.576 0.576 0.572 0.569 0.570DM 0.596 0.626 0.640 0.622 0.613MM 0.637 0.634 0.642 0.635 0.626SC 0.497 0.496 0.508 0.514 0.518DNA3 0.741 0.741 0.741 0.739 0.739DNA11 0.544 0.521 0.516 0.514 0.507PS 0.827 0.835 0.847 0.837 0.838chen-2002 0.852 0.874 0.870 0.872 0.867chowdary-2006 0.966 0.966 0.964 0.966 0.964nutt-2003-v2 0.746 0.793 0.784 0.783 0.809singh-2002 0.771 0.777 0.779 0.793 0.791west-2001 0.886 0.883 0.878 0.878 0.895dbworld-bodies 0.753 0.750 0.767 0.789 0.782dbworld-bodies-stemmed 0.770 0.804 0.816 0.782 0.817oh0.wc 0.940 0.940 0.939 0.940 0.939oh5.wc 0.913 0.902 0.871 0.885 0.884oh10.wc 0.876 0.878 0.868 0.868 0.858oh15.wc 0.909 0.898 0.899 0.889 0.910re0.wc 0.841 0.841 0.841 0.841 0.841re1.wc 0.832 0.874 0.878 0.880 0.880Average 0.774 0.781 0.781 0.780 0.782
Table A39: Results of predictive GMean values of HHEA-BNC for timeouts from 6,000sto 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.577 0.572 0.572 0.577 0.578DM 0.624 0.623 0.643 0.638 0.629MM 0.608 0.604 0.596 0.594 0.598SC 0.522 0.509 0.525 0.539 0.528DNA3 0.744 0.739 0.729 0.727 0.730DNA11 0.498 0.503 0.495 0.496 0.497PS 0.842 0.836 0.838 0.845 0.824chen-2002 0.867 0.869 0.867 0.866 0.862chowdary-2006 0.966 0.969 0.969 0.966 0.958nutt-2003-v2 0.799 0.808 0.814 0.804 0.809singh-2002 0.789 0.785 0.783 0.787 0.777west-2001 0.885 0.888 0.884 0.884 0.879dbworld-bodies 0.760 0.778 0.765 0.794 0.784dbworld-bodies-stemmed 0.811 0.821 0.795 0.809 0.805oh0.wc 0.939 0.929 0.918 0.918 0.918oh5.wc 0.895 0.874 0.873 0.876 0.896oh10.wc 0.848 0.848 0.836 0.848 0.847oh15.wc 0.910 0.910 0.910 0.910 0.900re0.wc 0.841 0.841 0.828 0.827 0.827re1.wc 0.881 0.881 0.882 0.882 0.883Average 0.780 0.779 0.776 0.779 0.777 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.566 0.566 0.564 0.567 0.567DM 0.546 0.552 0.554 0.550 0.567MM 0.490 0.493 0.493 0.493 0.492SC 0.396 0.401 0.407 0.417 0.417DNA3 0.570 0.552 0.570 0.565 0.560DNA11 0.462 0.452 0.443 0.439 0.446PS 0.688 0.685 0.676 0.681 0.682chen-2002 0.891 0.885 0.891 0.890 0.886chowdary-2006 0.973 0.973 0.973 0.973 0.973nutt-2003-v2 0.850 0.846 0.842 0.838 0.838singh-2002 0.872 0.874 0.874 0.872 0.869west-2001 0.908 0.898 0.908 0.898 0.912dbworld-bodies 0.885 0.885 0.863 0.861 0.858dbworld-bodies-stemmed 0.937 0.915 0.918 0.939 0.939oh0.wc 0.966 0.967 0.966 0.966 0.965oh5.wc 0.964 0.964 0.964 0.964 0.967oh10.wc 0.960 0.960 0.960 0.960 0.961oh15.wc 0.964 0.964 0.964 0.964 0.964re0.wc 0.941 0.940 0.941 0.941 0.938re1.wc 0.961 0.962 0.961 0.961 0.961Average 0.789 0.787 0.787 0.787 0.788
Table A41: Results of predictive Specificity values of Auto-WEKA-Bayes for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.567 0.572 0.570 0.567 0.571DM 0.555 0.557 0.561 0.559 0.560MM 0.488 0.492 0.498 0.494 0.494SC 0.406 0.401 0.423 0.387 0.407DNA3 0.566 0.567 0.576 0.571 0.560DNA11 0.462 0.460 0.456 0.466 0.462PS 0.688 0.686 0.694 0.695 0.693chen-2002 0.888 0.885 0.887 0.884 0.886chowdary-2006 0.973 0.973 0.973 0.973 0.973nutt-2003-v2 0.842 0.842 0.842 0.842 0.842singh-2002 0.872 0.864 0.862 0.867 0.859west-2001 0.908 0.912 0.912 0.912 0.908dbworld-bodies 0.871 0.868 0.857 0.857 0.844dbworld-bodies-stemmed 0.932 0.928 0.944 0.941 0.932oh0.wc 0.966 0.966 0.966 0.965 0.965oh5.wc 0.967 0.966 0.966 0.966 0.966oh10.wc 0.960 0.960 0.960 0.960 0.960oh15.wc 0.965 0.965 0.965 0.965 0.964re0.wc 0.941 0.941 0.941 0.941 0.941re1.wc 0.962 0.961 0.962 0.962 0.961Average 0.789 0.788 0.791 0.789 0.788 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.642 0.644 0.644 0.646 0.646DM 0.693 0.699 0.697 0.697 0.702MM 0.722 0.728 0.728 0.728 0.728SC 0.825 0.827 0.829 0.831 0.833DNA3 0.846 0.837 0.844 0.841 0.839DNA11 0.711 0.711 0.700 0.713 0.717PS 0.979 0.979 0.978 0.978 0.978chen-2002 0.900 0.895 0.900 0.899 0.897chowdary-2006 0.983 0.983 0.983 0.983 0.983nutt-2003-v2 0.800 0.792 0.783 0.775 0.775singh-2002 0.872 0.874 0.874 0.871 0.869west-2001 0.900 0.890 0.900 0.890 0.905dbworld-bodies 0.874 0.874 0.854 0.854 0.850dbworld-bodies-stemmed 0.938 0.918 0.918 0.938 0.938oh0.wc 0.803 0.810 0.806 0.804 0.799oh5.wc 0.766 0.766 0.766 0.766 0.773oh10.wc 0.723 0.722 0.721 0.722 0.721oh15.wc 0.770 0.771 0.771 0.771 0.771re0.wc 0.755 0.755 0.755 0.755 0.751re1.wc 0.779 0.779 0.779 0.779 0.779Average 0.814 0.813 0.811 0.812 0.813
Table A43: Results of predictive Sensitivity values of Auto-WEKA-Bayes for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.646 0.648 0.646 0.644 0.646DM 0.695 0.693 0.695 0.691 0.693MM 0.728 0.725 0.731 0.731 0.731SC 0.828 0.827 0.833 0.830 0.832DNA3 0.842 0.846 0.839 0.846 0.842DNA11 0.726 0.717 0.722 0.726 0.719PS 0.978 0.979 0.979 0.979 0.979chen-2002 0.899 0.897 0.899 0.896 0.899chowdary-2006 0.983 0.983 0.983 0.983 0.983nutt-2003-v2 0.783 0.783 0.783 0.783 0.783singh-2002 0.871 0.864 0.862 0.867 0.859west-2001 0.900 0.905 0.905 0.905 0.900dbworld-bodies 0.863 0.859 0.851 0.847 0.835dbworld-bodies-stemmed 0.931 0.926 0.941 0.938 0.930oh0.wc 0.804 0.804 0.804 0.803 0.801oh5.wc 0.774 0.771 0.773 0.770 0.773oh10.wc 0.724 0.723 0.723 0.725 0.723oh15.wc 0.773 0.774 0.771 0.772 0.771re0.wc 0.754 0.754 0.754 0.754 0.755re1.wc 0.776 0.776 0.778 0.775 0.777Average 0.814 0.813 0.814 0.813 0.812 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.603 0.604 0.602 0.605 0.605DM 0.612 0.618 0.619 0.616 0.628MM 0.586 0.590 0.590 0.590 0.589SC 0.555 0.560 0.566 0.572 0.571DNA3 0.684 0.670 0.683 0.679 0.675DNA11 0.568 0.562 0.551 0.555 0.560PS 0.800 0.798 0.793 0.796 0.797chen-2002 0.895 0.890 0.896 0.894 0.892chowdary-2006 0.978 0.978 0.978 0.978 0.978nutt-2003-v2 0.818 0.812 0.806 0.799 0.799singh-2002 0.872 0.874 0.874 0.872 0.869west-2001 0.904 0.893 0.904 0.893 0.908dbworld-bodies 0.879 0.879 0.858 0.857 0.854dbworld-bodies-stemmed 0.937 0.917 0.918 0.938 0.938oh0.wc 0.881 0.885 0.882 0.881 0.878oh5.wc 0.859 0.859 0.859 0.859 0.864oh10.wc 0.833 0.833 0.832 0.833 0.832oh15.wc 0.862 0.862 0.862 0.862 0.862re0.wc 0.843 0.842 0.843 0.843 0.840re1.wc 0.865 0.865 0.865 0.865 0.865Average 0.792 0.790 0.789 0.789 0.790
Table A45: Results of predictive GMean values of Auto-WEKA-Bayes for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.605 0.609 0.607 0.604 0.607DM 0.618 0.619 0.622 0.619 0.621MM 0.587 0.588 0.593 0.591 0.591SC 0.562 0.559 0.575 0.552 0.565DNA3 0.680 0.682 0.686 0.685 0.677DNA11 0.576 0.570 0.570 0.578 0.571PS 0.800 0.799 0.803 0.804 0.803chen-2002 0.893 0.891 0.893 0.890 0.892chowdary-2006 0.978 0.978 0.978 0.978 0.978nutt-2003-v2 0.806 0.806 0.806 0.806 0.806singh-2002 0.872 0.864 0.862 0.867 0.859west-2001 0.904 0.908 0.908 0.908 0.904dbworld-bodies 0.866 0.863 0.854 0.852 0.839dbworld-bodies-stemmed 0.931 0.927 0.943 0.940 0.931oh0.wc 0.881 0.881 0.881 0.880 0.879oh5.wc 0.865 0.863 0.864 0.862 0.864oh10.wc 0.833 0.833 0.833 0.834 0.833oh15.wc 0.863 0.864 0.862 0.863 0.862re0.wc 0.842 0.842 0.842 0.842 0.843re1.wc 0.864 0.864 0.865 0.863 0.864Average 0.791 0.790 0.792 0.791 0.789 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.564 0.561 0.557 0.566 0.558DM 0.460 0.469 0.474 0.437 0.437MM 0.471 0.448 0.457 0.462 0.465SC 0.290 0.280 0.299 0.279 0.281DNA3 0.584 0.586 0.564 0.571 0.577DNA11 0.380 0.396 0.376 0.401 0.418PS 0.737 0.740 0.739 0.747 0.750chen-2002 0.918 0.923 0.927 0.917 0.928chowdary-2006 0.982 0.982 0.986 0.981 0.984nutt-2003-v2 0.896 0.917 0.892 0.900 0.925singh-2002 0.865 0.876 0.871 0.878 0.881west-2001 0.891 0.882 0.882 0.882 0.885dbworld-bodies 0.759 0.776 0.778 0.802 0.821dbworld-bodies-stemmed 0.820 0.873 0.890 0.882 0.875oh0.wc 0.959 0.967 0.968 0.967 0.966oh5.wc 0.978 0.975 0.976 0.975 0.975oh10.wc 0.958 0.958 0.958 0.959 0.960oh15.wc 0.967 0.968 0.968 0.967 0.968re0.wc 0.921 0.916 0.917 0.913 0.916re1.wc 0.959 0.961 0.963 0.964 0.963Average 0.768 0.773 0.772 0.773 0.777
Table A47: Results of predictive Specificity values of Auto-WEKA-ALL for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.556 0.551 0.558 0.560 0.564DM 0.455 0.466 0.464 0.449 0.439MM 0.453 0.463 0.450 0.470 0.467SC 0.273 0.269 0.284 0.276 0.280DNA3 0.555 0.586 0.551 0.584 0.601DNA11 0.392 0.397 0.401 0.391 0.388PS 0.747 0.741 0.744 0.739 0.741chen-2002 0.925 0.919 0.927 0.929 0.924chowdary-2006 0.986 0.986 0.985 0.981 0.986nutt-2003-v2 0.921 0.892 0.904 0.917 0.908singh-2002 0.897 0.887 0.889 0.883 0.877west-2001 0.871 0.885 0.891 0.874 0.891dbworld-bodies 0.818 0.810 0.810 0.810 0.814dbworld-bodies-stemmed 0.880 0.878 0.885 0.877 0.881oh0.wc 0.966 0.966 0.966 0.966 0.966oh5.wc 0.975 0.976 0.977 0.976 0.976oh10.wc 0.960 0.960 0.959 0.959 0.960oh15.wc 0.967 0.967 0.966 0.966 0.966re0.wc 0.914 0.915 0.915 0.913 0.912re1.wc 0.963 0.964 0.963 0.964 0.963Average 0.774 0.774 0.775 0.774 0.775 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.651 0.648 0.644 0.651 0.645DM 0.660 0.669 0.664 0.658 0.658MM 0.720 0.706 0.706 0.709 0.709SC 0.828 0.828 0.821 0.822 0.834DNA3 0.853 0.838 0.837 0.844 0.840DNA11 0.712 0.711 0.714 0.715 0.727PS 0.975 0.975 0.976 0.977 0.977chen-2002 0.927 0.930 0.935 0.931 0.937chowdary-2006 0.988 0.988 0.991 0.986 0.991nutt-2003-v2 0.842 0.858 0.858 0.850 0.875singh-2002 0.866 0.877 0.872 0.879 0.881west-2001 0.871 0.860 0.860 0.860 0.865dbworld-bodies 0.760 0.783 0.787 0.796 0.811dbworld-bodies-stemmed 0.820 0.882 0.885 0.878 0.871oh0.wc 0.775 0.813 0.812 0.808 0.805oh5.wc 0.791 0.795 0.795 0.794 0.793oh10.wc 0.721 0.719 0.722 0.726 0.724oh15.wc 0.779 0.781 0.784 0.781 0.783re0.wc 0.783 0.778 0.779 0.777 0.779re1.wc 0.751 0.759 0.760 0.764 0.761Average 0.804 0.810 0.810 0.810 0.813
Table A49: Results of predictive Sensitivity values of Auto-WEKA-ALL for timeoutsfrom 6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.642 0.640 0.643 0.644 0.650DM 0.667 0.666 0.670 0.667 0.656MM 0.709 0.692 0.717 0.723 0.703SC 0.818 0.820 0.818 0.832 0.825DNA3 0.844 0.848 0.835 0.849 0.853DNA11 0.723 0.721 0.717 0.725 0.701PS 0.976 0.975 0.976 0.976 0.975chen-2002 0.931 0.927 0.933 0.934 0.930chowdary-2006 0.991 0.991 0.988 0.986 0.991nutt-2003-v2 0.867 0.858 0.858 0.858 0.867singh-2002 0.898 0.889 0.889 0.884 0.879west-2001 0.850 0.865 0.871 0.855 0.871dbworld-bodies 0.811 0.802 0.802 0.802 0.807dbworld-bodies-stemmed 0.874 0.874 0.882 0.875 0.879oh0.wc 0.803 0.805 0.804 0.804 0.803oh5.wc 0.792 0.793 0.793 0.793 0.796oh10.wc 0.723 0.728 0.725 0.725 0.729oh15.wc 0.778 0.778 0.778 0.777 0.779re0.wc 0.775 0.779 0.778 0.778 0.778re1.wc 0.767 0.768 0.767 0.767 0.767Average 0.812 0.811 0.812 0.813 0.812 data set 1,000s 2,000s 3,000s 4,000s 5,000sCE 0.606 0.603 0.599 0.607 0.600DM 0.545 0.555 0.556 0.531 0.530MM 0.573 0.553 0.559 0.563 0.564SC 0.473 0.466 0.478 0.463 0.469DNA3 0.694 0.689 0.677 0.685 0.685DNA11 0.515 0.524 0.511 0.529 0.544PS 0.831 0.833 0.832 0.837 0.839chen-2002 0.923 0.926 0.931 0.924 0.932chowdary-2006 0.985 0.985 0.988 0.983 0.987nutt-2003-v2 0.865 0.884 0.871 0.871 0.897singh-2002 0.866 0.876 0.871 0.878 0.881west-2001 0.881 0.870 0.870 0.870 0.875dbworld-bodies 0.759 0.779 0.782 0.799 0.815dbworld-bodies-stemmed 0.820 0.876 0.887 0.880 0.873oh0.wc 0.860 0.887 0.886 0.884 0.882oh5.wc 0.879 0.880 0.880 0.880 0.879oh10.wc 0.831 0.829 0.831 0.834 0.833oh15.wc 0.868 0.869 0.871 0.869 0.870re0.wc 0.849 0.844 0.845 0.842 0.844re1.wc 0.848 0.854 0.855 0.858 0.856Average 0.773 0.779 0.779 0.779 0.783
Table A51: Results of predictive GMean values of Auto-WEKA-ALL for timeouts from6,000s to 10,000s. data set 6,000s 7,000s 8,000s 9,000s 10,000sCE 0.597 0.594 0.599 0.600 0.605DM 0.546 0.552 0.554 0.543 0.532MM 0.556 0.556 0.557 0.571 0.563SC 0.456 0.452 0.468 0.465 0.463DNA3 0.673 0.695 0.668 0.694 0.707DNA11 0.525 0.528 0.529 0.524 0.516PS 0.836 0.835 0.835 0.836 0.837chen-2002 0.928 0.923 0.930 0.931 0.927chowdary-2006 0.988 0.988 0.986 0.983 0.988nutt-2003-v2 0.891 0.871 0.878 0.884 0.884singh-2002 0.898 0.888 0.889 0.883 0.878west-2001 0.860 0.875 0.881 0.864 0.881dbworld-bodies 0.814 0.806 0.806 0.806 0.810dbworld-bodies-stemmed 0.877 0.876 0.883 0.876 0.880oh0.wc 0.881 0.882 0.881 0.881 0.881oh5.wc 0.878 0.880 0.880 0.879 0.881oh10.wc 0.833 0.836 0.834 0.834 0.836oh15.wc 0.867 0.867 0.867 0.866 0.867re0.wc 0.842 0.844 0.844 0.843 0.842re1.wc 0.860 0.860 0.859 0.860 0.860Average 0.780 0.780 0.781 0.781 0.782
Specificity Sensitivity Gmean1k 2k-10k 1k 2k-10k 1k 2k-10kCE 0.574 0.574 0.644 0.644 0.608 0.608DM 0.550 0.550 0.675 0.675 0.607 0.607MM 0.472 0.472 0.674 0.674 0.557 0.557SC 0.351 0.351 0.778 0.778 0.507 0.507DNA3 0.718 0.718 0.879 0.879 0.790 0.790DNA11 0.448 0.448 0.749 0.749 0.571 0.571PS 0.747 0.747 0.980 0.980 0.833 0.833chen-2002 0.844 0.844 0.859 0.859 0.851 0.851chowdary-2006 0.937 0.937 0.932 0.932 0.934 0.934nutt-2003-v2 0.933 0.933 0.867 0.867 0.898 0.898singh-2002 0.757 0.757 0.763 0.763 0.760 0.760west-2001 0.905 0.905 0.895 0.895 0.900 0.900dbworld-bodies 0.670 0.670 0.688 0.688 0.678 0.678dbworld-bodies-stemmed 0.865 0.865 0.843 0.843 0.854 0.854oh0.wc 0.972 0.972 0.818 0.818 0.891 0.891oh5.wc 0.977 0.977 0.809 0.809 0.889 0.889oh10.wc 0.969 0.959 0.771 0.732 0.865 0.838oh15.wc 0.968 0.965 0.767 0.755 0.861 0.853re0.wc 0.919 0.919 0.755 0.755 0.833 0.833re1.wc 0.969 0.969 0.791 0.789 0.875 0.874Average 0.777 0.777 0.797 0.794 0.778 0.776
Specificity Sensitivity Gmean1k 2k 3k-6k 7k-10k 1k 2k 3k-6k 7k-10k 1k 2k 3k-6k 7k-10kCE 0.561 0.574 0.574 0.574 0.629 0.644 0.644 0.644 0.608 0.608 0.608 0.608DM 0.550 0.550 0.550 0.550 0.675 0.675 0.675 0.675 0.607 0.607 0.607 0.607MM 0.374 0.472 0.472 0.472 0.640 0.674 0.674 0.674 0.557 0.557 0.557 0.557SC 0.351 0.351 0.351 0.351 0.778 0.778 0.778 0.778 0.507 0.507 0.507 0.507DNA3 0.718 0.718 0.718 0.718 0.879 0.879 0.879 0.879 0.790 0.790 0.790 0.790DNA11 0.448 0.448 0.448 0.448 0.749 0.749 0.749 0.749 0.571 0.571 0.571 0.571PS 0.747 0.747 0.747 0.747 0.980 0.980 0.980 0.980 0.833 0.833 0.833 0.833chen-2002 0.855 0.844 0.844 0.844 0.876 0.859 0.859 0.859 0.851 0.851 0.851 0.851chowdary-2006 0.937 0.937 0.937 0.937 0.932 0.932 0.932 0.932 0.934 0.934 0.934 0.934nutt-2003-v2 0.950 0.933 0.933 0.933 0.900 0.867 0.867 0.867 0.898 0.898 0.898 0.898singh-2002 0.757 0.757 0.757 0.757 0.763 0.763 0.763 0.763 0.760 0.760 0.760 0.760west-2001 0.905 0.905 0.905 0.905 0.895 0.895 0.895 0.895 0.900 0.900 0.900 0.900dbworld-bodies 0.712 0.670 0.670 0.670 0.721 0.688 0.688 0.688 0.678 0.678 0.678 0.678dbworld-bodies-stemmed 0.819 0.865 0.865 0.865 0.814 0.843 0.843 0.843 0.854 0.854 0.854 0.854oh0.wc 0.972 0.972 0.972 0.972 0.818 0.818 0.818 0.818 0.891 0.891 0.891 0.891oh5.wc 0.977 0.977 0.977 0.977 0.809 0.809 0.809 0.809 0.889 0.889 0.889 0.889oh10.wc 0.960 0.959 0.959 0.959 0.736 0.732 0.732 0.732 0.838 0.838 0.838 0.838oh15.wc 0.968 0.968 0.965 0.965 0.767 0.767 0.755 0.755 0.861 0.861 0.853 0.853re0.wc 0.920 0.920 0.920 0.919 0.749 0.749 0.749 0.755 0.830 0.830 0.830 0.833re1.wc 0.970 0.969 0.969 0.969 0.801 0.789 0.789 0.789 0.881 0.874 0.874 0.874Average 0.772 0.777 0.777 0.777 0.796 0.794 0.794 0.794 0.777 0.777 0.776 0.776 eferences [MET, 2002] (2002). METAL: meta-learning assistant for providing user support in machinelearning and data mining.[Barros et al., 2012a] Barros, R. C., Basgalupp, M. P., de Carvalho, A. C. P. L. F., and Freitas, A. A.(2012a). A Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactionson Systems, Man, and Cybernetics, Part C: Applications and Reviews , 42(3):291–312.[Barros et al., 2013] Barros, R. C., Basgalupp, M. P., de Carvalho, A. C. P. L. F., and Freitas,A. A. (2013). Automatic design of decision-tree algorithms with evolutionary algorithms.
Evolutionary Computation , 21(4):659–684.[Barros et al., 2014] Barros, R. C., Basgalupp, M. P., Freitas, A. A., and de Carvalho, A. C. P.L. F. (2014). Evolutionary design of decision-tree algorithms tailored to microarray geneexpression data sets.
IEEE Transactions on Evolutionary Computation , 18(6):873–892.[Barros et al., 2015] Barros, R. C., de Carvalho, A. C., and Freitas, A. A. (2015).
Automatic Designof Decision-Tree Induction Algorithms . Number 978-3-319-14231-9 in SpringerBriefs in Com-puter Science. Springer.[Barros et al., 2012b] Barros, R. C., Winck, A. T., Machado, K. S., Basgalupp, M. P., de Carvalho,A. C. P. L. F., Ruiz, D. D., and de Souza, O. N. (2012b). Automatic design of decision-treeinduction algorithms tailored to flexible-receptor docking data.
BMC Bioinformatics , 13:310.[Basgalupp et al., 2018] Basgalupp, M., Barros, R., Sá, A. G., Pappa, G. L., Mantovani, R.,de Carvalho, A., and Freitas, A. (2018). Supplementary material for: An experimental evalua-tion of meta-learning methods for recommending classification algorithms. In to be submittedto arXiv .[Brazdil et al., 2008] Brazdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. (2008).
Metalearn-ing: Applications to Data Mining . Springer, 1 edition.[Cheng and Greiner, 1999] Cheng, J. and Greiner, R. (1999). Comparing bayesian network clas-sifiers. In
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages101–108. Morgan Kaufmann.[Daly et al., 2011] Daly, R., Shen, Q., and Aitken, S. (2011). Learning Bayesian networks: Ap-proaches and issues.
The Knowledge Engineering Review , 26(2):99–157.[das Dôres et al., 2018] das Dôres, S. C. N., Soares, C., and Ruiz, D. (2018). Bandit-based au-tomated machine learning. In
Proceedings of the Brazilian Conference on Intelligent Systems ,BRACIS’18, pages 121–126, New York, NY, USA. IEEE.[de Sá and Pappa, 2013] de Sá, A. G. C. and Pappa, G. L. (2013). Towards a method for au-tomatically evolving Bayesian network classifiers. In
Proceedings of the Annual ConferenceCompanion on Genetic and Evolutionary Computation , pages 1505–1512. ACM.[de Sá and Pappa, 2014] de Sá, A. G. C. and Pappa, G. L. (2014). A hyper-heuristic evolution-ary algorithm for learning Bayesian network classifiers. In
Proceedings of the Ibero-AmericanConference on Artificial Intelligence , pages 430–442. Springer.[de Sá et al., 2017] de Sá, A. G. C., Pinto, W. J. G. S., Oliveira, L. O. V. B., and Pappa, G. L. (2017).RECIPE: A grammar-based framework for automatically evolving classification pipelines.In
Proceedings of the European Conference on Genetic Programming (EuroGP) , pages 246–261.Springer International Publishing. de Souto et al., 2008] de Souto, M., Costa, I., de Araujo, D., Ludermir, T., and Schliep, A. (2008).Clustering cancer gene expression data: a comparative study. BMC Bioinformatics , 9(1):497.[Deb et al., 2002] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitistmultiobjective genetic algorithm: NSGA-II.
IEEE Transactions on Evolutionary Computation ,6(2):182–197.[Demšar, 2006] Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets.
J. Mach. Learn. Res. , 7:1–30.[Eiben and Smith, 2015] Eiben, A. E. and Smith, J. (2015). From evolutionary computation tothe evolution of things.
Nature , 521(7553):476–482.[Elsken et al., 2019] Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural architecture search:A survey.
Journal of Machine Learning Research , 20(55):1–21.[Feurer et al., 2015a] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., andHutter, F. (2015a). Efficient and robust automated machine learning. In
Advances in NeuralInformation Processing Systems 28 , pages 2944–2952. Curran Associates, Inc.[Feurer et al., 2015b] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., andHutter, F. (2015b). Methods for improving bayesian optimization for automl. In
ICML 2015AutoML Workshop .[Feurer et al., 2015c] Feurer, M., Springenberg, J. T., and Hutter, F. (2015c). Initializing bayesianhyperparameter optimization via meta-learning. In
Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence , pages 1128–1135.[Freitas, 2008] Freitas, A. A. (2008).
Soft Computing for Knowledge Discovery and Data Mining ,chapter A Review of evolutionary Algorithms for Data Mining, pages 79–111. Springer US.[Freitas et al., 2011] Freitas, A. A., Vasieva, O., and Magalhães, J. P. d. (2011). A data miningapproach for classifying dna repair genes into ageing-related or non-ageing-related.
BMCGenomics , 12(1):1–11.[Fusi et al., 2018] Fusi, N., Sheth, R., and Elibol, H. M. (2018). Probabilistic matrix factorizationfor automated machine learning. In
Proceedings of the International Conference on Neural Infor-mation Processing Systems , NIPS’18, pages 3348–3357, Red Hook, NY, USA. Curran AssociatesInc.[Hall et al., 2009] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten,I. H. (2009). The weka data mining software: An update.
SIGKDD Explor. Newsl. , 11(1):10–18.[Ho and Basu, 2002] Ho, T. and Basu, M. (2002). Complexity measures of supervised classifica-tion problems.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(3):289–300.[Ho et al., 2006] Ho, T., Basu, M., and Law, M. (2006). Measures of geometrical complexity inclassification problems. In
Data Complexity in Pattern Recognition . Springer London.[Hutter et al., 2019] Hutter, F., Kotthoff, L., and Vanschoren, J., editors (2019).
Automated Ma-chine Learning: Methods, Systems, Challenges . Springer, New York, NY, USA. Available athttp://automl.org/book.[Iman and Davenport, 1980] Iman, R. and Davenport, J. (1980). Approximations of the criticalregion of the friedman statistic.
Communications in Statistics , pages 571–595.[Japkowicz and Shah, 2011] Japkowicz, N. and Shah, M. (2011).
Evaluating Learning Algorithms:A Classification Perspective . Cambridge University Press, New York, NY, USA. Jin et al., 2019] Jin, H., Song, Q., and Hu, X. (2019). Auto-Keras: An efficient neural architec-ture search system. In
Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , KDD’19, pages 1946–1956, New York, NY, USA. ACM.[Koch et al., 2018] Koch, P., Golovidov, O., Gardner, S., Wujek, B., Griffin, J., and Xu, Y. (2018).Autotune: A derivative-free optimization framework for hyperparameter tuning. In
Pro-ceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD’18, pages 443–452, New York, NY, USA. ACM.[Kotthoff et al., 2017] Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown,K. (2017). Auto-weka 2.0: Automatic model selection and hyperparameter optimization inweka.
Journal of Machine Learning Research , 18(25):1–5.[Kˇren et al., 2017] Kˇren, T., Pilát, M., and Neruda, R. (2017). Automatic creation of machinelearning workflows with strongly typed genetic programming.
International Journal on Arti-ficial Intelligence Tools , 26(05):1760020.[Larcher and Barbosa, 2019] Larcher, C. H. N. and Barbosa, H. J. C. (2019). Auto-cve: A coevo-lutionary approach to evolve ensembles in automated machine learning. In
Proceedings of theGenetic and Evolutionary Computation Conference , GECCO’19, pages 392–400, New York, NY,USA. ACM.[Leite et al., 2012] Leite, R., Brazdil, P., and Vanschoren, J. (2012).
Selecting Classification Algo-rithms with Active Testing , pages 117–131. Springer, Berlin.[Li et al., 2018] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018).Hyperband: A novel bandit-based approach to hyperparameter optimization.
Journal of Ma-chine Learning Research , 18(185):1–52.[Mckay et al., 2010] Mckay, R., Hoai, N., Whigham, P., Shan, Y., and O Neill, M. (2010).Grammar-based Genetic Programming: a survey.
Genetic Programming and Evolvable Ma-chines , 11(3):365–396.[Michie et al., 1994] Michie, D., Spiegelhalter, D. J., Taylor, C. C., and Campbell, J., editors(1994).
Machine Learning, Neural and Statistical Classification . Ellis Horwood, Upper SaddleRiver, NJ, USA.[Mohr et al., 2018] Mohr, F., Wever, M., and Hüllermeier, E. (2018). ML-Plan: Automated ma-chine learning via hierarchical planning.
Machine Learning , 107:1495–1515.[Nyathi and Pillay, 2017] Nyathi, T. and Pillay, N. (2017). Automated design of genetic pro-gramming classification algorithms using a genetic algorithm. In
EvoApplications (2) , volume10200 of
Lecture Notes in Computer Science , pages 224–239.[Olson et al., 2016a] Olson, R., Urbanowicz, R., Andrews, P., Lavender, N., Kidd, L., and Moore,J. H. (2016a). Automating biomedical data science through tree-based pipeline optimization.In
Proceedings of the European Conference on the Applications of Evolutionary Computation , pages123–137.[Olson et al., 2016b] Olson, R. S., Bartley, N., Urbanowicz, R. J., and Moore, J. H. (2016b). Eval-uation of a tree-based pipeline optimization tool for automating data science. In
Proceedingsof the Genetic and Evolutionary Computation Conference (GECCO) , pages 485–492. ACM.[Pappa et al., 2005] Pappa, G. L., Baines, A. J., and Freitas, A. A. (2005). Predicting post-synapticactivity in proteins with data mining.
Bioinformatics , 21(2):19–25. Pappa and Freitas, 2009] Pappa, G. L. and Freitas, A. (2009).
Automating the Design of DataMining Algorithms: An Evolutionary Computation Approach . Springer, 1st edition.[Pappa et al., 2014] Pappa, G. L., Ochoa, G., Hyde, M. R., Freitas, A. A., Woodward, J., andSwan, J. (2014). Contrasting meta-learning and hyper-heuristic research: The role of evolu-tionary algorithms.
Genetic Programming and Evolvable Machines , 15(1):3–35.[Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). SciKit-Learn: Machinelearning in Python.
Journal of Machine Learning Research , 12:2825–2830.[Scott and De Jong, 2016] Scott, E. O. and De Jong, K. A. (2016). Evaluation-time bias in quasi-generational and steady-state asynchronous evolutionary algorithms. In
Proceedings of theGenetic and Evolutionary Computation Conference (GECCO) , pages 845–852. ACM.[Sohn et al., 2017] Sohn, A., Olson, R. S., and Moore, J. H. (2017). Toward the automated anal-ysis of complex diseases in genome-wide association studies using genetic programming. In
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO) , pages 489–496.ACM.[Thornton et al., 2013] Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2013).Auto-weka: Combined selection and hyperparameter optimization of classification algo-rithms. In
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’13, pages 847–855. ACM.[van Rijn et al., 2015] van Rijn, J. N., Abdulrahman, S. M., Brazdil, P., and Vanschoren, J. (2015).Fast algorithm selection using learning curves. In
Advances in Intelligent Data Analysis XIV -14th International Symposium, IDA 2015, Saint Etienne, France, October 22-24 , pages 298–309.[Vanschoren, 2018] Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint,arXiv:1810.03548 .[Vanschoren et al., 2014] Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). Openml:Networked science in machine learning.
SIGKDD Explor. Newsl. , 15(2):49–60.[Wan et al., 2015] Wan, C., Freitas, A., and de Magalhaes, J. (2015). Predicting the pro-longevityor anti-longevity effect of model organism genes with new hierarchical feature selectionmethods.
Computational Biology and Bioinformatics, IEEE/ACM Transactions on , 12(2):262–275.[Wilcoxon et al., 1970] Wilcoxon, F., Katti, S. K., and Wilcox, R. A. (1970). Critical values andprobability levels for the Wilcoxon rank sum test and the wilcoxon signed rank test.
Selectedtables in mathematical statistics , 1:171–259.[Witten et al., 2016] Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016).
Data Mining: Prac-tical Machine Learning Tools and Techniques . Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 4th edition.[Zaki and Meira Jr, 2020] Zaki, M. J. and Meira Jr, W. (2020).
Data Mining and Analysis: Funda-mental Concepts and Algorithms . Cambridge University Press, Cambridge, UK, 2nd edition.. Cambridge University Press, Cambridge, UK, 2nd edition.