[PDF] Cost-sensitive Hierarchical Clustering for Dynamic Classifier Selection

Abstract

We consider the dynamic classifier selection (DCS) problem: Given an ensemble of classifiers, we are to choose which classifier to use depending on the particular input vector that we get to classify. The problem is a special case of the general algorithm selection problem where we have multiple different algorithms we can employ to process a given input. We investigate if a method developed for general algorithm selection named cost-sensitive hierarchical clustering (CSHC) is suited for DCS. We introduce some additions to the original CSHC method for the special case of choosing a classification algorithm and evaluate their impact on performance. We then compare with a number of state-of-the-art dynamic classifier selection methods. Our experimental results show that our modified CSHC algorithm compares favorably

Full PDF

CC OST - SENSITIVE H IERARCHICAL C LUSTERING FOR D YNAMIC C LASSIFIER S ELECTION

A P

REPRINT

Meinolf Sellmann

GE Global Research [email protected]

Tapan Shah

GE Global Research [email protected]

December 21, 2020 A BSTRACT

We consider the dynamic classiﬁer selection (DCS) problem: Given an ensemble of classiﬁers, weare to choose which classiﬁer to use depending on the particular input vector that we get to classify.The problem is a special case of the general algorithm selection problem where we have multipledifferent algorithms we can employ to process a given input. We investigate if a method developedfor general algorithm selection named cost-sensitive hierarchical clustering (CSHC) is suited forDCS. We introduce some additions to the original CSHC method for the special case of choosing aclassiﬁcation algorithm and evaluate their impact on performance. We then compare with a number ofstate-of-the-art dynamic classiﬁer selection methods. Our experimental results show that our modiﬁedCSHC algorithm compares favorably.

The idea of using more than one classiﬁer to improve accuracy goes back to the basic theory of PAC learning [1] andboosting weak learners [2]. Often, we have multiple classiﬁers available to us, whereby these classiﬁers may be basedon different concept classes or may themselves be ensembles. We could use a cross-validation to determine the bestclassiﬁer and deploy it. We may be able to do even better, though, if we choose dynamically, after seeing the featureinput, which classiﬁer to use. That is, rather than choosing one classiﬁer and using it regardless of the input, we mayuse one classiﬁer for one input and another for another. A method is needed to choose a classiﬁer. This problem isknown in the literature as dynamic classiﬁer selection (DCS).Of course, there are other ways to combine multiple classiﬁers, for example by using each classiﬁer’s support for eachof the possible class labels and aggregating this information. This is the basic idea behind stacking [3]. Note that,in stacking, the ﬁnal class label may not coincide with any of the classes chosen by any of the base classiﬁers in theensemble, which gives this technique more ﬂexibility and the potential to outperform dynamic classiﬁer selection.However, the disadvantage of this more ﬂexible aggregation method is that every classiﬁer in the ensemble needs toscore each possible class label. Another disadvantage of more elaborate aggregation schemes is that it makes explainingthe classiﬁcation more challenging. When using dynamic classiﬁer selection, we can inherit the explanation methodfrom the base classiﬁer. In this paper, we therefore limit ourselves to dynamic classiﬁer selection.A problem highly related to DCS has been identiﬁed in the satisﬁability and optimization communities. It was foundthat different algorithmic approaches may solve the same problem instances in vastly different compute times. The ideaarose to choose which algorithm to employ only after the concrete instance to process is known [4]. These so-called"algorithm portfolios" have since led to a massive improvements in our ability to solve extremely hard combinatorialsatisﬁability and optimization problems [5].One method for selecting an algorithm out of a portfolio of algorithms was introduced in [6] and employs cost-sensitivemulti-classiﬁcation for algorithm selection. In this paper, we investigate whether this approach for general algorithmselection can be used effectively for DCS. We introduce several modiﬁcations to make the method more suited forclassiﬁer selection. Then, we compare the approach with state-of-the-art DCS methods. a r X i v : . [ c s . L G ] D ec REPRINT - D

ECEMBER

21, 2020

State-of-the-art DCS methods work by estimating the competence of the base classiﬁers for a given query sample andthen select the base classiﬁer with highest competence. The competence is commonly estimated as follows:1. For a given sample, a region of competence i.e. a local neighborhood of training samples is computed usingeither k -nearest neighbors ( k -NN) or clustering methods.2. Then, the competence level of each classiﬁer is computed on the neighborhood, based on varying criteria likeaccuracy of base classiﬁers, ranking, etc.Prominent examples that realize the framework above are Local Class Accuracy (LCA) [7], Overall Local Accu-racy (OLA) [7], A Priori [8], A Posterori [8], and Multiple Classiﬁer Behavior [9]: OLA : In this approach, the competence of a base classiﬁer is deﬁned as the overall accuracy of the classiﬁer in thelocal neighborhood. The local neighborhood is extracted using k -NN, where k is a tunable parameter. As all otherapproaches realizing this framework, the classiﬁer with the highest competence in the local neighborhood is chosen. LCA : This method is similar to OLA, with the difference that it uses a different notion of class-speciﬁc accuracy whereonly the accuracy of the class predicted by a classiﬁer is considered.

A Priori (APR) and A Posteriori (APO) : Both these methods use the "soft" conditional class probabilities which areoutput by the base classiﬁers, instead of "hard" class predictions used in LCA and OLA to compute the probability ofcorrect classiﬁcation. Note that this implies that APR and APO are equally costly as stacking techniques. Both APRand APO use k -NN to deﬁne the local neighborhood. The key difference between APR and APO is that APR computesthe competence of a base classiﬁer without knowledge of the class predicted by the base classiﬁer on the query sample.On the other hand, if the base classiﬁer predicts class C for a query sample, APO computes the competence by limitingto those samples in the local neighborhood with actual class C . MCB : A concept called Behavioral Knowledge Space (BKS) is used to reﬁne the k -NN neighborhood: Only sampleswith similar "output proﬁles" are kept in the local neighborhood, whereby an output proﬁle of an input feature vector isthe vector of predictions of all the base classiﬁers. Class-speciﬁc accuracy is the used as competence score.Methods that deviate slightly from the framework above select a subset of classiﬁers that perform well in terms ofcompetence criteria on the neighborhood. A majority vote by the subset of classiﬁers is then conducted for the ﬁnalprediction. Since the majority class must have at least one classiﬁer that voted for it, we can select any such classiﬁer,which is why these methods can also be viewed as dynamic classiﬁer selection methods.Examples of methods that realize this modiﬁed framework are k -Nearest Oracle Eliminate (KNORA-E), k -NearestOracle Union (KNORA-U) [10], and META-DES [11]. KNORA-E (KE) : Given a local neighborhood created using k -NN for a query sample, all base classiﬁers with lessthan 100% accuracy on the neighborhood training samples are eliminated, whereby k is reduced until this is the case forat least one classiﬁer. As with all other methods realizing this framework, of the remaining classiﬁers, a majority vote istaken to arrive at the ﬁnal prediction. KNORA-U (KU) : This method is similar to KNORA-E, with the difference that all classiﬁers which are correct for atleast one training sample in the neighborhood are also retained. Moreover, voting is weighted: The weight of the voteby a base classiﬁer is equal to the number of neighborhood samples that are classiﬁed correctly. META-DES (MD) : This method trains a meta-classiﬁcation algorithm on a set of meta-features, where the meta-classesare "competent" or "incompetent," to select the set of classiﬁers that vote by simple majority on the ﬁnal class. Themeta-features include class-speciﬁc and overall accuracy in the local neighborhood, classiﬁer probability, classiﬁerconsensus, and some others. The local neighborhood is extracted using BKS. To train the meta-classiﬁer, whenever aclassiﬁer labels a training input correctly, it is labeled competent, and incompetent otherwise.In our experiments, we also compare with the simplest (static) method used in multiple classiﬁcation systems, wherebyall the base classiﬁers are pooled and the output is obtained using the majority voting (MV) rule.For further details, we refer to the very thorough discussion of various dynamic classiﬁer and ensemble selectionmethods in [12].

The main objective of this paper is to study the effectiveness of "cost-sensitive hierarchical clustering" (CSHC) for DCS.The idea behind CSHC is simple: Recursively split a cluster of input samples such that the inputs within a partition can We found that APO performed consistently worse than APR and therefore do not include it in the results for APO in Table 3. We found that KU signiﬁcantly outperforms KE which is why we do not include results on KE in Table 3. REPRINT - D

ECEMBER

21, 2020agree on one algorithm that shall be used to process all inputs in the respective partition. In the original CSHC paper,the authors experimented with different ways how to split a cluster. In the end it was found effective and simple toconsider recursively splitting clusters by selecting one feature and associated splitting value and to put all examples thathave a respective feature value lower or equal the splitting value in one sub-cluster, and the others in the other. That is tosay, the ﬁnal version of CSHC essentially builds a decision tree. However, it does not use entropy to determine splittingfeatures and values. Instead, CSHC considers the overall performance when using a different, optimal algorithm foreach partition, rather than the same algorithm on all examples in the parent cluster. The split that results in the bestperformance gain is then selected.Note that performance can be any metric desired, from running time (which is typically the target in search andoptimization), to optimality gap within a ﬁxed time frame (a typical metric when tuning local search heuristics), tosome other metric of quality. For the purpose of classiﬁer selection, we will simply use the number of input samplesthat a classiﬁer labels correctly, i.e., the method’s accuracy.Three hyper-parameters guide when the recursive splitting of clusters stops. The ﬁrst is a simple depth limit, the seconda minimum number of samples that must remain in each cluster, and the last is a minimum improvement that is expectedfrom splitting a cluster.As it is the case with decision trees, it has been found beneﬁcial to build more than one hierarchical clustering.Identically to how random forests work, in CSHC, for each new clustering, only a subset of features are allowed to beused to split the inputs, and a sub-sample (with replacement) is built from the total set of inputs to be clustered.Three hyper-parameters guide this process of ensembling clusterings: How many clusterings (trees) to construct, howmany features are randomly selected to be used for splitting the sample set, and how often we sample the training inputswith replacement.This concludes the description of the training phase of CSHC. When using the clusterings to choose an algorithm for anew input, we require a process that resolves conﬂicts between the recommendations from different clusterings. Variousmethods have been described in the original CSHC paper [6]. Here, we will limit ourselves to the idea of using thealgorithm that has the best cumulative rank over all clusterings. That is, when we are given a new input at test time, wedetermine which clusters the input falls into for each of the clusterings. Then, we rank all algorithms for each cluster.We select the algorithm that has the best cumulative rank when summing up the ranks over all clusters. For furtherdetails on CSHC, please see [6].Note how CSHC differs from existing DCS methods. Superﬁcially, one might think that CSHC also builds a neighbor-hood and then selects the best performing classiﬁer on that neighborhood. However, the way how that neighborhoodis constructed and how performance is assessed is very different. First, the multiple hierarchical clusterings built bysub-sampling the training samples with repetition creates neighborhoods (the multi-set of examples in clusters the targetfeature vector is assigned to) that give different weights to different training examples by including samples as manytimes as they appear in target clusters.Second, the clusterings are constructed not by considering unsupervised metric regions in the feature-space, or regionswhere the original machine learning problem favors the same class, but by considering regions which are handled wellby the same classiﬁer. And ﬁnally, the performance is assessed by ranking classiﬁers on multiple clusters and pickingthe best, which is unlike how any other existing method determines the ﬁnal selection.

CSHC can be applied to any algorithm selection problem and is hence directly applicable to DCS as well. However,certain aspects make classiﬁer selection a special case of general algorithm selection. In this section, we discuss thesedifferences and propose some modiﬁcations to the vanilla CSHC methodology.

The ﬁrst particularity of DCS is the way how the training data is generated. When building an algorithm selector for anoptimization problem, for example, we simply run the various algorithms on each training instance and thereby gatherthe cost data needed to train the clusterings with CSHC. That is to say that, in other applications, the training instancesused to train the selector usually have no inﬂuence on the algorithms in the portfolio.When using an algorithm selector for classiﬁer selection, this is not so clear anymore. There is a certain amount oflabeled (with the classes of the original machine learning problem) data available, and this data needs to be used fortraining the base classiﬁers as well as the classiﬁer selector (whereby the labels are used to determine the associated3

REPRINT - D

ECEMBER

21, 2020cost of each classiﬁer). Obviously, the selector could be over-conﬁdent with a classiﬁer if it only had access to caseswhere the classiﬁer labels samples that were used to train the respective classiﬁer. To circumvent this issue, we conducta three-fold cross validation. In each fold, we use two thirds of the training data to train a classiﬁer, then we evaluatethe classiﬁer on the remaining third of the data. The cost labels generated for CSHC are then exclusively derived fromthe validation performances. Note that, in this way, we can use the entire training data for the generation of clusterings.

Another difference is that, in general algorithm selection, we cannot always run all algorithms. Imagine a case where weneed to choose the best scheduler for a given scheduling instance. ’Best’ in this case usually means ’fastest.’ Runningall schedulers is obviously not an option, we have to choose one before we see the algorithm output.In the context of ensemble learning, the situation may be different. Of course, there may be scenarios where runningall classiﬁers is too costly, for example because of latency requirements or because it is simply too cost-prohibitiveto run them all. In this case, we can simply use the cumulative ranking procedure from CSHS. We will report on theperformance of this method in the experimental results.In other cases, however, our prime concern is classiﬁcation performance rather than computational cost. Then, we maywant to run all classiﬁers and use the classiﬁcation results as well as the original features to select a classiﬁer (and theassociated class this classiﬁer labels the input with). In the following, we present methods how to use this informationin the context of CSHC.One way how we can use the labels produced by the different classiﬁers is by voting. To this end, each classiﬁer isassigned a certain weight, and the class it labels the input with gets this weight added as support. We select one of theclassiﬁers that labels the input with the class that has the most support. Among all classiﬁers that lend the support, weselect the one with the largest weight (with ties broken randomly).The question is what weight to assign to each classiﬁer. We utilize the ranks that CSHC provides for this purpose.Particularly, we assign the cumulative rank over all clusterings (with the better classiﬁers having higher rank) as theweight for each classiﬁer. Note that the clusters considered are input–speciﬁc. Therefore, the weight each classiﬁer isassigned changes dynamically from test sample to test sample.

Rather than using cumulative ranks as weights, we can also employ a more labor-intensive method and compute a set ofweights that would optimize the performance over the multi-set of samples over all clusters the given feature vector isassigned to by CSHC. We propose to set up a linear program (LP) for this purpose.Assume we are given the number n ∈ N of different classiﬁers in the ensemble, the number C ∈ N of classes, theset E = { e , . . . , e k } of unique samples in the union of all clusters the given feature vector falls into, the correctlabels y i ∈ { , . . . , C } , as well as numbers m i ∈ N for ≤ i ≤ k that determine how often example e i ∈ E appearsin the multi-set of samples returned by CSHC of the given feature vector. Finally, assume that, for each classiﬁer a ∈ { , . . . , n } and each sample e i ∈ E , we are given the label l ai ∈ { , . . . , C } .The LP we set up has three sets of variables. First, for each classiﬁer a ∈ { , . . . , n } , a weight ≤ w a ≤ with a ∈ { , . . . , n } . Moreover, for each unique example e i ∈ E , we introduce two penalty variables g i , f i ≥ . We imposethe following constraints: First, the weight variables must sum to 100: (cid:80) c ≤ C w c = 100 . Next, for each unique example e i ∈ E and each class c ∈ { , . . . , C } with c (cid:54) = y i , we add two constraints: g i + (cid:80) a,l ai = y i w a − (cid:80) b,l ib = c w b ≥ γ and f i + (cid:80) a,l ai = y i w a − (cid:80) b,l ib = c w b ≥ . Then, we solve the LP to obtain weights and penalties that minimize the totalpenalty (cid:80) i ≤ k m i ( g i + 2 f i ) .The LP aims to ﬁnd a weighting for the classiﬁers such that the support for the correct label is at least γ % more than themaximal support for any other label over the multi-set of examples that was returned by CSHC for the given featurevector. When that is not possible, the LP will strive to have at least the largest support for the correct label, or to get asclose to the largest support as possible. For each example for which the weighted aggregate results in a class label thatis correct, there is no penalty. Otherwise, the penalty is two times the gap of the total support for the wrong label minusthe support for the correct label, plus whatever is needed to bring the gap between the support for the correct class toany other class to at least γ .As previously, based on the weights obtained, we compute the class that has the most aggregate support and ultimatelyselect the classiﬁer that has the maximum weight among all that label the input with that maximally supported class.4 REPRINT - D

ECEMBER

21, 2020

We now have three different selection methods: The original CSHC which chooses the classiﬁer with the highestcumulative rank over all clusters, rank-weighted voting, and ﬁnally by optimizing the aggregation weights via linearprogramming. We will investigate each one of these methods in the numerical results section.A more robust selection mechanism, inspired by [13], may be obtained by employing a process that considers howconﬁdent each of these classier selection methods actually is. To this end, for the two methods introduced above, weconsider the ratio between the class that has the second largest support and the class that has the largest support (basedon the respective ways to compute the weights of each classiﬁer, either by cumulative rank or by solving the linearprogram). The lower this ratio (we refer to this parameter with ρ ), the higher our conﬁdence that this selection is correct.We propose to use rank regression ﬁrst, since it is computationally much cheaper than solving a linear program foreach test sample. If the conﬁdence in the rank selection is high, we return the classiﬁer selected. If the conﬁdence doesnot exceed the given threshold, we next compute the classiﬁer selected by the LP weighting scheme. We assess theconﬁdence in this method as well. If conﬁdence is high enough, we return the respective classiﬁer.If conﬁdence is also low in the LP-based weighting method, then we proceed as follows: If the class label of theclassiﬁer chosen by both rank regression and LP-based weighting are the same, then we return the classiﬁer whoserespective selection method has higher conﬁdence (note that the class labels may be the same even when the twomethods choose a different classiﬁer). If the classes are not the same, we next check if the classiﬁer returned by theoriginal CSHC labels the input with the same class as one of the other two classiﬁers. If so, we return that classiﬁer. Ifthis also fails, which implies that all three selection methods select a different classiﬁer and all three classiﬁers labelthe input with a different class, then we ﬁnally compute the dominant class label in the multi-set of training samplesreturned by CSHC. If one of the three classiﬁers provided by the three methods labels the input with that dominantclass, then we choose this classiﬁer. Otherwise, we return the classiﬁer chosen by the LP-based weighting scheme. In this section, we describe the experiments to quantify the performance of our methods as well as compare it againstcompeting methods.

We use 40 data sets from OpenML [14, 15] for our experiments. The details of the OpenML datasets used for numericalexperiments are given in Table 1.

For each benchmark, we built a set of 5 base classiﬁers, Naive Bayes (NB), Support Vector Classiﬁcation (SVC),Perceptron, k -Nearest Neighbors ( k -NN), and Decision Tree Classiﬁer (DTC). We use the methods reviewed in the related works section for our comparison, with the exception of A Posteriori (APO)and KNORA-E (KE) which we found were signiﬁcantly outperformed by their respective sister methods, APR and KU.Instead, we include a simple majority vote (MV) on the neighborhood instances. We use Python library DESLIB [16]which implements all competing methods.For each method, DESLIB uses 50% of the training data to train the base classiﬁers, and the remaining training data totrain the dynamic classiﬁcation selection method. To make the comparison fair, we use only 50% of the training data tocreate the clusterings in CSHC as well. Note that we limit CSHC in this way purely to level the playing ﬁeld with thecompeting algorithms. In practice, one will want to use 100% of the training data, labeled in one, or possibly multiple,cross-validation(s), to create the cost-sensitive clusters.

For Naive Bayes, we use class frequencies as priors. For the Support Vector Classiﬁcation, we use an RBF kernel with C = 1 and γ = . For k -NN, we use a simple one-nearest neighbor classiﬁcation ( k = 1 ). For the DecisionTree Classiﬁer, ﬁnally, we use the Gini index as branching metric, impose no depth limit, and perform no pruning.5 REPRINT - D

ECEMBER

21, 2020CSHC has six hyper parameters. We generate 50 trees using the number generator from [17]. For each tree, we samplewith repetition from the training set until we obtain a multi-set of samples which amounts to 80% of the total trainingset of unique samples. To create each tree, we use two times the square root of all features, chosen uniformly at random.The last three hyper parameters determine when we stop the hierarchical reﬁnement of clusters. First, we enforce that atleast 2 samples remain within each cluster. Second, we limit the depth of the trees to be at most 15. And ﬁnally, we stopreﬁning the clusters when the improvement by an additional split drops below 2%. name heart-h 13 2 196 98credit-g 20 2 670 330tic-tac-toe 9 2 641 317kr-vs-kp 36 2 2141 1055qsar-biodeg 41 2 706 349wdbc 30 2 381 188phoneme 5 2 3620 1784diabetes 8 2 514 254ozone-level-8hr 72 2 1697 837hill-valley 100 2 812 400kc2 21 2 349 173eeg-eye-state 14 2 10036 4944spambase 57 2 3082 1519kc1 21 2 1413 696ilpd 10 2 390 193pc1 21 2 743 366pc3 37 2 1047 516mozilla4 5 2 10415 5130scene 299 2 1612 795musk 167 2 4420 2178letter 16 26 13400 6600nomao 118 2 23091 11374gina_agnostic 970 2 2323 1145nomao 118 2 23091 11374bank-marketing 16 2 30291 14920isolet 617 26 5223 2574Bioresponse 1776 2 2513 1238mfeat-fourier 76 10 1340 660mfeat-factors 216 10 1340 660pendigits 16 10 7364 3628optdigits 64 10 3765 1855vehicle 18 4 566 280cnae-9 856 9 723 357breast-w 9 2 468 231balance-scale 4 3 418 207SpeedDating 120 2 5613 2765eucalyptus 19 5 493 243vowel 12 11 663 327credit-approval 15 2 462 228splice 60 3 2137 1053cmc 9 3 986 487Table 1: OpenML datasets used for evaluation and comparison. We give number of features (

REPRINT - D

ECEMBER

21, 2020 (a) k -NN (b) SVC(c) DT (d) CSHC-LPR Figure 1: Predictions of different base classiﬁers as well as CSHC-LPR for the mozilla4 dataset. The lighter roundmarkers and darker star markers indicate correct and incorrect predictions, respectively. In (d), CHSC-LPR selectsa base classiﬁer dynamically for each point. We use the same color coding as in (a)-(c) to show which classiﬁer isselected.For the LP-based weighting scheme, we set γ = 80 . The recourse threshold ρ is set to . , which means that we onlytrust a classiﬁer selection method outright when the support for the highest ranked class is at least twice that of thesecond most supported class. Note that all these parameters are set to the same values for all benchmarks we considerin the experiments. Naturally, these hyper-parameters could be tuned for each benchmark individually, for example bymeans of a cross validation. To demonstrate the effectiveness of the method proposed, we leave all CSHC parametersand the parameters for the modiﬁcations we introduced at the same default values for all benchmarks.For all other selection methods, we use the DESLIB library defaults for all hyper-parameters [16]. The experiments for CSHC and it’s variants were performed on a 16-CPU cluster of 8-core, 2.60GHz Intel(R) Xeon(R)CPU ES-2670 with a 20MB cache size. IBM ILOG CPLEX 12.6.3 was the solver used to solve the linear programs.The algorithms were coded in C++ using the GCC 4.8.5 compiler on a Red Hat 4.8.5-4 operating system. The numericalexperiments with the competing algorithms were performed on a 6-core, 2.71Ghz Intel(R) Xeon(R) CPU E-2176Mwith 64MB RAM running a Windows operating system. The Python (3.7.7) library DESLIB v0.3 [16] was used toimplement the competing algorithms.We illustrate the DCS concept in Figure 1. We plot the test cases of the mozilla4 benchmark set as projected into thetwo most signiﬁcant principle components. We mark the error cases in bold for k -NN, SVC, and DT (we omitted NB7 REPRINT - D

ECEMBER

21, 2020and Perceptron to save space and because CSHC-LPR hardly ever chooses them on this benchmark). The CSHC-LPRtile shows in what region which classiﬁer is selected.

CSHC RR LP LPR Oracle balance-scale 90.3 90.3 90.3 90.3 93.2bank-marketing 89.7 89.5 89.4 89.5 96.5Bioresponse 74.4 75.4 75.8 75.8 93.7breast-w 97.4 97.0 97.4 97.0 98.7cmc 52.2 51.1 49.9 51.3 85.2cnae-9 86.6 88.0 89.1 88.0 96.6credit-approval 86.0 89.0 88.6 88.6 95.6credit-g 75.2 74.5 74.5 75.2 94.8diabetes 74.8 76.4 74.8 75.2 89.8eeg-eye-state 93.0 90.1 93.1 93.1 99.9eucalyptus 58.8 62.6 60.5 61.7 86.4gina_agnostic 89.3 88.6 89.3 89.3 98.0heart-h 81.6 80.6 80.6 80.6 89.8hill-valley 51.5 52.5 52.2 53.8 97.2ilpd 70.5 71.0 72.5 72.5 99.0isolet 94.8 94.6 94.9 94.9 98.6kc1 84.9 86.1 85.2 85.1 94.0kc2 80.3 83.8 81.5 83.2 91.9kr-vs-kp 98.4 98.0 98.5 98.4 100.0letter 93.0 93.3 93.4 93.4 97.5mfeat-factors 95.8 96.4 96.5 96.2 98.8mfeat-fourier 79.5 80.6 80.9 80.3 93.9mozilla4 92.4 90.1 92.2 92.3 98.3musk 100 100 100 100 100nomao 95.5 95.7 95.8 95.7 99.2optdigits 98.2 98.2 98.3 98.3 99.6ozone-level-8hr 92.6 93.1 93.1 92.7 98.6pc1 95.6 95.6 95.4 95.6 97.8pc3 90.1 89.5 89.0 89.5 95.3pendigits 99.3 99.2 99.3 99.3 99.7phoneme 84.4 86.0 85.2 85.4 97.8qsar-biodeg 84.2 84.2 85.7 85.1 96.6scene 95.5 95.5 84.5 96.0 99.5spambase 93.7 94.1 95.7 94.1 99.3SpeedDating 85.1 85.4 94.1 85.4 97.0splice 91.6 93.3 93.2 93.0 98.5tic-tac-toe 85.8 86.1 86.8 87.1 97.5vehicle 73.6 77.1 76.8 77.1 92.1vowel 83.5 82.6 85.6 84.4 94.5wdbc 97.9 96.3 97.9 98.4 99.5

REPRINT - D

ECEMBER

21, 2020

We begin our experimentation by comparing vanilla CSHC with the rank regression scheme (CSHC-RR) we introduced.Recall that CSHC selects the classiﬁer that has the highest average rank over all clusters the test sample falls into. Therank regression modiﬁcation we introduced, on the other hand, uses these average ranks as weights for the support eachclassiﬁer gives to their favorite class.The performances of the two methods are depicted in columns two and three in Table 2. Using our ﬁve very basicclassiﬁers, we build ensembles using vanilla CSHC and with our newly introduced rank regression. Please note that theobjective of our experiments is not to create the best approach for each benchmark in absolute terms, but to compare therelative performance of different classiﬁer selection methods. In fact, exactly because our base classiﬁers are crude andrelatively weak, the classiﬁer selection is more challenging, which is the setting we strive for when comparing differentDCS methods. If all base classiﬁers returned mostly the correct labels anyway, it would be much harder to assess theeffectiveness of DCS methods.We observe that, out of the 40 head-to-head comparisons, CSHC-RR wins 21 and loses 14, while on 5 benchmarksboth methods perform equally well. This conﬁrms our initial speculation that using the actual classiﬁcations of eachclassiﬁer to select the top classiﬁer gives an advantage. However, note that this additional performance comes at thecost of having to run all classiﬁers ﬁrst. CSHC, on the other hand, selects one classiﬁer based on the original features,and thus only requires one base classiﬁer to run.

Next, we compare our new rank regression scheme with the more elaborate LP-based weighting which requires solvinga linear program for each test sample, thereby making this method rather computationally expensive. We can infer fromTable 2 that the LP-based method performs with higher accuracy on 21 benchmarks while performing worse on only 14.Moreover, the average accuracy is slightly higher as well.

In Table 2, we also show the performance of the conﬁdence-assessment-and-recourse process we introduced (CSHC-LPR). Recall from the earlier description that we run the rank regression ﬁrst and, provided conﬁdence is high, wemove on directly with the classiﬁer selected. Only when conﬁdence is low, we employ the LP-based weighting method.If its conﬁdence is high, we use the selected classiﬁer, otherwise we consider the classiﬁer with the highest average rankand eventually the majority class in the neighborhood to break the tie.The last four rows in the table compare every other method with this one. In the last row, we highlight the average rankover all benchmarks and datasets. In the last row, we show

M GI ( X ) ← GM (cid:16) Accuracy

CSHC − LPR

Accuracy X (cid:17) − , in percent,where X is the method compared with, and GM ( · ) is the geometric mean of a vector. Note that the M GI ( X ) is thesame as the ratio of the geometric mean of the accuracy of CSHC-LPR over the geometric mean of the accuracy ofmethod X , minus 1. Therefore, a value greater zero indicates that method X has lower accuracy than CSHC-LPR onaverage.We observe that using conﬁdence assessments helps make the selection more robust. CSHC-LPR outperforms all otherCSHC variants, both in terms of wins/losses over the 40 benchmarks, as well as in terms of the accuracy comparison asmeasured by MGI. At the same time, since most test samples can be handled by rank regression with high conﬁdence,the method works considerably faster than CSHC-LP. The largest benchmark is bank-marketing with 16 features andclose to 15,000 train and test samples each. Generating the 50 clusterings takes 9 sequential CPU seconds (KU 4s, MD10s) which is manageable, especially since the training of different trees can be parallelized easily with linear speed-up.On hill-valley, which has 100 features and around 400 train and test samples, CSHC takes 0.47 seconds to train. (KU0.04s, MD 0.29s), but therefore requires only 1.1 milliseconds to classify each sample (KU 75ms, MD 219ms).The reason why CSHC-LPR is so efﬁcient is because most test samples are handled by the fast rank regression technique,while KU and MD need to compute Euclidean distances to run k -NN in a high-dimensional space. On bank-marketing,e.g., only for 621, or 4.16%, test samples conﬁdence in the initial rank regression is too low and the LP-based weightingmethod is invoked. This means that, given the default threshold ρ = 0 . , in over 95% of the test cases the support of thetop class is at least twice as high as the class with second largest support, which in turn implies that the support for thetop class is at least 66%. This illustrates how effective CSHC is at partitioning the feature space in such a way that thechoice of classiﬁer is clear. 9 REPRINT - D

ECEMBER

21, 2020APR MCB OLA MV MD CSHC KU *LPRbalance-* 87.0 86.0 88.9 89.9 87.9 bank-ma* 88.6 88.8 89.0 89.2 89.2 breast-w* 95.2 95.2 95.7 96.5 95.7 eucalypt 56.8 56.8 61.3 58.8 61.3 58.8 heart-h ilpd 69.4 66.8 68.4 isolet* 89.0 91.1 93.1 94.1 94.6 94.8 94.5 kc1 86.5 86.1 85.5 86.2 85.8 84.9 letter* 91.8 91.8 92.2 90.9 92.5 93.0 93.4 mfeat-fac* 94.7 94.5 95.5 95.6

100 100 nomao 95.4 95.3 95.5 95.5 95.8 95.5 pc3 88.2 87.6 88.6 89.1 89.0 phoneme 86.5 86.5

SpeedDat* 83.6 83.8 84.5 85.0 83.8 85.1 85.2 splice 89.9 90.3 91.4 92.7 93.3 91.6 vehicle* 71.4 72.1 75.0 75.0 74.6 73.6 76.8 vowel 87.5 84.4 86.2 80.4 losses/LPR 36 35 33 33 24 27 26 0wins/LPR 4 4 7 7 14 6 13 0MGI [%] 2.8 2.6 1.3 1.1 1.0 0.8 0.4 0rank 2.3 2.6 4.1 4.6 5.2 5.3 5.6 6.4Table 3: Comparing the accuracy of CSHC and CSHC-LPR with other state-of-art methods. The last rows give thenumber of wins/losses over CSHC-LPR, the MGI, and the average ranks over all methods and benchmarks (the higherthe rank, the higher the relative performance). The names of the competing methods are given in the related workssection. 10

REPRINT - D

ECEMBER

21, 2020Most of the overrides result in no change. In 318 cases, both *-RR and *-LPR choose a classiﬁer that classiﬁes the inputcorrectly, in 263 cases both select a classiﬁer that errs on the respective input. In 15 cases, the override worsens theoutcome: *-RR would have chosen a good classiﬁer, but *-LPR chooses one that is favors the wrong class. However, in25 cases the initial rank regression would have chosen a bad classiﬁer, but the recourse corrected that mistake and chosea classiﬁer that labels the input correctly. In total, the rank regression method errs on 1,573 out of 14,919 test samples,288, or 18.3%, of these mistakes happen on cases for which the recourse is invoked. That implies that the error rate isover 4.4 times higher on recourse cases than on cases where we trust our primary rank regression method. This showsthat our support-ratio recourse indicator is quite effective at identifying problematic cases. However, there is clearlyroom to improve the actual recourse which only gives us a net gain of 10 test samples.

In Table 3, we compare the original CSHC and CSHC-LPR with prominent DCS methods from the literature, wherebyKNORA-U and META-DES are widely regarded as the current state of the art. As before, in the last rows we showwins/losses and MGI when compared with CSHC-LPR. In the last row we also give the average rank over all 40benchmarks when ranking all eight methods for each benchmark individually. This data conﬁrms that KNORA-U andMETA-DES are outstanding dynamic classiﬁer selectors. Surprisingly, we ﬁnd that simple majority voting is a closerunner-up to these sophisticated selection methods.Regarding CSHC-LPR, we see that it compares very favorably with all other methods. Even KNORA-U, the strongestcompetitor in terms of average accuracy, is outperformed on 26 out of 40 benchmarks, and tied on one (namely heart-h -note that we round accuracy in the table, hence sometimes two methods may appear to have the same performance whenthey do not. On the letter benchmark, for example, CSHC-LPR actually outperforms KU). Running a paired studentt-test based on this win/loss data results in a p-value of 3.56% for the Null-hypothesis that both methods performedequally well, which allows us to refute this assumption with statistical signiﬁcance at the commonly applied signiﬁcancelevel of 5%.Furthermore, we can observe from this table that the original, unmodiﬁed CSHC method is almost as good as thebest DCS methods to date. Compared with KNORA-U, it performs better on 18 benchmarks and worse on 20, with2 benchmarks tied. In practice, this makes CSHC a very attractive choice, since it does not require running all baseclassiﬁers, but only the one it selects. This means that an ensemble of classiﬁers can be used effectively for boostingaccuracy without having to pay a signiﬁcantly higher computational cost which may be attractive for keeping the energyand CO burden low for mass applications. We studied the use of cost-sensitive hierarchical clustering (CSHC) for the purpose of dynamic classiﬁer selection(DCS) in ensemble learning. We introduced two modiﬁcations of CSHC, one based on a rank-based weighting ofclassiﬁcations, the other using an input-speciﬁc linear programming formulation to compute a convex combination ofclassiﬁcations. We also introduced a conﬁdence assessment and recourse process to decide which selection method totrust. Experimental results on 40 established machine learning benchmarks with ﬁxed hyper-parameters showed thatthe modiﬁed CSHC works robustly and favorably when compared to various other DCS methods from the literature.Our project dictated that we must choose one out of a set of classiﬁers to enable the inheritance of explanations frombase classiﬁers. It is worth noting, though, that CSHC can also be adjusted to aggregate the scores of individual classesby multiple classiﬁers so that it also presents an alternative to traditional stacking techniques as well. This, as well asthe use of augmenting the feature set for cost-sensitive hierarchical clustering, are part of our future work.11

REPRINT - D

ECEMBER

21, 2020

References [1] Leslie G. Valiant. A theory of the learnable.

Commun. ACM , 27(11):1134–1142, 1984.[2] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application toboosting.

Journal of Computer and System Sciences , 55(1):119 – 139, 1997.[3] David H. Wolpert. Stacked Generalization.

Neural Networks , pages 241–259, 1992.[4] Kevin Leyton-Brown, Eugene Nudelman, Galen Andrew, Jim McFadden, and Yoav Shoham. A portfolio approachto algorithm select. In

Proceedings of the 18th international joint conference on Artiﬁcial intelligence , IJCAI’03,pages 1542–1543, San Francisco, CA, USA, August 2003. Morgan Kaufmann Publishers Inc.[5] Balint, Belov, Heule, and Järvisalo. Proceedings of sat competition 2013 : Solver and benchmark descriptions.University of Helsinki , Helsinki, Finland, 2013.[6] Yuri Malitsky, Ashish Sabharwal, Horst Samulowitz, and Meinolf Sellmann. Algorithm portfolios based oncost-sensitive hierarchical clustering. In Francesca Rossi, editor,

IJCAI 2013, Proceedings of the 23rd InternationalJoint Conference on Artiﬁcial Intelligence, Beijing, China, August 3-9, 2013 , pages 608–614, 2013.[7] K. Woods, W.P. Kegelmeyer, and K. Bowyer. Combination of multiple classiﬁers using local accuracy estimates.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 19(4), April 1997.[8] G. Giacinto and F. Roli. Methods for dynamic classiﬁer selection. In

Proceedings 10th International Conferenceon Image Analysis and Processing , September 1999.[9] Giorgio Giacinto and Fabio Roli. Dynamic classiﬁer selection based on multiple classiﬁer behaviour.

PatternRecognition , 34, September 2001.[10] Albert H. R. Ko, Robert Sabourin, and Jr. Britto, Alceu Souza. From dynamic classiﬁer selection to dynamicensemble selection.

Pattern Recognition , 41(5), May 2008.[11] Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti, and Tsang Ing Ren. META-DES: A dynamicensemble selection framework using meta-learning.

Pattern Recognition , 48(5), may 2015.[12] Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. Dynamic classiﬁer selection: Recent advancesand perspectives.

Information Fusion , 41, May 2018.[13] Carlos Ansótegui, Meinolf Sellmann, and Kevin Tierney. Self-conﬁguring cost-sensitive hierarchical clusteringwith recourse. In

Principles and Practice of Constraint Programming - 24th International Conference, CP 2018,Lille, France, August 27-31, 2018, Proceedings , Lecture Notes in Computer Science, pages 524–534. Springer,2018.[14] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, AndreasMüller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 , November 2019.[15] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machinelearning.

ACM SIGKDD Explorations Newsletter , 15, June 2014.[16] Rafael M. O. Cruz, Luiz G. Hafemann, Robert Sabourin, and George D. C. Cavalcanti. Deslib: A dynamicensemble selection library in python.

Journal of Machine Learning Research , 21(8):1–5, 2020.[17] Agner Fog. Pseudo random number generators uniform and non-uniform distributions.