Cost-sensitive Hierarchical Clustering for Dynamic Classifier Selection
CC OST - SENSITIVE H IERARCHICAL C LUSTERING FOR D YNAMIC C LASSIFIER S ELECTION
A P
REPRINT
Meinolf Sellmann
GE Global Research [email protected]
Tapan Shah
GE Global Research [email protected]
December 21, 2020 A BSTRACT
We consider the dynamic classifier selection (DCS) problem: Given an ensemble of classifiers, weare to choose which classifier to use depending on the particular input vector that we get to classify.The problem is a special case of the general algorithm selection problem where we have multipledifferent algorithms we can employ to process a given input. We investigate if a method developedfor general algorithm selection named cost-sensitive hierarchical clustering (CSHC) is suited forDCS. We introduce some additions to the original CSHC method for the special case of choosing aclassification algorithm and evaluate their impact on performance. We then compare with a number ofstate-of-the-art dynamic classifier selection methods. Our experimental results show that our modifiedCSHC algorithm compares favorably.
The idea of using more than one classifier to improve accuracy goes back to the basic theory of PAC learning [1] andboosting weak learners [2]. Often, we have multiple classifiers available to us, whereby these classifiers may be basedon different concept classes or may themselves be ensembles. We could use a cross-validation to determine the bestclassifier and deploy it. We may be able to do even better, though, if we choose dynamically, after seeing the featureinput, which classifier to use. That is, rather than choosing one classifier and using it regardless of the input, we mayuse one classifier for one input and another for another. A method is needed to choose a classifier. This problem isknown in the literature as dynamic classifier selection (DCS).Of course, there are other ways to combine multiple classifiers, for example by using each classifier’s support for eachof the possible class labels and aggregating this information. This is the basic idea behind stacking [3]. Note that,in stacking, the final class label may not coincide with any of the classes chosen by any of the base classifiers in theensemble, which gives this technique more flexibility and the potential to outperform dynamic classifier selection.However, the disadvantage of this more flexible aggregation method is that every classifier in the ensemble needs toscore each possible class label. Another disadvantage of more elaborate aggregation schemes is that it makes explainingthe classification more challenging. When using dynamic classifier selection, we can inherit the explanation methodfrom the base classifier. In this paper, we therefore limit ourselves to dynamic classifier selection.A problem highly related to DCS has been identified in the satisfiability and optimization communities. It was foundthat different algorithmic approaches may solve the same problem instances in vastly different compute times. The ideaarose to choose which algorithm to employ only after the concrete instance to process is known [4]. These so-called"algorithm portfolios" have since led to a massive improvements in our ability to solve extremely hard combinatorialsatisfiability and optimization problems [5].One method for selecting an algorithm out of a portfolio of algorithms was introduced in [6] and employs cost-sensitivemulti-classification for algorithm selection. In this paper, we investigate whether this approach for general algorithmselection can be used effectively for DCS. We introduce several modifications to make the method more suited forclassifier selection. Then, we compare the approach with state-of-the-art DCS methods. a r X i v : . [ c s . L G ] D ec REPRINT - D
ECEMBER
21, 2020
State-of-the-art DCS methods work by estimating the competence of the base classifiers for a given query sample andthen select the base classifier with highest competence. The competence is commonly estimated as follows:1. For a given sample, a region of competence i.e. a local neighborhood of training samples is computed usingeither k -nearest neighbors ( k -NN) or clustering methods.2. Then, the competence level of each classifier is computed on the neighborhood, based on varying criteria likeaccuracy of base classifiers, ranking, etc.Prominent examples that realize the framework above are Local Class Accuracy (LCA) [7], Overall Local Accu-racy (OLA) [7], A Priori [8], A Posterori [8], and Multiple Classifier Behavior [9]: OLA : In this approach, the competence of a base classifier is defined as the overall accuracy of the classifier in thelocal neighborhood. The local neighborhood is extracted using k -NN, where k is a tunable parameter. As all otherapproaches realizing this framework, the classifier with the highest competence in the local neighborhood is chosen. LCA : This method is similar to OLA, with the difference that it uses a different notion of class-specific accuracy whereonly the accuracy of the class predicted by a classifier is considered.
A Priori (APR) and A Posteriori (APO) : Both these methods use the "soft" conditional class probabilities which areoutput by the base classifiers, instead of "hard" class predictions used in LCA and OLA to compute the probability ofcorrect classification. Note that this implies that APR and APO are equally costly as stacking techniques. Both APRand APO use k -NN to define the local neighborhood. The key difference between APR and APO is that APR computesthe competence of a base classifier without knowledge of the class predicted by the base classifier on the query sample.On the other hand, if the base classifier predicts class C for a query sample, APO computes the competence by limitingto those samples in the local neighborhood with actual class C . MCB : A concept called Behavioral Knowledge Space (BKS) is used to refine the k -NN neighborhood: Only sampleswith similar "output profiles" are kept in the local neighborhood, whereby an output profile of an input feature vector isthe vector of predictions of all the base classifiers. Class-specific accuracy is the used as competence score.Methods that deviate slightly from the framework above select a subset of classifiers that perform well in terms ofcompetence criteria on the neighborhood. A majority vote by the subset of classifiers is then conducted for the finalprediction. Since the majority class must have at least one classifier that voted for it, we can select any such classifier,which is why these methods can also be viewed as dynamic classifier selection methods.Examples of methods that realize this modified framework are k -Nearest Oracle Eliminate (KNORA-E), k -NearestOracle Union (KNORA-U) [10], and META-DES [11]. KNORA-E (KE) : Given a local neighborhood created using k -NN for a query sample, all base classifiers with lessthan 100% accuracy on the neighborhood training samples are eliminated, whereby k is reduced until this is the case forat least one classifier. As with all other methods realizing this framework, of the remaining classifiers, a majority vote istaken to arrive at the final prediction. KNORA-U (KU) : This method is similar to KNORA-E, with the difference that all classifiers which are correct for atleast one training sample in the neighborhood are also retained. Moreover, voting is weighted: The weight of the voteby a base classifier is equal to the number of neighborhood samples that are classified correctly. META-DES (MD) : This method trains a meta-classification algorithm on a set of meta-features, where the meta-classesare "competent" or "incompetent," to select the set of classifiers that vote by simple majority on the final class. Themeta-features include class-specific and overall accuracy in the local neighborhood, classifier probability, classifierconsensus, and some others. The local neighborhood is extracted using BKS. To train the meta-classifier, whenever aclassifier labels a training input correctly, it is labeled competent, and incompetent otherwise.In our experiments, we also compare with the simplest (static) method used in multiple classification systems, wherebyall the base classifiers are pooled and the output is obtained using the majority voting (MV) rule.For further details, we refer to the very thorough discussion of various dynamic classifier and ensemble selectionmethods in [12].
The main objective of this paper is to study the effectiveness of "cost-sensitive hierarchical clustering" (CSHC) for DCS.The idea behind CSHC is simple: Recursively split a cluster of input samples such that the inputs within a partition can We found that APO performed consistently worse than APR and therefore do not include it in the results for APO in Table 3. We found that KU significantly outperforms KE which is why we do not include results on KE in Table 3. REPRINT - D
ECEMBER
21, 2020agree on one algorithm that shall be used to process all inputs in the respective partition. In the original CSHC paper,the authors experimented with different ways how to split a cluster. In the end it was found effective and simple toconsider recursively splitting clusters by selecting one feature and associated splitting value and to put all examples thathave a respective feature value lower or equal the splitting value in one sub-cluster, and the others in the other. That is tosay, the final version of CSHC essentially builds a decision tree. However, it does not use entropy to determine splittingfeatures and values. Instead, CSHC considers the overall performance when using a different, optimal algorithm foreach partition, rather than the same algorithm on all examples in the parent cluster. The split that results in the bestperformance gain is then selected.Note that performance can be any metric desired, from running time (which is typically the target in search andoptimization), to optimality gap within a fixed time frame (a typical metric when tuning local search heuristics), tosome other metric of quality. For the purpose of classifier selection, we will simply use the number of input samplesthat a classifier labels correctly, i.e., the method’s accuracy.Three hyper-parameters guide when the recursive splitting of clusters stops. The first is a simple depth limit, the seconda minimum number of samples that must remain in each cluster, and the last is a minimum improvement that is expectedfrom splitting a cluster.As it is the case with decision trees, it has been found beneficial to build more than one hierarchical clustering.Identically to how random forests work, in CSHC, for each new clustering, only a subset of features are allowed to beused to split the inputs, and a sub-sample (with replacement) is built from the total set of inputs to be clustered.Three hyper-parameters guide this process of ensembling clusterings: How many clusterings (trees) to construct, howmany features are randomly selected to be used for splitting the sample set, and how often we sample the training inputswith replacement.This concludes the description of the training phase of CSHC. When using the clusterings to choose an algorithm for anew input, we require a process that resolves conflicts between the recommendations from different clusterings. Variousmethods have been described in the original CSHC paper [6]. Here, we will limit ourselves to the idea of using thealgorithm that has the best cumulative rank over all clusterings. That is, when we are given a new input at test time, wedetermine which clusters the input falls into for each of the clusterings. Then, we rank all algorithms for each cluster.We select the algorithm that has the best cumulative rank when summing up the ranks over all clusters. For furtherdetails on CSHC, please see [6].Note how CSHC differs from existing DCS methods. Superficially, one might think that CSHC also builds a neighbor-hood and then selects the best performing classifier on that neighborhood. However, the way how that neighborhoodis constructed and how performance is assessed is very different. First, the multiple hierarchical clusterings built bysub-sampling the training samples with repetition creates neighborhoods (the multi-set of examples in clusters the targetfeature vector is assigned to) that give different weights to different training examples by including samples as manytimes as they appear in target clusters.Second, the clusterings are constructed not by considering unsupervised metric regions in the feature-space, or regionswhere the original machine learning problem favors the same class, but by considering regions which are handled wellby the same classifier. And finally, the performance is assessed by ranking classifiers on multiple clusters and pickingthe best, which is unlike how any other existing method determines the final selection.
CSHC can be applied to any algorithm selection problem and is hence directly applicable to DCS as well. However,certain aspects make classifier selection a special case of general algorithm selection. In this section, we discuss thesedifferences and propose some modifications to the vanilla CSHC methodology.
The first particularity of DCS is the way how the training data is generated. When building an algorithm selector for anoptimization problem, for example, we simply run the various algorithms on each training instance and thereby gatherthe cost data needed to train the clusterings with CSHC. That is to say that, in other applications, the training instancesused to train the selector usually have no influence on the algorithms in the portfolio.When using an algorithm selector for classifier selection, this is not so clear anymore. There is a certain amount oflabeled (with the classes of the original machine learning problem) data available, and this data needs to be used fortraining the base classifiers as well as the classifier selector (whereby the labels are used to determine the associated3
REPRINT - D
ECEMBER
21, 2020cost of each classifier). Obviously, the selector could be over-confident with a classifier if it only had access to caseswhere the classifier labels samples that were used to train the respective classifier. To circumvent this issue, we conducta three-fold cross validation. In each fold, we use two thirds of the training data to train a classifier, then we evaluatethe classifier on the remaining third of the data. The cost labels generated for CSHC are then exclusively derived fromthe validation performances. Note that, in this way, we can use the entire training data for the generation of clusterings.
Another difference is that, in general algorithm selection, we cannot always run all algorithms. Imagine a case where weneed to choose the best scheduler for a given scheduling instance. ’Best’ in this case usually means ’fastest.’ Runningall schedulers is obviously not an option, we have to choose one before we see the algorithm output.In the context of ensemble learning, the situation may be different. Of course, there may be scenarios where runningall classifiers is too costly, for example because of latency requirements or because it is simply too cost-prohibitiveto run them all. In this case, we can simply use the cumulative ranking procedure from CSHS. We will report on theperformance of this method in the experimental results.In other cases, however, our prime concern is classification performance rather than computational cost. Then, we maywant to run all classifiers and use the classification results as well as the original features to select a classifier (and theassociated class this classifier labels the input with). In the following, we present methods how to use this informationin the context of CSHC.One way how we can use the labels produced by the different classifiers is by voting. To this end, each classifier isassigned a certain weight, and the class it labels the input with gets this weight added as support. We select one of theclassifiers that labels the input with the class that has the most support. Among all classifiers that lend the support, weselect the one with the largest weight (with ties broken randomly).The question is what weight to assign to each classifier. We utilize the ranks that CSHC provides for this purpose.Particularly, we assign the cumulative rank over all clusterings (with the better classifiers having higher rank) as theweight for each classifier. Note that the clusters considered are input–specific. Therefore, the weight each classifier isassigned changes dynamically from test sample to test sample.
Rather than using cumulative ranks as weights, we can also employ a more labor-intensive method and compute a set ofweights that would optimize the performance over the multi-set of samples over all clusters the given feature vector isassigned to by CSHC. We propose to set up a linear program (LP) for this purpose.Assume we are given the number n ∈ N of different classifiers in the ensemble, the number C ∈ N of classes, theset E = { e , . . . , e k } of unique samples in the union of all clusters the given feature vector falls into, the correctlabels y i ∈ { , . . . , C } , as well as numbers m i ∈ N for ≤ i ≤ k that determine how often example e i ∈ E appearsin the multi-set of samples returned by CSHC of the given feature vector. Finally, assume that, for each classifier a ∈ { , . . . , n } and each sample e i ∈ E , we are given the label l ai ∈ { , . . . , C } .The LP we set up has three sets of variables. First, for each classifier a ∈ { , . . . , n } , a weight ≤ w a ≤ with a ∈ { , . . . , n } . Moreover, for each unique example e i ∈ E , we introduce two penalty variables g i , f i ≥ . We imposethe following constraints: First, the weight variables must sum to 100: (cid:80) c ≤ C w c = 100 . Next, for each unique example e i ∈ E and each class c ∈ { , . . . , C } with c (cid:54) = y i , we add two constraints: g i + (cid:80) a,l ai = y i w a − (cid:80) b,l ib = c w b ≥ γ and f i + (cid:80) a,l ai = y i w a − (cid:80) b,l ib = c w b ≥ . Then, we solve the LP to obtain weights and penalties that minimize the totalpenalty (cid:80) i ≤ k m i ( g i + 2 f i ) .The LP aims to find a weighting for the classifiers such that the support for the correct label is at least γ % more than themaximal support for any other label over the multi-set of examples that was returned by CSHC for the given featurevector. When that is not possible, the LP will strive to have at least the largest support for the correct label, or to get asclose to the largest support as possible. For each example for which the weighted aggregate results in a class label thatis correct, there is no penalty. Otherwise, the penalty is two times the gap of the total support for the wrong label minusthe support for the correct label, plus whatever is needed to bring the gap between the support for the correct class toany other class to at least γ .As previously, based on the weights obtained, we compute the class that has the most aggregate support and ultimatelyselect the classifier that has the maximum weight among all that label the input with that maximally supported class.4 REPRINT - D
ECEMBER
21, 2020
We now have three different selection methods: The original CSHC which chooses the classifier with the highestcumulative rank over all clusters, rank-weighted voting, and finally by optimizing the aggregation weights via linearprogramming. We will investigate each one of these methods in the numerical results section.A more robust selection mechanism, inspired by [13], may be obtained by employing a process that considers howconfident each of these classier selection methods actually is. To this end, for the two methods introduced above, weconsider the ratio between the class that has the second largest support and the class that has the largest support (basedon the respective ways to compute the weights of each classifier, either by cumulative rank or by solving the linearprogram). The lower this ratio (we refer to this parameter with ρ ), the higher our confidence that this selection is correct.We propose to use rank regression first, since it is computationally much cheaper than solving a linear program foreach test sample. If the confidence in the rank selection is high, we return the classifier selected. If the confidence doesnot exceed the given threshold, we next compute the classifier selected by the LP weighting scheme. We assess theconfidence in this method as well. If confidence is high enough, we return the respective classifier.If confidence is also low in the LP-based weighting method, then we proceed as follows: If the class label of theclassifier chosen by both rank regression and LP-based weighting are the same, then we return the classifier whoserespective selection method has higher confidence (note that the class labels may be the same even when the twomethods choose a different classifier). If the classes are not the same, we next check if the classifier returned by theoriginal CSHC labels the input with the same class as one of the other two classifiers. If so, we return that classifier. Ifthis also fails, which implies that all three selection methods select a different classifier and all three classifiers labelthe input with a different class, then we finally compute the dominant class label in the multi-set of training samplesreturned by CSHC. If one of the three classifiers provided by the three methods labels the input with that dominantclass, then we choose this classifier. Otherwise, we return the classifier chosen by the LP-based weighting scheme. In this section, we describe the experiments to quantify the performance of our methods as well as compare it againstcompeting methods.
We use 40 data sets from OpenML [14, 15] for our experiments. The details of the OpenML datasets used for numericalexperiments are given in Table 1.
For each benchmark, we built a set of 5 base classifiers, Naive Bayes (NB), Support Vector Classification (SVC),Perceptron, k -Nearest Neighbors ( k -NN), and Decision Tree Classifier (DTC). We use the methods reviewed in the related works section for our comparison, with the exception of A Posteriori (APO)and KNORA-E (KE) which we found were significantly outperformed by their respective sister methods, APR and KU.Instead, we include a simple majority vote (MV) on the neighborhood instances. We use Python library DESLIB [16]which implements all competing methods.For each method, DESLIB uses 50% of the training data to train the base classifiers, and the remaining training data totrain the dynamic classification selection method. To make the comparison fair, we use only 50% of the training data tocreate the clusterings in CSHC as well. Note that we limit CSHC in this way purely to level the playing field with thecompeting algorithms. In practice, one will want to use 100% of the training data, labeled in one, or possibly multiple,cross-validation(s), to create the cost-sensitive clusters.
For Naive Bayes, we use class frequencies as priors. For the Support Vector Classification, we use an RBF kernel with C = 1 and γ = . For k -NN, we use a simple one-nearest neighbor classification ( k = 1 ). For the DecisionTree Classifier, finally, we use the Gini index as branching metric, impose no depth limit, and perform no pruning.5 REPRINT - D
ECEMBER
21, 2020CSHC has six hyper parameters. We generate 50 trees using the number generator from [17]. For each tree, we samplewith repetition from the training set until we obtain a multi-set of samples which amounts to 80% of the total trainingset of unique samples. To create each tree, we use two times the square root of all features, chosen uniformly at random.The last three hyper parameters determine when we stop the hierarchical refinement of clusters. First, we enforce that atleast 2 samples remain within each cluster. Second, we limit the depth of the trees to be at most 15. And finally, we stoprefining the clusters when the improvement by an additional split drops below 2%. name heart-h 13 2 196 98credit-g 20 2 670 330tic-tac-toe 9 2 641 317kr-vs-kp 36 2 2141 1055qsar-biodeg 41 2 706 349wdbc 30 2 381 188phoneme 5 2 3620 1784diabetes 8 2 514 254ozone-level-8hr 72 2 1697 837hill-valley 100 2 812 400kc2 21 2 349 173eeg-eye-state 14 2 10036 4944spambase 57 2 3082 1519kc1 21 2 1413 696ilpd 10 2 390 193pc1 21 2 743 366pc3 37 2 1047 516mozilla4 5 2 10415 5130scene 299 2 1612 795musk 167 2 4420 2178letter 16 26 13400 6600nomao 118 2 23091 11374gina_agnostic 970 2 2323 1145nomao 118 2 23091 11374bank-marketing 16 2 30291 14920isolet 617 26 5223 2574Bioresponse 1776 2 2513 1238mfeat-fourier 76 10 1340 660mfeat-factors 216 10 1340 660pendigits 16 10 7364 3628optdigits 64 10 3765 1855vehicle 18 4 566 280cnae-9 856 9 723 357breast-w 9 2 468 231balance-scale 4 3 418 207SpeedDating 120 2 5613 2765eucalyptus 19 5 493 243vowel 12 11 663 327credit-approval 15 2 462 228splice 60 3 2137 1053cmc 9 3 986 487Table 1: OpenML datasets used for evaluation and comparison. We give number of features (
REPRINT - D
ECEMBER
21, 2020 (a) k -NN (b) SVC(c) DT (d) CSHC-LPR Figure 1: Predictions of different base classifiers as well as CSHC-LPR for the mozilla4 dataset. The lighter roundmarkers and darker star markers indicate correct and incorrect predictions, respectively. In (d), CHSC-LPR selectsa base classifier dynamically for each point. We use the same color coding as in (a)-(c) to show which classifier isselected.For the LP-based weighting scheme, we set γ = 80 . The recourse threshold ρ is set to . , which means that we onlytrust a classifier selection method outright when the support for the highest ranked class is at least twice that of thesecond most supported class. Note that all these parameters are set to the same values for all benchmarks we considerin the experiments. Naturally, these hyper-parameters could be tuned for each benchmark individually, for example bymeans of a cross validation. To demonstrate the effectiveness of the method proposed, we leave all CSHC parametersand the parameters for the modifications we introduced at the same default values for all benchmarks.For all other selection methods, we use the DESLIB library defaults for all hyper-parameters [16]. The experiments for CSHC and it’s variants were performed on a 16-CPU cluster of 8-core, 2.60GHz Intel(R) Xeon(R)CPU ES-2670 with a 20MB cache size. IBM ILOG CPLEX 12.6.3 was the solver used to solve the linear programs.The algorithms were coded in C++ using the GCC 4.8.5 compiler on a Red Hat 4.8.5-4 operating system. The numericalexperiments with the competing algorithms were performed on a 6-core, 2.71Ghz Intel(R) Xeon(R) CPU E-2176Mwith 64MB RAM running a Windows operating system. The Python (3.7.7) library DESLIB v0.3 [16] was used toimplement the competing algorithms.We illustrate the DCS concept in Figure 1. We plot the test cases of the mozilla4 benchmark set as projected into thetwo most significant principle components. We mark the error cases in bold for k -NN, SVC, and DT (we omitted NB7 REPRINT - D
ECEMBER
21, 2020and Perceptron to save space and because CSHC-LPR hardly ever chooses them on this benchmark). The CSHC-LPRtile shows in what region which classifier is selected.
CSHC RR LP LPR Oracle balance-scale 90.3 90.3 90.3 90.3 93.2bank-marketing 89.7 89.5 89.4 89.5 96.5Bioresponse 74.4 75.4 75.8 75.8 93.7breast-w 97.4 97.0 97.4 97.0 98.7cmc 52.2 51.1 49.9 51.3 85.2cnae-9 86.6 88.0 89.1 88.0 96.6credit-approval 86.0 89.0 88.6 88.6 95.6credit-g 75.2 74.5 74.5 75.2 94.8diabetes 74.8 76.4 74.8 75.2 89.8eeg-eye-state 93.0 90.1 93.1 93.1 99.9eucalyptus 58.8 62.6 60.5 61.7 86.4gina_agnostic 89.3 88.6 89.3 89.3 98.0heart-h 81.6 80.6 80.6 80.6 89.8hill-valley 51.5 52.5 52.2 53.8 97.2ilpd 70.5 71.0 72.5 72.5 99.0isolet 94.8 94.6 94.9 94.9 98.6kc1 84.9 86.1 85.2 85.1 94.0kc2 80.3 83.8 81.5 83.2 91.9kr-vs-kp 98.4 98.0 98.5 98.4 100.0letter 93.0 93.3 93.4 93.4 97.5mfeat-factors 95.8 96.4 96.5 96.2 98.8mfeat-fourier 79.5 80.6 80.9 80.3 93.9mozilla4 92.4 90.1 92.2 92.3 98.3musk 100 100 100 100 100nomao 95.5 95.7 95.8 95.7 99.2optdigits 98.2 98.2 98.3 98.3 99.6ozone-level-8hr 92.6 93.1 93.1 92.7 98.6pc1 95.6 95.6 95.4 95.6 97.8pc3 90.1 89.5 89.0 89.5 95.3pendigits 99.3 99.2 99.3 99.3 99.7phoneme 84.4 86.0 85.2 85.4 97.8qsar-biodeg 84.2 84.2 85.7 85.1 96.6scene 95.5 95.5 84.5 96.0 99.5spambase 93.7 94.1 95.7 94.1 99.3SpeedDating 85.1 85.4 94.1 85.4 97.0splice 91.6 93.3 93.2 93.0 98.5tic-tac-toe 85.8 86.1 86.8 87.1 97.5vehicle 73.6 77.1 76.8 77.1 92.1vowel 83.5 82.6 85.6 84.4 94.5wdbc 97.9 96.3 97.9 98.4 99.5
REPRINT - D
ECEMBER
21, 2020
We begin our experimentation by comparing vanilla CSHC with the rank regression scheme (CSHC-RR) we introduced.Recall that CSHC selects the classifier that has the highest average rank over all clusters the test sample falls into. Therank regression modification we introduced, on the other hand, uses these average ranks as weights for the support eachclassifier gives to their favorite class.The performances of the two methods are depicted in columns two and three in Table 2. Using our five very basicclassifiers, we build ensembles using vanilla CSHC and with our newly introduced rank regression. Please note that theobjective of our experiments is not to create the best approach for each benchmark in absolute terms, but to compare therelative performance of different classifier selection methods. In fact, exactly because our base classifiers are crude andrelatively weak, the classifier selection is more challenging, which is the setting we strive for when comparing differentDCS methods. If all base classifiers returned mostly the correct labels anyway, it would be much harder to assess theeffectiveness of DCS methods.We observe that, out of the 40 head-to-head comparisons, CSHC-RR wins 21 and loses 14, while on 5 benchmarksboth methods perform equally well. This confirms our initial speculation that using the actual classifications of eachclassifier to select the top classifier gives an advantage. However, note that this additional performance comes at thecost of having to run all classifiers first. CSHC, on the other hand, selects one classifier based on the original features,and thus only requires one base classifier to run.
Next, we compare our new rank regression scheme with the more elaborate LP-based weighting which requires solvinga linear program for each test sample, thereby making this method rather computationally expensive. We can infer fromTable 2 that the LP-based method performs with higher accuracy on 21 benchmarks while performing worse on only 14.Moreover, the average accuracy is slightly higher as well.
In Table 2, we also show the performance of the confidence-assessment-and-recourse process we introduced (CSHC-LPR). Recall from the earlier description that we run the rank regression first and, provided confidence is high, wemove on directly with the classifier selected. Only when confidence is low, we employ the LP-based weighting method.If its confidence is high, we use the selected classifier, otherwise we consider the classifier with the highest average rankand eventually the majority class in the neighborhood to break the tie.The last four rows in the table compare every other method with this one. In the last row, we highlight the average rankover all benchmarks and datasets. In the last row, we show
M GI ( X ) ← GM (cid:16) Accuracy
CSHC − LPR
Accuracy X (cid:17) − , in percent,where X is the method compared with, and GM ( · ) is the geometric mean of a vector. Note that the M GI ( X ) is thesame as the ratio of the geometric mean of the accuracy of CSHC-LPR over the geometric mean of the accuracy ofmethod X , minus 1. Therefore, a value greater zero indicates that method X has lower accuracy than CSHC-LPR onaverage.We observe that using confidence assessments helps make the selection more robust. CSHC-LPR outperforms all otherCSHC variants, both in terms of wins/losses over the 40 benchmarks, as well as in terms of the accuracy comparison asmeasured by MGI. At the same time, since most test samples can be handled by rank regression with high confidence,the method works considerably faster than CSHC-LP. The largest benchmark is bank-marketing with 16 features andclose to 15,000 train and test samples each. Generating the 50 clusterings takes 9 sequential CPU seconds (KU 4s, MD10s) which is manageable, especially since the training of different trees can be parallelized easily with linear speed-up.On hill-valley, which has 100 features and around 400 train and test samples, CSHC takes 0.47 seconds to train. (KU0.04s, MD 0.29s), but therefore requires only 1.1 milliseconds to classify each sample (KU 75ms, MD 219ms).The reason why CSHC-LPR is so efficient is because most test samples are handled by the fast rank regression technique,while KU and MD need to compute Euclidean distances to run k -NN in a high-dimensional space. On bank-marketing,e.g., only for 621, or 4.16%, test samples confidence in the initial rank regression is too low and the LP-based weightingmethod is invoked. This means that, given the default threshold ρ = 0 . , in over 95% of the test cases the support of thetop class is at least twice as high as the class with second largest support, which in turn implies that the support for thetop class is at least 66%. This illustrates how effective CSHC is at partitioning the feature space in such a way that thechoice of classifier is clear. 9 REPRINT - D
ECEMBER
21, 2020APR MCB OLA MV MD CSHC KU *LPRbalance-* 87.0 86.0 88.9 89.9 87.9 bank-ma* 88.6 88.8 89.0 89.2 89.2 breast-w* 95.2 95.2 95.7 96.5 95.7 eucalypt 56.8 56.8 61.3 58.8 61.3 58.8 heart-h ilpd 69.4 66.8 68.4 isolet* 89.0 91.1 93.1 94.1 94.6 94.8 94.5 kc1 86.5 86.1 85.5 86.2 85.8 84.9 letter* 91.8 91.8 92.2 90.9 92.5 93.0 93.4 mfeat-fac* 94.7 94.5 95.5 95.6
100 100 nomao 95.4 95.3 95.5 95.5 95.8 95.5 pc3 88.2 87.6 88.6 89.1 89.0 phoneme 86.5 86.5
SpeedDat* 83.6 83.8 84.5 85.0 83.8 85.1 85.2 splice 89.9 90.3 91.4 92.7 93.3 91.6 vehicle* 71.4 72.1 75.0 75.0 74.6 73.6 76.8 vowel 87.5 84.4 86.2 80.4 losses/LPR 36 35 33 33 24 27 26 0wins/LPR 4 4 7 7 14 6 13 0MGI [%] 2.8 2.6 1.3 1.1 1.0 0.8 0.4 0rank 2.3 2.6 4.1 4.6 5.2 5.3 5.6 6.4Table 3: Comparing the accuracy of CSHC and CSHC-LPR with other state-of-art methods. The last rows give thenumber of wins/losses over CSHC-LPR, the MGI, and the average ranks over all methods and benchmarks (the higherthe rank, the higher the relative performance). The names of the competing methods are given in the related workssection. 10
REPRINT - D
ECEMBER
21, 2020Most of the overrides result in no change. In 318 cases, both *-RR and *-LPR choose a classifier that classifies the inputcorrectly, in 263 cases both select a classifier that errs on the respective input. In 15 cases, the override worsens theoutcome: *-RR would have chosen a good classifier, but *-LPR chooses one that is favors the wrong class. However, in25 cases the initial rank regression would have chosen a bad classifier, but the recourse corrected that mistake and chosea classifier that labels the input correctly. In total, the rank regression method errs on 1,573 out of 14,919 test samples,288, or 18.3%, of these mistakes happen on cases for which the recourse is invoked. That implies that the error rate isover 4.4 times higher on recourse cases than on cases where we trust our primary rank regression method. This showsthat our support-ratio recourse indicator is quite effective at identifying problematic cases. However, there is clearlyroom to improve the actual recourse which only gives us a net gain of 10 test samples.
In Table 3, we compare the original CSHC and CSHC-LPR with prominent DCS methods from the literature, wherebyKNORA-U and META-DES are widely regarded as the current state of the art. As before, in the last rows we showwins/losses and MGI when compared with CSHC-LPR. In the last row we also give the average rank over all 40benchmarks when ranking all eight methods for each benchmark individually. This data confirms that KNORA-U andMETA-DES are outstanding dynamic classifier selectors. Surprisingly, we find that simple majority voting is a closerunner-up to these sophisticated selection methods.Regarding CSHC-LPR, we see that it compares very favorably with all other methods. Even KNORA-U, the strongestcompetitor in terms of average accuracy, is outperformed on 26 out of 40 benchmarks, and tied on one (namely heart-h -note that we round accuracy in the table, hence sometimes two methods may appear to have the same performance whenthey do not. On the letter benchmark, for example, CSHC-LPR actually outperforms KU). Running a paired studentt-test based on this win/loss data results in a p-value of 3.56% for the Null-hypothesis that both methods performedequally well, which allows us to refute this assumption with statistical significance at the commonly applied significancelevel of 5%.Furthermore, we can observe from this table that the original, unmodified CSHC method is almost as good as thebest DCS methods to date. Compared with KNORA-U, it performs better on 18 benchmarks and worse on 20, with2 benchmarks tied. In practice, this makes CSHC a very attractive choice, since it does not require running all baseclassifiers, but only the one it selects. This means that an ensemble of classifiers can be used effectively for boostingaccuracy without having to pay a significantly higher computational cost which may be attractive for keeping the energyand CO burden low for mass applications. We studied the use of cost-sensitive hierarchical clustering (CSHC) for the purpose of dynamic classifier selection(DCS) in ensemble learning. We introduced two modifications of CSHC, one based on a rank-based weighting ofclassifications, the other using an input-specific linear programming formulation to compute a convex combination ofclassifications. We also introduced a confidence assessment and recourse process to decide which selection method totrust. Experimental results on 40 established machine learning benchmarks with fixed hyper-parameters showed thatthe modified CSHC works robustly and favorably when compared to various other DCS methods from the literature.Our project dictated that we must choose one out of a set of classifiers to enable the inheritance of explanations frombase classifiers. It is worth noting, though, that CSHC can also be adjusted to aggregate the scores of individual classesby multiple classifiers so that it also presents an alternative to traditional stacking techniques as well. This, as well asthe use of augmenting the feature set for cost-sensitive hierarchical clustering, are part of our future work.11
REPRINT - D
ECEMBER
21, 2020
References [1] Leslie G. Valiant. A theory of the learnable.
Commun. ACM , 27(11):1134–1142, 1984.[2] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application toboosting.
Journal of Computer and System Sciences , 55(1):119 – 139, 1997.[3] David H. Wolpert. Stacked Generalization.
Neural Networks , pages 241–259, 1992.[4] Kevin Leyton-Brown, Eugene Nudelman, Galen Andrew, Jim McFadden, and Yoav Shoham. A portfolio approachto algorithm select. In
Proceedings of the 18th international joint conference on Artificial intelligence , IJCAI’03,pages 1542–1543, San Francisco, CA, USA, August 2003. Morgan Kaufmann Publishers Inc.[5] Balint, Belov, Heule, and Järvisalo. Proceedings of sat competition 2013 : Solver and benchmark descriptions.University of Helsinki , Helsinki, Finland, 2013.[6] Yuri Malitsky, Ashish Sabharwal, Horst Samulowitz, and Meinolf Sellmann. Algorithm portfolios based oncost-sensitive hierarchical clustering. In Francesca Rossi, editor,
IJCAI 2013, Proceedings of the 23rd InternationalJoint Conference on Artificial Intelligence, Beijing, China, August 3-9, 2013 , pages 608–614, 2013.[7] K. Woods, W.P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 19(4), April 1997.[8] G. Giacinto and F. Roli. Methods for dynamic classifier selection. In
Proceedings 10th International Conferenceon Image Analysis and Processing , September 1999.[9] Giorgio Giacinto and Fabio Roli. Dynamic classifier selection based on multiple classifier behaviour.
PatternRecognition , 34, September 2001.[10] Albert H. R. Ko, Robert Sabourin, and Jr. Britto, Alceu Souza. From dynamic classifier selection to dynamicensemble selection.
Pattern Recognition , 41(5), May 2008.[11] Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti, and Tsang Ing Ren. META-DES: A dynamicensemble selection framework using meta-learning.
Pattern Recognition , 48(5), may 2015.[12] Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. Dynamic classifier selection: Recent advancesand perspectives.
Information Fusion , 41, May 2018.[13] Carlos Ansótegui, Meinolf Sellmann, and Kevin Tierney. Self-configuring cost-sensitive hierarchical clusteringwith recourse. In
Principles and Practice of Constraint Programming - 24th International Conference, CP 2018,Lille, France, August 27-31, 2018, Proceedings , Lecture Notes in Computer Science, pages 524–534. Springer,2018.[14] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, AndreasMüller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 , November 2019.[15] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machinelearning.
ACM SIGKDD Explorations Newsletter , 15, June 2014.[16] Rafael M. O. Cruz, Luiz G. Hafemann, Robert Sabourin, and George D. C. Cavalcanti. Deslib: A dynamicensemble selection library in python.
Journal of Machine Learning Research , 21(8):1–5, 2020.[17] Agner Fog. Pseudo random number generators uniform and non-uniform distributions.