Predictive K-means with local models
Vincent Lemaire, Oumaima Alaoui Ismaili, Antoine Cornuéjols, Dominique Gay
PPredictive K-means with local models
Vincent Lemaire , Oumaima Alaoui Ismaili ,Antoine Cornu´ejols , Dominique Gay Orange Labs, Lannion, France AgroParisTech, Universit´e Paris-Saclay, Paris, France LIM-EA2525, Universit´e de La R´eunion
Abstract.
Supervised classification can be effective for prediction butsometimes weak on interpretability or explainability (XAI). Clustering,on the other hand, tends to isolate categories or profiles that can bemeaningful but there is no guarantee that they are useful for labels pre-diction. Predictive clustering seeks to obtain the best of the two worlds.Starting from labeled data, it looks for clusters that are as pure as pos-sible with regards to the class labels. One technique consists in tweakinga clustering algorithm so that data points sharing the same label tend toaggregate together. With distance-based algorithms, such as k-means, asolution is to modify the distance used by the algorithm so that it incor-porates information about the labels of the data points. In this paper,we propose another method which relies on a change of representationguided by class densities and then carries out clustering in this new repre-sentation space. We present two new algorithms using this technique andshow on a variety of data sets that they are competitive for predictionperformance with pure supervised classifiers while offering interpretabil-ity of the clusters discovered.
While the power of predictive classifiers can sometimes be awesome on givenlearning tasks, their actual usability might be severely limited by the lack ofinterpretability of the hypothesis learned. The opacity of many powerful super-vised learning algorithms has indeed become a major issue in recent years. Thisis why, in addition to good predictive performance as standard goal, many learn-ing methods have been devised to provide readable decision rules [3], degrees ofbeliefs, or other easy to interpret visualizations. This paper presents a predic-tive technique which promotes interpretability, explainability as well, in its coredesign.The idea is to combine the predictive power brought by supervised learn-ing with the interpretability that can come from the descriptions of categories,profiles, and discovered using unsupervised clustering. The resulting family oftechniques is variously called supervised clustering or predictive clustering. Inthe literature, there are two categories of predictive clustering. The first familyof algorithms aims at optimizing the trade-off between description and predic-tion, i.e., aiming at detecting sub-groups in each target class. By contrast, the a r X i v : . [ c s . L G ] D ec lgorithms in the second category favor the prediction performance over the dis-covery of all underlying clusters, still using clusters as the basis of the decisionfunction. The hope is that the predictive performance of predictive clusteringmethods can approximate the performances of supervised classifiers while theirdescriptive capability remains close to the one of pure clustering algorithms.Several predictive clustering algorithms have been presented over the years,for instance [1,4,10,11,23]. However, the majority of these algorithms require ( i )a considerable execution time, and ( ii ) that numerous user parameters be set.In addition, some algorithms are very sensitive to the presence of noisy data andconsequently their outputs are not easily interpretable (see [5] for a survey). Thispaper presents a new predictive clustering algorithm. The underlying idea is touse any existing distance-based clustering algorithms, e.g. k-means, but on aredescription space where the target class is integrated. The resulting algorithmhas several desirable properties: there are few parameters to set, its computa-tional complexity is almost linear in m , the number of instances, it is robust tonoise, its predictive performance is comparable to the one obtained with classicalsupervised classification techniques and it tends to produce groups of data thatare easy to interpret for the experts.The remainder of this paper is organized as follows: Section II introduces thebasis of the new algorithm, the computation of the clusters, the initialization stepand the classification that is realized within each cluster. The main computationsteps of the resulting predictive clustering algorithms are described in Algorithm1. We then report experiments that deal with the predictive performance inSections 3. We focus on the supervised classification performance to assess ifpredictive clustering could reach the performances of algorithms dedicated tosupervised classification. Our algorithm is compared using a variety of data setswith powerful supervised classification algorithms in order to assess its value asa predictive technique. And an analysis of the results is carried out. Conclusionand perspectives are discussed in Section 4. The k-means algorithm is one of the simplest yet most commonly used clus-tering algorithms. It seeks to partition m instances ( X , . . . X m ) into K groups( B , . . . , B K ) so that instances which are close are assigned to the same clusterwhile clusters are as dissimilar as possible. The objective function can be definedas: G = Argmin B i K (cid:88) i =1 (cid:88) X j ∈ B i (cid:107) X j − µ i (cid:107) (1) where µ i are the centers of clusters B i and we consider the Euclidean distance.Predictive clustering adds the constrain of maximizing clusters purity (i.e.instances in a cluster should share the same label). In addition, the goal is toprovide results that are easy to interpret by the end users. The objective functionof Equation (1) needs to be modified accordingly.One approach is to modify the distance used in conventional clustering al-gorithm in order to incorporate information about the class of the instances.his modified distance should make points differently labelled appear as moredistant than in the original input space. Rather than modifying the distance,one can instead alter the input space. This is the approach taken in this paper,where the input space is partitioned according to class probabilities prior to theclustering step, thus favoring clusters of high purity. Besides the introduction ofa technique for computing a new feature space, we propose as well an adaptedinitialization method for the modified k-means algorithm. We also show the ad-vantage of using a specific classification method within each discovered clusterin order to improve the classification performance. The main steps of the result-ing algorithm are described in Algorithm 1. In the remaining of this section IIwe show how each step of the usual K-means is modified to yield a predictiveclustering algorithm. Algorithm 1
Predictive K-means algorithm
Input: - D : a data set which contains m instances. Each one ( X i ) i ∈ { , . . . , m } is describedby d descriptive features and a label C i ∈ { , . . . , J } . - K : number of clusters . Start:
1) Supervised preprocessing of data to represent each X i as (cid:98) X i in a new featurespace Φ ( X ).2) Supervised initialization of centers. repeat Assignment: generate a new partition by assigning each instance (cid:98) X i to thenearest cluster.4) Representation: calculate the centers of the new partition. until the convergence of the algorithm5) Assignment classes to the obtained clusters:- method 1 : majority vote.- method 2 : local models.6) Prediction the class of the new instances in the deployment phase: → the closest cluster class (if method 1 is used). → local models (if method 2 is used). End
A modified input space for predictive clustering -
The principle of theproposed approach is to partition the input space according to the class prob-abilities P ( C j | X ). More precisely, let the input space X be of dimension d ,with numerical descriptors as well as categorical ones. An example X i ∈ X ( X i = [ X (1) i , . . . , X ( d ) i ] (cid:62) ) will be described in the new feature space Φ ( X ) by d × J components, with J being the number of classes. Each component X ( n ) i of X i ∈ X will give J components X ( n,j ) i , for j ∈ { , . . . , J } , of the new description (cid:98) X i in Φ ( X ), where (cid:98) X ( n,j ) i = log P ( X ( n ) = X ( n ) i | C j ), i.e., the log-likelihood val-ues. Therefore, an example X is redescribed according to the (log)-probabilitiesof observing the values of original input variables given each of the J possibleclasses (see Figure 1). Below, we describe a method for computing these values.ut first, we analyze one property of this redescription in Φ ( X ) and the distancethis can provide. X X (1) . . . X ( d ) YX . . .. . . . . . X m . . . Φ = ⇒ Φ ( X ) X (1 , . . . X (1 ,J ) . . . X ( d, . . . X ( d,J ) YX . . .. . . . . . X m . . . Fig. 1. Φ redescription scheme from d variables to d × J variables, with log-likelihoodvalues: log P ( X ( n ) | C j ) Property of the modified distance -
Let us denote dist pB the new distancedefined over Φ ( X ). For the two recoded instances ˆ X and ˆ X ∈ IR d × J , theformula of dist pB is (in the following we omit (cid:98) X = (cid:98) X i in the probability termsfor notation simplification): dist pB ( ˆ X , ˆ X ) = J (cid:88) j =1 (cid:107) log( P ( ˆ X | C j )) − log( P ( ˆ X | C j )) (cid:107) p (2) where (cid:107) . (cid:107) p is a Minkowski distance. Let us denote now, ∆ p ( ˆ X , ˆ X ) thedistance between the (log)-posterior probabilities of two instances ˆ X and ˆ X .The formula of this distance as follow: ∆ p ( ˆ X , ˆ X ) = J (cid:88) j =1 (cid:107) log( P ( C j | ˆ X )) − log( P ( C j | ˆ X )) (cid:107) p (3)where ∀ i ∈ { , . . . , m } , P ( C j | ˆ X i ) = P ( C j ) (cid:81) dn =1 P ( X ( n ) i | C j ) P ( ˆ X i ) (using the hypothesis offeatures independence conditionally to the target class). From the distance givenin equation 3, we find the following inequality: ∆ p ( ˆ X , ˆ X ) ≤ (cid:8) dist pB ( ˆ X , ˆ X ) + J (cid:107) log( P ( ˆ X )) − log( P ( ˆ X )) (cid:107) (cid:9) (4) Proof. ∆ p = J (cid:88) j =1 (cid:107) log( P ( C j | ˆ X )) − log( P ( C j | ˆ X )) (cid:107) p = J (cid:88) j =1 (cid:107) log( P ( ˆ X | C j ) P ( C j ) P ( ˆ X ) ) − log( P ( ˆ X | C j ) P ( C j ) P ( ˆ X ) ) (cid:107) p = J (cid:88) j =1 (cid:107) log( P ( ˆ X | C j )) − log( P ( ˆ X )) − log( P ( ˆ X | C j )) + log( P ( ˆ X )) (cid:107) p ≤ J (cid:88) j =1 [ A + B ]with ∆ p = ∆ p ( ˆ X , ˆ X ) and A = (cid:107) log( P ( ˆ X | C j )) − log( P ( ˆ X | C j )) (cid:107) p and B = (cid:107) log( P ( ˆ X )) − log( P ( ˆ X )) (cid:107) p then ∆ p ≤ dist pB ( ˆ X , ˆ X ) + J (cid:107) log( P ( ˆ X )) − log( P ( ˆ X )) (cid:107) p . his above inequality expresses that two instances that are close in terms ofdistance dist pB will also be close in terms of their probabilities of belonging tothe same class. Note that the distance presented above can be integrated intoany distance-based clustering algorithms. Building the log-likelihood redescription Φ ( X ) - Many methods can es-timate the new descriptors X ( n,j ) i = log P ( X ( n ) = X ( n ) i | C j ) from a set of ex-amples. In our work, we use a supervised discretization method for numericalattributes and a supervised grouping values for categorical attributes to obtainrespectively intervals and group values in which P ( X ( n ) = X ( n ) i | C j ) could bemeasured. The used supervised discretization method is described in [8] and thegrouping method in [7]. The two methods have been compared with extensiveexperiments to corresponding state of the art algorithms. These methods com-putes univariate partitions of the input space using supervised information. Itdetermines the partition of the input space to optimize the prediction of thelabels of the examples given the intervals in which they fall using the computedpartition. The method finds the best partition (number of intervals and thresh-olds) using a Bayes estimate. An additional bonus of the method is that outliersare automatically eliminated and missing values can be imputed. Initialisation of centers -
Because clustering is a NP-hard problem, heuristicsare needed to solve it, and the search procedure is often iterative, starting froman initialized set of prototypes. One foremost example of many such distance-based methods is the k-means algorithm. It is known that the initialization stepcan have a significant impact both on the number of iterations and, more impor-tantly, on the results which correspond to local minima of the optimization crite-rion (such as Equation 1 in [20]). However, by contrast to the classical clusteringmethods, predictive clustering can use supervised information for the choice ofthe initial prototypes. In this study, we chose to use the
K++R method. Describedin [17], it follows an “exploit and explore” strategy where the class labels are firstexploited before the input distribution is used for exploration in order to get theapparent best initial centers. The main idea of this method is to dedicate onecenter per class (comparable to a “Rocchio” [19] solution). Each center is definedas the average vector of instances which have the same class label. If the prede-fined number of clusters ( K ) exceeds the number of classes ( J ), the initializationcontinues using the K-means++ algorithm [2] for the K − J remaining centers insuch a way to add diversity. This method can only be used when K ≥ J , but thisis fine since in the context of supervised clustering we do not look for clusterswhere K < J . The complexity of this scheme is O ( m + ( K − J ) m ) < O ( mK ),where m is the number of examples. When K = J , this method is deterministic. Instance assignment and centers update -
Considering the Euclidean dis-tance and the original K-means procedure for updating centers, at each iteration,each instance is assigned to the nearest cluster ( j ) using the (cid:96) metric ( p = 2) in In the context of supervised clustering, it does not make sense to cluster instancesin K clusters where K < J he redescription space Φ ( X ). The K centers are then updated according to theK-Means procedure. This choice of distance (Euclidean) in the adopted k-meansstrategy could have an influence on the (predictive) relevance of the clusters buthas not been studied in this paper. Label prediction in predictive clustering -
Unlike classical clustering whichaims only at providing a description of the available data, predictive clusteringcan also be used in order to make prediction about new incoming examples thatare unlabeled.The commonest method used for prediction in predictive K-means is themajority vote. A new example is first assigned to the cluster of its nearest proto-type, and the predicted label is the one shared by the majority of the examplesof this cluster. This method is not optimal. Let us call P M the frequency of themajority class in a given cluster. The true probability µ of this class obeys theHoeffding inequality: P (cid:0) | P M − µ | ≥ ε (cid:1) ≤ − m k ε ) with m k the numberof instances assigned to the cluster k . If there are only 2 classes, the error rateis 1 − µ if P M and µ both are > .
5. But the error rate can even exceed 0.5 if P M > . µ < .
5. The analysis is more complex in case of morethan two classes. It is not the object of this paper to investigate this further.But it is apparent that the majority rule can often be improved upon, as is thecase in classical supervised learning.Another evidence of the limits of the majority rule is provided by the exam-ination of the ROC curve [12]. Using the majority vote to assign classes for thediscovered clusters generates a ROC curve where instances are ranked dependingon the clusters. Consequently, the ROC curve presents a sequence of steps. Thearea under the ROC curve is therefore suboptimal compared to a ROC curvethat is obtained from a more refined ranking of the examples, e.g., when classprobabilities are dependent upon each example, rather than groups of examples.One way to overcome these limits is to use local prediction models in eachcluster, hoping to get better prediction rules than the majority one. However, itis necessary that these local models: 1) can be trained with few instances, 2) donot overfit, 3) ideally, would not imply any user parameters to avoid the need forlocal cross-validation, 4) have a linear algorithmic complexity O ( m ) in a learningphase, where m is the number of examples, 5) are not used in the case where theinformation is insufficient and the majority rule is the best model we can hopefor, 6) keep (or even improve) the initial interpretation qualities of the globalmodel. Regarding item (1), a large study has been conducted in [22] in order totest the prediction performances in function of the number of training instanceof the most commonly classifiers. One prominent finding was that the NaiveBayes (NB) classifier often reaches good prediction performances using only fewexamples (Bouchard & Triggs’s study [6] confirms this result). This fact remainsvalid even when features receive weights ( e.g., Averaging Naive Bayes (ANB)and Selective Naive Bayes (SNB) [16]). We defer discussion of the other itemsto the Section 3 on experimental results.In our experiments, we used the following procedure to label each incom-ing data point X : i) X is redescribed in the space Φ ( X ) using the methodescribed in Section II, ii) X is assigned to the cluster k corresponding to thenearest center iii) the local model, l , in the corresponding cluster is used topredict the class of X (and the probability memberships) if a local model ex-its: P ( j | X ) = argmax ≤ j ≤ J ( P SNB l ( C j | X )) otherwise the majority vote is used(Note: P SNB l ( C j | X )) is described in the next section). To test the ability of our algorithm to exhibit high predictive performance whileat the same time being able to uncover interesting clusters in the different datasets, we have compared it with three powerful classifiers (in the spirit of, orclose to our algorithm) from the state of the art: Logistic Model Tree (
LMT ) [15],Naives Bayes Tree (
NBT ) [14] and Selective Naive Bayes (SNB) [9]. This sectionbriefly described these classifiers. • Logistic Model Tree (
LMT ) [15] combines logistic regression and decisiontrees. It seeks to improve the performance of decision trees. Instead of associatingeach leaf of the tree to a single label and a single probability vector (piecewiseconstant model), a logistic regression model is trained on the instances assignedto each leaf to estimate an appropriate vector of probabilities for each test in-stance (piecewise linear regression model). The logit-Boost algorithm is usedto fit a logistic regression model at each node, and then it is partitioned usinginformation gain as a function of impurity. • Naives Bayes Tree (
NBT ) [14] is a hybrid algorithm, which deploys anaive Bayes classifier on each leaf of the built decision tree. NBT is a classifierwhich has often exhibited good performance compared to the standard decisiontrees and naive Bayes classifier. • Selective Naive Bayes (
SNB ) is a variant of NB . One way to average alarge number of selective naive Bayes classifiers obtained with different subsetsof features is to use one model only, but with features weighting [9]. The Bayesformula under the hypothesis of features independence conditionally to classesbecomes: P ( j | X ) = P ( j ) (cid:81) f P ( X f | j ) Wf (cid:80) Kj =1 [ P ( j ) (cid:81) f P ( X f | j ) Wf ] , where W f represents the weight ofthe feature f , X f is component f of X , j is the class labels. The predicted class j is the one that maximizes the conditional probability P ( j | X ). The probabilities P ( X i | j ) can be estimated by interval using a discretization for continuous fea-tures. For categorical features, this estimation can be done if the feature has fewdifferent modalities. Otherwise, grouping into modalities is used. The resultingalgorithm proves to be quite efficient on many real data sets [13]. • Predictive K-Means (
PKM MV , PKM
SNB ): (i) PKM VM corresponds to the Pre-dictive K-Means described in Algorithm 1 where prediction is done according tothe Majority Vote; (ii) PKM
SNB corresponds to the Predictive K-Means describedin Algorithm 1 where prediction is done according to a local classification model. • Unsupervised K-Means ( KM MV ) is the usual unsupervised K-Means withprediction done using the Majority Vote in each cluster. This classifier is givenor comparison as a baseline method. The pre-processing is not supervised andthe initialization used is k-means++ [2] (in this case since the initialization isnot deterministic we run k-means 25 times and we keep the best initializationaccording to the Mean Squared Error). Among the existing unsupervised pre-processing approaches [21], depending on the nature of the features, continuousor categorical, we used: – for Numerical attribute: Rank Normalization (RN). The purpose of ranknormalization is to rank continuous feature values and then scale the featureinto [0 , i ) rank feature values u from lowest to highest values and then divide the resulting vector into H intervals, where H is the number of intervals, ( ii ) assign for each interval alabel r ∈ { , ..., H } in increasing order, ( iii ) if X iu belongs to the interval r ,then X (cid:48) iu = rH . In our experiments, we use H = 100. – for Categorical attribute: we chose to use a Basic Grouping Approach (BGB).It aims at transforming feature values into a vector of Boolean values. Thedifferent steps of this approach are: ( i ) group feature values into g groupswith as equal frequencies as possible, where g is a parameter given by theuser, ( ii ) assign for each group a label r ∈ { , ..., g } , ( iii ) use a full disjunctivecoding. In our experiments, we use g = 10. Fig. 2.
Differences between the three types of “classification”
In the Figure 2 we suggest a two axis figure to situate the algorithms de-scribed above: a vertical axis for their ability to describe (explain) the data(from low to high) and horizontal axis for their ability to predict the labels (fromlow to high). In this case the selected classifiers exemplify various trade-offs be-tween prediction performance and explanatory power: ( i ) KM MV more dedicatedto description would appear in the bottom right corner; ( ii ) LMT , NBT and
SNB dedicated to prediction would go on the top left corner; and ( iii ) PKM VM , PKM
SNB would lie in between. Ideally, our algorithm,
PKM
SNB should place itself on the topright quadrant of this kind of figure with both good prediction and descriptionperformance.Note that in the reported experiments, K = J ( i.e, number of clusters =number of classes). This choice which biases the algorithm to find one clusterer class, is detrimental for predictive clustering, thus setting a lower bound onthe performance that can be expected of such an approach. The comparison of the algorithms have been performed on 8 different datasets ofthe UCI repository [18]. These datasets were chosen for their diversity in terms ofclasses, features (categorical and numerical) and instances number (see Table 1).
Datasets
Instances V n V c Datasets
Instances V n V c Table 1.
The used datasets, V n : numerical features, V c : categorical features. Evaluation of the performance:
In order to compare the performance of thealgorithms presented above, the same folds in the train/test have been used.The results presented in Section 3.3 are those obtained in the test phase usinga 10 ×
10 folds cross validation (stratified). The predictive performance of thealgorithms are evaluated using the AUC (area under the ROC’s curve). It iscomputed as follows:
AUC = (cid:80) Ci P ( C i ) AUC ( C i ), where AUC ( i ) denotes the AUC’svalue in the class i against all the others classes and P ( Ci ) denotes the prior onthe class i (the elements frequency in the class i ). AUC ( i ) is calculated using theprobability vector P ( C i | X ) ∀ i . average results (in the test phase) using ACCData KM MV PKM MV PKM
SNB
LMT NBT SNB
Glass 70 . ± .
00 89 . ± .
09 95 . ± .
66 97 . ± .
68 94 . ± . ± . . ± .
17 66 . ± .
87 73 . ± . ± .
70 75 . ± .
71 75 . ± . . ± .
10 47 . ± .
62 72 . ± . ± .
64 70 . ± .
17 64 . ± . . ± .
35 80 . ± .
93 96 . ± . ± .
15 95 . ± .
29 94 . ± . . ± .
05 49 . ± .
39 84 . ± . ± .
69 79 . ± .
32 83 . ± . . ± .
97 98 . ± . ± .
09 98 . ± .
13 95 . ± .
73 99 . ± . . ± .
09 76 . ± .
33 97 . ± . ± .
35 95 . ± .
76 89 . ± . . ± .
14 77 . ± . ± .
39 83 . ± .
80 79 . ± .
34 86 . ± . .
19 73 .
44 88 . .
73 86 . AUCData KM MV PKM MV PKM
SNB
LMT NBT SNB
Glass 85 . ± .
69 96 . ± .
84 98 . ± .
50 97 . ± .
19 98 . ± . ± . . ± .
21 65 . ± .
37 78 . ± . ± .
61 80 . ± .
21 80 . ± . . ± .
36 74 . ± .
14 91 . ± . ± .
44 88 . ± .
04 87 . ± . . ± .
75 95 . ± .
75 99 . ± . ± .
23 98 . ± .
51 99 . ± . . ± .
58 69 . ± .
17 96 . ± . ± .
53 93 . ± .
41 95 . ± . . ± .
03 98 . ± . ± .
00 99 . ± .
69 99 . ± .
29 99 . ± . . ± .
45 95 . ± .
29 99 . ± . ± .
10 99 . ± .
78 99 . ± . . ± .
65 59 . ± . ± .
34 77 . ± .
93 84 . ± .
66 92 . ± . .
21 81 . .
81 92 .
74 94 . Table 2.
Mean performance and standard deviation for the TEST set using a 10x10folds cross-validation process .3 ResultsPerformance evaluation:
Table 2 presents the predictive performance of
LMT , NBT , SNB , our algorithm
PKM MV , PKM
SNB and the baseline KM MV using the ACC(accuracy) and the AUC criteria (presented as a %). These results show the verygood prediction performance of the PKM
SNB algorithm. Its performance is indeedcomparable to those of
LMT and
SNB which are the strongest ones. In addition, theuse of local classifiers (algorithm
PKM
SNB ) provides a clear advantage over the useof the majority vote in each cluster as done in
PKM MV . Surprisingly, PKM
SNB exhibitsslightly better results than
SNB while both use naive Bayes classifiers locally and
PKM
SNB is hampered by the fact that K = J , the number of classes. Betterperformance are expected when K ≥ J . Finally, PKM
SNB appears to be slightlysuperior to
SNB , particularly for the datasets which contain highly correlatedfeatures, for instance the PenDigits database.
Discussion about local models, complexity and others factors:
In Sec-tion II in the paragraph about label prediction in predictive clustering, we pro-posed a list of desirable properties for the local prediction models used in eachcluster. We come back to these items denoted from (i) to (vi) in discussing Tables2 and 3:i) The performance in prediction are good even for the dataset Glass whichcontains only 214 instances (90% for training in the 10x10 cross validation(therefore 193 instances)).ii) The robustness (ratio between the performance in test and training) is givenin Table 3 for the Accuracy (ACC) and the AUC. This ratio indicates thatthere is no significant overfitting. Moreover, by contrast to methods describedin [14, 15] (about
LMT and
NBT ) our algorithm does not require any crossvalidation for setting parameters.iii) The only user parameter is the number of cluster (in this paper K = J ).This point is crucial to help a non-expert to use the proposed method.iv) The preprocessing complexity (step 1 of Algorithm 1) is O ( d m log m ), thek-means has the usual complexity O ( d m J t ) and the complexity for thecreation of the local models is O ( d m ∗ log m ∗ )+ O ( K ( d m ∗ log dm ∗ )) where d is the number of variables, m the number of instances in the training dataset, m ∗ is the average number of instances belonging to a cluster. Therefore a fasttraining time is possible as indicated in Table 3 with time given in seconds(for a PC with Windows 7 enterprise and a CPU : Intel Core I7 6820-HQ2.70 GHz).v) Only clusters where the information is sufficient to beat the majority votecontain local model. Table 3 gives the percentage of pure clusters obtainedat the end of the convergence the K-Means and the percentage of clusterswith a local model (if not pure) when performing the 10x10 cross validation(so over 100 results).vi) Finally, the interpretation of the PKM
SNB model is based on a two-level anal-ysis. The first analysis consists in analyzing the profile of each cluster usinghistograms. A visualisation of the average profile of the overall populationeach bar representing the percentage of instances having a value of thecorresponding interval) and the average profile of a given cluster allows tounderstand why a given instance belongs to a cluster. Then locally to a clus-ter the variable importance of the local classifier (the weights, W f , in theSNB classifier) gives a local interpretation. Datasets
Robustness Training Local models
Datasets
Robustness Training Local models(
Table 3.
Elements for discussion about local models. (1) Percentage of pure clusters;(2) Percentage of non-pure clusters with a local model; (3) Percentage of non-pureclusters without a local model.
The results of our experiments and the elements (i) to (vi) show that thealgorithm
PKM
SNB is interesting with regards to several aspects. (1) Its predictiveperformance are comparable to those of the best competing supervised classi-fication methods, (2) it doesn’t require cross validation, (3) it deals with themissing values, (4) it operates a features selection both in the clustering stepand during the building of the local models. Finally, (5) it groups the categoricalfeatures into modalities, thus allowing one to avoid using a complete disjunctivecoding which involves the creation of large vectors. Otherwise this disjunctivecoding could complicate the interpretation of the obtained model.The reader may find a supplementary material here: https://bit.ly/2T4VhQw or here: https://bit.ly/3a7xmFF . It gives a detailed example about the inter-pretation of the results and some comparisons to others predictive clusteringalgorithms as COBRA or MPCKmeans.
We have shown how to modify a distance-based clustering technique, such ask-means, into a predictive clustering algorithm. Moreover the learned represen-tation could be used by other clustering algorithms. The resulting algorithm
PKM
SNB exhibits strong predictive performances most of the time as the state ofthe art but with the benefit of not having any parameters to adjust and thereforeno cross validation to compute. The suggested algorithm is also a good supportfor interpretation of the data. Better performances can still be expected whenthe number of clusters is higher than the number of classes. One goal of a workin progress it to find a method that would automatically discover the optimalnumber of clusters. In addition, we are developing a tool to help visualize the re-sults allowing the navigation between clusters in order to view easily the averageprofiles and the importance of the variables locally for each cluster.
References
1. Al-Harbi, S.H., Rayward-Smith, V.J.: Adapting k-means for supervised clustering.Applied Intelligence (3), 219–226 (2006). Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-rithms. pp. 1027–1035 (2007)3. Been Kim, Kush R. Varshney, A.W.: Workshop on human interpretability in ma-chine learning (whi 2018). In: Proceedings of the 2018 ICML Workshop (2018)4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learningin semi-supervised clustering. In: Proceedings of the Twenty-first InternationalConference on Machine Learning (ICML) (2004)5. Blockeel, H., Dzeroski, S., Struyf, J., Zenko, B.: Predictive Clustering. Springer-Verlag New York (2019)6. Bouchard, G., Triggs, B.: The tradeoff between generative and discriminative clas-sifiers. In: IASC International Symposium on Computational Statistics (COMP-STAT). pp. 721–728 (2004)7. Boull´e, M.: A Bayes optimal approach for partitioning the values of categoricalattributes. Journal of Machine Learning Research , 1431–1452 (2005)8. Boull´e, M.: MODL: a Bayes optimal discretization method for continuous at-tributes. Machine Learning (1), 131–165 (2006)9. Boull´e, M.: Compression-based averaging of selective naive Bayes classifiers. Jour-nal of Machine Learning Research , 1659–1685 (2007)10. Cevikalp, H., Larlus, D., Jurie, F.: A supervised clustering algorithm for the ini-tialization of rbf neural network classifiers. In: Signal Processing and Communi-cation Applications Conference (June 2007), http://lear.inrialpes.fr/pubs/2007/CLJ07
11. Eick, C.F., Zeidat, N., Zhao, Z.: Supervised clustering - algorithms and benefits. In:International Conference on Tools with Artificial Intelligence. pp. 774–776 (2004)12. Flach, P.: Machine learning: the art and science of algorithms that make sense ofdata. Cambridge University Press (2012)13. Hand, D.J., Yu, K.: Idiot’s bayes-not so stupid after all? International StatisticalReview (3), 385–398 (2001)14. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hy-brid. In: International Conference on Data Mining. pp. 202–207. AAAI Press (1996)15. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. (1-2)(2005)16. Langley, P., Sage, S.: Induction of selective bayesian classifiers. In: Proceedings ofthe Tenth International Conference on Uncertainty in Artificial Intelligence. pp.399–406. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1994)17. Lemaire, V., Alaoui Ismaili, O., Cornu´ejols, A.: An initialization scheme for super-vized k-means. In: International Joint Conference on Neural Networks (2015)18. Lichman, M.: UCI machine learning repository (2013)19. Manning, C.D., Raghavan, P., Sch¨utze, H.: Introduction to Information Retrieval.Cambridge University Press, New York (2008)20. Meil˘a, M., Heckerman, D.: An experimental comparison of several clustering andinitialization methods. In: Conference on Uncertainty in Artificial Intelligence. pp.386–395. Morgan Kaufmann Publishers Inc. (1998)21. Milligan, G.W., Cooper, M.C.: A study of standardization of variables in clusteranalysis. Journal of Classification5