[PDF] Truecluster matching

Abstract

Cluster matching by permuting cluster labels is important in many clustering contexts such as cluster validation and cluster ensemble techniques. The classic approach is to minimize the euclidean distance between two cluster solutions which induces inappropriate stability in certain settings. Therefore, we present the truematch algorithm that introduces two improvements best explained in the crisp case. First, instead of maximizing the trace of the cluster crosstable, we propose to maximize a chi-square transformation of this crosstable. Thus, the trace will not be dominated by the cells with the largest counts but by the cells with the most non-random observations, taking into account the marginals. Second, we suggest a probabilistic component in order to break ties and to make the matching algorithm truly random on random data. The truematch algorithm is designed as a building block of the truecluster framework and scales in polynomial time. First simulation results confirm that the truematch algorithm gives more consistent truecluster results for unequal cluster sizes. Free R software is available.

Full PDF

TTruecluster matching

Truecluster matching

Jens Oehlschl¨agel [email protected]

Editor:

Abstract

Cluster matching by permuting cluster labels is important in many clustering contexts suchas cluster validation and cluster ensemble techniques. The classic approach is to minimizethe euclidean distance between two cluster solutions which induces inappropriate stabilityin certain settings. Therefore, we present the truematch algorithm that introduces twoimprovements best explained in the crisp case. First, instead of maximizing the trace ofthe cluster crosstable, we propose to maximize a χ -transformation of this crosstable. Thus,the trace will not be dominated by the cells with the largest counts but by the cells withthe most non-random observations, taking into account the marginals. Second, we suggesta probabilistic component in order to break ties and to make the matching algorithm trulyrandom on random data. The truematch algorithm is designed as a building block of thetruecluster framework and scales in polynomial time. First simulation results conﬁrm thatthe truematch algorithm gives more consistent truecluster results for unequal cluster sizes.Free R software is available. Keywords:

Hungarian method, truematch, truecluster, MMCC, CIC, Hornik (2005)

1. Introduction

Applying a cluster algorithm to a dataset results in—fuzzy or crisp—assignments of casesto anonymous clusters. In order to interpret these clusters, we often wish to comparethese clusters to other classiﬁcations, so some heuristic is needed to match one classiﬁcationto another. With the advent of resampling and ensemble methods in clustering (Gordonand Vichi, 2001; Dimitriadou et al., 2002; Strehl and Ghosh, 2002), the task of matchingcluster solutions has become even more important: we need reliable and scalable matchingalgorithms that do the task fully automated.Consider, for example, the use of bootstrapping or cross-validation for cluster validationas suggested by many authors (Moreau and Jain, 1987; Jain and Moreau, 1988; Tibshiraniet al., 2001; Roth et al., 2002; Ben-Hur et al., 2002; Dudoit and Fridlyand, 2002): manycluster solutions are created and agreement between them is evaluated. Some agreementindices do not need explicit cluster matching (Rand, 1971; Hubert and Arabie, 1985), butothers can only be applied after cluster solutions have been matched, for example, Cohen’skappa (1960).Recently, authors have suggested transfering the idea of bagging (Breiman, 1996) toclustering. Some approaches aggregate cluster centers (Leisch, 1999; Dolnicar and Leisch,2000; Bakker and Heskes, 2001) or aggregate consensus between pairs of observations (Montiet al., 2003; Dudoit and Fridlyand, 2003,

BagClust2 algorithm). Other approaches aggre-gate cluster assignments and, therefore, require cluster matching, for example, the crisp a r X i v : . [ c s . A I] M a y ehlschl¨agel BagClust1 algorithm of Dudoit and Fridlyand (2003), the combination scheme for fuzzyclustering of Dimitriadou et al. (2002) or truecluster (Oehlschl¨agel, 2007b).Truecluster is an algorithmic framework for robust scalable clustering with model selec-tion that combines the idea of bagging with information theoretical model selection alongthe lines of

AIC (Akaike, 1973, 1974) and

BIC (Schwarz, 1978). In order to calculateits cluster information criterion ( CIC ), truecluster requires a reliable cluster matching al-gorithm. The truematch algorithm presented here was designed to play that role. Theorganization of the paper is as follows: in Section 2, we show an undesirable feature of thestandard approach to cluster matching. In Section 3, we present the truematch algorithm.In Section 4, we demonstrate the beneﬁts of the truematch algorithm within the trueclusterframework. In Section 5, we use simulation to compare truematch against standard tracemaximization matching and in Section 6, we discuss our results.

2. What’s wrong with trace maximization of the matching table

The standard aproach to cluster matching is searching for that permutation of cluster labelsthat minimizes the euclidean distance to a reference cluster solution. This criterion has beensuggested for fuzzy consensus clustering (Gordon and Vichi, 2001; Dimitriadou et al., 2002),as well as for crisp consensus clustering (Strehl and Ghosh, 2002) or crisp cluster baggingDudoit and Fridlyand (2003, BagClust1). In the crisp case, this criterion is simply tracemaximization of matching table counts: cross-tabulating class memberships of two solutionsand then permuting rows/columns of the matching table until the trace becomes maximal.To our knowledge, cluster publications and software diﬀer in the algorithms used to obtaintrace maximization, but do not question the euclidean criterion per se.For example, Dimitriadou et al. (2002) suggested a recursive heuristic to approximatetrace maximization. It is known that trying all permutations has time complexity O ( K !),where K denotes the number of clusters. The Hungarian method improves on this andachieves polynomial time complexity O ( K ). Kuhn (1955) published a pencil and pa-per version, which was followed by J.R. Munkres’ executable version (Munkres, 1957) andextended to non-square matrices by Bourgeois and Lassalle (1971). For a list of further al-gorithmic approaches to this so-called linear sum assignment problem or weighted bipartitematching , see Hornik (2005).However, scalablility is not the only quality aspect of a matching algorithm. An impor-tant statistical feature of a matching algorithm is the following: if we match two randompartitions, the matching algorithm should not systematically align the two partitions. Wenow show that the classic trace maximization does not generally possess this feature.Assume a cluster algorithm that claims to identify an outlier in a sample of size N = 100but which actually declares one case as ‘outlying’ by random. Now assume a procedure thatdraws two bootstrap samples and clusters them into 99% ‘normal’ cases and one ‘outlier’.In 1% of such procedures, the outlier picked in the second sample will randomly match theoutlier picked in the ﬁrst sample. In such cases, trace maximization matching will lead toa matching table as shown in Table 1. In the other 99%, there will be no match, which—by trace maximization—gives a matching table like that shown in Table 2. The resulting expected matching table is shown in Table 3. ruecluster matching a ba 99 0b 0 1Table 1: Random matching (1%)a ba 98 1b 1 0Table 2: Typical trace maximization matching (99%)a ba 98.01% 0.99%b 0.99% 0.01%Table 3: Expected trace maximization matchingWe can see that under random clustering, we expect 98.02% on the main diagonal whichat ﬁrst glance looks like a strong (non-random) match. Only applying standard randomcorrection (Cohen, 1960) conﬁrms this to be a pure random match (Cohen’s kappa = 0).However, in a clustering context we have two objections against relying on such randomcorrections: as far as evaluation of cluster agreement is concerned, random corrections,such as Cohen’s kappa or Hubert and Arabie’s corrected rand index do not work properly,because spatial neighbors have an above-random chance of being clustered together in theabsence of any cluster structure in the data. Therefore, agreement indices are too optimisticeven with random correction. More importantly, in other contexts such as bagging there isno random correction available at all. If cluster sizes are (very) diﬀerent, bagging clusterresults will suﬀer because in standard trace maximization big randomly matched cells winover small cells representing non-random matches. Therefore, we are looking for a matchingalgorithm that does not systematically generate a strong diagonal under random conditions.

3. Truematch algorithm

The problems with standard trace maximization described in the previous section resultfrom focusing on raw counts in a situation with unequal marginal (cluster) probabilities.From other contexts, we know that this is not a good idea. Take the χ -test for statisticalindependence of two categorial variables. It is not based on raw counts. Instead, thematching table of raw counts is transformed to another unit taking the marginals intoaccount. Let N denote the total number of observations, n k the number of observations inone row, n l the number of observations in one column and, ﬁnally, let n k,l denote the number ehlschl¨agel of observations in one cell of the K x K cluster crosstable. The ﬁrst step in calculating χ is to calculate for each cell the number of expected counts ˆ n k,l under the assumption ofindependence: ˆ n k,l = p k · p l · N = n k · n l N (1)Then, we transform the matrix of raw counts in Equation 1 into a matrix of normalizedsquared deviations d k,l from the null model: d k,l = ( n k,l − ˆ n k,l ) ˆ n k,l (2)The χ -value is deﬁned as the sum of Equation 2 over all cells. If we restore the sign inEquation 2, we get: s k,l = sign ( n k,l − ˆ n k,l ) · d k,l (3)In order to cope with unequal cluster sizes, we suggest basing cluster matching onmaximizing the trace of s k,l rather than on maximizing the trace of n k,l . And in orderto avoid any systematic not based on the data, we add a probabilistic component to thematching algorithm. Consequently we deﬁne the truematch algorithm as:1. Randomly permute rows and columns of the matching table2. Transform the matching table counts n k,l to signed normalized squared deviations s k,l using Equation 33. Apply a trace maximization algorithm like the Hungarian method to maximize thetrace (in fact the Hungarian method minimizes − s k,l )4. Order the resulting row/column pairs descending by s k,l breaking ties at randomIf no trace maximization algorithm like the Hungarian method is available, the match-ing can easily be done using the truematch heuristic similar to the heuristic suggested byDimitriadou et al. (2002):1. Calculate signed normalized squared deviations s k,l for all remaining cells of thematching table2. Order all cells descending by s k,l and by n k,l (breaking ties by random) and denotethe ﬁrst cell as the target cell

3. Match the row of the target cell to the column of the target cell4. Remove the row and the column of the target cell from the matching table5. If both the number of remaining rows and columns is at least two, repeat from step 1 ruecluster matching It is obvious that the truematch algorithm has runtime complexity O ( K ) like theHungarian method. The truematch heuristic also nicely translates into polynomial runtime.The number of residuals calculated to reduce the matching table from k to k − K , thusthe total number of residuals calculated is K + ( K − + ( K − + ... + 2 = ( K · ( K + 1)) · (2 K + 1)6 − O ( K ) and memory com-plexity O ( K ) if the recursive nature of the algorithm is realized using a while-loop.R package truecluster (Oehlschl¨agel, 2007a) implements the truematch algorithm in matchindex(method = "truematch") and the truematch heuristic in matchindex(method= "tracemax") eﬃciently through underlying C-code.Applying the truematch algorithm and the truematch heuristic to the above examplegives identical results: as in standard trace maximization matching, we ﬁnd 1% randommatches in matching table 1, but for the 99% non-random matching cases, truematch gen-erates two versions of matching tables, see Table 4. Both versions have shifted the majorityof counts oﬀ-diagonal. Due to the probabilistic component in the 2nd step, this leads tothe expected matching (Table 5) that has a weak trace. Under truematch, only systematic,non-random matches will result in a strong diagonal.a ba 1 98b 0 1 a ba 1 0b 98 1Table 4: Typical truematch (49.5% + 49.5%)a ba 1.98% 48.51%b 48.51% 1.00%Table 5: Expected truematchWe can quantify the beneﬁt of truematch in this case by comparing expected valuesof certain agreement indices, cf. Table 6. The rand index (Rand, 1971) and its randomcorrected version crand (Hubert and Arabie, 1985) are invariant against row/column per-mutations and, thus, do not diﬀer. There is also no diﬀerence for kappa (Cohen, 1960).However, the big diﬀerence is on the simple non-random-corrected diagonal fraction of ob-servations: while the trace maximization misleadingly results in an expected diagonal closeto 1, truematch reduces the expectation of this non-random-corrected index close to zero.In the next two sections, we will explore the beneﬁt of truematch in a bagging context,where the main diagonal deﬁnes the matching but no random correction is available. ehlschl¨agel fraction diagonal kappa rand crandTracemax RandomMatch 1.0% 1.00 1.00 1.000 1.00Tracemax NonRandomMatch 99.0% 0.98 -0.01 0.960 -0.01Tracemax Expected 100.0%

4. The role of truematch in truecluster

The truecluster concept (Oehlschl¨agel, 2007b) suggests a cluster information criterion ( CIC )that evaluates for each cluster model (for each number of clusters) a N x K matrix ˆ P thataggregates votes over many resamples. ˆ P is created by the multiple match cluster count ( M M CC ) algorithm using the truematch algorithm as follows:1. Create a N x K matrix C and initialize each cell C i,k with zero2. Take a resample (with replacement) of size N , use a base cluster algorithm to ﬁtthe K -cluster model c ∗ to the resample. Then, use a suitable prediction method todetermine cluster membership of the out-of-resample cases to get a complete clustervector c (cid:48) with N elements c (cid:48) i

3. For each row in C add one vote (add 1) to the column corresponding to the clustermembership in c (cid:48)

4. Repeat step 25. Estimate cluster memberships ˆc by row-wise majority count in C (breaking ties atrandom), use the truematch algorithm or heuristic to align c (cid:48) with ˆ c , and rename theclusters in c (cid:48) like the corresponding clusters in ˆ c

6. For each row in C add one vote (add 1) to the column corresponding to the clustermembership in c (cid:48)

7. Repeat from step 4 until some reasonable convergence criterion is reached8. Divide each cell in C by its rowsum to get a matrix of estimated cluster membershipprobabilities ˆ P Table 7 summarizes simulations with truecluster versus consensus clustering: 100 cases,10,000 replications, for details see

MMCCconcensus.r in R package truecluster (Oehlschl¨agel,2007a), the table is sorted and grouped by the magnitude of CIC values). For random datawithout cluster structure, we would expect very ‘fuzzy’ ˆ P without clear preferences for any ruecluster matching cluster. Furthermore, we would expect CIC to increase for models with more true clustersand to decrease if models try to distinguish more clusters than justiﬁed by the data.Table 7 shows that the MMCC algorithm using truematch delivers on this expectation:CIC increases for justiﬁed clusters and declines for unjustiﬁed ones, even if unjustiﬁedclusters in the model are small. This works because once cluster decisions are unjustiﬁed, thetrumatch algorithm starts distributing its votes randomly across undistinguishable columnsof C and, thus, ‘fuzziﬁes’ ˆ P . Compare that to consensus clustering (Dimitriadou et al.,2002) based on trace maximization obtained with R package clue (Hornik and Boehm,2007; Hornik, 2005). Models with unjustiﬁed small clusters get CIC values as high asmodels without the unjustiﬁed cluster. This is a consequence of the trace maximizationmatching, adding inappropriate stability to the voting. Take, for example, the ”random99:1” model, which is as unjustiﬁed as the ”random 50:50” model but receives a muchhigher CIC value. The stability induced by the trace maximization matching results inquite a crisp ˆ P : for each row, we ﬁnd high probability for one cluster and low probabilityfor the other. If we assign cases to clusters based on the maximum probability per rowin ˆ P , all cases are assigned to the same cluster. Such a degenerated ˆ P is not wrong butunfortunate. If we manually analyze ˆ P , we might detect that ˆ P actually represents a one-cluster (K=1) model. But if we are after automatic selection of models (number of clusters),it is misleading that ˆ P does not represent K = 2 but K = 1. Analyzing a consensus clustersolution ˆ P K for degeneracies does not really help: the estimated probabilitites can be biasedeven before the matrix formally degenerates. ehlschl¨agel MMCC true K model K H RMC I CICrandom 50:49:1 1 3 1.578 0.020 0.044 -1.534random 99:1 1 2 1.000 0.010 0.014 -0.985random 50:50 1 2 0.995 0.010 0.059 -0.936single 100 1 1 0.000 0.000 0.000 0.000justiﬁed 50 random 49:1 2 3 0.499 0.018 0.695 0.196justiﬁed 50:50 2 2 0.000 0.010 0.990 0.990consensus true K model K H RMC I CICrandom 50:49:1 1 3 1.066 0.011 0.049 -1.016random 50:50 1 2 0.995 0.010 0.048 -0.947random 99:1 1 2 0.081 0.001 0.001 -0.080single 100 1 1 0.000 0.000 0.000 0.000justiﬁed 50 random 49:1 2 3 0.071 0.011 0.965 0.895justiﬁed 50:50 2 2 0.000 0.010 0.990 0.990true K true number of clustersmodel K model number of clustersH model uncertaintyRMC relative model complexityI model informationCIC cluster information criterion (I-H)single 100 theoretical values for single group (no cluster)random 50:50 random clustering with 2 equal sized clustersrandom 99:1 random clustering 2 unequal sized clustersrandom 50:49:1 random clustering with 3 unequal sized clustersjustiﬁed 50:50 justiﬁed clustering with 2 equal sized clusterjustiﬁed 50 random 49:1 2 justiﬁed clusters, one randomly split unequal sizedTable 7: consensus cluster vs. truecluster ruecluster matching

5. Simulation results

In order to systematically investigate the consequences of the diﬀerent features of truematchversus simple trace maximization matching, we have carried out extensive simulations withinthe truecluster framework: we assume two clusters and vary their relative size p and thereliability κ of a ﬁctitious clustering algorithm and compare the truecluster results gainedvia trace maximization versus truematch. We did two versions of the simulations: in the non-ﬁxed version, p just determines sampling probabilitites; in the ﬁxed version, the ﬁcti-tious clustering algorithm enforces the exact relative size p of the two clusters. Details ofthe simulation are given in Appendix A.Figure 1 shows information , uncertainty , and its diﬀerence CIC for the non-ﬁxed sim-ulations. White areas denote simulation trials where the truecluster algorithm degeneratedfrom a 2-cluster solution to a 1-cluster solution. The most notable diﬀerence is the bigshare of non-converged truecluster solutions using trace maximization, compared to thetruematch algorithm. The estimated information, given reliability and skewness, is verysimilar and reasonable: information is highest for p = 0 . κ = 1 . κ and/or skewing p .By contrast, compared for uncertainty and for the CIC , trace maximization and true-match diﬀer dramatically. Using trace maximization, the uncertainty estimate does notonly depend on κ but is also artiﬁcially lower for higher skewness. As a consequence, clus-ter models with unequal cluster sizes get better CIC values than cluster models with equalcluster sizes. Using the truematch algorithm almost avoids this undesirable pattern: theestimated uncertainty almost only depends on κ , not on p . The estimated CIC shows avery reasonable pattern: at high κ the CIC is highest for equal sized clusters—conformingwith the entropy principle— at low κ , the CIC is low, however skewed p is. Only at veryextreme p is the CIC biased downwards: too small clusters cannot be detected with toosmall a sample size. Extreme models are non-identiﬁable and the uncertainty estimate hashigh variance. Keep in mind that ‘extreme’ p corresponds to very few cases at a samplesize of N = 100. The ﬁxed simulations gave similar results (Figure 2).In summary, trace maximization fails to estimate uncertainty independent of skewnessand tends to overestimate CIC for unequal cluster sizes or fails to converge. This restrictsits usefulness for cluster evaluation and bagging. By contrast, the truematch algorithmworks at almost any combination of reliability and skewness (with the exception of non-identiﬁable models, given the sample size). ehlschl¨agel Figure 1: Results of non-ﬁxed simulations ruecluster matching Figure 2: Results of ﬁxed simulations ehlschl¨agel

6. Discussion

We have shown that trace maximization matching fails to behave suﬃciently neutrally whenmatching clusterings. The problem arises generally but is especially important in contextswhere random correction is not applicable. As an alternative, we have presented the true-match algorithm and heuristic, both probabilistically generate neutral expected matchingtables and scale in polynomial time. Our simulations have conﬁrmed that truematch avoidsunjustiﬁed (expected) matchings induced by unequal cluster sizes. For the simulations donehere, the truematch algorithm and the truematch heuristic behave identically. Since thetruematch heuristic does not guarantee maximizing the χ -criterion, we expect the true-match algorithm to be superior. However, there is a subtle diﬀerence: while the matchingof the truematch algorithm depends solely on s k,l , the truematch heuristic uses s k,l and n k,l to select the row/column matches. Therefore, a ﬁnal decision about an optimal matchingalgorithm needs more investigation.Truematch is central to the M M CC algorithm, which creates the basis for the CIC-evaluation in the truecluster framework and, thus, contributes to solving the decade-oldproblem of choosing the optimal number of clusters. Beyond that, cluster bagging, ingeneral, could beneﬁt from using truematch: the resulting N x K matrix is rather fuzziﬁedthan degenerated for unjustiﬁed cluster splits. This allows for better automated processingof such results. It is an open question whether the truematch algorithm also has advantagesfor consensus clustering, or whether diﬀerent usages of cluster ensembles require diﬀerentmatching algorithms. Acknowledgments

We would like to thank Dr. Stefan Pilz for reviewing this paper and giving valuable hintsfor improvement. ruecluster matching Appendix A.

In this appendix, we give details concerning the simulations in section 5: assume a vector x of length 100 with ‘true’ sample group memberships where p denotes the fraction of 1 and(1 − p ) fraction of 0. Let p denote the matrix of joint probabilities for a case’s true andclustered classiﬁcation when the cluster algorithm perfectly separates 0 from 1 (at κ = 1). p = (1 − p ) 00 p Let p denote the matrix of joint probabilities for a case’s true and clustered classiﬁcationwhen the cluster algorithm makes a random guess when separating 0 from 1 (at κ = 0). p = (1 − p ) (1 − p ) · p (1 − p ) · p p Then p κ denotes the matrix of joint probabilities for a case’s true and clustered classi-ﬁcation when the cluster algorithm has reliability κ . p κ = κ · p + (1 − κ ) · p The two conditional probabilites p id that the clustering algorithm identiﬁes the trueclass, given the true class, are p id = κ + (1 − κ ) · (1 − p ) p For each value of p ∈ { / , / .. / } and each value of κ ∈ { . , . , . , .., . } ,we simulate aggregation of 1000 bootstrap samples from x , for each bootstrap sample ourﬁctitious cluster algorithm assigns cases with probability p id to the true class and withprobability 1 − p id to the other class. The resulting cluster memberships c ∗ are matchedversus the (current) estimated cluster memberships ˆc of the cases in the bootstrap sample.If c ∗ or ˆc does not contain two classes, the bootstrap sample is dropped and replaced byanother one. Diﬀerently from the M M CC algorithm in Section 4, we do not predict clustermemberships of the out-of-bag cases. We use c ∗ directly instead of c (cid:48) , consequently the rowsof C are not guaranteed to have aggregated an equal number of votes. For all combinationsof p and κ —the resulting 99x101 truecluster models ˆ P —we calculate information , uncer-tainty , and CIC (Oehlschl¨agel, 2007b). These values are visualized using colorcoding andcontourlines are added based on a loess smooth. To create the f ixed version, the completeprocedure is repeated, additionally enforcing a ﬁxed fraction p by moving randomly selectedobservations in c ∗ from the too big group to the too small one—analogous to a cluster al-gorithm that forces certain cluster sizes. The R-code doing the simulation is available in truematch.r in package truecluster (Oehlschl¨agel, 2007a). ehlschl¨agel References

H. Akaike. Information theory and an extension of the maximum likelihood principle.In B.N. Petrov and F. C´aski, editors,

Second International Symposium on InformationTheory , pages 267–281, Budapest, 1973. Akademiai Kaid´o. Reprinted in

Breakthroughsin Statistics , eds Kotz, S. & Johnson, N.L. (1992), volume I, pp. 599–624. New York:Springer.H. Akaike. A new look at statistical model identiﬁcation.

IEEE Transactions on AutomaticControl , 19:716–723, 1974.Bart Bakker and Tom Heskes. Model clustering and resampling, 2001. URL citeseer.ist.psu.edu/bakker00model.html .A. Ben-Hur, A. Elisseeﬀ, and I. Guyon. A stability based method for discovering structurein clustered data.

Pac Symp Biocomputing , 7:6–17, 2002.Franois Bourgeois and Jean-Claude Lassalle. An extension of the munkres algorithm forthe assignment problem to rectangular matrices.

Communication ACM , 14(12):802–804,1971.L. Breiman. Bagging predictors.

Machine Learning , 24(2):123–140, 1996.Jacob Cohen. A coeﬃcient of agreement for nominal scales.

Educational and PsychologicalMeasurement , 20:37–46, 1960.E. Dimitriadou, A. Weingessel, and K. Hornik. A combination scheme for fuzzy clustering.

Journal of Pattern Recognition and Artiﬁcial Intelligence , 16:901–912, 2002.S. Dolnicar and F. Leisch. Behavioural market segmentation using the bagged cluster-ing approach based on binary guest survey data: Exploring and visualizing unobservedheterogeneity.

Tourism Analysis , 5(2-4):163–170, 2000.S. Dudoit and J. Fridlyand. A prediction-based resampling method for estimating thenumber of clusters in a dataset.

Genome Biology , 3(7):research0036.1–0036.21, 2002.S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure.

Bioinformatics , 19(9):1090–1099, 2003.A. D. Gordon and M. Vichi. Fuzzy partition models for ﬁtting a set of partitions.

Psy-chometrika , 66:229–248, 2001.Kurt Hornik. A CLUE for CLUster Ensembles.

Journal of Statistical Software , 14(12),September 2005. URL .Kurt Hornik and Walter Boehm. clue: Cluster ensembles , 2007. R package version 0.3-11.Lawrence Hubert and Phipps Arabie. Comparing partitions.

Journal of Classiﬁcation , 2:193–218, 1985. ruecluster matching A. K. Jain and J. Moreau. Bootstrap techniques in cluster analysis.

Pattern Recognition ,20:547–568, 1988.H. W. Kuhn. The hungarian method for the assignment problem.

Naval Research LogisticsQuaterly , 2:225–231, 1955.Friedrich Leisch. Bagged clustering. Technical Report Working Paper 51, SFB AdaptiveInformation Systems and Modelling in Economics and Management Science, Vienna Uni-versity of Economics and Business Administration in cooperation with the University ofVienna, Vienna University of Technology., 1999.Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: Aresampling-based method for class discovery and visualization of gene expression mi-croarray data.

Machine Learning , 52:91–118, 2003.J. V. Moreau and A. K. Jain. The bootstrap approach to clustering. In P.A. Devijver andJ. Kittler, editors,

Pattern Recognition: Theory and Applications , volume 30 of

NATOASI Series F , pages 63–71. Springer, 1987.J. Munkres. Algorithms for the assignment and transportation problems.

J. Siam , 5:32–38,1957.Jens Oehlschl¨agel.

Truecluster: an algorithmic framework for robust and scalable clustering ,2007a. URL . R package version 0.3 (version 1.0 and higher willalso be hosted at

CRAN.R-project.org ).Jens Oehlschl¨agel. Truecluster: robust scalable clustering with model selection. submittedto jmlr , 2007b.W. M. Rand. Objective criteria for the evaluation of clustering methods.

Journal of theAmerican Statistical Association , 66:846–850, 1971.Volker Roth, Tilman Lange, Mikio Braun, and Joachim M. Buhmann. A resampling ap-proach to cluster validation. In Wolfgang H¨ardle and Bernd R¨onz, editors,

Proceedingsin Computational Statistics: 15th Symposium Held in Berlin (COMPSTAT2002) , pages123–128, Heidelberg, 2002. Physica-Verlag.G. Schwarz. Estimating the dimension of a model.

Annals of Statistics , 6:461–464, 1978.A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse framework for combiningmultiple partitions.

Journal of Machine Learning Research , 3:583–617, 2002.Robert Tibshirani, Guenther Walther, David Botstein, and Patrick Brown. Cluster valida-tion by prediction strength. Technical report, Stanford University, 2001., 3:583–617, 2002.Robert Tibshirani, Guenther Walther, David Botstein, and Patrick Brown. Cluster valida-tion by prediction strength. Technical report, Stanford University, 2001.