Adjusting for Chance Clustering Comparison Measures
Simone Romano, Nguyen Xuan Vinh, James Bailey, Karin Verspoor
AAdjusting for Chance Clustering Comparison Measures
Simone Romano [email protected]
Nguyen Xuan Vinh [email protected]
James Bailey [email protected]
Karin Verspoor [email protected]
Dept. of Computing and Information Systems, The University of Melbourne, VIC, Australia.
Abstract
Adjusted for chance measures are widely used to compare partitions/clusterings of thesame data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, andthe Adjusted Mutual Information (AMI) based on Shannon information theory are verypopular in the clustering community. Nonetheless it is an open problem as to what are thebest application scenarios for each measure and guidelines in the literature for their usageare sparse, with the result that users often resort to using both. Generalized InformationTheoretic (IT) measures based on the Tsallis entropy have been shown to link pair-countingand Shannon IT measures. In this paper, we aim to bridge the gap between adjustmentof measures based on pair-counting and measures based on information theory. We solvethe key technical challenge of analytically computing the expected value and variance ofgeneralized IT measures. This allows us to propose adjustments of generalized IT measures,which reduce to well known adjusted clustering comparison measures as special cases. Usingthe theory of generalized IT measures, we are able to propose the following guidelines forusing ARI and AMI as external validation indices: ARI should be used when the referenceclustering has large equal sized clusters; AMI should be used when the reference clusteringis unbalanced and there exist small clusters.
Keywords:
Clustering Comparisons, Adjustment for Chance, Generalized InformationTheoretic Measures
1. Introduction
Clustering comparison measures are used to compare partitions/clusterings of the same dataset. In the clustering community (Aggarwal and Reddy, 2013), they are extensively used forexternal validation when the ground truth clustering is available. A family of popular clus-tering comparison measures are measures based on pair-counting (Albatineh et al., 2006).This category comprises the well known similarity measures Rand Index (RI) (Rand, 1971)and the Jaccard coefficient (J) (Ben-Hur et al., 2001). Recently, information theoretic (IT)measures have been also extensively used to compare partitions (Strehl and Ghosh, 2003;Vinh et al., 2010). Given the variety of different possible measures, it is very challenging toidentify the best choice for a particular application scenario (Wu et al., 2009).The picture becomes even more complex if adjusted for chance measures are also consid-ered. Adjusted for chance measures are widely used external clustering validation techniquesbecause they improve the interpretability of the results. Indeed, two important properties a r X i v : . [ s t a t . M L ] D ec omano, Vinh, Bailey, and Verspoor hold true for adjusted measures: they have constant baseline equal to 0 value when the par-titions are random and independent, and they are equal to 1 when the compared partitionsare identical. Notable examples are the Adjusted Rand Index (ARI) (Hubert and Arabie,1985) and the Adjusted Mutual Information (AMI) (Vinh et al., 2009). It is common tosee published research that validates clustering solutions against a reference ground truthclustering with the ARI or the AMI. Nonetheless there are still open problems: there areno guidelines for their best application scenarios shown in the literature to date and authorsoften resort to employing them both and leaving the reader to interpret. Moreover, some clustering comparisons measures are susceptible to selection bias: whenselecting the most similar partition to a given ground truth partition, clustering compari-son measures are more likely to select partitions with many clusters (Romano et al., 2014).In Romano et al. (2014) it was shown that it is beneficial to perform statistical standard-ization to IT measures to correct for this bias. In particular, standardized IT measureshelp in decreasing this bias when the number of objects in the data set is small. Statisticalstandardization has not been applied to pair-counting measures yet in the literature. Wesolve this challenge in the current paper, and provide further results about the utility ofmeasure adjustment by standardization.In this work, we aim to bridge the gap between the adjustment of pair-counting measuresand the adjustment of IT measures . In Furuichi (2006); Simovici (2007) it has been shownthat generalized IT measures based on the Tsallis q -entropy (Tsallis et al., 2009) are afurther generalization of IT measures and some pair-counting measures such as RI. In thispaper, we will exploit this useful idea to connect ARI and AMI. Furthermore using thesame idea, we can perform statistical adjustment by standardization to a broader class ofmeasures, including pair-counting measures.A key technical challenge is to analytically compute the expected value and variancefor generalized IT measures when the clusterings are random. To solve this problem, wepropose a technique applicable to a broader class of measures we name L φ , which includesgeneralized IT measures as a special case. This generalizes previous work which provided an-alytical adjustments for narrower classes: measures based on pair-counting from the family L (Albatineh et al., 2006), and measures based on the Shannon’s mutual information (Vinhet al., 2009, 2010). Moreover, we define a family of measures N φ which generalizes manyclustering comparison measures. For measures belonging to this family, the expected valuecan be analytically approximated when the number of objects is large. Table 1 summarizesthe development of this line of work over the past 30 years and positions our contribution.In summary, we make the following contributions: • We define families of measures for which the expected value and variance can becomputed analytically when the clusterings are random; • We propose generalized adjusted measures to correct for the baseline property and forselection bias. This captures existing well known measures as special cases; • We provide insights into the open problem of identifying the best application scenariosfor clustering comparison measures, in particular the application scenarios for ARI andAMI. djusting for Chance Clustering Comparison Measures Rand Index (RI)MI Jaccard(J)GeneralizedInformation Theoretic VI MINMI
Figure 1: Families of clustering comparison measures.Year Contribution Reference1985 Expectation of Rand Index (RI) (Hubert and Arabie, 1985)2006 Expectation and variance of S ∈ L (Albatineh et al., 2006)2009 Expectation of Shannon Mutual Information (MI) (Vinh et al., 2009)2010 Expectation of Normalized Shannon MI (NMI) (Vinh et al., 2010)2014 Variance of Shannon MI (Romano et al., 2014)2015 Expectation and variance of S ∈ L φ Asymptotic expectation of S ∈ N φ This WorkTable 1: Survey of analytical computation of measures.
2. Comparing Partitions
Given two partitions (clusterings) U and V of the same data set of N objects, let { u , . . . , u r } and { v , . . . , v c } be the disjoint sets (clusters) for U and V respectively. Let | u i | = a i for i = 1 , . . . , r denote the number of objects in the set u i and | v j | = b j for j = 1 , . . . , c denotethe number of objects in v j . Naturally, (cid:80) ri =1 a i = (cid:80) cj =1 b j = N . The overlap between thetwo partitions U and V can be represented in matrix form by a r × c contingency table M where n ij represents the number of objects in both u i and v j , i.e. n ij = | u i ∩ v j | . Also, werefer to a i = (cid:80) ri =1 n ij as the row marginals and to b j = (cid:80) cj =1 n ij as the column marginals.A contingency table M is shown in Table 2. Vb · · · b j · · · b c a n · · · · · · · n c ... ... ... ... U a i · n ij · ... ... ... ... a r n r · · · · · · · n rc Table 2: r × c contingency table M related to two clusterings U and V . a i = (cid:80) j n ij arethe row marginals and b j = (cid:80) i n ij are the column marginals. omano, Vinh, Bailey, and Verspoor Pair-counting measures between partitions, such as the Rand Index (RI) (Rand, 1971),might be defined using the following quantities: k , the pairs of objects in the same set inboth U and V ; k the pairs of objects not in the same set in U and not in the same set in V ; k , the pairs of objects in the same set in U and not in the same set in V ; and k thepairs of objects not in the same set in U and in the same set in V . All these quantities canbe computed using the contingency table M , for example: k = 12 r (cid:88) i =1 c (cid:88) j =1 n ij ( n ij − , k = 12 (cid:16) N + r (cid:88) i =1 c (cid:88) j =1 n ij − (cid:16) r (cid:88) i =1 a i + c (cid:88) j =1 b j (cid:17)(cid:17) (1)Using k , k , k , and k it is possible to compute similarity measures, e.g. RI, or dis-tance measures, e.g. the Mirkin index MK( U, V ) (cid:44) (cid:80) i a i + (cid:80) j b j − (cid:80) i,j n ij , betweenpartitions (Meil˘a, 2007):RI( U, V ) (cid:44) ( k + k ) / (cid:18) N (cid:19) , MK(
U, V ) = 2( k + k ) = N ( N − − RI(
U, V )) (2)Information theoretic measures are instead defined for random variables but can also be usedto compare partitions when the we employ the empirical probability distributions associatedto U , V , and the joint partition ( U, V ). Let a i N , b j N , and n ij N be the probability that an objectfalls in the set u i , v j , and u i ∩ v j respectively. We can therefore define the Shannon entropywith natural logarithms for a partition V as follows: H ( V ) (cid:44) − (cid:80) j b j N ln b j N . Similarly, wecan define the entropy H ( U ) for the partition U , the joint entropy H ( U, V ) for the jointpartition (
U, V ), and the conditional entropies H ( U | V ) and H ( V | U ). Shannon entropycan be used to define the well know mutual information (MI) and employ it to computesimilarity between partitions U and V :MI( U, V ) (cid:44) H ( U ) − H ( U | V ) = H ( V ) − H ( V | U ) = H ( U ) + H ( V ) − H ( U, V ) (3)On contingency tables, MI is linearly related to G -statistics used for likelihood-ratio tests: G = 2 N MI. In Meil˘a (2007), using the Shannon entropy it was shown that the followingdistance, namely the variation of information (VI) is a metric:VI(
U, V ) (cid:44) H ( U, V ) − H ( U ) − H ( V ) = H ( U | V ) + H ( V | U ) = H ( U ) + H ( V ) − U, V )(4)
Generalized Information Theoretic (IT) measures based on the generalized Tsallis q -entropy(Tsallis, 1988) can be defined for random variables (Furuichi, 2006) and also be applied tothe task of comparing partitions (Simovici, 2007). Indeed, these measures have also seenrecent application in the machine learning community. More specifically, it has been shownthat they can act as proper kernels (Martins et al., 2009). Furthermore, empirical studiesdemonstrated that careful choice of q yields successful results when comparing the similaritybetween documents (Vila et al., 2011), decision tree induction (Maszczyk and Duch, 2008),and reverse engineering of biological networks (Lopes et al., 2011). It is important to notethat the Tsallis q -entropy is equivalent to the Harvda-Charvat-Dar´oczy generalized entropy djusting for Chance Clustering Comparison Measures proposed in Havrda and Charv´at (1967); Dar´oczy (1970). Results available in literatureabout these generalized entropies are equivalently valid for all the proposed versions.Given q ∈ R + − { } , the generalized Tsallis q -entropy for a partition V is definedas follows: H q ( V ) (cid:44) q − (cid:0) − (cid:80) i (cid:0) b j N (cid:1) q (cid:1) . Similarly to the case of Shannon entropy, wehave the joint q -entropy H q ( U, V ) and the conditional q -entropies H q ( U | V ) and H q ( V | U ).Conditional q -entropy is computed according to a weighted average parametrized in q . Morespecifically the formula for H q ( V | U ) is: H q ( V | U ) (cid:44) r (cid:88) i =1 (cid:16) a i N (cid:17) q H q ( V | u i ) = r (cid:88) i =1 (cid:16) a i N (cid:17) q q − (cid:16) − c (cid:88) j =1 (cid:16) n ij a i (cid:17) q (cid:17) (5)The q -entropy reduces to the Shannon entropy computed in nats for q → q > H q ( U ) ≥ H q ( U | V ), it is shown thatnon-negative MI can be naturally generalized with q -entropy when q > q ( U, V ) (cid:44) H q ( U ) − H q ( U | V ) = H q ( V ) − H q ( V | U ) = H q ( U ) + H q ( V ) − H q ( U, V ) (6)However, q values smaller than 1 are allowed if the assumption that MI q ( U, V ) is alwayspositive can be dropped. In addition, generalized entropic measures can be used to definedthe generalized variation of information distance (VI q ) which tends to VI in Eq. (4) when q → q ( U, V ) (cid:44) H q ( U | V )+ H q ( V | U ) = 2 H q ( U, V ) − H q ( U ) − H q ( V ) = H q ( U )+ H q ( V ) − q ( U, V )(7)In Simovici (2007) it was shown that VI q is a proper metric and interesting links wereidentified between measures for comparing partitions U and V . We state these links inProposition 1 given that they set the fundamental motivation of our paper: Proposition 1 (Simovici, 2007) When q = 2 the generalized variation of information,the Mirkin index, and the Rand index are linearly related: VI ( U, V ) = N MK(
U, V ) = N − N (1 − RI(
U, V )) . Generalized IT measures are not only a generalization of IT measures in the Shannon sensebut also a generalization of pair-counting measures for particular values of q . To allow a more interpretable range of variation, a clustering similarity measure should benormalized: it should achieve its maximum at 1 when U = V . An upper bound to thegeneralized mutual information MI q is used to obtained a normalized measure. MI q cantake different possible upper bounds (Furuichi, 2006). Here, we choose to derive anotherpossible upper bound using Eq. (7) when we use the minimum value of VI q = 0: max MI q = ( H q ( U ) + H q ( V )). This upper bound is valid for any q ∈ R + − { } and allows us tolink different existing measures as we will show in the next sections of the paper. TheNormalized Mutual Information with q -entropy (NMI q ) is defined as follows:NMI q ( U, V ) (cid:44) MI q ( U, V )max MI q ( U, V ) = MI q ( U, V ) (cid:0) H q ( U ) + H q ( V ) (cid:1) = H q ( U ) + H q ( V ) − H q ( U, V ) (cid:0) H q ( U ) + H q ( V ) (cid:1) (8) omano, Vinh, Bailey, and Verspoor Even if NMI q ( U, V ) achieves its maximum 1 when the partitions U and V are identical,NMI q ( U, V ) is not a suitable clustering comparison measure. Indeed, it does not showconstant baseline value equal to 0 when partitions are random. We explore this through anexperiment. Given a dataset of N = 100 objects, we randomly generate uniform partitions U with r = 2 , , , ,
10 sets and V with c = 6 sets independently from each others. Theaverage value of NMI q over 1 ,
000 simulations for different values of q is shown in Figure 2. Itis reasonable to expect that when the partitions are independent, the average value of NMI q is constant irrespectively of the number of sets r of the partition U . This is not the case.This behaviour is unintuitive and misleading when comparing partitions. Computing the Number of sets r in U N M I q -0.4-0.200.20.40.60.81 q = 0.8 q = 0.9 q = 1.2 q = 1.5 q = 2 q = 2.5 Figure 2: The baseline value of NMI q between independent partitions is not constant.analytical expected value of generalized IT measures under the null hypothesis of randomand independent U and V is important; it can be subtracted from the measure itself toadjust its baseline for chance such that their value is 0 when U and V are random. GivenProposition 1, this strategy also allows us to generalize adjusted for chance pair-countingand Shannon IT measures.
3. Baseline Adjustment
In order to adjust the baseline of a similarity measure S ( U, V ), we have to compute itsexpected value under the null hypothesis of independent partitions U and V . We adopt theassumption used for RI (Hubert and Arabie, 1985) and the Shannon MI (Vinh et al., 2009):partitions U and V are generated independently with fixed number of points N and fixedmarginals a i and b j ; this is also denoted as the permutation or the hypergeometric modelof randomness. We are able to compute the exact expected value for a similarity measurein the family L φ : Definition 1
Let L φ be the family of similarity measures S ( U, V ) = α + β (cid:80) ij φ ij ( n ij ) where α and β do not depend on the entries n ij of the contingency table M and φ ij ( · ) arebounded real functions. Intuitively, L φ represents the class of measures that can be written as a linear combinationof φ ij ( n ij ). A measure between partitions uniquely determines α , β , and φ ij . However, not djusting for Chance Clustering Comparison Measures every choice of α , β , and φ ij yields a meaningful similarity measure. L φ is a superset of theset of L defined in Albatineh et al. (2006) as the family of measures S ( U, V ) = α + β (cid:80) ij n ij ,i.e. S ∈ L are special cases of measures in L φ with φ ij ( · ) = ( · ) . Figure 1 shows a diagramof the similarity measures discussed in Section 2.1 and their relationships. Lemma 1 If S ( U, V ) ∈ L φ , when partitions U and V are random: E [ S ( U, V )] = α + β (cid:88) ij E [ φ ij ( n ij )] where E [ φ ij ( n ij )] is (9) min { a i ,b j } (cid:88) n ij =max { ,a i + b j − N } φ ij ( n ij ) a i ! b j !( N − a i )!( N − b j )! N ! n ij !( a i − n ij )!( b j − n ij )!( N − a i − b j + n ij )! (10)Lemma 1 extends the results in Albatineh and Niewiadomska-Bugaj (2011) showing exactcomputation of the expected value of measures in the family L . Given that generalized ITmeasures belong in L φ we can employ this result to adjust them. Using Lemma 1 it is possible to compute the exact expected value of H q ( U, V ), VI q ( U, V )and MI q ( U, V ): Theorem 1
When the partitions U and V are random:i) E [ H q ( U, V )] = q − (cid:16) − N q (cid:80) ij E [ n qij ] (cid:17) with E [ n qij ] from Eq. (10) with φ ij ( n ij ) = n qij ;ii) E [MI q ( U, V )] = H q ( U ) + H q ( V ) − E [ H q ( U, V )] ;iii) E [VI q ( U, V )] = 2 E [ H q ( U, V )] − H q ( U ) − H q ( V ) . It is worth noting that this approach is valid for any q ∈ R + − { } . We can use theseexpected values to adjust for baseline generalized IT measures. We use the method proposedin Hubert and Arabie (1985) to adjust similarity measures, such as MI q , and distancemeasures, such as VI q :AMI q (cid:44) MI q − E [MI q ]max MI q − E [MI q ] AVI q (cid:44) E [VI q ] − VI q E [VI q ] − min VI q (11)VI q is a distance measure, thus min VI q = 0. For MI q we use the upper bound max MI q = (cid:0) H q ( U ) + H q ( V ) (cid:1) as for NMI q in Eq. (8). An exhaustive list of adjusted versions ofShannon MI can be found in Vinh et al. (2010), when the upper bound ( H q ( U ) + H q ( V ))is used the authors named the adjusted MI as AMI sum .It is important to note that this type of adjustment turns distance measures into simi-larity measures, i.e., AVI q is a similarity measure. It is also possible to maintain both thedistance properties and the baseline adjustment using NVI q (cid:44) VI q /E [VI q ] which can beseen as a normalization of VI q with the stochastic upper bound E [VI q ] (Vinh et al., 2009).It is also easy to see that AVI q = 1 − NVI q . The adjustments in Eq. (11) also enable themeasures to be normalized. AMI q and AVI q achieve their maximum at 1 when U = V andtheir minimum is 0 when U and V are random partitions. omano, Vinh, Bailey, and Verspoor According to the chosen upper bound for MI q , we obtain the nice analytical form shownin Theorem 2. Our adjusted measures quantify the discrepancy between the values of theactual contingency table and their expected value in relation to the maximum discrepancypossible, i.e. the denominator in Eq. (12). It is also easy to see that all measures in L φ resemble this form when adjusted. Theorem 2
Using E [ n qij ] in Eq. (10) with φ ij ( n ij ) = n qij , the adjustments for chance for MI q ( U, V ) and VI q ( U, V ) are: AMI q ( U, V ) = AVI q ( U, V ) = (cid:80) ij n qij − (cid:80) ij E [ n qij ] (cid:16) (cid:80) i a qi + (cid:80) j b qj (cid:17) − (cid:80) ij E [ n qij ] (12)From now on we only discuss AMI q , given that it is identical to AVI q . There are notablespecial cases for our proposed adjusted generalized IT measures. In particular, the AdjustedRand Index (ARI) (Hubert and Arabie, 1985) is just equal to AMI . ARI is a classicmeasure, heavily used for validation in social sciences and the most popular clusteringvalidity index. Corollary 1
It holds true that:i) lim q → AMI q = lim q → AVI q = AMI = AVI with Shannon entropy;ii) AMI = AVI = ARI . Therefore, using the permutation model we can perform baseline adjustment to generalizedIT measures. Our generalized adjusted IT measures are a further generalization of particularwell known adjusted measures such as AMI and ARI. It is worth noting, that ARI isequivalent to other well known measures for comparing partitions (Albatineh et al., 2006).Furthermore, there is also a strong connection between ARI and Cohen’s κ statistics usedto quantify inter-rater agreement (Warrens, 2008). Computational complexity:
The computational complexity of AMI q in Eq. (12) isdominated by the computation of the sum of the expected value of each cell. Proposition 2
The computational complexity of
AMI q is O ( N · max { r, c } ) . If all the possible contingency tables M obtained by permutations were generated, thecomputational complexity of the exact expected value would be O ( N !). However, this canbe dramatically reduced using properties of the expected value. Here we show that our adjusted generalized IT measures have a baseline value of 0 whencomparing random partitions U and V . In Figure 3 we show the behaviour of AMI q ,ARI, and AMI on the same experiment proposed in Section 2.2. They are all close to 0with negligible variation when the partitions are random and independent. Moreover, it isinteresting to see the equivalence of AMI and ARI. On the other hand, the equivalence ofAMI q and AMI with Shannon’s entropy is obtained only at the limit q → q does not show constant baseline when the relative size ofthe sets in U varies when U and V are random. In Figure 4, we generate random partitions djusting for Chance Clustering Comparison Measures Number of sets r in U N M I q -0.500.51 q = 0.8 q = 0.9 q = 1.2 q = 1.5 q = 2 q = 2.5 Number of sets r in U A M I q -3 -2-1012 q = 0.8 q = 0.9 q = 1.2 q = 1.5 q = 2 q = 2.5ARIAMI Figure 3: When varying the number of sets for the random partition U , the value ofAMI q ( U, V ) is always very close to 0 with negligible variation for any q . V with c = 6 sets on N = 100 points, and random binary partitions U independently .NMI q ( U, V ) shows different behavior at the variation of the relative size of the biggest setin U . This is unintuitive given that the partitions U and V are random and independent.We obtain the desired property of a baseline value of 0 with AMI q . Relative size of the biggest set in U [%]
50 60 70 80 90 N M I q -0.200.20.40.60.8 q = 0.8 q = 0.9 q = 1.2 q = 1.5 q = 2 q = 2.5 Relative size of the biggest set in U [%]
50 60 70 80 90 A M I q -3 -1.5-1-0.500.51 q = 0.8 q = 0.9 q = 1.2 q = 1.5 q = 2 q = 2.5ARIAMI Figure 4: When varying the relative size of one cluster for the random partition U , thevalue of AMI q ( U, V ) is always very close to 0 with negligible variation for any q . In this section, we introduce a very general family of measures which includes L φ . Formeasures belonging to this family, it is possible to find an approximation of their expectedvalue when the number of objects N is large. This allows us to identify approximations forthe expected value of measures in L φ as well as for measures not in L φ , such as the Jaccardcoefficient as shown in Figure 1.Let N φ be the family of measures which are non -linear combinations of φ ij ( n ij ): Definition 2
Let N φ be the family of similarity measures S ( U, V ) = φ ( n N , . . . , n ij N , . . . , n rc N ) where φ is a bounded real function as N reaches infinity. Note that N φ is a generalization of L φ . At the limit of large number of objects N , it ispossible to compute the expected value of measures in N φ under random partitions U and V using only the marginals of the contingency table M : omano, Vinh, Bailey, and Verspoor Lemma 2 If S ( U, V ) ∈ N φ , then lim N → + ∞ E [ S ( U, V )] = φ (cid:16) a N b N , . . . , a i N b j N , . . . , a r N b c N (cid:17) . In Morey and Agresti (1984) the expected value of the RI was computed using an approx-imated value based on the multinomial distribution. It turns out this approximated valueis equal to what we obtain for RI using Lemma 2. The authors of (Albatineh et al., 2006)noticed that the difference between the approximation and the expected value obtained withthe hypergeometric model is small on empirical experiments when N is large. We pointout that this is a natural consequence of Lemma 2 given that RI ∈ L φ ⊆ N φ . Moreover,the multinomial distribution was also used to compute the expected value of the Jaccardcoefficient (J) in Albatineh and Niewiadomska-Bugaj (2011), obtaining good results on em-pirical experiments with many objects. Again, this is a natural consequence of Lemma 2given that J ∈ N φ but J / ∈ L φ .Generalized IT measures belong in L φ ⊆ N φ . Therefore we can employ Lemma 2.When the number of objects is large, the expected value under random partitions U and V of H q ( U, V ), MI q ( U, V ), and VI q ( U, V ) in Theorem 1 depends only on the entropy of thepartitions U and V : Theorem 3
It holds true that:i) lim N → + ∞ E [ H q ( U, V )] = H q ( U ) + H q ( V ) − ( q − H q ( U ) H q ( V ) ;ii) lim N → + ∞ E [MI q ( U, V )] = ( q − H q ( U ) H q ( V ) ;iii) lim N → + ∞ E [VI q ( U, V )] = H q ( U ) + H q ( V ) − q − H q ( U ) H q ( V ) . Result i) recalls the property of non-additivity that holds true for random variables (Fu-ruichi, 2006). Figure 5 shows the behaviour of E [ H q ( U, V )] when the partitions U and V are generated uniformly at random. V has c = 6 sets and U has r sets. On this case, H q ( U ) + H q ( V ) − ( q − H q ( U ) H q ( V ) appears to be a good approximation already for N = 1000. In particular, the approximation is good when the number of records N is bigwith regards to the number of cells of the contingency table in Table 2: i.e., when Nr · c islarge enough.
4. Application scenarios for AMI q In this section we aim to answer to the question:
Given a reference ground truth clustering V , which is the best choice for q in AMI q ( U, V ) to validate the clustering solution U ? By answering this question, we implicitly identify the application scenarios for ARI andAMI given the results in Corollary 1. This is particularly important for external clusteringvalidation. Nonetheless, there are a number of other applications where the task is to findthe most similar partition to a reference ground truth partition: e.g., categorical featureselection (Vinh et al., 2014), decision tree induction (Criminisi et al., 2012), generation ofalternative or multi-view clusterings (M¨uller et al., 2013), or the exploration of the clusteringspace with the Meta-Clustering algorithm (Caruana et al., 2006; Lei et al., 2014) to list afew.Different values for q in AMI q yield to different biases. The source of these biases canbe identified analyzing the properties of the q -entropy. In Figure 6 we show the q -entropyfor a binary partition at the variation of the relative size p of one cluster. This can be djusting for Chance Clustering Comparison Measures (a) Approximation on smaller sample size
Number of sets r in U E [ H q ( U ; V ) ] N = 50 objects q = 0.8 q = 1.1 q = 1.5 (b) Approximation on bigger sample size
Number of sets r in U E [ H q ( U ; V ) ] N = 1000 objects q = 0.8 q = 1.1 q = 1.5 Figure 5: E [ H q ( U, V )] (solid) and their limit value H q ( U ) + H q ( V ) − ( q − H q ( U ) H q ( V )(dashed). The solid line coincides approximately with the dashed one in 4(b)when N = 1000. The limit value is a good approximation for E [ H q ( U, V )] when Nr · c is large enough.analytically computed: H q ( p ) = q − (1 − p q − (1 − p ) q ). The range of variation for H q ( p ) is much bigger if q is small . More specifically when q is small, the difference in entropybetween an unbalanced partition and a balanced partition is big. p H q ( p ) q = 0.1 q = 0.5 q = 1.5 q = 2.5 q = 4 Figure 6: Tsallis q -entropy H q ( p ) for a binary cluster-ing where p is the relative size of one clus-ter. When q is small, the q -entropy variesin bigger range.Let us focus on an example. Let V be a reference clustering with 3 clusters of size 50 each,and let U and U be two clustering solutions with the same number of clusters and samecluster sizes. The contingency tables for U and U are shown on Figure 7. Given thatboth contingency tables have the same marginals, the only difference between AMI q ( U , V )and AMI q ( U , V ) according to Eq. (11) lies in MI q . Given that both solutions U and U are compared against V , the only term that varies in MI q ( U, V ) = H q ( V ) − H q ( V | U ) is H q ( V | U ). In order to identify the clustering solution that maximizes AMI q we have toanalyze the solution that decreases H q ( V | U ) the most. H q ( V | U ) is a weighted average ofthe entropies H q ( V | u i ) computed on the rows of the contingency table as shown in Eq. (5),and this is sensitive to values equal to 0. Given the bigger range of variation of H q for small q , small q implies higher sensitivity to row entropies of 0. Therefore, small values of q tendsto decrease H q ( V | U ) much more if the clusters in the solution V are pure: i.e., clusterscontain elements from only one cluster in the reference clustering V . In other words, AMI q with small q prefers pure clusters in the clustering solution. When the marginals in the contingency tables for two solutions are different, anotherimportant factor in the computation of AMI q is the normalization coefficient ( H q ( U ) + omano, Vinh, Bailey, and Verspoor V
50 50 5050 50 0 0 U
50 0 44 650 0 6 44 V
50 50 5050 48 1 1 U
50 1 46 350 1 3 46 q A M I q AMI q ( U ; V )AMI q ( U ; V ) Figure 7: AMI q with small q prefers the solution U because there exist one pure cluster:i.e., there is one cluster which contains elements from only one cluster in thereference clustering V . H q ( V )). Balanced solutions V will be penalized more by AMI q when q is small. Therefore,AMI q with small q prefers unbalanced clustering solutions . To summarize, AMI q with small q such as AMI . or AMI = AMI with Shannon’s entropy: • Is biased towards pure clusters in the clustering solutions; • Prefers unbalanced clustering solutions.By contrary, AMI q with bigger q such as AMI . or AMI = ARI: • Is less biased towards pure clusters in the clustering solution; • Prefers balanced clustering solutions.Given a reference clustering V , these biases can guide the choice of q in AMI q to identifymore suitable clustering solutions. q with small q such as AMI . or AMI = AMI when the referenceclustering is unbalanced and there exist small clusters
If the reference cluster V is unbalanced and presents small clusters, AMI q with small q might prefer more appropriate clustering solutions U . For example, in Figure 8 we showtwo contingency tables associated to two clustering solutions U and U for the referenceclustering V with 4 clusters of size [10 , , ,
70] respectively. When there exist smallclusters in the reference V their identification has to be precise in the clustering solution.The solution U looks arguably better than U because it shows many pure clusters. In thisscenario we advise the use of AMI . or AMI = AMI with Shannon’s entropy because itgives more weight to the clustering solution U . q with big q such as AMI . or AMI = ARI when the referenceclustering has big equal sized clusters If V is a reference clustering with big equal size clusters it is less crucial to have preciseclusters in the solution. Indeed, precise clusters in the solution penalize the recall of clustersin the reference. In this case, AMI q with bigger q might prefer more appropriate solutions.In Figure 9 we show two clustering solutions U and U for the reference clustering V with djusting for Chance Clustering Comparison Measures V
10 10 10 70 U V
10 10 10 70 U
10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67 q A M I q AMI q ( U ; V )AMI q ( U ; V ) Figure 8: AMI q with small q prefers the solution U because its clusters are pure. Whenthe reference clustering has small clusters their identification in the solution hasto be precise . In this scenario we advise the use of AMI . or AMI = AMI. V
25 25 25 25 U
17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25 V
25 25 25 25 U
24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22 q A M I q AMI q ( U ; V )AMI q ( U ; V ) Figure 9: AMI q with big q prefers the solution U because it is less biased to pure clustersin the solution. When the reference clustering has big equal sized clusters theirprecise identification is less crucial. In this scenario we advise the use of AMI . or AMI = ARI.4 equal size clusters of size 25. The solution U looks better than U because each of itsclusters identify more elements from particular clusters in the reference. Moreover, U hasto be preferred to U because it consists in 4 equal sized clusters like for the referenceclustering V . In this scenario we advise the use of AMI . or AMI = ARI because it givesmore importance to the solution U .
5. Standardization of Clustering Comparison Measures
Selection of the most similar partition U to a reference partition V is biased according to thechosen similarity measure, the number of sets r in U , and their relative size. This phenom-ena is known as selection bias and it has been extensively studied in decision trees (Whiteand Liu, 1994). Researchers in this area agree that in order to achieve unbiased selection ofpartitions, distribution properties of similarity measures have to be taken into account (Do-bra and Gehrke, 2001; Shih, 2004; Hothorn et al., 2006). Using the permutation model, weproposed in Romano et al. (2014) to analytically standardize the Shannon MI by subtrac-tion of its expected value and division by its standard deviation. In this Section, we discusshow to achieve analytical standardization of measures S ∈ L φ . omano, Vinh, Bailey, and Verspoor In order to standardize measures S ( U, V ) we must analytically compute their variance:
Lemma 3 If S ( U, V ) ∈ L φ , when partitions U and V are random: Var( S ( U, V )) = β (cid:16) E (cid:104)(cid:16) (cid:88) ij φ ij ( n ij ) (cid:17) (cid:105) − (cid:16) (cid:88) ij E [ φ ij ( n ij )] (cid:17) (cid:17) where E (cid:104)(cid:16) (cid:88) ij φ ij ( n ij ) (cid:17) (cid:105) is (cid:88) ij (cid:88) n ij φ ( n ij ) P ( n ij ) · (cid:34) φ ij ( n ij ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜ n i (cid:48) j φ i (cid:48) j (˜ n i (cid:48) j ) P (˜ n i (cid:48) j )++ (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) (cid:32) φ ij (cid:48) (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) P (˜˜ n i (cid:48) j (cid:48) ) (cid:33)(cid:35) (13) where n ij ∼ Hyp( a i , b j , N ) , ˜ n i (cid:48) j ∼ Hyp( b j − n ij , a i (cid:48) , N − a i ) , ˜ n ij (cid:48) ∼ Hyp( a i − n ij , b j (cid:48) , N − b j ) , ˜˜ n i (cid:48) j (cid:48) ∼ Hyp( a i (cid:48) , b j (cid:48) − ˜ n ij (cid:48) , N − a i ) are hypergeometric random variables. We can use the expected value to standardize S ∈ L φ , such as generalized IT measures. The variance under the permutation model of generalized IT measures is:
Theorem 4
Using Eqs. (10) and (13) with φ ij ( · ) = ( · ) q , when the partitions U and V arerandom:i) Var( H q ( U, V )) = q − N q (cid:16) E [( (cid:80) ij n qij ) ] − ( (cid:80) ij E [ n qij ]) (cid:17) ;ii) Var(MI q ( U, V )) = Var( H q ( U, V )) iii) Var(VI q ( U, V )) = 4Var( H q ( U, V ))We define the standardized version of the similarity measure MI q (SMI q ), and the standard-ized version of the distance measure VI q (SVI q ) as follows:SMI q (cid:44) MI q − E [MI q ] (cid:112) Var(MI q ) , SVI q (cid:44) E [VI] q − VI q (cid:112) Var(VI q ) , (14)As for the case of AMI q and AVI q , it turns out that SMI q is equal to SVI q : Theorem 5
Using Eqs. (10) and (13) with φ ij ( · ) = ( · ) q , the standardized MI q ( U, V ) andthe standardized VI q ( U, V ) are: SMI q ( U, V ) = SVI q ( U, V ) = (cid:80) ij n qij − (cid:80) ij E [ n qij ] (cid:113) E [( (cid:80) ij n qij ) ] − ( (cid:80) ij E [ n qij ]) (15)This formula shows that we are interested in maximizing the difference between the sum ofthe cells of the actual contingency table and the sum of the expected cells under randomness.However, standardized measures differs from their adjusted counterpart because of the djusting for Chance Clustering Comparison Measures denominator, i.e. the standard deviation of the sums of the cells. Indeed, SMI q and SVI q measure the number of standard deviations MI q and VI q are from their mean.There are some notable special cases for particular choices of q . Indeed, our generalizedstandardization of IT measures allows us to generalize also the standardization of pair-counting measures such as the Rand Index. To see this, let us define the StandardizedRand Index (SRI): SRI (cid:44) RI − E [RI] √ Var(RI) and recall that the standardized G -statistic is definedas S G (cid:44) G − E [ G ] √ Var( G ) (Romano et al., 2014): Corollary 2
It holds true that:i) lim q → SMI q = lim q → SVI q = SMI = SVI = S G with Shannon entropy;ii) SMI = SVI = SRI . Computational complexity:
The computational complexity of SVI q is dominated bycomputation of the second moment of the sum of the cells defined in Eq. (13): Proposition 3
The computational complexity of
SVI q is O ( N c · max { c, r } ) . Note that the complexity is quadratic in c and linear in r . This happens because of theway we decided to condition the probabilities in Eq. (13) in the proof of Lemma 3. Withdifferent conditions, it is possible to obtain a formula symmetric to Eq. (13) with complexity O ( N r · max { r, c } ) (Romano et al., 2014). Statistical inference:
All IT measures computed on partitions can be seen as estimatorsof their true value computed using the random variables associated to the partitions U and V . Therefore, SMI q can be seen as a non-parametric test for independence for MI q . Weformalize this with the following proposition: Proposition 4
The p -value associated to the test for independence between U and V using MI q ( U, V ) is smaller than: q ( U,V )) . For example, if SMI q is equal to 4.46 the associated p -value is smaller than 0.05. Neural timeseries data is often analyzed making use of the Shannon MI (e.g. see Chapter 29 in Cohen(2014)). It is common practice to test the independence of two time series by computingSMI via Monte Carlo permutations, sampling from the space of N ! cardinality. Our SMI q can be effectively and efficiently used in this application because it is exact and obtains O ( N r · max { r, c } ) complexity. In this section, we evaluate the performance of standardized measures on selection biascorrection when partitions U are generated at random and independently from the referencepartition V . This hypothesis has been employed in previous published research to studyselection bias (White and Liu, 1994; Frank and Witten, 1998; Dobra and Gehrke, 2001;Shih, 2004; Hothorn et al., 2006; Romano et al., 2014). In particular, we experimentallydemonstrate that NMI q is biased towards the selection of partitions U with more clustersat any q . Therefore, in this scenario it is beneficial to perform standardization. Althoughthe choice of whether performing standardization or not is dependent to the application omano, Vinh, Bailey, and Verspoor (Romano et al., 2015). For example, it has been argued that in some cases the selection ofclustering solutions should be biased towards solutions with the same number of clusters asin the reference (Amelio and Pizzuti, 2015). In this section we aim to show the effects ofselection bias when clusterings are independent and that standardization helps in reducingit. Moreover, we will see in Section 5.3 that it is particularly important to correct forselection bias when the number of records N is small.Given a reference partition V on N = 100 objects with c = 4 sets, we generate a poolof random partitions U with r ranging from 2 to 10 sets. Then, we use NMI q ( U, V ) toselect the closest partition to the reference V . The plot at the bottom of Figure 10 showsthe probability of selection of a partition U with r sets using NMI q computed on 5000simulations. We do not expect any partition to be the best given that they are all generated S M I q Probability of selection ( q = 1 : A M I q Number of sets r in U N M I q S M I q Probability of selection ( q = 2) A M I q Number of sets r in U N M I q S M I q Probability of selection ( q = 3) A M I q Number of sets r in U N M I q Figure 10: Selection bias towards partitions with different r when compared to a reference V . The probability of selection should be uniform when partitions are random.Using SMI q we achieve close to uniform probability of selection for q equal to1 . i.e., the plot is expected to be flat if a measure is unbiased . Nonetheless, we seethat there is a clear bias towards partitions with 10 sets if we use NMI q with q respectivelyequal to 1.001, 2, or 3. We can see that the use of the adjusted measures such as AMI q helps in decreasing this bias, in particular when q = 2. On this experiment when q =2, baseline adjustment seems to be effective in decreasing the selection bias because thevariance AMI = ARI is almost constant. However for all q , using SMI q we obtain close touniform probability of selection of each random partition U . It is likely to expect that the variance of generalized IT measures decreases when partitionsare generated on a large number of objects N . Here we prove a general result about measuresof the family N φ . Lemma 4 If S ( U, V ) ∈ N φ , then lim N → + ∞ Var( S ( U, V )) = 0 . Given that generalized IT measures belong in the family N φ , we can prove the following: djusting for Chance Clustering Comparison Measures Theorem 6
It holds true that: lim N → + ∞ Var( H q ( U, V )) = lim N → + ∞ Var(MI q ( U, V )) = lim N → + ∞ Var(VI q ( U, V )) = 0 (16)Therefore, SMI q attains very large values when N is large. In practice of course, N is finite,so the use of SMI q is beneficial. However, it is less important to correct for selection bias ifthe number of objects N is big with regards to the number of cells in the contingency tablein Table 2: i.e., when Nr · c is large. Indeed, when the number of objects is large AMI q mightbe sufficient to avoid selection bias and any test for independence between partitions hashigh power. In this scenario, SMI q is not needed and AMI q might be preferred as it can becomputed more efficiently.
6. Conclusion
In this paper, we computed the exact expected value and variance of measures of the family L φ , which contains generalized IT measures. We also showed how the expected value formeasures S ∈ N φ can be computed for large N . Using these statistics, we proposed AMI q and SMI q to adjust generalized IT measures both for baseline and for selection bias. AMI q isa further generalization of well known measures for clustering comparisons such as ARI andAMI. This analysis allowed us to provide guidelines for their best application in differentscenarios. In particular ARI might be used as external validation index when the referenceclustering shows big equal sized clusters. AMI can be used when the reference clusteringis unbalanced and there exist small clusters. The standardized SMI q can instead be usedto correct for selection bias among many possible candidate clustering solutions when thenumber of objects is small. Furthermore, it can also be used to test the independencebetween two partitions. All code has been made available online . https://sites.google.com/site/adjgenit/ omano, Vinh, Bailey, and Verspoor Appendix A. Theorem Proofs
Proposition 1 (Simovici, 2007) When q = 2 the generalized variation of information,the Mirkin index, and the Rand index are linearly related: VI ( U, V ) = N MK(
U, V ) = N − N (1 − RI(
U, V )) . Proof VI q ( U, V ) = 2 H q ( U, V ) − H q ( U ) − H q ( V )= 2 q − (cid:16) − r (cid:88) i =1 c (cid:88) j =1 (cid:16) n ij N (cid:17) q (cid:17) − q − (cid:16) − r (cid:88) i =1 (cid:16) a i N (cid:17) q (cid:17) − q − (cid:16) − c (cid:88) j =1 (cid:16) b j N (cid:17) q (cid:17) = 1( q − N q (cid:16) r (cid:88) i =1 a qi + c (cid:88) j =1 b qj − r (cid:88) i =1 c (cid:88) j =1 n qij (cid:17) When q = 2, VI ( U, V ) = N ( (cid:80) i a i + (cid:80) j b j − (cid:80) i,j n ij ) = N MK(
U, V ) = N − N (1 − RI(
U, V )).
Lemma 1 If S ( U, V ) ∈ L φ , when partitions U and V are random: E [ S ( U, V )] = α + β (cid:88) ij E [ φ ij ( n ij )] where E [ φ ij ( n ij )] is (9) min { a i ,b j } (cid:88) n ij =max { ,a i + b j − N } φ ij ( n ij ) a i ! b j !( N − a i )!( N − b j )! N ! n ij !( a i − n ij )!( b j − n ij )!( N − a i − b j + n ij )! (10) Proof
The expected value of S ( U, V ) according to the hypergeometric model of ran-domness is E [ S ( U, V )] = (cid:80) M S ( M ) P ( M ) where M is a contingency table generated viapermutations. This is reduced to E [ S ( U, V )] = (cid:80) M ( α + β (cid:80) ij φ ij ( n ij )) P ( M ) = α + β (cid:80) M (cid:80) ij φ ij ( n ij ) P ( M ). Because of linearity of the expected value, it is possible to swapthe summation over M and the one over cells obtaining α + β (cid:80) ij (cid:80) n ij φ ij ( n ij ) P ( n ij ) = α + β (cid:80) ij E [ φ ij ( n ij )] where n ij is a hypergeometric distribution with the marginals a i , b j ,and N as parameters, i.e. n ij ∼ Hyp( a i , b j , N ). Theorem 1
When the partitions U and V are random:i) E [ H q ( U, V )] = q − (cid:16) − N q (cid:80) ij E [ n qij ] (cid:17) with E [ n qij ] from Eq. (10) with φ ij ( n ij ) = n qij ;ii) E [MI q ( U, V )] = H q ( U ) + H q ( V ) − E [ H q ( U, V )] ;iii) E [VI q ( U, V )] = 2 E [ H q ( U, V )] − H q ( U ) − H q ( V ) . Proof
The results easily follow from Lemma 1 and the hypothesis of fixed marginals. djusting for Chance Clustering Comparison Measures Theorem 2
Using E [ n qij ] in Eq. (10) with φ ij ( n ij ) = n qij , the adjustments for chance for MI q ( U, V ) and VI q ( U, V ) are: AMI q ( U, V ) = AVI q ( U, V ) = (cid:80) ij n qij − (cid:80) ij E [ n qij ] (cid:16) (cid:80) i a qi + (cid:80) j b qj (cid:17) − (cid:80) ij E [ n qij ] (12) Proof
The using the upper bound ( H q ( U ) + H q ( V )) to MI q , AMI q and AVI q are equiva-lent. Therefore we compute AVI q . The denominator is equal to E [VI q ] = q − N q (cid:16) ( (cid:80) i a qi + (cid:80) j b qj ) − (cid:80) i,j E [ n qij ] (cid:17) . The numerator is instead q − N q (cid:16) (cid:80) ij n qij − (cid:80) i,j E [ n qij ] (cid:17) . Corollary 1
It holds true that:i) lim q → AMI q = lim q → AVI q = AMI = AVI with Shannon entropy;ii) AMI = AVI = ARI . Proof
Point i) follows from the limit of the q -entropy when q →
1. Point ii) follows from:AVI = E [VI ] − VI E [VI ] − min VI = N − N (RI − E [RI]) N − N (max RI − E [RI]) = ARI Proposition 2
The computational complexity of
AMI q is O ( N · max { r, c } ) . Proof
The computation of P ( n ij ) where n ij is a hypergeometric distribution Hyp( a i , b j , N )is linear in N . However, the computation of the expected value E [ n qij ] = (cid:80) n ij n qij P ( n ij ) canexploit the fact that P ( n ij ) are computed iteratively: P ( n ij +1) = P ( n ij ) ( a i − n ij )( b j − n ij )( n ij +1)( N − a i − b j + n ij +1) .We compute P ( n ij ) only for max { , a i + b j − N } . In both cases P ( n ij ) can be computedin O (max { a i , b j } ). We can compute all other probabilities iteratively as shown above inconstant time. Therefore: r (cid:88) i =1 c (cid:88) j =1 O (max { a i , b j } ) + min { a i ,b j } (cid:88) n ij =0 O (1) = r (cid:88) i =1 c (cid:88) j =1 O (max { a i , b j } ) = r (cid:88) i =1 O (max { ca i , N } )= O (max { cN, rN } ) = O ( N · max { c, r } ) Lemma 2 If S ( U, V ) ∈ N φ , then lim N → + ∞ E [ S ( U, V )] = φ (cid:16) a N b N , . . . , a i N b j N , . . . , a r N b c N (cid:17) . omano, Vinh, Bailey, and Verspoor Proof S ( U, V ) can be written as φ ( n N , . . . , n ij N , . . . , n rc N ). Let X = ( X , . . . , X rc ) =( n N , . . . , n ij N , . . . , n rc N ) be a vector of rc random variables where n ij is a hypergeometricdistribution with the marginals as parameters: a i , b j and N . The expected value of n ij N is E [ n ij N ] = N a i b j N . Let µ = ( µ , . . . , µ rc ) = ( E [ X ] , . . . , E [ X rc ]) = ( a N b N , . . . , a i N b j N , . . . , a r N b c N )be the vector of the expected values. The Taylor approximation of S ( U, V ) = φ ( X ) around µ is: φ ( X ) (cid:39) φ ( µ ) + rc (cid:88) t =1 ( X t − µ t ) ∂φ∂X t + 12 rc (cid:88) t =1 rc (cid:88) s =1 ( X t − µ t )( X s − µ s ) ∂ φ∂X t ∂X s + . . . Its expected value is (see Section 4.3 of (Ang and Tang, 2006)): E [ φ ( X )] (cid:39) φ ( µ ) + 12 rc (cid:88) t =1 rc (cid:88) s =1 Cov( X t , X s ) ∂ φ∂X t ∂X s + . . . We just analyse the second order remainder given that it dominates the higher order ones.Using the Cauchy-Schwartz inequality we have that | Cov( X t , X s ) | ≤ (cid:112) Var( X t )Var( X s ).Each X t and X s is equal to n ij N for some indexes i and j . The variance of each X t and X s istherefore equal to Var( n ij N ) = N a i b j N N − a i N N − b j N − . When the number of records is large alsothe marginals increase: N → + ∞ ⇒ a i → + ∞ , and b j → + ∞ ∀ i, j . However because ofthe permutation model, all the fractions a i N and b j N stay constant ∀ i, j . Therefore, also µ is constant. However, at the limit of large N , the variance of n ij N tends to 0: Var (cid:16) n ij N (cid:17) = N a i N b j N (cid:16) − a i N (cid:17)(cid:16) N − − b j N (cid:17) →
0. Therefore, at large N : E [ φ ( X )] (cid:39) φ ( µ ) = φ (cid:16) a N b N , . . . , a i N b j N , . . . , a r N b c N (cid:17) Theorem 3
It holds true that:i) lim N → + ∞ E [ H q ( U, V )] = H q ( U ) + H q ( V ) − ( q − H q ( U ) H q ( V ) ;ii) lim N → + ∞ E [MI q ( U, V )] = ( q − H q ( U ) H q ( V ) ;iii) lim N → + ∞ E [VI q ( U, V )] = H q ( U ) + H q ( V ) − q − H q ( U ) H q ( V ) . Proof E [ H q ( U, V )] = q − (cid:16) − (cid:80) ij E (cid:104)(cid:16) n ij N (cid:17) q (cid:105)(cid:17) and according to Lemma 2 for large N : E [ H q ( U, V )] (cid:39) q − (cid:16) − (cid:80) ij (cid:16) a i N b j N (cid:17) q (cid:17) = q − (cid:16) − (cid:80) i (cid:16) a i N (cid:17) q (cid:80) j (cid:16) b j N (cid:17) q (cid:17) . If we add an subtract djusting for Chance Clustering Comparison Measures − (cid:80) i (cid:16) a i N (cid:17) q − (cid:80) j (cid:16) b j N (cid:17) q in the parenthesis above: E [ H q ( U, V )] (cid:39) q − (cid:16) − (cid:88) i (cid:16) a i N (cid:17) q (cid:88) j (cid:16) b j N (cid:17) q + 1 − (cid:88) i (cid:16) a i N (cid:17) q − (cid:88) j (cid:16) b j N (cid:17) q − (cid:88) i (cid:16) a i N (cid:17) q + (cid:88) j (cid:16) b j N (cid:17) q (cid:17) = 1 q − (cid:16) − (cid:88) i (cid:16) a i N (cid:17) q (cid:17) + 1 q − (cid:16) − (cid:88) j (cid:16) b j N (cid:17) q (cid:17) + 1 q − (cid:16) − − (cid:88) i (cid:16) a i N (cid:17) q (cid:88) j (cid:16) b j N (cid:17) q + (cid:88) i (cid:16) a i N (cid:17) q + (cid:88) j (cid:16) b j N (cid:17) q (cid:17) = H q ( U ) + H q ( V ) + 1 q − (cid:16)(cid:16) − (cid:88) i (cid:16) a i N (cid:17) q (cid:17)(cid:16) (cid:88) j (cid:16) b j N (cid:17) q (cid:17)(cid:17) = H q ( U ) + H q ( V ) − ( q − H q ( U ) H q ( V )Point ii) and iii) follow from Equations (6) and (7). Lemma 3 If S ( U, V ) ∈ L φ , when partitions U and V are random: Var( S ( U, V )) = β (cid:16) E (cid:104)(cid:16) (cid:88) ij φ ij ( n ij ) (cid:17) (cid:105) − (cid:16) (cid:88) ij E [ φ ij ( n ij )] (cid:17) (cid:17) where E (cid:104)(cid:16) (cid:88) ij φ ij ( n ij ) (cid:17) (cid:105) is (cid:88) ij (cid:88) n ij φ ( n ij ) P ( n ij ) · (cid:34) φ ij ( n ij ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜ n i (cid:48) j φ i (cid:48) j (˜ n i (cid:48) j ) P (˜ n i (cid:48) j )++ (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) (cid:32) φ ij (cid:48) (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) P (˜˜ n i (cid:48) j (cid:48) ) (cid:33)(cid:35) (13) where n ij ∼ Hyp( a i , b j , N ) , ˜ n i (cid:48) j ∼ Hyp( b j − n ij , a i (cid:48) , N − a i ) , ˜ n ij (cid:48) ∼ Hyp( a i − n ij , b j (cid:48) , N − b j ) , ˜˜ n i (cid:48) j (cid:48) ∼ Hyp( a i (cid:48) , b j (cid:48) − ˜ n ij (cid:48) , N − a i ) are hypergeometric random variables. Proof
The proof follows Theorem 1 proof in Romano et al. (2014). Using the properties ofthe variance we can show that Var( S ( U, V )) = β Var( (cid:80) ij φ ij ( n ij )) = β (cid:16) E [( (cid:80) ij φ ij ( n ij )) ] − ( (cid:80) ij E [ φ ij ( n ij )]) (cid:17) . ( E [ (cid:80) ij φ ij ( n ij )]) = ( (cid:80) ij E [ φ ij ( n ij )]) can be computed using Eq.(10). The first term in the sum is instead: E [( (cid:88) ij φ ij ( n ij )) ] = (cid:88) ij (cid:88) i (cid:48) j (cid:48) E [ φ ij ( n ij ) φ i (cid:48) j (cid:48) ( n i (cid:48) j (cid:48) )] = (cid:88) ij (cid:88) i (cid:48) j (cid:48) (cid:88) n ij (cid:88) n i (cid:48) j (cid:48) φ ij ( n ij ) φ i (cid:48) j (cid:48) ( n i (cid:48) j (cid:48) ) P ( n ij , n i (cid:48) j (cid:48) ) omano, Vinh, Bailey, and Verspoor We cannot find the exact form of the joint probability P ( n ij , n i (cid:48) j (cid:48) ) thus we rewrite it as P ( n ij ) P ( n i (cid:48) j (cid:48) | n ij ) = P ( n ij ) P (˜ n i (cid:48) j (cid:48) ). The random variable n ij is an hypergeometric distri-bution that simulates the experiment of sampling without replacement the a i objects in theset u i from a total of N objects. Sampling one of the b j objects from v j is defined as success: n ij ∼ Hyp( a i , b j , N ). The random variable ˜ n i (cid:48) j (cid:48) has a different distribution depending onthe possible combinations of indexes i, i (cid:48) , j, j (cid:48) . Thus E [( (cid:80) ij φ ij ( n ij )) ] is equal to: (cid:88) ij (cid:88) n ij (cid:88) i (cid:48) j (cid:48) (cid:88) n i (cid:48) j (cid:48) φ ij ( n ij ) φ i (cid:48) j (cid:48) ( n i (cid:48) j (cid:48) ) P ( n ij , n i (cid:48) j (cid:48) ) = (cid:88) ij (cid:88) n ij φ ij ( n ij ) P ( n ij ) (cid:88) i (cid:48) j (cid:48) (cid:88) ˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜ n i (cid:48) j (cid:48) ) P (˜ n i (cid:48) j (cid:48) )which, by taking care of all possible combinations of i, i (cid:48) , j, j (cid:48) , is equal to : (cid:88) ij (cid:88) n ij φ ij ( n ij ) P ( n ij ) · (cid:34) (cid:88) i (cid:48) = i,j (cid:48) = j (cid:88) ˜ n ij φ ij (˜ n ij ) P (˜ n ij ) + (cid:88) i (cid:48) (cid:54) = i,j (cid:48) = j (cid:88) ˜ n i (cid:48) j φ i (cid:48) j (˜ n i (cid:48) j ) P (˜ n i (cid:48) j )(17)+ (cid:88) i (cid:48) = i,j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) φ ij (cid:48) (˜ n ij (cid:48) ) P (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j (cid:88) ˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜ n i (cid:48) j (cid:48) ) P (˜ n i (cid:48) j (cid:48) ) (cid:35) (18) Case 1: i (cid:48) = i ∧ j (cid:48) = jP (˜ n ij ) = 1 if and only if ˜ n ij = n ij and 0 otherwise. This case produces the first term φ ij ( n ij ) enclosed in square brackets. Case 2: i (cid:48) = i ∧ j (cid:48) (cid:54) = j In this case, the possible successes are the objects from the set v j (cid:48) . We have alreadysampled n ij objects and we are sampling from the whole set of objects excluding the set v j .Thus, ˜ n ij (cid:48) ∼ Hyp( a i − n ij , b j (cid:48) , N − b j ). Case 3: i (cid:48) (cid:54) = i ∧ j (cid:48) = j This case is symmetric to the previous one where a i (cid:48) is now the possible number ofsuccesses. Therefore ˜ n i (cid:48) j ∼ Hyp( b j − n ij , a i (cid:48) , N − a i ). Case 4: i (cid:48) (cid:54) = i ∧ j (cid:48) (cid:54) = j In order compute P (˜ n i (cid:48) j (cid:48) ), we have to impose a further condition: P (˜ n i (cid:48) j (cid:48) ) = (cid:88) ˜ n ij (cid:48) P (˜ n i (cid:48) j (cid:48) | ˜ n ij (cid:48) ) P (˜ n ij (cid:48) ) = (cid:88) ˜ n ij (cid:48) P (˜˜ n i (cid:48) j (cid:48) ) P (˜ n ij (cid:48) )We are considering sampling the a i (cid:48) objects in u i (cid:48) from the whole set of objects excludingthe a i objects from u i . Just knowing that n ij objects have already been sampled from u i does not allow us to know how many objects from v j (cid:48) have also been sampled. If we knowthat n ij (cid:48) are the number of objects sampled from v j (cid:48) , we know there are b j (cid:48) − n ij (cid:48) possiblesuccesses and thus ˜ n i (cid:48) j (cid:48) | ˜ n ij (cid:48) = ˜˜ n i (cid:48) j (cid:48) ∼ Hyp( a i (cid:48) , b j (cid:48) − ˜ n ij (cid:48) , N − a i ). So the last two terms in djusting for Chance Clustering Comparison Measures Eq. (18) can be put together: (cid:88) i (cid:48) = i,j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) φ ij (cid:48) (˜ n ij (cid:48) ) P (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j (cid:88) ˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜ n i (cid:48) j (cid:48) ) P (˜ n i (cid:48) j (cid:48) )= (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) φ ij (cid:48) (˜ n ij (cid:48) ) P (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j (cid:88) ˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜ n i (cid:48) j (cid:48) ) (cid:88) ˜ n ij (cid:48) P (˜ n i (cid:48) j (cid:48) | ˜ n ij (cid:48) ) P (˜ n ij (cid:48) )= (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) φ ij (cid:48) (˜ n ij (cid:48) ) P (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i,j (cid:48) (cid:54) = j (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) (cid:88) ˜ n ij (cid:48) P (˜˜ n i (cid:48) j (cid:48) ) P (˜ n ij (cid:48) )= (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) φ ij (cid:48) (˜ n ij (cid:48) ) + (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) P (˜˜ n i (cid:48) j (cid:48) )= (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) (cid:32) φ ij (cid:48) (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) P (˜˜ n i (cid:48) j (cid:48) ) (cid:33) By putting everything together we get that E [( (cid:80) ij φ ij ( n ij )) ] is equal to: (cid:88) ij (cid:88) n ij φ ( n ij ) P ( n ij ) · (cid:34) φ ij ( n ij ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜ n i (cid:48) j φ i (cid:48) j (˜ n i (cid:48) j ) P (˜ n i (cid:48) j )++ (cid:88) j (cid:48) (cid:54) = j (cid:88) ˜ n ij (cid:48) P (˜ n ij (cid:48) ) (cid:32) φ ij (cid:48) (˜ n ij (cid:48) ) + (cid:88) i (cid:48) (cid:54) = i (cid:88) ˜˜ n i (cid:48) j (cid:48) φ i (cid:48) j (cid:48) (˜˜ n i (cid:48) j (cid:48) ) P (˜˜ n i (cid:48) j (cid:48) ) (cid:33)(cid:35) Theorem 4
Using Eqs. (10) and (13) with φ ij ( · ) = ( · ) q , when the partitions U and V arerandom:i) Var( H q ( U, V )) = q − N q (cid:16) E [( (cid:80) ij n qij ) ] − ( (cid:80) ij E [ n qij ]) (cid:17) ;ii) Var(MI q ( U, V )) = Var( H q ( U, V )) iii) Var(VI q ( U, V )) = 4Var( H q ( U, V )) Proof
The results follow from Lemma 3, the hypothesis of fixed marginals and propertiesof the variance.
Theorem 5
Using Eqs. (10) and (13) with φ ij ( · ) = ( · ) q , the standardized MI q ( U, V ) andthe standardized VI q ( U, V ) are: SMI q ( U, V ) = SVI q ( U, V ) = (cid:80) ij n qij − (cid:80) ij E [ n qij ] (cid:113) E [( (cid:80) ij n qij ) ] − ( (cid:80) ij E [ n qij ]) (15) omano, Vinh, Bailey, and Verspoor Proof
For SMI q : the numerator is equal to H q ( U, V ) − E [ H q ( U, V )] = q − N q (cid:16) (cid:80) ij n qij − (cid:80) i,j E [ n qij ] (cid:17) .According Theorem 4, the denominator is instead: (cid:113) Var(MI q ( U, V )) = (cid:113)
Var(H q ( U, V )) = 1( q − N q (cid:115) E [( (cid:88) ij n qij ) ] − ( E [ (cid:88) ij n qij ]) . For SVI q : the numerator is equal to 2 H q ( U, V ) − E [ H q ( U, V )] = q − N q (cid:16) (cid:80) ij n qij − (cid:80) i,j E [ n qij ] (cid:17) . According Theorem 4, the denominator is instead: (cid:113) Var(VI q ( U, V )) = (cid:113) q ( U, V )) = 2( q − N q (cid:115) E [( (cid:88) ij n qij ) ] − ( E [ (cid:88) ij n qij ]) . Therefore, SMI q and SVI q are equivalent. Corollary 2
It holds true that:i) lim q → SMI q = lim q → SVI q = SMI = SVI = S G with Shannon entropy;ii) SMI = SVI = SRI . Proof
Point i) follows from the limit of the q -entropy when q → G -statistic to MI: G = 2 N MI. Point ii) follows from:SVI = E [VI ] − VI (cid:112) Var(VI ) = N − N (RI − E [RI]) N − N (cid:112) Var(RI) = SRI
Proposition 3
The computational complexity of
SVI q is O ( N c · max { c, r } ) . Proof
Each summation in Eq. (13) can be bounded above by the maximum value of thecell marginals and each sum can be done in constant time. The last summation in Eq. (13)is: c (cid:88) j (cid:48) =1 max { a i ,b j (cid:48) } (cid:88) ˜ n ij (cid:48) =0 r (cid:88) i (cid:48) =1 max { a i (cid:48) ,b j (cid:48) } (cid:88) ˜˜ n i (cid:48) j (cid:48) =0 O (1) = c (cid:88) j (cid:48) =1 max { a i ,b j (cid:48) } (cid:88) ˜ n ij (cid:48) =0 O (max { N, rb j (cid:48) } )= c (cid:88) j (cid:48) =1 O (max { a i N, a i rb j (cid:48) , b j (cid:48) N, rb j (cid:48) } )= O (max { ca i N, a i rN, rN } ) djusting for Chance Clustering Comparison Measures The above term is thus the computational complexity of the inner loop. Using the samemachinery one can prove that: c (cid:88) j =1 r (cid:88) i =1 max { a i ,b j } (cid:88) n ij =0 O (max { ca i N, a i rN, rN } ) = O (max { c N , rcN } ) = O ( N c · max { c, r } ) Proposition 4
The p -value associated to the test for independence between U and V using MI q ( U, V ) is smaller than: q ( U,V )) . Proof
Let MI q be the random variable under the null hypothesis of independence betweenpartitions associated to the test statistic MI q ( U, V ). The p -value is defined as: p -value = P (cid:16) MI q ≥ MI q ( U, V ) (cid:17) = P (cid:16) MI q − E [MI q ( U, V )] ≥ MI q ( U, V ) − E [MI q ( U, V )] (cid:17) = P (cid:32) MI q − E [MI q ( U, V )] (cid:112) Var(MI q ( U, V )) ≥ MI q ( U, V ) − E [MI q ( U, V )] (cid:112) Var(MI q ( U, V )) (cid:33) = P (cid:32) MI q − E [MI q ( U, V )] (cid:112) Var(MI q ( U, V )) ≥ SMI q ( U, V ) (cid:33) Let Z be the standardized random variable MI q − E [MI q ( U,V )] √ Var(MI q ( U,V )) , then using the one side Cheby-shev’s inequality also known as the Cantelli’s inequality (Ross, 2012): p -value = P ( Z ≥ SMI q ( U, V )) <
11 + (cid:16)
SMI q ( U, V ) (cid:17) Lemma 4 If S ( U, V ) ∈ N φ , then lim N → + ∞ Var( S ( U, V )) = 0 . Proof
Let X = ( X , . . . , X rc ) = ( n N , . . . , n ij N , . . . , n rc N ) be a vector of rc random variableswhere n ij is a hypergeometric distribution with the marginals as parameters: a i , b j and N .Using the Taylor approximation (Ang and Tang, 2006) of S ( U, V ) = φ ( X ), it is possible toshow that: Var( φ ( X )) (cid:39) rc (cid:88) t =1 rc (cid:88) s =1 Cov( X t , X s ) ∂φ∂X t ∂φ∂X s + . . . Using the Cauchy-Schwartz inequality we have that | Cov( X t , X s ) | ≤ (cid:112) Var( X t )Var( X s ).Each X t and X s is equal to n ij N for some indexes i and j . The variance of each X t and X s is omano, Vinh, Bailey, and Verspoor therefore equal to Var( n ij N ) = N a i b j N N − a i N N − b j N − . When the number of records is large alsothe marginals increase: N → + ∞ ⇒ a i → + ∞ , and b j → + ∞ ∀ i, j . However because of thepermutation model, all the fractions a i N and b j N stay constant ∀ i, j . Therefore, at the limitof large N , the variance of n ij N tends to 0: Var (cid:16) n ij N (cid:17) = N a i N b j N (cid:16) − a i N (cid:17)(cid:16) N − − b j N (cid:17) → φ ( X )) tends to 0. Theorem 6
It holds true that: lim N → + ∞ Var( H q ( U, V )) = lim N → + ∞ Var(MI q ( U, V )) = lim N → + ∞ Var(VI q ( U, V )) = 0 (16)
Proof
Trivially follows from Lemma 4. djusting for Chance Clustering Comparison Measures References
Charu C. Aggarwal and Chandan K. Reddy.
Data Clustering: Algorithms and Applications .CRC Press, 2013.Ahmed N. Albatineh and Magdalena Niewiadomska-Bugaj. Correcting Jaccard and othersimilarity indices for chance agreement in cluster analysis.
Advances in Data Analysisand Classification , 5(3):179–200, 2011.Ahmed N. Albatineh, Magdalena Niewiadomska-Bugaj, and Daniel Mihalko. On similarityindices and correction for chance agreement.
Journal of Classification , 23(2):301–313,2006.Alessia Amelio and Clara Pizzuti. Is normalized mutual information a fair measure forcomparing community detection methods? In
Proceedings of the 2015 IEEE/ACM In-ternational Conference on Advances in Social Networks Analysis and Mining 2015 , pages1584–1585. ACM, 2015.Alfredo H-S. Ang and Wilson H. Tang.
Probability Concepts in Engineering: Emphasis onApplications to Civil and Environmental Engineering . John Wiley and Sons, 2006.Asa Ben-Hur, Andr´e Elisseeff, and Isabelle Guyon. A stability based method for discoveringstructure in clustered data. In
Pacific symposium on biocomputing , volume 7, pages 6–17,2001.Rich Caruana, M Elhaway, Nam Nguyen, and Casey Smith. Meta clustering. In
DataMining, 2006. ICDM’06. Sixth International Conference on , pages 107–118. IEEE, 2006.Mike X Cohen.
Analyzing neural time series data: theory and practice . MIT Press, 2014.Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. Decision forests: A unifiedframework for classification, regression, density estimation, manifold learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision , 7(2-3):81–227, 2012.Zolt´an Dar´oczy. Generalized information functions.
Information and control , 16(1):36–51,1970.Alin Dobra and Johannes Gehrke. Bias correction in classification tree construction. In
Proceedings of the International Conference on Machine Learning , pages 90–97, 2001.Eibe Frank and Ian H. Witten. Using a permutation test for attribute selection in decisiontrees. In
Proceedings of the International Conference on Machine Learning , pages 152–160, 1998.Shigeru Furuichi. Information theoretical properties of tsallis entropies.
Journal of Mathe-matical Physics , 47(2):023302, 2006.Jan Havrda and Frantiˇsek Charv´at. Quantification method of classification processes. con-cept of structural a -entropy. Kybernetika , 3(1):30–35, 1967. omano, Vinh, Bailey, and Verspoor Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: Aconditional inference framework.
Journal of Computational and Graphical Statistics , 15(3):651–674, 2006.Lawrence Hubert and Phipps Arabie. Comparing partitions.
Journal of Classification , 2:193–218, 1985.Yang Lei, Nguyen Xuan Vinh, Jeffrey Chan, and James Bailey. Filta: Better view discoveryfrom collections of clusterings via filtering. In
Machine Learning and Knowledge Discoveryin Databases , pages 145–160. Springer, 2014.Fabr´ıcio M Lopes, Evaldo A de Oliveira, and Roberto M Cesar. Inference of gene regulatorynetworks from time series by Tsallis entropy.
BMC systems biology , 5(1):61, 2011.Andr´e FT Martins, Noah A Smith, Eric P Xing, Pedro MQ Aguiar, and M´ario ATFigueiredo. Nonextensive information theoretic kernels on measures.
The Journal ofMachine Learning Research , 10:935–975, 2009.Tomasz Maszczyk and W(cid:32)lodzis(cid:32)law Duch. Comparison of Shannon, Renyi and Tsallis entropyused in decision trees. In
Artificial Intelligence and Soft Computing–ICAISC 2008 , pages643–651. Springer, 2008.Marina Meil˘a. Comparing clusterings – an information based distance.
Journal of Multi-variate Analysis , 98(5):873–895, 2007.Leslie C Morey and Alan Agresti. The measurement of classification agreement: an adjust-ment to the rand statistic for chance agreement.
Educational and Psychological Measure-ment , 44(1):33–37, 1984.Emmanuel M¨uller, Stephan G¨unnemann, Ines F¨arber, and Thomas Seidl. Discovering mul-tiple clustering solutions: Grouping objects in different views of the data. Tutorial atICML, 2013. URL http://dme.rwth-aachen.de/en/DMCS .William M Rand. Objective criteria for the evaluation of clustering methods.
Journal ofthe American Statistical association , 66(336):846–850, 1971.Simone Romano, James Bailey, Vinh Nguyen, and Karin Verspoor. Standardized mutualinformation for clustering comparisons: One step further in adjustment for chance. In
Proceedings of the 31st International Conference on Machine Learning (ICML-14) , pages1143–1151, 2014.Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. A framework toadjust dependency measure estimates for chance. arXiv preprint arXiv:1510.07786 , 2015.Sheldon Ross.
A first course in probability . Pearson, 2012.Y-S Shih. A note on split selection bias in classification trees.
Computational statistics &data analysis , 45(3):457–466, 2004.Dan Simovici. On generalized entropy and entropic metrics.
Journal of Multiple ValuedLogic and Soft Computing , 13(4/6):295, 2007. djusting for Chance Clustering Comparison Measures Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework forcombining multiple partitions.
The Journal of Machine Learning Research , 3:583–617,2003.Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics.
Journal ofstatistical physics , 52(1-2):479–487, 1988.Constantino Tsallis et al.
Introduction to Nonextensive Statistical Mechanics . Springer,2009.Marius Vila, Anton Bardera, Miquel Feixas, and Mateu Sbert. Tsallis mutual informationfor document classification.
Entropy , 13(9):1694–1707, 2011.Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures forclusterings comparison: is a correction for chance necessary? In
Proceedings of theInternational Conference on Machine Learning , pages 1073–1080. ACM, 2009.Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures forclusterings comparison: Variants, properties, normalization and correction for chance.
Journal of Machine Learning Research , 11:2837–2854, 2010.Nguyen Xuan Vinh, Jeffrey Chan, Simone Romano, and James Bailey. Effective globalapproaches for mutual information based feature selection. In
Proceedings of the 20thACM SIGKDD international conference on knowledge discovery and data mining , pages512–521. ACM, 2014.Matthijs J Warrens. On the equivalence of Cohens kappa and the Hubert-Arabie adjustedRand index.
Journal of Classification , 25(2):177–183, 2008.Allan P. White and Wei Zhong Liu. Bias in information-based measures in decision treeinduction.
Machine Learning , pages 321–329, 1994.Junjie Wu, Hui Xiong, and Jian Chen. Adapting the right measures for k-means clustering.In
Knowledge Discovery and Data Mining , pages 877–886, 2009., pages 877–886, 2009.