Explicit agreement extremes for a 2×2 table with given marginals
aa r X i v : . [ s t a t . M L ] J a n Explicit agreement extremes for a 2 × Jos´e E. Chac´on ∗
22 January 2020
Abstract
The problem of maximizing (or minimizing) the agreement between clusterings, subjectto given marginals, can be formally posed under a common framework for several agree-ment measures. Until now, it was possible to find its solution only through numericalalgorithms. Here, an explicit solution is shown for the case where the two clusteringshave two clusters each. ∗ Departamento de Matem´aticas, Universidad de Extremadura, E-06006 Badajoz, Spain. E-mail: [email protected] Given two different clusterings of a data set, many measures have been proposed to quantifytheir degree of concordance. A recent review of a representative number of them can befound in Meil˘a (2016). These measures are usually categorized into three classes: thosebased on inspecting the assignments of data pairs in both clusterings, those involving somecluster matching between the two clusterings, and those relying on information theoreticcriteria. This paper concerns the first one of these classes. In fact, some of the most popu-lar and widely used similarity measures, such as the Rand index, the Jaccard index, or theFowlkes-Mallows index, belong to this class of pair-based similarities, but it should be notedthat there is a plethora of them, as explored in Albatineh, Niewiadomska-Bugaj and Mihalko(2006), Warrens (2008) or Warrens and van der Hoef (2019).Precisely, when studying the Rand index Morey and Agresti (1984) noted that thisstatistic does not take into account the possibility of agreement by chance. Hubert and Arabie(1985) suggested a general formulation to correct any of these indices for chance, which con-sists in substracting from the index its expected value when the clustering labels are assignedat random (with the constraint that the number of clusters and their sizes are fixed), fol-lowed by a normalization that ensures that the resulting corrected index still attains a valueof 1 for identical clusterings. Namely, the adjusted version of any index is given byindex − E [index]maximum index − E [index] , where “maximum index” is usually taken to be equal to 1, since that is the most generalupper bound for these indices.However, Hubert and Arabie (1985) also noted that another (perhaps more adequate)bound that could be used is the maximum of the index given the fixed marginals (i.e., thecluster sizes of each of the two clusterings). Unfortunately, they refer to the problem offinding the maximum index value subject to the marginal constraint as “a very difficultproblem of combinatorial optimization.” Nevertheless, some progress has been made sincethen. Messatfa (1992) pointed out that for some of the pair-based similarities, the problem isequivalent to finding the maximum of the sum of the squares of the contingency table. AndBrusco and Steinley (2008), and later Steinley, Hendrickson and Brusco (2015), proposed abinary integer program and a heuristic algorithm, respectively, to obtain an exact solutionnumerically.Here, on the contrary, the focus is on finding explicit expressions for the confusion matrixconfigurations that maximize the agreement between two clusterings, given the marginals.Due to the aforementioned difficulty of the problem, this study is restricted to the simplestcase of 2 × Given two clusterings C = { C , . . . , C r } and D = { D , . . . , D s } of a data set X = { x , . . . , x n } ,let us denote by n ij = | C i ∩ D j | the number of observations that are assigned to cluster C i in C and to cluster D j in D . All the information about the concordance between C and D is collected in the r × s confusion matrix N = ( n ij ), also known as contingency table. Therow and column marginals of N are determined, respectively, by the vectors ( n , . . . , n r + )and ( n +1 , . . . , n + s ), where n i + = P sj =1 n ij = | C i | and n + j = P ri =1 n ij = | D j | .For any pair of observations x k , x ℓ with k = ℓ , there are four possibilities regarding theirgroup assignments according to C and D : a) they are in the same cluster both in C and D ,b) they are in the same cluster in C but in different clusters in D , c) they are in differentclusters in C but in the same cluster in D , and d) they are in different clusters both in C and D . Following Brusco and Steinley (2008), the number of observations in each of theseclasses can be computed from N , respectively, as follows: a = (cid:0) P ri =1 P sj =1 n ij (cid:1) − n ,b = P ri =1 n i + − P ri =1 P sj =1 n ij ,c = P sj =1 n j − P ri =1 P sj =1 n ij ,d = (cid:0) P ri =1 P sj =1 n ij (cid:1) + n − P ri =1 n i + − P sj =1 n j . Many popular agreement indices are defined in terms of a, b, c, d , as for instance, theRand index, the adjusted Rand index, the Jaccard index or the Fowlkes-Mallows index(see, e.g., Steinley, Hendrickson and Brusco, 2015, for details). Moreover, Messatfa (1992)noted that, if the marginals are given and fixed, then all these quantities depend only on Q = P ri =1 P sj =1 n ij , in a way such that a and d are maximized and b and c are minimizedwhen Q is maximized. As a consequence, Brusco and Steinley (2008) noted that the problemof finding the extrema of the aforementioned indices, and also many others included inAlbatineh, Niewiadomska-Bugaj and Mihalko (2006), reduces to that of finding the extremaof Q , subject to the given marginals.This paper deals with a simplified version of the problem, namely the case r = s = 2.In such a context, the class N ( x, y, n ) of possible contingency tables given the marginals n = x and n +1 = y becomes uniparametric, with entries and marginals D D Total C k x − k xC y − k n + k − x − y n − x Total y n − y n The only free parameter in the previous table is k , which must be a non-negative integer.Moreover, all the entries in the table must be non-negative, resulting in the conditionmax { , x + y − n } ≤ k ≤ min { x, y } . (1)Thus, the problem reduces to finding the extrema of Q over N ( x, y, n ); that is, finding theminimizer(s) and maximizer(s) of Q ( k ) = k + ( x − k ) + ( y − k ) + ( n + k − x − y ) (2)subject to (1).It is quite useful to consider some reductions that greatly simplify the problem. Noticethat the codification of the cluster labels is arbitrary, for in cluster analysis what mattersis the group members, and not the group denomination. This means that we can assumethat x ≥ n − x since, if that is not the case, it suffices to interchange the labels of clusters C and C . Similarly, we can also assume that y ≥ n − y . Finally, by interchanging theclusterings C and D , if necessary, it is possible to assume that x ≤ y . The three reductionscan be summarized in the condition n/ ≤ x ≤ y and, under such a condition, the range(1) of possible values of k simplifies to x + y − n ≤ k ≤ x. (3)Notice also that, in order to have two clusters in each clustering (i.e., to avoid the possibilityof a degenerate, empty cluster), it must be max { x, y } < n . With this background, we areready to state our main result.Denote by ⌊ z ⌋ and ⌈ z ⌉ the floor and ceiling of a real number z , respectively; that is, ⌊ z ⌋ is the greatest integer less than or equal to z , and ⌈ z ⌉ is the least integer greater than orequal to z . Similarly, denote by { z } = z − ⌊ z ⌋ the fractional (or decimal) part of z . Withthis notation, the closest integer to z is ⌊ z + 1 / ⌋ , assuming a round-up tie-breaking rulefor those numbers with { z } = 1 / Theorem 1.
The maximum and minimum values of Q ( k ) , given n , n = x and n +1 = y ,with n/ ≤ x ≤ y , are attained as follows:a) If x > n/ , the maximum is attained for k = x . If x = n/ , the maximum is attainedboth for k = x + y − n and for k = x .b) If x + y > n/ , the minimum is attained for k = x + y − n . If x + y ≤ n/ , theminimum is attained for k = ⌊ v + 1 / ⌋ if { v } 6 = 1 / and for both k = ⌊ v ⌋ and k = ⌈ v ⌉ if { v } = 1 / , where v = (2 x + 2 y − n ) / .Proof. The function Q ( k ) in (2) is quadratic in k . In fact, it can be alternatively expressedas Q ( k ) = 4 k − x + 2 y − n ) k + ( n − x ) + ( n − y ) + ( x + y ) − n , so it is a convex parabola with minimum at v = (2 x + 2 y − n ) /
4. The different casescorrespond to the location of v with respect to the lower and upper bounds for k in (3). Itcan never be v ≥ x , since that would entail y ≥ x + n/ ≥ n , which is not possible. So onlytwo possibilities need to be studied: either v < x + y − n or x + y − n ≤ v < x .Condition v < x + y − n is equivalent to x + y > n/
2. In that case, since Q ( k ) is a convexparabola with minimum to the left of the possible range of k values, its minimum over (3)is attained for the leftmost feasible value of k , that is k = x + y − n , and its maximum forthe rightmost one, k = x .When x + y − n ≤ v < x , which is equivalent to x + y ≤ n/ v with respect to the midpoint of the range(take into account that Q ( k ) is symmetric with respect to v ). But x ≥ n/ v ≤ (2 x + y − n ) /
2, so v is always less than or equal than the midpoint of the of range of k values. If the inequality is strict, then the maximum of Q ( k ) is attained for the rightmostvalue k = x , and if x = n/ Q ( k ) attains the same maximum value for thetwo range bounds.Regarding the minimum, since in this case v is sandwiched between two integer values,the minimum of Q ( k ) over (3) is attained for the integer that is closest to v , or for the twoclosest integers when { v } = 1 /
2, as announced in the statement of the theorem.
It is instructive to visualize the configuration of the confusion matrices for which maximumand minimum agreement is attained, as learned from Theorem 1. Recall that our problemreductions imply that n/ ≤ x ≤ y . D C D C X Figure 1: A configuration with maximum agreement. The square represents the whole dataset X . Clustering C has boundary and cluster tags in black, while clustering D has themin grey.Under the given conditions, the maximum agreement between clusterings is always at-tained for k = x , that is, for the confusion matrix x y − x n − y ! . This corresponds to a situation where the biggest cluster of C is completely contained inthe biggest cluster of D , so that the smallest cluster of D only contains elements from thesmaller cluster of D , as depicted in Figure 1. When x = n/
2, the two clusters in C have thesame size, so the maximum agreement is attained when the biggest cluster of D completelycontains either C or C .On the other hand, the situation of minimum agreement is not so straighforward todescribe. The condition x + y > n/ y > ( n − x ) + n/
2, so that the biggestcluster in D is big enough to contain the smallest cluster in C plus a considerable numberof observations from the other cluster in C (more than half the total sample size). Then,the minimum agreement is attained for the confusion matrix x + y − n n − yn − x ! . This means that the smallest cluster in C is completely contained in the biggest cluster of D , which thus contains the most heterogeneous possible mixture from members of the twoclusters of C . This is represented graphically in Figure 2. To describe the situation for x + y ≤ n/
2, let us assume for simplicity that v = (2 x + 2 y − n ) / k = v . The confusion matrix for this case is (2 x + 2 y − n ) / n + 2 x − y ) / n − x + 2 y ) / n − x − y ) / ! , but it does not seem easy to find an intuitive description for this situation. D C D C X Figure 2: A configuration with minimum agreement. The square represents the whole dataset X . Clustering C has boundary and cluster tags in black, while clustering D has themin grey. This paper provides an explicit solution for the maximization and minimization of a certainclass of agreement measures between two clusterings, given the sizes of their clusters. Thisproblem was posed 35 years ago by Hubert and Arabie (1985), and until now it was possibleto solve it only through numerical algorithms (Steinley, Hendrickson and Brusco, 2015).Here, the focus is on the simplest case where each of the two clusterings has only twoclusters. Although unfortunately an explicit solution is not yet available in its greatestgenerality, it should be hoped that our revelation of the explicit forms of the configurationsattaining the agreement extremes could be inspiring to tackle the problem with clusteringsof arbitrary size.For instance, after additional inspection of the results from exhaustive computation ofall possible 3 × References
Albatineh, A. N., Niewiadomska-Bugaj, M. and Mihalko, D. (2006). On similarity indicesand correction for chance agreement.
Journal of Classification , , 301–313.Brusco, M. J. and Steinley, D. (2008). A binary integer program to maximize the agreementbetween partitions. Journal of Classification , , 185–193.Chac´on, J. E. (2019). A close-up comparison of the misclassification error distance and theadjusted Rand index for external clustering evaluation. arXiv:1907.11505 .Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification , , 193–218.Meil˘a, M. (2016). Criteria for comparing clusterings. In C. Hennig, M. Meil˘a, F. Murtaghand R. Rocci (Eds.), Handbook of Cluster Analysis , 619–635. CRC Press, Boca Raton.Messatfa, H. (1992). An algorithm to maximize the agreement between partitions.
Journalof Classification , , 5–15.Morey, L. C. and Agresti, A. (1984). The measurement of classification agreement: anadjustment of the Rand statistic for chance agreement. Educational and PsychologicalMeasurement , , 33–37.Steinley, D., Hendrickson, G. and Brusco, M. J. (2015). A note on maximizing the agree-ment between partitions: a stepwise optimal algorithm and some properties. Journal ofClassification , , 114–126.Warrens, M. J. (2008a). On similarity coefficients for 2 × Psychometrika , , 487–502.Warrens, M. J. and van der Hoef, H. (2019). Understanding partition comparison indicesbased on counting object pairs. arXiv:1901.01777arXiv:1901.01777