[PDF] Explicit agreement extremes for a 2×2 table with given marginals

Abstract

The problem of maximizing (or minimizing) the agreement between clusterings, subject to given marginals, can be formally posed under a common framework for several agreement measures. Until now, it was possible to find its solution only through numerical algorithms. Here, an explicit solution is shown for the case where the two clusterings have two clusters each.

Full PDF

aa r X i v : . [ s t a t . M L ] J a n Explicit agreement extremes for a 2 × Jos´e E. Chac´on ∗

22 January 2020

Abstract

The problem of maximizing (or minimizing) the agreement between clusterings, subjectto given marginals, can be formally posed under a common framework for several agree-ment measures. Until now, it was possible to ﬁnd its solution only through numericalalgorithms. Here, an explicit solution is shown for the case where the two clusteringshave two clusters each. ∗ Departamento de Matem´aticas, Universidad de Extremadura, E-06006 Badajoz, Spain. E-mail: [email protected] Given two diﬀerent clusterings of a data set, many measures have been proposed to quantifytheir degree of concordance. A recent review of a representative number of them can befound in Meil˘a (2016). These measures are usually categorized into three classes: thosebased on inspecting the assignments of data pairs in both clusterings, those involving somecluster matching between the two clusterings, and those relying on information theoreticcriteria. This paper concerns the ﬁrst one of these classes. In fact, some of the most popu-lar and widely used similarity measures, such as the Rand index, the Jaccard index, or theFowlkes-Mallows index, belong to this class of pair-based similarities, but it should be notedthat there is a plethora of them, as explored in Albatineh, Niewiadomska-Bugaj and Mihalko(2006), Warrens (2008) or Warrens and van der Hoef (2019).Precisely, when studying the Rand index Morey and Agresti (1984) noted that thisstatistic does not take into account the possibility of agreement by chance. Hubert and Arabie(1985) suggested a general formulation to correct any of these indices for chance, which con-sists in substracting from the index its expected value when the clustering labels are assignedat random (with the constraint that the number of clusters and their sizes are ﬁxed), fol-lowed by a normalization that ensures that the resulting corrected index still attains a valueof 1 for identical clusterings. Namely, the adjusted version of any index is given byindex − E [index]maximum index − E [index] , where “maximum index” is usually taken to be equal to 1, since that is the most generalupper bound for these indices.However, Hubert and Arabie (1985) also noted that another (perhaps more adequate)bound that could be used is the maximum of the index given the ﬁxed marginals (i.e., thecluster sizes of each of the two clusterings). Unfortunately, they refer to the problem ofﬁnding the maximum index value subject to the marginal constraint as “a very diﬃcultproblem of combinatorial optimization.” Nevertheless, some progress has been made sincethen. Messatfa (1992) pointed out that for some of the pair-based similarities, the problem isequivalent to ﬁnding the maximum of the sum of the squares of the contingency table. AndBrusco and Steinley (2008), and later Steinley, Hendrickson and Brusco (2015), proposed abinary integer program and a heuristic algorithm, respectively, to obtain an exact solutionnumerically.Here, on the contrary, the focus is on ﬁnding explicit expressions for the confusion matrixconﬁgurations that maximize the agreement between two clusterings, given the marginals.Due to the aforementioned diﬃculty of the problem, this study is restricted to the simplestcase of 2 × Given two clusterings C = { C , . . . , C r } and D = { D , . . . , D s } of a data set X = { x , . . . , x n } ,let us denote by n ij = | C i ∩ D j | the number of observations that are assigned to cluster C i in C and to cluster D j in D . All the information about the concordance between C and D is collected in the r × s confusion matrix N = ( n ij ), also known as contingency table. Therow and column marginals of N are determined, respectively, by the vectors ( n , . . . , n r + )and ( n +1 , . . . , n + s ), where n i + = P sj =1 n ij = | C i | and n + j = P ri =1 n ij = | D j | .For any pair of observations x k , x ℓ with k = ℓ , there are four possibilities regarding theirgroup assignments according to C and D : a) they are in the same cluster both in C and D ,b) they are in the same cluster in C but in diﬀerent clusters in D , c) they are in diﬀerentclusters in C but in the same cluster in D , and d) they are in diﬀerent clusters both in C and D . Following Brusco and Steinley (2008), the number of observations in each of theseclasses can be computed from N , respectively, as follows: a = (cid:0) P ri =1 P sj =1 n ij (cid:1) − n ,b = P ri =1 n i + − P ri =1 P sj =1 n ij ,c = P sj =1 n j − P ri =1 P sj =1 n ij ,d = (cid:0) P ri =1 P sj =1 n ij (cid:1) + n − P ri =1 n i + − P sj =1 n j . Many popular agreement indices are deﬁned in terms of a, b, c, d , as for instance, theRand index, the adjusted Rand index, the Jaccard index or the Fowlkes-Mallows index(see, e.g., Steinley, Hendrickson and Brusco, 2015, for details). Moreover, Messatfa (1992)noted that, if the marginals are given and ﬁxed, then all these quantities depend only on Q = P ri =1 P sj =1 n ij , in a way such that a and d are maximized and b and c are minimizedwhen Q is maximized. As a consequence, Brusco and Steinley (2008) noted that the problemof ﬁnding the extrema of the aforementioned indices, and also many others included inAlbatineh, Niewiadomska-Bugaj and Mihalko (2006), reduces to that of ﬁnding the extremaof Q , subject to the given marginals.This paper deals with a simpliﬁed version of the problem, namely the case r = s = 2.In such a context, the class N ( x, y, n ) of possible contingency tables given the marginals n = x and n +1 = y becomes uniparametric, with entries and marginals D D Total C k x − k xC y − k n + k − x − y n − x Total y n − y n The only free parameter in the previous table is k , which must be a non-negative integer.Moreover, all the entries in the table must be non-negative, resulting in the conditionmax { , x + y − n } ≤ k ≤ min { x, y } . (1)Thus, the problem reduces to ﬁnding the extrema of Q over N ( x, y, n ); that is, ﬁnding theminimizer(s) and maximizer(s) of Q ( k ) = k + ( x − k ) + ( y − k ) + ( n + k − x − y ) (2)subject to (1).It is quite useful to consider some reductions that greatly simplify the problem. Noticethat the codiﬁcation of the cluster labels is arbitrary, for in cluster analysis what mattersis the group members, and not the group denomination. This means that we can assumethat x ≥ n − x since, if that is not the case, it suﬃces to interchange the labels of clusters C and C . Similarly, we can also assume that y ≥ n − y . Finally, by interchanging theclusterings C and D , if necessary, it is possible to assume that x ≤ y . The three reductionscan be summarized in the condition n/ ≤ x ≤ y and, under such a condition, the range(1) of possible values of k simpliﬁes to x + y − n ≤ k ≤ x. (3)Notice also that, in order to have two clusters in each clustering (i.e., to avoid the possibilityof a degenerate, empty cluster), it must be max { x, y } < n . With this background, we areready to state our main result.Denote by ⌊ z ⌋ and ⌈ z ⌉ the ﬂoor and ceiling of a real number z , respectively; that is, ⌊ z ⌋ is the greatest integer less than or equal to z , and ⌈ z ⌉ is the least integer greater than orequal to z . Similarly, denote by { z } = z − ⌊ z ⌋ the fractional (or decimal) part of z . Withthis notation, the closest integer to z is ⌊ z + 1 / ⌋ , assuming a round-up tie-breaking rulefor those numbers with { z } = 1 / Theorem 1.

The maximum and minimum values of Q ( k ) , given n , n = x and n +1 = y ,with n/ ≤ x ≤ y , are attained as follows:a) If x > n/ , the maximum is attained for k = x . If x = n/ , the maximum is attainedboth for k = x + y − n and for k = x .b) If x + y > n/ , the minimum is attained for k = x + y − n . If x + y ≤ n/ , theminimum is attained for k = ⌊ v + 1 / ⌋ if { v } 6 = 1 / and for both k = ⌊ v ⌋ and k = ⌈ v ⌉ if { v } = 1 / , where v = (2 x + 2 y − n ) / .Proof. The function Q ( k ) in (2) is quadratic in k . In fact, it can be alternatively expressedas Q ( k ) = 4 k − x + 2 y − n ) k + ( n − x ) + ( n − y ) + ( x + y ) − n , so it is a convex parabola with minimum at v = (2 x + 2 y − n ) /

4. The diﬀerent casescorrespond to the location of v with respect to the lower and upper bounds for k in (3). Itcan never be v ≥ x , since that would entail y ≥ x + n/ ≥ n , which is not possible. So onlytwo possibilities need to be studied: either v < x + y − n or x + y − n ≤ v < x .Condition v < x + y − n is equivalent to x + y > n/

2. In that case, since Q ( k ) is a convexparabola with minimum to the left of the possible range of k values, its minimum over (3)is attained for the leftmost feasible value of k , that is k = x + y − n , and its maximum forthe rightmost one, k = x .When x + y − n ≤ v < x , which is equivalent to x + y ≤ n/ v with respect to the midpoint of the range(take into account that Q ( k ) is symmetric with respect to v ). But x ≥ n/ v ≤ (2 x + y − n ) /

2, so v is always less than or equal than the midpoint of the of range of k values. If the inequality is strict, then the maximum of Q ( k ) is attained for the rightmostvalue k = x , and if x = n/ Q ( k ) attains the same maximum value for thetwo range bounds.Regarding the minimum, since in this case v is sandwiched between two integer values,the minimum of Q ( k ) over (3) is attained for the integer that is closest to v , or for the twoclosest integers when { v } = 1 /

2, as announced in the statement of the theorem.

It is instructive to visualize the conﬁguration of the confusion matrices for which maximumand minimum agreement is attained, as learned from Theorem 1. Recall that our problemreductions imply that n/ ≤ x ≤ y . D C D C X Figure 1: A conﬁguration with maximum agreement. The square represents the whole dataset X . Clustering C has boundary and cluster tags in black, while clustering D has themin grey.Under the given conditions, the maximum agreement between clusterings is always at-tained for k = x , that is, for the confusion matrix x y − x n − y ! . This corresponds to a situation where the biggest cluster of C is completely contained inthe biggest cluster of D , so that the smallest cluster of D only contains elements from thesmaller cluster of D , as depicted in Figure 1. When x = n/

2, the two clusters in C have thesame size, so the maximum agreement is attained when the biggest cluster of D completelycontains either C or C .On the other hand, the situation of minimum agreement is not so straighforward todescribe. The condition x + y > n/ y > ( n − x ) + n/

2, so that the biggestcluster in D is big enough to contain the smallest cluster in C plus a considerable numberof observations from the other cluster in C (more than half the total sample size). Then,the minimum agreement is attained for the confusion matrix x + y − n n − yn − x ! . This means that the smallest cluster in C is completely contained in the biggest cluster of D , which thus contains the most heterogeneous possible mixture from members of the twoclusters of C . This is represented graphically in Figure 2. To describe the situation for x + y ≤ n/

2, let us assume for simplicity that v = (2 x + 2 y − n ) / k = v . The confusion matrix for this case is (2 x + 2 y − n ) / n + 2 x − y ) / n − x + 2 y ) / n − x − y ) / ! , but it does not seem easy to ﬁnd an intuitive description for this situation. D C D C X Figure 2: A conﬁguration with minimum agreement. The square represents the whole dataset X . Clustering C has boundary and cluster tags in black, while clustering D has themin grey. This paper provides an explicit solution for the maximization and minimization of a certainclass of agreement measures between two clusterings, given the sizes of their clusters. Thisproblem was posed 35 years ago by Hubert and Arabie (1985), and until now it was possibleto solve it only through numerical algorithms (Steinley, Hendrickson and Brusco, 2015).Here, the focus is on the simplest case where each of the two clusterings has only twoclusters. Although unfortunately an explicit solution is not yet available in its greatestgenerality, it should be hoped that our revelation of the explicit forms of the conﬁgurationsattaining the agreement extremes could be inspiring to tackle the problem with clusteringsof arbitrary size.For instance, after additional inspection of the results from exhaustive computation ofall possible 3 × References

Albatineh, A. N., Niewiadomska-Bugaj, M. and Mihalko, D. (2006). On similarity indicesand correction for chance agreement.

Journal of Classiﬁcation , , 301–313.Brusco, M. J. and Steinley, D. (2008). A binary integer program to maximize the agreementbetween partitions. Journal of Classiﬁcation , , 185–193.Chac´on, J. E. (2019). A close-up comparison of the misclassiﬁcation error distance and theadjusted Rand index for external clustering evaluation. arXiv:1907.11505 .Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classiﬁcation , , 193–218.Meil˘a, M. (2016). Criteria for comparing clusterings. In C. Hennig, M. Meil˘a, F. Murtaghand R. Rocci (Eds.), Handbook of Cluster Analysis , 619–635. CRC Press, Boca Raton.Messatfa, H. (1992). An algorithm to maximize the agreement between partitions.

Journalof Classiﬁcation , , 5–15.Morey, L. C. and Agresti, A. (1984). The measurement of classiﬁcation agreement: anadjustment of the Rand statistic for chance agreement. Educational and PsychologicalMeasurement , , 33–37.Steinley, D., Hendrickson, G. and Brusco, M. J. (2015). A note on maximizing the agree-ment between partitions: a stepwise optimal algorithm and some properties. Journal ofClassiﬁcation , , 114–126.Warrens, M. J. (2008a). On similarity coeﬃcients for 2 × Psychometrika , , 487–502.Warrens, M. J. and van der Hoef, H. (2019). Understanding partition comparison indicesbased on counting object pairs. arXiv:1901.01777arXiv:1901.01777