Graph Sensitive Indices for Comparing Clusterings
TTechnical Report
Graph-sensitive Indices for Comparing Clusterings
Zaeem Hussain
Department of Applied MathematicsUniversity of Washington [email protected]
Marina Meil˘a
Department of StatisticsUniversity of WashingtonSeattle, WA 98195-4322, USA [email protected]
Abstract
This report discusses two new indices for comparing clusterings of a set of points.The motivation for looking at new ways for comparing clusterings stems from thefact that the existing clustering indices are based on set cardinality alone and donot consider the positions of data points. The new indices, namely, the RandomWalk index (RWI) and Variation of Information with Neighbors (VIN), are bothinspired by the clustering metric Variation of Information (VI). VI possesses someinteresting theoretical properties which are also desirable in a metric for comparingclusterings. We define our indices and discuss some of their explored propertieswhich appear relevant for a clustering index. We also include the results of theseindices on clusterings of some example data sets.
1. Introduction
The problem of comparing clusterings essentially seeks to quantify the similarity ordissimilarity between two clusterings or partition of a dataset. This comparison canbe needed in a variety of cases. For example, suppose we have a desired or correctclustering of a dataset and an algorithm that also outputs a clustering of the samedata. An index that compares two clusterings would then be required to see whetherthe output of the algorithm is close to the correct solution or not. Such an indexwould also be required when we have the results of two such algorithms and wantto decide which algorithm outputs a solution closer to the correct clustering. This isjust one of many cases where indices for comparing clusterings are needed.Clusterings can be compared based on different properties and there are multipleindices that are developed by focusing on those different properties. However, allclustering comparison criteria are based on the
Confusion matrix or contingency ta-ble of the clusterings being compared [Meil˘a, 2007]. Formally, let the clusterings a r X i v : . [ c s . L G ] N ov echnical Report (a) (b) (c) Figure 1: 3 different clusterings of the same data. In (b), some points from group 2in (a) are put in group 1. In (c), the same number of points that changedlabels in (b) again change from label 2 to 1, but the points that change aredifferent from the points that changed in (b).being compared be denoted by C = { C , C , ...C K } and C (cid:48) = { C (cid:48) , C (cid:48) , ...C (cid:48) K (cid:48) } and thenumber of clusters in both be K and K (cid:48) respectively. C , ...C K are mutually disjointsubsets and so are the clusters in C (cid:48) . The confusion matrix is a K × K (cid:48) matrix whose kk (cid:48) -th element is the number of points in the intersection of the clusters C k and C (cid:48) k (cid:48) ,where k ∈ { , , ...K } and k (cid:48) ∈ { , , ...k (cid:48) } . Since all clustering indices can be de-fined using the confusion matrix, they are based on the counts of clusters and theirintersections alone and ignore any other relationships the points may have. In otherwords, when comparing two clusterings, these indices depend only on the number ofpoints that go into each cluster and not on the distances between points in a clusterand in different clusters. Consider, for example, the different clusterings of a datasetin Fig 1. Any existing index if calculated between the first two clusterings would yieldthe same answer as that of the index calculated between the first and last clustering.However, as can be seen clearly from the figure, the second and third clusterings arenot the same and should not be judged to be at the same ’distance’ from the firstpartitioning. The problem, then, is to define an index with some desirable propertiesthat would compare two clusterings while also incorporating spatial correlations be-tween points in the dataset, if there are any.
2. Variation of Information
The Variation of Information (VI) is a clustering metric that also suffers from theabove mentioned limitation in comparing clusterings. However, it possesses someinteresting theoretical properties which are desirable in any index for comparing clus-terings. VI is calculated by assigning to each cluster a probability that a point pickedat random would fall into that cluster. Precisely, let the total number of points in thedataset be n . Call one clustering C with a total of K clusters and the other clustering C (cid:48) with K (cid:48) clusters. Denote the number of points in cluster C k in C with n k and thenumber of points in C (cid:48) k (cid:48) in C (cid:48) be n k (cid:48) . The probability that a point chosen at random echnical Report would fall in cluster C k is P ( k ) = n k n (1)and similarly the joint probability that a point chosen at random falls in cluster C k in C and in cluster C (cid:48) k (cid:48) in C (cid:48) is P ( k, k (cid:48) ) = | C k ∩ C (cid:48) k (cid:48) | n (2)This probability can be calculated using the confusion matrix defined in the previoussection. Let the kk (cid:48) -th element of the confusion matrix be denoted by n kk (cid:48) . Based onthe definition of the confusion matrix, n kk (cid:48) = | C k ∩ C (cid:48) k (cid:48) | and so the probability can becalculated as P ( k, k (cid:48) ) = n kk (cid:48) n (3)Using these probabilities, we can define the entropy for clustering C as H ( C ) = − K (cid:88) k =1 P ( k )log P ( k ) (4)and similarly for C (cid:48) . The mutual information between C and C (cid:48) is defined as I ( C, C (cid:48) ) = K (cid:88) k =1 K (cid:48) (cid:88) k (cid:48) =1 P ( k, k (cid:48) )log P ( k, k (cid:48) ) P ( k ) P ( k (cid:48) ) (5)Using these quantities, the variation of information between two clusterings is thendefined as V I ( C, C (cid:48) ) = H ( C ) + H ( C ) − I ( C, C (cid:48) ) (6)which using simple arithmetic on entropies, can be shown to be of the form
V I ( C, C (cid:48) ) = H ( C | C (cid:48) ) + H ( C (cid:48) | C ) (7)As mentioned before, (VI) possesses certain desirable properties. Some importantproperties are that it is bounded and that it is a metric, thus introducing a notion ofdistance on the space of clusterings. Some of its theoretical properties will be discussedin detail in the subsequent sections as they are compared with the properties of ourproposed indices.
3. Proposed Indices
A natural way to view the points in a dataset with distance defined between any pairof points is as a weighted undirected graph. Such a graph can be described by the n × n similarity matrix S whose ij -th element represents the similarity or the edge echnical Report weight between points i and j . A special case of this would be a graph with no weightson the edges which may be represented by the binary adjacency matrix A where the ij -th entry is 1 if there is an edge between point i and j and 0 otherwise. Hence,with the points represented as nodes in a weighted undirected graph, one may defineindices which also look at the labels of the neighbors of a point instead of comparingclusterings by just comparing labels of each point independently of its neighbors. Based on the idea of considering the data as a graph, we can define an index usingthe random walks view of the graph. First, let us review some concepts on transitionprobabilities using the Markov Random walk theory [Meil˘a and Shi, 2001]. Thesimilarities between points of the graph be in S are nonnegative where S ij = S ji ≥ i and j of the graph. The degree of a node i is then defined as d i = (cid:80) nj =1 S ij and matrix D is a diagonal matrix formed withthe degrees of the nodes. The stochastic matrix T is obtained by ”normalizing” thesimilarity matrix as T = D − S (8)The entry T ij of T represents the probability of going from node i to j given thatwe are in node i . The stationary distribution of the Markov chain, denoted π ∞ i , isdefined as π ∞ i = d i (cid:80) ni =1 d i (9)Let k t represent the label of the point traversed at the current time step t during therandom walk in the first clustering and k (cid:48) t represent its label in the second clustering.Therefore, k t ∈ { , , ..., K } and k (cid:48) t ∈ { , , ..., K (cid:48) } . Now, assuming the Random Walkstarts in the stationary distribution, the probability of going from a point in cluster C k to a point in C l in one step given we are in cluster C l at previous time t − P C k C l = P r ( C k → C l | C k ) = P r ( k t − = C k , k t = C l | k t − = C k ) and definedas P C k C l = (cid:80) i ∈ C k ,j ∈ C l π ∞ i T ij π ∞ ( C k ) = (cid:80) i ∈ C k π ∞ i (cid:80) j ∈ C l T ij π ∞ ( C k ) (10)Similarly, the probability P r ( k t | k (cid:48) t , k t − ) can be obtained by P r ( k t = C l | k (cid:48) t = C (cid:48) m , k t − = C k ) = P r ( C k → C l , C k → C (cid:48) m | C k → C (cid:48) m )= P r ( C k → ( C l ∩ C (cid:48) m )) P r ( C k → C (cid:48) m ) (11)Calculating the transition probabilities this way allows us to condition the label of apoint in one clustering on its label in the other clustering as well as on the label of itsneighbor. Using these probabilities we can define the first of our indices, the RandomWalk index (RWI), as RW I ( C, C (cid:48) ) = H ( k t | k (cid:48) t , k t − ) + H ( k (cid:48) t | k t , k (cid:48) t − ) (12) echnical Report This index sums up the uncertainty in the label of a point given the label of the pointvisited in the previous step and the label of the same point in the other clustering. Byincluding information about the label of the just traversed point, we are adding thelabel information of one neighbor of each point. Comparing equation (12) with (7) wecan see that the difference in the 2 indices is the inclusion of label of a neighbor in thecalculation of (12) whereas VI does not incorporate any such information. Since weare adding another conditioning variable in the entropy terms, this index will alwaysbe smaller than VI. One point to note is that the time step t does not matter sincewe just need transition probabilities for one step and so this t does not figure in thecalculations of this index.Since the number of probability entries that will be calculated is O ( K ) where K isthe maximum number of clusters in one of the two clusterings, the running time forthe algorithm to calculate this index is cubic in the number of clusters present ineither of the two clusterings. So, if the number of clusters in the first clustering is K and in the second clustering is K (cid:48) , the running time to calculate this index would be O ( KK (cid:48) ( K + K (cid:48) ) N ) where N is the total number of points in the data set. Another index we propose as a comparison metric for clusterings also takes intoaccount the information about labels of neighbors of a point. However, rather thanquantifying how much information the labels of neighbors give about the point, thismetric measures the amount of information the labels of a neighborhood of points inone clustering give about the labels of the same neighborhood in the other clustering.Formally, the metric is defined as
V IN ( C, C (cid:48) ) = H ( N ( X ) | N ( Y )) + H ( N ( Y ) | N ( X )) (13)where again C and C (cid:48) are the two clusterings being compared. X is the label of apoint in the first clustering and N ( X ) represents the labels of the neighbors of thatpoint, and the point itself, in the first clustering. Similarly, Y is the label of the samepoint in the second clustering and N ( Y ) represents the labels of this point and itsneighbors in the second labeling.Whereas the previous proposed index was based on probabilities calculated from thesimilarity matrix of the graph using Markov random walk theory, the probabilitieshere are simply based on counting the different kinds of neighborhoods in the clus-tering. In other words, for a set of n points, there are n neighborhoods, one for eachpoint. Each neighborhood is characterized by the label of its ’central’ point and thenumbers of this point’s neighbors that belong to each of the clusters in the labeling.Two neighborhoods are considered equal, or belonging to the same class, when thefollowing three conditions are met:1. The labels of both ’central’ points are same.2. The numbers of neighboring points of each of the ’central’ points are equal. echnical Report
3. The number of neighboring points belonging to a cluster should be the same inboth the neighborhoods.As an example, consider a graph of 100 nodes where each node is given one of 3different labels: a , b and c . One node i is labeled a and has 5 neighbors, 2 of whichare labeled a , the other two are labeled b and one is labeled c . Another node j is alsolabeled a and has 5 neighbors with the same number of nodes in each label as neigh-bors of i . So the neighborhoods of i and j would be regarded as equal or belongingto the same category. On the other hand, a node k which also has 5 neighbors withthe same distribution of labels but itself labeled c would fall into a different category.Using this classification of neighborhoods, we can calculate the the conditional prob-abilities P ( N ( X ) | N ( Y )) by counting the number of neighborhoods belonging to acategory in the second labeling and then looking at the categories of those neighbor-hoods in the first clustering. More precisely, to calculate P ( N ( X ) = S X | N ( Y ) = S Y ),where S X and S Y are sets of point labels in the first and second clustering respectively,we first count the number of points whose neighborhoods in the second clustering areequal to S Y in the sense described above. From those points, the number of pointswhose neighborhoods in the first clustering equal S X is noted. The probability thenis just a fraction of these two numbers.Following is the algorithm to compute VIN: Data : Row vectors u and v of size 1 × n representing labels of points in C and C (cid:48) , n × n adjacency matrix A Result : Variation of Information with Neighbors%Stack both row vectors vertically U ← repmat( u , n , 1); V ← repmat( v , n , 1);%Compute element by element product A u ← U. ∗ A ; A v ← V. ∗ A ; B u ← sortrwd( A u ); B v ← sortrwd( A v ); u (cid:48) ← finduniquerows( B u ); v (cid:48) ← finduniquerows( B v ); V IN ← VI( u (cid:48) , v (cid:48) ); Algorithm 1:
Algorithm to compute VINThe inputs to the function to compute this index would contain two arrays of samesize containing point labels in both the clusterings and the N × N adjacency matrixfor the graph. The adjacency matrix can be used to obtain two n × n matrices A u and A v whose rows represent the labels of the neighborhoods of the points. The entry A u ( i, j ) would equal the label of point j in the first clustering if it is connected to i . If i and j are not connected, A u ( i, j ) = 0. The diagonal entry A u ( i, i ) would representthe label of point i in the first clustering. A v would similarly represent the labels of echnical Report points in the second clustering. Now, if the neighborhoods of two points in the firstclustering are equal, it would mean that the diagonal entries of their correspondingrows would be same and the remaining elements of one row would be a permutationof the remaining elements in the second row. Hence, comparing of neighborhoods canbe done by comparing the rows of each these matrices. The sortrwd operation in thealgorithm first shifts the diagonal entry of each row in the matrix to the beginningof the row and sorts the rest of the row for all rows in the matrix. The operation finduniquerows indexes the unique rows in the matrix such that same rows get thesame label and returns labels for the rows. Finally, the function VI simply calculatesthe variation of information between two clusterings represented by vectors in itsinput.If the maximum degree of any node in the graph is K , the maximum number of nonzero elements in any row would be K +1. Finding the different kinds of neighborhoods,then, could be done by first sorting the rows, which would be done in O ( KlogK ) foreach row, and comparing the rows with each other. Again, since the maximum lengthof any row is K + 1, the row comparisons can be done in O ( N K ) using a Radix tree[Morrison, 1968]. Thus, the running time of such an algorithm would be O ( N KlogK ).
4. Some properties of RWI and VIN
As mentioned before for the Variation of Information, a desirable property of a clus-tering comparison index is that it should be a metric. As can be seen from thedefinition of the first proposed index in Equation (12), the index is symmetric as wellas nonnegative, being zero only for identical clusterings. Thus, if this index satisfiesthe triangle inequality, then it will be a metric and will give a measure of the closenessof two clusterings while also taking into account the associations between data points.However, the proposed Random Walk Index was computed for a simple example of 4points as shown in Figure 2 and upon calculation it was found that
RW I ( A, B ) +
RW I ( B, C ) < RW I ( A, C ) (14)which implies that the Random walk index is not a metric on the space of clusterings.The other proposed index, Variation of Information with Neighbors, however, satis-fies the triangle inequality for arbitrary clusterings A , B and C of a set of points. V IN ( A, B ) +
V IN ( B, C ) ≥ V IN ( A, C ) (15)This index satisfies the other 2 properties (nonnegativity and symmetry) and is alsozero for identical clusterings. Thus, it is a metric on the space of clusterings. It alsosatisfies the following relation for any clustering
CV IN (ˆ1 , C ) +
V IN ( C, ˆ0) = V IN (ˆ1 ,
0) (16)where ˆ1 is the labeling in which each point of the data set is given a different labeland ˆ0 is the other extreme, where each point is given the same label. This relation echnical Report Figure 2: Three different clusterings of a set of 4 points. The number with each noderepresents its label.is also satisfied by VI but not by the Random Walk index. The VI with neighborscan be considered to be a more general case of VI in the sense that if the graph iscompletely connected or not connected at all, VIN as defined before reduces to VI.RWI also reduces to VI when the graph is not connected at all, but now when thegraph is completely connected.There is an interesting property that holds true for the Variation of Information andwhich makes it a more intuitive distance over the space of clusterings. If a clustering C (cid:48) is obtained from C by splitting C k in a number of clusters, VI between C and C (cid:48) isequal to the probability of cluster C k times the entropy of the clusters obtained from C k . Formally, assume C (cid:48) is obtained from C by splitting C k into clusters C (cid:48) k , ...C (cid:48) k m .The cluster probabilities are P (cid:48) ( k (cid:48) ) = (cid:40) P ( k (cid:48) ) , if C (cid:48) k (cid:48) ∈ CP ( k (cid:48) | k ) P ( k ) , if C (cid:48) k (cid:48) ⊆ C k ∈ C (17)where P ( k (cid:48) | k ) for k (cid:48) ∈ { k , ...k m } is P ( k l | k ) = | C (cid:48) k l || C k | (18)and its entropy, which represents the uncertainty associated with splitting C k , is H | k = − (cid:88) P ( k l | k )log P ( k l | k ) (19)Then [Meil˘a 2007], V I ( C, C (cid:48) ) = P ( k ) H | k (20)This property also induces the additivity of composition over VI, which says that iftwo clusterings are obtained by further segmenting the same clustering, VI between echnical Report the two clusterings is a weighted sum of VI between the partitions of each cluster inthe bigger clustering.The VI with Neighbors holds a weak form of additivity of composition. To see this,we first observe that the VIN also basically computes the VI between two partitions.However, these partitions are obtained from the original clusterings being comparedand are essentially refinements of the original clusterings. A refinement D of a clus-tering C is a partitioning which preserves the boundaries in C but some of the clustersin C are further split. The reason why the clusterings compared for VI with Neigh-bors are refinements of the original clusterings is that each point is relabeled based onthe labels of its neighbors and no two points that have different labels in the originallabeling are reclassified as belonging to the same category when the neighborhoodsare compared. Hence, VI with Neighbors can be written as V IN ( C, C (cid:48) ) =
V I ( D, D (cid:48) ) (21)where D and D (cid:48) are refinements of C and C (cid:48) respectively and are based on theneighborhoods of the points. In VI, the additivity of composition holds because thedistance between two clusterings depends only on the clusters that vary between thetwo partitions. However, for VI with neighbors, the distance is computed betweenclusterings which are refinements of the originals and these refinements are obtainedby looking at the directly connected neighbors of the points. So even if two clusteringsdiffer only in one cluster, their refinements as dictated by VIN will be influenced bypoints outside that cluster which are directly connected to points inside it. Hence,VIN can be considered as satisfying weak additivity of composition in the sense thatthe relation upon splitting a cluster as for VI above holds for VIN only when thecluster that is split is not connected to the rest of the points in the graph. If thepoints in that cluster have edges with other points in the graph, such a relation doesnot hold for VIN.
5. Experiments and Results
Since the indices in this report were proposed to address the limitation of VI thatit does not include the neighborhood of data points, the examples of clusteringsconsidered here will be those on which VI does not provide a satisfactory answer.The three kinds of graph that will be considered are: chain with evenly spaced points,Gaussian data with similarities between pairs of points encoded by weights on edges,and images, where similarities between pixels depend on their spatial distances fromeach other. In all the examples we will have three clusterings where the last two aredifferent modifications of the first clustering and are judged by VI to be at the samedistance from the first one. echnical Report Figure 3: Three different clusterings of a set of 10 points
The first case we consider is that of a chain with all the points belonging to the samecluster. Two new clusterings are obtained from this clustering by relabeling one ofthe points of the chain to a different label. In one clustering, a point from the middleis chosen and in the second, a point at one of the ends is chosen for relabeling. Aninstance of this case is shown in Figure 3 where the total number of points is 10. Forthe random walk index, all the weights on the edges are uniformly set to 1. Intuitively,we would expect the clustering C to be closer to A than B . This is indeed what isobserved with with both the proposed indices. The values of random walk index andVI with neighbors between A and B are greater than the corresponding values ofthese indices between A and C . It must be remembered that VI would judge both B and C to be at the same distance from A .Next we test the two proposed indices on a scenario based on the same situationas shown in Figure 1. In this case, the original labeling has two clusters where onehalf of the line belongs to one cluster and the other half belongs to another cluster.Two new clusterings are obtained from this by taking a certain number of points fromone cluster and relabeling them as belonging to the other cluster. One clustering isobtained this way by relabeling the points closest to the boundary between the twoclusters and the other is obtained by relabeling the points at the end of the line. Anexample of this again with 10 points is shown in Figure 4. Again the Variation ofInformation will judge the second clustering to be at the same distance from first oneas the third clustering, although based on the location of the points, the second andthird clusterings should be at different distances from the first one. This is indeedwhat is observed when the two proposed indices are used to compare these clustering.However, the results seem contrary to intuition because both the indices judge thethird clustering to be closer to the first one than the second clustering. The Randomwalk index was computed based on two different similarity matrices. The first oneonly included information about the two adjoining neighbors of each point in thechain. The other similarity matrix used included edges between all the points in thechain, with the edge weights inversely related to the distance between the points. Still, echnical Report Figure 4: Three different clusterings of a set of 10 pointsFigure 5: Three different clusterings of a set of 100 points obtained from a 2D Gaus-sian distributionin both cases, the index judged the third clustering closer to the first segmentationthan the second one.
Next we consider Gaussian data where the edge weights between points represent thesimilarity between them. If the distance between points i and j is d ij , the similarity, s ij , between the points is calculated as s ij = e − d ij . For VI with neighbors, theadjacency matrix is obtained by setting a threshold where all edges with weights abovethe threshold are kept and the rest are dropped. If the weights are obtained fromspatial distances between the points, this is equivalent to taking an (cid:15) neighborhoodaround each point. The clusterings we use here are similar to the first case of thechain example discussed in the previous section. Initially, all the points have the samelabel. One clustering is then generated by relabeling the point that is farthest fromthe mean and the other clustering is generated by relabeling the point closest to themean. An example for the two dimensional Gaussian data with 100 points is shownin Figure 5.A total of 100 simulations of this scenario were run with the covariances between the x and y coordinates set to 0 and variances for both x and y set to 1. In the majority echnical Report Figure 6: An image with true segmentation boundariesof the simulations, both the random walk index and VI with neighbors judge theclustering where the farthest point from the mean is relabeled to be closer to theoriginal clustering than the one where the point closest to the mean is relabeled. Theresults are summarized in table below. The 2 middle columns list the means of thecorresponding indices computed first on clusterings A and B and then on clusterings A and C . The last column lists the number of mistakes made by the indices from the100 trials. In the context of the example in figure 5, the Random Walk Index judged B to be closer to A than C
96 times while VI with neighborhoods did not make asingle mistake in the 100 simulations in declaring B to be closer to A than C .Indices Mean of d(A,B) Mean of d(A,C) ErrorsVI 0.0243 0.0243 100RWI 0.0355 0.1362 4VIN 0.0142 0.0782 0 We compute the indices between segmentations of the image in figure 6 which alsoshows the true boundaries of the image. The edge weights between pixels for therandom walk index are negative exponentials of the square of the spatial distancesbetween them. For simplicity and memory constraints, only a 5 × × ×
10 square of the pixels whose boundaries are echnical Report Figure 7: True segmentation of the image and two perturbationsshown in red. Both the indices, however, judged C to be closer to A than B . Theresults for this example are tabulated as follows:Indices d(A,B) d(A,C)RWI 0.0184 0.0172VIN 0.0064 0.0062The second case that is considered is shown in figure 8. Here, instead of relabelinga 10 ×
10 square of pixels, the 100 pixels just along the boundary between land andwater are relabeled in B and a horizontal line of 100 pixels, which is quite far fromthe original ’dirt’ segment pixels, is relabeled as belonging to ’dirt’ cluster on land toobtain the perturbed clustering C . For this example, VI with neighbors judged B tobe closer to A than C . However, the random walk index again judged C to be closerto A than B . The results are in the following table:Indices d(A,B) d(A,C)RWI 0.0335 0.0248VIN 0.0087 0.0094
6. Conclusion
We presented two criteria for comparing clusterings of a data set that take into ac-count the similarity between the points. The first index, which we call the RandomWalk index, is based on probabilities calculated from the similarity matrix of thegraph using Markov Random Walk theory. The second proposed index, the Variationof Information with neighbors, counts different neighborhoods based on the labels ofthe points in the neighborhoods. Both these criteria were tested on examples to seewhether they possess the properties desirable in a clustering comparison criterion. echnical Report Figure 8: True segmentation of the image and two perturbationsThe Random Walk index was found not to satisfy the triangle inequality and is thusnot a metric on the space of clusterings. VI with neighbors, however, can be shownto satisfy the triangle inequality and other conditions for a distance and so is a metricon the space of clusterings. Both the indices were observed to judge clusterings differ-ently based on the location of the points in the data. Some of the results agreed withwhat one might expect how the distance between a clustering should behave but theresults on other example apparently indicate otherwise. Further exploration in thisavenue with more examples might give a clearer picture of how these indices might fitwith human intuition about the similarity between clusterings of a set of points. Theproperty of splitting a cluster, where the comparison metric should not depend onpoints whose labels remain the same over the clusterings being compared, was alsochecked for the proposed indices. The Random Walk index did not always satisfythis property but was rather observed to depend on the similarity matrix. VI withneighbors, on the other hand, satisfies a weak form of this property where the distancebetween two clusterings depends on the points directly connected to the cluster thatvaries even if the labels of those points remain unchanged across the clusterings beingcompared.This report is meant as an introductory document on the ideas for comparing clus-terings of a set of points while incorporating the information about the distancesbetween points. Some of the properties that are theoretically interesting for a com-parison index were checked on some basic examples, but a detailed analysis withlarger examples would be required to establish these indices as standard comparisoncriteria. There are other interesting theoretical avenues for exploration with these in-dices as well, such as the comparison with the meet of the clusterings and identifyingnearest neighbors of the clusterings according to these indices. echnical Report
7. Appendix
Proofs to follow.
8. Referecnes
1. Meil˘a, Marina. ”Comparing clusterings an information based distance.” Journalof Multivariate Analysis 98.5 (2007): 873-895.2. Meil˘a, Marina, and Jianbo Shi. ”A random walks view of spectral segmenta-tion.” (2001).3. Morrison, Donald R. ”PATRICIApractical algorithm to retrieve informationcoded in alphanumeric.” Journal of the ACM (JACM) 15.4 (1968): 514-534.1. Meil˘a, Marina. ”Comparing clusterings an information based distance.” Journalof Multivariate Analysis 98.5 (2007): 873-895.2. Meil˘a, Marina, and Jianbo Shi. ”A random walks view of spectral segmenta-tion.” (2001).3. Morrison, Donald R. ”PATRICIApractical algorithm to retrieve informationcoded in alphanumeric.” Journal of the ACM (JACM) 15.4 (1968): 514-534.