[PDF] Graph Distances and Clustering

Abstract

With a view on graph clustering, we present a definition of vertex-to-vertex distance which is based on shared connectivity. We argue that vertices sharing more connections are closer to each other than vertices sharing fewer connections. Our thesis is centered on the widely accepted notion that strong clusters are formed by high levels of induced subgraph density, where subgraphs represent clusters. We argue these clusters are formed by grouping vertices deemed to be similar in their connectivity. At the cluster level (induced subgraph level), our thesis translates into low mean intra-cluster distances. Our definition differs from the usual shortest-path geodesic distance. In this article, we compare three distance measures from the literature. Our benchmark is the accuracy of each measure's reflection of intra-cluster density, when aggregated (averaged) at the cluster level. We conduct our tests on synthetic graphs generated using the planted partition model, where clusters and intra-cluster density are known in advance. We examine correlations between mean intra-cluster distances and intra-cluster densities. Our numerical experiments show that Jaccard and Otsuka-Ochiai offer very accurate measures of density, when averaged over vertex pairs within clusters.

Full PDF

GGraph Distances and Clustering

Pierre Miasnikof ∗ , Alexander Y. Shestopaloﬀ , Leonidas Pitsoulis ,and Yuri Lawryshyn University of Toronto, Toronto, ON, Canada The Alan Turing Institute, London, United Kingdom Aristotle University of Thessaloniki, Thessaloniki, Greece

Abstract

With a view on graph clustering, we present a deﬁnition of vertex-to-vertex distance which is based on shared connectivity. We arguethat vertices sharing more connections are closer to each other thanvertices sharing fewer connections. Our thesis is centered on the widelyaccepted notion that strong clusters are formed by high levels of in-duced subgraph density, where subgraphs represent clusters. We arguethese clusters are formed by grouping vertices deemed to be similar intheir connectivity. At the cluster level (induced subgraph level), ourthesis translates into low mean intra-cluster distances. Our deﬁnitiondiﬀers from the usual shortest-path geodesic distance. In this article,we compare three distance measures from the literature. Our bench-mark is the accuracy of each measure’s reﬂection of intra-cluster den-sity, when aggregated (averaged) at the cluster level. We conduct ourtests on synthetic graphs generated using the planted partition model,where clusters and intra-cluster density are known in advance. Weexamine correlations between mean intra-cluster distances and intra-cluster densities. Our numerical experiments show that Jaccard andOtsuka-Ochiai oﬀer very accurate measures of density, when averagedover vertex pairs within clusters.

When clustering graphs, we seek to group nodes into clusters of nodes thatare similar to each other. We posit that similarity is reﬂected in the number ∗ corresponding author: [email protected] a r X i v : . [ c s . D M ] A p r f shared connections. On the basis of this shared connectivity, we establishnode-to-node distances. These distances are inversely related to similarity,to shared connectivity.Although a formal deﬁnition of vertex clusters or node communities re-mains a topic of debate (e.g., [6]), virtually all authors agree a cluster (orcommunity) is a subset of vertices that exhibit a high-level of interconnec-tion between themselves and a low-level of connection to vertices in the restof the graph [5, 22, 18, 19] (we quote these authors, but their deﬁnition isvery common across the literature). Consequently, strongly inter-connectedsets of vertices also form dense induced subgraphs. In line with this virtuallyuniversal agreement, we compare the accuracy of various node-to-node dis-tance measures in reﬂecting intra-cluster density. The choice of intra-clusterdensity as a benchmark is consistent with this widely accepted deﬁnition ofclusters. As mentioned previously, clusters are deﬁned as subsets of vertices thatare considered somehow similar. This similarity is captured by the numberof shared connections and translated into distance. In our model, verticessharing a greater number of connections are closer to each other than tovertices with which they share fewer connections. It is important to note herethat, in our deﬁnition, distance measures similarity, not geodesic (shortestpath) distance. For example, two vertices that share an edge but no otherconnection have a geodesic distance of one, but they are arguably dissimilar.At the cluster level, this distance takes the form of subsets of denselyconnected vertices. The link between clustering and density has been dis-cussed in depth, recently [14, 15, 13, 16]. In this article, our ultimate goalis to transform a graph’s adjacency matrix into a | V | × | V | similarity or dis-tance matrix D = [ d ij ], where the distance between each pair of vertices isgiven by the element d ij . This transformation allows us to use the quadraticformulation of the clustering problem proposed by Fan and Pardalos [4, 3].Such a formulation can then be further modiﬁed into a QUBO formulation[8], which can be implemented on purpose-built hardware, like Fujitsu’s Dig-ital Annealer [1, 20], for example. This purpose built architecture allows usto circumvent the NP-hardness of the clustering problem [7, 21, 5, 12, 1].To illustrate our deﬁnition of distance, we examine the graph shownin Figure 1. The graph in that ﬁgure is arguably composed of two clus-2igure 1: Graph with Two Clustersters (triangles), the red cluster containing vertices v , v , v and the cyancluster with vertices v , v , v . We observe that each cluster forms a denseinduced subgraph (clique). We also note that the geodesic distance separat-ing vertices v and v is equal to the geodesic distance separating v and v .Nevertheless, in the context of clustering, we argue that v is closer, moresimilar, to v than to v . The ultimate goal of this study is to identify adistance measure that accurately measures this similarity in connectivity. We compare three diﬀerent distance measurements from the literature andexamine how faithfully they reﬂect connectivity patterns. We argue thatmean node-to-node distance within a cluster should oﬀer an accurate reﬂec-tion of intra-cluster density. Intra-cluster density, deﬁned as K ( k )intra = | E kk | . × n k × ( n k − . In this deﬁnition, | E kk | is the cardinality of the set of edges connecting twovertices within the same cluster ‘ k ’ and n k = | V k | is the number of verticesin that same cluster.We then examine the relationship between mean Jaccard [10], Otsuka-Ochiai [17] and Burt’s distances [2], on one hand, and intra-cluster density[14, 13, 15, 16] within each cluster, on the other. Because these distancesare pairwise measures, we compare their mean value for a given cluster tothe cluster’s internal density. 3 .1 Jaccard Distance The Jaccard distance separating two vertices ‘ i ’ and ‘ j ’ is deﬁned as ζ ij = 1 − | c i ∩ c j || c i ∪ c j | ∈ [0 , . Here, c i ( c j ) represents the set of all vertices with which vertex ‘ i ( j )’ sharesan edge.At the cluster level, we compute the mean distance separating all pairsof vertices within the cluster, which we denote as J . For an arbitrary cluster‘ k ’ with n k vertices, we have J k = 10 . × n k × ( n k − (cid:88) i,j = i +1 ζ ij . The Otsuka-Ochiai (OtOc) distance separating two vertices ‘ i ’ and ‘ j ’ isdeﬁned as o ij = 1 − | c i ∩ c j | (cid:112) | c i | × | c j | ∈ [0 , . Here again, to obtain a cluster level measure of similarity, we take the meanover each pair of nodes within a cluster. We denote this mean as O . Again,for an arbitrary cluster ‘ k ’ with n k vertices, we have O k = 10 . × n k × ( n k − (cid:88) i,j = i +1 o ij . Burt’s distance between two vertices ‘ i ’ and ‘ j ’ is computed as b ij = (cid:115) (cid:88) k (cid:54) = i,j ( A ik − A jk ) . At the cluster level, we denote the mean Burt distance as B . For an arbitrarycluster ‘ k ’ with n k vertices, it is computed as B k = 10 . × n k × ( n k − (cid:88) i,j = i +1 b ij . n k | V | G1 1 0 45 2,250G2 0.9 0.1 37 1,850G3 0.9 0.15 42 2,100G4 0.9 0.2 50 2,500G5 0.8 0.1 53 2,650G6 0.8 0.15 38 1,900G7 0.8 0.2 44 2,200G8 0.7 0.1 39 1,950G9 0.7 0.15 46 2,300G10 0.7 0.2 53 2,650

To compare the distance measures, we generate synthetic graphs with knowncluster membership, using the planted partition model. Then, for each ofour test graphs, we compute our three vertex-to-vertex distances. We thencompute mean distances between nodes in each cluster and intra-clusterdensity.To assess the accuracy of each measure as a reﬂection of intra-clusterdensity, we examine the (Pearson) correlation between each distance mea-sure and intra-cluster density. We examine correlations for each graph andfor the set of all graphs. We also record correlations between each meandistance measure and the probability of inter-cluster connection used togenerate these graphs.

We use the planted partition model to generate the 10 graphs describedin Table 1. All graphs consist of 50 clusters. We vary clusters sizes acrossgraphs, but these sizes are kept constant within each graph. We also vary theedge probability within clusters and between vertices in diﬀerent clusters,but keep them constant across clusters and cluster pairs, as per the plantedpartition model. These graphs were generated using the Python NetworkXlibrary [9]. Intra- and inter-cluster edge probabilities as well as cluster sizesare shown in Table 1. 5able 2: Mean Distances in Disconnected CliquesGraph Intra Pr Inter Pr

J B O

G1 1 0 0.02 0 0.04

As expected, in the case of graph G1 where the intra-cluster edge proba-bility is one and inter-cluster edge probability is zero, the case of a graphcomposed of disconnected cliques, all three distances are constant acrossall clusters. In this case, all vertices within each cluster have exactly thesame connectivity pattern (same neighbors). They are all separated by thesame distances. Therefore, correlation to intra-cluster density is meaning-less. These distances are recorded in Table 2.While the case of graph G1 is predictable, it is important to note thatJaccard and OtOc distances are not zero, in the case of a complete (sub)graphwith no self-loops. This diﬀerence is due to the numerator of these quan-tities. In the case of complete graphs with no self loops, a node ‘ v i ’ isconnected to node ‘ v j ’ but not to itself. As a result we have the followinginequality of the cardinalities: | c i ∩ c j | < | c i ∪ c j | , for any pair of vertices ina complete (sub)graph.For our other graphs (G2-G10), the exact number of intra- and inter-cluster edges are probabilistic. As a result, all distances and densities arerandom variables. This randomness allows a comparison of (Pearson) cor-relations between the distances and intra-(inter-)cluster densities. However,before performing these comparisons, we examine the relationship betweendistances and intra-(inter-)cluster densities, graphically.Upon examining Figure 2, we immediately note the distances J and O have a linear negative relationship with intra-cluster density. This stronglylinear relationship justiﬁes the use of Pearson correlation coeﬃcients. Mean-while, Burt’s distance ( B ) seems only loosely related to intra-cluster density,at best.It is also interesting to note that mean Jaccard and OtOc distancesacross the graph decrease with increases in inter-cluster edge probability.In contrast, Burt’s distances decrease with inter-cluster edge probability.These trends can be observed for distances within clusters in Figure 2. InTable 3, we also observe the same phenomenon for all distances across thegraph, regardless of cluster membership.Intuitively, Burt’s distance increases with the probability of inter-clusterconnection. As this probability increases, nodes share a smaller proportion6 a) Mean Intra-Cluster Jaccard Distances ( J )(b) Mean Intra-Cluster OtOc Distances ( O )(c) Mean Intra-Cluster Burt Distances ( B ) Figure 2: Mean Intra-Cluster Distances as a Function of Inter- and Intra-Cluster Densities 7able 3: Mean Distance, Cluster Membership Not ConsideredInter-Clust Pr

J B O ρ to Intra-Clust DensityInter Pr J B O (cid:54) = 0) -0.563 -0.116 -0.565of their connections. Meanwhile, the trends observed in J and O are directconsequences of their mathematical deﬁnition: ζ ij = 1 − | c i ∩ c j || c i ∪ c j | o ij = 1 − | c i ∩ c j | (cid:112) | c i | × | c j | . In both cases, the numerator is the number of shared connections. Thedenominator is proportional to all connections of either vertex ‘ i ’ or ‘ j ’ andincreases at a much higher rate. For example, with all else equal, as theprobability of inter-cluster connection increases from 0, the total number ofconnections (degree) of both vertices ‘ i ’ and ‘ j ’ increases sharply, at a meanrate of the order of 2 × P inter × ( | V | − n k ). However, the numerators, whichcorrespond to the number of shared connections, increase at a much lowermean rate. They increase at a mean rate of P × ( | V | − n k ).Nevertheless, we note a clear linear inverse relation between intra-clusterdensity and both J and O . This relationship is also observed in the corre-lations, shown in Table 4. 8 Our Chosen Distance

Both Jaccard and OtOc distances are very accurate reﬂections of intra-cluster density. When averaged over all vertices within clusters, they exhibitalmost perfect inverse correlation to intra-cluster density.However, the Jaccard similarity and it’s complement, the Jaccard dis-tance, are used widely in a variety of diﬀerent ﬁelds. Because of its widespreaduse and the availability of pre-built computational functions, we recommendthe Jaccard distance as a vertex-to-vertex distance measure. For example,we use the NetworkX Jaccard coeﬃcient function in our own work [9].

A metric space is a set of points that share a distance function. This functionmust have the following three properties: g ( x, y ) = 0 ⇔ x = y (1) g ( x, y ) = g ( y, x ) (2) g ( x, z ) ≤ g ( x, y ) + g ( y, z ) . (3)In the case of the Jaccard distance, the ﬁrst two properties are immedi-ately apparent. They are direct consequences of the deﬁnitions of set op-erations. The third property, the triangle inequality, was shown to hold byLevandowsky and Winter [11]. We have shown that Jaccard and Otsuka-Ochiai distances, when averagedover clusters provide very accurate estimates of (inverse) intra-cluster den-sity. The Pearson correlation coeﬃcients between these distances and intra-cluster density is almost inversely perfect ( ≈ −

References [1] M. Aramon, G. Rosenberg, E. Valiante, T. Miyazawa, H. Tamura, andH.G. Katzgraber. Physics-Inspired Optimization for Quadratic Uncon-strained Problems Using a Digital Annealer.

Frontiers in Physics , 7,Apr 2019. 92] R.S. Burt. Positions in Networks*.

Social Forces , 55(1):93–122, 09 1976.[3] N. Fan and P.M. Pardalos. Linear and Quadratic Programming Ap-proaches for the General Graph Partitioning Problem.

J. of GlobalOptimization , 48(1):57–71, September 2010.[4] N. Fan and P.M. Pardalos. Robust Optimization of Graph Partitioningand Critical Node Detection in Analyzing Networks. In

Proceedings ofthe 4th International Conference on Combinatorial Optimization andApplications - Volume Part I , COCOA’10, pages 170–183, Berlin, Hei-delberg, 2010. Springer-Verlag.[5] S. Fortunato. Community detection in graphs.

Physics Reports , 486:75–174, February 2010.[6] S. Fortunato and D. Hric. Community detection in networks: A userguide.

ArXiv e-prints , November 2016.[7] Y. Fu and P.W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimisation.

Journal of PhysicsA: Mathematical and General , 19(9):1605–1620, June 1986.[8] F. Glover, G. Kochenberger, and Y. Du. A Tutorial on Formulatingand Using QUBO Models. arXiv e-prints , page arXiv:1811.11538, June2018.[9] A.A. Hagberg, D.A. Schult, and P.J. Swart. Exploring Network Struc-ture, Dynamics, and Function using NetworkX. In G. Varoquaux,T. Vaught, and J. Millman, editors,

Proceedings of the 7th Python inScience Conference , pages 11 – 15, Pasadena, CA USA, 2008.[10] P. Jaccard. ´Etude de la distribution ﬂorale dans une portion des Alpeset du Jura.

Bulletin de la Soci´et´e Vaudoise des Sciences Naturelles ,37:547–579, 01 1901.[11] M. Levandowsky and D. Winter. Distance between Sets.

Nature , 234,November 1971.[12] A. Lucas. Ising formulations of many NP problems.

Frontiers inPhysics , 2:5, Feb 2014. 1013] P. Miasnikof, L. Pitsoulis, A.J. Bonner, Y. Lawryshyn, and P.M. Parda-los. Graph clustering via intra-cluster density maximization. In I. By-chkov, V.A. Kalyagin, P.M. Pardalos, and O. Prokopyev, editors,

Net-work Algorithms, Data Mining, and Applications , pages 37–48. SpringerInternational Publishing, 2020.[14] P. Miasnikof, A.Y. Shestopaloﬀ, A.J. Bonner, and Y. Lawryshyn.

AStatistical Performance Analysis of Graph Clustering Algorithms , chap-ter 11. Lecture Notes in Computer Science. Springer Nature, 6 2018.[15] P. Miasnikof, A.Y. Shestopaloﬀ, A.J. Bonner, Y. Lawryshyn, and P.M.Pardalos. A Statistical Density-Based Analysis of Graph ClusteringAlgorithm Performance.

CoRR , abs/1906.02366v3, 2019.[16] P. Miasnikof, A.Y. Shestopaloﬀ, A.J. Bonner, Y. Lawryshyn, and P.M.Pardalos. A Statistical Density-Based Analysis of Graph ClusteringAlgorithm Performance.

Journal of Complex Systems (accepted) , 2020.[17] A. Ochiai. Zoogeographical studies on the soleoid ﬁshes found injapan and its neighbouring regions-i.

NIPPON SUISAN GAKKAISHI ,22(9):522–525, 1957.[18] L. Ostroumova Prokhorenkova, P. Pra(cid:32)lat, and A. Raigorodskii. Modu-larity of Complex Networks Models. In A. Bonato, F.C. Graham, andP. Pra(cid:32)lat, editors,

Algorithms and Models for the Web Graph , pages115–126, Cham, 2016. Springer International Publishing.[19] L. Ostroumova Prokhorenkova, P. Pra(cid:32)lat, and A. Raigorodskii. Mod-ularity in several random graph models.

Electronic Notes in DiscreteMathematics , 61:947 – 953, 2017. The European Conference on Com-binatorics, Graph Theory and Applications (EUROCOMB’17).[20] M. Sao, H. Watanabe, Y. Musha, and A. Utsunomiya. Applicationof Digital Annealer for Faster Combinatorial Optimization.

FujitsuScientiﬁc and Technical Journal , 55(2):45–51, 2019.[21] S.E. Schaeﬀer. Survey: Graph clustering.

Comput. Sci. Rev. , 1(1):27–64, August 2007.[22] J. Yang and J. Leskovec. Deﬁning and Evaluating Network Communi-ties based on Ground-truth.