Optimizing an Organized Modularity Measure for Topographic Graph Clustering: a Deterministic Annealing Approach
OOptimizing an Organized Modularity Measure forTopographic Graph Clustering: a DeterministicAnnealing Approach
Fabrice Rossi a , Nathalie Villa-Vialaneix b,c a BILab, T´el´ecom ParisTech, LTCI - UMR CNRS 5141, France b Institut de Math´ematiques de Toulouse, Universit´e de Toulouse, France c IUT STID (Carcassonne), Universit´e de Perpignan Via Domitia, France
Abstract
This paper proposes an organized generalization of Newman and Girvan’s mod-ularity measure for graph clustering. Optimized via a deterministic anneal-ing scheme, this measure produces topologically ordered graph clusterings thatlead to faithful and readable graph representations based on clustering inducedgraphs. Topographic graph clustering provides an alternative to more classicalsolutions in which a standard graph clustering method is applied to build asimpler graph that is then represented with a graph layout algorithm. A com-parative study on four real world graphs ranging from 34 to 1 133 vertices showsthe interest of the proposed approach with respect to classical solutions and toself-organizing maps for graphs.
Key words:
Graph; Modularity; Self-organizing maps; Social network;Deterministic annealing; Clustering
1. Introduction
Large and complex graphs are natural ways of describing real world systemsthat involve interactions between objects: persons and/or organizations in socialnetworks, articles in citation networks, web sites on the world wide web, proteinsin regulatory networks, etc. [30]. However, the complexity of real world graphslimits the possibilities of exploratory analysis: while many graph drawing andgraph visualization methods have been proposed [9, 21], displaying a graph witha few hundred vertices in a meaningful way remains difficult.A way to tackle this scalability issue is to simplify a graph prior its drawing.This can be done by finding clusters of vertices via a graph clustering method[42]: rather than representing the original graph, the visualization is restrictedto the clusters themselves. More precisely, the graph induced by the clusteringis used: each cluster forms a vertex, while edges between clusters are inducedby edges between the vertices they contain. As illustrated in the survey paper[21], numerous implementations of this simple idea can be done, namely all the
Email addresses:
[email protected] (Fabrice Rossi), [email protected] (Nathalie Villa-Vialaneix)
Preprint submitted to Neurocomputing October 26, 2018 a r X i v : . [ s t a t . M L ] S e p airwise combinations between graph clustering methods and graph visualiza-tion methods (see also, e.g., [11] for recent work on planar clustered graphs and[5] for weighted graphs).Rather than relying on a generic graph clustering algorithm, we propose inthis paper to build a clustering of a graph that is adapted to a visualization of thegraph induced by the clustering. We follow the general principal of topographicmapping initiated by Prof. Kohonen with the Self Organizing Map (SOM, [26]):a SOM builds a clustering of a dataset in homogeneous and separated clustersthat are in addition arranged on a geometrical structure chosen a priori ; pointsassigned to close clusters in the prior structure are close in the original space.Our method is built on an organized generalization of the popular mod-ularity measure [34] for graph clustering. Following the SOM rationale, theorganized modularity uses a prior structure to build topologically aware clus-ters: nodes assigned to close clusters in the prior structure are more likely tobe connected in the original graph than nodes assigned to distant clusters. Asfor the SOM [45, 46], using a two dimensional regular grid as a prior structureenables straightforward visualization of the graph induced by the clustering.While the modularity has interesting properties over other graph cut mea-sures [29, 31, 32, 33], especially in the visualization context [36], it has twodrawbacks shared with its organized generalization. Firstly, maximizing the(organized) modularity is a combinatorial problem and there is no simple pro-totype based alternating optimization scheme as available for the SOM and itsnon Euclidean variants [4, 20, 47, 48]. Following [29] for the modularity, we relyon a deterministic annealing approach [41] to solve this problem. Secondly, max-imizing the (organized) modularity measure tends to miss small clusters, i.e., toaggregate them in large clusters [13]. To limit this effect, we also propose to usethe intermediate states of the deterministic annealing algorithm to build finerclusterings than the one with maximal organized modularity. Combined withthe prior structure, those clusterings give refined views of the graph inducedby the clustering. In summary our contribution is twofold: we introduce anorganized modularity measure and we use deterministic annealing to build fairand simplified visualizations of a graph, that are based on a clustering of thevertices.The rest of this paper is organized as follows: Section 2 details the clusteringapproach to graph visualization, recalls the modularity definition and introducesits topographic generalization (the organized modularity). Section 3 presentsthe deterministic annealing scheme used to optimize the organized modularity.Section 4 describes the proposed graph visualization methodology and shows inparticular how to leverage the intermediate results of the optimization algorithmto produce fuzzy layouts that limit the resolution effect induced by modularitymaximization. Section 5 analyses in detail the behavior of the proposed methodon a simple graph, Zachary’s Karate club social network [54]. Finally, Section6 demonstrates the practical interests of the method on three larger real worldgraphs. Technical derivations are gathered in an appendix.
2. Topographic graph clustering
Cognitive scalability can be brought to graph visualization methods [21] viagraph clustering [42]: the rationale is to draw the smaller and hopefully simpler2raph induced by the clustering instead of the original graph. Let us definemore precisely this idea.We consider given a non oriented graph G with N vertices (or nodes), V ( G ) = { , . . . , N } and A weighted edges ( A ( G ) is the set of edges). The (sym-metric) weight matrix is denoted by W , and W ij is the non negative weightbetween vertex i and vertex j ( W ij = 0 when ( i, j ) (cid:54)∈ A ( G )). A clustering of V ( G ) into C clusters ( C k ) ≤ k ≤ C induces a new non oriented graph G C definedas follows: each cluster is associated to a vertex in G C , i.e., V ( G C ) = { , . . . , C } and a pair ( i, j ) belongs to the edge set A ( G C ) if and only if there is ( k, l ) ∈ A ( G )such that k ∈ C i and l ∈ C j . G C is weighted and the corresponding weight ma-trix is denoted W C . The weights are defined by W Cij = (cid:80) k ∈ C i ,l ∈ C j W kl (for i (cid:54) = j ).The present paper focuses on the following graph visualization strategy: aclustering C of the original graph G is constructed and a graph visualizationmethod is applied to G C . In general, the visual representation is completelyunaware of the way G C has been produced: this is a simpler setting than theone of e.g. [5, 11] in which the proposed algorithms aim at representing a so-called clustered graph . In this latter context, the ultimate goal is to visualizethe complete original graph in a way that respects the (hierarchical) clusteringstructure. The clustering is used in this case both as a way to circumventalgorithmic difficulties (e.g., the cost of some force directed methods) and as astructural organization principle. In the present paper, the clustering methodmakes “a meaningful coarse graining of the graph” (to quote [34]) and producesa new graph which will be the main target of the visualization method.The solution chosen in this paper is also related to but quite different fromthe clustered layout strategy [21]. In this framework, one aims at producing avisualization of a graph in which vertices that belong to a given cluster are close,while distinct clusters remain separated. Clusters are generally a by productof those algorithms rather than given beforehand (see, e.g., [35] for a recentexample of such a method).In practice, we target visualization methods in which each vertex of the newgraph G C is represented by a glyph (e.g., a disk) while each edge is drawnas a straight segment between the corresponding vertices’ glyphs (we do notconsider more sophisticated drawing, for instance with bends). Rendering hintscan be used to convey information about the original graph to the user: thesurface occupied by a glyph can be proportional to the size of the correspondingcluster, while the thickness of a segment can encode the weight associated tothe corresponding edge. In order to be useful, a graph visualization produced via the chosen clus-tering framework must be both readable and faithful . The readability issue iscommon to all graph visualization algorithms and has been a matter of researchand debate in the graph visualization community with a particular focus onaesthetic criteria, such as bend minimization, symmetry, etc. [9, 21, 38]. Ex-perimental studies show that a minimal number of edge crossings is one of themost important quality criteria for readability (see e.g. [50]). Please note how-ever that there is no consensus on what other readability criteria should be usedto evaluate the quality of a graph representation beyond direct task based user3tudies. In particular, neighborhood rank preservation measures used to assessthe quality of nonlinear projection methods [28] are likely to be irrelevant in thegraph context: as showed in [12], displaying the edges of a graph has a majorinfluence on the way proximities between nodes are inferred by users from thevisualization.The faithfulness issue is more specific to the cluster based visualization: asthe final representation of the graph hides most of its structure, inference on thedrawing can be misleading. More precisely, there are two potential problemsinduced respectively by the vertices and the edges in the G C graph. The firstdifficulty is that the internal connectivity structure of clusters is hidden by theglyph based representation: while color and/or shape hints could be used togive an idea of the density of connections between the vertices that have beenput in a given cluster, the actual connection pattern is lost. A similar problemhappens for connection patterns between vertices of distinct clusters: as theassociated edges are summarized by a single edge between the induced vertices,there is no way to infer how individual vertices are connected in the originalgraph.Therefore, building a faithful clustering of a graph amounts to balancingtwo criteria that are somewhat contradictory: on the one hand, the densityof each cluster should be maximized, but on the other hand the connectivitypattern between two vertices of distinct clusters should be independent of thevertices. Intuitively, a dense cluster with a low outside connectivity is likely tohave a rather complex outside connectivity pattern: if all vertices in a clusterwere connected to e.g., a single outside vertex, then it might make sense toassign this vertex to the cluster. In addition, enforcing a simple between clusterconnectivity has no reason to reduce edge crossing, while within cluster densitymaximization should on the contrary reduce the number of between clusterconnection and therefore improve the readability of the graph.It seems therefore natural to focus on within cluster density maximizationas this solution should lead to a readable graph with a controlled impact on thefaithfulness of the representation. Among the clustering quality criteria thatfavor within cluster density (see [42]), we focus on the modularity introduced byNewman and Girvan in [34]. Let us denote k i = (cid:80) j W ij the degree of vertex i and m = (cid:80) i,j W ij the total weight of the graph. Then the modularity of theclustering ( C k ) ≤ k ≤ C of G is Q (( C k ) ≤ k ≤ C ) = 12 m C (cid:88) k =1 (cid:88) i,j ∈ C k ( W ij − P ij ) , (1)where P is a N × N symmetric matrix given by P ij = k i k j m .The rationale of the measure is to compare the weight of a link between twovertices in a cluster, W ij , to a simple random model, P ij , in which the weightsare proportional to the degrees of the vertices and independent of the clusters.A good clustering tends to cluster vertices that are more connected that oneexpects based solely on the degrees of the vertices: this corresponds to W ij > P ij and to higher values of Q ( M ). Maximizing the modularity tends therefore toproduce dense clusters, but only when they are meaningful as measured by adeviation from the simple random model.The modularity is a rather successful quality measure for graph clustering,especially compared to other graph cut like measures [29, 31, 32, 33]. In addition,436] has recently shown that there is a strong link between clusterings induced bymodularity maximization and those obtained with some versions of the clusteredlayout strategy described previously. The modularity seems therefore to be agood candidate for the topographic extension studied in this paper. However,this measure suffers from a resolution problem [13]: maximizing the modularitymay fail to identify small but meaningful clusters of nodes. We will tackle thisproblem by leveraging both the chosen optimization algorithm and the priorstructure introduced below (see Section 4.2 for details). It should be noted inaddition that the main concepts and techniques used in this paper are quiteindependent from the actual quality measure. In particular most graph cutbased measures [42] could be handled in a similar way as we proceed withmodularity. The main limitation of graph clustering for visualization is that the twophases of the methodology are generally completely independent: the clusteringmethod is not explicitly designed to help the subsequent visualization method.We propose in this paper to address this limitation via a specific quality measurefor the clustering phase.Following the SOM rationale, we derive an organized version of the mod-ularity. We assume given a prior structure (in R ) which is represented by asymmetric C by C matrix S of prior similarities between clusters such that S kk = 1 for all k = 1 , . . . , C . For instance S kl = exp( − σ (cid:107) x k − x l (cid:107) ) where x k is the prior position of cluster C k in R . Then the organized modularity of theclustering ( C k ) ≤ k ≤ C of G is O (( C k ) ≤ k ≤ C ) = 12 m (cid:88) i,j S c ( i ) c ( j ) ( W ij − P ij ) , (2)where c ( i ) is the index of the cluster to which the vertex i is assigned.In the standard modularity, the term W ij − P ij is taken into account onlywhen i and j belong to the same cluster. In the organized version, this term is always taken in account, but with a weight S c ( i ) c ( j ) equal to the prior similaritybetween C c ( i ) and C c ( j ) . This favors proximity in the prior structure for con-nected clusters. If there are indeed significant connections between vertices intwo clusters C k and C l (i.e., W ij − P ij > O will be higherif S kl is high than if it is low. This is similar to the SOM principle in which aprototype has to be close to observations assigned to its unit but also, to a lesserextent, to observations assigned to neighboring units in the prior structure.Maximizing O (( C k ) ≤ k ≤ C ) gives a graph clustering that can be used tocoarsen the original graph prior a standard visualization algorithm, as explainedin Section 2.1. In addition, the prior structure gives a natural layout for theclustered graph: nodes of this new graph can be drawn at their correspondingpositions on the grid.
3. Organized modularity maximization
The main difficulty with (organized) modularity maximization is that it is adiscrete optimization problem for which there is no simple alternate minimiza-tion scheme (contrarily to the SOM or the k -means, for instance). Deriving5ast modularity maximization algorithms that produce acceptable solutions hasbeen an important research topic since the introduction of this quality measure(this is generally the case for all graph clusteringing measures). The best com-promise between computational load and quality seems to be currently achievedby some type of heuristic algorithms that coarsen the graph in an iterative andmulti-level way [3, 37].In the present paper, we use a quite different approach which is more adaptedto the organized modularity defined in the previous Section. Following [29], wepropose to maximize the (organized) modularity via a deterministic annealing(DA) approach [41]. As pointed out in [29], the main advantage of DA oversimulated annealing (as used by, e.g., [40] in the graph clustering context) isthe speed of the former: deterministic annealing can use aggressive annealingschedules with a relatively small number of iterations compared to simulatedannealing. In addition, intermediate results of DA can be used to limit theresolution effect induced by modularity maximization, as will be shown in Sec-tion 4.2.To ease the derivation of the DA algorithm for modularity maximization,we use an assignment matrix notation for clusterings. An assignment matrix M for a clustering of { , . . . , N } in C clusters is a N × C matrix with entries in { , } such that (cid:80) Ck =1 M ik = 1 for all i . In other words, M ik is equal to 1 if andonly if i is assigned to cluster C k . We denote M the set of all valid assignmentmatrices for a clustering of { , . . . , N } in C clusters ( N and C will be givenby the context). Please note that we do not constrain an assignment matrix tohave non empty clusters.The modularity and the organized modularity of an assignment matrix M are given respectively by: Q ( M ) = 12 m (cid:88) i,j (cid:88) k M ik M jk ( W ij − P ij ) (3) O ( M ) = 12 m (cid:88) i,j (cid:88) k,l M ik S kl M jl ( W ij − P ij ) (4)If B is the N × N symmetric matrix defined by B ij = (cid:40) i = j m ( W ij − P ij ) if i (cid:54) = j. (5)then, maximizing O ( M ) is equivalent to maximizing F ( M ) = (cid:88) i,j (cid:88) k,l M ik S kl M jl B ij (6)as shown in Section A.1. Finally, it should be noted that by using the identitymatrix for S , one recovers the modularity: the algorithm derived for the maxi-mization of F ( M ) will therefore apply to both the standard modularity and theorganized version. Deterministic annealing tries to solve the complex combinatorial problem ofmaximizing a function F defined on a finite (but large) space M via an analysis6f the Gibbs distribution obtained as the asymptotic regime of a classical sim-ulated annealing [7, 24] or via the principal of maximum entropy [41]. In ourcase, the Gibbs distribution for temperature β is P ( M ) = 1 Z P exp( βF ( M )) , (7)where the normalization constant Z P is given by Z P = (cid:80) M ∈M exp( βF ( M ))(the sum ranges over the full space M ).The main idea of DA is to compute the expectations of the assignmentmatrix at a fixed temperature with respect to the Gibbs distribution, i.e., the E ( M ik ), and then to decrease the temperature while tracking the evolutionof the expectations (in simulated annealing, the Gibbs distribution is sampledrather than studied via expectation).Unfortunately, F is not linear with respect to M and computing Z P and P is therefore difficult: one should resort on an exhaustive exploration of M which is computationally infeasible. Following previous work on similar topics,we approximate P by a distribution that factorizes (see e.g., [18, 22]). This cor-responds to approximating the interaction between say M ik and all the othervariables via a mean field E ik . Then we compute the expectation of the assign-ment matrix with respect to the approximating distribution.More precisely, we consider the bi-linear cost function U ( M, E ) = (cid:80) i (cid:80) k M ik E ik where E is the mean field, a N by C matrix of partial assign-ment costs. For a fixed temperature β , we look for a mean field E that gives adistribution R ( M, E ) = Z R ( E ) exp( βU ( M, E )) close to P ( M ), in the sense thatthe Kullback-Leibler divergence KL ( R | P ) between R and P is minimal, with KL ( R | P ) = (cid:80) M R ( M, E ) ln R ( M,E ) P ( M ) (this corresponds to a variational approxi-mation of P [23]).At a minimum, the gradient of KL ( R | P ) with respect to E is zero. This leadsto the following classical mean field equations (see Appendix A.2 for details): ∂ E R ( F ( M )) ∂E jl = (cid:88) k ∂ E R ( M jk ) ∂E jl E jk , ∀ j, l. (8)They are obtained using the main consequence of the mean field approximation,namely the independence between M ik and M jl for i (cid:54) = j under the distribution R , i.e., the fact that E R ( M ik M jl ) = E R ( M ik ) E R ( M jl ) for i (cid:54) = j .To solve the mean field equations, we use a EM-like approach. We considerthe E R ( M ik ) fixed and solve the equations for E jl (maximization phase). Thenwe compute the new values of the E R ( M ik ) (expectation phase). This latterphase leads to the very simple standard deterministic annealing update rule: E R ( M ik ) = exp( βE ik ) (cid:80) l exp( βE il ) . (9)Moreover, the independence property recalled above gives E R ( F ( M )) = (cid:88) i (cid:54) = j (cid:88) k,l E R ( M ik ) S kl E R ( M jl ) B ij . E jk = 2 (cid:88) i (cid:54) = j (cid:88) l E R ( M il ) S kl B ij , (10)or, in matrix notations, E = B E R ( M ) S , using the symmetry of B and S . It is well known that deterministic annealing goes through several phasetransitions when the temperature is decreased [17, 22, 41]. In order to choosean adapted annealing schedule, i.e., an increasing series of ( β l ) ≤ l ≤ L that is usedto track the evolution of E R ( M ), it is important to find at least the first phasetransition.At the limit of infinite temperature ( β = 0), it can be shown (see AppendixA.4) that the following mean field E jk = 2 C (cid:88) i (cid:54) = j B ij (cid:88) l S kl , (11)is a fixed point of the EM like scheme (i.e., of equations (9) and (10)). Thecorresponding assignment expectations are uniform, i.e. E R (cid:0) M ik (cid:1) = 1 C . (12)An analysis of the stability of the EM-like scheme (conducted in AppendixA.4) shows that the fixed point E is stable when the temperature β is higherthan the critical temperature T = λ B λ S C , where λ B and λ S are the spectralradii, i.e., the largest eigenvalues in absolute value, respectively of B and S .The first phase transition will therefore happen when the temperature becomeslower than this limit. It should be noted that the critical temperature is a veryconservative estimation of the temperature of the first phase transition, as itcorresponds to a worst case analysis of the stability of the fixed point. Algorithm 1
Deterministic annealing for organized modularity maximization initialize E to E from equation (11) T ← λ B λ S C { critical temperature } T ← αT for l = 1 to L do { annealing loop } E ← E + ε { noise injection } repeat { EM-like phase } compute E R ( M ik ) using equation (9) with β = T compute E using equation (10) until convergence of E T ← γT end for threshold E R ( M ) into an assignment matrixTo derive the final Algorithm 1, we follow [41] rather than [29]. We use inparticular the perturbation idea of [41]: each time the temperature is lowered,8ome noise is injected in the mean field before running the EM-like scheme.This method favors phase transitions and symmetric breaks that are needed fora proper convergence.Deterministic annealing is very robust and quite insensitive to the parame-ters that appear in Algorithm 1: • L is the number of outer iterations, i.e., the number of temperatures con-sidered during the annealing process (a typical value is N , the size of thegraph); • α > . • γ is the damping factor of the temperature (typically chosen such that thefinal temperature is 0 . T ); • for the noise ε , we used a random multiplicative factor chosen uniformlyin [0 . , . E ; • the EM-like phase is considered to be stable when the mean squared dif-ference between the components of two values of E is below the squareroot of the machine precision (or after 500 iterations).Finally, the computation cost of the algorithm remains reasonable: the costlyoperation is the product B E R ( M ). As explained in [31], one can leveragethe structure of B to obtain a fast matrix multiplication. Indeed B is madefrom W the weight matrix and P the degree matrix. Computing W x costs O ( A ) operations (where A is the number of edges of the graph), while P x canbe computed in O ( N ) operations ( N is the number of vertices) exploiting thedefinition of P . Therefore computing B E R ( M ) costs O ( C ( A + N )). Thencomputing the product of B E R ( M ) by S is a O ( N C ) operation while applyingequation (9) costs O ( N C ). The total cost of one iteration of Algorithm 1 istherefore O ( CA + C N ). Computing the critical temperature can be done quiteefficiently via the Lanczos method [16] in O ( N ( A + N )) but in fact only a veryrough estimate of the spectral radius is needed. Therefore, a few iterations ofa power method should be enough to give a reasonable initial temperature. Inpractice, the computational cost of the method makes it suitable for graphs witha few thousands vertices as long as there are not too dense.
4. Graph visualization methodology
In principle, the proposed graph visualization methodology is rather straight-forward and strongly related to e.g., exploratory data analysis with the SOM.We choose a regular grid as a prior structure and use Algorithm 1 to find anoptimal clustering with respect to the organized modularity defined by equation(4). Then, as explained in Section 2.1, we display the clustering induced graph:we do not need a graph layout algorithm in this phase as each cluster has adedicated position in the grid.In practice, some parameters need to be carefully chosen to provide a mean-ingful visualization. In addition, (organized) modularity maximization faces aproblem of limited resolution and leads sometimes to oversimplified clusteringinduced graph. We describe in this Section the proposed solutions to thoseissues. 9 .1. Parameter tuning
The internal parameters of Algorithm 1 described in Section 1 do not re-quest any particular tuning and the guidelines provided should in general givesatisfactory results. On the contrary, the quality of the visual representationstrongly depends on the prior structure, both in terms of the size of grid itselfand in terms of the S matrix. Those parameters should be optimize as automat-ically as possible. The general principle consists first in defining a finite set ofacceptable prior structures and then in selecting the best one via some qualitycriterion.Our practical goal is to end up with readable graphs: we consider thereforesmall prior structures, for instance square grids with an absolute maximum of10 nodes per dimension (i.e., up to 100 clusters). The entries of the matrix S are calculated via a SOM like equation S ij = H ( λ (cid:107) x i − x j (cid:107) ) , (13)where x i is the position of cluster i in the prior structure and H is either H ( t ) =exp( − t ) (exponential decrease) or H ( t ) = max(0 , − t ) (linear decrease). Thescaling parameter λ is used to tune the neighboring influence and can be chosenso as to include from zero neighbor to a complete influence of all clusters oneach other.Unfortunately, as recalled in Section 2.2, there is no universally acceptedquality criterion for graph visualization. Moreover, the case of clustered vi-sualization naturally corresponds to a trade-off between internal and externalconnectivity uniformity. To handle this trade-off we propose to rely on dualobjectives optimization principles. More precisely, we use the standard modu-larity measure to assess clustering quality and the number of edge crossing toassess visual quality. Rather than combining the measures to select one value ofthe parameters of the prior structure, we consider Pareto optimal prior struc-tures: each selected structure is better than all others for at least one of the twocriteria.In practice, this means that the parameter tuning process is only semi-automatic: it selects some interesting visualizations from the explored set ofparameters. Those visualizations can then be presented to the user for selec-tion. As explained in Section 2.2, while the modularity is a clustering qualitycriterion, it favors both dense clusters and low connectivity between clusters.It is therefore reasonable to sort Pareto optima in decreasing modularity orderprior user analysis. However, limiting the results to the best prior structureaccording to modularity will generally lead to oversimplified graphs because ofthe limited resolution of this measure [13]: this justifies presenting to the userPareto optima with sub-optimal modularity but with more non empty clustersand less edge crossings. In addition, one can leverage the annealing process to produce intermediateresults which can be considered as compromise between the main strategy andthe clustered layout strategy [35]. Indeed, as will be shown in Section 5, theexpectation of the assignment matrix E R ( M ) does not contain 0/1 values, evenafter some phase transitions. The fuzzy layout strategy introduced in the presentsection takes advantage of this fact. 10et us denote ( x k ) ≤ k ≤ C the positions associated to the C clusters in theprior structure. The position of vertex i in the prior structure based layoutassociated to the assignment matrix M is given by p i = (cid:88) k M ik x k , and therefore, the expected position is E R ( p i ) = (cid:88) k E R ( M ik ) x k . (14)At the limit of zero temperature, E R ( M ) contains only 0 and 1, and the expectedpositions are exactly positions in the prior structure. However, during annealing,vertices that are difficult to classify will have in-between positions reflecting nonpeaked values of E R ( M ik ).As a by product of the annealing scheme, one can therefore provide an ani-mated rendering of the evolution of E R ( M ) by displaying for different values ofthe temperature the N points ( E R ( p i )) ≤ i ≤ N (the display is somewhat similarin principle to the posterior mean projection used in the Generative TopographicMapping [2]). To avoid cluttering the layout with overlapping vertices, we relyon an elementary simplification scheme: a complete linkage hierarchical cluster-ing is applied to the E R ( p i ) and the dendrogram is cut at an appropriate level(for instance 5% of the minimal distance between the points of the grid). Thisleads to a finer clustering than the final one, with associated positions for theclusters. Many of those clusters are positioned near or on cluster positions onthe grid, especially when the annealing scheme nears completion. This inducesgenerally some overlapping between edges. To limit this effect, we use the posi-tions computed above as initial positions of a force directed placement algorithm(such as the Fruchterman-Reingold algorithm [15]). The algorithm is used onlyfor a few iterations that move slightly the clusters and reduce overlapping.In practice, the user first browses through the Pareto optimal points sortedin decreasing modularity order to select some interesting visualizations. Thenhe/she can request an animated rendering of the most interesting graphs toassess whether some sub-structure can be found in the clusters.
5. Detailed analysis of a small graph
This section provides a detailed analysis of the behavior of the proposedmethod on a small graph. The method has been implemented in R [39], usingthe igraph package [8] complemented by the network package [6].We study Zachary’s Karate club social network [54]. The graph representsthe friendship social network between the 34 members of a Karate club at aUS university in the 70s. The graph contains 78 unweighted edges (its globaldensity, the fraction of connected pairs of nodes, is thus equal to 13.9 %) and isrepresented on Figure 1 obtained with the Fruchterman-Reingold force directedalgorithm [15] as implemented in igraph [8]. The transitivity of the graph, that isa measure of the probability that the adjacent vertices of a vertex are connected(see e.g., [51]), is equal to 25.6 %. The gap between the global density and thetransitivity is a good indication for a larger local density and thus a relevantclustering. 11
23 4567 8910 11 12131415 16 17 1819 2021 222324252627 28 2930 31323334
Figure 1: Zachary’s Karate club social network .1. Modularity maximization It is well known from previous work on this graph that in terms of modularity,the optimal number of clusters is four (see e.g., [10]). This is a consequence ofthe structure of the modularity which tends to peak at a certain graph specificnumber of clusters and then decreases (this is another manifestation of thelimited resolution of this measure [13]). For this graph, the social analysisconducted by Zachary leads to the definition of two clusters which correspondto the split between members of the club during the course of the study. Bothclusters are split into two sub-clusters by modularity maximization (one groundtruth cluster corresponds to clusters 1 and 4 in Figure 6 and the other to clusters2 and 3). We have investigated the behavior of the deterministic annealingalgorithm with C ranging from 2 to 8. The parameters of the algorithm werethe following ones: α = 1 . γ is chosen so that the final temperature is T / L = 151 (this relatively large value was used to obtained a convergence foreach EM-like phase; a faster annealing introduced instabilities on this graph). M odu l a r i t y . . . . . Figure 2: Maximal modularity achieved by DA as a function of C
50 100 200 500 . . . . . bb M odu l a r i t y Figure 3: Evolution of the modularity during annealing (the solid line corresponds to thetheoretical temperature for the first phase transitions, dashed lines correspond to detectedtransitions)
13s shown on Figure 2, the maximal modularity achieved by deterministicannealing (without a prior structure) reaches its maximum at C = 4 clustersand then remains constant (the value of the maximum is 0 . C = 4. More precisely, we compute the expectation of the modularity withrespect to the mean field distribution: E R ( Q ( M )) = (cid:88) i (cid:54) = j (cid:88) k E R ( M ik ) E R ( M jk ) B ij + 12 m (cid:88) i ( W ii − P ii ) (15)The critical temperature appears, as expected, as a very conservative estimateof the temperature of the first phase transition. This first phase transitioncorresponds to the detection of two well identified clusters (the ones discoveredby Zachary in [54]), while the second phase transition introduces two additionalclusters. The evolution of the expectation of the assignment matrix E R ( M ) isrepresented by Figures 4 and 5: each figure includes four copies of the graph. Incopy number k , the gray level of the node i encodes the value of E R ( M ik ), i.e., ofthe probability for node i to belong to cluster k according to the approximatingdistribution R . After the first phase transition, clusters 1 and 3 start to take(partial) ownership of some of the vertices (the black ones on Figure 4), evenif only 6 vertices out of 34 have a maximal assignment probability higher than0.75. More than 40% of the vertices (14 out of 34) have rather fuzzy assignmentprobabilities (i.e., less than 0.5 for the maximal value). After the second phasetransition (shown on Figure 5), the algorithm has identified four distinct clustersaccounting for 23 vertices (which are all assignments to a cluster with more than0.75 probability). Four vertices still have a maximal assignment probabilitybelow 0.5. If the annealing is brought to the limit of a very low temperature(below the threshold of T /
10 chosen here), the expectation of the assignmentmatrix is almost a 0/1 matrix, but such effort is not needed: in the case of theKarate club graph, all vertices have a maximal assignment probability higherthan 0.5 at T = T /
10 and a winner-take-all approach leads to a well definedclustering of the graph.Figure 6 compares the original layout numbered via the clustering to thelayout of the clustering induced graph (obtained by the Fruchterman-Reingoldforce directed algorithm [15]). The induced graph emphasizes clearly the relationbetween the clusters: cluster 2 is not directly connected to clusters 1 and 4, whilethe connection between 3 and 4 is less dense than the ones between 1 and 3 or4. The summary is therefore quite informative, even if, as expected, most ofthe structure is lost. For instance, cluster 1 exhibits a start shape which is ofcourse lost in the glyph based representation.
For the organized modularity, we use the simplest setting compatible withthe four clusters identified in the previous section: the prior structure consists14
23 4
Figure 4: Assignment probabilities after the first phase transition: black corresponds to prob-abilities close to one, white to probabilities close to zero
23 4
Figure 5: Assignment probabilities after the second phase transition: black corresponds toprobabilities close to one, white to probabilities close to zero
Figure 6: Left: original layout numbered according to the clustering; right: clustering inducedgraph
16n a square in which each vertex is associated to a cluster. We use the followinginfluence matrix: S = λ λ λ λλ λ λ λ , where λ specifies the amount of local influence (there is no diagonal influence).Interestingly, the value of λ has an impact on the number of non empty clustersobtained by deterministic annealing. We use the same parameters for the algo-rithm as in the previous section, while varying λ between 0 and 0 .
2. As shownon Figure 7, a large influence tends to reduce the number of effective clustersproduced by the algorithm. This can be explained by an analysis of the opti- ll non e m p t y c l u s t e r s Figure 7: Number of non empty clusters as a function of the influence parameter λ mal four clusters clustering obtained before. As shown on Figure 6, there is nodirect connection between cluster 2 and clusters 1 and 4. As a consequence, thecorresponding entries B ij are negative: the only way to prevent the organizedmodularity to be smaller than the standard one is to put cluster 2 far awayfrom clusters 1 and 4 in the prior structure. This is not strictly feasible as thereare only two pairs of cluster positions with no direct influence in S (they corre-sponds to the diagonal of the square). As a consequence, one might put cluster1 far away from cluster 1 or from cluster 4, but not from both. Therefore, themodularity will decrease by e.g., − m (cid:88) i ∈ C ,j ∈ C λP ij , if cluster 2 has no influence on cluster 1. Then, when λ is large enough, thereduction of the four clusters organized modularity will be high enough to bringits value below the one of the two or three clusters standard modularity. Thoselatter configurations are easier to arrange on the prior structure in a way thatminimizes negative contributions: they will be preferred by the algorithm overthe four clusters solution.In fact, the behavior of the algorithm is comparable to the one of a forcedirected method: when there is no connection between two vertices, they only17
42 3
Figure 8: clustering induced graph displayed on the prior structure repel each other and the algorithm tries to maximize their distances. As shownon Figure 8, the representation of the graph obtained via the prior structure isquite similar to the one provide by Figure 6. The underlying clusters are com-pletely identical and thus use the same numbering to ease comparison betweenthe figures (this is also the case when the value of λ induces a clustering withtwo or three clusters: the obtained clusterings are the ones that maximize thestandard modularity for C = 2 or C = 3). It might seem from Figure 8 and more generally from the discussion abovethat the organized modularity offers no particular interest compared to a twophases approach with a maximal modularity clustering followed by a force di-rected layout. The behavior of the DA algorithm on the Karate graph empha-sizes the fact that the modularity does not increase with the number of clusters:it tends to peak at an optimal graph specific number. As explained before,there is a complex interaction between the prior structure and the values of themodularity for different numbers of non empty clusters. In some situations, theprior structure introduces a too strong coupling between clusters and leads thisway to a reduced number of non empty clusters. This might lead to a bettervisualization of the graph, to an over simplification or to nothing more than thetwo phases approach (Section 6 will illustrate this further).The first way to address this problem is to test several prior structures andto select optimal graphs with respect to different quality measures as proposedin Section 4. A second (and complementary) approach consists in leveraging theannealing process to limit the impact of the peaking behavior of the modularityby means of the fuzzy layout strategy described in Section 4.2. An example ofsuch layouts is given by Figure 9. The representations give a more completepicture of the graph than Figure 6 (or Figure 8). It appears for instance quiteclearly that two vertices are not easy to assign to the final four clusters (theyare at the boundary of cluster 1 in Figure 8). If we study carefully the originalrepresentation on Figure 1 with the added knowledge provided by Figures 6 and9, those vertices (number 10 and 24) appear quite clearly in between two clusters.However, it would be almost impossible to obtain this information directly from18nitial layout after first phase transitionjust before second phase transition just after second phase transitionduring the final cooling phase final configuration
Figure 9: Evolution of the fuzzy layout during annealing (the Fruchterman-Reingold algorithmwas used to reduce edge overlapping)
As the proposed method is based on the SOM rationale, it seems naturalto compare it to SOM variants that can handle graph nodes, mainly the kernelSOM with adapted kernels [47] and the spectral SOM [48]. Details on thosemethods and on parameters optimization are postponed to Section 6.2 in orderto keep the present Section focused on a detailed analysis of the Karate graph.Following the methodology described in Section 6.2, we have built Paretooptimal points with respect to the modularity of the clustering and the numberof edge crossings in the induced layout. The only global Pareto optimal pointamong SOM variants has been obtained by the heat kernel (see [27]). It haszero edge crossing and a modularity of 0.4188. The obtained layout is identicalto one of Figure 8, up to a rotation. However, the underlying clustering isslightly different, as shown by the reduced modularity. In fact, node number10 on Figure 1 is assigned to the same cluster as node number 3 rather than tonode number 34’s cluster in the optimal clustering (in terms of modularity). Itturns out that node number 10 should be assigned to node number 34’s clusteraccording to Zachary’s analysis. Therefore, the best SOM variant fails to recoverexactly the ground truth clustering, while modularity optimization does recovera finer clustering.Other variants make similar or worse mistakes. For instance, the best resultobtained by the modularity kernel SOM has a modularity of 0.409 which corre-sponds to two differences with the optimal modularity clustering. Node number10 is misclassified compared to Zachery’s two clusters ground truth, while nodenumber 24 is assigned to the correct Zachery’s cluster but to a different clusterthan in the highest modularity clustering.The best solution for the Laplacian’s inverse kernel SOM has an even worsemodularity of 0.391. In this case, the differences between the proposed cluster-ing and the ground truth are more important. Figure 10 gives the clusteringobtained by the Laplacian’s inverse kernel SOM on the full layout of the Karategraph. Node number 3 is assigned to a wrong cluster according to the groundtruth. While the mismatch corresponds to only one node, the status of this nodeis much more obvious than the one of node 10: it is connected to node number 1,the central actor one of the clusters identified sociologically. In addition, clusternumber 4 seems to consist in boundary nodes rather than in a dense subset ofnodes. While this SOM variant manages to reach a fair approximation of theground truth clustering, it splits one of those clusters in a misleading way: clus-ter 4 is not dense and its members are more connected to outside nodes thanbetween themselves.In summary, the SOM variants are unable to recover exactly the groundtruth clustering obtained by Zachary on this graph. In addition, the SOMs givean example of a clustering with reduced modularity (compared to optimal ones)with a possibly misleading cluster: the subgraph is not dense and nodes havemore connections outside of the cluster than inside. Limitations of graph SOMscompared to our proposal will be confirmed in the next Section.20
34 3222 344 2 3331 1 2 31 31 311111 1 11 4111
Figure 10: Original layout numbered according to the clustering obtained by the Laplacian’sinverse kernel SOM
6. Results on larger graphs
In order to confirm the conclusions reached in the previous section, thefollowing experiments provide a comparison between the proposed method, theclassical two phases approach and SOM variants on three larger graphs. Moreprecisely, the studied graphs are: • a coappearance network of characters in chapters of the novel LesMis´erables (Victor Hugo), introduced by [25] and available at . This graphis undirected and weighted by the number of chapters in which each pairof characters appear together. It is larger than the graph described inSection 5 with 77 nodes representing the characters of the novel. Theglobal density of this graph is equal to 8.7% and its transitivity is equal to49.9%. This indicates that the local connectivity of the graph is very highcompared to its global one and thus that this graph is likely to be clusteredinto dense subgroups. This graph is represented in Figure 11 (this layouthas been obtained via the Fruchterman-Reingold force directed algorithm[15] as implemented in igraph); • a directed, weighted network representing the neural network of the worm“C. Elegans”, introduced by [52] and available at http://cdg.columbia.edu/cdg/datasets . The graph has been used as an undirected weightedgraph: the weights, ( W ij ), of the undirected graph are simply definedfrom the weights ( −→ W ij ) of the directed graph by W ij = W ji = −→ W ij + −→ W ji .This graph is connected and contains 453 nodes. Its density is equal to2.0% and its transitivity to 12.4%. Compared to the graph from “Les21is´erables”, this network has a much smaller local connectivity indicat-ing that the clustering task should be harder. Hence, a smaller modularityis expected for the resulting clustering. This graph is represented in Fig-ure 12 (left). While the graph from “Les Mis´erables” was readable, the “C.Elegans” graph is impossible to decipher when rendered by the standardFruchterman-Reingold algorithm; • a network representing the e-mail exchanges between members of the Uni-versity Rovira i Virgili (Tarragona), introduced by [19] and available at http://deim.urv.cat/~aarenas/data/xarxes/email.zip . This graphis also connected and contains 1 133 nodes. Its global density is equal to0.9% and its transitivity to 16.6%. As in the previous case, the transitiv-ity of the graph is not very high but the gap between the global densityand the transitivity is much larger and a larger modularity than for “C.Elegans” can then be expected. This graph is represented in Figure 12(right) and is as undecipherable as the “C. Elegans” one when renderedwith the Fruchterman-Reingold algorithm. l ll l l ll ll lll l ll ll lll l lllll l l l l ll ll lll llll ll llll l l llll lllll ll llll llll l lll ll l ll MyrielNapoleonMlleBaptistineMmeMagloireCountessDeLoGeborandChamptercierCravatteCountOldManLabarreValjeanMargueriteMmeDeRIsabeauGervaisTholomyesListolierFameuilBlachevilleFavouriteDahliaZephineFantineMmeThenardierThenardierCosetteJavertFaucheleventBamataboisPerpetueSimpliceScaufflaireWoman1 JudgeChampmathieuBrevetChenildieuCochepaillePontmercyBoulatruelleEponineAnzelma Woman2MotherInnocentGribierJondretteMmeBurgon GavrocheGillenormandMagnonMlleGillenormandMmePontmercyMlleVauboisLtGillenormandMariusBaronessTMabeufEnjolrasCombeferreProuvaireFeuillyCourfeyracBahorelBossuetJolyGrantaireMotherPlutarch GueulemerBabetClaquesousMontparnasseToussaintChild1Child2BrujonMmeHucheloup l lll lll ll lll lll ll llll lllll lll l ll ll lll llll ll llll l l llll lllll llllll llll llll ll l ll
Figure 11: Coappearance network from “Les Mis´erables” (left: with characters’ names; right:without label)
The experiments compare the proposed method with two other approaches:1. as explained in Section 2, the proposed approach tries to improve thestandard two phases approach by building topographically ordered clus-ters. The reference approach consists therefore in a two phases methodalready used in Section 5.1: we build a good clustering via determinis-tic annealing maximization of the modularity and the graph induced bythe clustering is rendered via a force directed placement algorithm (theFruchterman-Reingold algorithm as implemented in igraph);22 llll l l l lll lll l llllllll lllll lll ll l lll l l lll lll lllll l lll l llll l lll l lll l ll llll l ll lll llll l lll l ll l lll ll l l ll lllllllll lll ll l l ll l ll lllll l lll lll ll l ll l lll ll ll lll ll ll ll llll ll ll lll l ll l l l ll ll l ll l ll ll ll ll l ll l llll llll lll ll lll l l ll l ll l l lll l l l ll lllll lll ll l l ll l llll l ll ll lll lll ll l llll lll l ll lll l ll lll l ll lll l l llll lll ll l llll lllll l ll lll llll l llllll lllllll l lll ll l llll ll ll l l lll ll ll lll ll llllll lll l lll l ll ll ll ll l lll ll l lll lll ll ll lll lll ll lll l ll ll lllll l ll ll ll lll llll llll ll l ll l ll l llll llll l ll ll ll lll ll llll lll ll llll ll lll ll ll llllll ll ll lll ll l llllll ll llll l l lllll l lllll ll l ll lllll lll l lll l lll ll l ll llll ll ll ll ll lll ll lllll llll ll l ll l ll llll ll lll lllll llll l lll ll lll ll llll l ll ll l ll ll l ll llll ll lll lll llllll l lllll llll lll l lll lll lll ll l ll lll l ll lll l lll l ll ll l ll l llll ll lll l ll ll ll ll ll ll l ll ll l lll llll ll lll ll ll l lllll l llll lll lll ll lll llll llll lll ll l lll l llll l lll ll ll ll lll l llllll lll l lll ll l l lll l l lll llll ll l llllll l llll ll ll lll l ll ll ll llll ll ll l l ll l ll llllll ll lll lllll llll l ll llll ll ll lll ll llllll l lll l llll ll llll l llll llll lll lll lll lll lll l ll ll l ll l ll lll lll lll lll ll lll ll llllll ll l ll ll lll ll llll l lll l ll llll llll lll l lll lll lll ll lllllll lll llll l l ll ll ll ll l ll l ll lll llll llll ll l lll lll llllllll lll l lll l ll lll ll lll l llll l lllll lll lll ll ll l ll ll l llll lll ll l l lll l l lll ll llll l l lll lllllll ll lllll ll l llll ll llll ll l l lll l ll l llll lllll ll l l lll ll llll lll lll ll ll lll ll llllll l lll lll ll l lll l lllll ll l llll l lll l lll ll ll l l ll ll ll llllll l ll lllll l ll ll l l l ll ll ll llllll lllllllll llll ll ll ll ll lll ll l l llll l lllll ll llll lll lllll ll llll lll lll ll lll l llllll ll lll ll ll l l ll ll lll lll l llll l lll lll ll ll lll lll ll ll lll l llll lll lll ll l llll ll llll lll l l ll l l ll ll lll l l lll lll l lll l lllll l l l lll l l llll ll l l lllll l l ll l lll ll ll lll lll lll ll lll ll llll l llll ll lll ll ll l l
Figure 12: Two real-world networks: Neural network of the worm C. Elegans (left) and E-mailnetwork between the members of the University Rovira i Virgili (right)
2. an alternative organized approach is the one described in [47]: Kohonen’sSelf Organizing Map algorithm is used to build a topographically orderedclustering of the graph (on a prior grid) via a well chosen graph ker-nel. Several popular kernels are defined for graphs, including regularizedversions of the Laplacian (see, e.g., [44]), the generalized inverse of theLaplacian [14] or the modularity kernel [55]. The latter one is definedfrom the positive part of the matrix B involved in the definition of themodularity (see equation (5)).A slightly different way to use the Self Organizing Map for clustering thevertices of a graph is what is called “spectral SOM” in [48]. This approachis inspired by spectral clustering (see e.g., [49]) but using a SOM insteadof the usual k -means algorithm. More precisely, the vertices of the graphare represented by vectors of R C (where C is the initial number of clustersof the prior grid) that are the coordinates of the eigenvectors associated tothe C smallest non zero eigenvalues of the Laplacian of the graph. Then,a usual vector SOM is applied to these vectors.Those four SOM variants were used in Section 5.4 for the Karate graph. The parameters optimization procedure outlined in Section 4.1 is appliedto tunable parameters of the reference methods (e.g., the number of clusters,the kernel and its parameters, etc.). Random initialization is included in theprocedure for methods that are not deterministic.In the specific case of the two phases approach, we rely on a two phases opti-mization: the best clustering is selected by maximization of the modularity overthe number of clusters and then the layout with the lowest number of crossing We recall that the Laplacian of a graph is the N × N matrix, L , defined by L ij = (cid:26) − W ij if i (cid:54) = jk i if i = j • the number of clusters for the clustering algorithm was optimally chosenin { , . . . , } (please note that all methods can produce empty clusters); • the prior structure was encoded via a square grid selected among 3 sizes:3 ×
3, 4 × × • the entries of the matrix S were calculated via equation (13). We comparedexponential decrease and linear decrease, and used at least four differentscaling values, chosen according to the induced radius of influence in theprior structure; • in the case of the SOM (kernel version or “spectral SOM”), the optimalconfiguration (size of the prior structure and neighborhood function) wasselected in the same set of configurations as for the organized modularity; • as the batch SOM used in this paper exhibits some dependence to theinitial configuration, we used 5 different initializations for each run of thealgorithm. Among them 4 were random ones and the last one was a PCAbased initialization in which the prior structure is positioned on the planespanned by the first two principal components (this is done with kernelPCA [43] in the case of kernel SOM and standard PCA in R C for thespectral SOM); • as explained above, we used three different kernels. Among them, onlythe heat kernel has a parameter: it is defined as K = e βL where L is theLaplacian of the graph and β a temperature parameter. Four values of β were used;We used the three kernels only for the analysis of “Karate” and “LesMis´erables”. For the two other graphs (C. Elegans and E-mail), we re-stricted the analysis to the generalized inverse that has achieved almostthe best results in the first experiment (comparable with the best onesobtained for the heat kernel) and that does not require the tuning of anadditional parameter. We first compare the proposed method to the SOM variants with respect tothe chosen quality criteria: the modularity of the clustering and number of edgecrossings in the associated representation. As explained in Section 2 and 4.1,those quality criteria have been chosen in order to obtain a good compromisebetween a readable clustering based representation of the graph (with smallnumber of edges crossing) and the fairness of this representation (dense clusters,i.e., high modularity). Figures 13 and 14 give the values of the quality criteriafor all the experiments made on the dataset “Les Mis´erables”, while Tables 1,24 and 3 give the pairs of values associated to Pareto points obtained in foreach graph. We first focus on “Les Mis´erables” and then give more generalcomments. Please note that comments and conclusions about “Les Mis´erables”apply to the Karate graph with the only exception that the modularity kernelSOM performs in an acceptable way on the Karate graph. Detailed results arenot included to avoid lengthening the article. ll ll lll ll ll l ll ll l ll llllll
Generalized inverse
Modularity C r o ss i ng edge s l l ll lllllll lll lll ll lllllll lll lll ll l lllll ll llllllll lll Size 3Size 4Size 5 lll lllll lllll lllll lllll l llll lllll l llll l ll ll lll ll ll llll lll ll l llll l llll l ll lllll ll lll ll ll lll lll ll ll l ll l
Heat kernel
Modularity C r o ss i ng edge s llll lllll lllll lll ll ll lll lllll ll lll l lll ll ll ll ll ll ll ll l llll lll llll llll lllll lll llll ll l lll ll lll lll lllllll lll ll lll ll lll lllll ll lll l lll l l lll l l lllllllll lllll l ll ll l ll lll ll lll llllllll llll ll ll lllll llll lll lll ll lllll Size 3Size 4Size 5 lllll lllll ll ll l ll ll l llll l
Modularity kernel
Modularity C r o ss i ng edge s ll lll ll lll ll lllll ll l lllllll lll l ll ll lll ll llll lllll l lll Size 3Size 4Size 5 lllll ll lll ll lll lll ll ll lll
Spectral SOM
Modularity C r o ss i ng edge s lllll lllll ll lll l llll l llll lll ll lllll lllll l ll ll llll l lll Size 3Size 4Size 5
Figure 13: Quality criteria for the kernel and spectral SOM for “Les Mis´erables”: x axis isthe modularity and y axis is the number of edge crossings. The gray level of the points givesthe initial size of the square grid and Pareto points are indicated by a black star. Solutionswith a modularity lower than 0 or a number of edge crossings greater than 500 are omittedon this figure. As shown in Figures 13 and 14, the clusterings obtained on “Les Mis´erables”by optimizing the soft modularity have generally larger modularity values thanthose produced by SOM variants. In addition, only a limited subset of theparameter space leads to high modularity with SOM variants. Both outcomeswere expected for kernel SOMs using the heat kernel and the generalized inverseof the Laplacian. Indeed those kernels induce feature spaces and associatedclustering objectives that have no simple relation with the modularity criterion.As shown in e.g. [4, 53] both kernels lead to interesting clustering results, but it25 lll
Linear prior dissimilarity
Modularity C u t edge s lllll lllllllll Grid size: 3Grid size: 4Grid size: 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Gaussian prior dissimilarity
Modularity C u t edge s Grid size: 3Grid size: 4Grid size: 5
Figure 14: Quality criteria for the optimization of the organized modularity by deterministicannealing for “Les Mis´erables”. The gray level of the points gives the initial size of the squaregrid and Pareto points are indicated by a black star. is clear from the present results that one cannot expect to get high modularityclustering from them in general.The poor results for the spectral SOM are slightly less expected as thismethod is related to some form of graph cut measure optimization, using therelaxation proposed in spectral clustering [49]. However, as recalled in Section2.2, graph cut measures are quite different from the modularity, which probablyexplains the low modularity clusterings obtained by this method.Finally, the results obtained by the modularity kernel are quite disappoint-ing. Indeed, this kernel is strongly related to the modularity measure [55] andshould lead to some form of modularity maximization. In practice, it seems thattaking only the positive part of the B matrix does not capture all the complexityof the modularity, at least in the SOM context.Of course, as our method aims at maximizing an organized version of themodularity, the higher quality of the results in terms of a closely related measure(the classical modularity) is rather natural. However, the Figures show that italso leads to a low number of edge crossings and therefore to more readable lay-outs. Figure 13 shows in particular that SOM variants generally fail to balancemodularity and edge crossings: high modularity clusterings have frequently asignificant number of edge crossings (see also Table 1).Combining both criteria lead to the search of Pareto points. The case of “LesMis´erables” is summarized by Table 1: when considered all together, the SOMapproaches (kernels and spectral) lead to 4 Pareto points which are of lesserquality than the one obtained by the proposed approach: none of the Paretopoints for kernel SOM is a global Pareto point for the dataset mainly because oflow modularity values. In addition, and for similar reasons, the spectral SOMand the modularity kernel SOM do not produce any Pareto point. In fact,the only realistic competitive method in the SOM family is the kernel SOMbased on Laplacian’s inverse. Indeed, it gives comparable results as the onesobtained with the heat kernel but without the need for a time consuming kernelparameter tuning. The two larger datasets, “C. Elegans” and “E-mail” were26es Mis´erablesMethod Number Modularity
Nb of pairs Idof clusters of cut edgesSOM (heat kernel) 5 (22) 0.5327 37 M13 (8) 0.5276 2 M2SOM (Laplacian’s inverse) 3 (9) 0.5212 1 M33 (8) 0.5089 0 M4Organized mod. (linear) 4 (7) 0.5638 1 M5 Organized mod. (Gaussian) 5 (7) 0.5652 3 M6 (6) 0.5472 0 M7 Modularity optimization 8 (5) 0.5472 0 M8 Table 1: Description of the Pareto points for each of the three methods (SOM, optimizationof organized modularity and optimization of ordinary modularity) applied to the dataset “LesMis´erables”. The number of clusters gives the number of clusters C used by the algorithmand, in parenthesis, the number of non empty clusters obtained finally. The column “Id” isused to identify the corresponding solution in the text; Pareto points within all solutions havebold Id. therefore investigated only with this kernel SOM, excluding all other variants.Pareto points are described in Tables 2 and 3.C. ElegansMethod Number Modularity
Nb of pairs Idof clusters of cut edgesSOM (Laplacian’s inverse) 3 (9) 0.3228 14 CE1 (9) 0.3000 7 CE2 (8) 0.2936 1 CE3
Organized mod. (linear) 4 (8) 0.4373 41 CE43 (7) 0.4348 27 CE53 (7) 0.4321 19 CE6
Organized mod. (Gaussian) 3 (8) 0.4063 15 CE7
Modularity optimization 18 (8) 0.4383 27
CE8
Table 2: Description of the Pareto points for each of the three methods (SOM, optimizationof organized modularity and optimization of ordinary modularity) applied to the dataset “C.Elegans”. The number of clusters gives the number of clusters C used by the algorithm and,in parenthesis, the number of non empty clusters obtained finally. The column “Id” is usedto identify the corresponding solution in the text; Pareto points within all solutions have boldId. Results from “C. Elegans” and “E-mail” generally agree with those obtainedon “Les Mis´erables”: the kernel SOM does not produce clusterings with highmodularity, but it manages sometimes to reach a low number of edge intersec-tions in the induced layout. That said, the kernel SOM does not produce globalPareto points for “Les Mis´erables” and “E-mail”: on those datasets, it is there-fore strictly less efficient than the proposed method on the dual objectives pointof view (this was also the case for the Karate graph). The best situation forthe kernel SOM is “C. Elegans” where the very low numbers of edge crossingsof configurations CE1, CE2 and CE3 make them Pareto points within the fullset of solutions. However, those points have very low modularities compared to27ther solutions. E-mailMethod Number
Modularity
Nb of pairs Idof clusters of cut edgesSOM (Laplacian’s inverse) 4 (16) 0.4652 218 E13 (9) 0.4566 49 E23 (9) 0.4540 24 E33 (9) 0.4420 22 E43 (9) 0.4242 19 E5Organized mod. (Gaussian) 3 (8) 0.5694 47 E6 (8) 0.5693 44 E7 (7) 0.5554 25 E8 (7) 0.5456 23 E9 (6) 0.5401 11 E10
Modularity optimization 11 (8) 0.5736 56
E11
Table 3: Description of the Pareto points for each of the three methods (SOM, optimizationof organized modularity and optimization of ordinary modularity) applied to the dataset “E-mail”. 11 Pareto points have been found for the kernel SOM but 6 of them are omitted inthe table because of a very low modularity (less than 0.2). The number of clusters gives thenumber of clusters C used by the algorithm and, in parenthesis, the number of non emptyclusters obtained finally. The column “Id” is used to identify the corresponding solution inthe text; Pareto points within solutions have bold Id. In summary, those experiments show that the spectral SOM and the modu-larity kernel SOM should be avoided as they cannot produce satisfactory resultsin terms of the chosen quality criteria. In addition, while the heat kernel canproduce interesting results, its additional kernel parameter induces a large com-putational cost: as shown on Figure 13 most of the computing efforts are wastedas they lead to poor solutions on both criteria. The only competitive methodis the Laplacian’s inverse kernel SOM. But even if it seems to be the best SOMbased method, it still produces sub-optimal solutions on both criteria for threeout of four datasets.The comparison between the classical two phases approach and our methoddoes not lead to a clear winner as far as the chosen quality criteria are con-cerned. As expected, the modularities of the clusterings obtained in the twophases approach are among the highest. Interestingly, maximizing the orga-nized modularity can lead to a higher modularity than a more direct approach(in a similar way as the SOM which can overcome the k -means in term ofwithin cluster variance): this is the case for “Les Mis´erables” (see Table 1).However, in general, the two phases approach gives the clusterings with thehighest modularity. In term of edge crossings, the two phases approach givesalso satisfactory results: this confirms the analysis from Section 2.2, where weargued that looking for dense clusters should reduce the number of edges be-tween clusters. On complex graphs however, our method leads to lower numbersof edges cuts: again, this confirms the somewhat contradictory nature of thetwo quality criteria explained in Section 2.2.By construction, the two phases approach should give a Pareto optimal re-sult, up to sub-optimal optimization results caused by the combinatorial natureof both optimization problems. In practice, we obtained Pareto points on all28raphs. In general, those points have higher modularity and higher number ofedge crossings than the ones produced by our method, as expected: our clus-ters are somewhat adapted to the clustering induced layout. There is thereforeno obvious superiority of one method over the other. In fact, we are more in-terested in the final layout of the graphs than in the exact numerical results.It is therefore interesting to study the visual representations obtained differentapproaches, as long as they correspond to Pareto optimal points. This is donein the next subsection. The aim of the proposed method is to provide a simplified but relevantrepresentation of a large graph through the graph induced by the clustering.The present section analyses the visual representation obtained via the proposedmethod and reference methods. We first start with a detailed analysis of “LesMis´erables”: Figure 15 gives the representation of the graph of clusters forclusterings M5 and M8 (See Table 1). M5 l M8 l
23 14 5
Figure 15: Clustering induced graphs. Left: displayed on the prior structure (for M5 obtainedby optimizing the organized modularity) and Right: displayed by Fruchterman and Reingoldalgorithm (for M8 obtained by optimizing the ordinary modularity).
As the graph is small enough, those layouts can be compared to the originallayout numbered according to the clustering: this is provided by Figure 16.The density of the graph limits the possibilities of analysis, but it appearsclearly that cluster 1a from M5 corresponds to a quite isolated character (MotherPlutarch who is only connected to Mabeuf), while up to a single character(Sister Simplice), the union of clusters 2a and 2b in M5 gives cluster 2 in M8.In both cases, the summary of the graph given by the clustering seems thereforereasonable and a more detailed analysis is needed.It turns out that both representations give a clear understanding of the storyof the novel. It is based on a central group of characters (clusters 2a and 2b forM5 and 2 for M8) which includes Valjean, Cosette and Marius (among others).Several sub-stories are narrated in the novel: the story of the Bishop Myrielwho is spiritual guide of Valjean (clusters 5); the story of the street childrenGavroche (clusters 1 in M5 and 1b in M8); and finally the story of Fantine, poor29 M8 Figure 16: Original layout of “Les Mis´erables” labeled according to the clustering; left: M5(optimization of the organized modularity) and right: M8 (optimization of the ordinary mod-ularity). The number on the vertices is the cluster’s number.
Cosette’s mother (clusters 4). But representation M5 gives another informationby separating the main characters into the ones related to Valjean (Cosette,Marius, Javert, etc. in cluster 2b) and the ones related to the Th´enardierfamily (mister and misses Th´enardier, ´Eponine, their daughter, etc. in cluster2a). In addition, cluster 2b of M5 (the main characters) has more connectionsto cluster 1b than cluster 2a (Thenardier family) to cluster 1b. Therefore,the central position of cluster 2b remains clearly emphasized in M5, even ifthe representation is arguably slightly less readable than representation M8 (itshould be noted that the graph induced by the clustering M5 is planar, butbecause the organized layout is not optimized by a graph drawing algorithm,the actual representation has one edge crossing). Hence, with a small (andavoidable) increase in the number of edge crossings but with a higher qualityof clustering, the information given by clustering M5 provides a more completepicture of the original network.Additional insights can be gained with the help of the fuzzy layout method-ology described in Section 4.2. Figure 17 gives an example of such layout for thefinal configuration of the deterministic annealing. At first, this representationmight seem very different from M5 on Figure 15. In fact most of the differencescan be explained by two phenomenons. Firstly, as on Figure 9 for the Karategraph, some nodes are difficult to assign to a given cluster and appear thereforein intermediate position. This explains the presence of some small clusters inbetween larger ones : this is the case, for example, of the small cluster at the ex-treme right of the Figure who is Javert, the policeman who is pursuing Valjean(one of the character of the larger cluster just above the small isolated one).But this character is also very related to the Th´enardiers’ family (larger clusterbelow the small isolated one) because they capture and imprison him at the endof the novel. The small cluster situated between the large central cluster andthe cluster at the bottom left part of the map is also interesting. It contains3 characters that act as connections between several secondary characters andthe rest of the graph: these characters belong to cluster number 5 in clusterings30 l ll l Figure 17: Fuzzy layout of the final configuration of “Les Mis´erables” (the Fruchterman-Reingold algorithm was used to reduce edge overlapping)
M5 and M8.The second source of differences between M5 and the fuzzy layout is thatseveral characters are extremely difficult to assign to a cluster and have thereforealmost flat assignment probabilities E R ( M ik ) even at low temperature. Then,the corresponding nodes assume an almost barycentric position in the fuzzylayout: this explains the presence of a central node in Figure 17. The charactersassigned to this central node are not connected and this cluster should not beseen as a meaningful one. On the contrary, it contains very isolated characterswho belong to clusters 5, 2 or even 1 in clustering M8. All these characters havein common to share only a (small weighted) connection with another characterof the network. They are very secondary characters in the novel. None of thoseambiguities could have been detected on a standard two phases approach or onthe SOM variants.In addition, the SOM variants obtain poor results on this graph. As shownin Table 1 they do not provide any global Pareto optimal points. Therefore itis not very surprising to obtain unsatisfactory visual representations from thosemethods. Figure 18 represents the results associated to clustering M1 in Table 1(obtained with the heat kernel SOM). The resulting clustering has numerous nonempty clusters and leads to a quite cluttered visual representation. In addition,some clusters seems completely arbitrary: for instance clusters number 5, 10,11, 13 and 16 are a seemingly random clustering of the characters related onlyto the Myriel characters. The global impression is that the heat kernel SOMsuffers from a tendency to build too fine clusterings on a somewhat arbitrarybasis.The cluster and layout obtained by the Laplacian’s inverse kernel SOM aregiven on Figure 19. Results are quite different from those of the heat kernelSOM. Indeed, the clustered layout is very easy to read as there are no edgecrossing. However, the rather low modularity of M1 (see Table 1) is an indica-31
12 51212 161616 1311 1018 611 26 666 6 66633 8 88 22 77 115 222222 2222205 33 8232325 24 9 819820 188219 99 9999 9994 3333 84 4 39 M1 l ll ll l l ll llll
12 516 1311 101 86 2 3227 152023 252491921 4
Figure 18: Representation of the M1 clustering obtained with the heat kernel SOM on theoriginal layout (left) and via the prior structure (right) M4 M4 l l
47 25 31 96
Figure 19: Representation of the M4 clustering obtained with the Laplacian’s inverse kernelSOM on the original layout (left) and via the prior structure (right) • CE1 is the layout obtained by kernel SOM with the highest modularity(chosen to avoid misleading clusterings); • the CE8 is the layout obtained by the two phases approach and is inducedby the clustering with the highest modularity; • CE6 and CE7 are the two Pareto points obtained by our method: CE7has the smallest modularity but also a smaller number of edge crossingswithin this two solutions. These two solutions have a high modularity(larger than 0.4) and the modularity of CE6 is very close to those of CE8but with a smaller number of edge crossings.Despite a very small number of edge crossings, CE1 is poorly informative:three clusters (3, 6 and 8) contain more than 80 % of the vertices of the graphand the other clusters are thus very small compared to the three largest ones.The small clusters are then not very informative and their existence probablyexplains the poor modularity of the clustering. In fact, all solutions providedby kernel SOM for this dataset share the same problem with some large clustersand some very small ones. This type of behavior is neither specific to the chosengraph, nor to the Laplacian’s inverse kernel as shown on Figure 19 and in [4]for the heat kernel.CE8 shows another type of layout problems: the graph induced by the clus-tering is almost a complete graph (up to cluster 10, the graph is complete). Theonly information conveyed by this layout is that nodes in cluster 10 are onlyconnected to nodes in cluster 9. The width of the edges can be used to inferthat some clusters are only loosely connected (e.g., 12 and 14), but grasping thegeneral organization of the graph is quite difficult with this layout.CE6 and CE7 both have a slightly smaller modularity than CE7 but they areeasier to read because of a smaller number of edge crossings. Their clustering33 E1 l l l l l CE8 l
215 17 143 912 10
CE6 l CE7 l l l
317 95 682
Figure 20: Clustering induced graphs for the “C. Elegans” dataset. Top left: displayed onthe prior structure (for CE1 obtained by kernel SOM); Top right: displayed by Fruchtermanand Reingold algorithm (for CE8 obtained by maximization of the modularity); Bottom:displayed on the prior structure (for CE6 and CE7 obtained by maximization of the organizedmodularity). l l l l l l l l l l l ll l l l l l Figure 21: Fuzzy layout during the final cooling phase for the solution CE6 of “C. Elegans”
In addition, fuzzy layouts such as the one provided in Figure 21 can beused to analyze how clusters are built during annealing. In the particular caseof solution CE6, the layout shows that, e.g., cluster 8 is formed at the endof the annealing. The analyst can investigate first the associated nodes andtheir connection pattern to outside clusters, or on the contrary, focus his/herattention on well established clusters.Finally, Figure 22 provides two layouts of the graph “E-mail”. We chose torepresent the solution obtained by the two phases approach (E11, see Table 3)and the Pareto point obtained by our method with the smallest number of edgecrossings (E10).As in the previous example, the solution provided by the optimization ofthe organized modularity, despite a clustering quality that is slightly worse,gives a more understandable simplification of the graph than the dual phasesapproach. In fact, the graph associated to E11 is complete and one has torely on the width of the edges to try to infer the importance of the relationsbetween clusters. The layout associated to E10 seems therefore easier to grasp.In addition, the associated fuzzy layout (see Figure 23) shows that most of theclusters are well defined and identifies a few nodes that are difficult to assign toclusters. Further investigations could target those nodes.35 l E11 l l l l l l l l
19 8 6114 210
Figure 22: Clustering induced graphs for the “E-mail” dataset. Left: displayed on the priorstructure (E10 obtained by maximization of the organized modularity); Right: displayed byFruchterman and Reingold algorithm (E11 obtained by maximization of the modularity).
Several conclusions can be drawn from the experiments conducted on thefour real world graphs. Firstly, maximizing the organized modularity leadsto good solutions with a high modularity and a acceptable number of edgecrossings. In some situations, maximizing the organized modularity actuallygives a higher modularity clustering, in a similar way to what can be observedfor the SOM versus the k -means. However, the main gain is a better trade offbetween modularity and readability of the clustering induced graph: we tradea small reduction in modularity for either a larger number of clusters (limitingthe general tendency of the modularity to pick up a small number of clusters) ora reduced number of edge crossings (or both). All in one, organized modularitymaximization brings therefore simplified representations of graphs that seemmore informative than the standard two steps approach. In addition, as themodularity is almost optimal, the representations provided by maximization ofthe organized modularity are reasonably faithful, especially compared to thelow quality results generally obtained by graph adapted SOM. While the finallayout obtained by the organized modularity are not perfect (it is quite clear forinstance that the left graph on Figure 15 is planar and could be rendered with noedge crossing), their ordered and regular nature makes them very readable. Inconclusion, we consider that a good practice consists in combining our methodto the two phases approach to get different and complementary views on thesame graph. A future implementation challenge is to provide linked multi-views[1] of a graph that would give the analyst a visual comparison method betweenthose views.In addition, relying on deterministic annealing provides a path in the clus-tering space that is worth exploring via intermediate fuzzy layouts. The mainadvantage of those layouts is to increase the number of clusters and to pro-viding hints on the final assignment of vertices. Atypical vertices are clearlypinpointed with this approach and can be further studied by the analyst. Ouruse of the fuzzy layout was quite limited in this paper as we believe they should36 l l l l l l l l l l l l l l l l l l ll l l l Figure 23: Fuzzy layout during the final cooling phase for the solution E10 of “E-mail” be provided in an interactive environment in which the analyst can navigate inthe algorithm results. Further studies are needed to validate the interest of theconcept. In addition, as those layouts are closer to clustered layouts than therest of our work, comparisons with those latter methods (e.g., [35]) would beinteresting.Those results show that the combination of modularity maximization andedge crossings minimization is very well adapted to graph visualization. Then,it is quite natural to observe that SOM variants are not adapted to the targetedapplication. The analysis of the Karate graph has shown that they are not ableto recover the ground truth clustering. The graph “Les Mis´erables” showed quitestrong limitations of the clustering obtained by SOM variants (with meaninglessclusters) without any particular gain in term of readability of the clusteringinduced graph. This was confirmed on “C. Elegans” in terms of layout and on“E-mail” via quality criteria.
7. Conclusion
We have proposed in this paper a new organized modularity quality measurefor graph clustering inspired by the topographic mapping paradigm initiated byProf. Kohonen’s SOM. The organized modularity aims at producing a clusteringof a graph that respects constraints coming from a prior structure. This priorstructure is then used to display the clusters in an ordered way. A deterministicannealing scheme has been derived to maximize efficiently the organized mod-ularity. Its iterative nature can be leveraged to provide intermediate layouts ofthe graph that emphasize the progressive construction of the final clustering,pinpointing sub clusters, atypical vertices, etc.An experimental study conducted on four real world graphs ranging from34 to 1 133 vertices has shown that the proposed method outperforms similarapproaches based on adaptation of the SOM to graph data both in term of37lustering quality and in term of readability of the clustering induced graphs.The proposed method gives also better or complementary results compared to atwo steps approach in which one first build a graph clustering that maximizes thestandard modularity and then uses a graph visualization algorithm to display theclustering induced graph. Finally, the computational cost of the whole approach(which includes optimization of an influence parameter in the prior structure)remains acceptable and is compatible with graphs with a few thousands of nodes.
Acknowledgment
The authors thank the anonymous referees for their valuable comments thathelped improving this paper.
A. Derivations of the deterministic annealing algorithm
A.1. Equivalence between O and F We have O ( M ) = F ( M ) + 12 m (cid:88) i (cid:88) k,l M ik S kl M il ( W ii − P ii ) . Then, as M is an assignment matrix, M ik is non zero (and equal to one) onlywhen k = c ( i ) and therefore O ( M ) = F ( M ) + 12 m (cid:88) i S c ( i ) c ( i ) ( W ii − P ii )= F ( M ) + 12 m (cid:88) i ( W ii − P ii ) , as S kk = 1 for all k . Therefore O ( M ) − F ( M ) is independent of M and maxi-mizing F is equivalent to maximizing O . A.2. Mean field equations
Let us denote KL ( E ) = KL ( R | P ) = (cid:80) M R ( M, E ) ln R ( M,E ) P ( M ) . At a mini-mum of KL , the partial derivatives ∂KL∂E jl must be equal to zero. Those deriva-tives can be computed easily from the definition of R ( M, E ). We fist note that KL ( E ) = (cid:88) M R ( M, E ) ln R ( M, E ) − β (cid:88) M R ( M, E )( F ( M ) − log Z F )= (cid:88) M R ( M, E ) ln R ( M, E ) − β E R ( F ( M )) + β log Z F . We therefore need to compute ∂∂E jl ( (cid:80) M R ( M, E ) ln R ( M, E )). We have ∂U ( M, E ) ∂E jl = M jl , and ∂ exp( βU ( M, E )) ∂E jl = βM jl exp( βU ( M, E ))38s Z R ( E ) = (cid:80) M exp( βU ( M, E )), we have ∂Z R ( E ) ∂E jl = β (cid:88) M M jl exp( βU ( M, E ))= βZ R ( E ) E R ( M jl ) . Therefore, recalling R ( M, E ) = exp( βU ( M,E )) Z R ( E ) , we have ∂R ( M, E ) ∂E jl = − exp( βU ( M, E )) Z R ( E ) ∂Z R ( E ) ∂E jl + 1 Z R ( E ) βM jl exp( βU ( M, E ))= β ( M jl − E R ( M jl )) R ( M, E ) . Then (cid:88) M ln Z R ( E ) ∂R ( M, E ) ∂E jl = ln Z R ( E ) β (cid:88) M ( M jl − E R ( M jl )) R ( M, E )= 0 , because Z R ( E ) does not depend on M and by definition of E R ( M jl ). Moreover, (cid:88) M ∂ ln Z R ( E ) ∂E jl R ( M, E ) = (cid:88) M βZ R ( E ) E R ( M jl ) Z R ( E ) R ( M, E )= β E R ( M jl ) , and therefore ∂∂E jl (cid:32)(cid:88) M ln Z R ( E ) R ( M, E ) (cid:33) = β E R ( M jl ) , which leads to ∂∂E jl (cid:32)(cid:88) M R ( M, E ) ln R ( M, E ) (cid:33) = ∂∂E jl (cid:32) β (cid:88) M U ( M, E ) R ( M, E ) (cid:33) − β E R ( M jl ) . By definition of U ( M, E ), we have (cid:88) M U ( M, E ) R ( M, E ) = (cid:88) M (cid:88) ik E ik M ik R ( M, E )= (cid:88) ik E ik E R ( M ik ) . Then, by independence, ∂ E R ( M ik ) ∂E jl = 0 when j (cid:54) = i . Indeed we have : ∂ E R ( M ik ) ∂E jl = (cid:88) M M ik ∂R ( M, E ) ∂E jl = β (cid:88) M M ik ( M jl − E R ( M jl )) R ( M, E )= β ( E R ( M ik M jl ) − E R ( M ik ) E R ( M jl ))= 0 . M ik and M jl under R when i (cid:54) = j (this simplification is the motivation for using a bi-linear cost function U andtherefore a distribution R that factorizes).Finally, we have ∂∂E jl (cid:32)(cid:88) M U ( M, E ) R ( M, E ) (cid:33) = E R ( M jl ) + (cid:88) k ∂ E R ( M jk ) ∂E jl E jk , and therefore ∂KL ( E ) ∂E jl = β (cid:32)(cid:88) k ∂ E R ( M jk ) ∂E jl E jk − ∂ E R ( F ( M )) ∂E jl (cid:33) , from which we obtain the mean field equations (8). A.3. Expectation minimization scheme
We need to compute ∂ E R ( F ( M )) ∂E jl starting from E R ( F ( M )) = (cid:88) u (cid:54) = v (cid:88) k,t E R ( M uk ) S kt E R ( M vt ) B uv . As shown above, when j (cid:54) = i , ∂ E R ( M ik ) ∂E jl = 0 and therefore when u (cid:54) = j and v (cid:54) = j , ∂ ( E R ( M uk ) E R ( M vt )) ∂E jl = 0 . Then ∂ E R ( F ( M )) ∂E jl = (cid:88) u (cid:54) = v B uv (cid:88) k,t S kt ∂ ( E R ( M uk ) E R ( M vt )) ∂E jl = (cid:88) u (cid:54) = j B uj (cid:88) k,t S kt E R ( M uk ) ∂ E R ( M jt ) ∂E jl + (cid:88) v (cid:54) = j B jv (cid:88) k,t S kt E R ( M vt ) ∂ E R ( M jk ) ∂E jl = 2 (cid:88) u (cid:54) = j B uj (cid:88) k,t S kt E R ( M uk ) ∂ E R ( M jt ) ∂E jl , using the symmetry of B and S for the last equation. Then if for all j and k ,we set the values of E jk to E jk = 2 (cid:88) i (cid:54) = j (cid:88) l E R ( M il ) S kl B ij , we have obviously (cid:88) k ∂ E R ( M jk ) ∂E jl E jk = 2 (cid:88) k ∂ E R ( M jk ) ∂E jl (cid:88) i (cid:54) = j (cid:88) t E R ( M it ) S kt B ij = 2 (cid:88) i (cid:54) = j B ij B ij (cid:88) k,t S kt E R ( M it ) ∂ E R ( M jk ) ∂E jl = ∂ E R ( F ( M )) ∂E jl , and the mean field equations are fulfilled.40 .4. Fixed point analysis Finding the mean field E via the EM-like scheme given by equations (9)and (10) corresponds to looking for a fixed point of the following matrix valuedfunction: G jk ( E ) = 2 (cid:88) i (cid:54) = j B ij (cid:88) l S kl exp( βE il ) (cid:80) p exp( βE ip ) . (16)The stability of a fixed point E can be analyzed with a first order Taylorexpansion (where (cid:107) . (cid:107) F is the Frobenius norm) G jk ( E ) = G jk ( E ) + (cid:88) u (cid:88) t ∂G jk ∂E ut ( E ) (cid:0) E ut − E ut (cid:1) + o ( (cid:107) E − E (cid:107) F ) , which recalls that the stability is governed by the eigenvalues of the N C × N C
Jacobian matrix (cid:16) ∂G jk ∂E ut (cid:17) ( j,k ) , ( u,t ) .Obviously, ∂G jk ∂E jt = 0. When u (cid:54) = j , we have ∂G jk ∂E ut = 2 B uj (cid:88) l S kl ∂∂E ut (cid:32) exp( βE ul ) (cid:80) p exp( βE up ) (cid:33) = 2 B uj β exp( βE ut ) (cid:16)(cid:80) p exp( βE up ) (cid:17) (cid:32) S kt (cid:88) p exp( βE up ) − (cid:88) l S kl exp( βE ul ) (cid:33) = 2 βB uj E R ( M ut ) (cid:32) S kt − (cid:88) l S kl E R ( M ul ) (cid:33) At the limit of infinite temperature (when β = 0), equation (9) leads to E R ( M jk ) = C , and therefore for the high temperature fixed point E , ∂G jk ∂E ut ( E ) = 2 B uj βC (cid:32) S kt − C (cid:88) l S kl (cid:33) , while equation (10) gives in addition E jk = 2 C (cid:88) i (cid:54) = j B ij (cid:88) l S kl . Let us denote H = S ( I − C ) the centering matrix ( I is the C × C identitymatrix and is the C × C matrix with all terms equal to 1). Using the symmetryof S , we have ∂G jk ∂E ut ( E ) = 2 B uj βC H tk . Then, using the symmetry of B , this leads to (cid:88) u (cid:88) t ∂G jk ∂E ut ( E )∆ ut = 2 βC ( B ∆ H ) jk , for any N × C matrix (∆ ut ) ut . Let λ S denote S ’s largest eigenvalue (in absolutevalue if S is not positive) and let λ B denote B ’s largest eigenvalue (also in41bsolute value). Let δ be a vector of R C , then (cid:107) δ T S (cid:107) ≤ λ S (cid:107) δ (cid:107) (where (cid:107) . (cid:107) isthe Euclidean norm). As I − C has eigenvalues 1 and 0, (cid:107) δ T H (cid:107) ≤ λ S (cid:107) δ (cid:107) . Inaddition, if µ is a vector in R N , (cid:107) Bµ (cid:107) ≤ λ B (cid:107) µ (cid:107) . Then (cid:107) B ∆ H (cid:107) F ≤ λ S λ B (cid:107) ∆ (cid:107) F .Then (cid:107) G ( E ) − G ( E ) (cid:107) F ≤ βλ S λ B C (cid:107) E − E (cid:107) F + o ( (cid:107) E − E (cid:107) F ) . Therefore, when the temperature β is higher than λ B λ S C , (cid:107) G ( E ) − G ( E (cid:107) F < (cid:107) E − E (cid:107) F and therefore the fixed point E is stable: small perturbationsvanish. References [1] A. Becker and S. Cleveland. Brushing scatterplots.
Technometrics , 29(2):127–142, 1987.[2] C. M. Bishop, M. Svens´en, and C. K. I. Williams. GTM: The generativetopographic mapping.
Neural Computation , 10(1):215–234, 1998.[3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast un-folding of communities in large networks.
Journal of Statistical Mechanics:Theory and Experiment , 2008(10):P10008 (12pp), 2008.[4] R. Boulet, B. Jouve, F. Rossi, and N. Villa. Batch kernel som and relatedlaplacian methods for social network analysis.
Neurocomputing , 71(7–9):1257–1273, March 2008.[5] R. Bourqui, D. Auber, and P. Mary. How to draw clusteredweighted graphsusing a multilevel force-directed graph drawing algorithm. In
Proceedings ofthe 11th International Conference Information Visualization, 2007. IV’07. ,pages 757–764, July 2007. doi: 10.1109/IV.2007.65.[6] C. T. Butts. network: A package for managing relational data in r.
Jour-nal of Statistical Software , 24(2), 2008. .[7] V. Cerny. A thermodynamical approach to the travelling salesman prob-lem: an efficient simulation algorithm.
Journal of Optimization Theory andApplications , 45(1):41–51, January 1985.[8] G. Csardi and T. Nepusz. The igraph software package for complex networkresearch.
InterJournal , Complex Systems:1695, 2006. http://igraph.sf.net .[9] G. Di Battista, P. Eades, R. Tamassia, and I. G. Tollis.
Graph Drawing:Algorithms for the Visualization of Graphs . Prentice Hall, 1999.[10] J. Duch and A. Arenas. Community identification using extremal optimiza-tion.
Physical Review E , 72(027104), 2005.[11] P. Eades, Q.-W. Feng, X. Lin, and H. Nagamochi. Straight-line drawingalgorithms for hierarchical graphs and clustered graphs.
Algorithmica , 44(1):1–32, 2006. 4212] S. Fabrikant, D. Montello, M. Ruocco, and R. Middleton. The distance-similarity metaphor in network-display spatializations.
Cartography andGeographic Information Science , 31(4):237–252, 2004.[13] S. Fortunato and M. Barth´elemy. Resolution limit in community detection.
Proceedings of the National Academy of Sciences , 104(1):36–41, 2007. doi:10.1073/pnas.0605965104. URL .[14] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk com-putation of similarities between nodes of a graph, with application to col-laborative recommendation.
IEEE Transactions on Knowledge and DataEngineering , pages 355–369, March 2007.[15] T. M. Fruchterman and E. M. Reingold. Graph drawing by force-directedplacement.
Software - Practice and Experience , 21(11):1129–1164, 1991.[16] G. H. Golub and C. F. van Loan.
Matrix Computations . Johns HopkinsUniversity Press, 3rd edition, 1996.[17] T. Graepel, M. Burger, and K. Obermayer. Phase transitions in stochasticself-organizing maps.
Physical Review E , 56(4):3876–3890, 1997.[18] T. Graepel, M. Burger, and K. Obermayer. Self-organizing maps: Gener-alizations and new optimization techniques.
Neurocomputing , 21:173–190,November 1998.[19] R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network of human interactions.
PhysicalReview E , 68(065103(R)), 2003.[20] B. Hammer, A. Hasenfuss, F. Rossi, and M. Strickert. Topographic process-ing of relational data. In
Proceedings of the 6th International Workshop onSelf-Organizing Maps (WSOM 07) , Bielefeld (Germany), September 2007.ISBN: 978-3-00-022473-7.[21] I. Herman, G. Melan¸con, and M. Scott Marshall. Graph visualization andnavigation in information visualisation.
IEEE Transactions on Visualiza-tion and Computer Graphics , 6(1):24–43, 2000.[22] T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministicannealing.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 19(1):1–14, January 1997.[23] T. Jaakkola. Tutorial on variational approximation methods. In M. Opperand D. Saad, editors,
Advanced mean field methods: theory and practice ,pages 129–159. MIT Press, 2000.[24] M. P. V. Kirkpatrick, S.; C. D. Gelatt. Optimization by simulated anneal-ing.
Science , 220(4598):671–680, May 1983.[25] D. Knuth.
The Stanford GraphBase: A Platform for Combinatorial Com-puting . Addison-Wesley, Reading, MA, 1993.4326] T. Kohonen.
Self-Organizing Maps , volume 30 of
Springer Series in Infor-mation Sciences . Springer, third edition, 1995. Last edition published in2001.[27] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discretestructures. In
Proceedings of the 19th International Conference on MachineLearning , pages 315–322, 2002.[28] J. A. Lee and M. Verleysen. Quality assessment of dimension-ality reduction: Rank-based criteria.
Neurocomputing , 72(7-9):1431– 1443, 2009. ISSN 0925-2312. doi: DOI:10.1016/j.neucom.2008.12.017. URL .[29] S. Lehmann and L. K. Hansen. Deterministic modularity optimization.
European Physical Journal B , 60(1):83–88, November 2007. doi: 10.1140/epjb/e2007-00313-2.[30] M. E. J. Newman. The structure and function of complex networks.
SIAMReview , 45:167–256, 2003.[31] M. E. J. Newman. Finding community structure in networks using theeigenvectors of matrices.
Physical Review E , 74(3), 2006. doi: 10.1103/PhysRevE.74.036104.[32] M. E. J. Newman. Detecting community structure in networks.
Eur. Phys.J. B , 38(2):321–330, mar 2004. doi: 10.1140/epjb/e2004-00124-y. URL http://dx.doi.org/10.1140/epjb/e2004-00124-y .[33] M. E. J. Newman. Modularity and community structure in networks.
Pro-ceedings of the National Academy of Sciences , 103(23):8577–8582, 2006.doi: 10.1073/pnas.0601602103. URL .[34] M. E. J. Newman and M. Girvan. Finding and evaluating community struc-ture in networks.
Physical Review E , 69(2), 2004. doi: 10.1103/PhysRevE.69.026113.[35] A. Noack. Energy models for graph clustering.
Journal of Graph Algorithmsand Applications , 11(2):453–480, 2007.[36] A. Noack. Modularity clustering is force-directed layout.
Physical ReviewE , 79(026102), February 2009.[37] A. Noack and R. Rotta. Multi-level algorithms for modularity clustering.Preprint, Brandenburg University of Technology at Cottbus, 2008. http://arxiv.org/abs/0812.4073 .[38] H. Purchase. Metrics for graph drawing aesthetics.
Journal of VisualLanguages and Computing , 13(5):501–516, 2002.[39] R Development Core Team.
R: A Language and Environment for Statisti-cal Computing . R Foundation for Statistical Computing, Vienna, Austria,2009. URL . ISBN 3-900051-07-0.4440] J. Reichardt and S. Bornholdt. Statistical mechanics of community detec-tion.
Physical Review E , 74(016110), 2006.[41] K. Rose. Deterministic annealing for clustering, compression,classification,regression, and related optimization problems.
Proceedings of the IEEE , 86(11):2210–2239, November 1998. doi: 10.1109/5.726788.[42] S. E. Schaeffer. Graph clustering.
Computer Science Review , 1(1):27–64,August 2007.[43] B. Sch¨olkopf, A. Smola, and K. M¨uller. Nonlinear component analysis asa kernel eigenvalue problem.
Neural Computation , 10:1299–1319, 1998.[44] A. Smola and R. Kondor. Kernels and regularization on graphs. In M. War-muth and B. Sch¨olkopf, editors,
Proceedings of the Conference on LearningTheory (COLT) and Kernel Workshop , 2003.[45] J. Vesanto. Som-based data visualization methods.
Intelligent Data Anal-ysis , 3(2):111–126, 1999.[46] J. Vesanto.
Data Exploration Process Based on the Self–Organizing Map .PhD thesis, Helsinki University of Technology, Espoo (Finland), May 2002.Acta Polytechnica Scandinavica, Mathematics and Computing Series No.115.[47] N. Villa and F. Rossi. A comparison between dissimilarity som and ker-nel som for clustering the vertices of a graph. In
Proceedings of the 6thInternational Workshop on Self-Organizing Maps (WSOM 07) , Bielefeld(Germany), September 2007.[48] N. Villa, F. Rossi, and Q.-D. Truong. Mining a medieval social network bykernel som and related methods. In
Proceedings of MASHS 2008 (Mod`eleset Apprentissage en Sciences Humaines et Sociales) , Cr´eteil, France, June2008.[49] U. von Luxburg. A tutorial on spectral clustering.
Statistics and Comput-ing , 17(4):395–416, 2007. URL .[50] C. Ware, H. Purchase, L. Colpoys, and M. McGill. Cognitive measurementsof graph aesthetics.
Information Visualization , 1(2):103–110, 2002.[51] S. Wasserman and K. Faust.
Social Network Analysis: Methods and Appli-cations . Cambridge University Press, Cambridge, 1994.[52] D. Watts and S. Strogatz. Collective dynamics of “small-world” networks.
Nature , 393:440–442, April 1998.[53] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens.Graph nodes clustering with the sigmoid commute-time kernel:A comparative study.
Data & Knowledge Engineering , 68(3):338 – 361, 2009. ISSN 0169-023X. doi: DOI:10.1016/j.datak.2008.10.006. URL .4554] W. W. Zachary. An information flow model for conflict and fission in smallgroups.
Journal of Anthropological Research , 33(4):452–473, 1977.[55] D. Zhang and R. Mao. A new kernel for classification of networked entities.In
Proceedings of 6th International Workshop on Mining and Learning withGraphs , Helsinki, Finland, 2008. URL