[PDF] Spectral redemption: clustering sparse networks

Abstract

Spectral algorithms are classic approaches to clustering and community detection in networks. However, for sparse networks the standard versions of these algorithms are suboptimal, in some cases completely failing to detect communities even when other algorithms such as belief propagation can do so. Here we introduce a new class of spectral algorithms based on a non-backtracking walk on the directed edges of the graph. The spectrum of this operator is much better-behaved than that of the adjacency matrix or other commonly used matrices, maintaining a strong separation between the bulk eigenvalues and the eigenvalues relevant to community structure even in the sparse case. We show that our algorithm is optimal for graphs generated by the stochastic block model, detecting communities all the way down to the theoretical limit. We also show the spectrum of the non-backtracking operator for some real-world networks, illustrating its advantages over traditional spectral clustering.

Full PDF

SSpectral redemption: clustering sparse networks

Florent Krzakala , Cristopher Moore , Elchanan Mossel Joe Neeman , Allan Sly , Lenka Zdeborová and Pan Zhang , ESPCI and CNRS UMR 7083,10 rue Vauquelin,Paris 75005 Santa Fe Institute, 1399 Hyde Park Road,Santa Fe NM 87501, USA University of California, Berkeley Institut de Physique Théorique,CEA Saclay and URA 2306,CNRS, 91191 Gif-sur-Yvette, France (Dated: August 26, 2013)Spectral algorithms are classic approaches to clustering and community detection in networks.However, for sparse networks the standard versions of these algorithms are suboptimal, in some casescompletely failing to detect communities even when other algorithms such as belief propagation cando so. Here we introduce a new class of spectral algorithms based on a non-backtracking walkon the directed edges of the graph. The spectrum of this operator is much better-behaved thanthat of the adjacency matrix or other commonly used matrices, maintaining a strong separationbetween the bulk eigenvalues and the eigenvalues relevant to community structure even in the sparsecase. We show that our algorithm is optimal for graphs generated by the stochastic block model,detecting communities all the way down to the theoretical limit. We also show the spectrum of thenon-backtracking operator for some real-world networks, illustrating its advantages over traditionalspectral clustering.

Detecting communities or modules is a central task in the study of social, biological, and technological networks.Two of the most popular approaches are statistical inference, where we ﬁx a generative model such as the stochasticblock model to the network [1, 2]; and spectral methods, where we classify vertices according to the eigenvectors of amatrix associated with the network such as its adjacency matrix or Laplacian [3].Both statistical inference and spectral methods have been shown to work well in networks that are suﬃcientlydense, or when the graph is regular [4–8]. However, for sparse networks with widely varying degrees, the communitydetection problem is harder. Indeed, it was recently shown [9–11] that there is a phase transition below whichcommunities present in the underlying block model are impossible for any algorithm to detect. While standard spectralalgorithms succeed down to this transition when the network is suﬃciently dense, with an average degree growing asa function of network size [8], in the case where the average degree is constant these methods fail signiﬁcantly abovethe transition [12]. Thus there is a large regime in which statistical inference succeeds in detecting communities, butwhere current spectral algorithms fail.It was conjectured in [11] that this gap is artiﬁcial and that there exists a spectral algorithm that succeeds allthe way to the detectability transition even in the sparse case. Here, we propose an algorithm based on a linearoperator considerably diﬀerent from the adjacency matrix or its variants: namely, a matrix that represents a walk onthe directed edges of the network, with backtracking prohibited. We give strong evidence that this algorithm indeedcloses the gap.The fact that this operator has better spectral properties than, for instance, the standard random walk operator hasbeen used in the past in the context of random matrices and random graphs [13–15]. In the theory of zeta functionsof graphs, it is known as the edge adjacency operator, or the Hashimoto matrix [16]. It has been used to show fastmixing for the non-backtracking random walk [17], and arises in connection to belief propagation [18, 19], in particularto rigorously analyze the behavior of belief propagation for clustering problems on regular graphs [5]. It has also beenused as a feature vector to classify graphs [20]. However, using this operator as a foundation for spectral clusteringand community detection appears to be novel.We show that the resulting spectral algorithms are optimal for networks generated by the stochastic block model,ﬁnding communities all the way down to the detectability transition. That is, at any point above this transition, thereis a gap between the eigenvalues related to the community structure and the bulk distribution of eigenvalues comingfrom the random graph structure, allowing us to ﬁnd a labeling correlated with the true communities. In additionto our analytic results on stochastic block models, we also illustrate the advantages of the non-backtracking operatorover existing approaches for some real networks. a r X i v : . [ c s . S I] A ug - - c FIG. 1: The spectrum of the adjacency matrix of a sparse network generated by the block model (excluding the zero eigenvalues).Here n = 4000 , c in = 5 , and c out = 1 , and we average over realizations. Even though the eigenvalue λ c = 3 . given by (2)satisﬁes the threshold condition (1) and lies outside the semicircle of radius √ c = 3 . , deviations from the semicircle law causeit to get lost in the bulk, and the eigenvector of the second largest eigenvalue is uncorrelated with the community structure.As a result, spectral algorithms based on A are unable to identify the communities in this case. I. SPECTRAL CLUSTERING AND SPARSE NETWORKS

In order to study the eﬀectiveness of spectral algorithms in a speciﬁc ensemble of graphs, suppose that a graph G is generated by the stochastic block model [1]. There are q groups of vertices, and each vertex v has a group label g v ∈ { , . . . , q } . Edges are generated independently according to a q × q matrix p of probabilities, with Pr[ A u,v =1] = p g u ,g v . In the sparse case, we have p ab = c ab /n , where the aﬃnity matrix c ab stays constant in the limit n → ∞ .For simplicity we ﬁrst discuss the commonly-studied case where c has two distinct entries, c ab = c in if a = b and c out if a (cid:54) = b . We take q = 2 with two groups of equal size, and assume that the network is assortative, i.e., c in > c out . Wesummarize the general case of more groups, arbitrary degree distributions, and so on in subsequent sections below.The group labels are hidden from us, and our goal is to infer them from the graph. Let c = ( c in + c out ) / denotethe average degree. The detectability threshold [9–11] states that in the limit n → ∞ , unless c in − c out > √ c , (1)the randomness in the graph washes out the block structure to the extent that no algorithm can label the verticesbetter than chance. Moreover, [11] proved that below this threshold, it is impossible to identify the parameters c in and c out , while above the threshold the parameters c in and c out are easily identiﬁable.The adjacency matrix is deﬁned as the n × n matrix A u,v = 1 if ( u, v ) ∈ E and otherwise. A typical spectralalgorithm assigns each vertex a k -dimensional vector according to its entries in the ﬁrst k eigenvectors of A for some k ,and clusters these vectors according to a heuristic such as the k -means algorithm (often after normalizing or weightingthem in some way). In the case q = 2 , we can simply label the vertices according to the sign of the second eigenvector.As shown in [8], spectral algorithms succeed all the way down to the threshold (1) if the graph is suﬃciently dense.In that case A ’s spectrum has a discrete part and a continuous part in the limit n → ∞ . Its ﬁrst eigenvector essentiallysorts vertices according to their degree, while the second eigenvector is correlated with the communities. The secondeigenvalue is given by λ c = c in − c out c in + c out c in − c out . (2)The question is when this eigenvalue gets lost in the continuous bulk of eigenvalues coming from the randomness inthe graph. This part of the spectrum, like that of a suﬃciently dense Erdős-Rényi random graph, is asymptoticallydistributed according to Wigner’s semicircle law [21] P ( λ ) = 12 πc (cid:112) c − λ . - - - - µ c FIG. 2: The spectrum of the non-backtracking matrix B for a network generated by the block model with same parametersas in Fig. 1. The leading eigenvalue is at c = 3 , the second eigenvalue is close to µ c = ( c in − c out ) / , and the bulk ofthe spectrum is conﬁned to the disk of radius √ c = √ . Since µ c is outside the bulk, a spectral algorithm that labels verticesaccording to the sign of B ’s second eigenvector (summed over the incoming edges at each vertex) labels the majority of verticescorrectly. Thus the bulk of the spectrum lies in the interval [ − √ c, √ c ] . If λ c > c , which is equivalent to (1), the spectralalgorithm can ﬁnd the corresponding eigenvector, and it is correlated with the true community structure.However, in the sparse case where c is constant while n is large, this picture breaks down due to a numberof reasons. Most importantly, the leading eigenvalues of A are dictated by the vertices of highest degree, and thecorresponding eigenvectors are localized around these vertices [22]. As n grows, these eigenvalues exceed λ c , swampingthe community-correlated eigenvector, if any, with the bulk of uninformative eigenvectors. As a result, spectralalgorithms based on A fail a signiﬁcant distance from the threshold given by (1). Moreover, this gap grows as n increases: for instance, the largest eigenvalue grows as the square root of the largest degree, which is roughlyproportional to log n/ log log n for Erdős-Rényi graphs. To illustrate this problem, the spectrum of A for a large graphgenerated by the block model is depicted in Fig. 1.Other popular operators for spectral clustering include the Laplacian L = D − A where D uv = d u δ u,v is the diagonalmatrix of vertex degrees, the random walk matrix Q uv = A uv /d u , and the modularity matrix M uv = A uv − d u d v / (2 m ) .However, all these experience qualitatively the same diﬃculties as with A in the sparse case. Another simple heuristicis to simply remove the high-degree vertices (e.g. [6]), but this throws away a signiﬁcant amount of information; inthe sparse case it can even destroy the giant component, causing the graph to fall apart into disconnected pieces [23]. II. THE NON-BACKTRACKING OPERATOR

The main contribution of this paper is to show how to redeem the performance of spectral algorithms in sparsenetworks by using a diﬀerent linear operator. The non-backtracking matrix B is a m × m matrix, deﬁned on thedirected edges of the graph. Speciﬁcally, B ( u → v ) , ( w → x ) = (cid:40) if v = w and u (cid:54) = x otherwise . Using B rather than A addresses the problem described above. The spectrum of B is not sensitive to high-degreevertices, since a walk starting at v cannot turn around and return to it immediately. Other convenient properties of B are that any tree dangling oﬀ the graph, or disconnected from it, simply contributes zero eigenvalues to the spectrum,since a non-backtracking walk is forced to a leaf of the tree where it has nowhere to go. Similarly one can show thatunicyclic components yield eigenvalues that are either , or − .As a result, B has the following spectral properties in the limit n → ∞ in the ensemble of graphs generated by theblock model. The leading eigenvalue is the average degree c = ( c in + c out ) / . At any point above the detectabilitythreshold (1), the second eigenvalue is associated with the block structure and reads µ c = c in − c out . (3)Moreover, the bulk of B ’s spectrum is conﬁned to the disk in the complex plane of radius √ c , as shown in Fig. 2.As a result, the second eigenvalue is well separated from the top of the bulk, i.e., from the third largest eigenvalue inabsolute value, as shown in Fig. 3.The eigenvector corresponding to µ c is strongly correlated with the community structure. Since B is deﬁned ondirected edges, at each vertex we sum this eigenvector over all its incoming edges. If we label vertices according tothe sign of this sum, then the majority of vertices are labeled correctly (up to a change of sign, which switches thetwo communities). Thus a spectral algorithm based on B succeeds when µ c > √ c , i.e. when (1) holds—but unlikestandard spectral algorithms, this criterion now holds even in the sparse case. We present arguments for these claimsin the next section. III. RECONSTRUCTION AND A COMMUNITY-CORRELATED EIGENVECTOR

In this section we sketch justiﬁcations of the claims in the previous section regarding B ’s spectral properties, showingthat its second eigenvector is correlated with the communities whenever (1) holds. Let us start by recalling how togeneralize equation (2) for the adjacency matrix A of sparse graphs. We follow [11], who derived a similar result inthe case of random regular graphs.With µ = µ c deﬁned as in (3), for a given integer r , consider the vector f ( r ) v = µ − r (cid:88) u : d ( u,v )= r σ u , (4)where σ u = ± denotes u ’s community. By the theory of the reconstruction problem on trees [24, 25], if (1) holdsthen the correlation (cid:104) f ( r ) , σ (cid:105) /n is bounded away from zero in the limit n → ∞ .We will show that if r is large but small compared to the diameter of the graph, then f ( r ) is closely related to thesecond eigenvector of B . Thus if we label vertices according to the sign of this second eigenvector (summed over allincoming edges at each vertex) we obtain the true communities with signiﬁcant accuracy.First we show that f ( r ) approximately obeys an eigenvalue equation that generalizes (2). As long as the radius- r neighborhood of v is a tree, we have ( Af ( r ) ) v = µ − r  (cid:88) u : d ( u,v )= r +1 σ u + ( d v − (cid:88) u : d ( u,v )= r − σ u  , so ( Af ( r ) ) v = µf ( r +1) v + ( d v − µ − f ( r − v . (5)Summing over v ’s neighborhood gives the expectation E (cid:34) (cid:88) u ∈ N ( v ) σ u (cid:35) = µσ v , and summing the ﬂuctuations over the (in expectation) c r vertices at distance r gives (cid:12)(cid:12)(cid:12) f ( r ) v − f ( r ± v (cid:12)(cid:12)(cid:12) = O ( c r/ µ − r ) . If µ = µ c and (1) holds so that µ c > √ c , these ﬂuctuations tend to zero for large r . In that case, we can identify f ( r ) with f ( r ± , and (5) becomes Af = µf + ( D − ) µ − f . (6)In particular, in the dense case we can recover (2) by approximating D with c , or equivalently pretending that thegraph is c -regular. Then f is an eigenvector of A with eigenvalue λ c = µ + ( c − µ − .We deﬁne an analogous approximate eigenvector of B , g ( r ) u → v = µ − r (cid:88) ( w,x ): d ( u → v,w → x )= r σ x , where now d refers to the number of steps in the graph of directed edges. We have in expectation Bg ( r ) = µg ( r +1) , and as before | g ( r ) − g ( r +1) | tends to zero as r increases. Identifying them gives an approximate eigenvector g witheigenvalue µ , Bg = µg . (7)Furthermore, summing over all incoming edges gives (cid:88) u ∈ N ( v ) g u → v = f v , giving signs correlated with the true community memberships σ v .We note that the relation between the eigenvalue equation (7) for B and the quadratic eigenvalue equation (6) isexact and well known in the theory of zeta functions of graphs [16, 26, 27]. More generally, all eigenvalues µ of B that are not ± are the roots of the equation det (cid:2) µ − µA + ( D − ) (cid:3) = 0 . (8)This equation hence describes n of B ’s eigenvalues. These are the eigenvalues of a n × n matrix, B (cid:48) = (cid:18) D − − A (cid:19) . (9)The left eigenvectors of B (cid:48) are of the form ( f, − µf ) where f obeys (6). Thus we can ﬁnd f by dealing with a n × n matrix rather than a m × m one, which considerably reduces the computational complexity of our algorithm.Next, we argue that the bulk of B ’s spectrum is conﬁned to the disk of radius √ c . First note that for any matrix B , m (cid:88) i =1 | µ i | r ≤ tr B r ( B r ) T . On the other hand, for any ﬁxed r , since G is locally treelike in the limit n → ∞ , each diagonal entry ( u → v, u → v ) of B r ( B r ) T is equal to the number of vertices exactly r steps from v , other than those connected via u . In expectationthis is c r , so by linearity of expectation E tr B r ( B r ) T = 2 mc r . In that case, the spectral measure has the propertythat E ( | µ | r ) ≤ c r . Since this holds for any ﬁxed r , we conclude that almost all of B ’s eigenvalues obey | µ | ≤ √ c . Proving rigorously that all the eigenvalues in the bulk are asymptotically conﬁned to this disk requires a more precise argument and is leftfor future work.As a side remark we note that (8) yields B ’s spectrum for d -regular graphs [27]. There are n pairs of eigenvalues µ ± such that µ ± = λ ± (cid:112) λ − d − , (10)where λ are the (real) eigenvalues of A . These are related by µ + µ − = d − , so all the non-real eigenvalues of B are conjugate pairs on the circle of radius √ d − . The other eigenvalues are ± . For random regular graphs, theasymptotic spectral density of B follows straightforwardly from the well known result of [13] for the spectral densityof the adjacency matrix.Finally, the singular values of B are easy to derive for any simple graph, i.e., one without self-loops or multipleedges. Namely, BB T is block-diagonal: for each vertex v , it has a rank-one block of size d v that connects v ’s outgoingedges to each other. As a consequence, B has n singular values d v − , and its other m − n singular values are .However, since B is not symmetric, its eigenvalues and its singular values are diﬀerent—while its singular values arecontrolled by the vertex degrees, its eigenvalues are not. This is precisely why its spectral properties are better thanthose of A and related operators. c in −c out µ µ |µ | theorysqrt(3) FIG. 3: The ﬁrst, second and third largest eigenvalues µ , µ and | µ | respectively of B as functions of c in − c out . The thirdeigenvalue is complex, so we plot its modulus. Values are averaged over networks of size n = 10 and average degree c = 3 .The green line in the ﬁgure represents µ c = ( c in − c out ) / , and the horizontal lines are c and √ c respectively. The secondeigenvalue µ is well-separated from the bulk throughout the detectable regime. IV. MORE THAN TWO GROUPS AND GENERAL DEGREE DISTRIBUTIONS

The arguments given above regarding B ’s spectral properties generalize straightforwardly to other graph ensembles.First, consider block models with q groups, where for ≤ a ≤ q group a has fractional size n a . The average degreeof group a is c a = (cid:80) b c ab n b . The hardest case is where c a = c is the same for all a , so that we cannot simply labelvertices according to their degree.The leading eigenvector again has eigenvalue c , and the bulk of B ’s spectrum is again conﬁned to the disk of radius √ c . Now B has q − linearly independent eigenvectors with real eigenvalues, and the corresponding eigenvectors arecorrelated with the true group assignment. If these real eigenvalues lie outside the bulk, we can identify the groupsby assigning a vector in R q − to each vertex, and applying a clustering technique such as k -means. These eigenvaluesare of the form µ = cν where ν is a nonzero eigenvalue of the q × q matrix T ab = n a (cid:16) c ab c − (cid:17) . (11)In particular, if n a = 1 /q for all a , and c ab = c in for a = b and c out for a (cid:54) = b , we have µ c = ( c in − c out ) /q . Thedetectability threshold is again µ c > √ c , or | c in − c out | > q √ c . (12)More generally, if the community-correlated eigenvectors have distinct eigenvalues, we can have multiple transitionswhere some of them can be detected by a spectral algorithm while others cannot.There is an important diﬀerence between the general case and q = 2 . While for q = 2 it is literally impossible forany algorithm to distinguish the communities below this transition, for larger q the situation is more complicated. Ingeneral (for q ≥ in the assortative case, and q ≥ in the disassortative one) the threshold (12) marks a transitionfrom an “easily detectable” regime to a “hard detectable” one. In the hard detectable regime, it is theoreticallypossible to ﬁnd the communities, but it is conjectured that any algorithm that does so takes exponential time [9, 10].In particular, we have found experimentally that none of B ’s eigenvectors are correlated with the groups in the hardregime. Nonetheless, our arguments suggest that spectral algorithms based on B are optimal in the sense that theysucceed all the way down to this easy/hard transition.Since a major drawback of the stochastic block model is that its degree distribution is Poisson, we can also considerrandom graphs with speciﬁed degree distributions. Again, the hardest case is where the groups have the same degreedistribution. Let a k denote the fraction of vertices of degree k . The average branching ratio of a branching processthat explores the neighborhood of a vertex, i.e., the average number of new edges leaving a vertex v that we arrive atwhen following a random edge, is ˜ c = (cid:80) k k ( k − a k (cid:80) k ka k = (cid:104) k (cid:105) / (cid:104) k (cid:105) − . We assume here that the degree distribution has bounded second moment so that this process is not dominated bya few high-degree vertices. The leading eigenvalue of B is ˜ c , and the bulk of its spectrum is conﬁned to the disk ofradius √ ˜ c , even in the sparse case where ˜ c does not grow with the size of the graph. If q = 2 and the average numbersof new edges linking v to its own group and the other group are ˜ c in / and ˜ c out / respectively, then the approximateeigenvector described in the previous section has eigenvalue µ = (˜ c in − ˜ c out ) / . The detectability threshold (1) thenbecomes µ > √ ˜ c , or ˜ c in − ˜ c out > √ ˜ c . The threshold (12) for q groups generalizes similarly. V. DERIVING B BY LINEARIZING BELIEF PROPAGATION

The matrix B also appears naturally as a linearization of the update equations for belief propagation (BP). Thislinearization was used previously to investigate phase transitions in the performance of the BP algorithm [5, 9, 10, 28].We recall that BP is an algorithm that iteratively updates messages η v → w where ( v, w ) are directed edges. Thesemessages represent the marginal probability that a vertex v belongs to a given community, assuming that the vertex w is absent from the network. Each such message is updated according to the messages η u → v that v receives fromits other neighbors u (cid:54) = w . The update rule depends on the parameters c in and c out of the block model, as well asthe expected size of each community. For the simplest case of two equally sized groups, the BP update [9, 10] can bewritten as η + v → w η − v → w := e − h (cid:81) u ∈ N ( v ) − w (cid:0) η + u → w c in + η − u → w c out (cid:1)(cid:81) u ∈ N ( v ) − w (cid:0) η + u → w c out + η − u → w c in (cid:1) . (13)Here + and − denote the two communities. The term e h , where h = ( c in − c out )( n BP+ − n BP − ) and n B P ± is the currentestimate of the fraction of vertices in the two groups, represents messages from the non-neighbors of v . In theassortative case, it prevents BP from converging to a ﬁxed point where every vertex is in the same community.The update (13) has a trivial ﬁxed point η v → w = 1 / , where every vertex is equally likely to be in either community.Writing η ± u → v = 1 / ± δ u → v and linearizing around this ﬁxed point gives the following update rule for δ , δ v → w := c in − c out c in + c out (cid:88) u ∈ N ( v ) − w δ u → v , or equivalently δ := c in − c out c in + c out Bδ . (14)More generally, in a block model with q communities, an aﬃnity matrix c ab , and an expected fraction n a of verticesin each community a , linearizing around the trivial point and deﬁning η au → v = n a + δ au → v gives a tensor productoperator δ := ( T ⊗ B ) δ , (15)where T is the q × q matrix deﬁned in (11).We can also describe the linearization of BP in terms of the n × n matrix B (cid:48) deﬁned in (9). Speciﬁcally, if wedeﬁne δ in and δ out as the qn -dimensional vectors where δ in v = (cid:80) u ∈ N ( v ) δ au → v and δ out v = (cid:80) u ∈ N ( v ) δ av → u are the sumof δ over v ’s incoming and outgoing edges respectively, then (cid:18) δ out δ in (cid:19) = ( T ⊗ B (cid:48) ) (cid:18) δ out δ in (cid:19) . (16)Thus we can analyze BP to ﬁrst order around the trivial ﬁxed point by keeping track of just qn variables rather than qm of them.This shows that the spectral properties of the non-backtracking matrix are closely related to belief propagation.Speciﬁcally, the trivial ﬁxed point is unstable, leading to a ﬁxed point that is correlated with the community structure,exactly when T ⊗ B has an eigenvalue greater than . However, by avoiding the ﬁxed point where all the verticesbelong to the same group, we suppress B ’s leading eigenvalue; thus the criterion for instability is νµ > where ν is T ’s leading eigenvalue and µ is B ’s second eigenvalue. This is equivalent to (12) in the case where the groups are ofequal size.In general, the BP algorithm provides a slightly better agreement with the actual group assignment, since itapproximates the Bayes-optimal inference of the block model. On the other hand, the BP update rule depends onthe parameters of the block model, and if these parameters are unknown they need to be learned, which presentsadditional diﬃculties [12]. In contrast, our spectral algorithm does not depend on the parameters of the block model,giving an advantage over BP in addition to its computational eﬃciency. VI. EXPERIMENTAL RESULTS AND DISCUSSION c in −c out O v e r l ap Non−backtrackingModularityRandom WalkAdjacencyLaplacianBP

Average Degree O v e r l ap Non−backtrackingModularityRandom WalkAdjacencyLaplacianBP

FIG. 4: The accuracy of spectral algorithms based on diﬀerent linear operators, and of belief propagation, for two groups ofequal size. On the left, we vary c in − c out while ﬁxing the average degree c = 3 ; the detectability transition given by (1) occursat c in − c out = 2 √ ≈ . . On the right, we set c out /c in = 0 . and vary c ; the detectability transition is at c ≈ . . Eachpoint is averaged over instances with n = 10 . Our spectral algorithm based on the non-backtracking matrix B achieves anaccuracy close to that of BP, and both remain large all the way down to the transition. Standard spectral algorithms based onthe adjacency matrix, modularity matrix, the Laplacian, and the random walk matrix fail well above the transition, doing nobetter than chance. −0.03 0 0.03−0.0300.03 −0.1 0 0.100.10.20.10.1 FIG. 5: Clustering in the case of three groups of equal size. On the left, a scatter plot of the second and third eigenvectors (Xand Y axis respectively) of the non-backtracking matrix B , with colors indicating the true group assignment. On the right, theanalogous plot for the adjacency matrix A . Here n = 3 × , c = 3 , and c out /c in = 0 . . Applying k -means gives an overlap . using B , but . using A . In Fig. 4, we compare the spectral algorithm based on the non-backtracking matrix B with those based on variousclassical operators: the adjacency matrix A , the modularity matrix M , the Laplacian L , and the random walk matrix Q . We see that there is a regime where standard spectral algorithms do no better than chance, while the one basedon B achieves a strong correlation with the true group assignment all the way down to the detectability threshold.We also show the performance of belief propagation, which is believed to be asymptotically optimal [9, 10].We measure the performance as the overlap , deﬁned as (cid:32) n (cid:88) u δ g u , ˜ g u − q (cid:33) (cid:30)(cid:18) − q (cid:19) . (17)Here g u is the true group label of vertex u , and ˜ g u is the label found by the algorithm. We break symmetry bymaximizing over all q ! permutations of the groups. The overlap is normalized so that it is for the true labeling, and for a uniformly random labeling. -4-3-2-1 1 2 3 4-4 -2 2 4 6 8 10 12 Football q=12Overlap: 0.9163-30-20-10 10 20 30-40 -20 20 40 60 80

Polblogs q=2Overlap: 0.8533 -4-2 2 4-5 5 10

Adjnoun q=2Overlap: 0.6250-3-2-1 1 2 3-4 -2 2 4 6 8

Dolphins q=2Overlap: 0.7419 -4-3-2-1 1 2 3 4-4 -2 2 4 6 8 10 12

Polbooks q=3Overlap: 0.7571-4-3-2-1 1 2 3 4-4 -2 2 4 6 8 10 12

Karate q=2Overlap: 1

FIG. 6: Spectrum of the non-backtracking matrix in the complex plane for some commonly used benchmarks for communitydetection in real networks taken from [29–34]. The radius of the circle is the square root of the largest eigenvalue, which isa heuristic estimate of the bulk of the spectrum. The overlap is computed using the signs of the second eigenvector for thenetworks with two communities, and using k-means for those with three and more communities. The non-backtracking operatordetects communities in all these networks, with an overlap comparable to the performance of other spectral methods. As in thecase of synthetic networks generated by the stochastic block model, the number of real eigenvalues outside the bulk appears tobe a good indicator of the number q of communities. In Fig. 5 we illustrate clustering in the case q = 3 . As described above, in the detectable regime we expect to see q − eigenvectors with real eigenvalues that are correlated with the true group assignment. Indeed B ’s second andthird eigenvector are strongly correlated with the true clustering, and applying k -means in R gives a large overlap.In contrast, the second and third eigenvectors of the adjacency matrix are essentially uncorrelated with the trueclustering, and similarly for the other traditional operators.Finally we turn towards real networks to illustrate the advantages of spectral clustering based on the non-backtracking matrix in practical applications. In Fig. 6 we show B ’s spectrum for several networks commonly usedas benchmarks for community detection. In each case we plot a circle whose radius is the square root of the largesteigenvalue. Even though these networks were not generated by the stochastic block model, these spectra look quali-tatively similar to the picture discussed above (Fig. 2). This leads to several very convenient properties. For each ofthese networks we observed that only the eigenvectors with real eigenvalues are correlated to the group assignmentgiven by the ground truth. Moreover, the real eigenvalues that lie outside of the circle are clearly identiﬁable. Thisis very unlike the situation for the operators used in standard spectral clustering algorithms, where one must decidewhich eigenvalues are in the bulk and which are outside.In particular, the number of real eigenvalues outside of circle seems to be a natural indicator for the true number q of clusters present in the network, just as for networks generated by the stochastic block model. This suggests thatin the network of political books there might in fact be 4 groups rather than 3, in the blog network there might bemore than two groups, and in the NCAA football network there might be 10 groups rather than 12. However, we alsonote that large real eigenvalues may correspond in some networks to small cliques in the graph; it is a philosophicalquestion whether or not to count these as communities.Note also that clustering based on the non-backtracking matrix works not only for assortative networks, but also fordisassortative ones, such as word adjacency networks [31], where the important real eigenvalue is negative—withoutbeing told which is the case.A Matlab implementation with demos that can be used to reproduce our numerical results can be found at [35]. VII. CONCLUSION

While recent advances have made statistical inference of network models for community detection far more scalablethan in the past (e.g. [9, 36–38]) spectral algorithms are highly competitive because of the computational eﬃciencyof sparse linear algebra. However, for sparse networks there is a large regime in which statistical inference methodssuch as belief propagation can detect communities, while standard spectral algorithms cannot.0We closed this gap by using the non-backtracking matrix B as a new starting point for spectral algorithms. Weshowed that for sparse networks generated by the stochastic block model, B ’s spectral properties are much betterthan those of the adjacency matrix and its relatives. In fact, it is asymptotically optimal in the sense that it allowsus to detect communities all the way down to the detectability transition. We also computed B ’s spectrum for somecommon benchmarks for community detection in real-world networks, showing that the real eigenvalues are a goodguide to the number of communities and the correct labeling of the vertices.Our approach can be straightforwardly generalized to spectral clustering for other types of sparse data, such asreal-valued similarities between objects. The deﬁnition of B extends to B ( u → v ) , ( w → x ) = (cid:40) s ( u, v ) if v = w and u (cid:54) = x otherwise , where s ( u, v ) is the similarity index between u and v . As in the case of graphs, we cluster the vertices by computingthe top eigenvectors of B , projecting the rows of B to the space spanned by these eigenvectors, and using a low-dimensional clustering algorithm such as k -means to cluster the projected rows [3]. However, we believe that, as forsparse graphs, there will be important regimes in which using B will succeed where standard clustering algorithmsfail. Given the wide use of spectral clustering throughout the sciences, we expect that the non-backtracking matrixand its generalizations will have a signiﬁcant impact on data analysis. Acknowledgments

We are grateful to Noga Alon, Brian Karrer, Mark Newman, Nati Linial, and Xiaoran Yan for helpful discussions.C.M. and P.Z. are supported by AFOSR and DARPA under grant FA9550-12-1-0432. F.K. and P.Z. have beensupported in part by the ERC under the European Union’s 7th Framework Programme Grant Agreement 307087-SPARCS. E.M and J.N. were supported by NSF DMS grant number1106999 and DOD ONR grant N000141110140. [1] Holland P W, Laskey K B, Leinhardt S (1983). Stochastic blockmodels: First steps.

Social Networks

Journal of the American Statistical Association

PNAS

Combinatorics, Probability and Computing

18: 881–912.[6] Coja-Oghlan A (2010). Graph partitioning via adaptive spectral techniques.

Combinatorics, Probability and Computing ,19(02):227–284.[7] McSherry F (2001). Spectral partitioning of random graphs. Foundations of Computer Science, 2001. Proceedings. 42ndIEEE Symposium on, 529–537.[8] Nadakuditi R R and Newman M E J (2012). Graph spectra and the detectability of community structure in networks.

Phys. Rev. Lett.

Physical Review Letters

Physical Review E [18] Watanabe, Y., Fukumizu, K. (2010). Graph zeta function in the Bethe free energy and loopy belief propagation. arXivpreprint arXiv:1002.3307.[19] Vontobel, P. O. (2010). Connecting the Bethe entropy and the edge zeta function of a cycle code. In IEEE InternationalSymposium on Information Theory Proceedings (ISIT), pp. 704-708.[20] Ren, P., Wilson, R. C., Hancock, E. R. (2011). Graph characterization via Ihara coeﬃcients. IEEE Transactions on NeuralNetworks, 22(2), 233-245.[21] Wigner E P (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math, 67(2), 325-327.[22] Krivelevich M and Sudakov B (2003). The largest eigenvalue of sparse random graphs. Combinatorics, Probability andComputing 12(01), 61-72.[23] Bollobas B, Svante J and Oliver R (2007). The phase transition in inhomogeneous random graphs. Random Structures &Algorithms 31.1: 3–122.[24] Kesten H and Stigum B P (1966). Additional limit theorems for indecomposable multidimensional Galton-Watson pro-cesses. Ann. Math. Statist.

The Annals of Applied Probability

Proc 3rd IntlWorkshop on Link Discovery .[30] Zachary W W (1977). An information ﬂow model for conﬂict and ﬁssion in small groups.