[PDF] Efficient algorithm based on non-backtracking matrix for community detection in signed networks

Abstract

Community detection or clustering is a crucial task for understanding the structure of complex systems. In some networks, nodes are permitted to be linked by either "positive" or "negative" edges; such networks are called signed networks. Discovering communities in signed networks is more challenging than that in unsigned networks. In this study, we innovatively develop a non-backtracking matrix of signed networks, theoretically derive a detectability threshold for this matrix, and demonstrate the feasibility of using the matrix for community detection. We further improve the developed matrix by considering the balanced paths in the network (referred to as a balanced non-backtracking matrix). Simulation results demonstrate that the algorithm based on the balanced nonbacktracking matrix significantly outperforms those based on the adjacency matrix and the signed non-backtracking matrix. The proposed (improved) matrix shows great potential for detecting communities with or without overlap.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] A ug JOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Non-backtracking Operators for CommunityDetection in Signed Networks

Zhaoyue Zhong † , Xiangrong Wang ‡§ , Cunquan Qu ∗† , Guanghui Wang ∗†∗ Data Science Institute, Shandong Univeristy, China † School of Mathematics, Shandong University, China ‡ Institute of Future Networks, Southern University of Science and Technology, Shenzhen, China § Research Center of Networks and Communications, Peng Cheng Laboratory, Shenzhen, China

Abstract —Community detection or clustering is crucial for un-derstanding the structure of complex systems. In some networks,nodes are allowed to be linked by either ’positive’ or ’negative’edges. Such networks are called signed networks. Discoveringcommunities in signed networks is more challenging. In thisarticle, we innovatively propose a non-backtracking matrix forsigned networks, and theoretically derive a detectability thresholdand prove the feasibility in community detection. Furthermore,we improve the operator by considering the balanced paths in thenetwork (denoted as balanced non-backtracking operator). Sim-ulation results demonstrate that the balanced non-backtrackingmatrix-based approach signiﬁcantly outperforms the adjacencymatrix-based and the signed non-backtracking matrix-basedalgorithm. It shows great potential to detect communities withor without overlap.

Index Terms —Community detection, signed networks, non-backtracking operator, spectral analysis, detectability threshold.

I. I

NTRODUCTION C OMMUNITIES, also known as clusters or modules, aregroups of nodes that may share common attributes orhave similar properties in the graph. Community detection di-vides similar nodes or nodes with a large number of (positive,large weighted) connections into a group, providing peoplewith a possible way to control the network. Since nodes witha large number of (positive) connected edges often have similarproperties, in terms of graphs, community detection is also aprocess of ﬁnding cut edges. If a few edges are removed,the network can be divided into several parts, i.e. severalconnected components that are not connected with each other,then the division of these parts is on certain level equivalentto community partition.Community detection is widely applied in biology, computerscience, engineering, economics, political science, sociologyand other ﬁelds [1]. For example, protein-protein interactionnetworks are a research hotspot in biology and bioinfor-matics [2]. The interaction between proteins is the basisof every process in the cell. Each interaction is observedby experiments and marked as a connection. Proteins withthe same or similar functions are divided into one module.We expect them to participate in the same process. At thistime, the community structure is associated with most of theimmunohistochemistry as well as tumor and metastasis. The

Corresponding author: [email protected] (C.Qu) above is a classical application which abstracts the actualsituation as an unsigned network. At the same time, socialnetwork is also a typical network with community structure.In general research, connection is regarded as positive, suchas fans, likes and forwarding. However, there are often a lotof negative connections in social networks. Some websites,such as epinions.com and slashdot.com , allow users to identifyfriends and enemies [3]. The signed network introduced in thispaper is a representation of this situation, such as the oppositeopinions on the same topic [3], and the blackout and reportingamong users.Tracking back to the 1940s [4], Heider introduced theconcept of signed networks and the well-known structuralbalance theory, which states that ’the friends of my friends,as well as the enemies of my enemies, are my friends’, see inFig. 1. As one of the most popular theories in social science,structural balance theory has been addressed increasing atten-tion recently. One of the topics is to design algorithms forcomputing the structural balance of large-scale datasets [5],[6], [7]. Another question is studying the impact of structuralbalance on some concrete applications, such as recommendersystems [8], dynamic process [9], and so forth. (cid:1846) (cid:2869) (cid:1846) (cid:2870) (cid:1846) (cid:2871) (cid:1846) (cid:2872)

Fig. 1.

Triangles in signed networks. T and T are balanced and relativelystable. T and T are unbalanced and hence liable to break apart. In social networks, user communities provide better servicesfor websites, such as user recommendation of friends, etc.Clustering of web pages can be used to rank web pagesand provide more relevant search results [1]. Furthermore, theapplication of community detection in social media can betterexplain the observed phenomena and provide benchmarks forsocial mechanism [10].In general, we can classify the existing methods of com-munity detection into the following classes: (1) traditionalalgorithms, such as graph partitioning [11], hierarchical clus-tering [12], partition clustering [13], spectral clustering [14];(2) modular-based methods [15]; (3) dynamic algorithms [16],

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 [17]; (4) methods based on statistical inference [18], to namea few. In this paper, spectrum method is used to detectthe community structure: in the unsigned network, adjacencymatrix [19], Laplacian matrix [20], non-backtracking matrix[21] and other structure-related matrices are employed insolving this problem; in the signed network, adjacency matrixhas been used in community detection [22].In this paper, ﬁrstly, we give necessary notations andpropose the deﬁnition of the non-backtracking matrix inSec.II. Taking full advantage of the structural balance theory,we propose the signed non-backtracking matrix for signednetworks. Meanwhile, from another point of view, that is,belief propagation, we also derive the signed non-backtrackingmatrix, which complements this deﬁnition’s rationality. InSec.III, we obtain a theoretical detection threshold µ c > √ c (where µ c is the community-correlated eigenvalue and c isthe average degree of the network) and theoretically prove thefeasibility of the SNBT matrix in community detection abovethe detectability threshold. An improved non-backtrackingform, the balanced non-backtracking (BNBT) operator, is alsoinvestigated. Furthermore, we carry out the numerical simula-tion to prove the SNBT matrix’s effectiveness in communitydetection compared with the adjacency matrix in Sec.IV. Byconducting experiments on the signed stochastic block matrix,we show the best performance of the BNBT matrix-basedapproach among the three matrices mentioned above. Finally,we conclude in Sec.V.II. N ON - BACKTRACKING OPERATOR IN SIGNEDNETWORKS

A. Signed stochastic block model

Signed networks consist of interacting individuals with bothpositive and negative relationships. Each individual in thenetwork corresponds to a node in the graph. The connectionbetween each pair of individuals is regarded as the edge be-tween the corresponding node pair. The positive and negativerelationships are represented as the positive and negative edgesin the network. For simplicity, the weight of the edge is deﬁnedas and − , respectively.First, we give the signed stochastic block model [21], [22],[23]. Given an undirected network with N nodes(suppose N is even), we divide the node set into two groups, A and B ,each group has N/ nodes. Nodes in group A are indexedfrom to N/ and those in group B are from N/ to N .A signed network can be represented by an adjacency matrix A in which the entries take on values of { , − , } with0 signifying the absence of an edge, and ± denoting thepositive and negative relationship. The adjacency matrix issymmetric as the network is undirected. For any pair of nodes ( i, j ) , here are some parameters. The probability that an edgewill form between any given in-group(out-group) node pair is d in ( d out ). The expected edge density for the total network is d = d in + d out . Given the presence of an edge between in-group members, the conditional probability that it is positiveis denoted as p + in . Analogously, p + out , p − in and p − out denotethe conditional probability of a positive edge between out-group members, a negative edge between in-group members, and a negative edge between out-group members, respectively.Thus, the conditional probabilities satisfy p + in + p − in = 1 and p + out + p − out = 1 .Generally speaking, when we say a network has a commu-nity structure, at least one of the following conditions hold:Case 1: There are more intra-community positive links thannegative links, i.e., p + in > p + out and p − in < p − out .Case 2: The density of intra-community links is more signiﬁ-cant than inter-community relationships, i.e. d in > d out .For the ﬁrst case, the community structure is sign-sensitive,denoted as relationship dependent community . The second oneis link density-sensitive, and we call it as density dependentcommunity . B. Deﬁnition of non-backtracking matrix

One of the main contributions of this work is to deﬁne non-backtracking operators in signed networks which shows greatpotential to detect communities. Though non-backtracking ma-trix is well deﬁned on unsigned networks, which is presentedfor completeness, a proper deﬁnition of non-backtrackingmatrix is far from trivial as shown in following sections.Prior to the formal deﬁnition of non-backtracking matrix insigned networks, we present ﬁrst the deﬁnition in unsignedor general networks. The non-backtracking matrix e H , oftencalled Hashimoto matrix in mathematics, are deﬁned asfollows: e H e,f = e A e e A f ( e = f ) ( e = f ) , (1)where e A is the adjacency matrix of unsigned networks, and e = ( e , e ) and f = ( f , f ) are two directed edges. Note thatif the network is undirected, then we treat each undirected edgeas two directed edges. Hence, the matrix e H is of dimension m × m , where m is the number of edges in the network.Actually, the non-backtracking matrix e H m × m can bewritten in the format below, e H ( e → e ) , ( f → f ) = ( if e = f and e = f , otherwise. (2)Similar with Eq.(1), the non-backtracking matrix, denotedas H , of signed networks can be directly derived as follows, H e,f = A e A f ( e = f ) ( e = f ) . (3)We can write it as H ( e → e ) , ( f → f ) =  if e = f , e = f andsign ( e → e ) = sign ( f → f ) , − if e = f , e = f andsign ( e → e ) = sign ( f → f ) , otherwise , (4)where sign ( e → e ) denotes the sign of a directed edge e → e which takes value of either or − . The signiﬁcance of thedeﬁned non-backtracking matrix H is that real information canbe transferred between the edge pairs of two identical signsand false information can be transferred between two edgeswith different signs, which accurately encodes the theory ofstructural balance (a triple with either one or three negativesigns is unstable). OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

C. Alternative deﬁnition of non-backtracking matrix based onlinearized belief propagation

Belief propagation (BP) is a kind of acyclic message passingalgorithm, which calculates the exact marginal distribution ofeach vertex in the network. Although BP is designed to workcorrectly on trees, it is usually applied to general graphs thoseare sparse and may contain loops [21], [24], [25].This algorithm starts from the appropriate initial assignmentand performs iteration for some ”messages”. Speciﬁcally, foreach edge ( v, w ) in a graph G = ( V, E ) , the message η av → w indicates the conditional probability that v belongsto community a when w does not, and the message η aw → v indicates the probability that w belongs to community a when v does not. Usually η av → w = η aw → v . As you can see, althoughthe original graph is undirected, these messages are passed onthe directed edge, with each message between 0 and 1. Basedon information transfer, it can be calculated iteratively.BP algorithm has a good consistency with the actual groupallocation because it approximates the Bayesian optimal rea-soning of block model. The application of BP algorithm tospectral clustering is also a research direction [26], [27].In this section, we prove that in signed networks, H appears in the linearization equation derived from the updatingequation of the BP algorithm. Due to the appearance of theedge sign, we generalize the existing BP updating equationinto the following form, for u ∈ N ( v ) , η + v → w η − v → w := e − h × Q sign ( u → v )= sign ( v → w ) (cid:0) η + u → v c in + η − u → v c out (cid:1)Q sign ( u → v )= sign ( v → w ) (cid:0) η + u → v c out + η − u → v c in (cid:1) × Q sign ( u → v ) = sign ( v → w ) (cid:0)(cid:0) − η + u → v (cid:1) c in + (cid:0) − η − u → v (cid:1) c out (cid:1)Q sign ( u → v ) = sign ( v → w ) (cid:0)(cid:0) − η + u → v (cid:1) c out + (cid:0) − η − u → v (cid:1) c in (cid:1) , (5) where η ± v → w represents the probability that v belongs toa community when u does not belong to the network, ± represents two communities respectively. Note that e − h in-dicates the information passed in from non-edges(points notadjacent to v), where h = ( c in − c out )( n BP + − n BP − ) , and n BP ± refers to the ratio of the current number of points in twocommunities to the total number of nodes estimated accordingto BP algorithm.It should be noted that when u → v and v → w havedifferent signs, the information passed in v → w is not η ± u → v , but the (1 − η ± u → v ) . That means if these two edgeshave different signs, the fault information will be passed into v → w . Similarly, the trivial ﬁxed point of the above updatedequation is still η v → w = 1 / , that is, the probability that eachvertex is divided into two communities is equal.Next, we consider the information update equation near thetrivial ﬁxed point. Writing η ± u → v = 1 / ± δ u → v , and linearizearound this ﬁxed point (for more details, see Appendix A).We get an updating rule of δδ := ( c in − c out )( c in + c out ) H T δ. (6)That is, H can also be obtained by BP algorithm. In other words, we deﬁne the non-backtracking matrix fromtwo different perspectives, deduces its role in community de-tection by theory, and proves its feasibility in basic stochasticblock model by the linearization of BP updating equationaround the ﬁxed point as well.III. C OMMUNITY DETCTION

A. Analytical community detection threshold and detectionvector

To demonstrate the applicability of the signed non-backtracking matrix in community detection, we derive thecommunity detection threshold and a detection vector for anarbitrary signed network. Generalized by the conclusion inunsigned networks [21], [24], [26], [28], we deﬁne g out and g in as the N -dimensional vectors, g outu = X v ∈N ( u ) g u → v · sign ( u → v ) ,g inu = X v ∈N ( u ) g v → u · sign ( v → u ) , where N ( u ) represents the neighbor set of node u and vector g in m -dimension is a given vector. Different from the unsignednetwork, we not only sum over incoming and outgoing edgesbut also take the sign of edges into consideration.Applying H to g , we get that ( Hg ) outu = X v ∈ N ( u ) g outv − g inu , ( Hg ) inu = ( d u − X v ∈ N ( u ) g outv , where d u means the degree of node u (regardless of the signof edges).By rewriting the above two equations in a matrix form, weget that (cid:18) ( Hg ) in ( Hg ) out (cid:19) = H ′ (cid:18) g in g out (cid:19) ,H ′ = (cid:18) D − I − I e A (cid:19) , (7)where I is the identity matrix, D is the diagonal matrix ofvertex degrees, and e A is the adjacency matrix of the underlyingunsigned structure corresponding to the signed network.Suppose that Hg = µg , we have µ (cid:18) g in g out (cid:19) = H ′ · (cid:18) g in g out (cid:19) . If g in and g out are nonzero, then (cid:18) g in g out (cid:19) is an eigenvectorof H ′ with the same eigenvalue µ . Hence, µg out = e A · g out − g in = [ e A − µ − ( D − I )] g out . So µ is a root of the quadratic eigenvalue equation det (cid:12)(cid:12)(cid:12) µ I − µ e A + ( D − I ) (cid:12)(cid:12)(cid:12) = 0 . (8)Compared with the original non-backtracking matrix H ,the complexity to calculate eigenvalues of H ′ will be greatly OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 reduced. This equation is well known in the theory of graphzeta functions [28]. It accounts for n of H ’s eigenvalues, andthe other m − n are ± .Actually, we directly and simply prove that the spectrumof H is the same as that of e H , which regards the networkas an unsigned network. Because H can also be derivedby multiplying all the elements of some rows and theirsymmetrical columns of e H by − , that is, det | λI − H | = ( − | ǫ − | · det (cid:12)(cid:12)(cid:12) λI − e H (cid:12)(cid:12)(cid:12) = det (cid:12)(cid:12)(cid:12) λI − e H (cid:12)(cid:12)(cid:12) , (9)where | ǫ − | is the number of negative edges in the network.Therefore, according to the previous conclusion, the bulkof the spectrum of B are also conﬁned to the disk of radius √ c in signed networks. Note that c = d · n is the averagedegree of the network. Similarly, we can deﬁne c in and c out respectively.Further, we can get the ﬁrst and second eigenvalues of H ,  µ ≈ cµ ≈ µ c = c in − c out d in − d out n . (10)In the unsigned network, the second eigenvector of the non-backtracking matrix is a community-correlated eigenvector. Ifthe second eigenvalue of e H is separated from the bulk of thespectrum, then the eigenvector corresponding to the secondeigenvalue can be used in the community detection (labelvertices according to the sign of the sum of all incoming edgesat each vertex) [29]. Similar conclusions in signed networksare veriﬁed in the following.Now, we ﬁrst attempt to construct a vector g which is corre-lated with the communities and is an approximate eigenvectorwith eigenvalue µ c . We assume that c = O (1) , so the graphis sparse and locally tree-like. For any positive integer r , andany directed edge ( u, v ) , we deﬁne that, g ( r ) u → v = µ − rc · X ( w,x ): d ( u → v,w → x )= r σ x · σ u → v , where σ x = ± denotes x ’s community, σ u → v = ± denotesthe sign of edge ( u, v ) , d ( u → v, w → x ) denotes the numberof steps required to go from u → v to w → x in the graph ofdirected edges, as shown in Figure 2. x y y z z w Fig. 2.

An illustration for calculating d ( u → v, w → x ) . Going fromedge x → y to edge z → w needs to transverse two edges y → z and z → w . Thus, d ( x → y , z → w ) = 2 . Applying H to g ( r ) , we have, ( Hg ( r ) ) u → v = µ − rc · X ( w,x ): d ( u → v,w → x )= r +1 σ x · σ u → v , which can be simpliﬁed as ( Hg ( r ) ) u → v = µ c · g ( r +1) u → v We may write g ( r ) u → v − g ( r +1) u → v as µ − rc · σ u → v · X ( w,x ): d ( u → v,w → x )= r ( σ x − µ − c X y ∈ N ( x ) \ w σ y ) . Now, there are (in expectation) c r terms in this sum, eachof which, conditioned on the σ x s, has an expected value ofzero and a constant variance. Hence, E [( g ( r ) u → v − g ( r +1) u → v ) ] = O ( c r µ − rc ) . Summing over all the edges, we have, E [( g ( r ) − g ( r +1) ) ] = O ( c r µ − rc | E | ) . Therefore, when the community-correlated eigenvalue (thesecond eigenvalue) satisﬁes µ c > √ c. (11)When it is separated from the bulk spectrum, it can benaturally considered that the error is small and approacheszero for large r .And according to the conclusion in unsigned networks[30], [31], it can be inferred that, under the condition of thethreshold and n → ∞ , for every u → v , < g ( r ) u → v , σ u · σ u → v > = 0 . Thus, we can draw a conclusion that (cid:12)(cid:12)(cid:12) Hg ( r ) − µ c g ( r ) (cid:12)(cid:12)(cid:12) = o (1) . So g ( r ) is indeed an approximate eigenvector for H witheigenvalue µ c , which may be used to detect the communitystructure of the signed networks. And Ineq. (11) is the de-tection threshold ﬁnally deduced in this paper, which showsagreement with the threshold in unsigned networks. B. Beyond two communities -0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.06-0.04-0.0200.020.040.060.08 (a) -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.25-0.2-0.15-0.1-0.0500.050.10.150.2 (b)Fig. 3.

Community detection results when the number of communities isgreater than 2. q = 3 , N = 120 , d in = 0 . , d out = 0 . , p + in = p − out = 0 . (a): non-backtracking matrix , Ovl = 1 ; (b): adjacency matrix,

Ovl = 0 . The above analysis is demonstrated on stochastic blockmodels with two communities ( q = 2 ). In fact, according to theabove derivation process of this article, the non-backtrackingmatrix is also well applied in the model with communitynumber greater than 2 ( q > ). Its detectable threshold shouldbe similar to the conclusion in the unsigned network, that is, OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 the community-correlated eigenvalue satisﬁes Ineq. (11), i.e. µ c > √ c .In this case, the second eigenvalue is, µ ≈ c in − c out q = d in − d out q n. In general, we can take the ﬁrst k eigenvectors and usek-meansalgorithm to determine the grouping of nodes.Given the case that q = 3 , N = 120 , d in = 0 . , d out = 0 . , p + in = p − out = 0 . , apply k-means algorithm to partition thenetwork. In Fig. 3, different colors represent different groups,and non-backtracking matrix and adjacency matrix detectionare used to obtain overlaps of 1 and 0.86 respectively.It can be seen that the algorithm based on non-backtrackingmatrix still has a good application. However, it should be notedthat when the number of communities is greater than two,the threshold value of detecting network structure by usingadjacency matrix is not known yet, so we do not carry out adeeper comparative evaluation. C. An improved non-backtracking operator for communitydetection in signed networks

According to the analysis above, we know that the proposednon-backtracking operator is density sensitive rather thansign sensitive. Even in the case that the negative edges ofinter-community connections and the positive edges of intra-community connections account for the majority, the ﬁnalresults are not ideal as long as the density of connectededges within and between groups does not meet the thresholdvalue we derived. The insensitivity to edge signs is essentiallyagainst our original intention to explore the community struc-ture of the signed networks.In fact, according to the threshold of the adjacency matrixoperator mentioned below, we can know that this operator isonly sign sensitive and also does not meet our expectationsfor community detection. The ideal tool should have goodperformance in both aspects.In the construction of non-backtracking matrix, we haveconsidered the structure balance theory and the belief propa-gation theory. Why does this sign insensitive occur? We cansee that in Sec.IIIA, the approximation vector of the secondeigenvector is actually only related to the k-order neighborsof a node x , which is just a simple sum of its belongingcommunities, without considering whether the path betweentwo nodes is balanced, that is, whether the internal relationshipis friendly or hostile.Hence, we improve the operator under the following as-sumption. We assume that there should be expecting a largernumber of balanced or stable paths of a given length k betweentwo vertices u and v if they belong to the same community.So the improved matrix, denoted by H b , is deﬁned as follows, H bef = ( e = f ) ( e = f ) ( A e · A f = 1) . (12)Or we can write it as H b ( e → e ) , ( f → f ) =  if e = f , e = f andsign ( e → e ) = sign ( f → f ) , otherwise . (13) So corresponding to the BP algorithm, during the propaga-tion process, we only pass the real information to the edgeswith same signs, not the wrong message to the edges withdifferent signs. For simplicity, we call this matrix as balancednon-backtracking matrix . Actually, H b is an approximation of H , and the detection threshold for community detection is stillunclear. But through the experiment, we can see that it doesshow an ideal role in community detection.In the following section, we will compare the performanceof these three matrices in community detection.IV. R ESULTS

To evaluate the performance of the signed non-backtrackingmatrix ( H ) and balanced non-backtracking matrix ( H b ) incommunity detection, we carry out extensive simulations onsigned networks. The accuracy of community detection isquantiﬁed by the concept of overlap [21] which is deﬁnedas the proportion of correctly predicted nodes to all nodes.The overlap can be expressed as ovl = 1 N X u δ g u , e g u , where g u is the true group label of vertex u , and e g u is the labelfound by the algorithm. When g u = e g u for every node u , wehave ovl = 1 and the detection accuracy achieves 100 % .We break symmetry by maximizing overall q ! permutationsof the groups, where the nodes are divided into q groups. Theprediction is totally exact when overlap equals to and underthis deﬁnition, the minimum value of overlap can be taken as /q . The overlap is normalized as ovl = ( 1 N X u δ g u , e g u − q ) / (1 − q ) . The overlap is ranging from 0 to 1. Here, 0 means that theprediction is inaccurate due to random grouping. For the sakeof visualization, we still use the unnormalized overlap in thefollowing numerical simulation.

A. Analysis of detection accuracy -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10-4-3-2-101234 (a) signed NBT matrix -3 -2 -1 0 1 2 3 4 5 6-3-2-10123 (b) balanced NBT matrix Fig. 4.

Examples of detectable networks. N = 1000 , c = 10 , d in = 0 . , d out = 0 . , p + in = p − out = 0 . . (a) spectrum of signed non-backtrackingmatrix; (b) spectrum of balanced non-backtracking matrix. According to the results of numerical simulation, the fol-lowing conclusions are obtained.First of all, it is feasible and accurate to use signed non-backtracking matrix H to detect communities in signed net-works if the bulk of spectrum is within √ c and the secondreal eigenvalue exceeds √ c , the overlap obtained by using the OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 corresponding eigenvector is close to (sometimes equal to) 1.Fig. 4 (a) shows a typical example of accurate detection.Second, as for balanced non-backtracking matrix H b , al-though the intrinsic principle has not been studied yet, wecan still see that the bulk of its eigenvalues are within acertain radius range. As we can see in Fig. 4 (b), unlikebefore, there are three real eigenvalues outside the circle. Itis gratifying that BNBT also leads to satisfactory results incommunity detection. We ﬁnd that the calculated overlap isclose to 1, which means that the optimized BNBT algorithmcan indeed be used to detect community structure. In thefollowing sections, we will give a detailed description andanalysis of its performance.Besides, the threshold of community detection using non-backtracking matrix is closely related to d in and d out , but notto p + in and p + out . Meanwhile, it is noted that the communitystructure can not be detected near the threshold on certainconditions. It can be seen from the Section III that, since r doesnot really tend to inﬁnity (so the error will not be inﬁnitelysmall), g ( r ) can not be simply regarded as the approximationof the second eigenvector. Thus is only marginally effective onthe detection of community structure. For instance, in the casethat N = 100 , p + in = p − out = 0 . , d out = 0 . and d in = 0 . (or0.35), all the cases meet the threshold conditions in theory,but there is only one real eigenvalue out of the bulk of thespectrum and the overlap equals to . (or . ). B. Comparison with adjacency-matrix-based detection

The leading eigenvector of adjacency matrix has shownthe community structure, and the number of eigenvaluesbeyond the threshold is no longer expressed as the numberof communities (should be the number of communities minus1).When there are only two communities, as long as thefollowing conditions are met, the sign of the main eigenvectorcan be used to detect communities. p − out > − d in d out (cid:18) p + in − (cid:19) + 1 d out vuuut d in + d out − d in (cid:18) p + in − (cid:19) N (14)Considering the average degree c of network, we use thesigned non-backtracking matrix as long as the followinginequality is satisﬁed,  < p + in <

12 + 12 r cc + N d in d in > c + √ cN . (15)When applying NBT matrix, we can get better results insome cases than using the adjacency matrix. Here, we give asimple but general example to compare the two methods (seeFig. 5). In all the comparisons, the paper takes p + in = p − out for convenience. We only consider the case p + in > . here,because when p + in < . , the community structure can berepresented by u N (the last eigenvector of adjacency matrix). Fig. 5.

Detection threshold based on non-backtracking matrix. p + in = p − out and N = 100 . The red surface indicates the boundary that can bedetected by the algorithm proposed in this paper(points in region 1 & & However, considering the dynamic evolution process of thenetwork, only the driving role played by the leading eigenvec-tor of the initial network in the structural balance evolutiongives rise to a dynamical manifestation of the detectabilitytransition[22].(1) When ( d in , d out , p + in ) belongs to region 3 and region4, in other words, it does not meet the threshold valueIneq.(11), and the non-backtracking matrix methodfails, so the adjacency matrix method should beconsidered.(2) When ( d in , d out , p + in ) belongs to region 1 and region2, compared with the method based on adjacencymatrix, the method based on non-backtracking matrixhas less correlation with p + in and p + out .From the aspect of computation, adjacent matrixneeds to calculate the ﬁrst eigenvalue and eigenvectorof a n × n matrix, and non-backtracking matrix needsto calculate the second eigenvalue and eigenvector ofa n × n matrix (in the actual numerical simulationprocess, the program will calculate multiple eigen-values for comparison).(i) When ( d in , d out , p + in ) is in region 1, i.e.meeting the threshold value of the algorithmbased on the adjacency matrix in Ineq. (14),the adjacency matrix method should be usedfor community detection considering thecomputational complexity.(ii) When ( d in , d out , p + in ) is in region 2, thenon-backtracking matrix can be considered.That is to say, our algorithm has a betterindication for clustering when the algorithmbased on adjacency matrix does not meet thethreshold value.In fact, Fig. 5 is an abstract representation of the communityrelated parameters in the real network. Region 4 representsthe situation that d in < d out , p + in ≈ p − in and p + out ≈ p − out . In OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 6.

The performance in community detection when applying different matrices. N = 10000 , c = 10 . d in + d out = 1 and p + in = p − out . d in and p + in vary from . to with an interval of . respectively. other words, the random blocks generated by the parametersin region 4 are not the two communities mentioned above. Forthe case that the parameters meet region 2(or region 3), weactually hope that the algorithm is not sign sensitive(or densitysensitive), and then the balanced non-backtracking operatorworks.It is worth mentioning that since the theoretical detectionthreshold of the balanced non-backtracking matrix is stillunknown, we will not discuss it furtherly. C. Comprehensive comparison among three operators in com-munity detection p in+ o ve r l a p AHH b d in = 0.50 p in+ o ve r l a p AHH b d in = 0.80 p in+ o ve r l a p AHH b d in = 0.75 p in+ o ve r l a p AHH b d in = 0.70 Fig. 7.

Comparison of three matrices with ﬁxed d in . N = 10000 , c = 10 ,and d + in = 0 . , . , . , . respectively. p + in varies from . to withan interval of . . In the following, we give examples to show the better per-formance of the improved algorithm. The results are shown inFig. 6. The x -axis represents the p + in , and the y -axis representsthe d in . The z -axis as well as the color represents the overlap.The experiments are performed on signed networks with sizeof and average degree of . We set p + in = p − out forconvenience, which varies from . to with an interval of . . It is worth noting that d in s in the ﬁgure are normalized,that is d in + d out = 1 , which also varies from . to with an interval of . . As we can see, the signed non- backtracking matrix’s performance is only sensitive to the linkdensity of inter- and intra- community connections, while theadjacency-based algorithm is more sensitive to the link signs.The balanced non-backtracking matrix-based approach takesboth advantage of the above matrices, and its performancedepends on the link density and link signs. Moreover, the areaof undetectable ﬁled (in which the overlap is about . ) issmallest, which indicates the best performance of this operator.In Fig. 7, we show four concrete examples by taking d in ∈ { . , . , . , . } and varying p + in from . to . Actually, each line in this ﬁgure is a slice of Fig. 6by taking d in as the corresponding values. The black lineindicates the results detected using the adjacency matrix, thered line means the results detected using the non-backtrackingmatrix, and the blue line represents the results detected usingthe balanced non-backtracking matrix. For each case, weconduct 10 experiments and calculate the average to connectthem with the black/red/blue broken lines. Except for theﬁrst graph, which is the case of d in = 0 . , we can drawthe following conclusion. The overlap is close to 0.5 (whichmeans the nodes are labelled almost randomly) when thedetection threshold based on adjacency matrix is not satisﬁed,while the method based on non-backtracking matrix stillhas a good performance. As is discussed above, the overlapdoes not change with p + in increases. As p + in increases, theadjacency matrix-based algorithm outperforms the signed non-backtracking matrix-based one. However, the balanced non-backtracking-based method always performs better than theadjacency matrix-based approach. All the above analysis canlead us to a satisfactory conclusion. The ﬁrst graph is theonly exception. In this case, the density is useless to dividethe community structure( d in = d out = 0 . ), so the signednon-backtracking matrix-based algorithm is invalid, and theperformance of the balanced non-backtracking-based methodis weaker than that of sign sensitive algorithm.In Fig. 8, we discuss furtherly from the other perspective,that is we show three concrete examples by taking p + in ∈{ . , . , . } and varying d out from . to . These areactually slices from another angle of Fig. 6. The performanceof adjacency matrix-based approach is not completely inde-pendent with d out . When p + in = 0 . , the overlap increaseswhen d out is small. However, after the threshold is satisﬁed, OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 d out o ve r l a p AHH b p in+ = 0.50 d out o ve r l a p AHH b p in+ = 0.60 d out o ve r l a p AHH b p in+ = 0.70 Fig. 8.

Comparison of three matrices with ﬁxed p + in . N = 10000 , c = 10 , and p + in = 0 . , . , . respectively. d o ut varies from . to with aninterval of . . the performance do not increases as d out decreases. It is furtherproved that the original non-backtracking operator H is signinsensitive. As before, in the case of p + in = 0 . , densitysensitivity is the only factor we need to consider, so H issuperior, while in other cases, H b shows its advantages.In addition, we know that the community structure may bedifﬁcult to detect when the graph is sparse. For example, inunsigned networks, when c is constant and n is large, thenetwork is decomposed for many reasons. Most importantly,the leading eigenvalues of A are indicated by the verticesof the highest degree, and the corresponding eigenvectors arelocalized around these vertices [21], [32]. At the same time, thenon-backtracking matrix has better performance in sparse case.We expect to get the same conclusion in the signed network.If the right side of the ﬁrst inequality of (15) is regardedas a function of d in , we get a lower bound according to themonotonicity of the function as below,  < p + in <

12 + 12 r cd in > c + √ cN . (16)Therefore we prove theoretically that when the non-backtracking matrix detection is feasible, the smaller c is, thebetter the result based on the non-backtracking matrices iscompared with the adjacent matrix. Note that what we sayis better performance in the case of sparse is relative. Infact, as the network becomes sparse, our detection accuracywill certainly decrease correspondingly. The result is alsoconﬁrmed by numerical simulations performed on stochasticblock models with N = 10 , p + in = 0 . , d in = 0 . , and theaverage degree c varies from to with an interval of . Asis shown in Fig. 9, the performance of the non-backtrackingmatrix-based algorithm is more stable and robust comparedwith the adjacency matrix regarding to the average degree.Based on the above comparison, we are also interested inthe newly proposed H b , but these analyses are only basedon numerical simulation, and the underlying theory remainsto be studied. However, we can draw a conclusion that H b is sensitive to both the sign of the edge and the connectiondensity. In most cases, it has good performance.On the con-trary, A and H have their own limitations. Although they may

10 15 20 25 30 35 40 45 50 55 60 c o ve r l a p AHH b p in+ = 0.70d in = 0.75 Fig. 9.

The performance of three matrices on graphs with differentsparsity. N = 10 , p + in = 0 . , d in = 0 . . be the best choice in some very special situations, generallyspeaking, the balanced non-backtracking-based algorithm isthe most reliable and effective methodV. C ONCLUSIONS AND DISCUSSION

This paper investigates an efﬁcient community detection insigned network by demonstrating the feasibility to deﬁne anon-backtracking matrix for signed networks. We provide thedeﬁnition of a proper non-backtracking matrix from perspec-tives of both structural balance theory and belief propagation.Based on the proposed non-backtracking matrix, we analyti-cally determine the community detectability and propose themost efﬁcient operator, the balanced non-backtracking-basedoperator H b , which signiﬁcantly outperforms the adjacencymatrix based detection algorithms.It is worth noting that we certainly hope that the algorithmis sensitive and adaptable to both sign and density. Ouralgorithm (as well as the previous algorithms) is absolutelyeffective under the most standard community partition, butfor communities with different characteristics and for morecomplex networks, we’d better have a simple understandingof it before selecting the appropriate algorithm. Consideringthis, The balanced non-backtracking matrix is most universal.The exception to this is that when the community in the actualnetwork is not relationship dependent community or densitydependent community, the above algorithms may not achievesatisfactory results. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

The proposed framework shows great potential to detectcommunities with or without overlap and paves the way tounderstand the collective behaviors of systems where positiveand negative relationships coexist.A

PPENDIX T HE DERIVATION PROCESS OF THE UPDATING RULES OF δ In order to simplify the updating equation of δ , let’s considerEq. (5) carefully.First of all, we notice that n BP + = n BP − holds near thetrivial ﬁxed point, that is, e − h can be written as . Supposean unknown constant Z , we split the Eq. (5) into two equations, η + v → w = Z · Y u ∈ N ( v ) sign ( u → v )= sign ( v → w ) (cid:0) η + u → v c in + η − u → v c out (cid:1) × Y u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) (cid:2)(cid:0) − η + u → v (cid:1) c in + (cid:0) − η − u → v (cid:1) c out (cid:3) , (17) η − v → w = Z · Y u ∈ N ( v ) sign ( u → v )= sign ( v → w ) (cid:0) η + u → v c out + η − u → v c in (cid:1) × Y u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) (cid:2)(cid:0) − η + u → v (cid:1) c out + (cid:0) − η − u → v (cid:1) c in (cid:3) . (18)Rewriting Eq. (17) near the trivial ﬁxed point η ± u → v = 1 / ± δ u → v ,

12 + δ u → v = Z × Y u ∈ N ( v ) sign ( u → v )= sign ( v → w ) (cid:20)(cid:18)

12 + δ u → v (cid:19) c in + (cid:18) − δ u → v (cid:19) c out (cid:21) × Y u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) (cid:20)(cid:18) − δ u → v (cid:19) c in + (cid:18)

12 + δ u → v (cid:19) c out (cid:21) . (19) By merging the similar items, we get,

12 + δ u → v = Z × Y u ∈ N ( v ) sign ( u → v )= sign ( v → w ) (cid:20)

12 ( c in + c out ) + δ u → v ( c in − c out ) (cid:21) × Y u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) (cid:20)

12 ( c in + c out ) − δ u → v ( c in − c out ) (cid:21) . (20) And Eq. (18) is simpliﬁed to − δ u → v = Z × Y u ∈ N ( v ) sign ( u → v )= sign ( v → w ) (cid:20)

12 ( c in + c out ) − δ u → v ( c in − c out ) (cid:21) × Y u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) (cid:20)

12 ( c in + c out ) + δ u → v ( c in − c out ) (cid:21) . (21)Linearizing Eq. (20) and Eq. (21) , it follows that

12 + δ v → w ≈ Z · ( (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) | + X u ∈ N ( v ) sign ( u → v )= sign ( v → w ) δ u → v ( c in − c out ) (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) − | − X u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) δ u → v ( c in − c out ) (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) − | ) , (22) − δ v → w ≈ Z · ( (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) | − X u ∈ N ( v ) sign ( u → v )= sign ( v → w ) δ u → v ( c in − c out ) (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) − | + X u ∈ N ( v ) sign ( u → v ) = sign ( v → w ) δ u → v ( c in − c out ) (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) − | ) . (23) In order to eliminate the constant Z , we calculate Eq. (22) plus Eq. (23) and Eq. (22) minus Eq. (23) respectively, Z (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) | , δ v → w = 2 Z (cid:20)

12 ( c in + c out ) (cid:21) | N ( v ) − | × ( c in − c out ) ×  X u ∈ N ( v ) sign ( u → v )= sign ( v → w ) − X u ∈ N ( v ) sign ( u → v ) = sign ( v → w )  δ u → v . (24) After eliminating the constant Z , we have δ v → w = c in − c out c in + c out ×  X u ∈ N ( v ) sign ( u → v )= sign ( v → w ) − X u ∈ N ( v ) sign ( u → v ) = sign ( v → w )  δ u → v . (25)Thus, we get the updating rule of δ in signed networks, δ := ( c in − c out )( c in + c out ) H T δ. (26) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 A CKNOWLEDGMENT

Z-YZ, C-QQ, and G-HW were supported in part by NationalNatural Science Foundation of China under Grant 11631014,Grant 11871311, in part by China Postdoctoral Science Foun-dation under Grant 2019TQ0188, Grand 2019M662315, inpart by Shandong University multidisciplinary research and in-novation team of young scholars under Grand 2020QNQT017.X-RW acknowledges the partial support of the project ”PCLFuture Greater-Bay Area Network Facilities for Large-scaleExperiments and Applications (LZC0019)”.R

EFERENCES[1] S. Fortunato, “Community detection in graphs,”

Physics reports , vol.486, no. 3-5, pp. 75–174, 2010.[2] P. F. Jonsson, T. Cavanna, D. Zicha, and P. A. Bates, “Cluster analysisof networks generated through homology: automatic identiﬁcation ofimportant protein communities involved in cancer metastasis,”

BMCbioinformatics , vol. 7, no. 1, p. 2, 2006.[3] P. Anchuri and M. Magdon-Ismail, “Communities and balance in signednetworks: A spectral approach,” in . IEEE,2012, pp. 235–242.[4] F. Heider, “Attitudes and cognitive organization,”

The Journal of psy-chology , vol. 21, no. 1, pp. 107–112, 1946.[5] G. Facchetti, G. Iacono, and C. Altaﬁni, “Computing global structuralbalance in large-scale signed social networks,”

Proceedings of theNational Academy of Sciences , vol. 108, no. 52, pp. 20 953–20 958,2011.[6] S. A. Marvel, J. Kleinberg, R. D. Kleinberg, and S. H. Strogatz,“Continuous-time model of structural balance,”

Proceedings of theNational Academy of Sciences , vol. 108, no. 5, pp. 1771–1776, 2011.[7] A. Kirkley, G. T. Cantwell, and M. Newman, “Balance in signednetworks,”

Physical Review E , vol. 99, no. 1, p. 012320, 2019.[8] N. Monika, G. Lavanya, K. Vaishnavi, M. Kavya, and V. Kanaiya,“Structural balance theory based recommendation,”

International Jour-nal of Advanced Research in Computer Science , vol. 9, no. Special Issue3, p. 45, 2018.[9] C. Qu and H. Wang, “Impact of structural balance on self-avoidingpruning walk,”

Physica A: Statistical Mechanics and its Applications ,vol. 524, pp. 362–374, 2019.[10] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks insocial media,” in

Proceedings of the SIGCHI conference on humanfactors in computing systems , 2010, pp. 1361–1370.[11] B. W. Kernighan and S. Lin, “An efﬁcient heuristic procedure forpartitioning graphs,”

The Bell system technical journal , vol. 49, no. 2,pp. 291–307, 1970.[12] J. Friedman, T. Hastie, and R. Tibshirani,

The elements of statisticallearning . Springer series in statistics New York, 2001.[13] J. MacQueen et al. , “Some methods for classiﬁcation and analysis ofmultivariate observations,” in

Proceedings of the ﬁfth Berkeley sympo-sium on mathematical statistics and probability . Oakland, CA, USA,1967, pp. 281–297.[14] U. Von Luxburg, “A tutorial on spectral clustering,”

Statistics andcomputing , vol. 17, no. 4, pp. 395–416, 2007.[15] L. Yang, X. Cao, D. He, C. Wang, X. Wang, and W. Zhang, “Modularitybased community detection with deep learning.” in

IJCAI , vol. 16, 2016,pp. 2252–2258.[16] M. Girvan and M. E. Newman, “Community structure in social andbiological networks,”

Proceedings of the national academy of sciences ,vol. 99, no. 12, pp. 7821–7826, 2002.[17] M. E. Newman and M. Girvan, “Finding and evaluating communitystructure in networks,”

Physical review E , vol. 69, no. 2, p. 026113,2004.[18] D. J. MacKay and D. J. Mac Kay,

Information theory, inference andlearning algorithms . Cambridge university press, 2003.[19] S. Chauhan, M. Girvan, and E. Ott, “Spectral properties of networkswith community structure,”

Physical Review E , vol. 80, no. 5, p. 056114,2009.[20] A. Pothen, “Graph partitioning algorithms with applications to scientiﬁccomputing,” in

Parallel Numerical Algorithms . Springer, 1997, pp.323–368. [21] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov´a,and P. Zhang, “Spectral redemption in clustering sparse networks,”

Proceedings of the National Academy of Sciences , vol. 110, no. 52,pp. 20 935–20 940, 2013.[22] M. Morrison and M. Gabbay, “Community detectability and structuralbalance dynamics in signed networks,” arXiv preprint arXiv:1912.07772 ,2019.[23] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a, “Asymptoticanalysis of the stochastic block model for modular networks and itsalgorithmic applications,”

Physical Review E , vol. 84, no. 6, p. 066106,2011.[24] P. Ren, R. C. Wilson, and E. R. Hancock, “Graph characterization viaihara coefﬁcients,”

IEEE Transactions on Neural Networks , vol. 22,no. 2, pp. 233–245, 2010.[25] A. Coja-Oghlan, E. Mossel, and D. Vilenchik, “A spectral approach toanalysing belief propagation for 3-colouring,”

Combinatorics, Probabil-ity and Computing , vol. 18, no. 6, pp. 881–912, 2009.[26] O. Angel, J. Friedman, and S. Hoory, “The non-backtracking spectrumof the universal cover of a graph,”

Transactions of the AmericanMathematical Society , vol. 367, no. 6, pp. 4287–4318, 2015.[27] A. Mellor and A. Grusovin, “Graph comparison via the nonbacktrackingspectrum,”

Physical Review E , vol. 99, no. 5, p. 052309, 2019.[28] M. Kotani and T. Sunada, “2.-zeta functions of ﬁnite graphs,”

Journalof Mathematical Sciences-University of Tokyo , vol. 7, no. 1, pp. 7–26,2000.[29] S. Janson, E. Mossel et al. , “Robust reconstruction on trees is determinedby the second eigenvalue,”

The Annals of Probability , vol. 32, no. 3B,pp. 2630–2649, 2004.[30] E. Mossel, Y. Peres et al. , “Information ﬂow on trees,”

The Annals ofApplied Probability , vol. 13, no. 3, pp. 817–844, 2003.[31] H. Kesten and B. P. Stigum, “Additional limit theorems for inde-composable multidimensional galton-watson processes,”

The Annals ofMathematical Statistics , vol. 37, no. 6, pp. 1463–1481, 1966.[32] M. Krivelevich and B. Sudakov, “The largest eigenvalue of sparserandom graphs,”

Combinatorics, Probability and Computing , vol. 12,no. 1, pp. 61–72, 2003.

Zhaoyue Zhong received the B.S. degree fromSchool of Mathematics, Shandong University, Chinain 2020. She is currently pursuing the M.S. degree inSchool of Mathematical Sciences, Fudan University,China. Her current research interests include com-plex networks and mathematical methods in neuralnetworks.

Xiangrong Wang is currently a research assis-tant professor at Southern University of Scienceand Technology, Shenzhen, China. She received herPh.D. degree from the Delft University of Tech-nology, the Netherlands in 2016. Before joiningShenzhen, she was a postdoctoral researcher at theDelft University of Technology from 2017 to 2018.In 2017 and 2018, she was a visiting scholar atZaragoza University, Zaragoza, Spain and ISI Foun-dation, Torino, Italy. Her research focuses on mod-eling and analysis of complex networks, nonlineardynamics and graph spectral analysis.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Cunquan Qu is currently working as a Post-doctorresearcher at Data Science Institute, Shandong Uni-versity, Jinan, China. He did a joint Ph.D. projectin the Multimedia Computing Group at the DelftUniversity of Technology for two years. He receivedhis Ph.D. degree from Shandong University in 2019.His research focuses on analyzing network structure,modeling the dynamics process, graph neural net-works.