[PDF] Identifying Rumor Sources Using Dominant Eigenvalue of Nonbacktracking Matrix

Abstract

We consider the problem of identifying rumor sources in a network, in which rumor spreading obeys a time-slotted susceptible-infected model. Unlike existing approaches, our proposed algorithm identifies as sources those nodes, which when set as sources, result in the smallest dominant eigenvalue of the corresponding reduced nonbacktracking matrix deduced from message passing equations. We also propose a reduced-complexity algorithm derived from the previous algorithm through a perturbation approximation. Numerical experiments on synthesized and real-world networks suggest that these proposed algorithms generally have higher accuracy compared with representative existing algorithms.

Full PDF

IIDENTIFYING RUMOR SOURCES USING DOMINANT EIGENVALUE OFNONBACKTRACKING MATRIX

Jiachun Pan, Wenyi Zhang

University of Science and Technology of China, Hefei, ChinaEmails: [email protected], [email protected]

ABSTRACT

We consider the problem of identifying rumor sources in a network,in which rumor spreading obeys a time-slotted susceptible-infectedmodel. Unlike existing approaches, our proposed algorithm iden-tiﬁes as sources those nodes, which when set as sources, result inthe smallest dominant eigenvalue of the corresponding reduced non-backtracking matrix deduced from message passing equations. Wealso propose a reduced-complexity algorithm derived from the previ-ous algorithm through a perturbation approximation. Numerical ex-periments on synthesized and real-world networks suggest that theseproposed algorithms generally have higher accuracy compared withrepresentative existing algorithms.

Index Terms — Dominant eigenvalue, message passing equa-tions, multiple rumor sources, nonbacktracking matrix, susceptible-infected model

1. INTRODUCTION

Nowadays, facilitated by the development of Internet and smart de-vices, online social networks such as Twitter, Facebook, and Weibohave become important enablers and primary conduits of rumor-likeinformation. It is thus of considerable interest and importance toaccurately identify rumor sources in networks.Several works have investigated the problem of identifying ru-mor sources given a snapshot observation of the infected nodes ina network. For the basic case of a single rumor source, a networkcentrality metric called rumor center was initially developed in [1],which turns out to be the maximum likelihood estimate of the rumorsource for regular tree networks under the susceptible-infected (SI)rumor spreading model. This has inspired a large number of works.For example, in [2] [3] another network centrality metric calledJordan center was used to detect a single rumor source under thesusceptible-infected-recovered (SIR) rumor spreading model; in [4]the problem of rumor source identiﬁcation was solved using MonteCarlo estimators, for an arbitrary network structure; in [5] the casewhere multiple snapshot observations are available was studied anda joint estimator was developed.An important extension of the basic case is the case where mul-tiple rumor sources exist. In [6] a method based on rumor central-ity was developed, with computational complexity O ( N | S | ) , where N is the number of infected nodes and | S | is the number of rumorsources. In [7] a method based on Jordan Center was developed un-der the SIR model, and it was proved that for regular trees the iden-tiﬁed sources are within a constant distance from the actual sourceswith a high probability. In [8] a K-center method was developed, This work was supported in part by the National Natural Science Foun-dation of China under Grant 61722114. which ﬁrst transforms the network into a distance network using aneffective distance metric, then adaptively partitions the distance net-work and ﬁnally performs source identiﬁcation. The method has acomplexity of O ( MN log α ) , where α is a slowly growing inverse-Ackermann function related to the number of nodes N and the num-ber of edges M .We note two facts: ﬁrst, all the existing methods for both the sin-gle source and the multiple source cases are not optimal when net-works contain loops; second, all the existing methods for the multi-ple source case need to partition the infected network and then iden-tify a source in each partitioned part, regardless of the reality that theinfected nodes from different sources may substantially overlap.In this paper, we investigate the problem of identifying multiplerumor sources in a general network which may be highly loopy, andour main contribution is the proposal of a novel heuristic methodbased on the dominant eigenvalue of a matrix obtained from theHashimoto or nonbacktracking matrix [9] of the infected network.We develop the method via an approximate analysis of messagepassing equations of the rumor spreading process, combined withsome empirical observations. Unlike existing methods, our methodneither needs to convert a loopy network into a tree, nor needs topartition the infected network into non-overlapping parts. Numeri-cal experiments on synthesized and real-world networks suggest thatour method generally has higher accuracy compared with represen-tative existing algorithms, especially for highly loopy networks.The rest of the paper is organized as follows. Section 2 describesthe problem formulation, message passing equations, and the pro-posed approach. Section 3 presents the algorithms. Section 4 showsthe numerical results. Section 5 concludes the paper.

2. PROBLEM FORMULATION, MESSAGE PASSINGEQUATIONS, AND PROPOSED APPROACH

In this section, we describe the problem setup, develop the messagepassing equations, and motivate our approach.

We assume that rumors spread in an undirected graph G = ( V, E ) ,where V is the set of nodes and E is the set of edges. We adopta time-slotted susceptible-infected (SI) model. In this model, eachnode v ∈ V has two possible states: susceptible and infected. Asource node is infected from the beginning; a non-source node issaid to be susceptible if it has not received the rumor, and infectedif it has received the rumor from one of its neighbors. We assumethat there are | S | rumor sources, and they start to initialize their ru-mor spreadings at the same time, termed time zero . At each timestep, an infected node infects a neighboring susceptible node with a r X i v : . [ ee ss . SP ] S e p robability p (cid:28) , and such infections of different neighboring sus-ceptible nodes are mutually independent. As time grows withoutbound, eventually all connected nodes will be infected for any non-zero value of p . Given a snapshot observation of the infected nodes,we want to identify those rumor sources. Denote by P ( t ) i the probability that node i has not been infected bytime t , and by v ( t ) i → j the probability that node i has not passed therumor to a neighboring node j by time t . Note that by assumptionthe rumor spreading starts at time t = 0 . Assign to each node anindicator n i to indicate whether node i is a source: n i = 0 if node i is a source, and n i = 1 otherwise.Denote by ∂i the set of neighbors of node i , and ∂i \ j the setof neighbors of node i excluding node j . For a general network withloops, the probabilities (cid:110) v ( t ) k → i (cid:12)(cid:12) k ∈ ∂i \ j (cid:111) are to some extent cor-related and this fact makes a rigorous analysis intractable. Therefore,in subsequent development we make a key approximation that theseprobabilities are mutually independent; see, e.g., [10] for a simi-lar treatment. Hence we can deduce the following message passingequations: v ( t ) i → j = 1 − t (cid:88) τ =1 (1 − p ) τ − p  − n i (cid:89) k ∈ ∂i \ j v ( t − τ ) k → i  . (1)To see the validity of (1), note that if node i is a source, n i = 0 ,the equation (1) is simply the probability that node i has not infectednode j throughout t time steps, with the infection probability at eachtime being a Bernoulli distribution; otherwise, if n i = 1 , the term inthe box bracket of (1) is the probability that node i has been infectedby time t − τ (under the approximation of mutual independence),the term (1 − p ) τ − p is the probability that node i then takes τ timesteps to infect node j , and (1) then follows from the law of totalprobability.We can also deduce the probabilities P ( t ) i as: P ( t ) i = n i (cid:89) j ∈ ∂i v ( t ) j → i . (2)Taking the limit t → ∞ in (1), and noting that (1 − p ) τ − p → as τ → ∞ , we get v ( ∞ ) i → j = n i (cid:89) k ∈ ∂i \ j v ( ∞ ) k → i . (3)This conﬁrms that once a node is infected, its neighbors will eventu-ally be infected as time grows without bound. A snapshot observation of the infected nodes is a graph with N nodesand M edges. For a given snapshot observation, the approximatemessage passing equations in (1) constitute a collection of nonlinearequations of v ( τ ) → = (cid:16) · · · , v ( τ ) i → j , · · · (cid:17) for all the M directed links i → j and all time steps τ ∈ { , , . . . , t } , which can be collectivelywritten as a discrete-time dynamical system with initial condition v (0) → = e : v ( t ) → = e − t (cid:88) τ =1 (1 − p ) τ − p [ e − f (cid:16) v ( t − τ ) → , n → (cid:17) ] , (4) where e is a × M all-one vector, n → = ( · · · , n i → j , · · · ) , inwhich n i → j = n i and f = ( · · · , f i → j , · · · ) in which f i → j is thenonlinear function of v → for i → j given in (1).To obtain some insights, we linearly approximate f i → j ( v → , n → ) at v → = e : f i → j ( v → , n → ) = f i → j ( e , n → ) + f (cid:48) i → j ( e , n → )( v → − e ) , (5)where f (cid:48) i → j ( v → , n → ) = (cid:16) · · · , ∂f i → j ∂v k → l , · · · (cid:17) . By (1), we havethat for l (cid:54) = i , ∂f i → j ∂v k → l (cid:12)(cid:12)(cid:12)(cid:12) v → = e = 0 , and for l = i and k (cid:54) = j , ∂f i → j ∂v k → l (cid:12)(cid:12)(cid:12)(cid:12) v → = e = n i . So we get the linear approximation of v ( t ) i → j as v ( t ) i → j = 1 − t (cid:88) τ =1 (1 − p ) τ − p  − n i − n i (cid:88) k ∈ ∂i \ j (cid:16) v ( t − τ ) k → i − (cid:17) , (6)which can be collectively written in matrix form as v ( t ) → = e − t (cid:88) τ =1 (1 − p ) τ − p (cid:104) e − n → + ( e − v ( t − τ ) → ) R (cid:105) , (7)where R is a M × M matrix: R k → l,i → j = n i B k → l,i → j , B k → l,i → j = (cid:40) if l = i and j (cid:54) = k otherwise . (8)The matrix B is known as the Hashimoto or nonbacktracking matrixof a graph [9], which is also closely associated with the spreadingcapability [11]. We thus call R a reduced nonbacktracking matrixsince it is obtained from B by setting entries to zero correspondingto n i = 0 .We can further rewrite (7) in a recursive form so that v ( t +1) → isonly related to v ( t ) → : u ( t +1) → = p ( e − n → ) + u ( t ) → [ p I + (1 − p ) R ] , (9)where I is the identity matrix, u ( t ) → = e − v ( t ) → in which u ( t ) i → j =1 − v ( t ) i → j is the probability that node i has passed the rumor to aneighboring node j by time t . The task of identifying rumor sources is to determine a { , } -vector n subject to (cid:80) Ni =1 n i = N − | S | . Different choices of n lead todifferent evolution processes of v ( t ) → (or u ( t ) → equivalently). In ourstudy, we have numerically computed such evolution processes, ac-cording to both (4) and its linear approximation (7), for differenttypes of networks. Fig. 1 displays a representative scenario, wherefor a snapshot observation of the infected graph in a small-world net-work containing a single rumor source (i.e., | S | = 1 ), we computeand draw (cid:107) u ( t ) → (cid:107) under different choices of n . The dark star curveis the evolution process of (cid:107) u ( t ) → (cid:107) when n coincides with the actualrumor source, the grey solid curves are the evolution processes of (cid:107) u ( t ) → (cid:107) when n randomly indicates a rumor source, and the dashedcurves are the linear approximations of u ( t ) → .From Fig. 1, we have the following empirical observations: Here the number of infected nodes is 400 and p is 0.1.

10 20 30 40 50 t || u → || infection begins at the actual sourceinfection begins at other nodesevolution process (4)linear approximation (7) Fig. 1 . Evolution processes of (cid:107) u → (cid:107) in a small-world network. • The evolution processes of (cid:107) u ( t ) → (cid:107) computed according to(4) eventually approach the stable state where every con-nected node is infected with probability one, and when n coincides with the actual rumor source, the evolution processapproaches this stable state the most quickly. • The linear approximation of u ( t ) → computed according to (7) isaccurate for small values of t , so that when n coincides withthe actual rumor source, the linearly approximated evolutionprocess of (cid:107) u ( t ) → (cid:107) grows the most quickly.Iteratively applying (9) leads to: u ( t ) → = p ( e − n → ) (cid:0) I + [(1 − p ) I + p R ] + [(1 − p ) I + p R ] + ... + [(1 − p ) I + p R ] t − (cid:1) . (10)So in the linear approximation, the growth rate of (cid:107) u ( t ) → (cid:107) is deter-mined by ( e − n → )[(1 − p ) I + p R ] t ≈ ( e − n → )[(1 − p ) I + p B ] t since p (cid:28) and R differs from B only at the few entries indicatedby n → . In light of the two empirical observations, now the taskis to choose n → such that the non-zero entries of ( e − n → ) selectthe rows of [(1 − p ) I + p B ] t whose sum vector yields the largestnorm. With a binomial expansion, this norm is determined by thenumbers of nonbacktracking paths of lengths up to t starting fromedges like s → i where s is a source node indicated by n and i isone of its neighbors. On the other hand, note that choosing n → alsodetermines the reduced nonbacktracking matrix R , and that when theresulting R has the smallest dominant eigenvalue, the correspondingsources have the largest number of nonbacktracking paths passingthem [11]. This observation provides a reasonable heuristic for thechoice of n → , and motivates us to propose the following minimaxcriterion: min n max λ ( R ) . (11)

3. ALGORITHMS

In this section, we describe algorithms to identify rumor sourcesbased on the reasoning in the previous section.

According to (11), we need to ﬁnd the nodes, which when set assources, result in the minimum dominant eigenvalue of the corre- This heuristic has been veriﬁed with extensive numerical experiments inour study. sponding reduced nonbacktracking matrix. We can use the poweriteration to compute the dominant eigenvalue in a time that scalesas O ( M ) , and repeat this for each of the (cid:0) N | S | (cid:1) conﬁgurations ofsources. Thus, the complexity for the MSI algorithm is O ( MN | S | ) .The procedure of the MSI algorithm is in Table Algorithm 1 . Algorithm 1

Multiple Source Identiﬁcation (MSI)

Input:

Nonbacktracking matrix B of an infected graph of N nodes and M edges;Number of rumor sources | S | ; Output:

Identiﬁed rumor sources ˆ S ; for i = 1 , . . . , (cid:0) N | S | (cid:1) do Enumerate a set of potential source nodes as ˆ S i = { i , i , ..., i | S | } ; Form R i by setting the entries in B to zero corresponding to n i j = 0 , i j ∈ ˆ S i ; Use the power iterative to compute the dominant eigenvalue λ i, max of R i ; end for Declare the set ˆ S i with the minimal λ i, max as rumor sources ˆ S . Now we derive an approximation of ∆ λ = λ max ( B ) − λ max ( R ) byapplying a method similar to that in [12]. Denote R by B − ∆ B , thedominant eigenvalue of R by λ max − ∆ λ and its corresponding righteigenvector by u − ∆ u . We have ( B − ∆ B )( u − ∆ u ) = ( λ max − ∆ λ )( u − ∆ u ) . (12)where λ max is the dominant eigenvalue of B . Then we multiply bothsides of (12) by the left eigenvector v T and get ∆ λ = v T ∆ B u − v T ∆ B ∆ uv T u − v T ∆ u . (13)We apply a perturbation analysis on ∆ u . When we set entries in B to zero according to the source set S , the entries i → s ( s ∈ S , i ∈ ∂s ) in u will be zero, and other entries will be perturbed slightly.So we write ∆ u = u → s − δu , where u → s is a vector in which weonly keep the entries i → s ( s ∈ S , i ∈ ∂s ) of u and set others tozero, and δu is small. So by neglecting second order terms u T ∆ B δu and ∆ λu T δu , we obtain ∆ λ ≈ v T ∆ B u − v T ∆ B u → s v T u − v T u → s . (14)According to the deﬁnition of ∆ B , we obtain ∆ λ ≈ (cid:80) i ∈ ∂s (cid:80) k ∈ ∂s \ i v i → s u s → k v T u − v T u → s . (15)With this approximation, we only need to compute the dominanteigenvalue and associated eigenvector of the nonbacktracking ma-trix B , rather than the dominant eigenvalues of all the reduced non-backtracking matrices, as in the MSI algorithm. So the complex-ity is reduced from O ( MN | S | ) to O ( N | S | ) . The procedure of thereduced-complexity PMSI algorithm is in Table Algorithm 2 . In our simulations in Section 4 the number of iterations is ﬁxed as . lgorithm 2 Perturbation-based Multiple Source Identiﬁcation(PMSI)

Input:

Nonbacktracking matrix B of an infected graph of N nodes and M edges;Number of rumor sources | S | ; Output:

Identiﬁed rumor sources ˆ S ; for i = 1 , . . . , (cid:0) N | S | (cid:1) do Enumerate a set of potential source nodes as ˆ S i = { i , i , ..., i | S | } ; Get u → i j ( i j ∈ ˆ S i ) from u and calculate ∆ λ i according to(15); end for Declare the set ˆ S i with the maximal ∆ λ i as rumor sources ˆ S .

4. SIMULATIONS

In this section, we evaluate the performance of our proposed algo-rithms on different synthesized and real-world networks, includingsmall-world networks, power grids, Facebook networks, and regularlattices.

In single source case, we compare our algorithms with two represen-tative algorithms, the Jordan center (JC) [2] and the Rumor center(RC) combined with a breadth-ﬁrst-search (BFS) tree heuristic [1].Note that for loopy networks all these algorithms are heuristic innature.We evaluate the performance using three metrics: (1)

Accuracy :the probability that the identiﬁed source node is the actual source.(2)

One-hop accuracy : the probability that the distance between theidentiﬁed source node and the actual source is no more than one hop.(3)

Average error distance : the average number of hops between theidentiﬁed source node and the actual source.In simulating the rumor spreading process we choose p < . so that the infected nodes are sufﬁciently spread. We consider fourkinds of networks: synthesized small-world networks, the westernstates power grid network of the United States, a fraction of theFacebook network with 4039 nodes, and regular lattices. Note thatthese networks are all loopy, especially for the latter three kinds.We generate 500 instances of 400-node infected graphs for each net-work. The average diameters of infected graphs for these networksare 15.5 (small-world networks), 19.5 (power grids), 10.9 (Face-book networks) and 36.8 hops (regular lattices), respectively. Table1 shows the simulation results. We see that the MSI algorithm gener-ally outperforms both JC and RC-BFS, and the PMSI algorithm alsoperforms quite well, — sometimes it even outperforms the originalMSI algorithm. The performance advantage is the most evident forhighly loopy networks, e.g., Facebook networks and regular lattices. In multiple source case, we generate 500 instances of 100-node in-fected graphs for small-world and Facebook networks. The sourcesare randomly picked among each network. The average diameters ofinfected graphs are 16 (small-world networks) and 8.9 hops (Face-book networks), respectively. With multiple sources, we modify the Data source: http://snap.stanford.edu/data/index.html

Table 1 . Simulation results in single source case (a) Accuracy

Network

JC RC-BFS MSI PMSISmall-world 18.2 % % % % Power grids 2.6 % % % % Facebook 1.8 % % % % Regular lattices 6.8 % % % % (b) One-hop accuracy Network

JC RC-BFS MSI PMSISmall-world 77.8 % % % % Power grids 17.2 % % % % Facebook 17.6 % % % % Regular lattices 28.2 % % % % (c) Average error distance Network

JC RC-BFS MSI PMSISmall-world 1.06 1.40 1.05 1.06Power grids 3.17 3.45 3.43 3.77Facebook 2.37 2.35 1.96 2.13Regular lattices 2.36 4.36 2.39 1.73

Table 2 . Simulation results in multiple source case

Network | S | Accuracy One-hop accuracy (cid:52)

Small-world 2 .

8% 26 .

0% 22 . .

0% 12 . (cid:52) : average error distanceperformance metrics in the following way: we associate the iden-tiﬁed sources ˆ S with the actual sources S so that the normalizedtotal error distance between ˆ S and S , i.e., (cid:52) = | S | (cid:80) | S | i =1 d (ˆ s i , s i ) ,where d (ˆ s i , s i ) is the number of hops between the actual source s i and its associated identiﬁed source ˆ s i , is minimized. We then deﬁnethe accuracy as the probability that ˆ S = S , the one-hop accuracy asthe probability that d (ˆ s i , s i ) ≤ , ∀ i = 1 , . . . , | S | , and the averageerror distance as the average of the minimum (cid:52) .Table 2 shows the simulation results when | S | is two or threeusing the MSI algorithm. Although the accuracy and the one-hopaccuracy drastically degrade compared with the single source case,the average error distance is usually less than two hops. Furthermore,it is interesting to notice that average error distance decreases as thenumber of sources increases. This may be due to that with moresources, even it is challenging to accurately identify all of them, itis likely that some of them can be accurately identiﬁed so that theaverage of error distances is decreased.

5. CONCLUSION

We proposed a novel heuristic source identiﬁcation method for gen-eral loopy networks with multiple rumor sources, motivated by de-ducing and analyzing the behavior of message passing equations ofthe rumor spreading process, combined with some empirical obser-vations. Numerical experiments show that for several representativekinds of general networks, the proposed method is competitive withexisting methods. In future research, it is desirable to deepen our un-derstanding of the proposed heuristic method, and to provide a solidtheoretical foundation that explains its effectiveness. . REFERENCES [1] D. Shah and T. Zaman, “Rumors in a network: Who’s theculprit?,”

IEEE Trans. Inf. Theor. , vol. 57, no. 8, pp. 5163–5181, Aug. 2011.[2] K. Zhu and L. Ying, “Information source detection in the sirmodel: A sample-path-based approach,”

IEEE/ACM Transac-tions on Networking , vol. 24, no. 1, pp. 408–421, Feb 2016.[3] W. Luo, W. P. Tay, and M. Leng, “How to identify an infectionsource with limited observations,”

IEEE Journal of SelectedTopics in Signal Processing , vol. 8, no. 4, pp. 586–597, Aug2014.[4] Nino Antulov-Fantulin, Alen Lanˇci´c, Tomislav ˇSmuc, HrvojeˇStefanˇci´c, and Mile ˇSiki´c, “Identiﬁcation of patient zero instatic and temporal networks: Robustness and limitations,”

Phys. Rev. Lett. , vol. 114, pp. 248701, Jun 2015.[5] Zhaoxu Wang, Wenxiang Dong, Wenyi Zhang, and Chee WeiTan, “Rumor source detection with multiple observations: fun-damental limits and algorithms,” in

ACM SIGMETRICS / Inter-national Conference on Measurement and Modeling of Com-puter Systems, SIGMETRICS ’14, Austin, TX, USA - June 16 -20, 2014 , 2014, pp. 1–13.[6] W. Luo, W. P. Tay, and M. Leng, “Identifying infection sourcesand regions in large networks,”

IEEE Transactions on SignalProcessing , vol. 61, no. 11, pp. 2850–2865, June 2013.[7] Z. Chen, K. Zhu, and L. Ying, “Detecting multiple informationsources in networks under the sir model,”

IEEE Transactionson Network Science and Engineering , vol. 3, no. 1, pp. 17–31,Jan 2016.[8] J. Jiang, S. Wen, S. Yu, Y. Xiang, and W. Zhou, “K-center:An approach on the multi-source identiﬁcation of informationdiffusion,”

IEEE Transactions on Information Forensics andSecurity , vol. 10, no. 12, pp. 2616–2626, Dec 2015.[9] Ki-ichiro Hashimoto, “On brandt matrices associated with thepositive deﬁnite quaternion hermitian forms,”

J. Fac. Sci. Univ.Tokyo Sect. IA Math , vol. 27, no. 1, pp. 227–245, 1980.[10] Brian Karrer and M. E. J. Newman, “Message passing ap-proach for general epidemic models,”

Phys. Rev. E , vol. 82,pp. 016101, Jul 2010.[11] Flaviano Morone and Hern´an A. Makse, “Inﬂuence maximiza-tion in complex networks through optimal percolation,”

Na-ture , vol. 524, July 2015.[12] Juan G. Restrepo, Edward Ott, and Brian R. Hunt, “Character-izing the dynamical importance of network nodes and links,”