[PDF] Collective Influence of Multiple Spreaders Evaluated by Tracing Real Information Flow in Large-Scale Social Networks

Abstract

Identifying the most influential spreaders that maximize information flow is a central question in network theory. Recently, a scalable method called "Collective Influence (CI)" has been put forward through collective influence maximization. In contrast to heuristic methods evaluating nodes' significance separately, CI method inspects the collective influence of multiple spreaders. Despite that CI applies to the influence maximization problem in percolation model, it is still important to examine its efficacy in realistic information spreading. Here, we examine real-world information flow in various social and scientific platforms including American Physical Society, Facebook, Twitter and LiveJournal. Since empirical data cannot be directly mapped to ideal multi-source spreading, we leverage the behavioral patterns of users extracted from data to construct "virtual" information spreading processes. Our results demonstrate that the set of spreaders selected by CI can induce larger scale of information propagation. Moreover, local measures as the number of connections or citations are not necessarily the deterministic factors of nodes' importance in realistic information spreading. This result has significance for rankings scientists in scientific networks like the APS, where the commonly used number of citations can be a poor indicator of the collective influence of authors in the community.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] N ov Collective Inﬂuence of Multiple Spreaders Evaluatedby Tracing Real Information Flow in Large-ScaleSocial Networks

Xian Teng , Sen Pei , Flaviano Morone , and Hern ´an A. Makse Levich Institute and Physics Department, City College of New York, New York, NY 10031, USA Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York,NY 10032, USA * [email protected] ABSTRACT

Identifying the most inﬂuential spreaders that maximize information ﬂow is a central question in network theory. Recently,a scalable method called “Collective Inﬂuence (CI)” has been put forward through collective inﬂuence maximization. In con-trast to heuristic methods evaluating nodes’ signiﬁcance separately, CI method inspects the collective inﬂuence of multiplespreaders. Despite that CI applies to the inﬂuence maximization problem in percolation model, it is still important to examineits efﬁcacy in realistic information spreading. Here, we examine real-world information ﬂow in various social and scientiﬁcplatforms including American Physical Society, Facebook, Twitter and LiveJournal. Since empirical data cannot be directlymapped to ideal multi-source spreading, we leverage the behavioral patterns of users extracted from data to construct “virtual”information spreading processes. Our results demonstrate that the set of spreaders selected by CI can induce larger scaleof information propagation. Moreover, local measures as the number of connections or citations are not necessarily the de-terministic factors of nodes’ importance in realistic information spreading. This result has signiﬁcance for rankings scientistsin scientiﬁc networks like the APS, where the commonly used number of citations can be a poor indicator of the collectiveinﬂuence of authors in the community.

Introduction

Identiﬁcation of the most inﬂuential nodes in social networks has broad applications in a variety of network dynamics.

Forexample, in viral marketing, advertising a small group of inﬂuential customers to adopt a new product can inexpensively triggera large scale of further adoption; in epidemics control, the immunization of structurally important persons can efﬁcientlyhalt global epidemic outbreaks in contact networks; and in biological systems like brain networks, some signiﬁcant nodesare responsible for broadcasting information and therefore locating and protecting them are crucial for the whole informationprocessing system. Given its practical signiﬁcance, the problem of ﬁnding the optimal set of inﬂuencers in a given networkhas attracted much attention in network science.

For a long time, researchers have developed numerous heuristic measures as predictors of nodes’ importance in informationspreading. Among the most frequently used topological properties are the number of connections (degree), betweenness and eigenvector centralities, PageRank, k-core, etc. All of them are established in the non-interacting setting, wherenodes’ signiﬁcance is evaluated by taking them as isolated agents. As a result, these ad-hoc approaches, designed for ﬁndingsingle superspreaders, fail to provide the optimal solution for the general case of multiple inﬂuencers. To address this many-body issue, a rigorous theoretical framework based on collective inﬂuence (CI) theory has recently been presented. Witha broader notion of inﬂuence – collective inﬂuence, the CI method pursues the goal of maximizing the overall inﬂuence ofmultiple spreaders. Such explicit optimization objective enables CI to give the minimal set of spreaders.Although CI exhibits good performance with scalability in the optimal percolation model, more validation work regard-ing its efﬁcacy in real-world information spreading still needs to be done. Previously, the lack of real data of informationdiffusion has led to the mainstream adoption of artiﬁcial spreading models to simulate spreading dynamics. However, theover-simpliﬁed spreading models usually neglect such important factors as activity frequency, connection strength and be-havioral preferences, thus fail to reproduce some observed characteristics of real information spreading. More importantly,different models may produce model-dependent contradictory results. Therefore, it is necessary to evaluate CI’s perfor-mance empirically through realistic information diffusion before applying it to real-world applications like marketing andadvertising.ere, we address this problem by tracking and analyzing the real-world information ﬂows in a wide range of socialmedia: journals of American Physical Society (APS), an online social network Facebook.com (Facebook), a microbloggingservice Twitter.com (Twitter) and a blog website LiveJournal.com (LiveJournal). Rather than tracking the spreading rangeof single spreaders, we intend to investigate the overall spreading range, i.e., the collective inﬂuence, of multiple spreaders.To achieve this, the most straightforward idea is to extract and examine the real instances of information diffusion that aretriggered by multiple spreaders. Unfortunately, such ideal multi-source spreading instances in which spreaders send out thesame piece of message at the same time rarely exist in reality. Even though we can ﬁnd such instances, the initial spreaders arehardly the same as the set of nodes selected by CI or other heuristic strategies, making the comparison between those methodsimpossible.To overcome the aforementioned difﬁculties, we construct “virtual” multi-source spreading processes by following users’behavioral patterns in the data. In particular, under the assumption that users will maintain their personal preferences inspreading processes, we measure the strength of directed social ties shown in historical diffusion records to represent theinﬂuence strength of a user imposing on another. For a node under inﬂuence of several spreaders, the overall inﬂuence on itis deﬁned as the highest inﬂuence strength. In this way, we are able to quantify the collective inﬂuence imposing on the entirenetwork, corresponding to the collective spreading range of virtual processes initiated by any given set of seeds. Throughcomparisons with competing heuristic methods, including high degree (HD), adaptive high degree (HDA), PageRank(PR) and k-core, we ﬁnd that the set of spreaders selected by CI can exert larger collective inﬂuence on the populationwith the same number of initial seeds. This provides a direct empirical validation of CI’s good performance in real informationspreading. In addition, some individual properties such as the number of connections and citations, which were previouslyregarded as reliable predictors of inﬂuence, are found to be invalid in the context of collective inﬂuence. This in turn reﬂectsthat it is the interplay between spreaders that determines the collective inﬂuence rather than individual features. Results

Introduction of Datasets

In the following empirical study, four datasets are examined: the journals of American Physical Society (APS), an online socialnetwork Facebook.com (Facebook), a microblogging service Twitter.com (Twitter), and a blog website LiveJournal.com(LiveJournal). All datasets are available at kcore-analytics.com . During the period of data collection, people not onlymaintain social relations with their friends but also interact with others to spread and receive information. Certainly, there arediverse manifestations with respect to the social relation and interaction in distinct platforms. For instance, in the academicdata of APS, authors show their social relations, i.e. coauthorship, through jointly publishing articles, and they reveal theirinteractions and information transmission by citing others’ papers. While in the online social media like Facebook, Twitterand LiveJournal, users reﬂect their social relations by becoming “cyber friends”, and they interact with each other by creating,receiving, and transmitting messages. With the collection of such information, we can obtain the full network structure aswell as the empirical information ﬂows. Details about these data are explained as follows. • The American Physical Society (APS) is the world’s largest organization of physicists. APS data contains the infor-mation of all the scientiﬁc papers published on APS journals until 2005, including Physical Review A, B, C, D, E andPhysical Review Letters. From the author lists and references of scientiﬁc publications, we can obtain the informa-tion about collaborations and citations. In total, there are 299,996 articles and 230,521 authors in the data, along with2,356,525 records of citations. We construct the underlying collaboration network according to their coauthorship. Iftwo authors have published one article together, one undirected edge is built between them. Beyond that, we trace theinformation diffusion based on the reference ﬂows. If a scientist i cites one paper written by j , then we can say thatinformation spreads from j to i . • Facebook is an online social networking service. In Facebook, each registered user maintains a friend list, which is agood representation of actual social relationships. Users can exchange messages, post status updates and photos, sharevideos, and browse the posts published by their friends. The Facebook data contains the friend lists and the entirerecords of wall posts from the New Orleans regional network, over a period of two years from September 26th, 2006 toJanuary 22nd, 2009. This data contains 63,731 users and 838,092 wall posts in total. The social network is extractedfrom the friend lists. If user j is added into user i ’s friend list (or i is in j ’s friend list), we assume that they are friendsso that we build an undirected edge between them. According to the wall posts, we can infer the information diffusionﬂows. If user i makes comments on user j ’s page, we presume that i has gained information from j to motivate him/herto write comments. • Twitter is a microblogging service that enables users to send and read short word-limited messages called “Tweets”. Inthe 2016 election year, Donald Trump, who is the presumptive nominee of the Republican Party for President of the nited States, has become one of the most popular topics being discussed in Twitter. From February 10, 2016 to March14, 2016, we collect approximately 670,000 Tweets that contain the key word “Donald Trump” or “Trump”. In thecollection of Tweets, we extract four kinds of Tweets: mention, replies, retweet and quote. A mention is a Tweet thatcontains another user’s @username anywhere in the body of the Tweet. A reply is a response to another user’s Tweetthat begins with the @username of the person you’re replying to. Replies are also considered as mentions. Besides,a retweet is a re-posting of someone else’s Tweet, in which such character RT@username appears at the beginningto indicate that users are re-posting others’ content. A quote is a special form of retweet that users can write theirown comments when they are re-posting. We consider the mention (and also reply) relationship as a representative ofstrong social ties and use them to construct the network structure. Meanwhile, we use retweets (and quotes) to obtaininformation ﬂows. If user i retweets a Tweet from user j , we assume information diffuses from user j to user i . • LiveJournal is a blog-sharing website where users can maintain friend lists, keep a blog, journal or diary. Our datacontains the friend lists for all users and their blog posts published from February 14th, 2010 to November 21st, 2011,which involves 9,573,127 users and 3,462,504 records of blog reference. Similar to Facebook, we depend on the friendlist to build the underlying network topology. More importantly, LiveJournal users usually add URL links pointingto other relevant blogs when they refer them. As a result, we could use the URL reference to trace the informationdiffusion among users.The originally constructed network is indicated by ¯ G = { ¯ V , ¯ E } in which ¯ V stands for the set of nodes and ¯ E the set ofedges. In the raw datasets of online social platforms including Facebook and LiveJournal, we ﬁnd many inactive users whoneither spread nor receive messages in network. Actually, they just register an account but do nothing during the period oftime we collect data. Considering that no contributions are made by those inactive nodes to the information diffusion process,we exclude them from the original networks ¯ G and construct an active network ¯ G A = { ¯ V A , ¯ E A } . Different from the onlinesocial platforms, APS has no such inactive nodes as all the authors have to publish papers and cite others’ work. However,APS data contains a minority of articles ( ∼ . G = { V , E } . Properties of the original andtruncated networks are provided in Table 1. Construction of Virtual Information Spreading

In order to decide which strategy to use to locate the most inﬂuential nodes in networks, we intend to evaluate the collectiveinﬂuence exerted by the same number of inﬂuencers. The one that achieves the largest collective inﬂuence would be our ﬁrstchoice. To this end, the most straightforward idea is to compare the spreading range of multi-source spreading processestriggered by a ﬁxed number of seeds selected by different methods. However, the multi-source spreading is an ideal process.In the ideal setting, multiple sources should be activated by the same piece of message at the same time. While in reality, suchideal situation rarely exists because of the intrinsic properties of real data. Users are interested in a wide range of topics, andthey are receiving and delivering multifarious messages from time to time. It is unlikely that we can ﬁnd enough real instancesin which the spreaders happen to send out a same piece of message at the same time. Therefore, rather than enforcing realdata to match the ideal expectation, we propose an alternative way - to construct a virtual multi-source spreading process.The main idea behind the virtual multi-source spreading processes is that users are expected to follow the behavioralpatterns expressed in real data. For user i with k i neighbors who have chances to access information from i , the closely-tiedneighbors interested in user i ’s publications or posts would be more likely to inherit messages from i . On the contrary, thoseweakly-tied friends would occasionally be inﬂuenced by the information released from i . To reﬂect this effect, we proposea notion named the strength of directed ties r . For a directed link from i to j , the strength r ( i , j ) is deﬁned by the numberof messages, e.g., publications or posts, passed from i to j . By deﬁnition, the strength of directed tie r ( i , j ) from i to j isnot generally equal to r ( j , i ) from j to i . Figure 1a reveals that the strength of directed tie follows a power-law distribution.We assume that, in the virtual processes, people would continue to maintain such behavioral patterns. In this way, we canapproximate the multi-source information diffusion and obtain the collective inﬂuence as follows.In the virtual processes, suppose a q -percentage of initial spreaders are activated at the beginning, denoted by S = { s i | i = , , ..., n , n = N · q } . We introduce a quantity I u ( s ) ∈ [ , ] to represent the single inﬂuence strength that node u is affected byspreader s . Correspondingly, we employ I u to indicate the collective inﬂuence strength enforced by all seeds S . Both of theircalculations can rely on the above mentioned strength of directed ties (shown in Figure 1b). For an arbitrary spreader s , theinﬂuence strength I g ( s ) from s to its neighbor g depends on the strength of directed tie r ( s , g ) , or in other words, depends onthe tendency of g to receive information from s . Assume that, during one period of time, s has totally sent out r ( s ) pieces ofmessages and g has accepted r ( s , g ) of them [ r ( s , g ) ≤ r ( s ) ]. The proportion of acceptance r ( s , g ) / r ( s ) can be viewed as aproxy of inﬂuence strength from s to g , i.e. I g ( s ) = r ( s , g ) / r ( s ) . Next, g might affect its neighbor g = s in the same way. hen we follow the spreading paths, multiply the proportions together and then acquire the inﬂuence strength s enforcing onits l -step neighbor g l , say I g l ( s ) = l (cid:213) k = r ( g k − , g k ) / r ( g k − ) , (1)where g = s . Figure 1b gives an example with l =

2. As none of messages can spread inﬁnitely, we set a number L as themaximum layer of spreading, so that the inﬂuence range, denoted by R s , could be approximated by a ball around s with theradius L (shown in Figure 1c). Within each R s , we have I g ( s ) = s , then the value decreases as l becoming larger, and I g l ( s ) = ( l > L ) for any external node. The schematic diagram regarding the distribution of inﬂuencestrength within R s can be seen in Figure 1c. For APS and LiveJournal data, we know more information about references, thedetailed calculation of inﬂuence strength is shown in Methods .To obtain the collective inﬂuence I u for node u , we apply I u = max ni = I u ( s i ) . (2)Referring to Figure 1b,c, it is straightforward to understand when node u does not belong to any inﬂuence range, I u ( s i ) = i , in which case the collective inﬂuence should be zero. For the case that node u is only inﬂuenced by one spreader,for example I u ( s i ) > I u ( s j ) = j = i , the collective inﬂuence should be chosen as the positive (largest) one I u = I u ( s i ) . More generally, if node u lies within the overlapping areas of more than one inﬂuence ranges, i.e. it is affectedby more than one sources, we ought to choose the largest potential inﬂuence to be its collective inﬂuence during the virtualspreading process. Finally, we sum up all the { I u | u = , , ..., N } together to obtain the collective inﬂuence that spreadersimpose on the entire system through Q ( q ) = N (cid:229) u = I u / N . (3)Since 0 ≤ I u ≤

1, we have 0 ≤ Q ( q ) ≤

1, which corresponds to the collective spreading range for the virtual process (seeFigure 1c).In general, the virtual process of multi-source spreading constructed here is an approximation of real information diffusion.We take advantage of real data to extract users’ behavioral patterns, base on which, we can calculate the single inﬂuence andcollective inﬂuence that spreaders impose on each node. Given that, we can ﬁnally compute the collective inﬂuence exertedby all inﬂuencers on the entire network.

Comparison of Different Methods

In this section, we compare CI algorithm with four other widely-used heuristic measures, including adaptive high-degree(HDA), high-degree(HD), PageRank (PR) and k-core (details about methods are shown in Methods ). Recall that,our ﬁrst step is to identify the q -percentage of initial spreaders according to different methods. Secondly, we construct a virtualmulti-source spreading process. Finally, we compare the virtual spreading range Q ( q ) , i.e. the collective inﬂuence of thoseinitial inﬂuencers.Figure 2a,c,e,g show the virtual collective inﬂuence scores obtained by CI, HDA, HD, PR and k-core for the four networks– APS, Facebook, Twitter and LiveJournal. It can be seen that for a certain value of q , the set of nodes selected by CIcan diffuse the information to a larger scale of populations than those obtained by other methods. CI’s good performanceis more prominent for APS and Facebook data as their diffusion instances are relatively abundant. To clearly distinguishthe performances of different methods, we also present the ratios between CI’s collective inﬂuence score and those of otherapproaches (Figure 2b,d,f,h). It reveals that the ratios are always larger than one (indicated by the baseline at 1) for alldatasets. Besides, the ratio is relatively large when q is small. As q increases, it would decline accordingly, suggesting that ifwe select a larger amount of inﬂuencers, the collective inﬂuence score obtained by all methods would become similar. Amongthe competing heuristic methods, HDA can be viewed as a special case of CI with the calculation radius being zero (see Methods ). However, HDA’s capability in locating inﬂuencers is limited by the lack of knowledge of the surrounding nodes,so it is a strategy obtained from the non-interacting point of view. K-core method, a good predicator for locating single“superspreaders”, whereas fails to identify multiple spreaders in the multi-source spreading process. This is because theselected inﬂuential nodes tend to cluster together in the core shells which induces large overlapping of their inﬂuence areas.Besides, we also investigate the characteristics of inﬂuencers that CI has identiﬁed. Figure 3a shows the degree comparisonof nodes ranked by CI and HD (from the most inﬂuential to the least). Unlike HD ﬁnding inﬂuencers just relying on degree,CI’s most important nodes contain not only hubs but also many weakly-connected nodes. Besides, some of the most connected odes turn out to be moderate inﬂuencers. It conﬁrms the former conclusion that collective inﬂuence is determined by theinterplay of all the inﬂuencers. Under certain circumstances, some low-degree nodes surrounded by hierarchical coronasof hubs have larger contributions to collective inﬂuence than those high-degree nodes connecting to peripheral leaves. Inaddition, we have also examined the correlation between CI ranking and the number of citations in Figure 4. The number ofcitation for each user is deﬁned as how many times other people have accepted or inherited information from him/her directly.We acquire such information through checking the citations (APS), comments (Facebook), retweets (Twitter) as well as URLsreference (LiveJournal). Except for Twitter, the other datasets show us that the most inﬂuential nodes are not necessarilythose with the largest number of citations. The uniqueness of Twitter might be explained by considering the mechanism ofnetwork formation and the way of data collection. Twitter platform facilitates users arbitrarily following others, making itpossible that super hubs with millions of followers emerge and hold signiﬁcant inﬂuence; Besides, Twitter is gathered byfocusing on a popular topic ”Donald Trump”, the topic-based data might easily detect those extremely popular users whoalso play important role in spreading. Therefore, the phenomenon shown in APS, Facebook and LiveJournal suggest practicalimplications for academic rankings. When evaluating a researcher’s scientiﬁc impact within a ﬁeld, his/her number of citationis not the determinative factor.

It also reminds us that inﬂuence is an emergent property arising from interactions ratherthan an evaluation by viewing nodes individually.

Discussion

It is of importance to search for the most inﬂuential nodes in social networks. For a long time, heuristic approaches havebeen widely used to ﬁnd superspreaders, yet without an ultimate solution for ﬁnding multiple inﬂuencers. Recently, a rigorousframework called collective inﬂuence (CI), along with a scalable algorithm, has been put forward to resolve the many-bodyproblem. Even though CI has been shown to be effective in percolation model, we still need to verify its performanceparticularly in the real case of information diffusion. To achieve this, we collect data from four social media – APS journals,Facebook, Twitter as well as LiveJournal platforms. Different from the situation of ﬁnding single superspreaders where wecheck each node’s spreading range, under the circumstance of multiple spreaders, we should examine the collective spreadingrange. Given the difﬁculty that ideal multi-source spreading processes triggered by same messages at the same time are scarcein real-world diffusion, we propose a virtual multi-source spreading according to users’ behavioral patterns to approximatethe ideal process. Finally, by comparing the collective inﬂuence, i.e. the spreading ranges in virtual process, we ﬁnd that CI iseffective in ﬁnding multiple inﬂuencers.Moreover, our ﬁnding indicates that quantities from a non-interacting viewpoint, such as degree and the number of cita-tions, are not reliable in measuring nodes’ importance in collective inﬂuence. Our investigation for inﬂuencers’ propertiesconﬁrms that inﬂuence is an effect of cooperation in multi-source spreading. Our results can be transformed into an effec-tive way to rank scientist in academic communities according to their collective inﬂuence rather than on the commonly usedlocal connectivity metric, like the number of citations or collaborations in the H-index (Hirsch number). Using the numberof citations, as shown in Fig. 4, can be a poor indicator of the collective inﬂuence of a researcher on other researchers inthe community. A global quantity like the Collective Inﬂuence that takes into account the optimization of inﬂuence of allresearchers at once, provides a meaningful ranking of researchers according to the maximization of their inﬂuence. Morestudies will follow to elaborate on this particular point.

Methods

Collective Inﬂuence Method

Collective Inﬂuence (CI) Algorithm . CI is an optimization algorithm that aims to ﬁnd the minimal set of nodes thatcould fragment the network in optimal percolation. In percolation theory, if we remove nodes randomly, the network wouldundergo a structural collapse at a critical fraction where the probability that the giant connected component exists is G = q c to achievethe result G ( q c ) =

0. Let the vector n = ( n , n , ..., n N ) represent whether a node is removed ( n i =

0) or not ( n i = v = ( v , v , ..., v N ) represent whether a node belongs to the giant connected component ( v i =

1) or not ( v i = n and v can be derived in locally tree-like networks using message passing (MP) approach: v i → j = n i [ − (cid:213) k ∈ ¶ i \ j ( − v k → i )] , (4)where v i → j indicates the probability of i being in the giant component when j is absent, and ¶ i \ j is the neighbors of i besides j . The equation’s possible solution v i → j = i → j is associated with the special situation where the giant connectedcomponent is absent; therefore, to obtain G ( q ) =

0, the stability of this solution must be guaranteed. As a matter of fact, the tability of v i → j = l ( n ; q ) of the linear operator ˆ M , which is deﬁned on the directededges of networks as M k → l , i → j ≡ ¶ v i → j ¶ v k → l | { v i → j = } . (5)It can be expressed as M k → l , i → j = n i B k → l , i → j , (6)where B k → l , i → j is the non-backtracking matrix of the network. B stores the topological interconnections of network whoseelement B k → l , i → j = l = i , j = k . So far, the original optimal percolation problem has been rephrased as a mathematicalstatement: ﬁnding the optimal conﬁguration of n ∗ with size q c that achieves the critical threshold: l ( n ∗ ; q c ) = . (7)The eigenvalue l ( n ; q ) can be calculated according to power method: l ( n ) = lim l → ¥ (cid:20) | w l ( n ) || w | (cid:21) / l . (8)At a ﬁnite l , | w l ( n ) | is the cost energy function of inﬂuence that needs to be minimized. Take Equation 8 as a starting point,the problem of ﬁnding the optimal set of inﬂuencers can be solved by minimizing the following cost function: E l ( n ) = N (cid:229) i = ( k i − ) (cid:229) j ∈ ¶ Ball ( i , l ) (cid:213) k ∈ P l ( i , j ) n k ! ( k j − ) , (9)where Ball ( i , l ) is the set of nodes inside the ball of radius i around the central node i , and P l ( i , j ) is the shortest path of length l connecting i and j . To minimize the energy function of a many-body system, an adaptive method is developed with the mainidea of removing the nodes causing the biggest drop in the energy function - CI algorithm. In general, CI algorithm can bestated as follows. Firstly, it considers the nodes at the frontier j ∈ ¶ Ball ( i , l ) and assigns to node i a collective inﬂuence valueat the level of l asCI l ( i ) = ( k i − ) (cid:229) j ∈ ¶ Ball ( i , l ) ( k j − ) . (10)Starting with the node with the highest CI l , CI adaptively removes nodes and after each removal, it recalculates CI l for all therest nodes in the system. From the calculation we know that CI has richer topological contents and its performance will beimproved as l increases, but no larger than the network diameter because this case amounts to random identiﬁcation. In ouranalysis, we adopt the parameter L = l =

0, we have CI ( i ) = ( k i − ) . Under this situation, CI algorithm is reduced tothe High-degree adaptive (HDA) method . For l ≥

1, CI also considers the surrounding neighborhoods and the interactionsamong nodes; meanwhile, it is an easily-implemented algorithm as it only needs local topological structure within the ballof the radius l instead of the whole network structure. More importantly, its computational complexity is O ( N log N ) , whichguarantees its application for large real networks. Heuristic Methods k-core . In k-core method, nodes are ranked based on their k S values, which are calculated during the process of k -shelldecomposition. In k -shell decomposition, nodes are removed iteratively. Firstly, nodes with k = k S =

1. Similarly, the next k-shells with index k S > k S values. Actually, in k -shell composition, all the nodes are divided into different shells according to their relative locationsin networks. Compared with the peripheral nodes, the core nodes have higher probabilities to cause large-scale diffusions.This method has been revealed to perform well in searching for single spreaders who can yield large inﬂuence areas. However,it has a poor performance when being used to optimizing the collective spreading caused by multiple spreaders. Becausek-core would select a bunch of nodes within or near the network core, so their inﬂuence areas would heavily overlap andproduce a bad collective outcome. ageRank(PR) . PageRank algorithm was ﬁrstly proposed by S. Brin and L. Page and used by Google in order to rankwebsites. It extends the idea in academic citation that the number of citations or backlinks give some approximation of a page’simportance, by not counting links equally but normalizing by the number of links on a page. Its calculation is as follows: ifpage A has pages T , ..., T N citations with the associated PageRank as PR ( T ) , ..., PR ( T N ) , then the PageRank of A is given byPR ( A ) = ( − d ) + d (cid:18) PR ( T ) C ( T ) + ... + PR ( T N ) C ( T N ) (cid:19) , (11)in which C ( A ) is deﬁned as the number of links going out of page A . PageRank outputs a probability distribution used to rep-resent the likelihood that a person randomly clicking on links will arrive at any particular page. The higher the probability, thehigher the PR value of this page. In practice, PageRank can be calculated using a simple iterative algorithm and correspondingto the principal eigenvector of the normalized link matrix of the web network. High-Degree(HD) . HD method ranks nodes directly according to the number of connections. Compared with othermethods requiring global network structures like k-core and PageRank, HD only needs local information and is easily imple-mented. However, it cannot deal with the circumstance in which hubs form tight community such that their spreading areaswould heavily overlap.

High-Degree Adaptive(HDA) . HDA is the reﬁned adaptive version of HD method. To help mitigate the above mentionedsituation, HDA recalculates the degrees after each removal. It can also be viewed as a special case of CI algorithm at l = Data Processing

Analyzing APS and LiveJournal . In terms of APS, we know the speciﬁc article pairs ( a , b ) , which means paper a citespaper b , In other word, the authors A b of b spread their scientiﬁc discoveries to the authors A a of a . Therefore, for an arbitraryauthor s , we can know his or her journal set J ( s ) = { J i | i = , , ..., n s } in which J i indicates each piece of paper and n s standsfor the number of papers published by s . By tracking the spreading for each paper J i through citation ﬂows, we can determineits inﬂuence range R s ( J i ) containing all people who have cited this paper J i . For each receiver u ∈ R s = { R s ( J i ) | i = , , ..., n s } ,we calculate the individual inﬂuence strength by I u ( s ) = ( (cid:229) n s i = d u ∈ R s ( J i ) ) / n s where d u ∈ R s ( J i ) = u ∈ R s ( J i ) . Largevalues of I u ( s ) means that u is more likely to cite the work of s than other peers. Next, the collective inﬂuence strength fromall sources can be obtained by I u = max ni = I u ( s i ) . In LiveJournal, we know information about blog references. So, we canfollow the similar method as in APS to process LiveJournal data. References Valente, T. W. & Davis, R. L. Accelerating the diffusion of innovations using opinion leaders.

Ann. Am. Acad. Polit. Soc.Sci. Domingos, P. & Richardson, M. Mining knowledge-sharing sites for viral marketing. In

Proc. 8th ACM SIGKDD Int. Conf.on Knowledge Discovery and Data Mining , 61-70 (ACM, 2002). Van den Bulte, C. & Joshi, Y. V. New product diffusion with inﬂuentials and imitators.

Market. Sci. Iyengar, R., Van den Bulte, C. & Valente, T. W. Opinion leadership and social contagion in new product diffusion.

Market.Sci. Watts, D. J. A simple model of global cascades on random networks.

Proc. Natl. Acad. Sci. USA Watts, D. J., & Dodds, P. S. Inﬂuentials, networks, and public opinion formation.

J. Cons. Res. Albert, R., Jeong, H. & Barab´asi, A. Error and attack tolerance of complex networks.

Nature Pastor-Satorras, R. & Vespignani, A. Epidemic spreading in scale-free networks.

Phys. Rev. Lett. Yan, S., Tang, S., Pei, S., Jiang, S., & Zheng, Z. Dynamical immunization strategy for seasonal epidemics.

Phys. Rev. E.

Yan, S., Tang, S., Fang, W., Pei, S., & Zheng, Z. Global and local targeted immunization in networks with communitystructure.

J. Stat. Mech.

P08010 (2015).

Newman, M. E. J. Spread of epidemic disease on networks.

Phys. Rev. E.

Morone, F., Roth, K., Min, B., Stanley, H. E. & Makse, H. A. A model of brain activation predicts the collective inﬂuencemap of the human brain. arXiv:1602.06238 (2016). Leskovec, J. et al.

Cost-effective outbreak detection in networks. In

Proc. 13th ACM SIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining , 420-429 (ACM, 2007).

Kempe, D., Kleinberg, J. & Tardos, ´E. Maximizing the spread of inﬂuence through a social network. In

Proc. 9th ACMSIGKDD Int. Conf. on Knowledge Discovery and Data Mining , 137-143 (ACM, 2003).

Altarelli, F., Braunstein, A., Dall´asta, L. & Zecchina, R. Optimizing spread dynamics on graphs by message passing.

J.Stat. Mech.

P09011 (2013).

Freeman, L. C. A set of measures of centrality based on betweenness.

Sociometry

Freeman, L. C. Centrality in social networks: conceptual clariﬁcation.

Soc. Networks Brin, S. & Page, L. The anatomy of a large-scale hypertextual web search engine.

Computer Networks

Dorogovtsev, S. N., Goltsev, A. V. & Mendes, J. F. F. K-core organization of complex networks.

Phys. Rev. Lett.

Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y. & Shir, E. A model of Internet topology using k-shell decomposition.

Proc. Natl. Acad. Sci. USA

Kitsak, M. et al.

Identiﬁcation of inﬂuential spreaders in complex networks.

Nature Phys. Pei, S., Muchnik, L., Andrade J. S. Jr, Zheng, Z. & Makse, H. A. Searching for superspreaders of information in real-worldsocial media.

Sci. Rep. Pei, S., Makse, H. A. Spreading dynamics in complex networks.

J. Stat. Mech.

P12002 (2013).

Morone, F. & Makse, H. A. Inﬂuence maximization in complex networks through optimal percolation.

Nature

Pei, S., Teng, X., Shaman, J., Morone, F., & Makse, H. A. Collective inﬂuence maximization in threshold models ofinformation cascading with ﬁrst-order transitions. arXiv:1606.02739 (2016).

Muchnik, L. et al.

Origins of power-law degree distribution in the heterogeneity of human activity in social networks.

Sci.Rep. Goel, S., Watts, D. J., & Goldstein, D. G. The structure of online diffusion networks. In

Proc. 13th ACM Conf. onElectronic Commerce , 623-638 (ACM, 2012).

Viswanath, B., Mislove, A., Cha, M. & Gummadi, K. P. On the evolution of user interaction in Facebook. In

Proc. 2ndACM SIGCOMM Workshop on Social Networks , 37-42 (ACM, 2009).

Cheng, J., Adamic, L., Dow, P. A., Kleinberg, J. M., & Leskovec, J. Can cascades be predicted? In

Proc. 23rd Int. Conf.on World Wide Web , 925-936 (ACM, 2014).

Pei, S., Muchnik, L., Tang, S., Zheng, Z., Makse, H. A. Exploring the Complex Pattern of Information Spreading inOnline Blog Communities.

PloS One e0126894 (2015).

Wang, D., Song, C. and Barab´asi, A. L. Quantifying long-term scientiﬁc impact.

Science

Radicchi, F., Fortunato, S., Markines, B. & Vespignani, A. Diffusion of scientiﬁc credits and the ranking of scientists.

Phys. Rev. E.

Bollob´as, B. & Riordan, O. Percolation (Cambridge Univ. Press, 2006)

Bianconi, G. & Dorogovtsev, S. N. Multiple percolation transitions in a conﬁguration model of network of networks.

Phys.Rev. E.

Karrer, B., Newman, M. E. J. & Zdeborov´a, L. Percolation on sparse networks.

Phys. Rev. Lett.

Hashimoto, K. Zeta functions of ﬁnite graphs and representations of p-adic groups.

Adv. Stud. Pure Math.

Angel, O., Friedman, J. & Hoory, S. The non-backtracking spectrum of the universal cover of a graph.

Trans. Amer. Math.Soc.

Bhatia, N. P. & Szeg¨o, G. P.

Stability theory of dynamical systems (Springer-Verlag, Berlin Heidelberg, 2002).

Wasserman, S. & Faust, K.

Social Network Analysis (Cambridge Univ. Press, Cambridge, 1994).

Colizza, C., Flammini, A., Serrano, M. A. & Vespignani, A. Detecting rich-club ordering in complex networks.

NaturePhys. Morone, F., Min, B., Bo, L., Mari, R. & Makse, H. A. Collective Inﬂuence Algorithm to ﬁnd inﬂuencers via optimalpercolation in massively large social media.

Sci. Rep. cknowledgements This work was supported by NIH-NIGMS 1R21GM107641, NSF-PoLS PHY-1305476 and ARL Cooperative AgreementNumber W911NF-09-2-0053, the ARL Network Science CTA. We thank Lev Muchnik for providing the data on LiveJournal.

Author contributions statement

H.A.M. designed research; X.T., S.P. , F.M., H.A.M. analyzed data, prepared ﬁgures and wrote the main manuscript text; Allauthors reviewed the manuscript.

Additional information

Competing ﬁnancial interests

The authors declare no competing ﬁnancial interests.

Figure 1.

Construction of virtual spreading based on people’s interactions. a, Distribution of directed tie strength for realnetworks. The power law distribution demonstrates the heterogeneity of interactions between nodes. b, Calculation forinﬂuence strength. Nodes s and s are two distinct spreaders, the maximum spreading layer is set as L =

2. Node u isinﬂuenced by two seeds with the strength I u ( s ) and I u ( s ) . We select the largest value to indicate the collective inﬂuenceenforcing on it. c, An illustration of single inﬂuence strength I u ( s ) along with collective inﬂuence strength I u . The threecircle-like areas represent the corresponding inﬂuence ranges R s , R s , R s for distinct spreaders s , s , s , and the contourlines indicate the levels of inﬂuence strength I u . When projecting it onto 2-dimensional space, we have the correspondingdistribution. The collective outcome I u (indicated by gray curve) is obtained by combining the single inﬂuence strengths ofall the spreaders.Networks ¯ N ¯ M ¯ N A ¯ M A N M h k i h k d i q c APS 230,521 1,607,305 230,521 1,607,305 190,161 1,582,710 16.4 37.4 20%Facebook 63,731 817,090 45,746 703,924 45,459 703,803 31.0 18.8 45%Twitter 311,334 151,654 311,334 151,654 29,463 143,220 9.7 5.1 6%LiveJournal 9,573,126 188,240,039 304,858 19,785,460 290,362 19,783,730 136.3 7.7 46%

Table 1.

Properties of the original and processed networks ¯ G , ¯ G A , G in this article. In the table, ¯ N ( ¯ M ) is the number of nodes(edges) in the original networks ¯ G , ¯ N A ( ¯ M A ) represents the number of nodes (edges) in the active network ¯ G A , N ( M ) indicatesthe number of nodes (edges) in the network G . h k i is the average degree of network G . h k d i denotes the average out-degree ofdiffusion graph, i.e. the average number of messages which have been sent out. Besides, q c indicates CI’s minimal fractionof inﬂuencers to fragment the networks in optimal percolation. All datasets are available at kcore-analytics.com . .01 0.05 0.1 0.2 q s p r ead i ng r ange CIHDAHDPRkcore 0.01 0.05 0.1 0.15 0.2 q r a t i o baselineCI/HDACI/HDCI/PRCI/kcore0.01 0.05 0.1 0.2 0.3 0.5 q s p r ead i ng r ange CIHDAHDPRkcore 0.01 0.1 0.2 0.3 0.4 0.5 q r a t i o baselineCI/HDACI/HDCI/PRCI/kcore APSFacebook FacebookAPS bac d q s p r ead i ng r ange CIHDAHDPRkcore 0.02 0.04 0.06 0.08 0.1 q r a t i o baselineCI/HDACI/HDCI/PRCI/kcore0.01 0.05 0.1 0.2 0.4 q s p r ead i ng r ange CIHDAHDPRkcore 0.01 0.1 0.2 0.3 0.4 q r a t i o baselineCI/HDACI/HDCI/PRCI/kcore LiveJournalTwitterTwitter fge h

LiveJournal

Figure 2.

Performance of CI in large-scale real social networks. The datasets contain APS ( a,b ), Facebook ( c,d ), Twitter( e,f ) and LiveJournal ( g,h ). We compare the virtual spreading ranges of different methods in a,c,e,f . With a ﬁxed fraction q ofseeds, CI’s virtual spreading range is larger than all the heuristic approaches. Besides, we also show the ratios of spreadingranges between CI and others in b,d,f,h . It reveals that the ratios are always larger than 1 (higher than the baseline), implyingthat CI is an effective strategy in locating multiple spreaders. We set L = h k d i , and L = h k d i . We care about the results when q is small, so welimit q within the range of small value. As q increases, the performances of all the strategies become similar. igure 3. Degree versus ranking. We show the degrees of nodes ranked (from highest to lowest) by CI and HD for APS ( a ),Facebook ( b ), Twitter ( c ) and LiveJournal ( d ). It shows that CI can ﬁnd those previously neglected weak nodes to emergeamong most signiﬁcant inﬂuencers. Meanwhile, some most connected nodes are ranked as moderate inﬂuencers by CI,indicating that such weak node effect is a consequence of collective inﬂuence in the case of multiple spreaders. This resulthas important consequences for ranking of researchers in scientiﬁc networks. igure 4. The number of citations versus CI ranking. We present the number of citations (comments, reposts or references)of nodes ranked by CI strategy for APS ( a ), Facebook ( b ), Twitter ( c ) and LiveJournal ( d ). Despite that in Twitter data, themost inﬂuential user is exactly the one with the largest amount of citations, the overall results still prove that large number ofcitations is not necessarily a reliable measure for identiﬁcation of top-ranking inﬂuencers. This fact has meaning especiallyfor academic rankings for physicists in community like APS. CI takes into account the maximization of inﬂuence in thewhole network of each scientist rather than just the local information given by the number of citations. Thus a highly citedauthor may not have a large impact in the community if he/she is isolated in the periphery. An optimal measure as CI shouldrank such a scientist lower in the scientiﬁc community. This result calls for a revision of rankings based solely on the localinformation rather than the collective inﬂuence in the entire network community. We elaborate more on this problem insubsequent publications.). Despite that in Twitter data, themost inﬂuential user is exactly the one with the largest amount of citations, the overall results still prove that large number ofcitations is not necessarily a reliable measure for identiﬁcation of top-ranking inﬂuencers. This fact has meaning especiallyfor academic rankings for physicists in community like APS. CI takes into account the maximization of inﬂuence in thewhole network of each scientist rather than just the local information given by the number of citations. Thus a highly citedauthor may not have a large impact in the community if he/she is isolated in the periphery. An optimal measure as CI shouldrank such a scientist lower in the scientiﬁc community. This result calls for a revision of rankings based solely on the localinformation rather than the collective inﬂuence in the entire network community. We elaborate more on this problem insubsequent publications.