[PDF] A novel method based on node correlation to evaluate the important nodes in complex networks

Abstract

Finding the important nodes in complex networks by topological structure is of great significance to network invulnerability. Several centrality measures have been proposed recently to evaluate the performance of nodes based on their correlation, showing that the interaction between nodes has an influence on the importance of nodes. In this paper, a novel method based on node distribution and global influence in complex networks is proposed. Our main idea is that the importance of nodes being linked not only to the relative position in the network but also to the correlations with each other. The nodes in the complex networks are classified according to the distance matrix, then the correlation coefficient between pairs of nodes is calculated. From the whole perspective in the network, the global similarity centrality (GSC) is proposed based on the relevance and shortest distance between any two nodes. The efficiency, accuracy and monotonicity of the proposed method are analyzed in two artificial datasets and eight real datasets of different sizes. Experimental results show that the performance of GSC method outperforms those current state-of-the-art algorithms.

Full PDF

AA novel method based on node’s correlation to evaluate theimportant nodes in complex networks ∗ Pengli Lu † , Chen Dong and Yuhong Guo

1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou, 730050, Gansu, China2. School of Mathematics and Statistics, Hexi University, Zhangye, 734000, Gansu, China

Abstract:

Finding the important nodes in complex networks by topological structure is of greatsigniﬁcance to network invulnerability. Several centrality measures have been proposed recentlyto evaluate the performance of nodes based on their correlation, showing that the interactionbetween nodes has an inﬂuence on the importance of nodes. In this paper, a novel method basedon node’s distribution and global inﬂuence in complex networks is proposed. Our main idea isthat the importance of nodes being linked not only to the relative position in the network but alsoto the correlations with each other. The nodes in the complex networks are classiﬁed accordingto the distance matrix, then the correlation coeﬃcient between pairs of nodes is calculated.From the whole perspective in the network, the global similarity centrality (

GSC ) is proposedbased on the relevance and shortest distance between any two nodes. The eﬃciency, accuracyand monotonicity of the proposed method are analyzed in two artiﬁcial datasets and eight realdatasets of diﬀerent sizes. Experimental results show that the performance of

GSC methodoutperforms those current state-of-the-art algorithms.

Keywords:

Node importance, Network topology, Global similarity centrality (

GSC ), Distribu-tion vector, Susceptible-Infected-Recovered (

SIR ) model

Complex system can be modeled or mapped as complex network structure consisting ofnodes and edges, with every vertex represents an entity and edges denote the relationshipsbetween pairs of entities. The identiﬁcation of inﬂuential nodes has attracted many researchersin large and complex networks including social network, protein network, transportation network,information network and next generation network. If the inﬂuential nodes in a traﬃc networkor protein network lose eﬃcacy, the entire network may occur a catastrophic failure. In socialnetwork, information network and communication network, message can be spread easily andquickly throughout the network by inﬂuential nodes [1, 2]. The variety of users’ needs leads tothe discrepancy in information transmission eﬃciency, so it is impossible for all the informationspread in time. The users on the corners always receive messages relatively late, which ismeaningless to them [3–5].In complex networks, ﬁnding the inﬂuential nodes which are willing to spread information isof great signiﬁcance. News spreading starts from one or few users and the information diﬀuses ∗ Supported by the National Natural Science Foundation of China (No.11361033) and the National NaturalScience Foundation of China (No.11861045). † Corresponding author. E-mail addresses: [email protected] (

P. Lu ), [email protected] (

C. Dong ),[email protected] (

Y. Guo ). a r X i v : . [ c s . S I] F e b o friends who are closely related to or interested in it, then these friends transmit the newsto theirs friendship networks. A organization in social networks corresponds to a group ofindividuals with the same or similar backgrounds [6–8]. Take the gymnasium as an example.Keep the store’s management philosophy remains the same, the owners replace will only aﬀectthe employees of the gym, not the members. Therefore, these news will merely generate a greatresponse among the employees rather than cause waves among the customers.Inﬂuential users play an important role in the information spreading and ranking themaccording to theirs inﬂuence capability have received much attention in recent years. In orderto ﬁnd key nodes, researchers have proposed a number of centrality measures from diﬀerentperspective. The most common ways are degree centrality which only considers the node’s owntopological structure [9], betweenness centrality and closeness centrality of the shortest distancebetween nodes [10,11], and k-core decomposition centrality about the relative position of nodes inthe network [12]. However, degree centrality lacks accuracy, betweenness centrality and closenesscentrality are not applicable to large-scale networks, and k-core decomposition centrality tendsto assign nodes with diﬀerent spreading capability to the same k-shell index. Therefore, theseexisting methods have been proved not to meet the current needs [13]. Local dimension centrality( LD ) [14] broken through the traditional global dimension thought pattern, which combinedwith the characteristics of the power low distribution of BA scale-free network and each node’sattribute. The main idea behind the method is that the distribution concentration of remainingnodes is related to the position of the initial node. However, LD centrality considers the node’sinﬂuence range but neglects the correlation between pairs of nodes. Motivated by LD centrality,we propose our method. The nodes in the network are classiﬁed by distance matrix and thepertinence between any two nodes is calculated by pearson correlation coeﬃcient. From theglobal perspective of the network, the inﬂuence of the shortest distance and the correlationbetween any pairs of nodes on the importance of nodes are analyzed, and global similaritycentrality ( GSC ) is proposed. In this paper, we apply the proposed method to diﬀerent sizesnetworks and compare it with the state-of-the-art algorithms. Experiment results show that theproposed method has better performance in eﬃciency, accuracy and monotonicity than otherpopular measures.The rest of the paper is organized as follows. Section 2 analyzes the existing methods of thenode importance research. In Section 3,

GSC algorithm is introduced. Experimental resultsand discussions are included in Section 4. Finally, conclusion of the paper is in Section 5.

In this section, we will brieﬂy introduce the current progress of identifying important nodesin complex networks. A series of classic centrality measures have been proposed to evaluate thespreading capability of nodes. Degree centrality is a simple and straightforward way to measurethe importance of nodes by counting the number of neighbors [9]. However, there is a huge ﬂawin this measure. Simply ﬁguring up the neighboring nodes’ numbers but ignoring the importanceof the vertices themselves may be result in nodes with smaller degree being more vital than largerones. Otherwise, the relative position of nodes in complex networks is also a signiﬁcant thoughtpattern. Compared with nodes in larger degree, smaller nodes are more likely to be in the keyposition of news spreading and play an important role in the whole network. This phenomena is2lso the starting point of betweenness centrality and closeness centrality [10, 11]. Based on thedeﬁnition of h-index and the degree of each node’s neighboring nodes, T. Zhou et al. proposeda more feasible evaluation measure of node importance than degree centrality [15]. Consideringthe neighboring nodes performance can improve the accuracy of identifying important nodes, Q.Liu et al. proposed the local H-index centrality to promote the reliability of the measure [16].P.L. Lu et al. also proposed an extended H-index centrality based on local H-index centralityand clustering coeﬃcient [17].Besides, it is also an important topic to measure the importance of nodes by decomposingthe network. Kitstak et al. proposed the k -core decomposition centrality ( KS ) to determine theimportance of nodes based on their relative positions in the network [12]. First, set the KS valueof all nodes in the network to 1, then ﬁnd out all nodes with degree 1 in the network as well asremove these nodes and theirs edge relationships. Whereafter, recalculate the degree of nodesin the network, then delete the nodes with degree 1 and theirs edge relationships up to there areno nodes with degree 1 in the network. At this time, the KS value of the remaining nodes inthe network is set to 2, then the above operation is repeated through to there are no nodes withdegree 2 in the network. So on until the network is completely decomposed or there are onlyisolated nodes. The larger the KS value of a node, the closer it is to the center of the network.Considering the inﬂuence of neighboring nodes, J. Wang et al. proposed the neighborhoodcoreness centrality ( cn ), which reﬂected the relative distance between neighboring nodes andnetwork center [18]. In k -core decomposition centrality, the number of nodes deleted during eachstep also can reﬂect the performance of nodes. Mixed degree decomposition ( M DD ) considersthe variation of network topology structure in each decomposing step [19]. Qi et al. appliedlaplacian matrix and quasi-laplacian matrix to the study of node centrality in complex networkswith the knowledge of graph theory, the importance of nodes was represented by calculatingthe change of spectral energy with nodes deletion, which greatly improved the practicabilityof the method [20, 21]. For the ﬁrst time, newton’s classical mechanics theorem in physics iscombined with the topological structure of complex networks to propose the newton gravitycentrality ( G ). The degree of nodes is corresponding to the mass of planets and the shortestdistance between nodes is parallelism to the radius [22]. Wang et al. proposed an improvednewton gravity centrality ( IGC ), which replaced the degree of the node to the k -core [23]. A.Namtirtha et al. further improved the newton gravity centrality and put forward a new idea,which combined the degree and core of nodes to evaluate node’s importance [24]. A. Dutta et al.analyzed the applicable network of degree centrality and k -core decomposition centrality, thencombined these two measures and proposed a new method which can be applicable to diﬀerentnetworks [25].In addition to considering the spreading capability of one node, evaluating the importance ofnodes from the network global perspective is also a widely used measure. On the basis of kirchhoﬀpolynomials, Z. Dai et al. proposed a spanning tree centrality method to determine importantnodes and changed the evaluation of node importance from simple networks to weighted networks[26]. On this basis, a near-linear time algorithm based on kirchhoﬀ index is proposed to measurethe edge centrality of weighted networks, which further broadens the application range of thealgorithm [27]. In combination with the basic concept of fractal dimension in physics, Silva et al.proposed local dimension centrality to explore the nature of networks. Since each node in thenetwork has diﬀerent sphere of inﬂuence, the local dimension will also change with the diverse3f the central node, which has an impact on the feasibility of the method. Therefore, Y. Denget al. improved the local dimension centrality to make the method more practicable [14]. Ourmethod is proposed based on the shortest distance and correlation between nodes to identifythe importance of nodes more accurately. Distance matrix indicates the shortest distance between node pairs in the network, and itreﬂects the relative position of nodes. Core nodes are located at the center of the network,and the shortest path between many node pairs will go through these nodes, therefore theshortest distance between these nodes and other nodes is relatively small. Common nodes arelocated at the nooks of the network, while the surrounding nodes are dispersed, so the length ofshorted paths are relatively large. Local dimension centrality ( LD ) combines the characteristicsof distance matrix with the power law distribution, which matches the importance of nodes withthe scale of locality of each node. The lower LD means the higher importance. In other words,the distance between the node and the core of the network also aﬀects the importance of thenode, and nodes in the dense location are often more important than nodes in the sparse location.However, the local dimension centrality only considers the distribution of nodes and does nottake the properties of vertices as the evaluation criterion. Therefore, an accurate algorithmconsidering node’s property is certainly needed.Let G = ( V, E ) be an unweighted network with vertex set V ( G ) = { , , , .., N } and edgeset E ( G ). We deﬁne the weighted matrix W ( G ) of size N × N as follows: W ( G ) =  , if i=j1 , if i and j are adjacent ∞ , if i and j are not adjacent (2.1)The distance between two nodes i, j ∈ V ( G ), denoted by d i,j , is the length of the shortestpath from node i to j . The distance matrix of G , denoted by D ( G ), is a N × N matrix with the( i, j ) − th entry being d i,j , deﬁned as follows: D ( G ) =  d , d , ... d ,N d , d , ... d ,N ... ... ... ...d N, d N, ... d N,N  (2.2)Distance matrix of the network can be obtained by calculating the two-node shortest distancefrom W ( G ) by Floyd-Warshall algorithm.The maximus distance from node i to other nodes, which represents the surrounding size ofnode i , is denoted as: D i = max ( d i,j ) , j ∈ V (2.3)and the diameter D of the network is: D = max ( D i ) (2.4)After calculating the relative distance between nodes, node distribution vector and distancevector are deﬁned based on the location of each node.4ig. 1: A simple graph. (Take node 13 as the initial node, the nodes are divided into four partsby the distance from node 13 and each shown in diﬀerent colors) Deﬁnition 2.1. (Node Distribution Vector and Distance Vector)

The node distribution vector

N DV i and distance vector DV i for node i are deﬁned as follows, where | V ki | represents thenumber of nodes in the network whose shortest distance from node i is k . N DV i = ( | V i | , | V i | , | V i | , ..., | V Di | ) (2.5) DV i = ( | V i | , | V i | , | V i | , ..., D | V Di | ) (2.6)As shown in Fig. 1, we take node 13 as the initial node and divide the other nodes inthe network into four levels. The distance between nodes 10, 11, 12 and 13 are 1, the distancebetween nodes 5, 7, 8 and 13 are 2, the distance between nodes 1, 4, 6, 9 and 13 are 3, the distancebetween nodes 2, 3 and 13 are 4, and D is 4. We can represent the distribution vector of node13 as N DV = (3 , , , N DV = (3 , , , N DV = (5 , , , DV = (5 , , , DV = (3 , , , DV = (3 , , , P i,j and D i,j , respectively. P i,j describes the similarity between nodes in node distribution andis calculated using the traditional pearson correlation coeﬃcient formula, while D i,j improvespearson correlation coeﬃcient according to the distance distribution of nodes, which calculatesthe correlation between the distance distribution and the average shortest distance of nodes.By counting the number of nodes on each distance and calculating the diﬀerence between thedistance of two nodes and the average shortest distance, the similarity of topological structurebetween pairs of nodes is reﬂected and the relative position of nodes in the network can be5xpressed. The speciﬁc formulas are as follows: P i,j = D (cid:88) k =1 ( N DV ki − N DV i ) · ( N DV kj − N DV j ) (cid:118)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV ki − N DV i ) · (cid:118)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV kj − N DV j ) (2.7) D i,j = D (cid:88) k =1 (cid:34) NDV ki × (cid:32) DV ki NDV ki − ( DV i × DN − (cid:33)(cid:35) · (cid:34) NDV kj × (cid:32) DV kj NDV kj − ( DV j × DN − (cid:33)(cid:35)(cid:118)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:34) NDV ki × (cid:32) DV ki NDV ki − ( DV i × DN − (cid:33)(cid:35) · (cid:118)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:34) NDV kj × (cid:32) DV kj NDV kj − ( DV j × DN − (cid:33)(cid:35) (2.8) where N DV ki denotes the k − th element value of vector N DV i , and N DV i is the mean value ofthe vector N DV i , DV ki and DV i are represented as the element value and mean value of vector DV i . The results of Eq.(3.7) and Eq.(3.8) are between − P , = D (cid:88) k =1 ( N DV k − N DV ) · ( N DV k − N DV ) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV k − N DV ) · (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV k − N DV ) ∼ = 0 . D , = D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) · (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) · (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) ∼ = 0 . the correlation coeﬃcient between node 7 and node 13 are P , = D (cid:88) k =1 ( N DV k − N DV ) · ( N DV k − N DV ) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV k − N DV ) · (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 ( N DV k − N DV ) = 0 and D , = D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) · (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) · (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) D (cid:88) k =1 (cid:20) NDV k × (cid:18) DV k NDV k − ( DV × DN − (cid:19)(cid:21) ∼ = 0 . thus, node 8 plays more active role in news spreading of node 7 in the network than node 13. Deﬁnition 2.2. (Global Similarity Centrality)

The global similarity centrality consists of two6lgorithm: Ranking nodes on the basis of cumulative centrality

01 Input : G = ( V, E )

02 Output : A ranking list of nodes’ importance

03 Begin Algorithm04

Floyd-Warshall algorithm is used to calculate the shortestdistance between nodes and the diameter of the graph G

05 For i=1 to | V | Calculate

N DV i and DV i using Eq. (3 .

5) and

Eq. (3 .

07 End for08 For i=1 to | V | Set

GSC i =0

10 For j=1 to | V | Calculate P i,j and D i,j using Eq. (3 .

7) and

Eq. (3 . According to the value of P i,j and D i,j to use Eq. (3 . N C i,j GSC i = GSC i + N C i,j

14 End for15 End for16

Sort the nodes in descending order based on

GSC values toobtain the ranking list

17 End Algorithm parts, and it is deﬁned as follows:

N C i,j =  − P i,j d i,j + (1 + D i,j d i,j ) , P i,j >

01 + P i,j d i,j + (1 + D i,j d i,j ) , P i,j <

01 + D i,j d i,j , P i,j = 0 (2.9) GSC i = (cid:88) v j ∈ V N C i,j (2.10)The formula in this section consists of two parts: the distance clustering coeﬃcient of nodes,and the global correlation of node i . When the distance coeﬃcient is minus, node j has anegative eﬀect on the spreading ability of node i , which aﬀects the propogating of node i in thenetwork. Therefore, 1 + P i,j is used to accurately calculate the clustering coeﬃcient betweennode i and node j . At the same time, considering the diﬀerent inﬂuence capability betweennodes, coeﬃcient d i,j and D i,j also have positive eﬀect on the whole algorithm, and 1 + D i,j d i,j isused to control the inﬂuence of distance between nodes on the proposed method.Algorithm provides an idea of the proposed method which contains speciﬁc calculation detailsof each step. Floyd-Warshall algorithm is used in line 4 to calculate the distance matrix and thediameter of graph G , lines 5-7 use Eq. (3 .

5) and

Eq. (3 .

6) to calculate the node distribution vectorfor each node, the correlation coeﬃcient of node i and the other nodes in the network in lines8-15 through formula Eq. (3 .

7) and

Eq. (3 .

8) to calculate, then compute node’s

GSC . Finally,7he nodes are sorted by the value of the

GSC . The time complexity of the Floyd-Warshallalgorithm is O ( | V | ), the rest of the proposed measure is O ( | V | + | E | ).When evaluating the importance of nodes, the proposed method ﬁrst deﬁnes the distributionvector and distance vector of nodes according to the structure of the network, then calculates thesimilarity degree between pairs of nodes with pearson correlation coeﬃcient and the importanceof nodes are based on the node’s correlation. Compared with the existing global clustering coef-ﬁcient algorithm, the proposed algorithm has made the improvement on the clustering method.We consider the network structure, and also based on the similarity degree between nodes, re-divided the nodes from the perspective of propagation. The measure can determine the node’sspreading capability more accurately, which make up the shortcomings of the global clusteringcoeﬃcient algorithm for only considering the single parameter. In this section, to evaluate the proposed method, we compare it with a series of currently pop-ular algorithms, including: K-Shell decomposition centrality ( KS ) [12], neighborhood corenesscentrality ( cn ) [20], H-index centrality ( H ) [15], Local H-index centrality ( LH ) [16], Newton’sgravity centrality ( G ) [24], Improved Newton’s gravity centrality ( IGC ) [25], K-shell hybridmethod (

Ksh ) [26], Weighted k-shell degree neighborhood centrality (

Ksd ) [27], Betweennesscentrality ( BC ) [10], Closeness centrality ( CC ) [11], Eigenvector centrality ( EC ) [48] and Pager-ank centrality ( P A ) [49]. Then, these methods are used in eight real-world datasets and twoartiﬁcial datasets. The networks used in this paper are all undirected networks, and the al-gorithms are not experimented in directed networks. Real-world datasets including network ofmutual relations between club employees and customers (Karate) [28], Lusseau’s Bottlenose Dol-phins social network (Dolphins) [29], the network of selling political books about the presidentialelection in Amazon during 2004 (Polbooks) [30], the schedule network of major league soccerclubs (Football) [31], a network of collaborative relationships among jazz musicians (Jazz) [32],American airlines ﬂight route network (USAir) [33], Rovira Virgili university E-mail messagenetwork between teachers and students (Email) [34], a network of interrelationships betweenproteins (Yeast) [35]. In artiﬁcial network datasets, including Small-World network (WS) [36]and Lancichinetti-Fortunato-Radicchi network (LFR-2000) [37], both sets of these datasets aregenerated by software Gephi. The speciﬁc parameters of the datasets are shown in Table 1.Table 1: speciﬁc parameters of the datasets.Network | N | | E | Average number Maximum degree β th β AssortativityKarate 34 78 4.588 17 0.129 0.13 -0.4756Dolphins 62 159 5.129 12 0.147 0.15 -0.0436Polbooks 105 441 8.400 25 0.0838 0.09 -0.1279Football 115 613 10.661 12 0.0932 0.10 0.1624Jazz 198 2742 27.967 100 0.026 0.03 0.0202USair 332 2126 12.81 139 0.0225 0.03 -0.2079Email 1133 5451 9.622 71 0.0535 0.06 0.0782WS 2000 6012 6.021 11 0.1559 0.16 -0.0563LFR-2000 2000 4997 9.988 39 0.0477 0.05 -0.0032Yeast 2361 7181 6.083 65 0.0600 0.07 -0.04898 .1 Discrimination capability

In this experiment, we will study the discriminating ability of ranking lists generated byinvolved measures from the aspects of monotonicity and resolution [38, 39]. In order to betterevaluate the performance of nodes and calculate the capability of diﬀerent measures to distin-guish the importance of nodes, researchers applied monotonicity to assess the ability of diﬀerentmeasures about distributing the spreading eﬃciency of nodes in social networks. The formulafor monotonicity is as follows: M ( A ) = (cid:18) − (cid:80) a ∈ A | X | a × ( | X | a − | X | × ( | X | − (cid:19) (3.1)where A is the ranking list of one measure, | X | is the total nodes number of A , | X | a is the numberof nodes in level a. The range of monotonicity is [0,1]. The better the measure’s discriminationability, the bigger the value of monotonicity is. Experimental results are shown in Table 2.Involved methods are applied to diﬀerent networks for comparison, the results show that themeasure which considers the performance of neighboring nodes ( cn , LH ) can better reveal thediscrimination ability of nodes than only a single node ( KS , H ), and the proposed method GSC indicates the best performance while the existing algorithms

Ksd , BC and EC also performwell.In order to further compare the ability of diﬀerent methods to distinguish node importance,the second part of the experiment uses the cumulative distribution function ( CDF ) curve to rep-resent the resolution of these methods. A represents the ranking list generated by one measure,while the CDF of A represents the probability that the element in A is less than or equal to agiven value. In other words, the slower the curve rises, the higher the resolution of the method,and the better it is to distinguish the importance of nodes. Fig. 2 compares the CDF curvesof the ranking list generated by diﬀerent algorithms including

GSC . Experimental results showthat the proposed method has best performance in distinguishing node importance.Table 2: The M value of ranking list generated by diﬀerent measures in diﬀerent networks. Network M(KS) M(cn) M(H) M(LH) M(G) M(IGC) M(Ksh) M(Ksd) M(BC) M(CC) M(EC) M(PA) M(GSC)Karate 0.4958 0.8526 0.5766 0.8925 0.9334 0.9577 0.9334 0.9542 0.7754 0.8993 . . . . Polbooks 0.4949 0.9641 0.7067 0.9821 0.9982 0.9993 0.9993 . . . Football 0.0003 0.4218 0.2349 0.9190 0.8626 0.9903 0.8626 0.9994 . . . Jazz 0.7944 0.9982 0.9383 0.9982 0.9995 0.9995 . . USair 0.8114 0.9628 0.8335 0.9856 0.9942 0.9949 0.9943 . . Email 0.8089 0.9839 0.8584 0.9899 0.9996 0.9998 . . . WS 0.0002 0.6085 0.2904 0.9155 0.9757 0.9982 0.9799 0.9998 . . . LFR-2000 0.0385 0.9789 0.7184 0.9927 0.9997 0.9998 0.9998 . . . Yeast 0.6643 0.9458 0.6873 0.9686 0.9959 0.9964 0.9963 0.9964 0.7012 0.9964 0.7210 0.9916 . In this experiment, we will compare the accuracy between the ranking lists obtained bydiﬀerent measures and the real spreading capability of nodes. In order to acquire the performanceof nodes, we simulated the spreading process of nodes in the traditional epidemic spreadingmodel, then calculated the correlation between the results and ranking lists obtained by diﬀerentalgorithms. Susceptible-Infected-Recovered (

SIR ) model has become the most popular epidemicspreading model because of its simple principle and wide range of applications, it also has beenapplied to diﬀerent articles [40–43]. 9ig. 2: The

CDF curve of all measures on Dolphins, Football, USAir, WS networks.In standard

SIR model, every node has only three diﬀerent states: susceptibility ( S ), infec-tion ( I ), and recovery ( R ). In order to obtain the spreading capability of each node, we onlyset one node to the infected state at the beginning of the experiment, while all the remainingnodes are set to the susceptible state. In each time period, the infected nodes will spread toall the susceptible nodes which connected to them with probability α , and these nodes will alsorecover with probability β after being infected. After the experiment, the number of nodes inthe recovery state is deﬁned as the real spreading capability of the nodes. The above experimentis repeated for 1000 times, so that all nodes of the network can obtain the spreading capacityrange of nodes and take the average value as the ﬁnal result. The threshold of the network isdeﬁned as β th = , where < d > is the average degree of the node, and < d > is the averagedegree of the second-order neighbors of the node. The threshold β th and the corresponding β are shown in Table 1.After obtaining the spreading capability of nodes, we use Kendall correlation coeﬃcientto evaluate the relativity between the ranking lists obtained by diﬀerent algorithms and thereal spreading capability [44–46]. Let X and Y be two sets of the ranking sequences, and( x , y ) , ( x , y ) , ..., ( x n , y n ) be a set of the ranking pairs. Two data pairs ( x i , y i ) and ( x j , y j )are considered to be concordant under the condition that if ( x i > x j and y i > y j ) or ( x i < x j and y i < y j ), and discordant under the condition that if ( x i > x j and y i < y j ) or ( x i < x j and y i > y j ). The Kendall correlation coeﬃcient is deﬁned as follows τ = 2( R a − R b ) R ( R −

1) (3.2)where R a , R b are the numbers of concordant and discordant pairs, n is the number of all pairs.Table 3 shows the correlation at a certain point between the node’s real spreading capabilityand ranking lists generated by involved algorithms. It is obvious that the proposed measure10ig. 3: The inﬂuence of the change of infection rate on the accuracy of diﬀerent methods in fourdata sets including Jazz, Email, LFR-2000 and Yeast. GSC has the best performance in 9 of the 10 experimental datasets, while the cn algorithmhas the best performance in Football network, and KS , BC and P A show the worst eﬀect inall networks due to the limitations of the algorithm. These results reﬂect the superiority of theproposed method over the other state-of-the-art algorithms.Table 3: The kendall τ value of each method in 10 networks with a given β value. Network KS cn H LH G IGC Ksh Ksd BC CC EC PA GSCKarate 0.5799 0.6789 0.6219 0.7079 0.7580 0.7838 0.7472 0.7972 0.5433 0.6626 0.8245 0.3535 . Dolphins 0.7363 0.8275 0.8420 0.8678 0.7499 0.8091 0.5810 0.7984 0.5900 0.6175 0.6132 0.5948 . Polbooks 0.7196 0.8143 0.7946 0.8507 0.7505 0.7713 0.6196 0.7628 0.3646 0.3715 0.5818 0.4516 . Football 0.1320 0.4931 0.3897 . . USair 0.7550 0.8462 0.7580 0.8478 0.7532 0.7782 0.4633 0.8232 0.5590 0.7805 0.8361 0.3710 . Email 0.8218 0.8631 0.8401 0.8840 0.8359 0.8533 0.6854 0.8161 0.8210 0.8190 0.8517 0.5747 . WS 0.1239 0.6701 0.5227 0.6515 0.6255 0.6384 0.4932 0.6373 0.6052 0.5872 0.6235 0.4657 . LFR-2000 0.4049 0.7004 0.6795 0.7065 0.6614 0.6571 0.5360 0.6811 0.6843 0.7033 0.7157 0.6278 . Yeast 0.7553 0.8231 0.7604 0.8492 0.7983 0.8108 0.5835 0.7703 0.6301 0.5653 0.7270 0.3046 . Otherwise, we research the accuracy of the algorithm in the

SIR model under diﬀerentinfection rates. Taking four networks of diﬀerent sizes as an example, Fig. 3 expresses thecorrelation curve between the ranking lists and real spreading ability of nodes. In the experimentnetworks, with the increasing of β , the proposed method is more accurate than other methods.Especially near the threshold β th , the accuracy reaches the peak. The performance of theexisting algorithms is equal to GSC measure in the comparison of discriminating ability, whilethese measures are far less than

GSC in accuracy.11able 4: Top-10 nodes ranked by diﬀerent centrality methods in ﬁve real-world networks andthe simple graph network.Rank Karate DolphinsKS cn H LH GSC KS cn H LH GSC1 34 1 34 34 1 60 15 52 15 382 33 34 33 1 34 58 46 51 46 153 31 3 14 3 3 55 38 46 38 464 14 33 3 33 33 53 34 41 34 345 9 2 1 2 9 52 21 38 21 516 8 4 31 4 14 51 30 34 30 417 4 32 24 14 32 48 41 30 52 228 3 14 9 9 2 46 52 25 51 199 2 9 8 32 4 44 58 22 41 3010 1 24 4 24 31 43 2 21 19 17Rank Polbooks FootballKS cn H LH GSC KS cn H LH GSC1 101 9 74 9 9 115 105 84 68 682 100 13 85 85 13 114 89 74 54 83 92 85 74 13 85 113 68 68 89 34 87 4 83 74 74 112 54 54 16 545 85 73 77 31 31 111 16 50 3 896 84 74 76 73 4 110 8 48 8 167 83 31 75 4 67 109 7 47 7 1058 80 67 73 67 73 108 6 33 105 79 77 48 67 76 12 107 4 16 2 110 76 41 48 75 75 106 3 8 1 4Rank Jazz Simple graph(Fig.1)KS cn H LH GSC KS cn H LH GSC1 172 100 100 100 100 13 7 13 7 72 168 8 8 8 8 12 4 12 13 43 158 4 4 4 4 11 13 11 11 124 131 131 131 131 131 10 12 10 12 115 130 80 129 80 80 8 11 7 10 106 129 129 80 129 194 7 10 8 4 137 106 5 53 5 129 6 5 6 8 88 105 32 5 194 5 5 8 5 5 69 104 194 194 53 53 4 6 4 6 110 103 84 69 69 69 3 3 3 3 511 102 69 130 32 162 2 2 2 2 312 100 85 85 162 32 1 1 1 1 213 98 53 84 77 59 9 9 9 9 912 .3 Similarity

In the last experiment, disparate measures will generate diverse ranking lists because ofconsidering the diﬀerent aspects of network topology structure, so we use the number of samehigh-order vertices in each list to determine the similarity between the methods [47]. Thenumbers of same nodes increases the credibility of the measure, while the unique nodes in the

GSC list will bring signiﬁcant changes to the spreading process. Experimental results are shownin Table 4. In karate network, KS , cn , H and LH algorithms have high matching degree with GSC measure, and the number of the same nodes is 9, 9, 8, 9, respectively. In the small-scalenetworks, the number in the Dolphins network is 2, 6, 7, 8, the number in the Polbooks networkis 1, 8, 6, 9, the number in the Football network is 0, 9, 4, 9, and the number in the Jazz networkis 3, 10, 10, 12. KS algorithm gradually weakens with the increase of network size, while theother three algorithms are still similar to GSC . In the simple graph Fig. 1, compared withthe other four algorithms, the proposed measure further details the importance of nodes in thenetwork, and better shows the performance of nodes in the network.

How to identify and select users to eﬃciently spread information has become one of the mostconcerned research topics. In order to achieve this goal, ﬁnding the inﬂuential nodes is the widelyused method. In this paper, a new method is proposed to evaluate the importance of nodes incomplex networks: classifying nodes based on distance matrix and combining the correlationbetween nodes, then applying the global clustering coeﬃcient of networks to the research ofnode importance. Through extensive experiments on both artiﬁcial networks and real-worldnetworks, compared our algorithm with the current popular algorithms, we demonstrate that thethe proposed method has a better performance in accuracy, similarity, discrimination capabilityand other aspects, and which is valuable and signiﬁcant for the further research.

References [1] J. Heidemann, M. Klier, F. Probst, Online social networks: A survey of a global phenomenon, Comput.Netw, 56(18): 3866-3878, 2012.[2] A. Bozorgi, H. Haghighi, M.S. Zahedi, M. Rezvani, Incim: A community-based algorithm for inﬂuencemaximization problem under the linear threshold model, Inf. Process. Manage, 52(6): 1188-1199, 2016.[3] W. Chen, C. Wang, Y. Wang, Scalable inﬂuence maximization for prevalent viral marketing in large-scale social networks, in: Proceedings of the 16th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ACM, pp. 1029-1038, 2010.[4] Z. Yu, C. Wang, J. Bu, X. Wang, Y. Wu, C. Chen, Friend recommendation with content spread enhancementin social networks, Inform. Sci, 309: 102-118, 2015.[5] A. Sheikhahmadi, M.A. Nematbakhsh, A. Shokrollahi, Improving detection of inﬂuential nodes in complexnetworks, Physica A, 436: 833-845, 2015.[6] A. Sheikhahmadi, M.A. Nematbakhsh, A. Zareie, Identiﬁcation of inﬂuential users by neighbors in onlinesocial networks, Physica A, 486: 517-534, 2017.[7] R.M. Bond, et al., A 61-million-person experiment in social inﬂuence and political mobilization, Nature489(7415): 295, 2012.[8] M.-E.G. Rossi, F.D. Malliaros, M. Vazirgiannis, Spread it good, spread it fast: Identiﬁcation of inﬂuentialnodes in social networks, in: Proceedings of the 24th International Conference on World Wide Web, ACM,pp. 101-102, 2015.[9] L.C. Freeman, Centrality in social networks conceptual clariﬁcation, Soc. Netw, 1(3): 215-239, 1978.[10] L.C. Freeman, A set of measures of centrality based on betweenness, Sociometry, 40(1): 35-41, 1977.

43] W.R. Knight, A computer method for calculating Kendall’s tau with ungrouped data, J. Amer. Statist.Assoc, 61(314): 436-439, 1966.[44] M. Jalili, M. Perc, Information cascades in complex networks, J. ComplexNetw, 5(5): 665-693, 2017.[45] A. Buscarino, L. Fortuna, M. Frasca, V. Latora, Disease spreading in populations of moving agents, Euro-phys. Lett, 82(3): 38002, 2008.[46] R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and endemic states in complex networks, Phys. Rev.E, 63(6): 066117, 2001.[47] J. Zhao, Y.C. Wang, Y. Deng, Identifying inﬂuential nodes in complex networks from global perspective,Chaos, Solitons and Fractals, 133, 109637, 2020.[48] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst,30(1-7): 107-117, 1998.[49] X. Zhang, J. Zhu, Q. Wang, H. Zhao, Identifying inﬂuential nodes in complex networks with communitystructure, Knowl. Base Syst, 42: 74-84, 2013.43] W.R. Knight, A computer method for calculating Kendall’s tau with ungrouped data, J. Amer. Statist.Assoc, 61(314): 436-439, 1966.[44] M. Jalili, M. Perc, Information cascades in complex networks, J. ComplexNetw, 5(5): 665-693, 2017.[45] A. Buscarino, L. Fortuna, M. Frasca, V. Latora, Disease spreading in populations of moving agents, Euro-phys. Lett, 82(3): 38002, 2008.[46] R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and endemic states in complex networks, Phys. Rev.E, 63(6): 066117, 2001.[47] J. Zhao, Y.C. Wang, Y. Deng, Identifying inﬂuential nodes in complex networks from global perspective,Chaos, Solitons and Fractals, 133, 109637, 2020.[48] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst,30(1-7): 107-117, 1998.[49] X. Zhang, J. Zhu, Q. Wang, H. Zhao, Identifying inﬂuential nodes in complex networks with communitystructure, Knowl. Base Syst, 42: 74-84, 2013.