Improving personalized link prediction by hybrid diffusion
aa r X i v : . [ phy s i c s . s o c - ph ] D ec Improving personalized link prediction by hybrid di ff usion Jin-Hu Liu a , Yu-Xiao Zhu a, ∗ , Tao Zhou a,b a Complex Lab, Web Sciences Center, University of Electronic Science and Technology of China, Chengdu 611731,China b Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China
Abstract
Inspired by traditional link prediction and to solve the problem of recommending friends in socialnetworks, we introduce the personalized link prediction in this paper, in which each individualwill get equal number of diversiform predictions. While the performances of many classicalalgorithms are not satisfactory under this framework, thus new algorithms are in urgent need.Motivated by previous researches in other fields, we generalize heat conduction process to theframework of personalized link prediction and find that this method outperforms many classicalsimilarity-based algorithms, especially in the performance of diversity. In addition, we demon-strate that adding one ground node who is supposed to connect all the nodes in the system willgreatly benefit the performance of heat conduction. Finally, better hybrid algorithms composedof local random walk and heat conduction have been proposed. Numerical results show that thehybrid algorithms can outperform other algorithms simultaneously in all four adopted metrics:AUC, precision, recall and hamming distance. In a word, this work may shed some light on thein-depth understanding of the e ff ect of physical processes in personalized link prediction. Key words:
Personalized link prediction, heat conduction, ground node
PACS:
1. Introduction
Network has been used as a useful model to describe many social, biological and informationsystems, where nodes represent individuals and links reflect the relations or interactions betweennodes [1, 2, 3]. Networks have been widely studied in many di ff erent fields and one of fundamen-tal problems for network analysis is link prediction, which aims to estimate the likelihood of theexistence of a link between two nodes based on observed links and the attributes of nodes [4, 5].For example, the existence of a link must be verified by costly chemical experiments in manybiological networks, such as protein-protein interaction networks and metabolic networks. Ifthe predictions are accurate enough, the experimental cost can be sharply reduced compared toblindly checking. Missing data problem is also exist in social network, where link predictionis also one useful tool. In addition, link prediction algorithms can also be applied to identifyspurious links [6, 7, 8]. Link prediction algorithms can not only be used to predict missing databut also practical to predict the links that may appear in the future of evolving networks. For ∗ [email protected](Yu-Xiao Zhu) Preprint submitted to Physics A December 18, 2015 xample, in online social networks, very likely but not yet existent links can be recommended aspromising friendships, which can help users to find new friends and thus enhance their loyaltiesto the websites.In the traditional link prediction, all the nonexistent links are sorted in descending orderaccording to their prediction scores, and the top-ranked links are most likely to exist. Clearly,in this case, the prediction list is generated from a global perspective, in which some nodes mayhave large number of promising links while others may have very few or even zero possible links.This straightforward and standard method may lead to some bias. On the one hand, in this case,the links connect low-degree nodes may be ignored casually, while this kind of information mayvery be important and meaningful [9]. In addition, some research unveiled that low-degree usersmay have a big influence in the future [10]. On the other hand, the imbalance of prediction listmay bring unsatisfactory for some individuals and thus a ff ect the experience of the whole system.For example, in social networks, accurately predicting certain number of potential friends oracquaintances for each registered user is useful and meaningful. In this case, no real distinctioncan be made between low-degree and high-degree users and global link prediction does not applyin this case decently. However, this phenomenon has always been neglected in the traditional linkprediction for the past several decades. To solve these problems, we propose personalized linkprediction here, in which all nodes will get equal number of possible links through their own pastlink records.One challenge deserved special attention recently, called low-diversity problem, has plaguedalmost all recommendation systems. It means that lots of recommender systems always recom-mend very similar items to di ff erent users which narrows users’ views [11]. Subsequently, somephysical dynamics, like heat conduction process (HC) have been applied to design recommendersystems and can improve the diversity of recommendation. Motivated by this, we generalizeheat conduction process to the framework of personalized link prediction and find that it outper-forms other methods in diversity but do not perform very satisfactorily in accuracy. To solve thisdilemma, ground node, who is supposed to connect all the nodes in the system, is incorporatedto improve the prediction accuracy. Finally, we generalize one superior hybrid algorithm (LH)and propose another better hybrid algorithm (LGH) composed of local random walk (LRW) [12]and ground heat conduction (GHC), which performs pretty well not only on accuracy but also ondiversity.This article is organized as follows. In the next section, we will clearly define the problemof personalized link prediction, describe the standard metrics for evaluation. Then we explainseveral state-of-the-art similarity indices and introduce new algorithms HC, GHC, LH, LGH inSection 3. Data description and experimental results for the existed predicting algorithms andthe proposed method are presented in Section 4. Finally, we summarize our results in Section 5.
2. Problem and Metrics
For one given undirected network G ( V , E ), in which V and E are the sets of nodes and linksrespectively. The universal set of all | V | ( | V |− possible links are denoted by U , where | V | denotesthe number of elements in set V (multiple links and self-connections are not allowed). Clearlythe set of nonexistent links is U \ E , in which there are some missing links (i.e.,the existed yetunknown links) and promising links (i.e.,very likely but not yet existent links). The task of linkprediction is to uncover these links. Each node pair x and y will be assigned a scores s xy accordingto a given prediction algorithm. The higher the score is, the higher existence likelihood this linkhas. For each node x , we denote the set of its revelent nonexistent links (nonexistent links that2onnect x ) as ( U \ E ) x , thus all links in ( U \ E ) x are sorted in descending order according to theirscores, and the top-ranked links are most likely to exist.To test the performance of one given algorithm, we divide the observed links E into two sets:the training set E T (considered as known information) and the test set E P (used for testing and noinformation therein is allowed to be used for prediction). Clearly, E = E T ∪ E P and E T ∩ E P = φ .For each node x , the relevant test set (links in E P that connect x ) is denoted by E Px . We thenintroduce four popular evaluation metrics as below.(i) AUC - short for area under the receiver operating characteristic curve, is considered as onestandard metric to quantify the accuracy of prediction [13]. Specifically, for each node x , thismetric can be interpreted as the probability that a randomly chosen revelent relevant missing link(links in E Px ) has higher score than a randomly chosen relevant nonexistent link (links in ( U \ E ) x ).In the implementation, among n times of independent comparisons, if there are n times that themissing link has higher score and n times the missing link and nonexistent link have the samescore, the AUC value is defined as AUC = n + . n n . (1)The AUC of the whole system is the average value over all nodes in the system. If all the scoresare generated from an independent and identical distribution, the accuracy should be about 0.5.Therefore, the extent to which the accuracy exceeds 0.5 indicates how much better the algorithmperforms than pure chance.(ii) Precision and Recall [4] - Given the ranking of the non-observed links, the precision isdefined as the ratio of relevant items selected to the number of items selected. Denoting by L thelength of prediction list (i.e. the number of nodes recommended to each individual). For eachindividual x , if we take the top- L links as the predicted ones, among which L x links are right(i.e., there are L x links in the test set E Px ), then the precision equals P x = L x / L . While recall isdefined as the ratio of relevant items selected to the number of relevant items in the testing set.That’s, R x = L x / N x , where N x denotes the number of node x ’s positive edges in its testing set E Px . Clearly, higher precision and higher recall means higher prediction accuracy. The precision(recall) of the whole system can be calculated by the average value among all individuals.(iii) Hamming distance [14, 15] - One of the famous metrics that quantify the intra-diversityof the prediction system. For individual x and y , if the overlapped number of nodes in x and y ’sprediction lists is Q xy , their Hamming distance is defined as H xy = − Q xy L . (2)Generally speaking, a more diverse prediction list should have larger Hamming distances whichmeans recommending appropriately but not popularly. Accordingly, we use the mean value ofHamming distance, H = | V | ( | V | − X x , y H xy , (3)averaged over all the node-node pairs, to measure the diversity of predictions.3 . Algorithms In the traditional link prediction problem, the study on similarity-based algorithms is themainstream due to its simplicity. Considering this, we adopt the simplest local similarity indicesas benchmark in the framework of personalized link prediction in this paper. For similarity-basedalgorithm, the aforementioned scores s xy is directly defined as the similarity between node x and y [16]. For each node x , rank all relevant links (links connect x and other nodes) in relevantnon-observed set ( U \ E ) x based on their scores, and links with higher scores are supposed tobe of higher existence likelihoods and thus regarded as personalized prediction list (we considerlength of list L = , x and y are more likely to havea link if they have many common neighbors [17]. The simplest measure of its neighborhoodoverlap is the direct count, namely s CNxy = | Γ x ∩ Γ y | , (4)in which Γ x denote the set of neighbors of x .(2) Salton index [18]. It is defined as s S altonxy = | Γ x ∩ Γ y | p k x k y , (5)where k x denotes the degree of node x . The Salton index is also called the cosine similarity inthe literature.(3) Jaccard index [19]. This index was proposed by Jaccard over one hundred years ago,defined as s Jaccardxy = | Γ x ∩ Γ y || Γ x ∪ Γ y | . (6)(4) Adamic Adar (AA) index [20]. This index refines the simple counting of common neigh-bors by assigning the less-connected neighbors more weights, and is defined as s AAxy = X z ∈ Γ x ∩ Γ y logk z . (7)(5) Preferential Attachment (PA) [21]. The mechanism of preferential attachment is widelyused to generate evolving scale-free networks, where the probability that a new link is connectedto the node x is proportional to k x . The corresponding similarity index can be defined as s PAxy = k x × k y . (8)(6) Resource Allocation (RA) index [22]. This index is motivated by the resource allocationdynamics on complex networks, and is defined as s RAxy = X z ∈ Γ x ∩ Γ y k z , (9)where z runs over all common neighbors of x and y .47) Local Random Walk (LRW) index [12]. To measure the similarity between nodes x and y , a random walker is initially put on node x and thus the initial density vector −→ π x (0) = −→ e x . Thisdensity vector evolves as −→ π x ( t + = P T −→ π x ( t ) for t ≥
0. The LRW index at time step t is thusdefined as s LRWxy ( t ) = q x π xy ( t ) + q y π yx ( t ) , (10)where q is the initial configuration function. HCLRW Step: 0 Step: 1 Step: 2
Figure 1: The schematic drawing that illustrates LRW and HC from Step:0 to Step:2. Nodes arerepresented by circles and lines represent the existent links between them. The numerical valueover each node indicates its temperature resource in this step. The red node is the target node.In each step after step 2, those high-temperature nodes which have no links with the target nodewill be recommended.Motivated by some previous studies [11], here we generalize the heat conduction algorithmto the framework of personalized link prediction. Basically, heats conduction algorithm (HC)recommends promising links to an individual node through a process motivated by heat di ff usion.Firstly, the adjacency matrix, denoted by A , where the element a xy = x has connectionwith node y , a xy = x here, the temperature resource for x is initialized with 1, while 0 for the remaining nodes. Then in each iteration, the HC algorithmredistributes the temperature resource via a nearest-neighbor averaging process. That is to say,the temperature resource of node x in ( t + T x ( t + = k x M X y = a xy T y ( t ) , (11)5here M denotes the number of nodes. The degree of node x , the number of nodes who haveconnections with x , is denoted as k x . Then the final temperatures after the di ff usion are consideredas the corresponding scores and the resulting top-ranked list of non-connected nodes is sortedaccording to these scores in descending order. In order to describe the detailed di ff usion processvisually, an illustration of the first two step of HC and LRW processes is shown in Figure. 1.In depth, we also propose a new algorithm by adding a ground node who connect all thenodes in the network [23]. The iterative temperature of the GHC (abbreviation of the HC withground node) algorithm is thus written as T x ( t + = k x + M X y = a xy T y ( t ) + T g ( t )) . (12)where T g denotes the temperature of the ground node, which leads to an additional link betweentwo nodes even when they don’t have connection.In addition, motivated by the literatures [11, 24], we generalize one superior hybrid algorithm(LH) composed of LRW and HC. T x ( t + = α M X y = a xy T y ( t ) k y + − α k x M X y = a xy T y ( t ) . (13)where α is an adjustable parameter which ranges from 0 to 1. Obviously, LH turns to HC when α =
0, while degenerates to LRW when α =
1. This method can well test potential nodes fromtwo aspects, one is the strength of joint and the other is the personalization. Furthermore, wepropose another novel hybrid algorithm (LGH) by combining LRW and GHC. T x ( t + = α M X y = a xy T y ( t ) k y + − α k x + M X y = a xy T y ( t ) + T g ( t )) . (14)where α is an adjustable parameter which ranges from 0 to 1. Obviously, LGH turns to GHCwhen α =
0, while degenerates to LRW when α =
4. Data and Experiments
We consider four representative networks drawn from disparate fields: (i) USAir. The net-work of the USAir transportation system, which contains 332 airports and 2126 airlines [25]. (ii)C.elegans (CE). The neural network of the nematode worm C.elegans, in which an edge joins twoneurons if they are connected by either a synapse or a gap junction [26]. This network contains297 neurons and 2148 links. (iii) Political Blogs (PB). The network of US political blogs [27].The original links are directed, here we treat them as undirected links. (iv) Food Webs (FW).The food circle network of Florida bay, containing 128 living beings and 2106 preying relations.Table. 1 summarizes the basic topological features of these networks [28]. Brief definitions ofthe monitored topological measures can be found in the table caption.6able 1: The basic topological features of the giant components of the four example networks.CE, PB and FW are the abbreviations for C.elegans, Political Blogs and Food Webs networks,respectively. N = | V | and M = | E | are the total number of nodes and links, respectively. C is theclustering coe ffi cient that is defined as the average ratios of the number of connected pairs of anode’s neighbors to the possible maximum [29]. r is the assortative coe ffi cient [30], the Pearsoncorrelation coe ffi cient of degree between pairs of connected nodes. r lies between − r > r < h k i is theaverage degree of the network, and h d i is the average shortest distance between node pairs. H denotes the degree heterogeneity defined as H = h k ih k i .Datasets N M C r h k i h d i H USAir
322 2126 0.749 -0.208 12.81 2.46 3.46 CE
297 2148 0.308 -0.163 14.46 2.46 1.80 PB FW
128 2106 0.335 -0.104 32.90 1.77 1.23
To test the performances of algorithms, we divide the data sets into training set and testingset randomly in our experiments. The ratio of training set to testing set is 9:1, that is to say,testing set contains 10% links. The prediction list for each individual is provided based on thetraining set, and the testing set will be used for testing. Four metrics like AUC, precision, recalland hamming distance are adopted here to give quantitative measurements of the methods. Allthe results below are averaged over 100 independent runs with di ff erent data divisions.Table 2: Algorithmic accuracies on four di ff erent datasets, measured by AUC. Each value isobtained by averaging over 100 implementations with independently divisions of training set andtest set randomly. CE, PB and FW are the abbreviations for C.elegans, Political Blogs and Foodwebs networks respectively. The number in bracket indicates the step in which the correspondingalgorithm gets optimal value.Datasets USAir CE PB FW CN Salton
Jaccard AA PA RA LRW HC GHC LH LGH 0.9539(4) 0.8973(3) 0.9204(3) 0.9091(3)
The performances of di ff erent methods measured by AUC in all data sets are shown in Ta-ble. 2. Highest AUC value in each column is emphasized in black. The optimal step of iterations7 .800.880.96 0.860.880.90 0.840.860.880.900.92 0.840.870.900.060.090.120.15 0.030.040.050.060.07 0.010.020.030.04 0.100.120.140.160.160.240.320.40 0.120.180.24 0.000.040.080.12 0.200.250.300.350.0 0.5 1.00.60.70.80.91.0 0.0 0.5 1.00.80.91.0 0.0 0.5 1.0 LH LGH H R P A U C FWPBCEUSAir
Figure 2: The performance of LH and LGH measured by four metrics (AUC, precision (P), recall(R) and hamming distance (H)) as a function of α in the representative data sets. The blue andblack curves represent results of LH and LGH in the step getting optimal AUC, respectively. Andthe red solid line represents results of LRW in the step getting optimal AUC. All the numericalresults are obtained by averaging over 100 independent runs with data division. We set L = .800.880.96 0.860.880.90 0.840.860.880.900.92 0.840.870.900.040.060.08 0.030.04 0.010.02 0.060.080.100.30.40.50.60.7 0.30.40.5 0.000.080.160.24 0.50.60.70.0 0.5 1.00.40.60.81.0 0.0 0.5 1.00.70.80.91.0 0.0 0.5 1.0 LH LGH H R P A U C FWPBCEUSAir
Figure 3: The performance of LH and LGH measured by four metrics (auc, precision (P), recall(R) and hamming distance (H)) as a function of α in the representative data sets. The blue andblack curves represent results of LH and LGH in the step getting optimal AUC, respectively. Andthe red solid line represents results of LRW in the step getting optimal AUC. All the numericalresults are obtained by averaging over 100 independent runs with data division identical to thecase shown in Figure 2. We set L =
20 here. 9orrespond to best AUC are shown in the brackets. GHC is an abbreviation of the method HCwith a ground node. LH refers to the hybrid method that combines LRW and HC algorithms,while LGH refers to the hybrid method that combines LRW and GHC algorithms. Clearly, HCalgorithm outperforms classical benchmark algorithms (CN, Salton, Jaccard, AA and PA) inmost networks, while is slightly inferior to LRW. In order to improve the prediction accuracyof HC, we propose one new method called GHC by adding one ground node in the system. Bycomparing the results of HC and GHC, we can see that the prediction accuracy can be improvedto some extent by this way. For example, the AUC increases from 0.8450 to 0.8697 for USAirdata set. Previous studies have shown that the original HC algorithm prefers to the small-degreenodes [11]. Adding one ground node to the system practically amounts to add an additionaltransition probability from ground node to another. So in every iteration, each node receivesthe same heat from the ground node and then average it with the heat from other sources, thusthe temperature of the big-degree nodes will be enhanced. This improvement of the accuratepredictions on big-degree nodes leads to the improvement of prediction accuracy of the wholesystem. Moreover, di ff usion-based algorithms will be restricted by the network connectedness.The ground node happens to make the whole network much more strongly connected and theshortest path between any two nodes is less than 3. That’s why GHC performs better than HC. CE A U C LRW HC GHC LH LGH FW PB Diffusion Step
USAir
Figure 4: The correlation between AUC and the di ff usion step. The black squares, red circles,blue triangles, cyan triangles and wine stars represent the AUC of the five algorithms in the eachdi ff usion step, respectively. The values of AUC fo LH and LGH are all corresponding to optimalvalues in each step.Furthermore, in Figure. 2, the black and blue curves show the results of LGH and LH withthe corresponding α respectively, for comparison, results of LRW ( α =
1) are displayed inred solid line. Obviously, hybrid algorithms have greatly improved the prediction results. Bycombining LRW and GHC, the hybrid algorithm (LGH) performs the best in the accuracy of10rediction measured by AUC. Besides, Figure. 2 displays the performances of LGH measured byrecall and precision in relationship with free parameter α , although its diversity is a little lowerthan LH. The advantage of hybrid algorithms comes from two aspects. Firstly, LRW makesthose popular nodes still having determinate weighting to exist in the prediction list. Secondly,more importantly, GHC or HC enhances the weights of those low-degree nodes and reducesthe weights of high-degree nodes simultaneously, which make the prediction lists much morepersonalized and accurate. The hybrid algorithms with certain range of free parameter not onlyimprove the accuracy (AUC, precision, recall) but also befit the diversity of prediction. Theimprovements of this hybrid algorithms are independent of prediction length L . Figure. 3 showsthe corresponding results when L =
20, which display the same tendency.For all di ff usion-based method, one key problem is the di ff usion step. Liu et al. have alreadyproven a positive correlation between the optimal step and the average shortest distance [12]. Inour algorithm, the average shortest distance of a network is getting smaller with the adding ofground node. So our algorithm can quickly obtain the optimal value with less di ff usion steps.From Table. 2 and Figure. 4, it is clearly that LH and LGH both obtain their best results withinfour steps in all the four data sets. And after the 5th step, the results in all data sets get worsewith the increase of di ff usion step.
5. Conclusion and Discussion
In the past several decades, Internet is flourishing as never before. Especially, as new free-registered platforms, social networks provide us with abundant information, countless items andinfinite opportunities to make fresh friends around the world. Meanwhile, we are in deep troublewhen facing information overload. We are extremely hard to dig out interesting information,suitable items to purchase, new friends with the same abilities and interests and old acquaintanceswe lost touch with. Fortunately, the appearance of personalized recommendation and predictiongives us considerable assistance. Based on historical behaviors, potential interests and links willbe found out automatically via appropriate algorithms. Thus, an accurate and e ff ective algorithmis vital and extremely valuable. But to our knowledge, in the previous studies on link prediction,all the nonexistent links are sorted in descending order according to their prediction scores, andthe top-ranked links are most likely to exist. Clearly, in this case, the top-ranked list is generatedfrom a global perspective, in which some nodes may have large number of promising linkswhile others may have very few or even zero possible links. Considering this bias, we proposepersonalized link prediction problem in this paper, in which all nodes will get equal number ofpossible links. Motivated by some previous works, we generalized the physical process - heatconduction to the framework of personalized link prediction and find that it outperforms mostexisting similarity-based algorithms. In addition, we demonstrate that adding one ground nodewhich is supposed to connect all the nodes in the system will benefits the performance of HC.Finally, we introduce two hybrid algorithms that perform pretty well not only on accuracy butalso diversity. Note that, one small defect of our algorithms is relatively complicated and time-cost, fortunately this weakness can be easily solved by parallel computation technique. Withthe improvement of calculating technique, an algorithm with better performance must be a topchoice.However, how to provide better personalized predictions is still a long-standing challenge inmodern information science. One satisfying answer to this question may benefits our society,economic and lifestyle in the near future. As a starting point, we give a naive method and a11reliminary analysis, which is of course far from a satisfactory answer to the question. In fact,we believe the current work can provide some insights to understand this issue.
6. Acknowledgments
This work was partially supported by the National Natural Science Foundation of Chinaunder grant nos. 61370150 and 61433014, and the Program of Outstanding PhD Candidate inAcademic Research by UESTC: y-bxszc20131035. The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
References [1] M. E. J. Newman, The structure and function of complex networks, SIAM Rev. 45 (2) (2003) 167–256.[2] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, D.-U. Hwang, Complex networks: Structure and dynamics, Phys.Rep. 424 (4) (2006) 175–308.[3] R. Cohen, S. Havlin, Complex Networks: Structure, Robustness and Function, Cambridge University Press, 2010.[4] L. Y. L¨u, T. Zhou, Link prediction in complex networks: a survey, Physica A 390 (6) (2011) 1150–1170.[5] L. Getoor, C. P. Diehl, Link mining: a survey, ACM SIGKDD Explorations Newsletter 7 (3) (2005) 3–12.[6] G. Kossinets, E ff ects of missing data in social networks, Social Networks 28 (2006) 247–268.[7] R. Guimer´a, M. Sales-Pardo, Missing and spurious interactions and the reconstruction of complex networks, Proc.Natl. Acad. Sci. U.S.A. 106 (52) (2009) 22073–22078.[8] A. Zeng, G. Cimini, Removing spurious interactions in complex networks, Phys. Rev. E 85 (3) (2012) 036101.[9] Y. X. Zhu, L. Y. L¨u, Q. M. Zhang, T. Zhou, Uncovering missing links with cold ends, Physica A 391 (22) (2012)5769–5778.[10] J.-H. Liu, T. Zhou, Z.-K. Zhang, Z. Yang, C. Liu, W.-M. Li, Promoting cold-start items in recommender systems,PLoS One 9 (12) (2014) e113457.[11] T. Zhou, Z. Kuscsik, J. G. Liu, M. Medo, J. R. Wakeling, Y.-C. Zhang, Solving the apparent diversity-accuracydilemma of recommender systems, Proc. Natl. Acad. Sci. U.S.A. 107 (10) (2010) 4511–4515.[12] W. Liu, L. L¨u, Link prediction based on local random walk, EPL 89 (5) (2010) 58007.[13] J. A. Hanely, B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (roc) curve,Radiology 143 (1) (1982) 29–36.[14] T. Zhou, L.-L. Jiang, R.-Q. Su, Y.-C. Zhang, E ff ect of initial configuration on network-based recommendation, EPL81 (5) (2008) 58004.[15] T. Zhou, R.-Q. Su, R.-R. Liu, L.-L. Jiang, B.-H. Wang, Y.-C. Zhang, Accurate and diverse recommendations viaeliminating redundant correlations, New J. Phys. 11 (12) (2009) 123008–123026.[16] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Am. Soc. Inform. Sci. Technol.58 (7) (2007) 1019–1031.[17] G. Kossinets, E ff ects of missing data in social networks, Social Networks 28 (2006) 247–268.[18] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, MuGraw-Hill, Auckland, 1983.[19] P. Jaccard, ´Etude comparative de la distribution florale dans une portion des alpes et des jura, Bulletin de la SocieteVaudoise des Science Naturelles 37 (1901) 547–579.[20] L. A. Adamic, E. Adar, Friends and neighbors on the web, Social Networks 25 (2003) 211–230.[21] A. L. Barab´asi, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509–512.[22] T. Zhou, L. L¨u, Y.-C. Zhang, Predicting missing links via local information, Eur. Phys. J. B 71 (2009) 623–630.[23] L. Y. L¨u, Y.-C. Zhang, C. H. Yeung, T. Zhou, Leaders in social networks, the delicious case, PLoS One 6 (6) (2011)e21202.[24] R.-R. Liu, J.-G. Liu, C.-X. Jia, B.-H. Wang, Personal recommendation via unequal resource allocation on bipartitenetworks, Physica A 389 (16) (2010) 3282–3289.[25] V. Batageli, A. Mrvar, Pajek datasets, http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm .[26] J. G. White, E. Southgate, I. N. Thomson, S. Brenner, Philos. Trans. R. Soc. B 314 (1).[27] R. Ackland, Mapping the us political blogosphere: Are conservative bloggers more prominent (2005).URL http://incsub.org/blogtalk/images/robertackland.pdf [28] R. E. Ulanowicz, C. Bondavalli, M. S. Egnotovich, Technical report, cbl (98-123) (1998).URL [29] D. J. Watts, S. Strogatz, Collective dynamics of small-world networks, Nature 393 (1998) 440–442.[30] M. E. J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2007) 208701.[29] D. J. Watts, S. Strogatz, Collective dynamics of small-world networks, Nature 393 (1998) 440–442.[30] M. E. J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2007) 208701.