[PDF] Progresses and Challenges in Link Prediction

Abstract

Link prediction is a paradigmatic problem in network science, which aims at estimating the existence likelihoods of nonobserved links, based on known topology. After a brief introduction of the standard problem and metrics of link prediction, this Perspective will summarize representative progresses about local similarity indices, link predictability, network embedding, matrix completion, ensemble learning and others, mainly extracted from thousands of related publications in the last decade. Finally, this Perspective will outline some long-standing challenges for future studies.

Full PDF

PProgresses and Challenges in Link Prediction

Tao Zhou CompleX Lab, University of Electronic Science and Technology of China, Chengdu 611731, People’s Republic of China.

Abstract . Link prediction is a paradigmatic problem in network science, which aims at estimating the existence likelihoods of nonobserved links, based on known topology. After a brief introduction of the standard problem and metrics of link prediction, this Perspective will summarize representative progresses about local similarity indices, link predictability, network embedding, matrix completion, ensemble learning and others, mainly extracted from thousands of related publications in the last decade. Finally, this Perspective will outline some long-standing challenges for future studies. PACS: 89.75.Hc, 89.20.Ff, 89.65.-s.

Introduction . – Network is a natural and powerful tool to characterize a huge number of social, biological and information systems that are consisted of interacting elements, and network science is currently one of the most active interdisciplinary research domains [1][2]. Link prediction is a paradigmatic problem in network science that attempts to uncover missing links or predict future links [3], which has already found many theoretical and practical applications, such as reconstruction of networks [4][5][6], evaluation of evolving models [7][8][9], inference of biological interactions [10][11][12], online recommendation of friends and products [13][14], and so on. Consider a simple network

𝐺(𝑉, 𝐸) , where V and E are sets of nodes and links, the directionalities and weights of links are ignored, and multiple links and self-connections are not allowed. We assume that there are some missing links or future links in the set of nonobserved links 𝑈\𝐸 , where U is the universal set containing all |𝑉|(|𝑉| − 1) 2⁄ potential links. Then the task of link prediction is to find out those missing or future links. To test the algorithm’s accuracy, the observed links, E , is divided into two parts: the training set 𝐸 𝑇 is treated as known information, while the probe set 𝐸 𝑃 is used for algorithm evaluation and no information in 𝐸 𝑃 is allowed to be used for prediction. The majority of known studies applied random division , namely 𝐸 𝑃 is randomly drawn from E . In the case of predicting future links, temporal division is usually adopted where 𝐸 𝑃 contains most recently appeared links [15]. In some real networks, missing links have different topological features from observed links. For example, missing are more likely to be associated with low-degree nodes since interactions between hubs are probably well-known. In such situation, we may apply biased division where 𝐸 𝑃 contains links likely to be similar to missing links [16]. Performance evaluation metrics can be roughly divided into two categories: fixed-threshold accuracies and areas under threshold curves [17][18]. Precision and Recall are the two most widely used metrics in the former category. Precision is defined as the ratio of relevant items selected to the number of items selected. That is to say, if we take the top- L links as the predicted ones, among which 𝐿 𝑟 links are correctly predicted, then the Precision equals 𝐿 𝑟 𝐿⁄ . Recall is defined as the ratio of relevant items selected to the total number of relevant items, say 𝐿 𝑟 |𝐸 𝑃 |⁄ . n obvious drawback of fix-threshold accuracies is that we generally do not have a reasonable way to determine the threshold, like the number of predicted links L or the threshold score for the existence of links. A lazy, unreasonable but widely adopted way is setting 𝐿 = |𝐸 𝑃 | , at which Precision = Recall. However, |𝐸 𝑃 | is generally unknown. Even if it is known, this single value cannot reflect the performance of a predictor. Therefore, robust evaluation based on fixed-threshold accuracies should cover a range of thresholds (e.g., by varying L ), which is actually close to the consideration of threshold curves. The Precision-Recall (PR) curve [17] and receiver operating characteristic (ROC) curve [19] are frequently considered in the literature. The former shows Precision with respect to Recall at all thresholds and the latter represents performance trade-off between true positives and false positives at different thresholds. We say algorithm X is strictly better than algorithm Y only if X ’s threshold curve completely dominates Y ’s (the domination of PR curve is equivalent to the domination of ROC curve [20]). Such requirement is too rigid and thus we usually use areas under threshold curves as evaluation metrics. The area under the PR curve (AUPR) is less interpretable while the area under the ROC curve (AUC) can be interpreted as the probability that a randomly chosen link in 𝐸 𝑃 is assigned a higher existence likelihood than a randomly chosen link in 𝑈\𝐸 . If all likelihoods are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how better the algorithm performs than pure chance. Different metrics have specific advantages, for example, link prediction is a typical imbalance learning task since most real networks are sparse and AUC is non-parametric and very suitable for imbalance learning, and Precision is probably closer to real needs, because practical applications usually only account for the top- L predictions rather than the whole ranking. In real pplications, we should carefully choose suitable metrics according to the specific problems and requirements. Thanks to a few pioneering works [21][22][23][24], link prediction has been one of the most active research domains in network science. Early contributions were already summarized by a well-known survey article [3], and this Perspective will introduce most representative achievements in the last decade (mostly published after [3]) and discuss limitations of existing studies as well as open challenges for future studies. Local Similarity Indices . – A similarity-based algorithm will assign a similarity score to each nonobserved link, and the one with a higher score is of a larger likelihood to be a missing link. Liben-Nowell and Kleinberg [16] indicated that a very simple index named common neighbor (CN), say 𝑆 𝑥𝑦𝐶𝑁 = |𝛤 𝑥 ∩ 𝛤 𝑦 |, (1) with 𝛤 𝑥 and 𝛤 𝑦 being sets of neighbors of nodes x and y , performs very well in link prediction for social networks. Zhou et al . [18] proposed the resource allocation (RA) index via weakening the weights of large-degree common neighbors, namely 𝑆 𝑥𝑦𝑅𝐴 = ∑ 𝑧 𝑧∈𝛤 𝑥 ∩𝛤 𝑦 , (2) where 𝑘 𝑧 is the degree of node z . The simplicity, elegance and good performance of CN, RA and some other alternatives [21][23][25][26][27] lead to increasing attention on local similarity indices. n the recent decade, probably the most impressive achievement on local similarity indices is the proposal of the local community paradigm [28], which suggests that two nodes are more likely to link together if their common neighbors are densely connected. Accordingly, Cannistraci et al . [28] proposed the CAR index where the CN index is multiplied by the number of observed links between common neighbors, as 𝑆 𝑥𝑦𝐶𝐴𝑅 = 𝑆 𝑥𝑦𝐶𝑁 ∙ ∑ |𝛾 𝑧 |2𝑧∈𝛤 𝑥 ∩𝛤 𝑦 , (3) where 𝛾 𝑧 is the subset of z ’s neighbors that are also common neighbors of x and y . Analogously, RA index can be improved by accounting for the local community paradigm as 𝑆 𝑥𝑦𝐶𝑅𝐴 = ∑ |𝛾 𝑧 |𝑘 𝑧 𝑧∈𝛤 𝑥 ∩𝛤 𝑦 . (4) By integrating the idea of Hebbian learning rule [29], the above index is further extended and renamed as Cannistraci-Hebb (CH) index [30] 𝑆 𝑥𝑦𝐶𝐻 = ∑ 𝑧(𝑖) 𝑧(𝑒) 𝑧∈𝛤 𝑥 ∩𝛤 𝑦 , (5) where 𝑘 𝑧(𝑖) is the internal degree of z , say the number of z ’s neighbors that are also in 𝛤 𝑥 ⋂𝛤 𝑦 , and 𝑘 𝑧(𝑒) is the external degree of z , say the number of z ’s neighbors that are not in 𝛤 𝑥 ⋂𝛤 𝑦 ⋂*𝑥, 𝑦+ . The core idea of CH index is to consider the negative impacts of external local-community-links (see [31] for more CH indices according to the core idea). Extensive empirical analyses [31][32] indicated that the introduction of local community paradigm and Hebbian learning rule could considerably improve the performance of routine local similarity indices. In most known studies, the presence of many 2-hop paths between a pair of nodes is considered to be the strongest evidence indicating the existence of a corresponding missing link or future link. Although in local path index [33] and Katz index [34] longer paths are taken into account, they are onsidered to be less significant than 2-hop paths. Surprisingly, some recent works have argued that 3-hop-based similarity indices perform better than 2-hop-based indices. Pech et al . [35] assumed that the existence likelihood of a link is a linear sum of all its neighbors’ contributions. After some algebra, Pech et al . [35] obtained a global similarity index called linear optimization (LO) index, as 𝑆 𝐿𝑂 = 𝛼𝐴(𝛼𝐴 𝑇 𝐴 + 𝐼) −1 𝐴 𝑇 𝐴 = 𝛼𝐴 − 𝛼 𝐴 + 𝛼 𝐴 − 𝛼 𝐴 + ⋯, (6) where A and I are adjacency matrix and identity matrix. Clearly, the number of 3-hop paths 𝐴 can be interpreted as a degenerated index of LO. At the same time, Kovács et al. [36] independently proposed a degree-normalized index (called L3 index) based on 3-hop paths as 𝑆 𝑥𝑦𝐿3 = ∑ 𝑎 𝑥𝑢 𝑎 𝑢𝑣 𝑎 𝑣𝑦 √𝑘 𝑢 𝑘 𝑣 𝑢,𝑣 , (7) and showed its advantage compared with 2-hop-based indices in predicting protein–protein interactions. Muscoloni et al. [30] further proposed a theory that generalized 2-hop-based indices to n -hop-based indices with n >2, and demonstrated the superiority of 3-hop-based indices over 2-hop-based indices on protein–protein interaction networks, world trade networks and food webs. For example, in their framework [30], the n -hop-based RA index reads 𝑆 𝑥𝑦𝑅𝐴(𝑛) = ∑ 𝑧1 𝑘 𝑧2 ⋯𝑘 𝑧𝑛−1𝑛−1 , 𝑧 ,𝑧 ,⋯,𝑧 𝑛−1 ∈𝕃(𝑛) (8) where 𝕃(𝑛) is the set of all n -hop simple paths connecting x and y , and 𝑧 , 𝑧 , ⋯ , 𝑧 𝑛−1 are the intermediate nodes on the considered path. Accordingly, the L ink Predictability . – Quantifying link predictability of a network allows us to evaluate link prediction algorithms for this network, and to see whether there is still a large space to improve the current prediction accuracy. Lü et al. [37] raised a hypothesis that missing links are difficult to predict if their addition causes huge structural changes, and thus a network is highly predictable if the removal or addition of a set of randomly selected links does not significantly change structural features of this network. Denote A the adjacency matrix of a simple network 𝐺(𝑉, 𝐸) , and ∆𝐴 the adjacency matrix corresponding to a set of randomly selected links ∆𝐸 from E . After the removal of ∆𝐸 , the remaining network 𝐺 𝑅 is also a simple network, so that the corresponding adjacency matrix, 𝐴 𝑅 = 𝐴 − ∆𝐴 , can be diagonalized as 𝐴 𝑅 = ∑ 𝜆 𝑘 𝑥 𝑘 𝑥 𝑘𝑇 , 𝑁𝑘=1 (9) where

𝑁 = |𝑉| , and 𝜆 𝑘 and 𝑥 𝑘 are the eigenvalue and corresponding orthogonal and normalized eigenvector of 𝐴 𝑅 . Considering ∆𝐸 as a perturbation to 𝐴 𝑅 , which results in an updated eigenvalue 𝜆 𝑘 + ∆𝜆 𝑘 and a corresponding eigenvector 𝑥 𝑘 + ∆𝑥 𝑘 , then we have (𝐴 𝑅 + ∆A)(𝑥 𝑘 + ∆𝑥 𝑘 ) = (𝜆 𝑘 + ∆𝜆 𝑘 )(𝑥 𝑘 + ∆𝑥 𝑘 ). (10) Similar to the process to get the expectation value of the first-order perturbation Hamilonian, we neglect the second-order small terms and the changes of eigenvectors, and then obtain ∆𝜆 𝑘 ≈ 𝑥 𝑘𝑇 ∆𝐴𝑥 𝑘 𝑥 𝑘𝑇 𝑥 𝑘 , (11) as well as the perturbed matrix 𝐴̃ = ∑ (𝜆 𝑘 + ∆𝜆 𝑘 ) 𝑁𝑘=1 𝑥 𝑘 𝑥 𝑘𝑇 , (12) which can be considered as the linear approximation of A if the expansion is only based on 𝐴 𝑅 . If the perturbation does not significantly change the structural features, the eigenvectors of 𝐴 𝑅 and those of A should be almost the same, and thus 𝐴̃ should be very close to A according to Eq. (12). e rank all links in 𝑈\𝐸 𝑅 in a descending order according to their values in 𝐴̃ and select the top- L links to form the set 𝐸 𝐿 , where 𝐿 = |∆𝐸| . Links in 𝐸 𝑅 and 𝐸 𝐿 constitute the perturbed network, and if this network is close to G (because 𝐴̃ is close to A ), 𝐸 𝐿 should be close to ∆𝐸. Therefore, Lü et al. [37] finally proposed an index called structural consistency to measure the inherent difficulty in link prediction as 𝜎 𝑐 = |𝐸 𝐿 ∩∆𝐸||∆𝐸| . (13) The above perturbation method can also be applied to predict missing links, and the resulted structural perturbation method (SPM) is still one of the most accuracy methods till far. Very recently, a similar but enhanced method was used to analyze the link predictability of bipartite networks [38]. Koutra et al. [39] found that the major part of a seemingly complicated real network can be represented by a few elemental substructures like cliques, stars, chains, bipartite cores, and so on. Inspired by this study, Xian et al. [40] claimed that a network is more regular and thus more predictable if it can be well represented by a small number of subnetworks. To reduce the tremendous complexity caused by countless subnetworks, they further set a strong restriction that candidate subnetworks are ego networks of all nodes. Obviously, the ego network of node i can be represented by the i th row or i th column of the adjacency matrix A , and if a network can be perfectly represented by all ego networks, there exists a matrix 𝑍 ∈ ℝ

𝑁×𝑁 such that A = AZ . Intuitively, if a network G is very regular, the corresponding representation Z should have three properties: (i) G can be well represented by its ego networks, so that AZ if close to A ; (ii) G can be well represented by a small number of ego networks, so that Z is of low rank since the redundant go networks correspond to zero rows in Z ; (iii) Each ego network of G can be represented by a very few other ego networks, so that Z is sparse. Accordingly, the best representation matrix 𝑍 ∗ can be obtained by solving the following optimization problem min 𝑍 rank(𝑍) + 𝛼‖𝑍‖ + 𝛽‖𝐴 − 𝐴𝑍‖, (14) where 𝛼 and 𝛽 are tradeoff parameters. Based on 𝑍 ∗ , Xian et al. [40] proposed an ad hoc index named structural regularity index, as 𝜎 𝑟 = 𝑛−𝑟𝑛 √ 𝜏𝑛𝑟 , (15) where r is the rank of 𝑍 ∗ , 𝜏 is the number of zero entries in 𝑍 ∗ , 𝑛−𝑟𝑛 denotes the proportion of identical ego networks, and 𝜏𝑛𝑟 characterizes the density of zero entries of the reduced echelon form of 𝑍 ∗ . Clearly, a lower r and a larger 𝜏 will result in a smaller 𝜎 𝑟 , corresponding to a more predictable network. Xian et al. [40] suggested that the structural regularity corresponds to redundant information in the adjacency matrix, which can be characterized by a low-rank and sparse representation matrix. Sun et al. [41] proposed a more direct method to measure such redundancy. Their train of thought is that a more predictable network contains more structural redundancy, and thus can be compressed by a shorter binary string. As the shortest possible compression length can be calculated by a lossless compression algorithm [42], they used the obtained normalized shortest compression length of a network to quantify its structure predictability. To validate their methods, Lü et al. [37] and Xian et al. [40] tested on many real networks about whether 𝜎 𝑐 and 𝜎 𝑟 are strongly correlated with prediction accuracies of a few well-known lgorithms. This is rough because any algorithm cannot stand for the theoretical best algorithm. Sun et al. [41] adopted an improved method that uses the best performance among a number of known algorithms for each tested network to approximate the performance of the theoretical best predicting algorithm. Garcia-Perez et al. [43] analyzed the ensemble of simple networks, where each can be constructed by generating a link between any node pair i and j with a known linking probability 𝑝 𝑖𝑗 . For such theoretical benchmark, the best-possible algorithm is to rank unobserved links with largest linking probabilities in the top positions and the theoretical limitation of Precision can be easily obtained. They showed that if the size of ∆𝐴 is too small in compared with A , the evaluation of predictability by 𝜎 𝑐 is less accurate. Structural consistence, structural regularity and compression length are all ad hoc methods. They can be used to probe the intrinsic difficulty in link prediction, but cannot mathematically formulate the limitation of prediction. Mathematically speaking, there could be a God algorithm that correctly predicts all missing links, except for those indistinguishable from nonexistent links. A link ( i , j ) is indistinguishable from another link ( u , v ) if and only if there is a certain automorphism f such that f ( i )= u and f ( j )= v , or f ( i )= v and f ( j )=u. This extremely rigid definition from automorphism-based symmetry makes virtually all real networks have predictability very close to 1, which is indeed meaningless in practice. Using synthetic networks with known prediction limitation is a potentially promising way to evaluate predictors as well as indices for predictability [43][44], but the results may be irrelevant to real-world networks. All above studies target static networks, while a considerable fraction of real networks are ime-varying (named as temporal networks) [45]. Temporal networks are usually more predictable since one can utilize both topological and temporal patterns. Ignoring topological correlations, the randomness and thus predictability of a time series can be quantified by the entropy rate [46]. Tang et al. [47] listed weights of all possible links as an expanded vector with dimension 𝑁 (self-connections are allowed, directionalities are considered), and thus the evolution of a temporal network can be fully described by a matrix 𝑀 ∈ ℝ 𝑁 ×𝑇 , where T is the number of snapshots under consideration. After 𝑀 , the evolution of a temporal network can be treated as a stochastic vector process, and how to measure the predictability of temporal networks is transformed to a solved problem based on the generalized Lempel-Ziv algorithm [48]. An obvious defect is that the vector dimension is too big, resulting in huge computational complexity. Tang et al. [47] thus replaced 𝑀 by a much smaller matrix where only links occurring ≥ 10% of the snapshots are taken into consideration. An intrinsic weakness of Lempel-Ziv algorithm is that it tends to overestimate the predictability and thus in many situations the estimated values are very close to 1 [46][49]. Tang et al. [47] proposed a clever method that compares the predictability of the target network with the corresponding null network, and thus the normalized predictability is able to characterize the topological-temporal regularity in addition to the least predictable one. Network Embedding . – A network embedding algorithm will produce a function 𝑔: 𝑉 → ℝ 𝑑 with 𝑑 ≪ 𝑁 , so that every node is represented by a low-dimensional vector [50][51][52][53]. Then, the existence likelihood of a nonobserved link ( u , v ) can be estimated by the inner product or the cosine similarity of the two learned vectors 𝑔(𝑢) and 𝑔(𝑣) . Early methods cannot handle large-scale networks because they usually rely on solving the leading eigenvectors [54][55][56]. Mikolov et al. [57][58] proposed a language embedding algorithm named SkipGram that represents every word in a given vocabulary by a low-dimensional vector. Such representation can be obtained by maximizing the co-occurrence probability among words appearing within a window t in a sentence, via some stochastic gradient descent methods. Based on SkipGram, Perozzi et al. [59] proposed the so-called DeepWalk algorithm, where nodes and truncated random walks are treated as words and sentences. Grover and Leskovec [60] proposed the node2vec algorithm that learns the low-dimensional representation by maximizing the likelihood of preserving network neighborhoods of nodes. Grover and Leskovec argued that the choice of neighborhoods play a critical role in determining the quality of the representation. Therefore, instead of simple definitions of the neighborhood of an arbitrary node u , such as nodes with distance no more than a threshold to u (like the breadth-first search) and nodes sampled from a random walk starting from u (like the depth-first search), they utilized a flexible neighborhood sampling strategy by biased random walks, which smoothly interpolates between breadth-first search and depth-first search. Considering a random walk that just traversed link ( z , v ) and now resides at node v , the transition probability from v to any v ’s immediate neighbor x is 𝜋 𝑣𝑥 ∑ 𝜋 𝑣𝑦𝑦∈Γ 𝑣 ⁄ . The node2vec algorithm sets the unnormalized probability as 𝜋 𝑣𝑥 = 𝛼 𝑝𝑞 (𝑧, 𝑥) ∙ 𝑤 𝑣𝑥 , (16) where 𝑤 𝑣𝑥 is the weight of link ( v , x ), and 𝛼 𝑝𝑞 (𝑧, 𝑥) = { 1 𝑝⁄ , 𝑑 𝑧𝑥 = 01, 𝑑 𝑧𝑥 = 1 1 𝑞⁄ , 𝑑 𝑧𝑥 = 2 . (17) Obviously, the sampling strategy in DeepWalk is a special case of node2vec with p =1 and q =1. By tuning p and q , node2vec can achieve better performance than DeepWalk in link prediction. Tang et al. [61] argued that DeepWalk lacks a clear objective function tailored for network embedding, and proposed the LINE algorithm that learns node representations on the basis of a carefully designed objective function that preserves both the first-order and second-order proximity. The first-order proximity is captured by the observed links, and thus can be formulated as 𝑂 = − ∑ 𝑤 𝑢𝑣 log 𝑝 (𝑢, 𝑣) (𝑢,𝑣)∈𝐸 , (18) where 𝑤 𝑢𝑣 is the weight of the observed link ( u , v ) and 𝑝 (𝑢, 𝑣) = (19) describes the likelihood of the existence of ( u , v ) given the embedding 𝑔 . Of course, one can adopt other alternatives of Eq. (19). The second-order proximity assumes that nodes sharing many connections to other nodes are similar to each other. Accordingly, each node is also treated as a specific context and nodes with similar distributions over contexts are assumed to be similar. Then, the second-order proximity can be characterized by the objective function 𝑂 = − ∑ 𝑤 𝑢𝑣 log 𝑝 (𝑢|𝑣) (𝑢,𝑣)∈𝐸 , (20) where 𝑝 (𝑢|𝑣) denotes the probability that node v will generate a context u , namely 𝑝 (𝑢|𝑣) = exp ,𝑔′(𝑢)∙𝑔(𝑣)-∑ exp,𝑔′(𝑧)∙𝑔(𝑣)- 𝑧∈𝑉 , (21) with 𝑔′(𝑢) being the context representation of u . Clearly, 𝑂 is naturally suitable for directed networks. By minimizing 𝑂 and 𝑂 , LINE learns two kinds of node representations that respectively preserve the first-order and second-order proximity, and takes their concatenation as the final representation. n addition to DeepWalk, LINE and node2vec, other well-known network embedding algorithms that have been applied in link prediction include DNGR [62], SDNE [63], HOPE [64], GraphGAN [65], and so on. On the one hand, embedding is currently a very hot topic in network science and thought to be a promising method for link prediction. On the other hand, some very recent empirical studies [31][66][67] involving more a thousand networks showed negative evidence that the network embedding algorithms perform worse than some elaborately designed mechanistic algorithms. This is not a bad news because one can expect that some link prediction algorithms will enlighten and energize researchers in network embedding, and thus make contributions to other aspects of network analyses like community detection, classification and visualization. Another notable embedding method is based on the hyperbolic network model [68][69], where each node is represented by only two coordinates (i.e., 𝑑 = 2 ) in a hyperbolic disk. The hyperbolic network model is very simple yet can reproduce many topological characteristics of real networks, such as sparsity, scale-free degree distribution, clustering, small-world property, community structure, self-similarity, and so on. Kitsak et al. [70] proposed an embedding algorithm according to the hyperbolic network model, where the smaller the latent distance between two nodes, the higher the probability of a link between them. In despite of the extremely small embedding dimension, their algorithm is often competitive to other well-known link predictors and it is in particular good at predicting missing links that are really hard to predict. Matrix Completion . – Matrix completion aims to reconstruct a target matrix, given a subset of known entries. Since links can be fully conveyed by the adjacency matrix A , it is natural to regard ink prediction as a matrix completion task. Denote 𝐸 𝑘 the set of node pairs corresponding to known entries in A that can be utilized in the matrix completion task. In most studies, 𝐸 𝑘 = 𝐸 𝑇 , while we should be aware of that 𝐸 𝑘 can also contain some known absent links. The matrix completion problem can be formulated in line with supervised learning, as min 𝜗 1|𝐸 𝑘 | ∑ ℓ[𝑎 𝑖𝑗 , 𝑎̃ 𝑖𝑗 (𝜗)] + 𝛺(𝜗) (𝑖,𝑗)∈𝐸 𝑘 , (22) where 𝜗 is the parameter vector, 𝑎̃ 𝑖𝑗 is the predicted value of the model, ℓ(∙,∙) is a loss function, and 𝛺 is a regularization term preventing overfitting. The most frequently used loss functions are squared loss ℓ(𝑎, 𝑏) = (𝑎 − 𝑏) and logistic loss ℓ(𝑎, 𝑏) = log(1 + 𝑒 −𝑎𝑏 ) . Matrix factorization is a very popular method for matrix completion, which has already achieved great success in a closely related domain, the design of recommender systems [71]. We consider a simple network with symmetry A , and assume 𝐴̃ can be approximately factorized as 𝐴̃ ≈ 𝑈𝑈 𝑇 with 𝑈 ∈ ℝ

𝑁×𝑑 and 𝑑 ≪ 𝑁 , then we need to solve the following optimization problem min

𝑈 1|𝐸 𝑘 | ∑ ℓ(𝑎 𝑖𝑗 , 𝑢 𝑖𝑇 𝑢 𝑗 ) + 𝛺(𝑈) (𝑖,𝑗)∈𝐸 𝑘 , (23) where 𝑢 𝑖 and 𝑢 𝑗 are the i th and j th rows of U . Notice that, 𝑢 𝑖𝑇 is the transpose of 𝑢 𝑖 , not the i th row of 𝑈 𝑇 . Though without topological interpretation, 𝑢 𝑖 can be treated as a lower-dimensional representation of node i , and matrix factorization can also be considered as a kind of matrix embedding algorithms [72]. If we adopt the squared loss function and the Forbenius norm for 𝛺 , then we get a specific optimization problem min 𝑈 1|𝐸 𝑘 | ∑ (𝑎 𝑖𝑗 − 𝑢 𝑖𝑇 𝑢 𝑗 ) + 𝜆‖𝑈‖ 𝐹2(𝑖,𝑗)∈𝐸 𝑘 , (24) where 𝜆 is a tradeoff parameter. As mentioned above, AUC is suitable for imbalanced learning and link prediction is a typical imbalanced classification problem. Menon and Elkan [73] uggested that one can directly optimize for AUC on the training set, given some known absent links. Accordingly, the objective function Eq. (23) can be rewritten in terms of AUC as min 𝑈 1|𝐸 +𝑘 ||𝐸 −𝑘 | ∑ ℓ(1, 𝑢 𝑖𝑇 𝑢 𝑗 − 𝑢 𝑥𝑇 𝑢 𝑦 ) + 𝛺(𝑈) (𝑖,𝑗)∈𝐸 +𝑘 ,(𝑥,𝑦)∈𝐸 −𝑘 , (25) where 𝐸 +𝑘 and 𝐸 −𝑘 are sets of known present and known absent links, respectively. Cleary, 𝐸 +𝑘 ∩ 𝐸 −𝑘 = ∅ and 𝐸 +𝑘 ∪ 𝐸 −𝑘 = 𝐸 𝑘 . Menon and Elkan [73] showed that the usage of AUC-based loss function can improve AUC value by around 10% comparing with the routine loss function like Eq. (24). The factorization in Eq. (24) is easy to be extended to directed networks [73], bipartite networks [74], temporal networks [75], and so on. For example, if A is asymmetry, we can replace 𝐴̃ ≈ 𝑈𝑈 𝑇 by 𝐴̃ ≈ 𝑈𝛬𝑈 𝑇 with 𝛬 ∈ ℝ 𝑑×𝑑 , and thus Eq. (24) can be extended to min

𝑈,𝛬 1|𝐸 𝑘 | ∑ (𝑎 𝑖𝑗 − 𝑢 𝑖𝑇 𝛬𝑢 𝑗 ) + 𝜆 𝑈 ‖𝑈‖ 𝐹2 + 𝜆 𝛬 ‖𝛬‖ 𝐹2(𝑖,𝑗)∈𝐸 𝑘 , (26) Analogously, for bipartite networks like gene-disease associations [74] and user-product purchases [76], if 𝐴 ∈ ℝ

𝑀×𝑁 , then we can replace

𝐴̃ ≈ 𝑈𝑈 𝑇 by 𝐴̃ ≈ 𝑊𝐻 𝑇 , where 𝑊 ∈ ℝ

𝑀×𝑑 and

𝐻 ∈ ℝ

𝑁×𝑑 . Accordingly, we get the optimization problem for bipartite networks as min

𝑊,𝐻 1|𝐸 𝑘 | ∑ (𝑎 𝑖𝑗 − 𝑤 𝑖𝑇 𝑕 𝑗 ) + 𝜆 𝑊 ‖𝑊‖ 𝐹2 + 𝜆 𝐻 ‖𝐻‖ 𝐹2(𝑖,𝑗)∈𝐸 𝑘 , (27) where 𝑤 𝑖 and 𝑕 𝑗 are the i th and j th rows of W and H , respectively. More details can be found in Refs. [73][74][75]. The explicit features of nodes, such as tags associated with users and products [77], can also be incorporated in the matrix factorization framework. Menon and Elkan [73] suggested a direct combination of explicit features and latent features learned from the observed topology. Denoting 𝑖 ∈ ℝ 𝑠 the vector of explicit features of node i , the predicted values 𝑎̃ 𝑖𝑗 in Eq. (22) are then replaced by 𝑎̃ 𝑖𝑗 = 𝑢 𝑖𝑇 𝑢 𝑗 + 𝑣 𝑇 𝑥 𝑖 + 𝑣 𝑇 𝑥 𝑗 , (28) where 𝑣 ∈ ℝ 𝑠 is a vector of parameters. Experiments showed that the incorporation can considerably improve the prediction accuracy [73]. Jain and Dhillon [78] proposed a so-called inductive matrix completion (IMC) algorithm that uses explicit features to reduce the computational complexity. In IMC, the predicted value can be expressed as 𝑎̃ 𝑖𝑗 = 𝑥 𝑖𝑇 𝑄𝑄 𝑇 𝑥 𝑗 , where 𝑥 𝑖 ∈ ℝ 𝑠 is the vector of i ’s explicit features, and 𝑄 ∈ ℝ 𝑠×𝑡 is a low-rank matrix with small t , which describes the latent relationships between explicit features and topological structure. Q can be learned from observed links by the following optimization problem min 𝑄 1|𝐸 𝑘 | ∑ (𝑎 𝑖𝑗 − 𝑥 𝑖𝑇 𝑄𝑄 𝑇 𝑥 𝑗 ) + 𝜆‖𝑄‖ 𝐹2(𝑖,𝑗)∈𝐸 𝑘 . (29) Notice that, in Eq. (24), the number of parameters to be learned is Nd , while in Eq. (29), we only need to learn st parameters. The numbers of latent features ( d and t ) could be more or less the same as they are largely dependent on the topological structure, while the number of nodes N is usually much larger than the number of explicit features s . Therefore, the computational complexity of IMC should be much lower than the routine factorization method. The original IMC is proposed for bipartite networks [78], which has already found successful applications in the design of recommender systems [78] and the prediction of gene- and RNA-disease associations [74][79][80]. Pech et al. [81] argued that low rank is the most critical property in matrix completion. They assumed that the observed network can be decomposed into two parts as 𝐴 = 𝐴 𝐵 + 𝐴 𝐸 , where 𝐵 is called the backbone preserving the network organization pattern, and 𝐴 𝐸 is a noise matrix, in which positive and negative entries are spurious and missing links, respectively. Pech et al. [81] considered only two simple properties: the low rank of 𝐴 𝐵 and the sparsity of 𝐴 𝐸 . Accordingly, 𝐴 𝐵 and 𝐴 𝐸 can be determined by solving the following optimization problem min 𝐴 𝐵 ,𝐴 𝐸 rank(𝐴 𝐵 ) + 𝜆 ‖𝐴 𝐸 ‖ subject to 𝐴 = 𝐴 𝐵 + 𝐴 𝐸 , (30) where ‖𝐴 𝐸 ‖ is the 𝑙 -norm counting the number of nonzero entries of 𝐴 𝐸 . The predicted links can be obtained by sorting entries in 𝐴 𝐵 that correspond to zero entries in A . This method is straightforwardly named as low rank (LR) algorithm. Although being simple, LR performs better than well-known similarity-based algorithms [23][28], hierarchical structure model [22] and stochastic block model [24], while slightly worse than LOOP [82] and structural perturbation model [37]. Ensemble Learning . – In an early survey [3], we noticed the low stability of individual link predictors and thus suggested ensemble learning as a powerful tool to integrate them. Ensemble learning is a popular method in machine learning, which constructs and integrates a number of individual predictors to achieve better algorithmic performance [83]. Roughly speaking, ensemble learning techniques can be divided into two classes: the parallel ensemble where individual predictors do not strongly depend on each other and can be implemented simultaneously (e.g., bagging [84] and random forests [85]) and the sequential ensemble where the integration of individual predictors has to be implemented in a sequential way (e.g., boosting [86] and stacking [87]). In the following, we will respectively introduce how parallel ensemble and sequential ensemble are applied in link prediction. Given an observed network, an individual link predictor will produce a rank of all unobserved links. Pujari and Kanawati [88] proposed an aggregation approach on ranks resulted from individual algorithms. If there are R ranks produced by R individual predictors, an unobserved link e ’s Borda count 𝐵 𝑘 (𝑒) in the k th rank can be defined as the number of links ranked ahead of e (there are many variants of Borda count and here we use the simplest one, see [89] for example). Pujaji and Kanawati used a weighted aggregation to obtain the final score of any unobserved link e , as 𝐵(𝑒) = ∑ 𝑤 𝑘𝑅𝑘=1 𝐵 𝑘 (𝑒), (31) where 𝑤 𝑘 is set to be proportional to the precision of the k th predictor trained by the observed network. Clearly, smaller 𝐵(𝑒) indicates higher existence likelihood. In addition to rank aggregation, similar weighting technique can also be applied in integrating likelihood scores. If every unobserved link is assigned a score (higher score indicates higher existence likelihood) by each predictor, then the final score of any unobserved link e can be defined in a weighted form as 𝑆(𝑒) = ∑ 𝑤 𝑘𝑅𝑘=1 𝑆 𝑘 (𝑒), (32) where 𝑆 𝑘 (𝑒) is the score from the k th predictor. Different from rank aggregation, 𝑆 𝑘 (∙) (𝑘 =1, 2, ⋯ , 𝑅) should be normalized before the weighted sum to ensure scores from different predictors are comparable. An alternative aggregation method is the ordered weighted averaging (OWA) [90], where the R predictors are ordered according to their importance to the final prediction, as 𝑆 (1) , 𝑆 (2) , ⋯ , 𝑆 (𝑅) , and then the final score of any unobserved link e is 𝑆(𝑒) = ∑ 𝑤 𝑘𝑅𝑘=1 𝑆 (𝑘) (𝑒), (33) where ∑ 𝑤 𝑘𝑅𝑘=1 = 1 and 𝑤 ≥ 𝑤 ≥ ⋯ ≥ 𝑤 𝑅 ≥ 0 . Without prior information, the most usual ay to determine the weights is using the maximum entropy method, which maximizes − ∑ 𝑤 𝑘 ln 𝑤 𝑘𝑅𝑘=1 subject to ∑ 𝑤 𝑘𝑅𝑘=1 = 1 and 𝜂 = ∑ (𝑅 − 𝑘)𝑤 𝑘𝑅−1𝑘=1 , where 𝜂 is a tunable parameter measuring the extent to which the ensemble (33) is like an or operation. If 𝜂 is very large, then 𝑤 ≈ 1 and 𝑤 𝑘 ≈ 0 (𝑘 ≥ 2) , that is to say, only the first predictor works. He et al. [91] applied OWA to aggregate nine local similarity indices. These indices are ordered according to their normalized values, which is essentially unreasonable. Therefore, although He et al. [91] reported considerable improvement, later experiments [92][93] indicated that the method in [91] does not work well because the position of a predictor is irrelevant to its quality. In contrast, if the order is relevant to the predictors’ qualities (e.g., according to their precisions trained by the target network), OWA will bring in remarkable improvement compared with individual predictors [92]. As some link prediction algorithms scale worse than 𝕆(𝑁) , Duan et al. [94] argued that to solve smaller problems multiple times is more efficient than to solve a single large problem. They considered a latent factor model (similar to the one described by Eq. (23), with complexity

𝕆(𝑁𝑑 ) ), and developed several ways for the bagging decomposition, such as bagging with random nodes together with their immediate neighbors and bagging preferring dense components. They showed that those bagging techniques can largely reduce computational complexity withour sacrificing prediction accuracy. Considering the family of stochastic block models [24], Valles-Catala et al. [95] showed that the integration (via MCMC sampling according to Bayesian rules) of individually less plausible models can result in higher predictive performance than the single most plausible model. Boosting is a typical sequential ensemble algorithm that trains a base learner from initial training et and adjusts weights of instances (the wrongly predicted instances will be enhanced while the easy-to-be-predicted instances will lose weights) in the training set to train the next learner. Such operation will continue until reaching some preseted conditions. The most representative boosting algorithm is AdaBoost [86], which is originally designed for binary classification and thus can be directly applied in link prediction. AdaBoost is an additive model as 𝐻(𝑥) = ∑ 𝛼 𝑡 𝑕 𝑡 (𝑥), 𝑇𝑡=1 (34) where 𝑕 𝑡 is the t th base learner, 𝛼 𝑡 is a scalar coefficient, and T is a preseted terminal time. H aims to minimize the expected value of an exponential loss function ℓ(𝐻, 𝑊) = 𝔼 𝑥~𝑊 [𝑒 −𝑓(𝑥)𝐻(𝑥) ], (35) where 𝑓(𝑥) denotes the true class of the instance x and W is the original weight distribution. Without specific requirements, we usually set 𝑊(𝑥) = 1 𝑚⁄ for every instance x where m is the number of instances. We set 𝑊 = 𝑊 and learn 𝑕 from 𝑊 , and then 𝛼 is determined by minimizing ℓ(𝛼 𝑕 , 𝑊 ) . The weight of an instance x in the second step is updated as 𝑊 (𝑥) = { 𝑊 (𝑥)𝑒 −𝛼 , 𝑕 = 𝑓(𝑥) 𝑊 (𝑥)𝑒 𝛼 , 𝑕 ≠ 𝑓(𝑥) , (36) where 𝑍 is the normalization factor. Obvously, if the instance x can be corrected classified, its weight will decrease, otherwise its weight will increase. Such process iterates until reaching the terminal time T . When applying AdaBoost in link prediction, we need to be aware of the following three issues. (i) The base learner should be sensitive to 𝑊 𝑡 , so that we cannot use unsupervised algorithms or supervised algorithms insensitive to 𝑊 𝑡 . (ii) In addition to positive instances (observed links), negative instances should be sampled from unobserved links. Though it introduces some noise, the influence is ignorable if the network is sparse. (iii) The negative instances should be undersampled to keep the data balanced. Comar et al. [96] proposed the o-called LinkBoost algorithm, which is an extension of AdaBoost to link prediction with a typical matrix factorization model being the base learner. Instead of undersampling negative instances, they suggest a cost-sensitive loss function which penalizes the misclassifying links as nonlinks about N times heavier than misclassifying nonlinks as links. They further considered a degree-sensitive loss function that penalizes more for misclassification of links between low-degree nodes than high-degree nodes. Stacking [87] is another powerful approach in sequential ensemble. It trains a group of primary learners from the initial training set and uses the outputs of primary learners as input features to train the secondary learner that provides the final prediction. If both primary learners (i.e., input features) and training instances are directly generated by the same training set, the risk of overfitting will be very high. Therefore, the original training set D , usually containing similar numbers of positive and negative instances for data balance, is divided into J sets with same size as 𝐷 , 𝐷 , ⋯ , 𝐷 𝐽 . Denoting 𝑕 𝑟(𝑗) the primary learner using the r th algorithm and trained from the j th fold of the training set 𝐷 𝑗 ̅ = 𝐷\𝐷 𝑗 , for each instance 𝑥 𝑖 ∈ 𝐷 𝑗 , its feature vector is 𝑧 𝑖 = (𝑧 𝑖1 , 𝑧 𝑖2 , ⋯ , 𝑧 𝑖𝑅 ) , where 𝑧 𝑖𝑟 = 𝑕 𝑟(𝑗) (𝑥 𝑖 ) and R is the number of primary algorithms. This J -fold division ensures all features of any instance x are obtained by primary learners trained without x . Some scientists have already used similar techniques (e.g., using various regressions to integrate results from primary predictors and other features [93][97][98]), but they are not aware of stacking model and did not employ any measures to avoid overfitting. Li et al. [99] proposed a stacking model for link prediction, which use logistic regression and XGBoost [100] to learn 4 similarity indices. Their method is inspiring, but they only considered 4 primary predictors and tested on two very small networks with some experimental results (e.g., the AUC values of CN index) far different from well-known results, and hus the reported results and conclusion are still questionable. Ghasemian et al. [67] proposed a stacking model that considers 203 primary link predictors on 550 disparate networks. Using a standard supervised random forest algorithm [85] as the secondary learner, the stacking model is remarkably superior to individual predictors for real networks, and can approach to the theoretical optima for synthetic networks with known highest prediction accuracies. Their work provides clear and solid evidences about the power of ensemble approach in link prediction. In addition, they showed that social networks are more predictable than biological and technological networks. Wu et al. [92] proposed an alternative sequential ensemble strategy called network reconstruction, which reconstructs network via one link prediction algorithm and predict missing links by another prediction algorithm. Discussion . – In this Perspective, to improve the readability, we classify representative works in the last decade into five groups. Of course, some novel and interesting methods, such as evolutionary algorithm [101], ant colony approach [102] and structural Hamiltonian analysis [82], do not belong to any of the above groups, and readers are encouraged to read other recent surveys [103][104][105] as complements of the present Perspective. Very recently, a notable issue is the applications of neural networks in link prediction, which may be partially facilitated by the dramatic advances of deep learning techniques. Zhang and Chen [106] trained a fully-connected neural network on the adjacency matrices of enclosing subgraphs (with a fixed size) of target links. They applied a variant of the Weisfeiler-Lehman algorithm [107] to determine the order of nodes in each adjacency matrix, ensuring that nodes with closer istances to the target link are ranked in higher positions. Zhang and Chen [108] further proposed a novel framework based on graph neural networks [109], which can learn multiple types of information, including general structural features and latent and explicit node features. In this framework, a node’s order in the enclosing subgraph can be determined only by its closeness to the target link and the subgraph size can be flexible. Wang et al. [110] directly represented the adjacency matrix of a network as an image and then learned hierarchical feature representations by training generative adversarial networks [111]. Some preliminary experimental results suggested that the performance of those methods [106][108][110] is highly competitive to many other state-of-the-art algorithms. In despite of the promising results, at present, features and models are simply pieced together without intrinsic connections. The above pioneering works [106][108][110] provide a good start but we still need in-depth and comprehensive analyses to push forward related studies. Although most link prediction algorithms only account for structural information, attributes of nodes (e.g., expression levels of genes [74] and tags of citation and social networks [112]) can be utilized to improve the prediction performance. It is easy to treat attributes as independent information additional to structural features and work out a method that directly combines the two, while what is lacking but valuable is to uncover nontrivial relationship between attributes and structural roles and then design more meaningful algorithms. Beyond explicit attributes, we should also pay attention to dynamical information. It is known to us that limited time series obtained from some dynamical processes can be used to reconstruct network topology [113] while even a small fraction of missing links in modeling dynamical processes can lead to remarkable biases 114], however, studies about how to make use of the correlations between topology and dynamics to predict missing links or how to take advantage of link prediction algorithms to improve estimates of dynamical parameters are rare. By an elaborately designed model, Gu et al. [115] showed that there is no ground truth in ranking influential spreaders even with a given dynamics. Peel et al. [116] proved that there is no ground truth and no free lunch for community detection. The latter implies that no detection algorithm can be optimal on all inputs. Fortunately, we have ground truth in link prediction, however, extensive experiments [67] also implicate that no known link predictor performs best or worst across all inputs. If link prediction is a no-free-lunch problem, then no single algorithm performs better or worse than any other when applied to all possible inputs. It raises a question that whether the study on prediction algorithms is valuable. The answer is of course YES [117], because we actually have free lunches as what we are interested in, the real networks, have far different statistics from those of all possible networks. A related question is that given the usually superior performance of ensemble models [67][95], whether the study on individual algorithms is valuable. The answer is still YES, because an individual algorithm could be highly cost-effective for its competitive performance and low complexity in time and space. Above all, individual algorithms, especially the mechanistic algorithms, may provide significant insights about network organization and evolution. In some real applications like friend recommendation, predictions with explanations are more acceptable [118], which cannot be obtained by ensemble learning. In addition, an alogical reason is that some elegant individual models (e.g., HSM [22], SBM [23], SPM [37], HYPERLINK [70], etc.) bring us inimitable aesthetic perception that cannot be experienced lsewhere. Along with fruitful algorithms proposed recently, the design of novel and effective algorithms for general networks is increasingly hard. We expect a larger fraction of algorithms in the future studies will be designed for networks of particular types (e.g., directed networks [119][120], weighted networks [121][122][123], multilayer networks [124][125][126], temporal networks [127][128][129], hypergraphs and bipartite networks [130][131], networks with negative links [132][133][134], etc.) and networks with domain knowledge (e.g., drug-target interactions [135][136][137], disease-associated relations [138][139][140], protein-protein interactions [36][141], criminal networks [142], citation networks [143], academic social networks [144], knowledge graphs [145], etc.). We should take serious consideration about properties and requirements of target networks and domains in the algorithm design, instead of straightforward (and thus less valuable) extensions of general algorithms. For example, if we attempt to recommend friends in an online social network based on link prediction [13], we need to consider how to explain our recommendations to improve the acceptance rate [118], how to use the acceptance/rejection information to promote the prediction accuracy [146], and how to avoid recommending bots to real users [147]. These considerations will bring fresh challenges in link prediction. Early studies often compare a very few algorithms on several small networks according to one or two metrics. Recent large-scale experiments [31][32][66][67] indicated that the above methodology may result in misleading conclusions. Future studies ought to implement systematic nalyses involving more synthetic and real networks, benchmarks, state-of-the-art algorithms and metrics. If relevant results cannot be published in an article with limited space, they should be made public (better together with data and codes) in some accessible websites like GitHub. Lastly, we would like to emphasize that the soul of a network lies in its links instead of nodes, otherwise we should pay more attention on set theory rather than graph theory. Therefore, in network science, link prediction is a paradigmatic and fundamental problem with long attractivity and vitality. Beyond an algorithm predicting missing and future links, link prediction is also a powerful analyzing tool, which has already been utilized in evaluating and inferring network evolving mechanisms [7][8][9][97], testing the privacy-protection algorithms (as an attaching method) [148], evaluating and designing network embedding algorithms [149][150], and so on. Though the last decade has witnessed plentiful and substantial achievements, the study of link prediction is just unfolding and more efforts are required towards a full picture of how links do emerge and vanish.

References [1] A.-L. Barabási, Network Science (Cambridge University Press,2016).