[PDF] SI-spreading-based network embedding in static and temporal networks

Abstract

Link prediction can be used to extract missing information, identify spurious interactions as well as forecast network evolution. Network embedding is a methodology to assign coordinates to nodes in a low dimensional vector space. By embedding nodes into vectors, the link prediction problem can be converted into a similarity comparison task. Nodes with similar embedding vectors are more likely to be connected. Classic network embedding algorithms are random-walk-based. They sample trajectory paths via random walks and generate node pairs from the trajectory paths. The node pair set is further used as the input for a Skip-Gram model, a representative language model that embeds nodes (which are regarded as words) into vectors. In the present study, we propose to replace random walk processes by a spreading process, namely the susceptible-infected (SI) model, to sample paths. Specifically, we propose two SI-spreading-based algorithms, SINE and TSINE, to embed static and temporal networks, respectively. The performance of our algorithms is evaluated by the missing link prediction task in comparison with state-of-the-art static and temporal network embedding algorithms. Results show that SINE and TSINE outperform the baselines across all six empirical datasets. We further find that the performance of SINE is mostly better than TSINE, suggesting that temporal information does not necessarily improve the embedding for missing link prediction. Moreover, we study the effect of the sampling size, quantified as the total length of the trajectory paths, on the performance of the embedding algorithms. The better performance of SINE and TSINE requires a smaller sampling size in comparison with the baseline algorithms. Hence, SI-spreading-based embedding tends to be more applicable to large-scale networks.

Full PDF

SSI-spreading-based network embedding in static and temporalnetworks

Xiu-Xiu Zhan , Ziyu Li , Naoki Masuda , Petter Holme , Huijuan Wang April 16, 2020 Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology,Mekelweg 4, Delft, The Netherlands, 2628 CD. Department of Mathematics, University at Buffalo, StateUniversity of New York, Buffalo, New York, USA, NY 14260-2900. Computational and Data-EnabledScience and Engineering Program, University at Buffalo, State University of New York, Buffalo, New York,USA, NY 14260-2900 Tokyo Tech World Research Hub Initiative (WRHI), Institute of Innovative Research,Tokyo Institute of Technology, Yokohama, Japan, 226-8503.

Abstract

SINE and

TSINE , to embed static and temporal networks, respectively. The performanceof our algorithms is evaluated by the missing link prediction task in comparison with state-of-the-art staticand temporal network embedding algorithms. Results show that

SINE and

TSINE outperform the base-lines across all six empirical datasets. We further ﬁnd that the performance of

SINE is mostly better than

TSINE , suggesting that temporal information does not necessarily improve the embedding for missing linkprediction. Moreover, we study the effect of the sampling size, quantiﬁed as the total length of the trajec-tory paths, on the performance of the embedding algorithms. The better performance of

SINE and

TSINE requires a smaller sampling size in comparison with the baseline algorithms. Hence, SI-spreading-basedembedding tends to be more applicable to large-scale networks. a r X i v : . [ c s . S I] A p r Introduction

Real-world systems can be represented as networks, with nodes representing the components and links repre-senting the connections between them [1, 2]. The study of complex networks pervades in different ﬁelds [3].For example, with biological or chemical networks, scientists study interactions between proteins or chemi-cals to discover new drugs [4, 5]. With social networks, researchers tend to classify or cluster users into groupsor communities, which is useful for many tasks, such as advertising, search and recommendation [6, 7]. Withcommunication networks, learning the network structure can help understand how information spreads overthe networks [2]. These are only a few examples of the important role of analyzing networks. For all theseexamples, the data may be incomplete. If so, it could be important to be able to predict the link most likelyto be missing. If the network is evolving, it could be crucial to forecast the next link to be added. For both ofthese applications one needs link prediction [8, 9, 10, 11, 12].In link prediction, one estimates the likelihood that two nodes are adjacent to each other based on the ob-served network structure [13]. Methods using similarity-based metrics, maximum likelihood algorithms andprobabilistic models are major families of link prediction methods [14]. Recently, network embedding, whichembeds nodes into a low-dimensional vector space, has attracted much attention in solving the link predictionproblem [14, 15]. The similarity between the embedding vectors of two nodes is used to evaluate whether theywould be connected or not. Different algorithms have been proposed to obtain network embedding vectors.A simplest embedding method is to take the row or column vector in the adjacency matrix, which is calledan adjacency vector of the corresponding node, as the embedding vector. Then, the representation space is N -dimensional, where N is the number of nodes. As real-world networks are mostly large and sparse, the ad-jacency vector of a node is sparse and high-dimensional. In addition, the adjacency matrix only contains theﬁrst-order neighborhood information, and therefore the adjacency vector neglects the high-order structure ofthe network such as paths longer than an edge. These factors limit the precision of network embedding basedon the adjacency vector in link prediction tasks. Work in the early 2000s attempted to embed nodes into a lowdimension space using dimension reduction techniques [16, 17, 18]. Isomap [16], locally linear embedding(LLE) [17] and Laplacian eigenmap [18] are algorithms based on the k -nearest graph, where nodes i and j are connected by a link in the k -nearest graph if the length of the shortest path between i and j is withinthe k -th shortest among the length of all the shortest paths from i to any other nodes. Matrix factorizationalgorithms decompose the adjacency matrix into the product of two low-dimensional rectangular matrices.The columns of the rectangular matrices are the embedding vectors for nodes. Singular value decomposition(SVD) [19] is one commonly used and simple matrix factorization. However, the computation complexity ofmost of the aforementioned algorithms is at least quadratic in terms of N , limiting their applicability to largenetworks with millions of nodes.Random-walk-based network embedding is a promising family of computationally efﬁcient algorithms.These algorithms exploit truncated random walks to capture the proximity between nodes [20, 21, 22] gener-ally via the following three steps [23, 24, 25]: (1) Sample the network by running random walks to generatetrajectory paths. (2) Generate a node pair set from the trajectory paths: each node on the trajectory path isviewed as a center node, the nearby nodes within a given distance are considered as the neighboring nodes.2 node pair in the node pair set is formed by a center node and each of its neighboring nodes. (3) Apply aword embedding model such as Skip-Gram to learn the embedding vector for each node by using the nodepair set as input. Skip-Gram assumes nodes that are similar in topology or content tend to have similar repre-sentations [22]. Algorithms have been designed using different random walks to capture high-order structureon networks. For example, DeepWalk [20] and Node2Vec [23] adopted uniform and biased random walks,respectively, to sample the network structure. In addition, random-walk-based embedding methods have alsobeen developed for temporal networks, signed networks and multilayer networks [26, 27, 28, 29].In contrast to random-walk-based embedding, here we propose SI-spreading-based network embeddingalgorithms for static and temporal networks. We deploy the susceptible-infected (SI) spreading process onthe given network, either static or temporal, and use the corresponding spreading trajectories to generatethe node pair set, which is fed to the Skip-Gram to derive the embedding vectors. The trajectories of anSI spreading process capture the tree-like sub-network centered at the seed node, whereas random walkexplores long walks that possibly revisit the same node. We evaluate our static network embedding algorithm,which refer to as SINE , and temporal network embedding,

TSINE , via a missing link prediction task in sixreal-world social networks. We compare our algorithms with state-of-the-art static and temporal networkembedding methods. We show that both

SINE and

TSINE outperform other static and temporal networkembedding algorithms, respectively. In most cases, the static network embedding,

SINE , performs better than

TSINE , which additionally uses temporal network information. In addition, we evaluate the efﬁciency ofSI-spreading-based network embedding via exploring the sampling size for the Skip-Gram, quantiﬁed as thesum of the length of the trajectory paths, in relation to its performance on the link prediction task. We showthat high performance of SI-spreading-based network embedding algorithms requires a signiﬁcantly smallersampling size compared to random-walk-based embeddings. We further explore what kind of links can bebetter predicted to further explain why our proposed algorithms show better performance than the baselines.The rest of the paper is organized as follows. We propose our method in Section 2. In Section 2.1, wepropose our SI-spreading-based sampling method for static networks and the generation of the node pair setfrom the trajectory paths. Skip-Gram model is introduced in Section 2.2. We introduce an SI-spreading-basedsampling method for temporal networks in Section 2.3. In Section 3, our embedding algorithms are evaluatedon a missing link prediction task on real-world static and temporal social networks. The paper is concludedin Section 4.

This section introduces SI-spreading-based network embedding methods. Firstly, we illustrate our SI-spreading-based network embedding method for static networks in Sections 2.1 and 2.2. Section 2.3 generalizes themethod to temporal network embedding.Because we propose the network embedding methods for both static and temporal networks, we start withthe notations for temporal networks, of which the static networks are special cases. A temporal network isrepresented as G = ( N , L ) , where N is the node set and L = { l ( i, j, t ) , t ∈ [0 , T ] , i, j ∈ N } is the set of3ime-stamped contacts. The element l ( i, j, t ) in L represents a bidirectional contact between nodes i and j attime t . We consider discrete time and assume that all contacts have a duration of one discrete time step. Weuse [0 , T ] to represent the observation time window, N = |N | is the number of nodes. The aggregated staticnetwork G = ( N , E ) is derived from a temporal network G . Two nodes are connected in G if there is at leastone contact between them in G . E is the edge set of G . The network embedding problem is formulated asfollows:Given a network G = ( N , E ) , static network embedding aims to learn a low-dimensional representationfor each node i ∈ N . The node embedding matrix for all the nodes is given by U ∈ R d × N , where d isthe dimension of the embedding vector ( d < N ). The i -th column of U , i.e., −→ u i ∈ R d × , represents theembedding vector of node i . The SI spreading process on a static network is deﬁned as follows: each node is in one of the two statesat any time step, i.e., susceptible (S) or infected (I); initially, one seed node is infected; an infected nodeindependently infects each of its susceptible neighbors with an infection probability β at each time step; theprocess stops when no node can be infected further. To derive the node pair set as the input for Skip-Gram,we carry out the following steps: Algorithm 1

Generation of trajectory paths from SI spreading

Input: G = ( N , E ) , B , L max , β , m i Output: node trajectory path set D Initialize number of context windows C = 0 Initialize node trajectory path set D = ∅ while B −

C > do Randomly choose node i as the seed to start the SI spreading Generate spreading trajectory tree T i ( β ) Randomly choose m i trajectory paths D g i ( g i = 1 , . . . , m i ) from T i ( β ) for g i = 1 , . . . , m i do if | D g i | > L max then Choose the ﬁrst L max nodes from D g i to form D ∗ g i Add the trajectory D ∗ g i to D C = C + | D ∗ g i | else Add the trajectory D g i to D C = C + | D g i | end if end for end while return D (cid:28614) (cid:28617) (cid:28619) (cid:28620) (cid:28612)(cid:28611) (cid:28618) (cid:28616)(cid:28612) (cid:28614) (cid:28617) (cid:28619) (cid:28620) (cid:28612)(cid:28611) (cid:28618) (cid:28616)(cid:28612) (cid:28614) (cid:28617) (cid:28619) (cid:28620) (cid:28612)(cid:28611) (cid:28618) (cid:28616)(cid:28612) (cid:28614) (cid:28617) (cid:28619) (cid:28620) (cid:28612)(cid:28611) (cid:28618) (cid:28616) (cid:28616)(cid:28646)(cid:28629)(cid:28638)(cid:28633)(cid:28631)(cid:28648)(cid:28643)(cid:28646)(cid:28653)(cid:28564)(cid:28612)(cid:28629)(cid:28648)(cid:28636) (cid:28616)(cid:28646)(cid:28629)(cid:28637)(cid:28642)(cid:28637)(cid:28642)(cid:28635)(cid:28564)(cid:28610)(cid:28643)(cid:28632)(cid:28633)(cid:28564)(cid:28612)(cid:28629)(cid:28637)(cid:28646)(cid:28647) (cid:28603)(cid:28612)(cid:28607)(cid:28595)(cid:28614)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28612)(cid:28607)(cid:28595)(cid:28617)(cid:28604)(cid:28603)(cid:28614)(cid:28607)(cid:28595)(cid:28612)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28614)(cid:28607)(cid:28595)(cid:28617)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28614)(cid:28607)(cid:28595)(cid:28619)(cid:28604)(cid:28603)(cid:28617)(cid:28607)(cid:28595)(cid:28612)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28617)(cid:28607)(cid:28595)(cid:28614)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28617)(cid:28607)(cid:28595)(cid:28619)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28617)(cid:28607)(cid:28595)(cid:28620)(cid:28604)(cid:28603)(cid:28619)(cid:28607)(cid:28595)(cid:28614)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28619)(cid:28607)(cid:28595)(cid:28617)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28619)(cid:28607)(cid:28595)(cid:28620)(cid:28604)(cid:28607)(cid:28595)(cid:28603)(cid:28619)(cid:28607)(cid:28595)(cid:28612)(cid:28611)(cid:28604) Figure 1: Generating node pairs from a trajectory path , , , , , , , . The window size ω = 2 and onlythe ﬁrst four nodes 1, 3, 6 and 8 as the center node are illustrated as examples. In each run of the SI spreading process, a node i is selected uniformly at random as the seed. The SI spreadingprocess starting from i is performed. The spreading trajectory T i ( β ) is the union of all the nodes that ﬁnallyget infected supplied with all the links that have transmitted infection between node pairs.From each of the spreading trajectory T i ( β ) , we construct m i trajectory paths, each of which is the pathbetween the root node i and a randomly selected leaf node in T i ( β ) . The number m i of trajectory paths to beextracted from T i ( β ) is assumed to be given by m i = max (cid:40) , K ( i ) (cid:80) j ∈N K ( j ) m max (cid:41) , where m max is a control parameter and K ( i ) is the degree of the root node i in the static network (or aggregatednetwork).The trajectory paths may have different lengths (i.e., number of nodes in the path). For a trajectory pathwhose length is larger than L max = 20 , we only take the ﬁrst L max nodes on the path. For a randomlychosen seed node i , we can generate m i trajectory paths from T i ( β ) . We stop running the SI spreadingprocess until the sum of the length of the trajectory paths reaches the sampling size B = N X , where X is acontrol parameter. We consider X ∈ { , , , , , , , , , , , } . We compare differentalgorithms using the same B for fair comparison [26] to understand the inﬂuence of the sampling size. Weshow how to sample the trajectory paths in Algorithm 1.5 .1.2 Node pair set generation. We illustrate how to generate the node pairs, the input of the Skip-Gram, from a trajectory path in Figure 1.Consider a trajectory path, , , , , , , , , starting from node 1 and ending at node 5. We set each node,e.g., node 3, as the center node, and the neighboring nodes of the center node are deﬁned as nodes within ω = 2 hops. The neighboring nodes of node 3 are, 1, 6 and 8. We thus obtain ordered node pairs (3 , , (3 , ,and (3 , . Thus, we use the union of node pairs centered at each node in each of trajectory path as the inputto the Skip-Gram model. We illustrate how the Skip-Gram derives the embedding vector for each node based on the input node pair set.We denote by N SI ( i ) the neighboring set for a node i derived from the SI spreading process. A neighboringnode j of i may appear multiple times in N SI ( i ) if ( i, j ) appears multiple times in the node pair set.Let p ( j | i ) be the probability of observing neighboring node j given node i . We model the conditionalprobability p ( j | i ) as the softmax unit parametrized by the product of the embedding vectors, i.e., −→ u i and −→ u j ,as follows: p ( j | i ) = log exp( −→ u i · −→ u j T ) (cid:80) k ∈N exp( −→ u i · −→ u kT ) (1)Skip-Gram is to derive the set of the N embedding vectors that maximizes the log probability of observingevery neighboring node from N SI ( i ) for each i . Therefore, one maximizes max O = (cid:88) i ∈N (cid:88) j ∈ N SI ( i ) log p ( j | i ) . (2)Equation (2) can be further simpliﬁed to max O = (cid:88) i ∈N  − log Z i + (cid:88) j ∈ N SI ( i ) −→ u i · −→ u kT  , (3)where Z i = (cid:88) k ∈N exp( −→ u i · −→ u kT ) . (4)To compute Z i for a given i , we need to traverse the entire node set N , which is computationally costly. Tosolve this problem, we introduce negative sampling [22], which randomly selects a certain number of nodes k from N to approximate Z i . To get the embedding vectors for each node, we use the stochastic gradientascent to optimize Eq. (3).The static network embedding algorithm proposed above from the SI-spreading-based static networksampling and Skip-Gram model is named as SINE . 6 .3 SI-spreading-based temporal network sampling

We generalize

SINE to the SI-spreading-based temporal network embedding by deploying SI spreading pro-cesses on the given temporal network, namely,

TSINE . For a temporal network G = ( N , L ) , SI spreadingfollows the time step of the contacts in G . Initially, node i is chosen as the seed of the spreading process. Atevery time step t ∈ [0 , T ] , an infected node infects each of its susceptible neighbor in the snapshot throughthe contact between them with probability β . The process stops at time T . We construct the spreading tra-jectory starting from node i as T i ( β ) , which records the union of nodes that get infected together with thecontacts through which these nodes get infected. We propose two protocols to select the seed node of the SIspreading. In the ﬁrst protocol, we start by selecting uniformly at random a node i as the seed. Then, weselect uniformly at random a time step from all the times of contacts made by node i as the starting point ofthe spreading process, i.e., the time when i gets initially infected. We refer to this protocol as TSINE1 . In thesecond protocol, we choose a node i uniformly at random as the seed and start the spreading at the time whennode i has the ﬁrst contact. We refer to this protocol as TSINE2 .Both

TSINE1 and

TSINE2 generate the node pair set from the spreading trajectory T i ( β ) in the sameway as described in Section 2.1. The node pairs from the node pair set is the input of of Skip-Gram forcalculating the embedding vector for each node. The SI-spreading-based temporal network embedding usesthe information on the time stamps of contacts in addition to the information used by the static networkembedding. For the link prediction task in a static network, we remove a certain fraction of links from the given networkand predict these missing links based on the remaining links. We apply our static network embedding al-gorithm to the remaining static network to derive the embedding vectors for the nodes, which are used forlink prediction. For a temporal network, we select a fraction of node pairs that have at least one contact. Weremove all the contacts between the selected node pairs from the given temporal network. Then, we attemptto predict whether the selected node pairs have at least one contact or not based on the remaining temporalnetwork. We use the area under the curve (AUC) score to evaluate the performance of the algorithms on thelink prediction task. The AUC quantiﬁes the probability of ranking a random node pair that is connected orhas at least a contact higher than a random node pair that is not connected or has no contact.

We consider temporal networks, each of which records the contacts and their corresponding time stampsbetween every node pair. For each temporal network G , one can obtain the corresponding static network G by aggregating the contacts between each node pair over time. In other words, two nodes are connected instatic network G if there is at least one contact between them in G . The static network G derived from G isunweighted by deﬁnition. We consider the following temporal social network data sets.7 HT2009 [30] is a network of face-to-face contacts between the attendees of the ACM Hypertext 2009conference. • Manufacturing Email (ME) [31] is an email contact network between employees in a mid-sized manu-facturing company. • Haggle [32] records the physical contacts between individuals via wireless devices. • Fb-forum [33] captures the contacts between students at University of Califonia, Irvine, in a Facebook-like online forum. • DNC [34] is an email contact network in the 2016 Democratic National Committee email leak. • CollegeMsg [35] records messages between the users of an online community of students from theUniversity of California, Irvine.Table 1 provides some properties of the empirical temporal networks. In the ﬁrst three columns we showthe properties of the temporal networks, i.e., the number of nodes ( N ), timestamps ( T ) and contacts ( |L| ).In the remaining columns, we show the properties of the corresponding aggregate static networks, includingthe number of links ( | E | ), link density, average degree, and clustering coefﬁcient. The temporal networksare considerably different in size, which ranges from hundreds to thousands of nodes, as well as in the net-work density and clustering coefﬁcient. Choosing networks with different properties allows us to investigatewhether the performance of our algorithms can be consistent across networks.Table 1: Properties of the empirical temporal networks. The number of nodes ( N ), timestamps ( T ), andcontacts ( |L| ) are shown. In addition, the number of links ( | E | ), link density, average degree, and clusteringcoefﬁcient of the corresponding static network are shown.Dataset N T |L| | E | Link Density Average Degree Clustering CoefﬁcientHT2009 113 5,246 20,818 2,196 0.3470 38.87 0.5348ME 167 57,842 82,927 3,251 0.2345 38.93 0.5919Haggle 274 15,662 28,244 2,124 0.568 15.5 0.6327Fb-forum 899 33,515 33,720 7,046 0.0175 15.68 0.0637DNC 1,891 19,383 39,264 4,465 0.0025 4.72 0.2091CollegeMsg 1,899 58,911 59,835 13,838 0.0077 14.57 0.1094

We consider three state-of-the-art network embedding algorithms based on Skip-Gram. These baseline al-gorithms and the algorithms that we proposed differ only in the method to sample trajectory paths, fromwhich the node pair set, i.e., the input to the Skip-Gram, is derived.

DeepWalk [20] and

Node2Vec [23] arestatic network embedding algorithms based on random walks.

CTDNE [26] is a temporal network embeddingalgorithm based on random walks. • DeepWalk [20] deploys classic random walks on a given static network.8

Node2vec [23] deploys biased random walks on a given static network. The biased random walk gives atrade-off between breadth-ﬁrst-like sampling and depth-ﬁrst-like sampling of the neighborhood, whichis controlled via two hyper-parameters p and q . We use a grid search over p, q ∈ { . , . , . , , , } to obtain embeddings that achieve the largest AUC value for link prediction. • CTDNE [26]:

CTDNE is a temporal network embedding algorithm based on temporal random walks.The main idea is that the timestamp of the next temporal contact on the walk should be larger than thetimestamps of previously traversed contacts. Given a temporal network G = ( N , L ) , the starting con-tact for the temporal random walk is selected uniformly at random. Thus, every contact has probability / |L| to be selected as the starting contact. Assume that a random walker visits node i at time step t .We deﬁne Γ t ( i ) as the set of nodes that have contacted node i after time t allowing duplicated elements.A node may appear multiple times in Γ t ( i ) because it may have multiple contacts with node i over thecourse of time. The next node to walk to is uniformly selected from Γ t ( i ) , i.e., every node in Γ t ( i ) ischosen with probability / | Γ t ( i ) | . Nguyen et al. [26] generalized the starting contact and the successornode of a temporal walk to other distributions beyond the uniform distribution illustrated here. Whenwe compare the performance of the algorithms on link prediction, we explore the embeddings that givethe largest AUC value for link prediction of CTDNE by taking into account all possible generalizationsproposed by Nguyen et al.In our SI-spreading-based algorithms for both static and temporal networks, we set β ∈ { . , . , . , . , . , . , . , . , . , . , . , . } . We use ω = 10 and embedding dimension d = 128 for our algorithmsand the baseline algorithms. In this section, we illustrate how to generate the training and test sets in the link prediction task in temporaland static networks. We run the network embedding algorithms on the corresponding training set and obtainembedding vector for each node, and use the AUC to evaluate the link prediction performance in the test set.Given a temporal network G , we select uniformly at random node pairs among the node pairs thathave at least one contact between them in G as the training set for temporal embedding algorithms, includingall the contacts and their timestamps. The training set for static network embedding algorithms is the aggre-gation of the training set for temporal embedding algorithms. In other words, for every node pair, there is alink between the two nodes in the training set for static network embedding if and only if they have at leastone contact in the training set for temporal embedding algorithms.We use the remaining node pairs among the node pairs that have at least one contact of G as thepositive links in the test set. We label these node pairs 1. Then, we uniformly randomly sample an equalnumber of node pairs in G which have no contact between them. These node pairs are used as negative linksin the test set, which we label 0. The same test set is used for the link prediction task in both temporal andstatic networks. 9or each temporal network data set, we randomly split the network to obtain the training and test setaccording to the procedures given above ﬁve times. Both random walks and SI spreading processes arestochastic. For each split data, we run each algorithm on the training set and perform the link predictionon the test set for ten realizations. Therefore, we obtain ten AUC scores for each splitting of the data intothe training and test sets, evening the randomness stemming from stochasticity of the random walk or SIspreading processes. We obtain the AUC score for each algorithm with a given parameter set as an averageover 50 realizations in total. Table 2: AUC scores for link prediction. All the results shown are the average over 50 realizations. Boldindicates the optimal AUC among the embedding algorithms, ∗ indicates the optimal AUC among all thealgorithms. L2, L3, L4 are the short for link prediction metrics which counts the number of l = 2 , , paths,respectively. Dataset DeepWalk Node2Vec CTDNE TSINE1 TSINE2 SINE L2 L3 L4HT2009 0.5209 0.5572 0.6038 0.6740 . ∗ . ∗ ∗ ∗ . ∗ . ∗ We summarize the overall performance of the algorithms on missing link prediction in Table 2. For eachalgorithm, we tune the parameters and show the optimal average AUC score. Among the static networkembedding algorithms,

SINE signiﬁcantly outperforms

DeepWalk and

Node2Vec . The improvement in theAUC score is up to 30% on the

CollegeMsg dataset. Embedding algorithms

CTDNE , TSINE1 and

TSINE2 are for temporal networks. The SI-spreading-based algorithms (i.e.,

TSINE1 and

TSINE2 ) also show betterperformance than random-walk-based one (

CTDNE ). Additionally,

TSINE2 is slightly better than

TSINE1 onall data sets. Therefore, we will focus on

TSINE2 in the following analysis. In fact,

SINE shows better per-formance than temporal network embedding methods including

TSINE2 on all data sets except for

HT2009 .It has been shown that temporal information is important for learning embeddings [26, 36, 37]. However,up to our numerical efforts,

SINE outperforms the temporal network algorithms although SINE deliberatelyneglects temporal information.To get insights into the different performance among the embedding algorithms, we further investigatethe distribution of the dot product of node embedding vectors. Given a link ( i, j ) in the test set, we computethe dot product of the two end nodes’ embedding vectors, i.e., −→ u i · −→ u j T . We show the dot product distributionfor the positive links and negative links in the test set separately. For each embedding algorithm, we consideronly the parameter set that maximizes the AUC, i.e., the parameter values with which the results are shownin Table 2. We show the distribution of the dot product for Haggle in Figure 2 and for the other data setsin Figure S1–S5 in the Appendix. Compared to the random-walk-based algorithms,

TSINE2 and

SINE yieldmore distinguishable distributions between the positive (grey) and the negative links (pink). This result10igure 2: The dot product distribution of the two end nodes’ embedding vectors of the positive and negativelinks in the test set. We show the result of the

Haggle data set. For each algorithm, we use the same parametersettings as that of Table 2 to obtain the embeddings. Dot products of positive links are shown in grey. Negativelinks are shown in pink. The results are shown for algorithms (a)

DeepWalk ; (b)

Node2Vec ; (c)

CTDNE ; (d)

TSINE2 and (e)

SINE .supports the better performance of SI-spreading-based embeddings than random-walk-based ones.The embedding algorithms differ only in the sampling method to generate the node pair set. These algo-rithms use the same Skip-Gram architecture, which takes the node pair set as input, to deduce the embeddingvector for each node. We explore further how the algorithms differ in the node pair sets that they sampled.The objective is to discover the relation between the properties of the sampled node pairs and the performanceof an embedding method. We represent the node pair set generated by an embedding method as a network11 (cid:21)(cid:19) (cid:23)(cid:19) (cid:25)(cid:19) (cid:27)(cid:19) (cid:20)(cid:19)(cid:19) (cid:20)(cid:21)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19) (cid:19) (cid:21)(cid:19) (cid:23)(cid:19) (cid:25)(cid:19) (cid:27)(cid:19) (cid:20)(cid:19)(cid:19) (cid:20)(cid:21)(cid:19) (cid:20)(cid:23)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19)(cid:19) (cid:24)(cid:19) (cid:20)(cid:19)(cid:19) (cid:20)(cid:24)(cid:19) (cid:21)(cid:19)(cid:19) (cid:21)(cid:24)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19) (cid:20) (cid:20)(cid:19) (cid:20)(cid:19)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19)(cid:20) (cid:20)(cid:19) (cid:20)(cid:19)(cid:19) (cid:20)(cid:19)(cid:19)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19) (cid:20) (cid:20)(cid:19) (cid:20)(cid:19)(cid:19) (cid:20)(cid:19)(cid:19)(cid:19)(cid:19)(cid:17)(cid:19)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:23)(cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:27)(cid:20)(cid:17)(cid:19) (cid:11)(cid:68)(cid:12) (cid:43)(cid:55)(cid:21)(cid:19)(cid:19)(cid:28) (cid:11)(cid:69)(cid:12) (cid:48)(cid:40)(cid:43)(cid:68)(cid:74)(cid:74)(cid:79)(cid:72)(cid:11)(cid:70)(cid:12) (cid:55)(cid:85)(cid:68)(cid:76)(cid:81)(cid:76)(cid:81)(cid:74)(cid:66)(cid:86)(cid:72)(cid:87) (cid:39)(cid:72)(cid:72)(cid:83)(cid:58)(cid:68)(cid:79)(cid:78) (cid:49)(cid:82)(cid:71)(cid:72)(cid:21)(cid:57)(cid:72)(cid:70)(cid:38)(cid:55)(cid:39)(cid:49)(cid:40) (cid:55)(cid:54)(cid:44)(cid:49)(cid:40)(cid:21) (cid:54)(cid:44)(cid:49)(cid:40) (cid:41)(cid:69)(cid:16)(cid:73)(cid:82)(cid:85)(cid:88)(cid:80)(cid:11)(cid:71)(cid:12)(cid:11)(cid:72)(cid:12) (cid:39)(cid:49)(cid:38) (cid:39)(cid:72)(cid:74)(cid:85)(cid:72)(cid:72) (cid:38) (cid:88) (cid:80) (cid:88) (cid:79) (cid:68) (cid:87) (cid:76) (cid:89) (cid:72)(cid:71)(cid:72)(cid:74) (cid:85) (cid:72)(cid:72)(cid:71) (cid:76) (cid:86) (cid:87) (cid:85) (cid:76) (cid:69)(cid:88) (cid:87) (cid:76) (cid:82)(cid:81) (cid:11)(cid:72)(cid:12) (cid:38)(cid:82)(cid:79)(cid:79)(cid:72)(cid:74)(cid:72)(cid:48)(cid:86)(cid:74)

Figure 3: Cumulative degree distribution of the static network derived from the training set and that of thesampled networks G S from different algorithms. We show the results for (a) HT2009 ; (b) ME ; (c) Haggle ;(d)

Fb-forum ; (e)

DNC ; (f)

CollegeMsg . G S = ( N , E S ) , so called the sampled network. Two nodes are connected in G S if they form a node pair inthe node pair set. It should be noted that G S is an unweighted network. For each algorithm, with the param-eter set that maximizes the AUC, we show the cumulative degree distribution of its sampled network G S inFigure 3. The cumulative degree distribution of the training set for static network is also given. Comparedto the cumulative degree distribution of the training set, the sampled networks tend to have a higher node12egree. Zhang et al. and Gao et al. [25, 38] have shown that when the degree distribution of G S is closerto that of the training set, the prediction performance of a random-walk-based algorithm tends to be better.Even though SI-spreading based algorithms perform the best across the data sets, we have not found a directrelation between the performance of the embedding algorithm and similarity between the degree distributionof the sampled network and that of the training set. (cid:1861) (cid:1862) (cid:28626) (cid:1861) (cid:1862) (cid:28626) (cid:1861) (cid:1862) (cid:28626) (cid:1864) (cid:3404) (cid:884) (cid:28675)(cid:28660)(cid:28679)(cid:28667) (cid:1864) (cid:3404) (cid:885) (cid:28675)(cid:28660)(cid:28679)(cid:28667) (cid:1864) (cid:3404) (cid:886) (cid:28675)(cid:28660)(cid:28679)(cid:28667) Figure 4: Illustration of l paths between a pair of nodes i and j . Here we show l = 2 , , .Similarity-based methods such as the number of l = 2 , , paths have been used for link prediction prob-lem [9]. A l path between two nodes refers to a path that contains l links. We show examples of l = 2 , , path between a node pair i and j in Figure 4. Kov´acs et al. [39] have shown that l paths ( l = 3 , ) outperformexisting link prediction methods in predicting protein interaction. Cao et al. [40] found that network embed-ding algorithms based on random walks sometimes perform worse in link prediction than the number of l = 2 paths or equivalently the number of common neighbors. This result suggests a limit of random-walk-basedembedding in identifying the links between node pairs that have many common neighbors. Therefore, we ex-plore further whether our SI-spreading-based algorithms can overcome this limitation, thus possibly explaintheir outperformance.We investigate what kind of network structure surrounding links makes them more easily be predicted.For every positive link in the test set, we study its two end nodes’ topological properties (i.e., the number of l = 2 , l = 3 and l = 4 paths) and the dot product of the embedding vectors of its two end nodes. Givena network, the parameters of each embedding algorithm are tuned to maximize the AUC, as given in Table2. We take the data set Haggle as an example. Figure 5 show the relation between the dot product of theembedding vectors and the number of l = 2 , , paths of the two end nodes of a positive link in the test setfor all the embedding methods. The Pearson correlation coefﬁcient (PCC) between the two variables for allthe networks and algorithms is given in Table S1 in the Appendix. Figure 5 and Table S1 together show thatthe dot product of the embedding vectors constructed from TSINE2 and

SINE is more strongly correlated withthe number of l paths, where l = 2 , 3 or 4, than the random-walk-based embeddings. This result suggeststhat SI-spreading-based algorithms may better predict the links whose two end nodes have many l -paths, thusovercoming the limit of random-walk-based embedding algorithms.The number of l = 2 , paths has been used to predict links in [9, 39, 40]. The observation and the limitof random-walk-based embedding algorithms motivate us to use the number of l = 2 , , paths between anode pair to predict the missing links. Take l = 2 paths as an example. For every link in the test set, thenumber of l = 2 paths between the two end nodes in the training set is used to estimate the likelihood of13onnection between them. In the networks we considered, two end nodes of a link tend to be connected by l = 2 , l = 3 and l = 4 paths (see Figures 5). Table 2 ( L , L , L shown in the table correspond to the methodof using the number of l = 2 , , path for link prediction) shows that in such networks, the similarity-basedmethods do not evidently outperform the SI-spreading-based embedding. Actually, the SI-spreading-basedembedding performs better in two out of six networks.Next, we study the effect of the sampling size, B , on the performance of each algorithm. The samplingsize is quantiﬁed as the the total length of the trajectory paths as deﬁned in Section 2.1. Given a network, weset B = N X , where N is the size of the network and X ∈ { , , , , , , , } . We evaluate our SI-spreading-based embedding algorithms SINE and

TSINE2 , and one random-walk-based embedding algorithm

CTDNE , because

CTDNE performs mostly the best among all random-walk-based algorithms. The result isshown in Figure 6. For each X , we tune the other parameters to show the optimal AUC in the ﬁgure. Both SINE and

TSINE2 perform better than

CTDNE and are relatively insensitive to the sampling size. This meansthat they achieve a good performance even when the sampling size is small, even with X = 1 . The random-walk-based algorithm, CTDNE , however, requires a relatively large sampling size to achieve a comparableperformance with

SINE and

TSINE2 .Finally, the AUC as a function of the infection probability, β , is shown in Figure 7. For each β , we tunethe other parameters to show the optimal AUC. The SI-spreading-based algorithms achieve high performancewith a small infection probability ( . ≤ β ≤ . ) for all the data sets. The high performance of SI-spreading-based embedding algorithms with the small value of X and β across different networks motivatesthe further study whether one can optimize the performance by searching a smaller range of the parametervalues. In this paper, we proposed network embedding algorithms based on SI spreading processes in contrast tothe previously proposed embedding algorithms based on random walks [41, 42]. We further evaluated theembedding algorithms on the missing link prediction task. The key point of an embedding algorithm is howto design a strategy to sample trajectories to obtain embedding vectors for nodes. We used the SI model tothis end. The algorithms that we proposed are

SINE and

TSINE , which use static and temporal networks,respectively.On six empirical data sets, the SI-spreading-based network embedding algorithm on the static network,i.e.,

SINE , gains much more improvement than state-of-the-art random-walk-based network embedding al-gorithms across all the data sets. The SI-spreading-based network embedding algorithms on the temporalnetwork,

TSINE1 and

TSINE2 , also show better performance than the temporal random-walk-based algo-rithm. Temporal information provides additional information that may be useful for constructing embeddingvectors [26, 36, 37]. However, we ﬁnd that

SINE outperforms

TSINE , which uses timestamps of the con-tacts. This result suggests that temporal information does not necessarily improve the embedding for miss-ing link prediction. Moreover, when the sampling size of the Skip-Gram is small, the performance of the14I-spreading-based embedding algorithms is still high. Sampling trajectory paths takes time especially forlarge-scale networks. Therefore, our observation that the SI-spreading-based algorithms require less sam-ples than other algorithms promises the applicability of the SI-spreading-based algorithms to larger networksthan the random-walk-based algorithms. Finally, we show insights of why SI-spreading-based embeddingalgorithms performs the best by investigating what kind of links are likely to be predicted.We deem that the following future work as important. We have already applied susceptible-infected-susceptible (SIS) model and evaluated the SIS-spreading-based embedding. However, this generalizationhas not improved the performance in the link prediction task. Therefore, one may explore whether or notsampling the network information via the other spreading processes, such as susceptible-infected-recovered(SIR) model, further improves the embedding. It is also interesting to explore further the performance ofthe SI-spreading-based algorithms in other tasks such as classiﬁcation and visualization. Moreover, the SI-spreading-based sampling strategies can also be generalized to other types of networks, e.g., directed net-works, signed networks, and multilayer networks.

The authors declare that they have no competing interests.

All authors planed the study; X.Z. and Z.L. performed the experiments, analyzed the data and prepared theﬁgures. All authors wrote the manuscript. l = 2 , , paths between the two end nodes of the positive links in the test set for Haggle data set. (a1–a5),(b1–b5) and (c1–c5) are the results for the number of l = 2 , , paths, respectively.16igure 6: Inﬂuence of the sampling size B = N X on the link prediction performance, i.e., AUC score. Theerror bar shows the standard deviation of the AUC score calculated on the basis of 50 realizations. We showthe results for (a)

HT2009 ; (b) ME ; (c) Haggle ; (d)

Fb-forum ; (e)

DNC ; (f)

CollegeMsg .17igure 7: AUC as a function of β . We show the results for (a) HT2009 ; (b) ME ; (c) Haggle ; (d)

Fb-forum ;(e)

DNC ; (f)

CollegeMsg . 18 eferences [1] Newman, M. E. The structure and function of complex networks.

SIAM review , 167–256 (2003).[2] Zhang, Z.-K. et al. Dynamics of information diffusion and its applications on complex networks.

Physics Reports , 1–34 (2016).[3] Costa, L. d. F. et al.

Analyzing and modeling real-world phenomena with complex networks: a surveyof applications.

Advances in Physics , 329–412 (2011).[4] Qi, Y., Bar-Joseph, Z. & Klein-Seetharaman, J. Evaluation of different biological data and computa-tional classiﬁcation methods for use in protein interaction prediction. Proteins: Structure, Function,and Bioinformatics , 490–500 (2006).[5] Girvan, M. & Newman, M. E. Community structure in social and biological networks. Proceedings ofthe national academy of sciences , 7821–7826 (2002).[6] Jacob, Y., Denoyer, L. & Gallinari, P. Learning latent representations of nodes for classifying in hetero-geneous social networks. In Proceedings of the 7th ACM international conference on Web search anddata mining , 373–382 (ACM, 2014).[7] Traud, A. L., Mucha, P. J. & Porter, M. A. Social structure of facebook networks.

Physica A: StatisticalMechanics and its Applications , 4165–4180 (2012).[8] Liben-Nowell, D. & Kleinberg, J. The link-prediction problem for social networks.

Journal of theAmerican society for information science and technology , 1019–1031 (2007).[9] L¨u, L. & Zhou, T. Link prediction in complex networks: A survey. Physica A: statistical mechanicsand its applications , 1150–1170 (2011).[10] L¨u, L. et al.

Recommender systems.

Physics reports , 1–49 (2012).[11] Mart´ınez, V., Berzal, F. & Cubero, J.-C. A survey of link prediction in complex networks.

ACMComputing Surveys (CSUR) , 69 (2017).[12] Liu, C. et al. Computational network biology: Data, model, and applications.

Physics Reports (2019).[13] Getoor, L. & Diehl, C. P. Link mining: a survey.

Acm Sigkdd Explorations Newsletter , 3–12 (2005).[14] Cui, P., Wang, X., Pei, J. & Zhu, W. A survey on network embedding. IEEE Transactions on Knowledgeand Data Engineering (2018).[15] Wang, X. et al.

Community preserving network embedding. In

Thirty-First AAAI Conference on Artiﬁ-cial Intelligence (2017). 1916] Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimen-sionality reduction. science , 2319–2323 (2000).[17] Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science , 2323–2326 (2000).[18] Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In

Advances in neural information processing systems , 585–591 (2002).[19] Golub, G. H. & Reinsch, C. Singular value decomposition and least squares solutions 134–151 (1971).[20] Perozzi, B., Al-Rfou, R. & Skiena, S. Deepwalk: Online learning of social representations. In

Pro-ceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining ,701–710 (ACM, 2014).[21] Tang, J. et al.

Line: Large-scale information network embedding. In

Proceedings of the 24th interna-tional conference on world wide web , 1067–1077 (International World Wide Web Conferences SteeringCommittee, 2015).[22] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words andphrases and their compositionality. In

Advances in neural information processing systems , 3111–3119(2013).[23] Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks. In

Proceedings of the 22ndACM SIGKDD international conference on Knowledge discovery and data mining , 855–864 (ACM,2016).[24] Cao, Z., Wang, L. & de Melo, G. Link prediction via subgraph embedding-based convex matrix com-pletion. In

Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI 2018). AAAI Press (2018).[25] Zhang, Y., Shi, Z., Feng, D. & Zhan, X.-X. Degree-biased random walk for large-scale network embed-ding.

Future Generation Computer Systems , 198–209 (2019).[26] Nguyen, G. H. et al.

Continuous-time dynamic network embeddings. In

Companion of the The WebConference 2018 on The Web Conference 2018 , 969–976 (International World Wide Web ConferencesSteering Committee, 2018).[27] Yuan, S., Wu, X. & Xiang, Y. Sne: signed network embedding. In

Paciﬁc-Asia conference on knowledgediscovery and data mining , 183–195 (Springer, 2017).[28] Bagavathi, A. & Krishnan, S. Multi-net: a scalable multiplex network embedding framework. In

International Conference on Complex Networks and their Applications , 119–131 (Springer, 2018).2029] Qu, C., Zhan, X.-X., Wang, G., Wu, J. & Zhang, Z.-K. Temporal information gathering process fornode ranking in time-varying networks.

Chaos: An Interdisciplinary Journal of Nonlinear Science ,033116 (2019).[30] Isella, L. et al. What’s in a crowd? analysis of face-to-face behavioral networks.

Journal of theoreticalbiology , 166–180 (2011).[31] Michalski, R., Palus, S. & Kazienko, P. Matching organizational structure and social network extractedfrom email communication. In

International Conference on Business Information Systems , 87, 197–206(Springer, 2011).[32] Chaintreau, A. et al.

Impact of human mobility on opportunistic forwarding algorithms.

IEEE Transac-tions on Mobile Computing

Social Networks , 159–167 (2013).[34] Dnc emails network dataset – KONECT (2017). URL http://konect.uni-koblenz.de/networks/dnc-temporalGraph .[35] Opsahl, T. & Panzarasa, P. Clustering in weighted networks. Social networks , 155–163 (2009).[36] Zuo, Y. et al. Embedding temporal network via neighborhood formation. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2857–2866 (ACM,2018).[37] Zhou, L., Yang, Y., Ren, X., Wu, F. & Zhuang, Y. Dynamic network embedding by modeling triadicclosure process. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence (2018).[38] Gao, M., Chen, L., He, X. & Zhou, A. Bine: Bipartite network embedding. In

The 41st InternationalACM SIGIR Conference on Research & Development in Information Retrieval , 715–724 (2018).[39] Kov´acs, I. A. et al.

Network-based prediction of protein interactions.

Nature communications , 1–8(2019).[40] Cao, R.-M., Liu, S.-Y. & Xu, X.-K. Network embedding for link prediction: The pitfall and improve-ment. Chaos: An Interdisciplinary Journal of Nonlinear Science , 103102 (2019).[41] Zhan, X.-X., Hanjalic, A. & Wang, H. Information diffusion backbones in temporal networks. Scientiﬁcreports , 6798 (2019).[42] Zhan, X.-X. et al. Coupling dynamics of epidemic spreading and information diffusion on complexnetworks.

Applied Mathematics and Computation332