[PDF] NodeSim: Node Similarity based Network Embedding for Diverse Link Prediction

Abstract

In real-world complex networks, understanding the dynamics of their evolution has been of great interest to the scientific community. Predicting future links is an essential task of social network analysis as the addition or removal of the links over time leads to the network evolution. In a network, links can be categorized as intra-community links if both end nodes of the link belong to the same community, otherwise inter-community links. The existing link-prediction methods have mainly focused on achieving high accuracy for intra-community link prediction. In this work, we propose a network embedding method, called NodeSim, which captures both similarities between the nodes and the community structure while learning the low-dimensional representation of the network. The embedding is learned using the proposed NodeSim random walk, which efficiently explores the diverse neighborhood while keeping the more similar nodes closer in the context of the node. We verify the efficacy of the proposed embedding method over state-of-the-art methods using diverse link prediction. We propose a machine learning model for link prediction that considers both the nodes' embedding and their community information to predict the link between two given nodes. Extensive experimental results on several real-world networks demonstrate the effectiveness of the proposed framework for both inter and intra-community link prediction.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

NodeSim: Node Similarity based NetworkEmbedding for Diverse Link Prediction

Akrati Saxena · George Fletcher · Mykola Pechenizkiy

Received: date / Accepted: date

Abstract

In real-world complex networks, understanding the dynamics oftheir evolution has been of great interest to the scientiﬁc community. Predictingfuture links is an essential task of social network analysis as the addition orremoval of the links over time leads to the network evolution. In a network,links can be categorized as intra-community links if both end nodes of the linkbelong to the same community, otherwise inter-community links. The existinglink-prediction methods have mainly focused on achieving high accuracy forintra-community link prediction. In this work, we propose a network embeddingmethod, called NodeSim, which captures both similarities between the nodesand the community structure while learning the low-dimensional representationof the network. The embedding is learned using the proposed NodeSim randomwalk, which eﬃciently explores the diverse neighborhood while keeping themore similar nodes closer in the context of the node. We verify the eﬃcacy ofthe proposed embedding method over state-of-the-art methods using diverselink prediction. We propose a machine learning model for link predictionthat considers both the nodes’ embedding and their community informationto predict the link between two given nodes. Extensive experimental resultson several real-world networks demonstrate the eﬀectiveness of the proposedframework for both inter and intra-community link prediction.

Keywords

Network Embedding · Link Recommendation · Feature Learning

In online social networks (OSNs), nodes are organized into communities, wherea community represents a group of nodes having similar characteristics, such as

Akrati Saxena, George Fletcher, Mykola PechenizkiyDepartment of Mathematics and Computer ScienceEindhoven University of Technology, NetherlandsE-mail: [email protected], g.h.l.ﬂ[email protected], [email protected] a r X i v : . [ c s . S I] F e b Akrati Saxena et al. similar interests, opinions, or beliefs [1]. The links between the nodes belongingto the same community are referred to as intra community links , and the linksbetween the nodes belonging to diﬀerent communities are referred to as intercommunity links . In social networks, intra-community links are driven by theeﬀect of homophily [2] as similar nodes prefer to connect with each other. Theformation of inter-community links is still not well explored in the literature;however, it can be explained by diﬀerent complex phenomena, such as triadicclosure and weak ties [3]. In real-world networks, it is observed that the numberof intra-community links is more than the number of inter-community links [4].The evolution of social networks is regulated by the formation of new links inthe network.In OSNs, we recommend more probable, but not existing links as promisingconnections to help users in making new friends, and a user having more friendswill be more loyal towards the website [5, 6]. However, forming the right kind oflinks is very important as the opinion of a user is highly inﬂuenced by the opinionof its neighbors [7]. In the recent era, scientists have focused on increasing thediversity in the network so that the users receive information on a topic fromdiﬀerent viewpoints before making their opinion [8]. It is very crucial that auser receives the information from other users having diﬀerent perspectives tomitigate the negative impact of fake propaganda, false information, or fakenews spreading on the network [9]. Hence, it is required that a user has a diverse neighborhood by having connections with diﬀerent communities. Insocial networks, more inter-community links should be promoted to increasediversity. The link recommendation system plays an important role in formingnew links and transforming the network evolution.Initially, researchers proposed link prediction methods based on the sim-ilarity of the nodes [10]. These methods compute the similarity of a pair ofnodes based on network structure, and more similar nodes are more likely toform a link. These methods are also often referred to as classic or heuristiclink prediction methods . The well known classic method includes Jaccard coeﬃ-cient [11], Adamic Adar index [12], resource allocation index [10], preferentialattachment index [13], and so on. These methods were extended to includecommunity structure to improve the link prediction accuracy; however, mostof the methods improved the total accuracy by improving intra-communitylink prediction accuracy [14, 15].In recent works, network characteristics have been studied using networkembedding where the network is represented in a low dimensional latent space[16]. In network embedding techniques, the aim is to embed similar nodescloser to each other. Most of the existing network embedding methods focus onembedding the nodes closely if they belong to the same community and thereforehave high accuracy for the node classiﬁcation task and intra-community linkprediction.In our work, we propose a network embedding method, called NodeSimembedding, which considers both the nodes’ similarity and their communityinformation while generating the network embedding. In the learned embedding,the nodes belonging to the same community will be embedded closely, and the odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 3 nodes belonging to diﬀerent communities will be embedded closer based ontheir neighborhood similarity. Therefore, the generated embedding preservesthe structural properties of the network and is eﬃcient in predicting diversepromising links. Next, we propose a link prediction method that trains a logisticregression model using node pair embedding and their community informationto predict both the inter and intra-community links with high accuracy. This isthe ﬁrst work that uses community information for learning the link predictionmodel and achieves higher accuracy for both types of links. The experimentsare performed to show the accuracy and eﬃciency of the proposed method onreal-world networks. The results show that the proposed method outperformsthe state-of-the-art methods on all the datasets.The paper is structured as follows. In Section 2, we discuss the state of theart literature on link prediction by focusing on network embedding techniques.In Section 3, we discuss the proposed methods, including (i) NodeSim networkembedding method and (ii) link prediction method. In Section 4, we discussexperimental results on real-world networks, including the performance, sensi-tivity, scalability, and robustness analysis of the proposed method. The paperis concluded in Section 5 with future directions. Link prediction is a very well-known problem in network science and has beenapplied to predict missing links in diﬀerent types of networks, such as friendshipnetworks, collaboration network, and chemical networks. Initially, researchersproposed heuristic methods, also known as classic methods, which consider thesimilarity of two nodes to predict the link between them. The well known classicmethods include Jaccard Coeﬃcient, Adamic Adar Index, resource allocationindex, preferential attachment index, and so on. The initially proposed methodsonly considered the neighborhood information of the nodes for link predictionand did not consider the network topology. Then, researchers extended thesemethods that also considered the network structure properties like communitystructure to predict the links [15, 17, 14]. However, most of these methodsimproved the overall accuracy of link prediction by improving the accuracy ofintra-community link prediction. The main beneﬁt of using classic methods isthat these methods do not need any training and are comparatively faster.Another class of link prediction methods uses machine learning models,such as probabilistic graphical models [18, 19], matrix factorization [20, 21],supervised learning methods [22, 23, 24], and semi-supervised learning methods[25, 26]. These machine learning methods provide good accuracy though theysuﬀer from the class imbalance problem as the number of existing links in anetwork are signiﬁcantly fewer than the number of non-existing links.In recent years, network embedding techniques have been used to studynetworks and to propose solutions for various network analysis problems. Thenetwork embedding methods can be categorized into three categories basedon the structural proximity considered while generating the embedding, (i)

Akrati Saxena et al. microscopic structure embedding, which considers local proximity of nodes,such as ﬁrst-order [27, 28], second-order [27] or high-order proximity [29, 16,30], (ii) mesoscopic structure embedding, which captures hierarchical andcommunity structural proximity [31, 32, 33], and (iii) network propertiespreserved embedding, which captures global network properties, such as networktransitivity or structural balance [34, 35].In the existing mesoscopic network embedding, the main focus has beeneither on the hierarchical embedding where the users belonging to the samehierarchy should be embedded together [31] or on the intra-community proxim-ity where the nodes belonging to one community should be embedded closely[32, 33]. In hierarchical or structural role proximity, the nodes playing the sameroles are embedded closely; for example, the nodes having a similar degreeor similar inﬂuential power should be embedded closer [36]. In this work, wepropose the NodeSim network embedding method, which considers both (i)high-order proximity by the similarity of the nodes and (ii) mesoscopic structureby the network communities while generating the embedding. In NodeSimembedding, the nodes belonging to one community are clustered together, andthe similar nodes belonging to diﬀerent communities are embedded closer. Theproposed embedding captures a richer diverse neighborhood of the nodes thatis further veriﬁed using the link prediction.

In this section, we ﬁrst discuss the required network properties for our work.Next, we discuss our proposed NodeSim embedding method to learn the featurerepresentation of the nodes and the proposed link prediction method.3.1 Community StructureIn real-world complex networks, nodes connect with each other if they havesimilar properties. A group of nodes that are densely connected with each otheris referred to as a community [37]. The community label of a node u is denotedby C u . If both end nodes of a link ( u, v ) belong to the same community, it isreferred to as an intra-community link, and C ( u,v ) = 1 for an intra-communitylink. If both end nodes belong to diﬀerent communities, then the link ( u, v ) isreferred as an inter-community link and C ( u,v ) = 0 .In most real-world networks, the ground truth community information isnot available. In literature, several community detection methods have beenproposed to identify communities using network structure if the ground truthinformation is not known. In this work, we apply the highly used community de-tection method, known as the Louvain method [38], to identify the communitiesif the ground truth information is not known. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 5 Louvain method [38] uses two-step greedy optimization to optimize the modu-larity of a community partition of the network. First, the method optimizesthe modularity locally to ﬁnd small communities. In the second step, it mergesall nodes belonging to the same community and creates an aggregated networkwhere each node represents a community. These steps are performed iterativelyuntil we achieve the maximum modularity and the obtained communities arereturned.3.2 Node-Pair SimilarityIn a network, two nodes connect with each other if they have some commoninterest or characteristics, and therefore, a link between a pair of nodes is theﬁrst indication that they are similar. However, these binary/unweighted connec-tions cannot capture the complete information of the system as each connectionis not equally important. A better way of representing the network is withweighted edges, where edge-weight denotes the strength of the connection [39].For example, in a friendship network, the weight of an edge can be computedbased on the intimacy of the relationship or frequency of the communication[40]. The similarity of a node pair ( u, v ) is denoted as Sim ( u, v ) .In most real-world networks, the edge-weight data is not available as it isnot feasible to collect all the required information for computing the strengthof each connection. In network science, there have been proposed methods tocompute the similarity of a node-pair based on their neighborhood connectivityin the network structure. Some of the well-known methods are the number ofcommon neighbors [41], Jaccard coeﬃcient [11], Adamic-Adar [12], resource-allocation [10], hub promoted index [42], and so on, which compute a node-pairsimilarity based on their local-neighborhood proximity.In this work, we will use the Jaccard coeﬃcient to compute a node pair’ssimilarity in unweighted networks. The Jaccard coeﬃcient for a node pair ( u, v ) is deﬁned as, JC ( u, v ) = | Γ ( u ) ∩ Γ ( v ) || Γ ( u ) ∪ Γ ( v ) | , where Γ ( u ) is the set of neighbors of node u .3.3 NodeSim Network EmbeddingFor a given graph G ( V, E ) , the network embedding method learns the mapping Φ : V → R d , where d is the dimension of the embedding space. In recent works,the Skip-gram model has been used to generate the network embedding byrepresenting the network as a document where the nodes are corresponding tothe words [29]. In a network, a sampled sequence of nodes is considered thesame as an ordered sequence of words in a document. The simplest way togenerate the ordered sequence of nodes is by using random walks. Akrati Saxena et al.

In the random walk [43], if the random walker is at node u , the probabilitythat the random walker will move to node v is deﬁned as, P uv = (cid:26) /deg ( u ) , if ( u, v ) ∈ E , otherwise The random walk method does not consider the network structure propertieswhile sampling the nodes. In recent works, diﬀerent sampling methodologieshave been explored to sample the network to learn feature representations ofthe network [16, 44]. However, the proposed methods do not consider the meso-scale properties, such as community structure, while exploring the network. Inthis work, we propose a random walk based sampling method, called NodeSimRandom Walk, that captures the neighborhood of the node by consideringboth the nodes’ similarity as well as the meso-scale community structure ofthe network.

In network embedding, the focus is to embed similar nodes closer. The simplestway to capture the node similarity during the random walk would be to biasthe edge probability based on the similarity of its end nodes. However, thiswill ignore the meso-scale property of the network that is captured throughthe community structure. In NodeSim random walk, the edge-probabilities areassigned based on both the similarity of the nodes and community structure.In NodeSim Random walk, the unnormalized probability p uv to move fromnode u to node v is deﬁned as, p uv =  α · ( Sim ( u, v ) + 1 /deg ( u )) , if ( u, v ) ∈ E and C ( u,v ) = 1 β · ( Sim ( u, v ) + 1 /deg ( u )) , if ( u, v ) ∈ E and C ( u,v ) = 00 , otherwise (1)The probabilities are normalized for each node u with respect to all of itsneighbors. So, the probability to move from node u to node v is computed as, P uv = p uv · w u where w u is the normalizing factor for node u .In this work, the similarity of the nodes is computed using the JaccardCoeﬃcient. Figure explains edge-probabilities for NodeSim random walk, wherethe network has two communities shown by red and blue nodes, and the edges ( u, v ) and ( u, w ) are inter and intra-community edges, respectively, which arelabeled with p uv and p uw , respectively.Intuitively, parameters α and β control how the random walker exploresthe neighborhood. A higher value of α shows that the walker will prefer tosample more similar nodes from the same community, and a higher value of β shows that the walker will put a higher weight to explore the inter-communityneighborhood of the node. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 7 Fig. 1: NodeSim Random Walk probabilities for inter and intra communitynodes.

Once the ordered sequences of nodes are generated using NodeSim random walk,the network embedding is learned using the Skip-gram model [45]. The networkembedding method learns a mapping for each node u ∈ V to a d -dimensionembedding space that represents the d -dimensional feature representation ofnode u based on its structural role. The network embedding is denoted as Φ : u ∈ V −→ R | V |× d , where Φ can be considered a | V | × d size matrix that islearned by solving a maximal likelihood optimization problem.In the skip-gram model, given the corpus, the neighborhood of a word isdeﬁned using a sliding window over the consecutive words. In networks, wegenerate the ordered sequence of nodes using sampling methods. For example,if NodeSim random walker visits the following nodes { u , u , · · · u i , · · · u l } , theywill be referred to as an ordered sequence of nodes. The neighborhood of anode u i will be deﬁned by considering k − nodes visited before and after node u i during the sampling, where k is the window size or context of the node. Forevery node u i ∈ V , N NS ( u i ) ⊂ V denotes the neighborhood of node u i in thenetwork that is generated through the NodeSim sampling method with thegiven context k .In the skip-gram model, the network embedding is learned based on thelikelihood of a node u i co-occurring with other neighborhood nodes within thecontext k in the NodeSim random walk. We, therefore, optimize the followingoptimization function that aims for maximizing the probability of observing anode in the neighborhood of node u i , given its feature representation Φ ( u i ) , maximize Φ (cid:88) u i ∈ V logP r ( N NS ( u i ) | Φ ( u i )) (2)The optimization problem is solved using two assumptions. The ﬁrst as-sumption is conditional independence, that the probability of observing a nodein the neighborhood of the source node is independent of observing any other Akrati Saxena et al. node in its neighborhood given the feature representation of the source node,so,

P r ( N NS ( u i ) | Φ ( u i )) = Π u j ∈ N NS ( u i ) P r ( u j | Φ ( u i )) (3)The second assumption is the symmetry that considers the pairwise similar-ity of a source node and its neighborhood node in the feature space. Therefore,we estimate the probability of a node u j co-occurring with node u i using thesoftmax function, P r ( u j | Φ ( u i )) = exp ( Φ ( u j ) · Φ ( u i )) (cid:80) v ∈ V exp ( Φ ( v ) · Φ ( u i )) (4)Finally, using both assumptions, the objective function given in Equation 2is computed as, maximize Φ (cid:88) u i ∈ V  − logZ u i + (cid:88) u j ∈ N NS ( u i ) Φ ( u j ) · Φ ( u i )  (5)where Z u i = (cid:80) v ∈ V exp ( Φ ( u i ) · Φ ( v ) is expensive for large-scale networksand it is approximated using negative sampling method [46]. Equation 5 isoptimized using SGA (stochastic gradient ascent) over the features Φ [16]. The complexity of the proposed network embedding method depends on twomajor steps, (i) identify the communities and (ii) NodeSim embedding learnedusing the Skip-gram model. The complexity of the community detection methodand Skip-gram model is well deﬁned in the literature, so we brieﬂy discussthe complexity of our method. In our implementation, we have used theLouvain community detection method having complexity O ( n · logn ) . Once thecommunity structure is identiﬁed, the complexity to generate the probabilitydistribution for NodeSim random walk is O ( m ) . The complexity for learningembedding using the skip-gram model with negative sampling is O ( nklγ ( d + dlog ( n ))) , where d denotes the number of dimensions, l denotes the walk length, k denotes the window size, and γ denotes the number of random walks. So, theoverall complexity is O ( nlogn + m + nklγ ( d + dlog ( n ))) .3.4 Link-Prediction MethodThe link prediction method ﬁrst generates the feature representation of givennode pairs and then train a logistic regression model using the feature repre-sentation of node pairs and their community information. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 9 The feature representation of a pair of node ( u, v ) is generated by applyinga binary operator on the feature representation of node u and v . The mostcommon operators are mentioned below.1. Average: e i ( u, v ) = Φ i ( u )+ Φ i ( v )2

2. Weighted-L1: e i ( u, v ) = | Φ i ( u ) − Φ i ( v ) |

3. Weighted-L2: e i ( u, v ) = | Φ i ( u ) − Φ i ( v ) |

4. Hadamard: e i ( u, v ) = Φ i ( u ) ∗ Φ i ( v ) Φ i ( u ) denotes the i th feature of node u , and e i ( u, v ) denotes the i th feature ofa node pair ( u, v ) . In this way, a d -dimension feature vector is generated for eachnode-pair using the d -dimension feature representation of the correspondingnodes. For link prediction, a logistic regression model is trained using features ofthe node-pair and their community information, with the output having theexistent/non-existent information of the link between the given node-pair. Theinput features for a node pair ( u, v ) is generated as, f ( u, v ) = ( e ( u, v ) || C ( u,v ) ) ,where || is concatenation operator and C ( u,v ) is if both nodes u and v belongto the same community, otherwise . The output parameter is or if thereexists a link between the given pair of nodes or not, respectively. We haveshown results for all four operators applied on e ( u, v ) . In this section, we discuss baseline methods, datasets, and experimental results.4.1 Baseline MethodsThe proposed method is compared with both types of link prediction methods (i)similarity-based heuristic methods and (ii) network embedding based methods.We compare with the following three heuristic methods based on networkstructure.1. Jaccard Coeﬃcient (JC) [11]: JC ( u, v ) = | Γ ( u ) ∩ Γ ( v ) || Γ ( u ) ∪ Γ ( v ) |

2. Adamic Adar (AA) [12]: AA ( u, v ) = (cid:80) w ∈ ( Γ ( u ) ∩ Γ ( v )) 1 log | Γ ( w ) |

3. Resource Allocation (RA) [10]: RA ( u, v ) = (cid:80) w ∈ ( Γ ( u ) ∩ Γ ( v )) 1 | Γ ( w ) | We compare our method with the following network embedding basedlink-prediction methods.

4. DeepWalk [29]: Deepwalk method learns the network embedding using theskip-gram model on the ordered sequence of nodes generated using randomwalk.5. Node2Vec [16]: Node2Vec is an extension of DeepWalk where the walker hasdiﬀerent probabilities for moving to its neighbors, and the probability tomove to the next node depends on its distance from the previously visitednode. Once the nodes are sampled, the network embedding is learned usingthe skip-gram model. We have used the code provided by the authors at https://github.com/aditya-grover/node2vec .6. NECS [33]: Network Embedding with Community Structural information(NECS) uses nonnegative matrix factorization to generate nodes’ embedding,which preserves the high-order proximity. The ﬁnal network embedding islearned by jointly optimizing the consensus relationship between the nodes’representation and the community structure. We have used the implemen-tation provided by the authors at https://github.com/liyu1990/necs .For DeepWalk, Node2Vec, and NECS methods, the node-pair embedding isgenerated using the Hadamard operator, and then the logistic regressionmodel is trained for the link prediction as mentioned in these works.7. Splitter [47]: This network embedding method learns multiple embed-ding of each node based on the principled decomposition of the ego-network. These multiple representations of a node denote its embeddingwith respect to the local communities it belongs to. The implementationis provided by the authors at https://github.com/google-research/google-research/tree/master/graph_embedding/persona . For link pre-diction, we used the method discussed in their paper. For each node pair ( u, v ) , the similarity score is computed using the dot product of their em-bedding. In the persona graph, each node has multiple embedding, so wecompute the similarity score for each combination of their embedding, andthe maximum score is returned as the ﬁnal similarity score. Once the ﬁnalsimilarity score is computed, the link prediction is performed using thesame method as for similarity-based heuristic methods.4.2 DatasetsWe perform experiments on real-world networks, and their details are mentionedin Table 1. Facebook is a small subgraph extracted from Facebook socialnetworking website. GrQc, Hep-th, Hep-ph, and Astro-ph are co-authorshipnetworks extracted from ArXiv for General Relativity and Quantum Cosmology,High Energy Physics Theory, High Energy Physics Phenomenology, and AstroPhysics research areas, respectively. In all the networks, the communities aredetected using the Louvain Method, and a community label is assigned to eachnode based on which community it belongs to. A node pair is referred to asintra-community node pair if both the nodes belong to the same community;otherwise, it will be referred to as inter-community node pair. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 11 Table 1: Datasets

Network

GrQc

Hep-th

Hep-ph

Astro-ph

To generate the training and testing data, we follow the same methodologyas used in [47, 16]; however, we maintain the ratio of inter and intra-communitylinks that is not considered in previous studies. First, we remove of inter-community and of intra-community edges from E uniformly at random andput them in set E lp that will be used for link prediction. While removing the edges, it is ensured that the network remains connected. The remaining edges are referred to as E ne , and G ( E ne , V ) will be used to generatenetwork embedding.For link prediction task, the same number of inter and intra-communitynode pairs for non-existent links are chosen uniformly at random, as we havein E lp . These sampled links will work as negative cases and are added to set E lp . If a link is formed between a given node pair, then it is referred to as apositive case; otherwise, it will be referred to as a negative case. To create trainand test data, the node pairs in E lp are split into E train and E test , and whilesplitting, we ensure that the ratio of intra and inter-community node pairs ismaintained for both positive and negative cases. The default train and testratio is ( . . if it is not mentioned explicitly. In heuristic and Splitter linkPrediction methods, a node pair is predicted positive if the similarity score forthis pair is higher than the similarity score of positive train cases.4.3 Performance StudyFirst, we compare NodeSim method with baselines, and ROC-AUC value iscomputed for all test cases, intra-community and inter-community test casesas shown in Table 2. The table shows the best results observed for diﬀerentparameter settings used in diﬀerent methods, and each experiment is repeatedﬁve times to compute the average. The dimension of network embedding is d = 128 . The results show that the proposed NodeSim method with Hadamardoperator for node pair embedding outperforms all baseline methods for bothtypes of links. The NECS method for the Astro-ph network was not completedin 48 hours on the server, so the values are not mentioned.We further study the performance of our method by varying the ratio oftrain and test set. The results are shown in Figure 2 for Hep-ph and Astro-phnetworks. Results show that the performance of the proposed method is bettercompared to baselines, even if the training ratio is . ; however, the best results Table 2: ROC-AUC for link prediction

Method\Datasets Facebook GrQc Hep-th Hep-ph Astro-phTotal

Intra

JC Inter

Total

Intra

AA Inter

Total

Intra

RA Inter

Total

Intra

DeepWalk Inter

Total

Intra

Node2Vec Inter

Total

Intra

Splitter Inter

Total

Intra

NECS Inter

NodeSim Total (Average) Intra

Inter

NodeSim Total (Weighted-L1) Intra

Inter

NodeSim Total (Weighted-L2) Intra

Inter

NodeSim Total 0.857 0.864 0.849 0.924 0.883(Hadamard) Intra 0.862 0.874 0.884 0.937 0.907Inter 0.736 0.706 0.749 0.872 0.828 are achieved when the ratio of training size is at least . and . for Hep-phand Astro-ph networks, respectively.4.4 Parameter sensitivityThe NodeSim embedding method depends on a number of parameters, andwe examine the impact of diﬀerent parameters on the performance of linkprediction. In Table 3, we have shown the default values of diﬀerent parametersand their range that we have considered. The results are shown on the twolargest networks, Hep-ph and Astro-ph.Figure 3 shows the impact of varying α on inter and intra-community linkprediction. The results show that α ∼ . achieves the best results. In Figure4, the results show that β ∼ . − achieves the best results. The resultsconﬁrm that the inter-community edges should be weighted higher than the odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 13(a) Hep-ph (b) Astro-ph Fig. 2: Vary train size. (a) Hep-ph (b) Astro-ph

Fig. 3: Impact of varying α .intra-community edges during the sampling to predict inter-community linkswith high accuracy, as expected.Next, we analyze the impact of embedding parameters on link predictionaccuracy. Figure 5 represents that the performance of link prediction methodsimproves with the embedding dimension. In Figure 6, we observe that theperformance reduces with the window size as the larger window size considersdistant nodes while generating the local context of the nodes, and these nodesmight not be similar. In real-world networks, most of the new links are drivenby the triad closure phenomenon, and it is less probable that a node will beconnected to a distant node.Table 3: Default and varied range values of diﬀerent network embeddingparameters. Parameter Default Range α β d ) 128 4, 8, 16, 32, 64, 128, 256Context ( k ) 5 5, 7, 9, 11, 13, 15Number of Walks ( γ ) 10 6, 8, 10, 12, 14, 16, 18, 20Walk Length ( l ) 80 40, 50, 60, 70, 80, 90, 100 Fig. 4: Impact of varying β . (a) Hep-ph (b) Astro-ph Fig. 5: Impact of varying Dimension (d). (a) Hep-ph (b) Astro-ph

Fig. 6: Impact of varying context k.Figures 7 and 8 show results for varying the number of walks and thewalk-length. As observed in Figure 7, the intra-community results are lessaﬀected by the number of walks than the inter-community links as the ratioof inter-community context pairs decreases with more number of walks; aswe expected. Similarly, the inter-community accuracy also decreases with thewalk-length even if the total accuracy is improved, as shown in Figure 8 (b). odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 15(a) Hep-ph (b) Astro-ph

Fig. 7: Impact of varying Number of Walks. (a) Hep-ph (b) Astro-ph

Fig. 8: Impact of varying Walk-length.Fig. 9: Running time for diﬀerent embedding methods versus network size4.5 ScalabilityWe compare the running time of diﬀerent network embedding based methods onsynthetic networks generated using SCCP (Scale-free networks with Communityand Core-Periphery) model [50, 51]. The network generator ﬁrst creates a seed graph, i.e., a complete graph of m nodes for each community, where m isthe average degree of nodes. Next, in each iteration, a new node is added toeach community, and the added node builds m connections using preferentialattachment law [13] while ensuring the intra and inter-community edge ratio.The running time is compared on synthetic networks so that the ratio of intraand inter-community edges are maintained as we increase the network size. Inour experiments, the ratio is (intra : inter= .

75 : . ), and the average degreeof the network is . The total number of communities is in the networkhaving and nodes and in the network having and nodes. All communities in a network are of the same size.Figure 9 show the running time of diﬀerent methods. All experiments areperformed on the server having 384GB RAM and 2x Intel Xeon 4110 @ 2.1GhzCPU. For nodes network, the Splitter code was not ﬁnished in 48 hours,and the NECS code was killed due to the memory error on the server. Theresults show that the proposed method executes faster than all the baselinesexcept deepWalk as the network size grows. The deepwalk method is the fastestas it creates node context using a simple random walk and does not considerthe structural properties of the network.4.6 Robustness for Identiﬁed CommunitiesThere have been proposed several community detection methods in the litera-ture that consider diﬀerent network properties while identifying the commu-nities. Therefore, the communities identiﬁed by diﬀerent methods might vary.For some methods, such as Louvain or greedy method, if the same method isapplied many times, the returned community structure might diﬀer each time.We, therefore, study the eﬃciency of the NodeSim embedding methodcorresponding to diﬀerent community detection methods. We apply ﬁve diﬀerentcommunity detection methods (including Louvain), which are mentioned below.1. Asynchronous Label Propagation [52]: In this method, each node is ini-tialized with a unique community label. In every iteration, each node willupdate its community label based on its neighbors’ community label atthat time. Thus, the nodes belonging to a strongly connected group willbe assigned the same community label with their consensus through thisiterative process.2. Semi-synchronous Label Propagation [53]: This method is similar to theasynchronous Label propagation method, and it combines the advantagesof both synchronous and asynchronous method. In this method, each nodeis assigned with a community label initially, and at each iteration, a nodeupdates its community label based on the most used label by its neighbors.However, the ties are broken randomly, and the method is stopped whenno node changes its label.3. Fluid Communities Algorithm [54]: Fluid communities are based on theidea of ﬂuids interacting with each other, such as expanding or pushingeach other in an environment. In this method, ﬁrst, each of the initial c odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 17 Fig. 10: ROC-AUC for link prediction corresponding to diﬀerent communitydetection methods for Hep-ph network.communities is initialized by a random node in the network. Then, in eachiteration, each node’s community label is updated based on its communityand the community of its neighbors. Once no node changes its communityin an iteration, the method is stopped. In our implementation, we set thenumber of communities approximately close to the number of communitiesidentiﬁed by the Louvain method.4. Greedy Modularity Maximization [1]: This method is a well-known methodto identify communities by maximizing the modularity in the network. Inthis method, each node is assigned with a community label, and in eachstep, two communities are merged that most increases the modularity. Themethod is stopped when the modularity can not be further increased bymerging two communities.After identifying the communities using diﬀerent methods, the trainingand testing data is created, as discussed in Section 1. Next, we generatenetwork embedding by applying diﬀerent embedding methods and apply thelink prediction method. Each method is executed ﬁve times, and the averageROC-AUC value for the Hep-ph network is shown in Figure 10.The results show that the performance of diﬀerent methods is relativelymaintained irrespective of the community detection method. The NodeSimmethod outperforms in all the cases as the method considers both the similarityof nodes and their communities while generating the network embedding.4.7 Case StudyFor visualization, we show the NodeSim embedding of the Zachary KarateNetwork [55] in 2-dimension space. The network and its embedding are shown

Fig. 11: (a) Zachary Karate Network with three communities and (b) 2-dimension embedding of Zachary Karate Network using NodeSim Method.in Figure 11, where the nodes having the same color belong to one community.The embedding shows that the nodes belonging to diﬀerent communities arewell separated; however, more similar nodes are embedded closer. For example,node 12 is more probable to form inter-community links with node 5 and node4, so, as observed, they are embedded closer but still well separated.

In this work, we have proposed the NodeSim network embedding method,which considers both the nodes’ similarity and their community membershipwhile learning the feature representation of the nodes. The NodeSim embeddingmethod eﬃciently learns the embedding of diverse nodes that is further veriﬁedusing the link prediction. We proposed a link prediction method that trains alogistic regression model using nodes’ features and their community information.The results showed that the proposed link-prediction method outperformsbaseline methods for both intra-community as well as inter-community linkprediction. We further studied the impact of diﬀerent parameters and showedthat a higher value of β provides higher inter-community link predictionaccuracy as the NodeSim method embeds the more similar diverse nodes closerthan the others. The proposed method can be directly applied to weightednetworks. We will further extend the proposed method to generate embeddingof dynamic networks to predict inter and intra-community links with highaccuracy. References

1. Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding com-munity structure in very large networks.

Physical review E , 70(6):066111,2004.2. Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of afeather: Homophily in social networks.

Annual review of sociology , 27(1):415–444, 2001. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 19

3. Mark Granovetter. The strength of weak ties: A network theory revisited.

Sociological theory , pages 201–233, 1983.4. Akrati Saxena and SRS Iyengar. Evolving models for meso-scale structures.In , pages 1–8. IEEE, 2016.5. Fabrício Benevenuto, Tiago Rodrigues, Meeyoung Cha, and VirgílioAlmeida. Characterizing user behavior in online social networks. In

Pro-ceedings of the 9th ACM SIGCOMM Conference on Internet Measurement ,pages 49–62, 2009.6. Christo Wilson, Bryce Boe, Alessandra Sala, Krishna PN Puttaswamy, andBen Y Zhao. User interactions in social networks and their implications.In

Proceedings of the 4th ACM European conference on Computer systems ,pages 205–218, 2009.7. Akrati Saxena, Wynne Hsu, Mong Li Lee, Hai Leong Chieu, Lynette Ng,and Loo Nin Teow. Mitigating misinformation in online social network withtop-k debunkers and evolving user opinions. In

Companion Proceedings ofthe Web Conference 2020 , pages 363–370, 2020.8. Farzan Masrour, Tyler Wilson, Heng Yan, Pang-Ning Tan, and AbdolEsfahanian. Bursting the ﬁlter bubble: Fairness-aware network link pre-diction. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 34, pages 841–848, 2020.9. Cigdem Aslay, Antonis Matakos, Esther Galbrun, and Aristides Gionis.Maximizing the diversity of exposure in a social network. In , pages 863–868. IEEE,2018.10. Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links vialocal information.

The European Physical Journal B , 71(4):623–630, 2009.11. David Liben-Nowell and Jon Kleinberg. The link-prediction problem forsocial networks.

Journal of the American society for information scienceand technology , 58(7):1019–1031, 2007.12. Lada A Adamic and Eytan Adar. Friends and neighbors on the web.

Socialnetworks , 25(3):211–230, 2003.13. Albert-László Barabási and Réka Albert. Emergence of scaling in randomnetworks. science , 286(5439):509–512, 1999.14. Jorge Valverde-Rebaza and Alneu de Andrade Lopes. Exploiting behaviorsof communities of twitter users for link prediction.

Social Network Analysisand Mining , 3(4):1063–1074, 2013.15. Hyoungjun Jeon and Taewhan Kim. Community-adaptive link prediction.In

Proceedings of the 2017 International Conference on Data Mining,Communications and Information Technology , pages 1–5, 2017.16. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learningfor networks. In

Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 855–864, 2016.17. Jorge Valverde-Rebaza and Alneu de Andrade Lopes. Structural link predic-tion using community information on twitter. In , pages

Nature ,453(7191):98–101, 2008.19. Chao Wang, Venu Satuluri, and Srinivasan Parthasarathy. Local probabilis-tic models for link prediction. In

Seventh IEEE international conferenceon data mining (ICDM 2007) , pages 322–331. IEEE, 2007.20. Jerry Scripps, Pang-Ning Tan, Feilong Chen, and Abdol-Hossein Esfahanian.A matrix alignment approach for link prediction. In , pages 1–4. IEEE, 2008.21. Aditya Krishna Menon and Charles Elkan. Link prediction via matrixfactorization. In

Joint european conference on machine learning andknowledge discovery in databases , pages 437–452. Springer, 2011.22. Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed Zaki.Link prediction using supervised learning. In

SDM06: workshop on linkanalysis, counter-terrorism and security , volume 30, pages 798–805, 2006.23. Zhengdong Lu, Berkant Savas, Wei Tang, and Inderjit S Dhillon. Super-vised link prediction using multiple sources. In , pages 923–928. IEEE, 2010.24. Nesserine Benchettara, Rushed Kanawati, and Céline Rouveirol. A super-vised machine learning link prediction approach for academic collaborationrecommendation. In

Proceedings of the fourth ACM conference on Recom-mender systems , pages 253–256, 2010.25. Hisashi Kashima, Tsuyoshi Kato, Yoshihiro Yamanishi, Masashi Sugiyama,and Koji Tsuda. Link propagation: A fast semi-supervised learning algo-rithm for link prediction. In

Proceedings of the 2009 SIAM internationalconference on data mining , pages 1100–1111. SIAM, 2009.26. Huan Hu, Chunyu Zhu, Haixin Ai, Li Zhang, Jian Zhao, Qi Zhao, andHongsheng Liu. Lpi-etslp: lncrna–protein interaction prediction usingeigenvalue transformation-based semi-supervised link prediction.

MolecularBioSystems , 13(9):1781–1787, 2017.27. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and QiaozhuMei. Line: Large-scale information network embedding. In

Proceedingsof the 24th international conference on world wide web , pages 1067–1077,2015.28. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embed-ding. In

Proceedings of the 22nd ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 1225–1234, 2016.29. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Onlinelearning of social representations. In

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages701–710, 2014.30. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph represen-tations with global structural information. In

Proceedings of the 24th ACMinternational on conference on information and knowledge management ,pages 891–900, 2015. odeSim: Node Similarity based Network Embedding for Diverse Link Prediction 21

31. Lun Du, Zhicong Lu, Yun Wang, Guojie Song, Yiming Wang, and WeiChen. Galaxy network embedding: A hierarchical community structurepreserving approach. In

IJCAI , pages 2079–2085, 2018.32. Mohammad Mehdi Keikha, Maseud Rahgozar, and Masoud Asadpour.Community aware random walk for network embedding.

Knowledge-BasedSystems , 148:47–54, 2018.33. Yu Li, Ying Wang, Tingting Zhang, Jiawei Zhang, and Yi Chang. Learningnetwork embedding with community structural information. In

IJCAI ,pages 2937–2943, 2019.34. Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asym-metric transitivity preserving graph embedding. In

Proceedings of the 22ndACM SIGKDD international conference on Knowledge discovery and datamining , pages 1105–1114, 2016.35. Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, and Wenwu Zhu. Non-transitive hashing with latent similarity components. In

Proceedings ofthe 21th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 895–904, 2015.36. Tianshu Lyu, Yuan Zhang, and Yan Zhang. Enhancing the networkembedding quality with structural similarity. In

Proceedings of the 2017ACM on Conference on Information and Knowledge Management , pages147–156, 2017.37. Mark EJ Newman. Modularity and community structure in networks.

Proceedings of the national academy of sciences , 103(23):8577–8582, 2006.38. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and EtienneLefebvre. Fast unfolding of communities in large networks.

Journal ofstatistical mechanics: theory and experiment , 2008(10):P10008, 2008.39. Akrati Saxena. A survey of evolving models for weighted complex networksbased on their dynamics and evolution. arXiv preprint arXiv:2012.08166 ,2020.40. Jukka-Pekka Onnela, Jari Saramäki, Jörkki Hyvönen, Gábor Szabó, M Ar-gollo De Menezes, Kimmo Kaski, Albert-László Barabási, and János Kertész.Analysis of a large-scale weighted network of one-to-one human communi-cation.

New Journal of Physics , 9(6):179, 2007.41. Mark EJ Newman. Clustering and preferential attachment in growingnetworks.

Physical review E , 64(2):025102, 2001.42. Erzsébet Ravasz, Anna Lisa Somera, Dale A Mongru, Zoltán N Oltvai,and A-L Barabási. Hierarchical organization of modularity in metabolicnetworks. science , 297(5586):1551–1555, 2002.43. László Lovász et al. Random walks on graphs: A survey.

Combinatorics,Paul erdos is eighty , 2(1):1–46, 1993.44. Sam De Winter, Tim Decuypere, Sandra Mitrović, Bart Baesens, andJochen De Weerdt. Combining temporal aspects of dynamic networks withnode2vec for a more eﬃcient dynamic link prediction. In , pages 1234–1241. IEEE, 2018.

45. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. Eﬃcientestimation of word representations in vector space. arXiv preprintarXiv:1301.3781 , 2013.46. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeﬀrey Dean.Distributed representations of words and phrases and their compositionality.

NIPS , 2013.47. Alessandro Epasto and Bryan Perozzi. Is a single embedding enough?learning node representations that capture multiple social contexts. In

TheWorld Wide Web Conference , pages 394–404, 2019.48. Jure Leskovec and Julian J Mcauley. Learning to discover social circles inego networks. In

Advances in neural information processing systems , pages539–547, 2012.49. Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution:Densiﬁcation and shrinking diameters.

ACM transactions on KnowledgeDiscovery from Data (TKDD) , 1(1):2–es, 2007.50. Akrati Saxena, SRS Iyengar, and Yayati Gupta. Understanding spreadingpatterns on social networks based on network topology. In

Proceedingsof the 2015 IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining 2015 , pages 1616–1617, 2015.51. Yayati Gupta, Akrati Saxena, Debarati Das, and SRS Iyengar. Modelingmemetics using edge diversity. In

Complex Networks VII , pages 187–198.Springer, 2016.52. Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. Near lineartime algorithm to detect community structures in large-scale networks.

Physical review E , 76(3):036106, 2007.53. Gennaro Cordasco and Luisa Gargano. Community detection via semi-synchronous label propagation algorithms. In ,pages 1–8. IEEE, 2010.54. Ferran Parés, Dario Garcia Gasulla, Armand Vilalta, Jonatan Moreno,Eduard Ayguadé, Jesús Labarta, Ulises Cortés, and Toyotaro Suzumura.Fluid communities: A competitive, scalable and diverse community detec-tion algorithm. In

International Conference on Complex Networks andtheir Applications , pages 229–240. Springer, 2017.55. Wayne W Zachary. An information ﬂow model for conﬂict and ﬁssion insmall groups.